用于机器学习应用程序的准正异端编码

论文标题

用于机器学习应用程序的准正异端编码

Quasi-orthonormal Encoding for Machine Learning Applications

论文作者

Lu, Haw-minn

论文摘要

大多数机器学习模型，尤其是人工神经网络，都需要数值，而不是分类数据。我们简要描述了常见编码方案的优势和缺点。例如，一列编码通常用于具有一些无关类别和单词嵌入的属性，用于具有许多相关类别（例如单词）的属性。不适合编码许多无关类别的属性，例如医疗保健应用中的诊断代码。例如，将单速编码应用于诊断代码，可能会导致极高的维度，而样本量问题或人为地诱导机器学习工件，更不用说所需的计算资源的爆炸了。准正常编码（QOE）填补了空白。我们简要显示QoE与单壁编码的比较。我们提供了如何使用流行的ML库（例如Tensorflow和Pytorch）实现QoE的示例代码，以及向MNIST笔迹样本进行QoE的演示。

Most machine learning models, especially artificial neural networks, require numerical, not categorical data. We briefly describe the advantages and disadvantages of common encoding schemes. For example, one-hot encoding is commonly used for attributes with a few unrelated categories and word embeddings for attributes with many related categories (e.g., words). Neither is suitable for encoding attributes with many unrelated categories, such as diagnosis codes in healthcare applications. Application of one-hot encoding for diagnosis codes, for example, can result in extremely high dimensionality with low sample size problems or artificially induce machine learning artifacts, not to mention the explosion of computing resources needed. Quasi-orthonormal encoding (QOE) fills the gap. We briefly show how QOE compares to one-hot encoding. We provide example code of how to implement QOE using popular ML libraries such as Tensorflow and PyTorch and a demonstration of QOE to MNIST handwriting samples.

下载PDF全文

下载文献需遵守相关版权规定

论文标题