数据表示标准：基于数据集的相似性预测监督分类的性能

论文标题

数据表示标准：基于数据集的相似性预测监督分类的性能

The Data Representativeness Criterion: Predicting the Performance of Supervised Classification Based on Data Set Similarity

论文作者

Schat, Evelien, van de Schoot, Rens, Kouw, Wouter M., Veen, Duco, Mendrik, Adriënne M.

论文摘要

在广泛的字段中，可能需要重复使用监督分类算法并将其应用于新的数据集。但是，只有在用于构建算法的训练数据类似于希望将其应用于的新数据类似的训练数据时，才有可能对这种算法进行概括并因此实现类似的分类性能。通常未知算法如何在新的看不见的数据上执行，这是根本不部署算法的关键原因。因此，需要工具来衡量数据集的相似性。在本文中，我们提出了数据代表性标准（DRC），以确定训练数据集的代表性是新看不见的数据集。我们提供原理证明，以查看DRC是否可以量化数据集的相似性以及DRC是否与监督分类算法的性能相关。我们比较了许多磁共振成像（MRI）数据集，范围从微妙到严重的差异是获取参数。结果表明，基于数据集的相似性，DRC能够指出监督分类器的性能何时降低。 DRC的严格性可以由用户设置，具体取决于人们认为是可接受的表现不佳。

In a broad range of fields it may be desirable to reuse a supervised classification algorithm and apply it to a new data set. However, generalization of such an algorithm and thus achieving a similar classification performance is only possible when the training data used to build the algorithm is similar to new unseen data one wishes to apply it to. It is often unknown in advance how an algorithm will perform on new unseen data, being a crucial reason for not deploying an algorithm at all. Therefore, tools are needed to measure the similarity of data sets. In this paper, we propose the Data Representativeness Criterion (DRC) to determine how representative a training data set is of a new unseen data set. We present a proof of principle, to see whether the DRC can quantify the similarity of data sets and whether the DRC relates to the performance of a supervised classification algorithm. We compared a number of magnetic resonance imaging (MRI) data sets, ranging from subtle to severe difference is acquisition parameters. Results indicate that, based on the similarity of data sets, the DRC is able to give an indication as to when the performance of a supervised classifier decreases. The strictness of the DRC can be set by the user, depending on what one considers to be an acceptable underperformance.

下载PDF全文

下载文献需遵守相关版权规定

论文标题