数据集满足您的期望吗？在图像数据中解释样本表示

论文标题

数据集满足您的期望吗？在图像数据中解释样本表示

Does the dataset meet your expectations? Explaining sample representation in image data

论文作者

Parthasarathy, Dhasarathy, Johansson, Anton

论文摘要

由于神经网络模型的行为受到培训数据缺乏多样性的不利影响，因此我们提出了一种识别和解释这种缺陷的方法。当标记数据集时，我们注意到单独的注释能够提供人类可解释的样本多样性摘要。这允许解释任何缺乏多样性，因为在比较数据集中注释的\ textIt {实际}分布与\ textit {预期}注释分布时发现的不匹配，并指定了手动指定的标签多样性。尽管在许多实际情况下，标签（样品$ \ rightarrow $注释）很昂贵，但其倒数，模拟（注释$ \ rightarrow $样本）的倒数可能会更便宜。通过使用参数模拟将注释的预期分布映射到测试样品中，我们提出了一种方法，该方法使用模拟数据和收集的数据之间的多样性中的不匹配来解释样本表示。然后，我们应用该方法来检查几何形状的数据集，以定性和定量地解释样本表示，例如大小，位置和像素亮度等可理解的方面。

Since the behavior of a neural network model is adversely affected by a lack of diversity in training data, we present a method that identifies and explains such deficiencies. When a dataset is labeled, we note that annotations alone are capable of providing a human interpretable summary of sample diversity. This allows explaining any lack of diversity as the mismatch found when comparing the \textit{actual} distribution of annotations in the dataset with an \textit{expected} distribution of annotations, specified manually to capture essential label diversity. While, in many practical cases, labeling (samples $\rightarrow$ annotations) is expensive, its inverse, simulation (annotations $\rightarrow$ samples) can be cheaper. By mapping the expected distribution of annotations into test samples using parametric simulation, we present a method that explains sample representation using the mismatch in diversity between simulated and collected data. We then apply the method to examine a dataset of geometric shapes to qualitatively and quantitatively explain sample representation in terms of comprehensible aspects such as size, position, and pixel brightness.

下载PDF全文

下载文献需遵守相关版权规定

论文标题