论文标题
Wasserstein T-Sne
Wasserstein t-SNE
论文作者
论文摘要
科学数据集通常具有层次结构:例如,在调查中,个人参与者(样本)可能会分组为更高级别(单位),例如其地理区域。在这些设置中,兴趣通常是在探索单位级别的结构,而不是在样本级别上。可以根据其平均值之间的距离进行比较,但是这忽略了样品的单位内分布。在这里,我们使用Wasserstein距离指标开发了一种对层次数据集进行探索性分析的方法,该指标考虑了单位内分布的形状。我们使用T-SNE构建单元的2D嵌入,基于它们之间的瓦斯汀距离的矩阵。距离矩阵可以通过使用高斯分布近似于每个单元来有效计算,但是我们还提供了一种可扩展的方法来计算精确的Wasserstein距离。我们使用合成数据来证明我们的Wasserstein T-SNE的有效性,并将其应用于2017年德国议会选举的数据,将投票站视为样本和投票区。所得嵌入的嵌入在数据中发现有意义的结构。
Scientific datasets often have hierarchical structure: for example, in surveys, individual participants (samples) might be grouped at a higher level (units) such as their geographical region. In these settings, the interest is often in exploring the structure on the unit level rather than on the sample level. Units can be compared based on the distance between their means, however this ignores the within-unit distribution of samples. Here we develop an approach for exploratory analysis of hierarchical datasets using the Wasserstein distance metric that takes into account the shapes of within-unit distributions. We use t-SNE to construct 2D embeddings of the units, based on the matrix of pairwise Wasserstein distances between them. The distance matrix can be efficiently computed by approximating each unit with a Gaussian distribution, but we also provide a scalable method to compute exact Wasserstein distances. We use synthetic data to demonstrate the effectiveness of our Wasserstein t-SNE, and apply it to data from the 2017 German parliamentary election, considering polling stations as samples and voting districts as units. The resulting embedding uncovers meaningful structure in the data.