多样性，密度和同质性：文本收集的定量特征指标

论文标题

多样性，密度和同质性：文本收集的定量特征指标

Diversity, Density, and Homogeneity: Quantitative Characteristic Metrics for Text Collections

论文作者

Lai, Yi-An, Zhu, Xuan, Zhang, Yi, Diab, Mona

论文摘要

通过定量度量总结数据样本的历史悠久，描述性统计数据是一个很好的例子。但是，随着自然语言处理方法的繁荣，仍然没有足够的特征指标来描述它们所包含的单词，句子或段落的文本集合。在这项工作中，我们提出了多样性，密度和同质性的指标，这些指标可以用文本收集的分散性，稀疏性和统一性来定量测量。我们进行了一系列模拟，以验证每个度量是否具有所需的属性并与人类直觉产生共鸣。现实世界数据集的实验表明，所提出的特征指标与著名模型Bert的文本分类性能高度相关，这可能会激发未来的应用。

Summarizing data samples by quantitative measures has a long history, with descriptive statistics being a case in point. However, as natural language processing methods flourish, there are still insufficient characteristic metrics to describe a collection of texts in terms of the words, sentences, or paragraphs they comprise. In this work, we propose metrics of diversity, density, and homogeneity that quantitatively measure the dispersion, sparsity, and uniformity of a text collection. We conduct a series of simulations to verify that each metric holds desired properties and resonates with human intuitions. Experiments on real-world datasets demonstrate that the proposed characteristic metrics are highly correlated with text classification performance of a renowned model, BERT, which could inspire future applications.

下载PDF全文

下载文献需遵守相关版权规定

论文标题