论文标题
多样性,密度和同质性:文本收集的定量特征指标
Diversity, Density, and Homogeneity: Quantitative Characteristic Metrics for Text Collections
论文作者
论文摘要
通过定量度量总结数据样本的历史悠久,描述性统计数据是一个很好的例子。但是,随着自然语言处理方法的繁荣,仍然没有足够的特征指标来描述它们所包含的单词,句子或段落的文本集合。在这项工作中,我们提出了多样性,密度和同质性的指标,这些指标可以用文本收集的分散性,稀疏性和统一性来定量测量。我们进行了一系列模拟,以验证每个度量是否具有所需的属性并与人类直觉产生共鸣。现实世界数据集的实验表明,所提出的特征指标与著名模型Bert的文本分类性能高度相关,这可能会激发未来的应用。
Summarizing data samples by quantitative measures has a long history, with descriptive statistics being a case in point. However, as natural language processing methods flourish, there are still insufficient characteristic metrics to describe a collection of texts in terms of the words, sentences, or paragraphs they comprise. In this work, we propose metrics of diversity, density, and homogeneity that quantitatively measure the dispersion, sparsity, and uniformity of a text collection. We conduct a series of simulations to verify that each metric holds desired properties and resonates with human intuitions. Experiments on real-world datasets demonstrate that the proposed characteristic metrics are highly correlated with text classification performance of a renowned model, BERT, which could inspire future applications.