在评估复杂句子的语义相似性时对单词嵌入的比较分析

论文标题

在评估复杂句子的语义相似性时对单词嵌入的比较分析

Comparative analysis of word embeddings in assessing semantic similarity of complex sentences

论文作者

Chandrasekaran, Dhivya, Mago, Vijay

论文摘要

语义文本相似性是自然语言处理领域的开放研究挑战之一。在该领域进行了广泛的研究，在现有基准数据集（如STS数据集和病态数据集）中，最近基于变压器的模型实现了近乎完美的结果。在本文中，我们研究了这些数据集中的句子，并分析了各种单词嵌入句子的敏感性。我们构建一个复杂的句子数据集，该数据集由50个句子对，其中包含15个人类注释者提供的相关语义相似性值。进行可读性分析以突出显示现有基准数据集中句子的复杂性的增加以及所提出的数据集中的句子。此外，我们对现有基准数据集和提出的数据集上各种单词嵌入和语言模型的性能进行比较分析。结果表明，句子的复杂性增加对嵌入模型的性能有重大影响，导致皮尔森和斯皮尔曼的相关性下降了10-20％。

Semantic textual similarity is one of the open research challenges in the field of Natural Language Processing. Extensive research has been carried out in this field and near-perfect results are achieved by recent transformer-based models in existing benchmark datasets like the STS dataset and the SICK dataset. In this paper, we study the sentences in these datasets and analyze the sensitivity of various word embeddings with respect to the complexity of the sentences. We build a complex sentences dataset comprising of 50 sentence pairs with associated semantic similarity values provided by 15 human annotators. Readability analysis is performed to highlight the increase in complexity of the sentences in the existing benchmark datasets and those in the proposed dataset. Further, we perform a comparative analysis of the performance of various word embeddings and language models on the existing benchmark datasets and the proposed dataset. The results show the increase in complexity of the sentences has a significant impact on the performance of the embedding models resulting in a 10-20% decrease in Pearson's and Spearman's correlation.

下载PDF全文

下载文献需遵守相关版权规定

论文标题