利用外域数据集来增强多任务引用分析

论文标题

利用外域数据集来增强多任务引用分析

Utilizing Out-Domain Datasets to Enhance Multi-Task Citation Analysis

论文作者

Mercier, Dominique, Rizvi, Syed Tahseen Raza, Rajashekar, Vikas, Ahmed, Sheraz, Dengel, Andreas

论文摘要

通常仅使用定量措施对引用进行分析，同时排除了诸如情感和意图之类的定性方面。但是，定性方面为科学研究工具的影响提供了更深入的见解，并使您可以专注于与定量方面相关的偏见的相关文献。因此，可以根据论文的情感和意图对论文进行排名和分类。为此，需要更大的引用情感数据集。但是，从时间和成本的角度来看，策划大型引用情感数据集是一项具有挑战性的任务。特别是，引用情感分析遭受数据稀缺和数据集注释的巨大成本。为了克服引用分析域中数据稀缺性的瓶颈，我们探讨了训练过程中户外数据的影响，以增强模型性能。我们的结果强调了基于用例的不同调度方法的使用。我们从经验上发现，使用顺序数据调度训练的模型更适合特定于域的用品酶。相反，洗牌的数据馈送在跨域任务上取得了更好的性能。根据我们的发现，我们提出了一个可端到端的可训练多任务模型，该模型涵盖了利用外域数据集克服数据稀缺性的情感和意图分析。

Citations are generally analyzed using only quantitative measures while excluding qualitative aspects such as sentiment and intent. However, qualitative aspects provide deeper insights into the impact of a scientific research artifact and make it possible to focus on relevant literature free from bias associated with quantitative aspects. Therefore, it is possible to rank and categorize papers based on their sentiment and intent. For this purpose, larger citation sentiment datasets are required. However, from a time and cost perspective, curating a large citation sentiment dataset is a challenging task. Particularly, citation sentiment analysis suffers from both data scarcity and tremendous costs for dataset annotation. To overcome the bottleneck of data scarcity in the citation analysis domain we explore the impact of out-domain data during training to enhance the model performance. Our results emphasize the use of different scheduling methods based on the use case. We empirically found that a model trained using sequential data scheduling is more suitable for domain-specific usecases. Conversely, shuffled data feeding achieves better performance on a cross-domain task. Based on our findings, we propose an end-to-end trainable multi-task model that covers the sentiment and intent analysis that utilizes out-domain datasets to overcome the data scarcity.

下载PDF全文

下载文献需遵守相关版权规定

论文标题