科学文本意义的信息空间

论文标题

科学文本意义的信息空间

Informational Space of Meaning for Scientific Texts

论文作者

Suzen, Neslihan, Mirkes, Evgeny M., Gorban, Alexander N.

论文摘要

在自然语言处理中，自动提取文本的含义构成了一个重要的问题。我们的重点是对简短科学文本的含义的计算分析（摘要或简要报告）。在本文中，开发了矢量空间模型，以量化单词和文本的含义。我们介绍了含义空间，其中一个单词的含义由相对信息增益的向量表示（钻机），涉及文本所属的主题类别，可以通过观察文本中的单词来获得。这种新方法用于构建基于莱斯特科学语料库（LSC）和莱斯特科学词典（LSCDC）的意义空间。 LSC是1,673,350个摘要的科学语料库，LSCDC是科学词典，从LSC提取了单词。 LSC中的每个文本均属于Web of Science（WOS）的252个主题类别之一。这些类别用于构建信息收益的向量。描述了使用LSCDC对LSC进行描述并统计分析的含义空间。提出的表示模型的有用性通过每个类别的顶级单词进行评估。最有用的n个单词被订购。我们证明，基于钻机的单词排名比根据原始单词频率进行排名要多得多，以确定科学特定的含义和单词的重要性。证明基于钻机的拟议模型具有在类别中突出特定于主题的单词的能力。最有用的单词是针对252种类别的。新的科学词典和103,998 x 252单词类别钻机可在线获得。对含义空间的分析为我们提供了一种工具，可以使用更复杂和与上下文相关的含义模型来进一步探索文本的含义，这些模型使用单词及其组合的同时存在。

In Natural Language Processing, automatic extracting the meaning of texts constitutes an important problem. Our focus is the computational analysis of meaning of short scientific texts (abstracts or brief reports). In this paper, a vector space model is developed for quantifying the meaning of words and texts. We introduce the Meaning Space, in which the meaning of a word is represented by a vector of Relative Information Gain (RIG) about the subject categories that the text belongs to, which can be obtained from observing the word in the text. This new approach is applied to construct the Meaning Space based on Leicester Scientific Corpus (LSC) and Leicester Scientific Dictionary-Core (LScDC). The LSC is a scientific corpus of 1,673,350 abstracts and the LScDC is a scientific dictionary which words are extracted from the LSC. Each text in the LSC belongs to at least one of 252 subject categories of Web of Science (WoS). These categories are used in construction of vectors of information gains. The Meaning Space is described and statistically analysed for the LSC with the LScDC. The usefulness of the proposed representation model is evaluated through top-ranked words in each category. The most informative n words are ordered. We demonstrated that RIG-based word ranking is much more useful than ranking based on raw word frequency in determining the science-specific meaning and importance of a word. The proposed model based on RIG is shown to have ability to stand out topic-specific words in categories. The most informative words are presented for 252 categories. The new scientific dictionary and the 103,998 x 252 Word-Category RIG Matrix are available online. Analysis of the Meaning Space provides us with a tool to further explore quantifying the meaning of a text using more complex and context-dependent meaning models that use co-occurrence of words and their combinations.

下载PDF全文

下载文献需遵守相关版权规定

论文标题