论文标题

包容文字的概念

An Inclusive Notion of Text

论文作者

Kuznetsov, Ilia, Gurevych, Iryna

论文摘要

自然语言处理(NLP)研究人员根据书面文本开发语法,含义和交流模型。由于任务和数据差异,所认为的文本在整个研究中可能会有很大的不同。缺乏系统地捕获这些差异的概念框架。我们认为,对文本概念的清晰度对于可再现和可推广的NLP至关重要。为了实现这一目标,我们提出了共同的术语,以讨论文本数据的生产和转换,并引入了两层分类学的语言和非语言元素,这些分类学在文本源中可用,可以在NLP建模中使用。我们将这种分类法应用于调查现有工作,该工作将文本的概念扩展到以保守语言为中心的观点之外。我们概述了NLP中新兴的包容性文本方法的挑战,并提出了社区级别的报道,这是巩固讨论的至关重要的下一步。

Natural language processing (NLP) researchers develop models of grammar, meaning and communication based on written text. Due to task and data differences, what is considered text can vary substantially across studies. A conceptual framework for systematically capturing these differences is lacking. We argue that clarity on the notion of text is crucial for reproducible and generalizable NLP. Towards that goal, we propose common terminology to discuss the production and transformation of textual data, and introduce a two-tier taxonomy of linguistic and non-linguistic elements that are available in textual sources and can be used in NLP modeling. We apply this taxonomy to survey existing work that extends the notion of text beyond the conservative language-centered view. We outline key desiderata and challenges of the emerging inclusive approach to text in NLP, and suggest community-level reporting as a crucial next step to consolidate the discussion.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源