迈向词汇性别推论：使用在线数据库的可扩展方法

论文标题

迈向词汇性别推论：使用在线数据库的可扩展方法

Towards Lexical Gender Inference: A Scalable Methodology using Online Databases

论文作者

Bartl, Marion, Leavy, Susan

论文摘要

本文提出了一种新方法，用于在大规模语言数据集中自动检测具有词汇性别的单词。目前，对自然语言处理中性别偏见的评估依赖于手动编译的性别表达式词典，例如代词（'He'，''She'等）和具有词汇性别的名词（“母亲”，“男友”，“ Polectoman Womman”等）。但是，如果没有定期更新这些列表的手动汇编，则可以导致静态信息，并且通常涉及单个注释者和研究人员的价值判断。此外，列表中未包含的术语不超出分析范围。为了解决这些问题，我们设计了一种基于词典的可扩展方法，以自动检测词汇性别，该性别可以提供具有高覆盖范围的动态，最新分析。我们的方法在确定从Wikipedia样本中随机检索的名词的词汇性别以及在先前研究中使用的性别单词列表中进行测试时达到了超过80％的准确性。

This paper presents a new method for automatically detecting words with lexical gender in large-scale language datasets. Currently, the evaluation of gender bias in natural language processing relies on manually compiled lexicons of gendered expressions, such as pronouns ('he', 'she', etc.) and nouns with lexical gender ('mother', 'boyfriend', 'policewoman', etc.). However, manual compilation of such lists can lead to static information if they are not periodically updated and often involve value judgments by individual annotators and researchers. Moreover, terms not included in the list fall out of the range of analysis. To address these issues, we devised a scalable, dictionary-based method to automatically detect lexical gender that can provide a dynamic, up-to-date analysis with high coverage. Our approach reaches over 80% accuracy in determining the lexical gender of nouns retrieved randomly from a Wikipedia sample and when testing on a list of gendered words used in previous research.

下载PDF全文

下载文献需遵守相关版权规定

论文标题