论文标题

社会探针:语言模型在何时何地学习社会人口统计学

SocioProbe: What, When, and Where Language Models Learn about Sociodemographics

论文作者

Lauscher, Anne, Bianchi, Federico, Bowman, Samuel, Hovy, Dirk

论文摘要

预先训练的语言模型(PLM)在各种任务上都优于其他NLP模型。研究人员选择对其能力和内部运作有更透彻的了解,已经建立了他们捕获较低级别的知识(例如语法性)以及中级语义知识(如事实理解)的扩展。但是,对他们对语言高级方面的了解仍然很少。特别是,尽管社会人口统计学方面在塑造我们的语言方面的重要性,但仍未探索有关PLM是否,何处以及如何编码这些方面的问题,以及如何编码这些方面的问题。我们通过传统的分类器探测和信息理论的最小描述长度探测来探测不同单GPU PLM的社会人口统计学知识,以解决这一研究差距。我们的结果表明,PLM确实编码了这些社会人口统计学,并且这些知识有时会分布在某些经过测试的PLM的层次上。我们进一步进行了多语言分析,并研究了补充培训的效果,以进一步探索该知识的编码程度,何处以及何种程度的培训数据。我们的总体结果表明,社会人口统计学知识仍然是NLP的主要挑战。 PLM需要大量的预训练数据,以获取一般语言理解中表现出色的知识和模型,似乎没有更多关于这些方面的知识。

Pre-trained language models (PLMs) have outperformed other NLP models on a wide range of tasks. Opting for a more thorough understanding of their capabilities and inner workings, researchers have established the extend to which they capture lower-level knowledge like grammaticality, and mid-level semantic knowledge like factual understanding. However, there is still little understanding of their knowledge of higher-level aspects of language. In particular, despite the importance of sociodemographic aspects in shaping our language, the questions of whether, where, and how PLMs encode these aspects, e.g., gender or age, is still unexplored. We address this research gap by probing the sociodemographic knowledge of different single-GPU PLMs on multiple English data sets via traditional classifier probing and information-theoretic minimum description length probing. Our results show that PLMs do encode these sociodemographics, and that this knowledge is sometimes spread across the layers of some of the tested PLMs. We further conduct a multilingual analysis and investigate the effect of supplementary training to further explore to what extent, where, and with what amount of pre-training data the knowledge is encoded. Our overall results indicate that sociodemographic knowledge is still a major challenge for NLP. PLMs require large amounts of pre-training data to acquire the knowledge and models that excel in general language understanding do not seem to own more knowledge about these aspects.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源