论文标题
代码并在Stackoverflow中命名实体识别
Code and Named Entity Recognition in StackOverflow
论文作者
论文摘要
随着大量编程文本在互联网上很容易获得,对一起研究自然语言和计算机代码的兴趣越来越大。例如,Stackoverflow目前有超过1500万个由850万用户编写的编程相关问题。同时,仍然缺乏基本的NLP技术来识别出现在自然语言句子中的代码令牌或与软件相关的命名实体。在本文中,我们为计算机编程域引入了一个新的命名实体识别(NER)语料库,该域由15,372个带有20种细粒实体类型的句子组成。我们对Stackoverflow的1.52亿个句子进行了培训,培训了BERT表示(Bertoverflow),这使得与现成的BERT的绝对增加了+10 F-1得分。我们还提出了柔软的模型,该模型可实现代码的总体79.10 f $ _1 $得分,并在Stackoverflow数据上命名实体识别。我们柔软的模型将独立于上下文的代码令牌分类器与语料库级特征结合在一起,以改善基于BERT的标记模型。我们的代码和数据可在以下网址找到:https://github.com/jeniyat/stackoverflowner/
There is an increasing interest in studying natural language and computer code together, as large corpora of programming texts become readily available on the Internet. For example, StackOverflow currently has over 15 million programming related questions written by 8.5 million users. Meanwhile, there is still a lack of fundamental NLP techniques for identifying code tokens or software-related named entities that appear within natural language sentences. In this paper, we introduce a new named entity recognition (NER) corpus for the computer programming domain, consisting of 15,372 sentences annotated with 20 fine-grained entity types. We trained in-domain BERT representations (BERTOverflow) on 152 million sentences from StackOverflow, which lead to an absolute increase of +10 F-1 score over off-the-shelf BERT. We also present the SoftNER model which achieves an overall 79.10 F$_1$ score for code and named entity recognition on StackOverflow data. Our SoftNER model incorporates a context-independent code token classifier with corpus-level features to improve the BERT-based tagging model. Our code and data are available at: https://github.com/jeniyat/StackOverflowNER/