论文标题
大型代码!=大词汇:源代码的开放式摄影模型
Big Code != Big Vocabulary: Open-Vocabulary Models for Source Code
论文作者
论文摘要
统计语言建模技术已成功应用于大型源代码语料库,生产了各种新的软件开发工具,例如用于代码建议,提高可读性和API迁移的工具。这些技术的一个主要问题是,代码以比自然语言高得多的速度引入新词汇,因为新的标识符名称扩散。大型词汇量和视频范围的问题都严重影响了源代码的神经语言模型(NLM),使其性能降低并使它们无法扩展。 在本文中,我们通过:1)研究各种建模选择如何影响由13,362个项目的大规模语料库影响最终的词汇; 2)提出一个开放的词汇源代码NLM,该代码可以扩展到这样的语料库,比以前的工作大100倍; 3)表明,这种模型在三个不同的代码语料库(Java,c,python)上优于最新技术。据我们所知,这些是已报告的代码最大的NLM。 这项工作中使用的所有数据集,代码和训练的模型均可公开使用。
Statistical language modeling techniques have successfully been applied to large source code corpora, yielding a variety of new software development tools, such as tools for code suggestion, improving readability, and API migration. A major issue with these techniques is that code introduces new vocabulary at a far higher rate than natural language, as new identifier names proliferate. Both large vocabularies and out-of-vocabulary issues severely affect Neural Language Models (NLMs) of source code, degrading their performance and rendering them unable to scale. In this paper, we address this issue by: 1) studying how various modelling choices impact the resulting vocabulary on a large-scale corpus of 13,362 projects; 2) presenting an open vocabulary source code NLM that can scale to such a corpus, 100 times larger than in previous work; and 3) showing that such models outperform the state of the art on three distinct code corpora (Java, C, Python). To our knowledge, these are the largest NLMs for code that have been reported. All datasets, code, and trained models used in this work are publicly available.