学习和评估源代码的上下文嵌入

论文标题

学习和评估源代码的上下文嵌入

Learning and Evaluating Contextual Embedding of Source Code

论文作者

Kanade, Aditya, Maniatis, Petros, Balakrishnan, Gogul, Shi, Kensen

论文摘要

最近的研究通过建立针对自然语言开发的机器学习技术来实现理解和改善源代码的令人印象深刻的结果。自然语言理解的显着进步是随着预先训练的上下文嵌入（例如BERT）的发展，伯特（Bert）可以通过较少标记的数据和培训预算进行微调，同时获得更好的准确性。但是，尚未尝试获得源代码的高质量上下文嵌入，并同时在多个程序管理任务上对其进行评估。这就是本文旨在减轻的差距。具体而言，首先，我们从Github策划了740万个Python文件的大规模，重复的语料库，我们将其用于预先培训Cubert，这是一种开源的代码认可的BERT模型；其次，我们创建了一个开源基准，该基准包括五个分类任务和一个程序修复任务，类似于以前在文献中提出的代码知识的任务。我们对Cubert进行了基准任务，并将结果模型与Word2Vec代币嵌入，Bilstm和Transformer模型的不同变体进行比较，以及已发布的最先进模型，表明Cubert均超过了他们的表现，即使经过了较短的培训，并且具有更少的标记示例。未来关于源代码嵌入的工作可以从重新利用我们的基准测试中受益，并通过与Cubert模型作为强大的基线进行比较。

Recent research has achieved impressive results on understanding and improving source code by building up on machine-learning techniques developed for natural languages. A significant advancement in natural-language understanding has come with the development of pre-trained contextual embeddings, such as BERT, which can be fine-tuned for downstream tasks with less labeled data and training budget, while achieving better accuracies. However, there is no attempt yet to obtain a high-quality contextual embedding of source code, and to evaluate it on multiple program-understanding tasks simultaneously; that is the gap that this paper aims to mitigate. Specifically, first, we curate a massive, deduplicated corpus of 7.4M Python files from GitHub, which we use to pre-train CuBERT, an open-sourced code-understanding BERT model; and, second, we create an open-sourced benchmark that comprises five classification tasks and one program-repair task, akin to code-understanding tasks proposed in the literature before. We fine-tune CuBERT on our benchmark tasks, and compare the resulting models to different variants of Word2Vec token embeddings, BiLSTM and Transformer models, as well as published state-of-the-art models, showing that CuBERT outperforms them all, even with shorter training, and with fewer labeled examples. Future work on source-code embedding can benefit from reusing our benchmark, and from comparing against CuBERT models as a strong baseline.

下载PDF全文

下载文献需遵守相关版权规定

论文标题