利用代码生成来改善代码检索和通过双学习

论文标题

利用代码生成来改善代码检索和通过双学习

Leveraging Code Generation to Improve Code Retrieval and Summarization via Dual Learning

论文作者

Ye, Wei, Xie, Rui, Zhang, Jinglei, Hu, Tianxiang, Wang, Xiaoyin, Zhang, Shikun

论文摘要

代码摘要给定源代码段生成简短的自然语言描述，而代码检索给定自然语言查询的相关源代码。由于两项任务旨在模拟自然语言与编程语言之间的关联，因此最近的研究结合了这两个任务以提高其性能。但是，研究人员尚未能够有效利用这两个任务以单独或管道方式训练这些任务的固有联系，这意味着他们的绩效不能很好地平衡。在本文中，我们通过引入附加的代码生成任务，为这两个任务提出了一个新颖的端到端模型。更具体地说，我们通过双重学习明确利用代码摘要和代码生成之间的概率相关性，并利用两个编码器进行代码摘要和代码生成来通过多任务学习来训练代码检索任务。我们已经在SQL和Python的现有数据集上进行了广泛的实验，结果表明，我们的模型可以大大改善代码检索任务的结果，而不是制度模型，并在代码摘要任务的BLEU分数方面实现了竞争性能。

Code summarization generates brief natural language description given a source code snippet, while code retrieval fetches relevant source code given a natural language query. Since both tasks aim to model the association between natural language and programming language, recent studies have combined these two tasks to improve their performance. However, researchers have yet been able to effectively leverage the intrinsic connection between the two tasks as they train these tasks in a separate or pipeline manner, which means their performance can not be well balanced. In this paper, we propose a novel end-to-end model for the two tasks by introducing an additional code generation task. More specifically, we explicitly exploit the probabilistic correlation between code summarization and code generation with dual learning, and utilize the two encoders for code summarization and code generation to train the code retrieval task via multi-task learning. We have carried out extensive experiments on an existing dataset of SQL and Python, and results show that our model can significantly improve the results of the code retrieval task over the-state-of-art models, as well as achieve competitive performance in terms of BLEU score for the code summarization task.

下载PDF全文

下载文献需遵守相关版权规定

论文标题