通过语义保护转换，自我监督的对比度学习，用于检索代码和摘要

论文标题

通过语义保护转换，自我监督的对比度学习，用于检索代码和摘要

Self-Supervised Contrastive Learning for Code Retrieval and Summarization via Semantic-Preserving Transformations

论文作者

Bui, Nghi D. Q., Yu, Yijun, Jiang, Lingxiao

论文摘要

我们提出了Corder，这是源代码模型的自我监督对比学习框架。 Corder旨在减轻对代码检索和代码摘要任务的标记数据的需求。预先训练的Corder模型可以通过两种方式使用：（1）它可以产生代码的向量表示，该代码可以应用于没有标记数据的代码检索任务；（2）可以在微调过程中用于可能仍需要标签数据（例如代码摘要）的任务。关键创新是我们通过要求通过对比度学习目标识别类似和不同的代码段来训练源代码模型。为此，我们使用一组语义传播转换操作员来生成语法上不同但在语义上等效的代码段。通过广泛的实验，我们表明，由Corder预测的代码模型大大优于其他基准，用于代码对代码检索，文本对代码检索和代码到文本汇总任务。

We propose Corder, a self-supervised contrastive learning framework for source code model. Corder is designed to alleviate the need of labeled data for code retrieval and code summarization tasks. The pre-trained model of Corder can be used in two ways: (1) it can produce vector representation of code which can be applied to code retrieval tasks that do not have labeled data; (2) it can be used in a fine-tuning process for tasks that might still require label data such as code summarization. The key innovation is that we train the source code model by asking it to recognize similar and dissimilar code snippets through a contrastive learning objective. To do so, we use a set of semantic-preserving transformation operators to generate code snippets that are syntactically diverse but semantically equivalent. Through extensive experiments, we have shown that the code models pretrained by Corder substantially outperform the other baselines for code-to-code retrieval, text-to-code retrieval, and code-to-text summarization tasks.

下载PDF全文

下载文献需遵守相关版权规定

论文标题