论文标题

编码:可区分的代码搜索

CodeDSI: Differentiable Code Search

论文作者

Nadeem, Usama, Ziems, Noah, Wu, Shaoen

论文摘要

对以前解决的软件工程问题的重新实现解决方案不仅效率低下,而且还引入了不充分和错误的代码。许多现有的方法通过使用经过代码训练的自回归文本生成模型来在此问题上实现令人印象深刻的性能。但是,这些方法并非没有缺陷。这些模型中生成的代码可能是错误的,缺乏文档,并引入了开发人员可能不会注意的漏洞。代码生成的替代方法 - 神经代码搜索 - 是机器学习的一个领域,模型将自然语言查询作为输入,然后又返回数据库中的相关代码示例。由于此预先存在的数据库的性质,在开发人员在生产中使用之前,可以记录,测试,许可和检查漏洞。在这项工作中,我们提出了Codedsi,这是一种端到端的代码搜索方法。训练Codedsi可以将自然语言查询直接映射到其各自的代码样本中,以后可以检索。为了提高代码搜索的性能,我们研究了DOCID表示策略,代币化对DOCID结构的影响以及数据集大小对整体代码搜索性能。我们的结果表明,编码强的性能,在不同的数据集尺寸中超过了常规的鲁棒基线,超过了2-6%。

Reimplementing solutions to previously solved software engineering problems is not only inefficient but also introduces inadequate and error-prone code. Many existing methods achieve impressive performance on this issue by using autoregressive text-generation models trained on code. However, these methods are not without their flaws. The generated code from these models can be buggy, lack documentation, and introduce vulnerabilities that may go unnoticed by developers. An alternative to code generation -- neural code search -- is a field of machine learning where a model takes natural language queries as input and, in turn, relevant code samples from a database are returned. Due to the nature of this pre-existing database, code samples can be documented, tested, licensed, and checked for vulnerabilities before being used by developers in production. In this work, we present CodeDSI, an end-to-end unified approach to code search. CodeDSI is trained to directly map natural language queries to their respective code samples, which can be retrieved later. In an effort to improve the performance of code search, we have investigated docid representation strategies, impact of tokenization on docid structure, and dataset sizes on overall code search performance. Our results demonstrate CodeDSI strong performance, exceeding conventional robust baselines by 2-6% across varying dataset sizes.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源