论文标题
使用图形神经网络和流动型抽象语法树检测代码克隆
Detecting Code Clones with Graph Neural Networkand Flow-Augmented Abstract Syntax Tree
论文作者
论文摘要
代码克隆是语义上相似的代码片段对,在句法上相似或不同。检测代码克隆可以帮助降低软件维护的成本并防止错误。之前已经提出了许多检测代码克隆的方法,但大多数都专注于检测句法克隆,并且在具有不同句法特征的语义克隆上不能很好地工作。为了检测语义克隆,研究人员试图采用深度学习来进行代码克隆检测,以自动从数据中学习潜在的语义特征。尤其是,为了利用语法信息,几种方法使用抽象的语法树(AST)作为输入,并在各种编程语言的代码克隆基准上取得了重大进展。但是,这些基于AST的方法仍然无法完全利用代码片段的结构信息,尤其是语义信息,例如控制流和数据流。为了利用控制和数据流信息,我们在本文中构建了一个称为流动型抽象语法树(FA-ast)的程序的图表。我们通过增强具有明确控制和数据流动边缘的原始AST来构建FA-AST。然后,我们在FA-ast上应用两种不同类型的图形神经网络(GNN),以测量代码对的相似性。就我们而言,我们是第一个将图形神经网络应用于代码克隆检测领域的人。 我们在两个Java数据集上应用FA-ast和Graph神经网络:Google Code Jam和BigCloneBench。我们的方法表现优于Google Code JAM和BigCloneBench任务的最先进方法。
Code clones are semantically similar code fragments pairs that are syntactically similar or different. Detection of code clones can help to reduce the cost of software maintenance and prevent bugs. Numerous approaches of detecting code clones have been proposed previously, but most of them focus on detecting syntactic clones and do not work well on semantic clones with different syntactic features. To detect semantic clones, researchers have tried to adopt deep learning for code clone detection to automatically learn latent semantic features from data. Especially, to leverage grammar information, several approaches used abstract syntax trees (AST) as input and achieved significant progress on code clone benchmarks in various programming languages. However, these AST-based approaches still can not fully leverage the structural information of code fragments, especially semantic information such as control flow and data flow. To leverage control and data flow information, in this paper, we build a graph representation of programs called flow-augmented abstract syntax tree (FA-AST). We construct FA-AST by augmenting original ASTs with explicit control and data flow edges. Then we apply two different types of graph neural networks (GNN) on FA-AST to measure the similarity of code pairs. As far as we have concerned, we are the first to apply graph neural networks on the domain of code clone detection. We apply our FA-AST and graph neural networks on two Java datasets: Google Code Jam and BigCloneBench. Our approach outperforms the state-of-the-art approaches on both Google Code Jam and BigCloneBench tasks.