论文标题
基于依赖关系的神经表示,用于分类程序行
Dependency-Based Neural Representations for Classifying Lines of Programs
论文作者
论文摘要
我们研究了将程序行分类为包含漏洞或不使用机器学习的问题。这样的线路级分类任务需要一个程序表示,这超出了行中存在的代币的推理。我们在潜在特征空间中寻求分布式表示形式,该表示可以捕获出现在程序线上的代币的控制和数据依赖性,同时还可以确保具有相似含义的行具有相似的特征。我们提出了一种神经体系结构,即Vulcan,成功地证明了这两个要求。它将有关令牌的上下文信息提取在一条线中,并将其作为抽象语法树(AST)路径输入,并带有带有注意机制的双向LSTM。它通过递归地嵌入了最近定义的线,同时表示令牌中令牌的含义。在我们的实验中,Vulcan与最先进的分类器相比,该分类器需要对程序进行大量预处理,这表明使用深度学习来对程序依赖信息进行建模。
We investigate the problem of classifying a line of program as containing a vulnerability or not using machine learning. Such a line-level classification task calls for a program representation which goes beyond reasoning from the tokens present in the line. We seek a distributed representation in a latent feature space which can capture the control and data dependencies of tokens appearing on a line of program, while also ensuring lines of similar meaning have similar features. We present a neural architecture, Vulcan, that successfully demonstrates both these requirements. It extracts contextual information about tokens in a line and inputs them as Abstract Syntax Tree (AST) paths to a bi-directional LSTM with an attention mechanism. It concurrently represents the meanings of tokens in a line by recursively embedding the lines where they are most recently defined. In our experiments, Vulcan compares favorably with a state-of-the-art classifier, which requires significant preprocessing of programs, suggesting the utility of using deep learning to model program dependence information.