论文标题
什么时候相似或相同?介绍代码名称匹配器库
When Are Names Similar Or the Same? Introducing the Code Names Matcher Library
论文作者
论文摘要
程序代码包含由名称表示的功能,变量和数据结构。为了促进人类的理解,这些名称应描述它们代表的代码元素的作用和使用。但是,开发人员给出的名称显示出很高的可变性,反映了每个开发人员的口味,其不同的单词用于相同的含义或用于不同含义的相同词。这使比较名称很难。一个精确的比较应基于匹配的相同单词,但也考虑到单词(包括拼写和键入错误),单词重新排序,同义词之间的匹配等可能的变化。为了促进这一点,我们开发了一个专门针对比较代码名称的比较功能的库。不同的功能以不同的方式计算名称之间的相似性,因此研究人员可以选择适合其特定需求的名称。他们所有人都试图以词汇匹配为代价来反映人类对相似性的看法。
Program code contains functions, variables, and data structures that are represented by names. To promote human understanding, these names should describe the role and use of the code elements they represent. But the names given by developers show high variability, reflecting the tastes of each developer, with different words used for the same meaning or the same words used for different meanings. This makes comparing names hard. A precise comparison should be based on matching identical words, but also take into account possible variations on the words (including spelling and typing errors), reordering of the words, matching between synonyms, and so on. To facilitate this we developed a library of comparison functions specifically targeted to comparing names in code. The different functions calculate the similarity between names in different ways, so a researcher can choose the one appropriate for his specific needs. All of them share an attempt to reflect human perceptions of similarity, at the possible expense of lexical matching.