论文标题
多语言软件项目的可追溯性支持
Traceability Support for Multi-Lingual Software Projects
论文作者
论文摘要
软件可追溯性建立了各种软件工件(例如需求,设计,代码和测试用例)之间的关联。由于手动创建和维护链接的非平凡成本,许多研究人员根据信息检索技术提出了自动化方法。但是,许多全球分发的软件项目生产使用两种或多种语言编写的软件工件。使用混合语言的使用降低了自动跟踪解决方案的功效。在本文中,我们首先分析和讨论多个项目中语言使用的模式,然后评估几种不同的追踪算法,包括矢量空间模型(VSM),潜在的语义索引(LSI),潜在的dirichlet分配(LDA),以及与单个单词和交叉语言嵌入的模型,并将其与生成的vector(GVSM)组合在一起。基于对14个中文项目的分析,我们的结果表明,最佳性能是使用将机器翻译集成到GVSM中的单语言嵌入来实现的,作为预处理步骤。
Software traceability establishes associations between diverse software artifacts such as requirements, design, code, and test cases. Due to the non-trivial costs of manually creating and maintaining links, many researchers have proposed automated approaches based on information retrieval techniques. However, many globally distributed software projects produce software artifacts written in two or more languages. The use of intermingled languages reduces the efficacy of automated tracing solutions. In this paper, we first analyze and discuss patterns of intermingled language use across multiple projects, and then evaluate several different tracing algorithms including the Vector Space Model (VSM), Latent Semantic Indexing (LSI), Latent Dirichlet Allocation (LDA), and various models that combine mono- and cross-lingual word embeddings with the Generative Vector Space Model (GVSM). Based on an analysis of 14 Chinese-English projects, our results show that best performance is achieved using mono-lingual word embeddings integrated into GVSM with machine translation as a preprocessing step.