论文标题
(开源)许可文本变体的大规模数据集
A Large-scale Dataset of (Open Source) License Text Variants
论文作者
论文摘要
我们介绍了免费/开源软件(FOSS)许可证版本的完整文本的大规模数据集。 To assemble it we have collected from the Software Heritage archive-the largest publicly available archive of FOSS source code with accompanying development history-all versions of files whose names are commonly used to convey licensing terms to software users and developers.The dataset consists of 6.5 million unique license files that can be used to conduct empirical studies on open source licensing, training of automated license classifiers, natural language processing (NLP) analyses of legal texts, as以及有关FOSS许可的历史和系统发育研究。还提供了有关发货文件的其他元数据,使数据集准备在各种情况下使用;它们包括:文件长度度量,检测到的哑剧类型,检测到的SPDX许可证(使用SCANCODE),示例Origin(例如,GitHub存储库),最古老的公共提交,在其中出现许可证。
We introduce a large-scale dataset of the complete texts of free/open source software (FOSS) license variants. To assemble it we have collected from the Software Heritage archive-the largest publicly available archive of FOSS source code with accompanying development history-all versions of files whose names are commonly used to convey licensing terms to software users and developers.The dataset consists of 6.5 million unique license files that can be used to conduct empirical studies on open source licensing, training of automated license classifiers, natural language processing (NLP) analyses of legal texts, as well as historical and phylogenetic studies on FOSS licensing. Additional metadata about shipped license files are also provided, making the dataset ready to use in various contexts; they include: file length measures, detected MIME type, detected SPDX license (using ScanCode), example origin (e.g., GitHub repository), oldest public commit in which the license appeared.The dataset is released as open data as an archive file containing all deduplicated license files, plus several portable CSV files for metadata, referencing files via cryptographic checksums.