论文标题

引文解析的合成与真实参考字符串,以及重新训练和样本外数据的重要性,用于有意义的评估:Grobid,Giant和Cora的实验

Synthetic vs. Real Reference Strings for Citation Parsing, and the Importance of Re-training and Out-Of-Sample Data for Meaningful Evaluations: Experiments with GROBID, GIANT and Cora

论文作者

Grennan, Mark, Beel, Joeran

论文摘要

引用解析,尤其是在深层神经网络的情况下,缺乏培训数据,因为可用的数据集通常只包含几千个培训实例。手动标记引文字符串非常耗时,因此合成创建的训练数据可能是解决方案。但是,到目前为止,尚不清楚合成创建的参考串是否适合用于引文解析的机器学习算法。为了找出答案,我们训练使用条件随机字段的Grobid,其中a)来自“真实”书目的人体标记的参考字符串,b)b)从巨型数据集中创建了合成的参考字符串。我们发现合成和有机参考字符串同样适合训练grobid(F1 = 0.74)。我们还发现,对合成数据和真实数据均对其性能有显着影响(在F1中+30%)。在训练过程中,具有尽可能多类型的标记字段也可以提高效率,即使评估数据中没有这些字段(+13.5%F1)。我们得出结论,合成数据适合训练(深)引文解析模型。我们进一步建议,在对参考解析器的未来评估中,评估数据既相似又与培训数据不同,以进行更有意义的评估。

Citation parsing, particularly with deep neural networks, suffers from a lack of training data as available datasets typically contain only a few thousand training instances. Manually labelling citation strings is very time-consuming, hence synthetically created training data could be a solution. However, as of now, it is unknown if synthetically created reference-strings are suitable to train machine learning algorithms for citation parsing. To find out, we train Grobid, which uses Conditional Random Fields, with a) human-labelled reference strings from 'real' bibliographies and b) synthetically created reference strings from the GIANT dataset. We find that both synthetic and organic reference strings are equally suited for training Grobid (F1 = 0.74). We additionally find that retraining Grobid has a notable impact on its performance, for both synthetic and real data (+30% in F1). Having as many types of labelled fields as possible during training also improves effectiveness, even if these fields are not available in the evaluation data (+13.5% F1). We conclude that synthetic data is suitable for training (deep) citation parsing models. We further suggest that in future evaluations of reference parsers both evaluation data similar and dissimilar to the training data should be used for more meaningful evaluations.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源