论文标题
共同学习跨度提取和序列标签,以从业务文件中提取信息
Jointly Learning Span Extraction and Sequence Labeling for Information Extraction from Business Documents
论文作者
论文摘要
本文介绍了用于业务文件的新信息提取模型。与仅基于跨度提取或序列标记的先前研究不同,该模型均考虑了跨度提取和序列标记的优势。该组合允许模型处理稀疏信息(少量提取信息)的长文档。该模型是端对端训练的,以统一的方式共同优化这两个任务。英语和日语的四个业务数据集的实验结果表明,该模型可实现有希望的结果,并且比基于正常的跨度提取方法快得多。该代码也可用。
This paper introduces a new information extraction model for business documents. Different from prior studies which only base on span extraction or sequence labeling, the model takes into account advantage of both span extraction and sequence labeling. The combination allows the model to deal with long documents with sparse information (the small amount of extracted information). The model is trained end-to-end to jointly optimize the two tasks in a unified manner. Experimental results on four business datasets in English and Japanese show that the model achieves promising results and is significantly faster than the normal span-based extraction method. The code is also available.