基于图形注意网络的多模式预训练，以了解文档的理解

论文标题

基于图形注意网络的多模式预训练，以了解文档的理解

Multimodal Pre-training Based on Graph Attention Network for Document Understanding

论文作者

Zhang, Zhenrong, Ma, Jiefeng, Du, Jun, Wang, Licheng, Zhang, Jianshu

论文摘要

将情报作为一个相对较新的研究主题支持许多业务应用程序。它的主要任务是自动阅读，理解和分析文档。但是，由于格式的多样性（发票，报告，表格等）和文档中的布局，因此很难使机器理解文档。在本文中，我们介绍了GraphDoc，这是一种基于多模式的基于图形注意的模型，用于各种文档理解任务。通过同时利用文本，布局和图像信息，GraphDoc在多模式框架中进行了预训练。在文档中，文本块在很大程度上依赖其周围环境，因此，我们将图形结构注入注意力机制，以形成图形注意力层，以便每个输入节点只能参与其社区。每个图形注意层的输入节点由文档图像中语义上有意义的区域的文本，视觉和位置特征组成。我们通过门融合层进行每个节点的多模式融合。每个节点之间的上下文化由图形注意层建模。 GraphDoc通过蒙版句子建模任务仅从320k未标记的文档中学习通用表示。公开可用数据集的广泛实验结果表明，GraphDoc达到了最新的性能，这证明了我们提出的方法的有效性。该代码可从https://github.com/zzr8066/graphdoc获得。

Document intelligence as a relatively new research topic supports many business applications. Its main task is to automatically read, understand, and analyze documents. However, due to the diversity of formats (invoices, reports, forms, etc.) and layouts in documents, it is difficult to make machines understand documents. In this paper, we present the GraphDoc, a multimodal graph attention-based model for various document understanding tasks. GraphDoc is pre-trained in a multimodal framework by utilizing text, layout, and image information simultaneously. In a document, a text block relies heavily on its surrounding contexts, accordingly we inject the graph structure into the attention mechanism to form a graph attention layer so that each input node can only attend to its neighborhoods. The input nodes of each graph attention layer are composed of textual, visual, and positional features from semantically meaningful regions in a document image. We do the multimodal feature fusion of each node by the gate fusion layer. The contextualization between each node is modeled by the graph attention layer. GraphDoc learns a generic representation from only 320k unlabeled documents via the Masked Sentence Modeling task. Extensive experimental results on the publicly available datasets show that GraphDoc achieves state-of-the-art performance, which demonstrates the effectiveness of our proposed method. The code is available at https://github.com/ZZR8066/GraphDoc.

下载PDF全文

下载文献需遵守相关版权规定

论文标题