在数据稀疏限制下与人类在循环的医学科学桌面到文本

论文标题

在数据稀疏限制下与人类在循环的医学科学桌面到文本

Medical Scientific Table-to-Text Generation with Human-in-the-Loop under the Data Sparsity Constraint

论文作者

Wu, Heng-Yi, Zhang, Jingqing, Ive, Julia, Li, Tong, Gupta, Vibhor, Chen, Bingyuan, Guo, Yike

论文摘要

临床前和临床领域中的结构化（表格）数据包含有关个人的有价值信息，有效的表格到文本摘要系统可以大大减少手动努力，以将这些数据凝结到报告中。但是，实际上，问题差，数据稀疏性和最先进的自然语言生成模型（包括T5，Pegasus和GPT-NEO）的数据稀疏性和无法产生准确且可靠的输出。在本文中，我们提出了一种新颖的桌面到文本方法，并通过一种新型的两步结构解决了这些问题，该结构通过自动校正，复制机制和合成数据增强来增强。研究表明，所提出的方法从结构化数据中选择了显着的生物医学实体和值，以提高精度（最高0.13个绝对增加），以复制表格值，以生成相干和准确的文本以进行测定验证报告和毒理学报告。此外，我们还通过微调示例进行微调来展示提出的系统对新数据集的轻量重量改编。我们模型的输出在人类的场景中得到了人类专家的验证。

Structured (tabular) data in the preclinical and clinical domains contains valuable information about individuals and an efficient table-to-text summarization system can drastically reduce manual efforts to condense this data into reports. However, in practice, the problem is heavily impeded by the data paucity, data sparsity and inability of the state-of-the-art natural language generation models (including T5, PEGASUS and GPT-Neo) to produce accurate and reliable outputs. In this paper, we propose a novel table-to-text approach and tackle these problems with a novel two-step architecture which is enhanced by auto-correction, copy mechanism and synthetic data augmentation. The study shows that the proposed approach selects salient biomedical entities and values from structured data with improved precision (up to 0.13 absolute increase) of copying the tabular values to generate coherent and accurate text for assay validation reports and toxicology reports. Moreover, we also demonstrate a light-weight adaptation of the proposed system to new datasets by fine-tuning with as little as 40\% training examples. The outputs of our model are validated by human experts in the Human-in-the-Loop scenario.

下载PDF全文

下载文献需遵守相关版权规定

论文标题