口语叙事的局部分割：关于大屠杀幸存者证词的测试案例

论文标题

口语叙事的局部分割：关于大屠杀幸存者证词的测试案例

Topical Segmentation of Spoken Narratives: A Test Case on Holocaust Survivor Testimonies

论文作者

Wagner, Eitan, Keydar, Renana, Pinchevski, Amit, Abend, Omri

论文摘要

对局部分割的任务进行了充分的研究，但是以前的工作主要是在结构化的，定义明确的段的背景下解决的，例如分割为段落，分会或分割的文本，这些文本源自多个来源。我们解决了分割跑步（口语）叙述的任务，该任务构成了迄今为止尚未解决的挑战。作为测试案例，我们解决了大屠杀幸存者的证词，用英语给出。除了研究这些对大屠杀研究的证词的重要性外，我们认为它们为局部分割提供了有趣的测试案例，这是由于它们的非结构化表面水平，相对丰度（收集了数以万计的此类证词），并且相对受限制的领域覆盖了它们。我们假设段之间的边界点对应于句子进行和遵循边界之间的低相互信息。基于这一假设，我们探讨了一系列对任务的算法方法，这些方法是基于使用生成性贝叶斯建模和最先进的神经机械的分段工作的基础。与手动注释的参考文献相比，我们发现开发的方法比以前的工作有了很大的改进。

The task of topical segmentation is well studied, but previous work has mostly addressed it in the context of structured, well-defined segments, such as segmentation into paragraphs, chapters, or segmenting text that originated from multiple sources. We tackle the task of segmenting running (spoken) narratives, which poses hitherto unaddressed challenges. As a test case, we address Holocaust survivor testimonies, given in English. Other than the importance of studying these testimonies for Holocaust research, we argue that they provide an interesting test case for topical segmentation, due to their unstructured surface level, relative abundance (tens of thousands of such testimonies were collected), and the relatively confined domain that they cover. We hypothesize that boundary points between segments correspond to low mutual information between the sentences proceeding and following the boundary. Based on this hypothesis, we explore a range of algorithmic approaches to the task, building on previous work on segmentation that uses generative Bayesian modeling and state-of-the-art neural machinery. Compared to manually annotated references, we find that the developed approaches show considerable improvements over previous work.

下载PDF全文

下载文献需遵守相关版权规定

论文标题