论文标题
躲避数据瓶颈:自动串联使用自动分割的st corpora
Dodging the Data Bottleneck: Automatic Subtitling with Automatically Segmented ST Corpora
论文作者
论文摘要
字幕(替代)的语音翻译是通过将符合特定显示指南的字幕断断续续的字幕断断续续通过插入字幕断断续续来自动将语音数据转换为良好的字幕的任务。与语音翻译(ST)类似,模型培训需要包含音频输入与其文本翻译配对的并行数据。但是,在替代方面,文本还必须用字幕断裂注释。到目前为止,这一要求代表了系统开发的瓶颈,如公开可用的替代公司所证实。为了填补这一空白,我们提出了一种将现有的ST Corpora转换为替代资源的方法,而无需人为干预。我们构建了一个分段模型,该模型通过以多模式的方式利用音频和文本来自动将文本片段分为适当的字幕,从而在零拍摄条件下实现了高分子的质量。对手动和自动分割培训的替代系统的比较实验会导致相似的性能,显示了我们方法的有效性。
Speech translation for subtitling (SubST) is the task of automatically translating speech data into well-formed subtitles by inserting subtitle breaks compliant to specific displaying guidelines. Similar to speech translation (ST), model training requires parallel data comprising audio inputs paired with their textual translations. In SubST, however, the text has to be also annotated with subtitle breaks. So far, this requirement has represented a bottleneck for system development, as confirmed by the dearth of publicly available SubST corpora. To fill this gap, we propose a method to convert existing ST corpora into SubST resources without human intervention. We build a segmenter model that automatically segments texts into proper subtitles by exploiting audio and text in a multimodal fashion, achieving high segmentation quality in zero-shot conditions. Comparative experiments with SubST systems respectively trained on manual and automatic segmentations result in similar performance, showing the effectiveness of our approach.