端到端的语音到函数识别

论文标题

端到端的语音到函数识别

End-to-end Speech-to-Punctuated-Text Recognition

论文作者

Nozaki, Jumon, Kawahara, Tatsuya, Ishizuka, Kenkichi, Hashimoto, Taiichi

论文摘要

常规的自动语音识别系统不会产生标点符号，这对于语音识别结果的可读性很重要。随后的自然语言处理任务（例如机器翻译）也需要它们。标点符号预测模型上有许多作品将标点符号插入语音识别结果中作为后处理。但是，这些研究并未利用声学信息进行标点符号预测，并且直接受语音识别错误的影响。在这项研究中，我们提出了一个端到端模型，该模型将语音作为输入并输出标点的文本。预计该模型在使用声学信息时预计对语音识别误差的标点符号可鲁棒。我们还建议使用辅助损失，以使用中间层和未插入文本的输出来训练模型。通过实验，我们将提出的模型的性能与级联系统的模型进行了比较。所提出的模型比级联系统获得更高的标点符预测准确性，而无需牺牲语音识别错误率。还证明，使用中间输出针对未插入的文本的多任务学习有效。此外，与级联系统相比，所提出的模型仅具有约1/7的参数。

Conventional automatic speech recognition systems do not produce punctuation marks which are important for the readability of the speech recognition results. They are also needed for subsequent natural language processing tasks such as machine translation. There have been a lot of works on punctuation prediction models that insert punctuation marks into speech recognition results as post-processing. However, these studies do not utilize acoustic information for punctuation prediction and are directly affected by speech recognition errors. In this study, we propose an end-to-end model that takes speech as input and outputs punctuated texts. This model is expected to predict punctuation robustly against speech recognition errors while using acoustic information. We also propose to incorporate an auxiliary loss to train the model using the output of the intermediate layer and unpunctuated texts. Through experiments, we compare the performance of the proposed model to that of a cascaded system. The proposed model achieves higher punctuation prediction accuracy than the cascaded system without sacrificing the speech recognition error rate. It is also demonstrated that the multi-task learning using the intermediate output against the unpunctuated text is effective. Moreover, the proposed model has only about 1/7th of the parameters compared to the cascaded system.

下载PDF全文

下载文献需遵守相关版权规定

论文标题