通过端到端模型进入跨域语音识别

论文标题

通过端到端模型进入跨域语音识别

Toward Cross-Domain Speech Recognition with End-to-End Models

论文作者

Nguyen, Thai-Son, Stüker, Sebastian, Waibel, Alex

论文摘要

在多域语音识别领域，过去的研究集中在混合声学模型上，以建立跨域和域不变的语音识别系统。在本文中，我们在混合几个领域的声学训练数据时经验研究了混合声模型与神经端到端系统之间行为的差异。对于这些实验，我们从公共来源组成了一个多域数据集，语料库中的不同领域涵盖了各种各样的主题和声学条件，例如电话对话，讲座，阅读语音和广播新闻。我们表明，对于混合模型，提供来自其他领域的额外培训数据不匹配的声学条件不会增加特定领域的性能。但是，我们使用基于序列的标准优化的端到端模型比各种域上的混合模型都更好地推广。在单词率率表现方面，我们在多域数据集进行培训的实验性声学到基于注意的模型达到了域特异性长期短期记忆（LSTM）混合模型的性能，从而导致多域语音识别系统不会在域而受到域的性能不受域的性能。此外，使用神经端到端模型消除了识别过程中域适应性语言模型的需求，当输入域未知时，这是一个很大的优势。

In the area of multi-domain speech recognition, research in the past focused on hybrid acoustic models to build cross-domain and domain-invariant speech recognition systems. In this paper, we empirically examine the difference in behavior between hybrid acoustic models and neural end-to-end systems when mixing acoustic training data from several domains. For these experiments we composed a multi-domain dataset from public sources, with the different domains in the corpus covering a wide variety of topics and acoustic conditions such as telephone conversations, lectures, read speech and broadcast news. We show that for the hybrid models, supplying additional training data from other domains with mismatched acoustic conditions does not increase the performance on specific domains. However, our end-to-end models optimized with sequence-based criterion generalize better than the hybrid models on diverse domains. In term of word-error-rate performance, our experimental acoustic-to-word and attention-based models trained on multi-domain dataset reach the performance of domain-specific long short-term memory (LSTM) hybrid models, thus resulting in multi-domain speech recognition systems that do not suffer in performance over domain specific ones. Moreover, the use of neural end-to-end models eliminates the need of domain-adapted language models during recognition, which is a great advantage when the input domain is unknown.

下载PDF全文

下载文献需遵守相关版权规定

论文标题