低资源多语言声学模型融合的非线性成对语言映射

论文标题

低资源多语言声学模型融合的非线性成对语言映射

Non-Linear Pairwise Language Mappings for Low-Resource Multilingual Acoustic Model Fusion

论文作者

Farooq, Muhammad Umar, Narayana, Darshan Adiga Haniya, Hain, Thomas

论文摘要

多语言语音识别已引起大幅关注，作为补偿低资源语言数据稀缺性的有效方法。端到端（E2E）建模比常规混合系统优选，这主要是由于没有词典的要求。但是，在有限的数据方案中，混合DNN-HMM仍然优于E2E模型。此外，手动词典创建的问题已通过公开培训的素式训练型（G2P）（G2P）和多种语言的IPA音译来缓解。在本文中，在低资源语言的多语言设置中提出了一种混合DNN-HMM声学模型的新型方法。针对目标语言语言信号的不同单语言模型的后验分布融合在一起。为每个源目标语言对训练了一个单独的回归神经网络，以将后者从源声学模型转换为目标语言。与ASR培训相比，这些网络需要非常有限的数据。与多语言和单语基线相比，后融合的相对增益分别为14.65％和6.5％。跨语性模型融合表明，如果不使用来自语言依赖性ASR的后代，可以实现可比的结果。

Multilingual speech recognition has drawn significant attention as an effective way to compensate data scarcity for low-resource languages. End-to-end (e2e) modelling is preferred over conventional hybrid systems, mainly because of no lexicon requirement. However, hybrid DNN-HMMs still outperform e2e models in limited data scenarios. Furthermore, the problem of manual lexicon creation has been alleviated by publicly available trained models of grapheme-to-phoneme (G2P) and text to IPA transliteration for a lot of languages. In this paper, a novel approach of hybrid DNN-HMM acoustic models fusion is proposed in a multilingual setup for the low-resource languages. Posterior distributions from different monolingual acoustic models, against a target language speech signal, are fused together. A separate regression neural network is trained for each source-target language pair to transform posteriors from source acoustic model to the target language. These networks require very limited data as compared to the ASR training. Posterior fusion yields a relative gain of 14.65% and 6.5% when compared with multilingual and monolingual baselines respectively. Cross-lingual model fusion shows that the comparable results can be achieved without using posteriors from the language dependent ASR.

下载PDF全文

下载文献需遵守相关版权规定

论文标题