根据生成对抗网络从未配对的音频和电话序列中学习电话识别

论文标题

根据生成对抗网络从未配对的音频和电话序列中学习电话识别

Learning Phone Recognition from Unpaired Audio and Phone Sequences Based on Generative Adversarial Network

论文作者

Liu, Da-rong, Hsu, Po-chun, Chen, Yi-chen, Huang, Sung-feng, Chuang, Shun-po, Wu, Da-yi, Lee, Hung-yi

论文摘要

最近已显示ASR最近取得了出色的性能。但是，他们中的大多数都依靠大量的配对数据，这对于全球低回色语言是不可行的。本文调查了如何直接从未配对的电话序列和语音话语中学习。我们设计了两个阶段的迭代框架。 GAN培训在第一阶段被采用，以找到未配对的语音和电话序列之间的映射关系。在第二阶段，引入了另一个HMM模型以从发电机的输出中训练，这可以提高性能，并为下一次迭代提供更好的细分。在实验中，我们首先研究模型设计的不同选择。然后，我们将框架与不同类型的基准进行比较：（i）受监督的方法（ii）基于声学单元发现的方法（III）方法从未配对的数据中学习。我们的框架的执行始终比所有基于TIMIT数据集从未配对数据中学习的所有声学单元发现方法和以前的方法更好。

ASR has been shown to achieve great performance recently. However, most of them rely on massive paired data, which is not feasible for low-resource languages worldwide. This paper investigates how to learn directly from unpaired phone sequences and speech utterances. We design a two-stage iterative framework. GAN training is adopted in the first stage to find the mapping relationship between unpaired speech and phone sequence. In the second stage, another HMM model is introduced to train from the generator's output, which boosts the performance and provides a better segmentation for the next iteration. In the experiment, we first investigate different choices of model designs. Then we compare the framework to different types of baselines: (i) supervised methods (ii) acoustic unit discovery based methods (iii) methods learning from unpaired data. Our framework performs consistently better than all acoustic unit discovery methods and previous methods learning from unpaired data based on the TIMIT dataset.

下载PDF全文

下载文献需遵守相关版权规定

论文标题