这听起来很熟悉：对语音跨语言传输的分析

论文标题

这听起来很熟悉：对语音跨语言传输的分析

That Sounds Familiar: an Analysis of Phonetic Representations Transfer Across Languages

论文作者

Żelasko, Piotr, Moro-Velázquez, Laureano, Hasegawa-Johnson, Mark, Scharenborg, Odette, Dehak, Najim

论文摘要

只有少数世界语言丰富的资源才能实现语音处理技术的实际应用。克服此问题的方法之一是使用其他语言中现有的资源来训练多语言自动语音识别（ASR）模型，该模型应直观地学习一些通用的语音表示。在这项工作中，我们专注于对这些表示形式的一般程度以及在多语言环境中如何改善各个手机的一般性。为此，我们选择了一套语音多样的语言集，并执行一系列单语，多语言和跨语言（零拍）实验。对ASR进行了训练，可以识别国际语音字母（IPA）令牌序列。我们在多语言环境中观察到所有语言的显着改进，并在跨语言环境中脱颖而出，在该设置中，模型除其他错误外，还将爪哇人视为语气语言。值得注意的是，目标语言培训数据的10小时大大降低了ASR错误率。我们的分析发现，即使是单一语言独有的手机也可以从其他语言中添加培训数据受益匪浅，这是低资源语音社区的令人鼓舞的结果。

Only a handful of the world's languages are abundant with the resources that enable practical applications of speech processing technologies. One of the methods to overcome this problem is to use the resources existing in other languages to train a multilingual automatic speech recognition (ASR) model, which, intuitively, should learn some universal phonetic representations. In this work, we focus on gaining a deeper understanding of how general these representations might be, and how individual phones are getting improved in a multilingual setting. To that end, we select a phonetically diverse set of languages, and perform a series of monolingual, multilingual and crosslingual (zero-shot) experiments. The ASR is trained to recognize the International Phonetic Alphabet (IPA) token sequences. We observe significant improvements across all languages in the multilingual setting, and stark degradation in the crosslingual setting, where the model, among other errors, considers Javanese as a tone language. Notably, as little as 10 hours of the target language training data tremendously reduces ASR error rates. Our analysis uncovered that even the phones that are unique to a single language can benefit greatly from adding training data from other languages - an encouraging result for the low-resource speech community.

下载PDF全文

下载文献需遵守相关版权规定

论文标题