半监督的声学和语言模型培训，用于英语 - 伊斯兰代码转换语音识别

论文标题

半监督的声学和语言模型培训，用于英语 - 伊斯兰代码转换语音识别

Semi-supervised acoustic and language model training for English-isiZulu code-switched speech recognition

论文作者

Biswas, A., de Wet, F., van der Westhuizen, E., Niesler, T. R.

论文摘要

我们对使用SOAP Opera演讲进行了半监督的声学和语言模型培训进行了分析，用于英语 - iSizulu代码转换ASR。大约11个小时的未转录多语言语音是使用四个双语代码转换转录系统自动转录的，英语 - 伊斯兰教，英语 - 塞瓦萨，英语 - 塞斯瓦纳和英语 - 苏索州。这些转录被纳入声学和语言模型训练集中。结果表明，TDNN-F声学模型受益于其他半监督数据，并且可以通过包括其他CNN层来实现更好的性能。使用这些CNN-TDNN-F声学模型，半监督训练的第一次迭代实现了3.4％的绝对混合语言降低，第二次迭代后另外2.2％。尽管未转录的数据中的语言尚不清楚，但是当所有自动转录数据都用于培训时，获得了最佳结果，而不仅仅是被归类为英语 - iSizulu的话语。尽管减少了困惑，但半监督的语言模型仍无法提高ASR性能。

We present an analysis of semi-supervised acoustic and language model training for English-isiZulu code-switched ASR using soap opera speech. Approximately 11 hours of untranscribed multilingual speech was transcribed automatically using four bilingual code-switching transcription systems operating in English-isiZulu, English-isiXhosa, English-Setswana and English-Sesotho. These transcriptions were incorporated into the acoustic and language model training sets. Results showed that the TDNN-F acoustic models benefit from the additional semi-supervised data and that even better performance could be achieved by including additional CNN layers. Using these CNN-TDNN-F acoustic models, a first iteration of semi-supervised training achieved an absolute mixed-language WER reduction of 3.4%, and a further 2.2% after a second iteration. Although the languages in the untranscribed data were unknown, the best results were obtained when all automatically transcribed data was used for training and not just the utterances classified as English-isiZulu. Despite reducing perplexity, the semi-supervised language model was not able to improve the ASR performance.

下载PDF全文

下载文献需遵守相关版权规定

论文标题