端到端耳语的语音识别，采用频率加权方法和伪窃窃私语预训练

论文标题

端到端耳语的语音识别，采用频率加权方法和伪窃窃私语预训练

End-to-end Whispered Speech Recognition with Frequency-weighted Approaches and Pseudo Whisper Pre-training

论文作者

Chang, Heng-Jui, Liu, Alexander H., Lee, Hung-yi, Lee, Lin-shan

论文摘要

窃窃私语是人类言语的一种重要方式，但尚未报告端到端的识别结果，这可能是由于可用的耳语语音数据缺乏。在本文中，我们提出了几种端到端（E2E）对耳语语音的认识的方法，考虑到耳语语音的特殊特征和数据稀缺。这包括频率加权的规格策略和频偏的CNN特征提取器，以更好地捕获耳语的高频结构，以及一种层面上的转移学习方法，以预先训练训练模型，并以正常或正常的转换的转换的语音，然后用窃窃私语的语音对语言进行微调，以弥合窃窃私语和正常语音之间的窃窃私语。在相对较小的小声蒂米特语料库中，我们的总体相对相对减少为19.8％，而CER的总体相对降低为44.4％。结果表明，只要我们拥有在正常或伪装的语音上预先训练的良好的E2E模型，一组相对较小的耳语语音就足以获得相当良好的E2E窃窃私语语音识别器。

Whispering is an important mode of human speech, but no end-to-end recognition results for it were reported yet, probably due to the scarcity of available whispered speech data. In this paper, we present several approaches for end-to-end (E2E) recognition of whispered speech considering the special characteristics of whispered speech and the scarcity of data. This includes a frequency-weighted SpecAugment policy and a frequency-divided CNN feature extractor for better capturing the high-frequency structures of whispered speech, and a layer-wise transfer learning approach to pre-train a model with normal or normal-to-whispered converted speech then fine-tune it with whispered speech to bridge the gap between whispered and normal speech. We achieve an overall relative reduction of 19.8% in PER and 44.4% in CER on a relatively small whispered TIMIT corpus. The results indicate as long as we have a good E2E model pre-trained on normal or pseudo-whispered speech, a relatively small set of whispered speech may suffice to obtain a reasonably good E2E whispered speech recognizer.

下载PDF全文

下载文献需遵守相关版权规定

论文标题