语音变化的神经表示

论文标题

语音变化的神经表示

Neural Representations for Modeling Variation in Speech

论文作者

Bartelds, Martijn, de Vries, Wietse, Sanal, Faraz, Richter, Caitlin, Liberman, Mark, Wieling, Martijn

论文摘要

通常通过比较同一话语的语音转录来量化语音的变化。但是，手动转录语音耗时，并且容易出错。因此，作为替代方案，我们研究了从几种自我监督神经模型中提取声学嵌入的。我们使用这些表示形式来计算非母语和母语英语和挪威语方言者之间的基于单词的发音差异。为了与一些较早的研究进行比较，我们通过将这些差异与可用的人类相似性判断进行比较来评估这些差异如何与人类的看法相匹配。我们表明，根据语音转录和基于MFCC的声学特征，从特定类型的神经模型（即变形金刚）中提取的语音表示与人类感知的匹配更好。此外，我们发现，通常最好从中间隐藏层中提取来自神经模型的特征，而不是从最终层中提取。我们还证明，神经语音表示不仅捕获了节段差异，而且还可以通过语音转录中使用的一组离散符号来充分代表的语言和持续差异。

Variation in speech is often quantified by comparing phonetic transcriptions of the same utterance. However, manually transcribing speech is time-consuming and error prone. As an alternative, therefore, we investigate the extraction of acoustic embeddings from several self-supervised neural models. We use these representations to compute word-based pronunciation differences between non-native and native speakers of English, and between Norwegian dialect speakers. For comparison with several earlier studies, we evaluate how well these differences match human perception by comparing them with available human judgements of similarity. We show that speech representations extracted from a specific type of neural model (i.e. Transformers) lead to a better match with human perception than two earlier approaches on the basis of phonetic transcriptions and MFCC-based acoustic features. We furthermore find that features from the neural models can generally best be extracted from one of the middle hidden layers than from the final layer. We also demonstrate that neural speech representations not only capture segmental differences, but also intonational and durational differences that cannot adequately be represented by a set of discrete symbols used in phonetic transcriptions.

下载PDF全文

下载文献需遵守相关版权规定

论文标题