语音情绪：研究模型表示，多任务学习和知识蒸馏

论文标题

语音情绪：研究模型表示，多任务学习和知识蒸馏

Speech Emotion: Investigating Model Representations, Multi-Task Learning and Knowledge Distillation

论文作者

Mitra, Vikramjit, Chien, Hsiang-Yun Sherry, Kowtha, Vasudha, Cheng, Joseph Yitan, Azemi, Erdrin

论文摘要

在过去的几年中，广泛探索了声音信号的估计情绪，例如激活，价和优势。尽管可以准确地估计语音激活和主导性，但价对于价值仍然具有挑战性。先前的研究表明，词汇信息的使用可以改善价值估计的性能。词汇信息可以从预先训练的声学模型中获得，在该模型中，学习的表示可以改善语音的价估计。我们研究了预先训练的模型表示的使用来改善声音信号的价估计。我们还探索了表示形式的融合，以改善所有三个情绪维度的情绪估计：激活，价和优势。此外，我们研究了是否可以将预训练模型的表示形式提炼成具有低级特征的训练的模型，从而导致具有较少参数的模型。我们表明，与标准声学特征基线（MEL-FilterBank Energies）相比，预训练模型嵌入的融合会导致价值估计的一致性相关系数CCC相对相对改善（MEL-FILTERBANK ENERGIES），而从预训练的模型嵌入到较低维度表示的情况下，蒸馏量可产生12％的相对效果。在两个评估集中观察到了这种绩效提高，表明我们提出的架构在这些评估集中概括了。我们在两个MSP播客评估集中报告了新的最先进的“无文本”声学情绪估计$ CCC $值。

Estimating dimensional emotions, such as activation, valence and dominance, from acoustic speech signals has been widely explored over the past few years. While accurate estimation of activation and dominance from speech seem to be possible, the same for valence remains challenging. Previous research has shown that the use of lexical information can improve valence estimation performance. Lexical information can be obtained from pre-trained acoustic models, where the learned representations can improve valence estimation from speech. We investigate the use of pre-trained model representations to improve valence estimation from acoustic speech signal. We also explore fusion of representations to improve emotion estimation across all three emotion dimensions: activation, valence and dominance. Additionally, we investigate if representations from pre-trained models can be distilled into models trained with low-level features, resulting in models with a less number of parameters. We show that fusion of pre-trained model embeddings result in a 79% relative improvement in concordance correlation coefficient CCC on valence estimation compared to standard acoustic feature baseline (mel-filterbank energies), while distillation from pre-trained model embeddings to lower-dimensional representations yielded a relative 12% improvement. Such performance gains were observed over two evaluation sets, indicating that our proposed architecture generalizes across those evaluation sets. We report new state-of-the-art "text-free" acoustic-only dimensional emotion estimation $CCC$ values on two MSP-Podcast evaluation sets.

下载PDF全文

下载文献需遵守相关版权规定

论文标题