论文标题
使用自我监督的转移学习识别出更多的情绪,较少的数据
Recognizing More Emotions with Less Data Using Self-supervised Transfer Learning
论文作者
论文摘要
我们提出了一种新颖的转移学习方法,用于语音情感识别,使我们在只有少数培训数据时就能获得有希望的结果。每个情绪类别的示例低至125个示例,我们能够达到比在8倍数据上训练的强大基线更高的精度。我们的方法利用了从更普遍的自我监督任务训练的模型中提取的预训练的语音表示中包含的知识,该任务不需要人类注释,例如WAV2VEC模型。我们通过改变培训数据大小来提供有关方法的好处的详细见解,这可以帮助标签团队更有效地工作。我们将性能与Iemocap数据集上的其他流行方法进行了比较,这是语音情感识别(SER)研究社区中基础的基准数据集。此外,我们证明,通过结合转移学习中的声学和语言知识,可以极大地改善结果。我们通过基于注意的复发神经网络将声学预训练的预训练的预训练与语义表示。将模态和量表与数据量相结合时,性能会显着提高。在完整的Iemocap数据集中接受培训时,我们达到了73.9%的未加权准确性(UA)的新最新技术。
We propose a novel transfer learning method for speech emotion recognition allowing us to obtain promising results when only few training data is available. With as low as 125 examples per emotion class, we were able to reach a higher accuracy than a strong baseline trained on 8 times more data. Our method leverages knowledge contained in pre-trained speech representations extracted from models trained on a more general self-supervised task which doesn't require human annotations, such as the wav2vec model. We provide detailed insights on the benefits of our approach by varying the training data size, which can help labeling teams to work more efficiently. We compare performance with other popular methods on the IEMOCAP dataset, a well-benchmarked dataset among the Speech Emotion Recognition (SER) research community. Furthermore, we demonstrate that results can be greatly improved by combining acoustic and linguistic knowledge from transfer learning. We align acoustic pre-trained representations with semantic representations from the BERT model through an attention-based recurrent neural network. Performance improves significantly when combining both modalities and scales with the amount of data. When trained on the full IEMOCAP dataset, we reach a new state-of-the-art of 73.9% unweighted accuracy (UA).