使用深神经网络中的音频和视频中的情感识别

论文标题

使用深神经网络中的音频和视频中的情感识别

Emotion Recognition in Audio and Video Using Deep Neural Networks

论文作者

Singh, Mandeep, Fang, Yuan

论文摘要

人类能够从多个领域中理解信息。语音，文字和视觉。随着深度学习技术的发展，语音识别得到了显着改善。从言语中识别情绪是重要的方面，并且深度学习技术的情感识别在准确性和延迟方面有所提高。提高准确性仍然存在许多挑战。在这项工作中，我们试图探索不同的神经网络，以提高情绪识别的准确性。通过探索了不同的体系结构，我们发现（CNN + RNN） + 3DCNN多模型体系结构，该体系结构处理音频谱图和相应的视频框架，在4个情绪中，使用IEMOCAP [2]数据集中的3个情绪中的情感预测精度为54.0％，在3个情绪中为71.75％。

Humans are able to comprehend information from multiple domains for e.g. speech, text and visual. With advancement of deep learning technology there has been significant improvement of speech recognition. Recognizing emotion from speech is important aspect and with deep learning technology emotion recognition has improved in accuracy and latency. There are still many challenges to improve accuracy. In this work, we attempt to explore different neural networks to improve accuracy of emotion recognition. With different architectures explored, we find (CNN+RNN) + 3DCNN multi-model architecture which processes audio spectrograms and corresponding video frames giving emotion prediction accuracy of 54.0% among 4 emotions and 71.75% among 3 emotions using IEMOCAP[2] dataset.

下载PDF全文

下载文献需遵守相关版权规定

论文标题