基于主题的基于多模式抑郁检测的基于主题的变压器模型

论文标题

基于主题的基于多模式抑郁检测的基于主题的变压器模型

A Topic-Attentive Transformer-based Model For Multimodal Depression Detection

论文作者

Guo, Yanrong, Zhu, Chenyang, Hao, Shijie, Hong, Richang

论文摘要

抑郁症是最常见的精神障碍之一，对日常生活造成了严重的负面影响。根据面试诊断抑郁症通常是问题和答案的形式。在此过程中，音频信号及其对象的文本笔录与抑郁线索相关并易于记录。因此，基于实践中这些模式的数据构建自动抑郁检测（ADD）模型是可行的。但是，对于构建有效的添加模型，应面临两个主要的挑战。第一个挑战是文本和音频数据的组织，这些数据可能具有不同主题的各种内容和长度。第二个挑战是由于关注隐私而缺乏培训样本。针对这两个挑战，我们提出了主题专注的基于变压器的添加模型，缩写为toat。为了应对第一个挑战，在TOAT模型中，根据典型的访谈过程中的问题 - 答案形式，主题被视为文本和音频数据的基本单位。基于此，主题注意模块旨在了解每个主题的重要性，这有助于模型更好地检索抑郁症的样本。为了解决数据稀缺性问题，我们引入了大型的预培训模型，并根据小规模的添加培训数据采用了微调策略。我们还设计了一个两分支的体系结构，具有构建Toat模型的后期融合策略，其中文本和音频数据是独立编码的。我们在专门为添加任务设计的多模式DAIC-WOZ数据集上评估了我们的模型。实验结果表明我们方法的优越性。更重要的是，消融研究证明了Toat模型中关键要素的有效性。

Depression is one of the most common mental disorders, which imposes heavy negative impacts on one's daily life. Diagnosing depression based on the interview is usually in the form of questions and answers. In this process, the audio signals and their text transcripts of a subject are correlated to depression cues and easily recorded. Therefore, it is feasible to build an Automatic Depression Detection (ADD) model based on the data of these modalities in practice. However, there are two major challenges that should be addressed for constructing an effective ADD model. The first challenge is the organization of the textual and audio data, which can be of various contents and lengths for different subjects. The second challenge is the lack of training samples due to the privacy concern. Targeting to these two challenges, we propose the TOpic ATtentive transformer-based ADD model, abbreviated as TOAT. To address the first challenge, in the TOAT model, topic is taken as the basic unit of the textual and audio data according to the question-answer form in a typical interviewing process. Based on that, a topic attention module is designed to learn the importance of of each topic, which helps the model better retrieve the depressed samples. To solve the issue of data scarcity, we introduce large pre-trained models, and the fine-tuning strategy is adopted based on the small-scale ADD training data. We also design a two-branch architecture with a late-fusion strategy for building the TOAT model, in which the textual and audio data are encoded independently. We evaluate our model on the multimodal DAIC-WOZ dataset specifically designed for the ADD task. Experimental results show the superiority of our method. More importantly, the ablation studies demonstrate the effectiveness of the key elements in the TOAT model.

下载PDF全文

下载文献需遵守相关版权规定

论文标题