DSTC8-AVSD：带有检索样式Word Generator的多模式语义变压器网络

论文标题

DSTC8-AVSD：带有检索样式Word Generator的多模式语义变压器网络

DSTC8-AVSD: Multimodal Semantic Transformer Network with Retrieval Style Word Generator

论文作者

Lee, Hwanhee, Yoon, Seunghyun, Dernoncourt, Franck, Kim, Doo Soon, Bui, Trung, Jung, Kyomin

论文摘要

音频视觉场景吸引对话框（AVSD）是为对话框中的一个场景，视频，音频和前一个转弯的历史而生成问题的任务。此任务的现有系统使用编码器框架采用变压器或经常性神经网络的体系结构。即使这些技术在此任务中表现出卓越的性能，它们也有重大的局限性：模型很容易过度拟合以记住语法模式；该模型遵循数据集中词汇的先前分布。为了减轻问题，我们提出了一个多模式的语义变压器网络。它采用基于变压器的体系结构，该体系结构具有基于注意的单词嵌入层，该层通过查询单词嵌入来生成单词。通过这种设计，我们的模型不断考虑一代阶段的单词的含义。经验结果证明了我们提出的模型的优越性，该模型的表现优于以前的大多数AVSD任务。

Audio Visual Scene-aware Dialog (AVSD) is the task of generating a response for a question with a given scene, video, audio, and the history of previous turns in the dialog. Existing systems for this task employ the transformers or recurrent neural network-based architecture with the encoder-decoder framework. Even though these techniques show superior performance for this task, they have significant limitations: the model easily overfits only to memorize the grammatical patterns; the model follows the prior distribution of the vocabularies in a dataset. To alleviate the problems, we propose a Multimodal Semantic Transformer Network. It employs a transformer-based architecture with an attention-based word embedding layer that generates words by querying word embeddings. With this design, our model keeps considering the meaning of the words at the generation stage. The empirical results demonstrate the superiority of our proposed model that outperforms most of the previous works for the AVSD task.

下载PDF全文

下载文献需遵守相关版权规定

论文标题