论文标题
DSTC8 AVSD挑战的带有指针网络的多模式变压器
Multimodal Transformer with Pointer Network for the DSTC8 AVSD Challenge
论文作者
论文摘要
视听场景吸引对话框(AVSD)是视频问题回答(QA)的扩展程序,因此,对话代理需要生成自然语言响应以解决用户查询并进行对话。这是一项具有挑战性的任务,因为它由多种模式的视频功能组成,包括文本,视觉和视听功能。代理人还需要学习用户话语和系统响应之间的语义依赖性,以与人类进行连贯的对话。在这项工作中,我们描述了我们对第八对话系统技术挑战的AVSD轨道的提交。我们采用点产品的关注来结合输入视频的文本和非文本功能。我们通过采用指针网络指向每个一代步骤中多个源序列的令牌,进一步增强了对话代理的生成能力。我们的系统在自动指标中获得高性能,并在所有提交中获得人类评估的第五和第六名。
Audio-Visual Scene-Aware Dialog (AVSD) is an extension from Video Question Answering (QA) whereby the dialogue agent is required to generate natural language responses to address user queries and carry on conversations. This is a challenging task as it consists of video features of multiple modalities, including text, visual, and audio features. The agent also needs to learn semantic dependencies among user utterances and system responses to make coherent conversations with humans. In this work, we describe our submission to the AVSD track of the 8th Dialogue System Technology Challenge. We adopt dot-product attention to combine text and non-text features of input video. We further enhance the generation capability of the dialogue agent by adopting pointer networks to point to tokens from multiple source sequences in each generation step. Our systems achieve high performance in automatic metrics and obtain 5th and 6th place in human evaluation among all submissions.