通过基于变压器的视频表示形式，音频视觉场景感知对话框的生成

论文标题

通过基于变压器的视频表示形式，音频视觉场景感知对话框的生成

Audio Visual Scene-Aware Dialog Generation with Transformer-based Video Representations

论文作者

Yamazaki, Yoshihiro, Orihashi, Shota, Masumura, Ryo, Uchida, Mihiro, Takashima, Akihiko

论文摘要

已经有许多尝试构建多模式对话框系统的尝试，这些系统可以回答有关给定音频信息的问题，而此类系统的代表性任务是“视觉视觉场景吸引对话”（AVSD）。大多数常规的AVSD模型都采用基于卷积的神经网络（CNN）的视频提取器来了解视觉信息。尽管CNN倾向于在时间和空间上获得本地信息，但全局信息对于提高视频理解也至关重要，因为AVSD需要长期的时间视觉依赖性和整个视觉信息。在这项研究中，我们应用了基于变压器的视频功能，该视频功能可以比基于CNN的功能更有效地捕获时间和空间全局表示。我们的基于变压器功能的AVSD模型可获得更高的目标性能得分，以获得答案的生成。此外，我们的模型还达到了DSTC10中人类答案的主观评分。我们观察到，基于变压器的视觉功能对AVSD任务有益，因为我们的模型倾向于正确回答需要时间和空间范围广泛的视觉信息的问题。

There have been many attempts to build multimodal dialog systems that can respond to a question about given audio-visual information, and the representative task for such systems is the Audio Visual Scene-Aware Dialog (AVSD). Most conventional AVSD models adopt the Convolutional Neural Network (CNN)-based video feature extractor to understand visual information. While a CNN tends to obtain both temporally and spatially local information, global information is also crucial for boosting video understanding because AVSD requires long-term temporal visual dependency and whole visual information. In this study, we apply the Transformer-based video feature that can capture both temporally and spatially global representations more efficiently than the CNN-based feature. Our AVSD model with its Transformer-based feature attains higher objective performance scores for answer generation. In addition, our model achieves a subjective score close to that of human answers in DSTC10. We observed that the Transformer-based visual feature is beneficial for the AVSD task because our model tends to correctly answer the questions that need a temporally and spatially broad range of visual information.

下载PDF全文

下载文献需遵守相关版权规定

论文标题