论文标题
Dramaqa:以角色为中心的视频故事理解与等级质量
DramaQA: Character-Centered Video Story Understanding with Hierarchical QA
论文作者
论文摘要
尽管最近在计算机视觉和自然语言处理方面取得了进展,但由于视频故事的内在困难,开发可以理解视频故事的机器仍然很难实现。此外,研究如何评估基于人类认知过程的视频理解程度的研究尚未进展。在本文中,我们提出了一个新颖的视频问题回答(视频QA)任务Dramaqa,以全面了解视频故事。戏剧卡的重点是两种观点:1)基于人类智能的认知发展阶段的评估度量标准。 2)以角色为中心的视频注释,以模拟故事的本地连贯性。我们的数据集建立在电视剧《另一个小姐OH》上,其中包含17,983个QA对,分别是23,928个不同长度的视频剪辑,每个QA对属于四个难度级别之一。我们提供217,308个带注释的注释图像,具有丰富的字符注释,包括可视边界框,主要角色的行为和情感以及Coreference解决的脚本。此外,我们建议多层次上下文匹配模型,该模型以层次结构了解以字符为中心的视频表示问题来回答问题。我们出于研究目的公开发布数据集和模型,我们希望我们的工作能够为视频故事理解研究提供新的看法。
Despite recent progress on computer vision and natural language processing, developing a machine that can understand video story is still hard to achieve due to the intrinsic difficulty of video story. Moreover, researches on how to evaluate the degree of video understanding based on human cognitive process have not progressed as yet. In this paper, we propose a novel video question answering (Video QA) task, DramaQA, for a comprehensive understanding of the video story. The DramaQA focuses on two perspectives: 1) Hierarchical QAs as an evaluation metric based on the cognitive developmental stages of human intelligence. 2) Character-centered video annotations to model local coherence of the story. Our dataset is built upon the TV drama "Another Miss Oh" and it contains 17,983 QA pairs from 23,928 various length video clips, with each QA pair belonging to one of four difficulty levels. We provide 217,308 annotated images with rich character-centered annotations, including visual bounding boxes, behaviors and emotions of main characters, and coreference resolved scripts. Additionally, we suggest Multi-level Context Matching model which hierarchically understands character-centered representations of video to answer questions. We release our dataset and model publicly for research purposes, and we expect our work to provide a new perspective on video story understanding research.