对视频描述的最新方法和挑战的全面评论

论文标题

对视频描述的最新方法和挑战的全面评论

A Comprehensive Review on Recent Methods and Challenges of Video Description

论文作者

Singh, Alok, Singh, Thoudam Doren, Bandyopadhyay, Sivaji

论文摘要

视频描述涉及视频中动作，事件和对象的自然语言描述。视频描述的应用有各种应用，通过填补视觉障碍者的语言和视觉之间的空白，根据内容和视频指导的机器翻译[86]等自动标题建议，浏览视频[86]等。在过去的十年中，在此领域中，以视频描述，评估，评估计量和数据量来完成了几项工作。为了分析视频描述任务中的进度，需要进行全面的调查，以涵盖视频描述方法的所有阶段，特别关注最近的深度学习方法。在这项工作中，我们报告了有关视频描述方法阶段的全面调查，视频描述的数据集，评估指标，开放竞赛，以激励视频描述研究，该领域的开放挑战以及未来的研究指示。在这项调查中，我们介绍了针对每个数据集提出的最先进的方法。对于该研究领域的增长，众多基准数据集的可用性是基本需求。此外，我们将所有数据集分为两个类：开放域数据集和特定于域的数据集。从我们的调查中，我们可以观察到该领域的工作正在快速发展，因为视频描述的任务属于计算机视觉和自然语言处理的交集。但是，由于各种挑战（例如，由于影响视觉特征质量的相似框架），视频描述中的工作远非饱和阶段，冗余，包含更多样化内容的数据集的可用性以及有效评估度量的可用性。

Video description involves the generation of the natural language description of actions, events, and objects in the video. There are various applications of video description by filling the gap between languages and vision for visually impaired people, generating automatic title suggestion based on content, browsing of the video based on the content and video-guided machine translation [86] etc.In the past decade, several works had been done in this field in terms of approaches/methods for video description, evaluation metrics,and datasets. For analyzing the progress in the video description task, a comprehensive survey is needed that covers all the phases of video description approaches with a special focus on recent deep learning approaches. In this work, we report a comprehensive survey on the phases of video description approaches, the dataset for video description, evaluation metrics, open competitions for motivating the research on the video description, open challenges in this field, and future research directions. In this survey, we cover the state-of-the-art approaches proposed for each and every dataset with their pros and cons. For the growth of this research domain,the availability of numerous benchmark dataset is a basic need. Further, we categorize all the dataset into two classes: open domain dataset and domain-specific dataset. From our survey, we observe that the work in this field is in fast-paced development since the task of video description falls in the intersection of computer vision and natural language processing. But still, the work in the video description is far from saturation stage due to various challenges like the redundancy due to similar frames which affect the quality of visual features, the availability of dataset containing more diverse content and availability of an effective evaluation metric.

下载PDF全文

下载文献需遵守相关版权规定

论文标题