创建：中文短视频检索和标题一代的基准

论文标题

创建：中文短视频检索和标题一代的基准

CREATE: A Benchmark for Chinese Short Video Retrieval and Title Generation

论文作者

Zhang, Ziqi, Chen, Yuxin, Ma, Zongyang, Qi, Zhongang, Yuan, Chunfeng, Li, Bing, Shan, Ying, Hu, Weiming

论文摘要

视频字幕的先前作品旨在客观地描述视频的实际内容，该内容缺乏主观和有吸引力的表达，从而限制了其实际的应用程序场景。视频标题旨在实现此目标，但缺乏适当的基准。在本文中，我们建议创建第一个大规模的中国简短视频检索和标题生成基准，以促进中文的视频标题和视频检索研究。 Create由一个高质量标签的210K数据集和两个大型3M/10M预训练数据集组成，涵盖51个类别，50k+标签，537K手动注释的标题和字幕以及10m+短视频。基于创建，我们提出了一个新颖的模型Alwig，该模型结合了视频检索和视频标题任务，以借助视频标签和GPT预先训练的模型来实现多模式对齐的目的。创建新的方向，以促进中国短视频领域的视频标题和视频检索的未来研究和应用。

Previous works of video captioning aim to objectively describe the video's actual content, which lacks subjective and attractive expression, limiting its practical application scenarios. Video titling is intended to achieve this goal, but there is a lack of a proper benchmark. In this paper, we propose to CREATE, the first large-scale Chinese shoRt vidEo retrievAl and Title gEneration benchmark, to facilitate research and application in video titling and video retrieval in Chinese. CREATE consists of a high-quality labeled 210K dataset and two large-scale 3M/10M pre-training datasets, covering 51 categories, 50K+ tags, 537K manually annotated titles and captions, and 10M+ short videos. Based on CREATE, we propose a novel model ALWIG which combines video retrieval and video titling tasks to achieve the purpose of multi-modal ALignment WIth Generation with the help of video tags and a GPT pre-trained model. CREATE opens new directions for facilitating future research and applications on video titling and video retrieval in the field of Chinese short videos.

下载PDF全文

下载文献需遵守相关版权规定

论文标题