论文标题

端到端的视频文本与变压器

End-to-End Video Text Spotting with Transformer

论文作者

Wu, Weijia, Cai, Yuanqiang, Shen, Chunhua, Zhang, Debing, Fu, Ying, Zhou, Hong, Luo, Ping

论文摘要

最近的视频文本发现方法通常需要三个阶段的管道,即检测单个图像中的文本,识别本地化文本,跟踪文本流并通过后处理以生成最终结果。这些方法通常遵循按匹配范式跟踪并开发复杂的管道。在本文中,植根于变压器序列建模,我们提出了一个简单但有效的端到端视频文本检测,跟踪和识别框架(TransDert)。转码主要包括两个优点:1)与相邻帧中的显式匹配范式不同,transdert轨道不同,并通过长距离时间顺序(超过7帧)的不同查询所谓的文本查询隐式识别每个文本。 2)Transdert是第一个端到端可训练的视频文本斑点框架,该框架同时介绍了三个子任务(例如,文本检测,跟踪,识别)。进行了四个视频文本数据集(即ICDAR2013视频,ICDAR2015视频,Minetto和YouTube视频文本)中的广泛实验,以证明Transdert在预先的表现中实现了最新的性能,在视频文本中发现了最高约为8.0%的改进。可以在https://github.com/weijiawu/transdetr上找到Transdet的代码。

Recent video text spotting methods usually require the three-staged pipeline, i.e., detecting text in individual images, recognizing localized text, tracking text streams with post-processing to generate final results. These methods typically follow the tracking-by-match paradigm and develop sophisticated pipelines. In this paper, rooted in Transformer sequence modeling, we propose a simple, but effective end-to-end video text DEtection, Tracking, and Recognition framework (TransDETR). TransDETR mainly includes two advantages: 1) Different from the explicit match paradigm in the adjacent frame, TransDETR tracks and recognizes each text implicitly by the different query termed text query over long-range temporal sequence (more than 7 frames). 2) TransDETR is the first end-to-end trainable video text spotting framework, which simultaneously addresses the three sub-tasks (e.g., text detection, tracking, recognition). Extensive experiments in four video text datasets (i.e.,ICDAR2013 Video, ICDAR2015 Video, Minetto, and YouTube Video Text) are conducted to demonstrate that TransDETR achieves state-of-the-art performance with up to around 8.0% improvements on video text spotting tasks. The code of TransDETR can be found at https://github.com/weijiawu/TransDETR.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源