论文标题
RSTT:实时空间颞变压器,用于时空视频超分辨率
RSTT: Real-time Spatial Temporal Transformer for Space-Time Video Super-Resolution
论文作者
论文摘要
时空视频超级分辨率(STVSR)是插值视频的任务低帧速率(LFR)和低分辨率(LR),以产生高框架速率(HFR)以及高分辨率(HR)对应物。基于卷积神经网络〜(CNN)的现有方法成功地实现了视觉上满足的结果,而由于其繁重的建筑,推理速度缓慢。我们建议通过使用空间和时间超级分辨率模块自然地将空间和时间的超级分辨率模块纳入单个模型来解决此问题。与基于CNN的方法不同,我们没有明确使用分离的构件进行时间插值和空间超级分辨。相反,我们仅使用单一端到端的变压器体系结构。具体而言,可重复使用的词典由基于输入LFR和LR帧的编码器构建,然后在解码器部分中使用该框架来合成HFR和HR帧。与最先进的tmnet \ cite {Xu2021Temporal}相比,我们的网络为$ 60 \%$ $较小(450万vs 1.23亿参数)和$ 80 \%\%\%\%\%\%\%更快(26.2fps vs 14.3fps vs on 14.3fps on $ 720 \ times576 $ brames $ brames),而无需牺牲很多绩效。源代码可在https://github.com/llmpass/rstt上找到。
Space-time video super-resolution (STVSR) is the task of interpolating videos with both Low Frame Rate (LFR) and Low Resolution (LR) to produce High-Frame-Rate (HFR) and also High-Resolution (HR) counterparts. The existing methods based on Convolutional Neural Network~(CNN) succeed in achieving visually satisfied results while suffer from slow inference speed due to their heavy architectures. We propose to resolve this issue by using a spatial-temporal transformer that naturally incorporates the spatial and temporal super resolution modules into a single model. Unlike CNN-based methods, we do not explicitly use separated building blocks for temporal interpolations and spatial super-resolutions; instead, we only use a single end-to-end transformer architecture. Specifically, a reusable dictionary is built by encoders based on the input LFR and LR frames, which is then utilized in the decoder part to synthesize the HFR and HR frames. Compared with the state-of-the-art TMNet \cite{xu2021temporal}, our network is $60\%$ smaller (4.5M vs 12.3M parameters) and $80\%$ faster (26.2fps vs 14.3fps on $720\times576$ frames) without sacrificing much performance. The source code is available at https://github.com/llmpass/RSTT.