高分辨率人类视频综合的内在时间正则化

论文标题

高分辨率人类视频综合的内在时间正则化

Intrinsic Temporal Regularization for High-resolution Human Video Synthesis

论文作者

Yang, Lingbo, Gao, Zhanning, Ren, Peiran, Ma, Siwei, Gao, Wen

论文摘要

时间一致性对于将图像处理管道扩展到视频域至关重要，视频域通常会在相邻框架上使用基于流量的翘曲误差来执行。然而，对于人类视频综合，由于源和目标视频之间的错位以及准确的流动估计难度，这种方案的可靠性较差。在本文中，我们提出了一种有效的内在时间正则方案来减轻这些问题，其中通过框架发生器估算了固有的置信图，以通过时间损失调制来调节运动估计。这为直接向前端运动估计器直接向后传播的时间损失梯度创建了快捷方式，从而改善了训练稳定性和输出视频中的时间连贯性。我们将固有的时间调节应用于单图像发电机，从而导致功能强大的“互联网”能够生成$ 512 \ times512 $分辨率的人类动作视频，并具有时间连接，现实的视觉细节。广泛的实验表明，拟议的互联网比几个竞争基线的优越性。

Temporal consistency is crucial for extending image processing pipelines to the video domain, which is often enforced with flow-based warping error over adjacent frames. Yet for human video synthesis, such scheme is less reliable due to the misalignment between source and target video as well as the difficulty in accurate flow estimation. In this paper, we propose an effective intrinsic temporal regularization scheme to mitigate these issues, where an intrinsic confidence map is estimated via the frame generator to regulate motion estimation via temporal loss modulation. This creates a shortcut for back-propagating temporal loss gradients directly to the front-end motion estimator, thus improving training stability and temporal coherence in output videos. We apply our intrinsic temporal regulation to single-image generator, leading to a powerful "INTERnet" capable of generating $512\times512$ resolution human action videos with temporal-coherent, realistic visual details. Extensive experiments demonstrate the superiority of proposed INTERnet over several competitive baselines.

下载PDF全文

下载文献需遵守相关版权规定

论文标题