HRVGAN：使用时空gan的高分辨率视频生成

论文标题

HRVGAN：使用时空gan的高分辨率视频生成

HRVGAN: High Resolution Video Generation using Spatio-Temporal GAN

论文作者

Sagar, Abhinav

论文摘要

高分辨率的视频生成已成为计算机视觉中的一项关键任务，并在娱乐，模拟和数据增强中广泛应用。但是，由于视频数据的高维度和复杂的动态，生成时间连贯和视觉上现实的视频仍然是一个重大挑战。在本文中，我们提出了一种专门为高分辨率视频综合设计的新型深层生成网络体系结构。我们的方法整合了Wasserstein生成对抗网络（WGANS）的关键概念，从而对歧视者进行了K-Lipschitz的连续性约束，以稳定训练并增强收敛性。我们通过在训练和推理过程中纳入类标签，进一步利用条件GAN（CGAN）技术，从而使特定于班级的视频生成具有改善的语义一致性。我们提供了生成器和鉴别网络的详细层面描述，突出了建筑设计选择，促进了时间连贯性和空间细节。彻底介绍了整体组合体系结构，培训算法和优化策略。我们的训练目标结合了像素的平均正方形误差损失和对抗性损失，以平衡框架级别的准确性和视频现实主义。我们在包括UCF101，高尔夫和飞机在内的基准数据集上验证了我们的方法，其中包括各种运动模式和场景环境。使用INCEPTION评分（IS）和Fréchet成立距离（FID）进行定量评估表明，我们的模型在质量和多样性方面显着优于先前的最新无监督视频生成方法。

High-resolution video generation has emerged as a crucial task in computer vision, with wide-ranging applications in entertainment, simulation, and data augmentation. However, generating temporally coherent and visually realistic videos remains a significant challenge due to the high dimensionality and complex dynamics of video data. In this paper, we propose a novel deep generative network architecture designed specifically for high-resolution video synthesis. Our approach integrates key concepts from Wasserstein Generative Adversarial Networks (WGANs), enforcing a k-Lipschitz continuity constraint on the discriminator to stabilize training and enhance convergence. We further leverage Conditional GAN (cGAN) techniques by incorporating class labels during both training and inference, enabling class-specific video generation with improved semantic consistency. We provide a detailed layer-wise description of the Generator and Discriminator networks, highlighting architectural design choices promoting temporal coherence and spatial detail. The overall combined architecture, training algorithm, and optimization strategy are thoroughly presented. Our training objective combines a pixel-wise mean squared error loss with an adversarial loss to balance frame-level accuracy and video realism. We validate our approach on benchmark datasets including UCF101, Golf, and Aeroplane, encompassing diverse motion patterns and scene contexts. Quantitative evaluations using Inception Score (IS) and Fréchet Inception Distance (FID) demonstrate that our model significantly outperforms previous state-of-the-art unsupervised video generation methods in terms of both quality and diversity.

下载PDF全文

下载文献需遵守相关版权规定

论文标题