视频视频面对幻觉，频率监督和跨模式支持，基于语音的嘴唇阅读损失

论文标题

视频视频面对幻觉，频率监督和跨模式支持，基于语音的嘴唇阅读损失

Audio-visual video face hallucination with frequency supervision and cross modality support by speech based lip reading loss

论文作者

Sharma, Shailza, Dhall, Abhinav, Kumar, Vinay, Bawa, Vivek Singh

论文摘要

最近，面部幻觉任务有许多突破。但是，由于固有的一致性问题，与图像相比，视频中的任务仍然具有挑战性。视频面部幻觉中存在额外的时间维度使得通过序列学习面部运动并非平凡。为了学习这些精细的时空运动细节，我们提出了一种新颖的跨模式音频视频幻觉幻觉幻觉生成对抗网络（VFH-GAN）。该体系结构利用面部结构运动与相关语音信号之间的语义相关性。目前基于视频的方法的另一个主要问题是在关键面部区域（例如嘴巴和嘴唇）周围存在模糊性 - 与其他区域相比，空间位移要高得多。所提出的方法明确定义了唇读损失，以学习这些面部区域的细晶粒运动。在训练过程中，甘斯有可能适合从低到高的频率，这会导致难以合成频率。因此，为了在网络中添加显着频率特征，我们添加了基于频率的损耗函数。与最先进的视觉和定量比较显示出表现和功效的显着提高。

Recently, there has been numerous breakthroughs in face hallucination tasks. However, the task remains rather challenging in videos in comparison to the images due to inherent consistency issues. The presence of extra temporal dimension in video face hallucination makes it non-trivial to learn the facial motion through out the sequence. In order to learn these fine spatio-temporal motion details, we propose a novel cross-modal audio-visual Video Face Hallucination Generative Adversarial Network (VFH-GAN). The architecture exploits the semantic correlation of between the movement of the facial structure and the associated speech signal. Another major issue in present video based approaches is the presence of blurriness around the key facial regions such as mouth and lips - where spatial displacement is much higher in comparison to other areas. The proposed approach explicitly defines a lip reading loss to learn the fine grain motion in these facial areas. During training, GANs have potential to fit frequencies from low to high, which leads to miss the hard to synthesize frequencies. Therefore, to add salient frequency features to the network we add a frequency based loss function. The visual and the quantitative comparison with state-of-the-art shows a significant improvement in performance and efficacy.

下载PDF全文

下载文献需遵守相关版权规定

论文标题