改善性能驱动的说话面孔的模态辍学

论文标题

改善性能驱动的说话面孔的模态辍学

Modality Dropout for Improved Performance-driven Talking Faces

论文作者

Abdelaziz, Ahmed Hussen, Theobald, Barry-John, Dixon, Paul, Knothe, Reinhard, Apostoloff, Nicholas, Kajareker, Sachin

论文摘要

我们描述了使用声学和视觉信息驱动动画面孔的新型深度学习方法。特别是，使用视听信息生成与语音相关的面部运动，而非语音面部运动仅使用视觉信息生成。为了确保我们的模型在训练过程中利用这两种方式，生成了包含仅音频，仅视频和视听输入功能的批次。删除模式的概率可以控制模型在训练过程中利用音频和视觉信息的程度。我们训练有素的模型在资源有限的硬件上实时运行（例如\ \ smart手机），它是用户不可知的，并且不依赖于语音的潜在可能易于错误的转录。我们使用主观测试来证明：1）在引入模态脱落后，通过等效视频方法的动画改进了视频驱动的动画，以及2）改进与语音相关的面部运动动画的改进。在引入辍学之前，观众更喜欢在51％的测试序列中进行视听驱动的动画，而视频驱动的动画仅为18％。在引入了辍学的观众偏好之后，对视听驱动的动画的偏好增加到74％，但仅视频范围降至8％。

We describe our novel deep learning approach for driving animated faces using both acoustic and visual information. In particular, speech-related facial movements are generated using audiovisual information, and non-speech facial movements are generated using only visual information. To ensure that our model exploits both modalities during training, batches are generated that contain audio-only, video-only, and audiovisual input features. The probability of dropping a modality allows control over the degree to which the model exploits audio and visual information during training. Our trained model runs in real-time on resource limited hardware (e.g.\ a smart phone), it is user agnostic, and it is not dependent on a potentially error-prone transcription of the speech. We use subjective testing to demonstrate: 1) the improvement of audiovisual-driven animation over the equivalent video-only approach, and 2) the improvement in the animation of speech-related facial movements after introducing modality dropout. Before introducing dropout, viewers prefer audiovisual-driven animation in 51% of the test sequences compared with only 18% for video-driven. After introducing dropout viewer preference for audiovisual-driven animation increases to 74%, but decreases to 8% for video-only.

下载PDF全文

下载文献需遵守相关版权规定

论文标题