脸窗：使用静止图像的视听语音分离

论文标题

脸窗：使用静止图像的视听语音分离

FaceFilter: Audio-visual speech separation using still images

论文作者

Chung, Soo-Whan, Choe, Soyeon, Chung, Joon Son, Kang, Hong-Goo

论文摘要

本文的目的是使用深层视听语音分离网络将目标发言人的语音与两个说话者的混合物分开。与以前在视频片段上使用唇部运动或预先入学的扬声器信息作为辅助条件功能的作品不同，我们使用目标扬声器的单个脸部图像。在此任务中，有条件的特征是从跨模式生物识别任务中的面部外观获得的，其中音频和视觉标识表示在潜在空间中共享。从面部图像中学到的身份会强制网络隔离匹配的扬声器，并从混合语音中提取声音。它解决了由交换通道输出引起的置换问题，这在语音分离任务中经常发生。所提出的方法比基于视频的语音分离要实用得多，因为用户配置文件图像在许多平台上很容易获得。同样，与说话者感知的分离方法不同，它适用于从未见过的扬声器的分离。我们在具有挑战性的现实示例中显示出强烈的定性和定量结果。

The objective of this paper is to separate a target speaker's speech from a mixture of two speakers using a deep audio-visual speech separation network. Unlike previous works that used lip movement on video clips or pre-enrolled speaker information as an auxiliary conditional feature, we use a single face image of the target speaker. In this task, the conditional feature is obtained from facial appearance in cross-modal biometric task, where audio and visual identity representations are shared in latent space. Learnt identities from facial images enforce the network to isolate matched speakers and extract the voices from mixed speech. It solves the permutation problem caused by swapped channel outputs, frequently occurred in speech separation tasks. The proposed method is far more practical than video-based speech separation since user profile images are readily available on many platforms. Also, unlike speaker-aware separation methods, it is applicable on separation with unseen speakers who have never been enrolled before. We show strong qualitative and quantitative results on challenging real-world examples.

下载PDF全文

下载文献需遵守相关版权规定

论文标题