论文标题
通过跨模式原型对比学习无监督的语音表面表示
Unsupervised Voice-Face Representation Learning by Cross-Modal Prototype Contrast
论文作者
论文摘要
我们提出了一种从谈话面部视频中学习语音表面表示的方法,而没有任何身份标签。先前的作品采用跨模式实例歧视任务来建立语音和面部的相关性。这些方法忽略了不同视频的语义内容,将假阴性对作为训练噪声。此外,正对基于音频夹和视觉帧之间的自然相关性构建。但是,在大量现实世界中,这种相关性可能很弱或不准确,这会导致阳性偏离对比范式。为了解决这些问题,我们提出了跨模式的原型对比学习(CMPC),该学习利用了对比方法,并抵抗了假否定性的不利影响并偏离了阳性。一方面,CMPC可以通过以不同方式通过无监督的聚类来构建语义阳性来学习类内的不变性。另一方面,通过比较跨模式原型的跨模式实例的相似性,我们可以动态地重新校准,对本体损失的贡献。实验表明,所提出的方法在各种语音面积关联评估方案上都优于最先进的无监督方法。此外,与以前的实例对比度学习相比,我们的方法在低调的监督环境中也有显着改善。
We present an approach to learn voice-face representations from the talking face videos, without any identity labels. Previous works employ cross-modal instance discrimination tasks to establish the correlation of voice and face. These methods neglect the semantic content of different videos, introducing false-negative pairs as training noise. Furthermore, the positive pairs are constructed based on the natural correlation between audio clips and visual frames. However, this correlation might be weak or inaccurate in a large amount of real-world data, which leads to deviating positives into the contrastive paradigm. To address these issues, we propose the cross-modal prototype contrastive learning (CMPC), which takes advantage of contrastive methods and resists adverse effects of false negatives and deviate positives. On one hand, CMPC could learn the intra-class invariance by constructing semantic-wise positives via unsupervised clustering in different modalities. On the other hand, by comparing the similarities of cross-modal instances from that of cross-modal prototypes, we dynamically recalibrate the unlearnable instances' contribution to overall loss. Experiments show that the proposed approach outperforms state-of-the-art unsupervised methods on various voice-face association evaluation protocols. Additionally, in the low-shot supervision setting, our method also has a significant improvement compared to previous instance-wise contrastive learning.