论文标题
通过明确的节奏同步学习音乐舞蹈表示
Learning Music-Dance Representations through Explicit-Implicit Rhythm Synchronization
论文作者
论文摘要
尽管事实证明,视听表征适用于许多下游任务,但舞蹈视频的表示形式更为具体,并且总是伴随着具有复杂听觉内容的音乐,但仍然具有挑战性且未进行研究。考虑到舞者和音乐节奏的节奏运动之间的内在结合,我们介绍了Mudar,这是一个新颖的音乐舞蹈表示学习框架,以明确和隐式的方式执行音乐和舞蹈节奏的同步。具体来说,我们根据音乐节奏分析启发的视觉外观和运动提示得出舞蹈节奏。然后,视觉节奏与音乐对应物的时间对齐,这些音乐是由声音强度的振幅提取的。同时,我们利用对比度学习在音频和视觉流中隐含的节奏的隐式连贯性。该模型通过预测视听对之间的时间一致性来学习关节嵌入。音乐舞蹈表示以及检测音频和视觉节奏的能力,可以进一步应用于三个下游任务:(a)舞蹈分类,(b)音乐舞蹈检索,以及(c)音乐舞蹈重新制定。广泛的实验表明,我们提出的框架的表现可以大大优于其他自我监督方法。
Although audio-visual representation has been proved to be applicable in many downstream tasks, the representation of dancing videos, which is more specific and always accompanied by music with complex auditory contents, remains challenging and uninvestigated. Considering the intrinsic alignment between the cadent movement of dancer and music rhythm, we introduce MuDaR, a novel Music-Dance Representation learning framework to perform the synchronization of music and dance rhythms both in explicit and implicit ways. Specifically, we derive the dance rhythms based on visual appearance and motion cues inspired by the music rhythm analysis. Then the visual rhythms are temporally aligned with the music counterparts, which are extracted by the amplitude of sound intensity. Meanwhile, we exploit the implicit coherence of rhythms implied in audio and visual streams by contrastive learning. The model learns the joint embedding by predicting the temporal consistency between audio-visual pairs. The music-dance representation, together with the capability of detecting audio and visual rhythms, can further be applied to three downstream tasks: (a) dance classification, (b) music-dance retrieval, and (c) music-dance retargeting. Extensive experiments demonstrate that our proposed framework outperforms other self-supervised methods by a large margin.