论文标题
MM-ALT:多模式自动抒情转录系统
MM-ALT: A Multimodal Automatic Lyric Transcription System
论文作者
论文摘要
自动抒情转录(ALT)是一个新生的研究领域,鉴于其巨大的应用潜力,言语和音乐信息检索社区都引起了人们的兴趣。但是,仅使用音频数据,ALT是众所周知的一项艰巨的任务,这是由于乐器伴奏和音乐限制导致语音提示和Sung歌词的清晰度降解。为了应对这一挑战,我们提出了多模式自动抒情转录系统(MM-ALT),以及一个新的数据集N20EM,其中包括录音,唇部运动的视频和惯性测量单元(IMU)数据,由表演歌手佩戴的耳塞。我们首先将WAV2VEC 2.0框架从自动语音识别(ASR)调整为ALT任务。然后,我们提出了一种基于视频的ALT方法和基于IMU的语音活动检测(VAD)方法。此外,我们提出了剩余的交叉注意(RCA)机制,以融合三种方式(即音频,视频和IMU)的数据。实验显示了我们提出的MM-ALT系统的有效性,尤其是在噪声稳健性方面。项目页面位于https://n20em.github.io。
Automatic lyric transcription (ALT) is a nascent field of study attracting increasing interest from both the speech and music information retrieval communities, given its significant application potential. However, ALT with audio data alone is a notoriously difficult task due to instrumental accompaniment and musical constraints resulting in degradation of both the phonetic cues and the intelligibility of sung lyrics. To tackle this challenge, we propose the MultiModal Automatic Lyric Transcription system (MM-ALT), together with a new dataset, N20EM, which consists of audio recordings, videos of lip movements, and inertial measurement unit (IMU) data of an earbud worn by the performing singer. We first adapt the wav2vec 2.0 framework from automatic speech recognition (ASR) to the ALT task. We then propose a video-based ALT method and an IMU-based voice activity detection (VAD) method. In addition, we put forward the Residual Cross Attention (RCA) mechanism to fuse data from the three modalities (i.e., audio, video, and IMU). Experiments show the effectiveness of our proposed MM-ALT system, especially in terms of noise robustness. Project page is at https://n20em.github.io.