基于内容的唱歌语音源通过使用对齐的音素通过强条进行强度调节

论文标题

基于内容的唱歌语音源通过使用对齐的音素通过强条进行强度调节

Content based singing voice source separation via strong conditioning using aligned phonemes

论文作者

Meseguer-Brocal, Gabriel, Peeters, Geoffroy

论文摘要

知情的来源分离最近通过引入神经网络以及包含混合物和分离来源的大型多轨数据集的可用性引起了新的兴趣。这些方法使用有关目标源的先前信息来改善分离。从历史上看，音乐信息检索研究人员主要集中于得分信息的源分离，但最新的方法探索了歌词信息的源分离。但是，由于缺乏具有时间对准歌词的多轨数据集，模型使用弱调节与非对齐的歌词。在本文中，我们提出了一个多模式的多模式数据集，其歌词在单词级别上与语音信息进行对齐，并使用对齐的音素探索强条。我们的模型遵循U-NET架构，并将音乐混合物的大小谱图和带有对齐的语音信息的矩阵进行输入。将音素矩阵嵌入以获得控制特征线性调制（膜）层的参数。这些层调节了U-NET特征图，以使分离过程通过仿射转换使不同音素的存在调整。我们表明，可以成功应用音素调节以改善唱歌语音源分离。

Informed source separation has recently gained renewed interest with the introduction of neural networks and the availability of large multitrack datasets containing both the mixture and the separated sources. These approaches use prior information about the target source to improve separation. Historically, Music Information Retrieval researchers have focused primarily on score-informed source separation, but more recent approaches explore lyrics-informed source separation. However, because of the lack of multitrack datasets with time-aligned lyrics, models use weak conditioning with non-aligned lyrics. In this paper, we present a multimodal multitrack dataset with lyrics aligned in time at the word level with phonetic information as well as explore strong conditioning using the aligned phonemes. Our model follows a U-Net architecture and takes as input both the magnitude spectrogram of a musical mixture and a matrix with aligned phonetic information. The phoneme matrix is embedded to obtain the parameters that control Feature-wise Linear Modulation (FiLM) layers. These layers condition the U-Net feature maps to adapt the separation process to the presence of different phonemes via affine transformations. We show that phoneme conditioning can be successfully applied to improve singing voice source separation.

下载PDF全文

下载文献需遵守相关版权规定

论文标题