学习上下文标签嵌入音频和标签的跨模式对齐

论文标题

学习上下文标签嵌入音频和标签的跨模式对齐

Learning Contextual Tag Embeddings for Cross-Modal Alignment of Audio and Tags

论文作者

Favory, Xavier, Drossos, Konstantinos, Virtanen, Tuomas, Serra, Xavier

论文摘要

自我监督的音频表示学习提供了一种有吸引力的替代方案，用于获取能够用于各种下游任务中的通用音频嵌入。考虑到与音频相关的音频和单词/标签的已发布方法不会采用能够在训练过程中推广到未知标签的文本处理模型。在这项工作中，我们提出了一种使用音频自动编码器（AAE），一般单词嵌入模型（WEM）和多头自我注意力（MHA）机制来学习音频表示的方法。 MHA参加了WEM的输出，提供了与音频关联的标签的上下文化表示，我们使用对比度损失将MHA的输出与AAE编码器的输出保持一致。我们共同优化AAE和MHA，并通过在三个不同的下游任务中使用它们，即声音，音乐类型和音乐仪器分类，评估音频表示（即AAE的编码器的输出）。我们的结果表明，在基于标签的网络中使用多个头部使用多头自我注意力可以引起更好的音频表示。

Self-supervised audio representation learning offers an attractive alternative for obtaining generic audio embeddings, capable to be employed into various downstream tasks. Published approaches that consider both audio and words/tags associated with audio do not employ text processing models that are capable to generalize to tags unknown during training. In this work we propose a method for learning audio representations using an audio autoencoder (AAE), a general word embeddings model (WEM), and a multi-head self-attention (MHA) mechanism. MHA attends on the output of the WEM, providing a contextualized representation of the tags associated with the audio, and we align the output of MHA with the output of the encoder of AAE using a contrastive loss. We jointly optimize AAE and MHA and we evaluate the audio representations (i.e. the output of the encoder of AAE) by utilizing them in three different downstream tasks, namely sound, music genre, and music instrument classification. Our results show that employing multi-head self-attention with multiple heads in the tag-based network can induce better learned audio representations.

下载PDF全文

下载文献需遵守相关版权规定

论文标题