论文标题

部分可观测时空混沌系统的无模型预测

MuLan: A Joint Embedding of Music Audio and Natural Language

论文作者

Huang, Qingqing, Jansen, Aren, Lee, Joonseok, Ganti, Ravi, Li, Judith Yue, Ellis, Daniel P. W.

论文摘要

传统上,音乐标记和基于内容的检索系统是使用预定的本体论构建的,涵盖了一组刚性的音乐属性或文本查询。本文介绍了Mulan:首次尝试新一代的声学模型,这些模型将音乐音频直接与无约束的自然语言描述联系起来。 Mulan采用了两座联合音频文本嵌入模型的形式,该模型使用4400万张音乐录制(37万小时)和弱相关的自由形式文本注释训练。通过与广泛的音乐流派和文本样式(包括传统的音乐标签)的兼容性,由此产生的音频文本表示形式在毕业至真正的零弹药功能的同时,将其归为现有本体。我们通过一系列实验演示了Mulan嵌入式的多功能性,包括转移学习,零拍音乐标记,音乐域中的语言理解以及跨模式检索应用程序。

Music tagging and content-based retrieval systems have traditionally been constructed using pre-defined ontologies covering a rigid set of music attributes or text queries. This paper presents MuLan: a first attempt at a new generation of acoustic models that link music audio directly to unconstrained natural language music descriptions. MuLan takes the form of a two-tower, joint audio-text embedding model trained using 44 million music recordings (370K hours) and weakly-associated, free-form text annotations. Through its compatibility with a wide range of music genres and text styles (including conventional music tags), the resulting audio-text representation subsumes existing ontologies while graduating to true zero-shot functionalities. We demonstrate the versatility of the MuLan embeddings with a range of experiments including transfer learning, zero-shot music tagging, language understanding in the music domain, and cross-modal retrieval applications.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源