部分可观测时空混沌系统的无模型预测

论文标题

部分可观测时空混沌系统的无模型预测

MuLan: A Joint Embedding of Music Audio and Natural Language

论文作者

Huang, Qingqing, Jansen, Aren, Lee, Joonseok, Ganti, Ravi, Li, Judith Yue, Ellis, Daniel P. W.

论文摘要

传统上，音乐标记和基于内容的检索系统是使用预定的本体论构建的，涵盖了一组刚性的音乐属性或文本查询。本文介绍了Mulan：首次尝试新一代的声学模型，这些模型将音乐音频直接与无约束的自然语言描述联系起来。 Mulan采用了两座联合音频文本嵌入模型的形式，该模型使用4400万张音乐录制（37万小时）和弱相关的自由形式文本注释训练。通过与广泛的音乐流派和文本样式（包括传统的音乐标签）的兼容性，由此产生的音频文本表示形式在毕业至真正的零弹药功能的同时，将其归为现有本体。我们通过一系列实验演示了Mulan嵌入式的多功能性，包括转移学习，零拍音乐标记，音乐域中的语言理解以及跨模式检索应用程序。

Music tagging and content-based retrieval systems have traditionally been constructed using pre-defined ontologies covering a rigid set of music attributes or text queries. This paper presents MuLan: a first attempt at a new generation of acoustic models that link music audio directly to unconstrained natural language music descriptions. MuLan takes the form of a two-tower, joint audio-text embedding model trained using 44 million music recordings (370K hours) and weakly-associated, free-form text annotations. Through its compatibility with a wide range of music genres and text styles (including conventional music tags), the resulting audio-text representation subsumes existing ontologies while graduating to true zero-shot functionalities. We demonstrate the versatility of the MuLan embeddings with a range of experiments including transfer learning, zero-shot music tagging, language understanding in the music domain, and cross-modal retrieval applications.

下载PDF全文

下载文献需遵守相关版权规定

论文标题