论文标题
关于音频文本跨模式检索的公制学习
On Metric Learning for Audio-Text Cross-Modal Retrieval
论文作者
论文摘要
音频文本检索旨在从候选人池中检索目标音频剪辑或标题,以其他方式进行查询。解决这种跨模式检索任务是具有挑战性的,因为它不仅需要两种方式学习强大的特征表示,而且还需要捕获这两种方式之间的细粒度对齐。现有的跨模式检索模型主要是通过公制学习目标优化的,因为它们俩都试图将数据映射到嵌入式空间,在这些空间中,相似的数据相连,并且不同的数据相距遥远。与其他跨模式检索任务(例如Image-Text和Video-Text检索)不同,Audio-Text检索仍然是未探索的任务。在这项工作中,我们旨在研究不同的度量学习目标对音频检索任务的影响。我们对视听数据集的流行度量学习目标进行了广泛的评估。我们证明,由自我监督的学习适应的NT Xent损失表明在不同的数据集和培训设置中表现出稳定的表现,并且胜过了受欢迎的基于三胞胎的损失。我们的代码可从https://github.com/xinhaomei/audio-text_retrieval获得。
Audio-text retrieval aims at retrieving a target audio clip or caption from a pool of candidates given a query in another modality. Solving such cross-modal retrieval task is challenging because it not only requires learning robust feature representations for both modalities, but also requires capturing the fine-grained alignment between these two modalities. Existing cross-modal retrieval models are mostly optimized by metric learning objectives as both of them attempt to map data to an embedding space, where similar data are close together and dissimilar data are far apart. Unlike other cross-modal retrieval tasks such as image-text and video-text retrievals, audio-text retrieval is still an unexplored task. In this work, we aim to study the impact of different metric learning objectives on the audio-text retrieval task. We present an extensive evaluation of popular metric learning objectives on the AudioCaps and Clotho datasets. We demonstrate that NT-Xent loss adapted from self-supervised learning shows stable performance across different datasets and training settings, and outperforms the popular triplet-based losses. Our code is available at https://github.com/XinhaoMei/audio-text_retrieval.