一项关于环境声音表示的鲁棒性的鲁棒性的研究

论文标题

一项关于环境声音表示的鲁棒性的鲁棒性的研究

A Study on Robustness to Perturbations for Representations of Environmental Sound

论文作者

Srivastava, Sangeeta, Wu, Ho-Hsiang, Rulff, Joao, Fuentes, Magdalena, Cartwright, Mark, Silva, Claudio, Arora, Anish, Bello, Juan Pablo

论文摘要

涉及环境声音分析的音频应用越来越多地使用通用音频表示（也称为嵌入）进行转移学习。最近，对音频表示形式（HEAR）的整体评估评估了19个不同任务的29个嵌入模型。但是，评估的有效性取决于给定数据集中已经捕获的变化。因此，对于给定的数据域而言，尚不清楚表示形式如何受到由无数麦克风范围和声学条件引起的变化的影响 - 通常称为通道效应。我们的目标是扩展听力，以评估不变性以在这项工作中的渠道效果。为此，我们通过将扰动注入音频信号来模仿通道效应，并使用三个距离测量来测量新（扰动）嵌入的变化，从而使评估域依赖但不依赖于任务依赖性。结合下游性能，它有助于我们对嵌入方式对频道效果的鲁棒性进行更明智的预测。我们评估了两个嵌入 - Yamnet和OpenL3在单声道（Urbansound8K）和多音（Sonyc-ust）Urban数据集上。我们表明，在这种无关的评估中，一个距离度量不足。尽管Fréchet音频距离（FAD）与下游任务的性能下降趋势相关，但我们表明我们需要与其他距离一起研究时尚，以清楚地了解扰动的整体效果。就嵌入性能而言，我们发现OpenL3比Yamnet更健壮，Yamnet与听觉评估保持一致。

Audio applications involving environmental sound analysis increasingly use general-purpose audio representations, also known as embeddings, for transfer learning. Recently, Holistic Evaluation of Audio Representations (HEAR) evaluated twenty-nine embedding models on nineteen diverse tasks. However, the evaluation's effectiveness depends on the variation already captured within a given dataset. Therefore, for a given data domain, it is unclear how the representations would be affected by the variations caused by myriad microphones' range and acoustic conditions -- commonly known as channel effects. We aim to extend HEAR to evaluate invariance to channel effects in this work. To accomplish this, we imitate channel effects by injecting perturbations to the audio signal and measure the shift in the new (perturbed) embeddings with three distance measures, making the evaluation domain-dependent but not task-dependent. Combined with the downstream performance, it helps us make a more informed prediction of how robust the embeddings are to the channel effects. We evaluate two embeddings -- YAMNet, and OpenL3 on monophonic (UrbanSound8K) and polyphonic (SONYC-UST) urban datasets. We show that one distance measure does not suffice in such task-independent evaluation. Although Fréchet Audio Distance (FAD) correlates with the trend of the performance drop in the downstream task most accurately, we show that we need to study FAD in conjunction with the other distances to get a clear understanding of the overall effect of the perturbation. In terms of the embedding performance, we find OpenL3 to be more robust than YAMNet, which aligns with the HEAR evaluation.

下载PDF全文

下载文献需遵守相关版权规定

论文标题