ESRESNET：基于视觉域模型的环境声音分类

论文标题

ESRESNET：基于视觉域模型的环境声音分类

ESResNet: Environmental Sound Classification Based on Visual Domain Models

论文作者

Guzhov, Andrey, Raue, Federico, Hees, Jörn, Dengel, Andreas

论文摘要

环境声音分类（ESC）是音频域中的一个活跃研究领域，在过去几年中取得了很多进展。但是，许多现有方法通过依靠特定领域的特征和体系结构来实现高精度，这使得从其他领域的进步（例如，图像域）中受益更多。此外，过去的某些成功归因于评估结果的差异（即，在Urbansound8k（US8K）数据集的非正式拆分上），从而扭曲了该领域的整体进展。本文的贡献是双重的。首先，我们提出了一个与单声道和立体声声音输入固有兼容的模型。我们的模型基于简单的日志功能短期傅立叶变换（STFT）频谱图，并将它们与来自图像域的几种知名方法（即Resnet，类似暹罗的网络和注意力）相结合。我们研究了跨域预训练，体系结构变化的影响，并在标准数据集上评估了我们的模型。我们发现，通过达到97.0％（ESC-10），91.5％（ESC-50）和84.2％ / 85.4％（US8K Mono / stereo）的精确度，我们的模型超出了所有以前已知的方法。其次，我们通过在官方或非官方分裂之间区分了几个先前报道的US8K数据集的结果，对该领域的实际状态进行了全面概述。为了获得更好的可重复性，我们可以提供我们的代码（包括任何重新实施）。

Environmental Sound Classification (ESC) is an active research area in the audio domain and has seen a lot of progress in the past years. However, many of the existing approaches achieve high accuracy by relying on domain-specific features and architectures, making it harder to benefit from advances in other fields (e.g., the image domain). Additionally, some of the past successes have been attributed to a discrepancy of how results are evaluated (i.e., on unofficial splits of the UrbanSound8K (US8K) dataset), distorting the overall progression of the field. The contribution of this paper is twofold. First, we present a model that is inherently compatible with mono and stereo sound inputs. Our model is based on simple log-power Short-Time Fourier Transform (STFT) spectrograms and combines them with several well-known approaches from the image domain (i.e., ResNet, Siamese-like networks and attention). We investigate the influence of cross-domain pre-training, architectural changes, and evaluate our model on standard datasets. We find that our model out-performs all previously known approaches in a fair comparison by achieving accuracies of 97.0 % (ESC-10), 91.5 % (ESC-50) and 84.2 % / 85.4 % (US8K mono / stereo). Second, we provide a comprehensive overview of the actual state of the field, by differentiating several previously reported results on the US8K dataset between official or unofficial splits. For better reproducibility, our code (including any re-implementations) is made available.

下载PDF全文

下载文献需遵守相关版权规定

论文标题