论文标题
在频谱图中学习音频分类的时间分辨率
Learning Temporal Resolution in Spectrogram for Audio Classification
论文作者
论文摘要
音频频谱图是一种被广泛用于音频分类的时频表示。音频谱图的关键属性之一是时间分辨率,该分辨率取决于短期傅立叶变换(STFT)中使用的HOP大小。以前的工作通常假定跳跃大小应该是一个恒定值(例如,10 ms)。但是,固定的时间分辨率并不总是针对不同类型的声音最佳。时间分辨率不仅会影响分类精度,而且会影响计算成本。本文提出了一种新颖的方法,即Diffres,该方法可以实现可区分的时间分辨率模型来进行音频分类。给定频谱图用固定的啤酒花大小计算,Diffres在保留重要帧的同时合并了非必需的时间帧。 Diffres充当音频谱图和分类器之间的“插入”模块,可以通过分类任务共同优化。我们使用MEL光谱图作为声学特征评估了五个音频分类任务的差异,然后是现成的分类器主链。与使用固定时间分辨率的先前方法相比,基于差异的方法可以实现等效或更好的分类精度,而计算成本降低至少25%。我们进一步表明,差异可以通过增加输入声学特征的时间分辨率来提高分类精度,而无需增加计算成本。
The audio spectrogram is a time-frequency representation that has been widely used for audio classification. One of the key attributes of the audio spectrogram is the temporal resolution, which depends on the hop size used in the Short-Time Fourier Transform (STFT). Previous works generally assume the hop size should be a constant value (e.g., 10 ms). However, a fixed temporal resolution is not always optimal for different types of sound. The temporal resolution affects not only classification accuracy but also computational cost. This paper proposes a novel method, DiffRes, that enables differentiable temporal resolution modeling for audio classification. Given a spectrogram calculated with a fixed hop size, DiffRes merges non-essential time frames while preserving important frames. DiffRes acts as a "drop-in" module between an audio spectrogram and a classifier and can be jointly optimized with the classification task. We evaluate DiffRes on five audio classification tasks, using mel-spectrograms as the acoustic features, followed by off-the-shelf classifier backbones. Compared with previous methods using the fixed temporal resolution, the DiffRes-based method can achieve the equivalent or better classification accuracy with at least 25% computational cost reduction. We further show that DiffRes can improve classification accuracy by increasing the temporal resolution of input acoustic features, without adding to the computational cost.