论文标题
hts-at:用于声音分类和检测的层级令牌语义音频变压器
HTS-AT: A Hierarchical Token-Semantic Audio Transformer for Sound Classification and Detection
论文作者
论文摘要
音频分类是将音频样本映射到其相应标签中的重要任务。最近,在该领域采用了具有自发机制的变压器模型。但是,现有的音频变压器需要大量的GPU记忆和较长的训练时间,同时依靠验证的视觉模型来实现高性能,这限制了该模型在音频任务中的可扩展性。为了解决这些问题,我们介绍了HTS-AT:具有层次结构的音频变压器,以减少模型大小和训练时间。它进一步与令牌语义模块相结合,以将最终输出映射到类特征上,从而启用了音频事件检测的模型(即时间定位)。我们在三个音频分类数据集上评估了HTS-AT,在该数据集中,它在Audioset和Esc-50上实现了新的最新ART(SOTA)结果,并等于Speech Command v2上的SOTA。与以前的基于CNN的模型相比,它在事件本地化中的性能也更好。此外,HTS-AT仅需要35%的模型参数和先前音频变压器的15%训练时间。这些结果证明了HTS-AT的高性能和高效率。
Audio classification is an important task of mapping audio samples into their corresponding labels. Recently, the transformer model with self-attention mechanisms has been adopted in this field. However, existing audio transformers require large GPU memories and long training time, meanwhile relying on pretrained vision models to achieve high performance, which limits the model's scalability in audio tasks. To combat these problems, we introduce HTS-AT: an audio transformer with a hierarchical structure to reduce the model size and training time. It is further combined with a token-semantic module to map final outputs into class featuremaps, thus enabling the model for the audio event detection (i.e. localization in time). We evaluate HTS-AT on three datasets of audio classification where it achieves new state-of-the-art (SOTA) results on AudioSet and ESC-50, and equals the SOTA on Speech Command V2. It also achieves better performance in event localization than the previous CNN-based models. Moreover, HTS-AT requires only 35% model parameters and 15% training time of the previous audio transformer. These results demonstrate the high performance and high efficiency of HTS-AT.