论文标题
Mae-ast:蒙版自动编码音频谱图变压器
MAE-AST: Masked Autoencoding Audio Spectrogram Transformer
论文作者
论文摘要
在本文中,我们提出了一个简单而有力的改进,对语音和音频分类的最新自我监管的音频谱图变压器(SSAST)模型。具体而言,我们利用了SSAST在预训练期间使用非常高的掩蔽率(75%)的见解,这意味着绝大多数自我意识的计算是在面具令牌上执行的。我们通过从蒙版自动编码器中集成编码器架构来解决这一问题,这是可扩展的视觉学习者(MAE)到SSAST中,在SSAST上,深层编码器仅在未掩盖的输入上运行,而浅层解码器则在编码器输出和掩码代币上运行。我们发现,使用普通模型和输入尺寸的当前音频预处理策略,类似MAE的训练可以在香草SSAST上提供3倍的加速和2倍的内存使用情况。在仅使用编码器的下游任务进行微调时,我们发现我们的方法在各种下游任务上的表现都优于SSAST。我们进一步进行全面的评估,以进行预处理的不同策略,并探索视觉和音频域之间MAE风格的预处理的差异。
In this paper, we propose a simple yet powerful improvement over the recent Self-Supervised Audio Spectrogram Transformer (SSAST) model for speech and audio classification. Specifically, we leverage the insight that the SSAST uses a very high masking ratio (75%) during pretraining, meaning that the vast majority of self-attention compute is performed on mask tokens. We address this by integrating the encoder-decoder architecture from Masked Autoencoders are Scalable Vision Learners (MAE) into the SSAST, where a deep encoder operates on only unmasked input, and a shallow decoder operates on encoder outputs and mask tokens. We find that MAE-like pretraining can provide a 3x speedup and 2x memory usage reduction over the vanilla SSAST using current audio pretraining strategies with ordinary model and input sizes. When fine-tuning on downstream tasks, which only uses the encoder, we find that our approach outperforms the SSAST on a variety of downstream tasks. We further conduct comprehensive evaluations into different strategies of pretraining and explore differences in MAE-style pretraining between the visual and audio domains.