Mae-ast：蒙版自动编码音频谱图变压器

论文标题

Mae-ast：蒙版自动编码音频谱图变压器

MAE-AST: Masked Autoencoding Audio Spectrogram Transformer

论文作者

Baade, Alan, Peng, Puyuan, Harwath, David

论文摘要

在本文中，我们提出了一个简单而有力的改进，对语音和音频分类的最新自我监管的音频谱图变压器（SSAST）模型。具体而言，我们利用了SSAST在预训练期间使用非常高的掩蔽率（75％）的见解，这意味着绝大多数自我意识的计算是在面具令牌上执行的。我们通过从蒙版自动编码器中集成编码器架构来解决这一问题，这是可扩展的视觉学习者（MAE）到SSAST中，在SSAST上，深层编码器仅在未掩盖的输入上运行，而浅层解码器则在编码器输出和掩码代币上运行。我们发现，使用普通模型和输入尺寸的当前音频预处理策略，类似MAE的训练可以在香草SSAST上提供3倍的加速和2倍的内存使用情况。在仅使用编码器的下游任务进行微调时，我们发现我们的方法在各种下游任务上的表现都优于SSAST。我们进一步进行全面的评估，以进行预处理的不同策略，并探索视觉和音频域之间MAE风格的预处理的差异。

In this paper, we propose a simple yet powerful improvement over the recent Self-Supervised Audio Spectrogram Transformer (SSAST) model for speech and audio classification. Specifically, we leverage the insight that the SSAST uses a very high masking ratio (75%) during pretraining, meaning that the vast majority of self-attention compute is performed on mask tokens. We address this by integrating the encoder-decoder architecture from Masked Autoencoders are Scalable Vision Learners (MAE) into the SSAST, where a deep encoder operates on only unmasked input, and a shallow decoder operates on encoder outputs and mask tokens. We find that MAE-like pretraining can provide a 3x speedup and 2x memory usage reduction over the vanilla SSAST using current audio pretraining strategies with ordinary model and input sizes. When fine-tuning on downstream tasks, which only uses the encoder, we find that our approach outperforms the SSAST on a variety of downstream tasks. We further conduct comprehensive evaluations into different strategies of pretraining and explore differences in MAE-style pretraining between the visual and audio domains.

下载PDF全文

下载文献需遵守相关版权规定

论文标题