论文标题
在没有监督的情况下学习神经音频功能
Learning neural audio features without supervision
论文作者
论文摘要
深度音频分类是传统上以有监督的方式训练在Mel-Filterbanks之上训练深层神经网络的人,最近受益于两种独立的工作。第一个探讨了“可学习的前端”,即产生可学习的时频表示的神经模块,以克服固定特征的局限性。第二个使用自我监督的学习来利用前训练数据的前所未有的量表。在这项工作中,我们研究了将这两种方法结合起来的可行性,即,将可学习的前端与下游分类的主要结构共同培训。首先,我们表明,在音频集上预处理两个先前提出的前端(SINCNET和LEAF)可以大大提高固定的MEL-FilterBanks线性探针性能,这表明可学习的时间频率表示可以使您比监督的培训更具自律性预见的预培训。令人惊讶的是,随机初始化的可学习过滤库在自我监督的设置中优于MEL缩放的初始化,这是一个反直觉的结果,在设计可学习的过滤器时质疑强率强度的适当性。通过对学习的前端组件的探索性分析,我们发现在有监督和自我监督的环境中使用这些前端的性能上的关键差异,尤其是自我监督的过滤器的亲和力,以显着从MEL量表出现,以模拟更广泛的频率范围。
Deep audio classification, traditionally cast as training a deep neural network on top of mel-filterbanks in a supervised fashion, has recently benefited from two independent lines of work. The first one explores "learnable frontends", i.e., neural modules that produce a learnable time-frequency representation, to overcome limitations of fixed features. The second one uses self-supervised learning to leverage unprecedented scales of pre-training data. In this work, we study the feasibility of combining both approaches, i.e., pre-training learnable frontend jointly with the main architecture for downstream classification. First, we show that pretraining two previously proposed frontends (SincNet and LEAF) on Audioset drastically improves linear-probe performance over fixed mel-filterbanks, suggesting that learnable time-frequency representations can benefit self-supervised pre-training even more than supervised training. Surprisingly, randomly initialized learnable filterbanks outperform mel-scaled initialization in the self-supervised setting, a counter-intuitive result that questions the appropriateness of strong priors when designing learnable filters. Through exploratory analysis of the learned frontend components, we uncover crucial differences in properties of these frontends when used in a supervised and self-supervised setting, especially the affinity of self-supervised filters to diverge significantly from the mel-scale to model a broader range of frequencies.