论文标题
知识蒸馏应用于多任务语音表示学习
Application of Knowledge Distillation to Multi-task Speech Representation Learning
论文作者
论文摘要
已提出了诸如WAV2VEC 2.0和Hubert之类的模型体系结构以自我监督的方式从音频波形中学习语音表示。当它们与下游任务(例如关键字发现和扬声器验证)结合使用时,它们会提供最先进的性能。但是,这些模型使用大量参数,其最小版本具有9500万个参数。这构成了边缘AI设备部署的挑战。在本文中,我们研究了知识蒸馏到语音表示学习(SRL)模型的应用,然后与多个下游语音激活任务进行了微调。在我们对两个这样的任务的实验中,与全尺寸型号相比,我们的方法降低了近75%,而仅遭受0.1%的精度和0.9%的误差率降解。此外,我们表明,与使用冷冻的SRL模型相比,对SRL模型进行微调会产生显着的性能。
Model architectures such as wav2vec 2.0 and HuBERT have been proposed to learn speech representations from audio waveforms in a self-supervised manner. When they are combined with downstream tasks such as keyword spotting and speaker verification, they provide state-of-the-art performance. However, these models use a large number of parameters, the smallest version of which has 95 million parameters. This constitutes a challenge for edge AI device deployments. In this paper, we investigate the application of knowledge distillation to speech representation learning (SRL) models followed by joint fine-tuning with multiple downstream voice-activated tasks. In our experiments on two such tasks, our approach results in nearly 75% reduction in model size while suffering only 0.1% accuracy and 0.9% equal error rate degradation compared to the full-size model. In addition, we show that fine-tuning the SRL models results in a significant performance boost compared to using frozen SRL models.