知识蒸馏应用于多任务语音表示学习

论文标题

知识蒸馏应用于多任务语音表示学习

Application of Knowledge Distillation to Multi-task Speech Representation Learning

论文作者

Kerpicci, Mine, Nguyen, Van, Zhang, Shuhua, Visser, Erik

论文摘要

已提出了诸如WAV2VEC 2.0和Hubert之类的模型体系结构以自我监督的方式从音频波形中学习语音表示。当它们与下游任务（例如关键字发现和扬声器验证）结合使用时，它们会提供最先进的性能。但是，这些模型使用大量参数，其最小版本具有9500万个参数。这构成了边缘AI设备部署的挑战。在本文中，我们研究了知识蒸馏到语音表示学习（SRL）模型的应用，然后与多个下游语音激活任务进行了微调。在我们对两个这样的任务的实验中，与全尺寸型号相比，我们的方法降低了近75％，而仅遭受0.1％的精度和0.9％的误差率降解。此外，我们表明，与使用冷冻的SRL模型相比，对SRL模型进行微调会产生显着的性能。

Model architectures such as wav2vec 2.0 and HuBERT have been proposed to learn speech representations from audio waveforms in a self-supervised manner. When they are combined with downstream tasks such as keyword spotting and speaker verification, they provide state-of-the-art performance. However, these models use a large number of parameters, the smallest version of which has 95 million parameters. This constitutes a challenge for edge AI device deployments. In this paper, we investigate the application of knowledge distillation to speech representation learning (SRL) models followed by joint fine-tuning with multiple downstream voice-activated tasks. In our experiments on two such tasks, our approach results in nearly 75% reduction in model size while suffering only 0.1% accuracy and 0.9% equal error rate degradation compared to the full-size model. In addition, we show that fine-tuning the SRL models results in a significant performance boost compared to using frozen SRL models.

下载PDF全文

下载文献需遵守相关版权规定

论文标题