按时使用随机WAV2VEC 2.0计算还原

论文标题

按时使用随机WAV2VEC 2.0计算还原

On-demand compute reduction with stochastic wav2vec 2.0

论文作者

Vyas, Apoorv, Hsu, Wei-Ning, Auli, Michael, Baevski, Alexei

论文摘要

挤压和有效的WAV2VEC（SEW）是一种最近提出的结构，它将输入到变压器编码器中，以计算有效的预训练和WAV2VEC 2.0（W2V2）模型。在这项工作中，我们提出了W2V2模型的按需计算减少的随机压缩。与使用固定挤压因子相反，我们在训练过程中均匀地对其进行了对。我们进一步介绍了可以应用于每个变压器层以进一步压缩的查询和键值合并机制。我们针对在960h liblispeech数据集中预先训练的模型的结果，并在转录数据的10h上进行了微调，表明，使用相同的随机模型，我们在单词错误率（WER）和推理时间之间获得了平稳的权衡，而与W2V2相比，与w2v2的边际降低相比，与W2V2的边际降低以及对特定设置进行了培训的模型相比。我们进一步表明，我们可以将相同的随机训练模型微调到特定的配置，以恢复差异，从而从头开始，从而在预训练模型上节省了大量的计算节省。

Squeeze and Efficient Wav2vec (SEW) is a recently proposed architecture that squeezes the input to the transformer encoder for compute efficient pre-training and inference with wav2vec 2.0 (W2V2) models. In this work, we propose stochastic compression for on-demand compute reduction for W2V2 models. As opposed to using a fixed squeeze factor, we sample it uniformly during training. We further introduce query and key-value pooling mechanisms that can be applied to each transformer layer for further compression. Our results for models pre-trained on 960h Librispeech dataset and fine-tuned on 10h of transcribed data show that using the same stochastic model, we get a smooth trade-off between word error rate (WER) and inference time with only marginal WER degradation compared to the W2V2 and SEW models trained for a specific setting. We further show that we can fine-tune the same stochastically pre-trained model to a specific configuration to recover the WER difference resulting in significant computational savings on pre-training models from scratch.

下载PDF全文

下载文献需遵守相关版权规定

论文标题