Wav Prompt：使用冷冻语言模型迈出几声的语言理解

论文标题

Wav Prompt：使用冷冻语言模型迈出几声的语言理解

WAVPROMPT: Towards Few-Shot Spoken Language Understanding with Frozen Language Models

论文作者

Gao, Heting, Ni, Junrui, Qian, Kaizhi, Zhang, Yang, Chang, Shiyu, Hasegawa-Johnson, Mark

论文摘要

在大规模文本上鉴定的大规模自动回归语言模型已经证明了他们仅使用几个文本示例执行新的自然语言任务的令人印象深刻的能力，而无需进行微调。最近的研究进一步表明，通过训练编码器将图像编码到功能的嵌入方式中，可以将这种几次学习能力扩展到文本图像设置，例如语言模型的文本嵌入。有兴趣探索将几种学习能力传递到音频文本设置的可能性，我们提出了一个新颖的语音理解框架Wavprompt，我们对WAV2VEC模型进行了验证，以生成一系列被语言模型理解的音频嵌入。我们表明，Wavprompt是一个射门的学习者，可以比幼稚的文本基线更好地执行语音理解任务。我们对不同组件和超参数进行了详细的消融研究，以从经验上确定最佳模型构型。此外，我们进行了非语音理解实验以表明WAVPROMPT可以提取更多信息，而不仅仅是转录。代码可从https://github.com/hertin/wavprompt获得

Large-scale auto-regressive language models pretrained on massive text have demonstrated their impressive ability to perform new natural language tasks with only a few text examples, without the need for fine-tuning. Recent studies further show that such a few-shot learning ability can be extended to the text-image setting by training an encoder to encode the images into embeddings functioning like the text embeddings of the language model. Interested in exploring the possibility of transferring the few-shot learning ability to the audio-text setting, we propose a novel speech understanding framework, WavPrompt, where we finetune a wav2vec model to generate a sequence of audio embeddings understood by the language model. We show that WavPrompt is a few-shot learner that can perform speech understanding tasks better than a naive text baseline. We conduct detailed ablation studies on different components and hyperparameters to empirically identify the best model configuration. In addition, we conduct a non-speech understanding experiment to show WavPrompt can extract more information than just the transcriptions. Code is available at https://github.com/Hertin/WavPrompt

下载PDF全文

下载文献需遵守相关版权规定

论文标题