论文标题

Token2Vec:使用未配对的语音和文本的联合自我监管的预训练框架

token2vec: A Joint Self-Supervised Pre-training Framework Using Unpaired Speech and Text

论文作者

Yue, Xianghu, Ao, Junyi, Gao, Xiaoxue, Li, Haizhou

论文摘要

自我监督的预训练在文本和语音处理中都取得了成功。语音和文字提供不同但互补的信息。问题是我们是否能够对不配对的语音和文本进行语音文本联合培训。在本文中,我们将自我监督的预训练进一步提高了一个想法,并提出了Token2Vec,这是一种基于语音的离散表示的新型联合培训框架,用于未配对的语音和文本。首先,由于语音和文本方式之间的独特特征,在文本是离散时语音是连续的,我们首先将语音离散为一系列离散的语音令牌,以解决模态不匹配问题。其次,为了解决长度不匹配问题,其中语音序列通常比文本序列更长,我们将文本单词转换为音素序列,然后在序列中随机重复每个音素。最后,我们将离散的语音和文本令牌馈送到模态敏捷的变压器编码器中,并使用令牌级掩蔽语言建模(TMLM)进行预训练。实验表明,Token2Vec明显优于各种仅语音预训练的基准,相对减少高达17.7%。 Token2Vec模型还在非ASR任务(即口语意图分类)上进行了验证,并显示出良好的可传递性。

Self-supervised pre-training has been successful in both text and speech processing. Speech and text offer different but complementary information. The question is whether we are able to perform a speech-text joint pre-training on unpaired speech and text. In this paper, we take the idea of self-supervised pre-training one step further and propose token2vec, a novel joint pre-training framework for unpaired speech and text based on discrete representations of speech. Firstly, due to the distinct characteristics between speech and text modalities, where speech is continuous while text is discrete, we first discretize speech into a sequence of discrete speech tokens to solve the modality mismatch problem. Secondly, to solve the length mismatch problem, where the speech sequence is usually much longer than text sequence, we convert the words of text into phoneme sequences and randomly repeat each phoneme in the sequences. Finally, we feed the discrete speech and text tokens into a modality-agnostic Transformer encoder and pre-train with token-level masking language modeling (tMLM). Experiments show that token2vec is significantly superior to various speech-only pre-training baselines, with up to 17.7% relative WER reduction. Token2vec model is also validated on a non-ASR task, i.e., spoken intent classification, and shows good transferability.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源