通过自我监督的学习建立大脑中语音处理的现实模型

论文标题

通过自我监督的学习建立大脑中语音处理的现实模型

Toward a realistic model of speech processing in the brain with self-supervised learning

论文作者

Millet, Juliette, Caucheteux, Charlotte, Orhan, Pierre, Boubenec, Yves, Gramfort, Alexandre, Dunbar, Ewan, Pallier, Christophe, King, Jean-Remi

论文摘要

最近已显示几个深神网络会产生与大脑相似的激活，以响应相同的输入。但是，这些算法在很大程度上仍然令人难以置信：它们需要（1）大量的数据，（2）无法获得的监督标签，（3）文本而不是原始的感觉输入，以及 /或（4）（例如，数千个上下文单词）。这些要素强调了识别算法的需求，这些算法在这些局限性下足以说明行为和大脑反应。在关注语音处理问题时，我们在这里假设在原始波形上训练的自我监督算法构成了有前途的候选人。具体而言，我们将最近的自我监管结构WAV2VEC 2.0与412英语，法语和普通话个体的大脑活动进行了比较，并记录了使用功能性磁共振成像（fMRI）记录的，而他们听了约1h的音频书籍。我们的结果是四倍。首先，我们表明该算法以短短600个小时的未标记语音学习类似脑的表示形式 - 与在语言获取过程中可以接触的婴儿可以接触的数量。其次，其功能层次结构与语音处理的皮质层次结构保持一致。第三，不同的训练方案揭示了类似于皮质的功能专业：WAV2VEC 2.0学习了与前额叶和颞皮质相似的声音，语音特定和特定语言的表示。第四，我们证实了这种专业与386名其他参与者的行为的相似性。这些元素是由迄今为止最大的神经影像学基准产生的，它展示了自我监督的学习如何解释大脑中语音处理的丰富组织，从而描述了确定塑造人脑的语言习得定律的途径。

Several deep neural networks have recently been shown to generate activations similar to those of the brain in response to the same input. These algorithms, however, remain largely implausible: they require (1) extraordinarily large amounts of data, (2) unobtainable supervised labels, (3) textual rather than raw sensory input, and / or (4) implausibly large memory (e.g. thousands of contextual words). These elements highlight the need to identify algorithms that, under these limitations, would suffice to account for both behavioral and brain responses. Focusing on the issue of speech processing, we here hypothesize that self-supervised algorithms trained on the raw waveform constitute a promising candidate. Specifically, we compare a recent self-supervised architecture, Wav2Vec 2.0, to the brain activity of 412 English, French, and Mandarin individuals recorded with functional Magnetic Resonance Imaging (fMRI), while they listened to ~1h of audio books. Our results are four-fold. First, we show that this algorithm learns brain-like representations with as little as 600 hours of unlabelled speech -- a quantity comparable to what infants can be exposed to during language acquisition. Second, its functional hierarchy aligns with the cortical hierarchy of speech processing. Third, different training regimes reveal a functional specialization akin to the cortex: Wav2Vec 2.0 learns sound-generic, speech-specific and language-specific representations similar to those of the prefrontal and temporal cortices. Fourth, we confirm the similarity of this specialization with the behavior of 386 additional participants. These elements, resulting from the largest neuroimaging benchmark to date, show how self-supervised learning can account for a rich organization of speech processing in the brain, and thus delineate a path to identify the laws of language acquisition which shape the human brain.

下载PDF全文

下载文献需遵守相关版权规定

论文标题