论文标题
使用视觉扎根的语音对单词学习和识别进行建模
Modelling word learning and recognition using visually grounded speech
论文作者
论文摘要
背景:语音识别的计算模型通常假定已经给出了目标词的集合。这意味着这些模型在没有先验知识和明确监督的情况下就不会学会从头开始识别语音。视觉扎根的语音模型学会通过利用口语和视觉输入之间的统计依赖性来识别语音而没有先验知识。虽然以前已经证明,视觉上扎根的语音模型学会识别输入中的单词存在,但我们明确调查了这种模型作为人类语音识别模型。 方法:我们使用门控范式研究了模型模拟单词识别的时间顺序,以测试其识别是否受到人类语音处理中众所周知的单词竞争效应的影响。我们还研究了矢量量化是一种离散表示学习的技术,是否有助于模型的发现和识别单词。 结果/结论:我们的实验表明该模型能够隔离识别名词,甚至可以学会正确区分复数名词和单数名词。我们还发现,识别受词竞争的影响,来自单词初始人群和邻里密度,反映了人类语音理解中的单词竞争影响。最后,我们没有发现证据表明矢量量化有助于发现和识别单词。我们的门控实验甚至表明,矢量量化模型需要更多的输入序列才能正确识别。
Background: Computational models of speech recognition often assume that the set of target words is already given. This implies that these models do not learn to recognise speech from scratch without prior knowledge and explicit supervision. Visually grounded speech models learn to recognise speech without prior knowledge by exploiting statistical dependencies between spoken and visual input. While it has previously been shown that visually grounded speech models learn to recognise the presence of words in the input, we explicitly investigate such a model as a model of human speech recognition. Methods: We investigate the time-course of word recognition as simulated by the model using a gating paradigm to test whether its recognition is affected by well-known word-competition effects in human speech processing. We furthermore investigate whether vector quantisation, a technique for discrete representation learning, aids the model in the discovery and recognition of words. Results/Conclusion: Our experiments show that the model is able to recognise nouns in isolation and even learns to properly differentiate between plural and singular nouns. We also find that recognition is influenced by word competition from the word-initial cohort and neighbourhood density, mirroring word competition effects in human speech comprehension. Lastly, we find no evidence that vector quantisation is helpful in discovering and recognising words. Our gating experiments even show that the vector quantised model requires more of the input sequence for correct recognition.