单词和图像之间的链接如何直接？

论文标题

单词和图像之间的链接如何直接？

How direct is the link between words and images?

论文作者

Shahmohammadi, Hassan, Heitmeier, Maria, Shafaei-Bajestan, Elnaz, Lensch, Hendrik P. A., Baayen, Harald

论文摘要

当前的单词嵌入模型尽管成功，但仍然遭受现实世界中缺乏基础的困扰。在这一研究中，Gunther等人。 2022提出了一个行为实验，以研究单词和图像之间的关系。在他们的设置中，向参与者展示了一个目标名词和一对图像，其中一个由他们的模型选择，另一个随机选择。要求参与者选择最匹配目标名词的图像。在大多数情况下，参与者更喜欢由模型选择的图像。因此，Gunther等人得出结论，单词与体现经验之间直接联系的可能性。我们将他们的实验作为出发点，并解决了以下问题。 1。除了利用给定图像的视觉体现模拟外，其他受试者可能还使用哪些策略来解决此任务？此设置在多大程度上取决于图像中的视觉信息？可以使用纯文本表示可以解决吗？ 2。当前的视觉接地嵌入是否比文本嵌入更好地解释了受试者的选择行为？ 3。视觉接地是否可以改善具体和抽象单词的语义表示？为了解决这些问题，我们通过使用预训练的文本和视觉扎根的单词嵌入设计了新的实验。我们的实验表明，受试者的选择行为在很大程度上是基于纯粹基于文本的嵌入和基于单词的相似性的很大程度上解释的，这表明主动体现体验的小时参与。视觉扎根的嵌入仅在某些情况下提供了与文本嵌入相比的适度优势。这些发现表明，Gunther等人的实验。可能不太适合利用参与者的感知经验，因此尚不清楚其衡量视觉扎根的知识的程度。

Current word embedding models despite their success, still suffer from their lack of grounding in the real world. In this line of research, Gunther et al. 2022 proposed a behavioral experiment to investigate the relationship between words and images. In their setup, participants were presented with a target noun and a pair of images, one chosen by their model and another chosen randomly. Participants were asked to select the image that best matched the target noun. In most cases, participants preferred the image selected by the model. Gunther et al., therefore, concluded the possibility of a direct link between words and embodied experience. We took their experiment as a point of departure and addressed the following questions. 1. Apart from utilizing visually embodied simulation of given images, what other strategies might subjects have used to solve this task? To what extent does this setup rely on visual information from images? Can it be solved using purely textual representations? 2. Do current visually grounded embeddings explain subjects' selection behavior better than textual embeddings? 3. Does visual grounding improve the semantic representations of both concrete and abstract words? To address these questions, we designed novel experiments by using pre-trained textual and visually grounded word embeddings. Our experiments reveal that subjects' selection behavior is explained to a large extent based on purely text-based embeddings and word-based similarities, suggesting a minor involvement of active embodied experiences. Visually grounded embeddings offered modest advantages over textual embeddings only in certain cases. These findings indicate that the experiment by Gunther et al. may not be well suited for tapping into the perceptual experience of participants, and therefore the extent to which it measures visually grounded knowledge is unclear.

下载PDF全文

下载文献需遵守相关版权规定

论文标题