论文标题
面向人类的全球背景:视觉语言模型是否真的像人类一样判断?
Towards the Human Global Context: Does the Vision-Language Model Really Judge Like a Human Being?
论文作者
论文摘要
随着计算机视觉和NLP的进步,视觉语言(VL)正在成为重要的研究领域。尽管很重要,但研究领域的评估指标仍处于开发的初步阶段。在本文中,我们提出了定量度量的“符号分数”和评估数据集“人类难题”,以评估VL模型是否理解像人类这样的图像。我们观察到,VL模型不会解释输入图像的整体上下文,而是对形成本地上下文的特定对象或形状显示出偏见。我们旨在定量测量模型在理解环境中的表现。为了验证当前现有的VL模型的功能,我们将原始输入图像切成部分并随机放置,从而扭曲了图像的全局上下文。我们的论文讨论了每个VL模型在全球环境上的解释水平,并解决了结构特征如何影响结果。
As computer vision and NLP make progress, Vision-Language(VL) is becoming an important area of research. Despite the importance, evaluation metrics of the research domain is still at a preliminary stage of development. In this paper, we propose a quantitative metric "Equivariance Score" and evaluation dataset "Human Puzzle" to assess whether a VL model is understanding an image like a human. We observed that the VL model does not interpret the overall context of an input image but instead shows biases toward a specific object or shape that forms the local context. We aim to quantitatively measure a model's performance in understanding context. To verify the current existing VL model's capability, we sliced the original input image into pieces and randomly placed them, distorting the global context of the image. Our paper discusses each VL model's level of interpretation on global context and addresses how the structural characteristics influenced the results.