ZEST：使用文本相似性和视觉摘要从文本描述中学习零摄

论文标题

ZEST：使用文本相似性和视觉摘要从文本描述中学习零摄

ZEST: Zero-shot Learning from Text Descriptions using Textual Similarity and Visual Summarization

论文作者

Paz-Argaman, Tzuf, Atzmon, Yuval, Chechik, Gal, Tsarfaty, Reut

论文摘要

我们研究了从其课堂的文本描述中识别视觉实体的问题。具体而言，鉴于鸟类的图像具有其物种的自由文本描述，我们学会根据特定描述对以前未见物种的图像进行分类。该设置已在视觉社区中以零射门学习为名，重点是学习将有关鸟类的视觉方面的知识从可见的班级转移到以前的群体中。在这里，我们建议专注于文本描述，并从描述中提取最相关的信息，以有效地将视觉特征与文本的部分匹配讨论它们的部分。具体而言，（1）我们建议利用物种之间的相似性，反映在物种的文本描述之间。（2）我们得出了文本的视觉摘要，即提取性摘要，这些摘要的重点是倾向于反映在图像中的视觉特征。我们提出了一个简单的基于注意力的模型，并增强了相似性和视觉摘要组件。我们的经验结果始终如一地超过了基于文本的零击学习最大基准的最新基准，这说明了文本对于零拍图像识别的重要性。

We study the problem of recognizing visual entities from the textual descriptions of their classes. Specifically, given birds' images with free-text descriptions of their species, we learn to classify images of previously-unseen species based on specie descriptions. This setup has been studied in the vision community under the name zero-shot learning from text, focusing on learning to transfer knowledge about visual aspects of birds from seen classes to previously-unseen ones. Here, we suggest focusing on the textual description and distilling from the description the most relevant information to effectively match visual features to the parts of the text that discuss them. Specifically, (1) we propose to leverage the similarity between species, reflected in the similarity between text descriptions of the species. (2) we derive visual summaries of the texts, i.e., extractive summaries that focus on the visual features that tend to be reflected in images. We propose a simple attention-based model augmented with the similarity and visual summaries components. Our empirical results consistently and significantly outperform the state-of-the-art on the largest benchmarks for text-based zero-shot learning, illustrating the critical importance of texts for zero-shot image-recognition.

下载PDF全文

下载文献需遵守相关版权规定

论文标题