剪辑：由视觉语言模型驱动的条件种植

论文标题

剪辑：由视觉语言模型驱动的条件种植

ClipCrop: Conditioned Cropping Driven by Vision-Language Model

论文作者

Zhong, Zhihang, Cheng, Mingxi, Wu, Zhirong, Yuan, Yuhui, Zheng, Yinqiang, Li, Ji, Hu, Han, Lin, Stephen, Sato, Yoichi, Sato, Imari

论文摘要

在数据驱动的范式下，图像裁剪的进展巨大。但是，当前方法并未考虑用户的意图，这是一个问题，尤其是当输入图像的组成很复杂时。此外，农作物数据的标记是昂贵的，因此数据量受到限制，导致野外当前算法的泛化性能差。在这项工作中，我们利用视觉模型作为创建强大和用户无关的种植算法的基础。通过通过预先训练的基于夹的检测模型OWL-VIT调整变压器解码器，我们开发了一种使用文本或图像查询进行裁剪的方法，以反映用户作为指导的意图。此外，我们的管道设计使该模型可以通过一个小的裁剪数据集学习文本条件的美学裁剪，同时继承了从数百万个文本图像对获得的开放式摄影能力。我们通过对现有数据集的广泛实验以及我们编译的新的裁剪测试集来验证我们的模型，该测试的特征是内容歧义。

Image cropping has progressed tremendously under the data-driven paradigm. However, current approaches do not account for the intentions of the user, which is an issue especially when the composition of the input image is complex. Moreover, labeling of cropping data is costly and hence the amount of data is limited, leading to poor generalization performance of current algorithms in the wild. In this work, we take advantage of vision-language models as a foundation for creating robust and user-intentional cropping algorithms. By adapting a transformer decoder with a pre-trained CLIP-based detection model, OWL-ViT, we develop a method to perform cropping with a text or image query that reflects the user's intention as guidance. In addition, our pipeline design allows the model to learn text-conditioned aesthetic cropping with a small cropping dataset, while inheriting the open-vocabulary ability acquired from millions of text-image pairs. We validate our model through extensive experiments on existing datasets as well as a new cropping test set we compiled that is characterized by content ambiguity.

下载PDF全文

下载文献需遵守相关版权规定

论文标题