为复杂场景生成建模图像组成

论文标题

为复杂场景生成建模图像组成

Modeling Image Composition for Complex Scene Generation

论文作者

Yang, Zuopeng, Liu, Daqing, Wang, Chaoyue, Yang, Jie, Tao, Dacheng

论文摘要

我们提出了一种实现最先进的方法，可以通过对复杂场景中包含的纹理，结构和关系进行准确的建模，从而实现具有挑战性的（少数）布局到图像生成任务。将RGB图像压缩到补丁令牌之后，我们提出了焦点注意（TWFA）的变压器，以探索对象对象，对象对键入和贴片的依赖性。与现有基于CNN的基于CNN和基于变压器的生成模型相比，它们分别在像素级和贴片级和对象级和贴片级别上纠缠了建模，拟议的焦点注意力仅通过专注于其高度相关的代币来预测当前的贴片令牌，这些图形仅专注于其在空间布局中所指定的高度相关的代币，从而在培训过程中实现了训练。此外，拟议的TWFA在很大程度上提高了培训期间的数据效率，因此我们提出了基于训练良好的TWFA的第一个几次复杂场景生成策略。全面的实验表明了我们方法的优势，该方法显着增加了基于最先进的CNN和基于变压器的方法的定量指标和定性视觉现实主义。代码可在https://github.com/johndreamer/twfa上找到。

We present a method that achieves state-of-the-art results on challenging (few-shot) layout-to-image generation tasks by accurately modeling textures, structures and relationships contained in a complex scene. After compressing RGB images into patch tokens, we propose the Transformer with Focal Attention (TwFA) for exploring dependencies of object-to-object, object-to-patch and patch-to-patch. Compared to existing CNN-based and Transformer-based generation models that entangled modeling on pixel-level&patch-level and object-level&patch-level respectively, the proposed focal attention predicts the current patch token by only focusing on its highly-related tokens that specified by the spatial layout, thereby achieving disambiguation during training. Furthermore, the proposed TwFA largely increases the data efficiency during training, therefore we propose the first few-shot complex scene generation strategy based on the well-trained TwFA. Comprehensive experiments show the superiority of our method, which significantly increases both quantitative metrics and qualitative visual realism with respect to state-of-the-art CNN-based and transformer-based methods. Code is available at https://github.com/JohnDreamer/TwFA.

下载PDF全文

下载文献需遵守相关版权规定

论文标题