通过隐式视觉指导和超网络的文本对图像生成

论文标题

通过隐式视觉指导和超网络的文本对图像生成

Text-to-Image Generation via Implicit Visual Guidance and Hypernetwork

论文作者

Yuan, Xin, Lin, Zhe, Kuen, Jason, Zhang, Jianming, Collomosse, John

论文摘要

我们开发了一种文本到图像生成的方法，该方法由隐性视觉引导丢失和生成目标的组合驱动，该方法包含其他检索图像。与仅将文本作为输入的大多数现有文本到图像生成方法不同，我们的方法将跨模式搜索结果动态馈送到统一的训练阶段，从而提高了生成结果的质量，可控性和多样性。我们提出了一种新型的超网络调制的视觉文本编码方案，以预测编码层的重量更新，从而有效地从视觉信息（例如布局，内容）传输到相应的潜在域。实验结果表明，我们的模型以其他检索视觉数据的指导优于现有基于GAN的模型。在可可数据集上，与最先进的方法相比，我们实现了更高的$ 9.13 $，最高$ 3.5 \ tims $ $的发电机参数。

We develop an approach for text-to-image generation that embraces additional retrieval images, driven by a combination of implicit visual guidance loss and generative objectives. Unlike most existing text-to-image generation methods which merely take the text as input, our method dynamically feeds cross-modal search results into a unified training stage, hence improving the quality, controllability and diversity of generation results. We propose a novel hypernetwork modulated visual-text encoding scheme to predict the weight update of the encoding layer, enabling effective transfer from visual information (e.g. layout, content) into the corresponding latent domain. Experimental results show that our model guided with additional retrieval visual data outperforms existing GAN-based models. On COCO dataset, we achieve better FID of $9.13$ with up to $3.5 \times$ fewer generator parameters, compared with the state-of-the-art method.

下载PDF全文

下载文献需遵守相关版权规定

论文标题