使用残差量化的自回归图像生成

论文标题

使用残差量化的自回归图像生成

Autoregressive Image Generation using Residual Quantization

论文作者

Lee, Doyup, Kim, Chiheon, Kim, Saehoon, Cho, Minsu, Han, Wook-Shin

论文摘要

对于高分辨率图像的自回旋（AR）建模，向量量化（VQ）表示图像是一系列离散代码的序列。对于AR模型而言，短序列长度很重要，以减少其计算成本，以考虑代码的长期相互作用。但是，我们假设以前的VQ不能缩短代码顺序，并根据利率差异来共同产生高保真图像。在这项研究中，我们提出了由残留量化的VAE（RQ-VAE）和RQ转换器组成的两阶段框架，以有效地产生高分辨率图像。给定固定的代码书大小，RQ-VAE可以准确地近似图像的特征映射，并将图像表示为离散代码的堆叠图。然后，RQ-Transformer通过预测下一个代码来预测下一个位置的量化特征向量。多亏了RQ-VAE的确切近似值，我们可以将256 $ \ times $ 256的图像表示为8 $ \ times $ 8的功能映射分辨率，RQ-Transformer可以有效地降低计算成本。因此，我们的框架在无条件和条件图像生成的各种基准上都优于现有的AR模型。我们的方法还具有比以前的AR模型更快的采样速度来生成高质量的图像。

For autoregressive (AR) modeling of high-resolution images, vector quantization (VQ) represents an image as a sequence of discrete codes. A short sequence length is important for an AR model to reduce its computational costs to consider long-range interactions of codes. However, we postulate that previous VQ cannot shorten the code sequence and generate high-fidelity images together in terms of the rate-distortion trade-off. In this study, we propose the two-stage framework, which consists of Residual-Quantized VAE (RQ-VAE) and RQ-Transformer, to effectively generate high-resolution images. Given a fixed codebook size, RQ-VAE can precisely approximate a feature map of an image and represent the image as a stacked map of discrete codes. Then, RQ-Transformer learns to predict the quantized feature vector at the next position by predicting the next stack of codes. Thanks to the precise approximation of RQ-VAE, we can represent a 256$\times$256 image as 8$\times$8 resolution of the feature map, and RQ-Transformer can efficiently reduce the computational costs. Consequently, our framework outperforms the existing AR models on various benchmarks of unconditional and conditional image generation. Our approach also has a significantly faster sampling speed than previous AR models to generate high-quality images.

下载PDF全文

下载文献需遵守相关版权规定

论文标题