论文标题

使用残差量化的自回归图像生成

Autoregressive Image Generation using Residual Quantization

论文作者

Lee, Doyup, Kim, Chiheon, Kim, Saehoon, Cho, Minsu, Han, Wook-Shin

论文摘要

对于高分辨率图像的自回旋(AR)建模,向量量化(VQ)表示图像是一系列离散代码的序列。对于AR模型而言,短序列长度很重要,以减少其计算成本,以考虑代码的长期相互作用。但是,我们假设以前的VQ不能缩短代码顺序,并根据利率差异来共同产生高保真图像。在这项研究中,我们提出了由残留量化的VAE(RQ-VAE)和RQ转换器组成的两阶段框架,以有效地产生高分辨率图像。给定固定的代码书大小,RQ-VAE可以准确地近似图像的特征映射,并将图像表示为离散代码的堆叠图。然后,RQ-Transformer通过预测下一个代码来预测下一个位置的量化特征向量。多亏了RQ-VAE的确切近似值,我们可以将256 $ \ times $ 256的图像表示为8 $ \ times $ 8的功能映射分辨率,RQ-Transformer可以有效地降低计算成本。因此,我们的框架在无条件和条件图像生成的各种基准上都优于现有的AR模型。我们的方法还具有比以前的AR模型更快的采样速度来生成高质量的图像。

For autoregressive (AR) modeling of high-resolution images, vector quantization (VQ) represents an image as a sequence of discrete codes. A short sequence length is important for an AR model to reduce its computational costs to consider long-range interactions of codes. However, we postulate that previous VQ cannot shorten the code sequence and generate high-fidelity images together in terms of the rate-distortion trade-off. In this study, we propose the two-stage framework, which consists of Residual-Quantized VAE (RQ-VAE) and RQ-Transformer, to effectively generate high-resolution images. Given a fixed codebook size, RQ-VAE can precisely approximate a feature map of an image and represent the image as a stacked map of discrete codes. Then, RQ-Transformer learns to predict the quantized feature vector at the next position by predicting the next stack of codes. Thanks to the precise approximation of RQ-VAE, we can represent a 256$\times$256 image as 8$\times$8 resolution of the feature map, and RQ-Transformer can efficiently reduce the computational costs. Consequently, our framework outperforms the existing AR models on various benchmarks of unconditional and conditional image generation. Our approach also has a significantly faster sampling speed than previous AR models to generate high-quality images.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源