论文标题
OCR-VQGAN:驯服文本与图像生成
OCR-VQGAN: Taming Text-within-Image Generation
论文作者
论文摘要
综合图像产生最近在自然图像或艺术生成等领域方面取得了重大改进。但是,图和图生成的问题仍未探索。生成数字和图表的一个具有挑战性的方面有效地渲染了图像中可读的文本。为了减轻此问题,我们提出了OCR-VQGAN,图像编码器和解码器,它利用OCR预先训练的功能来优化文本感知损失,鼓励体系结构保留高保真文本和图表结构。为了探索我们的方法,我们介绍了Paper2Fig100k数据集,其中包含超过100k的研究论文图像和文本图像。这些数字显示了从人工智能和计算机视觉等领域从arxiv.org获得的文章的架构图和方法。数字通常包括文本和离散对象,例如图中的框以及连接它们的线条和箭头。我们通过对图重建任务进行多个实验来证明OCR-VQGAN的有效性。此外,我们探讨了加权不同感知指标在整体损耗函数中的定性和定量影响。我们在https://github.com/joanrod/ocr-vqgan上发布代码,模型和数据集。
Synthetic image generation has recently experienced significant improvements in domains such as natural image or art generation. However, the problem of figure and diagram generation remains unexplored. A challenging aspect of generating figures and diagrams is effectively rendering readable texts within the images. To alleviate this problem, we present OCR-VQGAN, an image encoder, and decoder that leverages OCR pre-trained features to optimize a text perceptual loss, encouraging the architecture to preserve high-fidelity text and diagram structure. To explore our approach, we introduce the Paper2Fig100k dataset, with over 100k images of figures and texts from research papers. The figures show architecture diagrams and methodologies of articles available at arXiv.org from fields like artificial intelligence and computer vision. Figures usually include text and discrete objects, e.g., boxes in a diagram, with lines and arrows that connect them. We demonstrate the effectiveness of OCR-VQGAN by conducting several experiments on the task of figure reconstruction. Additionally, we explore the qualitative and quantitative impact of weighting different perceptual metrics in the overall loss function. We release code, models, and dataset at https://github.com/joanrod/ocr-vqgan.