论文标题
基于利益区域的神经视频压缩
Region-of-Interest Based Neural Video Compression
论文作者
论文摘要
人类并没有以相同的分辨率感知场景的所有部分,而是专注于几个关注区域(ROI)。传统的基于对象的编解码器利用了这种生物学直觉,并且能够不一致地分配位,而支持显着区域,而牺牲了造成扭曲的领域:这种策略允许在低比率约束下提高感知质量。最近,引入了几种神经编解码器进行视频压缩,但它们在所有空间位置都均匀地运行,缺乏基于ROI的处理的能力。在本文中,我们介绍了两个用于基于ROI的神经视频编码的模型。首先,我们提出了一个隐含的模型,该模型被二进制ROI面膜喂食,并通过取消强调背景的扭曲而受到训练。其次,我们设计了一种显式潜在缩放方法,该方法允许控制潜在变量的不同空间区域的量化缩放方法,该空间区域以ROI掩码为条件。通过广泛的实验,我们表明我们的方法在投资回报率中的速率 - 延伸(R-D)性能方面都优于我们所有的基准。此外,它们可以在推理时推广到不同的数据集和任何任意ROI。最后,他们在训练过程中不需要昂贵的像素级注释,因为合成的ROI面罩几乎没有降解。据我们所知,我们的建议是将基于ROI的功能整合到神经视频压缩模型中的第一批解决方案。
Humans do not perceive all parts of a scene with the same resolution, but rather focus on few regions of interest (ROIs). Traditional Object-Based codecs take advantage of this biological intuition, and are capable of non-uniform allocation of bits in favor of salient regions, at the expense of increased distortion the remaining areas: such a strategy allows a boost in perceptual quality under low rate constraints. Recently, several neural codecs have been introduced for video compression, yet they operate uniformly over all spatial locations, lacking the capability of ROI-based processing. In this paper, we introduce two models for ROI-based neural video coding. First, we propose an implicit model that is fed with a binary ROI mask and it is trained by de-emphasizing the distortion of the background. Secondly, we design an explicit latent scaling method, that allows control over the quantization binwidth for different spatial regions of latent variables, conditioned on the ROI mask. By extensive experiments, we show that our methods outperform all our baselines in terms of Rate-Distortion (R-D) performance in the ROI. Moreover, they can generalize to different datasets and to any arbitrary ROI at inference time. Finally, they do not require expensive pixel-level annotations during training, as synthetic ROI masks can be used with little to no degradation in performance. To the best of our knowledge, our proposals are the first solutions that integrate ROI-based capabilities into neural video compression models.