论文标题
通过基于能量的模型,可靠和可控制的以对象为中心的学习
Robust and Controllable Object-Centric Learning through Energy-based Models
论文作者
论文摘要
人类非常擅长理解和推理复杂的视觉场景。将低水平观察分解为离散对象的能力使我们能够构建基础的抽象表示并确定世界的组成结构。因此,机器学习模型能够在无明确监督的情况下从视觉场景中推断对象及其属性是一个至关重要的步骤。但是,以对象表示学习为中心的现有工作要么依赖于量身定制的神经网络模块,要么在基本生成过程中的强大概率假设。在这项工作中,我们介绍了\我们的\我们的一种,一种通过基于能量的模型来学习以对象表示为中心表示的概念简单和一般的方法。通过使用香草注意块在变压器中可用的香草注意块形成置换不变的能量函数,我们可以通过基于基于梯度的MCMC方法来推断以对象为中心的潜在变量,这些变量可以自动保证置换置换率。我们表明,我们的\可以轻松地集成到现有的架构中,并可以有效提取以对象为中心的高质量表示,从而提高更好的细分精度和竞争性的下游任务绩效。此外,经验评估表明,\我们的学习表示形式可抵抗分配转移。最后,我们通过重新编造学习的能量功能来发挥新型场景和操纵来证明\我们的有效性。
Humans are remarkably good at understanding and reasoning about complex visual scenes. The capability to decompose low-level observations into discrete objects allows us to build a grounded abstract representation and identify the compositional structure of the world. Accordingly, it is a crucial step for machine learning models to be capable of inferring objects and their properties from visual scenes without explicit supervision. However, existing works on object-centric representation learning either rely on tailor-made neural network modules or strong probabilistic assumptions in the underlying generative and inference processes. In this work, we present \ours, a conceptually simple and general approach to learning object-centric representations through an energy-based model. By forming a permutation-invariant energy function using vanilla attention blocks readily available in Transformers, we can infer object-centric latent variables via gradient-based MCMC methods where permutation equivariance is automatically guaranteed. We show that \ours can be easily integrated into existing architectures and can effectively extract high-quality object-centric representations, leading to better segmentation accuracy and competitive downstream task performance. Further, empirical evaluations show that \ours's learned representations are robust against distribution shift. Finally, we demonstrate the effectiveness of \ours in systematic compositional generalization, by re-composing learned energy functions for novel scene generation and manipulation.