论文标题
$β$ -Multivariational AutoCododer用于视频帧中的纠缠代表学习
$β$-Multivariational Autoencoder for Entangled Representation Learning in Video Frames
论文作者
论文摘要
从适当的分布中选择行动是至关重要的,同时学习一个顺序决策过程,其中预计鉴于各州和先前的奖励,预计一组行动。但是,如果有两个以上的潜在变量并且每两个变量具有协方差价值,则从数据中学习已知的先验就变得具有挑战性。因为当数据大而多样化时,许多后估计方法会经历后塌陷。在本文中,我们提出了$β$ - 多数自动编码器($β$ MVAE),以从视频帧中学习一个多元高斯先验,以用作决策过程的单个对象跟踪的一部分。我们在带有一组依赖参数的视频中提出了一个新颖的对象运动的公式,以解决单个对象跟踪任务。运动参数的真实值是通过对训练集的数据分析获得的。然后假定参数总体具有多元高斯分布。开发了$β$ MVAE,以直接从框架贴片中学习此纠缠的先前$ p = n(μ,σ)$,其中输出是框架贴片的对象掩模。我们设计了一个瓶颈来估计后验参数,即$μ',σ'$。通过一个新的重新聚集技巧,我们将学习可能性$ p(\ hat {x} | z)$作为输入的对象蒙版。此外,我们使用U-NET架构更改$β$ MVAE的神经网络,并将新网络$β$β$多分支U-net($β$ mvunet)命名。我们的网络通过超过85k的视频帧从头开始训练,以24($β$ MVUNET)和78($β$ mVae)的百万步。我们表明,$β$ mvunet在测试集中增强了后验估计和分割功能。我们的代码和训练有素的网络将公开发布。
It is crucial to choose actions from an appropriate distribution while learning a sequential decision-making process in which a set of actions is expected given the states and previous reward. Yet, if there are more than two latent variables and every two variables have a covariance value, learning a known prior from data becomes challenging. Because when the data are big and diverse, many posterior estimate methods experience posterior collapse. In this paper, we propose the $β$-Multivariational Autoencoder ($β$MVAE) to learn a Multivariate Gaussian prior from video frames for use as part of a single object-tracking in form of a decision-making process. We present a novel formulation for object motion in videos with a set of dependent parameters to address a single object-tracking task. The true values of the motion parameters are obtained through data analysis on the training set. The parameters population is then assumed to have a Multivariate Gaussian distribution. The $β$MVAE is developed to learn this entangled prior $p = N(μ, Σ)$ directly from frame patches where the output is the object masks of the frame patches. We devise a bottleneck to estimate the posterior's parameters, i.e. $μ', Σ'$. Via a new reparameterization trick, we learn the likelihood $p(\hat{x}|z)$ as the object mask of the input. Furthermore, we alter the neural network of $β$MVAE with the U-Net architecture and name the new network $β$Multivariational U-Net ($β$MVUnet). Our networks are trained from scratch via over 85k video frames for 24 ($β$MVUnet) and 78 ($β$MVAE) million steps. We show that $β$MVUnet enhances both posterior estimation and segmentation functioning over the test set. Our code and the trained networks are publicly released.