以多模式的空间整流器的理解为中心的场景

论文标题

以多模式的空间整流器的理解为中心的场景

Egocentric Scene Understanding via Multimodal Spatial Rectifier

论文作者

Do, Tien, Vuong, Khiem, Park, Hyun Soo

论文摘要

在本文中，我们研究了以自我为中心的场景理解的问题，即从以自我为中心的图像来预测深度和表面正常状态。以自我为中心的场景理解构成了前所未有的挑战：（1）由于头部运动较大，这些图像是从非规范观点（即，倾斜的图像）中拍摄的，其中现有的几何预测模型不适用；（2）包括手在内的动态前景对象构成了很大一部分的视觉场景。这些挑战限制了从大型室内数据集中学到的现有模型的性能，例如Scannet和Nyuv2，这些模型主要包括静态场景的直立图像。我们提出了一个多模式的空间整流器，该空间整流器将以自我为中心的图像稳定到一组参考方向上，该图像允许学习相干的视觉表示。与通常会产生过多的以自我为中心图像的透视图的单峰空间整流器不同，多模式的空间整流器从多个方向学习，可以最大程度地减少透视图的影响。为了了解动态前景对象的视觉表示，我们提出了一个名为Edina的新数据集（每天的室内活动中以Egintric的深度），该数据集包含超过500K同步的RGBD框架和重力方向。我们提出的关于单视深度和表面正常估计的方法配备了多模式的空间整流器和Edina数据集，不仅在我们的Edina数据集上，而且在其他流行的中心数据集中，还明显超过基准，例如第一人称手动动作（FPHA）和Epic-Kitchens和Epic-Kitchens。

In this paper, we study a problem of egocentric scene understanding, i.e., predicting depths and surface normals from an egocentric image. Egocentric scene understanding poses unprecedented challenges: (1) due to large head movements, the images are taken from non-canonical viewpoints (i.e., tilted images) where existing models of geometry prediction do not apply; (2) dynamic foreground objects including hands constitute a large proportion of visual scenes. These challenges limit the performance of the existing models learned from large indoor datasets, such as ScanNet and NYUv2, which comprise predominantly upright images of static scenes. We present a multimodal spatial rectifier that stabilizes the egocentric images to a set of reference directions, which allows learning a coherent visual representation. Unlike unimodal spatial rectifier that often produces excessive perspective warp for egocentric images, the multimodal spatial rectifier learns from multiple directions that can minimize the impact of the perspective warp. To learn visual representations of the dynamic foreground objects, we present a new dataset called EDINA (Egocentric Depth on everyday INdoor Activities) that comprises more than 500K synchronized RGBD frames and gravity directions. Equipped with the multimodal spatial rectifier and the EDINA dataset, our proposed method on single-view depth and surface normal estimation significantly outperforms the baselines not only on our EDINA dataset, but also on other popular egocentric datasets, such as First Person Hand Action (FPHA) and EPIC-KITCHENS.

下载PDF全文

下载文献需遵守相关版权规定

论文标题