论文标题
从单图像学习立体声
Learning Stereo from Single Images
论文作者
论文摘要
监督深网是在立体声图像对中查找对应关系的最佳方法之一。像所有监督方法一样,这些网络在培训期间需要地面真相数据。但是,收集大量准确的密集对应数据非常具有挑战性。我们认为,对地面真相深度甚至相应的立体对,不必要如此高的依赖。受单眼深度估计的最新进展的启发,我们从单个图像中产生了合理的视差图。反过来,我们在精心设计的管道中使用那些有缺陷的差异图来生成立体声训练对。以这种方式培训可以将任何单个RGB图像集合转换为立体训练数据。这导致人类努力大大减少,而无需收集实际深度或手动设计合成数据。因此,我们可以在Coco等数据集上从头开始训练立体声匹配网络,该网络以前很难为立体声介绍。通过广泛的实验,我们表明,在对Kitti,Eth3d和Middlebury进行评估时,我们的方法优于接受标准合成数据集训练的立体网络。
Supervised deep networks are among the best methods for finding correspondences in stereo image pairs. Like all supervised approaches, these networks require ground truth data during training. However, collecting large quantities of accurate dense correspondence data is very challenging. We propose that it is unnecessary to have such a high reliance on ground truth depths or even corresponding stereo pairs. Inspired by recent progress in monocular depth estimation, we generate plausible disparity maps from single images. In turn, we use those flawed disparity maps in a carefully designed pipeline to generate stereo training pairs. Training in this manner makes it possible to convert any collection of single RGB images into stereo training data. This results in a significant reduction in human effort, with no need to collect real depths or to hand-design synthetic data. We can consequently train a stereo matching network from scratch on datasets like COCO, which were previously hard to exploit for stereo. Through extensive experiments we show that our approach outperforms stereo networks trained with standard synthetic datasets, when evaluated on KITTI, ETH3D, and Middlebury.