Humans have an amazing ability to instantly grasp the overall 3D structure of a scene – ground orientation, relative positions of major landmarks, etc – even from a single image. This ability is completely missing in most popular recognition algorithms, which pretend that the world is flat and/or view it through a patch-sized peephole. Yet it seems very likely that having a grasp of this “geometric context” of a scene should be of great assistance for many tasks, including recognition, navigation, and novel view synthesis.
In this talk, I will describe our first steps toward the goal of estimating a 3D scene context from a single image. We propose to estimate the coarse geometric properties of a scene by learning appearance-based models of geometric classes. Geometric classes describe the 3D orientation of an image region with respect to the camera. We provide a multiple-hypothesis segmentation framework for robustly estimating scene structure from a single image and obtaining confidences for each geometric label. These confidences can then (hopefully) be used to improve the performance of many other applications. We provide a quantitative evaluation of our algorithm on a dataset of challenging outdoor images. We also demonstrate its usefulness in two applications: 1) improving object detection (preliminary results), and 2) automatic qualitative single-view reconstruction (“Automatic Photo Pop-up”, SIGGRAPH’05).
Joint work with Derek Hoiem and Martial Hebert at CMU.