Holistic Scene Understanding for Robots

One of the original goals of computer vision was to fully understand a natural scene. This requires solving several sub-tasks simultaneously, including object detection, labeling of meaningful regions, and 3D reconstruction. In the past, researchers have developed great classifiers for tackling each of these sub-tasks in isolation. However, these sub-tasks help each other out---for example, if we know the 3D structure of the scene, then we make a better guess at the location of a car (cars usually don't fly in the air). It is not easy to compose different related sub-tasks together. In our work, we have developed machine learning techniques that combine the sub-tasks, without needing to know the inner workings of each classifier. I.e., our method only considers each vision module as a "black-box", allowing us to use very sophisticated, state-of-the-art classifiers without having to look under the hood.

Contact: Ashutosh Saxena
Related project: Make3D, Personal Robots

Publications

Towards Holistic Scene Understanding: Feedback Enabled Cascaded Classification Models, Congcong Li, Adarsh Kowdle, Ashutosh Saxena, Tsuhan Chen.
IEEE Trans Pattern Analysis and Machine Intelligence (PAMI), July 2012. (Online First: 2011) [PDF]

FeCCM for Scene Understanding: Helping the Robot to Learn Multiple Tasks, Congcong Li, TP Wong, Norris Xu, Ashutosh Saxena.
Video contribution in International Conference on Robotics and Automation (ICRA), 2011. [pdf, mp4, youtube]

Towards Holistic Scene Understanding: Feedback Enabled Cascaded Classification Models, Congcong Li, Adarsh Kowdle, Ashutosh Saxena, Tsuhan Chen. In Neural Information Processing Systems (NIPS), 2010. [pdf, pdf-full version]

Cascaded Classification Models: Combining Models for Holistic Scene Understanding, Geremy Heitz, Stephen Gould, Ashutosh Saxena, Daphne Koller. In Neural Information Processing Systems (NIPS), 2008. (full oral) [pdf]

A generic model to compose vision modules for holistic scene understanding, Adarsh Kowdle, Congcong Li, Ashutosh Saxena and Tsuhan Chen. In European Conference on Computer Vision Workshop on Parts and Attributes (ECCV '10), 2010. [pdf, slides]

θ-MRF: Capturing Spatial and Semantic Structure in the Parameters for Scene Understanding, Congcong Li, Ashutosh Saxena, Tsuhan Chen.
To appear in Neural Information Processing Systems (NIPS), 2011. [pdf coming soon]

More publications

The video our robot using FeCCM algorithm to combine object detectors with scene categorization and 3D reconstruction algorithms to build a reliable shoe detector. The robot finds the shoe on request.