To be useful to downstream applications, visual recognition systems have to solve a diverse array of tasks: they need to recognize a large number of categories, localize instances of these categories precisely in the visual field, estimate their pose accurately and so on. This set of requirements is also not fixed a priori and can change over time, requiring recognition systems to learn new tasks quickly and with minimal training. In contrast, visual recognition systems today only produce a shallow understanding of images, restricted to recognizing categories in an image, and expanding this shallow understanding requires the expensive collection of training data.

In this talk I will describe my work on removing this shortcoming. I will show we can build recognition systems that produce richer outputs, such as pixel-precise localization of detected objects, and how we can make progress towards making these systems capable of visual reasoning. Building models for these new goals requires a lot of training data. To reduce this requirement, I will present ways of leveraging past visual experience to learn new tasks, such as recognizing unseen categories, from very little data.

I am currently a post-doctoral scholar in Facebook AI Research (FAIR). Before joining FAIR, I did my PhD with Prof. Jitendra Malik at the University of California, Berkeley, where I was awarded the Berkeley Fellowship and the Microsoft Research Fellowship. My interests are in object recognition in computer vision and machine learning.