Intelligent systems must make inferences about the objects in their world, such as the vehicles on a road monitored by video cameras, or the people and events mentioned in a set of documents. This set of objects is seldom known in advance; instead, the system must hypothesize objects that explain its observations. Such tasks can be formalized naturally as probabilistic inference problems. However, they pose challenges for standard modeling formalisms such as Bayesian networks, which assume a fixed set of random variables with fixed dependencies.

This talk will describe a set of techniques for probabilistic reasoning about initially unknown objects. I will present a modeling language called Bayesian logic, or BLOG, for defining prior distributions over "possible worlds" with varying sets of objects. Every BLOG model that satisfies certain conditions is guaranteed to fully specify a distribution, even if it defines infinitely many random variables. I will then describe a Markov chain Monte Carlo algorithm for performing inference on BLOG models. This algorithm is novel in that it does a random walk not over fully specified possible worlds, but over partial world descriptions that instantiate only the relevant variables. I will describe an application of this algorithm to citation matching: identifying the distinct publications referred to by a set of citation strings extracted from online papers. I will conclude with an application of similar ideas to Bayesian model learning, where the structures we are reasoning about are themselves probabilistic models with varying numbers of dependency rules.