Thursday, September 21, 2006
4:15 pm
B17 Upson Hall

Computer Science
Colloquium
Fall 2006

Hector Garcia-Molina
Stanford University

Generic Entity Resolution

Entity resolution (ER) is a problem that arises in many information integration scenarios: We have two or more sources containing records on the same set of real-world entities (e.g., customers). However, there are no unique identifiers that tell us what records from one source correspond to those in the other sources. Furthermore, the records representing the same entity may have differing information, e.g., one record may have the address misspelled, another record may be missing some fields. An ER algorithm attempts to identify the matching records from multiple sources (i.e., those corresponding to the same real-world entity), and merges the matching records as best it can.

In this talk I will describe a "generic" ER approach where the functions for comparing and merging records are black-boxes, invoked on pairs of records. I will describe a set of important properties of the black-boxes that enable efficient ER. I will also introduce three algorithms for ER: one for the general case, one for the case the properties hold, and one when the computations can be distributed across multiple processors. If time permits, I will show some experimental comparisons of the algorithms, based on comparison shopping data provided by Yahoo.