Mike Cafarella

University of Washington

Extracting and Managing Structured Web Data

 

Most models of the Web consist of a graph of linked pages, in which each page is a simple unstructured text document.  But in fact, Web pages often contain structured data embedded in natural language text, HTML tables, and other forms.  (For example, a Web page about cars might contain a small list of car makes, models, and years.)  This structured data exists in plain sight for Web users, but is ignored by traditional search engines and other database systems.  Thus, even though the Web is one of the largest and most interesting datasets available, a huge portion of it cannot be queried or manipulated in any practical way.

 

In this talk, I will discuss three systems that extract and manage Web data.  The TextRunner system operates over natural language text and generates n-ary facts.  (For example, a biographical page about Albert Einstein might yield the 3-element fact [Einstein/was-born-in/1875].)  The WebTables system extracts relational databases from raw HTML tables (finding more than 125 million high-quality databases from a single Web crawl).  WebTables also offers several novel applications built on top of the extract data, such as a structured-data search engine and an autocomplete tool for database schema designers.  Finally, the Octopus system allows a user to construct a clean novel database out of data embedded in potentially dozens of different source pages, using just two or three simple commands.  The research agenda embodied in these three projects aims toward an automatically-constructed "database of everything" that offers the breadth of the Web while enabling novel structured-data applications.

 

**********

 

Michael Cafarella is a Ph.D. candidate in Computer Science at the University of Washington, under the supervision of Dan Suciu and Oren Etzioni.  His research focuses on information extraction from the Web, and draws on techniques from databases and artificial intelligence.  In addition to his Ph.D. studies, Mike has worked as an intern at Google and as an engineer at two successful startups.  He is also the co-creator of the Hadoop open-source project, which is deployed widely in both academia and industry.

 

4:15pm

B17 Upson Hall

Tuesday, March 31, 2009

Refreshments at 3:45pm in the Upson 4th Floor Atrium

Computer Science

Colloquium

Spring 2009

www.cs.cornell.edu/events/colloquium