Quark: Unifying Database Systems and Information Retrieval Systems

[People] [Publications] [Related Projects]

The data stored in most enterprises is a mix of structured and unstructured data. Traditionally, structured data has been queried using (relational) database systems, while unstructured data has been queried using information retrieval systems. Since users often need to query across both forms of data, there have been many attempts to integrate the functionality of database systems and information retrieval systems. Most of these approaches, however, only provide a “loose integration” of the two systems. Specifically, they assume that database style queries (i.e., complex structured queries) can be evaluated only over structured data, while information retrieval style queries (i.e., ranked keyword search queries) can be evaluated only over unstructured data. This “loose integration” results in a significant loss of functionality for users.

In the Quark project, we are exploring a much tighter integration (or unification) of database systems and information retrieval systems. Specifically, we are developing a novel system architecture that allows users to issue complex structured queries and ranked keyword search queries over any mix of structured, unstructured, and semi-structured data. In keeping with our goal of a unified data management system, we are using XML (eXtensible Markup Language) as the underlying data model because it is flexible enough to represent structured, unstructured and semi-structured data. We have developed TeXQuery, a full-text search extension to XQuery.

We are also applying Quark to the problem of querying the web. Most current web search engines can only crawl, index, and query over static (unstructured) web pages, also referred to as the “surface web”. A large fraction of the Internet data, however, is stored in (structured) Internet-attached databases or the “deep web”. For example, the data about auctions in ebay.com is stored in an Internet-attached database, but is not visible to current web search engines. Some studies estimate that the size of the deep web is 400-500 times the size of the surface web. As part of the Deep Glue component of Quark, we are building a system for querying both surface web and deep web data sources, focusing on many key issues such as ranking, indexing and query processing.

People

Publications

C. Botev, S. Amer-Yahia, J. Shanmugasundaram, "On the Completeness of Full-Text Search Languages for XML", Cornell University Technical Report, December 2003.

S. Amer-Yahia, C. Botev, J. Shanmugasundaram, "TeXQuery: A Full-Text Search Extension to XQuery", WWW Conference, May 2004.

L. Guo, F. Shao, C. Botev, J. Shanmugasundaram, "XRANK: Ranked Keyword Search over XML Documents", SIGMOD Conference, June 2003.

J. Qiu, F. Shao, M. Zatsman, J. Shanmugasundaram, "Index Structures for Querying the Deep Web", WebDB Workshop, June 2003.

Related Projects

TeXQuery: A Full-Text Search Extension to XQuery