Introduction
People
Publications
Demo
Download
Documentation
Acknowledgments
Contact Us

Cornell Database Group

Quark: Unifying Database Systems and Information Retrieval Systems

The data stored in most enterprises is a mix of structured and unstructured data. Traditionally, structured data has been queried using (relational) database systems, while unstructured data has been queried using information retrieval systems. Since users often need to query across both forms of data, there have been many attempts to integrate the functionality of database systems and information retrieval systems. Most of these approaches, however, only provide a “loose integration” of the two systems. Specifically, they assume that database style queries (i.e., complex structured queries) can be evaluated only over structured data, while information retrieval style queries (i.e., ranked keyword search queries) can be evaluated only over unstructured data. This “loose integration” results in a significant loss of functionality for users.

In the Quark project, we are exploring a much tighter integration (or unification) of database systems and information retrieval systems. Specifically, we are developing a novel system architecture that allows users to issue complex structured queries and ranked keyword search queries over any mix of structured, unstructured, and semi-structured data. In keeping with our goal of a unified data management system, we are using XML (eXtensible Markup Language) as the underlying data model because it is flexible enough to represent structured, unstructured and semi-structured data. We have developed TeXQuery, a full-text search extension to XQuery. TeXQuery is the precursor to the XQuery 1.0 and XPath 2.0 Full-Text currently being developed by the World Wide Web Consortium. Quark tightly integrates TeXQuery, including support for scoring and ranking, with regular XQuery processing.