Quark: Unifying Database Systems and Information
Retrieval Systems
The data stored in most enterprises is a mix of
structured and unstructured data. Traditionally, structured data has been
queried using (relational) database systems, while unstructured data has been
queried using information retrieval systems. Since users often need to query
across both forms of data, there have been many attempts to integrate the
functionality of database systems and information retrieval systems. Most of
these approaches, however, only provide a “loose integration” of the two
systems. Specifically, they assume that database style queries (i.e., complex
structured queries) can be evaluated only over structured data, while
information retrieval style queries (i.e., ranked keyword search queries) can be
evaluated only over unstructured data. This “loose integration” results in a
significant loss of functionality for users.
In the Quark
project, we are exploring a much tighter integration (or unification) of
database systems and information retrieval systems. Specifically, we are
developing a novel system architecture that allows users to issue complex
structured queries and ranked keyword search queries over any mix of
structured, unstructured, and semi-structured data. In keeping with our goal of
a unified data management system, we are using XML (eXtensible Markup Language)
as the underlying data model because it is flexible enough to represent
structured, unstructured and semi-structured data. We have developed TeXQuery,
a full-text search extension to XQuery. TeXQuery is the precursor to the
XQuery 1.0 and XPath 2.0 Full-Text currently being developed by the World Wide Web Consortium.
Quark tightly integrates TeXQuery, including support for scoring and ranking, with regular XQuery processing.
|