CS Colloquium

Gerhard Weikum
Max Planck Institute of Computer Science

Intelligent and Efficient Search on Semistructured Data

We are witnessing an information explosion with rapidly growing amounts of semistructured data in XML and other formats, within large intranets, digital libraries, scientific data repositories, and the Web. When searching multiple data sources one often faces the fundamental problem that there is no unified database schema and the data is highly diverse in terms of its structure, annotations (e.g., XML tags), and terminology. In this situation traditional database querying as supported by XQuery or XPath often yields unsatisfactory results, either too few or way too many answers. Instead, ranked retrieval is called for based on relevance and similarity measures for query results. To this end, database technology must be integrated with techniques from information retrieval and statistical learning.

The talk presents the XXL search engine (Flexible XML Search Language) that has been developed in Saarbruecken in pursuing the above research direction. The system leverages statistically quantified ontological relationships for ranked retrieval on heterogeneous XML data. For efficiency the system uses specific index structures, particularly the XML connection index HOPI (Two-Hop-Based Path Index), and approximative algorithms for top-k similarity queries with probabilistic score predictions. The key concepts of XXL can be applied also to conventional Web data (in HTML or PDF), which is automatically converted into XML, and to searching Deep-Web portals as well. These generalized capabilities are demonstrated by the COMPASS Web search engine.