Back to the Saarland Database Group's Stream Project Page  ]

GCX: A Streaming XQuery Engine with Static and Dynamic Buffer Minimization

News:

Summary:

The GCX in-memory XQuery engine is the first streaming XQuery engine which implements active garbage collection, a novel buffer management strategy in which both static and dynamic analysis are exploited. This technique actively purges main memory buffers at runtime based on the current status of query evaluation. In our paper, we show the various stages in evaluating composition-free XQuery with GCX. Our technique has a significant impact on reducing both main memory consumption and query execution time, as can be seen in our experiments.

GCX Resources:

On Active Garbage Collection in XQuery Engines:

Garbage collection is a well-understood technique for automatic memory management in programming languages. The basic principle of any garbage collector is to determine which data objects in a program will not be accessed in the future, and consequently, to reclaim the storage used by these objects.
A simple yet effective garbage collection strategy is reference counting where every object counts the number of references to it. When a reference is created to an object, its reference count is incremented. Likewise, the reference count is decremented when a reference is removed. Once the count reaches zero, the object is deleted and its memory is reclaimed. A major advantage of this approach is that the memory overhead is small.

The approach underlying active garbage collection for XQuery engines is strongly related insofar as each single node in the buffer keeps track whether it is still relevant to the remaining XQuery evaluation. Instead of counting references, we employ the concept of roles which are assigned to nodes. Intuitively, a role serves as a metaphor for the future relevance of a given node.
Roles are statically derived from the path expressions in the query. While reading the input stream, the XML nodes are matched against the set of possible roles. A node can be assigned several roles when it is used in the query in several different contexts. Moreover, a role can be assigned to a node several times; this can happen if queries involve XPath expressions with descendant-axes.

An Example

We demonstrate the two key principles in the GCX query engine, namely document projection and active garbage collection by an example. Consider the following query:

<x>
    {for $book in //book
     return ($book/author, $book/editor)}
</x>

In static analysis, we derive a projection tree where each node in this tree is associated with a "role". Intuitively,

The projection tree for our example is shown below. All nodes matched by the path expressions //book are query-relevant. They will be assigned role "r1" and stored in the main memory buffer. However, not their complete subtrees are query-relevant - only the author and editor children of a book node are buffered (with roles "r2" and "r3" respectively), and so are the nodes in their subtrees (with roles "r4" and "r5" respectively), because these nodes will have to be output. For instance, the XML document tree shown below is projected into the buffer as the tree shown on the right. The buffered nodes carry the roles that they have been assigned. Note that inner nodes, such as the "publications" node, can also be projected away.

While the idea of XML document projection has been proposed before (e.g. Marian and Simeon in their VLDB 2003 paper), our novel concept of assigning roles to buffered nodes allows us to dynamically purge nodes from the main memory buffer. At runtime, roles are removed from the buffered nodes as the query is evaluated. We say that roles are "signed off". The query evaluator signals this to the buffer by issuing "signOff"-statements. To this aim, the query is statically rewritten as follows:

<x>
  {for $book in //book
   return 
       (for $a in $book/author
        return ($a, signOff($a,r2), signOff($a/descendant-or-self::node(), r4)),

        for $e in $book/editor
        return ($e, signOff($e, r3), signOff($e/descendant-or-self::node(), r5),           

        signOff($book, r1))}  
</x>
For instance, after the author node (bound by variable $a) has been output, it loses role "r2". Moreover, the author node and all of its descendants lose role "r4". As these nodes are then without roles, they can be safely purged from the buffer by the garbage collector. The garbage collector monitors the buffer during query evaluation and removes those nodes that have lost all their roles (provided their descendants do not carry any roles anymore either).

By an interleaved interplay between document projection, query evaluation, and garbage collection, a streaming XQuery evaluation is achieved where buffers are continuously filled and purged. Our performance results show the effectiveness of this approach.

Project Members:


Last updated: October 06 2007 16:01:57.