The GCX in-memory XQuery engine is the first streaming XQuery engine which implements active garbage collection, a novel buffer management strategy in which both static and dynamic analysis are exploited. This technique actively purges main memory buffers at runtime based on the current status of query evaluation. In our paper, we show the various stages in evaluating composition-free XQuery with GCX. Our technique has a significant impact on reducing both main memory consumption and query execution time, as can be seen in our experiments.
Garbage collection is a well-understood technique for
automatic memory management in programming languages. The basic
principle of any garbage collector is to determine which data
objects in a program will not be accessed in the future, and
consequently, to reclaim the storage used by these objects.
A simple yet effective garbage collection strategy is reference
counting where every object counts the number of references to
it. When a reference is created to an object, its reference count is
incremented. Likewise, the reference count is decremented when a
reference is removed. Once the count reaches zero, the object is
deleted and its memory is reclaimed. A major advantage of this
approach is that the memory overhead is small.
The approach underlying active garbage collection for XQuery
engines is strongly related insofar
as each single node in the buffer keeps track whether it is still
relevant to the remaining XQuery evaluation. Instead of counting
references, we employ the concept of roles which are
assigned to nodes. Intuitively, a role serves as a metaphor for the
future relevance of a given node.
Roles are statically derived from the path expressions in the query.
While reading the input stream, the XML nodes are matched against
the set of possible roles. A node can be assigned several roles when
it is used in the query in several different contexts. Moreover, a
role can be assigned to a node several times; this can happen if
queries involve XPath expressions with descendant-axes.
We demonstrate the two key principles in the GCX query engine, namely document projection and active garbage collection by an example. Consider the following query:
<x>
{for $book in //book
return ($book/author, $book/editor)}
</x>
In static analysis, we derive a projection tree where each node in this tree is associated with a "role". Intuitively,
The projection tree for our example is shown below. All nodes matched by the path expressions //book are query-relevant. They will be assigned role "r1" and stored in the main memory buffer. However, not their complete subtrees are query-relevant - only the author and editor children of a book node are buffered (with roles "r2" and "r3" respectively), and so are the nodes in their subtrees (with roles "r4" and "r5" respectively), because these nodes will have to be output. For instance, the XML document tree shown below is projected into the buffer as the tree shown on the right. The buffered nodes carry the roles that they have been assigned. Note that inner nodes, such as the "publications" node, can also be projected away.
While the idea of XML document projection has been proposed before (e.g. Marian and Simeon in their VLDB 2003 paper), our novel concept of assigning roles to buffered nodes allows us to dynamically purge nodes from the main memory buffer. At runtime, roles are removed from the buffered nodes as the query is evaluated. We say that roles are "signed off". The query evaluator signals this to the buffer by issuing "signOff"-statements. To this aim, the query is statically rewritten as follows:
<x>
{for $book in //book
return
(for $a in $book/author
return ($a, signOff($a,r2), signOff($a/descendant-or-self::node(), r4)),
for $e in $book/editor
return ($e, signOff($e, r3), signOff($e/descendant-or-self::node(), r5),
signOff($book, r1))}
</x>
For instance, after the author node (bound by variable $a) has been output, it loses
role "r2". Moreover, the author node and all of its descendants lose role "r4". As
these nodes are then without roles, they can be safely purged from the buffer by the garbage collector.
The garbage collector monitors the buffer during query evaluation and removes those nodes that
have lost all their roles (provided their descendants do not carry any roles anymore either).
By an interleaved interplay between document projection, query evaluation, and garbage collection, a streaming XQuery evaluation is achieved where buffers are continuously filled and purged. Our performance results show the effectiveness of this approach.