Chapter 29

Object-Oriented Query Optimization

Úlfar Erlingsson


Disclaimer: This survey is not finished work, and its discussion may not even reflect my current beliefs. It has been put online as a service to the web community, and should be useful as at least the references are correct!

Back to main Table of Contents
  1. Introduction
  2. Background
    1. Problem Makers
      1. Additional Data Types
      2. Complex Objects
      3. Methods and Encapsulation
      4. Identity and Equivalence
    2. An Optimization Methodology
    3. Algebras and Calculi
    4. Algebraic Optimization
    5. Extensible Optimizers
  3. Current Status
  4. Future Directions
    1. Candidate Research Problems
    2. Restrictions and Prospects
  5. Annotated References


1. Introduction

One of the biggest problems in Object-Oriented Database Management Systems (OODBMS) is the optimization of declarative queries (see chapter 17 on query optimization, and chapter 28 on OODBMS query languages). There are several issues that are specific to the object-oriented data model which complicate the optimization of OODBMS queries. In this chapter we will give an overview of these issues, discuss the proposed methods that deal with them, and comment on past and present research. Finally we examine some of the possible future research topics in this area.


2. Background

As discussed in chapters 27 and 28 OODBMSs provide a much richer data model than conventional databases. This additional complexity of the data model greatly increases the difficulty in the optimizations of declarative queries. This was quickly realized once research began in declarative query languages for OODBMSs.

2.1. Problem Makers

Query optimization for relational databases were well understood at the time when OODBMS became popular in the mid-eighties. Researchers quickly pinpointed some aspects of OODBMSs as adding complexity to the problem of optimization. We address these aspects in the following subsections.
2.1.1. Additional Data Types
The user definition of new types and classes through inheritance can both assist and thwart optimization of queries. An example where it helps could be a query involving the intersection of Employees and Supervisors. If Employee is a superclass of Supervisor the optimizer can assume that Supervisors are a proper subset of Employees and simplify the join to the set of Supervisors.

An example where it deters optimization could involve the union of Students and Employees, with Person being a superclass of both. If we wanted to find all supervisors of students and employees we can not perform the union first and then apply a supervisor() to the result, since it may not be defined for Persons, the type of the union. We may be forced to perform the method on each set by itself, and then form the union of the results, which can obviously be very inefficient.

We can see that the union of Students and Employees is guaranteed to have supervisor() defined, and safely apply it to the union, but the requires the optimizer to provide powerful type-inference mechanisms.

2.1.2. Complex Objects
The path expressions and query closure of OODBMS query languages (see chapter 28) complicate the processing of queries in several ways. One of these difficulties is the building of indices for path expressions, especially in face of arbitrary methods in the path. This is in general a very hard problem, and near unsolvable if methods can have side-effects.

Another problem with path expressions is that they suggest an execution order of the path methods, which may well be a very inefficient order. As an example the path Orders.part.name, may well be best evaluated right-to-left if there are many orders but few parts. When methods are involved in queries this is further complicated since the optimizer may have no idea what the method execution times and return values are.

The closure properties of OODBMS query languages support the use of nested queries, and their use is often very desirable in order to concisely and express complex queries. Their use, however, greatly complicates the optimization process, turning it from a local problem to a global one, i.e. requiring global knowledge entire query expression.

2.1.3. Methods and Encapsulation
Optimizations in the face of methods and encapsulation is self-evidently a very dubious process, if there is no way of determining the cost of evaluating methods, and the size of their results. To reinvoke an example from last section, the expression july.specialorders().part.name may best be evaluated left-to-right or right-to-left depending on the size of the return value of specialorders().

A simple example where optimization is hopeless without knowing the method cost is the expression a.x() AND a.t == b.t. If we don't know the cost of evaluating method x() we cannot hope to know whether performing the join first will optimize the query or make it worse.

There are several proposals for the solution of this dilemma in the literature. Some systems allow the optimizer to break the encapsulation of methods and examine them to determine cost, which forces the methods to be written in a language understood by the optimizer. Other systems declare the cost of methods as a part of the definition. Yet other systems optimize queries under several assumptions, and then determine at query run-time what assumptions are most valid, based on current cost statistics, and execute the version optimized for those assumptions.

Optimizing with regard to methods is even further complicated when dealing with late binding of method calls. In that case it may not be know until at run-time which method will get called, and thereby what the method cost is. In this case the run-time selection of one of several pre-optimized queries, based on the actual method type, might be the best solution.

2.1.4. Identity and Equivalence
Query optimization is complicated by the notion of object identity and equivalence, as mentioned in section 2.3 of chapter 28. When predicating on equivalence it isn't always clear what semantics the equivalence has, and this the optimizer has to know in order to perform its task.

A major factor in query optimization is whether the query language allows the creation of new objects, and, if so, whether those objects have object identities (OIDs). This influences the query optimization in that it can introduce the need for duplicate elimination, deep equality comparisons where OID compares would otherwise be sufficient, or the need to deal with objects without OIDs. Another approach is to enforce a values versus objects dichotomy, as is discussed in chapter 28.

2.2. An Optimization Methodology

In a triad of papers, [Str91], [Özs94a] and [Özs94b], M.T. Özsu defines, with co-authors, the query processing methodology presented in figure 1, and discusses current query-optimization research in its light.

See reference for figure.
Figure 1:
Straube and Özsu's [Str91] Query Processing Methodology

In this methodology, the query-rewrite optimization is a high-level process where general-purpose heuristics drive the application of rewrite rules, and plan optimization is a lower level process which generates execution plans based on knowledge of relative costs, statistics and physical structure. A declarative query is thus optimized as follows:

Obviously the simplest implementation of this approach to query optimization would be to consider all equivalent algebra expressions from the next-to-last step, evaluate the cost of each of them in the plan generator, and choose the one with the least cost. However, this is unfeasible since it is requires the generation of a combinatorial number of alternatives. One solution here are to use heuristics, combined with memoization of optimal subexpressions, to generate a single top-level query expression as final input to the plan generator. Another solution is to combine the last two steps into one, which is the approach most often taken in cost-based optimization.

2.3. Algebras and Calculi

As discussed in chapter 28 there are several different formal query languages of algebras and calculi which have been proposed for OODBMSs, e.g. [Ala89], [Str91], [Ber92], [Fer93], [Leu93], [Fer95] and [Cer95]. These algebras and calculi differ in several respects in expressibility and support for optimizing rewrite rules. Most of these algebras are variable based, i.e. use variables for temporary results, whereas KOLA, the most recent algebra, is purely functional and variable free. In [che95] the author argues, quite successfully, that the KOLA algebra allows more powerful rule systems to be build, due to its variable freeness.

The lack of a standard algebra and calculi has hampered research into OODBMS query optimization, prohibiting generalized conclusions to be drawn from research results. Hopefully KOLA, or some other algebra, will soon remedy this like ODMG-93 OQL did for user-level query languages.

2.4. Algebraic Optimization

Algebraic optimization has the advantage that a query can be transformed using well-defined operators having properties such as transitivity and commutativity. If the algebra is equivalence-preserving, which it should be, each query has a large number of equivalent queries making up the optimization search space.

As mentioned before, simple enumeration and evaluation is usually unfeasible, because of the large size of the search space. Even so, dynamic programming methods, such as memoization and branch-and-bound can make this approach feasible. Methods using heuristics are another alternative, examples of which are randomized search, simulated annealing and hill climbing. These methods only examine a part of the search space, trying to evaluate those parts of it which look hopeful in one way or another. These methods therefore cannot guarantee an optimal solution, but, as discussed later, that may not always matter since the search is based on incomplete information anyway.

When evaluating positions in the search space, the optimizer uses a cost function, which is based on such things as cost in IO and CPU time. The cost of a query is defined by recursive addition of subquery cost. As discussed before the cost may not be available for OODBMS methods, and a critical issue is whether to allow the optimizer to break object encapsulation to gather this information. An alternative is to declare the cost of a method in its visible specification.

Compile time optimization is static, in that it cannot use information only available at run time, such as current statistics or the actual type of an polymorphic method. As discussed before one solution to this is to generate several optimized version of the query and select at run time the fastest for the current conditions. This however can involve very high overhead in run-time dispatch of queries, and may not be the best solution. Other alternatives include periodically reoptimizing queries with new statistics, perhaps based on the pace at which the relevant statistics change. Path expressions perhaps the most important aspect to inspect when trying achieve speedup by query optimization, involving the entire optimization process. Their optimization involves all the mechanisms described above, with semantic algebraic rewrite rules playing a large part. There has been considerable research into building path indices for these expressions, since indices are critical for the speed of real databases. The construction and maintenance of such indices is of course much complicated by the presence of arbitrary methods on the path.

2.5. Extensible Optimizers

Since OODBMSs have been, and still are, very much a moving target, with no standard data model, and only recently with a standard query language (the ODMG-93 OQL of chapter 28), any optimizer system needs to be able to change with time. Hence a lot of work in query optimization, e.g. [Fer93], has been in building ``Optimizer Generators'' which build optimizers based on information on the data model, query language, algebra, calculi and cost model. The systems going furthest in this direction define all aspects of the OODBMS and optimizer as objects and allow extensions through inheritance and method overloading.


3. Current Status

OODBMS query optimization is today very much an area of active research. Current research OODBMS, such as Thor, make use of cost-based algebraic optimization methods, using the latest and greatest object algebras, e.g. KOLA. As these research databases continue to be implemented and mature they will provide an excellent testing bed for the optimization techniques developed so far, especially when combined with the extensible optimizer generator platforms favored in current research.

The optimization of OODBMS query languages is inherently much more complicated than the optimization of conventional query languages. There are numerous issues involving types and method evaluation which are much harder to optimize due to the amount of inference the optimizer needs to make in order to be able to efficiently optimize the query.

Fortunately there has been tremendous work done on these issues already in the contexts of conventional programming languages, such as ML, and of general-purpose theorem proving and program verification. Researchers of OODBMS query optimization are beginning to acknowledge this, and when work such as [Mor95] begins to affect OODBMS query optimization to a larger extent we may expect to see some significantly improved results.


4. Future Directions

In this section we briefly look at some future research topics in OODBMS query optimization. This area is currently the scene for very active research, with numerous powerful research groups worldwide spending most their efforts on this problem. Due to the well defined nature of the problem, and the great light past research in programming-language and RDBMS-query optimization sheds on the problem, it is hard to find neglected topics within this area. Therefore we will in this section focus on the problems that are already recognized today and speculate on their future development and success.

OODBMS query optimization is still very much an academic research topic. The OODBMS vendors are currently struggling to flesh out their products into real DBMSs, i.e., still in the stage where they are adding good query capabilities, not where they are trying to optimize the queries. In the academic community, however, there is little interest in query languages, per se (see chapter 28), but a great interest in query optimization. Therefore we will look at future research in this area from an academic point of view.

4.1. Candidate Research Problems

As stated above the area of OODBMS query optimization is currently in very active academic research, and, given the extensive prior research in the field of optimization, it is not likely that there are ground-breaking paradigm-shifting major discoveries to be made in the area. In fact it seems probable, at least to an outsider, that all the major problems areas have already been identified and are being worked on today.

Some of these problem areas likely to be worked on in the present, and in the future, include the following:

  1. A Standard Extensible Optimizer Framework: Since much of the work in OODBMS query optimization involves continuous experimentation and construction of throwaway prototypes it would be highly beneficial if the community decided on a standard framework for building optimizers.
  2. Better Cost Models: There need to be better ways of judging the expected cost of evaluating an expression, as well as better ways of finding queries of minimal cost within the set of equivalent queries, e.g. using statistical methods. These problems relate to several areas in artificial intelligence where minimum-cost solutions are required, and synergy between fields may well be realized here.
  3. Better Algebras and Rewrite Systems: The most popular optimization methods today involve algebras and rewrite rule systems. The systems available today operate using heuristics which are know to be sub-optimal, and are of provably limited power, i.e., queries can be constructed which are very hard to optimize given the algebras. Needed improvements include better placement heuristics for predicates, e.g. joins, and a solid theoretical foundation for the possible expressibility (or computability) of such rewrite systems.
  4. Eliminating Method Evaluation: As a rule of thumb we would like to do without evaluating expensive methods, e.g. through selective short-circuiting of conditionals. The identification of which methods to eliminate and the directed rewriting of the query to try and achieve this is a worthy research problem. This might be said to fall under the hat of algebras and rule systems, but is specifically directed towards method evaluation and thus presented separately here.
  5. Indexing of Path Expressions: The evaluation of path expressions can be very time consuming and is a very good candidate for optimization. Applying indexing methods to these expressions seems likely for success, at least to some degree, and work on this issue should be done.
  6. Precomputation and Caching: This is a major candidate for massive speedup in query processing time. It is highly likely that the intermediate results of queries, subqueries and path expressions will be the same for many queries. The identification, precomputation and/or caching of these results, especially when combined with better indexing methods, would speed up queries by orders of magnitude. Here again statistical methods and methods from AI may apply.
  7. Heterogeneous Collections: Working with possibly heterogeneous results in queries is mostly an open research topic. Here methods from programming languages such as type inference should be applicable and may provide massive speedup by removing the need for the abstraction layers dealing with polymorphism and allowing static method resolution and optimization.
  8. Querying Un- and Semistructured Data: This is perhaps the most major research topic in this list. Data in OODBMSs may well consist of loosely structured data collected from such places as the World Wide Web. We would like to be able to query this data in ways differing from conventional queries, e.g. by complex queries involving keywords in preferred relationships to each other (and only approximate spelling) for full text searches. The optimization of these, and other queries on wildly heterogeneous data is a very hard problem, requiring methods very different from conventional join-type queries, but also one of the most important ones we face today, due to the success of the World Wide Web.

4.2. Restrictions and Prospects

There are few restrictions on the above list from the viewpoint of academic research. There is already a committed community with highly talented researchers currently working on these problems. There is strong motivation from the user community, and great need for faster query processing. All of the above candidates for research topics are likely to lead to interesting results, and are likely to be funded. The areas above are also likely to generate many papers, since their nature is incremental and rapid dissemination of both positive and negative results is encouraged in the community. Hence none of the candidates above can be eliminated as a future research topic.

It is hard to lay down a schedule for the research above. The results in the area tend to build upon prior work, with many attempts at the same goal usually taking place in parallel, often with close working knowledge of one another, proceeding partially by trial-and-error as various different methodologies are brought to bear on the problems. It is very likely that there will still be active research in this area in 5 to 10 years, just as there still is active research in RDBMSs query optimization, although the emphasis will probably have shifted to specific subproblems at that time.

It is hard to provide a scheme for evaluating the success of research into query optimization, the only real criterion being faster average execution time of queries. We can however say that for the last proposed research topic, candidate 8, that the research has been successful if powerful and complex queries can be evaluated quickly on large pools of unstructured data in five years in OODMBSs and on the World Wide Web.


4. Annotated References

Cherniack, M., `` Form(ers) over Function(s): The KOLA Query Algebra'', Technical Report, Brown University, December, 1995.

This paper describes KOLA, (arguably) the most advanced query algebra and rewrite rule system for OODBMSs, in a reference manual form. The algebras variable-free rule-based rewrite calculus is based on the author's experience with the AQUA algebra [Leu93]. This algebra is used in the MIT Thor research OODBMS.

Fegaras, L. and Maier, D., ``Towards an Effective Calculus for Object Query Languages'', ACM SIGMOD International Conference on Management of Data, San Jose, California, May 1995.

This paper describes a query language calculus based on monoid comprehensions, which claims to be more effective than other calculi, and describes how to map ODMG-93 OQL, [Cat96], into this calculus.

Fegaras, L., ``Efficient Optimization of Iterative Queries'', Proceedings of the Fourth International Workshop on Database Programming Languages, Manhattan, NY, August 1993.

This paper describes a object-oriented query algebra based on fold iterations, and supplies a normalization algorithm for the algebra reducing queries into canonical forms.

Fegaras, L., Maier, D. and Sheard, T., ``Specifying Rule-based Query Optimizers in a Reflective Framework'', Proceedings of the 3rd International Conference on Deductive and Object-Oriented Databases, Phoenix, Arizona, December 1993.

This paper describes a rule-based query optimizer framework based on syntax-directed query annotation and rewriting. The paper's contribution is directed at the last two steps of figure 1, algebra optimization and plan generation.

Leung, T.W., Mitchell, G., Subramanian, B., Vance, B., Vandenberg S.L. and Zdonik, S.B., `` The Aqua Data Model And Algebra'', Technical Report CS-93-09, Brown University, March, 1993.

This paper describes AQUA, an object-oriented query algebra that served as the precursor to the KOLA algebra, [Che95]. This algebra has a rule-based calculus, like KOLA, but is not variable-free.

Mitchell, G., Zdonik, S.B. and Dayal, U., `` Object-Oriented Query Optimization: What's the Problem?'', Technical Report CS-91-41, Brown University, June 1991.

This paper is an excellent early summary of the issues in optimizing object-oriented query languages. The paper does not focus on any particular solution or methodology for optimization, but rather tries to give a clear statement of the problems involved.

Morrisett, G., `` Compiling with Types'', Technical Report CMU-CS-95-226, Carnegie Mellon University, December 1995.

This thesis advocates the use of a strongly typed intermediate language for the compilation of programming languages. Type-directed compilation with intermediate language dynamic-type dispatch, as discussed in the thesis, were used to develop an ML compiler which generated code twice as fast as that of the excellent SML/NJ compiler.

Özsu, M.T. and Blakeley, J.A., `` Query Processing in Object-Oriented Database Systems'', Technical Report, Computer Science Department, University of Alberta, Canada, 1994.

Özsu, M.T., Straube, D.D. and Peters, R., `` Query Processing Issues in Object-Oriented Knowledge Base Systems'', Technical Report, Computer Science Department, University of Alberta, Canada, 1994.

Straube, D.D. and Özsu, M.T., `` Execution Plan Generation for an Object-Oriented Data Model'', Proceedings of the 2nd International Conference on Deductive and Object-Oriented Databases, pp. 43, Munich, Germany, December 1991.

The three papers above together form a clear overview and summary of the issues and research in OODBMS query optimization. The last paper is the earliest one, and presents the query processing methodology of figure 1, while the first two present the state of 1994 query optimization research in great detail, focusing on cost-based methods.

Su, S.Y.W., Guo, M. and Lam, H., ``Association Algebra: A Mathematical Foundation for Object-Oriented Databases'', IEEE Transactions on Knowledge and Data Engineering, Vol. 5, No. 5, pp. 775, October 1993.

This paper gives a very complete mathematical specification of a query algebra for the OQL of [Ala89], with proofs of completeness.