Statement of Present and Future Research Interests Zhiyuan Chen Current Work My research interests lie in the area of database systems. In my dissertation research, I have explored new directions of improving performance of database systems given current trends of hardware development and the requirement of new applications. I have addressed the following three issues: database compression, result size estimation for queries on string and hierarchical data (e.g., XML and LDAP), and automatic physical database design. Database Compression Over the last decades, improvements in CPU speed have outpaced improvements in main memory and disk access rates by orders of magnitude, enabling the use of data compression techniques to trade reduced I/O and storage space against additional CPU overhead for compression and decompression of data. The goal of my research is to build compressed database systems that can store and query compressed data and achieve better query performance than conventional database systems. Considering today's storage prices, I mainly target improvement of query performance rather than saving of storage space, which is the goal of traditional data compression. Therefore, the main technical challenge here is to balance the benefits (I/O savings) and overheads of compression. I addressed the following two issues: * Selection of appropriate compression methods. Since my goal is improvement of performance rather than saving of space, there are several novel issues in choosing suitable compression methods. First, compression and decompression are CPU intensive operations and their cost is important in choosing compression methods. Second, database applications often need random access to data in small pieces (tuples, attribute, etc.). This makes some popular compression methods such as Lempel­Ziv inappropriate because they have to decompress a large chunk of data. Third, data stored in databases is well structured and values of the same attribute usually share more similarities. Therefore, I have proposed strategies that intelligently select the most appropriate combination of compression methods to balance the benefits and overheads of compression. * Query optimization in compressed database systems. Query optimization is crucial in compressed database systems because during query execution, data can be either eagerly decompressed when it is read into main memory, or lazily stay compressed in main memory and be decompressed on demand, and there are cases one strategy leads to orders of magnitude better performance than the other strategy. I have addressed the compression­aware query optimization issue and proposed two algorithms: one is provably optimal and the other is very fast and almost always generates optimal execution strategies. All the above work has been implemented in a full fledge compressed database system, which achieves up to an order performance speedup over conventional database systems. Result Size Estimation for Queries on String and Hierarchical Data String and hierarchical data such as XML and LDAP have become ubiquitous. Estimating the number of data items (tuples, documents, etc.) matching a query on string and hierarchical data is useful to optimize the execution of the query and to provide a quick feedback for query refinement. However, traditional estimation techniques such as histograms are not sufficient because of two reasons: (1) they only work for numerical data, and (2) they can only estimate result sizes of simple range queries, while we are interested in estimating result sizes of complex queries with multiple conditions (e.g., find web pages containing ''CS Department'' and ''Opening''). I proposed a solution that uses a compact summary structure to store the result sizes for pieces of queries those are frequent in the data. This approach relies on two techniques: (1) a Set Hashing technique used in the summary structure to succinctly capture correlations in the data, and (2) algorithms based on conditional probability to reconstruct the result size of the original query from the result sizes of pieces of that query stored in the summary data structure. Experimental results demonstrate that our approach can achieve accurate and robust estimates (typically the estimation is within a factor of 0.5 to 2 of the real value) using limited space (typically less than 1% of the data size). Automatic Physical Database Design The automation of physical database design such as selecting indexes and materialized views has become of increasing importance because not only physical design is crucial for database performance, but also manual physical design is very costly (e.g., the salary of a database administrator can often buy a fleet of decent computers). Physical database design problems typically involve hard optimization problems and current approaches rely on often ad hoc heuristics to find good solutions. I proposed a new approach that starts with an empirical study of the structure of the search space for a particular physical design problem to identify the generic ''shape'' of the space. Then insights can be drawn from the shape of the search space and allow us to design much improved algorithms to find good solutions. I have applied this approach to two physical design problems: (1) the problem to select indexes, and (2) the problem to decide what attributes to compress in a compressed database system given a query workload. As a result, I have proposed index selection and compressed database design algorithms that achieve several orders speedup over existing algorithms while achieving results of the same good quality. Future Work After my graduation, I would like to continue working on three topics related to my current research: (1) efficient handling of new types of data in emerging applications, (2) building the next generation of automatic physical database design tools, (3) improving the performance of main memory database systems. * Efficient Handling of New Types of Data in Emerging Applications Conventional database systems are good at handling numerical data, but are not good at handling new types of data (e.g., documents, hierarchical data such as XML, and biology data) in emerging applications. Several challenging research issues need to be addressed. First, although redundant structures such as indexes and materialized views have been widely used in conventional database systems, little has been done on the redundant structures to speed up queries in new applications. Such redundant structures are not only crucial for performance, but also involve novel research issues because of the complexity of new types of data over structured numerical data. Second, these new applications often require new types of operations and hence new processing techniques. For example, data cleaning and integration are two important operations in E­Commerce and it remains open on how to execute them efficiently and effectively. Third, given the vast volume of data available (typically generated by programs), it is often sufficient to get quick and approximate answers of a query. However, little is done on approximate query answering for new types of data. I believe progress on these three issues would have valuable contribution to research community and industry. * Building the Next Generation of Automatic Physical Database Design Tools Current automatic physical database design tools have several shortcomings. First, although query workloads typically change over time, these tools design physical databases in a static way. Hence, I would like to design tools that can monitor the performance of database systems and dynamically suggest necessary changes of design. Second, although different design options often interact, current design tools usually decided them individually. For example, the decision on how to partition the data and how to index the data have strong correlations, however, current design tools decide them separately. Hence, I plan to design algorithms that consider physical design options in a joint and efficient way. * Improving the Performance of Main Memory Database Systems Over the last decades, the price of memory has dropped quickly and today many applications can fit their databases in main memory. However, the improvements on CPU speed have outpaced the improvements on memory access speed by orders of magnitude over the last decades. Hence, it is crucial to optimize the cache performance for main memory database systems and many research problems are still open. For example, most existing research deals with numerical data, and it is unclear how to improve cache performance for operations on string data. Also, compression is potentially useful to trade improved cache performance against additional CPU overhead to compress and decompress the data. However, the gap between memory access speed and CPU speed is much smaller than the gap between disk access speed and CPU speed and hence we may need compression methods and query optimization techniques different from what I proposed for disk based systems. I am very excited about these three research directions because they both have potential impact and involve challenging technical and practical issues. For example, building new physical design tools both has immediate commercial impact and involves solving challenging optimization problems.