Database Management Systems

Where is the Life we have lost in living?
Where is the wisdom we have lost in knowledge?
Where is the knowledge we have lost in information? -- T.S. Elliot

Logistics

Overview

Large datasets such as historical snapshots of the World Wide Web stored by the Internet Archive, or the data hosted by amazon.com ("earth's biggest selection") bring about interesting data management challenges that lie at the intersection of systems and algorithmic questions for managing and mining large datasets.

The course has two parts. The first part covers a basic background on system abstractions and algorithmic tools for managing large datasets. The second part covers current research topics in this area such as the management of structured and unstructured data, search, and data mining.

The course prerequisites include basic backgrounds in algorithms and probability (such as CS482) and in systems (such as CS414). The main work for the course consists of a paper summary for each class, the presentation of one paper in the course, and a course project.

In the project, you are expected to try to do original research. The project encompasses the following steps:

Course Outline (Draft)

Note: Write your summary about the paper marked with a (*).

Part I: The Classics

Programming abstractions and their implementations (1/31 and 2/2)

The relational model

XML

MapReduce

Auxiliary data structures (2/7)

Data reduction (2/9)

Dealing with performance mismatches and different access characteristics (2/14, 2/16, and 2/21)

2/14

Answer the following question about the Bohannon et al. paper: Given that Dali is a main-memory database system, some design decisions are strongly influenced by this constraint. Describe the one design decision that in your opinion is the most non-obvious, but still has a lot of impact on the overall system performance.

2/16

2/21

Data reduction (Continued) (2/23)

Concurrency (2/28 and 3/2)

2/28

3/2

Handling Failure (3/7 and 3/9)

3/7, 3/9

Distributed Transaction Management and Replication (3/14)

Please do not write any summaries for this class. Instead, please answer the following questions about the Mohan et al paper:

  1. The paper states that, in order to maintain atomicity, each subordinate S must make sure (by forcing log records) never to ask the coordinator about a decision that S has already ACKed. For each protocol (2P, PA, PC), give a scenario where failure to follow this principle breaks atomicity. (Assume that the coordinator and all other subordinates follow the protocol strictly.)
  2. In Presumed Commit (PC), the root’s commit record is forced even though all other nodes do not force their commit records. Why?
Benchmarks (3/16)

Please do not write any summaries for this class. Instead, please outline the design of a performance benchmark that you could use to benchmark Intranet (not Internet) search engines (for an example product in this space, look here: http://www.google.com/enterprise/).

Part II: The Nouveaux

Decision Support (3/27 and 3/29)

Data Mining

Association Rules (4/4)

Privacy-preserving association rule mining (4/6)

Frequent Structure Discovery (4/11 and 4/13)

Questions (due 4/11, 6am via email to Johannes):

  1. We read the DualMiner paper last week, where the idea was to incorporate constraints into itemset mining. Can you think of at least two interesting constraints that you would want to incorporate into either frequent tree mining or frequent graph mining? Please motivate your constraints by application examples.
  2. Given your constraints, please outline a high-level idea how you could modify either the TreeMiner Algorithm or CloseGraph (or another algorithm) to incorporate your constraints.
  3. Describe one other research direction that you can think of after having read these two papers.

Managing semi-structured and unstructured data

Native vs. Relational XML Stores (4/18)

Internet Query Engines (4/20)

Continuous Query Systems (4/25)

Questions about the paper (due 4/25, 6am via email to Johannes)

  1. The paper mentioned that XML filtering was the inverse problem of traditional query processing. Motivated by that assumption, they took the idea of indexing over the data to come up with indexing over queries. Can you think of another idea that can be borrowed from traditional query processing of databases, which can be applied to XML filtering?
  2. In Section 5 of the paper, there is a discussion of how to handle value-based predicate evaluation. Two methods are described - "Inline" and "Selection Posponed (SP)". Discuss one main problem associated with each method. Can you think of a unified way of addressing both of those issues?

Optional reading:

Text Search Over XML (4/27)

Questions about the papers (due 4/27, 6am via email to Johannes)

  1. Kaushik et al. paper: The paper shows a simple way to link two data structures to address the hybrid nature of simple XML path queries with keywords. Can you think of any other examples of a hybrid kind of query that can be answered by combining two data structures?
  2. Kaushik et al. paper: The power of a structural index is it roughly captures a sort of schema for the current instance of data even with all the irregularities that may be present. A simple version is shown in Figure 2 of the paper. How could you drive the construction of a good structural index if you had the XML schema, which explicitly states which elements/attributes are required and which ones are optional (which helps you decide what queries can be covered and therefore when inverted lists can be used)? Is there a better way to solve the problem if you have the schema?
  3. Amer-Yahia et al. paper: Section 3.4 discusses other classes of relaxations such as replacing a type with a supertype, relaxing value-based predicates like $i.price < 98 to $i.price < 100, or relaxing the contains predicate via standard IR methods. Including the structural relaxations described in this paper, which of these relaxations do you think are really the most useful to the end user? Is it desirable to incorporate several classes of relaxations at once to get a single ranking?
  4. Amer-Yahia et al. paper: In Example 1 (after Figure 6), the calculated structural score for query Q1 is shown to be 3. However, the listed predicate penalty for query Q5 consists of 4 terms, each of which is in [0,1]. Could the resulting structural score of Q5 then be negative? If so, does this have any negative implications?

Optional reading:

Potpourri

Parallel Database Systems (5/2)

Questions about the papers (due 5/2, 6am, via email to Johannes):

The Gamma paper:

The Alphasort paper:

Data Stream Systems (5/4)