CS 432 - Sample Questions

These are some sample questions. The exam may also include questions where you will have to actually draw the changes in data structures (like indexes, buffer pool, etc). There will (of course) be fewer questions than this in the exam.

Storage Management
You are hired by the Mega Software Of French Tahiti corporation as a storage management specialist. Since their database software is already delayed by many months, they badly need your expert advice on the following problems:
- Their algorithms involve many random I/Os for individual tuples, and they want to buy appropriate disks for better performance. (a) Are disks that spin faster going to help them? (b) Should they buy disks with several small platters, or with a few large platters? (c) Should they invest in an ultra-fast I/O bus? Explain each answer with 1 sentence.
- They expect data tuples of schema . What record structure should they use? For the record <'a', ``catch'', 22>, how many bytes will the record require with your design? Write pseudo-code to extract the third attribute from such a record.
- To improve the performance of their random I/Os for individual tuples, the chief technical lead has suggested the standard database technique of doubling the page size. How do you think this will affect performance? Why?
- They run queries which look for matches from a relation based on equality of the variable_string field. While they would like to use a B+-tree index, they have been told that B+-trees can only be used on integer attributes. What do you tell them?
Files
Consider a relation R with 10 attributes, some of variable size.
- Suggest a record structure for tuples of R.
- Suggest a page structure for pages of R.
- Instead of storing an entire tuple as a record ("horizontal partitioning" of a relation), another alternative is store each column value for all tuples contiguously ("vertical partitioning"). Why is vertical partitioning usually a bad idea?
- The records on a page usually belong to the same relation. Why is it a bad idea to store records of multiple relations on the same page?
- Find one counter example to each of the last two questions (i.e., one query for which vertical partitioning is good, and one for which it is good to store multiple relations' records on the same page).
Tree Indexes
Consider a B+-tree index.
- Write pseudo code for the insertion algorithm for a leaf page, involving splitting in case of overflow.
- Change the pseudo-code to consider redistribution with a neighbor.
- In what cases does each approach make sense: consider many vs few concurrent users and the desire to minimize disk usage, etc.
- If the DBMS is planning to sort a relation, why might it want to also build a B+-tree later on the same search/sort key? (i.e. do sorted files give good search performance?). If the DBMS is planning to build a B+-tree, why might it want to first sort the relation on the search key?
- How many I/Os are used in sorting a relation of 50 pages using 5 buffers? Distinguish these are reads and writes, and identify the phases in which they occur (make any reasnable assumption about ``chunking'' and state it).
Relational Query Languages
Consider the following schema used by the Campus Book Store:
- TOY(toyno, toyname, toytype, color, price)
- BOOK(bookno, bookname, booktype, price)
- MANUF(mname, address, phone)
- STOCK(itemno, mname,, quantity)
In the above, primary keys are in boldface. You can assume that each STOCK.itemno value refers either to a toy or a book (but not to both). Each manufacturer can make toys or books or both. Express the following queries in the relational algebra, the relational calculus and in SQL:
- Print the phone numbers of manufacturers who make a toy of type "bicycle" and a book of type "animal story"
- Print the toy types where every known manufacturer makes at least one toy of that type.
- Print the names of manufacturers who make both a toy of type "truck" and a book of type "bedtime story"
- Print the names of purple toys made by "Acme Toy Company" where the store has at least 100 of them in stock
- Print the phone numbers of the manufacturers who make the most expensive book known to the store.
Join Algorithms
Your manager being an old-timer likes the sort-merge join and it is the standard join algorithm being used, while you are convinced that the hash join is cool stuff. Your company's million dollar query involves the join of relations X and Y. X is a static relation with 500,000 tuples each of size 40 bytes. Y is a relation that is slowly growing -- currently it is at 1,000,000 tuples of 40 bytes each. The page size in your system is 2K, and you have enough main-memory for a buffer of size 400K (i.e. 200 pages). Assume that pages are full of tuples -- i.e., they have no free space.
- What category of join algorithms do both hash and sort-merge fall into? What is the other broad category of joins?
- Can you justify to your manager that you should be adding hash join to the system based on the current requirements? Why, or why not?
- If you could justify it, at what point, if ever, would you change your mind as Y grows? If you could not justify it, how much larger must Y grow to until you can justify the use of hash join?
- The disk drive people convince the company to switch to an 8K page size instead of 2K. Does this strengthen or weaken your argument in favor of switching rightaway to hash join?
- Your manager raises the option of firing you and using your salary to instead buy more memory to stay with sort-merge for the next year. How much more memory would she need to buy to make this a viable option as long as Y has fewer than 2,000,000 tuples?
Query Optimization
Consider the System-R optimizer commonly implemented in commercial database systems.
- What is the complexity of the optimization of a query involving the join of N relations?
- A common heuristic is to avoid the enumeration of cross-products. How is the effected? Show a query on which this heuristic significantly reduces the optimization complexity.
- What statistics are needed to evaluate the cost of a page-at-a-time NL join between Rel1 and Rel2 with condition Rel1.X1 = Rel2.X2 ?
- What assumptions are used in selectivity (reduction factor) estimation? Give examples of queries where the assumptions break down.
- How does the available buffer size affect the result of query optimization?
Transaction Processing
The airlines have been plagued with overbookings, and so want a foolproof system that keeps their bookings totally consistent. You are hired as the system architect. You are told that the only operations on the system are individual ticket reservations (small and short transactions).
- What degree of consistency will you recommend? What does it mean logically? If locking were used, what does this imply with respect to when the locks are obtained/released?
- The previous architect (now fired) claims that the old system did indeed have these properties -- yet the flights still got mysteriously overbooked. Luckily, the airlines keep diagnostic traces of all database actions in the actual time order that they happened. How will you use these to make your case?
- Much impressed by your skills, the airlines now want to run data mining queries over the same data concurrently with the ticket reservations. What degree of concurrency will you recommend for these queries? Why?
Crash Recovery
You are building an ACID database system, starting from a simple design and gradually making it more complex. Being well aware of current technology, your final goal is to provide the full ARIES recovery support.
- You start with no log and no recovery mechanism. What two policies will you use in your buffer manager to enforce atomicity and durability? (hint : do not think of replacement policies like LRU!)
- Why might this be inefficient? (hint: there is one problem with throughput and one with response time)
- To improve throughput, what change do you make? What is the recovery strategy that must now be implemented ?
- To improve response time, what change do you make? What is the recovery strategy you must implement now?
- You enhance the system with crash recovery code. A system crash occurs and your code has to be invoked upon restart. What are its three stages, (in the correct order) and in what direction does each read the log?
- How much UNDO activity must Aries do? What should one do during normal execution to limit the amount of UNDO work during restart?
- How much REDO activity must Aries do? What should one do during normal execution to limit the amount of REDO work during restart?
Short Takes
- How many primary indexes can I build on a relation with three attributes? How many secondary indexes?
- There is a B+-tree unique index with a node fanout of 1000 and about 1000,000,000 entries. How many nodes need to be read for a unique equality search?
- In buffer management, what is ``pinning'' and when is it needed?
- Using the familiar Sailors-Reserves-Boats example, write a query either in SQL or in the relational algebra to find the unique names of those sailors who have reserved a green boat sometime and who have never reserved a red boat.
- What are the relational algebra equivalences used in pushing selections ahead of joins and projections? What are used in join reordering?
- Why does every DBMS implement the nested-loops join algorithm?
- How is the evaluation of correlated sub-queries similar to a tuple-at-a-time NL join?
- How would you apply the ideas of page-at-a-time NL join to the execution of correlated sub-queries?
- In query optimization, what are "access paths"? What are "interesting orders"?
- What is a "left-deep" tree? What is the significance of such a tree with respect to join pipelining?
- Can you suggest a query for which the best execution plan is probably a bushy tree, not a left-deep tree?
- Why do query optimizers try to apply selections early? Can projections be applied early too?
- Explain Wound-Wait and Wait-Die.
- Show an example each of unrepeatable read and cascading abort.

CS 432 - Sample Questions

Storage Management

Files

Tree Indexes

Relational Query Languages

Join Algorithms

Query Optimization

Transaction Processing

Crash Recovery

Short Takes