CS 432 - Sample Questions
These are some sample questions. The exam may also include questions
where you will have to actually draw the changes in
data structures (like indexes, buffer pool, etc). There will (of course)
be fewer questions than this in the exam.
-
Storage Management
You are hired by the Mega Software Of French Tahiti corporation
as a storage management specialist. Since their
database software is already delayed by many months,
they badly need your expert advice on the following problems:
- Their algorithms involve many random I/Os for
individual tuples, and they
want to buy appropriate disks for better performance.
(a) Are disks that spin faster going to help them?
(b) Should they buy disks
with several small platters, or with a few large platters?
(c) Should they invest in an ultra-fast I/O bus?
Explain each answer with 1 sentence.
- They expect data tuples of
schema . What record structure
should they use? For the record <'a', ``catch'', 22>, how
many bytes will the record require with your design?
Write pseudo-code to extract the third attribute from
such a record.
- To improve the performance of their random I/Os
for individual tuples, the chief technical lead
has suggested the standard database technique
of doubling the page size. How do you think this
will affect performance? Why?
- They run queries which look for matches from
a relation based on equality of the variable_string field.
While they would like to use a B+-tree index, they have
been told that B+-trees can only be used on integer
attributes. What do you tell them?
-
Files
Consider a relation R with 10 attributes, some of variable size.
- Suggest a record structure for tuples of R.
- Suggest a page structure for pages of R.
- Instead of storing an entire tuple as a record ("horizontal partitioning"
of a relation), another alternative is store each column value for
all tuples contiguously ("vertical partitioning"). Why is vertical
partitioning usually a bad idea?
- The records on a page usually belong to the same relation. Why is it
a bad idea to store records of multiple relations on the same page?
- Find one counter example to each of the last two questions (i.e.,
one query for which vertical partitioning is good, and one for which
it is good to store multiple relations' records on the same page).
-
Tree Indexes
Consider a B+-tree index.
- Write pseudo code for the insertion algorithm for a leaf page,
involving splitting in case of overflow.
- Change the pseudo-code to consider redistribution with a neighbor.
- In what cases does each approach make sense: consider many vs few
concurrent users and the desire to minimize disk usage, etc.
- If the DBMS is planning to sort a relation, why might it want to
also build a B+-tree later on the same search/sort key? (i.e. do sorted
files give good search performance?). If the DBMS is planning to build
a B+-tree, why might it want to first sort the relation on the search key?
- How many I/Os are used in sorting a relation of 50 pages using 5 buffers?
Distinguish these are reads and writes, and identify the phases in which
they occur (make any reasnable assumption about ``chunking'' and state it).
-
Relational Query Languages
Consider the following schema used by the Campus Book Store:
- TOY(toyno, toyname, toytype, color, price)
- BOOK(bookno, bookname, booktype, price)
- MANUF(mname, address, phone)
- STOCK(itemno, mname,, quantity)
In the above, primary keys are in boldface. You can assume that each
STOCK.itemno value refers either to a toy or a book (but not to both).
Each manufacturer can make toys or books or both. Express the following
queries in the relational algebra, the relational calculus and in SQL:
- Print the phone numbers of manufacturers who make a toy of type
"bicycle" and a book of type "animal story"
- Print the toy types where every known manufacturer makes at least one toy of that type.
- Print the names of manufacturers who make both a toy of type "truck" and a book of type "bedtime story"
- Print the names of purple toys made by "Acme Toy Company" where the store has at least 100 of them in stock
- Print the phone numbers of the manufacturers who make the most expensive book known to the store.
-
Join Algorithms
Your manager being an old-timer likes
the sort-merge join and it is the standard join algorithm
being used, while you are convinced that
the hash join is cool stuff. Your company's million
dollar query involves the join of relations X and Y.
X is a static relation with 500,000 tuples
each of size 40 bytes. Y is a
relation that is slowly growing -- currently it is
at 1,000,000 tuples of 40 bytes each.
The page size in your system is 2K, and you have
enough main-memory for a buffer of size 400K (i.e. 200 pages).
Assume that pages are full of tuples -- i.e., they have
no free space.
- What category of join algorithms do both hash and
sort-merge fall into? What is the other broad category
of joins?
- Can you justify to your manager that you should be
adding hash join to the system based on the current
requirements? Why, or why not?
- If you could justify it, at what point, if ever, would
you change your mind as Y grows?
If you could not justify it, how much larger must Y grow
to until you can justify the use of hash join?
- The disk drive people convince the company to switch
to an 8K page size instead of 2K. Does this strengthen or
weaken your argument in favor of switching rightaway to
hash join?
- Your manager raises the option of firing you and using
your salary to instead buy more memory to stay with
sort-merge for the next year. How much more memory would she
need to buy to make this a viable option as long as Y has fewer than
2,000,000 tuples?
-
Query Optimization
Consider the System-R optimizer commonly implemented in commercial
database systems.
- What is the complexity of the optimization of a query involving
the join of N relations?
- A common heuristic is to avoid the enumeration of
cross-products. How is the effected? Show a query on which
this heuristic significantly reduces the optimization complexity.
- What statistics are needed to evaluate the cost of a
page-at-a-time NL join between Rel1 and Rel2 with condition
Rel1.X1 = Rel2.X2 ?
- What assumptions are used in selectivity (reduction factor)
estimation? Give examples of queries where the assumptions
break down.
- How does the available buffer size affect the result of
query optimization?
-
Transaction Processing
The airlines have been
plagued with overbookings, and so want a foolproof
system that keeps their bookings totally consistent.
You are hired as the system architect. You are told that
the only operations on the system are individual ticket
reservations (small and short transactions).
- What degree of consistency will you recommend?
What does it mean logically? If locking were used, what does this imply
with respect to when the locks are obtained/released?
- The previous architect (now fired) claims that the
old system did indeed have these properties -- yet the flights
still got mysteriously overbooked. Luckily, the airlines keep
diagnostic traces of all database actions in the actual time order
that they happened. How will you use these to make your case?
- Much impressed by your skills, the airlines now want
to run data mining queries over the same data concurrently with the ticket
reservations. What degree of concurrency will you recommend for these
queries? Why?
-
Crash Recovery
You are building an ACID database system, starting from
a simple design and gradually making it more complex.
Being well aware of current technology,
your final goal is to provide the full ARIES recovery
support.
-
You start with no log and no recovery mechanism. What two
policies will you use in your buffer manager to enforce
atomicity and durability? (hint : do not think of
replacement policies like LRU!)
- Why might this be inefficient?
(hint: there is one problem with throughput and one with
response time)
- To improve throughput, what change do
you make? What is the recovery strategy that must now
be implemented ?
- To improve response time, what change
do you make? What is the recovery strategy you must implement
now?
- You enhance the system with crash recovery
code. A system crash occurs and your code
has to be invoked upon restart.
What are its three stages, (in the correct order) and in
what direction does each read the log?
- How much UNDO activity must Aries do? What
should one do during normal execution to limit the amount of
UNDO work during restart?
- How much REDO activity must Aries do? What
should one do during normal execution to limit the amount of
REDO work during restart?
-
Short Takes
- How many primary indexes can I build on a relation with
three attributes? How many secondary indexes?
- There is a B+-tree unique index with a node fanout of
1000 and about 1000,000,000 entries. How many nodes need to
be read for a unique equality search?
- In buffer management, what is ``pinning'' and when is it needed?
- Using the familiar Sailors-Reserves-Boats example,
write a query either in SQL or in the relational algebra
to find the unique names of those sailors who
have reserved a green boat sometime and who have never
reserved a red boat.
- What are the relational algebra equivalences used in
pushing selections ahead of joins and projections? What
are used in join reordering?
- Why does every DBMS implement the nested-loops join
algorithm?
- How is the evaluation of correlated sub-queries
similar to a tuple-at-a-time NL join?
- How would you apply the ideas of page-at-a-time
NL join to the execution of correlated sub-queries?
- In query optimization, what are "access paths"? What
are "interesting orders"?
- What is a "left-deep" tree? What is the significance
of such a tree with respect to join pipelining?
- Can you suggest a query for which the best execution
plan is probably a bushy tree, not a left-deep tree?
- Why do query optimizers try to apply selections
early? Can projections be applied early too?
- Explain Wound-Wait and Wait-Die.
- Show an example each of unrepeatable read and
cascading abort.