CS432
Project 3
Joins
Deadline : Nov 5, 1997
Introduction
In this assignment you are required to implement 4 different join algorithms
and evaluate their performance.
Background
A relation is implemented as a heap file in Minibase. A heap file is a
collection of records. Records can be inserted or deleted from a heap file.
Each record is uniquely identified by a record id. A scan is an interface
used to access the records in a heap file. By using scan, we can access
the content of the heap file, record by record.
An index provides fast access to records in the heap file. Currently
Minibase has only B+ - tree index. Each entry in the index is a (key, record
id) pair. Entries can be inserted or deleted from an index. An index search
scan provides interface for accessing the records in the index. We can
specify the range to be scanned by specifying the low key and high key
parameter.
Four classes are provided for you : (i.e. you only need to call methods
in these classses)
-
HeapFile which implements a heap file
-
Scan which is a scan interface to HeapFile
-
BTreeFile which implements a B+-tree index
-
BTreeFileScan which implements the scan
interface to a B+-tree.
See the Reference section for the details of these classes and examples
on how to use them.
You will also need the globaldef library and the buffer manager that
you used and implemented in assignment 1.
Requirement
Statistic Collection
Augment the buffer manager class you have written during the first assignment
to collect statistics. In particular, you should be able to tell how many
PinPage requests have been made and how many result in a miss. We suggest
you write two functions : one to reset the statistics and one to report
the statistics.
You can use the function clock() to find out the running time of your
join algorithm. Clock() returns the current time. Remember to include time.h
when you use this function. If you have other questions refer to the C++
help.
You should run your algorithm a few times (the more the better) and
report the average running time. Avoid running other CPU intensive stuff
(WinWord, Netscape etc.) while collecting statistics since clock() reports
the actual time rather than CPU time.
You should print out the average running time and the number of misses
for every join algorithm you implemented.
Tuple-At-A-Time Nested Loop Join
You should start by implementing this since this is the simplest algorithm.
Block Nested Loop Join
This can be found in page 294 in the textbook. As Minibase does not provide
page-by-page access into a relation stored in a heap file, we will simulate
a "block" with an array inside the memory which stores records. The join
function will take in an integer B which is the "block" size (size of the
array) of outer relation. You don't have to implement the double buffering
techniques or the hash table.
The pseudocode for this join is as follows :
For each block b in R
For each tuple s in S
For each tuple r in b
Match r with s
if Match then
Insert (r,s) into the result relation
Compare the performance of Block Nested Loop Join when you change the block
size.
Index Nested Loop Join
This can be found in page 296 of the textbook. You should first build a
unclustered B+-Tree index on the inner relation.
Find out if it is worth building the B+-Tree index for the purpose of
performing a single join operation, by comparing the cost of this join
with other joins.
Sort Merge Join
This can be found in pg 298 of the textbook. You should first sort both
relations using the SortFile function provided. You don't have
to implement the refinement discussed in page 301 and 302 of the textbook.
Find out if it is worth sorting the file just for the purpose of performing
a single join operation, by comparing the cost of this join with other
joins.
Performance Comparison
-
Compare the performance of the four algorithms (use the best Block Nested
Loop Join algorithm to compare with others). Collect the time taken for
each algorithm to run, and their respective number of buffer pool misses.
-
Study the effects of all 4 join algorithms when you change the size of
buffer pool(which is defined in main.cpp as NUM_BUF_PAGES).
-
Submit your documentation tables and graphs of these statistics, and analyze
the result.
Simplification
-
You can assume that all records are fixed length.
-
You can assume that we only join on integer fields.
Source Code Provided
The source code for the project is located in the directory \\goose\courses\cs432\proj4.
The directory contains:
-
join.h/join.cpp - utilities function that are useful for writing join algorithms.
-
relation.cpp/relation.h- functions that create the test relations.
-
(join method).cpp - each file cantains a skeleton for implementing a particular
join method.
-
main.cpp - main program.
-
spacemgr - a subdirectory containing source codes for space manager. (HeapFile
class, DB class etc.)
-
btree - a subdirectory containing source codes for B+-Tree index.
-
bufmgr - a subdirectory containing source codes for buffer manager.
-
include - a subdirectory containing all .h files needed. If you use
your own B+ tree or buffer manager code, you need overwrite the .h files
in /include directory with your own .h files.
You should write your source code in (join method).cpp and main.cpp. Look
through the code provided to understand how the test program works. The
functions in join.cpp and relation.cpp should be useful when writing your
join methods and debugging. The function SortFile is particularly
useful as an example on how to use HeapFile, Scan, BTreeFile and BTreeFileScan.
Extra Credit
-
(5%) Compare the performance of Block Nested Loop Join with an in-memory
hash tables, versus no hash tables. Build a hash table whose entry consists
of (key, array index in block) for a block of records from R. Then for
each record in S use the hash table to find the matching record.
-
(5%) Compare the performance of Index Nested Loop Join with clusted index
versus unclustered index. (To create a clustered index, sort the file using
SortFile ordered by the join attribute, then create new B+-Tree
using the join attribute as the key.) Note: don't include the time taken
to sort the file ("cluster the index") in your calculation for the total
time taken to join the relations.
-
(5%) Use your own B+ Tree (You don't have to use your own buffmgr).
Hints
-
Study Scan class first. Learn how to access records in heapfile. Then study
SortFile function in join.cpp.
-
Use smaller number of records (by changing the constants in join.h) for
testing.
-
Each join algorithm should be around 100 lines of codes.
-
Since this time we are not building a library for Minibase, you are free
to change interface of the join functions given.
-
You are also free to modify anything in the main function, but please do
not change the functions that create relations so that we can compare your
results with our own.
-
For a user interface you can either :
-
produce a few executables, one for each join
-
provide a menu interface to select the join algorithm to run
-
run all of the join algorithm sequentially, or
-
if you are really "on the ball", produce a window gui for your program.
User interface will not affect the grade.
Reference
Those in bold are classes and types that are particularly useful in this
assignment. You don't need to know the rest to complete the assignment,
but if you like to study the code given, they might be helpful.
Coding Convention
You need to follow certain coding conventions for all your assignments.
-
Name for all classes/methods/types should start with a caps. Break multiple
words with caps. For example AddFileEntry.
-
Name for all members/variables should start with small letters. Break multiple
words with caps. For example numOfPages;
-
Name for all enum/macro should be all caps. Break multiple words with underscore.
For example MINIBASE_DB.
Submission Procedure
Copy your files and documents into submission folder.
This assignment is due on Nov. 5, 1997, 11:00 pm. Late submission
will lose 5 to 10 percent of your grade..
Marking Criteria
We will mark your programs based on the following criteria :
-
Correctness (50%)
-
You will get full marks if your implementation is correct. Partial credit
will be given to a partially correct submission.
-
Coding Style (10%)
-
We expect you to write neat code. Code should be nicely indented and commented.
We also expect you to follow the coding conventions.
-
Documentation (40%)
-
Your documentation should contains results of your experiments in table
and graph format. You should include brief analysis, explaining the results
that you have submitted. (You should not just reproduce the formulas in
the textbook.)
Miscellaneous
Please let us know if there is any ambiguity in the assignment and please
report any possible bugs as soon as possible. You can mail to cs432ta@cs.cornell.edu.
There are two newsgroups for the class : cornell.class.cs432
and cornell.class.cs433 for general discussions.
For the discussion of this poject please use cornell.class.cs432
Please check the newsgroup frequently for any
updates or clarification.