CS432

Project 3

Joins

Deadline : Nov 5, 1997

Introduction

In this assignment you are required to implement 4 different join algorithms and evaluate their performance. 

Background

A relation is implemented as a heap file in Minibase. A heap file is a collection of records. Records can be inserted or deleted from a heap file. Each record is uniquely identified by a record id. A scan is an interface used to access the records in a heap file. By using scan, we can access the content of the heap file, record by record.

An index provides fast access to records in the heap file. Currently Minibase has only B+ - tree index. Each entry in the index is a (key, record id) pair. Entries can be inserted or deleted from an index. An index search scan provides interface for accessing the records in the index. We can specify the range to be scanned by specifying the low key and high key parameter.

Four classes are provided for you : (i.e. you only need to call methods in these classses)

See the Reference section for the details of these classes and examples on how to use them.

You will also need the globaldef library and the buffer manager that you used and implemented in assignment 1. 


Requirement

Statistic Collection

Augment the buffer manager class you have written during the first assignment to collect statistics. In particular, you should be able to tell how many PinPage requests have been made and how many result in a miss. We suggest you write two functions : one to reset the statistics and one to report the statistics.

You can use the function clock() to find out the running time of your join algorithm. Clock() returns the current time. Remember to include time.h when you use this function. If you have other questions refer to the C++ help.

You should run your algorithm a few times (the more the better) and report the average running time. Avoid running other CPU intensive stuff (WinWord, Netscape etc.) while collecting statistics since clock() reports the actual time rather than CPU time.

You should print out the average running time and the number of misses for every join algorithm you implemented.

Tuple-At-A-Time Nested Loop Join

You should start by implementing this since this is the simplest algorithm.

Block Nested Loop Join

This can be found in page 294 in the textbook. As Minibase does not provide page-by-page access into a relation stored in a heap file, we will simulate a "block" with an array inside the memory which stores records. The join function will take in an integer B which is the "block" size (size of the array) of outer relation. You don't have to implement the double buffering techniques or the hash table.

The pseudocode for this join is as follows :

For each block b in R
  For each tuple s in S
    For each tuple r in b
      Match r with s
      if Match then
        Insert (r,s) into the result relation
Compare the performance of Block Nested Loop Join when you change the block size.

Index Nested Loop Join

This can be found in page 296 of the textbook. You should first build a unclustered B+-Tree index on the inner relation.

Find out if it is worth building the B+-Tree index for the purpose of performing a single join operation, by comparing the cost of this join with other joins.

Sort Merge Join

This can be found in pg 298 of the textbook. You should first sort both relations using the SortFile function provided. You don't have to implement the refinement discussed in page 301 and 302 of the textbook.

Find out if it is worth sorting the file just for the purpose of performing a single join operation, by comparing the cost of this join with other joins.

Performance Comparison

Simplification


Source Code Provided

The source code for the project is located in the directory \\goose\courses\cs432\proj4. The directory contains: You should write your source code in (join method).cpp and main.cpp. Look through the code provided to understand how the test program works. The functions in join.cpp and relation.cpp should be useful when writing your join methods and debugging. The function SortFile is particularly useful as an example on how to use HeapFile, Scan, BTreeFile and BTreeFileScan. 

Extra Credit


Hints

Reference

Those in bold are classes and types that are particularly useful in this assignment. You don't need to know the rest to complete the assignment, but if you like to study the code given, they might be helpful.

Coding Convention

You need to follow certain coding conventions for all your assignments.

Submission Procedure

Copy your files and documents into submission folder.

This assignment is due on  Nov. 5, 1997, 11:00 pm. Late submission will lose 5 to 10 percent of your grade.. 


Marking Criteria

We will mark your programs based on the following criteria :
Correctness (50%)
You will get full marks if your implementation is correct. Partial credit will be given to a partially correct submission.
Coding Style (10%)
We expect you to write neat code. Code should be nicely indented and commented. We also expect you to follow the coding conventions.
Documentation (40%)
Your documentation should contains results of your experiments in table and graph format. You should include brief analysis, explaining the results that you have submitted. (You should not just reproduce the formulas in the textbook.)

Miscellaneous

Please let us know if there is any ambiguity in the assignment and please report any possible bugs as soon as possible. You can mail to cs432ta@cs.cornell.edu.

There are two newsgroups for the class : cornell.class.cs432 and cornell.class.cs433 for general discussions. For the discussion of this poject please use cornell.class.cs432

Please check the newsgroup frequently for any updates or clarification.