CS432

Assignment 3

B+ Tree

Deadline: Oct.15 1997

You can find quick help of how to start this project now, click here

Introduction

In this assignment you will need to implement a B+-tree indexing scheme.  An indexing scheme facilitates the retrieval of a desired record from a relation. (In MINIBASE, all records in a relation are stored in a heap file.) The B+-tree index is one very useful indexing scheme.

A B+-tree index consists of a collection of records of the form (key, rid), where key is a value for the search key of the index, and rid is the record id of the record being indexed. (It is an index based on what is described as Alternative 2 in the textbook.)

The nodes in a B+-tree can be divided into two categories: internal index node, and leaf node. The leaf nodes store the actual (key, rid) pair, while the internal nodes store key values and the page ids of its children.

Description

The class IndexFile is an abstract class for indices. BTreeFile is a subclass of IndexFile, and implements the B+-tree in MINIBASE. The class BTreeFile maintains two types of nodes: BTLeafPage and BTIndexPage, which correspond to the leaf nodes and internal nodes of a B+-tree. Only one type of key will be supported: integer. Key values need not be unique. You can assume that all of the duplicates fit into one page.

The BTreeFile contains in it the functions for inserting, deleting, and searching the tree. It does not, however, contain the tree itself. It will however have a pointer to the root of the tree. Each node in the tree will be an instance of the class BTIndexPage, or if it is a leaf node than the BTLeafPage. These two classes are described below. The BTreeFile class is responsible for maintaining the integrity of the tree, i.e. when an insertion is made it has to ensure that the new record is inserted in the correct place or if a new entry is added to an index node that there is sufficient room in the index node to hold the new entry. The BTLeafPage and the BTIndexPage are both responsible for inserting into and deleting from a leaf node and index node, respectively. These classes also contain the data within the individual nodes. They do not however check to ensure the integrity of the tree. For example if an index node is exactly half full, and BTIndexPage::DeleteRecord is called, then a record will be deleted even though this will mean that the index node is now less than half full (which is an invalid state). Basically, these two classes provide the interface for the BTreeFile class to the individual nodes.

Both BTLeafPage and BTIndexPage are subclasses of SortedPage, which in turn is derived from HeapPage. The class SortedPage maintains the records in HeapPage in sorted order, and is implemented for you. Understanding how BTLeafPage uses the methods it inherited from SortedPage will be very useful for you in doing this assignment, and we  recommend reading both btleaf.cpp and sortedpage.cpp.

The BTLeafPage maintains records of type (key, rid) pair as describe above. The BTIndexPage contains <pid1, key1, pid2, key2, ..., keyk, pidk+1> (the semantics of this sequence are those that are describe in the textbook). The prevPage member (also referred to as the leftlink or the leftmost child page) is used to store pid1 while all (keyi, pidi+1) are stored as records in the BTIndexPage.

A set of functions for manipulating the (key, data) records (where data can be rid or pid) have been written for you . You can take a look at the functions BTIndexPage::InsertKey and BTIndexPage::GetPageID for examples on how to use these functions. A specification of the functions can be found in key.cpp.

Finally, the BTreeFileScan class implements a scan interface on a BTreeFile.

Also included in the source code is a buggy version of PrintTree, which is a member of the BTreeFile class. This function prints out the whole B+Tree to a file using a Depth First Search. Also included _PrintNode which prints out a single node in the B+Tree; this function is also buggy. You should debug these functions and then use them to debug your other code. There are two bugs in the version being given to you. The first bug is a logical bug, and the second involves a pointer. The logical bug is only in PrintTree, but the pointer bug is repeated in both functions. You should actually try and debug this using the debugger in VC 5.0. Groups are not allowed to discuss how to debug this code with any other group.

The definition of public interface for the BTreeFile,  BTIndexPage, BTreeFileScan classes are available, but not fully implemented. BTLeafPage and SortedPage are implemented  while IndexFile and IndexFileScan are defined but no implementation is needed.

You can find out more about the classes and types here (those in bold are newly added) :

The source code for the project is located in the directory J:\cs433\proj2. The directory contains:

User Interface:

If you run your project from the DOS prompt you will find that the first thing that appears on your screen is a menu of options. This is the user interface we have designed for your project. The user interface for this project is relatively simple, but there are a few things that should be pointed out: There are ten options in the menu. In order to run the tests and such the following things need to be kept in mind:
          1)For all the options (except 7- 9) you must first, create an instance of the tree using option1. Otherwise you will not be allowed to proceed. Moreover, once a tree is created, unless a tree is explicitly destroyed using option6 or at the end of options 8 and 9, you will run the options on the same tree that you have already created. Note: once you create a tree, you cannot create a new tree until the old tree is destroyed.
          2)Sometimes you are asked for a range (hikey-lokey) when inserting.
                   That  range specifies the range of data entries, but it also specifies how many insertions are carried out. For example, if you input 100 as the hikey, and 20 as the lokey, then there will be (100-20+1)=81 data entries inserted into the tree, each with a key ranging between 20 and 100. Note: that there can still be duplicates amongst these keys.
          3)When you are asked for a range while deleting, the tree deletes all the data entries, whose keys fall within that range.
          4)Whenever an insertion takes place, every new key value is assigned a unique RecordID. These RecordID's are junk (They are meant to represent the RecordID (pageID and slot no.) of the data on disk that the key is supposed to represent. However, for this project, the RecordID's in the leaf nodes do not refer to any actual place on any disk.)
          5) All the options except option 1, 3 and 6. print out the tree, and dump statistics on an output file. The statistics that are dumped out are explained below in the Requirements section.
          6)Options7-9 should be run sequentially i.e. one after the other.

Now here is a synopsis of what each option actually does:
        Option1)     Creates an instance of a B+Tree. (Basically creates an instance of BTFile class. There are no index pages associated with this BTFile)
        Option2)     Asks the user to input a range of keys to be inserted. Inserts the keys in the range specified.
        Option3)     Asks the user to input a range of keys. Searches for all the keys in this range (using a index-search for the lowest key within this range, and then a sequential search). Prints out to the screen the entries are within this range.
        Option4)    Asks the user to input a range of keys to be inserted. Inserts the keys specified. Prints out statistics and the tree at this time. Then it deletes  half of the entries inserted, and prints out the statistics and the tree after the deletion.
        Option5)    Asks the user for a range. Then prints out on to the screen, first a full scan of the tree, second a scan from the smallest entry to the input hikey, third
a scan from hikey to hikey, and finally a scan from lokey to hikey. In the last scan it deletes all the records between the lokey and hikey. After this deletion, it prints out the tree and the statistics to a file.
        Option6)    Destroys the Tree.
        Option7)    Creates a tree. Inserts 4000 entries (1-4000) into the tree. Deletes 10% of the tree. Repeats Option5 using the range 2000-3000.
        Option8)    Runs some abnormal scans. Using a lokey>hikey. lokey>the rightmost key in the tree. hikey<leftmost key. Finally, destroys the tree.
        Option9)    Creates a B+Tree. Insert 2000 entries(1-2000). Deletes 1950 entries. Repeats Option5 asking for a user input for the ranges. Destroys the tree.
 
 
 

Requirements:

Dumping statistics

 Once again you will be required to collect statistics which reflect the performance of your B+ Tree implementation. DumpStatistics will have collect the following statistics:
                1) Total number of nodes
                2) Total number of data entries
                3) Total number of index entries
                4) Fill factor of leaf nodes avg., min, max.
                5) Fill factor of index nodes
                6) Height of tree
 Note: by total number we mean the total number in the tree. By fill factor we mean (used space)/(total space) in a node. (therefore all fill factors will range between 0 and 1).
These statistics should serve you in making sure that your code executes correctly. For example, the fill factor of each node (except the root) should always be greater than 0.5, and the height of the tree should be less than 5 or so. (this will vary, of course). You should make sure that DumpStatistics dumps out these statistics on to a file.
 
 

 Coding Convention

You need to follow certain coding convention for all your assignments.

Submission Procedure

     Create a  folder with your NetID (or your partner) as its name in your local directory.
     Into this folder, drop your project file, source code, executable, and documentation. Make
     sure that  while the executable is  ready to run, it can be recompiled as needed.
     Insert this folder into the Project 1 Submissions network directory in
     \\Goose\courses\Cs432\Proj2\Submissions.
     Set the permissions on your folder (on the network drive) so that only the TA's (cs432stf)
     and yourself  have 'Full Control' access permission. Nobody else should have any
     permission to access this directory.

You should remember to keep a copy of the project in your own account.

This assignment is due on Oct. 8 1997, 5:00 PM. No late submission will be accepted.

Marking Criteria

We will mark your programs based on the following criteria :
Correctness (70%)
You will get full marks if your implementation is correct. Partial credit will be given to a partially correct submission.
Coding Style (15%)
We expect you to write neat code. Code should be nicely indented and commented. We also expect you to follow the coding conventions.
Documentation (15%)
You should also submit on-line documentation using any format you like, in it you should explain the code that you have written. This should include assumptions that you made, description of any new class that you have added, and any other special feature we should take note of. As a guideline, the document should be 2-3 pages long (with normal fonts and spacing), or more, if you feel necessary.
You can choose to use your implementation of the buffer manager, or take our implementation. For not using your own implementation of the library you will lose 5 out of 100 points.

Miscellaneous

Please let us know of any ambiguity in the assignment and please report any bugs as soon as possible. You can e-mail us at cs432ta@cs.cornell.edu.

There are now two newsgroups for the class : cornell.class.cs432 and cornell.class.cs433, for general discussions.

Please check the newsgroup frequently for any updates or clarification.