CS433 Assignment 2: B+-Trees

Goal

In this assignment, you will implement a B+-tree index file. Your implementation does not have to deal with duplicate key values.

Extra credit: You can get an extra 10%  (4% of your overall grade since the assignment has a weight of 40%) if you implement a B+-Tree that can deal with duplicate key values. You should handle the case when there are more duplicates than can fit on one page.

Introduction

A B+-tree index consists of a collection of records of the form (key, rid), where key is a value for the search key of the index, and rid is the record id of the record being indexed. (It is an index based on what is described as Alternative 2 in the textbook.)

The nodes in a B+-tree can be divided into two categories: internal index nodes and leaf nodes. The leaf nodes store the actual (key, rid) pairs, while the internal nodes store key values and the page identifiers (pageids) of their children.

Read the textbook carefully and familiarize yourself with the B+ Tree indexing structure and algorithm.

The space manager

In the last assignment, you were introduced to the Database (DB), Page and Buffer Manager (BufMgr) abstractions. As you may recall, the DB abstraction is responsible for maintaining files, allocation/deallocation of pages and performing I/O. The Page abstraction is just a dummy class of size MAX_SIZE, and the BufMgr is responsible for maintaining "frequently" used pages in memory. In this assignment you will have to build a B+ Tree on top of these abstractions.

The rest of this section explains the space manager: an abstraction that is responsible for keeping records in a page. Since you are going to use this abstraction in this assignment and the next assignment, we recommend that you study it carefully.

HeapPage, Page ID, and Record ID

A heap page maintains records using a slot directory located at the beginning of the page. Each slot contains two attributes: offset and length. Records starts from the end of the page. Offset refers to the offset of the record within the page, and length refers to the length of the record.

Each page is uniquely identified by a page id. A record id is a pair (page id, slot number) that uniquely identifies a record.

In MINIBASE, a heap page is implemented by the class HeapPage, which is a subclass of Page. The identifier of a page, the page id, is implemented as type PageID, which is just typedef as int. The identifier of a record, the record id, is implemented as a structure RecordID, containing two fields : a PageID pageNo, and an integer slotNo. Comparison operators == and != are also implemented on RecordID.

The class HeapPage also has a member called type. The type of a page is used in this assignment to indicate the category of the current HeapPage object (whether it is a leaf node or an index node of the B+ Tree).

HeapPage class provides interface to insert and delete records from the page, via InsertRecord and DeleteRecord(). It also provides a function to scan the page. This is done using FirstRecord() and NextRecord().

SortedPage

A SortedPage is a special type of HeapPage that maintains records in a page in increasing sorted order based on a key. It assumes that the first 4 bytes in a record are a key of integer type. Insertion sort is used to insert the records, so the records in a HeapPage may be moved whenever insertion occurs. This may change the RecordID of the records. For this reason, the RecordID of records in a SortedPage should never be exposed to users.

Buffer Manager

You should be very familiar with the Buffer Manager right now. In this assignment you will have a chance to use what you have implemented in the previous assignment. You can also use the buffer manager code supplied to you with the B+-tree assignment.

You will use many Pin and Unpin operations in this assignment. Use Pin() when you have a PageID, and you want to read in the page from disk into memory. After you use the page, you must remember to Unpin() it with the correct dirty flag.

The buffer manager will also give you a new page (using NewPage()) when ever you need one (e.g. when you create a B+ Tree). Remember to free the page when you do not need it anymore (e.g. where you want to delete the B+ Tree, or if the page is deleted from the B+ Tree), by calling FreePage().

Implementation of the B+ tree

A B+ Tree index is stored in MINIBASE as a database file.

The class IndexFile is an abstract class for indices. BTreeFile is a subclass of IndexFile, and implements the B+-tree in MINIBASE. The class BTreeFile maintains two types of nodes: BTLeafPage and BTIndexPage, which correspond to the leaf nodes and index (internal) nodes in a B+-tree.

The BTreeFile contains provides methods for inserting, deleting, and searching the tree. It does not contain the tree itself. It maintains the tree by keeping a pointer to the root of the tree. The BTreeFile class is responsible for maintaining the integrity of the tree. (i.e. when an insertion is made, it has to ensure that the new record is inserted in the correct place, or if a new entry is added to an index node, it has to ensure that there is sufficient room in the index node to hold the new entry)

The leaf and index nodes in a B+ Tree are implemented by the classes BTLeafPage and  BTIndexPage, respectively. They are both subclasses of SortedPage and are responsible for inserting into and deleting from leaf node and index node, respectively. These two classes are not responsible for the integrity of the tree.

For this assignment, the BTLeafPage and BTIndexPage are provided for you. We recommend that you study the source code for these two classes carefully. Their implementation can be found in btleaf.cpp and btindex.cpp. You are free to add any methods to these classes that can help with your assignment (try to keep the methods private wherever possible). For example, a method that finds out which child link to follow when searching for a record, and a method that finds the sibling of a node could be useful. Do not add any member variables to these classes, since they have to map carefully onto the structure of a page.

BTLeafPage

Each indexing entry that we insert or delete from a BTreeFile is a tuple (key, record id). These tuples are stored as records of type LeafEntry in BTLeafPage. To facilitate a scan through the leaf nodes, the leaf nodes are linked together as a doubly linked list. These "pointers" in the link list are actually PageID of the previous page and next page. The two members corresponding to these pointers are called nextPage and prevPage. They can be set and retreived with SetNextPage(), SetPrevPage(), GetNextPage() and GetPrevPage().

Other functions provided for BTLeafPage are Insert and Delete, which (as the name suggest) insert and delete a LeafEntry to/from the page. GetNext(), GetFirst() and GetCurrent() are also provided to scan through the records in a BTLeafPage. A typical use of these functions is shown below.

    int key;
    RecordID rid, outRid;
    Status s = page->GetFirst(key, rid, outRid);
    while (s != DONE) 
    {
        // do something with key and rid
        s = page->GetNext(key, rid, outRid);
    }

BTIndexPage

The BTIndexPage contains a sequence <pid1, key1, pid2, key2, ..., keyk, pidk+1> (the semantics of this sequence are as described in the textbook). The prevPage member in BTIndexPage (also referred to as the leftlink or the leftmost child page) is used to store pid1 while (keyi, pidi+1) are stored as records of type IndexEntry in the BTIndexPage.

Just like in BTLeafPage, Insert(), Delete(), GetFirst() and GetNext() are provided for this class.

Scanning a B+ Tree

We can also open a scan on a B+ Tree, and retrieve all records within a range of key values. The BTreeFilescan class is a data structure to keep track of the current cursor in the scan. It provides a GetNext() method for retrieving the next record in the BTreeFile We can also delete the current record at the cursor while we are scanning, using the DeleteCurrent() method.

Assignment requirements

In this assignment, you are required to implement two classes: BTreeFile and BTFileScan, link them with the main test program, and make sure that all test programs run successfully. Note that a successful run does not mean that your program is correct. You should also ensure that your program will work for all possible test cases.

To simplify the assignment, you can assume that a BTreeFile only handles integer keys.

The details of the functions that you have to implement are given below. We have also provided some sample code that illustrates the use of the classes we have provided. However these samples are extremely simple and do not reflect the actual coding that we expect from you. For example, we do not show any error checking in these samples. You are expected to write robust programs by checking the return code and signal errors when necessary.

BTreeFile::BTreeFile

The constructor for the BTreeFile takes in a filename, and checks if a file with that name already exists in the database. If the file exists, we "open" the file. Otherwise we create a new file with that name.

You can use MINIBASE_DB::GetFileEntry() to check if the file exists. A file entry is a pair (filename, pid), where pid is the page id of the "first" page of the file. GetFileEntry() takes in a filename and returns the page id of the first page. In our case, the page id returns should corresponds to the page id of the root note. If no such file is found in the database, GetFileEntry() returns FAIL.

Once we get the page id of the root, we "open" the file by pinning the first page, this will caused the root to be read from disk into memory. The following example shows how to read an existing file and pin the first page.

    // filename contains the name of the BTreeFile to be opened
    Page *page;
    MINIBASE_DB->GetFileEntry(filename, pid);
    MINIBASE_BM->PinPage(pid, page);

To create a new B+ tree index, you should first get a new page, and initialize it. You should also use MINIBASE::AddFileEntry to add a new file entry into the database. The following example shows how to do this :

    Page *page;
    MINIBASE_BM->NewPage(pid, page);
    MINIBASE_DB->AddFileEntry(filename, pid);
    ((SortedPage *)page)->Init(pid);

After the above, you should also initialize the type of the page to the appropiate type (is it LEAF_NODE or INDEX_NODE ?).

    ((SortedPage *)page)->SetType(???);

BTreeFile::~BTreeFile

The destructor of BTreeFile just "closes" the index. This includes unpinning any pages that are being pinned. Note that it does not delete the file.

BTreeFile::DestroyFile

This method deletes the entire index file from the database. You need to free all pages allocated for this index file. The file entry in the database also needs to be removed (using MINIBASE_DB::DeleteFileEntry(filename)).

BTreeFile::Insert

This method inserts a pair (key, rid) into the B+ Tree Index. The actual pair (key, rid) is inserted into a leaf node. But this insertion may cause one or more (key, pid) pair to be inserted into index nodes. You should always check to see if the current node has enough space before you insert. If you don't have enough space, you have to split the current node by creating a new node, and copy some of the data over from the current node to the new node. Splitting will cause a new entry to be added in the parent node.

Splitting of the root node should be considered separately, since if we have a new root, we need to update the file entry to reflect the changes. Splitting of a leaf node should also be considered separately since the leaf nodes are linked as a link list.

Due to the complexity of this function, we recommend that you write separate functions for different cases. For example, it is a good idea to write a function to insert into a leaf node, and a function to insert into an index node. The following shows some simplified code fragment that may be helpful.

Checking if there is enough space to insert a record into a leaf node :

    if (page->AvailableSpace() ...

Inserting a pair (key, rid) into a leaf node with page ID pid can be done with the following code :

    BTLeafPage *page;
    RecordID outRid;
    MINIBASE_BM->PinPage(pid, (Page *&)page);
    page->Insert(key, rid, outRid);
    MINIBASE_BM->UnpinPage(pid, DIRTY);

BTreeFile::Delete

This method deletes an entry (key, rid) from a leaf node. Deletion from a leaf node may cause one or more entries in the index node to be deleted. You should always check if a node underflows (less than 50% full) after deletion. If a node becomes underflows, merging or redistribution will occur (read and implement the algorithm in the textbook).

You should consider different scenarios separately (maybe write separate functions for them). You should consider deletion from a leaf node and index node separately. Deletion from the root should also be consider separately (what happens if the root becomes empty after some deletion, but there are still some child nodes?)

The following code fragment may be helpful :

Checking if a node is half full :

    if (page->AvailableSpace() > HEAPPAGE_DATA_SIZE/2) 
    {
        // check if merge can occur
        // merge
    }

BTreeFile::OpenScan

This method should new a BTreeFileScan object and initialize it based on the search range specified. It is useful to find out which leaf node the first record to scan is in.

BTreeFile::DumpStatistics

In this assignment, you are required to collect statistics to reflect the performance of your B+ Tree implementation. This method should print out the following statistics of your B+ Tree.

  1. Total number of nodes in the tree
  2. Total number of data entries in the tree
  3. Total number of index entries in the tree
  4. Average, minimum and maximum fill factor (used space/total space) of leaf nodes and index nodes.
  5. Height of tree

These statistics should serve you in making sure that your code executes correctly. For example, the fill factor of each node should be greater than 0.5. You should make sure that DumpStatistics performs this operation.

BTreeFile::Print, BTreeFile::PrintTree, BTreeFile::PrintNode

These are helper functions that should help you debug, by showing the tree contents. However, make sure that you study the code provided carefully, since we have introduced a bug. Fix the bug and use it to view your B+ Tree. Feel free to modify this function to suit your own debugging requirements.

BTreeFileScan::~BTreeFileScan

First, note that BTreeFileScan does not have a constructor since it is created inside BTreeFile::OpenScan. The destructor of BTreeFileScan should clean up the necessary things (such as unpinning any pinned pages).

BTreeFileScan::GetNext

GetNext returns the next record in the B+ Tree in increasing order of key. Basically, GetNext traverses through the link list of leaf nodes, and returns all the records in them that are within the scan range, one at a time.

For example, if the B+ Tree contains records with keys 1,2,4,5,6,7,8, the following code should print out "2 4 5 ".

    int low = 2;
    int high = 5;
    RecordID rid;
    int key;
    scan = (BTreeFileScan *)tree->OpenScan(&low, &high);
    scan->GetNext(rid, key);
    cout << key << " "; scan->GetNext(rid, key);
    cout << key << " "; scan->GetNext(rid, key);
    cout << key << " "; 

BTreeFileScan::DeleteCurrent

DeleteCurrent should delete the current record, i.e, the record returned by previous call to GetNext(). For example, if the B+ Tree contains records with keys 1,2,4,5,6,7,8, the following code should print out "2 4", and the remaining B+ Tree should contain records with keys 1,5,6,7,8.

    int low = 2;
    int high = 5;
    RecordID rid;
    int key;
    scan = (BTreeFileScan *)tree->OpenScan(&low, &high);
    scan->GetNext(rid, key);
    cout << key << " ";         // should output 2
    scan->DeleteCurrent();      // record with key 2 is deleted
    scan->GetNext(rid, key);
    cout << key << " ";         // should print 4
    scan->DeleteCurrent();      // record with key 4 is deleted
    scan->DeleteCurrent();      // error since record with key 4 is already deleted

Documentation

You should also submit a document describing the code that you have written. This should include assumptions that you made, descriptions of any new classes that you have added, and any other special feature we should take note of. As a guideline, the document should be 2-3 pages long (with normal fonts and spacing), or more, if you feel necessary. If you did not complete the assignment, explain the part that you completed so that we can give partial credits (e.g. our deletion works, except that it will cause your program to fail if an index entry is deleted from the root node.) Show sample test runs that you have used to ensure your code is correct, and explain why they are meaningful.

Assignment Details

The zip file for the project can be downloaded using the course management system. Once you unzip the file, the directory contains the following files (among others):

btfile.cpp/btfile.h code skeleton for the BTreeFile class
btleaf.cpp/btleaf.h code for BTLeafPage class
btindex.cpp/btindex.h code for the BTIndexPage class
btfilescan.cpp/btfilescan.h code skeleton for BTreeFileScan class
btreetest.cpp/btreetest.h code for test-script interface to the B+-tree
sortedpage.cpp/sortedpage.h code for SortedPage class
bt.h general declaration of types
main.cpp contains main(), runs tests

The definition of public interface for the BTreeFile, BTreeFileScan classes are available, but not fully implemented. BTLeafPage, BTIndexPage and SortedPage are implemented while IndexFile and IndexFileScan are defined, but no implementation is needed. The files you need to modify are: BTreeFile.cpp and BTreeFileScan.cpp.

You may choose to use your implementation of the buffer manager, or take our implementation.

Class Description

You can find out more about the classes and types here, but the code is always your best source. Use the Visual code browser to look at the code.

User Interface

You can run the program either interactively, or using a script. For a list of accepted commands, execute "btree ?".

A file named rim.txt has been provided as a sample, but it is far from a complete test set.

Below are the commands accepted. Note that when <low> and/or <high> equals -1, they mean min and max, respectively.

insert <low> <high> <duplicate>

if <duplicate> = 1 then allow duplicate (random) records in the range [low, high]

<duplicate> = 0  then all the records from [low, high) are inserted sequentially

scan <low> <high> Scan for <high> <low> records.
delete <low> <high> Delete records between <high> <low>
deletescan <low> <high> Scan and deleteCurrent records between <high> <low>.
print Print B+ tree.
stats Show stats
quit (or EOF) Termination

Coding conventions

You need to follow certain coding conventions for all your assignments.

Assignment Files, Submission and Grading

Files that you can modify

You should write most of your code in BTFile (.h and .cpp) and BTFileScan (.h and .cpp). You can add any member variables and methods to these two classes.

You can also add new methods to the BTLeafPage (.h and .cpp) and BTIndexPage (.h and .cpp) classes (keep these methods private wherever possible). You can also modify the isAtLeastHalfFull function if needed. You should not modify any of the other existing methods in BTLeafPage and BTIndexPage. You also should not add any member variables to these classes, since they have to map carefully onto the structure of a page.

Creating and Uploading the Zip File

You should submit the following files. You should not submit any other files (for you should not have modified any other files).

Zip only the above files in your directory and upload them into the course management system by the deadline. Keep a copy of the code just in case something goes wrong. No late submissions will be accepted.

Grading criteria

We will grade your program based on the following criteria :

Very Important Advice