CS432
Assignment 3
B+ Tree
Deadline: Oct.15 1997
You can find quick help of how to start this project now, click
here
Introduction
In this assignment you will need to implement a B+-tree indexing scheme.
An indexing scheme facilitates the retrieval of a desired record from a
relation. (In MINIBASE, all records in a relation are stored in a heap
file.) The B+-tree index is one very useful indexing scheme.
A B+-tree index consists of a collection of records of the form (key,
rid), where key is a value for the search key of the index,
and rid is the record id of the record being indexed. (It is an
index based on what is described as Alternative 2 in the textbook.)
The nodes in a B+-tree can be divided into two categories: internal
index node, and leaf node. The leaf nodes store the actual (key, rid)
pair, while the internal nodes store key values and the page ids of its
children.
Description
The class IndexFile is an abstract class for indices. BTreeFile is a subclass
of IndexFile, and implements the B+-tree in MINIBASE. The class BTreeFile
maintains two types of nodes: BTLeafPage and BTIndexPage, which correspond
to the leaf nodes and internal nodes of a B+-tree. Only one type of key
will be supported: integer. Key values need not be unique. You can assume
that all of the duplicates fit into one page.
The BTreeFile contains in it the functions for inserting, deleting,
and searching the tree. It does not, however, contain the tree itself.
It will however have a pointer to the root of the tree. Each node in the
tree will be an instance of the class BTIndexPage, or if it is a leaf node
than the BTLeafPage. These two classes are described below. The BTreeFile
class is responsible for maintaining the integrity of the tree, i.e. when
an insertion is made it has to ensure that the new record is inserted in
the correct place or if a new entry is added to an index node that there
is sufficient room in the index node to hold the new entry. The BTLeafPage
and the BTIndexPage are both responsible for inserting into and deleting
from a leaf node and index node, respectively. These classes also contain
the data within the individual nodes. They do not however check to ensure
the integrity of the tree. For example if an index node is exactly half
full, and BTIndexPage::DeleteRecord is called, then a record will be deleted
even though this will mean that the index node is now less than half full
(which is an invalid state). Basically, these two classes provide the interface
for the BTreeFile class to the individual nodes.
Both BTLeafPage and BTIndexPage are subclasses of SortedPage, which
in turn is derived from HeapPage. The class SortedPage maintains the records
in HeapPage in sorted order, and is implemented for you. Understanding
how BTLeafPage uses the methods it inherited from SortedPage will be very
useful for you in doing this assignment, and we recommend reading
both btleaf.cpp and sortedpage.cpp.
The BTLeafPage maintains records of type (key, rid) pair as describe
above. The BTIndexPage contains <pid1, key1, pid2,
key2, ..., keyk, pidk+1> (the semantics
of this sequence are those that are describe in the textbook). The prevPage
member (also referred to as the leftlink or the leftmost child page) is
used to store pid1 while all (keyi, pidi+1)
are stored as records in the BTIndexPage.
A set of functions for manipulating the (key, data) records (where data
can be rid or pid) have been written for you . You can take a look at the
functions BTIndexPage::InsertKey and BTIndexPage::GetPageID for examples
on how to use these functions. A specification of the functions can be
found in key.cpp.
Finally, the BTreeFileScan class implements a scan interface on a BTreeFile.
Also included in the source code is a buggy version of PrintTree, which
is a member of the BTreeFile class. This function prints out the whole
B+Tree to a file using a Depth First Search. Also included _PrintNode which
prints out a single node in the B+Tree; this function is also buggy. You
should debug these functions and then use them to debug your other code.
There are two bugs in the version being given to you. The first bug is
a logical bug, and the second involves a pointer. The logical bug is only
in PrintTree, but the pointer bug is repeated in both functions. You should
actually try and debug this using the debugger in VC 5.0. Groups are
not allowed to discuss how to debug this code with any other group.
The definition of public interface for the BTreeFile, BTIndexPage,
BTreeFileScan classes are available, but not fully implemented. BTLeafPage
and SortedPage are implemented while IndexFile and IndexFileScan
are defined but no implementation is needed.
You can find out more about the classes and types here (those in bold
are newly added) :
The source code for the project is located in the directory J:\cs433\proj2.
The directory contains:
-
btfile.cpp/btfile.h - the code skeleton for the BTreeFile
class.
-
btleaf.cpp/btleaf.h - the code for BTLeafPage class.
-
btindex.cpp/btindex.h - the code skeleton for the BTIndexPage class.
-
btfilescan.cpp/btfilescan.h - the code skeleton for BTreeFileScan
class.
-
btreetest.cpp/btreetest.h - the source code for testing the B+-tree
classes.
-
sortedpage.cpp/sortedpage.h - the source code for SortedPage class.
-
bt.h - general declaration of types and functions in key.cpp
-
key.cpp - functions to manipulate the (key, data) pair abstraction.
-
main.cpp - the main routine that initialized the MINIBASE and run
the test.
User Interface:
If you run your project from the DOS prompt you will find that the first
thing that appears on your screen is a menu of options. This is the user
interface we have designed for your project. The user interface for this
project is relatively simple, but there are a few things that should be
pointed out: There are ten options in the menu. In order to run the tests
and such the following things need to be kept in mind:
1)For all the
options (except 7- 9) you must first, create an instance of the tree using
option1. Otherwise you will not be allowed to proceed. Moreover, once a
tree is created, unless a tree is explicitly destroyed using option6 or
at the end of options 8 and 9, you will run the options on the same tree
that you have already created. Note: once you create a tree, you cannot
create a new tree until the old tree is destroyed.
2)Sometimes
you are asked for a range (hikey-lokey) when inserting.
That range specifies the range of data entries, but it also specifies
how many insertions are carried out. For example, if you input 100 as the
hikey, and 20 as the lokey, then there will be (100-20+1)=81 data entries
inserted into the tree, each with a key ranging between 20 and 100. Note:
that there can still be duplicates amongst these keys.
3)When you are
asked for a range while deleting, the tree deletes all the data entries,
whose keys fall within that range.
4)Whenever an
insertion takes place, every new key value is assigned a unique RecordID.
These RecordID's are junk (They are meant to represent the RecordID (pageID
and slot no.) of the data on disk that the key is supposed to represent.
However, for this project, the RecordID's in the leaf nodes do not refer
to any actual place on any disk.)
5) All the options
except option 1, 3 and 6. print out the tree, and dump statistics on an
output file. The statistics that are dumped out are explained below in
the Requirements section.
6)Options7-9
should be run sequentially i.e. one after the other.
Now here is a synopsis of what each option actually does:
Option1)
Creates an instance of a B+Tree. (Basically creates an instance of BTFile
class. There are no index pages associated with this BTFile)
Option2)
Asks the user to input a range of keys to be inserted. Inserts the keys
in the range specified.
Option3)
Asks the user to input a range of keys. Searches for all the keys in this
range (using a index-search for the lowest key within this range, and then
a sequential search). Prints out to the screen the entries are within this
range.
Option4)
Asks the user to input a range of keys to be inserted. Inserts the keys
specified. Prints out statistics and the tree at this time. Then it deletes
half of the entries inserted, and prints out the statistics and the tree
after the deletion.
Option5)
Asks the user for a range. Then prints out on to the screen, first a full
scan of the tree, second a scan from the smallest entry to the input hikey,
third
a scan from hikey to hikey, and finally a scan from lokey to hikey.
In the last scan it deletes all the records between the lokey and hikey.
After this deletion, it prints out the tree and the statistics to a file.
Option6)
Destroys the Tree.
Option7)
Creates a tree. Inserts 4000 entries (1-4000) into the tree. Deletes 10%
of the tree. Repeats Option5 using the range 2000-3000.
Option8)
Runs some abnormal scans. Using a lokey>hikey. lokey>the rightmost key
in the tree. hikey<leftmost key. Finally, destroys the tree.
Option9)
Creates a B+Tree. Insert 2000 entries(1-2000). Deletes 1950 entries. Repeats
Option5 asking for a user input for the ranges. Destroys the tree.
Requirements:
-
Study the functions provided in key.cpp and types defined in bt.h. The
web pages linked from above may help. Examples on how to use these functions
can be found in btindex.cpp.
-
Implement the methods listed in btfile.cpp, btindex.cpp, btfilescan.cpp
based on specifications described. You can skip those that have been implemented
for you. You should not modify any other files. You may, however, read
and understand what is going on inside these files. You are free to add
any private methods or members (unless otherwise stated in the .h file)
or make a function inline.
-
We recommend that you implement your code in this order:
-
implement BTreeFile Constructors, Destructor
-
implement the methods in BTIndexPage
-
implement BTreeFile::Insert and BTFile::Search
-
implement BTreeFileScan.cpp
-
implement BTreeFile::Delete
-
implement BTreeFile::DestroyFile
-
implement BTreeFile::DumpStatistics
-
Note: you are required to implement the BTreeFile::Insert and BTreeFile::Delete
functions using the split and merge algorithms. You are not required
to implement redistribution.
-
You may assume that the keys and records are of a fixed size.
-
To produce the executables,
-
You need a Buffer Manager library, which contains of the buffer manager
that you implemented in the first project.
-
You need a Space Manager library, which contains of db.cpp, page.cpp andheappage.cpp.
-
You also need the Global Definition library, which was distributed along
with the Buffer Manager project.
-
Compile the files in J:\cs433\proj2 to produce the executable.
-
You need to make sure that the tests run successfully. It is important
to note, however, that a successful test run does not guarantee that your
code is error free.
-
Answer the questions listed in the section below and include the answers
in the documentation.
Dumping statistics
Once again you will be required to collect statistics which reflect
the performance of your B+ Tree implementation. DumpStatistics will have
collect the following statistics:
1) Total number of nodes
2) Total number of data entries
3) Total number of index entries
4) Fill factor of leaf nodes avg., min, max.
5) Fill factor of index nodes
6) Height of tree
Note: by total number we mean the total number in the tree. By
fill factor we mean (used space)/(total space) in a node. (therefore all
fill factors will range between 0 and 1).
These statistics should serve you in making sure that your code executes
correctly. For example, the fill factor of each node (except the root)
should always be greater than 0.5, and the height of the tree should be
less than 5 or so. (this will vary, of course). You should make sure that
DumpStatistics dumps out these statistics on to a file.
Coding Convention
You need to follow certain coding convention for all your assignments.
-
Name for all classes/methods/types should start with a capital letter.
Break multiple words with caps. For example AddFileEntry.
-
Name for all members/variables should start with small letters. Break multiple
words with caps. For example numOfPages;
-
Name for all enum/macro should be all caps. Break multiple words with underscore.
For example MINIBASE_DB.
Submission Procedure
Create a folder with your NetID (or your
partner) as its name in your local directory.
Into this folder, drop your project file,
source code, executable, and documentation. Make
sure that while the executable is
ready to run, it can be recompiled as needed.
Insert this folder into the Project 1 Submissions
network directory in
\\Goose\courses\Cs432\Proj2\Submissions.
Set the permissions on your folder (on the
network drive) so that only the TA's (cs432stf)
and yourself have 'Full Control' access
permission. Nobody else should have any
permission to access this directory.
You should remember to keep a copy of the project in your own account.
This assignment is due on Oct. 8 1997, 5:00 PM. No late submission will
be accepted.
Marking Criteria
We will mark your programs based on the following criteria :
-
Correctness (70%)
-
You will get full marks if your implementation is correct. Partial credit
will be given to a partially correct submission.
-
Coding Style (15%)
-
We expect you to write neat code. Code should be nicely indented and commented.
We also expect you to follow the coding conventions.
-
Documentation (15%)
-
You should also submit on-line documentation using any format you like,
in it you should explain the code that you have written. This should include
assumptions that you made, description of any new class that you have added,
and any other special feature we should take note of. As a guideline, the
document should be 2-3 pages long (with normal fonts and spacing), or more,
if you feel necessary.
You can choose to use your implementation of the buffer manager, or take
our implementation. For not using your own implementation of the library
you will lose 5 out of 100 points.
Miscellaneous
Please let us know of any ambiguity in the assignment and please report
any bugs as soon as possible. You can e-mail us at cs432ta@cs.cornell.edu.
There are now two newsgroups for the class : cornell.class.cs432
and cornell.class.cs433, for general
discussions.
Please check the newsgroup frequently for any updates or clarification.