CS432 Proj 4 : Join Algorithms
Sample Report
This is a union of several good reports that I received, minus
the tables, the graphs, and explanation of how the
statistics are collected. Numbers mentioned below are based on
buffer size 50 and 10000 records per relation.
The purpose of this sample report is to give an idea of what
is expected. You don't have to mention everything to get full
marks (40).
1. Effect of Block Size in Block NLJ
- When block size increases, number of misses and
pin requests decrease.
- We store more records from R in memory, and
making fewer passes through S.
- When block size increases, running time
decreases, until the block size reaches about 1000
tuples, then the running time increases again, slightly.
- The more tuples we store in the block, the longer
the time we have to search through the block to
find the matching tuple. Increment of running
time is due to increment in computational time
rather than disk I/O time. (Since the number of
misses actually decreases.)
- When block size equals to 1, Block NLJ is the
same as Tuple-At-A-Time NLJ. But when we increment the
block size by 1, (change from 1 to 2), the performance
improves significantly.
2. Comparisons of the four algorithms
- Tuple-At-A-Time NLJ is bloody slow.
- We scan through S for each tuple in R. It results in
high number of pin requests (about 2 millions).
It does not make use of locality in S. (A tuple
that is read will not be use again in the near
future). This result in high number of misses as
well (about 2 millions) and extremely low hit
rate. (2%)
- Block NLJ is a lot faster.
- It has the least number of misses among the four
algorithm (< 5000) . But running time is not
the smallest as expected, due to the computational
time spent in scanning the arrays.
- Index NLJ is the fastest.
- Despite the amount of time spent in creating the
index, index NLJ is still the fastest. For each
tuple in R, at most 3-5 pin requests is needed to
retrieve a matching tuple in S. Duplicate
matching tuples can be retrieved by at most 1 pin
request, since the leaf nodes are doubly linked.
- So it's worth to build an index just for the
purpose of performing one join operation.
- Sort Merge Join is pretty good.
- Sort Merge is slightly slower than index NLJ.
This is particular to the sorting method used in
SortFile( ) function. Since it creates a B+-Tree
first, and then scans the B+-Tree to get the
records in sorted order.
- Without considering the sorting, Sort Merge has
the least number of misses (900+). The reason is
that, Sort Merge is partition based and require
far fewer number of passes.
- Sorting contributes to most of the running time
of SortMerge( ) join. It is not worth to sort the
file just for the purposes of one join operation.
(But it may be different if a better Sorting
methods is used).
- If a file is already sorted, SortMerge should be
the first choice.
3. Using Hash Table In Block NLJ
- Using hash table significantly improves the running time,
and makes Block NLJ the fastest algorithm of all. This
shows that a lot of time is spent in scanning the block
in the original Block NLJ.
- There is no significant effect on number of pin and
number of misses, as expected. However this is particular
to our implementation, since the hash table does not
resides in the buffer pool as it should.
4. Using Clustered Index for Index NLJ
- If we include the time to sort the relation, before
building the index, clustered index performed worst than
unclustered index. This maybe particular to the SortFile(
) function we used as it is not efficient.
- Excluding the time to sort the relation, clustered index
performs better. The I/O requests are sequential for
clustered index, but are random for unclustered index.
- If we exclude the statistics for sorting, both clustered
and unclustered version have roughly the same amount of
pin requests (130000+). But the random nature of the
unclustered index causes more misses (25000+) than the
clustered index (21000+).
5. Effect of Buffer Size
- In general, increment in buffer size causes the number of
misses to decrease, but has no effect on the number of
pin requests. More pages can fit into the buffers now so
the chances of hitting a page are now higher.
- Decrement in number of misses causes the running time to
decrease.
- Index NLJ, Block NLJ and Sort Merge Join are not sensitive
to changes in buffer size.
- Tuple-At-A-Time NLJ has a dramatic improvement in
performances when the buffer size reaches a point where the
whole inner relation can fit into the buffer pool (around
buffer size of 210-215). This contribute to a sharp
decrease in number of misses and running time for
Tuple-At-A-Time NLJ.
- When the buffer pool is large enough to hold both
relations , further increases in buffer pool size will not
affect the performances.