Balanced Binary Trees

Sets and maps are important and useful abstractions. We've seen various ways to implement an abstract data type for sets and maps, since data structures that implement sets can be used to implement maps as well. It's time to look at an implementation of sets that is asymptotically efficient and useful in practice: balanced binary trees.

Binary trees have two advantages above the asymptotically more efficient hash table: first, they support nondestructive update with the same asymptotic efficiency. Second, they store their values (or keys, in the case of a map) in order, which makes range queries and in-order iteration possible.

For simplicity, we will implement sets of some type we will call
`value`

. We assume we have a comparison function ```
compare:
value*value -> order
```

.

The signature that we will work with is a little different from that in Lecture 8:

signature SET = sig (* a "set" is an immutable set of values. *) type set type value (* empty is the empty set, ∅ *) val empty : set (* add(x,s) is {x} union s. *) val add: value*set -> set (* union(x,y) is x ∪ y. *) val union: set*set -> set (* contains(x,s) is whether x is a member of s (i.e., x∈s)*) val contains: value*set -> bool (* size(s) is the number of elements in s *) val size: set->int (* fold over the elements of the set *) val fold: ((value * 'b)->'b) -> 'b -> set -> 'b end

An important property of a search tree is that it can be used to implement an
**ordered set** or ordered map easily:
a set (map) that abstractly keeps its elements in sorted order.
Although the signature above doesn't show them, ordered sets
generally provide operations for finding the minimum and maximum elements of the
set, for iterating over all the elements between two elements, and for
extracting (or iterating over) ordered subsets of the elements between a range:

(* min_elt s is the least element of s *) val min_elt: set -> value (* max_elt s is the greatest element of s *) val max_elt: set -> value (* fold over the elements of the set, in order. *) val fold: ((value * 'b)->'b) -> 'b -> set -> 'b (* fold_range f i s lo hi folds f over the elements of s between two values lo and hi, in ascending order. *) val fold_range: ((value * 'b)->'b) -> 'b -> set -> value -> value -> 'b

type tree = Empty | Node of node and node = {value: value; left: tree; right: tree} type set = tree

A **binary search tree**
is a binary tree with the following representation invariant:
For
any node `n`

, every node in `#left n`

has a value less than
that of `n`

, and every node in `#right n`

has a value more
than that of `n`

. And the entire left and right subtrees satisfy the
same invariant.

Given such a tree, how do you perform a lookup operation? Start from the root, and at every node, if the value of the node is what you are looking for, you are done; otherwise, recursively look up in the left or right subtree depending on the value stored at the node. In code:

let rec contains (t:tree) (x:value): bool = (case t of Empty -> false | Node {value=x'; left=l; right=r} -> let c = compare x x' in if c = 0 then true else if c > 0 then contains l x else (* c < 0 *) contains r x

Adding an element is similar: you perform a lookup until you find the empty node that should contain the value. This is a nondestructive update, so as the recursion completes, a new tree is constructed that is just like the old one except that it has a new node (if needed). In code:

let rec add t x = match t with Empty -> Node {value=n, left=Empty, right=Empty} | Node {value=v; left=l; right=r} -> let c = compare x x' in if c = 0 then t else if c > 0 then Node { value=x'; left=l; right=add r x } else (* c < 0 *) Node { value=x'; right=r; left=add l x }

What is the running time of those operations? Since `add`

is just a lookup
with an extra node creation, we focus on the lookup operation. Clearly,
the run time of
`add`

is *O*(*h*), where
*h* is the height of the tree. What's the worst-case height of a tree? Clearly, a tree of *n*
nodes all in a single long branch (imagine adding the numbers 1,2,3,4,5,6,7
in order into a binary search tree). So the worst-case running time of lookup is
still *O*(*n*) (for *n*
the number of nodes in the tree).

Some useful code resources:

- An implementation of binary search trees that includes both add and remove operations.
- Support for printing out trees

What is a good shape for a tree that would allow for fast lookup? A
**perfect binary tree** has the largest number of nodes n for
a given height h: n = 2^{h+1}-1. Therefore h = lg(n+1)-1 = O(lg n).

^ 50 | / \ | 25 75 height=3 | / \ / \ n=15 | 10 30 60 90 | / \ / \ / \ / \ V 4 12 27 40 55 65 80 99

If a tree with *n* nodes is kept balanced,
its height is *O*(lg* n*), which leads
to a lookup operation running in time *O*(lg* n*).

How can we keep a tree balanced? It can become unbalanced during element addition or deletion. Most approaches involve adding or deleting an element just like in a normal binary search tree, followed by some kind of tree surgery to rebalance the tree. Similarly, element deletion proceeds as in a binary search tree, followed by some corrective rebalancing action. Examples of balanced binary search tree data structures include

- AVL (or height-balanced) trees (1962)
- 2-3 trees (1970's)
- Red-black trees

In each of these, we ensure asymptotic complexity of O(lg n) by enforcing a stronger invariant on the data structure than just the binary search tree invariant.

Red-black trees are a fairly simple and very efficient data structure for
maintaining a balanced binary tree. The idea is to strengthen the rep invariant
so a tree has height logarithmic in n. To help enforce the invariant, we color
each node of the tree either *red* or *black*. Where it matters, we
consider the color of an empty tree to be black.

type color = Red | Black type tree = Empty | Node of node and node = {value: value; left: tree; right: tree; color: color} type set = tree

Here are the new conditions we add to the binary search tree rep invariant:

- No red node has a red parent.
- Every path from the root to an empty node has the same number of black
nodes: the
**black height**of the tree. Call this BH.

If a tree satisfies these two conditions, it must also be the case that every subtree of the tree also satisfies the conditions. If a subtree violated either of the conditions, the whole tree would also.

With this invariant, the longest possible path from the root to an empty
node would alternately contain red and black nodes; therefore it is at most
twice as long as the shortest possible path, which only contains black nodes. If
*n* is the number of nodes in the tree, the longest
path cannot have a length greater than twice the length of the paths in a
perfect binary tree: 2 lg *n*,
which is O(lg* n*). Therefore,
the tree has height O(lg* n*) and
the operations are all as asymptotically efficient as we could expect.

Another way to see this is to think about just the black nodes in the tree.
Suppose we snip all the red nodes out of the trees by connecting black nodes
to their closest black descendants. Then we have a tree whose leaves are all
at depth BH, and whose branching factor ranges between 2 and 4. Such a tree
must contain at least Ω(2^{BH}) nodes, and so must the whole
tree when we add the red nodes back in. If N is Ω(2^{BH}),
then black height BH is O(lg N). But invariant 1 says that the longest
path is at most h = 2BH. So h is O(lg N) too.

How do we check for membership in red-black trees? Exactly the same way as for general binary trees.

More interesting is the `add`

operation. We add by replacing
the empty node that a standard `add`

into a binary
search tree would. We also color the new node red to ensure that
invariant #2 is preserved. However, we may destroy invariant #1 in doing so, by
producing two red nodes, one the parent of the other. In order to restore this
invariant we will need to consider not only the two red nodes, but their parent.
Otherwise, the red-red conflict cannot be fixed while preserving black depth. The next figure shows all
the possible cases that may arise:

1234B_{z}B_{z}B_{x}B_{x}/ \ / \ / \ / \ R_{y}d R_{x}d a R_{z}a R_{y}/ \ / \ / \ / \ R_{x}c a R_{y}R_{y}d b R_{z}/ \ / \ / \ / \ a b b c b c c d

Notice that in each of these trees, the values of the nodes in a,b,c,d must
have the same relative ordering with respect to x, y, and z:
a<x<b<y<c<z<d. Therefore, we can perform a local **tree
rotation** to restore the invariant locally, while possibly breaking
invariant 1 one level up in the tree:

R_{y}/ \ B_{x}B_{z}/ \ / \ a b c d

By performing a rebalance of the tree at that level, and all the levels above, we can locally (and incrementally) enforce invariant #1. In the end, we may end up with two red nodes, one of them the root and the other the child of the root; this we can easily correct by coloring the root black. The SML code (which really shows the power of pattern matching!) is as follows:

let rec add (t:tree) (x: value) : tree = let (* Definition: a tree t satisfies the "reconstruction invariant" if it is * black and satisfies the rep invariant, or if it is red and its children * satisfy the rep invariant and have the same black height. *) (* makeBlack(t) is a tree that satisfies the rep invariant. Requires: t satisfies the reconstruction invariant Algorithm: Make a tree identical to t but with a black root. *) let makeBlack (t:tree): tree = case t of Empty -> Empty | Node {color=c; value=x; left=l; right=r} -> Node {color=Black; value=x; left=l; right=r} (* Construct the result of a red-black tree rotation. *) let rotate x y z a b c d : tree = Node {color=Red, value=y, left= Node {color=Black, value=x, left=a, right=b}, right=Node {color=Black, value=z, left=c, right=d}} (* balance(t) is a tree that satisfies the reconstruction invariant and * contains all the same values as t. * Requires: one of the children of t satisfies the rep invariant and * the other satisfies the reconstruction invariant. Both children * have the same black height. *) let balance (t:tree): tree = match t with (*1*) Node {color=Black; value=z; left= Node {color=Red; value=y; left=Node {color=Red; value=x; left=a; right=b}; right=c}; right=d} -> rotate x y z a b c d | (*2*) Node {color=Black; value=z; left=Node {color=Red; value=x; left=a; right=Node {color=Red; value=y; left=b; right=c}} right=d} -> rotate x y z a b c d | (*3*) Node {color=Black; value=x; left=a; right=Node {color=Red; value=z; left=Node {color=Red; value=y; left=b; right=c}; right=d}} -> rotate x y z a b c d | (*4*) Node {color=Black; value=x; left=a; right=Node {color=Red; value=y; left=b; right=Node {color=Red; value=z; left=c; right=d}}} -> rotate x y z a b c d | _ -> t (* Add x into t, returning a tree that satisfies the reconstruction invariant. *) let walk (t:tree):tree = match t with Empty -> Node {color=Red; value=x; left=Empty; right=Empty} | Node {color=c; value=x; left=l; right=r} -> let cmp = compare x x' in if c = 0 then t else if c > 0 then balance (Node {color=color; value=value; left=walk left; right=right}) else (* c < 0 *) balance (Node {color=color; value=value; left=left; right=walk right})) in makeBlack (walk (t))

This code walks back up the tree from the point of insertion fixing the
invariants at every level. At red nodes we don't try to fix the invariant; we
let the recursive walk go back until a black node is found. When the walk
reaches the top the color of the root node is restored to black, which is needed
if `balance`

rotates the root.

Removing an element from a red-black tree works analogously. We start with BST element removal and then do rebalancing. Here is code to remove elements from a binary tree. The key is that when an interior (nonleaf) node is removed, then we simply splice it out if it has zero or one children; if it has two children, we find the next value in the tree, which must be found inside its right child.

(* remove t x is a new BST like t but with value x removed, if it is there. *) let rec remove t x = match t with Empty -> Empty (* not found *) | Node { value=x'; left=l; right=r} -> let cmp = compare x x' in if c > 0 then Node {value=x'; left=l; right=remove r x} else if c < 0 then Node {value=x'; right=r; left=remove l x} else match (l, r) with (_, Empty) -> l | (Empty, _) -> r | _ -> let (nx, nr) = remove_first r in Node {value=nx; left=l; r=nr} and (* remove_first n is (x, n') where x is the lowest value in n, and n' is n with x removed. Requires: n is not Empty *) remove_first n = match n with Empty -> raise (Failure "remove_first precondition") | Node {value=x; left=Empty; right=r} -> (x,r) | Node {value=x; left=l; right=r} -> let (nx,nl) = remove_first l in (nx, Node{value=nx, left=nl, right=r})

Balancing the trees during removal from red-black tree requires considering more cases. Deleting a black element from the tree creates the possibility that some path in the tree has too few black nodes, breaking the black-height invariant (2); the solution is to consider that path to contain a "doubly-black" node. A series of tree rotations can then eliminate the doubly-black node by propagating the "blackness" up until a red node can be converted to a black node, or until the root is reached and it can be changed from doubly-black to black without breaking the invariant.