Balanced BSTs. Red-Black Trees.

We have argued last time that a binary search tree can degenerate into a simple list when the input is ordered. Here is the example we have discussed last time:

1 \ A degenerate tree (essentially a list) results when we insert 2 an ordered sequence (inserted sequence: 1, 2, 3, 4, 5). \ 3 \ 4 \ 5

As it happens, long runs of ordered values are not uncommon (think, for example, of a system that records transactions and assigns them monotonically increasing transaction identifiers). Can we protect against such sequences? If we know the input is (mostly) sorted, and all the input is available at the same time (or if we can buffer it before inserting it into the tree) there are several things we can do:

The first possibility is to apply recursively the following algorithm: Find the middle of the input sequence and insert the corresponding value into the tree, then repeat the procedure recursively with the left half of the sequence (from the first element until the element immediately preceding the middle element that we just inserted), then apply the procedure recursively to the right half of the sequence. There are a few details to work out (e.g. what is the "middle" of a sequence containing an even number of values?), but overall this is a very simple algorithm. Here is the tree that results when we insert sequence 1, 2, 3, 4, 5:

3 / \ Note: If the "middle" index is falls between two integers, 2 4 we always round down. / \ 1 5

This looks much better - and would look even better for a longer input sequence. Try it on an example!

Another possible approach is to randomly permute the input sequence, then insert the values in the final (permuted) order. If we do this cleverly, we can reduce the probability of getting a very disadvantageous sequence. In particular, if an "adversary" feeds our program a perfectly ordered sequence, a random permutation of the original sequence will reduce the probability that this ordering will be preserved when inserting into the tree. If the input sequence has length `n`, if it contains distinct values, and if each permutation is equally likely, then the probability of obtaining one of the two perfectly ordered sequences after permutation is `2/n!` - a very small number indeed. While we have ignored some issues in the preceding argument (e.g. the fact that it is enough for "long" ordered subsequences of the input to be ordered), it is easy to accepta that randomly permuting the input sequence "mixes it up" with high probability and leads to the creation of bushy trees.

4 Input sequence 1, 2, 3, 4, 5 was randomly permuted to / \ 4, 1, 2, 5, 3, then inserted into the tree. 1 5 \ 2 \ 3

Now, it is often impractical or impossible to wait and buffer the entire input. For example, the input sequence can be very large and impossible to keep in memory, or it might be that the input values might accumulate slowly, over time, while intermediate results must be available continuously.

What is a good shape for a tree that would allow for fast lookup? A balanced, "bushy" binary search tree keeps as many values as possible close to the root; for example:

^ ----- 50 ----- | / \ | 25 75 height=4 | / \ / \ | 10 30 60 90 | / \ / \ / \ / \ V 4 12 27 40 55 65 80 99

A full binary tree of height `h` will have 2^{h+1}-1 nodes. Thus if a tree has height `h` and has `n` nodes, we must have that `n<=2 ^{h+1}-1`. Thus

Ideally, we would like to insert elements into a binary search tree in any order they come, while keeping the tree balanced. How can we keep a tree balanced? Many techniques involve inserting an element just like in a normal binary search tree, followed by some kind of tree surgery to rebalance the tree. For example:

- AVL (or height-balanced) trees (1962);
- 2-3 trees (1970's);
- Red-black trees;
- Splay trees.

Put simply, a red-black tree is a binary search tree in which each node is colored red or black. Carefully chosen restrictions are imposed on the distribution of colors, which then implicitly limit the amount of "imbalance" that can occur (see below).

datatype color = Red | Black datatype 'a rbtree = Empty | Node of {color: color, value: 'a, left: 'a rbtree, right: 'a rbtree}

Here are the new conditions we add to the binary search tree invariant:

- No red node has a red parent;
- Every path from the root to an empty node has the same number of black nodes;
- The root of a red-black tree is always black.

Note that empty nodes are considered always to be black. If a tree satisfies these two conditions, it must also be the case that every subtree of the tree also satisfies the conditions. If a subtree violated either of the conditions, the whole tree would also.

It is clear that if a subtree of the entire tree contains a red node whose parent is also red, then condition (1) is violated both for the subtree, and the entire tree (the converse statement is also true). Consider now a node `N` in the tree, and the subtree rooted at `N`. If the entire tree satisfies condition (2), then any path from the root that reaches `N` and continues on to a leaf of the subtree rooted at `N` has the same number of black nodes. By eliminating the common "prefix" that characterizes all these paths (i.e. the path from the root to `N`), and keeping in mind the assumption referring to the number of black nodes in paths that belong to the original tree, we establish that all paths from `N` to a leaf in the subtree rooted at `N` contain the same number of black nodes. Try to establish the converse property: if all subtrees of a tree have the property that the paths from the root of the subtree to one of its leafs contain the same number of black nodes ("same" within each subtree separately, not globally), then property (2) will hold for the entire tree.

With these invariants, the only way to lengthen one path from the root to a leaf with respect to other similar paths is to insert red nodes (we can not insert black nodes, as their number is limited by condition (2) - the only possibility is to simultaneously increase the number of black nodes on `all` paths). The longest possible path from the root to a leaf would start with a black node (the root itself), and it would alternatively contain black and red nodes (we can't have have two successive red nodes on a path due to restriction (1)). Hence the longest path in a red-black tree can not be more than twice as long as the shortest path - such trees can not get very imbalanced. We will return to this issue later, probably in section.

Now, every red-black tree is a BST, hence checking for the existence of a key (value) in the tree involves the same procedure as for regular BSTs. The function below is identical with the BST search function we discussed previously, except for the fact that we now represent a tree node using a record, not a tuple, and that we have added coloring to the nodes. Strictly speaking, we can ignore coloring altogether during search:

fun contains (n: 'a, t: 'a rbtree, cmp: 'a * 'a -> order): bool = case t of Empty => false | Node {color=_, value, left, right} => case cmp(value, n) of EQUAL => true | GREATER => contains (n, left, cmp) | LESS => contains (n, right, cmp)

Inserting nodes in a red-black tree is more complicated - not only do we have to find the right place of the new data element in the BST that the red-black tree "contains," but we must also make sure that the restrictions imposed on node coloring are not violated.

We proceed in two steps:

- First, we use the regular BST insert procedure to insert the new value in the tree. We color the new node red.
- Second, if needed, we perform a sequence of transformations on the tree such that the restrictions on node coloring are enforced.

It is clear that by adding a red node at the "bottom" of the tree we might end up violating color constraint (1). To see this, think of a long sequence of insertions - even if we started with an all-black tree, successive insertions will add more and more red leaves. As some point we will run out of black leaves, and we will end up "attaching" a red leaf to an already existing red node.

To reestablish the node-coloring invariants we will apply a series of structural transformations to the tree whose purpose at every step is to locally eliminate the "two red nodes" conflict. We replace the two red nodes with a single red node that is "pushed up" to a higher level in the tree (i.e. closer to the root). At this higher level, the "two red nodes" violation might reoccur (one red node might have been already present, and we have just "pushed up" a second node). Should such violation occur again, we repeat our restructuring procedure. At worst, the conflict might percolate up to the root of the entire tree - should this be case, we estinguish the "two red nodes" by changing the color of the root (which has now become red) to black. This eliminates the conflict related to the red nodes, and it simultaneously increases the "black length" of all paths in the tree, preserving invariant number (2). Invariant number (3) is also trivially reestablished.

We illustrate below all the cases that might occur (note the coloring of the nodes):

1 2 3 4ZZXX/ \ / \ / \ / \ Y d X d a Z a Y / \ / \ / \ / \ X c a Y Y d b Z / \ / \ / \ / \ a b b c b c c d

The diagrams above are easy to remember and reproduce if you keep in mind the following:

- Assuming that we started with a red-black tree, the insertion of a single new node creates at most one violation of the "no parent of a red node is a red node" restriction. Thus we can have at most one occurence of two successive red nodes on paths from the root to a leaf. We will see that the structural changes that we impose on the tree preserve this property at every intermediate step.
- Unless we have already reached the root, there always exists a black node that is the parent of the top red node. With respect to this black node and its two red descendants there are only four tree structures that can occur; more, diagrams (3) and (4) are the mirror images of (2) and (1), respectively (the symmetry is true with respect to structure, not notation!).
- We use lower-case letters
`a`,`b`,`c`, and`d`to represent possibly empty subtrees. - We chose the labeling of the nodes and subtrees so that in all four cases the following inequalities hold:
`a<X<b<Y<c<Z<d`. Here`X`,`Y`, and`Z`represent the values in the respective nodes, while`a`,`b`,`c`,`d`represent all the values in the respective (possibly empty) subtrees (in turn, each of these subtrees is a BST).

Keeping in mind the observations above, the `local` transformation represented below can be applied in all four cases. This transformation eliminates the "two red nodes" problem locally by restructuring the tree and by pushing the resulting red node higher in the tree. If there is a red node at the level immediately above the levels represented in diagrams (1) to (4) above, the conflict reoccurs, and the procedure must be repeated there as well.

Y / \XZ/ \ / \ a b c d

It is obvious that the transformation we propose has the potential of solving the invariant violation induced by the presence of too many red nodes, but what about the other invariants a red-black node must obey? By reasoning carefully, you can convince yourselves that no violations of the invariant related to the number of black nodes on a path will occur.

Consider, for example, the case of a path that starts at the root of the tree, passes through node `Z` of diagram `1` above, and ends up in a leaf in subtree `a`. Such a path will go through `k` black nodes from the root of the whole tree until just before reaching node `Z`. Then the path will touch nodes `Z`, `Y`, and `X`, out of which only one node (specifically, `Z`) is black, then it will follow a path in subtree `a`; this path will contain further `r` black nodes. The path considered will have a total of `k + r + 1` nodes.

Consider now the path from the root of the whole tree to the same leaf in subtree `a` after we applied the transformation. The number of black nodes from the root to node `Y` has not changed, as the transformation we made did not affect this path at all; this number is still `k`. The number of black nodes touched within subtree `a` has not changed, either: it is still `r`. On the new path `Y - X` through the changed part of the tree we still encounter exactly one black node. We thus reach the conclusion that the number of black nodes has not changes on any path that starts at the root of the tree and ends up at a leaf in subtree `a`. Similar considerations can be applied to all leaf nodes in subtrees `b`, `c`, and `d`, respectively.

The only case in which we do not have a black node above the two red nodes that violate the invariant is when we have already reached the root. In that case there is no need for restructuring - we can just change the color of the root to black, thereby increasing the number of black nodes on all paths from the root to a leaf by one (note that there is no other way to increase this number in a red-black tree).

The `SML` code (which really shows the power of pattern matching!) is as follows:

funinsert (n: 'a, t: 'a rbtree, cmp: 'a * 'a -> order): 'a rbtree =let(* Definition: a tree t satisfies the "reconstruction invariant" if it is * black and satisfies the rep invariant, or if it is red and its children * satisfy the rep invariant. *) (* makeBlack(t) is a tree that satisfies the rep invariant. Requires: t satisfies the reconstruction invariant Algorithm: Make a tree identical to t but with a black root. *)funmakeBlack (t: 'a rbtree): 'a rbtree =casetofEmpty => Empty | Node {color,value,left,right} => Node {color=Black, value=value, left=left, right=right} (* Construct the resultofa red-black tree rotation. *)funrotate(x: 'a, y: 'a, z: 'a, a: 'a rbtree, b: 'a rbtree, c: 'a rbtree, d: 'a rbtree): 'a rbtree = Node {color=Red, value=y, left= Node {color=Black, value=x, left=a, right=b}, right=Node {color=Black, value=z, left=c, right=d}} (* balance(t) is a tree that satisfies the reconstruction invariant and * contains all the same values as t. * Requires: the childrenoft satisfy the reconstruction invariant. *)funbalance (t: 'a rbtree): 'a rbtree =casetof(*1*)Node {color=Black, value=z, left= Node {color=Red, value=y, left=Node {color=Red, value=x, left=a, right=b}, right=c}, right=d} => rotate(x,y,z,a,b,c,d) |(*2*)Node {color=Black, value=z, left=Node {color=Red, value=x, left=a, right=Node {color=Red, value=y, left=b, right=c}}, right=d} => rotate(x,y,z,a,b,c,d) |(*3*)Node {color=Black, value=x, left=a, right=Node {color=Red, value=z, left=Node {color=Red, value=y, left=b, right=c}, right=d}} => rotate(x,y,z,a,b,c,d) |(*4*)Node {color=Black, value=x, left=a, right=Node {color=Red, value=y, left=b, right=Node {color=Red, value=z, left=c, right=d}}} => rotate(x,y,z,a,b,c,d) | _ => t (* no violation of invariants *) (* Insert x into t, returns a tree that satisfies the reconstruction invariant. *)funwalk (t: 'a rbtree): 'a rbtree =casetofEmpty => Node {color=Red, value=n, left=Empty, right=Empty} | Node {color,value,left,right} =>casecmp (value,n)ofEQUAL => t | GREATER => balance (Node {color = color, value = value, left = walk left, right = right}) | LESSER => balance (Node {color = color, value = value, left = left, right = walk right})inmakeBlack (walk (t))end

This code walks back up the tree from the point of insertion fixing the invariants at every level. At red nodes we don't try to fix the invariant; we let the recursive walk go back until a black node is found. When the walk reaches the top the color of the root node is restored to black, which is needed if `balance`

rotates the root.

It can be proven that a red-black tree with `n` nodes has a height of **at most** `2log _{2}(n+1)`. As red-black trees are binary search trees, we can compare this upper bound with the lower bound we have established before

We can break up the problem of node deletion into three subproblems:

- Given the identifying information (the key
`k`) of the node that must be deleted, find the respective node. This is easy, and it relies on the standard`BST`search algorithm. Lets call this node`Z`. - Eliminate one node from the tree. As we will see below, it is sometimes more sensible not to eliminate the actual node that holds the information we want to get rid of - it is often simpler to physically eliminate a node different from
`Z`, say`Y`, and transfer the contents of`Y`into`Z`. This is not a completely novel idea - a somewhat similar issue arises when we eliminate a node from a regular`BST`(you have addressed this problem in section). - If the red-black tree invariants don't hold after step (2), apply local transformations to the tree until the invariants are reestablished. Besides reestablishing the invariants, these transformations will also rebalance the tree.

Before we continue, let us note that the physical elimination of a red node will never break any red-black tree invariant (do you understand why?). Thus we only need to worry about rebalancing if the node we eliminated was black.

Assume that we have performed the search, and we found node `Z` which holds the key we are looking for. In the following, we will use capital letters to denote nodes and lowercase letters to denote subtrees. Thus `Z` is a node, while `a`, `b`, `c` are subtrees. We denote the parent of node `Z` with `pZ`. We use `[]` to denote a subtree that is known to be empty.

Here are some of cases that might arise (here we assume that `b` and `c`, if shown, are not empty):

A1 A2 A3 A4| | | | pZ pZ pZ pZ / \ / \ / \ / \ a Z a Z a Z a Z / \ / \ / \ / \ [] [] [] c b [] b c

Cases `A1` to `A4` represent some of the cases that can arise after we have identified `Z` (the node that must be deleted, but before we have actually removed its associated key from the tree. Note that not all possible cases have been represented - there are four analogous cases when `Z` is in the left subtree of `pZ`.

Eliminating `Z` is trivial in case `A1`, and very simple in cases `A2` and `A3`. Here is the outcome of the elimination:

B1 B2 B3| | | pZ pZ pZ / / \ / \ a a c a b

Now, if `Z` has two non-empty subtrees (case `A3`), the situation is more complicated. If we just eliminate `Z`, we are left with one location to attach a subtree to (the right subtree of `pZ`), but with two subtrees (`b`, `c`) that have to be reattached to the tree. We can avoid this complication if we don't physically eliminate `Z`, but we find a node somehow related to `Z`, lets call it `Y`, which can be more easily eliminated. We can't just drop node `Y`, because it is actually the information in `Z` that we don't need anymore. We also know that that it is easy to cut out nodes that have at most one non-empty subtree (see cases `A1-A3` above).

It turns out that if we choose `Y` carefully, then we can preserve the ordering properties of the tree when we transfer the information from node `Y` to node `Z`. Node `Z` is "greater" than `pZ`, and all nodes in `b`, but "smaller" that any node in `c`. The node we are looking for to eliminate in place of `Z` should also satisfy these restrictions. One good choice is to pick the node with minimum key from subtree `c` (alternatively, we can pick the node with maximum key from subtree `b`). Since `c` is not empty, we know that such a node exists, and more, we know that it has an empty left subtree (if the left subtree were non-empty, we could find a node with a smaller key). This minimum node in the subtree of `c` is `Y`, the node we can eliminate instead of `Z`.

The diagram below shows the a subtree from which we eliminate node `Z`. Assuming that a is non-empty, we need to find the minimum node in the subtree rooted at 70; this is node `Y` (60). We transfer the data from `Y` into `Z`, then we remove `Y`. The left subtree of `Y` is empty (otherwise `Y` would not be a minimum node), thus we can "glue" the right subtree of `Y` into the place that node `Y` occupied with respect to `Y`'s parent (node 70).

| | ___50(Z)____ ___60(Z)____ / \ / \ a ___70___ a ___70___ / \ / \ 60(Y) 80 65(X) 80 / \ / \ / \ / \ [] 65(X) [] [] [] [] [] [] / \ [] []

The procedure outlined above chooses a node and patches up the tree such that the structural changes needed to preserve ordering of the leftover nodes are minimal. This, however, is not sufficient. Note that up to now, the line of reasoning that we followed is very close to that applicable to the elimination of a node from a regular `BST`.

Physically eliminating a red node is never a problem; however, the elimination of a black node can break the red-black tree invariants. Note, however, that the invariant is not always destroyed when we remove a black node (think of eliminating the last node in the tree - this is black, but the empty tree that results trivially satisfies the RBT invariants). In the typical case, however, the removal of a black node creates a "black node deficit" on some paths of the red-black tree.

Look at the specific example we provided above! The paths that suffer from the black-node deficit are those that pass through node `X` (the root of old `Y`'s left subtree). Now, if `X` used to be red, we can just change its color to black, and the deficit goes away. But what if `X` is already black? Well, then we have a problem and we need to do rebalancing.

One way to think think about both cases (i.e. the case when `X` could have originally been red, and the case when it could have been black) is to consider that we add a unit of black color to node `X`, irrespective of what the prior color of `X` was. If `X` was red, a unit of black color makes it black; if it was black, then it becomes doubly black. The purpose of rebalancing is to relieve node `X `from its double black load by appropriately restructuring the tree and/or by recoloring its nodes.

Note that in case `A1` the elimination of `Z` does not lead to a black-node deficit, irrespective of the color of `Z` (in this case an entire path from the root to leaf `Z` is eliminated).

By going back to cases `A2`, and `A3` above, we can restate them in terms similar to case `A4`:

A2 A3| | pZ pZ / \ / \ a Z=Y a Z=Y / \ / \ [] c(X) b(X) []

Here the node that we physically eliminate (`Y`) is the same as the node that we "logically" want to delete (`Z`). Node `X` is the root of subtrees `c`, and `b`, respectively. Note that such a node `X` exists - otherwise `Z` would have no descendants, and we would be in case `A1`. Again, if `Z=Y` is red, we don't have a problem. If, however `Z` was black, then we have a black node deficit. If `X` is red, we change the color of `X` to black (and we stop); if `X` is already black we make it doubly black, and we restructure the tree to redistribute its double load.

We can thus treat all non-trivial cases in a similar manner for the purposes of rebalancing: if rebalancing is needed, we have a "doubly black" node, and we have to use tree restructurings and/or node recolorings to get rid of it.

The red-black tree deletion algorithm is notorious for the large number of cases that one has to consider when doing rebalancing. With the proper technique, however, we only need to consider four cases (and their mirror images).

The crucial idea is to consider that logically empty nodes actually consist of a regular black node that does not carry useful data, and that the left and right subtree of this node are regular empty nodes.

If we are given the following "logical" tree

__50[B]__ [B] = black node / \ [R] = red node 40[R] *[B] * = non-data carrying node / \ / \ [] [] [] []

then we will consider that its actual representation is the one below:

___50[B]___ / \ 40[R] *[B] / \ / \ *[B] *[B] [] [] / \ / \ [] [] [] []

This convention is needed to reduce the number of distinct cases that we need to consider.

The four cases we are interested in, and the corresponding restructurings are the following:

__B[B]__ __D[B]__ / \ / \ A[x] _D[R]_ case C1 _B[R]_ E[B] / \ / \ --------------> / \ / \ a b C[B] E[B] A[x] C[B] e f / \ / \ / \ / \ c d e f a b c d __B[i]__ __B[W]__ / \ / \ A[x] _D[B]_ case C2 A[B] _D[R]_ / \ / \ --------------> / \ / \ a b C[B] E[B] a b C[B] E[B] / \ / \ / \ / \ c d e f c d e f __B[i]__ __B[i]__ / \ / \ A[x] _D[B]_ case C3 A[x] C[B] / \ / \ --------------> / \ / \ a b C[R] E[B] a b c D[R] / \ / \ / \ c d e f d E[B] / \ e f __B[i]__ __D[i]__ / \ / \ A[x] _D[B]_ case C4 _B[B]_ E[B] / \ / \ --------------> / \ / \ a b C[i'] E[R] A[B] C[i']e f / \ / \ / \ / \ c d e f a b c d

As before, capital letters denote nodes. Regular node colors are indicated by letters `B` and `R` in brackets. The color of certain nodes is indiferent, case in which we indicate the color by `i` or `i'`. If a node is doubly black, its "color" is marked with `x` (by analogy with node `X` in the discussion of the algorithm). The color `W` of node `B` in case `C2` depends on the initial color `i` of the same node. If `B` was red, we make `W=B`, hence the double black color of node `A` is "spread out." If `B` was black, then node `B` becomes doubly black (`W=x`), and we need to continue with the restructuring at a higher level. The same letter denotes the same color whenever it occurs (except for `W`), even if it denotes an "indifferent" color.

Note that case `C1` restructures the tree so that one of cases `C2`, `C3`, or `C4` applies for sure. Also, note that case `C1` is the only case when the sibling of the doubly black node is red. The following three cases are characterized by the fact that the sibling of the doubly black node is black, and are distinguished by the color distribution of the sibling's children.

Case `C2` assures that the algorithm terminates (if `W = B`), or that it continues at a higher level (if `W=x`). Case `C3` is an intermediate step that reduces it to case `C4`. Finally, case `C4` redistributes the double color load of node `A` so that no further restructuring is needed.