Given a set of elements *S*, a partition of *S* is a set of
nonempty subsets of *S* such that every element of *S* is in
exactly one of the subsets. In other words the subsets making up the
partition are pairwise disjoint, and together contain all the elements
of *S* (cover *S*). A *disjoint set data structure* is
an efficient way of keeping track of such a partition. We are interested in
two operations on disjoint sets:

- union - merge two sets of the partition into one, changing the partition structure
- find - determine which set of the partition contains a given
element
*e*, returning a canonical element of that set

Sometimes a disjoint set is also referred to as a *union-find data
structure* because it supports these two operations. In addition, the
create operation makes a partition where each element *e* is in
its own set (all the subsets in the partition are singletons).

Efficient implementations of the union and find operations make use of the ability to change the values of variables, thus we make use of refs and arrays introduced in recitation.

Disjoint sets are commonly used in graph algorithms. For
instance, consider the problem of finding the *connected components* in
an undirected graph (sets of nodes that are reachable from one another
by some path). The following algorithm will label all the nodes in each
component with the same identifier and nodes in different components with different
identifiers:

- Create a new partition with one element corresponding to each
node
*v*in the graph. - For each edge (
*u*,*v*) in the graph, call the`union`

operation with*u*and*v*. - For each vertex
*v*in the graph, call the`find`

operation, which returns the component label for that vertex.

A common way of representing a disjoint set is as a *forest*,
or collection of trees, with one tree corresponding to each set of the
partition. When the nodes of the trees are labeled by consecutive
natural numbers 0 through *n* − 1, it is straightforward to implement a forest using an
array of length *n*, where the array index corresponds to the node and
the array entry at that index specifies the node's parent. The root
of a tree specifies itself as parent.

For instance, the forest:

0 / | \ 6 / | \ / \ 1 2 3 7 8 / \ 4 5

would be represented by the array

+---+---+---+---+---+---+---+---+---+ | 0 | 0 | 0 | 0 | 3 | 3 | 6 | 6 | 6 | +---+---+---+---+---+---+---+---+---+

With this representation of a disjoint set, a new partition is simply:

let createUniverse (size : int) : universe = Array.init size (fun i -> i)

Using this representation, the `find`

operation checks the
specified index of the array. If the value is equal to the index, it
returns the index; the root has been found. Otherwise, it
recursively calls `find`

with the value in
the array, which is the index of the parent.
This searches from a node to the root of its tree in the
forest:

let rec find s e = let p = s.(e) in if p = e then e else find s p

The `union`

operation finds the roots of the trees for each of the two
elements, then assigns one of the two roots to have the other as
parent:

let union s e1 e2 = let r1 = find s e1 and r2 = find s e2 in s.(r1) <- r2

Thus `union`

simply does two `find`

s and a pointer update.
The asymptotic running time of `union`

is thus the same as that of `find`

.
In the worst case, `find`

can take *O*(*n*) time for an array
of *n* elements, because the forest could consist of a single tree with a
single path of depth *n*. In that case, starting at the leaf, `find`

would visit every element of the array before reaching the root.
Thus, as with the balanced binary tree schemes such as
red-black trees, we need a balancing scheme to keep the height small, preferably
at most logarithmic in the number of nodes.

With this representation using parent pointers, trees are relatively easy
to balance, because they are not necessarily binary trees; there is no bound on the
branching factor, and we can exploit this fact to keep the height down.
The key trick is to make the shorter tree a child of the root of the taller in the `union`

operation. If tree *t*_{1} is strictly shorter than *t*_{2},
and if *t*_{1} is made a child of the root of *t*_{2},
then the overall height of the resulting tree does not change. If the two trees are the same
height, then the height increases by 1, but this is the only way the height can increase.
Rather than using the actual height of the
trees, we use a quantity referred to as the *rank*, which is an
upper bound on the height. This balancing scheme is
known as *union by rank*.

Our data structure now needs to store a rank for each node in addition to a parent pointer.

type node = {mutable parent : int; mutable rank : int} type universe = node array let createUniverse size = Array.init size (fun i -> {parent = i; rank = 0})

Now `union`

finds the roots of the trees for
both elements as before, except now we may also need to adjust the ranks.
If the two roots are the same, there is nothing
to do. If they are different, then the one with smaller rank is made
a child of the root of the one with larger rank. If the ranks are equal,
it does not matter which one is made a child of the other, but the
rank of the root is incremented by 1.

let union (s : universe) (e1 : int) (e2 : int) : unit = let r1 = find s e1 and r2 = find s e2 in let n1 = s.(r1) and n2 = s.(r2) in if r1 != r2 then if n1.rank < n2.rank then n1.parent <- r2 else (n2.parent <- r1; if n1.rank = n2.rank then n1.rank <- n1.rank + 1)

This process for constructing trees results in ranks that are
logarithmic in the number of nodes, thus the running time of `union`

and `find`

operations is *O*(log *n*) for *n* nodes.
This follows from the fact that if a node has rank *n*, then the subtree
rooted at that node has at least 2^{n} nodes. We can prove this
by induction on rank. Base case: a node of
rank 0 is the root of a subtree that contains at least itself, thus
has size at least 2^{0} = 1. Inductive step: we wish to show that a tree of
rank *k* + 1 has at least 2^{k+1} nodes.
A node *u* can have rank *k* + 1 only if, at some
point in the past, it had rank *k* and it was joined
with another tree also of rank *k*, and *u* became the root
of the union of the two trees. By the induction hypothesis, each tree
was of size at least 2^{k}, so *u* is the root of a tree of size at least
2^{k} + 2^{k} = 2^{k+1}.

The `find`

operation can also be improved using a technique known
as *path compression*. After doing a `find`

starting from a node *u*,
we can retrace the path from *u* up to the root and change all the parent pointers
along the way to point directly to the root. This will pay off in subsequent `find`

s starting
at any node along that path. In effect, this makes the tree flatter and bushier.

let rec find (s : universe) (e : int) : int = let n = s.(e) in if n.parent = e then e else (n.parent <- (find s n.parent); n.parent)

A more involved analysis can establish that with union by rank and path compression, any sequence
of *m* union and find operations on a set with *n*
elements take at most *O*((*m*+*n*) log* *n*) steps.
The function log* *n* is the inverse of the function 2^{22...2},
where the stack of 2's is of height *n*. This is an extremely fast-growing function—with
a stack of 2's of height 5, the value is a decimal number with 19728 digits. Its inverse,
the function log* *n*, is the number of times you have to apply the log function to *n* before you get
a number less than or equal to 1, and is no more than 5 for all practical values of *n*.
Such more detailed analyses are covered in CS 4820.