CS410, Summer 1998
Lecture 15 Outline
Dan Grossman

* Goals: Union-Find (CLR, Chapter 22)

In the union-find (or disjoint sets) problem, we assume we have n
objects numbered 1,2,...,n.  If are objects are not these small
integers, then we use some other method (such as hashing) to
efficiently map them.  Initially, each object is in its own set.  So
we have n sets, {1}, {2}, ... n.  We have two operations:

   * Find(x) takes an object and returns the set it belongs to.  Our sets
     don't have names, so we just return a representative element of the set.
     The only rule is that if x and y are in the same set, then Find(x) and
     Find(y) return the same representative.

   * Union(x,y) combines the (distinct) sets that x and y are in into
     one set.  This is destructive; the previous two sets no longer
     exist -- all their elements are now in one larger set.

We note immediately:
   * There can be at most (n-1) union operations.
   * There can be any number, say m, of find operations.

Motivation:

Union-find corresponds nicely to (at least) two groups of problems:
1. Aliasing.  Suppose we want to know if x and y refer to the same thing.
We might assume initially they don't, but we might be subsequently told
that they do.  Then later we might be told y and z refer to the same thing.
We want to efficiently know that, in fact, x and z refer to the same thing.

2. Connectivity.  Imagine we want to know if you can drive from one
street to another using only the streets and intersections you know
about.  For each intersection, union the two streets.  This is good if
you want to see if two streets are connected and you are periodically
adding intersections as you go.  It doesn't work at all if streets are
ever removed, though.

We'll see examples of the second kind of problem when we study graphs
the last week of class.

Fast-find implementation:

We maintain two data structures with these invariants:
   * An array arr of size n where arr[i] is the (representative of
the) set to which object i belongs.
   * An array lists where lists[i] is the list of all elements in set i
if i is the representative element.  Else lists[i] is undefined.

Maintaining the invariants during the operations is straightforward:
* initialization: arr[i] = i and lists[i] has the one element i.
* find(i): return arr[i].
* union(x,y): Let rx = find(x) and ry = find(y).  For all elements j in 
  lists[ry], set arr[j] to ry.  Append lists[ry] to lists[rx].

Clearly, initialization is O(n) and find is O(1).  Union is naively
O(n) since the running time is bounded by the size of y's set, which
could be as large as n-1.  So the total for m finds and n-1 unions, is
O(n + m + n^2) = O(m + n^2).

A small addition can improve the running time of unions: always append
the shorter list to the longer one.  Claim: The total time for n-1
unions is now O(n log n).  Proof: We need to total up all
re-assignments.  Rather than take the number of re-assignments per
union and sum over all unions, let's count the other way: take the
number of times object x could be re-assigned, and sum over all
objects.  If x is re-assigned, then by our addition to the algorithm,
it becomes a member of a set that is at least twice as large.  So the
size of its set at least doubles each time.  Since no set can be
larger than n, x is re-assigned at most log n times.  There are n
elements, so the total number of re-assignments is less than n log n.

So the running time for this implementation is O(m + n log n).

Tree implementation:

We maintain each set as a tree (only parent pointers needed) and let
the root be the representative element.  Our operations are simply:

Find(x): follow pointers to root
Union(x,y): Link(Find(x), Find(y)) where Link makes one root point to the other.

In fact, these conceptual trees can be done with an array where again
the ith index refers to index i:
Initialize: set arr[i] = i
Find(x): i = x; while (i != arr[i]) {i = arr[i];} return i;
Union(x,y): arr[Find(x)] = Find(y)

In this formulation, let m be the total number of Finds, including the
ones done during union.  (So this is the old m plus at most
2(n-1).)  Since initialize is O(n) and link is O(1), the running time is
determined by the time taken for Find.

Naively, our trees might be long chains, so m finds could take total
time O(mn).  We will employ two heuristics to improve this bound:

1. Union-by-rank 
As before, we can improve the running time by always linking the
shorter tree to the taller one.  With just this heuristic, we claim no
tree is taller than O(log n).  Hence m Find's take time O(m log n).

Claim: Using union-by-rank, a tree with height h has >= 2^h nodes.
Proof: By induction on h.
   Base: h=0 -- every tree has at least one node.
   Inductive: h > 0.  When the root of the tree got height h, it was the root
     of a tree of height h-1 and it was linked with another tree of height
     h-1.  (Using our heuristic, this is the only way a tree ever gets height
     h.).  By the I.H. each of the two smaller trees have >= 2^(h-1) nodes.
     So together they have >= 2*2^(h-1) == 2^h nodes.

2. Path compression
During Find(x), we follow a path from x to its (current) root.  We
then follow this path a second time, setting the node of each node on
the path _directly_ to the root.

Eg. Find(7)

       1                    1
      / \                / | \ \
     2  3        ====>  2  3  4 7
       / \                 |  |
      4   5                5  6
     / \
    6  7
       
Notice all future finds on 4, 6, and 7 will be able to skip some nodes.

This technique _together with union by rank_ is very powerful and is
very easy to implement.  It will take a lot of work to actually prove
the improved bound, however.

Running time of union-find with union-by-rank and path compression:

The bound we will prove is O((m+n)log* n) where log* n is the inverse
tower function.  It is how many times you have to hit log on your
calculator to get 1.

The tower function F(i) is 2^(2^(2^...)) where the
pile of 2's is i high.  The following table convinces us that log* n grows
_very_ slowly:
	n	log* n
       ================
        1	0
	2	1
	4	2
	16	3
	65536	4
	2^65536 5

It should be noted that 2^65536 has roughly 20,000 digits.  So for all
practical (not theoretical) purposes, log*n <= 5.

In fact, the bound for union-find with union-by-rank and
path-compression is even better.  It is O((m+n)alpha(m,n)) where
alpha(m,n) is the inverse Ackerman's function.  Trust me -- it grows
even slower than log*.  The proof of the log* bound is hard enough,
however, and should convince us that the algorithm is very, very close
to linear time.

Union-by-rank as defined above doesn't quite work in the presence of
path compression because the height of a tree can change.  We
therefore come up with a different definition of rank.  The intuition
is that this is, "the height a node would have if we did not have path
compression".  This is good intuition, but we will have to use the
real definition in our proof.  It is:
* Initially, rank(x) = 0
* If x is the root after a union operation, then
       * if rank(x) == the rank of the root of the other tree, 
	    add one to rank(x)
       * if rank(x) > the rank of the root of the other tree, do nothing.

The rule is we always make the root of greater rank the root of the
new tree.

In the ensuing proof (tomorrow's lecture), we will make heavy use of
the following easy-to-prove facts (they follow from the definition of
rank and the implementation of the operations):

1. Rank(x) only increases over time.
2. a. Once x is not a root, it is never a root.
   b. Once x is not a root, its rank never changes.
3. If x is a root, thenk it has 2^(rank(x)) nodes in its subtree.
	[Proof is basically the same as our earlier proof.  It now only applies
	 to roots, because path compression can remove other nodes subtrees.
	 But path compression never takes nodes out of a root's subtree -- it
	 just moves them.]
4. Rank(x) < Rank(parent(x))
	[Proof is that its true the first time x gets a parent because of the
	 union-by-rank rule.  Then path compression must re-assign x's parent
	 only to things with _even larger_ rank.  An important corollary is
	 that as we follow a path to the root, the rank increases at every
	 step.]
5. max rank of anty tree is <= log n.
	[Follows immediately from fact 3 since there are only n objects.]
6. For any rank r, at most n/(2^r) nodes ever have rank r.
	[Proof: Assume x gets rank r at some point.  At that moment,
	 it has > 2^r nodes in its subtree by fact 3.  By fact 4, all of
	 those nodes have smaller rank.  So by fact 2, none of those 2^r nodes
	 will ever have rank r.  So once n/(2^r) such x's exist, there are
	 2^r nodes for each x that cannot have rank r.  That's a total of n
	 nodes, so there cannot be more than n/(2^r) such x's.]

We will use these facts next time.