CS410, Summer 1998 Lecture 15 Outline Dan Grossman * Goals: Union-Find (CLR, Chapter 22) In the union-find (or disjoint sets) problem, we assume we have n objects numbered 1,2,...,n. If are objects are not these small integers, then we use some other method (such as hashing) to efficiently map them. Initially, each object is in its own set. So we have n sets, {1}, {2}, ... n. We have two operations: * Find(x) takes an object and returns the set it belongs to. Our sets don't have names, so we just return a representative element of the set. The only rule is that if x and y are in the same set, then Find(x) and Find(y) return the same representative. * Union(x,y) combines the (distinct) sets that x and y are in into one set. This is destructive; the previous two sets no longer exist -- all their elements are now in one larger set. We note immediately: * There can be at most (n-1) union operations. * There can be any number, say m, of find operations. Motivation: Union-find corresponds nicely to (at least) two groups of problems: 1. Aliasing. Suppose we want to know if x and y refer to the same thing. We might assume initially they don't, but we might be subsequently told that they do. Then later we might be told y and z refer to the same thing. We want to efficiently know that, in fact, x and z refer to the same thing. 2. Connectivity. Imagine we want to know if you can drive from one street to another using only the streets and intersections you know about. For each intersection, union the two streets. This is good if you want to see if two streets are connected and you are periodically adding intersections as you go. It doesn't work at all if streets are ever removed, though. We'll see examples of the second kind of problem when we study graphs the last week of class. Fast-find implementation: We maintain two data structures with these invariants: * An array arr of size n where arr[i] is the (representative of the) set to which object i belongs. * An array lists where lists[i] is the list of all elements in set i if i is the representative element. Else lists[i] is undefined. Maintaining the invariants during the operations is straightforward: * initialization: arr[i] = i and lists[i] has the one element i. * find(i): return arr[i]. * union(x,y): Let rx = find(x) and ry = find(y). For all elements j in lists[ry], set arr[j] to ry. Append lists[ry] to lists[rx]. Clearly, initialization is O(n) and find is O(1). Union is naively O(n) since the running time is bounded by the size of y's set, which could be as large as n-1. So the total for m finds and n-1 unions, is O(n + m + n^2) = O(m + n^2). A small addition can improve the running time of unions: always append the shorter list to the longer one. Claim: The total time for n-1 unions is now O(n log n). Proof: We need to total up all re-assignments. Rather than take the number of re-assignments per union and sum over all unions, let's count the other way: take the number of times object x could be re-assigned, and sum over all objects. If x is re-assigned, then by our addition to the algorithm, it becomes a member of a set that is at least twice as large. So the size of its set at least doubles each time. Since no set can be larger than n, x is re-assigned at most log n times. There are n elements, so the total number of re-assignments is less than n log n. So the running time for this implementation is O(m + n log n). Tree implementation: We maintain each set as a tree (only parent pointers needed) and let the root be the representative element. Our operations are simply: Find(x): follow pointers to root Union(x,y): Link(Find(x), Find(y)) where Link makes one root point to the other. In fact, these conceptual trees can be done with an array where again the ith index refers to index i: Initialize: set arr[i] = i Find(x): i = x; while (i != arr[i]) {i = arr[i];} return i; Union(x,y): arr[Find(x)] = Find(y) In this formulation, let m be the total number of Finds, including the ones done during union. (So this is the old m plus at most 2(n-1).) Since initialize is O(n) and link is O(1), the running time is determined by the time taken for Find. Naively, our trees might be long chains, so m finds could take total time O(mn). We will employ two heuristics to improve this bound: 1. Union-by-rank As before, we can improve the running time by always linking the shorter tree to the taller one. With just this heuristic, we claim no tree is taller than O(log n). Hence m Find's take time O(m log n). Claim: Using union-by-rank, a tree with height h has >= 2^h nodes. Proof: By induction on h. Base: h=0 -- every tree has at least one node. Inductive: h > 0. When the root of the tree got height h, it was the root of a tree of height h-1 and it was linked with another tree of height h-1. (Using our heuristic, this is the only way a tree ever gets height h.). By the I.H. each of the two smaller trees have >= 2^(h-1) nodes. So together they have >= 2*2^(h-1) == 2^h nodes. 2. Path compression During Find(x), we follow a path from x to its (current) root. We then follow this path a second time, setting the node of each node on the path _directly_ to the root. Eg. Find(7) 1 1 / \ / | \ \ 2 3 ====> 2 3 4 7 / \ | | 4 5 5 6 / \ 6 7 Notice all future finds on 4, 6, and 7 will be able to skip some nodes. This technique _together with union by rank_ is very powerful and is very easy to implement. It will take a lot of work to actually prove the improved bound, however. Running time of union-find with union-by-rank and path compression: The bound we will prove is O((m+n)log* n) where log* n is the inverse tower function. It is how many times you have to hit log on your calculator to get 1. The tower function F(i) is 2^(2^(2^...)) where the pile of 2's is i high. The following table convinces us that log* n grows _very_ slowly: n log* n ================ 1 0 2 1 4 2 16 3 65536 4 2^65536 5 It should be noted that 2^65536 has roughly 20,000 digits. So for all practical (not theoretical) purposes, log*n <= 5. In fact, the bound for union-find with union-by-rank and path-compression is even better. It is O((m+n)alpha(m,n)) where alpha(m,n) is the inverse Ackerman's function. Trust me -- it grows even slower than log*. The proof of the log* bound is hard enough, however, and should convince us that the algorithm is very, very close to linear time. Union-by-rank as defined above doesn't quite work in the presence of path compression because the height of a tree can change. We therefore come up with a different definition of rank. The intuition is that this is, "the height a node would have if we did not have path compression". This is good intuition, but we will have to use the real definition in our proof. It is: * Initially, rank(x) = 0 * If x is the root after a union operation, then * if rank(x) == the rank of the root of the other tree, add one to rank(x) * if rank(x) > the rank of the root of the other tree, do nothing. The rule is we always make the root of greater rank the root of the new tree. In the ensuing proof (tomorrow's lecture), we will make heavy use of the following easy-to-prove facts (they follow from the definition of rank and the implementation of the operations): 1. Rank(x) only increases over time. 2. a. Once x is not a root, it is never a root. b. Once x is not a root, its rank never changes. 3. If x is a root, thenk it has 2^(rank(x)) nodes in its subtree. [Proof is basically the same as our earlier proof. It now only applies to roots, because path compression can remove other nodes subtrees. But path compression never takes nodes out of a root's subtree -- it just moves them.] 4. Rank(x) < Rank(parent(x)) [Proof is that its true the first time x gets a parent because of the union-by-rank rule. Then path compression must re-assign x's parent only to things with _even larger_ rank. An important corollary is that as we follow a path to the root, the rank increases at every step.] 5. max rank of anty tree is <= log n. [Follows immediately from fact 3 since there are only n objects.] 6. For any rank r, at most n/(2^r) nodes ever have rank r. [Proof: Assume x gets rank r at some point. At that moment, it has > 2^r nodes in its subtree by fact 3. By fact 4, all of those nodes have smaller rank. So by fact 2, none of those 2^r nodes will ever have rank r. So once n/(2^r) such x's exist, there are 2^r nodes for each x that cannot have rank r. That's a total of n nodes, so there cannot be more than n/(2^r) such x's.] We will use these facts next time.