CS410, Summer 1998 Lecture 16 Outline Dan Grossman Goals: * Finish the proof of the running time for union-find with union-by-rank and path compression. * Begin our study of sorting algorithms Theorem: The total running time of union-find with union-by-rank and path compression over m finds and n-1 unions is O((m+n)log* n) where log* is the inverse tower function. Proof: Recall the implementation, definition of rank, and 6 facts from yesterday's lecture. Total running time = initialization time + link time + find time ^^^^^^^^^^^ ^^^^^^^ ^^^^^^ O(n) O(n) ??? O(n) for initialization is obvious. O(n) link is just n-1 times O(1) per link. So it suffices to show the total find time is O((m+n)log* n)... The total time for m finds is the total number of edges followed over those finds. We will count them in a very weird way in order to prove the bound. First we divide the possible ranks into groups for reasons that will only become apparent later: If the rank is between then we put it in ___ and ___ group 0 1 0 2 2 1 3 4 2 5 16 3 17 65536 4 65537 2^(65536) 5 ... ... (Although the ... is unnecessary here in reality.) So if rank(x) is i, then rank(x) is in group log* i. Now when following a path to the root during a find, we put each edge followed in one of three categories: * The last edge followed -- i.e. the one right below the root * Edges crossing group boundaries -- i.e. edges from x to y where rank(x) and rank(y) are in different groups. * Edges within group boundaries -- i.e. all the others. (For the last two categories, we mean "that are not last edges". So every edge is in exactly one of the three categories.) find time = sum over all finds of (all edges traversed) Using our categories, we have: find time = sum over all finds of (last edges traversed + boundaries traversed + within group traversed) We can distribute summation across plus: find time = (sum over all finds of last edges traversed) + (sum over all finds of boundaries traversed) + (sum over all finds of within group edges traversed) = m + O(m log* n) + (sum over all finds of within group edges traversed) The first term is m because every find has exactly one last edge. The second term is O(m log* n) because fact 4 guarantees that every find path has constantly increasing ranks, and fact 5 guarantees that ranks can't get very big. So in fact, on each find we cannot cross more than log* n boundaries, so the total over m finds is m log* n. So it suffices to prove: (sum over all finds of within group edges traversed) is O(n log* n) Here we will use another piece of accounting creativity. Instead of summing over the "within group edges" for each find, we will sum over the "within group edges" for each node. That is, (sum over all finds of "within group edges" traversed) = (sum over all nodes x of the number of times during all finds that a "within group edge" from x to another node is traversed) All we have done is rearranged our summation to sum over nodes instead of finds -- we are counting the exact same number of things. Fact A: Let x be a particular node in group g. The number of times during all finds that a "withing group edge" leaving x is traversed is <= F(g) where F is the tower function. Proof: An edge only leaves x if it is not a root. By Fact 2, x's rank never changes once it is not a root. So it is legal to talk about its group g -- g will not change. Now, the edges we are talking about are, by definition, not last edges -- so every time one is traversed we know PATH COMPRESSION will occur. So by Fact 4, x will have a new parent with an even greater rank. So once x's parent is in another group it will always be in another group and no more of the "within group" edges will occur. So the question is just how many times the rank of the parent can increase before the parent must be in another group. This is bounded by the size of x's group which is less than F(g). Fact B: The number of nodes in a group g is <= n/(F(g)) where F is the tower function. Proof: nodes in group g <= sum over all ranks r in the group of the //by defn of group maximum number of nodes with rank r = sum from r=F(g-1)+1 to F(g) of the //by defn of group maximum number of nodes with rank r = sum from r=F(g-1)+1 to F(g) of n/(2^r) //by fact 6 = n/(2^F(g-1)) * (1/2 + 1/4 + 1/8 + ... + 1/a big number) //by math < n/(2^F(g-1)) * 1 // by math = n/(F(g)) // by definition of the tower function Putting Facts A and B together, the total number of "within group" edges in group g is: (number of within group edges per node)*(number of nodes in group) <= F(g) * n/F(g) // by facts A and B = n // by math Fact C: There are at most log* n groups. Proof: Explained earlier. Putting A, B, and C together, the total total is log* n times the "within group edges" per group. So the total is n log* n. As explained above, this is all we needed to prove the theorem. =============== SORTING Since we are going to spend several lectures on sorting, we ought to specify what it means to sort and what other sort of specifications we are concerned with. Let a_1, a_2, ..., a_n be a collection of objects such that for any i and j, a_i <= a_j or a_i > a_j. Then to sort the collection means to give a permutation of the object such that if a_i <= a_j then a_i "appears before" a_j in the permutation. Stability -- we call a sorting method stable if ties between elements are always resolved by their original order. Otherwise we call the sort unstable. Formally, stability means if a_i == a_j then a_i appears before a_j if and only if i < j. We will learn stable and unstable methods. The unstable methods may be faster, but sometimes we need stability for our application. We can always make an unstable method stable by putting an "original position" field on each object and using it resolve what would otherwise be ties. However, in practice this slows down our originally faster unstable method enough that it is usually wiser just to use a stable method in the first place. We will also consider whether various sorting methods are appropriate for linked lists. Of course, we could always: * convert the linked list to an array in O(n) time where n is the length of the list. * sort the array * convert the array to a linked list in O(n) time This incurs overhead that can be avoided if we could simply sort the linked list. Array methods may be faster though. Resolving the trade-off may require empirical data. Next time we will begin discussing particular methods.