CS410, Summer 1998 Lecture 19 Outline Dan Grossman Goals: * Sorting Odds and Ends * Lower Bound for Sorting -- The Comparison Model * Order Statistics Odds and Ends There are a few possible improvements to quicksort. You will examine these in great depth in homework 6: 1. Switch to insertion sort for small subproblems. This has the same theoretical advantage it does with merge sort. 2. Do the smaller subproblem first. When doing the second subproblem, the call stack does not need to remember the "parent" function because it has no work left to do. By doing the smaller subproblem first, then, we can ensure that the size of the call stack never needs to be larger than log n. This could affect running time if the call stack became large otherwise. 3. To pick the pivot, choose k elements at random, find their median, and use it as the pivot. This increases the chances that the pivot element is near the middle, but it takes extra time to choose the pivot element. In lecture 17, it was misleading to say that insertion sort is only doubly-linked list friendly. It is singly-linked list friendly, if you do the insertion part from the front of the list. This flips what is best-case and worst-case. The notes for that lecture have been changed to reflect this. In lecture 18, the expected case analysis was portrayed as less intuitive than it actually is. The notes for that lecture now give a bit more intuition for the sum of q log q. ========== Lower Bound A (time) lower bound for an _algorithm_ is a running time such that for all inputs, the algorithm takes at least that much time. That is, it is just a lower bound on the best-case running time of the algorithm. A (time) lower bound for a _problem_ is a running time such that for all algorithms, there is some input such that the algorithm on that input takes at least that much time. That is, it means no algorithm can do better than the lower bound for all inputs. Notice these two definitions say very different things. The first really just discusses the best-case for an algorithm. The second places a limit on how good an algorithm can be. The second is a very useful result -- it tells us not to waste our time trying to come up with an algorithm with worst-case behavior less than the bound. It would be a waste of time because we have proven it is impossible! We will prove a lower bound for the sorting _problem_ in the comparison model. In this model, we wish to sort n distinct objects a_1, a_2, ..., a_n and the only information we can gather from the objects is whether on object is less than another. Doing a comparison takes Theta(1) time. That is, we can ask "is a_i less than a_j", but we cannot ask "what is the value of a_i". Claim: In the comparison model, no (correct) sorting algorithm can do fewer than Omega(nlog n) comparisons to sort n objects. Since each comparison takes Theta(1) time, this establishes a lower bound of time Omega(n log n). [Saying "correct" is redundant since it must be correct to be an algorithm.] This shows that heapsort, mergesort, and quicksort are asymptotically optimal! Proof: Given n objects, an algorithm must pick the unique correct permutation from the n! possible permutations by doing some sequence of comparisons. Choosing what comparison to do next can be based on the results of all previous comparisons. So we can think of any algorithm as a binary tree where: * each node is a comparison * the root is the first comparison * the left subtree is what the algorithm does if the comparison answers "yes" * the right subtree is what the algorithm does if the comparison answers "no" This is called a decision tree. We showed the decision tree for insertion sort of an array of three elements. This example is also on page 173 of the text. Fact: Every one of the n! permutations of the elements is correct for some input. Therefore, the decision tree for any (correct) algorithm must have at least one leaf for each permutation. So the tree must have at least n! leaves. Furthermore, any root-to-leaf path in the decision tree corresponds to the sequence of comparisons the algorithm makes for some input. So the height of the tree is the worst-case number of comparisons. So it suffices to show that a binary tree with n! leaves has height h >= n log n. We know from our studies of trees that a tree a tree with n! leaves has height h > log(n!). So we just need to show that log(n!) is Omega(n log n). The rest is math: n! = n * (n-1) * (n-2) * ... * 1 by definition of factorial = [n*1]*[(n-1)*2]*[(n-2)*3]*...*[n/2 * n/2] by re-arranging terms = product for i=0 to n/2 of [(n-i)(i+1)] by "un-dot-dot-dotting" Now we claim (n-i)(i+1) >= n/2: (n-i)(i+1) = ni - i^2 + n - i by math >= n-i since i < n, we know ni - i^2 > 0 >= n/2 since i < n/2 So, n! >= product for i=0 to n/2 of (n/2) by what we just showed >= (n/2)^(n/2) by definition of exponentiation So, h > log(n!) >= log((n/2)^(n/2)) = (n/2) log(n/2) by math which is Omega(nlog n) by math -- details left to you ============= Order Statistics Often we don't need all the elements sorted. Instead we just need the kth smallest (or largest) from the group. Sorting and taking the kth certainly works, but we can do asymptotically better. This is another example of being able to do something faster when you need less. We already know how to find min and max in O(n) time. More specifically, we can find min or max using (n-1) comparisons. So trivially we can find min _and_ max using 2(n-1) comparisons. We can use a cute trick to reduce this to 3(n/2)-2 comparisons: * pair the elements up and compare each pair. Put the lesser ones in one "pile" and the greater ones in another "pile" * Find the min of the "less than pile". This is the overall min. * Find the max of the "greater than pile". This is the overall max. To find the kth smallest when k is nearly zero or nearly n, we can just keep track of the k smallest seen so far. Naively we can keep the k smallest sorted for a worst-case total time of O(kn). This could be improved to O(nlog k + k) using techniques we've seen before. But this doesn't look so good if we want something like the median -- where k == n/2. If n is small, then basically doing insertion sort is probably the best way to go. But there's no sense in wasting time sorting the upper half of the elements -- just keep the first n/2 sorted. So for the first n/2 iterations, do the normal insertion sort. But for the second n/2 iterations, start "walking backwards" at the middle of the array. If the element being inserted is bigger than the current median, then we will just forget about it and continue. Do this on your homework for finding the medians of small numbers of elements. But to find the median of one element, don't use an array at all -- just return the element. But what if n is large and k is not near zero or n? Then let's use some tricks from sorting to produce an asymptotically faster algorithm: (We assume k==0 for the min and generally k==i if we want the i+1 th smallest. The code obviously is no more difficult if we resolve this off-by-one issue differently.) quick-select(array, k, l, r) { p = partition(array, l, r) if (p > k) quick-select(array, k, l, p-1) if (p == k) return array[p] else quick-select(array, p-k, p+1, r) } We will analyze this algorithm next time.