Analysis of quicksort THEOREM. The expected running time of quicksort, assuming the elements of the input vector are distinct and all permutations are equally likely, is O(n log n). PROOF. The expected running time on inputs of length n is the running time averaged over all inputs of length n. Let's denote this by E(T(n)) (read: expected value of T(n)). To get the expected running time, we compute for each input of length n - the time required on that input, times - the probability that that input occurs. Then we add all these up. (That's the definition of average.) Observe that if all the elements of the input vector are distinct, then the running time does not depend on the actual values, but only their order. Thus there are essentially n! different inputs, one for each permutation, and each one is equally likely. Let's assume then for simplicity that p=1, r=n, and the numbers we are sorting are 1,2,3,...,n. There are a couple of shortcuts we can take in computing the expected running time. First, for an input of length n, there are n possible choices for the pivot, and each one is equally likely. If the pivot is 1, the split is 1:n-1. If the pivot is m+1, the split is m:n-m. If we know the split is m:n-m, then the algorithm takes time cn for some constant c, plus the time for the recursive calls, one on a vector of length m, the other on a vector of length n-m. Now here is a trick: if we somehow knew that the pivot would be 1, then the expected running time would be E(cn+T(1)+T(n-1)). If we somehow knew that the pivot would be m+1, then the expected running time would be E(cn+T(m)+T(n-m)). We don't know what the pivot will be, but we do know the probabilities of each possible pivot, and we can average these expected running times over all choices of m. This is done by summing the expected running time given that the pivot is m times the probability that the pivot is m. In other words, n-1 E(T(n)) = E(cn+T(1)+T(n-1))/n + sum E(cn+T(m)+T(n-m))/n m=1 Each term is divided by n because 1/n is the probability that any given number between 1 and n is the pivot. The expectation operator E( ) is always a linear function, thus we can write E(T(n)) = cn/n + E(T(1))/n + E(T(n-1))/n n-1 n-1 n-1 + sum cn/n + sum E(T(m))/n + sum E(T(n-m))/n m=1 m=1 m=1 <= c + 1 + E(T(n-1))/n n-1 n-1 + c(n-1) + sum E(T(m))/n + sum E(T(n-m))/n m=1 m=1 = cn + 1 + E(T(n-1))/n n-1 n-1 + sum E(T(m))/n + sum E(T(n-m))/n m=1 m=1 Renumbering the last sum, E(T(n)) = cn + 1 + E(T(n-1))/n n-1 n-1 + sum E(T(m))/n + sum E(T(m))/n m=1 m=1 n-1 = cn + 1 + E(T(n-1))/n + 2 sum E(T(m))/n m=1 Now we would like to prove E(T(n)) <= dn ln n for some constant d. We're using ln = natural log (log base e) instead of log base 2 just to simplify some expressions. It's only a constant factor (log n = ln n/ln 2). We get to choose d as big as we like to make this work, as long as it is a constant independent of n. To figure out what the value of d should be, we can attempt an inductive proof parametrized by d, then figure out what we need to take to make the proof go through. For the basis, take n=2; then surely d can be chosen big enough that E(T(2)) <= 2d ln 2. Now we do our parametrized induction step. Suppose E(T(m)) <= dm ln m for all m < n. We would like to show E(T(n)) <= dn ln n. We have n-1 E(T(n)) = cn + 1 + E(T(n-1))/n + 2 sum E(T(m))/n m=1 n-1 <= cn + 1 + d(n-1)ln(n-1)/n + 2 sum (dm ln m)/n m=1 by the induction hypothesis n-1 <= cn + 1 + d ln n + 2d(sum m ln m)/n m=1 n <= cn + 1 + d ln n + 2d(Integral m ln m dm)/n 1 estimating the sum by a definite integral-- draw the graph! Now integrate by parts: = cn + 1 + d ln n + 2d((n^2 ln n)/2 - n^2/4 + 1/4)/n = cn + 1 + d ln n + dn ln n - dn/2 + d/2n and we must show this is <= dn ln n. We still get to choose d as big as we like, as long as it is a constant independent of n. Simplifying the inequality cn + 1 + d ln n + dn ln n - dn/2 + d/2n <= dn ln n, we get 1 + d ln n + d/2n <= (d/2 - c)n. Picking d large enough that the coefficient (d/2 - c) on the right hand side is positive, the right hand side is Theta(n) and the left hand side is o(n), so for sufficiently large n the right hand side dominates. ------------------------------------------------------------------ Every comparison-based sorting algorithm must make at least n log n comparisons on inputs of length n in the worst case. That is because the algorithm must distinguish between n! ~ 2^(n log n) possible input permutations, and a binary-branching decision tree generated by a comparison-based algorithm must have depth at least n log n to have that many leaves. ------------------------------------------------------------------ Linear time sorts: Counting sort, bucket sort, radix sort. Good treatment in CLR 9.2-4 pp. 175-183. Bucket sort, which we did not go over in class: Say the key space is integers [0,k]. * Divide the key space into m equal-size regions, make one "bucket" (linked list) for each. * Go through the input array, put each element x in bucket x/m. Link it into the list for that bucket. * Sort each bucket. * Concatenate the sorted buckets. If we assume that the keys are uniformly distributed over the possible key values, then we expect each bucket to have n/m elements. Having significantly more is very unlikely. We roughly expect: O(n) + mO((n/m)log(n/m)) = O(n log(n/m)) which is O(n) when m is a constant fraction of n. For example, if m = n/1000, we use an array of size m, and we would have expected time O(n log 1000) = O(n). Note that this does not depend on k. Some important points: --counting sort: Good when k is approximately n or less. After constructing the array of counts of the keys, it is tempting to just go through from left to right and write down the keys in order in the output array, since we know how many of each one there are from the count array. If we are only interested in the keys, this is ok. But usually there is some other information around besides the key associated with each elements being sorted, and we have to get this from the original array. --radix sort: sort on digit positions in order from less significant --> more significant. You must use a stable sorting procedure, so that the sorting you do on the higher order digits later does not mess up the sorting you already did on the lower order digits. --bucket sort: the keys do not have to be real numbers in the range [0,1] as in CLR. For example, they could be integers, as long as you know the bounds.