CS410, Summer 1998 Lecture 20 Outline Dan Grossman Goals: * Finish median algorithms * Linear-Time Sorting * Radix Sorting At the end of last class, we were inspired by quicksort to come up with the following algorithm for finding the element that would be kth were an array sorted: quick-select(array, k, l, r) { p = partition(array, l, r) if (p > k) quick-select(array, k, l, p-1) if (p == k) return array[p] else quick-select(array, p-k, p+1, r) } This is just quicksort where we only recur on the side that matters! This lowers the expected running time from O(n log n) to O(n). The rigorous proof is in the text. For intuition, compare T(n/2) + O(n) and 2T(n/2) + O(n). This is expected linear time, and with randomization that's probably sufficient. There is a way to get guaranteed linear time which you should see once. Basically we will partition around a "median of medians" in order to gurantee a good split: Select: * Divide the elements into groups of 5 and put the medians in a pile * Find the median of the pile * Partition around this median of medians. * Recur on the correct side. Claim: The element picked by the above algorithm guarantees that the split is at worst 3/10 : 7/10. That is, the element is at least the 3n/10 th and at most the 7n/10 th. Proof: Half the elements in the pile are greater than the median of medians. So that means at least 3(1/2)(n/5) elements are greater. Similarly 3(1/2)(n/5) are less. Running time: T(n) = T(n/5) + T(7n/10) + O(n) recursive worst split doing partition and n/5 O(1) medians This is O(n) -- see text. Note: a little sloppy, actually T(6 + 7n/10). Constant factors are way too high -- use quick-select. In fact, quicksort can be modified to do this, but I would call that slowsort. :-) Linear-Time Sorting The generic way to sort is with lessThan. And we proved yesterday that with just this, you cannot sort in time less than Omega(n log n). But when we do know things about the data, we can leave the comparison model and do better. For the rest of today we'll examine a few such methods. Suppose we n objects and there are k possible keys. When convenient we assume the poosible keys are integers are between 0 and k-1. If k is much larger than n, then we're usually best with a comparison sort. (An exception is bucket sort, described below.) k being smaller or near n is not uncommon. It can happen with a lot of repeated values. I have run across this in my programming. When analyzing your data in homework 6, you may want to organize by fields that only have a few values. (Probably a spreadsheet will do your sorting, but it could do it efficiently if it determined that k is small.) It also happens during radix sort. If k is small, we might sort this way: * Make an array of linked lists of size k. O(k) * Put each element in the correct bucket. O(n) * Walk through the buckets in order to get final array. O(n+k) Total: O(n+k) By putting elements in in reverse order we are stable. It turns out counting sort (below) has lower constant factors because it doesn't build up linked lists. But this is the inspiration for "bucket sort" where each bucket can actually hold a range of keys which we then sort. * Divide the key space into m equal-size regions, make one bucket for each * Put each element in the correct bucket (The correct bucket is key / (k/m).) * Sort each bucket * Concatenate buckets If we assume that the keys are uniformly distributed over the possible key values, then we expect each bucket to have n/m elements. And having significantly more is very unlikely. We're using the same math we did with hash tables here, only we're not hashing, just supposing uniform distribution of the keys already. (We can't hash -- it mixes things up and we're trying to sort!). So we expect n/m elements in each bucket and it gets exponentially less likely that there are constant factors more than n/m in any bucket. So we roughly expect: O(n) + mO((n/m)log(n/m)) which for good "load factor" (word not actually used to my knowledge in sorting) is O(n). This would do great on your homework, because we're using random keys. To repeat, bucket sort works fine with large k, if we make the additional assumption of uniform distribution of keys. Counting sort: Good when k is approximately n or less. * Sort A, an array of size n with keys from 0 to k-1. * Make B, an array of size n to hold the output. * Make C, an array of size k, intialized with zeros. * For i from 0 to n-1 C[A[i]]++ // make C[i] the number of times key i appears * For i from 0 to k-1 C[i] += C[i-1] // make C[i] the number of times a key <= i appears * For i from n-1 to 0 B[C[A[i]]-1] = A[i] // copy each element directly to the right place C[A[i]]-- // walk backwards through array for stability Total running time is O(n+k). And the constant factors are low -- just array indexing, additions, and copying. Radix Sort Split keys into parts, stable sort on less significant parts first: Radix Sort: for i = least significant part to most significant part stable sort all elements on i The intuition is that previous iterations resolve ties correctly, so we use a stable sort so as not to mess up the previous work. An example to keep in mind is sorting decimal numbers using one digit for each part. Notice for each stable sort in this example k=10, so a counting sort would work well. The running time is O(d*m) where d is the number of parts (think digits) and m is the time to one stable sort. If d < n (where n is the number of elements), then m = n by using counting sort, so we have O(dn). Now d is really log k since we can express k different keys using log k bits or digits. If k is roughly n, then the running time is really O(n log n). So radix sort is really asymptotically the same as our optimal comparison sorts.