Analysis of quicksort

THEOREM.  The expected running time of quicksort, assuming the 
elements of the input vector are distinct and all permutations are 
equally likely, is O(n log n).

PROOF.  The expected running time on inputs of length n is the 
running time averaged over all inputs of length n.  Let's denote 
this by E(T(n)) (read: expected value of T(n)).

To get the expected running time, we compute for each input of 
length n
  - the time required on that input, times
  - the probability that that input occurs.
Then we add all these up.  (That's the definition of average.)

Observe that if all the elements of the input vector are distinct, 
then the running time does not depend on the actual values, but 
only their order.  Thus there are essentially n! different inputs, 
one for each permutation, and each one is equally likely.  Let's 
assume then for simplicity that p=1, r=n, and the numbers we are 
sorting are 1,2,3,...,n.

There are a couple of shortcuts we can take in computing the 
expected running time.  First, for an input of length n, there are 
n possible choices for the pivot, and each one is equally likely.  
If the pivot is 1, the split is 1:n-1.  If the pivot is m+1, the 
split is m:n-m.

If we know the split is m:n-m, then the algorithm takes time cn 
for some constant c, plus the time for the recursive calls, one on 
a vector of length m, the other on a vector of length n-m.

Now here is a trick: if we somehow knew that the pivot would be 1, 
then the expected running time would be E(cn+T(1)+T(n-1)).  If we 
somehow knew that the pivot would be m+1, then the expected 
running time would be E(cn+T(m)+T(n-m)).  We don't know what the 
pivot will be, but we do know the probabilities of each possible 
pivot, and we can average these expected running times over all 
choices of m.  This is done by summing the expected running time 
given that the pivot is m times the probability that the 
pivot is m.  In other words,
                                  n-1
E(T(n))  =  E(cn+T(1)+T(n-1))/n + sum E(cn+T(m)+T(n-m))/n
                                  m=1

Each term is divided by n because 1/n is the probability that any 
given number between 1 and n is the pivot.

The expectation operator E( ) is always a linear function, thus
we can write

E(T(n))  =  cn/n + E(T(1))/n + E(T(n-1))/n
                                         
              n-1        n-1             n-1
            + sum cn/n + sum E(T(m))/n + sum E(T(n-m))/n
              m=1        m=1             m=1

         <= c + 1 + E(T(n-1))/n
                                         
                       n-1             n-1
            + c(n-1) + sum E(T(m))/n + sum E(T(n-m))/n
                       m=1             m=1

         =  cn + 1 + E(T(n-1))/n
                                         
              n-1             n-1
            + sum E(T(m))/n + sum E(T(n-m))/n
              m=1             m=1

Renumbering the last sum,

E(T(n))  =  cn + 1 + E(T(n-1))/n
                                         
              n-1             n-1
            + sum E(T(m))/n + sum E(T(m))/n
              m=1             m=1

                                     n-1
         =  cn + 1 + E(T(n-1))/n + 2 sum E(T(m))/n
                                     m=1
 
Now we would like to prove E(T(n)) <= dn ln n for some constant d.
We're using ln = natural log (log base e) instead of log base 2 just
to simplify some expressions.  It's only a constant factor
(log n = ln n/ln 2).

We get to choose d as big as we like to make this work, as long as
it is a constant independent of n.  To figure out what the value of
d should be, we can attempt an inductive proof parametrized by d,
then figure out what we need to take to make the proof go through.

For the basis, take n=2; then surely d can be chosen big enough that
E(T(2)) <= 2d ln 2.

Now we do our parametrized induction step.
Suppose E(T(m)) <= dm ln m for all m < n.
We would like to show E(T(n)) <= dn ln n.
We have

                                     n-1
E(T(n))  =  cn + 1 + E(T(n-1))/n + 2 sum E(T(m))/n
                                     m=1

                                          n-1
        <=  cn + 1 + d(n-1)ln(n-1)/n + 2 sum (dm ln m)/n
                                          m=1

                 by the induction hypothesis

                                  n-1
        <=  cn + 1 + d ln n + 2d(sum m ln m)/n
                                  m=1

                                     n
        <=  cn + 1 + d ln n + 2d(Integral m ln m dm)/n
                                     1

                estimating the sum by a definite integral--
                draw the graph!  Now integrate by parts:

         =  cn + 1 + d ln n + 2d((n^2 ln n)/2 - n^2/4 + 1/4)/n

         =  cn + 1 + d ln n + dn ln n - dn/2 + d/2n

and we must show this is <= dn ln n.  We still get to choose d as 
big as we like, as long as it is a constant independent of n.
Simplifying the inequality

cn + 1 + d ln n + dn ln n - dn/2 + d/2n  <=  dn ln n,

we get

1 + d ln n + d/2n  <=  (d/2 - c)n.

Picking d large enough that the coefficient (d/2 - c) on the
right hand side is positive, the right hand side is Theta(n) and
the left hand side is o(n), so for sufficiently large n the right
hand side dominates.

------------------------------------------------------------------
Every comparison-based sorting algorithm must make at least
n log n comparisons on inputs of length n in the worst case.  That 
is because the algorithm must distinguish between n! ~ 2^(n log n) 
possible input permutations, and a binary-branching decision tree 
generated by a comparison-based algorithm must have depth at least 
n log n to have that many leaves.

------------------------------------------------------------------
Linear time sorts:

Counting sort, bucket sort, radix sort.
Good treatment in CLR 9.2-4 pp. 175-183.

Bucket sort, which we did not go over in class:
Say the key space is integers [0,k].

* Divide the key space into m equal-size regions, make one "bucket"
  (linked list) for each.
* Go through the input array, put each element x in bucket x/m.
  Link it into the list for that bucket.
* Sort each bucket.
* Concatenate the sorted buckets.

If we assume that the keys are uniformly distributed over the possible
key values, then we expect each bucket to have n/m elements.  Having
significantly more is very unlikely.  We roughly expect: 

O(n) + mO((n/m)log(n/m)) = O(n log(n/m))

which is O(n) when m is a constant fraction of n.  For example, if
m = n/1000, we use an array of size m, and we would have expected
time O(n log 1000) = O(n).  Note that this does not depend on k.

Some important points:

--counting sort: Good when k is approximately n or less.
After constructing the array of counts of
the keys, it is tempting to just go through from left to right
and write down the keys in order in the output array, since we
know how many of each one there are from the count array.  If we
are only interested in the keys, this is ok.  But usually
there is some other information around besides the key associated 
with each elements being sorted, and we have to get this from
the original array.

--radix sort: sort on digit positions in order from less
significant --> more significant.  You must use a stable
sorting procedure, so that the sorting you do on the
higher order digits later does not mess up the sorting
you already did on the lower order digits.

--bucket sort: the keys do not have to be real numbers in the
range [0,1] as in CLR.  For example, they could be integers,
as long as you know the bounds.