CS410, Summer 1998
Lecture 13 Outline
Dan Grossman

Goals:
 * Open Address Hashing

Pointers for chaining take half of the table space or more.  This
could be for more buckets, a clear gain in efficiency.  So let's not
use chaining.  Instead, if we have a collision, let's just go to a
different bucket.  This is called open addressing (I don't know why).
As long as we use the same sequence of buckets when inserting and
searching for the same key, everything should work fine.

We notice immediately:
* alpha < 1, else there isn't room for all the elements.
* Buckets moved to must be a permutation of all buckets.  That is, we must
  eventually try every bucket if necessary.  Else alpha < 1 doesn't guarantee
  the scheme will work.

Insert: Go to a starting bucket.  While bucket is full, go to next
bucket in permutation.  When empty bucket is reached, put data there.

Lookup:
Follow same sequence of buckets.  Check keys and return the right one.
If reach an empty bucket, the data isn't in the table.

Delete:
Uh-oh -- will screw up future lookups.  If we delete an element, then
lookup for a different element might see an empty bucket and stop too
soon!  The solution is either:
* Use chaining.  This is probably the best choice if deletes are common.
  In many applications, they really aren't.
* Use a zombie.  That is, make the bucket "no value, but keep looking".  Future
  inserts can use this bucket, and lookups will effectively skip over any 
  such zombie bucket.

To pick the starting bucket, use hashing like we did yesterday.
Choosing the bucket permutation is the rest of the lecture.

There are m! possible permutations.  Ideally, a key will have any of
them with equal probability.  This is the assumption of "uniform
hashing".  We will not get anywhere close to fulfilling this
assumption, but reality will save the day.

For open addressing, let the hashing function actually be h(k, i)
where k is the key and i is how many full buckets have already been
seen.  So h(k, 0) is what we did yesterday.  We use h as follows:

lookup:
i = 0
while (true) // assume we resized if table was full and that we indeed have
	    //  a permutation
  j = h(k,i)
  if array[j] != null
  return array[j].val
  i++

Let h'(k) be a good function from yesterday.

The simplest choice for h is to just keep cycling through buckets
until an empty one is found. That is,
	h(k,i) = (h'(k) + i) mod m	
This is called linear probing.  It is a horrible idea...

Of the m! permutations, we only use m of them -- the whole sequence is
determined by the first bucket.  So keys that initially collide always
collide.  But we had that problem with chaining too.  The real problem
is that all m permutations are really the same except for their
starting points.  As a result, "long chains get longer".  To see why,
suppose we currently have a chain of length l.  Then the chain will
get longer if _any_ of the buckets in it are the start bucket for the
next insertion.  Hence the probability of making a chain longer is
(l/m).  That is, long chains get longer even when h'(k) is as good as
possible.  This is called "primary clustering".

It is not necessary that all m permutations have common subsequences.
For example, with m == 4, I made up the following:
0 -> 2 -> 1 -> 3
1 -> 0 -> 2 -> 3
2 -> 3 -> 0 -> 1
3 -> 1 -> 2 -> 0
It still only uses m sequences, but primary clustering doesn't happen.

I made the sequences above up.  A more general approach is "quadratic
probing".  It is defined by:
	h(k,i) = (h'(k) + c1 i + c2 i^2) mod m.  

where c1 and c2 are constants chosen such that h(k,i) defines a
permutation of all m buckets.  It should be clear that there are m
permutations (still all that changes is the start bucket -- h'(k)) and
that the permutations are not all really the same.

It isn't clear how to pick c1 and c2, though.  Here's one that works:
let m be a power of 2 and let c1 == c2 == 1/2.  What this is really
doing is going ahead i buckets on the ith iteration.  So if we started
at 0, we would check buckets 1, 3, 6, 10, 15 (all mod m of course).
For m==4, the set of permutations looks like:
0 -> 1 -> 3 -> 2
1 -> 2 -> 0 -> 3
2 -> 3 -> 1 -> 0
3 -> 0 -> 2 -> 1

We eliminated primary clustering, but we still have "secondary
clustering".  This refers to elements which collide initially always
colliding.  Chaining has this too, although for some reason it's not
called that.  We can avoid even that if we remember that we can choose
the next bucket using both i _and_ k.  Then different k's that started
together wouldn't have to stay together...

Our last method is called "double hashing".  It is defined by:
	h(k, i) = (h'(k) + ih''(k)) mod m
where h' and h'' are two _different_ hash functions.  (If they're the
same, then we're basically back to linear probing.)

There are now m^2 possible permutations -- any start bucket and any
jump amount.  That's much better than m, but still nowhere near m!  It
works well in practice, though.

The catch is making sure any jump value induces a permutation.  What
we need is that the greatest common divisor of h''(k) and m is 1.
Mathematicians can prove this is sufficient -- it should be fairly
intuitive if you try a few examples.  Here are two simple ways to
ensure this property of h''(k) and m:
* make m a power of 2, and makes sure h''(k) is always odd (perhaps by adding
  1 if necessary)
* make m prime (since h''(k) produces a number < m).

Also, h''(k) cannot be zero (look at the definition of h(k, i) to see
why).

None of the methods we looked at achieved uniform hashing.  But
hypothetically, assume all m! permutations are equally likely.  Then
how long do we expect chains to be?  Remember alpha is the load
factor.  So when doing an unsuccessful lookup, we expect with
probability alpha that the first bucket we look at has an element in
it.  The probability that the first two have elements is alpha^2,
first three alpha^3 and so on.  Summing over all these probabilities is:
           alpha + alpha^2 + ... < 1/(1-alpha)

So for smaller load factor, we expect chains to be shorter.  This
should be intuitive and reassuring.  And it gives us a formula for
making our time-space trade-off decision of picking alpha.