CS410, Summer 1998
Lecture 12 Outline
Dan Grossman

Goals: 
* Good ways to transform data to integers for purposes of hashing
* Using Java's Hashtable class
* Good hashing functions

Reading: CLR 12.3

Recall our framework: we transform a key to an integer and then hash
to a bucket.  But everyone confuses the two stages and calls the
first thing hashing as well.

Transforming data to integers

  Our running example will be strings.  In fact, any kind of data is
  done the same way if you just imagine putting the different fields
  together to make one long conceptual string out of the fields.  So we're just
  assuming we have one field and treating it like a String.  The whole
  discussion generalizes without difficulty.

  Let's be clear about what we want:
  * Different keys to transform to different integers.
  * A fast transformation -- this could well be the bottleneck of our program

  Guidelines to follows are:
  * Use all of the key that is not constant.  Use the different parts
    proportionally.  For Strings, this means to use all of the characters.
    Otherwise, data where everything collides is too common.
  * Don't lose any more information than necessary.

  Bad example: "Add the characters in a String".  That is, let a be 1,
  b be 2, c be 3, etc.  Total up a string.  This violates our second
  guideline horribly -- all ordering information is lost because any
  permutation of the same letters will go to the same integer.  It's also bad
  because lots of similar length Strings collide, for example, abc,
  cba, f, daa, caaa, etc.

  Good example: Multiply by size of the alphabet (say 26 or 52) before
  adding the next character.  So acb becomes for example 26*(26*1 + 3) + 2.
  This avoids the earlier problems.  If we make the multiplying factor
  a power of 2, then this is equivalent to shifting the bits set by
  previous characters "over to the left" so that we're not changing them when
  we add the next character.

  Unfortunately, we will quickly have a number bigger than an int can
  hold.  (In Java, this is 32 bits.  It is generally 16, 32, or 64 on
  current computers.)  We could deal with big integers, but it is faster 
  not to.  If we just naively kept "shifting over" we would lose
  everything not given by the last few characters, violating our first
  guideline.

  Here are a couple tricks for using all of the characters to produce an
  int.  We'll assume we're dealing with 32 bits, but the exact constant doesn't
  matter.

  First, we could multiply by a large prime number after adding each character.
  Because multiplication of ints is implicitly mod 2^32, this really
  returns the low-order bits of the product.  These should be good and
  "mixed up" when multiplying by a prime.  It takes number theory to show that
  primes are good, but you can see for example that a power of 2 would
  be bad -- it just puts zeros in the low-order bits!

  Second, we could do something a tad more intuitive.  Assume we have
  an alphabet of 26 letters, so the integer corresponding to a letter
  fits in 5 bits.  We'll waste 2 of our 32, so we could fit 6 characters
  worth of information in an int by shifting over like before.  For more than
  6 characters, we can cycle around and xor (exclusive or) what is
  currently in a 5-bit segment with the next character.
  Think of the 32 bits in an int like this:
           XX ----- ----- ----- ----- ----- -----
  
  int key = 0, j = 0;
  for (int i = 0; i < s.length; i++) {
	int newNum = letterToNum(s.charAt(i));
	newNum = newNume << 6*j;
	key = key ^ newNum;  // ^ is the bit-wise xor operator.
	j = (j+1)%6;
  }
  return key;

  The intuition behind using exclusive or is that the number of 1's
  and 0's will remain roughly the same.  If we used or, our key would
  turn into mostly 1's.  If we used and, mostly 0's.

Using Java's Hashtables

  The keys for Java's Hashtables are Objects.  This would seem to
  cause problems:  * How can it transform an Object to an integer
		   * How can it compare two keys when doing lookup

  The answer is that every object has methods hashCode() and
  equals(Object other).
  The former takes no arguments and returns an integer -- so every object
  knows how to transform itself.  The latter takes another Object and returns
  a boolean -- true if the two Objects should be considered the same
  key.

  The default behavior is to take the address in memory for the hashCode and
  make equals true only if it's the same actual Object.  Hence, if these methods
  are not over-written, no two keys will ever be the same.  This is rarely what
  you want.  For example, with Strings, we want equals to be true if the two
  Strings have the same characters in the same places.  Then we better override
  hashCode too, else two equal Strings could end up in different
  buckets, clearly incorrect behavior.  The library has already done
  the right thing for Strings, so you don't have to worry about this on your
  homework.  But you do in applications where your keys are something
  that hasn't already overridden equals and hashCode.

  You can consult Java documentation for how to use the Hashtable class. Here's
  a brief summary:
  * call the constructor to create a new table
  * put(Object key, Object value) is what we called insert
  * get(Object key) is what we called lookup
  * containsKey(Object key) returns true iff the table has an element with
	that key.
  * contains(Object value) returns true iff the table has an element with
	that value.  It is not clear to me what efficiency we should expect for 
	this operation -- use containsKey.
  * Enumeration keys()
  * Enumeration elements()

  Enumerations are useful and provided by many Java data structures.  The idea
  is we want to walk through all the items in the table.  We can't just do that
  because we don't have access to the actual table.  An Enumeration is an object
  which has two methods:
	hasMoreElements() returns false once you've seen every element
	nextElement() returns the next element
  Be careful not to call elements() if you want keys() -- this cost me 45 
  minutes when debugging homework 4.

Hashing

  Doing a good job of transforming is a waste if we don't hash well.  We'll 
  discuss 3 methods: the division methods, the multiplication method, and
  universal hashing.

The division method is just h(k) = k mod m where k is the integer we
  are hashing.  Mathematicians tell us that m (the table size) should
  be a prime number not near a power of 2.  To see why primality
  helps, consider the worst case which is when m is a power of 2.  Then
  the division method just throws away the high-order bits of the int we
  worked so hard to create!  In general, if k has factors in common with m,
  then it will lose more information when taken mod m.  Prime numbers don't
  have very many factors, so they're a good choice.

  It is less clear why being near a power of 2 is bad.  But do what
  you're told. :-)  Your text mentions that Strings differing by only a 
  transposition of characters collide when m = 2^p - 1.

  Having to choose these weird table sizes gets in the way of our
  standard method of doubling the size of the table when the number of
  elements gets too large.  Instead of actually doubling, we should just
  approximately double.  We should look up a bunch of good table sizes
  in a math book somewhere and "hard-wire" them into an array before we
  compile our program.  Then when we resize, we just use the next larger
  element in this array.

The multiplication method is just h(k) = floor(m(kA mod 1)) where A is
  a constant between 0 and 1.  The idea is to take the part of the
  product kA to the right of the decimal point (that's what mod 1
  means), and multiply by m to get a bucket.  The advantage over the
  division method is that all the "magic" is in A, so m can go back to
  being a nice convenient power of 2.  In fact, we'll see in a minute
  that we should insist that it is a power of 2.  Your text says A =
  (sqrt(5) - 1)/2 "works pretty well".

  We can avoid using floating point numbers by being clever and taking
  advantage of the fact that int multiplication is implicitly mod 2^32
  (or 2^w where w is the size of an int).  This only works if m is a
  power of two.  Instead of using A, use A' which by definition is just
  floor(A*2^w).  You don't actually calculate it in your program; it's a
  definition.  Now h(k) = k*A' >> (w-m), where >> means "shift bits to
  the right".  What's happening is k*A' is the part that used to be to
  the right of the decimal point -- the other bits disappeared because
  we're mod 2^w.  Then we shift right so we're getting the log m
  high-order bits -- this is equivalent to what was multiplication by m
  and taking the floor.  You just have to stare at the math a bit to
  see that this is exactly the multiplication method, just without
  floating point numbers.

Universal Hashing

  Since there are more keys than buckets, for any hashing function,
  there are possible sets of keys that have horrible collisions.  If
  an evil adversary knew our hashing function, he could always produce
  such a set of keys.  We could beat this by having a bunch of different hash
  functions which are all good and then choosing one at random when we
  make a new table.  (Of course, we have to use the same function for
  the life of a table or we won't be able to find the objects we've
  inserted.)  Then the probability of having bad collisions would be provably 
  low _regardless of the set of keys_.  And even if you have bad collisions for
  one table, the same keys will probably do fine the next time when a different
  hash function is used.

  More formally, let H={h_1, h_2, ... h_t} be a set of hash functions and let
  U be the set of possible keys.  We say that H is "universal" over U if for
  arbitrary x,y in U, the number of h's in H for which x and y collide is |H|/m,
  where |H| is the size of H and m is the number of buckets.

  Intuitively, this mean that if we pick h randomly from H, the
  probability that x and y collide (that is h(x) = h(y)) is 1/m. 
  Since we don't have x and y in advance and there are m buckets, this
  is as good as we could expect.  Furthermore, this is true for _all_ x and y,
  so we don't expect any set of keys to do any worse than any other.
  Finally, if we have n things in our table, then the number we expect
  to collide with x is (n-1)/m.  The "-1" is because x by definition
  does not collide with itself.

  So we have defined universal hashing and convinced ourselves that it would be
  great, but we haven't actually defined an H that matches the definition.  And
  it isn't at all obvious that such an H would exist.  In fact, there is...

  Let x be a key.  Assume it is w bits long.  Break x into r+1 binary
  substrings, each w/(r+1) bits long.  Let b=w/(r+1).  We require r to
  be large enough that 2^b < m.  We also make m prime.  Call the
  substrings x_0, x_1, x_2, ..., x_r and interpret them as binary
  integers between 0 and (2^b)-1.  Also pick r+1 random numbers between
  0 and (2^b)-1.  Call these numbers a_0, a_1, ..., a_r and call the
  collection of them a.  Finally define:
	h_a(x) = sum from i=0 to r of (a_i * x_i) mod m

  Each choice of a defines a different hash function.  There are m^(r+1) such
  functions.  Let the collection of all of them be H.

  Claim: H is universal.
  Proof: Let x != y.  Then some x_i != y_i.  Assume without loss of
         generality x_0 != y_0.  (The proof is similar if some other
         substring is different.)  Then for a particular h_a, we have
         a collision iff h_a(x) == h_a(y).  That is, iff
	      sum from i=0 to r of (a_i * x_i) mod m ==
	      sum from i=0 to r of (a_i * y_i) mod m
   Rearranging terms, this is true iff
	      a_0(x_0-y_0) mod m  == 
		- sum from i=1 to r of a_i * (x_i - y_i) mod m

   The number on the right is something between 0 and m-1.  
   And x_0-y_0 is not zero.  It follows from number theory (using the
   fact that m is prime) that exactly 1 of the m possible values for a_0 will
   cause a collision.

   So for any a_1, a_2, ..., a_r, exactly 1/m of the hash functions
   using these numbers cause a collision on x and y.  So grouping the
   elements of H by same a_1, a_2, ..., a_r, we see that we meet the
   definition of universal.