CS410, Summer 1998
Lecture 11 Outline
Dan Grossman

Goals:
* Finish up heaps
* Introduction to hash tables

Reading: CLR Chapter 12

* Prelim infromation is available on the Web.
* HW4, a programming assignment, will be handed out tomorrow and due next
  Wednesday.
* HW5 will be handed out next Tuesday and be due the following Monday.
  
* Heaps:
  Remember from last time:
  * Inspired by the priority queue ADT, we developed the heap data structure.
    Notice it can't do lookup.
  * Using the "heap shaped trees fit in arrays" trick, we replaced all pointers
    with array indices and got O(log n) time operations that had very low
    constants.
  * We even built a heap from an unsorted set of n elements in time O(n).
    Recall how built bottom up and took two heaps, added a root, and heapified
    to make a larger heap.

  Making a heap out of two smaller heaps only works because:
  * The two are the same height.
  * The one on the left has a complete bottom level.
  Else the result does not have the heap shape.

  But "merging heaps" seems like a potentially useful operation.
  CS482 covers other kinds of heaps that support merging.
  
  Other uses of heaps:
  * Find the m largest objects out of a collection of n objects: keep a min
    heap with the m largest seen so far.  Compare each object with the min, if
    it is bigger, replace the min and heapify.  You will do this on homework 4.
    It allows us to find the m largest in time O(nlog m).
  * sorting: Build a heap of n objects, and then keep calling delete.  The 
    elements come out in sorted order.  The total work is bounded by the n
    delete operations, for a running time of O(nlog n).

* Hash Tables
   Back to dictionaries: Could we even beat O(log n) and get to O(1)?  To do so,
   we will change our desires a bit:
	* Will not be able to compute predecessor or successor in less than 
          O(n) time.
	* Will rely on luck and accept an _unlikely_ worst case of O(n).

   The only way we know to get to an arbitrary amount of data in O(1)
   time is with an array.  So if we had a way to take a key and
   figure out which array index held the corresponding value, we'd be
   in great shape.

   This works great if the keys are integers close to each other.  
   Else we end up wasting an unacceptable amount of space.  We also have a 
   problem if there are duplicate keys.

   Else we could take the integer keys and use a mathematical function
   to determine which array index (hereafter which "bucket") holds the
   value.  Such a function is called a "hash function".  The most obvious
   function is h(key) = key mod m where m is the size of the array
   (hereafter referred to as the table size).

   If the hash function is computable in constant time, then
   everything can be O(1) -- on insert, lookup, and delete, "hash the key" and
   you know exactly where to go.

   But the problem of course is "collisions".  If the possible key
   values is greater than m, then they're unavoidable.  And in the worst
   case, everything ends up the same bucket!  Our approach must be to:

   * deal with collisions since they may occur
   * be clever so that collisions are reasonably unlikely
   * accept that empirically everything works like a charm

   To deal with collisions, we can just put a linked list at each bucket of 
   all items which hash to that bucket.  This is called "linear chaining" and
   the list is called the "chain".  Notice items with the same key will always
   collide, but items with different keys might collide as well.

   The average length of a chain will be n/m, the number of objects in
   the table divided by the number of buckets.  We define alpha = n/m
   and call alpha the "load factor".  We keep alpha a constant and in practice
   it is something between say 0.5 and 3.  Notice it can be less than
   one; this guarantees that some buckets will be empty.
  
   If everything hashes to the same bucket, the lookup and delete take
   time O(n).

   What we want is that every bucket has alpha elements in it.  If we just
   assume this is true, we make the "assumption of simple uniform hashing".
   Under this assumption, the operations are O(1) because alpha is a constant.

   A slightly different assumption is that the bucket to which a key hashes
   is random and every bucket is equally likely.  Under this assumption, the
   likelihood that any bucket has beta>alpha objects decreases exponentially
   as beta increases.  (The math is way beyond this class.)  So we can
   again expect O(1) operations.

   If we just want to say, "assuming O(1) operations" then we can
   loosely say, "assuming hashing behaves".  Tomorrow's lecture is
   devoted to hopefully making this assumption a reality.

   We need to keep the load factor alpha constant.  So if enough
   objects are inserted, we will have to increase the number of buckets.  As
   usual with array resizing, we should make twice as many buckets.  After doing
   so, we will have to rehash all the elements in the table.  Since
   we're doubling, this won't happen very often.

   So far we have assumed the keys are integers.  If we want to hash
   on something else (example, Strings, but really we could hash on any
   kind of Object), we will first transform the key into an integer.  So now
   the process looks like:
         transform           hash
     key --------> integer ---------> bucket

   Actually, a lot of people (including me) call the first arrow
   hashing.  I will try to avoid making this ambiguity in class.
   Despite the text's scant coverage, choosing a good transform function
   is as crucial as choosing a good hash function.  After all, if everything
   transforms to the same integer, then everything will collide.

   We have to store the key in the table in order to do the
   operations.  Whether we store the integer in the table or recompute it
   whenever we need it is a classic space vs. time trade-off.  The integers
   take space, and they're only needed when the table is being resized.
   However, they can also be used to speed-up lookup operations if we assume
   (as is usually the case) that comparing integers is faster than comparing
   keys.  (Again, consider Strings as a good example.)
	walk down the chain:
	    if the integer is equal to the integer of what you're looking for
	        then if the keys are equal
			then return the value
		        else continue
	        else continue

   Notice we still have to compare keys since distinct keys may transform
   to the same integer.