Hash Tables

We've seen several different implementations of dictionaries so far. First we had simple lists, which had O(n) access time. Then we saw a couple of ways of implementing binary search trees with O(log n) access time. What if we could do even better? We now have arrays in ML, which offers us constant-time (O(1)) access to arbitrary elements.

Direct-Address Tables

Suppose we wanted to implement a reverse phone book: We want a dictionary that maps phone numbers of people in the class to their names. We could just use the phone number as an array index:

structure R = struct

  type rpb = string option array

  fun empty():rpb = Array.array(10000000, NONE)

  fun insert(dict:rpb, num:int, name:string):unit =
      Array.update(dict, num, SOME name)

  fun lookup(dict:rpb, num:int):string option =
      Array.sub(dict,num)

end

(Notice that this data structure is not functional, since insert returns unit, not a modified dictionary.) Now insert and lookup certainly are O(1)-time operations. Unfortunately, the whole data structure is a gross waste of memory, since we don't expect to have more than about 100 entries, but we allocated space to store 10 million entries! This data structure, called a direct-address table, may be ideal when the space of keys is small, but we need to do better here.

Hash Tables

Consider the case of a normal phone book (names to numbers). We couldn't use this approach directly because the keys aren't even numbers; we'd have to compute some function h(k) of the key to use as the array index. We could express each character as a number and then string all the digits together, but the indices would be huge. So instead let's pick a hash function that maps the keys to a more reasonable set of buckets. Since there are about 100 students in the class, we'll use an array of size 400, so each key should hash to a value in the range 0-399. We could define a hash function like

fun hash(s:string):int =
    foldl (fn (c:char, n:int) => (n + ord c) mod 400) 0 (explode s)

Then our only problem is that, since the array is much smaller than the space of all keys, it's possible for position h(k) of the array to hold a value associated with a key other than k. (By the Pigeonhole Principle, we know that this is a problem with some indices.) So we need to store the key in the array, too, and figure out what to do with such collisions. Hopefully our hash function will distribute the names fairly evenly to the buckets, making collisions unlikely. The simplest solution is to store a list at each position in the array; this is called chaining. If collisions are unlikely, then the lists should hold at most a few elements. Then our dictionary is implemented as

structure P = struct

  type pb = (string*int) list array

  fun hash(s:string):int =
      foldl (fn (c:char, n:int) => (n + ord c) mod 400) 0 (explode s)

  fun empty():pb = Array.array(400, nil)

  fun insert(dict:pb, name:string, num:int):unit =
      Array.update(dict, hash(name), (name,num)::Array.sub(dict, hash(name)))

  fun lookup(dict:pb, name:string):int option =
      case List.find (fn (nam,_) => nam=name) (Array.sub(dict, hash(name))) of
          NONE => NONE
        | SOME (name,number) => SOME number

end

Here we clearly have O(1) insertion time. If we assume that our hash function works well so that each element is equally and independently likely to be hashed into any particular bucket (called the simple uniform hashing assumption), then lookup time turns out to be O(1+a) where a is the load factor of the hash table. (The load factor is the average number of elements per bucket; with 400 buckets for 100 students, our hash table has a load factor of 1/4.) If the number of buckets is at least proportional to the number of elements, the load factor is O(1), so lookup is also O(1) on average.

Open Addressing

Our current implementation is simple, but we can make it more space-efficient. We chose chaining as our way to resolve conflicts, but storing lists requires using memory to store extra pointers; every element requires an extra pointer field to point to the next element of its chain. We can keep more elements in the same amount of memory by storing all elements in the array. This is called open addressing, and it limits our load factor to at most 1. If we have a conflict while inserting a new element, we compute a new bucket for storing it and probe it. If that bucket is full, we compute a new bucket to probe, and so on.

With linear probing, we resolve a conflict by inserting into the next empty bucket following h(k), wrapping around the end if necessary. So if h(k) is full, then we try h(k)+1, h(k)+2, and so on until we find an empty bucket. This means that to lookup an element, we have to continue searching along the probe sequence until we reach an empty bucket. Try inserting some random keys into a small hash table using linear probing. You'll find that linear probing suffers from clustering; long sequences of buckets fill up, which hurts lookup times.

Quadratic probing is similar except that instead of searching ahead i buckets on iteration i, we search ahead a quadratic function of i. This tends to work much better, but we have to determine a good quadratic function to use, and since the probe sequence depends only on h(k) and i, any two keys with the same hash value will have the same probe sequence, which again leads to clustering.

One of the best methods for open addressing is double hashing, which involves searching ahead ih'(k) buckets on iteration i, where h' is a secondary hash function. This approach has the advantage that two keys that hash to the same bucket under hash function h probably won't have the same probe sequence because they probably won't hash to the same value under hash function h'. This defeats the clustering problem and is more ideal than linear or quadratic probing because it uses O(m^2) probe sequences (where m is the number of buckets) rather than O(m).

So now our implementation looks like

structure P = struct

  type pb = (string*int) option array
  exception Full

  fun hash(s:string):int =
      foldl (fn (c:char, n:int) => (n + ord c) mod 400) 0 (explode s)

  fun hash'(s:string):int =
      foldl (fn (c:char, n:int) => (n + 3 * ord c) mod 400) 0 (explode s)

  fun probeseq(s:string, i:int):int = (hash(s) + i*hash'(s)) mod 400

  fun empty():pb = Array.array(400, NONE)

  fun insert(dict:pb, name:string, num:int):unit =
      let fun ins(i:int):unit =
              if i = 400 then raise Full
              else case Array.sub(dict, probeseq(name,i)) of
                       NONE => Array.update(dict, probeseq(name,i), SOME(name,num))
                     | SOME _ => ins(i+1)
      in ins 0
      end

  fun lookup(dict:pb, name:string):int option =
      let fun lu(i:int):int option =
              if i = 400 then NONE
              else case Array.sub(dict, probeseq(name, i)) of
                       NONE => NONE
                     | SOME(nam,num) => if nam = name then SOME num
                                        else lu(i+1)
      in lu 0
      end

end

Hash Functions

The biggest remaining issue that affects our implementation is our choice of the hash function (and the number of buckets, which is of course determined by the hash function). A bad hash function can clearly destroy our attempts at a constant running time, since in the worst case we have to search O(n) buckets. If we're mapping names to phone numbers, then hashing each name to its length would be a very poor function, as would a hash function that used only the first name, or only the last name. Our hash functions above do use the whole name, but they still suffer from some problems.

With the division method, the hash function is simply h(k) = k mod m, which is easy to compute quickly. Certain values of m produce poor results though: if m=2^p, then h(k) is just the p lowest-order bits of k. Generally we prefer a hash function that uses all the bits of the key. In practice, primes not too close to powers of 2 work well.

Ideally you should test your hash function to make sure it behaves well under real data. But with any hash function, it is possible to generate data that cause it to behave poorly. The only way to prevent this is to choose a hash function randomly and independently from the keys to be used. Universal hashing involves choosing a hash function at runtime from a carefully chosen set of functions. This means that no set of keys will cause your hash table to perform poorly every time; in fact, chances are good that the hash table will perform well.