CS 312 Lecture 20
Hash tables and amortized analysis

We've seen various implementations of functional sets. First we had simple lists, which had O(n) access time. Then we saw how to implement sets as balanced binary search trees with O(lg n) access time. Our current best results are this:

linked list, no duplicates balanced binary trees
add (insert) O(n) O(lg n)
delete (remove) O(n) O(lg n)
member (contains) O(n) O(lg n)

What if we could do even better? It turns out that we can implement mutable sets and maps more efficiently than the immutable (functional) sets and maps we've been looking at so far. In fact, we can turn an O(n) functional set implementation into an O(1) mutable set implementation, using hash tables. The idea is to exploit the power of arrays to update a random element in O(1) time. 

We store each element of the mutable set in a simple functional set whose expected size is a small constant. Because the functional sets are small, linked lists without duplicates work fine. Instead of having just one functional set, we'll use a lot of them. In fact, for a mutable set containing n elements, we'll spread out its elements among O(n) smaller functional sets. If we spread the elements around evenly, each of the functional sets will contain O(1) elements and accesses to it will have O(1) performance!

hash table
add (insert) O(1)
delete (remove) O(1)
member (contains) O(1)

This data structure (the hash table) is a big array of O(n) elements, called buckets. Each bucket is a functional (immutable) set containing O(1) elements, and the elements of the set as a whole are partitioned among all the buckets. (Properly speaking, what we are talking about here is open hashing, in which a single array element can store any number of elements.)

There is one key piece missing: in which bucket should a set element be stored? We provide a hash function h(e) that given a set element e returns the index of a bucket that element should be stored into. The hash table works well if each element is equally and independently likely to be hashed into any particular bucket; this condition is the simple uniform hashing assumption. Suppose we have n elements in the set and the bucket array is length m. Then we expect α = n/m elements per bucket. The quantity α is called the load factor of the hash table. If the set implementation used for the buckets has linear performance, then we expect to take O(1+α) time to do add, remove, and member. To make hash tables work well, we ensure that the load factor α never exceeds some constant αmax, so all operations are O(1) on average.

The worst-case performance of a hash table is the same as the underlying bucket data structure, (O(n) in the case of a linked list), because in the worst case all of the elements hash to the same bucket. If the hash function is chosen well, this will be extremely unlikely, so it's not worth using a more efficient bucket data structure. But if we want O(lg n) worst-case performance from our hash tables, we can use a balanced binary tree for each bucket.

Closed hashing

An alternative to hashing with buckets is closed hashing, also known (confusingly) as open addressing. Instead of storing a set at every array index, a single element is stored there. If an element is inserted in the hash table and collides with an element already stored at that index, a second possible possible location for it is computed. If that is full, the process repeats. There are various strategies for generating a sequence of hash values for a given element: e.g., linear probing, quadratic probing, double hashing. In practice closed hashing is slower than an array of buckets. The performance of closed hashing becomes very bad when the load factor approaches 1, because a long sequence of array indices may need to be tried for any given element -- possibly every element in the array! Therefore it is important to resize the array when the load factor exceeds 2/3 or so. Open hashing, by contrast, suffers gradually declining performance as the load factor grows, and no fixed point beyond which resizing is absolutely needed. Further, a sophisticated implementation can defer the O(n) cost of resizing its hash tables to a point in time when it is convenient to incur it: for example, when the system is idle.

Representation

An SML representation of a hash table is then as follows:

type bucket
  (* A bucket is a (functional) set of elems *)
type set = {arr: bucket array, nelem: int ref}
  (* AF: the set represented by a "set" is the union of
   *   all of the bucket sets in the array "arr".
   * RI: nelem is the total number of elements in all the buckets in
   *   arr. In each bucket, every element e hashes via hash(e)
   *   to the index of that bucket modulo length(arr).
   *   The ratio of nelem to length(arr) is at most α_max.
   *)

Resizable hash tables and amortized analysis

The claim that hash tables give O(1) performance is based on the assumption that n = O(m). If a hash table has many elements inserted into it, n may become much larger than m and violate this assumption. The effect will be that the bucket sets will become large enough that their bad asymptotic performance will show through. The solution to this problem is relatively simple: the array must be increased in size and all the element rehashed into the new buckets using an appropriate hash function when the load factor exceeds some constant factor αmax. Because resizing is not visible to the client, it is a benign side effect. Each resizing operation takes O(n) time where n is the size of the hash table being resized. Therefore the O(1) performance of the hash table operations no longer holds in the case of add: its worst-case performance is O(n).

This isn't as much of a problem as it might sound, though it can be an issue for some real-time computing systems. If the bucket array is doubled in size every time it is needed, then the insertion of n elements in a row into an empty array takes only O(n) time, perhaps surprisingly. We say that add has O(1) amortized run time because the time required to insert an element is O(1) on the average even though some elements trigger a lengthy rehashing of all the elements of the hash table.

To see why this is, suppose we insert n elements into a hash table while doubling the number of buckets when the load factor crosses some threshold. A given element may be rehashed many times, but the total time to insert the n elements is still O(n). Consider inserting n = 2k elements, and suppose that we hit the worst case, where the resizing occurs on the very last element. Since the bucket array is being doubled at each rehashing, the rehashes must all occur at powers of two. The final rehash rehashes all n elements, the previous one rehashes n/2 elements, the one previous to that n/4 elements, and so on. So the total number of hashes computed is n hashes for the actual insertions of the elements, plus n + n/2 + n/4 + n/8 + ... = n(1 + 1/2 + 1/4 + 1/8 + ...) = 2n hashes, for a total of 3n hashing operations.

No matter how many elements we add to the hash table, there will be at most three hashing operations performed per element added. Therefore, add takes amortized O(1) time even if we start out with a bucket array of one element!

Another way to think about this is that the true cost of performing an add is about triple the cost observed on a typical call to add. The remaining 2/3 of the cost is paid as the array is resized later. It is useful to think about this in monetary terms. Suppose that a hashing operation costs $1 (that is, 1 unit of time). Then a call to add costs $3, but only $1 is required up front for the initial hash. The remaining $2 is placed into the hash table element just added and used to pay for future rehashing. Assume each time the array is resized, all of the remaining money gets used up. At the next resizing, there are n elements and n/2 of them have $2 on them; this is exactly enough to pay for the resizing. This is a really an argument by induction, so we'd better examine the base case: when the array is resized from one bucket to two, there is $2 available, which is $1 more than needed to pay for the resizing. That extra $1 will stick around indefinitely, so inserting n elements starting from a 1-element array takes at most 3n-1 element hashes, which is O(n) time. This kind of analysis, in which we precharge an operation for some time that will be taken later, is the idea behind amortized analysis of run time.

Notice that it was crucial that the array size grows geometrically (doubling). It is tempting to grow the array by a fixed increment (e.g., 100 elements at time), but this causes n elements to be rehashed O(n) times on average, resulting in O(n2) asymptotic insertion time!

Any fixed threshold load factor is equally good from the standpoint of asymptotic run time. If you are concerned about performance, it is a good idea to measure the value of αmax that maximizes performance. Typically it will be between 1 and 3. One might think that a=1 is the right place to rehash, but the best performance is often seen (for buckets implemented as linked lists) when load factors are in the 1–2 range When a<1, the bucket array contains many empty entries, resulting in suboptimal performance from the computer's memory system. There are many other tricks that are important for getting the very best performance out of hash tables. For best performance, it is important to use measured performance to tune the hash function and resizing threshold.

In fact, if the load factor becomes too low, it's a good idea to resize the hash table to make it smaller. Usually this is done when the load factor drops below αmax/4. At this point the hash table is halved in size and all of the elements are rehashed. It is important to shrink only once the hash table gets sufficiently small. For example, if the hash table grows by doubling, it should be shrunk only if its load factor is half of the point that would cause doubling. Otherwise, time could be wasted growing and shrinking the table, hurting asymptotic performance.

Amortized analysis

If we start from an empty hash table, any sequence of n operations will take O(n) time, even if we resize the hash table whenever the load factor goes outside the interval max/4, αmax].

To see this we need to evaluate the amortized complexity of the hash table operations. This formalizes the reasoning we used earlier. To do this, we define a potential function that measures the precharged time for a given state of the data structure. The potential function saves up time that can be used by later operations.

We then define the amortized time taken by a single operation that changes the data structure from h to h' as the actual time plus the change in potential, Φ(h') - Φ(h) . Now consider a sequence of n operations on a tree, taking actual times t1, t2, t3, ..., tn and producing hash tables h0, h1, h2, ... hn. The amortized time taken by these operations is the sum of the actual times for each operation plus the sum of the changes in potential: t1 + t2 + ... tn + (Φ(h1)−Φ(h0)) + (Φ(h2) − Φ(h1)) + ... + (Φ(hn) − Φ(hn-1)) = t1 + t2 + ... tn + Φ(hn) − Φ(h0). Therefore the amortized time for a sequence of operations overestimates of the actual time by the maximum drop in the potential function Φ(hn) − Φ(h0) seen over the whole sequence of operations. If we can arrange that the maximum drop is zero, total amortized time is always an upper bound on the actual time, which is what we want.

The key to amortized analysis is to define the right potential function. The potential function needs to save up enough time to be used later when it is needed. But it cannot save so much time that it causes the amortized time of the current operation to be too high.

Analyzing hash tables

Let us do an amortized analysis of hash tables with both resizing by doubling and by halving. For simplicity we assume αmax = 1. We define the potential function so that it stores up time as the load factor moves away from 1/2. Then there will be enough stored-up time to resize in either direction. The potential function is:

Φ(h) = 2|n - m/2|

Now we just have to consider all the cases that can occur with all the possible operations. Since α is O(1), we assume that looking for an element takes time 1.

In each case, the amortized time is O(1). If we start our hash table with a load factor of 1/2, then its initial potential will be zero. So we know that the potential can never decrease, and amortized time will be an upper bound on actual time. Therefore a sequence of n operations will take O(n) time.

A hash table implementation

We can use a functor to provide a generic implementation of the mutable set (MSET) signature too. In order to store elements in a hash table, we'll need a hash function for the element type, and an equality test just as for other sets. We can define an appropriate signature that groups the type and these two operations:

signature HASHABLE = sig
  type t
  (* hash is a function that maps a t to an integer. For
   * all e1, e2, if equal(e1,e2), then hash(e1) = hash(e2) *)
  val hash: t->int
  (* equal is an equivalence relation on t. *)
  val equal: t*t->bool
end

There is an additional invariant documented in the signature: for the hash table to function correctly, any two equal elements must have the same hash code.

functor HashSet(structure Hash: HASHABLE and
                Set: SET where type elem = Hash.t)
= struct
  type elem = Hash.t
  type bucket = Set.set
  type set = {arr: bucket array, nelem: int ref}
  (* AF: the set represented by a "set" is the union of
   * all of the bucket sets in the array "arr".
   * RI: nelem is the total number of elements in all the buckets in
   * arr. In each bucket, every element e hashes via Hash.hash(e)
   * to the index of that bucket modulo length(arr). *)

  (* Find the appropriate bucket for e *)
  type 'a bucketHandler = bucket array*int*bucket*elem*int ref->'a
  fun findBucket({arr, nelem}, e) (f: bucketHandler): 'a =
    let
      val i = Hash.hash(e) mod Array.length(arr)
      val b = Array.sub(arr, i)
    in
      f(arr, i, b, e, nelem)
    end
  fun member(s, e) =
    findBucket(s, e)
        (fn(_, _, b, e, _) => Set.member(b, e))
  fun add(s, e) =
    findBucket(s, e)
      (fn(arr, i, b, e, nelem) =>
        ( Array.update(arr, i, Set.add(b, e));
          nelem := !nelem + 1 ))
  fun remove(s, e) =
    findBucket(s, e)
      (fn(arr, i, b, e, nelem) =>
	( case Set.remove(b,e) of
	    (b2, NONE) => NONE
	  | (b2, SOME y) =>
	      ( Array.update(arr, i, b2);
		nelem := !nelem - 1;
		SOME y )))
  fun size({arr, nelem}) = !nelem
  fun fold f init {arr, nelem} =
    Array.foldl (fn (b, curr) => Set.fold f curr b) init arr
  fun create(size: int): set =
    { arr = Array.array(size, Set.empty), nelem = ref 0 }
  (* Copy all elements from s2 into s1. *)
  fun copy(s1:set, s2:set): unit =
    fold (fn(elem,_)=> add(s1,elem)) () s2
  fun fromList(lst) = let
      val s = create(length lst)
    in
      List.foldl (fn(e, ()) => add(s,e)) () lst;
      s
    end
  fun toList({arr, nelem}) =
    Array.foldl (fn (b, lst) => Set.fold (fn(e, lst) => e::lst) lst b)
        [] arr
end

This hash table implementation almost implements the MSET signature, but not quite, because it doesn't implement the empty method. Here is complete implementation of mutable sets using a fixed-size hash table of nbucket buckets:

functor FixedHashSet(val nbucket: int;
                     structure Hash: HASHABLE and
                     Set: SET where type elem = Hash.t)
  :> MSET where type elem = Hash.t
= struct
  structure HS = HashSet(structure Hash = Hash and Set = Set)
  type elem = HS.elem
  type set = HS.set
  val eq = Hash.eq
  fun empty() = HS.create(nbucket)
  val member = HS.member
  val add = HS.add
  val remove = HS.remove
  val size = HS.size
  val fold = HS.fold
  val toList = HS.toList
  val fromList = HS.fromList
end


Here we create a hash table of 1000 buckets and insert the numbers one through ten into it.

- structure FHS = FixedHashSet(structure Hash = IntHash and Set = IntSet)
- open FHS;
- val s = empty();
val s = - : set
- foldl (fn(x,_) => add(s,x)) () [1,2,3,4,5,6,7,8,9,10];
val it = () : unit
- toList(s);
val it = [10,9,8,7,6,5,4,3,2,1] : elem list

The elements are in reverse order because they are hashed into buckets 1 through 10.

The HashSet implementation of hash tables abstracts out the common operation of finding the correct bucket for a particular element into the findBucket function, thus keeping the rest of the code simpler. If the number of buckets is always a power of 2, the modulo operation can be performed using bit logic, which is much faster.

Dynamic resizing

As long as we don't put more than a couple of thousand elements into a fixed-size 1000-bucket hash table, its performance will be excellent. However, the asymptotic performance of any fixed-size table is no better than that of a linked list. We can introduce another level of indirection to obtain a hash table that grows dynamically and rehashes its elements, thus achieving O(1) amortized performance as described above. The implementation proves to be quite simple because we can reuse all of our HashSet code:

functor DynHashSet(structure Hash: HASHABLE and
                   Set: SET where type elem = Hash.t)
  :> MSET where type elem=Hash.t =
struct
  structure HS = HashSet(structure Hash = Hash and Set = Set)
  type set = HS.set ref
  type elem = HS.elem
  val thresholdLoadFactor = 3
  (* AF: the set represented by x:set is !x.
   * RI: the load factor of the hash table !x never goes
   * above thresholdLoadFactor.
   *)

  fun empty():set = ref (HS.create(1))
  fun member(s, e) = HS.member(!s, e)
  fun remove(s, e) = HS.remove(!s, e)
  fun size(s) = HS.size(!s)
  fun add(s, e) =
    let val {arr, nelem} = !s
        val nbucket = Array.length(arr) 
    in
      if !nelem >= thresholdLoadFactor*nbucket then
        let val newset = HS.create(nbucket*2) in
          HS.copy(newset, !s);
          s := newset
        end
      else ();
      HS.add(!s, e)
    end
  fun fold f init s = HS.fold f init (!s)
  fun fromList(lst) = ref (HS.fromList(lst))
  fun toList(s) = HS.toList (!s)
end

There is hardly any new code required: mostly just the logic in add that creates a new, larger hash table  and copies all the elements across when the load factor is too high.

This code requires access to the internals of the HashSet implementation, which is why those internals are not hidden behind a signature. An example of using these hash tables follows:

- structure DynIntHashSet = DynHashSet(structure Hash = IntHash and
						 Set = IntSet)
- open DynIntHashSet;
- val s = empty();
val s = - : set
- foldl (fn(x,_) => add(s,x)) () [1,2,3,4,5,6,7,8,9,10];
val it = () : unit
- toList(s);
val it = [7,3,10,6,2,9,5,1,8,4] : elem list

The MMAP signature for mutable maps can also be implemented using hash tables in a similar manner.