CS 312 Lecture 16
Mutable data abstractions and hash tables

Mutable data abstractions are abstractions whose value can change over time. We have avoided using them until now because they are harder to reason about than immutable (functional) data abstractions. But for solving some problems they offer an advantage in efficiency.

Arrays

An important kind of mutable data structure that SML provides is the array. The type t array is in fact very similar to the Java array type t[]. Arrays generalize refs in that they are a sequence of mutable cells containing values. We can think of a ref cell as an array of size 1. Here's a partial signature for the builtin Array structure for SML.

  signature ARRAY =
    sig
      (* Overview: an 'a array is a mutable fixed-length sequence of
       * elements of type 'a. *)
      type 'a array

      (* array(n,x) is a new array of length n whose elements are
       * all equal to x. *)
      val array : int * 'a -> 'a array

      (* fromList(lst) is a new array containing the values in lst *)
      val fromList : 'a list -> 'a array

      exception Subscript (* indicates an out-of-bounds array index *)

      (* sub(a,i) is the ith element in a. If i is
       * out of bounds, raise Subscript *)
      val sub : 'a array * int -> 'a

      (* update(a,i,x)
       * Effects: Set the ith element of a to x
       * Raise Subscript if i is not a legal index into a *)
      val update : 'a array * int * 'a -> unit

      (* length(a) is the length of a *)
      val length : 'a array -> int

      ...
    end

See the SML documentation for more information on the operations available on arrays.

Notice that we have started using a new kind of clause in the specification, the effects clause. This clause specifies side effects that the operation has beyond the value it returns. When a routine has a side effect, it is useful to have the word "Effects:" in the specification to explicitly warn the user that a side effect may occur. For example, the update function returns no interesting value, but it does have a side effect.

An imperative update to a mutable data abstraction is also known as a destructive update, because it "destroys" the old value of the data structure. An assignment to an array element changes the array in place, destroying the old sequence of elements that formerly made up the array. Programming in an imperative style is trickier than in a functional style precisely because the programmer has to be sure that the old value of the mutable data is no longer needed at the time that a destructive update is performed.

Mutable sets and specifying side effects

Mutable collections such as sets and maps are another important kind of mutable data abstraction. We've seen several different implementations of sets thus far, but they have implemented an immutable set abstraction. A mutable set is a set that can be imperatively updated to include more elements, or to remove some elements.

Here is an example of a signature for a mutable set. These signatures show an important issue in writing effects clauses. To specify a side effect, sometimes we need to be able to talk about the state of a mutable value both before and after the routine is executed. Writing "_pre" or "_post" after the name of a variable is a compact way of talking about that the state of the value in that variable before and after the function executes, respectively.

signature MSET = sig
  (* Overview: a set is a mutable set of items of type elem.
   * For example, if elem is int, then a set might be
   * {1,-11,0}, {}, or {1001} *)
  type elem
  type set
  (* empty() creates a new empty set *)
  val empty : unit -> set
  (* Effects: add(s,x) adds the element x to s if it is
   * not there already: s_post = s_pre U {x})
  val add: set * elem -> unit
  (* remove(s,x) removes the element x from s it it is
   * there already *)
  val remove: set * elem -> unit
  (* member(s,x) is whether x is a member of s *)
  val member: set * elem -> bool
  (* size(s) is the number of elements in s *)
  val size: set -> int
  (* fold over the elements of the set *)
  val fold: ((elem*'b)->'b) -> 'b -> set -> 'b
  val fromList: elem list -> set
  val toList: set -> elem list
end

Classifying operations

When designing the interface to a mutable data abstraction, it is a good idea to select operations that fall into one of three broad categories:

Observers are operations that return information about an abstract data value but do not have any externally visible side effects.
Creators are operations that construct new abstract values. Unlike with purely functional data abstractions, two identical calls to a creator do not return indistinguishable values. For example, two calls to Array.fromList will produce two distinguishable arrays even if the same list is provided as input. The two arrays can be distinguished because updates to one array will be not be visible in the other.
Mutators are operations that change the mutable abstract value in some way. They sometimes return some information about what happened but their primary purpose is mutation.

This rule of thumb reduces the amount of reasoning that programmers need to do about side effects, because creators and observers usually do not have side effects; only mutators do.

The MSET signature contains examples of all three kinds of operations: empty and fromList are creators; member, size, fold, and toList are observers; and add and remove are mutators. Similarly, in ARRAY we have creators array and fromList, observers sub and length, and a mutator update.

Rep invariants

Mutable data abstractions need rep invariants, just like immutable abstractions do. However, mutation raises some new issues for maintaining rep invariants. For functional programming we had to make sure that any new abstract values constructed satisfied the rep invariant. For imperative programming we also need to make sure that the rep invariant is not broken for existing abstract values.

For example, consider an following implementation of the MSET signature, in which an underlying sorted array is used as the representation:

functor ArrayMSet(structure Key: ORD_KEY)
        :> MSET where type elem = Key.ord_key
  = struct
    open Array
    type elem = Key.ord_key
    type set = {elems: elem option array ref, size: int}
    (* ref {elems, size} represents a set containing the first size
     * elements in elems.
     * Rep invariant: the first size elements have the form SOME(e)
     * and they are in sorted order according to Key.compare.
     *)
    val initial_size = 10
    val empty () = {elems = array(10, NONE), size = 0}
    ...
    
  end

The idea is to create an array that is large enough to hold all the elements of the set. If too many add's are done, a new array is created. Only the first size elements of the array are actually used to store elements, and they are stored in sorted order. The member operation can be performed using binary search with O(lg n) time. However, add will not be as efficient, because insertion into the middle of the array will take O(n) time. Note that adding the element at the end of the array would take O(1) time but would break the rep invariant on the set that is being extended.

Exposing the rep

A common mistake when designing a mutable abstraction is exposing the rep -- that is, implementing operations in a way that allows users access to mutable pieces of the representation. The problem is that these mutable values may then be updated by a thoughtless or malicious client, causing invalid changes to the original abstract value.

For example, suppose that we add an operation toArray: set -> elem option array to the mutable set interface. It looks very easy to implement in the functor above:

fun toArray({elems,size}) = !elems

This implementation simply returns the array that is used as part of the representation of the set. The problem is that a client receiving the array of elements can change the array using update, and in doing so break the rep invariant of the set the array was received from.

In early versions of Java, there was actually a security hole based on improperly exposing the rep. An array was returned containing security-critical information in response to a query; modifying this array caused the system security policy to be changed in an arbitrary way!

Benevolent side effects

Not every side effect needs to be documented with an effects clause. If a side effect does not affect the abstract value that a mutable representation maps to, the side effect will be invisible to the user of the abstraction. Therefore, it need not be mentioned in the specification. Side effects of this sort as known as benevolent side effects, because they are not destructive. Benevolent side effects can be useful for building caches internal to data structures, or for data structure reorganizations that improve performance without affecting their abstract value.

Hash tables

We've seen various implementations of functional sets. First we had simple lists, which had O(n) access time. Then we saw how to implement sets as balanced binary search trees with O(lg n) access time. Our current best results are this:

	linked list, no duplicates	red-black trees
`add (insert)`	O(n)	O(lg n)
`delete (remove)`	O(n)	O(lg n)
`member (contains)`	O(n)	O(lg n)

What if we could do even better? It turns out that we can implement mutable sets and maps more efficiently than the immutable (functional) sets and maps we've been looking at so far. In fact, we can turn an O(n) functional set implementation into an O(1) mutable set implementation, using hash tables. The idea is to exploit the power of arrays to update a random element in O(1) time.

The idea is that we'll store each element of the mutable set in a simple functional set of constant size on average. Because the functional sets are small, linked lists without duplicates work fine. Instead of having just one functional set, we'll use a lot of them. In fact, for a mutable set containing n elements, we'll spread out its elements among O(n) smaller functional sets. If we spread the elements around evenly, each of the functional sets will contain O(1) elements and accesses to it will have O(1) performance!

	hash table
`add (insert)`	O(1)
`delete (remove)`	O(1)
`member (contains)`	O(1)

This data structure (the hash table) is a big array of O(n) elements, called buckets. Each bucket is a functional (immutable) set containing O(1) elements, and the elements of the set as a whole are partitioned among all the buckets.

There is one key piece missing: in which bucket should a set element be stored? We provide a hash function h(e) that given a set element e returns the index of a bucket that element should be stored into. The hash table works well if each element is equally and independently likely to be hashed into any particular bucket; this condition is the simple uniform hashing assumption. Suppose we have n elements in the set and the bucket array is length m. Then we expect a = n/m elements per bucket. The quantity a is called the load factor of the hash table. If the set implementation used for the buckets has linear performance, then we expect to take O(1+a) time to do add, remove, and member. If the number of buckets is at proportional to the number of elements, the load factor a is O(1), so all the operations are also O(1) on average. Notice that the worst case performance of a hash table is O(n), however, because in the worst case all of the elements hash to the same bucket. If the hash function is chosen well, this will be extremely unlikely.

Representation

The SML representation of a hash table is then as follows:

type bucket
  (* A bucket is a (functional) set of elems *)
type set = {arr: bucket array, nelem: int ref}
  (* AF: the set represented by a "set" is the union of
   *   all of the bucket sets in the array "arr".
   * RI: nelem is the total number of elements in all the buckets in
   *   arr. In each bucket, every element e hashes via hash(e)
   *   to the index of that bucket modulo length(arr). *)

Hash functions

One remaining issue that affects our implementation is our choice of the hash function (and the number of buckets, which is of course determined by the hash function). Clearly, a bad hash function can destroy our attempts at a constant running time, since in the worst case we have to search O(n) buckets. If we're mapping names to phone numbers, then hashing each name to its length would be a very poor function, as would a hash function that used only the first name, or only the last name. We want our hash function to use all of the information in the key.

With modular hashing, the hash function is simply h(k) = k mod m for some m (typically the number of buckets). This is easy to compute quickly when we consider the bit-level representation of the key k as representing a number. Certain values of m produce poor results though; in particular if m is a power of two (i.e., m=2^p), then h(k) is just the p lowest-order bits of k. Generally we prefer a hash function that uses all the bits of the key so that any change in the key it likely to change the bucket it maps to. In practice, primes not too close to powers of 2 work well.

Another alternative is multiplicative hashing, in which we compute (ka/2^q) mod m for appropriately chosen values of a, m, and q. This works well for the same reason that linear congruential multipliers generate apparently random numbers. The multiplier a should be large and its binary representation should be a "random" mix of 1's and 0's; q is chosen so that all of the high bits of the product are retained before computing the modulus. Multiplicative hashing is cheaper than modular hashing and it works well with a bucket array of size m=2^p, which is convenient.

Ideally you should test your hash function to make sure it behaves well with real data. With any hash function, it is possible to generate data that cause it to behave poorly, but a good hash function will make this unlikely. A good way to determine whether your hash function is working well is to measure the clustering of elements into buckets. If bucket i contains x_i elements, then the clustering is (S_i(x_i²)/n) - n/m. A uniform hash function produces clustering near 1.0 with high probability. A clustering factor of c means that the performance of the hash table is slowed down by a factor of c relative to its performance with a uniform hash function and the same array size. If clustering is less than 1.0, the hash function is doing better than a uniform random hash function ought to: this is rare. Note that clustering is independent of the load factor.

Open addressing

An alternative to hashing with buckets is open addressing. Instead of storing a set at every array index, a single element is stored there. If an element is inserted in the hash table and collides with an element already stored at that index, a second possible possible location for it is computed. If that is full, the process repeats. There are various strategies for generating a sequence of hash values for a given element: linear probing, quadratic probing, double hashing. We have chosen not talk about open addressing in detail because in practice it is slower than an array of buckets. The performance of open addressing becomes very bad when the load factor approaches 1, because a long sequence of array indices may need to be tried for any given element -- possibly every element in the array! Therefore it is important to resize the array when the load factor exceeds 2/3 or so. The bucket approach, by contrast, suffers gradually declining performance as the load factor grows, and no fixed point beyond which resizing is absolutely needed. With buckets, a sophisticated application can defer the O(n) cost of resizing its hash tables to a point in time when it is convenient to incur it: for example, when the user is idle.

Resizable hash tables and amortized analysis

The claim that hash tables give O(1) performance is based on the assumption that m = O(n). If a hash table has many elements inserted into it, n may become much larger than m and violate this assumption. The effect will be that the bucket sets will become large enough that their bad asymptotic performance will show through. The solution to this problem is relatively simple: the array must be increased in size and all the element rehashed into the new buckets using an appropriate hash function when the load factor exceeds some constant factor. Each resizing operation therefore takes O(n) time where n is the size of the hash table being resized. Therefore the O(1) performance of the hash table operations no longer holds in the case of add: its worst-case performance is O(n).

This isn't really as much of a problem as it might sound. If the bucket array is doubled in size every time it is needed, then the insertion of n elements in a row into an empty array takes only O(n) time, perhaps surprisingly. We say that add has O(1) amortized run time because the time required to insert an element is O(1) on the average even though some elements trigger a lengthy rehashing of all the elements of the hash table.

To see why this is, suppose we insert n elements into a hash table while doubling the number of buckets when the load factor crosses some threshold. A given element may be rehashed many times, but the total time to insert the n elements is still O(n). Consider inserting n = 2^k elements, and suppose that we hit the worst case, where the resizing occurs on the very last element. Since the bucket array is being doubled at each rehashing, the rehashes must all occur at powers of two. The final rehash rehashes all n elements, the previous one rehashes n/2 elements, the one previous to that n/4 elements, and so on. So the total number of hashes computed is n hashes for the actual insertions of the elements, plus n + n/2 + n/4 + n/8 + ... = n(1 + 1/2 + 1/4 + 1/8 + ...) = 2n hashes, for a total of 3n hashing operations.

No matter how many elements we add to the hash table, there will be at most three hashing operations performed per element added. Therefore, add takes amortized O(1) time even if we start out with a bucket array of one element!

Another way to think about this is that the true cost of performing an add is about triple the cost observed on a typical call to add. The remaining 2/3 of the cost is paid as the array is resized later. It is useful to think about this in monetary terms. Suppose that a hashing operation costs $1 (that is, 1 unit of time). Then a call to add costs $3, but only $1 is required up front for the initial hash. The remaining $2 is placed into the hash table element just added and used to pay for future rehashing. Assume each time the array is resized, all of the remaining money gets used up. At the next resizing, there are n elements and n/2 of them have $2 on them; this is exactly enough to pay for the resizing. This is a really an argument by induction, so we'd better examine the base case: when the array is resized from one bucket to two, there is $2 available, which is $1 more than needed to pay for the resizing. That extra $1 will stick around indefinitely, so inserting n elements starting from a 1-element array takes at most 3n-1 element hashes, which is O(n) time. This kind of analysis, in which we precharge an operation for some time that will be taken later, typifies amortized analysis of run time.

Notice that it was crucial that the array size grows geometrically (doubling). It is tempting to grow the array by a fixed increment (e.g., 100 elements at time), but this causes n elements to be rehashed O(n) times on average, resulting in O(n²) asymptotic insertion time!

Any fixed threshold load factor is equally good from the standpoint of asymptotic run time, but a good rule of thumb is that rehashing should take place at a=3. One might think that a=1 is the right place to rehash, but in fact the best performance is seen (for buckets implemented as linked lists) when load factors are in the 1-2 range. When a<1, the bucket array contains many empty entries, resulting in suboptimal performance of the computer's memory system. There are many other tricks that are important for getting the very best performance out of hash tables.

Parameterized sets and maps

We observed above that sets and maps only make sense when their element and key types, respectively, support a notion of equality. This makes it more difficult to write a single implementation of sets that can be used for any type we like; we need to use some features of SML that we haven't seen yet. Let's build an implementation of hash tables to see how it all works out. First of all, we can describe the types that make sense in these signatures by writing another signature that describes a type t and an operation equal for testing whether they are equal:

signature EQ = sig
  (* t is a type with a notion of equality *)
  type t
  val equal: t * t -> bool
end

To use a type (for example, int) as a set element type we construct a structure that bundles the type with the operations that its ADT implementation requires:

structure IntEq : EQ = struct
(* subtlety: must use : here, not :>, so that t is visible
   outside the structure *)
  type t = int
  fun equal(x:int,y:int) = x = y
end

Now, consider the following functional signature for a set ADT, which we will use for the hash table buckets:

signature SET = sig
  (* Overview: a set is a set of distinct items of type elem.
   * For example, if elem is int, then a set might be
   * {1,-11,0}, {}, or {1001} *)
  type elem
  type set
  
  (* test for equality of two elements *)
  val eq: elem * elem -> bool
  (* empty is the empty set *)
  val empty : set
  (* Effects: add(s,e) is s union {e} *)
  val add: set * elem -> set
  (* remove(s,x) is s - {x}  (set difference) *)
  val remove: set * elem -> set
  (* member(s,x) is whether x is a member of s *)
  val member: set * elem -> bool
  (* size(s) is the number of elements in s *)
  val size: set -> int
  (* fold over the elements of the set *)
  val fold: ((elem*'b)->'b) -> 'b -> set -> 'b
  (* fromList(lst) is the set of elements in lst.
   * Requires: lst contains no equal elements *)
  val fromList: elem list -> set
  val toList: set -> elem list
end

We can instantiate a signature like SET on a particular element type using a where clause. This gives us the effect of type parameterization, but on signatures:

- signature INTSET = SET where type elem = int
signature INTSET =
  sig
    type elem = int
    type set
    val empty : set
    ...
  end

We can use a functor to write an implementation of SET that works for all its possible instantiations like INTSET. For example, suppose we want the simple implementation in which the rep is a linked list of unique elements. The functor we write takes in a type t that has an equal operation (both t and equal are bundled together in a structure Eq) and produces a structure that meets the SET signature:

functor ListSet(structure Eq: EQ) :> SET where type elem = Eq.t
= struct
  type elem = Eq.t
  type set = elem list
  (* RI: the list contains no elements that are equal according to
     Eq.equal *)

  val empty: set = []
  val eq = Eq.eq
  fun member(s, e) =
    case s of
      [] => false
    | h::t => Eq.equal(e,h) orelse member(t,e)
  fun add(s, e) =
    case s of
      [] => [e]
    | h::t => if Eq.equal(e,h) then e::t else h::add(t,e)
  fun remove(s, e) =
    case s of
      [] => []
    | h::t => if Eq.equal(e,h) then t else h::remove(t,e)
  fun size(s) = length(s)
  fun fold f b s = foldl f b s
  fun fromList s = s (* ought to check for duplicates *)
  fun toList s = s
end

This is our usual implementation of sets as lists, but it works for almost any element type. Now we can make sets of any type we like by using a structure that provides the element type and its equality operation; for example, the IntEq structure defined earlier:

 signature INTSET = SET where type elem = int
 structure IntSet :> INTSET = ListSet(structure Eq = IntEq)
                              (* functor application ....*)

 - val s: IntSet.set = IntSet.fromList([3,4,5])
 val s = - : IntSet.set
 - IntSet.toList(IntSet.add(IntSet.add(s, 2), 3));
 val it = [2,3,4,5] : IntSet.elem list

It's a little bit awkward to have to define the IntEq structure in order to instantiate the ListSet functor, but the result is that SML is very expressive. Some languages (e.g., CLU, PolyJ) make this process of instantiating signatures on type parameters more convenient, but lose some expressive power.

The implementation above shows how to use functors to provide a generic implementation of sets using linked lists: generic in the sense that it works for any type that it makes sense to have a set of. Functional maps can be implemented similarly, resulting in generic association lists.

A hash table implementation

We can use a functor to provide a generic implementation of the mutable set (MSET) signature too. In order to store elements in a hash table, we'll need a hash function for the element type, and an equality test just as for other sets. We can define an appropriate signature that groups the type and these two operations:

signature HASHABLE = sig
  type t
  (* hash is a function that maps a t to an integer. For
   * all e1, e2, if equal(e1,e2), then hash(e1) = hash(e2) *)
  val hash: t->int
  (* equal is an equivalence relation on t. *)
  val equal: t*t->bool
end

There is an additional invariant documented in the signature: for the hash table to function correctly, any two equal elements must have the same hash code.

functor HashSet(structure Hash: HASHABLE and
                Set: SET where type elem = Hash.t)
= struct
  type elem = Hash.t
  type bucket = Set.set
  type set = {arr: bucket array, nelem: int ref}
  (* AF: the set represented by a "set" is the union of
   * all of the bucket sets in the array "arr".
   * RI: nelem is the total number of elements in all the buckets in
   * arr. In each bucket, every element e hashes via Hash.hash(e)
   * to the index of that bucket modulo length(arr). *)

  (* Find the appropriate bucket for e *)
  fun findBucket({arr, nelem}, e) (f:bucket array*int*bucket*elem*int ref->'a) =
    let
      val i = Hash.hash(e) mod Array.length(arr)
      val b = Array.sub(arr, i)
    in
      f(arr, i, b, e, nelem)
    end
  fun member(s, e) =
    findBucket(s, e)
        (fn(_, _, b, e, _) => Set.member(b, e))
  fun add(s, e) =
    findBucket(s, e)
      (fn(arr, i, b, e, nelem) =>
        ( Array.update(arr, i, Set.add(b, e));
          nelem := !nelem + 1 ))
  fun remove(s, e) =
    findBucket(s, e)
      (fn(arr, i, b, e, nelem) =>
	( case Set.remove(b,e) of
	    (b2, NONE) => NONE
	  | (b2, SOME y) =>
	      ( Array.update(arr, i, b2);
		nelem := !nelem - 1;
		SOME y )))
  fun size({arr, nelem}) = !nelem
  fun fold f init {arr, nelem} =
    Array.foldl (fn (b, curr) => Set.fold f curr b) init arr
  fun create(size: int): set =
    { arr = Array.array(size, Set.empty), nelem = ref 0 }
  (* Copy all elements from s2 into s1. *)
  fun copy(s1:set, s2:set): unit =
    fold (fn(elem,_)=> add(s1,elem)) () s2
  fun fromList(lst) = let
      val s = create(length lst)
    in
      List.foldl (fn(e, ()) => add(s,e)) () lst;
      s
    end
  fun toList({arr, nelem}) =
    Array.foldl (fn (b, lst) => Set.fold (fn(e, lst) => e::lst) lst b)
        [] arr
end

This hash table implementation almost implements the MSET signature, but not quite, because it doesn't implement the empty method. Here is complete implementation of mutable sets using a fixed-size hash table of nbucket buckets:

functor FixedHashSet(val nbucket: int;
                     structure Hash: HASHABLE and
                     Set: SET where type elem = Hash.t)
  :> MSET where type elem = Hash.t
= struct
  structure HS = HashSet(structure Hash = Hash and Set = Set)
  type elem = HS.elem
  type set = HS.set
  val eq = Hash.eq
  fun empty() = HS.create(nbucket)
  val member = HS.member
  val add = HS.add
  val remove = HS.remove
  val size = HS.size
  val fold = HS.fold
  val toList = HS.toList
  val fromList = HS.fromList
end

Here we create a hash table of 1000 buckets and insert the numbers one through ten into it.

- structure FHS = FixedHashSet(structure Hash = IntHash and Set = IntSet)
- open FHS;
- val s = empty();
val s = - : set
- foldl (fn(x,_) => add(s,x)) () [1,2,3,4,5,6,7,8,9,10];
val it = () : unit
- toList(s);
val it = [10,9,8,7,6,5,4,3,2,1] : elem list

The elements are in reverse order because they are hashed into buckets 1 through 10.

The HashSet implementation of hash tables abstracts out the common operation of finding the correct bucket for a particular element into the findBucket function, thus keeping the rest of the code simpler. If the number of buckets is always a power of 2, the modulo operation can be performed using bit logic, which is much faster.

Specifying hash functions

Hash tables are one of the most useful data structures ever invented. Unfortunately, they are also one of the most misused. Code built using hash tables often does not get anywhere near the possible performance, because of badly designed hash functions. The reason for this comes down to a common failure to adequately specify the requirements on the hash function.

Earlier we described hashing as a function that maps a key to a bucket index. Recall that hash tables work well when the hash function satisfies the simple uniform hashing assumption -- that the hash function should look random. If it is to look random, this means that any change to a key, even a small one, should change the bucket index in an apparently random way. If we imagine writing the bucket index as a binary number, a small change to the key should randomly flip the bits in the bucket index. This is a kind of information diffusion: a one-bit change to the key should randomly affect every bit in the index.

However, in this hash table implementation (and in many others), the job of producing the hash value has been split into two steps, one provided by the hash table user and one by the hash table implementation. The first step (the hash function) converts the key into an integer hash code, the second (in findBucket) converts the integer into a bucket index by taking the hash code modulo the number of buckets:

Thus, the hash function is really the composition of two functions, hash and findBucket. Which one of these two functions is responsible for providing information diffusion? In this implementation, findBucket does not provide information diffusion if the number of buckets is a power of two, because the modulo operation throws away the upper bits of the hash code. In this case it is up to the hash function to provide information diffusion by making every bit of the hash code depend randomly on every part of the key. If the user happens to use a hash function that does not diffuse the key information into the hash code, the hash table is more likely to exhibit poor performance because of high clustering.

High clustering happens much more often than you might think. For example, our hash function on integers is the identity function. This won't work well if the integers tend to be equal modulo the number of buckets -- they'll all be hashed into the same bucket, and we'll have a glorified linked list instead of a hash table! For a less artificial example, consider the standard Java hashCode implementation for objects -- it converts the memory address of the object into an integer. However, the memory addresses of objects tend to exhibit a great deal of regularity, resulting in high clustering with common hash table implementations.

The problem is that hash table implementers are often lax about specifying what they expect from user hash functions. It's fine for the user hash function not to provide information diffusion, but then the hash table should provide it when converting the hash code into a bucket index. Conversely, if the hash table simply computes the hash code modulo the number of buckets, it ought to say in the specification of hash (in the HASHABLE signature) that the hash function is expected to provide information diffusion.

Multiplicative hashing

There are a number of ways to obtain good information diffusion, such as multiplicative hashing, cyclic redundancy checks (CRC's), cryptographic hashing (e.g., MD5) . Here is an example of how we might build information diffusion into the hash table implementation, using multiplicative hashing of the hash code to obtain the bucket index. The following code assumes a word size of 32 bits:

  val multiplier: Word.word = 0wx678DDE6F (* following a recommendation by Knuth *)
  fun findBucket({arr, nelem}, e) (f:bucket array*int*bucket*elem->'a) =
    let
      val n = Word.fromInt(Array.length(arr))
      val d = (0wxFFFFFFF div n)+0w1
      val i = Word.toInt(Word.fromInt(Hash.hash(e)) * multiplier div d)
      val b = Array.sub(arr, i)
    in
      f(arr, i, b, e)
    end

If n is always power of two, the div operations can be replaced by bit shifting, which is much faster. This code will work well even with Java hash codes.

Dynamic resizing

As long as we don't put more than a couple of thousand elements into a fixed-size 1000-bucket hash table, its performance will be excellent. However, the asymptotic performance of any fixed-size table is no better than that of a linked list. We can introduce another level of indirection to obtain a hash table that grows dynamically and rehashes its elements, thus achieving O(1) amortized performance as described above. The implementation proves to be quite simple because we can reuse all of our HashSet code:

functor DynHashSet(structure Hash: HASHABLE and
                   Set: SET where type elem = Hash.t)
  :> MSET where type elem=Hash.t =
struct
  structure HS = HashSet(structure Hash = Hash and Set = Set)
  type set = HS.set ref
  type elem = HS.elem
  val thresholdLoadFactor = 3
  (* AF: the set represented by x:set is !x.
   * RI: the load factor of the hash table !x never goes
   * above thresholdLoadFactor.
   *)

  fun empty():set = ref (HS.create(1))
  fun member(s, e) = HS.member(!s, e)
  fun remove(s, e) = HS.remove(!s, e)
  fun size(s) = HS.size(!s)
  fun add(s, e) =
    let val {arr, nelem} = !s
        val nbucket = Array.length(arr) 
    in
      if !nelem >= thresholdLoadFactor*nbucket then
        let val newset = HS.create(nbucket*2) in
          HS.copy(newset, !s);
          s := newset
        end
      else ();
      HS.add(!s, e)
    end
  fun fold f init s = HS.fold f init (!s)
  fun fromList(lst) = ref (HS.fromList(lst))
  fun toList(s) = HS.toList (!s)
end

There is hardly any new code required: mostly just the logic in add that creates a new, larger hash table and copies all the elements across when the load factor is too high. Sometimes hash table implementations will also resize the hash table downwards when a call to remove makes the load factor too low. This trick usually improves performance in practice for hash tables that grow and shrink—though it does not improve (in fact, harms) theoretical asymptotic performance.

Notice that resizing the hash table has no visible external effect. Its side effects are benevolent.

This code requires access to the internals of the HashSet implementation, which is why those internals are not hidden behind a signature. An example of using these hash tables follows:

- structure DynIntHashSet = DynHashSet(structure Hash = IntHash and Set = IntSet)
- open DynIntHashSet;
- val s = empty();
val s = - : set
- foldl (fn(x,_) => add(s,x)) () [1,2,3,4,5,6,7,8,9,10];
val it = () : unit
- toList(s);
val it = [7,3,10,6,2,9,5,1,8,4] : elem list

The MMAP signature for mutable maps can also be implemented using hash tables in a similar manner.

CS 312 Lecture 16 Mutable data abstractions and hash tables