Hash tables and amortized analysis

The claim that hash tables give *O*(1)
performance is based on the assumption that *n*
= *O*(*m*). If a hash table has many elements inserted into
it, *n* may become much larger than *m* and violate
this assumption. The effect will be that the bucket sets will become large
enough that their bad asymptotic performance will show through. The solution to
this problem is relatively simple: the array must be increased in size and all
the element **rehashed** into the new buckets using an appropriate hash
function when the load factor α = *n*/*m* exceeds some constant factor α_{max}.
Because resizing is not visible to the client, it is a
**benign side effect**. Each resizing
operation takes *O*(*n*) time
where *n* is the size of the hash table being
resized. Therefore the *O*(1) performance of
the hash table operations no longer holds in the case of `add`

: its
worst-case performance is *O*(*n*).

This isn't as much of a problem as it might sound, though it can be an
issue for some real-time computing systems. If the bucket array
is doubled in size every time it is needed, then the insertion of *n*
elements in a row into an empty array takes only *O*(*n*)
time, perhaps surprisingly. We say that add has *O*(1)
**amortized run time** because the time required to insert an element is *O*(1)
on the average even though some elements trigger a lengthy rehashing of all the
elements of the hash table.

To see why this is, suppose we insert *n*
elements into a hash table while doubling the number of buckets when the load
factor crosses some threshold. A given element may be rehashed many times, but
the total time to insert the *n* elements is
still *O*(*n*). Consider inserting *n*
= 2^{k} elements, and suppose that we hit the worst case,
where the resizing occurs on the very last element. Since the bucket array is
being doubled at each rehashing, the rehashes must all occur at powers of two.
The final rehash rehashes all * n *
elements, the previous one rehashes *n*/2
elements, the one previous to that *n*/4
elements, and so on. So the total number of hashes computed is *n*
hashes for the actual insertions of the elements, plus *n*
+ *n*/2 + *n*/4 + *n*/8 + ... = *n*(1 + 1/2 + 1/4 + 1/8 +
...) = 2*n *hashes, for a total of 3*n* hashing
operations.

No matter how many elements
we add to the hash table, there will be at most three hashing operations performed
per element added. Therefore, `add`

takes amortized *O*(1)
time even if we start out with a bucket array of one element!

Another way to think about this is that the true cost of performing an `add`

is about triple the cost observed on a typical call to `add`

. The remaining 2/3 of
the cost is paid as the array is resized later. It is useful to think about this
in monetary terms. Suppose that a hashing operation costs $1 (that is, 1 unit of
time). Then a call to `add`

costs $3, but only $1 is required up
front for the initial hash. The remaining $2 is placed into the hash table
element just added and used to pay for future rehashing. Assume each time the
array is resized, all of the remaining money gets used up. At the next resizing,
there are *n* elements and *n*/2
of them have $2 on them; this is exactly enough to pay for the resizing. This is
a really an argument by induction, so we'd better examine the base case: when
the array is resized from one bucket to two, there is $2 available, which is $1
more than needed to pay for the resizing. That extra $1 will stick around
indefinitely, so inserting *n* elements
starting from a 1-element array takes at most 3*n*`-`1
element hashes, which is *O*(*n*) time.
This kind of analysis, in which we precharge an operation for some time that
will be taken later, is the idea behind **amortized analysis** of run time.

Notice that it was crucial that the array size grows geometrically
(doubling). It is tempting to grow the array by a fixed increment (e.g., 100
elements at time), but this causes n elements to be rehashed *O*(*n*)
times on average, resulting in *O*(*n*^{2})
asymptotic insertion time!

Any fixed threshold load factor is equally good from the standpoint of
asymptotic run time. If you are concerned about performance, it is a
good idea to measure the value of α_{max} that maximizes
performance. Typically it will be between 1 and 3. One might think that a=1 is the right place to rehash, but the
best performance is often seen (for buckets implemented as linked lists)
when load factors are in the 1–2 range
When a<1, the
bucket array contains many empty entries, resulting in suboptimal performance
from the computer's memory system. There are many other tricks that are important
for getting the very best performance out of hash tables. For best
performance, it is important to use measured
performance to tune the hash function and resizing threshold.

In fact, if the load factor becomes too low, it's a good idea to resize the hash
table to make it smaller. Usually this is done when the load factor drops below
α_{max}/4. At this point the hash table is
halved in size and all of the elements are rehashed.
It is important to shrink only once the hash table
gets sufficiently small. For example, if the hash table grows by doubling,
it should be shrunk only if its load factor is half of the point that would
cause doubling. Otherwise, time could be wasted growing and shrinking
the table, hurting asymptotic performance.

If we start from an empty hash table, any sequence of *n*
operations will take *O*(*n*) time, even if we resize the hash
table whenever the load factor goes outside the interval [α_{max}/4, α_{max}].

To see this we need to evaluate the amortized complexity of the hash table
operations. This formalizes the reasoning we used earlier.
To do this, we define a **potential function** that measures the
precharged time for a given state of the data structure. The potential function
saves up time that can be used by later operations.

We then define the **amortized time**
taken by a single operation that changes the data structure
from *h* to *h'* as the actual time plus
the change in potential,
Φ(*h*') - Φ(*h*) .
Now consider
a sequence of *n* operations on a tree, taking actual times *t*_{1}, *t*_{2},
*t*_{3}, ..., *t*_{n} and
producing hash tables *h*_{0}, *h*_{1}, *h*_{2}, ... *h*_{n}.
The **amortized time** taken by these operations is
the sum of the actual times for each operation plus the sum of the changes in
potential: *t*_{1} + *t*_{2} + ... *t*_{n} + (Φ(*h*_{1})−Φ(*h*_{0}))
+ (Φ(*h*_{2}) − Φ(*h*_{1})) + ... + (Φ(*h*_{n})
− Φ(*h*_{n-1}))
= *t*_{1} + *t*_{2} + ... *t*_{n} + Φ(*h*_{n}) −
Φ(*h*_{0}).
Therefore the amortized time for a sequence of operations overestimates of
the actual time by the maximum drop in the potential function
Φ(*h*_{n}) − Φ(*h*_{0})
seen over the whole sequence of operations. If we can arrange that the maximum
drop is zero, total amortized time is always an upper bound on the actual time,
which is what we want.

The key to amortized analysis is to define the right potential function. The potential function needs to save up enough time to be used later when it is needed. But it cannot save so much time that it causes the amortized time of the current operation to be too high.

Let us do an amortized analysis of hash tables with both resizing by
doubling and by halving. For simplicity we assume α_{max} = 1. We define the potential function so that it stores
up time as the load factor moves away from 1/2. Then there will be enough
stored-up time to resize in either direction. The potential function is:

Φ(

h) = 2|n-m/2|

Now we just have to consider all the cases that can occur with all the possible operations. Since α is O(1), we assume that looking for an element takes time 1.

**Adding an element.**Adding an element increases*n*by one. There are three cases to consider.**1/2 ≤ α <1**. The potential increases by 2, so amortized time is 1+2 = 3.**α <1/2**. The potential decreases by 2, so amortized time is 1-2 = -1.**α = 1**. The hash table is resized, so actual time is 1 +*m*. But the potential goes from*m*to 0, so amortized time is 1 +*m*-*m*= 1

**Looking up an element.**The potential doesn't change, so this takes actual time and amortized time of 1.**Removing an element.**This reduces*n*by one. There are again three cases.**1/2 ≤ α <1**. The potential decreases by 2, so amortized time is 1-2 = -1.**α <1/2**. The potential increases by 2, so amortized time is 1+2 = 3.**α = 1**. The hash table is resized, so actual time is 1 +*m*/4. The potential goes from*m*/2 to 0, so amortized time is 1 +*m*/4 -*m*/2 = 1 −*m*/4.

In each case, the amortized time is O(1). If we start our hash table with a
load factor of 1/2, then its initial potential will be zero. So we know that
the potential can never decrease, and amortized time will be an upper bound on
actual time. Therefore a sequence of *n* operations will take
O(*n*) time.

We can use a functor to provide a generic implementation of the mutable set (`MSET`

)
signature too. In order to store elements in a hash table, we'll need a hash
function for the element type, and an equality test just as for other sets. We
can define an appropriate signature that groups the type and these two
operations:

signature HASHABLE = sig type t (* hash is a function that maps a t to an integer. For * all e1, e2, if equal(e1,e2), then hash(e1) = hash(e2) *) val hash: t->int (* equal is an equivalence relation on t. *) val equal: t*t->bool end

There is an additional invariant documented in the signature: for the hash table to function correctly, any two equal elements must have the same hash code.

functor HashSet(structure Hash: HASHABLE and Set: SET where type elem = Hash.t) = struct type elem = Hash.t type bucket = Set.set type set = {arr: bucket array, nelem: int ref} (* AF: the set represented by a "set" is the union of * all of the bucket sets in the array "arr". * RI: nelem is the total number of elements in all the buckets in * arr. In each bucket, every element e hashes via Hash.hash(e) * to the index of that bucket modulo length(arr). *) (* Find the appropriate bucket for e *) type 'a bucketHandler = bucket array*int*bucket*elem*int ref->'a fun findBucket({arr, nelem}, e) (f: bucketHandler): 'a = let val i = Hash.hash(e) mod Array.length(arr) val b = Array.sub(arr, i) in f(arr, i, b, e, nelem) end fun member(s, e) = findBucket(s, e) (fn(_, _, b, e, _) => Set.member(b, e)) fun add(s, e) = findBucket(s, e) (fn(arr, i, b, e, nelem) => ( Array.update(arr, i, Set.add(b, e)); nelem := !nelem + 1 )) fun remove(s, e) = findBucket(s, e) (fn(arr, i, b, e, nelem) => ( case Set.remove(b,e) of (b2, NONE) => NONE | (b2, SOME y) => ( Array.update(arr, i, b2); nelem := !nelem - 1; SOME y ))) fun size({arr, nelem}) = !nelem fun fold f init {arr, nelem} = Array.foldl (fn (b, curr) => Set.fold f curr b) init arr fun create(size: int): set = { arr = Array.array(size, Set.empty), nelem = ref 0 } (* Copy all elements from s2 into s1. *) fun copy(s1:set, s2:set): unit = fold (fn(elem,_)=> add(s1,elem)) () s2 fun fromList(lst) = let val s = create(length lst) in List.foldl (fn(e, ()) => add(s,e)) () lst; s end fun toList({arr, nelem}) = Array.foldl (fn (b, lst) => Set.fold (fn(e, lst) => e::lst) lst b) [] arr end

This hash table implementation almost implements the MSET signature, but not
quite, because it doesn't implement the empty method. Here is complete
implementation of mutable sets using a fixed-size hash table of `nbucket`

buckets:

functor FixedHashSet(val nbucket: int; structure Hash: HASHABLE and Set: SET where type elem = Hash.t) :> MSET where type elem = Hash.t = struct structure HS = HashSet(structure Hash = Hash and Set = Set) type elem = HS.elem type set = HS.set val eq = Hash.eq fun empty() = HS.create(nbucket) val member = HS.member val add = HS.add val remove = HS.remove val size = HS.size val fold = HS.fold val toList = HS.toList val fromList = HS.fromList end

Here we create a hash table of 1000 buckets and insert the numbers one through
ten into it.

- structure FHS = FixedHashSet(structure Hash = IntHash and Set = IntSet) - open FHS; - val s = empty();val s = - : set- foldl (fn(x,_) => add(s,x)) () [1,2,3,4,5,6,7,8,9,10];val it = () : unit- toList(s);val it = [10,9,8,7,6,5,4,3,2,1] : elem list

The elements are in reverse order because they are hashed into buckets 1 through 10.

The `HashSet`

implementation of hash tables abstracts out the
common operation of finding the correct bucket for a particular element into the
`findBucket`

function, thus keeping the rest of the code simpler. If
the number of buckets is always a power of 2, the modulo operation can be
performed using bit logic, which is much faster.

As long as we don't put more than a couple of thousand elements into a
fixed-size 1000-bucket hash table, its performance will be excellent. However,
the *asymptotic* performance of any fixed-size table is no better than that
of a linked list. We can introduce another level of indirection to obtain a hash
table that grows dynamically and rehashes its elements, thus achieving *O*(1)
amortized performance as described above. The implementation proves to be quite
simple because we can reuse all of our `HashSet`

code:

functor DynHashSet(structure Hash: HASHABLE and Set: SET where type elem = Hash.t) :> MSET where type elem=Hash.t = struct structure HS = HashSet(structure Hash = Hash and Set = Set) type set = HS.set ref type elem = HS.elem val thresholdLoadFactor = 3 (* AF: the set represented by x:set is !x. * RI: the load factor of the hash table !x never goes * above thresholdLoadFactor. *) fun empty():set = ref (HS.create(1)) fun member(s, e) = HS.member(!s, e) fun remove(s, e) = HS.remove(!s, e) fun size(s) = HS.size(!s) fun add(s, e) = let val {arr, nelem} = !s val nbucket = Array.length(arr) in if !nelem >= thresholdLoadFactor*nbucket then let val newset = HS.create(nbucket*2) in HS.copy(newset, !s); s := newset end else (); HS.add(!s, e) end fun fold f init s = HS.fold f init (!s) fun fromList(lst) = ref (HS.fromList(lst)) fun toList(s) = HS.toList (!s) end

There is hardly any new code required: mostly just the logic in `add`

that creates a new, larger hash table and copies all the elements across
when the load factor is too high.

This code requires access to the internals of the `HashSet`

implementation,
which is why those internals are not hidden behind a signature. An example of
using these hash tables follows:

- structure DynIntHashSet = DynHashSet(structure Hash = IntHash and Set = IntSet) - open DynIntHashSet; - val s = empty();val s = - : set- foldl (fn(x,_) => add(s,x)) () [1,2,3,4,5,6,7,8,9,10];val it = () : unit- toList(s);val it = [7,3,10,6,2,9,5,1,8,4] : elem list

The `MMAP`

signature for mutable maps can also be implemented
using hash tables in a similar manner.