At this point we have seen several mutable data structures using refs and arrays. Today we will look at hash tables.
We've seen various implementations for building sets or map abstractions. First we had simple lists, which had O(n) access time. We can also also implement sets (or maps) using binary search trees, by storing values (or key-value pairs) in the node of the tree. If the tree is balance, then access operations in tree implementations have O(lg n) access time. Our current best results are this:
linked list, no duplicates | balanced trees (e.g., AVL) | |
add (insert) |
O(n) | O(lg n) |
delete (remove) |
O(n) | O(lg n) |
member (contains) |
O(n) | O(lg n) |
What if we could do even better? It turns out that we can implement mutable sets and maps more efficiently than the immutable (functional) sets and maps we've been looking at so far. In fact, we can perform these operations in O(1) time with hash tables, implemented using arrays. The idea is to exploit the power of arrays to update a random element in O(1) time.
In a hash table we store entries using an array of lists of elements. We ensure that each list is small, O(1) to be precise, so simple functional lists without duplicates work fine. This data structure (the hash table) is a big array of O(n) elements, called buckets. Each bucket is a functional (immutable) set containing O(1) elements, and the elements of the set as a whole are partitioned among all the buckets. We cannot really guarantee that the sets are O(1) in size, so hashing is expected case O(1) time, but in the worst case can be O(n) - if some fraction of the elements all hash to the same bucket. Thus it is important to ensure that the mapping of elements to buckets "spreads out" the elements across the buckets.
There is one key piece missing: in which bucket should a set element be
stored? We provide a hash function h(e)
that given a set element e returns the index of
a bucket that element should be stored into. The hash table works well if each
element is equally and independently likely to be hashed into any particular
bucket; this condition is the simple uniform hashing assumption. Suppose
we have n elements in the set and the
bucket array is length m. Then we expect
a = n/m
elements per bucket. The quantity a is called the
load factor of the hash table. If the set implementation used for the
buckets has linear performance, then we expect to take O(1+a)
time to do add
, remove
, and member
. If
the number of buckets is at proportional to the number of elements, the load
factor a is O(1), so all
the operations are also O(1) on average. Notice
that the worst case performance of a hash table is O(n),
however, because in the worst case all of the elements hash to the same bucket.
If the hash function is chosen well, this will be extremely unlikely.
Hash tables are generally implemented as maps, mapping from keys to values. It is easy to implement a set as a map, by simply using the presence of a key to indicate than an element is in a set.
Here is a simple signature for hash tables.
signature HASH_TABLE = sig type key type value type hashtable (* new(n) creates a hash table with n buckets *) val new : int -> hashtable (* insert(ht,key,value) updates a hash table to contain the * mapping from key to value. If key already is in the hash * table then this new value is stored in place of the old one.*) val insert : hashtable * key * value -> unit (* lookup(ht,key) returns the corresponding SOME value if key is * found in the hash table, otherwise NONE *) val lookup : hashtable * key -> value option (* delete(ht,key) removes the specified key and its value if present * otherwise has no effect *) val delete : hashtable * key -> unit (* The size of a hash table *) val size : hashtable -> int end
A key issue that affects our implementation is our choice of the hash function (and the number of buckets, which is of course determined by the hash function). Clearly, a bad hash function can destroy our attempts at a constant running time, since in the worst case we have to search O(n) buckets. If we're mapping names to phone numbers, then hashing each name to its length would be a very poor function, as would a hash function that used only the first name, or only the last name. We want our hash function to use all of the information in the key.
With modular hashing, the hash function is simply h(k) = k mod m for some m (typically the number of buckets). This is easy to compute quickly when we consider the bit-level representation of the key k as representing a number. Certain values of m produce poor results though; in particular if m is a power of two (i.e., m=2^{p}), then h(k) is just the p lowest-order bits of k. Generally we prefer a hash function that uses all the bits of the key so that any change in the key it likely to change the bucket it maps to. In practice, primes not too close to powers of 2 work well.
Another alternative is multiplicative hashing, in which we compute (ka/2^{q}) mod m for appropriately chosen values of a, m, and q. This works well for the same reason that linear congruential multipliers generate apparently random numbers. The multiplier a should be large and its binary representation should be a "random" mix of 1's and 0's; q is chosen so that all of the high bits of the product are retained before computing the modulus. Multiplicative hashing is cheaper than modular hashing and it works well with a bucket array of size m=2^{p}, which is convenient.
Ideally you should test your hash function to make sure it behaves well with real data. With any hash function, it is possible to generate data that cause it to behave poorly, but a good hash function will make this unlikely. A good way to determine whether your hash function is working well is to measure the clustering of elements into buckets. If bucket i contains x_{i} elements, then the clustering is (S_{i}(x_{i}^{2})/n) - n/m. A uniform hash function produces clustering near 1.0 with high probability. A clustering factor of c means that the performance of the hash table is slowed down by a factor of c relative to its performance with a uniform hash function and the same array size. If clustering is less than 1.0, the hash function is doing better than a uniform random hash function ought to: this is rare. Note that clustering is independent of the load factor.
An alternative to hashing with buckets is open addressing. Instead of storing a set at every array index, a single element is stored there. If an element is inserted in the hash table and collides with an element already stored at that index, a second possible possible location for it is computed. If that is full, the process repeats. There are various strategies for generating a sequence of hash values for a given element: linear probing, quadratic probing, double hashing. We have chosen not talk about open addressing in detail because in practice it is slower than an array of buckets. The performance of open addressing becomes very bad when the load factor approaches 1, because a long sequence of array indices may need to be tried for any given element -- possibly every element in the array! Therefore it is important to resize the array when the load factor exceeds 2/3 or so. The bucket approach, by contrast, suffers gradually declining performance as the load factor grows, and no fixed point beyond which resizing is absolutely needed. With buckets, a sophisticated application can defer the O(n) cost of resizing its hash tables to a point in time when it is convenient to incur it: for example, when the user is idle.
Here we develop an implementation of hash tables where the keys and values
are both strings. First we extend the hash table signature to specify that
the keys and values are strings, and then define an implementation of the
resulting abstraction.
Note: what would happen if we tried to define StringHT as an opaque
implementation of HASH_TABLE rather than STRING_HT?
This is a basic implementation, that simply stores a list without "duplicates" in each bucket. Each element of the list is a (key,value) pair, such that the key hashes to the index of that bucket in the array. No two elements in the list have the same key (i.e., duplicates are duplicates on keys not pairs).
signature STRING_HT = HASH_TABLE where type key = string and type value = string structure StringHT :> STRING_HT = struct type key = string type value = string type hashtable = ((key * value) list) array (* helper function to hash a string to an int. hashes to a word using * byte shift operations and then converts to an int using the algorithm * from SML's hashString structure*) fun hash (k: key): int = let fun charToWord c = Word.fromInt(Char.ord c) fun hashChar (c, h) = Word.<<(h, 0w5) + h + 0w720 + (charToWord c) in Int.abs(Word.toIntX(CharVector.foldl hashChar 0w0 k)) end fun new(n: int): hashtable = Array.array(n,nil) (* Helper function to remove an element from a list *) fun remove(k: key, l: (key*value) list) = List.filter (fn(k',_) => k' <> k) l fun insert(t: hashtable, k: key, v: value): unit = let val bucket = Int.mod(hash k, Array.length t) val entries = Array.sub(t, bucket) in if List.exists(fn (k',_) => k=k') entries then Array.update(t, bucket, (k,v)::remove(k,entries)) else Array.update(t, bucket, (k,v)::entries) end fun lookup(t: hashtable, k: key): value option = let val bucket = Int.mod(hash k, Array.length t) val entries = Array.sub(t, bucket) in List.find (fn (k',_) => k=k') entries end fun delete(t: hashtable, k: key): unit = let val bucket = Int.mod(hash k, Array.length t) val entries = Array.sub(t, bucket) in if List.exists(fn (k',_) => k=k') entries then () else Array.update(t, bucket, remove(k,entries)) end fun size(t: hashtable): int = Array.foldl (fn (x,y) => length(x) + y) 0 ht end
Hash tables are one of the most useful data structures ever invented. Unfortunately, they are also one of the most misused. Code built using hash tables often does not get anywhere near the possible performance, because of badly designed hash functions. The reason for this comes down to a common failure to adequately specify the requirements on the hash function.
Above we described hashing as a function that maps a key to a bucket index. Recall that hash tables work well when the hash function satisfies the simple uniform hashing assumption -- that the hash function should look random. If it is to look random, this means that any change to a key, even a small one, should change the bucket index in an apparently random way. If we imagine writing the bucket index as a binary number, a small change to the key should randomly flip the bits in the bucket index. This is a kind of information diffusion: a one-bit change to the key should randomly affect every bit in the index.
What happens if we try inserting "successive" strings into our hash table, such as "a", "b", "c", "d", "e" how many buckets are filled say for a table with 1000 elements?
The claim that hash tables give O(1)
performance is based on the assumption that m = O(n).
If a hash table has many elements inserted into it, n
may become much larger than m and violate this
assumption. The effect will be that the bucket sets will become large enough
that their bad asymptotic performance will show through. The solution to this
problem is relatively simple: the array must be increased in size and all the
element rehashed into the new buckets using an appropriate hash function
when the load factor exceeds some constant factor. Each resizing operation
therefore takes O(n) time where
n is the size of the hash table being resized.
Therefore the O(1) performance of the hash
table operations no longer holds in the case of add
: its worst-case
performance is O(n).
This isn't really as much of a problem as it might sound. If the bucket array is doubled in size every time it is needed, then the insertion of n elements in a row into an empty array takes only O(n) time, perhaps surprisingly. We say that add has O(1) amortized run time because the time required to insert an element is O(1) on the average even though some elements trigger a lengthy rehashing of all the elements of the hash table.
To see why this is, suppose we insert n elements into a hash table while doubling the number of buckets when the load factor crosses some threshold. A given element may be rehashed many times, but the total time to insert the n elements is still O(n). Consider inserting n = 2^{k} elements, and suppose that we hit the worst case, where the resizing occurs on the very last element. Since the bucket array is being doubled at each rehashing, the rehashes must all occur at powers of two. The final rehash rehashes all n elements, the previous one rehashes n/2 elements, the one previous to that n/4 elements, and so on. So the total number of hashes computed is n hashes for the actual insertions of the elements, plus n + n/2 + n/4 + n/8 + ... = n(1 + 1/2 + 1/4 + 1/8 + ...) = 2n hashes, for a total of 3n hashing operations.
No matter how many elements we add to the hash table, there will be at most
three hashing operations performed per element added. Therefore, add
takes amortized O(1) time even if we start out
with a bucket array of one element!
Another way to think about this is that the true cost of performing an
add
is about triple the cost observed on a typical call to add
.
The remaining 2/3 of the cost is paid as the array is resized later. It is
useful to think about this in monetary terms. Suppose that a hashing operation
costs $1 (that is, 1 unit of time). Then a call to add
costs $3,
but only $1 is required up front for the initial hash. The remaining $2 is
placed into the hash table element just added and used to pay for future
rehashing. Assume each time the array is resized, all of the remaining money
gets used up. At the next resizing, there are n
elements and n/2 of them have $2 on them; this
is exactly enough to pay for the resizing. This is a really an argument by
induction, so we'd better examine the base case: when the array is resized from
one bucket to two, there is $2 available, which is $1 more than needed to pay
for the resizing. That extra $1 will stick around indefinitely, so inserting
n elements starting from a 1-element array
takes at most 3n-1 element hashes,
which is O(n) time. This kind of
analysis, in which we precharge an operation for some time that will be taken
later, typifies amortized analysis of run time.
Notice that it was crucial that the array size grows geometrically (doubling). It is tempting to grow the array by a fixed increment (e.g., 100 elements at time), but this causes n elements to be rehashed O(n) times on average, resulting in O(n^{2}) asymptotic insertion time!
Any fixed threshold load factor is equally good from the standpoint of asymptotic run time, but a good rule of thumb is that rehashing should take place at a=3. One might think that a=1 is the right place to rehash, but in fact the best performance is seen (for buckets implemented as linked lists) when load factors are in the 1-2 range. When a<1, the bucket array contains many empty entries, resulting in suboptimal performance of the computer's memory system. There are many other tricks that are important for getting the very best performance out of hash tables.
How would one extend the above implementation to do resizing? Remember, we don't want to do anything expensive on every insertion, the whole goal is to have O(1) insert time. How about explicitly storing the number of entries in the table? Then grow the table (doubling the number of entries) and rehash all the elements when this gets to be larger than some multiple of the number of buckets.
How much of the above code would be the same if we defined a hash table where the keys and values are integers? How would you abstract out the commonalities?