CS 312 Lecture 19
Hash functions and memory management

Hash functions

Hash tables are one of the most useful data structures ever invented. Unfortunately, they are also one of the most misused. Code built using hash tables often does not get anywhere near the possible performance. There are two reasons for this:

Clients choose poor hash functions that do not act like random number generators, invalidating the simple uniform hashing assumption.
Hash table abstractions do not adequately specify what is required of the hash function, or make it difficult to provide a good hash function.

Clearly, a bad hash function can destroy our attempts at a constant running time. A lot of obvious hash function choices are bad. For example, if we're mapping names to phone numbers, then hashing each name to its length would be a very poor function, as would a hash function that used only the first name, or only the last name. We want our hash function to use all of the information in the key. This is a bit of an art. While hash tables are extremely effective when used well, all too often poor hash functions are used that sabotage performance.

Clustering

Ideally you should test your hash function to make sure it behaves well with real data. With any hash function, it is possible to generate data that cause it to behave poorly, but a good hash function will make this unlikely. A good way to determine whether your hash function is working well is to measure the clustering of elements into buckets. If bucket i contains x_i elements, then a good measure of clustering is (∑_i(x_i²)/n) - α. A uniform hash function produces clustering near 1.0 with high probability. A clustering factor of c > 1 greater than one means that the performance of the hash table is slowed down by clustering.r If the clustering measure is less than 1.0, the hash function is spreading elements out more evenly than a random hash function would: this is rare!

Unfortunately most hash table implementations do not give the client a way to measure clustering. This means the client can't directly tell whether the hash function is performing well or not. Hash table designers should provide some clustering estimation as part of the interface. Note that it's not necessary to compute the sum of squares of all bucket lengths; picking a few at random is cheaper and usually good enough.

The reason the clustering measure works is because it is based on an estimate of the variance of the distribution of bucket sizes. If clustering is occurring, some buckets will have more elements than they should, and some will have fewer. So there will be a wider range of bucket sizes than one would expect from a random hash function.

High clustering happens much more often than you might think. For example, suppose our hash function on integers is to take the integer modulo the number of buckets. This won't work well if the integers tend to be equal modulo the number of buckets -- they'll all be hashed into the same bucket, and we'll have a glorified linked list instead of a hash table! For a less artificial example, consider the standard Java hashCode implementation for objects -- it converts the memory address of the object into an integer. However, the memory addresses of objects tend to exhibit a great deal of regularity, resulting in high clustering with some hash table implementations.

For those who have taken some probability theory: Consider bucket i containing x_i elements. For each of the n elements, we can imagine a random variable e_j, whose value is 1 if the element lands in bucket i (with probability 1/m), and 0 otherwise. The bucket size x_i is a random variable that is the sum of all these random variables:
x_i = ∑_j∈1..n e_j

Let's write ⟨x⟩ for the expected value of variable x, and Var(x) for the variance of x, which is equal to ⟨(x - ⟨x⟩)²⟩ = ⟨x²⟩ - ⟨x⟩². Then we have:

⟨e_j⟩ = 1/m
⟨e_j²⟩ = 1/m
Var(e_j) = 1/m - 1/m²
⟨x_i⟩ = n⟨e_j⟩ = α

The variance of the sum of independent random variables is the sum of their variances. If we assume that the e_j are independent random variables, then:

Var(x_i) = n Var(e_j) = α - α/m = ⟨x_i²⟩ - ⟨x_i⟩²
⟨x_i²⟩ = Var(x_i) + ⟨x_i⟩²
= α(1 - 1/m) + α²

Now, if we sum up all m of the variables x_i, and divide by n, as in the formula, we should effectively divide this by α:
(1/n) ⟨∑ x_i²⟩ = (1/α)⟨x_i²⟩ = 1 - 1/m + α

Subtracting α, we get 1 - 1/m, which is close to 1 if m is large, regardless of n or α.

Now, suppose instead we had a hash function that hit only one of every c buckets. In this case, for the non-empty buckets, we'd have
⟨e_j⟩ = ⟨e_j²⟩ = c/m
⟨x_i⟩ = αc
(1/n) ⟨∑ x_i²⟩ - α = (1/n)(m/c)(Var(x_i) + ⟨x_i⟩²) = 1 - c/m + αc
= 1 - c/m + α(c-1)

If the clustering measure gives a value significantly greater than one, it is like having a hash function that misses a substantial fraction of buckets.

Designing a hash function

For a hash table to work well, We want every change to a key to affect every bit in the bucket index in an apparently random way. A hash function has good diffusion if every bit change to the key causes half the bits in the index to flip apparently randomly. The easy way to accomplish this is to break the computation of the bucket index into three steps.

Serialization: Transform the key into a stream of bytes that contains all of the information in the original key. Two equal keys must result in the same byte stream. Two byte streams should be equal only if the keys are actually equal. How to do this depends on the form of the key. If the key is a string, then the stream of bytes would simply be the characters of the string.
Diffusion: Map the stream of bytes into a large integer x in a way that causes every change in the stream to affect the bits of x apparently randomly. There are a number of good off-the-shelf ways to accomplish this, with a tradeoff in performance versus randomness (and security).
Compute the hash bucket index as x mod m. This is particularly cheap if m is a power of two, but see the caveats below.

There are several different good ways to accomplish step 2: multiplicative hashing, modular hashing, cyclic redundancy checks, and secure hash functions such as MD5 and SHA-1.

Frequently, hash tables are designed in a way that doesn't let the client fully control the hash function. Instead, the client is expected to implement steps 1 and 2 to produce an integer hash code, as in Java. The implementation then uses the hash code and the value of m (usually not exposed to the client, unfortunately) to compute the bucket index.

Some hash table implementations expect the hash code to look completely random, because they directly use the low-order bits of the hash code as a bucket index, throwing away the information in the high-order bits. Other hash table implementations take a hash code and put it through an additional step of applying an integer hash function that provides additional diffusion. With these implementations, the client doesn't have to be as careful to produce a good hash code,

Any hash table interface should specify whether the hash function is expected to look random. If the client can't tell from the interface whether this is the case, the safest thing is to compute a high-quality hash code by hashing into the space of all integers. This may duplicate work done on the implementation side, but it's better than having a lot of collisions.

Modular hashing

With modular hashing, the hash function is simply h(k) = k mod m for some m (usually, the number of buckets). The value k is an integer hash code generated from the key. If m is a power of two (i.e., m=2^p), then h(k) is just the p lowest-order bits of k. The SML/NJ implementation of hash tables does modular hashing with m equal to a power of two. This is very fast but the the client needs to design the hash function carefully.

The Java Hashmap class is a little friendlier but also slower: it uses modular hashing with m equal to a prime number. Modulo operations can be accelerated by precomputing 1/m as a fixed-point number, e.g. (2³¹/m). A precomputed table of various primes and their fixed-point reciprocals is therefore useful with this approach, because the implementation can then use multiplication instead of division to implement the mod operation.

Multiplicative hashing

A faster but often misused alternative is multiplicative hashing, in which the hash index is computed as ⌊m * frac(ka)⌋. Here k is again an integer hash code, a is a real number and frac is the function that returns the fractional part of a real number. Multiplicative hashing sets the hash index from the fractional part of multiplying k by a large real number. It's faster if this computation is done using fixed point rather than floating point, which is accomplished by computing (ka/2^q) mod m for appropriately chosen integer values of a, m, and q. So q determines the number of bits of precision in the fractional part of a.

Here is an example of multiplicative hashing code, written assuming a word size of 32 bits:

val multiplier: Word.word = 0wx678DDE6F (* a recommendation by Knuth *)
  fun findBucket({arr, nelem}, e) (f:bucket array*int*bucket*elem->'a) =
    let
      val n = Word.fromInt(Array.length(arr))
      val d = (0wxFFFFFFF div n)+0w1
      val i = Word.toInt(Word.fromInt(Hash.hash(e)) * multiplier div d)
      val b = Array.sub(arr, i)
    in
      f(arr, i, b, e)
    end

Multiplicative hashing works well for the same reason that linear congruential multipliers generate apparently random numbers—it's like generating a pseudo-random number with the hashcode as the seed. The multiplier a should be large and its binary representation should be a "random" mix of 1's and 0's. Multiplicative hashing is cheaper than modular hashing because multiplication is usually considerably faster than division (or mod). It also works well with a bucket array of size m=2^p, which is convenient.

In the fixed-point version, The division by 2^q is crucial. The common mistake when doing multiplicative hashing is to forget to do it, and in fact you can find web pages highly ranked by Google that explain multiplicative hashing without this step. Without this division, there is little point to multiplying by a, because ka mod m = (k mod m) * (a mod m) mod m . This is no better than modular hashing with a modulus of m, and quite possibly worse.

Cyclic redundancy checks (CRCs)

For a longer stream of serialized key data, a cyclic redundancy check (CRC) makes a good, reasonably fast hash function. A CRC of a data stream is the remainder after performing a long division of the data (treated as a large binary number), but using exclusive or instead of subtraction at each long division step. This corresponds to computing a remainder in the field of polynomials with binary coefficients. CRCs can be computed very quickly in specialized hardware. Fast software CRC algorithms rely on accessing precomputed tables of data.

Cryptographic hash functions

Sometimes software systems are used by adversaries who might try to pick keys that collide in the hash function, thereby making the system have poor performance. Cryptographic hash functions are hash functions that try to make it computationally infeasible to invert them: if you know h(x), there is no way to compute x that is asymptotically faster than just trying all possible values and see which one hashes to the right result. Usually these functions also try to make it hard to find different values of x that cause collisions. Examples of cryptographic hash functions are MD5 and SHA-1. Some attacks are known on MD5, but it is faster than SHA-1 and still fine for use in generating hash table indices.

Precomputing hash codes

High-quality hash functions can be expensive. If the same values are being hashed repeatedly, one trick is to precompute their hash codes and store them with the value. Hash tables can also store the full hash codes of values, which makes scanning down one bucket fast. In fact, if the hash code is long and the hash function is high-quality (e.g., 64+ bits of a properly constructed MD5 digest), two keys with the same hash code are almost certainly the same value. Your computer is then more likely to get a wrong answer from a cosmic ray hitting it than from a hash code collision.

Memory management

(see slides)

CS 312 Lecture 19 Hash functions and memory management