Suppose we want a data structure to implement either a mutable set of
elements (with operations like contains, add, and remove that take an element
as an argument) or a mutable map from keys to values (with operations like get,
put, and remove that take a key or a (key,value) pair for an argument). A mutable map is also known
as an associative array. We've now seen a few data structures that could
be used for both of these implementation tasks.
We consider the problem of implementing sets and maps together, because most data structures that can implement a set can also implement a map, and vice versa. A set of key–value pairs can act as a map, provided we compare key–value pairs by comparing only the keys. Alternatively, we can view the transformation from a set to a map as starting with a data structure that implements set of keys and then adding an associated value to each data structure node that stores a key.
Here are the data structures we've seen so far, with the asymptotic complexities for each of their operations:
| Data structure | lookup (contains/get) | add/put | remove |
|---|---|---|---|
| Array | O(n) | O(1) | O(n) |
| Sorted array | O(lg n) | O(n) | O(n) |
| Linked list | O(n) | O(1) | O(n) |
| Search tree | O(lg n) | O(lg n) | O(lg n) |
Naturally, we might wonder if there is a data structure that can do better. And it turns out that there is: the hash table, one of the best and most useful data structures there is—when used correctly.
Many variations on hash tables have been developed. We'll explore the most common ones, building up in steps.
While arrays make a slow data structure when you don't know what index to look at, all of their operations are very fast when you do. This is the insight behind the direct address table. Suppose that for each element that we want to store in the data structure, we can determine a unique integer index in the range 0..m–1. That is, we need an injective function that maps elements (or keys) to integers in the range. Then we can use the indices produced by the function to decide at which index to store the elements in an array of size m.
For example, suppose we are maintaining a collection of objects representing houses on the same street. We can use the street address as the index into a direct address table. Not every possible street address will be used, so some array entries will be empty. This is not a problem as long as there are not too many empty entries. However, it is often hard to come up with an injective function that does not require many empty entries.
For example, suppose we are maintaining a collection of employees along with their social security numbers, and we want to look them up by social security number. Using the social security number as the index into a direct address table means we need an array of 10 billion elements, almost all of which are likely to be unused. Even assuming our computer has enough memory to store such a sparse array, it will be a waste of memory. Furthermore, on most computer hardware, the use of caches means that accesses to large arrays are actually significantly slower than accesses to small arrays, sometimes by two orders of magnitude.
Instead of requiring that each key be mapped to a unique index, we allow collisions in which two keys maps to the same index. To avoid having many collisions, this mapping is performed by a hash function that maps the key in a reproducible but seemingly random way to a hash value that is a legal array index. If the hash function is good, it will distribute the values fairly evenly, so that collisions occur infrequently, as if at random.
For example, suppose we are using an array with 13 entries and
our keys are social security numbers, expressed as long values.
Then we might use modular hashing, in which the array index is computed
as key % 13. This is not a very random hash function, but is
likely to be good enough unless there is an adversary purposely trying to produce
collisions.
There are two main ideas for how to deal with collisions. The best way is usually chaining: each array entry corresponds to a bucket containing a mutable set of elements. (Confusingly, this approach is sometimes known as closed addressing or open hashing.) Typically, the bucket is implemented as a linked list, and each array entry (if nonempty) contains a pointer to the head of the list. The list contains all keys that have been added to the hash table that hash to that array index.
To check whether an element is in the hash table, the key is first hashed to find the correct bucket to look in, then the linked list is scanned for the desired element. If the list is short, this scan is very quick.
An element is added or removed by hashing it to find the correct bucket. Then, the bucket is checked to see if the element is there, and finally the element is added or removed appropriately from the bucket in the usual way for linked lists.
Another approach to collision resolution is probing. (Equally confusingly, this technique is also known as open addressing or closed hashing.) Rather than put colliding elements in a linked list, all elements are stored in the array itself. When adding a new element to the hash table creates a collision, the hash table finds somewhere else in the array to put it. The simple way to find an empty index is to search ahead through the array indices with a fixed stride (usually 1) for the next unused array entry, wrapping modulo the length of the array if necessary. This strategy is called linear probing. It tends to produce a lot of clustering of elements, leading to poor performance. A better strategy is to use a second hash function to compute the probing interval; this strategy is called double hashing.
Regardless of how collisions are resolved, the time required for the hash table operations grows as the hash table fills up. By contrast, the performance of chaining degrades more gracefully, and chaining is usually faster than probing even when the hash table is not nearly full. Therefore chaining is usually preferred over probing.
A recently popular variant of closed hashing is cuckoo hashing, in which two hash functions are used. Each element is stored at one of the two locations computed by these hash functions, so at most two table locations must be consulted in order to determine whether the element is present. If both possible locations are occupied, the newly added element displaces the element that was there, and this element is then re-added to the table. In general, a chain of displacements occurs.
Suppose we are using a chained hash table with m buckets, and the number of
elements in the hash table is n. The average number of elements per bucket
is n/m, which is called the load factor of the hash table, denoted α.
When we search for an element that is not in the hash table, the expected
length of the linked list traversed is α. Since there is always at least some
cost of hashing, the cost of hash table operations with a good hash
function is, on average, O(1 + α). If we can ensure that the load factor α never exceeds
some fixed value αmax, then all operations will take O(1 +
αmax) = O(1) time on average.
In practice, we will get the best performance out of hash tables when α is
within a narrow range, from approximately 1/2 to 2. If α is less than 1/2, the
bucket array is becoming sparse and a smaller array is likely to give better
performance. If α is greater than 2, the cost of traversing the linked lists
limits performance.
One way to hit the desired range for α is to allocate the bucket array is just
the right size for the number of elements that are being added to it. In general,
however, it's hard to know ahead of time what this size will be, and in any case,
the number of elements in the hash table may need to change over time.
Since we cannot predict how big to make the bucket array ahead of time, we can use a resizable array to dynamically adjust the size when necessary. Instead of representing the hash table as a bucket array, we introduce a hash table object that maintains a pointer to the current bucket array and the number of elements that are currently in the hash table.
If adding an element would cause α to exceed αmax,
the hash table generates a new bucket array whose size is a multiple of
the current size, usually twice the size. This means the hash function must change,
so all the elements must be rehashed into the new bucket array. Hash functions
are typically designed so they take the array size m as a parameter,
so this parameter just needs to be changed.
With the above procedure, some add operations will cause all the
elements in the hash table to be rehashed. We have to search the
array for the nonempty buckets, hash all the elements, and add them
to the new table. This will take time O(n) (provided n is at least a
constant fraction of the size of the array). For a large hash table, this may take enough time that it causes problems for the
program. Perhaps surprisingly, however, the expected cost per operation is still
O(1). In particular, any sequence of n operations on the hash table always
takes expected time O(n), or O(1) per operation. Therefore we say that the amortized
asymptotic complexity of hash table operations is O(1).
To see why this is true, consider a hash table with αmax = 1. Starting
with a table of size 1, say we add a sequence of n = 2j elements.
The hash table resizes after 1 add, then again after 2 more adds, then again after 4 more adds, etc.
Not counting the array resizing, the cost of adding the n elements is O(n) on average.
The cost of all the resizings is (a constant multiple of) 1 + 2 + 4 + 8 + ⋅⋅⋅ + 2j = 2j+1−1 = 2n−1, which is O(n).
Notice that it is crucial that the array size grow geometrically (doubling). It may be tempting to grow the array by a fixed increment (e.g., 100 elements at time), but this causes n elements to be rehashed O(n) times on average, resulting in O(n2) total insertion time, or amortized complexity of O(n).
The standard Java libraries offer multiple implementations of hash tables. The
class HashSet<T> implements a mutable set abstraction: a set
of elements of type T. The class HashMap<K,V>
implements a mutable map from keys of type K to values of type V. There is also
a second, older mutable map implementation, Hashtable<K,V>,
but it should be avoided; the HashMap class is faster and better
designed.
All three of these hash table implementations rely on objects having a
hashCode() method that is used to compute the hash of an object.
The hashCode() method as defined by Java is not a hash function.
As shown in the figure, it generates the input to an internal hash function
that is provided by the hash table and that
operates on integers. Therefore, the hash function being used in effect
is the composition of the two methods: h ○ hashCode.
The design of the Java collection classes is intended to relieve the client of
the burden of implementing a high-quality hash function. The use of an internal
hash function makes it easier to implement
hashCode() in such a way that the composed hash function
h ○ hashCode is good enough.
However, a poorly designed hashCode() method can still cause the hash table
implementation to fail to work correctly or to exhibit poor performance. There are
two main considerations:
For the hash table to work, the hashCode() method must be
consistent with the equals() method, because equals()
is used by the hash table to determine when it has found the right element or key.
In fact, it is a general class invariant of Java classes that if two objects
are equal according to equals(), then their hashCode() values
must be the same. Many classes in the Java system library do a quick check for equality
of objects by comparing their hash values and returning false if they
are not equal. If the invariant did not hold, then there would be false negatives.
If you ever write a class that overrides the equals() method of Object,
be sure to do it in such a way that this invariant is maintained.
For good performance, hashCode() should distribute keys as uniformly
as possible to avoid collisions. Hash functions that have this property are said
to have good diffusion. This goal implies that the hash
code should be computed using all of the information in the object that
determines equality. If some of the information that distinguishes two objects
does not affect the hash code, objects will always collide when they differ
only with respect to that ignored information.
Java provides a default implementation of hashCode(), which
returns the memory address of the object. For mutable objects, this implementation
satisfies the two conditions above. It is usually the right choice,
because two mutable objects are only really equal if they are
the same object. On the other hand, immutable objects such as
Strings and Integers have a notion of equality that
ignores the object's memory address, so these classes override
hashCode().
Java's collection classes also override hashCode() to look at
the current contents of the collection. This way of computing the hash code is
dangerous, because mutating the collection used as the key will change its hash
code, breaking the class invariant of the hash table. Any collection being
used as a key must not be mutated.
If the client is providing hash codes that are easily distinguishable from random, the internal hash function may be necessary for providing good diffusion via a final hash. The goal of diffusion is to make the hash look random and thus to avoid collisions and clustering. The internal hash function is “good enough” if the client computation that generates keys is not more likely than chance to cause collisions. For most client computations, the implementation doesn't have to work very hard to achieve this goal.
Assuming the Java design with integer hash codes, we can use an integer hash function to provide good diffusion. There are two standard approaches, modular hashing and multiplicative hashing.
With modular hashing, the hash function is simply h(k) = k mod m for some modulus m, which is typically the number of buckets. This hash function is easy to compute quickly when we have an integer hash code. Some values of m tend to produce poor results though; in particular, if m is a power of two (that is, m=2j for some j), then h(k) is just the j lowest-order bits of k. Throwing away the rest of the bits works particularly poorly when the hash code of an object is its memory address, as is the case for Java. Two or more of lower-order bits of an object address will be zero, with the result that most buckets are not used! More generally, we want a hash function that uses all the bits of the key so that any change in the key is likely to change the bucket it maps to. In practice, primes not too close to powers of 2 work well as moduli.
A better alternative is multiplicative hashing, which is defined as h(k) = ⌊m * frac(kA)⌋, where A is a constant between 0 and 1 (e.g., Knuth recommends φ−1 = 0.61803...), and the function frac gives the fractional part of a number (that is, frac(x) = x − ⌊x⌋). This formula uses the fractional part of the product kA to choose the bucket.
However, the formula above is not the best way to evaluate the hash function. If we choose m to be a power of two 2q, we can scale up the multiplier A by 231, and then evaluate the hash function as follows using 64-bit long values, obtaining a q-bit result in [0,m):
h(k) = (kA & 0x7FFFFFFF) >> (31-q)
Implemented properly, multiplicative hashing is faster and higher-quality than modular hashing. Intuitively, multiplying together two large numbers diffuses information from each of them into the product, especially around the middle bits of the product. The formula above picks out q bits from the middle of the product kA.
Unfortunately, multiplicative hashing is often implemented incorrectly and has unfairly acquired a bad reputation in some quarters because of it. The most common mistake is to implement it as (kA mod m). By the properties of modular arithmetic, kA mod m = ((k mod m) × (A mod m) mod m). Therefore, this broken implementation merely shuffles the buckets rather than providing real diffusion.
For good performance, the goal of the hash table is that collisions should occur as if at random. Therefore, whether collisions occur depends to some extent on the keys being generated by the client. If the client is an adversary trying to produce collisions, the hash table must work harder. Many early web sites implemented using the Perl programming language were subject to denial-of-service attacks that exploited the ability to cause rampant hash table collisions. Attackers used their knowledge of Perl's hash function on strings to craft strings that collided, effectively turning Perl's associative arrays into linked lists. Resizing the array of buckets didn't help, because the collision happened in the space of hash codes.
An alternative way to design a hash table is to give the job of
providing a high-quality hash function entirely to the client code:
the hash codes themselves must look random. This approach puts more
of a burden on the client but avoids wasted computation when the
client is providing a high-quality hash function already. In the
presence of keys generated by an adversary, the client should already
be providing a hash code that appears random (and ideally one with at
least 64 bits), because otherwise the adversary can engineer hash code
collisions. For example, it is possible to choose strings such that
Java's String.hashCode() produces collisions.
To produce hashes resistant to an adversary, a cryptographic hash
function should be used. The message digest algorithms MD5, SHA-1,
and SHA-2 are good choices whose security increases (and performance
decreases) in that order. They are available in Java through the class
java.security.MessageDigest. Viewing the data to be hashed as
a string or byte array s, the value MD5(R + s) mod m is a
cryptographic hash function offering a good balance between security and
performance. MD5 generates 128 bits of output, so if m = 2j,
this formula amounts to picking j bits from the MD5 output. The value
R is the initialization vector. It should be randomly generated
when the program starts using a high-entropy input source such as the
class java.security.SecureRandom. The initialization vector
prevents the adversary from testing possible values of s ahead of
time. For very long-running programs, it is also prudent to proactively
refresh R periodically, though this requires rehashing all hash tables
that depend on it.
High-quality hash functions can be expensive. If the same values are being hashed repeatedly, one trick is to precompute their hash codes and store them with the value. Hash tables can also store the full hash codes of values, which makes scanning down one bucket fast; there is no need to do a full equality test on the keys if their hash codes don't match. In fact, if the hash code is long and the hash function is cryptographically strong (e.g., 64+ bits of a properly constructed MD5 digest), two keys with the same hash code are almost certainly the same value. Your computer is then more likely to get a wrong answer from a cosmic ray hitting it than from a collision in random 64-bit data.
Precomputing and storing hash codes is an example of a space-time tradeoff, in which we speed up computation at the cost of using extra memory.
When the distribution of keys into buckets is not random, we say that the hash table exhibits clustering. If you care about performance, it's a good idea to test your hash function to make sure it does not exhibit clustering. With any hash function, it is possible to generate data that cause it to behave poorly, but a good hash function will make this unlikely.
A good way to determine whether your hash function is working well is to measure clustering. If bucket i contains xi elements, then a good measure of clustering is the following:
A uniform hash function produces clustering C near 1.0 with high probability. A clustering measure C that is greater than one means that clustering will slow down the performance of the hash table by approximately a factor of C. For example, if m=n and all elements are hashed into one bucket, the clustering measure evaluates to n. If the hash function is perfect and every element lands in its own bucket, the clustering measure will be 0. If the clustering measure is less than 1.0, the hash function is spreading elements out more evenly than a random hash function would; not something to count on happening!
The reason the clustering measure works is because it is based on an estimate of the variance of the distribution of bucket sizes. If clustering is occurring, some buckets will have more elements than they should, and some will have fewer. So there will be a wider range of bucket sizes than one would expect from a random hash function.
Note that it's not necessary to compute the sum of squares of all bucket lengths; picking enough buckets so that enough keys are counted (say, at least 100) is good enough.
Unfortunately, most hash table implementations, including those in the Java Collections Framework, do not give the client a way to measure clustering. Clients can't easily tell whether the hash function is performing well. Hopefully, future hash table designers will provide some clustering estimation as part of the interface.