CS410, Summer 1998 Lecture 13 Outline Dan Grossman Goals: * Open Address Hashing Pointers for chaining take half of the table space or more. This could be for more buckets, a clear gain in efficiency. So let's not use chaining. Instead, if we have a collision, let's just go to a different bucket. This is called open addressing (I don't know why). As long as we use the same sequence of buckets when inserting and searching for the same key, everything should work fine. We notice immediately: * alpha < 1, else there isn't room for all the elements. * Buckets moved to must be a permutation of all buckets. That is, we must eventually try every bucket if necessary. Else alpha < 1 doesn't guarantee the scheme will work. Insert: Go to a starting bucket. While bucket is full, go to next bucket in permutation. When empty bucket is reached, put data there. Lookup: Follow same sequence of buckets. Check keys and return the right one. If reach an empty bucket, the data isn't in the table. Delete: Uh-oh -- will screw up future lookups. If we delete an element, then lookup for a different element might see an empty bucket and stop too soon! The solution is either: * Use chaining. This is probably the best choice if deletes are common. In many applications, they really aren't. * Use a zombie. That is, make the bucket "no value, but keep looking". Future inserts can use this bucket, and lookups will effectively skip over any such zombie bucket. To pick the starting bucket, use hashing like we did yesterday. Choosing the bucket permutation is the rest of the lecture. There are m! possible permutations. Ideally, a key will have any of them with equal probability. This is the assumption of "uniform hashing". We will not get anywhere close to fulfilling this assumption, but reality will save the day. For open addressing, let the hashing function actually be h(k, i) where k is the key and i is how many full buckets have already been seen. So h(k, 0) is what we did yesterday. We use h as follows: lookup: i = 0 while (true) // assume we resized if table was full and that we indeed have // a permutation j = h(k,i) if array[j] != null return array[j].val i++ Let h'(k) be a good function from yesterday. The simplest choice for h is to just keep cycling through buckets until an empty one is found. That is, h(k,i) = (h'(k) + i) mod m This is called linear probing. It is a horrible idea... Of the m! permutations, we only use m of them -- the whole sequence is determined by the first bucket. So keys that initially collide always collide. But we had that problem with chaining too. The real problem is that all m permutations are really the same except for their starting points. As a result, "long chains get longer". To see why, suppose we currently have a chain of length l. Then the chain will get longer if _any_ of the buckets in it are the start bucket for the next insertion. Hence the probability of making a chain longer is (l/m). That is, long chains get longer even when h'(k) is as good as possible. This is called "primary clustering". It is not necessary that all m permutations have common subsequences. For example, with m == 4, I made up the following: 0 -> 2 -> 1 -> 3 1 -> 0 -> 2 -> 3 2 -> 3 -> 0 -> 1 3 -> 1 -> 2 -> 0 It still only uses m sequences, but primary clustering doesn't happen. I made the sequences above up. A more general approach is "quadratic probing". It is defined by: h(k,i) = (h'(k) + c1 i + c2 i^2) mod m. where c1 and c2 are constants chosen such that h(k,i) defines a permutation of all m buckets. It should be clear that there are m permutations (still all that changes is the start bucket -- h'(k)) and that the permutations are not all really the same. It isn't clear how to pick c1 and c2, though. Here's one that works: let m be a power of 2 and let c1 == c2 == 1/2. What this is really doing is going ahead i buckets on the ith iteration. So if we started at 0, we would check buckets 1, 3, 6, 10, 15 (all mod m of course). For m==4, the set of permutations looks like: 0 -> 1 -> 3 -> 2 1 -> 2 -> 0 -> 3 2 -> 3 -> 1 -> 0 3 -> 0 -> 2 -> 1 We eliminated primary clustering, but we still have "secondary clustering". This refers to elements which collide initially always colliding. Chaining has this too, although for some reason it's not called that. We can avoid even that if we remember that we can choose the next bucket using both i _and_ k. Then different k's that started together wouldn't have to stay together... Our last method is called "double hashing". It is defined by: h(k, i) = (h'(k) + ih''(k)) mod m where h' and h'' are two _different_ hash functions. (If they're the same, then we're basically back to linear probing.) There are now m^2 possible permutations -- any start bucket and any jump amount. That's much better than m, but still nowhere near m! It works well in practice, though. The catch is making sure any jump value induces a permutation. What we need is that the greatest common divisor of h''(k) and m is 1. Mathematicians can prove this is sufficient -- it should be fairly intuitive if you try a few examples. Here are two simple ways to ensure this property of h''(k) and m: * make m a power of 2, and makes sure h''(k) is always odd (perhaps by adding 1 if necessary) * make m prime (since h''(k) produces a number < m). Also, h''(k) cannot be zero (look at the definition of h(k, i) to see why). None of the methods we looked at achieved uniform hashing. But hypothetically, assume all m! permutations are equally likely. Then how long do we expect chains to be? Remember alpha is the load factor. So when doing an unsuccessful lookup, we expect with probability alpha that the first bucket we look at has an element in it. The probability that the first two have elements is alpha^2, first three alpha^3 and so on. Summing over all these probabilities is: alpha + alpha^2 + ... < 1/(1-alpha) So for smaller load factor, we expect chains to be shorter. This should be intuitive and reassuring. And it gives us a formula for making our time-space trade-off decision of picking alpha.