CS410, Summer 1998 Lecture 12 Outline Dan Grossman Goals: * Good ways to transform data to integers for purposes of hashing * Using Java's Hashtable class * Good hashing functions Reading: CLR 12.3 Recall our framework: we transform a key to an integer and then hash to a bucket. But everyone confuses the two stages and calls the first thing hashing as well. Transforming data to integers Our running example will be strings. In fact, any kind of data is done the same way if you just imagine putting the different fields together to make one long conceptual string out of the fields. So we're just assuming we have one field and treating it like a String. The whole discussion generalizes without difficulty. Let's be clear about what we want: * Different keys to transform to different integers. * A fast transformation -- this could well be the bottleneck of our program Guidelines to follows are: * Use all of the key that is not constant. Use the different parts proportionally. For Strings, this means to use all of the characters. Otherwise, data where everything collides is too common. * Don't lose any more information than necessary. Bad example: "Add the characters in a String". That is, let a be 1, b be 2, c be 3, etc. Total up a string. This violates our second guideline horribly -- all ordering information is lost because any permutation of the same letters will go to the same integer. It's also bad because lots of similar length Strings collide, for example, abc, cba, f, daa, caaa, etc. Good example: Multiply by size of the alphabet (say 26 or 52) before adding the next character. So acb becomes for example 26*(26*1 + 3) + 2. This avoids the earlier problems. If we make the multiplying factor a power of 2, then this is equivalent to shifting the bits set by previous characters "over to the left" so that we're not changing them when we add the next character. Unfortunately, we will quickly have a number bigger than an int can hold. (In Java, this is 32 bits. It is generally 16, 32, or 64 on current computers.) We could deal with big integers, but it is faster not to. If we just naively kept "shifting over" we would lose everything not given by the last few characters, violating our first guideline. Here are a couple tricks for using all of the characters to produce an int. We'll assume we're dealing with 32 bits, but the exact constant doesn't matter. First, we could multiply by a large prime number after adding each character. Because multiplication of ints is implicitly mod 2^32, this really returns the low-order bits of the product. These should be good and "mixed up" when multiplying by a prime. It takes number theory to show that primes are good, but you can see for example that a power of 2 would be bad -- it just puts zeros in the low-order bits! Second, we could do something a tad more intuitive. Assume we have an alphabet of 26 letters, so the integer corresponding to a letter fits in 5 bits. We'll waste 2 of our 32, so we could fit 6 characters worth of information in an int by shifting over like before. For more than 6 characters, we can cycle around and xor (exclusive or) what is currently in a 5-bit segment with the next character. Think of the 32 bits in an int like this: XX ----- ----- ----- ----- ----- ----- int key = 0, j = 0; for (int i = 0; i < s.length; i++) { int newNum = letterToNum(s.charAt(i)); newNum = newNume << 6*j; key = key ^ newNum; // ^ is the bit-wise xor operator. j = (j+1)%6; } return key; The intuition behind using exclusive or is that the number of 1's and 0's will remain roughly the same. If we used or, our key would turn into mostly 1's. If we used and, mostly 0's. Using Java's Hashtables The keys for Java's Hashtables are Objects. This would seem to cause problems: * How can it transform an Object to an integer * How can it compare two keys when doing lookup The answer is that every object has methods hashCode() and equals(Object other). The former takes no arguments and returns an integer -- so every object knows how to transform itself. The latter takes another Object and returns a boolean -- true if the two Objects should be considered the same key. The default behavior is to take the address in memory for the hashCode and make equals true only if it's the same actual Object. Hence, if these methods are not over-written, no two keys will ever be the same. This is rarely what you want. For example, with Strings, we want equals to be true if the two Strings have the same characters in the same places. Then we better override hashCode too, else two equal Strings could end up in different buckets, clearly incorrect behavior. The library has already done the right thing for Strings, so you don't have to worry about this on your homework. But you do in applications where your keys are something that hasn't already overridden equals and hashCode. You can consult Java documentation for how to use the Hashtable class. Here's a brief summary: * call the constructor to create a new table * put(Object key, Object value) is what we called insert * get(Object key) is what we called lookup * containsKey(Object key) returns true iff the table has an element with that key. * contains(Object value) returns true iff the table has an element with that value. It is not clear to me what efficiency we should expect for this operation -- use containsKey. * Enumeration keys() * Enumeration elements() Enumerations are useful and provided by many Java data structures. The idea is we want to walk through all the items in the table. We can't just do that because we don't have access to the actual table. An Enumeration is an object which has two methods: hasMoreElements() returns false once you've seen every element nextElement() returns the next element Be careful not to call elements() if you want keys() -- this cost me 45 minutes when debugging homework 4. Hashing Doing a good job of transforming is a waste if we don't hash well. We'll discuss 3 methods: the division methods, the multiplication method, and universal hashing. The division method is just h(k) = k mod m where k is the integer we are hashing. Mathematicians tell us that m (the table size) should be a prime number not near a power of 2. To see why primality helps, consider the worst case which is when m is a power of 2. Then the division method just throws away the high-order bits of the int we worked so hard to create! In general, if k has factors in common with m, then it will lose more information when taken mod m. Prime numbers don't have very many factors, so they're a good choice. It is less clear why being near a power of 2 is bad. But do what you're told. :-) Your text mentions that Strings differing by only a transposition of characters collide when m = 2^p - 1. Having to choose these weird table sizes gets in the way of our standard method of doubling the size of the table when the number of elements gets too large. Instead of actually doubling, we should just approximately double. We should look up a bunch of good table sizes in a math book somewhere and "hard-wire" them into an array before we compile our program. Then when we resize, we just use the next larger element in this array. The multiplication method is just h(k) = floor(m(kA mod 1)) where A is a constant between 0 and 1. The idea is to take the part of the product kA to the right of the decimal point (that's what mod 1 means), and multiply by m to get a bucket. The advantage over the division method is that all the "magic" is in A, so m can go back to being a nice convenient power of 2. In fact, we'll see in a minute that we should insist that it is a power of 2. Your text says A = (sqrt(5) - 1)/2 "works pretty well". We can avoid using floating point numbers by being clever and taking advantage of the fact that int multiplication is implicitly mod 2^32 (or 2^w where w is the size of an int). This only works if m is a power of two. Instead of using A, use A' which by definition is just floor(A*2^w). You don't actually calculate it in your program; it's a definition. Now h(k) = k*A' >> (w-m), where >> means "shift bits to the right". What's happening is k*A' is the part that used to be to the right of the decimal point -- the other bits disappeared because we're mod 2^w. Then we shift right so we're getting the log m high-order bits -- this is equivalent to what was multiplication by m and taking the floor. You just have to stare at the math a bit to see that this is exactly the multiplication method, just without floating point numbers. Universal Hashing Since there are more keys than buckets, for any hashing function, there are possible sets of keys that have horrible collisions. If an evil adversary knew our hashing function, he could always produce such a set of keys. We could beat this by having a bunch of different hash functions which are all good and then choosing one at random when we make a new table. (Of course, we have to use the same function for the life of a table or we won't be able to find the objects we've inserted.) Then the probability of having bad collisions would be provably low _regardless of the set of keys_. And even if you have bad collisions for one table, the same keys will probably do fine the next time when a different hash function is used. More formally, let H={h_1, h_2, ... h_t} be a set of hash functions and let U be the set of possible keys. We say that H is "universal" over U if for arbitrary x,y in U, the number of h's in H for which x and y collide is |H|/m, where |H| is the size of H and m is the number of buckets. Intuitively, this mean that if we pick h randomly from H, the probability that x and y collide (that is h(x) = h(y)) is 1/m. Since we don't have x and y in advance and there are m buckets, this is as good as we could expect. Furthermore, this is true for _all_ x and y, so we don't expect any set of keys to do any worse than any other. Finally, if we have n things in our table, then the number we expect to collide with x is (n-1)/m. The "-1" is because x by definition does not collide with itself. So we have defined universal hashing and convinced ourselves that it would be great, but we haven't actually defined an H that matches the definition. And it isn't at all obvious that such an H would exist. In fact, there is... Let x be a key. Assume it is w bits long. Break x into r+1 binary substrings, each w/(r+1) bits long. Let b=w/(r+1). We require r to be large enough that 2^b < m. We also make m prime. Call the substrings x_0, x_1, x_2, ..., x_r and interpret them as binary integers between 0 and (2^b)-1. Also pick r+1 random numbers between 0 and (2^b)-1. Call these numbers a_0, a_1, ..., a_r and call the collection of them a. Finally define: h_a(x) = sum from i=0 to r of (a_i * x_i) mod m Each choice of a defines a different hash function. There are m^(r+1) such functions. Let the collection of all of them be H. Claim: H is universal. Proof: Let x != y. Then some x_i != y_i. Assume without loss of generality x_0 != y_0. (The proof is similar if some other substring is different.) Then for a particular h_a, we have a collision iff h_a(x) == h_a(y). That is, iff sum from i=0 to r of (a_i * x_i) mod m == sum from i=0 to r of (a_i * y_i) mod m Rearranging terms, this is true iff a_0(x_0-y_0) mod m == - sum from i=1 to r of a_i * (x_i - y_i) mod m The number on the right is something between 0 and m-1. And x_0-y_0 is not zero. It follows from number theory (using the fact that m is prime) that exactly 1 of the m possible values for a_0 will cause a collision. So for any a_1, a_2, ..., a_r, exactly 1/m of the hash functions using these numbers cause a collision on x and y. So grouping the elements of H by same a_1, a_2, ..., a_r, we see that we meet the definition of universal.