CS410, Summer 1998 Lecture 11 Outline Dan Grossman Goals: * Finish up heaps * Introduction to hash tables Reading: CLR Chapter 12 * Prelim infromation is available on the Web. * HW4, a programming assignment, will be handed out tomorrow and due next Wednesday. * HW5 will be handed out next Tuesday and be due the following Monday. * Heaps: Remember from last time: * Inspired by the priority queue ADT, we developed the heap data structure. Notice it can't do lookup. * Using the "heap shaped trees fit in arrays" trick, we replaced all pointers with array indices and got O(log n) time operations that had very low constants. * We even built a heap from an unsorted set of n elements in time O(n). Recall how built bottom up and took two heaps, added a root, and heapified to make a larger heap. Making a heap out of two smaller heaps only works because: * The two are the same height. * The one on the left has a complete bottom level. Else the result does not have the heap shape. But "merging heaps" seems like a potentially useful operation. CS482 covers other kinds of heaps that support merging. Other uses of heaps: * Find the m largest objects out of a collection of n objects: keep a min heap with the m largest seen so far. Compare each object with the min, if it is bigger, replace the min and heapify. You will do this on homework 4. It allows us to find the m largest in time O(nlog m). * sorting: Build a heap of n objects, and then keep calling delete. The elements come out in sorted order. The total work is bounded by the n delete operations, for a running time of O(nlog n). * Hash Tables Back to dictionaries: Could we even beat O(log n) and get to O(1)? To do so, we will change our desires a bit: * Will not be able to compute predecessor or successor in less than O(n) time. * Will rely on luck and accept an _unlikely_ worst case of O(n). The only way we know to get to an arbitrary amount of data in O(1) time is with an array. So if we had a way to take a key and figure out which array index held the corresponding value, we'd be in great shape. This works great if the keys are integers close to each other. Else we end up wasting an unacceptable amount of space. We also have a problem if there are duplicate keys. Else we could take the integer keys and use a mathematical function to determine which array index (hereafter which "bucket") holds the value. Such a function is called a "hash function". The most obvious function is h(key) = key mod m where m is the size of the array (hereafter referred to as the table size). If the hash function is computable in constant time, then everything can be O(1) -- on insert, lookup, and delete, "hash the key" and you know exactly where to go. But the problem of course is "collisions". If the possible key values is greater than m, then they're unavoidable. And in the worst case, everything ends up the same bucket! Our approach must be to: * deal with collisions since they may occur * be clever so that collisions are reasonably unlikely * accept that empirically everything works like a charm To deal with collisions, we can just put a linked list at each bucket of all items which hash to that bucket. This is called "linear chaining" and the list is called the "chain". Notice items with the same key will always collide, but items with different keys might collide as well. The average length of a chain will be n/m, the number of objects in the table divided by the number of buckets. We define alpha = n/m and call alpha the "load factor". We keep alpha a constant and in practice it is something between say 0.5 and 3. Notice it can be less than one; this guarantees that some buckets will be empty. If everything hashes to the same bucket, the lookup and delete take time O(n). What we want is that every bucket has alpha elements in it. If we just assume this is true, we make the "assumption of simple uniform hashing". Under this assumption, the operations are O(1) because alpha is a constant. A slightly different assumption is that the bucket to which a key hashes is random and every bucket is equally likely. Under this assumption, the likelihood that any bucket has beta>alpha objects decreases exponentially as beta increases. (The math is way beyond this class.) So we can again expect O(1) operations. If we just want to say, "assuming O(1) operations" then we can loosely say, "assuming hashing behaves". Tomorrow's lecture is devoted to hopefully making this assumption a reality. We need to keep the load factor alpha constant. So if enough objects are inserted, we will have to increase the number of buckets. As usual with array resizing, we should make twice as many buckets. After doing so, we will have to rehash all the elements in the table. Since we're doubling, this won't happen very often. So far we have assumed the keys are integers. If we want to hash on something else (example, Strings, but really we could hash on any kind of Object), we will first transform the key into an integer. So now the process looks like: transform hash key --------> integer ---------> bucket Actually, a lot of people (including me) call the first arrow hashing. I will try to avoid making this ambiguity in class. Despite the text's scant coverage, choosing a good transform function is as crucial as choosing a good hash function. After all, if everything transforms to the same integer, then everything will collide. We have to store the key in the table in order to do the operations. Whether we store the integer in the table or recompute it whenever we need it is a classic space vs. time trade-off. The integers take space, and they're only needed when the table is being resized. However, they can also be used to speed-up lookup operations if we assume (as is usually the case) that comparing integers is faster than comparing keys. (Again, consider Strings as a good example.) walk down the chain: if the integer is equal to the integer of what you're looking for then if the keys are equal then return the value else continue else continue Notice we still have to compare keys since distinct keys may transform to the same integer.