In recitation you saw that SML supports imperative programming through the
primitive parameterized ref type. A ref is like a box that can store a
single value. By using the := operator, the value in the box can be
changed as a side effect. It is important to distinguish between the value that
is stored in the box, and the box itself.
Another important kind of mutable data structure that SML provides is
the array. Arrays generalize refs in that they are a sequence of memory
locations, where each location contains a different value. We can think of
a ref cell as an array of size 1. The type t array is in
fact very similar to the Java array type t[]. Here's a partial
signature for the builtin Array structure for SML. Note that you have to
"open Array" explicitly to use the operations or else
write Array.foo when you want to use the operation foo.
signature ARRAY = sig (* Overview: an 'a array is a mutable fixed-length sequence of * elements of type 'a. *) type 'a array (* array(n,x) is a new array of length n whose elements are * all equal to x. *) val array : int * 'a -> 'a array
(* fromList(lst) is a new array containing the values in lst *) val fromList : 'a list -> 'a array
exception Subscript (* indicates an out-of-bounds array index *)
(* sub(a,i) is the ith element in a. If i is * out of bounds, raise Subscript *) val sub : 'a array * int -> 'a
(* update(a,i,x) * Effects: Set the ith element of a to x * Raise Subscript if i is not a legal index into a *) val update : 'a array * int * 'a -> unit
(* length(a) is the length of a *) val length : 'a array -> int ... end
See the SML documentation for more information on the operations available on arrays.
Notice that we have started using a new kind of clause in the specification, the effects clause. This clause specifies side effects that the operation has beyond the value it returns. When a routine has a side effect, it is useful to have the word "Effects:" explicitly in the specification to warn the user of the side effect.
We've seen several different implementations of sets so far. One reason that
sets are worth worrying about is that the techniques that we use to implement
sets also apply to implementations of maps. A map (not to be confused
with the map function) is a set of (key,value) pairs
where each key only appears once: abstractly, a partial function from keys to
values. Dictionaries are a special case of maps in which the keys are strings.
A mutable set is a set that can be imperatively changed to include more elements, or to remove some elements. Here are examples of signatures for mutable sets and maps. These are generic signatures that can be used for many different element or (key, value) types. These signatures show an important issue in writing effects clauses. To specify a side effect, sometimes we need to be able to talk about the state of a mutable value both before and after the routine is executed. Writing "_pre" or "_post" after the name of a variable is a compact way of talking about that the state of the value in that variable before and after the function executes, respectively.
signature MSET = sig (* Overview: a set is a mutable set of items of type elem. * For example, if elem is int, then a set might be * {1,-11,0}, {}, or {1001} *) type elem type set (* empty() creates a new empty set *) val empty : unit -> set (* Effects: add(s,x) adds the element x to s if it is * not there already: spost = spre union x *) val add: set * elem -> unit (* remove(s,x) removes the element x from s it it is * there already *) val remove: set * elem -> unit (* member(s,x) is whether x is a member of s *) val member: set * elem -> bool (* size(s) is the number of elements in s *) val size: set -> int (* fold over the elements of the set *) val fold: ((elem*'b)->'b) -> 'b -> set -> 'b val fromList: elem list -> set val toList: set -> elem list end signature MMAP = sig (* A 'value map is a mutable set of (key, 'value) pairs in * which no two keys are equal. Alternatively, it is * a partial function from keys to values. dom(m) is * the set of keys on which the map m is defined. *) type 'value map type key (* empty is the empty mapping. *) val empty : 'value map (* add(k,v,m) * Effects: mpost = mpre[k->v]. That is, the function * that is identical to m everywhere but at k, which * it maps to v. *) val add: 'value map * key * 'value -> unit (* remove(m,k) is SOME of the value mpre maps k to, or * NONE if there is no such mapping. * Effects: the mapping for k, if any is removed from m. *) val remove: 'value map * key -> 'value option (* get(k,m) is SOME of the value m maps k to, or NONE * if there is no such mapping. *) val get: 'value map * key -> 'value (* size(m) is the number of elements in dom(m) *) val size: 'value map -> int (* fold over the elements of the map *) val fold: ((key * 'value * 'b)->'b) -> 'b -> 'value map -> 'b end
You might notice that we haven't used any type parameters in the signature
for sets, unlike in the signature for arrays. This is because we can only use
type parameters when we have a parameterized type that makes sense for all
possible type parameters. Sets only make sense for types whose values can be
compared for equality. Some SML types, such as function types, do not support
equality. Similarly, maps only make sense for key types that support equality.
They make sense for arbitrary value types, which is why we have parameterized
the map type with respect to the value type ('value).
We've seen various implementations of functional sets. First we had simple lists, which had O(n) access time. Then we saw how to implement sets as balanced binary search trees with O(lg n) access time. Our current best results are this:
| linked list, no duplicates | red-black trees | |
add (insert) |
O(n) | O(lg n) |
delete (remove) |
O(n) | O(lg n) |
member (contains) |
O(n) | O(lg n) |
What if we could do even better? It turns out that we can implement mutable sets and maps more efficiently than the immutable (functional) sets and maps we've been looking at so far. In fact, we can turn an O(n) functional set implementation into an O(1) mutable set implementation, using hash tables. This works by exploiting the power of arrays to update a random element in O(1) time.
The idea is that we'll store each element of the mutable set in a functional set. Linked lists without duplicates work fine; having red-black trees is overkill. But instead of having just one functional set, we'll use a lot of them. In fact, for a mutable set containing n elements, we'll spread out its elements among O(n) smaller functional sets. If we spread the elements around evenly, each of the functional sets will contain O(1) elements and accesses to it will have O(1) performance!
| hash table | |
add (insert) |
O(1) |
delete (remove) |
O(1) |
member (contains) |
O(1) |
The idea then is that our data structure (the hash table) is a big array of O(n) elements, called buckets. Each bucket is a functional (immutable) set containing O(1) elements, and the elements of the set as a whole are partitioned among all the buckets.
There is one key piece missing: how do we decide which bucket to put a set
element into? We must provide a hash function h(e)
that given a set element e returns the
index of a bucket that element should be stored into. The hash table works well
if each element is equally and independently likely to be hashed into any
particular bucket, called the simple uniform hashing assumption. Suppose
we have n elements in the set and
the bucket array is length m. Then we
expect a = n/m
elements per bucket. The quantity a is called the load
factor of the hash table. If the set implementation used for the buckets has
linear performance, then we expect to take O(1+a)
time to do add, remove, and member. If
the number of buckets is at proportional to the number of elements, the load
factor a is O(1),
so all the operations are also O(1) on
average.
One remaining issue that affects our implementation is our choice of the hash function (and the number of buckets, which is of course determined by the hash function). A bad hash function can clearly destroy our attempts at a constant running time, since in the worst case we have to search O(n) buckets. If we're mapping names to phone numbers, then hashing each name to its length would be a very poor function, as would a hash function that used only the first name, or only the last name. We want our hash function to use all of the information in the key.
With modular hashing, the hash function is simply h(k) = k mod m, which is easy to compute quickly when we consider the bit-level representation of the key k as representing a number. Certain values of m produce poor results though: if m=2^p, then h(k) is just the p lowest-order bits of k. Generally we prefer a hash function that uses all the bits of the key. In practice, primes not too close to powers of 2 work well. A slightly better alternative is multiplicative hashing, in which we (conceptually) multiply the compute (km/2p) mod 2q for appropriately chosen values of m, p, and q. m should be nearly as large as the maximum integer but its binary representation should be a random mix of 1's and 0's; p is chosen so that all of the high bits of the product are retained. Multiplicative hashing is very useful if you want your bucket arrays to be of size 2q.
Ideally you should test your hash function to make sure it behaves well under real data. With any hash function, it is possible to generate data that cause it to behave poorly, but a good hash function will make this unlikely. A good way to determine whether your hash function is working well is to measure the clustering of elements into buckets. If bucket i contains xi elements, then the clustering is (Si(xi2)/n) - n/m. A uniform hash function produces clustering near 1.0 with high probability. A clustering factor of c means that the performance of the hash table is slowed down by a factor of c relative to its performance with a uniform hash function and the same array size. If clustering is less than 1.0, the hash function is doing better than a uniform random hash function ought to: this is rare. Note that clustering is independent of the load factor.
The claim that hash tables give O(1) performance
is based on the assumption that m = O(n).
If a hash table has many elements inserted into it, n
may become much larger than m and violate
this assumption. The effect will be that the bucket sets will become large
enough that their bad asymptotic performance will show through. The solution to
this problem is relatively simple: the array must be increased in size and all
the element rehashed into the new buckets using an appropriate hash
function when the load factor exceeds some constant factor. Each resizing
operation therefore takes O(n) time
where n is the size of the hash table being
resized. Therefore the O(1) performance of
the hash table operations no longer holds in the case of add: its
worst-case performance is O(n).
This isn't really as much of a problem as it might sound. If the bucket array is doubled in size every time it is needed, then the insertion of n elements in a row into an empty array takes only O(n) time, perhaps surprisingly. We say that add has O(1) amortized run time because the time required to insert an element is O(1) on the average even though some elements trigger a lengthy rehashing of all the elements of the hash table.
To see why this is, suppose we insert n elements into a hash table while doubling the number of buckets when the load factor crosses some threshold. A given element may be rehashed many times, but the total time to insert the n elements is still O(n). Consider inserting n = 2k elements, and suppose that we hit the worst case, where the resizing occurs on the very last element. Since the bucket array is being doubled at each rehashing, the rehashes must all occur at powers of two. The last half of the elements are only rehashed once. Of the first half of the inserted elements, the last half of them are hashed once on insertion and then rehashed at the 2k-1 and 2k points. Of the first half of those, the last half are hashed once on insertion and then rehashed thrice. And so on. The total number of hashing operations is therefore
(n/2 * 2 + n/4 * 3 + n/8 * 4 + n/16 * 5 + ...) = (1/2 * 2 + 1/4 * 3 + 1/8 * 4 + 1/16 * 5 + ...) * n
What is the limit of the series that is multiplied by n?
It turns out to be 3. No matter how many elements
we add to the hash table, there will be about three hashing operations performed
per element added. Therefore, add takes amortized O(1)
time even if we start out with a bucket array of one element!
Another way to think about this is that the true cost of performing an add
is about triple the cost observed on a typical call to add. The remaining 2/3 of
the cost is paid as the array is resized later. It is useful to think about this
in monetary terms. Suppose that a hashing operation costs $1 (that is, 1 unit of
time). Then a call to add costs $3, but only $1 is required up
front for the initial hash. The remaining $2 is placed into the hash table
element just added and used to pay for future rehashing. Assume each time the
array is resized, all of the remaining money gets used up. At the next resizing,
there are n elements and n/2
of them have $2 on them; this is exactly enough to pay for the resizing. This is
a really an argument by induction, so we'd better examine the base case: when
the array is resized from one bucket to two, there is $2 available, which is $1
more than needed to pay for the resizing. That extra $1 will stick around
indefinitely, so inserting n elements
starting from a 1-element array takes at most 3n-1
element hashes, which is O(n) time.
This kind of analysis, in which we precharge an operation for some time that
will be taken later, typifies amortized analysis of run time.
Notice that it was crucial that the array size grows geometrically (doubling). It is tempting to grow the array by a fixed increment (e.g., 100 elements at time), but this causes n elements to be rehashed O(n) times on average, resulting in O(n2) asymptotic insertion time!
Any fixed threshold load factor is equally good from the standpoint of asymptotic run time, but a good rule of thumb is that rehashing should take place at a=3. One might think that a=1 is the right place to rehash, but in fact the best performance is seen (for buckets implemented as linked lists) when load factors are in the 1-2 range. When a<1, the bucket array contains many empty entries, resulting in suboptimal performance of the computer's memory system.
An alternative to the approach outlined above is what is known as open addressing. Instead of storing a set at every array index, a single element is stored there. If an element is inserted in the hash table and collides with an element already stored at that index, a second possible possible location for it is computed. If that is full, the process repeats. There are various strategies for generating a sequence of hash values for a given element: linear probing, quadratic probing, double hashing. We have chosen not talk about open addressing in detail because in practice it is slower than the simple array of linked-lists approach. The performance of open addressing becomes very bad when the load factor approaches 1, because a long sequence of array indices may need to be tried for any given element -- possibly every element in the array! Therefore it is important to resize the array when the load factor exceeds 2/3 or so. The bucket approach, by contrast, suffers gradually declining performance as the load factor grows, and no fixed point beyond which resizing is absolutely needed. With buckets, a sophisticated application can defer the O(n) cost of resizing its hash tables to a point in time when it is convenient to incur it: for example, when the user is idle.
We observed above that sets and maps only make sense when their element and
key types, respectively, support a notion of equality. This makes it more
difficult to write a single implementation of sets that can be used for any type
we like; we need to use some features of SML that we haven't seen yet. Let's
build an implementation of hash tables to see how it all works out. First of
all, we can describe the types that make sense in these signatures by writing
another signature that describes a type t and an operation equal
for testing whether they are equal:
signature EQ = sig (* t is a type with a notion of equality *) type t val equal: t * t -> bool end
To use a type (for example, int) as a set element type we construct a structure that bundles the type with the operations that its ADT implementation requires:
structure IntEq : EQ = struct (* subtlety: must use : here, not :>, so that t is visible outside the structure *) type t = int fun equal(x:int,y:int) = x = y end
Now, consider the following functional signature for a set ADT, which we will use for the hash table buckets:
signature SET = sig (* Overview: a set is a set of distinct items of type elem. * For example, if elem is int, then a set might be * {1,-11,0}, {}, or {1001} *) type elem type set (* test for equality of two elements *) val eq: elem * elem -> bool (* empty is the empty set *) val empty : set (* Effects: add(s,e) is s union {e} *) val add: set * elem -> set (* remove(s,x) is s - {x} (set difference) *) val remove: set * elem -> set (* member(s,x) is whether x is a member of s *) val member: set * elem -> bool (* size(s) is the number of elements in s *) val size: set -> int (* fold over the elements of the set *) val fold: ((elem*'b)->'b) -> 'b -> set -> 'b (* fromList(lst) is the set of elements in lst. * Requires: lst contains no equal elements *) val fromList: elem list -> set val toList: set -> elem list end
We can instantiate a signature like SET on a particular element
type using a where clause. This gives us the effect of type
parameterization, but on signatures:
- signature INTSET = SET where type elem = int signature INTSET = sig type elem = int type set val empty : set ... end
We can use a functor to write an implementation of SET
that works for all its possible instantiations like INTSET. For
example, suppose we want the simple implementation in which the rep is a linked
list of unique elements. The functor we write takes in a type t
that has an equal operation (both t and equal
are bundled together in a structure Eq) and produces a structure
that meets the SET signature:
functor ListSet(structure Eq: EQ) :> SET where type elem = Eq.t = struct type elem = Eq.t type set = elem list (* RI: the list contains no elements that are equal according to Eq.equal *) val empty: set = [] val eq = Eq.eq fun member(s, e) = case s of [] => false | h::t => Eq.equal(e,h) orelse member(t,e) fun add(s, e) = case s of [] => [e] | h::t => if Eq.equal(e,h) then e::t else h::add(t,e) fun remove(s, e) = case s of [] => [] | h::t => if Eq.equal(e,h) then t else h::remove(t,e) fun size(s) = length(s) fun fold f b s = foldl f b s fun fromList s = s (* ought to check for duplicates *) fun toList s = s end
This is our usual implementation of sets as lists, but it works for almost
any element type. Now we can make sets of any type we like by using a structure
that provides the element type and its equality operation; for example, the IntEq
structure defined earlier:
signature INTSET = SET where type elem = int structure IntSet :> INTSET = ListSet(structure Eq = IntEq) (* functor application ....*) - val s: IntSet.set = IntSet.fromList([3,4,5]) val s = - : IntSet.set - IntSet.toList(IntSet.add(IntSet.add(s, 2), 3)); val it = [2,3,4,5] : IntSet.elem list
It's a little bit awkward to have to define the IntEq structure
in order to instantiate the ListSet functor, but the result is that
SML is very expressive. Some languages (e.g., CLU, PolyJ) make this process of
instantiating signatures on type parameters more convenient, but lose some
expressive power.
The implementation above shows how to use functors to provide a generic implementation of sets using linked lists: generic in the sense that it works for any type that it makes sense to have a set of. Functional maps can be implemented similarly, resulting in generic association lists.
We can use a functor to provide a generic implementation of the mutable set (MSET)
signature too. In order to store elements in a hash table, we'll need a hash
function for the element type, and an equality test just as for other sets. We
can define an appropriate signature that groups the type and these two
operations:
signature HASHABLE = sig type t (* hash is a function that maps a t to an integer. For * all e1, e2, if equal(e1,e2), then hash(e1) = hash(e2) *) val hash: t->int (* equal is an equivalence relation on t. *) val equal: t*t->bool end
There is an additional invariant documented in the signature: for the hash table to function correctly, any two equal elements must have the same hash code.
functor HashSet(structure Hash: HASHABLE and Set: SET where type elem = Hash.t) = struct type elem = Hash.t type bucket = Set.set type set = {arr: bucket array, nelem: int ref} (* AF: the set represented by a "set" is the union of * all of the bucket sets in the array "arr". * RI: nelem is the total number of elements in all the buckets in * arr. In each bucket, every element e hashes via Hash.hash(e) * to the index of that bucket modulo length(arr). *) (* Find the appropriate bucket for e *) fun findBucket({arr, nelem}, e) (f:bucket array*int*bucket*elem*int ref->'a) = let val i = Hash.hash(e) mod Array.length(arr) val b = Array.sub(arr, i) in f(arr, i, b, e, nelem) end fun member(s, e) = findBucket(s, e) (fn(_, _, b, e, _) => Set.member(b, e)) fun add(s, e) = findBucket(s, e) (fn(arr, i, b, e, nelem) => ( Array.update(arr, i, Set.add(b, e)); nelem := !nelem + 1 )) fun remove(s, e) = findBucket(s, e) (fn(arr, i, b, e, nelem) => ( case Set.remove(b,e) of (b2, NONE) => NONE | (b2, SOME y) => ( Array.update(arr, i, b2); nelem := !nelem - 1; SOME y ))) fun size({arr, nelem}) = !nelem fun fold f init {arr, nelem} = Array.foldl (fn (b, curr) => Set.fold f curr b) init arr fun create(size: int): set = { arr = Array.array(size, Set.empty), nelem = ref 0 } (* Copy all elements from s2 into s1. *) fun copy(s1:set, s2:set): unit = fold (fn(elem,_)=> add(s1,elem)) () s2 fun fromList(lst) = let val s = create(length lst) in List.foldl (fn(e, ()) => add(s,e)) () lst; s end fun toList({arr, nelem}) = Array.foldl (fn (b, lst) => Set.fold (fn(e, lst) => e::lst) lst b) [] arr end
This hash table implementation almost implements the MSET signature, but not
quite, because it doesn't implement the empty method. Here is complete
implementation of mutable sets using a fixed-size hash table of nbucket
buckets:
functor FixedHashSet(val nbucket: int; structure Hash: HASHABLE and Set: SET where type elem = Hash.t) :> MSET where type elem = Hash.t = struct structure HS = HashSet(structure Hash = Hash and Set = Set) type elem = HS.elem type set = HS.set val eq = Hash.eq fun empty() = HS.create(nbucket) val member = HS.member val add = HS.add val remove = HS.remove val size = HS.size val fold = HS.fold val toList = HS.toList val fromList = HS.fromList end
Here we create a hash table of 1000 buckets and insert the numbers one through
ten into it.
- structure FHS = FixedHashSet(structure Hash = IntHash and Set = IntSet) - open FHS; - val s = empty(); val s = - : set - foldl (fn(x,_) => add(s,x)) () [1,2,3,4,5,6,7,8,9,10]; val it = () : unit - toList(s); val it = [10,9,8,7,6,5,4,3,2,1] : elem list
The elements are in reverse order because they are hashed into buckets 1 through 10.
The HashSet implementation of hash tables abstracts out the
common operation of finding the correct bucket for a particular element into the
findBucket function, thus keeping the rest of the code simpler. If
the number of buckets is always a power of 2, the modulo operation can be
performed using bit logic, which is much faster.
Our use of the identity function when hashing integers is rather naive; it works well only if the integers are uniformly distributed modulo the array length. This is usually a bad assumption, and a frequent source of loss of performance. A better hash function would have multiplicative hashing, cyclic redundancy checks (CRC's), cryptographic hashing (e.g., MD5) or some other information diffusion mechanism built into it. Here is an example of how we might fix the hashing function for integers, using multiplicative hashing. The following code assumes a word size of 32 bits:
val multiplier: Word.word = 0wx678DDE6F (* following a recommendation by Knuth *) fun findBucket({arr, nelem}, e) (f:bucket array*int*bucket*elem->'a) = let val n = Word.fromInt(Array.length(arr)) val d = (0wxFFFFFFF div n)+0w1 val i = Word.toInt(Word.fromInt(Hash.hash(e)) * multiplier div d) val b = Array.sub(arr, i) in f(arr, i, b, e) end
If n is always power of two, the div operations can be replaced
by bit shifting, which is much faster.
As long as we don't put more than a couple of thousand elements into a
fixed-size 1000-bucket hash table, its performance will be excellent. However,
the asymptotic performance of any fixed-size table is no better than that
of a linked list. We can introduce another level of indirection to obtain a hash
table that grows dynamically and rehashes its elements, thus achieving O(1)
amortized performance as described above. The implementation proves to be quite
simple because we can reuse all of our HashSet code:
functor DynHashSet(structure Hash: HASHABLE and Set: SET where type elem = Hash.t) :> MSET where type elem=Hash.t = struct structure HS = HashSet(structure Hash = Hash and Set = Set) type set = HS.set ref type elem = HS.elem val thresholdLoadFactor = 3 (* AF: the set represented by x:set is !x. * RI: the load factor of the hash table !x never goes * above thresholdLoadFactor. *) fun empty():set = ref (HS.create(1)) fun member(s, e) = HS.member(!s, e) fun remove(s, e) = HS.remove(!s, e) fun size(s) = HS.size(!s) fun add(s, e) = let val {arr, nelem} = !s val nbucket = Array.length(arr) in if !nelem >= thresholdLoadFactor*nbucket then let val newset = HS.create(nbucket*2) in HS.copy(newset, !s); s := newset end else (); HS.add(!s, e) end fun fold f init s = HS.fold f init (!s) fun fromList(lst) = ref (HS.fromList(lst)) fun toList(s) = HS.toList (!s) end
There is hardly any new code required: mostly just the logic in add
that creates a new, larger hash table and copies all the elements across
when the load factor is too high. Sometimes hash table implementations will also
resize the hash table downwards when a call to remove makes the load factor too
low. This trick usually improves performance in practice for hash tables that
grow and shrink—though it does not improve (in fact, harms) theoretical
asymptotic performance.
This code requires access to the internals of the HashSet implementation,
which is why those internals are not hidden behind a signature. An example of
using these hash tables follows:
- structure DynIntHashSet = DynHashSet(structure Hash = IntHash and Set = IntSet) - open DynIntHashSet; - val s = empty(); val s = - : set - foldl (fn(x,_) => add(s,x)) () [1,2,3,4,5,6,7,8,9,10]; val it = () : unit - toList(s); val it = [7,3,10,6,2,9,5,1,8,4] : elem list
The MMAP signature for mutable maps can also be implemented
using hash tables in a similar manner.