# Hash tables
This lab explores hash tables, including their implementation in
OCaml's standard library.
## Hashing
##### Exercise: hash insert [✭✭]
Suppose we have a hash table on integer keys. The table currently has 7
empty buckets, and the hash function is simply `let hash k = k mod 7`.
Draw the hash table that results from inserting the keys 4, 8, 15, 16,
23, and 42 (with whatever values you like).
□
##### Exercise: relax bucket RI [✭✭]
In lecture, we required that hash table buckets must not contain
duplicates. What would happen if we relaxed this RI to allow
duplicates? Would the efficiency of any operations (insert, find, or
remove) change?
□
##### Exercise: strengthen bucket RI [✭✭]
What would happen if we strengthened the bucket RI to require
each bucket to be sorted by the key? Would the efficiency of
any operations (insert, find, or remove) change?
□
##### Exercise: hash values [✭✭]
`Hashtbl.hash : 'a -> int` is a hash function provided by
the standard library. This rather remarkable
function can transform a value of any type into an integer.
Use it to hash several values of different types.
Make sure to try at least `()`, `false`, `true`, `0`, `1`, `""`,
and `[]`, as well as several "larger" values of each type.
Try hashing lists of integers of increasing length (e.g.,
`[0]`, `[0;1]`, `[0;1;2]`, ...). How long can the list get
before you find a collision?
□
## Hashtbl
OCaml's `Hashtbl` module offers two kinds of hash tables. The
first (and simpler to use) is for clients who are happy to use
OCaml's built-in hash function. The second is for clients
who want to supply their own hash function; naturally, that is done
with a functor.
For the following exercises, consult the [documentation of
`Hashtbl`][hashtbl].
[hashtbl]: http://caml.inria.fr/pub/docs/manual-ocaml/libref/Hashtbl.html
##### Exercise: hashtbl usage [✭✭]
Create a hash table `tab` with `Hashtbl.create` whose initial size is 16.
(You can safely ignore the optional argument to that function
for this lab.) Add 31 bindings to it with `Hashtbl.add`.
For example, you could add the numbers 1..31 as keys and
the strings "1".."31" as their values. Use `Hashtbl.find`
to look for keys that are in `tab`, as well as keys
that are not.
□
##### Exercise: hashtbl bindings [✭✭]
Define a function `bindings : ('a,'b) Hashtbl.t -> ('a*'b) list`,
such that `bindings h` returns a list of all bindings in `h`.
Use your function to see all the bindings in `tab`. *Hint: fold.*
□
##### Exercise: hashtbl stats [✭]
Use the `Hashtbl.stats` function to find out the statistics
of `tab`. How many buckets are in the table? How many buckets
have a single binding in them?
□
##### Exercise: hashtbl load factor [✭✭]
Define a function `load_factor : ('a,'b) Hashtbl.t -> float`,
such that `load_factor h` is the load factor of `h`.
What is the load factor of `tab`? *Hint: stats.*
□
##### Exercise: hashtbl load factor [✭]
Add one more binding to `tab`. Do the stats or load factor change?
Now add yet another binding. Now do the stats or load factor change?
*Hint: Hashtbl [resizes][resize] when the load factor goes strictly
above 2.*
[resize]: https://github.com/ocaml/ocaml/blob/trunk/stdlib/hashtbl.ml#L167
□
## Functorial Hashtbl
##### Exercise: functorial interface [✭✭✭]
Use the functorial interface (i.e., `Hashtbl.Make`) to
create a hash table whose keys are strings that are
case insensitive. Be careful to obey the specification of
`Hashtbl.HashedType.hash`:
> If two keys are equal according to `equal`, then they have
> identical hash values as computed by `hash`.
□
##### Exercise: equals and hash [✭✭]
The previous exercise quoted the specification of `Hashtbl.HashedType.hash`.
Compare that to Java's `Object.hashCode()` [specification][hashCode].
Why do they both have this similar requirement?
[hashCode]: https://docs.oracle.com/javase/8/docs/api/java/lang/Object.html#hashCode--
□
##### Exercise: bad hash [✭✭]
Use the functorial interface to create a hash table with a really
bad hash function (e.g., a constant function). Use the `stats`
function to see how bad the bucket distribution becomes.
□
## Challenge problem: Probing
In lecture we briefly mentioned *probing* as an alternative to
*chaining*. Probing can be effectively used in hardware implementations
of hash tables, as well as in databases. With probing, every bucket
contains exactly one binding. In case of a collision, we search forward
through the array, as described below.
**Find.** Suppose we are trying to find a binding in the table. We hash
the binding's key and look in the appropriate bucket. If there is
already a different key in that bucket, we start searching forward
through the array at the next bucket, then the next bucket, and so
forth, wrapping back around to the beginning of the array if necessary.
Eventually we will either
* find an empty bucket, in which case the key we're searching for is
not bound in the table;
* find the key before we reach an empty bucket, in which case we can
return the value; or
* never find the key or an empty bucket, instead wrapping back
around to the original bucket, in which case all buckets are full and
the key is not bound in the table. This case actually should
never occur, because we won't allow the load factor to get
high enough for all buckets to be filled.
**Insert.** Insertion follows the same algorithm as finding a key,
except that whenever we first find an empty bucket, we can insert the
binding there.
**Remove.** Removal is more difficult. Once the key is found,
we can't just make the bucket empty, because that would affect
future searches by causing them to stop early. Instead,
we can introduce a special "deleted" value into that bucket
to indicate that the bucket does not contain a binding but
the searches should not stop at it.
**Resizing.** Since we never want the array to become
completely full, we can keep the load factor near 1/4.
When the load factor exceeds 1/2, we can double the array,
bringing the load factor back to 1/4. When the load factor goes below
1/8, we can half the array, again bringing the load factor
back to 1/4. "Deleted" bindings complicate the definition
of load factor:
* When determining whether to double the table size,
we calculate the load factor as (# of bindings + # of deleted
bindings) / (# of buckets). That is, deleted bindings
contribute toward increasing the load factor.
* When determining whether the half the table size,
we calculate the load factor as (# of bindings) / (# buckets).
That is, deleted bindings do not count toward
increasing the load factor.
When rehashing the table, deleted bindings are of course
not re-inserted into the new table.
##### Exercise: linear probing [✭✭✭✭]
Implement a hash table that uses linear probing.
□