Reading: Blum, Hopcroft, Kannan, 6.2.1

We proved Chebychev's inequality and the weak law of large numbers; these proofs have been added to last lecture's notes

Hashing

Given a very large set of values, possibly with duplicates, how can you count the distinct elements without storing them all?

Uniformly distributed independent random variables behave nicely on average. Hashing is a way to transform data into a set of uniformly distributed independent random variables.

To use hashing, one first selects a random hash function from a family of hash functions. Applying that function to each data point yields a "hash value"; these hash values vary depending on the random hash function that was chosen, making them random variables.

More formally, a **family of hash functions** from \(A\) to \(B\) is just a \([A→B]\)-valued random variable. That is, it is a function \(H : S → [A → B]\). Here \(S\) is a sample space (in the context of hashing, outcomes in \(s\) are often called "seeds"). The data points to be hashed are elements of the set \(A\), and the hash values are elements of the set \(B\). \(B\) is often chosen to be \(\{0,1,2,\dots,m\}\) for some \(m\).

Given a family of hash functions \(H\) and a data point \(a \in A\), the hash of \(a\) is the random variable \(H_a : S → B\) given by \(H_a(s) := H(s)(a)\).

A family \(H\) of hash functions is **good** if

- For all \(a\) and \(b\), \(Pr(H_a = b) = 1/|B|\) (in other words, \(H_a\) is uniformly distributed for all \(a\)), and
- For all \(a_1 \neq a_2\), \(H_{a_1}\) and \(H_{a_2}\) are independent.

Informally, the first condition tells you that for a fixed data point, all hash values are equally likely, and the second condition says that knowing the hash of one value doesn't tell you the value of the hash of a different value.

**Note:** This lecture follows the reading very closely. We have changed variables to be consistent with our convention that constants are lower case and random variables are uppercase, and we have de-emphasized hashing:

Text variable | Lecture variable | Description |
---|---|---|

\(d\) | \(d\) | (constant) The number of distinct inputs; this is what we want to estimate |

\(M\) | \(m\) | (constant) The number of possible hash values |

\(h(b_i)\) | \(X_i\) | (random variable) The hash of the \(i\)th value |

\(min\) | \(MIN\) | (random variable) The minimium of the \(X_i\) |

\(\frac{M}{min}\) | \(\widetilde{D}\) | (random variable) The estimated value of \(d\) |

Suppose we have a very large set of values, possibly with duplicates. We wish to count the distinct elements of the set. We'd like to do so *without* storing the whole set.

We give a probabilistic algorithm. There are a few big ideas here:

- uniformly distributed random values act a lot like evenly spaced values
- hashing can take a set and distribute it uniformly
- we can trade exactness and guaranteed correctness for efficiency

Suppose there are \(d\) distinct values, all in the range \(\{0,1,\dots,m-1\}\) for some \(m\), and they are evenly spaced. Then the smallest value would be about \(min \sim m/d\). If we knew the only the minimum value and \(m\), then we could estimate \(d \sim m/min\). Notice also that adding duplicates doesn't change the minimum value, so this estimate counts what we want: the size of the set not counting duplicates.

Of course, our input data may not be evenly spaced, but if we hash the input values, we get independent uniformly distributed random variables, and they act like evenly spaced data on average. This gives us the following algorithm:

- Select a random hash function \(h : A → \{0,1,2,\dots,m-1\}\)
- Apply \(h\) to each input value, and store the minimum \(min\)
- Return \(\tilde{d} = m/min\)

Note that for a given set of data, once we've selected the hash function, it determines \(min\) and \(\tilde{d}\); stated differently, \(MIN\) and \(\tilde{D}\) are random variables.

Even though we might be unlucky, and \(\widetilde{D}\) might be far from \(d\), we will show that with "high probability", "\(\widetilde{D}\) is "close" to \(d\). How close and how likely?

**Claim:** \(Pr(d/6 \leq \widetilde{D} \leq 6d) > 2/3 - d/m\).

**Note 1:** These numbers are not obvious; they are chosen because they work.

**Note 2:** This is a pretty weak bound: it basically says our estimate is within half an order of magnitude of the correct answer. Better algorithms exist, but this algorithm has the benefit of simplicity.

**Note 3:** If we use a 64-bit hash function, \(m = 2^{64}\). Thus if \(d\) is a billion (\(\sim 2^{20}\)) or even a trillion (\(\sim 2^{30}\)), the \(d/m\) term is tiny.

On Wednesday I'll give an overview of the proof; the details are described in the reading, and you will be required to explain them in the homework.