Reading: Blum, Hopcroft, Kannan, 6.2.1

Given a very large set of values, possibly with duplicates, how can you count the distinct elements without storing them all?

we'll introduce and use indicator variables

Often time we want to count how many times something happens in an experiment. For example, we might roll 8 dice, and wish to count the number of 3's rolled.

A useful tool for this is an **indicator variable**. An indicator variable is a variable having value 1 if the event of interest happens, and 0 otherwise. The number of times something happens can be written as a sum of indicator variables.

In the dice example, we could choose a sample space consisting of the \(6^8\) possible combinations of die rolls. We could define an indicator variable \(I_1\) that takes value 1 on all outcomes with the first die showing 3 (and 0 otherwise), \(I_2\) that is 1 if the second die is 3, \(I_3\) which is 1 if the third die is 3, and so on.

Then the total number of 3's is given by the random variable \(N = I_1 + I_2 + \cdots + I_8\). By linearity of expectation, \(E(N) = E(I_1) + \cdots + E(I_8)\). The expected value of the variables \(I_i\) may be much easier to calculate than the expecation of \(N\) as a whole. For example, if we are told that each die lands 3 with probability \(1/6\), then \(E(I_i) = \sum_x x Pr(I_i = x) = 1 \cdot 1/6 + 0 \cdot 5/6 = 1/6\). Thus \(E(N) = 1\).

**Note:** This lecture follows the reading very closely. We have changed variables to be consistent with our convention that constants are lower case and random variables are uppercase, and we have de-emphasized hashing:

Text variable | Lecture variable | Description |
---|---|---|

\(d\) | \(d\) | (constant) The number of distinct inputs; this is what we want to estimate |

\(M\) | \(m\) | (constant) The number of possible hash values |

\(h(b_i)\) | \(X_i\) | (random variable) The hash of the \(i\)th value |

\(min\) | \(MIN\) | (random variable) The minimium of the \(X_i\) |

\(\frac{M}{min}\) | \(\widetilde{D}\) | (random variable) The estimated value of \(d\) |

Suppose we have a very large set of values, possibly with duplicates. We wish to count the distinct elements of the set. We'd like to do so *without* storing the whole set.

We give a probabilistic algorithm. There are a few big ideas here:

- uniformly distributed random values act a lot like evenly spaced values
- hashing can take a set and distribute it uniformly
- we can trade exactness and guaranteed correctness for efficiency

Suppose there are \(d\) distinct values, all in the range \(\{0,1,\dots,m-1\}\) for some \(m\), and they are evenly spaced. Then the smallest value would be about \(min \sim m/d\). If we knew the only the minimum value and \(m\), then we could estimate \(d \sim m/min\). Notice also that adding duplicates doesn't change the minimum value, so this estimate counts what we want: the size of the set not counting duplicates.

Suppose we have \(d\) independent random variables \(X_1, \dots, X_d\) taking values in \(\{0,1,\dots,m-1\}\). Suppose also that for all \(i\) and \(x\), \(Pr(X_i = x) = 1/m\). We will describe below how we transform our \(d\) points into these \(d\) random variables using hashing below.

We will look at each \(X_i\) one at a time, and keep track of the minimum value. Let \(MIN\) be the random variable representing this minimum. We will then use the formula above to compute an estimate of \(d\): Let \(\widetilde{D} ::= m/MIN\).

Even though we might be unlucky, and \(\widetilde{D}\) might be far from \(d\), we will show that with "high probability", "\(\widetilde{D}\) is "close" to \(d\). How close and how likely?

**Claim:** \(Pr(d/6 \leq \widetilde{D} \leq 6d) > 2/3 - d/m\).

**Note 1:** These numbers are not obvious; they are chosen because they work.

**Note 2:** This is a pretty weak bound: it basically says our estimate is within half an order of magnitude of the correct answer. Better algorithms exist, but this algorithm has the benefit of simplicity.

**Note 3:** If we use a 64-bit hash function, \(m = 2^{64}\). Thus if \(d\) is a billion (\(\sim 2^{20}\)) or even a trillion (\(\sim 2^{30}\)), the \(d/m\) term is tiny.

**(Partial) Proof:** The claim is equivalent to \(Pr(d/6 > \widetilde{D} \text{ or } \widetilde{D} > 6d) \lt 1/3 + d/m\), because the event \((d/6 \gt \widetilde{D} \text{ or } \widetilde{D} > 6d)\) is the complement of the event \((d/6 \leq \widetilde{D} \leq 6d)\).

This would be true if \(Pr(d/6 > \widetilde{D}) \lt 1/6\) and \(Pr(\widetilde{D} > 6d) \lt 1/6 + d/m\); We will show the second half of this; for the first half, see the reading and homework.

\(Pr(\widetilde{D} \gt 6d) = Pr(m/MIN \gt 6d) = Pr(MIN \lt m/6d)\). Now, \(MIN \lt m/6d\) if and only if there is some \(X_i \lt m/6d\). For any given \(X_i\), we know that the probability that \(Pr(X_i = 0) = Pr(X_i = 1) = \cdots = Pr(X_d = m/6d) = 1/m\). Therefore, \(Pr(X_i \lt m/6d) = \sum_{i = 0}^{m/6d} 1/m = 1/6d\). It's also possible that \(X_i = m/6d\), which happens with probability \(1/m\), so \(Pr(X_i \leq m/6d) = 1/6d + 1/m\).

Since this is true for any \(X_i\), the probability that *any* of the \(X_i\) is \(\leq m/6d\) is less than or equal to the sum of the probabilities that each of the \(X_i\) is \(\leq m/6d\):

\[Pr(MIN \lt m/6d) \leq \sum_{i = 1}^{d} Pr(X_i \lt m/6d) = \sum_{i=1}^d \left(\frac{1}{6d} + \frac{1}{m}\right) = \frac{1}{6} + \frac{d}{m}\]

which is what we were trying to show.

**Note:** This material was covered Monday, but fits better in these lecture notes.

We have \(d\) distinct values in our set to be counted. How do we turn this into \(d\) independent uniform random variables?

The answer is we start by randomly selecting a hash function from a family of hash functions, and then applying the selected hash function to our \(d\) values.

Slightly more formally, let \(V\) be the set of possible values, and let \(\{v_1, v_2, \dots, v_k\}\) be the set of (distinct) values that we are trying to count (for example, \(V\) might be the set of all credit card numbers, and \(v_i\) be the actual credit card numbers used in some transaction).

We let our sample space \(S\) be a set of functions with domain \(V\) and codomain \(\{0,1,\dots,m-1\}\). Given an outcome in \(S\) (that is, a function \(h : V → \{0,\dots,m-1\}\)), we let \(X_i ::= h(v_i)\).

A space of hash functions is "good" if these random variables have the properties we want. In particular, for two different values \(v_1 \neq v_2\), the variables \(X_1\) and \(X_2\) should be independent. Moreover, for any \(i\) and \(x\), the probability that \(X_i = x\) should be \(1/m\).

Such families of hash functions exist, although we do not (yet) have the tools to prove it.