Reading: MCS 20.1-20.5

Markov's inequality, Chebychev's inequality, Weak law of large numbers

Markov's and Chebychev's inequalities and the weak law of large numbers are useful because they apply to *any* random variable (almost any: Markov's requires that the random variable only output non-negative numbers). They let you put bounds on the probabilities that certain surprising things occur. They all have the form

\[Pr(\text{surprising thing}) \leq \text{bound}\]

For Markov's, the surprising thing is that the variable gives a large answer:

\[Pr(X \geq a) \leq \text{bound}\]

while for Chebychev's, the surprising thing is that the variable gives an answer far from the expected value:

\[Pr(|X - E(X)| \geq a) \leq \text{bound}\]

The weak law of large numbers has a more complicated setup. There, we are trying to estimate the true expected value of a random variable by measuring that variable on independent samples.

If \(X_1, X_2, \dots, X_n\) are the values for the first, second, etc. samples, then the weak law of large numbers says that the average of the \(X_i\) is likely to be close to the "true" average. More formally, if \(E(X_i) = μ\) for all \(i\), and if \(Var(X_i) = σ^2\) for all \(i\), then the weak law of large numbers says

\[Pr\left(\left|\frac{X_1 + X_2 + \cdots + X_n}{n} - μ\right| \geq a\right) \leq \text{bound}\]

Here is how I remember/understand the bounds. For Markov, the bound depends on \(X\) and \(a\). If \(X\) returns very large values on average (i.e. if \(E(X)\) is large), then it is likely that \(X\) is large, while if \(E(X)\) is very small, then it is quite unlikely that \(X\) is large. Bigger \(E(X)\) leads to bigger probability, so \(E(X)\) is in the numerator.

\(a\) is our definition of "large". It is more likely that I am taller than 3' than it is that I am taller than 10'. \(Pr(X \geq 3) \geq Pr(X \geq 10)\). Increasing \(a\) decreases the probability, so \(a\) goes in the denominator. This gives:

**Claim (Markov's inequality):** If \(X \geq 0\), then \[Pr(X \geq a) \leq \frac{E(X)}{a}\]

Proof is below.

Note that \(X \geq 0\) is shorthand for \(∀k \in S, X(k) \geq 0\).

For Chebychev, the bound also depends on \(X\) and \(a\), but it depends on how spread out the distribution of \(X\) is. If \(X\) is very spread out (has a large variance) then I am likely to sample a point far away from the expected value. If the values of \(X\) are concentrated, then I would be surprised to sample a value far from the mean (the probability would be low). Higher variance leads to higher probability, so \(Var(X)\) is in the numerator.

As with Markov's, increasing my notion of "large" decreases the probability that I will cross the threshold. This tells me \(a\) is in the denominator. As discussed above, \(Var\) is in the units of \(X\) squared, while (since we are comparing \(a\) to \(X\)) \(a\) is in the units of \(X\). That reminds me that \(a\) should be squared in the denominator (since probabilities are unitless). This leads to:

**Claim (Chebychev's inequality):** For any \(X\), \[Pr\left(|X - E(X)| \geq a\right) \leq \frac{Var(X)}{a^2}\]

Proof is below.

For the weak law of large numbers, we see that increasing the variance of the samples makes it more likely that we mis-estimate the true average. As with Markov's and Chebychev's inequality, increasing \(a\) decreases the probability of getting a result that is even larger than \(a\). Moreover, taking more samples makes it less likely that the computed average will be far from the real value, so \(n\) is also in the denominator. This gives

**Claim (Weak law of large numbers):** If \(X_1, X_2, \dots, X_n\) are independent random variables satisfying \(E(X_i) = μ\) and \(Var(X_i) = σ^2\) then \[Pr\left(\left|\frac{X_1 + X_2 + \cdots + X_n}{n} - μ\right| ≥ a\right) \leq \frac{σ^2}{na^2}\]

Proof is below.

Suppose we want to build a door such that 90% of the population can walk through it without ducking. How tall should we build the door? Suppose all we know is that the average height is 5.5 feet.

We want to find a height \(a\) such that \(Pr(X \lt a) \geq 9/10\). This is the same as requiring that \(Pr(X \geq a) \leq 1/10\).

If we knew that \(E(X)/a \leq 1/10\), then Markov's inequality would tell us that \(Pr(X \geq a) \leq 1/10\). Solving for \(a\), we see that if \(a \geq (5.5)(10)\), then \(E(x)/a \leq 1/10\), so \(Pr(X \geq a) \leq E(x)/a \leq 1/10\).

Therefore, if we build the door 55 feet tall, we are guaranteed that at least 90% of the people can pass through it without ducking.

**Exercise:** suppose you also knew that everyone was taller than 4'. Use Markov's to build a smaller door.

Next lecture, we will use Chebychev's to get a much tighter bound.

**Claim (Markov's inequality):** If \(X \geq 0\), and \(a > 0\), then \(Pr(X \geq a) \leq E(X)/a\).

**Proof:** We start by expanding \(E(X)\):

\[ \begin{aligned} E(X) &= \sum_{x \in \mathbb{R}} x Pr(X = x) && \text{by definition} \\ &= \sum_{x \lt a} x Pr(X = x) + \sum_{x \geq a} x Pr(X = x) && \text{rearranging terms} \\ &\geq \sum_{x \geq a} xPr(X = x) && \text{the first sum is postitive since $X \geq 0$; dropping it decreases the sum} \\ &\geq \sum_{x \geq a} aPr(X = x) && \text{since $x \geq a$ for all terms in the sum} \\ &= a\sum_{x \geq a} Pr(X = x) && \text{algebra} \\ &= aPr(X \geq a) && \text{by the third axiom, since the event $(X \geq a)$ is $\bigcup_{x \geq a} (X = x)$} \\ \end{aligned} \]

Dividing both sides by \(a\) gives the result.

**Claim (Chebychev's inequality):** For any \(X\), \[Pr\left(|X - E(X)| \geq a\right) \leq \frac{Var(X)}{a^2}\]

**Proof:** Note that \(|X - E(X)| \geq a\) if and only if \((X - E(X))^2 \geq a^2\). Therefore, \(Pr(|X - E(X)| \geq a) = Pr((X - E(X))^2 \geq a^2)\). Moreover, \((X - E(X))^2 \geq 0\), so we can apply Markov's inequality to conclude \(Pr(|X - E(X)| ≥ a) ≤ E((X - E(X))^2)/a^2\). By definition, this is equal to \(Var(X)/a^2\) as required.

**Claim (Weak law of large numbers):** If \(X_1, X_2, \dots, X_n\) are independent random variables satisfying \(E(X_i) = μ\) and \(Var(X_i) = σ^2\) then \[Pr\left(\left|\frac{X_1 + X_2 + \cdots + X_n}{n} - μ\right| ≥ a\right) \leq \frac{σ^2}{na^2}\]

**Proof:** We want to apply Chebychev's inequality to the random variable \(X = \sum X_i/n\). By linearity of expectation, we see that \(E(X) = \sum E(X_i)/n = \sum μ/n = μ\). Therefore we can apply Chebychev's inequality to conclude \[Pr\left(\left|\frac{X_1 + X_2 + \cdots + X_n}{n} - μ\right| ≥ a\right) \leq \frac{Var(X)}{a^2}\]

Now, we will use two properties of variance (proofs left as exercises):

- if \(X\) and \(Y\) are independent then \(Var(X + Y) = Var(X) + Var(Y)\)
- if \(c\) is a constant random variable, then \(Var(cX) = c^2Var(x)\).

Applying these to \(X\), we see \[Var(X) = Var\left(\sum X_i/n\right) = \sum Var(X_i)/n^2 = nσ^2/n^2 = σ^2/n\] Plugging this in gives the result.