Lecture 9: expectation

Equivalent definitions of expectation

Last lecture we gave two defintions of expectation.

Definition 1: \(E(X) := \sum_{s ∈ S} X(s)Pr(\{s\})\)

Definition 2: \(E(X) := \sum_{x ∈ ℝ} x Pr(X = x)\)

Claim: these two definitions are equivalent

Proof: We can group together terms in the first sum having the same value of \(X\):

\[\sum_{s ∈ S} X(s) Pr(\{s\}) = \sum_{x ∈ ℝ} \sum_{s \mid X(s) = x} X(s)Pr(\{s\})\]

We then apply the third Kolmogorov axiom, using the fact that the events \(\{s\}\) partition \((X = x)\):

\[\cdots = \sum_{x ∈ ℝ} x \sum_{X(s) = x} Pr(\{s\}) = \sum_{x ∈ ℝ} xPr(X = x)\]

Aside: the expectation function; function spaces

Note that \(E\) by itself is a function; it takes in random variables and gives back numbers. So the domain of \(E\) is the set of all functions with domain \(S\) and codomain \(ℝ\).

Notation: In general, the set of functions with domain \(A\) and codomain \(B\) is written \([A → B]\).

Therefore, \(E : [S → ℝ] → ℝ\).

Linearity of expectation

Claim: If \(X\), \(Y\) are RVs, then \(E(X+Y) = E(X) + E(Y)\).

Proof: We compute:

\[ \begin{aligned} E(X + Y) &= \sum_{s} (X+Y)(s)Pr(\{s\}) && \text{by definition of $E$} \\ &= \sum_{s} (X(s) + Y(s))Pr(\{s\}) && \text{by definition of $X+Y$} \\ &= \sum_{s} X(s)Pr(\{s\}) + \sum_{s} Y(s)Pr(\{s\}) && \text{algebra} \\ &= E(X) + E(Y) && \text{by definition of $E$} \\ \end{aligned} \]

Fact: If \(C\) is a constant RV with value \(c\) (that is, \(C(s) = c\) for all \(s\)) then \(E(CX) = cE(X)\)

Fact: If \(C\) is a constant RV with value \(c\), then \(E(C) = c\).

Proofs: left as review exercises.

Note: We usually don't make the distinction between the number \(c\) and the random variable \(C\); so the above are often written \(E(cX) = cE(X)\) and \(E(c) = c\).

Note: The fact that \(E(X + Y) = E(X)+E(Y)\) and \(E(cX) = cE(X)\) are summarized by saying that "expectation is linear".

Independent random variables; expectation of the product.

It is not generally the case that \(E(XY) = E(X)E(Y)\). For example, imagine a single fair coin flip, and let \(X\) be the indicator variable for the flip being heads. That is, \(S = \{h,t\}\), \(X(h) = 1\), and \(X(t) = 0\).

We see \(E(X) = 1/2\). Moreover, \(X\cdot X = X\), because \((X \cdot X)(h) = X(h)X(h) = 1\) and \((X \cdot X)(t) = X(t)X(t) = 0\).

Thus \(E(X\cdot X) = E(X) = 1/2\) but \(E(X)E(X) = 1/4\).

However, we have the following:

Definition: Two random variables \(X\) and \(Y\) are independent if the events \(X = x\) and \(Y = y\) are independent for all \(x\) and \(y\).

Claim: If \(X\) and \(Y\) are independent, then \(E(XY) = E(X)E(Y)\).

Proof: Well,

\[ \begin{aligned} E(X)E(Y) &= \left(\sum_{x} xPr(X = x)\right)\left(\sum_{y} yPr(Y = y)\right) \\ &= \sum_{x,y} xyPr(X=x)Pr(Y=y) \\ &= \sum_{x,y} xyPr(X=x \cap Y=y) && \text{since $X$ and $Y$ are independent} \\ &= \sum_{z} \sum_{x,y~with~xy=z} xyPr(X = x \cap Y = y) && \text{grouping terms} \\ &= \sum_{z} z\sum_{x,y~with~xy=z} Pr(X = x \cap Y = y) \\ \end{aligned} \] Now, the union of the events \((X = x) \cap (Y = y)\) over all \(x\) and \(y\) with \(xy = z\) is just the event \(XY = z\). Moreover, these are disjoint, so we have \[\left[\sum_{x,y~with~xy=z} Pr(X = x \cap Y = y)\right] = Pr(XY = z)\] Plugging this in gives \[E(X)E(Y) = \cdots = \sum_z zPr(XY = z) = E(XY)\] by defintion.


Variance is a measure of how spread out a distribution is. You might ask "how far are the samples from the mean, on average?". This suggests finding the expectation of the random variable \(X - E(X)\) (this is the RV describing the distance from the expected value). Unfortunately, \(E(X - E(X)) = 0\) (exercise), because \(X - E(X)\) can be positive or negative. We could imagine taking the absolute value, but it turns out to have nicer properties if we square it instead. This gives the definition of variance:

Definition: For a random variable \(X\), \(Var(X) = E\left((X - E(X))^2\right)\).

If \(X\) is measured in a unit (such as inches) then the variance is measured in units squared (e.g. inches squared). Thus, it is often more useful to work with the square root of the variance, which is called the standard deviation:

Definition: the standard deviation of \(X\) is just \(\sqrt{Var(X)}\).

The following formula for the variance is often easier to compute in practice:

Claim: \(Var(X) = E(X^2) - (E(X))^2\).

Proof: Note that random variables satisfy the normal rules of arithmetic. For example, \(X(Y + Z) = XY + XZ\). This is because they are evaluated pointwise. For example, we can show \(X(Y+Z) = XY + XZ\) as follows: \[[X(Y+Z)](s) = (X(s))((Y+Z)(s)) = X(s)(Y(s) + Z(s)) = X(s)Y(s) + X(s)Z(s) = [XY + XZ](s)\]

Using this, the proof of the claim is just algebra:

\[ \begin{aligned} Var(X) &= E((X - E(X))^2) \\ &= E(X^2 - 2XE(X) + E(X)^2) \\ &= E(X^2) - 2E(XE(X)) + E(E(X)^2) && \text{by linearity of expectation} \\ &= E(X^2) - 2E(X)^2 + E(X)^2 && \text{because $E(X)$ and $E(X)^2$ are constants} \\ &= E(X^2) - E(X)^2 \end{aligned} \]