Appendix A

A Primer on Probability

In this section, we provide a brief introduction to some basic probability theory; our goal is to provide the minimal preliminaries needed to understand the material in the book. Part of the text is taken almost verbatim from [PT10].

Originally motivated by gambling, the study of probability is now fundamental to a wide variety of subjects, including social behavior (e.g., economics and game theory) and physical laws (e.g., quantum mechanics and radioactive decay). But what is probability? What does it mean that a fair coin toss comes up heads with probability 50%? One interpretation is frequentist “50%” means that if we toss the coin 10 million times, it will come up heads in roughly 5 million tosses. A different interpretation is Bayesian (or subjective): “50%” is a statement about our beliefs, and how much we are willing to bet on one coin toss. For the purpose of this book, the second interpretation will be more relevant to us.

A.1 Probability Spaces

In our treatment, we restrict our attention to discrete probability spaces:¹

Definition A.1 (Probability Space). A probability space is a pair $(S, f)$ where $S$ is a countable set called the sample space, and $f : S \to [0, 1]$ is called the probability mass function.² Additionally, $f$ satisfies the property $\sum_{x \in S} f (x) = 1$ .

Intuitively, the sample space $S$ corresponds to the set of possible states that the world could be in, and the probability mass function $f$ assigns a probability from 0 to 1 to each of these states. To model our conventional notion of probability, we require that the total probability assigned by $f$ to all possible states should sum up to 1.

Definition A.2 (Event). Given a probability space $(S, f)$ , an event is simply a subset of $S$ . The probability of an event $E$ , denoted by ${Pr}_{(S, f)} [E] = Pr [E]$ , is defined to be $\sum_{x \in E} f (x)$ . In particular, the event that includes “everything,” $E = S$ , has probability $Pr [S] = 1$ .

Even though events and probabilities are not well-defined without a probability space, by convention, we often omit $S$ and $f$ in our statements when they are clear from context.

Example A.1. Consider rolling a regular 6-sided die. The sample space is $S = {1, 2, 3, 4, 5, 6}$ , and the probability mass function is constant: $f (x) = 1 / 6$ for all $x \in S$ . We refer to such a probability mass function (i.e., one that is constant) as being equiprobable. The event of an even roll is $E = {2, 4, 6}$ , and this occurs with probability

Pr [E] = \sum_{x \in {2, 4, 6}} f (x) = \frac{1}{2} .

Some basic properties Now that probability spaces are defined, we give a few basic properties of probability:

Claim A.1. If $A$ and $B$ are disjoint events ( $A \cap B = \emptyset$ ), then $Pr [A \cup B] = Pr [A] + Pr [B]$ .

Proof. By definition,

\begin{array}{l} Pr [A \cup B] & = \sum_{x \in A \cup B} f (x) \\ = \sum_{x \in A} f (x) + \sum_{x \in B} f (x) & since A and B are disjoint \\ = Pr [A] + Pr [B], \end{array}

which concludes the proof.

■

Corollary A.1. For any event $E$ , $Pr [\bar{E}] = 1 - Pr [E]$ .

Proof. This follows directly from Claim A.1, $\bar{E} \cup E = S$ , and $\bar{E} \cap E = \emptyset$ .

■

When events are not disjoint, we instead have the following:

Claim A.2. Given events $A$ and $B$ , $Pr [A \cup B] = Pr [A] + Pr [B] - Pr [A \cap B]$ .

Proof. First observe that $A \cup B = (A - B) \cup (B - A) \cup (A \cap B)$ and that all the terms on the RHS are disjoint. Therefore,

$Pr [A \cup B] = Pr [A - B] + Pr [B - A] + Pr [A \cap B]$ (A.1)

Similarly, we have

\begin{align} Pr [A] & = Pr [A - B] + Pr [A \cap B] & (A.2) \\ Pr [B] & = Pr [B - A] + Pr [A \cap B] & (A.3) \end{align}

because, say $A$ is the disjoint union of $A - B$ and $A \cap B$ . Substituting (A.2) and (A.3) into (A.1) gives

\begin{array}{l} Pr [A \cup B] & = Pr [A - B] + Pr [B - A] + Pr [A \cap B] \\ = (Pr [A] - Pr [A \cap B]) + (Pr [B] - Pr [A \cap B]) + Pr [A \cap B] \\ = Pr [A] + Pr [B] - Pr [A \cap B], \end{array}

which concludes the proof.

■

A useful corollary of Claim A.2 is the Union Bound.

Corollary A.2 (Union Bound). Given events $A$ and $B$ , $Pr [A \cup B] \leq Pr [A] + Pr [B]$ . In general, given events $A_{1} \dots, A_{n}$ ,

Pr [⋃_{i} A_{i}] \leq \sum_{i} Pr [A_{i}] .

A.2 Conditional Probability

Suppose that after receiving a random 5-card hand dealt from a standard 52-card deck, we are told that the hand contains “at least a pair” (that is, at least two of the cards have the same rank). How do we calculate the probability of a full-house given this extra information? Consider the following thought process:

Start with the original probability space of containing all 5-card hands, pair or no pair.
To take advantage of our new information, eliminate all hands that do not contain a pair.

Re-normalize the probability among the remaining hands (that contain at least a pair).

Motivated by this line of reasoning, we define conditional probability in the following way:

Definition A.3. Let $A$ and $B$ be events, and let $Pr [B] \neq 0$ . The conditional probability of $A$ , conditioned on $B$ , denoted by $Pr [A ∣ B]$ , is defined as

Pr [A ∣ B] = \frac{Pr [A \cap B]}{Pr [B]} .

Independence By defining conditional probability, we modeled how the occurrence of one event can affect the probability of another event. An equally important concept is independence, where a set of events do not affect each other.

Definition A.4 (Independence). A sequence of events $A_{1}, \dots, A_{n}$ are (mutually) independent if and only if for every subset of these events, $A_{i_{1}}, \dots, A_{i_{k}}$ ,

Pr [A_{i_{1}} \cap A_{i_{2}} \cap \dots \cap A_{i_{k}}] = Pr [A_{i_{1}}] \cdot Pr [A_{i_{2}}] \dots Pr [A_{i_{k}}] .

If there are just two events, $A$ and $B$ , then they are independent if and only if $Pr [A \cap B] = Pr [A] Pr [B]$ . The following claim gives justification to the definition of independence.

Claim A.3. If $A$ and $B$ are independent events and $Pr [B] \neq 0$ , then $Pr [A ∣ B] = Pr [A]$ .

Proof.

Pr [A ∣ B] = \frac{Pr [A \cap B]}{Pr [B]} = \frac{Pr [A] Pr [B]}{Pr [B]} = Pr [A],

which concludes the proof.

■

A.3 Bayes’ Rule

Suppose that we have a test against a rare disease that affects only 0.3% of the population, and that the test is 99% effective (i.e., if a person has the disease the test says YES with probability 0.99, and otherwise it says NO with probability 0.99). If a random person in the populous tested positive, what is the probability that he has the disease? The answer is not 0.99. Indeed, this is an exercise in conditional probability: what are the chances that a random person has the rare disease, given the occurrence of the event that he tested positive?

We start with with some preliminaries.

Claim A.4. Let $A_{1}, \dots, A_{n}$ be disjoint events with non-zero probability such that $⋃_{i} A_{i} = S$ (i.e., the events are exhaustive; the events partition the sample space $S$ ). Let $B$ be an event. Then $Pr [B] = \sum_{i = 1}^{n} Pr [B ∣ A_{i}] P r [A_{i}]$ .

Proof. By definition $Pr [B ∣ A_{i}] = Pr [B \cap A_{i}] / Pr [A_{i}]$ , and so the RHS equals

\sum_{i = 1}^{n} Pr [B \cap A_{i}] .

Since $A_{1}, \dots, A_{n}$ are disjoint it follows that the events $B \cap A_{1}, \dots, B \cap A_{n}$ are also disjoint. Therefore,

\sum_{i = 1}^{n} Pr [B \cap A_{i}] = Pr [⋃_{i = 1}^{n} B \cap A_{i}] = Pr [B \cap ⋃_{i = 1}^{n} A_{i}] = Pr [B \cap S] = Pr [B],

which concludes the proof.

■

Theorem A.1 (Bayes’ Rule). Let $A$ and $B$ be events with non-zero probability. Then:

Pr [B ∣ A] = \frac{Pr [A ∣ B] Pr [B]}{Pr [A]} .

Proof. Multiply both sides by $Pr [A]$ . Now by definition of conditional probabilities, both sides equal:

Pr [B ∣ A] Pr [A] = Pr [A \cap B] = Pr [A ∣ B] Pr [B],

which concludes the proof.

■

Sometimes, it is useful to expand the statement of Bayes’ Rule with Claim A.4:

Corollary A.3 (Bayes’ Rule Expanded). Let $A$ and $B$ be events with non-zero probability. Then:

Pr [B ∣ A] = \frac{Pr [A ∣ B] Pr [B]}{Pr [B] Pr [A ∣ B] + Pr [\bar{B}] Pr [A ∣ \bar{B}]}

Proof. The corollary follows directly by applying Claim A.4, and using the facts that (a) $B$ and $\bar{B}$ are disjoint, and (b) $B \cup \bar{B} = S$ .

■

We return to our original question of testing for rare diseases. Let us consider the sample space $S = {(t, d) ∣ t \in {0, 1}, d \in {0, 1}}$ , where $t$ represents the outcome of the test on a random person in the populous, and $d$ represents whether the same person carries the disease or not. Let $D$ be the event that a randomly drawn person has the disease ( $d = 1$ ), and $T$ be the event that a randomly drawn person tests positive ( $t = 1$ ).

We know that $Pr [D] = 0.003$ (because 0.3% of the population has the disease). We also know that $Pr [T ∣ D] = 0.99$ and $Pr [T ∣ \bar{D}] = 0.01$ (because the test is 99% effective). Using Bayes’ rule, we can now calculate the probability that a random person, who tested positive, actually has the disease:

\begin{array}{l} Pr [D ∣ T] & = \frac{Pr [T ∣ D] Pr [D]}{(Pr [D] Pr [T ∣ D] + Pr [\bar{D}] Pr [T ∣ \bar{D}])} \\ = \frac{. 99 * . 003}{. 003 * . 99 + . 997 * . 01} = 0.23 . \end{array}

Notice that 23%, while significant, is a far cry from 99% (the effectiveness of the test). This final probability can vary if we have a different prior (initial belief). For example, if a random patient has other medical conditions that raises the probability of contracting the disease up to 10%, then the final probability of having the disease, given a positive test, raises to 92%.

Updating beliefs after multiple signals Our treatment so far discusses how to update our beliefs after receiving one signal (the outcome of the test). How should we update if we receive multiple signals? That is, how do we compute $Pr [A ∣ B_{1} \cap B_{2}]$ ? To answer this question, we first need to define a notion of conditional independence.

Definition A.5 (Conditional Independence). A sequence of events $B_{1}, \dots, B_{n}$ are conditionally independent given event $A$ if and only if for every subset of the sequence of events, $B_{i_{1}}, \dots, B_{i_{k}}$ ,

Pr [⋂_{k} B_{i_{k}} ∣ A] = \prod_{k} Pr [B_{i_{k}} ∣ A] .

In other words, given that the event $A$ has occurred, then the events $B_{1}, \dots, B_{n}$ are independent.

When there are only two events, $B_{1}$ and $B_{2}$ , they are conditionally independent given event $A$ if and only if $Pr [B_{1} \cap B_{2} ∣ A] = Pr [B_{1} ∣ A] Pr [B_{2} ∣ A]$ .

If the signals we receive are conditionally independent, we can still use Bayes’ rule to update our beliefs. More precisely, if we assume that the signals $B_{1}$ and $B_{2}$ are independent when conditioned on $A$ , and also independent when conditioned on $\bar{A}$ , then:

\begin{array}{l} Pr [A ∣ B_{1} \cap B_{2}] \\ = & \frac{Pr [B_{1} \cap B_{2} ∣ A] Pr [A]}{Pr [A] Pr [B_{1} \cap B_{2} ∣ A] + Pr [\bar{A}] Pr [B_{1} \cap B_{2} ∣ \bar{A}]} \\ = & \frac{Pr [B_{1} ∣ A] Pr [B_{2} ∣ A] Pr [A]}{Pr [A] Pr [B_{1} ∣ A] Pr [B_{2} ∣ A] + Pr [\bar{A}] Pr [B_{1} ∣ \bar{A}] Pr [B_{2} ∣ \bar{A}]} \end{array}

In general, given signals $B_{1}, \dots, B_{n}$ that are conditionally independent given $A$ and conditionally independent given $\bar{A}$ , we have

Pr [A ∣ ⋂_{i} B_{i}] = \frac{Pr [A] \prod_{i} Pr [B_{i} ∣ A]}{Pr [A] \prod_{i} Pr [B_{i} ∣ A] + Pr [\bar{A}] \prod_{i} Pr [B_{i} ∣ \bar{A}]}

Spam detection Using “training data” (e-mails classified as spam or not by hand), we can estimate the probability that a message contains a certain string conditioned on being spam (or not), for example Pr[ “viagra” | spam ], Pr[ “viagra” | not spam ]. We can also estimate the probability that a random e-mail is spam; that is, $Pr [spam]$ (this is about 80% in real life, although most spam detectors are “unbiased” and assume $Pr [spam] = 50 %$ to make calculations nicer).

By choosing a diverse set of keywords, say $W_{1}, \dots, W_{n}$ , and assuming that the occurrence of these keywords are conditionally independent given a spam message or given a non-spam e-mail, we can use Bayes’ rule to estimate the probability that an e-mail is spam based on the words it contains (we have simplified the expression assuming $Pr [spam] = Pr [notspam] = 0.5$ ):

Pr [spam ∣ ⋂_{i} W_{i}] = \frac{\prod_{i} Pr [W_{i} ∣ spam]}{\prod_{i} Pr [W_{i} ∣ spam] + \prod_{i} Pr [W_{i} ∣ notspam]}

A.4 Random Variables

We use events to express whether a particular class of outcomes has occurred or not. Sometimes we want to express more: for example, after 100 fair coin tosses, we want to study how many coin tosses were heads (instead of focusing on just one event, say, that there were 50 coin tosses). This takes us to the definition of random variables.

Definition A.6. A random variable $X$ on a probability space $(S, f)$ is a function from the sample space to the real numbers $X : S \to ℝ$ .

So, in the example of 100 coin tosses, given any outcome of the experiment $s \in S$ , we would define $X (s)$ to be the number of heads that occurred in that outcome.

Definition A.7. Given a random variable $X$ on probability space $(S, f)$ , we can consider a new probability space $(S^{'}, f_{X})$ where the sample space is the range of $X$ , $S^{'} = {X (s) ∣ s \in S}$ , and the probability mass function is extended from $f$ , $f_{X} (x) = {Pr}_{S, f} [{x ∣ X (s) = x}]$ . We call $f_{X}$ the probability distribution or the probability density function of the random variable $X$ .

Example A.2. Suppose we toss two 6-sided dice. The sample space would be pairs of outcomes, $S = {(i, j) ∣ i, j \in {1, \dots, 6}}$ , and the probability mass function is equiprobable. Consider the random variables, $X_{1} (i, j) = i$ , $X_{2} (i, j) = j$ , and $X (i, j) = i + j$ . These random variables denote the outcome of the first die, the outcome of the second die, and the sum of the two dice, respectively. The probability density function of $X$ would take values:

\begin{array}{l} f_{X} (1) & = 0 \\ f_{X} (2) & = Pr [(1, 1)] = 1 / 36 \\ f_{X} (3) & = Pr [(1, 2), (2, 1)] = 2 / 36 \\ ⋮ \\ f_{X} (6) & = Pr [(1, 5), (2, 3), \dots, (3, 1)] = 5 / 36 \\ f_{X} (7) & = Pr [(1, 6) . . (6, 1)] = 6 / 36 \\ f_{X} (8) & = Pr [(2, 6), (3, 5), \dots, (6, 2)] = 5 / 36 = f_{X} (6) \\ ⋮ \\ f_{X} (12) & = 1 / 36 \end{array}

Notation regarding random variables We can describe events by applying predicates to random variables (e.g., the event that $X$ , the number of heads, is equal to 50). We often use a short-hand notation, in which we treat random variables as if they are real numbers: if $X$ is a random variable, we let, for example, “ $X = 50$ ” denote the event ${s \in S ∣ X (s) = 50}$ . Using this notation, we may define the probability density function of a random variable $X$ as $f_{X} (x) = Pr [X = x]$ .

In a similar vein, we can define new random variables from existing random variables. In Example A.2, we can write $X = X_{1} + X_{2}$ , to mean that for any $s \in S$ , $X (s) = X_{1} (s) + X_{2} (s)$ (again, the notation treats, $X$ , $X_{1}$ and $X_{2}$ as if the are real numbers).

Independent random variables The intuition behind independent random variables is just like that of events: the value of one random variable should not affect the value of another independent random variable.

Definition A.8. A sequence of random variables $X_{1}, X_{2}, \dots, X_{n}$ are (mutually) independent if for every subset $X_{i_{1}}, \dots, X_{i_{k}}$ and for any real numbers $x_{1}, x_{2}, \dots, x_{k}$ , the events $X_{1} = x_{i_{1}}, X_{2} = x_{i_{2}}, \dots, X_{i_{k}} = x_{k}$ are (mutually) independent.

In the case of two random variables $X$ and $Y$ , they are independent if and only if for all real values $x$ and $y$ , $Pr [X = x \cap X = y] = Pr [X = x] Pr [Y = y]$ .

A common use of independence is to model the outcome of consecutive coin tosses: Consider a biased coin that comes up heads with probability $p$ . Define $X = 1$ if the coin comes up heads and $X = 0$ if the coin comes up tails; then $X$ is called the Bernoulli random variable with probability $p$ . Suppose now we toss this biased coin $n$ times, and let $Y$ be the random variable that denotes the total number of occurrence of heads. We can view $Y$ as a sum of independent random variables, $\sum_{i = 1}^{n} X_{i}$ , where $X_{i}$ is a Bernoulli random variable with probability $p$ that represents the outcome of the $i^{th}$ toss. We leave it as an exercise to show that the random variables $X_{1}, \dots, X_{n}$ are indeed independent.

A.5 Expectation

Given a random variable defined on a probability space, what is its “average” value? Naturally, we need to weigh things according to the probability that the random variable takes on each value.

Definition A.9. Given a random variable $X$ defined over a probability space $(S, f)$ , we define the expectation of $X$ to be

𝔼 [X] = \sum_{x \in R a n g e (X)} Pr [X = x] \cdot x = \sum_{x \in R a n g e (X)} f_{X} (x) \cdot x .

An alternative but equivalent definition is

𝔼 [X] = \sum_{s \in S} f (s) X (s) .

These definitions are equivalent because:

\begin{array}{l} \sum_{x \in R a n g e (X)} Pr [X = x] \cdot x \\ = & \sum_{x \in R a n g e (X)} \sum_{s \in S s.t. X (s) = x} f (s) \cdot x \\ = & \sum_{x \in R a n g e (X)} \sum_{s \in S s.t. X (s) = x} f (s) \cdot X (s) \\ = & \sum_{s \in S} f (s) X (s) . \end{array}

The following fact can be shown with a similar argument:

Claim A.5. Given a random variable $X$ and a function $g : ℝ \to ℝ$ ,

𝔼 [g (X)] = \sum_{x \in R a n g e (X)} Pr [X = x] g (x) .

Proof.

\begin{array}{l} \sum_{x \in R a n g e (X)} Pr [X = x] g (x) \\ = & \sum_{x \in R a n g e (X)} \sum_{s \in S s.t. X (s) = x} f (s) g (x) \\ = & \sum_{x \in R a n g e (X)} \sum_{s \in S s.t. X (s) = x} f (s) g (X (s)) \\ = & \sum_{s \in S} f (s) g (X (s)) = 𝔼 [g (X)], \end{array}

which concludes the proof.

■

Example A.3. Suppose in a game, with probability $1 / 10$ we are paid $10, and with probability 9/10 we are paid $2. What is our expected payment? The answer is

\frac{1}{10} $ 10 + \frac{9}{10} $ 2 = $ 2.80

An application to decision theory In decision theory, we assign a real number, called the utility, to each outcome in the sample space of a probabilistic game. We then assume that rational players make decisions that maximize their expected utility. For example, should we be willing to pay $2 to participate in the game in Example A.3? If we assume that our utility is exactly the amount of money that we earn, then

with probability 1/10 we get paid $10 and get utility 8

with probability 9/10 we get paid $ 2 and get utility 0

This gives a positive expected utility of 0.8, so we should play the game!

This reasoning of utility does not always explain human behavior though. Suppose there is a game that cost a thousand dollars to play. With one chance in a million, the reward is two billion dollars (!), but otherwise there is no reward. The expected utility is

\frac{1}{1 0^{6}} (2 \times 1 0^{9} - 1000) + (1 - \frac{1}{1 0^{6}}) (0 - 1000) = 1000

One expects to earn a thousand dollars from the game on average. Would you play it? Turns out many people are risk-averse and would turn down the game. After all, except with one chance in a million, you simply lose a thousand dollars. This example shows how expectation does not capture all the important features of a random variable, such as how likely the random variable is to end up close to its expectation (in this case, the utility is either $- 1, 000$ dollars or two billion dollars, not close to the expectation of $1, 000$ dollars at all).

In other instances, people are risk-seeking. Take yet another game that takes a dollar to play. This time, with one chance in a billion, the reward is a million dollars; otherwise there is no reward. The expected utility is

\frac{1}{1 0^{9}} (1 0^{6} - 1) + (1 - \frac{1}{1 0^{9}}) (0 - 1) = - 0.999

Essentially, to play the game is to throw a dollar away. Would you play the game? Turns out many people do; this is called a lottery!

How can we justify these behaviors in the expected utility framework? The point is that utility may not always be linear in money received/spent. Non-linear utility functions may be used to reconcile observed behavior with expected utility theory (but doing so is outside the scope of this course).

Linearity of expectation One nice property of expectation is that the expectation of the sum of random variables is the sum of the expectations. This can often simplify the calculation of expectations.

Theorem A.2. Let $X_{1}, \dots, X_{n}$ be random variables, and $a_{1}, \dots, a_{n}$ be real constants. Then

𝔼 [\sum_{i = 1}^{n} a_{i} X_{i}] = \sum_{i = 1}^{n} a_{i} 𝔼 [X_{i}] .

Proof.

\begin{array}{l} 𝔼 [\sum_{i = 1}^{n} a_{i} X_{i}] & = \sum_{s \in S} f (s) \sum_{i = 1}^{n} a_{i} X_{i} (s) \\ = \sum_{s \in S} \sum_{i = 1}^{n} a_{i} f (s) X_{i} (s) \\ = \sum_{i = 1}^{n} a_{i} \sum_{s \in S} f (s) X_{i} (s) \\ = \sum_{i = 1}^{n} a_{i} 𝔼 [X_{i}], \end{array}

which concludes the proof.

■

Example A.4. If we make $n$ tosses of a biased coin that ends up heads with probability $p$ , what is the expected number of heads? Let $X_{i} = 1$ if the $i^{th}$ toss is heads, and $X_{i} = 0$ otherwise. Then, $X_{i}$ is an independent Bernoulli random variable with probability $p$ , and has expectation

𝔼 [X_{i}] = p \cdot 1 + (1 - p) \cdot 0 = p .

The expected number of heads is then

𝔼 [\sum_{i = 1}^{n} X_{i}] = \sum_{i = 1}^{n} 𝔼 [X_{i}] = n p .

Thus if the coin was fair, we would expect $(1 / 2) n$ , half of the tosses, to be heads.

Conditional expectations We may also define a notion of an expectation of a random variable $X$ conditioned on some event $H$ by simply conditioning the probability space on $H$ :

Definition A.10. Given a random variable $X$ defined over a probability space $(S, f)$ , and some event $H$ on $S$ , we define the expectation of $X$ conditioned on $H$ to be

𝔼 [X ∣ H] = \sum_{x \in H} Pr [X = x ∣ H] \cdot x .

We end this section by showing an analog of A.4 for the case of expectations.

Claim A.6. Let $A_{1}, \dots, A_{n}$ be disjoint events with non-zero probability such that $⋃_{i} A_{i} = S$ . Let $X$ be a random variable over $S$ . Then

𝔼 [X ∣ S] = \sum_{i = 1}^{n} 𝔼 [X ∣ A_{i}] Pr [A_{i} ∣ S] .

Proof. By definition $𝔼 [X ∣ A_{i}] = \sum_{x} Pr [X = x ∣ A_{i}] \cdot x$ and so the right-hand side evaluates to

\begin{array}{rcl} \sum_{i = 1}^{n} \sum_{x} Pr [X = x ∣ A_{i}] Pr [A_{i} ∣ S] \cdot x = \\ = \sum_{i = 1}^{n} \sum_{x} \frac{Pr [X = x \cap A_{i}]}{Pr [A_{i}]} \frac{Pr [A_{i} \cap S]}{Pr [S]} \cdot x = \\ = \sum_{i = 1}^{n} \sum_{x} \frac{Pr [X = x \cap A_{i}]}{Pr [S]} \cdot x = \\ = \sum_{x} \frac{Pr [X = x \cap S]}{Pr [S]} \cdot x = \\ = \sum_{x} Pr [X = x ∣ S] \cdot x \end{array}

which equals the left-hand side.

■

¹Without formally defining this term, we refer to random processes whose outcomes are discrete, such as dice rolls, as opposed to picking a uniformly random real number from zero to one.

²By $[0, 1]$ we mean the real interval ${x ∣ 0 \leq x \leq 1}$ .