Appendix A

A Primer on Probability

In this section, we provide a brief introduction to some basic probability theory; our goal is to provide the minimal preliminaries needed to understand the material in the book. Part of the text is taken almost verbatim from [PT10].

Originally motivated by gambling, the study of probability is now fundamental to a wide variety of subjects, including social behavior (e.g., economics and game theory) and physical laws (e.g., quantum mechanics and radioactive decay). But what is probability? What does it mean that a fair coin toss comes up heads with probability 50%? One interpretation is frequentist “50%” means that if we toss the coin 10 million times, it will come up heads in roughly 5 million tosses. A different interpretation is Bayesian (or subjective): “50%” is a statement about our beliefs, and how much we are willing to bet on one coin toss. For the purpose of this book, the second interpretation will be more relevant to us.

A.1  Probability Spaces

In our treatment, we restrict our attention to discrete probability spaces:1

Definition A.1 (Probability Space). A probability space is a pair (S,f) where S is a countable set called the sample space, and f : S [0,1] is called the probability mass function.2 Additionally, f satisfies the property xSf(x) = 1.

Intuitively, the sample space S corresponds to the set of possible states that the world could be in, and the probability mass function f assigns a probability from 0 to 1 to each of these states. To model our conventional notion of probability, we require that the total probability assigned by f to all possible states should sum up to 1.

Definition A.2 (Event). Given a probability space (S,f), an event is simply a subset of S. The probability of an event E, denoted by Pr(S,f)[E] = Pr[E], is defined to be xEf(x). In particular, the event that includes “everything,” E = S, has probability Pr[S] = 1.

Even though events and probabilities are not well-defined without a probability space, by convention, we often omit S and f in our statements when they are clear from context.

Example A.1. Consider rolling a regular 6-sided die. The sample space is S = {1,2,3,4,5,6}, and the probability mass function is constant: f(x) = 1/6 for all x S. We refer to such a probability mass function (i.e., one that is constant) as being equiprobable. The event of an even roll is E = {2,4,6}, and this occurs with probability

Pr[E] = x{2,4,6}f(x) = 1 2.

Some basic properties Now that probability spaces are defined, we give a few basic properties of probability:

Claim A.1. If A and B are disjoint events (A B = ), then Pr[A B] = Pr[A] + Pr[B].

Proof. By definition,

Pr[A B] = xABf(x) = xAf(x) + xBf(x)since  and  B are disjoint = Pr[A] + Pr[B],

which concludes the proof.

Corollary A.1. For any event E, Pr[E¯] = 1 Pr[E].

Proof. This follows directly from Claim A.1, E¯ E = S, and E¯ E = .

When events are not disjoint, we instead have the following:

Claim A.2. Given events A and B, Pr[A B] = Pr[A] + Pr[B] Pr[A B].

Proof. First observe that A B = (A B) (B A) (A B) and that all the terms on the RHS are disjoint. Therefore,

Pr[A B] = Pr[A B] + Pr[B A] + Pr[A B]     (A.1)

Similarly, we have

Pr[A] = Pr[A B] + Pr[A B] (A.2) Pr[B] = Pr[B A] + Pr[A B] (A.3)

because, say A is the disjoint union of A B and A B. Substituting (A.2) and (A.3) into (A.1) gives

Pr[A B] = Pr[A B] + Pr[B A] + Pr[A B] = (Pr[A] Pr[A B]) + (Pr[B] Pr[A B]) + Pr[A B] = Pr[A] + Pr[B] Pr[A B],

which concludes the proof.

A useful corollary of Claim A.2 is the Union Bound.

Corollary A.2 (Union Bound). Given events A and B, Pr[A B] Pr[A] + Pr[B]. In general, given events A1,An,

Pr iAi i Pr[Ai].

A.2  Conditional Probability

Suppose that after receiving a random 5-card hand dealt from a standard 52-card deck, we are told that the hand contains “at least a pair” (that is, at least two of the cards have the same rank). How do we calculate the probability of a full-house given this extra information? Consider the following thought process:

Motivated by this line of reasoning, we define conditional probability in the following way:

Definition A.3. Let A and B be events, and let Pr[B]0. The conditional probability of A, conditioned on B, denoted by Pr[AB], is defined as

Pr[AB] = Pr[A B] Pr[B] .

Independence By defining conditional probability, we modeled how the occurrence of one event can affect the probability of another event. An equally important concept is independence, where a set of events do not affect each other.

Definition A.4 (Independence). A sequence of events A1,,An are (mutually) independent if and only if for every subset of these events, Ai1,,Aik,

Pr[Ai1 Ai2 Aik] = Pr[Ai1] Pr[Ai2]Pr[Aik].

If there are just two events, A and B, then they are independent if and only if Pr[A B] = Pr[A]Pr[B]. The following claim gives justification to the definition of independence.

Claim A.3. If A and B are independent events and Pr[B]0, then Pr[AB] = Pr[A].

Proof.

Pr[AB] = Pr[A B] Pr[B] = Pr[A]Pr[B] Pr[B] = Pr[A],

which concludes the proof.

A.3  Bayes’ Rule

Suppose that we have a test against a rare disease that affects only 0.3% of the population, and that the test is 99% effective (i.e., if a person has the disease the test says YES with probability 0.99, and otherwise it says NO with probability 0.99). If a random person in the populous tested positive, what is the probability that he has the disease? The answer is not 0.99. Indeed, this is an exercise in conditional probability: what are the chances that a random person has the rare disease, given the occurrence of the event that he tested positive?

We start with with some preliminaries.

Claim A.4. Let A1,,An be disjoint events with non-zero probability such that iAi = S (i.e., the events are exhaustive; the events partition the sample space S). Let B be an event. Then Pr[B] = i=1n Pr[BAi]Pr[Ai].

Proof. By definition Pr[BAi] = Pr[B Ai]/Pr[Ai], and so the RHS equals

i=1n Pr[B A i].

Since A1,,An are disjoint it follows that the events B A1,,B An are also disjoint. Therefore,

i=1n Pr[BA i] = Pr i=1nB A i = Pr B i=1nA i = Pr B S = Pr[B],

which concludes the proof.

Theorem A.1 (Bayes’ Rule). Let A and B be events with non-zero probability. Then:

Pr[BA] = Pr[AB]Pr[B] Pr[A] .

Proof. Multiply both sides by Pr[A]. Now by definition of conditional probabilities, both sides equal:

Pr[BA]Pr[A] = Pr[A B] = Pr[AB]Pr[B],

which concludes the proof.

Sometimes, it is useful to expand the statement of Bayes’ Rule with Claim A.4:

Corollary A.3 (Bayes’ Rule Expanded). Let A and B be events with non-zero probability. Then:

Pr[BA] = Pr[AB]Pr[B] Pr[B]Pr[AB] + Pr[B¯]Pr[AB¯]

Proof. The corollary follows directly by applying Claim A.4, and using the facts that (a) B and B¯ are disjoint, and (b) B B¯ = S.

We return to our original question of testing for rare diseases. Let us consider the sample space S = {(t,d)t {0,1},d {0,1}}, where t represents the outcome of the test on a random person in the populous, and d represents whether the same person carries the disease or not. Let D be the event that a randomly drawn person has the disease (d = 1), and T be the event that a randomly drawn person tests positive (t = 1).

We know that Pr[D] = 0.003 (because 0.3% of the population has the disease). We also know that Pr[TD] = 0.99 and Pr[TD¯] = 0.01 (because the test is 99% effective). Using Bayes’ rule, we can now calculate the probability that a random person, who tested positive, actually has the disease:

Pr[DT] = Pr[TD]Pr[D] (Pr[D]Pr[TD] + Pr[D¯]Pr[TD¯]) = .99 .003 .003 .99 + .997 .01 = 0.23.

Notice that 23%, while significant, is a far cry from 99% (the effectiveness of the test). This final probability can vary if we have a different prior (initial belief). For example, if a random patient has other medical conditions that raises the probability of contracting the disease up to 10%, then the final probability of having the disease, given a positive test, raises to 92%.

Updating beliefs after multiple signals Our treatment so far discusses how to update our beliefs after receiving one signal (the outcome of the test). How should we update if we receive multiple signals? That is, how do we compute Pr[AB1 B2]? To answer this question, we first need to define a notion of conditional independence.

Definition A.5 (Conditional Independence). A sequence of events B1,,Bn are conditionally independent given event A if and only if for every subset of the sequence of events, Bi1,,Bik,

Pr [ kBikA] = k Pr[BikA].

In other words, given that the event A has occurred, then the events B1,,Bn are independent.

When there are only two events, B1 and B2, they are conditionally independent given event A if and only if Pr[B1 B2A] = Pr[B1A]Pr[B2A].

If the signals we receive are conditionally independent, we can still use Bayes’ rule to update our beliefs. More precisely, if we assume that the signals B1 and B2 are independent when conditioned on A, and also independent when conditioned on A¯, then:

Pr[AB1 B2] = Pr[B1 B2A]Pr[A] Pr[A]Pr[B1 B2A] + Pr[A¯]Pr[B1 B2A¯] = Pr[B1A]Pr[B2A]Pr[A] Pr[A]Pr[B1A]Pr[B2A] + Pr[A¯]Pr[B1A¯]Pr[B2A¯]

In general, given signals B1,,Bn that are conditionally independent given A and conditionally independent given A¯, we have

Pr [A iBi] = Pr[A] i Pr [BiA] Pr[A] i Pr [BiA] + Pr[A¯] i Pr [BiA¯]

Spam detection Using “training data” (e-mails classified as spam or not by hand), we can estimate the probability that a message contains a certain string conditioned on being spam (or not), for example Pr[ “viagra” | spam ], Pr[ “viagra” | not spam ]. We can also estimate the probability that a random e-mail is spam; that is, Pr[spam] (this is about 80% in real life, although most spam detectors are “unbiased” and assume Pr[spam] = 50% to make calculations nicer).

By choosing a diverse set of keywords, say W1,,Wn, and assuming that the occurrence of these keywords are conditionally independent given a spam message or given a non-spam e-mail, we can use Bayes’ rule to estimate the probability that an e-mail is spam based on the words it contains (we have simplified the expression assuming Pr[spam] = Pr[notspam] = 0.5):

Pr [spam iWi] = i Pr [Wispam] i Pr [Wispam] + i Pr [Winotspam]

A.4  Random Variables

We use events to express whether a particular class of outcomes has occurred or not. Sometimes we want to express more: for example, after 100 fair coin tosses, we want to study how many coin tosses were heads (instead of focusing on just one event, say, that there were 50 coin tosses). This takes us to the definition of random variables.

Definition A.6. A random variable X on a probability space (S,f) is a function from the sample space to the real numbers X : S .

So, in the example of 100 coin tosses, given any outcome of the experiment s S, we would define X(s) to be the number of heads that occurred in that outcome.

Definition A.7. Given a random variable X on probability space (S,f), we can consider a new probability space (S,fX) where the sample space is the range of X, S = {X(s)s S}, and the probability mass function is extended from f, fX(x) = PrS,f[{xX(s) = x}]. We call fX the probability distribution or the probability density function of the random variable X.

Example A.2. Suppose we toss two 6-sided dice. The sample space would be pairs of outcomes, S = {(i,j)i,j {1,,6}}, and the probability mass function is equiprobable. Consider the random variables, X1(i,j) = i, X2(i,j) = j, and X(i,j) = i + j. These random variables denote the outcome of the first die, the outcome of the second die, and the sum of the two dice, respectively. The probability density function of X would take values:

fX(1) = 0 fX(2) = Pr[(1,1)] = 1/36 fX(3) = Pr[(1,2),(2,1)] = 2/36 fX(6) = Pr[(1,5),(2,3),,(3,1)] = 5/36 fX(7) = Pr[(1,6)..(6,1)] = 6/36 fX(8) = Pr[(2,6),(3,5),,(6,2)] = 5/36 = fX(6) fX(12) = 1/36

Notation regarding random variables We can describe events by applying predicates to random variables (e.g., the event that X, the number of heads, is equal to 50). We often use a short-hand notation, in which we treat random variables as if they are real numbers: if X is a random variable, we let, for example, “X = 50” denote the event {s SX(s) = 50}. Using this notation, we may define the probability density function of a random variable X as fX(x) = Pr[X = x].

In a similar vein, we can define new random variables from existing random variables. In Example A.2, we can write X = X1 + X2, to mean that for any s S, X(s) = X1(s) + X2(s) (again, the notation treats, X, X1 and X2 as if the are real numbers).

Independent random variables The intuition behind independent random variables is just like that of events: the value of one random variable should not affect the value of another independent random variable.

Definition A.8. A sequence of random variables X1,X2,,Xn are (mutually) independent if for every subset Xi1,,Xik and for any real numbers x1,x2,,xk, the events X1 = xi1,X2 = xi2,,Xik = xk are (mutually) independent.

In the case of two random variables X and Y , they are independent if and only if for all real values x and y, Pr[X = x X = y] = Pr[X = x]Pr[Y = y].

A common use of independence is to model the outcome of consecutive coin tosses: Consider a biased coin that comes up heads with probability p. Define X = 1 if the coin comes up heads and X = 0 if the coin comes up tails; then X is called the Bernoulli random variable with probability p. Suppose now we toss this biased coin n times, and let Y be the random variable that denotes the total number of occurrence of heads. We can view Y as a sum of independent random variables, i=1nXi, where Xi is a Bernoulli random variable with probability p that represents the outcome of the ith toss. We leave it as an exercise to show that the random variables X1,,Xn are indeed independent.

A.5  Expectation

Given a random variable defined on a probability space, what is its “average” value? Naturally, we need to weigh things according to the probability that the random variable takes on each value.

Definition A.9. Given a random variable X defined over a probability space (S,f), we define the expectation of X to be

𝔼[X] = xRange(X) Pr[X = x] x = xRange(X)fX(x) x.

An alternative but equivalent definition is

𝔼[X] = sSf(s)X(s).

These definitions are equivalent because:

xRange(X) Pr[X = x] x = xRange(X) sSs.t.X(s)=xf(s) x = xRange(X) sSs.t.X(s)=xf(s) X(s) = sSf(s)X(s).

The following fact can be shown with a similar argument:

Claim A.5. Given a random variable X and a function g : ,

𝔼[g(X)] = xRange(X) Pr[X = x]g(x).

Proof.

xRange(X) Pr[X = x]g(x) = xRange(X) sSs.t.X(s)=xf(s)g(x) = xRange(X) sSs.t.X(s)=xf(s)g(X(s)) = sSf(s)g(X(s)) = 𝔼[g(X)],

which concludes the proof.

Example A.3. Suppose in a game, with probability 1/10 we are paid $10, and with probability 9/10 we are paid $2. What is our expected payment? The answer is

1 10$10 + 9 10$2 = $2.80

An application to decision theory In decision theory, we assign a real number, called the utility, to each outcome in the sample space of a probabilistic game. We then assume that rational players make decisions that maximize their expected utility. For example, should we be willing to pay $2 to participate in the game in Example A.3? If we assume that our utility is exactly the amount of money that we earn, then

with probability 1/10 we get paid $10 and get utility 8

with probability 9/10 we get paid $ 2 and get utility 0

This gives a positive expected utility of 0.8, so we should play the game!

This reasoning of utility does not always explain human behavior though. Suppose there is a game that cost a thousand dollars to play. With one chance in a million, the reward is two billion dollars (!), but otherwise there is no reward. The expected utility is

1 106(2 × 109 1000) + (1 1 106)(0 1000) = 1000

One expects to earn a thousand dollars from the game on average. Would you play it? Turns out many people are risk-averse and would turn down the game. After all, except with one chance in a million, you simply lose a thousand dollars. This example shows how expectation does not capture all the important features of a random variable, such as how likely the random variable is to end up close to its expectation (in this case, the utility is either 1,000 dollars or two billion dollars, not close to the expectation of 1,000 dollars at all).

In other instances, people are risk-seeking. Take yet another game that takes a dollar to play. This time, with one chance in a billion, the reward is a million dollars; otherwise there is no reward. The expected utility is

1 109(106 1) + (1 1 109)(0 1) = 0.999

Essentially, to play the game is to throw a dollar away. Would you play the game? Turns out many people do; this is called a lottery!

How can we justify these behaviors in the expected utility framework? The point is that utility may not always be linear in money received/spent. Non-linear utility functions may be used to reconcile observed behavior with expected utility theory (but doing so is outside the scope of this course).

Linearity of expectation One nice property of expectation is that the expectation of the sum of random variables is the sum of the expectations. This can often simplify the calculation of expectations.

Theorem A.2. Let X1,,Xn be random variables, and a1,,an be real constants. Then

𝔼 [ i=1na iXi] = i=1na i 𝔼[Xi].

Proof.

𝔼 [ i=1na iXi] = sSf(s) i=1na iXi(s) = sS i=1na if(s)Xi(s) = i=1na i sSf(s)Xi(s) = i=1na i 𝔼[Xi],

which concludes the proof.

Example A.4. If we make n tosses of a biased coin that ends up heads with probability p, what is the expected number of heads? Let Xi = 1 if the ith toss is heads, and Xi = 0 otherwise. Then, Xi is an independent Bernoulli random variable with probability p, and has expectation

𝔼[Xi] = p 1 + (1 p) 0 = p.

The expected number of heads is then

𝔼 [ i=1nX i] = i=1n 𝔼[X i] = np.

Thus if the coin was fair, we would expect (1/2)n, half of the tosses, to be heads.

Conditional expectations We may also define a notion of an expectation of a random variable X conditioned on some event H by simply conditioning the probability space on H:

Definition A.10. Given a random variable X defined over a probability space (S,f), and some event H on S, we define the expectation of X conditioned on H to be

𝔼[XH] = xH Pr[X = xH] x.

We end this section by showing an analog of A.4 for the case of expectations.

Claim A.6. Let A1,,An be disjoint events with non-zero probability such that iAi = S. Let X be a random variable over S. Then

𝔼[XS] = i=1n 𝔼[XA i]Pr[AiS].

Proof. By definition 𝔼[XAi] = x Pr[X = xAi] x and so the right-hand side evaluates to

i=1n x Pr[X = xAi]Pr[AiS] x = = i=1n x Pr[X = x Ai] Pr[Ai] Pr[Ai S] Pr[S] x = = i=1n x Pr[X = x Ai] Pr[S] x = = x Pr[X = x S] Pr[S] x = = x Pr[X = xS] x

which equals the left-hand side.

1Without formally defining this term, we refer to random processes whose outcomes are discrete, such as dice rolls, as opposed to picking a uniformly random real number from zero to one.

2By [0, 1] we mean the real interval {x0 x 1}.