Estimating Probabilities from data

Remember that the Bayes Optimal classifier: "all we needed was" $P(Y|X)$. Most of supervised learning can be viewed as estimating $P(X, Y)$.

There are two cases of supervised learning: So where do these probabilities come from?

There are many ways to estimate these probabilities from data.

Simple scenario: coin toss

Suppose you find a coin and it's ancient and very valuable. Naturally, you ask yourself, "What is the probability that it comes up heads when I toss it?" You toss it $n = 10$ times and get results: $H, T, T, H, H, H, T, T, T, T$.

Maximum Likelihood Estimation (MLE)

What is $P(H) = \theta$?

We observed $n_H$ heads and $n_T$ tails. So, intuitively, $$ \theta \approx \frac{n_H}{n_H + n_T} = 0.4 $$ Can we derive this?

MLE Principle: Choose the $\theta$ to maximize the likelihood of the data, $P(Data)$, where $P(Data)$ is defined as \begin{align} P(Data) &= \begin{pmatrix} n_H + n_T \\ n_H \end{pmatrix} \theta^{n_H} (1 - \theta)^{n_T} \end{align} i.e. \begin{align} \theta &= argmax_{\theta} \begin{pmatrix} n_H + n_T \\ n_H \end{pmatrix} \theta^{n_H} (1 - \theta)^{n_T} \\ &= argmax_{\theta} log\begin{pmatrix} n_H + n_T \\ n_H \end{pmatrix} + n_H \cdot log(\theta) + n_T \cdot log(1 - \theta) \\ &= argmax_{\theta} n_H \cdot log(\theta) + n_T \cdot log(1 - \theta) \end{align} We can now solve for $\theta$ by taking the derivative and equating it to zero. This results in \begin{align} \frac{n_H}{\theta} = \frac{n_T}{1 - \theta} \Longrightarrow n_H - n_H\theta = n_T\theta \Longrightarrow \theta = \frac{n_H}{n_H + n_T} \end{align}

Check: $1 \ge \theta \ge 0$ (no constraints necessary)

Maximum a Posteriori Probability Estimation (MAP)

Assume you have a hunch that $\theta$ is close to $0.5$. But your sample size is small, so you don't trust your estimate.

Simple fix: Add $m$ imaginery throws that would result in $\theta'$ (e.g. $\theta = 0.5$). Add $m$ Heads and $m$ Tails to your data. $$ \theta \leftarrow \frac{n_H + m}{n_H + n_T + 2m} $$ For large $n$, this is an insignificant change. For small $n$, it incorporates your "prior belief" about what $\theta$ should be.

Can we derive this update formally?

Let $\theta$ be a random variable, drawn from a Dirichlet distribution. As $\theta$ is a random variable drawn from a Dirichlet distribution, we can express $P(\theta)$ as \begin{align} P(\theta) = \frac{\theta^{\beta_1 - 1}(1 - \theta)^{\beta_0 - 1}}{B(\beta_1, \beta_0)} \end{align} where $B(\beta_1, \beta_0)$ is the normalization constant. Note that this is also the formulation for Beta distribution, which is a specific case of the Dirichlet distribution where there are exactly two free parameters. The Dirichlet distribution is the multivariate generalization of the Beta distribution.

For the MAP estimate, we pick the most likely $\theta$ given the data. \begin{align} \theta &= argmax_{\theta} P(\theta | Data) \\ &= argmax_{\theta} \frac{P(Data | \theta)P(\theta)}{P(Data)} && \text{(By Bayes rule)} \\ &= argmax_{\theta} log(P(Data | \theta)) + log(P(\theta)) \\ &= argmax_{\theta} n_H \cdot log(\theta) + n_T \cdot log(1 - \theta) + (\beta_1 - 1)\cdot log(\theta) + (\beta_0 - 1) \cdot log(1 - \theta) \\ &= argmax_{\theta} (n_H + \beta_1 - 1) \cdot log(\theta) + (n_T + \beta_0 - 1) \cdot log(1 - \theta) \\ &\Longrightarrow \theta = \frac{n_H + \beta_1 - 1}{n_H + n_T + \beta_0 + \beta_1 - 2} \end{align}

"True" Bayesian approach

Let $\theta$ be our parameter.

We allow $\theta$ to be a random variable. If we are to make prediction using $\theta$, we formulate the prediction as $$ P(Y|X, D) = \int_{\theta}P(Y | X, \theta, D) d\theta = \int_{\theta} P(Y | X, \theta) P(\theta | D) d\theta $$ This is called the posterior predictive distribution. Unfortunately, above is generally intractable in closed form and various other techniques, such as Monte Carlo approximations, are used to approximate the distribution.