## Estimating Probabilities from data

Remember that the Bayes Optimal classifier: "all we needed was" $P(Y|X)$. Most of supervised learning can be viewed as estimating $P(X, Y)$.

There are two cases of supervised learning:
- When we estimate $P(Y|X)$ directly, then we call it
*discriminative learning.*
- When we estimate $P(X|Y)P(Y)$, then we call it
*generative learning.*

So where do these probabilities come from?

There are many ways to estimate these probabilities from data.
### Simple scenario: coin toss

Suppose you find a coin and it's ancient and very valuable. **Naturally**, you ask yourself, "What is the probability that it comes up heads when I toss it?"
You toss it $n = 10$ times and get results: $H, T, T, H, H, H, T, T, T, T$.
#### Maximum Likelihood Estimation (MLE)

What is $P(H) = \theta$?

We observed $n_H$ heads and $n_T$ tails. So, intuitively,
$$
\theta \approx \frac{n_H}{n_H + n_T} = 0.4
$$
Can we derive this?

__MLE Principle:__ Choose the $\theta$ to maximize the likelihood of the data, $P(Data)$, where $P(Data)$ is defined as
\begin{align}
P(Data) &= \begin{pmatrix} n_H + n_T \\ n_H \end{pmatrix} \theta^{n_H} (1 - \theta)^{n_T}
\end{align}
i.e.
\begin{align}
\theta &= argmax_{\theta} \begin{pmatrix} n_H + n_T \\ n_H \end{pmatrix} \theta^{n_H} (1 - \theta)^{n_T} \\
&= argmax_{\theta} log\begin{pmatrix} n_H + n_T \\ n_H \end{pmatrix} + n_H \cdot log(\theta) + n_T \cdot log(1 - \theta) \\
&= argmax_{\theta} n_H \cdot log(\theta) + n_T \cdot log(1 - \theta)
\end{align}
We can now solve for $\theta$ by taking the derivative and equating it to zero. This results in
\begin{align}
\frac{n_H}{\theta} = \frac{n_T}{1 - \theta} \Longrightarrow n_H - n_H\theta = n_T\theta \Longrightarrow \theta = \frac{n_H}{n_H + n_T}
\end{align}

__Check:__ $1 \ge \theta \ge 0$ (no constraints necessary)
- MLE gives the explanation of the data you observed.
- But the MLE can overfit the data if $n$ is small. It works well when $n$ is large.

#### Maximum a Posteriori Probability Estimation (MAP)

Assume you have a hunch that $\theta$ is close to $0.5$. But your sample size is small, so you don't trust your estimate.

__Simple fix:__ Add $m$ imaginery throws that would result in $\theta'$ (e.g. $\theta = 0.5$). Add $m$ Heads and $m$ Tails to your data.
$$
\theta \leftarrow \frac{n_H + m}{n_H + n_T + 2m}
$$
For large $n$, this is an insignificant change.
For small $n$, it incorporates your "prior belief" about what $\theta$ should be.

Can we derive this update formally?

Let $\theta$ be a **random variable**, drawn from a Dirichlet distribution.
__Note:__ Here we transcend into __Bayesian__ statistics. $\theta$ is **not** a random variable associated with an event in a sample space.
In frequentist statistics, this is forbidden. In Bayesian statistics, this is allowed.
- In lecture, Dirichlet distribution was briefly introduced as a probability distribution over probability distributions. All this really is that
a Dirichlet distribution is a distribution over a $(k - 1)$-dimensional probability simplex, where each sample from this distribution have all of its
components greater than or equal to 0 and sum to 1 (note that the sample itself has $k$ components, but the simplex is $(k - 1)$-dimensional because
of the constraints to be a valid probability distribution). To help understanding, you can imagine yourself rolling a poorly made 6-sided die. But before rolling anything,
you must draw a die from a bag full of dice of different sizes. Drawing a die from this bag is just like sampling from a Dirichlet distribution. The Dirichlet distribution says which dice are more likely and which are less likely. For example you could have a strong belief that dice with roughly even probability are likely whereas dice that have highly skewed probabilities (e.g. only "1" ever comes up are extremely unlikely.)
After drawing the die, your die now represents yet another probability function, but this time over its 6 faces.

As $\theta$ is a **random variable** drawn from a Dirichlet distribution, we can express $P(\theta)$ as
\begin{align}
P(\theta) = \frac{\theta^{\beta_1 - 1}(1 - \theta)^{\beta_0 - 1}}{B(\beta_1, \beta_0)}
\end{align}
where $B(\beta_1, \beta_0)$ is the normalization constant. Note that this is also the formulation for Beta distribution, which is a specific case of
the Dirichlet distribution where there are exactly two free parameters. The Dirichlet distribution is the multivariate generalization of the Beta distribution.

For the MAP estimate, we pick the __most likely $\theta$ given the data__.
\begin{align}
\theta &= argmax_{\theta} P(\theta | Data) \\
&= argmax_{\theta} \frac{P(Data | \theta)P(\theta)}{P(Data)} && \text{(By Bayes rule)} \\
&= argmax_{\theta} log(P(Data | \theta)) + log(P(\theta)) \\
&= argmax_{\theta} n_H \cdot log(\theta) + n_T \cdot log(1 - \theta) + (\beta_1 - 1)\cdot log(\theta) + (\beta_0 - 1) \cdot log(1 - \theta) \\
&= argmax_{\theta} (n_H + \beta_1 - 1) \cdot log(\theta) + (n_T + \beta_0 - 1) \cdot log(1 - \theta) \\
&\Longrightarrow \theta = \frac{n_H + \beta_1 - 1}{n_H + n_T + \beta_0 + \beta_1 - 2}
\end{align}
- MAP is a great estimator if prior belief exists and is accurate.
- It can be very wrong if prior belief is wrong!

### "True" Bayesian approach

Let $\theta$ be our parameter.

We allow $\theta$ to be a random variable. If we are to make *prediction* using $\theta$, we formulate the prediction as
$$
P(Y|X, D) = \int_{\theta}P(Y | X, \theta, D) d\theta = \int_{\theta} P(Y | X, \theta) P(\theta | D) d\theta
$$
This is called the **posterior predictive distribution**. Unfortunately, above is generally *intractable* in closed form and
various other techniques, such as Monte Carlo approximations, are used to approximate the distribution.