Remember that the Bayes Optimal classifier: "all we needed was" $P(Y|X)$. Most of supervised learning can be viewed as estimating $P(X, Y)$.

There are two cases of supervised learning:

- When we estimate $P(Y|X)$ directly, then we call it
*discriminative learning.* - When we estimate $P(X|Y)P(Y)$, then we call it
*generative learning.*

Some machine learning algorithms (e.g. kNN) do not estimate $P(Y|X)$ but only the function $f(x)=\textrm{argmax}_y P(Y=y|x)$. This is also considered discriminative learning. Not estimating the probabilities can provide some flexibility in terms of what approach is used, but one loses the advantage of knowing probability estimates of the labels and may not have a good reading on the certainty of a prediction.

So, how can we estimate probabilities from data?

There are many ways to estimate probabilities from data.

We observed $n_H$ heads and $n_T$ tails. So, intuitively, $$ P(H) \approx \frac{n_H}{n_H + n_T} = 0.4 $$ Can we derive this formally?

\begin{align} \hat{\theta}_{MLE} = argmax_{\theta} \,P(D\mid \theta) \end{align}

For the sequence of coin flips we can use the

- MLE gives the explanation of the data you observed.
- If $n$ is large and your model/distribution is correct (that is $\mathcal{H}$ includes the true model), then MLE finds the
**true**parameters. - But the MLE can overfit the data if $n$ is small. It works well when $n$ is large.
- If you do not have the correct model (and $n$ is small) then MLE can be terribly wrong!

Can we derive this formally?

Now, we can look at $P(\theta \mid D) = \frac{P(D\mid \theta) P(\theta)}{P(D)}$ (recall Bayes Rule!), where

- $P(D \mid \theta)$ is the
**likelihood**of the data given the parameter(s) $\theta$, - $P(\theta)$ is the
**prior**distribution over the parameter(s) $\theta$, and - $P(\theta \mid D)$ is the
**posterior**distribution over the parameter(s) $\theta$.

Now, we can use the Beta distribution to model $P(\theta)$: \begin{align} P(\theta) = \frac{\theta^{\alpha - 1}(1 - \theta)^{\beta - 1}}{B(\alpha, \beta)} \end{align} where $B(\alpha, \beta) = \frac{\Gamma(\alpha) \Gamma(\beta)}{\Gamma(\alpha+\beta)}$ is the normalization constant. Note that here we only need a distribution over a binary random variable. The multivariate generalization of the Beta distribution is the Dirichlet distribution.

Why use the Beta distribution?

- it models probabilities ($\theta$ lives on $\left[0,1\right]$ and $\sum_i \theta_i =1$)
- it is of the same distributional family as the binomial distribution (
**conjugate prior**) $\rightarrow$ the math will turn out nicely:

So far, we have a distribution over $\theta$. How can we get an estimate for $\theta$?

- As $n \rightarrow \infty$, $\hat\theta_{MAP} \rightarrow \hat\theta_{MLE}$.
- MAP is a great estimator if prior belief exists and is accurate.
- If $n$ is small, it can be very wrong if prior belief is wrong!

In general, the posterior predictive distribution is $$ P(Y\mid D,X) = \int_{\theta}P(Y,\theta \mid D,X) d\theta = \int_{\theta} P(Y \mid \theta, D,X) P(\theta | D) d\theta $$ Unfortunately, the above is generally

**MLE**Prediction: $P(y|x_t;\theta)$ Learning: $\theta=argmax_\theta P(D;\theta)$. Here $\theta$ is purely a model parameter.**MAP**Prediction: $P(y|x_t,\theta)$ Learning: $\theta=argmax_\theta P(\theta|D)\propto P(D \mid \theta) P(\theta)$. Here $\theta$ is a random variable.**"True Bayesian"**Prediction: $P(y|x_t,D)=\int_{\theta}P(y|\theta)P(\theta|D)d\theta$. Here $\theta$ is integrated out - our prediction takes all possible models into account.