So how can we estimate $\hat{P}(y | \vec{x})$?

One way to do this would be to use the MLE method. Assuming that $y$ is discrete, $$ \hat{P}(y|\vec{x}) = \frac{\sum_{i=1}^{n} I(\vec{x}_i = \vec{x} \wedge \vec{y}_i = y)}{ \sum_{i=1}^{n} I(\vec{x}_i = \vec{x})} $$

In

To get around this issue, we can make a 'naive' assumption.

Estimating $P(y)$ is easy. If $Y$ takes on discrete binary values, for example, this just becomes coin tossing. We simply need to count how many times we observe each outcome (in this case each class): $$P(y = c) = \frac{\sum_{i=1}^{n} I(y_i = c)}{n} = \hat\pi_c $$

Estimating $P(\vec{x}|y)$, however, is not easy! The additional assumption that we make is the

So, for now, let's pretend the Naive Bayes assumption holds.

Then the Bayes Classifier can be defined as \begin{align} h(\vec{x}) &= argmax_y P(y | \vec{x}) \\ &= argmax_y \; \frac{P(\vec{x} | y)P(y)}{P(\vec{x})} \\ &= argmax_y \; P(\vec{x} | y) P(y) && \text{($P(\vec{x})$ does not depend on $y$)} \\ &= argmax_y \; \prod_{\alpha=1}^{d} P(x_\alpha | y) P(y) && \text{(by the naive assumption)}\\ &= argmax_y \; \sum_{\alpha = 1}^{d} log(P(x_\alpha | y)) + log(P(y)) && \text{(as log is a monotonic function)} \end{align} Estimating $log(P(x_\alpha | y))$ is easy as we only need to consider one dimension. And estimating $P(y)$ is not affected by the assumption.

Training the Naive Bayes classisifer corresponds to estimating $\vec{\theta}_{jc}$ for all $j$ and $c$ and storing them in the respective conditional probability tables (CPT). Also note that by setting $l=0$ we get an MLE estimator, $l>0$ leads to MAP. If we set $l= +1$ we get

We can show that $$ h(\vec{x}) = argmax_y \; P(y) \prod_{\alpha - 1}^d P(x_\alpha \mid y) = sign(\vec{w}^\top \vec{x} + b) $$ That is, $$ \vec{w}^\top \vec{x} + b > 0 \Longleftrightarrow h(\vec{x}) = +1. $$

As before, we define $P(x_\alpha|y=+1)\propto\theta_{\alpha+}^{x_\alpha}$ and $P(Y=+1)=\pi_+$: \begin{align} [\vec{w}]_\alpha &= log(\theta_{\alpha +}) - log(\theta_{\alpha -}) \\ b &= log(\pi_+) - log(\pi_-) \end{align} If we use the above to do classification, we can compute for $\vec{w}^\top \cdot \vec{x} + b$

Simplifying this further leads to \begin{align} \vec{w}^\top \cdot \vec{x} + b > 0 &\Longleftrightarrow \sum_{\alpha = 1}^{d} [\vec{x}]_\alpha (log(\theta_{\alpha +}) - log(\theta_{\alpha -})) + log(\pi_+) - log(\pi_-) > 0 \\ &\Longleftrightarrow \frac{\prod_{\alpha = 1}^{d} P([\vec{x}]_\alpha | Y = +1)\pi_+}{\prod_{\alpha =1}^{d}P([\vec{x}]_\alpha | Y = -1)\pi_-} > 1 \\ &\Longleftrightarrow P(Y = +1 | \vec{x}) > P(Y = -1 | \vec{x}) && \text{(By our naive Bayes assumption)} \\ &\Longleftrightarrow h(\vec{x}) = +1 && \text{(By definition of $h(\vec{x})$)} \end{align} 2. In the case of continuous features (Gaussian Naive Bayes), we can show that $$ P(y \mid \vec{x}) = \frac{1}{1 + e^{-y (\vec{w}^\top \vec{x} +b) }} $$ This model is also known as logistic regression. NB and LR produce asymptotically the same model if the Naive Bayes assumption holds.