Lecture 6: Logistic Regression

Loading [MathJax]/jax/output/HTML-CSS/fonts/TeX/fontdata.js

Logistic Regression

In this lecture we will learn about the discriminative counterpart to the Gaussian Naive Bayes (Naive Bayes for continuous features).

Machine learning algorithms can be (roughly) categorized into two categories:

Generative algorithms, that estimate $P(\mathbf{x},y)$ (often they model $P(\mathbf{x}|y)$ and $P(y)$ separately).
Discriminative algorithms, that model $P(y|\mathbf{x})$

The Naive Bayes algorithm is generative. It models $P(\mathbf{x}|y)$ and makes explicit assumptions on its distribution (e.g. multinomial, categorical, Gaussian, ...). The parameters of this distributions are estimated with MLE or MAP. We showed previously that for some of these distributions, e.g. Multinomial or Gaussian Naive Bayes, it is the case that for for specific vectors $\mathbf{w}$ and $b$ that are uniquely determined through the particular choice of $P(\mathbf{x}|y)$ .

Logistic Regression is often referred to as the discriminative counterpart of Naive Bayes. Here, we model $P(y|\mathbf{x})$ and assume that it takes on exactly this form $P(y|\mathbf{x})=\frac{1}{1+e^{-y(\mathbf{w^Tx}+b)}}.$ We make little assumptions on $P(\mathbf{x}|y)$ , e.g. it could be Gaussian or Multinomial. Ultimately it doesn't matter, because we estimate the vector $\mathbf{w}$ and $b$ directly with MLE or MAP.

For a lot more details, I strongly suggest that you read this excellent book chapter by Tom Mitchell

Maximum likelihood estimate (MLE)

In MLE we choose parameters that maximize the conditional data likelihood. The conditional data likelihood is the probability of the observed values of $Y$ in the training data conditioned on the values of $\mathbf{X}$ . We choose the paramters that maximize this function. $\begin{aligned} log \bigg(\prod_{i=1}^n P(y_i|\mathbf{x_i};\mathbf{w},b)\bigg) &= -\sum_{i=1}^n \log(1+e^{-y_i(\mathbf{w^Tx}+b)})\\ \mathbf{w},b &= \operatorname*{argmax}_{\mathbf{w},b} -\sum_{i=1}^n \log(1+e^{-y_i(\mathbf{w^Tx}+b)})\\ &=\operatorname*{argmin}_{\mathbf{w},b}\sum_{i=1}^n \log(1+e^{-y_i(\mathbf{w^Tx}+b)}) \end{aligned}$

We need to estimate the parameters . To find the values of the parameters at minimum, we can try to find solutions for . This equation has no closed form solution, so we will use Gradient Descent on the function $\ell(\mathbf{w})=\sum_{i=1}^n \log(1+e^{-y_i(\mathbf{w^Tx}+b)})$ .

Maximum a Posteriori (MAP) Estimate

Before we begin, let us absorb the parameter $b$ into $\mathbf{w}$ through an additional constant dimension (similar to the Perceptron). In the MAP estimate we treat $\mathbf{w}$ as a random variable and can specify a prior belief distribution over it. We may use: $\mathbf{w} \sim \mathbf{\mathcal{N}}(0,\tau^2)$ .

Our goal in MAP is to find the most likely model parameters given the data. $\begin{aligned} P(\mathbf{w}|Data) &\propto P(Data|\mathbf{w})P(\mathbf{w})\\ \operatorname*{argmax}_{\mathbf{w}} [log P(Data|\mathbf{w})P(\mathbf{w})] &= \operatorname*{argmin}_{\mathbf{w}} \sum_{i=1}^n \log(1+e^{-y_i\mathbf{w^Tx}})+\lambda\mathbf{w}^\top\mathbf{w}, \end{aligned},$

where $\lambda$ is a linear function of $\frac{1}{2\tau^2}$ . Once again, this function has no closed form solution, but we can use Gradient Descent on the loss function $\ell(\mathbf{w})=\sum_{i=1}^n \log(1+e^{-y_i\mathbf{w^Tx}})+\lambda\mathbf{w}^\top\mathbf{w}$ to find the optimal parameters $\mathbf{w}$ .

For a better understanding for the connection of Naive Bayes and Logistic Regression, you may refer to these notes.