## Logistic Regression

In this lecture we will learn about the discriminative counterpart to the Gaussian Naive Bayes (Naive Bayes for continuous features).

Machine learning algorithms can be (roughly) categorized into two categories:

• Generative algorithms, that estimate $P(\mathbf{x}_i,y)$ (often they model $P(\mathbf{x}_i|y)$ and $P(y)$ separately).
• Discriminative algorithms, that model $P(y|\mathbf{x}_i)$

The Naive Bayes algorithm is generative. It models $P(\mathbf{x}_i|y)$ and makes explicit assumptions on its distribution (e.g. multinomial, categorical, Gaussian, ...). The parameters of this distributions are estimated with MLE or MAP. We showed previously that for the Gaussian Naive Bayes $P(y|\mathbf{x}_i)=\frac{1}{1+e^{-y(\mathbf{w}^T \mathbf{x}+b)}}$ for $y\in\{+1,-1\}$ for specific vectors $\mathbf{w}$ and $b$ that are uniquely determined through the particular choice of $P(\mathbf{x}_i|y)$.

Logistic Regression is often referred to as the discriminative counterpart of Naive Bayes. Here, we model $P(y|\mathbf{x}_i)$ and assume that it takes on exactly this form $$P(y|\mathbf{x}_i)=\frac{1}{1+e^{-y(\mathbf{w}^T \mathbf{x}+b)}}.$$ We make little assumptions on $P(\mathbf{x}_i|y)$, e.g. it could be Gaussian or Multinomial. Ultimately it doesn't matter, because we estimate the vector $\mathbf{w}$ and $b$ directly with MLE or MAP.

For a lot more details, I strongly suggest that you read this excellent book chapter by Tom Mitchell

### Maximum likelihood estimate (MLE)

In MLE we choose parameters that maximize the conditional likelihood. The conditional data likelihood $P(\vec y \mid X, \mathbf{w})$ is the probability of the observed values $\vec y \in \mathbb R^n$ in the training data conditioned on the feature values $\mathbf{x}_i$. Note that $X=\left[\mathbf{x}_1, \dots,\mathbf{x}_i, \dots, \mathbf{x}_n\right] \in \mathbb R^{d \times n}$. We choose the paramters that maximize this function and we assume that the $y_i$'s are independent given the input features $\mathbf{x}_i$ and $\mathbf{w}$. So, \begin{aligned} P(\vec y \mid X, \mathbf{w}) = \prod_{i=1}^n P(y_i \mid \mathbf{x}_i, \mathbf{w}). \end{aligned} Now if we take the log, e obtain \begin{aligned} \log \bigg(\prod_{i=1}^n P(y_i|\mathbf{x}_i,\mathbf{w})\bigg) &= -\sum_{i=1}^n \log(1+e^{-y_i \mathbf{w}^T \mathbf{x}})\\ \end{aligned} Note, that we absorbed the parameter $b$ into $\mathbf{w}$ through an additional constant dimension (similar to the Perceptron). \begin{aligned} \hat{\mathbf{w}}_{MLE} &= \operatorname*{argmax}_{\mathbf{w}} -\sum_{i=1}^n \log(1+e^{-y_i \mathbf{w}^T \mathbf{x}})\\ &=\operatorname*{argmin}_{\mathbf{w}}\sum_{i=1}^n \log(1+e^{-y_i \mathbf{w}^T \mathbf{x}}) \end{aligned}

We need to estimate the parameters $\mathbf{w}$. To find the values of the parameters at minimum, we can try to find solutions for $\nabla_{\mathbf{w}} \sum_{i=1}^n \log(1+e^{-y_i \mathbf{w}^T \mathbf{x}}) =0$. This equation has no closed form solution, so we will use Gradient Descent on the negative log likelihood $\ell(\mathbf{w})=\sum_{i=1}^n \log(1+e^{-y_i \mathbf{w}^T \mathbf{x}})$.

### Maximum a Posteriori (MAP) Estimate

In the MAP estimate we treat $\mathbf{w}$ as a random variable and can specify a prior belief distribution over it. We may use: $\mathbf{w} \sim \mathbf{\mathcal{N}}(\vec 0,\sigma^2 I)$. This is the Gaussian approximation for LR.

Our goal in MAP is to find the most likely model parameters given the data, i.e., the parameters that maximaize the posterior. \begin{aligned} P(\mathbf{w} \mid D) = P(\mathbf{w} \mid X, \vec y) &\propto P(\vec y \mid X, \mathbf{w}) \; P(\mathbf{w})\\ \hat{\mathbf{w}}_{MAP} = \operatorname*{argmax}_{\mathbf{w}} \log \, \left(P(\vec y \mid X, \mathbf{w}) P(\mathbf{w})\right) &= \operatorname*{argmin}_{\mathbf{w}} \sum_{i=1}^n \log(1+e^{-y_i\mathbf{w}^T \mathbf{x}})+\lambda\mathbf{w}^\top\mathbf{w}, \end{aligned},

where $\lambda = \frac{1}{2\sigma^2}$. Once again, this function has no closed form solution, but we can use Gradient Descent on the negative log posterior $\ell(\mathbf{w})=\sum_{i=1}^n \log(1+e^{-y_i\mathbf{w}^T \mathbf{x}})+\lambda\mathbf{w}^\top\mathbf{w}$ to find the optimal parameters $\mathbf{w}$.

For a better understanding for the connection of Naive Bayes and Logistic Regression, you may take a peek at these excellent notes.