Logistic Regression

Cornell CS 4/5780

Spring 2023


Video (Lecture 11)

In this lecture we will learn about the discriminative counterpart to the Gaussian Naive Bayes (Naive Bayes for continuous features).

Machine learning algorithms can be (roughly) categorized into two categories:

The Naive Bayes algorithm is generative. It models \(P(\mathbf{x}_i|y)\) and makes explicit assumptions on its distribution (e.g. multinomial, categorical, Gaussian, ...). The parameters of this distributions are estimated with MLE or MAP. We showed previously that for the Gaussian Naive Bayes \(P(y|\mathbf{x}_i)=\frac{1}{1+e^{-y(\mathbf{w}^T \mathbf{x}_i+b)}}\) for \(y\in\{+1,-1\}\) for specific vectors \(\mathbf{w}\) and \(b\) that are uniquely determined through the particular choice of \(P(\mathbf{x}_i|y)\).

Logistic Regression is often referred to as the discriminative counterpart of Naive Bayes. Here, we model \(P(y|\mathbf{x}_i)\) and assume that it takes on exactly this form $$P(y|\mathbf{x}_i)=\frac{1}{1+e^{-y(\mathbf{w}^T \mathbf{x}_i+b)}}.$$ We make little assumptions on \(P(\mathbf{x}_i|y)\), e.g. it could be Gaussian or Multinomial. Ultimately it doesn't matter, because we estimate the vector \(\mathbf{w}\) and \(b\) directly with MLE or MAP to maximize the conditional likelihood of \(\Pi_{i} P(y_i|\mathbf{x}_i;\mathbf{w},b )\). For a lot more details, I strongly suggest that you read this excellent book chapter by Tom Mitchell.

Throughout this lecture we absorbed the parameter \(b\) into \(\mathbf{w}\) through an additional constant dimension (similar to the Perceptron).

Maximum likelihood estimate (MLE)

In MLE we choose parameters that maximize the conditional likelihood. The conditional data likelihood \(P(\mathbf y \mid X, \mathbf{w})\) is the probability of the observed values \(\mathbf y \in \mathbb R^n\) in the training data conditioned on the feature values \(\mathbf{x}_i\). Note that \(X=\left[\mathbf{x}_1, \dots,\mathbf{x}_i, \dots, \mathbf{x}_n\right] \in \mathbb R^{d \times n}\). We choose the paramters that maximize this function and we assume that the \(y_i\)'s are independent given the input features \(\mathbf{x}_i\) and \(\mathbf{w}\). So, $$\begin{aligned} \hat{\mathbf{w}}_{MLE} &= \operatorname*{argmax}_{\mathbf{w}} P(D|\mathbf{w})& \textrm{(Definition of MLE)}\\ &= \operatorname*{argmax}_{\mathbf{w}}P((y_1,\mathbf{x}_1),\dots,(y_n,\mathbf{x}_n) \mid \mathbf{w}) & \textrm{(Substituting in D.)}\\ &=\operatorname*{argmax}_{\mathbf{w}}\prod_{i=1}^n P(y_i,\mathbf{x}_i \mid \mathbf{w}) & \textrm{(Data is i.i.d.)}\\ &=\operatorname*{argmax}_{\mathbf{w}}\prod_{i=1}^n P(y_i \mid \mathbf{x}_i,\mathbf{w})P(\mathbf{x}_i \mid \mathbf{w}) & \textrm{(Chain Rule of Statistics)}\\ &=\operatorname*{argmax}_{\mathbf{w}}\prod_{i=1}^n P(y_i \mid \mathbf{x}_i,\mathbf{w})P(\mathbf{x}_i) & \textrm{($\mathbf{x}_i$ does not depend on $\mathbf{w}$)}\\ &=\operatorname*{argmax}_{\mathbf{w}}\prod_{i=1}^n P(y_i \mid \mathbf{x}_i,\mathbf{w}) & \textrm{($P(\mathbf{x}_i)$ does not affect $\mathbf{w}$)}\\ &=\operatorname*{argmax}_{\mathbf{w}}\sum_{i=1}^n\log\left[ P(y_i \mid \mathbf{x}_i, \mathbf{w})\right]. & \textrm{(Taking the $\mathbf{log})$}\\ &=\operatorname*{argmax}_{\mathbf{w}} -\sum_{i=1}^n \log(1+e^{-y_i \mathbf{w}^T \mathbf{x}_i}) & \textrm{(Substituting in $P(y_i\mid \mathbf{x}_i,\mathbf{w})$)}\\ &=\operatorname*{argmin}_{\mathbf{w}}\sum_{i=1}^n \log(1+e^{-y_i \mathbf{w}^T \mathbf{x}_i}) & \textrm{(We prefer minimization.)} \end{aligned}$$

We need to estimate the parameters \(\mathbf{w}\). To find the values of the parameters at minimum, we can try to find solutions for \(\nabla_{\mathbf{w}} \sum_{i=1}^n \log(1+e^{-y_i \mathbf{w}^T \mathbf{x}_i}) =0\). This equation has no closed form solution, so we will use Gradient Descent on the negative log likelihood \(\ell(\mathbf{w})=\sum_{i=1}^n \log(1+e^{-y_i \mathbf{w}^T \mathbf{x}_i})\).

Maximum a Posteriori (MAP) Estimate

In the MAP estimate we treat \(\mathbf{w}\) as a random variable and can specify a prior belief distribution over it. We may use: \(\mathbf{w} \sim \mathbf{\mathcal{N}}(\mathbf 0,\sigma^2 I)\). This is the Gaussian approximation for LR.

Our goal in MAP is to find the most likely model parameters given the data, i.e., the parameters that maximaize the posterior. \[\begin{aligned} P(\mathbf{w} \mid D) = P(\mathbf{w} \mid X, \mathbf y) &\propto P(\mathbf y \mid X, \mathbf{w}) \; P(\mathbf{w}) \end{aligned}, \]

We can solve for $\hat\mathbf{w}_{MA}$ just as before with MLE.

\begin{aligned} \hat{\mathbf{w}}_{MAP} &= \operatorname*{argmax}_{\mathbf{w}} P(D|\mathbf{w})P(\mathbf{w}) & \textrm{(Definition of MAP)}\\ &= \operatorname*{argmax}_{\mathbf{w}}P((y_1,\mathbf{x}_1),\dots,(y_n,\mathbf{x}_n) \mid \mathbf{w})P(\mathbf{w}) & \textrm{(Substituting in D.)}\\ &=\operatorname*{argmax}_{\mathbf{w}}\left(\prod_{i=1}^n P(y_i \mid \mathbf{x}_i,\mathbf{w})\right)P(\mathbf{w}) & \textrm{(Data is i.i.d.)}\\ &=\operatorname*{argmax}_{\mathbf{w}}\sum_{i=1}^n\log\left[ P(y_i \mid \mathbf{x}_i, \mathbf{w})\right]+\log P(\mathbf{w}) &\textrm{(Taking the $\mathbf{log}$ and using the sum property)}\\ &=\operatorname*{argmin}_{\mathbf{w}}-\sum_{i=1}^n\log\left[ P(y_i \mid \mathbf{x}_i, \mathbf{w})\right]-\log P(\mathbf{w}) & \textrm{(We prefer minimization.)}\\ &=\operatorname*{argmin}_{\mathbf{w}}\sum_{i=1}^n\log\left[1+e^{-y_i \mathbf{w}^T \mathbf{x}_i}\right]+\frac{1}{2\sigma^2}\mathbf{w}^\top\mathbf{w} & \textrm{(Substituting $P(y_i \mid \mathbf{x}_i, \mathbf{w})$ and $P(\mathbf{w})$)}\\ &=\operatorname*{argmin}_{\mathbf{w}}\sum_{i=1}^n\log\left[1+e^{-y_i \mathbf{w}^T \mathbf{x}_i}\right]+\lambda\mathbf{w}^\top\mathbf{w} & \textrm{(Substituting $P(y_i \mid \mathbf{x}_i, \mathbf{w})$ and using L2 regularization)}\\ \end{aligned} where \(\lambda = \frac{1}{2\sigma^2}\). Once again, this function has no closed form solution, but we can use Gradient Descent on the negative log posterior \(\ell(\mathbf{w})=\sum_{i=1}^n \log(1+e^{-y_i\mathbf{w}^T \mathbf{x}_i})+\lambda\mathbf{w}^\top\mathbf{w}\) to find the optimal parameters \(\mathbf{w}\).

For a better understanding for the connection of Naive Bayes and Logistic Regression, you may take a peek at these excellent notes.


Logistic Regression is the discriminative counterpart to Naive Bayes. In Naive Bayes, we first model \(P(\mathbf{x}|y)\) for each label \(y\), and then obtain the decision boundary that best discriminates between these two distributions. In Logistic Regression we do not attempt to model the data distribution \(P(\mathbf{x}|y)\), instead, we model \(P(y|\mathbf{x})\) directly. We assume the same probabilistic form \(P(y|\mathbf{x}_i)=\frac{1}{1+e^{-y(\mathbf{w}^T \mathbf{x}_i+b)}}\) , but we do not restrict ourselves in any way by making assumptions about \(P(\mathbf{x}|y)\) (in fact it can be any member of the Exponential Family). This allows logistic regression to be more flexible, but such flexibility also requires more data to avoid overfitting. Typically, in scenarios with little data and if the modeling assumption is appropriate, Naive Bayes tends to outperform Logistic Regression. However, as data sets become large logistic regression often outperforms Naive Bayes, which suffers from the fact that the assumptions made on \(P(\mathbf{x}|y)\) are probably not exactly correct. If the assumptions hold exactly, i.e. the data is truly drawn from the distribution that we assumed in Naive Bayes, then Logistic Regression and Naive Bayes converge to the exact same result in the limit (but NB will be faster).