Linear Regression

Cornell CS 4/5780

Spring 2022

Old Recorded Lectures

Assumptions

Data Assumption: \(y_{i} \in \mathbb{R}\)
Model Assumption: \(y_{i} = \mathbf{w}^T\mathbf{x}_i + \epsilon_i\) where \(\epsilon_i \sim N(0, \sigma^2)\)
\(\Rightarrow y_i|\mathbf{x}_i \sim N(\mathbf{w}^T\mathbf{x}_i, \sigma^2) \Rightarrow P(y_i|\mathbf{x}_i,\mathbf{w})=\frac{1}{\sqrt{2\pi\sigma^2}}e^{-\frac{(\mathbf{x}_i^T\mathbf{w}-y_i)^2}{2\sigma^2}}\)

In words, we assume that the data is drawn from a "line" \(\mathbf{w}^T \mathbf{x}\) through the origin (one can always add a bias / offset through an additional dimension, similar to the Perceptron). For each data point with features \(\mathbf{x}_i\), the label \(y\) is drawn from a Gaussian with mean \(\mathbf{w}^T \mathbf{x}_i\) and variance \(\sigma^2\). Our task is to estimate the slope \(\mathbf{w}\) from the data.

Estimating with MLE

\[ \begin{aligned} \hat{\mathbf{w}}_{\text{MLE}} &= \operatorname*{argmax}_{\mathbf{\mathbf{w}}} \; P(y_1,\mathbf{x}_1,...,y_n,\mathbf{x}_n|\mathbf{w})\\ &= \operatorname*{argmax}_{\mathbf{\mathbf{w}}} \; \prod_{i=1}^n P(y_i,\mathbf{x}_i|\mathbf{w}) & \textrm{Because data points are independently}\\ &= \operatorname*{argmax}_{\mathbf{\mathbf{w}}} \; \prod_{i=1}^n P(y_i|\mathbf{x}_i,\mathbf{w})P(\mathbf{x}_i|\mathbf{w}) & \textrm{Chain rule of probability}\\ &= \operatorname*{argmax}_{\mathbf{\mathbf{w}}} \; \prod_{i=1}^n P(y_i|\mathbf{x}_i,\mathbf{w})P(\mathbf{x}_i) & \textrm{\(\mathbf{x}_i\) is independent of \(\mathbf{w}\), we only model \(P(y_i|\mathbf{x})\)}\\ &= \operatorname*{argmax}_{\mathbf{\mathbf{w}}} \; \prod_{i=1}^n P(y_i|\mathbf{x}_i,\mathbf{w}) & \textrm{\(P(\mathbf{x}_i)\) is a constant - can be dropped}\\ &= \operatorname*{argmax}_{\mathbf{\mathbf{w}}} \; \sum_{i=1}^n \log\left[P(y_i|\mathbf{x}_i,\mathbf{w})\right] & \textrm{log is a monotonic function}\\ &= \operatorname*{argmax}_{\mathbf{\mathbf{w}}} \; \sum_{i=1}^n \left[ \log\left(\frac{1}{\sqrt{2\pi\sigma^2}}\right) + \log\left(e^{-\frac{(\mathbf{x}_i^T\mathbf{w}-y_i)^2}{2\sigma^2}}\right)\right] & \textrm{Plugging in probability distribution}\\ &= \operatorname*{argmax}_{\mathbf{\mathbf{w}}} \; -\frac{1}{2\sigma^2}\sum_{i=1}^n (\mathbf{x}_i^T\mathbf{w}-y_i)^2 & \textrm{First term is a constant, and \(\log(e^z)=z\)}\\ &= \operatorname*{argmin}_{\mathbf{\mathbf{w}}} \; \frac{1}{n}\sum_{i=1}^n (\mathbf{x}_i^T\mathbf{w}-y_i)^2 & \textrm{Scale and switch to minimize}\\ \end{aligned} \]

We are minimizing a loss function, \(l(\mathbf{w}) = \frac{1}{n}\sum_{i=1}^n (\mathbf{x}_i^T\mathbf{w}-y_i)^2\). This particular loss function is also known as the squared loss or Ordinary Least Squares (OLS). In this form, it has a natural interpretation as the average squared error of the prediction over the training set. OLS can be optimized with gradient descent, Newton's method, or in closed form.

Closed Form Solution: if \( \mathbf{X} \mathbf{X}^T \) is invertible, then \[\hat{\mathbf{w}} = (\mathbf{X X}^T)^{-1}\mathbf{X}\mathbf{y}^T \text{ where } \mathbf{X}=\left[\mathbf{x}_1,\dots,\mathbf{x}_n\right] \in \mathbb{R}^{d \times n} \text{ and } \mathbf{y}=\left[y_1,\dots,y_n\right] \in \mathbb{R}^{1 \times n}.\] Otherwise, there is not a unique solution, and any \( \mathbf{w} \) that is a solution of the linear equation \[ \mathbf{X X}^T \hat{\mathbf{w}} = \mathbf{X}\mathbf{y}^T \] minimizes the objective.

Estimating with MAP

To use MAP, we will need to make an additional modeling assumption of a prior for the weight \( \mathbf{w} \). \[ P(\mathbf{w}) = \frac{1}{\sqrt{2\pi\tau^2}}e^{-\frac{\mathbf{w}^T\mathbf{w}}{2\tau^2}}.\] With this, our MAP estimator becomes \[ \begin{align} \hat{\mathbf{w}}_{\text{MAP}} &= \operatorname*{argmax}_{\mathbf{\mathbf{w}}} \; P(\mathbf{w}|y_1,\mathbf{x}_1,...,y_n,\mathbf{x}_n)\\ &= \operatorname*{argmax}_{\mathbf{\mathbf{w}}} \; \frac{P(y_1,\mathbf{x}_1,...,y_n,\mathbf{x}_n|\mathbf{w})P(\mathbf{w})}{P(y_1,\mathbf{x}_1,...,y_n,\mathbf{x}_n)}\\ &= \operatorname*{argmax}_{\mathbf{\mathbf{w}}} \; P(y_1,\mathbf{x}_1,...,y_n,\mathbf{x}_n|\mathbf{w})P(\mathbf{w})\\ &= \operatorname*{argmax}_{\mathbf{\mathbf{w}}} \; \left[\prod_{i=1}^nP(y_i,\mathbf{x}_i|\mathbf{w})\right]P(\mathbf{w})\\ &= \operatorname*{argmax}_{\mathbf{\mathbf{w}}} \; \left[\prod_{i=1}^nP(y_i|\mathbf{x}_i,\mathbf{w})P(\mathbf{x}_i|\mathbf{w})\right]P(\mathbf{w})\\ &= \operatorname*{argmax}_{\mathbf{\mathbf{w}}} \; \left[\prod_{i=1}^nP(y_i|\mathbf{x}_i,\mathbf{w})P(\mathbf{x}_i)\right]P(\mathbf{w})\\ &= \operatorname*{argmax}_{\mathbf{\mathbf{w}}} \; \left[\prod_{i=1}^n P(y_i|\mathbf{x}_i,\mathbf{w})\right]P(\mathbf{w})\\ &= \operatorname*{argmax}_{\mathbf{\mathbf{w}}} \; \sum_{i=1}^n \log P(y_i|\mathbf{x}_i,\mathbf{w})+ \log P(\mathbf{w})\\ &= \operatorname*{argmin}_{\mathbf{\mathbf{w}}} \; \frac{1}{2\sigma^2} \sum_{i=1}^n (\mathbf{x}_i^T\mathbf{w}-y_i)^2 + \frac{1}{2\tau^2}\mathbf{w}^T\mathbf{w}\\ &= \operatorname*{argmin}_{\mathbf{\mathbf{w}}} \; \frac{1}{n} \left( \sum_{i=1}^n (\mathbf{x}_i^T\mathbf{w}-y_i)^2 + \lambda \| \mathbf{w} \|_2^2 \right) \tag*{\(\lambda=\frac{\sigma^2}{\tau^2}\)}\\ \end{align} \]

This objective is known as Ridge Regression. It has a closed form solution of: \(\hat{\mathbf{w}} = (\mathbf{X} \mathbf{X}^T+\lambda \mathbf{I})^{-1}\mathbf{X}\mathbf{y}^T,\) where \(\mathbf{X}=\left[\mathbf{x}_1,\dots,\mathbf{x}_n\right]\) and \(\mathbf{y}=\left[y_1,\dots,y_n\right]\). The solution must always exist and be unique (why?).

Summary

Ordinary Least Squares:
Ridge Regression: