Lecture 3: The Perceptron

previous
next

Assumptions

  1. Binary classification (i.e. $y_i \in \{-1, +1\}$)
  2. Data is linearly separable

Classifier

$$ h(x_i) = \textrm{sign}(\vec{w} \cdot \vec{x_i} + b) $$
$b$ is the bias term (without the bias term, the hyperplane that $\vec{w}$ defines would always have to go through the origin). Dealing with $b$ can be a pain, so we 'absorb' it into the feature vector $\vec{w}$ by adding one additional constant dimension. Under this convention, $$ \vec{x_i} \hspace{0.1in} \text{becomes} \hspace{0.1in} \begin{bmatrix} \vec{x_i} \\ 1 \end{bmatrix} \\ \vec{w} \hspace{0.1in} \text{becomes} \hspace{0.1in} \begin{bmatrix} \vec{w} \\ b \end{bmatrix} \\ $$ We can verify that $$ \begin{bmatrix} \vec{x_i} \\ 1 \end{bmatrix} \cdot \begin{bmatrix} \vec{w} \\ b \end{bmatrix} = \vec{w} \cdot \vec{x_i} + b $$ Using this, we can simplify the above formulation of $h(x_i)$ to $$ h(x_i) = \textrm{sign}(\vec{w} \cdot \vec{x}) $$
(Left:) The original data is 1-dimensional (top row) or 2-dimensional (bottom row). There is no hyper-plane that passes through the origin and separates the red and blue points. (Right:) After a constant dimension was added to all data points such a hyperplane exists.
Observation: Note that $$ y_i(\vec{w} \cdot \vec{x_i}) > 0 \Longleftrightarrow x_i \hspace{0.1in} \text{is classified correctly} $$ where 'classified correctly' means that $x_i$ is on the correct side of the hyperplane defined by $\vec{w}$. Also, note that the left side depends on $y_i \in \{-1, +1\}$ (it wouldn't work if, for example $y_i \in \{0, +1\}$).

Perceptron Algorithm

Now that we know what the $\vec{w}$ is supposed to do (defining a hyperplane the separates the data), let's look at how we can get such $\vec{w}$.

Perceptron Algorithm


Geometric Intuition

Illustration of a Perceptron update. (Left:) The hyperplane defined by $\mathbf{w}_t$ misclassifies one red (-1) and one blue (+1) point. (Middle:) The red point $\mathbf{x}$ is chosen and used for an update. Because its label is -1 we need to subtract $\mathbf{x}$ from $\mathbf{w}_t$. (Right:) The udpated hyperplane $\mathbf{w}_{t+1}=\mathbf{w}_t-\mathbf{x}$ separates the two classes and the Perceptron algorithm has converged.

Quiz: How often can a Perceptron misclassify a point $\mathbf{x}$ repeatedly?

Perceptron Convergence

Suppose that $\exists \vec{w}^*$ such that $y_i(\vec{w}^* \cdot \vec{x}) > 0 $ $\forall (\vec{x}_i, y_i) \in D$.

Now, suppose that we rescale each data point and the $\vec{w}^*$ such that $$ ||\vec{w}^*|| = 1 \hspace{0.3in} \text{and} \hspace{0.3in} ||\vec{x_i}|| \le 1 \hspace{0.1in} \forall \vec{x_i} \in D $$ The Margin of a hyperplane, $\gamma$, is defined as $$ \gamma = \min_{(\vec{x_i}, y_i) \in D}|\vec{w}^* \cdot \vec{x_i} | $$ We can visualize this as follows


Theorem: If all of the above holds, then the perceptron algorithm makes at most $1 / \gamma^2$ mistakes.

Proof:
Keeping what we defined above, consider the effect of an update ($\vec{w}$ becomes $\vec{w}+y\vec{x}$) on the two terms $\vec{w} \cdot \vec{w}^*$ and $\vec{w} \cdot \vec{w}$. We will use two facts: