Loading [MathJax]/jax/output/HTML-CSS/fonts/TeX/fontdata.js
Lecture 3: The Perceptron
Assumptions
- Binary classification (i.e. yi∈{−1,+1})
- Data is linearly separable
Classifier
h(xi)=sign(→w⋅→xi+b)
b is the bias term (without the bias term, the hyperplane that →w defines would always have to go through the origin).
Dealing with b can be a pain, so we 'absorb' it into the feature vector →w by adding one additional constant dimension.
Under this convention,
→xibecomes[→xi1]→wbecomes[→wb]
We can verify that
[→xi1]⋅[→wb]=→w⋅→xi+b
Using this, we can simplify the above formulation of h(xi) to
h(xi)=sign(→w⋅→x)
Observation: Note that
yi(→w⋅→xi)>0⟺xiis classified correctly
where 'classified correctly' means that xi is on the correct side of the hyperplane defined by →w.
Also, note that the left side depends on yi∈{−1,+1} (it wouldn't work if, for example yi∈{0,+1}).
Perceptron Algorithm
Now that we know what the →w is supposed to do (defining a hyperplane the separates the data), let's look at how we can get such →w.
Perceptron Algorithm
Geometric Intuition
Quiz#1: Can you draw a visualization of a Perceptron update?
Quiz#2: How often can a Perceptron misclassify a point →x repeatedly?
Perceptron Convergence
Suppose that ∃→w∗ such that yi(→w∗⋅→x)>0 ∀(→xi,yi)∈D.
Now, suppose that we rescale each data point and the →w∗ such that
||→w∗||=1and||→xi||≤1∀→xi∈D
The Margin of a hyperplane, γ, is defined as
γ=min
We can visualize this as follows
- All inputs \vec{x_i} live within the unit sphere
- \gamma is the distance from the hyperplane (blue) to the closest data point
- \vec{w}^* lies on the unit sphere
Theorem: If all of the above holds, then the perceptron algorithm makes at most 1 / \gamma^2 mistakes.
Proof:
Keeping what we defined above, consider the effect of an update (\vec{w} becomes \vec{w}+y\vec{x}) on the two terms \vec{w} \cdot \vec{w}^* and \vec{w} \cdot \vec{w}.
We will use two facts:
- y( \vec{x}\cdot \vec{w})\leq 0: This holds because \vec x is misclassified by \vec{w} - otherwise we wouldn't make the update.
- y( \vec{x}\cdot \vec{w}^*)>0: This holds because \vec{w}^* is a separating hyper-plane and classifies all points correctly.
-
Consider the effect of an update on \vec{w} \cdot \vec{w}^*:
(\vec{w} + y\vec{x}) \cdot \vec{w}^* = \vec{w} \cdot \vec{w}^* + y(\vec{x}\cdot \vec{w}^*) \ge \vec{w} \cdot \vec{w}^* + \gamma
The inequality follows from the fact that, for \vec{w}^*, the distance from the hyperplane defined by \vec{w}^* to \vec{x} must be at least \gamma (i.e. y (\vec{x}\cdot \vec{w}^*)=|\vec{x}\cdot\vec{w}^*|\geq \gamma).
This means that for each update, \vec{w} \cdot \vec{w}^* grows by at least \gamma.
-
Consider the effect of an update on \vec{w} \cdot \vec{w}:
(\vec{w} + y\vec{x})\cdot (\vec{w} + y\vec{x}) = \vec{w} \cdot \vec{w} + 2y(\vec{w} \cdot\vec{x}) + y^2(\vec{x}\cdot \vec{x}) \le \vec{w} \cdot \vec{w} + 1
The inequality follows from the fact that
- 2y(\vec{w}\cdot \vec{x}) < 0 as we had to make an update, meaning \vec{x} was misclassified
- y^2(\vec{x} \cdot \vec{x}) \le 1 as y^2 = 1 and all \vec{x}\cdot \vec{x}\leq 1 (because \|\vec x\|\leq 1).
This means that for each update, \vec{w} \cdot \vec{w} grows by at most 1.
-
Now we can put together the above findings. Suppose we had M updates.
\begin{align}
M\gamma &\le \vec{w}\cdot\vec{w}^* &&\text{By (1)} \\
&= |\vec{w}\cdot\vec{w}^*| &&\text{By (1) again (the dot-product must be non-negative because the initialization is 0 and each update increases it by at least $\gamma$)} \\
&\le ||\vec{w}||\ ||\vec{w}^*|| &&\text{By Cauchy-Schwartz inequality} \\
&= ||\vec{w}|| &&\text{As $||\vec{w}^*|| = 1$} \\
&= \sqrt{\vec{w} \cdot \vec{w}} && \text{by definition of $\|\vec{w}\|$} \\
&\le \sqrt{M} &&\text{By (2)} \\
&\Rightarrow M\gamma \le \sqrt{M} \\
&\Rightarrow M^2\gamma^2 \le M \\
&\Rightarrow M \le \frac{1}{\gamma^2}
\end{align}
History
- Initially, huge wave of excitement ("Digital brains")
- Then, contributed to the A.I. Winter. Famous counter-example XOR problem (Minsky 1969):
- If data is not linearly separable, it loops forver.