Processing math: 100%

Low-Dimensional Structure to the Rescue

If the data lies in a low-dimensional submanifold, then we

Lecture 3: k-Nearest Neighbors

CS4780 — Introduction to Machine Learning

The k-Nearest Neighbors Algorithm

k-Nearest Neighbors Illustration

k-NN Formally

What should we do in case of a tie?

k-NN For Regression

What distance function should we use?

Most common distance: the Euclidean distance dist(u,v)=uv2=di=1(uivi)2

Also popular: the taxicab norm (a.k.a. Manhattan norm) dist(u,v)=uv1=di=1|uivi|

The Minkowski distance (a.k.a. p space)

For parameter p1

dist(u,v)=uvp=(di=1|uivi|p)1/p

Generalizes many other norms, including the popular 2 (Euclidean), 1 (taxicab), and (max norm).

Bayes Optimal Classifier

The Bayes Optimal Classifier is the hypothesis

hopt(x)=argmaxyYP(y|x)=argmaxyYP(x,y).

Bayes Optimal Classifier: A Silly Example

Bayes Optimal Classifier: Conditional Probability

Conditional probability of damage x conditioned on label y.

Damage Wolf (1d6+1) Werewolf (2d4)
2 1/6 1/16
3 1/6 1/8
4 1/6 3/16
5 1/6 1/4
6 1/6 3/16
7 1/6 1/8
8 0 1/16

Bayes Optimal Classifier: Joint Probability

Joint density: P(x,y)=P(x|y)P(y)=P(x|y)12

Damage Wolf (1d6+1) Werewolf (2d4)
2 1/12 1/32
3 1/12 1/16
4 1/12 3/32
5 1/12 1/8
6 1/12 3/32
7 1/12 1/16
8 0 1/32

Bayes Optimal Classifier: The Hypothesis

Always guess the label with the highest probability.

Damage Wolf (1d6+1) Werewolf (2d4) Bayes Optimal Classifier Prediction
2 1/12 1/32 Wolf
3 1/12 1/16 Wolf
4 1/12 3/32 Werewolf
5 1/12 1/8 Werewolf
6 1/12 3/32 Werewolf
7 1/12 1/16 Wolf
8 0 1/32 Werewolf

Does this being "optimal" mean we get it right all the time?

Bayes Optimal Classifier: What's the Error Rate?

We get it wrong when the true label disagrees with our prediction.

Damage Wolf (1d6+1) Werewolf (2d4) Bayes Optimal Classifier Prediction
2 1/12 1/32 Wolf
3 1/12 1/16 Wolf
4 1/12 3/32 Werewolf
5 1/12 1/8 Werewolf
6 1/12 3/32 Werewolf
7 1/12 1/16 Wolf
8 0 1/32 Werewolf

error=132+116+112+112+112+116+0=133241%.

Bayes Optimal Classifier: Take-Away

Best Constant Predictor

Another important baseline is the Best Constant Predictor.

h(x)=argmaxyYP(y).

How does k-NN compare to the Bayes Optimal Classifier?

We can bound the error of 1-NN relative to the Bayes Optimal Classifier.

Lemma: Convergence of the Nearest Neighbor

Suppose that (X,dist) is a separable metric space.

Let xtest and x1,x2, be independent identically distributed random variables over X. Then almost surely (i.e. with probability 1)

limnargminx{x1,,xn}dist(x,xtest)=xtest.

Proof Outline: Case 1

Consider the case where any ball of radius r centered around xtest has positive probability.

Then, no matter now close the current nearest neighbor is to xtest, every time we draw a fresh sample xi from the source distribution, with some probability it will be closer than the nearest neighbor currently in the distribution.

This implies that the distance diminishes to 0 with probability 1.

Proof Outline: Case 2

Consider the case where there is some ball of radius r centered around xtest that has probability zero in the source distribution.

But this must happen with zero probability in the random selection of xtest.

Bounding the Error of 1-NN

Let xtest denote a test point randomly drawn from P. Let ˆxn (also a random variable) denote the nearest neighbor to xtest in an independent training dataset of size n.

The expected error of the 1-NN classifier is error1-NN=E[yYP(y|ˆxn)(1P(y|xtest))]. This is the sum over all labels y of the probability that the prediction will be y but the true label will not be y.

Bounding the Error of 1-NN in the limit

Taking the limit as n approaches infinity, the expected error is limnerror1-NN=limnE[yYP(y|ˆxn)(1P(y|xtest))]=E[yYP(y|xtest))(1P(y|xtest))]

Comparing to the Bayes Optimal Classifier

Let ˆy denote the prediction of the Bayes Optimal Classifier on xtest. \begin{align*} \lim{n \rightarrow \infty} \; \operatorname{error}{\text{1-NN}} &= \mathbf{E}\left[ \mathcal{P}(\hat y | x{\text{test}}) \left(1 - \mathcal{P}(\hat y | x{\text{test}})\right) \right] \&\hspace{2em}+ \mathbf{E}\left[ \sum{y \ne \hat y} \mathcal{P}(y | x{\text{test}})) \left(1 - \mathcal{P}(y | x{\text{test}})\right) \right] \&\le \mathbf{E}\left[ 1 \cdot \left( 1 - \mathcal{P}(\hat y | x{\text{test}}) \right) \right] + \mathbf{E}\left[ \sum{y \ne \hat y} 1 \cdot \mathcal{P}(y | x{\text{test}})) \right] \&=

2 \mathbf{E}\left[ 1 - \mathcal{P}(\hat y | x_{\text{test}})\right]
=
2 \operatorname{error}_{\text{Bayes}}. 

\end{align*}

Conclusion: 1-NN is no more than a factor-of-2 worse

limnerror1-NN2errorBayes.

...in the limit...and results in-the-limit can be suspicious because they don't tell us how fast we converge

k-NN is limited by The Curse of Dimensionality!

k-NN works by reasoning about how close together points are.

In high dimensional space, points drawn from a distribution tend not to be close together.

First, let's look at some random points and a line in the unit square.

Consider what happens when we move to three dimensions. The points move further away from each other but stay equally close to the red hyperplane.

Curse of Dimensionality

Low-Dimensional Structure to the Rescue

If the data lies in a low-dimensional submanifold, then we can still use low-dimensional methods even in higher dimensions.

Reminders