Low-Dimensional Structure to the Rescue

If the data lies in a low-dimensional submanifold, then we

Lecture 3: k-Nearest Neighbors

CS4780 — Introduction to Machine Learning

$\newcommand{\R}{\mathbb{R}}$ $\newcommand{\norm}[1]{\left\|#1\right\|}$

The k-Nearest Neighbors Algorithm

k-Nearest Neighbors Illustration

k-NN Formally

What should we do in case of a tie?

k-NN For Regression

What distance function should we use?

Most common distance: the Euclidean distance $$\operatorname{dist}(u,v) = \| u - v \|_2 = \sqrt{ \sum_{i=1}^d (u_i - v_i)^2 }$$

Also popular: the taxicab norm (a.k.a. Manhattan norm) $$\operatorname{dist}(u,v) = \| u - v \|_1 = \sum_{i=1}^d |u_i - v_i|$$

The Minkowski distance (a.k.a. $\ell_p$ space)

For parameter $p \ge 1$

$$\operatorname{dist}(u,v) = \| u - v \|_p = \left( \sum_{i=1}^d |u_i - v_i|^p \right)^{1/p}$$

Generalizes many other norms, including the popular $\ell_2$ (Euclidean), $\ell_1$ (taxicab), and $\ell_{\infty}$ (max norm).

Bayes Optimal Classifier

The Bayes Optimal Classifier is the hypothesis

$$h_{\operatorname{opt}}(x) = \arg \max_{y \in \mathcal{Y}} \; \mathcal{P}(y | x) = \arg \max_{y \in \mathcal{Y}} \; \mathcal{P}(x, y).$$

Bayes Optimal Classifier: A Silly Example

Bayes Optimal Classifier: Conditional Probability

Conditional probability of damage $x$ conditioned on label $y$.

Damage Wolf (1d6+1) Werewolf (2d4)
2 1/6 1/16
3 1/6 1/8
4 1/6 3/16
5 1/6 1/4
6 1/6 3/16
7 1/6 1/8
8 0 1/16

Bayes Optimal Classifier: Joint Probability

Joint density: $\mathcal{P}(x,y) = \mathcal{P}(x | y) \mathcal{P}(y) = \mathcal{P}(x | y) \cdot \frac{1}{2}$

Damage Wolf (1d6+1) Werewolf (2d4)
2 1/12 1/32
3 1/12 1/16
4 1/12 3/32
5 1/12 1/8
6 1/12 3/32
7 1/12 1/16
8 0 1/32

Bayes Optimal Classifier: The Hypothesis

Always guess the label with the highest probability.

Damage Wolf (1d6+1) Werewolf (2d4) Bayes Optimal Classifier Prediction
2 1/12 1/32 Wolf
3 1/12 1/16 Wolf
4 1/12 3/32 Werewolf
5 1/12 1/8 Werewolf
6 1/12 3/32 Werewolf
7 1/12 1/16 Wolf
8 0 1/32 Werewolf

Does this being "optimal" mean we get it right all the time?

Bayes Optimal Classifier: What's the Error Rate?

We get it wrong when the true label disagrees with our prediction.

Damage Wolf (1d6+1) Werewolf (2d4) Bayes Optimal Classifier Prediction
2 1/12 1/32 Wolf
3 1/12 1/16 Wolf
4 1/12 3/32 Werewolf
5 1/12 1/8 Werewolf
6 1/12 3/32 Werewolf
7 1/12 1/16 Wolf
8 0 1/32 Werewolf

$\operatorname{error} = \frac{1}{32} + \frac{1}{16} + \frac{1}{12} + \frac{1}{12} + \frac{1}{12} + \frac{1}{16} + 0 = \frac{13}{32} \approx 41\%.$

Bayes Optimal Classifier: Take-Away

Best Constant Predictor

Another important baseline is the Best Constant Predictor.

$$h(x) = \arg \max_{y \in \mathcal{Y}} \; \mathcal{P}(y).$$

How does k-NN compare to the Bayes Optimal Classifier?

We can bound the error of 1-NN relative to the Bayes Optimal Classifier.

Lemma: Convergence of the Nearest Neighbor

Suppose that $(\mathcal{X}, \operatorname{dist})$ is a separable metric space.

Let $x_{\text{test}}$ and $x_1, x_2, \ldots$ be independent identically distributed random variables over $\mathcal{X}$. Then almost surely (i.e. with probability $1$)

$$\lim_{n \rightarrow \infty} \; \arg \min_{x \in \{x_1, \ldots, x_n\}} \operatorname{dist}(x, x_{\text{test}}) = x_{\text{test}}.$$

Proof Outline: Case 1

Consider the case where any ball of radius $r$ centered around $x_{\text{test}}$ has positive probability.

Then, no matter now close the current nearest neighbor is to $x_{\text{test}}$, every time we draw a fresh sample $x_i$ from the source distribution, with some probability it will be closer than the nearest neighbor currently in the distribution.

This implies that the distance diminishes to $0$ with probability $1$.

Proof Outline: Case 2

Consider the case where there is some ball of radius $r$ centered around $x_{\text{test}}$ that has probability zero in the source distribution.

But this must happen with zero probability in the random selection of $x_{\text{test}}$.

Bounding the Error of 1-NN

Let $x_{\text{test}}$ denote a test point randomly drawn from $\mathcal{P}$. Let $\hat x_n$ (also a random variable) denote the nearest neighbor to $x_{\text{test}}$ in an independent training dataset of size $n$.

The expected error of the 1-NN classifier is $$ \operatorname{error}_{\text{1-NN}} = \mathbf{E}\left[ \sum_{y \in \mathcal{Y}} \mathcal{P}(y | \hat x_n) \left(1 - \mathcal{P}(y | x_{\text{test}})\right) \right]. $$ This is the sum over all labels $y$ of the probability that the prediction will be $y$ but the true label will not be $y$.

Bounding the Error of 1-NN in the limit

Taking the limit as $n$ approaches infinity, the expected error is \begin{align*} \lim_{n \rightarrow \infty} \; \operatorname{error}_{\text{1-NN}} &= \lim_{n \rightarrow \infty} \; \mathbf{E}\left[ \sum_{y \in \mathcal{Y}} \mathcal{P}(y | \hat x_n) \left(1 - \mathcal{P}(y | x_{\text{test}})\right) \right] \\&= \mathbf{E}\left[ \sum_{y \in \mathcal{Y}} \mathcal{P}(y | x_{\text{test}})) \left(1 - \mathcal{P}(y | x_{\text{test}})\right) \right] \end{align*}

Comparing to the Bayes Optimal Classifier

Let $\hat y$ denote the prediction of the Bayes Optimal Classifier on $x_{\text{test}}$. \begin{align*} \lim{n \rightarrow \infty} \; \operatorname{error}{\text{1-NN}} &= \mathbf{E}\left[ \mathcal{P}(\hat y | x{\text{test}}) \left(1 - \mathcal{P}(\hat y | x{\text{test}})\right) \right] \&\hspace{2em}+ \mathbf{E}\left[ \sum{y \ne \hat y} \mathcal{P}(y | x{\text{test}})) \left(1 - \mathcal{P}(y | x{\text{test}})\right) \right] \&\le \mathbf{E}\left[ 1 \cdot \left( 1 - \mathcal{P}(\hat y | x{\text{test}}) \right) \right] + \mathbf{E}\left[ \sum{y \ne \hat y} 1 \cdot \mathcal{P}(y | x{\text{test}})) \right] \&=

2 \mathbf{E}\left[ 1 - \mathcal{P}(\hat y | x_{\text{test}})\right]
=
2 \operatorname{error}_{\text{Bayes}}. 

\end{align*}

Conclusion: 1-NN is no more than a factor-of-2 worse

$$\lim_{n \rightarrow \infty} \; \operatorname{error}_{\text{1-NN}} \le 2 \operatorname{error}_{\text{Bayes}}.$$

...in the limit...and results in-the-limit can be suspicious because they don't tell us how fast we converge

k-NN is limited by The Curse of Dimensionality!

k-NN works by reasoning about how close together points are.

In high dimensional space, points drawn from a distribution tend not to be close together.

First, let's look at some random points and a line in the unit square.

Consider what happens when we move to three dimensions. The points move further away from each other but stay equally close to the red hyperplane.

Curse of Dimensionality

Low-Dimensional Structure to the Rescue

If the data lies in a low-dimensional submanifold, then we can still use low-dimensional methods even in higher dimensions.

Reminders