Low-Dimensional Structure to the Rescue

If the data lies in a low-dimensional submanifold, then we

Lecture 3: k-Nearest Neighbors

CS4780 — Introduction to Machine Learning

$\newcommand{\R}{\mathbb{R}}$ $\newcommand{\norm}[1]{\left\|#1\right\|}$

The k-Nearest Neighbors Algorithm

k-Nearest Neighbors Illustration

k-NN Formally

What should we do in case of a tie?

k-NN For Regression

What distance function should we use?

Most common distance: the Euclidean distance $$\operatorname{dist}(u,v) = \| u - v \|_2 = \sqrt{ \sum_{i=1}^d (u_i - v_i)^2 }$$

Also popular: the taxicab norm (a.k.a. Manhattan norm) $$\operatorname{dist}(u,v) = \| u - v \|_1 = \sum_{i=1}^d |u_i - v_i|$$

The Minkowski distance (a.k.a. $\ell_p$ space)

For parameter $p \ge 1$

$$\operatorname{dist}(u,v) = \| u - v \|_p = \left( \sum_{i=1}^d |u_i - v_i|^p \right)^{1/p}$$

Generalizes many other norms, including the popular $\ell_2$ (Euclidean), $\ell_1$ (taxicab), and $\ell_{\infty}$ (max norm).

Bayes Optimal Classifier

The Bayes Optimal Classifier is the hypothesis

$$h_{\operatorname{opt}}(x) = \arg \max_{y \in \mathcal{Y}} \; \mathcal{P}(y | x) = \arg \max_{y \in \mathcal{Y}} \; \mathcal{P}(x, y).$$

Bayes Optimal Classifier: A Silly Example

Bayes Optimal Classifier: Conditional Probability

Conditional probability of damage $x$ conditioned on label $y$.

Damage Wolf (1d6+1) Werewolf (2d4)
2 1/6 1/16
3 1/6 1/8
4 1/6 3/16
5 1/6 1/4
6 1/6 3/16
7 1/6 1/8
8 0 1/16

Bayes Optimal Classifier: Joint Probability

Joint density: $\mathcal{P}(x,y) = \mathcal{P}(x | y) \mathcal{P}(y) = \mathcal{P}(x | y) \cdot \frac{1}{2}$

Damage Wolf (1d6+1) Werewolf (2d4)
2 1/12 1/32
3 1/12 1/16
4 1/12 3/32
5 1/12 1/8
6 1/12 3/32
7 1/12 1/16
8 0 1/32

Bayes Optimal Classifier: The Hypothesis

Always guess the label with the highest probability.

Damage Wolf (1d6+1) Werewolf (2d4) Bayes Optimal Classifier Prediction
2 1/12 1/32 Wolf
3 1/12 1/16 Wolf
4 1/12 3/32 Werewolf
5 1/12 1/8 Werewolf
6 1/12 3/32 Werewolf
7 1/12 1/16 Wolf
8 0 1/32 Werewolf

Does this being "optimal" mean we get it right all the time?

Bayes Optimal Classifier: What's the Error Rate?

We get it wrong when the true label disagrees with our prediction.

Damage Wolf (1d6+1) Werewolf (2d4) Bayes Optimal Classifier Prediction
2 1/12 1/32 Wolf
3 1/12 1/16 Wolf
4 1/12 3/32 Werewolf
5 1/12 1/8 Werewolf
6 1/12 3/32 Werewolf
7 1/12 1/16 Wolf
8 0 1/32 Werewolf

$\operatorname{error} = \frac{1}{32} + \frac{1}{16} + \frac{1}{12} + \frac{1}{12} + \frac{1}{12} + \frac{1}{16} + 0 = \frac{13}{32} \approx 41\%.$

Bayes Optimal Classifier: Take-Away

Best Constant Predictor

Another important baseline is the Best Constant Predictor.

$$h(x) = \arg \max_{y \in \mathcal{Y}} \; \mathcal{P}(y).$$

How does k-NN compare to the Bayes Optimal Classifier?

We can bound the error of 1-NN relative to the Bayes Optimal Classifier.

Lemma: Convergence of the Nearest Neighbor

Suppose that $(\mathcal{X}, \operatorname{dist})$ is a separable metric space.

Let $x_{\text{test}}$ and $x_1, x_2, \ldots$ be independent identically distributed random variables over $\mathcal{X}$. Then almost surely (i.e. with probability $1$)

$$\lim_{n \rightarrow \infty} \; \arg \min_{x \in \{x_1, \ldots, x_n\}} \operatorname{dist}(x, x_{\text{test}}) = x_{\text{test}}.$$

Proof Outline: Case 1

Consider the case where any ball of radius $r$ centered around $x_{\text{test}}$ has positive probability.

Then, no matter now close the current nearest neighbor is to $x_{\text{test}}$, every time we draw a fresh sample $x_i$ from the source distribution, with some probability it will be closer than the nearest neighbor currently in the distribution.

This implies that the distance diminishes to $0$ with probability $1$.

Proof Outline: Case 2

Consider the case where there is some ball of radius $r$ centered around $x_{\text{test}}$ that has probability zero in the source distribution.

But this must happen with zero probability in the random selection of $x_{\text{test}}$.

Bounding the Error of 1-NN

Let $x_{\text{test}}$ denote a test point randomly drawn from $\mathcal{P}$. Let $\hat x_n$ (also a random variable) denote the nearest neighbor to $x_{\text{test}}$ in an independent training dataset of size $n$.

The expected error of the 1-NN classifier is $$ \operatorname{error}_{\text{1-NN}} = \mathbf{E}\left[ \sum_{y \in \mathcal{Y}} \mathcal{P}(y | \hat x_n) \left(1 - \mathcal{P}(y | x_{\text{test}})\right) \right]. $$ This is the sum over all labels $y$ of the probability that the prediction will be $y$ but the true label will not be $y$.

Bounding the Error of 1-NN in the limit

Taking the limit as $n$ approaches infinity, the expected error is \begin{align*} \lim_{n \rightarrow \infty} \; \operatorname{error}_{\text{1-NN}} &= \lim_{n \rightarrow \infty} \; \mathbf{E}\left[ \sum_{y \in \mathcal{Y}} \mathcal{P}(y | \hat x_n) \left(1 - \mathcal{P}(y | x_{\text{test}})\right) \right] \\&= \mathbf{E}\left[ \sum_{y \in \mathcal{Y}} \mathcal{P}(y | x_{\text{test}})) \left(1 - \mathcal{P}(y | x_{\text{test}})\right) \right] \end{align*}

Comparing to the Bayes Optimal Classifier

Let $\hat y$ denote the prediction of the Bayes Optimal Classifier on $x_{\text{test}}$. \begin{align*} \lim{n \rightarrow \infty} \; \operatorname{error}{\text{1-NN}} &= \mathbf{E}\left[ \mathcal{P}(\hat y | x{\text{test}}) \left(1 - \mathcal{P}(\hat y | x{\text{test}})\right) \right] \&\hspace{2em}+ \mathbf{E}\left[ \sum{y \ne \hat y} \mathcal{P}(y | x{\text{test}})) \left(1 - \mathcal{P}(y | x{\text{test}})\right) \right] \&\le \mathbf{E}\left[ 1 \cdot \left( 1 - \mathcal{P}(\hat y | x{\text{test}}) \right) \right] + \mathbf{E}\left[ \sum{y \ne \hat y} 1 \cdot \mathcal{P}(y | x{\text{test}})) \right] \&=

2 \mathbf{E}\left[ 1 - \mathcal{P}(\hat y | x_{\text{test}})\right]
=
2 \operatorname{error}_{\text{Bayes}}. 

\end{align*}

Conclusion: 1-NN is no more than a factor-of-2 worse

$$\lim_{n \rightarrow \infty} \; \operatorname{error}_{\text{1-NN}} \le 2 \operatorname{error}_{\text{Bayes}}.$$

...in the limit...and results in-the-limit can be suspicious because they don't tell us how fast we converge

k-NN is limited by The Curse of Dimensionality!

k-NN works by reasoning about how close together points are.

In high dimensional space, points drawn from a distribution tend not to be close together.

First, let's look at some random points and a line in the unit square.