Low-Dimensional Structure to the Rescue¶

If the data lies in a low-dimensional submanifold, then we

Lecture 3: k-Nearest Neighbors¶

CS4780 — Introduction to Machine Learning¶

$\newcommand{\R}{\mathbb{R}}$ $\newcommand{\norm}[1]{\left\|#1\right\|}$

The k-Nearest Neighbors Algorithm¶

Assumption: Nearby inputs have similar outputs.
Hypothesis: Given a test input $x$ , output among its $k$ most-similar training inputs.
- 1-nearest neighbors is simpler: just output the label of the most-similar training input.

k-NN Formally¶

Start with training set $(x_1, y_1), \ldots, (x_n, y_n)$
Given input $x_{\text{test}}$
Compute $d_i = \operatorname{dist}(x_{\text{test}}, x_i)$ for all $i \in \{1, \ldots, n\}$
Compute $S$ , where $S \subset \{1,\ldots, n\}$ , $|S| = k$ , and $i \in S, \; j \notin S \; \rightarrow \; d_i \le d_j$
Output the most common label $h(x_{\text{test}}) = \arg \max_{y \in \mathcal{Y}} \; \left|\left\{ i \in S \middle| y_i = y \right\} \right|$

What should we do in case of a tie?

Choose at random/arbitrarily
Fall back to $k - 1$ ; go up to $k+1$

k-NN For Regression¶

Start with training set $(x_1, y_1), \ldots, (x_n, y_n)$
Given input $x_{\text{test}}$
Compute $d_i = \operatorname{dist}(x_{\text{test}}, x_i)$ for all $i \in \{1, \ldots, n\}$
Compute $S$ , where $S \subset \{1,\ldots, n\}$ , $|S| = k$ , and $i \in S, \; j \notin S \; \rightarrow \; d_i \le d_j$
Output the average of the labels $h(x_{\text{test}}) = \frac{1}{k} \sum_{i \in S} y_i.$

What distance function should we use?¶

Most common distance: the Euclidean distance $\operatorname{dist}(u,v) = \| u - v \|_2 = \sqrt{ \sum_{i=1}^d (u_i - v_i)^2 }$

Also popular: the taxicab norm (a.k.a. Manhattan norm) $\operatorname{dist}(u,v) = \| u - v \|_1 = \sum_{i=1}^d |u_i - v_i|$

The Minkowski distance (a.k.a. $\ell_p$ space)¶

For parameter $p \ge 1$

$\operatorname{dist}(u,v) = \| u - v \|_p = \left( \sum_{i=1}^d |u_i - v_i|^p \right)^{1/p}$

Generalizes many other norms, including the popular $\ell_2$ (Euclidean), $\ell_1$ (taxicab), and $\ell_{\infty}$ (max norm).

Many other norms are used in practice, including learned norms.
Which norm to use will depend on your task.

Bayes Optimal Classifier¶

The Bayes Optimal Classifier is the hypothesis

$h_{\operatorname{opt}}(x) = \arg \max_{y \in \mathcal{Y}} \; \mathcal{P}(y | x) = \arg \max_{y \in \mathcal{Y}} \; \mathcal{P}(x, y).$

It picks the label that is most likely in the source distribution $\mathcal{P}$ .
This is the "best" hypothesis possible for the task, and its loss/error measures the best you could do.

Bayes Optimal Classifier: A Silly Example¶

Suppose you are playing your favorite RPG and you get attacked by a wolf-like creature.
You want to predict whether the creature is a wolf or a werewolf.
- Wolves occur with probability $50%$ , and werewolves $50%$ .
You observe the amount of damage the creature did, and you know how much damage wolves and werewolves do.
- A wolf deals damage equal to a roll on a six-sided dice plus one.
- A werewolf deals damage equal to the sum of the rolls on two four-sided dice.

Bayes Optimal Classifier: Conditional Probability¶

Conditional probability of damage $x$ conditioned on label $y$ .

Damage	Wolf (1d6+1)	Werewolf (2d4)
2	1/6	1/16
3	1/6	1/8
4	1/6	3/16
5	1/6	1/4
6	1/6	3/16
7	1/6	1/8
8	0	1/16

Bayes Optimal Classifier: Joint Probability¶

Joint density: $\mathcal{P}(x,y) = \mathcal{P}(x | y) \mathcal{P}(y) = \mathcal{P}(x | y) \cdot \frac{1}{2}$

Damage	Wolf (1d6+1)	Werewolf (2d4)
2	1/12	1/32
3	1/12	1/16
4	1/12	3/32
5	1/12	1/8
6	1/12	3/32
7	1/12	1/16
8	0	1/32

Bayes Optimal Classifier: What's the Error Rate?¶

We get it wrong when the true label disagrees with our prediction.

Damage	Wolf (1d6+1)	Werewolf (2d4)	Bayes Optimal Classifier Prediction
2	1/12	1/32	Wolf
3	1/12	1/16	Wolf
4	1/12	3/32	Werewolf
5	1/12	1/8	Werewolf
6	1/12	3/32	Werewolf
7	1/12	1/16	Wolf
8	0	1/32	Werewolf

$\operatorname{error} = \frac{1}{32} + \frac{1}{16} + \frac{1}{12} + \frac{1}{12} + \frac{1}{12} + \frac{1}{16} + 0 = \frac{13}{32} \approx 41\%.$

Bayes Optimal Classifier: Take-Away¶

We care about the Bayes Optimal Classifier because it gives us a "best-case" baseline against which we can compare our hypothesis' performance.
- Although only in theory: in practice we don't know $\mathcal{P}$
- Expected Error of Bayes Optimal Classifier is lowest possible.
- Is the loss of the Bayes Optimal Classifier always the lowest possible?

Best Constant Predictor¶

Another important baseline is the Best Constant Predictor.

$h(x) = \arg \max_{y \in \mathcal{Y}} \; \mathcal{P}(y).$

Just guesses the label that occurs the most often.
This predicts completely independently of $x$ .
Important for debugging: You should always be able to show that your classifier performs significantly better on the test set than the Best Constant Predictor.

Lemma: Convergence of the Nearest Neighbor¶

Suppose that $(\mathcal{X}, \operatorname{dist})$ is a separable metric space.

A metric space is separable if there is countable subset of $\mathcal{X}$ that is dense in the space, i.e. a sequence that comes arbitrarily close to every point in the space.
In particular, $\mathbb{R}^d$ is separable.

Let $x_{\text{test}}$ and $x_1, x_2, \ldots$ be independent identically distributed random variables over $\mathcal{X}$ . Then almost surely (i.e. with probability $1$ )

$\lim_{n \rightarrow \infty} \; \arg \min_{x \in \{x_1, \ldots, x_n\}} \operatorname{dist}(x, x_{\text{test}}) = x_{\text{test}}.$

Proof Outline: Case 1¶

Consider the case where any ball of radius $r$ centered around $x_{\text{test}}$ has positive probability.

Then, no matter now close the current nearest neighbor is to $x_{\text{test}}$ , every time we draw a fresh sample $x_i$ from the source distribution, with some probability it will be closer than the nearest neighbor currently in the distribution.

This implies that the distance diminishes to $0$ with probability $1$ .

Proof Outline: Case 2¶

Consider the case where there is some ball of radius $r$ centered around $x_{\text{test}}$ that has probability zero in the source distribution.

But this must happen with zero probability in the random selection of $x_{\text{test}}$ .

Why? Let $Z$ be the set of all points in $\mathcal{X}$ that have the property that they are the center of some ball with zero probability. Then because $\mathcal{X}$ is separable, we can cover $Z$ with the union of a countable number of balls with zero probability. So $Z$ itself must have zero probability.
You defintely don't need to know this topology stuff for CS4/5780, but I think it's good to mention it so you have some intuition about when NN might fail on exotic spaces.

Bounding the Error of 1-NN¶

Let $x_{\text{test}}$ denote a test point randomly drawn from $\mathcal{P}$ . Let $\hat x_n$ (also a random variable) denote the nearest neighbor to $x_{\text{test}}$ in an independent training dataset of size $n$ .

The expected error of the 1-NN classifier is $\operatorname{error}_{\text{1-NN}} = \mathbf{E}\left[ \sum_{y \in \mathcal{Y}} \mathcal{P}(y | \hat x_n) \left(1 - \mathcal{P}(y | x_{\text{test}})\right) \right].$ This is the sum over all labels $y$ of the probability that the prediction will be $y$ but the true label will not be $y$ .

Bounding the Error of 1-NN in the limit¶

Taking the limit as $n$ approaches infinity, the expected error is $\begin{align*} \lim_{n \rightarrow \infty} \; \operatorname{error}_{\text{1-NN}} &= \lim_{n \rightarrow \infty} \; \mathbf{E}\left[ \sum_{y \in \mathcal{Y}} \mathcal{P}(y | \hat x_n) \left(1 - \mathcal{P}(y | x_{\text{test}})\right) \right] \\&= \mathbf{E}\left[ \sum_{y \in \mathcal{Y}} \mathcal{P}(y | x_{\text{test}})) \left(1 - \mathcal{P}(y | x_{\text{test}})\right) \right] \end{align*}$

Comparing to the Bayes Optimal Classifier¶

Let $\hat y$ denote the prediction of the Bayes Optimal Classifier on $x_{\text{test}}$ . \begin{align*} \lim{n \rightarrow \infty} \; \operatorname{error}{\text{1-NN}} &= \mathbf{E}\left[ \mathcal{P}(\hat y | x{\text{test}}) \left(1 - \mathcal{P}(\hat y | x{\text{test}})\right) \right] \&\hspace{2em}+ \mathbf{E}\left[ \sum{y \ne \hat y} \mathcal{P}(y | x{\text{test}})) \left(1 - \mathcal{P}(y | x{\text{test}})\right) \right] \&\le \mathbf{E}\left[ 1 \cdot \left( 1 - \mathcal{P}(\hat y | x{\text{test}}) \right) \right] + \mathbf{E}\left[ \sum{y \ne \hat y} 1 \cdot \mathcal{P}(y | x{\text{test}})) \right] \&=

2 \mathbf{E}\left[ 1 - \mathcal{P}(\hat y | x_{\text{test}})\right]
=
2 \operatorname{error}_{\text{Bayes}}.

\end{align*}

Conclusion: 1-NN is no more than a factor-of-2 worse¶

$\lim_{n \rightarrow \infty} \; \operatorname{error}_{\text{1-NN}} \le 2 \operatorname{error}_{\text{Bayes}}.$

Low-Dimensional Structure to the Rescue¶

Lecture 3: k-Nearest Neighbors¶

CS4780 — Introduction to Machine Learning¶

The k-Nearest Neighbors Algorithm¶

k-Nearest Neighbors Illustration¶

k-NN Formally¶

k-NN For Regression¶

What distance function should we use?¶

The Minkowski distance (a.k.a. $\ell_p$ space)¶

Bayes Optimal Classifier¶

Bayes Optimal Classifier: A Silly Example¶

Bayes Optimal Classifier: Conditional Probability¶

Bayes Optimal Classifier: Joint Probability¶

Bayes Optimal Classifier: The Hypothesis¶

Bayes Optimal Classifier: What's the Error Rate?¶

Bayes Optimal Classifier: Take-Away¶

Best Constant Predictor¶

How does k-NN compare to the Bayes Optimal Classifier?¶

Lemma: Convergence of the Nearest Neighbor¶

Proof Outline: Case 1¶

Proof Outline: Case 2¶

Bounding the Error of 1-NN¶

Bounding the Error of 1-NN in the limit¶

Comparing to the Bayes Optimal Classifier¶

Conclusion: 1-NN is no more than a factor-of-2 worse¶

...in the limit...and results in-the-limit can be suspicious because they don't tell us how fast we converge¶

k-NN is limited by The Curse of Dimensionality!¶

Curse of Dimensionality¶

Low-Dimensional Structure to the Rescue¶

Reminders¶