Logistics

Placement quiz due today
Use this to gauge your preparedness for CS4/5780
- and what you need to review
Review session soon!
If you scored $<13$ you won't get a Vocareum invite
- If this is somehow in error, let me know

So far...supervised learning.

Labeled training set $\mathcal{D} = {(x_1, y_1), \ldots, (x_n, y_n) }$
Goal: make predictions that are accurate on future data from the same source

Unsupervised Learning

Dataset is now unlabeled: $\mathcal{D} = {x_1, x_2, \ldots, x_n }$
Goal is to "uncover structure" in the feature vectors $x_i \in \R^d$

This is much more open-ended than supervised learning!

Q: What does "structure" even mean?

Examples of Unsupervised Learning Tasks

Clustering
- Group inputs based on similarity
Dimensionality reduction
- Embed inputs $x_i \in \R^d$ into $\R^m$ with $m \ll d$
Anomaly/outlier detection
- Identify rare inputs
Visualization
- Decide how to display inputs

Clustering

Input: $n$ data points $\mathcal{D} = {x_1, x_2, \ldots, x_n }$
Output: $k$ clusters of the dataset
- i.e. a funtion $\mathcal{D} \rightarrow {1, \ldots, k}$

Clustering Example: Topic Modeling

I have a large corpus of documents.
I want to split those documents by topic.
But I don't want to tell the system what the topics are
- I want the system to learn the topics from the corpus.
- There may not even be a ground-truth set of topics.

k-Means Clustering

Partition data into $k$ groups where all examples in a group are close in Euclidean distance.

k-Means Clustering

k-Means Objective

A k-Means clustering (the analog of a hypothesis in this case) is a partition of $\mathcal{D}$ into $k$ sets (clusters) $\mathcal{C}_1, \mathcal{C}_2, \ldots, \mathcal{C}_k$ such that $\mathcal{C}_i \cap \mathcal{C}_j = \emptyset$ and $\mathcal{C}_1 \cup \mathcal{C}_2 \cup \cdots \cup \mathcal{C}_k = \mathcal{D}$ .

We measure how "good" a clustering is by

$Z(\mathcal{C}_{1},\ldots,\mathcal{C}_{k}) = \sum_{\ell=1}^{k}\frac{1}{2\lvert\mathcal{C}_{\ell}\rvert} \sum_{i,j\in\mathcal{C}_{\ell}} \|x_{i} - x_{j}\|^{2}_{2}.$

This is the average distance between pairs of points in a cluster, weighted by cluster size.

k-Means Objective — Centroids

If $\mu_\ell = \frac{1}{|\mathcal{C}_{\ell}|} \sum_{i \in \mathcal{C}_{\ell}} x_i$ denotes the mean of $\mathcal{C}_{\ell}$ ,

$\begin{align*} Z(\mathcal{C}_{1},\ldots,\mathcal{C}_{k}) &= \sum_{\ell=1}^{k}\frac{1}{2\lvert\mathcal{C}_{\ell}\rvert} \sum_{i,j\in\mathcal{C}_{\ell}} \|x_{i} - x_{j}\|^{2}_{2} = \sum_{\ell=1}^{k}\frac{1}{2\lvert\mathcal{C}_{\ell}\rvert} \sum_{i,j\in\mathcal{C}_{\ell}} \| (x_{i} - \mu_\ell) - ( x_{j} - \mu_\ell) \|^{2}_{2} \\&= \sum_{\ell=1}^{k}\frac{1}{2\lvert\mathcal{C}_{\ell}\rvert} \sum_{i,j\in\mathcal{C}_{\ell}} \left( \| x_{i} - \mu_\ell \|^{2}_{2} + \| x_{j} - \mu_\ell \|^{2}_{2} -2 \langle x_{i} - \mu_\ell, x_{j} - \mu_\ell \rangle \right) \\&= \sum_{\ell=1}^{k}\frac{1}{2\lvert\mathcal{C}_{\ell}\rvert} \left( \lvert\mathcal{C}_{\ell}\rvert \sum_{i\in\mathcal{C}_{\ell}} \| x_{i} - \mu_\ell \|^{2}_{2} + \lvert\mathcal{C}_{\ell}\rvert \sum_{i\in\mathcal{C}_{\ell}} \| x_{i} - \mu_\ell \|^{2}_{2} -2 \langle 0, 0 \rangle \right) \\&= \sum_{\ell=1}^{k} \sum_{i\in\mathcal{C}_{\ell}} \| x_{i} - \mu_\ell \|^{2}_{2}. \end{align*}$

k-Means Objective — Centroids

If $\mu_\ell = \frac{1}{|\mathcal{C}_{\ell}|} \sum_{i \in \mathcal{C}_{\ell}} x_i$ denotes the mean of $\mathcal{C}_{\ell}$ ,

$\begin{align*} \sum_{i,j\in\mathcal{C}_{\ell}} \langle x_{i} - \mu_\ell, x_{j} - \mu_\ell \rangle &= \left\langle \sum_{i\in\mathcal{C}_{\ell}} \left( x_{i} - \mu_\ell \right), \sum_{j\in\mathcal{C}_{\ell}} \left( x_{j} - \mu_\ell \right) \right\rangle \\ &= \left\langle \left( \sum_{i\in\mathcal{C}_{\ell}} x_{i} \right) - \abs{\mathcal{C}_{\ell}} \mu_\ell, \left( \sum_{j\in\mathcal{C}_{\ell}} x_{j} \right) - \abs{\mathcal{C}_{\ell}} \mu_\ell \right\rangle \\&= \left\langle 0, 0 \right\rangle \end{align*}$

k-Means Objective — Centroids

So another way to write the k-Means objective is $Z(\mathcal{C}_{1},\ldots,\mathcal{C}_{k}) = \sum_{\ell=1}^{k} \sum_{i\in\mathcal{C}_{\ell}} \| x_{i} - \mu_\ell \|^{2}_{2}$ where $\mu_\ell = \frac{1}{|\mathcal{C}_{\ell}|} \sum_{i \in \mathcal{C}_{\ell}} x_i$ . This is the sum of the squares of the distances between each point and its cluster's centroid.

How do we optimize this?

Won't be able to find the global optimum necessarily.

"Augmented" k-Means Objective

Now let the "centroids" $\mu_\ell \in \R^d$ vary freely, and set $Z(\mathcal{C}_{1},\ldots,\mathcal{C}_{k}, \mu_1, \ldots, \mu_k) = \sum_{\ell=1}^{k} \sum_{i\in\mathcal{C}_{\ell}} \| x_{i} - \mu_\ell \|^{2}_{2}.$

Why is minimizing this objective equivalent to minimizing the original k-Means objective?

Minimizing this objective

It's hard! In fact, it's NP-Hard.

But, it's easy to minimize with respect to any one parameter, leaving the others fixed.

Minimizing over $\mu$

Suppose the $\mathcal{C}$ are fixed, and we want to minimize over $\mu$ . $\begin{align*} 0 &= \nabla_{\mu_j} Z(\mathcal{C}_{1},\ldots,\mathcal{C}_{k}, \mu_1, \ldots, \mu_k) \\&= \nabla_{\mu_j} \sum_{\ell=1}^{k} \sum_{i\in\mathcal{C}_{\ell}} \| x_{i} - \mu_\ell \|^{2}_{2} \\&= 2 \sum_{i\in\mathcal{C}_j} (\mu_j - x_{i}) = 2 \abs{\mathcal{C}_j} \mu_j - 2 \sum_{i\in\mathcal{C}_j} x_i. \end{align*}$ $\mu_j = \frac{1}{\abs{\mathcal{C}_j}} \sum_{i\in\mathcal{C}_j} x_i$ is just the mean of $\mathcal{C}_j$ .

Minimizing over $\mathcal{C}$

Suppose the $\mu$ are fixed, and we want to place $x_i$ in the cluster that minimizes the loss $Z(\mathcal{C}_{1},\ldots,\mathcal{C}_{k}, \mu_1, \ldots, \mu_k) = \sum_{\ell=1}^{k} \sum_{i\in\mathcal{C}_{\ell}} \| x_{i} - \mu_\ell \|^{2}_{2}.$ By inspection, this happens when we place $x_i$ in the class with the closest $\mu_{\ell}$ , i.e. $x_i \in \mathcal{C}_{\arg \min_{\ell \in \{1,\ldots,k\}} \| x_i - \mu_{\ell} \|_2}$ .

Alternating minimization: Lloyd's algorithm

Idea: alternate minimizing $Z$ over the centroids and the cluster assignments. Repeat until converged:

$\mu_\ell := \frac{1}{\abs{\mathcal{C}_\ell}} \sum_{i\in\mathcal{C}_\ell} x_\ell \text{ for all }\ell \in \{1,\ldots,k\}.$

$\mathcal{C}_\ell := \left\{ i \in \{1,\ldots,n\} \middle| \ell = \arg \min_{l \in \{1,\ldots,k\}} \| x_i - \mu_l \|_2 \right\}.$

If the cluster assignments didn't change, we converged.

Lloyd's algorithm: Graphically

Lloyd's algorithm: Animated

Why must this converge?

Loss $Z$ is decreasing at each step.
- Moving a point to a different cluster with a closer centroid must diminish the loss.
- Updating the centroids can't increase the loss.
There are only a finite number of cluster assignments.
The algorithm can't loop because $Z$ is decreasing.
So it must terminate.

Lloyd's algorithm: Caveats

Doesn't necessarily converge to the global optimum of $Z$ .
- A "local optimum" of Lloyd's algorithm isn't necessarily globally optimal.
- Different initializations can yield different clusters.
Computational cost is $\mathcal{O}(ndk)$ per iteration
- $n$ is dataset size, $d$ dimension, $k$ clusters
- Total run time depends on number of iterations

How to choose $k$ ?

Could choose the $k$ that minimizes $Z$ .

Problem: with $k = n$ each point gets its own centroid and the loss is $Z = 0$ .

How to choose $k$ ?

One heuristic: plot $Z$ with respect to $k$ and choose the $k$ at which the loss stops significantly decreasing.

How to choose $k$ ?

Often we use k-Means as part of a larger system, where the cluster output is passed into some downstream task.

Another heuristic: choose the $k$ that results in the best performance on the downstream task.

How to initialize?

Simple approach: assign each point to a cluster at random.

How to initialize?

Generally better approach: assign the centroids $\mu_\ell$ at random by sampling (without replacement) from the dataset.

How to initialize?

Even better approach: k-means++

Assign the centroids $\mu_\ell$ at random from the dataset, weighted so that the centroids are spread out.

How to initialize?

We usually try many random initializations and pick the one that results in the lowest loss $Z$ after running Lloyd's algorithm.

This increases our chances of getting the globally optimal solution — although it still does not guarantee anything.

What do k-Means clusters look like?

A cluster occupies the space that is nearer to its centroid than any other centroid. Say our cluster's centroid is $\mu \in \R^d$ and another cluster's centroid is $\nu \in \R^d$ . Then $x$ will be in our cluster if $\| x - \mu \|^2 \le \| x - \nu \|^2$

What do k-Means clusters look like?

$\| x - \mu \|^2 \le \| x - \nu \|^2$ $\| x \|^2 - 2 \langle x, \mu \rangle + \| \mu \|^2 \le \| x \|^2 - 2 \langle x, \nu \rangle + \| \nu \|^2$

What do k-Means clusters look like?

$\| x - \mu \|^2 \le \| x - \nu \|^2$ $\| x \|^2 - 2 \langle x, \mu \rangle + \| \mu \|^2 \le \| x \|^2 - 2 \langle x, \nu \rangle + \| \nu \|^2$ $2 \langle x, \nu \rangle - 2 \langle x, \mu \rangle \le \| \nu \|^2 - \| \mu \|^2$

What do k-Means clusters look like?

$\| x - \mu \|^2 \le \| x - \nu \|^2$ $\| x \|^2 - 2 \langle x, \mu \rangle + \| \mu \|^2 \le \| x \|^2 - 2 \langle x, \nu \rangle + \| \nu \|^2$ $2 \langle x, \nu \rangle - 2 \langle x, \mu \rangle \le \| \nu \|^2 - \| \mu \|^2$ $\langle x, \nu - \mu \rangle \le \frac{ \| \nu \|^2 - \| \mu \|^2 }{2}.$ This is just the equation for a half-space: $\langle x, a \rangle \le b$ .

What do k-Means clusters look like?

Conclusion: a k-Means cluster lies in the intersection of half-spaces.

The intersection of half-spaces is a polytope $^*$ .

$^*$ if we use an expanded definition of "polytope" that includes unbounded objects.

What do k-Means clusters look like?

Conclusion: a k-Means cluster lies in the intersection of half-spaces.

The intersection of half-spaces is a convex polytope $^*$ .

All these intersections together form a Voronoi diagram.

$^*$ if we use an expanded definition of "polytope" that includes unbounded objects.

k-Means as a Voronoi Diagram

We see the same thing in 1-NN classifiers: the decision space is described by polytopes and forms a Voronoi diagram.

What does this mean?

Boundaries between k-Means clusters are flat hyperplanes.

This limits the sorts of datasets we can cluster.

CS4/5780 — Lecture 4

Unsupervised Learning

k-Means