<13
you won't get a Vocareum inviteThis is much more open-ended than supervised learning!
Q: What does "structure" even mean?
Input: n data points D=x1,x2,…,xn
Output: k clusters of the dataset
Partition data into k groups where all examples in a group are close in Euclidean distance.
A k-Means clustering (the analog of a hypothesis in this case) is a partition of D into k sets (clusters) C1,C2,…,Ck such that Ci∩Cj=∅ and C1∪C2∪⋯∪Ck=D
.
We measure how "good" a clustering is by
Z(C1,…,Ck)=k∑ℓ=112|Cℓ|∑i,j∈Cℓ‖
This is the average distance between pairs of points in a cluster, weighted by cluster size.
If \mu_\ell = \frac{1}{|\mathcal{C}_{\ell}|} \sum_{i \in \mathcal{C}_{\ell}} x_i
denotes the mean of \mathcal{C}_{\ell}
,
If \mu_\ell = \frac{1}{|\mathcal{C}_{\ell}|} \sum_{i \in \mathcal{C}_{\ell}} x_i
denotes the mean of \mathcal{C}_{\ell}
,
So another way to write the k-Means objective is
Z(\mathcal{C}_{1},\ldots,\mathcal{C}_{k}) = \sum_{\ell=1}^{k} \sum_{i\in\mathcal{C}_{\ell}} \| x_{i} - \mu_\ell \|^{2}_{2}
where \mu_\ell = \frac{1}{|\mathcal{C}_{\ell}|} \sum_{i \in \mathcal{C}_{\ell}} x_i
.
This is the sum of the squares of the distances between each point and its cluster's centroid.
Won't be able to find the global optimum necessarily.
Now let the "centroids" \mu_\ell \in \R^d
vary freely, and set
Z(\mathcal{C}_{1},\ldots,\mathcal{C}_{k}, \mu_1, \ldots, \mu_k) = \sum_{\ell=1}^{k} \sum_{i\in\mathcal{C}_{\ell}} \| x_{i} - \mu_\ell \|^{2}_{2}.
Why is minimizing this objective equivalent to minimizing the original k-Means objective?
It's hard! In fact, it's NP-Hard.
But, it's easy to minimize with respect to any one parameter, leaving the others fixed.
Suppose the \mathcal{C} are fixed, and we want to minimize over \mu.
\begin{align*} 0 &= \nabla_{\mu_j} Z(\mathcal{C}_{1},\ldots,\mathcal{C}_{k}, \mu_1, \ldots, \mu_k) \\&= \nabla_{\mu_j} \sum_{\ell=1}^{k} \sum_{i\in\mathcal{C}_{\ell}} \| x_{i} - \mu_\ell \|^{2}_{2} \\&= 2 \sum_{i\in\mathcal{C}_j} (\mu_j - x_{i}) = 2 \abs{\mathcal{C}_j} \mu_j - 2 \sum_{i\in\mathcal{C}_j} x_i. \end{align*}
\mu_j = \frac{1}{\abs{\mathcal{C}_j}} \sum_{i\in\mathcal{C}_j} x_i
is just the mean of \mathcal{C}_j.
Suppose the \mu are fixed, and we want to place x_i in the cluster that minimizes the loss
Z(\mathcal{C}_{1},\ldots,\mathcal{C}_{k}, \mu_1, \ldots, \mu_k) = \sum_{\ell=1}^{k} \sum_{i\in\mathcal{C}_{\ell}} \| x_{i} - \mu_\ell \|^{2}_{2}.
By inspection, this happens when we place x_i in the class with the closest \mu_{\ell}, i.e. x_i \in \mathcal{C}_{\arg \min_{\ell \in \{1,\ldots,k\}} \| x_i - \mu_{\ell} \|_2}
.
Idea: alternate minimizing Z
over the centroids and the cluster assignments. Repeat until converged:
\mu_\ell := \frac{1}{\abs{\mathcal{C}_\ell}} \sum_{i\in\mathcal{C}_\ell} x_\ell \text{ for all }\ell \in \{1,\ldots,k\}.
\mathcal{C}_\ell := \left\{ i \in \{1,\ldots,n\} \middle| \ell = \arg \min_{l \in \{1,\ldots,k\}} \| x_i - \mu_l \|_2 \right\}.
If the cluster assignments didn't change, we converged.
Loss Z
is decreasing at each step.
There are only a finite number of cluster assignments.
The algorithm can't loop because Z
is decreasing.
So it must terminate.
Doesn't necessarily converge to the global optimum of Z
.
Computational cost is \mathcal{O}(ndk)
per iteration
n
is dataset size, d
dimension, k
clustersk
?Could choose the k
that minimizes Z.
Problem: with k = n
each point gets its own centroid and the loss is Z = 0
.
k
?One heuristic: plot Z
with respect to k
and choose the k
at which the loss stops significantly decreasing.
k
?Often we use k-Means as part of a larger system, where the cluster output is passed into some downstream task.
Another heuristic: choose the k
that results in the best performance on the downstream task.
Simple approach: assign each point to a cluster at random.
Generally better approach: assign the centroids \mu_\ell at random by sampling (without replacement) from the dataset.
Even better approach: k-means++
Assign the centroids \mu_\ell at random from the dataset, weighted so that the centroids are spread out.
We usually try many random initializations and pick the one that results in the lowest loss Z
after running Lloyd's algorithm.
This increases our chances of getting the globally optimal solution — although it still does not guarantee anything.
A cluster occupies the space that is nearer to its centroid than any other centroid. Say our cluster's centroid is \mu \in \R^d
and another cluster's centroid is \nu \in \R^d
. Then x
will be in our cluster if
\| x - \mu \|^2 \le \| x - \nu \|^2
\| x - \mu \|^2 \le \| x - \nu \|^2
\| x \|^2 - 2 \langle x, \mu \rangle + \| \mu \|^2 \le \| x \|^2 - 2 \langle x, \nu \rangle + \| \nu \|^2
\| x - \mu \|^2 \le \| x - \nu \|^2
\| x \|^2 - 2 \langle x, \mu \rangle + \| \mu \|^2 \le \| x \|^2 - 2 \langle x, \nu \rangle + \| \nu \|^2
2 \langle x, \nu \rangle - 2 \langle x, \mu \rangle \le \| \nu \|^2 - \| \mu \|^2
\| x - \mu \|^2 \le \| x - \nu \|^2
\| x \|^2 - 2 \langle x, \mu \rangle + \| \mu \|^2 \le \| x \|^2 - 2 \langle x, \nu \rangle + \| \nu \|^2
2 \langle x, \nu \rangle - 2 \langle x, \mu \rangle \le \| \nu \|^2 - \| \mu \|^2
\langle x, \nu - \mu \rangle \le \frac{ \| \nu \|^2 - \| \mu \|^2 }{2}.
This is just the equation for a half-space: \langle x, a \rangle \le b
.
Conclusion: a k-Means cluster lies in the intersection of half-spaces.
The intersection of half-spaces is a polytope^*
.
^*
if we use an expanded definition of "polytope" that includes unbounded objects.
Conclusion: a k-Means cluster lies in the intersection of half-spaces.
The intersection of half-spaces is a convex polytope^*
.
All these intersections together form a Voronoi diagram.
^*
if we use an expanded definition of "polytope" that includes unbounded objects.
Boundaries between k-Means clusters are flat hyperplanes.