Clustering Research

Many clustering algorithms have been invented or modified over the past decade. Each new algorithm is designed to minimize some measure of cluster quality. Usually this cluster quality is a function which tries to formalize the notion of a "good clustering". Examples of this measure include sum of squares, expansion, and conductance. Typically optimizing these measures is NP-hard and hence we need to create approximation algorithms to handle the problem. Once the algorithm is implemented and tested on real data typically the author makes a claim that the clusters are good by giving examples of similaries between the things clustered together. The question to ask then is "Do the clusters depend on the algorithm or will we be able to find the same clusters with very different algorithms, and if not why not?"

In the paper by Khan, Kulis, Hopcroft and Selman we have the notion of a natural community. That is a community that keeps reapring even when the data changes by a small amount. Their experiments involved clusterign a random subset of the citeseer database then looking for communities which keept apearing every time the clustering was performed. We wish to compare the clusterings that the agglomerative, spectral and mincut algorithms give. Spectral and mincut try to optimize the same functions "conductance and expansion" so we hope they might give the similar clusterings. We compared agglomerative clusterings with spectral clusterings and we observed some promising results. Something that might affect the clustering of an algorithm might be stability so we might want to explore stability of the algorithms. Here are some considerations for the next few weeks.

Stability of mincut
Stability of spectral
pairwise comparison between the communities in each tree type (agglomerative, mincut, spectral)