Meta Clustering

Meta Clustering

The people currently involved in this project are: Rich Caruana, Nam Nguyen, Casey Smith in the Computer Science Department at Cornell University.

The Meta Clustering Project is funded by NSF under CAREER Award # 0347318 (granted 2004).

Project Description:

Meta clustering is a new approach to the problem of clustering. Meta clustering aims at creating a new mode of interaction between users, the clustering system, and the data. Rather than finding one optimal clustering of the data, meta clustering finds many alternate good clusterings of the data and allows the user to select which of these clusterings is most useful, exploring the space of reasonable clusterings. To prevent the user from having to evaluate too many clusterings, the many base-level clusterings are organized into a meta clustering, a clustering of clusterings that groups similar base-level clusterings together. This meta clustering makes it easier for users to evaluate the clusterings and efficiently navigate to the clustering(s) useful for their purpose.

Meta clustering can be divided into three fundamental steps:

Generate many good, yet qualitatively different, base-level clusterings of the same data. (This is a very interesting problem all by itself!)
Measure the similarity between the base-level clusterings generated in the first step so that similar clusterings can be grouped together.
Organize the base-level clusterings at a meta level (either by clustering or by low-dimension projection) and present them to users.

Source Code:

Kmeans: generates clusterings with pre-specified number of clusters and sensitive to initialization.

Constraint Kmeans: generate clusterings with must-link and cannot-link constraints.

Kmeans Feature Weighting: generates diverse clusterings by applied weights on the features.

Zipf Generator: generates random numbers according the Zipf distribution with a specified shape parameter.

Diverse-Kmeans: generates diverse clusterings by incorporating clustering distance into objective function.

Clustering Distance: computes the Rand index and Variational Information (VI) distance between clusterings.

Warning: This code is under development. We make no guarantees that it is yet suitable for external use. If you try the code, please let us know if it was useful, and if you find any bugs or improvements. If you email us, we'll be happy to let you know when we update the code or find and fix bugs. Code and Tutorial

Data Sets:

We have been using the following data sets in most of our meta clustering experiments. The Bergmark data set was created by Donna Bergmark at Cornell University. The Protein data set was created by Rich Caruana, Paul Hodor, and John Rosenberg at CMU and the University of Pittsburgh. Email us if you want to use the Bergmark or Protein data sets. The others are already available on the web.

The Australia Coastal data is a subset of the data available from the Biogeoinformatics of Hexacorals environmental database, http://www.kgs.ku.edu/Hexacoral/.
The Bergmark data was collected from 25 focused web crawls, each with different keywords.
The Covertype data is from the UCI Machine Learning Repository. It contains cartographic variables sampled at 30x30 meter grid cells in wilderness areas.
The Letter data is the isolet spoken letter data set from the UCI Machine Learning Repository.
The Phoneme data is the from the UCI Machine Learning Repository. It records the 11 phonemes of 15 speakers.
The Protein data is the pairwise similarities between 639 proteins. It was created by crystallographers developing techniques to learn relationships between protein sequence and structure.
The LetterFont data set contains 100 images of 10 different letters and 10 different fonts.
The Yale Face data set contains 10x9x8 = 720 images of 10 different persons under 9 poses and 8 illumination conditions, http://cvc.yale.edu/projects/yalefacesB/yalefacesB.html.

Papers and Tech Reports:

Nam Nguyen, Rich Caruana, Consensus Clustering, to appear in the International Conference on Data Mining, 2007.

Nam Nguyen, Rich Caruana, Generating Diverse Clusterings, in submission to Neural Information Processing Systems Conference, 2007.

Rich Caruana, Mohamed Elhawary, Nam Nguyen, and Casey Smith, Meta Clustering, Proceedings of the International Conference on Data Mining, 2006.

Rich Caruana, Pedro Artigas, Ann Goldenberg and Anton Likhodedov, Meta Clustering, Technical Report TR2002-1884, Cornell University. This is an old, unpublished tech report that describes our first attempts at meta clustering

Interesting Results:

Zipf Weighting examples. Each row visualizes a Zipf distribution with a different shape parameter, α. Each row has 50 bars representing 50 random samples from the distribution, with the height of the bar proportional to the value of the sample. We use a Zipf power law distribution because there is empirical evidence that feature importance is Zipf-distributed in a number of real-world problems [1,2]. A Zipf distribution describes a range of integer values from 1 to some maximum value K. The frequency of each integer is proportional to 1/i^α where i is the integer value and α is the shape parameter. Thus, for α=0, the Zipf distribution becomes a uniform distribution from 1 to K. As α increases, the distribution becomes more biased toward smaller numbers, with only the occasional value approaching K. In the figure, random values from a Zipf distribution can be generated in the manner of [3].

References:

S. Cohen, F. Dror, and E. Ruppin. A feature selection method based on the Shapley value. In Proceedings of the International Joint Conference on Artificial Intelligence, 2005.
C. Furlanello, M. Serafini, S. Merler, and G. Jurman. Entropy-based gene ranking without selection bias for the predictive classification of microarray data. BMC Bioinformatics, 4:54, 2003.
K. Christensen, Statistical and Measurement Tools.

Clusterings that would be of most interest to users (best accuracy) often are not very compact clusterings: CoverType dataset. The x-axis is compactness which measures the average pairwise distances between points in the same clusters. In many traditional clustering algorithms such as Kmeans and agglomerative clustering, the optimization criterion is closely related to this measure of compactness. The y-axis is accuracy which is measured relative to the auxiliary labels for each data set. The auxiliary labels are only a proxy for what a specific user might want in a particular application. Each point in the figure represents a full clustering. Since there exist reasonable clusterings (clusterings locating in the top-right of the figure) which are more accurate than the most compact clustering (the left-most clustering), we conclude that the usual method of searching for the most compact clustering can be counterproductive. It is important to explore the space of reasonable clusterings more thoroughly.
Comparision among Kmeans, Kmeans Feature Weighting, and Diverse-Kmeans. In the figure, we compare the performance of Kmeans, Kmeans Feature Weighting, and Diverse-Kmeans in two difference performance metrics for the Phoneme data set. We observe that the Kmeans Feature Weighting explores a larger space of clusterings than the Kmeans. Moreover, the Diverse-Kmeans method is able to find the most interesting clusterings in both performance metrics. This result is consistent for other data sets that we experiment on. Diverse-Kmeans is able to find many quality clusterings that are missed by traditional clustering algorithms.

Related Works:

If you know of other papers or links we should add to this collection, please let us know.

Consensus Clustering:

Fern, X. Z. and Brodley, C. E., Random projection for high dimensional data clustering: A cluster ensemble approach. Proceedings of the 20th International Conference on Machine Learning, 2003.

Fred, A. L. and Jain, A. K., Data clustering using evidence accumulation. Proceedings of the 16th International Conference on Pattern Recognition, 2002.

Gionis, A., Mannila, H. and Tsaparas, P., Clustering aggregation. Proceedings of the 21st International Conference on Data Mining, 2005.

Strehl, A. and Ghosh, J., Cluster ensembles - a knowledge reuse framework for combining multiple partitions. Journal of Machine Learning Research, 2002.

Topchy, A., Jain, A. K. and Punch, W., Combining multiple weak clusterings. Proceedings IEEE International Conference on Data Mining, 2003.

Topchy, A., Jain, A. K. and Punch, W., A mixture model for clustering ensembles. Proceedings SIAM International Conference on Data Mining, 2004.

Clustering with Background Knowledge:

Cohn D., Caruana R. and McCallum A., Semi-supervised clustering with user feedback, Technical Report TR2003-1892, Cornell University.

Wagstaff K., Cardie C., Rogers S. and Schroedl S., Constrained k-means clustering with background knowledge, Proceedings of the Eighteenth International Conference on Machine Learning, 2001.

Basu S., Bilenko M. and Mooney R.J., A probabilistic framework for semi-supervised clustering, Proceedings of the tenth ACM SIGKDD, 2004.

Multiple Alternative Clusterings:

Martin H. C. Law, Alexander P. Topchy and Anil K. Jain, Multi-objective Data Clustering, IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004.

Eric Bae and James Bailey, COALA : A Novel Approach for the Extraction of an Alternate Clustering of High Quality and High Dissimilarity, Proceedings of the IEEE International Conference on Data Mining, 2006.

Clustering Distance:

Fowlkes, E. B. and Mallows, C. L., A method for comparing two hierarchical clusterings. Journal of the American Statistical Association, 1983.

Lawrence Hubert and Phipps Arabie, Comparing partitions, Journal of Classification, 1985.

Rand, W. M., Objective criteria for the evaluation of clustering methods. Journal of the American Statistical Association, 1971.

Marina Meila, Comparing clusterings, UW Statistics Technical Report 418, 2003.

Marina Meila, Comparing Clusterings - An Axiomatic View, Proceedings of the 22nd International Conference on Machine Learning, 2005.