**Meta**** Clustering**

**The people currently involved in this project are: Rich Caruana, Nam Nguyen, Casey Smith in the Computer Science
Department at **

**The Meta Clustering Project is funded by
NSF under CAREER Award # 0347318 (granted 2004).**

__Project Description__**: **

**
Meta clustering** is a new approach
to the problem of clustering.

- Generate many good, yet qualitatively different, base-level clusterings of the same data. (This is a very interesting problem all by itself!)
- Measure the similarity between the base-level clusterings generated in the first step so that similar clusterings can be grouped together.
- Organize the base-level clusterings at a meta level (either by clustering or by low-dimension projection) and present them to users.

__Source Code__**:**

** Kmeans**:
generates clusterings with pre-specified number of clusters and sensitive to
initialization.

**Constraint Kmeans**:
generate clusterings with must-link and cannot-link constraints.

** Kmeans
Feature Weighting**: generates diverse clusterings by applied weights on the
features.

**Zipf Generator**: generates
random numbers according the Zipf distribution with a specified shape
parameter.

**Diverse-Kmeans: **generates
diverse clusterings by incorporating clustering distance into objective
function.

**Clustering Distance**:
computes the

**Warning:** This code is under development. We make no guarantees
that it is yet suitable for external use. If you try the code, please let
us know if it was useful, and if you find any bugs or improvements. If
you email us, we'll be happy to let you know when we update the code or find
and fix bugs. Code and Tutorial

__Data Sets__**:**

We have been using the following data sets in most of our meta clustering experiments.
The Bergmark data set was created by Donna Bergmark at

- The Australia Coastal data is a subset of the data available from the Biogeoinformatics of Hexacorals environmental database, http://www.kgs.ku.edu/Hexacoral/.
- The Bergmark data was collected from 25 focused web crawls, each with different keywords.
- The Covertype data is from the UCI Machine Learning Repository. It contains cartographic variables sampled at 30x30 meter grid cells in wilderness areas.
- The Letter data is the isolet spoken letter data set from the UCI Machine Learning Repository.
- The Phoneme data is the from
the UCI Machine Learning Repository. It records the
**11 phonemes of 15 speakers**. - The Protein data is the pairwise similarities between 639 proteins. It was created by crystallographers developing techniques to learn relationships between protein sequence and structure.
- The LetterFont
data set contains 100 images of
**10 different letters and 10 different fonts**. - The Yale
Face data set contains 10x9x8 = 720 images of
**10 different persons under 9 poses and 8 illumination conditions**, http://cvc.yale.edu/projects/yalefacesB/yalefacesB.html.

** Papers and Tech Reports**:

Rich Caruana,

Rich Caruana, Pedro Artigas, Ann Goldenberg and Anton
Likhodedov, Meta
Clustering, Technical Report TR2002-1884,

**Interesting Results:**

- Zipf
Weighting examples. Each row visualizes a Zipf distribution with a
different shape parameter, α. Each row has 50 bars representing 50
random samples from the distribution, with the height of the bar
proportional to the value of the sample. We use a Zipf power law
distribution because there is empirical evidence that feature importance
is Zipf-distributed in a number of real-world problems [1,2]. A Zipf
distribution describes a range of integer values from 1 to some maximum
value K. The frequency of each integer is proportional to 1/i
^{α}where i is the integer value and α is the shape parameter. Thus, for α=0, the Zipf distribution becomes a uniform distribution from 1 to K. As α increases, the distribution becomes more biased toward smaller numbers, with only the occasional value approaching K. In the figure, random values from a Zipf distribution can be generated in the manner of [3].

References:

- S. Cohen, F. Dror, and E. Ruppin. A feature selection method based on the Shapley value. In Proceedings of the International Joint Conference on Artificial Intelligence, 2005.
- C. Furlanello, M. Serafini, S. Merler, and G. Jurman. Entropy-based gene ranking without selection bias for the predictive classification of microarray data. BMC Bioinformatics, 4:54, 2003.
- K. Christensen, Statistical and Measurement Tools.
- Clusterings that would be of
most interest to users (best
**accuracy**) often are not very**compact**clusterings: CoverType dataset. The x-axis is compactness which measures the average pairwise distances between points in the same clusters. In many traditional clustering algorithms such as Kmeans and agglomerative clustering, the optimization criterion is closely related to this measure of compactness. The y-axis is accuracy which is measured relative to the auxiliary labels for each data set. The auxiliary labels are only a proxy for what a specific user might want in a particular application. Each point in the figure represents a full clustering. Since there exist reasonable clusterings (clusterings locating in the top-right of the figure) which are more accurate than the most compact clustering (the left-most clustering), we conclude that the usual method of searching for the most compact clustering can be counterproductive. It is important to explore the space of reasonable clusterings more thoroughly. - Comparision among Kmeans, Kmeans Feature Weighting, and Diverse-Kmeans. In the figure, we compare the performance of Kmeans, Kmeans Feature Weighting, and Diverse-Kmeans in two difference performance metrics for the Phoneme data set. We observe that the Kmeans Feature Weighting explores a larger space of clusterings than the Kmeans. Moreover, the Diverse-Kmeans method is able to find the most interesting clusterings in both performance metrics. This result is consistent for other data sets that we experiment on. Diverse-Kmeans is able to find many quality clusterings that are missed by traditional clustering algorithms.

**Related Works:**

If you know of other papers or links we should add to this collection, please let us know.

** Consensus Clustering:**

** **Fern, X. Z. and Brodley, C. E., Random
projection for high dimensional data clustering: A cluster ensemble approach.
Proceedings of the 20th International Conference on Machine Learning, 2003.

Fred, A. L. and Jain, A. K., Data clustering using evidence accumulation. Proceedings of the 16th International Conference on Pattern Recognition, 2002.

Gionis, A., Mannila, H. and Tsaparas, P., Clustering aggregation. Proceedings of the 21st International Conference on Data Mining, 2005.

Strehl, A. and Ghosh, J., Cluster ensembles - a knowledge reuse framework for combining multiple partitions. Journal of Machine Learning Research, 2002.

Topchy, A., Jain, A. K. and Punch, W., Combining multiple weak clusterings. Proceedings IEEE International Conference on Data Mining, 2003.

Topchy,
A., Jain, A. K. and Punch, W., A
mixture model for clustering ensembles. Proceedings

** Clustering with Background Knowledge:**

** **Cohn D.,
Caruana R. and McCallum A., Semi-supervised
clustering with user feedback, Technical Report TR2003-1892,

Wagstaff K., Cardie C., Rogers S. and Schroedl S., Constrained k-means clustering with background knowledge, Proceedings of the Eighteenth International Conference on Machine Learning, 2001.

Basu S., Bilenko M. and Mooney R.J., A probabilistic framework for semi-supervised clustering, Proceedings of the tenth ACM SIGKDD, 2004.

** Multiple Alternative Clusterings:**

** **Martin H. C. Law,
Alexander P. Topchy and Anil K. Jain, Multi-objective
Data Clustering, IEEE Computer Society Conference on Computer Vision and
Pattern Recognition, 2004.

** **Eric Bae and
James Bailey, COALA
: A Novel Approach for the Extraction of an Alternate Clustering of High
Quality and High Dissimilarity, Proceedings of the IEEE International
Conference on Data Mining, 2006.

** Clustering Distance:**

** **Fowlkes,
E. B. and Mallows, C. L., A method for
comparing two hierarchical clusterings. Journal of the American Statistical
Association, 1983.

Lawrence Hubert and Phipps Arabie, Comparing partitions, Journal of Classification, 1985.

Rand, W. M., Objective criteria for the evaluation of clustering methods. Journal of the American Statistical Association, 1971.

Marina Meila, Comparing clusterings, UW Statistics Technical Report 418, 2003.

** **Marina Meila, Comparing
Clusterings - An Axiomatic View, Proceedings of the 22nd International
Conference on Machine Learning, 2005.