The people currently involved in this project are: Rich Caruana, Nam Nguyen, Casey Smith in the Computer Science
Department at Cornell
The Meta Clustering Project is funded by
NSF under CAREER Award # 0347318 (granted 2004).
Meta clustering is a new approach
to the problem of clustering. Meta clustering
aims at creating a new mode of interaction between users, the clustering
system, and the data. Rather than finding one optimal clustering of the data,
meta clustering finds many alternate good clusterings of the data and allows
the user to select which of these clusterings is most useful, exploring the
space of reasonable clusterings. To prevent the user from having to evaluate
too many clusterings, the many base-level clusterings are organized into a meta
clustering, a clustering of clusterings that groups similar base-level
clusterings together. This meta clustering makes it easier for users to
evaluate the clusterings and efficiently navigate to the clustering(s) useful
for their purpose.
clustering can be divided into three fundamental steps:
- Generate many good, yet
qualitatively different, base-level clusterings of the same data.
(This is a very interesting problem all by itself!)
- Measure the similarity
between the base-level clusterings generated in the first step so that similar
clusterings can be grouped together.
- Organize the base-level
clusterings at a meta level (either by clustering or by low-dimension
projection) and present them to users.
generates clusterings with pre-specified number of clusters and sensitive to
generate clusterings with must-link and cannot-link constraints.
Feature Weighting: generates diverse clusterings by applied weights on the
Zipf Generator: generates
random numbers according the Zipf distribution with a specified shape
diverse clusterings by incorporating clustering distance into objective
computes the Rand index and Variational
Information (VI) distance between clusterings.
Warning: This code is under development. We make no guarantees
that it is yet suitable for external use. If you try the code, please let
us know if it was useful, and if you find any bugs or improvements. If
you email us, we'll be happy to let you know when we update the code or find
and fix bugs. Code and Tutorial
We have been using the following data sets in most of our meta clustering experiments.
The Bergmark data set was created by Donna Bergmark at Cornell University.
The Protein data set was created by Rich Caruana, Paul Hodor, and John
Rosenberg at CMU and the University
Email us if you want to use the Bergmark or Protein data sets. The others
are already available on the web.
- The Australia Coastal data is
a subset of the data available from the Biogeoinformatics of Hexacorals
environmental database, http://www.kgs.ku.edu/Hexacoral/.
- The Bergmark data was
collected from 25 focused web crawls, each with different keywords.
- The Covertype
data is from the UCI Machine Learning Repository. It contains cartographic
variables sampled at 30x30 meter grid cells in wilderness areas.
- The Letter
data is the isolet spoken letter data set from the UCI Machine Learning
- The Phoneme data is the from
the UCI Machine Learning Repository. It records the 11 phonemes of 15
- The Protein data is the
pairwise similarities between 639 proteins. It was created by
crystallographers developing techniques to learn relationships between
protein sequence and structure.
- The LetterFont
data set contains 100 images of 10 different letters and 10 different
- The Yale
Face data set contains 10x9x8 = 720 images of 10 different persons
under 9 poses and 8 illumination conditions, http://cvc.yale.edu/projects/yalefacesB/yalefacesB.html.
Papers and Tech Reports:
Nam Nguyen, Rich Caruana, Consensus Clustering, to appear in the
International Conference on Data Mining, 2007.
Nguyen, Rich Caruana, Generating
Diverse Clusterings, in submission to Neural Information Processing Systems
Rich Caruana, Mohamed
Nguyen, and Casey Smith, Meta
Clustering, Proceedings of the International Conference on Data Mining,
Rich Caruana, Pedro Artigas, Ann Goldenberg and Anton
Clustering, Technical Report TR2002-1884, Cornell University.
This is an old, unpublished tech report that describes our first attempts at
Weighting examples. Each row visualizes a Zipf distribution with a
different shape parameter, α. Each row has 50 bars representing 50
random samples from the distribution, with the height of the bar
proportional to the value of the sample. We use a Zipf power law
distribution because there is empirical evidence that feature importance
is Zipf-distributed in a number of real-world problems [1,2]. A Zipf
distribution describes a range of integer values from 1 to some maximum
value K. The frequency of each integer is proportional to 1/iα
where i is the integer value and α is the shape parameter. Thus, for α=0,
the Zipf distribution becomes a uniform distribution from 1 to K. As α
increases, the distribution becomes more biased toward smaller numbers,
with only the occasional value approaching K. In the figure, random values
from a Zipf distribution can be generated in the manner of .
Clusterings that would be of
most interest to users (best accuracy) often are not very compact
dataset. The x-axis is compactness which measures the average pairwise
distances between points in the same clusters. In many traditional
clustering algorithms such as Kmeans and agglomerative clustering, the
optimization criterion is closely related to this measure of compactness.
The y-axis is accuracy which is measured relative to the auxiliary labels
for each data set. The auxiliary labels are only a proxy for what a
specific user might want in a particular application. Each point in the
figure represents a full clustering. Since there exist reasonable
clusterings (clusterings locating in the top-right of the figure) which
are more accurate than the most compact clustering (the left-most
clustering), we conclude that the usual method of searching for the most
compact clustering can be counterproductive. It is important to explore
the space of reasonable clusterings more thoroughly.
among Kmeans, Kmeans Feature Weighting, and Diverse-Kmeans. In the figure,
we compare the performance of Kmeans, Kmeans Feature Weighting, and
Diverse-Kmeans in two difference performance metrics for the Phoneme data
set. We observe that the Kmeans Feature Weighting explores a larger space of clusterings than the
Kmeans. Moreover, the Diverse-Kmeans method is able to find the most
interesting clusterings in both performance metrics. This result is consistent
for other data sets that we experiment on. Diverse-Kmeans is able to find
many quality clusterings that are missed by traditional clustering
- S. Cohen, F. Dror, and
E. Ruppin. A feature selection method based on the Shapley value. In Proceedings
of the International Joint Conference on Artificial Intelligence, 2005.
- C. Furlanello, M.
Serafini, S. Merler, and G. Jurman.
Entropy-based gene ranking without selection bias for the predictive
classification of microarray data. BMC Bioinformatics, 4:54, 2003.
- K. Christensen,
Statistical and Measurement Tools.
If you know of other papers or links we should add to this collection,
please let us know.
Fern, X. Z. and Brodley, C. E., Random
projection for high dimensional data clustering: A cluster ensemble approach.
Proceedings of the 20th International Conference on Machine Learning, 2003.
Fred, A. L. and Jain, A. K., Data
clustering using evidence accumulation. Proceedings of the 16th
International Conference on Pattern Recognition, 2002.
A., Mannila, H. and Tsaparas,
aggregation. Proceedings of the 21st International Conference on Data
A. and Ghosh, J., Cluster
ensembles - a knowledge reuse framework for combining multiple partitions.
Journal of Machine Learning Research, 2002.
A., Jain, A. K. and Punch, W., Combining
multiple weak clusterings. Proceedings IEEE International Conference on
Data Mining, 2003.
A., Jain, A. K. and Punch, W., A
mixture model for clustering ensembles. Proceedings SIAM International Conference on
Data Mining, 2004.
Clustering with Background Knowledge:
Caruana R. and McCallum A., Semi-supervised
clustering with user feedback, Technical Report TR2003-1892, Cornell University.
Wagstaff K., Cardie C., Rogers S. and Schroedl
k-means clustering with background knowledge, Proceedings
of the Eighteenth International Conference on Machine Learning, 2001.
S., Bilenko M. and Mooney R.J., A
probabilistic framework for semi-supervised clustering, Proceedings of the tenth ACM SIGKDD, 2004.
Multiple Alternative Clusterings:
Martin H. C. Law,
Alexander P. Topchy and Anil K. Jain, Multi-objective
Data Clustering, IEEE Computer Society Conference on Computer Vision and
Pattern Recognition, 2004.
Eric Bae and
James Bailey, COALA
: A Novel Approach for the Extraction of an Alternate Clustering of High
Quality and High Dissimilarity, Proceedings of the IEEE International
Conference on Data Mining, 2006.
E. B. and Mallows, C. L., A method for
comparing two hierarchical clusterings. Journal of the American Statistical
Lawrence Hubert and Phipps
partitions, Journal of Classification, 1985.
Rand, W. M., Objective
criteria for the evaluation of clustering methods. Journal of the American
Statistical Association, 1971.
Marina Meila, Comparing
clusterings, UW Statistics Technical Report 418, 2003.
Marina Meila, Comparing
Clusterings - An Axiomatic View, Proceedings of the 22nd International
Conference on Machine Learning, 2005.