The people currently involved in this project are: Rich Caruana, Nam Nguyen, Casey Smith in the Computer Science
Department at
The
Project Description:
Source Code:
Kmeans: generates clusterings with pre-specified number of clusters and sensitive to initialization.
Constraint Kmeans: generate clusterings with must-link and cannot-link constraints.
Kmeans Feature Weighting: generates diverse clusterings by applied weights on the features.
Zipf Generator: generates random numbers according the Zipf distribution with a specified shape parameter.
Diverse-Kmeans: generates diverse clusterings by incorporating clustering distance into objective function.
Clustering Distance:
computes the
Warning: This code is under development. We make no guarantees that it is yet suitable for external use. If you try the code, please let us know if it was useful, and if you find any bugs or improvements. If you email us, we'll be happy to let you know when we update the code or find and fix bugs. Code and Tutorial
Data Sets:
We have been using the following data sets in most of our meta clustering
experiments. The Bergmark data set was created by Donna Bergmark at
Papers and Tech Reports:
Rich Caruana,
Rich Caruana, Pedro Artigas, Ann Goldenberg and Anton
Likhodedov, Meta
Clustering, Technical Report TR2002-1884,
Interesting Results:
Related Works:
If you know of other papers or links we should add to this collection, please let us know.
Consensus Clustering:
Fern, X. Z. and Brodley, C. E., Random projection for high dimensional data clustering: A cluster ensemble approach. Proceedings of the 20th International Conference on Machine Learning, 2003.
Fred, A. L. and Jain, A. K., Data clustering using evidence accumulation. Proceedings of the 16th International Conference on Pattern Recognition, 2002.
Gionis, A., Mannila, H. and Tsaparas, P., Clustering aggregation. Proceedings of the 21st International Conference on Data Mining, 2005.
Strehl, A. and Ghosh, J., Cluster ensembles - a knowledge reuse framework for combining multiple partitions. Journal of Machine Learning Research, 2002.
Topchy, A., Jain, A. K. and Punch, W., Combining multiple weak clusterings. Proceedings IEEE International Conference on Data Mining, 2003.
Topchy, A., Jain, A. K. and
Punch, W., A
mixture model for clustering ensembles. Proceedings
Clustering with Background Knowledge:
Cohn D.,
Caruana R. and McCallum A., Semi-supervised
clustering with user feedback, Technical Report TR2003-1892,
Wagstaff K., Cardie C., Rogers S. and Schroedl S., Constrained k-means clustering with background knowledge, Proceedings of the Eighteenth International Conference on Machine Learning, 2001.
Basu S., Bilenko M. and Mooney R.J., A probabilistic framework for semi-supervised clustering, Proceedings of the tenth ACM SIGKDD, 2004.
Multiple Alternative Clusterings:
Martin H. C. Law, Alexander P. Topchy and Anil K. Jain, Multi-objective Data Clustering, IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004.
Eric Bae and James Bailey, COALA : A Novel Approach for the Extraction of an Alternate Clustering of High Quality and High Dissimilarity, Proceedings of the IEEE International Conference on Data Mining, 2006.
Clustering Distance:
Fowlkes, E. B. and Mallows, C. L., A method for comparing two hierarchical clusterings. Journal of the American Statistical Association, 1983.
Lawrence Hubert and Phipps Arabie, Comparing partitions, Journal of Classification, 1985.
Rand, W. M., Objective criteria for the evaluation of clustering methods. Journal of the American Statistical Association, 1971.
Marina Meila, Comparing clusterings, UW Statistics Technical Report 418, 2003.
Marina Meila, Comparing Clusterings - An Axiomatic View, Proceedings of the 22nd International Conference on Machine Learning, 2005.