Supervised k-Means

This is the page for the supervised k-means clusterer as submitted to KDD. This page is not yet linked from anywhere as the paper is under review. It is currently intended for consumption by reviewers only. Do not distribute this link. This is the code for supervised k-means algorithm.

supervisedkmeans.tgz

License and readme containing the input data format are included in the archive. Also are the datasets for use in the paper.

webkbn.tgz (768k) is the clusterings dataset for WebKB without link features.
webkbl.tgz (4400k) is the clusterings dataset for WebKB with link features.
news.tgz (5868k) is the clusterings dataset consisting of news articles divided into which are about the same story.
synth.tgz (284k) is the clusterings dataset for the synthetic dataset.
timinggen.py (3k) is the program to generate the data for the timing experiments. The resulting datasets are multiple gigabytes, so it is easier to provide the generating script.

All of these contain brief READMEs explaining the data.

Thomas Finley, tomf@cs.cornell.edu