Welcome to the Wikipedia Clustering Competition:

Competition data files: Here in zipped format.


For this competition you are given 11039 articles from wikipedia and you need to cluster them into 4 clusters such that the clusters accurately coincide with the 4 categories the articles belong to which are "Actor", "Movie", "Math" or "Mathematician".

You are given information about the articles in two forms.

1. First, all the text of the articles are given in file: "raw_data.txt". This file has one line for each articles and each line has all the text of the corresponding article. To make your life easier, I have written a code that extracts the lexicon from these articles, the extracted lexicon can be found in file "lexicon.txt". I have also extracted the bag of words feature for each articles. The resultant data matrix in sparse matrix format (of scipy) can be found in file "data.npz". In vocareum, you will find the commented code for the lexicon extraction and bag of words feature extraction. On my personal laptop, lexicon extraction took about 45 mins- an hour. Bag of words feature extraction takes way shorter time. You have the freedom to redo or edit my code for the bag of words feature extraction or write your own code to do whatever feature extraction you might want to use. However I suggest that at least to begin with, start with just loading the data matrix from the file "data.npz" which in current Vocareum code is not commented and is automatically loaded in variable X for you in sparse matrix format (look up scipy.sparse).

2. Second, you are also given a (directed) graph given in file "graph.csv" that contains the graph as a matrix with 174309 rows and two columns. Each line of the matrix lists an edge in the graph. For instance, the row "100, 200" means there is a hyperlink from wikipedia page of article # "100" linking it to article # "200".

Using the text and the hyperlink information amongst the wikipedia articles, your goal is to cluster the articles into 4 categories. You can use any library you need and write your own method. You can work in groups of size at most 4.

Your final prediction should be returned by your function in variable "prediction" which is a matrix of size $11039 \times 1$ with entries being one of 0, 1, 2, or 3.

We will evaluate how well your clustering predicts the actual categories of the articles and return to you accuracy in percentage. Higher the better.

Your competition is hosted in Vocareum. You will be able to get up to even 90% accuracy just using the bag of words features and the graph. In vocareum, we have set up two versions of the competition and you can use both. The first version is a jupyter notebook that has the starter codes and commented codes for feature extraction I mentioned. The second, is a version where you simply submit your prediction file and we will evaluate and give you your accuracy on a random sub-sample of points. We have given the late choice because you can use the later more easily if you want to use offline processing on your end and wont be slowed down by vocareum server.

Practice Prelim