Discovering Natural Communities in Large Linked Networks

Research by Omar Khan and Brian Kulis

 

Advisors: John Hopcroft and Bart Selman

CS 490: Fall 2002

 

Abstract:

We are interested in tracking changes in large-scale data by periodically creating an agglomerative clustering and examining the evolution of clusters (communities) over time. We examine a large real-world data set: the linked network of over 250,000 papers underlying the NEC CiteSeer database. In order to track changes over time, we require a clustering algorithm that is relatively stable under small perturbations of the input data. Agglomerative clustering techniques are known to be unstable on data in which the community structure is not strong. Using a comparison to random graphs with the same degree structure, we obtain a quantitative measure of the strength of intellectual communities in the CiteSeer data. We have developed an efficient, scalable agglomerative algorithm and analyzed its stability. We found that some communities are essentially random and thus unstable while others are natural and will appear in essentially every clustering. These natural communities will enable us to track the evolution of communities over time, which is the ultimate goal of our research.

A paper describing our results has been submitted to KDD 2003 and the Arthur M. Sackler Colloquium.