Summary

Spectral methods are frequently used to partition networks into large clusters that approximately minimize the number of edges between clusters or maximize quality measures such as modularity. However, these methods are less useful for finding relatively small, tight-knit communities. In contrast, recent diffusion methods explore the neighborhood around a pre-defined set of “seed” nodes in a network in order to find small communities that are “good” in some sense (typically using a measure like conductance).

Our work interpolates between spectral methods and diffusion-based methods by searching for communities using unconverged eigenvector estimates based on Rayleigh-Ritz approximation from a Krylov subspace started with a distribution over a seed vector. These eigenvector-like objects reflect the “shape” of the intermediate dynamics of diffusion processes in a manner analogous to how the extremal eigenvectors of a transition matrix reflect the shape of the asymptotic dynamics of convergence to the equilibrium distribution.

Papers

Y. Li, K. He, D. Bindel, and J. Hopcroft, “Local Spectral Clustering for Overlapping Community Detection,” ACM Transactions on Knowledge Discovery from Data, vol. PP, no. 99, pp. 1–1, 2017.
@article{2017-tkdd,
  author = {Li, Yixuan and He, Kun and Bindel, David and Hopcroft, John},
  title = {Local Spectral Clustering for Overlapping Community Detection},
  journal = {ACM Transactions on Knowledge Discovery from Data},
  volume = {PP},
  number = {99},
  pages = {1--1},
  year = {2017}
}
P. Shi, K. He, D. Bindel, and J. Hopcroft, “Local Lanczos Spectral Approximation for Community Detection,” in Proceedings of ECML-PKDD, 2017.
@inproceedings{2017-ecml-pkdd,
  author = {Shi, Pan and He, Kun and Bindel, David and Hopcroft, John},
  title = {Local Lanczos Spectral Approximation for Community Detection},
  booktitle = {Proceedings of ECML-PKDD},
  year = {2017},
  month = sep
}

Abstract:

We propose a novel approach called the Local Lanczos Spectral Approximation (LLSA) for identifying all latent members of a local community from very few seed members. To reduce the computation complexity, we first apply a fast heat kernel diffusing to sample a comparatively small subgraph covering almost all possible community members around the seeds. Then starting from a normalized indicator vector of the seeds and by a few steps of Lanczos iteration on the sampled subgraph, a local eigenvector is gained for approximating the eigenvector of the transition matrix with the largest eigenvalue. Elements of this local eigenvector is a relaxed indicator for the affiliation probability of the corresponding nodes to the target community. We conduct extensive experiments on real-world datasets in various domains as well as synthetic datasets. Results show that the proposed method outperforms state-of-the-art local community detection algorithms. To the best of our knowledge, this is the first work to adapt the Lanczos method for local community detection, which is natural and potentially effective. Also, we did the first attempt of using heat kernel as a sampling method instead of detecting communities directly, which is proved empirically to be very efficient and effective.

K. He, P. Shi, J. Hopcroft, and D. Bindel, “Local Spectral Diffusion for Robust Community Detection,” in KDD Workshop on Mining and Learning with Graphs, 2016.
@inproceedings{2016-losp-kdd,
  author = {He, Kun and Shi, Pan and Hopcroft, John and Bindel, David},
  booktitle = {KDD Workshop on Mining and Learning with Graphs},
  title = {Local Spectral Diffusion for Robust Community Detection},
  month = aug,
  year = {2016}
}

Abstract:

We address a semi-supervised learning problem of identifying all latent members of a local community from very few labeled seed members in large networks. By a simple and efficient sampling method, we conduct a comparatively small subgraph encompassing most of the latent members such that the follow-up membership identification could focus on an accurate local region instead of the whole network. Then we look for a sparse vector, a relaxed indicator vector representing the subordinative probability of the corresponding nodes, that lies in a local spectral subspace defined by an order-$d$ Krylov subspace. The subspace serves as a local proxy for the invariant subspace spanned by leading eigenvectors of the Laplacian matrices. Based on Rayleigh quotients, we relate the local membership identification task as a local RatioCut or local normalized cut optimization problem, and provide some theoretical justifications.

We thoroughly explore different probability diffusion methods for the subspace definition and evaluate our method on four groups with a total of 28 representative LFR benchmark datasets, and eight public available real-world networks with labeled ground truth communities across multiple domains. Experimental results exhibit the effectiveness and robustness of the proposed algorithm, and the local spectral communities perform better than those from the celebrated Heat Kernel diffusion and the PageRank diffusion.

K. He, Y. Sun, D. Bindel, J. Hopcroft, and Y. Li, “Detecting Overlapping Communities from Local Spectral Subspaces,” in Proceedings of ICDM 2015, 2015.
@inproceedings{2015-icdm,
  author = {He, Kun and Sun, Yiwei and Bindel, David and Hopcroft, John and Li, Yixuan},
  title = {Detecting Overlapping Communities from
             Local Spectral Subspaces},
  booktitle = {Proceedings of ICDM 2015},
  month = nov,
  year = {2015},
  doi = {10.1109/ICDM.2015.89},
  arxiv = {1509.08065}
}

Abstract:

Based on the definition of local spectral subspace, we propose a novel approach called LOSP for local overlapping community detection. Using the power method for a few steps, LOSP finds an approximate invariant subspace, which depicts the embedding of the local neighborhood structure around the seeds of interest. LOSP then identifies the local community expanded from the given seeds by seeking a sparse indicator vector in the subspace where the seeds are in its support. We provide a systematic investigation on LOSP, and thoroughly evaluate it on large real world networks across multiple domains. With the prior information of very few seed members, LOSP can detect the remaining members of a target community with high accuracy. Experiments demonstrate that LOSP outperforms the Heat Kernel and PageRank diffusions. Using LOSP as a subroutine, we further address the problem of multiple membership identification, which aims to find all the communities a single vertex belongs to. High F1 scores are achieved in detecting multiple local communities with respect to arbitrary single seed for various large real world networks.

Y. Li, K. He, D. Bindel, and J. Hopcroft, “Uncovering the Small Community Structure in Large Networks: A Local Spectral Approach,” in Proceedings of WWW 2015, 2015.
@inproceedings{2015-www,
  author = {Li, Yixuan and He, Kun and Bindel, David and Hopcroft, John},
  title = {Uncovering the Small Community Structure in Large Networks:
             A Local Spectral Approach},
  booktitle = {Proceedings of WWW 2015},
  month = may,
  year = {2015},
  doi = {10.1145/2736277.2741676},
  arxiv = {1509.07715}
}

Abstract:

Large graphs arise in a number of contexts and understanding their structure and extracting information from them is an important research area. Early algorithms on mining communities have focused on the global structure, and often run in time functional to the size of the entire graph. Nowadays, as we often explore networks with billions of vertices and find communities of size hundreds, it is crucial to shift our attention from macroscopic structure to microscopic structure when dealing with large networks. A growing body of work has been adopting local expansion methods in order to identify the community from a few exemplary seed members. Very few approaches can systematically demonstrate both high efficiency and effectiveness that significantly stands out amongst the divergent approaches in finding communities.

In this paper, we propose a novel approach for finding overlapping communities called LEMON (Local Expansion via Minimum One Norm). Different from PageRank-like diffusion methods, LEMON finds the community by seeking a sparse vector in the span of the local spectra such that the seeds are in its support. We show that LEMON can achieve the highest detection accuracy among state-of-the-art proposals. The running time depends on the size of the community rather than that of the entire graph. The algorithm is easy to implement, and is highly parallelizable.

Moreover, given that networks are not all similar in nature, a comprehensive analysis on how the local expansion approach is suited for uncovering communities in different networks is still lacking. We thoroughly evaluate our approach using both synthetic and real-world datasets across different domains, and analyze the empirical variations when applying our method to inherently different networks in practice. In addition, the heuristics on how the quality and quantity of the seed set would affect the performance are provided.

Talks

Communities, Spectral Clustering, and Random Walks

MMDS
clustering communitiesmeeting external invited

Communities, Spectral Clustering, and Random Walks

University of Chicago Scientific Computing Seminar
clustering communitiesseminar external invited

Communities, Spectral Clustering, and Random Walks

Cornell Brownbag Seminar
clustering communitiesseminar local

Communities, Spectral Clustering, and Random Walks

Cornell SCAN Seminar
clustering communitiesseminar local