Topic modeling is a statistical approach to understanding the semantic structure of a collection of documents. The basic idea is to decompose documents (or images, playlists, …) into distributions over words (or pixels, songs, …) that reflect common, coherent, semantically meaningful “topics.” Standard statistical models for how documents are generated from topics, such as Latent Dirichlet Allocation, provide a high-quality decomposition, but may require significant computational resources for training. In contrast, recent “spectral” inference algorithms based on separability assumptions are both fast and provably optimal under certain assumptions on the generating process.

But real data sets often show properties that are inconsistent with the assumptions of the generative models used by spectral inference methods. We explore the structure of matrices of co-occurrence statistics that are implied (in expectation) by a particular generative model for topic inference, and show that explicitly enforcing this structure in sample co-occurrence statistics allows us to significantly improve the quality of inferred topics.


M. Lee, D. Bindel, and D. Mimno, “Robust Spectral Inference for Joint Stochastic Matrix Factorization,” in Proceedings of NIPS 2015, 2015, pp. 2710–2718.
  author = {Lee, Moontae and Bindel, David and Mimno, David},
  title = {Robust Spectral Inference for
             Joint Stochastic Matrix Factorization},
  booktitle = {Proceedings of NIPS 2015},
  pages = {2710--2718},
  month = dec,
  year = {2015}


Spectral inference provides fast algorithms and provable optimality for latent topic analysis. But for real data these algorithms require additional ad-hoc heuristics, and even then often produce unusable results. We explain this poor performance by casting the problem of topic inference in the framework of Joint Stochastic Matrix Factorization (JSMF) and showing that previous methods violate the theoretical conditions necessary for a good solution to exist. We then propose a novel rectification method that learns high quality topics and their interactions even on small, noisy data. This method achieves results comparable to probabilistic techniques in several domains while maintaining scalability and provable optimality.