Surprises in Topic Model Estimation and New Wasserstein Document-Distance Calculations

Surprises in Topic Model Estimation and New Wasserstein Document-Distance Calculations (via Zoom)

Abstract: Topic models have been and continue to be an important modeling tool for an ensemble of independent multinomial samples with shared commonality. Although applications of topic models span many disciplines, the jargon used to de ne them stems from text analysis. In keeping with the standard terminology, one has access to a corpus of n independent documents, each utilizing words from a given dictionary of size p. One draws N words from each document and records their respective count, thereby representing the corpus as a collection of n samples from independent, p-dimensional, multinomial distributions, each having a di erent, document speci c, true word probability vector . The topic model assumption is that each is a mixture of K discrete distributions, that are common to the corpus, with document speci c mixture weights. The corpus is assumed to cover K topics, that are not directly observable, and each of the K mixture components correspond to conditional probabilities of words, given a topic. The vector of the K mixture weights, per document, is viewed as a document speci c topic distribution T, and is thus expected to be sparse, as most documents will only cover a few of the K topics of the corpus. Despite the large body of work on learning topic models, the estimation of sparse topic distributions, of unknown sparsity, especially when the mixture components are not known, and are estimated from the same corpus, is not well understood and will be the focus of this talk. We provide estimators of T, with sharp theoretical guarantees, valid in many practically relevant situations, including the scenario p >> N (short documents, sparse
data) and unknown K. Moreover, the results are valid when dimensions p and K are allowed to grow with the sample sizes N and n.

When the mixture components are known, we propose MLE estimation of the sparse vector T, the analysis of which has been open until now. The surprising result, and a remarkable property of the MLE in these models, is that, under appropriate conditions, and without further regularization, it can be exactly sparse, and contain the true zero pattern of the target. When the mixture components are not known, we exhibit computationally fast and rate optimal estimators for them, and propose a quasi-MLE estimator of T, shown to retain the properties of the MLE. The practical implication of our sharp, nite-sample, rate analyses of the MLE and quasi-MLE reveal that having short documents can be compensated for, in terms of estimation precision, by having a large corpus. Our main application is to the estimation of 1-Wasserstein distances between document generating distributions. We propose, estimate and analyze new 1-Wasserstein distances between alternative probabilistic document representations, at the word and topic level, respectively. The e ectiveness of the proposed 1-Wasserstein distances, and contrast with the more commonly used WMD between empirical frequency estimates, is illustrated by an analysis of an IMDB movie reviews data set.

Bio: Florentina is a faculty member of the Department of Statistics and Data Science, a member of the Center for Applied Mathematics (CAM) and a member of the Machine Learning Group in CIS. As a member of the CIS Diversity and Inclusion Council, she is committed to promoting the diversity of the work force in data-science disciplines.

Her research is broadly centered on statistical machine learning theory and high-dimensional statistical inference. Florentina is interested in developing new methodology accompanied by sharp theory for solving a variety of problems in data science. Recent research projects include high-dimensional latent-space clustering, cluster-based inference, network modeling, inference in high dimensional models with hidden latent structure and topic models. Florentina continues to be interested in the general areas of model selection, sparsity and dimension reduction in high dimensions, and in applications to genetics, systems immunology, neuroscience, sociology, among other disciplines.

Her research is funded in part by the National Science Foundation (NSF-DMS). She is a Fellow of the Institute of Mathematical Statistics (IMS). Florentina has served or am currently serving as an Associate Editor for a number of journals (the Annals of Statistics, Bernoulli, JASA, JRSS-B, EJS, the Annals of Applied Statistics).