Computational Sustainability Social Networking Project

The Problem:

We possess a set of seed papers written on topics in computational sustainability, and a corpus of papers in Computer Science (derived from the DBLP database) that contains the seed papers as a subset. The problem is to develop a similarity measure between papers that will ultimately be used to a) find new papers “close” to the seed papers and b) build a computational sustainability citation network constructed using the similarity measure.

Available Techniques:

The fields of Natural Language Processing and Machine Learning provide three commonly used techniques for analyzing documents and producing measures of similarity: Latent Semantic Analysis (LSA), Probabilistic Latent Semantic Analysis (pLSA), and Latent Dirichlet Allocation (LDA).

LSA models documents as being generated by a bag of words model: each word in the document is selected randomly from the available vocabulary according to some probability distribution. LSA techniques focus on generating lower-rank approximations of the term-occurrence matrix (with one row for each vocabulary word and one column for each document in the corpus), which counts the occurrence of every term in each document. These lower-rank approximations (typically generating using Singular Value Decomposition of the matrix) are generated to make the matrix more computationally tractable and to eliminate noise in the sampling of documents. Once generated, the lower-rank matrix provides a sampled distribution of the bag of words model for each document (in the form of the matrix columns). Using standard measures from statistics (such as the KL divergence), document similarity can be measured as the difference between their bag of words distribution.

In contrast, pLSA and LDA model each document as a mixture of topics, with each topic represented as a bag of words model. For example, a topic representing the field of epidemiology may have a high chance of generating words like epidemiology, disease, bacteria, epidemic, etc. Documents in this model are then associated with a probability distribution that gives for each topic the probability of randomly selecting that topic. For example, a document about controlling AIDS in Africa may have high probability of selecting topics like epidemiology, optimization, human health, policy-making, etc. Documents are then generated word at a time by 1) selecting a topic randomly according the document’s topic distribution and 2) selecting the word according to that topic’s bag of words model.

The task for both pLSA and LDA is, given a corpus’ term occurrence matrix, to infer the set of topics, the bag of words model for each topic, and the topic distribution for each document. Having inferred the topic distribution for each document, we can compute a similarity measure between topics by computing the difference in topic distributions between two documents.

In practice, LDA has achieved better results than pLSA (while both tend to provide better results than LSA). These results are attributed to the fact that LDA assumes that a document’s topic distribution has a Dirichlet prior and that this simplifying assumption improves the Bayesian inference process.

Another advantage of LDA is that it allows us to interpret each topic by looking at the words with a high probability of selection in that topic. This may help reveal relevant key-phrases in the field of computational sustainability that could be used for further analysis. In addition, since the documents in our corpus will contain author names, topics may also reveal high-profile authors in the field.

Implementations:

We are working with two natural language / machine learning packages: LingPipe, a Java package that operates in sequence, and Mahout, a machine learning library developed on top of the Hadoop Cloud Computing platform. Both provide implementations of LDA (LingPipe uses Gibb’s sampling to perform inference while Mahout uses expectation-maximization). Both packages provide additional tools we may use to analyze the corpus in the future.

Project Accomplishments:

To date, the following has been accomplished:

1. Jason has obtained and parsed the DBLP corpus to produce input for LDA. For each paper, a “document” is produced that contains the paper’s title, authors, abstract, and reference string. It should be noted that some of these fields (particularly the abstract and reference strings) are missing for some papers due to incompleteness in DBLP.

2. Karan has gotten LingPipe LDA to run on the DBLP corpus, Kiyan has gotten Mahout LDA to run on the corpus.

Work In Progress:

1. Bistra and Theo are developing an algorithm that uses the LDA similarity measure to construct a citation network. Theo has provided this paper as a starting point: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.101.4376&rep=rep1&type=pdf

2. Karan and Kiyan are tuning parameters for LDA on the corpus. In particular, the number of topics must be provided as a fixed parameter to the algorithm: a higher number of topics may increase accuracy but make the problem more computationally difficult.

3. Karan is exploring alternatives to running LDA on the entire corpus (instead restricting LDA to just the seed papers). Such techniques are explored in this paper he has found: http://perso.telecom-paristech.fr/~cappe/papers/08-conll.pdf

4. Kiyan is working on code to extract results for individual documents from Mahout, so a similarity measure can be computed.

Some References:

Wikipedia Articles on LSA, pLSA, and LDA

http://en.wikipedia.org/wiki/Latent_semantic_analysis

http://en.wikipedia.org/wiki/Probabilistic_latent_semantic_analysis

http://en.wikipedia.org/wiki/Latent_Dirichlet_allocation

First paper on LDA:

http://www.jmlr.org/papers/volume3/blei03a/blei03a.pdf

Paper that explores building citation networks on DBLP:

http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.101.4376&rep=rep1&type=pdf

Alternative LDA techniques:

http://perso.telecom-paristech.fr/~cappe/papers/08-conll.pdf

LingPipe’s tutorial on LDA:

http://alias-i.com/lingpipe/demos/tutorial/cluster/read-me.html

Mahout’s tutorial on LDA:

https://cwiki.apache.org/confluence/display/MAHOUT/Latent+Dirichlet+Allocation