Sample Results from a Burst Detection Algorithm

This page has links to sample results from the burst detection algorithm described in the paper The high-level idea is to analyze a stream of documents and find features whose behavior is `bursty': they occur with high intensity over a limited period of time. The analysis uses a probabilistic automaton whose states correspond to the frequencies of individual words. For each word separately, one imagines a copy of the automaton generating occurrences of the word, and computes its most likely state sequence over the course of the stream. State transitions correspond to points in time around which the frequency of the word changes significantly -- that is, to the beginning or end of a `burst' in the usage of the word. The full details of all this are given in the paper above.

The upshot of this analysis is a ranked list of the most significant word bursts in the document stream, together with the intervals of time in which they occurred. This can serve as a means of identifying topics or concepts that rose to prominence over the course of the stream, were discussed actively for a period of time, and then faded away again.

It is important to note that in all the examples below, essentially no pre-processing is performed on the documents (we simply eliminate all non-letter characters and down-case all words). In particular, burst analysis is performed for all words (including stop-words such as `the').

Presidential State of the Union Addresses

A basic first example comes from the set of all U.S. Presidential State of the Union Addresses, 1790-2002. (For many years the addresses were given as written messages rather than speeches, though the overall formats were comparable.) One finds that many of the automatically extracted bursts seem to correspond naturally to national events and issues, while others appear to arise from a repeated emphasis of certain key words over a period of years.

Computer Science Conference Paper Titles

A second set of examples comes from the titles of papers from computer science conferences over the past few decades. Four areas are considered separately: theory, databases, networking, and AI. In this context, automatically extracted bursty terms suggest research topics that were in fashion for a period of time.