Sample Results from a Burst Detection Algorithm
This page has links to sample results from the burst
detection algorithm described in the paper
The high-level idea is to analyze a stream of documents
and find features whose behavior is `bursty': they occur
with high intensity over a limited period of time.
The analysis uses a probabilistic automaton whose
states correspond to the frequencies of individual words.
For each word separately, one imagines a copy of
the automaton generating occurrences of the word, and computes its
most likely state sequence over the course of the stream.
State transitions correspond to points in time around which
the frequency of the word changes significantly --
that is, to the beginning or end of a `burst'
in the usage of the word.
The full details of all this are given in the paper above.
The upshot of this analysis is a ranked list of the most
significant word bursts in the document stream, together
with the intervals of time in which they occurred.
This can serve as a means of identifying topics or concepts
that rose to prominence over the course of the stream,
were discussed actively for a period of time, and then faded away again.
It is important to note that in all the examples below,
essentially no pre-processing is performed on the documents
(we simply eliminate all non-letter characters and down-case all words).
In particular, burst analysis is performed for all words
(including stop-words such as `the').
Presidential State of the Union Addresses
A basic first example comes from the set of all
U.S. Presidential State of the Union Addresses, 1790-2002.
(For many years the addresses were given as
written messages rather than speeches, though
the overall formats were comparable.)
One finds that many of the automatically extracted
bursts seem to correspond naturally
to national events and issues, while others appear to arise
from a repeated emphasis of certain key words over a period of years.
Computer Science Conference Paper Titles
A second set of examples comes from the titles of papers
from computer science conferences over the past few decades.
Four areas are considered separately:
theory, databases, networking, and AI.
In this context, automatically extracted bursty terms
suggest research topics that were in fashion for a period of time.