Mining Sequential Patterns
We are developing algorithms for finding frequent sequences in
transactional databases (SPAM) and for enhancing load value prediction (LVP).
SPAM combines efficient pruning and indexing techniques to enable the discovery
of frequent sequences even for very long patterns. The LVP technique discovers
frequent patterns in traces of a program's memory accesses. Based on the
discovered patterns efficient pre-fetching strategies can be developed.


SPAM is a new algorithm for finding all frequent sequences within a transactional database.
The algorithm is
especially efficient when the sequential patterns in the database are very long. A depth-first search strategy is used
to generate candidate sequences, and various pruning mechanisms are implemented to reduce the search space.
The transactional data is stored using a vertical bitmap representation, which allows for efficient support counting
as well as significant bitmap compression. A salient feature of our algorithm is that it incrementally outputs new
frequent itemsets in an online fashion.
In a thorough experimental evaluation using standard benchmark data, we isolate the effects of the
individual components of our algorithm. Our performance numbers show that our algorithm outperforms previous work by a
factor of 3 to over an order of magnitude.
Walkthrough of the algorithm
Obtaining SPAM
The full source code for SPAM is now available for download from Sourceforge.
People
Jay Ayres
Jason Flannick
Johannes Gehrke
Tomi Yiu
Publications
Jay Ayres, J. E. Gehrke, Tomi Yiu, and Jason Flannick. Sequential
PAttern Mining Using Bitmaps. In Proceedings of the
Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data
Mining. Edmonton, Alberta, Canada, July 2002.

Summary
The goal of this project is to develop enhanced load value prediction hardware.
To do this, we have developed a program that finds frequent patterns in traces
of memory load values. The trace files are a series of load number, load
value pairs, where load number is a value that represents a load
instruction, and load value is the value that was retrieved from memory.
Patterns with specified support are mined from the trace file. The algorithm we
use to mine the patterns is similar to the Apriori Algorithm as described in Fast
Algorithms for Mining Association Rules.
This project is the combined work of Jeff Hoy, a Cornell University Masters of
Engineering candidate, Prof.
Martin Burtscher from the Cornell Computer Systems Laboratory, and Prof.
Johannes Gehrke from the Cornell Database Group.
Implementation
The mining program has been made available to the research community under the
terms of the GNU General Public
License. Source code, details of the program implementation, documentation,
and examples follow.
Input Specification
Trace files consist of 32-bit integers. The first number in the file indicates
the range of the load numbers; 0 <= load number < first number. The
remainder of the file is a series of load number, load value pairs, where
load number is a value that represents a load instruction, and load
value is the value that was retrieved from memory. Load numbers must be in
the range from 0 to the total number of loads. Load values can be anything in
the range of 32-bit numbers. It is relatively straightforward to adapt the code
to different input formats, such as 64-bit integers.
Output Specification
The program will find frequent patterns for each load value and combine the
counts of each pattern. The program requires two parameters, minSupport, and
minSupportTotal. Only patterns that occur at least minSupportTotal times will be
output by the program. minSupportTotal is a combination of patterns from each
load number, where each load finds patterns with at least minSupport frequency.
Let us assume that Load 1 and Load 2 each fetches the pattern "1 2 1
2" 10000 times, and that Load 3 fetches the pattern "1 2 1 2"
with frequency 20000. If the program is run with minSupport = 10000 or less and
MinSupportTotal = 40000 or less, the result "1 2 1 2" will be output.
However, if minSupport = 15000, the evidence from Loads 1 and 2 will not be used
to find the total number of occurrences of the pattern. MinSupportTotal acts as
a filter to output only the more significant sequences. Lower values of
minSupport increase both program accuracy and running time. minSupport = 1 is
computationally intractable for large trace files.
Before calculating the combined support, several postprocessing steps occur.
First, the absolute values of the loads is not important for our purposes. The
program will normalize each pattern before combining the results. The sequence
"1 2 1 2" will become "A B A B". The sequence "514 17
514 17" will also become "A B A B", and the two sequences' counts
will be combined. Identical but shifted patterns are combined and filtered from
the output. For example the sequences "A B B" and "B A B"
are redundant because it is likely that "B A B" occurs in a larger
chain of "B A B B A B B A B ...". When combining these shifted
patterns, the new count is taken as the maximum of the counts of the combined
patterns. Output files include all patterns that occur with at least
MinSupportTotal evidence. Sequences of length from 1 to cutoff are output, where
a typical cutoff value is 50. Below is sample output from a small (5-value)
trace file.
Data File Name: data
Minimum support for each load: 1
Minimum support over all loads: 1
Sequences Length 1
Count: 5 Sequence: A
Sequences Length 2
Count: 2 Sequence: A B
Count: 1 Sequence: A A
Sequences Length 3
Count: 1 Sequence: A B C
Algorithm
The mining algorithm works as follows:
- Over the entire trace file, count the number of loads performed for each
load number. Sort the load numbers by count of values loaded, and determine
the number of loads necessary to maintain percentReads (Line 31, see
HTML source below) coverage of the data (Lines 684 - 729). This filters many
irrelevant loads.
- For each load used, create a new file specific to that load number and
write to that file the sequence of numbers specific to that load (738 -
803). This separates each load number into its own sequence of values. To
limit the number of simultaneously opened files, the load files are built numFiles
(33) files at a time.
- For each created load file, find sequences with required support (797):
- First find all sequences of length 1 with minSupport (command
line argument) support. Since the space of integers is large, this is
done with a probabilistic hashing technique. The first pass collects
counts of load values into hash buckets. For each bucket with minimum
support, the second pass collects the counts of those values that hash
into the significant buckets. Values with lower than minimum support are
thrown out (140 - 255).
- Add each sequence of length 1 with minimum support to the global
block. The globalBlock (44) is a global memory space to store program
results. It is organized as a table of sequences, and their support.
Before sequences are added to the global block (464 - 503), sequence
values are normalized into alphanumeric characters (416 - 447).
- Build potential sequences of length 2 by combining sequences of length
1. Then find support for each of those potential sequences by a pass
over the trace data. Save the supported sequences for future use, and
add their support to the global block (264 - 411).
- Repeat building sequences of length n+1 and adding their support to
the block, until sequences of length cutoff (32) are found (637 -
651).
- After all loads have had their sequences added to the global block, it is
time to print the information in the global block (578 - 631). This consists
of first combining all patterns that are the same except shifted, such as
"A B B", "A B A", and "A A B" (536 - 573).
Merging these patterns requires renormalization of the sequences (509 - 531)
Wildcards
The program includes support for wildcard pattern matching. The MAX_ASTERIK (38)
value represents the maximum number of wildcards that will be used in the
search. Valid values are 0, 1, and 2. Each asterisk represents a single position
in the sequence that will be satisfied by any value. For example, the sequence
"15 * 22" will be get support from sequences such as "15 7
22" and "15 82 22". Since use of asterisks exponentially expands
the search space, the running time is significantly higher. In output files, the
wildcard is marked by an '*' rather than by an alphanumeric character.
Command Line Arguments
The program is executed from the command line with five parameters:
mine.exe datafile outfile minSupportIndividual minSupportTotal
mine.exe is the name of the binary file.
datafile is the name of the trace file in <NUM ... value, num, load
loads>
format.
outfile is the destination of output text.
minSupportIndividual is the count a sequence requires to be considered in
the total results.
minSupportTotal is the minimum total support required for a sequence to
be printed to the output file.
Other parameters that might change between executions, such as number of percentReads
and cutoff, are currently compiled into the program but can be easily
added as command line arguments.
Program Execution
Program running time varies depending upon trace file and required support. For
a five gigabyte file with minSupport = 30000, the running time is on the order
of several hours. The output file will be on the order of hundreds of kilobytes.
A small file will output to the console something similar to the following
console snapshot:
D:\mining>mine data outnew.txt 1 1
Execution Beginning
Max Sequence Length: 50
2 Total Loads
2 Loads Used
Building loads 0 to 1
Num potential sequences len1 is 3
Num potential sequences length 2 is 9
Num potential sequences length 3 is 1
Num potential sequences length 4 is 0
Analysis of Load 0 completed
Num potential sequences len1 is 1
Num potential sequences length 2 is 1
Num potential sequences length 3 is 1
Analysis of Load 1 completed
Printing block data to outnew.txt ...
Global block size is 4
Execution Complete
The number of Loads Used multiplied by the time it takes to analyze one
load is a good indication of overall running time. The program will update its
progress on each load by outputting the sequence length it is analyzing and the
number of potential sequences. Time required to format the global block is
currently O(n2), so if the block grows huge it can take some time to
output.
Source Code
If you have questions or problems with this code, feel free to contact the
author Jeff Hoy, jrh26@cornell.edu.
Original source code is available at mine.cpp. The
program was developed on the Windows platform and it has compiled cleanly with
g++ on Unix systems.
HTML-formatted code with line numbers is available at mine.html.
Sample output files are available at sample1.txt and sample2.txt.