Mining Sequential Patterns

We are developing algorithms for finding frequent sequences in transactional databases (SPAM) and for enhancing load value prediction (LVP). SPAM combines efficient pruning and indexing techniques to enable the discovery of frequent sequences even for very long patterns. The LVP technique discovers frequent patterns in traces of a program's memory accesses. Based on the discovered patterns efficient pre-fetching strategies can be developed.

SPAM: Sequential PAttern Mining

SPAM is a new algorithm for finding all frequent sequences within a transactional database. The algorithm is especially efficient when the sequential patterns in the database are very long. A depth-first search strategy is used to generate candidate sequences, and various pruning mechanisms are implemented to reduce the search space.

The transactional data is stored using a vertical bitmap representation, which allows for efficient support counting as well as significant bitmap compression. A salient feature of our algorithm is that it incrementally outputs new frequent itemsets in an online fashion.

In a thorough experimental evaluation using standard benchmark data, we isolate the effects of the individual components of our algorithm. Our performance numbers show that our algorithm outperforms previous work by a factor of 3 to over an order of magnitude.

Walkthrough of the algorithm

Obtaining SPAM

The full source code for SPAM is now available for download from Sourceforge.


Jay Ayres
Jason Flannick
Johannes Gehrke
Tomi Yiu


Jay Ayres, J. E. Gehrke, Tomi Yiu, and Jason Flannick. Sequential PAttern Mining Using Bitmaps. In Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Edmonton, Alberta, Canada, July 2002.


Mining Sequential Patterns for Load Value Prediction


The goal of this project is to develop enhanced load value prediction hardware. To do this, we have developed a program that finds frequent patterns in traces of memory load values. The trace files are a series of load number, load value pairs, where load number is a value that represents a load instruction, and load value is the value that was retrieved from memory. Patterns with specified support are mined from the trace file. The algorithm we use to mine the patterns is similar to the Apriori Algorithm as described in Fast Algorithms for Mining Association Rules.

This project is the combined work of Jeff Hoy, a Cornell University Masters of Engineering candidate, Prof. Martin Burtscher from the Cornell Computer Systems Laboratory, and Prof. Johannes Gehrke from the Cornell Database Group.


The mining program has been made available to the research community under the terms of the GNU General Public License. Source code, details of the program implementation, documentation, and examples follow.

Input Specification

Trace files consist of 32-bit integers. The first number in the file indicates the range of the load numbers; 0 <= load number < first number. The remainder of the file is a series of load number, load value pairs, where load number is a value that represents a load instruction, and load value is the value that was retrieved from memory. Load numbers must be in the range from 0 to the total number of loads. Load values can be anything in the range of 32-bit numbers. It is relatively straightforward to adapt the code to different input formats, such as 64-bit integers.

Output Specification

The program will find frequent patterns for each load value and combine the counts of each pattern. The program requires two parameters, minSupport, and minSupportTotal. Only patterns that occur at least minSupportTotal times will be output by the program. minSupportTotal is a combination of patterns from each load number, where each load finds patterns with at least minSupport frequency.

Let us assume that Load 1 and Load 2 each fetches the pattern "1 2 1 2" 10000 times, and that Load 3 fetches the pattern "1 2 1 2" with frequency 20000. If the program is run with minSupport = 10000 or less and MinSupportTotal = 40000 or less, the result "1 2 1 2" will be output. However, if minSupport = 15000, the evidence from Loads 1 and 2 will not be used to find the total number of occurrences of the pattern. MinSupportTotal acts as a filter to output only the more significant sequences. Lower values of minSupport increase both program accuracy and running time. minSupport = 1 is computationally intractable for large trace files.

Before calculating the combined support, several postprocessing steps occur. First, the absolute values of the loads is not important for our purposes. The program will normalize each pattern before combining the results. The sequence "1 2 1 2" will become "A B A B". The sequence "514 17 514 17" will also become "A B A B", and the two sequences' counts will be combined. Identical but shifted patterns are combined and filtered from the output. For example the sequences "A B B" and "B A B" are redundant because it is likely that "B A B" occurs in a larger chain of "B A B B A B B A B ...". When combining these shifted patterns, the new count is taken as the maximum of the counts of the combined patterns. Output files include all patterns that occur with at least MinSupportTotal evidence. Sequences of length from 1 to cutoff are output, where a typical cutoff value is 50. Below is sample output from a small (5-value) trace file.

          Data File Name: data
          Minimum support for each load: 1
          Minimum support over all loads: 1
          Sequences Length 1
            Count: 5 Sequence: A
          Sequences Length 2
            Count: 2 Sequence: A B
            Count: 1 Sequence: A A
          Sequences Length 3
            Count: 1 Sequence: A B C


The mining algorithm works as follows:
  1. Over the entire trace file, count the number of loads performed for each load number. Sort the load numbers by count of values loaded, and determine the number of loads necessary to maintain percentReads (Line 31, see HTML source below) coverage of the data (Lines 684 - 729). This filters many irrelevant loads.
  2. For each load used, create a new file specific to that load number and write to that file the sequence of numbers specific to that load (738 - 803). This separates each load number into its own sequence of values. To limit the number of simultaneously opened files, the load files are built numFiles (33) files at a time.
  3. For each created load file, find sequences with required support (797):
    1. First find all sequences of length 1 with minSupport (command line argument) support. Since the space of integers is large, this is done with a probabilistic hashing technique. The first pass collects counts of load values into hash buckets. For each bucket with minimum support, the second pass collects the counts of those values that hash into the significant buckets. Values with lower than minimum support are thrown out (140 - 255).
    2. Add each sequence of length 1 with minimum support to the global block. The globalBlock (44) is a global memory space to store program results. It is organized as a table of sequences, and their support. Before sequences are added to the global block (464 - 503), sequence values are normalized into alphanumeric characters (416 - 447).
    3. Build potential sequences of length 2 by combining sequences of length 1. Then find support for each of those potential sequences by a pass over the trace data. Save the supported sequences for future use, and add their support to the global block (264 - 411).
    4. Repeat building sequences of length n+1 and adding their support to the block, until sequences of length cutoff (32) are found (637 - 651).
  4. After all loads have had their sequences added to the global block, it is time to print the information in the global block (578 - 631). This consists of first combining all patterns that are the same except shifted, such as "A B B", "A B A", and "A A B" (536 - 573). Merging these patterns requires renormalization of the sequences (509 - 531)


The program includes support for wildcard pattern matching. The MAX_ASTERIK (38) value represents the maximum number of wildcards that will be used in the search. Valid values are 0, 1, and 2. Each asterisk represents a single position in the sequence that will be satisfied by any value. For example, the sequence "15 * 22" will be get support from sequences such as "15 7 22" and "15 82 22". Since use of asterisks exponentially expands the search space, the running time is significantly higher. In output files, the wildcard is marked by an '*' rather than by an alphanumeric character.

Command Line Arguments

The program is executed from the command line with five parameters:
          mine.exe datafile outfile minSupportIndividual minSupportTotal
mine.exe is the name of the binary file.
datafile is the name of the trace file in <NUM ... value, num, load loads> format.
outfile is the destination of output text.
minSupportIndividual is the count a sequence requires to be considered in the total results.
minSupportTotal is the minimum total support required for a sequence to be printed to the output file.

Other parameters that might change between executions, such as number of percentReads and cutoff, are currently compiled into the program but can be easily added as command line arguments.

Program Execution

Program running time varies depending upon trace file and required support. For a five gigabyte file with minSupport = 30000, the running time is on the order of several hours. The output file will be on the order of hundreds of kilobytes. A small file will output to the console something similar to the following console snapshot:
          D:\mining>mine data outnew.txt 1 1
          Execution Beginning
          Max Sequence Length: 50
          2 Total Loads
          2 Loads Used
          Building loads 0 to 1
          Num potential sequences len1 is 3
          Num potential sequences length 2 is 9
          Num potential sequences length 3 is 1
          Num potential sequences length 4 is 0
          Analysis of Load 0 completed
          Num potential sequences len1 is 1
          Num potential sequences length 2 is 1
          Num potential sequences length 3 is 1
          Analysis of Load 1 completed
          Printing block data to outnew.txt ...
          Global block size is 4
          Execution Complete

The number of Loads Used multiplied by the time it takes to analyze one load is a good indication of overall running time. The program will update its progress on each load by outputting the sequence length it is analyzing and the number of potential sequences. Time required to format the global block is currently O(n2), so if the block grows huge it can take some time to output.

Source Code

If you have questions or problems with this code, feel free to contact the author Jeff Hoy,
Original source code is available at mine.cpp. The program was developed on the Windows platform and it has compiled cleanly with g++ on Unix systems.

HTML-formatted code with line numbers is available at mine.html.
Sample output files are available at sample1.txt and sample2.txt.