We are developing algorithms for finding frequent sequences in transactional databases (SPAM) and for enhancing load value prediction (LVP). SPAM combines efficient pruning and indexing techniques to enable the discovery of frequent sequences even for very long patterns. The LVP technique discovers frequent patterns in traces of a program's memory accesses. Based on the discovered patterns efficient pre-fetching strategies can be developed.

SPAM is a new algorithm for finding all frequent sequences within a transactional database. The algorithm is especially efficient when the sequential patterns in the database are very long. A depth-first search strategy is used to generate candidate sequences, and various pruning mechanisms are implemented to reduce the search space.

The transactional data is stored using a vertical bitmap representation, which allows for efficient support counting as well as significant bitmap compression. A salient feature of our algorithm is that it incrementally outputs new frequent itemsets in an online fashion.

In a thorough experimental evaluation using standard benchmark data, we isolate the effects of the individual components of our algorithm. Our performance numbers show that our algorithm outperforms previous work by a factor of 3 to over an order of magnitude.

The full source code for SPAM is now available for download from Sourceforge.

Jay Ayres

Jason Flannick

Johannes Gehrke

Tomi Yiu

This project is the combined work of Jeff Hoy, a Cornell University Masters of Engineering candidate, Prof. Martin Burtscher from the Cornell Computer Systems Laboratory, and Prof. Johannes Gehrke from the Cornell Database Group.

Let us assume that Load 1 and Load 2 each fetches the pattern "1 2 1 2" 10000 times, and that Load 3 fetches the pattern "1 2 1 2" with frequency 20000. If the program is run with minSupport = 10000 or less and MinSupportTotal = 40000 or less, the result "1 2 1 2" will be output. However, if minSupport = 15000, the evidence from Loads 1 and 2 will not be used to find the total number of occurrences of the pattern. MinSupportTotal acts as a filter to output only the more significant sequences. Lower values of minSupport increase both program accuracy and running time. minSupport = 1 is computationally intractable for large trace files.

Before calculating the combined support, several postprocessing steps occur. First, the absolute values of the loads is not important for our purposes. The program will normalize each pattern before combining the results. The sequence "1 2 1 2" will become "A B A B". The sequence "514 17 514 17" will also become "A B A B", and the two sequences' counts will be combined. Identical but shifted patterns are combined and filtered from the output. For example the sequences "A B B" and "B A B" are redundant because it is likely that "B A B" occurs in a larger chain of "B A B B A B B A B ...". When combining these shifted patterns, the new count is taken as the maximum of the counts of the combined patterns. Output files include all patterns that occur with at least MinSupportTotal evidence. Sequences of length from 1 to cutoff are output, where a typical cutoff value is 50. Below is sample output from a small (5-value) trace file.

Data File Name: data Minimum support for each load: 1 Minimum support over all loads: 1 Sequences Length 1 Count: 5 Sequence: A Sequences Length 2 Count: 2 Sequence: A B Count: 1 Sequence: A A Sequences Length 3 Count: 1 Sequence: A B C

- Over the entire trace file, count the number of loads performed for each
load number. Sort the load numbers by count of values loaded, and determine
the number of loads necessary to maintain
*percentReads*(Line 31, see HTML source below) coverage of the data (Lines 684 - 729). This filters many irrelevant loads. - For each load used, create a new file specific to that load number and
write to that file the sequence of numbers specific to that load (738 -
803). This separates each load number into its own sequence of values. To
limit the number of simultaneously opened files, the load files are built
*numFiles*(33) files at a time. - For each created load file, find sequences with required support (797):
- First find all sequences of length 1 with
*minSupport*(command line argument) support. Since the space of integers is large, this is done with a probabilistic hashing technique. The first pass collects counts of load values into hash buckets. For each bucket with minimum support, the second pass collects the counts of those values that hash into the significant buckets. Values with lower than minimum support are thrown out (140 - 255). - Add each sequence of length 1 with minimum support to the global block. The globalBlock (44) is a global memory space to store program results. It is organized as a table of sequences, and their support. Before sequences are added to the global block (464 - 503), sequence values are normalized into alphanumeric characters (416 - 447).
- Build potential sequences of length 2 by combining sequences of length 1. Then find support for each of those potential sequences by a pass over the trace data. Save the supported sequences for future use, and add their support to the global block (264 - 411).
- Repeat building sequences of length n+1 and adding their support to
the block, until sequences of length
*cutoff*(32) are found (637 - 651).

- First find all sequences of length 1 with
- After all loads have had their sequences added to the global block, it is time to print the information in the global block (578 - 631). This consists of first combining all patterns that are the same except shifted, such as "A B B", "A B A", and "A A B" (536 - 573). Merging these patterns requires renormalization of the sequences (509 - 531)

mine.exe datafile outfile minSupportIndividual minSupportTotal

Other parameters that might change between executions, such as number of

D:\mining>mine data outnew.txt 1 1 Execution Beginning Max Sequence Length: 50 2 Total Loads 2 Loads Used Building loads 0 to 1 Num potential sequences len1 is 3 Num potential sequences length 2 is 9 Num potential sequences length 3 is 1 Num potential sequences length 4 is 0 Analysis of Load 0 completed Num potential sequences len1 is 1 Num potential sequences length 2 is 1 Num potential sequences length 3 is 1 Analysis of Load 1 completed Printing block data to outnew.txt ... Global block size is 4 Execution CompleteThe number of

Original source code is available at mine.cpp. The program was developed on the Windows platform and it has compiled cleanly with g++ on Unix systems.

HTML-formatted code with line numbers is available at mine.html.

Sample output files are available at sample1.txt and sample2.txt.