Homework 11

CS409 - Spring 2000

Now Due: Tuesday, May 2 (original due date: Thursday, Apr 27)

For this assignment, you are to implement a suffix tree. This is a surprisingly powerful data structure that is extremely useful for various applications involving strings. Recently, suffix trees have become important for working with strings associated with molecular biology (e.g., DNA strings and protein strings). The end of this document contains a brief introduction to suffix trees.

You are given an abstract class called SuffixTree. Your goal is to build a new class called MySuffixTree that extends
this class. You shouldn't change the class SuffixTree at all.

You'll be able to use all the internal classes and fields that appear in SuffixTree. You can also use the methods --- the ones that aren't abstract --- that appear there. In particular, you'll need to run the main() method to get the output that we want to see for grading.

The SuffixTree abstract class contains:

Internal classes --- including Node, LeafNode, InternalNode, and Edge --- that define the structures needed to build a suffix tree.
Fields root and string to hold the suffix tree data.
A nonfunctional constructor. You'll have to make a functional constructor for MySuffixTree. It's possible to construct a suffix tree in linear time, but for this assignment you can use the quadratic method of inserting each suffix, one at a time.
Two abstract methods, findAll() and longestPrefix(). What these methods are supposed to do is described in the comments. You'll have to create non-abstract working versions of these for MySuffixTree.
Several useful methods, including toString() and main(). If you implement MySuffixTree correctly then toString() should be able to print it.

The Terminator/Separator ($)

The input to your suffix tree will be a string made up of letters and the terminator/separator character ($). The $ character can appear within the string as a separator as well as appearing at the end of the string as a terminator. We want the $ to mark the end of a suffix. In other words, whenever you reach $, don't store the rest of the suffix past the $.

This allows us to have a separator character, but it leads to a new problem: two different suffixes can end up at the same leaf. For example, this would happen for the input string "abc$bac$" because the suffix "c$" appears twice. To accommodate this situation, we have to allow multiple suffix IDs to be stored in a single leaf. (A suffix ID is just the location in the original string where the suffix starts.) The LeafNode class includes a field called more which can hold another LeafNode. Using this field, you can form a linked list of LeafNodes allowing multiple suffix IDs to be held in a single leaf.

What to Turn In

A listing of MySuffixTree.
The output of SuffixTree.main().

Notes/Hints

Here's some sample output. The dots represent nodes in the suffix tree; the letters represent edges. There is one line of output for each leaf in the tree showing all the nodes and edges passed on the way from the root to the leaf. The number at the end of the line is a suffix ID (i.e., the location in the original string where the suffix starts). If your suffix tree is built correctly then the toString() method supplied in SuffixTree should print this for you.
```
SuffixTree for mississippi$:
 .$ 11
 .i.$ 10
 .i.ppi$ 7
 .i.ssi.ppi$ 4
 .i.ssi.ssippi$ 1
 .mississippi$ 0
 .p.i$ 9
 .p.pi$ 8
 .s.i.ppi$ 6
 .s.i.ssippi$ 3
 .s.si.ppi$ 5
 .s.si.ssippi$ 2
```
You may want to add some additional fields. For instance, I found it useful to have some fields to represent my current location within the SuffixTree. You may also want to consider including some fields that can be used to accumulate suffix IDs; both findAll() and longestPrefix() require you to return an array of suffix IDs and you need a place to store this information while you're accumulating it.
Java allows you to do do arithmetic on characters. For instance, (ch-'a'+1) will map any of the characters 'a' through 'z' to the numbers 1 through 26.
Take a look at the methods in java.lang.String. Several of these methods are useful for this assignment. Try not to use methods that are likely to be inefficient. Note though that the substring() method is very efficient --- it doesn't actually copy characters, it just keeps track of indices.
Consider building some private helper methods to do some of the tasks common to findAll(), longestPrefix(), and your constructor.
You can determine if a node is leaf node by using instanceof. Once you know a node is a leaf node you can use type conversion to use it as leaf node.
```
    if (node instanceof LeafNode) {
        LeafNode leaf = (LeafNode) node;
    }
  
```

Background on Suffix Trees

A suffix tree is actually a compressed trie holding all the suffixes of a given string. A trie is yet another Dictionary data structure. The name came from some of the characters in retrieval. Trie is often pronounced like tree, but then it's hard to tell whether trie or tree is meant; for this reason, I usually pronounce it as try. For a trie, Dictionary operations are fast (proportional to the length of the key), but storage costs are high. Each node in a trie has a number of children equal to the number of letters in your alphabet. Here's a trie for the words she, sells, sea, and shells. (The root is on the left. A node is represented by either a period or $; $ indicates that a word terminates at this node. An edge is represented by - or \ and the following letter indicates the label on that edge.)

.-s.-h.-e$-l.-l.-s$
    \
     e.-l.-l.-s$
       \
        a$

It's easy to see that (1) all the original words are all represented in this trie and (2) we can determine whether a query word appears in the trie in time proportional to the length of the query word.

For an uncompressed suffix tree, we simply build a trie of all the suffixes in the input string, starting with the input string itself. If you look at an uncompressed suffix tree it's easy to see that most of the nodes aren't really doing anything. The important nodes are the ones at which branching occurs. So we get rid of all the other nodes, allowing edges to be labeled with strings. This saves some space, but we've still got strings saved all over the data structure.

The real trick is to realize that all the strings that appear as edge labels are substrings of the original input string. So we don't have to label each edge with a string. Instead we can label each edge with two indices into the original string, one to show where the substring starts and one to show where the substring ends. Now each label has just two numbers associated with it instead of an arbitrary-length string.

To simplify storage, we assume that the input string ends with a special terminator character ($) that occurs nowhere else in the string. This forces each suffix to end at a leaf. The leaf is usually labeled with a suffix ID indicating which suffix ends at this leaf. The simplest suffix ID is to use is the starting position of the suffix within the original string; it's easy to see that this uniquely identifies each suffix..

Space

If the original string has length n then the final suffix tree has n leaves, one for each suffix. Each internal node (i.e., nonleaf node) in the tree has a fixed, maximum number of children based on the size of our alphabet (for this assignment, our alphabet is of size 27). We want to get an upper bound on the number of nodes in a suffix tree based on the fact that there are n leaves. The number of nodes is maximized if there is not much branching, so the worst-case would occur if our suffix tree were a binary tree. In this case, there would be 2n-1 nodes since a binary tree with n leaves has 2n-1 nodes. This was a worst-case bound, so the number of nodes is O(n). Since each edge has to go to a node, the number of edges must also be O(n). Each edge and each node stores a fixed amount of data, so the total space for the whole suffix tree is O(n).

Time

Remarkably, a suffix tree can be built in time linear in the length of the input string. We'll discuss this in class. For the assignment, don't worry about building a suffix tree quickly, just find a way to build it.

Using a Suffix Tree

To find all occurrences of string q in string S:

Build a suffix tree for S. [O(n) time; n = length of S]
Follow q into the suffix tree and find the node N at or below q. [O(q) time; q = length of q]
Walk the tree below N, reporting the suffix IDs found in any leaves we pass. [O(k) time; k = number of leaves below N]

Thus total time is O(n+q+k). Note that k is both the number of leaves examined and the number of answers that must be reported.

The real power of a suffix tree can be seen if we consider the problem of multiple queries over a fixed string S. Once the suffix tree for S is built, many queries can be done. Each query takes time proportional to the size of the query string and the number of answers given; the time doesn't depend on the size of S at all!

Suffix trees have many other uses. We'll talk about some more ways to use them in class.