CS312 Lecture 26: String search

Administrivia

String Matching Algorithms

There are two famous ones, both from 1977. Boyer-Moore and Knuth-Morris-Pratt. We'll also discuss the difference between an algorithm and a heuristic.

A common problem in text editing and DNA sequence analysis: finding strings inside other strings.

Suppose we have a text T consisting of an array of characters from some alphabet S. For example, S might be just the set {0,1}, in which case the possible strings are strings of binary digits (e.g., "1001010011"); it might be {A,C,T,G}, in which case the strings are DNA sequences (e.g., "GATTACA"). The string matching problem is this: given a smaller string P (the pattern) that we want to find occurrences of in T. If P occurs in T at shift i then P[j] = T[i+j] for all valid indices i of P.

For example, in the string "ABABABAC", the pattern string "BAB" occurs at shifts 1 and 3.

A slightly simpler version of this problem, which we will consider here, is just to find the first occurrence of P in T. Algorithms for solving this problem are adapted fairly easily to solve the more general problem.

(* Return the smallest shift at which P occurs within T, or
 * length(T) if there is no such shift. *)
stringMatch(T: string, P: string): int

Note: string to search has size n, pattern size m.

Of course we can phrase this problem more generally; the strings we match on can be arrays of any sort. We assume that all we can do with array elements is compare them for equality. Here is pseudo-code for a naive string matching algorithm, which steps the shift along by one and tries to compare corresponding characters.

for i := 0 to n-1 {
  for j := 0 to m-1 {
    if P[j] <> T[i+j] then break
  }
  if j = m then return i
}

This is clearly O(nm) in the worst case (think of a pattern and a text that are both mostly a single character). If the pattern string is short, this isn't a problem, but for longer patterns it can be slow.

A little notation: we write P[i..j] to mean the substring of P containing all the characters between indices i and j inclusive. A prefix of P is a substring P[0..k], k<m; a proper prefix is P[0..k], k < m-1. A suffix of P is a substring P[k..m-1], k>=0, and a proper suffix similarly requires k > 0.

Examples: string "baaad", proper prefixes are "b", "ba", "baa", "baaa". Prefixes also include "baaad". Proper suffixes are "d", "ad", "aad", "aaad". Suffixes also include "baaad".

Boyer-Moore algorithm

The insight of the Boyer-Moore algorithm is to start matching at the end of the pattern string P rather than the beginning. When a mismatch is found, this allows the shift to be increased by more than one. Consider these two strings:

T = ...IN THE UNTITLED STATES
P =             EDITED
Matching from the back, we first encounter a mismatch at "L". Because "L" does not occur in P, we can increase the shift by 4 without being worried about missing a potential match. In general, the mismatched character could be in P. In this case the shift can only be increased to the point where that character occurs in P. A simplified version of Boyer-Moore can use this bad-character heuristic.

Note the difference between a heuristic and an algorithm. A heuristic defines rules or guidelines that reduces/limits the search for solutions in domains that are difficult to manipulate. Unlike algorithms, heuristics do not guarantee optimal, or even feasible, solutions. A real life heuristic example could be when a students completes a True/False test: he will expect to have a balanced number of True and False answers, althought it's not neccessary true.

The following is a possible implementation of the bad-character heuristic.

s := 0 (* the current shift *)
while s <= n - m do {
  j := m - 1
  while P[j] = T[s+j] and j >= 0 do j := j - 1
  if j < 0 return s
  s := s + max(1, j - last[T[s+j]] + 1)
}

The array last must be precomputed; last[c] is the position of the last occurrence of the character c in P, or 0 if there is no such occurrence. It is computed in O(m+|S|) time where |S| is the size of the alphabet S that strings are made up of:

for all c, last[c] := 0
for j := 0 to m-1 {
  last[P[j]] := j
}  

This algorithm works well if the alphabet is reasonably big, but not too big. If the last character usually fails to match, the current shift s is increased by m each time around the loop. The total number of character comparisons is typically about n/m, which compares well with the roughly n comparisons that would be performed in the naive algorithm for similar problems. In fact, the longer the pattern string, the faster the search! However, the worst-case run time is still O(nm). The algorithm as presented doesn't work very well if the alphabet is small, because even if the strings are randomly generated, the last occurrence of any given character is near the end of the string. All the improvements, both Boyer-Moore and Knuth-Morris-Pratt, use the same intuition, which is to take advantage of a partial match to figure out how far you can skip forward. Consider example:

T = ...CAAAD...
P =    BAAAD

The last occurrence doesn't help. Clearly we can skip forward by more than 1, though, essentially because the string has no "stutters" in it (repeated patterns). In fact, we can skip ahead in this example by the length of the pattern, which is m. Essentially, having matched AAAD in string and pattern, we know that various right shift of BAAAD won't overlap with the string. Moreover, we can compute this by analyzing just the pattern!

Boyer-Moore can be improved by using a heuristic based on the partial match for increasing the shift. Consider this example:

T = ...LIVID_MEMOIRS...
P =   EDITED_MEMOIRS
When the I/E mismatch is observed, the bad-character heuristic doesn't help at all because the last "I" in P is already beyond the point of comparison. The algorithm above must just increment by one. However, knowing that the suffix ("D_MEMOIRS") has already been matched tells us that P can be shifted forward by a full m characters, because no suffix of "D_MEMOIRS" appears earlier in P than the point found. Compare
P =   IRS_EDITED_MEMOIRS
Where we can't skip forward as much.

Given that a suffix S=P[j+1..m-1] has already been matched when a mismatch is encountered, we can compute a shift forward based on where suffixes of that suffix appear in P. This is called the good-suffix heuristic: we can conservatively shift forward by the smallest shift that causes S to suffix-match with a prefix of P. In fact, this smallest shift will correspond to the longest proper prefix of P that suffix-matches. Call this prefix S'. We say that two strings "suffix-match" when the shorter of the two strings is a suffix of the other string. If S' is the longer of the two strings, it means that there might be a copy of P that starts between the current shifted start of P and the mismatch point. If the string S is the longer of the two strings, there might be a copy of P that starts after the mismatch point (and before or at the current shifted end P). Here is an illustration of these two cases:

(case 1)  T =    ....CABABDABAB...       S = ABAB, S' = BAB, largest safe shift = 8-3 = 5
          P =     BABDABAB

(case 2)  T =    ...CCABABAB...          S = ABAB, S' = CCABAB, largest safe shift = 8-6 = 2
          P =     CCABABAB    

For each possible j, let kj be the length of the longest proper prefix that suffix-matches with P[j+1..m-1]. For each j, we then define good-suffix[j] as m-kj; the algorithm can shift forward safely by m-kj characters. Here then is the full Boyer-Moore algorithm:

s := 0
while s <= n - m do {
  j := m - 1
  while P[j] = T[s+j] and j >= 0 do j := j - 1
  if j < 0 return s
  s := s + max(good_suffix[j], j - last[T[s+j]] + 1)
}

In the "MEMOIRS" example above, kj=0, so the full m-character shift is safe. Consider also a less trivial case for the good-suffix heuristic:

T = ...BABABABA...
P =    BABACABA

The string S="ABA", and the longest suffix-matching prefix is "BABA". Hence the possible shift forward is 8-4 = 4 characters. Notice that here the bad-character heuristic generates a useless shift of -2. The good-suffix refinement is particularly helpful if the alphabet is small. Note that good_suffix[j] is always positive, because necessarily k < m (if they were equal, the pattern would have been found!). This is why we no longer need to find the maximum with 1 when computing the shift to s.

We can compute the array good_suffix in time O(m), as discussed below. Therefore the total time to set up a Boyer-Moore search is still O(m+|S|) even with this refinement. The total run time is still not better in the worst case.

Knuth-Morris-Pratt Algorithm

The Boyer-Moore algorithm is a good choice for many string-matching problems, but it does not offer asymptotic guarantees that are any stronger than those of the naive algorithm. If asymptotic guarantees are needed, the Knuth-Morris-Pratt algorithm (KMP) is a good alternative. The algorithm takes the following form:
j := 0
for i := 0 to n-1 {
  while (j > 0 and P[j] <> T[i]) {
    j := prefix[j]
  }
  if P[j] = T[i] then j := j + 1
  if j = m then return i - m + 1
}

The idea is to scan the string T from left to right, looking at every character T[i] once. The variable j keeps track of what character of P is being compared against T[i], so the currently considered shift is i-j. When a mismatch is encountered, the characters matched so far, S=T[i-j..i-1]=P[0..j-1], may be part of a match. If so, some proper suffix of this j-character string must itself be a prefix of P. Suppose that prefix[j] contains the length of the largest prefix of P that is a suffix of P[0..j-1].In this case we can increase the shift by the smallest amount that might allow a suffix of S to be a prefix of the new positioning of P, by setting j to prefix[j]. We can then check whether P[j]=T[i], to see whether a partial match has been obtained including the mismatched character T[i]. If this does not match, we try the next smaller prefix that is a proper suffix of s, and so on until a match is found or it is determined that no proper suffix of s can be part of a match, at which point j=0. Note that each outer loop iterator always increases the shift i-j or leaves it the same, because j is increased at most once in the loop whereas i is always increased, and because prefix[j] < j.

To see how this works, consider the string P="ababaca". The prefix array for this string is:

prefix[1] = 0
prefix[2] = 0    (The only proper suffix of "ab" is "b", which is not a prefix)
prefix[3] = 1    ("a" is a proper suffix of "aba" and is a prefix)
prefix[4] = 2
prefix[5] = 3
prefix[6] = 0    (No proper suffix of "ababac" is a prefix)
prefix[7] = 1

Now, consider searching for this string, where the scan of T has proceeded up to the indicated point:

T = ...ababababa...
P =    ababaca
            ^
            i

Here the algorithm finds a mismatch at j=5, so it tries shifting the string forward two spaces to j=prefix[5]=3 to see whether the mismatched character "b" matches P[3] -- which it does. Therefore the algorithm is ready to step forward to the next character of T. On the other hand, if T looked a little different, we'd skip forward farther:

T = ...ababaxaba...
P =    ababaca
            ^
            i

In this case, at the mismatch, j is still set to prefix[5]=3, but P[3]<>"x", so we try again with the shorter matched string "aba". j is set to prefix[3]=1, and again P[1]<>"x", so j is set to prefix[1]=0 and the inner loop terminates. Thus, all the possible matching prefixes of P are tried out in order from longest to shortest, and because none of them has "x" as a following character in P, the inner loop effectively shifts P forward by a full 6 characters.

The code above is O(n), which is not obvious. Clearly the outer loop and all the code except the inner while loop takes O(n) time. The question is how many times the while loop can be executed. The key observation is that every time the while loop executes, j decreases. But j is never negative, so it can be decreased at most as many times as it is increased by the statement j:=j+1, which is executed at most n times total. Therefore the inner loop also can be executed at most n times total, and the whole algorithm is O(n).

The one remaining part of the algorithm is the computation of prefix[j]. Recall that prefix[j] is the length of the largest prefix of P that is a suffix of P[0..j-1]. It turns out that this array can be computed in O(m) time by an algorithm very similar to KMP itself, which makes sense because we're searching for prefixes of P within P. The key is that knowing prefix[0], ..., prefix[j], we can efficiently compute prefix[j+1]:

prefix[1] := 0
j := 0
for i := 1 to m-1 {
  while j>0 and P[j] <> P[i] {
    j := prefix[j]
  }
  if P[j] = P[i] then j := j + 1
  prefix[i+1] := j
}
return prefix

This is clearly O(m) by the same reasoning that we used for the code above, but why does it work? This looks exactly like running KMP to search for P within P itself, except that prefix isn't initialized before running this code, and also we start things off at i=1 to avoid the trivial match at shift 0. The uninitialized state of prefix doesn't affect the execution because clearly, in computing prefix[i+1] this code doesn't use prefix[k] for any k>i. Now, for each index i into the string being searched, the KMP algorithm finds the longest prefix P[0..j-1] of the pattern string that matches a suffix of P[i-j..i]. But this is exactly prefix[i+1] as defined.

A note about programming languages

Strings are a good example of something that is more naturally done in a low-level language like C.

CS312  © 2002 Cornell University Computer Science