CS 312 Lecture 27
String matching

A common problem in text editing, DNA sequence analysis, and web crawling: finding strings inside other strings. Suppose we have a text T consisting of an array of characters from some alphabet S. For example, S might be just the set {0,1}, in which case the possible strings are strings of binary digits (e.g., "1001010011"); it might be {A,C,T,G}, in which case the strings are DNA sequences (e.g., "GATTACA"). The string matching problem is this: given a smaller string P (the pattern) that we want to find occurrences of in T. If P occurs in T at shift i then P[j] = T[i+j] for all valid indices i of P.

For example, in the string "ABABABAC", the pattern string "BAB" occurs at shifts 1 and 3.

A slightly simpler version of this problem, which we will consider here, is just to find the first occurrence of P in T. Algorithms for solving this problem are adapted fairly easily to solve the more general problem.

signature STRING_MATCH = sig
  (* match(T,P) is the smallest shift at which
   * P occurs within T. Raises NoMatch if P doesn't
   * occur within T.
   *)
  exception NoMatch
  val match: string * string -> int
end

Of course we can phrase this problem more generally; the strings we match on can be arrays of any sort. We assume that all we can do with array elements is compare them for equality. Here is pseudo-code for a naive string matching algorithm, which steps the shift along by one and tries to compare corresponding characters.

for i := 0 to n-1 {
  for j := 0 to m-1 {
    if P[j] <> T[i+j] then break
  }
  if j = m then return i
}

which can of course be written in SML too.

This is clearly O(nm) in the worst case (think of a pattern and a text that are both mostly a single character). If the pattern string is short, this isn't a problem, but for longer patterns it can be slow. It turns out that there are faster string matching algorithms.

A little notation: we write P[i..j] to mean the substring of P containing all the characters between indices i and j inclusive. A prefix of P is a substring P[0..k], k<m; a proper prefix is P[0..k], k < m-1. A suffix of P is a substring P[k..m-1], k>=0, and a proper suffix similarly requires k>0.

Boyer-Moore algorithm

The insight of the Boyer-Moore algorithm is to start matching at the end of the pattern string P rather than the beginning. When a mismatch is found, this allows the shift to be increased by more than one. Consider this two strings:

T = ...IN THE UNTITLED WORKS BY
P =             EDITED

Matching from the back, we first encounter a mismatch at "L". Because "L" does not occur in P, the shift can be increased by 4 without missing a potential match. In general, the mismatched character could be in P. In this case the shift can only be increased to the point where that character occurs in P. Here is a simplified version of Boyer-Moore using this bad-character heuristic:

s := 0 (* the current shift *)
while s <= n - m do {
  j := m - 1
  while P[j] = T[s+j] and j >= 0 do j := j - 1
  if j < 0 return s
  s := s + max(1, j - last[T[s+j]] + 1)
}

The array last must be precomputed; last[c] is the position of the last occurrence of the character c in P, or -1 if there is no such occurrence. It is computed in O(m+|Σ|) time where |Σ| is the size of the alphabet σ that strings are made up of:

for all c, last[c] := -1
for j := 0 to m-1 {
  last[P[j]] := j
}

This algorithm works well if the alphabet is reasonably big, but not too big. If the last character usually fails to match, the current shift s is increased by m each time around the loop. The total number of character comparisons is typically about n/m, which compares well with the roughly n comparisons that would be performed in the naive algorithm for similar problems. In fact, the longer the pattern string, the faster the search! However, the worst-case run time is still O(nm).

The algorithm as presented doesn't work very well if the alphabet is small, because the last occurrence of any given character tends to be near the end of the string. This can be improved by using another heuristic for increasing the shift. Consider this example:

T = ...LIVID_MEMOIRS...
P =   EDITED_MEMOIRS

When the I/E mismatch is observed, the bad-character heuristic doesn't help at all because the last "I" in P is already beyond the point of comparison. The algorithm above must just increment by one. However, knowing that the suffix ("D_MEMOIRS") has already been matched tells us that P can be shifted forward by a full m characters, because no suffix of "D_MEMOIRS" appears earlier in P than the point found.

Given that a suffix S=P[j+1..m-1] has already been matched when a mismatch is encountered, we can compute a shift forward based on where suffixes of that suffix appear in P. This is called the good-suffix heuristic: we can conservatively shift forward by the smallest shift that causes a suffix of S to match a prefix of P. We say that two strings "suffix-match" when the shorter of the two strings is a suffix of the other string.This smallest shift will correspond to the longest proper prefix of P that suffix-matches. Call this prefix S'. If S' is the longer of the two strings, it means that there might be a copy of P that starts between the current shifted start of P and the mismatch point. If the string S is the longer of the two strings, there might be a copy of P that starts after the mismatch point (and before or at the current shifted end P). Here is an illustration of these two cases:

(case 1)  T =    ....CABABDABAB...       S = ABAB, S' = BAB, largest safe shift = 8-3 = 5
          P =     BABDABAB

(case 2)  T =    ...CCABABAB...          S = ABAB, S' = CCABAB, largest safe shift = 8-6 = 2
          P =     CCABABAB

For each possible j, let k_j be the length of the longest proper prefix that suffix-matches with P[j+1..m-1]. For each j, we then define good-suffix[j] as m-k_j; the algorithm can shift forward safely by m-k_j characters. Here then is the full Boyer-Moore algorithm (also available written in SML):

s := 0
while s <= n - m do {
  j := m - 1
  while P[j] = T[s+j] and j >= 0 do j := j - 1
  if j < 0 return s
  s := s + max(good_suffix[j], j - last[T[s+j]] + 1)
}

In the "MEMOIRS" example above, k_j=0, so the full m-character shift is safe. Consider also a less trivial case for the good-suffix heuristic:

T = ...BABABABA...
P =    BABACABA

The string S="ABA", and the longest suffix-matching prefix is "BABA". Hence the possible shift forward is 8−4 = 4 characters. Notice that here the bad-character heuristic generates a useless shift of −2. The good-suffix refinement is particularly helpful if the alphabet is small. Note that good_suffix[j] is always positive, because necessarily k < m (if they were equal, the pattern would have been found!). This is why we no longer need to find the maximum with 1 when computing the shift to s.

We can compute the array good_suffix in time O(m), as discussed below. Therefore the total time to set up a Boyer-Moore search is still O(m+|S|) even with this refinement. The total run time is still not better in the worst case.

Knuth-Morris-Pratt Algorithm

The Boyer-Moore algorithm is a good choice for many string-matching problems, but it does not offer asymptotic guarantees that are any stronger than those of the naive algorithm. If asymptotic guarantees are needed, the Knuth-Morris-Pratt algorithm (KMP) is a good alternative. The algorithm takes the following form:

j := 0
for i := 0 to n-1 {
  while (j > 0 and P[j] <> T[i]) {
    j := prefix[j]
  }
  if P[j] = T[i] then j := j + 1
  if j = m then return i - m + 1
}

The idea is to scan the string T from left to right, looking at every character T[i] once. The variable j keeps track of what character of P is being compared against T[i], so the currently considered shift is i-j. When a mismatch is encountered, the characters matched so far, S=T[i-j..i-1]=P[0..j-1], may be part of a match. If so, some proper suffix of the j-character string S must itself be a prefix of P.

Suppose that prefix[j] is the length of the largest prefix of P that is a proper suffix of P[0..j-1]. In this case we can increase the shift by the smallest amount that might allow a suffix of S to be a prefix of the new positioning of P, by setting j to prefix[j]. We then check whether P[j]=T[i], to see whether a partial match has been obtained including the mismatched character T[i]. If this does not match, we try the next smaller prefix that is a proper suffix of s, and so on until a match is found or it is determined that no proper suffix of s can be part of a match, at which point j=0. Note that each outer loop iterator always increases the shift i-j or leaves it the same, because j is increased at most once in the loop whereas i is always increased, and because prefix[j]<j.

To see how this works, consider the string P="ababaca". The prefix array for this string is:

prefix[1] = 0
prefix[2] = 0    (The only proper suffix of "ab" is "b", which is not a prefix)
prefix[3] = 1    ("a" is a proper suffix of "aba" and is a prefix)
prefix[4] = 2
prefix[5] = 3
prefix[6] = 0    (No proper suffix of "ababac" is a prefix)
prefix[7] = 1

Now, consider searching for this string, where the scan of T has proceeded up to the indicated point:

T = ...ababababa...
P =    ababaca
            ^
            i

Here the algorithm finds a mismatch at j=5, so it tries shifting the string forward two spaces to j=prefix[5]=3 to see whether the mismatched character "b" matches P[3] -- which it does. Therefore the algorithm is ready to step forward to the next character of T. On the other hand, if T looked a little different, we'd skip forward farther:

T = ...ababaxaba...
P =    ababaca
            ^
            i

In this case, at the mismatch, j is still set to prefix[5]=3, but P[3]<>"x", so we try again with the shorter matched string "aba". j is set to prefix[3]=1, and again P[1]<>"x", so j is set to prefix[1]=0 and the inner loop terminates. Thus, all the possible matching prefixes of P are tried out in order from longest to shortest, and because none of them has "x" as a following character in P, the inner loop effectively shifts P forward by a full 6 characters.

The code above is O(n), which is not obvious. Clearly the outer loop and all the code except the inner while loop takes O(n) time. The question is how many times the while loop can be executed. The key observation is that every time the while loop executes, j decreases. But j is never negative, so it can be decreased at most as many times as it is increased by the statement j:=j+1, which is executed at most n times total. Therefore the inner loop also can be executed at most n times total, and the whole algorithm is O(n).

The one remaining part of the algorithm is the computation of prefix[j]. Recall that prefix[j] is the length of the largest prefix of P that is a suffix of P[0..j-1]. It turns out that this array can be computed in O(m) time by an algorithm very similar to KMP itself, which makes sense because we're searching for prefixes of P within P. The key is that knowing prefix[0], ..., prefix[j], we can efficiently compute prefix[j+1]:

prefix[1] := 0
j := 0
for i := 1 to m-1 {
  while j>0 and P[j] <> P[i] {
    j := prefix[j]
  }
  if P[j] = P[i] then j := j + 1
  prefix[i+1] := j
}
return prefix

This is clearly O(m) by the same reasoning that we used for the code above, but why does it work? This looks exactly like running KMP to search for P within P itself, except that prefix isn't initialized before running this code, and also we start things off at i=1 to avoid the trivial match at shift 0. The uninitialized state of prefix doesn't affect the execution because clearly, in computing prefix[i+1] this code doesn't use prefix[k] for any k>i. Now, for each index i into the string being searched, the KMP algorithm finds the longest prefix P[0..j-1] of the pattern string that matches a suffix of P[i-j..i]. But this is exactly prefix[i+1] as defined.

Computing good_suffix

Now let us return to the Boyer-Moore algorithm. It turns out that we can use the same prefix computation to efficiently compute the good_suffix array in O(m) time. Recall that good_suffix[j] is m-k_j where k_j is the length of the longest proper prefix of P that suffix-matches with P[j+1..m-1]. Call the previous routine compute_prefix. Using it, here is the algorithm for computing good_suffix:

prefix := compute_prefix(P)
for j := 0 to m-1 {
  good_suffix[j] := m - prefix[m]
}
P' := reverse(P)
prefix' := compute_prefix(P')
for j' := 1 to m {
  j := m - prefix'[j'] - 1
  good_suffix[j] := min(good_suffix[j], j' - prefix'[j'])
}

Note that prefix[m], the length of the longest proper prefix that matches up through the end of P.
The value prefix[m] is the length of the largest proper prefix of P that could suffix-match with any suffix of P, regardless of j. Therefore, m-prefix[m] is the smallest relative shift position that corresponds to a possible suffix-match regardless of the value of j. The value m-prefix[m] therefore sets an upper bound on the value of good_suffix[j] -- the algorithm definitely can't shift safely beyond that point. It also captures the earliest possible possible of a case 1 suffix-match. However, it may not be the earliest possible suffix-match because there may be a instance of case 2 with a smaller shift value.

The second loop fixes good_suffix to not shift beyond possible case 2 matches that have a smaller shift than m-prefix[m]. The trick is to apply compute_prefix to the reverse of P to find longest suffixes. The entry prefix'[j'] is the length of the longest suffix of P that matches a proper prefix of P[m-j'..m-1].

Notice that j'-prefix'[j']>0 and m-prefix[m]>0, so good_suffix[j]>0, guaranteeing that the Boyer-Moore algorithm makes progress.

A small example may help illustrate. Consider the string P = "ADEADHEAD". The following table shows the relevant values computed:

  A D E A D H E A D      (m=9)
  0 1 2 3 4 5 6 7 8 9    j
    0 0 0 1 2 0 0 1 2    prefix[j]  (m-prefix[m] = 7)

  9 8 7 6 5 4 3 2 1      j'
  2 1 3 2 1 0 0 0 0      prefix'[j']
  6 7 5 6 7 8 8 8 8      j=m-prefix[j']-1
  7 7 4 4 4 4 3 2 1      j'-prefix'[j']

Thus, running the algorithm will set good_suffix[5]...good_suffix[7] to 4, and good_suffix[8] to 1, which is what we'd expect because the repeating "EAD" allows case-2 suffix-matches at shift 4.

Regular Expressions

Simple string matching isn't powerful enough for some applications. Many programming languages and operating systems offer more powerful string matching and searching facilities based on regular expressions (e.g., Perl, Unix grep, the search utility in Windows, Emacs (try Meta-X search-forward-regexp).

They are defined inductively in the following table. With each regular expression, we also say what strings match them. This is also an inductive definition. The following table describes the syntax of regular expressions as understood by a number of standard tools. In the table, A and B refer to regular expressions and X, Y refer to strings. The top set of regular expressions are standard; the bottom set in green are common extensions that make regular expressions more convenient.

Regular Expression	Matches
any ordinary character	that character only
A `\|` B	any string that matches either A or B
AB	any string of the form X`^`Y, where X matches A and Y matches B
A`*`	any string of the form X₁`^`X₂`^...^`X_n, where each X_i matches A
(A)	same strings that A matches
`.`	any single non-newline character
`[abc]`	any character inside the brackets (same as `(a\|b\|c)`)
`[^abc]`	any character not inside the brackets
A`?`	an optional A (same as `(`A`\|)` )

Parentheses are used to avoid ambiguity in the construction of regular expressions. Note that in the definition of the set of strings that match A*, n can be 0. The concatenation of a list of no strings is the empty string "". This is the appropriate convention, since the empty string is the identity for concatenation (evaluate String.concat [] and see what you get!), just like the empty sum is 0 and the empty product is 1. Thus the empty string "" always matches A* for any A.

Here are some examples.

val as_then_b = "a*.b"
val a_or_bs = "(a|b)*"
val powers_of_ten = "10*"
val nonzero_digit = "(1|2|3|4|5|6|7|8|9)"
val digit = "(0|1|2|3|4|5|6|7|8|9)"
val num = nonzero_digit ^ "(" ^ digit ^ ")*"
     (* = "(1|2|3|4|5|6|7|8|9)(0|1|2|3|4|5|6|7|8|9)*" *)
val even = nonzero_digit ^ "(" ^ digit ^ ")*(0|2|4|6|8)"
val lowerchar = "(a|b|c|d|e|f|g|h|i|j|k|l|m|n|o|p|q|r|s|t|u|v|w|x|y|z)"
val upperchar = "(A|B|C|D|E|F|G|H|I|J|K|L|M|N|O|P|Q|R|S|T|U|V|W|X|Y|Z)"
val word = "(" ^ lowerchar ^ "|" ^ upperchar ^ ")*"

These regular expressions are matched by the following strings:

as_then_b matches any string of zero or more a's, followed by any single character, followed by a b
a_or_bs matches any string of zero or more a's or b's in any order
powers_of_ten matches 1, 10, 100, 1000, etc.
nonzero_digit matches any single decimal digit except 0
digit matches any single decimal digit
num matches the string representation of any positive integer
even matches the string representation of any positive even integer
lowerchar matches any lowercase character of the English alphabet
upperchar matches any uppercase character of the English alphabet
word matches any string of zero or more characters of the English alphabet, either lower- or uppercase, in any order.

Clearly regular expressions are a powerful way to describe textual patterns that one wants to search for. Surprisingly, it turns out that it is possible to efficiently match strings against regular expressions or to find the first occurrence of a regular expression in a given string.

Reading

Cormen, Leiserson, and Rivest. Introduction to Algorithms. MIT Press, McGraw-Hill. ISBN 0-07-013151-1.