String matching

A common problem in text editing, DNA sequence analysis, and web crawling: finding strings
inside other strings. Suppose we have a text T consisting of an array of
characters from some alphabet S. For example, S
might be just the set {0,1},
in which case the possible strings are strings of binary digits (e.g.,
"1001010011"); it might be {A,C,T,G},
in which case the strings are DNA sequences (e.g., "GATTACA"). The string matching problem is
this: given a smaller string `P`

(the *pattern)* that we want to find
occurrences of in `T`

. If `P`

occurs in `T`

at *shift *`i`

then
`P[j] = T[i+j]`

for all valid indices `i`

of `P`

.

For example, in the string "ABABABAC", the pattern string "BAB" occurs at shifts 1 and 3.

A slightly simpler version of this problem, which we will consider here, is just to find the first
occurrence of `P`

in `T`

. Algorithms for solving this
problem are adapted fairly easily to solve the more general problem.

signature STRING_MATCH = sig (* match(T,P) is the smallest shift at which * P occurs within T. Raises NoMatch if P doesn't * occur within T. *) exception NoMatch val match: string * string -> int end

Of course we can phrase this problem more generally; the strings we match on can be arrays of any sort. We assume that all we can do with array elements is compare them for equality. Here is pseudo-code for a naive string matching algorithm, which steps the shift along by one and tries to compare corresponding characters.

for i := 0 to n-1 { for j := 0 to m-1 { if P[j] <> T[i+j] then break } if j = m then return i }

which can of course be written in SML too.

This is clearly *O*(*nm*) in the worst
case (think of a pattern and a text that are both mostly a single character). If the pattern string is short, this
isn't a problem, but for longer patterns it can be slow. It turns out that there
are faster string matching algorithms.

A little notation: we write `P[i..j]`

to mean the substring of `P`

containing all the characters between indices `i`

and `j`

inclusive. A **prefix** of `P`

is a substring `P[0..k]`

, `k<m`

;
a **proper prefix** is `P[0..k]`

, `k < m-1`

. A **suffix**
of `P`

is a substring `P[k..m-1`

], `k>=0`

, and a **proper
suffix** similarly requires `k>0`

.

The insight of the Boyer-Moore algorithm is to start matching at the *end* of the pattern string
`P`

rather than the beginning. When a mismatch is found, this allows the shift to be
increased by more than one. Consider this two strings:

T = ...IN THE UNTITLED WORKS BY P = EDITED

Matching from the back, we first encounter a mismatch at `"L"`

.
Because `"L"`

does not occur in `P`

, the shift
can be increased by 4
without missing a potential match. In general, the
mismatched character could be in `P`

. In this case the shift can only be increased
to the point where that character occurs in P. Here is a simplified version of
Boyer-Moore using this **bad-character heuristic**:

s := 0 (* the current shift *) while s <= n - m do { j := m - 1 while P[j] = T[s+j] and j >= 0 do j := j - 1 if j < 0 return s s := s + max(1, j - last[T[s+j]] + 1) }

The array `last`

must be precomputed; `last[c]`

is the position of
the last occurrence of the character `c`

in `P`

, or `-1`

if there is no
such occurrence. It is computed in
*O*(*m*+|Σ|) time where
|Σ| is the size of the alphabet
σ that strings are made up of:

for all c, last[c] := -1 for j := 0 to m-1 { last[P[j]] := j }

This algorithm works well if the alphabet is reasonably big, but not too big.
If the last character usually fails to match, the current shift ` s`

is increased
by ` m`

each time around the loop. The total number of character comparisons is
typically about *n*/*m*, which compares
well with the roughly *n* comparisons that
would be performed in the naive algorithm for similar problems. In fact, the
longer the pattern string, the faster the search! However, the worst-case run time is still O(*nm*).

The algorithm as presented doesn't work very well if the alphabet is small, because the last occurrence of any given character tends to be near the end of the string. This can be improved by using another heuristic for increasing the shift. Consider this example:

T = ...LIVID_MEMOIRS... P = EDITED_MEMOIRS

When the `I/E`

mismatch is observed, the bad-character heuristic
doesn't help at all because the last "`I`

" in P is already beyond the
point of comparison. The algorithm above must just increment by one. However,
knowing that the suffix (`"D_MEMOIRS"`

) has already been
matched tells us that P can be shifted forward by a full `m`

characters, because
no suffix of `"D_MEMOIRS"`

appears earlier in P than the
point found.

Given that a suffix `S=P[j+1..m-1]`

has already been matched when
a mismatch is encountered, we can compute a shift forward based on where suffixes of that suffix
appear in `P`

. This is called the **good-suffix heuristic**: we
can conservatively shift forward by the smallest shift that causes a suffix of `S`

to match a prefix of `P`

. We say
that two strings "suffix-match" when the shorter of the two strings is a
suffix of the other string.This smallest shift will
correspond to the longest proper prefix of `P`

that suffix-matches. Call
this prefix `S'`

. If `S'`

is the longer of the two strings, it means
that there might be a copy of `P`

that starts between the current shifted start of
```
P
```

and the mismatch point. If the string `S`

is the longer of the two strings,
there might be a copy of `P`

that starts after the mismatch point (and before or
at the current shifted end `P`

). Here is an illustration of these two cases:

(case 1) T = ....CABABDABAB... S = ABAB, S' = BAB, largest safe shift = 8-3 = 5 P = BABDABAB (case 2) T = ...CCABABAB... S = ABAB, S' = CCABAB, largest safe shift = 8-6 = 2 P = CCABABAB

For each possible `j`

, let `k`

be the length of the longest proper prefix that suffix-matches with _{j}`P[j+1..m-1]`

.
For each `j`

, we then define `good-suffix[j]`

as `m-k`

;
the algorithm can shift forward safely by _{j}`m-k`

characters. Here then is the full Boyer-Moore algorithm (also available written
in SML):_{j}

s := 0 while s <= n - m do { j := m - 1 while P[j] = T[s+j] and j >= 0 do j := j - 1 if j < 0 return s s := s + max(good_suffix[j], j - last[T[s+j]] + 1) }

In the `"MEMOIRS"`

example above, `k`

=0, so the
full m-character shift is safe.
Consider also a less trivial case for the good-suffix heuristic:_{j}

T = ...BABABABA... P = BABACABA

The string `S="ABA"`

, and the longest suffix-matching prefix is
`"BABA"`

. Hence the possible shift forward is 8−4
= 4 characters. Notice that here the bad-character
heuristic generates a useless shift of −2. The good-suffix
refinement is particularly helpful if the alphabet is small. Note that `good_suffix[j]`

is always
positive, because necessarily `k < m`

(if they were equal, the
pattern would have been found!). This is why we no longer need to find the
maximum with `1`

when computing the shift to `s`

.

We can compute the array `good_suffix`

in time *O*(*m*),
as discussed below. Therefore the total time to set up a Boyer-Moore search is
still *O*(*m*+|S|)
even with this refinement. The total run time is still not better in the worst
case.

The Boyer-Moore algorithm is a good choice for many string-matching problems, but it does not offer asymptotic guarantees that are any stronger than those of the naive algorithm. If asymptotic guarantees are needed, the Knuth-Morris-Pratt algorithm (KMP) is a good alternative. The algorithm takes the following form:

j := 0 for i := 0 to n-1 { while (j > 0 and P[j] <> T[i]) { j := prefix[j] } if P[j] = T[i] then j := j + 1 if j = m then return i - m + 1 }

The idea is to scan the string `T`

from left to right, looking at
every character `T[i]`

once. The variable `j`

keeps track
of what character of `P`

is being compared against `T[i]`

,
so the currently considered shift is `i-j`

. When a mismatch is
encountered, the characters matched so far, `S`

=`T[i-j..i-1]=P[0..j-1]`

, may be
part of a match. If so, some proper suffix of the `j`

-character
string `S`

must itself be a prefix of `P`

.

Suppose that `prefix[j]`

is the length of the largest prefix of `P`

that is a proper suffix
of `P[0..j-1]`

. In this case we can increase the shift by the smallest amount that
might allow a suffix of `S`

to be a prefix of the new positioning of `P`

,
by setting `j`

to `prefix[j]`

. We then check whether `P[j]=T[i]`

,
to see whether a partial match has been obtained including the mismatched
character `T[i]`

. If this does not match, we try the next smaller prefix that is a proper suffix
of s, and so on until a match is found or it is determined that no proper suffix
of s can be part of a match, at which point `j=0`

. Note that each
outer loop iterator always increases the shift `i-j`

or leaves it the
same, because `j`

is increased at most once in the loop whereas `i`

is always increased, and because `prefix[j]<j`

.

To see how this works, consider the string `P="ababaca"`

.
The prefix array for this string is:

prefix[1] = 0 prefix[2] = 0 (The only proper suffix of "ab" is "b", which is not a prefix) prefix[3] = 1 ("a" is a proper suffix of "aba" and is a prefix) prefix[4] = 2 prefix[5] = 3 prefix[6] = 0 (No proper suffix of "ababac" is a prefix) prefix[7] = 1

Now, consider searching for this string, where the scan of T has proceeded up to the indicated point:

T = ...ababababa... P = ababaca ^ i

Here the algorithm finds a mismatch at `j=5`

, so it tries shifting the string
forward two spaces to `j=prefix[5]=3`

to see whether the mismatched
character "b" matches P[3] -- which it does. Therefore the algorithm
is ready to step forward to the next character of T. On the other hand, if T
looked a little different, we'd skip forward farther:

T = ...ababaxaba... P = ababaca ^ i

In this case, at the mismatch, `j`

is still set to `prefix[5]=3`

, but `P[3]<>"x"`

,
so we try again with the shorter matched string `"aba"`

. `j`

is set to `prefix[3]=1`

, and again `P[1]<>"x"`

,
so `j`

is set to `prefix[1]=0`

and the inner loop
terminates. Thus, all the possible matching prefixes of `P`

are tried
out in order from longest to shortest, and because none of them has `"x"`

as a following character in `P`

, the
inner loop effectively shifts `P`

forward by a full 6 characters.

The code above is *O*(*n*), which is not obvious. Clearly the outer loop and all the code except the
inner `while`

loop takes *O*(*n*)
time. The question is how many times the `while`

loop can be executed. The key
observation is that every time the `while`

loop executes, `j`

decreases. But `j`

is never negative, so it can be decreased at most
as many times as it is increased by the statement `j:=j+`

1, which is
executed at most *n*
times total. Therefore
the inner loop also can be executed at most *n*
times total, and the whole algorithm is *O*(*n*).

The one remaining part of the algorithm is the computation of `prefix[j]`

.
Recall that `prefix[j]`

is the length of the largest prefix of `P`

that is a suffix of `P[0..j-1]`

. It turns out that this array can be computed in *O*(*m*)
time by an algorithm very similar to KMP itself, which makes sense because we're
searching for prefixes of P within P. The key is that knowing `prefix[0]`

,
..., `prefix[j]`

, we can efficiently compute `prefix[j+1]`

:

prefix[1] := 0 j := 0 for i := 1 to m-1 { while j>0 and P[j] <> P[i] { j := prefix[j] } if P[j] = P[i] then j := j + 1 prefix[i+1] := j } return prefix

This is clearly *O*(*m*) by the same
reasoning that we used for the code above, but why does it work? This looks
exactly like running KMP to search for `P`

within `P`

itself, except that `prefix`

isn't initialized before running this
code, and also we start things off at i=1 to avoid the trivial match at shift 0.
The uninitialized state of `prefix`

doesn't affect the execution
because clearly, in computing `prefix[i+1]`

this code doesn't use `prefix[k]`

for any `k>i`

. Now, for each index `i`

into the string
being searched, the KMP algorithm finds the longest prefix `P[0..j-1]`

of the pattern string that matches a suffix of `P[i-j..i]`

. But
this is exactly `prefix[i+1]`

as defined.

Now let us return to the Boyer-Moore algorithm. It turns out that we can use the same prefix
computation to efficiently compute the `good_suffix`

array in *O*(*m*)
time. Recall that `good_suffix[j]`

is `m-k`

where _{j}`k`

is the length of the longest proper prefix of _{j}`P`

that suffix-matches with `P[j+1..m-1]`

. Call the previous routine `compute_prefix`

. Using it, here is the
algorithm for computing `good_suffix`

:

prefix := compute_prefix(P) for j := 0 to m-1 { good_suffix[j] := m - prefix[m] } P' := reverse(P) prefix' := compute_prefix(P') for j' := 1 to m { j := m - prefix'[j'] - 1 good_suffix[j] := min(good_suffix[j], j' - prefix'[j']) }

Note that `prefix[m]`

, the length of the longest proper prefix that
matches up through the end of `P`

.

The value `prefix[m]`

is the length of the largest proper prefix of
`P`

that could suffix-match with *any* suffix of `P`

, regardless of `j`

.
Therefore, `m-prefix[m]`

is the smallest relative shift position that
corresponds to a possible suffix-match regardless of the value of `j`

.
The value `m-prefix[m]`

therefore sets an upper bound on the value of
`good_suffix[j]`

-- the algorithm definitely can't shift safely
beyond that point. It also captures the earliest possible possible of a case 1
suffix-match. However, it may not be the earliest possible suffix-match because
there may be a instance of case 2 with a smaller shift value.

The second loop fixes `good_suffix`

to not shift beyond possible
case 2 matches that have a smaller shift than `m-prefix[m]`

. The trick is to apply `compute_prefix`

to
the *reverse* of `P`

to find longest suffixes. The entry `prefix'[j']`

is the length of
the longest suffix of `P`

that matches a proper prefix of `P[m-j'..m-1]`

.

Notice that `j'-prefix'[j']>0`

and `m-prefix[m]>0`

,
so `good_suffix[j]>0`

, guaranteeing that the Boyer-Moore algorithm
makes progress.

A small example may help illustrate. Consider the string `P = "ADEADHEAD"`

.
The following table shows the relevant values computed:

A D E A D H E A D (m=9) 0 1 2 3 4 5 6 7 8 9 j 0 0 0 1 2 0 0 1 2 prefix[j] (m-prefix[m] = 7) 9 8 7 6 5 4 3 2 1 j' 2 1 3 2 1 0 0 0 0 prefix'[j'] 6 7 5 6 7 8 8 8 8 j=m-prefix[j']-1 7 7 4 4 4 4 3 2 1 j'-prefix'[j']

Thus, running the algorithm will set `good_suffix[5]...good_suffix[7]`

to 4, and `good_suffix[8]`

to 1, which is what we'd expect because
the repeating "EAD" allows case-2 suffix-matches at shift 4.

Simple string matching isn't powerful enough for some applications.
Many programming languages and operating systems offer more powerful
string matching and searching facilities based on **regular expressions **(e.g.,
Perl,
Unix grep, the search utility in Windows, Emacs (try `Meta-X
search-forward-regexp`).

They are defined inductively in the following
table. With each regular expression, we also say what strings match
them. This is also an inductive definition. The following
table describes the syntax of regular expressions as understood by a
number of standard tools. In the table, *A* and *B* refer to regular
expressions and *X*, *Y* refer to strings. The top set of
regular expressions are standard; the bottom set in green are common extensions
that make regular expressions more convenient.

Regular
Expression |
Matches |

any ordinary character | that character only |

A | B |
any string that matches either A or
B |

AB |
any string of the form X^Y, where
X matches A and Y matches B |

A* |
any string of the form X_{1}^X_{2}^...^X, where each
_{n}X matches _{i}A |

(A) |
same strings that A matches |

. |
any single non-newline character |

`[abc]` |
any character inside the brackets (same as `(a|b|c)` ) |

`[^abc]` |
any character not inside the brackets |

A`?` |
an optional A (same as `(` A`|)`
) |

Parentheses are used to avoid ambiguity in the construction of regular
expressions. Note that in the definition of the set of strings that
match `A*`, `n` can be 0. The concatenation of a
list of no strings is the empty string `""`. This is the
appropriate convention, since the empty string is the identity for
concatenation (evaluate `String.concat []` and see what you
get!), just like the empty sum is `0` and the empty product is
`1`. Thus the empty string `""` always
matches `A*` for any `A`.

Here are some examples.

val as_then_b = "a*.b" val a_or_bs = "(a|b)*" val powers_of_ten = "10*" val nonzero_digit = "(1|2|3|4|5|6|7|8|9)" val digit = "(0|1|2|3|4|5|6|7|8|9)" val num = nonzero_digit ^ "(" ^ digit ^ ")*" (* = "(1|2|3|4|5|6|7|8|9)(0|1|2|3|4|5|6|7|8|9)*" *) val even = nonzero_digit ^ "(" ^ digit ^ ")*(0|2|4|6|8)" val lowerchar = "(a|b|c|d|e|f|g|h|i|j|k|l|m|n|o|p|q|r|s|t|u|v|w|x|y|z)" val upperchar = "(A|B|C|D|E|F|G|H|I|J|K|L|M|N|O|P|Q|R|S|T|U|V|W|X|Y|Z)" val word = "(" ^ lowerchar ^ "|" ^ upperchar ^ ")*"

These regular expressions are matched by the following strings:

`as_then_b`matches any string of zero or more`a`'s, followed by any single character, followed by a`b``a_or_bs`matches any string of zero or more`a`'s or`b`'s in any order`powers_of_ten`matches 1, 10, 100, 1000, etc.`nonzero_digit`matches any single decimal digit except`0``digit`matches any single decimal digit`num`matches the string representation of any positive integer`even`matches the string representation of any positive even integer`lowerchar`matches any lowercase character of the English alphabet`upperchar`matches any uppercase character of the English alphabet`word`matches any string of zero or more characters of the English alphabet, either lower- or uppercase, in any order.

Clearly regular expressions are a powerful way to describe textual patterns that one wants to search for. Surprisingly, it turns out that it is possible to efficiently match strings against regular expressions or to find the first occurrence of a regular expression in a given string.

Cormen, Leiserson, and Rivest. *Introduction to Algorithms*. MIT Press,
McGraw-Hill. ISBN 0-07-013151-1.

A nice visualization and discussion of the Boyer-Moore algorithm (and many
other string matching algorithms) is in Charras and Lecroq, *Exact
String Matching Algorithms*.