A common problem in text editing, DNA sequence analysis, and web crawling: finding strings
inside other strings. Suppose we have a text T consisting of an array of
characters from some alphabet S. For example, S
might be just the set {0,1},
in which case the possible strings are strings of binary digits (e.g.,
"1001010011"); it might be {A,C,T,G},
in which case the strings are DNA sequences (e.g., "GATTACA"). The string matching problem is
this: given a smaller string P
(the pattern) that we want to find
occurrences of in T
. If P
occurs in T
at shift i
then
P[j] = T[i+j]
for all valid indices i
of P
.
For example, in the string "ABABABAC", the pattern string "BAB" occurs at shifts 1 and 3.
A slightly simpler version of this problem, which we will consider here, is just to find the first
occurrence of P
in T
. Algorithms for solving this
problem are adapted fairly easily to solve the more general problem.
signature STRING_MATCH = sig (* match(T,P) is the smallest shift at which * P occurs within T. Raises NoMatch if P doesn't * occur within T. *) exception NoMatch val match: string * string -> int end
Of course we can phrase this problem more generally; the strings we match on can be arrays of any sort. We assume that all we can do with array elements is compare them for equality. Here is pseudo-code for a naive string matching algorithm, which steps the shift along by one and tries to compare corresponding characters.
for i := 0 to n-1 { for j := 0 to m-1 { if P[j] <> T[i+j] then break } if j = m then return i }
which can of course be written in SML too.
This is clearly O(nm) in the worst case (think of a pattern and a text that are both mostly a single character). If the pattern string is short, this isn't a problem, but for longer patterns it can be slow. It turns out that there are faster string matching algorithms.
A little notation: we write P[i..j]
to mean the substring of P
containing all the characters between indices i
and j
inclusive. A prefix of P
is a substring P[0..k]
, k<m
;
a proper prefix is P[0..k]
, k < m-1
. A suffix
of P
is a substring P[k..m-1
], k>=0
, and a proper
suffix similarly requires k>0
.
The insight of the Boyer-Moore algorithm is to start matching at the end of the pattern string
P
rather than the beginning. When a mismatch is found, this allows the shift to be
increased by more than one. Consider this two strings:
T = ...IN THE UNTITLED WORKS BY P = EDITED
Matching from the back, we first encounter a mismatch at "L"
.
Because "L"
does not occur in P
, the shift
can be increased by 4
without missing a potential match. In general, the
mismatched character could be in P
. In this case the shift can only be increased
to the point where that character occurs in P. Here is a simplified version of
Boyer-Moore using this bad-character heuristic:
s := 0 (* the current shift *) while s <= n - m do { j := m - 1 while P[j] = T[s+j] and j >= 0 do j := j - 1 if j < 0 return s s := s + max(1, j - last[T[s+j]] + 1) }
The array last
must be precomputed; last[c]
is the position of
the last occurrence of the character c
in P
, or -1
if there is no
such occurrence. It is computed in
O(m+|Σ|) time where
|Σ| is the size of the alphabet
σ that strings are made up of:
for all c, last[c] := -1 for j := 0 to m-1 { last[P[j]] := j }
This algorithm works well if the alphabet is reasonably big, but not too big.
If the last character usually fails to match, the current shift s
is increased
by m
each time around the loop. The total number of character comparisons is
typically about n/m, which compares
well with the roughly n comparisons that
would be performed in the naive algorithm for similar problems. In fact, the
longer the pattern string, the faster the search! However, the worst-case run time is still O(nm).
The algorithm as presented doesn't work very well if the alphabet is small, because the last occurrence of any given character tends to be near the end of the string. This can be improved by using another heuristic for increasing the shift. Consider this example:
T = ...LIVID_MEMOIRS... P = EDITED_MEMOIRS
When the I/E
mismatch is observed, the bad-character heuristic
doesn't help at all because the last "I
" in P is already beyond the
point of comparison. The algorithm above must just increment by one. However,
knowing that the suffix ("D_MEMOIRS"
) has already been
matched tells us that P can be shifted forward by a full m
characters, because
no suffix of "D_MEMOIRS"
appears earlier in P than the
point found.
Given that a suffix S=P[j+1..m-1]
has already been matched when
a mismatch is encountered, we can compute a shift forward based on where suffixes of that suffix
appear in P
. This is called the good-suffix heuristic: we
can conservatively shift forward by the smallest shift that causes a suffix of S
to match a prefix of P
. We say
that two strings "suffix-match" when the shorter of the two strings is a
suffix of the other string.This smallest shift will
correspond to the longest proper prefix of P
that suffix-matches. Call
this prefix S'
. If S'
is the longer of the two strings, it means
that there might be a copy of P
that starts between the current shifted start of
P
and the mismatch point. If the string S
is the longer of the two strings,
there might be a copy of P
that starts after the mismatch point (and before or
at the current shifted end P
). Here is an illustration of these two cases:
(case 1) T = ....CABABDABAB... S = ABAB, S' = BAB, largest safe shift = 8-3 = 5 P = BABDABAB (case 2) T = ...CCABABAB... S = ABAB, S' = CCABAB, largest safe shift = 8-6 = 2 P = CCABABAB
For each possible j
, let kj
be the length of the longest proper prefix that suffix-matches with P[j+1..m-1]
.
For each j
, we then define good-suffix[j]
as m-kj
;
the algorithm can shift forward safely by m-kj
characters. Here then is the full Boyer-Moore algorithm (also available written
in SML):
s := 0 while s <= n - m do { j := m - 1 while P[j] = T[s+j] and j >= 0 do j := j - 1 if j < 0 return s s := s + max(good_suffix[j], j - last[T[s+j]] + 1) }
In the "MEMOIRS"
example above, kj
=0, so the
full m-character shift is safe.
Consider also a less trivial case for the good-suffix heuristic:
T = ...BABABABA... P = BABACABA
The string S="ABA"
, and the longest suffix-matching prefix is
"BABA"
. Hence the possible shift forward is 8−4
= 4 characters. Notice that here the bad-character
heuristic generates a useless shift of −2. The good-suffix
refinement is particularly helpful if the alphabet is small. Note that good_suffix[j]
is always
positive, because necessarily k < m
(if they were equal, the
pattern would have been found!). This is why we no longer need to find the
maximum with 1
when computing the shift to s
.
We can compute the array good_suffix
in time O(m),
as discussed below. Therefore the total time to set up a Boyer-Moore search is
still O(m+|S|)
even with this refinement. The total run time is still not better in the worst
case.
The Boyer-Moore algorithm is a good choice for many string-matching problems, but it does not offer asymptotic guarantees that are any stronger than those of the naive algorithm. If asymptotic guarantees are needed, the Knuth-Morris-Pratt algorithm (KMP) is a good alternative. The algorithm takes the following form:
j := 0 for i := 0 to n-1 { while (j > 0 and P[j] <> T[i]) { j := prefix[j] } if P[j] = T[i] then j := j + 1 if j = m then return i - m + 1 }
The idea is to scan the string T
from left to right, looking at
every character T[i]
once. The variable j
keeps track
of what character of P
is being compared against T[i]
,
so the currently considered shift is i-j
. When a mismatch is
encountered, the characters matched so far, S
=T[i-j..i-1]=P[0..j-1]
, may be
part of a match. If so, some proper suffix of the j
-character
string S
must itself be a prefix of P
.
Suppose that prefix[j]
is the length of the largest prefix of P
that is a proper suffix
of P[0..j-1]
. In this case we can increase the shift by the smallest amount that
might allow a suffix of S
to be a prefix of the new positioning of P
,
by setting j
to prefix[j]
. We then check whether P[j]=T[i]
,
to see whether a partial match has been obtained including the mismatched
character T[i]
. If this does not match, we try the next smaller prefix that is a proper suffix
of s, and so on until a match is found or it is determined that no proper suffix
of s can be part of a match, at which point j=0
. Note that each
outer loop iterator always increases the shift i-j
or leaves it the
same, because j
is increased at most once in the loop whereas i
is always increased, and because prefix[j]<j
.
To see how this works, consider the string P="ababaca"
.
The prefix array for this string is:
prefix[1] = 0 prefix[2] = 0 (The only proper suffix of "ab" is "b", which is not a prefix) prefix[3] = 1 ("a" is a proper suffix of "aba" and is a prefix) prefix[4] = 2 prefix[5] = 3 prefix[6] = 0 (No proper suffix of "ababac" is a prefix) prefix[7] = 1
Now, consider searching for this string, where the scan of T has proceeded up to the indicated point:
T = ...ababababa... P = ababaca ^ i
Here the algorithm finds a mismatch at j=5
, so it tries shifting the string
forward two spaces to j=prefix[5]=3
to see whether the mismatched
character "b" matches P[3] -- which it does. Therefore the algorithm
is ready to step forward to the next character of T. On the other hand, if T
looked a little different, we'd skip forward farther:
T = ...ababaxaba... P = ababaca ^ i
In this case, at the mismatch, j
is still set to prefix[5]=3
, but P[3]<>"x"
,
so we try again with the shorter matched string "aba"
. j
is set to prefix[3]=1
, and again P[1]<>"x"
,
so j
is set to prefix[1]=0
and the inner loop
terminates. Thus, all the possible matching prefixes of P
are tried
out in order from longest to shortest, and because none of them has "x"
as a following character in P
, the
inner loop effectively shifts P
forward by a full 6 characters.
The code above is O(n), which is not obvious. Clearly the outer loop and all the code except the
inner while
loop takes O(n)
time. The question is how many times the while
loop can be executed. The key
observation is that every time the while
loop executes, j
decreases. But j
is never negative, so it can be decreased at most
as many times as it is increased by the statement j:=j+
1, which is
executed at most n
times total. Therefore
the inner loop also can be executed at most n
times total, and the whole algorithm is O(n).
The one remaining part of the algorithm is the computation of prefix[j]
.
Recall that prefix[j]
is the length of the largest prefix of P
that is a suffix of P[0..j-1]
. It turns out that this array can be computed in O(m)
time by an algorithm very similar to KMP itself, which makes sense because we're
searching for prefixes of P within P. The key is that knowing prefix[0]
,
..., prefix[j]
, we can efficiently compute prefix[j+1]
:
prefix[1] := 0 j := 0 for i := 1 to m-1 { while j>0 and P[j] <> P[i] { j := prefix[j] } if P[j] = P[i] then j := j + 1 prefix[i+1] := j } return prefix
This is clearly O(m) by the same
reasoning that we used for the code above, but why does it work? This looks
exactly like running KMP to search for P
within P
itself, except that prefix
isn't initialized before running this
code, and also we start things off at i=1 to avoid the trivial match at shift 0.
The uninitialized state of prefix
doesn't affect the execution
because clearly, in computing prefix[i+1]
this code doesn't use prefix[k]
for any k>i
. Now, for each index i
into the string
being searched, the KMP algorithm finds the longest prefix P[0..j-1]
of the pattern string that matches a suffix of P[i-j..i]
. But
this is exactly prefix[i+1]
as defined.
Now let us return to the Boyer-Moore algorithm. It turns out that we can use the same prefix
computation to efficiently compute the good_suffix
array in O(m)
time. Recall that good_suffix[j]
is m-kj
where kj
is the length of the longest proper prefix of P
that suffix-matches with P[j+1..m-1]
. Call the previous routine compute_prefix
. Using it, here is the
algorithm for computing good_suffix
:
prefix := compute_prefix(P) for j := 0 to m-1 { good_suffix[j] := m - prefix[m] } P' := reverse(P) prefix' := compute_prefix(P') for j' := 1 to m { j := m - prefix'[j'] - 1 good_suffix[j] := min(good_suffix[j], j' - prefix'[j']) }
Note that prefix[m]
, the length of the longest proper prefix that
matches up through the end of P
.
The value prefix[m]
is the length of the largest proper prefix of
P
that could suffix-match with any suffix of P
, regardless of j
.
Therefore, m-prefix[m]
is the smallest relative shift position that
corresponds to a possible suffix-match regardless of the value of j
.
The value m-prefix[m]
therefore sets an upper bound on the value of
good_suffix[j]
-- the algorithm definitely can't shift safely
beyond that point. It also captures the earliest possible possible of a case 1
suffix-match. However, it may not be the earliest possible suffix-match because
there may be a instance of case 2 with a smaller shift value.
The second loop fixes good_suffix
to not shift beyond possible
case 2 matches that have a smaller shift than m-prefix[m]
. The trick is to apply compute_prefix
to
the reverse of P
to find longest suffixes. The entry prefix'[j']
is the length of
the longest suffix of P
that matches a proper prefix of P[m-j'..m-1]
.
Notice that j'-prefix'[j']>0
and m-prefix[m]>0
,
so good_suffix[j]>0
, guaranteeing that the Boyer-Moore algorithm
makes progress.
A small example may help illustrate. Consider the string P = "ADEADHEAD"
.
The following table shows the relevant values computed:
A D E A D H E A D (m=9) 0 1 2 3 4 5 6 7 8 9 j 0 0 0 1 2 0 0 1 2 prefix[j] (m-prefix[m] = 7) 9 8 7 6 5 4 3 2 1 j' 2 1 3 2 1 0 0 0 0 prefix'[j'] 6 7 5 6 7 8 8 8 8 j=m-prefix[j']-1 7 7 4 4 4 4 3 2 1 j'-prefix'[j']
Thus, running the algorithm will set good_suffix[5]...good_suffix[7]
to 4, and good_suffix[8]
to 1, which is what we'd expect because
the repeating "EAD" allows case-2 suffix-matches at shift 4.
Simple string matching isn't powerful enough for some applications. Many programming languages and operating systems offer more powerful string matching and searching facilities based on regular expressions (e.g., Perl, Unix grep, the search utility in Windows, Emacs (try Meta-X search-forward-regexp).
They are defined inductively in the following table. With each regular expression, we also say what strings match them. This is also an inductive definition. The following table describes the syntax of regular expressions as understood by a number of standard tools. In the table, A and B refer to regular expressions and X, Y refer to strings. The top set of regular expressions are standard; the bottom set in green are common extensions that make regular expressions more convenient.
Regular Expression | Matches |
any ordinary character | that character only |
A | B | any string that matches either A or B |
AB | any string of the form X^Y, where X matches A and Y matches B |
A* | any string of the form X1^X2^...^Xn, where each Xi matches A |
(A) | same strings that A matches |
. | any single non-newline character |
[abc] |
any character inside the brackets (same as (a|b|c) ) |
[^abc] |
any character not inside the brackets |
A? |
an optional A (same as ( A|)
) |
Parentheses are used to avoid ambiguity in the construction of regular expressions. Note that in the definition of the set of strings that match A*, n can be 0. The concatenation of a list of no strings is the empty string "". This is the appropriate convention, since the empty string is the identity for concatenation (evaluate String.concat [] and see what you get!), just like the empty sum is 0 and the empty product is 1. Thus the empty string "" always matches A* for any A.
Here are some examples.
val as_then_b = "a*.b" val a_or_bs = "(a|b)*" val powers_of_ten = "10*" val nonzero_digit = "(1|2|3|4|5|6|7|8|9)" val digit = "(0|1|2|3|4|5|6|7|8|9)" val num = nonzero_digit ^ "(" ^ digit ^ ")*" (* = "(1|2|3|4|5|6|7|8|9)(0|1|2|3|4|5|6|7|8|9)*" *) val even = nonzero_digit ^ "(" ^ digit ^ ")*(0|2|4|6|8)" val lowerchar = "(a|b|c|d|e|f|g|h|i|j|k|l|m|n|o|p|q|r|s|t|u|v|w|x|y|z)" val upperchar = "(A|B|C|D|E|F|G|H|I|J|K|L|M|N|O|P|Q|R|S|T|U|V|W|X|Y|Z)" val word = "(" ^ lowerchar ^ "|" ^ upperchar ^ ")*"
These regular expressions are matched by the following strings:
Clearly regular expressions are a powerful way to describe textual patterns that one wants to search for. Surprisingly, it turns out that it is possible to efficiently match strings against regular expressions or to find the first occurrence of a regular expression in a given string.
Cormen, Leiserson, and Rivest. Introduction to Algorithms. MIT Press, McGraw-Hill. ISBN 0-07-013151-1.
A nice visualization and discussion of the Boyer-Moore algorithm (and many
other string matching algorithms) is in Charras and Lecroq, Exact
String Matching Algorithms.