Automating Lexical Analysis

A lexer generator converts a lexical specification consisting of a list of regular expressions and corresponding actions into code that breaks the input into tokens. In this lecture we examine how this conversion works.

We can think of the lexical specification as a big regular expression R₁ | R₂ | ... R_n where the R_i are the descriptions of each of the token.

A lexer generator works by converting this regular expression into a deterministic finite automaton (DFA). This is done in a couple of steps. First, the regular expression is converted into a nondeterministic finite automaton (NFA). The NFA is converted into a DFA, which then becomes the basis for a table-driven lexer.

DFAs

Recall that a DFA is a abstract machine:

The machine reads an input stream of symbols x∈Σ, where Σ is the alphabet of the DFA.
It has a finite set of states q_i.
There is a distinguished initial state q₀ in which the machine begins reading its input.
As the machine reads each symbol, it changes its state according to a transition function δ. On reading symbol x in state q, it changes to the new state q' where q' = δ(q, x).
It has a set of final or accept states F. The machine accepts the input if it arrives at the end of the input in a final state q∈F.

A DFA can be drawn as a labeled graph in which states are nodes, the initial state q₀ is indicated by an incoming edge from outside, other edges are labeled with the corresponding input symbol, and final states in F are marked by nodes with double circles. For example, consider the following DFA, which accepts only odd numbers expressed in binary, corresponding to the regular expression (0|1)*1:

We model illegal characters by adding a non-final error state to the DFA, which we typically do not bother to draw in such a diagram. Every state has transitions to the error state on every symbol that cannot lead to a final state. Therefore, δ is a total function.

We can describe the transition function δ as a table, which hints at how we might implement the DFA:

	0	1
q₀	q₀	q₁
q₁	q₀	q₁

Pseudo-code for a DFA that reads an input of length n, where input[i] is the i'th input character, is simple and efficient. It is a loop that simply reads the appropriate table entry for each input character and updates the state accordingly:

start = i
q = q0
while (i ≤ n) {
    q = δ(q, input[i])
    i = i + 1
}
if (q ∈ F) return accept
else return fail

Now the question is how to obtain the table δ from a regular expression.

NFA

The first step is to convert the regular expression into a nondeterministic finite automaton. An NFA differs from a DFA in that each state can transition to zero or more other states on each input symbol, and a state can also transition to others without reading a symbol. In the diagram representation, multiple exiting edges can be labeled with the same symbol. Edges corresponding to not reading a symbol are labeled with ε. We write q→^xq′ to mean that the NFA can transition from q to q′ on symbol x.

For example, the following is an NFA:

The NFA accepts a given input stream if there is any way to reach a final state while reading the entire input. That is, it exhibits angelic nondeterminism. We imagine that when it needs to make a choice, there is an infallible oracle telling it which transitions to take. If the machine above receives the input “aba”, it can reach a final state by choosing the upper ε-transition, and staying within the top three states. Therefore the machine accepts this input. It can also accept “abc” by choosing the lower “b” transition. The machine does not accept “ac”, however, because there is no way to reach a final state while reading that input. (Challenge: Can you write a regular expression that describes exactly the strings that this NFA accepts?)

RE to NFA

We now see how to translate a regular expression to an equivalent NFA by induction on the structure of the regular expression. That is, given that we know how to convert the subexpressions of a regular expression, we show how to use the NFAs produced by those translations to produce the NFA for the full expression.

In each case, the result of translating a regular expression will be an NFA with a single accept state, which we represent with the following diagram, in which the squiggly dashed arrow and surrounding oval represent some additional nodes and edges that are part of the NFA:

Let us write \(⟦R⟧\) to mean the translation of regular expression \(R\) to an NFA that accepts exactly the language of \(R\). (The special double brackets are known as “Oxford brackets”, and are used to express translations.) We define the translation recursively as follows, by induction on the size of the expression \(R\):

⟦ε⟧
⟦a⟧
⟦R₁R₂⟧
⟦R₁\|R₂⟧
⟦R*⟧

By working bottom-up, we can use these translations to construct an NFA for any regular expression. For example, the odd-number regular expression above, (0|1)*1, translates to the following NFA, which clearly accepts the same strings. The unlabeled edges in the diagram are ε-transitions. (The states in this diagram are labeled with letter names for later use).

NFA to DFA

Although an NFA clearly can do anything a DFA can, the reverse is also true. We can convert an arbitrary NFA into a DFA (though the DFA may in general be exponentially larger than the NFA). The intuition is that we make a DFA that simulates all possible executions of the NFA. At any given point in the input stream, the NFA could be in some set of states. For each set of states the NFA could be in during its execution, we create a state in the DFA.

The final states of this DFA will be the states that include some final state from the NFA, since being in that DFA state means that the NFA could have reached a final state.

Since ε transitions can be taken at any time, it is useful to have the concept of the ε-closure of an NFA state q. It is the set of all states reachable from q using zero or more ε-transitions. Similarly, we can can take the ε-closure of a set of states by finding all states reachable from any state in the set using only ε-transitions.

For example, in the odd-number NFA above, the ε-closure of F is the set ε-closure{F} = {F,G,A,B,D,H}. The ε-closure of {E,J} is ε-closure{E} ∪ ε-closure{J} = {E, F, A, B, D, G, H, J}.

Now let us construct the corresponding DFA. The initial state of the DFA is the ε-closure of the start state of the NFA. Using capital Q to denote DFA states, that means Q₀ = ε-closure{q₀}. From each DFA state Q, we determine the set of states reachable in the NFA by finding for each possibly active (and ε-closed) set of states what is the set of states that can reached by following a single input symbol from any of the states in the set. In general, multiple transitions may be possible, in which case all the reached states and their ε-closures are in the reached state. In other words, the DFA transition on input x from a state Q is defined as follows: δ(Q,x) = ε-closure(⋃ {q' | ∃q∈Q, q→^xq'})

To perform the conversion, we apply this step repeatedly for every DFA state and every input symbol until no new DFA states are constructed.

In our example NFA, the initial state of the resulting DFA is the ε-closure of the start state of the NFA: that is, ε-closure({S}) = {S,A,B,D,G,H}. From that set of states we can take a transition on either 0 or 1. A transition on 0 can only happen from state B to state C, so the DFA state reached is ε-closure C = { C, F, A, B, D, G, H }. From either of these two DFA states, we can transition on 1 to reach states E and G, so the final DFA state is ε-closure({E,J}) = {E,F,A,B,D,G,H,J}. The full DFA looks as follows:

Note that there will in general be a non-final error state ∅ capturing the case in which no NFA state is reachable using the input seen up to a certain point, though this state is not reachable for this particular example unless the alphabet includes some symbols other than 0 and 1.

DFA minimization

In general, the DFA generated by this procedure may have more states than necessary. According to the Myhill–Nerode theorem, there is a unique minimal DFA that accepts the same input as a given DFA. This minimal DFA can be found by merging together the reachable states of the original DFA that are equivalent to each other. If the DFA contains any states that cannot be reached from the start state, these unreachable states can simply be discarded immediately since they do not affect the accepted strings.

For any state there is a set of strings that would be accepted if that state were the start state of the DFA. For reachable states, these strings must be suffixes of input strings accepted by the full DFA. For any reachable state in the DFA, there must be a state in the minimal DFA that accepts exactly the same input suffixes, since otherwise there is some input prefix that would reach the original state, on which the minimal DFA would behave differently. Therefore, if each distinct set of input suffixes that are accepted by some reachable DFA state is represented by a single state in the DFA, that DFA must contain the minimum possible number of reachable states needed to accept the same language. This is why we can minimize the DFA by merging states that accept the same suffixes.

Thus, two states \(q_1\) and \(q_2\) of a DFA are considered equivalent, written \(q_1 ≈ q_2\), if the machine, having reached either one of the two states by reading some input, accepts exactly the same remaining input suffixes. If the machine accepts or rejects exactly the same input when starting from those two states, then merging them together into one state cannot change the strings accepted by the machine. With this notion of equivalence, there is a unique equivalence relation associated with any given DFA. It can be used to reduce the DFA into a minimal DFA in which no states are equivalent.

We find the equivalent states by finding all the states that are not equivalent. Let us write \(q_1 ≉ q_2 \) if merging states \(q_1\) and \(q_2\) would change the language accepted by the DFA; in this case we say that \(q_1\) and \(q_2\) are distinguishable. If two states are distinguishable, there must be some string \(s\) that is accepted starting from one of the states but not from the other. Let us write \(s : q_1 ≉ q_2 \) if the string \(s\) demonstrates that the states behave differently, with one state accepting \(s\) and the other rejecting it.

Clearly, states \(q_1\) and \(q_2\) are distinguishable if one of them is final and one of them is non-final, since in the former case an empty input suffix is accepted and the latter it is not. That is, the suffix ε distinguishes these two states, written \( ε : q_1 ≉ q_2 \). We can express this idea as the following reasoning rule:

\(q_1∈F\) \(q_2∉F\)
(Rule 1)
\(ε : q_1 ≉ q_2\)

Two states are also distinguishable if following the same symbol from each of them leads to distinguishable states.

\(q_1' = δ(q_1, x)\) \(q_2' = δ(q_2, x)\) \(s : q_1' ≉ q_2'\)
	(Rule 2)
\(xs : q_1 ≉ q_2\)

The idea is that on the same input symbol \(x\), the machine transitions from \(q_1\) to \(q_1'\) and also from \(q_2\) to \(q_2'\). Since \(q_1'\) and \(q_2'\) are distinguishable, there must be some string \(s\) on which one of them accepts and the other does not. Therefore, the states \(q_1\) and \(q_2\) are distinguishable because one of them accepts the string \(xs\) and the other does not.

We can use these two rules to infer that two states are distinguishable if and only if they are distinguishable on some input suffix. This can be seen by induction on the length of the shortest string on which two states are distinguishable. Therefore, if we cannot use these rules to infer that two states are distinguishable, the states must be equivalent. Merging the states will not change which strings the DFA accepts.

The algorithms keeps track of whether each pair of reachable states \(q_i\) and \(q_j\) are distinguishable, starting from the supposition that they are indistinguishable. It marks all final/non-final pairs distinguishable, using Rule 1. It then applies Rule 2, following similarly-labeled edges backward from all distinguishable states to identify additional pairs of states that are distinguishable. Eventually no more distinguishable pairs can be identified. At that point, merging two states not known to be distinguishable cannot affect which strings are accepted. Note that the rules describe how to produce witness strings that prove distinguishability, but the witness strings are not actually needed by the minimization algorithm.

For the odd-number DFA, the result of this algorithm is as shown in the following table, which compares all 3 possible pairs of states (note that a similar table of size \(C(n,2)\) can be constructed for any possible number of states \(n\)):

CFABDGH
EFABDGHJ	\(ε\)	\(ε\)
	SABDGH	CFABDGH

By Rule 1, states SABDGH and CFABDGH are both distinguishable from EFABDGHJ using the empty string as input, as indicated by the \(ε\) in the table. Rule 2 cannot be applied to either of these pairs of distinguishable states, so all distinguishable pairs have already been identified. Since SABDGH and CFABDGH are not distinguished by any input, they can be merged, giving us exactly the 2-state DFA shown at the beginning of the notes.

Building an efficient lexer

Thus far we have been considering how to build a DFA for a single regular expression. However, the lexical specification for a set of tokens has the form \( R_1 \mid R_2 \mid \dots \mid R_n\), where we want to know not only what the token is but also which of the \(n\) token types matched. Further, we want to implement the longest-matching token rule, while prioritizing the patterns \(R_i\) appropriately. We start by constructing an alternation NFA, while keeping the final states distinct so they can be associated with the appropriate lexer action.

We convert this NFA to a DFA but continue to mark each final state with the corresponding action. Where two different NFA final states are both part of the DFA state, the action chosen is the one with higher priority according to the lexer specification.

When the lexer hits an accept state in the DFA, it remembers which accept state was encountered and keeps reading ahead. It only stops when the DFA error state ∅ is reached, because this means that there is no way to read more symbols to build a longer token. At that point, the lexer rewinds the state of the input back to the last final state and invokes the corresponding action using the symbols seen to that point. One way to implement this backtracking, is to assume the input stream has an operation unread(c) that lets us put a character back into the stream.

To find the last final state, we the lexer keeps track of all the states it has seen by pushing them onto a stack. (It might seem unnecessary, though harmless, to remember the non-final states seen along the way, but we will use this for an optimization shortly.)

start = i
q = q0
// read ahead until stuck
while (true) {
    input[i] = read()
    if (input[i] == EOF or δ(q, input[i]) == ∅) break
    if (q∈F) clear the stack
    push q
    q = δ(q, input[i])
    i = i + 1
}
// backtrack to last final state
while (q ∉ F) {
    if (stack is empty) fail
    q = pop()
    unread(input[i])
    i = i - 1
}
return input[start..i-1]

Lexing with backtracking

For most lexical specifications, this algorithm will be fairly efficient and take time linear in the number of characters on the input. However, in the worst case, this algorithm can be quadratic because of backtracking. Consider what happens when the lexical specification is abc | (abc)*d, and the input is these n characters: “abcabcabc... abc”. The correct result is a sequence of "abc" tokens, but for each token, the lexer reads all the way to the end of the input to find out whether there is a "d". The algorithm above will backtrack n/3 times in this case, taking Θ(n²) time.

It is possible to ensure that lexical analysis takes linear time using an algorithm due to Tom Reps. It memoizes hopeless lexer states. If during the backtracking phase, some non-final state \(q\) was encountered at position \(i\), there is no reason to try finding a token again from that state and position. We add a memoization table hopeless[q,i] to record such scanner states, and modify the algorithm above slightly to update and use this information:

start = i
q = q0
// read ahead until stuck
while (true) {
    if (hopeless[q,i]) break
    input[i] = read()
    if (input[i] == EOF or δ(q, input[i]) == ∅) break
    if (q∈F) clear the stack
    push q
    q = δ(q, input[i])
    i = i + 1
}
// backtrack to last final state
while (q ∉ F) {
    hopeless[q,i] = true
    if (stack is empty) fail
    q = pop()
    i = i - 1
    unread(input[i])
}
return input[start..i-1]

Memoizing hopeless states to avoid quadratic lexing

In the example of lexing “"abcabcabc...abc"”, the lexer will read all the input to find the first token, but on the second and following tokens, it will not read past the tokens that make up each of the "abc" tokens, since the corresponding states are known to be hopeless.

Recognizing regular expressions without DFA construction

The construction above is worthwhile for recognizing regular expressions in a compiler, since the token specification does not change. However, regular expressions are frequently used in other settings where precompilation into a DFA is not worth the cost. Unfortunately, common regular expression libraries such as those relied upon by Java and Perl take exponential time in the worst case, because they use backtracking to handle the alternation operator.

A straightforward but effective way to recognize regular expressions with little precomputation is to construct the NFA from the regular expression and then to directly simulate the execution of the NFA. As each input symbol is processed, the set of possible NFA states is updated, lazily constructing the states of the (unminimized) DFA on an as-needed basis. Thus, no backtracking is necessary. It's even possible to memoize these constructed states, yielding speedup for some regular expressions.

An alternative technique is to directly interpret the regular expression as the input is parsed, using regular expression derivatives, an elegant technique due to Janus Brzozowski. The idea is that for any given regular expression \(R\) and input symbol \(a\), we can compute the regular expression that accepts the rest of \(R\) after symbol \(a\) has been consumed from the input. We use the suggestive notation \(∂a R\) to represent the regular expression that accepts the rest of R. Using ∅ to denote a regular expression that accepts no strings, and using + to denote alternation, the regular expression derivatives of the various constructs are as follows:

∂a a = ε (ε functions like the constant 1)
∂a ε = ∅ (the derivative of a constant is zero.)
∂a b = ∅ (where b≠a )
∂a(R₁+R₂) = ∂a R1+ ∂a R₂ (Just like ordinary differentiation!)
∂a(R*) = (∂a R) R* (Like differentiating an exponential)
∂a(R₁ R₂) = (∂a R₁) R₂ + ν(R₁) ∂a R₂ (Like differentiating a product)

Clearly, these equations closely mirror the usual mathematical notion of derivative. One point of divergence is the rule for concatenation; the function ν(R) is equal to ε if R accepts the empty string and ∅ otherwise.

Regular expression derivatives also conveniently handle negation and intersection of regular expressions.

∂a a	= ε	(ε functions like the constant 1)
∂a ε	= ∅	(the derivative of a constant is zero.)
∂a b	= ∅	(where b≠a )
∂a(R₁+R₂)	= ∂a R1+ ∂a R₂	(Just like ordinary differentiation!)
∂a(R*)	= (∂a R) R*	(Like differentiating an exponential)
∂a(R₁ R₂)	= (∂a R₁) R₂ + ν(R₁) ∂a R₂	(Like differentiating a product)