Top-down (LL) parsing

We saw previously that we could build a recursive-descent parser for s-expressions, implementing parsing of the following grammar:

S → x | ( L )
L → ε | S L

Recall that the symbols x, (, and ) are all terminal symbols (tokens), where x stands for any legal identifier token.

Predictive Parsing Table (PPT)

The recursive-descent parser code can be abstracted into a table that describes its actions succinctly: the predictive parsing table (PPT). The core task of a top-down parser is to use the next input token predict which production to use, assuming that the upcoming input is expected to correspond to a given nonterminal. For example, the recursive-descent parser for s-expressions can be summarized in the following PPT:
( x )
SS→(L)S→xerror
L L→SL L→SL L→ε

Each row of this table corresponds to the different cases found inside a single parsing function in a recursive-descent parser. Alternatively, it is possible to build a non-recursive parser whose actions are driven by a PPT.

The table above is a 1-lookahead PPT; it is also possible to have a PPT that looks ahead more than 1 symbol. For example, a 2-lookahead PPT would in principle have a column for every possible combination of two input tokens. In practice we would want to use a more efficient representation, since the table is typically quite sparse and in a k-lookahead table, the number of cells will be O(|Σ|k) where Σ is the set of nonterminals.

We say that a grammar is LL(k) if a k-lookahead PPT can be constructed. In particular, each cell in the PPT must contain at most one one production, with the implication that the parser knows what to do in that case.

NULLABLE

To build the PPT, some facts must be computed about the grammar. For simplicity, we consider only the LL(1) case.

Recall that we use greek letters (e.g., α, β, γ) to represent a string of terminal and nonterminal symbols, and the symbol ε represents an empty string.

The first fact we need about the grammar is whether each nonterminal can derive the empty string. We compute a predicate NULLABLE(X), which is true exactly when there is a way to derive the empty string from X. Given a definition of NULLABLE(X), we can determine whether an arbitrary string β is nullable:

The algorithm works as follows:

  1. Set NULLABLE(X) to false for all nonterminals X.

  2. Repeat the following step until no more changes to NULLABLE(X) are possible:

    Find some nonterminal X that has a production X→γ where γ is nullable given the current values for the predicate NULLABLE, and set NULLABLE(X) to true.

For example, consider the s-expression grammar. The algorithm converges on the correct value of NULLABLE after one iteration:

Iteration NULLABLE(S) NULLABLE(L)
0 false false
1 false true

More generally, the algorithm works because NULLABLE(X) only changes in the direction from false to true, and it is only set to true when the algorithm has proved that X is nullable. The algorithm can only take as many iterations as there are nonterminals in the grammar, since on each iteration one nonterminal is marked as nullable. Therefore, the algorithm always terminates.

When the algorithm terminates, it has found all the nullable nonterminals. This can be seen by induction on the number of steps in the derivation of ε for each nonterminal.

FIRST

Recall a top-down parser needs to predict productions based on the first symbol it sees in the derivation using that productions. Therefore, it is useful to compute the symbols that can begin the expanded form of nonterminals. We write FIRST(X) to represent the set of terminals that can begin a string derived starting from X. This function, once known, can be extended to operate on arbitrary strings γ as follows:

Since the strings derived from nonterminal X must come from one of its productions, the following equation must hold for each nonterminal X:

FIRST(X) = ⋃ { FIRST(γ) | X → γ }
  = ⋃X→γ FIRST(γ)

Note that the second line is just an alternate notation for taking a union. To solve these equations for a grammar G, we use an approach similar to the computation of NULLABLE:

  1. Set FIRST(X) := ∅ for all X

  2. Repeat the following step until no change is possible:

    Apply all FIRST(X) equations as assignments, updating FIRST(X) to be ⋃ { FIRST(γ) | X → γ ∈ G}.

For example, consider the computation of FIRST for the s-expression grammar. Expanding and simplifying the equations for this grammar, we have:

FIRST(L) = FIRST(SL) ∪ FIRST(ε) = FIRST(SL) = FIRST(S)
FIRST(S) = FIRST("(L)") ∪ FIRST(x) = {"(", x}

Now, we can apply these equations using the algorithm above to find solutions iteratively:

Iteration FIRST(S) FIRST(L)
0
1 {"(", x}
2 {"(", x}{"(", x}

The algorithm works for much the same reason that the algorithm for computing NULLABLE does: at each iteration, the value of FIRST(X) is a conservative approximation of the true value; nothing is ever put into a FIRST set that does not belong there. The algorithm must terminate because FIRST sets only increase in size at each iteration and they can't keep growing forever; when it does terminate, it therefore finds the minimal solution to the equations. We'll see later that both of these algorithms (and others) are instances of the Fixed Point Theorem; a set of equations can be solved in this iterative fashion when they satisfy certain conditions.

FOLLOW

All of the cells in the s-expression PPT are now explained by FIRST, except for one: the cell for nonterminal L with lookahead ")", where the parser is supposed to predict the production L→ε. This production is chosen because the symbol ")" can follow the derivation of L somewhere in a string derived from the start symbol S.

In general, we will need to know what set of terminals can follow a nonterminal X. We call this set FOLLOW(X). There are three sources of symbols in this set:

  1. If X appears on the right-hand side of a production Y→αXβ, then any symbol in FIRST(β) is in FOLLOW(X).

  2. If X appears in a production Y→αXβ and β is nullable, then any symbol in FOLLOW(Y) is also in FOLLOW(X).

  3. If X is the start symbol of the grammar, then the special end-of-input symbol (which we represent as $) is in FOLLOW(X).

These three sources can be interpreted as the basis of an equation for defining FOLLOW(X) as equal to the union of three sets. Expanding these equations our example grammar, we have:

FOLLOW(S) = FIRST(L) ∪ FOLLOW(L) ∪ { $ } = FOLLOW(L) ∪ { $, "(", x }
FOLLOW(L) = FIRST(")") ∪ FOLLOW(L) ∪ ∅ = FOLLOW(L) ∪ { ")" }

These equations can be solved iteratively just like those for NULLABLE and FIRST:

  1. Initialize FOLLOW(X) := ∅ for all X.

  2. Apply the equations for FOLLOW(X) repeatedly as assignments, using the current value of FOLLOW(X) to interpret the right-hand side of each equation, until no further change occurs.

Iteration FOLLOW(S) FOLLOW(L)
0
1 {$, "(", x} { ")" }
2 {$, "(", x, ")"} { ")" }

Notice that because FOLLOW(L) appears on both the left- and right-hand sides of its equation, other solutions to the equations could be found. However, the iterative solving technique produces the smallest set satisfying the equations.

Constructing the PPT

Using the FIRST and FOLLOW sets for the various nonterminals, it is straightforward to build the PPT. A production X→γ is placed into the PPT cell for nonterminal X and terminal a under two conditions:

  1. If a∈FIRST(γ)

  2. If a∈FOLLOW(X) and γ is nullable.

Applying these rules to the s-expression grammar, we obtain the PPT given at the beginning.

If each cell contains at most one production, then the grammar is LL(1) and can be parsed by a recursive-descent parser. If a cell (X, a) contains no production, the parser detects a syntax error whenever it is parsing an X and sees the terminal symbol a. If a cell contains two or more productions, the grammar cannot be parsed with only one lookahead token; it is not LL(1). Either the grammar must be rewritten or a more powerful parsing technology is needed.

Fixing grammars to be LL(1)

We've already discussed how to change grammars to be unambiguous for some common cases. However, even unambiguous grammars can fail to be LL(1). One common problem is that two productions from a given nonterminal can have overlapping FIRST sets. For example, if the grammar contains productions S→E and S→E+S, clearly FIRST(E) and FIRST(E+S) are not disjoint. The solution to this problem is often left-factoring. We introduce a new nonterminal that captures the part of the productions that is not in common. For example,

S → E | E + S
becomes, after the introduction of a new nonterminal T, compatible with LL parsing:
S → E T
T → ε | + S

Another problem that commonly arises is left-associativity, which is naturally expressed using left-recursive productions. Left-recursive productions automatically cause conflicts among FIRST sets, however. A commonly useful trick is to rewrite left-recursive productions into right-recursive form, and then to fix the associativity in the step where the parse tree is converted into the abstract syntax tree.

For example, suppose that we have the grammar

S → E | S + E

We can write a right-recursive grammar that accepts the same language but generates a right-associative parse tree:

S → E | E + S

Abusing regular-expression syntax, we could also write a grammar that uses Kleene star to achieve the same effect (the parentheses and asterisk are meta-syntax here):

S → E ( + E )*

This grammar suggests that we can parse nonterminal S using a loop instead of recursion, along the following lines:

void parseS() {
    parseE();
    while (next().equals(PLUS)) {
        consume(PLUS);
        parseE();
    }
}

By extending this code appropriately, we can construct an AST with whatever associativity is desired. In particular, it is easy to construct a left-associative AST:

Expr parseS() {
    Expr result = parseE();
    while (next().equals(PLUS)) {
        consume(PLUS);
        result = new Plus(result, parseE());
    }
    return result;
}