Bottom-Up Parsing

So far we've been looking at top-down parsing. A top-down parser is limited in the grammars it can handle because it must be able to commit to predicted productions high in the parse tree based on relatively little information. For example, it must be able to choose the correct production from the start symbol based on the first token in the input.

This limitation motivates bottom-up parsing, in which the parser can choose productions after seeing more input. For example, bottom-up parsers can handle left-recursive productions whereas top-down parsers cannot. In fact, bottom-up parsers work best when productions are left-recursive, although they can handle right-recursive productions too. In the following figure, the shaded areas represent the part of the parse tree that must be predicted when some of the input has been read.

Earley parsing

An powerful yet ingeniously simple bottom-up parser is that due to Jay Earley. It can parse all context-free grammars, including ambiguous ones. Its worst-case time is cubic in the input length, but for most grammars encountered in practice, it takes linear time if implemented carefully.

The idea of the Earley parser is to keep track of all possible parses as the input is read. We can think of the parser as one that, unlike an LL parser, doesn't try to predict just one production; instead, it blindly predicts all productions regardless of lookahead, spawning a new concurrent thread to handle each such prediction. Spawning these threads would work in principle and would be able to parse any grammar, but would be extremely inefficient, because an exponential number of threads would be created in general. Moreover, a huge amount of work would be repeated by different threads. So instead, the Earley parser simulates the work that those threads would do while ensuring that the work is only done once.

The Earley parser proceeds one input position at a time, keeping track of all the information that would needed to simulate the parsing threads. For each input position j, an Earley parser builds up a set of items I_j representing the state of productions that might be used in the derivation. An Earley item has the form [A → β.γ, k]. The first part of the item is a production A→βγ from the grammar, with a dot located somewhere on the right side of the production to show how much of that production rule has been completed. For a production with an empty right-hand side, like X→ε, the first part of the item just has the dot: [X→ . , k]. The second part of the item, k, is the position in the input where the parsing of the nonterminal A started, in terms of the number of tokens read up to that point. We can think of k as recording the position where the thread for this item was spawned.

The algorithm starts by initializing the set I₀ to [S′→ . S, 0], where S is the start symbol of the grammar and S′ is a new artificial symbol introduced to simplify the algorithm. The algorithm successfully parses the sequence of n input tokens a₀a₁...a_n−1 if the item [S′→ S ., 0] is in set I_n.

The algorithm considers each input position j from 0 up to n. At each input position, it builds up the set I_j by applying the following three rules as many times as possible until none of the rules has any effect:

SCAN. If the set I_j contains an item [A→β . cδ, k] where c is equal to the next input symbol a_j, then the item [A→βc . δ, k] is added to I_j+1. All scan steps can be done as the final actions at a given input position since they don't affect I_j.
PREDICT. If the set I_j contains an item [A→β . Cδ, k], then for all productions C→γ in the grammar, an item [C→.γ, j] is added to I_j. Note that prediction steps can enable further prediction steps, when γ starts with a nonterminal.
COMPLETE. If the set I_j contains a completed item [C→γ . , k], then for each item of the form [A→β . Cδ, l] in set I_k, an item [A→βC . δ, l] is added to I_j.

These are the same three actions that we saw in the table-driven LL parser, but unlike that parser, the Earley parser does not have to commit to its predictions. Instead, it handles multiple predictions—and even completions—in parallel. The predictions that do not pan out effectively die out because at some point they lead to items that are incompatible with the input tokens encountered.

Example

Consider the following grammar, which is not LL(1):

S → S + E | E
E → n | ( S )

Given the input (1) + 2, the Earley parser proceeds as follows:

Set   Item          Pos   Reason                 Step #
I₀
      S' → . S      0     initial item           1
      S → . S + E   0     predict S              2 (from 1)
      S → . E       0     predict S              3 (from 1 and also from 2)
      E → . n       0     predict E              4 (from 3)
      E → . ( S )   0     predict E              5 (from 3)
I₁
      E → ( . S )   0     scan (                 6 (from 5)
      S → . S + E   1     predict S              7 (from 6)
      S → . E       1     predict S              8 (from 6 and also from 7)
      E → . n       1     predict E              9 (from 8)
      E → . ( S )   1     predict E             10 (from 8)
I₂
      E → n .       1     scan 1                11 (from 9)
      S → E .       1     complete E            12 (from 11, 8)
      E → ( S . )   0     complete S            13 (from 12, 6)
      S → S . + E   1     complete S            14 (from 12, 7)
I₃
      E → ( S ) .   0     scan )                15 (from 13)
      S → E .       0     complete E            16 (from 15, 3)
      S' → S .      0     complete S            17 (from 16, 1)
      S → S . + E   0     complete S            18 (from 16, 2)
I₄
      S → S + . E   0     scan +                19 (from 18)
      E → . n       4     predict E             20 (from 19)
      E → . ( S )   4     predict E             21 (from 19)
I₅
      E → n .       4     scan 2                22 (from 20)
      S → S + E .   0     complete E            23 (from 22, 19)
      S' → S .      0     complete S: success!  24 (from 23, 1)
      S → S . + E   0     complete S            25 (from 23, 2)

We can visualize the action of the parser by drawing a graph showing how the various steps depend on each other:

Here, the scan actions are annotated with the token scanned in blue. The dashed arrows show the state advanced by completions and are labeled with the nonterminal being completed. As this graph shows, the parser explores some states that do not lead to the successful parse, but these states (4, 10, 14, 17, and 21) quickly get stuck because they are unable to scan matching input. Removing these states and looking just at the sequence of completed productions leading to the success at step 24, we see that they describe a derivation of the string:

S'  →  S  →  S+E  →  S+2  →  E+2  →  (S)+2  →  (E)+2  →  (1)+2
   24    23     18–22    16     13–15      12        1–11

Note that at each step, the rightmost nonterminal is expanded using some production. However, the parser completes these productions backward, so it generates the rightmost derivation backward, starting from the input and ending with the start symbol. This backward construction of a rightmost derivation is what bottom-up parsers do.

Earley parsers run in worst-case time O(n³) when carefully implemented; on unambiguous grammars they take worst-case time O(n²). Intuitively, the reason Earley parsers are polynomial rather than exponential in the worst case is that they share work among the different "threads". For example, notice that the single completion of production S → E. at step 12 does work for two threads that fork off at step 6 to states 7 and 8. The "7" thread gets stuck eventually because it does not see the expected "+" on the input, but the other thread continues to the end.

On LL or LR grammars, Earley parsers are even more practical, taking linear time when carefully implemented. Our example grammar turns out to be an unambiguous LR grammar, which is why the graph of states above is mostly linear and little work is wasted.

One simple and useful optimization is to complete productions only when the lookahead token is in the FOLLOW set of the nonterminal in question. It can also be helpful to memoize cascaded prediction steps.

A further improvement that avoids redundant computation is suggested by Aycock and Horspool [Practical Earley Parsing] In the PREDICT step, when the symbol C is nullable, the item [A→βC.γ, k] is immediately added to the state I_j. They report that with this and other optimizations, Earley parsing is only 50% slower than the Bison LALR parser—a reasonable tradeoff given the improvement in parsing power.

LR parsing

LR parsing, including LALR parsing, is a popular way to parse. It is the underlying technology in a number of parser generators, such as yacc, Bison, and CUP. We can view LR parsing as an optimization of Earley parsing, in which all predictions are precomputed. LR parsing consists of two actions: shift and reduce, so LR parsers are called shift–reduce parsers. The shift action corresponds to the Earley scan action, followed by as much prediction as possible; the reduce action corresponds to the complete action, again followed by prediction.

Rather than keep track of all possible rightmost parses, a shift-reduce parser only keeps track of a set of parses that correspond to a single rightmost parse. This limitation allows the state of the parser to be simpler than in an Earley parser: the state of the parser is based on a stack of symbols. At any point during a shift-reduce parse, the current state of the derivation is the concatenation of the stack with the unconsumed input. For example, we can write the parse of the example expression (1)+2 as a series of shift and reduce operations that build exactly the same rightmost derivation as the Earley parser did above:

Action        Derivation       Stack        Unconsumed input
              (1) + 2                       (1) + 2
shift         (1) + 2          (            1) + 2
shift         (1) + 2          (1           ) + 2
reduce E→n    (E) + 2          (E           ) + 2
reduce S→E    (S) + 2          (S           ) + 2
shift         (S) + 2          (S)          + 2
reduce E→(S)  E + 2            E            + 2
reduce S→E    S + 2            S            + 2
shift         S + 2            S +          2
shift         S + 2            S + 2
reduce E→n    S + E            S + E
reduce S→S+E  S                S

To determine what action should be taken at each step, the parser needs to know what are the possible future productions to reduce. The possible future productions are a set of items, closed under prediction; this set of items tells us which action makes sense at a given point in the parse. To map a stack into a set of items, these item sets can be interpreted as a deterministic finite automaton that reads the stack and whose states are exactly the sets of items. These states look similar to Earley sets, but have the property that, ignoring prediction, only one meaningful action can be taken in each state: either to shift (scan) or to reduce (complete) a particular item.

The initial state of the LR(0) automaton corresponds to the Earley set I₀, except that we add the end-of-input symbol to the top-level production. Just like the Earley set, it is closed under prediction.

S' → . S $
S → . S + E
S → . E
E → . n
E → . ( S )

The automaton is constructed by repeating the following procedure until no further change is possible. Choose a state and a symbol that lies to the right of the dot in one or more productions. Shift that symbol to the right in those items and close the item set under prediction. The resulting item set may be either an existing automaton state or a new one. In either case, add a transition on the chosen symbol from the old state to the state for this item set.

For example, taking a transition on the token ( from the initial state shifts just the final item right to obtain a state containing the production E → ( . S ), on which we then take a preduction closure that adds items for S and then E. Taking a transition from the same state on the nonterminal S shifting the two first items right to obtain the items S'→S . $ and S→S.+E. Prediction closure does not add any items to this state. The following figure shows the steps you might take in completing the automaton (press the button to advance the animation). In the figure, the states are numbered in green, and states in which a reduce action is to be performed have a thicker border.

We can summarize this automaton in two tables that can be used to drive a parser. The first is the action table, which tells the parser what to do in each state, given the lookahead symbol: either shift and go to a new state, or else reduce some production. The second table is the goto table, which tells the parser what state to go to when it reduces a production. Below are these two tables for the automaton we just constructed, side by side. Empty entries in the tables correspond to syntax errors. Numeric entries in the action table correspond to shift actions; a production in the action table indicates a reduce action.

State	Action					Goto
State	n	+	(	)	$	S	E
0	1		2			8	3
1	E→n	E→n	E→n	E→n	E→n
2	1		2			4	3
3	S→E	S→E	S→E	S→E	S→E
4		5		6
5	1		2				7
6	E→(S)	E→(S)	E→(S)	E→(S)	E→(S)
7	S→S+E	S→S+E	S→S+E	S→S+E	S→S+E
8		5			9
9	S'→S$.	S'→S$.	S'→S$.	S'→S$.	S'→S$.

A shift-reduce parser operates on such a table. It has a stack of states, rather just a current state as in a finite automaton. It starts with the stack containing the initial state, and no input consumed. As it reads each input token, it looks up in the action table what is the current action for the state at the top of the stack, given the next token. If the action is a shift, it pushes the specified state onto the stack. If the action is a reduce, it pops the number of items in the production off the stack, and uses the goto table from the state now at the top of the stack to determine the next state to push onto the stack.

Let's try this parser on the earlier example, (1)+2. The initial stack is just 0. To hopefully make it clearer, states on the stack are shown in blue, and are preceded by the symbol that was followed to get to this state. These symbols correspond to the part of the derivation that has been constructed so far by the parser.

Stack                 Unconsumed input      Action
0                     (1)+2                 shift 2
0(2                   1)+2                  shift 1
0(211                 )+2                   reduce E→n
0(2E3                 )+2                   reduce S→E
0(2S4                 )+2                   shift 6
0(2S4)6               +2                    reduce E→(S)
0E3                   +2                    reduce S→E
0S8                   +2                    shift 5
0S8+5                 2                     shift 1
0S8+521               $                     reduce E→n
0S8+5E7               $                     reduce S→S+E
0S8                   $                     shift 9
0S8$9                                       reduce S'→S$

Notice that the LR construction we've seen so far does not do anything but reduce to a single fixed production, regardless of the next token on the input. As a result, the states containing reductions (states 1, 3, 6, 7, 9 in the example) all ignore the lookahead token. Hence, we call this an LR(0) parser. By filling in the action and goto tables in a less constrained way, we could obtain a more powerful LR(1) parser. However, we will need to extend the LR(0) construction that we just saw in order to obtain this increased expressive power.

A simple but surprisingly effective extension to LR(0) is to only add a reduce action to the action table in the case where the lookahead token is in the FOLLOW set of the nonterminal being reduced, since otherwise the production is clearly not leading to a valid parse. When this refinement to the LR(0) construction produces a valid shift-reduce parser, we say that the grammar is SLR (Simple LR). By adding reduce actions to the action table conditionally on the lookahead, the SLR exposes more of the power of the shift–reduce parser. However, even the SLR construction does not exploit the full power of a shift-reduce parser. For this we need the LR(1) construction.