Grammars and Parsing

Context-Free Grammars

Grammars are defined by a set of productions. Productions use two kinds of symbols, terminals and nonterminals. Terminals are also called tokens: they are the symbols that the scanner recognizes and provides to the parser. A nonterminal represents some sequence of tokens in the string that is being parsed. A production has a left-hand-side, which is always a nonterminal. We will use upper-case letters to represent nonterminals and other characters to represent terminals. For example, the following are productions:

  E → T
  E → ( E + E )
  T → n
A production explains one way to put together tokens to make up the nonterminal on the left-hand side. In this case, the productions mean that an E can be either anything generated by the nonterminal T, or else a left parenthesis, followed by anything generated by the nonterminal E itself, followed by the token +, followed by anything generated by E and then a closing parenthesis. Here the nonterminal T can only represent an integer, but the nonterminal E can represent “(n+n)”, “((n+n)+n)”, “(n+(n+n))”, and so on. (Since n actually stands for a number, real examples of inputs would have numbers in them, e.g. 2, (2+3), ((2+3)+7), etc.)

A grammar has a start symbol, usually a nonterminal. A legal string as defined by the grammar is anything that can be generated from the start symbol using productions. One way to think about this is that we start with a string containing just the start symbol, and use productions to change nonterminals into whatever appears on the right-hand side of the production, until there are no nonterminals left. For example, we can generate the string ((1 + 2) + 3) by using productions as follows: E(E + E)(T + E)(1 + E)(1 + (E + E))(1 + (T + E))(1 + (T + T))(1 + (2 + T))(1 + (2 + 3)).

Recursive descent parsing and exceptions

Many grammars can be parsed using recursive descent parsing, as we saw in class. Recursive descent parsers have one function (method) per nonterminal, and the job of that method is to read in all the tokens for that nononterminal. In order for this to work, the method must be able to decide what production to use based on the input it sees. For example, with the grammar above, the code to parse an E looks very roughly like:
  void parseE() {
    if (lookaheadToken().isNumber()) {
	parseT();
	return; // E → T
    } else {
	checkToken("(");
	parseE();
	checkToken("+");
	parseE(); // E → ( E + E )
	checkToken(")");
    }
  }
Here we've assumed a method lookaheadToken that returns the next token without consuming it, and a method checkToken that reads the next token and makes sure that it is the given expected token. If we have a method nextToken that reads (consumes) the next token from the scanner, we can implement checkToken as follows:
  void checkToken(String expected) throws Exception {
     String s = nextToken();
     if (s.equals(expected)) return;
     throw new Exception("Got " + s + ", expected " + expected);
  }
This method simply returns normally if the expected token was seen. If some other token was seen, there must have been a syntax error in the input, and the code throws an exception, which will propagate up the stack of method calls until it hits an exception handler. If there is no exception, the program will stop and print an error message. Notice that we had to add the declaration throws Exception to the top of the method to indicate that it could throw an exception. If we take this approach to handling errors, we'll have to add similar declarations to all the parsing methods.

We can handle exceptions using a try...catch statement. For example, we could parse an expression as follows:

  try {
    parseE();
  } catch (Exception exc) {
    System.out.println("Syntax error: " + exc.getMessage());
  }

If the input has an unexpected token, the exception will stop all the recursively nested calls to parseE, etc., and run the exception handler starting with catch. Notice that the exception handler declares a variable exc of type Exception. This variable refers to the new exception object that was created in the throw statement, and exc.getMessage() returns the string message that was given when the exception was created.

Doing computation during parsing

So far, the return type of all the parsing methods has been void or (in class) boolean. The parsers have just checked whether the input has the right form. Suppose we wanted to compute the value of the input expression, instead of just checking it. Then we could declare the type of the parse methods to be int and write code like this:
  int parseE() {
    if (lookaheadToken().isNumber()) {
	return parseT(); // E → T
    } else {
	checkToken("(");
	int a = parseE();
	checkToken("+");
	int b = parseE();
	checkToken(")");
	return a + b; // E → ( E + E )
    }
  }

Dealing with problematic grammars

Left-recursive grammars

Not every grammar works with recursive descent parsing, but usually it is possible to rewrite your grammar so it works. One problem that can arise is a grammar that is left-recursive: it has a nonterminal that can be expanded through productions into a string of symbols in which the left-most symbol is the same nonterminal. For example, in this grammar:

  E → E + E
  E → T

The right-hand side starts with E, the very same nonterminal being expanded by the production. When we write a recursive descent parser for this production, it will go into an infinite loop:

  parseE() {
    parseE();
    checkToken("+");
    parseE();
  }

To handle this kind of grammar in a recursive descent parser, it needs to be rewritten as a grammar without left recursion.

Ambiguous grammars

Actually, there is a second problem with this grammar: it is ambiguous, meaning that there is more than one way to parse some strings. If we see E + E + E, we don't know whether the first or second + was expanded last. This can be important if (x+y)+z means something different from x+(y+z). The former assumes + is left-associative, the latter right-associative. Since addition is an associative operator, it doesn't make too much difference in this case, but if we were talking about subtraction (-), the two parses would have very different meanings. With recursive descent parsers, it's tricky to parse left-associative operators because of the left recursion problem.

Here is a grammar that is not ambiguous, works with recursive descent, and that generates the same strings. It does makes + right-associative, however.

  E → T P
  P → ε
  P → + E

Note that ε represents the empty string. We can then write parseE, parseP as follows (a little more verbosely than necessary, for clarity):

  parseE() {
    parseP();
    parseT();
  }
  parseP() {
    if (lookaheadToken().equals("+")) {
	return; // P → ε (don't consume any input)
    } else {
	check("+"); // always succeeds;
	parseE();
	return; // P → + E
    }
  }

Left-factoring a grammar

To get to this grammar, we first eliminate the left recursion by eliminating the ambiguity:
  E → T + E
  E → T

This still won't work because both productions have right-hand sides starting with T, so we won't know which one to pick. The next step is to left-factor the grammar by splitting out the common prefix of the right-hand side, in this case T, and generating all the suffixes of that common prefix (+ E and ε) using a new nonterminal, called P above. This lets us write the E production as E → T P, so we don't have to figure out which P production to choose until we get to a place in the input where a sensible choice can be made.

Parser Generators

For serious parsing tasks, it's usually a good idea to use a parser generator to produce your parser code. They take in a grammar as input and produce Java code to parse input. And they can handle more grammars than a recursive descent parser can.

There is much more to building parsers than we can cover in this course. Take the compilers course, CS412, to learn more about building parsers and how to write and parse different grammars (and much else besides).