CS412/413
Introduction to Compilers and Translators
Spring 2000
Cornell University Computer Science Department

Programming Assignment 2: Syntactic and Semantic Analysis

due Monday, February 21, 10AM


In this programming assignment, you will build the syntactic and semantic analysis phases for the language Iota. The latest version of the Iota language definition may be found at http://courses.cs.cornell.edu/cs412/2000sp/iota/iota.html. Your program will determine whether a source file defines a legal interface or module. We expect you to build upon the code that you wrote for Programming Assignment 1, and to fix any problems with your lexical analysis phase.

Your program will have a main class, Iota.SemTest, that implements a syntactic/semantic checker. When invoked from the command line, this checker will implement at least the following minimal specification: it should expect to receive a single argument that is the name of a source file containing a module specification, with an extension of  ".mod".

Syntactic analysis

Your checker will first determine whether the source file contains any syntactic errors, and if so, report them using the syntax "source-file-name:line-number:message". For example, a reasonable error message might look like the following:

hello.mod:4:syntax error at token "else"

Your syntactic analysis phase must be structured as a recursive-descent parser. In order to parse the Iota Language top-down, you will need to transform the Iota language grammar into an equivalent LL grammar, which you will turn in as part of your report.  We will provide you with a partial LL(1) grammar for the language.  This grammar does not recognize the entire Iota language; in particular, the if statements and stmt_expr expressions are incomplete, and the operator precedence rules are not implemented correctly.  You may extend this grammar or create one from scratch from the grammar in the Iota language specification.  For this assignment, you are not permitted to use an automatic parser generator tool; however, to assist you in your task, we have provided a tool to analyze a grammar and determine if it is LL(1).

You will find that to produce an LL(1) grammar for this language is essentially impossible; we expect you to produce a grammar that is nearly LL(1), in the sense that most predictive parse table entries can be filled in uniquely using one look-ahead symbol. If your grammar is not LL(1), you should discuss the reasons why and describe what modifications were made to your parser to deal with the problematic parts of the grammar. However, difficulty in writing the syntactic analysis is not an acceptable reason for changing the language accepted by your compiler.

AST construction

If the program is syntactically correct, your checker will produce an abstract syntax tree (AST). The abstract syntax tree is the interface between the syntactic and semantic analysis phases, and designing it carefully before starting to write code will result in much less work.

The abstract syntax tree should be constructed so that binary operators are left associative.  The nodes of the abstract syntax tree will not necessarily correspond to the non-terminals of your LL grammar.  Use the grammar in the Iota specification as a guideline for constructing the AST.  The LL grammar is in a left-factored form and may bear little resemblance to the grammar in the specification.  For example, in the provided partial LL(1) grammar, the statement x = 0 has this parse tree (we apologize for the crude artwork) :

                        Stmt
                         |
                  IdentExprOrStmt
                         |
                    +----+----+
                    |         |
                   ID  IdentExprOrStmtContinued
                              |
                           Assign
                              |
                         +----+----+
                         |         |
                        EQ        Expr
                                   |
                    +--------------+--------------+
                    |                             |
                UnaryExpr                      ExprRem
                    |                             |
               PrimaryExpr                     epsilon
                    |
          +---------+---------+
          |         |         |
       Constant  OptArgs OptSubscripts
          |         |         |
       INTEGER   epsilon   epsilon
However, a reasonable AST for this statement would look more like this:
             Assignment
                 |
            +----+----+
            |         |
        Variable  IntegerConstant
You will want to think about how to generate an AST that eliminate some of the unnecessary levels of the parse tree. If you don't, the remaining phases of the compiler, such as semantic analysis, will become much more tedious to write. The code to generate the compact AST shown might look like the following:
        Stmt parse_Stmt()
        {
            switch (nextToken.type()) {
                case ID:
                    return parse_IdentExprOrStmt();
                // ...
            }
        }

        Stmt parse_IdentExprOrStmt()
        {
            Identifier id;
            switch (nextToken.type()) {
                case ID:
                    id = new Identifier((String) nextToken.attribute());
                    nextToken = lexer.getToken();
                    parse_IdentExprOrStmtContinued(id);
                // ...
            }
        }

        Stmt parse_IdentExprOrStmtContinued(Identifier id)
        {
            switch (nextToken.type()) {
                case PLUS:
                case MINUS:
                case TIMES: // ...
                    int op = parse_BinaryOp();
                    Expr left = new IdentifierExpr(id);
                    Expr right = parse_Expr();
                    Expr e = new BinaryExpr(op, left, right);
                    return new ExprStmt(e);
                case EQ:
                    Expr left = new IdentifierExpr(id);
                    Expr right = parse_Assign();
                    return new AssignmentStmt(left, right);
                case COLON:
                    nextToken = lexer.getToken();
		    Type type = parse_Type();
		    Expr expr = parse_OptAssign();
                    return new VarDeclStmt(id, type, expr);
                // ...
            }
        }

        Expr parse_Assign(Identifier id)
        {
            switch (nextToken.type()) {
                case EQ:
                    nextToken = lexer.getToken();
		    return parse_Expr();
                // ...
            }
        }

Semantic analysis

After the AST is constructed, your checker will analyze it to ensure that it contains no semantic errors.  As part of this analysis, the checker may need to read interface files (extension .int) to learn the signatures of operations from other modules. Interface files are read in when uses declarations are processed during semantic checking. The checker must report any lexical, syntactic, or semantic errors found in the interface files. The checker should look for interface files in the same directory as the module file.

Semantic errors include the following categories: type errors, invalid types, undefined identifiers, overlaps of identifier scope (see the language definition). Your checker must report these categories of errors. It is not required to report more than one error; it may terminate after reporting the first lexical, syntactic, or semantic error it encounters. These errors should be reported in the same format as syntax errors.

Output

In addition, your checker must support a "-dump_ast" option. When invoked with this option and the name of a module file (e.g., java Iota.SemTest -dump_ast hello.mod), the checker will print to System.out a representation of the decorated abstract syntax tree for the module it has checked.  Each node in the AST must implement the following interface:

interface Unparse {
    public void unparse(Iota.util.CodeWriter cw);
        // Write a human-readable representation of the node to the given
        // CodeWriter cw.
}

You will use the CodeWriter class to format text for output. This formatter automatically inserts line breaks and indents text as necessary. You can print the AST to System.out using a CodeWriter with code like the following:

    CodeWriter cw = new CodeWriter(System.out, 72);
    ast.unparse(cw);
    cw.flush();

The AST representation that is printed should use parentheses to indicate the tree structure. For example, a WhileStmt node might be partially implemented as:

class WhileStmt extends Stmt implements Unparse {
    Expr condition;
    Stmt body;

    // ...

    public void unparse(CodeWriter cw) {
        cw.begin(4);

        cw.write("(while");
        cw.allowBreak(0);

        condition.unparse(cw);
        cw.allowBreak(0);

        body.unparse(cw);
        cw.write(")");
        cw.end();
    }
}

In this case, a WhileStmt with condition E and body S should be printed as (while E S), where E and S are the printed representations of the corresponding nodes in the abstract syntax tree. Those familiar with Scheme or Lisp will recognize this format as an s-expression. The first word within the parentheses -- while -- indicates the kind of node that this is; the remaining output within the parentheses describes the children of this node. For nodes representing expressions or statements that have a value, the type of the expression is included as the last component of the output for the node. For example, a node representing the addition of two integers 2 and 3 could be output as follows:

(+ (integer 2) (integer 3) :int)
In other words, the right-most component of the printed represention of an expression node should have the form :type, where type is the type of the expression. As shown in the example, it is not necessary to include this component for constants, where the type is obvious.

As another example, an addition expression like the one above might be implemented as:

class AddExpr extends Expr implements Unparse {
    Expr left;
    Expr right;
    Type type;

    // ...

    public void unparse(CodeWriter cw) {
        cw.begin(4);
        cw.write("(+");
        cw.allowBreak(0);
        left.unparse(cw);
        cw.allowBreak(0);
        right.unparse(cw);
        cw.allowBreak(0);
        cw.write(":");
        type.unparse(cw);
        cw.write(")");
        cw.end();
    }
}

Use the allowBreak method of CodeWriter to specify where newlines can be inserted.  The newline method forces a newline to be printed and should be used sparingly, if at all. The begin and end methods can be used to indent blocks of statements.  This description has not been a precise specification of how to dump the AST; you have some latitude in choosing the output style.

How to proceed

This is a much longer assignment than the first programming assignment. It will be important to get your entire team working effectively on this assignment in order to get it done in time. We recommend working on the syntactic and semantic analysis phases in parallel, perhaps in two teams of two. To make this approach work, you will need to design the AST data structures carefully.

What to turn in

Turn in on paper:

Turn in electronically:

Electronic submission instructions

Your electronic submission is expected at the same time as your written submission: at the beginning of class on the due date. Electronic submissions after 10AM will be considered a day late. To submit Programming Assignment 2, please drop your files in \\goose\courses\cs412-sp00\grpX\pa2, where grpX is your group identifier. Please organize your top-level directory structure as follows :

Note: Failure to submit your assignment in the proper format may result in deductions from your grade.