CS412/413
Introduction to Compilers
Cornell University Computer Science Department, Spring 2001

Programming Assignment 1: Lexical Analysis

due Wednesday, February 7


In this programming assignment, you will build a lexical analysis phase for the language Iota, defined on the web page at http://www.cs.cornell.edu/courses/cs412/2001sp/iota/iota.html. Your lexer, or tokenizer, should have the following interface (it may have additional methods, of course):

class Lexer {
    Lexer(InputStream i);
        // Create a lexer that reads characters from the input stream i
    Token getToken() throws LexicalError;
        // Return the next language token on the input stream. Returns
        // a token representing the end of file as the last token, assuming
        // that no lexical error is encountered first.
}

The class Token may be defined in whatever way you prefer, but it should at least implement the following interface:

interface LexerResult {
    void unparse(OutputStream o) throws IOException;
        // Print a human-readable representation of this token on the
        // output stream o; one that contains all the relevant information
        // associated with the token. The representation has the form
        // <token-type, attribute, line-number>.
        // I/O exceptions on the output stream o are passed through.
    int lineNumber();
        // Return the number of the line that this token came from.
}

The class LexicalError should also implement the LexerResult interface, though you will want to choose a different output format for the unparse method.

You must also implement a lexer test-bed program. This program must be a class LexTest that implements the following behavior. When run from the command line, the LexTest program takes a single filename as an argument. It reads the file, breaks it into tokens, and uses the Token.unparse method to dump a representation of the file as a series of tokens. If a lexical error is encountered, it prints an error message that includes the line number on which the error occurred. It must report the first lexical error in the file; it may but need not report additional lexical errors.

String literals should be unparsed with their characters translated into canonical form. The canonical form for any printable character is itself, e.g., "\065BC" should be unparsed as "ABC". For non-printable characters you may choose a suitable canonical form, such as the "\ddd" or "\^c" escape sequences, as appropriate.

All of the classes you write should be in or under the package Iota, so the Lexer class will be Iota.Lexer, the testbed will be Iota.LexTest, etc.

You may use a lexer generator such as JLex to do this assignment. However, we do not take responsibility for helping you figure out how to use JLex; if you use it, you are on your own. If you use a lexer generator, you should turn in the lexer generator input as well as the Java source code that it emits!

The correctness of your lexer will be important, and we will be more rigorous in our expectations for correctness if you use a lexer generator (though this should not discourage you from using such a tool). We expect you to perform your own testing of the lexer. Often student projects do not handle erroneous input properly -- make sure that yours does! You should develop a thorough test suite that, at a minimum, tests all legal tokens and all possible lexical errors. Testing your program on corner cases is also a good idea. We will test your lexer rigorously against our own test cases -- including programs that are lexically correct, and also programs that contain lexical errors.

Planning ahead

This programming assignment is much smaller than the remainder of the assignments will be. Use this assignment as a warm-up and a chance to set up your code production process. Start thinking now about how you will manage the size and complexity of your source code and test cases. Although we cannot provide support in using them, CVS and Visual SourceSafe are both available for use in managing your code base. You may also wish to consider automation of your testing via shell scripts or other tools.

What to turn in

Because groups in this class are relatively large, we will be expecting a higher level of quality in your product than in some other courses you have taken. Much of the value in a compiler (or any other large program) is in how easily it can be maintained. A high value is placed here on both clarity and brevity -- both in documentation and code.

Turn in on paper:

Turn in electronically:

Electronic submission instructions

Your electronic submission is expected at the same time as your written submission: at the beginning of class on the due date.  Electronic submissions after 10AM will be considered a day late. Place your files in \\goose\courses\cs412-sp01\grpX\pa1, where grpX is your group identifier.  Please organize your top-level directory structure as follows :

Note: Failure to submit your assignment in the proper format may result in deductions from your grade.