CS 4120: Introduction to Compilers
Fall 2013

Programming Assignment 1: Lexical Analysis

due: Wednesday, September 11

In this assignment you will implement a lexer (also called a scanner or a tokenizer) for the Cubex programming language. As discussed in lecture 2, a lexer provides a stream of tokens (also called symbols or lexemes) given a stream of characters.


Your submission has to have two parts:

The lexer

You must use some language that uses the Java Virtual Machine and turn in an executable JAR-file. The program should accept a file name as an argument, read the specified file and print out the following text representations of tokens, separated by a space wherever whitespace would be permitted (here should be no leading or trailing spaces).
Token typeReplace with
Variable/function namesname
Type (variable) namesName
String literals""
Integer literals0
Boolean literalstrue
Comments and whitespace[remove]
Everything else[leave as is]
If the program encounters invalid input that cannot be tokenized, it should just print "error", and nothing else.

We encourage you to use a lexer generator such as ANTLR in your implementation, but this is not required. If you do use ANTLR, basic setup information and sample grammars for the Xi language from a previous iteration of this course are provided.

You have to include the complete source code in the jar-file, including inputs for any lexer generators you might use. You will need a manifest file (MANIFEST.MF) that sets the classpath. Your manifest file should look like this:

Manifest-Version: 1.0
Class-Path: . antlr-4.1-complete.jar
Main-Class: [your-class-name-here]

The following code snippet does this packaging for you ( .g4 is the file extension for ANTLR files).

jar cvfm Lexer.jar MANIFEST.MF *.class *.java *.g4

The text file

Finally, we ask that you submit a small text file (.txt) that contains the following information:

Source control

We recommend that you use some kind of source control. Be aware that your repositories should not be publicly viewable on the web. GitHub offers private repositories for students.


You should thoroughly test your solutions before turning them in. We provided some basic test examples for you to play with, but keep in mind that we will test your submissions with more complex examples, too. The way a test works is that after running

java -jar Lexer.jar simple_test1.in > tokens1.out

the content of tokens1.out should be equal to simple_test1.out .

There may be a few special cases where lexing is ambiguous. However, those cases would later be rejected by the parser no matter what rules are chosen. We promise not to have test cases that are affected by these ambiguities.

Working as a group

You need to form a group of three or four people in CMS for this assignment. These groups will be kept for all future programming assigments.

For this example, you should sit in front of one computer as a whole group and write the grammar together. This will help you to get going as a group.

After that, one person can write the main function while the others generate test cases.

Planning ahead

This assignment is much smaller than future assignments will be: it is intended primarily as a warmup assignment that gives your group the chance to practice working together. The later assignments will test your ability to work effectively as a group, so this is a great time to learn how to work together as a group. It is also a good time to set up the infrastructure that you will use for the rest of the semester.