CS412/413
Introduction to Compilers and Translators
Spring 2000
Cornell University Computer Science Department

Programming Assignment 3: Intermediate Code Generation

due Wednesday, March 8, 10AM


In this programming assignment, you will build an intermediate code generator for the language Iota. The latest version of the Iota language definition is at http://courses.cs.cornell.edu/cs412/2000sp/iota/iota.html. We expect you to build upon the code that you wrote for Programming Assignments 1 and 2, and to fix any problems with your lexical analysis, parsing, or semantic analysis that interfere with IR generation for correct Iota programs.

Requirements

Your program will have a main class, Iota.IRGen, that translates Iota code to an intermediate representation. When invoked from the command line, the translator will implement at least the following minimal specification: it should expect to receive a single argument that is the name of a source file containing a module specification, with an extension of  ".mod". The translator will parse and type-check the program as in Programming Assignments 1 and 2. It will next convert the AST for each function into an intermediate representation (IR) and then put the IR into canonical form.

Your translator must support a number of dump options for debugging purposes:

You may choose to add additional options not mentioned here as long as they do not interfere with the required functionality. For example, your IR generator might accept more than one source file on the command line or support more debugging options than those listed. You may allow multiple dump options to be specified; we require support for only a single dump option on a particular command invocation.

Extra credit

Certain language features need not be implemented in this assignment. However, you will be required to implement them for a later programming assignment, and you may do them now for extra credit. If these language features are encountered in the program, the compiler should print a message: "Language feature not implemented: <description>" and then exit.

Logic Operators. Iota has the short-circuiting logical operators "&" and "|". You are not required to implement these operations.

Bounds checks. The Iota language specification requires that arrays and strings be bounds-checked. For this programming assignment, we do not require implementation of bounds checking.

String comparisons. The operators ==, <, and > all may be used to compare strings to determine their lexicographic equality or their lexicographic ordering. You need not implement these operators for this assignment.

Intermediate Code

IR model

We expect that you will use a tree-based IR similar to the one described in class or the slightly different IR described by Appel. You may choose to add more IR nodes or modify some of the IR nodes; if you do so, you must describe the meaning of these new IR nodes and explain why you added them. We recommend that before coding you carefully design and document all of the IR translation functions and the transformations that you use to convert the IR to canonical form. Canonical form for the IR must have the properties described in lecture. In other words, all side-effects, including function calls, are moved to top level as statements. All SEQ and ESEQ nodes must be removed as part of this process. Also, two-way conditional jumps must be converted to one-way jumps by basic block reordering and the possible introduction of new jump statements.

Suggestions for how to proceed

You are free to implement this project in any order you see fit. However, we suggest that the first step is to carefully design your IR tree nodes and write the interfaces defining them. This will allow the IR translation and canonicalization phases to be attacked independently.

For IR translation, define a method on AST nodes that creates a corresponding IR node, as discussed in lecture. Write a placeholder method that generates an empty IR node or, even better, a special IR node that can contain the entire AST node that was to be transformed. When dumped, the special IR node uses your existing AST dumping facility. This placeholder method can then be implemented for each of the Iota language constructs in turn, so that you can test IR translation on simple programs and gradually add the functionality for translating all of the AST nodes to IR.

For canonicalization, you can take the same approach. First, attack the problem of bringing side-effecting nodes up to the top level. Elimination of two-way jumps can be worked on as a parallel problem. For the hoisting of side-effecting nodes, define a transformation method for each IR node that reduces it to a sequence of statement nodes plus an expression node in the case of canonicalizing expressions, as described in class. Again, you can use a placeholder method to allow incremental development and testing. The placeholder transformation method simply returns the statement node or expression node it is invoked on. This approach will allow you to test out your canonicalization functionality without having implemented all of the relevant code.

Output format

As in Programming Assignment 2, the printed IR representation should use parentheses to indicate tree structure. The output will be simpler because there is no need to print the types of the various nodes. Binary operation nodes may be indicated using the corresponding symbol. The non-canonical IR for each function will be a single item in the print-out; the canonical IR will be a list of parenthesized items, one for each statement in the canonical IR. For example, the Iota function
    f(x: int) = x + g(x)
might be compiled to IR and printed out as follows:
    MOVE(RV, MEM(FP + 8) + CALL(NAME(g), MEM(FP + 8)))
For compactness, + nodes in the trees are printed out in infix notation here; CONST nodes are printed simply as the corresponding constant, and FP and RV denote the frame pointer and return value registers, respectively. During canonicalization, the CALL node must be hoisted to the top of the tree, resulting in the following output:
    MOVE(TEMP(t1), CALL(NAME(g), MEM(FP + 8)))
    MOVE(RV, MEM(FP + 8) + TEMP(t1))

Pretty-printing the IR output

To help you generate reasonable dumps of these IR trees, we are providing a pretty-printing class CodeWriter that you should use. The source code for this class is online.  Your IR nodes must implement the Unparse interface from Programming Assignment 2:
    interface Unparse { 
        public void unparse(Iota.util.CodeWriter cw); 
            // Write a human-readable representation of the node to the given 
            // CodeWriter cw. 
    } 

CodeWriter automatically introduces newlines to avoid excessively long lines, while also maintaining indentation levels. In expected use, you will call begin(0) whenever you write a parenthesis and end() whenever you write its closing parenthesis. allowBreak(0) is called at every comma and after binary operators. With this usage, the IR expression above will be formatted as follows with a narrow formatting width, resulting in a more readable output than simply wrapping lines:

    MOVE(TEMP(t1),
         CALL(NAME(g),
              MEM(FP + 8)))
    MOVE(RV, MEM(FP + 8) +
             TEMP(t1))
Do not use the newline() method to insert newlines except at the end of each statement in canonicalized IR.

As an example, the ESEQ node described in Appel might be dumped with the following code:

    class ExpSeqNode extends IRNode implements Unparse { 
        StmNode stm;
        ExpNode exp;

        public void unparse(CodeWriter cw) { 
            cw.write("ESEQ("); 
            cw.begin(0); 
    
            stm.unparse();

            cw.write(","); 
            cw.allowBreak(0); 

            exp.unparse();

            cw.write(")"); 
            cw.end(); 
        } 
    } 

Implementing Iota

Word size

In the next assignment, you will be generating 32-bit Pentium code, so all variables and registers will take a full 32 bits of storage. Although it would be possible to generate code that stores characters and booleans more compactly, these data types will be stored in 32-bit words as well. All stack locations and dynamically allocated objects will be aligned to 32-bit boundaries: the address of a variable or dynamically allocated object always will be divisible by four.

Iota modules and name mangling

An Iota program may comprise several Iota modules that must be linked together. When you generate code for an Iota module, you must mangle the names used for publicly visible symbols in the Iota program so that the linker can link external references from other modules to these global variables or functions. The public variables and functions in the module are, of course, those declared in the interface file of the module. External references in the IR into the form required by the linker.

Global variables and functions names in Iota are translated to assembler names as follows. A global identifier I in module M is translated as an assembler symbol _M__I; that is, two underscore characters are placed after the name of the module, and one before. For example, if a call is made to a function F in another module M, the following IR might be generated for the call, even though the Iota program expression will look like F(...) :

       CALL(NAME(_M__F), ...)

These fully-expanded names are used even when the global function or variable is found within the same module as the call itself. This name mangling scheme is chosen for simplicity. Obviously you can create confusion by writing Iota programs in which double underscores appear in the identifiers. A better mangling would avoid this problem but be more complex to implement.

The only exception to this name-mangling strategy is the function main, which is the first function executed in the program. To allow the run-time system to find this function, it must be mangled as _iota__main regardless of what module it occurs in.

Iota runtime

Various Iota language constructs require runtime support.  When you generate assembly code in the next programming assignment, we will provide a library of runtime support routines. These functions are available to the assembler code you will generate but are not exposed to the Iota programmer. Your IR should call these functions.

Dynamically allocated objects

Arrays and strings will be stored in memory in the manner described in class. To create new arrays and strings, you will use the iota__newstring and iota__newarray routines described above. Note that these routines allocate space to store the length of the array or string. For both arrays and strings, the array or string variable on the stack, and the addresses returned by the creation routines, actually are pointers to the 32-bit word following the length; that is, to the array elements or string characters themselves. Strings are stored in a packed representation in which every character is placed in a consecutive byte. The string is always terminated by an extra null character (0), and the string is also padded out with nulls until it fills out a 32-bit word completely.

a : array[T]
a.length
a ® a[0]
a[1]
a[2]
...
a[a.length-1]

Image of a string memory layout

 

The Iota language specification requires that arrays and strings be bounds-checked, although it is not required for this assignment. To implement such a feature, it is necessary to have a way to halt the program in mid-execution. The procedure iota__abort() can be used to accomplish this.

Return

The Iota return statement can be converted to IR in a number of ways. We recommend that a return statement be converted to a sequence of two IR statements. The first statement moves the returned expression (if any) into the return-value register (RV), and the second statement jumps to the label of the current function epilogue. This strategy allows the function epilogue to be shared by all possible return paths from the function. Of course, the epilogue must have a label for this to work. Creating the function prologue and epilogue can be postponed until code generation. Consequently, the function epilogue in the IR is empty except for a single labeled statement; we would like you to give it the label "F_epilogue" where F is the name of the function.

What to turn in

Turn in on paper:

Turn in electronically:

Electronic submission instructions

Your electronic submission is expected at the same time as your written submission: at the beginning of class on the due date. Electronic submissions after 10AM will be considered a day late. To submit Programming Assignment 3, please drop your files in \\goose\courses\cs412-sp00\grpX\pa3, where grpX is your group identifier. Please organize your top-level directory structure as follows :

Note: Failure to submit your assignment in the proper format may result in deductions from your grade.