CS 412

CS 412/413
Introduction to Compilers
Spring 2002

IC Language Description

IC is a simple language for which we will be writing compilers in CS413. This language definition will be updated periodically as the need for clarifications or corrections arises. If you see ambiguity in this specification, you should contact the course staff to resolve ambiguities.

A program consists of a collection of one or more modules. Each module consists of two files: an interface file and an implementation file. The implementation file provides the actual code for the module, and the interface file declares the parts of the module which are externally visible, i.e. the variables and functions from that module which are accessible from other modules. A module must define everything that its interface declares, but it may also define additional variables and functions that are not accessible from outside the module. The filename extension for implementation files is ii and for interface files ic (a module mod will have an interface file mod.ii, and an implementation mod.ic).

Grammar and overview

The grammar of the language is presented in the Backus-Naur Form (BNF). A production written in class as A ! B C will be written here as A ::= B C . In BNF, nonterminals are enclosed between the special symbols '<' and '>'. Keywords and terminal characters are written in bold fonts (e.g. while, <=). Terminals which represent tokens are written in bold italic type (e.g., string). Brackets [ ] indicate an optional item in the production and parantheses ( ) are used for grouping. Quoted brackets and quoted parantheses '[', ']', '(', ')' are terminal characters. Finally, a group of items followed by an asterisk (item)* is the Kleene closure, i.e. a sequence of zero or more items.

Modules and definitions

A module implementation file starts with a declaration of what external components it accesses (uses), then continues with a series of definitions.

<module> ::= [<uses>] <definition>* <uses> ::= uses <component> (, component)* <component> ::= id . id

The non-terminal <uses> represents a declaration that the code of this module will use several external components. Each component is of the form mod.i, denoting the fact the the identifier i from module mod may be used in the current module under the name i.

A module contains one or more definitions of variables or functions. All top-level definitions are visible throughout the module, regardless of where they are defined. Two definitions may not use the same name; nor may they conflict with any identifier defined by a use declaration.

A function definition may define two kinds of functions: functions that return a value, and functions that do not. Functions of the former sort are indicated by writing the type of their return value after the closing parenthesis of the formal arguments. In both case, the implementation of the function is described by a block containing a list of one or more statements between enclosing parantheses '{' and '}'. The list of formal arguments of a function specifies the name and the type of each argument; a function may be called only if the actual arguments to the function have a type that is the same as the type of the formal arguments.

<definition> ::= <type> id | function id : '(' [<formals>] ')' [-> <type>] <block>

<formals> ::= <formal> (, <formal> )* <formal> ::= <type> id

<block> ::= { (<statement>)* }

Interfaces and declarations

The syntax of an interface is similar to that of an implementation; it is a series of declarations of variables or functions. Unlike definitions in implementation modules, function declarations do not include any implementation block; they only specify the signature of the exported functions. The types of any identifiers (variables and functions) defined in the interface must agree with the corresponding types in the module implementation of the same name.

<interface> ::= <declaration>* <declaration> ::= <type> id | function id : <signature>

<signature> ::= '(' [<type_list>] ')' [-> <type>] <type_list> ::= <type> [id] (, <type> [id])*

Types

There are three primitive types: integer, boolean, and string. Also, for any type T, the type array[T] represents an array of that type. Using this recursive type definition, array[array[int]]is an array of an array of integer, i.e. an integer matrix.

<type> ::= int | bool | string | array '[' <type> ']'

Statements

The possible statement forms are shown below.

<statement> ::= | <type> id [= <expression>] ; | <function_call> ; | <lvalue_expression> = <expression> ; | if '(' <expression> ')' <statement> [else <statement>] | while '(' <expression> ')' <statement> | break ; | continue ; | return ; | return expr ; | <block>

A new variable may be introduced by a statement, as in the first production. As in Java, this variable remains in scope in subsequent statements until the closing brace of the tightest enclosing pair of braces that surround this statement. The new variable may optionally be initialized with an assignment to an expression. If not, the value of the variable is undefined until its first assignment.

An if statement evaluates the initial expression and executes the following statement if the expression evaluates to true. The expression must have type bool. If the expression evaluates to false, the else clause is evaluated if present. A while statement operates as in Java. The expression tested must have boolean type. A break or a continue statement must belong to an enclosing while statement. The break statement stops the execution of the innermost while loop; the execution of the program continues with the next statement after the loop body. The continue statement stops the execution of the current iteration of the innermost while loop; the execution of the program continues by testing the loop condition.

A return statement may either return a value or not. A return statement in a function with no return type may not return a value; a return statement in a function with a return type must return a value. A return statement has no value itself.

Finally a block of statements is a statement itself, which means that blocks of statements can be nested.

Expressions

Expressions may be of one of the following forms:

<expression> ::= <simple_expression> | <binary_expression> | <unary_expression> | <constructor_expression> | <constant_expression> | '(' <expression> ')'

<simple_expression> ::= <lvalue_expression> | <function_call> <lvalue_expression> ::= id | <simple_expression> '[' <expression> ']' <binary_expression> ::= <expression> <binary_op> <expression> <unary_expression> ::= <unary_op> <expression> <constructor_expression> ::= new <type> '[' <expr> ']' <constant_expression> ::= string | integer | true | false

There are three kinds of constants : integer constants, string constants, and boolean constants. The forms of integers and strings are described below in Lexical considerations. The non-terminal <simple_expression> represents an expression with high precedence -- an expression that can be used as the left-hand-side of an array index; these expressions include variable names, array index expressions, and function call expressions. The non-terminal <lvalue_expression>represents expressions that can occur in the left hand side of an assignment; these can be either variable names or array index expressions.

Operators may be either arithmetic operators, comparison operators or conditional operators. Operators have the same precedence and associativity that they do in Java, and unless specified, the same meaning.

<binary_op> ::= <arithmetic_op> | <comparison_op> | <conditional_op> <arithmetic_op> ::= + | - | * | / | % <comparison_op> ::= < | > | <= | >= | == | != <conditional_op> ::= \/ | /\ <unary_op> ::= - | lengthof

Arithmetic operators include addition, subtraction, multiplication, division, and the modulo operator. All of these operate only on integers, with the exception of + which is also used to concatenate two strings. A divisor or modulus of zero is checked, and results in the program terminating.

Comparison operators test the relation between their operands: strictly less than, strictly greater than, less than, greater than; these operators may be used to compare integer numbers or to compare strings according to their lexicographic order. The other two comparison operators are the equality test operators: equals and not equals. These operators can be applied to any variables of the same type. They work as follows: for integers and boolean they test the equality of values; for strings (or arrays) they test for the equality of the string (or array) object.

The conditional operators are the short-circuit and operator '/\' and the short-circuit or operator '\/'; they have the same semantics as the equivalent && and || operators in Java. Finally, unary operators include sign change for integers and computing the length of an array.

Brackets may be used to index both arrays and strings, extracting either the component of the array with that index, or the ASCII code of the character with that index, respectively. Both arrays and strings are zero-indexed, so the largest index that may be used is one less than the length of the array or string. Out-of-bounds errors are checked for and terminate the program with an error if encountered.

A new array object is created by an array constructor expression. The keyword new is followed by element type and the number of elements in the array. For example, the expression new int[5] creates an array[int] containing 5 elements.

A function call looks the same as in Java. The optional list of actual expressions is separated by commas. The types of the expressions must match the declared types of the corresponding formal arguments in the interface of the module that the identifier comes from.

<function_call> ::= id '(' [<actuals>] ')' <actuals> ::= <expr> (, <expr>)*

Scope of variables

Variables with the same name never overlap in scope. The globally visible names in a module (whether variables, functions, or imported identifiers) all must have distinct names. Similarly, local variables and formal argument variables within a particular function definition may not have the same names as any globally visible identifiers, or the same name as any other variable that is in scope at the point of their declaration. Local variables in different functions may have the same names, since their scopes do not overlap, and local variables within the same function also may have the same names if their scopes do not overlap, as in the following example:

{ int tmp = x1; x1 = y1; y1 = tmp; }
{ int tmp = x2; x2 = y2; y2 = tmp; }

Since the scope of each variable temp is limited to its respective block of statements, the variables may have the same name.

Lexical considerations

Both forms of Java comments are supported; a comment beginning with the characters // indicates that the remainder of the line of code is a comment. In addition, a comment may begin with the characters/*, in which case all of the characters, including newline characters, up to the next occurrence of the characters */, is part of a comment.

Integer literals (integer) are either 0 or a digit in the range 1..9, followed by a sequence of digits in the range 0..9. Negative integers are expressed as a unary negation operator followed by a positive integer constant. Integers in 32-bit values. The largest integer literal is 2147483648 (2³¹). All integer literals from 0 to 2147483647 may appear anywhere an integer literal may appear, but the literal 2147483648 may appear only as the operand of the unary negation operator -.

String literals (string) are largely as defined for Tiger (Appel, p. 526). The escape sequence \^c, where c is an arbitrary character, stands for a control character. To clarify this definition, c may be any character whose ASCII code is in the range 64..126; the escape sequence stands for the control character whose ASCII code is in the range 0..31 and is equal to the ASCII code of c, modulo 32.

Boolean literals must be either the keyword true or the keyword false.

Identifiers must begin with an alphabetic character. Following the initial alphabetic character may be any sequence of alphabetic characters, numeric characters, or the underscore character ( _ ). Uppercase and lowercase alphabetic characters are both considered alphabetic and are distinguished, so x and X are different identifiers. Keywords in the language (int, string, bool, array, uses, function, if, else, while, break, continue, return, new, lengthof, true and false) may not be used as identifiers.

White spaces consists of a sequence of one or more space, tab, or newline characters. White spaces may appear between any tokens. Keywords and identifiers must be separated by white space or a token that is neither a keyword or an identifier. For instance, elsex will represent a single identifier, not the keyword else followed by the identifier x.

The main program

Any module can serve as a main program, by defining a function main with the following signature:

function main : (array[string] args) -> int

There must be a function named main in exactly one of the modules in the program and it must have exactly this signature. When the program is run, this function will be evaluated, passing any command-line arguments to the argument variables args. The expression (lengthof args) can be used to determine the number of arguments. The return status of the program is defined by the return value of main.

Standard Libraries

io

The io module provides several functions for performing input and output, which are defined as follows:

function print : (string s)
    // output the string s to the standard output device
function printi : (int i)
    // output the integer i as a formatted integer to standard
    // output. Equivalent to print(itos(i))
function printc : (int c)
    // output the character whose ASCII value is c to standard output
function readln : () -> string
    // read a line from the standard input device
function readc : () -> int
    // read a single character from standard input
function eof : () -> bool
    // return whether the end of file has been reached on standard input

conv

The conv module provides conversions between different types.

function itos : (int i) -> string 
    // Produce a formatted version of i.
function stoi : (string s, int err) -> int
    // Parse s as a formatted integer.
    // Return err if the string could not be parsed.
function itoc : (int i) -> string
    // Produce a string containing the single character whose ASCII code is i.
function atos : (array[int] a) -> string
    // Produce a string containing the characters whose ASCII codes are in a,
    // in proper sequence
function stoa : (string s)-> array[int]
    // produce an array containing the ASCII codes of all the characters in s,
    // in proper sequence

Quicksort

The following is an example IC program which implements the classic quicksort routine.

quicksort.ii:

function quicksort : (array[int] a, int l, int h)
    /* Put the array elements a[l]...a[h] into ascending order.
       Requires that 0 <= l <= h <= lengthof a */

quicksort.ic:

function quicksort : (array[int] a, int low, int high)  {
    if (low < high) { 
    	int mid = partition(a, low, high); 
    	quicksort(a, low, mid); 
    	quicksort(a, mid + 1, high);
    }
}

function partition : (array[int] a, int low, int high) -> int 
{
    /* Reorder the elements in a into two contiguous groups. If
       ret is the return value, then the first group is the
       elements in low...ret, and the second group is the elements 
       in ret+1...high. Each element in the second group will be at 
       least as large as every element in the first group. */

    int x = a[low];
    int i = low; 
    int j = high; 
    while (true) { 
        while (a[j] > x) { j = j - 1; } 
        while (a[i] < x) { i = i + 1; } 
        if (i < j) {
            int tmp = a[i];
            a[i] = a[j]; a[j] = tmp;
        } else {
            return j;
	}
    }
    return 0;
}

CS 412/413 Introduction to CompilersSpring 2002