At this point, we have talked about lexing and parsing. Lexical analysis has converted the input stream of characters into tokens, and syntactic analysis (parsing) has converted the stream of tokens into an abstract syntax tree. The next phase of the compiler is semantic analysis, which checks that the AST represents a valid, well-formed program.
Each of these first three stages may detect some kind of error in the program. Lexical analysis detects lexical errors (ill-formed tokens), syntactic analysis detects syntax errors, and semantic analysis detects semantic errors, such as static type errors, undefined variables, and uninitialized variables. Once semantic analysis is complete and successful, the program must be a legal program in the programming language; no further errors in the program should be reported by the compiler.
In addition to checking that the AST is a valid program, semantic analysis may also compute additional information that is useful for the rest of the compilation process: for example, the types of expressions, the memory layout of data structures, or non-AST representations of classes and modules. The output of successful semantic analysis is thus a decorated AST in which the AST has been augmented with this additional information.
For many languages, static type checking is the core task of semantic analysis. It is conventionally implemented as a recursive traversal of the AST. To see how this works, consider the problem of type-checking uses of the operator +. Since it shares many features with other binary operators, in an object-oriented like Java it is natural to define a class that captures the common features of all binary operators, such as the left and right operands:
class BinaryExpr extends Expr { Expr left, right; ... }
Let us suppose that our type checker is intended to record the type of each expression
node in the AST. This decoration can be expressed by adding an instance variable to the
common superclass Expr
:
code
The variable type
is a decoration to be filled in by semantic analysis
with an actual type. The method typeCheck()
is used to implement type checking,
with each node class supplying the logic for type checking the kind of expressions it describes;
it is natural for typeCheck()
to be a recursive method. Note that no arguments
are given for typeCheck()
; we see shortly that at least one argument is needed.
Now let's try to write code for type-checking addition in Java. Here is a
simplified version of that code, with some missing pieces. Assuming we have a
existing representation of the types int
and String
,
we might write code like the following, using recursive method calls to type-check
the subtrees left
and right
:
code
That something is still missing can be seen if we consider how to type-check an identifier
expression (such as a local variable). The body of the typeCheck()
method is
impossible to implement because we have no way to know what type identifier name
should be associated with.
code
In general, when doing semantic analysis of some part of the program, we need a description of the surrounding context in which that program fragment is located. To handle type-checking identifiers, the context needs to include an environment that maps identifier names to types. In the setting of compilers, this environment is known as a symbol table. A simple typing context might look like the following:
code
Then, we modify the signature of typeCheck
to take a Context
as
an argument. The implementation of Id.typeCheck
becomes easy:
code
However, the change to the signature of typeCheck
means that the context must be threaded through all recursive calls. For example,
the method Plus.typeCheck
is updated accordingly:
code
We represent environments more formally as a mapping \(Γ\) from identifiers \(x\) to types \(t\). An environment \(Γ\) is written as \(Γ = x_1:t_1, x_2:t_2, \dots x_n:t_n\), meaning that each identifier \(x_i\) is mapped to the corresponding type \(t_i\).
Using this notation, let's consider the following example, assuming that the initial typing context has no variables in scope: \(Γ=∅\).
{ // Γ=∅ int i, n; // Γ=i:int, n:int for (i = 0; i < n; i++) { boolean b = ... // Γ=i:int, n:int, b:boolean ... } // Γ=i:int, n:int } // Γ=∅
As we can see, the typing context changes as we encounter variables declarations and at the exit to scopes. For example, we might implement type checking for variables declarations as follows, adding the declared variable to the context:
code
This implementation is designed under the assumption that the statements in a given block are checked sequentially, allowing new declarations to be accumulated by simply mutating the current environment.
code
However, as the example above shows, blocks introduce another challenge: variables must go
out of scope. One way to implement this is as shown in Block
. The context
maintains a stack of environments, so that c.push()
saves the current environment
and c.pop()
restores the most recently saved environment that hasn't been restored.
Programming languages tend to have a lot of mechanisms for introducing new scopes. For example, in Java we have blocks, methods, classes (including nested classes), and packages. The typing context needs to be able to keep track of what identifiers are bound in all these scopes.
Stack of hash tables. The traditional implementation is as symbol table implemented as a stack of hash tables. The symbol table is updated imperatively during type checking. On entry to a new scope, a new, empty hash table is pushed onto the stack, taking \(O(1)\) time. On exit from a scope, the top hash table is popped from the stack, taking \(O(1)\) time. Adding a new identifier to the context just requires adding a mapping to the top hash table, in \(O(1)\) time. However, looking up an identifier requires, in general, walking through the stack, checking each hash table. For an identifier at depth \(d\), this takes \(O(d)\) time. Of course, the value of \(d\) is small in most programs.
Immutable search tree. A second way to implement the typing context is to use a persistent (non-destructive) data structure, such as a binary search tree. A binary search tree can be updated in \(O(\lg n)\) time to produce a new binary search tree without affecting the original. This is possible because the data structure is immutable and the new binary search tree shares all but \(O(\lg n)\) nodes with the original tree. On entry to a scope, no change is necessary; as bindings are added to the environment, new contexts are constructed. On exit from a scope, the extended context is simply discarded. One advantage of this implementation is that the context used to check a given expression can be saved as a decoration, since it is immutable.
Hash table with duplicate entries and a log.
If a mutable implementation is acceptable, the asympotically optimal data
structure is a single chained hash table, in which we allow duplicate keys to
appear in the same bucket. In addition, the context keeps track of a log of
updated keys, which can be implemented as a linked list. An identifier is
looked up by using the hash table directly, and taking the earliest matching
mapping in its bucket, taking time \(O(1)\). There may be other mappings, but
the ones belonging to outer scopes will be later in the bucket list. An
identifier is added to the data structure by inserting it at the head of its
bucket list, and also recording it at the head of the log, taking total time
\(O(1)\). On entry to a new scope, a special marker is put at the head of the
log, a constant-time operation. On exit from a scope, all identifiers are
popped from the log up to including the first marker. Each such identifier is
removed from the hash table in the usual way as well. The total time to do this
operation is proportional to the number of variables in the innermost scope,
but that number is also proportional to the number of put()
operations, so the total time to enter, add variables, and exit a scope is
still proportional to the number of variables added. Hence, the amortized
complexity of exiting is also \(O(1)\). This implementation is probably the
fastest but is not so easily constructed using off-the-shelf data structures.