Semantic Analysis and Symbol Tables

Semantic analysis

At this point, we have talked about lexing and parsing. Lexical analysis has converted the input stream of characters into tokens, and syntactic analysis (parsing) has converted the stream of tokens into an abstract syntax tree. The next phase of the compiler is semantic analysis, which checks that the AST represents a valid, well-formed program.

Each of these first three stages may detect some kind of error in the program. Lexical analysis detects lexical errors (ill-formed tokens), syntactic analysis detects syntax errors, and semantic analysis detects semantic errors, such as static type errors, undefined variables, and uninitialized variables. Once semantic analysis is complete and successful, the program must be a legal program in the programming language; no further errors in the program should be reported by the compiler.

In addition to checking that the AST is a valid program, semantic analysis may also compute additional information that is useful for the rest of the compilation process: for example, the types of expressions, the memory layout of data structures, or non-AST representations of classes and modules. The output of successful semantic analysis is thus a decorated AST in which the AST has been augmented with this additional information.

Type checking

For many languages, static type checking is the core task of semantic analysis. It is conventionally implemented as a recursive traversal of the AST. To see how this works, consider the problem of type-checking uses of the operator +. Since it shares many features with other binary operators, in an object-oriented like Java it is natural to define a class that captures the common features of all binary operators, such as the left and right operands:

class BinaryExpr extends Expr {
    Expr left, right;
    ...
}

Let us suppose that our type checker is intended to record the type of each expression node in the AST. This decoration can be expressed by adding an instance variable to the common superclass Expr:

code

The variable type is a decoration to be filled in by semantic analysis with an actual type. The method typeCheck() is used to implement type checking, with each node class supplying the logic for type checking the kind of expressions it describes; it is natural for typeCheck() to be a recursive method. Note that no arguments are given for typeCheck(); we see shortly that at least one argument is needed.

Now let's try to write code for type-checking addition in Java. Here is a simplified version of that code, with some missing pieces. Assuming we have a existing representation of the types int and String, we might write code like the following, using recursive method calls to type-check the subtrees left and right:

code

Typing contexts

That something is still missing can be seen if we consider how to type-check an identifier expression (such as a local variable). The body of the typeCheck() method is impossible to implement because we have no way to know what type identifier name should be associated with.

code

In general, when doing semantic analysis of some part of the program, we need a description of the surrounding context in which that program fragment is located. To handle type-checking identifiers, the context needs to include an environment that maps identifier names to types. In the setting of compilers, this environment is known as a symbol table. A simple typing context might look like the following:

code

Then, we modify the signature of typeCheck to take a Context as an argument. The implementation of Id.typeCheck becomes easy:

code

However, the change to the signature of typeCheck means that the context must be threaded through all recursive calls. For example, the method Plus.typeCheck is updated accordingly:

code

Formalizing typing contexts

We represent environments more formally as a mapping \(Γ\) from identifiers \(x\) to types \(t\). An environment \(Γ\) is written as \(Γ = x_1:t_1, x_2:t_2, \dots x_n:t_n\), meaning that each identifier \(x_i\) is mapped to the corresponding type \(t_i\).

Using this notation, let's consider the following example, assuming that the initial typing context has no variables in scope: \(Γ=∅\).

{                               // Γ=∅
    int i, n;                   // Γ=i:int, n:int
    for (i = 0; i < n; i++) {
        boolean b = ...         // Γ=i:int, n:int, b:boolean
        ...
    }                           // Γ=i:int, n:int
}
                                // Γ=∅

As we can see, the typing context changes as we encounter variables declarations and at the exit to scopes. For example, we might implement type checking for variables declarations as follows, adding the declared variable to the context:

code

This implementation is designed under the assumption that the statements in a given block are checked sequentially, allowing new declarations to be accumulated by simply mutating the current environment.

code

However, as the example above shows, blocks introduce another challenge: variables must go out of scope. One way to implement this is as shown in Block. The context maintains a stack of environments, so that c.push() saves the current environment and c.pop() restores the most recently saved environment that hasn't been restored.

Implementing Typing Contexts

Programming languages tend to have a lot of mechanisms for introducing new scopes. For example, in Java we have blocks, methods, classes (including nested classes), and packages. The typing context needs to be able to keep track of what identifiers are bound in all these scopes.

Stack of hash tables. The traditional implementation is as symbol table implemented as a stack of hash tables. The symbol table is updated imperatively during type checking. On entry to a new scope, a new, empty hash table is pushed onto the stack, taking \(O(1)\) time. On exit from a scope, the top hash table is popped from the stack, taking \(O(1)\) time. Adding a new identifier to the context just requires adding a mapping to the top hash table, in \(O(1)\) time. However, looking up an identifier requires, in general, walking through the stack, checking each hash table. For an identifier at depth \(d\), this takes \(O(d)\) time. Of course, the value of \(d\) is small in most programs.

Immutable search tree. A second way to implement the typing context is to use a persistent (non-destructive) data structure, such as a binary search tree. A binary search tree can be updated in \(O(\lg n)\) time to produce a new binary search tree without affecting the original. This is possible because the data structure is immutable and the new binary search tree shares all but \(O(\lg n)\) nodes with the original tree. On entry to a scope, no change is necessary; as bindings are added to the environment, new contexts are constructed. On exit from a scope, the extended context is simply discarded. One advantage of this implementation is that the context used to check a given expression can be saved as a decoration, since it is immutable.

Hash table with duplicate entries and a log. If a mutable implementation is acceptable, the asympotically optimal data structure is a single chained hash table, in which we allow duplicate keys to appear in the same bucket. In addition, the context keeps track of a log of updated keys, which can be implemented as a linked list. An identifier is looked up by using the hash table directly, and taking the earliest matching mapping in its bucket, taking time \(O(1)\). There may be other mappings, but the ones belonging to outer scopes will be later in the bucket list. An identifier is added to the data structure by inserting it at the head of its bucket list, and also recording it at the head of the log, taking total time \(O(1)\). On entry to a new scope, a special marker is put at the head of the log, a constant-time operation. On exit from a scope, all identifiers are popped from the log up to including the first marker. Each such identifier is removed from the hash table in the usual way as well. The total time to do this operation is proportional to the number of variables in the innermost scope, but that number is also proportional to the number of put() operations, so the total time to enter, add variables, and exit a scope is still proportional to the number of variables added. Hence, the amortized complexity of exiting is also \(O(1)\). This implementation is probably the fastest but is not so easily constructed using off-the-shelf data structures.