CS 312 Lecture 16:
An ML Interpreter

For the next few lectures we will investigate programming languages more deeply. We have talked about what the various constructs of SML mean and how they are evaluated. We have seen that evaluation of SML programs can be described by rewrite rules that explain how to reduce SML subexpressions to other expressions. When we have fully described how to evaluate a program, we have obtained a semantics for the programming language. The word "semantics" means "meaning". A semantics for a programming language tells you how to determine what any program in that language means.

The kinds of semantics we have looked at are operational semantics: descriptions of how to evaluate programs (there are other kinds of semantics, such axiomatic semantics, which tell you how to prove statements about programs). There is even more than one way to specify an operational semantics for a given programming language. We have been exploring a particular operational model of evaluation called the substitution model. The key idea of the substitution model is that when a variable is bound to a value (by pattern-matching), the value is substituted in place of all occurrences of the variable that are bound by the pattern in question.

Evaluation

In a functional language, we can think of the execution of the program as a series of rewrite steps applied to the program text. This is also how we usually think about the evaluation of an arithmetic expression. For example, if we see the expression (2+3)*4+3*4, we know that it evaluates in four steps:

(2+3)*4+3*4 -> 5*4+3*4 -> 20+3*4 -> 20+12 -> 32

In each step, we take some part of the expression and replace it with a new expression. For example, in the first step we replace 2+3 with 5. Thus, each rewrite step acts locally to replace a subexpression with its value. These local rewritings are called reductions.

Sometime there are several rewrite steps we can choose for a given expression; these different choice lead to different evaluation orders. There are actually several possible evaluation orders for this expression; for example, here is a different one:

(2+3)*4+3*4 -> (2+3)*4+12 -> 5*4+12 -> 20+12 -> 32

It doesn't matter what order we evaluate things in; we always get the same result regardless. This will also be true for SML as long as we stick to functional language features (that is, stay away from imperative features such as refs, arrays, :=, etc.) One benefit of functional programming is precisely that the result of evaluating an expression is always the same; it does not depend on the order of evaluation and it is always the same no matter how many times it is evaluated.

Here are some examples of simple SML evaluations:

#2(2+3*4, false) -> #2(2+12, false) -> #2(14, false) -> false
false::(false orelse true)::nil -> false::true::nil

These evaluations use various reductions that are part of SML. For example, there are lots of arithmetic reductions of the form v₁ op v₂->v₃, In addition there are reductions on tuples; as seen in the first example, we have a reduction

#i(v₁,...,v_n) -> v_i(where 1 <= i <= n)

Every SML expression form has its own reductions. For example, the if..then..else expression has two reductions that capture the essential computational behavior of the expression:

if true then e₁ else e₂ -> e₁
if false then e₁ else e₂ -> e₂

Expressions vs. Values

When does the program stop? In arithmetic, it's when we reach a number, because there are no further steps to take. In general, we have some set of expressions in the programming language that can't be evaluated any further; we call these expressions values. Values are things that you can type at the SML prompt and get the same thing right back. For example, in SML, the following are values:

1
true
"hello"
(true, "5", 1)
fn(x:int) => x
5::4::nil         (=[5,4])

The following expressions are not values, because an evaluation step can be performed on them:

1+2
true orelse false
(true, "5", 0+1)
(fn(x:int) => x) (3)

We can write a BNF grammar for values v, just as we did earlier for expressions:

c ::= integer_const
    | bool_const
    | string_const
    | real_const
    | char_const

v ::= c                     (* constants *)
    | (v1,...,vn)           (* tuples of values *)
    | (fn (id:t):t' => e)   (* anonymous functions *)
    | {id1=v1, ..., idn=vn} (* records of values *)
    | Id  | Id(v)           (* data constructors *)

Anything described by this grammar is a value and thus a legal result of an SML program. In other words, any tuple whose elements are values is a value itself; any records whose fields are bound to values is a value, any data constructor applied to a value is also a value, and any anonymous function is a value—even if its body is an arbitrary expression e. In other words, the body of a function is not evaluated at all until it is applied to an argument.

How do we know that a program will always reach a value? Actually, we don't. A program might go into an infinite loop. But no matter how long the program executes, as long as it hasn't reached a value there will always be a reduction to perform. For example, we'll never have to apply a reduction to #i(v₁,...,v_n) where i > n. The SML type checker ensures that this and other bad things will never happen. This is what it means to say that SML is type-safe.

Variables

Of course, SML is quite a bit more complicated than 3rd-grade arithmetic. The biggest difference is that in SML expressions can contain variables: names that are bound to values. In the substitution model we handle variables by substituting for them using the values to which they are bound. For example, the expression let val x=2 in x+3 end is evaluated by taking its right-hand side, x+3, and substituting all occurrences of x with the value to which it bound, 2. Therefore, it steps to 2+3 and then to 5. In general, an expression of the form let val x=v in e' end is evaluated by replacing it with e', but with occurrences of x replaced by v. We denote the result of this substitution as e'{v/x}; that is, there is a reduction

let val x=v in e'

Here are some examples of substitution:

x{true/x} = true
x{true/y} = x
(x+(2*x)){1/x} = 1 + (2*1)
(x + let val x = 1 in x end){2/x} = (2 + let val x = 1 in x end)
(fn x: int => x+1)(#1 x){(3,"three")/x} = (fn x: int => x+1)(#1 (3,"three"))

Occurrences of a variable in an expression can be either bound, unbound, or binding occurrences. For example, in the expression x+3, the variable x is unbound: its meaning is not defined by the expression. In the expression x + let val x = 1 in x+3 end, the first occurrence of x is unbound; the second is a binding occurrence that binds x to the value 1 throughout the body of the let expression. The third occurrence is a bound occurrence because it occurs within the scope of the second, binding occurrence.

The last two substitution examples illustrate an important point: when we substitute for some variable x, we don't replace the binding or bound occurrences of x, because that variable is really a different variable despite having the same name.

We can also use substitution to explain the action of a function invocation. An expression of the form

(fn(x: t) => e) (v)

reduces to

e{v/x}

That is, we take the body of the function and replace all unbound occurrences of x (which must have been bound by the binding occurrence in the argument list) with the actual argument value v.

What about named functions? A declaration of the form

fun f(x: t):t' = e

is mostly just syntactic sugar for the declaration

val f = fn(x: t) => e

(it isn't completely syntactic sugar because a named function can refer to itself recursively. But that's another story.) So we can understand the evaluation of calls to non-anonymous functions as using the same rule that anonymous functions do. Here's an example:

let val y = 3 in
    fun f(x:int):int = x*y
in
  f(2+y)
end
->                           (let reduction)
let fun f(x:int):int = x*3
in
  f(2+3)
end
->                           (let reduction)
(fn(x:int):int => x*3)(2+3)
->                           (+ reduction)
(fn(x:int):int => x*3)(5)
->                           (fn application reduction)
5*3
->                           (* reduction)
8

Evaluation order

The other thing we have to keep in mind is that we can't perform reductions just anywhere. Each SML expression imposes some order on the evaluation of its subexpressions. For example, no reductions can be performed on the body of a let expression until all of its declarations have been evaluated and the results substituted into the body. Similarly, no evaluations are performed

Abstract syntax

When we talk about language semantics, we first need to say what it is we are defining the semantics of; that is, what is our representation of a "program". One obvious representation is the stream of bytes that are the ASCII codes for the characters in the program. However, this representation is not convenient for talking about language semantics.

Early in the course we commented on a similarity between BNF declarations and datatype declarations. In fact, we can define datatype declarations that act like the corresponding BNF declarations. The values of these datatypes then represent legal expressions that can occur in the language. For example, our earlier BNF definition of legal SML types

(base types)   b ::= int | real | string | bool | char
(types)        t ::= b | t₁->t₂ | t₁*t₂*...*t_n | { id₁ : t₁,..., id_n : t_n } | id

has the same structure as the following datatype declarations:

type id = string
datatype baseType = Int | Real | String | Bool | Char
datatype type_ = Base of baseType | Arrow of type_*type_ | Product of type_ List
                 | Record of (id*type_) List | DatatypeName of id

Any legal SML type expression can be represented by a value of type type_ that contains all the information in the type expression. This value is known as the abstract syntax for that expression. It is abstract, because it doesn't contain any information about the actual symbols used to represent the expression in the program. For example, the abstract syntax for the (type) expression int*bool->{name: string} would be

Arrow( Product(Cons(Base Int, Cons(Base Bool, Nil))),
       Record(Cons(("name", Base String), Nil)))

It will be convenient to draw abstract syntax as trees. For example, the expression above has the following abstract syntax tree (AST):

In this diagram, the names of the nodes are not essential; for example, -> is written where Arrow could be as easily and as correctly written instead.

Compilers typically use abstract syntax internally to represent the program that they are compiling, and we can also use it to talk about operational semantics. Inside a compiler it is the job of the parser to convert the string-of-characters representation of the program into the abstract syntax. Parsers can be built mostly automatically by giving the BNF grammar for the language to an parser generator. To learn how parser generators work, take CS 412!

Definitional Interpreter

Now that we have a representation of an SML program as a data structure, we have the opportunity to precisely define the semantics of SML by writing a definitional interpreter. An interpreter is a program that accepts as input another program written in some language, and executes that program (or simulates its execution, depending on your viewpoint). A definitional interpreter is an interpreter written for the purpose of describing the semantics of a programming language. Since its purpose is to help us understand what SML programs are supposed to do, we will put a premium on clarity and worry less about performance issues here. However, it is possible to produce a reasonably fast interpreter using the basic approach shown here.

Below is a definitional interpreter for a subset of SML. Here are some things to notice about this interpreter:

The first part of the code is a definition of the abstract syntax of the simplified language. Because values and expressions overlap, we can't define values as a separate datatype; therefore, there is a function is_value that figures out whether an expression is a value according to the rules above.
toString is a helpful function that prints out expressions in a more readable form than the AST it accepts as input. It isn't really part of the interpreter, though.
subst explains how substitution is done. Notice that the rules for substituting variables in let and fn expressions only substitute into the bodies of these expressions if the variable being substituted is not bound by the expression.
eval_binop implements all the reductions for primitive types.
The function eval takes an expression AST as input and gives the value that the expression evaluates to.
- For possible kind of expression, it (1) recursively evaluates any subexpressions, (2) performs the appropriate reduction for the resulting expression and (3) applies eval() to finish the evaluation of the new reduced expression.
- So each expression has some core code that performs the reduction, plus some code around it that specifies the order in which things should be evaluated.
- Since we have no type checker yet to make sure that there is always a legal reduction to be performed, the interpreter does run-time checking. For example, if a raw, unsubstituted variable shows up during evaluation, it must have been unbound in the original program because it was never substituted for by a containing expression that bound it.
- Evaluation is split into two functions eval and eval' so that the interpreter not only reports the final result of evaluation, but also reports each intermediate step along the way.

In Problem Set 5, you will be building an interpreter for a language that is not too different from ML, except that it is a concurrent language. Like this interpreter, your interpreter will have to implement reductions. Unlike this interpreter, your evaluator will only take one evaluation step at a time. This will be necessary in order to simulate the execution of multiple concurrent processes. So this interpreter is in some ways a good model of your code for problem set 5, but not in others.

<% ShowSMLFile("code/interp1.sml") %>

Pattern Matching

The language above doesn't support datatypes or pattern matching. Here is a definitional interpreter based on the substitution model that does support pattern matching.

<% ShowSMLFile("code/interp2.sml") %>

CS 312 Lecture 16: An ML Interpreter