CS 312 Lecture 5
Identifiers, Scope, Binding. Evaluation and Substitution.

Administrivia

We returned homework 1 in sections yesterday. If you did not pick up yours, Alan has it. Read the regrade policy on the course web page and make sure you submit your regrade request within two weeks from the return of the homework.

Motivation

We spent the last four lectures talking about SML, and we have come a long way. While we still have many things to learn, we can now write recursive functions, we understand types, we know about - and use - parameterized types and polymorphic functions. This is quite a lot - in the case of other languages one would might not even be done with syntax in four lectures! Our progress, however, hides a number of dirty little secrets; on our way here we have skirted a number of important issues - now we will start to catch up.

We mentioned informally that the following two constructs are equivalent:

fun f(a: t1): t2 = e
val f :t1->t2 = fn a:t1 => e

This equivalence seems to hold for a function that, for example, only increments its unique integer argument:

- fun f1(a: int): int = a + 1;
val f1 = fn : int -> int
- f1 3;
val it = 4 : int
- val f2: int -> int = fn a:int => a + 1;
val f2 = fn : int -> int
- f2 3;
val it = 4 : int

But what about recursive functions? Take a look at this:

- fun f3(n: int): int = case n of 0 => 0 | _ => n * n * (f3 (n - 1));
val f3 = fn : int -> int
- val f4: int -> int = fn n: int => case n of 0 => 0 | _ => n * n * (f4 (n - 1));
stdIn:19.9-19.11 Error: unbound variable or constructor: f4

The equivalence we hoped for does not hold; apparently f4 is not known (or "visible," or "accessible") in the very expression that defines it. It is thus important for us to understand the rules that govern the "visibility" of identifiers.

Let us look at another example:

- let
=   val int = 3
=   val int:int->bool = fn(int:int) => int = int
= in
=   int(3)
= end;
val it = true : bool

Here int is present as both an integer and a type, making the let expression quite confusing. Clearly, we need precise rules that define the meaning of expressions like the one above..

We note here that writing confusing code is discouraged - don't do it even if you understand the relevant rules well. It is easy to write hard-to-understand code that relies on obscure features - it is hard to write crisp, simple (w.r.t. the problem at hand), and efficient solutions. We encourage you to do the latter.

Identifiers and Scope

Identifiers allow us to name all important SML constructs, such as values, types, datatype constructors, type variables, as well as signatures and modules.

Notes:

Functions are just a special category of values.
An identifier associated with a value is often called a variable, this terminology implying the possibility that the value associated with the identifier can change. Nothing that we learnt until now, however, allows us to change the value associated with an identifier (but we can shadow the respective identifier, see below). In a purely functional language, like in the subset of SML that we have learnt up to now, there are thus no proper variables like there are, say, in Java/C/C++.

Interestingly, the impossibility of changing the value associated with an identifier does not preclude us from reusing an identifier to define a new new association, that will supersede (shadow) the original one. In the last example above the declaration of function int on the bold line shadows the declaration of the integer variable int on the immediately preceding line.

Shadowed variables do not cease to exist; if the shadowing variable's definition "expires" (goes out of scope), the old variable becomes visible again.

Shadowing is a concept that is often encountered in modern programming languages. Consider, for example, the following generic piece of Java/C++ code, for example:

{
  int s = 0;
  int x = 1;

  s += x;

  {
    int x = 3;  
    s += x;
  }

  s += x;
}

The first addition to s increases its value by 1, the second by 3, the third by 1 again. Notice how once the second definition of x "expires" (goes out of scope), the old, once shadowed value of x becomes visible again. SML behaves much like this; in some sense you can think of SML as behaving like Java/C/C++ restricted to having only variable declarations and initializations.

The region of an SML program in which a given identifier is visible (i.e. the identifier can be referred to) is called the respective identifier's scope. The scope of an identifier in SML can always be determined on the basis of the program's text only. We say that identifiers have static (or lexical) scoping; no run-time information is needed to determine the scope of a variable. We talk about the scope of identifiers in general to emphasize that it is not only variables that have scope, types, for example, have scope as well:

- fun f():int =
=   let
=      datatype 'a mylist = Empty | LIST of 'a * 'a mylist
=      val dummy = LIST(3, Empty)
=   in
=      3
=   end;
val f = fn : unit -> int

- LIST(3, Empty);
stdIn:35.1-35.5 Error: unbound variable or constructor: LIST

There exist languages that use dynamic (as opposed to static) scoping, for example Python. The grandfather of functional languages, LISP, also has elements of dynamic scoping. In dynamic scoping the "visibility" of an identifier depends on the execution path, and in general can not be determined based only on a given program's text (the execution path will, in general, depend on the program's input, for example). Consider the following example in (non-existent) dynamically-scoped ML:

(* We assume no x is defined in the global scope. *)
(* With this assumption 'regular' SML would complain of x being unbound. *)
fun f(): int = x + 1

fun g(): int =
let
  val x: int = 3
in
  f()
end

fun h(): int = 
let
  val x: string = "this is bad"
in
  f()
end

Assuming that SML had dynamic scoping, expression g() would evaluate to 4 (function f would use the last definition of x that was encountered during execution), while expression h() would produce an error (one can not add 1 to a string). Dynamic scoping is powerful and flexible, but it imposes a lot of overhead (e.g. type checking must be done at execution time) and it is harder to think about (thus being more error-prone).

Binding

There are three different ways that one can use an identifier:

A binding occurrence, which binds the identifier to a particular value or type. For example, in the expression let val x:int = 1 in x end, the first occurrence is a binding occurrence that binds x to 1. Each binding occurrence introduces a new variable.
A bound occurrence is a use of a identifier in the scope of an identifier binding. For each bound occurrence of a identifier, there is a single corresponding binding of that identifier. For example, in expression (fn(x:int)=>x), the second occurrence of x is a bound occurrence; its corresponding binding occurrence is the first occurrence. At run time this variable will be bound to whatever value is passed to the function when it is invoked.
An unbound or free occurrence is a use of an identifier with no corresponding binding occurrence in whose scope the respective identifier would exist. For example, in the expression let val y:foo = x+1 in y end, the use of x is an unbound occurrence because there is no containing binding of x. The identifier foo is also an unbound occurrence of a type identifier. A legal SML program cannot contain an unbound occurrence of an identifier. However, for the purpose of understanding how SML works, sometimes it is useful to write down syntactically legal fragments of SML programs and talk about the unbound variables that occur in them.

Given an occurrence of an identifier that is not a binding occurrence, there is a simple way to figure out whether it is bound or unbound, and if the former, to which binding occurrence the identifier is bound. If the variable lies within the scope of more than one binding occurrence, then one of those bindings shadows the rest. It will be the binding occurrence whose scope most tightly encloses the use of the identifier.

Substitution

An SML program is evaluated by performing a number of transformations on the initial program to produce the final result. Evaluation stops when no more such transformations can be applied. If there is always a transformation that can be applied to the program, then we say that the program is in an infinite loop.

Note that the transformations that we describe here are specified at a conceptual, abstract level. The actual implementation details, while interesting in their own right, are completely ignored.

Let us assume that x is bound to -5; then consider the evaluation of the following if expression:

if x > 0 then sqrt(x) else 0

Now, SML could evaluate this expression by first evaluating the conditional expression and the expressions on the two branches of the if. Then, based on the value of the condition, one of the values obtained from the if's branches could be returned. This method of evaluating an if would clearly be wasteful, as it would involve the computation of values that are ultimately not used. There is also a more subtle problem we encounter here: in an otherwise correct program the evaluation of the expression on the else branch often leads to a fatal error (in fact, many if statements are introduced to shield certain expressions from evaluation when that would lead to an error). Clearly, the order in which expressions are evaluated matters.

Here is what SML actually does: first, it evaluates the condition, then it uses the following rewrite rules to replace the if expression with the - yet unevaluated - expression on one of its branches:

if true then e1 else e2 --> e1
if false then e1 else e2 --> e2

Thus the original if expression is substituted by the expression on one of its branches; in the next step it is this expression that will be evaluated.

Similar rewrite rules apply to other expressions; consider, for example, a simple let expression:

let val x:t =v ine end --> e (with occurrences of x replaced by v)

Here e is an arbitrary expression, while x is an arbitrary identifier; v is a value -- that is, a fully evaluated expression (term).

We now know this rule will break occasionally: e may contain occurrences of x whose binding occurrence is not the binding in the let (x:t = v₁). It doesn't make sense to substitute v for these occurrences. For example, consider evaluation of the expression:

let
  val x:int = 1
  fun f(x:int) = x
  val y:int = x+1
in
  fn(a:string) => x*2
end

The next step of evaluation replaces the bold occurrences of x with 1, because these occurrences have the first declaration as their binding occurrence. Notice that the two occurrences of x inside the function f, which are respectively a binding and a bound occurrence, are not replaced. Thus, the result of this rewriting step is:

let
  fun f(x:int) = x
  val y:int = 1+1
in
  fn(a:string) => 1*2
end

Let's write the substitution e{v/x} to mean the expression e with all unbound occurrences of x replaced by the value v. Then we can restate the rule for evaluating let more simply:

let val x:t = v in e end --> e{v/x}

This works because any occurrence of x in e must be bound by exactly this declaration: val x:t = v. Here are some examples of substitution:

x{2/x} = 2 x{2/y} = x (fn(y:int) => x) {"hi"/x} = (fn(y:int) => "hi") (fn(y:int) => y) {"hi"/x} = (fn(y:int) => y)
f(x) { fn(y:int) => y/f } = (fn(y:int) => y)(x)

One of the features that makes ML fairly unique is the ability to write complex patterns containing binding occurrences of variables. Pattern matching in ML causes these variables to be bound in such a way that the pattern matches the supplied value. This can be a very concise and convenient way of binding variables. We can generalize the notation used above by writing e{v/p} to mean the expression e with all unbound occurrences of variables appearing in the pattern p replaced by the values obtained when p is matched against v. Using this notation, we can express the let rule simply:

let val p = v in e end --> e{v/p}

What if a let expression introduces multiple declarations? Such an expression is identical in effect to a series of nested let expressions. Thus, we can use the following rewrite that pulls out the first declaration so the rules above apply.