CS312 PROBLEM SET 5B: Common Subexpression Elimination

Issued: Monday, November 10

Due: Friday, December 5

This is the second part of problem set 5. You will hand in both parts at once, as a single problem set. This problem set, and especially this part of the problem set, is long – get started early!

You may do this problem set alone or in pairs. You will submit the files you changed through CMS.

Please note that you may NOT use side effects in your solutions.

Common Subexpression Elimination (CSE)

The goal of this problem set is to analyze an SML program and produce a better SML program by finding and eliminating common subexressions. A common subexpression is a subset of the input expression that appears multiple times; the input expression can be made more efficient by introducing a temporary variable to hold the value of the common subexpression, and then using that variable multiple times.

For example, consider the expression

let

  val x = 42

in

  (fact x)*(fact x)

end

The subexpression (fact x) appears twice. An equivalent, but much faster, expression would be

let

  val x = 42

in

let

    val tmp0 = fact(x)

in

    tmp0*tmp0

end

end

This transformation seems relatively straightforward, but actually will turn out to have numerous subtleties. By the end of this problem set, your code will be able to perform common subexpression elimination.

Simplifying assumptions

In order to make this problem considerably easier, you can make the following assumptions about the input expression that you will optimize:

Alpha renaming: In your solutions, you can assume that alpha renaming has already occurred (i.e. that any two identifiers with the same name are in fact the same).

Limited input expressions: For the CSE problem, you can assume that your input expression is in the same form as you handled is PS#4. Specifically, you can assume that there are no fun declarations, and that there are no side effects. Note that you will need to make a different simplifying assumption about the input for the callgraph problem (PS#5A).

Test cases for this problem set

In order to minimize the ambiguities in this problem set, we have provided a set of examples showing how your code should behave. These examples have actually been run using our solutions. If there is sufficient demand we will add additional examples over time.

What to hand in:

In this problem set, you will modify the included file CSE.sml. This is the only file you should modify. The code that you add will need to use the following types that are defined in this file:

  type path = int list

  type searchlist = (exp * path) list

  type cselist =

   (exp  *     (* AST representing the common expression *)

    int  *     (* number of tmp variable *)

    path list) (* paths where the same subexpression occurs *)

                 ) list

Implementing CSE

CSE Step 1: listing all subtrees

Subtrees and rooted trees

An AST is, of course, a tree (that is what the T in AST stands for…) A subset of a tree is called a subtree if the subset is also a tree. A subtree is called rooted if it consists of a single node plus all of the elements of the original tree that lie beneath that node.

Rooted subtrees play a vital role in common subexpression elimination, because the subexpressions that we will consider will be rooted subtrees of the input AST. (Note that it is possible to perform common subexpression elimination on non-rooted subtrees as well; this is more general, but quite a bit more complicated.) As a result, thoughout this problem set “subexpression” will simply mean “rooted subtree of the input AST”.

It is important to note that a leaf is not a subexpression. As a result the expression

let val x = 42 in f(x) end

only has a single subexpression, which is the application of f(x).

Problem CSE-1.1: Write a function ASTsubexpressions which takes an AST and returns a list of AST’s that contains all the subexpressions (rooted subtrees) of the input AST. The subexpressions should be listed in prefix order, and should include the input subexpression last. Your code should behave as shown below:

Analyze CSE >> let val x = 2 in x + 1 end

CSE analyzed at level 1:

CSE-1-1: Subexpressions:

Binop_e(Id_e(x),Plus,Int_c(1))

Let_e(Val_d(x,Int_c(2)),Binop_e(Id_e(x),Plus,Int_c(1)))

Analyze CSE >> let val x = d(1) in x + 1 end

CSE analyzed at level 1:

CSE-1-1: Subexpressions:

Binop_e(Id_e(x),Plus,Int_c(1))

Apply_e(Id_e(d),Int_c(1))

Let_e(Val_d(x,Apply_e(Id_e(d),Int_c(1))),Binop_e(Id_e(x),Plus,Int_c(1)))

Analyze CSE >> let val x = d(1) in f(x + 1) end

CSE analyzed at level 1:

CSE-1-1: Subexpressions:

Binop_e(Id_e(x),Plus,Int_c(1))

Apply_e(Id_e(f),Binop_e(Id_e(x),Plus,Int_c(1)))

Apply_e(Id_e(d),Int_c(1))

Let_e(Val_d(x,Apply_e(Id_e(d),Int_c(1))),Apply_e(Id_e(f),Binop_e(Id_e(x),Plus,Int_c(1))))

Paths and searchlists

The same subexpression can appear in multiple places within the AST. In order to describe these different places, we need a notion of a path, which describes the route from the root of the tree to the particular subexpression.

Paths encode the choices made among the different subexpressions at every (non-leaf) node of the AST. Some nodes have only a single subexpression, such as unary operators. In this case, we will still include the node in the path, but we will use the number 0 to describe the single (forced) choice. Some paths have an arbitrary number of descendants, like tuples or let expressions. For a tuple with k elements, we will encode the choice using a number in the range 0..k-1. A let expression with k var declarations will have k+1 subexpressions (one for each var declaration’s expression, and one for the body); they will be numbered 0..k.

Thus, an expression like fact(x)* fact(x) would be contain two different (but identical) subtrees with two different paths.

A searchlist is a list, where each element is a subexpression and its associated path. The searchlist type is defined in CSE.sml. The included function printSL will print out a searchlist, by printing out each expression and then its path.

Problem CSE-1.2: Write a function ASTsearchlist which takes an AST and returns a searchlist of AST’s that contains all the rooted subtrees of the input AST and their associated paths. You will need to use the path and searchlist types which we have defined for you in CSE.sml. Hint: you may find it easier to initially create the paths backwards, and then reverse them at the end. Your code should behave as shown below:

Analyze CSE >> let val x = 42 in (f x) + (f x) end

CSE analyzed at level 2:

CSE-1-2: Searchlist:

Apply_e(Id_e(f),Id_e(x))

1 1

Apply_e(Id_e(f),Id_e(x))

1 0

Binop_e(Apply_e(Id_e(f),Id_e(x)),Plus,Apply_e(Id_e(f),Id_e(x)))

Let_e(Val_d(x,Int_c(42)),Binop_e(Apply_e(Id_e(f),Id_e(x)),Plus,Apply_e(Id_e(f),Id_e(x))))

<empty-path>

Analyze CSE >> let val x = 42 in (f x) + (g x) end

CSE analyzed at level 2:

CSE-1-2: Searchlist:

Apply_e(Id_e(g),Id_e(x))

1 1

Apply_e(Id_e(f),Id_e(x))

1 0

Binop_e(Apply_e(Id_e(f),Id_e(x)),Plus,Apply_e(Id_e(g),Id_e(x)))

Let_e(Val_d(x,Int_c(42)),Binop_e(Apply_e(Id_e(f),Id_e(x)),Plus,Apply_e(Id_e(g),Id_e(x))))

<empty-path>

Analyze CSE >> let val x = 42 val y = 24 in (f x) + (g x) end

CSE analyzed at level 2:

CSE-1-2: Searchlist:

Apply_e(Id_e(g),Id_e(x))

2 1

Apply_e(Id_e(f),Id_e(x))

2 0

Binop_e(Apply_e(Id_e(f),Id_e(x)),Plus,Apply_e(Id_e(g),Id_e(x)))

Let_e(Val_d(x,Int_c(42)),Val_d(y,Int_c(24)),Binop_e(Apply_e(Id_e(f),Id_e(x)),Plus,Apply_e(Id_e(g),Id_e(x))))

<empty-path>

CSE Step 2: finding common subexpressions

The searchlist for your input AST will contain various subexpressions that only appear once, and other subexpressions that occur multiple times (at multiple paths). We now need to go through the searchlist and collect the common subexpressions. More precisely, a particular subexpression will occur at multiple paths. We will need to collect the set of paths where that subexpression occurs

First, you will need to tell whether a given subexpression appears more than once. To do this, you can use the expression equality function expEqual that is provided for you in CSE.sml.

We have defined a cselist type in CSE.sml. This is a list whose elements are an expression, a list of paths, and a number. Such a list will be the output of your solution to this part of the problem. The number will be the identifier of the temporary variable that will be used to replace the expression. Your numbering scheme should use the prefix order on subexpressions that you obtained in step 1.

Problem CSE-2.1: Write a function ASTtoCSElist which takes an AST and returns the associated CSElist. You will need to use the cselist datastructure which we have defined for you in CSE.sml, as well as your solution in step 1. Your code should behave as shown below:

Analyze CSE >> let val x = 42 in (f x) + (f x) end

CSE analyzed at level 3:

CSE-2-1: CSElist:

Apply_e(Id_e(f),Id_e(x))

 ** 0 ** [1 0] [1 1]

Analyze CSE >> let val x = 42 in (f (g x)) + (f (g x)) end

CSE analyzed at level 3:

CSE-2-1: CSElist:

Apply_e(Id_e(f),Apply_e(Id_e(g),Id_e(x)))

 ** 1 ** [1 0] [1 1]

Apply_e(Id_e(g),Id_e(x))

 ** 0 ** [1 0 1] [1 1 1]

CSE Step 3: Reordering the common subexpression list using containment

CSE is complicated by the fact that a common subexpression can appear inside another common subexpression. For example, consider the expression

(f (g x))+(f (g x))+(g x)

Here, the common subexpressions are (g x) and (f (g x)). If we simply substituted for (f (g x)) we would end up with:

let val tmp0 = (f (g x))

in

  tmp0 + tmp0 + (g x)

end

This is not optimal; instead, we would like to obtain:

let val tmp0 = (g x)

in

  let val tmp1 = (f tmp0)

in

    tmp1 + tmp1 + tmp0

end

end

In order to accomplish this, however, we need to reorder the CSElist so that “smaller” subexpressions (such as (g x) in the example above) are handled before “larger” subexpressions (such as (f (g x)) above).

More precisely, you will need to reorder the CSElist so that if subexpression contains subexpression B, subexpression B appears before subexpression A. Note that two subexpressions will frequently be “incomparable”, in that neither is a subexpression of the other.

To begin with, you will need to determine if one subexpression is contained inside another. While this can be done directly on AST’s, it is both faster and simpler to do it on paths.

Problem CSE-3.1: Write a function contains which takes a pair of paths and determines if the subexpression corresponding to the first path contains the subexpression corresponding to the second path.

Problem CSE-3.2: Write a function reorderCSElist which takes a CSElist and reorders it using the containment relation. [Hint: this problem is not trivial. Please think carefully about what is involved, and consider whether any code you have written elsewhere in CS312 may be applicable.] After your code is written, it should be the case that for any two subexpressions e1,e2 in the CSElist if contains(e1,e2) is true then e2 appears in the reordered CSElist before e1.

Your code should behave as shown below:

let val x = 42 in (f (g x)) + (f (g x)) + (g x) end

CSE analyzed at level 4:

CSE-3-2: Reordered CSElist:

Apply_e(Id_e(g),Id_e(x))

 ** 0 ** [1 0 0 1] [1 0 1 1] [1 1]

Apply_e(Id_e(f),Apply_e(Id_e(g),Id_e(x)))

 ** 1 ** [1 0 0] [1 0 1]

CSE Step 4: Substitute variable names for common subexpressions

Our next task is to take common subexpressions and replace them all with variable names. (In the final CSE step, we will introduce a binding that creates the variable name and binds it to the value of the common subexpression.)

In order to handle cases like this, we have to do substitution within the CSElist. More precisely, we place the input expression at the end of the CSElist and then walk down the CSElist, repeatedly taking one element and replacing all its occurrences in the rest of the list by its corresponding variable name. You will also need to write a function which takes two paths, the first of which is contained in the second, and returns a path that specifies the location of the first one relative to the second one. (We will not grade this function separately, but you should be aware that such a function is necessary.)

If the AST already contains a Let_e node, then you should insert the new definition at the beginning of the declaration list (otherwise you should create a Let_e which wraps the original AST node, and has a single declaration for the just-inserted _tmp variable). This method of inserting avoids creating a long series of let statements that only contain the declaration of one temporary variable.

Problem CSE-4.1: Write a function substituteCSElist which takes an CSElist and performs substitution in it. You will need to use the CSElist datastructure which we have defined for you in CSE.sml, as well as your solution to steps 2 and 3. Be sure that the variables you create follow the same naming convention that we use. Your code should behave as shown below:

Analyze CSE >> let val x = 42 in (f (g x)) + (f (g x)) + (g x) end

CSE analyzed at level 5:

CSE-4-1: CSElist substitution

Apply_e(Id_e(g),Id_e(x))

 ** 0 ** [1 0 0 1] [1 0 1 1] [1 1]

Apply_e(Id_e(f),Id_e(_tmp0))

 ** 1 ** [1 0 0] [1 0 1]

Let_e(Val_d(x,Int_c(42)),Binop_e(Binop_e(Id_e(_tmp1),Plus,Id_e(_tmp1)),Plus,Id_e(_tmp0)))

 ** ~1 ** [<empty-path>]

let val x = 42 in ((_tmp1+_tmp1)+_tmp0) end

CSE Step 5: Introduce bindings for common subexpressions

Our final task is to introduce a binding that creates the variable name and binds it to the value of the common subexpression. This will involve creating one let expression per variable and inserting it into the input expression. In order to do this, you will first need to figure out where in the input expression the new let should go. This requires computing the lowest common ancestor of two paths (or, to put it differently, the smallest subexpression that contains two input expressions).

Once this is done, you will need to take each element of the CSElist and insert the appropriate binding at the appropriate place. The subtlety here is that each time you insert a let all of the existing paths become invalid. To solve this problem, you can skip over let statements that you have introduced (you can tell this by looking at the variable name).

Problem CSE-5.1: Write a function substituteCSElist which takes an CSElist and performs substitution in it. You will need to use you solutions from the previous step. Your code should behave as shown below:

Analyze CSE >> let val x = 42 in (f (g x)) + (f (g x)) + (g x) end

CSE analyzed at level 6:

let val x = 42 in let val _tmp0 = g(x) in (let val _tmp1 = f(_tmp0) in (_tmp1+_tmp1) end+_tmp0) end end

Analyze CSE >> let val a = [2, 3] val b = (1::a) val c = (0::1::a) in c end

CSE analyzed at level 6:

let val a = [2,3] val _tmp0 = (1::a) val b = _tmp0 val c = (0::_tmp0) in c end