This is the second part of problem set 5. You will hand in both parts at once, as a single problem set. This problem set, and especially this part of the problem set, is long – get started early!
You may do this problem set alone or in pairs. You will submit the files you changed through CMS.
Please note that you may NOT use side effects in your solutions.
The goal of this problem set is to analyze an SML program and produce a better SML program by finding and eliminating common subexressions. A common subexpression is a subset of the input expression that appears multiple times; the input expression can be made more efficient by introducing a temporary variable to hold the value of the common subexpression, and then using that variable multiple times.
For example, consider the expression
let
val x = 42
in
(fact x)*(fact x)
end
The subexpression (fact x) appears twice. An equivalent, but much faster, expression would be
let
val x = 42
in
let
val tmp0 = fact(x)
in
tmp0*tmp0
end
end
This transformation seems relatively straightforward, but actually will turn out to have numerous subtleties. By the end of this problem set, your code will be able to perform common subexpression elimination.
In order to make this problem considerably easier, you can make the
following assumptions about the input expression that you will optimize:
Alpha renaming: In your solutions, you can assume that alpha renaming
has already occurred (i.e. that any two identifiers with the same name are in
fact the same).
Limited input expressions: For the CSE problem, you can assume that your input
expression is in the same form as you handled is PS#4. Specifically, you can
assume that there are no fun declarations, and
that there are no side effects. Note that you will need to make a different
simplifying assumption about the input for the callgraph problem (PS#5A).
In order to minimize the ambiguities in this problem set, we have provided a set of examples showing how your code should behave. These examples have actually been run using our solutions. If there is sufficient demand we will add additional examples over time.
In this problem set, you will modify the included file CSE.sml. This is the
only file you should modify. The code that you add will need to use the
following types that are defined in this file:
type path = int list
type searchlist = (exp * path) list
type cselist =
(exp * (* AST representing the common expression *)
int * (* number of tmp variable *)
path list) (* paths where the same subexpression occurs *)
) list
An AST is, of course, a tree
(that is what the T in AST stands for…) A subset of a tree is called a subtree
if the subset is also a tree. A subtree is called rooted if it consists of
a single node plus all of the elements of the original tree that lie beneath
that node.
Rooted subtrees play a vital
role in common subexpression elimination, because the subexpressions that we
will consider will be rooted subtrees of the input AST. (Note that it is
possible to perform common subexpression elimination on non-rooted subtrees as
well; this is more general, but quite a bit more complicated.) As a result,
thoughout this problem set “subexpression” will simply mean “rooted subtree of
the input AST”.
It is important to note that
a leaf is not a subexpression. As a result the expression
let val x = 42 in f(x) end
only has a single
subexpression, which is the application of f(x).
Problem CSE-1.1: Write a function ASTsubexpressions which
takes an AST and returns a list of AST’s that contains all the subexpressions
(rooted subtrees) of the input AST. The subexpressions should be listed in
prefix order, and should include the input subexpression last. Your code should
behave as shown below:
Analyze CSE >> let val x = 2 in x + 1 end
CSE analyzed at level 1:
CSE-1-1: Subexpressions:
Binop_e(Id_e(x),Plus,Int_c(1))
Let_e(Val_d(x,Int_c(2)),Binop_e(Id_e(x),Plus,Int_c(1)))
Analyze CSE >> let val x = d(1) in x + 1 end
CSE analyzed at level 1:
CSE-1-1: Subexpressions:
Binop_e(Id_e(x),Plus,Int_c(1))
Apply_e(Id_e(d),Int_c(1))
Let_e(Val_d(x,Apply_e(Id_e(d),Int_c(1))),Binop_e(Id_e(x),Plus,Int_c(1)))
Analyze CSE >> let val x = d(1) in f(x + 1) end
CSE analyzed at level 1:
CSE-1-1: Subexpressions:
Binop_e(Id_e(x),Plus,Int_c(1))
Apply_e(Id_e(f),Binop_e(Id_e(x),Plus,Int_c(1)))
Apply_e(Id_e(d),Int_c(1))
Let_e(Val_d(x,Apply_e(Id_e(d),Int_c(1))),Apply_e(Id_e(f),Binop_e(Id_e(x),Plus,Int_c(1))))
The same subexpression can
appear in multiple places within the AST. In order to describe these different
places, we need a notion of a path, which describes the route from the
root of the tree to the particular subexpression.
Paths encode the choices
made among the different subexpressions at every (non-leaf) node of the AST.
Some nodes have only a single subexpression, such as unary operators. In this
case, we will still include the node in the path, but we will use the number 0
to describe the single (forced) choice. Some paths have an arbitrary number of
descendants, like tuples or let expressions. For a tuple with k elements, we
will encode the choice using a number in the range 0..k-1. A let expression
with k var declarations will have k+1 subexpressions (one for each var
declaration’s expression, and one for the body); they will be numbered 0..k.
Thus, an expression like fact(x)* fact(x) would be contain two different (but identical) subtrees
with two different paths.
A searchlist is a
list, where each element is a subexpression and its associated path. The searchlist
type is defined in CSE.sml. The included function printSL will print out a searchlist, by printing out each
expression and then its path.
Problem CSE-1.2: Write a function ASTsearchlist which takes
an AST and returns a searchlist of AST’s that contains all the rooted subtrees
of the input AST and their associated paths. You will need to use the path and
searchlist types which we have defined for you in CSE.sml. Hint:
you may find it easier to initially create the paths backwards, and then
reverse them at the end. Your code should behave as shown below:
Analyze CSE >> let val x = 42 in (f x) + (f x) end
CSE analyzed at level 2:
CSE-1-2: Searchlist:
Apply_e(Id_e(f),Id_e(x))
1 1
Apply_e(Id_e(f),Id_e(x))
1 0
Binop_e(Apply_e(Id_e(f),Id_e(x)),Plus,Apply_e(Id_e(f),Id_e(x)))
1
Let_e(Val_d(x,Int_c(42)),Binop_e(Apply_e(Id_e(f),Id_e(x)),Plus,Apply_e(Id_e(f),Id_e(x))))
<empty-path>
Analyze CSE >> let val x = 42 in (f x) + (g x) end
CSE analyzed at level 2:
CSE-1-2: Searchlist:
Apply_e(Id_e(g),Id_e(x))
1 1
Apply_e(Id_e(f),Id_e(x))
1 0
Binop_e(Apply_e(Id_e(f),Id_e(x)),Plus,Apply_e(Id_e(g),Id_e(x)))
1
Let_e(Val_d(x,Int_c(42)),Binop_e(Apply_e(Id_e(f),Id_e(x)),Plus,Apply_e(Id_e(g),Id_e(x))))
<empty-path>
Analyze CSE >> let val x = 42 val y = 24 in (f x) + (g x) end
CSE analyzed at level 2:
CSE-1-2: Searchlist:
Apply_e(Id_e(g),Id_e(x))
2 1
Apply_e(Id_e(f),Id_e(x))
2 0
Binop_e(Apply_e(Id_e(f),Id_e(x)),Plus,Apply_e(Id_e(g),Id_e(x)))
2
Let_e(Val_d(x,Int_c(42)),Val_d(y,Int_c(24)),Binop_e(Apply_e(Id_e(f),Id_e(x)),Plus,Apply_e(Id_e(g),Id_e(x))))
<empty-path>
The searchlist for your
input AST will contain various subexpressions that only appear once, and other
subexpressions that occur multiple times (at multiple paths). We now need to go
through the searchlist and collect the common subexpressions. More precisely, a
particular subexpression will occur at multiple paths. We will need to collect
the set of paths where that subexpression occurs
First, you will need to tell
whether a given subexpression appears more than once. To do this, you can use
the expression equality function expEqual
that is provided for you in CSE.sml.
We have defined a cselist type
in CSE.sml. This is a list whose elements are an expression, a list of paths,
and a number. Such a list will be the output of your solution to this part of
the problem. The number will be the identifier of the temporary variable that
will be used to replace the expression. Your numbering scheme should use the
prefix order on subexpressions that you obtained in step 1.
Problem CSE-2.1: Write a function ASTtoCSElist which takes an
AST and returns the associated CSElist. You will need to use the cselist datastructure which we have defined for you in
CSE.sml, as well as your solution in step 1. Your code should behave as shown
below:
Analyze CSE >> let val x = 42 in (f x) + (f x) end
CSE analyzed at level 3:
CSE-2-1: CSElist:
Apply_e(Id_e(f),Id_e(x))
** 0 ** [1 0] [1 1]
Analyze CSE >> let val x = 42 in (f (g x)) + (f (g x)) end
CSE analyzed at level 3:
CSE-2-1: CSElist:
Apply_e(Id_e(f),Apply_e(Id_e(g),Id_e(x)))
** 1 ** [1 0] [1 1]
Apply_e(Id_e(g),Id_e(x))
** 0 ** [1 0 1] [1 1 1]
CSE is complicated by the
fact that a common subexpression can appear inside another common subexpression.
For example, consider the expression
(f (g x))+(f (g x))+(g x)
Here, the common
subexpressions are (g x) and (f (g x)). If we simply substituted for (f (g x))
we would end up with:
let val tmp0 = (f (g x))
in
tmp0 + tmp0 + (g x)
end
This is not optimal;
instead, we would like to obtain:
let val tmp0 = (g x)
in
let val tmp1 = (f tmp0)
in
tmp1 + tmp1 + tmp0
end
end
In order to accomplish this,
however, we need to reorder the CSElist so that “smaller” subexpressions (such
as (g
x) in the example above) are handled
before “larger” subexpressions (such as (f
(g x)) above).
More precisely, you will
need to reorder the CSElist so that if subexpression contains subexpression B,
subexpression B appears before subexpression A. Note that two subexpressions
will frequently be “incomparable”, in that neither is a subexpression of the
other.
To begin with, you will need
to determine if one subexpression is contained inside another. While this can be
done directly on AST’s, it is both faster and simpler to do it on paths.
Problem CSE-3.1: Write a function contains which takes a pair
of paths and determines if the subexpression corresponding to the first path contains
the subexpression corresponding to the second path.
Problem CSE-3.2: Write a function reorderCSElist which
takes a CSElist and reorders it using the containment relation. [Hint: this problem is not trivial. Please
think carefully about what is involved, and consider whether any code you have
written elsewhere in CS312 may be applicable.] After your code is written, it
should be the case that for any two subexpressions e1,e2 in
the CSElist if contains(e1,e2) is true then e2 appears
in the reordered CSElist before e1.
Your code should behave as
shown below:
let val x = 42 in (f (g x)) + (f (g x)) + (g x) end
CSE analyzed at level 4:
CSE-3-2: Reordered CSElist:
Apply_e(Id_e(g),Id_e(x))
** 0 ** [1 0 0 1] [1 0 1 1] [1 1]
Apply_e(Id_e(f),Apply_e(Id_e(g),Id_e(x)))
** 1 ** [1 0 0] [1 0 1]
Our next task is to take common subexpressions and replace them all with variable names. (In the final CSE step, we will introduce a binding that creates the variable name and binds it to the value of the common subexpression.)
In order to handle cases
like this, we have to do substitution within the CSElist. More precisely, we place
the input expression at the end of the CSElist and then walk down the CSElist,
repeatedly taking one element and replacing all its occurrences in the rest of
the list by its corresponding variable name. You will also need to write a function
which takes two paths, the first of which is contained in the second, and returns
a path that specifies the location of the first one relative to the second one.
(We will not grade this function separately, but you should be aware that such
a function is necessary.)
If the AST already contains
a Let_e node, then you should insert the new definition at
the beginning of the declaration list (otherwise you should create a Let_e which wraps the original AST node, and has a single
declaration for the just-inserted _tmp
variable). This method of inserting avoids creating a long series of let statements that only contain the declaration of one
temporary variable.
Problem CSE-4.1: Write a function substituteCSElist which takes
an CSElist and performs substitution in it. You will need to use the CSElist datastructure which we have defined for you in
CSE.sml, as well as your solution to steps 2 and 3. Be sure that the variables
you create follow the same naming convention that we use. Your code should
behave as shown below:
Analyze CSE >> let val x = 42 in (f (g x)) + (f (g x)) + (g x) end
CSE analyzed at level 5:
CSE-4-1: CSElist substitution
Apply_e(Id_e(g),Id_e(x))
** 0 ** [1 0 0 1] [1 0 1 1] [1 1]
Apply_e(Id_e(f),Id_e(_tmp0))
** 1 ** [1 0 0] [1 0 1]
Let_e(Val_d(x,Int_c(42)),Binop_e(Binop_e(Id_e(_tmp1),Plus,Id_e(_tmp1)),Plus,Id_e(_tmp0)))
** ~1 ** [<empty-path>]
let val x = 42 in ((_tmp1+_tmp1)+_tmp0) end
Our final task is to
introduce a binding that creates the variable name and binds it to the value of
the common subexpression. This will involve creating one let expression per
variable and inserting it into the input expression. In order to do this, you
will first need to figure out where in the input expression the new let should
go. This requires computing the lowest common ancestor of two paths (or, to put
it differently, the smallest subexpression that contains two input
expressions).
Once this is done, you will
need to take each element of the CSElist and insert the appropriate
binding at the appropriate place. The subtlety here is that each time you
insert a let all of the existing paths become invalid. To solve
this problem, you can skip over let
statements that you have introduced (you can tell this by looking at the
variable name).
Problem CSE-5.1: Write a function substituteCSElist which
takes an CSElist and performs substitution in it. You will need to use you
solutions from the previous step. Your code should behave as shown below:
Analyze CSE >> let val x = 42 in (f (g x)) + (f (g x)) + (g x) end
CSE analyzed at level 6:
let val x = 42 in let val _tmp0 = g(x) in (let val _tmp1 = f(_tmp0) in (_tmp1+_tmp1) end+_tmp0) end end
Analyze CSE >> let val a = [2, 3] val b = (1::a) val c = (0::1::a) in c end
CSE analyzed at level 6:
let val a = [2,3] val _tmp0 = (1::a) val b = _tmp0 val c = (0::_tmp0) in c end