Program Analysis as Non-Standard Denotational Semantics We can use a non-standard denotational semantics to reason about programs. The basic insight is that if we abstract from actual values, keep our abstractions suitably finite, and are always conservative, then we can compute conservative approximations of the flow of a given program. For example, let us consider the abstract domain D defined as: D ::= Z | Z+ | Z- | 0 where Z represents the set of all integers, Z+ represents the set of all positive integers, Z- all negative integers, 0 represents the singleton set { 0 }, and {} represents the empty set of integers. This is an abstraction of our concrete domain of integers, and it's only one abstraction that we could choose for a given analysis problem. It has the property that there is an "information ordering" in the sense Z is a superset of all of the other abstract domain elements. It has less information than something like Z+ or 0. Z / | \ / | \ Z- 0 Z+ In principle, we can use any lattice to abstract the domain of values, but it helps to have a finite lattice or at least a lattice of finite height (no infinite ascending or descending chains.) We can define a denotational semantics using this abstract domain as follows: First, we use abstract stores which will be functions from variables to D instead of variables to integers. In other words, we're going to forget what specific integer value a variable holds and only remember whether it's zero, positive, negative, etc. If we don't know anything about the variable, then we'll have to assume that it's any possible integer (i.e., Z). S in AbsStore : Var -> D Next, we provide an interpretation of expressions E' that respects our abstraction: E'[i]S = 0 if i = 0 Z- if i < 0 Z+ if i > 0 E'[x]S = S(x) E'[e1 + e2]S = 0 if E'[e1]S = E'[e2]S = 0 Z+ if (E'[e1]S = E'[e2]S = Z+) or (E'[e1]S = Z+ and E'[e2]S = 0) or (E'[e1]S = 0 and E'[e2]S = Z+) Z- if (E'[e1]S = E'[e2]S = Z-) or (E'[e1]S = Z- and E'[e2]S = 0) or (E'[e1]S = 0 and E'[e2]S = Z-) Z otherwise Note that evaluating an integer i returns either 0, Z-, or Z+ but not Z. This reflects perfect information with respect to the abstraction. We're returning the most precise thing that we possibly can. The variable case is just as in the standard denotational semantics -- we lookup the value in the store. Of course, here, the store returns an abstract domain value instead of an integer. Finally, we had to interpret the operation + in a way that's consistent with the domain. For instance, we can only say that we get a positive number out if we know that e1 and e2 yield non-negative numbers. In general, the property that we want is that if we run the real semantics, whatever value we get out is contained in the set returned by the abstract semantics. Formally, we can say that an abstract store S is faithful to a concrete store s if for all variables x, s(x) is an element of S(x). Then, we can say that E' is faithful to E if for expressions e, all stores s, and all abstract stores S that are faithful to s, E[e]s is an element of E'[e]S. We can define a similar function B' for boolean valued expressions. However, we'll need a new abstract domain for booleans: D2 ::= True | False | DontKnow Here, DontKnow represents the set {true,false} while True reprsents {true} and False {false}. Then we can define B' as follows: B'[true] = True B'[false] = False B'[e1 <= e2] = True if (E'[e1]S = Z- and E'[e2]S = 0 or Z+) or else (E'[e1]S = 0 and E'[e2]S = Z+) = False if (E'[e1]S = Z+ and E'[e2]S = 0 or Z-) or else (E'[e1]S = 0 and E'[e2]S = Z-) = DontKnow otherwise Next we need to define our abstract semantics for commands, C'. The first few cases work just as before: C'[skip]S = S C'[x := e]S = S[x -> E'[e]S] C'[c1 ; c2]S = C'[c2](C'[c1]S) We run into problems with if-commands, because in general, we don't know which branch will be taken. In particular, if we have "if e1 <= e2 then c1 else c2", and we only know that e1 and e2 are integers, then we don't know which branch will be selected. Thus, we must conservatively look at both branches to compute the possible output state, and we must somehow merge the information in the two output states to get a single abstract store. C'[if e then c1 else c2]S = C'[c1]S if B'[e]S = True C'[c2]S if B'[e]S = False merge_stores(C'[c1]S, C'[C2]S) otherwise where we define merge_stores(S1,S2) as: { (x,merge(S1(x),S2(x))) | x in Var } and merge for two abstract domains as: merge(Z,_) = Z merge(_,Z) = Z merge(X,X) = X In general, we calculate the union of the two sets and then find the smallest abstract domain element that is big enough to cover the union. Since Z contains everything, merging it with one of the other domains always yields Z. As an example, consider what we get out of analyzing: if x <= 0 then x := x + 1 else skip If we assume on input that x is any negative integer (i.e., S(x) = Z-) then the analysis works as follows: A. First, we need to compute E'[x]S = S(x) = Z- B. Then we need to compute E'[0]S = 0 C. We know that all elements of Z- are less than 0, so we only need to calculate C'[x := x + 1]S and return that. D. C'[x := x + 1]S requires computing E'[x + 1]S. Since E'[x]S = S(x) = Z-, and E'[1]S = Z+, we can only conclude that the result is a Z. Thus, C'[x := x + 1]S = S[x -> Z]. Now let's analyze a different program: if x <= 0 then x := 3 else x := -4 and assume on input that x is any integer (i.e., S(x) = Z). A. compute E'[x]S = S(x) = Z. B. compute E'[0]S = 0. C. We can't tell which way the if goes since we don't know whether x <= 0 (i.e., all elements of Z aren't less-than or equal to zero, nor are they all greater than 0.) So, we have to compute both possible outcomes and merge them. D. Compute C'[x := 3]S = S[x -> Z+] since E'[3]S = Z+. E. Compute C'[x := -4]S = S[x -> Z-] since E'[-4]S = Z-. F. Merge the two output states S[x -> Z+] and S[x -> Z-] which results in S[x -> Z] since in one case, x is positive and in another case it's negative. The most we can say is that after executing the if-command, x is an integer. So much for conditionals. What about while loops? C'[while e do c]S = ??? Suppose we could guess an output S' for this. What properties should S' have? Well, it ought to be the case that S' is faithful to the original semantics. In particular, if we have a concrete store s and S is faithful to s (i.e., for all x, s(x) is an element of S(x)), and C[while e do c]s = s', then it ought to be the case that s' is faithful to S'. One thing we could guess for S' is the store that maps every variable to Z. Let's call that store Top. Top is certainly faithful. It also has the property that: C'[if e do (c; while e do c) else skip]S <= Top where S1 <= S2 means for all x, S1(x) is a subset of S2(x). So, we could just use Top, but that's a little imprecise. I claim that you can do something like the following: fun loop S = let S' = merge_stores(S,C'[c]S) if S' = S then return S' else loop S' That is, start with the input state, compute C'[c]S (the result of running the body of the while loop), and merge that with the input state to produce an S'. If S' is different than S (i.e., for some variable x, S(x) != S'(x)), then we try again but using S' as the input state. Why would this work? First, note that: (1) S1 <= merge_stores(S1,S2) and S2 <= merge_stores(S1,S2) and (2) if S1 <= S2, then merge_stores(S,S1) <= merge_stores(S,S2). So, that tells us that after we go around the loop, S' is more abstract than S and is more abstract than C'[c]S. So, for instance, if S'(x) = 0, then it must be the case that S(x) = 0 and C'[c]S(x) = 0. Second, suppose we've reached a point where S' = S. That is, merge(S,C'[c]S) = S. Well, for one thing, it tells us that going around the loop again won't matter, because we'll always be getting the same S out. So we might as well stop. Now I need to convince you of two things: First, using loop in this fashion will terminate (i.e., we'll eventually find a fixed point for the loop.) This is pretty easy to see -- if we go around the loop, then we get something that is more abstract than we had on input. That means that for some variable, we went from 0,Z+, or Z- to Z. Now, the next time around the loop, that variable can't change. There can only be a finite number of variables that change (since the program can only mention a finite number of variables), so sooner or later, we'll stop. In this case, we should be able to stop within n iterations where n is the number of program variables. (If we had a deeper lattice, it might require more iterations.) Technically, I need to convince you that this is a faithful approximation. I won't prove this formally, but it's pretty easy to see that for all finite unrollings of the while loop, we get out something that is a faithful approximation. -------------------------------------------------------------------- Homework 3: 1. Given the following datatypes for IMP programs: type var = string datatype oper = Plus | Times | Minus | Divide datatype exp = Var of var | Int of int | Oper of exp * oper * exp datatype bexp = True | False | LessThanEq of exp * exp datatype com = Skip | Assign of var * exp | Seq of com * com | If of bexp * com * com | While of bexp * com Write an analysis which conservatively determines whether or not the program can divide by zero. You should build a non-standard denotational semantics with an abstract domain that records whether an integer value can be negative, positive, zero, non-negative, non-positive, or simply an integer. That is, your lattice will have more elements than the example I gave above. You'll need to define suitable merge operations for abstract integers (and abstract booleans). You'll want to use abstract stores that are finite maps from variables to abstract integers (e.g., association lists or something from the SML library) and define a suitable merge_stores function. 2. Suppose we modified IMP so that we allowed assigning boolean values to variables, as well as integer values. For instance, we could write: x := true; x := 1 + 42; if x then x := 42 else x := true; We want to allow variables to hold both integers and booleans, but we would like to warn the programmer if they ever use a variable in a way that is inconsistent with the way the variable was last assigned. In particular, if the variable was last assigned as an integer, but is used as a boolean, then this should generate a warning. Similarly, if the variable was last assigned a boolean but is then used as an integer, we should generate a warning. Write down an analysis (on paper -- or as ML code) which generates these warnings.