CS312 Lecture 9
Representation and Module Invariants

We need to write some comments in module implementations to help code implementers and maintainers to reason about the code within the module. We'd like them to be able to reason about individual functions within the module in a local way, so they can judge whether each function is correctly implemented without thinking about every other function in the module. Last time we saw the abstraction function, which is essential to this process because it explains how information within the module is viewed abstractly by module clients.

Commutation diagrams

Using the abstraction function, we can now talk about what it means for an implementation of an abstraction to be correct. It is correct exactly when every operation that takes place in the concrete space makes sense when mapped by the abstraction function into the abstract space. This can be visualized as a commutation diagram:

A commutation diagram means that if we take the two paths around the diagram, we have to get to the same place. Suppose that we start from a concrete value and apply the actual implementation of some operation to it to obtain a new concrete value or values. When viewed abstractly, a concrete result should be an abstract value that is a possible result of applying the function as described in its specification to the abstract view of the actual inputs. For example, consider the union function applied to the concrete pair ([1,3],[2,2]), which would be the lower-left corner of the diagram. The result of this operation is the list [2,2,1,3], whose corresponding abstract value is the list {1,2,3}. Note that if we apply the abstraction function AF to the lists [1,3] and [2,2], we have the sets {1,3} and {2}. The commutation diagram requires that in this instance the union of {1,3} and {2} is {1,2,3}, which is of course true.

Some specifications, as we have seen, are nondeterministic: they do not fully specify the abstract behavior of the functions. For nondeterministic specifications there are several possible arrows leading from the abstract state at the upper left corner of the commutation diagram to other states. The commutation diagram is satisfied as long as the implemented function completes the diagram for any one of those arrows.

Some missing information

Recall the NATSET interface for sets of natural numbers:

signature NATSET = sig
  (* Overview: A "set" is a set of natural numbers:
     e.g., {1,11,0}, {}, and {1001} *)
  type set
 
  (* empty is the empty set *)
  val empty : set
 
  (* single(x) is {x}. Requires: x >= 0 *)
  val single : int -> set
 
  (* union is set union. *)
  val union : set*set -> set
 
  (* contains(x,s) is whether x is a member of s *)
  val contains: int*set -> bool
 
  (* size(s) is the number of elements in s *)
  val size: set -> int
end

As discussed last time, in an implementation of this interface we need an abstraction function. We talked about three different implementations: a list of integers with no duplicates (NatSetNoDups), a list of integers possibly with duplicates (NatSet), and a vector of booleans. Each has its own abstraction function, but we can use the same abstraction function for the first two:

(* Abstraction function: the list [a1,...,an] represents the
 * smallest set containing all of the elements a1,...,an.
 * The empty list represents the empty set.
 *)

Consider how we might write the size function in each of these implementations. For the list of integers with no duplicates (NatSetNoDups), the size is just the length of the list:

val size = List.length

But for the representation of a list of integers with possible duplicates (NatSet) we need to make sure we don't double-count any duplicate elements:

fun size(lst) = case lst of
   [] => 0
 | h::t => size(t) + (if contains(t, h) then 0 else 1)

How we know that we don't need to do this check in NatSetNoDups? After all, the type of the representation is exactly the same: int list. And the abstraction function is identical. What is different is that if an int list represents a NatSetNoDups.set, it cannot have any duplicate elements. Because the code doesn't say this, implementers will not be able to reason locally about whether functions like size are implemented correctly.

If we think about this in terms of the commutation diagram, we see that the abstraction function is not enough. Consider taking the size of the set {2}. The abstraction function maps the representation [2,2] to this set. But the abstract size operation on the set gives size({2}) = 1, whereas the NatSetNoDups implementation computes size([2,2]) = 2.

Representation Invariant

We can fix this by adding a second piece of information to the implementation: the representation invariant (or rep invariant, or RI). The rep invariant defines what concrete data values are valid representations (reps) of abstract values. For NatSetNoDups, a valid representation must satisfy the following condition:

structure NatSetNoDups = struct
  type set = int list
(* Abstraction function: the list [a1,...,an] represents the set
 *   {a1,...,an}. [] represents the empty set.
 * Representation invariant: given the rep [a_1,...,a_n],
 *   no elements are negative, and no two elements are equal. *)

We write this along with the abstraction function:

(* Representation invariant: given the rep [a_1,...,a_n],
 *   no elements are negative, and no two elements are equal. *)

The rep invariant holds for any valid representation. Therefore, a value of the representation type that does not satisfy the rep invariant does not correspond to implementation structures that must be maintained at all times. Structures that do not satisfy this constraint are invalid; they do not correspond (via the abstraction function) to any abstract type. The correct functioning of implemented operations depend on such constraints. For instance, The NatSetNoDups implementation requires no duplicates in the list. If this constraint is broken, functions such as size() will not return the correct answer.  Here is an example of how to document the rep invariant for the NatSetNoDups implementation:

The rep invariant is a condition that is intended to hold for all values of the abstract type (e.g., set). Therefore, in implementing one of the operations of the abstract data type, it can be assumed that any arguments of the abstract type satisfy the rep invariant.

Rep invariant vs. abstraction function

We observed earlier that the abstraction function may be a partial function. In fact, in the case of both NatSet and NatSetNoDups, the abstraction function is partial because it maps lists containing negative integers (such as [-1,1]) to sets that are not part of the space of abstract values. In order to make sure that an implementation works—that it completes the commutation diagram—it had better be the case that the implementation never produces any concrete values that do not map to abstract values.

The role of the representation invariant is to restrict the domain of the abstraction function to those values on which the implementation is going to work properly. The relationship between the representation invariant and the abstraction function is depicted in this figure:

The rep invariant is a condition that is intended to hold for all values of the abstract type (e.g., set). Therefore, in implementing one of the operations of the abstract data type, it can be assumed that any arguments of the abstract type satisfy the rep invariant. This assumption restores local reasoning about correctness, because we can use the rep invariant and abstraction function to judge whether the implementation of a single operation is correct in isolation from the rest of the module. It is correct if, assuming that:

  1. the function's requires and checks clauses hold and
  2. the concrete representations of all values of the abstract type satisfy the rep invariant

we can show that

  1. the returns clause of the function is satisfied (that is, the commutation diagram holds) and
  2. all new values of the abstract type that are created have concrete representations satisfying the rep invariant

The rep invariant makes it easier to write code that is provably correct, because it means that we don't have to write code that works for all possible incoming concrete representations--only those that satisfy the rep invariant. This is why NatSetNoDups.union doesn't have to work on lists that contain duplicate elements. On return there is a corresponding responsibility to produce only values that satisfy the rep invariant. As suggested in the figure above, the rep invariant holds for all reps both before and after the functions, which is why we call it an invariant at all.

repOK

When implementing a complex abstract data type, it is often helpful to write a function internal to the module that checks that the rep invariant holds. This function can provide an additional level of assurance about your reasoning the correctness of the code. By convention we will call this function repOK; given an abstract type (say, set) implemented as a concrete type (say, int list)  it always has the same specification:

(* Returns whether x satisfies the representation invariant *)
fun repOK(x: int list): bool = ...

The repOK can be used to help us implement a module and be sure that each function is independently correct. The trick is to bulletproof each function in the module against all other functions by having it apply repOK to any values of the abstract type that come from outside. In addition, if it creates any new values of the abstract type, it applies repOK to them to ensure that it isn't breaking the rep invariant itself. With this approach, a bug in one function is less likely to create the appearance of a bug in another.

repOK as an identity function

A more convenient way to write repOK is to make it an identity function that raises an exception if the rep invariant doesn't hold. Making it an identity function lets us conveniently test the rep invariant in various ways, as shown below.

(* The identity function.
 * Checks whether x satisfies the representation invariant. *)
fun repOK(x: int list): int list = ...

Here is an example of we might use repOK for the NatSetNoDups implementation of sets given in lecture. Notice that repOK is applied to all sets that are created. This ensures that if a bad set representation is created, it will be detected immediately. In case we somehow missed a check on creation, we also apply repOK to all incoming set arguments. If these is a bug, these checks will make help us quickly figure out where the rep invariant is being broken.

structure NatSetNoDups :> NATSET = struct
  type set = int list
  (* AF: the list [a1,...,an] represents the set {a1,...,an}.
   * RI: list contains no negative elements or duplicates.
   *)
  fun contains_internal(x:int,s:int list) =
    case s of
       [] => false
     | h::t => x = h orelse contains_internal(x,t)
  fun repOK(s: int list): int list =
    case s of
      [] => s
    | h::t => if h >= 0 andalso not(contains_internal(h,repOK(t)))
                then s
                else raise Fail "RI failed"
  val empty = []
  fun single(x) = repOK([x])
  fun contains(x,s) = contains_internal(repOK(s))
  fun union(s1, s2) =
    repOK (foldl (fn (x,s) => 
                    if contains(x,s) then s else x::s)
            (repOK(s1)) (repOK(s2)))
  fun size(s) = length(repOK(s))
end

Here, repOK is implemented using contains_internal rather than the function contains, because using contains would result in a lot of extra repOK checks.  Writing an unchecked helper function like this is a common pattern when implementing a repOK check. Fortunately we can reuse contains_internal when implementing the real contains.

Production vs. development code

Calling repOK on every argument can be too expensive for the production version of a program. The repOK above is quite expensive (though it could be implemented more cheaply). For production code it may be more appropriate to use a version of repOK that only checks the parts of the rep invariant that are cheap to check. When there is a requirement that there be no run-time cost, repOK can be changed to an identity function (or macro) so the compiler optimizes away the calls to it. Howver, it is a good idea to keep around the full code of repOK (perhaps in a comment) so it can be easily reinstated during future debugging.

Module invariants

Invariants on data are useful even when writing modules that are not easily considered to be providing abstract data types. Sometimes it is difficult to identify an abstract view of data that is provided by the module, and there may not be any abstract type at all. Invariants are important even without an abstraction function, because they document the legal states and representations that the code is expected to handle correctly. In general we refer to module invariants as invariants enforced by modules. In the case of an ADT, the rep invariant is a module invariant. Module invariants are useful for understanding how the code works, and also for maintenance, because the maintainer can avoid changes to the code that violate the module invariant.

Module invariants and code evolution

A strong module invariant is not always the best choice, because it restricts future changes to the module. We described interface specifications as a contract between the implementer of a module and the user. A module invariant is a contract between the implementer and herself, or among the various implementers of the module, present and future. According to assumption 2, above, ADT operations may be implemented assuming that the rep invariant holds. If the rep invariant is ever weakened (made more permissive), some parts of the implementation may break.Thus, one of the most important purposes of the rep invariant is to document exactly what may and what may not be safely changed about a module implementation. A weak invariant forces the implementer to work harder to produce a correct implementation, because less can be assumed about concrete representation values, but conversely it gives maximum flexibility for future changes to the code.

Let us consider the rep invariant for the vector implementation of NATSET. There is some question about what we should write. One possibility is to write the strongest possible specification of the possible values that can be created by the implementation. It happens that the vector representing the set never has trailing false values:

structure NatSetVec :> NATSET = struct
  type set = bool vector
  (* Abstraction function: the vector v represents the set
     of all natural numbers i such that sub(v,i) = true

     Representation invariant: the last element of v is true
   *)
  val empty:set = Vector.fromList []
  

This representation invariant describes an interesting property of the implementation that may be useful in judging its performance. However, we don't need this rep invariant in order to show that the implementation is correct. If there were no rep invariant, we could still argue that the implementation works properly. All of the operations of NatSetVec will work even if sets are somehow introduced that violate the no-trailing-false property. It is not necessary to have the rep invariant in order to argue that the operations of NatSetVec are correct according to the 4-point plan above.

Modularity and module invariants

A sign of good code design is that invariants on program data are enforced in a localized way, within modules, so that programmers can reason about whether the invariant is enforced without thinking about the rest of the program. To do this it is necessary to figure out just the right operations to be exposed by the various modules, so that useful functionality can be provided while also ensuring that invariants are maintained.

Conversely, a common design error is to break up a program into a set of modules that simply encapulate data and provide low-level accessor operations, while putting all the interesting logic of the program in one main module. The problem with this design is that all the interesting (and hard!) code still lives in one place, and the main module is responsible for enforcing many complex invariants among the data. This kind of design does not break the program into simpler parts that can be reasoned about independently.It shows the big danger sign that the abstractions aren't right: all the code is either boring code, or overly complex code that is hard to reason about. It is a kind of fake modularity.

For example, suppose we are implementing a graphical chess game. The game state includes a board and a bunch of pieces. We might want to keep track of where each piece is, as well as what is on each board square. And there may be a good deal of state in the graphical display too. A good design would ensure that the board, the pieces, and the graphical display stay in sync with each other in code that is separate from that which handles the detailed rules of the game of chess.