CS 3110 Lecture 16
Mutable data abstractions

Evaluation with a store

Once we add refs and arrays to OCaml, we can no longer reason about evaluation of programs as simply. Previously we could think about evaluation as involving a series of changes to a system configuration that consisted of just a single program term (i.e., expression). Each evaluation step took one subterm of the term and reduced it to a value. With imperative update, the system configuration comprises not only a term but also a store that records for each store location what value is in it.

In response to imperative updates, such as uses of the assignment operator :=, the store part of the configuration changes. So an evaluation step can affect both parts of the configuration. Reasoning about the changes to the store is more difficult than reasoning about the changes to the program term because where in the store the update happens is determined by locations that happen to have been computed in the term. In practice this leads to a lot of bugs, so it is a good idea to use imperative update in a limited, careful way.

Mutable data abstraction

Mutable data abstractions are abstractions whose value can change over time. We have avoided using them until now because they are harder to reason about than immutable (functional) data abstractions. But for solving some problems they offer an advantage in efficiency.

Arrays

An important kind of mutable data structure that OCaml provides is the array.  The type t array is in fact very similar to the Java array type t[].  Arrays generalize refs in that they are a sequence of mutable cells containing values.  We can think of a ref cell as an array of size 1.  Here's a partial signature for the builtin Array structure for OCaml, including specifications:

  module type Array =
    sig
      (* Overview: an 'a array is a mutable fixed-length sequence of
       * elements of type 'a. *)
      type 'a array

      (* length(a) is the length of a *)
      val length : 'a array -> int

      (* get a i  is the ith element in a. If i is
       * out of bounds, raise Subscript *)
      val get : 'a array * int -> 'a

      (* set a i x 
       * Effects: Set the ith element of a to x
       * Raise Subscript if i is not a legal index into a *)
      val set : 'a array * int * 'a -> unit

      (* create n x  is a new array of length n whose elements are
       * all equal to x. *)
      val create : int * 'a -> 'a array

      (* of_list lst  is a new array containing the values in lst *)
      val fromList : 'a list -> 'a array
           exception Subscript (* indicates an out-of-bounds array index *)
      ...
    end

See the OCaml documentation for more information on the operations available on arrays.

Notice that we have started using a new kind of clause in the specification, the effects clause. This clause specifies side effects that the operation has beyond the value it returns. When a routine has a side effect, there should be an "Effects:" clause to explicitly warn the user that a side effect may occur. For example, the update function returns no interesting value, but it does have a side effect. The set of things affected by the side effect should be clearly described. The client should be able to assume that anything not mentioned in the effects clause is unaffected by the use of the function.

An imperative change to a mutable data abstraction is also known as a destructive update, because it "destroys" the old state of the data structure. An assignment to an array element changes the array in place, destroying the old sequence of elements that formerly made up the array. When destructive operation is performed on a mutable data abstraction, it looks to the client like an imperative assignment is performed, changing the abstraction to refer to a new value.

Programming in an imperative style is trickier than in a functional style exactly because the programmer has to be sure that the old value of the mutable data is no longer needed at the time that a destructive update is performed. In general it's hard to know whether there might be reference to the data where a side effect wasn't expected.

Mutable sets and specifying side effects

Mutable collections such as sets and maps are another important kind of mutable data abstraction. We've seen several different implementations of sets thus far, but they have implemented an immutable set abstraction. A mutable set is a set that can be imperatively updated to include more elements, or to remove some elements. 

Here is an example of a signature for a mutable set. These signatures show an important issue in writing effects clauses. To specify a side effect, sometimes we need to be able to talk about the state of a mutable value both before and after the routine is executed. Writing "_pre" or "_post" after the name of a variable is a compact way of talking about that the state of the value in that variable before and after the function executes, respectively.

module type MSet = sig
  (* Overview: a set is a mutable set of items of type elem.
   * For example, if elem is int, then a set might be
   * {1,-11,0}, {}, or {1001} *)
  type elem
  type set
  (* empty() creates a new empty set *)
  val empty : unit -> set
  (* Effects: add(s,x) adds the element x to s if it is
   * not there already: spost = spre U {x})
  val add: set * elem -> unit
  (* remove(s,x) removes the element x from s it it is
   * there already *)
  val remove: set * elem -> unit
  (* member(s,x) is whether x is a member of s *)
  val member: set * elem -> bool
  (* size(s) is the number of elements in s *)
  val size: set -> int
  (* fold over the elements of the set *)
  val fold: ((elem*'b)->'b) -> 'b -> set -> 'b
  val fromList: elem list -> set
  val toList: set -> elem list
end

Classifying operations

When designing the interface to a mutable data abstraction, it is a good idea to select operations that fall into one of three broad categories:

This rule of thumb reduces the amount of reasoning that programmers need to do about side effects, because creators and observers usually do not have side effects; only mutators do. Because mutators are hard to reason about, it's typically a good idea to have relatively few mutators, and while giving clients the power they need, make operations as functional as possible.

The MSet signature contains examples of all three kinds of operations: empty and fromList are creators; member, size, fold, and toList are observers; and add and remove are mutators. Similarly, in ARRAY we have creators array and fromList, observers sub and length, and a mutator update.

One interesting kind of observer is an iterator, which allows clients to iterate over a series of values contained in or computed by the abstraction. In OCaml, iterators are usually folds. In the interface above, the observer fold is an iterator abstraction. An alternative iterator interface is streams, which are closer to the the Iterator interface in Java. It is also possible to have iterators that mutate or allow mutation of the underlying data abstraction, though these are typically more trouble than they're worth.

Programmers in object-oriented languages often define operations that are getters and setters, allowing reads from and writes to fields of an object. A getter is a kind of observers and a setter is a kind of mutator. However, programmers often reflexively define getters and setters for data abstractions when they shouldn't. If all fields have getters and setters, there isn't much data abstraction, and it often makes the interface needlessly wide. It's better to ask, “What observers and mutators does this abstraction really need?”. Some of these operations may end up being getters or setters, but usually there aren't many.


Lecture 17: Implementing mutable data abstractions

Equality, similarity, and abstraction

The Leibniz principle is that two things are equal iff they can be substituted for each other in any context without any observable difference. Our logic for reasoning about programs used this principle. We said that if terms a1 and a2 were equal (a1=a2), then the terms are completely indistinguishable. That is, for an arbitrary proposition P that mentions a name x of the right type, P{a1/x} ⇔ P{a2/x}. Furthermore, the Leibniz principle says that if two things are indistinguishable, they are equal.

When we implement a data abstraction for which a test of equality makes sense, we should support an test of equality satisfying the Leibniz principle. How do we go about this?

OCaml has two built-in operators that purport to test equality: = and ==. Unfortunately, neither of these operators agrees with the Leibniz principle. The Java equals() method and operator == have the same problem. So when implementing a data abstraction, the correct implementation of an equals operation should not simply use these builtin operators.

Mutable data abstractions break simple reasoning about equality. For example, two distinct expressions array(3, 0) cannot be substituted for each other; they will be distinguishable in some contexts. For example:

  let a = Array.create 3 0
  and b = Array.create 3 0
  in
    if a = b then print_string "a and b are similar";
    set a 0 1;
    if get a 0 != get b 0 then print_string "a and b are not equal"

Here, the values referenced by a and b are not truly equal, because some expressions, such as will have a different value when b is substituted for a. We say that two mutable data objects are similar when their current state is abstractly equal. Data objects that are similar now may not be in the future.

Two mutable values are truly equal only if every change to one of the values affects the other one. Typically, this requires that the refs inside them share the same location.

So we can see that the OCaml operator = reports similarity instead of equality. However, OCaml provides another equality test, ==, which behaves like the operator of the same name in Java. It tests whether two values are stored in the same place in memory. If they are, they must truly be equal.

# "hi" = "hi";;
- : bool = true
# "hi" == "hi";;
- : bool = false

Unfortunately, the operator == also violates the Leibniz principle, because immutable values like tuples and records can appear unequal even though there is no operator other than == that can detect it:

# (2) = (2);;
- : bool = true
# (2) == (2);;
- : bool = true
# (1,2) = (1,2);;
- : bool = true
# (1,2) == (1,2);;
- : bool = false
# {x = 1; y = 2} == {x = 1; y = 2};;
- : bool = false

The upshot is that you should never rely on = or == to provide the notion of equality for your data abstractions. You should always implement your own equals or compare operation, because = does not test equality, and == violates the data abstraction. In general, == is too strong and = is too weak.

A language that is confused about what equality means leads programmers to make mistakes. For example, Java programmers are also encouraged to write their own equals() methods. But in Java, equals sometimes means equality, and in others, similarity. So you have to be very careful. For example, think about what happens if you have a hash table whose keys are mutable. If a key is mutated, it could break the data structure invariant of the hash table. It will not be possible to find that key using the Java equals() method in its typical implementation as similarity.

The rule of thumb is that for mutable data abstractions, equality should be based on the == test, whereas for immutable data abstractions, equality should be the same as similarity. This doesn't mean that equality on immutable abstractions should be implemented using =. For an immutable abstraction, two concrete values x and y should be equal when AF(x) = AF(y). For example, if implementing rational numbers, we probably want 2/3 = 4/6, yet the representations of these two numbers would be different in some implementations.

Rep invariants

Mutable data abstractions need rep invariants, just like immutable abstractions do. However, mutation raises some new issues for maintaining rep invariants. For functional programming, we have to make sure that any new abstract values constructed satisfy the rep invariant. For imperative programming we also need to make sure that the rep invariant is not broken for existing abstract values.

For example, consider an following implementation of the MSet signature, in which an underlying sorted array is used as the representation:

functor ArrayMSet(structure Key: ORD_KEY)
    : MSet where type elem = Key.ord_key =
struct
    open Array
    type elem = Key.ord_key
    type set = {elems: elem option array ref; size: int}
    (* ref {elems, size} represents a set containing the first size
     * elements in elems.
     * Rep invariant: the first size elements have the form Some(e)
     * and they are in sorted order according to Key.compare.
     *)
    let initial_size = 10
    let empty () = {elems = array(10, None), size = 0}

    let equals s1 s2 = (s1.elems == s2.elems)
    ...
end

The idea is to create an array that is large enough to hold all the elements of the set. If too many add's are done, a new array is created. Only the first size elements of the array are actually used to store elements, and they are stored in sorted order. The member operation can be performed  using binary search with O(lg n) time. However, add will not be as efficient, because insertion into the middle of the array will take O(n) time. Note that adding the element at the end of the array would take O(1) time but would break the rep invariant on the set that is being extended.

Exposing the rep

A common mistake when designing a mutable abstraction is exposing the rep -- that is, implementing operations in a way that allows users access to mutable pieces of the representation. The problem is that these mutable values may then be updated by a  thoughtless or malicious client, causing invalid changes to the original abstract value.

For example, suppose that we add an operation to_array: set -> elem option array to the mutable set interface. It looks very easy to implement in the functor above:

let to_array({elems=e,size=s}) = !e

This implementation simply returns the array that is used as part of the representation of the set. The problem is that a client receiving the array of elements can change the array using update, and in doing so break the rep invariant of the set the array was received from.

We can also expose the rep by using a mutable value in a creator. For example, if we had a creator that created a set from an array, it would be tempting to reuse the array:

let of_array(a) = {elems=ref a, size=length a}

But a client can hold onto the array used in the creator and then use that array later to break the rep invariant of the set.

In early versions of Java, there was actually a security hole that arose from exposing the rep. An array was returned containing information about the security policies enforced by the system. But the actual internal array was returned, so if applet code modified the array, it changed the system security policy!

Aliasing

When a routine has side effects or manipulates mutable data structures, aliasing is another important issue to think about. Aliasing occurs when two different variables refer to the same underlying mutable data, so changes through one variable affect the other one. For example, consider a function that copies elements from one set to another:

(* copy(s1,s2): add all the elements of s2 into the set s1
 * Requires: s1 and s2 are different sets. *)
val copy: set * set -> unit

This function might easily be implemented in a way that causes it not to work when the inputs are aliased and the function is trying to copy elements from the set to itself. There, in the specification for this function, we should specify whether the two sets are allowed to be the same set.

Aliasing is the source of many bugs, because often programmers do not think enough about the possibility of aliasing.

Benign side effects

Not every side effect needs to be documented with an effects clause. If a side effect does not affect the abstract value that a mutable representation maps to, the side effect will be invisible to the user of the abstraction. Therefore, it need not be mentioned in the specification. Side effects of this sort as known as benign side effects, because they are not destructive. Benign side effects can be useful for building caches that remain internal to data structures, or for data structure reorganizations that improve performance without affecting their abstract value.

For example, suppose we want to implement an immutable set as a list. We know that an operation that is frequently used is finding an element in the set, but that almost always the element looked for is the same as the previous one. We can use a benign side effect to accelerate this find operation: the list is paired with a ref that stores the last element looked for, and the find operation checks the ref first.

type clist = int list * int ref

(* contains x s  is whether s contains x *)
boolean contains x (lst, cache) =
(* Implementation: uses a benign side effect. *)
    if x = !cache then true
    else
	if List.exists (fun y -> x=y) lst then
	    (cache := x; true)
	else
	    false

Benign side effects can be a useful implementation technique. Because they use mutation, we have to be more careful in a concurrent setting, just as we do with the non-benign variety.