Mutable Data Types

# Mutable Data Types * * * <i> Topics: * refs * mutable fields * arrays * mutable data structures </i> * * * ## Mutable Data Types OCaml is not a *pure* language: it does admit side effects. We have seen that already with I/O, especially printing. But up till now we have limited ourself to the subset of the language that is *immutable*: values could not change. Today, we look at data types that are mutable. Mutability is neither good nor bad. It enables new functionality that we couldn't implement (at least not easily) before, and it enables us to create certain data structures that are asymptotically more efficient than their purely functional analogues. But mutability does make code more difficult to reason about, hence it is a source of many faults in code. One reason for that might be that humans are not good at thinking about change. With immutable values, we're guaranteed that any fact we might establish about them can never change. But with mutable values, that's no longer true. "Change is hard," as they say. ## Refs A *ref* is like a pointer or reference in an imperative language. It is a location in memory whose contents may change. Refs are also called *ref cells*, the idea being that there's a cell in memory that can change. **A first example.** Here's an example utop transcript to introduce refs: ``` # let x = ref 0;; val x : int ref = {contents = 0} # !x;; - : int = 0 # x := 1;; - : unit = () # !x;; - : int = 1 ``` At a high level, what that shows is creating a ref, getting the value from inside it, changing its contents, and observing the changed contents. Let's dig a little deeper. The first phrase, `let x = ref 0`, creates a reference using the `ref` keyword. That's a location in memory whose contents are initialized to `0`. Think of the location itself as being an address—for example, 0x3110bae0—even though there's no way to write down such an address in an OCaml program. The keyword `ref` is what causes the memory location to be allocated and initialized. The first part of the response from utop, `val x : int ref`, indicates that `x` is a variable whose type is `int ref`. We have a new type constructor here. Much like `list` and `option` are type constructors, so is `ref`. A `t ref`, for any type `t`, is a reference to a memory location that is guaranteed to contain a value of type `t`. As usual we should read should a type from right to left: `t ref` means a reference to a `t`. The second part of the response shows us the contents of the memory location. Indeed, the contents have been initialized to `0`. The second phrase, `!x`, dereferences `x` and returns the contents of the memory location. Note that `!` is the dereference operator in OCaml, not Boolean negation. The third phrase, `x := 1`, is an assignment. It mutates the contents `x` to be `1`. Note that `x` itself still points to the same location (i.e., address) in memory. Variables really are immutable in that way. What changes is the contents of that memory location. Memory is mutable; variable bindings are not. The response from utop is simply `()`, meaning that the assignment took place—much like printing functions return `()` to indicate that the printing did happen. The fourth phrase, `!x` again dereferences `x` to demonstrate that the contents of the memory location did indeed change. **A more sophisticated example.** Here is code that implements a *counter*. Every time `next_val` is called, it returns one more than the previous time. ``` # let counter = ref 0;; val counter : int ref = {contents = 0} # let next_val = fun () -> counter := (!counter) + 1; !counter;; val next_val : unit -> int = <fun> # next_val();; - : int = 1 # next_val();; - : int = 2 # next_val();; - : int = 3 ``` In the implementation of `next_val`, there are two expressions separated by semi-colon. The first expression, `counter := (!counter) + 1`, is an assignment that increments `counter` by 1. The second expression, `!counter`, returns the newly incremented contents of `counter`. This function is unusual in that every time we call it, it returns a different value. That's quite different than any of the functions we've implemented ourselves so far, which have always been *deterministic*: for a given input, they always produced the same output. On the other hand, we've seen some library functions that are *nondeterministic*, for example, functions in the `Random` module, and `Pervasives.read_line`. It's no coincidence that those happen to be implemented using mutable features. We could improve our counter in a couple ways. First, there is a library function `incr : int ref -> unit` that increments an `int ref` by 1. Thus it is like the `++` operator in many language in the C family. Using it, we could write `incr counter` instead of `counter := (!counter) + 1`. Second, the way we coded the counter currently exposes the `counter` variable to the outside world. Maybe we're prefer to hide it so that clients of `next_val` can't directly change it. We could do so by nesting `counter` inside the scope of `next_val`: ``` let next_val = let counter = ref 0 in fun () -> incr counter; !counter ``` Now `counter` is in scope inside of `next_val`, but not accessible outside that scope. When we gave the dynamic semantics of let expressions before, we talked about substitution. One way to think about the definition of `next_val` is as follows. * First, the expression `ref 0` is evaluated. That returns a location `loc`, which is an address in memory. The contents of that address are initialized to `0`. * Second, everywhere in the body of the let expression that `counter` occurs, we substitute for it that location. So we get: ``` fun () -> incr loc; !loc ``` * Third, that anonymous function is bound to `next_val`. So any time `next_val` is called, it increments and returns the contents of that one memory location `loc`. Now imagine that we instead had written the following (broken) code: ``` let next_val_broken = fun () -> let counter = ref 0 in incr counter; !counter ``` It's only a little different: the binding of `counter` occurs after the `fun () ->` instead of before. But it makes a huge difference: ``` # next_val_broken ();; - : int = 1 # next_val_broken ();; - : int = 1 # next_val_broken ();; - : int = 1 ``` Every time we call `next_val_broken`, it returns `1`: we no longer have a counter. What's going wrong here? The problem is that every time `next_val_broken` is called, the first thing it does is to evaluate `ref 0` to a new location that is initialized to `0`. That location is then incremented to `1`, and `1` is returned. Every call to `next_val_broken` is thus allocating a new ref cell, whereas `next_val` allocates just one new ref cell. **Syntax.** The first three of the following are new syntactic forms involving refs, and the last is a syntactic form that we haven't yet fully explored. * Ref creation: `ref e` * Ref assignment: `e1 := e2` * Dereference: `!e` * Sequencing of effects: `e1; e2` **Dynamic semantics.** * To evaluate `ref e`, - Evaluate `e` to a value `v` - Allocate a new location `loc` in memory to hold `v` - Store `v` in `loc` - Return `loc` * To evaluate `e1 := e2`, - Evaluate `e2` to a value `v`, and `e1` to a location `loc`. - Store `v` in `loc`. - Return `()`, i.e., unit. * To evaluate `!e`, - Evaluate `e` to a location `loc`. - Return the contents of `loc`. * To evaluate `e1; e2`, - First evaluate `e1` to a value `v1`. - Then evaluate `e2` to a value `v2`. - Return `v2`. (`v1` is not used at all.) - If there are multiple expressions in a sequence, e.g., `e1; e2; ...; en`, then evaluate each one in order from left to right, returning only `vn`. Another way to think about this is that semi-colon is right associative—for example `e1; e2; e3` is the same as `e1; (e2; e3))`. Note that locations are values that can be passed to and returned from functions. But unlike other values (e.g., integers, variants), there is no way to directly write a location in an OCaml program. That's different than languages like C, where programmers can directly write memory addresses and do arithmetic on pointers. C programmers want that kind of low-level access to do things like interface with hardware and build operating systems. Higher-level programmers are willing to forego it to get *memory safety*. That's a hard term to define, but according to [Hicks 2014][memory-safety-hicks] it intuitively means that * pointers are only created in a safe way that defines their legal memory region, * pointers can only be dereferenced if they point to their allotted memory region, * that region is (still) defined. [memory-safety-hicks]: http://www.pl-enthusiast.net/2014/07/21/memory-safety/ **Static semantics.** We have a new type constructor, `ref`, such that `t ref` is a type for any type `t`. Note that the `ref` keyword is used in two ways: as a type constructor, and as an expression that constructs refs. * `ref e : t ref` if `e : t`. * `e1 := e2 : unit` if `e1 : t ref` and `e2 : t`. * `!e : t` if `e : t ref`. * `e1; e2 : t` if `e1 : unit` and `e2 : t`. Similarly, `e1; e2; ...; en : t` if `e1 : unit`, `e2 : unit`, ... (i.e., all expressions except `en` have type `unit`), and `en : t`. The typing rule for semi-colon is designed to prevent programmer mistakes. For example, a programmer who writes `2+3; 7` probably didn't mean to: there's no reason to evaluate `2+3` then throw away the result and instead return `7`. The compiler will give you a warning if you violate this particular typing rule. To get rid of the warning (if you're sure that's what you need to do), there's a function `ignore : 'a -> unit` in the standard library. Using it, `ignore(2+3); 7` will compile without a warning. Of course, you could code up `ignore` yourself: `let ignore _ = ()`. **Aliasing.** Now that we have refs, we have *aliasing*: two refs could point to the same memory location, hence updating through one causes the other to also be updated. For example, ``` let x = ref 42 let y = ref 42 let z = x let () = x := 43 let w = (!y) + (!z) ``` The result of executing that code is that `w` is bound to `85`, because `let z = x` causes `z` and `x` to become aliases, hence updating `x` to be `43` also causes `z` to be `43`. **Equality.** OCaml has two equality operators, physical equality and structural equality. The [documentation][pervasives] of `Pervasives.(==)` explains physical equality: > `e1 == e2` tests for physical equality of `e1` and `e2`. On mutable types such as > references, arrays, byte sequences, records with mutable fields and objects with > mutable instance variables, `e1 == e2` is `true` if and only if physical modification > of `e1` also affects `e2`. On non-mutable types, the behavior of `( == )` is > implementation-dependent; however, it is guaranteed that `e1 == e2` implies > `compare e1 e2 = 0`. [pervasives]: http://caml.inria.fr/pub/docs/manual-ocaml/libref/Pervasives.html One interpretation could be that `==` should be used only when comparing refs (and other mutable data types) to see whether they point to the same location in memory. Otherwise, don't use `==`. Structural equality is also explained in the documentation of `Pervasives.(=)`: > `e1 = e2` tests for structural equality of `e1` and `e2`. Mutable structures > (e.g. references and arrays) are equal if and only if their current contents > are structurally equal, even if the two mutable objects are not the same > physical object. Equality between functional values raises `Invalid_argument`. > Equality between cyclic data structures may not terminate. Structural equality is usually what you want to test. For refs, it checks whether the contents of the memory location are equal, regardless of whether they are the same location. The negation of physical equality is `!=`, and the negation of structural equality is `<>`. This can be hard to remember. Here are some examples involving equality and refs to illustrate the difference between structural equality (`=`) and physical equality (`==`): ``` # let r1 = ref 3110;; val r1 : int ref = {contents = 3110} # let r2 = ref 3110;; val r2 : int ref = {contents = 3110} # r1 == r1;; - : bool = true # r1 == r2;; - : bool = false # r1 != r2;; - : bool = true # r1 = r1;; - : bool = true # r1 = r2;; - : bool = true # r1 <> r2;; - : bool = false # ref 3110 <> ref 2110;; - : bool = true ``` ## Mutable fields The fields of a record can be declared as mutable, meaning their contents can be updated without constructing a new record. For example, here is a record type for two-dimensional colored points whose color field `c` is mutable: ``` # type point = {x:int; y:int; mutable c:string};; type point = {x:int; y:int; mutable c:string; } ``` Note that `mutable` is a property of the field, rather than the type of the field. In particular, we write `mutable field : type`, not `field : mutable type`. The operator to update a mutable field is `<-`: ``` # let p = {x=0; y=0; c="red"};; val p : point = {x=0; y=0; c="red"} # p.c <- "white";; - : unit = () # p;; val p : point = {x=0; y=0; c="white"} # p.x <- 3;; Error: The record field x is not mutable ``` The syntax and semantics of `<-` is similar to `:=` but complicated by fields: * **Syntax:** `e1.f <- e2` * **Dynamic semantics:** To evaluate `e1.f <- e2`, evaluate `e2` to a value `v2`, and `e1` to a value `v1`, which must have a field named `f`. Update `v1.f` to `v2`. Return `()`. * **Static semantics:** `e1.f <- e2 : unit` if `e1 : t1` and `t1 = {...; mutable f : t2; ...}`, and `e2 : t2`. ## Refs and mutable fields It turns out that refs are actually implemented as mutable fields. In [`Pervasives`][pervasives] we find the following declaration: ``` type 'a ref = { mutable contents : 'a; } ``` And that's why when we create a ref it does in fact looks like a record: it *is* a record! ``` # let r = ref 3110;; val r : int ref = {contents = 3110} ``` The other syntax we've seen for records is in fact equivalent to simple OCaml functions: ``` (* Equivalent to [fun v -> {contents=e}]. *) val ref : 'a -> 'a ref (* Equivalent to [fun r -> r.contents]. *) val (!) : 'a ref -> 'a (* Equivalent to [fun r v -> r.contents <- v]. *) val (:=) : 'a ref -> 'a -> unit ``` The reason we say "equivalent" is that those functions are actually implemented not in OCaml but in the OCaml run-time, which is implemented mostly in C. But the functions do behave the same as the OCaml source given above in comments. ## Arrays Arrays are fixed-length mutable sequences with constant-time access and update. So they are similar in various ways to refs, lists, and tuples. Like refs, they are mutable. Like lists, they are (finite) sequences. Like tuples, their length is fixed in advance and cannot be resized. The syntax for arrays is similar to lists: ``` # let v = [|0.; 1.|];; val v : float array = [|0.; 1.|] ``` That code creates an array whose length is fixed to be 2 and whose contents are initialized to `0.` and `1.`. The keyword `array` is a type constructor, much like `list`. Later those contents can be changed using the `<-` operator: ``` # v.(0) <- 5.;; - : unit = () # v;; - : float array = [|5.; 1.|] ``` As you can see in that example, indexing into an array uses the syntax `array.(index)`, where the parentheses are mandatory. The [`Array` module][array] has many useful functions on arrays. [array]: http://caml.inria.fr/pub/docs/manual-ocaml/libref/Array.html **Syntax.** * Array creation: `[|e0; e1; ...; en|]` * Array indexing: `e1.(e2)` * Array assignment: `e1.(e2) <- e3` **Dynamic semantics.** * To evaluate `[|e0; e1; ...; en|]`, evaluate each `ei` to a value `vi`, create a new array of length `n+1`, and store each value in the array at its index. * To evaluate `e1.(e2)`, evaluate `e1` to an array value `v1`, and `e2` to an integer `v2`. If `v2` is not within the bounds of the array (i.e., `0` to `n-1`, where `n` is the length of the array), raise `Invalid_argument`. Otherwise, index into `v1` to get the value `v` at index `v2`, and return `v`. * To evaluate `e1.(e2) <- e3`, evaluate each expression `ei` to a value `vi`. Check that `v2` is within bounds, as in the semantics of indexing. Mutate the element of `v1` at index `v2` to be `v3`. **Static semantics.** * `[|e0; e1; ...; en|] : t array` if `ei : t` for all the `ei`. * `e1.(e2) : t` if `e1 : t array` and `e2 : int`. * `e1.(e2) <- e3 : unit` if `e1 : t array` and `e2 : int` and `e3 : t`. **Loops.** OCaml has while loops and for loops. Their syntax is as follows: ``` while e1 do e2 done for x=e1 to e2 do e3 done for x=e1 downto e2 do e3 done ``` The second form of `for` loop counts down from `e1` to `e2`—that is, it decrements its index variable at each iteration. Though not mutable features themselves, loops can be useful with mutable data types like arrays. We can also use functions like `Array.iter`, `Array.map`, and `Array.fold_left` instead of loops. ## Mutable data structures As an example of a mutable data structure, let's look at stacks. We're already familiar with functional stacks: ``` exception Empty module type Stack = sig (* ['a t] is the type of stacks whose elements have type ['a]. *) type 'a t (* [empty] is the empty stack *) val empty : 'a t (* [push x s] is the stack whose top is [x] and the rest is [s]. *) val push : 'a -> 'a t -> 'a t (* [peek s] is the top element of [s]. * raises: [Empty] is [s] is empty. *) val peek : 'a t -> 'a (* [pop s] is all but the top element of [s]. * raises: [Empty] is [s] is empty. *) val pop : 'a t -> 'a t end ``` An interface for a *mutable* or *non-persistent* stack would look a little different: ``` module type MutableStack = sig (* ['a t] is the type of mutable stacks whose elements have type ['a]. * The stack is mutable not in the sense that its elements can * be changed, but in the sense that it is not persistent: * the operations [push] and [pop] destructively modify the stack. *) type 'a t (* [empty ()] is the empty stack *) val empty : unit -> 'a t (* [push x s] modifies [s] to make [x] its top element. * The rest of the elements are unchanged. *) val push : 'a -> 'a t -> unit (* [peek s] is the top element of [s]. * raises: [Empty] is [s] is empty. *) val peek : 'a t -> 'a (* [pop s] removes the top element of [s]. * raises: [Empty] is [s] is empty. *) val pop : 'a t -> unit end ``` Notice especially how the type of `empty` changes: instead of being a value, it is now a function. This is typical of functions that create mutable data structures. Also notice how the types of `push` and `pop` change: instead of returning an `'a t`, they return `unit`. This again is typical of functions that modify mutable data structures. In all these cases, the use of `unit` makes the functions more like their equivalents in an imperative language. The constructor for an empty stack in Java, for example, might not take any arguments (which is equivalent to taking unit). And the push and pop functions for a Java stack might return `void`, which is equivalent to returning `unit`. Now let's implement the mutable stack with a mutable linked list. We'll have to code that up ourselves, since OCaml linked lists are persistent. ``` module MutableRecordStack = struct (* An ['a node] is a node of a mutable linked list. It has * a field [value] that contains the node's value, and * a mutable field [next] that is [Null] if the node has * no successor, or [Some n] if the successor is [n]. *) type 'a node = {value : 'a; mutable next : 'a node option} (* AF: An ['a t] is a stack represented by a mutable linked list. * The mutable field [top] is the first node of the list, * which is the top of the stack. The empty stack is represented * by {top = None}. The node {top = Some n} represents the * stack whose top is [n], and whose remaining elements are * the successors of [n]. *) type 'a t = {mutable top : 'a node option} let empty () = {top = None} (* To push [x] onto [s], we allocate a new node with [Some {...}]. * Its successor is the old top of the stack, [s.top]. * The top of the stack is mutated to be the new node. *) let push x s = s.top <- Some {value = x; next = s.top} let peek s = match s.top with | None -> raise Empty | Some {value} -> value (* To pop [s], we mutate the top of the stack to become its successor. *) let pop s = match s.top with | None -> raise Empty | Some {next} -> s.top <- next end ``` Here is some example usage of the mutable stack: ``` # let s = empty ();; val s : '_a t = {top = None} # push 1 s;; - : unit = () # s;; - : int t = {top = Some {value = 1; next = None}} # push 2 s;; - : unit = () # s;; - : int t = {top = Some {value = 2; next = Some {value = 1; next = None}}} # pop s;; - : unit = () # s;; - : int t = {top = Some {value = 1; next = None}} ``` The `'_a` in the first utop response in that transcript is a *weakly polymorphic type variable.* It indicates that the type of elements of `s` is not yet fixed, but that as soon as one element is added, the type (for that particular stack) will forever be fixed. Weak type variables tend to appear once mutability is involved, and they are important for the type system to prevent certain kinds of errors, but we won't discuss them further. ## Summary We cover mutable data types in the "Advanced Data Structures" section of this course because they are, in fact, harder to reason about. For example, before refs, we didn't have to worry about aliasing in OCaml. But mutability does have its uses. I/O is fundamentally about mutation. And some data structures (like arrays, which we saw here, and hash tables) cannot be implemented as efficiently without mutability. Mutability thus offers great power, but with great power comes great responsibility. Try not to abuse your new-found power! ## Terms and concepts * address * alias * array * assignment * dereference * determinstic * immutable * index * loop * memory safety * mutable * mutable field * nondeterministic * persistent * physical equality * pointer * pure * ref * ref cell * reference * sequencing * structural equality ## Further reading * *Introduction to Objective Caml*, chapters 7 and 8 * *OCaml from the Very Beginning*, chapter 13 * *Real World OCaml*, chapter 8 * [*Relaxing the value restriction*][relaxing], by Jacques Garrigue, explains more about weak type variables. Section 2 is a succinct explanation of why they are needed. [relaxing]: https://caml.inria.fr/pub/papers/garrigue-value_restriction-fiwflp04.pdf