Lecture 8
Modules and data abstractions

Modular programming

Modular programming is a valuable technique for building medium-sized and large programs because it allows a program to be broken up into loosely coupled pieces that can be developed largely in isolation. It facilitates local reasoning: the programmer(s) can think about the implementation of a piece of a program without full knowledge of the rest of the program. Rather, the rest of the program only needs to be understood abstractly, at the level of detail presented by the interfaces to the various modules on which the piece of code being worked on depends. This abstraction makes the programmer's job much easier; it is helpful even when there is only one programmer working on a moderately large program, and it is crucial when there is more than one programmer.

Because modules can be used only through their declared interfaces, the job of the module implementer is also made easier. The implementer has the flexibility to change the module as long as the module still satisfies its interface. The interface (signature) ensure that the module is loosely coupled to its clients. Loose coupling gives implementers and clients the freedom to work on their code mostly independently. 

Data abstractions

We have already talked about functional abstraction, in which a function hides the details of the computation it performs. Structures and signatures in SML provide a new kind of abstraction: data (or type) abstraction. The signature does not state what the type is.  This is known as an abstract type.

A data abstraction (or abstract data type, ADT) consists of an abstract type along with a set of operations and values. ML of course provides a number of built-in types with built-in operations; a data abstraction is in a sense a clean extension of the language to support a new kind of type. For example, a record type has builtin projection operators, and datatypes have builtin constructors. For a data abstraction, its signature creates an abstraction barrier that prevents users from using its values in a way incompatible with the operations declared in the signature.

Signatures and Structures

To successfully develop large programs, we need more than the ability to group related operations together. We need to be able to use the compiler to enforce the separation between different modules, which prevents bad things from happening. Signatures are the mechanism in ML that enforces this separation.

A signature declares a set of types and values that any module implementing it must provide. It consists of type, datatype, exception and val specifications.  The specifications are a bit different than we are used to so far, they specify only names and types, no values.  

A signature specifies an "interface", what a particular module of code does, as opposed to an "implementation" of how a module operates.  A signature for a queue might look something like the following:

signature QUEUE =
  sig 
    type 'a queue
    exception Empty
    val empty: 'a queue
    val insert : 'a * 'a queue -> 'a queue
    val remove : 'a queue -> 'a * 'a queue
  end 

Note that this signature defines a parameterized queue type, an exception called Empty, a constant called empty, and two functions that operate on queues.  By convention signature names use all capital letters. 

A programmer can use queues based on these definitions without seeing the implementation.  Many possible data representations and corresponding code could implement this queue: singly linked lists, doubly linked lists, vectors, etc.

A structure can be used to implement a signature.  There are several variations on structures in SML; a structure need not refer to a signature at all.  Here we use the opaque form of a structure, where the structure name is separated from the signature name by a :> rather than the : normally used for specifying a type.  In an opaque structure such as this, the concrete implementation is not accessible outside the structure itself.  That is, only the signature and not the particular implementation choice are available to users of the structure. 

A structure must implement all the specifications of its signature.  It may implement more than what is in the signature, but those additional definitions are accessible only inside the structure definition itself, not to users of the structure.

Here is a simple structure implementing the queue signature:

structure Queue :> QUEUE =
  struct
    type 'a queue = 'a list * 'a list
    exception Empty
    val empty = (nil, nil)
    fun insert (x, (b,f)) = (x::b, f)
    fun remove (nil, nil) = raise Empty
      | remove (bs, nil) = remove(nil, rev bs)
      | remove (bs, f::fs) = (f, (bs, fs))
end

This implementation represents a queue as two lists, elements are inserted into one list and removed from the other.  When the list being removed from becomes empty, the list being inserted into is reversed and made the new list to be removed from, and the list being inserted into is made empty.  Insertion operations are thus always O(1) time.  Removal operations are O(1) time except when the list being removed from becomes empty, then they are O(n) time.  However, note that there can not be many of these O(n) removal operations.  We will discuss this kind of algorithm, where costs can be amortized, more a bit later in the semester.

With this implementation we can now use queues, but their implementation is completely opaque. The fact that they are represented as lists is not available to programmers using the structure.  There are several ways to refer to the elements of a structure.  One is with fully qualified names: Queue.Empty, Queue.empty, Queue.insert, Queue.remove.  Another is by using the open declaration, open Queue, which makes the names accessible without the need to specify the prefix.

- Queue.remove(Queue.insert(2,Queue.insert(1,Queue.empty)));
val it = (1,-) : int * int Queue.queue
- 

Note that the value of the opaque queue type is indicated by -, we cannot see how int Queue.queue is implemented. The use of signatures with opaque implementations ensures that programmers cannot depend on the implementation inadvertently.  Thus we are free to change the implementation later.  This may sound like something that is only of theoretical interest.  The first time you have to change a data structure that is not opaque in a large program (say 100k lines or more) and have to understand whether every usage depends on the concrete implementation, you will realize the value of this.  Somehow its incredibly tempting to write code without thinking about the abstract signature or interface versus the concrete implementation.  Its a bad way of working because it puts an unnecessary limitation on the ability to change and improve code.

Another data abstraction example

Queues make a nice simple example, but you probably more or less saw the implementation from the signature, in that lists are about the only appropriate tools currently at our disposal in SML. Suppose we want to develop a data abstraction for univariate polynomials; that is, expressions of the form a+bx+cx2+dx3+...+ zxn. We'd like to be able to create polynomials and to add, subtract, and multiply them. The name of the variable is not important, so we only need to track of is the the coefficients corresponding to each exponent. 

For many data abstractions, the capabilities offered by signatures (or other interface specification techniques in other languages) still do not provide enough expressive power.  For instance, for univariate polynomials, we would like to ensure not only that the degree of an exponent is an integer but also that it is non-negative.  We note such additional specifications as being required in the comments.

Interface

The following signature POLYNOMIAL is an interface to a data abstraction for polynomials:

signature POLYNOMIAL =
  sig

    (* A poly is a univariate polynomial with integer
     * coefficients. For example, 2 + 3x + x^3. *)
    type poly

    (* zero is the polynomial 0 *)
    val zero: poly

    (* singleton(c,d) is the polynomial c*x^d.
     * Requires: d >= 0 *)
    val singleton: int*int -> poly

    (* degree(p) is the degree of the polynomial:
     * the largest exponent of the polynomial with
     * a nonzero coefficient *)
    val degree: poly -> int

    (* evaluate(p,x) is p evaluated at x *)
    val evaluate: poly*int -> int

    (* coeff(p,n) is the coefficient c of the term
     * of form c*x^n, or zero if there is no such term.
     * Requires: d >= 0 *)
    val coeff: poly*int -> int

    (* plus, minus, times are +, -, * on polynomials,
     * respectively *)
    val plus: poly * poly -> poly
    val minus: poly * poly -> poly
    val times: poly * poly -> poly

    (* toString converts a poly to a nicely readable string *)
    val toString: poly -> string

  end

The type poly is an abstract type that may be implemented in different ways by different structures that implement this signature. Again by looking at the signature, we can tell what poly does but not what it is. The signature prevents clients from depending on the module in inappropriate ways, by hiding all the things they're not supposed to know about. The signature also acts like a defensive perimeter that prevents clients from constructing values of a declared types except through the operations provided. Thus, the signature is a contract between the implementer of the module and the clients of the module. As long as both sides abide by the contract -- the implementer by providing all of the operations that the signature defines, and the client, by only using the module in accordance with the signature -- the two sides can work without stepping on one another's toes. The client doesn't need to see or think about the code that the implementer is writing, and the implementer doesn't have to think about the details of how clients are using the code.

This signature provides not only the types of the operations but also their specifications. As discussed earlier, the signature is the right place to put these specifications. There are two views of an data abstraction: the abstract view, which is the view from the standpoint of the user of the data abstraction, and the concrete view, which is the view of the implementer. The abstract view is presented by the module interface; the concrete view by the module implementation. A well-designed data abstraction can be used entirely from the abstract view, without knowing the concrete type that represents the abstract values, or the actual algorithm being used to implement the operations. Thus, the specifications that appear in the signature should always be from the abstract view, not the concrete view, which would violate the abstraction barrier.

The singleton and coeff operations are both partial functions because they are not defined for negative exponents, and hence have "requires" in the comments. In the specifications for plus, minus, times, we rely on the reader's understanding of polynomials to avoid writing tedious specifications of the form, "plus(p,q) is p+q", etc. It is acceptable and even a good idea to rely on the reader's likely knowledge to avoid long specifications. However, as with all writing tasks, this requires a judgment about your likely reader. If that reader is yourself (perhaps at some time in the future), it is relatively easy to assess what will be comprehensible! But when writing code for a larger organization more care must be taken.

The right way develop modules is to figure out the signature (interface) first, then write the structure (module implementation) to match the interface. This approach has two big advantages. First, a lot of design problems become evident when the signature is being written. It's much lower cost in terms of development time to get the design right before trying to implement the module. Another advantage is that code can be written using the interface even before the implementation is complete; the module client and module implementer can work in parallel, speeding up development. And because the interface is known by both parties, it is more likely that when they finish their work, the complete program will work as intended.

Implementation

Choosing the right representation for a data abstraction is the first step in any implementation. The following is a simple representation of polynomials:

type poly = int list

The first item in the list will be the coefficient a for x, the second one b for x2, and so on. The number of items in the list will tell us the degree of the polynomial. In addition, we will need to make sure that the list never ends in a trailing sequence of zeros, because that would might mislead us about the degree of the polynomial. The empty list will represent the polynomial 0.

Note that this is just one of many possible ways to represent a polynomial, all of which can meet the signature but which can lead to very different implementations (structures).

Now we can start to implement the operations specified in the signature POLYNOMIAL. For example, the function degree:

fun degree(p: poly):int =  
   case
p of 
      
[] => 0
      | _ => length(p) - 1

How about polynomial addition?

fun plus(p: poly, q: poly): poly =
  case (p, q) of
      (nil, q) => q
    | (p, nil) => p
    | (a::p2, b::q2) =>
        (a+b)::plus(p2,q2)

Actually this doesn't quite work. Why? Because the result might have trailing zeros if the two polynomials cancel each other out, causing the degree function to return the wrong result.

- plus([1,2], [1,~2]);
val it = [2,0]: poly
- degree(it)
val it = 1: int

We can avoid this by checking as follows:

fun plus(p: poly, q: poly): poly =
  case (p,q) of
      (nil,q) => q
    | (p, nil) => p
    | (a::p2, b::q2) =>
        case (a+b)::plus(p2,q2) of
            [0] => []
          | r => r
- plus([1,2], [1,~2]);
val it = [2]: poly

Here is more of the implementation:

structure Polynomial :> POLYNOMIAL =
  struct
    (* Univariate polynomials represented using a list of coefficients.
     * Degree of each term is based on its position in the list. *)
    type poly = int list
    val zero: poly = []

    (* A singleton cx^d is a list of length d, where the first d-1
     * elements are 0 and the last element is c *)
    fun singleton(coeff: int, degree: int):poly =
      case (coeff, degree) of
        (0, _) => zero
      | (c, 0) => [c]
      | (c, d) => if (d<0)
                    then raise Fail "negative degree"
                    else 0::singleton(c, d-1)

    fun degree(p:poly):int =       
      case p of
        [] => 0
      | _ => length(p)-1

    fun coeff(p: poly, n: int):int =
        case p of
          nil => 0
        | h::t => if n = 0 then h else coeff(t, n-1)

    (* plus and minus both operate term by term, so this function
     * abstracts out the common pattern *)
    fun termapply (f:int*int->int,p:poly,q:poly):poly =
      case (p,q) of
        (nil,q) => q
      | (p, nil) => p
      | (a::p2, b::q2) =>
        case f(a,b)::termapply(f,p2,q2) of
          [0] => []
        | r => r

    fun plus(p:poly, q:poly):poly =
      termapply(op+, p, q)

    fun minus(p:poly, q:poly):poly =
      termapply(op-, p, q)

    fun times(p:poly, q:poly):poly =
      raise Fail "Not implemented"

    fun evaluate(p:poly, x:int): int =
      case p of
        nil => 0
      | a::q => a + x*evaluate(q, x)

    fun toString (p: poly): string =
      let fun pp_ndegree(deg: int, p: poly): string =
        case p of
          nil => ""
        | h::t =>
          if h = 0 then 
            pp_ndegree(deg+1,t)
          else 
            Int.toString(h) ^
            ( if deg > 0 then 
                "x" ^ (if deg > 1 then "^"^Int.toString(deg) else "")
              else 
                "" ) ^
                ( case t of
                    nil => ""
                  | _ => " + " ^ pp_ndegree(deg+1, t))
        in 
          case pp_ndegree(0, p) of
            "" => "0"
          | s => s
        end 

  end 

We can provide this module to other programmers and they can then create polynomials using Polynomial.zero and Polynomial.singleton and manipulate them with Polynomial.degree and Polynomial.plus. they don't have to know that polynomials are really lists of integers (and with only the signature they won't know).

The abstraction barrier

The abstraction barrier prevents the clients of the Polynomial module from using their knowledge of what poly is. In fact, the SML interpreter will not even print out values of a type like poly. Without the signature, we can see what poly's really are:

- Polynomial.zero;
val it = []: Polynomial.poly

Once the module is protected by its signature, values of the type poly are printed only as a dash:

- Polynomial.zero;
val it = - : Polynomial.poly

Without the abstraction barrier, users might get into trouble. For example, a client using the Polynomial structure might see that  polynomials are really lists and write code like this:

let z: Polynomial.poly = [2,3,4] in ... end

It looks convenient; what's wrong with it? Two things: this code depends on the actual type used to represent polynomials. An implementer cannot change between int list and another representation of polynomials without breaking this code; therefore we've lost loose coupling. Second, there is nothing that prevents the client from constructing lists that violate our no-trailing-zeros condition. The operations defined on polynomials will not work properly if polynomials are constructed out of such lists. In general, a misbehaving client could cause the program to give wrong answers or even crash with an exception in a module that another programmer wrote! This is bad because it makes it hard to assign blame for bugs.

The abstraction barrier gives the implementer has the freedom to change what the poly type is bound to and correspondingly change the implementation of degree, plus, zero, etc. to match.For example, the implementer might decide to use the SML vector type instead of list, resulting in a more efficient implementation of polynomials.

Designing interfaces

We have talked about what makes a specification good; a few comments about what makes an interface good are also in order. Obviously, an interface should contain good specifications of its components. In addition, a well designed interface strikes a balance between simplicity and completeness. Sometimes it is better not to offer every possible operation that the users might want, particularly if those users can efficiently construct the desired computation by using other operations. An interface that specifies many components is said to be wide; a small interface is narrow. Narrow interfaces are good because they provide a simpler, more flexible contract between the user and implementer. The user is less dependent on the details of the implementation, and the implementer has greater flexibility to change how the abstraction is implemented. Interfaces should be made as narrow as possible while still providing users with the operations they need to get the job done.

Modules in other languages

Modules and interfaces are supported in SML by structures and signatures, but they are also found in other modern programming languages in different form. In Java, interfaces, classes, and packages facilitate modular programming. All three of these constructs can be thought to provide interfaces in the more general sense that we are using in this course. The interface to a Java class or package consists of its public components. The Java approach is to use the javadoc tool to extract this interface into a readable form that is separate from the code. Because the interface consists of the public methods and classes, these are the program components that must be carefully specified.

The C language, on the other hand, works more like SML. Programmers write programs by writing source files (".c files") and header files (".h files").  Source files correspond to ML structures and header files correspond to signatures. Header files may declare abstract types and function types, just like in SML. Therefore, the place to write function specifications in C (and in C++) is in header files.

Java-style interface extraction makes life a little easier for the implementer because a separate interface does not have to be written as in SML. However, automatic interface extraction is also dangerous, because changes to any public components of the class will implicitly change the interface and possibly break client code that depends on that interface. The discipline provided by explicit interfaces is useful in preventing these problems for larger programming projects.