# Lists
* * *
*
Topics:
* lists
* recursive functions on lists
* pattern matching
* tail recursion
*
* * *
## Lists
An OCaml list is a sequence of values all of which have the same type.
By comparison, lists in OCaml are like the classic linked list data
structure that you would find in other languages. But more so than any
other classic data structure, these lists enjoy a first-class status in
the language: there is special support for easily creating and working
with lists. That's a characteristic that OCaml shares with many other
functional languages. Mainstream imperative languages, like Python,
have such support these days too. Maybe that's because programmers find
it so pleasant to work directly with lists as a first-class part of the
language, rather than having to go through a library (as in C and Java).
## Building lists
**Syntax.** There are three syntactic forms for building lists:
```
[]
e1::e2
[e1; e2; ...; en]
```
The empty list is written `[]` and is pronounced "nil", a
name that comes from Lisp. Given a list `lst` and element `elt`, we can
prepend `elt` to `lst` by writing `elt::lst`. The double-colon operator
is pronounced "cons", a name that comes from an operator in List that
__cons__tructs objects in memory. "Cons" can also be used as a verb,
as in "I will cons an element onto the list." The first element of a list
is usually called its *head* and the rest of the elements (if any) are
called its *tail*.
The square bracket syntax is convenient but unnecessary. Any list
`[e1; e2; ...; en]` could instead be written with the more primitive
nil and cons syntax: `e1::e2::...::en::[]`. When a pleasant syntax
can be defined in terms of a more primitive syntax within the language,
we call the pleasant syntax *syntactic sugar*: it makes the language
"sweeter". Transforming the sweet syntax into the more primitive
syntax is called *desugaring*.
Because the elements of the list can be arbitrary expressions, lists
can be nested as deeply as we like, e.g., `[ [[]]; [[1;2;3]]]`
**Dynamic semantics.**
* `[]` is already a value.
* if `e1` evaluates to `v1`, and if `e2` evaluates to `v2`,
then `e1::e2` evaluates to `v1::v2`.
As a consequence of those rules and how to desugar the square-bracket
notation for lists, we have the following derived rule:
* if `ei` evaluates to `vi` for all `i` in `1..n`,
then `[e1; ...; en]` evaluates to `[v1; ...; vn]`.
It's starting to get tedious to write "evaluates to" in all our
evaluation rules. So let's introduce a shorter notation for it.
We'll write `e ==> v` to mean that `e` evaluates to `v`. Note that
`==>` is not a piece of OCaml syntax. Rather, it's a notation
we use in our description of the language, kind of like metavariables.
Using that notation, we can rewrite the latter two rules above:
* if `e1 ==> v1`, and if `e2 ==> v2`,
then `e1::e2 ==> v1::v2`.
* if `ei ==> vi` for all `i` in `1..n`,
then `[e1; ...; en] ==> [v1; ...; vn]`.
**Static semantics.**
All the elements of a list must have the same type. If that
element type is `t`, then the type of the list is `t list`.
You should read such types from right to left: `t list` is a
list of `t`'s, `t list list` is a list of list of `t`'s, etc.
The word `list` itself here is not a type: there is no way
to build an OCaml value that has type simply `list`.
Rather, `list` is a *type constructor*: given a type, it produces
a new type. For example, given `int`, it produces the type `int list`.
You could think of type constructors as being like functions that
operate on types, instead of functions that operate on values.
The type-checking rules:
* `[] : 'a list`
* if `e1 : t` and `e2 : t list` then `e1::e2 : t list`.
In the rule for `[]`, recall that `'a` is a type variable: it stands
for an unknown type. So the empty list is a list whose elements have an
unknown type. If we cons an `int` onto it, say `2::[]`, then the
compiler infers that for that particular list, `'a` must be `int`. But
if in another place we cons a `bool` onto it, say `true::[]`, then the
compiler infers that for that particular list, `'a` must be `bool`.
## Accessing lists
There are really only two ways to build a list, with nil and cons.
So if we want to take apart a list into its component pieces, we
have to say what do with the list if it's empty, and what to do if
it's non-empty (that is, a cons of one element onto some other list).
We do that with a language feature called *pattern matching*.
Here's an example of using pattern matching to compute the sum of
a list:
```
let rec sum lst =
match lst with
| [] -> 0
| h::t -> h + sum t
```
This function says to take the input `lst` and see whether it has the
same shape as the empty list. If so, return 0. Otherwise, if
it has the same shape as the list `h::t`, then let `h` be the first
element of `lst`, and let `t` be the rest of the elements of `lst`,
and return `h + sum t`. The choice of variable names here is
meant to suggest "head" and "tail" and is a common idiom, but we could
use other names if we wanted. Another common idiom is:
```
let rec sum xs =
match lst with
| [] -> 0
| x::xs' -> x + sum xs'
```
That is, the input list is a list of xs (pronounced EX-uhs),
the head element is an x, and the tail is xs' (pronounced EX-uhs prime).
Here's another example of using pattern matching to compute the
length of a list:
```
let rec length lst =
match lst with
| [] -> 0
| h::t -> 1 + sum t
```
Note how we didn't actually need the variable `h` in the pattern match.
When we want to indicate the presence of some value in a pattern without
actually giving it a name, we can write `_` (the underscore character):
```
let rec length lst =
match lst with
| [] -> 0
| _::t -> 1 + sum t
```
And here's a third example that appends one list onto the beginning of
another list:
```
let rec append lst1 lst2 =
match lst1 with
| [] -> lst
| h::t -> h::(append t lst2)
```
For example, `append [1;2] [3;4]` is `[1;2;3;4]`.
That function is actually available as a built-in operator `@`, so
we could instead write `[1;2]@[3;4]`.
As a final example, we could write a function to determine whether
a list is empty:
```
let empty lst =
match lst with
| [] -> true
| h::t -> false
```
But there a much easier way to write the same function without pattern
matching:
```
let empty lst =
lst = []
```
Note how all the recursive functions above are similar to doing proofs
by induction on the natural numbers: every natural number is either 0
or is 1 greater than some other natural number \\(n\\), and so a proof
by induction has a base case for 0 and an inductive case for \\(n+1\\).
Likewise all our functions have a base case for the empty list and a
recursive case for the list that has one more element than another list.
This similarity is no accident. There is a deep relationship between
induction and recursion. If you ever study the proof assistant [Coq]
you might learn more about this.
[coq]: https://coq.inria.fr/
By the way, there are two library functions `List.hd` and `List.tl`
that return the head and tail of a list. It is not good, idiomatic
OCaml to apply these directly to a list. The problem is that they
will raise an exception when applied to the empty list, and you will
have to remember to handle that exception. Instead, you should use
pattern matching: you'll then be forced to match against both
the empty list and the non-empty list (at least), which will prevent
exceptions from being raised, thus making your program more robust.
## Mutating lists
Lists are immutable. There's no way to change an element of a list from
one value to another. Instead, OCaml programmers create new lists out
of old lists. For example, suppose we wanted to write a function that
returned the same list as its input list, but with the first element (if
there is one) incremented by 1. We could do that:
```
let inc_first lst =
match lst with
| [] -> []
| h::t -> (h+1)::t
```
Now you might be concerned about whether we're being wasteful of space.
After all, there are at least two ways the compiler could implement
the above code:
1. Copy the entire tail list `t` when the new list is created in
the pattern match with cons, such that the amount of memory
in use just increased by an amount proportionate to the length of `t`.
2. Share the tail list `t` between the old list and the new list,
such that the amount of memory in use does not increase (beyond
the one extra piece of memory needed to store `h+1`).
In fact, the compiler does the latter. So there's no need for concern.
The reason that it's quite safe for the compiler to implement sharing
is exactly that list elements are immutable. If they were instead mutable,
then we'd start having to worry about whether the list I have
is shared with the list you have, and whether changes I make will be
visible in your list. So immutability makes it easier to reason about
the code, and makes it safe for the compiler to perform an optimization.
## Pattern matching
We saw above how to access lists using pattern matching. Let's
look more carefully at this feature.
**Syntax.**
```
match e with
| p1 -> e1
| p2 -> e2
| ...
| pn -> en
```
Each of the clauses `pi -> ei` is called a *branch* or a *case* of
the pattern match. The first vertical bar in the entire pattern match
is optional.
The `p`'s here are a new syntactic form called a *pattern*. For now,
a pattern may be:
* a variable name, e.g. `x`
* the underscore character `_`, which is called the *wildcard*
* the empty list `[]`
* `p1::p2`
* `[p1; ...; pn]`
No variable name may appear more than once in a pattern. For example,
the pattern `x::x` is illegal. The wildcard may occur any number of times.
As we learn more of data structures available in OCaml, we'll expand
the possibilities for what a pattern may be.
**Dynamic semantics.**
In lecture we gave an abbreviated version of the dynamic semantics.
Here we give the full details.
Pattern matching involves two inter-related tasks: determining whether
a pattern matches a value, and determining what parts of the value
should be associated with which variable names in the pattern. The
former task is intuitively about determining whether a pattern and a
value have the same *shape*. The latter task is about determining the
*variable bindings* introduced by the pattern. For example, in
```
match 1::[] with
| [] -> false
| h::t -> (h>=1) && (length t = 0)
```
(which evaluates to `true`)
when evaluating the right-hand side of the second branch, `h=1` and `t=[]`.
Let's write `h->1` to mean the variable binding saying that `h` has value `1`;
this is not a piece of OCaml syntax, but rather a notation we use to
reason about the language. So the variable bindings produced
by the second branch would be `h->1,t->[]`.
More carefully, here is a definition of when a pattern matches a value
and the bindings that match produces:
* The pattern `x` matches any value `v` and produces the variable
binding `x->v`.
* The pattern `_` matches any value and produces no bindings.
* The pattern `[]` matches the value `[]` and produces no bindings.
* If `p1` matches `v1` and produces a set \\(b_1\\) of bindings,
and if `p2` matches `v2` and produces a set \\(b_2\\) of bindings,
then `p1::p2` matches `v1::v2` and produces the set \\(b_1 \cup b_2\\)
of bindings. Note that `v2` must be a list (since it's on the
right-hand side of `::`) and could have any length: 0 elements, 1
element, or many elements. Note that the union \\(b_1 \cup b_2\\) of
bindings will never have a problem where the same variable is bound
separately in both \\(b_1\\) and \\(b_2\\) because of the syntactic
restriction that no variable name may appear more than once in a
pattern.
* If for all `i` in `1..n`, it holds that `pi` matches `vi` and produces
the set \\(b_i\\) of bindings, then `[p1; ...; pn]` matches `[v1; ...;
vn]` and produces the set \\(\bigcup_i b_i\\) of bindings. Note that
this pattern specifies the exact length the list must be.
Now we can can say how to evaluate `match e with p1 -> e1 | ... | pn -> en`:
* Evaluate `e` to a value `v`.
* Match `v` against `p1`, then against `p2`, and so on, in the order they
appear in the match expression.
* If `v` does not match against any of the patterns, then evaluation of
the match expression raises a `Match_failure` exception.
We haven't yet discussed exceptions in OCaml, but you're familiar with
them from CS 1110 (Python) and CS 2110 (Java). We'll come back to exceptions
after we've covered some of the other built-in data structures in OCaml.
* Otherwise, stop trying to match at the first time a match succeeds
against a pattern. Let `pi` be that pattern and let \\(b\\) be the
variable bindings produced by matching `v` against `pi`.
* Substitute those bindings inside `ei`, producing a new expression `e'`.
* Evaluate `e'` to a value `v'`.
* The result of the entire match expression is `v'`.
For example, here's how this match expression would be evaluated:
```
match 1::[] with
| [] -> false
| h::t -> (h=1) && (t=[])
```
* `1::[]` is already a value
* `[]` does not match ``1::[]``
* `h::t` does match `1::[]` and produces variable bindings
\\(\\{\\)`h->1,t->[]`\\(\\}\\), because:
- `h` matches `1` and produces the variable binding \\(h=1\\)
- `t` matches `[]` and produces the variable binding \\(t=[]\\)
* substituting \\(\\{\\)`h->1,t->[]`\\(\\}\\) inside `(h=1) && (t=[])`
produces a new expression `(1=1) && ([]=[])`
* evaluating `(1=1) && ([]=[])` yields the value `true`
(we omit the justification for that fact here, but it follows from
other evaluation rules for built-in operators and function application)
* so the result of the entire match expression is `true`.
**Static semantics.**
* If `e:ta` and for all `i`, it holds that `pi:ta` and `ei:tb`,
then `(match e with p1 -> e1 | ... | pn -> en) : tb`.
That rule relies on being able to judge whether a pattern has a
particular type. As usual, type inference comes into play here. The
OCaml compiler infers the types of any pattern variables as well as all
occurrences of the wildcard pattern. As for the list patterns, they
have the same type-checking rules as list expressions.
In addition to that type-checking rule, there are two other checks
the compiler does for each match expression:
* **Exhaustiveness:** the compiler checks to make sure that there are
enough patterns to guarantee that at least one of them matches
the expression `e`, no matter what the value of that expression
is at run time. This ensures that the programmer did not forget
any branches. For example, the function below will cause
the compiler to emit a warning:
```
# let head lst = match lst with h::_ -> h;;
Warning 8: this pattern-matching is not exhaustive.
Here is an example of a value that is not matched:
[]
```
By presenting that warning to the programmer, the compiler is helping
the programmer to defend against the possibility of `Match_failure`
exceptions at runtime.
* **Unused branches:** the compiler checks to see whether any of the branches
could never be matched against because one of the previous branches
is guaranteed to succeed.
For example, the function below will cause the compiler to emit a warning:
```
# let rec sum lst =
match lst with
| h::t -> h + sum t
| [h] -> h
| [] -> 0;;
Warning 11: this match case is unused.
```
The second branch is unused because the first branch will match anything
the second branch matches.
Unused match cases are usually a sign that the programmer wrote something
other than what they intended. So by presenting that warning, the compiler
is helping the programmer to detect latent bugs in their code.
Here's an example of one of the most common bugs that causes an unused match
case warning. Understanding it is also a good way to check your understanding
of the dynamic semantics of match expressions:
```
let length_is lst n =
match length lst with
| n -> true
| _ -> false
```
The programmer was thinking that if the length of `lst` is equal to `n`,
then this function will return `true`, and otherwise will return `false`.
But in fact this function *always* returns `true`. Why? Because the
pattern variable `n` is distinct from the function argument `n`.
Suppose that the length of `lst` is 5. Then the pattern match becomes:
`match 5 with n -> true | _ -> false`. Does `n` match 5? Yes, according
to the rules above: a variable pattern matches any value and here produces
the binding `n->5`. Then evaluation applies that binding to `true`,
substituting all occurrences of `n` inside of `true` with 5. Well,
there are no such occurrences. So we're done, and the result of
evaluation is just `true`.
What the programmer really meant to write was:
```
let length_is lst n =
match length lst with
| m -> if m=n then true else false
| _ -> false
```
or better yet:
```
let length_is lst n =
match length lst with
| m -> m=n
| _ -> false
```
or even better yet:
```
let length_is lst n =
length lst = n
```
## Tail recursion
A function is *tail recursive* if it calls itself recursively but does not
perform any computation after the recursive call returns, and
immediately returns to its caller the value of its recursive call.
Consider these two implementations, `sum` and `sum_tr` of summing a list,
where we've provided some type annotations to help you understand the code:
```
let rec sum (l : int list) : int =
match l with
[] -> 0
| x :: xs -> x + (sum xs)
let rec sum_plus_acc (acc : int) (l : int list) : int =
match l with
[] -> acc
| x :: xs -> sum_plus_acc (acc + x) xs
let sum_tr : int list -> int =
sum_plus_acc 0
```
Observe the following difference between the `sum` and `sum_tr` functions
above: In the `sum` function, which is not tail recursive, after the
recursive call returned its value, we add `x` to it. In the tail
recursive `sum_tr`, or rather in `sum_plus_acc`, after the recursive call
returns, we immediately return the value without further computation.
Why do we care about tail recursion? Actually, sometimes functional
programmers fixate a bit too much upon it. If all you care about is
writing the first draft of a function, you probably don't need to worry
about it.
But if you're going to write functions on really long lists, tail
recursion becomes important for performance. Recall (from CS 1110) that
there is a call stack, which is a stack (the data structure with push
and pop operations) with one element for each function call that has
been started but has not yet completed. Each element stores things like
the value of local variables and what part of the function has not been
evaluated yet. When the evaluation of one function body calls another
function, a new element is pushed on the call stack and it is popped off
when the called function completes.
When a function makes a recursive call to itself and there is nothing
more for the caller to do after the callee returns (except return the
callee's result), this situation is called a tail call. Functional
languages like OCaml (and even imperative languages like C++) typically
include an hugely useful optimization: when a call is a tail call, the
caller's stack-frame is popped before the call—the callee's stack-frame
just replaces the caller's. This makes sense: the caller was just going
to return the callee's result anyway. With this optimization, recursion
can sometimes be as efficient as a while loop in imperative languages
(such loops don't make the call-stack bigger.) The "sometimes" is
exactly when calls are tail calls—something both you and the compiler
can (often) figure out. With tail-call optimization, the space
performance of a recursive algorithm can be reduced from \\(O(n)\\) to \\(O(1)\\),
that is, from one stack frame per call to a single stack frame for all
calls.
So when you have a choice between using a tail-recursive vs.
non-tail-recursive function, you are likely better off using the
tail-recursive function on really long lists to achieve space
efficiency. For that reason, the List module documents which functions
are tail recursive and which are not.
But that doesn't mean that a tail-recursive implementation is strictly
better. For example, the tail-recursive function might be harder to
read. (Consider `sum_plus_acc`.) Also, there are cases where
implementing a tail-recursive function entails having to do a pre- or
post-processing pass to reverse the list. On small to medium sized
lists, the overhead of reversing the list (both in time and in
allocating memory for the reversed list) can make the tail-recursive
version less time efficient. What constitutes "small" vs. "big" here?
That's hard to say, but maybe 10,000 is a good estimate, according
to the [standard library documentation of the `List` module][list].
[list]: http://caml.inria.fr/pub/docs/manual-ocaml/libref/List.html
## Bonus syntax
Here are a couple additional pieces of syntax related to lists and
pattern matching.
**Immediate matches:**
When you have a function that immediately pattern-matches against
its final argument, there's a nice piece of syntactic sugar
you can use to avoid writing extra code. Here's an example:
instead of
```
let rec sum lst =
match lst with
| [] -> 0
| h::t -> h + sum t
```
you can write
```
let rec sum = function
| [] -> 0
| h::t -> h + sum t
```
The word `function` is a keyword. Notice that we're able to leave
out the line containing `match` as well as the name of the argument,
which was never used anywhere else but that line. In such cases, though,
it's especially important in the specification comment for the function
to document what that argument is supposed to be, since the code
no longer gives it a descriptive name.
**List comprehensions:**
Some languages, including Python and Haskell,
have a syntax called *comprehension* that allows lists to be written
somewhat like set comprehensions from mathematics. The earliest example
of comprehensions seems to be the functional language NPL, which was
designed in 1977. OCaml doesn't have built-in support for
comprehensions. And we won't be using comprehensions in this course, so
it's safe for you to ignore the rest of this subsection.
It is possible to get a limited form of support for them through
Camlp4, the Caml Preprocessor and Pretty-Printer, a tool that was once
part of the official OCaml distribution but is no longer. If you would
like to install it, run:
```
opam install -y camlp4
```
Then in utop you can write comprehensions in a Haskell-like notation:
```
# #require "camlp4.listcomprehension";;
# [ x+y | x <- [1;2;3]; y <- [1;2;3]; x < y ];;
- : int list = [3; 4; 5]
```
## Summary
Lists are a highly useful built-in data structure in OCaml. The language
provides a lightweight syntax for building them, rather than requiring
you to use a library. Accessing parts of a list makes use of pattern matching,
a very powerful feature (as you might expect from its rather lengthy semantics).
We'll see more uses for pattern matching as the course proceeds.
These built-in lists are implemented as linked lists. That's important
to keep in mind when your needs go beyond small to medium sized lists.
Recursive functions on long lists will take up a lot of stack space,
so tail recursion becomes important. And if you're attempting to
process really huge lists, you probably don't want linked lists at all,
but instead a data structure that will do a better job of exploiting
memory locality.
## Terms and concepts
* append
* binding
* branch
* cons
* copying
* desugaring
* exhaustiveness
* head
* induction
* list
* nil
* pattern matching
* prepend
* recursion
* sharing
* stack frame
* syntactic sugar
* tail
* tail call
* tail recursion
* type constructor
* wildcard
## Further reading
* *Introduction to Objective Caml*, chapters 4, 5.3, 5.4
* *OCaml from the Very Beginning*, chapters 3, 4, 5
* *Real World OCaml*, chapter 3