# Data Types
* * *
*
Topics:
* let expressions
* scope
* variants
* records
* tuples
* pattern matching with let and functions
*
* * *
## Let expressions
In our use of the word `let` thus far, we've been making definitions
in the toplevel and in `.ml` files. For example,
```
# let x = 42;;
val x : int = 42
```
defines `x` to be 42, after which we can use `x` in future definitions
at the toplevel. We'll call this use of `let` a *let definition*.
There's another use of `let` which is as an expression:
```
# let x = 42 in x+1
- : int = 43
```
Here we're *binding* a value to the name `x` then using that binding
inside another expression, `x+1`. We'll call this use of `let` a
*let expression*. Since it's an expression it evaluates to a value.
That's different than definitions, which themselves do not evaluate
to any value. You can see that if you try putting a let definition
in place of where an expression is expected:
```
# (let x = 42) + 1
Error: Syntax error: operator expected.
```
Syntactically, a let definition is not permitted on the left-hand side
of the `+` operator, because a value is needed there, and definitions
do not evaluate to values. On the other hand, a let expression
would work fine:
```
# (let x = 42 in x) + 1
- : int = 43
```
Another way to understand let definitions at the toplevel is that they
are like let expression where we just haven't provided the body expression
yet. Implicitly, that body expression is whatever else we type
in the future. For example,
```
# let a = "big";;
# let b = "red";;
# let c = a^b;;
# ...
```
is understand by OCaml in the same way as
```
let a = "big" in
let b = "red" in
let c = a^b in
...
```
That latter series of `let` bindings is idiomatically how several variables
can be bound inside a given block of code.
**Syntax.**
```
let x = e1 in e2
```
As usual `x` is an identifier. We call `e1` the *binding expression*, because
it's what's being bound to `x`; and we call `e2` the *body expression*,
because that's the body of code in which the binding will be in scope.
**Dynamic semantics.**
To evaluate `let x = e1 in e2`:
* Evaluate `e1` to a value `v1`.
* Substitute `v1` for `x` in `e2`, yielding a new expression `e2'`.
* Evaluate `e2'` to a value `v2`.
* The result of evaluating the let expression is `v2`.
Here's an example:
```
let x = 1+4 in x*3
--> (evaluate e1 to a value v1)
let x = 5 in x*3
--> (substitute v1 for x in e2, yielding e2')
5*3
--> (evaluate e2' to v2)
15
(result of evaluation is v2)
```
If you compare these evaluation rules to the rules for function application,
you will notice they both involve substitution. This is not an accident.
In fact, anywhere `let x = e1 in e2` appears in a program, we could replace
it with `(fun x -> e2) e1`. They are syntactically different but semantically
equivalent. So let expressions are really syntactic
sugar for anonymous function application.
**Static semantics.**
* If `e1:t1` and if under the assumption that `x:t1` it holds that `e2:t2`,
then `(let x = e1 in e2) : t2`.
We use the parentheses above just for clarity. As usual, the compiler's
type inferencer determines what the type of the variable is, or the programmer
could explicitly annotate it with this syntax:
```
let x : t = e1 in e2
```
## Scope
Let bindings are in effect only in the block of code in which they occur.
This is exactly what you're used to from nearly any modern programming
language. For example:
```
let x=42 in
(* y is not meaningful here *)
x + (let y="3110" in
(* y is meaningful here *)
int_of_string y)
```
The *scope* of a variable is where its name is meaningful. Variable `y`
is in scope only inside of the `let` expression that binds it above.
It's possible to have overlapping bindings of the same name. For example:
```
let x = 5 in
((let x = 6 in x) + x)
```
But this is darn confusing, and for that reason, it is strongly discouraged
style—much like ambiguous pronouns are discouraged in natural language.
Nonetheless, let's consider what that code means.
To what value does that code evaluate? The answer comes down to how `x`
is replaced by a value each time it occurs. Here are a few possibilities
for such *substitution*:
```
(* possibility 1 *)
let x = 5 in
((let x = 6 in 6) + 5)
(* possibility 2 *)
let x = 5 in
((let x = 6 in 5) + 5)
(* possibility 3 *)
let x = 5 in
((let x = 6 in 6) + 6)
```
The first one is what nearly any reasonable language would do. And most likely
it's what you would guess But, **why?**
The answer is something we'll call the *Principle of Name Irrelevance*: the
name of a variable shouldn't intrinsically matter. You're used to this from
math. For example, the following two functions are the same:
\\[
f(x) = x^2 \\\\
f(y) = y^2
\\]
It doesn't intrinsically matter whether we call the argument to the function
\\(x\\) or \\(y\\); either way, it's still the squaring function.
Therefore, in programs, these two functions should be identical:
```
let f x = x*x
let f y = y*y
```
This principle is more commonly known as *alpha equivalence*: the two functions
are equivalent up to renaming of variables, which is also called *alpha conversion*,
for historical reasons that are unimportant here.
According to the Principle of Name Irrelevance, these two expressions should
be identical:
```
let x = 6 in x
let y = 6 in y
```
Therefore the following two expressions, which have the above expressions
embedded in them, should also be identical:
```
let x = 5 in (let x = 6 in x) + x
let x = 5 in (let y = 6 in y) + x
```
But for those to be identical, we **must** choose possibility 1 from the
three possibilities above. It is the only one that makes the name of
the variable be irrelevant.
There is term commonly used for this phenomenon: a new binding of a
variable *shadows* any old binding of the variable name. Metaphorically,
it's as if the new binding temporarily casts a shadow over the old binding.
But eventually the old binding could reappear as the shadow recedes.
Shadowing is not mutable assignment. For example, both the following
expressions evaluate to 11:
```
let x = 5 in ((let x = 6 in x) + x)
let x = 5 in (x + (let x = 6 in x))
```
Likewise, the following utop transcript is not mutable assignment, though
at first it could seem like it is:
```
# let x = 42;;
val x : int = 42
# let x = 22;;
val x : int = 22
```
Recall that every let definition in the toplevel is effectively a nested let
expression. So the above is effectively the following:
```
let x = 42 in
let x = 22 in
... (* whatever else is typed in the toplevel *)
```
The right way to think about this is that the second `let` binds an entirely
new variable that just happens to have the same name as the first `let`.
Here is another utop transcript that is well worth studying:
```
# let x=42;;
val x : int = 42
# let f y = x+y;;
val f : int -> int =
# f 0;;
: int = 42
# let x=22;;
val x : int = 22
# f 0;;
- : int = 42 (* x did not mutate! *)
```
To summarize, each let definition binds an entirely new variable.
If that new variable happens to have the same name as an old variable,
the new variable temporarily shadows the old. But the old variable is
still around, and its value is immutable: it never, ever changes.
So even though let expressions might superficially look like assignment
statements from imperative languages, they are actually quite different.
## Variants
A *variant* is a data type representing a value that is one of
several possibilities. At their simplest, variants are like
enums from C or Java:
```
type day = Sun | Mon | Tue | Wed | Thu | Fri | Sat
let d:day = Tue
```
The individual names of the values of a variant are called
*constructors* in OCaml. In the example above, the constructors
are `Sun`, `Mon`, etc. This is a somewhat different use of
the word constructor than in C++ or Java.
For each kind of data type in OCaml, we've been discussing how
to build and access it. For variants, building is
easy: just write the name of the constructor. For accessing,
we use pattern matching. For example:
```
let int_of_day d =
match d with
| Sun -> 1
| Mon -> 2
| Tue -> 3
| Wed -> 4
| Thu -> 5
| Fri -> 6
| Sat -> 7
```
There isn't any kind of automatic way of mapping a constructor name
to an `int`, like you might expect from languages with enums.
**Syntax.**
Defining a variant type:
```
type t = C1 | ... | Cn
```
The constructor names must begin with an uppercase letter. OCaml
uses that to distinguish constructors from variable identifiers.
The syntax for writing a constructor value is simply its name, e.g., `C`.
**Dynamic semantics.**
* A constructor is already a value. There is no computation to perform.
**Static semantics.**
* if `t` is a type defined as `type t = ... | C | ...`,
then `C : t`.
Suppose there are two types defined with overlapping constructor names,
for example,
```
type t1 = C | D
type t2 = D | E
let x = D
```
When `D` appears after these definitions, to which type does it refer?
That is, what is the type of `x` above? The answer is that the type defined
later wins. So `x : t2`. That is potentially surprising to programmers,
so within any given scope (e.g., a file or a module, though we haven't
covered modules yet) it's idiomatic whenever overlapping constructor names
might occur to prefix them with some distinguishing character.
For example, suppose we're defining types to represent Pokémon:
```
type ptype =
TNormal | TFire | TWater
type peff =
ENormal | ENotVery | ESuper
```
Because "Normal" would naturally be a constructor name for both the type
of a Pokémon and the effectiveness of a Pokémon attack,
we add an extra character in front of each constructor name to indicate
whether it's a type or an effectiveness.
**Pattern matching.**
As we said in the last lecture, each time we introduced a new kind of
data type, we need to introduce the new patterns associated with it.
For variants, this is easy. We add the following new pattern form
to the list of legal patterns:
* a constructor name `C`
And we extend the definition of when a pattern matches a value and produces
a binding as follows:
* The pattern `C` matches the value `C` and produces no bindings.
**Note.**
Variants are actually considerably more powerful than what
we have seen today. We'll return to them again in a future
lecture.
## Records
A *record* is a kind of type in OCaml that programmers can define.
It is a composite of other types of data, each of which is named.
OCaml records are much like structs in C. Here's an example
of a record type definition for a Poké__mon__:
```
type mon = {name: string; hp : int; ptype: ptype}
```
This type defines a record with three *fields* named `name`,
`hp` (hit points), and `ptype` (defined above). The type
of each of those fields is also given. Note that `ptype`
can be used as both a type name and a field name; the *namespace*
for those is distinct in OCaml.
To build a value of a record type, we write a record expression,
which looks like this:
```
{name = "Charmander"; hp = 39; ptype = TFire}
```
So in a type definition we write a colon between the name and the type
of a field, but in an expression we write an equals sign.
To access a record and get a field from it, we use the dot notation
that you would expect from many other languages. For example:
```
# let c = {name = "Charmander"; hp = 39; ptype = TFire};;
# c.hp;;
- : int = 39
```
It's also possible to use pattern matching to access record fields:
```
match c with
| {name=n; hp=h; ptype=t} -> h
```
The `n`, `h`, and `t` here are pattern variables. There is a syntactic
sugar provided if you want to use the same name for both the field
and a pattern variable:
```
match c with
| {name; hp; ptype} -> hp
```
Here, the pattern `{name; hp; ptype}` is sugar for `{name=name; hp=hp; ptype=ptype}`.
In each of those subexpressions, the identifier appearing on the left-hand side
of the equals is a field name, and the identifier appearing on the right-hand
side is a pattern variable.
**Syntax.**
A record expression is written:
```
{f1 = e1; ...; fn = en}
```
The order of the `fi=ei` inside a record expression is irrelevant.
For example, `{f=e1;g=e2}` is entirely equivalent to `{g=e2;f=e1}`.
A field access is written:
```
e.f
```
where `f` is an identifier of a field name, not an expression.
**Dynamic semantics.**
* If for all `i` in `1..n`, it holds that `ei ==> vi`, then
`{f1=e1; ...; fn=en} ==> {f1=v1; ...; fn=vn}`.
* If `e ==> {...; f=v; ...}` then `e.f ==> v`.
**Static semantics.**
A record type is written:
```
{f1 : t1; ...; fn : tn}
```
The order of the `fi:ti` inside a record type is irrelevant.
For example, `{f:t1;g:t2}` is entirely equivalent to `{g:t2;f:t1}`.
Note that record types must be defined before they can be used. This
enables OCaml to do better type inference than would be possible if
record types could be used without definition.
The type checking rules are:
* If for all `i` in `1..n`, it holds that `ei : ti`, and if
`t` is defined to be `{f1:t1; ...; fn:tn}`, then
`{f1=e1; ...; fn=en} : t`. Note that the set of fields provided in a
record expression must be the full set of fields defined as part of the
record's type.
* If `e : t1` and if `t1` is defined to be `{...; f:t2; ...}`, then
`e.f : t2`.
**Record copy.**
Another syntax is also provided to construct a new record out of an old record:
```
{e with f1 = e1; ...; fn = en}
```
This doesn't mutate the old record; it constructs a new one with new values.
The set of fields provided after the `with` does not have to be the full
set of fields defined as part of the record's type. In the newly copied
record, any field not provided as part of the `with` is copied
from the old record.
The dynamic and static semantics of this are what you might expect, though
they are tedious to write down mathematically.
**Pattern matching.**
We add the following new pattern form to the list of legal patterns:
* `{f1=p1; ...; fn=pn}`
And we extend the definition of when a pattern matches a value and produces
a binding as follows:
* If for all `i` in `1..n`, it holds that `pi` matches `vi` and produces
bindings \\(b_i\\), then the record pattern `{f1=p1; ...; fn=pn}` matches the
record value `{f1=v1; ...; fn=vn; ...}` and produces the set
\\(\bigcup_i b_i\\) of bindings.
Note that the record value may have more fields than the record pattern does.
As a syntactic sugar, another form of record pattern is provided: `{f1; ...; fn}`.
It is desugared to `{f1=f1; ...; fn=fn}`.
## Tuples
A *tuple* is another kind of type in OCaml that programmers can define.
Like records, it is a composite of other types of data. But instead of
naming the *components*, they are identified by position. Here are some
examples of tuples:
```
(1,2,10)
1,2,10
(true, "Hello")
([1;2;3], (0.5,'X'))
```
A tuple with two components is called a *pair*. A tuple with three
components is called a *triple*. Beyond that, we usually just use
the word *tuple* instead of continuing a naming scheme based on numbers.
Also, beyond that, it's arguably better to use records instead of tuples,
because it becomes hard for a programmer to remember which component
was supposed to represent what information.
Building of tuples is easy: just write the tuple, as above.
Accessing again involves pattern matching, for example:
```
match (1,2,3) with
| (x,y,z) -> x+y+z
```
**Syntax.**
A tuple is written
```
(e1, e2, ..., en)
```
The parentheses are optional but might sometimes be necessary
to ensure the compiler parses your code the way you intended. One place
where it is somewhat idiomatic to omit them is in a match expression
between the `match` and `with` keywords (and also in the patterns
in the following branches).
**Dynamic semantics.**
* if for all `i` in `1..n` it holds that `ei ==> vi`,
then `(e1, ..., en) ==> (v1, ..., vn)`.
**Static semantics.**
Tuple types are written using a new type constructor `*`, which is
different than the multiplication operator. The type `t1 * ... * tn`
is the type of tuples whose first component has type `t1`, ..., and
nth component has type `tn`.
* if for all `i` in `1..n` it holds that `ei : ti`,
then `(e1, ..., en) : t1 * ... * tn`.
**Pattern matching.**
We add the following new pattern form to the list of legal patterns:
* `(p1, ..., pn)`
The parentheses are optional but might sometimes be necessary
to ensure the compiler parses your code the way you intended.
And we extend the definition of when a pattern matches a value and produces
a binding as follows:
* If for all `i` in `1..n`, it holds that `pi` matches `vi` and produces
bindings \\(b_i\\), then the tuple pattern `(p1, ..., pn)` matches the
tuple value `(v1, ..., vn)` and produces the set
\\(\bigcup_i b_i\\) of bindings.
Note that the tuple value must have exactly the same number
of components as the tuple pattern does.
## One-of vs. each-of
The big difference between variants and tuples/records is that a value of
a variant type is *one of* a set of possibilities, whereas a value
of a tuple/record type provides *each of* a set of possibilities.
Going back to our examples, a value of type `day` is **one of**
`Sun` or `Mon` or etc. But a value of type `mon` provides **each of**
a `string` and an `int` and `ptype`. Note how, in those previous two sentences,
the word "or" is associated with variant types, and the word "and" is associated
with tuple/record types. That's a good clue if you're ever trying to decide
whether you want to use a variant or a tuple/record: if you need one piece
of data *or* another, you want a variant; if you need one piece of data
*and* another, you want a tuple/record.
One-of types are more commonly known as *sum types*, and each-of types
as *product types*. Those names come from set theory. Variants are
like [disjoint union][disjun], because each value of a variant comes
from one of many underlying sets (and thus far each of those sets is
just a single constructor hence has cardinality one). And disjoint
union is is sometimes written with a summation operator \\(\Sigma\\).
Tuples/records are like [Cartesian product][cartprod], because each
value of a tuple/record contains a value from each of many underlying
sets. And Cartesian product is usually written with a product operator
\\(\times\\).
[disjun]: https://en.wikipedia.org/wiki/Disjoint_union
[cartprod]: https://en.wikipedia.org/wiki/Cartesian_product
## Let and pattern matching
The syntax we've been using so far for let expressions
is, in fact, a special case of the full syntax that OCaml permits.
That syntax is:
```
let p = e1 in e2
```
That is, the left-hand side of the binding may in fact be a pattern,
not just an identifier. Of course, variable identifiers are on our
list of valid patterns, so that's why the syntax we've studied so
far is just a special case.
Given this syntax, we revisit the semantics of let expressions.
**Dynamic semantics.**
To evaluate `let p = e1 in e2`:
1. Evaluate `e1` to a value `v1`.
2. Match `v1` against pattern `p`. If it doesn't match, raise
the exception `Match_failure`. Otherwise, if it does match,
it produces a set \\(b\\) of bindings.
3. Substitute those bindings \\(b\\) in `e2`, yielding a new expression `e2'`.
4. Evaluate `e2'` to a value `v2`.
5. The result of evaluating the let expression is `v2`.
**Static semantics.**
* If all the following hold:
- `e1:t1`
- the pattern variables in `p` are `x1..xn`
- `e2:t2` under the assumption that for all `i` in `1..n` it holds that
`xi:ti`,
then `(let p = e1 in e2) : t2`.
**Let definitions.**
As before, let definitions can be understood as let expression whose
body has not yet been given. So their syntax can be generalized
to
```
let p = e
```
and their semantics follow from the semantics of let expressions, as before.
## Functions and pattern matching
The syntax we've been using so far for functions is also
a special case of the full syntax that OCaml permits.
That syntax is:
```
let f p1 ... pn = e1 in e2 (* function as part of let expression *)
let f p1 ... pn = e (* function definition at toplevel *)
fun p1 ... pn -> e (* anonymous function *)
```
The truly primitive syntactic form we need to care about is
`fun p -> e`. Let's revisit the semantics of anonymous functions
and their application with that form; the changes to the other forms
follow from those below:
**Static semantics.**
* Let `x1..xn` be the pattern variables appearing in `p`. If by assuming that
`x1:t1` and `x2:t2` and ... and `xn:tn`, we can conclude that `p:t` and `e:u`,
then `fun p -> e : t -> u`.
* The type checking rule for application is unchanged.
**Dynamic semantics.**
* The evaluation rule for anonymous functions is unchanged.
* To evaluate `e0 e1`:
1. Evaluate `e0` to an anonymous function `fun p -> e`, and
evaluate `e1` to value `v1`.
3. Match `v1` against pattern `p`. If it doesn't match, raise
the exception `Match_failure`. Otherwise, if it does match,
it produces a set \\(b\\) of bindings.
4. Substitute those bindings \\(b\\) in `e`, yielding a new expression `e'`.
5. Evaluate `e'` to a value `v`, which is the result of evaluating `e0 e1`.
## Using patterns
Here are many examples of how to use patterns with the various language
features we've seen today:
```
(* Pokemon types *)
type ptype =
TNormal | TFire | TWater
(* A record to represent Pokemon *)
type mon = {name: string; hp : int; ptype: ptype}
(*********************************************
* Several ways to get a Pokemon's hit points:
*********************************************)
(* OK *)
let get_hp m =
match m with
| {name=n; hp=h; ptype=t} -> h
(* better *)
let get_hp m =
match m with
| {name=_; hp=h; ptype=_} -> h
(* better *)
let get_hp m =
match t with
| {name; hp; ptype} -> hp
(* better *)
let get_hp m =
match m with
| {hp} -> hp
(* best *)
let get_hp m = m.hp
(**************************************************
* Several ways to get the 3rd component of a tuple
**************************************************)
(* OK *)
let thrd t =
match t with
| (x,y,z) -> z
(* good *)
let thrd t =
let (x,y,z) = t in z
(* better *)
let thrd t =
let (_,_,z) = t in z
(* best *)
let thrd (_,_,z) = z
(*************************************
* How to get the components of a pair
*************************************)
let fst (x,_) = x
let snd (_,y) = y
(* both fst and snd are functions already provided in the standard library *)
```
## Additional patterns
Here are some addition pattern forms that are useful:
* `p1 | ... | pn`: an "or" pattern; matching against it succeeds if
a match succeeds against any of the individual patterns `pi`, which
are tried in order from left to right. All the patterns must bind
the same variables.
* `(p : t)`: a pattern with an explicit type annotation.
* `c`: here, `c` means any constant, such as integer literals,
string literals, and booleans.
* `'ch1'..'ch2'`: here, `ch` means a character literal. For example,
`'A'..'Z'` matches any uppercase letter.
You can read about [all the pattern forms][patterns] in the manual.
[patterns]: http://caml.inria.fr/pub/docs/manual-ocaml/patterns.html
## Summary
Let expressions can be used to provide local scope for variables: the
binding is in scope only in the body of the let expression. OCaml
provides data types for variants (one-of types) and tuples and products
(each-of types). Pattern matching can be used to access values of each
of those data types. And pattern matching can be used in let expressions
and functions.
## Terms and concepts
* binding
* binding expression
* body expression
* constructor
* each-of type
* field
* let definition
* let expression
* one-of type
* pair
* product type
* record
* substitution
* sum type
* triple
* tuple
* variant
## Further reading
* *Introduction to Objective Caml*, chapters 3, 5.2, 8.1
* *OCaml from the Very Beginning*, chapters 2, 5, 8
* *Real World OCaml*, chapters 2, 5