Interpreters
In lecture we saw how to write an interpreter, assuming that the AST was provided to us. Producing an AST from a text file is the job of the lexer and parser. In this lab, we'll see how to build a lexer and parser in OCaml. And we'll extend our interpreter to handle some new language features.
Parsing in OCaml
You could build your own lexer and parser from scratch. But many languages include tools for automatically generating lexers and parsers from formal descriptions of the syntax of a language. The ancestors of many of those tools are lex and yacc, which generate lexers and parsers, respectively; lex and yacc were developed in the 1970s for C. As part of the standard distribution, OCaml provides lexer and parser generators named ocamllex and ocamlyacc. There is a more modern parser generator named menhir available through opam; menhir is "90% compatible" with ocamlyacc and provides significantly improved support for debugging generated parsers.
Part 1: Explore the base code
We provide some base code for this lab. Download it. The base code completes the interpreter we wrote in lecture by adding parsing and lexing.
Base language. The language the base interpreter implements is a simple arithmetic expression language with integers, addition, and let expressions. Language syntax is usually described using notation like the following:
e ::= x (* variables *)
| i (* integers *)
| e1 + e2 (* addition *)
| let x = e1 in e2 (* let expressions *)
x ::= identifiers
i ::= integers
This notation is called Backus-Naur form (BNF). The above description
means that e
can be one of four things:
something of the form
x
, which is an (undefined) set of identifierssomething of the form
i
, which is an (undefined) set of integerse1 + e2
, wheree1
ande2
themselves are of the forme
let x = e1 in e2
, wheree1
ande2
themselves are again of the forme
, andx
is an identifier.
The e
, x
, and i
above are called syntactic variables or meta-variables:
they stand for an unknown piece of syntax.
Explore the base code.
You'll find six files in the base code archive:
ast.ml
: The variant type for the abstract syntax tree (AST) of the expression language.main.ml
: The interpreter for the expression language. The first three functions in this file,subst
,step
, andeval
should be familiar.parser.mly
: The input file to menhir. It describes the syntax of the expression language; we'll discuss it below.lexer.mll
: The input file to ocamllex. It describes the tokens of the expression language; we'll discuss it below.test.ml
: An OUnit test suite..ocamlinit
: An initialization file to cause utop to make the functions defined inmain.ml
available for use.
Exercise: base ast [✭]
Open and read ast.ml
to remind yourself how the AST is defined.
□
Exercise: base step [✭]
Read Main.eval
and Main.step
to remind yourself how the interpreter
works: step
causes an expression to take a single step of execution,
and eval
takes as many steps as possible.
□
Compiling and running
To compile the interpreter and run the unit test suite, run make
.
If you open the tags file you'll notice we're using a new directive use_menhir
,
which directs the build system to use menhir instead of ocamlyacc.
To experiment with the interpreter interactively, first run make
, then
launch the toplevel. Because of the .ocamlinit
we provided, there are
already two functions available for your use:
(* [parse s] is the AST corresponding to the concrete syntax of expression [s]. *)
val parse : string -> expr
(* [interp s] parses the string [s] into an AST, interprets the AST, and yields
the resulting integer value. *)
val interp : string -> int
Exercise: warmup [✭✭]
Evaluate the following expressions in the toplevel. Note what each returns.
parse "22"
interp "22"
parse "1+2+3"
interp "1+2+3"
parse "let x = 2 in 20+x"
interp "let x = 2 in 20+x"
Also evaluate these expressions, which will raise exceptions. Explain why each one is an error, and where during the interpretation process the error occurs.
interp "3.14"
interp "3+"
interp "(let x = 2 in 20)+x"
□
The parser and lexer
Read the parse
function in main.ml
:
(* Parse a string into an ast *)
let parse s =
let lexbuf = Lexing.from_string s in
let ast = Parser.prog Lexer.read lexbuf in
ast
This function takes a string s
and uses the standard
library's Lexing
module to create a lexer buffer from
it. Think of that buffer as the token stream that we
discussed in lecture. The function then lexes and parses the
string into an AST, using Lexer.read
and Parser.prog
.
The Lexer
and Parser
modules are code that is generated
automatically during the compilation process by ocamllex and
menhir:
- ocamllex produces
lexer.ml
from input filelexer.mll
. - menhir produces
parser.ml
from input fileparser.mly
.
Exercise: parser.mly [✭✭]
Read the
parser.mly
file. That file gives a grammar definition for the expression language. Read the comments in the file.Compare the
expr
rule inparser.mly
to the BNF given above for the expression language. How are they the same? How are they different?Open
_build/parser.ml
, which is the module generated automatically by menhir fromparser.mly
. Skim through the file to appreciate not having to write the parser yourself.
□
Exercise: lexer.mll [✭✭]
Read the
lexer.mll
file. That file gives a lexer definition for the expression language. Read the comments in the file carefully.Examine the definition of the
id
regular expression. Identify at least one way in which it differs from the definition of OCaml identifiers.Open
_build/lexer.ml
, which is the module generated automatically by ocamllex fromlexer.mll
. Skim through the file to appreciate not having to write the lexer yourself.
□
Part 2: Multiplication
In this part of the lab, we'll add multiplication to the expression language. The new BNF is as follows:
e ::= x (* variables *)
| i (* integers *)
| e1 + e2 (* addition *)
| e1 * e2 (* multiplication *)
| let x = e1 in e2 (* let expressions *)
Exercise: multiplication [✭✭]
Follow the next five steps to add multiplication to your interpreter.
Step 1 (AST): Add the following line to the definition of the expr
type in ast.ml
:
| Mult of expr*expr
Recompile the code. You should get two compiler warnings about inexhaustive
pattern matching. The compiler is telling you that the implementations
of subst
and step
are now incomplete, because they don't handle the
new AST node.
Step 2 (Evaluation):
Add the following pattern to the false branch of the is_value
function
in main.ml
:
... | Mult _ | ...
Add the following line to the definition of subst
in main.ml
:
| Mult(el,er) -> Mult(subst el v x, subst er v x)
Add the following lines to the definition of step
in main.ml
:
| Mult(Int n1, Int n2) -> Int (n1*n2)
| Mult(Int n1, e2) -> Mult(Int n1, step e2)
| Mult(e1,e2) -> Mult(step e1, e2)
Recompile the code. You should no longer get any compiler warnings. The evaluation part of your interpreter is finished, but you stil need to extend the parser to handle multiplication.
Step 3 (Parsing):
Add the following line to the declarations section of parser.mly
:
%token TIMES
Add the following line to the precedence and associativity section of parser.mly
:
%left TIMES
It must be the next line after PLUS
in that section. That's because multiplication
has higher precedence than addition—i.e., 1+2*3
should be parsed as
1+(2*3)
, not as (1+2)*3
.
Add the following production to the expr
rule:
| e1 = expr; TIMES; e2 = expr { Mult(e1,e2) }
Recompile the code. You should not receive any errors or warnings.
Step 4 (Lexing):
Add the following line to the read
rule in lexer.mll
:
| "*" { TIMES }
Recompile the code. You should not receive any errors or warnings.
Step 5 (Testing):
Add unit tests for the following expressions to test.ml
:
2 * 11
2+2*10
2*2+10
2*2*10
Run make test
.
You've successfully added multiplication to the interpreter!
□
Exercise: operator parsing [✭✭, optional]
You declared the TIMES
token as having higher precedence than PLUS
,
and as being left associative. Let's experiment with other choices.
Evaluate
parse "1*2*3"
with your current interpreter. Note the AST. Now change the declaration of the associativity ofTIMES
inparser.mly
to be%right
instead of%left
. Recompile and reevaluateparse "1*2*3"
. How did the AST change? Before moving on, restore the declaration to be%left
.Evaluate
parse "1+2*3"
with your current interpreter. Note the AST. Now swap the declaration%left TIMES
inparser.mly
with the declaration%left PLUS
. Recompile and reevaluateparse "1+2*3"
. How did the AST change? Before moving on, restore the original declaration order.
□
Part 3: If expressions
Finally, we'll add if
expressions, Booleans, and a comparison
operator to the expression language.
The new BNF is as follows:
e ::= x (* variables *)
| i (* integers *)
| b (* Booleans *)
| e1 + e2 (* addition *)
| e1 * e2 (* multiplication *)
| e1 <= e2 (* less than or equal *)
| let x = e1 in e2 (* let expressions *)
| if e1 then e2 else e3 (* if expressions *)
Exercise: if expressions [✭✭✭✭]
Follow the next five steps to extend your interpreter. The instructions are deliberately less precise than in the previous part of the lab.
Step 1 (AST): Extend your AST type to include three new
kinds of nodes: Boolean values, the <=
operator, and if
expressions. Also add the following definition to ast.ml
:
type value =
| VInt of int
| VBool of bool
We will need this definition in the next step, because there is now more than one type of value in the language.
Step 2 (Evaluation): Add code to the definitions of is_value
, subst
,
step
, and extract_value
to handle the three new kinds of AST nodes. To handle
Boolean values, change extract_value
as follows:
let extract_value = function
| Int i -> VInt i
| Bool b -> VBool b (* NEW *)
| _ -> failwith "Not a value"
Step 3 (Parsing):
Declare six new tokens in parser.mly
: TRUE
, FALSE
, LEQ
, IF
, THEN
, and ELSE
.
Change the precedence and associativity section to declare ELSE
as nonassociative.
The precedence from least to greatest should be IN
, ELSE
, LEQ
, PLUS
, TIMES
.
You can see that OCaml uses a similar precedence and associativity by looking at the
table immediately above section 6.7.1 of the OCaml manual.
Step 4 (Lexing):
Add six new lines to lexer.mll
for the six new tokens. Make sure they all appear
before the line that lexes ID
, otherwise the five new keywords (true
, false
,
if
, then
, and else
) would be considered identifiers rather than keywords.
Step 5 (Testing):
Modify the existing test cases in main.ml
to expect VInt
. Add new test cases
for the extended language. Here are some suggestions:
if true then 22 else 0
true
1<=1
if 1+2 <= 3+4 then 22 else 0
if 1+2 <= 3*4 then let x = 22 in x else 0
let x = 1+2 <= 3*4 in if x then 22 else 0
□
Part 4: Binary operators
Exercise: binop [✭✭✭✭]
You now have three binary operators in the language. The code
implementing them in subst
and step
is highly repetitive. Fix that
inelegance by having a single AST node, Binop
, that unifies all three.
See how much repetetive code you can eliminate. It might help to
introduce some additional types or functions.
□