**No partners:**You must work on this assignment individually.**Compile errors:**All programs that you submit must compile.**Programs that do not compile will receive an automatic zero.**If you are having trouble getting your assignment to compile, please visit consulting hours. If you run out of time, it is better to comment out the parts that do not compile and hand in a file that compiles than hand in a more complete file that does not compile.**Missing functions:**We will be using an automatic grading script, so it is**crucial**that you name your functions and order their arguments according to the problem set instructions, and that you place the functions in the correct files. Otherwise you may not receive credit for a function properly written.**Code style:**Please pay attention to style. Refer to the CS 3110 style guide and lecture notes. Ugly code that is functionally correct may still lose points. Take the extra time to think through the problems and find the most elegant solutions before coding them up.**Wrong versions:**Please double-check that you are submitting the correct versions of your files to CMS. The excuse, "I had it all done, but I accidentally submitted the wrong files" will not be accepted.

Find an expression that, when substituted for the ??, will cause the following expression to evaluate to 42. Show the evaluation using the substitution model. You may omit minor substitutions, but do show all of the main steps.

let initial x = 0 in let rec zardoz (f1, lst) = match lst with [] -> 3110 | h :: t -> let f x = h * List.length lst in if List.length t = 0 then f1 0 else zardoz (f, t) + f 0 in zardoz (initial, ??)

In this problem you will implement a data compression mechanism called
*Huffman coding*. This is a useful mechanism for compressing data based
on the frequencies (number of occurrences) of bytes in a file. The idea is that
bytes that occur very frequently are encoded as short bitstrings, while those
that occur infrequently are encoded as long bitstrings.

A *bitstring* is just a sequence of bits, e.g. 10010110. Normally,
printable characters in a file are encoded as *bytes*, or bitstrings of
length 8. This is called ASCII code. For example, the ASCII code for the character
`'a'`

is the bitstring 01100001, which is binary for 97 decimal.
However, we are not forced to use a fixed-length encoding. If we choose a
different encoding in which the codes can have different lengths, then we can
assign shorter strings to the more frequent characters, thereby saving space.

To illustrate, let's do this for bitstrings of length 2. Suppose we have a file that is exactly 2000 bits long. We can view this file as a sequence of 1000 2-bit bitstrings. Suppose we scan the file and find that the four 2-bit bitstrings 00, 01, 10, 11 occur in the file with the frequencies as shown in Fig. 1.

A good encoding for this set of frequencies would be the variable-length code shown in Fig. 2.

This particular encoding has the nice property that it is
*self-delimiting*, which means that no codeword is a prefix of any other
codeword. This property allows us to place the codewords end-to-end in a file
without having to mark their endpoints, and still be able to recover the
original text. For example, if the file begins with

`00 00 00 10 10 11 01 00 00 00 00 01...`

then it would be encoded as

`0 0 0 110 110 111 10 0 0 0 0 10`

(We have inserted spaces just to show the original 2-bit bitstrings and the codewords, but the spaces are not really there in the file.) Given the correspondence of Fig. 2, the two sequences determine each other uniquely.

How much space have we saved? The original file was 2000 bits long. The length of the encoded file is

`900 · 1 + 90 · 2 + 9 · 3 + 3 = 1110`

bits. We have compressed the file by almost half.

A convenient representation of a self-delimiting code is a complete binary
tree with the original uncompressed bitstrings at the leaves. The path from the
root down to the leaf containing `x`

, represented as a bitstring
(left=0, right=1) is the codeword for `x`

. For example, the code of
Fig. 2 would be represented by the tree in Fig. 3. The tree in Fig. 4 shows the
*identity code*, in which each word is represented as itself. This code
doesn't achieve any compression.

A *Huffman tree* is a tree built from known frequencies for bitstrings
of some fixed constant length `k`

. It codes frequently
occurring bitstrings of length `k`

as short bitstrings and
infrequently occurring bitstrings of length `k`

as long bitstrings.
The tree of Fig. 3 is a Huffman tree for `k=2`

constructed from the
frequencies of Fig. 1. In this assignment, you will implement Huffman trees for
`k=8`

bits (that is, each bitstring is a byte).

To build a Huffman tree, suppose we know the frequency of each
`k`

-bit bitstring in the file. We make a separate tree node for each
`k`

-bit bitstring containing that bitstring as the *key* along with its
*frequency*; these will be the leaves of our Huffman tree when we are done. So we
start with a list of `2`

single-node trees.^{k}

Now we build the Huffman tree bottom up. We repeatedly do the following, while there are at least two nodes in the list:

- Find the two nodes x, y with the lowest frequency. These can be internal nodes or leaf nodes. Remove them from the list.
- Make a new internal node z with children x and y and frequency the sum of the frequencies of x and y. Put it on the list.

We keep doing this until there is exactly one element remaining in the list, which is the root of the completed Huffman tree. This has to happen eventually, since the size of the list decreases by one in each stage.

For example, suppose we wish to form a Huffman tree from the frequencies as
shown in Fig. 1. We start with the four tree nodes `(00,900)`

,
`(01,90)`

, `(10,9)`

, `(11,1)`

. In the first
stage, the two lowest-frequency elements are `x = (10,9)`

and
`y = (11,1)`

. We remove them from the list and
form a new tree node `(u,10)`

with children
`(10,9)`

and `(11,1)`

.
Here `u`

is a new key that we assign (see below).
After this step, the list will contain `(00,900)`

,
`(01,90)`

, and `(u,10)`

.

Doing this again, after the next stage the list will contain
`(00,900)`

and `(v,100)`

, where `v`

is a new
node with children `(u,10)`

and `(01,90)`

. Finally, the
last two nodes are combined, yielding a node `(w,1000)`

which is the
root of the Huffman tree. The resulting Huffman tree looks like Fig. 3.

**Lemma**: A complete binary tree with `n`

leaves contains
exactly `2n - 1`

nodes.

When constructing the tree, you should give each newly constructed internal node a key consisting of a number assigned sequentially, starting at 256 (doing this will help you reconstruct the tree in the decode phase—see below). All the leaves corresponding to ASCII characters will already have keys 0 to 255. By the above lemma, since the root of the tree is the last one constructed, it should always have key value 510. You might want to include this test as a sanity check.

**Representation**: Choose an appropriate representation for Huffman trees and define the type`huffmantree`

.**Compression****Huffman codes**: Write the function`get_codes : huffmantree -> encoding list`

.This function takes a Huffman tree and returns a list of (character, codeword) pairs. A codeword is represented as an

`int list`

, where each`int`

in the list is either 0 or 1. You can implement this function recursively (see Tree Representation of Self-Delimiting Codes for a reminder of how this works). When it encounters an internal node, it recursively finds the list of codes for the left child and prepends 0 to each, then finds the list of codes for the right child and prepends a 1 to each, and then combines both lists. This can be done recursively with only a few lines of code.**Encoding**: Write the function`lookup_code : char -> encoding list -> int list`

.This function returns the codeword in the encoding list that corresponds to the character.

Now write the function

`encode : char list -> huffmantree * int list`

.This function takes in text represented as a list of characters and performs Huffman coding on it. It returns the Huffman tree created for the text, along with the encoded text as a list of bits represented as integers.

**Saving the Huffman tree**: The final compressed file will contain the Huffman tree followed by the sequence of codewords corresponding to the bytes of the input file. We have provided helper functions that write a bitstream to disk (explained below), and in this function you are preparing that bitstream.Write the function

`prepend_tree : huffmantree -> int list -> int list`

.This takes a Huffman tree and an existing bitstream and returns a list with the bit representation of the Huffman tree at the front. The bit representation of the Huffman tree consists of the keys in

*preorder*. Add just the keys, nothing else. Use the helper function`Util.to_bits`

to find the bit representation of each key. You will be writing an integer, but you only want to write 9 bits of it. (All keys are less than 512, so they only require 9 bits. We are trying to save space, remember?)This can also be done recursively with only a few lines of code.

**Longest codeword**: Write the function`longest_code : encoding list -> int list`

.This will return the longest codeword of all those generated by

`get_codes`

, and must be of length 8 bits or more (can you explain why?). The`Util.write_bits`

helper function will use a prefix of that codeword to pad the output file to a multiple of 8 bits. This step is necessary because file lengths are always in bytes, a multiple of 8 bits. If we just pad out with 0's, we run the risk of adding an extra unwanted codeword to the end of the file (there might exist a short code containing all 0s). There is no danger if we use a codeword 8 bits or longer, because the end of the file will be reached first.

**Decompression****Reconstructing the Huffman tree**: Before we can decompress, we must reconstruct the Huffman tree that was saved in the compressed file as a preorder sequence of keys. In the decompression phase, the frequencies are not used.Write the function

`regrow_tree : int list -> huffmantree * int list`

.This takes a combined bitstream and returns the Huffman tree saved within it, along with the remainder of the bitstream (without the tree representation bits). Use the helper function

`Util.read_n_bits_to_int`

to read and process the 9-bit bitstrings representing the keys in preorder.**Decoding**: After reading in the Huffman tree, the remainder of the input bitstream contains the codewords corresponding to the bytes of the uncompressed file.Write the function

`decode : huffmantree -> int list -> char list`

.This function returns the character list decoded from the remaining bitstream, based on the given Huffman tree.

We have provided you with helper functions in `util.ml`

. We have
also provided you with the `main`

function which writes the
compressed version of the original file, then decompresses that compressed
version. Make sure you understand what these functions do so you can use them
effectively.

With your Huffman codec, you should see about 25 to 45% compression on large
text files. For very small files, compression will actually make things worse,
because the Huffman tree is stored in the compressed file. We have supplied a
text file `text.txt`

of 213,304 bytes, which our solution code
compresses to 118,576, around 44% compression. You should see comparable
performance, possibly with minor variation depending on your implementation and platform.

For this section of the problem set, you will be implementing a
regular expression matcher that supports fuzzy matching of
strings. The so-called *regular expressions* are built using the
following constructs.

A lone character is always a regular expression. In our implementation, we only allow lowercase alphabetical characters (i.e. a to z).

# Regexp.accept (Regexp.build "a") "a";; - : bool = trueA

# Regexp.accept (Regexp.build "{hello,3}") "helpoh";; - : bool = true

Using these two types of regular expressions, we can build more complicated ones. The first way to do so is by concatenation. Simply adjoin two regular expressions to create a new one that matches both in sequence.

# Regexp.accept (Regexp.build "{zardoz,2}{zardoz,2}") "zardozmoardoz";; - : bool = trueWe may also use alternation with the pipe symbol (

# Regexp.accept (Regexp.build "{help,3}|{me,3}") "derp";; - : bool = true # Regexp.accept (Regexp.build "{help,3}|{me,3}") "merp";; - : bool = trueFinally, we may use the star operator to generate the Kleene closure of a given regular expression. The regular expression

# Regexp.accept (Regexp.build "(hua)*") "huahuahuahuahua";; - : bool = true # Regexp.accept (Regexp.build "a*") "";; - : bool = trueYou may have noticed that we needed to use parentheses in building our regular expressions. The star and alternation operators have higher precedence than concatenation, so

A nondeterministic
finite automaton (NFA) is a construct for implementing a function that takes
a string as input and returns true if and only if its input is in a particular
set. This set is the *language* recognized by the NFA. For any legal regular
expression, we will formulate a procedure for constructing an NFA whose language
is precisely the strings that match the regular expression.

Formally, a NFA is a set of states, an alphabet, a transition relation, a set of start states, and a set of end states. Strings passed as input to the NFA are sequences of characters taken from the alphabet. The transition relation acts as a function that takes the current state and a character from the alphabet as input and returns a set of states. NFAs are naturally depicted as graphs.

The states are the nodes in our graph, and the transition relation
provides labeled edges between nodes. Note that for a given state,
there can be multiple edges labeled with the same character leaving
the node. Additionally, we provide a special transition called
an *epsilon transition*. These edges in the graph represent a
"free" edge that can be traversed without consuming a character.

When executing the machine, we would like to step through the input
characters one at a time, keeping track of the possible
states. However, when we are in the position to consume an input
character, a current state of the machine may have an epsilon
transition. Taking this path will not consume an input character, but
taking other paths may use up the input. Therefore, we would like to
not have to worry about epsilon transitions at all. That is, we would
like a set of possible states to be *epsilon closed*, where
following an epsilon transition does not leave the set.

Running an NFA steps through the characters of the input string one by
one, keeping track of the set of all possible states. Before we have
read the input, the machine can be in any of its start states. Next,
say `S` is the set of possible states, and `c` is the
next character of input. Then for each `s` in `S` we use
the transition relation to lookup the set `d(s,c)` of all
states the machine could be in after reading `c` while in
state `s`. After doing this for each `s`, we now have a
set of sets of possible states. We then flatten this set to get the
next set of possible states. We continue to do this until we run out
of characters in the input. Once we are done, we check to see if any
of the possible states is also a final state. If so, the string has
matched and we return true. Otherwise, we return false.

Given a string `s` and a maximal edit distance `k`, we can create
an NFA that accepts the set of strings that are less than `k` edits from
`s`. Here's a NFA that accepts all words within two edits of "food". The
following diagram (and the idea for Levenshtein automata) were derived from this
blog
post.

The sole start state is indicated by the arrow, and the three double circle
states in the rightmost column are accept states. The states are labeled as
n^{e}, where n is the number of characters matched to "food", and e is
the number of edits used so far. The horizontal edges represent correct
characters in the input string, as we move forward one column while consuming
one character of input. The vertical edges are deletions that consume any
character and use up one edit without moving forward at all. There are two
diagonal transitions, one epsilon transition that does not consume a character
and one that consumes any character. The epsilon transition represents inserting
the correct character in place, and the consuming transition represents swapping
the current character with the correct one. If any of the final states are
reached, then we were able to transform the input string to "food" with less
than three edits.

We can also convert any regular expression into an equivalent NFA. We will start by converting simple regular expressions to NFAs. We already implemented fuzzy strings in the previous section, so let's now form an NFA that accepts a single character. We simply have one start state and one end state, and there is one transition connecting them that accepts that character.

Now let's build our compound regular expressions. We first need to concatenate two regular expressions. One way to do this is to connect every accepting state of the first regular expression to every start state of the second with an epsilon transition. The start states of the first automata are the new start states, and the final states of the second are the new accepting states. Here are two examples of concatenations.

Alternation is trivial. All we have to do is take the union of the start and end states of both automata.

The final case is when we have a star in our regular expression. Consider the case
when we have `r*` for some regular expression `r`. We have to accept
the empty string, so we should first make every start state of `r` a final
state as well. Next, we need to allow for arbitrary repetitions of `r`.
We can do this by connecting every accept state of `r` to every start state
with an epsilon closure.

We have provided a handful of modules for your use in completing this
assignment. The first, `RegexpUtil` provides an abstract
syntax tree type `regexp` for regular expressions and a
parser. The second, `Util` contributes useful functions for
generating lists of integers and strings. Finally, the `Table`
module provides an implementation of immutable two-dimensional arrays
for storing the transition information. Within the `NFA`
module, we have created `StateSet`
and `Dict`
modules using the OCaml library functors. The interfaces for these
modules can be found in the links provided above. We recommend that
you familiarize yourself with all of these modules before you start
coding.

`NFA.make`: This function takes a number`n_states`as an argument and assumes that all states are integers between zero and`n_states`minus one. After processing its arguments, it returns a NFA of type NFA.nfa.`NFA.count_states`: This function takes a NFA and returns the number of states. Do not ignore "dead states," so the number of states is precisely the number of rows in the transition matrix.`NFA.epsilon_closure`: Implement a function that takes an NFA and a list of states and returns the epsilon closure of that set of states. For simplicity of implementation, you may assume that the NFA does not have any epsilon cycles.`NFA.step`: Given an epsilon closed list of input states and a character from the alphabet, compute all the possible states the machine can be in after consuming that character.`NFA.accept`: With an NFA and a string, return true if and only if the NFA accepts the input string.`NFA.union`: Take two NFAs as input and return a new NFA that accepts a string if either of the input NFAs accepts.`NFA.concat`: Form a new NFA from two input NFAs that accepts if the first one accepts, and then the second accepts.`NFA.star`: Given an NFA, create a new NFA that accepts the Kleene closure of the input NFA.