# Problem Set 3: Tries and LZW Compression

### Due Thursday, February 26

For this problem set, you are to turn in an overview document covering Part 2. The overview document should be composed as described in Overview Requirements.  You should read this document carefully. We have provided an example overview document written as if for Problem Set 2 from CS 312, Spring 2008. Submission in PDF is preferred; however, you can also submit your overview document as a .doc or a .txt file.

In particular, starting with this problem set, it's your job to convince us that your code works, by describing your implementation and by presenting a testing strategy, test cases, and results.

## Part 1 : Substitution Model

Find an expression that, when substituted for the ??, will cause the following function for the given input to evaluate to 42. Show the evaluation using the substitution model. You may omit minor substitutions, but be sure to show all the main steps.  At an absolute minimum, show each step before and after a recursive call. Steps that create new global bindings should also be shown

```let rec zardoz l =
match l with
[] -> ??
| h::t -> (h (zardoz t)) in
let rec f n = if n=1 then 1 else n*f(n-1)+n+2
in
zardoz [(fun n -> 7*n); f];;
```

To do: turn in the solution under Substition Model in CMS. Submission is allowed in .pdf, .doc, or .txt. Please include your name and netid in your solution.

## Part 3: LZW Compression

Lempel-Ziv-Welch is a widely used data compression algorithm. It is a lossless compression scheme, losing no data between compression and decompression. The method was introduced in 1978 and improved in 1984, and by the late 1980s became the first widely used compression algorithm on computers. This method is always used in GIF image files, and it can optionally be used with TIFF files. Many modern compression programs, and other programs such as Adobe Acrobat, contain an implementation of LZW. In this assignment, you will be implementing LZW compression in OCaml.

### Compression

LZW processes input by building a string translation table (also called code table, or dictionary) and converting chunks of text into fixed-length codes (often 12-bit codes are used). If the character set contains 256 characters (standard ASCII 8-bit encoding), entries 0-255 in the table are filled with the single byte codes expressed in 12 bits. The remaining entries (for 12-bit codes, these are entries 256-4095) represent sequences of bytes.

Pseudo code for the algorithm (taken from Wikipedia's page on LZW compression):

```   w = NIL;
add all possible charcodes to the dictionary
for (every character c in the uncompressed data) do
if ((w + c) exists in the dictionary) then
w = w + c;
else
add (w + c) to the dictionary;
add the dictionary code for w to output;
w = c;
endif
done
add the dictionary code for w to output;
display output;
```

After initializing the table, the algorithm keeps track of two things: a string w, representing the current sequence; and a character c, the new character in the input file. If a sequence (w + c) exists in the translation table, the algorithm keeps adding bytes to the sequence by setting w to (w + c) and reading the next byte of input, c. When a sequence (w + c) does not appear in the translation table, the algorithm knows that string w is in the table, so it appends the code for w to the output file, it adds the new sequence (w + c) to the table, and it resets the current sequence (w = c).

If an input file consisted of each charcode represented exactly once, then no compression would take place (in fact, since 8-bit charcodes are replaced with 12-bit codes in the translation table, the "compressed" output would be 50% larger than the input). Importantly, LZW only compresses the input when it encounters repeated sequences.

The following is an example of the compression algorithm in action, taken from http://www.dspguide.com/ch27/5.htm. Consider the ASCII string:

`the/rain/in/Spain/falls/mainly/on/the/plain/`
. The following table demonstrates how this string is compressed with LZW:

As a caveat, the output of this example is actually larger (408 bits) than the input (352 bits). However, the example demonstrates how new sequences are added to the translation table, and how repeated sequences allow for compression. With a proper implementation, this algorithm can compress large English texts to half their original size.

### Decompression

The important feature of the decompression algorithm is that it does not require the translation table that was given by the compressor; instead, it rebuilds it as it processes the compressed string. Specifically, it is the requirement that every sequence be seen once and added to the table before being added to the compressed output that allows for this fact.

Pseudo code for the decompression algorithm is as follows:

```   add all possible charcodes to the dictionary
output o
w = look up dictionary entry for n
output w
add o + w[0] to dictionary
o = n
done
```

Each input that is read by the decompressor should be a 12-bit code corresponding to an entry in the compressor's translation dictionary. The first code in the input must represent a single character, so it can be output directly. For each additional character, the algorithm keeps track of: n, new code being read; and o, the old code that was read.

The resulting table from the decompression algorithm is exactly the same as that of the compression algorithm. For example, given the compressed version of the string:

`the/rain/in/Spain/falls/mainly/on/the/plain/`

the decompression algorithm looks the first code and outputs it, in this case a "t". It then looks at the next code, which is the code for "h". It looks up this entry in the table and outputs it, and then adds "th" as the 256th entry in the table, etc.

There is a single exception to this decompression algorithm. Suppose there is a string consisting of a string w and a character c (w + c) already defined in the table. If the input stream sees a sequence of wcwcw, the compression algorithm will produce a code before the decompressor puts it in the table.

Suppose the string RICKD is defined in the table. Somewhere later in the input, the sequence RICKDRICKDRICK appears. The compression will add an entry (say, entry 500) to the table mapping to the sequence RICKDR. The new string will start with R, and eventually the compressor will add the sequence RICKDRI to the table as entry 501. The problem is that the decompression algorithm as listed will read code 500 and attempt to look it up in the table before adding it.

Fortunately, this is the only case in which a problem occurs. To get around it, we can add a special case to the pseudo code: if a code is not definied, we know the translation is the string translation of the old code plus the first character of that string. The new pseudo code becomes:

```   add all possible charcodes to the dictionary
output o
c = o
if n is not in the translation table
w = look up entry for o
w = w + c
else
w = look up entry for n
endif
output w
c = w[0]
add o + c to the translation table
o = n
done
```

One final comment: while this is a relatively simple algorithm to implement, performance will suffer immensely if you are not careful when implementing the translation table. It is necessary to think about how to properly save space and runtime when writing this code.

1. Implement ```compress (infile: string) (outfile: string) (fixed: bool) (maxsize: int): unit. infile and outfile give the path location of the input and output files respectively.  fixed specifies whether the codes the compressed file have a fixed or variable bit length.  In the example above, we used a fixed length of 12; however, a better scheme would be to use variable code lengths, increasing the number of bits as values are added to the dictionary.  For instance, if there are fewer than 256 entries in the table, we should only write 8 bit codes to the out file; once we add another element to the dictionary, however, we should start using 9 bit codes for all outputs (including those which were previously 8 bits).  Finally, maxsize specifies the maximum number of entries we can add to the dictionary.```
2. ```Implement decompress (infile: string) (outfile: string) (fixed: bool) (maxsize: int): unit. Arguments here have the same meanings as specified above.```