Problem Set 3: Tries and LZW Compression

Due Thursday, February 26


For this problem set, you are to turn in an overview document covering Part 2. The overview document should be composed as described in Overview Requirements.  You should read this document carefully. We have provided an example overview document written as if for Problem Set 2 from CS 312, Spring 2008. Submission in PDF is preferred; however, you can also submit your overview document as a .doc or a .txt file.

In particular, starting with this problem set, it's your job to convince us that your code works, by describing your implementation and by presenting a testing strategy, test cases, and results.

 

Part 1 : Substitution Model

Find an expression that, when substituted for the ??, will cause the following function for the given input to evaluate to 42. Show the evaluation using the substitution model. You may omit minor substitutions, but be sure to show all the main steps.  At an absolute minimum, show each step before and after a recursive call. Steps that create new global bindings should also be shown 

let rec zardoz l =
  match l with
    [] -> ??
  | h::t -> (h (zardoz t)) in
let rec f n = if n=1 then 1 else n*f(n-1)+n+2
in
  zardoz [(fun n -> 7*n); f];;

To do: turn in the solution under Substition Model in CMS. Submission is allowed in .pdf, .doc, or .txt. Please include your name and netid in your solution.

 

Part 2: Tries

Several algorithms, including the compression algorithms and string matching algorithms, make use of maps where the keys are strings (or sequences) of values of a certain type. A trie  (pronounced "try")  or prefix tree is an efficient data structure for implementing the map ADT for this kind of map. The set of possible elements in the key strings is referred to as the alphabet of the trie.

A trie is a tree structure whose edges are labeled with the elements of the alphabet. Each node can have up to N children, where N is the size of the alphabet. Nodes correspond to strings of elements of the alphabet. More precisely, each node corresponds to the sequence of elements traversed on the path from the root to that node. The map binds a string s to a value v by storing the value v in the node of s. The advantage of this scheme is that if efficiently represents common string prefixes, and lookups or inserts are achieved using quick searches starting from the root.

Suppose we want to build a map from strings of ASCII characters to integer codes. In this case, edges are labeled with ASCII characters, and integer codes are stored in (some of) the nodes. Here is an example: 

The above trie describes the map {"to"->7, "tea"->3, "ten"->12, "in"->5, "inn"->9}. The values (shown in blue) are stored in the nodes. Note that nodes might not have values. For instance the node that corresponds to string "te" doesn't have a value. Hence, "te" is not in the map.

It is sometimes useful to define a cursor pointing into the trie, in order to cache the last search in the trie and advance the cursor one step at a time.

In this part you will implement tries that map strings of integers (int list) to values of arbitrary types ('a). Your task is to do the following:

  1. Interface specifications. Before you begin implementing the trie data structure, you need to finish its specifications in the signature TRIE. The existing specifications may be incomplete, incorrect, or nonexistent.
     

  2. Representation. You must decide how you will represent a trie. Briefly inspect the functions below to decide if certain pieces of information would make their implementation simpler. Then define an appropriate trie type.

    type 'a trie = (* your implementation here *)
  3. Next to the representation type write a comment giving the abstraction function and representation invariant for tries.
     

  4. Implement the function repOK. If the input trie satisfies the representation invariant, then repOK simply returns the input trie.  Otherwise, repOK raises Failure.
     

  5. Trie operations. Implement the empty value, representing an empty trie. Then implement the functions put, get, size, and tabulate. The put function adds a new mapping to a trie. If the value to be inserted is associated with a list of integers that already exist in the trie, then the code should raise an exception ElementExists. The function get returns the binding of a string in the trie. If the string does not exist in the trie, then an exception NotFound must be raised. Finally, the function size returns the size of a trie; and tabulate f n creates a new trie with n bindings, where binding k is f(k), for k between 0 and n-1.

    For example, the code tabulate (fun k -> ([k; k+1], k*k)) 3 would create a trie that makes [0;1] to 0, [1;2] to 1, and [2;3] to 4.

    val empty: 'a trie
    val put: 'a trie -> int list -> 'a -> 'a trie
    val get: 'a trie -> int list -> 'a
    val tabulate: (int -> int list * 'a) -> int -> 'a trie
    val size: 'a trie -> int

     

    Note: the functions that build new tries are important since the course staff will use them to generate the inputs for testing other functions. Therefore, be careful to implement them correctly.
     

  6. Cursor support. Next, define an appropriate type to describe a cursor into a trie structure. Make sure the type contains enough information to support the cursor operations defined below.

    type 'a cursor = (* your implementation here *)

    Implement the following functions:

    val cursor: 'a trie -> int list -> 'a cursor
    val advance: 'a cursor -> int -> 'a cursor option
    val getc: 'a cursor -> 'a option
    val putc: 'a trie -> 'a cursor -> int -> 'a -> 'a trie

    The function cursor returns a cursor positioned at the node that corresponds to the string given as argument. In the string doesn't exits, the exception NotFound is raised. The function advance moves the cursor one position down the tree, following the edge that corresponds to the value passed as an argument. This function returns None if no such edge exists. The function getc yields the value stored at the node that the cursor points to. Finally, putc t c x v advances cursor c along the edge x (creating the edge if it doesn't exit), inserts value v at that position in trie t, and returns the updated trie. The appropriate exception is raised if a value already exists at the insertion point.
     

To do: turn in the solution in the files trie.mli and trie.ml.

 

Part 3: LZW Compression

Lempel-Ziv-Welch is a widely used data compression algorithm. It is a lossless compression scheme, losing no data between compression and decompression. The method was introduced in 1978 and improved in 1984, and by the late 1980s became the first widely used compression algorithm on computers. This method is always used in GIF image files, and it can optionally be used with TIFF files. Many modern compression programs, and other programs such as Adobe Acrobat, contain an implementation of LZW. In this assignment, you will be implementing LZW compression in OCaml.

Compression

LZW processes input by building a string translation table (also called code table, or dictionary) and converting chunks of text into fixed-length codes (often 12-bit codes are used). If the character set contains 256 characters (standard ASCII 8-bit encoding), entries 0-255 in the table are filled with the single byte codes expressed in 12 bits. The remaining entries (for 12-bit codes, these are entries 256-4095) represent sequences of bytes.

Pseudo code for the algorithm (taken from Wikipedia's page on LZW compression):

   w = NIL;
   add all possible charcodes to the dictionary
   for (every character c in the uncompressed data) do
       if ((w + c) exists in the dictionary) then
           w = w + c;
       else
           add (w + c) to the dictionary;
           add the dictionary code for w to output;
           w = c;
       endif
   done
   add the dictionary code for w to output;
   display output;

After initializing the table, the algorithm keeps track of two things: a string w, representing the current sequence; and a character c, the new character in the input file. If a sequence (w + c) exists in the translation table, the algorithm keeps adding bytes to the sequence by setting w to (w + c) and reading the next byte of input, c. When a sequence (w + c) does not appear in the translation table, the algorithm knows that string w is in the table, so it appends the code for w to the output file, it adds the new sequence (w + c) to the table, and it resets the current sequence (w = c).

If an input file consisted of each charcode represented exactly once, then no compression would take place (in fact, since 8-bit charcodes are replaced with 12-bit codes in the translation table, the "compressed" output would be 50% larger than the input). Importantly, LZW only compresses the input when it encounters repeated sequences.

The following is an example of the compression algorithm in action, taken from http://www.dspguide.com/ch27/5.htm. Consider the ASCII string:

the/rain/in/Spain/falls/mainly/on/the/plain/
. The following table demonstrates how this string is compressed with LZW:

As a caveat, the output of this example is actually larger (408 bits) than the input (352 bits). However, the example demonstrates how new sequences are added to the translation table, and how repeated sequences allow for compression. With a proper implementation, this algorithm can compress large English texts to half their original size.

Decompression

The important feature of the decompression algorithm is that it does not require the translation table that was given by the compressor; instead, it rebuilds it as it processes the compressed string. Specifically, it is the requirement that every sequence be seen once and added to the table before being added to the compressed output that allows for this fact.

Pseudo code for the decompression algorithm is as follows:

   add all possible charcodes to the dictionary
   read code o
   output o
   for (each additional code n)
      w = look up dictionary entry for n
	  output w
	  add o + w[0] to dictionary
	  o = n
   done

Each input that is read by the decompressor should be a 12-bit code corresponding to an entry in the compressor's translation dictionary. The first code in the input must represent a single character, so it can be output directly. For each additional character, the algorithm keeps track of: n, new code being read; and o, the old code that was read.

The resulting table from the decompression algorithm is exactly the same as that of the compression algorithm. For example, given the compressed version of the string:

the/rain/in/Spain/falls/mainly/on/the/plain/

the decompression algorithm looks the first code and outputs it, in this case a "t". It then looks at the next code, which is the code for "h". It looks up this entry in the table and outputs it, and then adds "th" as the 256th entry in the table, etc.

There is a single exception to this decompression algorithm. Suppose there is a string consisting of a string w and a character c (w + c) already defined in the table. If the input stream sees a sequence of wcwcw, the compression algorithm will produce a code before the decompressor puts it in the table.

Suppose the string RICKD is defined in the table. Somewhere later in the input, the sequence RICKDRICKDRICK appears. The compression will add an entry (say, entry 500) to the table mapping to the sequence RICKDR. The new string will start with R, and eventually the compressor will add the sequence RICKDRI to the table as entry 501. The problem is that the decompression algorithm as listed will read code 500 and attempt to look it up in the table before adding it.

Fortunately, this is the only case in which a problem occurs. To get around it, we can add a special case to the pseudo code: if a code is not definied, we know the translation is the string translation of the old code plus the first character of that string. The new pseudo code becomes:

   add all possible charcodes to the dictionary
   read code o
   output o
   c = o
   for (each additional code n)
      if n is not in the translation table
	     w = look up entry for o
		 w = w + c
      else
	     w = look up entry for n
	  endif
	  output w
	  c = w[0]
	  add o + c to the translation table
	  o = n
   done

One final comment: while this is a relatively simple algorithm to implement, performance will suffer immensely if you are not careful when implementing the translation table. It is necessary to think about how to properly save space and runtime when writing this code.

Your tasks for this part include the following:

  1. Implement compress (infile: string) (outfile: string) (fixed: bool) (maxsize: int): unit. infile and outfile give the path location of the input and output files respectively.  fixed specifies whether the codes the compressed file have a fixed or variable bit length.  In the example above, we used a fixed length of 12; however, a better scheme would be to use variable code lengths, increasing the number of bits as values are added to the dictionary.  For instance, if there are fewer than 256 entries in the table, we should only write 8 bit codes to the out file; once we add another element to the dictionary, however, we should start using 9 bit codes for all outputs (including those which were previously 8 bits).  Finally, maxsize specifies the maximum number of entries we can add to the dictionary.
  2. Implement decompress (infile: string) (outfile: string) (fixed: bool) (maxsize: int): unit. Arguments here have the same meanings as specified above.

To do: turn in the solution in the file lzw.ml. You are not expected to modify lzw.mli

In order to complete this task, you will need to be able to manipulate binary files.  The basic ocaml installation provides limited support for this, so we recommend you make use the ocaml-extlib library, which includes a module IO containing many useful functions for reading and writing bitstrings of variable length.  Extract the contents of the ExtLib archive to your "Objective Caml/lib" directory, and then run ocaml install.ml from the command line.  When this is done, you can load the library into the ocaml toplevel using #load "extlib.cma".  Functions for opening and closing files can be found in the pervasives module of the core ocaml library.