For this problem set, you are to turn in an overview document covering Part 2. The overview document should be composed as described in Overview Requirements. You should read this document carefully. We have provided an example overview document written as if for Problem Set 2 from CS 312, Spring 2008. Submission in PDF is preferred; however, you can also submit your overview document as a .doc or a .txt file.
In particular, starting with this problem set, it's your job to convince us that your code works, by describing your implementation and by presenting a testing strategy, test cases, and results.
Find an expression that, when substituted for the ??, will cause the following function for the given input to evaluate to 42. Show the evaluation using the substitution model. You may omit minor substitutions, but be sure to show all the main steps. At an absolute minimum, show each step before and after a recursive call. Steps that create new global bindings should also be shown
let rec zardoz l = match l with [] -> ?? | h::t -> (h (zardoz t)) in let rec f n = if n=1 then 1 else n*f(n-1)+n+2 in zardoz [(fun n -> 7*n); f];;
To do: turn in the solution under Substition Model in CMS. Submission is allowed in .pdf, .doc, or .txt. Please include your name and netid in your solution.
Several algorithms, including the compression algorithms and string matching algorithms, make use of maps where the keys are strings (or sequences) of values of a certain type. A trie (pronounced "try") or prefix tree is an efficient data structure for implementing the map ADT for this kind of map. The set of possible elements in the key strings is referred to as the alphabet of the trie.
A trie is a tree structure whose edges are labeled with the elements of the alphabet. Each node can have up to N children, where N is the size of the alphabet. Nodes correspond to strings of elements of the alphabet. More precisely, each node corresponds to the sequence of elements traversed on the path from the root to that node. The map binds a string s to a value v by storing the value v in the node of s. The advantage of this scheme is that if efficiently represents common string prefixes, and lookups or inserts are achieved using quick searches starting from the root.
Suppose we want to build a map from strings of ASCII characters to integer codes. In this case, edges are labeled with ASCII characters, and integer codes are stored in (some of) the nodes. Here is an example:
The above trie describes the map {"to"->7, "tea"->3,
"ten"->12, "in"->5, "inn"->9}
. The values (shown in blue) are stored
in the nodes. Note that nodes might not have values. For instance the node
that corresponds to string "te"
doesn't have a value. Hence,
"te"
is not in the map.
It is sometimes useful to define a cursor pointing into the trie, in order to cache the last search in the trie and advance the cursor one step at a time.
In this part you will implement tries that map
strings of integers (int list
) to values of arbitrary types ('a
).
Your task is to do the following:
Interface specifications. Before you
begin implementing the trie data structure, you need to finish its
specifications in the signature TRIE
. The existing
specifications may be incomplete, incorrect, or nonexistent.
Representation. You must decide how you will represent a trie. Briefly inspect the functions below to decide if certain pieces of information would make their implementation simpler. Then define an appropriate trie type.
type 'a trie = (* your implementation here *)
Next to the representation type write a
comment giving the abstraction function and representation invariant for
tries.
Implement the function repOK. If
the input trie satisfies the representation invariant, then repOK
simply returns the input trie. Otherwise, repOK raises
Failure.
Trie operations. Implement the
empty
value, representing an empty trie. Then implement the
functions put
, get
, size
, and
tabulate
. The put
function adds a new mapping
to a trie. If the value to be inserted is associated with a list of
integers that already exist in the trie, then the code should raise an
exception ElementExists
. The function get
returns the binding of a string in the trie. If the string does not
exist in the trie, then an exception NotFound
must be
raised. Finally, the function size
returns the size of a
trie; and tabulate f n
creates a new trie with n bindings,
where binding k is f(k), for k between 0 and n-1.
For example, the code tabulate (fun k -> ([k; k+1], k*k)) 3
would create a trie that makes [0;1]
to 0, [1;2]
to 1, and [2;3]
to 4.
val empty: 'a trie val put: 'a trie -> int list -> 'a -> 'a trie val get: 'a trie -> int list -> 'a val tabulate: (int -> int list * 'a) -> int -> 'a trie val size: 'a trie -> int
Note: the functions that build new tries are
important since the course staff will use them to generate the inputs
for testing other functions. Therefore, be careful to implement them
correctly.
Cursor support. Next, define an appropriate type to describe a cursor into a trie structure. Make sure the type contains enough information to support the cursor operations defined below.
type 'a cursor = (* your implementation here *)
Implement the following functions:
val cursor: 'a trie -> int list -> 'a cursor val advance: 'a cursor -> int -> 'a cursor option val getc: 'a cursor -> 'a option val putc: 'a trie -> 'a cursor -> int -> 'a -> 'a trie
The function cursor
returns a
cursor positioned at the node that corresponds to the string given as
argument. In the string doesn't exits, the exception NotFound
is raised. The function advance
moves the cursor one
position down the tree, following the edge that corresponds to the value
passed as an argument. This function returns None
if no
such edge exists. The function getc
yields the value stored
at the node that the cursor points to. Finally, putc t c x v
advances cursor c
along the edge x
(creating
the edge if it doesn't exit), inserts value v
at that
position in trie t
, and returns the updated trie. The
appropriate exception is raised if a value already exists at the
insertion point.
To do: turn in the solution in the files trie.mli and trie.ml.
Lempel-Ziv-Welch is a widely used data compression algorithm. It is a lossless compression scheme, losing no data between compression and decompression. The method was introduced in 1978 and improved in 1984, and by the late 1980s became the first widely used compression algorithm on computers. This method is always used in GIF image files, and it can optionally be used with TIFF files. Many modern compression programs, and other programs such as Adobe Acrobat, contain an implementation of LZW. In this assignment, you will be implementing LZW compression in OCaml.
LZW processes input by building a string translation table (also called code table, or dictionary) and converting chunks of text into fixed-length codes (often 12-bit codes are used). If the character set contains 256 characters (standard ASCII 8-bit encoding), entries 0-255 in the table are filled with the single byte codes expressed in 12 bits. The remaining entries (for 12-bit codes, these are entries 256-4095) represent sequences of bytes.
Pseudo code for the algorithm (taken from Wikipedia's page on LZW compression):
w = NIL; add all possible charcodes to the dictionary for (every character c in the uncompressed data) do if ((w + c) exists in the dictionary) then w = w + c; else add (w + c) to the dictionary; add the dictionary code for w to output; w = c; endif done add the dictionary code for w to output; display output;
After initializing the table, the algorithm keeps track of two things: a string w, representing the current sequence; and a character c, the new character in the input file. If a sequence (w + c) exists in the translation table, the algorithm keeps adding bytes to the sequence by setting w to (w + c) and reading the next byte of input, c. When a sequence (w + c) does not appear in the translation table, the algorithm knows that string w is in the table, so it appends the code for w to the output file, it adds the new sequence (w + c) to the table, and it resets the current sequence (w = c).
If an input file consisted of each charcode represented exactly once, then no compression would take place (in fact, since 8-bit charcodes are replaced with 12-bit codes in the translation table, the "compressed" output would be 50% larger than the input). Importantly, LZW only compresses the input when it encounters repeated sequences.
The following is an example of the compression algorithm in action, taken from http://www.dspguide.com/ch27/5.htm. Consider the ASCII string:
the/rain/in/Spain/falls/mainly/on/the/plain/. The following table demonstrates how this string is compressed with LZW:
As a caveat, the output of this example is actually larger (408 bits) than the input (352 bits). However, the example demonstrates how new sequences are added to the translation table, and how repeated sequences allow for compression. With a proper implementation, this algorithm can compress large English texts to half their original size.
The important feature of the decompression algorithm is that it does not require the translation table that was given by the compressor; instead, it rebuilds it as it processes the compressed string. Specifically, it is the requirement that every sequence be seen once and added to the table before being added to the compressed output that allows for this fact.
Pseudo code for the decompression algorithm is as follows:
add all possible charcodes to the dictionary read code o output o for (each additional code n) w = look up dictionary entry for n output w add o + w[0] to dictionary o = n done
Each input that is read by the decompressor should be a 12-bit code corresponding to an entry in the compressor's translation dictionary. The first code in the input must represent a single character, so it can be output directly. For each additional character, the algorithm keeps track of: n, new code being read; and o, the old code that was read.
The resulting table from the decompression algorithm is exactly the same as that of the compression algorithm. For example, given the compressed version of the string:
the/rain/in/Spain/falls/mainly/on/the/plain/
the decompression algorithm looks the first code and outputs it, in this case a "t". It then looks at the next code, which is the code for "h". It looks up this entry in the table and outputs it, and then adds "th" as the 256th entry in the table, etc.
There is a single exception to this decompression algorithm. Suppose there is a string consisting of a string w and a character c (w + c) already defined in the table. If the input stream sees a sequence of wcwcw, the compression algorithm will produce a code before the decompressor puts it in the table.
Suppose the string RICKD is defined in the table. Somewhere later in the input, the sequence RICKDRICKDRICK appears. The compression will add an entry (say, entry 500) to the table mapping to the sequence RICKDR. The new string will start with R, and eventually the compressor will add the sequence RICKDRI to the table as entry 501. The problem is that the decompression algorithm as listed will read code 500 and attempt to look it up in the table before adding it.
Fortunately, this is the only case in which a problem occurs. To get around it, we can add a special case to the pseudo code: if a code is not definied, we know the translation is the string translation of the old code plus the first character of that string. The new pseudo code becomes:
add all possible charcodes to the dictionary read code o output o c = o for (each additional code n) if n is not in the translation table w = look up entry for o w = w + c else w = look up entry for n endif output w c = w[0] add o + c to the translation table o = n done
One final comment: while this is a relatively simple algorithm to implement, performance will suffer immensely if you are not careful when implementing the translation table. It is necessary to think about how to properly save space and runtime when writing this code.
Your tasks for this part include the following:
compress (infile: string) (outfile: string) (fixed: bool)
(maxsize: int): unit. infile
and outfile give the path location of
the input and output files respectively. fixed
specifies whether the codes the compressed
file have a fixed or variable bit length. In the example above, we
used a fixed length of 12; however, a better scheme would be to use variable
code lengths, increasing the number of bits as values are added to the
dictionary. For instance, if there are fewer than 256 entries in the
table, we should only write 8 bit codes to the out file; once we add another
element to the dictionary, however, we should start using 9 bit codes for
all outputs (including those which were previously 8 bits). Finally,
maxsize specifies the maximum number
of entries we can add to the dictionary.
Implement
decompress (infile: string) (outfile: string) (fixed: bool) (maxsize:
int): unit. Arguments here
have the same meanings as specified above.
To do: turn in the solution in the file lzw.ml. You are not expected to modify lzw.mli
In order to complete this task, you will need to be able to manipulate binary files. The basic ocaml installation provides limited support for this, so we recommend you make use the ocaml-extlib library, which includes a module IO containing many useful functions for reading and writing bitstrings of variable length. Extract the contents of the ExtLib archive to your "Objective Caml/lib" directory, and then run ocaml install.ml from the command line. When this is done, you can load the library into the ocaml toplevel using #load "extlib.cma". Functions for opening and closing files can be found in the pervasives module of the core ocaml library.