CS 410 Fall 99
Programming Assignment 4 -- Lempel-Ziv Compression
Due Tuesday, December 7 (note new date!)

Your assignment is to implement Lempel-Ziv compression. This is much the same algorithm used in gzip, pkzip, and WinZip; the main difference is that we use a fixed code size (2 bytes), and when the dictionary gets full, we just use it without entering or deleting any strings.

Before you start, please read the handout on Lempel-Ziv compression (scanned from H.R. Lewis and L. Dennenberg, Data Structures & Their Algorithms).

Getting Started

Download the files 410zip.exe and 410zip.410. The first is a little WinZip-like application, and the second is an archive created by that application containing all its source files, which you will need for your project.

Fire up the application and play with it a bit. "New" creates a new archive, "Open" opens an existing archive, "Add" adds a file to an open archive, and "Extract" extracts all the files to C:\Temp. Note that right now, it does no compression at all--when you add a file to an archive or extract a file, it just copies the file verbatim.

Open the archive 410zip.410 and extract the four source files for the project. Create a new Java project with these files. The main class is Zip and it is found in Zip.java.

The Files

Here is a description of the four source files. They may contain some Java constructs you have not seen before. You should have copies of them in front of you to refer to as you read this.

Zip.java. This contains the Zip class and main procedure, as well as all code implementing the user interface. The code in the constructor of Zip initializes the user interface. Also in this file are event handlers that dispatch the appropriate routines in response to menu selections and button clicks. The code is fairly self-explanatory and you don't need to change it (unless you want to).

IO.java. This contains several low-level file I/O routines. File I/O is handled by input and output streams. The call new FileInputStream() opens a file for input. Bytes can then be read from the file in a sequential manner. The call BufferedInputStream(FileInputStream f) buffers the input so that the file is not accessed every time a byte is read. FileOutputStream and BufferedOutputStream are similar. There are also routines for reading and writing single bytes, Strings, ints, and byte arrays to and from the current output and input streams. The ones you will find most useful are get() and put(), which read and write a single byte of data in the form of an int in the range 0-255. The routine get() returns -1 when the end of the file is reached; note this is not the same as the byte value 255.

Archive.java. This contains the class Archive. The file format of an archive on disk is described there. An archive file consists of a header giving some information about the files in the archive followed by the compressed data itself. The header consists of the string "<410zip>", then an int (4 bytes) giving the number of files in the archive, followed by some information for each file. For each file, there is an int giving the length of the file name, followed by the filename itself, then an int giving the location where the compressed data for that file starts, and finally an int giving its uncompressed length.

When an existing archive is opened, the header data is read in and a summary is displayed in the main window. The summary gives the compressed/uncompressed size of each file and the compression ratio. The compressed data is not read at that point, but the file is left open with the input stream pointer pointing to the first byte of the compressed data in case the user later wants to extract files. When a new archive is created, a standard header with an empty file list is created and written out to disk.

The routines that handle adding a file to the archive, extracting files, and closing the archive are also found in Archive.java.

LZ.java. This contains the code for the dictionary structure that maps Strings to their codes and vice-versa, as well as the compression and decompression code. Right now, the compression and decompression routines don't do anything except copy characters verbatim.

What do I have to do?

The assignment is divided into three parts, two required and one optional.

(Required) You are to implement a Lempel-Ziv decoder as described in the handout on Lempel-Ziv compression. We have implemented the corresponding encoder for you; you will find it at the end of LZ.java, commented out. Pseudocode for the decoder is given in the handout. The decoder not only decodes the compressed stream, it also builds the dictionary.
Our decoder is called with a number m, the compressed length of the file. You can assume the input stream pointer is set to the beginning of the compressed file you wish to decode; this is taken care of by the calling procedure. You should decode exactly m bytes of the input stream. You should read two bytes at a time from the input stream (all code words are two bytes), convert them to an integer, look up the String corresponding to that code in the dictionary, and write that String to the output stream. If there are fewer than m bytes available from the input stream--that is, if the end of the file was reached prematurely--then that means the archive has been corrupted. In that case you should throw an IOException with the message "Unexpected end of file". Also, if there is no String in the dictionary with that code, then you should throw an IOException with the message "Archive corrupted".

During all this, you are also building the dictionary. You insert a String s into the dictionary dict by calling dict.insert(s). The insert routine checks for you whether the dictionary is full or whether it already contains s, in which case it just returns without doing anything. In each step, the String you want to insert is the String that was output in the previous step concatenated with the first character of the String output in this step. Remember to skip this in the first step, since there was no previous step.

Get this working before moving on to part 2. When you have it working, it's a good idea to save a copy of your entire project in case you do not have time to finish part 2. We would rather see a working version of part 1 than an incomplete version of part 2.
(Required) Right now the dictionary uses a Java Hashtable to map Strings -> codes. This is rather inefficient in both time and space. Your job is to change the dictionary data structure to use a version of radix search trees instead. (You should still use a Vector to map in the other direction, codes -> Strings.) The top level of the tree should be an array of length 256 containing a node for each character in the alphabet (byte value 0-255). These represent the strings of length 1. The byte value is the same as the array index where it is stored. This array should be initialized in the constructor of the Dictionary class. Below the top level, nodes in the tree are represented by an object of class Node. Each instance of Node should contain a byte representing a single character, an integer in the range 0-65535 representing a code, a pointer to the next sibling, and a pointer to the first child. The sibling pointer is null if the node is the last sibling of its parent, and the child pointer is null if the node is a leaf. (This is similar to our implementation of Binomial Trees.)
The integer code stored at a Node is the code number of the string obtained by concatenating the characters of the nodes on the path from the root down to that Node. The code number is assigned by the routine that inserts the string in the Dictionary; it just assigns the next available code. Note that whenever a string is inserted, its maximal proper prefix (i.e., the string obtained by deleting the last character) is already there, so you only need to add the last character as a new child node of some other node that already exists in the tree.

To search for a string in the dictionary, look at the first character of the string (there are methods in the String class for extracting single characters or substrings of a String). Go to the corresponding index of the array, where the child list of that character is stored. Search the child list for the next character; if found, search the child list of that node for the third character; and so on. Note that the longestMatch routine will require substantial hacking; but keep the routines for the construction and manipulation of these trees appropriately encapsulated.
(Optional) For extra credit, modify the application to add some new functionality of your choice. The amount of extra credit will depend on the difficulty of what you did and how well you did it, as judged by the course staff. Tell us clearly what you did so that we can try it out.
Here are some ideas, roughly in order of difficulty:
- We would like to open an archive by double-clicking on it or dragging-and-dropping it on the 410zip.exe icon after registering the file type .410 in Windows to launch 410zip.exe. When either of these things happen, the application receives the filename of the archive in args[0]. Make the main procedure open the passed-in archive in that case.
- Right now files are extracted to C:\Temp. Change this to display a dialog box that allows the user to select a directory to extract to.
- Right now all files are extracted at once. Change this to allow a single file to be extracted by right-clicking or opened by double-clicking on it in the main window.
- Implement deletion of files from the archive.
- Right now all codes are of a fixed length, 16 bits. This hurts compression for both files that are too small (since we may not need that many bits per codeword) and files that are too big (since we will run out of codewords after 65536 entries). Implement a means of using a different codeword length depending on the size of the file to be compressed. To allow bit lengths that are not multiples of 8, you should implement getBit() and putBit() methods in the IO class that do buffering on the bit level as BufferedInputStream and BufferedOutputStream do on the byte level, using an 8-bit buffer. You should encapsulate the action of getting or putting a codeword in methods getCode() and putCode() that are called the same way independent of the size of the codeword; right now these actions are hardwired for 2 bytes in the body of the encode and decode procedures.
- At present, when the dictionary is full, any further encoding just uses that dictionary without changing it. Implement a method to recode the entire archive using a larger code size when the dictionary fills up. You will need to implement the previous enhancement before implementing this one.
- At present, each file is encoded separately, reinitializing the dictionary every time a new file is added. Change this to use the same dictionary for all files.

What are we looking for?

You should know by now. This is not an easy assignment. Get started early!

What do I turn in and how?

Please pass in a printout showing any code that you have changed. In addition, you should submit your project online in the CSUGLab as before in your personal handin directory \\goose\courses\cs410-fall99\proj4\<login-id>, where <login-id> is your CSUGLab login identifier. Please check as soon as possible that this directory exists and that you can save files there. Please do not wait till the day of the deadline to do this. Sorry, no diskettes accepted.

When you are ready to submit, copy all the files in your project directory to your handin directory. Please copy your entire project, including all executables, .class files, source files, and project (.vjp, .sln) files into your handin directory. We will run your code and test all functionality.

If you are working with a partner, you only need to submit once, and you may submit it in either partner's handin directory.

The deadline for electronic submission is Tuesday, December 7, 7pm (note new date!). At that time the permissions on the handin directory will change and you will not be able to access it. You may modify or replace files in your handin directory before the deadline as often as you wish.

Please contact cs410@cs.cornell.edu if you encounter problems.