Your assignment is to implement Lempel-Ziv compression. This is much the same algorithm used in gzip, pkzip, and WinZip; the main difference is that we use a fixed code size (2 bytes), and when the dictionary gets full, we just use it without entering or deleting any strings.
Before you start, please read the handout on Lempel-Ziv compression (scanned from H.R. Lewis and L. Dennenberg, Data Structures & Their Algorithms).
Download the files 410zip.exe and 410zip.410. The first is a little WinZip-like application, and the second is an archive created by that application containing all its source files, which you will need for your project.
Fire up the application and play with it a bit. "New" creates a new archive, "Open" opens an existing archive, "Add" adds a file to an open archive, and "Extract" extracts all the files to C:\Temp. Note that right now, it does no compression at all--when you add a file to an archive or extract a file, it just copies the file verbatim.
Open the archive 410zip.410 and extract the four source files for the project. Create a new Java project with these files. The main class is Zip and it is found in Zip.java.
Here is a description of the four source files. They may contain some Java constructs you have not seen before. You should have copies of them in front of you to refer to as you read this.
Zip.java. This contains the Zip class and main procedure, as well as all code implementing the user interface. The code in the constructor of Zip initializes the user interface. Also in this file are event handlers that dispatch the appropriate routines in response to menu selections and button clicks. The code is fairly self-explanatory and you don't need to change it (unless you want to).
IO.java. This contains several low-level file I/O routines. File I/O is handled by
input and output streams. The call new FileInputStream(
Archive.java. This contains the class Archive. The file format of an archive on disk is described there. An archive file consists of a header giving some information about the files in the archive followed by the compressed data itself. The header consists of the string "<410zip>", then an int (4 bytes) giving the number of files in the archive, followed by some information for each file. For each file, there is an int giving the length of the file name, followed by the filename itself, then an int giving the location where the compressed data for that file starts, and finally an int giving its uncompressed length.
When an existing archive is opened, the header data is read in and a summary is displayed in the main window. The summary gives the compressed/uncompressed size of each file and the compression ratio. The compressed data is not read at that point, but the file is left open with the input stream pointer pointing to the first byte of the compressed data in case the user later wants to extract files. When a new archive is created, a standard header with an empty file list is created and written out to disk.
The routines that handle adding a file to the archive, extracting files, and closing the archive are also found in Archive.java.
LZ.java. This contains the code for the dictionary structure that maps Strings to their codes and vice-versa, as well as the compression and decompression code. Right now, the compression and decompression routines don't do anything except copy characters verbatim.
The assignment is divided into three parts, two required and one optional.
Our decoder is called with a number m, the compressed length of the file. You can assume the input stream pointer is set to the beginning of the compressed file you wish to decode; this is taken care of by the calling procedure. You should decode exactly m bytes of the input stream. You should read two bytes at a time from the input stream (all code words are two bytes), convert them to an integer, look up the String corresponding to that code in the dictionary, and write that String to the output stream. If there are fewer than m bytes available from the input stream--that is, if the end of the file was reached prematurely--then that means the archive has been corrupted. In that case you should throw an IOException with the message "Unexpected end of file". Also, if there is no String in the dictionary with that code, then you should throw an IOException with the message "Archive corrupted".
During all this, you are also building the dictionary. You insert a String s into the dictionary dict by calling dict.insert(s). The insert routine checks for you whether the dictionary is full or whether it already contains s, in which case it just returns without doing anything. In each step, the String you want to insert is the String that was output in the previous step concatenated with the first character of the String output in this step. Remember to skip this in the first step, since there was no previous step.
Get this working before moving on to part 2. When you have it working, it's a good idea to save a copy of your entire project in case you do not have time to finish part 2. We would rather see a working version of part 1 than an incomplete version of part 2.
The integer code stored at a Node is the code number of the string obtained by concatenating the characters of the nodes on the path from the root down to that Node. The code number is assigned by the routine that inserts the string in the Dictionary; it just assigns the next available code. Note that whenever a string is inserted, its maximal proper prefix (i.e., the string obtained by deleting the last character) is already there, so you only need to add the last character as a new child node of some other node that already exists in the tree.
To search for a string in the dictionary, look at the first character of the string (there are methods in the String class for extracting single characters or substrings of a String). Go to the corresponding index of the array, where the child list of that character is stored. Search the child list for the next character; if found, search the child list of that node for the third character; and so on. Note that the longestMatch routine will require substantial hacking; but keep the routines for the construction and manipulation of these trees appropriately encapsulated.
Here are some ideas, roughly in order of difficulty:
You should know by now. This is not an easy assignment. Get started early!
Please pass in a printout showing any code that you have changed. In addition, you should submit your project online in the CSUGLab as before in your personal handin directory \\goose\courses\cs410-fall99\proj4\<login-id>, where <login-id> is your CSUGLab login identifier. Please check as soon as possible that this directory exists and that you can save files there. Please do not wait till the day of the deadline to do this. Sorry, no diskettes accepted.
When you are ready to submit, copy all the files in your project directory to your handin directory. Please copy your entire project, including all executables, .class files, source files, and project (.vjp, .sln) files into your handin directory. We will run your code and test all functionality.
If you are working with a partner, you only need to submit once, and you may submit it in either partner's handin directory.
The deadline for electronic submission is Tuesday, December 7, 7pm (note new date!). At that time the permissions on the handin directory will change and you will not be able to access it. You may modify or replace files in your handin directory before the deadline as often as you wish.
Please contact firstname.lastname@example.org if you encounter problems.