Discussion 9: Huffman Coding

Download Handout download

Compression can help computers store and transfer files by reducing the amount of data that they contain. Some compression algorithms are lossy, meaning that they alter the contents of the file in a way that cannot be reversed. For example, the JPEG compression algorithm can significantly reduce the size of an image file by using many clever tricks to produce a smaller, nearly indistinguishable image from the original. However, the exact (pixel-perfect) original image can never be recovered. Remarkably, there are also lossless compression algorithms, which can (typically) reduce the size of a file but still allow the original file contents to be recovered. One such lossless compression algorithm, Huffman Coding, is used by your computer’s zip application (e.g., to create and extract files from the zip files we’ve used to distribute all the code in this course). The Huffman coding procedure uses both a priority queue and a binary tree to carry out its compression, and it is the focus of today’s discussion.

Learning Outcomes

Use data structures from the Java standard library.
Visualize operations on binary trees and heaps.
Demonstrate understanding of the Huffman coding procedure.
Use loop invariants to develop an iterative method involving a data structure.

Reminder: Discussion Guidelines

The work that you complete in discussion serves as a formative assessment tool, meaning it offers the opportunity to assess your understanding of the material and for our course staff to get a “pulse” on how things are going so we can make adjustments in future classes. Your grade in discussion is based entirely on attendance and participation; if you show up and you are actively engaged with the activity (working on the activity on your computer, discussing the activity with your group, asking and answering questions, etc.) for the entire 50-minute section period, you will earn full credit. If you complete the activity early, helping other students is a great way to further your own understanding. You do not need to submit any of the work that you complete during discussion.

Since discussion activities are not graded for correctness, we do not place any restrictions on resources that you may use to complete them, which include notes, books, unrestricted conversations with other students, internet searches, and the use of large language models or other generative AI. We advise you to be pragmatic about your use of these resources and think critically about whether they are enhancing your learning. Discussion activities are intended to serve as “strength training” for programming tasks we will expect on assignments and exams (and that you will encounter in future courses and careers), and their main benefit comes from critically thinking to “puzzle” them out.

Working together in small groups is encouraged during discussion!

Background: The Huffman Coding Procedure

In basic text encoding, we can imagine that each character is represented using the same number of bits. For example, the ASCII encoding represents each of the first 128 unicode characters (including all the common English characters in both uppercase and lowercase, numerals, and basic punctuation) using 7 bits. If we want to save space, we could try to encode the more common characters with fewer bits, at the expense of lengthening the encodings of some more infrequent characters. This is exactly the strategy employed in Huffman coding.

Prefix Trees

Huffman coding works by building up a prefix tree, which is a binary tree whose leaves are each labeled with a distinct character (one of the characters that appears in the text we are compressing). For example, the prefix tree for the String “an Albanian banana” looks like:

We’ll talk about how to construct this tree below. Here, we’ve labeled the tree connections with binary digits (0 and 1) to describe how the binary encoding of characters works using this prefix tree. To obtain the encoding of a character, we trace along the path from the tree root to that character’s node, and record the connection labels as we go. Every time we follow a left() connection, we record a 0, and every time we follow a right() connection, we record a 1. For example, the encoding of 'b' is 101. By carrying out this encoding process for all the characters in sequence and concatenating the results, we obtain our compressed binary text encoding. “an Albanian banana” becomes

000111010011111010001100000011101010001000100

To decode the text (given we have access to the prefix tree, or at least the binary String encodings associated with each character), we reverse this process. Starting at the beginning of the encoded String and at the root of the tree, we follow arrows until we reach a leaf node.

In this case, the initial 0 moves us into the left subtree of the root, and the following 0 takes us to the leaf node labeled 'a'. Thus, the first character of our original String must be 'a'. Then, we repeat this process, returning to the root node to process the next binary digit in the encoded String.

The fact that every leaf node (and no non-leaf node) in the tree is labeled with a character ensures that this decoding process can be carried out unambiguously (as every traversal path down a tree must end at a leaf).

Now that we know how to use a prefix tree, it remains to discuss how it is built in the first place.

Constructing a Prefix Tree

As we said earlier, Huffman coding works by assigning shorter encodings to more frequent characters and longer encodings to less frequent characters. This means that less frequent characters should end up lower in the prefix tree.

Our procedure to build the prefix tree will construct the tree from the bottom to the top. It will maintain a min priority queue of subtrees whose leaves are labeled with characters. The priority of each subtree is the sum of the frequencies of the characters in its leaves.

Initially, the priority queue contains a tree of size 1 for each character.
In each iteration of the construction loop, two subtrees are removed from the priority queue (the subtrees with the highest priorities, so the lowest frequencies). They become the left and right subtrees of a new subtree, which is added to the priority queue.
This process repeats until there is only one subtree in the priority queue. This will be the final prefix tree.

The following animation visualizes this construction process on our example from earlier, "an Albanian banana":

Notice that less frequent characters get “merged” earlier, so they get pushed down lower in the final prefix tree. This is exactly what we wanted. Now, we have all the pieces we need to carry out the Huffman coding procedure.

Your Tasks

For the rest of the discussion, you’ll walk through another example by hand to practice its mechanics. Then, you’ll complete an implementation of this procedure so you can see Huffman coding in action!

Exercise 1: Huffman Coding by Hand

In this exercise, you'll use Huffman coding to encode the String,

"carter raced a red car"

(a)

First, fill out the following table to determine the frequencies of each character in this String.

Character	Frequency

(b)

Next, use the procedure described in the previous section to draw the prefix tree.

Note: Our above description did not specify what should happen in the case of ties (when there are multiple subtrees in the priority queue with the same priority). This means that there can be multiple valid prefix trees (which, interestingly, all produce a final encoding using the same number of bits).

To make sure you end up with the same prefix tree as us, assume that the priority queue breaks ties by returning the tree with highest priority that contains the alphabetically earliest leftmost leaf (the space character is alphabetically before all letters). In addition, always draw the highest-priority subtree from each iteration to the right of the second-highest-priority subtree. These rules were used to break ties in the example from the previous section.

(c)

Now, use your prefix tree to fill out the following table of binary encodings for each character.

Character	Binary Encoding

(d)

Finally, use your table of character encodings to write the final binary String encoding our original String, “carter raced a red car”.

Exercise 2: Programming the Huffman Coding Procedure

Now that you've gotten more comfortable with the Huffman coding procedure, we can think about how to implement it using some of the data structures that we've studied. Download the starter code and take a couple minutes to read through the functionality that is already provided.

Download Starter Code download

Much of this code handles file input/output (in particular, serialization, the process of turning Java objects into Strings to write to files). Another portion uses HashMap objects (which implement the Map interface that we will discuss in the next lecture) to build the frequency and encoding tables that you created in parts (a) and (c) of the previous exercise. There are three TODOs that have been left for you to fill in.

(a)

buildPrefixTree()

Complete the definition of the buildPrefixTree() method, which carries out the tree construction procedure similar to part (b) of the previous exercise. Pay careful attention to the data types involved in this method.

The priority queue is not the MaxHeapPriorityQueue that we developed in lecture. Rather, it is the PriorityQueue from Java’s standard library (offering you the opportunity to work with more of the Java library code). The main difference is that Java’s priority queue uses the “natural ordering” of the objects themselves to determine priority, rather than allowing the user to specify the priority in add().

- The elements in the priority queue are of type `SymbolCount`. This is a record class defined at the top of the file that contains a subtree (specifically, an `ImmutableBinaryTree` from the lecture notes) and an integer frequency. The nodes of this subtree are labeled with `String`s, which should contain **all the characters of that subtree's leaves, in order from left to right**. The frequency is the sum of the frequencies of all these characters.

Complete the method definition to iteratively select the two highest-priority subtrees from the priority queue and combine them to build a larger subtree. Once the priority queue includes a single entry, return the tree in that entry.

(b)

getBinaryEncoding()

Complete the definition of the getBinaryEncoding() method, which returns the binary String encoding the given character, according to the given prefix tree (i.e., this method allows us to build a table similar to that from part (c) of the previous exercise).

In your definition, you’ll want to traverse down the prefix tree, building up the binary String that you will return as you go. Use the labels of each node to determine which child to traverse down to.

At this point, you should have a working Huffman encoder (i.e., a program that will compress text files). Running the main() method in HuffmanCoding.java will compress the text of Jules Verne's Twenty Thousand Leagues under the Sea. You should find that the compressed file is just over half the size of the original. Feel free to compress other files by supplying program arguments to this method.

(c)

decode()

The compressed files after Huffman coding appear nothing like the original (take a look!), so we’ll need a way to decode them and restore the original text. We’ve provided most of the definition of a decode() method that does just this. To complete this method, you must fill in the portion to scan over the binary String and restore the original characters.

To do this, write a loop over the bits in the BitSet object bits. You can read the ith bit by calling bits.get(i) which will return true if this bit is a 1 and false if this bit is a 0. The total number of bits is bits.length(). You’ll want to use the decodingMap to convert binary Strings back to characters. Read through the Map documentation (or look ahead at the Lecture 19 notes) to see which map operations may be useful.

The provided unit test checks that running the encoding procedure followed by the decoding procedure yields the exact file that you started with (i.e., the compression was lossless). If you pass this test, you can be fairly confident that your Huffman coding works correctly.