Huffman Coding

Suppose we want to compress a 100,000-byte data file that we know contains
only the lowercase letters `A` through `F`. Since we have only six
distinct characters to encode, we can represent each one with three bits rather
than the eight bits normally used to store characters:

Letter ABCDEFCodeword 000 001 010 011 100 101

This **fixed-length code** gives us a compression ratio of 5/8 = 62.5%.
Can we do better?

What if we knew the relative frequencies at which each letter occurred? It would be logical to assign shorter codes to the most frequent letters and save longer codes for the infrequent letters. For example, consider this code:

Letter ABCDEFFrequency (K) 45 13 12 16 9 5 Codeword 0 101 100 111 1101 1100

Using this code, our file can be represented with

(45×1 + 13×3 + 12×3 + 16×3 + 9×4 + 5×4) × 1000 = 224 000 bits

or 28 000 bytes, which gives a compression ratio of 72%. In fact, this is an
optimal character code for this file (which is *not* to say that
the file is not further compressible by other means).

Notice that in our variable-length code, no codeword is a prefix of any other
codeword. For example, we have a codeword 0, so no other codeword starts with 0.
And both of our four-bit codewords start with 110, which is not a codeword. A
code where no codeword is a prefix of any other is called a **prefix code**. Prefix codes are useful because they
make a stream of bits unambiguous; we simply can accumulate bits from a stream
until we have completed a codeword. (Notice that encoding is simple regardless
of whether our code is a prefix code: we just build a dictionary of letters to
codewords, look up each letter we're trying to encode, and append the codewords
to an output stream.) In turns out that prefix codes always can be used to
achive the optimal compression for a character code, so we're not losing
anything by restricting ourselves to this type of character code.

When we're decoding a stream of bits using a prefix code, what data structure might we want to use to help us determine whether we've read a whole codeword yet?

One convenient representation is to use a binary tree with the codewords stored in the leaves so that the bits determine the path to the leaf. This binary tree is a trie in which only the leaves map to letters. In our example, the codeword 1100 is found by starting at the root, moving down the right subtree twice and the left subtree twice:

100 / \ A 55 [45] / \ 25 30 / \ / \ C B 14 D [12] [13] / \ [16] F E [5] [9]

Here we've labeled the leaves with their frequencies and the branches with the
total frequencies of the leaves in their subtrees. You'll notice that this is a **full
**
binary tree: every nonleaf node has two children. This happens to be true of all
optimal codes, so we can tell that our fixed-length code is suboptimal by
observing its tree:

100 / \ 86 14 / \ / 58 28 14 / \ / \ / \ A B C D E F [45] [13] [12] [16] [9] [5]

Since we can restrict ourselves to full trees, we know that for an alphabet *C*,
we will have a tree with exactly |*C*|
leaves and |*C*|−1
internal
nodes. Given a tree *T* corresponding to a prefix code, we also can compute
the number of bits required to encode a file:

B(T) = ∑f(c)d(_{T}c)

where *f*(*c*)
is the frequency of character *c*
and * d _{T}*(

Huffman invented a simple algorithm for constructing such trees given the set
of characters and their frequencies. Like Dijkstra's algorithm, this is a **greedy
**algorithm, which
means that it makes choices that are locally optimal yet achieves a globally
optimal solution.

The algorithm constructs the tree in a bottom-up way. Given a set of leaves
containing the characters and their frequencies, we merge the current two
subtrees with the *smallest* frequencies. We perform this merging by
creating a parent node labeled with the sum of the frequencies of its two
children. Then we repeat this process until we have performed |*C*|−1
mergings to produce a single tree.

As an example, use Huffman's algorithm to construct the tree for our input.

How can we implement Huffman's algorithm efficiently? The operation we need to perform repeatedly is extraction of the two subtrees with smallest frequencies, so we can use a priority queue. We can express this in ML as:

datatype HTree = Leaf of char * int | Branch of HTree * int * HTree fun huffmanTree(alpha : (char * int) list) : HTree = letval alphasize = length(alpha) fun freq(node:HTree):int = case node of Leaf(_,i) => i | Branch(_,i,_) => i val q = new_heap (fn (x,y) => Int.compare(freq x, freq y)) alphasize fun merge(i:int):HTree = if i = 0 then extract_min(q) elseletval x = extract_min(q) val y = extract_min(q) in insert q (Branch(x, freq(x)+freq(y), y)); merge(i-1) end in app (fn (c:char,i:int):unit => insert q (Leaf(c,i))) alpha; merge(alphasize-1) end

We won't prove that the result is an optimal prefix tree, but why does this algorithm produce a valid and full prefix tree? We can see that every time we merge two subtrees, we're differentiating the codewords of all of their leaves by prepending a 0 to all the codewords of the left subtree and a 1 to all the codewords of the right subtree. And every non-leaf node has exactly two children by construction.

Let's analyze the running time of this algorithm if our alphabet has *n*
characters. Building the initial queue takes time O(*n* log *n*)
since each `enqueue`
operation takes O(log *n*) time. Then we perform *n*−1
merges, each of
which takes time O(log *n*). Thus Huffman's algorithm takes O(*n* log *n*)
time.

If we want to compress a file with our current approach, we have to scan through the whole file to tally the frequencies of each character. Then we use the Huffman algorithm to compute an optimal prefix tree, and we scan the file a second time, writing out the codewords of each character of the file. But that's not sufficient. Why? We also need to write out the prefix tree so that the decompression algorithm knows how to interpret the stream of bits.

So our algorithm has one major potential drawback: We need to scan the whole input file before we can build the prefix tree. For large files, this can take a long time. (Disk access is very slow compared to CPU cycle times.) And in some cases it may be unreasonable; we may have a long stream of data that we'd like to compress, and it could be unreasonable to have to accumulate the data until we can scan it all. We'd like an algorithm that allows us to compress a stream of data without seeing the whole prefix tree in advance.

The solution is **adaptive Huffman coding**, which builds the prefix
tree incrementally in such a way that the coding always is optimal for the
sequence characters already seen. We start with a tree that has a frequency of
zero for each character. When we read an input character, we increment the
frequency of that character (and the frequency in all branches above it). We
then may have to modify the tree to maintain the invariant that the least
frequent characters are at the greatest depths. Because the tree is constructed
incrementally, the decoding algorithm simply can update its copy of the tree
after every character is decoded, so we don't need to include the prefix tree
along with the compressed data.