Lecture 19: Priority Queues, Heaps, Huffman Coding

Priority Queues

Priority queues are another abstract data type for a collection of elements, but with fewer operations than some of our other collection classes:

Each element has a priority, an element of a totally ordered set (usually a number)
More important things come out first, even if they were added later
Our convention: smaller number = higher priority
There is no operation to find out whether an arbitrary element is in the queue
Useful for event-based simulators (with priority = simulated time), real-time games, searching, routing, compression via Huffman coding

<% ShowSMLFile("lec19/imp_prioq.sml") %>

There are many ways to implement this signature. For example, we could implement it as a linked list where the cells of the list are connected through refs so it can be updated imperatively:

<% ShowSMLFile("lec19/list_prioq.sml") %>

What is the asymptotic performance of this implementation?

insert: O(n), because it has to bubble a new element in to its rightful place in the sorted list.
extract_min: O(1), because it can just remove the first element of the list.

Another alternative implementation is to use red-black trees or another of the balanced search trees. For example, in red-black trees we can find the minimum element by simply walking down the left children all the way from the root. Extracting the minimum element requires deleting it from the tree; we haven't seen how to do this, but it's about twice as complicated as the insertion we've already seen. This implementation has better performance for many applications:

insert: O(lg n), because an element must be inserted into the tree according to its priority, which serves as the key.
extract_min: O(lg n), because red-black deletion also requires walking up and down the tree.

In fact, we can tell that this is the best we do in terms of asymptotic performance, because we can implement sorting using O(n) priority queue operations, and we know that sorting takes O(n lg n) time in general. The idea is simply to insert all the elements to be sorted into the priority queue, and then use extract_min to pull them out in the right order:

<% ShowSMLFile("lec19/heapsort.sml") %>

Heaps

Although they have good asymptotic performance, it turns out that red-black trees are overkill for implementing priority queues: they are more complicated and slower than necessary. There is a simple, fast way to implement priority queues.

A heap is a special kind of balanced binary tree. (The term heap is also used to refer to the part of computer memory that is used to allocate new objects at run time; this is a different kind of heap.) The tree satisfies two invariants:

The priorities of the children of a node are at least as large as the priority of the parent. By implication, the node at the top (root) of the tree has minimum priority.
The different paths from root to leaf differ in height by at most one. At the bottom of the tree there may be some missing leaves; these are to the right to all of the leaves that are present.

Suppose the priorities are just numbers. Here is a possible heap:

Obviously we can find the minimum element in O(1) time. Extracting it while maintaining the heap invariant will take O(lg n) time. Inserting a new element and establishing the heap invariant will also take O(lg n) time. So asymptotic performance is the same as for red-black trees but constant factors are better for heaps.

The key observation is that we can represent a heaps as an array.

The root of the tree is at location 0 in the array and the children of the node stored at position i are at locations 2i+1 and 2i+2. This means that the array corresponding to the tree contains all the elements of tree, read across row by row. The representation of the tree above is:

[3 5 9 12 6 10]

Given an element at index i, we can compute where the children are stored, and conversely we can go from a child at index j to its parent at index floor((j-1)/2).

The rep invariant for heaps in this representation is actually simpler than when in tree form:

Rep invariant for heap a (the partial ordering property):

a[i] <= a[2i+1] and a[i] <= a[2i+2]
for 1 <= i <= floor((n-1)/2)

Now let's see how to implement the priority queue operations:

insert

Put the element at first missing leaf. (Extend array by one element.)
Switch it with its parent if its parent is larger: "bubble up"
Repeat #2 as necessary.

Example: inserting 4 into previous tree.

              3
             / \
            /   \
           5     9        [3 5 9 12 6 10 4]
          / \   / \
         12  6 10  4

              3
             / \
            /   \
           5     4        [3 5 4 12 6 10 9]
          / \   / \
         12  6 10  9

This operation requires only O(lg n) time -- the tree is depth
ceil(lg n) , and we do a bounded amount of work on each level.

extract_min

extract_min works by returning the element at the root.

Guaranteed to be the most important (smallest value) by the partial ordering property.
Now we have the two subtrees to put right, though.

The trick is this:

Copy a leaf (last element) to the root (first element)
If it's larger than one of the children, bubble it down.
Swap with the higher priority child, to make sure the parent is always more important than both children.

Original heap to delete top element from (leaves two subheaps)

              3
             / \
            /   \
           5     4        [3 5 4 12 6 10 9]
          / \   / \
         12  6 10  9

copy last leaf to root

              9
             / \
            /   \
           5     4        [9 5 4 12 6 10]
          / \   /
         12  6 10

"push down"

              4
             / \
            /   \
           5     9        [9 5 4 12 6 10]
          / \   /
         12  6 10

Again an O(lg n) operation.

We can sort using this implementation of priority queues.
How expensive is the sorting function built from this?

n insertions, at O(lg n) cost, for O(n lg n) total
n deletions, at O(lg n) cost, for O(n lg n) total.

Thus, O(n lg n) total cost.

It's called heapsort and it's a reliable standard sorting algorithm.

If you have to sort by doing comparisons only, this is as fast as possible (up to a constant factor). There are plenty of other O(n lg n) algorithms with better properties in some cases, for example:

smaller constant factor (quicksort)
faster if the list is already sorted (mergesort)

One last comment -- you might be worried about the fixed size for the array of values. The solution is just to use a resizable array abstraction, which you know how to build.

<% ShowSMLFile("lec19/heap.sml") %>

Huffman Coding

Fixed-Length Codes

Suppose we want to compress a 100,000-byte data file that we know contains only the lowercase letters A through F. Since we have only six distinct characters to encode, we can represent each one with three bits rather than the eight bits normally used to store characters:

Letter A B C D E F
Codeword 000 001 010 011 100 101

Letter	`A`	`B`	`C`	`D`	`E`	`F`
Codeword	000	001	010	011	100	101

This fixed-length code gives us a compression ratio of 5/8 = 62.5%. Can we do better?

Variable-Length Codes

What if we knew the relative frequencies at which each letter occurred? It would be logical to assign shorter codes to the most frequent letters and save longer codes for the infrequent letters. For example, consider this code:

Letter A B C D E F
Frequency (K) 45 13 12 16 9 5
Codeword 0 101 100 111 1101 1100

Letter	`A`	`B`	`C`	`D`	`E`	`F`
Frequency (K)	45	13	12	16	9	5
Codeword	0	101	100	111	1101	1100

Using this code, our file can be represented with

(45×1 + 13×3 + 12×3 + 16×3 + 9×4 + 5×4) × 1000 = 224 000 bits

or 28 000 bytes, which gives a compression ratio of 72%. In fact, this is an optimal character code for this file (which is not to say that the file is not further compressible by other means).

Prefix Codes

Notice that in our variable-length code, no codeword is a prefix of any other codeword. For example, we have a codeword 0, so no other codeword starts with 0. And both of our four-bit codewords start with 110, which is not a codeword. A code where no codeword is a prefix of any other is called a prefix code. Prefix codes are useful because they make a stream of bits unambiguous; we simply can accumulate bits from a stream until we have completed a codeword. (Notice that encoding is simple regardless of whether our code is a prefix code: we just build a dictionary of letters to codewords, look up each letter we're trying to encode, and append the codewords to an output stream.) In turns out that prefix codes always can be used to achive the optimal compression for a character code, so we're not losing anything by restricting ourselves to this type of character code.

When we're decoding a stream of bits using a prefix code, what data structure might we want to use to help us determine whether we've read a whole codeword yet?

One convenient representation is to use a binary tree with the codewords stored in the leaves so that the bits determine the path to the leaf. This binary tree is a trie in which only the leaves map to letters. In our example, the codeword 1100 is found by starting at the root, moving down the right subtree twice and the left subtree twice:

      100
    /     \
  A         55
[45]      /    \
       25        30
     /  \      /  \
     C    B   14    D
   [12] [13] /  \  [16]
            F    E
           [5]  [9]

Here I've labeled the leaves with their frequencies and the branches with the total frequencies of the leaves in their subtrees. You'll notice that this is a full binary tree: every nonleaf node has two children. This happens to be true of all optimal codes, so we can tell that our fixed-length code is suboptimal by observing its tree:

                  100
              /         \
         86                 14
       /    \             /
    58        28       14
   /  \      /  \     /  \
  A    B    C    D   E    F
[45] [13] [12] [16] [9]  [5]

Since we can restrict ourselves to full trees, we know that for an alphabet C, we will have a tree with exactly |C| leaves and |C|-1 internal nodes. Given a tree T corresponding to a prefix code, we also can compute the number of bits required to encode a file:

B(T) = sum f(c) d_T(c)

where f(c) is the frequency of character c and d_T(c) is the depth of the character in the tree (which also is the length of the codeword for c). We call B(T) the cost of the tree T.

Huffman's Algorithm

Huffman invented a simple algorithm for constructing such trees given the set of characters and their frequencies. The algorithm is greedy, which means that it makes choices that are locally optimal.

The algorithm constructs the tree in a bottom-up way. Given a set of leaves containing the characters and their frequencies, we merge the current two subtrees with the smallest frequencies. We perform this merging by creating a parent node labeled with the sum of the frequencies of its two children. Then we repeat this process until we have performed |C|-1 mergings to produce a single tree.

As an example, use Huffman's algorithm to construct the tree for our input.

How can we implement Huffman's algorithm efficiently? The operation we need to perform repeatedly is extraction of the two subtrees with smallest frequencies, so we can use a priority queue. We can express this in ML as:

datatype HTree = Leaf of char * int | Branch of HTree * int * HTree

fun huffmanTree(alpha : (char * int) list) : HTree =
    let val alphasize = length(alpha)
        fun freq(node:HTree):int = case node of
                                       Leaf(_,i) => i
                                     | Branch(_,i,_) => i
        val q = new_heap (fn (x,y) => Int.compare(freq x, freq y)) alphasize
        fun merge(i:int):HTree =
            if i = 0 then extract_min(q)
            else let val x = extract_min(q)
                     val y = extract_min(q)
                 in
                     insert q (Branch(x, freq(x)+freq(y), y));
                     merge(i-1)
                 end
    in
        app (fn (c:char,i:int):unit => insert q (Leaf(c,i))) alpha;
        merge(alphasize-1)
    end

We won't prove that the result is an optimal prefix tree, but why does this algorithm produce a valid and full prefix tree? We can see that every time we merge two subtrees, we're differentiating the codewords of all of their leaves by prepending a 0 to all the codewords of the left subtree and a 1 to all the codewords of the right subtree. And every nonleaf node has exactly two children by construction.

Let's analyze the running time of this algorithm if our alphabet has n characters. Building the initial queue takes time O(n log n) since each enqueue operation takes O(log n) time. Then we perform n-1 merges, each of which takes time O(log n). Thus Huffman's algorithm takes O(n log n) time.

Adaptive Huffman Coding

If we want to compress a file with our current approach, we have to scan through the whole file to tally the frequencies of each character. Then we use the Huffman algorithm to compute an optimal prefix tree, and we scan the file a second time, writing out the codewords of each character of the file. But that's not sufficient. Why? We also need to write out the prefix tree so that the decompression algorithm knows how to interpret the stream of bits.

So our algorithm has one major potential drawback: We need to scan the whole input file before we can build the prefix tree. For large files, this can take a long time. (Disk access is very slow compared to CPU cycle times.) And in some cases it may be unreasonable; we may have a long stream of data that we'd like to compress, and it could be unreasonable to have to accumulate the data until we can scan it all. We'd like an algorithm that allows us to compress a stream of data without seeing the whole prefix tree in advance.

The solution is adaptive Huffman coding, which builds the prefix tree incrementally in such a way that the coding always is optimal for the sequence characters already seen. We start with a tree that has a frequency of zero for each character. When we read an input character, we increment the frequency of that character (and the frequency in all branches above it). We then may have to modify the tree to maintain the invariant that the least frequent characters are at the greatest depths. Because the tree is constructed incrementally, the decoding algorithm simply can update its copy of the tree after every character is decoded, so we don't need to include the prefix tree along with the compressed data.

References

Cormen, Leiserson, and Rivest. Introduction to Algorithms.
Aho, Hopcroft, Ullman. Data Structures and Algorithms.