Huffman coding This is a coding scheme in which each character, instead of being coded as a fixed-length bit string as in ASCII or Unicode, receives a variable length code according to its frequency in the text. More frequent letters get shorter codes and rarer letters get longer codes. This saves space. Example: To code the string aaaaaaaaaaaaaaaaaaaaaaaaabbbbbbbbbbbbbbbbbbbbbbbbbcd (25 a's, 25 b's, 1 c, and 1 d) we might assign a fixed length code such as a = 00 b = 01 (*) c = 10 d = 11 Each symbol is 2 bits, so the length of the coded sequence is 50 + 50 + 2 + 2 = 104 bits. However, if we assign a = 0 b = 10 (**) c = 110 d = 111 the coded sequence is only 25 + 50 + 3 + 3 = 81 bits. If codewords are assigned so that no code is a prefix of another code, then the coding scheme is /self-delimiting/. Any fixed-length code such as (*) is self-delimiting. So is (**). An example of a non-self-delimiting code would be a = 001 b = 101 q = 101001 When decoding, we would not be able to tell the difference between ba and q. The problem is that the code for b, 101, is a prefix of the code for q, 101001. If we disallow this, the input sequence is unambiguously determined by the coded sequence. A self-delimiting code can be represented by a binary tree with the characters at the leaves. The binary code of a character is given by the path from the root down to the character, where 0 = left and 1 = right. This tree can be used to decode a coded stream. As the bits come in, trace the path they describe down the tree until hitting a leaf, and output the character there. Then go back up to the root and start again. For example, the codes (*) and (**) would be represented respectively by * * / \ / \ / \ a * / \ / \ * * b * / \ / \ / \ a b c d c d Given the frequency distribution on the characters in the input string (number of occurrences of each character), there is a coding scheme that is OPTIMAL for that frequency distribution in the sense that it minimizes the length of the coded input string. Such a code is called a /Huffman code/. We can construct a Huffman code for a given frequency distribution in a greedy fashion. Given an alphabet a,b,c,... with frequencies w(a),w(b),w(c),... (w(x) is the number of occurrences of x in the input text to be coded), form a forest of binary trees as follows. Start by making every character x a tree of a single node with weight w(x). Now repeat the following: Find two trees u and v of minimum weight, say w(u) and w(v); create a new tree with root y of weight w(y) = w(u) + w(v) and subtrees u and v. Continue combining pairs of minimum weight trees like this until there is only one tree left. Example: for the input string above, we would start with a 25 b 25 c 1 d 1 In the first stage we would choose c and d and combine them to give a 25 b 25 2 / \ c 1 d 1 In the second stage we would combine b and that tree to get 27 / \ a 25 b 25 2 / \ c 1 d 1 Finally we would combine the two remaining trees to get 52 / \ a 25 27 / \ b 25 2 / \ c 1 d 1 which gives the code (**) above. Thus this code is optimal. There is no coding scheme assigning a fixed bit string to each character that gets better compression for this frequency distribution (although other schemes that assign codes to /strings/, such as Lempel-Ziv, may do better). To compute a Huffman code, you have to know the exact frequencies of the characters in advance, which may require 2 passes through the data, once to count the characters and once to do the coding. (However, you can do quite well if you have a good estimate. For example, in most English text, e, t, and s occur frequently and z, q, and x occur more rarely. The frequency of e is about 13%, while that of q and z are only about 0.1%.) Alternatively, you can use adaptive Huffman coding, which constructs the tree that is optimal for the character sequence seen so far. It updates the frequencies (and the codes) as new characters come in. To prove that Huffman codes are optimal, we first make a couple of definitions. For a given binary tree T with weights w(x) at the leaves, define inductively for each non-leaf u w(u) = w(left(u)) + w(right(u)) where left(u) and right(u) are the left and right child of u, respectively. Let d(u,x) = the length of the path from u down to x, if x is a descendant of u |x| = d(r,x), where r is the root. If x is a leaf, |x| is the length of the codeword assigned to x. Define the /weighted path length/ of a node u inductively as W(u) = 0, if u is a leaf W(u) = W(left(u)) + W(right(u)) + w(left(u)) + w(right(u)) = W(left(u)) + W(right(u)) + w(u). It can be shown by induction that W(u) = SUM w(x).d(u,x) x a leaf below u In particular, if r is the root, then W(r) = SUM w(x).|x| x a leaf is the total length of the encoded string, since each x occurs w(x) times in the input string and each occurrence contributes |x| to the length of the coded string. Let N be the length (number of characters) of the input string. For each character x, let p(x) = w(x)/N. p(x) is the probability that a randomly chosen character in the input string, all character positions equally likely, is x. For a non-leaf u, define inductively p(u) = p(left(u)) + p(right(u)), where left(u) and right(u) are the left and right subtrees of u, resp. Note p(u) = w(u)/N. p(u) is the probability that a randomly chosen character from the input string lies in the subtree rooted at u. One can show by induction that p(u) = SUM p(x) x a leaf below u A quantity closely related to the weighted path length is entropy. Define the entropy of a node u inductively as H(u) = 0, if u is a leaf H(u) = H(left(u)) + H(right(u)) + p(left(u)) + p(right(u)) = H(left(u)) + H(right(u)) + p(u). One can show by induction that H(u) = SUM p(x).d(u,x) x a leaf below u If r is the root, then H(r) = SUM p(x).|x| x a leaf = W(r)/N This is the average code length for letters chosen randomly from the input string. Thus we wish to minimize H(r) (equivalently, minimize W(r)) over over all binary trees with leaves a,b,... and weights w(a),w(b),... Let T be a binary tree with root r. Let S be a set of vertices of T no two of which are related to each other in the ancestor/descendant relationship. Let T' be the tree obtained by deleting all the subtrees below any node in S, but retaining the nodes in S and their weights. For example, if T is 27 / \ / \ 14 13* / \ / \ 5* 9 6 7 / \ | \ 3 2 5 4* where the nodes marked with * are in S, then T' would be 27 / \ / \ 14 13 / \ 5 9 | \ 5 4 Let H and H' refer to the entropy functions of T and T', resp. Lemma 1. For any S, H(r) = H'(r) + SUM H(s). s in S This can be proved by induction on the depth of the tree. Examples: For S = {r} or S = {leaves of T}, the lemma reduces to the trivial identity H(r) = H(r). For S = {left(r), right(r)}, the lemma says H(r) = p(r) + H(left(r)) + H(right(r)), which is just the inductive definition of H(r). Lemma 2. (a) Let u and v be nodes of T such that w(u) = w(v) or |u| = |v|. Then the tree T' obtained from T by switching the subtrees rooted at u and v has the same entropy as T. (b) Let u and v be nodes of T such that w(u) < w(v) and |u| < |v|. Then the tree T' obtained from T by switching the subtrees rooted at u and v has smaller entropy than T. Proof. By Lemma 1, the difference in entropy when u and v are switched is |u|(w(v) - w(u)) + |v|(w(u) - w(v)) = (|u| - |v|)(w(v) - w(u)) which is 0 in case (a) and negative in case (b). If there are k characters in all, there will be 2k-1 nodes of T. Theorem 1. T is optimal iff there is a numbering of the nodes of T, say n_0, n_1, ..., n_{2k-2}, such that (i) for all i, 0 <= i < 2k-2, w(n_i) <= w(n_{i+1}); (ii) for all i, 0 <= i < k-1, n_{2i} and n_{2i+1} are siblings. Note that the Huffman algorithm produces a tree with this property: we just number the nodes in the order they are chosen to be combined. Proof. This is trivial for k=1 or 2, so assume k >= 3. First suppose T is optimal. The deepest level of the tree contains only leaves, and it contains at least 2 leaves; by Lemma 2(b), the two elements of smallest weight can be found on that level. By lemma 2(a), we can swap element on that level if necessary to make the two smallest element siblings, say a and b. Now remove the subtree t / \ a b which by Lemma 1 reduces the entropy by w(t) = w(a) + w(b). The remaining tree T' is optimal (if not, then T was not, since if there were a tree with the same data as T' but with smaller entropy, then by Lemma 1 I could put t back to get a tree with the same data as T and smaller entropy). By the induction hypothesis, T' has an ordering of nodes as in the statement of the theorem. Then so does T by putting a and b at the front of this ordering. Conversely, suppose T has an ordering n_0, n_1,... as stated in the theorem. We want to show that T is optimal. Let S be an optimal tree with the same data. By the argument above, n_0 and n_1 occur at the deepest level of S, and we can swap nodes on that level if necessary to make n_0 and n_1 siblings in S. Now remove the subtree t pictured above from both T and S to get T' and S', respectively. The ordering on T' obeys the properties of the theorem, so by the induction hypothesis, T' is optimal. Since S' has the same data as T', H(T') <= H(S'). By putting t back in both trees, the entropy increases by H(t) in both, thus H(T) <= H(S), so T was optimal. Adaptive Huffman Coding We can avoid two passes through the data by adaptive Huffman coding. In this technique we adjust the Huffman tree as the characters come in so that at all times the coding is optimal for the sequence of characters seen so far. The codes for each character may change as time evolves. On the decoding side, we can reconstruct the tree as the codes are received; there is no need to transmit the tree. We always maintain an ordering of the nodes of the Huffman tree satisfying the two properties of Theorem 1. We start with a balanced tree assigning equal weight 0 to each character. Now as the characters of the input stream come in, we update the frequencies. For each new character, we increment the weight of its leaf and all its ancestors by 1. The problem is, the tree may then no longer satisfy the properties of Theorem 1, so we may have to rearrange the tree. This can be done in time proportional to the depth of the character. On the decoding side, the receiver starts with the same starting tree, decodes he incoming stream using the current tree, and as soon as the next character is found, updates the tree exactly as the encoder would. Here's how we rearrange the tree to keep it optimal. Call the set of nodes of a given weight a /block/. We will maintain some extra information that will allow us to find in constant time, given any element n_i, the maximum (in the order n_0,n_1,...) element of its block. Before we increment a node n_i, we find the maximum element n_j of its block and swap the trees rooted at n_i and n_j (unless n_j is an ancestor--we handle this case specially). This does not disrupt the properties of Theorem 1. Our element to be incremented is now n_j. We look at its parent and do the same thing--swap it with the maximum element of its block. We continue on up the tree in this fashion. Now the original leaf to be incremented and all its ancestors satisfy w(n_k) < w(n_{k+1}), so we just go up the tree and increment them all. To find the maximum element of the block of a node in constant time, we link the nodes of each block in a doubly linked list ordered according to the order n_0,n_1,... We link all these doubly linked lists together in a doubly linked list ordered by weight. The maximum node of a block is found at the head of the block list. When we increment a node, we unlink it from its block list and either link it onto the end of the next block list or start a new block list. If the node was the last node of its old from its block list, we delete the old block list. For an applet, check out http://www.cs.sfu.ca/cs/CC/365/mark/squeeze/AdaptiveHuff.html