Huffman coding

This is a coding scheme in which each character, instead
of being coded as a fixed-length bit string as in ASCII or
Unicode, receives a variable length code according to its
frequency in the text.  More frequent letters get shorter
codes and rarer letters get longer codes.  This saves space.

Example: To code the string
aaaaaaaaaaaaaaaaaaaaaaaaabbbbbbbbbbbbbbbbbbbbbbbbbcd
(25 a's, 25 b's, 1 c, and 1 d)
we might assign a fixed length code such as
a = 00
b = 01      (*)
c = 10
d = 11
Each symbol is 2 bits, so the length of the coded sequence
is 50 + 50 + 2 + 2 = 104 bits.  However, if we assign
a = 0
b = 10      (**)
c = 110
d = 111
the coded sequence is only 25 + 50 + 3 + 3 = 81 bits.

If codewords are assigned so that no code is a prefix of another
code, then the coding scheme is /self-delimiting/.  Any fixed-length
code such as (*) is self-delimiting.  So is (**).  An example
of a non-self-delimiting code would be
a = 001
b = 101
q = 101001
When decoding, we would not be able to tell the difference
between ba and q.  The problem is that the code for b, 101,
is a prefix of the code for q, 101001.  If we disallow this,
the input sequence is unambiguously determined by the coded
sequence.

A self-delimiting code can be represented by a binary tree
with the characters at the leaves.  The binary code of a character
is given by the path from the root down to the character, where
0 = left and 1 = right.  This tree can be used to decode a coded
stream.  As the bits come in, trace the path they describe down
the tree until hitting a leaf, and output the character there.
Then go back up to the root and start again.  For example, the
codes (*) and (**) would be represented respectively by

      *           *
     / \         / \
    /   \       a   *
   /     \         / \
  *       *       b   *
 / \     / \         / \
a   b   c   d       c   d

Given the frequency distribution on the characters in the input
string (number of occurrences of each character), there is a 
coding scheme that is OPTIMAL for that frequency distribution
in the sense that it minimizes the length of the coded input
string.  Such a code is called a /Huffman code/.

We can construct a Huffman code for a given frequency distribution
in a greedy fashion.  Given an alphabet a,b,c,... with frequencies
w(a),w(b),w(c),... (w(x) is the number of occurrences of x in the
input text to be coded), form a forest of binary trees as follows.
Start by making every character x a tree of a single node with weight
w(x).  Now repeat the following: Find two trees u and v of minimum
weight, say w(u) and w(v); create a new tree with root y of weight
w(y) = w(u) + w(v) and subtrees u and v.  Continue combining pairs
of minimum weight trees like this until there is only one tree left.

Example: for the input string above, we would start with

a 25     b 25     c 1     d 1

In the first stage we would choose c and d and combine them to give

a 25     b 25         2
                     / \
                  c 1   d 1

In the second stage we would combine b and that tree to get

              27
             / \
a 25     b 25   2
               / \
            c 1   d 1

Finally we would combine the two remaining trees to get

     52
    / \
a 25   27
       / \
   b 25   2
         / \
      c 1   d 1

which gives the code (**) above.  Thus this code is optimal.
There is no coding scheme assigning a fixed bit string to each
character that gets better compression for this frequency distribution
(although other schemes that assign codes to /strings/, such as
Lempel-Ziv, may do better).

To compute a Huffman code, you have to know the exact frequencies
of the characters in advance, which may require 2 passes through
the data, once to count the characters and once to do the coding.
(However, you can do quite well if you have a good estimate.  For
example, in most English text, e, t, and s occur frequently and
z, q, and x occur more rarely.  The frequency of e is about 13%,
while that of q and z are only about 0.1%.)

Alternatively, you can use adaptive Huffman coding, which constructs
the tree that is optimal for the character sequence seen so far.
It updates the frequencies (and the codes) as new characters come in.

To prove that Huffman codes are optimal, we first make a couple of
definitions.  For a given binary tree T with weights w(x) at the
leaves, define inductively for each non-leaf u

w(u) = w(left(u)) + w(right(u))

where left(u) and right(u) are the left and right child of u,
respectively.  Let

d(u,x) = the length of the path from u down to x, if x is a descendant of u
|x| = d(r,x), where r is the root.

If x is a leaf, |x| is the length of the codeword assigned to x.

Define the /weighted path length/ of a node u inductively as

W(u) = 0, if u is a leaf
W(u) = W(left(u)) + W(right(u)) + w(left(u)) + w(right(u))
     = W(left(u)) + W(right(u)) + w(u).

It can be shown by induction that

W(u) =   SUM    w(x).d(u,x)
       x a leaf
       below u

In particular, if r is the root, then

W(r) =   SUM    w(x).|x|
       x a leaf

is the total length of the encoded string, since each x occurs
w(x) times in the input string and each occurrence contributes
|x| to the length of the coded string.

Let N be the length (number of characters) of the input string.
For each character x, let p(x) = w(x)/N.  p(x) is the probability
that a randomly chosen character in the input string, all character
positions equally likely, is x.

For a non-leaf u, define inductively p(u) = p(left(u)) + p(right(u)),
where left(u) and right(u) are the left and right subtrees of u, resp.
Note p(u) = w(u)/N.  p(u) is the probability that a randomly chosen
character from the input string lies in the subtree rooted at u.

One can show by induction that

p(u) =   SUM    p(x)
       x a leaf
       below u

A quantity closely related to the weighted path length is entropy.
Define the entropy of a node u inductively as

H(u) = 0, if u is a leaf
H(u) = H(left(u)) + H(right(u)) + p(left(u)) + p(right(u))
     = H(left(u)) + H(right(u)) + p(u).

One can show by induction that

H(u) =   SUM    p(x).d(u,x)
       x a leaf
       below u

If r is the root, then

H(r) =   SUM    p(x).|x|
       x a leaf

     =  W(r)/N

This is the average code length for letters chosen randomly
from the input string.

Thus we wish to minimize H(r) (equivalently, minimize W(r)) over
over all binary trees with leaves a,b,... and weights w(a),w(b),...

Let T be a binary tree with root r.  Let S be a set of vertices of T
no two of which are related to each other in the ancestor/descendant
relationship.  Let T' be the tree obtained by deleting all the
subtrees below any node in S, but retaining the nodes in S and their
weights.  For example, if T is

          27
         /  \
        /    \
      14      13*
     /  \    /  \
    5*   9  6    7
   / \   | \
  3   2  5  4*

where the nodes marked with * are in S, then T' would be

          27
         /  \
        /    \
      14      13
     /  \
    5    9
         | \
         5  4

Let H and H' refer to the entropy functions of T and T', resp.

Lemma 1.  For any S,

H(r) = H'(r) +  SUM   H(s).
               s in S

This can be proved by induction on the depth of the tree.

Examples: For S = {r} or S = {leaves of T}, the lemma reduces
to the trivial identity H(r) = H(r).  For S = {left(r), right(r)},
the lemma says

H(r) = p(r) + H(left(r)) + H(right(r)),

which is just the inductive definition of H(r).

Lemma 2.
(a) Let u and v be nodes of T such that w(u) = w(v) or
|u| = |v|.  Then the tree T' obtained from T by switching
the subtrees rooted at u and v has the same entropy as T.
(b) Let u and v be nodes of T such that w(u) < w(v) and
|u| < |v|.  Then the tree T' obtained from T by switching
the subtrees rooted at u and v has smaller entropy than T.

Proof.  By Lemma 1, the difference in entropy when u and v
are switched is
|u|(w(v) - w(u)) + |v|(w(u) - w(v))
  = (|u| - |v|)(w(v) - w(u))
which is 0 in case (a) and negative in case (b).

If there are k characters in all, there will be 2k-1 nodes of T. 

Theorem 1.  T is optimal iff there is a numbering of the nodes of
T, say n_0, n_1, ..., n_{2k-2}, such that

(i) for all i, 0 <= i < 2k-2, w(n_i) <= w(n_{i+1});
(ii) for all i, 0 <= i < k-1, n_{2i} and n_{2i+1} are siblings.

Note that the Huffman algorithm produces a tree with this property:
we just number the nodes in the order they are chosen to be combined.

Proof.  This is trivial for k=1 or 2, so assume k >= 3.

First suppose T is optimal.  The deepest level of the tree contains
only leaves, and it contains at least 2 leaves; by Lemma 2(b), the two
elements of smallest weight can be found on that level.  By lemma 2(a),
we can swap element on that level if necessary to make the two smallest
element siblings, say a and b.  Now remove the subtree
  
  t
 / \
a   b

which by Lemma 1 reduces the entropy by w(t) = w(a) + w(b).  The remaining
tree T' is optimal (if not, then T was not, since if there were a tree with
the same data as T' but with smaller entropy, then by Lemma 1 I could put
t back to get a tree with the same data as T and smaller entropy).
By the induction hypothesis, T' has an ordering of nodes as in the statement
of the theorem.  Then so does T by putting a and b at the front of this
ordering.

Conversely, suppose T has an ordering n_0, n_1,... as stated in the theorem.
We want to show that T is optimal.  Let S be an optimal tree with the same
data.  By the argument above, n_0 and n_1 occur at the deepest level of S,
and we can swap nodes on that level if necessary to make n_0 and n_1 siblings
in S.  Now remove the subtree t pictured above from both T and S to get
T' and S', respectively.  The ordering on T' obeys the properties of the
theorem, so by the induction hypothesis, T' is optimal.  Since S' has the
same data as T', H(T') <= H(S').  By putting t back in both trees, the
entropy increases by H(t) in both, thus H(T) <= H(S), so T was optimal.

Adaptive Huffman Coding

We can avoid two passes through the data by adaptive Huffman coding.  In
this technique we adjust the Huffman tree as the characters come in so that
at all times the coding is optimal for the sequence of characters seen so
far.  The codes for each character may change as time evolves.  On the
decoding side, we can reconstruct the tree as the codes are received; there
is no need to transmit the tree.

We always maintain an ordering of the nodes of the Huffman tree satisfying
the two properties of Theorem 1.

We start with a balanced tree assigning equal weight 0 to each character.  Now as
the characters of the input stream come in, we update the frequencies.  For each new
character, we increment the weight of its leaf and all its ancestors by 1.
The problem is, the tree may then no longer satisfy the properties of Theorem 1,
so we may have to rearrange the tree.  This can be done in time proportional to the
depth of the character.

On the decoding side, the receiver starts with the same starting tree,
decodes he incoming stream using the current tree, and as soon as the
next character is found, updates the tree exactly as the encoder would.

Here's how we rearrange the tree to keep it optimal.  Call the set of
nodes of a given weight a /block/.  We will maintain some extra information
that will allow us to find in constant time, given any element n_i, the
maximum (in the order n_0,n_1,...) element of its block.  Before we
increment a node n_i, we find the maximum element n_j of its block and
swap the trees rooted at n_i and n_j (unless n_j is an ancestor--we handle
this case specially).  This does not disrupt the properties of Theorem 1.
Our element to be incremented is now n_j.  We look at its parent and do
the same thing--swap it with the maximum element of its block.  We continue
on up the tree in this fashion.  Now the original leaf to be incremented
and all its ancestors satisfy w(n_k) < w(n_{k+1}), so we just go up the
tree and increment them all.

To find the maximum element of the block of a node in constant time,
we link the nodes of each block in a doubly linked list ordered
according to the order n_0,n_1,...  We link all these doubly linked
lists together in a doubly linked list ordered by weight.  The maximum
node of a block is found at the head of the block list.  When we increment
a node, we unlink it from its block list and either link it onto the
end of the next block list or start a new block list.  If the node was
the last node of its old from its block list, we delete the old block list.

For an applet, check out
http://www.cs.sfu.ca/cs/CC/365/mark/squeeze/AdaptiveHuff.html