Updating the Huffman Tree - The Data Compression Book 2nd Ed Mark Nelson pdf

The algorithm for constructing a Huffman coding tree is fairly simple, but it is not something we would want to do after every character is encoded. It would be relatively simple to implement adaptive Huffman coding with the following update function:

update_model( int c ) {

counts[ c ]++;

construct_tree( counts ); }

Unfortunately, what we would end up with would probably be the world’s slowest data-compression program. Building the tree takes too much work to reasonably expect to do it after every character. Fortunately, there is a way to take an existing Huffman coding tree and modify it to account for a new character. All it takes is a slightly different approach to building the tree in the first place. This approach introduces a concept known as the sibling property. A Huffman tree is simply a binary tree that has a weight assigned to every node, whether an internal node or a leaf node. Each node (except for the root) has a sibling, the other node that shares the same parent. The tree exhibits the sibling property if the nodes can be listed in order of increasing weight and if every node appears adjacent to its sibling in the list.

A binary tree is a Huffman tree if and only if it obeys the sibling property. Figure 4.1 shows a Huffman tree that illustrates how this works. In this tree, the nodes have been assigned numbers, with the numbers assigned from left to right starting at the lowest row of nodes and working up. This tree was created using a conventional Huffman algorithm given the weights A=1, B=2, C=2, D=2, and E=10.

Figure 4.1 A Huffman tree.

In Figure 4.1, the A, B, C, and D nodes at the bottom of the tree are numbered in increasing order starting at 1. Nodes 5 and 6 are the first two internal nodes, with weights of 3 and 4. The node numbers work their way up to node 9, the root. This arrangement shows that this tree obeys the sibling property. The nodes have been numbered in order of increasing weight, and each node is adjacent to its sibling in the list.

The sibling property is important in adaptive Huffman coding since it helps show what we need to do to a Huffman tree when it is time to update the counts. Maintaining the sibling property during the update assures that we have a Huffman tree before and after the counts are adjusted.

Updating the tree consists of two basic types of operations. The first, incrementing the count, is easy to follow conceptually. To increment the count for symbol ‘c,’ start at the leaf node for the symbol and increment the count for the leaf node. Then move up to the parent node. Since the weight of the parent node is the sum of the weight of its children, incrementing its weight by one will adjust it to its correct value. This process continues all the way up the tree till we reach the root node.

Figure 4.2 shows how the increment operation affects the tree. Starting at the leaf, the increment works its way up the tree till it reaches the parent node. Implementing this portion of the code is relatively simple. Be sure that each node has a parent pointer and that an index points to the leaf node for each symbol. This can be done using conventional data structures at a low cost. The average number of increment operations required will correspond to the average number of bits needed to encode a symbol.

Figure 4.2 The increment process.

The second operation required in the update procedure arises when the node increment causes a violation of the sibling property. This occurs when the node being incremented has the same weight as the next highest node in the list. If the increment were to proceed as normal, we would no longer have a Huffman tree.

When we have an increment that violates the sibling property, we need to move the affected node to a higher point in the list. This means that the node is detached from its present position in the tree and swapped with a node farther up the list.

Figure 4.3 shows the same Huffman tree from Figure 4.2 after the A node has been incremented again, then switched with the D node. How was the D node selected as the one to be switched? To minimize the amount of work during the shuffle, we want to swap just two nodes. If the newly incremented node has a weight of W + 1, the next higher node will have a weight of W. There may be more nodes after the next higher one that have a value of W as well. The swap procedure moves up the node list till it finds the last node with a weight of W. That node is swapped with the node with weight W + 1. The new node list will then have a string of 1 or more weight W nodes, followed by the newly incremented node with weight W + 1.

Figure 4.3 After a node switch (only the A node has been incremented).

In Figure 4.3, the A node was incremented from a weight of 2 to 3. Since the next node in the list, the B node, had a weight of 2, the tree no longer obeyed the sibling property. This meant it was time to swap. We worked our way up the list of nodes till we found the last node with a weight of 2, the D node. The A and D nodes were then swapped, yielding a correctly ordered tree.

After the swap is completed, the update can continue. The next node to be incremented will be the new parent of the incremented node. In Figure 4.3, this would be internal node #6. As each node is incremented, a check is performed for correct ordering. A swap is performed if necessary.

What Swapping Does

The swap shown in Figure 4.3 doesn’t have a noticeable effect on the coding of the symbols. The A and D nodes were swapped, but the length of their codes did not change. They were both three bits long before the swap and three bits long after.

Figure 4.4 shows what happens to the three after the A symbol has been incremented two more times. After the second increment, the A node has increased enough to swap positions with an internal node on a higher level of the tree. This reshapes the tree, impacting the length of the codes. When A had a count of two like three other symbols, it was encoded using three bits. Now, when its count has increased to five, it is encoded using only 2 bits. Symbols C is still encoded using 3 bits, but B and D have slipped down to 4 bits.

Figure 4.4 After another node switch.

The Algorithm

In summary, the algorithm for incrementing the count of a node goes something like what’s shown below:

for ( ; ; ) {

if ( node == ROOT ) break;

if ( nodes[ node ].count > nodes[ node + 1 ].count ) swap_nodes();

node = nodes[ node ].parent; }

The swap_nodes() routine has to move up through the list of nodes until it finds the right node to swap with. It then performs the swap. This routine looks something like that shown below: swap_node = node + 1;

while ( nodes[ swap_node + 1 ].count < nodes[ node ].count ) swap_node++;

temp = nodes[ swap_node ].parent;

nodes[ swap_node ].parent = nodes[ node ].parent; nodes[ node ].parent = temp;

An Enhancement

One way to make coding more efficient is to make sure your coder doesn’t waste coding space for symbols not used in the message. With the standard Huffman coding in the previous chapter, this was easy. Since we made a pass over the data to collect statistics before building the tree, we knew in advance which symbols weren’t used. So when we built the Huffman tree we didn’t have to include symbols with a count of 0.

With an adaptive process, we don’t know in advance which symbols will show up in the message. The simplest way to handle this problem is to initialize the Huffman tree to have all 256 possible bytes (for conventional 8-bit data messages) predefined with a count of 1. When the encoding first starts, each message will have a length of eight bits. As statistics accumulate, frequently seen characters will start to use fewer and fewer bits.

This method of encoding works, but in many cases it wastes coding capacity. Particularly in shorter messages, the extra unused codes tend to blunt the effect of compression by skewing the statistics of the message.

A better way to handle this aspect of coding is to start the encoding process with an empty table and add symbols only as they are seen in the incoming message. But this presents us with a seeming contradiction. The first time a symbol appears, it can’t be encoded since it doesn’t appear in the table. So how do we get around this problem?

The Escape Code

The answer to this puzzle is the escape code. The escape code is a special symbol sent out of the encoder to signify that we are going to `escape’ from the current context. The decoder know that the next symbol will be encoded in a different context. We can use this mechanism to encode symbols that don’t appear in the currently defined Huffman tree.

In the example program in this chapter, the escape code signifies that the next symbol to be encoded will be sent as a plain 8-bit character. The symbol is added to the table, and regular encoding

resumes. The C code to implement the encoder for this algorithm looks something like this: encode( char c )

{

if ( in_tree( c ) )

else {

transmit_huffman_code( ESCAPE, out_file ); putc( c, out_file );

add_code_to_tree( c ); }

update_tree( c ); }

This example shows that the escape code is transmitted like any other symbol from the Huffman tree, so it has to appear in the Huffman tree to be properly transmitted. When the encoder first starts up, it needs to be initialized with the escape code already present.

In the implementation used in the example code for this chapter, the Huffman tree is actually initialized with two values: the escape code and the end of file code. Since both will appear in the file, we start off with them in a very small Huffman tree:

Figure 4.5 A Huffman tree initialized with two values.

As the encoding process goes on, the table fills up and the tree fleshes out. The end of file code will always have a weight of one, and in this implementation, so will the escape code. As the tree grows, these two codes will always be stuck down at the remotest branches of the tree and have the longest codes.

The Overflow Problem

As the compression program progresses, the counts in the table increase. At some point, the counts become large enough to cause trouble for the program. There are two possible areas of concern. The first occurs when the weight of the root exceeds the capacity of the counters in the tree. For most of the programs used here, that will be 16 bits.

Another possible problem can occur even sooner. It happens when the length of the longest possible Huffman code exceeds the range of the integer used to transmit it. The decoding process doesn’t care how long a code is, since it works its way down through the tree a bit at a time. The transmitter has a different problem though. It has to start at the leaf node of the tree and work up towards the root. It accumulates bits to be transmitted in reverse order, so it has to stack them up. This is conventionally done in an integer variable, so this means that when a Huffman code exceeds the size of that integer, there is a problem.

The maximum length of a Huffman code is related to the maximum count via a Fibonacci sequence. A Fibonacci function is defined as follows:

int fib( int n ) {

if ( n <= 1 ) return( 1 ); else

return( fib( n - 1 ) + fib( n -2 ) ); }

The sequence of Fibonacci numbers looks something like this: 1, 1, 2, 3, 5, 8, 13, 21, 34, etc. These numbers show up in the worst-case, most lopsided Huffman tree:

Figure 4.6 A lopsided Huffman tree produced through a sequence of Fibonacci numbers.

From this we can deduce that if the weight at the root node of a Huffman tree equals fib(i), then the longest code for that tree is i - 1. This means that if the integers used with our Huffman codes are only 16 bits long, a root value of 4181 could potentially introduce an overflow. (This low value is frequently overlooked in simple Huffman implementations. Setting up a file with Fibonacci counts up to fib[18] is a good way to test a Huffman program). When we update the tree, we ought to check for a maximum value. Once we reach that value, we need to rescale all the counts, typically dividing them by a fixed factor, often two.

One problem with dividing all the counts by two is that it can possibly reshape the tree. Since we are dealing with integers, dividing by two truncates the fractional part of the result, which can lead to imbalances. Consider the Huffman tree shown in Figure 4.7.

Figure 4.7 A Huffman tree created for four symbols.

This is a tree created for four symbols: A, B, C, and D, with weights of 3, 3, 6, and 6. The nodes of the tree are numbered in this diagram, and the diagram clearly shows that the tree is a Huffman tree, since it obeys the sibling property. The problem with this tree occurs if we try a rescaling operation. The simple version of the rescaling algorithm would go through the tree, dividing every leaf node weight by two, then rebuilding upwards from the leaf nodes. The resulting tree would look like what follows.

Figure 4.8 The rescaling problem after the nodes are divided by two.

The problem with the resulting tree is that it is no longer a Huffman tree. Because of the vagaries of truncation that follow integer division, we need to end up with a tree that has a slightly different shape:

Figure 4.9 What the tree should look like after integer division.

The properly organized Huffman tree has a drastically different shape from what it had before being rescaled. This happens because rescaling loses some of the precision in our statistical base,

introducing errors sometimes reflected in the rescaled tree. As time goes on, we accumulate more statistics, and the effect of the errors gradually fades away.

Unfortunately, there is no simple way to compensate for the necessary reshaping the tree after rescaling. The sample code in this chapter merely does a brute-force rebuilding after rescaling. Rebuilding the entire tree is a relatively expensive operation, but since the rescaling operation is rare, we can live with the cost.

A Rescaling Bonus

An interesting side effect comes out of rescaling our tree at periodic intervals. Though we lose accuracy by scaling our counts, testing reveals that rescaling generally results in better compression ratios than if rescaling is postponed. This occurs because data streams frequently have a “decaying recency” effect, or the statistics for recently seen symbols are generally more valid than those accumulated farther back in the data stream. To put it simply, current symbols are more like recent symbols than older symbols.

The rescaling operation tends to discount the effect of older symbols, while increasing the importance of recent symbols. Though difficult to quantify, this seems to have a good effect on compression. Experimenting with various rescaling points will yield different results at differing values, but it doesn’t seem possible to pin down an optimal strategy. There may be an optimal value for rescaling, but it moves around with different types of data streams.

The Code

The sample code for this chapter is a simple order-0 adaptive Huffman compression routine. It is linked with the standard I/O and user interface routines from the previous chapter to create a standalone compression program and a decompression program.

The key to understanding how this sample code operates lies in understanding the data structures in the program. The data structure that describes the tree is shown next.

struct tree {

int leaf[ SYMBOL_COUNT ]; int next_free_node;

struct node {

unsigned int weight; int parent;

int child_is_leaf; int child;

} nodes[ NODE_TABLE_COUNT ]; } Tree;

Two arrays describe the tree. The tree itself is entirely represented by the nodes[] array. This array is a set of structures with the following elements:

As described earlier in the chapter, every node in the tree is kept in a number list. When discussing the list before, we had the nodes with the lowest weight starting at 1 and working up to higher numbers until reaching the root. The implementation in this program is backwards from that, though the same principles apply. The list of nodes is the nodes[] array, with the highest number on the list appearing at nodes[0]. As we work our way down through the lower weights, we go to higher indices in the nodes list.

When the tree is first initialized, nodes[0] is the root node, nodes [1] is set to the end-of-stream symbol, and nodes[2] is set to the escape symbol. The next_free_node element in the tree is then set to 3, and the next time a character is added to the tree, it will be placed in nodes[3].

The leaf[] array in the tree data structure is used to find the leaf node for a particular symbol. To unsigned int weight: This weight element is the weight of individual node, just as it has

been described previously in this chapter.

int parent: This int is the index of the parent node. The parent node information is necessary both when encoding a symbol, and when updating the model.

int child_is_leaf: The child of a given node can either be a leaf or a pair of nodes. This flag is used to indicate which it is.

int child: If the child is a leaf, this int holds the value of the symbol encoded at the leaf. If the child is a pair of nodes, this value is the index to the first node. Because of the sibling property, the two nodes will always be adjacent to one another, so we know the first node will be child, and the second node will be child+1.

encode a symbol, start at the leaf node and work up to the root node of the tree, accumulating bits on

In document The Data Compression Book 2nd Ed Mark Nelson pdf (Page 57-81)