• No results found

Into the Huffman Code

With the infrastructure code in place, all we need to do to create a program that demonstrates Huffman coding is to write two routines, CompressFile() and ExpandFile(), and a couple of strings that describe the name of the compression method and program usage. The code for this is found in

HUFF.C.

To build the Huffman decoding tree, we need to create a data structure that models the tree on the computer. In our previous examples, each node on the tree had several pieces of information: first, the weight associated with it; second, pointers to two child nodes, one associated with the 0 bit and one associated with the 1 bit. Finally, leaf nodes had the value of the symbol associated with the leaf. The data structure used in this program to model the Huffman tree was built around the node

structure:

typedef struct tree_node { unsigned int count;

unsigned int saved_count; int child_0;

int child_1; } NODE;

The first thing to notice about this structure is that there is no information about the value of a leaf node. This is because the node structures are allocated as an array of 514 nodes. The lower nodes are all assigned to be leaf nodes, and the upper nodes become internal nodes. The information about the value of a leaf is encoded based on the position of the node in the array.

Instead of having 256 symbols in our alphabet for this program, we actually have 257. Values 0 through 255 are reserved for the normal range of bytes that fit into a character. The remaining symbol value of 256 is reserved for the end-of-stream indicator. This is the last code written out to the stream, and it indicates that no more data will be arriving. Because of the bit-oriented nature of compressed data, it is not ordinarily a simple matter to determine when you have reached an end-of- file state. Handling it with a special code for end-of-stream is one method for getting around this. Another would be to encode the length of the file as a prefix to the compressed data.

With 257 symbols to deal with, we know in advance the largest possible size of the Huffman tree. If all 257 symbols are in use, we will have 256 internal nodes, meaning that we have to allocate an array of 513 node structures. In the program, I actually allocate 514 and use the last one as a dummy value for comparisons when building the tree.

Counting the Symbols

To build the tree, I first calculate the relative frequencies of the symbols. In HUFF.C, I set up an array of 256 longs and count the occurrences of every character in the file, from the start to the end. The position of the file input pointer is saved when the count starts and is restored when it is done. All this takes place in function count_bytes().

Though I start with 32-bit unsigned long counts, I scale the counts back significantly in module scale_counts. Scale_counts() finds the maximum count for any symbol in the file, then develops a scaling factor to make that count and all the rest of the counts fit in a single unsigned character. These counts are then copied into the weight elements of the first 257 node elements.

There are several good reasons for scaling back the counts. First, by limiting any symbol’s weight to an 8-bit unsigned character, I can confine all of the math I perform when building the tree to 16-bit unsigned integers. This helps the program run a little faster, and it cuts back on the amount of storage required for the node array. It also limits the maximum size of a Huffman code as well, ensuring that it will fit in a 16-bit unsigned integer.

For the expansion program to correctly expand the Huffman encoded bit stream it will be receiving, it needs a copy of the Huffman tree identical to the one used by the encoder. This means that the tree, or its equivalent, must be passed as a header to the file so the expander can read it in before it starts to read Huffman codes.

The easiest way for the expansion program to get this data would probably be to store the entire node array as a preamble to the compressed data. This would work well and would not be too hard for the compressor to do. An alternative method that occupies far less space in the compressed file,

however, is to transmit the symbol counts to the expander. Since the Huffman tree is built up in an unambiguous manner from the symbol counts, it stands to reason that the expansion program doesn’t need more to do its job. And since the scaled count array will be only 256 bytes, compared to the Huffman tree’s 4K bytes, there is good reason to choose this.

I elected to try to cut down on the amount of data to be passed even further. Under many

circumstances, the number of counts that stay at zero is considerable. With ASCII text files, such as program listings, there will generally be only around 100 symbols in use out of the possible 256. It seems a waste to transmit all those zero counts when they aren’t necessary. To make this happen, I use a slightly more complicated format for the header.

The header used in HUFF.C that contains the symbol counts consists of a series of “count run” definitions, followed by a 0 terminator. A count-run definition consists of the value of the first symbol in the run, followed by the value of the last symbol in the run, followed by the counts for all of the symbols in the run from first to last. This is repeated until each run has been stored in the output file. When there is no more data to store, a first value of zero is written out to the file. Note that a value of zero for the very first run is not treated as an end of data.

For a typical ASCII file, the start of the compressed file might look something like Figure 3.4.

Figure 3.4 The start of a typical compressed ASCII file.

This symbol count format takes a fair amount of work to generate, performed in output_counts() in HUFF.C. Reading in the symbols counts is much simpler, since the work has been done in advance. Reading the counts in from the compressed file during expansion is done in the input_counts() routine.

Building the Tree

Whether compressing or expanding, once the counts have been loaded, it is time to build the Huffman tree. In HUFF.C, this is done in a function called build_tree(). Because some care was taken when creating the data structure, the actual process of creating the tree is the simple matter of sitting in a loop and combining the two free nodes with the lowest weight into a new internal node with the combined weight of the nodes. Once only one free node is left, the tree is done, and the free node is the root of the tree.

The logic of the build_tree() routine is fairly simple. When the routine is first entered, all nodes below 257 have a count value set to their frequency in the file. A nonzero value here means that this is an active node.

build_tree() also sets up a special node used as a straw man for comparison purposes. Node 513, which will never be used, is set to have a count value of 65535, which no normal node can ever exceed. When searching for the two minimum nodes, I will start by setting the minimum node to 513, knowing that any valid active node will fall below its value.

Finally, before the comparisons start, an index to the next free node’s initialized. The node array is in use from 0 to 256, so the next free node will be at 257.

After things have been set up, build_tree() goes into an infinite loop. On each pass through the loop, build_tree tries to find the two active nodes with the lowest weights. If only one node is found, the tree is complete and the loop is exited. If there are two good minimum values, a new node to the tree can be created. This new node is set up using the next_free node index. Its two child pointers are set to point to the two minimum nodes found before, and its weight is their sum. The two minimum nodes are now marked as being inactive by setting their weights to 0. Nodes with a weight of 0 are considered to be unused and will never again be selected to represent a minimum.

One piece of inefficient code is deliberately left in build_tree(). There is an extra member in the node structure called saved_count. When a node is taken off the active list by having its count set to zero, the previous count is stored in saved_count. Later, if the user has selected the -d option in order to print out the model, the saved_count can be printed. This helps when debugging the program and when trying to understand how the tree works.

Using the Tree

During the expansion phase, it is easy to see how to use the Huffman tree. Starting at the root node, a single bit at a time is read in by the decoder. If the bit is a 0, the next node is the one pointed to by the child_0 index. If the bit is a 1, the next node is the one pointed to by the child_1 index. If the new node is 256 or less, we have reached a leaf of the tree and can output the corresponding symbol. If the symbol was the special end-of-stream symbol, we can exit instead of sending it out. This is what is done in the expand_node() function. It is just a few lines of code, and it decodes a compressed Huffman code file with relative ease.

Compressing the same file is a bit harder. Essentially, we want to work down the tree, outputting a 1 or a 0 bit at each node, till we get to the appropriate leaf node. Unfortunately, the tree structure makes this impossible. When we start at the root node, we have no idea whether to take the 0 or the 1 branch to arrive at a particular symbol.

One way to solve this problem when building the tree would be to add a parent member to the node structure. When combining the two minimum nodes to form a new internal node, each minimum node would have its parent structure set to point to the new node. With this new node, we could start at the leaf node and work our way up through the tree toward the root. The only problem with this

procedure is that we would accumulate bits in reverse order as we went up the tree. We would have to rack them up till we reached the root node, then put them out in reverse order.

Fortunately, there is a better way to do this. Rather than trying to use the tree to code our symbols when compressing a file, we could build a code table by recursively traversing the entire tree one time only. This creates a table of codes, one for each symbol, along with the length of each code. Once the table is built, the file can be encoded by simply outputting the appropriate code for every character in the input file.

The code to convert the tree data structures into a table of codes is very simple, thanks to a recursive algorithm. We start at the root node of the tree with a zero. Then we begin working down the

individual branches of the tree, adding a one or a zero to the code each time we travel down a branch. Whenever we reach a leaf, we store the code values for that leaf in the code array and back up to the previous node, where we can start searching down the other side of the tree.

The code to accomplish this is in function convert_tree_to_code(). This routine takes a fair amount of work to create the code table, but once it is done the actual file compression is very easy.