Decompression - The Data Compression Book 2nd Ed Mark Nelson pdf

The companion algorithm for compression is the decompression algorithm. It takes the stream of codes output from the compression algorithm and uses them to recreate the exact input stream. One reason for the efficiency of the LZW algorithm is that it does not need to pass the dictionary to the decompressor. The table can be built exactly as it was during compression, using the input stream as data. This is possible because the compression algorithm always outputs the phrase and character components of a code before it uses it in the output stream, so the compressed data is not burdened with carrying a large dictionary.

old_string[ 0 ] = input_bits(); old_string[ 1 ] = '\0';

putc( old_string[ 0 ], output )

while ( ( new_code = input_bits() ) != EOF ) new_string = dictionary_lookup( new_code ); fputs( new_string, output );

append_char_to_string( old_string, new_string[ 0 ] ); add_to_dictionary( old_string );

strcpy( old_string, new_string ); }

Preceding is a rough C implementation. Like the compression algorithm, it adds a new string to the string table each time it reads in a new code. In addition, it translates each incoming code into a string and sends it to the output.

Following is the output of the algorithm given the input created by the earlier compression. Note that the string table ends up looking exactly like the table built during compression. The output string is identical to the input string from the compression algorithm. Note also that the first 256 codes are already defined to translate to single-character strings, as in the compression code.

Input Codes: "WED<256>E<260><261><257>B<260>T"

“ W” 261 263 = “E W” “EB” 257 264 = “WEB” “ ” B 265 = “B” “WET” 260 266 = “ WET” <EOF> T Input/ NEW_CODE OLD_CODE STRING/

Output CHARACTER New table entry

‘ ’ ‘ ’ “ ”

‘W’ ‘ ’ “W” ‘W’ 256 = “ W”

The Catch

Unfortunately, the decompression algorithm shown is just a little too simple. A single exception in the LZW compression algorithm causes some trouble in decompression. Each time the compressor adds a new string to the phrase table, it does so before the entire phrase has actually been output to the file. If for some reason the compressor used that phrase as its next code, the expansion code would have a problem. It would be expected to decode a string that was not yet in its table. Unfortunately, there is a way this can occur. If there is a phrase already in the table composed of a CHARACTER, STRING pair, and the input stream then sees a sequence of CHARACTER, STRING, CHARACTER, STRING, CHARACTER, the compression algorithm will output a code before the decompressor defines it.

A simple example will illustrate the point. Imagine the string IWOMBAT is defined in the table as code 300. Later, the sequence IWOMBATIWOMBATI occurs in the table. The compression output will look like the following:

Input String: IWOMBAT...IWOMBATIWOMBATIXXX

When the decompression algorithm sees this input stream, it first decodes code 300 and outputs the IWOMBATI string. It will then add the definition for code 399 to the table, whatever that may be. It then reads the next input code, 400, and finds that it is not in the table.

Fortunately, this is the only time when the decompression algorithm will encounter an undefined code. Since it is the only time, we can add an exception handler to the algorithm. The modified

‘D’ ‘E’ “D” ‘D’ 258 = “ED”

256 ‘D’ “ W” ‘ ’ 259 = “D”

‘E’ 256 “E” ‘E’ 260 = “ WE”

260 ‘E’ “ WE” ‘ ’ 261 = “E”

261 260 “E “ ‘E’ 262 = “ WEE”

257 261 “WE” ‘W’ 263 = “E W” ‘B’ 257 “B” ‘B’ 264 = “WEB” 260 ‘B’ “ WE” ‘ ’ 265 = “B” ‘T’ 260 “T” ‘T’ 266 = “ WET” <Problem section> Character Input

New code value and

associated string Code Output

...I

WOMBATA 300 = IWOMBAT 288 (IWOMBA)

. . .

...I . .

WOMBATI 400 = IWOMBATI 300 (IWOMBAT)

algorithm just looks for the special case of an undefined code and handles it. In the sample, the decompression routine sees a code of 400. Since 400 is undefined the program goes back to the previous code/string, which was “IWOMBAT”, or code 300. It then appends the first character of the string to the end of the string, yielding “IWOMBATI,” the correct value for code 400. Processing then proceeds as normal.

The exception handler takes advantage of the knowledge that this problem can happen only in the special circumstances of CHARACTER+ STRING+CHARACTER+STRING+CHARACTER. Given that, any time an unknown code occurs, it can determine what the unknown code is given knowledge of the previous string from the input.

old_string[ 0 ] = input_bits(); old_string[ 1 ] = '\0';

putc( old_string[ 0 ], output )

while ( ( new_code = input_bits() ) != EOF ) { new_string = dictionary_lookup( new_code ); if ( new_string == NULL ) {

strcpy( new_string, old_string );

append_character_to_string( new_string, new_string[ 0 ] ); }

fputs( new_string, output );

append_character_to_string( old_string, new_string[ 0 ] ); add_to_dictionary( old_string );

strcpy( old_string, new_string ); }

LZW Implementation

The concepts in the compression algorithm are so simple that the whole algorithm can be expressed in a dozen lines. Implementation of this algorithm is somewhat more complicated, mainly due to management of the dictionary. A short example program that uses twelve-bit codes is in LZW12.C, and it will illustrate some of the techniques used here.

Tree Maintenance and Navigation

As in the LZ78 algorithm, the LZW dictionary is maintained as a multiway tree. But in the case of LZW, the way the data is stored doesn’t look much like a tree. A little analysis, however will reveal a multiway tree hidden behind the dictionary data structures.

struct dictionary { int code_value; int parent_code; char character; } dict[ TABLE_SIZE ];

The structure shown in the preceding figure holds the entire dictionary tree. Each element in the data structure represents a single node. The node is defined by three items: (1) Code_value. This number is the actual code for the string that terminates at this node and is what the compression program emits when it wants to encode the string; (2) Parent_code. Under LZ78-style compression, every string in the dictionary has a parent string one character shorter than it. This integer is the code for that parent string; (3) Character. This is the character for this particular node. If the string encoded by the parent of a node were “GREENLEA,” and the character value was “F,” this node would encode “GREENLEAF.”

Something that immediately becomes noticeable as a problem here is that each dictionary node does not have a pointer or pointers to its child nodes. As we navigate the tree, how are we supposed to

find the children of each node if there are no pointers to children?

The answer is that this tree maintains the dictionary pointers through a hashed array of nodes. To find the child of a particular node, we apply a hashing function to see where that puts us in the list. The hashing function used in LZW12.C is shown next.

unsigned int find_child_node( parent_code, child_character ) int parent_code;

int child_character; {

int index; int offset;

index = ( child_character << ( BITS - 8 ) ) ^ parent_code; if ( index == 0 )

offset = 1; else

offset = TABLE_SIZE - index; for ( ; ; ) {

if ( dict[ index ].code_value == UNUSED ) return( index );

if ( dict[ index ].parent_code == parent_code &&

dict[ index ].character == (char) child_character ) return( index ); index -= offset; if ( index < 0 ) index += TABLE_SIZE; } }

This hashing function is essentially the same one used in the UNIX compress program. It combines the numeric values of the parent_code and the child_character to form a sixteen-bit offset into the list of nodes. After finding the target node, it checks for collisions, since that node may be in use by some other element in the tree. Eventually, one of two things happens. Either this function finds a node already defined as belonging to the parent and child, or it finds an empty node that can be used that way.

This hashing function performs fairly well. The collision avoidance mechanism depends on having TABLE_SIZE be a prime number, and performance depends on it being at least 20 percent larger than two raised to the BITS power. In LZW12.C, TABLE_SIZE needs to be larger than 4,096. The number actually used was 5,021.

With the hashing function in place, we can now effectively navigate down through the tree. The data structures used to maintain the dictionary during compression don’t help us move up the tree, but during compression we don’t need to move up the tree, only down.

During decompression, the hashing function is no longer used. Instead, each node in the tree has its parent code and character value stored at the array offset defined by its own code. This allows for quick lookup of dictionary values, which lets us move up the tree quickly. We need to move up the tree during decompression to determine the entire contents of a string, and this different storage method makes this possible. We never need to move down the tree during decompression, so the hashing function is no longer needed.

One additional feature of the dictionary tree used in LZW12.C needs explanation. The first 256 nodes are considered “special” nodes by the program. Each of them represents the one character string that corresponds with its node value. In other words, code 65 will always represent the character “A,” and it will automatically be assumed not to have a parent. These nodes are all predefined when the program is first initialized.

In document The Data Compression Book 2nd Ed Mark Nelson pdf (Page 179-183)