One obvious way to tackle these problems is simply to start tinkering with the size of the window and the size of the look-ahead buffer. Instead of using a 4K window and a seventeen-byte buffer, for example, why not use a 64K text window and a 1K look-ahead buffer? Wouldn’t that address both problems effectively?
While raising the size of both parameters does seem to address these problems, the scheme has two major drawbacks. First, when we increase the buffer size from 4K to 64K, we now need sixteen bits to encode an index location instead of twelve. And instead of needing four bits to encode a phrase length, we now need ten. So the cost of encoding a phrase rises from seventeen bits to twenty-seven. This 50 percent increase in the bit size of an index/length token can have a severely negative impact on the compression algorithm. For one thing, it will change the BREAK_EVEN point in the program from just under two characters to three characters. This means that matches of three or fewer
characters will no longer be effectively coded in index/length tokens and will instead have to be encoded as single characters. Encoding data as single characters is actually less efficient than plain text, since it needs an additional bit to indicate that a normal character is coming.
An even more distressing effect is that changing these parameters will drastically increase the amount of CPU time needed to perform compression. Under LZ77, just changing the text window size from 4K to 64K will result in the average search taking sixteen times longer, since every string in the window is compared to the look-ahead buffer. The situation is somewhat better under LZSS, since the strings are kept in a binary tree. In this case, the runtime cost of the window size is
proportional to the logarithm of the window size. But this still means over a 30 percent increase in runtime.
The real penalty comes when the size of the look-ahead buffer is increased. Since our string comparisons between the text window phrases and the look-ahead buffer proceed sequentially, the runtime here will increase in direct proportion to the length of the look-ahead buffer. Going from sixteen to 1,024 characters means this portion of the program is going to run sixty-four times more slowly—a costly penalty indeed.
These effects combine to effectively cancel out any gains from increasing either of these parameters in an LZ77 algorithm. And even with a 64K text window, we are still effectively tied to an algorithm that depends on recency to perform adequate compression.
Enter LZ78
To effectively sidestep these problems, Ziv and Lempel developed a different form of dictionary- based compression. This algorithm, popularly referred to as LZ78, was published in “Compression of Individual Sequences via Variable-Rate Coding” in IEEE Transactions on Information Theory (September 1978).
LZ78 abandons the concept of a text window. Under LZ77 the dictionary of phrases was defined by a fixed window of previously seen text. Under LZ78, the dictionary is a potentially unlimited list of previously seen phrases.
LZ78 is similar to LZ77 in some ways. LZ77 outputs a series of tokens. Each token has three components: a phrase location, the phrase length, and a character that follows the phrase. LZ78 also outputs a series of tokens with essentially the same meanings. Each LZ78 token consists of a code that selects a given phrase and a single character that follows the phrase. Unlike LZ77, the phrase length is not passed since the decoder knows it.
Unlike LZ77, LZ78 does not have a ready-made window full of text to use as a dictionary. It creates a new phrase each time a token is output, and it adds that phrase to the dictionary. After the phrase is added, it will be available to the encoder at any time in the future, not just for the next few thousand characters.
LZ78 Details
When using the LZ78 algorithm, both encoder and the decoder start off with a nearly empty dictionary. By definition, the dictionary has a single encoded string—the null string. As each character is read in, it is added to the current string. As long as the current string matches some phrase in the dictionary, this process continues.
But eventually the string will no longer have a corresponding phrase in the dictionary. This is when LZ78 outputs a token and a character. Remember that the string did have a match in the dictionary until the last character was read in. The current string, therefore, is defined as that last match with one new character added on. This is what LZ77 outputs: the index for the previous match and the character that broke that match.
But at this point, LZ78 takes an additional step. The new phrase, consisting of the dictionary match and the new character; is added to the dictionary. The next time that phrase appears, it can be used to build an even longer phrase.
over, but this is a fairly faithful representation of the algorithm. for ( ; ; ) {
current_match = 1; current_length = 0;
memset( test_string, '\0', MAX_STRING ); for ( ; ; ) {
test_string[ current_length++ ] = getc( input ); new_match = find_match( test_string );
if ( new_match == -1 ) break;
current_match = new_match; }
output_code( current_match );
output_char( test_string[ current_length - 1 ] ); add_string_to_dictionary( test_string );
}
By definition, the empty string will always match string 0, the null node in the dictionary. Thus, when we encounter a character for the first time, it is encoded as phrase 0 plus the new character. The next time that character appears, it will be encoded as part of a phrase.
An example of the encoder output follows. The input text is a sequence of words from the dictionary of a spelling checker. The LZ78 encoder starts encoding with no phrases in the dictionary; therefore, the first character it reads in from the input text, ‘D’, creates a string that has no match in the
dictionary. The encoders will then output a phrase/character pair, in this case 0 and ‘D’. Remember that the dictionary starts up with zero defined as the empty phrase.
Input text: "DAD DADA DADDY DADO..."
The first two characters to come through the encoder, ‘D’ and ‘A,’ have not been seen before. Each will have to be encoded as a phrase, 0+ character pair. “D” is added to the dictionary as phrase 1, and “A” is added as phrase 2.
When the third character, ‘D,’ is read in, it matches an existing dictionary phrase. The ‘ ’ character, the next character read in, creates a new phrase with no match in the dictionary. LZ78 will output code 1 for the previous match (the D string), then the “ ” character.
As the encoding continues, the dictionary quickly builds up fairly long phrases. Of course, since these entries are from a dictionary sorted in alphabetical order, we probably build up phrases much faster than would normally be the case. After just nineteen characters have been read in and encoded,
Output Phrase Output Character Encoded String
0 ‘D’ “D” 0 ‘A’ “A” 1 ‘ ‘ “D “ 1 ‘A’ “DA” 4 ‘ ‘ “DA “ 4 ‘D’ “DAD” 1 ‘Y’ “DY” 0 ‘ ‘ “ “ 6 ‘O’ “DADO”
the dictionary looks like the one following.
LZ78 Implementation
Like LZ77, LZ78 can arbitrarily set the size of the phrase dictionary. And like LZ77, in LZ78 we have to worry about the effects of this in two ways. First, we have to consider the number of bits allocated in the output token for the phrase code. Second, and more importantly, we have to consider how much CPU time managing the dictionary will take.
In theory, LZ78 should compress better and better as the size of the dictionary increases. But this only holds true as the length of the input text tends towards infinity. In practice, smaller files will quickly begin to suffer as the code size grows larger.
The real difficulty with LZ78 actually comes in managing the dictionary. If we use a sixteen-bit code for the phrase index, for example, we can accommodate 65,536 phrases, including the null code. The phrases can vary tremendously in length, including the improbable possibility of 65,536 different versions of a phrase composed of runs of a single, repeated character.
These phrases are conventionally stored in a multiway tree. The tree starts at a root node, 0, the null string. Each possible character that can be added to the null string is a new branch of the tree, with each phrase created that way getting a new node number.
Figure 9.1 An LZ78 Dictionary Tree.
0 “” 1 “D” 2 “A” 3 “D” 4 “DA” 5 “DA” 6 “DAD” 7 “DY” 8 “” 9 “DADO”
The dictionary tree shown here would be created after the previous nineteen-character phrase was encoded. The major difficulty with managing a tree such as this is the potentially large number of branches off of each node. When compressing binary files with an eight-bit alphabet, 256 branches off of each node are possible. We could simply allocate an array of indices or pointers at each node that was large enough to accommodate all 256 possible descendants. But since most nodes will not have that many descendants, it would be incredibly wasteful to allocate so much storage. Instead, descendant nodes are usually managed as a list of indices no longer than the number of descendant nodes that actually exits. This technique makes better use of available memory, but it is also significantly slower. It is essentially the same technique used in chapter 6 to perform higher-order modeling of data streams.
With a tree like this, comparing an existing string to the dictionary is simple. It is just a matter of walking through the tree, traversing a single node of the tree for every character in the phrase. If the phrase terminates at a particular node, we have a match. If there are more phrases but we have reached a leaf node, there is not a match. After the symbol has been encoded, adding it to the leaf node is also simple—just a matter of adding space to the descendant list, then inserting a new descendant node at the node last matched.
One negative side effect of LZ78 not found in LZ77 is that the decoder has to maintain this tree as well. With LZ77, a dictionary index was just a pointer or index to a previous position in the data stream. But with LZ78, the index is the number of a node in the dictionary tree. The decoder, therefore, has to keep up the tree in exactly the same fashion as the encoder, or a disastrous mismatch will occur.
Another issue ignored so far is that of the dictionary filling up. Regardless of how big the dictionary space is, it is going to fill up sooner or later. If we are using a sixteen-bit code, the dictionary will fill up after it has 65,535 phrases defined in it.
There are several alternative choices regarding a full dictionary. Probably the safest default choice is to stop adding new phrases to the dictionary after it is full. This only requires an extra line or two of code in the add_phrase_to_dictionary() routine.
But just leaving the dictionary alone may not be the best choice. When compressing large streams of data, we may see significant changes in the character of the incoming data. When compressing a program’s binary image (such as an EXE file), for example, we would expect to see a major shift in the statistical model of the data as we move out of the code section of the file and into the data section.
If we keep using our existing phrase dictionary, we may be stuck with an out-of-date dictionary that isn’t compressing very well. At the same time, we have to be careful not to throw away a dictionary that is compressing well.
The UNIX compress program, which uses an LZ78 variant, manages the full dictionary problem by monitoring the compression ratio of the file. If the compression ratio ever starts to deteriorate, the dictionary is deleted and the program starts over from scratch. Otherwise, the existing dictionary continues to be used, though no new phrases are added to it.