token is less than the number of bits in the phrase, compression will occur. But this definition of dictionary-based compression still leaves enormous room for variation. Consider, for example, the methods for building and maintaining a dictionary.
In some cases, it is advantageous to use a predefined dictionary to encode text. If the text to be encoded is a database containing all motor-vehicle registrations for Texas, we could develop a dictionary with only a few thousand entries that concentrated on words like “General Motors,” “Smith,” “Main,” and “1977.” Once this dictionary were compiled, it could be kept on-line and used by both the encoder and decoder as needed.
A dictionary like this is called a static dictionary. It is built up before compression occurs, and it does not change while the data is being compressed. It has advantages and disadvantages. One of the biggest advantages is that a static dictionary can be “tuned” to fit the data it is compressing. With the motor-vehicle registration database, for example, Huffman encoding could allocate fewer bits to strings such as “Ford” and more bits to “Yugo.” Of course, we could use different bit strings depending on which field is being compressed.
Adaptive compression schemes can’t tune their dictionaries in advance, which in principle would seem a major disadvantage. But static dictionary schemes have to deal with the problem of how to pass the dictionary from the encoder to the decoder. Chapters 3 and 5 showed that passing statistics along with compressed data can significantly harm compression, particularly on small files.
But this doesn’t have to be a disadvantage in every case. In many situations, a static dictionary could remain the same over long periods of time and be kept on line, available to both the compressor and the decompressor. The motor-vehicle database dictionary could be calculated once, for example, then kept on hand. In the case of an exceptionally large amount of data, the compression ratio may not be significantly degraded if the dictionary is passed with the compressed text.
Adaptive Methods
At present, dictionary-based compression schemes using static dictionaries are mostly ad hoc, implementation dependent, and not general purpose. Most well-known dictionary algorithms are adaptive. Instead of having a completely defined dictionary when compression begins, adaptive schemes start out either with no dictionary or with a default baseline dictionary. As compression proceeds, the algorithms add new phrases to be used later as encoded tokens.
The basic principle behind adaptive dictionary programs is relatively easy to follow. Imagine a section of code that compressed text using an algorithm that looked something like this: for ( ; ; ) {
word = read_word( input_file );
dictionary_index = look_up( word, dictionary ); if ( dictionary_index < 0 ) {
output( word, output_file );
add_to_dictionary( word, dictionary ); } else
output( dictionary_index, output_file ); }
If the dictionary index used here could be encoded as an integer index into a table, we would achieve respectable compression with what is actually a very simple algorithm. This code is a specialized one set up to apply to written documents, but the principle behind it is similar to that behind many more sophisticated algorithms. It illustrates the basic components of an adaptive dictionary compression algorithm;
1. To parse the input text stream into fragments tested against the dictionary.
2. To test the input fragments against the dictionary; it may or may not be desirable to report on partial matches.
3. To add new phrases to the dictionary.
4. To encode dictionary indices and plain text so that they are distinguishable.
The corresponding decompression program has a slightly different set of requirements. It no longer has to parse the input text stream into fragments, and it doesn’t have to test fragments against the dictionary. Instead, it has the following requirements: (1) to decode the input stream into either dictionary indices or plain text; (2) to add new phrases to the dictionary; (3) to convert dictionary indices into phrases; and (4) to output phrases as plain text. The ability to accomplish these tasks with relatively low costs in system resources made dictionary-based programs popular over the last ten years.
A Representative Example
Compressing data when sending it to magnetic tape has several nice side effects. First, it reduces the use of magnetic tape. Though magnetic tape is not particulary expensive, some applications make prodigous use of it. Second, the effective transfer rate to and from the tape is increased.
Improvements in transfer speed through hardware are generally expensive, but compression through software is in a sense “free.” Finally, in some cases, the overall CPU time involved may actually be reduced. If the CPU cost of writing a byte to magnetic tape is sufficiently high, writing half as many compressed bytes may save enough cycles to pay for the compression.
While the benefits of compressing data before sending it to magnetic tape have been clear, only sporadic methods were used until the late 1980s. In 1989, however, Stac Electronics successfully implemented a dictionary-based compression algorithm on a chip. This algorithm was quickly embraced as an industry standard and is now widely used by tape-drive manufacturers worldwide. This compression method is generally referred to by the standard which defines it: QIC-122. (QIC refers to the Quarter Inch Cartridge industry group, a trade association of tape-drive manufacturers.) As you may know, Stac Electronics expanded the scope of this algorithm beyond tape drives to the consumer hard disk utility market in the form of its successful Stacker program (discused later in this chapter).
QIC-122 provides a good example of how a sliding-window, dictionary-based compression
algorithm actually works. It is based on the LZ77 sliding-window concept. As symbols are read in by the encoder, they are added to the end of a 2K window that forms the phrase dictionary. To encode a symbol, the encoder checks to see if it is part of a phrase already in the dictionary. If it is, it creates a token that defines the location of the phrase and its length. If it is not, the symbol is passed through unencoded.
The output of a QIC-122 encoder consists of a stream of data, which, in turn, consists of tokens and symbols freely intermixed. Each token or symbol is prefixed by a single bit flag that indicates whether the following data is a dictionary reference or a plain symbol. The definitions for these two sequences are: (1) plaintext: <1><eight-bit-symbol>; (2) dictionary reference: <0><window-
offset><phrase-length>.
The QIC-122 encoder complicates things by further encoding the window-offset and phrase-length codes. Window offsets of less than 128 bytes are encoded in seven bits. Offsets between 128 bytes and 2,047 bytes are encoded in eleven bits. The phrase length uses a variable-bit coding scheme which favors short phrases over long. This explanation will gloss over these as “implementation details.” The glossed-over version of the C code for this algorithm is shown here.
while ( !out_of_symbols ) {
length = find_longest_match(&offset); if ( length > 1 ) {
output_bit( 0 );
length = find_longest_match( &offset ); output_bits( offset ); output_bits( length ); shift_input_buffer( length ); } else { output_bit( 1 ); output_byte( buffer[ 0 ] ); shift_input_buffer( 1 ); } }
Following is an example of what this sliding window looks like when used to encode some C code, in this case the phrase “output_byte.” The previously encoded text, which ends with the phrase “output_bit( 1 );\r,” is at the end of the window. The find_longest_match routine will return a value of 8, since the first eight characters of “output_byte” match the first eight characters of “output_bit.” The encoder will then output a 0 bit to indicate that a dictionary reference is following. Next it will output a 15 to indicate that the start of the phrase is fifteen characters back into the window (‘\r’ is a single symbol). Finally, it will output an 8 to indicate that there are eight matching symbols from the phrase.
Figure 7.1 A sliding window used to encode some C code.
Using QIC-122 encoding, this will take exactly sixteen bits to encode, which means it encodes 8 bytes of data with only 2 bytes. This is clearly a respectable compression ratio, typical of how QIC- 122 works under the best circumstances as shown here:
Figure 7.2 Encoding 8 bytes of data using only 2 bytes.
After the dictionary reference is output, the input stream over eight characters, with the last symbol encoded becoming the last symbol in the window. The next three symbols will not match anything in the window, so they will have to be individually encoded.
This example of QIC-122 gives a brief look at how a dictionary-based compression scheme might work. Chapter 8 will take a more extensive look at LZ77 and its derivatives.