LZW compression process - Assembly Language, The True Language Of Programmers pdf

The number of bits which represent a pixel once again appears in the first Raster Block which directly precedes the length byte. This value is used by the LZW process to compress the images.

The most important advantage of using LZW is the file size of the image is reduced.

Unlike other compression schemes such as RLE (Run Length Encoding) used in the PCX format, LZW can handle both adjacent identical bytes and sequences of bytes that are not adjacent. To do this, an "extended alphabet" is used. Instead of the usual 8 bits, this alphabet uses additional encoding bits. For example, by using 9 bits per character, a file can contain codes 256-511 in addition to the usual codes 0-255. These are used to represent character strings.

Here's how it works. Characters are read from the source file or video memory until a character string is encountered which is no longer found in the alphabet. This happens in the beginning after only two characters: The first character will be in the alphabet (an ordinary character from 0 to 255), while the string formed from the first two characters does not yet exist.

When the compressor reaches this point it writes this character string into the alphabet, so the next time the string occurs it can compress it by replacing it with this code. The code of the longest character string still contained in the alphabet is then written to the destination file and the character string initialized with the last character read. Characters are again added until the string is no longer found in the alphabet. Now the question is how many bits you should use for the alphabet. If you use too few the alphabet will soon overflow and reinitializing it greatly reduces the compression rate. However, taking too many bits immediately wastes space because the upper bits will never be needed. The solution is the modified LZW process, which uses a variable byte width. You begin with nine bits, which allows an alphabet with 512 entries. If this limit is exceeded another bit is simply added on.

On the other hand it makes no sense to keep extending indefinitely. This wastes bits unnecessarily, although the majority of character strings already in the alphabet will never be used again. Therefore, when reaching a width greater than 12 bits, a clear-code is eventually sent. This clear-code completely clears the alphabet and resets the width back to nine bits. Compression then basically starts from the beginning.

The effectiveness of this algorithm depends strongly on the length of the file to be compressed. You'll need large amounts of data to access the alphabet repeatedly and be able to store long character strings in a small number of bits. This algorithm is therefore best suited for images of several kilobytes in size.

What is even more important for our purposes is the decompression algorithm. It takes the compressed data and transforms them back into recognizable images. Before you can understand the packing process, however, you must first understand the compressor.

Interestingly, the LZW process does not require storing the alphabet. The alphabet is regenerated from the packed data during decompression. The program uses the fact the only character strings coded in the compressed data are those that already occurred and therefore exist in the alphabet.

The following describes how the decompressor proceeds. Each compressed character read is first checked to see whether it's actually a real, uncompressed byte. This would be indicated by a value less than 256. These characters can be written directly to the destination file (or video memory). When encountering an extended code, however, the corresponding character string is retrieved from the alphabet and then written. Of course, the alphabet must be constructed at the same time by combining the last decompressed character string (or uncompressed character) with the first character of the just decoded character string and entering it into the alphabet.

This corresponds exactly to the compression process but "in reverse". So, the alphabets formed during compression and decompression correspond exactly at any point in time. An exception to the "any point" is when a character occurs whose code is not yet in the alphabet. When compressing a character string of the form AbcAbcA, if the character string Abc already exists in the alphabet, the compressor writes this entry's code to the destination file and forms the new alphabet entry AbcA, which appears again immediately afterward and is therefore also used by writing it to the destination file.

The decompressor at this point still does not recognize the character string however; how would it know the next character will be an A, since it is not writing to the destination file. You should be able to notice when

this situation occurs because it arises only with character strings as described above. If a code appears that is not yet in the alphabet, the last decoded character string plus its first character is simply written to the video memory and the new character string recorded in the alphabet.

An alphabet could require a great amount of memory, which was a problem in the past. The algorithm was therefore again improved. Although it may seem more complicated, it actually simplifies your work even more. As we have seen in both compressing and decompressing each new alphabet entry is formed from an already existing character string plus a new character. What is being stored is simply the code of the old character string and the code of the new character. We, therefore, need just two more entries: Prefix and

Tail.

In document Assembly Language, The True Language Of Programmers pdf (Page 37-39)