4.1 The Problem
4.2.3 Exploiting redundancy in the sensor stream
Compression has been a topic of interest since the birth of information theory in the work of Shannon [122], with the aim of reducing the size of data for storage and/or for transmission. Compression exploits the repetition in the data by building a dictionary of codewords, and then replacing each incidence of the word with a codeword in the dictionary, with shorter codewords being used for frequent words, and longer codewords for less frequently used words. Provided that the codewords are shorter than the words in the codebook, compression is achieved. Most compression algorithms require no prior knowledge about the input data stream and can deal with codewords of different lengths without problem.
Compression algorithms can be broken down into two encoding schemes: lossless and lossy. The former allows the perfect reproduction of the input from the com- pressed stream, while the latter does not. Clearly, for most computer files, lossless compression is what is required. However, for images and sound, where knowledge of the type of data to be stored and the capabilities of humans to process their sight and sound stimuli enables more compression to be achieved by suppressing data that
would not be detected by humans anyway, such as very high frequencies in sound recording. It can also help to deal with noise and variability, provided that the com- ponent that is lost is the noisy part of the data.
Lossless compression
In lossless compression, the decompressed data is the exact replication of the original data. One common application of lossless compression is the ZIP file, which effectively compresses large document files in order to reduce the file size. Two popular lossless compression techniques used for text are the statistical-based approach, which is based on context modeling, and the dictionary approach.
(a) The statistical approach
The main idea behind the statistical approach is to predict the occurrence of the next symbol based on the symbols that precede it. One well-known statistical approach is
Prediction by Partial Matching (PPM), which makes use of finite-context models of orderk(wherekis the number of preceding symbols used) for character prediction [19, 18].
PPM uses a table to keep track of the number of times the characters are seen in the input stream. The table is updated adaptively as each character is read from the input stream. Since prediction probabilities are used to predict the upcoming character, each context model maintains its own probability distributions.
The encoding of a character begins at the highest k model to see if the character has occurred in the current context. If it has occurred before, then the count is used to encode the character in that context. However, if the character has never been encountered before in the context, then an ‘escape’ symbol is transmitted to tell the decoder to use the model of k−1, and the process is repeated until a model is reached and the character is predicted according to the probability distribution of that model [18]. There are many methods to determine the probabilities for the
Figure 4.3: Illustration of Prediction by Partial Matching (PPM) method based on order- 2 context models. The ‘esc’ symbol stands for ‘escape’ event and is used when a novel character is encountered and is not seen in the context of the current model.
‘escape’ probability [5, 19, 96, 121], but this is beyond the scope of this thesis. If a character has never occurred in any context at all then it is processed at the bottom-level, i.e., k=−1, where all characters are assigned with equal probabilities. For example, if there are 10 characters, then each character in the k = −1 model is assigned a probability of 0.1. Figure 4.3 illustrates the PPM method of three models with k = 2,1 and 0 for the input string mousemousemouse. As the figure shows, all the previously seen contexts are shown along with the frequency counts and probabilities for each k model.
(b) The dictionary approach
The dictionary approach replaces phrases in the text with a pointer to an earlier occurrence of the same phrase, effectively creating a ‘codebook’ of common phrases. One popular dictionary-based compression algorithm is the Lempel-Ziv-Welch (LZW) algorithm [144], a family of Lempel-Ziv compression (LZ78) [159]. LZW begins with single characters and adds them into the dictionary, and then examines the token stream character by character until a string that is not in the dictionary is found.
This is added into the dictionary, and the search for dictionary strings recommences, starting with the last letter of the string that has just been added.
As an example, the second time the phrase ‘mo’ is seen in the input stringmouse- mousemouse, it will take the index of ‘mo’ found in the dictionary and extend the phrase by concatenating it with the next character from the sequence to form a new phrase (‘mou’), which is later added to the dictionary. The search then continues from the token ‘u’. The phrases in the dictionary are basically indexed by a code and the code representing the phrase can be found by looking it up in the dictionary. For a complete description of LZW, see [144].
Within the context of smart homes, Das et al. [25] use compression to predict the inhabitant’s movement (location) in the home. They partitioned the home into different zones (rooms), with each zone represented by a symbol. A sequence of zones are generated when the inhabitant moves from one room to another. They first used the LZ78 method to build a dictionary that represents the inhabitant’s movements in the home. The statistical method is then used to predict the inhabitant’s next room’s location based on the probability distributions of the phrases in the dictionary.
Cilibrasi and Vit´anyi [17] propose a similarity distance based on the length of compressed data files, the ‘normalised compression distance (NCD)’ and then use hierarchical clustering to identify clusters within the data based on the most dominant shared feature, which is computed from the lengths of the compressed data files. Their approach aimed at rather longer sequences such as music files and genome data.