Variable Frame Size for Vector Quantization and Application to Speech Coding
Carlos Moreno
Department of Electrical & Computer Engineering
McGill University Montreal, Canada
December 2005
A thesis submitted to McGill University in partial fulfillment of the requirements of the degree of Master of Engineering.
© 2005 Carlos Moreno
2005/12/18
1+1 Library and Archives Canada
Bibliothèque et Archives Canada Published Heritage
Branch
Direction du
Patrimoine de l'édition 395 Wellington Street
Ottawa ON K1A ON4 Canada
395, rue Wellington Ottawa ON K1A ON4 Canada
NOTICE:
The author has granted a non- exclusive license allowing Library and Archives Canada to reproduce, publish, archive, preserve, conserve, communicate to the public by
telecommunication or on the Internet, loan, distribute and sell th es es
worldwide, for commercial or non- commercial purposes, in microform, paper, electronic and/or any other formats.
The author retains copyright ownership and moral rights in this thesis. Neither the thesis nor substantial extracts from it may be printed or otherwise reproduced without the author's permission.
ln compliance with the Canadian Privacy Act some supporting forms may have been removed from this thesis.
While these forms may be included in the document page count,
their removal does not represent any loss of content from the thesis.
• ••
Canada
AVIS:
Your file Votre référence ISBN: 978-0-494-24999-4 Our file Notre référence ISBN: 978-0-494-24999-4
L'auteur a accordé une licence non exclusive permettant à la Bibliothèque et Archives Canada de reproduire, publier, archiver,
sauvegarder, conserver, transmettre au public par télécommunication ou par l'Internet, prêter, distribuer et vendre des thèses partout dans le monde, à des fins commerciales ou autres, sur support microforme, papier, électronique et/ou autres formats.
L'auteur conserve la propriété du droit d'auteur et des droits moraux qui protège cette thèse.
Ni la thèse ni des extraits substantiels de celle-ci ne doivent être imprimés ou autrement reproduits sans son autorisation.
Conformément à la loi canadienne sur la protection de la vie privée, quelques formulaires secondaires ont été enlevés de cette thèse.
Bien que ces formulaires aient inclus dans la pagination, il n'y aura aucun contenu manquant.
1
Abstract
Vector Quantization (VQ) is a lossy data compression technique that is often applied in the field of speech communications. In VQ, a group of values or vector is replaced by the close st vector from a list of possible choices, the codebook. Compression is achieved by providing the index corresponding to the closest vector in the codebook, which in general can be represented with less data than the original vector.
In the case of VQ applied to speech signals, the input signal is divided into frames of a given length. Depending on the particular technique being used, the system either extracts a vector representation of the whole frame (usually sorne form of spectral representation), or applies sorne processing to the signal and uses the processed frame itself as the vector to be quantized. The two techniques are often combined, and the system uses VQ for the spectral representation of the frame and also for the proeessed frame.
A typical assumption in this scheme is the fact that the frame size is fixed. This simplifies the scheme and thus reduces the computing-power requirements for a practical implementation.
In this study, we present a modification to this technique that allows for variable size frames, providing an additional degree of freedom for the optimization of the Data Com- pression proeess.
The quantization error is minimized by choosing the closest point in the codebook for the given frame. We now minimize this by choosing the frame size that yields the lowest quantization error - notice that the quantization error is a function of the given frame and the codebook; by considering different frame sizes, we get different actual frames that yield different quantization errors, allowing us to choose the optimal size, effectively providing a second level of optimization.
This idea has two caveats; we require additional data to represent the frame, sinee we have to indicate the size that was used. Also, the complexity of the system increases, sinee we have to try different frame sizes, requiring more computing-power for a practical implementation of the scheme.
The results of this study show that this technique effectively improves the quality of the compressed signal at a given compression ratio, even if the improvement is not dramatic.
Whether or not the increase in complexity is worth the quality improvement for a given application depends entirely on the design constraints for that particular application.
11
Sommaire
La Quantification de Vecteurs (QV) est une technique de compression des données avec perte qui est souvent utilisée dans le domaine des communications de la voix. Avec QV, un groupe de valeurs ou vecteur est remplacé par le point le plus près parmi une liste de possibilités, le livre de codes. La compression est obtenue grâce à l'utilisation de l'indexe qui correspond au point le plus près dans le livre de codes.
Dans le cas de QV utilisée pour des signaux de voix, le signal d'entrée est divisé en bloques d'une certaine longueur. Ensuite, le système extrait une représentation vectorielle de ce bloque, soit comme une représentation basée sur un modèle des données (normalement un modèle qui utilise des paramètres spectrales), soit le bloque du signal obtenu après un certain traitement.
La longueur de ces bloques est normalement une constante. Ceci simplifie le système et reduit le niveau de puissance de calcul nécessaire pour une implantation réelle.
Cette étude présente une modification à cette technique, qui permet l'utilisation de bloques à longueur variable. Ceci représente un niveau aditionnel pour l'optimisation du processus de compression des données.
L'erreur de quantification est minimisée par le choix du point le plus près dans le livre de codes. On minimise encore cette erreur en choisisant la longueur du bloque qui produit l'erreur le plus faible - il faut remarquer que l'erreur de quantification est une fonction du bloque des données et du livre de codes; en tenant compte de différentes longueurs pour le bloque, on obtient différentes valeurs pour l'erreur, ce qui nous permet de choisir la longueur optimale. Ceci représent, en effet, un deuxième niveau d'optimisation.
Cette idée souffre de deux difficultés; à savoir, on a besoin de données supplémentaires pour représenter chaque bloque, étant donné qu'on doit indiquer la longueur qui fut choisie pour ce bloque. De plus, la complexité du système augmente, dù au fait qu'on doit essayer plusieurs longueurs de bloque; ceci exige des processeurs numériques d'une plus haute puissance lors d'une implantation réelle du système.
Les résultats de cette étude montrent que la technique en effet améliore la qualité du signal comprimé au même rapport de compression, même si la différence n'est pas dramatique. Dépendant de l'application et les contraintes du design, il est possible que l'augmentation de complexité lors de l'implantation serait justifiée par l'augmentation de la qualité du signal.
III
Acknowledgments
My first and foremost expression of gratitude goes to my research advisor, Professor Fabrice Labeau, for his continuous support, encouragement and guidance throughout all the phases of this project. l very much enjoyed the time spent on technical discussions with Prof. Labeau, and l feel that these discussions greatly contributed to a higher quality of the education l received from McGill University.
l also consider myself fortunate to have had such high-quality instructors during the courses that l attended as part of this degree. l am very grateful to each and every one of the instructors l had. They made me and all my fellow students work very hard during those semesters; for that, l will always be grateful and will regard them as excellent instructors.
Among these instructors, l would especially like to address my thanks to Prof. Douglas O'Shaughnessy and Prof. Richard Rose, who provided me with a very enjoyable introduc- tion to the area of Speech Processing and Communications as well as the related area of Speech Recognition. Prof. O'Shaughnessy kindly offered valuable advice during the first stages of my research project.
As a last note of acknowledgment, l would like to thank Prof. Xiao-Wen Chang in the Computer Science Department, for kindly allowing me to attend one of his excellent courses, which indirectly contributed to the success of this project.
Contents
1 Introduction Background The Problem Our Contribution This Document . 2 Speech Compression
2.1 Data Compression
2.1.1 Lossless Data Compression.
2.1.2 Lossy Data Compression . . 2.2 Lossy Compression of Speech SignaIs
How Speech is Produced . How Speech is Encoded . 2.3 Linear Prediction Co ding (LPC) . 2.4 Vector Quantization . . . . 2.5 Vector Quantization Applied to LPC
2.5.1 Quantizing the Prediction Filter . 2.5.2 Quantizing the Residuai . . . . .
3 Vector Quantization of Frames with Non-Uniform Size 3.1 Non-uniform Sampling of Signais . . . . 3.2 Non-uniform Sampling in Vector Quantization . . . . 3.3 Variable Frame Size Vector Quantization Applied to LPC .
IV
1 1 2 3 4 5 5 8 10 11 12 15 16 20 23 24 26
29
29 33 36
Co~e~s v
4 Experimental Setup and Results 40
4.1 Experimental Setup . . . . 40
4.2 R e s u l t s . . . 43
Results - Objective Measurable Parameters Results - Perceptual Evaluation
5 Concl usions
5.1 Discussion and General Conclusions 5.2 Recommendations for Future Work
A Finding the Optimal Prediction Filter for LPC
B Generation of Optimal Codebook from Speech Samples (Training) References
43 50 54
54 57 61 65 71
VI
List of Figures
2.1 Example of Lossy Compression: Replacing a "Noisy" Signal with One that has Identical Spectral Properties. . . . 12 2.2 Simplified Model of the Ruman Speech Production System. . 13
2.3 Example of a Typical Voiced Speech Spectrum. . 14
2.4 A Typical Linear Prediction Coding (LPC) Setup. . . . 17 2.5 Example of a Two-Dimensional Vector Quantizer. . . . 21 2.6 Probability Density F\mction of a Two-Dimensional Random Vector. 22 2.7 LPC Setup with VQ for the Filter and the Residual.. . . 24 3.1 Example of Non-Uniform Sampling Scalar Quantization. 31 3.2 Variable Frame Size as a Form of Non-Uniform Sampling VQ. 34 3.3 Example of Variable Frame Size Vector Quantization. . . . 35 3.4 Encoding Algorithm for Variable Frame Size Vector Quantization. 39 4.1 Signal-to-Noise Ratio (SNR) vs. Bit rate - 8 Residual Chunks per Frame. 46 4.2 Segmental SNR (20 ms Segments) vs. Bit rate - 8 Residual Chunks per
Frame. . . .. 47 4.3 Signal-to-Noise Ratio (SNR) vs. Bit rate - 4 Residual Chunks per Frame. 48 4.4 Segmental SNR (20 ms Segments) vs. Bit rate - 4 Residual Chunks per
Frame. . . . 49 4.5 Reconstruction Error of a Small Segment of Speech (Vowel Sound). . . . . 50 4.6 Percentiles 90 to 99 of Quantization Error of the Prediction Filter. . . . .. 51 4.7 Perceptual Evaluation of Audio Quality (PEAQ) Grades for the Processed
Speech Samples. . . .. 52
VII
List of Tables
4.1 Number of Bits for Codebooks and for the Encoding of the Frame Size - 8 Residual Chunks per Frame. . . .. 41 4.2 Number of Bits for Codebooks and for the Encoding of the Frame Size - 4
Residual Chunks per Frame. . . .. 42 4.3 Frame Sizes for the Experimental Setup. . . .. 42 4.4 Signal-to-Noise Ratio (SNR) After Reconstruction - 8 Residual Chunks per
Frame. . . .. 44 4.5 Signal-to-Noise Ratio (SNR) After Reconstruction - 4 Residual Chunks per
Frame. . . .. 45
List of Acronyms
VQ SQ LPC ABS MPELP CELP LSF SNR FIR IIR PRN PDF CDF IID LBG
Vector Quantization Scalar Quantization Linear Prediction Coding Analysis By Synthesis
Multi-Pulse Excited Linear Prediction Code-Excited Linear Prediction Line Spectral Frequencies Signal-to-Noise Ratio Finite Impulse Response Infinite Impulse Response Pseudo-Random Number(s) Probability Density Function Cumulative Distribution Function Independent and Identically Distributed Linde-Buzo-Gray (Algorithm)
VIII
1
Chapter 1
Introduction
Background
Vector Quantization (VQ) is a lossy data compression technique that is often applied in the field of Speech Communications [1, 2, 3]. This technique is mainly used for speech coding, but it also has other applications, such as speech recognition [1].
In VQ encoding, we replace a group of values or vector with a codeword that we choose following sorne optimality criterion. Perhaps the most typical example would be encoding a multi-dimensional discrete-time signal by replacing each sample with the closest point from a list of possible values, the codebook. Compression is achieved by providing the index corresponding to the closest point as the codeword. This index can be represented using fewer bits than required to represent the original sample of the signal.
If the group of values exhibits sorne structure or statistical correlation, this approach has a great advantage over direct scalar quantization of each of the values [2], sin ce we can optimize the codebook to follow this structure and take full advantage of the statistical properties of the values as a group, including correlation between the various components.
When using VQ to encode a signal, the signal may be multi-dimensional in nature (for example, a set of values measured in parallel such as the array of positions and velocities of the articulation points of a robot), or it can be a multi-dimensional representation obtained through a model of an otherwise one-dimensional signal.
In the case of VQ applied to Speech SignaIs, the input signal is divided into frames of a given length. Depending on the particular technique being used, the system either extracts a representation of the whole frame (usually sorne form of spectral representation), or
2005/12/18
1 Introduction 2
applies sorne processing to the signal and uses the processed frame as the multi-dimensional sample.1 Due to the nature of speech signaIs - in particular, the fact that human speech sounds tend to be restricted to a finite set of possible sounds, the phonemes - , the spectral properties of speech frames tend to exhibit a certain structure, which makes VQ a suitable and efficient technique for speech co ding [1 J.
The Problem
Even though VQ has proved extremely effective in practical applications such as Speech Compression in cellular telephony, sorne opportunities for potential improvement may have been neglected. In particular, a typical assumption is the fact that the frame size is fixed.
Little attention has been paid to the possible benefits that could derive from encoding with variable frame size, and it has been often considered an unnecessary complexity [1].
Non-uniform sampling has been subject of several studies, sorne of them of theoretical interest only, sorne of them oriented to the issue of jitter in discrete-time signaIs [4]. A technique using variable analysis frame sizes has been proposed for the coding of multiband excitation model parameters [5]. In that study, however, the encoding allowed for variable frame size under certain conditions, to ensure stationary spectral parameters of the signal within frames.
The idea of non-uniform size for the quantized blocks has been successfully applied to Image Compression techniques; images are well suited for the use of variable block size, since blocks of a given size usually represent perceptually distinct regions, and thus, co ding them separately leads to substantial benefits [6]. Indeed, a comparison with variable rate VQ (based on variable bit allocation for the codebook, tree structured codebooks, etc.) shows that variable block size leads to better results [7J. This represents an important hint in favor of the idea proposed in this study. The technique has also been combined with adaptive codebooks, where adjacent blocks affect the encoding of a block, taking advantage of any correlation between adjacent blocks in an image [8].
Other studies have focused on variable rate VQ. One such study reports little success when applied to VQ of LPC parameters [9]. Another study reports success in using variable bit allocation for Line-Spectrum Pairs and generalized spectral distributions - taking advantage of the relative entropy in these parameters can be used to develop variable rate
1 As we will discuss in chapter 2, often bath techniques are applied.
1 Introduction 3
Vector Quantizers [10].
Another area that has received sorne attention is the use of variable dimension spectral vectors, where the vectors of harmonie spectral peaks have variable dimension to optimize the bit rate according to the current spectral characteristics [11, 12, 13].
Variable precision VQ has also be proposed, presenting a form of variable-rate VQ where the codebook search stops as soon as a required precision is reached [14].
AU of these ideas, however related to what we propose in this study, still do not consider the possibility of an additional level of optimization that could result from using non- uniform sampling or variable frame size VQ.
Another important class of related techniques where the use and optimization of VQ has been sought is the Analysis-by-Synthesis (ABS), in particular, Multi-Pulse Excited Linear Prediction (MPELP) techniques [15], and Code-Excited Linear Prediction (CELP) techniques, both in its original form [16, 17], as well as sorne optimized variations [18, 19].
In the context of CELP, variable rate quantization has been applied with success [20].
Specifically, variable rate quantization of spectral parameters has been done by classification of frames into voiced or unvoiced, and coded separately [21, 22], and later also transition frames [23] - this last study has one detail in common with our approach, even if the rationale is completely different.
In a more recent study, variable percentage of overlap between frames was proposecl [24]. Even though the context and the rationale are different from those of our study, there are sorne common ideas and common consequences of this technique and our proposed technique.
Our Contribution
In this study, we present a modification to the VQ technique that allows for variable size frames, providing an additional degree of freedom for the optimization of the data compression process.
The quantization error goes through a first level of minimization by choosing the closest point in the codebook for the given frame. We now minimize this by choosing the frame size that yields the lowest quantization error - notice that the quantization error is a function of the given frame and the codebook; by considering different frame sizes, we get different actual frames that yield different quantization errors, allowing us to choose the
1 Introduction 4
optimal size. This effectively provides a second level of optimization.
This idea has two caveats; as part of the encoding proeess, we have to provide the size that was chosen for each frame, requiring additional data to represent the frame. Also, the complexity of the system increases, sinee we have to try different frame sizes. This involves a certain degree of redundant processing of the same data, and thus requires more computing-power for a practical implementation of the scheme.
The results of this study show that technique effectively improves the quality of the compressed signal at a given compression ratio, even if the improvement is not dramatic.
Whether or not the increase in complexity is worth the quality improvement for a given application depends entirely on the design constraints for that particular application.
This Document
This thesis presents the details and results of the study as follows: in Chapter 2, we present the basic concepts in Data and Speech Compression, with particular emphasis on Linear Prediction Coding (LPC) and VQ. This chapter may be skipped entirely if the reader is familiar with these techniques and the related mathematical background.
Chapter 3 presents a detailed description of the method that this study introduees, highlighting the potential advantages as well as potential drawbacks. We start by discussing the general idea of VQ with non-uniform sampling and several concrete ways in which it can be applied. We then coneentrate in applying the idea for speech compression in the context of an LPC setup, which is the main focus of this study.
Chapter 4 describes the experimental setups and the results; these are discussed in Chapter 5, presenting the conclusions as well as recommendations for possible improvements and future work deriving from this study.
5
Chapter 2
Speech Compression
In this chapter, we present the basic concepts of data and speech compression. We de- scribe both lossless and lossy compression techniques, with emphasis on lossy compression, which are the types of techniques normally used for speech compression, given the higher compression ratio that they exhibit.
We specifically describe Linear Prediction Co ding (LPC) techniques, sin ce virtually aH speech compression schemes presently used involve - directly or indirectly - the use of LPC. We also devote special attention to Vector Quantization techniques, for two reasons:
it is an efficient speech compression technique that is commonly used; and it is the basis of this study, where we propose a slight modification to this technique.
2.1 Data Compression
Data compression is a process that has the goal of representing a sequence of data using less storage or transmission space than the original representation. In the context of a digital system and in particular in digital communications and signal processing, this translates to representing a signal with fewer bits or with a lower bit rate than the original representation.
A simple example that illustrates how data compression can be achieved is the following:
we have a signal that is represented by a stream of integer values between 0 and 9. Given the following instance of that signal:
{2 1 1 1 1 1 0 0 0 0 0 0 0 4 0 0 0 0 0 0 0 0 0 3 3 3 3 3 3 7}
2005/12/18
~ ____ ~'p_~~~~ ____ g_~~P!'~~.!?E:_________________________________________________________________________________________________________________________________________________________________________ _________________________ 6
we can take advantage of the fact that it exhibits a particular pattern - there are long sequences of the same digit. A simple data compression scheme could encode these patterns by using just two digits to represent each sequence of up 10 consecutive numbers:
{1 x 2 5 x 1 7 x 0 1 x 4 9 x 0 6 x 3 1 x 7}
Taking into account that spaces are just placeholders, and that the x symbol is not really part of the new data stream, and is only shown to make the example more clear, we notice that we were able to represent, using only 14 characters, the same stream of data that took 30 characters in its original representation. A sequence of a number repeated 10 times would be represented as 0 x N (where N is the given number). A sequence of a number repeated more than 10 times can be split into one or more sequences of no more than 10 numbers each.1
There is an implicit assumption, and it is the fact that reading and interpreting the compressed stream requires knowledge of the convention that we used - in this simple example, it requires knowing that every pair of digits is really representing a sequence consisting of the second digit repeated the number of times indicated by the first digit. It also requires knowing that if the first digit is 0, then the sequence is 10 characters long.
This illustrates the more general principle that a particular data compression scheme or compression algorithm always consists of two pro cesses or algorithms: the encoding process and the decoding or reconstruction pro cess [2].
Another important princip le illustrated by this simple example is the fact that compres- sion can be achieved as a result of particular patterns or probabilistic properties involving sorne form of redundancy in the data. If we had a stream of data that almost never had sequences of the same digit repeated, then the above scheme would be useless as a compres- sion mechanism - in fact, it could have a negative effect, as each number would take two characters to be represented, instead of one in the original, straightforward representation.
Redundancy in a signal or stream of data may be somewhat "hidden" in the sense that a direct and naive inspection of the data would not reveal any pattern or redundancYi howe ver , it is possible that we could model such data in a way that would allow us to represent a considerable fraction of the data with just a few parameters.
IThis is not necessarily an optimal compression mechanism, but we want to keep the example as simple as possible, since it is only intended to illustrate the basic principle.
For example, let us consider a signal that is given by the positions of a train measured at every minute while the train is moving at approximately constant speed. To simplify the example, let us assume that the position is an integer number, measuring the distance in meters. In this case, we can model the sequence of data as a straight tine, and only encode the deviations from the model, which is usually referred to as the residual. This model would be dynamically obtained, by computing the best-fitting straight line for the given sequence of data. An instance of this sequence could be the following:
{302, 401, 497, 600, 699,703,801, 895, 998}
We notice that representing the data directly (assuming binary numbers system) would require 10 bits per sample, to coyer the range from 0 to 1000; however, the deviations from the best-fitting straight tine, which in this case is x = 300 + 100t, are in the range -5 to +5, so we require only 4 bits per sample to represent the residual. By encoding the data as two values (of 9 bits and 7 bits, respectively), representing the best-fitting straight line, and the residual, we can reconstruct the original signal exactly.
Sacrificing U nnecessary Information
Another class of data compression techniques is one where the compression algorithm aims at reconstructing a stream of data that is not the same, but conveys an underlying message that, from a given point of view, is either equivalent, or it exhibits differences that are irrelevant. This "point of view" is usually a perceptual point of view. A typical example would be a sound compression scheme that pro duces (after reconstruction) a sound that is different, but perceptually indistinguishable from the original for a normal human ear.
This class of compression techniques is of great interest in communications and signal processing, and in particular to this study; it involves an addition al detail with respect to the schemes presented in the previous examples - we not only take advantage of the fact that there is redundancy in the stream of data, but also the fact that there is unnecessary information in it. There are components of a sound that, for physiological reasons, we are unable to perceive; a data compression scheme for signaIs representing sound could and should take advantage of that fact and eliminate all the data that represent those particular components of the sound.
8
Different Classes of Compression Techniques
The contrast between this notion of unnecessary information and the previous introductory examples illustrates an important classification of data compression techniques: lossless compression and lossy compression. Lossless compression provides a mechanism where the original data is reconstructed exactly. Lossy compression attempts to identify and eliminate unnecessary information, usually achieving much higher compression ratios. As a result, lossy compression mechanisms reconstruct a signal that is different but, under certain criteria, equivalent or similar enough to be useful- possibly even indistinguishable from the original.
Since this study focuses on one particular lossy compression technique, we will coyer these in greater detail. We will, however, briefly introduce lossless data compression, sin ce most of the concepts and ideas are also used in lossy compression techniques - in particular, lossy compression techniques often decompose the signal into several components, discard one or more of these components, and use sorne suitable lossless compression technique for sorne or all of the remaining components.
2.1.1 Lossless Data Compression
As discussed in the previous section, lossless data compression techniques take advantage of sorne form of redundancy in the data. This redundancy can be formally modeled from a probabilistic point of view, based on two fundamental notions from Information Theory:
Self-information and Entropy [25].
The self-information associated to an event X is a measure of the amount of information that such event conveys. It is defined in terms of the probability of such event, P(X), as follows:
i(X) = -log(P(X)) (2.1)
Depending on the base that we choose for the logarithm, we obtain a measure in different units - for instance, if we use log2' we obtain the self-information in bits.
Intuitively, this definition tells us that an event with low probability conveys more information. For ex ample , if you look at a highway at 3AM and see no cars, there is not much that you can conclude from this normal, high-probability event. However, if you look at a highway at 5 or 6PM and see no cars, that tells you a lot!
2 Speech Compression 9
... _ ... _---.-... ---_ .. ---... ---.-.. _ ... __ .... _---.--.. -... -... _ ... _ .. _ .. _---_ ... _-.. _-_ .. _ ... _---~-_ ... _- ----._-_ ... _---_ ... _--
If we have a pro cess S that generates outcomes or events given by a set of possibilities { Xl, X 2, . . . , X N }, then the entropy of that pro cess is defined as the average self-information associated with the events generated by the process:
N
H(S) = L -P(Xk ) log(P(Xk )) (2.2)
k=l
Claude E. Shannon showed that the entropy of a source is directly related to the average number of bits per symbol required to encode the output of the source [2]. A very simple example that illustrates this is the following: if we have a source of data where the only possible values are -5, +5, -10, and +10 with equal probability of ~, the entropy of this source (in bits) is given by
Clearly, we can encode the output of this source with two bits per symbol - we assign each of the four combinat ions of two bits (00,01,10,11) to one of the four possible symbols that we are encoding.
A key detail is the fact that if one of the symbols occurs with high probability, then it has less self-information associated to it, and we could use fewer bits to encode it. It is clear that if we use fewer bits for the symbols that occur more frequently (the symbols with higher probability) , the total amount of bits to encode an actual sequence of the given symbols will be lower than with a straightforward representation in which we presumably use the same amount of bits or characters for each symbol.
This ide a was exploited by an algorithm proposed by Shannon, and later by an algo- rithm proposed by Huffman, who presented an optimal encoding scheme that minimizes redundancy in the data [2], under the constraint that no two codewords can have the same prefix (that is, no codeword can begin with a pattern that is exactly another codeword).
A more detailed description of Huffman co ding and a deeper coverage of lossless com- pression techniques is omitted in this document; the examples shown in this section are intended to provide an intuitive introduction to the subject, since this is not the main fOClIS of this study. A more comprehensive coverage of lossless data compression can be found in sorne of the references, such as [2].
~ .... ~P~~.~.!:t .... g~:!!I:;e!~~.~.!~ .. ~..._ ... _ ... _ ... _ ... _ ... . 10
2.1.2 Lossy Data Compression
Lossy data compression is a class of compression methods where the reconstructed data or signal is different from the original, but similar enough to be equivalent and possibly indistinguishable un der certain criteria. An example of such criteria would be the inability of the human ear to perceive certain features of a given sound, or the inability of the human eye to perceive certain features of a given image or sequence of images (motion video).
Lossy data compression techniques exploit one additional characteristic in the signal:
the presence of unnecessaï!) information. Most signaIs that are intended for human per- ception - such as sound and images - do contain components that are in general not perceived by us. A data compression system could eliminate those components and we would not be able to distinguish the reconstructed signal from the original. Even if we are able to distinguish them, for sorne applications, the difference may be irrelevant for the particular purpose. Such is the case with intelligibility of speech - a compression system could eliminate certain features from a speech signal, making it sound different but keeping all the attributes that allow us to understand the words without errors, even if we do notice that i t sounds different.
In the context of communications and audio and speech signal processing, most lossy compression techniques are based on spectral analysis of the signal, given that sound per- ception is largely dependent on spectral features2 [1, 26].
A simple example is a signal that consists of a loud tone plus some faint background noise. If we look at the spectral content of that signal, we notice that most of the energy is concentrated at a single frequency. Due to perceptual features of the human ear, in partic- ular masking [26], the background noise could be imperceptible, since it is masked by the presence of the much louder tone. A simple spectral analysis and processing would enable us to represent this single-tone signal with much less data than the original representation.
The reconstructed signal is different from the original; however, it will be indistinguishable, due to the fact that the difference is a component of the original signal that is not perceived by the human ear.
Another possibility in the previous case, instead of eliminating portions of the spectral content, would be to replace them with data that has similar spectral characteristics. For
2 At least there is evidence that sound perception by the hum an ear exhibits behavior that is based on spectral features - whether or not the brain actually extracts or uses spectral content is not known definitively.
2 Speech Compression I l
-.-... _ ... -... __ ... _ .... _ ... _ ... -_ ... _ ... -... __ ... --.-... _ ... __ ... _--... --.... __ ... .
the purpose of this example, let us assume that the background noise is sufficiently loud as to be perceived. If this background noise is white noise, it would contain a level of details too high for our auditive capabilities. Our ear would be unable to pro cess that large amount of information, and it would only perceive the fact that there is noise with the given spectral characteristics [26]. The key detail in this example is that our ear would be unable to distinguish between different instances of white noise with the same intensity, and will perceive them all as a background hiss.
A lossy data compression scheme could take advantage of that fact and encode aU the spectral content present in the background noise as a single parameter representing the intensity of the noise. Given the characteristics of this random pro cess, the intensity may be obtained as the variance of the noise samples [27]. The decoding or reconstruction algorithm would synthesize a different instance of white noise by generating a sequence of pseudo-random numbers (PRN) with the given variance, and add it to the tone. The reconstructed signal will necessarily be different from the original; however, our ear would perceive it as exactly the same: atone with a background hiss of the same intensity.
Figure 2.1 illustrates this simple scheme. Although the way our eye percei ves this graphical display is different from the way our ear perceives the sound represented by it, the princip le is illustrated accurately: following a simple visual inspection, we do not distinguish the original from the reconstructed signal (in fact, if we do not see both at the same time, most likely we would not be able to distinguish them at all).
The examples in this and the previous sections illustrate one important aspect of lossy compression methods: detailed knowledge about the nature of the pro cess that generates the signals that we are encoding is essential for us to design an efficient compression scheme.
We discuss how this principle applies to speech signaIs in the next section, where we present a description of speech compression methods.
2.2 Lossy Compression of Speech SignaIs
SignaIs intended for human perception usually contain both redundancy and unnecessary information. The sounds present in human speech are one perfect example of this general rule. This makes sense from the functional - and even from an evolutionary - point of view: the speech production system aims at producing messages with sorne redundancy to guarantee that the communication is successful even in the presence of other sounds that
o 0.2 0.4 0.6 0.8 1.2
Original Signal (Top) - - Reconstructed Signal (Bottom) - -
1.4 1.6 L.8
t(ms)
2
Fig. 2.1 Example of Lossy Compression: Replacing a "Noisy" Signal with One that has Identical Spectral Properties.
would "pollute" the signal; and our ear and brain would naturally aim to "optimize" the reception system by cleaning up and extracting key information that would allow us to perceive the intended message with a minimum amount of information to process.
Without necessarily trying to emulate the way our ear works, knowledge about the pro cess that produces human speech does help us identify the components that are un nec- essary - or rather, design procedures that identify these unnecessary components. It also helps us identify redundant components as weil as the exact way in which this redundancy is "embedded" in the signal; this in turn allows us to design efficient mechanisms that extract the underlying redundancy from the signal.
How Speech is Produced
The two main types of sounds that are present in human speech are voiced and unvoiced.3 Voiced speech is a quasi-periodic sound that is produced when the vocal cords vibrate
3The classification is more detailed, but for the purpose of understanding how good speech compression schemes can be achieved, this simple categorization suffices.
~_~p_~~_~_tt_~_~~p!:~~sion ________________________________________________________________________________________________________________________________________ !_?
producing pulses of airflow through the larynx. These pulses are then acoustically filtered by the combination of the vocal and nasal cavities; these act as a network of ),,/4 resonators with different resonance frequencies, arranged in a mixture of series and parallel configurations.
Unvoiced speech is produced when the vocal cords do not vibrate, and instead they allow a continuous flow of air through the larynx. With certain specifie geometric configurations of the vocal tract, the teeth and the tongue, this airflow goes through a radiating effect to pro duce a sound that has a noisy quality, similar to white noise or filtered white noise.
Examples of this are sounds like S, SH, or F. Also, certain transitions between silence and a voiced sound do contain smal1 segments of sound with this noisy quality. An example of this occurs in the transition from a T to a vowel.
Figure 2.2 shows a simplified model of the human speech production system.
Pulses
l l III
White Noise
I~
( ACOll';tic Filter (U4 resonators)
Low-Pass Filter
(Dampcnîngand losses in vocal tract)
Fig. 2.2 Simplified Model of the Ruman Speech Production System.
In the frequency domain, if we model the pulses of airflow as ideal pulses, the spectrum produced at the input of the network of resonator filters is a series of impulses at equal spacing given by the frequency of the train of pulses. Depending on the exact configuration of the network of acoustic resonators, the overall transfer function will be a series of peaks and possibly sorne notches, defining the spectral envelope. The geometry of the vocal and nasal cavities are changed to produce different spectral envelopes, which in turns produces sounds with different spectral properties or timbre [26].
Figure 2.3 shows an example of the spectrum of a voiced sound, in this particular case, the sound of the vowel A.
Naturally, the spectrum do es not consist of perfect impulses multiplied by a spectral
2 Sp e~_c::_~ ____ g_~!.!I:P!.~~~!~r:t.___________________________________________________________________________________________________________________________________ ____________________ _ 14
o 500 1000 1500 2000 2500 3000 3500 4000
f(Hz)
Fig. 2.3 Example of a Typical Voiced Speech Spectrum.
envelope, as in the ideal case; in practice, the spectral characteristics are not perfectly stationary, for several reasons: the pulses of airflow do not occur at perfectly regular intervals, nor are they identical; the spectral envelope is slowly moving, since the geometric configuration of the vocal tract is constantly changing during normal speech, due to the transitions from one phoneme to the next.
However, the ideal model do es provide a reasonable approximation that enables us to design processing and compression techniques well adapted to the actual speech production system. We notice that the spectrum does contain pro minent peaks that are equally spaced (approximately every 90 Hz). We also notice that the amplitudes of these peaks do exhibit a pattern consistent with that of a spectral envelope, since the values of these amplitudes change smoothly (as a function of 1), and exhibit several "bumps" and "valleys," as pre- dicted by the ideal model. Not only are these bumps present: they are spaced consistently with the response of a network of ),,/4 resonator filters. For a single ),,/4 resonator, the peaks in the spectral envelope occur at fa, then at 3fo, 5fo, 7 fa, ... (where fa is the fre- quency corresponding the the wavelength ).. that resonates in the filter). With the network of resonators of the vocal tract, the typical (and most noticeable) effect is that the first
2 Sp~~_~.!! ____ g_~_~p!,~SS~?_~ ______________________________________________________________________________________________________________________________________________________________________________________________ ~_?
two resonances are shifted [1]. We confirm this characteristic in the spectrum of figure 2.3, for fa I""-.J 500 Hz - the first resonance is slightly shifted up, approximately at 600 Hz, and the second one is slightly shifted down, approximately at 1300 Hz).
How Speech is Encoded
Speech compression techniques take advantage of the spectral nature of speech sounds to extract the important information. We notice that these spectral properties are shon-term spectral properties, since they apply to a specifie sound (e.g., a vowel), and normal speech consists of a sequence of different sounds, each with relatively short duration. Hence, a speech signal has to be segmented prior to processing or encoding, and the segments or frames are individually processed. These frames should be short enough as to ensure that the signal has quasi-stationary propertiesj that is, that the short-term spectral properties of the signal are practically constant within the frame.
When processing a given frame, we notice that for both voiced and unvoiced sounds, the spectral envelope plays an extremely important role. In the case of unvoiced sounds, it would suffice to encode the shape of the spectral envelope; to reconstruct the frame, the system would simply generate white noise (as a sequence of pseudo-random numbers) and pass it through a filter with frequency response equal to the encoded spectral envelope.
In the case of voiced sounds, there is more information that needs to be encoded if we want the reconstructed signal to sound natural, but in principle, the shape of the spectral envelope and the pulses representing the fundamental frequency or pitch convey most of the useful information in the signal.
Sorne compression techniques use the approach described above directly. Sorne other techniques exploit these spectral properties indirectlYj for example, theyexploit sorne time- domain characteristic of the signal that is a direct consequence of the spectral nature of it.
One of the most commonly used speech compression techniques - which we discuss in the next section - exploits the fact that these spectral characteristics imply a high correlation between nearby samples (or, equivalently, a high auto-correlation at small time lags). This in turn implies that a good approximation of a sample can be obtained from the few samples preceding itj in particular, from a linear combination of them.
This technique - or rather, family of techniques - is called Linear Prediction. The
2 ~p_~~_~~ ____ g_~_~p!,es~~~_~ _______________________________________________________________________________________________________________________________________________________ _ 16
simplest form of linear prediction is differential encoding. If we know that consecutive samples are highly correlated, then the difference between them should be low; instead of encoding each sample, we encode the difference with respect to the previous sample, the residual, which will require fewer bits, since the values are much smaller than the values of the samples. In this case, we are "predicting" the current sample as the value of the previous sam pie , and we encode the prediction error or residual.
If the nature of the signal involves the fact that the waveform is smooth, then we could encode the difference with respect to a linear extrapolation using the previous two samples;
though this will not necessarily pro duce a good or optimal encoding, it illustrates the idea that using more than only the previous sample can yield a better approximation or prediction. This in turn means that the prediction error (which is what we encode) takes smaller values, requiring fewer bits to encode.
These are just two simple examples of linear prediction. A more complete overview of this technique is presented in the next section.
2.3 Linear Prediction Coding (LPC)
In general, Linear Prediction attempts to find the linear combinat ion of the past p samples (where p is given as part ofthe design of the system) that yields the best approximation or prediction of the current value. We encode the signal by providing the coefficients of the linear combinat ion , and the difference between the predicted value and the actual value, the prediction error or residual.
This optimallinear combinat ion defines the optimal prediction filter, P. Our notational convention is as follows: P denotes the optimal prediction filter; P(z) denotes its transfer function in the z-domain; P denotes a prediction filter in general.
The pro cess of computing the residual or prediction error is implemented as a Finite Impulse Response (FIR) filter with a transfer function H(z) = 1 - P(z). It is quite straightforward to see that this filter computes the difference between the current sample and the optimallinear combinat ion of the previous p samples.
The optimality criterion when we talk about the optimal prediction filter is usually the minimization of the residual values (in the statistical sense), since the lower these values are, the fewer bits we will need to encode them.
Figure 2.4 shows a diagram of a typical LPC setup, showing the analysis or encoding
stage and the synthesis or decoding stage.
o :~-p ~ o
Filler coefficients transmined or stored
1 ---
Fig. 2.4 A Typical Linear Prediction Coding (LPC) Setup.
A prediction filter, P, is defined by the set of coefficients {al, a2, ... , ap } producing an output y according to the formula
p
y(n) = L akx(n - k) (2.3)
k=l
The residual or prediction error is obtained as the difference between the actual signal and the output of the prediction filter:
p
e(n) = x(n) - L ak x(n - k) (2.4)
k=l
The optimal prediction filter, P, is computed for every frame as the filter that minimizes the sum of the squared prediction error for the frame un der the constraint of producing a filter 1-P that is minimum phase.4 We will explain this additional constraint shortly. For a frame of N samples, we would have
p ~ argIIJin {t, le(k)I'} (2.5)
At the decoding stage, which is fed by the residual from the encoding stage, the output 4For notational convenience, we refer to 1 - P as a shorthand for an encoding filter p' with transfer function P'(z) = 1 - P(z).