DSP Fundamentals and Implementation
3.5 Quantization Errors
As discussed in Section 3.4, digital signals and system parameters are represented by a finite number of bits. There is a noticeable error between desired and actual results ± the finite-precision (finite wordlength, or numerical) effects. In general, finite-precision effects can be broadly categorized into the following classes:
1. Quantization errors a. Input quantization b. Coefficient quantization 2. Arithmetic errors
a. Roundoff (truncation) noise b. Overflow
The limit cycle oscillation is another phenomenon that may occur when implementing a feedback system such as an IIR filter with finite-precision arithmetic. The output of the system may continue to oscillate indefinitely while the input remains 0. This can happen because of quantization errors or overflow.
This section briefly analyzes finite-precision effects in DSP systems using fixed-point arithmetic, and presents methods for confining these effects to acceptable levels.
3.5.1 Input Quantization Noise
The ADC shown in Figure 1.2 converts a given analog signal x(t) into digital form x(n).
The input signal is first sampled to obtain the discrete-time signal x(nT). Each x(nT) value is then encoded using B-bit wordlength to obtain the digital signal x(n), which consists of M magnitude bits and one sign-bit as shown in Figure 3.11. As discussed in Section 3.4, we assume that the signal x(n) is scaled such that 1 x n < 1. Thus the full-scale range of fractional numbers is 2. Since the quantizer employs B bits, the number of quantization levels available for representing x(nT) is 2B. Thus the spacing between two successive quantization levels is
D full-scale range
number of quantization levels 2
2B 2 B1 2 M, 3:5:1
which is called the quantization step (interval, width, or resolution).
Common methods of quantization are rounding and truncation. With rounding, the signal value is approximated using the nearest quantization level. When truncation is used, the signal value is assigned to the highest quantization level that is not greater than the signal itself. Since the truncation produces bias effect (see exercise problem), we use rounding for quantization in this book. The input value x(nT) is rounded to the nearest level as illustrated in Figure 3.12. We assume there is a line between two quantization levels. The signal value above this line will be assigned to the higher quantization level, while the signal value below this line is assigned to the lower level. For example, the
000 001 010 011
Quantization level
Time, t x(t)
0 T 2T
∆ / 2 e(n)
∆
Figure 3.12 Quantization process related to ADC
discrete-time signal x(T) is rounded to 010, since the real value is below the middle line between 010 and 011, while x(2T) is rounded to 011 since the value is above the middle line.
The quantization error (noise), e(n), is the difference between the discrete-time signal, x(nT), and the quantized digital signal, x(n). The error due to quantization can be expressed as
e n x n x nT: 3:5:2
Figure 3.12 clearly shows that
je nj D
2: 3:5:3
Thus the quantization noise generated by an ADC depends on the quantization interval.
The presence of more bits results in a smaller quantization step, therefore it produces less quantization noise.
From (3.5.2), we can view the ADC output as being the sum of the quantizer input x(nT) and the error component e(n). That is,
x n Qx nT x nT e n, 3:5:4
where Q[] denotes the quantization operation. The nonlinear operation of the quantizer is modeled as a linear process that introduces an additive noise e(n) to the discrete-time signal x(nT) as illustrated in Figure 3.13. Note that this model is not accurate for low-amplitude slowly varying signals.
For an arbitrary signal with fine quantization (B is large), the quantization error e(n) may be assumed to be uncorrelated with the digital signal x(n), and can be assumed to be random noise that is uniformly distributed in the interval D2,D2
. From (3.3.13), we can show that
Ee n D=2 D=2
2 0: 3:5:5
QUANTIZATION ERRORS 99
+ x(n) + Σ
e(n)
x(nT )
Figure 3.13 Linear model for the quantization process
That is, the quantization noise e(n) has zero mean. From (3.3.14) and (3.5.1), we can show that the variance
s2e D2 122 2B
3 : 3:5:6
Therefore the larger the wordlength, the smaller the input quantization error.
If the quantization error is regarded as noise, the signal-to-noise ratio (SNR) can be expressed as
SNR s2x
s2e 3 22Bs2x, 3:5:7
where s2x denotes the variance of the signal, x(n). Usually, the SNR is expressed in decibels (dB) as
SNR 10 log10 s2x s2e
10 log10 3 22Bs2x
10 log103 20B log102 10 log10s2x
4:77 6:02B 10 log10s2x 3:5:8
This equation indicates that for each additional bit used in the ADC, the converter provides about 6-dB signal-to-quantization-noise ratio gain. When using a 16-bit ADC (B 16), the SNR is about 96 dB. Another important fact of (3.5.8) is that the SNR is proportional to s2x. Therefore we want to keep the power of signal as large as possible.
This is an important consideration when we discuss scaling issues in Section 3.6.
In digital audio applications, quantization errors arising from low-level signals are referred to as granulation noise. It can be eliminated using dither (low-level noise) added to the signal before quantization. However, dithering reduces the SNR. In many applica-tions, the inherent analog audio components (microphones, amplifiers, or mixers) noise may already provide enough dithering, so adding additional dithers may not be necessary.
If the digital filter is a linear system, the effect of the input quantization noise alone on the output may be computed. For example, for the FIR filter defined in (3.1.16), the variance of the output noise due to the input quantization noise may be expressed as
s2y;e s2eXL 1
l0
b2l: 3:5:9
This noise is relatively small when compared with other numerical errors and is deter-mined by the wordlength of ADC.
Example 3.5: Input quantization effects may be subjectively evaluated by observ-ing and listenobserv-ing to the quantized speech. A speech file called timitl.asc (included in the software package) was digitized using fs 8 kHzand B 16.
This speech file can be viewed and played using the MATLAB script:
load(timitl.asc);
plot(timitl);
soundsc(timitl, 8000, 16);
where the MATLAB function soundsc autoscales and plays the vector as sound.
We can simulate the quantization of data with 8-bit wordlength by qx round(timitl/256);
where the function, round, rounds the real number to the nearest integer. We then evaluate the quantization effects by
plot(qx);
soundsc(qx, 8000, 16);
By comparing the graph and sound of timitl and qx, the quantization effects may be understood.
3.5.2 Coefficient Quantization Noise
When implementing a digital filter, the filter coefficients are quantized to the word-length of the DSP hardware so that they can be stored in the memory. The filter coefficients, bl and am, of the digital filter defined by (3.2.18) are determined by a filter design package such as MATLAB for given specifications. These coefficients are usually represented using the floating-point format and have to be encoded using a finite number of bits for a given fixed-point processor. Let b0l and a0m denote the quantized values corresponding to bl and am, respectively. The difference equation that can actually be implemented becomes
y n XL 1
l0
b0lx n l XM
m1
a0my n m: 3:5:10
This means that the performance of the digital filter implemented on the DSP hardware will be slightly different from its design specification. Design and implementation of digital filters for real-time applications will be discussed in Chapter 5 for FIR filters and Chapter 6 for IIR filters.
If the wordlength B is not large enough, there will be undesirable effects. The coefficient quantization effects become more significant when tighter specifications are used. This generally affects IIR filters more than it affects FIR filters. In many applications, it is desirable for a pole (or poles) of IIR filters to lie close to the unit circle.
QUANTIZATION ERRORS 101
Coefficient quantization can cause serious problems if the poles of desired filters are too close to the unit circle because those poles may be shifted on or outside the unit circle due to coefficient quantization, resulting in an unstable implementation. Such undesir-able effects due to coefficient quantization are far more pronounced when high-order systems (where L and M are large) are directly implemented since a change in the value of a particular coefficient can affect all the poles. If the poles are tightly clustered for a lowpass or bandpass filter with narrow bandwidth, the poles of the direct-form realiza-tion are sensitive to coefficient quantizarealiza-tion errors. The greater the number of clustered poles, the greater the sensitivity.
The coefficient quantization noise is also affected by the different structures for the implementation of digital filters. For example, the direct-form implementation of IIR filters is more sensitive to coefficient quantization errors than the cascade structure consisting of sections of first- or second-order IIR filters. This problem will be further discussed in Chapter 6.
3.5.3 Roundoff Noise
As shown in Figure 3.3 and (3.1.11), we may need to compute the product y n ax n
in a DSP system. Assuming the wordlength associated with a and x(n) is B bits, the multiplication yields 2B bits product y(n). For example, a 16-bit number times another 16-bit number will produce a 32-bit product. In most applications, this product may have to be stored in memory or output as a B-bit word. The 2B-bit product can be either truncated or rounded to B bits. Since truncation causes an undesired bias effect, we should restrict our attention to the rounding case.
In C programming, rounding a real number to an integer number can be implemented by adding 0.5 to the real number and then truncating the fractional part. For example, the following C statement
y (int)(x+0.5);
rounds the real number x to the nearest integer y. As shown in Example 3.5, MATLAB provides the function round for rounding a real number.
In TMS320C55x implementation, the CPU rounds the operands enclosed by the rnd( ) expression qualifier. For example,
mov rnd(HI(AC0)), *AR1
This instruction will round the content of the high portion of AC0(31:16)and the rounded 16-bit value is stored in the memory location pointed at by AR1. Another key word, R (or r), when used with the operation code, also performs rounding operation on the operands. The following is an example that rounds the product of AC0 and AC1 and stores the rounded result in the upper portion of the accumulator AC1(31:16)and the lower portion of the accumulator AC1(15:0) is cleared:
mpyr AC0, AC1
The process of rounding a 2B-bit product to B bits is very similar to that of quantiz-ing discrete-time samples usquantiz-ing a B-bit quantizer. Similar to (3.5.4), the nonlinear
operation of product roundoff can be modeled as the linear process shown in Figure 3.13. That is,
y n Qax n ax n e n, 3:5:11
where ax n is the 2B-bit product and e(n) is the roundoff noise due to rounding 2B-bit product to B-bit. The roundoff noise is a uniformly distributed random process in the interval defined in (3.5.3). Thus it has a zero-mean and its power is defined in (3.5.6).
It is important to note that most commercially available fixed-point DSP devices such as the TMS320C55x have double-precision (2B-bit) accumulator(s). As long as the program is carefully written, it is quite possible to ensure that rounding occurs only at the final stage of calculation. For example, consider the computation of FIR filter output given in (3.1.16). We can keep all the temporary products, blx n l for l 0, 1, . . . , L 1, in the double-precision accumulator. Rounding is only performed when computation is completed and the sum of products is saved to memory with B-bit wordlength.