Speech digitization
The human ear is capable of perceiving frequencies in the
range of 16Hz-20KHz, known as the audio range, whereas
speech produces a narrow band of frequencies 100Hz-10KHz in this audio range.
A reduction in the bandwidth is desirable as it reduces the cost
of the communication systems.
An acceptable level of intelligibility of speech is obtained by
transmitting frequencies in the range of 300-3400Hz. Such a band-limited (a bandwidth of 3.1 kHz) speech signal is often called ‘toll’ (telephone) quality speech.
In this band-limited range of speech, the ear is most sensitive
to frequencies that lie around 3 KHz. In the case of female voice, maximum energy is distributed around this frequency, whereas in the case of male voice, the maximum energy occurs at a much lower frequency. That’s why women are preferred as telephone operators and announcers.
Speech digitization
A channel in a communication system has a finite
transmission loss and is subject to noise impairment.
When the length of the transmission path increases,
the signal-to-noise ratio at the receiving end
decreases.
In analog voice transmission, the effect of noise and
interference is most apparent during speech pauses
when the signal amplitude is near zero. Even
relatively low noise levels can be quite annoying to a
listener during speech pauses. The same levels of
noise may be unnoticeable when speech is present.
Hence, it is the absolute noise level of an idle channel
that determines the analog speech quality.
Speech digitization
In a digital system, speech and speech pauses are encoded
with data pattern and transmitted at a constant power level.
Signal regeneration at regular intervals bringing the signal to
the original level virtually eliminates all noise due to the transmission medium. Thus, the idle channel noise is determined by the encoding process and not by the transmission link in a digital system.
Besides, the ability of the digital transmission to reject
crosstalk is superior to that of an analog system. First, low level crosstalks are eliminated because of the constant
amplitude signals. Second, high amplitude crosstalks result in detection errors and as such are unintelligible.
Other advantages of digital systems include the ability to
support nonvoice services, and easy data encryption and performance monitoring.
Although digital systems require greater bandwidth than
analog systems and transmission media like wire pairs cause greater attenuation when larger bandwidth signals are passed through them, the advantages offered by the digital systems
Sampling
The first step in digitizing speech is to establish a set of
discrete times at which the input waveform is sampled.
The discrete sample instances may be spaced either at
regular or irregular intervals.
The minimum sampling frequency required to reconstruct the
original waveform from the sampled sequence is given by Nyquist criterion which can be stated as, fS≥2H
where, fS= sampling frequency or the Nyquist rate H= highest frequency component in the
input analog waveform
H is the bandwidth of the input waveform if it is not band
limited with a lower cut-off frequency. In this case, the
original waveform is reconstructed by passing the sampled values through a low pass filter which smoothens out or interpolates the signal between sampled values.
Sampling
Sampling is a process ofmultiplying a constant
amplitude impulse train with the input signal. It is an
amplitude modulation process, where the pulse train acts as the carrier.
Since the amplitude of the
pulses is modulated, the scheme is called pulse
amplitude modulation (PAM).
The frequency spectrum of an
amplitude modulated signal, when the carrier is a sine
wave, has frequencies ranging from fC-H to fC+H, where fC is the carrier frequency.
Sampling
If the carrier is a pulse train, as inthe case in PAM, the output spectrum contains the
fundamental as well as the
harmonics of the fundamental.
If the pulse train is a square wave
(50% duty cycle), only the
fundamental and odd harmonics are present.
The low pass filter at the receiver
end allows only the baseband
component 0-H Hz to pass. If fS is less than twice H, portions of PAM signal spectrum will overlap.
This overlapping of the sidebands
produces beat frequencies that interfere with the desired signal and such an interference is
referred to as aliasing or foldover distortion.
To avoid aliasing effects, the
minimum sampling frequency required is 6.8 KHz though in
digital telephone network, speech is sampled at 8 KHz rate.
•The filter used for band limiting the input speech waveform is known as antialiasing Filter.
•8KHz sampling results in oversampling which Provides for the nonideal filter characteristics
Quantization & binary coding
PAM systems are not generally useful over long
distances, owing to vulnerability of the individual
pulse amplitudes to noise, distortion & crosstalk.
The amplitude susceptibility may be reduced or
eliminated by converting the PAM samples into a
digital format, thereby allowing the use of
regenerative repeaters to remove transmission
imperfections before errors result.
With n bits, the no. of sample values that can be
represented is 2
n. But the PAM sample amplitudes
can take on an infinite range of values. Therefore,
it is necessary to quantize the PAM sample
amplitude to the nearest of a range of discrete
amplitude levels.
Quantization & binary coding
Signal V is confined to a rangefrom VL to VH, and this range is divided into M equal steps. The step size S=(VH-VL)/M
In the center of each of these
steps we locate the quantization levels V0, V1,…,VM-1. The quantized signal Vq takes on any one of the quantized level values.
A signal V is quantized to its
nearest quantization level.
The boundary values between the
steps are equidistant from two quantization levels and a
convention may be adopted to
quantize them to one of the levels.
Thus, the signal Vq makes a
quantum jump of step size S and at any instant of time the
quantization error V-Vq has a magnitude which is equal to or less than S/2.
When the step size is uniform, it is
known as linear or uniform quantization.
Vq=V3 if (V3-S/2)≤V<(V3+S/2) Vq= V4if (V4-S/2)≤V<(V4+S/2)
Quantization & Binary Coding
The process of quantization itself brings about a
certain amount of noise immunity to the signal.
The quantized signal is an approximation to the
original signal. The quality of approximation may be
improved by reducing the size of the steps and
thereby increasing the no. of allowable levels.
However, reducing the step size makes the PAM
signal more susceptible to noise.
So, each quantized level is represented by a code
number which is transmitted instead of the
quantized sample value itself. If binary arithmetic is
used for coding, then the code number is
transmitted as a series of pulses. Hence, such a
system of transmission is called pulse code
Quantization & Binary Coding
The analog signal is
limited to -4V to +4V.
Step size is one volt.
Eight quantization levels
are used and are
located at -3.5V, -2.5V,
…,+3.5V.
Code number 0 is
assigned to -3.5V, the
code number 1 to -2.5V
and so on.
Each code number has
its equivalent 3-bit
A PCM system
The analog input signal V
is band limited to 3.4 KHz to prevent aliasing and sampled at 8 KHz.
The quantizer and the
encoder together perform the A-D conversion.
The quantizer and the
decoder together perform D-A conversion at the
receiver.
The quantized PAM levels
are then passed through a filter which rejects the
frequency components lying outside the
baseband and produces a reconstructed waveform of the original band
limited signal.
Quantization noise
The instantaneous error
e=V-Vq is randomly distributed within the range (S/2) and is called the quantization error or noise.
The average quantization
noise output power is given by the variance,
σ2=
=S2/12
Signal to noise ratio (SQR) is
a good measure of
performance of a PCM system.
SQR=1.76+6.02n dB
de e p
e ) ( )
( 2
Companding
In linear or uniform quantization, the magnitude of quantization noise is absolute for a particular system and is independent of the input signal amplitude.
Therefore, comparatively, the weak and low-level signals suffer worse from quantization noise than the loud and strong signals. The very high percentage error
at low input signal levels actually represents idle channel noise. The effect of this is particularly
bothersome during speech
pauses and can be minimized by choosing 0 volt level as a
quantization level and avoiding the mid points of the first
intervals on either side of the zero level as quantization levels.
ef=(S/2)/|V|
For sinusoidal input, S=2Vm/M, Hence, ef=[Vm/(M|V|)]×100%
Companding
The scheme which uses the
two first midpoints is known
as mid-riser scheme and the
other as mid-tread scheme.
The mid-tread scheme uses
odd number of quantization
levels, i.e., M=2
n-1
In mid-tread scheme, very
low signals are decoded into a
constant, zero-level output.
However, if a d.c. bias exists
in the encoder, idle channel
noise is still a problem with
mid-tread quantisation.
Companding
A more efficient method of minimizinglarge variations in the percentage
quantization error over the signal range is to use nonlinear or nonuniform
quantization.
It is interesting to note that uniform
quantization intervals result in
nonuniform SQR over the signal range and nonuniform intervals result in
uniform SQR.
The effect of permitting larger
quantization intervals at higher signal
amplitudes is to compress the input signal to achieve a uniform quantization level.
The input signal is first compressed by
using a nonlinear functional device and then a linear quantizer is used. At the receiving end, the quantized signal is
expanded by a nonuniform device having an inverse characteristic of the
compression at the sending end.
The process of first compressing and then
expanding is referred to as companding.
Companding
A variety of nonlinearcompression-expansion
functions can be chosen to implement a compandor. The obvious one is a logarithmic law.
Unfortunately, the function
y=lnx does not pass through the origin.
So, it is necessary to substitute
a linear portion to the curve for lower values of x.
Most practical companding
systems are based on a law
suggested by K.W. Cattermole.
These equations are collectively
known as A-law used by India and other European countries.
U.S.A & Japan follow a variation
of A-law known as µ-law.
For logarithmic section,
y=(1+lnAx)/(1+lnA) for 1/A≤x≤1 For linear section,
y=Ax/(1+lnA) for 0≤x≤1/A A=compression coefficient
The expansion function is given by, x=ey(1+lnA)-1/A for 1/(1+lnA) ≤y≤1
Companding
In practice, a piecewise linearsegment approximation is used.
A-law companding consists of eight
linear segments for each polarity.
The slope halves for each segment
except the lowest two segments which have the same slope.
The lowest two segments of positive
& negative polarities coalesce into one straight line segment.
As a result, there are 13 effective
segments in the curve and the law is sometimes referred to as
13-segment companding law.
In µ-law, the slope halves in the
lowest two segments also, giving rise to 15 effective segments.
Each segment is divided into 16
linear steps. Eight bits are required to represent each sample value: 1-bit sign, 3-1-bit segment number and a 4-bit linear step number.
There are in all 256 defined signal
levels.
Differential coding
PCM is not specifically designed for digitizing speech
waveforms.
Speech waveforms exhibit considerable redundancy which
can be usefully exploited in designing coding schemes.
The following characteristics of speech signals contribute to
the redundancy:
Nonuniform amplitude distributions
Sample-to-sample correlations
Periodicity or cycle-to-cycle correlations
Pitch interval-to-pitch interval correlations
Speech pauses or inactivity factors
A sizeable fraction of the human speech sounds is produced
by the flow of puffs of air from the lungs into the vocal tract. The interval between these puffs of air is known as the pitch interval. There may be as many as 20 to 40 pitch intervals in a single sound.
Differential coding
Delta or differential coding systems are designed to take
advantage of the sample-to-sample redundancies in speech waveforms.
Because of the strong correlation between adjacent speech
samples, large abrupt changes in levels do not occur frequently in speech waveforms.
In such situations, it is more efficient to transmit or encode and
transmit only the signal changes instead of the absolute value of the samples.
Delta modulation (DM) is a scheme that transmits only the
signal changes and differential pulse code modulation (DPCM) encodes the differences and transmits them.
A delta modulator may be implemented by simply comparing
each new signal sample with the previous sample and transmitting the resulting difference signal.
At the receiver end, the difference signals are added up to
construct the absolute signal by using an integrator.
However, such a system, being open loop, suffers from the
possibility of the receiver output diverging from the transmitter input due to system errors or inaccuracies.
Differential coding
The system can be converted intoa closed loop system by setting up a feedback path with an integrator at the transmitting end.
When the input is constant, the
output of the transmitter is an alternating positive and negative pulse train. This constitutes the quantization noise in delta
modulators and is also known as granular noise.
If the transmitter input signal
changes too rapidly, the receiver output is unable to keep up and this phenomenon is known as slope overload.
This problem may be overcome by
using a variable slope integrator whose output slope is increased or decreased, depending on the rate of change of the input signal.
Vocoders
By considering some of the properties that are more or less
unique to speech, such as pitch interval and cycle
correlations, significant reductions can be achieved in bit rates.
Coding systems that are so specifically designed for voice
signal are known as voice coders or vocoders & operate typically at bit rates in the range 1.2-2.4 kbps.
Vocoders take into account the physiology of the vocal cords,
the larynx, the throat, the mouth, the nasal passages and the ear in their design.
The basic purpose of the vocoders is to encode only the
perceptually important aspects of speech and thereby reduce the bit rate significantly.
As a result, the reproduced voice is synthetic sounding and
unnatural with artificial quality.
Main applications include recorded message announcements,
encrypted voice transmission, voice mail etc.
Vocoders
Human speech is generated in two basic ways: Voiced sounds generated as a result of vibrations in the
vocal cords.
Unvoiced sounds formed by expelling air through lips &
teeth ( in the pronunciation Of s, p, t and f)
Human speech can now be
modeled as a sequence of voiced and unvoiced sounds passed
through a filter which represents the effect of mouth, throat, etc. on the generated sounds.
Vocoders
There are three basic types of vocoders:
Channel vocoders
Formant vocoders
Linear predictive coders
The speech spectrum exhibits sound specific structures with
energy peaks at some frequencies and energy valleys at others over short periods. Channel vocoders attempt to determine these short term signal spectrums as a function of time and take advantage of them.
In addition, it also determine the nature of speech
excitations and the pitch intervals. The excitation
information is used at the receiver end to synthesize speech by switching the appropriate signal source for the required duration. The filter at the receiver implements a vocal tract transfer function.
Vocoders
The three or four energy peaks in the short-term
spectral density of speech are known as formants. A
formant vocoder determines the location and amplitude
of these spectral peaks and transmits this information
instead of the entire spectrum envelope.