Advanced Low Bit-Rate Speech Coding Below 2.4 Kbps.

(1)

Below 2.4 Kbps

Emre Unver

Subm itted for the Degree of

Doctor of Philosophy

from the

University of Surrey

UNIVERSITY OF

SURREY

Centre for Communication Systems Research

Faculty of Engineering and Physical Sciences

University of Surrey

Guildford, Surrey GU2 7XH, UK

February 2010

(2)

INFO RM ATION TO ALL USERS

The quality of this reproduction is dependent on the quality of the copy submitted. in the unlikely event that the author did not send a complete manuscript and there are missing pages, these will be noted. Also, if material had to be removed,

a note will indicate the deletion.

uest

P roQ uest U512734

Published by ProQuest LLC (2019). Copyright of the Dissertation is held by the Author. Ail Rights Reserved.

This work is protected against unauthorized copying under Title 17, United States Code Microform Edition © ProQuest LLC.

ProQuest LLC

789 East Eisenhower Parkway P.O. Box 1346

(3)

(4)

There has been a fast grow th in th e telecom m unications ind ustry in th e p ast decades. W ith th e increasing dem and in th e transm ission of speech over bandw idth-lim ited me dia, such as mobile or satellite com m unication links, and storage of spoken inform ation in bit-rate-lim ited media, such as silicon memory, efficient compression of speech has become an im p o rtan t issue. A lthough there are speech coding stan d ard s producing high quality speech above 4 kbps, there is still room for im provem ent a t lower bit rates especially a t 2.4 kbps and below. Especially for m ilitary wireless com m unica tions where some of th e bandw idth is required for error correction, or for applications where speech is em bedded into other speech or non-speech data, achieving good speech quality and intelligibility a t very low b it-rates is im portant.

P aram etric coders, such as sinusoidal coders, are used extensively a t low bit-rates. In this work, relaxing th e delay, mem ory and com plexity constraints, strategies for lowering th e b it-rates of sinusoidal coders while m aintaining good speech quality are discussed. These strategies include th e extension of th e previous work in th e literatu re on combining several frames w ithin a m etafram e and variable bit-allocation schemes as well as a new voicing estim ation algorithm from th e spectral envelope. Moreover, th e use of phonem es in speech coding is investigated for further bit reductions. A m ethod for producing highly intelligible speech w ith m odest quality a t a very low b it-rate is presented. Coding of any ex tra inform ation in order to achieve high quality is also discussed.

These strategies have been im plem ented in th e SB-LPC vocoder in order to perform param eter quantisation a t several bit-rates. In listening tests, it has been found th a t

the proposed techniques have been effective in lowering th e b it-rate from 2.4 kbps to

1.2 kbps, from 1.2 kbps to 0.8 kbps, and from 4.0 kbps to 1.8 kbps while m aintaining the speech quality. In addition to those, a coding scheme is also designed operating a t 309 b ps and producing speech whose intelligibility is similar to th a t of th e M ELP operating a t 600 bps. Finally, discussions ab ou t th e perform ance of th e strategies proposed in this thesis as well as possibilities for im provem ent are given.

K e y w o rd s : Speech, Sinusoidal Coding, Low B it-R ate, M etafram e, M ode-Based, Phonem e

Email: [email protected]

(5)

I would like to express my sincere gratitu d e to my PhD supervisors, Professor A hm et Kondoz and D r Stéphane V illette for th eir full support, encouragem ent and guidance during my PhD . It would have been impossible w ithout th eir presence. I also would like to th a n k all my colleagues in th e M ultim edia Com m unications Research Group and friends for th eir advice and encouragement. Finally, I would like to say thanks to my parents for th eir continuous love, encouragem ent and support.

(6)

C o n te n ts ... i List of F ig u re s ... ! . . . viii List of T a b l e s ... x Glossary of T e rm s ... xii 1 In trod u ction 1 1.1 B a c k g r o u n d ... 1 1.2 Thesis O u tlin e ... 2 1.3 Original C o n tr ib u tio n s ... 4

2 R ev iew o f Sp eech C odin g 6 2.1 I n tr o d u c tio n ... 6

2.2 Design C riteria ... 7

2.2.1 B it-R ate and Q u a l i t y ... 7

2.2.2 Delay ... 9

2.2.3 Im plem entation Com plexity and C o s t ... 9

2.2.4 Robustness to In p u t Signal V ariations ...10

2.2.5 R obustness to Acoustic N o i s e ...11

2.2.6 Robustness to Channel E r r o r s ...11

2.3 G eneral Speech Coding Paradigm s ...12

2.3.1 Waveform Coding ... 12

2.3.2 P aram etric C o d i n g ... 13

2.3.3 H ybrid C o d i n g ...13

2.4 F undam ental Techniques in Speech C o d in g ...14

(7)

2.4.1.1 Scalar Q uantisation ...15 2.4.1.2 Vector Q uantisation ... 16 2.4.2 LP Modelling of S p e e c h ... ... . 19 2.4.2.1 T he Source-Filter M o d e l...19 2.4.2.2 Linear P r e d ic tio n ... 20 2.4.2.3 LSF R epresentation of th e LP C o e ffic ie n ts...22 2.5 A pplications of Speech C o d in g ...23 2.5.1 T errestrial S y s te m s ... 23

2.5.2 Satellite Com m unications ... 24

2.6 S tandardisation ... 25 2.7 C o n clu sio n ... 27 3 L o w B i t - R a t e S p e e c h C o d in g 28 3.1 I n tr o d u c tio n ... 28 3.2 Vocoder M o d e ls ... 29 3.2.1 Channel V o c o d e r ... 29

3.2.2 Form ant Vocoder ...29

3.2.3 LPC Vocoder ... 29

3.2.4 Mixed E xcitation Linear P rediction (M ELP) C o d e r ...30

3.2.5 Waveform Interpolation C o d i n g ...31

3.2.6 Sinusoidal C o d in g ...31

3.2.6.1 M ulti-B and E xcitation V o c o d e r ... 32

3.2.6.2 Split-B and LPC C o d e r ...33

3.2.6.3 Perform ance of Sinusoidal C o d e r s ...34

3.3 Efficient P aram eter Q u a n t i z a t i o n ... 35

3.3.1 Use of M etafram es ...36

3.3.1.1 M ELP Based Coding of Speech a t 2 .4 /1 .2 /0 .6 kbps . . 37

3.3.1.2 HSX Speech Coder O perating a t 2 .4 /1.2 k b p s ...39

3.3.1.3 SB-LPC Vocoder O perating a t 2 .4 /1.2 k b p s ... 40

3.3.2 Variable B it-R ate Coding ... 41

3.4 Use of P h o n e m e s ... 44

3.4.1 Text-to-Speech S y n th e s is ...45

3.4.2 Phonem e-Based Speech C o d i n g ...47

(8)

4 J o i n t Q u a n ti s a t io n S tr a t e g ie s fo r L o w B i t - R a t e S in u s o id a l C o d in g 50

4.1 Introduction ... 50

4.2 Use of M e t a f r a m e s ... . 52

4.2.1 Fixed-R ate M etafram es vs. Variable B it-R ate C o d in g ... 52

4.2.2 D eterm ination of an O ptim um M etafram e S iz e ...53

4.3 Classification of M e ta f r a m e s ... 54

4.3.1 I n tr o d u c tio n ...54

4.3.2 Proposed M ethod of C lassification ... 55

4.3.3 Perform ance and D iscussions... 60

4.4 An A lternative A pproach to Voicing E stim ation for V LBR Coding . . . 65

4.4.1 In tr o d u c tio n ... 65

4.4.2 Voicing S tatu s E stim ation from th e Spectral Envelope ...68

4.4.2.1 Simple Codebook S e a rc h ...68

4.4.2.2 Improvem ents for th e Voicing S tatu s E stim ation . . . . 70

4.4.2.3 Voicing S trength D e te rm in a tio n ... ... 72

4.4.2.4 D is c u s s io n s ... 77

4.4.3 A pplication of th e New Voicing E stim ation A lgorithm to M eta frame C la ssificatio n ... 78

4.4.3.1 E stim ation of th e Voicing Statuses and th e M etafram e C la s s ... 80

4.4.3.2 E stim ation of th e Voicing S trengths ... . 84

4.4.3.3 D is c u s s io n s ... 87

4.5 C o n clu sio n ... 88

5 M o d e B a s e d S B -L P C C o d in g 90 5.1 Introduction ...90

5.2 Q uantisation of P itch and V o ic in g ...92

5.2.1 P i t c h ...94

5.2.2 Voicing S t r e n g t h ... . 95

5.3 Q uantisation of E n e r g y ...96

5.4 Q uantisation of Spectral I n f o r m a t io n ...99

(9)

5.4.2 Spectral A m p l i t u d e s ... 103

5.5 Coding Schemes a t 1200 and 800 b p s ... 104

5.5.1 Coding Scheme a t 1200 b p s ...105

5.5.2 Coding Scheme a t 800 b p s ... ... . 107

5.6 Perform ance E v a l u a t i o n ... 108

5.6.1 P rep aratio n for th e Subjective E v a l u a t i o n ... 109

5.6.2 Results and Discussions ...109

5.7 C o n clu sio n ...1 1 4 6 P h o n em e-B a sed Scalable S p eech C oding 116 6.1 I n tr o d u c tio n ... 116

6.2 M o tiv a tio n ... 117

6.3 D atabase Analysis and O bservations ... 118

6.3.1 T he T IM IT Speech C o rp u s ... 119 6.4 O b se rv atio n s... H g 6.4.1 Phonem e G r o u p s ... 119 6.4.1.1 V o w els... 6.4.1.2 Semivowels and G l i d e s ... 6.4.1.3 N a s a l s ... 6.4.1.4 F r ic a t iv e s ... 6.4.1.5 A f f r i c a t e s ... 6.4.1.6 Closure Intervals and S t o p s ... _{. . ...124}

6.5 Coding for In telligib ilility ... 125

6.5.1 I n tr o d u c tio n ... 125

6.5.2 Encoding th e Phonem e Index and D u r a t i o n ... ... . 128

6.5.3 Changing th e D u r a tio n ...129

6.5.4 Choosing th e Tem plate P h o n e m e ... 131

6.5.4.1 Encoding th e Residual E xcitation P a r a m e t e r s ...132

6.5.4.2 Encoding the Energy V a lu e s ... 134

6.5.4.3 Closure I n t e r v a l s ... 136

6.5.4.4 Spectral A m p li tu d e s ...137

(10)

6.6 Coding for High Q u a l i t y ...139

6.6.1 In tr o d u c tio n ... 139

6.6.2 Encoding of th e Phonem e Index, D uration and T em plate Index 140 6.6.3 Q uantisation of LSF P a ra m e te r s ... 140

6.6.4 Q uantisation of P itch and Voicing . ... 141

6.6.5 Q uantisation of Energy Residual ... 141

6.6.6 Q uantisation of Spectral A m p l i t u d e s ... 142

6.6.7 B it A llo c a tio n ... 142

6.6.8 Perform ance and D iscussions... 143

6.7 C o n clu sio n ...144

7 C onclusions 1 4 7 7.1 Pream ble ... 147

7.2 Concluding O v e rv ie w ... ; 149

7.3 F u ture W o r k ... I54

A List o f P h o n e-C o d es U sed In T IM IT D a ta b a se 156

(11)

2.1 Preferred operation region for speech coding te c h n iq u e s . 12

2.2 An exam ple of th e partitioning of a two dim ensional s p a c e ... 17

2.3 An exam ple of speech spectrum (solid) and th e corresponding form ant . stru ctu re ( d o t t e d ) ...19

2.4 T he source-filter speech production m o d e l ... 20

4.1 Average SD of joint LSF quantisation for different m etafram e sizes . . . 54

4.2 T he allowed m etafram e c o m b in a tio n s ... 56

4.3 An exam ple of altering th e voicing com bination of a m etafram e . . . . 59

4.4 Exam ples of (a-b) unvoiced and (c-d) voiced s p e c tr a ...66

4.5 Block diagram of th e proposed voicing cut-off estim ation m ethod . . . . 67

4.6 Normalized spectral envelope shapes w ith 4 codebook entries represent

ing (a-b) unvoiced and (c-d) voiced s p e c t r a ... 69

4.7 Exam ples of normalized spectral envelope shapes w ith 16 codebook

entries representing (a) unvoiced and (b) voiced s p e c t r a ... 70

4.8 Energy histogram s of th e voiced frames classified as unvoiced initially

and th en (a) corrected or (b) left as u n v o ic ed ...72

4.9 Exam ple spectral envelope shapes p lotted using 64-point E F T along

w ith th eir cut-off frequencies... 73

4.10 C alculating th e height and w idth of a p eak...74

4.11 Exam ples of spectra which contain high energy in m iddle and higher

frequencies. ... ' 75

4.12 Exam ple of a double p eak ... 75

4.13 Exam ple results of the estim ated cut-off frequencies (vertical dotted) w ith th e original values (vertical solid) and th e respective th e spectral

envelope s h a p e s ...76

4.14 D istribution of th e percentage spectral energy of th e bands affected '

from th e voicing statu s and stren gth e s t i m a t i o n ...77

(12)

4.15 D istribution of th e ratio of th e erroneous spectral energy to th e to tal m etafram e spectral energy as a result of th e (a) initial (b) improved

m etafram e voicing stren g th estim ation algorithm s... 85

4.16 An exam ple of th e change th e voicing strengths of neighbouring frames. 86 5.1 Change of th e voicing stren gth along speech frames . ...95

5.2 Exam ples of energy shapes for different voicing com binations. Each line represents th e change in th e energy shape vector of th e m etafram e for th e p articular com bination. A distinctive tren d can be observed depending on th e m etafram e class based on voicing sta tu se s... 98

5.3 Average logarithm ic quantisation error in th e energy shapes obtained using a general codebook and dedicated codebooks for each com bination 99 5.4 Block diagram of th e proposed coding schemes a t 1200 b p s ... 105

5.5 Block diagram of th e proposed coding schemes a t 800 bps ... 106

5.6 Listening te st r e s u l t s ... 112

6.1 T he tim e dom ain waveform of th e words “she h ad ” consisting of pho nemes “h # /s h /i y /h v /a e /d c l /jh ” ... 120

6.2 E x tracted voicing and energy param eters of an “ae” phonem e... 120

6.3 T he histogram of th e length of “ae” phonem es in sam ples...121

6.4 E x tracted voicing and energy param eters of an “hv” phonem e...121

6.5 T he histogram of th e length of “hv” phonem es in sa m p le s ...122

6.6 T he tim e dom ain waveform of th e word “rom antic” consisting of pho nemes “r /o w /m /a e / n /tc l/t /ix / k c l/ k ” ... 123

6.7 E x tracted voicing and energy param eters of an “m ” p h o n e m e ...123

6.8 T he histogram of th e length of “m ” phonem es in s a m p le s ... 124

6.9 E x tracted voicing and energy param eters of an “z” phonem e... 124

6.10 E xtracted energy of an “sh” p h o n e m e ...125

6.11 T he histogram of th e length of “sh” phonemes in s a m p le s ... 126

6.12 E x tracted voicing and energy param eters of an “jh ” phonem e ... 126

6.13 E xtracted energy param eters of an “ch” p h o n e m e ...127

6.14 T he histogram of th e length of “ch” phonemes in s a m p le s ...128

6.15 E xtracted voicing and energy param eters of an “te l” phonem e followed by th e “t ” p h o n e m e ... 128 6.16 T he histogram of the lengths of (a) “te l” phonem es, (b) “t ” phonem es . 129

(13)

6.17 H istogram s of th e phonem e durations in term s of num ber of frames of

10 ms for th e phonemes (a) aa (b) m (c) r (d) s ... 130

6.18 T he com parison of th e original (dotted) and tem plate (solid) voicing values over a num ber of frames ... 133 6.19 T he com parison of th e original energy values (dotted) and those of the

chosen tem plates (solid) in phonemes “k /e y /m /a x /p c l” ... 135 6.20 T he com parison of th e original (dotted) and average-adjusted (solid)

(14)

2.1 T he MOS speech quality s c a l e ... 8

2.2 M emory and com plexity requirem ents for th e th ree M ELP-based stan dard speech coder b it - r a te s . . . 10

2.3 A com parison of some telephone b and speech coding s ta n d a r d s ... 26

3.1 T he bit allocation table for th e M ELP a t 2.4 kbps ... 37

3.2 T he b it allocation table for th e M ELP a t 1.2 k b p s ... 38

3.3 T he bit allocation table for th e M ELP a t 600 b p s ... 39

3.4 T he b it allocation table for th e HSX Coder a t 2.4 k b p s ... 40

3.5 T he b it allocation table for th e HSX Coder a t 1.2 k b p s ... 41

3.6 T he b it allocation table for th e SB-LPC a t 2.4 and 1.2 k b p s ...42

4.1 N um ber and percentage of m etafram es according to th eir voicing statuses 57 4.2 Bit-wise representation of th e m etafram e c l a s s e s ... 58

4.3 Initial and final percentages of th e m etafram e com binations as a result of c la s s if ic a tio n ... 60

4.4 T he average values for th e energy of an altered frame and th eir repre sentation as a percentage of th e respective to ta l m etafram e energy. . . . 62

4.5 T he average values of th e affected spectral energies and th eir represen ta tio n as a percentage of th e to ta l m etafram e spectral energy. ...63

4.6 T he average error as a result of m etafram e classification...64

4.7 E stim ation accuracy of voiced and unvoiced frames w ith different code book s i z e s ...68

4.8 Initial and improved perform ance of voiced/ unvoiced estim ation . . . . 72

4.9 A ccuracy of m etafram e classes w ith voicing estim ation from th e spec tra l s h a p e ...79

4.10 A ccuracy of m etafram e class estim ation from th e spectral shape w ith different num ber of codebooks for each c o m b in a tio n ... 81

(15)

4.11 Initial and final results for th e estim ation of th e m etafram e class . . . . 83

5.1 Perform ance of th e LSF quantisation m ethod a t different bit rates . . . 102

5.2 B it alocation according to th e voicing com binations for th e 1.2 kbps

v e r s io n ... 106

5.3 B it allocation according to th e voicing com binations for th e 800 bps

v e r s io n ... 107

5.4 T he scoring table for th e com parison of th e first sentence to th e second

o n e ... 110

5.5 Listening te st r e s u l t s ... 110

5.6 Listening te st r e s u l t s ...I l l 5.7 Total processing tim es for a speech segment of 400 sec using th e C uT

and th e references codecs a t different b i t - r a t e s ... 113

6.1 T he representation of th e pitch average of th e first voiced phonem e . . 134

6.2 T he A llocation B its for Each P h o n e m e ...137

6.3 D RT scores com paring th e perform ance of th e C uT to M ELP-based stan d ard coders a t various bit-rates. Different figures for th e official scores are obtained from different p ap ers... 138 6.4 B it allocation according to the phonem e g r o u p s ...143 6.5 Subjective com parison te st results showing preference ratio of each codec 144

(16)

G lossary o f Terms

A C EL P - Algebraic Code Excited Linear P rediction

A M R - A daptive M ulti-R ate

A F C - A daptive Predictive Coding

B E R - B it E rror R ates

B SS - B roadcast Satellite Service

C C SR - th e Centre for Com m unication Systems Research

CELP - C ode-Excited Linear Predictive

C S-A C E L P - C onjugate-S tructure Algebraic Code E xcited Linear

P rediction

C uT - Codec under Test

D R T - D iagnostic Rhyme Test

D S P - D igital Signal Processing

D T M F - Dual Tone M ulti-Frequency

E F R - E nhanced Full-Rate

E T SI - European Telecom munications S tandards In stitu te

FEC - Forward E rror Correction

FSS - Fixed Satellite Service

G S M -A M R - Global System for Mobile Com m unications - A daptive

M ulti-R ate

H M M - H idden M arkov Model

H R -G S M - H alf-R ate GSM

H SX - H arm onic & Stochastic E xcitation

IM B E - Improved M ulti-Band E xcitation

IT U - International Telecom munication Union

L A R - Log A rea R atios

LBG - Linde Buzo Gray

L D -C EL P - Low Delay Code Excited Linear P rediction

LP - Linear Prediction

LPC - Linear Predictive Coding

LSF - Line Spectral Frequency

LSP - Line Spectral P air

M A - Moving Average

M B E - M ulti-B and E xcitation

M ELP - M ixed-Excitation Linear Predictive

M IP S - Million Instructions P er Second

M IR S - Modified Interm ediate Response System

M SE - M ean-Squared E rror

(17)

M S V Q - M ulti-Stage Vector Q uantisation

M O S - M ean Opinion Score

P C M - Pulse Code M odulation

P S T N - P ublic Switched Telephone Network

R E L P - Residual E xcited Linear P rediction

R E W R apidly Evolving Waveform

R M S - R oot M ean Square

S B - L P C - Split-B and Linear Predictive Coding

S D - Spectral D istortion

S E W - Slowly Evolving Waveform S N R - Signal-to-Noise R atio

S V Q - Split-Vector Q uantisation

S T P - Short Term Prediction V A D - Voice A ctivity D etection V X C - Vector E xcitation Coder V L S I - Very Large Scale Integration

V Q - Vector quantization W I - Waveform Interpolation W M S E - W eighted M ean Square E rror

(18)

In trod u ction

1.1

Background

The m ost im p o rtan t and th e m ost n atu ra l form of com m unication between hum an beings is speech. Since th e invention of th e telephone by A lexander G raham Bell in 1876, it has been th e m ost w idespread and prim ary means of com m unication world wide. The m ain problem w ith th e transm ission of speech signals over long distances was th e fact th a t speech signals are analogue in n atu re and the accum ulation of noise over th e transm ission channel corrupted th e signal m aking it unusable and th us lim- iting the transm ission distance and overall quality. This has been overcome by th e transition to digital com m unication systems, where speech is sam pled and quantised into a bit-stream . D igital system s have th e advantage of regeneration and flexible m anipulation, such as multiplexing, encryption, forward error correction and storage. The bandw idth of speech signals has tradition ally been lim ited to 300-3400 Hz for transm ission over th e Public Switched Telephone Network (PSTN ). As a result, dur

ing digitization, speech signals have to be sam pled a t 8 kHz in order to satisfy th e

N yquist rate. Using 8 bits per sample, this led to th e logarithm ic Pulse Code M od

ulation (PGM ) system operating a t 64 kbps. This is, however, much greater th a n th e bandw idth of th e analogue speech signal, and therefore is lim ited to broad band channels. L ater, using adaptive quantisation techniques, th e b it-rate has been lowered to 32 kbps in A daptive Differential Pulse Code M odulation (A D PCM ). This b it-rate

(19)

is acceptable only on tru n k telephone links.

In th e last few years there has been a rapid increase in th e use of mobile com m unication networks. These include satellite system s and cellular telephony where th e bandw idth is limited. Especially w ith th e increase in th e num ber of users, th ere has been a great dem and for a m ore efficient use of th e available bandw idth. W ith th e advances in electronic hardw are technology, research has focused into th e compression of speech signals at very low b it-rates for bandw idth efficiency.

One of th e m ain challenges of speech coding has been m aintaining th e o u tp u t speech quality while reducing th e b it rate. Several speech coding algorithm s have been devel oped operating a t various b it rates and offering various qualities. These include very

low b it-rate coders operating below 1 kbps w ith synthetic speech quality to m edium -

to-high b it-rate coders operating a t 8 kbps or higher producing speech sounding same

as th e original. W ith th e growing mobile telephone industry, th ere is a dem and for even lower b it rates w ith high speech quality.

T he work presented in this thesis focus on efficient quantisation strategies for reducing th e b it-rates of speech coding system s and to be used in a variety of applications. These applications include mobile telephony, mobile m ilitary com m unications, secure com m unication applications, digital w aterm arking, storage and in tern et telephony.

1.2

T hesis O utline

The research work described in this thesis focuses m ainly on new q uantisation s tra te gies or th e extension or com bination of th e existing ones. T he strategies th a t have been developed have th en been im plem ented using th e Split-B and Linear Predictive Coding (SB-LPC) vocoder for testing purposes. Using these strategies have resulted in significant reductions in b it-rates while th e o u tp u t speech quality is m aintained.

C hapter 2 gives a brief overview of speech coding as well as some fundam ental p rin

cipals. T he m ain criteria for th e design of speech coding algorithm s, such as b it-rate, quality, delay are discussed. The three m ain speech coding paradigm s are also pre sented along w ith brief discussions. Moreover, th e m ain applications for speech coders

(20)

are given w ith examples and stan d ard speech coders are m entioned. Q uantisation of param eters is a key concept in low b it-rate speech coding. A fter th e introduction of basic quantisation techniques, th e speech production model is presented. This is a powerful model used by m any speech coders. The m odulation p a rt of this model is modelled using a Linear Predictive (LP) filter which effectively removes th e short term correlations existent between speech samples. T he representation of th e LP filter coefficients by Line Spectral Frequencies (LSF) is also discussed.

C hapter 3 covers th e low b it-rate speech coding algorithm s and b it-rate reduction strategies. Beginning w ith early param etric speech coders, m ain low b it-rate speech coders are briefly mentioned. Sinusoidal coding is an efficient coding scheme for low b it-rate applications, and a short description of sinusoidal coding is given w ith exam ples of sinusoidal coders. A fter the discussion of th e advantages and disadvantages of sinusoidal coders, existing strategies for efficient param eter quantisation are pre sented. The use of m etafram es for joint-quantisation and correlation exploitation between successive frames, variable b it-rate coding schemes using m ode-base coding and bit-allocation, and finally phonem e-based speech coding are presented along w ith examples, as they form th e basis for th e work presented in this thesis.

C hapter 4 investigates th e extension and com bination of m etafram e-based and mode- based coding techniques. An optim um m etafram e size is given, and an effective m eta-fram e classification technique is presented. Moreover, an alternative m eth od for voicing statu s and streng th determ ination for very low b it-rate coders is introduced. Finally, th e application of this m ethod to m etafram e and m ode-base q uan tisatio n scheme is discussed.

C hapter 5 presents th e quantisation of param eters based on th e strategies described in C hapter 5. T he advantages of m etafram e and m ode-based quantisation is fu rth er illustrated here. Two coding schemes based on th e Split-B and LPC coder are devel-

oped, operating a t 1.2 kbps and 800 bps. T he perform ance of these coding schemes

are given by com paring them to th e original SB-LPC operating a t 2.4 kbps and 1.2

kbps respectively as well as a stan dard coder, th e M ixed-Excitation Linear P redictive (M ELP), a t various bit rates.

(21)

C hapter 6 investigates th e use of phonem es as tem plates in low b it-rate speech coding. For speaker independent operation, a large d atabase is used. T he criteria for choosing th e phonem e tem p late is discussed. By using th e param eters from a carefully chosen tem p late phonem e, and a little side inform ation, an intelligibility-oriented speech coder is produced. Moreover, a high quality coding scheme is also presented where residual inform ation on top of th e phonem e tem plate is encoded. T he perform ance of bo th coding schemes are discussed w ith possible extensions.

C hapter 7 gives a sum m ary of th e discussions in th e previous chapters. M ost signif icant achievements are highlighted. In addition to highlighting th e m ost significant achievements of this project, possible areas for future research are suggested.

1.3

Original C ontributions

T he original contributions included in this thesis can be sum m arized as follows;

• Investigation of th e effect of m etafram e size on quantisation efficiency and find ing an optim um m etafram es size.

• Com bination and extension of m etafram e and m ode-based coding schemes fea tu rin g an effective m etafram e classification m ethod.

• A novel voicing estim ation algorithm for estim ating th e voicing sta tu s and stren g th from th e spectral shape which enables th e voicing inform ation recovery a t th e decoder w ith no ex tra bits.

• Extension of th e voicing estim ation algorithm to m etafram es for th e estim ation of m etafram e class and voicing strengths of th e frames w ithin th e m etafram e a t th e decoder w ithout th e transm ission of any bits.

• Designing m ode-based joint-quantisation schemes for th e param eters.

• Development of two very low b it-rate quantisation schemes effectively reducing th e b it-rate significantly while m aintaining th e quality.

(22)

Investigation of th e use of phonem e tem plates from a database for th e use of low b it-rate speech coding.

Developing criteria for choosing a suitable phonem e tem plate and duration m od ification.

Developm ent of an intelligibility-oriented very low b it-rate codec as well as a quality oriented higher b it-rate codec.

(23)

R eview o f Speech C oding

2.1

Introduction

D igital signals enjoy m any benefits over analogue signals, such as ease of regeneration, security and flexibility. Therefore representing th e speech signals in digital form at is very advantageous. In D igital Speech Coding, th e digital speech signals are processed using sophisticated signal processing techniques to achieve efficient compression in order to be used for transm ission or storage.

The invention of Pulse Code M odulation (PCM ) in 1938 was th e first exam ple of digital speech com m unication systems. PC M becam e very popular later w ith th e availability of th e necessary hardw are and was applied to private and public switched telephone networks. Today alm ost all of th e P ublic Switched Telephone Network (PSTN ) are based upon PC M and its spin-off technologies.

The use of PCM , however, requires m ore bandw itdh th a n th e original analogue signal does. This poses a problem especially in com m unication links where th e bandw idth is lim ited, such as satellite or cellular mobile radio systems. As th e user dem and on such system s has grown, th ere has been extensive research interest in order to develop signal processing algorithm s aiming a t efficient compression of th e source speech data. Especially w ith th e advancem ents in Very Large Scale Integration (VLSI) technologies, new D igital Signal Processing (D SP) hardw are has been produced, fuelling rapid

(24)

developments in th e speech compression area, which th en has allowed th e w idespread acceptance and use of these technologies by th e end user.

There are different approaches to efficiently code th e speech signals. However, for im proved efficiency, and sim ultaneously high quality, some kind of a param etric model mimicking th e hum an speech production m echanism is required instead of sample-by- sample coding of th e waveform. This param etric model makes use of th e repetitions and correlations between th e consecutive speech samples. The short term correlations between speech samples can be removed by using Linear Predictive Coding (LPC), which is a powerful tool and used extensively in speech coding. T he param eters obtained as a result have to be quantised to be either stored or tran sm itted . It is im p o rta n t to perform th e param eter quantisation w ith m inim um distortion. O therwise, quantisation m ay cause degradations in th e o u tp u t speech quality. For tran sp aren t, i.e., w ith no audible distortion, quantisation of th e Line Spectral Frequencies (LSF),

for example, th e following criteria need to be m et [1]:

• The average Spectral D istortion (SD) should be less th a n 1 dB.

• T he percentage of th e outliers a t 2 dB should be less 1%.

• T here should be no outliers a t 4 dB.

2.2

D esign C riteria

T here are several criteria which need to be considered when designing a speech coding algorithm . These criteria are often conflicting, and im proving th e algorithm w ith respect to one criteria m ay result in degradation w ith respect to another. Therefore, a balance m ust be sought during th e design process w ith an optim al trade-off between

th e criteria depending on th e needs of th e application [2].

2.2.1 B it-R a te and Q u ality

B it R ate and Speech Q uality are probably th e m ost im p o rtan t design criteria, which are usually conflicting, since a drop in b it-rate is usually accom panied by a d egradation

(25)

in quality. Generally, there is an optim um operating range in term s of th e b it-rate for each coding algorithm ; exceeding th e u pper lim it brings little or no benefit, while operating below the lower lim it causes severe degradation. For example, there are waveform coders such as G.721 A D PCM operating at 32 kbps [10] and producing near-transparent quality, hybrid coders such as G.729 Algebraic Code Excited Linear

Predictive (ACELP) [16] operating a t 8 kbps and producing toll speech quality, and

param etric coders such as M ixed-Excitation Linear Predictive (M ELP) [11] operating

at 2.4 kbps and producing com m unication speech quality.

It is very difficult to find an objective assessment of th e speech quality. Especially at low bit-rates, where m atching th e inp ut and synthetic speech waveform is usually not possible, hum an subjects are usually used for determ ining th e suitability of a speech coder to a specific application. O ften different assessm ent techniques are used where a different aspect of quality is m easured. One of these techniques is th e D iagnostic Rhyme Test (DRT) [3] which m easure intelligibility, since it is very im p o rtan t in low b it-rate coding. For subjective evaulation of th e o u tp u t quality of speech coders a t all rates th e M ean Opinion Score (MOS) [4] can be used. T he MOS te st is a widely used procedure where te st subjects are required to score individual coded speech samples

on a scale of 1 to 5, as shown in Table 2.1. T he average score is used as th e final MOS

score for a system.

G rade Subjective Opinion Q uality

5 Excellent Im perceptible T ransparent

4 Good Perceptible, b u t not annoying Toll

3 Fair Slightly annoying Com m unication

2 Poor Annoying _Synthetic

1 Bad Very annoying B ad

Table 2.1: The MOS speech quality scale

Informal listening tests can be organised following th e procedures listed in th e sta n dards in order to assess th e coder performance. Form al assessm ent of speech coders,

(26)

2.2.2

D ela y

Delay is an im p o rtan t criterion for designing real-tim e speech com m unication appli cations. M ost m odern speech coders operate on blocks of speech d a ta called speech frames. This introduces some delay, since a t least one speech fram e is buffered for analysis. Using more frames can lead to more efficient coding by removing th e redun dancies between them . In some speech coding algorithm s, future values of speech are

used for redundancy removal, and as a result look-ahead delay occurs. In addition

to th e buffering delay, th e processing tim e required by th e encoder and th e decoder also incur delays on th e system. Delays m ay also be introduced in th e transm ission channel.

W hen th e end-to-end delay of th e system becomes too much, typically over 250 ms, two-way conversation m ay become uncom fortable. A nother problem w ith delay oc curs in th e case of an im pedance m ism atch between the switching equipm ent and telephone hybrid circuits. This results in signal reffections and together w ith th e de lay of th e system causes an annoying echo effect. In such cases, sophisticated echo cancellation hardw are m ust be used to control this effect. Delay over mobile and satellite com m unication systems is very large and therefore there is a requirem ent for echo cancellation. On th e other hand, in land based system s delay is usually very low. However, when a speech coding algorithm is combined in such a system , th e delay of th e system increases significantly, which requires th e use of echo cancellation. For

example, in th e U nited Kingdom the m axim um delay allowed on th e P S T N is 5 ms

[6J. Since m any speech coders operate on frames of 20 ms typically, combined w ith

th e processing, buffering and look-ahead delays, th e to ta l delay can exceed 50 ms. In order to lim it th e complexity of th e echo cancellation systems, it is im p o rtan t to keep th e delay a t minimum.

2.2.3

Im p lem en ta tio n C o m p lex ity and C ost

The com plexity of an algorithm determ ines w hether th e algorithm can practically be im plem ented or not. W ith the recent advances in th e D SP technology, it has been possible to im plem ent more complex algorithm s. However, th e cost and th e power

(27)

consum ption issues are still im p o rtan t for mass m arket applications, especially where speech coding is extensively used, such as mobile telephony.

Similarly, th e am ount of mem ory required for an algorithm is also an issue when it comes to im plem entation. For buffering and internal processing, fast m em ory is needed, which is usually expensive and can become a problem w ith mass m arket applications.

In summary, sometimes th e com plexity and im plem entation costs become th e m ajor criteria ra th e r th a n th e quality for speech coders. Processing speech depends on th e CPU architecture. As a rough guide, speech is generally given in term s of Million In structions Per Second (M IPS). Typical com plexity figures for existing speech codecs

are in th e range 20-50 M IPS [1]. The m em ory requirem ents, m ostly due to th e q uan ti

sation tables to be held in Read O nly M emory (ROM ), are m easured in words. As an example, mem ory and com plexity requirem ents for th e NATO sta n d a rd coders based on M ELP a t various bit-rates are given in Table 2.2 [46].

Requirements MELP 2.4 kpbs MELP 1.2 kpbs MELP 0.6 kpbs

Program Memory (kWord) 20 30 30

D ata Memory (kWord) 20 53 60

Complexity (MIPS) 54 66 70

Table 2.2. M emory and com plexity requirem ents for th e three M ELP-based stan d ard

speech coder b it-rates

2.2 .4

R o b u stn e ss to In p u t S ignal V ariation s

Sometimes th ere are non-speech signals on speech com m unication channels. For ex ample, PST N applications require th e ability to carry non-hum an signals, such as modem tones or signalling tones used in D ual Tone M ulti-Frequency (D TM F). Moreover, th e in pu t signal levels or speech characteristics can vary. Speech coders are often required to cope w ith these variations as well as background noise.

(28)

2.2.5

R o b u stn e ss to A co u stic N o ise

Speech coders usually use a specific speech production model and expect th e input to be com patible w ith th a t model. In m any cases, however, speech is contam inated w ith acoustic background noise w ith characteristics very different from speech, and this m ay result in poor quality. Especially a t low b it-rates where accurate param eter estim ation becomes very im p ortant, speech coders are generally more sensitive to acoustic noise.

A popular m ethod for solving this problem is to use noise reduction techniques which make use of the different statistical properties of th e speech signal and th e acoustic noise to differentiate between noise and speech. T he aim is to reduce th e am ount of noise present in th e speech before passing it on to th e encoding stage [7]. Sig nificant im provem ents can be obtained in low b it-rate coders operating under heavy background noise.

2.2.6

R o b u stn e ss to C han n el Errors

In m ost cases, th e bitstream encoded by speech coders are tran sm itted over a commu nication channel. In case of an error in th e channel, depending on th e bits affected, severe degradations m ight occur, such as annoying blasts in th e o u tp u t due to unstable filter coefficients.

A t low B it E rror R ates (HER), such as 10“ ^ to 10“ ^ present in PSTN , designing inherently robust algorithm s m ay solve th e problem . However, w ith higher HER,

such as 1% to 5% in mobile and satellite com m unication channels, it is necessary to

include Forward E rror Correction (EEC) techniques. This is achieved by introducing high degrees of redundancy into th e bitstream . There are also fram e su b stitu tio n and m uting strategies to be used when th e conditions become worse th a n th e channel coding can handle.

(29)

2.3

G eneral Speech C oding Paradigm s

Speech coders are generally classified into three categories depending on th e exploita tion of th e speech signal:

• Waveform Coding • P aram etric Coding • H ybrid Coding

As illustrated in Figure 2.1, each technique has a preferred operation region.

2.3.1 W aveform C od in g

In Waveform Coding, th e algorithm tries to m atch th e original speech waveform on a sam ple-per-sam ple basis. This allows th e quality to be m easured using Signal-to- Noise R atio (SNR). Waveform coders provide high quality, however th e b it-rates are quite high as well. As an example, toll quality speech requires sam pling th e speech a t

Hybrid Cpders

Parametric (Sodefs

Fair

Poor

64

Coding Bit Rate in kb/s

(30)

8 kHz w ith a 13-bit accuracy, resulting in b it-rates around 100 kbps. N um ber of bits allocated for each sample can be reduced using logarithm ic com panding techniques, such as A-law or /i-law [8]. This has led to th e widely used ITU G.711 64 kbps PC M stan d ard [9].

F u rther reduction in b it rate can be achieved by exploiting th e high correlation be tween consecutive speech samples. This has led to th e ITU G.721 32 kbps A daptive Differential Pulse Code M odulation (ADPCM ) stan d ard [10].

W ith decreasing bit-rate, th e quality of th e waveform coders decrease rapidly.

2.3.2

P a ra m etric C oding

In P aram etric Coding, which will be explained in m ore detail in Section 3.2, instead of m atching th e original speech waveform, th e speech signal is characterized by a num ber of param eters which are then quantised and sent to th e decoder. A t th e decoder speech is synthesized using th e tran sm itted param eters which is ideally perceptually identical to th e in pu t speech.

Very low bit rates can be achieved by exploiting th e fundam ental properties of speech. T he quality is, however, lim ited by th e speech production model and th e p aram eter estim ation and quantisation accuracy. W ith recent developments, good quality speech can be produced a t around 2.4 kbps [11].

Low b it-rate coders require a speech production m odel and expect hum an speech, resulting in loss of quality w ith non-hum an speech or acoustic background noise. W ith param etric coders low quality speech can be obtained a t rates as low as 800 bps while n atu ra l sounding speech is possible w ith 4.8 kbps.

2.3.3

H yb rid C oding

H ybrid coders combine the benefits of th e b o th param etric and waveform coding techniques. Similar to param etric coders, they employ a speech production m odel in order to exploit th e correlations between th e neighbouring speech samples. A fter th a t.

(31)

waveform coding of th e residual signal is performed. Therefore th e final tran sm itted signal includes th e predictor coefficients for redundancy removal and th e waveform coded residual. A t m edium b it rates, hybrid coders are th e m ost popular coding schemes, providing near toll quality.

T he A nalysis-and-Synthesis (AaS) models used in early schemes find th e speech pro duction model param eters and then inverse filters th e in p u t speech using these p aram eters. Long term correlations are then removed from th e original signal or th e residual using long term prediction. A t th e decoder, th e quantised residual is combined w ith th e long term correlations and used to excite th e speech production model. N otable examples include th e A daptive Predictive Coder (A FC) [12] and th e Residual Excited Linear P rediction (RELP) coder [13]. T he m ain differences between these two lie in th e technique used for th e quantisation of th e residual signal. A PC operates a t 16 kbps producing high quality speech while R E L P offers good speech quality a t 9.6 kbps.

W ith th e advances in th e D SP technology, a closed loop approach for finding th e optim um excitation is proposed, aiming at lower b it-rates and b e tte r quality. In th e Analysis-by-Synthesis (AbS) m ethod, th e error between th e synthetic and original speech samples are minimized. T he m ost notable exam ple of AbS coders is th e Code Excited LPC (C E L P ) [14] where th e coarse and fine spectral inform ation is represented using tim e varying linear filters. The optim al excitation sequence is chosen as th e codebook entry which produces the m inim um error upon synthesis. In its original form, th e CELP algorithm was com putationally very intensive. L ater feasible versions have been developed, such as V SELP [15] and A CELP [16], utilizing efficient codebook structures.

2.4

Fundam ental Techniques in Speech C oding

In th e previous section, different approaches to efficiently code th e speech signals have been briefly presented, and it was stated th a t for p aram etric coders perform b e tte r a t low and very low bit-rates. P aram etric coders use a model for representing th e

(32)

hum an speech production system , and try to estim ate several param eters for this model. One such popular model is th e source-hlter model. In this model, th e vocal tra c t is represented by an all-pole Linear Predictive (LP) filter.

A fter th e model param eters have been obtained, they have to be quantised for either storage or transm ission.

2.4.1

Q u an tisation

M ost speech coders, especially th e ones operating a t medium or low bit-rates, employ a speech production model for exploiting th e redundancies in th e speech signal. U pon estim ation, th e param eters of this model have to be encoded into a bitstream for either transm ission to th e decoding side or storing digitally, quantisation is th e process in which th e param eters of th e speech production model are converted into a suitable form for transm ission or storage. D uring quantisation, a continuous or a discrete value signal w ith an infinite range is m apped to a set of levels w ith a finite range. T he difference between th e initial value and th e m apped value is known as th e quantisation error or noise. The m ain goal of a quantisation scheme is to represent th e param eters in such a way th a t th e quantisation noise is im perceptible to th e listening hum an subjects.

P aram eter quantisation can be perform ed using scalar or vector quantisation.

2.4.1.1 Scalar Q uantisation

In scalar quantisation, a single continuous value is m apped to th e nearest level from

a possible num ber of levels. If I is the num ber of possible levels, and D is th e finite

range, th en th e num ber of bits for representing th e selected level is given as

B = log2{p) (2.1)

T he spacing between levels can be uniform w ith D / I or non-uniform. Uniform q uan

(33)

this m ay not always be th e case, and usually th e param eter values are distributed unevenly. In th a t case, th e quantisation levels should be determ ined by taking the statistical distribution of th e values. For example, in places w ith a concentration of values m ore levels could be assigned for more accurate representation of th e values.

2.4.1.2 V ector Q uantisation

Scalar quantisers are simple and efficient in term s of mem ory usage. However, in term s of achieving more efficient quantisation, vector quantisation perform s b e tte r th a n scalar quantisation. In vector quantisation, a group of values are combined in a vector and quantised jointly. Vector quantisation can exploit th e correlations existing between th e values, which leads to an im proved quantisation efficiency.

If X is an N dim ensional vector w ith real valued elements given by

^ ~ [^ij ^2) 2:3, ...ajjv, ] (2 .2 )

it is th en m apped onto another N dim ensional real valued vector, y. y is th e quantised

version of x typically chosen from a finite set of values, y = 1 < i Gj, where

Vi = [yn,yi2,yi3, -yiN]- T he set of vectors y is called a codebook w ith size C, which is usually chosen to be a power of 2.

Designing a codebook requires optim al partition ing of an N dim ensional space into C

regions. Each region is represented by a code vector y^ which is generally th e centroid of th e region. Figure 2.2 illustrates partitioning of a two dim ensional space.

One of th e m ost popular m ethods for Codebook design is th e Linde Buzo G ray (LEG ) algorithm [17] which is an iterative algorithm for obtaining optim um p artitio n s and code vectors.

W hen th e num ber of bits used for a codebook becomes too large, codebooks for LSF quantisation for example, th e complexity and storage requirem ents m ay render

th e system im practical for im plem entation. For a vector quantiser consisting of L

(34)

a = N L = N 2B (2.3)

where B is th e num ber of bits allocated. T he m em ory requirem ent of this system can

be given by:

M = N L = N 2 ^ (2.4) in words.

For large codebooks, sub-optim al design and search techniques exist. One of th e m ost common techniques is Split-Vector Q uantisation (SVQ). In this scheme, th e vector is divided into sub-vectors, which use th eir own codebooks. Due to th e reduced vector

size, L, and th e num ber of entries, N , each codebook has significantly lower com plexity

and m em ory requirem ents. Hence th e to ta l com plexity and m em ory requirem ents are less th a n th e non-split case. There are, however, disadvantages of this system. Since th e sub-vectors are trea ted separately, th e intra-vector correlations cannot be exploited properly. Moreover, th e splitting of th e vector into sub-vectors m ay not

Ceokolds (CodsbookEatnes)

(35)

optim al. For example, during LSF quantisation, sometimes an LSF pair corresponding to a form ant can be split into different sub-codebooks. As a result, quantisation may not be very efficient. Finally, th e b it allocation scheme for each sub-codebook is fixed and can only cater for th e perceptual im portance of each value in a lim ited fashion. For example, in a case where all th e elements of a sub-vector are of low im portance, the num ber of allocated bits rem ains th e same, and as a result th e quantisation efficiency is lowered.

A nother technique which addresses th e problem s of th e SVQ is th e M ulti-Stage Vector quantisation (MSVQ) technique, where th e com bination of smaller codebooks is used to quantise th e in pu t vector [18]. In MSVQ, th e vectors in each codebook has the sam e length as th e in p u t vector, which makes it possible to exploit th e intra-vector correlations as well as perceptual weighting of th e values w ithin each vector. However, testing each com bination of vectors from each stage can be com putationally very complex. Instead, th e stage codebooks can be search sequentially, finding th e best index for each stage and searching for th e best one at th e next stage, minimizing th e residual error a t each stage. T he disadvantage of this m ethod is th e fact th a t th e com bination w ith th e lowest interm ediate distortion m ay not result in th e lowest overall distortion.

An optim um trade-off between com plexity and quantisation efficiency can be achieved using an M -Best tree search algorithm . In this search technique, M indices of th e first stage are kept. T he residual for each of th e M-indices are then searched a t th e next stage, and again M indices are kept resulting in th e lowest distortion. Therefore, a t each stage the codebook is searched M -times, one for each previously kept M indices. This results in M candidate p aths giving th e lowest interm ediate distortion. Finally, th e candidate p a th resulting in th e lowest overall distortion is chosen.

W hen designing codebooks for MSVQ, LEG algorithm can be used on th e in p u t training set for th e first stage and on th e residual of th e previous stage for th e next ones. Finally, these codebooks are jointly-optim ized using several iterations.

(36)

2.4.2

LP M o d ellin g o f S p eech

Linear Predictive Coding (LPC) [19] is one of th e m ost powerful analysis techniques. T here is usually a significant am ount of correlation between successive speech sam ples, known as short term correlations. The am ount of short term correlations depend on th e characteristics of th e speech signal. LPC analysis tries to model these corre lations using a short order filter. T he filter order is usually 10 for narrow band and 14 for w ideband speech. Due to th e modelling of th e sam ple-to-sam ple correlations (form ants), th e LPC is also called Short Term P rediction (STP). An exam ple of the spectrum of a speech segment and th e corresponding form ant stru ctu re can be seen in Figure 2.3.

2.4.2.1 T h e Sou rce-F ilter M odel

T he m ajority of low b it-rate speech coders employ a speech production model m im icking th e hum an speech production mechanism, and is formed of two parts: th e excitation and th e m odulation. E xcitation can either be voiced or unvoiced. D uring voiced excitation, th e vocal folds open and close a t regular intervals, breaking th e air forced from th e lungs into quasi-periodic pulses whose frequency is controlled by

50

500 1000 1500 2000

Frequency (Hz) 2500 3000 3500 4000

Figure 2.3. An exam ple of speech spectrum (solid) and th e corresponding form ant stru ctu re (dotted)

(37)

pitch. Unvoiced excitation, on th e other hand, is caused by tu rb u len t air from th e lungs. T he excitation signal then passes through th e vocal tra c t which acts as th e m odulation filter. T he shape of th e m odulation filter depends on th e positions of th e tongue, velum, lips and th e nasal cavity.

Figure 2.4 shows th e popular Source-Filter model which is widely used in speech coding [20]. This model is assum ed to be linear w ith independent excitation and m odulation parts. This way a simple and practical im plem entation is possible.

2 .4.2.2 Linear P red ictio n

The accuracy of th e modelling of th e m odulation filter in th e source-filter model is critical for good performance. W hile there are m any different techniques developed for th e modelling of this m odulation filter. Linear P rediction (LP) is th e m ost widely used one [22].

In LP, th e combined effects of th e vocal tra c t, glottal flow and th e lips represented by th e tim e varying filter is modelled as a pole-zero filter whose transfer function is given by: Impulse Train \oicewUnvoiced Selector Vitrjr'ing Noise Gam G Generator Output Speech

(38)

J J M = = g ( l - E ” i _(2.5)

Finding th e optim al coefficients for this filter w ith bo th poles and zeros is a challenging task, as it requires complex num erical optim isation techniques [21]. However, this filter can be simplified to an all-pole model w ith a high-enough order due to the hum an speech production mechanism. Since there are no more th a n 4 or 5 form ants in hum an speech lim ited to 4 kHz in bandw idth, an 10^^ order filter is usually sufficient,

representing each form ant w ith two poles. T he p order all-pole filter transfer function

is then given by:

H { z) = G G E U A ( z )

where.

(2.6)

A{z) = l ~ Ÿ ^ a j Z ' In th e sample domain, this equation becomes:

(2.7)

s{n) = Gx{n) -f Q'js{n — j ) ₍2.8)

which is th e LPC difference equation, where th e o u tp u t s(n ) is expressed as a weighted

sum of th e past ou tputs, s{n — j ) , and th e current input, x{n). W ith a finite num ber

of coefficients, th e predicted values s{n) fails to represent th e o u tp u t s{n) completely,

resulting in th e prediction error e(n) which is also called a residual:

e(n) = s{n) - s{n) = s{n) - ^ ajs{n - j )

;=i

(2.9)

The m ain objective here is to determ ine th e optim al coefficients aj which minimizes

th e M ean-Squared E rror (MSB), i.e..

m i n E = e [e^(n)] = e [(s(n) — s(n))^] = e

3=1

(39)

where e is th e expectation of th e residual. T he coefficients aj minimizing th e to ta l E

can be calculated by taking th e derivative of E w ith respect to and equating it to

zero:

— = > s ( n ) s ( n — k) — y

N

' = 0 (2.11)

n = l

Equation 2.11 can be expressed using autocorrelation (/){ij), which leads to:

d E

= ( f ) { 0 , k ) - ^ ^ l a j ( j ) { 0 , k ) ] = 0 (2.12) ;=1

A lthough this equation is very complex, due to th e characteristics of th e autocor relation m atrix, th ere are efficient algorithm s for a solution. T he m ost widely used technique is th e Levinson-Durbin algorithm [23] which employs a recursive solution.

2.4.2.3 LSF R ep resen tation o f th e LP C oefficients

W hen using th e LP coefficients ak in speech coders, th ey have to be quantised and of

ten interpolated. However, this causes some problems. Since these are th e coefficients of an H R filter, th eir sensitivity to small changes is very large. This is especially a problem during quantisation where some addition of quantisation noise is unavoidable. As a result, th e resulting filter may be com pletely different or even unstable. Lack of a stability check for the LP coefficients is an im p o rtan t shortcom ing in this case. Moreover, interpolation between two sets of LP coefficients is very difficult due to th e unpredictable relation between th e filter coefficients and th e associated frequency response. T he result of an interpolation between two sets of LP coefficients m ay have no resemblance to either set.

In order to overcome th e difficulties discussed above, an alternative form of represen ta tio n for th e LP coefficients is required. T he m ost common form of representation are th e Line Spectral Frequencies (LSF) which are easier to m anipulate, robust to small distortions, open to interpolation, and have a stability check. The conversion between th e LP coefficients and LSF is a lossless transform ation.

(40)

LSF have strong relations to th e speech spectrum . For speech sam pled at 8kHz, th e LSF are lim ited to th e range (0, 4000) Hz. Two consecutive LSF come closer near th e form ant frequencies where their distance depends on th e stren g th of th e formant. This fact causes th e LSF sets to contain th e redundancies in th e speech spectrum as a result of the correlations between successive speech frames. A nother im p ortant property of th e LSF is th e convenient stability check. T he corresponding filter is guaranteed to be stable when th e LSF are in an increasing order, i.e. :

0 < L 6 'F i < L 6 'F 2 < ••• < L S 'F io < 4000 (2.13)

2.5

A p p lication s o f Speech C oding

T here are m ainly two types of applications of speech coding: Voice Storage and Telecommunications. T he use for th e Voice Storage system s are archiving, or answer phones. Telecom munication applications can fu rther be classified as terrestrial and satellite systems.

2.5.1 T errestrial S y stem s

Public Switched Telephone Networks (PSTN ), Integ rated Services D igital Networks (ISDN) and cellular mobile radio systems are all terrestrial voice com m unication sys tems. The first generation PST N employed th e ITU G.711 com panded PC M a t 64 kbps [9]. L ater it was replaced by a more efficient version, th e ITU G.721 A D PC M at 32 kbps [10]. As th e num ber of subscribers grew and m ore b andw idth efficient coding was required, th e ITU G.728 Low Delay Code Excited Linear P rediction (LD -CELP) a t 16 kbps [15] was introduced. Then th e ITU G.729 a t 8 kbps [16] was developed, producing near toll quality w ith a higher delay due to th e 10 ms fram e sizes.

In ISDN, speech and d a ta are integrated, using two channels a t 64 kbps for data, voice or video, and one control channel a t 16 kbps.

Cellular telephony is a m ajor application area for speech coding w ith a very large num ber of subscribers on a lim ited-bandw idth radio link. T he speech coders to be

(41)

used m ust provide good speech quality under a high BER, w ith delay kept reasonably low. T he GSM stan d ard which was set up by th e European Telecom munication S tandards In stitu te (ETSI) in 1988 employs th e GSM Full R ate (FR) [24] operating a t 22.8 kbps gross ra te w ith 13 kbps used for speech coding and th e rest for channel coding. T here is also th e ET SI E nhanced Full R ate (EFR ) [25] coder w ith b e tte r speech quality th a n th e GSM FR, and operates a t 22.8 kbps gross ra te w ith 12.2 kbps used for speech. T he H alf-Rate GSM (HR-GSM) [26] operates a t 11.4 kbps w ith 5.6 kbps used for speech, although E F and E F R are used m ore frequently. Finally, th e A daptive M ulti-R ate (AMR) stan d ard [27] has been introduced operating a t a gross rate of 22.8 kbps or 11.4 kbps.

2.5.2

S a te llite C om m u n ication s

Satellite system s are m ainly used for long distance com m unications since th ey have wide area coverage and allow point-to-point and point-to-m ulti-point connection. The International Telecom munications Union (ITU) defines three m ain types of satellite services. These are th e Fixed Satellite Service (FSS) which provides television, relay, telephony and d a ta com m unication services to fixed earth stations, th e Mobile Satel lite Service (MSS) which provides services such as m aritim e, aeronautical or land to fixed and mobile term inals, and B roadcast Satellite Service (BSS) which broadcasts television and radio services to home users.

The highly erroneous and bu rsty channel as well as long delays are th e m ain problem s of th e satellite systems. FSS and BSS require broadcast audio quality and therefore provide high b it rates, generally more th a n 32 kbps. MSS systems, on th e o ther hand, use lower b it-rate systems. One of th e m ain stan d ard s for MSS is th e Inm arsat-M Improved M ulti-B and E xcitation (IMBE) [28] speech coder operating a t a gross rate of 6.4 kbps w ith 4.15 kbps used for speech coding. T he Iridium system , which is a constellation of 66 low earth orbit satellites providing telephony and d a ta services from any location on E arth , use an improved version of this coder.

(42)

2.6 Standardisation

Since th e 1960s, th ere has been a large am ount of research activity in th e field of speech coding, resulting in a large num ber of speech coding algorithm s. A t first, m any com panies developed and employed th eir own speech coders in their products and private networks. W ith th e use of speech coders becoming publicly available in telecom m unications service, th e necessity of m aking speech coders com patible w ith one another has arisen. By standardisation, it has also been possible for equipm ent m anufacturers to combine th eir research efforts and also com pete w ith each other, resulting in lower prices. A num ber of stan dard isation bodies exist which set require m ents for th e next generation systems, and choose th e speech coder catering for these requirem ents best as th e new standard. Table 2.3 shows some of th e well known coding stand ards for telephone band speech.

(43)

S tan dard Year A lgorithm B it rate* MOS** Delay*** G.711 1972 Com panded PCM 64 4.3 0.125 G.726 1991 V BR-AD PCM 16/24 /3 2/40 toll 0.125 G.728 1994 LD-CELP 16 4 0.625 G.729 1995 CS-AGELP 8 4 15 G.723.1 1995 A /M P -M L Q GELP 5.3/6.3 toll 37.5 ITU 4 - - 4 toll 25 GSM F R 1989 R PE -L T P 13 3.7 20 GSM E F R . 1995 AGELP 12.2 4 20 G SM /2 1994 VSELP 5.6 3.5 24.375 IS54 1989 VSELP 7.95 3.6 20 IS96 1993 Q -CELP 0 .8 /2 /4 /8 .5 3.5 20 JD C 1990 VSELP 6.7 commun. 20

JD C /2 1993 PSI-C ELP 3.45 commun. 40

Inm arsat-M 1990 IM BE 4.15 3.4 78.75

FS1015 1984 LPG-10 2.4 synthetic 112.5

FS1016 1991 CELP 4.8 3 37.5

New FS 2.4 1997 M ELP 2.4 3 45.5

Table 2.3: A comparison of some telephone band speech coding stand ards

* B it ra te is given in kbps.

** The MOS Figures are obtained from different formal subjective tests using different te st m aterial. T he MOS figures given here are for guidance only.

*** Delay is th e to ta l algorithm ic delay, i.e. th e frame length and look ahead, and given in milliseconds.

(44)

2.7

C onclusion

The three m ain types of speech coding schemes (waveform coding, hybrid coding and param etric coding) have been mentioned. T he research presented in this thesis falls into the category of param etric coders. Moreover, th e m ain factors regarding th e design of speech coding systems