ATIS 3GPP SPECIFICATION

(1)

ATIS 3GPP S PECIFICATION

ATIS.3GPP.26.090.900-2010

3rd Generation Partnership Project;

Technical Specification Group Services and System Aspects;

Mandatory Speech Codec speech processing functions;

Adaptive Multi-Rate (AMR) speech codec;

Transcoding functions (Release 9)

Approved by WTSC

Wireless Technologies and Systems Committee

(2)

ATIS is committed to providing leadership for, and the rapid development and promotion of, worldwide technical and operations standards for information, entertainment and communications technologies using a pragmatic, flexible and open approach.

< http://www.atis.org/ >

The text in this ATIS Specification is identical to 3GPP TS 26.090 V9.0.0.

Please note that ATIS.3GPP.26.090.900-2010 was developed within the Third Generation Partnership Project (3GPP™) and may be further elaborated for the purposes of 3GPP™. The contents of ATIS.3GPP.26.090.900-2010 are subject to continuing work within the 3GPP™ and may change following formal 3GPP™ approval. Should the 3GPP™ modify the contents of ATIS.3GPP.26.090.900-2010 it will be re-released by the 3GPP™ with an identifying change of release date and an increase in version number. The user of this Specification is advised to check for the latest version of 3GPP TS 26.090 V9.0.0 at the following address:

ftp://ftp.3gpp.org/Specs/ (sorted by release date)

The user is further advised to verify the changes over the version listed as the approved basis for this Specification and to utilize discretion after identifying any changes.

3GPP Support Office

650 Route des Lucioles -- Sophia Antipolis Valbonne - FRANCE

tel: +33 4 92 94 42 00 fax: +33 4 93 65 47 16 web: http://www.3gpp.org

"3GPP" is a registered trademark of ETSI in France and other jurisdictions on behalf of the 3rd Generation Partnership Project Organizational Partners (ARIB, ATIS, CCSA, ETSI, TTA, TTC).

ATIS.3GPP.26.090.900-2010

Published by

Alliance for Telecommunications Industry Solutions 1200 G Street, NW, Suite 500

Washington, DC 20005

No part of this publication may be reproduced in any form, in an electronic retrieval system or otherwise, without the prior written permission of the publisher. For information contact ATIS at +1 202.628.6380. ATIS is online at

< http://www.atis.org >.

Printed in the United States of America.

(3)

Foreword ... 5

1 Scope... 6

2 References... 6

3 Definitions, symbols and abbreviations ... 6

3.1 Definitions ...6

3.2 Symbols ...8

3.3 Abbreviations ...11

4 Outline description... 12

4.1 Functional description of audio parts ...12

4.2 Preparation of speech samples...13

4.2.1 PCM format conversion ...13

4.3 Principles of the adaptive multi-rate speech encoder ...13

4.4 Principles of the adaptive multi-rate speech decoder ...15

4.5 Sequence and subjective importance of encoded parameters ...16

5 Functional description of the encoder... 16

5.1 Pre-processing (all modes) ...16

5.2 Linear prediction analysis and quantization ...16

5.2.1 Windowing and auto-correlation computation...17

5.2.2 Levinson-Durbin algorithm (all modes)...18

5.2.3 LP to LSP conversion (all modes) ...18

5.2.4 LSP to LP conversion (all modes) ...20

5.2.5 Quantization of the LSP coefficients ...20

5.2.6 Interpolation of the LSPs ...22

5.2.7 Monitoring resonance in the LPC spectrum (all modes)...23

5.3 Open-loop pitch analysis ...23

5.4 Impulse response computation (all modes) ...27

5.5 Target signal computation (all modes) ...27

5.6 Adaptive codebook ...27

5.6.1 Adaptive codebook search ...27

5.6.2 Adaptive codebook gain control (all modes)...31

5.7 Algebraic codebook...31

5.7.1 Algebraic codebook structure ...31

5.7.2 Algebraic codebook search ...34

5.8 Quantization of the adaptive and fixed codebook gains ...37

5.8.1 Adaptive codebook gain limitation in quantization ...37

5.8.2 Quantization of codebook gains...37

5.8.3 Update past quantized adaptive codebook gain buffer (all modes)...39

5.9 Memory update (all modes)...40

6 Functional description of the decoder... 40

6.1 Decoding and speech synthesis ...40

6.2 Post-processing...43

6.2.1 Adaptive post-filtering (all modes) ...43

6.2.2 High-pass filtering and up-scaling (all modes) ...44

7 Detailed bit allocation of the adaptive multi-rate codec ... 45

8 Homing sequences ... 49

8.1 Functional description ...49

8.2 Definitions ...50

8.3 Encoder homing...50

8.4 Decoder homing ...50

(4)

9 Bibliography ... 54

Annex A (informative): Change history... 55

(5)

Foreword

This Technical Specification has been produced by the 3^rd Generation Partnership Project (3GPP).

The contents of the present document are subject to continuing work within the TSG and may change following formal TSG approval. Should the TSG modify the contents of the present document, it will be re-released by the TSG with an identifying change of release date and an increase in version number as follows:

Version x.y.z where:

x the first digit:

1 presented to TSG for information;

2 presented to TSG for approval;

3 or greater indicates TSG approved document under change control.

y the second digit is incremented for all changes of substance, i.e. technical enhancements, corrections, updates, etc.

z the third digit is incremented when editorial only changes have been incorporated in the document.

(6)

1 Scope

The present document describes the detailed mapping from input blocks of 160 speech samples in 13-bit uniform PCM format to encoded blocks of 95, 103, 118, 134, 148, 159, 204, and 244 bits and from encoded blocks of 95, 103, 118, 134, 148, 159, 204, and 244 bits to output blocks of 160 reconstructed speech samples. The sampling rate is 8 000 samples/s leading to a bit rate for the encoded bit stream of 4.75, 5.15, 5.90, 6.70, 7.40, 7.95, 10.2 or 12.2 kbit/s. The coding scheme for the multi-rate coding modes is the so-called Algebraic Code Excited Linear Prediction Coder, hereafter referred to as ACELP. The multi-rate ACELP coder is referred to as MR-ACELP.

In the case of discrepancy between the requirements described in the present document and the fixed point

computational description (ANSI-C code) of these requirements contained in [4], the description in [4] will prevail.

The ANSI-C code is not described in the present document, see [4] for a description of the ANSI-C code.

The transcoding procedure specified in the present document is mandatory for systems using the AMR speech codec.

2 References

The following documents contain provisions which, through reference in this text, constitute provisions of the present document.

 References are either specific (identified by date of publication, edition number, version number, etc.) or non-specific.

 For a specific reference, subsequent revisions do not apply.

 For a non-specific reference, the latest version applies. In the case of a reference to a 3GPP document (including a GSM document), a non-specific reference implicitly refers to the latest version of that document in the same Release as the present document.

[1] GSM 03.50: " Digital cellular telecommunications system (Phase 2+); Transmission planning aspects of the speech service in the GSM Public Land Mobile Network (PLMN) system".

[2] 3GPP TS 26.101 : "Frame Structure".

[3] 3GPP TS 26.094: "AMR Speech Codec; Voice Activity Detector".

[4] 3GPP TS 26.073: "Adaptive Multi-Rate (AMR); ANSI C source code".

[5] 3GPP TS 26.074: "Adaptive Multi-Rate (AMR); Test sequences".

[6] ITU-T Recommendation G.711 (1988): "Pulse code modulation (PCM) of voice frequencies".

[7] ITU-T Recommendation G.726: "40, 32, 24, 16 kbit/s Adaptive Differential Pulse Code Modulation (ADPCM)".

[8] ITU-T Recommendation G.712

3 Definitions, symbols and abbreviations

3.1 Definitions

For the purposes of the present document, the following terms and definitions apply:

(7)

adaptive codebook: contains excitation vectors that are adapted for every subframe. The adaptive codebook is derived from the long-term filter state. The lag value can be viewed as an index into the adaptive codebook adaptive postfilter: this filter is applied to the output of the short-term synthesis filter to enhance the perceptual quality of the reconstructed speech. In the adaptive multi-rate codec, the adaptive postfilter is a cascade of two filters: a formant postfilter and a tilt compensation filter

algebraic codebook: fixed codebook where algebraic code is used to populate the excitation vectors (innovation vectors). The excitation contains a small number of nonzero pulses with predefined interlaced sets of positions anti-sparseness processing: adaptive post-processing procedure applied to the fixed codebook vector in order to reduce perceptual artefacts from a sparse fixed codebook vector

closed-loop pitch analysis: adaptive codebook search, i.e., a process of estimating the pitch (lag) value from the weighted input speech and the long term filter state. In the closed-loop search, the lag is searched using error minimization loop (analysis-by-synthesis). In the adaptive multi-rate codec, closed-loop pitch search is performed for every subframe

direct form coefficients: One of the formats for storing the short term filter parameters. In the adaptive multi-rate codec, all filters which are used to modify speech samples use direct form coefficients.

fixed codebook: The fixed codebook contains excitation vectors for speech synthesis filters. The contents of the codebook are non-adaptive (i.e., fixed). In the adaptive multi-rate codec, the fixed codebook is implemented using an algebraic codebook.

fractional lags: A set of lag values having sub-sample resolution. In the adaptive multi-rate codec a sub-sample resolution of 1/6^th or 1/3^rd of a sample is used.

frame: time interval equal to 20 ms (160 samples at an 8 kHz sampling rate) integer lags: set of lag values having whole sample resolution

interpolating filter: FIR filter used to produce an estimate of subsample resolution samples, given an input sampled with integer sample resolution

inverse filter: this filter removes the short term correlation from the speech signal. The filter models an inverse frequency response of the vocal tract

lag: long term filter delay. This is typically the true pitch period, or its multiple or sub-multiple Line Spectral Frequencies: (see Line Spectral Pair)

Line Spectral Pair: transformation of LPC parameters. Line Spectral Pairs are obtained by decomposing the inverse filter transfer function A(z) to a set of two transfer functions, one having even symmetry and the other having odd symmetry. The Line Spectral Pairs (also called as Line Spectral Frequencies) are the roots of these polynomials on the z-unit circle

LP analysis window: for each frame, the short term filter coefficients are computed using the high pass filtered speech samples within the analysis window. In the adaptive multi-rate codec, the length of the analysis window is always 240 samples. For each frame, two asymmetric windows are used to generate two sets of LP coefficient in the 12.2 kbit/s mode. For the other modes, only a single asymmetric window is used to generate a single set of LP coefficients. In the 12.2 kbit/s mode, no samples of the future frames are used (no lookahead). The other modes use a 5 ms lookahead

LP coefficients: linear Prediction (LP) coefficients (also referred as Linear Predictive Coding (LPC) coefficients) is a generic descriptive term for the short term filter coefficients

mode: when used alone, refers to the source codec mode, i.e., to one of the source codecs employed in the AMR codec

open-loop pitch search: process of estimating the near optimal lag directly from the weighted speech input. This is done to simplify the pitch analysis and confine the closed-loop pitch search to a small number of lags around the open-loop estimated lags. In the adaptive multi-rate codec, an open-loop pitch search is performed in every other subframe

residual: the output signal resulting from an inverse filtering operation

(8)

short term synthesis filter: this filter introduces, into the excitation signal, short term correlation which models the impulse response of the vocal tract

perceptual weighting filter: this filter is employed in the analysis-by-synthesis search of the codebooks. The filter exploits the noise masking properties of the formants (vocal tract resonances) by weighting the error less in regions near the formant frequencies and more in regions away from them

subframe: time interval equal to 5 ms (40 samples at 8 kHz sampling rate)

vector quantization: method of grouping several parameters into a vector and quantizing them simultaneously zero input response: output of a filter due to past inputs, i.e. due to the present state of the filter, given that an input of zeros is applied

zero state response: output of a filter due to the present input, given that no past inputs have been applied, i.e., given that the state information in the filter is all zeroes

3.2 Symbols

For the purposes of the present document, the following symbols apply:

 

A z

The inverse filter with unquantized coefficients

 

A z 

The inverse filter with quantized coefficients

   

H z  A z 1



The speech synthesis filter with quantized coefficients

a

_i The unquantized linear prediction parameters (direct form coefficients)

a 

_i The quantified linear prediction parameters

m

The order of the LP model

1 B z ( )

The long-term synthesis filter

 

W z

The perceptual weighting filter (unquantized coefficients)

 

1

,

2 The perceptual weighting factors

F z

_E

( )

Adaptive pre-filter

T

The integer pitch lag nearest to the closed-loop fractional pitch lag of the subframe



The adaptive pre-filter coefficient (the quantified pitch gain)

H z A z

f

A z

n

d

( ) ( / )

( / )

 



The formant postfilter



_n Control coefficient for the amount of the formant post-filtering



_d Control coefficient for the amount of the formant post-filtering

 

H z

_t Tilt compensation filter



t Control coefficient for the amount of the tilt compensation filtering

  

t

k

₁

'

A tilt factor, with

k

₁

'

being the first reflection coefficient

 

h

_f

n

The truncated impulse response of the formant postfilter

L

_h The length of

^h

_f

  ⁿ

r i

_h

( )

The auto-correlations of

h

_f

  n

 

A z 

_n The inverse filter (numerator) part of the formant postfilter

 

1  A z 

_d The synthesis filter (denominator) part of the formant postfilter

 

r n

The residual signal of the inverse filter

^{A z}   ^

ⁿ

(9)

 

h n

_t Impulse response of the tilt compensation filter



_sc

( ) n

The AGC-controlled gain scaling factor of the adaptive postfilter



The AGC factor of the adaptive postfilter

 

H

_h1

z

Pre-processing high-pass filter

w n

_I

( )

,

w

_II

( ) n

LP analysis windows

L

₁^{( )}^I Length of the first part of the LP analysis window

w n

_I

( ) L

₂^{( )}^I Length of the second part of the LP analysis window

w n

_I

( ) L

₁^{( )}^II Length of the first part of the LP analysis window

w

_II

( ) n L

₂^{( )}^II Length of the second part of the LP analysis window

w

_II

( ) n r

_ac

( ) k

The auto-correlations of the windowed speech

s n ' ( )

 

w

_lag

i

Lag window for the auto-correlations (60 Hz bandwidth expansion)

f

₀ The bandwidth expansion in Hz

f

_s The sampling frequency in Hz

r '

_ac

( ) k

The modified (bandwidth expanded) auto-correlations

 

E

_LD

i

The prediction error in the

i

th iteration of the Levinson algorithm

k

_i The

i

th reflection coefficient

a

^{( )}_jⁱ The

j

th direct form coefficient in the

i

th iteration of the Levinson algorithm

  

F z

₁ Symmetric LSF polynomial

  

F z

₂ Antisymmetric LSF polynomial

 

F z

₁ Polynomial

F z

₁

  

with root

z   1

eliminated

 

F z

₂ Polynomial

F z

₂

  

with root

z  1

eliminated

q

_i The line spectral pairs (LSPs) in the cosine domain

q

An LSP vector in the cosine domain



^{( )}

q

_iⁿ The quantified LSP vector at the ith subframe of the frame n



_i The line spectral frequencies (LSFs)

T x

_m

( )

A

m

th order Chebyshev polynomial

f i

₁

( ), f i

₂

( )

The coefficients of the polynomials

F z

₁

( )

and

F z

₂

( ) f i

₁^'

( ), f i

₂^'

( )

The coefficients of the polynomials

F z

₁

  

and

F z

₂

  

f i ( )

The coefficients of either

F z

₁

 

or

F z

₂

 

C x

Sum polynomial of the Chebyshev polynomials

x

Cosine of angular frequency





k Recursion coefficients for the Chebyshev polynomial evaluation

f

_i The line spectral frequencies (LSFs) in Hz

 

f

^t

 f f

_{1 2}

 f

₁₀ The vector representation of the LSFs in Hz

 

z

^{( )}¹

n

,

^z

^{( )}²

  ⁿ

The mean-removed LSF vectors at frame

n

 

r

^{( )}¹

n

,

r

^{( )}²

  n

The LSF prediction residual vectors at frame

n p( ) n

The predicted LSF vector at frame

n

 



^{( )}

r

²

n  1

The quantified second residual vector at the past frame

f

^k The quantified LSF vector at quantization index k

E

_LSP The LSP quantization error

w i

_i

,  1 ,  , , 10

LSP-quantization weighting factors

(10)

d

_i The distance between the line spectral frequencies

f

_i_₁ and

f

_i_₁

 

h n

The impulse response of the weighted synthesis filter

O

_k The correlation maximum of open-loop pitch analysis at delay

k O

_t

i

, 1 ,  , 3

The correlation maxima at delays

t i

_i

,  1 ,  , 3

 ^{M t}

i

,

i

 , ⁱ  1 ,  , 3

The normalized correlation maxima

M

_i and the corresponding delays

t i

_i

,  1 ,  , 3 H z W z A z

A z A z

( ) ( ) ( / )

( ) ( / )

 



1 2

The weighted synthesis filter

 

A z 

₁ The numerator of the perceptual weighting filter

 

1 A z 

₂ The denominator of the perceptual weighting filter

T

₁ The integer nearest to the fractional pitch lag of the previous (1^st or 3^rd) subframe

s n ' ( )

The windowed speech signal

 

s

_w

n

The weighted speech signal

 

s n

Reconstructed speech signal

 

 

s n

The gain-scaled post-filtered signal

 

s n

_f Post-filtered speech signal (before scaling)

 

x n

The target signal for adaptive codebook search

 

x n

₂ _,

x

₂^t The target signal for algebraic codebook search

res

_LP

( ) n

The LP residual signal

 

c n

The fixed codebook vector

 

v n

The adaptive codebook vector

y n ( ) = ( ) v n  h n ( )

The filtered adaptive codebook vector

 

y

_k

n

The past filtered excitation

 

u n

The excitation signal

 

u n

The emphasized adaptive codebook vector

'( )

u n

The gain-scaled emphasized excitation signal

T

_op The best open-loop lag

t

_min Minimum lag search value

t

_max Maximum lag search value

 

R k

Correlation term to be maximized in the adaptive codebook search

b

₂₄ The FIR filter for interpolating the normalized correlation term

^{R k}  

 

R k

_t The interpolated value of

^{R k}  

for the integer delay

k

and fraction

t

b

₆₀ The FIR filter for interpolating the past excitation signal

^{u n}  

to yield the adaptive codebook vector

^{v n}  

A

_k Correlation term to be maximized in the algebraic codebook search at index

k C

_k The correlation in the numerator of

A

_k at index

k

E

_Dk The energy in the denominator of

A

_k at index

k

d  H x

^t 2 The correlation between the target signal

x n

₂

 

and the impulse response

^{h n}  

^{, i.e.,}

backward filtered target

H

The lower triangular Toepliz convolution matrix with diagonal

^{h 0}  

and lower diagonals

   

h 1 ,  , h 39

(11)

  H H

^t The matrix of correlations of

^{h n}  

d n ( )

The elements of the vector

d

( , ) i j

The elements of the symmetric matrix



c

_k The innovation vector

C

The correlation in the numerator of

A

_k

m

_i The position of the

i

th pulse



_i The amplitude of the

i

th pulse

N

_p The number of pulses in the fixed codebook excitation

E

_D The energy in the denominator of

A

_k

 

res

_LTP

n

The normalized long-term prediction residual

 

b n

The signal used for presetting the signs in algebraic codebook search

 

s n

_b The sign signal for the algebraic codebook search

 



d n

Sign extended backward filtered target



^'

( , ) i j

The modified elements of the matrix



, including sign information

z

^t,

^{z n}  

The fixed codebook vector convolved with

h n  

 

E n

The mean-removed innovation energy (in dB)

E

The mean of the innovation energy

 

E n ~

The predicted energy

 ^{b b b b}

^{1 2 3 4}



The MA prediction coefficients

 

R k 

The quantified prediction error at subframe

k E

_I The mean innovation energy

R n ( )

The prediction error of the fixed-codebook gain quantization

E

_Q The quantization error of the fixed-codebook gain quantization

e n ( )

The states of the synthesis filter

1  A z  

 

e

_w

n

The perceptually weighted error of the analysis-by-synthesis search



The gain scaling factor for the emphasized excitation

g

_c The fixed-codebook gain



g

_c The predicted fixed-codebook gain

g

_c The quantified fixed codebook gain g_p The adaptive codebook gain

g_p The quantified adaptive codebook gain



_gc

 g

_c

g

_c



A correction factor between the gain

g

_c and the estimated one

g

_c



 

_gc The optimum value for



_gc



_sc Gain scaling factor

3.3 Abbreviations

For the purposes of the present document, the following abbreviations apply.

ACELP Algebraic Code Excited Linear Prediction

AGC Adaptive Gain Control

AMR Adaptive Multi-Rate

CELP Code Excited Linear Prediction

EFR Enhanced Full Rate

FIR Finite Impulse Response

(12)

ISPP Interleaved Single-Pulse Permutation

LP Linear Prediction

LPC Linear Predictive Coding

LSF Line Spectral Frequency

LSP Line Spectral Pair

LTP Long Term Predictor (or Long Term Prediction)

MA Moving Average

4 Outline description

The present document is structured as follows:

Clause 4.1 contains a functional description of the audio parts including the A/D and D/A functions. Clause 4.2 describes the conversion between 13-bit uniform and 8-bit A-law or



-law samples. Clauses 4.3 and 4.4 present a simplified description of the principles of the AMR codec encoding and decoding process respectively. In clause 4.5, the sequence and subjective importance of encoded parameters are given.

Clause 5 presents the functional description of the AMR codec encoding, whereas clause 6 describes the decoding procedures. In clause 7, the detailed bit allocation of the AMR codec is tabulated.

4.1 Functional description of audio parts

The analogue-to-digital and digital-to-analogue conversion will in principle comprise the following elements:

1) Analogue to uniform digital PCM - microphone;

- input level adjustment device;

- input anti-aliasing filter;

- sample-hold device sampling at 8 kHz;

- analoguetouniform digital conversion to 13bit representation.

The uniform format shall be represented in two's complement.

2) Uniform digital PCM to analogue

- conversion from 13bit/8 kHz uniform PCM to analogue;

- a hold device;

- reconstruction filter including x/sin( x ) correction;

- output level adjustment device;

- earphone or loudspeaker.

In the terminal equipment, the A/D function may be achieved either:

- by direct conversion to 13-bit uniform PCM format;

- or by conversion to 8-bit A-law or



-law compounded format, based on a standard A-law or



^-law

codec/filter according to ITU-T Recommendations G.711 [6] and G.714, followed by the 8-bit to 13-bit conversion as specified in clause 4.2.1.

For the D/A operation, the inverse operations take place.

In the latter case it should be noted that the specifications in ITU-T G.714 (superseded by G.712) are concerned with PCM equipment located in the central parts of the network. When used in the terminal equipment, the present

(13)

document does not on its own ensure sufficient out-of-band attenuation. The specification of out-of-band signals is defined in [1] in clause 2.

4.2 Preparation of speech samples

The encoder is fed with data comprising of samples with a resolution of 13 bits left justified in a 16-bit word. The three least significant bits are set to '0'. The decoder outputs data in the same format. Outside the speech codec further processing must be applied if the traffic data occurs in a different representation.

4.2.1 PCM format conversion

The conversion between 8-bit A-Law or



-law compressed data and linear data with 13-bit resolution at the speech encoder input shall be as defined in ITU-T Rec. G.711 [6].

ITU-T Rec. G.711 [6] specifies the A-Law or



-law to linear conversion and vice versa by providing table entries.

Examples on how to perform the conversion by fixed-point arithmetic can be found in ITU-T Rec. G.726 [7]. Clause 4.2.1 of G.726 [7] describes A-Law or



-law to linear expansion and clause 4.2.8 of G.726 [7] provides a solution for linear to A-Law or



-law compression.

4.3 Principles of the adaptive multi-rate speech encoder

The AMR codec consists of eight source codecs with bit-rates of 12.2, 10.2, 7.95, 7.40, 6.70, 5.90, 5.15 and 4.75 kbit/s.

The codec is based on the code-excited linear predictive (CELP) coding model. A 10^th order linear prediction (LP), or short-term, synthesis filter is used which is given by:

   

H z A z a z

_i ⁱ

i

 

m

 

_ ^

1 1

1

 

^, ⁽¹⁾

where

a i  ,

_i

 1  , , , m

are the (quantified) linear prediction (LP) parameters, and

m  10

is the predictor order.

The long-term, or pitch, synthesis filter is given by:

 

1 1

B z  1 g z

_p ^T



^ ^, ⁽²⁾

where

T

is the pitch delay and

g

_p is the pitch gain. The pitch synthesis filter is implemented using the so-called adaptive codebook approach.

The CELP speech synthesis model is shown in figure 2. In this model, the excitation signal at the input of the short-term LP synthesis filter is constructed by adding two excitation vectors from adaptive and fixed (innovative) codebooks. The speech is synthesized by feeding the two properly chosen vectors from these codebooks through the short-term synthesis filter. The optimum excitation sequence in a codebook is chosen using an analysis-by-synthesis search procedure in which the error between the original and synthesized speech is minimized according to a perceptually weighted distortion measure.

The perceptual weighting filter used in the analysis-by-synthesis search technique is given by:

   

 

W z A z

 A z 



1 2

, (3)

where

^{A z}  

is the unquantized LP filter and

0  

₂

 

₁

 1

are the perceptual weighting factors. The values



1

 0 9 .

(for the 12.2 and 10.2 kbit/s mode) or



₁

 0 . 94

(for all other modes) and



2

 0 6 .

are used. The weighting filter uses the unquantized LP parameters.

(14)

The coder operates on speech frames of 20 ms corresponding to 160 samples at the sampling frequency of 8 000 sample/s. At each 160 speech samples, the speech signal is analysed to extract the parameters of the CELP model (LP filter coefficients, adaptive and fixed codebooks' indices and gains). These parameters are encoded and transmitted. At the decoder, these parameters are decoded and speech is synthesized by filtering the reconstructed excitation signal through the LP synthesis filter.

The signal flow at the encoder is shown in figure 3. LP analysis is performed twice per frame for the 12.2 kbit/s mode and once for the other modes. For the 12.2 kbit/s mode, the two sets of LP parameters are converted to line spectrum pairs (LSP) and jointly quantized using split matrix quantization (SMQ) with 38 bits. For the other modes, the single set of LP parameters is converted to line spectrum pairs (LSP) and vector quantized using split vector quantization (SVQ). The speech frame is divided into 4 subframes of 5 ms each (40 samples). The adaptive and fixed codebook parameters are transmitted every subframe. The quantized and unquantized LP parameters or their interpolated versions are used depending on the subframe. An open-loop pitch lag is estimated in every other subframe (except for the 5.15 and 4.75 kbit/s modes for which it is done once per frame) based on the perceptually weighted speech signal.

Then the following operations are repeated for each subframe:

The target signal

^{x n}  

is computed by filtering the LP residual through the weighted synthesis filter

   

W z H z

with the initial states of the filters having been updated by filtering the error between LP residual and excitation (this is equivalent to the common approach of subtracting the zero input response of the weighted synthesis filter from the weighted speech signal).

The impulse response,

^{h n}  

of the weighted synthesis filter is computed.

Closed-loop pitch analysis is then performed (to find the pitch lag and gain), using the target

x n  

and impulse response

h n  

, by searching around the open-loop pitch lag. Fractional pitch with 1/6^th or 1/3^rd of a sample resolution (depending on the mode) is used.

The target signal

x n  

is updated by removing the adaptive codebook contribution (filtered adaptive codevector), and this new target,

x n

₂

 

, is used in the fixed algebraic codebook search (to find the optimum innovation).

The gains of the adaptive and fixed codebook are scalar quantified with 4 and 5 bits respectively or vector quantified with 6-7 bits (with moving average (MA) prediction applied to the fixed codebook gain).

Finally, the filter memories are updated (using the determined excitation signal) for finding the target signal in the next subframe.

The bit allocation of the AMR codec modes is shown in table 1. In each 20 ms speech frame, 95, 103, 118, 134, 148, 159, 204 or 244 bits are produced, corresponding to a bit-rate of 4.75, 5.15, 5.90, 6.70, 7.40, 7.95, 10.2 or

12.2 kbit/s. More detailed bit allocation among the codec parameters is given in tables 9a-9h. Note that the most significant bits (MSB) are always sent first.

(15)

Table 1: Bit allocation of the AMR coding algorithm for 20 ms frame

Mode Parameter 1^st

subframe

2^nd subframe

3^rd subframe

4^th subframe

total per frame

2 LSP sets 38

12.2 kbit/s Pitch delay 9 6 9 6 30

(GSM EFR) Pitch gain 4 4 4 4 16

Algebraic code 35 35 35 35 140

Codebook gain 5 5 5 5 20

Total 244

LSP set 26

Gains 7 7 7 7 28

Total 204

LSP sets 27

Pitch gain 4 4 4 4 16

Codebook gain 5 5 5 5 20

Total 159

LSP set 26

(TDMA EFR) Algebraic code 17 17 17 17 68

Gains 7 7 7 7 28

Total 148

LSP set 26

(PDC EFR) Algebraic code 14 14 14 14 56

Gains 7 7 7 7 28

Total 134

LSP set 26

Gains 6 6 6 6 24

Total 118

LSP set 23

Gains 6 6 6 6 24

Total 103

LSP set 23

Gains 8 8 16

Total 95

4.4 Principles of the adaptive multi-rate speech decoder

The signal flow at the decoder is shown in figure 4. At the decoder, based on the chosen mode, the transmitted indices are extracted from the received bitstream. The indices are decoded to obtain the coder parameters at each transmission frame. These parameters are the LSP vectors, the fractional pitch lags, the innovative codevectors, and the pitch and innovative gains. The LSP vectors are converted to the LP filter coefficients and interpolated to obtain LP filters at each subframe. Then, at each 40-sample subframe:

- the excitation is constructed by adding the adaptive and innovative codevectors scaled by their respective gains;

- the speech is reconstructed by filtering the excitation through the LP synthesis filter.

Finally, the reconstructed speech signal is passed through an adaptive postfilter.

(16)

4.5 Sequence and subjective importance of encoded parameters

The encoder will produce the output information in a unique sequence and format, and the decoder must receive the same information in the same way. In table 9a-9h, the sequence of output bits and the bit allocation for each parameter is shown.

The different parameters of the encoded speech and their individual bits have unequal importance with respect to subjective quality. The output and input frame formats for the AMR speech codec are given in [2], where a reordering of bits take place.

5 Functional description of the encoder

In this clause, the different functions of the encoder represented in figure 3 are described.

5.1 Pre-processing (all modes)

Two pre-processing functions are applied prior to the encoding process: high-pass filtering and signal down-scaling.

Down-scaling consists of dividing the input by a factor of 2 to reduce the possibility of overflows in the fixed-point implementation.

The high-pass filter serves as a precaution against undesired low frequency components. A filter with a cut off frequency of 80 Hz is used, and it is given by:

2 1

1

1 1 . 906005859 0 . 911376953 927246903 .

0 8544941 .

1 927246093 .

) 0

(

_ _









 

z z

z z z

H

_h . (4)

Down-scaling and high-pass filtering are combined by dividing the coefficients at the numerator of

H

_h1

  z

by 2.

5.2 Linear prediction analysis and quantization

12.2 kbit/s mode

Short-term prediction, or linear prediction (LP), analysis is performed twice per speech frame using the auto-correlation approach with 30 ms asymmetric windows. No lookahead is used in the auto-correlation computation.

The auto-correlations of windowed speech are converted to the LP coefficients using the Levinson-Durbin algorithm. Then the LP coefficients are transformed to the Line Spectral Pair (LSP) domain for quantization and interpolation purposes. The interpolated quantified and unquantized filter coefficients are converted back to the LP filter coefficients (to construct the synthesis and weighting filters at each subframe).

10.2, 7.95, 7.40, 6.70, 5.90, 5.15, 4.75 kbit/s modes

Short-term prediction, or linear prediction (LP), analysis is performed once per speech frame using the auto-correlation approach with 30 ms asymmetric windows. A lookahead of 40 samples (5 ms) is used in the auto-correlation computation.

The auto-correlations of windowed speech are converted to the LP coefficients using the Levinson-Durbin algorithm. Then the LP coefficients are transformed to the Line Spectral Pair (LSP) domain for quantization and interpolation purposes. The interpolated quantified and unquantized filter coefficients are converted back to the LP filter coefficients (to construct the synthesis and weighting filters at each subframe).

(17)

5.2.1 Windowing and auto-correlation computation

12.2 kbit/s mode

LP analysis is performed twice per frame using two different asymmetric windows. The first window has its weight concentrated at the second subframe and it consists of two halves of Hamming windows with different sizes. The window is given by:

w n

n

L n L

n L

L n L L L

I

I I

I I I

( )

. .46 , , , ,

. .46 ( )

, , , .

( )

( ) ( )

( ) ( ) ( )



 



  

   

 





  

    





 



 

0 54 0

1 0 1

0 54 0

1 1

1

1 2

1 1 2

cos cos





(5)

The values

L

₁^{( )}^I

 160

^and

L

₂^{( )}^I

 80

are used. The second window has its weight concentrated at the fourth subframe and it consists of two parts: the first part is half a Hamming window and the second part is a quarter of a cosine function cycle. The window is given by:

w n

n

L n L

n L

L n L L L

II

II II

II II II

( )

. .46 , , , ,

( )

, , ,

( )

( ) ( )

( ) ( ) ( )



 



  

   





  

    





 



 

0 54 0 2

2 1 0 1

2 4 1 1

1

1 2

1 1 2

cos cos





(6)

where the values

L

₁^{( )}^II

 232

^and

L

₂^{( )}^II

 8

^{are used.}

Note that both LP analyses are performed on the same set of speech samples. The windows are applied to 80 samples from past speech frame in addition to the 160 samples of the present speech frame. No samples from future frames are used (no lookahead). A diagram of the two LP analysis windows is depicted below.

20 ms 5 ms

frame (160 samples) sub frame (40 samples)

frame n-1 frame n

t w (n)I

w (n)II

Figure 1: LP analysis windows

s n n    ,  0 ,  239

, are computed by:

r

_ac

k s n s n k k

n k

( )  ' ( ) ' (  ) ,  , , ,



²³⁹

⁰ ^ ¹⁰

⁽⁷⁾

and a 60 Hz bandwidth expansion is used by lag windowing the auto-correlations using the window:

(18)

 

w i f i

f i

lag

s

  

  

 





 





  

exp 1 , ,

2

⁰

1 10



2



, (8)

where

f

₀

 60

Hz is the bandwidth expansion and

f

_s

 8000

Hz is the sampling frequency. Further,

r

_ac

( ) 0

is multiplied by the white noise correction factor 1.0001 which is equivalent to adding a noise floor at -40 dB.

10.2, 7.95, 7.40, 6.70, 5.90, 5.15, 4.75 kbit/s modes

LP analysis is performed once per frame using an asymmetric window. The window has its weight concentrated at the fourth subframe and it consists of two parts: the first part is half a Hamming window and the second part is a quarter of a cosine function cycle. The window is given by equation (6) where the values

L

₁

 200

and

L

₂

 40

are used.

s n n    ,  0 ,  239

, are computed by equation (7) and a 60 Hz bandwidth expansion is used by lag windowing the auto-correlations using the window of equation (8). Further,

r

_ac

( ) 0

is multiplied by the white noise correction factor 1.0001 which is equivalent to adding a noise floor at -40 dB.

5.2.2 Levinson-Durbin algorithm (all modes)

The modified auto-correlations

r ' ( )

_ac

0  1 0001 . r

_ac

( ) 0

^and

r ' ( )

_ac

k  r

_ac

( ) k w

_lag

( ), k k  1 ,  10 ,

are used to obtain the direct form LP filter coefficients

a k

_k

,  1 ,  , , 10

by solving the set of equations.

 

a r

_k _ac

i k r i i

k

'    ' ( ) ,

ac

 , , .



 1 10

1  10

(9)

The set of equations in (9) is solved using the Levinson-Durbin algorithm. This algorithm uses the following recursion:

 

E r

i a

k a r i j E i

a k

j i

a a k a

E i k E i

LD ac

i

i j

i j ac i

LD i

i i

j i

i i j i

LD i LD

( ) ' ( )

' ( ) / ( )

( ) ( ) ( )

( )

( ) ( )

( ) ( ) ( )

0 0

1 10

1

1 1 1

1 1

0 1

1 0 1

1 1

2

 



   

  

 

  



 



 



for to do

end end

The final solution is given as

a

_j

 a

⁽_j¹⁰⁾

, j  1 ,  , 10

.

The LP filter coefficients are converted to the line spectral pair (LSP) representation for quantization and interpolation purposes. The conversions to the LSP domain and back to the LP filter coefficient domain are described in the next clause.

5.2.3 LP to LSP conversion (all modes)

The LP filter coefficients

a k

_k

,  1 ,  , 10

, are converted to the line spectral pair (LSP) representation for quantization and interpolation purposes. For a 10^th order LP filter, the LSPs are defined as the roots of the sum and difference polynomials:

(19)

     

  

^ ^

F z

₁

A z z

¹¹

A z

¹ (10)

and

     

  

^ ^

F z

₂

A z z

¹¹

A z

¹ , (11)

respectively. The polynomial

F z

₁

  

and

F z

₂

  

are symmetric and anti-symmetric, respectively. It can be proven that all roots of these polynomials are on the unit circle and they alternate each other.

F z

₁

  

has a root

z  1

(

  

^{) and}

F z

₂

  

has a root

z  1

(

  0

). To eliminate these two roots, we define the new polynomials:

     

F z

₁

  F z

₁

1  z

^¹ ⁽¹²⁾

and

     

F z

₂

  F z

₂

1  z

^¹ ⁽¹³⁾

Each polynomial has 5 conjugate roots on the unit circle

  ^e

^{ }^j ⁱ , therefore, the polynomials can be written as

   

F z q z

_i

z

i 1

1 2

1 3 9

 1 2 

^



^





, , ,

(14)

and

   

F z q z

_i

z

i 2

1 2

2 4 10

 1 2 

^



^





, , ,

, (15)

where

^q

ⁱ

^ ^cos   ^

ⁱ ^with ^ⁱ being the line spectral frequencies (LSF) and they satisfy the ordering property

0  

₁

 

₂

   

₁₀

 

. We refer to

q

_i as the LSPs in the cosine domain.

Since both polynomials

^{F z}

₁

 

^and

^{F z}

₂

 

are symmetric only the first 5 coefficients of each polynomial need to be computed. The coefficients of these polynomials are found by the recursive relations (for

i = 0

to 4):

   

f i a a f i

i m i

1 1 1

2 1 2

1 1

   

 

^_



^_



⁽¹⁶⁾

where

m  10

is the predictor order.

The LSPs are found by evaluating the polynomials

F z

₁

 

and

F z

₂

 

at 60 points equally spaced between 0 and

and checking for sign changes. A sign change signifies the existence of a root and the sign change interval is then divided 4 times to better track the root. The Chebyshev polynomials are used to evaluate

F z

₁

 

and

F z

₂

 

. In this method the roots are found directly in the cosine domain

  ^q

ⁱ . The polynomials

F z

₁

 

or

F z

₂

 

evaluated at

z  e

^j^ can be written as:

   

F   2 e

^^j⁵^

C x

, with:

                     

C x  T x

5

 f 1 T x

4

 f 2 T x

3

 f 3 T x

2

 f 4 T x

1

 f 5 2

, (17)

(20)

where

^T

_m

  ^x  cos  ^m  

^{is the}

^m

th order Chebyshev polynomial, and

^{f i i}   ^,  1 ^,  ^, 5

are the coefficients of either

F z

₁

 

or

F z

₂

 

, computed using the equations in (16). The polynomial

C x  

is evaluated at a certain value of

^x  cos   

using the recursive relation:

for down to end

k

x f k

C x x f

k k k

   

  

  

4 1

2 5

5 2

1 2

  

 

( )

( ) ( ) / ,

with initial values



5

 1

^and



6

 0 .

The details of the Chebyshev polynomial evaluation method are found in P. Kabal and R.P. Ramachandran [4].

5.2.4 LSP to LP conversion (all modes)

Once the LSPs are quantified and interpolated, they are converted back to the LP coefficient domain

  ^a

^k ^{. The}

conversion to the LP domain is done as follows. The coefficients of

F z

₁

 

or

F z

₂

 

are found by expanding equations (14) and (15) knowing the quantified and interpolated LSPs

q

_i

, = , i 1  , 10

. The following recursive relation is used to compute

f i

₁

 

:

     

       

for to

for down to end

end i

f i q f i f i

j i

f j f j q f j f j

i



    

 

    



1 5

2 1 2 2

1 1

2 1 2

1 2 1 1 1

1 1 2 1 1 1

with initial values

^f

1

  0  1

^and

^f

1

    1 0

. The coefficients

^f

2

  ⁱ

are computed similarly by replacing

q

_{2 1}_i_

by

q

_{2 .}_i

Once the coefficients

f i

₁

 

and

f

₂

  i

are found,

F z

₁

 

and

F z

₂

 

are multiplied by

1  z

^¹ and

1  z

^¹, respectively, to obtain

F z

₁

  

and

F z

₂

  

; that is:

     

    

    

f i f i f i i

1 1 1

2 2 2

1 1 5

, , ,



^. ⁽¹⁸⁾

Finally the LP coefficients are found by:

   

a f i f i i

f i f i i

i

    

     

 



0 5 0 5 1 5

0 5 11 0 5 11 6 10

1 2

. . , , ,



^. ⁽¹⁹⁾

This is directly derived from the relation

^{A z}   ^  ^{F z}

¹

^{  }   ^{F z}

²

   ²

, and considering the fact that

F z

₁

  

and

  

F z

₂ are symmetric and anti-symmetric polynomials, respectively.

5.2.5 Quantization of the LSP coefficients

12.2 kbit/s mode

The two sets of LP filter coefficients per frame are quantified using the LSP representation in the frequency domain;

that is:

(21)

 

f f

q i

i s



i



2 1 10

 arccos , ,  , ,

(20)

where

f

i are the line spectral frequencies (LSF) in Hz [0,4 000] and

f

_s

 8000

is the sampling frequency. The LSF vector is given by

^f

^t

^  ^{f f}

^{1 2}

^ ^f

¹⁰



^{, with}

^t

denoting transpose.

A 1^st order MA prediction is applied, and the two residual LSF vectors are jointly quantified using split matrix quantization (SMQ). The prediction and quantization are performed as follows. Let

z

⁽¹⁾

  n

and

z

⁽²⁾

  n

denote the mean-removed LSF vectors at frame

n

. The prediction residual vectors

^r

⁽¹⁾

  ⁿ

^and

^r

⁽²⁾

  ⁿ

are given by:

     

r z p

( ) ( )

, ,

1 1

2 2

n n n

 

and

(21) where

p( ) n

is the predicted LSF vector at frame

n

. First order moving-average (MA) prediction is used where:

   

p n  0 65 . r 

^{( )}²

n  1

, (22)

where

^r ^

^{( )}²

 ⁿ  ¹ 

is the quantified second residual vector at the past frame.

The two LSF residual vectors

r

⁽¹⁾ and

r

⁽²⁾ are jointly quantified using split matrix quantization (SMQ). The matrix

 ^r

⁽¹⁾

^r

⁽²⁾



is split into 5 submatrices of dimension 2 x 2 (two elements from each vector). For example, the first submatrix consists of the elements

r

₁^{( )}¹ ,

r

₂^{( )}¹ ,

r

₁^{( )}² , and

r

₂^{( )}² . The 5 submatrices are quantified with 7, 8, 8+1, 8, and 6 bits, respectively. The third submatrix uses a 256-entry signed codebook (8-bit index plus 1-bit sign).

A weighted LSP distortion measure is used in the quantization process. In general, for an input LSP vector

f

and a quantified vector at index

k

,

f

^k, the quantization is performed by finding the index

k

which minimizes:

 

E

_LSP

f w

_i _i

f w

_i^k _i

i

 





^ ^.

1

10 2

(23)

The weighting factors

w i

_i

, 1 ,  , 10

, are given by

 ⁴⁵⁰  ^otherwise,

1050 - 0.8 1.8

=

, 450 450 for

547 . 347 1 . 3









i

i i

i

d

d d

w

(24)

where

d

_i

 f

_i_₁

 f

_i__{1 with}

f

₀

 0

^and

f

₁₁

 4000

. Here, two sets of weighting coefficients are computed for the two LSF vectors. In the quantization of each submatrix, two weighting coefficients from each set are used with their corresponding LSFs.

10.2, 7.95, 7.40, 6.70, 5.90, 5.15, 4.75 kbit/s modes

The set of LP filter coefficients per frame is quantified using the LSP representation in the frequency domain using equation (20).

A 1^st order MA prediction is applied, and the residual LSF vector is quantified using split vector quantization. The prediction and quantization are performed as follows. Let

z (n )

denote the mean-removed LSF vectors at frame

n

. The prediction residual vectors

r (n )

is given by:

ATIS 3GPP SPECIFICATION