Simple Digital Speech Synthesis

(1)

American

Journal

r ~ f

Computational

Linguistics

M i c r o f i c h e 1 6

S ~ M

L E

DIGITAL

SPEECH

SYNTHESIG

William Mw Fisher and Aw Maynard Engebretson

C e n t r a l I n s - t ~ i t u t e f o r the D e a f

818 S o u t h Euclid Street

St. Louis, Mo. 63110

and

Biomedical Computer Laboratory

Washington University ~ c h o o l of Medicine 700 South Euclid Street

St. Louis- M o m 63110

c o p y r i g h t 1 9 7 5

(2)

R e l a t i v e l y

s i m p l ~

computer

aiethods

f o r

s y n t h e s i a i n g

B

eech

t o

be

used

i n p h o n e t i c

/perceptual

research

are

p r e s e n t e d ,

wit

f:

p a r t i c u l a r

r e f e r e n c e s

t o

t h e problm~

and

successes

encountered

i n

t h e devel-

opment of

suoh

a

system

a t C e h t r a l

inatitUte

f o r

t h e

Deaf and

the

Biomedical

Computer

Laboratory

of

Washington University,

_The

purpose of

t h i s

paper i s

t o

p r e s e n t

a

synthesis

and

c l a r i f i c a t i o n

of

e s t a b l i s h e d

methods

s o

a s

t o

encourage

o t h e r

computational

l i n -

g u i s t s t o t a c k l e

d i g i t a l

speech

synthesf

e.

The

approach

is

semi-

tutorial

: c r u c i a l

algorithms

a r e

g i v e n

i n

Fortran

o r

block-diagram

form,

and b i b l i o g r a p h i c

referenoes

t h a t

were

found t o

be

most

useful

i n t h e

eystem

development

are

listed

and

dlscilaeed.

The

system

described requires

a

minimum

of

hardware;

a mini-

computer

i s

s u f f i c i e n t ,

i f

it

i s

equipped

w i t h

tape

or

diek

secondary

memory.

The

sound

pressure wave

is

calculated

e n t i r e l y

by

s o f t w a r e

and

only

a digital-to-analog

converter and

a

low-pass

f i l t e r .

are

.required

to convert

it

to

a

recordable

e l e c t r i c a l

s i g n a l .

The

vocal

apparatus

i s

s i m u l a t e d

by

a

rough

model

which

i s

still

general enough

t o

mate

most speech

eounds.

The

two

main

types of

excitation o f

the

vocal

t r a c t

--

periodic g l o t t a l waves f o r v o i c i n g

and

random noise f o r f r i c a t i o n or

a s p i r a t i o n

--

a r e

supplied by

algorithms

presented

a s

function

s u b r ~ u t i n e s .

The

effect o f the vocal

t r a c t

on

these

i n p u t s i s modelled by combinations

of

t h r e e other

elemental

f u n c t i o n s , whose

coding

i s

based on r e c u r s i v e

equa-

tion

theory

f o r computational e f f i c i e n c y .

A

resonance p r o v i d e s t h e

user

with a

means

f o r

a c c e n t u a t i n g

t h e

s i g n a l

a t

a

c e r t a i n

f r e q u e n -

cy,

such

a s

a formant

frequency

of

a

vowel;

an

anti-resonance

o r

notch filter

i s

p r o v i d e d

t o

c u t

back

t h e energy

a t

a c e r t a i n

f r e - quency,

as

in simulation

of

the

nasal

a n t i - f o r m a n t ;

and

a

r a d i a t i o n -

effect

s u b r o u t i n e s i m u l a t e s

the

e f f e c t on

the

speech signal of

passage

from

t h e l i p s through a s h o r t s t r e t c h of a i r . Empirically

obtained

wave

shapes

and

s p e c t r a of t h e

o u t p u t s

of these f i v e

basic

f u n c t i o n s a r e

g i v e n i n o r d e r t o give

t h e

r e a d e r

a

b e t t e r f e e l f o r what they do.

These f i v e

elements

can be combined

in a

number

of

ways.

A

d e t a i l e d d i s c u s s i o n i s given f o r qne of t h e s i m p l e s t r e a s o n a b l e models,

i n

which t h e g l o t t a l wave and

f r i c a t i o n

g e n e r a t o r s e x c i t e a

series

of t h r e e v a r i a b l e r e s o n a t o r s , using

a

s e t o f f i x e d reso-

nances

t o

simulhte

h i g h e r frequency formants

i n

a d d i t i o n t o t h e

r a d i a t i o n - e f f e c t simulator. S e v e r a l o t h e r more complex arrangements a r e

presented,

i n c l u d i n g p a r a l l e l r e s o n a t o r models

dnd

models w i t h

s e p a r a t e

f i l t e r s f o r shaping voiced, f r i c a t i v e , and n a s a l components,

and

t h e i r advantages

and

disadvantages

a r e

discussed.

An

example i s g i v e n of

a

complete modular

F o r t r a n

program

g e n e r a t i n g t h e word

% e a t t t .

The equations f o r specifying t h e param-

eters ~ w 3 r o l l i n g

t h e

elemental f u n c t i o n s

were

d e r i v e d , w i t h much

e f f o r t , from

a n a l y s t s

of

one

token u t t e r a n c e , and spectrographs of

t h e

real

and s y n t h e t i c

words

_{a r e}

shown t o

i l l u s t r a t e

t h e degree of

n a t u r a l n e s s obtainable

with t h e simple t h r e e - r e s o n a t o r s e r i e s model.

A simpler example for

g e n e r a t i n g a

c o n s t a n t vowel sound i s also

given,

along d t h

a

summary of d a t a

u s e f u l

i n

making

many vowels.

This

paper

is

a

s l i g h t l y

expanded

v e r s i o n

of

one given o r a l l y

(3)

Table

of

Page

No

I. Introduction

e m m m a m a a a m m m . . ~ . . ~ e 4

111. Basic Elements

A .

Sources

* m w s . . . r . e m a r . m m r m 10

1.

G l p t t a l

Wave

Generator

.

. .

10

2 .

White

Noise Generator

. b a r n . . . I .

16

3. Spectral

Shaping

blements

. .

. . .

17

1. Resonances

and

Anti-resonances

.

. . .

,

1 7

Radiation

Effect

I V ,

O r g a n i z a t i o a o f E l e m e n t s

. .

.

. .

.

22

A. The

Simplest

Model

l

. . .

.

B. Control

. . . . . . .

. . .

. . . .

_{2 8}

C .

Other

Models

w . . .

₃₀

Parallel

Formant

2.

Separate

Noise

Shaping

Channel

.

,

3.

Other

More Complex

Models

.

. .

.

_{3 3}

P .

A

Complete

Example l

.

,,

.

. .

.

33

V. F i n a l Reaarks

, . . a n .

. .

. . .

44

(4)

5 .

Introduction

Several

years ago,

it was

decided

t h a t

t h e

Research Depart-

ment

a t

Central

I n s t i t u t e

f a r t h e ,

Deaf

in

S t .

Louis should have

a

digital

speech

synthesizer

to

aid s t u d i e s in

psychoacoustios.

arid

phonetic

perception.

The

equipment

on

hand

a t

t h e

time

coneisted

p r i m a r i l y

o f

a 12

- b i t - w ~ r d

mini

-computer

with

keyboard

and

scope,

with

special-purpose

hardware

f

a f

doing

d i g i t a l - t o -

analog conversion,

low-pass

filtering,

and

floating-point

arithmetic

Mare peripheral

devf ces

and

core

memory have

R a w

b&n

added,

We

have

been

working

since

then

on writing

digital

computer programs to synthesize

speech s t u d y i n g

the

literature

and

gaining practical

experience.

Almost

a l l

of

t h e theory and

techniques

necessary to program

the

synthesis of

E r g l i s h

sounds

oan

be

found

in

published literature,

b u t

in

bits

and

pleces, here

and

t h e r e .

Utilizing

the

cantributions

o f

many

authors,

plus our

own

experience,

we

present

and

explain

a

basic

program

for

synthesizmg speech,

in

the

hope

that computa-

tional

lingdists

who

have not

worked

with

low-level

speech phenomena

may

be encouraged t o program

synthesizers.

The use of

s y n t h e t

ic-speech

stimuli

h a s been ext xemely import

-

a n t

to

t h e

investigation

of the

p e r ~ e p t u a l l y ~ d i s t i n c t i v e

f e a t u r e s

of

speech

and

of

low-level

phonological

r u l e s ,

but much work

remains

undone. Synthesizing speech i s

c l e a r l y

important

to phonetic

research,

and

t h e f i e l d could

well

use

more

researchers

w i t h

linguistic

t r a i n i n g .

The

system

described

in

this paper

requires

(5)

11. Overview

Synthesizers use

varying

a m o ~ n t s

of

8peciai-purpose

hardware.

The

type

of

synthesis

we

desc.ribe

here

uses t h e

bare

minimum,

calculatiag

t h e

speech wave

on

a

d i g i t a l

c ~ m p u t e r

and f e q u i r i n g

only

a

d i g i t a l - t o - a n a l o g

convertqr

and

a

low-pass

analog f i l t e r

a s

s p e c i a l .

hardware.

T h i s

minimum

s e t of equipment

i s

shown

i n

Figure

1,

page

6 .

If

we

asstuse

t h a t

tape

o r

disk

secondary

storage

i s

a v a i l a b l e ,

t h e n a 4

k

12-bit-word mini-computer

i s

s u f f i c -

i e n t l y

large,

and

a

12-bit

D / A

c o n v e r t e r ,

will

g l v e

enough

dynamic

range.

Ohce

t h e speech

wave

i s

generated

and stored on

t a p e

o r

d i s k ,

t h e r e i s

problem of

writing

a

program

t o

output enough

of

i t

synchronously

a t a

f a s t enough

r a t e .

W

w i l l not

e

go

f u r t h e r i n t o

t h i s

pmblem

here, s i n c e t h e s o l u t i o n will

depend

on

t h e p a r t i c u l a r

machine

i n s t a l l a t i o n you

have.

Whatever

output

sample

r a t e

you

achieve,

t h e r e

are

t w

t h i n g s t o

note:

t h e

analog low-pass f i l t e r

should

pass

only

frequencies below

t h e output

sample

rate,

and t h e

output

sample

r a t e

i s

a

parameter

whose v a l u e

must

be

fed

i n t o t h e

d i g i t a l

c a l ' c u l a t i o n s .

Although

our

synthesizer

can

be

described

as a

t e x m i n a l

analog

model of

t h e vocal

apparatus,

t h e

a t t i t u d e

we

t a k e i s

t h a t our method

o f

s y n t h e s i s i s

used

r a t h e r

t o

produce

t h e

sig-

nificant

a c o u s t i c f e a t u r e s

of

speech. To s e t t h e s t a g e f o r

a n

understanding

p f

t h e

s y n t h e s i z e r

presented

here,

and

for

t h e

benefit

of

t h o s e

not

f a m i l i a r with

acoustic

phonetics,

l e t

us

(6)

KEY BOARD

[image:6.786.131.658.228.628.2]

(7)

Figure

2,

page

8,

shorn

a typical

_spectrogram

of

t h e

word

Itseat.

Frequency

i s

t h e v e r t i c a l

axis,

time

t h e

h o r i z o n t a l ,

and

darker marks i n d i c a t e

more

energy. _Themottled

_area

a t

t h e

upper

l e f t

is t h e

quasi-random noise

of t h e

/ e l

sound.

To

make

o t h e r f r i c a t i v e s ,

suck

as

f

1,

we

need

t o a l t e r

t h e

frequency

spectrum and

intensity of

t h e

noise.

The

dark

horizontal

bands

in the

c e n t e r

a r e

c o n c e n t r a t i o n s of

,energy

- -

resonances

called

~ t f o r m a n t s ~ ~

- -

c h a x a c t e f i s t i c of vowel

sounds.

To

make

different vowels,

we

;reed t o

change

t h e

c e n t e r

frequenaees,

bandwidths,

and r e l a t i v e i n t e n s i t i e s

of

the

t h x e e

lower

formants visible here.

There a r e

h i g h e r -

frequency formants,

which

do

not

show

up

w e l l

i n

t h i s spectro-

gram, _but t h e y seem t o be

important

only t o t h e

naturalness

of

the

speech,

n o t

t o

which

vowel

is

perceived. Note t h e

beginning

and

ending slopes of

t h e formants;

these

formant

t r a n s i t i o n s a r e

crucial

t o t h e perception of o c c l u s i v e con-

sonants

such a s

t h e stops

,

and

/g/,

And

f i n a l l y ,

t h e

vertical mark

a t

t h e

t a r

right i s a burst o t noise marking t h e

release

of t h e

f i n a l

consonant,

The

detailed

algorithms we present here w i l l be expressed,

in

Fortran

f o r c l a r i t y , although t h e s y n t h e s i z e r w i t h

which

we

have

had

the most experience i s coded

in

assembly

language.

W

e

a r e

presently c o n v e r t i n g t o F o r t r a n ,

and

t h e

subroutines

and

f i n a l

example program

l i s t e d

i n

t h i s paper

have

b ~ e h tested

in

their:

F o r t r a n

form.

(8)

[image:8.791.156.638.174.717.2]

Time

[ i n

msec)

(9)

SKELETON

OF

SYNTHESIS

PROGRAM

DIMENSION

NBUF

( 2 5 6 )

GENERATE

SPEECH

WAVE

POINT

A3

THE

VALUE

OF

YN

C

STORE

SPEECH

WAVE

POINT

C

NBUF

(NPT)

=IFIX(YN)

750 CONTINUE

C

WRITE

OUTPUT

BUFFER

C

CALL

WRITB

(

NBUF

)

800 CONTINUE

[image:9.786.196.517.160.760.2]

CAZL E X I T

END

Figure

3 .

Over-all

Synthesiaer

Program

Logic.

WRITB

i s a

subroutine

t o

write

t h e

c o n t e n t s

(10)

3.0

this

logic i s that

we

s y n t h e s i z e

the

speech

a

point

st

a

tihe,

in

one paaa, storing each

buffer-full

on tape

or disk as it

is

generated. This program will produce an integral number of

buffer-loads, but other methods for terminating

the

main loop

are easy to implement.

111. Basic Elements

T h e r e a r e five b a s i c e l e m e n t s i n t h i s method of speech synthesis :

1.

A

glottal wave generator, with controls f o r repetition

rate (pitch) and amplitude;

2. A

white-noise generator, w i t h a control for amplitude;

3.

A

resonant filter, w i t h center frequency and bandwidth

controls ;

4. An

anti-resonant filter, with similar controls;

and 5.

A

radiation-effect simulator.

T k s e five elements can be connected in

a

variety of ways t p

produce models of greater or lesser complexity. me glottal wave

generator and noise generator produce sounds whose spectra are

then shaped by combinations of the other elementss,

A.

Sources

1. Glottal-Wave Generator

Natural glottal waves, while subject to much variation, are

usually considered to consist of three parts: a glottis-opening

phase in which t h e volume velocity is increasing,

a

glottis-closing

phase in which the volume velocity is decreasing, and a g l o t t i s -

closed phase

in

which

the

volume velocity is zero.

The

spectrum of

such waves is supposed to fa11 off at about 10 to 12 d B / o c t a v e , 1

(11)

(12)

7.88 T l f l E p tl8ECe WEECH WhUE (flDRHCIL1 ZED)

0

I

t

I I I

[image:12.795.165.564.143.592.2]

PHASE fiH6LC SPECTRUH

Figure

4 .

Wave Shape

and

Spectral

A n a l y s i s

of a

Typical

G l o t t a l Wave

Period.

The

intensity spectrum

has

been

normalized

so

t h a t

the

largest component

is a t t h e 80

dB

level;

before

normalization

i t

was

61.54

dB

relative

t o

an arbitrary standard

of

0.25. The

horizontal

line

across

the

i n t e n s i t y spectrum

graph

is

a

conservatively

estilpated

n o i s e

cut

-off

l i n e ;

any

com-

ponent with

rnagnitudelless than

t h i s l e v e l may

be

the

product

of computational n o i s e .

The

phase

angles

of t h e

components

are

gR,

where

t h e

wave

is

represented

by

Z

A,

sin(nwlt

+

vn)

and Q1

is

arbitrarily

zero.

_The

(13)

execution t i m e and

the

polynomial scored t h e h i g h e s t

i n

h i s

tests of n a t u r a l n e s s . F i g u r e 5, page 14, shows t h e wave shape

and spectrum of a

l i n e a r

a p p r o x i ~ p a t i o n and F i g u r e 6, Page 1 4 ,

shows

t h a t of

a

polynomial approximation. The o v e r - a l l f a l l o f f

of t h e spectrum of t h e polynomial more n e a r l y matches our example

wave, b u t no l o b e s a r e a p p a r e n t , a s t h e y

a r e

i n t h e s p e c t r a

of

the l i n e a r approximation and t h e n a t u r a l example. Judging

from some i n f o r m a l l i s t e n i n g t e s t s we have made, t h e s e

dis-

t i n c t i o n s do n o t seem t o make a g r e a t d i f f e r e n c e : b o t h approx-

i m a t i o n s sound good.

A s F i g u r e 7, page 15, we present

a

F o r t r a n f u n c t i o n

GLOT(P,AV)

for g e n e r a t i n g g l o t t a l waves u s i n g €he polynomial

approximation. Values of p i t c h ( P ) and zero-to-peak amplitude

(AV)

a r e p a s s e d d i r e c t l y a s c o n t r o l parameters.

The s u b r o u t i n e u s e s t h e common a r e a t o s t o r e s e v e r a l v a r i a b l e s ,

which

o r c o u r s e c o u l d

he

d e c l a r e d a s formal parameters i n s t e q d ,

ISWV

i s

a

v o i c i n g switch used i n t h e l o g i c i n t e r n a l t o

GLOT.

TDEL

i s t h e p e r i o d between o u t p u t sample p o i n t s

i n

m i l l i s e c o n d s ;

i n t h e over-all i n i t i a l i z a t i o n of t h e program, i t s v a l u e should

be

c a l o u l a t e d

a s

1000.0/OSR. where OSR i s t h e o u t p u t sample

rate

i n

samples per second. TG i s a v a r i a b l e used

a s

a s i m u l a t e d

t i m e clock by GWT, keeping

track

of how f a r t h r o u g h t h e g l o t t a l

wave

i t

has gone.

TP, TI,

a n d T2

a r e

d u r a t i o n s ( i n millisecond^)

from t h e hegirining o r the g l o t t a l p u l s e , c a l c u l a t e d and used by

GLQT:

TP

i s

the

d u r a t i o n

of

t h e pulse.

T1

i s t h e d u r a t i o n of

t h e opening phase, and T2 i s t h e d u r a t i o n of the opening and

(14)

[image:14.795.255.591.126.455.2] [image:14.795.266.601.576.898.2]

--

PHASE M O L E SPRCTRUW

Figure 5. Wave Shape and Spectral Analysis of a

L i n e a r Approximation to a G l o t t a l Wave. For

details of

the

display, cf. Flgure 4 .

la07

4 0 8 7

7 . 8 6 trm, nrac. s.70

$ P & c ~ H UhU6 < U O R P M ~ Z L O ?

Figure 6. Wave Shape and Spectral Analysis of a Polynomial Approximatioa to a G l o t t a l Wave. FOX

(15)

[image:15.786.228.611.148.892.2]

Figure

ruNevxaN CLOT ( P , ~ A V )

COMMON ~ ~ u V ~ T D F L ~ T G ~ T P , ~ ~ , T D ~ O P T R ~ C L T R ~ A V ~ A V E C

C PRODUCES A P O L Y N O M I A L 4 P P R O X I H A T f O N TO L G L O T T A L WAVE C W I T H C O N ~ T A N T WAVE $HAPf!

c

c uacs THE ~ O L ~ O ~ X N G V A R I A R ~ E ~ man COMMONI X S W V , C T D f C ~ T C , t P , T l r T 2 , a P T R , C L T R , A v 1 A V t ;

C

C P l R 8 T 8EC I P HI! NEED TO R L m I N I I I A L I t L ~ C F O R B L G f N H l N G OF G L b T T A L H A V E

C

C ARC W t HOW GENERATING V O f t E T

r r ( X ~ M V I ~ B E , I ? ~ , { ~

c Y E S

-.

3 1 HE A R E N O T A T THE END o r A r u L a C ,

C PARAMETERS ARE OeK,

C OTHERWl3E Y E NEED TO C H E C K ' A V Tb SEE

c xr HE N E E D ANOTHER PULSE

1 0 I F I T G + 0 , 5 * W E L - T P ) 5 B 1 2 8 p 2 0

C E I T H E R H E HAVE NOT BEEN G E N E R A T I N G V O I C E OR

C WE HAVE JUST FINISHED A PULSE r r

C I F ' AV P 0 , X N I T I A L X Z E TO GENERATE A(NOTHER] P U L I L C AND R E S E t I S Y V TO 1

C OTHERWISE ( R E j 8 E T I S H V TO B 28 I P * ( A v ) 3 0 ; 3 0 , 4 0

3d 2Suv.a

G O T O 5 0

C I N I T I A L I Z E FOR ANOTHER P U L S E

48 I S W V l l

h V S A V E 8 AV

t6aB.B

TP.lB0BeG9/P T l m O P T R t T P T S m T l * C L t R * T P

58 C O N T I N U E C

C END O f PARAMETER S E T T I N G

c

c BEGINNING OF L ~ G X C T O GENERATE G C Q T I A L W A V E C

I F (IJWY) 5 7 1 S 5 t 5 7

55 Y.R18

t o t o 1 0 8 5 7 C O N T I N U E C

C fr TG T I , Y . A V * ( ~ * ( T C / T ~ I ~ * ~ - ~ + ( T G / T ~ ] * * J )

138 I.# (TG.Tll 1 4 8 , ) S A P 15R

149 Y ~ A v S A V E * ( J , B * ( [ T G / T l ) * * 2 ) - 2 , 8 * ( ( T C / t l ) c * 3 ~ ) GO 70 180

C ELSE I F T S * 7 2 , Y ~ A V * ( l ~ ( ( ( T G ~ f l ) / ( T 2 ~ T l ) ) * * 2 ) )

159 I F ( T G * T 2 ) 1 6 G 1 1 7 8 a 1 7 B

c ELSE rue

178 Y.fl,fl

GO T O A 8 0

c

C E N 0 OF GLOTTAL PULSE GENERATION C R E T U R N VALUE OP GLOTTAL N A V E

d AND XNCREHENT TG C

189 G ~ O ~ D Y TG.TG+JtIEL REYURN

C

c E R R O R IN V A L U E OF x s r v

C-'

900 U R ! T E ( I , 9 i @ ) 13UV

0 1 0 FORHAT("* ERR I N G L D T l l ~ ~ Y v R ~ ~ , I ! J )

CALL HOLD C A L L E X I T

€ N O

C GCOT

(16)

r a t i o " ) i8 t h e tractlon

or

tne wave

o a c u p l e a

by The opening

phase,

and

CLTR

i s

the

fraotion

o c c u p i e d

by

c l ~ c r i n g .

I n

t h e

over-all

i n i t i a l i a e t t i o n ,

OPTR

ehould

be

set t o

.40

and

CLTR Lo

.16, v a l u e s

which maximize naturalness

according to Rosenbergfs paper.

S f

t h e i n s t a n t a n e o u s values of A\I were used by GLOT, the

standard,

wave shape would be altered if

AV

were changing during

generation

of

a

g l o t t a l

wave.

To keep

the

wave

shape

constant,

GLOT

uses the variable AVSAVE t o

a t o r e

t h e value of AV

at

t h e

b e g i n n i n g of each p i t c h period,

and

during

the

generation of

t h e

pulse,

AVSAVE

is

used

as

the

(constant)

a m p l i t u d e .

Between c a l l s t o GLOT,

the

values of

Ismt

TG,

TP,

T1,

T2,

and

AVSAVE s h o u l d n o t be altered,

Rosenbergls equations for the polynomial approximation

a r e used

in

GLOT; to get

the

linear approximation, the follow-

ing

two

l i n e s of F o r t r a n s h o u l d be substituted in GLOT for

lines number

60

and

64:

2 . White Noise Generator

Almost any reasonably good random-number generator c a n be u s e d a s

a

s o u r c e of white n o i s e .

If

t h e spectrum of t h e

random numbers produced is f l a t ,

it

w i l l be e a s i e r to shape

into the desired spectra fox

the

d i f f e r e n t fricative sounds.

The algorithm w e use was developed f o r use i n s y n t h e t i c

speech work: i t i s v e r y fast, a n d produces noise with

a

quite

(17)

and p e r r y l i s easy to f e l l o w and implement. The l o g i c of t h e

algorithm i s f o r m a l l y stated

i n

F o r t r a n in Figure 8 , page 1 8 ,

but if possible this s h o u l d be one function coded

in

alssembly

language: if the right machine instructions are available,

it

w i l l be snap, b u t "bit i n F o r t r a n is very s l o w .

Our function

IRN4(X)

- -

X

is a dummy variable required

by our F o r t r a n compiler

- -

contains this algorithm

in

assembly

language, producing on successive calls a series of random

numbers

with

a uniform distribution over

the

interval from

-2047 ta +2047.

To

implement a white noise g e n e r a t o r , o n l y t h i s line of coding i s needed:

where

AN

is a variable whose value is the amplitude of n o i s e

desired.

A

typical stretch of noise produced in this manner, along

with i t s spectrum, i s g i v e n a s Figure 9, page 19. Note that

there does not appear to be any significant deviation from

flatness

in

t h e spectrum intensity.

B. Spectral Shaping E l e m e n t s

1. Resonances and Anti-resonances

We use recursive equations, a technique developed by

electrical engineers, to simulate resonant. and anti-resonant

( n o t c h ) filters a s elements to shape spectra. Each individual f i l t e r can be represented by a second-order linear differential

1. R a d e r , Rabiner and Schaf er (1970) and P e r r y , Schafer.

(18)

FUNCTION

IRN4(X)

COMMON

N M 1 ( 1 9 ) ,

NM2(19)

DlMENsION

NX1(19),NX2[19)

C

FORM

BIT-WISE

EXCLUSIVE

OR

OF

NMl,NM2

DO

10 I=1,19

10 ~ X l ( f

)-MOR(NM1(1:),NM2(I)

C

ROTATE NXI,

8

PLACES

TO

TI.B

RIGHT

DO

20

I=l,ll

20 NX2

(I+B)=NXl

(I)

DO

30

I=12,19

30

NX2(1-11)=NXl(I)

C SHIm PAST VALUES

DO

40 I = l , 1 9

N M 2 ( I ) = N M l ( I )

40

N M l ( I ) = = N X 2 ( I I

C

RETURN VALUE

OF

LEFT-MOST

1 2

BITS

OF

NX2

IRN4=MINT(PfX2)

C END

]HETURN

END

Figure 8. Random

Number

Generator Documented

In

k ' o r t r a n .

NMl(19) and

NM2(19)

are! a r r a y s

whose e l e m e n t s have e i t h e r

t h e v a l u e 0 o r 1.

MOR(Nl,N2)

is a

function

returning

the

exclusive

or of

N1

and

N2,

v a r i a b l e s h a v i n g e i t h e r t h e v a l u e

0 or

1. MINT(N)

is

a

function

whose argument

N

is a

bit-

(19)

80

'to

20

F i g u r e

9.

Typical

Wave

Form

and

Spectral

Analysis

of

t h e Output

From

t h e White

Noise

Generator.

[image:19.786.197.626.279.746.2]

(20)

e q u a t i o n , g i v i n g only

one

resonance

o r

anti-resonance. FOX those

who f e e l a t home i n

the

s-plane,

the

r e c e n t book 8p?.eCh --

Synthesis

. ---

edited

by Flanagan

and

Rabiner contains rsprinta of

pawrs

d e v e l o p i n g

the

theory of r e c u r s i v e equation f i l t e r simulation:

for t h e

rest

of us,

the

paper by Lovell et al. (1973) i s a c l e a r

presentation,

with

some more

g e n e r a l

F o r t r a n algorithms

_{t h a n}

w i l l be given here.

Figure 10, Rage 21, g i v e s one Fortran _{s u b r o u t i n e} and

two

Fortran f u n c t i o n s which we use t o simulate r e s o n a n t a n d

anti-resonant t l l t e r 8 .

The

f u n c t i o n s RE3 and ARES r e t u r n the output v a l u e s of

simple resonant ( c o n j u g a t e pole p a i r ) and a n t i - r e s o n a n t (con

jugate z e r o p a i r ) f i l t e r s . respectively.

AO,

Al,

and A2 are

coefficients used

in

the r e c u r s i v e e q u a t i o n s .

I

and Y'M2 axe

remembered previous

values

of _{t h e}signal, and Y is the i n p u t

t o the filter.

AO, Al,

and

R2

determine t h e characteristics

of the filter: c e n t e r frequency and bandwidth. Each s i m u l a t e d f i l t e r s h o u l d have i t s own v a r i a b l e s i n which _{t o}save t h e values

of

YM1

and YM2, and between calls to the fdnction simulating

that f i l t e r , t h e values of these variables ahould not be changed.

The

subroutine COEFF is used to c a l c u l a t e appropriate

values for

AO,

Al,

and A2, b a s e d on CF, the c e n t e r frequency,

and

BW,

t h e bandwidth. of the r e s o n a n c e or anti-resonance. 1

SR

is

the

output sample rate, and MPZ tells t h e subroutine

1. I n

the

v e r s i o n of this paper presented at

the

12th annual

(21)

~ U B R Q U T X N L

~ o ~ e ~ c c r , a w , ~ s , r r , ~ a , s ~ , ~ r z ~

C

C! COMPUTE8 THE R ~ C U R b I V E E Q U A T I O N E Q E F P X C I E N T 8

C A 0 , A I I A Q FOR EXf'HER A RESONANT O R AN'TIRPBONATL P I L T C R

C ~ I P E C ~ $ ' ~ ~ U 0Y CENTER FRl!!OULNCY C F AND BANbWtOTH I W (HZ.)

C 8R 1 8 THE SAMPLE R A T E ( ~ A M P C E S I ~ E C }

S I P W P L q I , F I L T E R W I l L BE? A RPIONANCE

C I F M P Z 8 0 , C X L f E R WILL BE AH A N T T R C B O N A N C E ? X ~ 3 , t 4 1 5 9 Q 6 6

AaPXlrcBW/SR

Bs2,8*PI*CP/SR

A O ~ E X P C m 2 , B * A j

A ~ ~ & , B + E X P ( a 4 1 +CO5 CB3 & 0 m l n B r A 1 + A 2

I F (MPZ) 2 f l u l f l r Q 0

1 0 A O @ l , B / A B

20 RETURN END

Q U N C P I Q N R E S C Y t Y M 1 1 Y M 2 r A 0 , A l r a 2 ] C

C SIMULATES R E 8 0 N A T O R (CONJUGATE POLE P A I R 1

C GXYENpBY R E C U R S I V E E Q U A T I O N C O E F P Z C I E N T S A B P A 1 , A 2

C Y M i khlD V M 2 ARE PAST VALUE$ OF Y I T H E I R VALUES

C MUST BE SAVED

YRESrnABtY+Al * Y M ~ u A ~ * Y M $

YM2aYMt

Y M t r V R E S

ISESrVRES RETURN END

F U N C T I O N A R E J ( V ~ Y M I p Y M 2 , A 0 1 h l C A 2 J

C

C S I M U L A T E S ANTIRESONATOR (CONJUGATE Z E R O P A I R ]

C G I V E N BV R E C U R S l V E E Q U A T I O N C O L f P f C I E N 7 9 A B , A i , A 2

C Jn! AN0 Y M 2 ARE P A S T VALUES OF Y t T H E I R V A L U U

C MUST BE SAVED

TEHPaY+A0

ARESeTEMP*Al*YMl*A2+YM2

YMZ!3YMf

Y M l r T E M P

[image:21.786.69.700.121.935.2]

RE*~URN END

(22)

whether to compute ~ o e f f i ~ i 0 n t ~

fsr

B rasonancs or a n a n t i -

XBEI onance

The s p e c t r a l e f f e c t of a resonance with center frequenay

of 3000 H e a n d bandwidth of 200 Ha is i l l u s t r a t e d in Figure 11,

page 2 3 , while F i g u r e 1 2 , page 2 3 , is a similar illustration of

the, e f f e c t sf a n anti-resonance of t h e same c e n t e r freq'uency

and bafidwidth. I n these figures, t h e f i x s t g r a p h shows o u t p u t

for a n impulse input and t h e second _graph shows t h e normalized

intensity spectrum, of t h a t o u t p u t .

2 . R a d i a t i o n E f f e c t

The e f f e c t ~ o n t h e s p e c t r u m o f , t h e radiation of s o u n d from

the l i p s through a s h o r t stretch of air c a n be reasonably approx-

imated by a differentiat0r.l Figure 13,- page 24, gives a s i m p l e

F o r t r a n f u n c t i o n

KAD

s i m u L a t i n g t h i s e f f e c t . Y is t h e s p e e c h

m v e i n p u t t o t h e s i m u l a t e r ,

YM1

is the remembered immediately

previous value of

Y,

hfid G i s a n o r m a l i z i n g gain c o n t r o l which

should be c a l c u l a t e d i n t h e over-all i n i t i a L i z a t i o n a s a d i r e c t f u n c t i o n of the output s a m p l e xake, some

K

t i m e s OSR. The s p e c - t r a l e f f e c t of RAID, approximately a 6 dI3/octave rise, i s i l l u s -

t r a t e d i n Figure 1 4 , p a g e 2 4 .

IV.

Organization of Elements

A. The Simplest Model

The simplest r e a s o n a b l e model f ~ r c o n n e c t i n g t h e s e elements,

which we _{have t a k e n}t o c a l l i n g "Mode3

TI',

fs given i n b l o c k

diagram form i n F i g u r e 15, page 25.

We

use this o r g a n i z a t i o n

In o u r c u r r e n t l y ru'nning synthesizer. T h e t h r e e vdriable

(23)

-.

...-

.-

"r""- * " " +----a

--2011 -. *I--

----

.,---

~ 1 . m mrc. 28.11

I

[image:23.786.214.589.112.464.2] [image:23.786.223.601.556.901.2]

t r e a c ~ unvc <nounn~ t ZED)

Figure

11.

Spectral Shaping E f f e c t of

an

Elemental Resonant F i l t e r with CF=3000 'Hz and BW=200

Hz.

(24)

R A D ~ O * ~ Y I Y M I I

Y M l n Y

RETURN END

F i g u r e

13. F o r t r a n

Function

to

Simulate

Radiation

Ef

feat.

OU 2K YK BK 8% low

FREQUENCY, K H z .

[image:24.786.190.596.516.898.2]

~ W T L ~ ~ ~ Z T Y ~ C E C T R U H ~ W ~ R ~ ~ L I Z E D , nhx. camr. r a+. 7 6 or.,

Figure 14.

Spectral

Shaping E f f e c t of Radiation

(25)

(26)

resonant

f i l t e x s are used to make t h a f o r m a n t s o f v o i a a d speech and, i n

a rather

strained

fashion, to-

shape

the

noise spectrum

during frication and aspiration.

A I L of t h e elements in this figure

should

now be

familiar

except

the

one

called

"higher otder correction filter.^^ This is

a s e r i e s of resonant filters of fixed center frequency and

bandwidth

which

cornpensat e for

the

effect

of

higher -f requency

resonances present in

a

real vocal tract

but

absent in a d i g i t a l

simulation of t h i s kind. Their

use

is discussed

in

Rabiner ('1968.)

from

which

the values presented in Figure 16, page 27, were taken.

These are the values to use for center frequency and bandwidth of

t h e higher order correcting f i l t e r s . O n l y higher-order f j l t e r s

with center frequency less t h a n

&

khe output sample rate should be

used. The x e c u r s i v e equation c o e f f i c i e n t s AO, Al, and A2 need be

calculated only once, in the over-all initialiaation.

Theoretically, the arder o f computation of the s e r i e s elements

such a s those in the main stem of Model

T,

makes no difference.

However,

because the

digital

numbers are finite in length round-off

or truncation errors are introduced a t each s t e p of' t h e computation

The overall error increases as the number o f computational s t e p s

increase. Some types o f computation such as differentiation tend

to increase the error, while other types such as integration tend

ro aecrease - m e e r r o r . F o r this reason, overall system error is

related in a complex way

to the

order of computation.

An

understandin

of

error buildup and testing of

the various algorithms will

help in

choosing

the

computational sequence that results in smallest errors

In the case of cascaded resonators. it is better to perform t h e

computation

in

reverse order from t h a t implied by Fig.

15 -

radia-

t i o n e f f e c t

first,

then

higher

order filters in descending center

(27)

[image:27.786.79.693.205.375.2]

Resonator

No.

Center

F r e q .

(Hal

Bandwidth

(Hz)

Figure

16.

Higher

Order

Correction

F i l t e r

Center

Frequencies

and

Bandwidths.

F r o m Rabiner

(1

968)

C

ZERO

VARIABLE

HOLDING

SPEECH

WAVE P O I N T

YN=O

- 0

C

ADD

GLOTTAL WAW

YN-YN-tGLOT (

P,

AV)

C

ADD

FRICATIVE

NOISE

**YN=YN+AN*(FLOAT(IRN4(X))/2047mO)**

C

-PLY

FORMANT F I L T E R S

W

200 I = 1 , 3

200

YN==S(YN,ml(I)

,YM2{IJ,AO(I)

,Al(I),A2(I)

1

C APPLY HIGHER ORDER CORREZTING FILTERS

DO 2 5 0 I = l , 7

250

Y N = R E S ( Y N , ~ ~ ( ~ ) , ~ ~ ( I ; ) , H A O ( I ) , H A ~ ( I ) . ~ ~ ( I )

CaAPPLY

IRADIANCE

EFFBCT

YN=RAD(YN,

RYM1, GRAD)

(28)

The c o n v e r s i o n of t h e Model

T

b l o c k

diagram

into F o r t r a n is

i l l u s t r a t e d by

Figure 1 7 , page

2 7 , t h e

series

elemel~ts being

here

computed

i n

t h e i r

natural

o r d e r . T h i s coding

i s

an

example

of

what

should be i n s e r t e d

i n t o t h e

o v e r - a l l

logic

( F i g u r e 3,

page 9 ) f o l l o w i n g t h e comment l i n e s "GENERATE

WAVE

I I

POINT.

,

. .

B. Control

I n

t h e Model

T

o r g a n i z a t i o n , n i n e c o n t r o l p a r a m e t e r s a r e

a v a i l -

able

- -

P ( p i t c h ) , AV (amplitude of v o i c i n g ) , AN (amplitude of noise)

and the

center

frequency

and bandwidth

of

three

variable

formant

f i l t e r s .

I n

t h e

main

l o o p , just before generating the next speecn

wave

p o i n t ,

a

s u b r o u t i n e ( i t

can

be

in-line

code, of course)

calcu-

l a t i n g

v a l u e s

of these p a r a m e t e r s i s needed.

If

a t

the

beginning

of

the

program t h e

variable

T ( t i m e ) i s i n i t i a l i z e d t o zero

and

lncremented by

TDEL

a t t h e

end

of

the

main loop,

it

can serve

as a

simulated-time clock

on

which

to base

_{c a l c u l a t i o n}

of t h e ,

c o n t r o l p a r a m e t e r s . The s i m p l e s t method of c o n t r o l is

t o

formu-

(29)

include Fortran

statements

in this

sedtion calculating their v a l u e s

as

in

the

algebraic equations. For example, suppose

we

mnted

the

pitch t o rise linearly

from

80 to 120

Hz

in

t h e f i r s t

100 msec, stay constant at 120

Hz

for 200 msec, then

fall

linearly

t a 100

Hz

in

the

next 100 msec and stay at that value from

then

on.

The

following Fortran statements can be used to calculate

P:

6 0

TO

270

260 P=lOO ,O

270 CONTINUE

When new v a l u e s of CF and

BW

are computed for

the

t h r e e variable formant filters, subroutine

COEFF

should be called to

translate these into the coefficients

AO, Al,

and

A2

actually

used by function RE$.

If

n e w valQes of thi c o n t r o l parameters are c a l c u l a t e d e v e r y

sample point,

the

execution time of the program w i l l be very long

There ape s e v e r a l obvious ways t o speed up this calculation.

One way

is

to calculate

new

values for P and

AV

only at

(30)

With

some

e r r o r

introduoed.

the

control

parameters

can

be

re-computed

o n l y every

so

many

meeo

t o

speed

t h i n g s

up.

A

variable

used a s

a

time

clock

i n

the

same

way

that

TG

is used

by

GLOT

can

control

this

period. Computing the

source

control

parameters ( P , AV, and

AN)

t h i s way

introduces

error

only in

that

the

a c t u a l

parameter

c u r v e s w i l l

follow the

desired c u r v e

in

s

s t e p - w i s e f a s h i o n , b u t

changing the characteristics

of the

formant f i l t e r s

this way

will

introduce

another

type of e r r o r ,

which

comes

out

sounding like

c l i c k s

or static

i f

the

change

in

filter

characteristics i s t o o

large.

Our

c u r r e n t l y - i m p l e m e n t e d a ~ s e m b l y - l a n g u a g e

synthesizer

reads

a

f i l ' e of

tabled

v a l u e s

created

by

another

program as

values

representing the

parameter curves. The period between

t a b l e d parameter v a l u e s is

changeable,

but on the order

of

5

to 1 0 msec. _{I n}

computing the actual parameter values

used, t%e

synthesizer interpolates

linearly along

the tabled parameter

data

curves.

The step size

of

the interpolation

can be

easily

changed,

allowing

a

smooth trade-off between accuracy

and execution time.

To sum up,

computing

new control parameter values f o x each sample

point generated

is the easiest and most accurate way, but

alternative

schemes

allowing a convenient trade

of accuracy for

speed are easily

programmed

C. Other Models

We

will

briefly describe several

alternative organizations

of the

elements, although

most of our

practical experience has

(31)

1,

P a r a l l e l Formant

One model used

i n

some s y n t h e s i z e r s l a t h e p a r a l l e l

formant

model, whose

block

diagram

i s given a s F i g u r e 18,

page 3 2 .

In

serial formant

models

such as Model

T,

no

independent c o n t r o l of t h e r e l a t i v e

i~teensities

of formants

i s p o s s i b l e ,

since the

order of o p e r a t i o m i s i m m a t e r i a l . I t

has

been shorn t h a t

t h e

relative intensities

of f o r m a n t s

i n

a

serial

s v n t h e s i s c l o s e l y match t h o s e found

in

n a t u r a l speech,

1

which i s some j u s t i f i c a t i o n of t h e

serial

model a s a n analog

of t h e vocal t r a c t . But

i n

a p a r a l l e l formant arrangement, each

parallel

channel must

have

a separate gain c o n t r o l . T h i s i s fine

i f v o u l r e i n v e s t i g a t i n g

the

perception of r e l a t i v e

formant

intens-

i t i e s , b u t n o t many have chosen t h i s model for general speech

synthesis

Rabiner

( 1 9 6 8 ) c o n t a i n s a worthwhile d i s c u s s i o n of

the

r e l a t i v e merits of

serial and

p a r a l l e l synthesis. I f you decide

t o try

a

p a r a l l e l formant model, t h e h i g h e r order c o r r e c t i n g

f i l t e r s

are

apparently unnecessary,

and

Rabiner (1968 ) mentions

that z e r o s

- -

anti-resonances

- -

a r e introduced into t h e spectrum.

2. Separate Noise Shaping Channel

In

Model

T,

the sqme

three

f i l t e r s are used t o make the

formants ot v o i c e d speech and to shape t h e spectrum of noise

during unvoiced speech.

This

is cumbersome

and

d i f f i c u l t , and

a

simple alternative

is illustrated in

F i g u r e 19, page 34.

1. Fant

( 1 9 5 6 )

(32)

I

GLOTTAL

WAVE^^

A V

RADIATION

EFFECT

.

1

1 I

[image:32.786.71.713.261.695.2]

h - .

Figure 18. Parallel Formant Organization Model

RESONANCE

-

CF1 RESONANCE -CF2

NO, 1

-

NO. 2

BW1

m L

*

RESQNANCE

NO.

3 -

-

CF3

(33)

A separate

channel

ie devoted to

nolae,

with i t s

own

r e s o n a n t

and

anti-resonant filters for

spectral

shaping.

It

haa been

suggested

that

one remnance

and

one

anti-resonance

axe

sufficient

to nodel

m e $

English

f r i c a t i v e s .

1

O f

course, this model

adds

t w o

new control parameters

t a

be computed.

3 .

Other More Complex Models

Model

T

does not

use

anti-resonances;

they are

not

typical

o f

voiced speech, but

rather

are present

in

the spectra of

faicatives

and

nasalized segments.

For

making nasal sounds,

a

parallel nasal channel whose input is the glottal wave and whose

output

is added

in

just

before

the

radiance

effect

calculation

can

be added.

The

spectral

shaping filters needed in this channel

are not obvious from published reports,

but one variable anti

-

resonance and several

fixed resonances are

p r o b a b l y

a

m i n i m u m

complement.

A

multitude

of more complex

models

can

be seen

in the litera-

ture,

- -

Rabiner (1968).

to

take

one

example, includes a

special

a'rrangement

for generating

voiced

fricatives.

D. A Complete

Example

To i l l u s t r a t e

the capabilities

of

the ~implest

synthesis

model,

on the following pages we p r e s e n t a s Figure 20 a complete

Fortran program

for

synthesizing

the

wrd

llseatll

(

_sith]

_{) .}

We tried to

duplicate one particular token

utterance

of, this word.

Spectrograms

of

the

o r i g i n a l

sound used

a s

a model

and

the

syn-

thesized sound calculated by the

Fortran

program are shown in

-- --

(34)

GLOTTAL WAVE

PARAMETERS

RESONANCES VOCAL TRACT

PARAMETERS-

[image:34.786.73.713.284.729.2]

NTI-RESONANCE-

Figure 19. Block D i a g r a m of Model With Separate

(35)

Figme 21, page 44. D e r i v i n g

the

parameter curvQa from

the

token utterance took

a

l o t of work, b u t as Figure 2 1 shows, t h e

rssulkihg ~ l y n t h e t i c

word

i s a r e a s o n a b l y close copy of the o r i g i n a l .

F o r those who may want t o use, this program as

a

beginning to

their

work, s e v e r a l features of

i t

w i l l be explained.

The basic model used i s

the

simplest, model W"f (Figure 1 5 ,

page 25) w i t h one a d d i t i o n : t h e noise s i g n a l i s m u l t i p l i e d by a

relative

gain c o n s t a n t

(GFRIC)

b e f o r e e n t e r i n g t h e vocal t r a c t

( v a r i a b l e

filter)

s e c t i o n .

The output sound wave i s s t o r e d a b l o c k a t a t i m e in f i l e

WWRK1. T h i s file is opened by t h e s u b r o u t i n e c a l l e d i n l i n e 43

w r i t t e n into i n line 455, and c l o s e d i n line 470.

Parameter v a l u e s a r e p e r i o d i c a l l y c a l c u l a t e d from piece-

w i s e polvnomial algebraic speciffications i n the section c a l l e d !lGPARfI, l i n e s 142 t o 415. The periods between p a r a m e t e r

r e - c a l c u l a t i o n s a r e c o n t r o l l e d by twa v a r i a b l e s serving a s

clocks, TVOC for v o i c i n g parametexs and TFRIC f o r f r i c a t i o n

p a r a m e t e r s . The values of PVOC and PFRIC are t h e t i m e s i n msec

between r e - c a l c u l a t i o n s of vocalic and f r i c a t i v e parameter v a l u e s ,

r e s p e c t i v e l y . These parameter v a l u e s can be reset more often by

merely changing t h e values assigned to PVOC and PFRIC i n l i n e s

9 3 and 37; a t present f r i c a t i o n parameters a r e reset every

0 . 1 msec and v o c a l i c p a r a m e t e r s e v e r y 0 . 2 msec. _{The F o r t r a n}

logic calculating the parameters was coded for c l a r i t y , not econ-

omy, and though lengthy should be easy t o f o l l o w . _The primitive

s u b r o u t i n e TPOW r e t u r n s pawrs of

a

v a r i a b l e f o r e a s e in c a l c u -

(36)

Certain paramaters are c o n s t a n t d u r i n g aome o f the sounds,

e . g . , s p e c t r a l parameters during f t 3 f N . As

a

minolr concession to exeoution speed, t h e s e constant parameters are not reset a f t e r the f i r s t entry into t h e section _{d u r i n g}which t h e y are constant

The variables 19W1, ISW2, and 1SW3 a r e irst-time-through"

s w i t c h e s , used t b remember whether or n o t the tempora1:il.y

constant parametei-s have h e n calculated y e t .

We have found that duplicating a token of natural speech

u s i n g t h L s simple basic synthesis model, t h o u g h p o s s i b l e , can

be q u i t e d i f f i c u l t , r e q u i r i n g much trial-and-error work.

F o r t u n a t e l y , much r e s e a r c h on speech perception does not r e q u i r e

exact duplication of g i v e n u t t e r a n c e s , b u t i n s t e a d uses simpler

sets of parameter curves. An example of such a program, which

synthesizes the v o w e l /it with constant p i t c h a n d i n t e n s i t y can

be made by substituting t h e following code for t h e "GPARfl Section, lines 151 t o 411 of' t h e sample program.

IF

(ISW1) 150,100,150

I S W l = l

P=lOO. 0

AVDB-15 0 . 0

A V = = ~ O . O ~ * ( A V D B / ~ O . O )

DO

110 I=1,3