CS578- Speech Signal Processing

(1)

CS578- Speech Signal Processing

Lecture on Voice Function Assessment

Yannis Stylianou

University of Crete, Computer Science Dept., Multimedia Informatics Lab [email protected]

(2)

Outline

1 _Introduction

Purpose

2 _{Jitter estimation}

Spectral Jitter Estimator Short time SJE

3 _{Using sinusoidal modeling}

(3)

Voice Function Assessment

Use of:

Voice Production Models

Algorithms of Signal/Speech Processing

(4)

(5)

(6)

High-Speed (invasive)

L R

Patient: Sample Wolf

Recording 'Völli_A014' from 2005-Aug-30 12:09

Page: 1 23-Mar-2009 Sequence position in ms 0 20 40 60 80 100 120 140 160 180 200 218 11 10 8 6 4 2 0 -2 -4 -6 -8 -10

(7)

High-Speed (invasive)

I Glottal surface Sequence position in ms 0 20 40 60 80 100 120 140 160 180 200 218 1221 1100 1000 900 800 700 600 500 400 300 200 100 51

(8)

Phonovibrography - PVG[1]

! ! !hardly change. Thus, for

! ! !can

be approximated by linear interpolation. The connection line

! ! !defines the glottal main axis. In a

" !"

!" !are linearly connected to the computed

! ! !which merge the formerly separated

The accuracy of the segmentation procedure was evaluated within an extensive study [29]. The procedure was applied to 372 different high-speed movies each comprising 500 frames. The procedure showed a mean information flow-rate of almost 100 images per second and can thus be applied directly after the recording of a high-speed movie. For quantifying the segmentation accuracy the computed contours

" !"

!" ! were compared to manual segmentation results

which were performed by ten clinical experts with years of experience. Totally 25,200 manual markings were set onto the vocal fold edges in 630 raw images. It could be shown, that at medial position the mean accuracy of the automatic

"##pixels which is in the range

$#%$$. Further details about the segmentation procedure

(9)

Videokymography[2]

(10)

Non-invasive

EGG [3] I Principle

Contact microphone signals

(11)

Outline

1 _Introduction

Purpose

(12)

Measuring Jitter (1/2)

Local jitter is the period-to-period variability of pitch (%) 1 N − 1 N−1 X n=1 |u(n + 1) − u(n)| 1 N N X n=1 u(n)

(13)

Measuring Jitter (2/2)

Relative Average Perturbation (RAP): 3 periods (%) 1

N − 2

N−2

X

n=1

|2u(n + 1) − u(n) − u(n + 2)| 3 1 N N X n=1 u(n)

Pitch Period Perturbation Quotient (PPQ): 5 periods (%) 1

N − 4

N−4

X

n=1

(14)

Spectral Jitter Model

Jittered impulse train:

g [n] = +∞ X k=−∞ δ[n − (2k)P]+ +∞ X k=−∞ δ[n + − (2k + 1)P]

and the corresponding magnitude spectrum:

(15)

Spectral Jitter Model

Jittered impulse train:

g [n] = +∞ X k=−∞ δ[n − (2k)P]+ +∞ X k=−∞ δ[n + − (2k + 1)P]

and the corresponding magnitude spectrum:

(16)

Beat Spectrum

This representation leads to a beat spectrum: 1 + cos h (P − ) kω0 2 i = 1 + cos (kπ) cos (k Pπ) with intersections at:

(17)

(18)

Harmonic and Subharmonic spectra

The magnitude spectrum can be split into a Harmonic spectrum: H(, l ω0) = 10 log10

ω2 0

2 (1 + cos [(P − ) l ω0])

(19)

(20)

Voice Pathology Detection

AUC (standard error) %

MEEI PdA

(21)

Running Speech Database

MEEI - Rainbow passage: MEEIRainbow

Recordings are limited to 12 seconds (2 first sentences) 53 signals from healthy voices and 660 signals from pathological voices

Sampling frequency: 25 kHz (683 signals), 10 kHz (30 signals), 16 bits

Onset and offset effects in the voiced areas (2 frames) Pitch period measured at a rate of 10ms

(22)

Features

0 2 4 6 8 10 12 0 100 200 300 400 500

normal reading text signal (Over = 13.17%)

time (s) (a) absolute jitter ( µ s) 0 2 4 6 8 10 12 0 200 400 600 800 1000 1200

pathological reading text signal (Over = 80.14%)

time (s) (b)

absolute jitter (

µ

(23)

Discrimination Power

AUC (standard error) % (MEEIRainbow,ThrSJE)

Over Max Over Max Under

(24)

Running and local: Over

Local:

Feature is computed using a sliding analysis window of fixed size

Running:

(25)

Example from MEEI: Running Over

2 4 6 8 10 12 0 10 20 30 40 50 60 70

running Over of a normal signal

time (s) (a) Over (%) 2 4 6 8 10 12 0 10 20 30 40 50 60 70

running Over of a pathological signal

(26)

Example from MEEI: Local Over

2 4 6 8 10 12 0 20 40 60 80 100

local Over of a normal signal

time (s) (a) Over (%) 2 4 6 8 10 12 0 20 40 60 80 100

local Over of a pathological signal

(27)

(28)

(29)

(30)

(31)

References for the work on SJE

1 _{Miltos Vassilakis and Yannis Stylianou:}

Spectral jitter modeling and estimation.

Elsevier Biomedical Signal Processing and Control. Special Issue: M&A of Vocal Emissions, 2009, DOI:

10.1016/j.bspc.2009.02.001

2 _{Miltos Vassilakis and Yannis Stylianou:}

Voice Pathology Detection based on Short-Time Jitter Estimations in Running Speech.

(32)

Outline

1 _Introduction

Purpose

(33)

(34)

Jitter and Shimmer

Jitter: f0(t) = f0− δ sin(πf0t + ψk) Shimmer: µk(t) = µk[1 + γkcos(πf0t + χk)] so then: s(t) = L X k=−L Ak[1 + γkcos(πf0t + χk)] ej [2πkf0t+δkcos(πf0t+ψk)+θk]_{w (t)} s(t) ≈ L X k=−L Akej θkej 2πkf0tw (t)

[1 + (γkcos(χk) cos(πf0t) − γksin(χk) sin(πf0t))

(35)

Jitter and Shimmer

(36)

Jitter and Shimmer

(37)

Models of shimmer and jitter

Time (samples) Amplitude Time (samples) Frequency (Hz) 0 μk(1+γk) μ_k μ_k(1−γ_k) 2P 3P 4P 5P 0 P 2P 3P 4P 5P f₀ 0 P f₀+δ f 0−δ

(38)

Parameters estimation

Suggesting: x (t) = L X k=−L

[ak + bksin(πf0t) + ckcos(πf0t)]ej 2πkf0tw (t)

(39)

Parameters estimation

Suggesting: x (t) = L X k=−L

[ak + bksin(πf0t) + ckcos(πf0t)]ej 2πkf0tw (t)

(40)

Comparing models

What we want, was:

s(t) ≈

L

X

k=−L

Akej θkej 2πkf0tw (t)

+ j (δkcos(ψk) cos(πf0t) − δksin(ψk) sin(πf0t))]

and what we suggest:

(41)

Comparing models

What we want, was:

s(t) ≈

L

X

k=−L

Akej θkej 2πkf0tw (t)

+ j (δkcos(ψk) cos(πf0t) − δksin(ψk) sin(πf0t))]

and what we suggest:

(42)

Final ...

π1,k = ˆγkcos(χk)

ρ1,k = −ˆγksin(χk)

π2,k = ˆδkcos(ψk)

ρ2,k = −ˆδksin(ψk)

(43)

Validation

Glottal airflow rate r [n] =A1 +∞ X k=−∞ δ[n − (2k)P]+ A2 +∞ X k=−∞ δ[n + − (2k + 1)P]

where A1 = A0+ A and A2= A0− A, with A to control the

shimmer value while controls the jitter.

Glottal airflow frequency response (maximum phase)

G (z) = 1

(1 − bz)2

(44)

Validation: Modeling Shimmer

0 5 10 15 20 25 30 35 40 45 50 −0.5 0 0.5 1 Time (ms) (a) 0 5 10 15 20 25 30 35 40 45 50 −0.5 0 0.5 1 Time (ms) (b) 0 5 10 15 20 25 30 35 40 45 50 −0.5 0 0.5 Time (ms) (c)

Signal with shimmer Standard Sinusoidal Model

Signal with shimmer Suggested Sinusoidal Model

(45)

Validation: Modeling Jitter

0 5 10 15 20 25 30 35 40 45 50 −0.5 0 0.5 1 Time (ms) (a)

Signal with jitter Standard Sinusoidal model

0 5 10 15 20 25 30 35 40 45 50 −0.5 0 0.5 1 Time (ms) (b)

Signal with jitter Suggested sinusoidal Model

(46)

Evaluation: Voice Pathology Detection

(MEEI)

AUC: 92.06% (L = 20, FLD, 5000 Monte Carlo repetitions)

0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1

False Positive Rate

True Positive Rate

Receiver Operating Characteristic (ROC) curve on MEEI

(47)

(48)

(49)

Iterative QHM

Iteratively improve estimation of QHM parameters:

Let’s assume that we know at n = 0, fk(0):

1 Compute the a_k,n and b_k,n through Least Squares using

fk(n − 1).

2 Compute instantaneous components, for k = 1 . . . K :

   Ak(n) = mk(0) Φk(n) = φk(0) Fk(n) = fk(0)

(50)

Iterative QHM

Iteratively improve estimation of QHM parameters: Let’s assume that we know at n = 0, fk(0):

fk(n − 1).

(51)

Iterative QHM

fk(n − 1).

(52)

Iterative QHM

fk(n − 1).

(53)

(54)

(55)

(56)

Harmonics and Sub-harmonics: Frequency

Left: Normophonic speaker, Right: Dysphonic speaker

(57)

Harmonics and Sub-harmonics: Time

Left: Normophonic speaker, Right: Dysphonic speaker

(58)

References for the work on Sinusoidal Model

1 Yannis Pantazis, Olivier Rosec, and Yannis Stylianou:

Adaptive AM-FM Signal Decomposition with Application to Speech Analysis

IEEE Trans. on Audio, Speech and Language Processing, Vol.19, No.2, Feb 2011, pp 290-300.

2 Yannis Pantazis, Miltiadis Vasilakis, and Yannis Stylianou:

Jitter and Shimmer modeling based on a Time-Varying Sinusoidal Model of Speech

(59)

Outline

1 _Introduction

Purpose

(60)

Define Vocal Tremor

Vocal Tremor: Involuntary modulations of frequency and/or amplitude in sustained phonation.

Pathological & Physiological Vocal Tremor.

Pathological Tremor: From diseases like Parkinson, essential tremor, etc. ⇒ Strong motor synchronization.

Physiological Tremor: Natural stochastic modulations in the interval [2, 15]Hz with low amplitude.

Acoustic Vocal Tremor Attributes:

(61)

Define Vocal Tremor

(62)

Define Vocal Tremor

(63)

Define Vocal Tremor

(64)

Define Vocal Tremor

(65)

Define Vocal Tremor

Modulation Frequency: How fast are the modulations.

(66)

Define Vocal Tremor

(67)

Vocal Tremor Estimation

Use of an AM-FM decomposition algorithm based on the adaptive time-varying quasi-harmonic model for speech.

High resolution in Time-Frequency plane.

Estimation of Vocal Tremor for any sinusoidal component of speech.

(68)

AM-FM Decomposition using aQHM

Speech is modeled as a sum of AM-FM sinusoids:

s(t) =

K

X

k=1

ak(t)cos(φk(t))

K is the number of components,

ak(t) is the instantaneous amplitude of the kth sinusoid,

φk(t) is the instantaneous phase of the kth sinusoid, and

fk(t) =_2π1 d φk(t)

dt is the instantaneous frequency of the k

th

sinusoid.

(69)

(70)

Preprocessing of Inst. Component

Downsample inst. component to fs = 1000Hz

Remove the very slow (< 2Hz) modulations of the instantaneous component.

This is performed by Savinzky-Golay smoothing filter.

(71)

(72)

Compute Modulation Frequency & Level

Assuming that the processed inst. component has a single but time-varying modulation frequency and modulation level.

x (t) = m(t)cos(ψ(t))

Apply for second time the AM-FM dec. alg. to the processed inst. component.

Thus,

Modulation frequency, _2π1 d ψ(t)_dt , is estimated from the FM

component of AM-FM dec. alg.

(73)

(74)

(75)

References for the work on Tremor

1 Yannis Pantazis, Maria Koutsoyannaki and Yannis Stylianou:

A novel method for the extraction of vocal tremor, MAVEBA-2009, Florence, Italy, 14-16 Dec, 2009.

2 _{Maria Koutsoyannaki, Yannis Pantazis, Yannis Stylianou, and}

Philippe Dejonckere:

(76)

Outline

1 _Introduction

Purpose

(77)

(78)

Our work on Modulation Spectra

1 Maria Markaki and Yannis Stylianou:

Voice Pathology Detection and Discrimination based on Modulation Spectral Features.

IEEE Trans. on Audio, Speech and Language Processing. TASL.2010.2104141, Jan 2011

2 _{J.D. Arias-Londono, J.I. Godino-Llorente, M. Markaki, and Y.}

Stylianou:

On combining information from Modulation Spectra and Mel-Frequency Cepstral coefficients for Automatic Detection of Pathological Voices.

Logopedics, Phoniatrics, Vocology (LPV), Nov 2010.

3 _{Maria Markaki and Yannis Stylianou:}

Discrimination of Speech from Nonspeeech in Broadcast News Based on Modulation Frequency Features

(79)

Outline

1 _Introduction

Purpose

(80)

Acknowledgments

My ex-students: Miltos Vasilakis, Yannis Pantazis. My colleague: Olivier Rosec, France Telecom

Juan Ignacio Godino-Llorente, Department of Circuits & Systems Engineering, Universidad Polit´ecnica de Madrid, for the availability of the PdA database.

(81)

(82)

Outline

1 _Introduction

Purpose

(83)

J. Lohscheller, U. Eysholdt, H. Toy, and M. Doellinger, “Phonovibrography: Mapping High-Speed Movies of Vocal Fold Vibrations into 2-D Diagrams for Visualizing and Analyzing the Underlying Laryngeal Dynamics,” IEEE Trans. Medical Imaging, vol. 27, no. 3, pp. 300–309, 2008.

J. Svec and H. Schutte, “Videokymography: high-speed line scanning of vocal fold vibration,” Journal of Voice, vol. 10, no. 2, pp. 201–205, 1996.

(84)