CS578- Speech Signal Processing
Lecture on Voice Function AssessmentYannis Stylianou
University of Crete, Computer Science Dept., Multimedia Informatics Lab [email protected]
Outline
1 Introduction
Purpose
2 Jitter estimation
Spectral Jitter Estimator Short time SJE
3 Using sinusoidal modeling
Voice Function Assessment
Use of:
Voice Production Models
Algorithms of Signal/Speech Processing
High-Speed (invasive)
L R
Patient: Sample Wolf
Recording 'Völli_A014' from 2005-Aug-30 12:09
Page: 1 23-Mar-2009 Sequence position in ms 0 20 40 60 80 100 120 140 160 180 200 218 11 10 8 6 4 2 0 -2 -4 -6 -8 -10
Patient: Sample Wolf
Recording 'Völli_A014' from 2005-Aug-30 12:09
High-Speed (invasive)
I Glottal surface Sequence position in ms 0 20 40 60 80 100 120 140 160 180 200 218 1221 1100 1000 900 800 700 600 500 400 300 200 100 51Patient: Sample Wolf
Recording 'Völli_A014' from 2005-Aug-30 12:09
Phonovibrography - PVG[1]
! ! !hardly change. Thus, for
! ! !can
be approximated by linear interpolation. The connection line
! ! !defines the glottal main axis. In a
" !"
!" !are linearly connected to the computed
! ! !which merge the formerly separated
The accuracy of the segmentation procedure was evaluated within an extensive study [29]. The procedure was applied to 372 different high-speed movies each comprising 500 frames. The procedure showed a mean information flow-rate of almost 100 images per second and can thus be applied directly after the recording of a high-speed movie. For quantifying the segmentation accuracy the computed contours
" !"
!" ! were compared to manual segmentation results
which were performed by ten clinical experts with years of experience. Totally 25,200 manual markings were set onto the vocal fold edges in 630 raw images. It could be shown, that at medial position the mean accuracy of the automatic
"##pixels which is in the range
$#%$$. Further details about the segmentation procedure
Videokymography[2]
Non-invasive
EGG [3] I Principle
Contact microphone signals
Outline
1 Introduction
Purpose
2 Jitter estimation
Spectral Jitter Estimator Short time SJE
3 Using sinusoidal modeling
Measuring Jitter (1/2)
Local jitter is the period-to-period variability of pitch (%) 1 N − 1 N−1 X n=1 |u(n + 1) − u(n)| 1 N N X n=1 u(n)
Measuring Jitter (2/2)
Relative Average Perturbation (RAP): 3 periods (%) 1
N − 2
N−2
X
n=1
|2u(n + 1) − u(n) − u(n + 2)| 3 1 N N X n=1 u(n)
Pitch Period Perturbation Quotient (PPQ): 5 periods (%) 1
N − 4
N−4
X
n=1
Spectral Jitter Model
Jittered impulse train:
g [n] = +∞ X k=−∞ δ[n − (2k)P]+ +∞ X k=−∞ δ[n + − (2k + 1)P]
and the corresponding magnitude spectrum:
Spectral Jitter Model
Jittered impulse train:
g [n] = +∞ X k=−∞ δ[n − (2k)P]+ +∞ X k=−∞ δ[n + − (2k + 1)P]
and the corresponding magnitude spectrum:
Beat Spectrum
This representation leads to a beat spectrum: 1 + cos h (P − ) kω0 2 i = 1 + cos (kπ) cos (k Pπ) with intersections at:
Harmonic and Subharmonic spectra
The magnitude spectrum can be split into a Harmonic spectrum: H(, l ω0) = 10 log10
ω2 0
2 (1 + cos [(P − ) l ω0])
Voice Pathology Detection
AUC (standard error) %
MEEI PdA
Running Speech Database
MEEI - Rainbow passage: MEEIRainbow
Recordings are limited to 12 seconds (2 first sentences) 53 signals from healthy voices and 660 signals from pathological voices
Sampling frequency: 25 kHz (683 signals), 10 kHz (30 signals), 16 bits
Onset and offset effects in the voiced areas (2 frames) Pitch period measured at a rate of 10ms
Features
0 2 4 6 8 10 12 0 100 200 300 400 500normal reading text signal (Over = 13.17%)
time (s) (a) absolute jitter ( µ s) 0 2 4 6 8 10 12 0 200 400 600 800 1000 1200
pathological reading text signal (Over = 80.14%)
time (s) (b)
absolute jitter (
µ
Discrimination Power
AUC (standard error) % (MEEIRainbow,ThrSJE)
Over Max Over Max Under
Running and local: Over
Local:
Feature is computed using a sliding analysis window of fixed size
Running:
Example from MEEI: Running Over
2 4 6 8 10 12 0 10 20 30 40 50 60 70running Over of a normal signal
time (s) (a) Over (%) 2 4 6 8 10 12 0 10 20 30 40 50 60 70
running Over of a pathological signal
Example from MEEI: Local Over
2 4 6 8 10 12 0 20 40 60 80 100local Over of a normal signal
time (s) (a) Over (%) 2 4 6 8 10 12 0 20 40 60 80 100
local Over of a pathological signal
References for the work on SJE
1 Miltos Vassilakis and Yannis Stylianou:
Spectral jitter modeling and estimation.
Elsevier Biomedical Signal Processing and Control. Special Issue: M&A of Vocal Emissions, 2009, DOI:
10.1016/j.bspc.2009.02.001
2 Miltos Vassilakis and Yannis Stylianou:
Voice Pathology Detection based on Short-Time Jitter Estimations in Running Speech.
Outline
1 Introduction
Purpose
2 Jitter estimation
Spectral Jitter Estimator Short time SJE
3 Using sinusoidal modeling
Jitter and Shimmer
Jitter: f0(t) = f0− δ sin(πf0t + ψk) Shimmer: µk(t) = µk[1 + γkcos(πf0t + χk)] so then: s(t) = L X k=−L Ak[1 + γkcos(πf0t + χk)] ej [2πkf0t+δkcos(πf0t+ψk)+θk]w (t) s(t) ≈ L X k=−L Akej θkej 2πkf0tw (t)[1 + (γkcos(χk) cos(πf0t) − γksin(χk) sin(πf0t))
Jitter and Shimmer
Jitter: f0(t) = f0− δ sin(πf0t + ψk) Shimmer: µk(t) = µk[1 + γkcos(πf0t + χk)] so then: s(t) = L X k=−L Ak[1 + γkcos(πf0t + χk)] ej [2πkf0t+δkcos(πf0t+ψk)+θk]w (t) s(t) ≈ L X k=−L Akej θkej 2πkf0tw (t)[1 + (γkcos(χk) cos(πf0t) − γksin(χk) sin(πf0t))
Jitter and Shimmer
Jitter: f0(t) = f0− δ sin(πf0t + ψk) Shimmer: µk(t) = µk[1 + γkcos(πf0t + χk)] so then: s(t) = L X k=−L Ak[1 + γkcos(πf0t + χk)] ej [2πkf0t+δkcos(πf0t+ψk)+θk]w (t) s(t) ≈ L X k=−L Akej θkej 2πkf0tw (t)[1 + (γkcos(χk) cos(πf0t) − γksin(χk) sin(πf0t))
Models of shimmer and jitter
Time (samples) Amplitude Time (samples) Frequency (Hz) 0 μk(1+γk) μk μk(1−γk) 2P 3P 4P 5P 0 P 2P 3P 4P 5P f0 0 P f0+δ f 0−δParameters estimation
Suggesting: x (t) = L X k=−L[ak + bksin(πf0t) + ckcos(πf0t)]ej 2πkf0tw (t)
Parameters estimation
Suggesting: x (t) = L X k=−L[ak + bksin(πf0t) + ckcos(πf0t)]ej 2πkf0tw (t)
Comparing models
What we want, was:
s(t) ≈
L
X
k=−L
Akej θkej 2πkf0tw (t)
[1 + (γkcos(χk) cos(πf0t) − γksin(χk) sin(πf0t))
+ j (δkcos(ψk) cos(πf0t) − δksin(ψk) sin(πf0t))]
and what we suggest:
Comparing models
What we want, was:
s(t) ≈
L
X
k=−L
Akej θkej 2πkf0tw (t)
[1 + (γkcos(χk) cos(πf0t) − γksin(χk) sin(πf0t))
+ j (δkcos(ψk) cos(πf0t) − δksin(ψk) sin(πf0t))]
and what we suggest:
Final ...
π1,k = ˆγkcos(χk)
ρ1,k = −ˆγksin(χk)
π2,k = ˆδkcos(ψk)
ρ2,k = −ˆδksin(ψk)
Validation
Glottal airflow rate r [n] =A1 +∞ X k=−∞ δ[n − (2k)P]+ A2 +∞ X k=−∞ δ[n + − (2k + 1)P]
where A1 = A0+ A and A2= A0− A, with A to control the
shimmer value while controls the jitter.
Glottal airflow frequency response (maximum phase)
G (z) = 1
(1 − bz)2
Validation: Modeling Shimmer
0 5 10 15 20 25 30 35 40 45 50 −0.5 0 0.5 1 Time (ms) (a) 0 5 10 15 20 25 30 35 40 45 50 −0.5 0 0.5 1 Time (ms) (b) 0 5 10 15 20 25 30 35 40 45 50 −0.5 0 0.5 Time (ms) (c)Signal with shimmer Standard Sinusoidal Model
Signal with shimmer Suggested Sinusoidal Model
Validation: Modeling Jitter
0 5 10 15 20 25 30 35 40 45 50 −0.5 0 0.5 1 Time (ms) (a)Signal with jitter Standard Sinusoidal model
0 5 10 15 20 25 30 35 40 45 50 −0.5 0 0.5 1 Time (ms) (b)
Signal with jitter Suggested sinusoidal Model
Evaluation: Voice Pathology Detection
(MEEI)
AUC: 92.06% (L = 20, FLD, 5000 Monte Carlo repetitions)
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
False Positive Rate
True Positive Rate
Receiver Operating Characteristic (ROC) curve on MEEI
Iterative QHM
Iteratively improve estimation of QHM parameters:
Let’s assume that we know at n = 0, fk(0):
1 Compute the ak,n and bk,n through Least Squares using
fk(n − 1).
2 Compute instantaneous components, for k = 1 . . . K :
Ak(n) = mk(0) Φk(n) = φk(0) Fk(n) = fk(0)
Iterative QHM
Iteratively improve estimation of QHM parameters: Let’s assume that we know at n = 0, fk(0):
1 Compute the ak,n and bk,n through Least Squares using
fk(n − 1).
2 Compute instantaneous components, for k = 1 . . . K :
Ak(n) = mk(0) Φk(n) = φk(0) Fk(n) = fk(0)
Iterative QHM
Iteratively improve estimation of QHM parameters: Let’s assume that we know at n = 0, fk(0):
1 Compute the ak,n and bk,n through Least Squares using
fk(n − 1).
2 Compute instantaneous components, for k = 1 . . . K :
Ak(n) = mk(0) Φk(n) = φk(0) Fk(n) = fk(0)
Iterative QHM
Iteratively improve estimation of QHM parameters: Let’s assume that we know at n = 0, fk(0):
1 Compute the ak,n and bk,n through Least Squares using
fk(n − 1).
2 Compute instantaneous components, for k = 1 . . . K :
Ak(n) = mk(0) Φk(n) = φk(0) Fk(n) = fk(0)
Harmonics and Sub-harmonics: Frequency
Left: Normophonic speaker, Right: Dysphonic speaker
Harmonics and Sub-harmonics: Time
Left: Normophonic speaker, Right: Dysphonic speaker
References for the work on Sinusoidal Model
1 Yannis Pantazis, Olivier Rosec, and Yannis Stylianou:
Adaptive AM-FM Signal Decomposition with Application to Speech Analysis
IEEE Trans. on Audio, Speech and Language Processing, Vol.19, No.2, Feb 2011, pp 290-300.
2 Yannis Pantazis, Miltiadis Vasilakis, and Yannis Stylianou:
Jitter and Shimmer modeling based on a Time-Varying Sinusoidal Model of Speech
Outline
1 Introduction
Purpose
2 Jitter estimation
Spectral Jitter Estimator Short time SJE
3 Using sinusoidal modeling
Define Vocal Tremor
Vocal Tremor: Involuntary modulations of frequency and/or amplitude in sustained phonation.
Pathological & Physiological Vocal Tremor.
Pathological Tremor: From diseases like Parkinson, essential tremor, etc. ⇒ Strong motor synchronization.
Physiological Tremor: Natural stochastic modulations in the interval [2, 15]Hz with low amplitude.
Acoustic Vocal Tremor Attributes:
Define Vocal Tremor
Vocal Tremor: Involuntary modulations of frequency and/or amplitude in sustained phonation.
Pathological & Physiological Vocal Tremor.
Pathological Tremor: From diseases like Parkinson, essential tremor, etc. ⇒ Strong motor synchronization.
Physiological Tremor: Natural stochastic modulations in the interval [2, 15]Hz with low amplitude.
Acoustic Vocal Tremor Attributes:
Define Vocal Tremor
Vocal Tremor: Involuntary modulations of frequency and/or amplitude in sustained phonation.
Pathological & Physiological Vocal Tremor.
Pathological Tremor: From diseases like Parkinson, essential tremor, etc. ⇒ Strong motor synchronization.
Physiological Tremor: Natural stochastic modulations in the interval [2, 15]Hz with low amplitude.
Acoustic Vocal Tremor Attributes:
Define Vocal Tremor
Vocal Tremor: Involuntary modulations of frequency and/or amplitude in sustained phonation.
Pathological & Physiological Vocal Tremor.
Pathological Tremor: From diseases like Parkinson, essential tremor, etc. ⇒ Strong motor synchronization.
Physiological Tremor: Natural stochastic modulations in the interval [2, 15]Hz with low amplitude.
Acoustic Vocal Tremor Attributes:
Define Vocal Tremor
Vocal Tremor: Involuntary modulations of frequency and/or amplitude in sustained phonation.
Pathological & Physiological Vocal Tremor.
Pathological Tremor: From diseases like Parkinson, essential tremor, etc. ⇒ Strong motor synchronization.
Physiological Tremor: Natural stochastic modulations in the interval [2, 15]Hz with low amplitude.
Acoustic Vocal Tremor Attributes:
Define Vocal Tremor
Vocal Tremor: Involuntary modulations of frequency and/or amplitude in sustained phonation.
Pathological & Physiological Vocal Tremor.
Pathological Tremor: From diseases like Parkinson, essential tremor, etc. ⇒ Strong motor synchronization.
Physiological Tremor: Natural stochastic modulations in the interval [2, 15]Hz with low amplitude.
Acoustic Vocal Tremor Attributes:
Modulation Frequency: How fast are the modulations.
Define Vocal Tremor
Vocal Tremor: Involuntary modulations of frequency and/or amplitude in sustained phonation.
Pathological & Physiological Vocal Tremor.
Pathological Tremor: From diseases like Parkinson, essential tremor, etc. ⇒ Strong motor synchronization.
Physiological Tremor: Natural stochastic modulations in the interval [2, 15]Hz with low amplitude.
Acoustic Vocal Tremor Attributes:
Vocal Tremor Estimation
Use of an AM-FM decomposition algorithm based on the adaptive time-varying quasi-harmonic model for speech.
High resolution in Time-Frequency plane.
Estimation of Vocal Tremor for any sinusoidal component of speech.
AM-FM Decomposition using aQHM
Speech is modeled as a sum of AM-FM sinusoids:
s(t) =
K
X
k=1
ak(t)cos(φk(t))
K is the number of components,
ak(t) is the instantaneous amplitude of the kth sinusoid,
φk(t) is the instantaneous phase of the kth sinusoid, and
fk(t) =2π1 d φk(t)
dt is the instantaneous frequency of the k
th
sinusoid.
Preprocessing of Inst. Component
Downsample inst. component to fs = 1000Hz
Remove the very slow (< 2Hz) modulations of the instantaneous component.
This is performed by Savinzky-Golay smoothing filter.
Compute Modulation Frequency & Level
Assuming that the processed inst. component has a single but time-varying modulation frequency and modulation level.
x (t) = m(t)cos(ψ(t))
Apply for second time the AM-FM dec. alg. to the processed inst. component.
Thus,
Modulation frequency, 2π1 d ψ(t)dt , is estimated from the FM
component of AM-FM dec. alg.
References for the work on Tremor
1 Yannis Pantazis, Maria Koutsoyannaki and Yannis Stylianou:
A novel method for the extraction of vocal tremor, MAVEBA-2009, Florence, Italy, 14-16 Dec, 2009.
2 Maria Koutsoyannaki, Yannis Pantazis, Yannis Stylianou, and
Philippe Dejonckere:
Outline
1 Introduction
Purpose
2 Jitter estimation
Spectral Jitter Estimator Short time SJE
3 Using sinusoidal modeling
Our work on Modulation Spectra
1 Maria Markaki and Yannis Stylianou:
Voice Pathology Detection and Discrimination based on Modulation Spectral Features.
IEEE Trans. on Audio, Speech and Language Processing. TASL.2010.2104141, Jan 2011
2 J.D. Arias-Londono, J.I. Godino-Llorente, M. Markaki, and Y.
Stylianou:
On combining information from Modulation Spectra and Mel-Frequency Cepstral coefficients for Automatic Detection of Pathological Voices.
Logopedics, Phoniatrics, Vocology (LPV), Nov 2010.
3 Maria Markaki and Yannis Stylianou:
Discrimination of Speech from Nonspeeech in Broadcast News Based on Modulation Frequency Features
Outline
1 Introduction
Purpose
2 Jitter estimation
Spectral Jitter Estimator Short time SJE
3 Using sinusoidal modeling
Acknowledgments
My ex-students: Miltos Vasilakis, Yannis Pantazis. My colleague: Olivier Rosec, France Telecom
Juan Ignacio Godino-Llorente, Department of Circuits & Systems Engineering, Universidad Polit´ecnica de Madrid, for the availability of the PdA database.
Outline
1 Introduction
Purpose
2 Jitter estimation
Spectral Jitter Estimator Short time SJE
3 Using sinusoidal modeling
J. Lohscheller, U. Eysholdt, H. Toy, and M. Doellinger, “Phonovibrography: Mapping High-Speed Movies of Vocal Fold Vibrations into 2-D Diagrams for Visualizing and Analyzing the Underlying Laryngeal Dynamics,” IEEE Trans. Medical Imaging, vol. 27, no. 3, pp. 300–309, 2008.
J. Svec and H. Schutte, “Videokymography: high-speed line scanning of vocal fold vibration,” Journal of Voice, vol. 10, no. 2, pp. 201–205, 1996.