Disordered Speech Assessment Using Automatic
Methods Based on Quantitative Measures
Lingyun Gu
Computational NeuroEngineering Laboratory, Department of Electrical&Computer Engineering, University of Florida, Gainesville, FL 32611-6200, USA
Email:[email protected]
John G. Harris
Computational NeuroEngineering Laboratory, Department of Electrical&Computer Engineering, University of Florida, Gainesville, FL 32611-6200, USA
Email:[email protected]
Rahul Shrivastav
Department of Communication Sciences&Disorders, University of Florida, Gainesville, FL 32611, USA Email:[email protected]
Christine Sapienza
Department of Communication Sciences&Disorders, University of Florida, Gainesville, FL 32611, USA Email:[email protected]
Received 2 November 2003; Revised 6 August 2004
Speech quality assessment methods are necessary for evaluating and documenting treatment outcomes of patients suffering from degraded speech due to Parkinson’s disease, stroke, or other disease processes. Subjective methods of speech quality assessment are more accurate and more robust than objective methods but are time-consuming and costly. We propose a novel objective measure of speech quality assessment that builds on traditional speech processing techniques such as dynamic time warping (DTW) and the Itakura-Saito (IS) distortion measure. Initial results show that our objective measure correlates well with the more expensive subjective methods.
Keywords and phrases:objective speech quality measures, subjective speech quality measures, pathology, anthropomorphic.
1. INTRODUCTION
The accurate assessment of speech quality is a major research problem that has attracted attention in the field of speech communications for many years. The two major classes of methods employed in the assessment of speech quality are subjective and objective speech quality measures. Subjective quality measures are more accurate and robust since they are given by professional personnel who have received spe-cial assessment training, but they are necessarily time con-suming and costly. On the contrary, objective quality mea-sures, inspired by speech signal processing techniques, pro-vide an efficient, economical alternative to subjective mea-sures. Although it is not suggested to use objective quality measures to completely replace subjective measures, objec-tive quality measures do show the strong ability to predict subjective quality measures and the results do correlate very
well with those produced by subjective quality measures [1]. Traditionally, objective measures have been used to evaluate speech after decoding and in the presence of noise. Currently, some pioneers have already developed some system protocols or algorithms to apply objective speech quality assessment into disordered speech analysis.
(3) they must be reliable and robust in their test method. The key aspect of utilitarian approaches is that the results are summarized by a single number. On the other hand, analytic methods try to identify the underlying psychological compo-nents that determine perceived quality, and to discover the acoustic correlates of these components. Therefore the re-sults from analytic methods are summarized on a multidi-mensional scale [1].
The modified rhyme test (MRT) by House and the di-agnostic rhyme test (DRT) by Voiers are both intelligibil-ity measures. The mean opinion score (MOS) test and the diagnostic acceptability measure (DAM) are overall quality measures, even though MOS is also commonly categorized as utilitarian and DAM is classified as analytic. It is under-standable that subjective quality measures are the preferable means of quality assessment but subjective measures do have several major drawbacks: (1) subjective measures require sig-nificant time and personnel resources, making it difficult to evaluate the range of potential speech/voice distortion; (2) subjective measures do not work very well when the tested speech database is large [2]; (3) some rating score protocols are not suitable for measurement of speech/voice [3]; (4) some literature suggests that listeners cannot agree on spe-cific speech/voice ratings [4].
Compared with the subjective measures mentioned above, objective measures have several outstanding advan-tages: (1) they are less expensive to administer, saving money, time, and human resources; (2) they produce more consis-tent results and are not affected by human error; (3) most importantly, the form of the objective measure itself can give valuable insight into the nature of the human speech per-ception process, helping researchers understand the speech production mechanism more deeply [1]. Generally speaking, objective speech quality measures are usually evaluated in the time, spectral, or cepstral domains.
This paper is organized as follows. In Section 2, dis-ordered speech background will be introduced. Then, in Section 3, the DTW method is discussed. Specific speech fea-tures for disordered speech will be proposed in Section 4. Section 5deals with one subjective measure. All experimen-tal results are discussed inSection 6. Finally, conclusions are drawn inSection 7.
2. DISORDERED SPEECH BACKGROUND
Usually, patients with Parkinson’s disease or people who have suffered a stroke have difficulty producing clear speech, re-sulting in a loss of intelligibility. Hence, it is important to develop a means to help them produce more clear speech or develop algorithms to automatically clarify their unclear speech. These efforts require an efficient method to evaluate disordered speech as the first step.
Attempts to develop algorithms to evaluate disordered speech require us to understand how disordered speech is produced, the factors that affect disordered speech, and the explicit phenomena related to these factors. The term “dysarthria” is used to describe changes in speech production
Automatic assessment procedure
Patients’ speech
DTW alignment
Objective quality assessment
Healthy speech
Score scaling system
Speech quality score
Figure1: Objective patients’ speech quality assessment block dia-gram.
characterized by an impairment in one or more of the sys-tems involved in speech [5]. The three major syssys-tems in-volved in speech production are respiration, voice produc-tion, and articulation. Voice is produced by the larynx and the oral structures articulate to modify the sound source pro-duced by the larynx. The dysarthria associated with Parkin-son’s disease is referred to as a hypokinetic dysarthria [6,7]. Common symptoms of hypokinetic dysarthria include re-duced loudness of speech and/or monoloudness (lack of loudness variation) and reduced speaking rate with intermit-tent rapid bursts of speech. For instance, speakers may show a slow rate of speech, but particular words or phrases within that utterance may be produced with a rapid rate. The oral structures such as the tongue and lips are “rigid,” resulting in a reduced range of movement. This effectively dampens the speech signal and distorts the accuracy of the sound (con-sonant or vowel) production. There may be some instances of hypernasality as the condition worsens resulting from an inadequate velar closure. This may also result in the damp-ening of the sound produced. Voice quality in these patients is often described as hoarse or harsh.
In this paper, we test several well-known speech process-ing parameters that can quantify the severity of disordered speech. These are the Itakura-Saito (IS) measure, the log-likelihood ratio (LLR) measure, and the log-area-ratio (LAR) measure which evaluate the spectral envelope of the given disordered speech.Figure 1 shows the objective disordered speech quality assessment block diagram.
3. DYNAMIC TIME WARPING
by healthy people as the gold standard to compare with dis-ordered speech. In this case, aligning the two different speech segments to the same reasonable comparable length is cru-cial. Dynamic time warping (DTW) is the most straightfor-ward solution and is used to solve exactly this problem in speech recognition applications.
Given two speech patterns,XandY, these patterns can be represented by a sequence (x1,x2,. . .,xTx) and (y1,y2,. . ., yTy), wherexiandyiare the feature vectors. As we have noted, in general the sequence ofxi’s will not have the same length as the sequence ofyi’s. In order to determine the distance be-tweenXandY, given that some distance functiond(x,y) ex-ists, we need a meaningful way to determine how to properly align the vectors for the comparison. DTW is one way that such an alignment can be made [8]. We define two warping functions,φxandφy, which transform the indices of the vec-tor sequences to a normalized time axis,k. Thus we have
ix=φx(k), k=1, 2,. . .,T, iy=φy(k), k=1, 2,. . .,T.
(1)
This gives us a mapping from (x1,x2,. . .,xTx) to (x1,x2,. . ., xT) and from (y1,y2,. . .,yTy) to (y1,y2,. . .,yT). With such a mapping, we are able to computedφ(x,y) using these warp-ing functions, givwarp-ing us the total distance between two pat-terns as
dφ(x,y)= T
k=1
dφx(k),φy(k)
m(k) Mφ
, (2)
wherem(k) is a path weight andMφis a normalization factor. Thus, all that remains is the specification of the pathφ indi-cated in the above equation. The most common technique is to specify thatφis the minimum of all possible paths, subject to certain constraints by using the equation as follows:
d(X,Y)min
φ dφ(x,y). (3)
For time normalization, the optimal path based on DTW has fixed beginning and ending points. Some other constraints may also apply. For example, the path should be monotonic, which requires a positive slope. This constraint eliminates the possibility of reverse warping. Therefore, we choose to en-force the Type III local constraint [8]. In addition, our nu-merous experimental results show that the eight local con-straints will not significantly change the final results. Because of the local continuity constraints, certain portions are ex-cluded from the region the optimal warping path can tra-verse. By using the maximum and minimum possible path expansion, we can define global path constraints as follows:
1 +
φx(k)−1
Qmax ≤
1 +Qmax
φx(k)−1
,
Ty+Qmax
φx(k)−Tx
≤Ty+
φx(k)−Tx
Qmax .
(4)
In this aspect, slope weighting along the path adds yet an-other dimension of control in the search for the optimal warping path. There are four types of slope weighting. The type chosen in this paper is
m(k)=φx(k)−φx(k−1) +φy(k)−φy(k−1). (5)
If we take the notationd(ix,iy) as the distance betweenxix and yiy, which are the elements of (x1,x2,. . .,xTx) and (y1,y2,. . .,yTy), respectively, and D(ix,iy) as the accumula-tive optimal value, then we can apply the exact local con-straint as well as the slope weight to get
Dix,iy
=min
Dix−2,yx−1
+ 3dix,iy
Dix−1,yx−1
+ 2dix,iy
Dix−1,yx−2
+ 3dix,iy
. (6)
4. OBJECTIVE QUALITY MEASURES
From an anthropomorphic perspective, speech production is very complex but a simple view is that vowels are produced by the lungs, the larynx excitation, and the resonance of the vocal tract. The laryngeal configuration and the tongue’s po-sition dramatically change an individual speaker’s speech in-tonation, pitch, or quality. For example, due to differences in tongue positions during pronunciation, nonnative speakers of English may use tongue movements characteristic to their native language, thereby producing a noticeable accent. Sim-ilarly, the rigid tongue movement of the Parkinson’s patient causes their pronunciation to become distorted. We attempt to develop objective speech quality measures using knowl-edge of human speech production. However, we first need to define a few terms commonly used in speech processing. A formant is defined as a peak in the speech power spectrum. The pitch of speech is usually determined by the frequency of the excitation signal, which is produced by the vibration of the vocal folds. The vocal tract resonance is usually repre-sented by the spectral envelope.
percentage [10,11], and is often unsuitable for clinical pur-poses. Many of these measures can only be calculated from relatively steady portions of the speech signal. However, nu-merous studies have stressed that the unsteady parts of the signal, such as onset, could provide valuable information for objective evaluation of speech and allow finer discrimination of the severity of dysphonia. In addition, many of these mea-sures are calculated from a single vowel that patients are re-quired to produce for a relatively long period of time [12,13]. In reality, the natural continuous sentence may provide a more accurate picture of the patients speech disorder.
To overcome some of these shortcomings of the existing speech analysis techniques, we propose a new algorithm orig-inally inspired by the speech coding-decoding and speech telecommunications techniques. The first meaningful mea-sure which can be obtained to compare speech differences is to compute the differences of the logarithms of the power spectrum at each frequency range [4]. We use the following equation to represent the difference:
d(w)=lnX(w)2−lnY(w)2, (7)
whereX(w) andY(w) are the magnitudes in the frequency domain of two compared speech signals. It is also possible to formally express the most easy and straightforward method to stand for the spectral distortion as follows:
d(X,Y)= π −π
d(w)kdw 2π
1/k
, (8)
where, again,XandY here represent the two speech signals to be compared.
Although the above method is easy to implement, good results are not guaranteed. Many different types of modi-fied standard objective quality measures have been proposed. These include measures such as the Itakura-Saito (IS) dis-tortion measure, the log-likelihood ratio (LLR) measure, the log-area-ratio (LAR) measure, the segmental SNR measure, and the weighted spectral slope (WSS) measure. In this pa-per, we chose to investigate the first three measures: IS, LLR and LAR [14,15,16].
The IS distortion measure is calculated based on the fol-lowing equation:
dIS
ad,aφ
=
σφ2 σ2 d
adRφaTd
aφRφaTφ
+ log
σφ2 σ2 d
−1, (9)
where σ2
φ andσd2 represent the all-pole gains for the stan-dard healthy people’s speech and the test patients’ speech.aφ andadare the healthy-speech and patient-speech LPC coef-ficient vectors, respectively.Rφis the autocorrelation matrix forxφ(n), wherexφ(n) is the sampled speech of healthy peo-ple. The elements ofRφare defined as
rφ
|i−j|= N−|i−j|
n=1
rφ(n)rφ
n+|i−j|,
|i−j| =0, 1,. . .,p,
(10)
Table1: MOS subjective measure evaluation table.
Rating Speech quality Level of distortion 5 Excellent Imperceptible
4 Good Perceptible, but not annoying 3 Fair Perceptible, and slightly annoying 2 Poor Annoying, but not objectionable 1 Unsatisfied Very annoying and objectionable
whereNis the length of the speech frame andpis the order of LPC coefficients.
LLR is similar to the IS measure. However, while the IS measure incorporates the gain factor by using variance terms, LLR only considers the difference between the general spec-tral shapes. The following equation provides the details for computing the LLR:
dLLR
ad,aφ
=log
adRφaTd
aφRφaTφ
. (11)
LAR is another speech quality assessment measure based on the dissimilarity of LPC coefficients between healthy speech and the patient’s speech. Different from LLR, LAR uses the reflection coefficients to calculate the difference and is expressed by the equation
dLAR=
1p
p
i=1
log1 +rφ(i) 1−rφ(i)−
log1 +rd(i) 1−rd(i)
2
1/2
, (12)
wherepis the order of the LPC coefficients,rφ(i) andrd(i) are theith reflection coefficients of healthy and patient’s speech signals.
In the following section describing the experiment and results, we will compare the performances of each of these measures applied to our database. The correlation between these objective quality assessment measures and one subjec-tive quality assessment will also be discussed.
5. SUBJECTIVE QUALITY MEASURES
No matter how speech quality is defined, it must be based on human response and perception. So designing a suitable subjective measure of quality is very important in the assess-ment of speech quality. Correspondingly, the most important criterion to evaluate the accuracy of an objective measure of quality is to determine its correlation with subjective quality measures.
Table2: Moderate-severe subjective measure evaluation table.
Rating Level of distortion
3 Moderate
2 Moderate to severe
1 Severe
quality. The final speech quality assessment value can be cal-culated as the average of the responses of several listeners. The MOS test is widely used in the telecommunications area to compare the original signal quality with that of the dis-torted signal. For disordered speech analysis, however, it may not be feasible to categorize sentences as “perceptible, but not annoying” or “annoying, but not objectionable.” Therefore, a different commonly used subjective utilitarian measure was obtained. In this test, listeners rated the sentences into three categories: mild, moderate, or severe [5,6,7]. A similar 4-point rating scale, called the GRBAS method, has been pre-sented for the evaluation of disorder voice quality [17]. In these subjective tests, each test sentence was assigned a score based on whether the disordered sentence quality was per-ceived to be mild, moderate, or severe. Based on our database of Parkinson’s patients tested in this experiment, we modi-fied the mild-moderate-severe rating scale to have three new levels: moderate, moderate to severe, and severe. The details and criteria for these ratings are listed inTable 2. The fol-lowing procedures were followed when obtaining perceptual judgment in the present experiment: Listeners were asked to listen carefully to each test sentence. Listeners were allowed to hear the test sentence as many times as needed to ensure that they assigned the most appropriate score to each sentence. Listeners were asked to read the criteria table (Tables1and 2) carefully and were required to assign a score to each sen-tence based on the level of distortion described in the tables.
6. EXPERIMENTAL RESULTS
The speech database used in this experiment was collected by the experimenters at the Motor Movement Disorders Clinic, University of Florida. Ten patients with Parkinson’s disease were recorded reading a standard passage (“Grandfather Pas-sage”). Additionally, the same passage was also recorded from four healthy adult speakers. Although speakers vary in their rate of speech, this passage takes approximately 1 minute to read. Three successive sentences (around 15 seconds in du-ration) were selected from this passage for acoustic and per-ceptual analyses. The sentences include “You wish to know all about my grandfather. Well, he is nearly ninety three years old. He dresses himself in an ancient black frock coat, usu-ally minus several buttons.” The fourteen speakers were di-vided into two groups—males and females. In the first lis-tening test, six listeners evaluated the speech of four Parkin-son’s patients and one healthy speaker. In the second listening test, we tested twelve listeners who rated the speech of seven Parkinson’s patients and one healthy speaker. Of the 18 par-ticipants in the listening tests, six were from the USA, five from China, five from India, one from Korea, and one from
Turkey. Seven of them were male and the rest were female. All listeners spoke fluent English.
The first listening test was used to obtain ratings using the MOS criteria listed inTable 1. Listeners gave an individ-ual score to each sentence. In this study, two different meth-ods were used to compare the objective and the subjective measures. In the first method, all MOS scores given by the listeners were correlated with the distance measures calcu-lated by the various algorithms. In the second approach, the order of the MOS scores (rather than the actual value of the MOS scores) was correlated with the distance measures. In this approach, listeners simply ordered each sentence from the best to the worst quality. If two or more sentences were given the same rank, listeners were asked to listen carefully and choose different ranks for each sentence. In contrast, in the first method, listeners may end up giving identical integer scores to two speech segments even though one may sound noticeably better than the other.Table 3gives the details on all the sentences scored using MOS scale for male speakers only. Sentences labelled as P1, P2, P3, and P4 were spoken by the Parkinson’s patients and H1 is the sentences spoken by the healthy speaker. The six listeners are labelled as List1 to List6.
One sentence from a healthy speaker was used as the standard sentence for calculating the objective measures of quality. DTW was first applied to align this standard sen-tence with each patient’s sensen-tence.Figure 2shows the opti-mal frame match path between the standard healthy speech and the patient’s speech. For the second method, every pro-cedure is the same except for replacing the exact score by the relative order. Therefore, inTable 3, each column is the order given by each listener.
Finally, the three distortion measures (IS, LLR, and LAR) were calculated. The last three columns inTable 3show the exact values of IS, LLR, and LAR, respectively. In Table 4, the last three columns show the relative order of the distor-tion scores obtained from each speaker.Figure 3shows the healthy speech waveform (upper panel), the patient speech waveform (middle panel) and their distortion curve calcu-lated by the IS measure (lower panel).Figure 4shows a sim-ilar comparison based on LLR andFigure 5shows the same comparison based on LAR.Figure 6exhibits the histogram of the distortion values, which may give us deeper insight about the differences between the healthy speaker and the patient’s speech. This may provide greater information than the use of a single number obtained by averaging the distortion mea-sures across a number of frames.
As discussed earlier, the quality of an objective measure is determined by how well it predicts the subjective measure. The following formula is widely used to evaluate the perfor-mance of objective measures:
ρ=
d
Sd−S¯d
Od−O¯d
d
Sd−S¯d 2
d
Od−O¯d
21/2, (13)
whereSd andOd are subjective and objective results. ¯Sdand ¯
Table3: Subjective test results and their correlation with objective test using method 1 in the first round.
Subject List1 List2 List3 List4 List5 List6 Avg. IS LLR LAR
P1 2 3 2 2 3 2 2.33 71 035 197.5 1441.5
P2 2 1 2 1 2 1 1.50 769 990 175.6 1054.2
P3 3 1 1 1 2 2 1.67 572 200 152.3 1014.9
P4 3 2 3 2 3 3 2.67 304 150 218.8 1025.4
H1 5 5 5 5 5 5 5 24 155 96.2 752.5
Corr. — — — — — — — 0.7638 0.6419 0.5729
20 40 60 80 100 120 140 160 180
50 100 150
Frame number
Fr
am
e
n
u
m
b
er
20 40 60 80 100 120 140 160 180
50 100 150
Frame number
Fr
am
e
n
u
m
b
er
Figure2: Dynamic time warping (DTW) optimal path between the recorded speech of a healthy person (horizontal axis) and a Parkinson’s patient (vertical axis).
Table4: Subjective test results and their correlation with objective test using method 2 in the first round.
Subject List1 List2 List3 List4 List5 List6 Avg. IS LLR LAR
P1 5 2 3 3 3 3 3.2 2 4 5
P2 4 4 4 5 4 5 4.3 5 3 4
P3 2 5 5 4 5 4 4.2 4 2 2
P4 3 3 2 2 2 2 2.3 3 5 3
H1 1 1 1 1 1 1 1 1 1 1
Corr. — — — — — — — 0.8684 0.1828 0.5142
three objective measures and their correlation values based on method 1. The IS measure, with a correlation of 0.7638, showed the best performance. Table 4 lists the correlation values based on method 2, and once again the IS measure showed the highest correlation of 0.8684. In analyzing (9), (11) and (12), we can see that the good performance of the IS measure might be partially due to the fact that it not only considers the general spectral difference, but also uses the variance term to take into account the gain factor of the all-pole filter model.
0.5 0
−0.5
−1
A
m
plitude
0 0.5 1 1.5 2
Time(s) 0.5
0
−0.5
A
m
plitude
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 Time(s)
5000 4000 3000 2000 1000
A
m
plitude
20 40 60 80 100 120 140 160 180 200 Frame number
Figure3: IS value (lower) versus healthy speech waveform (upper) and patient speech waveform (middle).
0.5 0
−0.5
−1
A
m
plitude
0 0.5 1 1.5 2
Time(s) 0.5
0
−0.5
A
m
plitude
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 Time(s)
15 10 5
A
m
plitude
20 40 60 80 100 120 140 160 180 200 Frame number
Figure4: LLR value (lower) versus healthy speech waveform (up-per) and patient speech waveform (middle).
MOS from individual listeners, the average MOS, and the correlation between the IS measure and MOS values based on method 1 described earlier. This correlation was found to be 0.8032 and is comparable with 0.7638 obtained in the first round test.Table 6shows the moderate-severe test scores from each listener, the average moderate-severe test scores, and the correlation between the IS measure and the subjec-tive ratings. Once again, a correlation of 0.7417 was obtained which is comparable to that obtained in the first round test.
0.5 0
−0.5
−1
A
m
plitude
0 0.5 1 1.5 2
Time(s) 0.5
0
−0.5
A
m
plitude
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 Time(s)
2.5 2 1.5 1 0.5
A
m
plitude
20 40 60 80 100 120 140 160 180 200 Frame number
Figure5: LAR value (lower) versus healthy speech waveform (up-per) and patient speech waveform (middle).
80 70 60 50 40 30 20 10 0
0 0.5 1 1.5 2 2.5 3
×104
Distortion value
Co
u
n
t
Figure6: The histogram of the distortion values based on the IS method.
Table5: Subjective test results and their correlation with objective test using method 1 in the second round based on MOS test.
Subject List1 List2 List3 List4 List5 List6 List7 List8 List9 List10 List11 List12 Avg. IS
P1 3 2 4 3 2 2 3 3 3 3 1 1 2.50 41 500
P2 2 3 3 2 3 1 2 2 2 3 2 1 2.17 84 200
P3 1 2 2 1 2 1 2 1 1 2 1 1 1.42 264 000
P4 4 4 4 4 4 3 5 4 5 4 4 4 4.08 10 300
P5 4 4 3 3 3 4 5 4 4 5 4 4 3.92 29 800
P6 1 3 2 1 2 2 3 1 2 3 2 1 1.92 205 000
P7 2 3 2 2 2 3 4 3 3 3 3 2 2.67 103 000
H1 5 5 5 5 5 3 5 5 5 5 5 5 4.83 6010
Corr. — — — — — — — — — — — — — 0.8032
Table6: Subjective test results and their correlation with objective test using method 1 in the second round based on moderate-severe test.
Subject List1 List2 List3 List4 List5 List6 List7 List8 List9 List10 List11 List12 Avg. IS
P1 1 2 2 1 2 1 2 1 1 1 2 1 1.42 205 000
P2 2 2 1 2 2 2 2 2 3 2 2 2 2 103 000
P3 3 3 3 3 3 3 3 3 3 3 3 3 3 10 300
P4 1 1 1 1 1 1 1 2 1 1 1 1 1.08 264 000
P5 2 2 2 1 2 2 2 1 2 2 1 1 1.67 84 200
P6 2 3 1 2 2 2 3 2 2 2 2 2 2.08 41 500
P7 3 3 3 3 3 3 3 3 3 3 3 3 3 29 800
Corr. — — — — — — — — — — — — — 0.7417
the monoloudness or energy variation [2]. The following equations give the exact mathematic expressions for these measures:
PT= 1 N−1
N−1
i=1
P(i+ 1)−P(i),
ET= 1
N−1 N−1
i=1
E(i+ 1)−E(i), (14)
where N is the total number of frames of the given sen-tence, andP(i) and E(i) represent the pitch and energy of the framei. We used the data obtained in the first round of evaluation to test the correlation between these measures (PT and ET) and the subjective ratings.Table 7shows the pitch turbulence (PT) and energy turbulence (ET) values calcu-lated from (14) as well as their correlation based on method 1.Table 8shows the similar results based on the method 2. Figures 7and8 show the pitch turbulence and energy tur-bulence from a given speech signal. Based on Tables 7and 8, it appears that PT and ET are poorly correlated with the subjective assessments, using either method 1 or method 2. This suggests that during subjective assessment, humans put most of their emphasis on intelligibility, which, from a signal processing view, is related primarily to the spectral envelope. The excitation (pitch) and energy variation are not as impor-tant as spectral envelope variation in the perception of over-all speech quality. Even in our current algorithm, pitch and energy turbulence were not very efficient in predicting the
Table7: Subjective test results and their correlation with PT and ET test using method 1.
Subject Avg. PT ET
P1 2.33 13.2777 3.9040
P2 1.50 8.0712 8.0712
P3 1.67 4.5775 11.9782
P4 2.67 16.3607 5.8966
H1 5 4.2815 8.3446
Corr. — 0.1264 0.1137
Table8: Subjective test results and their correlation with PT and ET test using method 2.
Subject Avg. PT ET
P1 3.2 4 5
P2 4.3 3 3
P3 4.2 2 1
P4 2.3 5 4
H1 1 1 2
Corr. — 0.1828 0.0800
0.5 0
−0.5
−1
A
m
plitude
0 0.5 1 1.5 2
Time(s)
200 150 100
Pit
ch
20 40 60 80 100 120 140 160 180 200 220 Frame number
Figure7: The pitch turbulence (lower) from a given speech signal (upper).
0.5 0
−0.5
−1
A
m
plitude
0 0.5 1 1.5 2
Time(s) 100
50 0
−50
−100
−150
A
m
plitude
20 40 60 80 100 120 140 160 180 200 220 Frame number
Figure8: The energy turbulence (lower) from a given speech signal (upper).
7. CONCLUSION
Objective evaluation of disordered speech quality is not an easy task. In this paper, we discuss three objective qual-ity assessment measures and one subjective measure. By evaluating our speech database, the IS measure showed a strong correlation with the MOS tests. Therefore, the IS mea-sure is suggested to be more suitable than LLR and LAR for use as a reliable tool to evaluate the overall quality of disor-dered speech. The IS measure could also be used to predict the subjective quality measure MOS score given by humans.
ACKNOWLEDGMENT
The authors are grateful to three reviewers who provided us with a large number of detailed suggestions for improving the submitted manuscript.
REFERENCES
[1] S. Quanckenbush, T. Barnwell, and M. Clements,Objective Measures of Speech Quality, Prentice Hall, New York, NY, USA, 1988.
[2] J. Hansen and S. Nandkumar, “Objective quality assessment and the RPE-LTP vocoder in different noise and language con-ditions,”Journal of the Acoustical Society of America, vol. 97, no. 1, pp. 609–627, 1995.
[3] J. Hansen and L. Arslan, “Robust feature-estimation and ob-jective quality assessment for noisy speech recognition using the credit card corpus,”IEEE Trans. Speech Audio Processing, vol. 3, no. 3, pp. 169–184, 1995.
[4] S. Dimolitsas, “Objective speech distortion measures and their relevance to speech quality assessments,”IEE Proceed-ings Part I: Communications, Speech and Vision, vol. 136, no. 5, pp. 317–324, 1989.
[5] L. Ramig, C. Bonitati, J. Lemke, and Y. Horii, “Voice treat-ment for patients with Parkinson disease: Developtreat-ment of an approach and preliminary efficacy data,”Journal of Medical Speech-Language Pathology, vol. 2, no. 3, pp. 191–209, 1994. [6] S. Countryman, L. Ramig, and A. Pawlas, “Speech and
voice deficits in Parkinsonian Plus syndromes: Can they be treated?”Journal of Medical Speech-Language Pathology, vol. 2, no. 3, pp. 211–225, 1994.
[7] S. Countryman and L. Ramig, “Effects of intensive voice ther-apy on voice deficits associated with bilateral thalamotomy in Parkinson disease: A case study,”Journal of Medical Speech-Language Pathology, vol. 1, no. 4, pp. 233–250, 1993. [8] L. Rabiner and B. Juang,Fundamental of Speech Recognition,
Prentice Hall, New York, NY, USA, 1984.
[9] P. Yu, M. Ouaknine, J. Revis, and A. Giovanni, “Objective voice analysis for dysphonic patients: a multiparametric pro-tocol including acoustic and aerodynamic measurements,” Journal of Voice, vol. 15, no. 4, pp. 529–542, 2001.
[10] A. Giovanni, D. Robert, N. Estublier, B. Teston, M. Zanaret, and M. Cannoni, “Objective evaluation of dysphonia: prelim-inary results of a device allowing simultaneous acoustic and aerodynamic measurements,” Folia Phoniatr Logop, vol. 48, no. 4, pp. 175–185, 1996.
[11] J. Revis, A. Giovanni, F. Wuyts, and J. Triglia, “Comparison of different voice samples for perceptual analysis,”Folia Phoniatr Logop, vol. 51, no. 3, pp. 108–116, 1999.
[12] D. Berry, K. Verdolini, D. Montequin, M. Hess, R. Chan, and I. Titze, “A quantitative output-cost ratio in voice production,” Journal of Speech, Language and Hearing Research, vol. 44, no. 1, pp. 29–37, 2001.
[13] P. Dejonckere, C. Obbens, G. De Moor, and G. Wieneke, “Per-ceptual evaluation of dysphonia: Reliability and relevance,” Folia Phoniatr Logop, vol. 45, no. 2, pp. 76–83, 1993.
[14] E. Wallen and J. Hansen, “A screening test for speech pathol-ogy assessment using objective quality measures,” inProc. 4th International Conference on Spoken Language Proceedings (IC-SLP ’96), vol. 2, pp. 776–779, Philadelphia, Pa, USA, October 1996.
[15] L. Thorpe and W. Yang, “Performance of current perceptual objective speech quality measures,” inProc. IEEE Workshop on Speech Coding Proceedings (SCW ’99), pp. 144–146, Porvoo, Finland, June 1999.
[16] J. Hansen and B. Pellom, “An effective quality evaluation pro-tocol for speech enhancement algorithms,” On-line Technical Report.
Lingyun Gureceived his B.S. and M.S. de-grees in electrical engineering from the Uni-versity of Electronic Science and Technol-ogy of China (UESTC) and Old Domin-ion University in 1998 and 2002, respec-tively. He is currently pursuing his Ph.D. in the Computational NeuroEngineering Lab-oratory, Electrical and Computer Engineer-ing Department, University of Florida (UF). His main research interests are in robust
speech recognition, speech signal processing, and auditory produc-tion and percepproduc-tion.
John G. Harrisreceived his B.S. and M.S. degrees in electrical engineering from MIT in 1983 and 1986. He earned his Ph.D. de-gree from Caltech in the interdisciplinary Computation and Neural Systems Program in 1991. After a two-year postdoc at the MIT AI Lab, Dr. Harris joined the Electrical and Computer Engineering Department, Uni-versity of Florida (UF). He is currently an Associate Professor and leads the Hybrid
Signal Processing Group in researching biologically inspired cir-cuits, architectures, and algorithms for signal processing. Dr. Harris has published over 100 research papers and patents in this area. He codirects the Computational NeuroEngineering Laboratory and has a joint appointment in the Biomedical Engineering Depart-ment at UF.
Rahul Shrivastavearned his B.S. degree in 1995 and M.S. degree in 1997 in speech and hearing sciences from the University of Mysore, India. He completed his Ph.D. degree in speech and hearing science from Indiana University, Bloomington, in 2001. Currently he is on the faculty at the Depart-ment of Communication Sciences and Dis-orders, University of Florida. His research is studying the factors that affect the
percep-tion of voice quality and speech intelligibility in patients with a va-riety of speech disorders.
Christine Sapienza received her Ph.D. degree in speech science from The State University of New York at Buffalo in 1993. Currently, she is a Professor in the Depart-ment of Communication Sciences and Dis-orders, University of Florida. Her most re-cent work has focused on the use of strength training paradigms in multiple populations including voice disorders, Parkinson’s dis-ease, spinal cord injury, and multiple