Everyday huge amounts of multimedia data are generated by broadcast me- dia world-wide, which would be desirable to have automatically segmented and transcribed. Actually, there exist real systems capable of providing accurate transcriptions in some contexts, such as in TV broadcast news applications. In this work, we have started to investigate the possible re-usability of a broadcast news transcription system for the European Portuguese language to the similar radio broadcast transcription problem. The main challenges that one can ﬁnd are the signiﬁcant increase of both telephonespeech and spontaneous speaking style. Thus, we have initially focused on the automatic detection of telephonespeech and the improvement of phonetic acoustic modelling for particular con- versational telephonespeech. A relative WER reduction of 11.8% was achieved with respect to the broadcast news system, besides an high classiﬁcation rate of the telephone channel detector proposed. Finally, some thoughts for future development have been provided.
The results for the proposed FDLP technique are compared with those obtained for several other robust feature extraction techniques namely RASTA , Multi-resolution RASTA (MRASTA) , and the ETSI advanced (noise-robust) distributed speech recognition front-end . The first set of experiments compare the performance of these feature extraction techniques for the clean test conditions in TIMIT database. The results of this experiment are shown in the first column of Table 1. The conventional PLP feature extraction used with a context of 9 frames  is denoted as PLP-9. RASTA-PLP-9 features use 9 frame context of the PLP features extracted by applying the RASTA filtering . The ETSI-9 corresponds to 9 frame context of the features generated by the ETSI front- end. In a manner similar to the proposed spectro-temporal features, we also combine the posterior probabilities for the MRASTA features with PLP-9 . For the proposed FDLP based technique, we investigate the effect of gain normalization of the Hilbert Envelopes for the spectro-temporal feature extraction. The FDLP-G features are derived from temporal envelopes without the gain normalization, whereas the gain normalized temporal envelopes are used for deriving the FDLP-GN features. Table 1 also shows the phoneme recognition results for one of telephone sets (cb1) in HTIMIT database. It can be seen that the spectro-temporal features extracted from temporal envelopes without removing the gain (FDLP-G) perform better than other features on clean speech. Without much degradation in performance for clean speech, the FDLP-GN features provide significant improvements for the telephonespeech.
In this study, a novel approach for estimation of four characteristics of speakers, namely age, height, weight and smoking habit, from spontaneous telephonespeech signals has been proposed. In this method, utterances were modeled using the i-vector and the NFA frameworks, which are based on the factor analysis on GMM means and weights, respectively. Then, ANNs and LSSVR were employed to estimate the age, height and weight of speakers, and ANNs and LR were used to perform smoking habit detection. Afterward, the score-level fusion of the i-vector-based and the NFA-based recognizers was considered for speaker age and smoking habit estimation tasks to improve the performance.
n this paper, we present a spectro-temporal feature extraction technique using sub-band Hilbert envelopes of relatively long segments of speech signal. Hilbert envelopes of the sub-bands are estimated using Frequency Domain Linear Prediction (FDLP). Spectral features are derived by integrating the sub-band Hilbert envelopes in short-term frames and the temporal features are formed by converting the FDLP envelopes into modulation frequency components. These are then combined at the phoneme posterior level and are used as the input features for a phoneme recognition system. In order to improve the robustness of the proposed features to telephonespeech, the sub-band temporal envelopes are gain normalized prior to feature extraction. Phoneme recognition experiments on telephonespeech in the HTIMIT database show significant performance improvements for the proposed features when compared to other robust feature techniques (average relative reduction of 11% in phoneme error rate).
A system for bandwidth extension of telephonespeech, aided by data embedding, is presented. The proposed system uses the trans- mitted analog narrowband speech signal as a carrier of the side information needed to carry out the bandwidth extension. The upper band of the wideband speech is reconstructed at the receiving end from two components: a synthetic wideband excitation signal, generated from the narrowband telephonespeech and a wideband spectral envelope, parametrically represented and trans- mitted as embedded data in the telephonespeech. We propose a novel data embedding scheme, in which the scalar Costa scheme is combined with an auditory masking model allowing high rate transparent embedding, while maintaining a low bit error rate. The signal is transformed to the frequency domain via the discrete Hartley transform (DHT) and is partitioned into subbands. Data is embedded in an adaptively chosen subset of subbands by modifying the DHT coeﬃcients. In our simulations, high quality wideband speech was obtained from speech transmitted over a telephone line (characterized by spectral magnitude distortion, dispersion, and noise), in which side information data is transparently embedded at the rate of 600 information bits/second and with a bit error rate of approximately 3 · 10 −4 . In a listening test, the reconstructed wideband speech was preferred (at diﬀerent degrees) over conventional telephonespeech in 92.5% of the test utterances.
conversations typically last 5 minutes and originate from a large number of participants for whom meta data is recorded— including participant age. The NIST databases where chosen for this work due to the large number of speakers and because the total variability subspace requires a considerable amount of development data for training. The development dataset used to train the total variability subspace and UBM includes over 30,000 speech recordings and was sourced from NIST 2004– 2006 SRE databases, LDC releases of Switchboard 2 phase III and Switchboard Cellular (parts 1 and 2). For the purpose of age estimation, telephone recordings from the common proto- cols of the recent NIST 2010 and 2008 SRE databases are used for training and testing respectively. The core protocol, short2- short3, from the 2008 database contains 3999 telephone record- ings for 1336 speakers for whom the age is known. Similarly, the extended core-core protocol of the 2010 database contains 5634 telephonespeech segments from 445 speakers. Figure 2 illustrates the age histograms of male and female speakers of NIST 2010 and 2008 SRE databases.
This paper reports insights from translating Spanish conver- sational telephonespeech into English text by cascading an automatic speech recognition (ASR) system with a statistical machine translation (SMT) system. The key new insight is that the informal register of conversational speech is a greater challenge for ASR than for SMT: the BLEU score for translat- ing the reference transcript is 63%, but drops to 31% for trans- lating automatic transcripts, whose word error rate (WER) is 40%. Several strategies are examined to mitigate the impact of ASR errors on the SMT output: (i) providing the ASR lat- tice, instead of the 1 -best output, as input to the SMT system, (ii) training the SMT system on Spanish ASR output paired with English text, instead of Spanish reference transcripts, and (iii) improving the core ASR system. Each leads to con- sistent and complementary improvements in the SMT output. Compared to translating the 1 -best output of an ASR system with 40% WER using an SMT system trained on Spanish ref- erence transcripts, translating the output lattice of an ASR system with 35% WER using an SMT system trained on ASR output improves BLEU from 31% to 37%.
3. Spanish Fisher Speech Corpus, developed by the Linguistic Data Consortium, consists of 819 telephone conver- sations lasting around 10 to 12 minutes each, yielding roughly to 163 hours of telephonespeech from 136 native Caribbean Spanish and non-Caribbean Spanish speakers. A broad set of topics is covered in the conversations ensuring speech variability. Speaker segmentation is done by analysing independently each conversation chan- nel, which is supposed to correspond to one speaker 14 . Fisher corpus comprises a challenging, large vocabulary, spontaneous speech recognition dataset ideal for our purposes.
Training our system on the Continuous corpus resulted in a major improvement of the word error rate (WER) dur- ing evaluation. We attribute this result to several factors. Firstly, our corpus consists only of the telephonespeech, and GlobalPhone, contrary to its name, does not. This re- sults in a mismatch between the spectral characteristics of the training and evaluation recordings in case of a system trained on the GlobalPhone. Another reason is the larger amount of data available in the Continuous corpus - 25 hours versus GlobalPhone’s 15 hours. Finally, our record- ings are of higher quality and generally not as noisy as those encountered in GlobalPhone.
There were seven audio recordings of varying lengths used for the test data. They consisted of recorded interviews with SRVK users about their experiences with SRVK and their use in research and education. The audio data was originally collected for a qualitative research experiment so there was no attempt to have uniform recording lengths. The test data was recorded with a hand-held, single-channel recorder (an Olympus DS-50 Digital Voice Recorder). The interviewer was speaking near the recorder providing a clean, direct speech while the interviewee was speaking on a speaker phone providing telephonespeech. Since the audio data came from interview recordings, they consisted of spontaneous speech about ASR systems and the SRVK. Table 1 shows the speakers listed by gender (F/M) and native (E) or non-native (N) speaker of American English, recording length, analysis of audio files, and out of vocabulary percentage.
We have presented a complete state-of-the-art system for the transcription of conversational telephonespeech and we described a range of techniques in acoustic, pronunciation and language modeling specifically important for this task. Particularly powerful methods in acoustic modeling are the use of side-based cepstral normalization, VTLN, discriminative training using the MMI or MPE criteria, and heteroscedastic LDA. Speaker adaptation using standard or lattice-based MLLR and full variance transforms yields considerable word error rate improvements. In language modeling the use of a background Broadcast News corpus together with class based language models allows to reduce the effect of the general lack of training data for this task. Pronunciation probabilities give consistent performance improvements. The use of lattices allow the use of confusion network decoding and the efficient implementation of system combination. We have discussed several systems with similar performance and their use in system combination.
Anthony Rousseau, Paul Del´eglise, and Yannick Est`eve. 2012. TED-LIUM: an automatic speech recognition dedicated corpus. In Nicoletta Calzolari (Conference Chair), Khalid Choukri, Thierry Declerck, Mehmet U˘gur Do˘gan, Bente Maegaard, Joseph Mariani, Jan Odijk, and Stelios Piperidis, editors, Proceedings of the Eight Inter- national Conference on Language Resources and Evalu- ation (LREC’12), Istanbul, Turkey, May. European Lan- guage Resources Association (ELRA).
In this paper, we investigate the effectiveness of GMS and i- vector for native language recognition on a spontaneous and real speech database instead of the ABI-1 corpus, which consists of clean and read speech signals. Consequently, we formed a database of non-native accents of English by extracting English utterances with Russian, Hindi, American English, Thai, Vietnamese and Cantonese accents from the NIST 2008 SRE database. For each utterance mod- eling method, three different classifiers, namely SVM, NBC and SRC are employed to further investigate the role of classifiers in this task. Unlike SVM and NBC, sparse representation classification techniques have never been tested on accent recognition problems. On the other hand, recent studies show the effectiveness of GPPS in other speech technology problems such as speaker adaptation and speaker age group recognition [6, 8]. Consequently, we test GPPS along with i-vectors and GMS in our investigations on accent recog- nition too.
It is very common nowadays for people, especially university students to make a phone call to get some information like result or application status. Computer telephone usually answers the call automatically. All the caller need to do is just provide some information, for example identity card numbers by pressing telephone buttons and then wait for the response. The computer system at the other end of the line will search for desire information in database based on the callers’ input.
The telephone plant uses copper wires for the sending of the voice traffic between the subscriber and the central office. Call processing sig- nals are also sent on these wires. The systems use wire pairs. The twist- ing of each pair (in a multipair cable) is staggered, as shown in Figure 2–2. Radiated energy from the current flowing in one wire of the pair is largely canceled by the radiated energy of the current flowing back in the return wire of the same pair. This approach greatly reduces the effect of crosstalk (interference on the pair). Moreover, each pair in the cable is less acceptable to external noise; the pair cancels out much of the noise because noise is coupled almost equally in each wire of the pair.
Until recently, although the telephone has been thought of as a peculiarly female means of communication, there has been an absence of studies which compare male and female telephone use. Published reviews (eg Singer, 1981; Claisse and Rowe, 1987) offer no analysis by sex. Some incidental data is available. Maddox (1977) reported that women use the telephone most frequently for intrinsic reasons and use it more often because they are more likely to be at home. Noble (1987) found that women use the telephone more frequently than men for intrinsic reasons, whereas there were no sex differences for instrumental use of the domestic telephone.