How voice-recognition software

(1)

H

ow voice-recognition software

presents a useful transcription

tool for qualitative and mixed

methods researchers

A

NNA

K F

LETCHERAND

G

REG

S

HAW

Faculty of Education, Health and Science, Charles Darwin University, Darwin, NT, Australia

A B S T R A C T

Voice recognition (VR) software has increased in accuracy and ease of use over the last decade. While VR may carry the potential to signifi cantly ease the transcription process, only recently has it gained enough accuracy and ease of use to become a valid option to manually typed transcription of quali-tative data. However, the use of VR transcription in mixed methods research has largely remained unexplored. This article aims to illustrate how VR software is useful when transcribing open-ended questionnaires and interviews in mixed methods research. A signifi cant amount of time was saved yet valuable insights of emerging themes were gained at an early stage of the data processing.

Keywords: transcription, voice recognition, MacSpeech, questionnaires, interviews, mixed methods research

I

NTRODUCTION

W

hilst planning and preparing to collect data for research into student engagement in writing assessments, the incidental observa-tion of a primary student using voice-recogniobserva-tion software to dictate a narrative triggered the idea to use voice recognition (VR) for data transcrip-tion. The student had reduced fi ne-motor skills, a side effect from the treatment of a brain tumour some years earlier, and used VR on his laptop to for writing tasks in class. Having made some inquiries about the availability of the software and its purchase costs, we realised VR presented a time-effective, accurate method for transcrib-ing interviews and questionnaires at a relatively low cost.

In qualitative research, the transcription of interview and observation data is common

practice, and often central to the processes of data analysis and theorising. However, the role and importance of transcription is rarely exam-ined, and the process itself is seen as a mechani-cal chore for the researcher (Oliver, Serovich, & Mason, 2005). In smaller research projects, tran-scription tends to be handled by the researcher directly (McLellan, MacQueen, & Neidig, 2003), where they typically listen to an audio recording, moment by moment, and convert this into digital text. Alternatively the researcher may choose to save time by using a professional transcription ser-vice. However, this requires funding, which may not be available in a small research project. The time and cost of transcription is also a concern for researchers adopting a mixed-methods approach, particularly in regard to surveys with open-ended questionnaires, where larger numbers of

(2)

between voice recognition, which is speaker dependent and requires machine training. Speech recognition, the technology used for example in call centres does not require training and is there-fore defi ned as ‘speaker-independent’ (Coniam, 1999). It is worth noting that Nuance, the com-pany which produces Dragon Naturally Speaking (DNS) and Dragon Dictate, despite requiring the user to undertake some machine training by creat-ing a voice profi le, refer to their products as ‘speech recognition’. Perhaps this referral to speech recog-nition is an indicator of where the technology is heading. Indeed, having conducted an informal test using a colleague’s Naturally Speaking soft-ware, we would suggest that the software in fact is speaker-independent but requires voice profi ling for maximum accuracy.

Voice recognition technology has become sig-nifi cantly more accurate and easy to use over the last 20 years (Juang & Rabiner, 2004). Initially, many of the users of VR turned to the technology as a result of a range of medical conditions, which resulted in diffi culties with typing. For sufferers of Multiple Sclerosis (Lodato, 2005), carpal tunnel syndrome (Honeycutt, 2003; Matheson, 2007) and tenosynovitis of the wrist (Honeycutt, 2003), voice recognition has enabled them to remain productive writers.

With the advances in accuracy and ease of use, new applications of the technology are emerging. VR software has gained signifi cant popularity and MacSpeech indicate a doubling in sales from 2005 to 2009 (MacSpeech, 2009). This popularity, and different uses, has promoted the development of specialist variants of software for the medical and legal fi elds.

The application of the software to facilitate language training has also been explored. Coniam (1999) has investigated the potential of using VR in English language teaching. The study aimed to ascertain whether VR software would be a use-ful tool in the oral language assessment of second language speakers.

Other research shows some advantages for teachers in providing feedback comments to participants are frequent and individual responses

can be lengthy.

Despite these obstacles, the process of tran-scribing data carries the signifi cant benefi t for researchers to familiarise themselves with the data and gain valuable insight to occurring themes in the initial data analysis (Anderson, 1998; Matheson, 2007; Park & Zeanah, 2005). Faced with these pros and cons, yet keen to explore a more innovative approach to data transcription, MacSpeech Dictate voice-recognition software was used to transcribe open-ended questionnaire responses and semi-structures interviews with stu-dents and teachers in a mixed-method study into primary students’ learning.

In this article VR is presented as a useful tool for qualitative and quantitative data transcrip-tion. The use of VR for data transcription is a valuable tool in any research where data collected in questionnaires and interviews require tran-scription before analysis. Additionally, the pro-cess of transcribing using VR software provides a further opportunity for the researcher to become familiar with their data and begin processes of analysis.

B

ACKGROUND

Voice recognition is an Information Communi-cation Technology (ICT), which allows the user to create digital text by dictating words and punctuation to a computer, via a microphone connected to a computer. Juang and Rabiner (2004) describe the technology as developing from two main approaches. One was a speaker-dependent voice-activated typewriter called Tangora, which was developed by IBM. Tangora’s main technical focus was the extensiveness of its recognized vocabulary. The second approach, a speaker- independent technology, was aimed at recognizing words spoken by people with differ-ent accdiffer-ents, and was developed by AT&T Bell Laboratories.

The terms voice recognition and speech recogni-tion denote different characteristics of the tech-nology. Coniam (1999) makes the distinction

(3)

from the New York Times: ‘After all, it’s nose mall feet for a computer to understand mice peach’ (Pogue, 2001).

In sum, despite the glaringly inaccurately tran-scription of phrases that occasionally occur, VR is a technology which is continually undergoing development and fi ne-tuning. It carries notable advantages enabling writers to rapidly produce large quantities of text irrespectively of their abil-ity to type.

In our case, Anna is a doctoral candidate working full time as a primary school teacher and consequently her research time is con-strained. The option to use professional tran-scription services for interview and other data, while saving time, would mean additional cost as well as a lost opportunity to engage with the data at an early stage. The data in this mixed methods research project was drawn from semi-structured interviews and two sets of 165 open-ended questionnaires, all which required transcription.

As previously indicated, a chance observation of VR in action captured our interest to explore how VR might provide an effective way to tran-scribe the data. It indicated potential to save time and yet still provide opportunity to become familiar with the data early in the stages of data processing, as is common practice in qualitative research. Bruce (2007) highlights the impor-tance of engaging with data early in the process of analysis, and prior to further data collection, as a useful approach within Grounded Theory. In this research early identifi cation of themes, in part through a process of familiarity with the data was planned.

Anderson (1998) notes:

The benefi ts of ‘speaking what you hear’ and having it immediately transcribed to text will be a godsend for those who regularly transcribe tapes. A notable benefi t comes from being able to remain familiar with the empirical data, given that transcribing rep-resents one of the fi rst analytical moments for quali-tative researchers who prefer not to send their tapes off to professional transcription services. (p. 721) students using VR software particularly from the

point of view of time effi ciency (Batt & Wilson, 2008). Wald (2005) indicates success in the use of VR in higher education in the preparation of lecture materials to assist the learning of students with disabilities.

In commenting on how VR software works Coniam (1999) points out that:

Part of the attraction – or potential for English Language teaching – of VR software such as ‘Dragon Naturally Speaking’ lies in the fact that it does not produce a phonemic transcription, but attempts to interpret input as recognisable English. With second language speakers this is important, since what might be called ‘standard deviant’ forms for a particular language group (e.g., /l/ and /n/, /v/ and /w/ confusion) do not appear in their incorrect forms in the output. (p. 51)

In the context of transcribing data for this research the potential of VR as a valuable tool for second language researchers was signifi cant, as the main author (Anna Fletcher) has English as her second language. It is important to remember that VR relies on spoken words in a grammati-cal context. Therefore, if VR is used to transcribe the spoken words of an interview, the sentences need to follow grammatical conventions to some degree for VR to work effectively. However, in this study we found that VR generally transcribed the sentences accurately even when they did not follow the correct grammatical pattern, particu-larly when short phrases were enunciated.

All transcription requires proof reading, and in particular grammatically incorrect segments need careful checking to ensure that they match what had been said. We found ourselves quite amused to discover unexpected phrases in the interview transcripts such as ‘nuclear armament?’ instead of ‘are you clear on that?’.

Doubts about the level of accuracy claimed by the software companies exist, particularly as theses claims appears to be based on numbers of accurate words, not accurate words in context (Coniam, 1999). Honeycutt (2003) highlights the same notion by quoting a witty David Pogue

(4)

M

ETHODS

Equipment and software

The voice recognition software we used for tran-scription was MacSpeech Dictate 1.3. The soft-ware was installed on a Macintosh computer with a 1.83 GHz Intel Core 2 Duo processor with 512 MB RAM, running on the OS X 10.4.11 (‘Tiger’) operating system.

The accuracy of voice recognition software is greatly affected by the quality of the microphone used for dictation. We used a Plantronics DSP 400 USB Headset with a noise-cancelling micro-phone. In addition to using a good quality head-set, the level of accuracy is clearly helped if the dictation is conducted in a quiet environment. The headset’s microphone is positioned few cen-timetres away from the speaker’s mouth and the noise-cancelling function prevents it from pick-ing up sounds other than the user’s voice. For example, it does not transcribe the words spoken by a second person who is speaking in the same room as the transcriber. Nevertheless, the software works at the optimal level of accuracy when in a quiet environment.

Once the MacSpeech software had been installed onto the computer, it was set to Australian English. A training session followed in which the software created a user voice profi le. The training helps MacSpeech Dictate learn the speech pattern of the user and adapt to the way the user speaks. The software works best when words are spoken in a smooth manner, with natural phrasing and pauses. During the training process the user is asked to read two stories. As the user reads, the text turns green as it is recognised. If the text turns red, the user needs to pause then begin reading again. MacSpeech suggests that the training pro-cess takes about 5 minutes. In this case the voice training took approximately 5 minutes per story, resulting in a total of 10 minutes.

Further tuning of the user profi le can be done through vocabulary training. Vocabulary training involves importing samples of the user’s own writ-ten documents. MacSpeech Dictate then applies

an analysing function, which recognises the user’s writing style, aimed at improving accuracy as the software is tuned to the way the user puts words together. We found the vocabulary training to be a useful function as the nature of transcrib-ing interviews and questionnaires entails ustranscrib-ing a number of context-specifi c words.

MacSpeech can be operated by voice com-mands, which means that the user does not need to use a mouse or a keyboard to operate the soft-ware once it has been opened. For example, giving the command ‘New Line’ has the same effect as pressing the return key on the keyboard once.

Once the voice training steps have been com-pleted, the user can start transcribing data. In order to so, the user opens the MacSpeech Dictate application on the computer. After voice profi le has been activated, two windows appear. A small black palette controls the microphone, while the larger window shows the transcribed words as they are being spoken.

Transcribing

We started by one of us, Anna, transcribing 165 hand-written open-ended questionnaire responses. This was a good way to commence because reading written responses turned out to be an easy way of getting used to dictating in general.

The accuracy of the software improved with use, as a more detailed bank of information about Anna’s speech pattern was built up. Precision was further enhanced through the addition of new vocabulary to the software, as a series of topic- specifi c words were frequently repeated in the questionnaire responses. We also found that accu-racy improved once a few words had been spoken after the microphone had been switched on. Thus, we found that leaving the microphone switched on even when Anna remained silent for periods of time helped reduce errors in the transcription.

Transcribing NVivo sound ﬁ les

Having completed the transcription of the fi rst series of questionnaires, Anna set about

(5)

transcribing interviews. This required her to build on the technique she had become accustomed to when transcribing the handwritten question-naire responses. The interviews were recorded on a digital voice recorder, an Olympus DS-2, and then uploaded into NVivo 8; qualitative analy-sis software. As NVivo operates on a Windows platform, a second computer, a Toshiba laptop, was used to run the NVivo soundfi le. On a practi-cal level, this meant plugging in the microphone of the Plantronics headset into the Machintosh computer and plugging in the earphones of the headset into the Toshiba laptop. This was not a problem as the headset had separate plugs for the microphone and earphones, which normally are plugged into the headset’s USB adaptor. It is possible to run NVivo on a Macintosh computer that has PC emulating software such as Parallels or VMware.

An advantage of using NVivo 8 to play back the sound fi le was the sound wave bar. Having the bar visualising the entire audio fi le with its sound waves made it easier to replay the exact seg-ment that Anna wanted to listen to, in order to repeat it orally to the MacSpeech software. This also facilitated the setting of time markers for seg-ments she wished to pre-code at this early stage. Thus having to write down time markers on the text transcript, as recommended by researchers in the early days of using digital audio data seemed redundant (Maloney & Paolisso, 2001).

Anna would generally start with a few warm-up sentences before starting to play the recorded interview. Once the transcription of Anna’s words was running smoothly, she would begin playing back and transcribing the interview. The technique is very similar to the ‘listen and repeat’ technique (Matheson, 2007; Park & Zeanah, 2005), or the ‘shadowing’ technique (Bain, Basson, Faisman, & Kanevsky, 2005). The technique constitutes three steps. First, the transcriber listens to a segment of the recorded interview. Second, the interview is paused. Third, the transcriber repeats the words into the micro-phone and MacSpeach then types the words on

the screen. The process is then repeated for the next segment, and so on.

R

ESULTS

We found dictating to be a simple and time sav-ing method to transcribe data. The dictation of the handwritten questionnaire responses took an average of 2.4 minutes per questionnaire to transcribe and save each document individually. Ignoring the few seconds it took to save each questionnaire transcription as an individual docu-ment MacSpeech Dictate produced the question-naire transcripts at typing speed of approximately 77 words per minute. This was a signifi cant reduction in time compared to Anna’s result of 28 words per minute with 96% accuracy in an online typing test (TypingTest, 2009).

Once Anna had developed her shadowing technique profi ciently; we found 1:5 to be the approximate ratio between recorded time and transcription time for a verbatim transcription, which included all words and false starts of the speakers. The transcription time included time for editing words, which had been transcribed incorrectly.

To give an indication of the typing speed in interview transcripts, we have simply divided the number of transcribed words by the number of minutes taken to produce the transcript. In other words, this ‘typing speed calculation’ includes the time taken to listen to the voice recording as well as editing, a combined transcription, typing and editing speed of approximately 31 words per minute. If the voice recording listening time is deducted for the purposes of comparing manual typing time with VR typing and editing time, the later indicate an approximate typing speed of approximately 38 words per minute.

Limitations

The fact that the only one voice can be used to dictate the transcription has been frequently been described as a limiting condition to the applica-tion of dictaapplica-tion software in regard to transcrib-ing interviews (Matheson, 2007; Park & Zeanah,

(6)

2005). In our experience we found that while it without doubt would be quicker to transcribe an interview without using a shadowing technique to repeat the spoken words, the process of repeating speech had the signifi cant advantage of becoming more familiar with the data. Having to listen to the interview data, and then repeating the spo-ken words provided a further opportunity to be immersed in the data, and through this to become more familiar with the issues and the emerging themes.

The typing times stated for the different tran-scriptions are only provided as an indicator of the time it took to transcribe the material. In order to establish the validity and reliability of these typing times, more rigorously controlled testing condi-tions would be needed, which was not our pri-ority when using VR to transcribe interviews. In respect to this article, we simply wanted to give an example of how we have found VR a time saving and useful transcription tool our research.

D

ISCUSSION

Having embarked on this method of transcrip-tion, we were encouraged at the speed and ease with which it was possible to transcribe data that Anna was in the process of collecting. Greg, who previously had used MacSpeech mainly to com-pose emails, was keen to learn just how much time was saved by using the method compared to transcribing by manually typing the text. As indi-cated by the difference in typing speed between MacSpeech Dictate and Anna’s typing speed, it would have taken her more than twice the time to manually type the questionnaire responses.

In regard to transcribing the interviews, the time saving was not of the same magnitude. As described in the results, the use of MacSpeech in this case resulted in an increase in typing speed of approximately 10 words per minute.

A notable benefi t of this time-effective method of transcription was that it lent fl exibility to transcribe the data during the collection period. For example, a set of 25 open-ended question-naires, which had been answered in a school class

during the day, would take approximately an hour to transcribe later in the evening. In addi-tion to presenting a quick and effi cient way of processing data, the method enabled us to gain a valuable understanding of the data content at an early stage. We would argue that the process of not only reading the data, but and also dictating and proofreading it, added to our awareness of the emerging themes. As Lapadat and Lindsay (1999) write, transcription is an important component of the analysis process. ‘We want to emphasize that it is not just the transcription product – those ver-batim words written down – that is important; it is also the process that is valuable’ (p. 82).

C

ONCLUSION

Voice recognition technology has developed sig-nifi cantly in accuracy and ease of use over the last 20 years. From having been a technology primar-ily used by people with decreased function of their hands due to medical conditions, who had little option but to dictate their writing, it now presents a timesaving option for wide audience. Earlier research into how VR can be used in quali-tative research has mainly focused on using the technology to transcribe interviews (Matheson, 2007; Park & Zeanah, 2005). This article describes how VR also is a highly useful tool in mixed methods research, particularly in transcrib-ing open-ended questionnaires, as found in this small scale mixed methods research into primary students’ writing assessment.

With the improvements in VR technol-ogy leading to increased speed and ease of use, it seems likely that VR will become a common tool to many researchers in the future. It should be noted as well, that MacSpeech in 2010 was purchased by Nuance, who also produce DNS, the well-established and reliable VR software for PC computers. Nuance now produce Dragon Dictate for Macintosh computers, which uses the same VR engine as DNS. The relatively low cost to purchase the software, combined with the sig-nifi cant amount of time saved and the benefi ts of increased understanding of the data through the

(7)

transcription process presents VR as a particularly attractive option for qualitative and mixed meth-ods researchers.

References

Anderson, J. F. (1998). Transcribing with voice recognition software: A new tool for qualitative researchers. Qualitative Health Research, 8(5), 718–723.

Bain, K., Basson, S., Faisman, A., & Kanevsky, D. (2005). Accessibility, transcription, and access everywhere. IBM Systems Journal, 44(3), 589–603. Batt, T., & Wilson, S. (2008). A study of voice-

recognition software as a tool for teacher response. Computers and Composition, 25(2), 165–181. Bruce, C. (2007). Questions arising about

emer-gence, data collection, and its interaction with analysis in a grounded theory study. International Journal of Qualitative Methods, 6(1), 1–12. Coniam, D. (1999). Voice recognition software

accuracy with second language speakers of English. System, 27(1), 49–64.

Honeycutt, L. (2003). Researching the use of voice recognition writing software. Computers and Composition, 20(1), 77–95.

Juang, B. H., & Rabiner, L. R. (2004). Automatic speech recognition – A brief history of the technol-ogy development. Georgia Institute of Technoltechnol-ogy, Atlanta, GA; Rutgers University and the

University of California, Santa Barbara, CA. Lapadat, J. C., & Lindsay, A. C. (1999).

Transcription in research and practice: From standardization of technique to interpretive positionings. Qualitative Inquiry, 5(1), 64–86. Lodato, J. (2005). Advances in voice recognition.

The Futurist, 39(1), 7–8.

MacSpeech. (2009). Releases MacSpeech dictate medical MacSpeech press [online]. MacSpeech,

Inc. Retrieved August 30, 2009, from http://www.macspeech.com/article_info. php?articles_id=324

Maloney, R. S., & Paolisso, M. (2001). What can digital audio data do for you? Field Methods, 13(1), 88–96.

Matheson, J. L. (2007). The voice transcription technique: Use of voice recognition software to transcribe digital interview data in qualitative research. The Qualitative Report, 12(4),

547–560.

McLellan, E., MacQueen, K. M., & Neidig, J. L. (2003). Beyond the qualitative interview: Data preparation and transcription. Field Methods, 15(1), 63–84.

Oliver, D., Serovich, J. M., & Mason, T. L. (2005). Constraints and opportunities with interview transcription: Towards refl ection in qualitative research. Social Forces, 84(2), 1273–1289. Park, J., & Zeanah, A. E. (2005). An evaluation of

voice recognition software for use in interview-based research: A research note. Qualitative Research, 5(2), 245–251.

Pogue, D. (2001). State of the art: If typing won’t do, speak up [electronic version]. The New York Times. Retrieved August 30, 2009, from http://www.nytimes.com/2001/04/19/ technology/19STAT.html?pagewanted=1 TypingTest. (2009). Test your typing skills. Retrieved

August 30, 2009, from www.typingtest.com Wald, M. (2005). Using automatic speech

recog-nition to enhance education for all students: Turning a vision into reality. Proceedings of 34th ASEE/IEEE Frontiers in education conference, October 19–22 (pp. 22–25). Indianapolis, IN.

Received 09 December 2009 Accepted 23 August 2011

P O S T P R E S S E D B O O K S

Aboriginal Knowledge Narratives & Country: Marri kunkimba putj putj marrideyan – Payi Linda Ford 230 Pages ISBN: 978-1-921214-71-4 Event TV: The production and inhabited resistance of images of control – Wendy Davis 246 Pages ISBN: 978-1-921214-77-6 Representations of Indigenous Australians in the Mainstream News Media

– Clemence Due and

Damien W Riggs

ISBN: 978-1-921214950

www.e-contentmanagement.com