• No results found

TRANSCRIBE YOUR CLASS: EMPOWERING STUDENTS, INSTRUCTORS, AND INSTITUTIONS:

N/A
N/A
Protected

Academic year: 2021

Share "TRANSCRIBE YOUR CLASS: EMPOWERING STUDENTS, INSTRUCTORS, AND INSTITUTIONS:"

Copied!
9
0
0

Loading.... (view fulltext now)

Full text

(1)

TRANSCRIBE YOUR CLASS: EMPOWERING STUDENTS,

INSTRUCTORS, AND INSTITUTIONS:

FACTORS AFFECTING IMPLEMENTATION AND ADOPTION OF A HOSTED

TRANSCRIPTION SERVICE

Keith Bain

1

, Janice Stevens

1

, Heather Martin

1

, Eunice Lund-Lucas

2

1

Saint Mary’s University (CANADA)

2

Trent University(CANADA)

[email protected]

,

[email protected]

Abstract

Access to lecture content and the requirement for adequate notetaking skills

are key challenges facing students with disabilities. Emerging technologies

such as Speech Recognition may offer solutions to accessibility issues. This

paper reviews a project that developed and tested a Hosted Transcription

Service that used speech recognition to automatically caption and transcribe

course media, including live presentations and eLearning content.

Keywords: accessibility, technology, lecture transcription, multimedia, disability, speech recognition, transcription

1

INTRODUCTION

This paper describes initial outcomes associated with establishing a prototype Hosted Transcription System (HTS) in the Canadian post-secondary education environment. Through a project designed to improve access to information for learners with disabilities, a prototype system was implemented in an applied research context to perform offline transcription of audio and video files to create ‘Multimedia Transcripts’. In addition to exploring the technology itself, a multidisciplinary team studied environmental conditions that affected adoption and use.

1.1 Background: Accessibility and Education

Accessibility issues fundamentally affects educational and employment opportunities for learners with disabilities and other at risk students. According to the International Labour Organization, the annual loss of global GDP due to the exclusion of persons with disabilities from the labour market is $1.3-$1.8 trillion [1]. The Canadian Council on Social Development reports that 42% of Canadians with disabilities do not work compared to 18% without disabilities. A root cause of labour market exclusion is a systemic lack of access to information in educational settings.

Educational achievement is highly correlated to employment, especially for individuals with disabilities. It has been shown that employment rates for persons with disabilities double with a high school diploma and triple for those with a postsecondary education [2]. While the impact of a successful postsecondary education for graduates with disabilities can be profound and far-reaching, this environment can be extremely difficult and exclusive.

In post-secondary education audio is the primary communication channel, which presents pervasive notetaking and listening comprehension challenges for students with various disabilities. All students require adequate notetaking skills as studying lecture notes results in a greater learning experience and higher grades [3]. While notetaking is clearly connected to academic achievement, less than 40% of information presented during lectures is typically captured by notetakers. Furthermore, studies show that students without disabilities record up to 70% more lecture information than students with disabilities [4],[5]. While various intermediary based notetaking supports exist, including peer based volunteers, computerized note taking, and professional stenography, they are typically fraught with problems, including reliability, availability, quality, and costs.

(2)

1.2 Project Description

The project scope encompassed a two-year applied research effort titled the Liberated Learning Youth Initiative. The project’s overarching goal was to improve access to information for persons with disabilities using Speech Recognition technology (SR). Near term objectives included developing an automated lecture transcription system designed to improve access for post-secondary students with disabilities, identifying system requirements, assessing technology performance, and understanding technology, implementation, and environmental factors that enabled and/or prohibited successful adoption.

The project was initiated by the Liberated Learning Consortium, an international research network dedicated to improving access to information through SR based captioning and transcription systems. The project was designed and executed by a multidisciplinary team of university, industry, and national disability organizations.

2

HOSTED TRANSCRIPTION & MULTIMEDIA TRANSCRIPTS

The primary technology studied in this project is referred to as Hosted Transcription Service (HTS). HTS is a web-based, prototype SR system that automatically transcribes audio or video files and creates Multimedia Transcripts. The system and its constituent components are described in the following section.

2.1 Speech Recognition

At its core, HTS is powered by complex SR technology. SR can be used to convert digitized media, such as a lecture or presentation, into a searchable transcript for review and study purposes [6]. Case studies have demonstrated that post-secondary students with disabilities found SR generated transcripts of academic lectures useful and enhanced access to lecture content [7].

SR systems are composed of numerous technology layers, including media conversion engines, SR engines, and statistical models that improve recognition results.

Audio/video files are saved in a wide range of multimedia formats including WAV, WMA, WMV, AVI, AU, Real Media, MP3, or QuickTime. Certain formats are proprietary (ASF, Real Media, QuckTime) and others are only accessible through proprietary APIs or usable only in specific operating systems. Media converters allow the system to process a larger variety of input formats and can greatly increase the system’s usability.

IBM’s Attila speech recognition engine powered this system [8]. While a comprehensive description of Attila’s technical architecture is beyond the scope of this paper, it is important to note that the engine included a flexible and scalable toolkit and set of APIs that facilitated the development of HTS.

SR systems utilize statistical models to improve recognition results. These include acoustic models that represent typical wave forms for a language and language models that utilize statistical analysis to determine which words and combinations of words are most likely for a given domain/task and language. For this system, the models were created from a Broadcast News corpus that included read speech and language modelled from thousands of hours of American television broadcast news transcripts. These statistical models are subsequently referred to as the U.S. English Broadcast News Model.

While these models are well suited to general transcription tasks, they were not specifically designed for the unique speech and language environment of the typical post-secondary classroom. Furthermore, because the source data is derived from North American English speakers, transcription of accented English was predicted to be less accurate.

2.2 Hosted Transcription Service

HTS leveraged the IBM Attila SR engine to perform offline transcription of audio or video files. The core HTS technology provided by IBM Research was implemented virtually via the Amazon Elastic Compute Cloud (Amazon EC2) web service. This cloud-based infrastructure was selected based on

(3)

its ability to scale computing capacity based on system requirements, usage, and other technical variables. To use HTS, authenticated users visited an online portal, logged into their secure accounts, and then uploaded a media file for automatic transcription. HTS was classified as a speaker-independent system given it did not require voice profile training or enrolment to achieve better recognition results.

Once HTS had converted and transcribed the recorded lecture, students participating in the project received SR generated Multimedia Transcripts. While traditional transcripts are typically generated by listening and manually typing what is heard, Multimedia Transcripts refer to SR generated text that is synchronized with a spoken language source and other media (slides, images, etc).

Multimedia transcripts can be viewed online. When users logged into HTS, they were brought to a "Jobs" page that lists all the media files that have been uploaded and transcribed [Fig 1.].

Fig. 1: HTS “Jobs” Interface

Students could access either a Flash or HTML based interactive interface that allowed the user to customize how the Multimedia Transcript was viewed. Although the content was automatically synchronized by HTS, disaggregated content was also technically available, allowing students to self-select the individual learning objects and combinations (text only, audio only) that suited their individual preferences.

Fig. 2: Multimedia Transcript: Flash Interface

Each multimedia format had perceived advantages and disadvantages. The flash version offered numerous default layouts that allowed users to easily reconfigure the media sources. The flash version did however introduce accessibility challenges. For example, the interface was not navigable by screen reader technologies used by people with visual impairments. It was also difficult to search for key words and phrases in the flash player.

The HTML version was designed as an accessible interface and included keyboard shortcuts and screen reader navigability. It furthermore facilitated search and retrieval of keywords or phrases, which was demonstrated to be an important feature in similar SR systems [9].

(4)

3

METHODOLOGY

The following section describes participant recruitment, selection criteria, participant roles and responsibilities, and a description of the planned transcription workflow.

3.1 Participation Requirements

After HTS was developed, implemented, and internally tested, organizers facilitated an external evaluation with targeted stakeholders. Testing was organized into two phases according to the generic North American academic year: September 2010 - May 2011; September 2011 - May 2012. Phases I and II were structured to provide access to HTS for up to 75 and 150 students respectively. To be eligible, students were required to have a documented disability and be actively taking courses at an accredited post-secondary institution. Given the funding envelope focused on Youth Issues, participants were furthermore required to be between ages 18-30. Organizers also attempted to secure representation from five different geographical regions as defined by project sponsors.

Resources were provided to process up to five recorded media hours per participant (i.e. 5 - one hour lectures, 10 – 30 minute lectures, etc). Processing included the automated transcription, data analysis, digital archival and subsequent correction of recognition errors that was conducted by human editors. Students were asked to adhere to specific recording protocols, including the use of a headset or lapel style microphone to record media, which aimed to reduce ambient noise captured in the recordings. This was instituted to theoretically improve transcription accuracy and reduce editing requirements. Given the vast array of multimedia formats and corresponding differences in quality, HTS was configured to accept only pcm wav or adpcm wav audio recordings, recorded at a minimum 22kHz. These formats were chosen given a desire to utilize archived data for acoustic and language model development.

Student participants were obligated to seek permission from a willing instructor. Instructors had to authorize the recording of at least five hours of course lectures and permit upload of recorded lectures to HTS. Instructors had to consent to using either headset or lapel style microphones as noted above. Both students and instructors furthermore abided by HTS Terms of Usage developed for the project.

3.2 Participant Recruitment

The project team put out a formal call for participation by disseminating project information through a variety of student, faculty, and professional networks. The team sent invitations to numerous email lists, made direct phone calls to post-secondary staff, posted information in social media platforms, delivered targeted presentations at several institutions, and established a project website (www.transcribeyourclass.ca).

There were three anticipated scenarios for recruiting participants. Students with Disabilities would learn about the project, initiate involvement, and independently seek out faculty approval. Alternatively, faculty themselves would identify a prospective student that could benefit from the project. A third scenario involved Disability Resource Centre staff facilitating involvement from qualified students and willing faculty. Initiation by Disability Resource professionals was anticipated to be the most common method for securing participants.

Given plans to accommodate 75 users in phase I, an online form on the project website allowed prospective candidates to apply to participate using a “first-come-first serve” approach. Students meeting the established criteria were offered a Participation Agreement, which needed to be signed by the student and faculty member(s) whose lectures would be recorded and transcribed.

3.3 Participant Activity

After submitting a signed Participation Agreement, students received a generic email message welcoming them to the project and a hyperlink link to complete an anonymous online Pre-Technology Survey. The survey was an initial component of a qualitative research activity analysing student/system interaction.

Once the survey was completed, HTS Account information was sent to the email address on record. The message included links to help documents that outlined recording tips and suggested strategies for using HTS. Once these preliminary activities were completed, students were free to record lectures

(5)

using whatever means they had at their disposal, provided they followed recording protocols. Once media was recorded, students could log into their user accounts, electronically accept the HTS Terms of Usage, and submit media files for transcription.

3.4 HTS Transcription and Editing Workflow

The project team engaged a lead editor to manage incoming recordings and conduct the important task of correcting recognition errors. The editing workflow was designed to produce both a Multimedia Transcript for participants and usable lecture data for improving SR performance.

HTS was set up to send a notification to the lead editor when a “job” was completed. The lead editor could either edit the job or delegate the job to another editor for error correction. Error correction was completed by downloading a directory containing the original media and resulting transcription files, which could be reviewed using a few alternative tools, including HTS itself, IBM ViaScribe, or an open source tool called Transcriber [10], [11]. Once completed, the editor uploaded the edited transcript for the participant. HTS reintegrated the edited transcript with the media source using timing data through an implicit xml schema. This format enabled the media synchronization and playback that characterised the Multimedia Transcripts provided to students.

For status reporting, HTS used a series of automated email notifications used at various stages of the workflow. Once a media file was uploaded, HTS sent an automated message informing the project team that the file was being processed. Upon completion, another e-mail was sent to the participant that confirmed that the job was complete. The message included a raw, text only transcript. Finally, a third message was sent to the participant alerting them when the edited Multimedia Transcript was available and included a direct hyperlink to the online content.

4

INITIAL RESULTS

Evaluations and analysis of Phase I were conducted through a variety of measures and sources including HTS technical logs, issues logs, observations, participant feedback, formal and informal interviews, and focus groups.

The initial results somewhat contradicted pre-technology feedback received from prospective participants. Researchers were generally greeted with enthusiastic responses to the opportunity to participate. Students, professional staff, and instructors alike lauded the project as an important innovation and believed it could provide numerous accessibility and learning benefits. In many cases, instructors and professional support staff offered active assistance in recruiting and supporting students. At the end of Phase I, 68 applications were received and 66 students met the initial participation criteria. From this pool of 66 participants, 28 returned Participation Agreements, 18 completed pre-technology surveys, and 16 ultimately utilized HTS.

4.1 HTS Performance

HTS performance was evaluated based on conformance to stated system requirements, perceived ease of use, noted technical errors, and accuracy of SR transcription.

4.1.1 HTS Conformance to Requirements

High level requirements for HTS included tools to facilitate account creation and management, efficient media conversion, error handling, Multimedia Transcript access, navigation, and playback, and SR processing.

HTS inherently lacked a designated database that could provide finely tuned account creation and user management, which forced developers to implement a temporary work around solution. Given the limited number of users, this was not a major performance issue at this initial development stage. The existing media conversion engines contained a finite set of converters. The system was furthermore configured to accept only pcm wav or adpcm WAV audio recordings. The system

(6)

successfully converted and transcribed media files recorded according to these parameters. Files not recorded and submitted in these formats were expectedly not processed by HTS.

HTS did not facilitate intuitive error handling. When a media file was not converted properly or was not read by the SR engine, the system did not alert administrators or users of a processing error. There were a few technical issues that were identified during testing. The ‘Jobs’ table listing available multimedia transcripts did not refresh in Firefox browsers, forcing some students to use other browsers. While playing back and navigating Multimedia Transcripts, there were issues with zooming features and resizing the text window. Similarly, certain options that allowed users to customize the Multimedia Transcript, such as changing font size, work intermittently and customized settings were not saved. Finally, the system would occasionally time out when facing larger recordings or when many recordings were in the processing queue. This required a manual restart of the system, which would typically resolve the issue.

4.1.2 HTS Accuracy

SR accuracy, or conversely, word error rates (WER), were deemed a critical success factor for this project. Although extensive editing resources were available to provide students with edited multimedia transcripts, a key focus of the project was evaluating HTS ability to process recordings from a variety of different speakers, disciplines, and recording conditions. Furthermore, as a speaker independent system, HTS did not offer tools to create customized voice profiles that typically reduced WER in mainstream SR systems. Previous research of SR had identified a targeted benchmark of 85% accuracy / 15% WER. At these WER levels, the number of errors present could be corrected with a reasonable amount of editing.

The average WER during this first phase was lower than the established benchmark [Fig. 3]. There were significant WER variation between speakers given noticeable differences in recording quality, course vocabulary, and individual speaker characteristics.

Measure Average Standard Deviation

Lecture Time 85.9 min 55.7 min

Editing Time 300.2 min 168.8 min

Editing Time/Lecture Time 3.91 1.87

WER 34.91% 22.16%

Fig. 3: Word Error Rates and Editing Ratio

Audio quality was deemed to be highly correlated with WER. Many recordings were not digitized using recording protocols. Such recordings included extraneous noise and were most not recorded using lapel or headset microphones. There were also a number of instructors with distinct accents. Given HTS used U.S. English Broadcast News models, WER scores were much higher for these recordings. Out of Vocabulary (OOV) scores were not calculated explicitly, but several courses included extensive usage of proper names, uncommon abbreviations, and foreign words. These instances would not be in the SR model’s lexicon and therefore would not be transcribed correctly. On average, it took 3.91 hours per one hour of audio to correct recognition errors. This effort included editing and other preparatory/administrative activities. It is important to note that there were also significant differences between the editors in terms of level of effort required to complete the correction task.

4.2 Barriers

Numerous environmental and project barriers (external to HTS) were identified that affected project participation, HTS usage, and sustained adoption.

(7)

4.2.1 Instructor Reticence

In a limited number of cases, instructors were unwilling to sign participation agreements and would not allow their lectures to be recorded. Some instructors cited intellectual property concerns and expressed fears that their lectures could be published or otherwise disseminated without permission. Others were worried that peers or superiors could use the resulting transcripts to monitor or evaluate teaching methods and in class performance. Two instructors noted that given HTS was hosted on servers residing in the U.S., it was subject to the U.S. Patriot Act that allowed authorities access to the records of internet service providers. In one case, the instructor felt the lapel microphone was too inconvenient and uncomfortable to wear.

4.2.2 Disability Service Providers / Technical Support

Disability service providers were deemed a key ally in this project given their vast knowledge of individual student’s unique support and learning requirements. These professionals also typically maintained good relations with instructors and on campus IT supports. Interactions with service providers during recruitment and initiation were extremely positive and most made explicit offers to help and outlined specific actions they would take to facilitate participation.

In practice, many were unable to facilitate involvement as expected. Observations noted that service providers were extremely busy and most had incredibly large caseloads. Many service providers were typically fully engaged in providing traditional supports to their clients and/or busy intervening in crisis situations. This reality left little time for the unique requirements of the project. Similarly, support staff were typically not able to intervene when students faced technical difficulties.

4.2.3 Recording Equipment / Format

Organizers did not anticipate that recording equipment would be a critical success factor in the project. The assumption was that some institutions were already recording lectures, would have available microphones, or could easily purchase new equipment. Therefore, the project did not allocate resources for the purchase of recording equipment for participants. Unexpectedly, willing participants were not able to secure the proper recording equipment and therefore could not digitize their lectures. Most of the poor quality recordings were recorded without an external microphone, either using a student’s laptop or a digital recorded placed on a desk in the classroom close to the professor. These scenarios resulted in transcripts with higher than average WER.

Certain digital recorders used by participants natively recorded in MP3 formats, ostensibly to reduce file size and facilitate easy transfer to other devices or platforms. Participants using these recorders could not generate the requisite WAV formats accepted by HTS.

4.2.4 Student and Learning Factors

In some cases, students that were initially interested in the project did not participate for a variety of known and unknown reasons. Some found the various forms and surveys mandated by sponsors and research ethics requirements overwhelming. Others admittedly forgot to get the instructors signatures in a timely manner. Certain courses that students anticipated would be lecture intensive turned out to be less didactic, rendering HTS and multimedia transcripts unnecessary. A few students that successfully recorded lectures and received multimedia mistakenly only viewed the unedited text transcripts and did not access the multimedia transcripts. Anecdotally, some students seemed to lack the motivation to complete the many necessary initiation steps.

For students that successfully navigated the initiation steps, recorded media, and accessed the Multimedia Transcripts, initial results were overwhelmingly positive. Once students used the system, they continued to do so. Specific research analyzing accessibility and learning outcomes is forthcoming.

5

HTS AND WORKFLOW MODIFICATIONS

Based on lessons learned, organizers modified participation requirements and made changes to HTS and associated workflows for Phase 2.

(8)

The initiation process was significantly streamlined. Geographical considerations were removed from the application process and the age requirements were broadened. Most importantly, the five hour transcription maximum imposed because of resource limitations was eliminated, allowing students to submit as many recordings as they wanted. Researchers furthermore offered a nominal participation incentive, intended to help students acquire the necessary recording equipment and microphones that were critical to success.

HTS was reconfigured to allow a variety of different media types, including MP3 recordings. A few technical issues were resolved. The editing workflow was modified to ensure students accessed Multimedia Transcripts by removing the unedited text file that was initially provided. System notifications were simplified so that students only received an email notification once their file was corrected.

5.1 Phase II Preliminary Results

Based on these changes, there were corresponding improvements in overall participation, with over 50 participants engaged in HTS usage during Phase II. Given Phase II is on-going, specific causes for this increased adoption are not yet known. In addition to technology and process improvements, the increased awareness of the project and HTS system in the Canadian post-secondary ecosystem is a likely contributing factor.

WER are also substantially improved with a mid-point average of 23.32%. Moreover, the majority of recordings are meeting and exceeding stated WER benchmarks of 15%. Given the underlying statistical models are unchanged, certain accented speakers continue to generate relatively high WER, which weighs down average results.

6

FUTURE WORK

Lecture data from both Phase 1 and 2 are being used by scientists to develop specific statistical models designed for the lecture transcription domain/task. Early research shows that a statistical model derived from actual lecture data improves recognition performance for this task, especially for accented speakers. Lower WERs translate into reduced editing requirements, which improve turnaround time, student satisfaction, usability, and contribute to the establishment of a viable business case for HTS like systems as sustainable solutions.

As referenced, a simultaneous qualitative research effort was conducted along with the technical and implementation evaluation described in this paper. Future publications are planned that will articulate the learning and accessibility benefits associated with the use of HTS and multimedia transcripts for students with disabilities. The effect of the systems on pedagogy will also be explored in future work.

7

ACKNOWLEDGEMENTS

This project was made possible with financial support from the Social Development Partnerships Program, Office for Disability Issues, Human Resources and Skills Development Canada. The authors would like to thank the project partners from Saint Mary's University, Trent University, IBM Research, Learning Disabilities Association of Canada, Canadian Hard of Hearing Association, the Neil Squire Society, and Easter Seals Canada for their extensive contributions. Additional support was provided by the Liberated Learning Consortium, the National Educational Association of Disabled Students, and members of the Canadian Association of Disability Service Providers in Postsecondary Education.

REFERENCES

[1] Buckup, Sebastian. The price of exclusion : the economic consequences of excluding people with disabilities from the world of work. International Labour Office, Employment Sector, Skills and Employability Department. - Geneva: ILO, 2009. 85 p. (Employment working paper ; no.43) [2] Stodden, R. A., & Dowrick, P. W. (1999). Postsecondary education and employment of adults

with disabilities. American Rehabilitation, 25, 19-23.

[3] Peverly, S.T., Ramaswamy, V., Brown, C., Sumowski, J., Alidoost, M., Garner, J. (2007) What Predicts Skill in Lecture Note Taking? Journal of Educational Psychology. 99(1): 167-180.

(9)

[4] B. Titsworth, and K. Kiewra, “Spoken Organizational Lecture Cues and Student Notetaking as Facilitators of Student Learning,” Journal of Contemporary Educational Psychology, vol. 29, no. 4, pp. 447-461, Oct. 2004.

[5] K. A Kiewra, Students' note-taking behaviors and the efficacy of providing the instructor's notes for review. Contemporary Educational Psychology, 10. (1985).

[6] M. Wald, K. Bain, “Universal access to communication and learning: the role of automatic speech recognition,” Universal Access in the Information Society, vol. 6, no. 4, pp. 435-447, 2008.

[7] Leitch, D. and MacMillan, T. Liberated Learning Initiative Innovative Technology and Inclusion: Current Issues and Future Directions for Liberated Learning Research. Year III Report. (2003) Saint Mary's University, Nova Scotia.

[8] Soltau, H.; Saon, G.; Kingsbury, B.; , "The IBM Attila speech recognition toolkit," Spoken Language Technology Workshop (SLT), 2010 IEEE , vol., no., pp.97-102, 12-15 Dec. 2010 [9] K. Bain, J. Hines, P. Lingras, Q. Yumei. “Using Speech Recognition and Intelligent Search

Tools to Enhance Information Accessibility“ in Proceedings of HCI International 2007, In Lecture Notes in Computer Science series (LNCS), 2007.

[10] K. Bain, S. Basson, A. Faisman, D. Kanevsky, “Accessibility, transcription, and access everywhere,” IBM Systems Journal, vol. 44, no. 3, pp. 589-604, 2005.

References

Related documents

Objectives: Students will develop listening skills as the teacher leads the class in an informative lecture, students will participate in a range of collaborative discussion,

associated kiosks from piped network 0 chambers 3 kiosks 22 chambers 3 kiosks 19 chambers 5 kiosks 41 chambers 11 kiosks There has been a balance of interventions depending

Hence, very large (and often idle) holdings have co-existed with masses of landless peasants. This situation has created political pressures for land redistribution over time

Surah Al-Baqarah Chapter 2 : Verse 285-287 [2:285] To Allah belongs whatever is in the heavens and whatever is in the earth; and whether you disclose what is in your minds or

Three criteria were followed to extrapolate a temporal trend from nearby regulatory monitoring data: (1) agreement with regional background measures (two summer; two winter

Title: A study into the factors influencing the choice-making process of Indian students when selecting an international university for graduate studies using grounded theory1.

Agency Network Meetings The Centre coordinates monthly agency network meetings, which provide an opportunity for staff working with volunteers to discuss their programs

“Analysis of English Speaking skills teaching strategies applied by the teacher to ninth grade students of secondary education, at Alfonso Cortés School, Managua during the second