• No results found

A Readability Evaluation of Real-Time Crowd Captions in the Classroom

N/A
N/A
Protected

Academic year: 2021

Share "A Readability Evaluation of Real-Time Crowd Captions in the Classroom"

Copied!
8
0
0

Loading.... (view fulltext now)

Full text

(1)

A Readability Evaluation of Real-Time Crowd Captions in

the Classroom

Raja S. Kushalnagar, Walter S. Lasecki

, Jeffrey P. Bigham

DepartmentofInformationandComputingStudies †

DepartmentofComputerScience RochesterInstituteofTechnology UniversityofRochester 1LombMemorialDr,Rochester,NY14623 160TrusteeRd,Rochester,NY14627

[email protected] {wlasecki,jbigham}@cs.rochester.edu

ABSTRACT

Deafandhardofhearingindividualsneedaccommodations

that transform aural to visual information, such as tran­

scripts generated in real-time to enhance their access to

spokeninformation inlectures and other liveevents. Pro­ fessionalcaptionists’stranscriptsworkwellingeneralevents suchascommunity,administrativeorlegalmeetings,butis often perceived as notreadable enoughinspecialized con­

tent events such as higher education classrooms. Profes­

sionalcaptionistswithexperienceinspecializedcontentar­

easarescarceandexpensive. Commercialautomaticspeech

recognition(ASR)softwaretranscriptsarefarcheaper,but isoftenperceivedasunreadableduetoASR’ssensitivityto accents,backgroundnoiseandslowresponsetime. Weeval­

uate the readability of a new crowd captioning approach

inwhich captions are typed collaboratively by classmates

into a system that alignsand merges the multiple incom­

pletecaptionstreamsintoasingle,comprehensivereal-time transcript. Ourstudyasked48deafandhearingreadersto evaluate transcriptsproduced byaprofessional captionist, automaticspeechrecognitionsoftwareandcrowdcaptioning softwarerespectivelyandfoundthereaderspreferredcrowd

captionsoverprofessionalcaptionsandASR.

Categories

and

Subject

Descriptors

H.5.1[InformationInterfacesandPresentation]: Multime­ diaInformationSystems;K.4.2[SocialIssues]: Assistivetech­ nologiesforpersonswithdisabilities

General

Terms

HumanFactors,Design,Experimentation

Keywords

Accessible Technology, Educational Technology, Deaf and

HardofHearingUsers

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee.

ASSETS’12,October 22–24, 2012, Boulder, Colorado, USA. Copyright 2012 ACM 978-1-4503-1321-6/12/10 ...$15.00.

1.

INTRODUCTION

Deafandhardofhearing(DHH)individualstypicallycan­ notunderstandaudioalone,andaccesstotheaudiothrough

accommodationsthattranslatetheauditoryinformationto

visualinformation. Themostcommonaccommodationsare

real-time transcription or sign language translation of the audio.

As a low incidence disability, deaf and hard of hearing

individualsare evenly and thinlyspread [18]. As aresult,

many DHH individualstend to be located farfrom major

populationcenters and findit hardto obtainaccommoda­

tion providers, especiallythose who canhandle situations thatrequirespecializedcontentknowledge. Theseproviders prefertoliveinclosetoareaswheretheycanobtainenough

demandtoprovideservices. Ifthereisnotenoughdemand

forproviders in the area, thereis a catch-22 forthe DHH students andinstitutions. Therefore, formanyinstitutions in terms of content knowledge, availability and cost, it is besttouseaccommodationservicescenteredonthestudent

suchasclassmatesoron-demandremoteworkers.

This paper analyzes the readability of a new

student-centered approach to real-timecaptioning in whichmulti­

ple classmates simultaneously caption speech inreal-time.

Although classmatescannottype asquicklyas thenatural

speakingrateofmostspeakers,wehavefoundthattheycan

provideaccuratepartialcaptions. We alignandmerge the

multiple incompletecaption streams intoasingle, compre­

hensive real-time transcript. We compare deaf and hear­

ingstudents’evaluationoftheeffectivenessandusabilityof this crowd-sourcedreal-time transcript against transcripts

producedbyprofessional captionistsand automaticspeech

recognitionsoftwarerespectively.

2.

BACKGROUND

Equal access to communication is fundamental to stu­

dents’academic success,butis oftentakenforgranted. In

mainstreamenvironmentswheredeaf,hard-of-hearing, and

hearingstudents studyand attendclasses together,people tend to assume that captioners or interpreters enable full communicationbetweendeafandhearingpeopleintheclass. Thisassumptionisespeciallydetrimentalasitdoesnotad­ dress other information accessibility issues such as trans­ lation delays that impact interaction and readability that

impactscomprehension.

Therearetwopopularapproachestogeneratingreal-time

captions thatattempt to conveyeveryspoken wordinthe

(2)

(a)Astenographkeyboardthatshowsits

phonetic-basedkeys.

(b) A stenographer’s typical Words Per Minute

(WPM)limitandrange.

Figure 1: Professional Real-TimeCaptioning using astenograph

recognition(ASR). Both professional captioning and ASR

provideareal-timeword-for-worddisplayofwhatissaidin class, as well as options forsaving the text after class for

study. Wediscuss thereadability of theseapproaches and

anewapproach, whichutilizes crowdsourcingto generate

real-timecaptions.

2.1

Professional

Captioning

Themostwidelyusedapproach,CommunicationsAccess

RealTime(CART),isgeneratedbyprofessionalcaptionists

whouse shorthandsoftwareto generatecaptionscankeep

upwith naturalspeakingrates. Althoughpopular, profes­

sionalcaptionersundergoyearsoftraining,whichresultsin professional captioning services being expensive. Further­

more, captionists usually have inadequate content knowl­

edge and dictionaries to handle higher education lectures

inspecificfields. is themost reliabletranscriptionservice,

butisalso themost expensiveone. Trainedstenographers

typeinshorthandonastenographic(shorthandwritingsys­

tem)keyboard as showninFigure1. Thiskeyboard maps

multiplekeypressestophonemesthatareexpandedtover­ batimfulltext. Stenography requires 2-3yearsof training

toachieveatleast225wordsperminute(WPM)andupto

300WPMthatisneededtoconsistentlytranscribeall real-timespeech,whichhelpstoexplainthecurrentcostofmore

than$100anhour. CARTstenographers needonlytorec­

ognizeand type in the phonemesto createthe transcript,

which enables them to type fast enough to keep up with

thenatural speaking rate. But thesoftwaretranslation of

phonemesto wordsrequires adictionarythat alreadycon­

tains the words used in the lecture; typing in new words

intothedictionaryslowsdownthetranscriptionspeedcon­

siderably. Thestenographer cantranscribe speech evenif

thewordsor phonemesdonotmakesenseto them,e.g.,if

the speechwords appearto violate rulesofgrammar, pro­

nunciation,orlogic. Ifthecaptionercannotunderstandthe phonemeorwordatall,thentheycannottranscribeit.

Inresponse to thehigh costs of CART, computer-based

macroexpansionserviceslikeC-Printweredevelopedandin­ troduced. C-Printisatypeofnearly-realtimetranscription that wasdeveloped at the NationalTechnicalInstitutefor theDeaf.Thecaptionistbalancesthetradeoffbetweentyp­ ingspeedandsummarization,byincludingasmuchinforma­ tionaspossible,generallyprovidingameaning-for-meaning butnotverbatimtranslationofthespokenEnglishcontent.

Thissystemenablesoperatorswhoaretrainedinacademic

situations to consolidateandbetter organizethe textwith thegoalofcreatinganendresultmorelikeclassnotesthat maybemoreconducivetoforlearning. C-Printcaptionists needlesstraining,andgenerallychargearound$60anhour. Asthecaptionistnormallycannottypeasfastasthenatural speakingrate,theyarenotabletoproduceaverbatim

real-time transcript. Also, the captionist can only effectively

convey classroom content if they understand that content

themselves. The advantage is that the C-Print transcript

accuracyandreadability ishigh [21],butthedisadvantage

of thisapproachis thatthe transcriptshows thesummary

that is basedonthe captionist’sunderstanding ofthe ma­ terial, whichmay bedifferentfrom thespeakeror reader’s understandingofthematerial.

There are several captioning challenges in higher edu­

cation. The first challenge is content knowledge - lecture

information is dense and contains specialized vocabulary.

Thismakesithardtoidentifyandschedulecaptionistswho arebothskilledintypingandhavetheappropriatecontent knowledge. Anothercaptioningissueinvolvestranscription

delay, which occurs when captionists have to understand

the phonemes or words and then type in what they have

recognized. Asaresult,captioniststendto typethemate­ rial to studentswith adelay of severalseconds. Thispre­ vents students from effectively participating inaninterac­ tive classroom. Anotherchallengeisspeakeridentification, inwhichcaptionistareunfamiliarwithparticipantsandare

challenged to properlyidentify the current speaker. They

cansimplifythisbyrecognizingthespeakerbyname,orask­ ingthespeakertopausebeforebeginninguntilthecaptionist

hascaught upand hadanopportunitytoidentifythenew

speaker. In terms of availability, captioniststypically are notavailabletotranscribe livespeech ordialogueforshort periodsoron-demand.Professionalcaptionistsusuallyneed at leastafew hours advancenotice, and preferto workin 1-hourincrementssoastoaccountfortheircommutetimes. Asaresult,studentscannoteasilydecideatthelastminute toattendalectureorafterclassinteractionswithpeersand teacher. Captionistsusedtoneedtobephysicallypresentat theeventtheyweretranscribing,butcaptioningservicesare increasingly beingoffered remotely[12, 1]. Captionists of­ tenaresimplynotavailableformanytechnicalfields[21,8]. Remotecaptioningoffersthepotentialtorecruitcaptionists familiar witha particularsubject (e.g., organicchemistry) evenifthecaptionistislocatedfarawayfromanevent. Se­ lectingforexpertisefurtherreducesthepoolofcaptionists. A finalchallengeis their cost- professional captionistsare highlytrainedtokeepupwithspeechwithlowerrorsrates,

and so are highly paid. Experiencedverbatimcaptionists’

paycanexceed$200anhour,andnewlytrainedsummariza­

(3)

2.2

Automatic

Speech

Recognition

ASR platforms typically use probabilistic approaches to

translate speech to text. These platforms face challenges

inaccuratelycapturingmodernclassroomlecturesthatcan haveoneormoreofthefollowingchallenges: extensivetech­ nical vocabulary, poor acoustic quality, multiple informa­ tionsources,speakeraccents,orother problems. Theyalso

impose a processing delay of several seconds and the de­

lay lengthens as the amount of data to be analyzed gets

bigger. Inother words,ASR workswell underideal situa­

tions,butdegradesquicklyinmanyreal settings. Kheiret

al. [12] foundthatuntrainedASR softwarehad 75%accu­

racyrate, but with training, could go to 90% underideal

singlespeaker, butthis accuracyratewas still toolow for

use by deaf students. In the best possible case, inwhich

thespeakerhastrainedthe ASRand wearsahigh-quality,

noise-cancelingmicrophone,theaccuracycanbeabove90%.

Whenrecordingaspeakerusingastandardmicrophoneon

ASRnottrainedforthespeaker,accuracyratesplummetto

farbelow50%. Additionally,theerrorsmadebyASRoften

changethemeaningofthetext,whereaswehavefound non-expertcaptionistsaremuchmorelikelytosimplyomitwords ormake spellingerrors. InFigure2 forinstance, theASR changes‘twofoldaxis’to‘twentyfourlexus’,whereasthec typiststypicallyomitwordstheydonotunderstandormake spellingerrors.CurrentASRisspeaker-dependent,hasdiffi­ cultyrecognizingdomain-specificjargon,andadaptspoorly to vocal changes, suchas when the speaker is sick [6, 7].

ASR systems generally need substantial computing power

andhigh-quality audioto work well,which meanssystems

canbedifficultto transport. Theyare alsoill-equippedto recognizeandconveytone,attitudes,interestandemphasis, andtorefer tovisualinformation suchas slidesor demon­

strations. ASRserviceschargeabout$15-20anhour. How­

ever, these systems are more easily integrated with other

functionssuchasmultimediaindexing.

2.3

Crowd

Captions

in

the

Classroom

Deafandhardofhearingstudentshavehadalonghistory of enhancingtheir classroom accessibility by collaborating

withclassmates. Forexample, they often arrange to copy

notesfromaclassmateandshareitwiththeirstudygroup. Crowdsourcinghasbeenappliedtoofflinetranscriptionwith greatsuccess [2], buthas justrecently been usedfor real-time transcription[15]. Applying a collaborative caption­ ingapproachamongclassmatesenablesreal-timetranscrip­

tionfrommultiplenon-experts,andcrowdagreementmech­

anismscanbeutilizedtovettranscriptquality[14].

We imaginea deaf or hardof hearingpersoneventually

beingabletocapture auralspeechwithhercellphoneany­ whereandhavecaptionsreturnedtoherwithafewseconds latency. She may use this to follow along ina lecture for which aprofessional captionist wasnot requested, to par­ ticipate ininformal conversation with peers afterclass, or enjoyamovieorother liveeventthatlacksclosedcaption­ ing. TheseusecasescurrentlybeyondthescopeofASR,and theirserendipitousnatureprecludespre-arrangingaprofes­ sionalcaptionist. Lasecki etal. havedemonstrated thata

modestnumberofpeoplecanprovidereasonablyhighcov­

erageoverthecaptionstream,andintroducesanalgorithm thatusesoverlappingportionsofthesequencestoalignand

mergethemusingthe Legion:Scribe system[15]. Scribe is

basedonthe Legion [13] framework, whichuses crowdsof

……….that has a two fold axis…….

………….have a crystal that………..

...we have a crystal………..

...wehave a crystalthathas a two fold axis…..

Figure 2: The crowd captioning interface. The in­ terfaceprovidesatextinputboxatthebottom,and shiftstextupasuserstype(eitherwhenthetexthits the end of the box, or when the user presses the enter key). To encourage users to continue typing even when making mistakes, editing of text is dis­ abledwordbyword. Partialcaptionsareforwarded to the server in real-time, which uses overlapping segmentsand the orderin segmentsarereceivedto align and mergethem.

workers to accomplish tasks in real-time. Unlike Legion,

Scribemergesresponsestocreateasingle,better,response insteadofselectingfrominputstoselectthebestsequence. Thismergerisdoneusinganonlinemultiplesequencealign­ mentalgorithmthatalignsworkerinputtobothreconstruct

the final stream and correct errors (such as spelling mis­

takes)madebyindividualworkers.

Crowdcaptioningoffersseveralpotentialbenefitsoverex­ istingapproaches. First,itispotentiallymuchcheaperthan hiringaprofessionalcaptionistbecausenon-expertcaption­ istsdonotneedextensivetrainingtoacquireaspecificskill set,and thusmay bedrawn fromavarietyof sources,e.g. classmates,audiencemembers,microtaskmarketplaces,vol­ unteers,or affordableandreadilyavailableemployees. Our

workforce can be very large because, for people who can

hear, speech recognition is relatively easy and most peo­

ple cantype accurately. Theproblem is that individually

they cannottype quickly enough to keep up with natural

speaking rates, and crowd captioning nicely remedies this

problem. Recentworkhasdemonstratedthatsmallcrowds

canberecruitedquicklyon-demand(inlessthan2seconds)

(4)

receiveatranscriptofashortsoundsequenceinafewmin­ utes,butisnotabletoproduceverbatimcaptionsoverlong periodsoftime[17].

Inprevious work, we developed a crowdcaptioning sys­

temthatacceptsrealtimetranscriptionfrommultiple non-expertsasshowninFigure2. Whilenon-expertscannottype asquicklyasthenaturalspeakingrate,wehavefoundthat theycanprovideaccuratepartialcaptions. Oursystemre­ cruitsfellowstudentswithnotrainingandcompensatesfor slowertypingspeedandloweraccuracybycombiningtheef­ fortsofmultiplecaptionistssimultaneouslyandmergesthese partialcaptionsinreal-time.Wehaveshownthatgroupsof

non-expertscan achieve more timely captions thana pro­

fessional captionist, that we can encourage them to focus

onspecific portionsofthe speech to improveglobal cover­ age,andthatitispossibletorecombinepartialcaptionsand effectivelytradeoffcoverageandprecision[15].

2.4

Real-time

text

reading

versus

listening

Most peopleonlyseereal-timetextonTVat thebaror

gymintheformofclosedcaptions,whichtendtohaveno­

ticeableerrors. However,thoseprogramsare captioned by

live captionists or stenographers. To reduce errors, these

real-time transcripts are often corrected and made into a

permanentpartofthe videofilebyoff-linecaptionistswho

prepare captions from pre-recorded videotapes and thor­

oughlyreviewtheworkforerrorsbeforeairing.

Thetranslationofspeechtotextisnotdirect,butrather isinterpreted andchangedinthecourseofeach utterance. Markerslikeaccent,tone,and timbreare strippedoutand

representedbystandardizedwrittenwordsandsymbols. Then

thereaderinterpretsthesewordsandflowtomakemeanings forthemselves. Captionists tendnot to includeallspoken informationsothatreaderscankeepupwiththetranscript. Captionists are encouraged to alter the original transcrip­ tionto providetimeforthereaderstocompletelyreadthe

captionand tosynchronizewiththe audio. Thisis needed

because, fora non-orthographiclanguagelike English,the lengthofaspokenutteranceisnotnecessarilyproportional

to the length of a spelled word. In other words, reading

speedisnotthesameaslisteningspeed,especiallyfor real-timescrollingtext, asopposedto staticpre-preparedtext. Forstatictext,readingspeedhasbeenmeasuredat291wpm [19]. BycontrasttheaveragecaptionrateforTVprograms is141wpm[11],whilethemostcomfortablereadingratefor hearing,hard-of-hearing,anddeafadultsisaround145wpm [10]. The reasonis that thetask ofviewingreal-timecap­ tions involveddifferentprocessing demands invisual loca­

tionandtrackingofmovingtextonadynamicbackground.

Englishliteracyratesamongdeafandhardofhearingpeo­

plewhois low compared to hearingpeers. Captioningre­

search has shown that both rate and text reduction and

viewerreadingabilityare importantfactors,and thatcap­ tionsneedtobeprovidedwithin5secondssothatthereader canparticipate[20].

Thenumberofspokenwordsandtheircomplexitycanalso

influence the captioning decision onthe amount of words

totranscribeand degreeofsummarizationto includesoas toreduce the reader’stotal cognitiveload. Jensema etal.

[10]analyzedalargesampleofcaptionedTVprogramsand

foundthatthetotal sethad around800Kwordsconsisting

of16,000uniquewords. Furthermore,overtwo-thirdsofthe

transcriptwords consistedof250words. Higher education

lecturetranscriptshaveaverydifferentprofile. Forcompari­ sonpurposes,weselecteda50minutelongclipfromtheMIT

OpenCourseWare(OCW)website1. Theaudiosamplewas

pickedfromalecturesegmentinwhichthespeechwasrela­ tivelyclear.We chosethislecturebecauseitcombinedboth

technicalandnon-technicalcomponents. Wefoundthatthe

lecturehad9137words,ofwhich1428wereunique,at182.7

wpm. Furthermore, over twothirds of the transcript con­

sistedof around500words,whichisdoublethe sizeof the

captionedTVwordset.

3.

EVALUATION

Toevaluatethe efficacyofcrowd-sourcedreal-timetran­

scripts, wecompared deafand hearinguserevaluationson

theirperceptionsoftheusabilityofcrowd-sourcedreal-time

transcriptsagainst ComputerAidedReal-Timetranscripts

(CART)andAutomaticSpeechRecognitiontranscripts(ASR).

3.1

Design

Criteria

Based on prior work as well our own observations and

experiences,wehavedevelopedthefollowingdesigncriteria for effective real-time transcript presentation for deaf and hardofhearingstudents:

1. The transcript must have enough information to be

understoodbytheviewer.

2. Thetranscriptmustnotbetoofastortooslowsothat itcanbecomfortablyread.

3. Readingmustnotrequiresubstantialbacktracking.

3.2

Transcript

Generation

Weobtainedthreetranscriptionsof anOCWlectureus­

ing crowdcaptioners, professionalcaptionerandautomatic speechrecognitionsoftwareandgeneratedthreetranscripts ofthelecture.

Aprofessionalreal-timestenographercaptionistwhocharged $200anhourtocreateaprofessionalreal-timetranscriptof the lecture. Thecaptioner listenedto the audioand tran­

scribedinreal-time. Themeantypingspeedwasabout180

wpm withalatency of4.2seconds. We calculatedlatency

byaveragingthelatencyofallmatchedwords.

We recruited 20 undergraduate students to act as

non-expert captionistsforourcrowdcaptioningsystem. These

students hadnospecialtrainingorprevious formalexperi­ encetranscribingaudio. Participantsthenprovidedpartial

captions for the lecture audio. The final transcript speed

wasabout130WPM,withalatencyof3.87seconds.

Inadditionto the thesetwotranscripts, wegenerated a

transcript usinganautomaticspeech recognitionASR us­

ing Nuance Dragon Naturally Speaking 11 software. We

usedanuntrainedprofileto simulateour targetcontextof studentstranscribingspeechfromnewormultiplespeakers. Toconductthistest, theaudiofiles wereplayed,andredi­

rectedto Dragon. We usedasoftwarelooptoredirect the

audio signal without resampling using SoundFlower2, and

acustom programtorecordthe timewheneach wordwas

generatedbytheASR.TheASRtranscriptspeed was71.0

wpm(SD=23.7)withalatencyof7.9seconds.

3.3

Transcript

Evaluation

1http://ocw.mit.edu/ 2

(5)

Figure3: Thetranscriptviewingexperience.

Werecruited48studentsforthestudyovertwoweeksto participateinthestudyandevenlyrecruitedbothdeafand

hearingstudents, male amd female. Twenty-oneof the of

themweredeaf,fourofthemwerehardofhearingandthe

remainder,twenty-four,werehearing. Therewere21females

and27males,whichreflectsthegenderbalanceoncampus.

Their ages ranged from18 to 29 and all were students at

RIT, ranging from first year undergraduates to graduate

students. We recruitedthrough flyersand wordof mouth

onthecampus. Weaskedstudentstocontactandschedule

throughemail appointment. Allstudents were reimbursed

fortheir participation. Alldeaf participants were askedif they usedvisual accommodations fortheir classes, and all

ofthemansweredaffirmatively.

Testing was conducted in a quiet room with a 22 inch

flat-screenmonitorasshowninFigure3. Eachpersonwas

directedtoanonlinewebpagethatexplainedthepurposeof thestudy. Next,thestudentswereaskedtocompleteashort demographicquestionnaireinordertodetermineeligibility forthetestandaskedforinformedconsent.Thentheywere

asked to view a short 30second introductory videoto fa­

miliarizethemselveswiththeprocessofviewingtranscripts. Thenthestudentswereaskedtowatchaseriesoftranscripts onthesamelecture,eachlastingtwominutes. Eachclipwas labeledTranscript1,2and3,andwerepresentedinaran­

domizedorderwithoutanyaccompanyingaudio. Thetotal

timeforthestudywasabout15minutes.

After theparticipant completedwatchingallthreevideo clipsof the real-timetranscripts, they were askedto com­ pleteaquestionnaire. Thequestionnaire askedthreeques­

tions. The firstquestion asked“Howeasywas itto follow

transcript1?”. Inresponsetothequestion,theparticipants

Figure 4: A comparison of the flow for each tran­ script. Both CART and crowd captions exhibit a relativelysmoothreal-timetextflow. Studentspre­ fer this flow over the more choppy ASR transcript flow.

were presented with a a Likert scale that ranged from 1

through5,with1being“Veryhard”to5being“veryeasy”. Thesecondquestionasked“Howeasywasittofollowtran­ script 2?”. In response to this question, participants were promptedtoanswerusingasimilarLikertscaleresponseas

inquestion1. Thethirdquestionwas“Howeasywasit to

follow transcript 3?”. Inresponse to this question, partic­

ipantswere promotedwitha similar,corresponding Likert

scale responsetoquestion1and2. Thenparticipantswere askedtoanswerintheirownwordstothreequestionsthat

asked participants for their thoughts about following the

lecture through the transcripts; the first video transcript

contained the captions created by the stenographer. The

answerswereopenendedandmanyparticipantsgavewon­

derfulfeedback. Thesecondvideotranscriptcontainedthe captions createdbytheautomaticspeech recognitionsoft­

ware, inthiscase, DragonNaturallySpeaking v. 11. The

thirdandfinal videotranscript containedthecaptionscre­

atedbythecrowdcaptioningprocess.

4.

DISCUSSION

Fortheuserpreferencequestions,therewasasignificant differencebetweentheLikertscoredistributionbetweenTran­ scripts1and2or2and3. Ingeneral,participantsfoundit hardtofollowTranscript2(automaticspeechrecognition); themedianratingforitwasa1,i.e.,“Veryhard”. Thequal­

itativecomments indicatedthat manyofthemthoughtthe

transcriptwastoochoppyandhadtoomuchlatency. Incon­ trast,participantsfounditeasiertofolloweitherTranscript 1(professionalcaptions)or3(crowdcaptions). Overallboth

deaf and hearing students had similar preference ratings

forbothcrowdcaptionsandprofessionalcaptions(CART),

in the absence of audio. While the overall responses for

crowdcaptionswasslightlyhigherat3.15(SD=1.06)than

for professional captions (CART) at 3.08 (SD=1.24), the

differences were not statistically significant (χ2 = 32.52,

p <0.001). Therewasagreatervariationinpreferencerat­ ingsforprofessionalcaptionsthanforcrowdcaptions.When wedividedthestudentsintodeafandhearingsubgroupsand

(6)

Figure 5: A graph of the latencies for each tran­ script (professional, automatic speech recognition and crowd). CART and CrowdCaptions have rea­ sonablelatenciesoflessthan5seconds,whichallows studentstokeepupwithclasslectures,butnotcon­ sistentlyparticipate inclassquestions and answers, orotherinteractiveclassdiscussion.

lookedattheirLikertpreferenceratings,therewasnosignifi­ cantdifferencebetweencrowdcaptionsandprofessionalcap­ tionsfordeafstudents(χ2=25.44,p <0.001).Hearingstu­

dentsasawholeshowedsignificantdifferencebetweencrowd captionsandprofessionalcaptions(χ2 =19.56,p= 0.07).

Thequalitativecommentsfromhearingstudentsrevealed

thattranscriptflowasshowninFigure4,latencyasshown inFigure5andspeedweresignificantfactors intheirpref­

erenceratings. For example, onehearingstudent had the

followingcommentforprofessionalcaptionedreal-timetran­

script: “The words did not always seem to form coherent

sentences and the topics seemed to change suddenly as if there was no transition from one topic to the next. This made it hard to understand so I had to try and reread it quickly”. Incontrast, for crowdcaptioning, the same stu­

dent commented : “I feel this was simpler to read mainly

becausethewordseventhoughsome notspelledcorrectly or grammatically correct in English were fairly simple to fol­ low. I wasable toreadthesentences aboutthere being two sub-trees,theleftandtherightandthattherearetwohalves ofthealgorithmattempted tobeexplained. Theword order was more logical to me so I didn’t need to try and reread it”. On theotherhandfortheprofessionalcaptions,adeaf studentcommented:“ItwastypingslowlysoIgetdistracted andI looked repeatedly fromthe beginning”; and forcrowd

captions, the deaf student commented: “It can be confus­

ingsoslow respsoneontyping, soiget distractedon other paragraphsjusttokeepmyselffocused”.

Overall,hearingparticipantsappearedto liketheslower

andmoresmooth flowingcrowdtranscriptratherthanthe

faster and lesssmooth captions. Deaf participantsappear

toaccept alltranscripts. It maybe thatthe deafstudents

aremoreusedtobadanddistortedinpurtandmoreeasily

skip or tolerate errors bypicking out key words, but this

or any other explanation requires further research. These

considerations wouldseem to be particularly importantin

educationalcontextswherematerialmaybecaptionedwith

theintentionofmakingcurriculum-basedinformationavail­ abletolearners.

A review of the literature oncaptioning comprehension

and readability shows this result is consistent with find­

ings from Burnham et al. [5], who found that there was

no reduction in comprehension of text reduction for deaf

adults, whethergood or poorat reading. Thesamestudy

also found that slowercaption rates tended to assist com­ prehensionofmoreproficientreaders, butthis wasnotthe caseforlessproficientreaders. Thismayexplainwhyhear­ ingstudentssignificantlypreferredcrowdcaptionsoverpro­ fessional captions,whereasdeafstudentsdid notshowany significant preference for crowd captions over professional captions. Sincedeafstudentsonaveragehaveawiderrange ofreadingskills,itappearsslowercaptionsforthelesspro­

ficient readers in this group doesnot help. Based on the

qualitative comments, it appears that these students pre­

ferredtohaveasmootherwordflowandtokeeplatencylow ratherthantoslowdownthereal-timetext. Infact,manyof thelessproficientreaderscommentedthatthecaptionswere

tooslow. Wehypothesizethat thesestudents,whotendto

useinterpretersratherthanreal-timecaptions,arefocusing onkey-wordsandignoretherestofthetext.

5.

CONCLUSIONS

Likertratingsshowedthat hearingstudentsrated crowd

captionsatorhigherthanprofessionalcaptions,whiledeaf studentsratedbothequally.Asummaryofqualitativecom­ mentsoncrowdcaptionssuggeststhatthesetranscriptsare presentedatareadablepace,phrasingandvocabularymade

more senseand that captioningflow isbetter than profes­

sionalcaptioningorAutomaticSpeechRecognition.

Wehypothesizethatthisfindingisattributabletotwofac­ tors. Thefirstfactoristhatthespeakingratetypicallyvaries from175-275wpm[19],whichisfasterthanthereadingrate

for captions of around 100-150 wpm, especially for dense

lectures material. Thesecondfactoris thatthe timingfor listeningtospokenlanguageisdifferentfromthetimingfor

readingwrittentext. Speakersoftenpause,changerhythm

orrepeatthemselves. Theend-resultisthatthecaptioning flowisasimportantastraditionalcaptioningmetricssuchas coverage,accuracyandspeed,ifnotmore. Theaveragingof

multiple captionstreams intoanaggregatestreamappears

to smooth the flow of text as perceived bythe reader, as

comparedwiththeflowoftextinprofessionalcaptioningor

ASRcaptions.

We thinkthe crowdcaptionists are are typing the most

important information to them, in other words, dropping

theunimportantbitsandthishappenstobettermatchthe

readingrate.Asthecaptionistsareworkingsimultaneously,

it canberegarded asagroupvote forthe mostimportant

information. A groupof non-expert captionists appear to

betterabletocollectivelycatch,understandandsummarize as well as asingleexpert captioner. Theconstraint of the

maximumaveragereadingreal-timetranscriptwordflowre­

ducestheneedformakingatradeoffbetweencoverageand

speed; beyondaspeedofabout140wordsperminute[10],

coverageandflowbecomesmoreimportant. Inotherwords,

assuming alimiting reading rate (especially fordense lec­

tureinformation),thecomments showthat studentsprefer

to condensed material so that they can maintain reading

speed/flowtokeepupwiththeinstructor.

One of the key advantages to using human captionists

instead of ASR is the types of errorswhich are generated

system when it fails to correctly identifya word. Instead

of random text, humans are capable of inferringmeaning,

(7)

contextof thespeech. Weanticipate thiswill make

quick-Captionmoreusablethanautomatedsystemsevenincases

wheretheremaybeminimaldifferenceinmeasuressuchas

accuracyandcoverage.

Weproposeanewcrowdcaptioningapproachthatrecruits classmatesandotherstotranscribeandshareclassroomlec­

tures. Classmates are likely to be more familiar with the

topicbeingdiscussed,andtobeusedtothespeaker’sstyle.

Weshowthatreadersprefer thisapproach. This approach

islessexpensiveandismoreinclusive,scalable,flexibleand easiertodeploythantraditionalcaptioning,especiallywhen usedwithmobiledevices. Thisapproachcanscaleinterms

of classmates and vocabulary, and can enable efficient re­

trievalandviewingonawiderangeof devices. Thecrowd

captioningtranscript,asanaverageofmultiplestreamsfrom allcaptionists,islikelytobemoreconsistentandhaveless surprisethananysinglecaptionist,andhavelessdelay,allof whichreducethelikelihoodofinformationlossbythereader. Thisapproachcanbeviewedasaparallelnote-takingthat benefitsallstudentswhogetanhighcoverage,highquality reviewabletranscriptthatnoneofthemcouldnormallytype ontheirown.

Wehaveintroducedtheideaofreal-timenon-expertcap­

tioning, and demonstrated through coverage experiments

that this is a promising direction for future research. We showthatdeafandhearingstudentsalikeprefercrowdcap­ tionsoverASRbecausethestudentsfindtheerrorseasierto

backtrackonandcorrectinreal-time. Most peoplecannot

tolerate an error rate of 10% or more as errors can com­

pletely changethe meaningof the text. Human operators

whocorrect the errorson-the-flymakethesesystemsmore

viable,openingthefieldtooperatorswithfarlessexpertise

and the ability to format, add punctuation, and indicate

speaker changes. Until the time ASR becomes a mature

technologythatcanhandleallkindsofspeechandenviron­

ments,human assistance incaptioning will continueto be

anessentialingredientinspeechtranscription.

Wealso noticethatcrowdcaptionsappeartohavemore

accurate technical vocabulary than either ASR or profes­

sionalcaptions. CrowdcaptioningoutperformsASRinmany

realsettings. Non-expert real-timecaptioninghasnotyet, andmightnotever,replaceprofessionalcaptionistsorASR, butitshowslotofpromise. Thereasonisthatasinglecap­ tioner cannotoptimize their dictionary fully, as they have to to adaptto various teachers, lecture content and their context. Classmatesaremuchbetterpositionedtoadaptto allof these, and fully optimize their typing, spelling, and

flow. Crowdcaptioning enablesthe softwareand users to

effectivelyadaptto avarietyof environmentsthata single captionistanddictionarycannothandle.

One common thread among the feedback comments re­

vealedthatdeafparticipantsarenothomogenous,andthere thereisnoneatunifyinglearningstyleabstraction. Lesson complexity,learningcurves,expectations,anxiety,trustand suspicionscanallcanaffectlearning experiences and indi­ rectlythesatisfactionandratingoftranscripts.

6.

FUTURE

WORK

Fromtheperspectiveofareaderviewingareal-timetran­ script,notallerrorsareequallyimportant,andhumanper­ ceptualerrorsof thedialog ismuch easierforusers toun­

derstandandadapttothanASRerrors. AlsounlikeASR,

crowdcaptioningcanhandlepoordialogaudiooruntrained

speech, e.g. multiple speakers, meetings, panels, audience questions. Usingthisknowledge,wehopetobeabletoen­

courage crowdcaptioning workers to leverage their under­

standingofthecontextthatcontentisspokenintocapture

thesegmentswiththehighestinformationcontent.

Non-expertcaptionistsand ASRmake differenttypes of

errors. Specifically, humans generally type words that ac­

tually appear inthe audio, but miss many words. Auto­

maticspeechrecognitionoftenmisunderstandswhichword

was spoken,butgenerally getsthen numberofwords spo­

kennearlycorrect. Oneapproachmay betouseASR asa

stableunderlyingsignalforreal-timetranscription,anduse

non-expert transcription to replace incorrect words. This

may be particularly useful when transcribing speech that

containsjargonterms. A non-expertcaptionist couldtype

asmanyofthesetermsaspossible,andcouldfittheminto

thetranscriptionprovidedbyASRwhereappropriate.

ASRusuallycannotprovideareliableconfidencelevelof

their ownaccuracy. Onthe other hand,thecrowdusually

hasabetter senseoftheir ownaccuracy. Oneapproachto

leveragethiswouldbetoprovideanindicationoftheconfi­ dencethesystemhasinrecognitionaccuracy. Thiscouldbe

doneinmanyways,forexamplethroughcolors. Thiswould

enabletheuserstopicktheirownconfidencethreshold. Itwouldbeusefultoaddautomaticspeechrecognitionas

a complementary sourceof captions because its errorsare

generally independent of non-expert captionists. This dif­

ferencemeansthat matchingcaptionsinputbycaptionists

andASRcanlikelybeusedwithhighconfidence,eveninthe absenceofmanylayersofredundantcaptionistsorASRsys­ tems. Futureworkalsoseekstointegratemultiplesourcesof evidence,suchasN-gramfrequencydata,intoaprobabilis­ tic frameworkfortranscriptionandordering. Estimates of workerlatencyorqualitycanalsobeusedtoweightthein­ putofmultiplecontributorsinordertoreducetheamountof erroneous inputfrom lazy or maliciouscontributors, while

not penalizing good ones. This is especially important if

crowdservicessuchasAmazon’sMechanicalTurkaretobe

usedtosupportthesesystemsinthefuture. Themodelscur­ rentlyusedtoalignandmergesetsofpartialcaptionsfrom contributors areintheir infancy,and willimproveas more

work is done inthis area. Ascrowd captioning improves,

studentscanbegintorelymoreonreadablecaptionsbeing madeavailableatanytimeforanyspeaker.

Thebenefitsofcaptioningbylocalorremoteworkerspre­

sented in this paper aims to further motivate the use of

crowd captioning. We imagine a deaf or hard of hearing

personeventuallybeingabletocapturespeechwithhercell­

phone anywhereand have captions returned to her within

a few seconds latency. She may use this to follow along

inalectureforwhich aprofessional captionistwasnot re­ quested,to participateininformalconversation withpeers after class,or enjoy amovie or other liveevent that lacks

closed captioning. These use cases currently beyond the

scopeofASR,andtheirserendipitousnatureprecludespre­ arrangingaprofessionalcaptionist. Moreover,ASRandpro­ fessionalcaptioningsystemsdonothaveaconsistentwayof addingappropriatepunctuationfromlecturespeechin real-time,resultingincaptionsthatareverydifficulttoreadand understand[9,16].

Achallengeindevelopingnewmethodsforreal-timecap­

tioning is that it can be difficult to quantify whether the

(8)

abilityandreadability ofreal-timecaptioningisdependent

onmuch more than just Word ErrorRate, involving at a

minimumnaturalnessoferrors,regularity,latencyandflow. Theseconceptsaredifficulttocaptureautomatically,which makesitdifficulttomakereliablecomparisonsacrossdiffer­

entapproaches. Designingmetrics that canbeuniversally

appliedwillimproveourabilitytomakeprogressinsystems forreal-timecaptioning.

7.

ACKNOWLEDGMENTS

We thank our participants for their time and feedback

inevaluatingthecaptions,andthereal-timecaptionistsfor theirworkinmakingthelectureaccessibletodeafandhard ofhearingstudents.

8.

REFERENCES

[1] Faqaboutcart(real-timecaptioning),2011.

http://www.ccacaptioning.org/articles-resources/faq.

[2] Y.C.BeatriceLiem,HaoqiZhang.Aniterativedual

pathwaystructureforspeech-to-texttranscription.In

Proceedingsofthe3rdWorkshoponHuman Computation(HCOMP’11),HCOMP’11,2011. [3] M.S.Bernstein,J.R.Brandt,R.C.Miller,andD.R.

Karger.Crowdsintwoseconds: Enablingrealtime

crowd-poweredinterfaces.InProceedingsofthe24th annualACMsymposiumonUserinterfacesoftware andtechnology,UIST’11,pagetoappear,NewYork,

NY,USA,2011.ACM.

[4] J.P.Bigham,C.Jayant,H.Ji,G.Little,A.Miller, R.C.Miller,R.Miller,A.Tatarowicz,B.White,

S.White,andT.Yeh.Vizwiz: nearlyreal-time

answerstovisualquestions.InProceedingsof the23nd annualACMsymposiumonUserinterfacesoftware andtechnology,UIST’10,pages333–342,NewYork,

NY,USA,2010.ACM.

[5] D.Burnham,G.Leigh,W.Noble,C.Jones,M.Tyler,

L.Grebennikov,andA.Varley. Parametersin

televisioncaptioningfordeafandhard-of-hearing adults: Effectsofcaptionrateversustextreductionon

comprehension.JournalofDeaf StudiesandDeaf

Education,13(3):391–404,2008.

[6] X.Cui,L.Gu,B.Xiang,W.Zhang,andY.Gao.

Developinghighperformanceasrintheibm

multilingualspeech-to-speechtranslationsystem.In

Acoustics,SpeechandSignalProcessing,2008. ICASSP2008.IEEE InternationalConferenceon, pages5121–5124,312008-april42008.

[7] L.B.Elliot,M.S.Stinson,D.Easton,and

J.Bourgeois.CollegeStudentsLearningWith

C-Print’sEducationSoftwareandAutomaticSpeech

Recognition.InAmericanEducationalResearch

AssociationAnnualMeeting,NewYork,NY,2008.

[8] M.B.Fifield.Realtimeremoteonlinecaptioning: An

effectiveaccommodationforruralschoolsandcolleges. InInstructionalTechnologyAndEducationoftheDeaf Symposium,2001.

[9] A.Gravano,M.Jansche,andM.Bacchiani.Restoring

punctuationandcapitalizationintranscribedspeech. InAcoustics,SpeechandSignalProcessing,2009. ICASSP2009.IEEE InternationalConferenceon, pages4741–4744,april2009.

[10] C.Jensema.Closed-captionedtelevisionpresentation

speedandvocabulary.AmericanAnnalsoftheDeaf,

141(4):284–292,1996.

[11] C.J.Jensema,R.Danturthi,andR.Burch.Time

spentviewingcaptionsontelevisionprograms.

AmericanAnnalsoftheDeaf,145(5):464–468,2000. [12] R.KheirandT.Way.Inclusionofdeafstudentsin

computerscienceclassesusingreal-timespeech transcription.InProceedingsof the12thannual SIGCSEconferenceonInnovationandtechnologyin computer scienceeducation,ITiCSE’07,pages

261–265,NewYork,NY,USA,2007. ACM.

[13] W.Lasecki,K.Murray,S.White,R.C.Miller,and

J.P.Bigham.Real-timecrowdcontrolofexisting

interfaces.InProceedingsofthe24thannualACM

symposiumonUserinterfacesoftwareandtechnology,

UIST’11,pageToAppear,NewYork,NY,USA,

2011.ACM.

[14] W.S.LaseckiandJ.P.Bigham.Onlinequality

controlforreal-timecaptioning.InProceedingsofthe 14thInternationalACMSIGACCESSConference on ComputersandAccessibility,ASSETS’12,2012. [15] W.S.Lasecki,C.Miller,A.Sadilek,A.Abumoussa,

D.Borrello,R.Kushalnagar,andJ.P.Bigham.

Realtimecaptioningbygroupsofnonexperts.In

Proceedingsofthe25thACMUISTSymposium,UIST ’12,2012.

[16] Y.Liu,E.Shriberg,A.Stolcke,D.Hillard,

M.Ostendorf, andM.Harper.Enrichingspeech

recognitionwithautomaticdetectionofsentence

boundariesanddisfluencies.Audio,Speech,and

Language Processing,IEEETransactionson, 14(5):1526–1540,sept.2006.

[17] T.Matthews,S.Carter,C.Pai,J.Fong,and J.Mankoff.InProceedingofthe8thInternational Conference onUbiquitousComputing,pages159–176, Berlin,2006.Springer-Verlag.

[18] R.E.Mitchell.Howmanydeafpeoplearethereinthe

UnitedStates? EstimatesfromtheSurveyofIncome

andProgramParticipation.Journalof deafstudies

anddeafeducation,11(1):112–9,Jan.2006. [19] S.J.SamuelsandP.R.Dahl.Establishing

appropriatepurposeforreadinganditseffecton flexibilityofreadingrate.JournalofEducational Psychology,67(1):38–43,1975.

[20] M. Wald.Usingautomaticspeechrecognitionto

enhanceeducationforallstudents: Turningavision intoreality.InFrontiers inEducation,2005.FIE’05. Proceedings35thAnnualConference,pageS3G,oct. 2005.

[21] M. Wald.Creatingaccessibleeducationalmultimedia

througheditingautomaticspeechrecognition

captioninginrealtime.InteractiveTechnologyand Smart Education,3(2):131–141, 2006.

1http://ocw.mit.edu/ http://www.ccacaptioning.org/articles-resources/faq.

References

Related documents

On the other hand, the finding of this variable is inconsistent with the studies of chege (2014) who studied commercial banks of Kenya and found positive but negative linear

Genome-wide Meta-analyses of Breast, Ovarian and Prostate Cancer Association Studies Identify Multiple New Susceptibility Loci Shared by At Least Two Cancer Types..

Confirmed adult-onset asthma (AOA) cases were defined as those potential cases with either new- onset asthma or reactivated mild intermittent asthma that had been quiescent for at

The study of generalized closed (briefly g-closed) sets in a topological spaces was initiated by N.Levine in 1970[7] and in 1982 A.S.Mashhour [11] introduced the concept

Schiedam was niet de enige Hollandse stad die een huis liet bouwen voor de gemeenschap van marginale lepralijders, want vanaf de tweede helft van de veertiende eeuw werden er

Making Lower Silesia the European Knowledge Creation Region Making Lower Silesia the European Knowledge Creation Region in the European Research Area (ERA). in the European

GDP from the data, and changes in countries’net foreign asset positions computed from the model The latter is the di¤erence between stochastic steady state values of net foreign