A Readability Evaluation of Real-Time Crowd Captions in
the Classroom
Raja S. Kushalnagar, Walter S. Lasecki
†, Jeffrey P. Bigham
†DepartmentofInformationandComputingStudies †
DepartmentofComputerScience RochesterInstituteofTechnology UniversityofRochester 1LombMemorialDr,Rochester,NY14623 160TrusteeRd,Rochester,NY14627
[email protected] {wlasecki,jbigham}@cs.rochester.edu
ABSTRACT
Deafandhardofhearingindividualsneedaccommodations
that transform aural to visual information, such as tran
scripts generated in real-time to enhance their access to
spokeninformation inlectures and other liveevents. Pro fessionalcaptionists’stranscriptsworkwellingeneralevents suchascommunity,administrativeorlegalmeetings,butis often perceived as notreadable enoughinspecialized con
tent events such as higher education classrooms. Profes
sionalcaptionistswithexperienceinspecializedcontentar
easarescarceandexpensive. Commercialautomaticspeech
recognition(ASR)softwaretranscriptsarefarcheaper,but isoftenperceivedasunreadableduetoASR’ssensitivityto accents,backgroundnoiseandslowresponsetime. Weeval
uate the readability of a new crowd captioning approach
inwhich captions are typed collaboratively by classmates
into a system that alignsand merges the multiple incom
pletecaptionstreamsintoasingle,comprehensivereal-time transcript. Ourstudyasked48deafandhearingreadersto evaluate transcriptsproduced byaprofessional captionist, automaticspeechrecognitionsoftwareandcrowdcaptioning softwarerespectivelyandfoundthereaderspreferredcrowd
captionsoverprofessionalcaptionsandASR.
Categories
and
Subject
Descriptors
H.5.1[InformationInterfacesandPresentation]: Multime diaInformationSystems;K.4.2[SocialIssues]: Assistivetech nologiesforpersonswithdisabilities
General
Terms
HumanFactors,Design,Experimentation
Keywords
Accessible Technology, Educational Technology, Deaf and
HardofHearingUsers
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee.
ASSETS’12,October 22–24, 2012, Boulder, Colorado, USA. Copyright 2012 ACM 978-1-4503-1321-6/12/10 ...$15.00.
1.
INTRODUCTION
Deafandhardofhearing(DHH)individualstypicallycan notunderstandaudioalone,andaccesstotheaudiothrough
accommodationsthattranslatetheauditoryinformationto
visualinformation. Themostcommonaccommodationsare
real-time transcription or sign language translation of the audio.
As a low incidence disability, deaf and hard of hearing
individualsare evenly and thinlyspread [18]. As aresult,
many DHH individualstend to be located farfrom major
populationcenters and findit hardto obtainaccommoda
tion providers, especiallythose who canhandle situations thatrequirespecializedcontentknowledge. Theseproviders prefertoliveinclosetoareaswheretheycanobtainenough
demandtoprovideservices. Ifthereisnotenoughdemand
forproviders in the area, thereis a catch-22 forthe DHH students andinstitutions. Therefore, formanyinstitutions in terms of content knowledge, availability and cost, it is besttouseaccommodationservicescenteredonthestudent
suchasclassmatesoron-demandremoteworkers.
This paper analyzes the readability of a new
student-centered approach to real-timecaptioning in whichmulti
ple classmates simultaneously caption speech inreal-time.
Although classmatescannottype asquicklyas thenatural
speakingrateofmostspeakers,wehavefoundthattheycan
provideaccuratepartialcaptions. We alignandmerge the
multiple incompletecaption streams intoasingle, compre
hensive real-time transcript. We compare deaf and hear
ingstudents’evaluationoftheeffectivenessandusabilityof this crowd-sourcedreal-time transcript against transcripts
producedbyprofessional captionistsand automaticspeech
recognitionsoftwarerespectively.
2.
BACKGROUND
Equal access to communication is fundamental to stu
dents’academic success,butis oftentakenforgranted. In
mainstreamenvironmentswheredeaf,hard-of-hearing, and
hearingstudents studyand attendclasses together,people tend to assume that captioners or interpreters enable full communicationbetweendeafandhearingpeopleintheclass. Thisassumptionisespeciallydetrimentalasitdoesnotad dress other information accessibility issues such as trans lation delays that impact interaction and readability that
impactscomprehension.
Therearetwopopularapproachestogeneratingreal-time
captions thatattempt to conveyeveryspoken wordinthe
(a)Astenographkeyboardthatshowsits
phonetic-basedkeys.
(b) A stenographer’s typical Words Per Minute
(WPM)limitandrange.
Figure 1: Professional Real-TimeCaptioning using astenograph
recognition(ASR). Both professional captioning and ASR
provideareal-timeword-for-worddisplayofwhatissaidin class, as well as options forsaving the text after class for
study. Wediscuss thereadability of theseapproaches and
anewapproach, whichutilizes crowdsourcingto generate
real-timecaptions.
2.1
Professional
Captioning
Themostwidelyusedapproach,CommunicationsAccess
RealTime(CART),isgeneratedbyprofessionalcaptionists
whouse shorthandsoftwareto generatecaptionscankeep
upwith naturalspeakingrates. Althoughpopular, profes
sionalcaptionersundergoyearsoftraining,whichresultsin professional captioning services being expensive. Further
more, captionists usually have inadequate content knowl
edge and dictionaries to handle higher education lectures
inspecificfields. is themost reliabletranscriptionservice,
butisalso themost expensiveone. Trainedstenographers
typeinshorthandonastenographic(shorthandwritingsys
tem)keyboard as showninFigure1. Thiskeyboard maps
multiplekeypressestophonemesthatareexpandedtover batimfulltext. Stenography requires 2-3yearsof training
toachieveatleast225wordsperminute(WPM)andupto
300WPMthatisneededtoconsistentlytranscribeall real-timespeech,whichhelpstoexplainthecurrentcostofmore
than$100anhour. CARTstenographers needonlytorec
ognizeand type in the phonemesto createthe transcript,
which enables them to type fast enough to keep up with
thenatural speaking rate. But thesoftwaretranslation of
phonemesto wordsrequires adictionarythat alreadycon
tains the words used in the lecture; typing in new words
intothedictionaryslowsdownthetranscriptionspeedcon
siderably. Thestenographer cantranscribe speech evenif
thewordsor phonemesdonotmakesenseto them,e.g.,if
the speechwords appearto violate rulesofgrammar, pro
nunciation,orlogic. Ifthecaptionercannotunderstandthe phonemeorwordatall,thentheycannottranscribeit.
Inresponse to thehigh costs of CART, computer-based
macroexpansionserviceslikeC-Printweredevelopedandin troduced. C-Printisatypeofnearly-realtimetranscription that wasdeveloped at the NationalTechnicalInstitutefor theDeaf.Thecaptionistbalancesthetradeoffbetweentyp ingspeedandsummarization,byincludingasmuchinforma tionaspossible,generallyprovidingameaning-for-meaning butnotverbatimtranslationofthespokenEnglishcontent.
Thissystemenablesoperatorswhoaretrainedinacademic
situations to consolidateandbetter organizethe textwith thegoalofcreatinganendresultmorelikeclassnotesthat maybemoreconducivetoforlearning. C-Printcaptionists needlesstraining,andgenerallychargearound$60anhour. Asthecaptionistnormallycannottypeasfastasthenatural speakingrate,theyarenotabletoproduceaverbatim
real-time transcript. Also, the captionist can only effectively
convey classroom content if they understand that content
themselves. The advantage is that the C-Print transcript
accuracyandreadability ishigh [21],butthedisadvantage
of thisapproachis thatthe transcriptshows thesummary
that is basedonthe captionist’sunderstanding ofthe ma terial, whichmay bedifferentfrom thespeakeror reader’s understandingofthematerial.
There are several captioning challenges in higher edu
cation. The first challenge is content knowledge - lecture
information is dense and contains specialized vocabulary.
Thismakesithardtoidentifyandschedulecaptionistswho arebothskilledintypingandhavetheappropriatecontent knowledge. Anothercaptioningissueinvolvestranscription
delay, which occurs when captionists have to understand
the phonemes or words and then type in what they have
recognized. Asaresult,captioniststendto typethemate rial to studentswith adelay of severalseconds. Thispre vents students from effectively participating inaninterac tive classroom. Anotherchallengeisspeakeridentification, inwhichcaptionistareunfamiliarwithparticipantsandare
challenged to properlyidentify the current speaker. They
cansimplifythisbyrecognizingthespeakerbyname,orask ingthespeakertopausebeforebeginninguntilthecaptionist
hascaught upand hadanopportunitytoidentifythenew
speaker. In terms of availability, captioniststypically are notavailabletotranscribe livespeech ordialogueforshort periodsoron-demand.Professionalcaptionistsusuallyneed at leastafew hours advancenotice, and preferto workin 1-hourincrementssoastoaccountfortheircommutetimes. Asaresult,studentscannoteasilydecideatthelastminute toattendalectureorafterclassinteractionswithpeersand teacher. Captionistsusedtoneedtobephysicallypresentat theeventtheyweretranscribing,butcaptioningservicesare increasingly beingoffered remotely[12, 1]. Captionists of tenaresimplynotavailableformanytechnicalfields[21,8]. Remotecaptioningoffersthepotentialtorecruitcaptionists familiar witha particularsubject (e.g., organicchemistry) evenifthecaptionistislocatedfarawayfromanevent. Se lectingforexpertisefurtherreducesthepoolofcaptionists. A finalchallengeis their cost- professional captionistsare highlytrainedtokeepupwithspeechwithlowerrorsrates,
and so are highly paid. Experiencedverbatimcaptionists’
paycanexceed$200anhour,andnewlytrainedsummariza
2.2
Automatic
Speech
Recognition
ASR platforms typically use probabilistic approaches to
translate speech to text. These platforms face challenges
inaccuratelycapturingmodernclassroomlecturesthatcan haveoneormoreofthefollowingchallenges: extensivetech nical vocabulary, poor acoustic quality, multiple informa tionsources,speakeraccents,orother problems. Theyalso
impose a processing delay of several seconds and the de
lay lengthens as the amount of data to be analyzed gets
bigger. Inother words,ASR workswell underideal situa
tions,butdegradesquicklyinmanyreal settings. Kheiret
al. [12] foundthatuntrainedASR softwarehad 75%accu
racyrate, but with training, could go to 90% underideal
singlespeaker, butthis accuracyratewas still toolow for
use by deaf students. In the best possible case, inwhich
thespeakerhastrainedthe ASRand wearsahigh-quality,
noise-cancelingmicrophone,theaccuracycanbeabove90%.
Whenrecordingaspeakerusingastandardmicrophoneon
ASRnottrainedforthespeaker,accuracyratesplummetto
farbelow50%. Additionally,theerrorsmadebyASRoften
changethemeaningofthetext,whereaswehavefound non-expertcaptionistsaremuchmorelikelytosimplyomitwords ormake spellingerrors. InFigure2 forinstance, theASR changes‘twofoldaxis’to‘twentyfourlexus’,whereasthec typiststypicallyomitwordstheydonotunderstandormake spellingerrors.CurrentASRisspeaker-dependent,hasdiffi cultyrecognizingdomain-specificjargon,andadaptspoorly to vocal changes, suchas when the speaker is sick [6, 7].
ASR systems generally need substantial computing power
andhigh-quality audioto work well,which meanssystems
canbedifficultto transport. Theyare alsoill-equippedto recognizeandconveytone,attitudes,interestandemphasis, andtorefer tovisualinformation suchas slidesor demon
strations. ASRserviceschargeabout$15-20anhour. How
ever, these systems are more easily integrated with other
functionssuchasmultimediaindexing.
2.3
Crowd
Captions
in
the
Classroom
Deafandhardofhearingstudentshavehadalonghistory of enhancingtheir classroom accessibility by collaborating
withclassmates. Forexample, they often arrange to copy
notesfromaclassmateandshareitwiththeirstudygroup. Crowdsourcinghasbeenappliedtoofflinetranscriptionwith greatsuccess [2], buthas justrecently been usedfor real-time transcription[15]. Applying a collaborative caption ingapproachamongclassmatesenablesreal-timetranscrip
tionfrommultiplenon-experts,andcrowdagreementmech
anismscanbeutilizedtovettranscriptquality[14].
We imaginea deaf or hardof hearingpersoneventually
beingabletocapture auralspeechwithhercellphoneany whereandhavecaptionsreturnedtoherwithafewseconds latency. She may use this to follow along ina lecture for which aprofessional captionist wasnot requested, to par ticipate ininformal conversation with peers afterclass, or enjoyamovieorother liveeventthatlacksclosedcaption ing. TheseusecasescurrentlybeyondthescopeofASR,and theirserendipitousnatureprecludespre-arrangingaprofes sionalcaptionist. Lasecki etal. havedemonstrated thata
modestnumberofpeoplecanprovidereasonablyhighcov
erageoverthecaptionstream,andintroducesanalgorithm thatusesoverlappingportionsofthesequencestoalignand
mergethemusingthe Legion:Scribe system[15]. Scribe is
basedonthe Legion [13] framework, whichuses crowdsof
……….that has a two fold axis…….
………….have a crystal that………..
...we have a crystal………..
...wehave a crystalthathas a two fold axis…..
Figure 2: The crowd captioning interface. The in terfaceprovidesatextinputboxatthebottom,and shiftstextupasuserstype(eitherwhenthetexthits the end of the box, or when the user presses the enter key). To encourage users to continue typing even when making mistakes, editing of text is dis abledwordbyword. Partialcaptionsareforwarded to the server in real-time, which uses overlapping segmentsand the orderin segmentsarereceivedto align and mergethem.
workers to accomplish tasks in real-time. Unlike Legion,
Scribemergesresponsestocreateasingle,better,response insteadofselectingfrominputstoselectthebestsequence. Thismergerisdoneusinganonlinemultiplesequencealign mentalgorithmthatalignsworkerinputtobothreconstruct
the final stream and correct errors (such as spelling mis
takes)madebyindividualworkers.
Crowdcaptioningoffersseveralpotentialbenefitsoverex istingapproaches. First,itispotentiallymuchcheaperthan hiringaprofessionalcaptionistbecausenon-expertcaption istsdonotneedextensivetrainingtoacquireaspecificskill set,and thusmay bedrawn fromavarietyof sources,e.g. classmates,audiencemembers,microtaskmarketplaces,vol unteers,or affordableandreadilyavailableemployees. Our
workforce can be very large because, for people who can
hear, speech recognition is relatively easy and most peo
ple cantype accurately. Theproblem is that individually
they cannottype quickly enough to keep up with natural
speaking rates, and crowd captioning nicely remedies this
problem. Recentworkhasdemonstratedthatsmallcrowds
canberecruitedquicklyon-demand(inlessthan2seconds)
receiveatranscriptofashortsoundsequenceinafewmin utes,butisnotabletoproduceverbatimcaptionsoverlong periodsoftime[17].
Inprevious work, we developed a crowdcaptioning sys
temthatacceptsrealtimetranscriptionfrommultiple non-expertsasshowninFigure2. Whilenon-expertscannottype asquicklyasthenaturalspeakingrate,wehavefoundthat theycanprovideaccuratepartialcaptions. Oursystemre cruitsfellowstudentswithnotrainingandcompensatesfor slowertypingspeedandloweraccuracybycombiningtheef fortsofmultiplecaptionistssimultaneouslyandmergesthese partialcaptionsinreal-time.Wehaveshownthatgroupsof
non-expertscan achieve more timely captions thana pro
fessional captionist, that we can encourage them to focus
onspecific portionsofthe speech to improveglobal cover age,andthatitispossibletorecombinepartialcaptionsand effectivelytradeoffcoverageandprecision[15].
2.4
Real-time
text
reading
versus
listening
Most peopleonlyseereal-timetextonTVat thebaror
gymintheformofclosedcaptions,whichtendtohaveno
ticeableerrors. However,thoseprogramsare captioned by
live captionists or stenographers. To reduce errors, these
real-time transcripts are often corrected and made into a
permanentpartofthe videofilebyoff-linecaptionistswho
prepare captions from pre-recorded videotapes and thor
oughlyreviewtheworkforerrorsbeforeairing.
Thetranslationofspeechtotextisnotdirect,butrather isinterpreted andchangedinthecourseofeach utterance. Markerslikeaccent,tone,and timbreare strippedoutand
representedbystandardizedwrittenwordsandsymbols. Then
thereaderinterpretsthesewordsandflowtomakemeanings forthemselves. Captionists tendnot to includeallspoken informationsothatreaderscankeepupwiththetranscript. Captionists are encouraged to alter the original transcrip tionto providetimeforthereaderstocompletelyreadthe
captionand tosynchronizewiththe audio. Thisis needed
because, fora non-orthographiclanguagelike English,the lengthofaspokenutteranceisnotnecessarilyproportional
to the length of a spelled word. In other words, reading
speedisnotthesameaslisteningspeed,especiallyfor real-timescrollingtext, asopposedto staticpre-preparedtext. Forstatictext,readingspeedhasbeenmeasuredat291wpm [19]. BycontrasttheaveragecaptionrateforTVprograms is141wpm[11],whilethemostcomfortablereadingratefor hearing,hard-of-hearing,anddeafadultsisaround145wpm [10]. The reasonis that thetask ofviewingreal-timecap tions involveddifferentprocessing demands invisual loca
tionandtrackingofmovingtextonadynamicbackground.
Englishliteracyratesamongdeafandhardofhearingpeo
plewhois low compared to hearingpeers. Captioningre
search has shown that both rate and text reduction and
viewerreadingabilityare importantfactors,and thatcap tionsneedtobeprovidedwithin5secondssothatthereader canparticipate[20].
Thenumberofspokenwordsandtheircomplexitycanalso
influence the captioning decision onthe amount of words
totranscribeand degreeofsummarizationto includesoas toreduce the reader’stotal cognitiveload. Jensema etal.
[10]analyzedalargesampleofcaptionedTVprogramsand
foundthatthetotal sethad around800Kwordsconsisting
of16,000uniquewords. Furthermore,overtwo-thirdsofthe
transcriptwords consistedof250words. Higher education
lecturetranscriptshaveaverydifferentprofile. Forcompari sonpurposes,weselecteda50minutelongclipfromtheMIT
OpenCourseWare(OCW)website1. Theaudiosamplewas
pickedfromalecturesegmentinwhichthespeechwasrela tivelyclear.We chosethislecturebecauseitcombinedboth
technicalandnon-technicalcomponents. Wefoundthatthe
lecturehad9137words,ofwhich1428wereunique,at182.7
wpm. Furthermore, over twothirds of the transcript con
sistedof around500words,whichisdoublethe sizeof the
captionedTVwordset.
3.
EVALUATION
Toevaluatethe efficacyofcrowd-sourcedreal-timetran
scripts, wecompared deafand hearinguserevaluationson
theirperceptionsoftheusabilityofcrowd-sourcedreal-time
transcriptsagainst ComputerAidedReal-Timetranscripts
(CART)andAutomaticSpeechRecognitiontranscripts(ASR).
3.1
Design
Criteria
Based on prior work as well our own observations and
experiences,wehavedevelopedthefollowingdesigncriteria for effective real-time transcript presentation for deaf and hardofhearingstudents:
1. The transcript must have enough information to be
understoodbytheviewer.
2. Thetranscriptmustnotbetoofastortooslowsothat itcanbecomfortablyread.
3. Readingmustnotrequiresubstantialbacktracking.
3.2
Transcript
Generation
Weobtainedthreetranscriptionsof anOCWlectureus
ing crowdcaptioners, professionalcaptionerandautomatic speechrecognitionsoftwareandgeneratedthreetranscripts ofthelecture.
Aprofessionalreal-timestenographercaptionistwhocharged $200anhourtocreateaprofessionalreal-timetranscriptof the lecture. Thecaptioner listenedto the audioand tran
scribedinreal-time. Themeantypingspeedwasabout180
wpm withalatency of4.2seconds. We calculatedlatency
byaveragingthelatencyofallmatchedwords.
We recruited 20 undergraduate students to act as
non-expert captionistsforourcrowdcaptioningsystem. These
students hadnospecialtrainingorprevious formalexperi encetranscribingaudio. Participantsthenprovidedpartial
captions for the lecture audio. The final transcript speed
wasabout130WPM,withalatencyof3.87seconds.
Inadditionto the thesetwotranscripts, wegenerated a
transcript usinganautomaticspeech recognitionASR us
ing Nuance Dragon Naturally Speaking 11 software. We
usedanuntrainedprofileto simulateour targetcontextof studentstranscribingspeechfromnewormultiplespeakers. Toconductthistest, theaudiofiles wereplayed,andredi
rectedto Dragon. We usedasoftwarelooptoredirect the
audio signal without resampling using SoundFlower2, and
acustom programtorecordthe timewheneach wordwas
generatedbytheASR.TheASRtranscriptspeed was71.0
wpm(SD=23.7)withalatencyof7.9seconds.
3.3
Transcript
Evaluation
1http://ocw.mit.edu/ 2
Figure3: Thetranscriptviewingexperience.
Werecruited48studentsforthestudyovertwoweeksto participateinthestudyandevenlyrecruitedbothdeafand
hearingstudents, male amd female. Twenty-oneof the of
themweredeaf,fourofthemwerehardofhearingandthe
remainder,twenty-four,werehearing. Therewere21females
and27males,whichreflectsthegenderbalanceoncampus.
Their ages ranged from18 to 29 and all were students at
RIT, ranging from first year undergraduates to graduate
students. We recruitedthrough flyersand wordof mouth
onthecampus. Weaskedstudentstocontactandschedule
throughemail appointment. Allstudents were reimbursed
fortheir participation. Alldeaf participants were askedif they usedvisual accommodations fortheir classes, and all
ofthemansweredaffirmatively.
Testing was conducted in a quiet room with a 22 inch
flat-screenmonitorasshowninFigure3. Eachpersonwas
directedtoanonlinewebpagethatexplainedthepurposeof thestudy. Next,thestudentswereaskedtocompleteashort demographicquestionnaireinordertodetermineeligibility forthetestandaskedforinformedconsent.Thentheywere
asked to view a short 30second introductory videoto fa
miliarizethemselveswiththeprocessofviewingtranscripts. Thenthestudentswereaskedtowatchaseriesoftranscripts onthesamelecture,eachlastingtwominutes. Eachclipwas labeledTranscript1,2and3,andwerepresentedinaran
domizedorderwithoutanyaccompanyingaudio. Thetotal
timeforthestudywasabout15minutes.
After theparticipant completedwatchingallthreevideo clipsof the real-timetranscripts, they were askedto com pleteaquestionnaire. Thequestionnaire askedthreeques
tions. The firstquestion asked“Howeasywas itto follow
transcript1?”. Inresponsetothequestion,theparticipants
Figure 4: A comparison of the flow for each tran script. Both CART and crowd captions exhibit a relativelysmoothreal-timetextflow. Studentspre fer this flow over the more choppy ASR transcript flow.
were presented with a a Likert scale that ranged from 1
through5,with1being“Veryhard”to5being“veryeasy”. Thesecondquestionasked“Howeasywasittofollowtran script 2?”. In response to this question, participants were promptedtoanswerusingasimilarLikertscaleresponseas
inquestion1. Thethirdquestionwas“Howeasywasit to
follow transcript 3?”. Inresponse to this question, partic
ipantswere promotedwitha similar,corresponding Likert
scale responsetoquestion1and2. Thenparticipantswere askedtoanswerintheirownwordstothreequestionsthat
asked participants for their thoughts about following the
lecture through the transcripts; the first video transcript
contained the captions created by the stenographer. The
answerswereopenendedandmanyparticipantsgavewon
derfulfeedback. Thesecondvideotranscriptcontainedthe captions createdbytheautomaticspeech recognitionsoft
ware, inthiscase, DragonNaturallySpeaking v. 11. The
thirdandfinal videotranscript containedthecaptionscre
atedbythecrowdcaptioningprocess.
4.
DISCUSSION
Fortheuserpreferencequestions,therewasasignificant differencebetweentheLikertscoredistributionbetweenTran scripts1and2or2and3. Ingeneral,participantsfoundit hardtofollowTranscript2(automaticspeechrecognition); themedianratingforitwasa1,i.e.,“Veryhard”. Thequal
itativecomments indicatedthat manyofthemthoughtthe
transcriptwastoochoppyandhadtoomuchlatency. Incon trast,participantsfounditeasiertofolloweitherTranscript 1(professionalcaptions)or3(crowdcaptions). Overallboth
deaf and hearing students had similar preference ratings
forbothcrowdcaptionsandprofessionalcaptions(CART),
in the absence of audio. While the overall responses for
crowdcaptionswasslightlyhigherat3.15(SD=1.06)than
for professional captions (CART) at 3.08 (SD=1.24), the
differences were not statistically significant (χ2 = 32.52,
p <0.001). Therewasagreatervariationinpreferencerat ingsforprofessionalcaptionsthanforcrowdcaptions.When wedividedthestudentsintodeafandhearingsubgroupsand
Figure 5: A graph of the latencies for each tran script (professional, automatic speech recognition and crowd). CART and CrowdCaptions have rea sonablelatenciesoflessthan5seconds,whichallows studentstokeepupwithclasslectures,butnotcon sistentlyparticipate inclassquestions and answers, orotherinteractiveclassdiscussion.
lookedattheirLikertpreferenceratings,therewasnosignifi cantdifferencebetweencrowdcaptionsandprofessionalcap tionsfordeafstudents(χ2=25.44,p <0.001).Hearingstu
dentsasawholeshowedsignificantdifferencebetweencrowd captionsandprofessionalcaptions(χ2 =19.56,p= 0.07).
Thequalitativecommentsfromhearingstudentsrevealed
thattranscriptflowasshowninFigure4,latencyasshown inFigure5andspeedweresignificantfactors intheirpref
erenceratings. For example, onehearingstudent had the
followingcommentforprofessionalcaptionedreal-timetran
script: “The words did not always seem to form coherent
sentences and the topics seemed to change suddenly as if there was no transition from one topic to the next. This made it hard to understand so I had to try and reread it quickly”. Incontrast, for crowdcaptioning, the same stu
dent commented : “I feel this was simpler to read mainly
becausethewordseventhoughsome notspelledcorrectly or grammatically correct in English were fairly simple to fol low. I wasable toreadthesentences aboutthere being two sub-trees,theleftandtherightandthattherearetwohalves ofthealgorithmattempted tobeexplained. Theword order was more logical to me so I didn’t need to try and reread it”. On theotherhandfortheprofessionalcaptions,adeaf studentcommented:“ItwastypingslowlysoIgetdistracted andI looked repeatedly fromthe beginning”; and forcrowd
captions, the deaf student commented: “It can be confus
ingsoslow respsoneontyping, soiget distractedon other paragraphsjusttokeepmyselffocused”.
Overall,hearingparticipantsappearedto liketheslower
andmoresmooth flowingcrowdtranscriptratherthanthe
faster and lesssmooth captions. Deaf participantsappear
toaccept alltranscripts. It maybe thatthe deafstudents
aremoreusedtobadanddistortedinpurtandmoreeasily
skip or tolerate errors bypicking out key words, but this
or any other explanation requires further research. These
considerations wouldseem to be particularly importantin
educationalcontextswherematerialmaybecaptionedwith
theintentionofmakingcurriculum-basedinformationavail abletolearners.
A review of the literature oncaptioning comprehension
and readability shows this result is consistent with find
ings from Burnham et al. [5], who found that there was
no reduction in comprehension of text reduction for deaf
adults, whethergood or poorat reading. Thesamestudy
also found that slowercaption rates tended to assist com prehensionofmoreproficientreaders, butthis wasnotthe caseforlessproficientreaders. Thismayexplainwhyhear ingstudentssignificantlypreferredcrowdcaptionsoverpro fessional captions,whereasdeafstudentsdid notshowany significant preference for crowd captions over professional captions. Sincedeafstudentsonaveragehaveawiderrange ofreadingskills,itappearsslowercaptionsforthelesspro
ficient readers in this group doesnot help. Based on the
qualitative comments, it appears that these students pre
ferredtohaveasmootherwordflowandtokeeplatencylow ratherthantoslowdownthereal-timetext. Infact,manyof thelessproficientreaderscommentedthatthecaptionswere
tooslow. Wehypothesizethat thesestudents,whotendto
useinterpretersratherthanreal-timecaptions,arefocusing onkey-wordsandignoretherestofthetext.
5.
CONCLUSIONS
Likertratingsshowedthat hearingstudentsrated crowd
captionsatorhigherthanprofessionalcaptions,whiledeaf studentsratedbothequally.Asummaryofqualitativecom mentsoncrowdcaptionssuggeststhatthesetranscriptsare presentedatareadablepace,phrasingandvocabularymade
more senseand that captioningflow isbetter than profes
sionalcaptioningorAutomaticSpeechRecognition.
Wehypothesizethatthisfindingisattributabletotwofac tors. Thefirstfactoristhatthespeakingratetypicallyvaries from175-275wpm[19],whichisfasterthanthereadingrate
for captions of around 100-150 wpm, especially for dense
lectures material. Thesecondfactoris thatthe timingfor listeningtospokenlanguageisdifferentfromthetimingfor
readingwrittentext. Speakersoftenpause,changerhythm
orrepeatthemselves. Theend-resultisthatthecaptioning flowisasimportantastraditionalcaptioningmetricssuchas coverage,accuracyandspeed,ifnotmore. Theaveragingof
multiple captionstreams intoanaggregatestreamappears
to smooth the flow of text as perceived bythe reader, as
comparedwiththeflowoftextinprofessionalcaptioningor
ASRcaptions.
We thinkthe crowdcaptionists are are typing the most
important information to them, in other words, dropping
theunimportantbitsandthishappenstobettermatchthe
readingrate.Asthecaptionistsareworkingsimultaneously,
it canberegarded asagroupvote forthe mostimportant
information. A groupof non-expert captionists appear to
betterabletocollectivelycatch,understandandsummarize as well as asingleexpert captioner. Theconstraint of the
maximumaveragereadingreal-timetranscriptwordflowre
ducestheneedformakingatradeoffbetweencoverageand
speed; beyondaspeedofabout140wordsperminute[10],
coverageandflowbecomesmoreimportant. Inotherwords,
assuming alimiting reading rate (especially fordense lec
tureinformation),thecomments showthat studentsprefer
to condensed material so that they can maintain reading
speed/flowtokeepupwiththeinstructor.
One of the key advantages to using human captionists
instead of ASR is the types of errorswhich are generated
system when it fails to correctly identifya word. Instead
of random text, humans are capable of inferringmeaning,
contextof thespeech. Weanticipate thiswill make
quick-Captionmoreusablethanautomatedsystemsevenincases
wheretheremaybeminimaldifferenceinmeasuressuchas
accuracyandcoverage.
Weproposeanewcrowdcaptioningapproachthatrecruits classmatesandotherstotranscribeandshareclassroomlec
tures. Classmates are likely to be more familiar with the
topicbeingdiscussed,andtobeusedtothespeaker’sstyle.
Weshowthatreadersprefer thisapproach. This approach
islessexpensiveandismoreinclusive,scalable,flexibleand easiertodeploythantraditionalcaptioning,especiallywhen usedwithmobiledevices. Thisapproachcanscaleinterms
of classmates and vocabulary, and can enable efficient re
trievalandviewingonawiderangeof devices. Thecrowd
captioningtranscript,asanaverageofmultiplestreamsfrom allcaptionists,islikelytobemoreconsistentandhaveless surprisethananysinglecaptionist,andhavelessdelay,allof whichreducethelikelihoodofinformationlossbythereader. Thisapproachcanbeviewedasaparallelnote-takingthat benefitsallstudentswhogetanhighcoverage,highquality reviewabletranscriptthatnoneofthemcouldnormallytype ontheirown.
Wehaveintroducedtheideaofreal-timenon-expertcap
tioning, and demonstrated through coverage experiments
that this is a promising direction for future research. We showthatdeafandhearingstudentsalikeprefercrowdcap tionsoverASRbecausethestudentsfindtheerrorseasierto
backtrackonandcorrectinreal-time. Most peoplecannot
tolerate an error rate of 10% or more as errors can com
pletely changethe meaningof the text. Human operators
whocorrect the errorson-the-flymakethesesystemsmore
viable,openingthefieldtooperatorswithfarlessexpertise
and the ability to format, add punctuation, and indicate
speaker changes. Until the time ASR becomes a mature
technologythatcanhandleallkindsofspeechandenviron
ments,human assistance incaptioning will continueto be
anessentialingredientinspeechtranscription.
Wealso noticethatcrowdcaptionsappeartohavemore
accurate technical vocabulary than either ASR or profes
sionalcaptions. CrowdcaptioningoutperformsASRinmany
realsettings. Non-expert real-timecaptioninghasnotyet, andmightnotever,replaceprofessionalcaptionistsorASR, butitshowslotofpromise. Thereasonisthatasinglecap tioner cannotoptimize their dictionary fully, as they have to to adaptto various teachers, lecture content and their context. Classmatesaremuchbetterpositionedtoadaptto allof these, and fully optimize their typing, spelling, and
flow. Crowdcaptioning enablesthe softwareand users to
effectivelyadaptto avarietyof environmentsthata single captionistanddictionarycannothandle.
One common thread among the feedback comments re
vealedthatdeafparticipantsarenothomogenous,andthere thereisnoneatunifyinglearningstyleabstraction. Lesson complexity,learningcurves,expectations,anxiety,trustand suspicionscanallcanaffectlearning experiences and indi rectlythesatisfactionandratingoftranscripts.
6.
FUTURE
WORK
Fromtheperspectiveofareaderviewingareal-timetran script,notallerrorsareequallyimportant,andhumanper ceptualerrorsof thedialog ismuch easierforusers toun
derstandandadapttothanASRerrors. AlsounlikeASR,
crowdcaptioningcanhandlepoordialogaudiooruntrained
speech, e.g. multiple speakers, meetings, panels, audience questions. Usingthisknowledge,wehopetobeabletoen
courage crowdcaptioning workers to leverage their under
standingofthecontextthatcontentisspokenintocapture
thesegmentswiththehighestinformationcontent.
Non-expertcaptionistsand ASRmake differenttypes of
errors. Specifically, humans generally type words that ac
tually appear inthe audio, but miss many words. Auto
maticspeechrecognitionoftenmisunderstandswhichword
was spoken,butgenerally getsthen numberofwords spo
kennearlycorrect. Oneapproachmay betouseASR asa
stableunderlyingsignalforreal-timetranscription,anduse
non-expert transcription to replace incorrect words. This
may be particularly useful when transcribing speech that
containsjargonterms. A non-expertcaptionist couldtype
asmanyofthesetermsaspossible,andcouldfittheminto
thetranscriptionprovidedbyASRwhereappropriate.
ASRusuallycannotprovideareliableconfidencelevelof
their ownaccuracy. Onthe other hand,thecrowdusually
hasabetter senseoftheir ownaccuracy. Oneapproachto
leveragethiswouldbetoprovideanindicationoftheconfi dencethesystemhasinrecognitionaccuracy. Thiscouldbe
doneinmanyways,forexamplethroughcolors. Thiswould
enabletheuserstopicktheirownconfidencethreshold. Itwouldbeusefultoaddautomaticspeechrecognitionas
a complementary sourceof captions because its errorsare
generally independent of non-expert captionists. This dif
ferencemeansthat matchingcaptionsinputbycaptionists
andASRcanlikelybeusedwithhighconfidence,eveninthe absenceofmanylayersofredundantcaptionistsorASRsys tems. Futureworkalsoseekstointegratemultiplesourcesof evidence,suchasN-gramfrequencydata,intoaprobabilis tic frameworkfortranscriptionandordering. Estimates of workerlatencyorqualitycanalsobeusedtoweightthein putofmultiplecontributorsinordertoreducetheamountof erroneous inputfrom lazy or maliciouscontributors, while
not penalizing good ones. This is especially important if
crowdservicessuchasAmazon’sMechanicalTurkaretobe
usedtosupportthesesystemsinthefuture. Themodelscur rentlyusedtoalignandmergesetsofpartialcaptionsfrom contributors areintheir infancy,and willimproveas more
work is done inthis area. Ascrowd captioning improves,
studentscanbegintorelymoreonreadablecaptionsbeing madeavailableatanytimeforanyspeaker.
Thebenefitsofcaptioningbylocalorremoteworkerspre
sented in this paper aims to further motivate the use of
crowd captioning. We imagine a deaf or hard of hearing
personeventuallybeingabletocapturespeechwithhercell
phone anywhereand have captions returned to her within
a few seconds latency. She may use this to follow along
inalectureforwhich aprofessional captionistwasnot re quested,to participateininformalconversation withpeers after class,or enjoy amovie or other liveevent that lacks
closed captioning. These use cases currently beyond the
scopeofASR,andtheirserendipitousnatureprecludespre arrangingaprofessionalcaptionist. Moreover,ASRandpro fessionalcaptioningsystemsdonothaveaconsistentwayof addingappropriatepunctuationfromlecturespeechin real-time,resultingincaptionsthatareverydifficulttoreadand understand[9,16].
Achallengeindevelopingnewmethodsforreal-timecap
tioning is that it can be difficult to quantify whether the
abilityandreadability ofreal-timecaptioningisdependent
onmuch more than just Word ErrorRate, involving at a
minimumnaturalnessoferrors,regularity,latencyandflow. Theseconceptsaredifficulttocaptureautomatically,which makesitdifficulttomakereliablecomparisonsacrossdiffer
entapproaches. Designingmetrics that canbeuniversally
appliedwillimproveourabilitytomakeprogressinsystems forreal-timecaptioning.
7.
ACKNOWLEDGMENTS
We thank our participants for their time and feedback
inevaluatingthecaptions,andthereal-timecaptionistsfor theirworkinmakingthelectureaccessibletodeafandhard ofhearingstudents.
8.
REFERENCES
[1] Faqaboutcart(real-timecaptioning),2011.
http://www.ccacaptioning.org/articles-resources/faq.
[2] Y.C.BeatriceLiem,HaoqiZhang.Aniterativedual
pathwaystructureforspeech-to-texttranscription.In
Proceedingsofthe3rdWorkshoponHuman Computation(HCOMP’11),HCOMP’11,2011. [3] M.S.Bernstein,J.R.Brandt,R.C.Miller,andD.R.
Karger.Crowdsintwoseconds: Enablingrealtime
crowd-poweredinterfaces.InProceedingsofthe24th annualACMsymposiumonUserinterfacesoftware andtechnology,UIST’11,pagetoappear,NewYork,
NY,USA,2011.ACM.
[4] J.P.Bigham,C.Jayant,H.Ji,G.Little,A.Miller, R.C.Miller,R.Miller,A.Tatarowicz,B.White,
S.White,andT.Yeh.Vizwiz: nearlyreal-time
answerstovisualquestions.InProceedingsof the23nd annualACMsymposiumonUserinterfacesoftware andtechnology,UIST’10,pages333–342,NewYork,
NY,USA,2010.ACM.
[5] D.Burnham,G.Leigh,W.Noble,C.Jones,M.Tyler,
L.Grebennikov,andA.Varley. Parametersin
televisioncaptioningfordeafandhard-of-hearing adults: Effectsofcaptionrateversustextreductionon
comprehension.JournalofDeaf StudiesandDeaf
Education,13(3):391–404,2008.
[6] X.Cui,L.Gu,B.Xiang,W.Zhang,andY.Gao.
Developinghighperformanceasrintheibm
multilingualspeech-to-speechtranslationsystem.In
Acoustics,SpeechandSignalProcessing,2008. ICASSP2008.IEEE InternationalConferenceon, pages5121–5124,312008-april42008.
[7] L.B.Elliot,M.S.Stinson,D.Easton,and
J.Bourgeois.CollegeStudentsLearningWith
C-Print’sEducationSoftwareandAutomaticSpeech
Recognition.InAmericanEducationalResearch
AssociationAnnualMeeting,NewYork,NY,2008.
[8] M.B.Fifield.Realtimeremoteonlinecaptioning: An
effectiveaccommodationforruralschoolsandcolleges. InInstructionalTechnologyAndEducationoftheDeaf Symposium,2001.
[9] A.Gravano,M.Jansche,andM.Bacchiani.Restoring
punctuationandcapitalizationintranscribedspeech. InAcoustics,SpeechandSignalProcessing,2009. ICASSP2009.IEEE InternationalConferenceon, pages4741–4744,april2009.
[10] C.Jensema.Closed-captionedtelevisionpresentation
speedandvocabulary.AmericanAnnalsoftheDeaf,
141(4):284–292,1996.
[11] C.J.Jensema,R.Danturthi,andR.Burch.Time
spentviewingcaptionsontelevisionprograms.
AmericanAnnalsoftheDeaf,145(5):464–468,2000. [12] R.KheirandT.Way.Inclusionofdeafstudentsin
computerscienceclassesusingreal-timespeech transcription.InProceedingsof the12thannual SIGCSEconferenceonInnovationandtechnologyin computer scienceeducation,ITiCSE’07,pages
261–265,NewYork,NY,USA,2007. ACM.
[13] W.Lasecki,K.Murray,S.White,R.C.Miller,and
J.P.Bigham.Real-timecrowdcontrolofexisting
interfaces.InProceedingsofthe24thannualACM
symposiumonUserinterfacesoftwareandtechnology,
UIST’11,pageToAppear,NewYork,NY,USA,
2011.ACM.
[14] W.S.LaseckiandJ.P.Bigham.Onlinequality
controlforreal-timecaptioning.InProceedingsofthe 14thInternationalACMSIGACCESSConference on ComputersandAccessibility,ASSETS’12,2012. [15] W.S.Lasecki,C.Miller,A.Sadilek,A.Abumoussa,
D.Borrello,R.Kushalnagar,andJ.P.Bigham.
Realtimecaptioningbygroupsofnonexperts.In
Proceedingsofthe25thACMUISTSymposium,UIST ’12,2012.
[16] Y.Liu,E.Shriberg,A.Stolcke,D.Hillard,
M.Ostendorf, andM.Harper.Enrichingspeech
recognitionwithautomaticdetectionofsentence
boundariesanddisfluencies.Audio,Speech,and
Language Processing,IEEETransactionson, 14(5):1526–1540,sept.2006.
[17] T.Matthews,S.Carter,C.Pai,J.Fong,and J.Mankoff.InProceedingofthe8thInternational Conference onUbiquitousComputing,pages159–176, Berlin,2006.Springer-Verlag.
[18] R.E.Mitchell.Howmanydeafpeoplearethereinthe
UnitedStates? EstimatesfromtheSurveyofIncome
andProgramParticipation.Journalof deafstudies
anddeafeducation,11(1):112–9,Jan.2006. [19] S.J.SamuelsandP.R.Dahl.Establishing
appropriatepurposeforreadinganditseffecton flexibilityofreadingrate.JournalofEducational Psychology,67(1):38–43,1975.
[20] M. Wald.Usingautomaticspeechrecognitionto
enhanceeducationforallstudents: Turningavision intoreality.InFrontiers inEducation,2005.FIE’05. Proceedings35thAnnualConference,pageS3G,oct. 2005.
[21] M. Wald.Creatingaccessibleeducationalmultimedia
througheditingautomaticspeechrecognition
captioninginrealtime.InteractiveTechnologyand Smart Education,3(2):131–141, 2006.