A Readability Evaluation of Real-Time Crowd Captions in the Classroom

(1)

A Readability Evaluation of Real-Time Crowd Captions in

the Classroom

Raja S. Kushalnagar, Walter S. Lasecki

†

, Jeffrey P. Bigham

†

DepartmentofInformationandComputingStudies †

DepartmentofComputerScience RochesterInstituteofTechnology UniversityofRochester 1LombMemorialDr,Rochester,NY14623 160TrusteeRd,Rochester,NY14627

[email protected] {wlasecki,jbigham}@cs.rochester.edu

ABSTRACT

Deafandhardofhearingindividualsneedaccommodations

that transform aural to visual information, such as tran

scripts generated in real-time to enhance their access to

spokeninformation inlectures and other liveevents. Pro fessionalcaptionists’stranscriptsworkwellingeneralevents suchascommunity,administrativeorlegalmeetings,butis often perceived as notreadable enoughinspecialized con

tent events such as higher education classrooms. Profes

sionalcaptionistswithexperienceinspecializedcontentar

easarescarceandexpensive. Commercialautomaticspeech

recognition(ASR)softwaretranscriptsarefarcheaper,but isoftenperceivedasunreadableduetoASR’ssensitivityto accents,backgroundnoiseandslowresponsetime. Weeval

uate the readability of a new crowd captioning approach

inwhich captions are typed collaboratively by classmates

into a system that alignsand merges the multiple incom

pletecaptionstreamsintoasingle,comprehensivereal-time transcript. Ourstudyasked48deafandhearingreadersto evaluate transcriptsproduced byaprofessional captionist, automaticspeechrecognitionsoftwareandcrowdcaptioning softwarerespectivelyandfoundthereaderspreferredcrowd

captionsoverprofessionalcaptionsandASR.

INTRODUCTION

Deafandhardofhearing(DHH)individualstypicallycan notunderstandaudioalone,andaccesstotheaudiothrough

accommodationsthattranslatetheauditoryinformationto

visualinformation. Themostcommonaccommodationsare

real-time transcription or sign language translation of the audio.

As a low incidence disability, deaf and hard of hearing

individualsare evenly and thinlyspread [18]. As aresult,

many DHH individualstend to be located farfrom major

populationcenters and ﬁndit hardto obtainaccommoda

tion providers, especiallythose who canhandle situations thatrequirespecializedcontentknowledge. Theseproviders prefertoliveinclosetoareaswheretheycanobtainenough

demandtoprovideservices. Ifthereisnotenoughdemand

forproviders in the area, thereis a catch-22 forthe DHH students andinstitutions. Therefore, formanyinstitutions in terms of content knowledge, availability and cost, it is besttouseaccommodationservicescenteredonthestudent

suchasclassmatesoron-demandremoteworkers.

This paper analyzes the readability of a new

student-centered approach to real-timecaptioning in whichmulti

ple classmates simultaneously caption speech inreal-time.

Although classmatescannottype asquicklyas thenatural

speakingrateofmostspeakers,wehavefoundthattheycan

provideaccuratepartialcaptions. We alignandmerge the

multiple incompletecaption streams intoasingle, compre

hensive real-time transcript. We compare deaf and hear

ingstudents’evaluationoftheeﬀectivenessandusabilityof this crowd-sourcedreal-time transcript against transcripts

producedbyprofessional captionistsand automaticspeech

recognitionsoftwarerespectively.

2.

BACKGROUND

Equal access to communication is fundamental to stu

dents’academic success,butis oftentakenforgranted. In

mainstreamenvironmentswheredeaf,hard-of-hearing, and

hearingstudents studyand attendclasses together,people tend to assume that captioners or interpreters enable full communicationbetweendeafandhearingpeopleintheclass. Thisassumptionisespeciallydetrimentalasitdoesnotad dress other information accessibility issues such as trans lation delays that impact interaction and readability that

impactscomprehension.

Therearetwopopularapproachestogeneratingreal-time

captions thatattempt to conveyeveryspoken wordinthe

(2)

(a)Astenographkeyboardthatshowsits

phonetic-basedkeys.

(b) A stenographer’s typical Words Per Minute

(WPM)limitandrange.

Figure 1: Professional Real-TimeCaptioning using astenograph

recognition(ASR). Both professional captioning and ASR

provideareal-timeword-for-worddisplayofwhatissaidin class, as well as options forsaving the text after class for

study. Wediscuss thereadability of theseapproaches and

anewapproach, whichutilizes crowdsourcingto generate

real-timecaptions.

2.1

Professional

Captioning

Themostwidelyusedapproach,CommunicationsAccess

RealTime(CART),isgeneratedbyprofessionalcaptionists

whouse shorthandsoftwareto generatecaptionscankeep

upwith naturalspeakingrates. Althoughpopular, profes

sionalcaptionersundergoyearsoftraining,whichresultsin professional captioning services being expensive. Further

more, captionists usually have inadequate content knowl

edge and dictionaries to handle higher education lectures

inspeciﬁcﬁelds. is themost reliabletranscriptionservice,

butisalso themost expensiveone. Trainedstenographers

typeinshorthandonastenographic(shorthandwritingsys

tem)keyboard as showninFigure1. Thiskeyboard maps

multiplekeypressestophonemesthatareexpandedtover batimfulltext. Stenography requires 2-3yearsof training

toachieveatleast225wordsperminute(WPM)andupto

300WPMthatisneededtoconsistentlytranscribeall real-timespeech,whichhelpstoexplainthecurrentcostofmore

than$100anhour. CARTstenographers needonlytorec

ognizeand type in the phonemesto createthe transcript,

which enables them to type fast enough to keep up with

thenatural speaking rate. But thesoftwaretranslation of

phonemesto wordsrequires adictionarythat alreadycon

tains the words used in the lecture; typing in new words

intothedictionaryslowsdownthetranscriptionspeedcon

siderably. Thestenographer cantranscribe speech evenif

thewordsor phonemesdonotmakesenseto them,e.g.,if

the speechwords appearto violate rulesofgrammar, pro

nunciation,orlogic. Ifthecaptionercannotunderstandthe phonemeorwordatall,thentheycannottranscribeit.

Inresponse to thehigh costs of CART, computer-based

macroexpansionserviceslikeC-Printweredevelopedandin troduced. C-Printisatypeofnearly-realtimetranscription that wasdeveloped at the NationalTechnicalInstitutefor theDeaf.Thecaptionistbalancesthetradeoﬀbetweentyp ingspeedandsummarization,byincludingasmuchinforma tionaspossible,generallyprovidingameaning-for-meaning butnotverbatimtranslationofthespokenEnglishcontent.

Thissystemenablesoperatorswhoaretrainedinacademic

situations to consolidateandbetter organizethe textwith thegoalofcreatinganendresultmorelikeclassnotesthat maybemoreconducivetoforlearning. C-Printcaptionists needlesstraining,andgenerallychargearound$60anhour. Asthecaptionistnormallycannottypeasfastasthenatural speakingrate,theyarenotabletoproduceaverbatim

real-time transcript. Also, the captionist can only eﬀectively

convey classroom content if they understand that content

themselves. The advantage is that the C-Print transcript

accuracyandreadability ishigh [21],butthedisadvantage

of thisapproachis thatthe transcriptshows thesummary

that is basedonthe captionist’sunderstanding ofthe ma terial, whichmay bediﬀerentfrom thespeakeror reader’s understandingofthematerial.

There are several captioning challenges in higher edu

cation. The ﬁrst challenge is content knowledge - lecture

information is dense and contains specialized vocabulary.

Thismakesithardtoidentifyandschedulecaptionistswho arebothskilledintypingandhavetheappropriatecontent knowledge. Anothercaptioningissueinvolvestranscription

delay, which occurs when captionists have to understand

the phonemes or words and then type in what they have

recognized. Asaresult,captioniststendto typethemate rial to studentswith adelay of severalseconds. Thispre vents students from eﬀectively participating inaninterac tive classroom. Anotherchallengeisspeakeridentiﬁcation, inwhichcaptionistareunfamiliarwithparticipantsandare

challenged to properlyidentify the current speaker. They

cansimplifythisbyrecognizingthespeakerbyname,orask ingthespeakertopausebeforebeginninguntilthecaptionist

hascaught upand hadanopportunitytoidentifythenew

speaker. In terms of availability, captioniststypically are notavailabletotranscribe livespeech ordialogueforshort periodsoron-demand.Professionalcaptionistsusuallyneed at leastafew hours advancenotice, and preferto workin 1-hourincrementssoastoaccountfortheircommutetimes. Asaresult,studentscannoteasilydecideatthelastminute toattendalectureorafterclassinteractionswithpeersand teacher. Captionistsusedtoneedtobephysicallypresentat theeventtheyweretranscribing,butcaptioningservicesare increasingly beingoffered remotely[12, 1]. Captionists of tenaresimplynotavailableformanytechnicalfields[21,8]. Remotecaptioningoffersthepotentialtorecruitcaptionists familiar witha particularsubject (e.g., organicchemistry) evenifthecaptionistislocatedfarawayfromanevent. Se lectingforexpertisefurtherreducesthepoolofcaptionists. A finalchallengeis their cost- professional captionistsare highlytrainedtokeepupwithspeechwithlowerrorsrates,

and so are highly paid. Experiencedverbatimcaptionists’

paycanexceed$200anhour,andnewlytrainedsummariza

(3)

2.2

Automatic

Speech

Recognition

ASR platforms typically use probabilistic approaches to

translate speech to text. These platforms face challenges

inaccuratelycapturingmodernclassroomlecturesthatcan haveoneormoreofthefollowingchallenges: extensivetech nical vocabulary, poor acoustic quality, multiple informa tionsources,speakeraccents,orother problems. Theyalso

impose a processing delay of several seconds and the de

lay lengthens as the amount of data to be analyzed gets

bigger. Inother words,ASR workswell underideal situa

tions,butdegradesquicklyinmanyreal settings. Kheiret

al. [12] foundthatuntrainedASR softwarehad 75%accu

racyrate, but with training, could go to 90% underideal

singlespeaker, butthis accuracyratewas still toolow for

use by deaf students. In the best possible case, inwhich

thespeakerhastrainedthe ASRand wearsahigh-quality,

noise-cancelingmicrophone,theaccuracycanbeabove90%.

Whenrecordingaspeakerusingastandardmicrophoneon

ASRnottrainedforthespeaker,accuracyratesplummetto

farbelow50%. Additionally,theerrorsmadebyASRoften

changethemeaningofthetext,whereaswehavefound non-expertcaptionistsaremuchmorelikelytosimplyomitwords ormake spellingerrors. InFigure2 forinstance, theASR changes‘twofoldaxis’to‘twentyfourlexus’,whereasthec typiststypicallyomitwordstheydonotunderstandormake spellingerrors.CurrentASRisspeaker-dependent,hasdiﬃ cultyrecognizingdomain-speciﬁcjargon,andadaptspoorly to vocal changes, suchas when the speaker is sick [6, 7].

ASR systems generally need substantial computing power

andhigh-quality audioto work well,which meanssystems

canbediﬃcultto transport. Theyare alsoill-equippedto recognizeandconveytone,attitudes,interestandemphasis, andtorefer tovisualinformation suchas slidesor demon

strations. ASRserviceschargeabout$15-20anhour. How

ever, these systems are more easily integrated with other

functionssuchasmultimediaindexing.

2.3

Crowd

Captions

in

the

Classroom

Deafandhardofhearingstudentshavehadalonghistory of enhancingtheir classroom accessibility by collaborating

withclassmates. Forexample, they often arrange to copy

notesfromaclassmateandshareitwiththeirstudygroup. Crowdsourcinghasbeenappliedtooﬄinetranscriptionwith greatsuccess [2], buthas justrecently been usedfor real-time transcription[15]. Applying a collaborative caption ingapproachamongclassmatesenablesreal-timetranscrip

tionfrommultiplenon-experts,andcrowdagreementmech

anismscanbeutilizedtovettranscriptquality[14].

We imaginea deaf or hardof hearingpersoneventually

beingabletocapture auralspeechwithhercellphoneany whereandhavecaptionsreturnedtoherwithafewseconds latency. She may use this to follow along ina lecture for which aprofessional captionist wasnot requested, to par ticipate ininformal conversation with peers afterclass, or enjoyamovieorother liveeventthatlacksclosedcaption ing. TheseusecasescurrentlybeyondthescopeofASR,and theirserendipitousnatureprecludespre-arrangingaprofes sionalcaptionist. Lasecki etal. havedemonstrated thata

modestnumberofpeoplecanprovidereasonablyhighcov

erageoverthecaptionstream,andintroducesanalgorithm thatusesoverlappingportionsofthesequencestoalignand

mergethemusingthe Legion:Scribe system[15]. Scribe is

basedonthe Legion [13] framework, whichuses crowdsof

……….that has a two fold axis…….

………….have a crystal that………..

...we have a crystal………..

...wehave a crystalthathas a two fold axis…..

Figure 2: The crowd captioning interface. The in terfaceprovidesatextinputboxatthebottom,and shiftstextupasuserstype(eitherwhenthetexthits the end of the box, or when the user presses the enter key). To encourage users to continue typing even when making mistakes, editing of text is dis abledwordbyword. Partialcaptionsareforwarded to the server in real-time, which uses overlapping segmentsand the orderin segmentsarereceivedto align and mergethem.

workers to accomplish tasks in real-time. Unlike Legion,

Scribemergesresponsestocreateasingle,better,response insteadofselectingfrominputstoselectthebestsequence. Thismergerisdoneusinganonlinemultiplesequencealign mentalgorithmthatalignsworkerinputtobothreconstruct

the ﬁnal stream and correct errors (such as spelling mis

takes)madebyindividualworkers.

Crowdcaptioningoffersseveralpotentialbenefitsoverex istingapproaches. First,itispotentiallymuchcheaperthan hiringaprofessionalcaptionistbecausenon-expertcaption istsdonotneedextensivetrainingtoacquireaspecificskill set,and thusmay bedrawn fromavarietyof sources,e.g. classmates,audiencemembers,microtaskmarketplaces,vol unteers,or affordableandreadilyavailableemployees. Our

workforce can be very large because, for people who can

hear, speech recognition is relatively easy and most peo

ple cantype accurately. Theproblem is that individually

they cannottype quickly enough to keep up with natural

speaking rates, and crowd captioning nicely remedies this

problem. Recentworkhasdemonstratedthatsmallcrowds

canberecruitedquicklyon-demand(inlessthan2seconds)

(4)

receiveatranscriptofashortsoundsequenceinafewmin utes,butisnotabletoproduceverbatimcaptionsoverlong periodsoftime[17].

Inprevious work, we developed a crowdcaptioning sys

temthatacceptsrealtimetranscriptionfrommultiple non-expertsasshowninFigure2. Whilenon-expertscannottype asquicklyasthenaturalspeakingrate,wehavefoundthat theycanprovideaccuratepartialcaptions. Oursystemre cruitsfellowstudentswithnotrainingandcompensatesfor slowertypingspeedandloweraccuracybycombiningtheef fortsofmultiplecaptionistssimultaneouslyandmergesthese partialcaptionsinreal-time.Wehaveshownthatgroupsof

non-expertscan achieve more timely captions thana pro

fessional captionist, that we can encourage them to focus

onspecific portionsofthe speech to improveglobal cover age,andthatitispossibletorecombinepartialcaptionsand effectivelytradeoffcoverageandprecision[15].

2.4

Real-time

text

reading

versus

listening

Most peopleonlyseereal-timetextonTVat thebaror

gymintheformofclosedcaptions,whichtendtohaveno

ticeableerrors. However,thoseprogramsare captioned by

live captionists or stenographers. To reduce errors, these

real-time transcripts are often corrected and made into a

permanentpartofthe videoﬁlebyoﬀ-linecaptionistswho

prepare captions from pre-recorded videotapes and thor

oughlyreviewtheworkforerrorsbeforeairing.

Thetranslationofspeechtotextisnotdirect,butrather isinterpreted andchangedinthecourseofeach utterance. Markerslikeaccent,tone,and timbreare strippedoutand

representedbystandardizedwrittenwordsandsymbols. Then

thereaderinterpretsthesewordsandﬂowtomakemeanings forthemselves. Captionists tendnot to includeallspoken informationsothatreaderscankeepupwiththetranscript. Captionists are encouraged to alter the original transcrip tionto providetimeforthereaderstocompletelyreadthe

captionand tosynchronizewiththe audio. Thisis needed

because, fora non-orthographiclanguagelike English,the lengthofaspokenutteranceisnotnecessarilyproportional

to the length of a spelled word. In other words, reading

speedisnotthesameaslisteningspeed,especiallyfor real-timescrollingtext, asopposedto staticpre-preparedtext. Forstatictext,readingspeedhasbeenmeasuredat291wpm [19]. BycontrasttheaveragecaptionrateforTVprograms is141wpm[11],whilethemostcomfortablereadingratefor hearing,hard-of-hearing,anddeafadultsisaround145wpm [10]. The reasonis that thetask ofviewingreal-timecap tions involveddiﬀerentprocessing demands invisual loca

tionandtrackingofmovingtextonadynamicbackground.

Englishliteracyratesamongdeafandhardofhearingpeo

plewhois low compared to hearingpeers. Captioningre

search has shown that both rate and text reduction and

viewerreadingabilityare importantfactors,and thatcap tionsneedtobeprovidedwithin5secondssothatthereader canparticipate[20].

Thenumberofspokenwordsandtheircomplexitycanalso

inﬂuence the captioning decision onthe amount of words

totranscribeand degreeofsummarizationto includesoas toreduce the reader’stotal cognitiveload. Jensema etal.

[10]analyzedalargesampleofcaptionedTVprogramsand

foundthatthetotal sethad around800Kwordsconsisting

of16,000uniquewords. Furthermore,overtwo-thirdsofthe

transcriptwords consistedof250words. Higher education

lecturetranscriptshaveaverydiﬀerentproﬁle. Forcompari sonpurposes,weselecteda50minutelongclipfromtheMIT

OpenCourseWare(OCW)website1. Theaudiosamplewas

pickedfromalecturesegmentinwhichthespeechwasrela tivelyclear.We chosethislecturebecauseitcombinedboth

technicalandnon-technicalcomponents. Wefoundthatthe

lecturehad9137words,ofwhich1428wereunique,at182.7

wpm. Furthermore, over twothirds of the transcript con

sistedof around500words,whichisdoublethe sizeof the

captionedTVwordset.

3.

EVALUATION

Toevaluatethe eﬃcacyofcrowd-sourcedreal-timetran

scripts, wecompared deafand hearinguserevaluationson

theirperceptionsoftheusabilityofcrowd-sourcedreal-time

transcriptsagainst ComputerAidedReal-Timetranscripts

(CART)andAutomaticSpeechRecognitiontranscripts(ASR).

3.1

Design

Criteria

Based on prior work as well our own observations and

experiences,wehavedevelopedthefollowingdesigncriteria for eﬀective real-time transcript presentation for deaf and hardofhearingstudents:

1. The transcript must have enough information to be

understoodbytheviewer.

2. Thetranscriptmustnotbetoofastortooslowsothat itcanbecomfortablyread.

3. Readingmustnotrequiresubstantialbacktracking.

3.2

Transcript

Generation

Weobtainedthreetranscriptionsof anOCWlectureus

ing crowdcaptioners, professionalcaptionerandautomatic speechrecognitionsoftwareandgeneratedthreetranscripts ofthelecture.

Aprofessionalreal-timestenographercaptionistwhocharged $200anhourtocreateaprofessionalreal-timetranscriptof the lecture. Thecaptioner listenedto the audioand tran

scribedinreal-time. Themeantypingspeedwasabout180

wpm withalatency of4.2seconds. We calculatedlatency

byaveragingthelatencyofallmatchedwords.

We recruited 20 undergraduate students to act as

non-expert captionistsforourcrowdcaptioningsystem. These

students hadnospecialtrainingorprevious formalexperi encetranscribingaudio. Participantsthenprovidedpartial

captions for the lecture audio. The ﬁnal transcript speed

wasabout130WPM,withalatencyof3.87seconds.

Inadditionto the thesetwotranscripts, wegenerated a

transcript usinganautomaticspeech recognitionASR us

ing Nuance Dragon Naturally Speaking 11 software. We

usedanuntrainedproﬁleto simulateour targetcontextof studentstranscribingspeechfromnewormultiplespeakers. Toconductthistest, theaudioﬁles wereplayed,andredi

rectedto Dragon. We usedasoftwarelooptoredirect the

audio signal without resampling using SoundFlower2_, _and

acustom programtorecordthe timewheneach wordwas

generatedbytheASR.TheASRtranscriptspeed was71.0

wpm(SD=23.7)withalatencyof7.9seconds.

3.3

Transcript

Evaluation

1_{http://ocw.mit.edu/} 2

(5)

Figure3: Thetranscriptviewingexperience.

Werecruited48studentsforthestudyovertwoweeksto participateinthestudyandevenlyrecruitedbothdeafand

hearingstudents, male amd female. Twenty-oneof the of

themweredeaf,fourofthemwerehardofhearingandthe

remainder,twenty-four,werehearing. Therewere21females

and27males,whichreﬂectsthegenderbalanceoncampus.

Their ages ranged from18 to 29 and all were students at

RIT, ranging from ﬁrst year undergraduates to graduate

students. We recruitedthrough ﬂyersand wordof mouth

onthecampus. Weaskedstudentstocontactandschedule

throughemail appointment. Allstudents were reimbursed

fortheir participation. Alldeaf participants were askedif they usedvisual accommodations fortheir classes, and all

ofthemansweredaﬃrmatively.

Testing was conducted in a quiet room with a 22 inch

ﬂat-screenmonitorasshowninFigure3. Eachpersonwas

directedtoanonlinewebpagethatexplainedthepurposeof thestudy. Next,thestudentswereaskedtocompleteashort demographicquestionnaireinordertodetermineeligibility forthetestandaskedforinformedconsent.Thentheywere

asked to view a short 30second introductory videoto fa

miliarizethemselveswiththeprocessofviewingtranscripts. Thenthestudentswereaskedtowatchaseriesoftranscripts onthesamelecture,eachlastingtwominutes. Eachclipwas labeledTranscript1,2and3,andwerepresentedinaran

domizedorderwithoutanyaccompanyingaudio. Thetotal

timeforthestudywasabout15minutes.

After theparticipant completedwatchingallthreevideo clipsof the real-timetranscripts, they were askedto com pleteaquestionnaire. Thequestionnaire askedthreeques

tions. The ﬁrstquestion asked“Howeasywas itto follow

transcript1?”. Inresponsetothequestion,theparticipants

Figure 4: A comparison of the flow for each tran script. Both CART and crowd captions exhibit a relativelysmoothreal-timetextflow. Studentspre fer this flow over the more choppy ASR transcript flow.

were presented with a a Likert scale that ranged from 1

through5,with1being“Veryhard”to5being“veryeasy”. Thesecondquestionasked“Howeasywasittofollowtran script 2?”. In response to this question, participants were promptedtoanswerusingasimilarLikertscaleresponseas

inquestion1. Thethirdquestionwas“Howeasywasit to

follow transcript 3?”. Inresponse to this question, partic

ipantswere promotedwitha similar,corresponding Likert

scale responsetoquestion1and2. Thenparticipantswere askedtoanswerintheirownwordstothreequestionsthat

asked participants for their thoughts about following the

lecture through the transcripts; the ﬁrst video transcript

contained the captions created by the stenographer. The

answerswereopenendedandmanyparticipantsgavewon

derfulfeedback. Thesecondvideotranscriptcontainedthe captions createdbytheautomaticspeech recognitionsoft

ware, inthiscase, DragonNaturallySpeaking v. 11. The

thirdandﬁnal videotranscript containedthecaptionscre

atedbythecrowdcaptioningprocess.

4.

DISCUSSION

Fortheuserpreferencequestions,therewasasigniﬁcant diﬀerencebetweentheLikertscoredistributionbetweenTran scripts1and2or2and3. Ingeneral,participantsfoundit hardtofollowTranscript2(automaticspeechrecognition); themedianratingforitwasa1,i.e.,“Veryhard”. Thequal

itativecomments indicatedthat manyofthemthoughtthe

transcriptwastoochoppyandhadtoomuchlatency. Incon trast,participantsfounditeasiertofolloweitherTranscript 1(professionalcaptions)or3(crowdcaptions). Overallboth

deaf and hearing students had similar preference ratings

forbothcrowdcaptionsandprofessionalcaptions(CART),

in the absence of audio. While the overall responses for

crowdcaptionswasslightlyhigherat3.15(SD=1.06)than

for professional captions (CART) at 3.08 (SD=1.24), the

diﬀerences were not statistically signiﬁcant (χ2 = 32.52,

p <0.001). Therewasagreatervariationinpreferencerat ingsforprofessionalcaptionsthanforcrowdcaptions.When wedividedthestudentsintodeafandhearingsubgroupsand

(6)

Figure 5: A graph of the latencies for each tran script (professional, automatic speech recognition and crowd). CART and CrowdCaptions have rea sonablelatenciesoflessthan5seconds,whichallows studentstokeepupwithclasslectures,butnotcon sistentlyparticipate inclassquestions and answers, orotherinteractiveclassdiscussion.

lookedattheirLikertpreferenceratings,therewasnosigniﬁ cantdiﬀerencebetweencrowdcaptionsandprofessionalcap tionsfordeafstudents(χ2₌_25.44,_{p <}₀_._001)._Hearing_stu

dentsasawholeshowedsigniﬁcantdiﬀerencebetweencrowd captionsandprofessionalcaptions(χ2 =19.56,p= 0.07).

Thequalitativecommentsfromhearingstudentsrevealed

thattranscriptﬂowasshowninFigure4,latencyasshown inFigure5andspeedweresigniﬁcantfactors intheirpref

erenceratings. For example, onehearingstudent had the

followingcommentforprofessionalcaptionedreal-timetran

script: “The words did not always seem to form coherent

sentences and the topics seemed to change suddenly as if there was no transition from one topic to the next. This made it hard to understand so I had to try and reread it quickly”. Incontrast, for crowdcaptioning, the same stu

dent commented : “I feel this was simpler to read mainly

becausethewordseventhoughsome notspelledcorrectly or grammatically correct in English were fairly simple to fol low. I wasable toreadthesentences aboutthere being two sub-trees,theleftandtherightandthattherearetwohalves ofthealgorithmattempted tobeexplained. Theword order was more logical to me so I didn’t need to try and reread it”. On theotherhandfortheprofessionalcaptions,adeaf studentcommented:“ItwastypingslowlysoIgetdistracted andI looked repeatedly fromthe beginning”; and forcrowd

captions, the deaf student commented: “It can be confus

ingsoslow respsoneontyping, soiget distractedon other paragraphsjusttokeepmyselffocused”.

Overall,hearingparticipantsappearedto liketheslower

andmoresmooth ﬂowingcrowdtranscriptratherthanthe

faster and lesssmooth captions. Deaf participantsappear

toaccept alltranscripts. It maybe thatthe deafstudents

aremoreusedtobadanddistortedinpurtandmoreeasily

skip or tolerate errors bypicking out key words, but this

or any other explanation requires further research. These

considerations wouldseem to be particularly importantin

educationalcontextswherematerialmaybecaptionedwith

theintentionofmakingcurriculum-basedinformationavail abletolearners.

A review of the literature oncaptioning comprehension

and readability shows this result is consistent with ﬁnd

ings from Burnham et al. [5], who found that there was

no reduction in comprehension of text reduction for deaf

adults, whethergood or poorat reading. Thesamestudy

also found that slowercaption rates tended to assist com prehensionofmoreproficientreaders, butthis wasnotthe caseforlessproficientreaders. Thismayexplainwhyhear ingstudentssignificantlypreferredcrowdcaptionsoverpro fessional captions,whereasdeafstudentsdid notshowany significant preference for crowd captions over professional captions. Sincedeafstudentsonaveragehaveawiderrange ofreadingskills,itappearsslowercaptionsforthelesspro

ﬁcient readers in this group doesnot help. Based on the

qualitative comments, it appears that these students pre

ferredtohaveasmootherwordﬂowandtokeeplatencylow ratherthantoslowdownthereal-timetext. Infact,manyof thelessproﬁcientreaderscommentedthatthecaptionswere

tooslow. Wehypothesizethat thesestudents,whotendto

useinterpretersratherthanreal-timecaptions,arefocusing onkey-wordsandignoretherestofthetext.

5.

CONCLUSIONS

Likertratingsshowedthat hearingstudentsrated crowd

captionsatorhigherthanprofessionalcaptions,whiledeaf studentsratedbothequally.Asummaryofqualitativecom mentsoncrowdcaptionssuggeststhatthesetranscriptsare presentedatareadablepace,phrasingandvocabularymade

more senseand that captioningﬂow isbetter than profes

sionalcaptioningorAutomaticSpeechRecognition.

Wehypothesizethatthisﬁndingisattributabletotwofac tors. Theﬁrstfactoristhatthespeakingratetypicallyvaries from175-275wpm[19],whichisfasterthanthereadingrate

for captions of around 100-150 wpm, especially for dense

lectures material. Thesecondfactoris thatthe timingfor listeningtospokenlanguageisdiﬀerentfromthetimingfor

readingwrittentext. Speakersoftenpause,changerhythm

orrepeatthemselves. Theend-resultisthatthecaptioning ﬂowisasimportantastraditionalcaptioningmetricssuchas coverage,accuracyandspeed,ifnotmore. Theaveragingof

multiple captionstreams intoanaggregatestreamappears

to smooth the ﬂow of text as perceived bythe reader, as

comparedwiththeﬂowoftextinprofessionalcaptioningor

ASRcaptions.

We thinkthe crowdcaptionists are are typing the most

important information to them, in other words, dropping

theunimportantbitsandthishappenstobettermatchthe

readingrate.Asthecaptionistsareworkingsimultaneously,

it canberegarded asagroupvote forthe mostimportant

information. A groupof non-expert captionists appear to

betterabletocollectivelycatch,understandandsummarize as well as asingleexpert captioner. Theconstraint of the

maximumaveragereadingreal-timetranscriptwordﬂowre

ducestheneedformakingatradeoﬀbetweencoverageand

speed; beyondaspeedofabout140wordsperminute[10],

coverageandﬂowbecomesmoreimportant. Inotherwords,

assuming alimiting reading rate (especially fordense lec

tureinformation),thecomments showthat studentsprefer

to condensed material so that they can maintain reading

speed/ﬂowtokeepupwiththeinstructor.

One of the key advantages to using human captionists

instead of ASR is the types of errorswhich are generated

system when it fails to correctly identifya word. Instead

of random text, humans are capable of inferringmeaning,

(7)

contextof thespeech. Weanticipate thiswill make

quick-Captionmoreusablethanautomatedsystemsevenincases

wheretheremaybeminimaldiﬀerenceinmeasuressuchas

accuracyandcoverage.

Weproposeanewcrowdcaptioningapproachthatrecruits classmatesandotherstotranscribeandshareclassroomlec

tures. Classmates are likely to be more familiar with the

topicbeingdiscussed,andtobeusedtothespeaker’sstyle.

Weshowthatreadersprefer thisapproach. This approach

islessexpensiveandismoreinclusive,scalable,ﬂexibleand easiertodeploythantraditionalcaptioning,especiallywhen usedwithmobiledevices. Thisapproachcanscaleinterms

of classmates and vocabulary, and can enable eﬃcient re

trievalandviewingonawiderangeof devices. Thecrowd

captioningtranscript,asanaverageofmultiplestreamsfrom allcaptionists,islikelytobemoreconsistentandhaveless surprisethananysinglecaptionist,andhavelessdelay,allof whichreducethelikelihoodofinformationlossbythereader. Thisapproachcanbeviewedasaparallelnote-takingthat beneﬁtsallstudentswhogetanhighcoverage,highquality reviewabletranscriptthatnoneofthemcouldnormallytype ontheirown.

Wehaveintroducedtheideaofreal-timenon-expertcap

tioning, and demonstrated through coverage experiments

that this is a promising direction for future research. We showthatdeafandhearingstudentsalikeprefercrowdcap tionsoverASRbecausethestudentsﬁndtheerrorseasierto

backtrackonandcorrectinreal-time. Most peoplecannot

tolerate an error rate of 10% or more as errors can com

pletely changethe meaningof the text. Human operators

whocorrect the errorson-the-ﬂymakethesesystemsmore

viable,openingtheﬁeldtooperatorswithfarlessexpertise

and the ability to format, add punctuation, and indicate

speaker changes. Until the time ASR becomes a mature

technologythatcanhandleallkindsofspeechandenviron

ments,human assistance incaptioning will continueto be

anessentialingredientinspeechtranscription.

Wealso noticethatcrowdcaptionsappeartohavemore

accurate technical vocabulary than either ASR or profes

sionalcaptions. CrowdcaptioningoutperformsASRinmany

realsettings. Non-expert real-timecaptioninghasnotyet, andmightnotever,replaceprofessionalcaptionistsorASR, butitshowslotofpromise. Thereasonisthatasinglecap tioner cannotoptimize their dictionary fully, as they have to to adaptto various teachers, lecture content and their context. Classmatesaremuchbetterpositionedtoadaptto allof these, and fully optimize their typing, spelling, and

ﬂow. Crowdcaptioning enablesthe softwareand users to

eﬀectivelyadaptto avarietyof environmentsthata single captionistanddictionarycannothandle.

One common thread among the feedback comments re

vealedthatdeafparticipantsarenothomogenous,andthere thereisnoneatunifyinglearningstyleabstraction. Lesson complexity,learningcurves,expectations,anxiety,trustand suspicionscanallcanaﬀectlearning experiences and indi rectlythesatisfactionandratingoftranscripts.

6.

FUTURE

WORK

Fromtheperspectiveofareaderviewingareal-timetran script,notallerrorsareequallyimportant,andhumanper ceptualerrorsof thedialog ismuch easierforusers toun

derstandandadapttothanASRerrors. AlsounlikeASR,

crowdcaptioningcanhandlepoordialogaudiooruntrained

speech, e.g. multiple speakers, meetings, panels, audience questions. Usingthisknowledge,wehopetobeabletoen

courage crowdcaptioning workers to leverage their under

standingofthecontextthatcontentisspokenintocapture

thesegmentswiththehighestinformationcontent.

Non-expertcaptionistsand ASRmake diﬀerenttypes of

errors. Speciﬁcally, humans generally type words that ac

tually appear inthe audio, but miss many words. Auto

maticspeechrecognitionoftenmisunderstandswhichword

was spoken,butgenerally getsthen numberofwords spo

kennearlycorrect. Oneapproachmay betouseASR asa

stableunderlyingsignalforreal-timetranscription,anduse

non-expert transcription to replace incorrect words. This

may be particularly useful when transcribing speech that

containsjargonterms. A non-expertcaptionist couldtype

asmanyofthesetermsaspossible,andcouldﬁttheminto

thetranscriptionprovidedbyASRwhereappropriate.

ASRusuallycannotprovideareliableconﬁdencelevelof

their ownaccuracy. Onthe other hand,thecrowdusually

hasabetter senseoftheir ownaccuracy. Oneapproachto

leveragethiswouldbetoprovideanindicationoftheconﬁ dencethesystemhasinrecognitionaccuracy. Thiscouldbe

doneinmanyways,forexamplethroughcolors. Thiswould

enabletheuserstopicktheirownconﬁdencethreshold. Itwouldbeusefultoaddautomaticspeechrecognitionas

a complementary sourceof captions because its errorsare

generally independent of non-expert captionists. This dif

ferencemeansthat matchingcaptionsinputbycaptionists

andASRcanlikelybeusedwithhighconﬁdence,eveninthe absenceofmanylayersofredundantcaptionistsorASRsys tems. Futureworkalsoseekstointegratemultiplesourcesof evidence,suchasN-gramfrequencydata,intoaprobabilis tic frameworkfortranscriptionandordering. Estimates of workerlatencyorqualitycanalsobeusedtoweightthein putofmultiplecontributorsinordertoreducetheamountof erroneous inputfrom lazy or maliciouscontributors, while

not penalizing good ones. This is especially important if

crowdservicessuchasAmazon’sMechanicalTurkaretobe

usedtosupportthesesystemsinthefuture. Themodelscur rentlyusedtoalignandmergesetsofpartialcaptionsfrom contributors areintheir infancy,and willimproveas more

work is done inthis area. Ascrowd captioning improves,

studentscanbegintorelymoreonreadablecaptionsbeing madeavailableatanytimeforanyspeaker.

Thebeneﬁtsofcaptioningbylocalorremoteworkerspre

sented in this paper aims to further motivate the use of

crowd captioning. We imagine a deaf or hard of hearing

personeventuallybeingabletocapturespeechwithhercell

phone anywhereand have captions returned to her within

a few seconds latency. She may use this to follow along

inalectureforwhich aprofessional captionistwasnot re quested,to participateininformalconversation withpeers after class,or enjoy amovie or other liveevent that lacks

closed captioning. These use cases currently beyond the

scopeofASR,andtheirserendipitousnatureprecludespre arrangingaprofessionalcaptionist. Moreover,ASRandpro fessionalcaptioningsystemsdonothaveaconsistentwayof addingappropriatepunctuationfromlecturespeechin real-time,resultingincaptionsthatareverydiﬃculttoreadand understand[9,16].

Achallengeindevelopingnewmethodsforreal-timecap

tioning is that it can be diﬃcult to quantify whether the

(8)

abilityandreadability ofreal-timecaptioningisdependent

onmuch more than just Word ErrorRate, involving at a

minimumnaturalnessoferrors,regularity,latencyandflow. Theseconceptsaredifficulttocaptureautomatically,which makesitdifficulttomakereliablecomparisonsacrossdiffer

entapproaches. Designingmetrics that canbeuniversally

appliedwillimproveourabilitytomakeprogressinsystems forreal-timecaptioning.

7.

ACKNOWLEDGMENTS

We thank our participants for their time and feedback

inevaluatingthecaptions,andthereal-timecaptionistsfor theirworkinmakingthelectureaccessibletodeafandhard ofhearingstudents.

8.

REFERENCES

[1] Faqaboutcart(real-timecaptioning),2011.

http://www.ccacaptioning.org/articles-resources/faq.

[2] Y.C.BeatriceLiem,HaoqiZhang.Aniterativedual

pathwaystructureforspeech-to-texttranscription.In

Proceedingsofthe3rdWorkshoponHuman Computation(HCOMP’11),HCOMP’11,2011. [3] M.S.Bernstein,J.R.Brandt,R.C.Miller,andD.R.

Karger.Crowdsintwoseconds: Enablingrealtime

crowd-poweredinterfaces.InProceedingsofthe24th annualACMsymposiumonUserinterfacesoftware andtechnology,UIST’11,pagetoappear,NewYork,

NY,USA,2011.ACM.

[4] J.P.Bigham,C.Jayant,H.Ji,G.Little,A.Miller, R.C.Miller,R.Miller,A.Tatarowicz,B.White,

S.White,andT.Yeh.Vizwiz: nearlyreal-time

answerstovisualquestions.InProceedingsof the23nd annualACMsymposiumonUserinterfacesoftware andtechnology,UIST’10,pages333–342,NewYork,

NY,USA,2010.ACM.

[5] D.Burnham,G.Leigh,W.Noble,C.Jones,M.Tyler,

L.Grebennikov,andA.Varley. Parametersin

televisioncaptioningfordeafandhard-of-hearing adults: Eﬀectsofcaptionrateversustextreductionon

comprehension.JournalofDeaf StudiesandDeaf

Education,13(3):391–404,2008.

[6] X.Cui,L.Gu,B.Xiang,W.Zhang,andY.Gao.

Developinghighperformanceasrintheibm

multilingualspeech-to-speechtranslationsystem.In

Acoustics,SpeechandSignalProcessing,2008. ICASSP2008.IEEE InternationalConferenceon, pages5121–5124,312008-april42008.

[7] L.B.Elliot,M.S.Stinson,D.Easton,and

J.Bourgeois.CollegeStudentsLearningWith

C-Print’sEducationSoftwareandAutomaticSpeech

Recognition.InAmericanEducationalResearch

AssociationAnnualMeeting,NewYork,NY,2008.

[8] M.B.Fiﬁeld.Realtimeremoteonlinecaptioning: An

eﬀectiveaccommodationforruralschoolsandcolleges. InInstructionalTechnologyAndEducationoftheDeaf Symposium,2001.

[9] A.Gravano,M.Jansche,andM.Bacchiani.Restoring

punctuationandcapitalizationintranscribedspeech. InAcoustics,SpeechandSignalProcessing,2009. ICASSP2009.IEEE InternationalConferenceon, pages4741–4744,april2009.

[10] C.Jensema.Closed-captionedtelevisionpresentation

speedandvocabulary.AmericanAnnalsoftheDeaf,

141(4):284–292,1996.

[11] C.J.Jensema,R.Danturthi,andR.Burch.Time

spentviewingcaptionsontelevisionprograms.

AmericanAnnalsoftheDeaf,145(5):464–468,2000. [12] R.KheirandT.Way.Inclusionofdeafstudentsin

computerscienceclassesusingreal-timespeech transcription.InProceedingsof the12thannual SIGCSEconferenceonInnovationandtechnologyin computer scienceeducation,ITiCSE’07,pages

261–265,NewYork,NY,USA,2007. ACM.

[13] W.Lasecki,K.Murray,S.White,R.C.Miller,and

J.P.Bigham.Real-timecrowdcontrolofexisting

interfaces.InProceedingsofthe24thannualACM

symposiumonUserinterfacesoftwareandtechnology,

UIST’11,pageToAppear,NewYork,NY,USA,

2011.ACM.

[14] W.S.LaseckiandJ.P.Bigham.Onlinequality

controlforreal-timecaptioning.InProceedingsofthe 14thInternationalACMSIGACCESSConference on ComputersandAccessibility,ASSETS’12,2012. [15] W.S.Lasecki,C.Miller,A.Sadilek,A.Abumoussa,

D.Borrello,R.Kushalnagar,andJ.P.Bigham.

Realtimecaptioningbygroupsofnonexperts.In

Proceedingsofthe25thACMUISTSymposium,UIST ’12,2012.

[16] Y.Liu,E.Shriberg,A.Stolcke,D.Hillard,

M.Ostendorf, andM.Harper.Enrichingspeech

recognitionwithautomaticdetectionofsentence

boundariesanddisﬂuencies.Audio,Speech,and

Language Processing,IEEETransactionson, 14(5):1526–1540,sept.2006.

[17] T.Matthews,S.Carter,C.Pai,J.Fong,and J.Mankoﬀ.InProceedingofthe8thInternational Conference onUbiquitousComputing,pages159–176, Berlin,2006.Springer-Verlag.

[18] R.E.Mitchell.Howmanydeafpeoplearethereinthe

UnitedStates? EstimatesfromtheSurveyofIncome

andProgramParticipation.Journalof deafstudies

anddeafeducation,11(1):112–9,Jan.2006. [19] S.J.SamuelsandP.R.Dahl.Establishing

appropriatepurposeforreadinganditseﬀecton ﬂexibilityofreadingrate.JournalofEducational Psychology,67(1):38–43,1975.

[20] M. Wald.Usingautomaticspeechrecognitionto

enhanceeducationforallstudents: Turningavision intoreality.InFrontiers inEducation,2005.FIE’05. Proceedings35thAnnualConference,pageS3G,oct. 2005.

[21] M. Wald.Creatingaccessibleeducationalmultimedia

througheditingautomaticspeechrecognition

captioninginrealtime.InteractiveTechnologyand Smart Education,3(2):131–141, 2006.

1http://ocw.mit.edu/

http://www.ccacaptioning.org/articles-resources/faq.

A Readability Evaluation of Real-Time Crowd Captions in the Classroom