[PDF] Deaf, Hard of Hearing, and Hearing Perspectives on using Automatic Speech Recognition in Conversation

Abstract

Many personal devices have transitioned from visual-controlled interfaces to speech-controlled interfaces to reduce costs and interactive friction, supported by the rapid growth in capabilities of speech-controlled interfaces, e.g., Amazon Echo or Apple's Siri. A consequence is that people who are deaf or hard of hearing (DHH) may be unable to use these speech-controlled devices. We show that deaf speech has a high error rate compared to hearing speech, in commercial speech-controlled interfaces. Deaf speech had approximately a 78% word error rate (WER) compared to a hearing speech 18% WER. Our findings show that current speech-controlled interfaces are not usable by DHH people. Based on our findings, significant advances in speech recognition software or alternative approaches will be needed for deaf use of speech-controlled interfaces. We show that current speech-controlled interfaces are not usable by DHH people.

Full PDF

DDeaf, Hard of Hearing, and Hearing perspectives on usingAutomatic Speech Recognition in Conversation

Abraham Glasser, Kesavan Kushalnagar, and Raja Kushalnagar

Rochester Institute of Technology1 Lomb Memorial DriveRochester, NY 14623 {atg2036,krk4565, reuami}@rit.edu

ABSTRACT

In this experience report, we describe the accessibility chal-lenges that deaf, hard of hearing and hearing participantsencounter in mixed group conversation when using personaldevices with Automatic Speech Recognition (ASR) applica-tions. We discuss problems, and describe accessibility bar-riers in using these devices. We also describe best practices,as well as lessons learned, and pitfalls to avoid in using per-sonal devices in conversation.

1. INTRODUCTION

Deaf or hard of hearing (DHH) people usually cannot under-stand speech unaided, and usually depend on additional sup-port such as hearing aids or speech-to-text technology, espe-cially in multi-speaker environments. Simple low-technologyaids such as using paper and pen to write back and forth orto text back and forth can work, but are about 3-4 timesslower than spoken or signed communication, and tends notto be eﬀective for sustained communication.Deaf and hard of hearing use sign language interpretersand/or stenographers for access to auditory information.They listen to the spoken information and translate into signlanguage. The stenographers also listen, and they type whatthey hear. Stenographers are intensely trained to use a spe-cial keyboard to be able to type diﬀerent keystrokes to actas shortcuts for what they hear phonetically. The programthen processes the phonetic sounds and matches it with itsimmense vocabulary database. If a mistake is made or ifthe program does not correctly identify a word, the stenog-rapher has the ability to change it on the ﬂy. Interpretersand stenographers are very good resources for providing ac-cess to deaf and hard of hearing individuals in a very timelymanner. However, there are a few issues concerning the sup-ply and dependability of these resources. If they become illor are delayed and cannot come in, the deaf person wouldbe stuck without them. Also, the cost of the service is veryhigh, and is generally not aﬀordable below the university

Permission to make digital or hard copies of part or all of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor proﬁt or commercial advantage and that copies bear this notice and the full citationon the ﬁrst page. Copyrights for third-party components of this work must be honored.For all other uses, contact the owner/author(s).

ASSETS ’17 October 29 - November 1, Baltimore, MD, USA c (cid:13) level. The mental and physical work required to interpretor caption is heavy and is hard to maintain for a prolongedamount of time. Because of this multiple people are neededto hire so they can swap with each other and take turns.Small organizations and businesses would be very likely tonot be able to aﬀord professional interpreters or stenogra-phers.This is where Automatic Speech Recognition comes in. Thetechnology is very cheap, and ASR applications can be puton virtually any device. The only foreseeable physical issueswith ASR technology is battery life, storage space, hardwarequality, Internet connection, and portability. All of whichhave feasible solutions.When the participants use speech to text technology thatis capable to keeping up with speech, they face challengesin following both the speaker’s gestures and the extra vi-sual of the speaker’s speech-to-text. Deaf participants haveto concentrate on managing competing tasks such as shift-ing attention between multiple visuals, compared with theirhearing peers. They are left receiving incomplete informa-tion and trying to connect these segments together - allwhile searching for cues to know where to pay attention. Asa result, even when provided with accurate real-time textthrough captioners, they receive only 50%-80% of the infor-mation, compared to 84%-95% for their hearing peers [1].Similarly, hard of hearing participants have to make senseof reduced speech information through their hearing aids.Deaf, hard of hearing and hearing people face diverse chal-lenges with spoken language communication with each otherin most conversational settings, especially in large-groupand multiple-talker settings such as in classrooms and work-places. They face a wide array of communication strate-gies and need considerable ﬂexibility in accessible technologyfor upward mobility [2]. Professional jobs in such areas asart, education, technology, and management demand moreinteraction and greater communication skills with hearingpeople than do nonprofessional occupations such as clerical,machine operations, printing, welding, food preparation, orjanitorial.Job-related demands also make the workplace a more diﬃ-cult communication situation for those who are deaf com-pared to those who are hard of hearing [3]. Both groups,however, tend to experience less success in securing higherlevel jobs than their hearing peers and are limited by levelof college degree [4]. a r X i v : . [ c s . H C ] S e p or both deaf and hard-of-hearing people, communicationon the job involves English about 80% of the time, whetherthrough writing, speech, or sign language with speech [4].Many deaf, hard of hearing and hearing participants have ex-perimented with addressing these communication challengesthrough the use of ASR applications on their personal de-vices.Given the spoken-language communication requirements ofthe classroom and the workplace, we discuss how currentASR applications enhance access by deaf and hard-of-hearingindividuals. We also examine how ASR applications enhancecommunication exchanges between deaf or hard-of-hearingpersons and hearing persons in the classroom or workplace.

2. PARTICIPANTS

Deaf, hard of hearing and hearing speakers and listenershave diﬀerent challenges and accessibility needs in mixedgroup conversation in most settings, including academic andworkplace settings.

Deaf participants have challenges in both accessing and fol-lowing spoken information and in conveying information ef-ﬁciently and quickly to others in group settings. When deafparticipants use ASR, they primarily use it to display andread the words of their communication partners. Sometimesthey also use ASR to speak and convey information quicklyto their communication partners especially if their speechclarity is suﬃciently close enough to that of their hearingpeers. The ability to display their own speech can aid com-munication by conveying both spoken and text informationto their communication partners.For deaf participants, ASR is less accurate for them as com-pared to their hearing peers, because their speech intelligibil-ity is lower on average. Venkatagiri et al. [5] found that suchsystems could not recognize voice commands from subjectswho had relatively low speech intelligibility and often thoseindividuals were unable to correct their dictation errors.Deaf speakers tend to have diﬀerent segmental and prosodicfactors, such as rate, pausing, voice volume, intonation, andstress, that may inﬂuence the overall performance of speechrecognition applications and software. Jeyalakshmi et al. [6]found larger than normal variation in pitch and formants fordeaf children, ages ﬁve to ten years, making such featuresunusable for recognition of deaf speech.

Hard of hearing people often do not face conversational chal-lenges in quiet or one-to-one settings. They often face diﬃ-culties in multi-speaker or noisy settings, as they have dif-ﬁculty in handling acoustics that interfere with the qualityof the signal, extensive technical vocabulary, multiple infor-mation sources, and/or talkers with dialects or accents [7].Even if they use auditory assistive technologies that incor-porate noise reduction algorithms that are capable of im-proving listening-alone performance, they cannot make upfor the adverse eﬀects of having to concentrate on followingspeech, and or dealing with competing tasks such as tak-ing notes or shifting attention between multiple speakers orvisuals [8].

While hearing participants do not typically have diﬃcultyin speaking or listening to other hearing peers, they face dif-ﬁculties in understanding deaf or hard of hearing speakers.This is due do their inability to adjust to the wide variance inspeech production by deaf and hard of hearing speakers. InFigure 2, we show the distribution of about 650 deaf peopleas rated by professional speech pathologists at the NationalTechnical Institute for the Deaf. Each of these ratings werebased on a set of sentences called “Clarke Sentences.” In per-sonal testing, scores from 1 to 3 were generally impossible tounderstand. A score from 3 to 4.5 was diﬃcult to recognize,but it is still feasible for communication. Anything ratedabout a 4.5 and above should be understandable to mosthearing people (with some eﬀort). However amongst deafpeople rated 5.0, ASR technology had a Word Error Rate ofabout 53%, which is not useful enough for most applications.

3. ASR EVOLUTION

Speech recognition is still a very diﬃcult task for applica-tions. For the last 50 years, researchers and inventors haveiteratively implemented and improved applications that canunderstand speech, including conversation.First-generation systems adopted a pattern-matching ap-proach in which speech waveforms were matched with spe-ciﬁc word waveforms. Because every talker produces a dif-ferent waveform and a given talker often has diﬀerent wave-forms for the same word (e.g., if spoken excitedly, or with acold), the ﬁrst-generation systems could not be used outsidecontrolled laboratory situations.Second-generation systems incorporated statistical predic-tion algorithms, mainly hidden Markov modeling (HMM).This approach was more robust to speech waveform varia-tion, as it merged speech data from many talkers over timeto build statistical models that captured the variation. Asa result, speech recognition systems improved. In addi-tion, faster computers and the ability to process dramati-cally more data through the cloud enabled many practicalinteractive speech systems. Most modern interactive phonesystems are speech recognition systems that can answer sim-ple questions. These systems, however, limit the domain ofpossible words in order to simplify analysis and increase ac-curacy. Unfortunately, for unrestricted domains, even thebest second-generation speech recognition systems still hadword error rates of 20-25% with arbitrary speech [7].Third generation systems now have the potential to assistdeaf and hard-of-hearing users. These systems incorporatelearning algorithms, which try to emulate human brain be-havior. Driven by companies such as Apple, Microsoft, andGoogle, with intended applications in wearables, cars, robotics,and machines, these systems seem to handle human speechvariation better. In fact, in many commercial systems, word(or character) error rates for tasks such as mobile short mes-sage dictation and voice search are far below 10%. Somecompanies are even aiming at reducing the sentence errorrate to below 10% [9].However, as constraints are relaxed, allowing longer utter-ances, technical jargon, and environmental noise, speechrecognition systems still have word and character error ratesf about 20%, as reported by Yu et al. [9], under conditionssuch as: • Far ﬁeld microphones (e.g., when the microphone is inthe background, as in a living room, meeting room, oraudiovisual recording made in the ﬁeld) • Noise (e.g., when loud music is captured by the micro-phone) • Speech produced with an accent • Multi-talker speech or side conversation (as in a meet-ing or with multiparty chatting) • Spontaneous speech that is not ﬂuent, with variablespeeds, or with emotion.

4. ASR FOR CONVERSATIONAL USE

Major obstacles that limit ASR use in group conversationfor deaf and hard of hearing includes text accuracy and lagtime. Additionally, ASR accuracy for any application can beaﬀected by other variables including auditory factors suchas speech ﬁdelity, ambient noise and microphone quality,and computing factors such as available computing power,speech recognition engine and associated statistical models.

The text accuracy for ASR needs to be suﬃciently high tobe usable and beneﬁcial. Wald et al. [10] found that learn-ing improved for students with disabilities when using suchtechnologies, but only if the text was at least 85% accu-rate. Others have suggested that a speech recognition tran-script must be at least 90% accurate to be useful in theclassroom [11] and that an even greater accuracy of 98% orhigher may be needed to maintain the meaning and intentof the message [12], actual accuracy rates are most often toolow for use by deaf and hard-of-hearing students in typicalhigher education settings [11].

Similarly, the lag time for ASR needs to be short enough tobe usable. Lag time for ASR becomes worse as the amountof data to be analyzed increases and causes processing de-lays. It has been shown that students cannot eﬀectivelyparticipate in discussions or dialogues if lag time is morethan 5 seconds [10]. In a study of college students whoviewed transcriptions of a lecture, both deaf and hearingparticipants commented that captions created by automaticspeech recognition software were choppy and had too muchlatency, making it hard to follow, compared to captioningby a stenographer or a crowd-captioning process [7].

5. USER EXPERIENCE

To investigate the capabilities of current ASR applications,from Fall 2016 through Summer 2017, the authors used avariety of applications on personal devices in everyday, real-world settings. The purpose of using the speech recogni-tion applications was to facilitate face-to-face spoken lan-guage interactions by providing a visible text representationof speech in the following contexts: 1) classroom commu-nication, 2) conversation, 3) job interviews, and 4) speech production practice in which the text displayed by an appwas used as an indicator of the intelligibility.Application software, also known as an“application”or“app”,enables an user to carry out speciﬁc tasks on the computer.The authors used and evaluated communication apps de-scribed in the list below. These apps were chosen becausethey were available at no additional cost, other than devicecharges and service plans, and had been rated at least 3.5out of 5 for user satisfaction in evaluations published by thedevelopers.The developers of each product described them as follows:DEAFCOM by askjerry Communications ”is a very sim-ple application that will convert speech into readable text.For a non-hearing or hard-of-hearing person, the applica-tion will allow faster communication with deaf persons. Fordeaf users, the software can assist in faster communicationand may also be used as a useful tool when practicing yourspeech” .Dragon Dictation by Nuance Communications is describedas ”an easy-to-use voice recognition application powered byDragon Naturally Speaking that allows you to easily speakand instantly see your text or email messages. In fact, it’sup to ﬁve (5) times faster than typing on the keyboard” .Siri by Apple Inc., part of iOS, is described in these words:“Talk to Siri as you would to a friend and it can help you getthings done, like sending messages, placing calls, or makingdinner reservations. You can ask Siri to show you the Orionconstellation or to ﬂip a coin. Siri works hands-free, so youcan ask it to show you the best route home and what yourETA is while driving. And it’s connected to the world, work-ing with Wikipedia, Yelp, Rotten Tomatoes, Shazam, andother online services to get you even more answers. Themore you use Siri, the more you’ll realize how great it is.And just how much it can do for you” .Virtual Voice by Gareth Hannaway Communications is “de-signed to use the text to speech (TTS) and the speech recog-nition features of your Android device. It was created withdeaf and/or mute people in mind, so they can communi-cate with others without the need for sign language or lipreading” .Ava is is an eponymous product, which is described as fol-lows: “Ava shows you who says what. Ava shows you whatpeople say, in less than a second. Easy communication isonly a tap away.” The Google Assistant is from Google Inc. “Meet your GoogleAssistant. Ask it questions. Tell it to do things. It’s yourown personal Google, always ready to help. Ready to help,wherever you are. Your one Assistant extends to help youacross devices, like Google Home, your phone, and more.Discover what your Assistant can do. Learn more about play.google.com/store/apps/details?id=defcom.v1 play.google.com/store/apps/details?id=appinventor.aiGareth Hannaway 420.VirtualVoice igure 1: Logos of various ASR platforms used how you can get help from your Assistant. Works with yourfavorite stuﬀ, too. Shuﬄe your favorite playlist, dim yourPhilips Hue lights with just your voice, or ask your Assis-tant on Google Home to stream Netﬂix to your TV withChromecast. Discover more services and smart devices thatwork with your Google Assistant.” Alexa is owned by Amazon.com, Inc. “Using Alexa is as sim-ple as asking a question. Just ask to play music, read thenews, control your smart home, tell a joke, and more-Alexawill respond instantly. Whether you are at home or on thego, Alexa is designed to make your life easier by letting youvoice-control your world. Alexa lives in the cloud so it’salways getting smarter, and updates are delivered automat-ically. The more you talk to Alexa, the more it adapts toyour speech patterns, vocabulary, and personal preferences.Alexa comes included with Echo and other Alexa devices.” The authors downloaded the chosen apps to their personaliPhone or Android device and evaluated use in both quietone-to-one settings and in noisy, open areas with multiplespeakers.Speciﬁcally, the aim was to assess how well these technolo-gies could facilitate: 1) classroom communication, 2) in-formal conversations, and 3) speech production practice.The technologies selected for evaluation were DEAFCOM,Dragon Dictation, Siri, Virtual Voice, Ava, Google Assis-tant, and Amazon Alexa. The logos for these products canbe seen in Figure 1.All apps were both accurate and had minimal lag in speciﬁc https://assistant.google.com/ igure 2: Distribution of Deaf Speech scores. work for everybodyWe are not the only ones with problems under-standing .... That we’re not the only ones. There’snon-disabled people out there who are having thesame problems ... you feel equal.

6. LESSONS LEARNED

When using ASR technology, deaf, hard of hearing and hear-ing people report diﬀerent ways of interacting with the tech-nology. The following lessons were learned from daily expe-rience with the technology.1. Hearing people often use the technology when theirhands are occupied as a way to do basic tasks suchas looking up simple queries on the web. They donot generally consider speech to text translations tobe particularly important.2. Deaf people most often reported that their experiencewith the technology for their own use was restrictedto messing around with the technology in order to seewhat the technology might interpret from their speech.Some report that ASR is starting to work better forthem now, to the point where some deaf speech can becomfortably understandable.3. One deaf person reported that she commonly woulduse Amazon Alexa for its intended purposes and saidthat she was satisﬁed with its use. Generally most deafpeople we met have not had this level of success withthe technology yet.4. So far, only a small portion of the hearing populationseems to often use the voice recognition feature of theirphone. Several reported that the technology does notwork well enough for them or that they felt uncomfort-able using the technology in a public setting. This ledto them not using the technology in a private settingeither. 5. ASR services generally use Internet connections to sendaudio data to their service. Phones have various con-nection strengths and reliability. Some have diﬃcultywith wireless networks and connections, so performingthe actual analysis can take erratic amounts of time.Also, using ASR while not on Wi-Fi may result in ad-ditional data, battery, space usage and costs.6. When using ASR to communicate with another per-son, the ability to change the text on the ﬂy has beenlimited and/or not feasible. Many people often men-tion this when discussing why they do not use ASR asmuch as texting.7. Interaction with ASR devices such as Alexa and theGoogle Home has been limited since deaf people do nothave access to the verbal responses from said devicesafter commands are spoken to it. The two most popu-lar personal assistants for smart phones, Google Assis-tant and Siri display their commands, but sometimesonly voice their responses. It was not until a coupleof months ago that both personal assistants have in-troduced text inputs in addition to voice inputs. Thismakes them more accessible for DHH people, howeverthis removes much of the appeal of a personal assis-tant.8. A few deaf people have reported that they experienceobstacles when they try to converse with a hearingperson who have never used ASR technology before.They reported that this has been mostly a result ofthe user interface design of the ASR app itself and thefact that they were not intuitive to use.9. Users said that it would be ideal if ASR systems couldtell the user if repetition of a speciﬁc word was neededrather than the whole word. Using a system like thismight increase the interaction of users with the device.10. Mild accents do not aﬀect ASR much anymore due toadvances in the technology. Heavy accents or uniqueaccents still confuse the software too much to be usableon a daily basis.While improvements in speech recognition are being reportedwith third generation processing strategies in laboratory set-tings [13], deaf users are still noting performance issues inthe real world, especially in the critical settings of the class-room and on the job. Improvements are needed, particularlyto control noise and side-talk interference, perhaps with bet-ter noise canceling algorithms, more advanced microphonearray techniques, and through use of a lapel mic, Bluetoothstreaming, and/or Wi-Fi Direct, and to ultimately convincedeaf and hard-of-hearing persons that speech recognitiontechnologies are better than pencil and paper when tryingto communicate to a person with typical hearing when itcounts in the classroom or on the job.These apps can produce text or spoken output for users withspeech that is diﬃcult to understand. However, they doesnot mean they will actually use it. They may prefer to usemore reliable, easier or less conspicuous low-tech solutionssuch as paper and pen.he idea today is that everyone should make a small eﬀort tomake the conversation work. Of course a minimal eﬀort, butthe burden should not be on the deaf person alone. Everyoneworks together.Many deaf speakers have low volume, and directional micro-phones can make a signiﬁcant diﬀerence. It would be help-ful to include external microphone support for microphonesthat can be either plugged in or connected with Bluetooth.This could be a lapel type of mic that is clipped on theperson, or a more central hardware piece with multiple mi-crophones. These could indeed do both directional beamforming or omni directional capturing. Most applicationsneed fairly loud speech samples for optimal accuracy. Deafpeople typically cannot monitor their own speech volume,and it is important to provide users good feedback aboutgood placement of the phone and associated speech volume.

7. CONCLUSIONS

Although many people use ASR systems such as Siri orGoogle for recreational use and every once in a while to senda text, they are not comfortable using these systems for sus-tained conversational use, as the systems have higher thantolerable error rates, especially in less than perfect settings.Whenever there are errors, the errors are time consuming toﬁx and the interfaces are not customizable.ASR technology has been around for over 50 years and hasbeen constantly updated and improved by many researchersand companies. However, ASR technology has been focusedon the hearing population and therefore it is diﬃcult for deafand hard of hearing individuals, even those who use voiceon a regular basis, to be fully comfortable with ASR. Asfound in one of our studies, there is a big variance in WERfor those whom have voices that are understandable by ahearing person. This means that even though a deaf personmight seem to speak well, ASR still won’t always have goodresults. On the other hand, The variance was very smallfor those whose voices scored low on the intelligibility scale.This means that for these people, they have almost no hopeusing ASR technology with their own voices. This, alongwith several other factors, shows that deaf people will havevarious experiences with ASR technology, and that ASR isnot stable or reliable to use with Deaf people at this time.Putting aside the problems facing deaf people using theirown voices with ASR, there are still user interface accessi-bility issues. For example, if it was practical for ASR tech-nology to be used in conversations between a deaf personand a hearing person, the ASR interface should be intuitiveand easy to use. We have seen that current technology isstill not always convenient. People have frustrations withlack of Internet connections and battery and space usage, soit should be an objective to make ASR apps eﬃcient. Peo-ple who have little or no experience using ASR technologymay not know how to interact with the app. Thus, clear,intuitive user interfaces need to be added to the ASR appsso that the conversation will have better ﬂow and comfort.Apps are typically limited and have a certain amount theywill transcribe in a time period. It has been observed thatsome apps require purchases for more usage. This is dis-advantageous because Deaf people should not have to workharder and give up more to have equal access to information. When ASR technology is improved upon, it is critical totake into consideration all of the factors and perspectives.It would be very beneﬁcial to have an experiment along witha survey and to recruit people with diﬀerent backgrounds.This way, the developers and researchers will be able to workwith accessibility in mind and have their products beneﬁt awide spectrum of people. This process applies to all tech-nology, especially those that become widely available, com-mercial products.

8. REFERENCES [1] Marc Marschark, Patricia Sapere, Carol Convertino,and Jeﬀ Pelz. Learning via direct and mediatedinstruction by deaf students.

Journal of deaf studiesand deaf education , 13(4):546–561, 2008.[2] Susan Bannerman Foster, Gerard G. Walter, andundeﬁned undeﬁned undeﬁned.

Deaf students inpostsecondary education . Routledge, 1992.[3] Daniel L Boutin. Professional jobs and hearing loss: Acomparison of deaf and hard of hearing consumers.

Journal of Rehabilitation , 75(1):36–40, Mar 2009.[4] Ronald R Kelly. Deaf workers: Educated andemployed, but limited in career growth.[5] HS Venkatagiri. Speech recognition technologyapplications in communication disorders.

AmericanJournal of Speech-Language Pathology , 11(4):323–332,2002.[6] C Jeyalakshmi, V Krishnamurthi, and A Revathy.Deaf speech assessment using digital processingtechniques.

Signal & Image Processing: AnInternational Journal (SIPIJ) , 1(1):14–25, 2010.[7] Raja. S. Kushalnagar, Walter S. Lasecki, andJeﬀrey P. Bigham. Accessibility Evaluation ofClassroom Captions.

ACM Transactions on AccessibleComputing , 5(3):1–24, 2014.[8] Karen L. Anderson. Access is the issue, not hearingloss: New policy clariﬁcation requires schools toensure eﬀective communication access.

SIG 9Perspectives on Hearing and Hearing Disorders inChildhood , 25(1):24–36, 2015.[9] Dong Yu and Li Deng.

Automatic speech recognition:A deep learning approach . Springer, 2014.[10] Mike Wald. Using automatic speech recognition toenhance education for all students: Turning a visioninto reality. In

Frontiers in Education, 2005. FIE’05.Proceedings 35th Annual Conference , pages S3G–S3G.IEEE, 2005.[11] Richard Kheir and Thomas Way. Inclusion of deafstudents in computer science classes using real-timespeech transcription.

ACM Sigcse Bulletin ,39(3):261–265, 2007.[12] Ross Stuckless. Recognition means more than justgetting the words right: Beyond accuracy toreadability.