Feasibility of Using Automatic Speech Recognition with Voices of Deaf and Hard-of-Hearing Individuals
FFeasibility of Using Automatic Speech Recognition withVoices of Deaf and Hard-of-Hearing Individuals
Abraham Glasser, Kesavan Kushalnagar, and Raja Kushalnagar
Rochester Institute of TechnologyRochester, NY 14623 {atg2036, krk4565}@rit.edu, [email protected]
ABSTRACT
Many personal devices have transitioned from visual-controlledinterfaces to speech-controlled interfaces to reduce costs andinteractive friction, supported by the rapid growth in capa-bilities of speech-controlled interfaces, e.g., Amazon Echo orApple’s Siri. A consequence is that people who are deaf orhard of hearing (DHH) may be unable to use these speech-controlled devices. We show that deaf speech has a high er-ror rate compared to hearing speech, in commercial speech-controlled interfaces. Deaf speech had approximately a 78%word error rate (WER) compared to a hearing speech 18%WER. Our findings show that current speech-controlled in-terfaces are not usable by DHH people. Based on our find-ings, significant advances in speech recognition software oralternative approaches will be needed for deaf use of speech-controlled interfaces. We show that current speech-controlledinterfaces are not usable by DHH people.
CCS Concepts • Human-centered computing → Accessibility; Ac-cessibility design and evaluation methods;
Keywords
Automatic Speech Recognition, Deaf Speech, Hearing Speech,Word Error Rate, Deaf, Hearing
1. RELATED WORK
Prior research has investigated how the lack of a feedbackloop for deaf people who cannot hear their own speakingresults in poor speech quality due vowel errors, intonationerrors, and length errors [1, 2]. ASR software is generallytrained with speech samples from hearing people, which re-sults in very poor recognition of deaf speech. Even whenused with limited possibilities, e.g., single digits, ASR fordeaf speakers with poor speech intelligibility yielded a 13%Word Error Rate (WER) [3], compared to nearly no errorsfor hearing speech [4].
Permission to make digital or hard copies of part or all of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for third-party components of this work must be honored.For all other uses, contact the owner/author(s).
ASSETS ’17 October 29-November 1, 2017, Baltimore, MD, USA c (cid:13) https://doi.org/10.1145/3132525.3134819 Figure 1: Distribution of Clarke Sentence Scores.
2. METHODOLOGY2.1 Deaf Speech Dataset
We sampled from a subset of a large speech dataset of650 deaf and hard of hearing (DHH) individuals at the Na-tional Technical Institute for the Deaf at Rochester Insti-tute of Technology, which has an enrollment of around 1100deaf and hard of hearing students. The dataset consisted ofsamples taken from DHH individuals who took the ClarkeSentences intelligibility test [5]. The test has 60 sentencelists, each with 10 sentences of 10 syllables. The number ofwords varies across the sentences and lists. Each audio filehas one speaker speaking one list from the Clarke sentencelist. In each audio file, the speaker says the sentence numberand then proceeds to say that sentence, and repeats until allthe ten sentences are spoken. The audio files were recordedby one individual, then the samples were sent to a speechpathologist. The speech pathologist assigned an intelligibil-ity score from 0 to 50. The score is computed by looking for50 target words within the sample for credit. A score of 50would indicate that the deaf person is generally intelligibleto the speech pathologist, while a score of 30 means difficultto understand, and a score of 0 means completely unintelli-gible. About half of all deaf individuals had scores of under40, which is usually unintelligible to people not used to deafspeech, as shown in Figure 1. a r X i v : . [ c s . H C ] S e p igure 2: Intelligibility rating vs Word Error Rate We used the Microsoft Translator Speech API to createtranscriptions for each audio file. This API is used by busi-nesses for transcriptions. As a commercial level software, itmatches other similar transcription software [6, 7, 8]. Wealso used the National Institute of Standards and Technol-ogy Speech Recognition Scoring Toolkit (SCTK) Version2.4.0.4 for the Word Error Rate calculations [9].
3. ANALYSIS RESULTS3.1 WER for Hearing Speakers
We calculated the ASR transcription and WER analysisresults for five hearing subjects who read various lists fromthe same Clarke Sentences database, The speech sampleswere recorded with a cell phone in a noisy environment withbackground noise in a lab with many people speaking andcomputers, which is similar to common use-case scenariosin using phone ASR services. The average WER was 18%.While the speech recognition was not close to perfect forhearing people, it was still passable. These numbers areexpected from the current state of the art technology. Themajority of voice command interfaces are currently foundin cell phones and home assistants, which are often usedin noisy environments. If we had repeated these recordingsin the same setting as the deaf speech samples, we wouldexpect an even lower WER.
We ran a sample of the deaf speech database through theMicrosoft Translator Speech API. We used 45 total samplesthat were chosen by a naive listener who determined 15 goodsamples (40+), 16 mediocre samples (30-40), and 14 poorfiles (10-30). The error rates were extremely high over allsamples at 77%, including 53% WER for the good samples.The average WER and standard deviation was calculatedfor each group, as shown in Figure 3.2. The average WER forthe good speech group was significantly less than either themediocre group or bad speech groups; a t-test comparisonbetween the good and mediocre groups yielded p <
4. CONCLUSIONS
The current WER of Microsoft Translator was too high forcomfortable use. They were significantly poorer in perfor-mance compared to hearing people under similar conditions.There are a number of factors that have an impact on theaccuracy of automatic speech recognition systems with deaf
Figure 3: Group vs average WER speech. The results also show much greater variance amongdeaf speakers, compared with hearing speakers. In orderfor ASR systems to recognize deaf speech as well as it doeshearing speech requires a huge database of deaf speakers.While conceptually simple, it is still challenging. The deafpopulation is relatively small compared to the size of thehearing population, and have far more varied backgrounds.ASR systems are trained using huge hearing speaker datasets.The results show that even deaf speakers with “good” speechhave worse accuracy compared with the average hearingspeaker. Although the Clarke Sentence test is useful forspeech pathology evaluation, it is less useful for providingmore feedback to DHH people about the usability of currentASR interfaces, since the top rating of 50 does not distin-guish between DHH speakers with high ASR accuracy andthose with less ASR accuracy. It would be helpful to developan automated test that provides feedback to DHH people ontheir use of ASR services such as Siri or Alexa.
5. REFERENCES [1] M. S. Thirumalai and S. G. Gayathri.
Speech of thehearing impaired . Central Institute of IndianLanguages, 2004.[2] Mary J. Osberger and Nancy S. McGarr. Speechproduction characteristics of the hearing impaired.
Journal of Speech, Language and Hearing Research ,8:221–283, 1982.[3] C Jeyalakshmi, V Krishnamurthi, and A Revathy. Deafspeech assessment using digital processing techniques.
Signal & Image Processing: An International Journal(SIPIJ) , 1(1):14–25, 2010.[4] George R Doddington. Phonetically sensitivediscriminants for improved speech recognition. In
Acoustics, Speech, and Signal Processing, 1989.ICASSP-89., 1989 International Conference on , pages556–559. IEEE, 1989.[5] Vincent J. Samar and Dale Evan Metz. Criterionvalidity of speech intelligibility rating-scale proceduresfor the hearing-impaired population.