Do face masks introduce bias in speech technologies? The case of automated scoring of speaking proficiency
Anastassia Loukina, Keelan Evanini, Matthew Mulholland, Ian Blood, Klaus Zechner
DDo face masks introduce bias in speech technologies?The case of automated scoring of speaking proficiency.
Anastassia Loukina, Keelan Evanini, Matthew Mulholland, Ian Blood, Klaus Zechner
Educational Testing Service, NJ, USA [email protected], [email protected], [email protected], [email protected], [email protected]
Abstract
The COVID-19 pandemic has led to a dramatic increase in theuse of face masks worldwide. Face coverings can affect bothacoustic properties of the signal as well as speech patterns andhave unintended effects if the person wearing the mask attemptsto use speech processing technologies. In this paper we explorethe impact of wearing face masks on the automated assessmentof English language proficiency. We use a dataset from a large-scale speaking test for which test-takers were required to wearface masks during the test administration, and we compare it toa matched control sample of test-takers who took the same testbefore the mask requirements were put in place. We find thatthe two samples differ across a range of acoustic measures andalso show a small but significant difference in speech patterns.However, these differences do not lead to differences in humanor automated scores of English language proficiency. Severalmeasures of bias showed no differences in scores between thetwo groups.
Index Terms : speech recognition, human-computer interac-tion, fairness, computational paralinguistics
1. Introduction
The spread of the COVID-19 pandemic has led to a dramatic in-crease in the use of face masks worldwide. However, the use offace coverings can lead to transmission loss and can modify theacoustic properties of the speech signal. They may also impactthe speech patterns of the person wearing the mask. Both ofthese factors may in turn have unintended effects if the personwearing the mask attempts to use speech processing technolo-gies [1, 2, 3, 4, 5].In this paper we explore the impact of wearing face maskson the automated assessment of English language proficiency.We consider an example of a large-scale English proficiencytest which includes a speaking section and is usually taken atan official test center. With few exceptions, test-takers did nottypically wear any face coverings when taking the test prior tothe COVID-19 pandemic. However, starting in early 2020, lo-cal regulations in some areas have required test-takers to wearface masks. For instance, all test-takers taking the test in HongKong starting in late-February 2020 were required to wear sur-gical masks throughout the duration of the test, including thespeaking section. This gave us a unique opportunity to evalu-ate the impact of face masks using data collected under real-lifeconditions.The study has two main goals. Our first goal is to explorethe validity of using automated speech scoring systems in a sit-uation when test-takers are wearing face masks. When auto-mated systems are used as part of the scoring pipeline for as-sessments, it is important that they do not introduce bias intothe final scores. Automated scoring systems are likely to havebeen trained on the data from test-takers who were not wear- ing face masks. If the models trained on this population do notgeneralize well to recordings from test-takers who are wearingmasks, the scores assigned by the system would be unfair [6].Our study is the first to explore the impact of face masks on thevalidity of automated speech scoring.Our second goal is to contribute to our knowledge of theeffect face masks might have on speech technologies in gen-eral. Previous studies on this topic have typically used smallcorpora collected under laboratory conditions. In our case, wehave access to a unique dataset collected under real-life condi-tions from speakers who were wearing face masks while en-gaging in an authentic task (taking the speaking test). Fur-thermore, since this dataset was collected in the context of astandardized test, we also have data from a large population oftest-takers who engaged with the same assessment tasks andunder the same conditions except that they were not wearingface masks. This allows us to compare the data from test-takerswearing face masks to a matched control sample. We comparethe two samples across a wide range of acoustic and phoneticmeasures to obtain a better understanding of how face masksmay affect speech production and speech processing.
2. Speech when wearing a face mask
Previous research on the effect of surgical masks in speech stud-ies has considered three possible types of effects: the effecton transmission loss and other signal properties, the effect onspeech perception and intelligibility, and finally the effect onspeech production when wearing a mask. Some of these studieswere done in a clinical [2, 7, 8] or forensic context [4, 5], al-though [1] also looked at the effect of face masks in the contextof language proficiency assessment.In terms of acoustic effects, most fabrics have been shownto lead to transmission loss, especially at higher frequencies [9].These results have also been confirmed for surgical face masksby [3] and [5] who found the largest differences in frequenciesabove 4.5 KHz.The limited evidence so far suggests that these changes tospectral quality have little effect on speech intelligibility. Para-doxically, [2] reported that speech perception scores for stimulifrom the Connected Speech Test recorded by 1 speaker with themask present for listeners with both normal and impaired hear-ing were significantly better than their scores without the mask.However, given that this difference was less than 1%, they didnot consider it clinically significant. No effect of face mask forlisteners with normal hearing was also reported by [8] and [3],who used a similar approach. In a slightly different set-up where16 nurses were paired with each other for a series of face-to-facetests, [7] also found that word intelligibility scores for subjectswearing a surgical mask were not significantly different fromcontrols . In control condition the speaker’s face was obscured by a piece of a r X i v : . [ ee ss . A S ] N ov ess is known about the effect of face masks on speechtechnologies. [4, 5] collected a corpus for forensic automaticspeaker recognition which contained 1.5 hours of recordingsfrom 8 speakers wearing several types of face coverings includ-ing surgical masks and a control condition. They found thatsurgical masks had little effect on the accuracy of closed-setautomatic speaker identification: the accuracy of identifying aspeaker wearing a mask using a speaker model trained withoutthe mask (’no mask’) was 95.1%, very close to the 95.2% accu-racy in ’no mask’ - ’no mask’ condition [5].Finally, in the study that is most relevant to the contextof this paper, [1] considered the effect face masks had on anEnglish speaking proficiency assessment administered by in-person examiners in Hong Kong. In this study 186 studentstook a mock oral exam twice, with and without surgical masks,and then completed a short questionnaire. The tests were ratedby trainee teachers. The results of the study showed that facemasks had no significant effect on scores assigned by the hu-man raters although for some areas, such as pronunciation, it ispossible that a significant effect could have been detected on alarger sample.To our knowledge, there are no empirical studies of the ef-fect the face masks might have on speaker behavior. The anal-ysis of questionnaires in [1] provides some qualitative evidencein this direction: the test-takers felt that they performed at alower level when wearing face masks and found it more diffi-cult to understand their peers. The study also revealed some ofthe strategies used to compensate for face masks as reported bythe speakers and raters: 92% felt they spoke slower and 89%felt they spoke louder. 12 out of 15 raters also observed thatstudents might have been speaking louder and more slowly, orarticulated more clearly.We use a large corpus of spoken responses recorded underreal-life conditions to address the following research questions:(1) What are the acoustic differences in responses recorded withand without face masks? (2) Are there any differences in speechpatterns between the test takers who are wearing face masks andthose who are not? (3) Do face masks lead to a bias in per-formance of an automated scoring engine for spoken languageproficiency?
3. Data
The data set considered in this paper includes responses from597 test takers sampled from the data collected during opera-tional administration of a large-scale language proficiency testin Hong Kong. The data set consists of two samples:
FACE - MASK and
CONTROL .Our
FACEMASK sample consists of 297 test-takers whotook an English language proficiency test in Hong Kong in late-February - March 2020, during the time when a requirement towear surgical face masks was in place. Of these, 75% reportedChinese as their native language,
8% reported Korean, with afurther 10% of speakers reporting English, Japanese, Germanor Indonesian. Other native languages were represented by lessthan 1% of the test-takers. 50% of the test-takers in this samplereported their gender as “Female”.All test-takers took four components of the test: reading,listening, speaking and writing. The total duration of the test isabout 3 hours. For the speaking section which takes about 20 cardboard The metadata we have available does not differentiate between dif-ferent linguistic varieties. minutes each test-taker provided responses to 4 questions thatwere designed to elicit spontaneous speech. For some ques-tions test-takers were expected to use provided materials (e.g.,a reading passage) as the basis for their response, while otherquestions were more general and elicited personal opinions ornarratives. Depending on the question type, the speakers weregiven 45 seconds or 1 minute to complete their response. Theaudio was captured through a headset with a microphone in .oggformat. The final corpus consisted of 1,188 spoken responses.We also selected a matched
CONTROL sample from test-takers who took the test in Hong Kong in Fall 2019, before therequirement to wear masks took effect. We selected a strati-fied sample of 300 test-takers matched to the
FACEMASK sam-ple based on the proportion of speakers of different languagesfor the 6 languages that occurred most frequently in the
FACE - MASK sample. The test-takers in this sample took the test underthe same conditions as the test-takers in the
FACEMASK sample:they answered the same types of questions and their answerswere recorded using the same equipment. The final corpus ofcontrol responses consisted of 1,200 spoken responses.The responses in both samples were scored using the sameprocedure: the audio recordings were sent to the distributed net-work of human raters who scored the responses on a 1-4 scaleaccording to the scoring rubrics. The rating process is organizedsuch that different responses from the same person are neverscored by the same rater. In both samples, the responses fromHong Kong were scored together with responses from test cen-ters in other countries. The raters were not aware of the countrywhere the response was recorded.
4. Automated scoring engine
All responses were also scored by SpeechRater ® , ETS’s enginefor scoring spoken responses to language proficiency tests [10,11, 12].All responses were first processed using an automatedspeech recognition system built using the Kaldi toolkit [13].For the training of the acoustic model we used 800 hours ofspoken responses of non-native spontaneous speech, coveringover 100 native languages across almost 9,000 different speak-ers. The DNN model was adapted to speakers with fMLLR andi-vectors using Kaldi’s nnet2 DNN environment. The languagemodel is a trigram model trained using the same dataset used foracoustic model training and is based on a vocabulary of around30k lexical entries. The ASR training corpus was elicited usingquestions similar to the ones considered in this study. There wasno overlap of speakers or questions between the ASR trainingcorpus and the corpus used in this paper. We did not addition-ally adapt the ASR to the speakers or responses in this study.Detailed information about the ASR model building approachis provided in [14].The scoring model used to produce automated scores in-cluded 28 features. The scoring model was trained on a largesample of 500,000 responses from earlier administrations of thetest worldwide. It was not adapted to the current sample. Thefeatures included in the model cover aspects of the delivery andlanguage use, two dimensions of speaking proficiency that areconsidered by the human raters. Features related to deliverycovered general fluency, pronunciation and prosody. Featuresrelated to language use covered vocabulary, grammar and someaspects of discourse structure. An additional module was usedto flag atypical responses where an automated score is likely tobe unreliable [11, 15]. See [12] for a detailed description of thefeatures and the filtering module. . Results Even though the focus of this paper is the effect of face maskson speech technologies, it is conceivable that potential discom-fort of wearing face masks would affect test-taker performanceon other sections of the test (listening, reading, writing). Fur-thermore, there could be additional factors that would lead toscore differences between the two samples in our study that arenot directly related to wearing face masks.To address this we first used a mixed linear regressionmodel to evaluate whether there was a significant difference be-tween the two samples in scores for each section of the test. The linear model has section score as the dependent variable,test taker as random factor and included sample and section aswell as their interaction as fixed factors. We considered all foursections: speaking, listening, reading and writing.While there were differences in score distributions betweendifferent sections, the results showed that face masks had noeffect on section scores with p value varying from 0.36 to 0.97.In other words, there were no differences in means between thetwo samples, either in the human scores for the speaking sectionor in the scores for the other sections.We further considered whether there were differences inhuman scores at the level of individual speaking responses us-ing response score as dependent variable, sample as fixed fac-tor and test taker as random factor. The results were consis-tent with those obtained for section scores: the fact that someof the test-takers were wearing face masks during the speakingsection had no effect on the average human scores assigned totheir responses (2.95 for CONTROL sample vs. 2.97 for
FACE - MASK sample, p =0.657). To examine the impact of wearing face masks on the acous-tic properties of test-taker responses, we extracted a range ofacoustic features using the openSMILE toolkit [18]. We ex-tracted the 88 features in the extended GeMAPS set, whichincluded measurements related to frequency, energy/amplitudeand spectral characteristics.Unpaired, two-sample t-tests on the distributions of the fea-ture values for the
FACEMASK and
CONTROL samples indi-cate that 38 out of the 88 openSMILE features have signif-icant differences between the two samples (using Bonferronicorrection to adjust the significance level for multiple com-parisons). A few examples of features that have distinct dis-tributions include mean slope of the power spectrum betweenthe 0Hz and 500Hz bands and mean bandwidth value for F2.In order to investigate whether responses from the two sam-ples can be distinguished using these features, classification ex-periments were conducted using the SKLL machine learningtoolkit . Specifically, 10-fold cross-validation (with no overlapof responses from the same test-taker across folds) was used totrain and evaluate a range of models; the best performing model( GradientBoostingClassifier ) achieved an F1-scoreof 0.786 and accuracy of 0.783. All mixed-effects models reported in this paper were fit using ver-sion lme4 package [16]. Significance tests for parameter estimateswere performed using lmerTest package [17]. Two of the control responses were excluded from the analysis, re-sulting in a total of 1198 control responses, since openSMILE featurescould not be extracted for them. https://github.com/EducationalTestingService/skll We next considered whether face masks had any effect ontest-taker speech patterns. Using ASR hypotheses and times-tamps we computed 8 different features designed to capturewhether test-takers wearing masks made more pauses, spokemore slowly or showed different patterns of disfluencies. Sinceall these features depend on speaker proficiency, for each fea-ture we first fit a baseline linear model with the feature valueas dependent variable and human score as predictor. We thenadded the sample as a second independent variable and com-puted the difference in adjusted R between the two models.The results are shown in Table 1 and Figure 1.Four out of eight properties we considered showed signif-icant differences: average duration of silences and number ofsilences per word as well as the duration of chunks betweenpauses, whether measured in words or seconds. In other words,speakers wearing masks spoke with about the same articula-tion rate as those not wearing masks, but paused slightly moreoften. However, the difference between the two samples wasvery small: face masks only explained less than 1% variancein feature values (after controlling for proficiency). In absoluteterms, wearing masks reduced the duration of chunks betweentwo pauses by 0.6 words or 0.2 seconds.Table 1: The effect of speech masks on various speech patterns.The table shows the additional variance in feature value ex-plained by
FACE MASK after controlling for human score, stan-dardized coefficient and whether the result was statistically sig-nificant after Bonferroni correction for multiple comparison.
Feature R diff Coef. sigAverage duration of silences 0.006 0.021 *N silences per word 0.003 0.007 *Total N silences 0.001 0.382 nsSeconds between two pauses 0.009 -0.188 *N words between two pauses 0.008 -0.567 *Total N words 0.002 -2.278 nsWords per second of speech 0.000 -0.014 nsNumber of disfluencies 0.002 -0.563 ns To further explore the effect face masks might have on ASRperformance, we selected a sample of 55 responses (28
FACE - MASK responses and 27
CONTROL responses) and had themtranscribed by a group of three transcribers. Of these, 16 re-sponses were triple-transcribed. We then computed ASR worderror rate (WER) for each group. A Mann Whitney U testshowed that there was no difference in WER between the twosamples (29.6% for
CONTROL sample vs. 27.7% for
FACE - MASK sample, p =0.22). The WER between two humans var-ied between 19.7% and 27.6% depending on the pair of tran-scribers. Surprisingly, human-human WER seemed to be lowerfor FACEMASK sample (27.6% vs. 20.7%, linear regressionmodel after controlling for transcriber pair p =0.04). We notehowever that the sample size is very small. Finally, we considered whether the small differences we ob-served between the two samples had any effect on automatedscores computed for the responses. First of all, we looked intoigure 1:
Average values for different speech properties for the two groups of test-takers conditioned on the human score. whether more responses from test-takers with face masks havebeen flagged as non-scorable by the SpeechRater filters. Wefound this not to be the case: only 1 response in the
FACE - MASK sample was flagged as non-scorable in comparison to 3responses in the
CONTROL sample. In addition, 11 responses inthe
CONTROL sample and 9 responses in the
FACEMASK samplewere not sent for automated scoring.We first consider the effect of face masks on the outcomefairness: that is, whether the automated scores assigned to re-sponses are affected by whether the test-taker is wearing a mask.We considered overall accuracy of scores as well as overallscore differences, that is, whether the automated scores are con-sistently different from human scores for members of a certaingroup [19, 20]. We used RSMTool [21] to compute all metrics.Table 2:
Accuracy of automated scores for responses from test-takers wearing masks and the control sample. The table showsmean system score, Pearson’s r , and root mean squared error(RMSE). Sample N Sys mean r RMSE
CONTROL
FACEMASK r between human and system scores was 0.56 for the CONTROL sample and 0.58 for the
FACEMASK sample. Testfor significance of difference between two independent corre-lations using the Fisher r-to-z transform showed that the differ-ence was not significant ( p =0.44). There also was no significantdifference in root mean squared error (RMSE) between the twosamples ( p =0.06) (“overall score accuracy” in [20]). Finally,there was no difference in the actual scores: the differences be-tween standardized means for both samples were below the 0.01threshold and there was no significant difference in the abso-lute error between machine and human score ( p =0.678) (“over-all score difference” in [20]).We also considered process fairness, that is whether the au-tomated scoring engine assigns different scores to test-takersfrom different samples despite them having the same profi-ciency (“conditional score accuracy” in [20]). To do this welooked into how much additional variance in score error isexplained by sample membership after controlling for humanscore. The results once again showed no difference betweentest-takers in the two samples ( p =0.853). In this paper we used a large corpus of responses to languageproficiency assessment to evaluate the effect face masks mayhave on different aspects of speech processing.Face masks had an effect on various acoustic properties ofthe signal: our classifier experiments showed that it is possi-ble to predict with almost 80% accuracy whether a test-taker iswearing a mask or not based on low-level OpenSmile features.We also found that face masks led to small but significant dif-ferences in speech patterns: test-takers wearing masks tended topause a bit more often than the control sample. The differencewas very small: 0.6 words or 0.2 seconds.However, these differences in acoustics and speech patternsdid not have a further effect on the performance of automatedspeech recognition or the automated scoring engine. We foundno difference in ASR WER. Automated scores also were notaffected by whether the test-takers were wearing masks or not.Finally, we found no difference in proficiency scores assignedby human raters. These results are very encouraging: as ofthe time of writing this paper, the use of face masks remainswidespread. It is important to know that they can continue to beused in the context of language proficiency assessment withouthaving negative impacts on test scores.To our knowledge, this is the first study that has exploredthe use of face masks using a large number of speakers and acorpus obtained in real-life conditions. Our results are consis-tent with previous research conducted with a smaller numberof subjects in controlled laboratory conditions, which showedno effect of surgical face masks on speech intelligibility by hu-mans. While these results may appear counter-intuitive giventhe widespread perception that face masks degrade intelligi-bility, we note that in face-to-face communications the use offace coverings also results in the reduction in visual informa-tion available to listeners. Multiple studies since the 1930shave shown that being able to see a speaker’s mouth movementsgreatly enhances intelligibility, especially where speech is pre-sented in noise [3]. Both this paper and previous studies (except[1]) considered situations where the listeners and the automatedsystems had access only to the auditory stimulus collected withor without face mask and without additional visual information.To conclude, our analysis of speech patterns and acousticfeatures in combination with previously published results sug-gests that face masks are unlikely to have a substantial detri-mental effect on speech technologies. Yet the differences weobserved for low-level acoustic features suggest that some typesof technologies and applications may be more affected than oth-ers. . References [1] D. Coniam, “The impact of wearing a face mask in a high-stakesoral examination: An exploratory post-SARS study in HongKong,”
Language Assessment Quarterly , vol. 2, no. 4, pp. 235–261, 2005.[2] L. L. Mendel, J. A. Gardino, and S. R. Atcherson, “Speech under-standing using surgical masks: A problem in health care?”
Jour-nal of the American Academy of Audiology , vol. 19, no. 9, pp.686–695, 2008.[3] C. Llamas, P. Harrison, D. Donnelly, and D. Watt, “Effects ofdifferent types of face coverings on speech acoustics and intelli-gibility,”
York Papers of Linguistics Series 2 , no. 9, pp. 80–104,2008.[4] R. Saeidi, T. Niemi, H. Karppelin, J. Pohjalainen, T. Kinnunen,and P. Alku, “Speaker recognition for speech under face cover,”
Proceedings of the Annual Conference of the International SpeechCommunication Association, INTERSPEECH , vol. 2015-Janua,pp. 1012–1016, 2015.[5] R. Saeidi, I. Huhtakallio, and P. Alku, “Analysis of face maskeffect on speaker recognition,”
Proceedings of the Annual Con-ference of the International Speech Communication Association,INTERSPEECH , vol. 08-12-Sept, pp. 1800–1804, 2016.[6] M. J. Zieky, “Fairness in test design and development,” in
Fairnessin Educational Assessment and Measurement , N. J. Dorans andL. L. Cook, Eds. Routledge, 2016, pp. 9–32.[7] L. J. Radonovich, R. Yanke, J. Cheng, and B. Bender, “Dimin-ished speech intelligibility associated with certain types of respi-rators worn by healthcare workers,”
Journal of Occupational andEnvironmental Hygiene , vol. 7, no. 1, pp. 63–70, 2010.[8] S. R. Atcherson, L. L. Mendel, W. J. Baltimore, C. Patro, S. Lee,M. Pousson, and M. J. Spann, “The effect of conventional andtransparent surgical masks on speech understanding in individualswith and without hearing loss,”
Journal of the American Academyof Audiology , vol. 28, no. 1, 2017.[9] M. E. Nute and K. Slater, “The effect of fabric parameters onsound-transmission loss,”
The Journal of The Textile Institute ,vol. 64, no. 11, pp. 652–658, 1973.[10] K. Zechner, D. Higgins, X. Xi, and D. M. Williamson, “Auto-matic scoring of non-native spontaneous speech in tests of spokenEnglish,”
Speech Communication , vol. 51, no. 10, pp. 883–895,2009.[11] D. Higgins, X. Xi, K. Zechner, and D. Williamson, “A three-stage approach to the automated scoring of spontaneous spokenresponses,”
Computer Speech & Language , vol. 25, no. 2, pp.282–306, 2011.[12] L. Chen, K. Zechner, S.-Y. Yoon, K. Evanini, X. Wang,A. Loukina, J. Tao, L. Davis, C. M. Lee, M. Ma, R. Mundkowsky,C. Lu, C. W. Leong, and B. Gyawali, “Automated scoringof nonnative speech using the SpeechRater. 5.0 engine,”
ETSResearch Report Series , vol. 2018, no. 1, pp. 1–31, dec 2018.[Online]. Available: http://doi.wiley.com/10.1002/ets2.12198[13] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek,N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz,J. Silovsky, G. Stemmer, and K. Vesely, “The Kaldi speech recog-nition toolkit,” in
Proceedings of the Workshop on AutomaticSpeech Recognition and Understanding , 2011.[14] Y. Qian, X. Wang, K. Evanini, and D. Suendermann-Oeft, “Self-adaptive DNN for improving spoken language proficiency assess-ment,”
Proceedings of the Annual Conference of the InternationalSpeech Communication Association, INTERSPEECH , vol. 08-12-September-2016, pp. 3122–3126, 2016.[15] S.-Y. Yoon, A. Cahill, A. Loukina, K. Zechner, B. Riordan,and N. Madnani, “Atypical inputs in educational applica-tions,” in
Proceedings of the 2018 Conference of the NorthAmerican Chapter of the Association for Computational Lin-guistics: Human Language Technologies, Volume 3 (IndustryPapers) . Stroudsburg, PA, USA: Association for Compu-tational Linguistics, 2018, pp. 60–67. [Online]. Available:http://aclweb.org/anthology/N18-3008 [16] D. Bates, M. M¨achler, B. Bolker, and S. Walker, “Fitting linearmixed-effects models using lme4,”
Journal of Statistical Software ,vol. 67, no. 1, pp. 1–48, 2015.[17] A. Kuznetsova, P. B. Brockhoff, and R. H. B. Christensen,“lmerTest package: Tests in linear mixed effects models,”
Jour-nal of Statistical Software , vol. 82, no. 13, pp. 1–26, 2017.[18] F. Eyben, M. W¨ollmer, and B. Schuller, “openSMILE –the Munich versatile and fast open-source audio featureextractor,” in
Proceedings of the 18th ACM internationalconference on Multimedia . Firenze, Italy: Association forComputing Machinery, 2010, pp. 1459–1462. [Online]. Available:https://dl.acm.org/doi/pdf/10.1145/1873951.1874246[19] D. M. Williamson, X. Xi, and F. J. Breyer, “A framework for eval-uation and use of automated scoring,”
Educational Measurement:Issues and Practice , vol. 31, no. 1, pp. 2–13, 2012.[20] A. Loukina, N. Madnani, and K. Zechner, “The manydimensions of algorithmic fairness in educational applications,”in
Proceedings of the Fourteenth Workshop on InnovativeUse of NLP for Building Educational Applications