Non-intrusive speech intelligibility prediction using automatic speech recognition derived measures
NNon-intrusive speech intelligibility prediction usingautomatic speech recognition derived measures
Mahdie Karbasi, Stefan Bleeck, and Dorothea Kolossa Institute of Communication Acoustics, Faculty of Electrical Engineering and Information Technol-ogy, Ruhr University Bochum, Universit¨atsstr. 150, 44801 Bochum, Germany Institute of Sound and Vibration Research, University of Southampton, SO17 1BJ, UK
The estimation of speech intelligibility is still far from being a solved problem. Especially oneaspect is problematic: most of the standard models require a clean reference signal in order toestimate intelligibility. This is an issue of some significance, as a reference signal is often un-available in practice. In this work, therefore a non-intrusive speech intelligibility estimationframework is presented. In it, human listeners’ performance in keyword recognition tasks ispredicted using intelligibility measures that are derived from models trained for automaticspeech recognition (ASR). One such ASR-based and one signal-based measure are combinedinto a full framework, the proposed
NO-Reference Intelligibility ( Nori ) estimator, which isevaluated in predicting the performance of both normal-hearing and hearing-impaired listen-ers in multiple noise conditions. It is shown that the
Nori framework even outperforms thewidely used reference-based (or intrusive) short-term objective intelligibility (STOI) measurein most considered scenarios, while being applicable in fully blind scenarios with no refer-ence signal or transcription, creating perspectives for online and personalized optimizationof speech enhancement systems.
I. INTRODUCTION
The intelligibility of a speech signal, defined as thepercentage of words or phonetic units that can be rec-ognized by a human listener, is dependent on a largenumber of factors, including the loudness of the speech,ambient noise, characteristics of the transmission chan-nel or speaking style (French and Steinberg, 1947). Acomplete description of all parameters is very complex,and although there have been many (simplified) efforts topredict speech intelligibility (SI) objectively, this has notbeen achieved with equally high accuracy over differentconditions such as non-linearly distorted or reverberatedspeech. Also, modeling hearing-impaired listeners andpredicting their performance is still a challenging task.Speech intelligibility prediction methods can be di-vided into two categories: intrusive methods (reference-based approaches) require the clean reference signal.
Non-intrusive (reference-free methods) predict the intel-ligibility without the need for a clean reference signal.Early work on intelligibility prediction exclusivelyused intrusive methods, as this is conceptually much sim-pler. For instance, the articulation index (AI) (Frenchand Steinberg, 1947), the speech intelligibility index(SII) (ANSI/ASA, 1997), and the speech transmission in-dex (STI) (Steeneken and Houtgast, 1980) were the firstmethods developed. Both AI and SII work by first es-timating the SNRs within psychoacoustically motivatedfrequency bands, and then using the weighted sum ofthose estimates as a measure of the speech intelligibility.The speech transmission index (STI) is calculated us-ing the weighted average of reductions in temporal enve-lope modulation by estimating the modulation frequency function. The speech-based envelope power spectrummodel (sEPSM) (Jørgensen and Dau, 2011) and later themulti resolution sEPSM (mr-sEMSP) (Jørgensen et al. ,2013) were introduced to overcome the drawbacks ofSTI in predicting non-linearly distorted or reverberatedspeech.An improved, and well-established intrusive measurefor speech intelligibility prediction today is the short-time objective intelligibility (STOI) measure (Taal et al. ,2011), which has become a common benchmark in thefield of speech processing (Gao and Tew, 2015; Light-burn and Brookes, 2015). STOI is based on the corre-lation between the test signal and the reference signalwithin one-third octave frequency bands over short timesegments. Regardless of the reported high correlation be-tween STOI and human speech recognition performancein different acoustic scenarios, it does not perform well fordistorted (processed) speech or reverberant acoustic con-ditions. Such distortions are especially a problem whenassessing signal processing algorithms that are used inhearing aids, as they introduce further non-linear dis-tortions, e.g. by compressing the dynamic range, thatlead to subsequently reduced speech intelligibility. Toovercome this, further studies have been conducted toimprove the original STOI metric to work in a widerrange of conditions. For instance, the Extended STOI(ESTOI) (Jensen and Taal, 2016) was introduced to per-form better in scenarios with fluctuating noise. ESTOIis based on energy-normalized short-time spectrogramsthat are orthogonally decomposed into subspaces whichare important for intelligibility. a r X i v : . [ ee ss . A S ] O c t hile the correlation-based measures such as STOIare limited to second-order statistics, it has been demon-strated that higher-order statistics improve SI predictionsby using the mutual information (MI) between the testsignal and reference (Jensen and Taal, 2014; Taghia andMartin, 2014). Speech signals are sparsely encoded inthe time-frequency domain and in this representation,the regions with higher energy of speech relative to themasker are called glimpses . It has been shown that thereliable glimpses, which have an adequate SNR, are con-tributing to the intelligibility of speech (Cooke, 2006; Li et al. , 2012) and can be used to predict the intelligibilityinstrumentally (Tang, 2014; Tang et al. , 2016).Such methods provide an estimate of the average in-telligibility over the entire speech signal. Also, they usu-ally require longer segments of speech in order to achievemore accurate predictions of intelligibility. Assessing theintelligibility of speech not at the signal level but at thephoneme level has also attracted some recent attention.Ullmann et al. (2015) suggest to use the distance betweenthe phoneme posterior probabilities of the distorted andthe reference signal, which are estimated by a deep neuralnetwork (DNN). Huber et al. (2017) introduce the meantemporal distance between posteriograms as a predictorof human listening effort.All methods described so far are intrusive orreference-signal-based. However, in many applicationsthe clean signal is not available or not practical to ac-cess. It is therefore important to also investigate non-intrusive methods with the goal of predicting the intelli-gibility of speech without the need for a reference signal.In recent work, Sch¨adler et al. (2015) have chosen a non-intrusive approach to develop the framework for audi-tory discrimination experiments (FADE), which predictsspeech reception thresholds (SRTs) for a German matrixsentence test. In FADE, reference HMMs are trained onspecific target speech tokens as well as noise types thatare used in the testing phase. To individualize FADEfor hearing-impaired listeners, audiogram thresholds andsupra-threshold characteristics, (distortions), are used toestimate the spectrogram that is employed during theprocess of the speech feature extraction (Kollmeier et al. ,2016). Another ASR-based framework has been intro-duced by Spille et al. (2018) to predict SRTs in differentnoisy scenarios. They use a hybrid DNN/HMM struc-ture to identify words from a German matrix sentencetest. Phone-based HMMs with context-dependent tri-phone models are used for training. In addition to statis-tical models, in another approach (Andersen et al. , 2018),a convolutional neural network has been used to learnand predict the speech intelligibility non-intrusively.One issue that none of the described methods is con-ceptually able to model is the question of prior knowl-edge. Humans can of course use experience, knowledgeabout context and prior knowledge about the characteris-tics of speech units (such as phonemes) when listening tospeech (Fingscheidt and Bauer, 2013). Therefore a com-plete objective model for SI and speech quality must alsotake the phonetic contextual information into account. Otherwise, for example comparing the processed speechonly to a signal-based reference might lead to unreason-ably low intelligibility estimates in scenarios where thespeech is strongly modified or even partially synthesizedin the enhancement stage. With the goal of making sim-ilar knowledge available to SI measures, a non-intrusivemethod has been introduced by Karbasi et al. (2016a , b)that uses the distorted input speech and its correspon-dent transcription to synthesize an estimated reference.In this method, the estimated reference signal can beused in intrusive measures (such as STOI) to predict in-telligibility.In addition to the above approaches to provide an es-timate of the average intelligibility over the entire speechsignal, there are also so-called microscopic SI predictionmodels. These predict the intelligibility of smaller unitsof speech, such as words or phonemes, taking knowledgeabout the functioning of the human auditory process intoaccount (J¨urgens and Brand, 2009; J¨urgens et al. , 2010).This allows for individual tailoring of the process, forexample by taking wider filter shapes or higher thresh-olds into account. Microscopic methods promise to bemore precise than macroscopic models in predicting in-telligibility and in diagnosing problems due to specificphoneme confusions. However, they still need furtherimprovements to become more robust against variationsof the test scenarios in future work.In order to utilize the advantage of microscopic meth-ods in predicting the intelligibility of small units ofspeech accurately and inspired by previous work on non-intrusive methods, in this paper we investigate the pos-sibility of using automatic speech recognition to extractnon-intrusive measures to predict SI. As a secondary goal,we also investigate the feasibility of incorporating hear-ing profiles and predicting individual hearing-impairedlisteners’ performance.In previous work (Karbasi and Kolossa, 2015) wehave presented a related, but simpler method of micro-scopic SI prediction with limited applicability. In thispaper, we propose to improve the model using new ob-jective measures. For this purpose, we additionally in-clude two ASR-based discriminance measures for pre-dicting the speech intelligibility from a microscopic view-point (Karbasi and Kolossa, 2017) and we improve theircomputation to arrive at a fully blind framework whichonly requires pre-trained reference models in its compu-tation. Specifically, its low-level intelligibility measuresare extracted utilizing an HMM-based ASR system. Tocreate a full framework for SI prediction, we combine ournew model-based intelligibility measures with a signal-based measure and add a neural-network-based regres-sion stage. We finally demonstrate that the ultimatelyproposed
NO-Reference Intelligibility ( Nori ) frameworkoutperforms traditional methods and is broadly applica-ble, both under noisy conditions and for modeling theperception of hearing-impaired listeners.The proposed measures and the full
Nori frameworkare explained in detail in the following section. The dataused for evaluation are described in Section II D, the ex- erimental setup and the evaluations will be presentedand discussed in Sections III and IV. II. METHODS AND MATERIALS
An automatic speech recognizer delivers, in principle,the same output measure, and it faces the same chal-lenges as a human listener when it comes to recognizingspeech in a formalized intelligibility test. In previousstudies (Sch¨adler et al. , 2016; Sch¨adler et al. , 2015; Spille et al. , 2018) it has been suggested to use ASR as a predic-tor of intelligibility and to compare the results directlyto the results from human listeners. However, thereare fundamental differences between human and machinehearing.Direct ASR-based predictions are restricted by thequality and the level of detail of the model of the hu-man hearing system, and it is currently not possible tomodel human hearing accurately enough. Also, addingspecific details of human perception does not necessarilycontribute to better speech recognition performance: therecognition engine and language models are also differ-ent between ASR and humans listeners. Furthermore,ASRs are usually not trained to have the same perfor-mance (high or low) as human listeners, but they rathertry to achieve the best performance overall. This leads tolow correlations between the ASR recognition output andthe human listener performance in many circumstances.In order to overcome this problem, we propose to pro-ceed differently in this paper: first, we will computea set of ASR discriminance scores (here called model-based measures)—which are inspired by ASR confidencemeasures—and compare them in microscopic SI predic-tion tasks in order to find the best-performing model-based measure. In addition, we also compute signal-dependent (or signal-based ) measures, including blindly-estimated SNR, STOI, ESTOI, mr-sEPSM, and the non-parametric estimation of MI using k-nearest neighbors(MI-KNN) (Taghia and Martin, 2014). This allows fora detailed comparison between the traditional and pro-posed approaches. Moreover, a blind estimator of SNRcan be included in our proposed non-intrusive SI predic-tion framework. This leads to a vector-valued measurethat is a combination of the estimated SNR and the best-performing model-based measure. In a second step, wecombine each intelligibility measure with a subsequentprediction stage, which maps the respective measure tothe expected human outcome in recognizing the speechunits.
A. Automatic speech recognition
At the first stage of this ASR system, speech-relatedfeatures were extracted. Here, the first 13 Mel frequencycepstral coefficients (MFCCs), their first (∆) and second-order derivatives (∆∆) were used as features. Hamming-windowed frames with a length of 25 ms and a frame shiftof 10 ms were used for the MFCC extraction algorithm.The sampling frequency was 25 kHz. We are using a conventional, statistical model-basedautomatic speech recognition (ASR) in all subsequentwork. Similar to the procedure developed previously inour lab (Karbasi and Kolossa, 2017), each word ν wasmodeled using a linear left-to-right HMM. The number ofstates was chosen as three times the number of phonemesof the word. A 2-mixture diagonal covariance GMMwas used to model the state output distribution of theHMMs. In order to be able to compute confidence in-tervals, each experiment was implemented with a 5-foldcross-validation to train and test the ASR models 5 timeson different splits of the dataset. For each experimentand in each fold, the speech material was divided intodisjoint training (60%), development(20%), and test sets(20%). To achieve the highest accuracy, noise-dependentmodels were trained separately for each SNR. The de-velopment sets were only used to control the progress ofthe training. In each fold, only the test set was used toextract the ASR-based intelligibility measures. B. Model-based SI measures
There are many ASR-derived measures that can beused to estimate the human recognition performance.Based on the central idea that the degree of ASR uncer-tainty is related to the intelligibility level of the speechsignal, we investigate several statistical measures: thetime alignment difference (TAD) and the normalized like-lihood difference (NLD) (Karbasi and Kolossa, 2017), theentropy (H), the log-likelihood ratio (L) and the disper-sion (D).All of these measures are computed in parallel by theASR system, which can provide several hypotheses withan associated likelihood as the recognized transcriptionof input speech. In contrast to the NLD and TAD (Kar-basi and Kolossa, 2017), which require ground-truth timealignments and transcriptions of the signal, the newlyproposed discriminative measures are computed basedonly on the information provided by the ASR and do notrequire any type of reference. Note that all measures inthis paper are extracted at the word level. However, theproposed framework is not subject to any inherent con-straints regarding the chosen speech units and it can alsobe generalized to a phoneme-based framework in futurework.In the following subsections, the details of the pro-posed method for extracting the model-based intelligibil-ity measures will be described.
1. Preprocessing
After applying feature extraction to the speech sig-nal, the entire sequence of feature vectors, i.e. the ob-servation sequence O , is divided into segments, each cor-responding to one word. To perform this procedure, theword boundary information is required. This can be donein two ways: if the reference word alignment informationis available, it can be employed directly to divide theobservation sequence into words. If not, the automaticspeech recognizer can be used to produce estimated word peechFeature ExtractionViterbialgorithmASRHMM setGrammarRecognizedword sequence Divide O toword level O O N ... O Recognizedalignments
FIG. 1. Block diagram of the preprocessing steps to produceobservation sequences. alignments through its recognition (see Fig. 1). In or-der to produce these recognized alignments, the trainedHMM set and the grammar are used on the observationsequence in a Viterbi algorithm, yielding not only the rec-ognized word sequence, but also the corresponding wordboundaries. The Viterbi algorithm uses dynamic pro-gramming to find the HMM state sequence that matchesthe observation sequence best (Forney, 1973).After these steps, the intelligibility measures for eachsegment of the observation sequence O n are calculatedseparately. In the following sections, the segment index n is dropped for simpler notation, but note that all pro-posed measures are extracted given the segmented obser-vation O n at the word level.
2. Extracting intelligibility measures
We are considering 5 ASR-based measures, the dis-persion, the entropy, the likelihood ratio, the time align-ment difference and the normalized likelihood difference,which are introduced in detail below.a) Model-based dispersion represents the degree ofuncertainty of the ASR decoder in recognizing a speechsignal. The dispersion of the speech signal correspondingto a single word is computed (Estellers et al. , 2011) asD = 2 N ( N − N (cid:88) k =1 N (cid:88) l = k +1 log P (cid:0) λ k | O (cid:1) P (cid:0) λ l | O (cid:1) . (1)Here, P (cid:0) λ k | O (cid:1) is the probability of the word model λ k given the observation sequence O where the probabilitiesare sorted in descending order, with P ( λ k =1 ) as the high- est. N is the number of best hypotheses used to computethe dispersion.In Eq. 1, the model probabilities P (cid:0) λ k | O (cid:1) are re-quired. Having one HMM per word k , it is possibleto compute the likelihood of the observation sequencegiven the word model, P (cid:0) O | λ k (cid:1) , using the forward al-gorithm (Rabiner, 1989), and to then obtain the modelprobability P (cid:0) λ k | O (cid:1) using Bayes’ theorem P (cid:0) λ k | O (cid:1) = P (cid:0) O | λ k (cid:1) P (cid:0) λ k (cid:1) P (cid:0) O (cid:1) . (2)The prior probability of the models P (cid:0) λ k (cid:1) can be ac-quired using a language model.In our experiments we use a matrix test for evalua-tion. Since the prior probability of each model P (cid:0) λ ν (cid:1) isthus equal for all possible words and since the probabilityof the observation sequence P (cid:0) O (cid:1) , is independent of λ ,in this work, Eq. (1) can be reformulated toD = 2 N ( N − N (cid:88) k =1 N (cid:88) l = k +1 log P (cid:0) O | λ k (cid:1) P (cid:0) O | λ l (cid:1) . (3)The steps required for extracting the HMM-baseddispersion are shown in Algorithm 1 as an overview.First, the likelihood of the observation O is needed forall word models λ ν , ν = 1 . . . V . These likelihoods needto be sorted in descending order, of which the N highestvalues are used in Eq. 3 to compute the dispersion. Algorithm 1
Compute the model-based dispersion forobservation sequence O , corresponding to the word posi-tion n in a sentence. Compute the likelihood of O , given all possible wordmodels for the word position n , using the forwardalgorithm: P (cid:0) O | λ ν (cid:1) , ν = 1 ...V n ; Sort all probabilities P (cid:0) O n | λ ν (cid:1) in descending order; Compute the dispersion for the N highest probabili-ties, via Eq. 3;b) The second measure proposed as a model-basedintelligibility measure is the entropy, which can also beconsidered as an indicator of the ASR confidence. Theentropy is computed for the segmented observation O viaH = M (cid:88) m =1 − P (cid:0) λ m | O (cid:1) log P (cid:0) λ m | O (cid:1) , (4)where M is the number of all possible word models.c) The log-likelihood ratio L (Karbasi and Kolossa,2015) is the third proposed model-based measure. L indicates the decoder’s discrimination between the firstand second best model. Therefore, L corresponds to thedispersion measure where N = 2. For comparison, theresults of intelligibility prediction using the L measureare reported in Section III. ) TAD and e) NLD are the final model-based mea-sures. As introduced in (Karbasi and Kolossa, 2017),TAD is defined as the difference between the recog-nized and the ground-truth time alignments of the in-put signal. NLD is the normalized likelihood differencebetween the first and second most-likely models giventhe ground-truth transcription. Both measures requireknowledge about the ground truth transcription and timealignments and can not be computed without a time-alignment and transcription reference. C. Signal-based SI measures
In addition to the above model-based measures, theSNR is estimated as a signal-dependent measure (S (cid:98)
NR).To estimate the SNR, the clean speech and noise signalpowers are required. Here, the Improved Minima Con-trolled Recursive Averaging (IMCRA) algorithm and theWiener filter are used to estimate the power of the cleansignal, ˆs( t ), and the noise, ˆn( t ), without need for a refer-ence. S (cid:98) NR is then calculated byS (cid:98)
NR = 10 log (cid:80) t (ˆs ( t )) (cid:80) t (ˆn ( t )) . (5)As baseline intrusive measures, STOI, ESTOI, MI-KNN,and mr-sEPSM values are also extracted and used forcomparison. To compute these measures (except S (cid:98) NR),the reference alignments have been used to divide thesignals into word units. We will investigate and comparetheir performance to find the strongest baseline for oursetup. Note that none of these methods are explicitlydesigned to predict the performance of hearing-impairedlisteners, but we will use the best-performing measure asthe baseline in predicting the hearing impaired listenersas well.
D. Databases
We used three speech intelligibility databases toevaluate the performance of the considered measures.These databases are all based on the speech signalsfrom the Grid corpus (Cooke et al. , 2006) and weused associated intelligibility scores collected by listen-ing tests with human subjects. The Grid corpus origi-nally contains clean speech signals and their time align-ments. In total, there are 34,000 clean speech sig-nals in this corpus, collected from 34 English speak-ers at the University of Sheffield. Each Grid utter-ance is a semantically unpredictable 6-word sentencewith a fixed grammar: Verb(4)-Color(4)-Preposition(4)-Letter(25)-Digit(10)-Adverb(4), where the numbers inparentheses represent the number of available choices foreach word position.
1. The original Grid intelligibility database (DB1)
The first database (DB1) is a noisy version of theoriginal Grid corpus, the ’Grid speech intelligibility database’ (Barker and Cooke, 2007). It contains noisysignals created by adding speech-shaped noise (SSN) toclean Grid signals at 12 different SNRs from -14 dB upto 6 dB with steps of 2 dB, plus one condition at 40 dB,labeled as ’clean’. Speech-shaped noise is created asGaussian noise shaped with a long-term average spec-trum identical to that of an averaged speech signal fromthe Grid corpus. DB1 also includes the listening test re-sults from 20 normal-hearing listeners (NHL). The par-ticipants’ responses to 2000 utterances and the ground-truth transcription are provided for the keywords color,letter, and digit.
2. The Grid intelligibility database with crowd-sourcing(DB2)
A second noisy version of Grid speech signals wasused in addition, referred to as DB2. The intelligibilityscores in this database have been collected by the authorsin previous work (Karbasi et al. , 2016a). In this database,in addition to speech-shaped noise (SSN), tokens withwhite noise and babble noise were also created, with thebabble noise from the AURORA database (Hirsch andPearce, 2000) containing a mixture of speech signals froma crowd of both female and male English speakers.To obtain human speech recognition results for thenewly generated data, separate listening tests were car-ried out for the signals of each noise type in a large-scale listening experiment, using crowd-sourcing testsat CrowdFlower, Inc. (2016). Every test participant wasasked to transcribe a set of 22 audio signals, containingdifferent SNR conditions, ranging from -10 dB to 6 dB insteps of 2 dB, where the noise type was fixed for a giventest set. In order to prevent memory effects, we ensuredthat the same utterance was only utilized once within agiven test set.Experimentation in a weakly controlled environmentlike crowd-sourcing offers many benefits, specifically thepotential large number of participants that can be ap-proached, but this comes at the price of unknown vari-ability in the sampling. In order to control quality ofresponses, therefore, additional control steps are neces-sary to ensure first that only participants are recruitedwho are qualified for the task (e.g., they need to havesufficient language skills) and to ensure that participantsare concentrating during the tests. The English profi-ciency was established in a self-reporting questionnaireat the beginning, and to test concentration, each testset also contained 4 clean utterances. These were ran-domly interspersed between the actual test signals. Forthe analysis, only results were used where at least 50% ofthe control utterances were correctly transcribed. Lateranalysis demonstrated that the minimum accuracy of lis-teners who passed this threshold, and were therefore usedfor further analysis, was above 70%. The hearing statusof listeners in DB2 was not tested, because it is very dif-ficult to measure or verify the listeners’ hearing ability incrowd-sourcing experiments objectively.The transcriptions were recorded using a multiple-choice experiment using a web-based graphical user in- erface. Each contributor was allowed to participate mul-tiple times but restricted to 6 test sets. Participants werepayed $
3. The Grid intelligibility database for hearing-impairedlisteners (DB3)
In order to evaluate the performance of the pro-posed measures in predicting the performance of hearing-impaired listeners (HIL), an additional intelligibilitydatabase was collected by conducting listening testswith hearing-impaired participants at the University ofSouthampton. The resulting third database (DB3) con-tains the same noisy Grid tokens as DB1, but withhearing-impaired listeners’ intelligibility scores.All participants in this study were native Englishspeakers and regular users of hearing aids. During thetest, they were not wearing the hearing aid. First, puretone audiometry (PTA) was performed to measure thehearing thresholds. The better ear was used in the speechtest. In total, 9 listeners (3 female and 6 male) agedfrom 62 to 79 years took part in this study and werepaid for their participation. The study was approved bythe local ethics committee. Information of all partici-pants including their individual audiometric thresholdsare listed in Tab. I. Speech from DB1 was presented, sothe stimuli contained SSN at SNRs ranging from -6 dB to6 dB in steps of 2 dB plus the clean signals. The stimuliwere presented via circumaural headphones (SennheiserHD380pro) in a quiet room. The equipment was cali-brated with clean speech to a presentation level of 65 dBSPL using a sound level meter (Br¨uel&Kjær 2260) andartificial ear simulation (Br¨uel&Kjær 4153).The participants were asked to repeat the Grid key-words (color, letter and digit) after they had heard thewhole sentence and the experimenter recorded their an-swers for each keyword. A short training session wasperformed prior to the main test in order to familiarizethe participants with the task and the stimuli.
III. EXPERIMENTS AND RESULTS
Three sets of experiments were conducted to evaluatethe performance of the proposed intelligibility measuresfor different tasks:A. Microscopic prediction of intelligibility for NHLs,B. Macroscopic prediction of intelligibility for NHLs,C. Microscopic prediction of intelligibility for HILs.The overall goal of the experiments was to predict theperformance of human listeners in keyword recognitiontasks using ASR-driven measures and to analyze theirperformance based on both microscopic and macroscopicmodels. For comparison, the intrusive signal-based mea-sures STOI, ESTOI, MI-KNN and mr-sEPSM were also
Human keyword recognition outcome (WCS)(train & validation set)NNtrainingIntelligibility metric(train & validation set) NNevaluationIntelligibility metric(test set) Predicted intelligibility( b SI)NN parameters
FIG. 2. Block diagram of the mapping stage, including thetraining and testing phase to map the intelligibility measureto actual prediction values. computed using the noisy speech and its clean counter-part. For every noisy speech utterance in each databaseconsidered in the current work, there is a clean counter-part from the original Grid corpus. Those pairs are usedin the computation of intrusive measures.
A. Microscopic SI prediction for normal-hearing listeners
For microscopic evaluation, the human listeners’ per-formance was predicted using the estimated intelligibilitymeasures as described in detail below and the outcomesare reported as ’accuracy’ values, that is, the percentageof words for which the model predicts the human lis-tener’s performance correctly, either as “recognized” oras “not recognized”.In order to obtain the predicted intelligibility, a map-ping was performed between the speech intelligibilitymeasure under test and the results of human speechrecognition experiments, cf. Fig. 2. For that purpose,a binary classification neural network (NN) was trainedto predict the human keyword recognition outcomes us-ing the MATLAB patternnet toolbox. A feed-forwardnetwork was employed using the cross entropy as thecost function. The network had one hidden layer with10 neurons. The input size was defined by the intelli-gibility measure(s) used for training and testing. In ourexperiments, it was either one scalar input, or two, in thecase of
Nori . The NN was trained using the respectiveintelligibility measure for each token and its correspon-dent label, taken from the listening test results. Thebinary label was defined on the word level, depending onwhether or not the word was correctly recognized by thehuman listener. A 7-fold cross validation was performedin this step so as to be able to compute confidence in-tervals for the evaluation results. At each fold, 70% ofdata was used for training the network, 15% for valida-tion and 15% for evaluation. The network was trainedindividually for each intelligibility measure, at each noisecondition and over all SNRs. ABLE I. Audiometric thresholds of the tested ears for each listener.
Thresholds (dB HL)ID Age Gender Tested ear 0.25 0.5 1 2 4 8 (kHz)L1 73 Male Right 15 10 15 20 50 75L2 79 Male Right 35 30 40 50 65 70L3 66 Female Left 20 20 35 45 35 40L4 72 Male Right 0 5 15 45 65 60L5 62 Male Left 10 15 20 25 50 60L6 65 Female Left 65 65 65 60 60 80L7 72 Male Left 15 30 40 50 60 65L8 72 Male Right 20 25 35 30 60 50L9 65 Female Right 15 20 25 20 15 20
1. Preliminary experiments
Prior to the main evaluation, two pilot experimentswere conducted, using the data of DB1 only. In the firstof theses, a pilot microscopic experiment was conductedin order to obtain the best value for N in Eq. 1 was de-termined in a microscopic intelligibility prediction task,with N varying between 2 and 8 (Fig. 3). The best mi-croscopic SI prediction accuracy for the dispersion wasachieved at N = 5 hypothesis, and N = 5 was thus se-lected for further processes. This indicates that (at leastfor this dataset) the likelihood differences between the 5best hypotheses contain the highest amount of informa-tion regarding the associated intelligibility, and addingmore hypotheses does not provide further benefit. N A cc u r a c y ( % ) FIG. 3. Accuracy of the measure dispersion, ( D ), for differentvalues of N . In the second pilot experiment, the average accuracyof all proposed model-based intelligibility measures wascomputed to pre-select the most accurate model-based measure. The speech signal was divided into word seg-ments using true alignments for the intrusive measures.For the non-intrusive measures, ASR-recognized align-ments were used to divide the signal. The results of thesecond pilot experiment are shown in Tab. II. Amongall investigated measures, the dispersion ( D ) shows thehighest accuracy with 84.04% and 85.89%, using the rec-ognized and true alignments, respectively. Also, it canbe seen that all measures perform better using the truealignments than using the recognized ones. As describedabove, D is computed using the five highest model likeli-hoods. In contrast, the entropy H considers all possiblemodel likelihoods in estimating the decoder uncertaintyand can contain redundant information, leading to lossof accuracy. On the other hand, the log-likelihood ra-tio L takes only the two highest model likelihoods intoaccount and might therefore underestimate the amountof uncertainty in the ASR. Also, the comparison betweenthe results gained by using the true alignments versus therecognized alignments shows that using the correct timealignments has the biggest impact on the performanceof TAD. This outcome was expected, since TAD is com-puted based on the time alignments, so the correctnessof the time alignments is important to its performance.On the other hand, the dispersion D is less influenced bythe correctness of alignment information.As a result of this preliminary assessment, we chosethe dispersion D computed with the recognized align-ments as the most appropriate reference-free model-basedmeasure for the remainder of this paper, and in the fol-lowing we compare its performance against signal-basedmeasures in various test scenarios under different noiseconditions. In total, in addition to the baseline intrusivemethods, we investigated three non-intrusive intelligibil-ity measures: dispersion D , estimated SNR (S (cid:98) NR) anda combination of D and S (cid:98) NR (the combination in
Nori ). ABLE II. Results of the 2nd pilot experiment, for the pre-selection of metrics. These show the average accuracy of theproposed model-based intelligibility measures in predicting the performance of 20 normal-hearing listeners from DB1 in theGrid keyword recognition task.
NLD = normalized likelihood difference,
T AD = time alignment difference, D = dispersion, H =entropy, L = log-likelihood ratio N LD T AD D H L
True alignment 78.85 80.88 85.89 81.25 79.71Recognized alignment 76.16 77.65 84.09 78.44 78.64
2. Microscopic SI prediction results
In this section, intelligibility measures computedfrom the normal-hearing listener databases DB1 and DB2are used for evaluation. All results are averaged over alllisteners of each database. The results are organized bythe type of keyword in the sentence. Tab. III shows theaccuracy of all considered measures for DB1. The key-words in the Grid corpus are of three different types: col-ors, letters, and digits. The keyword types differ with re-spect of their degree of difficulty, mainly due to differentperplexity (4 different choices for colors vs. 25 letters and10 digits) and duration. We expected the highest intelli-gibility prediction accuracy for the color category, since italso contains the longest words, which helps in the decod-ing phase. The lowest intelligibility estimation accuracywas expected in the letter category due to its high per-plexity and because the letters are relatively short, bothcontributing to a higher degree of difficulty. Accuracyresults in Tab. III show that the combined (reference-free)
Nori system delivers a slightly better performancethan the reference-based STOI, specifically in the letterand the digit category. However, it is not statisticallydifferent than the performance of STOI. On average, itcan also be seen that the single measures, dispersion D or S (cid:98) NR alone, perform less well. Apparently the infor-mation captured by D and S (cid:98) NR that are combined inthe
Nori framework , is jointly improving the accuracyof intelligibility prediction - without the need of a cleanreference.Among the intrusive measures, mr-sEPSM had thelowest accuracy in predicting the performance of listenersin SSN. The extended version of STOI (ESTOI) and themutual information-based measure (MI-KNN) did notachieve a higher accuracy than STOI in this microscopicevaluation.The intelligibility prediction performance of themethods is assessed by computing their accuracy in pre-dicting listening test results for different SNRs. The re-sults for STOI, as the best intrusive method, and forall non-intrusive methods are shown in Fig. 4. All con-sidered measures perform worse at around -10 dB SNR.This corresponds roughly to the human SRT, which isat -10.31 dB in DB1. This implies that (at least usingour methods) predicting the intelligibility of a speech sig-nal is most difficult at SNRs where human performanceis around 50%. Accuracy rises with higher SNRs as ex-pected, but also rises for lower SNRs. STOI is perform- ing slightly better than D in almost all SNRs, except forthree low SNRs.For comparison, in addition to D, whichis computed using the recognized alignments, we also in-clude a dispersion measure with true alignment. It showsbetter performance at lower SNRs, but the improvementis very small at higher SNRs. This proves the importanceof accurate word boundaries in computing the dispersionmeasure. The dispersion computed with true alignmentis still performing slightly worse than STOI at higherSNRs. In SNRs from -6 down to -14 dB, S (cid:98)
NR shows loweraccuracy than D . However, at higher SNRs it performsslightly better. The Nori framework, taking advantageof both measures— D and S (cid:98) NR—is performing well in allSNRs. -14 -12 -10 -8 -6 -4 -2 0 2 4 6 clean50556065707580859095100 SNR (dB) A cc u r a c y ( % ) STOIDS b NR Nori
D (true alignments)
FIG. 4. Accuracy of all considered intelligibility measures inpredicting the keyword recognition performance of all listen-ers in DB1 with SSN data, using a feed-forward NN in themapping stage. ABLE III. Average accuracy of all considered intelligibility measures in predicting the performance of 20 normal-hearinglisteners from DB1 in a keyword recognition task, categorized by keyword type.
Keyword type mr-sEPSM STOI ESTOI MI-KNN D S (cid:98) NR Nori
Color 86.92 88.84 88.21 87.18 88.53 87.22 88.67Letter 66.85 82.72 80.81 77.25 79.25 81.52 83.02Digit 77.75 85.34 83.35 79.21 84.49 84.44 85.75Average 77.17 85.63 84.12 81.21 84.09 84.39
The above experiment was based on data from DB1,which contains only one noise type (speech-shaped noise)at 12 different SNRs. Incorporating more types of noisein speech distortion is important for a more comprehen-sive evaluation of the proposed
Nori method, since dif-ferent types of distortion can influence the accuracy ofintelligibility prediction methods differently.Therefore, in the next experiment, the database DB2was used for further investigation of the proposed mea-sures. DB2 consists of speech signals distorted with threedifferent noise types: speech-shaped noise, white noiseand babble noise, with listening tests collected by crowd-sourcing, as detailed in Sec. II D 2. The average accuracyof the intelligibility measures is shown in Fig. 5. Here,the dispersion measure D , solely and also in combinationwith S (cid:98) NR, outperforms the STOI measure in all noiseconditions, especially in SSN and white noise. The com-bined
Nori measure shows the highest accuracy predic-tion among all tested measures. Predicting intelligibilityin white noise is the most difficult task; all white noiseresults show lower accuracy in comparison to babble andspeech shaped noise. According to Fisher’s exact test,the accuracy gained by using either D , or D +S (cid:98) NR inthe full
Nori framework, is statistically significant overthe use of the STOI measure at a level of p < .
01 inconditions with SSN or white noise. However, in babblenoise, the accuracy achieved with the proposed measuresis not statistically different from that based on the STOImeasure.Among all considered intrusive methods, STOIshowed the best performance on DB1. In the next ex-periment we therefore only evaluated and compared ourproposed measures against STOI as the best-performingintrusive measure.
B. Macroscopic SI prediction for normal-hearing listeners
In addition to the microscopic evaluation, we alsoanalyzed the correspondence between the predicted andthe ground truth intelligibility, i.e., the average humanword recognition scores.
1. Experimental setup
As the intelligibility measures are extracted per key-word here, their corresponding ground truth are binarydata. However, for macroscopic evaluation, a continuous
SSN White Babble607080 . .
63 72 . .
31 73 .
77 72 . .
48 67 .
41 67 . .
31 73 .
93 73 . A cc . % STOI D S b NR Nori
FIG. 5. Average accuracy of all considered intelligibility mea-sures in three different noise types, SSN, white, and babblenoise, in predicting the performance of all listeners in DB2. distribution of the intelligibility scores is required. Inour databases, each file contains one Grid utterance andeach utterance contains 3 keywords. Before evaluation,the speech files were randomly divided into segments con-sisting of 10 files. Then, all measures, extracted per key-word in each segment, were averaged. This was repeatedfor every SI measure and the same averaging process wasalso applied to the corresponding human word recogni-tion scores. This created continuously-distributed datafor macroscopic evaluation.As evaluation metrics, the averaged normalizedcross-correlation coefficient (NCC), Kendall’s Tau ( τ ),and the root mean square error (RMSE) were used tocompare the predicted intelligibility and human wordrecognition scores across different SNRs. NCC andRMSE only provide valid estimates when their input vari-ables have a linear relationship. However, it is possibleto linearize the relationship between the machine-derivedand the human listening test results by estimating a map-ping function. To estimate such a function, both neuralnetworks and logistic function estimation were tested.Since the logistic function had a lower accuracy thanthe neural network regression, only the NN results arereported here. For NN training and testing, fitnet, theMATLAB shallow neural network toolbox, was used withthe mean square error as its default cost function. Amapping function was estimated for each intelligibilitymeasure using a network with one hidden layer with 10neurons. For every measure and each noise type, a sep-arate mapping was estimated over all SNRs. Since uti-lizing a mapping function can influence the final eval- ation results, a metric that does not require linearity,namely Kendall’s Tau, was also included in the evalua-tion. Kendall’s Tau is computed between the rank or-dering within two data sets without requiring a mappingfunction. Similar to the microscopic experiments, thenetwork parameter estimation and evaluation was per-formed in a 7-fold cross validation. This allowed us touse all available data for evaluation and achieve morereliable results. The results reported here are the aver-age values of evaluation metrics computed over the onesobtained for each fold.This evaluation was performed with normal-hearinglistener data from DB1 and DB2.
2. Macroscopic SI prediction results
In the following experiments, our intelligibility pre-diction, based on all considered SI measures, is beingcompared to the human speech recognition accuracy,given as the word correct score (WCS). The WCS iscomputed by dividing the number of correctly recognizedkeywords by the total number of keywords.The amount of correlation between the intelligibilityprediction and the correspondent WCS (ground truth in-telligibility) in all conditions taken from DB1 and DB2 isshown in Fig. 6 in terms of NCC, τ , and the root meansquare error (RMSE). The average performance evalu-ation shows that D +S (cid:98) NR, as in the
Nori framework,again performs better than the individual measures D and S (cid:98) NR in terms of correlation and error.The average performance of all tested intelligibilitymeasures was also evaluated, differentiating the threegroups of noisy data present in DB2, shown in Fig. 6.In the case of white and babble noise,
Nori always out-performs STOI in terms of all three evaluation metrics.For SSN,
Nori performs slightly worse than STOI, how-ever, with the added benefit of predicting the intelligibil-ity non-intrusively. Similar results were achieved usingthe SSN data from DB1. S (cid:98)
NR has the lowest perfor-mance under all noise conditions.The speech reception threshold (SRT) – the SNR atwhich the speech recognition accuracy is 50% – is an-other measure to describe the intelligibility of a signal.It is computed based on the psychometric function overmany different SNRs but it does not provide detailed in-formation about the intelligibility of the signal at eachSNR separately. Fig. 7 shows the SRTs computed usingthe listening test data collected in DB1 and DB2. Theestimated SRTs in SSN show large differences betweenDB1 (-10.1 dB) and DB2 (-6.9 dB). The listening testsin DB1 have been conducted in a controlled environmentwith normal-hearing listeners, while such a level of con-trol was not possible in the crowd-sourcing experiment.Therefore, some differences are expected. The humanresults from DB2 show the highest SRT in babble noise(at -4.8 dB), and the lowest in white noise (-8.8 dB). Asexpected, babble noise is the most effective masker andwhite noise is the least effective masker for human lis-
SSN (DB1) SSN (DB2) White (DB2) Babble (DB2)758595 .
61 91 .
69 77 .
94 92 . . .
61 77 .
97 92 . .
92 88 .
08 75 .
49 89 . . .
72 78 .
92 93 . N CC ( % ) STOI D S b NR Nori
SSN (DB1) SSN (DB2) White (DB2) Babble (DB2)58657380 .
77 73 .
44 59 .
71 74 . .
72 72 .
52 57 .
86 73 . .
35 71 .
69 57 .
55 69 . .
34 73 .
44 61 .
64 75 . τ ( % ) SSN (DB1) SSN (DB2) White (DB2) Babble (DB2)7891011 × − . . . . . . . . . . . . . . . . R M S E ( × ) FIG. 6. Comparison of intelligibility measures regarding theirpredictive performance for listening tests. Noise types areSSN, babble, and white noise from DB2 in terms of NCC(%), RMSE, and Kendall’s Tau ( τ ). teners. On the other hand, predicting the human SRT ismost difficult in white noise compared to babble and SSN.Comparing the predicted SRTs in each noise type showsthat using the dispersion measure D yields results closestto human performance in conditions with SSN and whitenoise, where the ASR-based prediction and STOI do notperform as well. For babble noise, the predicted SRT of D is slightly less accurate than that based on the STOIand ASR values. Also, the results show that adding theestimated S (cid:98) NR to D does not improve prediction accu-racy in this case. Based on the results from SSN (DB1),it can be seen that using ASR as a direct predictor over-estimates the SRT and performs less accurately than theother intelligibility measures in this case. SN (DB1) SSN (DB2) White (DB2) Babble (DB2)-12-10-8-6-4 S R T STOIDS b NR Nori
ASRHuman
FIG. 7. Predicted SRTs using all considered intelligibilitymeasures, also introducing ASR recognition results, in com-parison to the ground truth SRT computed from the humanlistening test results . All SRT values have been reported forfour groups of speech data distorted by SSN in DB1, SSN inDB2, white noise in DB2, and babble noise in DB2.
C. Microscopic SI prediction for hearing-impaired listeners
In the final set of experiments, we investigatedhow well we can predict whether an individual hearing-impaired listener will be able to understand specific ut-terances or words.In order to simulate individual hearing status, per-sonalized features were extracted (Kollmeier et al. , 2016)for the ASR-based SI measures. To do this, the audio-gram data was used to set thresholds in computing thelogarithmic Mel-scale spectrograms during the MFCCfeature extraction, prior to which, all speech signals inDB3 were amplified to the same hearing level of presen-tation. An example of these personalized feature maps isshown in Fig. 8. Shown is the the Mel scale spectrogramof (a) a clean speech signal, (b) its equivalent representa-tion at 4dB SNR, and (c) the threshold-adapted version.It can be seen how raised hearing thresholds cause lossof information, specifically at higher frequencies.
1. Experimental setup
The outcomes are reported as ’accuracy’ values, thatis, the percentage of words where the model predicts theHIL performance on a single word correctly. Similar tothe microscopic experiments on normal-hearing listeners,a NN-based mapping was applied to the SI measures be-fore computing the accuracies. In this experiment, foreach listener and noise condition, a separate mappingNN was trained over all SNRs. For the model-basedmeasures, each listener was modeled separately, i.e., allHMMs and GMMs were trained for the specific listener.The signal-based measures, STOI and S (cid:98)
NR, however,were computed without further processing to model thehearing loss.
2. Results for HI listeners
Intelligibility prediction results using STOI, D , S (cid:98) NR,and D +S (cid:98) NR as proposed in the
Nori framework, areshown in Tab. IV. The results are reported individuallyand also on average. They show that
Nori outperformsSTOI on average and also individually for participantsL2, L6 and L9. For other HILs,
Nori performs almostat the same level as STOI.In evaluations of the FADE framework (Sch¨adler et al. , 2015), it has been shown that the audiogramdata is not sufficient for modeling the hearing impair-ment effects on speech recognition results. In our ex-periments, however, they have shown significant benefitfor accurately predicting the speech perception of HILs,even without employing a reference signal, even despitethe fact that our simple hearing impairment model onlyapplies raised thresholds and ignores all other potentialproblems of the auditory system like widening filters orreduced temporal gap detection. With access to fur-ther supra-threshold hearing loss information, the fea-ture extraction algorithm could, and should be adaptedto achieve an even better prediction of individual listenerperformance. . . . . . . . . . . . . . . M e l b a nd s ( k H z ) (b) 4050600 0 . . . . . . . FIG. 8. Short-time Mel-scale spectral representation of onespeech signal in different scenarios; (a) clean , (b) noisy, and(c) threshold-adapted noisy spectrum.
IV. DISCUSSION
We have introduced a novel reference-free approachto speech intelligibility prediction, and we have evaluatedits performance for both normal-hearing and hearing-impaired listeners in various noise conditions.Our approach bases on an intelligibility measure thatwe derived from the discriminability (or confidence) in-formation within an automatic speech recognition (ASR) ABLE IV. Average accuracy in predicting the hearing-impaired listeners’ speech recognition performance.
HIL ID STOI D S (cid:98) NR Nori
L1 80.83 80.41 77.91 79.16L2 70.00 66.04 66.04 73.12L3 74.79 73.95 70.83 74.16L4 74.58 76.25 68.12 74.58L5 83.33 82.70 78.54 81.66L6 65.35 70.53 61.42 69.82L7 70.53 68.92 60.35 70.35L8 77.50 72.32 72.85 74.82L9 75.17 77.85 75.35 76.96Mean 74.67 74.33 70.15 74.95system, namely the model-based dispersion D . Giventhat speech intelligibility is affected by many internal andexternal factors, we also took signal-based intelligibilityinformation into account. It has been shown in previousstudies that the SNR is one such indicator of speech intel-ligibility. Accordingly, we combined the discriminabilityscore D with the estimated S (cid:98) NR to provide an improvedmeasure.We showed that the resulting non-intrusive
Nori method performs well in predicting the performance ofnormal-hearing and hearing-impaired listeners in a noisykeyword recognition task and that
Nori can even out-perform the often-used reference-based STOI measure.The results of evaluation with SSN data from DB1(Fig. 4) show that this approach increased prediction ac-curacy in most SNRs. The average performance of allconsidered measures are statistically different from eachother at a level of p < .
05 (Table III) with
Nori outper-forming STOI in two out of three conditions. As shownin Fig. 4, predicting speech intelligibility is most difficultwhen the word recognition rate of listeners is around 50%,i.e., around the speech reception threshold (SRT), wherethe slope of the psychometric function is steepest. Ourinitial experiments demonstrate that if we provide theproposed method with additional reference alignmentsto segment the speech to smaller units, the ASR-baseddiscriminance score D becomes more accurate and out-performs STOI, even without taking S (cid:98) NR into consid-eration. Using the reference time alignments for speechsegmentation also helps the dispersion to be more accu-rate in lower SNRs close to the SRT. This implies anexpected benefit of further robustness improvements inthe ASR system, which will be one goal of future work.The
Nori framework has been developed to predictthe intelligibility of speech in smaller units like words.Among the considered keywords, letters and digits arethe shortest words with the highest perplexity, whichmakes them the most difficult for ASR and also for otherinstrumental measures to predict their intelligibility. Ananalysis based on the word type shows that the proposed
Nori framework is more successful at word-level intelli-gibility prediction in comparison to STOI in these mostdifficult cases of letters and digits (Tab. III).We also investigated the ability of
Nori to predict in-dividual hearing-impaired listeners’ performance in var-ious noise situations. Although our hearing-impairmentmodel is very simple and only involves raised thresholds,
Nori performs equivalently to STOI in some conditions,and is even slightly superior on average. The predic-tion accuracy may be improved further when includingother supra-threshold effects characterizing hearing im-pairment, like wider auditory filters and temporal smear-ing. Note, however, that STOI, in contrast to
Nori ,has not been explicitly designed for modeling hearing im-paired listening.Overall, the proposed intelligibility estimate of the
Nori framework is computed non-intrusively and is suc-cessful in predicting the speech intelligibility microscop-ically as well as macroscopically. Whilst the well-knownSTOI measure requires a clean reference signal, ourframework only requires some time and data for train-ing the ASR models.The proposed method has been evaluated on theGrid corpus, a small-vocabulary dataset with a matrixsentence test structure, and it is hence applicable di-rectly to other similar data. However, the framework isnot limited to matrix tests, but rather it can and shouldbe extended and used for the prediction of intelligibilityat a phoneme level in future work. Consequently, theframework is also extendable to large-vocabulary scenar-ios. This application, however, would call for a large-vocabulary speech dataset with a corresponding, largeset of human listening test results collected for evalua-tion.
V. CONCLUSION
The main goal of this work was to predict micro-scopic (i.e. word-by-word) speech intelligibility in a non-intrusive manner, i.e., without access to any clean refer-ence signal. The dispersion, extracted as a discriminancemeasure from ASR models, together with the blindly es-timated S (cid:98)
NR were introduced as non-intrusive measuresand embedded as predictors into the introduced
Nori framework for
NO-Reference Intelligibility estimation.The evaluation is based on a large number of normal-hearing listeners’ data, showing that the
Nori frame-work can outperform the STOI in predicting NHL’s wordrecognition performance and it correlates well with hu-man data in terms of NCC, τ and RMSE in most condi-tions. Overall, Nori performs accurately and can predictthe word-level speech intelligibility precisely, without theneed for any extra information like the clean referencesignal during the prediction process.Finally, it was shown that it is feasible to use the pro-posed measure in predicting the performance of hearing-impaired listeners, reaching an accuracy equivalent tothat of STOI. However, to better model the hearing im-pairments, it will be a goal of future work to take supra- hreshold effects into account in computing the font-endfeatures for the employed ASR models. To evaluate theframework with large-vocabulary data, it will be neces-sary to collect larger databases of real-life speech thatare annotated with human listening results—a task that,while difficult, is deemed by the authors to be vital tomove the field forward towards real-life on-line intelligi-bility optimization.Research in speech communication systems requiresthe ability to assess speech intelligibility rapidly. Ex-periments with human listeners are time-consuming andcostly, and they are unrealistic for big-data and learning-driven approaches and for on-line adaptation. Therefore,estimated speech intelligibility needs to be available andreliable as a stand-in measure. We have demonstratedhere that reference-free algorithms are a viable option forthis purpose, which can considerably widen the space forsystem optimization towards real-life and user-adaptivespeech enhancement. ACKNOWLEDGMENTS
This research has received funding from the Euro-pean Union’s Seventh Framework Programme FP7/2007-2013/ under REA grant agreement n ◦ [317521]. The au-thors would like to thank Jon Barker for providing a noisyversion of the Grid database with comprehensive listen-ing test results. Andersen, A. H., de Haan, J. M., Tan, Z., and Jensen, J.( ). “Nonintrusive speech intelligibility prediction using con-volutional neural networks,” IEEE/ACM Trans. Audio, Speech,Language Process. (10), 1925–1939.ANSI/ASA, ed. ( ). S3.5 Methods for the Calculation of theSpeech Intelligibility Index (ANSI, New York, NY, USA).Barker, J., and Cooke, M. ( ). “Modelling speaker intelligibilityin noise,” Speech Communication (5), 402–417.Cooke, M. ( ). “A glimpsing model of speech perception innoise,” J. Acoust. Soc. Am. (3), 1562–1573.Cooke, M., Barker, J., Cunningham, S., and Shao, X. ( ). “Anaudio-visual corpus for speech perception and automatic speechrecognition,” J. Acoust. Soc. Am. (5), 2421–2424.CrowdFlower, Inc. ( ). “CrowdFlower” .Estellers, V., Gurban, M., and Thiran, J.-P. ( ). “On dy-namic stream weighting for audio-visual speech recognition,”IEEE/ACM Trans. Audio, Speech, Language Process. (4),1145–1157.Fingscheidt, T., and Bauer, P. ( ). “A phonetic referenceparadigm for instrumental speech quality assessment of artificialspeech bandwidth extension,” in Proc. 4th International Work-shop on Perceptual Quality of Systems .Forney, G. D. ( ). “The Viterbi algorithm,” Vol. 61, pp. 268–278.French, N., and Steinberg, J. ( ). “Factors governing the intel-ligibility of speech sounds,” J. Acoust. Soc. Am. (1), 90–119.Gao, J., and Tew, A. ( ). “The segregation of spatialised speechin interference by optimal mapping of diverse cues,” in Proc.ICASSP , pp. 2095–2099.Hirsch, H.-G., and Pearce, D. ( ). “The Aurora experimentalframework for the performance evaluation of speech recognitionsystems under noisy conditions,” in
ASR2000-Automatic SpeechRecognition: Challenges for the new Millenium ISCA Tutorialand Research Workshop (ITRW) . Huber, R., Spille, C., and Meyer, B. T. ( ). “Single-ended pre-diction of listening effort based on automatic speech recognition,”in
Proc. Interspeech , pp. 1168–1172.Jensen, J., and Taal, C. H. ( ). “Speech intelligibility predic-tion based on mutual information,” IEEE/ACM Trans. Audio,Speech, Language Process. (2), 430–440.Jensen, J., and Taal, C. H. ( ). “An algorithm for predictingthe intelligibility of speech masked by modulated noise maskers,”IEEE/ACM Trans. Audio, Speech, Language Process. (11),2009–2022.Jørgensen, S., and Dau, T. ( ). “Predicting speech intelligi-bility based on the envelope power signal-to-noise ratio aftermodulation-frequency selective processing,” J. Acoust. Soc. Am. (4), 2384–2384.Jørgensen, S., Ewert, S. D., and Dau, T. ( ). “A multi-resolution envelope-power based model for speech intelligibility,”J. Acoust. Soc. Am. (1), 436–446.J¨urgens, T., and Brand, T. ( ). “Microscopic prediction ofspeech recognition for listeners with normal hearing in noise us-ing an auditory model,” The Journal of the Acoustical Societyof America (5), 2635–2648.J¨urgens, T., Fredelake, S., Meyer, R. M., Kollmeier, B., and Brand,T. ( ). “Challenging the speech intelligibility index: macro-scopic vs. microscopic prediction of sentence recognition in nor-mal and hearing-impaired listeners,” in Proc. Interspeech , pp.2478–2481.Karbasi, M., Abdelaziz, A. H., and Kolossa, D. ( a). “Twin-HMM-based non-intrusive speech intelligibility prediction,” in
Proc. ICASSP , pp. 624–628.Karbasi, M., Abdelaziz, A. H., Meutzner, H., and Kolossa, D.( b). “Blind non-intrusive speech intelligibility prediction us-ing twin-HMMs,” in
Proc. Interspeech , pp. 625–629.Karbasi, M., and Kolossa, D. ( ). “A microscopic approach tospeech intelligibility prediction using auditory models,” in
Proc.DAGA 2015 , pp. 16–19.Karbasi, M., and Kolossa, D. ( ). “ASR-based measures for mi-croscopic speech intelligibility prediction,” in
Proc. 1st Interna-tional Workshop on Challenges in Hearing Assistive Technology(CHAT-2017) .Kollmeier, B., Sch¨adler, M. R., Warzybok, A., Meyer, B. T., andBrand, T. ( ). “Sentence recognition prediction for hearing-impaired listeners in stationary and fluctuation noise with FADEempowering the attenuation and distortion concept by Plompwith a quantitative processing model,” Trends in Hearing ,1–17.Li, G., Lutman, M. E., Wang, S., and Bleeck, S. ( ). “Re-lationship between speech recognition in noise and sparseness,”International Journal of Audiology (2), 75–82.Lightburn, L., and Brookes, M. ( ). “SOBM - a binary mask fornoisy speech that optimises an objective intelligibility metric,” in Proc. ICASSP , pp. 5078–5082.Rabiner, L. R. ( ). “A tutorial on hidden Markov models andselected applications in speech recognition,” Vol. 77, pp. 257–286.Sch¨adler, M. R., Warzybok, A., Ewert, S. D., and Kollmeier, B.( ). “A simulation framework for auditory discriminationexperiments: Revealing the importance of across-frequency pro-cessing in speech perception,” J. Acoust. Soc. Am. (5), 2708–2722.Sch¨adler, M. R., Warzybok, A., Hochmuth, S., and Kollmeier, B.( ). “Matrix sentence intelligibility prediction using an auto-matic speech recognition system,” International Journal of Au-diology (sup2), 100–107.Spille, C., Ewert, S. D., Kollmeier, B., and Meyer, B. T. ( ).“Predicting speech intelligibility with deep neural networks,”Computer Speech & Language , 51–66.Steeneken, H. J., and Houtgast, T. ( ). “A physical methodfor measuring speech-transmission quality,” J. Acoust. Soc. Am. (1), 318–326.Taal, C., Hendriks, R., Heusdens, R., and Jensen, J. ( ). “Analgorithm for intelligibility prediction of time-frequency weightednoisy speech,” IEEE/ACM Trans. Audio, Speech, Language Pro-cess. (7), 2125–2136.Taghia, J., and Martin, R. ( ). “Objective intelligibility mea-sures based on mutual information for speech subjected to speechenhancement processing,” IEEE/ACM Trans. Audio, Speech, anguage Process. (1), 6–16.Tang, Y. ( ). “Speech intelligibility enhancement and glimpse-based intelligibility models for known noise conditions,” Ph.D.thesis, Universidad del Pa´ıs Vasco.Tang, Y., Cooke, M., Fazenda, B. M., and Cox, T. J. ( ).“A metric for predicting binaural speech intelligibility in station- ary noise and competing speech maskers,” J. Acoust. Soc. Am. (3), 1858–1870.Ullmann, R., Doss, M. M., and Bourlard, H. ( ). “Objectivespeech intelligibility assessment through comparison of phonemeclass conditional probability sequences,” in Proc. ICASSP , pp.4924–4928., pp.4924–4928.