[PDF] Thoughts on the potential to compensate a hearing loss in noise

Abstract

The effect of hearing impairment on speech perception was described by Plomp (1978) as a sum of a loss of class A, due to signal attenuation, and a loss of class D, due to signal distortion. While a loss of class A can be compensated by linear amplification, a loss of class D, which severely limits the benefit of hearing aids in noisy listening conditions, cannot. Not few users of hearing aids keep complaining about the limited benefit of their devices in noisy environments. Recently, in an approach to model human speech recognition by means of a re-purposed automatic speech recognition system, the loss of class D was explained by introducing a level uncertainty which reduces the individual accuracy of spectro-temporal signal levels. Based on this finding, an implementation of a patented dynamic range manipulation scheme (PLATT) is proposed, which aims to mitigate the effect of increased level uncertainty on speech recognition in noise by expanding spectral modulation patterns in the range of 2 to 4 ERB. An objective evaluation of the benefit in speech recognition thresholds in noise using an ASR-based speech recognition model suggests that more than half of the class D loss due to an increased level uncertainty might be compensable.

Full PDF

TThoughts on the potential to compensate a hearing loss in noise

Marc René SchädlerMedizinische Physik and Cluster of Excellence Hearing4all, Universität Oldenburg, [email protected] 25, 2021

Abstract

The eﬀect of hearing impairment on speech perception was described by Plomp (1978) as a sum of a loss of class A, dueto signal attenuation, and a loss of class D, due to signal distortion. While a loss of class A can be compensated by linearampliﬁcation, a loss of class D, which severely limits the beneﬁt of hearing aids in noisy listening conditions, cannot. Notfew users of hearing aids keep complaining about the limited beneﬁt of their devices in noisy environments. Recently, in anapproach to model human speech recognition by means of a re-purposed automatic speech recognition system, the loss of classD was explained by introducing a level uncertainty which reduces the individual accuracy of spectro-temporal signal levels.Based on this ﬁnding, an implementation of a patented dynamic range manipulation scheme (PLATT) is proposed, whichaims to mitigate the eﬀect of increased level uncertainty on speech recognition in noise by expanding spectral modulationpatterns in the range of 2 to 4 ERB. An objective evaluation of the beneﬁt in speech recognition thresholds in noise using anASR-based speech recognition model suggests that more than half of the class D loss due to an increased level uncertaintymight be compensable.

Keywords: theoretical audiology, speech perception modeling, impaired hearing, hearing loss compensation

Introduction

To this day, hearing aids without directional ampliﬁcationor directional noise suppression provide their users only withlimited beneﬁt in noisy listening conditions. The limited ben-eﬁt of hearing aids in noisy listening conditions is long knownand was extensively described and put into context by Plomp(1978). There, the eﬀect of impaired hearing on speech recog-nition performance was described as a sum of two fundamen-tally diﬀerent classes of hearing loss: class A, which accountsfor an attenuation of the signal, and class D, which accountsfor a distortion of the signal. While, the class A loss is deﬁnedsuch that it can be fully compensated by a suitable linear am-pliﬁcation of the signal, the class D loss was assumed to belevel independent, that is, it cannot be compensated by linearampliﬁcation.The class A loss, according to Plomp (1978) usually makesup the largest part of the total increase in speech recognitionthresholds (SRTs) in quiet (A+D), which is why hearing aidsprovide the largest beneﬁts in quiet environments. In noisyenvironments with suﬃciently high levels, the contribution ofthe class A loss diminishes, and the contribution of the classD loss dominates the total loss. Plomp (1978) estimated that,on average, the class A loss accounts for approximately two-thirds and the class D loss for approximately one-third of thetotal loss, where huge individual variability was expected. Amain drawback of this model is that it does not provide a spe-ciﬁc hint on if and how the class D loss can be compensated.Recently, Kollmeier et al. (2016) proposed a modiﬁcation tothe feature extraction stage of the simulation framework forauditory discrimination experiments (FADE, Schädler et al.,2016a) to implement a class A loss as an absolute hearingthreshold and a class D closs as a level uncertainty. The simulation approach with FADE employs a re-purposed au-tomatic speech recognition system (ASR) to predict the out-come of a speech recognition test, such as the matrix sen-tence test (Kollmeier et al., 2015). The FADE approach wasalready successfully used to predict the outcomes of severalspeech in noise recognition experiments (Schädler et al., 2015,2016b) as well as the outcomes of basic psycho-acoustic ex-periments (Schädler et al., 2016a) for listeners with normalhearing. Kollmeier et al. (2016) proposed to remove the in-formation that is not available to an individual listener withimpaired hearing in the feature extraction stage of the ASRsystem used in the FADE modeling approach.To induce a class A loss in the model, variations in theinternal spectro-temporal signal levels below the individualhearing threshold, determined by the individual audiogram,were removed. This manipulation is illustrated in the centerpanel of Figure 1, where the low-energy portions (blue/green)were replaced by constant values which are equal to the indi-vidual absolute hearing threshold, while the high-energy por-tions above the individual absolute hearing threshold are un-changed compared to unmodiﬁed representation in the upperpanel. It seems plausible that, if all relevant signal portionsare above the hearing threshold, this manipulation shouldhave no eﬀect on the predicted speech recognition perfor-mance.To induce a class D loss in the model, random values weredrawn from a normal distribution and added to the internalspectro-temportal signal levels, where the standard deviationof the normal distribution was a variable called level uncer-tainty . This manipulation is illustrated in the lower panelin Figure 1, where all signal portions, including those abovethe hearing threshold, are aﬀected. Because the signal energyin that representation (which is a logarithmically scaled Mel-1 a r X i v : . [ ee ss . A S ] F e b igure 1: Figure reproduced from Kollmeier et al. (2015).Internal spectro-temporal signal representation (log Mel-spectrogram) like it is used in the FADE modeling approachof a speech in noise mixture (upper panel) and examples ofmanipulations to it that were introduced to induce a class Ahearing loss (center panel) and a class D hearing loss (lowerpanel). spectrogram) is represented in a logarithmic domain, linearampliﬁcation cannot be expected to change its eﬀect on thepredicted speech recognition performance.Kollmeier et al. (2016) evaluated the eﬀect of these ma-nipulations on the predicted outcomes of the German matrixsentence test in a stationary and a ﬂuctuating noise conditionfor diﬀerent noise levels, and ﬁtted the A/D-class descriptionproposed by Plomp (1978) to the data. The results, repro-duced in Figure 2, clearly show that the two manipulationslargely achieved the intended eﬀects, that is, inducing a classA and a class D hearing loss. The left panel shows FADE sim-ulations with diﬀerent standard audiograms from Bisgaardet al. (2010), which converge for high noise levels, that is,the manipulations can be compensated by ampliﬁcation asone would expect from a class A loss. The right panel showsFADE simulations with increasing values for the level uncer-tainty, which do not converge for high noise levels, that is,the manipulations cannot be compensated by ampliﬁcationas one would expect from a class D loss. An important ob-servation of Kollmeier et al. (2015) was, that their empiricaldata set, which included matrix sentence test results in noiseof almost 200 ears, could not be satisfactorily predicted withthe class A loss alone, indicating that an implementation ofa mechanism that induces a class D loss is needed to explainthe speech recognition performance of individual listeners.Schädler et al. (2020a) extended that approach by inferringthe individual frequency-dependent level uncertainty fromtone in noise detection thresholds, and achieved unprece-dented accuracy in the prediction of beneﬁts in SRTs dueto diﬀerent traditional hearing loss compensation schemes innoise (and in quiet). The central assumption there is thatthe same mechanism that aﬀects individual speech in noiseperception also aﬀects individual tone in noise perception.Other mechanisms that reduce the information encoded inthe internal signal representation, like a reduced spectral res-olution, might also be considered in this modeling approach.However, Hülsmeier et al. (2020) found that, with FADE, areduced spectral resolution had only little eﬀect on the simu-lated SRTs compared to an increased level uncertainty. Thisindicates, that between a reduced spectral resolution and thelevel uncertainty, the latter is considered to be the suitable mechanism to implement a class D loss in FADE.If the mechanism that removes the information in the audi-tory system of listeners with impaired hearing works similarto the level uncertainty, than the level uncertainty can be seenas the functional counter-part to a compensation strategy fora class D hearing loss. This is exactly what we will assume inthe remainder of this contribution to what we consider the-oretical audiology. In the context of FADE, this will allowto generate testable hypothesis on the achievable beneﬁt inspeech recognition performance due to a possible compensa-tion strategy for a class D loss. The aim of this contributionis to:A) Present an approach which is able to partially compen-sate a class D loss as implemented with the level uncer-tainty in FADE, andB) objectively evaluate this approach and come up withtestable quantitative hypothesis on the beneﬁt in noisylistening conditions.For an eﬀective mitigation of the eﬀect of the level uncer-tainty on speech recognition performance, three main prob-lems need to be addressed:1) Which portions, or more precisely, patterns of the speechsignal carry the most relevant information and need tobe protected?2) Which signal patterns can be protected, given the strictconstrains on the available future temporal context of thesignal (approximately 1 ms) in hearing aid applications?3) How can such a protection be achieved?Continuing to follow an (assumed) analogy of basic prin-ciples in human and machine speech recognition, literatureon robust automatic speech recognition provides hints onwhich signal patterns are relevant for good (automatic) speechrecognition performance in noise. Let us assume that a logMel-spectrogram, such as it used for the calculation of thewidely used Mel-frequency cepstral coeﬃcient (MFCC) fea-tures, is representative for the information that is availableto an ASR system. The spectral resolution of such a log Mel-spectrogram, of which an Example is depicted in the upperpanel of Figure 1, is about 1 equivalent rectangular bandwidth(ERB), the temporal resolution is 10 ms. The relevant speechinformation is encoded in the represented spectro-temporaldynamic, that is, the diﬀerences of spectro-temporal signallevels over time, called temporal modulations, and over fre-quency, called spectral modulations. It is remarkable thatASR systems traditionally don’t even use the whole informa-tion in the log Mel-spectrogram, but work with a reducedspectral resolution compared to the spectral resolution ofthe human auditory system. For example, the standard fea-tures used for ASR, MFCCs (ETSI, 2007), speciﬁcally en-code spectral modulation frequencies from to about cy-cles per ERB; spectral modulation frequencies above cyclesper ERB empirically don’t contribute to automatic speechrecognition performance. In line with this ﬁnding, the robustGabor ﬁlter bank (GBFB) features use only slightly morethan half of the available spectral resolution of the log Mel-spectrograms (c.f. Schädler et al., 2012). By omitting the cor-responding parts of the feature vector, Schädler et al. (2012)2 igure 2: Figures reproduced from Kollmeier et al. (2015). Simulated speech recognition thresholds with FADE for a stationary anda ﬂuctuating noise condition at diﬀerent noise levels. The left panel shows simulations with diﬀerent absolute hearing threshold,which induce diﬀerent class A losses. The right panel shows simulations with diﬀerent values for the level uncertainty, which inducediﬀerent class D losses. The embedded tables show the contributions of A and D when the descriptive model of Plomp (1978) wasﬁtted to the data. For further details please refer to the original publication. assessed the relative importance of diﬀerent spectro-temporalmodulation frequencies in a robust ASR task. There, it wasobserved that the highest spectral modulation frequencies,around cycles per band (the bandwidth of a Mel band thereis approximately 1 ERB) seem to important for the consid-ered speech in noise recognition task. Before going into moredetail on the importance of spectro-temporal modulations fornoisy speech recognition, the limited possibilities to manipu-late these in the context of hearing aids have to be considered.The restriction to the availability of approximately 1 msfuture temporal context in hearing aids does not allow forthe reliable manipulation of temporal modulations below ap-proximately 500 Hz. This limit is still way above the temporalmodulation frequencies that are represented in the features ofASR systems. With the common analysis window length andshift of 25 and 10 ms, respectively, as used in Figure 1, theupper limit for represented temporal modulation frequenciesis around 50 Hz. Hence, the only modulations that can be re-liably manipulated are spectral modulations, which requiresa temporal signal context in the order of 1 ms. Fortunately,spectral modulations seem to play an important role in (au-tomatic) speech recognition. With the already mentioned ap-proach to model human speech recognition performance bymeans of a re-purposed ASR system, Schädler et al. (2016a)found that explicitly encoding spectral modulations, that is,across-frequency interactions, in the feature vector was re-quired to explain the empirically found beneﬁt when listeningin a ﬂuctuation masker. In other words, spectral modulationsseem to be especially relevant in ﬂuctuating noise maskers.In combination of the answers to problems 1) and 2): Spec-tral modulations just below 0.25 cycles per ERB, let’s sayspectral patterns between 2 and 4 ERB, seem to be a goodﬁrst candidate for protecting them against the level uncer-tainty. How can these signal patterns be protected or at leasthardened against the eﬀect of the internal noise due to thelevel uncertainty? In the logarithmic level domain (in the log Mel-spectrogram) the noise of the level uncertainty is ad-ditive, and one eﬀective way of mitigating the eﬀect of anadditive noise is ampliﬁcation. This means, that the desiredspectral modulation patterns should be ampliﬁed, that is, ex-panded before the noise of the level uncertainty can removethat information.At a ﬁrst glance, an expansion of signal dynamic in thecontext of the reduced residual dynamic range of listenerswith impaired hearing might seem completely undesirable:The expansion of spectral modulations can result in uncom-fortably high and/or inaudibly soft signal levels; which canoccur at the same time in diﬀerent frequency ranges. How-ever, considering the scale (2 to 4 ERB) it is clear, that thisis a region which is not modiﬁed by common approaches tomulti-band dynamic compression, where the signal is usuallyindependently compressed in approximately six bands. Thetraditional approaches to multi-band dynamic compressionwith less than 9 bands leave spectral patterns with a widthof 2 to 4 ERB virtually uncompressed, which might also beinterpreted as an indication towards the importance of thesepatterns. If now the expansion of one (small, as we will latersee) part of the spectral dynamic increases the total spectraldynamic range of the signal, it might be possible to counterit with a stronger compression of other (less relevant) partsof the spectral dynamic. However, compression should onlybe applied if it is required, that is, if the input signal dy-namic range does not ﬁt into the available output dynamicrange. The dynamic range of speech in noise is less thanthe dynamic range of a clean speech signal. Especially whenconsidering speech in only slightly ﬂuctuating noises at signal-to-noise ratios (SNRs) around 0 dB, the dynamic range of themixture can be close to the minimum which is required todiscriminate words. In this condition, no improvement canbe expected from compressing the signal dynamic. Hence,a micro-management of the limited residual dynamic rangeto optimize its use for speech-recognition-relevant informa-3ion is a desirable goal. PLATT is an approach which givespreference to spectral modulation patterns that are suppos-edly relevant for speech recognition at the expense of spectralmodulation patterns which are supposedly less relevant whenmapping an input signal dynamic to a given output signal dy-namic. The here outlined ideas are part of a German patent(DE 10 2017 216 972); a patent application in Europe is pend-ing.The implementation details of PLATT, which are presentedin the Section Methods, were already engineered towards low-delay real-time processing. In a simpler implementation, therepresentation of the log Mel-spectrogram could be manipu-lated in the MFCC domain where the spectral modulationsare suitably encoded for this operation, and which in MFCC-terminology would be called liftering . The inverse transformof the modiﬁed MFCCs, with ampliﬁed spectral modulationpatterns between 0.25 and 0.125 cycles per band, could becompared to the unmodiﬁed log Mel-spectrogram, and a suit-able ﬁlter response in the time-domain could be designed foreach signal frame. While such an approach would have theadvantage of simplicity (even simpler than the related im-plementation of an intelligibility-improving signal processingapproach (IISPA) Schädler (2020b)), it would not be appli-cable without further modiﬁcations in a hearing device: Themain problem would be the infeasible window length of 25 ms.Because of the highly non-linear nature of the interactions be-tween the factors inﬂuencing speech recognition performance(speech material, masker type and level, reverberation, non-linear signal processing, hearing impairment), an implemen-tation that already fulﬁlls the most basic requirements of ahearing aid algorithm and can run on a hearing aid prototype was preferred over a simple proof-of-concept implementation.This additional eﬀort makes a seamless translation to an ap-plication in a hearing device more likely and increases themeaningfulness of the presented results for a possibly realiz-able hearing aid solution.For an evaluation of the implementation with respect to apossible compensation of a class D loss, the following pointsneed to be considered:1) Which listening conditions, that is, which speech test andmaskers, are suitable to evaluate the PLATT implemen-tation objectively with FADE and (also later) empiri-cally.2) Which listener proﬁles are suited to clearly demonstratea (partial) compensation of a class D loss like it is imple-mented in FADE.The ﬁrst point is important to enable the veriﬁcation withempirical data of any hypotheses that are based on the modelpredictions. Schädler et al. (2020a) discussed this point andproposed to use the SRT-50 measured with the matrix sen-tence test in quiet, in a stationary, and in a ﬂuctuating noisecondition, to cover the very diﬀerent masker properties in typ-ical listening conditions: Quiet, low-dynamic maskers, high-dynamic masker. The main reason for using these “labora-tory” signals and the SRT-50 instead of real noise recordingand, e.g., the SRT-80, was the known high test-retest reliabil-ity of these tests. For individual predictions, the test-retest https://github.com/m-r-s/hearingaid-prototype reliability of the employed speech recognition test sets thelower limit for the achievable prediction error in an evalua-tion with the data. That means, low measurement errors inthe empirical data may facilitate or even enable falsiﬁcationof the model predictions. An SRT of 0 dB at high noise levelsin the test-speciﬁc noise condition, this is, with a stationarynoise of identical long term spectrum than the speech sig-nal, can be considered very problematic when normal-hearinglisteners can achieve about -8 dB. If only half of the hear-ing loss in that condition in noise could be compensated(which would be a huge achievement), the measurable beneﬁtwould be only 4 dB. Considering that the beneﬁt in SRT iscalculated as the diﬀerence between two measurements, thetargeted error of a single measurement should be less than √ · · dB ≈ . dB. Such low measurement errors, whichcan be achieved in SRT measurements with the matrix sen-tence test, would later enable to show individual beneﬁtswithout averaging over groups of listeners, given the beneﬁtwas 4 dB. When adding to this consideration the need for alevel-dependent evaluation, which is required to identify theclass D loss according to Plomp (1978), one arrives at thetest conditions which were already studied in Kollmeier et al.(2015) and that are depicted in Figure 2.The second point, the selection of suitable listener pro-ﬁles, is a bit more complex than it might initially appear.A sensible approach would be to take the individual proﬁlesinferred from the psychoacoustic measurements by Schädleret al. (2020a) which are available online . The main prob-lem with this approach is, that even with a small pure classA loss, the SRTs are generally not level-independent at highlevels. This can already be observed for the ﬂuctuating noisecondition in the left panel of Figure 2. There, the simulationswith a pure class A loss with hearing thresholds according tothe standard proﬁle N1 (corresponding to a very mild hearingloss) do not converge with the data of the normal-hearing pro-ﬁle None up to noise levels of 90 dB SPL. Hence, even for verysmall increases in hearing threshold, ampliﬁcation improvesthe SRT in the ﬂuctuating noise condition up to very highpresentation levels. With the aim of clearly attributing com-pensation strategies to compensate a class A or a class D loss,this is highly undesirable. To clearly identify the compensa-tion of a class D loss, simple linear ampliﬁcation alone mustnot improve the SRT. The reason for the observed model be-havior is that, for the Bisgaard proﬁles, the frequency rangeabove the hearing threshold increases with the presentationlevel. While the limited frequency range can safely be as-sumed a factor contributing to a class D loss and has to beconsidered in a suitable listener proﬁle, it must be avoidedthat the eﬀectively used frequency range changes at high pre-sentation levels. One option would be to low-pass ﬁlter thespeech material. Another option is to deﬁne proﬁles with verysteep sloping hearing loss functions. The former option wouldbe very suitable to measure empirical data. The latter optionis regarded cleaner from a modeling perspective, because itreduces the number of parameters that inﬂuence the SRTand results in a simpler and possibly better traceable model.Hence, listener proﬁles with normal hearing thresholds be-low, and inﬁnite hearing loss above a given limit frequencyare suitable for the considerations in this contribution. While https://doi.org/10.5281/zenodo.4394186 Methods

The methods described in the following were used to simulatespeech recognition experiments in stationary and ﬂuctuatingnoise at diﬀerent presentation levels for 16 listening proﬁleswith class D hearing losses without and with the later pro-posed dynamic range expansion by PLATT including diﬀerentdegrees of expansion.

Speech recognition tests

The speech material of the (male) German matrix sen-tence test (Wagener et al., 1999; Kollmeier et al., 2015)was used with two masker signals: The test-speciﬁc noise(called OLNOISE) and the ﬂuctuating ICRA5-250 noise sig-nal (Dreschler et al., 2001; Wagener et al., 2006). The matrixtest, which exists in more than 20 languages, comprises 50phonetically balanced common words of which sentences witha ﬁxed syntax, such as “Peter got four large rings” or “Ninawants seven heavy tables”, are built. Typically, the SRT-50,this is the speech level that is required to correctly recognize50% of the words, is measured with an adaptive procedure inexperiments with human listeners. In this contribution, thespeech and masker material was used to predicted the SRT-50 with FADE. The test-speciﬁc noise (OLNOISE) has thesame long-term spectrum as the speech material. It can beassumed to mask the speech signal similarly well across allfrequencies. At the SRT for normal hearing listeners, -7 dB(Hochmuth et al., 2015), this results in a noisy speech signalwith a low dynamic range, where spectro-temporal maxima ofthe mixtures are dominated by the speech signal. The eﬀectcan be observed in Figure 3, where the log Mel-spectrogram of clean speech signal (upper panel) and the same speech sig-nal with the OLNOISE masker at -7 dB SNR (center panel)are depicted. The ICRA5-250 noise is a speech-shaped noisewhich is co-modulated with speech-like temporal patterns inthree independent frequency bands, where the pause durationwas limited to 250 ms (Wagener et al., 2006). The empiricalSRTs with this masker signal are usually more than 10 dBlower than in the corresponding test-speciﬁc noise condition(Hochmuth et al., 2015). At the SRT for listeners with nor-mal hearing, -19 dB (Hochmuth et al., 2015), this results ina noisy speech signal with a high dynamic range, where thespectro-temporal maxima of the mixtures are dominated bythe masker signal. This can be observed in the lower panelof Figure 3, where the log Mel-spectrogram of a speech sig-nal with the ICRA5-250 masker at -19 dB SNR is depicted inthe lower panel. To assess the presentation level dependency,both maskers are considered at presentation levels from 0 to100 dB SPL in 10-dB steps. At 0 dB SPL presentation level,this corresponds to listening in quiet. The selected listeningconditions reﬂect important dimensions of speech perception:Listening in quiet, at low levels, and at high levels, as well aslistening in stationary and ﬂuctuating noise. All consideredspeech tests can also be performed human listeners.

Simulations of matrix tests with FADE

The speech tests considered in Section Speech recognitiontests were simulated with an ASR-based approach and theiroutcome, the SRT-50, was predicted based on the simula-tion results. The simulations were performed with the lateststandard version of FADE , as described by Schädler et al.(2016a). In this contribution, FADE is used as a tool to pre-dict the outcome of speech recognition tests, where the onlychange to the standard setup was the manipulation of thefeature extraction stage which is explained in Section Lis-tener proﬁles: Class D hearing losses and the processing ofthe noisy speech signals with PLATT as explained in Sec-tion PLATT dynamic range manipulation. Hence, the FADEsimulation method is only outlined here, and we refer the in-terested reader to the original description from Schädler et al.(2016a).Predictions with FADE are performed completely indepen-dently for each listening condition (masker, maskerlevel, hear-ing loss compensation, and hearing proﬁle). There is no de-pendency on any empirically measured SRT, nor on predic-tions of the same model in other/reference conditions (andhence no need to deﬁne such). This means, with FADE, asingle SRT of a speech recognition test for which no empir-ical data exists can be predicted. For the prediction of oneoutcome the following standard procedure as described bySchädler et al. (2016a) was used.An ASR system was trained on a broad range of SNRswith noisy speech material form the considered condition, e.g.German matrix sentence test in OLNOISE for listener pro-ﬁle “P-8000-1” with no compensation. For this, a corpus ofnoisy speech material at diﬀerent SNRs was generated fromthe clean matrix sentence test material and the masker sig-nal, by adding randomly chosen masker signal fragments withthe speech material. The noisy signals were processed with https://doi.org/10.5281/zenodo.4003779 ime / ms F r equen cy / H z

250 500 750 1000 1250 1500 1750 20001245031057186530454768728410957 l e v e l / d B SP L Time / ms F r equen cy / H z

250 500 750 1000 1250 1500 1750 20001245031057186530454768728410957 l e v e l / d B SP L Figure 3:

Log Mel-spectrograms of a clean German matrix sentence at 65 dB SPL (upper panel), of the same sentence in the stationarynoise (center panel) and ﬂuctuating noise (lower panel) conditions at the SNRs which correspond to the SRT listeners with normalhearing of speech, -7 and -19 dB, respectively.

PLATT when an aided listening condition was considered.From the noisy (and optionally processed) speech signals, fea-tures were extracted, where this step included the implemen-tation of the class D hearing loss, as described in Section Lis-tener proﬁles: Class D hearing losses. Subsequently, an ASRsystem using whole-word models implemented with GaussianMixture Models and Hidden Markov Models, was trained onthe features. This resulted in 50 whole-word models for eachtraining SNR. These models were then used with a languagemodel that considers only valid matrix sentences (of which exist) to recognize test sentences on a broad range ofSNRs with noisy speech material form the same consideredcondition. For each combination of a training SNR and a testSNR, the transcriptions of the test sentences were evaluatedin terms of the percentage of correctly recognized words. Theresulting recognition result map (cf. left panel in Figure 7in Schädler et al. (2016a) for an example), which containedthe speech recognition performance of the ASR system de-pending on the training and testing SNRs in 3 dB steps, wasqueried for the SRT. For a given target recognition range, e.g.50%, the lowest SNR at which this performance was achievedwas interpolated from the data in the recognition result mapand reported as the predicted SRT for the considered con-dition. The whole simulation process, including the creationof noisy speech material, an optional processing of this noisyspeech material with PLATT, the feature extraction (whichdepends on the listener proﬁle), the training of the ASR sys-tem, the recognition of the test sentences, and the evaluationof the recognition result map, was (independently) repeated for each considered condition. Listener proﬁles: Class D hearing losses

Outcome predictions of speech recognition tests as well as ba-sic psychoacoustic tests with FADE were found to be closeto the empirical results for listeners with normal hearing(Schädler et al., 2016a). As proposed by Kollmeier et al.(2016) and successfully used by Schädler et al. (2020a), im-paired hearing was implemented in the ASR system by remov-ing the information from the feature vectors that is presum-ably not available to listeners with impaired hearing. As dis-cussed in the Introduction, two types of manipulations whichinduce class D hearing loss were considered: 1) A limitationof the frequency range, 2) An increase of level uncertainty.The eﬀect of both parameters on the log Mel-spectrogram ofa clean speech sample is depicted in Figure 4, where in theupper panel, the frequency range was limited to 8000 Hz andthe level uncertainty was 1 dB. In the center panel, the leveluncertainty was increased to 7 dB, compared to the upperpanel. In the lower panel, the frequency range was addi-tional limited to 2000 Hz compared to the center panel. Anampliﬁcation of the input signal increases all values in a logMel-spectrograms by a constant value. Both manipulationsintroduce a level-independent loss of information and henceinduce a class D loss. For the evaluation, upper frequency lim-its of 1000, 2000, 4000, and 8000 Hz were considered, wherethe class D loss decreases with high values. For the leveluncertainty, frequency-independent values of 1, 7, 14, and21 dB were considered, where the class D loss increases with6 ime / ms F r equen cy / H z

250 500 750 1000 1250 1500 1750 20001245031057186530454768728410957 l e v e l / d B SP L Time / ms F r equen cy / H z

250 500 750 1000 1250 1500 1750 20001245031057186530454768728410957 l e v e l / d B SP L Figure 4:

Illustration of the two considered class-D-loss-inducing log Mel-spectrogram manipulations: Log Mel-spectrograms forlistener proﬁles “P-8000-1” (upper panel), “P-8000-7” (center panel), and “P-2000-7” (lower panel). The ﬁrst number in the proﬁleencodes the upper frequency limit, in this example 8000 and 2000 Hz, and the second number indicates the level uncertainy, here1 and 7 dB. high values. All combinations of both parameters result in 16proﬁles from “P-8000-1” to “P-1000-21”. The former can beexpected to be the best performing one, while the latter canbe expected to be the worst performing one.

PLATT dynamic range manipulation

In this section, the patented (DE 10 2017 216 972) PLATTdynamic range manipulation as it was conceived for a laterimplementation in a hearing device is described. The imple-mentation was optimized to run in real-time on a RaspberryPi 3 Model B to enable ﬁeld studies with mobile hearing aidprototype hardware . The ability to expand spectral modu-lation frequencies in the range of to cyclesERB is a feature thatintegrates naturally with the approach. Even if not strictlynecessary for the goals of this contribution, the method is de-scribed here in detail to make statements about its ability tocompensate a class D loss in the algorithmic context in whichit might be later usable in a hearing device. To motivatethe design decisions behind PLATT, which generally aims topreserve relevant speech modulations when compressing thedynamic range of a signal, this subsection comes with its ownintroductory part. Introduction to the PLATT concept

Conditions inwhich the available dynamic range for acoustic communica-tion is reduced are rather the norm than the exception. For For example: https://github.com/m-r-s/hearingaid-prototype or https://batandcat.com/portable-hearing-laboratory-phl.html example, in a driving car, the lower limit of the availabledynamic range is determined by the driving noise. Or in alibrary, the upper limit is given by the accepted sound levelsin such an environment. And, importantly, the available dy-namic range for communication is limited for listeners withimpaired hearing. For a successful communication, it may berequired to adapt a source signal, which may contain speechand non-speech parts, to the available dynamic range on thereceiver side by dynamic range compression. But, in manyreal-time applications, the available temporal context to per-form this operation is very limited.Multi-band dynamic range compressors used in hearing(aid) research (e.g. Grimm et al., 2015) statically map theinput dynamic range to a reduced output dynamic range ina number of independent frequency bands. The compres-sion is often applied with rather short attack time constants,e.g., 20 ms, with the aim to protect the user from high levels,while the release time constants are usually much longer, e.g.100 ms to 1000 ms, with the aim to limit compression whenit not desirable, i.e. during short speech pauses. However,no distinction is made whether the signal contains speechportions or not. Approaches which depend on a classiﬁca-tion whether or not speech is present in the input signal, areprone to errors if the (speech-)signal-to-noise ratio (SNR) islow, that is, just when the classiﬁcation result is most impor-tant. Approaches that require more than a few milliseconds offuture temporal context cannot be used in applications whichrequire low latency, such as, e.g., hearing devices. Regardingthe speech intelligibility of processed signals, compression in7 few wide frequency bands is preferred over compression inmany narrow frequency bands, however, the recommendednumber of channels greatly varies (usually between 1 and8) (Plomp, 1988; Dreschler, 1992; Hohmann and Kollmeier,1995; Yund and Buckles, 1995; Moore et al., 1999; Souza,2002). The fewer channels are used, the better the spectraldynamic, that is, the spectral contrast or spectral modula-tion , is preserved. Static dynamic range compression onlypreserves the spectral modulation within each independentfrequency band, but not across bands, even if the dynamicrange would be available. Also, fewer frequency channels re-duce the need for sharp ﬁlters which would require long inte-gration time constants and introduce additional latency.Bustamante and Braida (1987) proposed to compress theﬁrst two principle components (PC1 and PC2) of the short-term speech spectrum, which were roughly representative ofoverall level and spectral tilt. With this approach, the fre-quency bands were not processed independently anymore,and the ﬁner spectral structure was always preserved. Theiranalysis indicated that the highest intelligibility was obtainedwhen audibility was improved and the relative spectral shapesof diﬀerent speech sounds were preserved (Bustamante andBraida, 1987). In their concluding section, they recommendedto investigate the enhancement of spectral diﬀerences whilecompressing level variations. Levitt and Neuman (1991) pro-posed an approach which decomposes and manipulates theshort-term spectrum using a set of orthogonal polynomialfunctions with the aim to preserve important speech cues.Referring to the study of Bustamante and Braida (1987),Levitt and Neuman (1991) wrote: “Both studies showed thatcompression of the lowest order component (factor 1 in theprincipal-components method and the constant term in theorthogonal polynomial method, respectively) had by far thelargest eﬀect, and that compression of higher order compo-nents had little eﬀect, if any.” The common idea behind thesetwo studies was to linearly map and manipulate the spectraldimension of a suitable spectro-temporal representation withthe aim of separating important from less important speechsignal dynamic. However, both studies considered only cleanspeech signals, and hence did not consider the relevant por-tions of speech signal for their recognition in noise.That the signal dynamic can be described as the diﬀer-ence of frequency-dependent short-term eﬀective amplitudes,e.g., across time (temporal dynamic), across frequency (spec-tral dynamic), or both (spectro-temporal dynamic), raises thequestion which representation is most suitable to manipulateit. ASR systems are the technical solution to decode speechsignals and hence provide a model for speech recognition. Asoutlined in the Introduction, the feature extraction stages ofASR systems provide representations of the speech dynamicthat are well suited for the robust recognition of speech innoise. However, the often employed basis for the feature ex-traction stages, the log Mel-spectrogram, is not suited forlow-latency signal processing due to its long integration win-dow. The relatively long integration window of the log Mel-spectrogram serves two objectives: 1) Obtain a suﬃcientlyhigh frequency resolution to separate low-frequency signalcontent into approximately 1 ERB-wide bands, and 2) Ensurethat in voiced speech portions each signal frame contains atleast one pulse (that is to remove the temporal ﬁne-structureof the speech signal). Fortunately, these two aspects (suﬃ- cient spectral resolution for low frequencies and limited tem-poral resolution) are compatible and can be optimized forlow-latency processing at the cost of a frequency-dependentgroup delay. In the following, the design of PLATT, a fastadaptive dynamic range manipulation scheme that takes thementioned observations into account, is proposed, where thefollowing three objectives were pursued:• Preservation and enhancement of spectral modulationswhich are assumed to be relevant for speech recognition• Low-latency and fast reaction time while minimizing au-dible artifacts• Adaptive limitation of the compression to the necessaryminimumPLATT consists of three functional parts:1) An auditory-motivated frequency decomposition and re-synthesis of the audio signal, which allows to manipulatefrequency- and time-dependent amplitudes in a percep-tually relevant domain and helps to ensure that only lim-ited audible artifacts can be introduced.2) Extraction of a spectro-temporal representation from thefrequency-decomposed signal which is similar to thoseused for robust ASR and hence suitable for the analysisof the relevant spectral modulations.3) Adaptive calculation of frequency- and time-dependentgains from the spectro-temporal representation whichuses compression only as required to provide high speechrecognition performance when the available dynamicrange on the output side is limited.Figure 5 illustrates the relations between the signal pro-cessing blocks that were used to implement this functional-ity. A detailed description is provided in the following. Theexact implementation details are provided in a reference im-plementation that is written in C (cf. Section Availability ofresources).

Frequency decomposition & re-synthesis

The fre-quency decomposition of the input signal is performed with aﬁlter bank of fourth-order Gammatone ﬁlters. To get a set offrequencies which are relevant for (automatic) speech recog-nition, the center frequencies are chosen equidistantly on aMel-frequency scale with half the distance that is commonlyused for calculating MFCCs. Considering frequencies in therange from Hz to

Hz results in the following

78 + 4 values: (64, 93,) 123, 155, 187, 221, 256, 293, 330, 370, 410, 453, 496,542, 589, 638, 689, 742, 797, 854, 914, 975, 1039, 1105, 1174, 1245, 1319,1396, 1476, 1559, 1645, 1734, 1827, 1923, 2023, 2127, 2235, 2346, 2462,2583, 2708, 2838, 2972, 3112, 3257, 3408, 3565, 3727, 3896, 4071, 4253,4441, 4637, 4840, 5051, 5270, 5498, 5734, 5979, 6233, 6497, 6771, 7056,7352, 7658, 7977, 8307, 8650, 9006, 9376, 9760, 10158, 10572, 11001,11447, 11909, 12390, 12888, 13406, 13943, (14501, 15080) Hz . Onlythe 78 ﬁlters with center frequencies from 123 Hz to 13943 Hz,spaced approximately 0.5 equivalent rectangular bandwidths(ERB), are used. The -10 dB-bandwidth of each fourth-orderGammatone ﬁlter is chosen to be equal to the diﬀerence ofthe frequencies two positions right and left to its center fre-quency, e.g., Hz − Hz = 128 Hz for 155 Hz. The aim8 ammatone ﬁ lterbank real 15 ms hold peak/max 1dB/ms decay downsample 20*log10(x) modulation band-pass ﬁ lters compression/expansion... sum...targetreferenceband-pass ﬁ ltered signal gain rate limit10^(x/20) di ﬀ erencegainsmultiplicationsuminput output Figure 5:

Diagram illustrating the relations of the main signal processing blocks which were used to implement the signal analysis,manipulation, and re-synthesis with PLATT. C en t e r f r equen c i e s and a m p li t ude Delay / ms0 1 2 3 4 5 6 7 8 9 10 11830742532023797123sum

Figure 6:

Real part of the impulse responses of a subset ofthe normalized, phase-adjusted, fourth-order Gammatone ﬁl-ters that were employed for the frequency decomposition, andthe (scaled) sum of the impulse responses of all employed Gam-matone ﬁlters. is to evenly cover the relevant frequency range with ﬁltersthat have a bandwidth similar to auditory ﬁlters ( ≈ ERB)and allow a trivial re-synthesis in the time domain by sim-ple summation of all ﬁlter bank outputs. With this goal,the ﬁlter coeﬃcients are determined as follows: The pole inthe complex z-plane that describes the frequency-dependentproperties of a ﬁrst-order inﬁnite impulse response Gamma-tone ﬁlter is calculated according to the formula p = (cid:32) − √ · bw + 0 . (cid:33) · exp (cid:18) π i f c f s (cid:19) , (1)where bw is the -10 dB-bandwidth in Hz, f c the center fre-quency of the corresponding fourth-order ﬁlter, and f s thesampling frequency. The phase of the single FIR coeﬃcientof each ﬁlter is chosen such that the phases of each pair offourth-order ﬁlters with neighboring center frequencies wereidentical at the delay where the product of their respectivetemporal envelopes reaches its maximum. This happens to bethe case at diﬀerent delays for diﬀerent pairs and minimizesdestructive interference between the corresponding ﬁlter out-puts. In addition, the absolute value of the FIR coeﬃcient ofeach ﬁlter is chosen such that the maximum gain is 2. To-gether, a) evenly covering frequencies, b) avoiding destructiveinterference, and c) normalizing the maximum gain result ina very ﬂat frequency response for the sum of all ﬁlter bankchannels. Figure 6 shows the ﬁrst 11 ms of the real part ofthe impulse responses of a subset of the Gammatone ﬁltersand also the (scaled) sum over the real parts of all impulse re-sponses. The joint impulse response (sum of the real-valuedimpulse responses of all normalized, phase-adjusted, forth-order Gammatone ﬁlters) is a downward frequency-sweep.The frequency-dependent delay can be read from Figure 6 andis about . ms at kHz and about . ms at Hz. Figure 7 A tt enua t i on / d B Frequency / Hz 8k4k2k1k5002501250-20-40

Figure 7:

Absolute values of the transfer functions correspondingto the impulse responses shown in Figure 6 and the absolutevalue of the joint transfer function of all ﬁlters (including thosenot shown). shows the corresponding absolute values of the transfer func-tions of the same sub-set of ﬁlters and the absolute valuesof the transfer function corresponding to the joint impulseresponse of all ﬁlters. The joint transfer function of all 78employed Gammatone-ﬁlters, which characterizes the systemproperty after re-synthesis if the amplitudes are not manipu-lated, has a ﬂat frequency response between about

Hz and kHz. That the frequency decomposition and re-synthesishas almost no spectral, and only a limited temporal eﬀect onprocessed signals indicates that a perceptually mostly trans-parent re-synthesis can probably be achieved. The amplitudesof the ﬁlter bank output represent perceptually relevant prop-erties of the signal and can be interpreted as a proxy for thedisplacement of oscillatory systems with properties similar tothose of the respective Gammatone ﬁlters, e.g., the humanbasilar membrane. The ﬁlter bank output can be manipu-lated directly, e.g., multiplied with a time- and frequency-dependent gain function, prior to the re-synthesis. The rateof change of the gain functions is limited to 24 dB per pe-riod of the corresponding center frequency to limit channelcrosstalk. Spectro-temporal signal representation

A representa-tion of the spectral dynamic which encodes information sim-ilar to a log Mel-spectrogram is determined as follows, basedon the real-valued output of the ﬁlter bank. For each ﬁlteroutput, the values are held for 15 ms and decay subsequentlywith a rate of dBms , if the held value is above the current value.This approach approximately extracts the temporal envelopeof each channel while preserving fast increases in amplitude(on-sets). Hence, the exact timing (or temporal ﬁne struc-ture) is removed from this representation, and only the localmaximum values remain as an estimate of the maximum am-plitude (or displacement) of an oscillatory system with prop-erties similar to those of the employed Gammatone ﬁlters.This representation can be down-sampled by any factor whichreduces the sample rate to ms ≈ Hz or higher, withoutmissing any local maximum value. The encoded informationis very similar to the information encoded in features for ASR,9here updated spectral values are determined every ms infrequency-bands which are equally spaced on a Mel-scale. Un-published pilot experiments by the author conﬁrm that theproposed representation, down-sampled to Hz, achievesvery similar simulation results in a range of speech recogni-tion experiments with FADE. The use in a hearing device,however, requires faster updates which is why the represen-tation is down-sampled to

Hz, that is, an update of the78 spectral values is calculated every ms. In Figure 8, thespectro-temporal representation used in PLATT and the logMel-spectrogram of a clean speech sample at 65 dB SPL areshown. Compared to the log Mel-spectrogram, the proposedspectro-temporal representation has a 10-times higher tem-poral resolution (visible at the on-sets), where, however, thetemporal ﬁne structure is eﬀectively removed. The oﬀ-setsare not as prominent because the maximum tracking only al-lows a decrease by dBms =

100 dB100 ms . Also, the proposed spectro-temporal representation has approximately twice the numberof frequency channels, 78 compared to 36 with the the logMel-spectrogram. The aim of the spectral oversampling is toprovide headroom for spectral modulation manipulations.Because a pure tone is not changed in amplitude by a ﬁlterof the employed Gammatone ﬁlter bank when it matches theﬁlters center frequency, the calibration of its input and outputvalues is identical. Assuming a calibrated setup, only valuesabove the frequency-dependent normal hearing threshold, asdeﬁned by the International Organization for Standardizationin its standard 226:2003 (ISO 226, 2003), are considered inthe representation; values below the normal hearing thresh-old are replaced by the corresponding threshold value. Thiseﬀectively removes spectro-temporal modulations that can-not be perceived by listeners with normal hearing from theproposed spectro-temporal signal representation. The eﬀectcan be observed in the high frequency range ( > Hz) inFigure 8, where the speech signal has no energy.

Adaptive spectral gain

The adaptive determination oftime- and frequency-dependent gains takes into account thecurrent spectral input dynamic and the currently availableoutput dynamic. It aims to minimize the compression withthe constraint to avoid masking the signal parts which couldcarry important (speech) information. It also allows to ex-pand the spectral modulations that are assumed to be im-portant for speech recognition and trade the such increasedsignal dynamic against an increased compression of less rele-vant signal dynamic.The spectral input dynamic is analyzed with spectral mod-ulation low-pass ﬁlters. For this, each vector of the spectralvalues is convolved with Hanning windows of the followingwidths to obtain increasingly spectrally smoothed versions ofthe initial vector: , , , and , which approximatelycorrespond to a full width at half maximum (FWHM) of , , , and ERB, respectively. The left panel of Figure 9shows an example of the spectral analysis for a signal whichconsists of two pure tones, of

Hz and

Hz. The inputspectral values are depicted in black, the smoothed versionswith ascending widths of the Hanning window in increasinglylighter shades of gray. The diﬀerences between the curves ofincreasingly smoothed spectral representations are indicatedwith the digits 1 to 4. Because the diﬀerence of two low-passﬁlters is a band-pass ﬁlter, the diﬀerences 1, 2, 3, and 4 can be interpreted as the result of a spectral modulation band-pass ﬁltering which roughly contains the respective spectralmodulation frequencies 1) above , 2) from to , 3) from to , and 4) from to cyclesERB , respectively. By deﬁni-tion, the original vector of spectral values (cf. black line inleft panel of Figure 9) can be recovered by adding the dif-ferences 1 to 4 to the low-pass ﬁltered spectral values whichcontain the spectral modulation frequencies below cyclesERB (cf. lightest line in left panel of Figure 9). In the following,the vector of low-pass ﬁltered spectral values which containsthe lowest modulation frequencies is referred to as the baselayer . The base layer contains a very coarse description of thespectral dynamic and resolves spectral patterns larger thanapproximately ERB. The other layers, the diﬀerences 4 to1, resolve the spectral dynamic which is not described in thebase layer with decreasing pattern sizes.The base layer needs to be mapped from the input dynamicrange to the output dynamic range. For the application in ahearing device, the mapping of the base layer could be per-formed with a prescription rule. In addition, the limits ofthe input and output dynamic ranges need to be deﬁned.Here, for the input, a dynamic range that is probably rele-vant for normal hearing listeners is assumed: A frequency-independent uncomfortable level of dB SPL as the up-per limit, and levels of dB SPL from Hz to kHz, and dB above kHz and below Hz, as the lower limit. Theassumed input dynamic range is indicated with triangles inthe left panel in Figure 9. For the output, an exemplaryreduced dynamic range is assumed, which is arbitrarily lim-ited to frequency-independent levels from to dB SPL;resembling elevated hearing thresholds due to environmentalnoise or impaired hearing and a lower acceptance of high lev-els. The targeted output dynamic range is indicated withtriangles in the center panel in Figure 9. The mapping ofthe base layer can be independent from the input and outputdynamic range deﬁnitions. In the context of a hearing de-vice, the mapping of the base layer will mainly determine theoutput levels and strongly aﬀect loudness perception, whilethe lower limit of the output dynamic range (to be chosen re-lated to the individual hearing thresholds) will strongly aﬀectspeech recognition performance. The deﬁned output dynamicrange is the reservoir that can be used by PLATT to map theinput dynamic.In our example in Figure 9, let’s assume the base layer wasmapped linearly from the deﬁned input dynamic range to thedeﬁned output dynamic range. The base layer is depictedas the lightest gray line in the left panel, and the mappedbase layer as the lightest gray line in the center panel. Thegains which would be theoretically required to achieve sucha smooth output dynamic, given the black line in the leftpanel as the input dynamic, are depicted as the lightest grayline in the right panel. But such an extreme compression ofthe spectral dynamic would introduce audible artifacts andwould most likely have a negative eﬀect on speech recogni-tion performance. Hence, the more of the remaining originalsignal dynamic described by the diﬀerences 1 to 4 is addedback to the mapped base layer, the less compression will beneeded, and the better spectral modulation patterns will bepreserved. However, unconditionally adding the whole dy-namic that is encoded in the diﬀerences 1 to 4 could result10 ime / ms F r equen cy / H z

250 500 750 1000 1250 1500 1750 20001245031057186530454768728410957 l e v e l / d B SP L Time / ms F r equen cy / H z

250 500 750 1000 1250 1500 1750 20001234961039182729724637705610572 l e v e l / d B SP L Figure 8:

Comparison of spectro-temporal representations: In the upper panel a log Mel-spectrogram as it is used in FADE of aclean speech sample at 65dB SPL, and in the lower panel the presented specto-temporal representation that is used in PLATT ofthe same sentence.

Frequency / Hz A m p li t ude / d B SP L G a i n / d B INPUT OUTPUT GAIN

Figure 9:

Example dynamic mapping of two pure tones. Left panel: Input dynamic analysis with spectral modulation low-pass ﬁlters.Center panel: (Conditional) reconstruction with reduced dynamic. Right Panel: Gains required to map the input dynamic to thereconstruction stages of the output dynamic (ﬁnal gains are black). Numbers indicate the diﬀerences.

11n output levels below the lower limit of the output dynamicrange, which might not contribute to speech recognition any-more, or above the upper limit of the output dynamic range,which might lead to undesirably high output levels. A goodcompromise in the fundamental conﬂict that too much andtoo little compression can both result in sub-optimal speechrecognition performance requires a compression managementwhich depends on the current spectral input dynamic and thecurrently available output dynamic.To prefer spectral patterns which are important for robust(automatic) speech recognition, the diﬀerences correspondingto high spectral modulation frequencies are added ﬁrst. Thehighest spectral modulation frequencies are encoded in dif-ference 1 and account only for a very small part of the totaldynamic, which is why it is added unconditionally . The exam-ple with two pure tones is an extreme one which assesses themaximum dynamic that can be encoded in each diﬀerence,which is about 6 dB for diﬀerence 1. The corresponding out-put dynamic, when only adding diﬀerence 1 to the base layer,can be observed in the center panel of Figure 9 as the lightgray curve which only deviates slightly from the base layer.Probably the most important spectral modulation frequen-cies describe spectral patterns between and ERB and aremainly encoded in diﬀerence 2, which is why it is also addedunconditionally. To protect this diﬀerence, which encodes amaximum dynamic of less than 9 dB, against a hearing loss ofclass D as implemented in FADE with the level uncertainty,it can be expanded by a factor greater than 1 prior to addingit to the base layer. The expansion of the diﬀerence 2 couldincrease the total output signal dynamic, which however canoften be compensated by an increased compression of the re-maining diﬀerences 3 and 4, which encode larger part of thesignal dynamic, compared to diﬀerence 2.The remaining diﬀerences 3, and subsequently 4, are added conditionally and possibly partially according to the followingconstraints: 1) For each frequency, the respective diﬀerenceis multiplied with the highest possible factor between 0 and 1for which its addition to the current output dynamic (whichalready includes the base layer and diﬀerences 1 and 2) doesnot result in output values above the upper limit or below thelower limit. For example, if the current value is dB SPL,the diﬀerence +30 dB SPL, and the limit dB SPL only athird of diﬀerence can be added and the factor is . This ruleis not suﬃcient because, when applied independently for eachfrequency band, it would result in a total loss of high spec-tral modulation frequencies in the frequency regions where nooutput dynamic range is left to add the diﬀerence; the alreadyadded spectral modulations needs to be protected from beingcompressed. Hence, 2) the minimum factor (highest com-pression) is propagated along the frequency axis by addingvalues of normalized Hanning windows of FWHM of 6 and 12samples for diﬀerence 3 and 4, respectively. For example, ifthe factor according to 1) turns out to be zero at the centerfrequency of Hz when adding diﬀerence 3, the factor at

Hz and

Hz can be at most . and below Hzand above

Hz again 1. The propagation ensures thatneighboring channels are compressed similarly and protectsany higher spectral modulation frequencies. The ﬁnal de-sired spectral output levels are described by the sum of themapped base layer, the unconditionally added diﬀerences 1and 2, and the conditionally compressed diﬀerences 3 and 4. The frequency-dependent gain needed to achieve the desiredspectral output levels is the diﬀerence of the spectral outputlevels (black curve in the center panel in Figure 9) and spec-tral input levels (black curve in the left panel in Figure 9).The ﬁnal frequency-dependent gain is plotted in black in theright panel of Figure 9 along with the partial gains that wouldtheoretically be needed after the cumulative addition of onlydiﬀerences 1 to 3 to the base layer in increasingly darker grayshades. The ﬁnal frequency-dependent gain is then applied tothe output of the Gammatone ﬁlter bank with the limitationof the rate of change to 24 dB per period of the correspondingcenter-frequencies.Admittedly, in our example with the two pure tones, thereis only energy at 500 Hz and 2000 Hz, and hence only the levelof the two tones will be changed, while the gains at otherfrequencies will have no eﬀect. However, this signal creates apattern of extreme spectral modulation in the proposed signalanalysis, and hence is well suited to illustrate how PLATTworks. For a signal with only low spectral modulations, noconditional compression would be required.

Summary of PLATT dynamic range manipulation

With PLATT, compression is only applied if the availableoutput dynamic range is less than required to represent theinput signal dynamic. The main eﬀect can be observed in Fig-ure 10, where the calculated gains for high (solid curves) and reduced (dotted curves) spectral input dynamic are shown foran exemplarily reduced output dynamic range, as indicatedby the triangles. While the signals with high spectral dy-namic are compressed (observed here as diﬀerent gains fordiﬀerent frequencies) to make relevant signal portions audi-ble and avoid excessively high output levels, the signal withlow spectral dynamic is not (observed here as similar gainsfor diﬀerent frequencies), because it ﬁts in the available out-put dynamic range without compression. Only low spectralmodulation frequencies, which are less important for speechrecognition, are conditionally compressed, while the impor-tant higher spectral modulation frequencies can be even ex-panded. An eﬃcient implementation of spectral modulationﬁltering in the time-domain is possible with reasonably lowlatency, fast reaction times, and probably little audible arti-facts due to the auditory-motivated signal decomposition andre-synthesis.

Compensation of level uncertainty with PLATT

Thepossibility to selectively expand the spectral modulation pat-terns between 2 and 4 ERB with PLATT was used to protectthese patterns against the eﬀect of the level uncertainty asimplemented in FADE. With the aim to best decouple theevaluation of the expansion from the conditional compres-sion feature of PLATT, the available output dynamic rangewas set to match the input dynamic range, and the base layermapping was set to identity. This minimizes compression andeﬀectively disables the ﬁne-grained compression managementfor all but very low and very high signal levels. Nonetheless,in view of expanding a part of the signal dynamic, such amanagement will probably be needed in an application witha limited output dynamic range to counter the expansion witha possible compression of other parts of the signal dynamic.Expansion factors of 2, 4, 6, and 8 were considered for the12 requency / Hz A m p li t ude / d B SP L G a i n / d B INPUT OUTPUT GAIN

Figure 10:

Examples of calculated spectral gain for an example with reduced available output dynamic range as indicated by thetriangles. Left panel: Spectral input dynamic with two pure tones at diﬀerent presentation levels (gray shaded solid curves), andwith white noise added (dotted curve). Center panel: Corresponding spectral output levels. Right panel: Corresponding gains. evaluation, and the corresponding compensation conditionsare referred to as PLATT-2 to PLATT-8 in the remainder ofthe manuscript. The eﬀect of processing noisy speech signalsat 0 dB SNR with PLATT-6 is illustrated in Figure 11. In thetop row, log Mel-spectrograms of a speech signal in the sta-tionary and ﬂuctuating noise at 0 dB SNR are depicted in theleft and right panel, respectively. In the center row, the samelog Mel-spectrograms are shown, but with a level uncertaintyof 7 dB. In the bottom row, log Mel-spectrograms of the samesignals, but processed with PLATT-6 are shown with a leveluncertainty of 7 dB. Especially for the stationary noise, someof the speech patterns are better distinguishable in the logMel-spectrogram with the added level uncertainty after theprocessing. This justiﬁes the expectation that the expansionmight protect important speech patterns against level uncer-tainty. With the ﬂuctuating masker, the speech patterns arenot as easily distinguishable from the background noise. Butcomparing the representation of the unprocessed (center) andprocessed (bottom) signals, more of the patterns that are hid-den by the level uncertainty in the unprocessed signal are dis-cernible in the processed variant. This illustrates that the ex-pansion does not selectively process the speech portions of thesignal but generally protects the spectral modulation patternsbetween 2 and 4 ERB against the level uncertainty, indepen-dently of what caused these patterns. Also observable, theprocessing with PLATT-6 results in a possible increase in theoutput level. This aﬃrms the need to evaluate the methodin a context where simple time-invariant linear ampliﬁcationwill not change the speech recognition performance and thusdecouple the eﬀect of the expansion from the eﬀect of anylinear ampliﬁcation.The just described eﬀects of an expansion of the spectralmodulation patterns between 2 and 4 ERB on noisy speechsignals were quantiﬁed in simulation experiments with FADE.For this, SRTs with the German matrix sentence test weresimulated for all combinations of considered noise maskers(OLNOISE, ICRA5-250), noise presentation levels (0, 10,20, ..., 100 dB SPL), listener proﬁles (P-8000-1 through P-1000-21), and compensations (none, and PLATT-2 throughPLATT-8); summing a total of 1760 outcome predictions withFADE.

Evaluation of simulation results

Key simulation outcomes are presented as “Plomp curves”,that is, SRT-50s in dB SPL as a function of the noise pre-sentation level, analog to Figure 2. On the one hand, thisdepiction allows to assess the eﬀect of the (noise) presenta-tion level on the predicted SRTs. On the other hand, it alsoallows to quantify improvements in SRT in aided conditionsover a given reference condition, e.g., normal hearing or anunaided condition. Because not all 1760 data points can pre-sented in graphical form in this contribution, a summary ofthe achieved improvements in SRT at high presentation levels,at which we can conﬁdently assume that linear time-invariantampliﬁcation cannot improve the SRT, are presented in theform of a table for all listener proﬁles and compensations. Forkey listener proﬁles, psychometric functions were obtained bysimulating SRT-20 to SRT-90 to assess the SNR-dependencyof PLATT. The main interest here was if the eﬀect of thePLATT compensation is diﬀerent for higher SRTs, which aremore realistic for conversations.

Availability of resources

To facilitate the reproduction of the presented experiments,and to encourage, foster, and accelerate the veriﬁcation andadoption of the presented methods, the employed resourcesare provided, as far as licensing allows it. FADE version2.4.0, which is open source sofware, was used for the FADEsimulations. The code and scripts for the setting up the sim-ulations were based on, and are now integrated into the mea-surement and prediction framework . This includes:• The modiﬁed feature extraction.• The reference implementation of PLATT and the usedconﬁguration ﬁles• The scripts which prepare and run the FADE simulationsusing the modiﬁed feature extraction and the referenceimplementation of PLATT.• The scripts which evaluate raw the experimental resultsand plot the results ﬁgures. https://doi.org/10.5281/zenodo.4003779 https://doi.org/10.5281/zenodo.4500810 ime / ms F r equen cy / H z

250 500 750 10001245031057186530454768728410957 l e v e l / d B SP L Time / ms F r equen cy / H z

250 500 750 10001245031057186530454768728410957 l e v e l / d B SP L Figure 11:

Illustration of the eﬀect of the dynamic range manipulation with PLATT-6: Log Mel-spectrograms of noisy speech instationary (left column) and ﬂuctuating noise (right column) at 0 dB SNR (top row), the same log Mel-spectrograms with leveluncertainty of 7 dB (center row), and log Mel-spectrograms with level uncertainty of 7 dB of the same signals processed withPLATT-6, that is with an expansion factor of 6.

This does not include:• The Hidden Markov Toolkit (used by FADE), whichcannot be distributed because of the license.• The speech material of the German matrix sentence test and the corresponding test speciﬁc noise signal (OL-NOISE), which cannot be distributed because of the li-cense.• The ICRA5-250 noise signal, due to the unknown licenseconditions. Results

Eﬀect of limit frequency and level uncertainty

Two modiﬁcations were used to implement a hearing loss ofclass D, the limitation of the frequency range up to a limitfrequency, and the increase of the level uncertainty. Theirseparate eﬀect on simulated SRTs is shown in Figure 12 andFigure 13, respectively.The limitation of the available frequency range aﬀectedthe simulated SRTs in the ﬂuctuating noise condition muchmore than in the stationary noise condition when assuminga level uncertainty of 1 dB (cf. Figure 12). In the station-ary noise condition, the limitation of the frequency range to4000 Hz did not have an eﬀect on the simulated outcome ofthe German matrix sentence test. A reduction to 2000 Hz re-sulted in a small, mostly level-independent increase in SRT of http://htk.eng.cam.ac.uk/ http://medi.uni-oldenburg.de/download/ICRA/index.html about 1 dB, and a further reduction of the frequency range to1000 Hz, resulted in a further increase of about 1 dB. Hence,limiting the frequency range while assuming a level uncer-tainty of 1 dB had relatively little eﬀect on the simulatedSRTs. A diﬀerent picture can be observed in the ﬂuctuat-ing noise condition. There, the reduction of the frequencyrange had a large, and level-dependent eﬀect on the simu-lated SRTs. The eﬀect of the reduction increases with thehigher presentation level and stabilizes at levels above 60 dBSPL. At these levels, the increase in SRT was about 3, 6, and15 dB for a limitation to 4000, 2000, 1000 Hz, respectively.The increase of the level uncertainty also aﬀected the sim-ulated SRTs in the ﬂuctuating noise conditions more than inthe stationary noise condition (cf. Figure 13). The eﬀectwas not as level-independent at low presentation levels as onemight have expect. This indicates an interaction betweenthe implementation of the absolute hearing threshold and thelevel uncertainty of about 5 dB. The behavior could alreadybe observed in the data from Kollmeier et al. (2016) (cf. rightpanel in Figure 2). However, in this contribution, the focusdoes not lie on the interactions of the level uncertainty withthe absolute hearing threshold but eﬀective elimination of theabsolute hearing threshold as a factor from the evaluation. Inthe ﬂuctuating noise condition, this is the case for presenta-tion levels of 70 dB SPL and above, where the diﬀerences (cf.panel four in Figure 13) stabilize. At these levels, the increasein SRT was about 4, 9, and 12 dB for level uncertainties of 7,14, and 21 dB, respectively. In the stationary noise condition,the corresponding increases were lower, with about 1, 4, and6 dB for level uncertainties of 7, 14, and 21 dB, respectively.Neither implementation alone achieves to increase the SRTto above 0 dB with the considered parameter values. In lis-14

15 25 35 45 55 65 75 85 955152535455565758595

Noise level / dB SPL S pee c h l e v e l / d B SP L OLNOISE SRT

OLNOISE differences in SRT S pee c h l e v e l d i ff e r en c e / d B Noise level / dB SPL

P-8000-1 noneP-4000-1 noneP-2000-1 noneP-1000-1 none

Noise level / dB SPL S pee c h l e v e l / d B SP L ICRA5-250 SRT

ICRA5-250 differences in SRT S pee c h l e v e l d i ff e r en c e / d B Noise level / dB SPL

Figure 12:

Simulated “Plomp curves” in stationary noise (ﬁrst panel) and ﬂuctuating noise (third panel) when limiting the availablefrequency range (to 8000, 4000, 2000, and 1000 Hz) in the feature extraction stage. The dotted lines indicate an SNR of 0 dB.The corresponding level-dependent diﬀerences in SRT compared the proﬁle plotted in black (here proﬁle P-8000-1) are depicted inpanels two and four.

Noise level / dB SPL S pee c h l e v e l / d B SP L OLNOISE SRT

OLNOISE differences in SRT S pee c h l e v e l d i ff e r en c e / d B Noise level / dB SPL

P-8000-1 noneP-8000-7 noneP-8000-14 noneP-8000-21 none

Noise level / dB SPL S pee c h l e v e l / d B SP L ICRA5-250 SRT

ICRA5-250 differences in SRT S pee c h l e v e l d i ff e r en c e / d B Noise level / dB SPL

Figure 13:

Eﬀect of increasing the level uncertainty (to 7, 14, and 21 dB). Analog to Figure 12.

Noise level / dB SPL S pee c h l e v e l / d B SP L OLNOISE SRT

OLNOISE differences in SRT S pee c h l e v e l d i ff e r en c e / d B Noise level / dB SPL

P-1000-1 noneP-1000-7 noneP-1000-14 noneP-1000-21 none

Noise level / dB SPL S pee c h l e v e l / d B SP L ICRA5-250 SRT

ICRA5-250 differences in SRT S pee c h l e v e l d i ff e r en c e / d B Noise level / dB SPL

Figure 14:

Eﬀect of increasing the level uncertainty when the frequency range is limited to 1000 Hz. Analog to Figure 12.

Noise level / dB SPL S pee c h l e v e l / d B SP L OLNOISE SRT

OLNOISE differences in SRT S pee c h l e v e l d i ff e r en c e / d B Noise level / dB SPL

P-8000-1 noneP-4000-7 noneP-2000-14 noneP-1000-21 none

Noise level / dB SPL S pee c h l e v e l / d B SP L ICRA5-250 SRT

ICRA5-250 differences in SRT S pee c h l e v e l d i ff e r en c e / d B Noise level / dB SPL

Figure 15:

Eﬀect of mixed proﬁles with increasing level uncertainty and frequency range limitation. Analog to Figure 12.

Eﬀect of PLATT expansion

The eﬀect of the expansion with PLATT on the simulation re-sults for listener proﬁles P-4000-7, P-2000-14, and P-1000-21is presented in Figures 16, 17, and 18, respectively. The blue,red, and yellow lines indicate the simulated speech recogni-tion performance with compensations PLATT-2, PLATT-4,and PLATT-6, respectively. The black lines indicated thecorresponding unaided reference conditions, while the purplelines indicate the simulated performance of proﬁle P-8000-1,that is, the performance of a listener with normal hearing.The expansion with PLATT improved the SRTs in almostall simulated conditions. The beneﬁt in SRT due to PLATT in was generally higher in the ﬂuctuating noise condition thanin the stationary noise condition. Higher expansion factorsstrongly tended to result in higher beneﬁts, however, not inall conditions. As expected, the observed compensations werepartial, that is, the simulated performance of listeners withnormal hearing were not restored. For a quantitative analy-sis, the results are interpreted only at the levels at which animprovement in SRT due to simple linear ampliﬁcation canbe ruled out.The curves stabilize at high SRTs, that is, a further in-crease in levels does not improve the SNR anymore. In thestationary noise condition, this is the case for presentationlevels of 40 dB SPL and above. But even above this level,the curves show some variability, which is due to the stochas-tic nature of the simulation process. The plotted data pointswere derived as the diﬀerence of two independent simulations.Assuming a random (normally distributed) prediction errorof 0.5 dB, this would lead to a random error of the plotteddata points of approximately ± . dB, that is for 5 of 100simulation results, on average, the error will be larger than ± . · . ≈ ± . dB. Hence, “stabilized” refers to reach-ing a state which may allow to assume that the remainingvariability is due to the random error. In the ﬂuctuatingnoise condition, the random error is increased, because of thegenerally shallower slope of the psychometric function in thiscondition. There, the curve of the normal-hearing proﬁle sta-bilizes at levels of about 70 dB. The curves of the unaidedand compensated simulations already stabilize at lower lev-els. Hence, at noise presentation levels of 70, 80, and 90 dBSPL an improvement in SRT due to simple linear ampliﬁca-tion can be ruled out.For a break-down of the simulation results, the average im-provements at noise presentation levels of 70, 80, and 90 dBSPL were calculated and are reported in Table 1. The tablespresent the simulated beneﬁts in SRT using PLATT with ex-pansion factors 2, 4, 6, and 8, for all 16 considered listenerproﬁles, sorted by the limit frequency. In the last column, thebeneﬁt in SRT that corresponds to the normal-hearing (NH)proﬁle is presented as an orientation to assess the proportionof the total class D loss that be compensated. Positive valuesindicate an improvement, that is, a lower SRT. An orientativevalue for the random error of the presented simulation datain Table 1 is 1 dB.For listener proﬁles with a level uncertainty of 1 dB (P-*-1), only small beneﬁts were observed. In the stationarynoise condition, processing the signals with PLATT had noeﬀect on the SRTs ( ≤ . dB). In the ﬂuctuating noise con-dition, small positive and negative beneﬁts (min. -2 dB andmax. 2 dB) were observed; the SRTs for the listener proﬁleP-1000-1 improved (lower SRTs) and the SRTs for the listenerproﬁles P-8000-1 and P-4000-1 increased (negative improve-ment). This eﬀect was most pronounced with PLATT-8 andPLATT-6, less pronounced with PLATT-4, and could notbe observed with PLATT-2, that is, it depended on the ex-pansion factor. The processing with PLATT, by increasingthe amplitude of certain spectral modulations, decreases theprominence of (or “masks”) other modulation patterns, and itintroduces a possible leak of information across frequencies,e.g., from > Hz to < Hz . While the masking ofother modulation patterns could explain the detrimental ef-fect of PLATT for the proﬁles P-8000-1 and P-4000-1, the16

15 25 35 45 55 65 75 85 955152535455565758595

Noise level / dB SPL S pee c h l e v e l / d B SP L OLNOISE SRT

OLNOISE differences in SRT S pee c h l e v e l d i ff e r en c e / d B Noise level / dB SPL

P-4000-7 noneP-4000-7 platt2P-4000-7 platt4P-4000-7 platt6P-8000-1 none

Noise level / dB SPL S pee c h l e v e l / d B SP L ICRA5-250 SRT

ICRA5-250 differences in SRT S pee c h l e v e l d i ff e r en c e / d B Noise level / dB SPL

Figure 16:

Simulated “Plomp curves” for listener proﬁle P-4000-7 without and with PLATT expansion factors 2, 4 and 6 in stationarynoise (ﬁrst panel) and ﬂuctuating noise (third panel). The dotted lines indicate an SNR of 0 dB. The corresponding level-dependentdiﬀerences in SRT compared the proﬁle plotted in black (here the unaided proﬁle P-4000-7) are depicted in panels two and four.The data with proﬁle P-8000-1 (normal hearing) was added as an orientation for normal-hearing performance.

Noise level / dB SPL S pee c h l e v e l / d B SP L OLNOISE SRT

OLNOISE differences in SRT S pee c h l e v e l d i ff e r en c e / d B Noise level / dB SPL

P-2000-14 noneP-2000-14 platt2P-2000-14 platt4P-2000-14 platt6P-8000-1 none

Noise level / dB SPL S pee c h l e v e l / d B SP L ICRA5-250 SRT

ICRA5-250 differences in SRT S pee c h l e v e l d i ff e r en c e / d B Noise level / dB SPL

Figure 17:

Simulated “Plomp curves” for listener proﬁle P-2000-14 without and with PLATT. Analog to Figure 16.

Noise level / dB SPL S pee c h l e v e l / d B SP L OLNOISE SRT

OLNOISE differences in SRT S pee c h l e v e l d i ff e r en c e / d B Noise level / dB SPL

P-1000-21 noneP-1000-21 platt2P-1000-21 platt4P-1000-21 platt6P-8000-1 none

Noise level / dB SPL S pee c h l e v e l / d B SP L ICRA5-250 SRT

ICRA5-250 differences in SRT S pee c h l e v e l d i ff e r en c e / d B Noise level / dB SPL

Figure 18:

Simulated “Plomp curves” for listener proﬁle P-1000-21 without and with PLATT. Analog to Figure 16. leakage of information from higher frequencies could explainthe improvement for proﬁle P-1000-1. Despite these smallvariations of the SRTs due to the processing with PLATT inthe ﬂuctuating noise condition, one might feel comfortable tostate: As expected, the eﬀect of a limited frequency range (anincrease in SRT that cannot be compensated by simple am-pliﬁcation) cannot be compensated with the expansion per-formed by PLATT.The data in the column NH, that is, the diﬀerence in SRTbetween the simulation with a speciﬁc listener proﬁle and thenormal-hearing proﬁle, shows again how diﬀerent the simula-tion results are in the stationary and ﬂuctuating noise con-dition. While limiting the frequency range to 1000 Hz (P-1000-1) only resulted in a surprisingly small increase in SRT, namely by 2.3 dB, the same modiﬁcation increased the SRTin the ﬂuctuating noise condition by 15.0 dB. This reaﬃrmsthe use of diﬀerent noise maskers to assess speech in noiserecognition performance.Interpreting the addition of maskers to a speech signal as afrequency-dependent removal of information, the results indi-cate that the information in the mid-frequency range (1000-4000 Hz), was much more relevant in the considered ﬂuctuat-ing noise condition (ICRA5-250) than in the stationary noisecondition (OLNOISE). In other words, in the stationary noisecondition, this mid-frequency portion was mostly redundantat the SRT, and the ASR system could discriminate the 50words of the matrix sentence test almost equally well onlyusing the low-frequency information. This was not the case17 able 1:

Simulated beneﬁts in SRT compared to the respec-tive unaided conditions averaged over high presentation levels(70, 80, and 90 dB SPL), where simple linear ampliﬁcation can-not improve the SRT. NH indicates the normal-hearing listenerproﬁle.

Stationary noise (OLNOISE)Proﬁle

PLATT-2 PLATT-4 PLATT-6 PLATT-8

NHP-8000-1 0.0 0.1 0.2 -0.1 0.0P-8000-7 0.5 0.7 0.8 0.7 1.2P-8000-14 1.0 2.0 2.5 2.7 4.2P-8000-21 0.8 2.2 3.2 3.9 6.7P-4000-1 0.1 0.4 0.4 0.2 -0.1P-4000-7 0.5 1.0 1.3 1.2 1.5P-4000-14 0.9 2.2 3.0 3.4 4.9P-4000-21 1.3 3.0 4.0 4.8 7.9P-2000-1 0.1 0.2 0.2 -0.1 1.2P-2000-7 0.5 1.1 1.4 1.2 3.1P-2000-14 0.8 2.0 3.0 3.2 6.5P-2000-21 1.0 2.5 4.0 4.6 9.7P-1000-1 -0.1 0.0 -0.1 -0.2 2.3P-1000-7 0.6 1.4 1.6 1.5 5.0P-1000-14 0.9 2.4 3.1 3.5 9.0P-1000-21 1.3 3.4 4.6 5.3 13.1Fluctuating noise (ICRA5-250)Proﬁle

PLATT-2 PLATT-4 PLATT-6 PLATT-8

NHP-8000-1 -0.0 -0.5 -1.8 -2.0 0.0P-8000-7 1.2 1.9 1.9 1.4 4.6P-8000-14 1.2 2.9 4.2 4.3 8.7P-8000-21 1.8 4.7 5.7 5.9 12.5P-4000-1 0.4 -0.7 -0.8 -1.0 3.2P-4000-7 1.3 2.7 2.0 2.4 7.2P-4000-14 2.1 3.7 4.6 4.8 11.4P-4000-21 2.0 4.7 6.1 7.0 14.7P-2000-1 0.6 0.4 0.0 0.2 6.5P-2000-7 1.3 2.5 2.2 2.9 9.7P-2000-14 1.8 4.2 5.3 5.7 14.3P-2000-21 1.9 4.6 6.6 7.4 17.9P-1000-1 0.3 1.0 2.1 1.8 15.0P-1000-7 1.9 3.3 4.1 5.1 18.8P-1000-14 1.3 4.3 5.7 6.6 22.4P-1000-21 2.1 4.9 6.2 7.8 26.3 for the ﬂuctuating noise condition, where the mid-frequencyportions contributed a substantial part of the information toachieve low SRTs (say, less than -15 dB). For the followingpresentation of the beneﬁts with PLATT, it is important tokeep in mind that this missing information, by design, cannotbe compensated with PLATT. That means, when consideringthe maximum achievable improvement for a speciﬁc proﬁle,the increase in SRT due to limiting the frequency range wasdisregarded. For example, with proﬁle P-2000-14 in the ﬂuc-tuating noise condition, the diﬀerence to the normal-hearingSRT is 14.3 dB; improving the SRT by 14.3 dB the perfor-mance of the normal-hearing proﬁle would be achieved. Whenonly considering the increase in SRT due to limiting the fre-quency range, that is proﬁle P-2000-1, the diﬀerence to thenormal-hearing SRT is 6.5 dB. Hence, the maximum achiev-able improvement for proﬁle P-2000-14 is then estimated bythe diﬀerence . − . . dB.For proﬁles with increased level uncertainty (P-*-7, P-*-14,and P-*-21), the average beneﬁts in SRT due to using PLATT were positive and ranged from 0.5 to 7.8 dB. For these proﬁles,the lowest improvements were found with PLATT-2, and thehighest improvements often, but not always, with PLATT-8.Overall, in both noise conditions, a very strong relation be-tween the increase in level uncertainty and the improvementin SRT due to using PLATT could be observed. Combined,there was strong general tendency that with higher level un-certainty and with higher expansion factors larger improve-ments were observed. Exceptions were observed only for theproﬁles with a level uncertainty of 7 dB (P-*-7), where higherexpansion factors did not further increase the improvement.This data supports the sensible expectation that lower ex-pansion factors are suﬃcient (or even optimal) to compensatelower values of level uncertainty. For proﬁles with a level un-certainty of 14 or 21 dB, the improvements always increasedtogether with the expansion factor. With high values of leveluncertainty, higher expansion factors might further improvethe SRT; however, increasing the dynamic of a signal portioneight fold might have undesirable collateral eﬀects which arediscussed later.In absolute terms, the improvements were generally largerin the ﬂuctuating noise condition than in the stationary noisecondition. For example with the extreme proﬁle P-1000-21and PLATT-8 compensation, the improvement was 5.3 dB inthe stationary noise condition, and 7.8 dB in the ﬂuctuat-ing one. However, relating the improvement to the perfor-mance with the normal-hearing proﬁle (rightmost column inTable 1), 5.3 dB of a total class D loss of 13.1 dB were com-pensated in the stationary noise condition and “only” 7.8 dBof a total class D loss of 26.3 dB. While this interpretationcorrectly reﬂects the proportion of the total class D loss thatwas compensated, it does not reﬂect that a part of the classD loss cannot be compensated by expansion approach withPLATT by design. As explained earlier, the portion of theclass D loss due to limiting the frequency range to 1000 Hzis very diﬀerent in both maskers. To evaluate the achievedimprovements with respect to the maximum achievable im-provement, the class D loss due to limiting the frequencyrange needs to be disregarded. In the context of the maxi-mum achievable improvement, PLATT-8 compensated 5.3 dBof (13 . − . . dB and 7.8 dB of (26 . − . . dB ofthe class D loss due to a level uncertainty of 21 dB in the sta-tionary and ﬂuctuating noise condition, respectively. For theintermediate mixed proﬁle P-2000-14, PLATT-6 compensated3.0 dB of (6 . − . . dB and 5.3 dB of (14 . − . . dBin the stationary and ﬂuctuating noise condition, respectively.And for the least extreme mixed proﬁle P-4000-7, PLATT-4compensated 1.0 dB of (1 . − − . . dB, and 2.7 dB of (7 . − . . dB. Hence, in relative terms, over a broadrange of assumed parameters, PLATT expansion compen-sated at least about half of the class D loss caused by anincrease of a level uncertainty.The absolute improvements due to PLATT tended to in-crease with an increased limitation of the frequency range.The eﬀect of the level uncertainty increased with an increas-ing limitation of the frequency range. Based on the presenteddata, however, it is diﬃcult to make statements about thefrequency-dependency of an optimal expansion factor becauseof the diverse non-linear interactions between the consideredparameters.18

20 -15 -10 -5 0 5 10 15 200102030405060708090100

P-4000-7 OLNOISE W o r d r e c ogn i t i on r a t e / % c o rr e c t SNR / dB noneplatt2platt4platt6platt8 -20 -15 -10 -5 0 5 10 15 200102030405060708090100

P-4000-7 ICRA5-250 W o r d r e c ogn i t i on r a t e / % c o rr e c t SNR / dB

Figure 19:

Segments of psychometric functions for simulationswith listener proﬁle P-4000-7 without and with PLATT ex-pansion factors 2, 4 and 6 in stationary noise (left panel) andﬂuctuating noise (right panel). The dotted and dashed linesindicate a word recognition rate of 50% and 80% correct, re-spectively. -20 -15 -10 -5 0 5 10 15 200102030405060708090100

P-2000-14 OLNOISE W o r d r e c ogn i t i on r a t e / % c o rr e c t SNR / dB noneplatt2platt4platt6platt8 -20 -15 -10 -5 0 5 10 15 200102030405060708090100

P-2000-14 ICRA5-250 W o r d r e c ogn i t i on r a t e / % c o rr e c t SNR / dB

Figure 20:

Segments of psychometric functions for simulationswith listener proﬁle P-2000-14 without and with PLATT. Ana-log to Figure 19.

SNR-dependency of beneﬁts in SRT

We (humans) generally prefer to have conversations at higherSNRs than the SRT-50, that is, at SNRs at which more than50% of the words can be correctly recognized; which mightcorrespond to something like the SRT-80. The SRT-80 can besimulated, but it cannot be as eﬃciently and accurately mea-sured in listening experiments as the SRT-50, because of theshallower slope of the psychometric function at the SRT-80.The main reason for simulating the SRT-50 was that it can beaccurately measured in later listening experiments with hu-man listeners, and accurate measurements are a requirementto show eﬀects as small as 1 dB. To assess if the improve-ments would (at least according to the model) translate toimprovements at SRTs preferred in real conversations, thepsychometric functions of aided and unaided conditions werecompared. Segments of psychometric functions were obtainedby evaluating simulations at SRT-20, SRT-25, ..., SRT-90.Figure 19, 20, and 21 present the unaided and aided psycho-metric functions for the mixed proﬁles P-4000-7, P-2000-14,and P-1000-21, respectively. The simulation results in theunaided conditions are plotted in black, the correspondingsimulation results with PLATT2, PLATT4, PLATT6, andPLATT8 expansion in blue, red, yellow, and purple color, re-spectively. As expected, the slopes in the ﬂuctuating noiseconditions (right panels) were shallower than the slopes in -20 -15 -10 -5 0 5 10 15 200102030405060708090100

P-1000-21 OLNOISE W o r d r e c ogn i t i on r a t e / % c o rr e c t SNR / dB noneplatt2platt4platt6platt8 -20 -15 -10 -5 0 5 10 15 200102030405060708090100

P-1000-21 ICRA5-250 W o r d r e c ogn i t i on r a t e / % c o rr e c t SNR / dB

Figure 21:

Segments of psychometric functions for simulationswith listener proﬁle P-1000-21 without and with PLATT. Ana-log to Figure 19. the stationary noise condition (left panel). Also, as expected,the data points in the ﬂuctuating noise conditions were morenoisy, due to the greater variability in the spectro-temporaldistribution of the masker energy. This variability could bedecreased by increasing the amount of training data and test-ing data. Within the uncertainty due to this variability, noreduction in improvement between SRT-50 and SRT-80 canbe observed for listener proﬁle P-4000-7. For proﬁle P-2000-14 (in Figure 20), where the improvements are larger, therewas a trend towards a slight increase in slope with higher ex-pansion factors. This indicates, that the improvements due tothe expansion with PLATT of the SRT-80 might be slightlylarger than the corresponding improvement of the SRT-50.This trend was conﬁrmed by the data with the proﬁle P-1000-21 in Figure 21. According to the these simulations, theexpansion with PLATT was found to improve the simulatedSRT-80 to the same extent or even more than the SRT-50.

Discussion

The presented experimental results were derived using amodel of auditory perception, more speciﬁcally, a model of im-paired human speech recognition based on automatic speechrecognition. The modeling approach with FADE brings as-sumptions about the impaired human speech recognition pro-cess into a form in which they can be tested by comparing pre-dictions with empirical data. The employed model, the frame-work for auditory discrimination experiments (FADE) suchas it was used by Schädler et al. (2020a), was already evalu-ated with respect to predictions of the individual aided speechrecognition performance of listeners with impaired hearing.There, as elaborated in the Introduction, an important as-sumption was that the part of the hearing loss which cannotbe explained by the absolute hearing threshold, that is, themissing piece to describe the eﬀect of hearing loss on speechrecognition in noise, can be explained by the level uncertainty.Another assumption was, that this model parameter also af-fects tone in noise perception and hence its value could beinferred from tone in noise detection tests. While the resultssupported this hypothesis, evidence was not suﬃcient to ruleout other mechanisms that would also increase the SRTs innoise. This is a fundamental problem in modeling the indi-vidual speech recognition performance. While the quantitythat is predicted by the model, the SRT, can be measured in19xperiments with human listeners, the outcome of such mea-surements still depends on many correlated and non-linearlyinteracting parameters; some of which cannot be controlledvery well, such as, e.g., attention. And even if the experimen-tal results were measured with the most accurate methods,the measurement errors include this uncontrollable (human)variability which can increase the required amount of data tofalsify hypotheses to infeasible regions. This is especially truefor hypotheses which predict relatively small eﬀects. Hence,there is reasonable doubt about whether the removal of in-formation in listeners with impaired hearing is really well de-scribed by the level uncertainty, or if it just coincidentallyincreased the SRT in the correct conditions.To more speciﬁcally test if the level uncertainty is suit-able to describe the eﬀect of hearing loss on speech recog-nition in noise, a promising approach is to interact with it.The expansion of PLATT was speciﬁcally designed to inter-act with—namely compensate—the eﬀect of the level uncer-tainty. The modeled data clearly showed this interaction. Ifthis speciﬁc interaction was found in empirical data, it wouldbe strongly supportive for the hypothesis of the existence of amechanism in the human auditory system similar to the leveluncertainty. Beyond the academical interest in the suitabilityof the assumptions to describe impaired human speech recog-nition performance, a positive result would have immediatepractical implications for the design of hearing loss compen-sation strategies.Let’s remember that the goal was not to test if the expan-sion with PLATT improves speech recognition performanceof listeners with impaired hearing. For that, measurementswith listeners with impaired hearing will be necessary. Theaim of this contribution was to:A) Present an approach which is able to partially compen-sate a class D loss as implemented with the level uncer-tainty in FADE, andB) objectively evaluate this approach and come up withtestable quantitative hypothesis on the beneﬁt in noisylistening conditions.The latter aim, in other words, was to guide the planningof the measurements with listeners with impaired hearing to-wards optimal evidence. Hence, the following discussion isoriented towards the planning of a suitable experiment.

What is a realistic class D loss?

Plomp (1978) did not distinguish mechanisms causing thepostulated class D hearing loss which he used to describehearing loss in noise. His D component described the totalloss in SRT in noise due to impaired hearing, also referred toas speech hearing loss D (SHL D ). He reported average em-pirical values of SHL D from ﬁve investigations which rangefrom as low as 1 dB to above 10 dB (cf. TABLE III in Plomp(1978)). In stationary noises (average speech spectrum noise,white noise, and airplane cockpit noise) the maximum averagevalues for SHL D were 6 to 8 dB. In ﬂuctuating noises (inter-fering talker and voice babble) the maximum average valueswere larger, up to 14 dB. The increase in SRT in noise wasrelated to the increase in SRT in quiet SHL A+D in FIG. 8in Plomp (1978), which showed that every 3-dB increase of SHL

A+D comes with a 1-dB increase of SHL D . This dataindicated that on average, about a third of the hearing lossin quiet (SHL A+D ) was incompensable by simple ampliﬁca-tion; that would be a lot . Listeners with speech hearing lossin quiet of 39 dB would have on average 13 dB of tradition-ally incompensible loss. Also, Plomp (1978) observed thatindividual data diﬀered considerably from the mean. This in-dividual variability could be due to individually diﬀerent de-grees of and causes for the measured loss, or due to insuﬃcientmeasurement accuracy. An important observation he madewas that data points from listeners with age-related hear-ing loss agreed well with data points based from studies onother sensorineural hearing impairments. As a consequencehe considered age-related hearing loss to be primarily due todeterioration in the auditory pathway rather than to mentalimpairment. This last point is fundamental considering thatmental impairment can most likely not be compensated bysignal processing.However, these ﬁndings have to be taken with care. Thedata used by Plomp (1978) were measured at diﬀerent siteswith diﬀerent speech tests using diﬀerent setups. The result-ing list of possible systematic and random errors in the under-lying data is long. The main contribution to the systematicerror (expectable diﬀerences across studies) apart from thecalibration error and the diﬀerent listener panels was proba-bly the use of diﬀerent speech tests (including measurementparadigm, speech material, masker, presentation, ...). It isknown, that the type of speech material (e.g., logatomes,numbers, isolated words, or sentences) makes a diﬀerencein the outcome of a speech in noise recognition experiment.Hence, tests with diﬀerent speech material might also be dif-ferently susceptible to hearing loss and result in diﬀerent val-ues of SHL D . There is no reason to assume that SHL D isindependent from the speech test. The main contribution tothe random error (unpredictable diﬀerences across measure-ments) was probably diﬀerent for the studies and due to thestochastic nature of the measurement procedures. Both, therandom and the systematic errors were not speciﬁcally con-sidered in the analysis of Plomp (1978). The variety of un-knowns makes it diﬃcult to translate his ﬁnding to expectablevalues and individual variability of SHL D with the employedmatrix sentence test. Matrix sentence tests were designed tominimize the random error in the measurement of an SRT,where Kollmeier et al. (2015) reported a test-retest reliabil-ity of 0.5 dB for listeners with normal hearing and 0.7 dB forlisteners with impaired hearing.Fortunately, a suitable data set was measured with a ma-trix sentence test. Wardenga et al. (2015) found a similarrelation between the degree of hearing loss and SHL D for theGerman matrix sentence test in the (stationary) test-speciﬁcnoise condition, where the SRT in noise increased by about1 dB every 10-dB increase in PTA for PTAs below 47 dB HL(cf. Figure 5 there). This is much less than the average (1 dBevery 3-dB) increase found by Plomp (1978). One reasonfor the lower increase could be that the speech hearing lossin quiet (SHL A+D ) includes the speech hearing loss in noise(SHL D ), while the PTA might not. But that alone cannotexplain the huge diﬀerence. Another reason for the diﬀerencecould be that Wardenga et al. (2015) derived the relation from pure-tone average D for PTAs of 60 dB HLcould be about 6 dB in stationary noise. The individual vari-ability of SHL D in their analysis was reported with 1.17 dB forPTAs below 47 dB HL, which was only twice the test-retestreliability of the matrix test. An interpretation of this relationcould be that a given hearing loss in quiet is very likely relatedto a certain hearing loss in noise. This interpretation wouldbe only valid if additional ampliﬁcation would not changethe relation, that is, if the same relation was observed for anoise presentation level of, e.g., 75 dB SPL instead of 65 dBSPL. However, it is diﬃcult to speculate on that. On the onehand, the test-speciﬁc noise (OLNOISE) reduces the eﬀectof the individual hearing threshold by masking the speechsignals with a stationary matched-spectrum noise. On theother hand, for higher PTAs, the individual hearing thresh-old will eventually exceed the noise level at high frequencies,which could be compensated by ampliﬁcation. Then again,according to the simulations presented in this contribution,the removal of high-frequency portions ( > Hz) of the sig-nals in the OLNOISE condition had little eﬀect on the SRT.These considerations demonstrate the inherently non-linearrelations between the parameters assumed to aﬀect speechrecognition in noise, and hence the diﬃculty of abstractingthese from a given context.Nonetheless, to continue with an speciﬁc value for SHL D forthe German matrix sentence test that could likely be found inthe wild, the relation established by Wardenga et al. (2015)is assumed to ve valid. In favor of this assumption is that thelinear regression estimated from the data with PTAs below47 dB HL describes the data in their Figure 5 (solid blackline) really well. One would expect an increase in variabil-ity with the PTA (below 47 dB HL) if the absolute hearingthreshold aﬀected the SRT, such as it was observed for higherPTAs; but this was not the case. Accordingly, Wardengaet al. (2015) concluded that with 65 dB SPL ﬁxed noise pre-sentation level the SRT is determined by listening in noise forPTAs < ≈ dB HL, and above it is determined by listeningin quiet. Based on these considerations, it seems likely thatlisteners with an SRT of 0 dB in the German matrix sentencetest in the test-speciﬁc stationary noise exist whose speechrecognition performance cannot be improved further by sim-ple ampliﬁcation. That would indicate an SHL D of about7 dB.This estimate can be used as an orientation to interpretthe right-most column in Table 1, which is equivalent tothe SHL D . For example, proﬁle P-2000-14 with an SHL D of 6.5 dB lies within the range of empirically observed values.In this context, proﬁles P-1000-14 and P-2000-21 might beborder cases with values of 9 to 10 dB for SHL D , and proﬁleP-1000-21 lies probably outside range of empirically observedvalues for SHL D . Proﬁles P-8000-21, P-4000-14, P-2000-14,and P-1000-7, are all compatible with a large SHL D of lessthan 7 dB in the stationary noise condition. The key diﬀer- ence between these proﬁles is the proportion of SHL D that isdue to the level uncertainty and hence might be compensableby a PLATT expansion. Assuming that the average availablefrequency range to listeners with impaired hearing is between2000 and 4000 Hz, P-4000-14 and P-2000-14 could representparameter values in a realistic range.The aim of classifying proﬁles into more (and less) realisticones is to use them to formulate a quantitative hypothesis onthe expected beneﬁt in SRT due to PLATT expansion. Thisdoes not mean that the considered proﬁles exist. The assump-tion that the reduction in speech recognition performance innoise is due to a combination of a limited frequency rangeand a mechanism similar to an increased level uncertaintyjustiﬁes their use. While this assumption may be sensible,it doesn’t mean that it is correct. This would have to betested in listening experiments. For the eﬀect of the limitedfrequency range, this could be achieved by, e.g., low-pass ﬁl-tering the signals in speech recognition experiments. For theeﬀect of an increased level uncertainty, a direct veriﬁcationis currently not possible because the level uncertainty adds anoise in a domain which is not accessible for manipulationsin experiments with human listeners; unlike for the limitationof the frequency range, there is no known equivalent signalmanipulation. As explained in the Introduction, the mostpromising (and at the same time constructive) approach totest whether a mechanism similar to an increased level un-certainty causes a part of the class D hearing loss of listenerswith impaired hearing is trying to compensate it. Role of PLATT expansion

The expansion feature of the PLATT dynamic range manip-ulation approach aims to mitigate the increase in SRT dueto an increased level uncertainty. It was speciﬁcally designedto compensate the eﬀect of the level uncertainty on speechrecognition performance such as it is implemented in FADE.The presented simulation results showed that this necessaryinterim goal was achieved and indicate that about half ofclass D loss due to an increased level uncertainty was com-pensated. This only demonstrates that the expansion worksas intended in the context of the model. There is no reason toassume that the results, that is, the partial compensation ofa class D hearing loss, would transfer to experiments humanlisteners. Now—the key question of this contribution—whatwill happen in experiments with human listeners, in case thecentral model assumptions are incorrect?A) If the PLATT expansion does not improve the SRTs innoise of human listeners with a class D hearing loss, thiswould indicate that the mechanism that causes the classD loss is not well described by the level uncertainty, be-cause the PLATT expansion interacts diﬀerently with themechanism in human listeners than with the mechanismin the model.B) If the PLATT expansion improves the SRTs in noise ofhuman listeners and if these improvements are in linewith individual predictions, this would strongly supportthe hypothesis that a mechanism similar to an increasedlevel uncertainty causes a part of the class D hearing lossof listeners with impaired hearing.21) However, more likely than B), if the PLATT expansionimproves the SRTs in noise of human listeners but theimprovements are found to be substantially smaller thanpredicted, this would only partly support the hypothe-sis that a mechanism similar to an increased level uncer-tainty causes a part of the class D hearing loss of listenerswith impaired hearing.In scenario A), a central model assumption is incorrect. The(to-be) collected empirical data still can help to improve themodel, but the expansion feature of PLATT would be uselessin the context of a hearing device. In scenario B), the centralmodel assumptions are probably correct. The (to-be) col-lected empirical data cannot help to improve the model, butthe expansion feature of PLATT would overcome a seriouslimitation of current hearing device technology. In scenarioC), most of the central model assumptions were probably cor-rect. The (to-be) collected empirical data can probably helpto improve the model, and the expansion feature of PLATTcould give strong hints regarding how to overcome a seriouslimitation of current hearing device technology. In this lastscenario, a diﬀerentiated analysis of the predictions errors willbe helpful in tracking down the inaccurate assumptions. Theprobabilities for scenario A), B), and C) are unknown. Toprovide evidence for any scenario, an experiment with humanlisteners suitable to unmistakably show the compensation ef-fect has to be conceived.

Potential of compensating a class D loss

The presented simulation results clearly show the potentialof the expansion of spectral modulation patterns in the rangebetween 2 and 4 ERB, as implemented in PLATT, to com-pensate a class D hearing loss that was implemented by anincreased level uncertainty. In both noise conditions, approxi-mately half of the class D loss due to an increased level uncer-tainty was compensated, while the class D loss due to a lim-ited frequency range was not compensated. Narrowing downthe simulation results to proﬁles P-4000-14 and P-2000-14,which were identiﬁed earlier as more realistic, the predictedimprovements in SRT with PLATT-6 were 3.0 dB in the sta-tionary noise condition. These improvements would corre-spond to a compensation of . . ≈ and . . − . ≈ ofthe class D hearing loss due to the increased level uncertainty.In the ﬂuctuating noise condition, improvements in SRT of4.6 and 5.3 dB were predicted with PLATT-6, with proﬁlesP-4000-14 and P-2000-14, respectively. These improvementswould correspond to a compensation of . . − . ≈ and . . − . ≈ of the class D hearing loss due to the in-creased level uncertainty. Hence, if the model assumptionsare correct, beneﬁts in SRT of about 3.0 dB and 5.0 dB inthe stationary and ﬂuctuating noise condition are expected,which correspond to more than 50% of the respective class Dlosses due to the level uncertainty.Such beneﬁts would be individually measurable with thestandard German matrix sentence test. However, accordingto the proﬁle P-4000-7, for which beneﬁts in SRT of up to1.3 dB and 2.7 dB were predicted in the stationary and ﬂuctu-ating noise condition, respectively, individual measurementswould possibly not yield signiﬁcant results. This is becausethe individual beneﬁt in SRT is derived from two individual measurements (aided and unaided), each generating a ran-dom error of about 0.7 dB (standard deviation) for listenerswith impaired hearing. If the real beneﬁt is lower than about( . · √ · . ≈ . ≈ )2 dB, a signiﬁcant (with p-value < . ) eﬀect can be only shown in group averages but not insingle individual measurements. However, showing an eﬀectin individual measurements is highly preferable because thenthe data could be used to further analyze the individual char-acteristics of listeners with and without beneﬁts. Hence, evenif this goal might not be achieved, experiments with humanlisteners should be designed with the aim to maximize theabsolute eﬀect. Optimal values for the expansion

The optimal (in terms of improvements in SRT) value forthe expansion factor depended on the level uncertainty pa-rameter. For lower values of the level uncertainty, in manyconditions lower factors (4 or 6) resulted in the best speechrecognition performance, but higher values did not decreasethe improvement much. For the proﬁles P-4000-14 and P-2000-14, a factor of 8 was optimal. From this perspective,nothing speaks against evaluating PLATT with high expan-sion factors in listening experiments. Because an expansionfactor of 8 results in a strong modiﬁcation of the signal, andbecause it is unknown how such modiﬁcations are perceivedin terms of quality and loudness by listeners with impairedhearing, at least two values for the expansion factor shouldbe evaluated in listening experiments.

Audio quality and loudness

In measurements with human listeners, speech recognitionperformance, audio quality perception, and loudness percep-tion all depend on the presentation level. In this contribu-tion, eﬀorts were made to separate the speech recognitionperformance from the presentation level as much as possible.This resulted in considering elevated presentation levels, andloudness perception is strongly related to presentation levels.Also in this contribution, eﬀorts were made to mitigate au-dible artifacts in the manipulation and resynthesis stages ofPLATT. But the non-linear manipulation inevitable resultsin audible artifacts which probably aﬀect audio quality per-ception. The eﬀect of processing noisy speech signals at 0 dBSNR with PLATT-1, PLATT-4, and PLATT-8 is illustratedin Figure 22. The spectro-temporal maxima of the shownlog Mel-spectrograms are 68.8, 73.4, and 79.8 dB SPL for thenoisy signals in stationary noise, and 77.0, 81.8, and 89.3 dBfor the noisy signals in ﬂuctuating noise. The correspond-ing eﬀective amplitudes (RMS) are 66.6, 69.2, and 74.4 dBSPL for the noisy signals in stationary noise, and 69.7, 72.3,77.8 dB for the noisy signals in ﬂuctuating noise. The increasein RMS and maximum power, with PLATT-8 by about 8 dBand more than 10 dB, respectively, suggests an increased loud-ness perception and also demonstrates the necessity to eval-uate the approach in conditions in which an increase in levelmust not result in an improvement in SRT. Also, the pro-cessing will probably have an eﬀect on the perceived audioquality, where, with increased expansion factors, the audioquality will eventually decrease. In the subjective opinion ofthe author, based on a comparison of unprocessed and pro-22 ime / ms F r equen cy / H z

250 500 750 10001245031057186530454768728410957 l e v e l / d B SP L Time / ms F r equen cy / H z

250 500 750 10001245031057186530454768728410957 l e v e l / d B SP L Figure 22:

Illustration of the eﬀect of the expansion factor with PLATT: Log Mel-spectrograms of processed noisy speech in stationary(left column) and ﬂuctuating noise (right column) at 0 dB SNR processed with PLATT-1 (top row), PLATT-4 (center row), andPLATT-8 (bottom row). cessed signals, the artifacts due to processing the stimuli withPLATT-1 are minor, with PLATT-4 clearly audible, and withPLATT-8 pronounced. While it would be a diﬃcult task topredict the eﬀect of the processing on the perception of au-dio quality of listeners with normal hearing, it would be evenmore so for listener with impaired hearing.Apart from speech recognition performance, individualloudness and audio quality perception are important percep-tual dimensions for hearing aid users. The expected increasein perceived loudness and a possible decrease in perceivedaudio quality are considered collateral side-eﬀects of a signalmanipulation with PLATT. These side-eﬀects can either betolerated or mitigated/compensated with further extensionsor modiﬁcations of the PLATT approach. It is unclear howa listener with impaired hearing would trade speech recogni-tion performance for audio quality or loudness in the contextof PLATT expansion. To gather evidence about the relationof PLATT expansion, loudness perception, and audio qual-ity perception, the considered expansion factors should covera wide range of these side-eﬀects. Hence an evaluation ofspeech recognition performance, audio quality, and loudnesswith PLATT expansion factors of 1, 4, and 8 is proposed.This will help to better estimate the suitability of the rawapproach in the context of hearing aids and show which prop-erties should be in the focus for possible future optimization.There are several possibilities to optimize the expansion ap-proach with respect to loudness perception and quality. Forexample, the expansion factor can be chosen to be frequencydependent. If the frequency regions in which the beneﬁts areachieved by expansion do not fully overlap with the frequencyregions where loudness perception is critical, there could bepotential to favorably trade speech recognition performanceagainst loudness. The expansion factor could be limited, or mapped otherwise non-linearly to only expand (the low) mod-ulation amplitudes required to improve speech perception inlow-SNR conditions while leaving (the already high) mod-ulation amplitudes in high-SNR conditions unmodiﬁed. Adetailed discussion of possible modiﬁcations is considered toohypothetical to be further elaborated in this contribution.

Proposed measurement conditions

Based on the presented simulations and considerations, thefollowing experimental measurement conditions are proposedto test the presented hypothesis that a part of a class D hear-ing loss can be compensated by expanding spectral patternsin the range between 2 and 4 ERB in speech in noise recogni-tion experiments with human listeners. Regarding the noisemaskers, an evaluation with the test-speciﬁc noise (e.g. OL-NOISE for the German matrix sentence test), the ﬂuctuatingICRA5-250 noise, and, in addition, a competing voice masker,e.g., the International Speech Test Signal (ISTS; Holube et al.,2010) or its optimized version with limited pause durations,the International Female Fluctuating Masker (IFFM ), isproposed. The IFFM masker is interesting because it sharesthe speech modulation properties with the target speaker andusually results in very weak masking for listeners with nor-mal hearing. It was not included in the objective evaluationwith FADE because predictions with FADE for competingtalker scenarios are known to be inaccurate; Schädler et al.(2018) reported the eﬀect of the masker to be overestimatedby approximately 10 dB.The noise presentation levels ideally would be suﬃcientlyhigh to compensate any hearing loss which is compensable

23y simple ampliﬁcation. For the stationary noise condition,this would be easier to achieve than with the ﬂuctuating noiseconditions. As shown in left panel of Figure 2, even mild hear-ing loss proﬁles are expected to require very high presenta-tion levels to maximize the compensation with simple linearampliﬁcation. However, the improvements in SRT at highpresentation levels can be expected to be reasonably small.As a compromise, and to avoid measuring the whole Plomp-curve, a reference experiment with an increased level can beperformed to test if the equivalent linear ampliﬁcation wouldresult in a similar speech recognition performance than usingPLATT expansion. The observed increase in RMS level withPLATT-8 was about 8 dB. Adding a small margin, 10 dB am-pliﬁcation should result in higher RMS levels of noisy speechsignals at 0 dB SNR than processing the same signals withPLATT-8. Hence, noise presentation levels of 70 and 80 dBSPL are proposed. The proposed noise presentation level is70 dB SPL for aided and unaided measurements, while thenoise presentation level of 80 dB SPL serves as an alternativeto the aided measurement; one in which 10 dB ampliﬁcationis applied. If the processing with PLATT expansion achievesbetter SRTs than with the 10 dB linear ampliﬁcation, thiswould indicate the compensation of a class D hearing loss.However, the RMS level is only a rough proxy to loudnessperception. Hence, in all conditions (70 dB SPL unaided,70 dB SPL aided, 80 dB SPL unaided), the loudness percep-tion of characteristic signals, e.g., speech in noise at 0 dBSNR, should be measured to check if the perceived loudnessof the processed 70 dB SPL signals is really lower than theperceived loudness of the 80 dB SPL signals.To further mitigate any beneﬁts due to simple linear am-pliﬁcation at high presentation levels, all stimuli should below-pass ﬁltered to achieve an eﬀective frequency range of2000 Hz. This would limit the evaluation of a beneﬁt due toPLATT expansion to the frequency range up to 2000 Hz; thecentral frequency region that listeners with impaired hear-ing typically can use for speech recognition. A focus onthis frequency region could be advantageous in a ﬁrst evalu-ation. The low-pass ﬁltering would make a group of listenerswith impaired hearing less heterogeneous, because larger dif-ferences across listeners are usually observed at frequenciesabove 2000 Hz. And the loudness perception of such low-passﬁltered stimuli might also be more pleasant than their corre-sponding broadband variants.In total, the proposed measurements comprise 18 condi-tions: Three for training, three unaided measurements at70 dB SPL (one for each noise masker), three unaided mea-surements at 80 dB SPL (also one for each noise masker), andnine measurements with PLATT-1, PLATT-4, and PLATT-8 (that is, three for each noise masker). In all conditions,SRT, loudness perception and audio quality perception arerecommended to be assessed.In addition, it is recommendable to characterize the listen-ers in the way Schädler et al. (2020a) proposed it, that is, withtone detection and tone in noise detection experiments, andnot only with clinical audiograms. This additional data couldbe used to perform individual predictions of beneﬁts withFADE and compare them to the individual measurements.The comparison of individual measurements and predictionsis crucial to detect invalid assumptions in the model.

Final practical considerations

As speech material for the SRT measurements, the optimizedmatrix test sentences in native language are recommended(Kollmeier et al., 2015). These are phonetically balanced andoptimized for a high test-retest reliability. To minimize thetraining eﬀect, three training lists of 20 sentences in noiseare recommended. Lists of 30 sentences obtain a higher test-retest reliability but can also result in excessively long mea-surements which can be fatiguing. Hence, as a compromise,measurements with lists of 20 sentences could be used andrepeated on a diﬀerent day, which would also allow to detecta possibly remaining training eﬀect.For a ﬁrst approach, headphone measurements are prefer-able over free-ﬁeld measurements because they can be per-formed monaurally, which is recommendable. A monauralpresentation prevents the interference of possibly individualbinaural eﬀects and facilitates the comparison to model pre-dictions, as discussed by Schädler et al. (2020a).

Outlook

Once there is evidence whether the proposed compensationstrategy has a positive eﬀect on speech recognition perfor-mance of listeners with impaired hearing, the band-width lim-itation to 2000 Hz that was recommended for the ﬁrst exper-iments should be removed. For the next steps, it would notbe important anymore to demonstrate the compensation of aclass D loss alone, but to show beneﬁts in more realistic andindividually optimized conﬁgurations. Hence, there shouldbe a shift towards a joint individual optimization of the com-pensation of class A and class D loss, loudness perception,and audio quality. Then, care should be taken to individuallynormalize loudness perception to get meaningful data. If theexpansion approach with PLATT proves to work in monaurallistening conditions, the concept should be extended to a bin-aural listening condition. For this, a free-ﬁeld setup with amobile hearing aid prototype hardware is recommendable tocorrectly assess individual binaural hearing, including binau-ral loudness perception. To better understand which speechportions are most aﬀected by PLATT expansion, it wouldalso be interesting to study its eﬀect on phonemic contrasts.

Conclusions

The most important ﬁndings of this work can be summarizedas follows:• The functional modeling of the class D hearing loss withthe framework for auditory discrimination experiments(FADE), implemented by means of the level uncertainty,was interpreted as the counterpart of a compensationstrategy which aims to (partially) compensate a class Dheaing loss.• The strict low-delay constraints in hearing aid applica-tions only allow for a manipulation of mainly spectralmodulation patterns. Of these, the patterns in the rangeof 2 to 4 ERB seem especially suitable to protect themagainst the eﬀect of the level uncertainty by dynamicrange expansion.24 A low-delay, real-time capable implementation of apatented dynamic range manipulation scheme (PLATT)which allows to perform the required dynamic range ex-pansion was proposed. The implementation was opti-mized to run in real-time on the Raspberry Pi 3 ModelB platform.• The evaluation of the PLATT expansion with FADE forseveral idealized proﬁles of hearing loss indicated thatapproximately half of the class D hearing loss due to anincreased level uncertainty was compensable.• Simulations with FADE were used to predict the out-comes of speciﬁc speech recognition experiments prior to performing these. The underling hypothesis, that aclass D hearing loss can be (partially) compensated, canbe directly tested in an experiment with human listenersin the same listening conditions. Recommendations forthis experiment were elaborated.

References

Bisgaard, N., Vlaming, M. S., and Dahlquist, M. (2010) Stan-dard audiograms for the IEC 60118-15 measurement procedure.

Trends in ampliﬁcation , 14(2):113–120, https://doi.org/10.1177%2F1084713810379609

Bustamante, D. K. and Braida, L. D. (1987) Principal-component am-plitude compression for the hearing impaired.

The Journal of theAcoustical Society of America , 82(4):1227–1242, https://doi.org/10.1121/1.395259

Dreschler, W. A. (1992) Fitting multichannel-compression hear-ing aids.

Audiology , 31(3):121–131, https://doi.org/10.3109/00206099209072907

Dreschler, W. A., Verschuure, H., Ludvigsen, C., and Westermann, S.(2001) ICRA noises: artiﬁcial noise signals with speech-like spectraland temporal properties for hearing instrument assessment.

Audiol-ogy , 40(3):148–157, https://doi.org/10.3109/00206090109073110

European Telecommunications Standards Institute (2007) "202050 v1.1.5" Speech processing transmission and quality aspects(STQ); Distributed speech recognition; Advanced front-end fea-ture extraction algorithm; Compression algorithms.

Standard , Grimm, G., Herzke, T., Ewert, S., and Hohmann, V. (2015) Imple-mentation and evaluation of an experimental hearing aid dynamicrange compressor. In

Proceedings of German Annual Conference onAcoustics , 185–188, http://pub.dega-akustik.de/DAGA_2015/data/articles/000429.pdf

Hochmuth, S., Kollmeier, B., Brand, T., and Jürgens, T. (2015) Inﬂu-ence of noise type on speech reception thresholds across four languagesmeasured with matrix sentence tests.

International Journal of Au-diology , 54(sup2):62–70, https://doi.org/10.3109/14992027.2015.1046502

Hohmann, V. and Kollmeier, B. (1995) The eﬀect of multichanneldynamic compression on speech intelligibility.

The Journal of theAcoustical Society of America , 97(2):1191–1195, https://doi.org/10.1121/1.413092

Holube, I., Fredelake, S., Vlaming, M. and Kollmeier, B. (2010) Devel-opment and analysis of an international speech test signal (ISTS).

In-ternational Journal of Audiology , 49(12):891–903, https://doi.org/10.3109/14992027.2010.506889

Hülsmeier, D., Warzybok, A., Kollmeier, B., and Schädler, M. R. (2020)Simulations with FADE of the eﬀect of impaired hearing on speechrecognition performance cast doubt on the role of spectral resolu-tion.

Hearing Research , 395, https://doi.org/10.1016/j.heares.2020.107995

ISO (2003). Standard 226: 2003: Acoustics–normal equal-loudness-levelcontours.

International Organization for Standardization , 63,

Kollmeier, B., Warzybok, A., Hochmuth, S., Zokoll, M. A., Uslar, V.,Brand, T., and Wagener, K. C. (2015) The multilingual matrixtest: Principles, applications, and comparison across languages: Areview.

International Journal of Audiology , 54(sup2):3–16, https://doi.org/10.3109/14992027.2015.1020971 .Kollmeier, B., Schädler, M. R., Warzybok, A., Meyer, B. T., and Brand,T. (2016) Sentence recognition prediction for hearing-impaired lis-teners in stationary and ﬂuctuation noise with fade: Empoweringthe attenuation and distortion concept by Plomp with a quantitativeprocessing model.

Trends in Hearing , 20, https://doi.org/10.1177%2F2331216516655795

Levitt, H. and Neuman, A. C. (1991) Evaluation of orthogonal polyno-mial compression.

The Journal of the Acoustical Society of America ,90(1):241–252, https://doi.org/10.1121/1.401294

Moore, B. C. J., Peters, R. W., and Stone, M. A. (1999) Beneﬁts of linearampliﬁcation and multichannel compression for speech comprehensionin backgrounds with spectral and temporal dips.

The Journal of theAcoustical Society of America , 105(1):400–411, https://doi.org/10.1121/1.424571

Plomp, R. (1978) Auditory handicap of hearing impairment and thelimited beneﬁt of hearing aids.

The Journal of the Acoustical Societyof America , 63(2):533–549, https://doi.org/10.1121/1.381753

Plomp, R. (1988) The negative eﬀect of amplitude compression in multi-channel hearing aids in the light of the modulation-transfer function.

The Journal of the Acoustical Society of America , 83(6):2322–2327, https://doi.org/10.1121/1.396363

Schädler, M. R., Meyer, B., and Kollmeier, B. (2012) Spectro-temporalmodulation subspace-spanning ﬁlter bank features for robust auto-matic speech recognition.

The Journal of the Acoustical Society ofAmerica , 131(5):4134–4151, https://doi.org/10.1121/1.3699200

Schädler, M. R., Warzybok, A., Hochmuth, S., and Kollmeier, B.(2015) Matrix sentence intelligibility prediction using an auto-matic speech recognition system.

International Journal of Audi-ology , 54(sup2):100–107, https://doi.org/10.3109/14992027.2015.1061708

Schädler, M. R., Warzybok, A., Ewert, S. D., and Kollmeier, B.(2016b) A simulation framework for auditory discrimination exper-iments: Revealing the importance of across-frequency processing inspeech perception.

The Journal of the Acoustical Society of America ,139(5):2708–2722, https://doi.org/10.1121/1.4948772

Schädler, M. R., Hülsmeier, D., Warzybok, A., Hochmuth, S., andKollmeier, B. (2016a) Microscopic multilingual matrix test predic-tions using an ASR-based speech recognition model. In

Proceed-ings of INTERSPEECH , 610–614, http://dx.doi.org/10.21437/Interspeech.2016-1119

Schädler, M. R., Warzybok, A., and Kollmeier, B. (2018) Objec-tive Prediction of Hearing Aid Beneﬁt Across Listener Groups Us-ing Machine Learning: Speech Recognition Performance With Bin-aural Noise-Reduction Algorithms.

Trends in Hearing , 22, https://doi.org/10.1177/2331216518768954 .Schädler, M. R., Hülsmeier, D., Warzybok, A., and Kollmeier, B.(2020) Individual Aided Speech-Recognition Performance and Pre-dictions of Beneﬁt for Listeners With Impaired Hearing Employ-ing FADE.

Trends in Hearing , 24, https://doi.org/10.1177%2F2331216520938929 chädler, M. R. (2020b) Optimization and evaluation of an intelligibility-improving signal processing approach (IISPA) for the Hurricane Chal-lenge 2.0 with FADE. In Proceedings of INTERSPEECH , 1331–1335, https://doi.org/10.21437/Interspeech.2020-0093

Souza, P. E. (2002) Eﬀects of compression on speech acoustics, intel-ligibility, and sound quality.

Trends in Ampliﬁcation , 6(4):131–165, https://doi.org/10.1177%2F108471380200600402

Wagener, K., Brand, T., and Kollmeier, B. (1999) Entwicklung undEvaluation eines Satztests für die Deutsche Sprache I-III: Design, Op-timierung und Evaluation des Oldenburger Satztests.

Zeitschrift fürAudiologie , 38(1-3):4–15Wagener, K. C., Brand, T., and Kollmeier, B. (2006) The role of silentintervals for sentence intelligibility in ﬂuctuating noise in hearing-impaired listeners.

International Journal of Audiology , 45(1):26–33, https://doi.org/10.1080/14992020500243851

Wardenga, N., Batsoulis, C., Wagener, K. C., Brand, T., Lenarz, T., andMaier, H. (2015) Do you hear the noise? The German matrix sentencetest with a ﬁxed noise level in subjects with normal hearing and hear-ing impairment.

International Journal of Audiology , 54(sup2):71–79, https://doi.org/10.3109/14992027.2015.1079929 .Yund, E. W. and Buckles, K. M. (1995) Multichannel compression hear-ing aids: Eﬀect of number of channels on speech discrimination innoise.

The Journal of the Acoustical Society of America , 97(2):1206–1223, https://doi.org/10.1121/1.413093https://doi.org/10.1121/1.413093