Thoughts on the potential to compensate a hearing loss in noise
TThoughts on the potential to compensate a hearing loss in noise
Marc René SchädlerMedizinische Physik and Cluster of Excellence Hearing4all, Universität Oldenburg, [email protected] 25, 2021
Abstract
The effect of hearing impairment on speech perception was described by Plomp (1978) as a sum of a loss of class A, dueto signal attenuation, and a loss of class D, due to signal distortion. While a loss of class A can be compensated by linearamplification, a loss of class D, which severely limits the benefit of hearing aids in noisy listening conditions, cannot. Notfew users of hearing aids keep complaining about the limited benefit of their devices in noisy environments. Recently, in anapproach to model human speech recognition by means of a re-purposed automatic speech recognition system, the loss of classD was explained by introducing a level uncertainty which reduces the individual accuracy of spectro-temporal signal levels.Based on this finding, an implementation of a patented dynamic range manipulation scheme (PLATT) is proposed, whichaims to mitigate the effect of increased level uncertainty on speech recognition in noise by expanding spectral modulationpatterns in the range of 2 to 4 ERB. An objective evaluation of the benefit in speech recognition thresholds in noise using anASR-based speech recognition model suggests that more than half of the class D loss due to an increased level uncertaintymight be compensable.
Keywords: theoretical audiology, speech perception modeling, impaired hearing, hearing loss compensation
Introduction
To this day, hearing aids without directional amplificationor directional noise suppression provide their users only withlimited benefit in noisy listening conditions. The limited ben-efit of hearing aids in noisy listening conditions is long knownand was extensively described and put into context by Plomp(1978). There, the effect of impaired hearing on speech recog-nition performance was described as a sum of two fundamen-tally different classes of hearing loss: class A, which accountsfor an attenuation of the signal, and class D, which accountsfor a distortion of the signal. While, the class A loss is definedsuch that it can be fully compensated by a suitable linear am-plification of the signal, the class D loss was assumed to belevel independent, that is, it cannot be compensated by linearamplification.The class A loss, according to Plomp (1978) usually makesup the largest part of the total increase in speech recognitionthresholds (SRTs) in quiet (A+D), which is why hearing aidsprovide the largest benefits in quiet environments. In noisyenvironments with sufficiently high levels, the contribution ofthe class A loss diminishes, and the contribution of the classD loss dominates the total loss. Plomp (1978) estimated that,on average, the class A loss accounts for approximately two-thirds and the class D loss for approximately one-third of thetotal loss, where huge individual variability was expected. Amain drawback of this model is that it does not provide a spe-cific hint on if and how the class D loss can be compensated.Recently, Kollmeier et al. (2016) proposed a modification tothe feature extraction stage of the simulation framework forauditory discrimination experiments (FADE, Schädler et al.,2016a) to implement a class A loss as an absolute hearingthreshold and a class D closs as a level uncertainty. The simulation approach with FADE employs a re-purposed au-tomatic speech recognition system (ASR) to predict the out-come of a speech recognition test, such as the matrix sen-tence test (Kollmeier et al., 2015). The FADE approach wasalready successfully used to predict the outcomes of severalspeech in noise recognition experiments (Schädler et al., 2015,2016b) as well as the outcomes of basic psycho-acoustic ex-periments (Schädler et al., 2016a) for listeners with normalhearing. Kollmeier et al. (2016) proposed to remove the in-formation that is not available to an individual listener withimpaired hearing in the feature extraction stage of the ASRsystem used in the FADE modeling approach.To induce a class A loss in the model, variations in theinternal spectro-temporal signal levels below the individualhearing threshold, determined by the individual audiogram,were removed. This manipulation is illustrated in the centerpanel of Figure 1, where the low-energy portions (blue/green)were replaced by constant values which are equal to the indi-vidual absolute hearing threshold, while the high-energy por-tions above the individual absolute hearing threshold are un-changed compared to unmodified representation in the upperpanel. It seems plausible that, if all relevant signal portionsare above the hearing threshold, this manipulation shouldhave no effect on the predicted speech recognition perfor-mance.To induce a class D loss in the model, random values weredrawn from a normal distribution and added to the internalspectro-temportal signal levels, where the standard deviationof the normal distribution was a variable called level uncer-tainty . This manipulation is illustrated in the lower panelin Figure 1, where all signal portions, including those abovethe hearing threshold, are affected. Because the signal energyin that representation (which is a logarithmically scaled Mel-1 a r X i v : . [ ee ss . A S ] F e b igure 1: Figure reproduced from Kollmeier et al. (2015).Internal spectro-temporal signal representation (log Mel-spectrogram) like it is used in the FADE modeling approachof a speech in noise mixture (upper panel) and examples ofmanipulations to it that were introduced to induce a class Ahearing loss (center panel) and a class D hearing loss (lowerpanel). spectrogram) is represented in a logarithmic domain, linearamplification cannot be expected to change its effect on thepredicted speech recognition performance.Kollmeier et al. (2016) evaluated the effect of these ma-nipulations on the predicted outcomes of the German matrixsentence test in a stationary and a fluctuating noise conditionfor different noise levels, and fitted the A/D-class descriptionproposed by Plomp (1978) to the data. The results, repro-duced in Figure 2, clearly show that the two manipulationslargely achieved the intended effects, that is, inducing a classA and a class D hearing loss. The left panel shows FADE sim-ulations with different standard audiograms from Bisgaardet al. (2010), which converge for high noise levels, that is,the manipulations can be compensated by amplification asone would expect from a class A loss. The right panel showsFADE simulations with increasing values for the level uncer-tainty, which do not converge for high noise levels, that is,the manipulations cannot be compensated by amplificationas one would expect from a class D loss. An important ob-servation of Kollmeier et al. (2015) was, that their empiricaldata set, which included matrix sentence test results in noiseof almost 200 ears, could not be satisfactorily predicted withthe class A loss alone, indicating that an implementation ofa mechanism that induces a class D loss is needed to explainthe speech recognition performance of individual listeners.Schädler et al. (2020a) extended that approach by inferringthe individual frequency-dependent level uncertainty fromtone in noise detection thresholds, and achieved unprece-dented accuracy in the prediction of benefits in SRTs dueto different traditional hearing loss compensation schemes innoise (and in quiet). The central assumption there is thatthe same mechanism that affects individual speech in noiseperception also affects individual tone in noise perception.Other mechanisms that reduce the information encoded inthe internal signal representation, like a reduced spectral res-olution, might also be considered in this modeling approach.However, Hülsmeier et al. (2020) found that, with FADE, areduced spectral resolution had only little effect on the simu-lated SRTs compared to an increased level uncertainty. Thisindicates, that between a reduced spectral resolution and thelevel uncertainty, the latter is considered to be the suitable mechanism to implement a class D loss in FADE.If the mechanism that removes the information in the audi-tory system of listeners with impaired hearing works similarto the level uncertainty, than the level uncertainty can be seenas the functional counter-part to a compensation strategy fora class D hearing loss. This is exactly what we will assume inthe remainder of this contribution to what we consider the-oretical audiology. In the context of FADE, this will allowto generate testable hypothesis on the achievable benefit inspeech recognition performance due to a possible compensa-tion strategy for a class D loss. The aim of this contributionis to:A) Present an approach which is able to partially compen-sate a class D loss as implemented with the level uncer-tainty in FADE, andB) objectively evaluate this approach and come up withtestable quantitative hypothesis on the benefit in noisylistening conditions.For an effective mitigation of the effect of the level uncer-tainty on speech recognition performance, three main prob-lems need to be addressed:1) Which portions, or more precisely, patterns of the speechsignal carry the most relevant information and need tobe protected?2) Which signal patterns can be protected, given the strictconstrains on the available future temporal context of thesignal (approximately 1 ms) in hearing aid applications?3) How can such a protection be achieved?Continuing to follow an (assumed) analogy of basic prin-ciples in human and machine speech recognition, literatureon robust automatic speech recognition provides hints onwhich signal patterns are relevant for good (automatic) speechrecognition performance in noise. Let us assume that a logMel-spectrogram, such as it used for the calculation of thewidely used Mel-frequency cepstral coefficient (MFCC) fea-tures, is representative for the information that is availableto an ASR system. The spectral resolution of such a log Mel-spectrogram, of which an Example is depicted in the upperpanel of Figure 1, is about 1 equivalent rectangular bandwidth(ERB), the temporal resolution is 10 ms. The relevant speechinformation is encoded in the represented spectro-temporaldynamic, that is, the differences of spectro-temporal signallevels over time, called temporal modulations, and over fre-quency, called spectral modulations. It is remarkable thatASR systems traditionally don’t even use the whole informa-tion in the log Mel-spectrogram, but work with a reducedspectral resolution compared to the spectral resolution ofthe human auditory system. For example, the standard fea-tures used for ASR, MFCCs (ETSI, 2007), specifically en-code spectral modulation frequencies from to about cy-cles per ERB; spectral modulation frequencies above cyclesper ERB empirically don’t contribute to automatic speechrecognition performance. In line with this finding, the robustGabor filter bank (GBFB) features use only slightly morethan half of the available spectral resolution of the log Mel-spectrograms (c.f. Schädler et al., 2012). By omitting the cor-responding parts of the feature vector, Schädler et al. (2012)2 igure 2: Figures reproduced from Kollmeier et al. (2015). Simulated speech recognition thresholds with FADE for a stationary anda fluctuating noise condition at different noise levels. The left panel shows simulations with different absolute hearing threshold,which induce different class A losses. The right panel shows simulations with different values for the level uncertainty, which inducedifferent class D losses. The embedded tables show the contributions of A and D when the descriptive model of Plomp (1978) wasfitted to the data. For further details please refer to the original publication. assessed the relative importance of different spectro-temporalmodulation frequencies in a robust ASR task. There, it wasobserved that the highest spectral modulation frequencies,around cycles per band (the bandwidth of a Mel band thereis approximately 1 ERB) seem to important for the consid-ered speech in noise recognition task. Before going into moredetail on the importance of spectro-temporal modulations fornoisy speech recognition, the limited possibilities to manipu-late these in the context of hearing aids have to be considered.The restriction to the availability of approximately 1 msfuture temporal context in hearing aids does not allow forthe reliable manipulation of temporal modulations below ap-proximately 500 Hz. This limit is still way above the temporalmodulation frequencies that are represented in the features ofASR systems. With the common analysis window length andshift of 25 and 10 ms, respectively, as used in Figure 1, theupper limit for represented temporal modulation frequenciesis around 50 Hz. Hence, the only modulations that can be re-liably manipulated are spectral modulations, which requiresa temporal signal context in the order of 1 ms. Fortunately,spectral modulations seem to play an important role in (au-tomatic) speech recognition. With the already mentioned ap-proach to model human speech recognition performance bymeans of a re-purposed ASR system, Schädler et al. (2016a)found that explicitly encoding spectral modulations, that is,across-frequency interactions, in the feature vector was re-quired to explain the empirically found benefit when listeningin a fluctuation masker. In other words, spectral modulationsseem to be especially relevant in fluctuating noise maskers.In combination of the answers to problems 1) and 2): Spec-tral modulations just below 0.25 cycles per ERB, let’s sayspectral patterns between 2 and 4 ERB, seem to be a goodfirst candidate for protecting them against the level uncer-tainty. How can these signal patterns be protected or at leasthardened against the effect of the internal noise due to thelevel uncertainty? In the logarithmic level domain (in the log Mel-spectrogram) the noise of the level uncertainty is ad-ditive, and one effective way of mitigating the effect of anadditive noise is amplification. This means, that the desiredspectral modulation patterns should be amplified, that is, ex-panded before the noise of the level uncertainty can removethat information.At a first glance, an expansion of signal dynamic in thecontext of the reduced residual dynamic range of listenerswith impaired hearing might seem completely undesirable:The expansion of spectral modulations can result in uncom-fortably high and/or inaudibly soft signal levels; which canoccur at the same time in different frequency ranges. How-ever, considering the scale (2 to 4 ERB) it is clear, that thisis a region which is not modified by common approaches tomulti-band dynamic compression, where the signal is usuallyindependently compressed in approximately six bands. Thetraditional approaches to multi-band dynamic compressionwith less than 9 bands leave spectral patterns with a widthof 2 to 4 ERB virtually uncompressed, which might also beinterpreted as an indication towards the importance of thesepatterns. If now the expansion of one (small, as we will latersee) part of the spectral dynamic increases the total spectraldynamic range of the signal, it might be possible to counterit with a stronger compression of other (less relevant) partsof the spectral dynamic. However, compression should onlybe applied if it is required, that is, if the input signal dy-namic range does not fit into the available output dynamicrange. The dynamic range of speech in noise is less thanthe dynamic range of a clean speech signal. Especially whenconsidering speech in only slightly fluctuating noises at signal-to-noise ratios (SNRs) around 0 dB, the dynamic range of themixture can be close to the minimum which is required todiscriminate words. In this condition, no improvement canbe expected from compressing the signal dynamic. Hence,a micro-management of the limited residual dynamic rangeto optimize its use for speech-recognition-relevant informa-3ion is a desirable goal. PLATT is an approach which givespreference to spectral modulation patterns that are suppos-edly relevant for speech recognition at the expense of spectralmodulation patterns which are supposedly less relevant whenmapping an input signal dynamic to a given output signal dy-namic. The here outlined ideas are part of a German patent(DE 10 2017 216 972); a patent application in Europe is pend-ing.The implementation details of PLATT, which are presentedin the Section Methods, were already engineered towards low-delay real-time processing. In a simpler implementation, therepresentation of the log Mel-spectrogram could be manipu-lated in the MFCC domain where the spectral modulationsare suitably encoded for this operation, and which in MFCC-terminology would be called liftering . The inverse transformof the modified MFCCs, with amplified spectral modulationpatterns between 0.25 and 0.125 cycles per band, could becompared to the unmodified log Mel-spectrogram, and a suit-able filter response in the time-domain could be designed foreach signal frame. While such an approach would have theadvantage of simplicity (even simpler than the related im-plementation of an intelligibility-improving signal processingapproach (IISPA) Schädler (2020b)), it would not be appli-cable without further modifications in a hearing device: Themain problem would be the infeasible window length of 25 ms.Because of the highly non-linear nature of the interactions be-tween the factors influencing speech recognition performance(speech material, masker type and level, reverberation, non-linear signal processing, hearing impairment), an implemen-tation that already fulfills the most basic requirements of ahearing aid algorithm and can run on a hearing aid prototype was preferred over a simple proof-of-concept implementation.This additional effort makes a seamless translation to an ap-plication in a hearing device more likely and increases themeaningfulness of the presented results for a possibly realiz-able hearing aid solution.For an evaluation of the implementation with respect to apossible compensation of a class D loss, the following pointsneed to be considered:1) Which listening conditions, that is, which speech test andmaskers, are suitable to evaluate the PLATT implemen-tation objectively with FADE and (also later) empiri-cally.2) Which listener profiles are suited to clearly demonstratea (partial) compensation of a class D loss like it is imple-mented in FADE.The first point is important to enable the verification withempirical data of any hypotheses that are based on the modelpredictions. Schädler et al. (2020a) discussed this point andproposed to use the SRT-50 measured with the matrix sen-tence test in quiet, in a stationary, and in a fluctuating noisecondition, to cover the very different masker properties in typ-ical listening conditions: Quiet, low-dynamic maskers, high-dynamic masker. The main reason for using these “labora-tory” signals and the SRT-50 instead of real noise recordingand, e.g., the SRT-80, was the known high test-retest reliabil-ity of these tests. For individual predictions, the test-retest https://github.com/m-r-s/hearingaid-prototype reliability of the employed speech recognition test sets thelower limit for the achievable prediction error in an evalua-tion with the data. That means, low measurement errors inthe empirical data may facilitate or even enable falsificationof the model predictions. An SRT of 0 dB at high noise levelsin the test-specific noise condition, this is, with a stationarynoise of identical long term spectrum than the speech sig-nal, can be considered very problematic when normal-hearinglisteners can achieve about -8 dB. If only half of the hear-ing loss in that condition in noise could be compensated(which would be a huge achievement), the measurable benefitwould be only 4 dB. Considering that the benefit in SRT iscalculated as the difference between two measurements, thetargeted error of a single measurement should be less than √ · · dB ≈ . dB. Such low measurement errors, whichcan be achieved in SRT measurements with the matrix sen-tence test, would later enable to show individual benefitswithout averaging over groups of listeners, given the benefitwas 4 dB. When adding to this consideration the need for alevel-dependent evaluation, which is required to identify theclass D loss according to Plomp (1978), one arrives at thetest conditions which were already studied in Kollmeier et al.(2015) and that are depicted in Figure 2.The second point, the selection of suitable listener pro-files, is a bit more complex than it might initially appear.A sensible approach would be to take the individual profilesinferred from the psychoacoustic measurements by Schädleret al. (2020a) which are available online . The main prob-lem with this approach is, that even with a small pure classA loss, the SRTs are generally not level-independent at highlevels. This can already be observed for the fluctuating noisecondition in the left panel of Figure 2. There, the simulationswith a pure class A loss with hearing thresholds according tothe standard profile N1 (corresponding to a very mild hearingloss) do not converge with the data of the normal-hearing pro-file None up to noise levels of 90 dB SPL. Hence, even for verysmall increases in hearing threshold, amplification improvesthe SRT in the fluctuating noise condition up to very highpresentation levels. With the aim of clearly attributing com-pensation strategies to compensate a class A or a class D loss,this is highly undesirable. To clearly identify the compensa-tion of a class D loss, simple linear amplification alone mustnot improve the SRT. The reason for the observed model be-havior is that, for the Bisgaard profiles, the frequency rangeabove the hearing threshold increases with the presentationlevel. While the limited frequency range can safely be as-sumed a factor contributing to a class D loss and has to beconsidered in a suitable listener profile, it must be avoidedthat the effectively used frequency range changes at high pre-sentation levels. One option would be to low-pass filter thespeech material. Another option is to define profiles with verysteep sloping hearing loss functions. The former option wouldbe very suitable to measure empirical data. The latter optionis regarded cleaner from a modeling perspective, because itreduces the number of parameters that influence the SRTand results in a simpler and possibly better traceable model.Hence, listener profiles with normal hearing thresholds be-low, and infinite hearing loss above a given limit frequencyare suitable for the considerations in this contribution. While https://doi.org/10.5281/zenodo.4394186 Methods
The methods described in the following were used to simulatespeech recognition experiments in stationary and fluctuatingnoise at different presentation levels for 16 listening profileswith class D hearing losses without and with the later pro-posed dynamic range expansion by PLATT including differentdegrees of expansion.
Speech recognition tests
The speech material of the (male) German matrix sen-tence test (Wagener et al., 1999; Kollmeier et al., 2015)was used with two masker signals: The test-specific noise(called OLNOISE) and the fluctuating ICRA5-250 noise sig-nal (Dreschler et al., 2001; Wagener et al., 2006). The matrixtest, which exists in more than 20 languages, comprises 50phonetically balanced common words of which sentences witha fixed syntax, such as “Peter got four large rings” or “Ninawants seven heavy tables”, are built. Typically, the SRT-50,this is the speech level that is required to correctly recognize50% of the words, is measured with an adaptive procedure inexperiments with human listeners. In this contribution, thespeech and masker material was used to predicted the SRT-50 with FADE. The test-specific noise (OLNOISE) has thesame long-term spectrum as the speech material. It can beassumed to mask the speech signal similarly well across allfrequencies. At the SRT for normal hearing listeners, -7 dB(Hochmuth et al., 2015), this results in a noisy speech signalwith a low dynamic range, where spectro-temporal maxima ofthe mixtures are dominated by the speech signal. The effectcan be observed in Figure 3, where the log Mel-spectrogram of clean speech signal (upper panel) and the same speech sig-nal with the OLNOISE masker at -7 dB SNR (center panel)are depicted. The ICRA5-250 noise is a speech-shaped noisewhich is co-modulated with speech-like temporal patterns inthree independent frequency bands, where the pause durationwas limited to 250 ms (Wagener et al., 2006). The empiricalSRTs with this masker signal are usually more than 10 dBlower than in the corresponding test-specific noise condition(Hochmuth et al., 2015). At the SRT for listeners with nor-mal hearing, -19 dB (Hochmuth et al., 2015), this results ina noisy speech signal with a high dynamic range, where thespectro-temporal maxima of the mixtures are dominated bythe masker signal. This can be observed in the lower panelof Figure 3, where the log Mel-spectrogram of a speech sig-nal with the ICRA5-250 masker at -19 dB SNR is depicted inthe lower panel. To assess the presentation level dependency,both maskers are considered at presentation levels from 0 to100 dB SPL in 10-dB steps. At 0 dB SPL presentation level,this corresponds to listening in quiet. The selected listeningconditions reflect important dimensions of speech perception:Listening in quiet, at low levels, and at high levels, as well aslistening in stationary and fluctuating noise. All consideredspeech tests can also be performed human listeners.
Simulations of matrix tests with FADE
The speech tests considered in Section Speech recognitiontests were simulated with an ASR-based approach and theiroutcome, the SRT-50, was predicted based on the simula-tion results. The simulations were performed with the lateststandard version of FADE , as described by Schädler et al.(2016a). In this contribution, FADE is used as a tool to pre-dict the outcome of speech recognition tests, where the onlychange to the standard setup was the manipulation of thefeature extraction stage which is explained in Section Lis-tener profiles: Class D hearing losses and the processing ofthe noisy speech signals with PLATT as explained in Sec-tion PLATT dynamic range manipulation. Hence, the FADEsimulation method is only outlined here, and we refer the in-terested reader to the original description from Schädler et al.(2016a).Predictions with FADE are performed completely indepen-dently for each listening condition (masker, maskerlevel, hear-ing loss compensation, and hearing profile). There is no de-pendency on any empirically measured SRT, nor on predic-tions of the same model in other/reference conditions (andhence no need to define such). This means, with FADE, asingle SRT of a speech recognition test for which no empir-ical data exists can be predicted. For the prediction of oneoutcome the following standard procedure as described bySchädler et al. (2016a) was used.An ASR system was trained on a broad range of SNRswith noisy speech material form the considered condition, e.g.German matrix sentence test in OLNOISE for listener pro-file “P-8000-1” with no compensation. For this, a corpus ofnoisy speech material at different SNRs was generated fromthe clean matrix sentence test material and the masker sig-nal, by adding randomly chosen masker signal fragments withthe speech material. The noisy signals were processed with https://doi.org/10.5281/zenodo.4003779 ime / ms F r equen cy / H z
250 500 750 1000 1250 1500 1750 20001245031057186530454768728410957 l e v e l / d B SP L Time / ms F r equen cy / H z
250 500 750 1000 1250 1500 1750 20001245031057186530454768728410957 l e v e l / d B SP L Time / ms F r equen cy / H z
250 500 750 1000 1250 1500 1750 20001245031057186530454768728410957 l e v e l / d B SP L Figure 3:
Log Mel-spectrograms of a clean German matrix sentence at 65 dB SPL (upper panel), of the same sentence in the stationarynoise (center panel) and fluctuating noise (lower panel) conditions at the SNRs which correspond to the SRT listeners with normalhearing of speech, -7 and -19 dB, respectively.
PLATT when an aided listening condition was considered.From the noisy (and optionally processed) speech signals, fea-tures were extracted, where this step included the implemen-tation of the class D hearing loss, as described in Section Lis-tener profiles: Class D hearing losses. Subsequently, an ASRsystem using whole-word models implemented with GaussianMixture Models and Hidden Markov Models, was trained onthe features. This resulted in 50 whole-word models for eachtraining SNR. These models were then used with a languagemodel that considers only valid matrix sentences (of which exist) to recognize test sentences on a broad range ofSNRs with noisy speech material form the same consideredcondition. For each combination of a training SNR and a testSNR, the transcriptions of the test sentences were evaluatedin terms of the percentage of correctly recognized words. Theresulting recognition result map (cf. left panel in Figure 7in Schädler et al. (2016a) for an example), which containedthe speech recognition performance of the ASR system de-pending on the training and testing SNRs in 3 dB steps, wasqueried for the SRT. For a given target recognition range, e.g.50%, the lowest SNR at which this performance was achievedwas interpolated from the data in the recognition result mapand reported as the predicted SRT for the considered con-dition. The whole simulation process, including the creationof noisy speech material, an optional processing of this noisyspeech material with PLATT, the feature extraction (whichdepends on the listener profile), the training of the ASR sys-tem, the recognition of the test sentences, and the evaluationof the recognition result map, was (independently) repeated for each considered condition. Listener profiles: Class D hearing losses
Outcome predictions of speech recognition tests as well as ba-sic psychoacoustic tests with FADE were found to be closeto the empirical results for listeners with normal hearing(Schädler et al., 2016a). As proposed by Kollmeier et al.(2016) and successfully used by Schädler et al. (2020a), im-paired hearing was implemented in the ASR system by remov-ing the information from the feature vectors that is presum-ably not available to listeners with impaired hearing. As dis-cussed in the Introduction, two types of manipulations whichinduce class D hearing loss were considered: 1) A limitationof the frequency range, 2) An increase of level uncertainty.The effect of both parameters on the log Mel-spectrogram ofa clean speech sample is depicted in Figure 4, where in theupper panel, the frequency range was limited to 8000 Hz andthe level uncertainty was 1 dB. In the center panel, the leveluncertainty was increased to 7 dB, compared to the upperpanel. In the lower panel, the frequency range was addi-tional limited to 2000 Hz compared to the center panel. Anamplification of the input signal increases all values in a logMel-spectrograms by a constant value. Both manipulationsintroduce a level-independent loss of information and henceinduce a class D loss. For the evaluation, upper frequency lim-its of 1000, 2000, 4000, and 8000 Hz were considered, wherethe class D loss decreases with high values. For the leveluncertainty, frequency-independent values of 1, 7, 14, and21 dB were considered, where the class D loss increases with6 ime / ms F r equen cy / H z
250 500 750 1000 1250 1500 1750 20001245031057186530454768728410957 l e v e l / d B SP L Time / ms F r equen cy / H z
250 500 750 1000 1250 1500 1750 20001245031057186530454768728410957 l e v e l / d B SP L Time / ms F r equen cy / H z
250 500 750 1000 1250 1500 1750 20001245031057186530454768728410957 l e v e l / d B SP L Figure 4:
Illustration of the two considered class-D-loss-inducing log Mel-spectrogram manipulations: Log Mel-spectrograms forlistener profiles “P-8000-1” (upper panel), “P-8000-7” (center panel), and “P-2000-7” (lower panel). The first number in the profileencodes the upper frequency limit, in this example 8000 and 2000 Hz, and the second number indicates the level uncertainy, here1 and 7 dB. high values. All combinations of both parameters result in 16profiles from “P-8000-1” to “P-1000-21”. The former can beexpected to be the best performing one, while the latter canbe expected to be the worst performing one.
PLATT dynamic range manipulation
In this section, the patented (DE 10 2017 216 972) PLATTdynamic range manipulation as it was conceived for a laterimplementation in a hearing device is described. The imple-mentation was optimized to run in real-time on a RaspberryPi 3 Model B to enable field studies with mobile hearing aidprototype hardware . The ability to expand spectral modu-lation frequencies in the range of to cyclesERB is a feature thatintegrates naturally with the approach. Even if not strictlynecessary for the goals of this contribution, the method is de-scribed here in detail to make statements about its ability tocompensate a class D loss in the algorithmic context in whichit might be later usable in a hearing device. To motivatethe design decisions behind PLATT, which generally aims topreserve relevant speech modulations when compressing thedynamic range of a signal, this subsection comes with its ownintroductory part. Introduction to the PLATT concept
Conditions inwhich the available dynamic range for acoustic communica-tion is reduced are rather the norm than the exception. For For example: https://github.com/m-r-s/hearingaid-prototype or https://batandcat.com/portable-hearing-laboratory-phl.html example, in a driving car, the lower limit of the availabledynamic range is determined by the driving noise. Or in alibrary, the upper limit is given by the accepted sound levelsin such an environment. And, importantly, the available dy-namic range for communication is limited for listeners withimpaired hearing. For a successful communication, it may berequired to adapt a source signal, which may contain speechand non-speech parts, to the available dynamic range on thereceiver side by dynamic range compression. But, in manyreal-time applications, the available temporal context to per-form this operation is very limited.Multi-band dynamic range compressors used in hearing(aid) research (e.g. Grimm et al., 2015) statically map theinput dynamic range to a reduced output dynamic range ina number of independent frequency bands. The compres-sion is often applied with rather short attack time constants,e.g., 20 ms, with the aim to protect the user from high levels,while the release time constants are usually much longer, e.g.100 ms to 1000 ms, with the aim to limit compression whenit not desirable, i.e. during short speech pauses. However,no distinction is made whether the signal contains speechportions or not. Approaches which depend on a classifica-tion whether or not speech is present in the input signal, areprone to errors if the (speech-)signal-to-noise ratio (SNR) islow, that is, just when the classification result is most impor-tant. Approaches that require more than a few milliseconds offuture temporal context cannot be used in applications whichrequire low latency, such as, e.g., hearing devices. Regardingthe speech intelligibility of processed signals, compression in7 few wide frequency bands is preferred over compression inmany narrow frequency bands, however, the recommendednumber of channels greatly varies (usually between 1 and8) (Plomp, 1988; Dreschler, 1992; Hohmann and Kollmeier,1995; Yund and Buckles, 1995; Moore et al., 1999; Souza,2002). The fewer channels are used, the better the spectraldynamic, that is, the spectral contrast or spectral modula-tion , is preserved. Static dynamic range compression onlypreserves the spectral modulation within each independentfrequency band, but not across bands, even if the dynamicrange would be available. Also, fewer frequency channels re-duce the need for sharp filters which would require long inte-gration time constants and introduce additional latency.Bustamante and Braida (1987) proposed to compress thefirst two principle components (PC1 and PC2) of the short-term speech spectrum, which were roughly representative ofoverall level and spectral tilt. With this approach, the fre-quency bands were not processed independently anymore,and the finer spectral structure was always preserved. Theiranalysis indicated that the highest intelligibility was obtainedwhen audibility was improved and the relative spectral shapesof different speech sounds were preserved (Bustamante andBraida, 1987). In their concluding section, they recommendedto investigate the enhancement of spectral differences whilecompressing level variations. Levitt and Neuman (1991) pro-posed an approach which decomposes and manipulates theshort-term spectrum using a set of orthogonal polynomialfunctions with the aim to preserve important speech cues.Referring to the study of Bustamante and Braida (1987),Levitt and Neuman (1991) wrote: “Both studies showed thatcompression of the lowest order component (factor 1 in theprincipal-components method and the constant term in theorthogonal polynomial method, respectively) had by far thelargest effect, and that compression of higher order compo-nents had little effect, if any.” The common idea behind thesetwo studies was to linearly map and manipulate the spectraldimension of a suitable spectro-temporal representation withthe aim of separating important from less important speechsignal dynamic. However, both studies considered only cleanspeech signals, and hence did not consider the relevant por-tions of speech signal for their recognition in noise.That the signal dynamic can be described as the differ-ence of frequency-dependent short-term effective amplitudes,e.g., across time (temporal dynamic), across frequency (spec-tral dynamic), or both (spectro-temporal dynamic), raises thequestion which representation is most suitable to manipulateit. ASR systems are the technical solution to decode speechsignals and hence provide a model for speech recognition. Asoutlined in the Introduction, the feature extraction stages ofASR systems provide representations of the speech dynamicthat are well suited for the robust recognition of speech innoise. However, the often employed basis for the feature ex-traction stages, the log Mel-spectrogram, is not suited forlow-latency signal processing due to its long integration win-dow. The relatively long integration window of the log Mel-spectrogram serves two objectives: 1) Obtain a sufficientlyhigh frequency resolution to separate low-frequency signalcontent into approximately 1 ERB-wide bands, and 2) Ensurethat in voiced speech portions each signal frame contains atleast one pulse (that is to remove the temporal fine-structureof the speech signal). Fortunately, these two aspects (suffi- cient spectral resolution for low frequencies and limited tem-poral resolution) are compatible and can be optimized forlow-latency processing at the cost of a frequency-dependentgroup delay. In the following, the design of PLATT, a fastadaptive dynamic range manipulation scheme that takes thementioned observations into account, is proposed, where thefollowing three objectives were pursued:• Preservation and enhancement of spectral modulationswhich are assumed to be relevant for speech recognition• Low-latency and fast reaction time while minimizing au-dible artifacts• Adaptive limitation of the compression to the necessaryminimumPLATT consists of three functional parts:1) An auditory-motivated frequency decomposition and re-synthesis of the audio signal, which allows to manipulatefrequency- and time-dependent amplitudes in a percep-tually relevant domain and helps to ensure that only lim-ited audible artifacts can be introduced.2) Extraction of a spectro-temporal representation from thefrequency-decomposed signal which is similar to thoseused for robust ASR and hence suitable for the analysisof the relevant spectral modulations.3) Adaptive calculation of frequency- and time-dependentgains from the spectro-temporal representation whichuses compression only as required to provide high speechrecognition performance when the available dynamicrange on the output side is limited.Figure 5 illustrates the relations between the signal pro-cessing blocks that were used to implement this functional-ity. A detailed description is provided in the following. Theexact implementation details are provided in a reference im-plementation that is written in C (cf. Section Availability ofresources).
Frequency decomposition & re-synthesis
The fre-quency decomposition of the input signal is performed with afilter bank of fourth-order Gammatone filters. To get a set offrequencies which are relevant for (automatic) speech recog-nition, the center frequencies are chosen equidistantly on aMel-frequency scale with half the distance that is commonlyused for calculating MFCCs. Considering frequencies in therange from Hz to
Hz results in the following
78 + 4 values: (64, 93,) 123, 155, 187, 221, 256, 293, 330, 370, 410, 453, 496,542, 589, 638, 689, 742, 797, 854, 914, 975, 1039, 1105, 1174, 1245, 1319,1396, 1476, 1559, 1645, 1734, 1827, 1923, 2023, 2127, 2235, 2346, 2462,2583, 2708, 2838, 2972, 3112, 3257, 3408, 3565, 3727, 3896, 4071, 4253,4441, 4637, 4840, 5051, 5270, 5498, 5734, 5979, 6233, 6497, 6771, 7056,7352, 7658, 7977, 8307, 8650, 9006, 9376, 9760, 10158, 10572, 11001,11447, 11909, 12390, 12888, 13406, 13943, (14501, 15080) Hz . Onlythe 78 filters with center frequencies from 123 Hz to 13943 Hz,spaced approximately 0.5 equivalent rectangular bandwidths(ERB), are used. The -10 dB-bandwidth of each fourth-orderGammatone filter is chosen to be equal to the difference ofthe frequencies two positions right and left to its center fre-quency, e.g., Hz − Hz = 128 Hz for 155 Hz. The aim8 ammatone fi lterbank real 15 ms hold peak/max 1dB/ms decay downsample 20*log10(x) modulation band-pass fi lters compression/expansion... sum...targetreferenceband-pass fi ltered signal gain rate limit10^(x/20) di ff erencegainsmultiplicationsuminput output Figure 5:
Diagram illustrating the relations of the main signal processing blocks which were used to implement the signal analysis,manipulation, and re-synthesis with PLATT. C en t e r f r equen c i e s and a m p li t ude Delay / ms0 1 2 3 4 5 6 7 8 9 10 11830742532023797123sum
Figure 6:
Real part of the impulse responses of a subset ofthe normalized, phase-adjusted, fourth-order Gammatone fil-ters that were employed for the frequency decomposition, andthe (scaled) sum of the impulse responses of all employed Gam-matone filters. is to evenly cover the relevant frequency range with filtersthat have a bandwidth similar to auditory filters ( ≈ ERB)and allow a trivial re-synthesis in the time domain by sim-ple summation of all filter bank outputs. With this goal,the filter coefficients are determined as follows: The pole inthe complex z-plane that describes the frequency-dependentproperties of a first-order infinite impulse response Gamma-tone filter is calculated according to the formula p = (cid:32) − √ · bw + 0 . (cid:33) · exp (cid:18) π i f c f s (cid:19) , (1)where bw is the -10 dB-bandwidth in Hz, f c the center fre-quency of the corresponding fourth-order filter, and f s thesampling frequency. The phase of the single FIR coefficientof each filter is chosen such that the phases of each pair offourth-order filters with neighboring center frequencies wereidentical at the delay where the product of their respectivetemporal envelopes reaches its maximum. This happens to bethe case at different delays for different pairs and minimizesdestructive interference between the corresponding filter out-puts. In addition, the absolute value of the FIR coefficient ofeach filter is chosen such that the maximum gain is 2. To-gether, a) evenly covering frequencies, b) avoiding destructiveinterference, and c) normalizing the maximum gain result ina very flat frequency response for the sum of all filter bankchannels. Figure 6 shows the first 11 ms of the real part ofthe impulse responses of a subset of the Gammatone filtersand also the (scaled) sum over the real parts of all impulse re-sponses. The joint impulse response (sum of the real-valuedimpulse responses of all normalized, phase-adjusted, forth-order Gammatone filters) is a downward frequency-sweep.The frequency-dependent delay can be read from Figure 6 andis about . ms at kHz and about . ms at Hz. Figure 7 A tt enua t i on / d B Frequency / Hz 8k4k2k1k5002501250-20-40
Figure 7:
Absolute values of the transfer functions correspondingto the impulse responses shown in Figure 6 and the absolutevalue of the joint transfer function of all filters (including thosenot shown). shows the corresponding absolute values of the transfer func-tions of the same sub-set of filters and the absolute valuesof the transfer function corresponding to the joint impulseresponse of all filters. The joint transfer function of all 78employed Gammatone-filters, which characterizes the systemproperty after re-synthesis if the amplitudes are not manipu-lated, has a flat frequency response between about
Hz and kHz. That the frequency decomposition and re-synthesishas almost no spectral, and only a limited temporal effect onprocessed signals indicates that a perceptually mostly trans-parent re-synthesis can probably be achieved. The amplitudesof the filter bank output represent perceptually relevant prop-erties of the signal and can be interpreted as a proxy for thedisplacement of oscillatory systems with properties similar tothose of the respective Gammatone filters, e.g., the humanbasilar membrane. The filter bank output can be manipu-lated directly, e.g., multiplied with a time- and frequency-dependent gain function, prior to the re-synthesis. The rateof change of the gain functions is limited to 24 dB per pe-riod of the corresponding center frequency to limit channelcrosstalk. Spectro-temporal signal representation
A representa-tion of the spectral dynamic which encodes information sim-ilar to a log Mel-spectrogram is determined as follows, basedon the real-valued output of the filter bank. For each filteroutput, the values are held for 15 ms and decay subsequentlywith a rate of dBms , if the held value is above the current value.This approach approximately extracts the temporal envelopeof each channel while preserving fast increases in amplitude(on-sets). Hence, the exact timing (or temporal fine struc-ture) is removed from this representation, and only the localmaximum values remain as an estimate of the maximum am-plitude (or displacement) of an oscillatory system with prop-erties similar to those of the employed Gammatone filters.This representation can be down-sampled by any factor whichreduces the sample rate to ms ≈ Hz or higher, withoutmissing any local maximum value. The encoded informationis very similar to the information encoded in features for ASR,9here updated spectral values are determined every ms infrequency-bands which are equally spaced on a Mel-scale. Un-published pilot experiments by the author confirm that theproposed representation, down-sampled to Hz, achievesvery similar simulation results in a range of speech recogni-tion experiments with FADE. The use in a hearing device,however, requires faster updates which is why the represen-tation is down-sampled to
Hz, that is, an update of the78 spectral values is calculated every ms. In Figure 8, thespectro-temporal representation used in PLATT and the logMel-spectrogram of a clean speech sample at 65 dB SPL areshown. Compared to the log Mel-spectrogram, the proposedspectro-temporal representation has a 10-times higher tem-poral resolution (visible at the on-sets), where, however, thetemporal fine structure is effectively removed. The off-setsare not as prominent because the maximum tracking only al-lows a decrease by dBms =
100 dB100 ms . Also, the proposed spectro-temporal representation has approximately twice the numberof frequency channels, 78 compared to 36 with the the logMel-spectrogram. The aim of the spectral oversampling is toprovide headroom for spectral modulation manipulations.Because a pure tone is not changed in amplitude by a filterof the employed Gammatone filter bank when it matches thefilters center frequency, the calibration of its input and outputvalues is identical. Assuming a calibrated setup, only valuesabove the frequency-dependent normal hearing threshold, asdefined by the International Organization for Standardizationin its standard 226:2003 (ISO 226, 2003), are considered inthe representation; values below the normal hearing thresh-old are replaced by the corresponding threshold value. Thiseffectively removes spectro-temporal modulations that can-not be perceived by listeners with normal hearing from theproposed spectro-temporal signal representation. The effectcan be observed in the high frequency range ( > Hz) inFigure 8, where the speech signal has no energy.
Adaptive spectral gain
The adaptive determination oftime- and frequency-dependent gains takes into account thecurrent spectral input dynamic and the currently availableoutput dynamic. It aims to minimize the compression withthe constraint to avoid masking the signal parts which couldcarry important (speech) information. It also allows to ex-pand the spectral modulations that are assumed to be im-portant for speech recognition and trade the such increasedsignal dynamic against an increased compression of less rele-vant signal dynamic.The spectral input dynamic is analyzed with spectral mod-ulation low-pass filters. For this, each vector of the spectralvalues is convolved with Hanning windows of the followingwidths to obtain increasingly spectrally smoothed versions ofthe initial vector: , , , and , which approximatelycorrespond to a full width at half maximum (FWHM) of , , , and ERB, respectively. The left panel of Figure 9shows an example of the spectral analysis for a signal whichconsists of two pure tones, of
Hz and
Hz. The inputspectral values are depicted in black, the smoothed versionswith ascending widths of the Hanning window in increasinglylighter shades of gray. The differences between the curves ofincreasingly smoothed spectral representations are indicatedwith the digits 1 to 4. Because the difference of two low-passfilters is a band-pass filter, the differences 1, 2, 3, and 4 can be interpreted as the result of a spectral modulation band-pass filtering which roughly contains the respective spectralmodulation frequencies 1) above , 2) from to , 3) from to , and 4) from to cyclesERB , respectively. By defini-tion, the original vector of spectral values (cf. black line inleft panel of Figure 9) can be recovered by adding the dif-ferences 1 to 4 to the low-pass filtered spectral values whichcontain the spectral modulation frequencies below cyclesERB (cf. lightest line in left panel of Figure 9). In the following,the vector of low-pass filtered spectral values which containsthe lowest modulation frequencies is referred to as the baselayer . The base layer contains a very coarse description of thespectral dynamic and resolves spectral patterns larger thanapproximately ERB. The other layers, the differences 4 to1, resolve the spectral dynamic which is not described in thebase layer with decreasing pattern sizes.The base layer needs to be mapped from the input dynamicrange to the output dynamic range. For the application in ahearing device, the mapping of the base layer could be per-formed with a prescription rule. In addition, the limits ofthe input and output dynamic ranges need to be defined.Here, for the input, a dynamic range that is probably rele-vant for normal hearing listeners is assumed: A frequency-independent uncomfortable level of dB SPL as the up-per limit, and levels of dB SPL from Hz to kHz, and dB above kHz and below Hz, as the lower limit. Theassumed input dynamic range is indicated with triangles inthe left panel in Figure 9. For the output, an exemplaryreduced dynamic range is assumed, which is arbitrarily lim-ited to frequency-independent levels from to dB SPL;resembling elevated hearing thresholds due to environmentalnoise or impaired hearing and a lower acceptance of high lev-els. The targeted output dynamic range is indicated withtriangles in the center panel in Figure 9. The mapping ofthe base layer can be independent from the input and outputdynamic range definitions. In the context of a hearing de-vice, the mapping of the base layer will mainly determine theoutput levels and strongly affect loudness perception, whilethe lower limit of the output dynamic range (to be chosen re-lated to the individual hearing thresholds) will strongly affectspeech recognition performance. The defined output dynamicrange is the reservoir that can be used by PLATT to map theinput dynamic.In our example in Figure 9, let’s assume the base layer wasmapped linearly from the defined input dynamic range to thedefined output dynamic range. The base layer is depictedas the lightest gray line in the left panel, and the mappedbase layer as the lightest gray line in the center panel. Thegains which would be theoretically required to achieve sucha smooth output dynamic, given the black line in the leftpanel as the input dynamic, are depicted as the lightest grayline in the right panel. But such an extreme compression ofthe spectral dynamic would introduce audible artifacts andwould most likely have a negative effect on speech recogni-tion performance. Hence, the more of the remaining originalsignal dynamic described by the differences 1 to 4 is addedback to the mapped base layer, the less compression will beneeded, and the better spectral modulation patterns will bepreserved. However, unconditionally adding the whole dy-namic that is encoded in the differences 1 to 4 could result10 ime / ms F r equen cy / H z
250 500 750 1000 1250 1500 1750 20001245031057186530454768728410957 l e v e l / d B SP L Time / ms F r equen cy / H z
250 500 750 1000 1250 1500 1750 20001234961039182729724637705610572 l e v e l / d B SP L Figure 8:
Comparison of spectro-temporal representations: In the upper panel a log Mel-spectrogram as it is used in FADE of aclean speech sample at 65dB SPL, and in the lower panel the presented specto-temporal representation that is used in PLATT ofthe same sentence.
Frequency / Hz A m p li t ude / d B SP L G a i n / d B INPUT OUTPUT GAIN
Figure 9:
Example dynamic mapping of two pure tones. Left panel: Input dynamic analysis with spectral modulation low-pass filters.Center panel: (Conditional) reconstruction with reduced dynamic. Right Panel: Gains required to map the input dynamic to thereconstruction stages of the output dynamic (final gains are black). Numbers indicate the differences.
11n output levels below the lower limit of the output dynamicrange, which might not contribute to speech recognition any-more, or above the upper limit of the output dynamic range,which might lead to undesirably high output levels. A goodcompromise in the fundamental conflict that too much andtoo little compression can both result in sub-optimal speechrecognition performance requires a compression managementwhich depends on the current spectral input dynamic and thecurrently available output dynamic.To prefer spectral patterns which are important for robust(automatic) speech recognition, the differences correspondingto high spectral modulation frequencies are added first. Thehighest spectral modulation frequencies are encoded in dif-ference 1 and account only for a very small part of the totaldynamic, which is why it is added unconditionally . The exam-ple with two pure tones is an extreme one which assesses themaximum dynamic that can be encoded in each difference,which is about 6 dB for difference 1. The corresponding out-put dynamic, when only adding difference 1 to the base layer,can be observed in the center panel of Figure 9 as the lightgray curve which only deviates slightly from the base layer.Probably the most important spectral modulation frequen-cies describe spectral patterns between and ERB and aremainly encoded in difference 2, which is why it is also addedunconditionally. To protect this difference, which encodes amaximum dynamic of less than 9 dB, against a hearing loss ofclass D as implemented in FADE with the level uncertainty,it can be expanded by a factor greater than 1 prior to addingit to the base layer. The expansion of the difference 2 couldincrease the total output signal dynamic, which however canoften be compensated by an increased compression of the re-maining differences 3 and 4, which encode larger part of thesignal dynamic, compared to difference 2.The remaining differences 3, and subsequently 4, are added conditionally and possibly partially according to the followingconstraints: 1) For each frequency, the respective differenceis multiplied with the highest possible factor between 0 and 1for which its addition to the current output dynamic (whichalready includes the base layer and differences 1 and 2) doesnot result in output values above the upper limit or below thelower limit. For example, if the current value is dB SPL,the difference +30 dB SPL, and the limit dB SPL only athird of difference can be added and the factor is . This ruleis not sufficient because, when applied independently for eachfrequency band, it would result in a total loss of high spec-tral modulation frequencies in the frequency regions where nooutput dynamic range is left to add the difference; the alreadyadded spectral modulations needs to be protected from beingcompressed. Hence, 2) the minimum factor (highest com-pression) is propagated along the frequency axis by addingvalues of normalized Hanning windows of FWHM of 6 and 12samples for difference 3 and 4, respectively. For example, ifthe factor according to 1) turns out to be zero at the centerfrequency of Hz when adding difference 3, the factor at
Hz and
Hz can be at most . and below Hzand above
Hz again 1. The propagation ensures thatneighboring channels are compressed similarly and protectsany higher spectral modulation frequencies. The final de-sired spectral output levels are described by the sum of themapped base layer, the unconditionally added differences 1and 2, and the conditionally compressed differences 3 and 4. The frequency-dependent gain needed to achieve the desiredspectral output levels is the difference of the spectral outputlevels (black curve in the center panel in Figure 9) and spec-tral input levels (black curve in the left panel in Figure 9).The final frequency-dependent gain is plotted in black in theright panel of Figure 9 along with the partial gains that wouldtheoretically be needed after the cumulative addition of onlydifferences 1 to 3 to the base layer in increasingly darker grayshades. The final frequency-dependent gain is then applied tothe output of the Gammatone filter bank with the limitationof the rate of change to 24 dB per period of the correspondingcenter-frequencies.Admittedly, in our example with the two pure tones, thereis only energy at 500 Hz and 2000 Hz, and hence only the levelof the two tones will be changed, while the gains at otherfrequencies will have no effect. However, this signal creates apattern of extreme spectral modulation in the proposed signalanalysis, and hence is well suited to illustrate how PLATTworks. For a signal with only low spectral modulations, noconditional compression would be required.
Summary of PLATT dynamic range manipulation
With PLATT, compression is only applied if the availableoutput dynamic range is less than required to represent theinput signal dynamic. The main effect can be observed in Fig-ure 10, where the calculated gains for high (solid curves) and reduced (dotted curves) spectral input dynamic are shown foran exemplarily reduced output dynamic range, as indicatedby the triangles. While the signals with high spectral dy-namic are compressed (observed here as different gains fordifferent frequencies) to make relevant signal portions audi-ble and avoid excessively high output levels, the signal withlow spectral dynamic is not (observed here as similar gainsfor different frequencies), because it fits in the available out-put dynamic range without compression. Only low spectralmodulation frequencies, which are less important for speechrecognition, are conditionally compressed, while the impor-tant higher spectral modulation frequencies can be even ex-panded. An efficient implementation of spectral modulationfiltering in the time-domain is possible with reasonably lowlatency, fast reaction times, and probably little audible arti-facts due to the auditory-motivated signal decomposition andre-synthesis.
Compensation of level uncertainty with PLATT
Thepossibility to selectively expand the spectral modulation pat-terns between 2 and 4 ERB with PLATT was used to protectthese patterns against the effect of the level uncertainty asimplemented in FADE. With the aim to best decouple theevaluation of the expansion from the conditional compres-sion feature of PLATT, the available output dynamic rangewas set to match the input dynamic range, and the base layermapping was set to identity. This minimizes compression andeffectively disables the fine-grained compression managementfor all but very low and very high signal levels. Nonetheless,in view of expanding a part of the signal dynamic, such amanagement will probably be needed in an application witha limited output dynamic range to counter the expansion witha possible compression of other parts of the signal dynamic.Expansion factors of 2, 4, 6, and 8 were considered for the12 requency / Hz A m p li t ude / d B SP L G a i n / d B INPUT OUTPUT GAIN
Figure 10:
Examples of calculated spectral gain for an example with reduced available output dynamic range as indicated by thetriangles. Left panel: Spectral input dynamic with two pure tones at different presentation levels (gray shaded solid curves), andwith white noise added (dotted curve). Center panel: Corresponding spectral output levels. Right panel: Corresponding gains. evaluation, and the corresponding compensation conditionsare referred to as PLATT-2 to PLATT-8 in the remainder ofthe manuscript. The effect of processing noisy speech signalsat 0 dB SNR with PLATT-6 is illustrated in Figure 11. In thetop row, log Mel-spectrograms of a speech signal in the sta-tionary and fluctuating noise at 0 dB SNR are depicted in theleft and right panel, respectively. In the center row, the samelog Mel-spectrograms are shown, but with a level uncertaintyof 7 dB. In the bottom row, log Mel-spectrograms of the samesignals, but processed with PLATT-6 are shown with a leveluncertainty of 7 dB. Especially for the stationary noise, someof the speech patterns are better distinguishable in the logMel-spectrogram with the added level uncertainty after theprocessing. This justifies the expectation that the expansionmight protect important speech patterns against level uncer-tainty. With the fluctuating masker, the speech patterns arenot as easily distinguishable from the background noise. Butcomparing the representation of the unprocessed (center) andprocessed (bottom) signals, more of the patterns that are hid-den by the level uncertainty in the unprocessed signal are dis-cernible in the processed variant. This illustrates that the ex-pansion does not selectively process the speech portions of thesignal but generally protects the spectral modulation patternsbetween 2 and 4 ERB against the level uncertainty, indepen-dently of what caused these patterns. Also observable, theprocessing with PLATT-6 results in a possible increase in theoutput level. This affirms the need to evaluate the methodin a context where simple time-invariant linear amplificationwill not change the speech recognition performance and thusdecouple the effect of the expansion from the effect of anylinear amplification.The just described effects of an expansion of the spectralmodulation patterns between 2 and 4 ERB on noisy speechsignals were quantified in simulation experiments with FADE.For this, SRTs with the German matrix sentence test weresimulated for all combinations of considered noise maskers(OLNOISE, ICRA5-250), noise presentation levels (0, 10,20, ..., 100 dB SPL), listener profiles (P-8000-1 through P-1000-21), and compensations (none, and PLATT-2 throughPLATT-8); summing a total of 1760 outcome predictions withFADE.
Evaluation of simulation results
Key simulation outcomes are presented as “Plomp curves”,that is, SRT-50s in dB SPL as a function of the noise pre-sentation level, analog to Figure 2. On the one hand, thisdepiction allows to assess the effect of the (noise) presenta-tion level on the predicted SRTs. On the other hand, it alsoallows to quantify improvements in SRT in aided conditionsover a given reference condition, e.g., normal hearing or anunaided condition. Because not all 1760 data points can pre-sented in graphical form in this contribution, a summary ofthe achieved improvements in SRT at high presentation levels,at which we can confidently assume that linear time-invariantamplification cannot improve the SRT, are presented in theform of a table for all listener profiles and compensations. Forkey listener profiles, psychometric functions were obtained bysimulating SRT-20 to SRT-90 to assess the SNR-dependencyof PLATT. The main interest here was if the effect of thePLATT compensation is different for higher SRTs, which aremore realistic for conversations.
Availability of resources
To facilitate the reproduction of the presented experiments,and to encourage, foster, and accelerate the verification andadoption of the presented methods, the employed resourcesare provided, as far as licensing allows it. FADE version2.4.0, which is open source sofware, was used for the FADEsimulations. The code and scripts for the setting up the sim-ulations were based on, and are now integrated into the mea-surement and prediction framework . This includes:• The modified feature extraction.• The reference implementation of PLATT and the usedconfiguration files• The scripts which prepare and run the FADE simulationsusing the modified feature extraction and the referenceimplementation of PLATT.• The scripts which evaluate raw the experimental resultsand plot the results figures. https://doi.org/10.5281/zenodo.4003779 https://doi.org/10.5281/zenodo.4500810 ime / ms F r equen cy / H z
250 500 750 10001245031057186530454768728410957 l e v e l / d B SP L Time / ms F r equen cy / H z
250 500 750 10001245031057186530454768728410957 l e v e l / d B SP L Time / ms F r equen cy / H z
250 500 750 10001245031057186530454768728410957 l e v e l / d B SP L Time / ms F r equen cy / H z
250 500 750 10001245031057186530454768728410957 l e v e l / d B SP L Time / ms F r equen cy / H z
250 500 750 10001245031057186530454768728410957 l e v e l / d B SP L Time / ms F r equen cy / H z
250 500 750 10001245031057186530454768728410957 l e v e l / d B SP L Figure 11:
Illustration of the effect of the dynamic range manipulation with PLATT-6: Log Mel-spectrograms of noisy speech instationary (left column) and fluctuating noise (right column) at 0 dB SNR (top row), the same log Mel-spectrograms with leveluncertainty of 7 dB (center row), and log Mel-spectrograms with level uncertainty of 7 dB of the same signals processed withPLATT-6, that is with an expansion factor of 6.
This does not include:• The Hidden Markov Toolkit (used by FADE), whichcannot be distributed because of the license.• The speech material of the German matrix sentence test and the corresponding test specific noise signal (OL-NOISE), which cannot be distributed because of the li-cense.• The ICRA5-250 noise signal, due to the unknown licenseconditions. Results
Effect of limit frequency and level uncertainty
Two modifications were used to implement a hearing loss ofclass D, the limitation of the frequency range up to a limitfrequency, and the increase of the level uncertainty. Theirseparate effect on simulated SRTs is shown in Figure 12 andFigure 13, respectively.The limitation of the available frequency range affectedthe simulated SRTs in the fluctuating noise condition muchmore than in the stationary noise condition when assuminga level uncertainty of 1 dB (cf. Figure 12). In the station-ary noise condition, the limitation of the frequency range to4000 Hz did not have an effect on the simulated outcome ofthe German matrix sentence test. A reduction to 2000 Hz re-sulted in a small, mostly level-independent increase in SRT of http://htk.eng.cam.ac.uk/ http://medi.uni-oldenburg.de/download/ICRA/index.html about 1 dB, and a further reduction of the frequency range to1000 Hz, resulted in a further increase of about 1 dB. Hence,limiting the frequency range while assuming a level uncer-tainty of 1 dB had relatively little effect on the simulatedSRTs. A different picture can be observed in the fluctuat-ing noise condition. There, the reduction of the frequencyrange had a large, and level-dependent effect on the simu-lated SRTs. The effect of the reduction increases with thehigher presentation level and stabilizes at levels above 60 dBSPL. At these levels, the increase in SRT was about 3, 6, and15 dB for a limitation to 4000, 2000, 1000 Hz, respectively.The increase of the level uncertainty also affected the sim-ulated SRTs in the fluctuating noise conditions more than inthe stationary noise condition (cf. Figure 13). The effectwas not as level-independent at low presentation levels as onemight have expect. This indicates an interaction betweenthe implementation of the absolute hearing threshold and thelevel uncertainty of about 5 dB. The behavior could alreadybe observed in the data from Kollmeier et al. (2016) (cf. rightpanel in Figure 2). However, in this contribution, the focusdoes not lie on the interactions of the level uncertainty withthe absolute hearing threshold but effective elimination of theabsolute hearing threshold as a factor from the evaluation. Inthe fluctuating noise condition, this is the case for presenta-tion levels of 70 dB SPL and above, where the differences (cf.panel four in Figure 13) stabilize. At these levels, the increasein SRT was about 4, 9, and 12 dB for level uncertainties of 7,14, and 21 dB, respectively. In the stationary noise condition,the corresponding increases were lower, with about 1, 4, and6 dB for level uncertainties of 7, 14, and 21 dB, respectively.Neither implementation alone achieves to increase the SRTto above 0 dB with the considered parameter values. In lis-14
15 25 35 45 55 65 75 85 955152535455565758595
Noise level / dB SPL S pee c h l e v e l / d B SP L OLNOISE SRT
OLNOISE differences in SRT S pee c h l e v e l d i ff e r en c e / d B Noise level / dB SPL
P-8000-1 noneP-4000-1 noneP-2000-1 noneP-1000-1 none
Noise level / dB SPL S pee c h l e v e l / d B SP L ICRA5-250 SRT
ICRA5-250 differences in SRT S pee c h l e v e l d i ff e r en c e / d B Noise level / dB SPL
Figure 12:
Simulated “Plomp curves” in stationary noise (first panel) and fluctuating noise (third panel) when limiting the availablefrequency range (to 8000, 4000, 2000, and 1000 Hz) in the feature extraction stage. The dotted lines indicate an SNR of 0 dB.The corresponding level-dependent differences in SRT compared the profile plotted in black (here profile P-8000-1) are depicted inpanels two and four.
Noise level / dB SPL S pee c h l e v e l / d B SP L OLNOISE SRT
OLNOISE differences in SRT S pee c h l e v e l d i ff e r en c e / d B Noise level / dB SPL
P-8000-1 noneP-8000-7 noneP-8000-14 noneP-8000-21 none
Noise level / dB SPL S pee c h l e v e l / d B SP L ICRA5-250 SRT
ICRA5-250 differences in SRT S pee c h l e v e l d i ff e r en c e / d B Noise level / dB SPL
Figure 13:
Effect of increasing the level uncertainty (to 7, 14, and 21 dB). Analog to Figure 12.
Noise level / dB SPL S pee c h l e v e l / d B SP L OLNOISE SRT
OLNOISE differences in SRT S pee c h l e v e l d i ff e r en c e / d B Noise level / dB SPL
P-1000-1 noneP-1000-7 noneP-1000-14 noneP-1000-21 none
Noise level / dB SPL S pee c h l e v e l / d B SP L ICRA5-250 SRT
ICRA5-250 differences in SRT S pee c h l e v e l d i ff e r en c e / d B Noise level / dB SPL
Figure 14:
Effect of increasing the level uncertainty when the frequency range is limited to 1000 Hz. Analog to Figure 12.
Noise level / dB SPL S pee c h l e v e l / d B SP L OLNOISE SRT
OLNOISE differences in SRT S pee c h l e v e l d i ff e r en c e / d B Noise level / dB SPL
P-8000-1 noneP-4000-7 noneP-2000-14 noneP-1000-21 none
Noise level / dB SPL S pee c h l e v e l / d B SP L ICRA5-250 SRT
ICRA5-250 differences in SRT S pee c h l e v e l d i ff e r en c e / d B Noise level / dB SPL
Figure 15:
Effect of mixed profiles with increasing level uncertainty and frequency range limitation. Analog to Figure 12.
Effect of PLATT expansion
The effect of the expansion with PLATT on the simulation re-sults for listener profiles P-4000-7, P-2000-14, and P-1000-21is presented in Figures 16, 17, and 18, respectively. The blue,red, and yellow lines indicate the simulated speech recogni-tion performance with compensations PLATT-2, PLATT-4,and PLATT-6, respectively. The black lines indicated thecorresponding unaided reference conditions, while the purplelines indicate the simulated performance of profile P-8000-1,that is, the performance of a listener with normal hearing.The expansion with PLATT improved the SRTs in almostall simulated conditions. The benefit in SRT due to PLATT in was generally higher in the fluctuating noise condition thanin the stationary noise condition. Higher expansion factorsstrongly tended to result in higher benefits, however, not inall conditions. As expected, the observed compensations werepartial, that is, the simulated performance of listeners withnormal hearing were not restored. For a quantitative analy-sis, the results are interpreted only at the levels at which animprovement in SRT due to simple linear amplification canbe ruled out.The curves stabilize at high SRTs, that is, a further in-crease in levels does not improve the SNR anymore. In thestationary noise condition, this is the case for presentationlevels of 40 dB SPL and above. But even above this level,the curves show some variability, which is due to the stochas-tic nature of the simulation process. The plotted data pointswere derived as the difference of two independent simulations.Assuming a random (normally distributed) prediction errorof 0.5 dB, this would lead to a random error of the plotteddata points of approximately ± . dB, that is for 5 of 100simulation results, on average, the error will be larger than ± . · . ≈ ± . dB. Hence, “stabilized” refers to reach-ing a state which may allow to assume that the remainingvariability is due to the random error. In the fluctuatingnoise condition, the random error is increased, because of thegenerally shallower slope of the psychometric function in thiscondition. There, the curve of the normal-hearing profile sta-bilizes at levels of about 70 dB. The curves of the unaidedand compensated simulations already stabilize at lower lev-els. Hence, at noise presentation levels of 70, 80, and 90 dBSPL an improvement in SRT due to simple linear amplifica-tion can be ruled out.For a break-down of the simulation results, the average im-provements at noise presentation levels of 70, 80, and 90 dBSPL were calculated and are reported in Table 1. The tablespresent the simulated benefits in SRT using PLATT with ex-pansion factors 2, 4, 6, and 8, for all 16 considered listenerprofiles, sorted by the limit frequency. In the last column, thebenefit in SRT that corresponds to the normal-hearing (NH)profile is presented as an orientation to assess the proportionof the total class D loss that be compensated. Positive valuesindicate an improvement, that is, a lower SRT. An orientativevalue for the random error of the presented simulation datain Table 1 is 1 dB.For listener profiles with a level uncertainty of 1 dB (P-*-1), only small benefits were observed. In the stationarynoise condition, processing the signals with PLATT had noeffect on the SRTs ( ≤ . dB). In the fluctuating noise con-dition, small positive and negative benefits (min. -2 dB andmax. 2 dB) were observed; the SRTs for the listener profileP-1000-1 improved (lower SRTs) and the SRTs for the listenerprofiles P-8000-1 and P-4000-1 increased (negative improve-ment). This effect was most pronounced with PLATT-8 andPLATT-6, less pronounced with PLATT-4, and could notbe observed with PLATT-2, that is, it depended on the ex-pansion factor. The processing with PLATT, by increasingthe amplitude of certain spectral modulations, decreases theprominence of (or “masks”) other modulation patterns, and itintroduces a possible leak of information across frequencies,e.g., from > Hz to < Hz . While the masking ofother modulation patterns could explain the detrimental ef-fect of PLATT for the profiles P-8000-1 and P-4000-1, the16
15 25 35 45 55 65 75 85 955152535455565758595
Noise level / dB SPL S pee c h l e v e l / d B SP L OLNOISE SRT
OLNOISE differences in SRT S pee c h l e v e l d i ff e r en c e / d B Noise level / dB SPL
P-4000-7 noneP-4000-7 platt2P-4000-7 platt4P-4000-7 platt6P-8000-1 none
Noise level / dB SPL S pee c h l e v e l / d B SP L ICRA5-250 SRT
ICRA5-250 differences in SRT S pee c h l e v e l d i ff e r en c e / d B Noise level / dB SPL
Figure 16:
Simulated “Plomp curves” for listener profile P-4000-7 without and with PLATT expansion factors 2, 4 and 6 in stationarynoise (first panel) and fluctuating noise (third panel). The dotted lines indicate an SNR of 0 dB. The corresponding level-dependentdifferences in SRT compared the profile plotted in black (here the unaided profile P-4000-7) are depicted in panels two and four.The data with profile P-8000-1 (normal hearing) was added as an orientation for normal-hearing performance.
Noise level / dB SPL S pee c h l e v e l / d B SP L OLNOISE SRT
OLNOISE differences in SRT S pee c h l e v e l d i ff e r en c e / d B Noise level / dB SPL
P-2000-14 noneP-2000-14 platt2P-2000-14 platt4P-2000-14 platt6P-8000-1 none
Noise level / dB SPL S pee c h l e v e l / d B SP L ICRA5-250 SRT
ICRA5-250 differences in SRT S pee c h l e v e l d i ff e r en c e / d B Noise level / dB SPL
Figure 17:
Simulated “Plomp curves” for listener profile P-2000-14 without and with PLATT. Analog to Figure 16.
Noise level / dB SPL S pee c h l e v e l / d B SP L OLNOISE SRT
OLNOISE differences in SRT S pee c h l e v e l d i ff e r en c e / d B Noise level / dB SPL
P-1000-21 noneP-1000-21 platt2P-1000-21 platt4P-1000-21 platt6P-8000-1 none
Noise level / dB SPL S pee c h l e v e l / d B SP L ICRA5-250 SRT
ICRA5-250 differences in SRT S pee c h l e v e l d i ff e r en c e / d B Noise level / dB SPL
Figure 18:
Simulated “Plomp curves” for listener profile P-1000-21 without and with PLATT. Analog to Figure 16. leakage of information from higher frequencies could explainthe improvement for profile P-1000-1. Despite these smallvariations of the SRTs due to the processing with PLATT inthe fluctuating noise condition, one might feel comfortable tostate: As expected, the effect of a limited frequency range (anincrease in SRT that cannot be compensated by simple am-plification) cannot be compensated with the expansion per-formed by PLATT.The data in the column NH, that is, the difference in SRTbetween the simulation with a specific listener profile and thenormal-hearing profile, shows again how different the simula-tion results are in the stationary and fluctuating noise con-dition. While limiting the frequency range to 1000 Hz (P-1000-1) only resulted in a surprisingly small increase in SRT, namely by 2.3 dB, the same modification increased the SRTin the fluctuating noise condition by 15.0 dB. This reaffirmsthe use of different noise maskers to assess speech in noiserecognition performance.Interpreting the addition of maskers to a speech signal as afrequency-dependent removal of information, the results indi-cate that the information in the mid-frequency range (1000-4000 Hz), was much more relevant in the considered fluctuat-ing noise condition (ICRA5-250) than in the stationary noisecondition (OLNOISE). In other words, in the stationary noisecondition, this mid-frequency portion was mostly redundantat the SRT, and the ASR system could discriminate the 50words of the matrix sentence test almost equally well onlyusing the low-frequency information. This was not the case17 able 1:
Simulated benefits in SRT compared to the respec-tive unaided conditions averaged over high presentation levels(70, 80, and 90 dB SPL), where simple linear amplification can-not improve the SRT. NH indicates the normal-hearing listenerprofile.
Stationary noise (OLNOISE)Profile
PLATT-2 PLATT-4 PLATT-6 PLATT-8
NHP-8000-1 0.0 0.1 0.2 -0.1 0.0P-8000-7 0.5 0.7 0.8 0.7 1.2P-8000-14 1.0 2.0 2.5 2.7 4.2P-8000-21 0.8 2.2 3.2 3.9 6.7P-4000-1 0.1 0.4 0.4 0.2 -0.1P-4000-7 0.5 1.0 1.3 1.2 1.5P-4000-14 0.9 2.2 3.0 3.4 4.9P-4000-21 1.3 3.0 4.0 4.8 7.9P-2000-1 0.1 0.2 0.2 -0.1 1.2P-2000-7 0.5 1.1 1.4 1.2 3.1P-2000-14 0.8 2.0 3.0 3.2 6.5P-2000-21 1.0 2.5 4.0 4.6 9.7P-1000-1 -0.1 0.0 -0.1 -0.2 2.3P-1000-7 0.6 1.4 1.6 1.5 5.0P-1000-14 0.9 2.4 3.1 3.5 9.0P-1000-21 1.3 3.4 4.6 5.3 13.1Fluctuating noise (ICRA5-250)Profile
PLATT-2 PLATT-4 PLATT-6 PLATT-8
NHP-8000-1 -0.0 -0.5 -1.8 -2.0 0.0P-8000-7 1.2 1.9 1.9 1.4 4.6P-8000-14 1.2 2.9 4.2 4.3 8.7P-8000-21 1.8 4.7 5.7 5.9 12.5P-4000-1 0.4 -0.7 -0.8 -1.0 3.2P-4000-7 1.3 2.7 2.0 2.4 7.2P-4000-14 2.1 3.7 4.6 4.8 11.4P-4000-21 2.0 4.7 6.1 7.0 14.7P-2000-1 0.6 0.4 0.0 0.2 6.5P-2000-7 1.3 2.5 2.2 2.9 9.7P-2000-14 1.8 4.2 5.3 5.7 14.3P-2000-21 1.9 4.6 6.6 7.4 17.9P-1000-1 0.3 1.0 2.1 1.8 15.0P-1000-7 1.9 3.3 4.1 5.1 18.8P-1000-14 1.3 4.3 5.7 6.6 22.4P-1000-21 2.1 4.9 6.2 7.8 26.3 for the fluctuating noise condition, where the mid-frequencyportions contributed a substantial part of the information toachieve low SRTs (say, less than -15 dB). For the followingpresentation of the benefits with PLATT, it is important tokeep in mind that this missing information, by design, cannotbe compensated with PLATT. That means, when consideringthe maximum achievable improvement for a specific profile,the increase in SRT due to limiting the frequency range wasdisregarded. For example, with profile P-2000-14 in the fluc-tuating noise condition, the difference to the normal-hearingSRT is 14.3 dB; improving the SRT by 14.3 dB the perfor-mance of the normal-hearing profile would be achieved. Whenonly considering the increase in SRT due to limiting the fre-quency range, that is profile P-2000-1, the difference to thenormal-hearing SRT is 6.5 dB. Hence, the maximum achiev-able improvement for profile P-2000-14 is then estimated bythe difference . − . . dB.For profiles with increased level uncertainty (P-*-7, P-*-14,and P-*-21), the average benefits in SRT due to using PLATT were positive and ranged from 0.5 to 7.8 dB. For these profiles,the lowest improvements were found with PLATT-2, and thehighest improvements often, but not always, with PLATT-8.Overall, in both noise conditions, a very strong relation be-tween the increase in level uncertainty and the improvementin SRT due to using PLATT could be observed. Combined,there was strong general tendency that with higher level un-certainty and with higher expansion factors larger improve-ments were observed. Exceptions were observed only for theprofiles with a level uncertainty of 7 dB (P-*-7), where higherexpansion factors did not further increase the improvement.This data supports the sensible expectation that lower ex-pansion factors are sufficient (or even optimal) to compensatelower values of level uncertainty. For profiles with a level un-certainty of 14 or 21 dB, the improvements always increasedtogether with the expansion factor. With high values of leveluncertainty, higher expansion factors might further improvethe SRT; however, increasing the dynamic of a signal portioneight fold might have undesirable collateral effects which arediscussed later.In absolute terms, the improvements were generally largerin the fluctuating noise condition than in the stationary noisecondition. For example with the extreme profile P-1000-21and PLATT-8 compensation, the improvement was 5.3 dB inthe stationary noise condition, and 7.8 dB in the fluctuat-ing one. However, relating the improvement to the perfor-mance with the normal-hearing profile (rightmost column inTable 1), 5.3 dB of a total class D loss of 13.1 dB were com-pensated in the stationary noise condition and “only” 7.8 dBof a total class D loss of 26.3 dB. While this interpretationcorrectly reflects the proportion of the total class D loss thatwas compensated, it does not reflect that a part of the classD loss cannot be compensated by expansion approach withPLATT by design. As explained earlier, the portion of theclass D loss due to limiting the frequency range to 1000 Hzis very different in both maskers. To evaluate the achievedimprovements with respect to the maximum achievable im-provement, the class D loss due to limiting the frequencyrange needs to be disregarded. In the context of the maxi-mum achievable improvement, PLATT-8 compensated 5.3 dBof (13 . − . . dB and 7.8 dB of (26 . − . . dB ofthe class D loss due to a level uncertainty of 21 dB in the sta-tionary and fluctuating noise condition, respectively. For theintermediate mixed profile P-2000-14, PLATT-6 compensated3.0 dB of (6 . − . . dB and 5.3 dB of (14 . − . . dBin the stationary and fluctuating noise condition, respectively.And for the least extreme mixed profile P-4000-7, PLATT-4compensated 1.0 dB of (1 . − − . . dB, and 2.7 dB of (7 . − . . dB. Hence, in relative terms, over a broadrange of assumed parameters, PLATT expansion compen-sated at least about half of the class D loss caused by anincrease of a level uncertainty.The absolute improvements due to PLATT tended to in-crease with an increased limitation of the frequency range.The effect of the level uncertainty increased with an increas-ing limitation of the frequency range. Based on the presenteddata, however, it is difficult to make statements about thefrequency-dependency of an optimal expansion factor becauseof the diverse non-linear interactions between the consideredparameters.18
20 -15 -10 -5 0 5 10 15 200102030405060708090100
P-4000-7 OLNOISE W o r d r e c ogn i t i on r a t e / % c o rr e c t SNR / dB noneplatt2platt4platt6platt8 -20 -15 -10 -5 0 5 10 15 200102030405060708090100
P-4000-7 ICRA5-250 W o r d r e c ogn i t i on r a t e / % c o rr e c t SNR / dB
Figure 19:
Segments of psychometric functions for simulationswith listener profile P-4000-7 without and with PLATT ex-pansion factors 2, 4 and 6 in stationary noise (left panel) andfluctuating noise (right panel). The dotted and dashed linesindicate a word recognition rate of 50% and 80% correct, re-spectively. -20 -15 -10 -5 0 5 10 15 200102030405060708090100
P-2000-14 OLNOISE W o r d r e c ogn i t i on r a t e / % c o rr e c t SNR / dB noneplatt2platt4platt6platt8 -20 -15 -10 -5 0 5 10 15 200102030405060708090100
P-2000-14 ICRA5-250 W o r d r e c ogn i t i on r a t e / % c o rr e c t SNR / dB
Figure 20:
Segments of psychometric functions for simulationswith listener profile P-2000-14 without and with PLATT. Ana-log to Figure 19.
SNR-dependency of benefits in SRT
We (humans) generally prefer to have conversations at higherSNRs than the SRT-50, that is, at SNRs at which more than50% of the words can be correctly recognized; which mightcorrespond to something like the SRT-80. The SRT-80 can besimulated, but it cannot be as efficiently and accurately mea-sured in listening experiments as the SRT-50, because of theshallower slope of the psychometric function at the SRT-80.The main reason for simulating the SRT-50 was that it can beaccurately measured in later listening experiments with hu-man listeners, and accurate measurements are a requirementto show effects as small as 1 dB. To assess if the improve-ments would (at least according to the model) translate toimprovements at SRTs preferred in real conversations, thepsychometric functions of aided and unaided conditions werecompared. Segments of psychometric functions were obtainedby evaluating simulations at SRT-20, SRT-25, ..., SRT-90.Figure 19, 20, and 21 present the unaided and aided psycho-metric functions for the mixed profiles P-4000-7, P-2000-14,and P-1000-21, respectively. The simulation results in theunaided conditions are plotted in black, the correspondingsimulation results with PLATT2, PLATT4, PLATT6, andPLATT8 expansion in blue, red, yellow, and purple color, re-spectively. As expected, the slopes in the fluctuating noiseconditions (right panels) were shallower than the slopes in -20 -15 -10 -5 0 5 10 15 200102030405060708090100
P-1000-21 OLNOISE W o r d r e c ogn i t i on r a t e / % c o rr e c t SNR / dB noneplatt2platt4platt6platt8 -20 -15 -10 -5 0 5 10 15 200102030405060708090100
P-1000-21 ICRA5-250 W o r d r e c ogn i t i on r a t e / % c o rr e c t SNR / dB
Figure 21:
Segments of psychometric functions for simulationswith listener profile P-1000-21 without and with PLATT. Ana-log to Figure 19. the stationary noise condition (left panel). Also, as expected,the data points in the fluctuating noise conditions were morenoisy, due to the greater variability in the spectro-temporaldistribution of the masker energy. This variability could bedecreased by increasing the amount of training data and test-ing data. Within the uncertainty due to this variability, noreduction in improvement between SRT-50 and SRT-80 canbe observed for listener profile P-4000-7. For profile P-2000-14 (in Figure 20), where the improvements are larger, therewas a trend towards a slight increase in slope with higher ex-pansion factors. This indicates, that the improvements due tothe expansion with PLATT of the SRT-80 might be slightlylarger than the corresponding improvement of the SRT-50.This trend was confirmed by the data with the profile P-1000-21 in Figure 21. According to the these simulations, theexpansion with PLATT was found to improve the simulatedSRT-80 to the same extent or even more than the SRT-50.
Discussion
The presented experimental results were derived using amodel of auditory perception, more specifically, a model of im-paired human speech recognition based on automatic speechrecognition. The modeling approach with FADE brings as-sumptions about the impaired human speech recognition pro-cess into a form in which they can be tested by comparing pre-dictions with empirical data. The employed model, the frame-work for auditory discrimination experiments (FADE) suchas it was used by Schädler et al. (2020a), was already evalu-ated with respect to predictions of the individual aided speechrecognition performance of listeners with impaired hearing.There, as elaborated in the Introduction, an important as-sumption was that the part of the hearing loss which cannotbe explained by the absolute hearing threshold, that is, themissing piece to describe the effect of hearing loss on speechrecognition in noise, can be explained by the level uncertainty.Another assumption was, that this model parameter also af-fects tone in noise perception and hence its value could beinferred from tone in noise detection tests. While the resultssupported this hypothesis, evidence was not sufficient to ruleout other mechanisms that would also increase the SRTs innoise. This is a fundamental problem in modeling the indi-vidual speech recognition performance. While the quantitythat is predicted by the model, the SRT, can be measured in19xperiments with human listeners, the outcome of such mea-surements still depends on many correlated and non-linearlyinteracting parameters; some of which cannot be controlledvery well, such as, e.g., attention. And even if the experimen-tal results were measured with the most accurate methods,the measurement errors include this uncontrollable (human)variability which can increase the required amount of data tofalsify hypotheses to infeasible regions. This is especially truefor hypotheses which predict relatively small effects. Hence,there is reasonable doubt about whether the removal of in-formation in listeners with impaired hearing is really well de-scribed by the level uncertainty, or if it just coincidentallyincreased the SRT in the correct conditions.To more specifically test if the level uncertainty is suit-able to describe the effect of hearing loss on speech recog-nition in noise, a promising approach is to interact with it.The expansion of PLATT was specifically designed to inter-act with—namely compensate—the effect of the level uncer-tainty. The modeled data clearly showed this interaction. Ifthis specific interaction was found in empirical data, it wouldbe strongly supportive for the hypothesis of the existence of amechanism in the human auditory system similar to the leveluncertainty. Beyond the academical interest in the suitabilityof the assumptions to describe impaired human speech recog-nition performance, a positive result would have immediatepractical implications for the design of hearing loss compen-sation strategies.Let’s remember that the goal was not to test if the expan-sion with PLATT improves speech recognition performanceof listeners with impaired hearing. For that, measurementswith listeners with impaired hearing will be necessary. Theaim of this contribution was to:A) Present an approach which is able to partially compen-sate a class D loss as implemented with the level uncer-tainty in FADE, andB) objectively evaluate this approach and come up withtestable quantitative hypothesis on the benefit in noisylistening conditions.The latter aim, in other words, was to guide the planningof the measurements with listeners with impaired hearing to-wards optimal evidence. Hence, the following discussion isoriented towards the planning of a suitable experiment.
What is a realistic class D loss?
Plomp (1978) did not distinguish mechanisms causing thepostulated class D hearing loss which he used to describehearing loss in noise. His D component described the totalloss in SRT in noise due to impaired hearing, also referred toas speech hearing loss D (SHL D ). He reported average em-pirical values of SHL D from five investigations which rangefrom as low as 1 dB to above 10 dB (cf. TABLE III in Plomp(1978)). In stationary noises (average speech spectrum noise,white noise, and airplane cockpit noise) the maximum averagevalues for SHL D were 6 to 8 dB. In fluctuating noises (inter-fering talker and voice babble) the maximum average valueswere larger, up to 14 dB. The increase in SRT in noise wasrelated to the increase in SRT in quiet SHL A+D in FIG. 8in Plomp (1978), which showed that every 3-dB increase of SHL
A+D comes with a 1-dB increase of SHL D . This dataindicated that on average, about a third of the hearing lossin quiet (SHL A+D ) was incompensable by simple amplifica-tion; that would be a lot . Listeners with speech hearing lossin quiet of 39 dB would have on average 13 dB of tradition-ally incompensible loss. Also, Plomp (1978) observed thatindividual data differed considerably from the mean. This in-dividual variability could be due to individually different de-grees of and causes for the measured loss, or due to insufficientmeasurement accuracy. An important observation he madewas that data points from listeners with age-related hear-ing loss agreed well with data points based from studies onother sensorineural hearing impairments. As a consequencehe considered age-related hearing loss to be primarily due todeterioration in the auditory pathway rather than to mentalimpairment. This last point is fundamental considering thatmental impairment can most likely not be compensated bysignal processing.However, these findings have to be taken with care. Thedata used by Plomp (1978) were measured at different siteswith different speech tests using different setups. The result-ing list of possible systematic and random errors in the under-lying data is long. The main contribution to the systematicerror (expectable differences across studies) apart from thecalibration error and the different listener panels was proba-bly the use of different speech tests (including measurementparadigm, speech material, masker, presentation, ...). It isknown, that the type of speech material (e.g., logatomes,numbers, isolated words, or sentences) makes a differencein the outcome of a speech in noise recognition experiment.Hence, tests with different speech material might also be dif-ferently susceptible to hearing loss and result in different val-ues of SHL D . There is no reason to assume that SHL D isindependent from the speech test. The main contribution tothe random error (unpredictable differences across measure-ments) was probably different for the studies and due to thestochastic nature of the measurement procedures. Both, therandom and the systematic errors were not specifically con-sidered in the analysis of Plomp (1978). The variety of un-knowns makes it difficult to translate his finding to expectablevalues and individual variability of SHL D with the employedmatrix sentence test. Matrix sentence tests were designed tominimize the random error in the measurement of an SRT,where Kollmeier et al. (2015) reported a test-retest reliabil-ity of 0.5 dB for listeners with normal hearing and 0.7 dB forlisteners with impaired hearing.Fortunately, a suitable data set was measured with a ma-trix sentence test. Wardenga et al. (2015) found a similarrelation between the degree of hearing loss and SHL D for theGerman matrix sentence test in the (stationary) test-specificnoise condition, where the SRT in noise increased by about1 dB every 10-dB increase in PTA for PTAs below 47 dB HL(cf. Figure 5 there). This is much less than the average (1 dBevery 3-dB) increase found by Plomp (1978). One reasonfor the lower increase could be that the speech hearing lossin quiet (SHL A+D ) includes the speech hearing loss in noise(SHL D ), while the PTA might not. But that alone cannotexplain the huge difference. Another reason for the differencecould be that Wardenga et al. (2015) derived the relation from pure-tone average D for PTAs of 60 dB HLcould be about 6 dB in stationary noise. The individual vari-ability of SHL D in their analysis was reported with 1.17 dB forPTAs below 47 dB HL, which was only twice the test-retestreliability of the matrix test. An interpretation of this relationcould be that a given hearing loss in quiet is very likely relatedto a certain hearing loss in noise. This interpretation wouldbe only valid if additional amplification would not changethe relation, that is, if the same relation was observed for anoise presentation level of, e.g., 75 dB SPL instead of 65 dBSPL. However, it is difficult to speculate on that. On the onehand, the test-specific noise (OLNOISE) reduces the effectof the individual hearing threshold by masking the speechsignals with a stationary matched-spectrum noise. On theother hand, for higher PTAs, the individual hearing thresh-old will eventually exceed the noise level at high frequencies,which could be compensated by amplification. Then again,according to the simulations presented in this contribution,the removal of high-frequency portions ( > Hz) of the sig-nals in the OLNOISE condition had little effect on the SRT.These considerations demonstrate the inherently non-linearrelations between the parameters assumed to affect speechrecognition in noise, and hence the difficulty of abstractingthese from a given context.Nonetheless, to continue with an specific value for SHL D forthe German matrix sentence test that could likely be found inthe wild, the relation established by Wardenga et al. (2015)is assumed to ve valid. In favor of this assumption is that thelinear regression estimated from the data with PTAs below47 dB HL describes the data in their Figure 5 (solid blackline) really well. One would expect an increase in variabil-ity with the PTA (below 47 dB HL) if the absolute hearingthreshold affected the SRT, such as it was observed for higherPTAs; but this was not the case. Accordingly, Wardengaet al. (2015) concluded that with 65 dB SPL fixed noise pre-sentation level the SRT is determined by listening in noise forPTAs < ≈ dB HL, and above it is determined by listeningin quiet. Based on these considerations, it seems likely thatlisteners with an SRT of 0 dB in the German matrix sentencetest in the test-specific stationary noise exist whose speechrecognition performance cannot be improved further by sim-ple amplification. That would indicate an SHL D of about7 dB.This estimate can be used as an orientation to interpretthe right-most column in Table 1, which is equivalent tothe SHL D . For example, profile P-2000-14 with an SHL D of 6.5 dB lies within the range of empirically observed values.In this context, profiles P-1000-14 and P-2000-21 might beborder cases with values of 9 to 10 dB for SHL D , and profileP-1000-21 lies probably outside range of empirically observedvalues for SHL D . Profiles P-8000-21, P-4000-14, P-2000-14,and P-1000-7, are all compatible with a large SHL D of lessthan 7 dB in the stationary noise condition. The key differ- ence between these profiles is the proportion of SHL D that isdue to the level uncertainty and hence might be compensableby a PLATT expansion. Assuming that the average availablefrequency range to listeners with impaired hearing is between2000 and 4000 Hz, P-4000-14 and P-2000-14 could representparameter values in a realistic range.The aim of classifying profiles into more (and less) realisticones is to use them to formulate a quantitative hypothesis onthe expected benefit in SRT due to PLATT expansion. Thisdoes not mean that the considered profiles exist. The assump-tion that the reduction in speech recognition performance innoise is due to a combination of a limited frequency rangeand a mechanism similar to an increased level uncertaintyjustifies their use. While this assumption may be sensible,it doesn’t mean that it is correct. This would have to betested in listening experiments. For the effect of the limitedfrequency range, this could be achieved by, e.g., low-pass fil-tering the signals in speech recognition experiments. For theeffect of an increased level uncertainty, a direct verificationis currently not possible because the level uncertainty adds anoise in a domain which is not accessible for manipulationsin experiments with human listeners; unlike for the limitationof the frequency range, there is no known equivalent signalmanipulation. As explained in the Introduction, the mostpromising (and at the same time constructive) approach totest whether a mechanism similar to an increased level un-certainty causes a part of the class D hearing loss of listenerswith impaired hearing is trying to compensate it. Role of PLATT expansion
The expansion feature of the PLATT dynamic range manip-ulation approach aims to mitigate the increase in SRT dueto an increased level uncertainty. It was specifically designedto compensate the effect of the level uncertainty on speechrecognition performance such as it is implemented in FADE.The presented simulation results showed that this necessaryinterim goal was achieved and indicate that about half ofclass D loss due to an increased level uncertainty was com-pensated. This only demonstrates that the expansion worksas intended in the context of the model. There is no reason toassume that the results, that is, the partial compensation ofa class D hearing loss, would transfer to experiments humanlisteners. Now—the key question of this contribution—whatwill happen in experiments with human listeners, in case thecentral model assumptions are incorrect?A) If the PLATT expansion does not improve the SRTs innoise of human listeners with a class D hearing loss, thiswould indicate that the mechanism that causes the classD loss is not well described by the level uncertainty, be-cause the PLATT expansion interacts differently with themechanism in human listeners than with the mechanismin the model.B) If the PLATT expansion improves the SRTs in noise ofhuman listeners and if these improvements are in linewith individual predictions, this would strongly supportthe hypothesis that a mechanism similar to an increasedlevel uncertainty causes a part of the class D hearing lossof listeners with impaired hearing.21) However, more likely than B), if the PLATT expansionimproves the SRTs in noise of human listeners but theimprovements are found to be substantially smaller thanpredicted, this would only partly support the hypothe-sis that a mechanism similar to an increased level uncer-tainty causes a part of the class D hearing loss of listenerswith impaired hearing.In scenario A), a central model assumption is incorrect. The(to-be) collected empirical data still can help to improve themodel, but the expansion feature of PLATT would be uselessin the context of a hearing device. In scenario B), the centralmodel assumptions are probably correct. The (to-be) col-lected empirical data cannot help to improve the model, butthe expansion feature of PLATT would overcome a seriouslimitation of current hearing device technology. In scenarioC), most of the central model assumptions were probably cor-rect. The (to-be) collected empirical data can probably helpto improve the model, and the expansion feature of PLATTcould give strong hints regarding how to overcome a seriouslimitation of current hearing device technology. In this lastscenario, a differentiated analysis of the predictions errors willbe helpful in tracking down the inaccurate assumptions. Theprobabilities for scenario A), B), and C) are unknown. Toprovide evidence for any scenario, an experiment with humanlisteners suitable to unmistakably show the compensation ef-fect has to be conceived.
Potential of compensating a class D loss
The presented simulation results clearly show the potentialof the expansion of spectral modulation patterns in the rangebetween 2 and 4 ERB, as implemented in PLATT, to com-pensate a class D hearing loss that was implemented by anincreased level uncertainty. In both noise conditions, approxi-mately half of the class D loss due to an increased level uncer-tainty was compensated, while the class D loss due to a lim-ited frequency range was not compensated. Narrowing downthe simulation results to profiles P-4000-14 and P-2000-14,which were identified earlier as more realistic, the predictedimprovements in SRT with PLATT-6 were 3.0 dB in the sta-tionary noise condition. These improvements would corre-spond to a compensation of . . ≈ and . . − . ≈ ofthe class D hearing loss due to the increased level uncertainty.In the fluctuating noise condition, improvements in SRT of4.6 and 5.3 dB were predicted with PLATT-6, with profilesP-4000-14 and P-2000-14, respectively. These improvementswould correspond to a compensation of . . − . ≈ and . . − . ≈ of the class D hearing loss due to the in-creased level uncertainty. Hence, if the model assumptionsare correct, benefits in SRT of about 3.0 dB and 5.0 dB inthe stationary and fluctuating noise condition are expected,which correspond to more than 50% of the respective class Dlosses due to the level uncertainty.Such benefits would be individually measurable with thestandard German matrix sentence test. However, accordingto the profile P-4000-7, for which benefits in SRT of up to1.3 dB and 2.7 dB were predicted in the stationary and fluctu-ating noise condition, respectively, individual measurementswould possibly not yield significant results. This is becausethe individual benefit in SRT is derived from two individual measurements (aided and unaided), each generating a ran-dom error of about 0.7 dB (standard deviation) for listenerswith impaired hearing. If the real benefit is lower than about( . · √ · . ≈ . ≈ )2 dB, a significant (with p-value < . ) effect can be only shown in group averages but not insingle individual measurements. However, showing an effectin individual measurements is highly preferable because thenthe data could be used to further analyze the individual char-acteristics of listeners with and without benefits. Hence, evenif this goal might not be achieved, experiments with humanlisteners should be designed with the aim to maximize theabsolute effect. Optimal values for the expansion
The optimal (in terms of improvements in SRT) value forthe expansion factor depended on the level uncertainty pa-rameter. For lower values of the level uncertainty, in manyconditions lower factors (4 or 6) resulted in the best speechrecognition performance, but higher values did not decreasethe improvement much. For the profiles P-4000-14 and P-2000-14, a factor of 8 was optimal. From this perspective,nothing speaks against evaluating PLATT with high expan-sion factors in listening experiments. Because an expansionfactor of 8 results in a strong modification of the signal, andbecause it is unknown how such modifications are perceivedin terms of quality and loudness by listeners with impairedhearing, at least two values for the expansion factor shouldbe evaluated in listening experiments.
Audio quality and loudness
In measurements with human listeners, speech recognitionperformance, audio quality perception, and loudness percep-tion all depend on the presentation level. In this contribu-tion, efforts were made to separate the speech recognitionperformance from the presentation level as much as possible.This resulted in considering elevated presentation levels, andloudness perception is strongly related to presentation levels.Also in this contribution, efforts were made to mitigate au-dible artifacts in the manipulation and resynthesis stages ofPLATT. But the non-linear manipulation inevitable resultsin audible artifacts which probably affect audio quality per-ception. The effect of processing noisy speech signals at 0 dBSNR with PLATT-1, PLATT-4, and PLATT-8 is illustratedin Figure 22. The spectro-temporal maxima of the shownlog Mel-spectrograms are 68.8, 73.4, and 79.8 dB SPL for thenoisy signals in stationary noise, and 77.0, 81.8, and 89.3 dBfor the noisy signals in fluctuating noise. The correspond-ing effective amplitudes (RMS) are 66.6, 69.2, and 74.4 dBSPL for the noisy signals in stationary noise, and 69.7, 72.3,77.8 dB for the noisy signals in fluctuating noise. The increasein RMS and maximum power, with PLATT-8 by about 8 dBand more than 10 dB, respectively, suggests an increased loud-ness perception and also demonstrates the necessity to eval-uate the approach in conditions in which an increase in levelmust not result in an improvement in SRT. Also, the pro-cessing will probably have an effect on the perceived audioquality, where, with increased expansion factors, the audioquality will eventually decrease. In the subjective opinion ofthe author, based on a comparison of unprocessed and pro-22 ime / ms F r equen cy / H z
250 500 750 10001245031057186530454768728410957 l e v e l / d B SP L Time / ms F r equen cy / H z
250 500 750 10001245031057186530454768728410957 l e v e l / d B SP L Time / ms F r equen cy / H z
250 500 750 10001245031057186530454768728410957 l e v e l / d B SP L Time / ms F r equen cy / H z
250 500 750 10001245031057186530454768728410957 l e v e l / d B SP L Time / ms F r equen cy / H z
250 500 750 10001245031057186530454768728410957 l e v e l / d B SP L Time / ms F r equen cy / H z
250 500 750 10001245031057186530454768728410957 l e v e l / d B SP L Figure 22:
Illustration of the effect of the expansion factor with PLATT: Log Mel-spectrograms of processed noisy speech in stationary(left column) and fluctuating noise (right column) at 0 dB SNR processed with PLATT-1 (top row), PLATT-4 (center row), andPLATT-8 (bottom row). cessed signals, the artifacts due to processing the stimuli withPLATT-1 are minor, with PLATT-4 clearly audible, and withPLATT-8 pronounced. While it would be a difficult task topredict the effect of the processing on the perception of au-dio quality of listeners with normal hearing, it would be evenmore so for listener with impaired hearing.Apart from speech recognition performance, individualloudness and audio quality perception are important percep-tual dimensions for hearing aid users. The expected increasein perceived loudness and a possible decrease in perceivedaudio quality are considered collateral side-effects of a signalmanipulation with PLATT. These side-effects can either betolerated or mitigated/compensated with further extensionsor modifications of the PLATT approach. It is unclear howa listener with impaired hearing would trade speech recogni-tion performance for audio quality or loudness in the contextof PLATT expansion. To gather evidence about the relationof PLATT expansion, loudness perception, and audio qual-ity perception, the considered expansion factors should covera wide range of these side-effects. Hence an evaluation ofspeech recognition performance, audio quality, and loudnesswith PLATT expansion factors of 1, 4, and 8 is proposed.This will help to better estimate the suitability of the rawapproach in the context of hearing aids and show which prop-erties should be in the focus for possible future optimization.There are several possibilities to optimize the expansion ap-proach with respect to loudness perception and quality. Forexample, the expansion factor can be chosen to be frequencydependent. If the frequency regions in which the benefits areachieved by expansion do not fully overlap with the frequencyregions where loudness perception is critical, there could bepotential to favorably trade speech recognition performanceagainst loudness. The expansion factor could be limited, or mapped otherwise non-linearly to only expand (the low) mod-ulation amplitudes required to improve speech perception inlow-SNR conditions while leaving (the already high) mod-ulation amplitudes in high-SNR conditions unmodified. Adetailed discussion of possible modifications is considered toohypothetical to be further elaborated in this contribution.
Proposed measurement conditions
Based on the presented simulations and considerations, thefollowing experimental measurement conditions are proposedto test the presented hypothesis that a part of a class D hear-ing loss can be compensated by expanding spectral patternsin the range between 2 and 4 ERB in speech in noise recogni-tion experiments with human listeners. Regarding the noisemaskers, an evaluation with the test-specific noise (e.g. OL-NOISE for the German matrix sentence test), the fluctuatingICRA5-250 noise, and, in addition, a competing voice masker,e.g., the International Speech Test Signal (ISTS; Holube et al.,2010) or its optimized version with limited pause durations,the International Female Fluctuating Masker (IFFM ), isproposed. The IFFM masker is interesting because it sharesthe speech modulation properties with the target speaker andusually results in very weak masking for listeners with nor-mal hearing. It was not included in the objective evaluationwith FADE because predictions with FADE for competingtalker scenarios are known to be inaccurate; Schädler et al.(2018) reported the effect of the masker to be overestimatedby approximately 10 dB.The noise presentation levels ideally would be sufficientlyhigh to compensate any hearing loss which is compensable
23y simple amplification. For the stationary noise condition,this would be easier to achieve than with the fluctuating noiseconditions. As shown in left panel of Figure 2, even mild hear-ing loss profiles are expected to require very high presenta-tion levels to maximize the compensation with simple linearamplification. However, the improvements in SRT at highpresentation levels can be expected to be reasonably small.As a compromise, and to avoid measuring the whole Plomp-curve, a reference experiment with an increased level can beperformed to test if the equivalent linear amplification wouldresult in a similar speech recognition performance than usingPLATT expansion. The observed increase in RMS level withPLATT-8 was about 8 dB. Adding a small margin, 10 dB am-plification should result in higher RMS levels of noisy speechsignals at 0 dB SNR than processing the same signals withPLATT-8. Hence, noise presentation levels of 70 and 80 dBSPL are proposed. The proposed noise presentation level is70 dB SPL for aided and unaided measurements, while thenoise presentation level of 80 dB SPL serves as an alternativeto the aided measurement; one in which 10 dB amplificationis applied. If the processing with PLATT expansion achievesbetter SRTs than with the 10 dB linear amplification, thiswould indicate the compensation of a class D hearing loss.However, the RMS level is only a rough proxy to loudnessperception. Hence, in all conditions (70 dB SPL unaided,70 dB SPL aided, 80 dB SPL unaided), the loudness percep-tion of characteristic signals, e.g., speech in noise at 0 dBSNR, should be measured to check if the perceived loudnessof the processed 70 dB SPL signals is really lower than theperceived loudness of the 80 dB SPL signals.To further mitigate any benefits due to simple linear am-plification at high presentation levels, all stimuli should below-pass filtered to achieve an effective frequency range of2000 Hz. This would limit the evaluation of a benefit due toPLATT expansion to the frequency range up to 2000 Hz; thecentral frequency region that listeners with impaired hear-ing typically can use for speech recognition. A focus onthis frequency region could be advantageous in a first evalu-ation. The low-pass filtering would make a group of listenerswith impaired hearing less heterogeneous, because larger dif-ferences across listeners are usually observed at frequenciesabove 2000 Hz. And the loudness perception of such low-passfiltered stimuli might also be more pleasant than their corre-sponding broadband variants.In total, the proposed measurements comprise 18 condi-tions: Three for training, three unaided measurements at70 dB SPL (one for each noise masker), three unaided mea-surements at 80 dB SPL (also one for each noise masker), andnine measurements with PLATT-1, PLATT-4, and PLATT-8 (that is, three for each noise masker). In all conditions,SRT, loudness perception and audio quality perception arerecommended to be assessed.In addition, it is recommendable to characterize the listen-ers in the way Schädler et al. (2020a) proposed it, that is, withtone detection and tone in noise detection experiments, andnot only with clinical audiograms. This additional data couldbe used to perform individual predictions of benefits withFADE and compare them to the individual measurements.The comparison of individual measurements and predictionsis crucial to detect invalid assumptions in the model.
Final practical considerations
As speech material for the SRT measurements, the optimizedmatrix test sentences in native language are recommended(Kollmeier et al., 2015). These are phonetically balanced andoptimized for a high test-retest reliability. To minimize thetraining effect, three training lists of 20 sentences in noiseare recommended. Lists of 30 sentences obtain a higher test-retest reliability but can also result in excessively long mea-surements which can be fatiguing. Hence, as a compromise,measurements with lists of 20 sentences could be used andrepeated on a different day, which would also allow to detecta possibly remaining training effect.For a first approach, headphone measurements are prefer-able over free-field measurements because they can be per-formed monaurally, which is recommendable. A monauralpresentation prevents the interference of possibly individualbinaural effects and facilitates the comparison to model pre-dictions, as discussed by Schädler et al. (2020a).
Outlook
Once there is evidence whether the proposed compensationstrategy has a positive effect on speech recognition perfor-mance of listeners with impaired hearing, the band-width lim-itation to 2000 Hz that was recommended for the first exper-iments should be removed. For the next steps, it would notbe important anymore to demonstrate the compensation of aclass D loss alone, but to show benefits in more realistic andindividually optimized configurations. Hence, there shouldbe a shift towards a joint individual optimization of the com-pensation of class A and class D loss, loudness perception,and audio quality. Then, care should be taken to individuallynormalize loudness perception to get meaningful data. If theexpansion approach with PLATT proves to work in monaurallistening conditions, the concept should be extended to a bin-aural listening condition. For this, a free-field setup with amobile hearing aid prototype hardware is recommendable tocorrectly assess individual binaural hearing, including binau-ral loudness perception. To better understand which speechportions are most affected by PLATT expansion, it wouldalso be interesting to study its effect on phonemic contrasts.
Conclusions
The most important findings of this work can be summarizedas follows:• The functional modeling of the class D hearing loss withthe framework for auditory discrimination experiments(FADE), implemented by means of the level uncertainty,was interpreted as the counterpart of a compensationstrategy which aims to (partially) compensate a class Dheaing loss.• The strict low-delay constraints in hearing aid applica-tions only allow for a manipulation of mainly spectralmodulation patterns. Of these, the patterns in the rangeof 2 to 4 ERB seem especially suitable to protect themagainst the effect of the level uncertainty by dynamicrange expansion.24 A low-delay, real-time capable implementation of apatented dynamic range manipulation scheme (PLATT)which allows to perform the required dynamic range ex-pansion was proposed. The implementation was opti-mized to run in real-time on the Raspberry Pi 3 ModelB platform.• The evaluation of the PLATT expansion with FADE forseveral idealized profiles of hearing loss indicated thatapproximately half of the class D hearing loss due to anincreased level uncertainty was compensable.• Simulations with FADE were used to predict the out-comes of specific speech recognition experiments prior to performing these. The underling hypothesis, that aclass D hearing loss can be (partially) compensated, canbe directly tested in an experiment with human listenersin the same listening conditions. Recommendations forthis experiment were elaborated.
References
Bisgaard, N., Vlaming, M. S., and Dahlquist, M. (2010) Stan-dard audiograms for the IEC 60118-15 measurement procedure.
Trends in amplification , 14(2):113–120, https://doi.org/10.1177%2F1084713810379609
Bustamante, D. K. and Braida, L. D. (1987) Principal-component am-plitude compression for the hearing impaired.
The Journal of theAcoustical Society of America , 82(4):1227–1242, https://doi.org/10.1121/1.395259
Dreschler, W. A. (1992) Fitting multichannel-compression hear-ing aids.
Audiology , 31(3):121–131, https://doi.org/10.3109/00206099209072907
Dreschler, W. A., Verschuure, H., Ludvigsen, C., and Westermann, S.(2001) ICRA noises: artificial noise signals with speech-like spectraland temporal properties for hearing instrument assessment.
Audiol-ogy , 40(3):148–157, https://doi.org/10.3109/00206090109073110
European Telecommunications Standards Institute (2007) "202050 v1.1.5" Speech processing transmission and quality aspects(STQ); Distributed speech recognition; Advanced front-end fea-ture extraction algorithm; Compression algorithms.
Standard , Grimm, G., Herzke, T., Ewert, S., and Hohmann, V. (2015) Imple-mentation and evaluation of an experimental hearing aid dynamicrange compressor. In
Proceedings of German Annual Conference onAcoustics , 185–188, http://pub.dega-akustik.de/DAGA_2015/data/articles/000429.pdf
Hochmuth, S., Kollmeier, B., Brand, T., and Jürgens, T. (2015) Influ-ence of noise type on speech reception thresholds across four languagesmeasured with matrix sentence tests.
International Journal of Au-diology , 54(sup2):62–70, https://doi.org/10.3109/14992027.2015.1046502
Hohmann, V. and Kollmeier, B. (1995) The effect of multichanneldynamic compression on speech intelligibility.
The Journal of theAcoustical Society of America , 97(2):1191–1195, https://doi.org/10.1121/1.413092
Holube, I., Fredelake, S., Vlaming, M. and Kollmeier, B. (2010) Devel-opment and analysis of an international speech test signal (ISTS).
In-ternational Journal of Audiology , 49(12):891–903, https://doi.org/10.3109/14992027.2010.506889
Hülsmeier, D., Warzybok, A., Kollmeier, B., and Schädler, M. R. (2020)Simulations with FADE of the effect of impaired hearing on speechrecognition performance cast doubt on the role of spectral resolu-tion.
Hearing Research , 395, https://doi.org/10.1016/j.heares.2020.107995
ISO (2003). Standard 226: 2003: Acoustics–normal equal-loudness-levelcontours.
International Organization for Standardization , 63,
Kollmeier, B., Warzybok, A., Hochmuth, S., Zokoll, M. A., Uslar, V.,Brand, T., and Wagener, K. C. (2015) The multilingual matrixtest: Principles, applications, and comparison across languages: Areview.
International Journal of Audiology , 54(sup2):3–16, https://doi.org/10.3109/14992027.2015.1020971 .Kollmeier, B., Schädler, M. R., Warzybok, A., Meyer, B. T., and Brand,T. (2016) Sentence recognition prediction for hearing-impaired lis-teners in stationary and fluctuation noise with fade: Empoweringthe attenuation and distortion concept by Plomp with a quantitativeprocessing model.
Trends in Hearing , 20, https://doi.org/10.1177%2F2331216516655795
Levitt, H. and Neuman, A. C. (1991) Evaluation of orthogonal polyno-mial compression.
The Journal of the Acoustical Society of America ,90(1):241–252, https://doi.org/10.1121/1.401294
Moore, B. C. J., Peters, R. W., and Stone, M. A. (1999) Benefits of linearamplification and multichannel compression for speech comprehensionin backgrounds with spectral and temporal dips.
The Journal of theAcoustical Society of America , 105(1):400–411, https://doi.org/10.1121/1.424571
Plomp, R. (1978) Auditory handicap of hearing impairment and thelimited benefit of hearing aids.
The Journal of the Acoustical Societyof America , 63(2):533–549, https://doi.org/10.1121/1.381753
Plomp, R. (1988) The negative effect of amplitude compression in multi-channel hearing aids in the light of the modulation-transfer function.
The Journal of the Acoustical Society of America , 83(6):2322–2327, https://doi.org/10.1121/1.396363
Schädler, M. R., Meyer, B., and Kollmeier, B. (2012) Spectro-temporalmodulation subspace-spanning filter bank features for robust auto-matic speech recognition.
The Journal of the Acoustical Society ofAmerica , 131(5):4134–4151, https://doi.org/10.1121/1.3699200
Schädler, M. R., Warzybok, A., Hochmuth, S., and Kollmeier, B.(2015) Matrix sentence intelligibility prediction using an auto-matic speech recognition system.
International Journal of Audi-ology , 54(sup2):100–107, https://doi.org/10.3109/14992027.2015.1061708
Schädler, M. R., Warzybok, A., Ewert, S. D., and Kollmeier, B.(2016b) A simulation framework for auditory discrimination exper-iments: Revealing the importance of across-frequency processing inspeech perception.
The Journal of the Acoustical Society of America ,139(5):2708–2722, https://doi.org/10.1121/1.4948772
Schädler, M. R., Hülsmeier, D., Warzybok, A., Hochmuth, S., andKollmeier, B. (2016a) Microscopic multilingual matrix test predic-tions using an ASR-based speech recognition model. In
Proceed-ings of INTERSPEECH , 610–614, http://dx.doi.org/10.21437/Interspeech.2016-1119
Schädler, M. R., Warzybok, A., and Kollmeier, B. (2018) Objec-tive Prediction of Hearing Aid Benefit Across Listener Groups Us-ing Machine Learning: Speech Recognition Performance With Bin-aural Noise-Reduction Algorithms.
Trends in Hearing , 22, https://doi.org/10.1177/2331216518768954 .Schädler, M. R., Hülsmeier, D., Warzybok, A., and Kollmeier, B.(2020) Individual Aided Speech-Recognition Performance and Pre-dictions of Benefit for Listeners With Impaired Hearing Employ-ing FADE.
Trends in Hearing , 24, https://doi.org/10.1177%2F2331216520938929 chädler, M. R. (2020b) Optimization and evaluation of an intelligibility-improving signal processing approach (IISPA) for the Hurricane Chal-lenge 2.0 with FADE. In Proceedings of INTERSPEECH , 1331–1335, https://doi.org/10.21437/Interspeech.2020-0093
Souza, P. E. (2002) Effects of compression on speech acoustics, intel-ligibility, and sound quality.
Trends in Amplification , 6(4):131–165, https://doi.org/10.1177%2F108471380200600402
Wagener, K., Brand, T., and Kollmeier, B. (1999) Entwicklung undEvaluation eines Satztests für die Deutsche Sprache I-III: Design, Op-timierung und Evaluation des Oldenburger Satztests.
Zeitschrift fürAudiologie , 38(1-3):4–15Wagener, K. C., Brand, T., and Kollmeier, B. (2006) The role of silentintervals for sentence intelligibility in fluctuating noise in hearing-impaired listeners.
International Journal of Audiology , 45(1):26–33, https://doi.org/10.1080/14992020500243851
Wardenga, N., Batsoulis, C., Wagener, K. C., Brand, T., Lenarz, T., andMaier, H. (2015) Do you hear the noise? The German matrix sentencetest with a fixed noise level in subjects with normal hearing and hear-ing impairment.
International Journal of Audiology , 54(sup2):71–79, https://doi.org/10.3109/14992027.2015.1079929 .Yund, E. W. and Buckles, K. M. (1995) Multichannel compression hear-ing aids: Effect of number of channels on speech discrimination innoise.
The Journal of the Acoustical Society of America , 97(2):1206–1223, https://doi.org/10.1121/1.413093https://doi.org/10.1121/1.413093