[PDF] On the Use of Audio Fingerprinting Features for Speech Enhancement with Generative Adversarial Network

Abstract

The advent of learning-based methods in speech enhancement has revived the need for robust and reliable training features that can compactly represent speech signals while preserving their vital information. Time-frequency domain features, such as the Short-Term Fourier Transform (STFT) and Mel-Frequency Cepstral Coefficients (MFCC), are preferred in many approaches. While the MFCC provide for a compact representation, they ignore the dynamics and distribution of energy in each mel-scale subband. In this work, a speech enhancement system based on Generative Adversarial Network (GAN) is implemented and tested with a combination of Audio FingerPrinting (AFP) features obtained from the MFCC and the Normalized Spectral Subband Centroids (NSSC). The NSSC capture the locations of speech formants and complement the MFCC in a crucial way. In experiments with diverse speakers and noise types, GAN-based speech enhancement with the proposed AFP feature combination achieves the best objective performance while reducing memory requirements and training time.

Full PDF

aa r X i v : . [ ee ss . A S ] J u l On the Use of Audio Fingerprinting Features for SpeechEnhancement with Generative Adversarial Network

Farnood Faraji a, ∗ , Yazid Attabi a , Benoit Champagne a and Wei-Ping Zhu ba Department of Electrical and Computer Engineering, McGill University, Montreal, QC, Canada b Department of Electrical and Computer Engineering, Concordia University, Montreal, QC, Canada

Abstract —The advent of learning-based methods in speechenhancement has revived the need for robust and reliabletraining features that can compactly represent speech signalswhile preserving their vital information. Time-frequency domainfeatures, such as the Short-Term Fourier Transform (STFT)and Mel-Frequency Cepstral Coefﬁcients (MFCC), are preferredin many approaches. While the MFCC provide for a compactrepresentation, they ignore the dynamics and distribution ofenergy in each mel-scale subband. In this work, a speechenhancement system based on Generative Adversarial Network(GAN) is implemented and tested with a combination of AudioFingerPrinting (AFP) features obtained from the MFCC andthe Normalized Spectral Subband Centroids (NSSC). The NSSCcapture the locations of speech formants and complement theMFCC in a crucial way. In experiments with diverse speakers andnoise types, GAN-based speech enhancement with the proposedAFP feature combination achieves the best objective performancewhile reducing memory requirements and training time.

Index Terms —audio ﬁngerprinting, generative adversarial net-work, spectral subband centroids, speech enhancement

I. I

NTRODUCTION

Speech enhancement aims to isolate a desired speech signalfrom the additive background noise, and increase the quality orintelligibility of the processed speech [1]. In the past decade,due to important theoretical advances, faster and cheaper com-putational resources, and the availability of large recorded dataset for training, neural networks have been applied success-fully to a variety of non-linear mapping problems, includingspeech enhancement. For instance, [2] proposes a supervisedspeech enhancement system based on Deep Neural Network(DNN) that can outperform the conventional methods.The Generative Adversarial Network (GAN) aims to gen-erate more realistic output patterns that exhibit characteristicscloser to the real data [3]. Adversarial training can also beemployed in the ﬁeld of speech enhancement. Proposed by[4], Speech Enhancement GAN (SEGAN) works in time-domain and uses a one dimensional Convolutional NeuralNetwork (CNN). A similar architecture is investigated by[5] using Short-Term Fourier Transform (STFT) features.Studies by [6] and [7] use Gammatone spectrum and STFTfeatures, respectively, and propose modiﬁed training targets.Neural network systems require substantial training data togive the best performance. Thus, having a reliable feature setwhich reduces memory requirements and training time is an

Funding for this work was provided by a grant from NSERC Canada withindustrial sponsor Microsemi Corporation. ∗ Corresponding Author: [email protected] important asset, especially for embedded systems and real-timeapplications. Speech enhancement with GAN can work in bothtime [4], [8] and frequency domains [5]–[7]. However, theseworks indicate that frequency-domain features have a clearadvantage over the former, especially in terms of measureslike Perceptual Evaluation of Speech Quality (PESQ) [8].Frequency-domain features such as STFT, Gammatonespectrum and Mel-Frequency Cepstral Coefﬁcients (MFCC)have been used frequently. In addition, a combination of STFTwith MFCC is employed in [9] for training wide residualnetworks for speech enhancement. Compared to STFT, ﬁlter-based features like MFCC exhibit reduced dimensionality andare more suitable for learning algorithms, as they can reducememory and computational requirements while maintainingcomparable level of performance [7], [10], [11]. MFCC belongto a larger family of so-called Audio Fingerprinting (AFP)features, which include the Spectral Subband Centroids (SSC)and Spectral Energy Peaks (SEP), and are used to compressdata and extract essential patterns in audio frames [12].The MFCC are computed by applying the Discrete CosineTransform (DCT) to a set of weighted subband energiesobtained from a Mel-spaced ﬁlterbank. The ﬁlter-based energycomputation of this process ignores important informationabout the audio signal in each subband, such as the locationsof energy peaks corresponding to speech formants. The SSCintroduced by Paliwal [13], provides crucial information aboutthe centroid frequency in each subband, which has provento be of great value in several applications. The SSC havebeen successfully employed in speech recognition, speakeridentiﬁcation and music classiﬁcation, with non-learning ordictionary-based systems [14]–[16]. Besides a combination ofMFCC and SSC was proposed for speaker authentication withnon-learning methods in [17].In this paper, a state-of-art speech enhancement systembased on GAN is implemented to predict the Ideal RatioMask (IRM) of the noisy speech, using a compact set offeatures obtained from the combination of MFCC, NormalizedSSC (NSSC) and their time differences (i.e. delta versions).The performance of the resulting systems is evaluated bymeans of standard objective measures, and compared to that ofother possible combinations of features, including the STFTcoefﬁcients. Our results show that the proposed combinationof AFP features based on MFCC and NSSC can achieve best(or near best) performance under a wide range of SNR, whilesigniﬁcantly reducing memory requirements and training time.I. G

ENERATIVE A DVERSARIAL N ETWORK

GANs are generative models designed to map noisy samplevectors z from a prior distribution into outputs that resemblethose generated from the real (i.e. actual) data distribution.To achieve this, a generator (G) learns to effectively imitatethe real data distribution under adversarial conditions. Theadversary in this case is the discriminator (D) which is a binaryclassiﬁer whose inputs are either samples from the real distri-bution, or fake samples made up by G. The training processis a game between G and D: G is trying to fool D to acceptits outputs as real , and D gets better in detecting fake inputsfrom G and distinguishing them from real data. As a result, Gadjusts its parameters to move towards the real data manifolddescribed by the training data [3]. The described adversarialtraining can be formulated as the following minmax problem, min G max D V = E [log D ( x )] + E [log(1 − D ( G ( z )))] (1)where V ≡ V ( D, G ) is the value function of the system,referred to as sigmoid cross entropy loss function, x is thefeature vector from the real data distribution, z is the latentvector generated from a noisy distribution, D ( x ) and G ( x ) are the outputs of D and G, and E denotes expected value.In speech enhancement applications, it has been observedthat Conditional GAN (CGAN) [18] results in better perfor-mance than conventional GAN [4], [6], [7]. CGAN uses anadditional data vector x c in both G and D for regressionpurposes. Moreover, the GAN method from (1) uses sigmoidcross entropy loss function which causes vanishing gradientsproblem for some fake samples far from the real data, whichleads to saturation of the loss function. In the sequel, CGAN iscombined with the Least-Squares GAN (LSGAN) [19] whichsolves this problem by stabilizing GAN training and increasingG’s output quality. This is achieved by substituting the cross-entropy loss with a binary-coded least-squares function, andtraining G and D individually. This modiﬁed GAN objectivefunction is expressed by, min D V ( D ) = E [( D ( x , x c ) − ] + E [( D ( G ( z , x c ) , x c )) ]min G V ( G ) = E [( D ( G ( z , x c ) , x c ) − ] (2)III. P ROPOSED S YSTEM

A. Speech Model in the Frequency Domain

Let y [ m ] denote the observed noisy speech signal, where m ∈ Z is the discrete-time index. The noisy speech resultsfrom the contamination of a desired, clean speech signal s [ m ] with an additive noise signal n [ m ] , i.e., y [ m ] = s [ m ] + n [ m ] , m ∈ Z (3)We represent the signals of interest in the time-frequencydomain, as obtained from application of the STFT to (3).Speciﬁcally, the STFT coefﬁcients of the noisy speech signal y [ m ] are deﬁned as, Y ( k, f ) = M − X m =0 y [ m + kL ] h [ m ] e − j πfm/M (4) where k ∈ Z is the frame index, L is the frame advance, f ∈ { , , , ..., M/ } is the frequency bin index, M is theframe size and h [ m ] is a window function. In practice, thecalculation in (4) is implemented by means of an M -pointFast Fourier Transform (FFT) algorithm. Applying the STFTformula from (4) on the time-domain model (3) yields thetime-frequency model representation Y ( k, f ) = S ( k, f ) + N ( k, f ) (5)where S ( k, f ) and N ( k, f ) are the STFT of the clean speechand noise signals, respectively. B. Audio Fingerprinting Features

To train the GAN architecture, we propose a new feature setobtained by combination of MFCC and NSSC. In this part, weexplain the calculation and combination of these AFP features.

1) Mel-Frequency Cepstral Coefﬁcients (MFCC):

MFCCare widely used in speech recognition and enhancement dueto their powerful compacting capabilities while preservingessential information in speech [10], [11], [20]. To calculatethe MFCC features, the time-domain signal y [ m ] is passedthrough a ﬁrst order FIR ﬁlter to boost the highband formantsin a so-called pre-emphasis stage, as given by, y ′ [ m ] = y [ m ] − αy [ m − (6)where α is the pre-emphasis coefﬁcient, with . ≤ α ≤ .Next, the STFT of the ﬁltered signal y ′ [ m ] is calculated asin (4), yielding the STFT coefﬁcients Y ′ ( k, f ) . For each dataframe, these STFT coefﬁcients are used to calculate a set ofSpectral Subband Energies (SSE) deﬁned in terms of a bankof overlapping narrow-band ﬁlters. Speciﬁcally, the SSE of the k -th frame are calculated as,SSE y ( k, b ) = h b X f = l b w b ( f ) | Y ′ ( k, f ) | (7)where b ∈ { , , . . . , B − } , B is the number of subbandsin the ﬁlterbank, and w b ( f ) ≥ is the spectral shapingﬁlter of the b -th subband, with l b and h b denoting the lowerand upper frequency limits of w b ( f ) . More speciﬁcally, theﬁlters w b ( f ) together form a mel-spaced ﬁlterbank, i.e., theyare characterized by triangular shapes with peak frequenciesdistributed according to the mel-scale of frequency.Finally, the Discrete Cosine Transform (DCT) - Type IIIis applied to the logarithm of the SSE to obtain the desiredMFCC features, which is expressed as,MFCC y ( k, p ) = r B B − X b =0 log SSE y ( k, b ) cos ( pπB ( b − . (8)where p ∈ { , , . . . , P − } and P is the number of coefﬁ-cients. We deﬁne the MFCC feature vector of the current dataframe as: MFCC y = [ MFCC y ( k, , ..., MFCC y ( k, P − . ) Spectral Subband Centroids (SSC): The SSC were in-troduced in [13] to measure the center of mass of a subbandspectrum in terms of frequency, using a weighted averagetechnique. These features exhibit robustness against the equal-ization, data compression and additive noise which do notsigniﬁcantly alter the peak frequencies at moderate to highSignal-to-Noise Ratio (SNR) [12]. In [21], the SSC outperformMFCC when used as inputs in a audio recognition taskbased on dictionary matching. To generate SSC values, thenoisy speech signal y [ m ] is pre-emphasized as in (6) and thecorresponding STFT coefﬁcients Y ′ ( k, f ) are computed. Foreach frame, a set of SSC is obtained by calculating the centroidfrequencies of a bank of narrowband ﬁlters as in the MFCC.Speciﬁcally, the SSC of the k -th frame are calculated as,SSC y ( k, b ) = P h b f = l b f w ′ b ( f ) | Y ′ ( k, f ) | P h b f = l b w ′ b ( f ) | Y ′ ( k, f ) | (9)where b ∈ { , , . . . , B − } and w ′ b ( f ) is the correspondingsubband ﬁlter. In this work, to simplify implementation, we usethe same bank of triangular mel-scale ﬁlters for both MFCCand SSC calculations, i.e. w ′ b ( f ) = w b ( f ) Finally, following [21], the SSC values are normalizedwithin the range [ − , , which is more convenient for use inneural network layers and activation functions. The normalizedSSC (NSSC) features are obtained as,NSSC y ( k, b ) = SSC y ( k, b ) − ( h b − l b ) h b − l b (10)For later reference, we deﬁne the NSSC feature vectorof signal y [ m ] at the current frame k as NSSC y =[ NSSC y ( k, , ..., NSSC y ( k, B − .

3) Feature Combination:

In this paper, we propose to usethe concatenation of MFCC and NSSC vectors, along withsome of their ﬁrst and second differences (i.e., delta anddouble-delta) for training the GAN architecture. In the sequel,we refer to this extended feature set as AFP Combination(AFPC). The MFCC and their deltas have long been used asan efﬁcient alternative to the STFT, as they contain crucialinformation about the spectral subband energies and theirtemporal evolution [23]. Nevertheless, due to the smoothingnature of (7), the MFCC ignore the dynamics of the formantpresent in each subband. In contrast, the NSSC and their deltascan provide critical information about the formant locationsand their temporal variations. At the same time, the NSSC tendto be more noise-robust, compared to the MFCC, since theformant locations are not signiﬁcantly disturbed by the additivenoise distortion [13]. Thence, the proposed AFPC featureshave the ability to capture information about the distributionof energy, both across and inside spectral subbands.To obtain the AFPC, the MFCC and NSSC are bothextracted from the STFT of the noisy signal, Y ( k, f ) asdescribed previously. The proposed AFPC feature vector atthe k -th time frame for signal y [ m ] is then deﬁned as, AFPC y = [ MFCC y , ∆ MFCC y , ∆ MFCC y , NSSC y , ∆ NSSC y , ∆ NSSC y ] (11) where ∆ MFCC y and ∆ MFCC y are the deltas and double-deltas of the MFCC. Similarly, ∆ NSSC y and ∆ NSSC y arethe deltas and double deltas of the NSSC. C. Incorporation of AFPC within GAN

We assume that the magnitude spectrum of the noisy speechcan be approximated by the sum of the clean speech and noisemagnitude spectra, i.e, | Y ( k, f ) | ≈ | S ( k, f ) | + | N ( k, f ) | . Thegenerator in the adversarial setting is trained to predict a real output, which is taken as the Ideal Ratio Mask (IRM) genera-ted from the known clean speech and noise signals [22], i.e.,IRM ( k, f ) = s | S ( k, f ) | | S ( k, f ) | + | N ( k, f ) | (12)We deﬁne the IRM vector at the current frame k as IRM =[ IRM ( k, , ..., IRM ( k, M/ . Then, the generator producesthe estimated IRM whose patterns and distribution should beclose to the real IRM, as expressed by, [ IRM = G ( z , AFPC jy ) (13)where AFPC jy represents the AFPC feature vector at thecurrent frame, obtained by concatenating the AFPC featurevectors from a subset of j + 1 consecutive context framescentered at the current one (i.e., by including the j adjacentframes to its left and right). The estimated output [ IRM in (13)is only calculated for the current frame.By examining [ IRM and the

AFPC y of the current frame,D decides whether its input is the real IRM from (12), or the fake output from (13). The estimated IRM for every frame andfrequency index is used as a Wiener type of ﬁlter on the STFTmagnitude of the noisy speech. This method only enhancesthe amplitude of the signal and uses the phase from the noisyspeech to reconstruct the time-domain enhanced signal usingthe overlap-add and Inverse STFT (ISTFT) as shown in, | ˆ S ( k, f ) | = d IRM ( k, f ) | Y ( k, f ) | (14) ˆ s [ m ] = ISTFT {| ˆ S ( k, f ) | e jk ∠ Y ( k,f ) } (15)In [4], it is reported that having an extra term in trainingthe generator using CGAN is very useful. Pandey et al. [7] show that using the L loss gives a better performancecompared to the L loss in speech enhancement applications.This approach allows adversarial component to produce morereﬁned and realistic results. The weight of the L componentin the objective function is controlled by a parameter λ > .Therefore, the objective functions from (2) are modiﬁed as, min D V ( D ) = E [( D ( IRM , AFPC y ) − ]+ E [( D ( G ( z , AFPC jy ) , AFPC y )) ] (16) min G V ( G ) = E [( D ( G ( z , AFPC jy ) , AFPC y ) − ]+ λ k G ( z , AFPC jy ) − IRM k (17)A schematic of this adversarial training procedure is illus-trated in Fig. 1. The training consists of three consecutivesteps: First, D is trained with a concatenation of the IRM ector and the

AFPC y feature vector, in such a way that itrecognizes the IRM as real (or output 1). Next, D learns tocategorize the concatenation of the [ IRM and

AFPC y featurevector as fake data distribution (or output 0). Finally, Dvariables are frozen and the G is trained with the AFPC jy features to fool the D. Fig. 1. The Proposed GAN training procedure used with the AFPC.

A block diagram of the system architecture is depicted inFig. 2. The operation consists of two stages: training andenhancement. During the training stage, the system uses theAFPC features to train the D and G as shown in Fig. 1 andlearn the IRM. In the enhancement stage, the G from theGAN setting is inputted with the AFPC features to outputthe estimated d IRM and the speech spectrum is reconstructedusing a Wiener type of ﬁltering shown in (15).IV. E

XPERIMENTAL S ETUP

A. Dataset

We use the LibriSpeech [24] dataset which is an open corpusbased on audio books and containing 1000 hours of relativelynoise-free speech in English. For training, 1755 utterances arerandomly selected from 250 speakers (half male, half female)for a total of 6 hours of speech. For testing, 255 utterances areselected from 40 speakers (half male, half female), for a totalof 30 minutes of speech. The clean ﬁles are contaminated withadditive noise at -5dB, 0dB and 5dB SNRs for both trainingand testing sets, while two extra SNRs of 10dB and 15dB areadded for testing. Five different noise types from NOISEX-92 [25] are used for both training and testing: babble, pink,buccaneer2, factory1 and hfchannel.All the audio ﬁles are sampled at 16 KHz. The STFT coef-ﬁcients are extracted with an M = 512 STFT, using a 32msHanning window, overlap of 50% ( L = 256 ) and three contextframes (i.e. j = 1 ). The MFCC and NSSC are computedfrom the STFT parameters using B = 64 subbands with mel-frequency triangular ﬁlters w b ( f ) distributed between Hz and KHz. The number of MFCC is set to P = 22 while for NSSC, only the ﬁrst 22 coefﬁcients are kept in the feature vector. Thepre-emphasis factor α = 0 . is used in (6). The delta anddouble-delta variations are included in the feature sets for eachcontext frame [13]. The estimated IRM (13) is calculated onlyfor the middle STFT frame. For each feature set, one modelis trained for all noise types, SNRs and speakers. B. Training and Evaluation

The generator’s architecture has three hidden layers, eachincluding 512 nodes. The ReLU activation function is usedafter each hidden layer with a dropout rate of 0.2. Thediscriminator has the same structure as the generator but usesinstead the leaky ReLU activation function. Both employ thesigmoid activation at the output layer because they predict theIRM. The latent vector z has 15 elements generated randomlyfrom a normal Gaussian distribution. The GAN architecture istrained in 50 epochs with a learning rate of − for the ﬁrsthalf and − for the second half of the epochs. The batchsize is set to 128 and ADAM optimizer is used for training.We set λ = 100 in (17), which provides good convergence.We compare different combinations of the discussed fea-tures, i.e. STFT coefﬁcients, MFCC and NSSC, and theyare designated with ”+”, which means concatenation of theindicated feature vectors. Out of the seven distinct possiblecombinations, STFT+MFCC+NSSC combination is not in-cluded in the study, since it does not substantially improvethe performance nor the computational efﬁciency. In eachexperiment, one GAN architecture is trained for each fea-ture set using all SNRs and noise types and uses the samearchitecture, training and hyper-parameters. The feature setsare compared objectively in terms of PESQ, which providesa measure of signal quality between -0.5 and 4.5, Signal-to-Distortion Ratio (SDR) which measures the speech quality indB based on the introduced speech distortion, and Short-TimeObjective Intelligibility (STOI), which provides a measure ofintelligibility between 0 and 1. Besides these performancemeasures, we also compare the different feature combinationsin terms of system efﬁciency, i.e. feature vector size, trainingtime per epoch, and number of network parameters.V. R ESULTS AND D ISCUSSION

In this section, we present and discuss the experimentalresults. To select the number of context frames (i.e., j +1 ), thePESQ performance of three selected feature sets is studied asdemonstrated in Fig. 3. When the number of context framesincreases, the performance tend to improve for each featureset. However, since most of the gains for MFCC+NSSC andSTFT+MFCC are obtained with 3 context frames, we use thevalue of j = 1 for all subsequent experiments.For each feature set, results are obtained for ﬁve differentnoise types at ﬁve SNR levels from -5dB to 15dB. AveragePESQ, SDR and STOI measures over all noise types are re-ported in Tables I-III, where the best results (within the 2% ofthe observed maximum) are highlighted for each SNR. Whenused separately, MFCC and NSSC improve the overall speechquality compared to the noisy speech but do not generally ig. 2. Block diagram of the proposed AFPC training feature set and its incorporation into GAN.Fig. 3. Average PESQ performance for three feature sets: STFT,MFCC+NSSC and STFT+MFCC in different context frames from 1 to 9. outperform STFT. Comparing STFT with STFT+NSSC andSTFT+MFCC indicates that both AFP features add importantinformation to the STFT features. STFT+MFCC outperformsSTFT+NSSC in terms of both PESQ and STOI, while achiev-ing a similar SDR performance.According to Tables I-III, the proposed AFPC, i.e.,MFCC+NSSC, substantially increases the performance of theGAN-based speech enhancement system in all three measurescompared to MFCC or STFT. Furthermore, MFCC+NSSCachieves the best PESQ performance (within the error margin)and demonstrates a performance close to STFT+MFCC interms of SDR and STOI. In particular, MFCC+NSSC out-performs the other feature sets in all three measures at highunmatched SNR of 15dB. This is due to the fact that at suchhigh SNR, the additive noise does not signiﬁcantly corrupt theextraction of formant frequencies with NSSC.While the bottom 3 feature sets in Tables I-III achievethe best performance in terms of average PESQ, STOI andSDR, the cost of this improvement for a GAN-based systemusing STFT+NSCC or STFT+MFCC is much more than forthe proposed MFCC+NSSC (i.e., AFPC). As shown in TableIV, the latter signiﬁcantly outperforms the former in terms offeature size, training time and number of network parameters.Speciﬁcally, MFCC+NSCC leads to reductions of 59.1% inmemory storage for the training data, 43.3% in training time TABLE IA

VERAGE

PESQ R

ESULTS FOR ALL NOISE TYPES AT VARIOUS

SNR S Feature Set PESQ -5dB 0dB 5dB 10dB 15dBNoisy 1.13 1.40 1.72 2.07 2.43STFT 1.71 2.12 2.52 2.82 2.99NSSC 1.56 2.07 2.48 2.80 3.07MFCC 1.69 2.11 2.50 2.84 3.12STFT+NSSC 1.77 2.20 2.60 2.90 3.04STFT+MFCC

TABLE IIA

VERAGE

SDR R

ESULTS FOR ALL NOISE TYPES AT VARIOUS

SNR S Feature Set SDR(dB) -5dB 0dB 5dB 10dB 15dBNoisy -5.21 -0.34 4.62 9.61 14.6STFT 3.80 7.71 11.5 15.1 17.8NSSC 3.05 7.10 10.8 14.0 16.5MFCC 3.17 6.96 10.7 14.3 17.2STFT+NSSC

TABLE IIIA

VERAGE

STOI R

ESULTS FOR ALL NOISE TYPES AT VARIOUS

SNR S Feature Set STOI -5dB 0dB 5dB 10dB 15dBNoisy 0.56 0.67 0.78 0.87 0.93STFT 0.69 0.79

NSSC 0.64 0.76 0.85 0.90 0.93MFCC 0.68 0.79 0.86 0.91

STFT+NSSC

STFT+MFCC

MFCC+NSSC for the GAN system, and 25.0% in the number of networkparameters. Compared to the STFT baseline, MFCC+NSCCrequires 49.6% less memory storage for features and 30.1%less training time, while achieving signiﬁcant performanceimprovements. The savings in training time and network sizewith the proposed AFPC become larger when we add morecontext frames (i.e., j > ). The testing time is not reportedin Table IV since it is almost the same for all systems. Intesting, most of the processing time is allocated to the STFTcomputation which is similar for all feature combinations. ABLE IVF

EATURE V ECTOR S IZE AND T RAINING T IME PER E POCH

Feature Set AveragePESQ FeatureSize Training Timeper epoch NetworkParam.

STFT 2.43 257 17.6 mins 1.06MSTFT+NSSC 2.50 323 21.7 mins 1.16MSTFT+MFCC

323 21.7 mins 1.16M

MFCC+NSSC 2.57 132 12.3 mins 870K

Fig. 4 shows the spectrograms of: (a) clean speech; (b) noisyspeech after contamination with babble noise at 0dB SNR; (c)enhanced speech using GAN with STFT, and; (d) enhancedspeech using proposed AFPC. It can be seen that the proposedAFPC features preserve the speech formants while removingmore noise during non-speech segments. (a) Clean speech (b) Noisy speech(c) Processed with STFT (d) Processed with AFPC

Fig. 4. (a) Clean speech (b) Noisy speech (0dB babble noise) (c) Processedspeech using STFT features (d) Processed speech using the AFPC features.

VI. C

ONCLUSION

In this work, we proposed using a compact set of featuresobtained from the combination of two AFP techniques, i.e.,MFCC and NSSC, to implement a speech enhancement systembased on GAN and trained to predict the IRM of the noisyspeech. The NSSC capture the speech formants and the distri-bution of energy in each subband, and therefore complementthe MFCC in a crucial way. In experiments with diversespeakers and noise types, GAN-based speech enhancementwith the proposed AFPC (MFCC+NSCC) achieved the bestaverage performance in terms of PESQ, STOI and SDR ob-jective measures. Furthermore, compared to the STFT+MFCCcombination with nearly similar performance, AFPC led toreductions of about 60% in memory storage, 45% in trainingtime, and 25% in network size. Hence, the proposed AFPCset is a promising feature-extraction method in learning-basedspeech enhancement systems. R

EFERENCES[1] J. Benesty, S. Makino, J. D. Chen, Speech Enhancement, Berlin,Germany:Springer, 2005.[2] Y. Xu, J. Du, L. R. Dai and C. H. Lee, ”A regression approach to speechenhancement based on deep neural networks,” in

IEEE/ACM Trans. onAudio, Speech, and Language Proc. , vol. 23, no. 1, pp. 7-19, Jan. 2015.[3] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde- Farley,S. Ozair, A. Courville, and Y. Bengio, ”Generative adversarial nets, inadvances in neural information processing systems,” NIPS, Montreal,Canada, 2014, pp. 2672-2680.[4] S. Pascual, A. Bonafonte, and J. Serr, ”Segan: Speech enhancementgenerative adversarial network,”

Interspeech , Stockholm, Sweden, 2017,pp. 36423646.[5] C. Donahue, B. Li, and R. Prabhavalkar, ”Exploring speech enhancementwith generative adversarial networks for robust speech recognition,”arXiv:1711.05747, 2017.[6] M. H. Soni, N. Shah and H. A. Patil, ”Time-frequency masking-basedspeech enhancement using generative adversarial network,” in

Proc.ICASSP , Calgary, Alberta, Canada, 2018, pp. 5039-5043.[7] A. Pandey and D. Wang, ”On adversarial training and loss functions forspeech enhancement,” in

Proc. ICASSP , 2018, p. in press.[8] J. Abdulbaqi, Y. Gu, I. Marsic, ”RHR-Net: A residual hourglass recurrentneural network for speech enhancement,” arXiv:1904.07294, 2019.[9] D. Ribas, J. Llombart, A. Miguel, and L. Vicente, ”Deep speechenhancement for reverberated and noisy signals using wide residualnetworks,” arXiv:1901.00660, 2019.[10] R. Razani, H. Chung, Y. Attabi, B. Champagne, ”A reduced complexityMFCC-based DNN approach for speech enhancement”, in

Proc. IEEESymp. on Signal Process. and Information Tech. , pp. six, Dec. 2017.[11] J. Chen, Y. Wang, and D. Wang, ”A feature study for classiﬁcation-based speech separation at low signal-to-noise ratios,” in IEEE/ACMTrans. on Audio, Speech, and Language Processing, vol. 22, no. 12, pp.19932002, 2014.[12] N.Q. Duong and H.T. Duong, ”A review of audio features and sta-tistical models exploited for voice pattern design,” arXiv preprintarXiv:1502.06811. 2015 Feb 24.[13] K. K. Paliwal, ”Spectral subband centroids features for speech recogni-tion,” in

Proc. ICASSP , vol. 2, Seattle, U.S, 1998, pp. 617620.[14] N. Poh, C. Sanderson, and S. Bengio, An investigation of spectralsubband centroids for speaker authentication, in

Proc. LNCS

Int.Conf. Biometric Authent. (ICBA) , Hong Kong, 2004, pp. 631639[15] A. Nicolson, J. Hanson, J. Lyons, and K. Paliwal, ”Spectral subbandcentroids for robust speaker identiﬁcation using marginalization-basedmissing feature theory,”

Int. Journal of Signal Processing Systems , vol.6, no. 1, pp. 1216, 2018.[16] T. Kinnunen, B. Zhang, J. Zhu, and Y. Wang, ”Speaker veriﬁcation withadaptive spectral subband centroids,” Adv. Biometrics, pp. 5866, 2007.[17] N. Thian, C. Sanderson, and S. Bengio, ”Spectral subband centroids ascomplementary features for speaker authentication,” Biometric Authen-ticat., pp. 1-38, 2004[18] M. Mirza and S. Osindero, ”Conditional generative adversarial nets,”arXiv:1411.1784, 2014.[19] X. Mao, Q. Li, H. Xie, R. Y. K. Lau, and Z. Wang, Least squaresgenerative adversarial networks, arXiv: 1611.04076, 2016.[20] M. Kolbæk, Z. Tan and J. Jensen, ”Speech intelligibility potential ofgeneral and specialized deep neural network based speech enhancementsystems,” in IEEE/ACM Trans. on Audio, Speech, and Language Pro-cessing, vol. 25, no. 1, pp. 153-167, 2017.[21] J. S. Seo, M. Jin, S. Lee, D. Jang, S. Lee, C. D. Yoo, ”Audioﬁngerprinting based on normalized spectral subband centroids,” in

Proc.ICASSP , Philadelphia, USA, vol. 3, pp. 213-216, Mar. 2005.[22] A. Narayanan and D. L. Wang, ”Ideal ratio mask estimation usingdeep neural networks for robust speech recognition,” in

Proc. ICASSP ,Vancouver, Canada, 2013, pp. 70927096.[23] X. Huang, A. Acero, and H. Hon, Spoken language processing: a guideto theory, algorithm, and system development, Prentice Hall, 2001.[24] V. Panayotov, G. Chen, D. Povey and S. Khudanpur, ”Librispeech: AnASR corpus based on public domain audio books,” in