End-to-End Model for Speech Enhancement by Consistent Spectrogram Masking
Xingjian Du, Mengyao Zhu, Xuan Shi, Xinpeng Zhang, Wen Zhang, Jingdong Chen
11 End-to-End Model for Speech Enhancement byConsistent Spectrogram Masking
Du Xingjian, Zhu Mengyao, Shi Xuan, Zhang Xinpeng, Zhang Wen, and Chen Jingdong
Abstract —Recently, phase processing is attracting increasinginterest in speech enhancement community. Some researchersintegrate phase estimations module into speech enhancementmodels by using complex-valued short-time Fourier transform(STFT) spectrogram based training targets, e.g. Complex RatioMask (cRM) [1]. However, masking on spectrogram would violentits consistency constraints. In this work, we prove that theinconsistent problem enlarges the solution space of the speechenhancement model and causes unintended artifacts. ConsistencySpectrogram Masking (CSM) is proposed to estimate the complexspectrogram of a signal with the consistency constraint in asimple but not trivial way. The experiments comparing ourCSM based end-to-end model with other methods are conductedto confirm that the CSM accelerate the model training andhave significant improvements in speech quality. From ourexperimental results, we assured that our method could enhancenoisy speech audios with both efficiency and effectiveness.
Index Terms : speech enhancement, end-to-end model, com-plex spectrogram, phase processingI. I
NTRODUCTION
Many of audio and speech processing approaches representthe signal in a time-frequency transformation. The short-timediscrete Fourier transform (STFT) are most usually used. Afterthis transformation, the signal can be represented by theirmagnitude and their phase in complexed value form. Howeverthe phase has been largely ignored while the researcherswere focusing on the modeling and processing of the STFTmagnitude in the past three decades. [2].However, as soon as reconstruction is desired, phase infor-mation becomes essential. When the magnitude is modified,it is often sufficient to reuse the original phase to recoverthe signal, which may lead to undesired artifacts. Someresearchers focus on the applications that the original phase isnot available [3]. In this case, STFT phase retrieval algorithmsconstruct a new valid phase from the modified magnitude,allowing complete disposal of the existing phase.Based on phase enhancement research, enhancing the phasespectrogram of noisy speech leads to perceptual quality im-provements [4]. Instead of separately enhancing the magnitudeand phase response of noisy speech, recent researchers focuson jointly enhancing the magnitude and phase responses tofurther improve the perceptual quality [5]. If the spectrogram
Du Xingjian, Zhu Mengyao, Shi Xuan and Zhang Xinpeng are with Schoolof Communication and Information, Shanghai University. (Correspondingauthor: Zhu Mengyao. e-mail: [email protected])Zhang Wen and Chen Jingdong are with Center of Intelligent Acousticsand Immersive Communication, Northwestern Polytechnical University.Manuscript received Sept. 13, 2018. This work was supported by the Na-tional Natural Science Foundation of China (61831019) and the Key SupportProjects of Shanghai Science and Technology Committee (16010500100). is modified, the modified spectrogram may not correspond tothe STFT of any time-domain signal anymore, which is so-called inconsistent spectrogram [2]. The majority of speechenhancement approaches either only modify the magnitude orestimate complex spectrogram, which will most likely lead toan inconsistent spectrogram. It is worth mentioning that con-sistent spectrogram obtained from the SFTF of a time-domainsignal should be a small subset of the complex spectrogram. Inthis letter, we propose a joint real and imaginary reconstructionalgorithm on consistent spectrogram. In other words, giventhe complex spectrum of noisy speech, we could recover theconsistent spectrum of clean speech. Because the optimizationspace of our method is restricted to a consistent spectrogram,fast convergence rate and high accuracy can be achieved bythe proposed speech enhancement algorithm.This paper is organized as follows. Section II reviewsmasking based speech enhancement methods and inconsis-tent spectrogram problem. Section III proposes ConsistentSpectrogram Masking algorithm. Section IV describes theexperimental setups used to evaluate the performance of themodel we propose. Finally, Section V present conclusions.II. M
ASKING METHODS AND I NCONSISTENT S PECTROGRAMS P ROBLEM
The common speech enhancement setup consisting of STFTanalysis, spectral modification, and subsequent inverse STFT(ISTFT). The analyzed digital signal yielding the complex-valued STFT coefficients, this procedure can be compactlydescribed as S = ST F T ( x ) . Recently, phase processing hasemerged as a further leverage on speech enhancement tasks,including the noticeable work like Phase Sensitive Masking(PSM) [6], and Complex Ratio Masking (cRM) [7], [1]. Wanget al. illustrated that the real and imaginary spectrogramsexhibits clear temporal and spectral structure, so they proposethe cRM which is defined as follow: cRM ( t, f ) = R e { S t , f } R e { S t , f + N t , f } + i I m { S t,f } I m { S t,f + N t,f } (1)However, the methods mentioned above all ignore the in-consistent spectrogram problem. The inconsistent spectrogramproblem illustrated by Timo Gerkmon is a great challenge tospeech enhancement. Because the STFT analysis is done usingoverlapping analysis window, any modification for individualsignal components (sinusoids, impulses), will be spread overmultiple frames and multiple STFT frequencies locations.Le Roux et al. [8] derived the consistency constraints forSTFT spectrograms consicely. Let S t,f be a set of complex a r X i v : . [ c s . S D ] J a n NoisySignal QuasiSTFTLayer RISpectrogram QuasiISTFTLayerCSM EsitmatedRISpectrogram CleanSignalFCN
Layers
Fig. 1. The framework of our proposed end-to-end model for speech enhancement numbers, where t will correspond to the frame index and f to the frequency band index, and W a , W s are analysis andsynthesis window function verifying the perfect reconstructionconditions for a frame shift R . For any complex spectrogram S , we can get the following equation. ST F T ( IST F T ( S t,f (cid:48) )) = S t,f + 1 N (cid:88) k W a ( k ) e − j πk f (cid:48) N { W s ( k + R ) N − (cid:88) f =0 S t − ,f e j πn k + RN + W s ( k − R ) N − (cid:88) f =0 S t +1 ,f e j πn k − RN } S can be divided into S con and S incon . S con can be obtainedfrom STFT of time signal x . And there is a one-to-onemapping between S con and x and a many-to-one mappingbetween S incon and x . The resynthesized time signal ISTFT( S incon ) has the consistent spectrogram S con after STFTtransform. As a consequence, the relation between S con and S incon can be shown in the following equation. S con = ST F T ( IST F T ( S incon )) (cid:54) = S incon (2)Since the many-to-one mapping between S incon and x andone-to-one mapping between S con and x as illustrated inFig. 2, the space of S incon is much larger than the spaceof S con . Therefore, the estimated clean spectrogram ˆ S inthe design of speech enhancement system tend to fall intothe inconsistent spectrograms S incon space. The commonlyignored inconsistent spectrograms problem not only introducesartifacts into resynthesized signals because of the inconsis-tency of overlapping frames but also increases difficultiesof model convergence due to the expansion of inconsistentspectrogram space.III. C ONSISTENT S PECTROGRAM M ASKING
A. Masking with Consistency constraints
The most of model-based speech enhancement methods canbe regarded as minimize the follow objective function: O = || ˆ S − ST F T ( x ) || β (3) iSTFTSTFT x S con STFTiSTFT (cid:54) = S incon S (cid:48) incon (cid:54) = Time Domain Complex Domain
Fig. 2. An illustration of the notion of consistency. STFT transform isan injective function which maps distinct valid signals to correspondingconsistent spectrograms S con respectively i.e. there is a perfect one-to-onecorrespondence between the sets of time signal and consistent spectrograms.However, STFT transform is not guaranteed to be invertiable for inconsistentspectrograms S incon . There is a many-to-one mapping between S incon andtime signal x as indicated by red arrows. where ˆ S is estimated clean spectrogram, x denotes clean signali.e the ground truth for the model, and β is a tunable parameterto scale the distance.Because ˆ S is estimated from a non-linear function of nosiyspeech F ( S + N ) (non-linear function can be neural networkor HMM etc.), these non-linear operation may destruct thecorresponding relationship between neirbouring frames andcan not guarantee the consistence of ˆ S . As a result, theobjective function defined in spectrogram incurs the aforemen-tioned inconsistent spectrogram problem. Here we derive thedifference between objective functions defined in consistentand inconsistent spectrogram.If we apply both ISTFT and STFT transform in termsof Eq. 3, we can have the following equations. Since theconsistency of ˆ S that the model estimate cannot be guaranteed, ˆ S con = ST F T ( IST F T ( ˆ S )) can be deduced from Eq. 2 and ˆ S con is not equal to ˆ S . Therefore, the following objectivefunctions are not equal to the objective function in Eq. 3. Itworth noting that the last two equations in Eq. 4 shows theequivalent form of objective functions on both time domainand consistent spectrogram. || ST F T ( IST F T ( ˆ S )) − ST F T ( IST F T ( ST F T ( x ))) || β = || ˆ S con − ST F T ( x ) || β = || IST F T ( ˆ S con ) − x || β (4)Follow the motivations noted in section II and the derivationof Eq 4, we naturally considered introducing a objectivefunction termed O con which is defined on consistent spec- trogram domain ˆ S con . We name our method as ConsistentSpectrogram Masking (CSM) because it iteratively minimizesthe objective function and derives masking on a consistentspectrogram. Our proposed method could dispel the artifactsof resynthesis signal and speed up of model training based onspace contraction on a consistent spectrogram. O con = || IST F T ( ˆ S ) − x || β (5)Although ˆ S con and ˆ S are different, IST F T ( ˆ S con ) and IST F T ( ˆ S ) are the same in time domain (illustrated by Fig.2 and Eq. 2). Thus, we have the useful form of objectivefunction in Eq. 5. By coincidence, there are some similaritiesbetween the Eq. 5 and Griffin-Lim algorithm [9], because a lotof ISTFT and STFT calculations are needed in the optimiza-tion procedure. In Griffin-Lim algorithm, phase informationis solely derived from the magnitude of the spectrogram.Nevertheless, our method could estimate both magnitude andphase information in the form of complex numbers on the con-sistent spectrogram. Thus, we defined Consistent SpectrogramMasking (CSM) as follow by given the complex spectrogramof noisy speech, Y t,f ˆ S t,f = MR t,f R e { Y t,f } + i MI t,f I m { Y t,f } (6)where MR t,f , MI t,f represent the mask for the real andimaginary spectrogram at time t and frequency f . B. The framework of our proposed end-to-end model
Following the aforementioned methodology and principlethat optimizing the model with consistency constraint, wedesigned an end-to-end speech enhancement model whichcomprises a densely connected convolutional neural network(CNN) and integrated Quasi-Layers (QL). A high-level vi-sual depiction of our proposed model is presented in Fig.1. Specifically, for corresponding functionalities, the CNNmodule is employed to adaptively modify spectrogram of theinput signal, and QL is a backpropagate module designedto simulate the STFT transform and its inversion, therebymaking it possible to directly accumulate the loss on consistentspectrogram.The CNN based acoustic models have been used in speechenhancement and source separation tasks and have beenproven to improve the performance [10]. The unique con-nection structure and weight sharing make CNN capable oflearning feature representation via applying convolutional fil-ters to the spectrogram of audio. However, there is an intrinsictradeoff problem between kernel size and feature resolution.In other words, a larger kernel can exploit more contextualinformation in time dimension or learning pattern in a widerband, but obtain lower resolution features. In this work, weutilize a densely connected fully convolutional network (FCN)[11] which can learn multi-scale features efficiently to solvethe trade-off problem. In a standard feedforward network, theoutput of the l th layer is computed as x l = H l ( x l − ) , wherethe network input is denoted as x l − and H l ( · ) is a nonlineartransformation which can be a composite function of opera-tions such as nonlinear activation, pooling or convolution[11]. The idea of DenseNet is to use concatenation of featuremaps produced in preceding layers as the input to succeedinglayers: x l = H l ([ x l − , x l − , . . . , x ]) , (7)where [ x l − , x l − , . . . , x ] refers to the concatenation of thefeature maps produced in layers , . . . , l − [11]. Such denseconnectivity enables all layers not only to receive the gradientdirectly but also to reuse features computed in preceding lay-ers. This pipeline avoids the re-calculation of similar featuresin different layers and makes network can learn differentlevel features in the same layer [11]. The experimental resultsshow that our DenseNet based approach has a considerableimprovement compared to DNN based model.The FCN is the backbone of our model, and the preprocess-ing and postprocessing modules Quasi-Layers, are also vitalparts of the whole system. The Quasi-STFT layer uses two1-dimensional convolutions, each of which is initialized withreal and imaginary part of discrete Fourier transform kernelsrespectively, following the definition of STFT: S t,f = N − (cid:88) n =0 x Nt + n · [cos(2 πf n/N ) − i · sin(2 πf n/N )] (8)for k ∈ [0 , N − , the Quasi-ISTFT layer is similar to thisone. These modules are constructed on normal convlutionallayers and thus it’s easily to integrate these modules into theneural network based model. These Quasi-Layers can bringus benefits in two folds, firstly Quasi-ISTFT also offers theprobability to define the objective function on a consistentspectrogram as Eq. 5. On the other hand, the integrationof STFT and ISTFT into the end-to-end model can makeFourier transform kernel and window function learnable withthe backpropagation. IV. E XPERIMENT
A. Experimental Setup
We conducted our experiments on the Center for SpeechTechnology Voice Cloning Toolkit (VCTK) [12] and TheDARPA TIMIT Acoustic-Phonetic Continuous Speech Corpus(TIMIT) [13] corpora, the training data is supplied by VCTKwhich includes 400 x 109 sentence uttered by 109 nativespeakers of English with various accents and the model isevaluated in TIMIT. Training and testing in different datasetpromise the reliability of results. Moreover, the followingbroadband noise: speech babble (Babble), cafeteria (Cafe),factory floor noise (Factory), transportation noise (Road). Thetraining set is composed by combining ten random parts fromthe first half of each noise with each training sample atdifferent SNR levels which is -6, -3, 0, 3 and 6 dB respectively.The test set is generated by mixing 60 clean utterances ofthe last half of the above noises at different SNRs. Dividingnoises into two halves ensures that the testing noise segmentsare unseen during training.The proposed model termed QL-FCN-CSM is given inFigure. 1. Ahead of the FCN, the raw audio input of 66048samples, is transformed to a 512 x 16 x 2 matrix by STFTQuasi-layer, the window length and hop length of which are set to 1024 and 512 respectively. Mean, and variancenormalization was applied to the input vector to make thetraining and testing process stable. The perceptual evaluationof speech quality (PESQ) [14] and the signal to noise ratio(SNR) are used to evaluate the quality and intelligibility ofdifferent signals. B. Experimental Results1) Comparison Between Different Objective Functions:
Weconducted the experiments with models based on differentobjective functions, the model which is targeted to minimizethe error between the complex spectrogram of clean speechand its noisy version is denoted as QL-FCN-cRM (similar toQL-FCN-CSM, but replace CSM with cRM), and the modelwhich estimate magnitude solely is denoted as QL-FCN-IRM(still similar to QL-FCN-CSM, but replace CSM with IRM).Table 1 shows that there is a substantial performance gap be-tween QL-FCN-CSM and QL-FCN-cRM, between QL-FCN-CSM and QL-FCN-IRM, which proves the efficiency of CSMwhich optimize model with the objective function defined inthe consistent spectrogram and synthesize waveforms directly.It is observed that the average PESQ scores and SNR ofQL-FCN-CSM and QL-FCN-cRM are always better than theother models, which proves the effectiveness of the end-to-end model we proposed. Our best results on 0dB conditionare even more encouraging: the PESQ score is 0.38 higherthan the DNN-cRM, which is state-of-the-art DNN approach.It was noteworthy that the convergency speed of QL-FCN-CSM overtaking the others with better performance, thesecircumstances reinforce the view we hold: the constrain ofthe estimated spectrogram into the scope of the consistentspectrogram, leading the faster convergence shown in Fig. 4.
2) Comparison Between Different Network Architectures:
To compare our FCN based model with those base on DNN,experiments compare ours with DNN-cRM [1] (QL is notconducted as there is no convolution procedure here, deepneural network is used instead of FCN) and DNN-IRM [15].From Table 1, we can observe that QL-FCN-CSM andQL-FCN-cRM outperform DNN-cRM and DNN-IRM all thetime. The results proved the efficiency of our selection ofnetwork architecture. However, the results of QL-FCN-CSMis comparable to those of QL-FCN-cRM in 6 dB and -6 dB conditions. It is because artifacts caused by the loss ofphase information are negligible in very high or very low SNRconditions [16]. V. C ONCLUSIONS
The insights and deductions of our work are clear andcomprehensive. We draw concepts from prior works that a)Phase processing is essential to speech enhancement tasks;b) Masking on spectrogram would destruct the consistencyconstraints. In this letter, we unveil facts that inconsistentspectrograms problem slow the convergence of model andcause unintended artifacts. To estimate the clean spectrogram(including magnitude and phase) from the STFT of noisyspeech with the constraint of consistency, we design a CSMon complex spectrogram and derive the loss function in
TABLE IPESQ
AND
SNR
PERFORMANCE FOR THE MODELS : N
O ENHANCEMENT ( A ), QL-FCN-CSM ( B ), QL-FCN- C RM ( C ), QL-FCN-IRM ( D ),DNN- C RM ( E ), DNN-IRM ( F ). PESQ SNRSNR -6 -3 0 3 6 -6 -3 0 3 6 B a bb l e a 1.179 1.301 1.489 1.672 1.998 -6.00 -3.00 0.00 3.00 6.00b 1.951 c 1.953 2.068 2.543 2.833 2.966 5.89 8.13 10.76 C a f e a 1.413 1.676 1.894 2.123 2.342 -6.00 -3.00 0.00 3.00 6.00b c 2.363 2.501 2.686 F ac t o r y a 0.987 1.119 1.265 1.468 1.695 -6.00 -3.00 0.00 3.00 6.00b c 1.778 1.89 2.106 2.302 2.441 7.10 8.55 11.37 14.16 16.25d 1.78 1.893 2.101 2.246 2.408 7.12 8.59 11.30 13.36 15.75e 1.687 1.813 1.908 2.113 2.381 5.89 7.55 8.78 11.47 15.33f 1.625 1.765 1.874 2.046 2.240 5.09 6.93 8.34 10.55 13.27 R o a d a 2.182 2.363 2.547 2.721 2.903 -6.00 -3.00 0.00 3.00 6.00b d 2.98 3.078 3.249 3.356 3.493 7.22 8.79 11.45 13.43 15.89e 2.905 3.007 3.084 3.253 3.467 6.03 7.64 8.87 11.53 15.39f 2.853 2.966 3.059 3.185 3.352 5.19 7.01 8.47 10.42 13.37 Fig. 3. A random clip (768 samples) from the waveform of the experimentalresults. Red line indicates the clean signal. The green line and the red lineindicate the output of QL-FCN-CSM and QL-FCN-IRM respectively. It isobvious that estimating spectrogram masks in a consistent manner can reducedistortion of results in the time domain. the consistent spectrogram, which resolves the problem ofinconsistent spectrogram and phase processing simultaneousand jointly.In technical details, we implement new Quasi-Layers toemulate STFT with convolution layers in the neural network,which makes it possible to optimize our model with anobjective function on the consistent spectrogram. DenseNetis selected as the basis of our model framework rather thanvanilla CNN or DNN, for its superior ability to extract featureswith various scales in a spectrogram. The experimental resultsshow that the considered acceleration of convergence and theimprovement of quality occurred. training epoch l o ss Fig. 4. Training CSM-QL and cRM model on VCTK dataset. The prefor-mance of CSM-QL surpass the cRM model with the faster convergence speed. R EFERENCES[1] D. S. Williamson, Y. Wang, and D. Wang, “Complex ratio maskingfor monaural speech separation,”
IEEE/ACM Trans. Audio, Speech &Language Processing , vol. 24, no. 3, pp. 483–492, 2016.[2] T. Gerkmann, M. Krawczyk-Becker, and J. L. Roux, “Phase processingfor single-channel speech enhancement: History and recent advances,”
IEEE Signal Process. Mag. , vol. 32, no. 2, pp. 55–66, 2015. [Online].Available: https://doi.org/10.1109/MSP.2014.2369251[3] Z. Prusa and P. Rajmic, “Toward high-quality real-time signalreconstruction from STFT magnitude,”
IEEE Signal Process. Lett. ,vol. 24, no. 6, pp. 892–896, 2017. [Online]. Available: https://doi.org/10.1109/LSP.2017.2696970[4] K. K. Paliwal, K. K. Wójcicki, and B. J. Shannon, “The importance ofphase in speech enhancement,”
Speech Communication , vol. 53, no. 4,pp. 465–494, 2011.[5] D. S. Williamson, Y. Wang, and D. Wang, “Complex ratio masking forjoint enhancement of magnitude and phase,” in . IEEE, 2016, pp. 5220–5224.[Online]. Available: https://doi.org/10.1109/ICASSP.2016.7472673[6] H. Erdogan, J. R. Hershey, S. Watanabe, and J. L. Roux, “Phase-sensitiveand recognition-boosted speech separation using deep recurrent neuralnetworks,” in , 2015, pp. 708–712. [Online]. Available:https://doi.org/10.1109/ICASSP.2015.7178061[7] D. S. Williamson, Y. Wang, and D. Wang, “Complex ratio masking forjoint enhancement of magnitude and phase,” in . IEEE, 2016, pp. 5220–5224.[8] J. Le Roux, “Phase-controlled sound transfer based on maximally-inconsistent spectrograms,” in
Proceedings of the Acoustical Society ofJapan Spring Meeting , no. 1-Q-51, Mar. 2011.[9] S. Nawab, T. Quatieri, and J. Lim, “Signal reconstruction from short-time fourier transform magnitude,”
IEEE Transactions on Acoustics,Speech, and Signal Processing , vol. 31, no. 4, pp. 986–998, 1983.[10] S. Fu, Y. Tsao, X. Lu, and H. Kawai, “End-to-end waveform utteranceenhancement for direct evaluation metrics optimization by fully convo-lutional neural networks,”
CoRR , vol. abs/1709.03658, 2017.[11] G. Huang, Z. Liu, L. van der Maaten, and K. Q. Weinberger, “Denselyconnected convolutional networks,” in
Proceedings of the IEEE Confer-ence on Computer Vision and Pattern Recognition , 2017.[12] C. Veaux, J. Yamagishi, K. MacDonald et al. , “Cstr vctk corpus: Englishmulti-speaker corpus for cstr voice cloning toolkit,” 2017.[13] J. S. Garofolo, L. F. Lamel, W. M. Fisher, J. G. Fiscus, and D. S.Pallett, “Darpa timit acoustic-phonetic continous speech corpus cd-rom.nist speech disc 1-1.1,”
NASA STI/Recon technical report n , vol. 93,1993.[14] A. W. Rix, J. G. Beerends, M. P. Hollier, and A. P. Hekstra, “Perceptualevaluation of speech quality (pesq)-a new method for speech qualityassessment of telephone networks and codecs,” in
IEEE InternationalConference on Acoustics, Speech, and Signal Processing, ICASSP 2001,7-11 May, 2001, Salt Palace Convention Center, Salt Lake City, Utah,USA, Proceedings . IEEE, 2001, pp. 749–752.[15] M. Tu and X. Zhang, “Speech enhancement based on deep neural net-works with skip connections,” in . IEEE, 2017, pp. 5565–5569.[16] P. C. Loizou,