Speech Enhancement Using Multi-Stage Self-Attentive Temporal Convolutional Networks
Ju Lin, Adriaan J. van Wijngaarden, Kuang-Ching Wang, Melissa C. Smith
11 Speech Enhancement Using Multi-StageSelf-Attentive Temporal Convolutional Networks
Ju Lin,
Student Member, IEEE,
Adriaan J. van Wijngaarden,
Senior Member, IEEE,
Kuang-Ching Wang,
Senior Member, IEEE, and Melissa C. Smith,
Senior Member, IEEE
Abstract —Multi-stage learning is an effective technique toinvoke multiple deep-learning modules sequentially. This paperapplies multi-stage learning to speech enhancement by usinga multi-stage structure, where each stage comprises a self-attention (SA) block followed by stacks of temporal convolutionalnetwork (TCN) blocks with doubling dilation factors. Each stagegenerates a prediction that is refined in a subsequent stage.A fusion block is inserted at the input of later stages to re-inject original information. The resulting multi-stage speechenhancement system, in short, multi-stage SA-TCN, is comparedwith state-of-the-art deep-learning speech enhancement methodsusing the LibriSpeech and VCTK data sets. The multi-stage SA-TCN system’s hyper-parameters are fine-tuned, and the impactof the SA block, the fusion block and the number of stages aredetermined. The use of a multi-stage SA-TCN system as a front-end for automatic speech recognition systems is investigated aswell. It is shown that the multi-stage SA-TCN systems performwell relative to other state-of-the-art systems in terms of speechenhancement and speech recognition scores.
Index Terms —Speech enhancement, speech recognition, neuralnetworks, self-attention, temporal convolutional networks, multi-stage architectures.
I. I
NTRODUCTION S PEECH enhancement is a basic function that is used toimprove the quality and the intelligibility of a speechsignal that is degraded by ambient noise. Speech enhancementalgorithms are used extensively in many audio- and communi-cation systems, including mobile handsets, speaker verificationsystems and hearing aids. Popular classic techniques includespectral-subtraction algorithms, statistical model-based meth-ods that use maximum-likelihood (ML) estimators, Bayesianestimators, minimum mean squared error (MMSE) methods,subspace algorithms based on single value decomposition andnoise-estimation algorithms (see [1] and references therein).Modern techniques often use deep learning. Early examplesinclude a recurrent neural network (RNN) to model long-term acoustic characteristics [2], and a deep auto-encoderthat denoises speech signals with greedy layer-oriented pre-training [3]. In [4], a deep neural network (DNN) was usedas a non-linear regression function. In [5], a convolutionalrecurrent neural network (CRN) was used, consisting of a
The site https://linjucs.github.io/demo/ms sa tcn.html contains audio sam-ples that demonstrate the effectiveness of the proposed multi-stage SA-TCNspeech enhancement systems.This work is supported by the US Army Medical Research and MaterielCommand under Contract No. W81XWH-17-C-0238.Ju Lin, Kuang-Ching Wang and Melissa C. Smith are with the Departmentof Electrical and Computer Engineering, Clemson University, Clemson, SC29634, USA (e-mail: { jul, kwang, smithmc } @clemson.edu).Adriaan J. de Lind van Wijngaarden is with Nokia Bell Labs, MurrayHill, NJ 07974, USA (e-mail: adriaan.de lind van [email protected]). convolutional encoder-decoder architecture and multiple longshort-term memory (LSTM) layers that aim to capture long-context information. Other speech enhancement systems usea generative adversarial network (GAN), which is known forits ability to generate natural-looking signals in the time orfrequency domain [6]–[10]. Recent studies consider the useof an attention mechanism [11]–[16]. Self-attention [17] isan efficient context information aggregation mechanism thatoperates on the input sequence itself and that can be utilizedfor any task that has a sequential input and output. In [15],self-attention is combined with a dense convolutional neuralnetwork. A time-frequency (T-F) attention method, proposedin [16], combines time-domain and frequency-domain atten-tion to perform denoising and dereverberation at the sametime.A temporal convolutional network (TCN) consists of dilated1-D convolutions that create a large temporal receptive fieldwith fewer parameters than other models. Recent researchshows that TCN-based models achieve excellent performancefor text-to-speech [18], speech enhancement [19]–[23], andspeech separation [24]. In [20], a speech enhancement systemwas proposed that uses a multi-branch TCN, in short MB-TCN, which effectively performs a split-transform-aggregateoperation and enables the model to learn and determine anaccurate representation by aggregating the information fromeach branch. In [22], the TCN used in [24] for speechseparation was adapted for speech enhancement and integratedin a multi-layer encoder-decoder architecture. The use of acomplex Short-Time Fourier transform (STFT) for TCN-basedspeech enhancement rather than magnitude or time-domainfeatures was investigated in [21].The above-mentioned methods can generally be classifiedas feature-mapping and mask-learning methods, which aretwo commonly used deep-learning approaches for single-channel speech enhancement methods for stereo data. Featuremapping approaches enhance the noisy features using a map-ping network that minimizes the mean square error betweenthe enhanced and clean features. Mask-learning approachesestimate the ideal ratio mask, the ideal binary mask or thecomplex ratio mask, and then use this mask to filter noisyspeech signals and reconstruct the clean speech signals. Mask-learning methods usually perform better than feature mappingmethods in terms of speech quality metrics [25]–[27].Recently, multi-stage learning has been successfully appliedfor a wide variety of tasks, including human pose estima-tion [28], action segmentation [29], speech enhancement [30]–[32] and speech separation [33]. A multi-stage architectureconsists of stages that sequentially use the same model ora combination of different models, and each model operates a r X i v : . [ ee ss . A S ] F e b directly on the output of the previous stage. The effect of suchan arrangement is that the model used in a given stage takesthe predictions from prior stages as input and incrementallyrefines these predictions.Multi-stage learning systems that perform the same taskin each stage typically use the same supervision principlesin each intermediate stage [28], [29], [32]. In [29], multiplestacked TCN networks are proposed for action segmentation.In [32], a multi-stage network with dynamic attention isintroduced, where the intermediate output in each stage iscorrected with a memory mechanism. To reduce the modelparameters, each stage uses a shared network. It is shownthat this multi-stage approach typically performs better thansystems with a larger and deeper network.Multi-stage learning systems where each stage performsa different task are considered in [30], [31], [33]. Here,each stage has a different task and a different target. Theperformance can be improved by aggregating different stagesif the nature of each stage is complementary. For instance, atwo-stage speech enhancement approach is presented in [30],where the first stage uses a model to predict a binary maskto remove frequency bins that are dominated by severe noise,and where the second stage performs in-painting of the maskedspectrogram from the first stage to recover the speech spectro-gram that was removed in the first stage. In [31], a two-stagealgorithm is proposed to optimize the magnitude and phaseseparately. The magnitude is optimized in the first stage andthe enhanced magnitude and phase are then further refinedjointly.This paper details a novel multi-stage speech enhancementsystem, where each stage comprises a self-attention (SA)block [17] followed by stacks of dilated temporal convolu-tional network (TCN) blocks. The system is referred to as amulti-stage SA-TCN speech enhancement system. Each stagegenerates a prediction in the form of a soft mask that is refinedin each subsequent stage. Each self-attention block producesa dynamic representation for different noise environments andtheir relevance across frequency bins, as such enhancing thefeatures, and the stacks of TCN blocks perform sequentialrefinement processing. A fusion block is inserted at the inputof later stages to re-inject original speech information tomitigate possible speech information loss in earlier stages.This paper is organized as follows. Section II details the pro-posed multi-stage SA-TCN speech enhancement system andthe underlying SA, TCN, and fusion blocks. Section III detailsthe comprehensive experiments using the LibriSpeech [34] andVCTK [35] corpus. Section IV first presents the experimentsthat were performed to fine-tune the multi-stage SA-TCN sys-tem’s hyper-parameters, to determine the optimum number ofstages, and to quantify the impact of the SA block and the fu-sion block on the performance. The use of the proposed multi-stage SA-TCN system as a front-end for automatic speechrecognition (ASR) systems is investigated as well. Extensiveexperiments with the LibriSpeech [34] and VCTK [35] corpusshow that multi-stage SA-TCN systems achieve significantlybetter speech enhancement and speech recognition scores thanother state-of-the-art speech enhancement systems. Section Vconcludes the paper and discusses further research directions. II. M ULTI -S TAGE
SA-TCN S
YSTEMS
Speech enhancement systems take a sampled received noisyspeech signal as input and aim to reconstruct the speech signal.Let { x ( t ) | t ∈ Z } denote a deterministic discrete-time datasequence that is obtained by sampling a received continuous-time noisy speech signal x c ( · ) at time interval T s , i.e., x ( t ) = x c ( t · T s ) ∈ R , and let the total number of samples be denotedby T (cid:96) . The short-time Fourier transform (STFT) of length N of { x ( t ) } with window function w ( t ) of length N and hop-length T h is given by X τ,ω = N − (cid:88) n =0 w ( n ) x ( τ T h + n ) exp (cid:18) − j πωnN (cid:19) , (1)where τ is the index of the sliding window and ≤ ω < N isthe frequency index. In this paper, a Hanning window is used,where w ( t ) = sin (cid:18) πtN − (cid:19) . (2)Let X and Ω denote the STFT magnitude and phase, i.e., X = {| X τ,ω |} ∈ R F × T and Ω ∈ R F × T , where F = N/ denotes the number of frequency bins and T = T (cid:96) /T h + 1 .The proposed multi-stage SA-TCN speech enhancementsystem consists of K stages. Fig. 1 illustrates a 4-stage SA-TCN system. Each stage comprises a self-attention (SA) blockfollowed by R stacks of L TCN blocks. For K -stage SA-TCNsystems where K ≥ , a feature fusion block is inserted priorto each stage k , where ≤ k ≤ K .Each of the blocks have special features that are particularlysuited for speech enhancement. The self-attention mechanismaggregates context information across channels, which is par-ticularly helpful in obtaining a dynamic representation whenthe noise is non-stationary, and this is the case for many speechenhancement scenarios.The TCN consists of R stacks of L non-causal TCN blocks,where the dilation factor of the (cid:96) -th TCN block in the stack isgiven by ∆ (cid:96) = 2 (cid:96) . As such, each stack has a large receptivefield, which makes it particularly suited for temporal sequencemodeling. Each TCN block has a skip connection between theinput and output to reduce the loss of low-level details and toprovide hooks for optimization.The multi-stage architecture iteratively refines the initialpredictions. It should be noted that the prediction of a previousstage may include some errors. For instance, the frequencybins dominated by speech may be masked and the resultingmagnitude spectrogram may have lost some of the speechinformation. A fusion block block is inserted prior to eachstage k , where ≤ k ≤ K , that combines the predictedmagnitude ˆX ( k − at the output of stage k − and the originalmagnitude X as input, in order to re-inject the original speechinformation.The first stage consists of a self-attention (SA) block thattakes X as input and that uses three 1 × Q and the key-value pair ( K , V ) , where Q , K , V ∈ R F × T . In order to compute the attention component A , wefirst compute the weight W , given by W = QK T √ F , (3)
STFT ISTFT X ∈ R F × T SA TCN SA TCN SA TCN SA TCNBATCH-NORMPReLUPReLUBATCH-NORM1 × × × × × × × × × X ˆ s ( t ) FBFB x ( t ) Ω ˆX (1) M (2) M (3) M (4) M (1) ˆX (2) ˆX (3) L (3) L (4) L (2) L (1) H × TB × TB × T F × TB × TF × T W = QK T √ F A = c WV ∈ R F × T b X = X + δ A ∈ R F × T K ∈ R F × T c W ∈ R F × F Q ∈ R F × T ℓ − V ∈ R F × T Fig. 1. Block diagram of a multi-stage SA-TCN speech enhancement system with K = 4 stages, where each stage consists of a self-attention (SA) block,followed by R = 3 stacks of L = 8 TCN blocks, where the dilation factor of the (cid:96) -th TCN block in the stack equals ∆ (cid:96) = 2 (cid:96) − , i.e., the dilation doublesfor each next TCN block in the stack. A fusion block is used as of Stage 3 to re-inject the original STFT magnitude. The figure also shows the detailedstructure of the self-attention module, with frequency and time dimensions F and T , a single TCN block with hyper-parameters B and H and P , and theproposed fusion block. and then use the soft-max function σ ( · ) to obtain (cid:99) W = { (cid:99) W i,j } = σ ( W ) , i.e., (cid:99) W i,j = exp ( W i,j ) w j , where w j = F (cid:88) i =1 exp ( W i,j ) . (4)The attention component A ∈ R F × T is now determinedusing A = (cid:99) WV , (5)The SA block outputs (cid:98) X = X + δ A , where δ is a scalar withinitial value zero that is used to allow the network to first relyon the cues in the local channels X and then gradually assignmore weight to the non-local channels using back-propagationto reach its optimal value.The output (cid:98) X ∈ R F × T is fed into a TCN with input featuredimension B and network feature map dimension H by usinga bottleneck layer to reduce the number of channels from F to B . The TCN consists of R identical stacks of L TCNblocks. Each TCN block comprises an 1 × B to the TCNblock’s internal feature map dimension H , a dilated depth-wiseconvolution (D-conv) layer with kernel size P and dilationfactor ∆ (cid:96) = 2 (cid:96) − , where (cid:96) denotes the order of the TCNblock in the stack of L TCN blocks, and a 1 × H to B . This output is then recombined with the inputusing a skip connection to avoid losing low-level details. Aparametric rectified linear unit (PR E LU) activation layer [36]and a batch normalization layer [37] are inserted prior to andafter the depth-wise convolution layer to accelerate training and improve performance. A sigmoid function is applied atthe output of the last TCN block of the last stack to obtain a[0-1] mask M (1) that minimizes the mean absolute error loss L (1) = (cid:107) M (1) (cid:12) X − S (cid:107) , (6)where the operator (cid:12) denotes the Hadamard product and S denotes the STFT magnitude of the clean speech signal s ( t ) .The stack of L TCN blocks with kernel P and dilation factor ∆ (cid:96) = 2 (cid:96) − create a receptive field of size R ( P,L ) , given by R ( P,L ) = 1 + L (cid:88) (cid:96) =1 ( P − · (cid:96) − . (7)As such, a stack of L TCN blocks creates a large temporalreceptive field with fewer parameters than other models.This paper considers multi-stage SA-TCN systems withkernel size P = 3 . An illustration of the receptive field for astack of L = 5 TCN blocks with kernel size P = 3 is shownin Fig. 2. The multi-stage SA-TCN system’s hyper-parameters ∆ = 1∆ = 2∆ = 4∆ = 8∆ = 16 Fig. 2. Example of the connections formed by a non-causal D-convolutionfor a stack of L = 5 TCN blocks, each with kernel P = 3 and dilation ∆ (cid:96) = 2 (cid:96) − , where ≤ (cid:96) ≤ L . ( B, H, R, L ) will be optimized using experiments. As indicated, the same SA-TCN structure is used forsubsequent stages, and an additional element, a fusion block,is inserted prior to each stage if there are three or more stages.For notational convenience, let Ψ ( R,L ) k ( · ) denote the map-ping performed by the R stacks of L TCN blocks in stage k ,and let Υ k ( · ) denote the self-attention operation at stage k . Itfollows that M (1) can now be expressed as M (1) = S ( Ψ ( R,L )1 ( Υ ( X ))) , (8)where S ( · ) denotes the sigmoid function. As such, M (1) is thepredicted mask at the output of the first stage. The enhancedspeech STFT magnitude ˆX (1) at the output of stage 1 is givenby ˆX (1) = M (1) (cid:12) X .In a similar fashion, the predicted mask M (2) at the outputof the second stage can be obtained by evaluating M (2) = S ( Ψ ( R,L )2 ( Υ ( ˆX (1) ))) (9)and the estimated STFT magnitude ˆX (2) = M (2) (cid:12) ˆX (1) .A multi-stage SA-TCN speech enhancement system withthree or more stages ( K ≥ ) is constructed by inserting afusion block that performs operation Φ ( · ) prior to each stage k , where ≤ k ≤ K , taking the masked STFT magnitude ˆX ( k − and STFT magnitude X as inputs. Each input is passedthrough a 1 × E LU operation, afterwhich a global layer normalization ( G LN) is performed [24].The operation G LN ( Y ) is given by G LN ( Y ) = Y − E [ Y ] (cid:112) var ( Y ) + (cid:15) (cid:12) γ + β , (10)where Y ∈ R F × T is the input feature with mean E [ Y ] andvariance var ( Y ) , γ , β ∈ R F × are trainable parameters, and (cid:15) is a small constant for numerical stability.The outputs of the two G LN are added, and the resultis again sent through a 1 × E LU, another G LN, another 1 × E LU. Theoutput, denoted as ˘X ( k − , is given by ˘X k − = Φ k (cid:16) M ( k − (cid:12) X , ˆX ( k − (cid:17) . (11)The output ˘X k − is then used as an input to the next stage,and the expression for the mask M ( k ) at the output of the k -thstage is now given by M ( k ) = S (cid:0) Ψ k ( R,L ) (cid:0) Υ k (cid:0) ˘X k (cid:1)(cid:1)(cid:1) . (12)The enhanced magnitude ˆX ( k ) at stage k is given by ˆX ( k ) = M ( k ) (cid:12) ˆX ( k − . (13)Each next stage k , where k > , computes mask M ( k ) that minimizes the mask-based signal approximation meanabsolute error loss L ( k ) using L ( k ) = (cid:13)(cid:13) M ( k ) (cid:12) ˆX ( k − − S (cid:13)(cid:13) , (14)where ˆX ( k − denotes the estimated STFT magnitude at stage k − .At the output of the last stage of the multi-stage SA-TCNsystem, the time-domain waveform ˆ s is computed using the processed STFT magnitude ˆX ( k ) and the original STFT phase Ω by applying the inverse STFT, in short ISTFT, denoted as ˆ s = ISTFT ( ˆX ( K ) , Ω ) . (15)The proposed multi-stage SA-TCN system provides a meanabsolute error loss L ( k ) at the output of each stage. Sinceeach stage provides an equal contribution during the trainingprocess, we use the accumulated mask-based signal approxi-mation training objective function L = K (cid:88) k =1 L ( k ) . (16)The use of the mean absolute error loss is motivated by recentobservations that it achieves better objective quality scoreswhen using spectral mapping techniques [38], [39].III. E XPERIMENTAL S ETUP
In the following, the data set, model set up and the evalu-ation metrics are detailed.
A. Data Set
To verify the effectiveness of the proposed multi-stage SA-TCN system, we conduct experiments using the LibriSpeechand VCTK data sets. The detailed set-up for each data set isdetailed below.
LibriSpeech is an open-source corpus that contains 960hours of speech derived from audio books in the LibriVoxproject. The sampling frequency is 16 kHz. The clean sourceis trained using 100 hours of speech data from the “train-clean” data set. The validation set uses 800 sentences fromthe “dev-clean” data set, and the test set uses 500 sentencesfrom the “test-clean” data set. The training set uses 10,000randomly selected noise sample sequences from the DNSChallenge [40]. The training clean speech has been cut to75,206 4-second segments. The training and validation setsdistort the clean segments with a randomly-selected noisesound from the DNS Challenge noise set with an SNR in theset {− , − , · · · , , } (in dB), The test set uses three distinctnoise types: “babble noise” from the NOISEX-92 corpus [41],and “office noise” and “kitchen noise” from the DEMANDnoise corpus [42]. The first channel signal of the corpus isused for data generation. Each clean utterance is distorted bya randomly selected noise type at a randomly selected SNRfrom the set {− , , , , } (in dB).The VCTK database used here is derived from theValentini-Botinhao corpus [35]. Each speaker fragment con-tains about 10 different sentences. The training set uses 28speakers, and the test set uses two speakers. The training setused here uses 40 noise conditions: eight noise types and twoartificial noise types from the Demand database [42]) are usedat a randomly selected SNR from the set { , , , } (in dB).The test set uses 20 noise conditions: five noise types fromthe Demand database at a randomly selected SNR from theset { . , . , . , . } (in dB). There are about 20 differentsentences in each condition for each test speaker. The test setconditions are different from the training set, as the test setuses different speakers and noise conditions. B. Model Setup
The baseline systems used for performance comparison area CRN system [5], a complex-CNN system that is based onconcepts proposed in [43] and that was adapted for speechenhancement, and a multi-stage system DARCN [32]. Thesetup of the baseline systems and the proposed multi-stageSA-TCN systems are detailed below.
CRN:
The CRN-based approach takes the magnitude as input.Instead of directly mapping the noisy magnitude to the cleanmagnitude, we adapted the CRN to predict the ratio maskand as such improve its performance. The CRN-based methodconsists of five 2D convolution layers with filters of size × each and [16, 32, 64, 128, 256] output channels, respectively.This output is post-processed by two LSTM layers with 1024nodes each, and five 2D deconvolution layers with filtersize × each and output channels [128, 64, 32, 16, 1],respectively. Complex-CNN.
The complex-CNN performs a complex spec-tral mapping [44], [45], where the real and imaginary spectro-grams of the noisy speech signal are treated as two differentinput channels. An STFT is used with a 20 ms Hanningwindow, a 20 ms filter length and a 10 ms hop size. Thearchitecture uses eight convolutional layers, one LSTM layerand two fully-connected layers, each with ReLU activationsexcept for the last layer, which has a sigmoid activation. Theparameters used here are similar to the ones used in [43], butnow both the input and the output have two channels withreal and imaginary components, respectively. The predictionserves as a complex mask, consisting of a real and imaginarymask. The training stage uses a multi-resolution STFT lossfunction [46], which is the sum of all STFT loss functionsusing different STFT parameters.
DARCN.
DARCN [32] is a recently proposed monauralspeech enhancement technique that uses multiple stages andthat combines dynamic attention and recursive learning. Ex-periments are conducted with the open-source code using anon-causal, 3-stage configuration. Proposed multi-stage SA-TCN Systems.
The proposedmulti-stage SA-TCN systems are characterized by the numberof stages K and the hyper-parameters ( H, B, R, L ) . Each K -stage SA-TCN system uses an STFT with a 32 ms Hanning-window, a 32 ms filter length and a 16 ms hop size. As such, F = 257 . The multi-stage SA-TCN systems are trained using80 epochs of 4-second utterances from the LibriSpeech corpusand using 100 epochs of variable-length utterances from theVCTK corpus. The proposed multi-stage SA-TCN systems aretrained using the Adam optimizer [47] with an initial learningrate of 0.0002. All models use a mini-batch of 16 utterances.For each mini-batch of 16 utterances from the VCTK corpus,the longest utterance is determined and the other utterancesare zero-padded to obtain equal-length utterances. C. ASR Setup.
The automatic speech recognition (ASR) experiments usea time-delay neural network-hidden Markov model (TDNN-HMM) hybrid chain model [48]. The TDNN models long-term https://github.com/Andong-Li-speech/DARCN temporal dependencies with training times that are comparableto standard feed-forward DNNs. The data is represented atdifferent time points by adding a set of delays to the input,which allows the TDNN to have a finite dynamic responseto the time series input data. This acoustic model is trainedusing the Kaldi toolkit [49] with the standard recipe . TheASR acoustic models were trained using 960 hours from theLibriSpeech training set. The word error rate (WER) wasmeasured using the LibriSpeech “test-clean” set. D. Evaluation Metrics
The speech enhancement systems are evaluated using thecommonly used wide-band perceptual evaluation of speechquality (PESQ) score [50]–[52], the short-time objective in-telligibility (STOI) score [53], the scale-invariant signal-to-distortion ratio (SI-SDR) [54], and the CSIG, CBAK andCOVL scores. The CSIG score is a signal distortion meanopinion score, the CBAK score measures background intru-siveness, and the COVL score measures the speech quality.The automatic speech recognition performance is measuredby determining the word error rate (WER).IV. E
XPERIMENTAL P ERFORMANCE R ESULTS
Extensive experiments have been performed to determinethe performance of the proposed multi-stage SA-TCN speechenhancement systems, This section first details the findings ofthe ablation studies, and then presents the performance resultsfor the multi-stage SA-TCN systems.
A. Ablation Studies
Ablation studies were performed to fine-tune the multi-stage SA-TCN system’s hyper-parameters ( H, B, R, L ) , andto analyze the effectiveness of the self-attention and fusionblocks.The performance of 5-stage SA-TCN systems is measuredin terms of PESQ and STOI scores for several hyper-parameterconfigurations. The results are listed in Table I. We observethat it is more effective to increase the number of channels(hyper-parameters B and H ) in each TCN block than toincrease the number of TCN blocks per stack ( L ). For instance,when R = 2 and H and B are doubled, the PESQ scoreimproves from 2.59 to 2.65 and the STOI score improves from92.36 to 93.02. At the same time, using L = 8 instead of L = 5 causes a slight degradation of the PESQ score. Theperformance can also be improved significantly by increasingthe number of stacks R . We determined the model size for thelarger TCN with R = 3 stacks and L = 8 TCN blocks perstack, which accounts for about 1.68 M parameters. Each SAblock has about 0.2 M parameters and each fusion block hasabout 1.7 M parameters. If we only consider models with lessthan 10 million parameters, the model where ( H, B, R, L ) =(256 , , , performs best. We should also note that thereis a trade-off between the performance and the model size.Next, we investigate the impact of the number of stages K on the performance of a multi-stage SA-TCN speech https://github.com/kaldi-asr/kaldi/tree/master/egs/librispeech/s5 TABLE IP
ERFORMANCE FOR S EVERAL STAGE
SA-TCN C
ONFIGURATIONS
R L H B P model size PESQ STOI2 5 128 64 3 2.38 M 2.59 92.362 5 256 128 3 5.19 M 2.65 93.022 8 128 64 3 2.90 M 2.53 92.322 8 256 128 3 7.21 M 2.64 93.053 5 128 64 3 2.81 M 2.61 92.673 5 256 128 3 6.88 M 2.71
The best score in a column is bold-faced, the second bestis navy blue and the third best is dark pink. enhancement system. The motivation for employing multi-stage learning is that the initial prediction is refined by thenext stage. The results in Table II show that the performanceimproves step-wise after each stage. For instance, when com-paring the first and the fifth stage, it shows that the PESQ scoreimproves from 2.60 to 2.73, and the STOI score improvesfrom 93.08 % to 93.37 %. We also observe that the PESQscore’s rate of improvement gradually decreases from 0.5 to0.1, which suggests that adding further stages has diminishingreturns in terms of performance and that a 5-stage SA-TCNsystem is likely close to the upper bound on performance forthis multi-stage TCN-based approach.
TABLE IIP ER -S TAGE
PESQ
AND
STOI S
CORES FOR A
TAGE
SA-TCN S
YSTEM
Stage PESQ STOIstage 1 2.60 93.08stage 2 2.65 93.10stage 3 2.70 93.22stage 4 2.72 93.33stage 5 2.73 93.37
The performance impact of using self-attention was deter-mined using PESQ and STOI scores. The results are shown inFig. 3. On average, a 5-stage SA-TCN system provides a STOIscore improvement of 3.5 % and a PESQ score improvementof 1.05 relative to unprocessed noisy speech. The insertionof the SA block prior to the stacked layers of TCN blocksconsistently improves PESQ and STOI scores for all SNRconditions: the average PESQ score improves from 2.68 to2.73 and the average STOI score improves from 93.16 % to93.37 %. This indicates that the SA block is able to aggregatethe frequency context, which is helpful for TCN-based speechenhancement. We also observe that the use of SA blocks showmore significant performance gains at low SNR, e.g., at -5 dB,the PESQ score improves from 2.04 to 2.14 and the STOIscore improves from 86.57 % to 87.05 %. This also indicatesthat multi-stage SA-TCN systems are more robust for lowerSNR.The effectiveness of the proposed fusion block, which re-injects original information in stages 3–5 in a 5-stage SA-TCNsystem to alleviate any speech signal loss, is considered next.The PESQ and STOI scores are shown in Fig. 4. It shows thatboth scores improve for all SNR scenarios. The average PESQ -5 dB 0 dB 5 dB 10 dB 15 dB average1.15 1.35 1.50 1.77 2.36 1.632.04 2.33 2.75 2.98 3.23 2.682.14 2.35 2.76 3.02 3.35 2.73
Noisy Speech 5-stage TCN 5-stage SA-TCN a . PESQ score -5 dB 0 dB 5 dB 10 dB 15 dB average81.51 85.52 90.87 93.99 96.25 89.6686.57 90.27 94.98 96.50 97.36 93.1687.05 90.49 95.09 96.65 97.49 93.37 Noisy Speech 5-stage TCN 5-stage SA-TCN b . STOI score [%]Fig. 3. Impact of using a self-attention module in a 5-stage SA-TCN system,where ( H, B, R, L ) = (256 , , , . score improves from 2.65 to 2.73, and the average STOI scoreimproves from 93.08 % to 93.37 %. The impact of the fusionblock is, as expected, more prominent at lower SNR, when themodel not only removes the noise, but can also easily partlyremove the speech signal itself. -5 dB 0 dB 5 dB 10 dB 15 dB average1.15 1.35 1.50 1.77 2.36 1.632.04 2.31 2.71 2.89 3.29 2.652.14 2.35 2.76 3.02 3.35 2.73 Noisy Speech 5-stage SA-TCN without Fusion Block 5-stage SA-TCN with Fusion Block a . PESQ score -5 dB 0 dB 5 dB 10 dB 15 dB average81.51 85.52 90.87 93.99 96.25 89.6686.95 90.06 94.78 96.32 97.19 93.0887.05 90.49 95.09 96.65 97.49 93.37 Noisy Speech 5-stage SA-TCN without Fusion Block 5-stage SA-TCN with Fusion Block b . STOI score [%]Fig. 4. Impact of using a fusion block in a 5-stage SA-TCN system, where ( H, B, R, L ) = (256 , , , . B. Baseline System Comparison
Extensive experiments with the proposed multi-stageSA-TCN system and the CRN-based, complex-CNN andDARCN systems were conducted using the LibriSpeech dataset. All multi-stage SA-TCN systems use hyper-parameters ( H, B, R, L ) = (256 , , , . Table III shows that all multi-stage SA-TCN systems outperform the baseline systems in TABLE IIIPESQ S
CORES FOR M ULTI -S TAGE
SA-TCN
AND B ASELINE S YSTEMS U SING S AMPLES FROM THE L IBRI S PEECH C ORPUS
Noise type Office Noise Babble Noise Kitchen Noise AverageSNR -5 0 5 10 15 -5 0 5 10 15 -5 0 5 10 15Noisy speech 1.30 1.68 2.01 2.60 3.39 1.06 1.09 1.22 1.37 1.80 1.07 1.18 1.30 1.62 2.10 1.63CRN 1.90 2.16 2.58 3.07 3.45 1.08 1.20 1.46 1.75 2.24 1.43 1.81 2.19 2.44 2.78 2.09Complex-CNN 2.14 2.40 2.84 3.01 3.24 1.13 1.30 1.70 2.06 2.54 1.77 2.20 2.46 2.67 2.95 2.30DARCN 2.23 2.48 3.02 3.35 3.62 1.14 1.33 1.72 1.97 2.52 1.80 2.16 2.45 2.64 2.82 2.351-stage SA-TCN 2.29 2.60 3.11 3.38 3.64 1.15 1.38 1.85 2.28 2.81 1.83 2.27 2.69 2.84 2.87 2.472-stage SA-TCN 2.55 2.85
All multi-stage SA-TCN models use hyper-parameters ( H, B, R, L ) = (256 , , , . The best score in a column is bold-faced, the second bestis navy blue and the third best is dark pink. terms of the PESQ score for the different noise types and SNRconditions. The results also show that multi-stage SA-TCNsystems with more stages have a better PESQ score. Similarly,Table IV shows that the STOI scores of the multi-stage SA-TCN systems are generally better than the baseline systems,and that the best STOI scores are generally obtained for 4-stage and 5-stage SA-TCN systems. Interestingly, even thesingle-stage SA-TCN system outperforms all baseline systemsin terms of PESQ score. Adding more stages improves theoverall performance significantly. For instance, the single-stage SA-TCN and the 2-stage SA-TCN have average PESQscores of 2.47 and 2.67, respectively, and average STOI scoresof 92.52% and 92.88%. The best performance is achieved with K = 5 stages, with an average PESQ score of 2.73 and anaverage STOI score of 93.37 %. The proposed 5-stage SA-TCN system has much better PESQ and STOI scores than thebaseline systems, which demonstrates the effectiveness of theproposed approach. We also observe that multi-stage learningis more effective at a low SNR. For example, the 5-stage SA-TCN system achieves much better performance for Office andKitchen Noise at -5 dB, and it also performs well for BabbleNoise at low SNR.Finally, we determined the SI-SDR metrics that quantifyspeech distortion. Table V shows that the proposed multi-stageSA-TCN sytems generally outperform the baseline systems.We also observe that the SI-SDR performance for multi-stageSA-TCN systems with K > stages decreases slightly, whichindicates that the additional stages not only mask the noise,but also distort the speech signal. However, it will be shownnext that these speech distortions do not impact the ASRperformance. C. Automatic Speech Recognition
We conducted automatic speech recognition (ASR) experi-ments using LibriSpeech to assess the performance of multi-stage SA-TCN systems with up to five stages and determinedthe word error rate (WER) as well as the WER reduction.The baseline systems are the CRN-based method, and thecomplex-CNN and DARCN methods. The results are shownin Table VI. Our 1-stage SA-TCN system performs slightlyworse than the best baseline systems, but the multi-stage SA- TCN methods perform better, and the the proposed 5-stageSA-TCN achieves an absolute improvement of 18.8 %, 8.4 %and 4.6 % relative to CRN, complex-CNN and the DARCNmethods, respectively. The ASR results are similar to the STOIperformance.
D. Spectrogram-Based Visualization
Speech enhancement performance can be assessed usingspectrograms. Consider the situation where clean speech isperturbed by Babble noise at an SNR of 5 dB. Fig. 5 showsspectrograms of the noisy speech signal, the clean speechtarget, as well as the CRN-based and complex CNN-basedsystems, the DARCN system, and the proposed 5-stage SA-TCN enhanced speech system. The spectrograms clearly showthat the proposed system is best at suppressing residual noisewhile preserving the speech patterns.
E. Speech-Enhancement Benchmark Results
The proposed multi-stage SA-TCN speech enhancementsystems are compared with state-of-the-art methods using thepublicly available benchmark data set VCTK. As shown in Ta-ble VII, the proposed multi-stage SA-TCN systems outperformmethods that use T-F frequency features, including magnitude,gamma-tone spectral and complex STFT in terms of all thespeech enhancement metrics used in this paper. Compared withthe recently proposed time-domain method DEMUCS, ourproposed method uses fewer parameters and achieves betterperformance in terms of CBAK and COVL metrics, while thePESQ, STOI and CSIG are slightly worse. The experimentswith the VCTK corpus show that adding more stages stillprovides some incremental performance improvements.V. D
ISCUSSION AND C ONCLUSIONS
In this paper, we have presented novel multi-stage SA-TCNspeech enhancement systems, where each stage consists ofa self-attention block followed by R stacks of L temporalconvolutional network blocks with doubling dilation factors.The stacks of L TCN blocks effectively perform sequential re-finement processing. Multi-stage SA-TCN systems with threeor more stages use a fusion block as of the third stage to
TABLE IVSTOI S
CORES FOR M ULTI -S TAGE
SA-TCN
AND B ASELINE S YSTEMS U SING S AMPLES FROM THE L IBRI S PEECH C ORPUS
Noise type Office Babble Kitchen AverageSNR [dB] -5 0 5 10 15 -5 0 5 10 15 -5 0 5 10 15Noisy speech 92.91 96.81 98.50 98.62 99.25 54.94 64.93 80.04 87.10 91.08 84.59 91.75 95.40 98.21 98.86 89.66CRN 93.02 96.15 97.29 97.55 98.07 54.44 68.15 83.75 90.04 91.98 87.14 92.60 95.57 97.15 97.23 90.23Complex-CNN 93.23 95.78 97.25 96.91 97.01 60.70 72.92 87.01 91.85 93.26 87.38 92.63 95.11 96.83 96.83 91.09DARCN 95.08 97.04 98.46 98.44 98.82 62.60 75.20 88.90 91.72 94.41 90.31 93.71 96.60
All multi-stage SA-TCN models use hyper-parameters ( H, B, R, L ) = (256 , , , . The best score in a column is bold-faced, the second bestis navy blue and the third best is dark pink. A m p li t u d e H z a. Noisy Speech b. Clean Target c. CRN-based System A m p li t u d e H z d. Complex-CNN System e. DARCN System f. Fig. 5. Spectrograms of a sample input mixed with Babble Noise at an SNR of 5 dB and the speech-enhanced signals at the output of SA-TCN and baselinesystems. TABLE VSI-SDR
SCORES FOR M ULTI -S TAGE
SA-TCN
AND B ASELINE S YSTEMS U SING S AMPLES FROM THE L IBRI S PEECH C ORPUS
SNR [dB] -5 0 5 10 15 AverageCRN 1.43 5.72 9.63 13.07 16.39 9.28Complex-CNN 9.66 11.32 13.35 14.49 16.86 13.15DARCN 11.00 12.65 15.84 17.60 19.87 15.411-stage SA-TCN 11.13
The best score in a column is bold-faced, the second bestis navy blue and the third best is dark pink. mitigate any possible loss of the original speech informationloss in later stages. The proposed self-attention module is
TABLE VIS
PEECH R ECOGNITION P ERFORMANCE OF M ULTI -S TAGE
SA-TCN
AND B ASELINE S YSTEMS
Method WER [%] WER reduction [%]Noisy Speech 32.94 –CRN 30.86 6.3Complex-CNN 27.44 16.7DARCN 26.18 20.51-stage SA-TCN 27.86 15.42-stage SA-TCN 25.32 23.13-stage SA-TCN 26.11 20.74-stage SA-TCN 25.27 23.35-stage SA-TCN
The best score in a column is bold-faced, the second bestis navy blue and the third best is dark pink. used to provide a dynamic representation by aggregatingthe frequency context. Extensive experiments were used to
TABLE VIIP
ERFORMANCE E VALUATION S CORES OF M ULTI -S TAGE
SA-TCN
AND B ASELINE S YSTEMS U SING S AMPLES FROM THE
VCTK C
ORPUS model size feature type PESQ STOI CSIG CBAK COVL SI-SDRnoisy speech – – 1.97 0.921 3.35 2.44 2.63 8.45SEGAN [6] (2017) 43.2 M Waveform 2.16 0.93 3.48 2.94 2.80 –Wave-U-Net [55] (2018) 10.2 M Waveform 2.40 – 3.52 3.24 2.96 –DFL [56] (2018) 0.64 M Waveform – – 3.86 3.33 3.22 –MMSE-GAN [57] (2018) 0.79 M Gamma-tone spectral 2.53 0.93 3.80 3.12 3.14 –MetricGAN [7] (2019) 1.89 M Magnitude 2.86 3.99 3.18 3.42 –MB-TCN [20] (2019) 1.66 M Magnitude 2.94 0.9364 4.21 3.41 3.59 –DeepMMSE [58] (2020) – Magnitude 2.95 0.94 4.28 3.46 3.64 –MHSA-SPK [14] (2020) – STFT 2.99 – 4.15 3.42 3.57 –STFT-TCN [21] (2020) – STFT 2.89 – 4.24 3.40 3.56 –DEMUCS [46] (2020) 127.9 M Waveform
The best score in a column is bold-faced, the second best is navy blue and the third best is dark pink. fine-tune the hyper-parameters. It was shown that both theaddition of the self-attention modules and the fusion blocksresulted in better performance. We noted that even the basic1-stage SA-TCN system performs well and that adding stagesimproves the speech enhancement scores. The model sizeincreases almost linearly with the number of stages. Therelative improvement when adding an additional stage reduceswhen more stages are added and as such one approachesan implicit upper bound for this approach. The best overallperformance with a reasonable model size was obtained witha 5-stage SA-TCN system.Extensive experiments were conducted using the Lib-riSpeech and VCTK data sets to determine the performance ofthe multi-stage SA-TCN speech enhancement systems and tocompare the proposed system with other state-of-the-art deep-learning speech enhancement systems. It was shown that theproposed multi-stage SA-TCN methods achieve better perfor-mance in terms of widely used objective metrics while havingfewer parameters. Speech enhancement, especially in mobileapplications, requires computational- and parameter-efficientmodels. The proposed methods meet this requirement and atthe same time provide excellent performance. Spectrogramswere used to visualize that the proposed 5-stage SA-TCNsystems can remove noise effectively while preserving thespeech patterns. The proposed multi-stage SA-TCN systemspredict a soft mask at each stage, which can be viewed as animplicit ideal ratio mask (IRM). For speech signals that aredominated by noise, the noise is suppressed gradually in eachstage, which is a main reason for the excellent performance.The proposed multi-stage SA-TCN systems are also shown tohave excellent ASR performance.The focus of this paper is to process and enhance thespectrum magnitude, and the unaltered noisy phase is usedwhen reconstructing the waveforms in the time domain. Re-cently, several studies have shown that phase informationis also important for improving the perceptual quality [31],[59]. Thus, incorporating phase information into the proposed approach may lead to further improvements. This is currentlybeing investigated. A
CKNOWLEDGMENTS
The authors thank Clemson University for the generousallotment of compute time on its Palmetto cluster.R
EFERENCES[1] P. C. Loizou,
Speech Enhancement , 2nd ed. Boca Raton, FL: CRCPress, 2013.[2] A. L. Maas, Q. V. Le, T. M. O’Neil, O. Vinyals, P. Nguyen, and A. Y.Ng, “Recurrent neural networks for noise reduction in robust ASR,” in
Proc. Interspeech , Portland, OR, Sep. 2012, pp. 22–25.[3] X. Lu, Y. Tsao, S. Matsuda, and C. Hori, “Speech enhancement based ondeep denoising autoencoder,” in
Proc. Interspeech , Lyon, France, Aug.2013, pp. 436–440.[4] Y. Xu, J. Du, L.-R. Dai, and C.-H. Lee, “A regression approach to speechenhancement based on deep neural networks,”
IEEE/ACM Trans. Audio,Speech, Language Process. , vol. 23, no. 1, pp. 7–19, Jan. 2015.[5] K. Tan and D. Wang, “A convolutional recurrent neural network forreal-time speech enhancement,” in
Proc. Interspeech , Hyderabad, India,Sep. 2018, pp. 3229–3233.[6] S. Pascual, A. Bonafonte, and J. Serr`a, “SEGAN: Speech enhancementgenerative adversarial network,” in
Proc. Interspeech , Stockholm, Swe-den, Aug. 2017, pp. 3642–3646.[7] S.-W. Fu, C.-F. Liao, Y. Tsao, and S.-D. Lin, “MetricGAN: Generativeadversarial networks based black-box metric scores optimization forspeech enhancement,” in
Proc. Int’l Conf. on Machine Learning , LongBeach, CA, Jun. 2019, pp. 2031–2041.[8] J. Lin, S. Niu, Z. Wei, X. Lan, A. J. van Wijngaarden, M. C. Smith, andK.-C. Wang, “Speech enhancement using forked generative adversarialnetworks with spectral subtraction,” in
Proc. Interspeech , Graz, Austria,Sep. 2019, pp. 3163–3167.[9] D. Baby and S. Verhulst, “SERGAN: Speech enhancement using rela-tivistic generative adversarial networks with gradient penalty,” in
Proc.IEEE Int’l Conf. Acoustics, Speech and Signal Proc. , Brighton, UnitedKingdom, May 2019, pp. 106–110.[10] J. Lin, S. Niu, A. J. van Wijngaarden, J. L. McClendon, M. C. Smith,and K.-C. Wang, “Improved speech enhancement using a time-domainGAN with mask learning,” in
Proc. Interspeech , Shanghai, China, Oct.2020, pp. 3286–3290.[11] R. Giri, U. Isik, and A. Krishnaswamy, “Attention Wave-U-Net forspeech enhancement,” in
Proc. IEEE Workshop on Appl. of Signal Proc.to Audio and Acoustics , New Paltz, NY, Oct. 2019, pp. 249–253. [12] J. Kim, M. El-Khamy, and J. Lee, “T-GSA: Transformer with Gaussian-weighted self-attention for speech enhancement,” in Proc. IEEE Int’lConf. Acoustics, Speech and Signal Proc. , Barcelona, Spain, May 2020,pp. 6649–6653.[13] Y. Zhao, D. Wang, B. Xu, and T. Zhang, “Monaural speech dere-verberation using temporal convolutional networks with self atten-tion,”
IEEE/ACM Trans. Audio, Speech, Language Process. , vol. 28,p. 1598–1607, May 2020.[14] Y. Koizumi, K. Yatabe, M. Delcroix, Y. Maxuxama, and D. Takeuchi,“Speech enhancement using self-adaptation and multi-head self-attention,” in
Proc. IEEE Int’l Conf. Acoustics, Speech and Signal Proc. ,Barcelona, Spain, May 2020, pp. 181–185.[15] A. Pandey and D. Wang, “Dense CNN with self-attention for time-domain speech enhancement,” arXiv:2009.01941 , Sep. 2020.[16] Y. Zhao and D. Wang, “Noisy-reverberant speech enhancement usingDenseUNet with time-frequency attention,” in
Proc. Interspeech , Shang-hai, China, Oct. 2020, pp. 3261–3265.[17] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez,Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in
Proc.Advances in Neural Information Proc. Sys. , Long Beach, CA, Dec. 2017,pp. 5998–6008.[18] A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals,A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu, “WaveNet:A generative model for raw audio,” in
Proc. ISCA Speech SynthesisWorkshop , Sunnyvale, CA, Sep. 2016, p. 125.[19] A. Pandey and D. Wang, “TCNN: Temporal convolutional neural net-work for real-time speech enhancement in the time domain,” in
Proc.IEEE Int’l Conf. Acoustics, Speech and Signal Proc. , Brighton, UnitedKingdom, May 2019, pp. 6875–6879.[20] Q. Zhang, A. Nicolson, M. Wang, K. K. Paliwal, and C. Wang, “Monau-ral speech enhancement using a multi-branch temporal convolutionalnetwork,” arXiv:1912.12023 , Dec. 2019.[21] Y. Koyama, T. Vuong, S. Uhlich, and B. Raj, “Exploring the best lossfunction for DNN-based low-latency speech enhancement with temporalconvolutional networks,” arXiv:2005.11611 , May 2020.[22] V. Kishore, N. Tiwari, and P. Paramasivam, “Improved speech en-hancement using TCN with multiple encoder-decoder layers,” in
Proc.Interspeech , Shanghai, China, Oct. 2020.[23] C.-L. Liu, S.-W. Fu, Y.-J. Li, J.-W. Huang, H.-M. Wang, and Y. Tsao,“Multichannel speech enhancement by raw waveform-mapping usingfully convolutional networks,”
IEEE/ACM Trans. Audio, Speech, Lan-guage Process. , vol. 28, pp. 1888–1900, 2020.[24] Y. Luo and N. Mesgarani, “Conv-TasNet: Surpassing ideal time-frequency magnitude masking for speech separation,”
IEEE/ACM Trans.Audio, Speech, Language Process. , vol. 27, no. 8, pp. 1256–1266, Aug.2019.[25] Y. Wang, A. Narayanan, and D. Wang, “On training targets for super-vised speech separation,”
IEEE/ACM Trans. Audio, Speech, LanguageProcess. , vol. 22, no. 12, pp. 1849–1858, Dec. 2014.[26] Z. Chen, Y. Huang, J. Li, and Y. Gong, “Improving mask learningbased speech enhancement system with restoration layers and residualconnection,” in
Proc. Interspeech , Stockholm, Sweden, Aug. 2017, pp.3632–3636.[27] A. Narayanan and D. Wang, “Investigation of speech separation as afront-end for noise robust speech recognition,”
IEEE/ACM Trans. Audio,Speech, Language Process. , vol. 22, no. 4, pp. 826–835, Apr. 2014.[28] A. Newell, K. Yang, and J. Deng, “Stacked hourglass networks forhuman pose estimation,” in
Proc. European Conf. on Computer Vision ,Amsterdam, The Netherlands, Oct. 2016, pp. 483–499.[29] Y. A. Farha and J. Gall, “MS-TCN: Multi-stage temporal convolutionalnetwork for action segmentation,” in
Proc. IEEE/CVF Conf. Comput.Vision and Pattern Recognition , Long Beach, CA, Jun. 2019, pp. 3575–3584.[30] X. Hao, X. Su, S. Wen, Z. Wang, Y. Pan, F. Bao, and W. Chen, “Maskingand inpainting: A two-stage speech enhancement approach for low SNRand non-stationary noise,” in
Proc. IEEE Int’l Conf. Acoustics, Speechand Signal Proc. , Barcelona, Spain, May 2020, pp. 6959–6963.[31] A. Li, C. Zheng, R. Peng, and X. Li, “Two heads are better than one:A two-stage approach for monaural noise reduction in the complexdomain,” arXiv:2011.01561 , Nov. 2020.[32] A. Li, C. Zheng, C. Fan, R. Peng, and X. Li, “A recursive net-work with dynamic attention for monaural speech enhancement,” arXiv:2003.12973 , Mar. 2020.[33] C. Fan, J. Tao, B. Liu, J. Yi, Z. Wen, and X. Liu, “Deep attentionfusion feature for speech separation with end-to-end post-filter method,” arXiv:2003.07544 , Mar. 2020. [34] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “LibriSpeech:an ASR corpus based on public domain audio books,” in
Proc. IEEEInt’l Conf. Acoustics, Speech and Signal Proc. , Brisbane, Australia, Apr.2015, pp. 5206–5210.[35] C. Valentini-Botinhao, X. Wang, S. Takaki, and J. Yamagishi, “Investi-gating RNN-based speech enhancement methods for noise-robust Text-to-Speech,” in
Proc. ISCA Speech Synthesis Workshop , Sunnyvale, CA,Sep. 2016, pp. 146–152.[36] K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectifiers:Surpassing human-level performance on ImageNet classification,” in
Proc. IEEE Int’l Conf. Comput. Vision , Santiago, Chile, Dec. 2015, pp.1026–1034.[37] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deepnetwork training by reducing internal covariate shift,” in
Proc. Int’l Conf.on Machine Learning , Lille, France, Jul. 2015, pp. 448–456.[38] A. Pandey and D. Wang, “On adversarial training and loss functions forspeech enhancement,” in
Proc. IEEE Int’l Conf. Acoustics, Speech andSignal Proc. , Calgary, AB, Apr. 2018, pp. 5414–5418.[39] ——, “A new framework for CNN-based speech enhancement in thetime domain,”
IEEE/ACM Trans. Audio, Speech, Language Process. ,vol. 27, no. 7, pp. 1179–1188, Jul. 2019.[40] C. K. A. Reddy, H. Dubey, V. Gopal, R. Cutler, S. Braun, H. Gamper,R. Aichner, and S. Srinivasan, “ICASSP 2021 deep noise suppressionchallenge,” arXiv:2009.06122 , Sep. 2020.[41] A. Varga and H. J. Steeneken, “Assessment for automatic speechrecognition: II. NOISEX-92: A database and an experiment to studythe effect of additive noise on speech recognition systems,”
SpeechCommun. , vol. 12, no. 3, pp. 247–251, Jul. 1993.[42] J. Thiemann, N. Ito, and E. Vincent, “The diverse environments multi-channel acoustic noise database (DEMAND): A database of multichan-nel environmental noise recordings,” in
Proc. Int’l Conf. on Acoustics ,Montr´eal, Canada, Jun. 2013, pp. 1–6.[43] Q. Wang, H. Muckenhirn, K. Wilson, P. Sridhar, Z. Wu, J. Hershey,R. A. Saurous, R. J. Weiss, Y. Jia, and I. Lopez Moreno, “Voicefilter:Targeted voice separation by speaker-conditioned spectrogram masking,”in
Proc. Interspeech , Graz, Austria, Sep. 2019, pp. 2728–2732.[44] S.-W. Fu, T.-Y. Hu, Y. Tsao, and X. Lu, “Complex spectrogram enhance-ment by convolutional neural network with multi-metrics learning,” in
Proc. IEEE Int’l Workshop on Machine Learning for Signal Proc. ,Tokyo, Japan, Sep. 2017, pp. 1–6.[45] K. Tan and D. Wang, “Complex spectral mapping with a convolutionalrecurrent network for monaural speech enhancement,” in
Proc. IEEEInt’l Conf. Acoustics, Speech and Signal Proc. , Brighton, United King-dom, May 2019, pp. 6865–6869.[46] A. D´efossez, G. Synnaeve, and Y. Adi, “Real time speech enhancementin the waveform domain,” arXiv:2006.12847 , Sep. 2020.[47] D. P. Kingma and J. L. Ba, “Adam: A method for stochastic optimiza-tion,” in
Proc. Int’l Conf. on Learning Representations , San Diego, CA,May 2015, pp. 1–15.[48] V. Peddinti, D. Povey, and S. Khudanpur, “A time delay neural networkarchitecture for efficient modeling of long temporal contexts,” in
Proc.Interspeech , Dresden, Germany, Sep. 2015, pp. 3214–3218.[49] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel,M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz, J. Silovsky, G. Stem-mer, and K. Vesely, “The Kaldi speech recognition toolkit,” in
Proc.IEEE Workshop on Autom. Speech Recognition and Understanding ,Waikoloa, HI, Dec. 2011.[50] A. W. Rix, J. G. Beerends, M. P. Hollier, and A. P. Hekstra, “Perceptualevaluation of speech quality (PESQ) – a new method for speech qualityassessment of telephone networks and codecs,” in
Proc. IEEE Int’l Conf.Acoustics, Speech and Signal Proc. , Salt Lake City, UT, May 2001, pp.749–752.[51]
Perceptual evaluation of speech quality (PESQ): An objective methodfor end-to-end speech quality assessment of narrow-band telephonenetworks and speech codecs , ITU-P recommendation P.862, Feb. 2001.[52]
Perceptual objective listening quality prediction , ITU-P recommendationP.863, Mar. 2018.[53] C. H. Taal, R. C. Hendriks, R. Heusdens, and J. Jensen, “An algorithmfor intelligibility prediction of time–frequency weighted noisy speech,”
IEEE/ACM Trans. Audio, Speech, Language Process. , vol. 19, no. 7, pp.2125–2136, Sep. 2011.[54] J. Le Roux, S. Wisdom, H. Erdogan, and J. R. Hershey, “SDR – half-baked or well done?” in
Proc. IEEE Int’l Conf. Acoustics, Speech andSignal Proc. , Brighton, United Kingdom, May 2019, pp. 626–630.[55] C. Macartney and T. Weyde, “Improved speech enhancement with theWave-U-Net,” arXiv:1811.11307 , Nov. 2018. [56] F. G. Germain, Q. Chen, and V. Koltun, “Speech denoising with deepfeature losses,” arXiv:1806.10522 , Jun. 2018.[57] M. H. Soni, N. Shah, and H. A. Patil, “Time-frequency masking-basedspeech enhancement using generative adversarial network,” in Proc.IEEE Int’l Conf. on Acoustics, Speech and Signal Proc. , Calgary, AB,Apr. 2018, pp. 5039–5043.[58] Q. Zhang, A. Nicolson, M. Wang, K. K. Paliwal, and C. Wang,“DeepMMSE: A deep learning approach to MMSE-based noise powerspectral density estimation,”
IEEE/ACM Trans. Audio, Speech, LanguageProcess. , vol. 28, pp. 1404–1415, Apr. 2020.[59] K. Paliwal, K. W´ojcicki, and B. Shannon, “The importance of phase inspeech enhancement,”