Dense CNN with Self-Attention for Time-Domain Speech Enhancement
11 Dense CNN with Self-Attention for Time-DomainSpeech Enhancement
Ashutosh Pandey,
Student Member, IEEE and DeLiang Wang,
Fellow, IEEE
Abstract —Speech enhancement in the time domain is becomingincreasingly popular in recent years, due to its capability tojointly enhance both the magnitude and the phase of speech. Inthis work, we propose a dense convolutional network (DCN) withself-attention for speech enhancement in the time domain. DCN isan encoder and decoder based architecture with skip connections.Each layer in the encoder and the decoder comprises a denseblock and an attention module. Dense blocks and attentionmodules help in feature extraction using a combination of featurereuse, increased network depth, and maximum context aggrega-tion. Furthermore, we reveal previously unknown problems witha loss based on the spectral magnitude of enhanced speech. Toalleviate these problems, we propose a novel loss based on magni-tudes of enhanced speech and a predicted noise. Even though theproposed loss is based on magnitudes only, a constraint imposedby noise prediction ensures that the loss enhances both magnitudeand phase. Experimental results demonstrate that DCN trainedwith the proposed loss substantially outperforms other state-of-the-art approaches to causal and non-causal speech enhancement.
Index Terms —Speech enhancement, self-attention network,time-domain enhancement, dense convolutional network,frequency-domain loss.
I. I
NTRODUCTION
Speech signal in a real-world environment is degraded bybackground noise that reduces its intelligibility and quality forhuman listeners. Further, it can severely degrade the perfor-mance of speech-based applications, such as automatic speechrecognition (ASR), teleconferencing, and hearing-aids. Speechenhancement aims at improving the intelligibility and qualityof a speech signal by removing or attenuating backgroundnoise. It is used as preprocessor in speech-based applicationsto improve their performance in noisy environments. Monaural(single-channel) speech enhancement provides a versatile andcost-effective approach to the problem by utilizing recordingsfrom a single microphone. Single-channel speech enhancementin low signal-to-noise ratio (SNR) conditions is considereda very challenging problem. This study focuses on single-channel speech enhancement in the time domain.Traditional monaural speech enhancement approaches in-clude spectral subtraction, Wiener filtering and statisticalmodel-based methods [1]. Speech enhancement has beenextensively studied in recent years as a supervised learning
This research was supported in part by two NIDCD grants (R01DC012048and R02DC015521) and the Ohio Supercomputer Center.A. Pandey is with the Department of Computer Science and Engi-neering, The Ohio State University, Columbus, OH 43210 USA (e-mail:[email protected]).D. L. Wang is with the Department of Computer Science and Engineeringand the Center for Cognitive and Brain Sciences, The Ohio State University,Columbus, OH 43210 USA (e-mail: [email protected]) problem using deep neural networks (DNNs) since the firststudy in [2].Supervised approaches to speech enhancement generallyconvert a speech signal to a time-frequency (T-F) representa-tion, and extract input features and training targets from it [3].Training targets are either masking based or mapping based[4]. Masking based targets, such as the ideal ratio mask (IRM)[4] and phase sensitive mask [5], are based on time-frequencyrelation between noisy and clean speech, whereas mappingbased targets [6], [7], such as spectral magnitude and logpower spectrum, are based on clean speech. Input features andtraining targets are used to train a DNN that estimates targetsfrom noisy features. Finally, enhanced waveform is obtainedby reconstructing a signal from the estimated target.Most of the T-F representation based methods aim toenhance only spectral magnitudes and noisy phase is usedunaltered for time-domain signal reconstruction [6], [7], [8],[9], [10], [11], [12], [13]. This is mainly because phase wasconsidered not important for speech enhancement [14], andexhibits no spectro-temporal structure amenable to supervisedlearning [15]. A recent study, however, found that the phasecan play an important role in the quality of enhanced speech,especially in low SNR conditions [16]. This has led researchersto explore techniques to jointly enhance magnitude and phase[15], [17], [18], [19].There are two approaches to jointly enhance magnitude andphase: complex spectrogram enhancement and time-domainenhancement. In complex spectrogram enhancement, the realand the imaginary part of the complex-valued noisy STFT(short-time Fourier transform) is enhanced. Based on trainingtargets, complex spectrogram enhancement is further catego-rized as complex ratio masking [15] and complex spectralmapping [17], [18], [19].Time-domain enhancement aims at directly predicting en-hanced speech samples from noisy speech samples, and inthe process, magnitude and phase are jointly enhanced [20],[21], [22], [23], [24], [25], [26]. Even though complex spectro-gram enhancement and time-domain enhancement have similarobjectives, time-domain enhancement has some advantages.First, time-domain enhancement avoids the computations as-sociated with the conversion of a signal to and from thefrequency domain. Second, since the underlying DNN istrained from raw samples, it can potentially learn to extractbetter features that are suited for the particular task of speechenhancement. Finally, short-time processing based on a T-Frepresentation requires frame size to be greater than somethreshold to have sufficient spectral resolution, whereas intime-domain processing frame size can be set to an arbitraryvalue. In [27] and [28], the performance of a time-domain a r X i v : . [ ee ss . A S ] S e p speaker separation network is substantially improved by set-ting frame size to very small values. However, using a smallerframe size requires more computations due to an increasednumber of frames.Self-attention is a widely utilized mechanism for sequence-to-sequence tasks, such as machine translation [29], imagegeneration [30] and ASR [31]. First introduced in [29], self-attention is a mechanism for selective context aggregation,where a given output in a sequence is computed based on onlya subset of the input sequence (attending on that subset) thatis helpful for the output prediction. It can be utilized for anytask that has sequential input and output. Self-attention can bea helpful mechanism for speech enhancement because of thefollowing reason. A spoken utterance generally contains manyrepeating phones. In a low SNR condition, a given phone canbe present in both high and low SNR regions in the utterance.This suggests that a speech enhancement system based on self-attention can attend over phones in high SNR regions to betterreconstruct phones in low SNR regions. Recent studies [32],[33], [34], and [35] have successfully employed self-attentionfor speech enhancement with promising results.In this work, we propose a dense convolutional network(DCN) with self-attention for speech enhancement in the timedomain. DCN is based on an encoder-decoder architecturewith skip connections [24], [25], [26]. Each of the layers inthe encoder and the decoder comprises a dense block [36]and an attention module. The dense block is used for betterfeature extraction with feature reuse in a deeper network,and the attention module is used for utterance level contextaggregation. This study is an extension of our previous workin [26], where dilated convolutions are utilized inside a denseblock for context aggregation. We find attention to be superiorto dilated convolutions for speech enhancement. We use anattention module similar to the one proposed in [37].Furthermore, we find that the spectral magnitude (SM) lossproposed for training of a time-domain network [38] obtainsbetter objective intelligibility and quality scores, but introducesa previously unknown artifact in enhanced utterances. Also, itis inconsistent in terms of SNR improvement. We proposea magnitude based loss to remove this artifact and obtainconsistent SNR improvement as a result. The proposed lossfunction is based on spectral magnitudes of the enhancedspeech and a predicted noise. In case of perfect estimation, theproposed loss reduces the possible number of phase values at agiven T-F unit from infinity to two, one of which correspondsto the clean phase, i.e, it constrains the phase to be much closerto clean phase. We call this loss phase constrained magnitude(PCM) loss.The rest of the paper is organized as follows. We describespeech enhancement in the time domain in Section II. DCNarchitecture and its building blocks are explained in SectionIII. Section IV describes different loss functions along with theproposed loss. Experimental settings are given in Section V,and results are discussed in Section VI. Concluding remarksare given in Section VII. II. S PEECH E NHANCEMENT IN THE T IME D OMAIN
Given a clean speech signal s and a noise signal n , thenoisy speech signal is modeled as y = s + n (1)where { y , s , n } ∈ R M × , and M represents the numberof samples in the signal. The goal of a speech enhancementalgorithm is to get a close estimate, (cid:98) s , of s given y .Speech enhancement in the time domain aims at computing (cid:98) s directly from y instead of using a T-F representation of y .We can formulate time-domain enhancement using a DNN as (cid:98) s = f θ ( y ) (2)where f θ denotes a function defining a DNN modelparametrized by θ . The DNN model f θ can be any of theexisting DDN architectures such as a feedforward, recurrent,or convolutional neural network. A. Frame-Level Processing
Generally, the input signal y is first chunked into overlap-ping frames which is then processed as frame-level enhance-ment. Let Y ∈ R T × L denote the matrix containing frames ofsignal y , and y t ∈ R L × the t th frame. y t is defined as y t [ k ] = y [( t − · J + k ] , k = 0 , · · · , L − (3)where T is the number of frames, L is the frame length, and J is the frame shift. T is given by (cid:108) MJ (cid:109) , where (cid:100) (cid:101) denotesthe ceiling function. Note that y is padded with zeros if M isnot divisible by J . Frame-level processing using a DNN canbe defined as (cid:98) s t = f θ ( y t − K , · · · , y t − , y t , y t +1 , · · · , y t + K ) (4)where (cid:98) s t is computed using y t , K past frames, and K futureframes. B. Causal Speech Enhancement
A speech enhancement system is considered causal if theprediction for a given frame is computed using only the currentand the past frames. This can be defined as (cid:98) s t = f θ ( y t − K , · · · , y t − , y t ) (5)A causal speech enhancement system is required real-timespeech enhancement.III. D ENSE C ONVOLUTIONAL N ETWORK
A block diagram of DCN is shown in Fig. 1. The buildingblocks of DCN are 2D convolution, sub-pixel convolution [39],layer normalization [40], dense block [36], and self-attentionmodule [29]. Next, we describe these building blocks one byone.
Fig. 1: Diagram of the proposed DCN model.
A. 2-D Convolution
Formally, a 2-D discrete convolution operator ∗ , whichconvolves a signal Y of size T × L with a kernel K of size m × n and stride ( r, s ) , is defined as ( Y ∗ K )( i, j ) = m − (cid:88) u =0 n − (cid:88) v =0 Y ( r · i + u, s · j + v ) · K ( u, v ) (6)where i ∈ { , · · · , T − m } and j ∈ { , , · · · L − n } .Note that Eq. (6) is actually a correlation operator generallyreferred as convolution in convolutional neural networks. Fur-ther, Eq. (6) defines VALID convolution in which the kernelis placed only at the locations where it does not cross thesignal boundary, and as a result the output size is reduced to ( T − m + 1) × ( L − n + 1) . Fig. 2(a) illustrates the positionof kernel on four corners for VALID convolution. To obtainan output of the same size as the input, the input is paddedwith zeros around all the boundaries, and is known as SAMEpadding, which is shown in Fig. 2(b).Causal convolution is a term used for convolution withtime-series signals, such as audio and video. A convolutionis considered causal if the output at t is computed usingFig. 2: Illustration of different types of convolution of an inputof size × with a kernel of size × . (a) VALID convolution,(b) Non-causal convolution with SAME padding. (c) Causalconvolution with SAME padding. Fig. 3: An illustration of sub-pixel convolution for upsamplinga 2D signal by rate (2 , .inputs at time instances less than or equal to t . For speechenhancement, the matrix Y , which stores the frames of speechsignal, y , y , · · · , y t , · · · , y T − , is a time series. A non-causal convolution can be easily converted to a causal oneby padding extra zeros in the beginning ( t < ). A causalconvolution is shown in Fig. 2(c). In general, a padding oflength m − is required for causal convolution with a kernelof size m along the time dimension. B. Sub-pixel Convolution
First proposed in [39], a sub-pixel convolution is used toincrease the size of a signal (upsampling). It becomes increas-ingly popular as an alternative to transposed convolution, as itavoids a well-known checkerboard artifact in the output signal[41] and is computationally efficient. For an upsampling rate ( r, s ) , sub-pixel convolution uses r · s convolutions to obtain r · s different signals of the same size as the input. The differentconvolutions in a sub-pixel convolution are defined as S , = Pad ( Y ) ∗ K , S , = Pad ( Y ) ∗ K , · · · S r − ,s − = Pad ( Y ) ∗ K r − ,s − (7)where Pad denotes the SAME padding operation. S , , S , , · · · , and S r − ,s − are combined to obtain the upsampledsignal using the following equation, S ( i, j ) = S ( i % r ) , ( j % s ) ( (cid:4) i/r (cid:5) , (cid:4) j/s (cid:5) ) (8) where % denotes the remainder operator, (cid:98) (cid:99) the floor operator, i ∈ { , , · · · , r · T − } , and j ∈ { , , · · · , s · L − } . Adiagram of sub-pixel convolution is shown in Fig. 3. C. Layer Normalization
Layer normalization is a technique proposed to improvegeneralization and facilitate DNN training [40]. It is used asan alternative to batch normalization, which is sensitive totraining batch size. We use the following layer normalization. y norm = y − µ y (cid:113) σ y + (cid:15) (cid:12) γ + β (9)where µ y and σ y , respectively, are mean and variance of y . γ and β are trainable variables of the same size as y , and (cid:12) denotes element-wise multiplication. (cid:15) is a small positiveconstant to avoid division by zero. For an input of shape [ C, T, L ] ( C channels, T frames), normalization is performedover last dimension using γ and β that are shared acrosschannels and frames. D. Dense Block
Densely connected convolutional networks were recentlyproposed in [36]. A densely connected network is based onthe idea of feature reuse in which an output at a given layeris reused multiple times in the subsequent layers. In otherwords, the input to a given layer is not just the input from theprevious layer but also the outputs from several layers beforethe given layer. It has two major advantages. First, it can avoidthe vanishing gradient problem in DNNs because of directconnections of a given layer to the subsequent layers. Second,a thinner dense network is found to outperform a wider normalnetwork, and hence improves the parameter efficiency of thenetwork. Formally, a dense connection can be defined as y l = g ( y l − , y l − , · · · , y l − D ) (10)where y l denotes the output at layer l , g is the functionrepresented by a single layer in the network, and D is thedepth of dense connections. DCN uses a dense block aftereach layer in the encoder and the decoder. The proposed denseblock is shown in Fig. 4. It consists of five convolutional layerswith m × convolutions followed by layer normalization andparametric ReLU nonlinearity [42]. We set m to for causaland to for non-causal convolution. The input to a given layeris formed by a concatenation of the input to and the outputof the previous layer. The number of input channels in thesuccessive layers increases linearly as C, C, C, C, C . Theoutput after each convolution has C channels. E. Self-attention Module
DCN uses self attention after downsampling in the encoderand upsampling in the decoder. An attention mechanismcomprises three key components: query Q , key K , and value V , where { K , Q } ∈ R T × P and Q ∈ R T × Q . First, correlationscores of all the rows in Q are computed with all the rows in K using the following equation. W = QK T (11) Fig. 4: The proposed dense block. X and Y in the pair ( X, Y ) inside convolution box, respectively, denote the number ofinput and output channels.where K T denotes the transpose of K and W ∈ R T × T . Next,correlation scores are converted to probability values using aSoftmax operation defined asSoftmax ( W )( i, j ) = exp W ( i,j ) (cid:80) T − j =0 exp W ( i,j ) (12)Finally, the rows of V are linearly combined using weights inSoftmax ( W ) to obtain the attention output. A = Softmax ( W ) V (13)An attention mechanism is called self-attention if Q and K are computed from the same sequence. For example, given aninput sequence Y , a self-attention layer can be implementedby using a linear layer to compute Q , K , and V , and thenusing Eqs. (11-13) to get the attention output.The proposed self-attention module in DCN is shown inFig. 5. First, three different × convolutions are used totransform an input of shape [ C, T, L ] to Q of shape [ E, T, L ] , K of shape [ E, T, L ] , and V of shape [ F, T, L ] . Next, Q , K ,and V are reshaped to obtain 2D matrices. Finally, Eq. (11),Eq. (12), and Eq. (13) are applied to get the 2D attentionoutput, which is reshaped to get an output of shape [ F, T, L ] .The proposed attention module is similar to the one in [37]with one difference: we do not use linear layers to project Q and K to lower dimensions. We find that the performance issimilar with and without linear layers.Causal attention can be implemented by applying a mask to W where entries above the main diagonal are set to negativeinfinity so that the contribution from future frames in Eq. (12)becomes zero. This can be defined as A causal = Softmax ( Mask ( W )) V (14)Fig. 5: Proposed self-attention module. where Mask ( W )( i, j ) = (cid:40) W ( i, j ) , if i ≤ j −∞ , otherwise (15)With the building blocks described, we now present theprocessing flow of DCN. First, a given utterance y is chunkedinto frames of size L , reshaped to a shape of [1 , T, L ] , andfed to the encoder. The first layer in the encoder uses × convolution to increase the number of channels to C , andthen is processed by a dense block. The following layers inthe encoder process their input by one convolutional layer fordownsampling, one attention module and one dense block. Theoutput of the attention module is concatenated with its inputalong the channel dimension before feeding it to the denseblock. The output of the encoder is fed to the decoder. Eachlayer in the decoder has one module for upsampling using sub-pixel convolution, one attention module and one dense block.The output of the decoder is concatenated with the output ofthe corresponding symmetric layer in the encoder. The finallayer in the decoder does not include a dense block, and uses × convolution to output a signal with 1 channel, which issubject to overlap-and-add to obtain the enhanced utterance.Each convolution in DCN, except at the input and at the output,is followed by layer normalization and parametric ReLU [42].IV. L OSS F UNCTIONS
A. Time-Domain Loss
An utterance level mean squared error (MSE) loss in thetime domain is defined as L T ( s , (cid:98) s ) = 1 M M − (cid:88) k =0 ( s [ k ] − (cid:98) s [ k ]) (16) B. STFT Magnitude Loss
A loss based on STFT magnitude was proposed in [24],which was found to be superior to the time-domain loss interms of objective intelligibility and quality scores, and a littleworse in terms of scale-invariant speech-to-distortion ratio (SI-SDR). The loss is defined as L SM ( s , (cid:98) s ) = 1 T · F T − (cid:88) t =0 F − (cid:88) f =0 [( | S r ( t, f ) | + | S i ( t, f ) | ) − ( | (cid:98) S r ( t, f ) | + | (cid:98) S i ( t, f ) | )] (17)where S and (cid:98) S respectively denote STFTs of s and (cid:98) s , T isthe number of time frames, and F is the number of frequencybins. Subscripts r and i respectively denote the real and theimaginary part of a complex variable. L SM is a mean absoluteerror loss between the L norm of clean and estimated STFTcoefficients [43].Even though L SM can obtain better objective scores, it hassome disadvantages. First, we find that it is not consistentin terms of SNR improvement, as in some cases processedSNR is found to be worse than unprocessed SNR. However, aconsistent improvement is observed in scale-invariant scores, such as SI-SNR and SI-SDR, suggesting that enhanced utter-ances do not have an appropriate scale using L SM , which is arequirement for speech enhancement algorithms. Second, wefind that L SM introduces an unknown artifact in enhancedutterances, which does not affect intelligibility and qualityscores, but this steady buzzing sound is annoying to humanlisteners.We find that the introduced artifact is not visible in a spec-trogram with the same frequency resolution as in the STFT of L SM . However, it can be observed with a higher frequencyresolution. Spectrograms of a sample noisy utterance enhancedusing DCN trained with different loss functions is plotted inFig. 6. The first row plots spectrograms with frame size andframe shift equal to the ones used in computation of L SM , L T F (Eq. (18)), and L P CM (Eq. (19)). The second row plotsspectrograms with a frame size twice that in the first row.We can see horizontal stripes in the second plot of L SM and L T F , which are not visible in the first row, and these stripescorrespond to the artifact in enhanced utterances. This artifactis not present with the time-domain MSE loss or PCM lossproposed in the study.
C. Time-frequency Loss
Time-frequency loss, which was proposed in [26], is acombination of L T and L SM . It is defined as L T F = α · L T + (1 − α ) · L SM (18)where α is a hyperparameter. We find that L T F can solve theinconsistent SNR problem associated with L SM as it obtainsconsistent SNR improvement similar to L T . Additionally, L T F preserves improvements in objective scores obtained using L SM . However, L T F is not able to remove the artifacts, asshown in Fig. 6. We have explored different values of α inEq. (18) and find that the artifact is present for a wide rangeof α values, and not straightforward to find a value that canremove the artifacts while maintaining objective scores similarto L SM . D. Phase Constrained Magnitude Loss
We propose a new loss that is based on STFT magnitude butcan alleviate both the problems associated with L SM . Given y , s , and (cid:98) s , a prediction for noise can be defined as (cid:98) n = y − (cid:98) s (19)Now, we can modify the objective of speech enhancement tomatch not only the STFT magnitude of speech but that of thenoise also. The PCM loss is defined as L P CM ( s , (cid:98) s ) = 12 · L SM ( s , (cid:98) s ) + 12 · L SM ( n , (cid:98) n ) (20)Even though one can play with relative contributions of speechand noise, we find that the equal contribution in Eq. (20)obtains consistent SNR improvement similar to L T , removesartifacts associated with L SM , and achieves objective intelli-gibility and quality scores similar to L SM .How can L P CM remove the artifact caused by L SM ? Let y ( t, f ) , s ( t, f ) , and n ( t, f ) respectively denote the STFT Fig. 6: Spectrograms of a sample utterance processed using DCN trained with different loss functions. Frame size for STFTis ms in the first row and ms in the second row.coefficients at a given T-F unit of noisy speech, clean speech,and noise. L SM aims at obtaining close estimates of | s ( t, f ) | only, and there is an arbitrary number of perfect estimates of | s ( t, f ) | in the complex representation. This is illustrated inFig. 7(a) with perfect estimates of | s ( t, f ) | at the perimeter ofa circle with the radius of | s ( t, f ) | . L P CM , on the other hand,aims at getting good estimates of both | s ( t, f ) | and | n ( t, f ) | ,and it has only two candidates for the perfect estimate asshown in Fig. 7(b). This implies that L P CM optimizes L SM with an additional constraint on phase, hence the name PCM. (a) L SM (b) L PCM
Fig. 7: Differences between L SM and L P CM in Cartesian(rectangular) coordinates. Re and Im respectively denote thereal and the imaginary axes in complex plane.V. E
XPERIMENTAL SETTINGS
A. Datasets
We evaluate all the models in a speaker- and noise-independent way on the WSJ0 SI-84 dataset (WSJ) [44], whichconsists of utterances from speakers ( males and noisy utterances at SNRs uniformly sampled from {− dB, − dB, − dB, − dB, − dB, dB } noisy utterancesfor both the noises at SNRs of − dB, dB, and dB. B. System Setup
All the utterances are resampled to kHz. We use L =512 , J = 256 , C = 64 , E = 5 , and F = 32 . Inside a denseblock, m is set to for causal and for non-causal DCN.The Adam optimizer [45] is used for SGD (stochasticgradient descent) based optimization with a batch size of utterances. All the models are trained for 15 epochs using alearning rate schedule given in Table I. We use PyTorch [46]to develop all the models, and utilize its default settings forinitialization.TABLE I: Learning rate schedule for training the proposedmodel. Epochs 1 to 3 4 to 9 10 to 12 13 to 15Learning rate 0.0002 0.0001 0.00005 0.00001
C. Baseline Models
We compare DCN with different existing approaches tospeech enhancement, namely T-F masking, spectral mapping,complex spectral mapping, and time-domain enhancement. ForT-F masking, we train an IRM based 4-layered bidirectionallong short-term memory (BLSTM) network [12]. A gatedresidual network (GRN) proposed in [13] is used for spectralmapping. For complex spectral mapping, we report resultsfrom a recently proposed state-of-the-art gated convolutionalrecurrent network (GCRN) [19]. We compare with both causaland non-causal GCRN. For time-domain enhancement, wecompare results with three different models: auto-encoderCNN (AECNN) [24], temporal convolutional neural network(TCNN) [25], and speech enhancement generative adversarialnetwork (SEGAN) [20]. SEGAN is trained with the time-domain loss as we find it to be superior to adversarial trainingproposed in the original paper.
D. Evaluation Metrics
We use short-time objective intelligibility (STOI) [47],perceptual evaluation of speech quality (PESQ) [48], andsignal-to-noise ratio (SNR) as the evaluation metrics, whichare the standard metrics for speech enhancement. STOI valuestypically range from to , which can be roughly interpretedas percent correct. PESQ values range from − . to . .VI. R ESULTS AND D ISCUSSIONS
A. Ablation Study
In this section, we present the findings of an ablation studyperformed to analyze the effectiveness of different context-aggregation techniques in DCN. There are componentsresponsible for context-aggregation. First, using m > in adense block so that the receptive field of convolution extendsbeyond one frame. Second, using an exponentially increasingdilation rate in the layers of dense blocks, as proposed in [26].Third, the attention module proposed in this study (SectionIII-E). STOI, PESQ, and SNR scores for causal and non-causalmodels trained using L T are given in Table II.We observe that when there is no context, i.e., m = 1 , nodilation, and no attention, an average improvement of . in STOI, . in PESQ, and . dB in SNR is obtained incausal enhancement. Increasing m to with causal convolutionobtains further improvement of in STOI, . in PESQ,and . dB in SNR. Next, replacing causal convolutions withdilated and causal convolutions, as in [26], obtains furtherimprovement of . in STOI, . in PESQ, and . dB inSNR. Most of the improvements due to dilated convolutionsare at the negative SNR of − dB. This suggests that a largercontext is more helpful for speech enhancement in low SNRconditions. Further, inserting attention module to the networkconsistently improves objective scores with relatively largerimprovements at − dB. In summary, objective scores areimproved by progressively adding all the three componentsof context aggregation to the model, and most of the improve-ments are obtained at − dB.Next, we change the dilated convolutions to normal con-volutions and observe that objective scores either improve orremain similar. This suggests that using dilated convolutionsalong with attention would be redundant, since attentioncan utilize maximum available context. Thus we can expectthat m = 1 with attention should be sufficient for contextaggregation. However, we find that reducing m from to degrades performance. Therefore, context aggregation usingthe attention module along with some context with normalconvolution is important for optimal results. Also, we find m = 3 to be worse than m = 2 (not reported here). A similarbehavior is observed for non-causal models, where m is setto instead of to maintain symmetry in context from pastand future. B. Loss Comparisons
This section analyzes different loss functions, and illustratesadvantages of the proposed L P CM . First, we reveal the incon-sistent SNR improvement issue with L SM . Causal and non-causal DCN are trained using L T , L SM , L T F , and L P CM , and average STOI and PESQ scores over two test noises andSNRs of − dB, − dB, dB, dB, and dB are plottedin Fig. 8. We observe that L SM , L T F , and L P CM obtainsimilar STOI scores, and they are better than L T . L T , L T F ,and L P CM obtain similar SNR scores, whereas L SM obtainssimilar SNR for a causal system but significantly worse SNR(even worse than unprocessed) for the non-causal system. Wefind that the SNR improvement of L SM is sensitive to learningrate, initialization and model architecture, i.e., not consistent.We also find that both L T F and L P CM obtain consistent SNRimprovement similar to L T , suggesting that L T F and L P CM can solve this issue without compromising STOI and PESQscores.Next, we evaluate the effects of α in L T F . Average STOI,PESQ, and SNR scores of a dilation based model [26] areplotted in Fig. 9 over two test noises and SNRs of − dB, − dB, dB, dB, and dB. We use α values from { . , . , . , . , . , . } . We can notice that for α < , STOIand PESQ scores are similar. For α = 1 , which correspondsto L T , STOI and PESQ results are worse. Similarly, SNRscores are similar for α > and worse for α = 0 , whichcorresponds to L SM . These observations suggest that as longas L SM is included in training, better STOI and PESQ resultsare obtained. Similarly, as long as L T is included in training,a consistent improvement in SNR is obtained.We provide enhanced speech samples at https://web.cse.ohio-state.edu/ ∼ wang.77/pnl/demo/PandeyDCN.html. The ar-tifact is observed with L SM and L T F , but not with L T and L P CM . These comparisons suggest that L T F can solvethe inconsistent SNR issue, but is not able to remove theartifact. Fig. 8 suggests that the proposed L P CM improvesSNR consistently and obtains STOI and PESQ similar to L SM .As shown in Fig. 6, the PCM loss removes the buzzing artifactpresent in the SM and TF losses. C. Comparison with Baselines
In this section, we present results to demonstrate the supe-riority of DCN over different approaches. DCN is comparedFig. 8: STOI, PESQ and SNR comparisons between differentloss functions.
TABLE II: Performance comparisons between different configurations of dense block, dilation, and attention in DCN. Boldfaceindicates the best score in a given condition.
Metric STOI PESQ SNRTest noise Babble Cafeteria Babble Cafeteria Babble CafeteriaTest SNR (dB) -5 0 5 Avg. -5 0 5 Avg. -5 0 5 Avg. -5 0 5 Avg. -5 0 5 Avg. -5 0 5 Avg.Mixture 58.4 70.5 81.3 70.1 57.1 69.7 81.0 69.2 1.56 1.82 2.12 1.83 1.46 1.77 2.12 1.78 -5.0 0.0 5.0 0 -5.0 0.0 5.0 0.0 C a u s a l (cid:53) (cid:53) (cid:53) (cid:53) (cid:51) (cid:53) (cid:51) (cid:51) (cid:53) (cid:51) (cid:53) (cid:51) N on - ca u s a l (cid:53) (cid:53) (cid:51) (cid:53) (cid:51) (cid:51) (cid:53) (cid:51) (cid:53) (cid:51) m Dil. Att.
TABLE III: STOI and PESQ comparisons between DCN and the baseline models of a) T-F masking, b) spectral mapping, c)complex-spectral mapping, and d) time-domain enhancement. A pp r o ac h C a u s a l ? R ea l - ti m e? Metric STOI PESQTest Noise Babble Cafeteria Babble CafeteriaTest SNR -5 db 0 dB 5 dB AVG -5 dB 0 dB 5 dB AVG -5 db 0 dB 5 dB AVG -5 dB 0 dB 5 dB AVGMixture 58.4 70.5 81.3 70.1 57.1 69.7 81.0 69.2 1.56 1.82 2.12 1.83 1.46 1.77 2.12 1.78a) (cid:53) (cid:53)
BLSTM [12] 77.4 85.8 91.0 84.7 76.1 84.7 90.5 83.7 1.97 2.37 2.69 2.34 2.01 2.38 2.51 2.30b) (cid:53) (cid:53)
GRN [13] 80.2 88.9 93.4 87.5 79.4 88.0 92.9 86.8 2.16 2.63 2.97 2.59 2.23 2.62 2.96 2.60c) (cid:51) (cid:51)
GCRN [19] 82.4 90.9 94.8 89.4 79.1 89.3 94.0 87.5 2.17 2.70 3.07 2.65 2.10 2.60 2.99 2.56 (cid:53) (cid:53)
NC-GCRN [19] 87.0 93.0 95.6 91.9 84.1 91.7 95.1 90.3 2.53 2.96 3.25 2.91 2.40 2.85 3.17 2.81d) (cid:51) (cid:53)
SEGAN-T [20] 81.5 90.3 94.1 88.6 79.8 89.5 93.5 87.6 2.11 2.62 2.97 2.57 2.15 2.61 2.94 2.57 (cid:51) (cid:53)
AECNN-SM [24] 82.6 91.5 95.1 89.7 81.1 90.7 94.5 88.8 2.21 2.80 3.17 2.73 2.23 2.76 3.12 2.70 (cid:51) (cid:51)
TCNN [25] 82.8 91.3 94.8 89.6 80.6 89.8 94.0 88.1 2.18 2.70 3.06 2.65 2.14 2.62 2.98 2.58 (cid:51) (cid:51)
DCN-T (cid:51) (cid:51)
DCN-SM 85.2 (cid:51) (cid:51)
DCN-PCM 85.1 (cid:53) (cid:53)
NC-DCN-T 87.9 93.5 96.1 92.5 85.0 92.1 95.3 90.8 2.61 3.04 3.33 2.99 2.45 2.91 3.23 2.86 (cid:53) (cid:53)
NC-DCN-SM (cid:53) (cid:53)
NC-DCN-PCM 89.0
Fig. 9: Performance of L T F with different α values.with a BLSTM for T-F masking [12], GRN [13] for spectralmapping, GCRN [19] for complex spectral mapping, andSEGAN [20], AECNN [24], and TCNN for time-domainenhancement. In our results, we call a system real-time if it iscausal and uses a frame size less than or equal to ms, whichis a general setting for real-time enhancement algorithms. TheSTOI and PESQ scores over two test noises are given in TableIII. We denote non-causal DCN as NC-DCN and non-causalGCRN as NC-GCRN. DCN trained with L X is denoted as DCN-X.First, we observe that a frame based model with m = 1 ,no dilation, and no attention (Table II), outperforms BLSTMbased T-F masking on average. BLSTM is slightly betterat − dB SNR. Note that BLSTM is a non-causal systemthat utilizes a whole utterance for one frame enhancement.This suggests that, even without any context information,the proposed model is a highly effective network for speechenhancement in the time domain.Further, using m = 2 with causal convolution makes it sig-nificantly better than spectral mapping based non-causal GRNand time-domain SEGAN, which is a causal network but usesa frame size of second, and hence is not real-time. It is alsosimilar or better than complex-spectral mapping based causalGCRN for all cases but babble noise at − dB. Similarly,using m = 3 with non-causal convolution makes it comparableto NC-GCRN, which is the best performing network in thebaseline models. It implies that the proposed network canoutperform all the baselines without any dilation and attention.Also, these comparisons are done with the proposed networktrained with L T ; training with L P CM will obtain obtain evenbetter performance improvement over baselines.Additionally, Table III reports STOI and PESQ numbers forDCN-T, DCN-SM, and DCN-PCM. We can see that DCN-SMand DCN-PCM obtain similar scores, which are better thanDCN-T for all the cases except babble − dB, where scoresare similar for all the three losses. Finally, we compare DCN-PCM, the best real-time version,with other real-time baselines. For real-time systems, TCNN isthe best baseline, and DCN, on average, is better than TCNNby . for STOI and . for PESQ. Similarly, we compareNC-DCN-PCM with NC-GCRN, the best non-causal baselinesystem. NC-DCN, on average, outperforms NC-GCRN by . in STOI and . in PESQ. D. Attention Maps
The attention mechanism in DCN is meant to focus on theframes of an utterance that can aid speech enhancement. Inthis section, we plot attention scores of Eq. (13) for non-causaland causal DCN. Attention scores for a sample utterance fromthe last layer of the encoder of DCN are plotted in Fig. 10and Fig. 11. The horizontal axis represents the frame index ofinterest, and the vertical axis represents the frames over whicha given frame attends to. The spectrogram on top shows thenoisy speech and the one on the right clean one.Fig. 10: Attention map of a sample utterance with non-causalDCN.For non-causal DCN, we observe that the most of theattention is paid to the harmonic structure, i.e., on voicedspeech, between frames and . Also, there is someattention to two high-frequency sounds towards the end ofthe utterance.For causal DCN, since frames in the future are not available,the attention on voiced sounds has shifted to earlier framesabove frame . For high-frequency sounds, the two soundstowards the end of the utterances that are used in non-causalcase are not available, and hence the attention is shifted toearlier high-frequency sounds between frames and .Also, attention in causal DCN is sharper than that in non-casual DCN. VII. C ONCLUDING R EMARKS
In this study, we have proposed a novel dense convolutionalnetwork with self-attention for speech enhancement in the Fig. 11: Attention map of the same utterance as in Fig. 10with causal DCN.time domain. The proposed DCN is based on an encoder-decoder structure with skip connections. The encoder anddecoder each consists of dense blocks and attention modulesthat enhance feature extraction using a combination of featurereuse, increased depth, and maximum context aggregation. Wehave evaluated different configurations of DCN, and foundthat the attention mechanism in conjunction with a normalconvolution with a small receptive field, i.e, no dilation, ishelpful for time-domain enhancement. We have developedcausal and non-causal DCN, and have shown that DCNsubstantially outperforms existing approaches to talker- andnoise-independent speech enhancement.We have revealed some of the existing problems with aspectral magnitude based loss. Even though magnitude basedloss obtains better objective intelligibility and quality scores, itis inconsistent in terms of SNR improvement, and introducesan unknown artifact in enhanced utterances. We have proposeda new phase constrained magnitude loss that combines thetwo losses over STFT magnitudes of the enhanced speech andpredicted noise. The PCM loss solves the SNR and artifactissues while maintaining the improvements in objective scores.By visualizing attention maps, we have found that mostof the attention seems to be paid to voiced segments andsome high-frequency regions. Further, attended regions appeardifferent for causal and non-causal DCN, and attention isrelatively sharper for causal speech enhancement.DCN is trained on the WSJ corpus and evaluated onuntrained WSJ speakers. We have recently revealed that DNN-based speech enhancement fails to generalize to untrainedcorpora, and better performance on a trained corpus does notnecessarily lead to a better performance on untrained corpora[49], [50]. For future research, we plan to evaluate DCN onuntrained corpora, and explore techniques to improve cross-corpus generalization. R EFERENCES[1] P. C. Loizou,
Speech Enhancement: Theory and Practice , 2nd ed. BocaRaton, FL, USA: CRC Press, 2013.[2] Y. Wang and D. L. Wang, “Towards scaling up classification-basedspeech separation,”
IEEE Transactions on Audio, Speech, and LanguageProcessing , vol. 21, pp. 1381–1390, 2013.[3] D. L. Wang and J. Chen, “Supervised speech separation based on deeplearning: An overview,”
IEEE/ACM Transactions on Audio, Speech, andLanguage Processing , vol. 26, pp. 1702–1726, 2018.[4] Y. Wang, A. Narayanan, and D. L. Wang, “On training targets for su-pervised speech separation,”
IEEE/ACM Transactions on Audio, Speechand Language Processing , vol. 22, pp. 1849–1858, 2014.[5] H. Erdogan, J. R. Hershey, S. Watanabe, and J. Le Roux, “Phase-sensitive and recognition-boosted speech separation using deep recurrentneural networks,” in
ICASSP , 2015, pp. 708–712.[6] X. Lu, Y. Tsao, S. Matsuda, and C. Hori, “Speech enhancement basedon deep denoising autoencoder.” in
INTERSPEECH , 2013, pp. 436–440.[7] Y. Xu, J. Du, L.-R. Dai, and C.-H. Lee, “A regression approach to speechenhancement based on deep neural networks,”
IEEE/ACM Transactionson Audio, Speech and Language Processing , vol. 23, pp. 7–19, 2015.[8] F. Weninger, H. Erdogan, S. Watanabe, E. Vincent, J. Le Roux, J. R.Hershey, and B. Schuller, “Speech enhancement with LSTM recurrentneural networks and its application to noise-robust ASR,” in
Interna-tional Conference on Latent Variable Analysis and Signal Separation ,2015, pp. 91–99.[9] J. Chen, Y. Wang, S. E. Yoho, D. L. Wang, and E. W. Healy, “Large-scaletraining to increase speech intelligibility for hearing-impaired listenersin novel noises,”
The Journal of the Acoustical Society of America , vol.139, pp. 2604–2612, 2016.[10] S.-W. Fu, Y. Tsao, and X. Lu, “SNR-aware convolutional neural networkmodeling for speech enhancement.” in
INTERSPEECH , 2016, pp. 3768–3772.[11] S. R. Park and J. Lee, “A fully convolutional neural network for speechenhancement,” in
INTERSPEECH , 2017, pp. 1993–1997.[12] J. Chen and D. L. Wang, “Long short-term memory for speaker general-ization in supervised speech separation,”
The Journal of the AcousticalSociety of America , vol. 141.[13] K. Tan, J. Chen, and D. L. Wang, “Gated residual networks withdilated convolutions for monaural speech enhancement,”
IEEE/ACMTransactions on Audio, Speech, and Language Processing , vol. 27, pp.189–198, 2018.[14] D. Wang and J. Lim, “The unimportance of phase in speech enhance-ment,”
IEEE Transactions on Acoustics, Speech, and Signal Processing ,vol. 30.[15] D. S. Williamson, Y. Wang, and D. L. Wang, “Complex ratio masking formonaural speech separation,”
IEEE/ACM Transactions on Audio, Speechand Language Processing , vol. 24, pp. 483–492, 2016.[16] K. Paliwal, K. W´ojcicki, and B. Shannon, “The importance of phasein speech enhancement,”
Speech Communication , vol. 53, pp. 465–494,2011.[17] S.-W. Fu, T.-y. Hu, Y. Tsao, and X. Lu, “Complex spectrogram enhance-ment by convolutional neural network with multi-metrics learning,” in
Workshop on Machine Learning for Signal Processing , 2017, pp. 1–6.[18] A. Pandey and D. L. Wang, “Exploring deep complex networks forcomplex spectrogram enhancement,” in
ICASSP , 2019, pp. 6885–6889.[19] K. Tan and D. L. Wang, “Learning complex spectral mapping with gatedconvolutional recurrent networks for monaural speech enhancement,”
IEEE/ACM Transactions on Audio, Speech, and Language Processing ,vol. 28, pp. 380–390, 2019.[20] S. Pascual, A. Bonafonte, and J. Serr, “SEGAN: Speech enhancementgenerative adversarial network,” in
INTERSPEECH , 2017, pp. 3642–3646.[21] D. Rethage, J. Pons, and X. Serra, “A wavenet for speech denoising,”in
ICASSP , 2018, pp. 5069–5073.[22] K. Qian, Y. Zhang, S. Chang, X. Yang, D. Florˆencio, and M. Hasegawa-Johnson, “Speech enhancement using bayesian wavenet,” in
INTER-SPEECH , 2017, pp. 2013–2017.[23] S.-W. Fu, T.-W. Wang, Y. Tsao, X. Lu, and H. Kawai, “End-to-endwaveform utterance enhancement for direct evaluation metrics optimiza-tion by fully convolutional neural networks,”
IEEE/ACM Transactionson Audio, Speech, and Language Processing , vol. 26, pp. 1570–1584,2018.[24] A. Pandey and D. L. Wang, “A new framework for CNN-based speechenhancement in the time domain,”
IEEE/ACM Transactions on Audio,Speech and Language Processing , vol. 27, pp. 1179–1188, 2019. [25] ——, “TCNN: Temporal convolutional neural network for real-timespeech enhancement in the time domain,” in
ICASSP , 2019, pp. 6875–6879.[26] ——, “Densely connected neural network with dilated convolutions forreal-time speech enhancement in the time domain,” in
ICASSP , 2020,pp. 6629–6633.[27] Y. Luo and N. Mesgarani, “Conv-TasNet: Surpassing ideal time-frequency magnitude masking for speech separation,”
IEEE/ACM Trans-actions on Audio, Speech, and Language Processing , vol. 27, pp. 1256–1266, 2019.[28] Y. Luo, Z. Chen, and T. Yoshioka, “Dual-path RNN: Efficient longsequence modeling for time-domain single-channel speech separation,”in
ICASSP , 2020, pp. 46–50.[29] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez,Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in
NIPS , 2017,pp. 5998–6008.[30] H. Zhang, I. Goodfellow, D. Metaxas, and A. Odena, “Self-attentiongenerative adversarial networks,” in
ICML , 2019, pp. 7354–7363.[31] L. Dong, S. Xu, and B. Xu, “Speech-Transformer: a no-recurrencesequence-to-sequence model for speech recognition,” in
ICASSP , 2018,pp. 5884–5888.[32] Y. Zhao, D. L. Wang, B. Xu, and T. Zhang, “Monaural speech dere-verberation using temporal convolutional networks with self attention,”
IEEE/ACM Transactions on Audio, Speech, and Language Processing ,vol. 28, pp. 1598–1607, 2020.[33] R. Giri, U. Isik, and A. Krishnaswamy, “Attention wave-U-Net forspeech enhancement,” in
WASPAA , 2019, pp. 249–253.[34] J. Kim, M. El-Khamy, and J. Lee, “T-GSA: Transformer with Gaussian-weighted self-attention for speech enhancement,” in
ICASSP , 2020, pp.6649–6653.[35] Y. Koizumi, K. Yaiabe, M. Delcroix, Y. Maxuxama, and D. Takeuchi,“Speech enhancement using self-adaptation and multi-head self-attention,” in
ICASSP , 2020, pp. 181–185.[36] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, “Denselyconnected convolutional networks,” in
CVPR , 2017, pp. 4700–4708.[37] Y. Liu, B. Thoshkahna, A. Milani, and T. Kristjansson, “Voice andaccompaniment separation in music using self-attention convolutionalneural network,” arXiv:2003.08954 , 2020.[38] A. Pandey and D. L. Wang, “A new framework for supervised speechenhancement in the time domain,” in
INTERSPEECH , 2018, pp. 1136–1140.[39] W. Shi, J. Caballero, F. Husz´ar, J. Totz, A. P. Aitken, R. Bishop,D. Rueckert, and Z. Wang, “Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network,” in
CVPR , 2016, pp. 1874–1883.[40] J. L. Ba, J. R. Kiros, and G. E. Hinton, “Layer normalization,” arXiv:1607.06450 , 2016.[41] A. Odena, V. Dumoulin, and C. Olah, “Deconvolution and checkerboardartifacts,”
Distill , 2016. [Online]. Available: http://distill.pub/2016/deconv-checkerboard[42] K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectifiers:Surpassing human-level performance on imagenet classification,” in
ICCV , 2015, pp. 1026–1034.[43] A. Pandey and D. L. Wang, “On adversarial training and loss functionsfor speech enhancement,” in
ICASSP , 2018, pp. 5414–5418.[44] D. B. Paul and J. M. Baker, “The design for the wall street journal-basedCSR corpus,” in
Workshop on Speech and Natural Language , 1992, pp.357–362.[45] D. Kingma and J. Ba, “Adam: A method for stochastic optimization,”in
ICLR , 2015.[46] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin,A. Desmaison, L. Antiga, and A. Lerer, “Automatic differentiation inpytorch,” 2017.[47] C. H. Taal, R. C. Hendriks, R. Heusdens, and J. Jensen, “An algorithmfor intelligibility prediction of time–frequency weighted noisy speech,”
IEEE Transactions on Audio, Speech, and Language Processing , vol. 19,pp. 2125–2136, 2011.[48] A. W. Rix, J. G. Beerends, M. P. Hollier, and A. P. Hekstra, “Perceptualevaluation of speech quality (PESQ) - a new method for speech qualityassessment of telephone networks and codecs,” in
ICASSP , 2001, pp.749–752.[49] A. Pandey and D. L. Wang, “On cross-corpus generalization of deeplearning based speech enhancement,”
IEEE/ACM Transactions on Audio,Speech, and Language Processing , in press, 2020.[50] ——, “Learning complex spectral mapping for speech enhancement withimproved cross-corpus generalization,” in