[PDF] Real-time Denoising and Dereverberation with Tiny Recurrent U-Net

Abstract

Modern deep learning-based models have seen outstanding performance improvement with speech enhancement tasks. The number of parameters of state-of-the-art models, however, is often too large to be deployed on devices for real-world applications. To this end, we propose Tiny Recurrent U-Net (TRU-Net), a lightweight online inference model that matches the performance of current state-of-the-art models. The size of the quantized version of TRU-Net is 362 kilobytes, which is small enough to be deployed on edge devices. In addition, we combine the small-sized model with a new masking method called phase-aware \beta-sigmoid mask, which enables simultaneous denoising and dereverberation. Results of both objective and subjective evaluations have shown that our model can achieve competitive performance with the current state-of-the-art models on benchmark datasets using fewer parameters by orders of magnitude.

Full PDF

RREAL-TIME DENOISING AND DEREVERBERATION WTIH TINY RECURRENT U-NET

Hyeong-Seok Choi , , Sungjin Park , Jie Hwan Lee , Hoon Heo , Dongsuk Jeon , Kyogu Lee , Department of Intelligence and Information, Artiﬁcial Intelligence Institute, Seoul National University Supertone Inc.

ABSTRACT

Modern deep learning-based models have seen outstanding perfor-mance improvement with speech enhancement tasks. The numberof parameters of state-of-the-art models, however, is often too largeto be deployed on devices for real-world applications. To this end,we propose Tiny Recurrent U-Net (TRU-Net), a lightweight onlineinference model that matches the performance of current state-of-the-art models. The size of the quantized version of TRU-Net is 362kilobytes, which is small enough to be deployed on edge devices.In addition, we combine the small-sized model with a new mask-ing method called phase-aware β -sigmoid mask, which enables si-multaneous denoising and dereverberation. Results of both objectiveand subjective evaluations have shown that our model can achievecompetitive performance with the current state-of-the-art models onbenchmark datasets using fewer parameters by orders of magnitude. Index Terms — real-time speech enhancement, lightweight net-work, denoising, dereverberation

1. INTRODUCTION

In this paper, we focus on developing a deep learning-based speechenhancement model for real-world applications that meets the fol-lowing criteria: 1. a small and fast model that can reduce single-frame real-time-factor (RTF) as much as possible while keepingcompetitive performance against the state-of-the-art deep learn-ing networks, 2. a model that can perform both the denoising andderverberation simultaneously.To address the ﬁrst issue, we aim to improve a popular neuralarchitecture, U-Net [1], which has proven its superior performanceon speech enhancement tasks [2, 3, 4]. The previous approaches thatuse U-Net on source separation applications apply convolution ker-nel not only on the frequency-axis but also on the time-axis. Thisnon-causal nature of U-Net increases computational complexity be-cause additional computations are required on past and future framesto infer the current frame. Therefore, it is not suitable for online in-ference scenarios where the current frame needs to be processed inreal-time. In addition, the time-axis kernel makes the network com-putationally inefﬁcient because there exists redundant computationbetween adjacent frames in both the encoding and decoding path ofU-Net. To tackle this problem, we propose a new neural architec-ture, Tiny Recurrent U-Net (TRU-Net), which is suitable for onlinespeech enhancement. The architecture is designed to enable efﬁcientdecoupling of the frequency-axis and time-axis computations, whichmakes the network fast enough to process a single frame in real-time.The number of parameters of the proposed network is only 0.38 mil-lion (M), which is small enough to deploy the model not only on alaptop but also on a mobile device and even on an embedded devicecombined with a quantization technique [5]. The details of TRU-Netis described more in section 2.Next, to suppress the noise and reverberation simultaneously, wepropose a phase-aware β -sigmoid mask (PHM). The proposed PHM is inspired by [6], in which the authors propose to estimate phaseby reusing an estimated magnitude mask value from a trigonometricperspective. The major difference between PHM and the approachin [6] is that PHM is designed to respect the triangular relationshipbetween the mixture, the target source, and the remaining part, hencethe sum of the estimated target source and the remaining part is al-ways equal to the mixture. We extend this property into a quadrilat-eral by producing two different PHMs simultaneously, which allowsus to effectively deal with both denoising and dereverberation. Wewill discuss PHM in further details in section 3.

2. TINY RECURRENT U-NETFig. 1 . The network architecture of TRU-Net

A spectrogram is perhaps the most popular input feature for manyspeech enhancement models. Per-channel energy normalization(PCEN) [7] combines both dynamic range compression and auto-matic gain control together, which reduce the variance of foregroundloudness and supress background noise when applied to a spectro-gram [8]. PCEN is also suitable for online inference scenarios as itincludes a temporal integration step, which is essentially a ﬁrst-orderinﬁnite impulse response ﬁlter that depends solely on a previous in-put frame. In this work, we employ the trainable version of PCEN.

TRU-Net is based on U-Net architecture, except that the convolutionkernel does not span the time-axis. Therefore, it can be considereda frequency-axis U-Net with 1D Convolutional Neural Networks(CNNs) and recurrent neural networks in the bottle-neck layer. Theencoder is composed of 1D Convolutional Neural Network (1D-CNN) blocks and a Frequency-axis Gated Recurrent Unit (FGRU)block. Each 1D-CNN block is a sequence of pointwise convolutionand depthwise convolution similar to [9], except the ﬁrst layer, which a r X i v : . [ c s . S D ] F e b ses the standard convolution operation without a preceding point-wise convolution. To spare the network size, we use six 1D-CNNblocks, which downsample the frequency-axis size from 256 to 16using strided convolutions. This results in a small receptive ﬁeld(1,750Hz) which may be detrimental to the network performance. Toincrease the receptive ﬁeld, we use a bi-directional GRU layer [10]along the frequency-axis instead of stacking more 1D-CNN blocks.That is, the sequence of 16 vectors from 1D-CNN blocks is passedinto the bi-directional GRU to increase the receptive ﬁeld and sharethe information along the frequency-axis. We call this frequency-axis bi-directional GRU layer an FGRU layer. A pointwise convolu-tion, batch normalization (BN), and rectiﬁed linear unit (ReLU) areused after the FGRU layer, composing an FGRU block. We used 64hidden dimensions for each forward and backward FGRU cell.The decoder is composed of a Time-axis Gated Recurrent Unit(TGRU) block and 1D Transposed Convolutional Neural Network(1D-TrCNN) blocks. The output of the encoder is passed into a uni-directional GRU layer to aggregate the information along the time-axis. We call this GRU layer a TGRU layer. While one can applydifferent GRU cells to each frequency-axis index of the encoder out-put, we shared the same cell on each frequency-axis index to save thenumber of parameters. A pointwise convolution, BN, and ReLU fol-low the TGRU layer, composing a TGRU block. We used 128 hiddendimensions for the TGRU cell. Finally, 1D-TrCNN blocks are usedto upsample the output from the TGRU block to the original spec-trogram size. The 1D-TrCNN block takes two inputs - 1. a previouslayer output, 2. a skipped tensor from the encoder at the same hier-archy - and upsamples them as follows. First, the two inputs are con-catenated and projected to a smaller channel size (256 →

64) usinga pointwise convolution. Then, 1D transposed convolution is used toupsample the compressed information. This procedure saves both thenumber of parameters and computation compared to the usual U-Netimplementation where the two inputs are concatenated and upsam-pled immediately using the transposed convolution operation. Notethat we did not use depthwise convolution for 1D-TrCNN block aswe empirically observed that it drops the performance signiﬁcantlywhen used in the decoding stage.Every convolution operation used in the encoder and decoderis followed by BN and ReLU. We denote the convolution conﬁg-urations as follows, l -th: ( κ , s , c ) , where l , κ , s , c denotes layerindex, kernel size, strides, and output channels, respectively. The de-tailed conﬁgurations of the encoder and decoder are as follows, En-coderConﬁg = { } , DecoderConﬁg = { } . Note that the pointwise convolution operations sharethe same output channel conﬁguration with the exception that κ and s are both 1. The overview of TRU-Net and the number of parame-ters used for 1D-CNN blocks, FGRU block, TGRU block, and 1D-TrCNN blocks are shown in Fig. 1.

3. SINGLE-STAGE DENOISING ANDDEREVERBERATION

A noisy-reverberant mixture signal x is commonly modeled as thesum of additive noise y ( n ) and reverberant source ˜ y , where ˜ y is aresult of convolution between room impulse response (RIR) h anddry source y as follows, x = ˜ y + y ( n ) = h (cid:126) y + y ( n ) (1)More concretely, we can break down h into two parts. First, the di-rect path part h ( d ) , which does not include the reﬂection path, and second, the rest of the part h ( r ) including all the reﬂection paths asfollows, x = h ( d ) (cid:126) y + h ( r ) (cid:126) y + y ( n ) = y ( d ) + y ( r ) + y ( n ) , (2)where y ( d ) and y ( r ) denotes a direct path source and reverberation,respectively. In this setting, our goal is to separate x into threeelements y ( d ) , y ( r ) , and y ( n ) . Each of the corresponding time-frequency ( t, f ) representations computed by short-time Fouriertransform (STFT) is denoted as X t,f ∈ C , Y ( d ) t,f ∈ C , Y ( r ) t,f ∈ C , Y ( n ) t,f ∈ C , and the estimated values will be denoted by the hatoperator ˆ · . β -sigmoid mask The proposed phase-aware β -sigmoid mask (PHM) is a complex-valued mask which is capable of systemically restricting the sum ofestimated complex values to be exactly the value of mixture, X t,f = Y ( k ) t,f + Y ( ¬ k ) t,f . The PHM separates the mixture X t,f in STFT domaininto two parts as one-vs-rest approach, that is, the signal Y ( k ) t,f andthe sum of the rest of the signals Y ( ¬ k ) t,f = X t,f − Y ( k ) t,f , where index k could be one of the direct path source ( d ), reverberation ( r ), andnoise ( n ) in our setting, k ∈ { d, r, n } . The complex-valued mask M ( k ) t,f ∈ C estimates the magnitude and phase value of the source ofinterest k .Computing PHM requires two steps. First, the network outputsthe magnitude part of two masks | M ( k ) t,f | and | M ( ¬ k ) t,f | with sig-moid function σ ( k ) ( z t,f ) multiplied by coefﬁcient β t,f as follows, | M ( k ) t,f | = β t,f · σ ( k ) ( z t,f ) = β t,f · (1 + e − ( z ( k ) t,f − z ( ¬ k ) t,f ) ) − , where z ( k ) t,f is the output located at ( t, f ) from the last layer of neural-network function ψ ( k ) ( φ ) , and φ is a function composed of networklayers before the last layer. | M ( k ) t,f | serves as a magnitude mask toestimate source k and its value ranges from 0 to β t,f . The role of β t,f is to design a mask that is close to optimal values with a ﬂex-ible magnitude range so that the values are not bounded between0 and 1, unlike the commonly used sigmoid mask. In addition,because the sum of the complex valued masks M ( k ) t,f and M ( ¬ k ) t,f must compose a triangle, it is reasonable to design a mask that sat-isﬁes the triangle inequalities, that is, | M ( k ) t,f | + | M ( ¬ k ) t,f | ≥ and (cid:12)(cid:12)(cid:12) | M ( k ) t,f | − | M ( ¬ k ) t,f | (cid:12)(cid:12)(cid:12) ≤ . To address the ﬁrst inequality we designedthe network to output β t,f from the last layer with a softplus acti-vation function as follows, β t,f = 1 + softplus (( ψ β ( φ )) t,f ) ,where ψ β denotes an additional network layer to output β t,f . Thesecond inequality can be satisﬁed by clipping the upper bound of the β t,f by / | σ ( k ) ( z t,f ) − σ ( ¬ k ) ( z t,f ) | .Once the magnitude masks are decided, we can constructa phase mask e jθ ( k ) t,f . Given the magnitudes as three sides of atriangle, we can compute the cosine of the absolute phase dif-ference ∆ θ ( k ) t,f between the mixture and source k as follows, cos(∆ θ ( k ) t,f ) = (1+ | M ( k ) t,f | −| M ( ¬ k ) t,f | ) / (2 | M ( k ) t,f | ) . Then, the rota-tional direction ξ t,f ∈ { , − } (clockwise or counterclockwise)for phase correction is estimated for the phase mask as follows, e jθ ( k ) t,f = cos(∆ θ ( k ) t,f ) + jξ t,f sin(∆ θ ( k ) t,f ) . Two-class straight-through Gumbel-softmax estimator was used to estimate ξ t,f [11]. M ( k ) t,f is deﬁned as follows, M ( k ) t,f = | M ( k ) t,f | · e jθ ( k ) t,f . Finally, M ( k ) t,f is multiplied with X t,f to estimate the source k as follows, ˆ Y ( k ) t,f = M ( k ) t,f · X t,f . .2. Masking from the perspective of a quadrilateralFig. 2 . The illustration of PHM masking method on a quadrilateralBecause we desire to extract both the direct and reverberant source,two pairs of PHMs are used for each of them. The ﬁrst pair of masks, M ( d ) t,f and M ( ¬ d ) t,f , separate the mixture into the direct source andthe rest of the components, respectively. The second pair of masks, M ( n ) t,f and M ( ¬ n ) t,f , separate the mixture into the noise and the re-verberant source. Since PHM guarantees the mixture and separatedcomponents to construct a triangle in the complex STFT domain,the outcome of the separation can be seen from the perspective of aquadrilateral, as shown in Fig 2. In this setting, as the three sides andtwo side angles are already determined by the two pairs of PHMs,the fourth side of the quadrilateral, M ( r ) t,f , is uniquely decided. Recently, a multi-scale spectrogram (MSS) loss function has beensuccessfully used in a few audio synthesis studies [12, 13]. We in-corporate this multi-scale scheme not only in the spectral domain butalso in the waveform domain similar to [14].Learning to maximize cosine similarity can be regarded as max-imizing the signal-to-distortion ratio (SDR) [2]. Cosine similarityloss C between the estimated signal ˆ y ( k ) ∈ R N and the groundtruth signal y ( k ) ∈ R N is deﬁned as follows, C ( y ( k ) , ˆ y ( k ) ) = − < y ( k ) , ˆ y ( k ) > (cid:107) y ( k ) (cid:107)(cid:107) ˆ y ( k ) (cid:107) , where N denotes the temporal dimensionality of asignal and k denotes the type of signal ( k ∈ { d, r, n } ). Considera sliced signal y ( k )[ NM ( i − NM i ] , where i denotes the segment indexand M denotes the number of segments. By slicing the signal andnormalizing it by its norm, each sliced segment is considered a unitfor computing C . Therefore, we hypothesize that it is important tochoose a proper segment length unit NM when computing C . In ourcase, we used multiple settings of segment lengths g j = NM j as fol-lows, L ( k ) wav = (cid:88) j M j M j (cid:88) i =1 C ( y ( k )[ g j ( i − g j i ] , ˆ y ( k )[ g j ( i − g j i ] ) , (3)where M j denotes the number of sliced segments. In our case, theset of g j 's was chosen as follows, g j ∈ { , , , } .Next, the multi-scale loss on spectral domain is deﬁned as fol-lows, L ( k ) spec = (cid:88) i (cid:13)(cid:13)(cid:13) | STFT i ( y ( k ) ) | . − | STFT i ( ˆ y ( k ) ) . | (cid:13)(cid:13)(cid:13) , (4)where i denotes the FFT size of STFT i . The only difference to theoriginal MSS loss is that we replaced the log transformation intothe power-law compression, as it has been successfully used in pre-vious speech enhancement studies [15, 16]. We used the FFT sizesof STFT, (1024, 512, 256), with 75% overlap. The ﬁnal loss func-tion is deﬁned by adding all the components as follows, L ﬁnal = (cid:80) k ∈{ d,r,n } L ( k ) wav + L ( k ) spec .

4. EXPERIMENTS4.1. Implementation details

Since our goal is to perform both denoising and dereverberation,we used pyroomacoustics [20] to simulate an artiﬁcial reverberationwith randomly sampled absorption, room size, location of sourceand microphone distance. We used 2 seconds of speech and noisesegments, and mixed them with a uniformly distributed source-to-noise ratio (SNR) ranging from -5 dB to 25 dB. Input features wereused as a channel-wise concatenation of log-magnitude spectrogram,PCEN spectrogram, and real/imaginary part of demodulated phase.We used AdamW optimizer [21] and the learning rate was halvedwhen the validation score did not improve for three consecutiveepochs. The initial learning rate was set to 0.0004. The window sizeand hop size were set to 512 (32 ms) and 128 (8 ms), respectively.We also quantized the proposed model into INT8 format andcompared the model size with prior works. The purpose of our quan-tized model experiments is to reduce the model size and compu-tational cost for embedded environments. We adopted the compu-tation ﬂow using quantized numbers suggested in [5] to quantizethe neural network. In addition, the uniform symmetric quantiza-tion scheme [22], which uses uniform quantization and restricts zero-point to 0, was applied for efﬁcient hardware implementation. In theexperiments, all the layers in the neural network are processed us-ing quantized weights, activations, and inputs; only bias values arerepresented in full precision. Other processing steps such as featureextraction and masking are computed in full precision. For encoderand decoder layers, we observe the scale statistics of intermediatetensors during training. Then, during inference, we ﬁx the scales ofactivations using the average of the observed minimum and maxi-mum values. Only GRU layers are dynamically quantized during theinference time due to the large dynamic range of internal activationsat each time step.

In order to conﬁrm the effect of PCEN, multi-scale objective, andFGRU block, we trained and validated the model using the CHiME2training set and development set, respectively. An ablation study wasconducted on the CHiME2 test set. TRU-Net-A denotes the pro-posed method. TRU-Net-B denotes the model trained without multi-scale objective. TRU-Net-C denotes the model trained without thePCEN feature. TRU-Net-D denotes the model trained without FGRUblock. We used the original SDR [23] to compare our model withother models. The results are shown in Table 2. It is clearly observ-able that all the proposed methods are contributing to performanceimprovement. Note that FGRU block contributes signiﬁcantly onthe performance. We also compared the proposed model with othermodels using the CHiME2 test set. The proposed model showed bet-ter performance than not only the recent lightweight model Tiny-LSTM (TLSTM) and its pruned version (PTLSTM) [24], but alsothe large-sized model [16].

We further checked the denoising performance of our model bytraining the model on the large scale DNS-challenge dataset [25] andinternally collected dataset. It was tested on two non-blind DNS de-velopment sets, 1) synthetic clips without reverb (Synthetic withoutReverb) and 2) synthetic clips with reverb (Synthetic with Reverb).We compared our model with the recent models [3, 4, 17, 18, 19]submitted to the previous 2020 Interspeech DNS-challenge. 6 eval-uation metrics, PESQ, CBAK, COVL, CSIG, SI-SDR, and STOI ynthetic without Reverb Synthetic with ReverbMethods Size(M/MB) RT PESQ1 PESQ2 CBAK COVL CSIG SI-SDR STOI PESQ1 PESQ2 CBAK COVL CSIG SI-SDR STOINoisy - - 2.45 1.58 2.53 2.35 3.19 9.07 91.52 2.75 1.82 2.80 2.64 3.50 9.03 86.62NSnet [17] 1.27/4.84 (cid:51) (cid:51) (cid:55) - 2.73 3.64 3.41 4.07 - - - 2.71 - -PoCoNet1 [3] 50/190.73 (cid:55) - 2.71 3.02 3.29 3.85 - - - (cid:55) - 2.75 3.04 3.42 4.08 - - - - - - - - -DCCRN-E [4] 3.7/14.11 (cid:51) (cid:55)

TRU-Net (FP32) (cid:51) (cid:51)

Table 1 . Objective evaluation results on DNS-challenge synthetic development sets. PoCoNet2 denotes the model with partial dereverberationdescribed in [3], and PoCoNet1 is the model trained without it. We denote the network size (Size) in two aspects, the number of parametersin million (M) and the actual model size in megabyte (MB). The models with real-time (RT) capability are marked with (cid:51) , otherwise (cid:55) . Input SNRMethods Size (M/MB) -6 -3 0 3 6 9 Avg.TLSTM (FP32) [24] 0.97/3.70 10.01 11.54 13.08 14.23 15.85 17.46 13.70PTLSTM (FP32) [24] 0.52/1.97 10.07 11.59 13.10 14.31 15.89 17.50 13.74PTLSTM (INT8) [24] 0.61/0.58 9.82 11.37 12.91 14.20 15.74 17.44 13.58PTLSTM (INT8) [24] 0.33/0.31 9.33 10.91 12.46 13.79 15.46 17.16 13.18Wilson et al. [16] 65/247.96 12.17 13.44 14.70 15.83 17.30 18.78 15.37TRU-Net-A (FP32) 0.38/1.45

TRU-Net-B (FP32) 0.38/1.45 12.21 13.39 14.91 16.09 17.53 19.24 15.56TRU-Net-C (FP32) 0.38/1.45 11.96 13.24 14.69 15.97 17.47 19.18 15.42TRU-Net-D (FP32) 0.31/1.18 11.83 13.14 14.63 15.85 17.28 18.97 15.28TRU-Net-A (INT8) 0.38/0.36 12.35 13.62 15.03 16.18 17.62 19.30 15.68TRU-Net-B (INT8) 0.38/0.36 12.23 13.40 14.91 16.08 17.51 19.21 15.56TRU-Net-C (INT8) 0.38/0.36 11.96 13.20 14.64 15.94 17.42 19.11 15.38TRU-Net-D (INT8) 0.31/0.30 11.79 13.13 14.56 15.78 17.19 18.85 15.22

Table 2 . Objective evaluation results on the CHiME2 test set.[26, 27, 28, 29], were used. Note that although it is recommendedto use ITU-T P862.2 wide-band version of PESQ (PESQ2), a fewstudies reported their score using ITU-T P862.1 (PESQ1). There-fore, we used both PESQ versions to compare our model with othermodels. The results are shown in Table 1. We can see that TRU-Netshows the best performance in the Synthetic without Reverb setwhile having the smallest number of parameters. In the Syntheticwith Reverb set, TRU-Net showed competitive performance usingorders of magnitude fewer parameters than other models.

The performance of simultaneous denoising and dereverberation wastested on min subset of WHAMR dataset, which contains 3,000 au-dio ﬁles. The WHAMR dataset is composed of noisy-reverberantmixtures and the direct sources as ground truth. TRU-Net models(FP32 and INT8) in Table 1 were used for the test. We show thedenoising and dereverberation performance of our model in Table 3along with two other models that were tested on the same WHAMRdataset. Our model achieved the best results compared to the otherbaseline models, which shows the parameter efﬁciency of TRU-Neton simultaneous denoising and dereverberation task.

Method Size (M/MB) PESQ1 SI-SDR STOINoisy - 1.83 -2.73 73.00NSnet [17] 1.27/4.84 1.91 0.34 73.02DTLN [18] 0.99/3.78 2.23 2.12 80.40TRU-Net (FP32) 0.38/1.45

TRU-Net (INT8) 0.38/0.36 2.49 3.03 80.56

Table 3 . Objective evaluation of simultaneous denoising and dere-verberation results on the WHAMR dataset.

Using the proposed model (TRU-Net (FP32)) in Table 1, we partici-pated in 2021 ICASSP DNS Challenge Track 1 [25]. For better per-ceptual quality, we mixed the estimated direct source and reverberant source at 15 dB, and applied a zero-delay dynamic range compres-sion (DRC). The average computation time to process a single frame(including FFT, iFFT, and DRC) took 1.97 ms and 1.3 ms on 2.7GHz Intel i5-5257U and 2.6 GHz Intel i7-6700HQ CPUs, respec-tively. The lookahead of TRU-Net is 0 ms. The listening test wasconducted based on ITU-T P.808. The results are shown in Table 4.The model was tested on various speech sets including singing voice,tonal language, non-English (includes tonal), English, and emotionalspeech. The results show that TRU-Net can achieve better perfor-mance than the baseline model, NSnet2 [30].

Method Size (M/MB) Singing Tonal Non-English English Emotional OverallNoisy - 2.96 3.00 2.96 2.80 2.67 2.86NSnet2 [30] 2.8/10.68

Table 4 . MOS results on the DNS-challenge blind test set

5. RELATION TO PRIOR WORKS

Recently, there has been increasing interest in phase-aware speechenhancement because of the sub-optimality of reusing the phase ofthe mixture signal. While most of these works tried to estimate theclean phase by using a phase mask or an additional network, theabsolute phase difference between mixture and source can be actu-ally computed using the law of cosines [31]. Inspired by this, [6]proposed to estimate a rotational direction of the absolute phase dif-ference for speech separation.The FGRU and TGRU used in TRU-Net is similar to the workin [32]. They used bidirectional long short-term memory (bi-LSTM)networks on the frequency-axis and the time-axis combined with2D-CNN-based U-Net. The difference is that bi-LSTM was uti-lized to increase performance in [32], whereas we employ FGRUand uni-directional TGRU to better handle the online inferencescenario combined with the proposed lightweight 1D-CNN-based(frequency-axis) U-Net.

6. CONCLUSIONS

In this work, we proposed TRU-Net, which is an efﬁcient neural net-work architecture speciﬁcally designed for online inference applica-tions. Combined with the proposed PHM, we successfully demon-strated a single-stage denoising and dereverberation in real-time. Wealso showed that using PCEN and multi-scale objectives improvesthe performance further. Experimental results conﬁrm that our modelachieve comparable performance with state-of-the-art models hav-ing a signiﬁcantly larger number of parameters. For future work, weplan to employ modern pruning techniques on an over-parameterizedmodel to develop a big-sparse model which may provide better per-formance than a small-dense model with the same number of param-eters. . REFERENCES [1] Olaf Ronneberger, Philipp Fischer, and Thomas Brox, “U-net:Convolutional networks for biomedical image segmentation,”in

Proc. MICCAI , 2015, pp. 234–241.[2] Hyeong-Seok Choi, Jang-Hyun Kim, Jaesung Huh, AdrianKim, Jung-Woo Ha, and Kyogu Lee, “Phase-aware speechenhancement with deep complex u-net,” arXiv preprintarXiv:1903.03107 , 2019.[3] Umut Isik, Ritwik Giri, Neerad Phansalkar, Jean-Marc Valin,Karim Helwani, and Arvindh Krishnaswamy, “Poconet: Betterspeech enhancement with frequency-positional embeddings,semi-supervised conversational data, and biased loss,” in

Proc.INTERSPEECH , 2020.[4] Yanxin Hu, Yun Liu, Shubo Lv, Mengtao Xing, Shimin Zhang,Yihui Fu, Jian Wu, Bihong Zhang, and Lei Xie, “Dccrn:Deep complex convolution recurrent network for phase-awarespeech enhancement,” in

Proc. INTERSPEECH , 2020.[5] Benoit Jacob, Skirmantas Kligys, Bo Chen, Menglong Zhu,Matthew Tang, Andrew Howard, Hartwig Adam, and DmitryKalenichenko, “Quantization and training of neural networksfor efﬁcient integer-arithmetic-only inference,” in

Proc. CVPR ,2018, pp. 2704–2713.[6] Zhong-Qiu Wang, Ke Tan, and DeLiang Wang, “Deep learningbased phase reconstruction for speaker separation: A trigono-metric perspective,” in

Proc. ICASSP , 2019, pp. 71–75.[7] Yuxuan Wang, Pascal Getreuer, Thad Hughes, Richard F Lyon,and Rif A Saurous, “Trainable frontend for robust and far-ﬁeldkeyword spotting,” in

Proc. ICASSP , 2017, pp. 5670–5674.[8] Vincent Lostanlen, Justin Salamon, Mark Cartwright, BrianMcFee, Andrew Farnsworth, Steve Kelling, and Juan PabloBello, “Per-channel energy normalization: Why and how,”

IEEE Signal Processing Letters , vol. 26, no. 1, pp. 39–43,2018.[9] Andrew G Howard, Menglong Zhu, Bo Chen, DmitryKalenichenko, Weijun Wang, Tobias Weyand, Marco An-dreetto, and Hartwig Adam, “Mobilenets: Efﬁcient convolu-tional neural networks for mobile vision applications,” arXivpreprint arXiv:1704.04861 , 2017.[10] Kyunghyun Cho, Bart van Merri¨enboer, Caglar Gulcehre,Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, andYoshuas Bengio, “Learning phrase representations using RNNencoder–decoder for statistical machine translation,” in

Proc.EMNLP , 2014, pp. 1724–1734.[11] Eric Jang, Shixiang Gu, and Ben Poole, “Categorical reparam-eterization with gumbel-softmax,” in

Proc. ICLR , 2017.[12] Xin Wang, Shinji Takaki, and Junichi Yamagishi, “Neuralsource-ﬁlter-based waveform model for statistical parametricspeech synthesis,” in

Proc. ICASSP , 2019, pp. 5916–5920.[13] Jesse Engel, Lamtharn (Hanoi) Hantrakul, Chenjie Gu, andAdam Roberts, “Ddsp: Differentiable digital signal process-ing,” in

Proc. ICLR , 2020.[14] Jian Yao and Ahmad Al-Dahle, “Coarse-to-Fine Optimizationfor Speech Enhancement,” in

Proc. INTERSPEECH , 2019, pp.2743–2747.[15] Hakan Erdogan and Takuya Yoshioka, “Investigations ondata augmentation and loss functions for deep learning basedspeech-background separation.,” in

INTERSPEECH , 2018, pp.3499–3503. [16] Kevin Wilson, Michael Chinen, Jeremy Thorpe, Brian Patton,John Hershey, Rif A Saurous, Jan Skoglund, and Richard FLyon, “Exploring tradeoffs in models for low-latency speechenhancement,” in

IWAENC , 2018, pp. 366–370.[17] Yangyang Xia, Sebastian Braun, Chandan KA Reddy, Har-ishchandra Dubey, Ross Cutler, and Ivan Tashev, “Weightedspeech distortion losses for neural-network-based real-timespeech enhancement,” in

Proc. ICASSP , 2020, pp. 871–875.[18] Nils L Westhausen and Bernd T Meyer, “Dual-signal transfor-mation lstm network for real-time noise suppression,” in

Proc.INTERSPEECH , 2020.[19] Yuichiro Koyama, Tyler Vuong, Stefan Uhlich, and BhikshaRaj, “Exploring the best loss function for dnn-based low-latency speech enhancement with temporal convolutional net-works,” arXiv preprint arXiv:2005.11611 , 2020.[20] Robin Scheibler, Eric Bezzam, and Ivan Dokmani´c, “Pyrooma-coustics: A python package for audio room simulation and ar-ray processing algorithms,” in

Proc. ICASSP , 2018, pp. 351–355.[21] Sashank J. Reddi, Satyen Kale, and Sanjiv Kumar, “On theconvergence of adam and beyond,” in

Proc. ICLR , 2018.[22] Raghuraman Krishnamoorthi, “Quantizing deep convolutionalnetworks for efﬁcient inference: A whitepaper,” arXiv preprintarXiv:1806.08342 , 2018.[23] Emmanuel Vincent, R´emi Gribonval, and C´edric F´evotte, “Per-formance measurement in blind audio source separation,”

IEEE transactions on audio, speech, and language processing ,vol. 14, no. 4, pp. 1462–1469, 2006.[24] Igor Fedorov, Marko Stamenovic, Carl Jensen, Li-Chia Yang,Ari Mandell, Yiming Gan, Matthew Mattina, and Paul N What-mough, “Tinylstms: Efﬁcient neural speech enhancement forhearing aids,” in

Proc. INTERSPEECH , 2020.[25] Chandan KA Reddy, Harishchandra Dubey, Vishak Gopal,Ross Cutler, Sebastian Braun, Hannes Gamper, Robert Aich-ner, and Sriram Srinivasan, “Icassp 2021 deep noise suppres-sion challenge,” arXiv preprint arXiv:2009.06122 , 2020.[26] ITU-T Recommendation, “Perceptual evaluation of speechquality (pesq): An objective method for end-to-end speechquality assessment of narrow-band telephone networks andspeech codecs,”

Rec. ITU-T P. 862 , 2001.[27] Philipos C Loizou,

Speech enhancement: theory and practice ,CRC press, 2013.[28] Jonathan Le Roux, Scott Wisdom, Hakan Erdogan, and John RHershey, “Sdr–half-baked or well done?,” in

Proc. ICASSP ,2019, pp. 626–630.[29] Cees H Taal, Richard C Hendriks, Richard Heusdens, and Jes-per Jensen, “A short-time objective intelligibility measure fortime-frequency weighted noisy speech,” in

Proc. ICASSP ,2010, pp. 4214–4217.[30] Sebastian Braun and Ivan Tashev, “Data augmentation and lossnormalization for deep noise suppression,” in

InternationalConference on Speech and Computer , 2020, pp. 79–86.[31] Pejman Mowlaee, Rahim Saeidi, and Rainer Martin, “Phaseestimation for signal reconstruction in single-channel sourceseparation,” in

Thirteenth Annual Conference of the Interna-tional Speech Communication Association , 2012.[32] Tomasz Grzywalski and Szymon Drgas, “Using recurrencesin time and frequency within u-net architecture for speech en-hancement,” in