[PDF] VACE-WPE: Virtual Acoustic Channel Expansion Based On Neural Networks for Weighted Prediction Error-Based Speech Dereverberation

Abstract

Speech dereverberation is an important issue for many real-world speech processing applications. Among the techniques developed, the weighted prediction error (WPE) algorithm has been widely adopted and advanced over the last decade, which blindly cancels out the late reverberation component from the reverberant mixture of microphone signals. In this study, we extend the neural-network-based virtual acoustic channel expansion (VACE) framework for the WPE-based speech dereverberation, a variant of the WPE that we recently proposed to enable the use of dual-channel WPE algorithm in a single-microphone speech dereverberation scenario. Based on the previous study, some ablation studies are conducted regarding the constituents of the VACE-WPE in an offline processing scenario. These studies help understand the dynamics of the system, thereby simplifying the architecture and leading to the introduction of new strategies for training the neural network for the VACE. Experimental results in noisy reverberant environments reveal that VACE-WPE considerably outperforms its single-channel counterpart in terms of objective speech quality and is complementary to the single-channel WPE when employed as the front-end for the far-field automatic speech recognizer.

Full PDF

11 VACE-WPE: Virtual Acoustic Channel ExpansionBased On Neural Networks for Weighted PredictionError-Based Speech Dereverberation

Joon-Young Yang and Joon-Hyuk Chang , Senior Member, IEEE

Abstract —Speech dereverberation is an important issue formany real-world speech processing applications. Among the tech-niques developed, the weighted prediction error (WPE) algorithmhas been widely adopted and advanced over the last decade,which blindly cancels out the late reverberation component fromthe reverberant mixture of microphone signals. In this study, weextend the neural-network-based virtual acoustic channel expan-sion (VACE) framework for the WPE-based speech dereverber-ation, a variant of the WPE that we recently proposed to enablethe use of dual-channel WPE algorithm in a single-microphonespeech dereverberation scenario. Based on the previous study,some ablation studies are conducted regarding the constituentsof the VACE-WPE in an ofﬂine processing scenario. These studieshelp understand the dynamics of the system, thereby simplifyingthe architecture and leading to the introduction of new strategiesfor training the neural network for the VACE. Experimentalresults in noisy reverberant environments reveal that VACE-WPEconsiderably outperforms its single-channel counterpart in termsof objective speech quality and is complementary to the single-channel WPE when employed as the front-end for the far-ﬁeldautomatic speech recognizer.

Index Terms —Speech dereverberation, weighted predictionerror, deep neural network, single microphone, ofﬂine processing.

I. I

NTRODUCTION

Speech signals traveling in an enclosed space are encoun-tered by walls, ﬂoor, ceiling, and other obstacles present in theroom, creating multiple reﬂections of the source image. Hence,when they are captured by a set of microphones in a distance,the delayed and attenuated replicas of the sound source appearas the so-called reverberation component of the microphoneobservations. The reverberation component can be considereda composition of the early reﬂections and late reverberation[1]. In particular, the former is known to change the timbreof the source speech yet helps improve the intelligibility [2],whereas the latter degrades the perceptual listening quality aswell as deteriorates the performance of speech and speakerrecognition applications [3]–[5]. One of the most popularapproaches for speech dereverberation is to exploit the multi-channel linear prediction (MCLP) technique to model the latereverberation component and subsequently cancel it out fromthe microphone observations. Speciﬁcally, in [6], the delayedlinear prediction (LP) model was adopted to estimate thelate reverberation, whose model parameters are obtained via Hanyang University, Seoul, 04763, Korea (e-mail: [email protected]). Hanyang University, Seoul, 04763, Korea (e-mail: [email protected]). iterative maximization of the likelihood function deﬁned underthe assumption that the dereverberated speech signal followsa complex normal distribution with time-varying variance.This method is referred to as the weighted prediction error(WPE) algorithm, and both the time- and short-time Fouriertransform (STFT) domain implementations were presented in[6]; the latter is usually preferred to the former owing to itscomputational efﬁciency.Several variants of the WPE algorithm or MCLP-basedspeech dereverberation methods have been proposed for thepast decade. In [7], a generalized version of the WPE algo-rithm [6] was derived via the introduction of a new cost func-tion that measures temporal correlation within the sequenceof the dereverberated samples. In [8], the log-spectral domainpriors based on Gaussian mixture models were introduced tothe procedure for estimating the power spectral density (PSD)of the dereverberated speech signal. The STFT coefﬁcients ofthe dereverberated speech were modeled using the Laplaciandistribution in [9], whereas a more general sparse prior, thecomplex generalized Gaussian (CGG) [10], was adopted in[11]. More recently, Student’s t-distribution was employed asthe prior of the desired signal, and the LP ﬁlter coefﬁcientswere subjected to probabilistic Bayesian sparse modeling witha Gaussian prior [12].Another branch of the WPE variant is to integrate deepneural networks (DNNs) into the WPE-based speech derever-beration framework. In [13], a DNN was trained to estimate thePSD of the early arriving speech components, which substi-tuted the iterative PSD estimation routine of the conventionalWPE algorithm [6]. It was shown in [14] that such a DNNfor supporting the WPE algorithm can be trained in an unsu-pervised manner (i. e., without requiring the parallel data forsupervision) by performing an end-to-end optimization of the (cid:96) -norm-based cost functions involving the relevant signals.Moreover, the DNN-supported WPE [13] was subjected toan end-to-end joint optimization with a DNN-based acousticmodel for robust speech recognition [15]. Unlike [13], an auto-encoder DNN trained on clean speech was used to constrainthe estimated PSD to have characteristics similar to those ofthe clean speech in a learned feature space [16]. Meanwhile,a DNN was employed to estimate the shape parameter of theCGG source prior [17], which provides a more ﬂexible formof the WPE algorithm proposed in [11].A common observation underlying the abovementionedstudies [11]–[16] is that the multi-channel WPE algorithm isgenerally superior to its single-channel counterpart. Inspired a r X i v : . [ ee ss . A S ] F e b by this, we previously proposed the virtual acoustic channelexpansion (VACE) technique for the WPE [18], a variantof the WPE designed to utilize the dual-channel WPE algo-rithm in a single-microphone speech dereverberation scenario.Speciﬁcally, the neural WPE [13] is assisted by anotherneural network that generates the virtual signal from an actualsingle-channel observation, whereby the pair of actual andvirtual signals is directly consumed by the dual-channel neuralWPE algorithm. The neural network for the virtual signalgeneration, the supposed VACENet, is ﬁrst pre-trained andthen subsequently ﬁne-tuned to produce the dereverberatedsignal via the actual output channel of the dual-channel neuralWPE.This article is an extension of [18], which aims to pro-vide a more comprehensive understanding of the VACE-WPEbased on the empirical evaluation results obtained via setsof experiments, each of which is designed to investigate thedynamics of the VACE-WPE with respect to the varioussystem constituents. The limitations of the previous study [18]are listed below: • The VACE-WPE system in [18] was designed rather adhoc, and the dynamics of the system was not sufﬁcientlyinvestigated. • Because [18] is essentially a feasibility study, the exper-iments were conducted only in the noiseless reverberantconditions, which is practically unrealistic.Accordingly, the contribution of this article is two-fold: • Some ablation studies are conducted with regard to thesystem components of the VACE-WPE, which helpsunderstand the characteristics of the VACE-WPE andfurther leads to an overall performance improvement. • Experimental results in noisy reverberant environmentsare provided, which demonstrates that the VACE-WPEis signiﬁcantly superior to the single-channel WPE inachieving better objective speech quality, while bothbeing complementary with each other as the front-endfor the reverberant speech recognition task.II. O

VERVIEW OF THE

VACE-WPE

A. Signal Model

Suppose that a speech source signal is captured by D microphones in a reverberant enclosure. In the STFT domain,the observed signal impinging on the d -th microphone can beapproximated as follows [6], [7]: X t,f,d = l − (cid:88) τ =0 h ∗ τ,f,d S t − τ,f + V t,f,d , (1)where S t,f and V t,f,d denote the STFT-domain representationsof the source speech and noise observed at the d -th micro-phone, respectively; the superscript ∗ denotes the complexconjugate operation, and h t,f,d represents the room impulseresponse (RIR) from the source to the d -th microphone, whoseduration is l . Further decomposing the speech term into theearly arriving component (i. e., the direct path plus the early reﬂections) and late reverberation [6] provides X t,f,d = ∆ − (cid:88) τ =0 h ∗ τ,f,d S t − τ,f + l − (cid:88) τ =∆ h ∗ τ,f,d S t − τ,f + V t,f,d (2) = X (early) t,f,d + X (late) t,f,d + V t,f,d , (3)where ∆ denotes the STFT-domain time index and determinesthe duration of the RIR that contributes to the early arrivingspeech component. Herein, the early arriving speech is as-sumed to be obtained upon convolution between the sourcespeech and the RIR truncated up to 50 ms after the mainpeak. Accordingly, with the 64 ms Hann window and a hopsize of 16 ms employed for the STFT analysis, ∆ is ﬁxed to3 (16 × ≈ B. Review of the WPE Algorithm1) Iterative WPE:

Under the noiseless assumption that V t,f,d = 0 , ∀ d , the late reverberation component, X (late) t,f,d , inEq. (3) can be approximated by the delayed LP technique asfollows [6]: ˆ X (late) t,f,d = ∆+ K − (cid:88) τ =∆ g Hτ,f,d X t − τ,f (4) = ˜ g Hf,d ˜ X t − ∆ ,f , (5)where g τ,f,d ∈ C D represents the K -th order time-invariant LPﬁlter coefﬁcients for the output channel index d ; X t,f ∈ C D represents the D -channel stack of the microphone input sig-nal; ˜ g f,d = [ g T ∆ ,f,d , ..., g T ∆+ K − ,f,d ] T ∈ C DK , ˜ X t − ∆ ,f =[ X Tt − ∆ ,f , ..., X Tt − (∆+ K − ,f ] T ∈ C DK , and T and H denotethe hermitian and transpose operations, respectively. Under theassumption that X (early) t,f,d is sampled from a complex normaldistribution with a zero mean and time-varying variance, λ t,f,d , the objective of the WPE algorithm is to maximizethe log-likelihood function [6], [7]: ˜ g (cid:48) f,d , λ (cid:48) t,f,d = arg max ˜ g f,d ,λ t,f,d L f,d , (6) L f,d = N ( ˆ X (early) t,f,d = X t,f,d − ˜ g Hf,d ˜ X t − ∆ ,f ; 0 , λ t,f,d ) (7)for d ∈ { , , ..., D } . As this optimization problem has noanalytic solution, ˜ g f,d and λ t,f,d are alternatively updated viathe following iterative procedure [6], [7]:Step 1) λ t,f = 1 D (cid:88) d (cid:32) δ + 1 δ (cid:88) τ = − δ | Z t + τ,f,d | (cid:33) , (8)Step 2) R f = (cid:88) t ˜ X t − ∆ ,f ˜ X Ht − ∆ ,f λ t,f ∈ C DK × DK , (9) P f = (cid:88) t ˜ X t − ∆ ,f X Ht,f λ t,f ∈ C DK × D , (10) G f = R − f P f ∈ C DK × D , (11)Step 3) Z t,f = X t,f − G Hf ˜ X t − ∆ ,f , (12)where Eq. (8) is obtained by further assuming that λ t,f, = λ t,f, = ... = λ t,f,D , and δ is the term introduced to consider Fig. 1. Block diagram of the VACE-WPE systems: (a) VACE-WPE [18]and (b) simpliﬁed VACE-WPE. The subscripts and v denote the actual andvirtual channel signals, respectively. the temporal context between the neighboring frames. G f is amatrix whose d -th column is ˜ g f,d , and Z t,f = ˆ X (early) t,f,d is the D -channel stack of the dereverberated output signal. In the ﬁrstiteration, Z t,f is initialized to X t,f . It was revealed in [7] thatthe WPE algorithm described in Eqs. (8) – (12) can be derivedas a special case of the generalized WPE, without enforcingthe noiseless assumption.

2) Neural WPE:

Neural WPE [13] exploits a neural net-work to estimate the PSD of the dereverberated output signal, | Z t,f,d | , as follows: ln | ˆ Z t,f,d | = F (cid:16) ln | X d | ; Θ LPS (cid:17) , (13)where F ( · ; Θ LPS ) denotes the neural network parameterizedby Θ LPS , to estimate the log-scale power spectra (LPS) of thedereverberated signal in a channel-independent manner; thetime-frequency (T-F) indices were dropped in X d , as neuralnetworks often consume multiple T-F units within a contextas the input. Accordingly, Eq. (8) can be rewritten as follows: λ t,f = 1 D (cid:88) d | ˆ Z t,f,d | . (14)For the rest of this paper, we will denote the neural networkfor the PSD estimation, F ( · ; Θ LPS ) , as the LPSNet [18], as itoperates in the LPS domain of the relevant signals. C. VACE-WPE System Description1) Overview:

The entire VACE-WPE system [18] consistsof two separate modules: the VACE module, which is respon-sible for the generation of the virtual signal, and the dual-channel neural WPE, which operates in the exact same manneras described in Eqs. (9) – (14) for D = 2 . To build the completeVACE-WPE system, the LPSNet is trained to estimate the LPSof the early arriving speech given the reverberant observation,and the VACENet is pre-trained under a certain predeﬁnedcriterion. These two steps are independent of each other, andthus, can be performed in parallel. Subsequently, the VACE-WPE system is constructed as depicted in Fig. 1, and theVACENet is ﬁne-tuned to produce the dereverberated signalat the output channel corresponding to the actual microphone. Fig. 2. Four different VACENet architectures for modeling the RI compo-nents: (a) VACENet-a, (b) VACENet-b, (c) VACENet-c, and (d) VACENet-d. The input and output feature maps are represented in (time , frequency , channel) format, and the numbers above the rectangles denote the number ofchannels. During the ﬁne-tuning, the LP order is ﬁxed to K = K trn , andthe parameters of the LPSNet are frozen.

2) Architecture of the VACENet:

Similar to our previousstudy [18], we used the U-Net [19] as the backbone architec-ture of the VACENet, whose input and output representationsare the real and imaginary (RI) components of the STFTcoefﬁcients of the actual and virtual signals, respectively.Unlike [18], four different architectures of the VACENet areconsidered in this study, each of which differs in whetherto use a shared or a separate stream for the convolutionalencoder and decoder. Fig. 2 shows the detailed illustrationof the four distinctive VACENet architectures, denoted asVACENet- { a, b, c, d } . First, all the models consume both of theRI components as the input for the encoder stream, whetherit is separated or not, which is intended to fully exploit theinformation residing in the pair of the RI components. Second,the VACENet- { a, c } use a shared decoder stream to model theRI components of the virtual signal, whereas the VACENet- { b, d } split the decoder stream into two to separately modeleach attribute of the RI components. As shown in Fig. 2, thedifference between the VACENet-b and VACENet-d lies inwhether the separated decoder streams share the bottleneckfeature or not, as well as the encoder feature maps for theskip connections. Meanwhile, VACENet-c can be considereda more ﬂexible version of the VACENet-a, as it splits theencoder stream into two separate streams, and thus, doublesthe number of skip connections originating from the encodermodule.In each subﬁgure in Fig. 2, the rectangles denote the featuremaps, whose height and width represent their relative size anddepth, respectively, and the numbers above the rectangles arethe channel sizes of the feature maps. Each of the wide arrowsdenotes a 2D convolution (Conv2D) with a kernel size of 3,and ⊕ denotes the concatenation of the feature maps along thechannel axis. Every downsampling or upsampling operation iseither performed by a × Conv2D or a transposed Conv2D

TABLE IM

ODEL SIZE OF THE DIFFERENT

VACEN

ET ARCHITECTURES

Model Conﬁg. with a stride size of 2, and × convolutions are used in thebottleneck and the last layers of the network. A gated linearunit [20] was used instead of a simple convolution followed byan activation function, except for the layers for downsamplingand upsampling. Lastly, to make fair comparisons between thedifferent model structures, we designed each model to have asimilar number of parameters in total, as shown in Table I.A similar investigation regarding the model architecture wasconducted in [21] for the speech enhancement task, wherethe structure analogous to that depicted in Fig. 2-(b) wasshown to be effective. In contrast, it was mentioned in [22]that separately handling each RI component is beneﬁcial.Because the existing task, and hence the role of the VACENet,is fundamentally different from that of the neural networksadopted for speech enhancement [21], [22], we argue that it isworthwhile to examine which architecture is more appropriatefor the VACE task.

3) Loss Function:

Two types of loss functions, namely thefrequency-domain loss and time-domain loss, are deﬁned totrain the VACENet [18]: L freq ( A, B ) = α · [ MSE ( A r , B r ) + MSE ( A i , B i )]+ β · MSE ( ln | A | , ln | B | ) , (15) L time ( a, b ) = MAE ( a, b ) , (16) L ( A, B ) = L freq ( A, B ) + γ · L time ( a, b ) , (17)where A and B are the STFT coefﬁcients, ln | A | and ln | B | arethe log-scale magnitudes; a and b are the time-domain signalsobtained by taking the inverse STFT of A and B , respectively;the superscripts r and i denote the RI components, respec-tively; α , β , and γ are scaling factors to weigh the lossesdeﬁned in different domains of the signal representations,and MSE ( · , · ) and MAE ( · , · ) compute the mean squared andabsolute error between the inputs, respectively.It is worth noting that α and β should be determined suchthat the values of α · [ MSE ( A r , B r ) + MSE ( A i , B i )] and β · MSE ( ln | A | , ln | B | ) are similar. When the former is con-siderably larger than the latter, severe checkerboard artifacts[23] were revealed in the output signal of the network. Forthe opposite condition, it was not able to obtain ﬁne-grainedrepresentations of the RI components of the output signal. γ was also set to make γ · L time ( a, b ) to have values similar to orslightly smaller than those of the aforementioned two terms.

4) Pre-training of the VACENet:

In this study, we considertwo different pre-training strategies to initialize the VACENet.Suppose that the time-domain representations of the actual andvirtual signals are denoted by x and x v , respectively, and theirSTFT-domain counterparts X and X v , respectively. Then, the forward pass of VACENet can be expressed as follows: X v = G ( X ; Θ VACE ) , (18)where G ( · ; Θ VACE ) denotes the VACENet parameterized by Θ VACE . First, considering the observed signal as the input,the VACENet can be pre-trained to reconstruct the inputsignal itself [18] by minimizing the loss function L ( X v , X ) .Alternatively, we propose to pre-train the VACENet to estimatethe late reverberation component of the input signal, denotedby X (late) , by minimizing L ( X v , X (late) ) .The rationale behind the invention of these pre-trainingstrategies is rather simple and intuitive. Under the assumptionthat the actual dual-channel speech recordings may not deviatesigniﬁcantly from each other, we employed the ﬁrst methodin [18], while expecting the virtual signal to resemble theobserved signal. However, the generated virtual signal wasshown to have characteristics different from the observedsignal [18], and the shape and scale of the waveform resembledthose of the late reverberation component of the observedsignal, as shown in Fig. 7 in Section IV-C. Accordingly, wesuggest initializing VACENet to produce the late reverberationcomponent of the observed signal. For the rest of this paper,we denote the two pre-training strategies described above as PT-self and

PT-late .

5) Fine-tuning of the VACENet:

As mentioned earlier,VACENet is ﬁne-tuned within the VACE-WPE architecturedepicted in Fig. 1. The loss function is set to L ( Z , X (early) ) ,where X (early) denotes the early arriving speech component ofthe observed signal, X , and Z is the output of the WPEalgorithm on the actual channel side [18]; the virtual channeloutput, Z v , is neglected.

6) Simpliﬁcation of the PSD Estimation Routine:

In ad-dition to the architecture of the original VACE-WPE system[18] depicted in Fig. 1-(a), we propose the simpliﬁed VACE-WPE, depicted in Fig. 1-(b), by removing the contribution ofthe virtual signal to the PSD estimation routine expressed inEq. (14). Accordingly, Eq. (14) can be rewritten as follows: λ t,f = | ˆ Z t,f, | . (19)One of the motivations behind this modiﬁcation is to takeaway some burden from the roles of the VACENet by reducingthe dependency of the model to the entire system. In otherwords, if we consider the WPE-based dereverberation asa two-stage process of early arriving speech PSD estima-tion (Eq. (13)) followed by decorrelation (Eqs. (9) – (12)), theVACENet in Fig. 1-(a) is expected to generate the virtual signalwhose role is to contribute to both the stages. In contrast,as the contribution of the virtual signal to the ﬁrst stage isremoved in Fig. 1-(b), the VACENet would concentrate moreon the second stage. Further details regarding the simpliﬁedVACE-WPE system are provided in Section IV-B with theexperimental results.III. E XPERIMENTAL S ETUP

A. On-the-ﬂy Data Generator

To present as many random samples as possible to theneural networks during the training, an on-the-ﬂy data gen-erator was used. Given the sets of clean speech utterances,

TABLE IIP

ARAMETERS FOR

RIR

SIMULATION [24]

BASED ON IMAGE METHOD [25]Parameter Medium LargeRoom size lower bound [10 × ×

2] m [30 × ×

2] m upper bound [30 × ×

5] m [50 × ×

5] m Duration 1.0 s 2.0 sReﬂection order 10Absorption coefﬁcient [0.2, 0.8]Source-Receiver distance [1.0, 5.0] m

RIRs, and noises, the data generator ﬁrst randomly selects aspeech utterance, an RIR, and a noise sample from each set,respectively. Then, the speech utterance is randomly cropped,and subsequently convolved with the full-length RIR as wellas the truncated RIR to create the reverberated speech andearly arriving speech, respectively. The noise sample is eithercropped or duplicated to match the duration of the speechexcerpt and added to both the reverberated and early arrivingspeech; the signal-to-noise ratio (SNR) is randomly chosenwithin the predeﬁned range of integers.

B. Training Datasets1) TrainSimuClean:

The clean speech utterances weretaken from the “training” portion of the TIMIT [26] dataset,which comprises phonetically balanced English speech sam-pled at 16 kHz. After excluding the common-transcript ut-terances and ﬁltering out those with durations of less than2 s, we obtained 3,337 utterances from 462 speakers; theaverage duration of the training utterances was 3.21 s. Thesimulated RIRs in [24] were used for the training, whichis freely available and widely used in Kaldi’s speech andspeaker recognition recipes for data augmentation purposes[27]. A total of 16,200 medium room and 5,400 large-roomRIRs were randomly selected to construct a simulated RIRdataset for the training, where we excluded the small roomRIRs to check whether the trained neural WPE variants cangeneralize well to the small room conditions at the evaluationtime. The parameters of the RIR simulation [25] are presentedin Table II, and further details can be found in [24]. No additivenoise samples were used in this dataset.

2) TrainSimuNoisy:

The modiﬁed LibriSpeech-80h datasetwas used as the clean speech corpus, which is a subset of theLibriSpeech [28] corpus and provided as part of the VOiCESChallenge 2019 dataset [4], [5]. It consists of read Englishspeech sampled at 16 kHz, whose transcripts are derived frompublic domain audiobooks. As most of the speech samplescontain considerable amounts of epenthetic silence regions aswell as those at the beginning and end of the utterance, weemployed an energy-based voice activity detector implementedin Kaldi [27] to trim the silence regions. The utterances whoseduration was less than 2.8 s were ﬁltered out after the silenceremoval. Consequently, we obtained 16,341 utterances from194 speakers, with an average speech duration of 12.26 s. Thesimulated RIR dataset described in Section III-B1 was reused.As for the noise dataset, we used 58,772 audio samples in PECIFICATIONS OF THE REAL

RIR

S TAKEN FROM THE

REVERBC

HALLENGE

DATASET [3]Condition Duration T Recording distance

Small-near

1s 0.25 s 0.5 m

Small-far

Medium-near

Medium-far

Large-near

Large-far the DNS Challenge 2020 dataset [29], which contains audioclips selected from Google Audioset and Freesound . Thedataset comprises 150 unique audio classes, including animalsounds, vehicular sounds, indoor and outdoor environmentsounds originating from various things and daily supplies,music of different genres, and musical instruments.Instead of directly feeding the raw clean speech samples tothe neural network models during the training, we set a limiton the dynamic range of the speech waveform amplitudes asdescribed in the following. Suppose that x is a vector of thetime-domain speech waveform amplitudes normalized to havevalues between -1 and 1. Then, the waveform amplitudes afterapplying a simple dynamic range control (DRC) scheme canbe obtained as follows: x drc = x · a max − ¯ a min · r, (20)where ¯ a max and ¯ a min are the average of the n largest and n smallest waveform amplitudes, respectively, and r is a constantfor the DRC; n = 100 and r = 0 . were used in this study. C. Test Datasets1) TestRealClean:

The “core test” set of the TIMIT [26]dataset was used as the clean speech corpus, where no speakersand transcripts overlap with those of the

TrainSimuClean dataset described in Section III-B1; the average speech dura-tion is 3.04 s. The entire set of utterances was randomly con-volved with the real RIRs taken from the REVERB Challenge2014 [3] dataset to create six unique test sets, each of whichdiffers in the room size as well as the recording distance forthe RIR measurement. Among the eight microphone channels[3], only the ﬁrst and ﬁfth channels were used to createthe dual-channel test sets; these two channels were locatedon the opposite side of each other at a distance of 20 cm.The speciﬁcations of the real RIRs are presented in TableIII. Similar to

TrainSimuClean , TestRealClean contains noadditive noise.

2) TestRealNoisy:

To create the

TestRealNoisy dataset, thestationary air conditioner noise residing in each room [3] aswell as the nonstationary babble and factory noise from theNOISEX-92 [30] dataset and the music samples from theMUSAN [31] dataset were added to the

TestRealClean dataset.To simulate test environments with various SNR levels, thenoise samples were added to the reverberated speech with theSNRs randomly chosen between 5 dB and 15 dB. https://research.google.com/audioset https://freesound.org TABLE IVLPSN

ET ARCHITECTURE ADOPTED AND MODIFIED FROM [32]Layer Kernel Stride × × × × × × × × (cid:34) DilatedConv1DBlockConv1D + BN + ELU (cid:35) × TRUCTURE OF THE D ILATED C ONV

LOCK

Layer Kernel Stride Dilation k + ELU(for k = 1 , ..., ) 3 1 2 k D. LPSNet Speciﬁcations

We adopted the dilated convolutional network proposed in[32] as the LPSNet architecture, but with a few modiﬁcations.Tables IV and V show the detailed architecture of the LPSNetand DilatedConv1DBlock, respectively, where the latter worksas a building block for the former. In Table IV, “BN” is thebatch normalization [33], “ELU” is the exponential linear unit[34], and “Shortcut Sum” takes the summation of the outputsof the layers in the shaded rows. In Table V, a feature mapis ﬁrst processed by a stack of dilated Conv1D layers andanother Conv1D layer, and further compressed to have valuesbetween 0 and 1 using the sigmoid function. This compressedrepresentation is element-wise multiplied to the feature mapfed to the DilatedConv1DBlock, thus working as an analogueto a T-F mask. Note that the input LPS features were alsonormalized using a trainable BN [33].The LPSNet was trained for 65 epochs using the Adamoptimizer [35], where the initial learning rate was set to − and halved after the 20th, 35th, 45th, and 55th epochs. Dropoutregularization [36] was applied with a drop rate of 0.3 forevery third mini-batch, and gradient clipping [37] was used tostabilize the training with a global norm threshold of 3.0. Theweights of the LPSNet were also subject to (cid:96) -regularizationwith a scale of − . The speciﬁcations regarding the mini-batch composition and the number of iterations deﬁned for asingle training epoch are presented in Table VI. TABLE VIH

YPERPARAMETERS FOR TRAINING THE

LPSN

ET MODELS

Dataset

TrainSimuClean TrainSimuNoisy

Mini-batch size, duration 4, [2.0, 2.8] s 6, [2.4, 2.8] sSNR range - [3, 20] dB

YPERPARAMETERS FOR TRAINING THE

VACEN

ET MODELS . p DENOTESTHE DROPOUT RATE .Dataset

TrainSimuClean TrainSimuNoisy

Stage Pre-training Fine-tuning Pre-training Fine-tuningPT-self PT-late PT-late α

10 10 10 2 1 β γ

20 20 20 10 5 p E. VACENet Speciﬁcations

The architecture of the VACENet is basically the same asthat of the U-Net [19], including the number of downsamplingand upsampling operations and positions of the concatenationsbetween the encoder and decoder feature maps. Similar tothe LPSNet, each attribute of the input RI components wasnormalized using a trainable BN [33]. In addition, the RIcomponents of the output signal were de-normalized usingthe pre-computed mean and variance statistics. Other detailsof the VACENet are described in Section II-C2 and Fig. 2.The training of the VACENet was conducted in a mannersimilar to that described in Section III-D for training theLPSNet, employing the same on-the-ﬂy mini-batching schemepresented in Table VI.Table VII shows the hyperparameters set during the pre-training and ﬁne-tuning of the VACENet models, where thevalues of α , β , and γ were determined by monitoring theﬁrst few thousand iterations of the training. To make faircomparisons across the different VACE-WPE systems, all theVACENet models were trained for 60 epochs, both in the pre-training and ﬁne-tuning stages. In the pre-training stage, thelearning rate was initially set to − and annealed by a factorof 0.2 after the 20th and 40th training epochs, whereas in theﬁne-tuning stage, the initial learning rate was set to · − and annealed in the same manner. F. Evaluation Metrics

The dereverberation performance of the WPE algorithmswas evaluated in terms of the perceptual evaluation of speechquality (PESQ) [38], cepstrum distance (CD), log-likelihoodratio, frequency-weighted segmental SNR (FWSegSNR) [39],and non-intrusive normalized signal-to-reverberation modula-tion energy ratio (SRMR) [40]. For the metrics computation,the early arriving speech was used as the reference signal,except for the SRMR, which can be calculated from theprocessed signal itself.IV. E

XPERIMENTAL R ESULTS AND A NALYSIS

In this section, the experimental results and analysis ofthe VACE-WPE system are provided. The ablation studiesregarding the constituents of the VACE-WPE are providedfrom Section IV-A to IV-D; these studies are performedunder noiseless reverberant conditions; that is, the LPSNet andVACENet models are trained on

TrainSimuClean and evalu-ated on

TestRealClean . The rationale behind this design of

Fig. 3. Performance evaluation results of the VACE-WPE and baseline WPEalgorithms on

TestSimuClean . The horizontal axis denotes the LP ﬁlter order, K . The VACE-WPE employed the VACENet-b model pre-trained with thePT-self method, and was constructed as depicted in Fig. 1-(a). K trn was setto 10 during ﬁne-tuning. experiments is that, by excluding any interferences other thanreverberation, it would be easier to observe how the differentsystem components of the VACE-WPE inﬂuence the operatingcharacteristics of the system as well as the realization of thevirtual signal. The results of noisy reverberant conditions andspeech recognition results on real recordings are provided inSection IV-E and Section IV-F, respectively.The baseline systems under comparison are the single- anddual-channel neural WPE algorithms, where the latter is fedwith actual dual-channel speech signals; for the latter, only thedereverberated signal at the ﬁrst output channel will be underevaluation. Although it is not possible to exploit the dual-channel WPE in a single-microphone speech dereverberationscenario, it was included for comparison purposes. Please notethat the results for the iterative WPE [6], [7] are not presented,as it requires a cumbersome process of parameter tuning, forexample, the context parameter, δ , in Eq. (8) and the numberof iterations, per test condition; nevertheless, the performanceof the iterative WPE was slightly worse than that of the neuralWPE, when measured on our test datasets. A. Comparison to the Baselines1) Performance Analysis:

Similar to our previous study[18], we ﬁrst compared the VACE-WPE with the baselinesingle- and dual-channel WPE algorithms. To start with theVACE-WPE that has an architecture identical to that describedin [18], the VACENet-b was pre-trained using the PT-selfmethod and ﬁne-tuned within the VACE-WPE architecture,as depicted in Fig. 1-(a), with K trn set to 10. Fig. 3 demon-strates the evaluation results on TestSimuClean in terms of thePESQ, CD, and SRMR metrics. As shown in the ﬁgure, theevaluation for each algorithm was conducted over the ﬁxedsets of LP orders having a constant step size, that is, K ∈{ , , , , } and K ∈ { , , , , } for thesingle-channel WPE and dual-channel versions, respectively.Although these values may not represent the best operatingpoints, it is sufﬁcient to observe the performance variationof each algorithm across the different values of the LP orderand to compare the overall performance of the different WPE-based dereverberation methods.First, in the small room conditions, as the LP order grows, the PESQ score monotonically decreased while the CD in-creased. This is because large LP orders lead to overestimationof reverberation, and thus, to speech distortion in a room witha low reverberation time ( T ). In contrast, the SRMR slightlyincreased with K , as it only considers the energy ratio inthe modulation spectrogram [40], and thus, cannot accuratelyreﬂect the distortions relative to the reference signal. All threemethods revealed the lowest CD at their smallest consideredLP orders, exhibiting overall comparable performance.In the medium room conditions, the performance measuredat a far distance was certainly inferior to that measured in thenear distance. Moreover, setting K too small or large led toinaccurate estimation of late reverberation, as demonstrated byboth the PESQ and CD metrics. Unlike the observations in thesmall room conditions, there are noticeable performance gapsbetween the single-channel WPE and the others, which arefurther emphasized in the far distance condition. Furthermore,there are operating points at which the VACE-WPE outper-forms the single-channel WPE in terms of all three metrics,yet is not competitive with the dual-channel WPE. The resultsin the large room conditions showed patterns similar to thoseobserved in the medium rooms, but with overall performancedegradation, which is attributed to the increased reverberationlevel.

2) Visualization of Virtual Signals and LP Filters:

As boththe dual-channel WPE and VACE-WPE in [18] share thesame neural WPE back-end, but only differ in the type ofthe secondary input signal, we compared the input and outputsignals of the two systems. Fig. 4 shows the spectrogramsand waveforms and the LP ﬁlter coefﬁcients obtained from asample test utterance taken from

TestRealClean in the

Large-near condition; the ﬁlters were calculated with K = 10 .As shown in the ﬁrst two rows, the generated virtual signal( X v ) appears to be considerably different from the pair ofactual signals ( X and X ), yet the dereverberated outputs( Z ’s) look similar. This implies that, other than the actualobservation, an alternative form of the secondary signal thatfacilitates blind dereverberation via Eqs. (9) – (14) exists, anda mechanism for generating such a signal can be learned ina data-driven manner using a neural network. A noticeablefeature of the virtual signal is the scale difference, where theamplitudes of the waveform were reduced by an approximatefactor of 0.1, as shown in Fig. 4. This “amplitude shrinkage”started to appear in the very early stage of the ﬁne-tuning, eventhough the VACENet was initialized using the PT-self methodto produce the signals whose amplitudes are similar to thoseof the inputs. We conjecture that this may be attributed tosetting the LP order, K trn , to a constant during the ﬁne-tuning,which forces the VACENet to generate virtual signals thatcan effectively function as the secondary input for the WPEoperating with a ﬁxed LP order, regardless of the degree ofreverberation measured in the observed signal. Nonetheless, itcan be seen from the rightmost panel of Fig. 3 that the VACE-WPE does not break down when the LP order at the inferencetime does not match with that employed for the ﬁne-tuning.The LP ﬁlter coefﬁcients of the dual-channel WPE andVACE-WPE, with K set to 10, are demonstrated in the rightpanel of Fig. 4. This clearly veriﬁes that, despite the same Fig. 4. (a) Spectrograms and waveforms of the input and output signals of the different WPE algorithms; X and X denote the actual ﬁrst and secondchannel signals, X v is the virtual signal; and Z and Z v denote the WPE output signals corresponding to X and X v , respectively. (b) Visualization of(complex-valued) LP ﬁlters ( K = 10 ) of the WPE algorithms. The label “First channel” denotes the ﬁlter applied to the ﬁrst channel input signal. In eachsubﬁgure of the ﬁlter, the left and right halves represent the real and imaginary components, respectively.Fig. 5. Spectrograms (in log-magnitudes) obtained from the output of theLPSNet. operations expressed by Eqs. (9) – (14), the principles behindthe late reverberation estimation are completely different be-tween the two algorithms. For example, the ﬁlters of the dual-channel WPE for both channels seem to focus more on thelow-frequency bands, whereas those of the VACE-WPE [18]are concentrated on some speciﬁc frame delay indices over awide range of frequency bins and reveal more inter-channelasymmetry.In terms of perceptual quality, an informal listening testrevealed that the virtual signal does not necessarily soundlike a completely natural speech, playing machine-like soundsoccasionally. This was attributed to the checkerboard artifacts[23], which inevitably appeared in some utterances. In ad-dition, the virtual signal sounded more like a delayed andattenuated version of the observed speech, similar to the latereverberation component. Accordingly, the phonetic soundsor pronunciations of the linguistic contents still remained tosome extent, but not as clear as those contained in the originalutterance. B. Simpliﬁcation of the PSD Estimation Routine

An observation regarding the LPSNet, derived from the“amplitude shrinkage” of the virtual signal, is shown in Fig. 5.In the ﬁgure, the ﬁrst two images are the outputs of theLPSNet, given the actual and virtual signals as the inputs,respectively, and the last image is the average PSD obtained

Fig. 6. Performance comparison between the VACE-WPE systems beforeand after the simpliﬁcation of the PSD estimation routine described in II-C6.The horizontal axis denotes the LP ﬁlter order, K . Both systems share thesame VACENet-b model pre-trained with the PT-self method. K trn was set to10 during ﬁne-tuning. via Eq. (14). As seen in the ﬁgure, due to the signiﬁcant re-duction in the amplitudes of the virtual signal, followed by thechannel-wise average operation in Eq. (14), the average PSD ismerely faded out from the power scale of the reverberated ordereverberated speech of the reference (actual) channel. Basedon this observation, we hypothesized that this fadeout wouldadversely affect the operation of the VACE-WPE, therebymodifying the system. architecture, as depicted in Fig. 1-(b).Section II-C6 further explains the simpliﬁed architecture.Fig. 6 shows the comparisons between the VACE-WPE in[18] and the simpliﬁed VACE-WPE in terms of the PESQ,CD, and SRMR metrics. Herein, the simpliﬁed VACE-WPEwas constructed by ﬁne-tuning the pre-trained VACENet-b,described in Section IV-A1, within the simpliﬁed architecture;the same hyperparameters were employed for the ﬁne-tuning.Note that we omitted the results for the single-channel WPEfor visual clarity. Overall, the simpliﬁcation boosted both thePESQ and SRMR scores, particularly in the Medium-far and

Large-far conditions by considerable margins, with marginalincrements in the CD measures. In other words, it can be

Fig. 7. Spectrograms and waveforms of the virtual signals and the oraclelate reverberation signal. regarded that the simpliﬁed VACE-WPE has become better ca-pable of ﬁtting to larger rooms and farther distance conditions,at the expense of slight increase in CD. The spectrogramsand waveforms of the virtual signals related to the simpliﬁedVACE-WPE are presented in the last row of Fig. 4. Relativeto the system without the simpliﬁcation, the LP ﬁlters seemto exploit the virtual signal more aggressively. Meanwhile,the amplitudes of the virtual signals were ampliﬁed by anapproximate factor of 2.0.For the rest of the sections, we use the simpliﬁed architec-ture for all the experiments.

C. VACENet Architecture and Pre-training Methods

As brieﬂy mentioned in Section IV-A2, we observed aresemblance between the virtual signal and late reverberationto an extent. Fig. 7 shows the spectrograms and waveforms ofthe virtual signals and those of the oracle late reverberationcomponent of the observed signal; the ﬁrst two were gener-ated using the VACE-WPE [18] and its simpliﬁed version,respectively. As seen in the ﬁgure, all these signals are clearlydifferent from the reverberant input signals ( X and X )depicted in Fig. 4, yet are partially similar to each other; forexample, the waveforms in the time-domain or the temporaldistribution of “hot” regions of the spectrograms. Inspired bythis, we proposed to pre-train the VACENet to estimate the latereverberation component of the observed signal, as describedin Section II-C4.Fig. 8 compares the PESQ and CD measures obtainedfrom the different VACE-WPE systems, each of which isdistinguished by the pre-training strategy employed and theVACENet structure; details of the four different VACENetmodels can be found in Fig. 2 and Table I in Section II-C2.In the ﬁgure, the results for K ∈ { , } were omittedbecause the simpliﬁed VACE-WPE revealed unfavorably highCD values with nearly consistent PESQ and SRMR scores(see Fig. 6). First, focusing on the impact of the new pre-training strategy on the four VACENet models, the VACE-WPE systems built with the VACENet- { b, c } models revealednoticeable improvement via adoption of the PT-late method inboth the medium and large room conditions; they exhibitednegligible difference in the small room conditions. Moreover,between the VACENet-b and VACENet-c, the latter was over-all superior to the former. In contrast, when the PT-late strategywas introduced to the systems built with the VACENet- { a, d } ,the performance was marginally improved in the small rooms,but was substantially degraded in the Medium-far , Large-near ,and

Large-far conditions, with regard to either the PESQ or CD measure. This may be possibly due to their distinctivestructure, where they employ either a shared or separate streamfor both of the encoder and decoder, as depicted in Fig. 2.Next, comparing the VACENet structures initialized withthe PT-self method, the VACENet-a and VACENet-c, bothof which have a shared-stream decoder for modeling the RIcomponents of the virtual signal, broadly outperformed theothers in terms of both the PESQ and CD metrics. Meanwhile,VACENet-d exhibited the worst performance in the

Medium-far and the large room conditions, under both the PT-self andPT-late strategies.To summarize, among the eight different VACE-WPE sys-tems under evaluation, the combination of the VACENet-cstructure and the PT-late strategy for initialization showed thebest performance.

D. Effect of the LP Order Set During the Fine-tuning

In this subsection, we investigate the effect of different LPorders set during the ﬁne-tuning of the VACENet. Based on theanalysis in Section IV-C, we constructed a simpliﬁed VACE-WPE on top of the VACENet-c model initialized using thePT-late method. Fig. 9 shows the performance of the VACE-WPE systems ﬁne-tuned with the different values of the LPorders, K trn ∈ { , , , } , in terms of the PESQ, CD, andSRMR metrics. Notably, the systems trained with relativelylarge LP orders of K trn ∈ { , } tend to severely fail inthe medium and large room conditions, when evaluated usingthe smaller LP orders of K ∈ { , } . In contrast, under thesame test conditions, the systems built with relatively small LPorders of K trn ∈ { , } showed favorable trade-offs betweenthe PESQ and CD metrics measured with K = 5 and thosemeasured with K = 10 , while exhibiting adversely high CDvalues for K ∈ { , , } . These two contrasting trendsmildly indicate that the VACENet, pre-trained using the PT-late strategy, is in fact ﬁt to generate the virtual signal thatis basically the most effective as the auxiliary input when theback-end WPE algorithm operates with the LP order close tothat employed in the ﬁne-tuning stage. This may be attributedto training the VACENet in an end-to-end manner withinthe WPE dereverberation framework, where the algorithm isrestricted to operate with a ﬁxed LP order. However, theVACE-WPE systems trained with K trn ∈ { , } , even whenevaluated using the matched LP orders of K ∈ { , } , failedto achieve high PESQ and SRMR scores in the Medium-far and

Large-far conditions. This is explained in Fig. 10, whichvisualizes the training and validation losses observed duringthe ﬁne-tuning of the four different VACE-WPE systems; thevalidation loss was computed on a small, separate validationset using K = 15 . It can be seen from the ﬁgure that,unlike the systems trained with K trn ∈ { , } , those trainedwith K trn ∈ { , } fail to sufﬁciently reduce both thetraining and validation losses. Furthermore, comparing thetwo systems trained with K trn = 5 and K trn = 10 , theformer certainly experienced a faster convergence than thelatter. These observations indicate that generating virtual inputsignals from scratch against the dual-channel WPE operatingwith relatively large LP orders is difﬁcult, possibly because Fig. 8. Performance comparison of the simpliﬁed VACE-WPE systems built with different pre-training strategies (i. e., PT-self and PT-late) and VACENetstructures (i. e., VACENet- { a, b, c, d } ): (a) PESQ and (b) CD. The horizontal axis denotes the LP ﬁlter order, K . K trn was set to 10 during ﬁne-tuning.Fig. 9. Performance comparison of the simpliﬁed VACE-WPE systems ﬁne-tuned with different LP orders, K trn ∈ { , , , } . The VACENet-c modelpre-trained using the PT-late method was adopted for ﬁne-tuning. The horizontal axis denotes the LP ﬁlter order, K .Fig. 10. Training and validation losses observed during ﬁne-tuning thesimpliﬁed VACE-WPE with the different LP orders, K trn ∈ { , , , } .The validation loss was calculated with K = 15 , and the loss values weredepicted for every third epoch for visual clarity. the degrees of freedom of the relevant matrices presented inEqs. (9) – (11) increases with the LP order. Nonetheless, it isquite impressive that the VACE-WPE ﬁne-tuned with K trn = 5 performed well in the large room conditions, even whenevaluated using relatively small LP orders of K ∈ { , } .Meanwhile, in the small room conditions, the systemstrained with K trn ∈ { , } were comparable or marginallysuperior to those trained with K trn ∈ { , } in terms of thePESQ and CD measures, with slightly lower SRMR scores. E. Results in Noisy Reverberant Conditions

In this subsection, the performance of the VACE-WPEis veriﬁed under noisy reverberant test conditions. Both theLPSNet and VACENet-c models were trained using the

Train-SimuNoisy dataset as described in Sections III-D and III-E.The PT-late strategy was adopted to pre-train the VACENet.Herein, the early arriving speech plus noise was employed asthe target signal for training the LPSNet and VACENet, asthe WPE algorithm is only capable of blind dereverberation,but not explicitly designed for noise removal. Based on theobservation from Fig. 10, we ﬁne-tuned the VACENet bygradually increasing the LP ﬁlter order, K trn , as the trainingprogresses. More speciﬁcally, for every single mini-batch, K trn was randomly chosen within the set S K = { K | K lowertrn ≤ K ≤ K uppertrn } ⊂ Z + , and the optimization was performedusing the selected LP order; K lowertrn was ﬁxed at 4, and K uppertrn was initially set to 6 and increased to 9, 12, 15, 18, and 21after the 15th, 25th, 35th, 44th, and 52nd epochs, respectively.The evaluation results on the TestRealNoisy dataset areshown in Figs. 11 and 12, where the former demonstrates thosemeasured in the small room environment and the latter inthe medium and large rooms. Comparing the single-channel WPE and VACE-WPE, it can be conﬁrmed that the lattertends to exhibit operating points generally superior to those ofthe former in terms of all the evaluation metrics considered.Similar to the results obtained in Section IV-A, the perfor-mance gap between the two algorithms further increased inthe far-ﬁeld speaking conditions, particularly with regard to thePESQ, SRMR, and FWSegSNR metrics. Moreover, the VACE-WPE was also favorably comparable to the dual-channel WPE,revealing marginally better PESQ measures in the babble andfactory noise conditions in various room environments andmoderately higher SRMR scores in the

Medium-far and

Large-far conditions. Interestingly, these SRMR scores measuredwith the different values of the LP order imply that the VACE-WPE is better capable of producing “dry” signals than thedual-channel WPE using relatively small LP orders. Finally,considering that there exists a mismatch between the cleanspeech corpus of

TrainSimuNoisy and that of

TestRealNoisy ,it can be stated that the training of the VACE-WPE cangeneralize well to a larger corpus, instead of simply beingoverﬁt to a small-scale dataset.

F. Speech Recognition Results on Real Recordings

In this subsection, we verify the performance of the variousspeech dereverberation methods as the front-end for the auto-matic speech recognition (ASR) task. Speciﬁcally, we followedthe protocol for the ASR task of the VOiCES Challenge 2019[4], [5], a recent benchmark on far-ﬁeld ASR in challengingnoisy reverberant room environments. The challenge providestwo different sets of utterances for the system development andevaluation, namely the “dev” and “eval” sets [4], [5]; each setconsists of a small portion of the VOiCES corpus [41]. TheVOiCES corpus is a re-recorded subset of the LibriSpeechdataset [28], and the re-recording was performed using twelvemicrophones of different types and locations in the presenceof background noise, for example, fan, babble, music, andtelevision [41]. To build the baseline ASR system, we usedan open source script that partially implements the systemdescribed in [42] based on the Kaldi [27] toolkit. The acousticmodel was built using the modiﬁed LibriSpeech-80h dataset[4], [5] after applying the standard data augmentation andspeed perturbation [24] provided by the Kaldi recipes [27];40-dimensional log-mel-ﬁlterbank energies, extracted with a25 ms window and 10 ms hop sizes, were used as theinput acoustic features. A 3-gram statistical language modelconstructed using the transcripts of the training utterances wasemployed for decoding.Tables VIII and IX present the SRMR scores and worderror rate (WER) obtained using the different speech derever-beration methods, respectively. For the single-channel WPEand VACE-WPE, the LP ﬁlter order, K , was set to 80 and35, respectively; further increasing K did not improve theperformance of both algorithms signiﬁcantly. As shown in thetables, besides the single-channel WPE, two different fullyneural speech dereverberation models, namely the LPSNet-Drv and VACENet-c-Drv, were also under comparison. More https://github.com/freewym/kaldi-voices kaldi/egs/librispeech/s5/local/chain/tuning/run cnn tdnn 1a.sh TABLE VIIISRMR SCORES MEASURED ON THE VO I CES C

HALLENGE

DATASET

Method Rawsignal WPE-single( K = 80 ) VACE-WPE( K = 35 ) LPSNet-Drv VACENet-c-Drvdev 2.30 2.80 eval 2.07 2.59 TABLE IXWER(%)

MEASURED ON THE VO I CES C

HALLENGE

DATASET

Method Rawsignal WPE-single( K = 80 ) VACE-WPE( K = 35 ) LPSNet-Drv VACENet-c-Drvdev 24.2 speciﬁcally, the LPSNet-Drv was implemented by simplycombining the dereverberated magnitude spectra, estimatedfrom the trained LPSNet, with the phase spectra of thereverberant observation. The VACENet-c-Drv was obtainedby training a neural network, whose structure is identical tothe VACENet-c, to estimate the RI components of the earlyarriving speech plus noise. These models allow to make adirect comparison between i) employing the neural networkfor directly estimating the early arriving speech componentand ii) employing the neural network for the virtual signalgeneration instead and subsequently let the pre-trained dual-channel neural WPE perform the dereverberation. Table VIIIillustrates that the VACE-WPE and VACENet-c-Drv revealsigniﬁcantly higher SRMR scores relative to the other methodsand are comparable with each other. However, as shown inTable IX, the single-channel WPE achieved the lowest WERin both sets, followed by the VACE-WPE that revealed slightlyworse performance; both the LPSNet-Drv and VACENet-c-Drv failed to reduce the WER. Accordingly, it can be statedthat the proposed VACE-WPE can achieve a great balancebetween the objective speech quality improvement and front-end processing for the ASR task in terms of dereverberation.Table X further presents the results obtained after per-forming lattice interpolation [43] on top of the ASR outputlattices generated using the single-channel WPE front-end andthose using the VACE-WPE; the scaling factor, λ , was variedfrom 0.1 to 0.9. Absolute decrements of 0.3% and 0.9% inWER, achieved on the “dev” and “eval” sets, respectively,indicate that the single-channel WPE and VACE-WPE can becomplementary as the speech dereverberation front-end for theASR task. TABLE XWER(%)

AFTER PERFORMING LATTICE INTERPOLATION [43]

BETWEENTHE

ASR

OUTPUT LATTICES GENERATED USING THE SINGLE - CHANNEL

WPE

AND THOSE USING THE

VACE-WPE. λ WAS APPLIED TO THEFORMER AND − λ TO THE LATTER λ Fig. 11. Speech dereverberation performance on

TestRealNoisy in the smallroom environment: (a) air conditioner, (b) babble, (c) factory, and (d) music.The horizontal axis denotes the LP ﬁlter order, K . V. C

ONCLUSIONS

In this study, we ﬁrst investigated the properties of theVACE-WPE system via ablation studies, which led to theintroduction of a simpliﬁed architecture and new strategiesfor training the neural network for the VACE. Based on theseﬁndings, the performance of the VACE-WPE was further ex-amined with regard to i) objective quality of the dereverberatedspeech under noisy reverberant conditions and ii) ASR resultsmeasured on real noisy reverberant recordings. Experimentalresults and analysis indicate that the neural-network-basedvirtual signal generation followed by the modiﬁed neural WPEback-end can provide an implementation of an effective speechdereverberation algorithm in a single-microphone ofﬂine pro-cessing scenario. R

EFERENCES[1] H. Kuttruff,

Room Acoustics . Boca Raton, FL, USA: CRC Press, 2016.[2] J. S. Bradley, H. Sato, and M. Picard, “On the importance of earlyreﬂections for speech in rooms,”

J. Acoust. Soc. Amer. , vol. 113, no. 6,pp. 3233–3244, 2003. [3] K. Kinoshita et al. , “The REVERB challenge: A common evaluationframework for dereverberation and recognition of reverberant speech,”in

Proc. IEEE Workshop Appl. Signal Process. Audio Acoust. , 2013, pp.1–4.[4] M. K. Nandwana et al. , “The VOiCES from a distance challenge 2019evaluation plan,” arXiv:1902.10828 , 2019.[5] ——, “The VOiCES from a distance challenge 2019,” in

Proc. INTER-SPEECH , 2019, pp. 2438–2442.[6] T. Nakatani, T. Yoshioka, K. Kinoshita, M. Miyoshi, and B. Juang,“Speech dereverberation based on variance-normalized delayed linearprediction,”

IEEE Trans. Audio, Speech, Lang. Process. , vol. 18, no. 7,pp. 1717–1731, 2010.[7] T. Yoshioka and T. Nakatani, “Generalization of multi-channel linearprediction methods for blind MIMO impulse response shortening,”

IEEETrans. Audio, Speech, Lang. Process. , vol. 20, no. 10, pp. 2707–2720,2012.[8] Y. Iwata and T. Nakatani, “Introduction of speech log-spectral priors intodereverberation based on Itakura-Saito distance minimization,” in

Proc.IEEE Int. Conf. Acoust., Speech Signal Process. , 2012, pp. 245–248.[9] A. Juki´c and S. Doclo, “Speech dereverberation using weighted predic-tion error with Laplacian model of the desired speech,” in

Proc. IEEEInt. Conf. Acoust., Speech Signal Process. , 2014, pp. 5172–5176.[10] M. Novey, T. Adali, and A. Roy, “A complex generalized Gaussiandistribution–Characterization, generation, and estimation,”

IEEE Trans.Signal Process. , vol. 58, no. 3, pp. 1427–1433, 2010.[11] A. Juki´c, T. van Waterschoot, T. Gerkmann, and S. Doclo, “Multi-channel linear prediction-based speech dereverberation with sparse pri-ors,”

IEEE Trans. Audio, Speech, Lang. Process. , vol. 23, no. 9, pp.1509–1520, 2015.[12] S. R. Chetupalli and T. V. Sreenivas, “Late reverberation cancellation us-ing Bayesian estimation of multi-channel linear predictors and Student’st-source prior,”

IEEE Trans. Audio, Speech, Lang. Process. , vol. 27,no. 6, pp. 1007–1018, 2019.[13] K. Kinoshita, M. Delcroix, H. Kwon, T. Hori, and T. Nakatani, “Neuralnetwork based spectrum estimation for online WPE dereverberation,” in

Proc. INTERSPEECH , 2017, pp. 384–388.[14] P. N. Petkov, V. Tsiaras, R. Doddipatla, and Y. Stylianou, “An unsuper-vised learning approach to neural-net-supported WPE dereverberation,”in

Proc. IEEE Int. Conf. Acoust., Speech Signal Process. , 2019, pp.5761–5765.[15] J. Heymann, L. Drude, R. Haeb-Umbach, K. Kinoshita, and T. Nakatani,“Joint optimization of neural network-based WPE dereverberation andacoustic model for robust online ASR,” in

Proc. IEEE Int. Conf. Acoust.,Speech Signal Process. , 2019, pp. 6655–6659.[16] S. R. Chetupalli and T. V. Sreenivas, “Clean speech AE-DNN PSDconstraint for MCLP based reverberant speech enhancement,” in

Proc.Eur. Signal Process. Conf. , 2019, pp. 1–5.[17] T. Taniguchi, A. S. Subramanian, X. Wang, D. Tran, Y. Fujita, andS. Watanabe, “Generalized weighted-prediction-error dereverberationwith varying soure priors for reverberant speech recognition,” in

Proc.IEEE Workshop Appl. Signal Process. Audio Acoust. , 2019, pp. 293–297.[18] J.-Y. Yang and J.-H. Chang, “Virtual acoustic channel expansion basedon neural networks for weighted prediction error-based speech derever-beration,” in

Proc. INTERSPEECH , 2020, pp. 3930–3934.[19] O. Ronneberger, P. Fischer, and T. Brox, “U-Net: Convolutional net-works for biomedical image segmentation,” in

Proc. Int. Conf. Med.Image Comput. Comput.-Assist. Interv. , 2015, pp. 234–241.[20] Y. N. Dauphin, A. Fan, M. Auli, and D. Grangier, “Language modelingwith gated convolutional networks,” in

Proc. Int. Conf. Mach. Learn. ,2017, pp. 933–941.[21] K. Tan and D. Wang, “Complex spectral mapping with a convolutionalrecurrent network for monaural speech enhancement,” in

Proc. IEEEInt. Conf. Acoust., Speech Signal Process. , 2019, pp. 6865–6869.[22] S.-W. Fu, T.-y. Hu, Y. Tsao, and X. Lu, “Complex spectrogram enhance-ment by convolutional neural network with multi-metrics learning,” in

Proc. Int. Workshop Mach. Learn. Signal Process. , 2017, pp. 1–6.[23] A. Odena, V. Dumoulin, and C. Olah, “Deconvolutionand checkerboard artifacts,”

Distill , 2016. [Online]. Available:http://distill.pub/2016/deconv-checkerboard[24] T. Ko, V. Peddinti, D. Povey, M. L. Seltzer, and S. Khudanpur, “Astudy on data augmentation of reverberant speech for robust speechrecognition,” in

Proc. IEEE Int. Conf. Acoust., Speech Signal Process. ,2017, pp. 5220–5224.[25] J. B. Allen and D. A. Berkley, “Image method for efﬁciently simulatingsmall-room acoustics,”

J. Acoust. Soc. Amer. , vol. 65, no. 4, pp. 943–950,1979. Fig. 12. Speech dereverberation performance on

TestRealNoisy in the medium and large room environments: (a) air conditioner, (b) babble, (c) factory, and(d) music. The horizontal axis denotes the LP ﬁlter order, K .[26] J. S. Garofolo, “TIMIT acoustic phonetic continuous speech corpus,” Linguistic Data Consortium , 1993.[27] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel,M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz et al. , “The Kaldispeech recognition toolkit,” in

Proc. IEEE Workshop Automat. SpeechRecognit. Understanding , 2011.[28] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech: AnASR corpus based on public domain audio books,” in

IEEE Int ConfAcoust., Speech Signal Process. , 2015, pp. 5206–5210.[29] C. K. A. Reddy et al. , “The INTERSPEECH 2020 deep noise suppres-sion challenge: Datasets, subjective speech quality and testing frame-work,” arXiv:2001.08662 , 2020.[30] A. Varga and H. J. Steeneken, “Assessment for automatic speechrecognition: II. NOISEX-92: A database and an experiment to studythe effect of additive noise on speech recognition systems,”

SpeechCommun. , vol. 12, no. 3, pp. 247–251, 1993.[31] D. Snyder, G. Chen, and D. Povey, “MUSAN: A music, speech, andnoise corpus,” arXiv:1510.08484 , 2015.[32] S. Pirhosseinloo and J. S. Brumberg, “Monaural speech enhancementwith dilated convolutions,” in

Proc. INTERSPEECH , 2019, pp. 3143–3147.[33] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deepnetwork training by reducing internal covariate shift,” arXiv:1502.03167 ,2015.[34] D.-A. Clevert, T. Unterthiner, and S. Hochreiter, “Fast and accurate deepnetwork learning by exponential linear units,” arXiv:1511.07289 , 2015. [35] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv:1412.6980 , 2014.[36] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhut-dinov, “Dropout: A simple way to prevent neural networks from over-ﬁtting,”

J. Mach. Learn. Res. , vol. 15, no. 1, pp. 1929–1958, 2014.[37] R. Pascanu, T. Mikolov, and Y. Bengio, “On the difﬁculty of trainingrecurrent neural networks,” in

Proc. Int. Conf. Mach. Learn. , 2013, pp.1310–1318.[38] “P. 862.2: Wideband extension to recommendation P. 862 for theassessment of wideband telephone networks and speech codecs,” ITU-TRecommendation, 2005.[39] Y. Hu and P. C. Loizou, “Evaluation of objective quality measuresfor speech enhancement,”

IEEE Trans. Audio, Speech, Lang. Process. ,vol. 16, no. 1, pp. 229–238, 2007.[40] J. F. Santos, M. Senoussaoui, and T. H. Falk, “An improved non-intrusiveintelligibility metric for noisy and reverberant speech,” in

Proc. Int.Workshop Acoust. Signal Enhance. , 2014, pp. 55–59.[41] C. Richey et al. , “Voices obscured in complex environmental settings(VOiCES) corpus,” in

Proc. INTERSPEECH , 2018, pp. 1566–1570.[42] Y. Wang, D. Snyder, H. Xu, V. Manohar, P. S. Nidadavolu, D. Povey, andS. Khudanpur, “The JHU system for VOiCES from a distance challenge2019,” in

Proc. INTERSPEECH , 2019, pp. 2488–2492.[43] D. Povey et al. , “Generating exact lattices in the WFST framework,”in