VACE-WPE: Virtual Acoustic Channel Expansion Based On Neural Networks for Weighted Prediction Error-Based Speech Dereverberation
11 VACE-WPE: Virtual Acoustic Channel ExpansionBased On Neural Networks for Weighted PredictionError-Based Speech Dereverberation
Joon-Young Yang and Joon-Hyuk Chang , Senior Member, IEEE
Abstract —Speech dereverberation is an important issue formany real-world speech processing applications. Among the tech-niques developed, the weighted prediction error (WPE) algorithmhas been widely adopted and advanced over the last decade,which blindly cancels out the late reverberation component fromthe reverberant mixture of microphone signals. In this study, weextend the neural-network-based virtual acoustic channel expan-sion (VACE) framework for the WPE-based speech dereverber-ation, a variant of the WPE that we recently proposed to enablethe use of dual-channel WPE algorithm in a single-microphonespeech dereverberation scenario. Based on the previous study,some ablation studies are conducted regarding the constituentsof the VACE-WPE in an offline processing scenario. These studieshelp understand the dynamics of the system, thereby simplifyingthe architecture and leading to the introduction of new strategiesfor training the neural network for the VACE. Experimentalresults in noisy reverberant environments reveal that VACE-WPEconsiderably outperforms its single-channel counterpart in termsof objective speech quality and is complementary to the single-channel WPE when employed as the front-end for the far-fieldautomatic speech recognizer.
Index Terms —Speech dereverberation, weighted predictionerror, deep neural network, single microphone, offline processing.
I. I
NTRODUCTION
Speech signals traveling in an enclosed space are encoun-tered by walls, floor, ceiling, and other obstacles present in theroom, creating multiple reflections of the source image. Hence,when they are captured by a set of microphones in a distance,the delayed and attenuated replicas of the sound source appearas the so-called reverberation component of the microphoneobservations. The reverberation component can be considereda composition of the early reflections and late reverberation[1]. In particular, the former is known to change the timbreof the source speech yet helps improve the intelligibility [2],whereas the latter degrades the perceptual listening quality aswell as deteriorates the performance of speech and speakerrecognition applications [3]–[5]. One of the most popularapproaches for speech dereverberation is to exploit the multi-channel linear prediction (MCLP) technique to model the latereverberation component and subsequently cancel it out fromthe microphone observations. Specifically, in [6], the delayedlinear prediction (LP) model was adopted to estimate thelate reverberation, whose model parameters are obtained via Hanyang University, Seoul, 04763, Korea (e-mail: [email protected]). Hanyang University, Seoul, 04763, Korea (e-mail: [email protected]). iterative maximization of the likelihood function defined underthe assumption that the dereverberated speech signal followsa complex normal distribution with time-varying variance.This method is referred to as the weighted prediction error(WPE) algorithm, and both the time- and short-time Fouriertransform (STFT) domain implementations were presented in[6]; the latter is usually preferred to the former owing to itscomputational efficiency.Several variants of the WPE algorithm or MCLP-basedspeech dereverberation methods have been proposed for thepast decade. In [7], a generalized version of the WPE algo-rithm [6] was derived via the introduction of a new cost func-tion that measures temporal correlation within the sequenceof the dereverberated samples. In [8], the log-spectral domainpriors based on Gaussian mixture models were introduced tothe procedure for estimating the power spectral density (PSD)of the dereverberated speech signal. The STFT coefficients ofthe dereverberated speech were modeled using the Laplaciandistribution in [9], whereas a more general sparse prior, thecomplex generalized Gaussian (CGG) [10], was adopted in[11]. More recently, Student’s t-distribution was employed asthe prior of the desired signal, and the LP filter coefficientswere subjected to probabilistic Bayesian sparse modeling witha Gaussian prior [12].Another branch of the WPE variant is to integrate deepneural networks (DNNs) into the WPE-based speech derever-beration framework. In [13], a DNN was trained to estimate thePSD of the early arriving speech components, which substi-tuted the iterative PSD estimation routine of the conventionalWPE algorithm [6]. It was shown in [14] that such a DNNfor supporting the WPE algorithm can be trained in an unsu-pervised manner (i. e., without requiring the parallel data forsupervision) by performing an end-to-end optimization of the (cid:96) -norm-based cost functions involving the relevant signals.Moreover, the DNN-supported WPE [13] was subjected toan end-to-end joint optimization with a DNN-based acousticmodel for robust speech recognition [15]. Unlike [13], an auto-encoder DNN trained on clean speech was used to constrainthe estimated PSD to have characteristics similar to those ofthe clean speech in a learned feature space [16]. Meanwhile,a DNN was employed to estimate the shape parameter of theCGG source prior [17], which provides a more flexible formof the WPE algorithm proposed in [11].A common observation underlying the abovementionedstudies [11]–[16] is that the multi-channel WPE algorithm isgenerally superior to its single-channel counterpart. Inspired a r X i v : . [ ee ss . A S ] F e b by this, we previously proposed the virtual acoustic channelexpansion (VACE) technique for the WPE [18], a variantof the WPE designed to utilize the dual-channel WPE algo-rithm in a single-microphone speech dereverberation scenario.Specifically, the neural WPE [13] is assisted by anotherneural network that generates the virtual signal from an actualsingle-channel observation, whereby the pair of actual andvirtual signals is directly consumed by the dual-channel neuralWPE algorithm. The neural network for the virtual signalgeneration, the supposed VACENet, is first pre-trained andthen subsequently fine-tuned to produce the dereverberatedsignal via the actual output channel of the dual-channel neuralWPE.This article is an extension of [18], which aims to pro-vide a more comprehensive understanding of the VACE-WPEbased on the empirical evaluation results obtained via setsof experiments, each of which is designed to investigate thedynamics of the VACE-WPE with respect to the varioussystem constituents. The limitations of the previous study [18]are listed below: • The VACE-WPE system in [18] was designed rather adhoc, and the dynamics of the system was not sufficientlyinvestigated. • Because [18] is essentially a feasibility study, the exper-iments were conducted only in the noiseless reverberantconditions, which is practically unrealistic.Accordingly, the contribution of this article is two-fold: • Some ablation studies are conducted with regard to thesystem components of the VACE-WPE, which helpsunderstand the characteristics of the VACE-WPE andfurther leads to an overall performance improvement. • Experimental results in noisy reverberant environmentsare provided, which demonstrates that the VACE-WPEis significantly superior to the single-channel WPE inachieving better objective speech quality, while bothbeing complementary with each other as the front-endfor the reverberant speech recognition task.II. O
VERVIEW OF THE
VACE-WPE
A. Signal Model
Suppose that a speech source signal is captured by D microphones in a reverberant enclosure. In the STFT domain,the observed signal impinging on the d -th microphone can beapproximated as follows [6], [7]: X t,f,d = l − (cid:88) τ =0 h ∗ τ,f,d S t − τ,f + V t,f,d , (1)where S t,f and V t,f,d denote the STFT-domain representationsof the source speech and noise observed at the d -th micro-phone, respectively; the superscript ∗ denotes the complexconjugate operation, and h t,f,d represents the room impulseresponse (RIR) from the source to the d -th microphone, whoseduration is l . Further decomposing the speech term into theearly arriving component (i. e., the direct path plus the early reflections) and late reverberation [6] provides X t,f,d = ∆ − (cid:88) τ =0 h ∗ τ,f,d S t − τ,f + l − (cid:88) τ =∆ h ∗ τ,f,d S t − τ,f + V t,f,d (2) = X (early) t,f,d + X (late) t,f,d + V t,f,d , (3)where ∆ denotes the STFT-domain time index and determinesthe duration of the RIR that contributes to the early arrivingspeech component. Herein, the early arriving speech is as-sumed to be obtained upon convolution between the sourcespeech and the RIR truncated up to 50 ms after the mainpeak. Accordingly, with the 64 ms Hann window and a hopsize of 16 ms employed for the STFT analysis, ∆ is fixed to3 (16 × ≈ B. Review of the WPE Algorithm1) Iterative WPE:
Under the noiseless assumption that V t,f,d = 0 , ∀ d , the late reverberation component, X (late) t,f,d , inEq. (3) can be approximated by the delayed LP technique asfollows [6]: ˆ X (late) t,f,d = ∆+ K − (cid:88) τ =∆ g Hτ,f,d X t − τ,f (4) = ˜ g Hf,d ˜ X t − ∆ ,f , (5)where g τ,f,d ∈ C D represents the K -th order time-invariant LPfilter coefficients for the output channel index d ; X t,f ∈ C D represents the D -channel stack of the microphone input sig-nal; ˜ g f,d = [ g T ∆ ,f,d , ..., g T ∆+ K − ,f,d ] T ∈ C DK , ˜ X t − ∆ ,f =[ X Tt − ∆ ,f , ..., X Tt − (∆+ K − ,f ] T ∈ C DK , and T and H denotethe hermitian and transpose operations, respectively. Under theassumption that X (early) t,f,d is sampled from a complex normaldistribution with a zero mean and time-varying variance, λ t,f,d , the objective of the WPE algorithm is to maximizethe log-likelihood function [6], [7]: ˜ g (cid:48) f,d , λ (cid:48) t,f,d = arg max ˜ g f,d ,λ t,f,d L f,d , (6) L f,d = N ( ˆ X (early) t,f,d = X t,f,d − ˜ g Hf,d ˜ X t − ∆ ,f ; 0 , λ t,f,d ) (7)for d ∈ { , , ..., D } . As this optimization problem has noanalytic solution, ˜ g f,d and λ t,f,d are alternatively updated viathe following iterative procedure [6], [7]:Step 1) λ t,f = 1 D (cid:88) d (cid:32) δ + 1 δ (cid:88) τ = − δ | Z t + τ,f,d | (cid:33) , (8)Step 2) R f = (cid:88) t ˜ X t − ∆ ,f ˜ X Ht − ∆ ,f λ t,f ∈ C DK × DK , (9) P f = (cid:88) t ˜ X t − ∆ ,f X Ht,f λ t,f ∈ C DK × D , (10) G f = R − f P f ∈ C DK × D , (11)Step 3) Z t,f = X t,f − G Hf ˜ X t − ∆ ,f , (12)where Eq. (8) is obtained by further assuming that λ t,f, = λ t,f, = ... = λ t,f,D , and δ is the term introduced to consider Fig. 1. Block diagram of the VACE-WPE systems: (a) VACE-WPE [18]and (b) simplified VACE-WPE. The subscripts and v denote the actual andvirtual channel signals, respectively. the temporal context between the neighboring frames. G f is amatrix whose d -th column is ˜ g f,d , and Z t,f = ˆ X (early) t,f,d is the D -channel stack of the dereverberated output signal. In the firstiteration, Z t,f is initialized to X t,f . It was revealed in [7] thatthe WPE algorithm described in Eqs. (8) – (12) can be derivedas a special case of the generalized WPE, without enforcingthe noiseless assumption.
2) Neural WPE:
Neural WPE [13] exploits a neural net-work to estimate the PSD of the dereverberated output signal, | Z t,f,d | , as follows: ln | ˆ Z t,f,d | = F (cid:16) ln | X d | ; Θ LPS (cid:17) , (13)where F ( · ; Θ LPS ) denotes the neural network parameterizedby Θ LPS , to estimate the log-scale power spectra (LPS) of thedereverberated signal in a channel-independent manner; thetime-frequency (T-F) indices were dropped in X d , as neuralnetworks often consume multiple T-F units within a contextas the input. Accordingly, Eq. (8) can be rewritten as follows: λ t,f = 1 D (cid:88) d | ˆ Z t,f,d | . (14)For the rest of this paper, we will denote the neural networkfor the PSD estimation, F ( · ; Θ LPS ) , as the LPSNet [18], as itoperates in the LPS domain of the relevant signals. C. VACE-WPE System Description1) Overview:
The entire VACE-WPE system [18] consistsof two separate modules: the VACE module, which is respon-sible for the generation of the virtual signal, and the dual-channel neural WPE, which operates in the exact same manneras described in Eqs. (9) – (14) for D = 2 . To build the completeVACE-WPE system, the LPSNet is trained to estimate the LPSof the early arriving speech given the reverberant observation,and the VACENet is pre-trained under a certain predefinedcriterion. These two steps are independent of each other, andthus, can be performed in parallel. Subsequently, the VACE-WPE system is constructed as depicted in Fig. 1, and theVACENet is fine-tuned to produce the dereverberated signalat the output channel corresponding to the actual microphone. Fig. 2. Four different VACENet architectures for modeling the RI compo-nents: (a) VACENet-a, (b) VACENet-b, (c) VACENet-c, and (d) VACENet-d. The input and output feature maps are represented in (time , frequency , channel) format, and the numbers above the rectangles denote the number ofchannels. During the fine-tuning, the LP order is fixed to K = K trn , andthe parameters of the LPSNet are frozen.
2) Architecture of the VACENet:
Similar to our previousstudy [18], we used the U-Net [19] as the backbone architec-ture of the VACENet, whose input and output representationsare the real and imaginary (RI) components of the STFTcoefficients of the actual and virtual signals, respectively.Unlike [18], four different architectures of the VACENet areconsidered in this study, each of which differs in whetherto use a shared or a separate stream for the convolutionalencoder and decoder. Fig. 2 shows the detailed illustrationof the four distinctive VACENet architectures, denoted asVACENet- { a, b, c, d } . First, all the models consume both of theRI components as the input for the encoder stream, whetherit is separated or not, which is intended to fully exploit theinformation residing in the pair of the RI components. Second,the VACENet- { a, c } use a shared decoder stream to model theRI components of the virtual signal, whereas the VACENet- { b, d } split the decoder stream into two to separately modeleach attribute of the RI components. As shown in Fig. 2, thedifference between the VACENet-b and VACENet-d lies inwhether the separated decoder streams share the bottleneckfeature or not, as well as the encoder feature maps for theskip connections. Meanwhile, VACENet-c can be considereda more flexible version of the VACENet-a, as it splits theencoder stream into two separate streams, and thus, doublesthe number of skip connections originating from the encodermodule.In each subfigure in Fig. 2, the rectangles denote the featuremaps, whose height and width represent their relative size anddepth, respectively, and the numbers above the rectangles arethe channel sizes of the feature maps. Each of the wide arrowsdenotes a 2D convolution (Conv2D) with a kernel size of 3,and ⊕ denotes the concatenation of the feature maps along thechannel axis. Every downsampling or upsampling operation iseither performed by a × Conv2D or a transposed Conv2D
TABLE IM
ODEL SIZE OF THE DIFFERENT
VACEN
ET ARCHITECTURES
Model Config. with a stride size of 2, and × convolutions are used in thebottleneck and the last layers of the network. A gated linearunit [20] was used instead of a simple convolution followed byan activation function, except for the layers for downsamplingand upsampling. Lastly, to make fair comparisons between thedifferent model structures, we designed each model to have asimilar number of parameters in total, as shown in Table I.A similar investigation regarding the model architecture wasconducted in [21] for the speech enhancement task, wherethe structure analogous to that depicted in Fig. 2-(b) wasshown to be effective. In contrast, it was mentioned in [22]that separately handling each RI component is beneficial.Because the existing task, and hence the role of the VACENet,is fundamentally different from that of the neural networksadopted for speech enhancement [21], [22], we argue that it isworthwhile to examine which architecture is more appropriatefor the VACE task.
3) Loss Function:
Two types of loss functions, namely thefrequency-domain loss and time-domain loss, are defined totrain the VACENet [18]: L freq ( A, B ) = α · [ MSE ( A r , B r ) + MSE ( A i , B i )]+ β · MSE ( ln | A | , ln | B | ) , (15) L time ( a, b ) = MAE ( a, b ) , (16) L ( A, B ) = L freq ( A, B ) + γ · L time ( a, b ) , (17)where A and B are the STFT coefficients, ln | A | and ln | B | arethe log-scale magnitudes; a and b are the time-domain signalsobtained by taking the inverse STFT of A and B , respectively;the superscripts r and i denote the RI components, respec-tively; α , β , and γ are scaling factors to weigh the lossesdefined in different domains of the signal representations,and MSE ( · , · ) and MAE ( · , · ) compute the mean squared andabsolute error between the inputs, respectively.It is worth noting that α and β should be determined suchthat the values of α · [ MSE ( A r , B r ) + MSE ( A i , B i )] and β · MSE ( ln | A | , ln | B | ) are similar. When the former is con-siderably larger than the latter, severe checkerboard artifacts[23] were revealed in the output signal of the network. Forthe opposite condition, it was not able to obtain fine-grainedrepresentations of the RI components of the output signal. γ was also set to make γ · L time ( a, b ) to have values similar to orslightly smaller than those of the aforementioned two terms.
4) Pre-training of the VACENet:
In this study, we considertwo different pre-training strategies to initialize the VACENet.Suppose that the time-domain representations of the actual andvirtual signals are denoted by x and x v , respectively, and theirSTFT-domain counterparts X and X v , respectively. Then, the forward pass of VACENet can be expressed as follows: X v = G ( X ; Θ VACE ) , (18)where G ( · ; Θ VACE ) denotes the VACENet parameterized by Θ VACE . First, considering the observed signal as the input,the VACENet can be pre-trained to reconstruct the inputsignal itself [18] by minimizing the loss function L ( X v , X ) .Alternatively, we propose to pre-train the VACENet to estimatethe late reverberation component of the input signal, denotedby X (late) , by minimizing L ( X v , X (late) ) .The rationale behind the invention of these pre-trainingstrategies is rather simple and intuitive. Under the assumptionthat the actual dual-channel speech recordings may not deviatesignificantly from each other, we employed the first methodin [18], while expecting the virtual signal to resemble theobserved signal. However, the generated virtual signal wasshown to have characteristics different from the observedsignal [18], and the shape and scale of the waveform resembledthose of the late reverberation component of the observedsignal, as shown in Fig. 7 in Section IV-C. Accordingly, wesuggest initializing VACENet to produce the late reverberationcomponent of the observed signal. For the rest of this paper,we denote the two pre-training strategies described above as PT-self and
PT-late .
5) Fine-tuning of the VACENet:
As mentioned earlier,VACENet is fine-tuned within the VACE-WPE architecturedepicted in Fig. 1. The loss function is set to L ( Z , X (early) ) ,where X (early) denotes the early arriving speech component ofthe observed signal, X , and Z is the output of the WPEalgorithm on the actual channel side [18]; the virtual channeloutput, Z v , is neglected.
6) Simplification of the PSD Estimation Routine:
In ad-dition to the architecture of the original VACE-WPE system[18] depicted in Fig. 1-(a), we propose the simplified VACE-WPE, depicted in Fig. 1-(b), by removing the contribution ofthe virtual signal to the PSD estimation routine expressed inEq. (14). Accordingly, Eq. (14) can be rewritten as follows: λ t,f = | ˆ Z t,f, | . (19)One of the motivations behind this modification is to takeaway some burden from the roles of the VACENet by reducingthe dependency of the model to the entire system. In otherwords, if we consider the WPE-based dereverberation asa two-stage process of early arriving speech PSD estima-tion (Eq. (13)) followed by decorrelation (Eqs. (9) – (12)), theVACENet in Fig. 1-(a) is expected to generate the virtual signalwhose role is to contribute to both the stages. In contrast,as the contribution of the virtual signal to the first stage isremoved in Fig. 1-(b), the VACENet would concentrate moreon the second stage. Further details regarding the simplifiedVACE-WPE system are provided in Section IV-B with theexperimental results.III. E XPERIMENTAL S ETUP
A. On-the-fly Data Generator
To present as many random samples as possible to theneural networks during the training, an on-the-fly data gen-erator was used. Given the sets of clean speech utterances,
TABLE IIP
ARAMETERS FOR
RIR
SIMULATION [24]
BASED ON IMAGE METHOD [25]Parameter Medium LargeRoom size lower bound [10 × ×
2] m [30 × ×
2] m upper bound [30 × ×
5] m [50 × ×
5] m Duration 1.0 s 2.0 sReflection order 10Absorption coefficient [0.2, 0.8]Source-Receiver distance [1.0, 5.0] m
RIRs, and noises, the data generator first randomly selects aspeech utterance, an RIR, and a noise sample from each set,respectively. Then, the speech utterance is randomly cropped,and subsequently convolved with the full-length RIR as wellas the truncated RIR to create the reverberated speech andearly arriving speech, respectively. The noise sample is eithercropped or duplicated to match the duration of the speechexcerpt and added to both the reverberated and early arrivingspeech; the signal-to-noise ratio (SNR) is randomly chosenwithin the predefined range of integers.
B. Training Datasets1) TrainSimuClean:
The clean speech utterances weretaken from the “training” portion of the TIMIT [26] dataset,which comprises phonetically balanced English speech sam-pled at 16 kHz. After excluding the common-transcript ut-terances and filtering out those with durations of less than2 s, we obtained 3,337 utterances from 462 speakers; theaverage duration of the training utterances was 3.21 s. Thesimulated RIRs in [24] were used for the training, whichis freely available and widely used in Kaldi’s speech andspeaker recognition recipes for data augmentation purposes[27]. A total of 16,200 medium room and 5,400 large-roomRIRs were randomly selected to construct a simulated RIRdataset for the training, where we excluded the small roomRIRs to check whether the trained neural WPE variants cangeneralize well to the small room conditions at the evaluationtime. The parameters of the RIR simulation [25] are presentedin Table II, and further details can be found in [24]. No additivenoise samples were used in this dataset.
2) TrainSimuNoisy:
The modified LibriSpeech-80h datasetwas used as the clean speech corpus, which is a subset of theLibriSpeech [28] corpus and provided as part of the VOiCESChallenge 2019 dataset [4], [5]. It consists of read Englishspeech sampled at 16 kHz, whose transcripts are derived frompublic domain audiobooks. As most of the speech samplescontain considerable amounts of epenthetic silence regions aswell as those at the beginning and end of the utterance, weemployed an energy-based voice activity detector implementedin Kaldi [27] to trim the silence regions. The utterances whoseduration was less than 2.8 s were filtered out after the silenceremoval. Consequently, we obtained 16,341 utterances from194 speakers, with an average speech duration of 12.26 s. Thesimulated RIR dataset described in Section III-B1 was reused.As for the noise dataset, we used 58,772 audio samples in PECIFICATIONS OF THE REAL
RIR
S TAKEN FROM THE
REVERBC
HALLENGE
DATASET [3]Condition Duration T Recording distance
Small-near
1s 0.25 s 0.5 m
Small-far
Medium-near
Medium-far
Large-near
Large-far the DNS Challenge 2020 dataset [29], which contains audioclips selected from Google Audioset and Freesound . Thedataset comprises 150 unique audio classes, including animalsounds, vehicular sounds, indoor and outdoor environmentsounds originating from various things and daily supplies,music of different genres, and musical instruments.Instead of directly feeding the raw clean speech samples tothe neural network models during the training, we set a limiton the dynamic range of the speech waveform amplitudes asdescribed in the following. Suppose that x is a vector of thetime-domain speech waveform amplitudes normalized to havevalues between -1 and 1. Then, the waveform amplitudes afterapplying a simple dynamic range control (DRC) scheme canbe obtained as follows: x drc = x · a max − ¯ a min · r, (20)where ¯ a max and ¯ a min are the average of the n largest and n smallest waveform amplitudes, respectively, and r is a constantfor the DRC; n = 100 and r = 0 . were used in this study. C. Test Datasets1) TestRealClean:
The “core test” set of the TIMIT [26]dataset was used as the clean speech corpus, where no speakersand transcripts overlap with those of the
TrainSimuClean dataset described in Section III-B1; the average speech dura-tion is 3.04 s. The entire set of utterances was randomly con-volved with the real RIRs taken from the REVERB Challenge2014 [3] dataset to create six unique test sets, each of whichdiffers in the room size as well as the recording distance forthe RIR measurement. Among the eight microphone channels[3], only the first and fifth channels were used to createthe dual-channel test sets; these two channels were locatedon the opposite side of each other at a distance of 20 cm.The specifications of the real RIRs are presented in TableIII. Similar to
TrainSimuClean , TestRealClean contains noadditive noise.
2) TestRealNoisy:
To create the
TestRealNoisy dataset, thestationary air conditioner noise residing in each room [3] aswell as the nonstationary babble and factory noise from theNOISEX-92 [30] dataset and the music samples from theMUSAN [31] dataset were added to the
TestRealClean dataset.To simulate test environments with various SNR levels, thenoise samples were added to the reverberated speech with theSNRs randomly chosen between 5 dB and 15 dB. https://research.google.com/audioset https://freesound.org TABLE IVLPSN
ET ARCHITECTURE ADOPTED AND MODIFIED FROM [32]Layer Kernel Stride × × × × × × × × (cid:34) DilatedConv1DBlockConv1D + BN + ELU (cid:35) × TRUCTURE OF THE D ILATED C ONV
LOCK
Layer Kernel Stride Dilation k + ELU(for k = 1 , ..., ) 3 1 2 k D. LPSNet Specifications
We adopted the dilated convolutional network proposed in[32] as the LPSNet architecture, but with a few modifications.Tables IV and V show the detailed architecture of the LPSNetand DilatedConv1DBlock, respectively, where the latter worksas a building block for the former. In Table IV, “BN” is thebatch normalization [33], “ELU” is the exponential linear unit[34], and “Shortcut Sum” takes the summation of the outputsof the layers in the shaded rows. In Table V, a feature mapis first processed by a stack of dilated Conv1D layers andanother Conv1D layer, and further compressed to have valuesbetween 0 and 1 using the sigmoid function. This compressedrepresentation is element-wise multiplied to the feature mapfed to the DilatedConv1DBlock, thus working as an analogueto a T-F mask. Note that the input LPS features were alsonormalized using a trainable BN [33].The LPSNet was trained for 65 epochs using the Adamoptimizer [35], where the initial learning rate was set to − and halved after the 20th, 35th, 45th, and 55th epochs. Dropoutregularization [36] was applied with a drop rate of 0.3 forevery third mini-batch, and gradient clipping [37] was used tostabilize the training with a global norm threshold of 3.0. Theweights of the LPSNet were also subject to (cid:96) -regularizationwith a scale of − . The specifications regarding the mini-batch composition and the number of iterations defined for asingle training epoch are presented in Table VI. TABLE VIH
YPERPARAMETERS FOR TRAINING THE
LPSN
ET MODELS
Dataset
TrainSimuClean TrainSimuNoisy
Mini-batch size, duration 4, [2.0, 2.8] s 6, [2.4, 2.8] sSNR range - [3, 20] dB
YPERPARAMETERS FOR TRAINING THE
VACEN
ET MODELS . p DENOTESTHE DROPOUT RATE .Dataset
TrainSimuClean TrainSimuNoisy
Stage Pre-training Fine-tuning Pre-training Fine-tuningPT-self PT-late PT-late α
10 10 10 2 1 β γ
20 20 20 10 5 p E. VACENet Specifications
The architecture of the VACENet is basically the same asthat of the U-Net [19], including the number of downsamplingand upsampling operations and positions of the concatenationsbetween the encoder and decoder feature maps. Similar tothe LPSNet, each attribute of the input RI components wasnormalized using a trainable BN [33]. In addition, the RIcomponents of the output signal were de-normalized usingthe pre-computed mean and variance statistics. Other detailsof the VACENet are described in Section II-C2 and Fig. 2.The training of the VACENet was conducted in a mannersimilar to that described in Section III-D for training theLPSNet, employing the same on-the-fly mini-batching schemepresented in Table VI.Table VII shows the hyperparameters set during the pre-training and fine-tuning of the VACENet models, where thevalues of α , β , and γ were determined by monitoring thefirst few thousand iterations of the training. To make faircomparisons across the different VACE-WPE systems, all theVACENet models were trained for 60 epochs, both in the pre-training and fine-tuning stages. In the pre-training stage, thelearning rate was initially set to − and annealed by a factorof 0.2 after the 20th and 40th training epochs, whereas in thefine-tuning stage, the initial learning rate was set to · − and annealed in the same manner. F. Evaluation Metrics
The dereverberation performance of the WPE algorithmswas evaluated in terms of the perceptual evaluation of speechquality (PESQ) [38], cepstrum distance (CD), log-likelihoodratio, frequency-weighted segmental SNR (FWSegSNR) [39],and non-intrusive normalized signal-to-reverberation modula-tion energy ratio (SRMR) [40]. For the metrics computation,the early arriving speech was used as the reference signal,except for the SRMR, which can be calculated from theprocessed signal itself.IV. E
XPERIMENTAL R ESULTS AND A NALYSIS
In this section, the experimental results and analysis ofthe VACE-WPE system are provided. The ablation studiesregarding the constituents of the VACE-WPE are providedfrom Section IV-A to IV-D; these studies are performedunder noiseless reverberant conditions; that is, the LPSNet andVACENet models are trained on
TrainSimuClean and evalu-ated on
TestRealClean . The rationale behind this design of
Fig. 3. Performance evaluation results of the VACE-WPE and baseline WPEalgorithms on
TestSimuClean . The horizontal axis denotes the LP filter order, K . The VACE-WPE employed the VACENet-b model pre-trained with thePT-self method, and was constructed as depicted in Fig. 1-(a). K trn was setto 10 during fine-tuning. experiments is that, by excluding any interferences other thanreverberation, it would be easier to observe how the differentsystem components of the VACE-WPE influence the operatingcharacteristics of the system as well as the realization of thevirtual signal. The results of noisy reverberant conditions andspeech recognition results on real recordings are provided inSection IV-E and Section IV-F, respectively.The baseline systems under comparison are the single- anddual-channel neural WPE algorithms, where the latter is fedwith actual dual-channel speech signals; for the latter, only thedereverberated signal at the first output channel will be underevaluation. Although it is not possible to exploit the dual-channel WPE in a single-microphone speech dereverberationscenario, it was included for comparison purposes. Please notethat the results for the iterative WPE [6], [7] are not presented,as it requires a cumbersome process of parameter tuning, forexample, the context parameter, δ , in Eq. (8) and the numberof iterations, per test condition; nevertheless, the performanceof the iterative WPE was slightly worse than that of the neuralWPE, when measured on our test datasets. A. Comparison to the Baselines1) Performance Analysis:
Similar to our previous study[18], we first compared the VACE-WPE with the baselinesingle- and dual-channel WPE algorithms. To start with theVACE-WPE that has an architecture identical to that describedin [18], the VACENet-b was pre-trained using the PT-selfmethod and fine-tuned within the VACE-WPE architecture,as depicted in Fig. 1-(a), with K trn set to 10. Fig. 3 demon-strates the evaluation results on TestSimuClean in terms of thePESQ, CD, and SRMR metrics. As shown in the figure, theevaluation for each algorithm was conducted over the fixedsets of LP orders having a constant step size, that is, K ∈{ , , , , } and K ∈ { , , , , } for thesingle-channel WPE and dual-channel versions, respectively.Although these values may not represent the best operatingpoints, it is sufficient to observe the performance variationof each algorithm across the different values of the LP orderand to compare the overall performance of the different WPE-based dereverberation methods.First, in the small room conditions, as the LP order grows, the PESQ score monotonically decreased while the CD in-creased. This is because large LP orders lead to overestimationof reverberation, and thus, to speech distortion in a room witha low reverberation time ( T ). In contrast, the SRMR slightlyincreased with K , as it only considers the energy ratio inthe modulation spectrogram [40], and thus, cannot accuratelyreflect the distortions relative to the reference signal. All threemethods revealed the lowest CD at their smallest consideredLP orders, exhibiting overall comparable performance.In the medium room conditions, the performance measuredat a far distance was certainly inferior to that measured in thenear distance. Moreover, setting K too small or large led toinaccurate estimation of late reverberation, as demonstrated byboth the PESQ and CD metrics. Unlike the observations in thesmall room conditions, there are noticeable performance gapsbetween the single-channel WPE and the others, which arefurther emphasized in the far distance condition. Furthermore,there are operating points at which the VACE-WPE outper-forms the single-channel WPE in terms of all three metrics,yet is not competitive with the dual-channel WPE. The resultsin the large room conditions showed patterns similar to thoseobserved in the medium rooms, but with overall performancedegradation, which is attributed to the increased reverberationlevel.
2) Visualization of Virtual Signals and LP Filters:
As boththe dual-channel WPE and VACE-WPE in [18] share thesame neural WPE back-end, but only differ in the type ofthe secondary input signal, we compared the input and outputsignals of the two systems. Fig. 4 shows the spectrogramsand waveforms and the LP filter coefficients obtained from asample test utterance taken from
TestRealClean in the
Large-near condition; the filters were calculated with K = 10 .As shown in the first two rows, the generated virtual signal( X v ) appears to be considerably different from the pair ofactual signals ( X and X ), yet the dereverberated outputs( Z ’s) look similar. This implies that, other than the actualobservation, an alternative form of the secondary signal thatfacilitates blind dereverberation via Eqs. (9) – (14) exists, anda mechanism for generating such a signal can be learned ina data-driven manner using a neural network. A noticeablefeature of the virtual signal is the scale difference, where theamplitudes of the waveform were reduced by an approximatefactor of 0.1, as shown in Fig. 4. This “amplitude shrinkage”started to appear in the very early stage of the fine-tuning, eventhough the VACENet was initialized using the PT-self methodto produce the signals whose amplitudes are similar to thoseof the inputs. We conjecture that this may be attributed tosetting the LP order, K trn , to a constant during the fine-tuning,which forces the VACENet to generate virtual signals thatcan effectively function as the secondary input for the WPEoperating with a fixed LP order, regardless of the degree ofreverberation measured in the observed signal. Nonetheless, itcan be seen from the rightmost panel of Fig. 3 that the VACE-WPE does not break down when the LP order at the inferencetime does not match with that employed for the fine-tuning.The LP filter coefficients of the dual-channel WPE andVACE-WPE, with K set to 10, are demonstrated in the rightpanel of Fig. 4. This clearly verifies that, despite the same Fig. 4. (a) Spectrograms and waveforms of the input and output signals of the different WPE algorithms; X and X denote the actual first and secondchannel signals, X v is the virtual signal; and Z and Z v denote the WPE output signals corresponding to X and X v , respectively. (b) Visualization of(complex-valued) LP filters ( K = 10 ) of the WPE algorithms. The label “First channel” denotes the filter applied to the first channel input signal. In eachsubfigure of the filter, the left and right halves represent the real and imaginary components, respectively.Fig. 5. Spectrograms (in log-magnitudes) obtained from the output of theLPSNet. operations expressed by Eqs. (9) – (14), the principles behindthe late reverberation estimation are completely different be-tween the two algorithms. For example, the filters of the dual-channel WPE for both channels seem to focus more on thelow-frequency bands, whereas those of the VACE-WPE [18]are concentrated on some specific frame delay indices over awide range of frequency bins and reveal more inter-channelasymmetry.In terms of perceptual quality, an informal listening testrevealed that the virtual signal does not necessarily soundlike a completely natural speech, playing machine-like soundsoccasionally. This was attributed to the checkerboard artifacts[23], which inevitably appeared in some utterances. In ad-dition, the virtual signal sounded more like a delayed andattenuated version of the observed speech, similar to the latereverberation component. Accordingly, the phonetic soundsor pronunciations of the linguistic contents still remained tosome extent, but not as clear as those contained in the originalutterance. B. Simplification of the PSD Estimation Routine
An observation regarding the LPSNet, derived from the“amplitude shrinkage” of the virtual signal, is shown in Fig. 5.In the figure, the first two images are the outputs of theLPSNet, given the actual and virtual signals as the inputs,respectively, and the last image is the average PSD obtained
Fig. 6. Performance comparison between the VACE-WPE systems beforeand after the simplification of the PSD estimation routine described in II-C6.The horizontal axis denotes the LP filter order, K . Both systems share thesame VACENet-b model pre-trained with the PT-self method. K trn was set to10 during fine-tuning. via Eq. (14). As seen in the figure, due to the significant re-duction in the amplitudes of the virtual signal, followed by thechannel-wise average operation in Eq. (14), the average PSD ismerely faded out from the power scale of the reverberated ordereverberated speech of the reference (actual) channel. Basedon this observation, we hypothesized that this fadeout wouldadversely affect the operation of the VACE-WPE, therebymodifying the system. architecture, as depicted in Fig. 1-(b).Section II-C6 further explains the simplified architecture.Fig. 6 shows the comparisons between the VACE-WPE in[18] and the simplified VACE-WPE in terms of the PESQ,CD, and SRMR metrics. Herein, the simplified VACE-WPEwas constructed by fine-tuning the pre-trained VACENet-b,described in Section IV-A1, within the simplified architecture;the same hyperparameters were employed for the fine-tuning.Note that we omitted the results for the single-channel WPEfor visual clarity. Overall, the simplification boosted both thePESQ and SRMR scores, particularly in the Medium-far and
Large-far conditions by considerable margins, with marginalincrements in the CD measures. In other words, it can be
Fig. 7. Spectrograms and waveforms of the virtual signals and the oraclelate reverberation signal. regarded that the simplified VACE-WPE has become better ca-pable of fitting to larger rooms and farther distance conditions,at the expense of slight increase in CD. The spectrogramsand waveforms of the virtual signals related to the simplifiedVACE-WPE are presented in the last row of Fig. 4. Relativeto the system without the simplification, the LP filters seemto exploit the virtual signal more aggressively. Meanwhile,the amplitudes of the virtual signals were amplified by anapproximate factor of 2.0.For the rest of the sections, we use the simplified architec-ture for all the experiments.
C. VACENet Architecture and Pre-training Methods
As briefly mentioned in Section IV-A2, we observed aresemblance between the virtual signal and late reverberationto an extent. Fig. 7 shows the spectrograms and waveforms ofthe virtual signals and those of the oracle late reverberationcomponent of the observed signal; the first two were gener-ated using the VACE-WPE [18] and its simplified version,respectively. As seen in the figure, all these signals are clearlydifferent from the reverberant input signals ( X and X )depicted in Fig. 4, yet are partially similar to each other; forexample, the waveforms in the time-domain or the temporaldistribution of “hot” regions of the spectrograms. Inspired bythis, we proposed to pre-train the VACENet to estimate the latereverberation component of the observed signal, as describedin Section II-C4.Fig. 8 compares the PESQ and CD measures obtainedfrom the different VACE-WPE systems, each of which isdistinguished by the pre-training strategy employed and theVACENet structure; details of the four different VACENetmodels can be found in Fig. 2 and Table I in Section II-C2.In the figure, the results for K ∈ { , } were omittedbecause the simplified VACE-WPE revealed unfavorably highCD values with nearly consistent PESQ and SRMR scores(see Fig. 6). First, focusing on the impact of the new pre-training strategy on the four VACENet models, the VACE-WPE systems built with the VACENet- { b, c } models revealednoticeable improvement via adoption of the PT-late method inboth the medium and large room conditions; they exhibitednegligible difference in the small room conditions. Moreover,between the VACENet-b and VACENet-c, the latter was over-all superior to the former. In contrast, when the PT-late strategywas introduced to the systems built with the VACENet- { a, d } ,the performance was marginally improved in the small rooms,but was substantially degraded in the Medium-far , Large-near ,and
Large-far conditions, with regard to either the PESQ or CD measure. This may be possibly due to their distinctivestructure, where they employ either a shared or separate streamfor both of the encoder and decoder, as depicted in Fig. 2.Next, comparing the VACENet structures initialized withthe PT-self method, the VACENet-a and VACENet-c, bothof which have a shared-stream decoder for modeling the RIcomponents of the virtual signal, broadly outperformed theothers in terms of both the PESQ and CD metrics. Meanwhile,VACENet-d exhibited the worst performance in the
Medium-far and the large room conditions, under both the PT-self andPT-late strategies.To summarize, among the eight different VACE-WPE sys-tems under evaluation, the combination of the VACENet-cstructure and the PT-late strategy for initialization showed thebest performance.
D. Effect of the LP Order Set During the Fine-tuning
In this subsection, we investigate the effect of different LPorders set during the fine-tuning of the VACENet. Based on theanalysis in Section IV-C, we constructed a simplified VACE-WPE on top of the VACENet-c model initialized using thePT-late method. Fig. 9 shows the performance of the VACE-WPE systems fine-tuned with the different values of the LPorders, K trn ∈ { , , , } , in terms of the PESQ, CD, andSRMR metrics. Notably, the systems trained with relativelylarge LP orders of K trn ∈ { , } tend to severely fail inthe medium and large room conditions, when evaluated usingthe smaller LP orders of K ∈ { , } . In contrast, under thesame test conditions, the systems built with relatively small LPorders of K trn ∈ { , } showed favorable trade-offs betweenthe PESQ and CD metrics measured with K = 5 and thosemeasured with K = 10 , while exhibiting adversely high CDvalues for K ∈ { , , } . These two contrasting trendsmildly indicate that the VACENet, pre-trained using the PT-late strategy, is in fact fit to generate the virtual signal thatis basically the most effective as the auxiliary input when theback-end WPE algorithm operates with the LP order close tothat employed in the fine-tuning stage. This may be attributedto training the VACENet in an end-to-end manner withinthe WPE dereverberation framework, where the algorithm isrestricted to operate with a fixed LP order. However, theVACE-WPE systems trained with K trn ∈ { , } , even whenevaluated using the matched LP orders of K ∈ { , } , failedto achieve high PESQ and SRMR scores in the Medium-far and
Large-far conditions. This is explained in Fig. 10, whichvisualizes the training and validation losses observed duringthe fine-tuning of the four different VACE-WPE systems; thevalidation loss was computed on a small, separate validationset using K = 15 . It can be seen from the figure that,unlike the systems trained with K trn ∈ { , } , those trainedwith K trn ∈ { , } fail to sufficiently reduce both thetraining and validation losses. Furthermore, comparing thetwo systems trained with K trn = 5 and K trn = 10 , theformer certainly experienced a faster convergence than thelatter. These observations indicate that generating virtual inputsignals from scratch against the dual-channel WPE operatingwith relatively large LP orders is difficult, possibly because Fig. 8. Performance comparison of the simplified VACE-WPE systems built with different pre-training strategies (i. e., PT-self and PT-late) and VACENetstructures (i. e., VACENet- { a, b, c, d } ): (a) PESQ and (b) CD. The horizontal axis denotes the LP filter order, K . K trn was set to 10 during fine-tuning.Fig. 9. Performance comparison of the simplified VACE-WPE systems fine-tuned with different LP orders, K trn ∈ { , , , } . The VACENet-c modelpre-trained using the PT-late method was adopted for fine-tuning. The horizontal axis denotes the LP filter order, K .Fig. 10. Training and validation losses observed during fine-tuning thesimplified VACE-WPE with the different LP orders, K trn ∈ { , , , } .The validation loss was calculated with K = 15 , and the loss values weredepicted for every third epoch for visual clarity. the degrees of freedom of the relevant matrices presented inEqs. (9) – (11) increases with the LP order. Nonetheless, it isquite impressive that the VACE-WPE fine-tuned with K trn = 5 performed well in the large room conditions, even whenevaluated using relatively small LP orders of K ∈ { , } .Meanwhile, in the small room conditions, the systemstrained with K trn ∈ { , } were comparable or marginallysuperior to those trained with K trn ∈ { , } in terms of thePESQ and CD measures, with slightly lower SRMR scores. E. Results in Noisy Reverberant Conditions
In this subsection, the performance of the VACE-WPEis verified under noisy reverberant test conditions. Both theLPSNet and VACENet-c models were trained using the
Train-SimuNoisy dataset as described in Sections III-D and III-E.The PT-late strategy was adopted to pre-train the VACENet.Herein, the early arriving speech plus noise was employed asthe target signal for training the LPSNet and VACENet, asthe WPE algorithm is only capable of blind dereverberation,but not explicitly designed for noise removal. Based on theobservation from Fig. 10, we fine-tuned the VACENet bygradually increasing the LP filter order, K trn , as the trainingprogresses. More specifically, for every single mini-batch, K trn was randomly chosen within the set S K = { K | K lowertrn ≤ K ≤ K uppertrn } ⊂ Z + , and the optimization was performedusing the selected LP order; K lowertrn was fixed at 4, and K uppertrn was initially set to 6 and increased to 9, 12, 15, 18, and 21after the 15th, 25th, 35th, 44th, and 52nd epochs, respectively.The evaluation results on the TestRealNoisy dataset areshown in Figs. 11 and 12, where the former demonstrates thosemeasured in the small room environment and the latter inthe medium and large rooms. Comparing the single-channel WPE and VACE-WPE, it can be confirmed that the lattertends to exhibit operating points generally superior to those ofthe former in terms of all the evaluation metrics considered.Similar to the results obtained in Section IV-A, the perfor-mance gap between the two algorithms further increased inthe far-field speaking conditions, particularly with regard to thePESQ, SRMR, and FWSegSNR metrics. Moreover, the VACE-WPE was also favorably comparable to the dual-channel WPE,revealing marginally better PESQ measures in the babble andfactory noise conditions in various room environments andmoderately higher SRMR scores in the
Medium-far and
Large-far conditions. Interestingly, these SRMR scores measuredwith the different values of the LP order imply that the VACE-WPE is better capable of producing “dry” signals than thedual-channel WPE using relatively small LP orders. Finally,considering that there exists a mismatch between the cleanspeech corpus of
TrainSimuNoisy and that of
TestRealNoisy ,it can be stated that the training of the VACE-WPE cangeneralize well to a larger corpus, instead of simply beingoverfit to a small-scale dataset.
F. Speech Recognition Results on Real Recordings
In this subsection, we verify the performance of the variousspeech dereverberation methods as the front-end for the auto-matic speech recognition (ASR) task. Specifically, we followedthe protocol for the ASR task of the VOiCES Challenge 2019[4], [5], a recent benchmark on far-field ASR in challengingnoisy reverberant room environments. The challenge providestwo different sets of utterances for the system development andevaluation, namely the “dev” and “eval” sets [4], [5]; each setconsists of a small portion of the VOiCES corpus [41]. TheVOiCES corpus is a re-recorded subset of the LibriSpeechdataset [28], and the re-recording was performed using twelvemicrophones of different types and locations in the presenceof background noise, for example, fan, babble, music, andtelevision [41]. To build the baseline ASR system, we usedan open source script that partially implements the systemdescribed in [42] based on the Kaldi [27] toolkit. The acousticmodel was built using the modified LibriSpeech-80h dataset[4], [5] after applying the standard data augmentation andspeed perturbation [24] provided by the Kaldi recipes [27];40-dimensional log-mel-filterbank energies, extracted with a25 ms window and 10 ms hop sizes, were used as theinput acoustic features. A 3-gram statistical language modelconstructed using the transcripts of the training utterances wasemployed for decoding.Tables VIII and IX present the SRMR scores and worderror rate (WER) obtained using the different speech derever-beration methods, respectively. For the single-channel WPEand VACE-WPE, the LP filter order, K , was set to 80 and35, respectively; further increasing K did not improve theperformance of both algorithms significantly. As shown in thetables, besides the single-channel WPE, two different fullyneural speech dereverberation models, namely the LPSNet-Drv and VACENet-c-Drv, were also under comparison. More https://github.com/freewym/kaldi-voices kaldi/egs/librispeech/s5/local/chain/tuning/run cnn tdnn 1a.sh TABLE VIIISRMR SCORES MEASURED ON THE VO I CES C
HALLENGE
DATASET
Method Rawsignal WPE-single( K = 80 ) VACE-WPE( K = 35 ) LPSNet-Drv VACENet-c-Drvdev 2.30 2.80 eval 2.07 2.59 TABLE IXWER(%)
MEASURED ON THE VO I CES C
HALLENGE
DATASET
Method Rawsignal WPE-single( K = 80 ) VACE-WPE( K = 35 ) LPSNet-Drv VACENet-c-Drvdev 24.2 specifically, the LPSNet-Drv was implemented by simplycombining the dereverberated magnitude spectra, estimatedfrom the trained LPSNet, with the phase spectra of thereverberant observation. The VACENet-c-Drv was obtainedby training a neural network, whose structure is identical tothe VACENet-c, to estimate the RI components of the earlyarriving speech plus noise. These models allow to make adirect comparison between i) employing the neural networkfor directly estimating the early arriving speech componentand ii) employing the neural network for the virtual signalgeneration instead and subsequently let the pre-trained dual-channel neural WPE perform the dereverberation. Table VIIIillustrates that the VACE-WPE and VACENet-c-Drv revealsignificantly higher SRMR scores relative to the other methodsand are comparable with each other. However, as shown inTable IX, the single-channel WPE achieved the lowest WERin both sets, followed by the VACE-WPE that revealed slightlyworse performance; both the LPSNet-Drv and VACENet-c-Drv failed to reduce the WER. Accordingly, it can be statedthat the proposed VACE-WPE can achieve a great balancebetween the objective speech quality improvement and front-end processing for the ASR task in terms of dereverberation.Table X further presents the results obtained after per-forming lattice interpolation [43] on top of the ASR outputlattices generated using the single-channel WPE front-end andthose using the VACE-WPE; the scaling factor, λ , was variedfrom 0.1 to 0.9. Absolute decrements of 0.3% and 0.9% inWER, achieved on the “dev” and “eval” sets, respectively,indicate that the single-channel WPE and VACE-WPE can becomplementary as the speech dereverberation front-end for theASR task. TABLE XWER(%)
AFTER PERFORMING LATTICE INTERPOLATION [43]
BETWEENTHE
ASR
OUTPUT LATTICES GENERATED USING THE SINGLE - CHANNEL
WPE
AND THOSE USING THE
VACE-WPE. λ WAS APPLIED TO THEFORMER AND − λ TO THE LATTER λ Fig. 11. Speech dereverberation performance on
TestRealNoisy in the smallroom environment: (a) air conditioner, (b) babble, (c) factory, and (d) music.The horizontal axis denotes the LP filter order, K . V. C
ONCLUSIONS
In this study, we first investigated the properties of theVACE-WPE system via ablation studies, which led to theintroduction of a simplified architecture and new strategiesfor training the neural network for the VACE. Based on thesefindings, the performance of the VACE-WPE was further ex-amined with regard to i) objective quality of the dereverberatedspeech under noisy reverberant conditions and ii) ASR resultsmeasured on real noisy reverberant recordings. Experimentalresults and analysis indicate that the neural-network-basedvirtual signal generation followed by the modified neural WPEback-end can provide an implementation of an effective speechdereverberation algorithm in a single-microphone offline pro-cessing scenario. R
EFERENCES[1] H. Kuttruff,
Room Acoustics . Boca Raton, FL, USA: CRC Press, 2016.[2] J. S. Bradley, H. Sato, and M. Picard, “On the importance of earlyreflections for speech in rooms,”
J. Acoust. Soc. Amer. , vol. 113, no. 6,pp. 3233–3244, 2003. [3] K. Kinoshita et al. , “The REVERB challenge: A common evaluationframework for dereverberation and recognition of reverberant speech,”in
Proc. IEEE Workshop Appl. Signal Process. Audio Acoust. , 2013, pp.1–4.[4] M. K. Nandwana et al. , “The VOiCES from a distance challenge 2019evaluation plan,” arXiv:1902.10828 , 2019.[5] ——, “The VOiCES from a distance challenge 2019,” in
Proc. INTER-SPEECH , 2019, pp. 2438–2442.[6] T. Nakatani, T. Yoshioka, K. Kinoshita, M. Miyoshi, and B. Juang,“Speech dereverberation based on variance-normalized delayed linearprediction,”
IEEE Trans. Audio, Speech, Lang. Process. , vol. 18, no. 7,pp. 1717–1731, 2010.[7] T. Yoshioka and T. Nakatani, “Generalization of multi-channel linearprediction methods for blind MIMO impulse response shortening,”
IEEETrans. Audio, Speech, Lang. Process. , vol. 20, no. 10, pp. 2707–2720,2012.[8] Y. Iwata and T. Nakatani, “Introduction of speech log-spectral priors intodereverberation based on Itakura-Saito distance minimization,” in
Proc.IEEE Int. Conf. Acoust., Speech Signal Process. , 2012, pp. 245–248.[9] A. Juki´c and S. Doclo, “Speech dereverberation using weighted predic-tion error with Laplacian model of the desired speech,” in
Proc. IEEEInt. Conf. Acoust., Speech Signal Process. , 2014, pp. 5172–5176.[10] M. Novey, T. Adali, and A. Roy, “A complex generalized Gaussiandistribution–Characterization, generation, and estimation,”
IEEE Trans.Signal Process. , vol. 58, no. 3, pp. 1427–1433, 2010.[11] A. Juki´c, T. van Waterschoot, T. Gerkmann, and S. Doclo, “Multi-channel linear prediction-based speech dereverberation with sparse pri-ors,”
IEEE Trans. Audio, Speech, Lang. Process. , vol. 23, no. 9, pp.1509–1520, 2015.[12] S. R. Chetupalli and T. V. Sreenivas, “Late reverberation cancellation us-ing Bayesian estimation of multi-channel linear predictors and Student’st-source prior,”
IEEE Trans. Audio, Speech, Lang. Process. , vol. 27,no. 6, pp. 1007–1018, 2019.[13] K. Kinoshita, M. Delcroix, H. Kwon, T. Hori, and T. Nakatani, “Neuralnetwork based spectrum estimation for online WPE dereverberation,” in
Proc. INTERSPEECH , 2017, pp. 384–388.[14] P. N. Petkov, V. Tsiaras, R. Doddipatla, and Y. Stylianou, “An unsuper-vised learning approach to neural-net-supported WPE dereverberation,”in
Proc. IEEE Int. Conf. Acoust., Speech Signal Process. , 2019, pp.5761–5765.[15] J. Heymann, L. Drude, R. Haeb-Umbach, K. Kinoshita, and T. Nakatani,“Joint optimization of neural network-based WPE dereverberation andacoustic model for robust online ASR,” in
Proc. IEEE Int. Conf. Acoust.,Speech Signal Process. , 2019, pp. 6655–6659.[16] S. R. Chetupalli and T. V. Sreenivas, “Clean speech AE-DNN PSDconstraint for MCLP based reverberant speech enhancement,” in
Proc.Eur. Signal Process. Conf. , 2019, pp. 1–5.[17] T. Taniguchi, A. S. Subramanian, X. Wang, D. Tran, Y. Fujita, andS. Watanabe, “Generalized weighted-prediction-error dereverberationwith varying soure priors for reverberant speech recognition,” in
Proc.IEEE Workshop Appl. Signal Process. Audio Acoust. , 2019, pp. 293–297.[18] J.-Y. Yang and J.-H. Chang, “Virtual acoustic channel expansion basedon neural networks for weighted prediction error-based speech derever-beration,” in
Proc. INTERSPEECH , 2020, pp. 3930–3934.[19] O. Ronneberger, P. Fischer, and T. Brox, “U-Net: Convolutional net-works for biomedical image segmentation,” in
Proc. Int. Conf. Med.Image Comput. Comput.-Assist. Interv. , 2015, pp. 234–241.[20] Y. N. Dauphin, A. Fan, M. Auli, and D. Grangier, “Language modelingwith gated convolutional networks,” in
Proc. Int. Conf. Mach. Learn. ,2017, pp. 933–941.[21] K. Tan and D. Wang, “Complex spectral mapping with a convolutionalrecurrent network for monaural speech enhancement,” in
Proc. IEEEInt. Conf. Acoust., Speech Signal Process. , 2019, pp. 6865–6869.[22] S.-W. Fu, T.-y. Hu, Y. Tsao, and X. Lu, “Complex spectrogram enhance-ment by convolutional neural network with multi-metrics learning,” in
Proc. Int. Workshop Mach. Learn. Signal Process. , 2017, pp. 1–6.[23] A. Odena, V. Dumoulin, and C. Olah, “Deconvolutionand checkerboard artifacts,”
Distill , 2016. [Online]. Available:http://distill.pub/2016/deconv-checkerboard[24] T. Ko, V. Peddinti, D. Povey, M. L. Seltzer, and S. Khudanpur, “Astudy on data augmentation of reverberant speech for robust speechrecognition,” in
Proc. IEEE Int. Conf. Acoust., Speech Signal Process. ,2017, pp. 5220–5224.[25] J. B. Allen and D. A. Berkley, “Image method for efficiently simulatingsmall-room acoustics,”
J. Acoust. Soc. Amer. , vol. 65, no. 4, pp. 943–950,1979. Fig. 12. Speech dereverberation performance on
TestRealNoisy in the medium and large room environments: (a) air conditioner, (b) babble, (c) factory, and(d) music. The horizontal axis denotes the LP filter order, K .[26] J. S. Garofolo, “TIMIT acoustic phonetic continuous speech corpus,” Linguistic Data Consortium , 1993.[27] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel,M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz et al. , “The Kaldispeech recognition toolkit,” in
Proc. IEEE Workshop Automat. SpeechRecognit. Understanding , 2011.[28] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech: AnASR corpus based on public domain audio books,” in
IEEE Int ConfAcoust., Speech Signal Process. , 2015, pp. 5206–5210.[29] C. K. A. Reddy et al. , “The INTERSPEECH 2020 deep noise suppres-sion challenge: Datasets, subjective speech quality and testing frame-work,” arXiv:2001.08662 , 2020.[30] A. Varga and H. J. Steeneken, “Assessment for automatic speechrecognition: II. NOISEX-92: A database and an experiment to studythe effect of additive noise on speech recognition systems,”
SpeechCommun. , vol. 12, no. 3, pp. 247–251, 1993.[31] D. Snyder, G. Chen, and D. Povey, “MUSAN: A music, speech, andnoise corpus,” arXiv:1510.08484 , 2015.[32] S. Pirhosseinloo and J. S. Brumberg, “Monaural speech enhancementwith dilated convolutions,” in
Proc. INTERSPEECH , 2019, pp. 3143–3147.[33] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deepnetwork training by reducing internal covariate shift,” arXiv:1502.03167 ,2015.[34] D.-A. Clevert, T. Unterthiner, and S. Hochreiter, “Fast and accurate deepnetwork learning by exponential linear units,” arXiv:1511.07289 , 2015. [35] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv:1412.6980 , 2014.[36] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhut-dinov, “Dropout: A simple way to prevent neural networks from over-fitting,”
J. Mach. Learn. Res. , vol. 15, no. 1, pp. 1929–1958, 2014.[37] R. Pascanu, T. Mikolov, and Y. Bengio, “On the difficulty of trainingrecurrent neural networks,” in
Proc. Int. Conf. Mach. Learn. , 2013, pp.1310–1318.[38] “P. 862.2: Wideband extension to recommendation P. 862 for theassessment of wideband telephone networks and speech codecs,” ITU-TRecommendation, 2005.[39] Y. Hu and P. C. Loizou, “Evaluation of objective quality measuresfor speech enhancement,”
IEEE Trans. Audio, Speech, Lang. Process. ,vol. 16, no. 1, pp. 229–238, 2007.[40] J. F. Santos, M. Senoussaoui, and T. H. Falk, “An improved non-intrusiveintelligibility metric for noisy and reverberant speech,” in
Proc. Int.Workshop Acoust. Signal Enhance. , 2014, pp. 55–59.[41] C. Richey et al. , “Voices obscured in complex environmental settings(VOiCES) corpus,” in
Proc. INTERSPEECH , 2018, pp. 1566–1570.[42] Y. Wang, D. Snyder, H. Xu, V. Manohar, P. S. Nidadavolu, D. Povey, andS. Khudanpur, “The JHU system for VOiCES from a distance challenge2019,” in
Proc. INTERSPEECH , 2019, pp. 2488–2492.[43] D. Povey et al. , “Generating exact lattices in the WFST framework,”in