[PDF] Exploring Deep Hybrid Tensor-to-Vector Network Architectures for Regression Based Speech Enhancement

Abstract

This paper investigates different trade-offs between the number of model parameters and enhanced speech qualities by employing several deep tensor-to-vector regression models for speech enhancement. We find that a hybrid architecture, namely CNN-TT, is capable of maintaining a good quality performance with a reduced model parameter size. CNN-TT is composed of several convolutional layers at the bottom for feature extraction to improve speech quality and a tensor-train (TT) output layer on the top to reduce model parameters. We first derive a new upper bound on the generalization power of the convolutional neural network (CNN) based vector-to-vector regression models. Then, we provide experimental evidence on the Edinburgh noisy speech corpus to demonstrate that, in single-channel speech enhancement, CNN outperforms DNN at the expense of a small increment of model sizes. Besides, CNN-TT slightly outperforms the CNN counterpart by utilizing only 32\% of the CNN model parameters. Besides, further performance improvement can be attained if the number of CNN-TT parameters is increased to 44\% of the CNN model size. Finally, our experiments of multi-channel speech enhancement on a simulated noisy WSJ0 corpus demonstrate that our proposed hybrid CNN-TT architecture achieves better results than both DNN and CNN models in terms of better-enhanced speech qualities and smaller parameter sizes.

Full PDF

EExploring Deep Hybrid Tensor-to-Vector Network Architectures forRegression Based Speech Enhancement

Jun Qi , Hu Hu , Yannan Wang , Chao-Han Huck Yang , Sabato Marco Siniscalchi , , Chin-Hui Lee Electrical and Computer Engineering, Georgia Institute of Technology, Atlanta, GA, USA Computer Engineering School, University of Enna, Italy Tencent Media Lab, Tencent Corporation, Shenzhen, Guangdong, China

Abstract

This paper investigates different trade-offs between the numberof model parameters and enhanced speech qualities by employ-ing several deep tensor-to-vector regression models for speechenhancement. We ﬁnd that a hybrid architecture, namely CNN-TT, is capable of maintaining a good quality performance witha reduced model parameter size. CNN-TT is composed ofseveral convolutional layers at the bottom for feature extrac-tion to improve speech quality and a tensor-train (TT) outputlayer on the top to reduce model parameters. We ﬁrst derivea new upper bound on the generalization power of the convo-lutional neural network (CNN) based vector-to-vector regres-sion models. Then, we provide experimental evidence on theEdinburgh noisy speech corpus to demonstrate that, in single-channel speech enhancement, CNN outperforms DNN at theexpense of a small increment of model sizes. Besides, CNN-TT slightly outperforms the CNN counterpart by utilizing only32% of the CNN model parameters. Besides, further perfor-mance improvement can be attained if the number of CNN-TTparameters is increased to 44% of the CNN model size. Fi-nally, our experiments of multi-channel speech enhancementon a simulated noisy WSJ0 corpus demonstrate that our pro-posed hybrid CNN-TT architecture achieves better results thanboth DNN and CNN models in terms of better-enhanced speechqualities and smaller parameter sizes.

Index Terms : convolutional neural network, tensor-train net-work, tensor-to-vector regression, speech enhancement

1. Introduction

A speech enhancement system aims at restoring the quality andintelligibility of noisy speech. The state-of-the-art speech en-hancement systems are commonly built with deep neural net-work (DNN) based vector-to-vector regression models, whereinputs are context-dependent log power spectrum (LPS) fea-tures of noisy speech and outputs correspond to either clean orenhanced LPS features. Although deep neural network (DNN)based speech enhancement [1, 2] has demonstrated the state-of-the-art performance under a single-channel setting, it can alsobe extended to scenarios of multi-channel speech enhancementwith even better-enhanced speech qualities [3]. The process ofboth single and multi-channel speech enhancement can be takenas a DNN based vector-to-vector regression aiming at bridginga functional relationship f : Y → X such that the input noisyspeech y ∈ Y can be mapped to the corresponding clean speech x ∈ X . In [1, 4], DNNs with feed-forward fully-connected(FC) hidden layers were proposed to attain the state-of-the-artperformance of speech enhancement on the target tasks and therelated theorems were later set up in [5, 6, 7]. In some follow-up studies, recurrent neural networks (RNNs) [8, 9], and convolu-tional neural networks (CNNs) [10] were further investigatedto boost speech enhancement quality [11]. Moreover, a deepbidirectional RNN with LSTM gates was instead used in [12],and a generative adversarial network (GAN) was attemptedfor speech enhancement tasks in [13]. In particular, CNN isa tensor-to-vector regression model because it is capable ofdealing with 3D/4D tensorized input data. Besides, the recentworks [10, 14] suggest that CNN can outperform both DNN andRNN counterparts for speech enhancement. Similarly, a tensor-to-vector regression model can also be built by directly employ-ing the proposed tensor-train network (TTN) [15]. Besides, TT-DNN is a compact representation for a fully-connected (FC)layers of DNN into a tensor-train (TT) format [16]. In [17],we were the ﬁrst to attempt a tensor-train deep neural network(TT-DNN) to tackle the multi-channel speech enhancement taskand also demonstrate that the TT representation of a DNN doesnot cause the quality degradation of the enhanced speech, and italso results in a signiﬁcant reduction of the model parameters.More importantly, the quality of speech enhancement can beimproved over the DNN counterpart by allowing the TT-DNNparameters to grow.A signiﬁcant advantage of tensor-to-vector regression, suchas CNN and TT-DNN, is its compact architecture to observestringent hardware constraints, where computational resourcesare often limited. Therefore, it is worth investigating the modelsin terms of the representation power, and experimentally com-paring them by considering the trade-off between enhancementperformance and the number of model parameters. On onehand, CNN is a powerful model to learn spatial-temporal fea-tures and extract semantically meaningful aspects in higher hid-den layers. On the other hand, TT-DNN can maintain baselineresults of the corresponding DNN by applying the TT transfor-mation to the FC hidden layers. Hence, in this work, we focuson a tensor-to-vector model to take advantage of both CNN andTT-DNN. More speciﬁcally, we propose a novel hybrid archi-tecture, namely CNN-TT, with convolutional layers stacked atthe bottom and one TT hidden layer on the top. To highlight theadvantages of CNN-TT, we compare different deep tensor-to-vector models for speech enhancement. The used models in thiswork include (a) DNN; (b) CNN; (c) TT-DNN; (d) CNN-TT.In more detail, we ﬁrst explain the fundamental mechanismsof tensor-to-vector regression based on our theorems of DNNbased vector-to-vector regression [5, 6, 18, 19]. Then, we vali-date our CNN-TT models in speech enhancement tasks.Our experimental results show that in single-channel speechenhancement on the Edinburgh noisy speech corpus [20], CNNoutperforms the best DNN with a small increment of parametersizes. Moreover, our proposed CNN-TT slightly outperforms a r X i v : . [ ee ss . A S ] A ug igure 1: Four tensor-to-vector regression models used in this study.

CNN with only 32% of the CNN model size. A further im-provement can be attained if the size of the CNN-TT modelis increased up to 44% of the CNN model size. Finally, theexperiments of a multi-channel speech enhancement task on asimulated noisy WSJ0 corpus [21] show the same trend that ourproposed hybrid CNN-TT architecture can be favorably com-pared to both DNN and CNN models to achieve better-enhancedspeech qualities and utilize much smaller model sizes.

2. Deep Tensor-to-vector Regression

Figure 1 shows all regression network architectures studiedhere: (a) DNN, (b) CNN, (c) DNN-TT, and (d) CNN-TT.

CNN follows a feed-forward architecture to transform a tensorinput into a vector output through a sequence of convolutionalneural layers [22]. The CNN based tensor-to-vector regres-sion model has four two-dimensional (2D) convolutional lay-ers, each having twice the number of channels of the previouslayer. ReLU-based activation and Batch normalization compo-nents are appended at the output of each convolutional layer.A fully-connected (FC) layer is employed as the last hiddenlayer of the neural architecture to generate the desired enhancedspeech vectors.A typical convolutional layer transforms a 3-dimensioninput tensor

X ∈ R W × H × C into an output tensor Y ∈ R ( W − L +1) × ( H − L +1) × S by convolving X with a kernel tensor K ∈ R L × L × C × S as: Y ( x, y, s ) = L (cid:88) i =1 l (cid:88) j =1 C (cid:88) c =1 K ( i, j, c, s ) X ( x + i − , y + j − , c ) . In [5], we studied the representation power of DNN basedvector-to-vector regression and derived upper bounds on differ-ent DNN architectures. That study allows us to better under-stand the successful application of DNN for speech enhance-ment tasks observed in [23]. To extend the theorems proposedin [5] to CNNs, we need to obtain a matrix representation forboth input and kernel of the CNN. Thus, we introduce a matrix X of size W (cid:48) H (cid:48) × L C , in which the k -th row corresponds tothe L × L × C patch of the input tensor that is used to computethe k -th row of the matrix Y : X ( x + i − , y + j − , c )= X ( x + W (cid:48) ( y − , i + L ( j −

1) + L ( c − , where y = 1 , ..., H (cid:48) , x = 1 , ..., W (cid:48) , i, j = 1 , ..., L .The kernel tensor K can be reshaped into a matrix K of thesize l C × S as follows: K ( i, j, c, s ) = K ( i + L ( j −

1) + L ( c − , s ) . Finally, a convolutional layer can be rewritten in a matrixformat as Y = XK , and the process is illustrated as Figure 2.Figure 2: Convolution as a matrix-by-matrix multiplication.

We are now ready to link CNNs with our theorems forDNN-based vector-to-vector regression in [5]. Let ˆ f : R d → R q refer to a vector-to-vector smooth function, we can ﬁnd adeep CNN f CNN with B layers with ReLU activations suchthat Eq. (1) is satisﬁed, || ˆ f − f CNN || ≤ || ˆ f − f CNN || = O (cid:32) q ( L B C B + B − d (cid:33) , (1)where C B and L B denote the numbers of channel and width ofthe B -th CNN layer. A DNN-TT based tensor-to-vector regression model relies onthe TT decomposition, which is described as follows: For a setof integer ranks r = { r , r , ..., r K +1 } , the TT decompositionfactorizes a tensor W ∈ R ( m n ) × ( m n ) ×···× ( m K n K ) , ∀ i ∈{ , ..., K } , m i ∈ R + , n i ∈ R + into a multiplication of coretensors as: W (( i , j ) , ( i , j ) , ..., ( i K , j K )) = K (cid:89) k =1 C [ k ] ( r k , i k , j k , r k +1 ) . (2)igure 3: A conversion from a DNN to a DNN-TT. where for the given ranks r k and r k +1 , the k -th core tensor C [ k ] ( r k , i k , j k , r k +1 ) ∈ R m k × n k in which i k ∈ { , , ..., m K } and j k ∈ { , , ..., n k } . Besides, r and r K +1 are ﬁxed to .Since DNN-TT only stores the low-rank core tensors {C k } Kk =1 of the size (cid:80) Kk =1 m k n k r k r k +1 , which is much less than thesize (cid:81) Kk =1 m k n k for the corresponding DNN.Figure 3 shows the relationship between a traditional hid-den layer of a DNN and a tensor layer of a DNN-TT. The matrixassociated with a DNN hidden layer corresponds to two matri-ces given the ranks, and the DNN input vector is reshaped into ahigher-order input tensor. We have shown that the TT decompo-sition can keep the representation power of DNN [17]. In [17],we have also demonstrated that for a tensor-to-vector function ˆ T ∗ : R J × J ×···× J K → R I · I ··· I K , there is a DNN-TT T with k hidden tensor layers, such that Eq. (3) is satisﬁed. || ˆ T − T || ≤ || ˆ T − T || = O  K (cid:89) k =1 I k ( r k − r l n k,B + B − rkrk − Jk  , (3)where n k,B is the width of B th hidden layer for the k -th coretensor. The Eq. (3) suggests that DNN-TT can maintain therepresentation power of the corresponding DNN. Figure 1(c) displays a hybrid tensor-to-vector regression modelhaving both convolutional and tensor-train layers. A key bene-ﬁt of this hybrid tensorized model is that the number of modelparameters of the original FC layer is signiﬁcantly reducedwith one TTN. Moreover, we can expect that the representa-tion power of input salient features can be preserved because ofthe convolutional blocks in the lower layers.The representation power of CNN-TT combines the charac-teristics of both CNN and TT-DNN, and Eq. (4) demonstrates anupper bound on the performance of CNN-TT based on tensor-to-vector regression. The derivation of the upper bound is basedon the combination of Eqs. (2) and (3) [17]. || ˆ T − T || ≤ || ˆ T − T || = O  K (cid:89) k =1 I k ( r k − r l c k,B + B − rkrk − Jk  . (4) where (cid:81) Kk =1 c k,B = L B C B and other notations are the sameas Eqs. (2) and (3).

3. Experiments and Result Analysis

The proposed architectures were evaluated on two differentspeech databases. One is based on the Edinburgh noisy speechdatabase [20], where clean utterances were recorded from speakers including males and females from different ac-cent regions both Scotland and the United States. Clean datawere randomly split into training and test wave-forms, respectively. The noisy training speech materials, at fourSNR levels: 15dB, 10dB, 5dB, and 0dB, were created from cor-rupting clean waveforms with the following noises: a domesticnoise (inside a kitchen), and ofﬁce noise (in a meeting room),three public space noises (cafeteria, restaurant, subway station),two transportation noises (car and metro), and a street noise(busy trafﬁc intersection). In total, there were different noisybackgrounds for synthesizing the noisy training data (ten noises × four SNRs). As for the noisy test set, noise types included: adomestic noise (living room), an ofﬁce noise (ofﬁce space), onetransport (bus), and two street noises (open area cafeteria and apublic square). SNR values were: 17.5dB, 12.5dB, 7.5dB, and2.5dB. Therefore, there were different noisy backgrounds forsynthesizing the test data.The second one is a synthesized database with 30-hour sim-ulated materials obtained from the clean WSJ0 corpus [21] withOSU- -noise dataset [24], which allows us to obtain hoursof training waveforms and hours of test ones. To simulate thenoisy data, each waveform was corrupted with one kind of back-ground noise from the noise set. The target and additional inter-fering speech with their corresponding RIRs were convolved togenerate the ﬁnal noisy waveform. In doing so, the dataset con-tained additive noise, interfering speakers, and reverberation.Before we set up the training and testing sets, an improvedimage-source method (ISM) [25] was used to generate RIRsof reverberation time (RT ) (from . s to . s ) and the cor-responding direct path response for each microphone channel.For both training and test datasets, the setting of RIRs was ﬁxedto the same conditions, such as the room size, RT , and all ofthe distances and directions. Additional detail about the datasimulation procedure can be found in [3, 17]. In all experiments, we use -dimensional normalized log-power spectral (LPS) feature vectors as inputs. LPS featureswere generated by computing points Fourier transform ona speech segment of milliseconds. For each input frame, M neighboring adjacent frames were concatenated together, whichresults in a total × (2 M +1) × B dimensional feature, where B is the channel number of the input signal. As for the setup ofTT-DNN, we ignored the ﬁrst dimension of the input LPS fea-tures because it corresponded to the direct-current component.After the regression, the ﬁrst dimension of input was concate-nated back to the -dimensional output without any change.The clean speech features were assigned to the top layers oftensor-to-vector regression models as the reference during thetraining stage.The DNN based regression model was adopted as a baselinemodel. On the Edinburgh data set, the DNN model consistedof hidden layers with hidden dimensions conﬁgured to ,able 1: PESQ comparisons of single-channel deep speech en-hancement models on the Edinburgh noisy speech database.The average PESQ score for unprocessed noisy speech is . . Model Parameters , , , respectively. As for the WSJ0 simulated dataset, we set up a layer DNN model with a hidden dimension of . Moreover, the CNN models kept similar deep tensor-to-vector structures in all experiments and were composed of fourconvolutional layers with gradually increasing the number ofchannels according to the setup of - - - . Moreover,the ReLU activation function and batch normalization were alsoutilized for each convolutional layer, and two FC layers with neurons were stacked on the top layer to generate out-put vectors. Besides, we used different kernel sizes on the twodatasets to obtain two slightly different model sizes. Moreover,to improve the subjective perception in the speech enhancementtasks, the global variance equalization was applied to alleviatethe problem of over-smoothing by correcting a global variancebetween estimated features and clean reference targets, and atechnique of noise-aware training (NAT) was also employed toenable non-stationary awareness. Besides, the mean square er-ror (MSE) loss was applied, which corresponds to the upperbounds of L norm in Eqs. (1), (3), and (4). Adam optimizer[26] with an initial learning rate of . was utilized during thetraining process, and the back-propagation (BP) algorithm wasused to update the model parameters. The size of the contextwindow at the input layer is set to for DNN in Edinburgh data, for DNN in WSJ0 simulated data, and for all CNN models.The perceptual evaluation of speech quality (PESQ) [27], wasemployed in our experimental validation. Table 1 shows our experimental results on the Edinburgh noisyspeech data set. Tensor-to-vector regression based on CNN canoutperform the DNN baseline results in terms of a higher PESQscore ( . vs. . ). DNN-TT with much fewer parameters( . M vs. . M) can maintain the same experimental per-formance of DNN, where the TT transformation was applied tothe fully-connected layers. More importantly, compared withthe combined convolutional and TT layers, the proposed CNN-TT can attain the highest PESQ score. If we allow the size ofthe CNN-TT model to increase up to 5.05M, a better speech en-hancement quality can be attained with a PESQ score of 3.13.

The evaluation results on the 30-hour WSJ0 simulated multi-channel data are shown in Table 2. The experimental results ofboth DNN and DNN-TT are in line with the results as shownin [17]. The usage of the DNN-TT model can signiﬁcantlyreduce the number of parameters without degrading the per-formance. Moreover, the CNN based tensor-to-vector regres-sion model outperforms the DNN based one. Thus, CNN takesadvantage of parameter reduction and the improvement of en-hanced speech quality over DNN. In more detail, as for the Table 2: PESQ comparisons of different deep models for multi-channel speech enhancement on the WSJ0 corpus. The averagePESQ score for unprocessed ch-1 noisy speech is 2.02.

Model Channel single-channel case, CNN-TT attains a PESQ 3.04 using 2.8Mparameters which correspond to CNN which attains the samePESQ score at 3.03 but costs more than 9.4M parameters. If thenumber of parameters is reduced to as small as 1.6M, the PESQscore is decreased to . . For our two-channel experiments,the CNN baseline has the same parameter numbers with a sin-gle channel one because the convolutional layer can be properlyadapted to the multi-channel inputs. However, if the two fullyconnected layers in the CNN-based architecture with tensor-train layers, the model parameters can be signiﬁcantly reducedfrom 9.4M to 2.8M without degrading the system performancein terms of the PESQ scores (3.13 vs. 3.11). Tucker decomposition [28] is a higher-order extension to thesingular value decomposition obtained by computing the or-thonormal spaces associated with the different modes of a ten-sor. It is also meaningful to verify whether tucker decomposi-tion applied to each CNN convolutional layer can lead to thesame parameter reduction with a small drop in the PESQ value.We refer to this Tucker-reduced CNN as CNN-Tucker. Partic-ularly, CNN-Tucker-3 means that we apply Tucker decompo-sition to the ﬁrst three CNN hidden layers except the top one.The related results by using CNN-Tucker-3 in Tables 1 and 2demonstrate that high-order singular value decomposition is notsufﬁcient to obtain a smaller size deep tensor-to-vector regres-sion model without sacriﬁcing the speech quality.

4. Conclusion

We compare several tensor-to-vector regression models forspeech enhancement. These models include CNN, DNN-TT,and the hybrid models composed of convolutional and TT lay-ers, namely CNN-TT. We ﬁrst discuss the representation powerby linking tensor-to-vector regression to our earlier theorieson DNN based vector-to-vector regression. Next, we evalu-ate these models for single-channel speech enhancement on theEdinburgh noisy speech database. Finally, we conduct multi-channel speech enhancement on a synthesized WSJ noisy cor-pus. Our experimental results suggest that CNN can outperformboth DNN-TT and DNN with smaller regression errors andhigher PESQ scores. Moreover, when the fully-connected out-put layer of CNN is replaced with a TT layer to generate a hy-brid regression network, we achieve even better performancesby gradually increasing the model size of the TT layer. In fu-ture work, we will investigate different tensor representations toreduce the parameters of the hidden convolutional layers. . References [1] Y. Xu, J. Du, L.-R. Dai, and C.-H. Lee, “A regression ap-proach to speech enhancement based on deep neural net-works,”

IEEE/ACM Transactions on Audio, Speech andLanguage Processing (TASLP) , vol. 23, no. 1, pp. 7–19,2015.[2] D. Wang and J. Chen, “Supervised speech separationbased on deep learning: An overview,”

IEEE/ACM Trans-actions on Audio, Speech, and Language Processing ,vol. 26, no. 10, pp. 1702–1726, 2018.[3] Q. Wang, S. Wang, F. Ge, C. W. Han, J. Lee, L. Guo,and C.-H. Lee, “Two-stage enhancement of noisy and re-verberant microphone array speech for automatic speechrecognition systems trained with only clean speech,” in

ISCSLP , 2018, pp. 21–25.[4] C. Yu, K.-H. Hung, S.-S. Wang, Y. Tsao, and J.-w. Hung,“Time-domain multi-modal bone/air conducted speechenhancement,”

IEEE Signal Processing Letters , 2020.[5] J. Qi, J. Du, S. M. Siniscalchi, and C.-H. Lee, “A theoryon deep neural network based vector-to-vector regressionwith an illustration of its expressive power in speech en-hancement,”

IEEE/ACM Transactions on Audio, Speech,and Language Processing (TASLP) , vol. 27, no. 12, pp.1932–1943, 2019.[6] J. Qi, J. Du, S. M. Siniscalchi, X. Ma, and C.-H. Lee, “An-alyzing upper bounds on mean absolute errors for deepneural network based vector-to-vector regression,”

IEEETransactions on Signal Processing (TSP) , vol. 68, pp.3411–3422, 2020.[7] J. Qi, J. Du, M. S. Siniscalchi, X. Ma, and C.-H. Lee, “Onmean absolute error for deep neural network based vector-to-vector regression,” accepted to IEEE Signal ProcessingLetters (SPL) , 2020.[8] F. Weninger, H. Erdogan, S. Watanabe, E. Vincent,J. Le Roux, J. R. Hershey, and B. Schuller, “Speech en-hancement with LSTM recurrent neural networks and itsapplication to noise-robust ASR,” in

International Confer-ence on Latent Variable Analysis and Signal Separation .Springer, 2015, pp. 91–99.[9] H. Zhao, S. Zarar, I. Tashev, and C.-H. Lee,“Convolutional-recurrent neural networks for speech en-hancement,” in

ICASSP , 2018, pp. 2401–2405.[10] S. R. Park and J. Lee, “A fully convolutional neu-ral network for speech enhancement,” arXiv preprintarXiv:1609.07132 , 2016.[11] C.-H. Yang, J. Qi, P.-Y. Chen, X. Ma, and C.-H. Lee,“Characterizing speech adversarial examples using self-attention u-net enhancement,” in

ICASSP 2020-2020IEEE International Conference on Acoustics, Speech andSignal Processing (ICASSP) , 2020, pp. 3107–3111.[12] L. Sun, S. Kang, K. Li, and H. Meng, “Voice conversionusing deep bidirectional long short-term memory basedrecurrent neural networks,” in

ICASSP , 2015, pp. 4869–4873.[13] S. Pascual, A. Bonafonte, and J. Serra, “Segan: Speechenhancement generative adversarial network,” arXivpreprint arXiv:1703.09452 , 2017.[14] T. Kounovsky and J. Malek, “Single channel speech en-hancement using convolutional neural network,” in , 2017, pp. 1–5.[15] A. Novikov, D. Podoprikhin, A. Osokin, and D. P. Vetrov,“Tensorizing neural networks,” in

Advances in Neural In-formation Processing Systems , 2015, pp. 442–450.[16] I. V. Oseledets, “Tensor-train decomposition,”

SIAM Jour-nal on Scientiﬁc Computing , vol. 33, no. 5, pp. 2295–2317, 2011.[17] J. Qi, H. Hu, Y. Wang, C. H. Yang, S. M. Siniscalchi,and C.-H. Lee, “Tensor-to-vector regression for multi-channel speech enhancement based on tensor-train net-work,” in

Proc. IEEE International Conference on Acous-tics, Speech, and Signal Processing (ICASSP) , 2020, pp.7504–7508.[18] J. Qi, X. Ma, S. M. Siniscalchi, and C.-H. Lee, “Upperbounding mean absolute errors for deep tensor regressionbased on tensor-train neworks,” submitted to IEEE Trans-actions on Signal Processing (TSP) .[19] J. Qi, X. Ma, C.-H. Lee, J. Du, and S. M. Siniscalchi, “Per-formance analysis for tensor-train decomposition to deepneural network based vector-to-vector regression,” in , 2020, pp. 1–6.[20] C. Valentini-Botinhao, X. Wang, S. Takaki, and J. Yamag-ishi, “Investigating rnn-based speech enhancement meth-ods for noise-robust text-to-speech.” in

SSW , 2016, pp.146–152.[21] D. B. Paul and J. M. Baker, “The design for the wallstreet journal-based CSR corpus,” in

Proc. Workshop onSpeech and Natural Language , Banff, Canada, Oct. 1992,pp. 899–902.[22] T. Garipov, D. Podoprikhin, A. Novikov, and D. Vetrov,“Ultimate tensorization: compressing convolutional andFC layers alike,” arXiv preprint arXiv:1611.03214 , 2016.[23] Y. Xu, J. Du, L.-R. Dai, and C.-H. Lee, “An experimentalstudy on speech enhancement based on deep neural net-works,”

IEEE Signal Processing Letters , vol. 21, no. 1,pp. 65–68, 2013.[24] G. Hu and D. Wang, “A tandem algorithm for pitch es-timation and voiced speech segregation,”

IEEE Trans-actions on Audio, Speech, and Language Processing ,vol. 18, no. 8, pp. 2067–2079, 2010.[25] E. A. Lehmann and A. M. Johansson, “Prediction of en-ergy decay in room impulse responses simulated with animage-source model,”

The Journal of the Acoustical Soci-ety of America , vol. 124, no. 1, pp. 269–277, 2008.[26] D. P. Kingma and J. L. Ba, “Adam: A method for stochas-tic optimization,” arXiv preprint arXiv:1412.6980 , 2014.[27] A. W. Rix, J. G. Beerends, M. P. Hollier, and A. P. Hekstra,“Perceptual evaluation of speech quality (PESQ)-a newmethod for speech quality assessment of telephone net-works and codecs,” in

IEEE International Conference onAcoustics, Speech, and Signal Processing. Proceedings ,vol. 2, 2001, pp. 749–752.[28] Y.-D. Kim and S. Choi, “Nonnegative tucker decomposi-tion,” in2007 IEEE Conference on Computer Vision andPattern Recognition