[PDF] Attention-based multi-task learning for speech-enhancement and speaker-identification in multi-speaker dialogue scenario

Abstract

Multi-task learning (MTL) and attention mechanism have been proven to effectively extract robust acoustic features for various speech-related tasks in noisy environments. In this study, we propose an attention-based MTL (ATM) approach that integrates MTL and the attention-weighting mechanism to simultaneously realize a multi-model learning structure that performs speech enhancement (SE) and speaker identification (SI). The proposed ATM system consists of three parts: SE, SI, and attention-Net (AttNet). The SE part is composed of a long-short-term memory (LSTM) model, and a deep neural network (DNN) model is used to develop the SI and AttNet parts. The overall ATM system first extracts the representative features and then enhances the speech signals in LSTM-SE and specifies speaker identity in DNN-SI. The AttNet computes weights based on DNN-SI to prepare better representative features for LSTM-SE. We tested the proposed ATM system on Taiwan Mandarin hearing in noise test sentences. The evaluation results confirmed that the proposed system can effectively enhance speech quality and intelligibility of a given noisy input. Moreover, the accuracy of the SI can also be notably improved by using the proposed ATM system.

Full PDF

aa r X i v : . [ ee ss . A S ] F e b Attention-based multi-task learning for speech-enhancementand speaker-identiﬁcation in multi-speaker dialogue scenario

Chiang-Jen Peng ∗ , Yun-Ju Chan ∗ , Cheng Yu † , Syu-Siang Wang †‡ , Yu Tsao † and Tai-Shih Chi ∗∗ Department of Electrical and Computer Engineering, National Chiao Tung University, Hsinchu, TaiwanEmail: [email protected] † Research Center for Information Technology Innovation, Academia Sinica, Taipei, TaiwanEmail: [email protected], chengyu [email protected] ‡ Department of Electrical Engineering, Yuan Ze University, Taoyuan, TaiwanEmail: [email protected]

Abstract —Multi-task learning (MTL) and attention mechanism havebeen proven to effectively extract robust acoustic features for variousspeech-related tasks in noisy environments. In this study, we proposean attention-based MTL (ATM) approach that integrates MTL and theattention-weighting mechanism to simultaneously realize a multi-modellearning structure that performs speech enhancement (SE) and speakeridentiﬁcation (SI). The proposed ATM system consists of three parts: SE,SI, and attention-Net (AttNet). The SE part is composed of a long-short-term memory (LSTM) model, and a deep neural network (DNN) modelis used to develop the SI and AttNet parts. The overall ATM system ﬁrstextracts the representative features and then enhances the speech signalsin LSTM-SE and speciﬁes speaker identity in DNN-SI. The AttNet com-putes weights based on DNN-SI to prepare better representative featuresfor LSTM-SE. We tested the proposed ATM system on Taiwan Mandarinhearing in noise test sentences. The evaluation results conﬁrmed that theproposed system can effectively enhance speech quality and intelligibilityof a given noisy input. Moreover, the accuracy of the SI can also benotably improved by using the proposed ATM system.

Index Terms —Speech enhancement, speaker identiﬁcation, multi-tasklearning, attention weighting, neural network.

I. I

NTRODUCTION

Speech signals propagating in an acoustic environment are in-evitably distorted by noise. Such distortions may considerably de-grade the performance of the target speech-related tasks, such asassistive hearing systems [1], [2] automatic speech recognition [3],[4], and speaker recognition [5], [6], [7]. To address this issue, speechenhancement (SE), which aims to extract clean acoustic signals fromnoisy inputs, has been widely used. Conventional SE techniques,including signal subspace [8], power spectral subtraction [9], Wienerﬁltering [10], and minimum mean square error based estimations[11], [12], perform well in stationary noise environments, where thestatistical assumptions on environmental noises and human speechhold adequately [13], [14], [15]. For environments involving non-stationary noises, conventional SE techniques may not provide satis-factory performance. Recently, deep learning (DL)-based SE methodshave been widely studied, and notable performance improvementshave been observed over conventional techniques. In general, the DL-based SE methods aim to transform the noisy source to a clean targetusing nonlinear mapping, in which no statistical property assumptionof noise and speech acoustic signals is required [16], [17], [18].In [19], [20], the authors proposed a deep denoising autoencoder(DDAE) SE system that encodes input noisy signals into a seriesof frame-level speech codes and then performs a decoding processto retrieve the enhanced signals from the system output. Anotherstudy in [3] applied a long short-term memory (LSTM) model tointegrate the context information to carry out SE for improving speechquality and intelligibility and achieving a low word error rate in anASR system. In [21], the transformer model that utilizes an attention mechanism [22] to compute attention weights is used to emphasizeand fuse related context symbols to obtain clean components.The SE system can be used as a front-end processor for speciﬁcapplications by placing it in front of a main speech-signal-processingsystem. By jointly minimizing the losses from the SE and the mainsystem, the overall system is considered to be optimized in a multi-task learning (MTL) manner [23], [24], [25]. In such systems, theMTL aims to purify the representations along with the goal to boostthe performance of the main task [26], [27], [28]. In [29], [30], visualinformation is treated as the second task to promote the SE capability.Experimental results show that audio and visual cues can be jointlyconsidered to derive more representative acoustic features in a DL-based SE system.MTL has also been used in speaker recognition, namely speakeridentiﬁcation (SI) and speaker veriﬁcation (SV), systems [31], [32],[33]. The recognition accuracy of an SI task is highly dependent onthe quality of speaker feature extraction. Therefore, most existingsystems aim to compute a decent speaker representation from speechsignals. A well-known speaker recognition system is a combinationof an i -vector with a probabilistic linear discriminant analysis [34].This system has been widely used and yields satisfactory performancein numerous speaker recognition tasks. More recently, d -vector [35]and x -vector [36] features extracted by DL models have been provento provide more abundant speaker information and, therefore, showsuperior recognition performances to the i -vector.Inspired by the transformer model structure, this study proposesa novel system, namely the attention-based MTL (ATM), to extractthe shared information between SE and SI to attain improved per-formance for individual tasks. The outputs of the ATM system areenhanced speech and identiﬁcation results, and the input is noisyspeech signals. In addition, an attention-based network (AttNet) isused to integrate both speech and speaker cues between SE andSI models to extract robust features. The ATM consists of threeDL-based models: the ﬁrst LSTM enhances the noisy input andthe other two DNNs are used to identify the speaker identity andextract the attention weights. We tested the proposed system on theTaiwan Mandarin hearing in noise test (TMHINT) sentences [37].The experimental results show that the proposed ATM can not onlyenhance the quality and intelligibility of noisy speech but also theimprove SI accuracy.The remainder of this paper is organized as follows. Section IIreviews the related work, including LSTM-based SE and DNN-based SI. Section III introduces the proposed ATM architecture.Experimental results and analyses are provided in Section IV. Finally,Section V presents the conclusions and directions for future research.I. RELATED WORKSThis section brieﬂy reviews the related works of the LSTM-SE and DNN-SI systems. Considering noisy speech signals areobtained from contaminating clean speech signals with additive noisesignals. With short-time Fourier transform (STFT) and several featureprocessing steps, we can obtain noisy and clean logarithmic powerspectra (LPS), Y and S , respectively, from the noisy and cleanspeech signals. We assumed that there are N frames in the paired ( Y – S ) . The context feature of noisy LPS by concatenating theadjacent M feature frames of the target feature vector Y [ n ] , namely, Y [ n ] = [ Y ′ [ n − M ] , · · · , Y ′ [ n ] , · · · , Y ′ [ n + M ]] ′ are used. A. Speech enhancement

In this study, the baseline SE system is composed of an L -hidden-layer LSTM and a feed-forward layer and is denoted as LSTM-SE.The input-output relationship ( z ℓ +1 [ n ] , z ℓ [ n ] ) at the n -th frame andthe ℓ -th hidden layer is formulated as z ℓ +1 [ n ] = LST M ℓ { z ℓ [ n ] } , ℓ = 1 , , · · · , L. (1)The input of the ﬁrst LSTM layer is Y , i.e. z [ n ] = Y [ n ] , and theoutput z L +1 [ n ] is ˆ S [ n ] = Wz L +1 [ n ] + b , (2)where W and b are the weight matrix and bias vector, respectively.In the training stage, the parameters of the LSTM-SE system areupdated to minimize the difference between ˆ S [ n ] and S [ n ] in termsof the mean square error (MSE). In the testing stage, the output ˆ S of the LSTM-SE is combined with the phase from the noisy speechsignals to produce the enhanced signals ˆ s in the time domain. B. Speaker identiﬁcation

The objective of the DNN-SI is to classify an input speech signal Y [ n ] at the n -th frame to a speciﬁc speaker identity. We categorizedthe non-speech segments as a single virtual speaker. Therefore, thedimension of DNN-SI output is the number of speakers plus one,namely K + 1 . The reference target for the DNN training is a one-hot ( K +1) -dimensional vector I [ n ] , where a single non-zero elementcorresponds to the target speaker identity.The DNN-SI contains D layers, and the input-output relationship( z d [ n ] , z d +1 [ n ] ) at the d -th layer and the n -th frame can beformulated as z d +1 [ n ] = σ d ( F d ( z d [ n ])) , d = 1 , · · · , D, (3)where σ d ( · ) and F d ( · ) are the activation and linear transformationfunctions, respectively. In this study, the softmax function is used asthe activation function for the output layer, and the rectiﬁed linearunits (ReLU) function is used for all hidden layers. Meanwhile,the input and output of DNN is z [ n ] = Y [ n ] and z D +1 [ n ] = ˆ I ,respectively. The categorical cross-entropy loss is used to computethe DNN parameters in Eq. (3).III. T HE PROPOSED APPROACH

Figure 1 shows the block diagram of the proposed ATM system.From the ﬁgure, the input to the ATM system is a noisy LPS feature, Y , while the outputs have enhanced LPS feature in SE and speakeridentity vector in SI. Between SE and SI tasks, an AttNet is employedto reshape the feature size from SI and extract more compact speakercues for SE. Based on the ways of incorporating attention mechanism,two ATM architectures have been proposed, namely ATM bef andATM ide , which are detailed in the following two sub-sections. Speech enhancement(SE module)Speaker identification (SI module) Attention neural network (AttNet)

Noisy input Enhanced speechSpeaker identityRepresentative feature

Fig. 1. The block diagram of the proposed ATM system, in which the inputis noisy speech while the outputs are enhanced speech and the recognizedspeaker identity.

SE module (cid:1709)

SI moduleAttNet (cid:1709)(cid:1709) (cid:1622)

Noisy (cid:15)(cid:19)(cid:22) Y ω (cid:4632) I (cid:3552) Sz L+1 z L (cid:1709) (cid:1709) (cid:1709) (cid:1709) (cid:1709)(cid:1709) (cid:1709)(cid:1709)(cid:1709) z (cid:7) Fig. 2. The architecture of ATM bef . The output of the ( L + 1) -th layerof LSTM-SE is used to compute ω , which is then used to weight therepresentative features at the L -th layer. A. The ATM bef system

Figure 2 illustrates the block diagram of the ATM bef system. Asshown in the ﬁgure, the SE model is used to provide the embeddedspeech code vector, z L +1 [ n ] , from the output of the L -th LSTMhidden layer. We then created the context information of speech byconcatenating the adjacent vectors of z L +1 [ n ] to obtain [ z ′ L +1 [ n − M ] , · · · ; z ′ L +1 [ n ] , · · · , z ′ L +1 [ n + M ]] ′ to form the input of SI tocompute the speaker feature (from the output of the last hidden layer).Then, AttNet, which is a J -layer DNN model, takes the speakerfeature as the input to compute the weighting vector, ω , to weight theLSTM-SE by performing ω [ n ] ⊙ z L [ n ] , where ⊙ is an element-wisemultiplication operator. Finally, enhanced speech, ˆ S , and recognizedspeaker identity, ˆ I , are obtained. This system is referred to ATM bef because the attention operation is performed before extracting theacoustic feature representation.To train ATM bef , we prepare noisy LPS features as the input andthe corresponding speaker-identity vectors and clean LPS featuresas the two outputs. Then, an iterative training procedure is appliedto train SI and SE–AttNet models using the following steps: (1) Thecategorical cross-entropy loss is used to train the SI model, where themodel input and output are the contextual embedding features and thespeaker-identity vectors, respectively. (2) The speaker features, z D ,using the SI model was then extracted. (3) The training proceedswith Y and z D on the input side of SE and AttNet, respectively, toproduce an enhanced output that approximates S . Notably, the SEand AttNet models are jointly trained based on the MSE loss. E module (cid:1709)

SI module (cid:1709)

Noisy (cid:15)(cid:19)(cid:22) Y (cid:4632) I (cid:3552) S z L+1

AttNet (cid:1709) ω (cid:1622) (cid:1709) (cid:1709) (cid:1709) (cid:1709)(cid:1709)(cid:1709) (cid:1709)(cid:1709) z (cid:7) Fig. 3. The architecture of ATM ide . The shared representation of SE and SIand the weighting operation are performed in the same layer of LSTM-SE.

B. The ATM ide system

Figure 3 shows the block diagram of ATM bef . Different fromATM bef , ATM ide performs feature extraction and weighting oper-ation in the same layer of LSTM-SE. More speciﬁcally, ATM ide performs four steps to obtain ˆ S and ˆ I . First, the acoustic code z L +1 [ n ] is ﬁrst computed by passing the noisy LPS Y [ n ] to LSTM-SE. Next,the SI model generates ˆ I [ n ] in the output and the speaker code z D [ n ] ,which is served as the input of the AttNet, to obtain the weightingvector, ω [ n ] . Then, the weighting vector, ω [ n ] , is applied to weightthe acoustic code (that is, ω ⊙ z L +1 [ n ] ). Finally, a transformation isapplied to the weighted acoustic code to generate the enhanced output ˆ S [ n ] . For ATM ide , the weighting vector ω is extracted to introducethe speaker-dependent characteristics in the acoustic feature and guidethe SE to generate the output corresponding to the target speaker. Theproposed ATM systems (ATM ide and ATM bef ) can be considered asmulti-model systems because the speaker characteristics are used toguide SE to achieve better performance.The dynamic weighted loss function has been proposed to addressthe scaling issue of classiﬁcation and regression tasks [38], [39]. Theloss is formulated in Eq. (4) with two additional trainable parameters, α and β . L (Θ , α, β ) = 12 α L (Θ) + 1 β L (Θ) + logα + logβ, (4)where L and L are the MSE and the categorical cross-entropy loss,respectively; Θ represents all the parameters in ATM ide .IV. E XPERIMENTS AND A NALYSES

In the following subsections, we ﬁrst introduce the experimentalsetup and then provide the experimental results along with a discus-sion about our ﬁndings.

A. Experimental setup

We evaluated the proposed ATM system on TMHINT sentences.The training and testing utterances were recorded by eight speakersat a 16 kHz sampling rate in a noise-free meeting room. A total of1,560 clean utterances were pronounced by three males and threefemales ( K = 6 in Section II-B), each providing 260 utterances,for the training set. From these clean utterances, we randomlyconcatenated three utterances to simulate the dialogue scenario andsubsequently generated 520 clean training utterances, where eachutterance contained exactly three different speakers. Noisy utteranceswere generated by artiﬁcially adding 100 different types of noises [40]at six signal-to-noise ratio (SNR) levels ( , , , , − , and − dB) to the prepared 520 clean training utterances, and thus generating 312,000 ( = 520 × × ) noisy–clean training pairs. Among them,we randomly selected 500 noisy–clean pairs to form the validationset. Meanwhile, two different testing conﬁgurations were prepared forthe SE and SI tasks. For SE, the testing set contained additional maleand female speech. We randomly concatenated one utterance of themale speaker and one utterance of the female speaker and generated60 clean utterances for testing. Noisy testing utterances were thenprepared by deteriorating these clean utterances with four additivenoises (“engine”, “pink”, “street”, and “white”) at three SNRs ( , ,and − dBs). Accordingly, we prepared 720 ( = 60 × × ) noisytesting utterances. In contrast to the SE testing set, the utterances usedto evaluate the SI part were from the same speakers in the trainingset. We prepared 120 clean dialogue utterances for testing, with eachutterance containing segments from three different speakers. Then, weadded four additive noises (“engine”, “pink”, “street”, and “white”)at three SNRs ( , , and − dBs) to these clean testing utterancesto form the noisy utterances. Accordingly, we prepared 1440 noisyutterances for testing the SI performance. Please note that we didnot consider overlapped speech in this study, and thus there were nooverlapped segments in the utterances for all the training and testingsets.To apply the STFT, we used a window with frame size and shift of32 ms and 16 ms, respectively. Then, a 257-dimensional LPS vectorwas obtained. The context feature was created by M = 5 thus withthe dimension of 2,827 = 257 × (2 × . Accordingly, the input-and output-layer sizes of SE were both 257, and those of SI were2,827 and 7 (i.e., K + 1 = 6 + 1 ), respectively. For the overall ATMsystem, the input size were 257 and the output sizes was 257 for SEand 7 for SI. The detailed network conﬁguration is as follows: • The SE model consisted of two LSTM layers ( L = 2 ) with 300cells in each layer, followed by a 257-node feed-forward layer. • The SI model comprised four hidden layers ( D = 4 ) in the orderof 1024, 1024, 256, and 7 nodes. • The AttNet comprised two hidden layers ( J = 2 ) with eachlayer having 300 nodes.In this study, we applied three metrics to evaluate the proposedsystem: perceptual evaluation of speech quality (PESQ) [41], short-time objective intelligibility (STOI) [42], and segmental SNR index(SSNRI) [43]. The score ranges of PESQ and STOI are [ − . , . and [0 , , respectively. Higher PESQ and STOI scores indicate betterspeech quality and intelligibility. Meanwhile, a higher SSNRI scoreindicates better signal-level SE performance. B. Experimental results

In this subsection, we split the evaluation results into two parts.We ﬁrst report the SE evaluation results and then the SI performance.

1) SE results:

Table I lists the averaged PESQ, STOI, and SSNRIresults with respect to all testing utterances of the noisy baseline (de-noted as “Noisy”) and the enhanced speech obtained by conventionalLSTM-SE, ATM bef , and ATM ide . In addition, the results of MTL,

TABLE IA

VERAGED

PESQ, STOI

AND

SSNRI

SCORES OF N OISY , LSTM-SE,MTL, ATM bef , AND

ATM ide . Noisy LSTM-SE MTL ATM bef

ATM ide

PESQ – 7.39 7.61 7.57 hich is composed of only SE and SI models (without AttNet) inFig. 1, are also listed for comparison. From the table, most evaluationmetrics using the MTL criterion, that is, MTL, ATM bef , and ATM ide ,show better results than those provided by LSTM-SE, except thePESQ score of MTL. The results conﬁrm the effectiveness of MTL-based models in improving the speech quality, intelligibility, andbackground noise reductions. In addition, both ATM bef and ATM ide provide better results than MTL for all evaluation metrics. The resultsconﬁrm that the MLT-based SE system can be further improvedby applying the attention-weighting mechanism. In addition, ATM ide yields scores superior to ATM bef implying that a suitable attentionmechanism further promotes the system capability.To further analyze the beneﬁts of the proposed systems, we reportthe detailed PESQ and STOI scores of Table I in Tables II and III,respectively. We compared the performance of Noisy, LSTM, MTL,ATM bef , and ATM ide with respect to four testing noise environmentsover all SNR levels. From both tables, we observe that all DL-basedSE approaches provide better PESQ and STOI scores on all evaluatedconditions than the noisy baseline, while ATM ide performs the best.The results verify the capability of the proposed ATM approach toextract robust features for SE, thus further improving speech qualityand intelligibility.

2) SI results:

Figure 4 illustrates the frame-wise SI accuracy ofthe DNN-SI baseline, MTL, ATM bef , and ATM ide . The evaluationswere conducted on testing utterances involving “engine”, “pink”,“street”, and “white” noise backgrounds, among which “street” isconsidered to be the most complex noise type. From the ﬁgure, it canbe observed that MTL-based approaches (MTL, ATM bef , and ATM ide )provide higher SI accuracies than those achieved by conventionalDNN-SI. In addition, ATM ide shows the highest recognition accuracyin the street background, and competes with MTL in other noiseenvironments. The results demonstrate that the MTL architecture caneffectively enhance the SI performance and can be further improvedby incorporating the attention-weighting mechanism.Next, we analyze the speaker features using DNN-SI and ATM ide based on t-SNE analyses [44]. The t-SNE analysis is a widely

TABLE IIT

HE AVERAGED SCORES OF

PESQ

WITH RESPECT TO FOUR DIFFERENTNOISE ENVIRONMENTS OVER ALL

SNR

LEVELS , ACHIEVED BY N OISY ,LSTM-SE, MTL, ATM bef , AND

ATM ide

SYSTEMS . Noisy LSTM-SE MTL ATM bef

ATM ide

WHITE 1.25 2.01 2.00 2.08

PINK 1.28 1.88 1.88 1.96

STREET 1.32 1.84 1.83 1.89

ENGINE 1.16 1.72 1.71 1.81

TABLE IIIT

HE AVERAGED SCORES OF

STOI

WITH RESPECT TO DIFFERENT NOISEENVIRONMENTS OVER ALL

SNR

LEVELS , ACHIEVED BY N OISY ,LSTM-SE, MTL, ATM bef , AND

ATM ide . Noisy LSTM-SE MTL ATM bef

ATM ide

WHITE 0.75 0.75 0.75 0.76

PINK 0.72 0.72 0.73 0.73

STREET 0.72 0.74 0.75 0.75

ENGINE 0.69 0.70 0.71 0.71 n1 n2 n3 n4 n5 n6 A cc u r a c y ( % ) eline Multi-task Strategy A Strategy B A cc u r ac y ( % ) DNN-SIMTLATM_befATM_ideATM bef

ATM ide

Fig. 4. The frame-wise SI accuracy of DNN-SI, MTL, ATM bef , and ATM ide in four testing noise environments. ide spk spk spk spk spk spk spk7 Fig. 5. The distributions of (a) DNN-SI and (b) ATM ide extracted speakerfeatures with t-SNE analyses. used technique that provides visualized feature clusters from high-dimensional spaces. In this study, seven speakers were involved in thetraining set (including one non-speech virtual speaker). The analysiswas carried out by ﬁrst placing all SI-testing noisy utterances on theinput of DNN-SI or ATM ide to derive the associated speaker fea-tures. Then, these high-dimensional DNN-SI- and ATM ide -extractedspeaker features were processed by t-SNE to yield two-dimensionalrepresentations. Fig. 5 illustrates the distributions of these dimension-reduced (a) DNN-SI and (b) ATM ide features with associated speakeridentities. In the ﬁgure, it can be observed that the ATM ide systemprovides a larger inter-class distance and a clearer class boundary thanthose of the DNN-SI baseline. The results show that a combination ofMTL and AttNet techniques can extract more representative featuresfor the SI task. V.

CONCLUSION

In this study, we proposed a novel ATM approach that integratesthe MTL and the attention-weighting mechanism to carry out SE andSI tasks simultaneously. The overall ATM system is composed ofSE, SI, and AttNet modules, and is able to extract representative androbust acoustic features in a noisy environment. Experimental resultson the simulated dialog conditions conﬁrm that the proposed ATMcan signiﬁcantly reduce the noise components from the noisy speech,thereby improving speech quality and intelligibility for the SE task.Meanwhile, a suitable attention mechanism performed in ATM couldfurther improve the enhancement performance. On the other hand,the recognition accuracy of the SI system can be further improvedthrough the proposed ATM approach. In the future, we plan to testthe ATM system with other languages. We will also explore ATM byusing other types of SE and SI models. Finally, the we will test theproposed ATM architecture on speaker-diarization and speech-sourceseparation tasks.

EFERENCES [1] DeLiang Wang, “Deep learning reinvents the hearing aid,”

IEEEspectrum , vol. 54, no. 3, pp. 32–37, 2017.[2] Ying-Hui Lai, Fei Chen, Syu-Siang Wang, Xugang Lu, Yu Tsao, andChin-Hui Lee, “A deep denoising autoencoder approach to improving theintelligibility of vocoded speech in cochlear implant simulation,”

IEEETransactions on Biomedical Engineering , vol. 64, no. 7, pp. 1568–1578,2016.[3] Felix Weninger, Hakan Erdogan, Shinji Watanabe, Emmanuel Vincent,Jonathan Le Roux, John R Hershey, and Bj¨orn Schuller, “Speechenhancement with lstm recurrent neural networks and its applicationto noise-robust asr,” in

Proc. LVA/ICA , 2015, pp. 91–99.[4] Jinyu Li, Li Deng, Yifan Gong, and Reinhold Haeb-Umbach, “Anoverview of noise-robust automatic speech recognition,”

IEEE/ACMTransactions on Audio, Speech, and Language Processing , vol. 22, no.4, pp. 745–777, 2014.[5] Morten Kolbœk, Zheng-Hua Tan, and Jesper Jensen, “Speech enhance-ment using long short-term memory based recurrent neural networks fornoise robust speaker veriﬁcation,” in

Proc. SLT , 2016, pp. 305–311.[6] Suwon Shon, Hao Tang, and James Glass, “Voiceid loss: Speechenhancement for speaker veriﬁcation,” arXiv preprint arXiv:1904.03601 ,2019.[7] K A Al-Karawi, A H Al-Noori, Francis F Li, Tim Ritchings, et al., “Au-tomatic speaker recognition system in adverse conditions—implicationof noise and reverberation on system performance,”

InternationalJournal of Information and Electronics Engineering , vol. 5, no. 6, pp.423–427, 2015.[8] Philipos C Loizou,

Speech enhancement: theory and practice , CRCpress, 2013.[9] Steven Boll, “Suppression of acoustic noise in speech using spectralsubtraction,”

IEEE Transactions on acoustics, speech, and signalprocessing , vol. 27, no. 2, pp. 113–120, 1979.[10] Jae Soo Lim and Alan V Oppenheim, “Enhancement and bandwidthcompression of noisy speech,”

Proceedings of the IEEE , vol. 67, no.12, pp. 1586–1604, 1979.[11] Yariv Ephraim and David Malah, “Speech enhancement using aminimum-mean square error short-time spectral amplitude estimator,”

IEEE Transactions on acoustics, speech, and signal processing , vol. 32,no. 6, pp. 1109–1121, 1984.[12] Yariv Ephraim and David Malah, “Speech enhancement using aminimum mean-square error log-spectral amplitude estimator,”

IEEEtransactions on acoustics, speech, and signal processing , vol. 33, no. 2,pp. 443–445, 1985.[13] P. C. Loizou, “Speech enhancement based on perceptually motivatedbayesian estimators of the magnitude spectrum,”

IEEE Transactions onSpeech and Audio Processing , vol. 13, no. 5, pp. 857–869, 2005.[14] Kuldip Paliwal, Kamil W´ojcicki, and Belinda Schwerin, “Single-channel speech enhancement using spectral subtraction in the short-timemodulation domain,”

Speech communication , vol. 52, no. 5, pp. 450–475, 2010.[15] Thomas Lotter and Peter Vary, “Speech enhancement by map spectralamplitude estimation using a super-gaussian speech model,”

EURASIPjournal on applied signal processing , vol. 2005, pp. 1110–1126, 2005.[16] Yong Xu, Jun Du, Li-Rong Dai, and Chin-Hui Lee, “A regressionapproach to speech enhancement based on deep neural networks,”

IEEE/ACM Transactions on Audio, Speech, and Language Processing ,vol. 23, no. 1, pp. 7–19, 2014.[17] Yan Zhao, Zhong-Qiu Wang, and DeLiang Wang, “Two-stage deeplearning for noisy-reverberant speech enhancement,”

IEEE/ACM trans-actions on audio, speech, and language processing , vol. 27, no. 1, pp.53–62, 2018.[18] Yong Xu, Jun Du, Li-Rong Dai, and Chin-Hui Lee, “Dynamic noiseaware training for speech enhancement based on deep neural networks,”in

Proc. INTERSPEECH , 2014, pp. 2670–2674.[19] Xugang Lu, Yu Tsao, Shigeki Matsuda, and Chiori Hori, “Speech en-hancement based on deep denoising autoencoder.,” in

Proc. Interspeech ,2013, vol. 2013, pp. 436–440.[20] Cheng Yu, Ryandhimas E Zezario, Syu-Siang Wang, Jonathan Sherman,Yi-Yen Hsieh, Xugang Lu, Hsin-Min Wang, and Yu Tsao, “Speechenhancement based on denoising autoencoder with multi-branched en-coders,”

IEEE/ACM Transactions on Audio, Speech, and LanguageProcessing , vol. 28, pp. 2756–2769, 2020. [21] Jaeyoung Kim, Mostafa El-Khamy, and Jungwon Lee, “T-gsa: Trans-former with gaussian-weighted self-attention for speech enhancement,”in

Proc. ICASSP , 2020, pp. 6649–6653.[22] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, LlionJones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin, “Attentionis all you need,” arXiv preprint arXiv:1706.03762 , 2017.[23] Zhuo Chen, Shinji Watanabe, Hakan Erdogan, and John R Hershey,“Speech enhancement and recognition using multi-task learning of longshort-term memory recurrent neural networks,” in

Proc. INTERSPEECH ,2015, pp. 3274–3278.[24] Geon Woo Lee and Hong Kook Kim, “Multi-task learning u-netfor single-channel speech enhancement and mask-based voice activitydetection,”

Applied Sciences , vol. 10, no. 9, pp. 3230, 2020.[25] Jing Shi, Jiaming Xu, and Bo Xu, “Which ones are speaking?speaker-inferred model for multi-talker speech separation.,” in

Proc.INTERSPEECH , 2019, pp. 4609–4613.[26] Sebastian Ruder, “An overview of multi-task learning in deep neuralnetworks,” arXiv preprint arXiv:1706.05098 , 2017.[27] Yu Zhang and Qiang Yang, “A survey on multi-task learning,” arXivpreprint arXiv:1707.08114 , 2017.[28] Michael Crawshaw, “Multi-task learning with deep neural networks: Asurvey,” arXiv preprint arXiv:2009.09796 , 2020.[29] Giovanni Morrone, Daniel Michelsanti, Zheng-Hua Tan, and JesperJensen, “Audio-visual speech inpainting with deep learning,” arXivpreprint arXiv:2010.04556 , 2020.[30] Jen-Cheng Hou, Syu-Siang Wang, Ying-Hui Lai, Yu Tsao, Hsiu-WenChang, and Hsin-Min Wang, “Audio-visual speech enhancement usingmultimodal deep convolutional neural networks,”

IEEE Transactionson Emerging Topics in Computational Intelligence , vol. 2, no. 2, pp.117–128, 2018.[31] Nanxin Chen, Yanmin Qian, and Kai Yu, “Multi-task learning for text-dependent speaker veriﬁcation,” in

Proc. INTERSPEECH , 2015, pp.185–189.[32] Zhiyuan Tang, Lantian Li, and Dong Wang, “Multi-task recurrent modelfor speech and speaker recognition,” in

Proc. APSIPA , 2016, pp. 1–4.[33] Gueorgui Pironkov, St´ephane Dupont, and Thierry Dutoit, “Speaker-aware multi-task learning for automatic speech recognition,” in

Proc.ICPR , 2016, pp. 2900–2905.[34] Najim Dehak, Patrick J Kenny, R´eda Dehak, Pierre Dumouchel, andPierre Ouellet, “Front-end factor analysis for speaker veriﬁcation,”

IEEETransactions on Audio, Speech, and Language Processing , vol. 19, no.4, pp. 788–798, 2010.[35] Ehsan Variani, Xin Lei, Erik McDermott, Ignacio Lopez Moreno, andJavier Gonzalez-Dominguez, “Deep neural networks for small footprinttext-dependent speaker veriﬁcation,” in

Proc. ICASSP , 2014.[36] David Snyder, Daniel Garcia-Romero, Gregory Sell, Daniel Povey, andSanjeev Khudanpur, “X-vectors: Robust dnn embeddings for speakerrecognition,” in

Proc. ICASSP , 2018, pp. 5329–5333.[37] MW Huang, “Development of taiwan mandarin hearing in noise test,”

Master thesis, Department of speech language pathology and audiology,National Taipei University of Nursing and Health science , 2005.[38] Alex Kendall, Yarin Gal, and Roberto Cipolla, “Multi-task learningusing uncertainty to weigh losses for scene geometry and semantics,” in

Proc. CVPR , 2018.[39] Ozan Sener and Vladlen Koltun, “Multi-task learning as multi-objectiveoptimization,” arXiv preprint arXiv:1810.04650 , 2018.[40] Guoning Hu and DeLiang Wang, “A tandem algorithm for pitchestimation and voiced speech segregation,”

IEEE/ACM Transactions onAudio, Speech, and Language Processing , vol. 18, no. 8, pp. 2067–2079,2010.[41] Antony W Rix, John G Beerends, Michael P Hollier, and Andries PHekstra, “Perceptual evaluation of speech quality (pesq)-a new methodfor speech quality assessment of telephone networks and codecs,” in

Proc. ICASSP . IEEE, 2001.[42] Cees H Taal, Richard C Hendriks, Richard Heusdens, and Jesper Jensen,“An algorithm for intelligibility prediction of time–frequency weightednoisy speech,”

IEEE Transactions on Audio, Speech, and LanguageProcessing , vol. 19, no. 7, pp. 2125–2136, 2011.[43] Jingdong Chen, Jacob Benesty, Yiteng Arden Huang, and Eric J Di-ethorn, “Fundamentals of noise reduction,” in

Springer Handbook ofSpeech Processing , pp. 843–872. Springer, 2008.[44] Laurens van der Maaten and Geoffrey Hinton, “Visualizing data usingt-sne,”