[PDF] On Mean Absolute Error for Deep Neural Network Based Vector-to-Vector Regression

Abstract

In this paper, we exploit the properties of mean absolute error (MAE) as a loss function for the deep neural network (DNN) based vector-to-vector regression. The goal of this work is two-fold: (i) presenting performance bounds of MAE, and (ii) demonstrating new properties of MAE that make it more appropriate than mean squared error (MSE) as a loss function for DNN based vector-to-vector regression. First, we show that a generalized upper-bound for DNN-based vector- to-vector regression can be ensured by leveraging the known Lipschitz continuity property of MAE. Next, we derive a new generalized upper bound in the presence of additive noise. Finally, in contrast to conventional MSE commonly adopted to approximate Gaussian errors for regression, we show that MAE can be interpreted as an error modeled by Laplacian distribution. Speech enhancement experiments are conducted to corroborate our proposed theorems and validate the performance advantages of MAE over MSE for DNN based regression.

Full PDF

aa r X i v : . [ ee ss . A S ] A ug On Mean Absolute Error for Deep Neural Network BasedVector-to-Vector Regression

Jun Qi,

Student Member, IEEE , Jun Du,

Member, IEEE , Sabato Marco Siniscalchi,

Senior Member, IEEE ,Xiaoli Ma,

Fellow, IEEE , and Chin-Hui Lee,

Fellow, IEEE

Abstract —In this paper, we exploit the properties of meanabsolute error (MAE) as a loss function for the deep neuralnetwork (DNN) based vector-to-vector regression. The goal ofthis work is two-fold: (i) presenting performance bounds of MAE,and (ii) demonstrating new properties of MAE that make itmore appropriate than mean squared error (MSE) as a lossfunction for DNN based vector-to-vector regression. First, weshow that a generalized upper-bound for DNN-based vector-to-vector regression can be ensured by leveraging the knownLipschitz continuity property of MAE. Next, we derive a newgeneralized upper bound in the presence of additive noise.Finally, in contrast to conventional MSE commonly adopted toapproximate Gaussian errors for regression, we show that MAEcan be interpreted as an error modeled by Laplacian distribution.Speech enhancement experiments are conducted to corroborateour proposed theorems and validate the performance advantagesof MAE over MSE for DNN based regression.

Index Terms —Mean absolute error, mean squared error, deepneural network, vector-to-vector regression, speech enhancement

I. I

NTRODUCTION M EAN absolute error (MAE), originated from a measureof average error [1], is often employed in assessingvector-to-vector (a.k.a. multivariate) regression models [2].Another form of average error is a root-mean-squared error(RMSE), but MAE was shown to outperform RMSE for mea-suring an average model accuracy in most situations except theGaussian noisy scenarios [3]–[5]. An exception occurs whenthe expected error satisﬁes Gaussian-distributed and enoughtraining samples are available [3]. Besides, mean squared error(MSE) is the squared form of RMSE and is commonly adoptedas a regression loss function [6]–[9].In the literature, there are some discussions on the rela-tionship between MSE and MAE. Berger [10] presented prosand cons of squared and absolute errors from an estimationpoint of view. In [11], a better solution to support vectormachines could be obtained based on a loss function of anabsolute difference instead of the quadratic error. Li et al. [12]discussed the effectiveness of MAE and its variations whentraining a deep model for energy load forecasting; Imani etal. [13] investigated distributional losses, including both MAEand MSE, for regression problems from the perspective of

J. Qi, X. Ma and C.-H. Lee are with the School of Electrical and ComputerEngineering, Georgia Institute of Technology, Atlanta, GA, 30332 USA e-mail: ([email protected], [email protected], [email protected]).J. Du is with the National Engineering Laboratory for Speech and LanguageInformation Processing, University of Science and Technology of China, Hefei230027, China (e-mail: [email protected]).S. M. Siniscalchi is with the Faculty of Architecture and Engi-neering, University of Enna “Kore”, Enna 94100, Italy, and also withthe Georgia Institute of Technology, Atlanta, GA 30332 USA (e-mail:[email protected]). efﬁcient optimization. Pandey and Wang [14] exploited theMAE and MSE loss functions for generative adversarial nets(GANs). However, a comparison between MAE and MSE interms of generalization capabilities [15]–[17] is still missing intheory. Thus, this paper aims at bridging this gap. In particular,we investigate MAE and MSE in terms of performance errorbounds and robustness against various noises in the contextof the deep neural network (DNN) based vector-to-vectorregression, since DNNs offer better representation power andgeneralization capability in large-scale regression problems,such as those addressed in [18]–[21].In this paper, we prove that the Lipschitz continuity prop-erty [22], [23], which holds for MAE but not for MSE,is a necessary condition to derive the upper bound on theRademacher complexity [24], [25] of DNN based vector-to-vector regression functions, as we have demonstrated in [26].Next, we show that the MAE Lipschitz continuity property canalso result in a new upper bound on the generalization capabil-ity of DNN-based vector-to-vector regression in the presenceof additive noise [27]–[29]. Moreover, another contributionof this work is that we establish a connection between theMAE loss function and Laplacian distribution [30], which isin contrast to the MSE loss function associated with Gaussiandistribution [31]. In doing so, we can highlight the key advan-tages of MAE over MSE by comparing the characteristics ofthose two distributions.Our experiments of speech enhancement are used as theregression task to assess our theoretical derivations and em-pirically verify the effectiveness of MAE over MSE. Wechoose regression-based speech enhancement because it is anunbounded mapping from R d → R q , where enhanced speechfeatures are expected to closely approximate the clean speechfeatures in regression.The remainder of this paper is presented as follows: Sec-tion II introduces some necessary math notations and theo-rems. Sections III, and IV highlight some key properties of theMAE loss function for DNN based vector-to-vector regression.Section V associates the MAE loss function with the Laplaciandistribution. The related experiments of speech enhancementare given in Section VI, and Section VII concludes this work.II. P RELIMINARIES Notations • f ◦ g : The composition of functions f and g . • || x || p : L p norm of the vector x . • R d : d -dimensional real coordinate space. • [ n ] : An integer set { , , ..., n } . • : Vector of all ones. Lipschitz continuityDeﬁnition 1.

A function f is β -Lipschitz continuous if for any x , y ∈ R n , for an integer p ≥ , || f ( x ) − f ( y ) || p ≤ β || x − y || p . (1)3. Mean Absolute Error (MAE)Deﬁnition 2.

MAE measures the average magnitude ofabsolute differences between N predicted vectors S = { x , x , ..., x N } and S ∗ = { y , y , ..., y N } , the correspondingloss function is deﬁned as: L MAE ( S, S ∗ ) = 1 N N X i =1 || x i − y i || , (2) where || · || denotes L norm. Mean Squared Error (MSE)Deﬁnition 3.

MSE denotes a quadratic scoring rule thatmeasures the average magnitude of N predicted vectors S = { x , x , ..., x N } and N actual observations S ∗ = { y , y , ..., y N } , the corresponding loss function is shown as: L MSE ( S, S ∗ ) = 1 N N X i =1 || x i − y i || , (3) where || · || denotes L norm. Empirical Rademacher ComplexityDeﬁnition 4.

The empirical Rademacher complexity of ahypothesis space H of functions h : R n → R with respectto N samples S = { x , x , ..., x N } is: ˆ R S ( H ) := E σ ,...,σ N " sup h ∈ H N N X i =1 σ i h ( x i ) . (4) where σ , σ , ..., σ N are the Rademacher random variables,which are deﬁned by the uniform distribution as: σ i = ( , with probability -1 , with probability . (5)In [32]–[34], it was shown that a function class with largerempirical Rademacher complexity is more likely to be overﬁtto the training data.III. MAE L OSS F UNCTION FOR U PPER B OUNDING E MPIRICAL R ADEMACHER C OMPLEXITY

The Lipschitz continuity property is fundamental to derivean upper bound of the estimated regression error. In thefollowing in Lemma 1, we show that the MAE loss functioncan ensure the Lipschitz continuity property. In Lemma 2, weinstead show that the property does not hold for MSE.

Lemma 1.

The MAE loss function is -Lipschitz continuous. Proof. For two vectors x , x ∈ R q , and a target vector x ∈ R q , the MAE loss difference is |L MAE ( x , x ) − L MAE ( x , x ) | = ||| x − x || − || x − x || |≤ || x − x || (triangle inequality) = L MAE ( x , x ) . (6) Lemma 2.

The MSE loss function cannot lead to the Lipschitzcontinuity property.Proof. ∀ x , x ∈ R q , and || x || > || x || , there is || x − x || = || x || + || x || − x T x . (7)Next, we assume x = 2 x , we have that || x − x || − || x − x || = || x || − x T x − || x || + 2 x T x = || x || − x T x − || x || + 4 || x || = || x || − x T x + 3 || x || . (8)By reducing Eq. (7) from Eq. (8), || x − x || − || x − x || − || x − x || = 2 || x || − x T x > || x || + || x || − x T x = || x − x || > , (9)we derive that (cid:12)(cid:12) || x − x || − || x − x || (cid:12)(cid:12) > || x − x || , (10)which contradicts the property of Lipschitz continuity. Thus,the MSE loss function is not Lipschitz continuous.We now discuss the characteristic of Lipschitz continuityderived from the MAE loss function for upper bounding theestimation error T , which is associated with the generalizationcapability and deﬁned as: T = sup f v ∈ F (cid:12)(cid:12)(cid:12) L ( f v ) − ˆ L ( f v ) (cid:12)(cid:12)(cid:12) ≤ ˆ R S ( L ) . (11)where F = { f v : R d → R q } is a family of DNN based vector-to-vector functions and L = {L ( f v , f ∗ v ) : R d × R d → R , f v ∈ F } denotes the family of generalized MAE loss functions.In [26], we have shown that the estimation error T canbe upper bounded by the empirical Rademacher complexity ˆ R S ( L ) .In [26], we have also shown that the estimation error T canbe further upper-bounded as: T = sup f v ∈ F (cid:12)(cid:12)(cid:12) L ( f v ) − ˆ L ( f v ) (cid:12)(cid:12)(cid:12) ≤ ˆ R S ( L ) ≤ ˆ R S ( F ) , (12)where ˆ R S ( F ) is deﬁned as: ˆ R S ( F ) = 1 N E σ " sup f v ∈ F N X i =1 ( σ i ) T f v ( x i ) , (13)where σ = { σ , σ , ..., σ N } denotes a set of Rademacherrandom variables as shown in Deﬁnition 4. IV. MAE

LOSS FUNCTION FOR

DNN

ROBUSTNESSAGAINST ADDITIVE NOISES

We now show that the MAE loss function can give an upperbound for regression errors to ensure DNN robustness againstadditive noises.

Theorem 1.

For an objective function h = L ◦ f v : R d → R with the MAE loss function L : R q → R and a vector-to-vector regression function f v : R d → R q , the difference of theobjectives for adding noise η to signal x is bounded as: | h ( x + η ) − h ( x ) | ≤ L || η || , (14) where L = P qi =1 L ,i is the Lipschitz constant for DNNbased vector-to-vector regression, and each L ,i is shown as: L ,i = sup {||∇ f i ( x ) || : x ∈ R d } . (15) Proof.

To prove Theorem 1, we ﬁrst introduce Lemma 3,which is achieved by the modiﬁcation of Theorem 1 in [35].

Lemma 3.

For a vector-to-vector regression function f : R d → R q with the property of Lipschitz continuity, ∀ x , y ∈ R d ,there exists an inequality as: || f ( x ) − f ( y ) || ≤ L p || x − y || q , (16) where L p = sup {||∇ f ( x ) || p : x ∈ R d } is a Lipschitz constant,and p + q = 1 , p, q ≥ . We employ the fact that DNNs with the ReLU activationfunction are Lipschitz continuous [36]. Then, based on bothtriangle inequality and Lemma 3, we can upper bound thedifference of objective functions with and without the additivenoise η as: | h ( x + η ) − h ( x ) | = ||| f ( x + η ) || − || f ( x ) || |≤ || f ( x + η ) − f ( x ) || (triangle ineq.) = L || η || (Lemma 2)which completes the proof.Theorem 1 holds for the MAE loss function but is notvalid for MSE loss because it is not Lipschitz continuous. Inother words, the difference of additive noises imposed uponthe DNN based vector-to-vector function is unbounded on theMSE loss function but the MAE can guarantee an upper bound.The upper bound makes more sense when the additive noiseis small because the upper bound suggests that the imposednoise cannot lead to signiﬁcant performance degradation.V. C ONNECTION OF

MAE L

OSS F UNCTION TO L APLACIAN D ISTRIBUTION

We now separately link the MAE and MSE loss functionsto Laplacian distribution (LD) and Gaussian distribution (GD)based loss functions as deﬁned in [37]. Both LD and GD basedlosses for DNN-based multivariate regression were experi-mentally compared and contrasted in [37], and it was shownthat the LD loss can attain better vector-to-vector regressionaccuracies than those obtained optimizing GD losses. For N input samples { x , x , ..., x N } and N target vectors { y , y , ..., y N } , assuming f : R d → R q is a vector-to-vectorregression function, we change the MAE loss function as: L MAE ( S, S ∗ ) = 1 N N X i =1 || f ( x n ) − y n || = 1 N N X n =1 d X m =1 | f m ( x n ) − y n,m | = 1 N N X n =1 d X m =1 | ˆ f m ( x n ) − ˆ y n,m | α m , (17)where ˆ f m ( x n ) = α m f m ( x n ) , ˆ y n,m = α m y n,m , and α m is thevariance of dimension m .To link the LD based loss function L LD ( S, S ∗ ) in [37],an additional term N P dm =1 ln α m is added to L MAE ( S, S ∗ ) ,and we obtain L LD ( S, S ∗ ) = L MAE ( S, S ∗ ) + N d X m =1 ln α m . (18)Moreover, an MSE based loss function can be modiﬁed as: L MSE ( S, S ∗ ) = 1 N N X n =1 d X m =1 | ˆ f m ( x n ) − ˆ y n,m | α m . (19)Then, the GD based loss L GD ( S, S ∗ ) can be derived by addingthe term N P dm =1 ln α m to the MSE loss L MSE ( S, S ∗ ) , L GD ( S, S ∗ ) = L MSE ( S, S ∗ ) + N d X m =1 ln α m . (20)We can observe that L MAE ( S, S ∗ ) and L MSE ( S, S ∗ ) arespecial cases of L LD ( S, S ∗ ) and L GD ( S, S ∗ ) without con-cerning the variance terms. When ∀ m ∈ [ d ] , the variance α m is a constant, L LD ( S, S ∗ ) and L GD ( S, S ∗ ) exactly correspondto L MAE ( S, S ∗ ) and L MSE ( S, S ∗ ) , respectively.Since the work [37] suggests that the LD based loss functioncan achieve better regression performance than the GD basedone, we show that the MAE based loss function can alsokeep the advantage over the MSE when the variance relatedterms are the same. Our experiments of speech enhancementin Section VI, where both MAE and MSE loss functions areinvolved, are used to verify that.VI. E XPERIMENTS

This section presents our speech enhancement experimentsto corroborate the aforementioned theorems. The goal ofthe experiments is to verify that MAE can achieve betterregression performance than MSE under various noisy con-ditions because of the ensured upper bounds on the MAE lossfunctions for DNN-based vector-to-vector regression.

A. Data Preparation

Our experiments were conducted on the Edinburgh noisyspeech database, where there were a total and cleanutterances for training and testing, respectively. The noisytraining dataset at four SNR levels (15 dB, 10 dB, 5 dB, 0 dB), was obtained using the following noises: a domestic noise(inside a kitchen), an ofﬁce noise (in a meeting room), threepublic space noises (cafeteria, restaurant, subway station), twotransportation noises (car and metro) and a street noise (busytrafﬁc intersection). In sum, we had 40 different noisy types tosynthesize noisy training speech utterances. As for thenoisy test set, the noisy conditions include a domestic (livingroom), an ofﬁce noise (ofﬁce space), one transport (bus) andtwo street noises (open area cafeteria and a public square) atfour SNR values (17.5 dB, 12.5 dB, 7.5 dB, 2.5 dB). Thus,there were various noisy conditions for generating totally noisy test speech utterances. The Edinburgh noisy speechcorpus provides a more challenging speech scenario, whichallows us to better support our Theorems. B. Experimental Setup

In this work, DNN based vector-to-vector regression modelsfollowed feed-forward architectures, where the inputs werenormalized log-power spectral (LPS) feature vectors of noisyspeech [38], [39], and the outputs were LPS features ofeither clean or enhanced speech. At training time, cleanLPS vectors were assigned to the top layer of DNN tofunction as targets. At test time, the top layer of DNNgenerated enhanced LPS vectors. The architecture of DNNhad the structure - - - - - - - , whichcorresponds to Input − Hidden − Output. The ReLU activationfunction was employed in the hidden neurons, and the toplayer was connected to a linear function for vector-to-vectorregression. The enhanced waveforms were reconstructed basedon the overlap-add method as shown in [20]. The technique ofglobal variance equalization [40] was utilized to improve thesubjective perception of speech enhancement. At training time,the standard back-propagation (BP) algorithm was adoptedto update the model parameters. The MAE and MSE lossfunctions were separately used to measure the differencebetween normalized LPS features and the reference ones.The stochastic gradient descent (SGD) based optimizer witha learning rate of × − and a momentum rate of . wasset up for the BP algorithm. Moreover, noise-aware training(NAT) [41] was also used to enable non-stationary noiseawareness. Context information was accounted at the input byusing LPS vectors by concatenating frames within a slidingwindow [42]–[44]. During the training time, the maximum epochs were set, and one-tenth of training data were randomlysplit into a validation set. If the performance of the model onthe validation dataset started to degrade, the training processwas stopped.The evaluation metrics were based on three types of cri-teria: MAE, MSE, perceptual evaluation of speech qual-ity (PESQ) [45], and short-time objective intelligibility(STOI) [46]. PESQ, which ranges from − . to . , is anindirect evaluation which is highly correlated with speechquality. A higher PESQ score corresponds to a better percep-tion quality. Similarly, the STOI score, which ranges from 0 to1, also refers to a measurement of predicting the intelligibilityof noisy or enhanced speech. A higher STOI score correspondsto a better speech intelligibility. C. Evaluation Results

Using the DNN models trained with the MAE criterion(DNN-MAE) and the MSE criterion (DNN-MSE), Table Ilists the MAE values for speech enhancement experimentswith test data. The MAE values evaluated with DNN-MAEin the top row are always lower than those in the bottom rowevaluated with DNN-MSE under the same noisy condition ineach column. More speciﬁcally, MAE scores by DNN-MAEachieves a lower value than DNN-MSE (0.7812 vs. 0.8278).Similarly, DNN-MAE achieves a lower MSE score than DNN-MSE (0.7954 vs. 0.8371). Besides, the MAE scores for bothDNN-MAE and DNN-MSE are consistently lower than theMSE values.

TABLE IT HE MAE

AND

MSE V

ALUES ON E DINBURGH SPEECH CORPUS .Models MAE MSEDNN-MAE 0.7812 0.7954DNN-MSE 0.8278 0.8371TABLE IIT HE PESQ

AND

STOI

SCORES ON E DINBURGH SPEECH CORPUS .Models PESQ STOIDNN-MAE 2.93 0.8509DNN-MSE 2.85 0.8317

Moreover, Table II shows PESQ and STOI scores obtainedwith the DNN-MAE and DNN-MSE models. It can be seenthat the DNN model trained with the MAE criterion con-sistently outperforms models trained with the MSE criterion( . vs. . for PESQ, and . vs. . for STOI),which further conﬁrms that MAE is a good objective functionto optimize when training DNNs for speech enhancement.Furthermore, the performance advantages of DNN-MAEover DNN-MSE corresponds to the aforementioned theorems:(1) the upper bound in Eq. (14) ensures more robust perfor-mance against the additive noise; (2) the performance gainis consistent with the connection between MAE loss functionand the Laplacian distribution.VII. C ONCLUSION

This work investigates the advantages of the MAE lossfunction for DNN based vector-to-vector regression. On onehand, we emphasize that the Lipschitz continuity property cannot only ensure a performance upper bound on DNN-basedvector-to-vector regression but also give an upper bound topredict the robustness against additive noises. On the otherhand, we associate the MAE loss function with Laplaciandistribution. Our experiments show that DNN based regressionoptimized with the MAE loss function can achieve lowerloss values than those obtained with the MSE counterpart.Moreover, the MAE loss function can also lead to better-enhanced speech quality in terms of the PESQ and STOIscores. Our empirical results are in line with the proposedtheorems for MAE and indirectly reﬂect that the MAE lossfunctions can beneﬁt from its related upper bounds as shownin this study. R EFERENCES[1] C. Willmott, S. Ackleson, R. Davis, J. Feddema, K. Klink, D. Legates,J. Odonnell, and C. Rowe, “Statistics for the evaluation of modelperformance,”

J. Geophys. Res , vol. 90, no. C5, pp. 8995–9005, 1985.[2] H. Borchani, G. Varando, C. Bielza, and P. Larra˜naga, “A survey onmulti-output regression,”

Mathematical Methods in the Applied Sciences ,vol. 5, no. 5, pp. 216–233, 2015.[3] T. Chai and R. R. Draxler, “Root mean square error (RMSE) ormean absolute error (MAE)?–Arguments against avoiding RMSE in theliterature,”

Geoscientiﬁc Model Development , vol. 7, no. 3, pp. 1247–1250, 2014.[4] C. J. Willmott and K. Matsuura, “Advantages of the mean absolute error(MAE) over the root mean square error (RMSE) in assessing averagemodel performance,”

Climate Research , vol. 30, no. 1, pp. 79–82, 2005.[5] C. J. Willmott, K. Matsuura, and S. M. Robeson, “Ambiguities inherentin sums-of-squares-based error statistics,”

Atmospheric Environment ,vol. 43, no. 3, pp. 749–752, 2009.[6] T. Hastie, R. Tibshirani, and J. Friedman,

The Elements of StatisticalLearning . Springer New York Inc., 2001.[7] C. M. Bishop,

Pattern Recognition and Machine Learning . Springer,2006.[8] M. Mohri, A. Rostamizadeh, and A. Talwalkar,

Foundations of machinelearning . MIT Press, 2018.[9] S. Shalev-Shwartz and S. Ben-David,

Understanding machine learning:From theory to algorithms . Cambridge university press, 2014.[10] J. O. Berger,

Statistical decision theory and Bayesian analysis . SpringerScience & Business Media, 2013.[11] V. N. Vapnik,

The nature of statistical learning theory . Springer-VerlagNew York, Inc., 1995.[12] N. Li, L. Wang, X. Li, and Q. Zhu, “An effective deep learningneural network model for shortterm load forecasting,”

Concurrency andComputation: Practice and Experience , vol. 32, 01 2020.[13] E. Imani and M. White, “Improving regression performance with distri-butional losses,” arXiv preprint arXiv:1806.04613 , 2018.[14] A. Pandey and D. Wang, “On adversarial training and loss functionsfor speech enhancement,” in

Proc. IEEE International Conference onAcoustics, Speech and Signal Processing (ICASSP) . IEEE, 2018, pp.5414–5418.[15] V. N. Vapnik and A. Y. Chervonenkis, “On the uniform convergenceof relative frequencies of events to their probabilities,”

Theory ofProbability & Its Applications , vol. 16, no. 2, pp. 264–280, 2018.[16] Z. Charles and D. Papailiopoulos, “Stability and generalization oflearning algorithms that converge to global optima,” arXiv preprintarXiv:1710.08402 , 2017.[17] J. Qi, J. Du, S. M. Siniscalchi, and C.-H. Lee, “A theory on deepneural network based vector-to-vector regression with an illustration ofits expressive power in speech enhancement,”

IEEE/ACM Transactionson Audio, Speech, and Language Processing (TASLP) , vol. 27, no. 12,pp. 1932–1943, 2019.[18] A. Lorencs, I. Mednieks, and J. Sinica-Sinavskis, “Biomedical imageprocessing based on regression models,” in , 2008, pp. 536–539.[19] H. Takeda, S. Farsiu, and P. Milanfar, “Kernel regression for imageprocessing and reconstruction,”

IEEE Transactions on Image Processing ,vol. 16, no. 2, pp. 349–366, 2007.[20] Y. Xu, J. Du, L.-R. Dai, and C.-H. Lee, “A regression approach to speechenhancement based on deep neural networks,”

IEEE/ACM Transactionson Audio, Speech and Language Processing (TASLP) , vol. 23, no. 1, pp.7–19, 2015.[21] J. Qi, H. Hu, Y. Wang, C. H. Yang, S. Marco Siniscalchi, and C. Lee,“Tensor-to-vector regression for multi-channel speech enhancementbased on tensor-train network,” in

IEEE International Conference onAcoustics, Speech and Signal Processing (ICASSP) , 2020, pp. 7504–7508.[22] O. L. Mangasarian and T.-H. Shiau, “Lipschitz continuity of solutionsof linear inequalities, programs and complementarity problems,”

SIAMJournal on Control and Optimization , vol. 25, no. 3, pp. 583–595, 1987.[23] M. O’Searcoid,

Metric spaces . Springer Science & Business Media,2006.[24] P. L. Bartlett and S. Mendelson, “Rademacher and Gaussian complexi-ties: Risk bounds and structural results,”

Journal of Machine LearningResearch (JMLR) , vol. 3, no. Nov, pp. 463–482, 2002.[25] P. L. Bartlett, O. Bousquet, and S. Mendelson, “Local rademachercomplexities,”

Annual of Statistics , vol. 33, no. 4, pp. 1497–1537, 082005. [26] J. Qi, J. Du, S. M. Siniscalchi, X. Ma, and C.-H. Lee, “Analyzing upperbounds on mean absolute errors for deep neural network based vector-to-vector regression,”

IEEE Transactions on Signal Processing (TSP) ,vol. 68, pp. 3411–3422, 2020.[27] J. Su, D. V. Vargas, and K. Sakurai, “One pixel attack for fooling deepneural networks,”

IEEE Transactions on Evolutionary Computation ,vol. 23, no. 5, pp. 828–841, 2019.[28] C. Yang, J. Qi, P. Chen, X. Ma, and C. Lee, “Characterizing speechadversarial examples using self-attention u-net enhancement,” in

IEEEInternational Conference on Acoustics, Speech and Signal Processing(ICASSP) , 2020, pp. 3107–3111.[29] T.-W. Weng, H. Zhang, P.-Y. Chen, J. Yi, D. Su, Y. Gao, C.-J. Hsieh,and L. Daniel, “Evaluating the robustness of neural networks: Anextreme value theory approach,” in

Proc. International Conference onRepresentation Learning (ICLR) , 2018.[30] S. Kotz, T. Kozubowski, and K. Podgorski,

The Laplace distributionand generalizations: a revisit with applications to communications,economics, engineering, and ﬁnance . Springer Science & BusinessMedia, 2012.[31] N. R. Goodman, “Statistical analysis based on a certain multivariatecomplex gaussian distribution (an introduction),”

The Annals of Mathe-matical Statistics , vol. 34, no. 1, pp. 152–177, 1963.[32] J. Fan, C. Ma, and Y. Zhong, “A selective overview of deep learning,” accepted to Statistical Science , 2020.[33] J. Zhu, B. R. Gibson, and T. T. Rogers, “Human rademacher complexity,”in

Advances in neural information processing systems , 2009, pp. 2322–2330.[34] M. J. Wainwright,

High-dimensional statistics: A non-asymptotic view-point . Cambridge University Press, 2019, vol. 48.[35] R. Paulaviˇcius and J. ˇZilinskas, “Analysis of different norms andcorresponding lipschitz constants for global optimization,”

Technologicaland Economic Development of Economy , vol. 12, no. 4, pp. 301–306,2006.[36] M. Fazlyab, A. Robey, H. Hassani, M. Morari, and G. Pappas, “Efﬁcientand accurate estimation of lipschitz constants for deep neural networks,”in

Proc. Advances in Neural Information Processing Systems (NIPS) ,2019, pp. 11 423–11 434.[37] L. Chai, J. Du, Q.-F. Liu, and C.-H. Lee, “Using generalized gaussiandistributions to improve regression error modeling for deep learning-based speech enhancement,”

IEEE/ACM Transactions on Audio, Speech,and Language Processing , vol. 27, no. 12, pp. 1919–1931, 2019.[38] L. Deng, J. Droppo, and A. Acero, “Enhancement of log mel power spec-tra of speech using a phase-sensitive model of the acoustic environmentand sequential estimation of the corrupting noise,”

IEEE Transactionson Speech and Audio Processing , vol. 12, no. 2, pp. 133–143, 2004.[39] J. Qi, D. Wang, Y. Jiang, and R. Liu, “Auditory features based on gam-matone ﬁlters for robust speech recognition,” in

Proc. IEEE InternationalSymposium on Circuits and Systems , 2013, pp. 305–308.[40] H. Sil´en, E. Helander, J. Nurminen, and M. Gabbouj, “Ways toimplement global variance in statistical speech synthesis,” in

Proc.INTERSPEECH , 2012, pp. 1436–1439.[41] Y. Xu, J. Du, L.-R. Dai, and C.-H. Lee, “Dynamic noise aware trainingfor speech enhancement based on deep neural networks,” in

Proc.INTERSPEECH , 2014, pp. 2670–2674.[42] J. Qi, D. Wang, and J. Tejedor Noguerales, “Subspace models forbottleneck features,” in

Proc. INTERSPEECH , 2013, pp. 1746–1750.[43] J. Qi, D. Wang, J. Xu, and J. Tejedor Noguerales, “Bottleneck featuresbased on gammatone frequency cepstral coefﬁcients,” in

Proc. INTER-SPEECH , 2013, pp. 1751–1755.[44] J. Qi and J. Tejedor, “Robust submodular data partitioning for distributedspeech recognition,” in

Proc. IEEE International Conference on Acous-tics, Speech and Signal Processing (ICASSP) , 2016, pp. 2254–2258.[45] A. W. Rix, J. G. Beerends, M. P. Hollier, and A. P. Hekstra, “Perceptualevaluation of speech quality (PESQ)-a new method for speech qualityassessment of telephone networks and codecs,” in

IEEE InternationalConference on Acoustics, Speech, and Signal Processing (ICASSP) ,vol. 2, 2001, pp. 749–752.[46] C. H. Taal, R. C. Hendriks, R. Heusdens, and J. Jensen, “A short-time objective intelligibility measure for time-frequency weighted noisyspeech,” in