[PDF] An l 1 -oracle inequality for the Lasso in mixture-of-experts regression models

Abstract

Mixture-of-experts (MoE) models are a popular framework for modeling heterogeneity in data, for both regression and classification problems in statistics and machine learning, due to their flexibility and the abundance of statistical estimation and model choice tools. Such flexibility comes from allowing the mixture weights (or gating functions) in the MoE model to depend on the explanatory variables, along with the experts (or component densities). This permits the modeling of data arising from more complex data generating processes, compared to the classical finite mixtures and finite mixtures of regression models, whose mixing parameters are independent of the covariates. The use of MoE models in a high-dimensional setting, when the number of explanatory variables can be much larger than the sample size (i.e., p≫n ), is challenging from a computational point of view, and in particular from a theoretical point of view, where the literature is still lacking results in dealing with the curse of dimensionality, in both the statistical estimation and feature selection. We consider the finite mixture-of-experts model with soft-max gating functions and Gaussian experts for high-dimensional regression on heterogeneous data, and its l 1 -regularized estimation via the Lasso. We focus on the Lasso estimation properties rather than its feature selection properties. We provide a lower bound on the regularization parameter of the Lasso function that ensures an l 1 -oracle inequality satisfied by the Lasso estimator according to the Kullback-Leibler loss.

Full PDF

aa r X i v : . [ m a t h . S T ] S e p An l -oracle inequality for the Lassoin mixture-of-experts regression models TrungTin Nguyen ∗ , Hien D Nguyen , Faicel Chamroukhi ,and Geoﬀrey J McLachlan Lab of Mathematics Nicolas Oresme LMNO, UMR CNRS, Caen, France. School of Engineering and Mathematical Sciences. Department of Mathematics and Statistics, LaTrobe University, Melbourne, Victoria, Australia. School of Mathematics and Physics, University of Queensland, St. Lucia, Brisbane, Australia. ∗ Corresponding author, email: [email protected].

Abstract

Mixture-of-experts (MoE) models are a popular framework for modeling heterogeneity in data,for both regression and classiﬁcation problems in statistics and machine learning, due to theirﬂexibility and the abundance of statistical estimation and model choice tools. Such ﬂexibilitycomes from allowing the mixture weights (or gating functions) in the MoE model to depend on theexplanatory variables, along with the experts (or component densities). This permits the modelingof data arising from more complex data generating processes, compared to the classical ﬁnitemixtures and ﬁnite mixtures of regression models, whose mixing parameters are independent of thecovariates. The use of MoE models in a high-dimensional setting, when the number of explanatoryvariables can be much larger than the sample size (i.e., p ≫ n ), is challenging from a computationalpoint of view, and in particular from a theoretical point of view, where the literature is still lackingresults in dealing with the curse of dimensionality, in both the statistical estimation and featureselection. We consider the ﬁnite mixture-of-experts model with soft-max gating functions andGaussian experts for high-dimensional regression on heterogeneous data, and its l -regularizedestimation via the Lasso. We focus on the Lasso estimation properties rather than its featureselection properties. We provide a lower bound on the regularization parameter of the Lassofunction that ensures an l -oracle inequality satisﬁed by the Lasso estimator according to theKullback-Leibler loss. Keywords.

Mixture-of-Experts, mixture of regressions, penalized maximum likelihood, l -oracle inequal-ity, high-dimensional statistics, Lasso. Mixture-of-experts (MoE) models, a ﬂexible generalization of classical ﬁnite mixture models, were introducedby Jacobs et al. (1991) in a problem decomposition context, and are widely used in statistics and machinelearning, thanks to their ﬂexibility and the abundance of statistical estimation and model choice tools. Themain idea of MoE is a divide-and-conquer principle that proposes dividing a complex problem into a set ofsimpler subproblems and then one or more specialized problem-solving tools, or experts, are assigned to eachof the subproblems. The ﬂexibility of MoE models comes from allowing the mixture weights (or the gatingfunctions) to depend on the explanatory variables, along with the experts (or the component densities). Thispermits the modeling of data arising from more complex data generating processes than the classical ﬁnitemixtures and ﬁnite mixtures of regression models, whose mixing parameters are independent of the covariates.Statistically, the MoE models are used to estimate the conditional distribution of a random variable Y ∈ R q , given certain features from n observations { x i } i ∈ [ n ] = { ( x i , . . . , x ip ) } i ∈ [ n ] ∈ ( R p ) n , where q, p, n ∈ N ⋆ ,[ n ] := { , . . . , n } , N ⋆ denotes the positive integer numbers, and R p means the p -dimensional real number. In thecontext of regression, ﬁnite MoE models with Gaussian experts and soft-max gating functions are a standardchoice and a powerful tool for modeling more complex non-linear relationships between response and predictors,arising from diﬀerent subpopulations, compared to the ﬁnite mixture of Gaussian regression models. The readeris referred to Nguyen & Chamroukhi (2018) for a recent review on the topic. he use of MoE models in the high-dimensional regression setting, when the number of explanatory variablescan be much larger than the sample size, remains a challenge, particularly from a theoretical point of view,where there is still a lack of results in the literature regarding both statistical estimation and model selection. Insuch settings, we are required to reduce the dimension of the problem by seeking the most relevant relationships,to avoid numerical identiﬁability problems.We focus on the use of an l -penalized maximum likelihood estimator (MLE), as originally proposed as theLasso by Tibshirani (1996), which tends to produce sparse solutions and can be viewed as a convex surrogatefor the non-convex l -penalization problem. These methods have attractive computational and theoreticalproperties (cf. Fan & Li, 2001). First introduced in Tibshirani (1996) for the linear regression model, theLasso estimator has since been extended to many statistical problems, including for high-dimensional regressionof non-homogeneous data by using ﬁnite mixture regression models as considered by Khalili & Chen (2007),Stadler et al. (2010), and Lloyd-Jones et al. (2018). In Stadler et al. (2010), it is assumed that, for i ∈ [ n ] , n ∈ N ⋆ , the observations y i , conditionally on X i = x i , come from a conditional density s ψ ( ·| x i ) , which is a ﬁnitemixture of K ∈ N ⋆ Gaussian conditional densities with mixing proportions ( π , , . . . , π ,K ), where Y i | X i = x i ∼ s ψ ( y i | x i ) = K X k =1 π ,k φ ( y i ; β ⊤ ,k x i , σ ,k ) . (1)Here φ ( · ; µ, σ ) = 1 √ πσ exp − ( · − µ ) σ ! is the univariate Gaussian probability density function (PDF), with mean µ ∈ R and variance σ ∈ R + , and ψ = ( π ,k , β ,k , σ ,k ) k ∈ [ K ] is the vector of model parameters.Then, considering a model S , deﬁned by the form (1). To estimate the true generative model s ψ ,Stadler et al. (2010) proposed a Lasso-regularization based estimator, which consists of a minimiser of thepenalized negative conditional log-likelihood that is deﬁned by b s Lasso ( λ ) = argmin s ψ ∈ S ( − n n X i =1 ln ( s ψ ( y i | x i )) + pen λ ( ψ ) ) , pen λ ( ψ ) = λ K X k =1 π k p X j =1 (cid:12)(cid:12) σ − k β kj (cid:12)(cid:12) , λ > , ψ = ( π, β k , σ k ) k ∈ [ K ] . (2)For this estimator, the authors provided an l -oracle inequality, satisﬁed by b s Lasso ( λ ), conditional on the re-stricted eigenvalue condition and margin condition, which leads to link the Kullback-Leibler loss function to the l -norm of the parameters.Another direction of study regarding b s Lasso ( λ ) is to look at its l -regularization properties; see, for example,Massart & Meynet (2011), Meynet (2013), and Devijver (2015). As indicated by Devijver (2015), contrary toresults for the l penalty, some results for the l penalty are valid with no assumptions, neither on the Grammatrix nor on the margin. However, such results can be achieved only at a rate of convergence of 1 /n , ratherthan at order 1 / √ n .In the framework of ﬁnite mixtures of Gaussian regression models, Meynet (2013) considered the casefor a univariate response, and Devijver (2015) extended these results to the case of a multivariate responses, i.e., the Gaussian conditional pdf in (1) is replaced by a multivariate Gaussian PDF of the form φ ( · ; µ, Σ)with mean vector µ and a covariance matrix Σ. In particular, Devijver (2015) considered an extension of theLasso-estimator (2), with a regularization term deﬁned by pen λ ( ψ ) = λ P Kk =1 P pj =1 P qz =1 (cid:12)(cid:12)(cid:12) [ β k ] z,j (cid:12)(cid:12)(cid:12) .In this article, we shall extend such result for the ﬁnite mixture of Gaussian regressions models, which isconsidered as a special case of the MoE models, where only the mixture components depend on the features,to the more general mixture of Gaussian experts regression models with soft-max gating functions, as deﬁnedin (6). Since each mixing proportion is modeled by a soft-max function of the covariates, the dependence oneach feature appears both in the experts pdfs and in the mixing proportion functions (gating functions), whichallows us to capture more complex non-linear relationships between the response and predictors arising fromdiﬀerent subpopulations, compared to the ﬁnite mixture of Gaussian regression models. This is demonstratedvia numerical experiments in several articles such as Nguyen & Chamroukhi (2018), Chamroukhi & Huynh(2018), and Chamroukhi & Huynh (2019).In the context of studying the statistical properties of the penalized maximum likelihood approach for MoEmodels with soft-max gating functions, we may consider the prior works of Khalili (2010) and Montuelle et al.(2014). In Khalili (2010), for feature selection, two extra penalty terms are applied to the l -penalized conditional og-likelihood function. Their penalized conditional log-likelihood estimator is given by b s PL ( λ ) = argmin s ψ ∈ S ( − n n X i =1 ln ( s ψ ( y i | x i )) + pen λ ( ψ ) ) , (3) s ψ ( y | x ) = K X k =1 g k ( x ; γ ) φ (cid:0) y ; β k + β ⊤ k x, σ k (cid:1) , ψ = ( γ k , β k , σ k ) k ∈ [ K ] , (4)pen λ ( ψ ) = K X k =1 λ [1] k p X j =1 | γ kj | + K X k =1 λ [2] k p X j =1 | β kj | + λ [3] K X k =1 k γ k k , (5)where λ = (cid:16) λ [1]1 , . . . , λ [1] K , λ [2]1 , . . . , λ [2] K , λ [3] (cid:17) is a vector of non-negative regularization parameters, S contains allfunctions of form (3), k·k is the Euclidean norm in R p , and g k ( x ; γ ) = exp (cid:0) γ k + γ ⊤ k x (cid:1)P Kl =1 exp (cid:0) γ l + γ ⊤ l x (cid:1) is a soft-max gating function. Note that the ﬁrst two terms from (5) are the normal Lasso functions ( l penalty function), while the l penalty function for the gating network is added to excessively wildly largeestimates of the regression coeﬃcients corresponding to the mixing proportions. This behavior can be ob-served in logistic/multinomial regression when the number of potential features is large and highly corre-lated (see e.g., Park & Hastie, 2008 and Bunea et al., 2008). However, this also aﬀects the sparsity of theregularization model, which is conﬁrmed via the numerical experiments of Chamroukhi & Huynh (2018) andChamroukhi & Huynh (2019).By extending the theoretical developments for mixture of linear regression models in Khalili & Chen (2007),standard asymptotic theorems for MoE models are established in Khalili (2010). More precisely, under severalstrict regularity conditions on the true joint density function s ψ ( y, x ) and the choice of tuning parameter λ , theestimator of the true parameter vector b ψ PL n ( λ ), deﬁned via b s PL ( λ ) from (3) but using the Scad penalty functionfrom Fan & Li (2001), instead of Lasso, is proved to be both consistent in feature selection and maintains root- n consistency. Diﬀering from Scad, for Lasso, the estimator b ψ PL n ( λ ) cannot achieve both properties, simultaneously.In other words, Lasso is consistent in feature selection but introduces bias to the estimators of the true nonzerocoeﬃcients.Another related result to our work is the weak oracle inequality from Montuelle et al. (2014, Theorem 1).Montuelle et al. (2014) focused on the variable selection procedure instead of investigating the l -regularizationproperties for the Lasso estimator. A detailed comparison between our work and their results can be foundin Remark 3.1. Therefore, our non-asymptotic result in Theorem 3.1 can be considered as a complement tosuch asymptotics for MoE regression models with soft-max gating functions. To obtain our oracle inequality,Theorem 3.1, we shall restrict our study to the Lasso estimator without the l -norm.While studying the oracle inequality within the context of the ( l + l )-norm may also be interesting. Ithas been demonstrated, in Huynh & Chamroukhi (2019), that the regularized maximum-likelihood estimationof MoE models for generalized linear models, better encourages sparsity under the l -norm, compared to whenusing the ( l + l )-norm, which may aﬀect sparsity. We shall not discuss such approaches, further.To the best of our knowledge, we are the ﬁrst to study the l -regularization properties of the MoE regressionmodels. In the current paper, we focus on a simpliﬁed but standard setting in which the means of the expertsare linear functions, with respect to explanatory variables. Although simpliﬁed, this model captures the coreof the MoE regression problem, which is the interactions among the diﬀerent mixture components. We believethat the general techniques that we develop here can be extended to more general experts, such as Gaussianexperts with polynomial means ( e.g., Mendes & Jiang, 2012) or even with hierarchical MoE for exponentialfamily regression models in Jiang & Tanner (1999). But we leave such nontrivial developments for future work.The main contribution of our paper is a theoretical result: an oracle inequality, Theorem 3.1, which providesthe lower bound on the regularization parameters of Lasso that ensures such non asymptotic theoretical controlon the Kullback-Leibler loss of the Lasso estimator for the mixtures of Gaussian experts regression models withsoft-max gating functions. Note that this result is non-asymptotic; i.e., the number of observations n is ﬁxed,while the number of predictors p and the dimension of the responses q can grow, with respect to n , and canbe much larger than n . Good discussions about non-asymptotic statistics are provided in Massart (2007) andWainwright (2019).Note that, as in Khalili (2010), the true order K of the MoE model (the true number of experts in our model)is supposed to be known. From a pragmatic perspective, one may estimate it via using the AIC of Akaike (1974),the BIC of Schwarz et al. (1978), or slope heuristic of Birg´e & Massart (2007). Our result follows directly theline of work of Meynet (2013) and Devijver (2015). In fact, our theorem combined Vapnik’s structural risk inimization paradigm ( e.g., Vapnik, 1982) and theory of model selection for conditional density estimation( e.g.,

Cohen & Pennec, 2011), which is an extended version of the density estimation results from Massart(2007).The goal of this paper is to provide a treatment regarding penalizations that guarantee an l -oracle inequalityfor ﬁnite MoE models in particular for high-dimensional non-linear regression. As such, the remainder of thearticle progresses as follows. In Section 2, we discuss the construction and framework of ﬁnite mixture ofGaussian experts regression models with soft-max gating functions. In Section 3, we state the main result of thearticle, which is an l -oracle inequality satisﬁed by the Lasso estimator in the ﬁnite mixture of Gaussian expertsregression models. Section 4 is devoted to the proof of these main results. The proof of technical lemmas can befounded in Section 5. Some conclusions are provided in Section 6, and additional technical results are relegatedto Appendix A. We consider the statistical framework in which we model a sample of high-dimensional regression data generatedfrom a heterogeneous population via the mixtures of Gaussian experts regression models with Gaussian gatingfunctions. We observe n independent couples (( x i , y i )) i ∈ [ n ] ∈ ( X × R q ) n ⊂ ( R p × R q ) n ( p, q, n ∈ N ⋆ ), wheretypically p ≫ n , x i is ﬁxed and y i is a realization of the random variable variable Y i , for all i ∈ [ n ]. We assumethat X is a compact set of R p . We also assume that the response variable Y i depends on the set of explanatoryvariables (covariates) through a regression-type model. The conditional probability density function (PDF) ofthe model is approximated by mixture of Gaussian experts regression models with soft-max gating functions.The approximation capabilities of such MoE models have been extensively studied in Jiang & Tanner (1999),Norets et al. (2010), Nguyen et al. (2016), Ho et al. (2019), and Nguyen et al. (2019), and particular in the caseof ﬁnite mixture models by Genovese et al. (2000), Nguyen et al. (2013), Ho et al. (2016a), Ho et al. (2016b),and Nguyen et al. (2020a,b).More precisely, we assume that, conditionally to the { x i } i ∈ [ n ] , { Y i } i ∈ [ n ] are independent and identicallydistributed with conditional density s ( ·| x i ), which is approximated by a MoE model. Our goal is to estimatethis conditional density function s from the observations.For any K ∈ N ⋆ , the K -component MoE model can be deﬁned asMoE ( y | x ; θ ) = K X k =1 g k ( x ; γ ) f k ( y | x ; η ) , where g k ( x ; γ ) > P Kk =1 g k ( x ; γ ) = 1, and f k ( y | x ; η ) is a conditional PDF (cf. Nguyen & Chamroukhi,2018). In our proposal, we consider the MoE model of Jordan & Jacobs (1994), which extended the originalMoE from Jacobs et al. (1991), for a regression model. More precisely, we utilize the following mixtures ofGaussian experts regression models with soft-max gating functions: s ψ ( y | x ) = K X k =1 g k ( x ; γ ) φ ( y ; v k ( x ) , Σ k ) , (6)to estimate s , where given any k ∈ [ K ], φ ( · ; v k , Σ k ) is the multivariate Gaussian density with mean v k ,which is a function of x tgat speciﬁes the mean of the k th component, and with covariance matrix Σ k . Here,( v, Σ) := (( v , . . . , v K ) , (Σ , . . . , Σ K )) ∈ (Υ × V ), where Υ is a set of K -tuples of mean functions from X to R q and V is a sets of K -tuples of symmetric positive deﬁnite matrices on R q , and the soft-max gating function g k ( x ; γ ) is deﬁned as in (7): g k ( x ; γ ) = exp ( w k ( x )) P Kl =1 exp ( w l ( x )) , w k ( x ) = γ k + γ ⊤ k x, γ = (cid:0) γ k , γ ⊤ k (cid:1) k ∈ [ K ] ∈ Γ = R ( p +1) K . (7)We shall deﬁne the parameter vector ψ in the sequel. Inspired by the framework in Meynet (2013) and Devijver (2015), the explanatory variables x i and the numberof components K ∈ N ⋆ are both ﬁxed. We assume that the observed x i , i ∈ [ n ], are ﬁnite. Without loss ofgenerality, we choose to rescale x , so that k x k ∞ ≤

1. Therefore, we can assume that the explanatory variables x i ∈ X = [0 , p , for all i ∈ [ n ]. Note that such a restriction is also used in Devijver (2015). Under only theassumption of bounded parameters, we provide a lower bound on the Lasso regularization parameter λ , whichguarantees an oracle inequality. Note that in this non-random explanatory variables setting, we focus on theLasso for its l -regularization properties rather than as a model selection procedure, as in the case of randomexplanatory variables and unknown K , as in Montuelle et al. (2014).For simplicity, we consider the case where the means of Gaussian experts are linear functions of the ex-planatory variables; i.e., Υ = (cid:26) v : X 7→ v β ( x ) := ( β k + β k x ) k ∈ [ K ] ∈ ( R q ) K (cid:12)(cid:12)(cid:12)(cid:12) β = ( β k , β k ) k ∈ [ K ] ∈ B = (cid:16) R q × ( p +1) (cid:17) K (cid:27) , where β k and β k are respectively the q × q × p regression coeﬃcients matrix for the k th expert.In summary, we wish to estimate s via conditional densities belonging to the class: { ( x, y ) s ψ ( y | x ) | ψ = ( γ, β, Σ) ∈ Ψ } , (8)where Ψ = Γ × Ξ, and Ξ =

B × V .From hereon in, for a vector x ∈ R p , we assume that x = ( x , . . . , x p ) is in the column form. Similarly, theparameter of the entire model, ψ = ( γ, β, Σ), is also a column vector, where we consider any matrix as a vectorproduced using vec( · ): the vectorization operator that stacks the columns of a matrix into a vector. For a matrix A , let m ( A ) be the modulus of the smallest eigenvalue, and M ( A ) the modulus of the largesteigenvalue. We shall restrict our study to estimate s by conditional PDFs belonging to the model class S ,which has boundedness assumptions on the softmax gating and Gaussian expert parameters. Speciﬁcally, weassume that there exists deterministic constants A γ , A β , a Σ , A Σ >

0, such that ψ ∈ e Ψ, where e Γ = (cid:26) γ ∈ Γ | ∀ k ∈ [ K ] , sup x ∈X (cid:0) | γ k | + (cid:12)(cid:12) γ ⊤ k x (cid:12)(cid:12)(cid:1) ≤ A γ (cid:27) , e Ξ = (cid:26) ξ ∈ Ξ | ∀ k ∈ [ K ] , max z ∈{ ,...,q } sup x ∈X ( | [ β k ] z | + | [ β k x ] z | ) ≤ A β , a Σ ≤ m (cid:0) Σ − k (cid:1) ≤ M (cid:0) Σ − k (cid:1) ≤ A Σ (cid:27) , e Ψ = e Γ × e Ξ . (9)Since a G := exp ( − A γ ) P Kl =1 exp ( A γ ) ≤ sup x ∈X ,γ ∈ e Γ exp (cid:0) γ k + γ ⊤ k x (cid:1)P Kl =1 exp (cid:0) γ l + γ ⊤ l x (cid:1) ≤ exp ( A γ ) P Kl =1 exp ( − A γ ) =: A G , there exists deterministic positive constants a G , A G , such that a G ≤ sup x ∈X ,γ ∈ e Γ g k ( x ; γ ) ≤ A G . (10)We wish to use the model class S of conditional PDFs to estimate s , where S = n ( x, y ) s ψ ( y | x ) (cid:12)(cid:12)(cid:12) ψ = ( γ, β, Σ) ∈ e Ψ o . (11)To simplify the proofs, we shall assume that the true density s belongs to S . That is to say, there exists ψ = ( γ , β , Σ ) ∈ e Ψ, such that s = s ψ . In maximum likelihood estimation, we consider the Kullback-Leibler information as the loss function, which isdeﬁned for densities s and t byKL( s, t ) = (R R q ln (cid:16) s ( y ) t ( y ) (cid:17) s ( y ) dy if sdy is absolutely continuous with respect to tdy, + ∞ otherwise . ince we are working with conditional PDFs and not with classical densities, we deﬁne the following adaptedKullback-Leibler information that takes into account the structure of conditional PDFs. For ﬁxed explanatoryvariables ( x i ) ≤ i ≤ n , we consider the average loss functionKL n ( s, t ) = 1 n n X i =1 KL ( s ( ·| x i ) , t ( · , | x i )) = 1 n n X i =1 Z R q ln (cid:18) s ( y | x i ) t ( y | x i ) (cid:19) s ( y | x i ) dy. (12)The maximum likelihood estimation approach suggests to estimate s by the conditional PDF s ψ thatmaximizes the likelihood, conditioned on ( x i ) ≤ i ≤ n , deﬁned asln n Y i =1 s ψ ( y i | x i ) ! = n X i =1 ln ( s ψ ( y i | x i )) . Or equivalently, that minimizes the empirical contrast: − n n X i =1 ln ( s ψ ( y i | x i )) . However, since we want to handle high-dimensional data, we have to regularize the maximum likelihood estima-tor (MLE) in order to obtain reasonable estimates. Here, we shall consider l -regularization and the associatedso-called Lasso estimator, which is the following l -norm penalized MLE: b s Lasso ( λ ) := argmin s ψ ∈ S ( − n n X i =1 ln ( s ψ ( y i | x i )) + pen λ ( ψ ) ) , (13)where λ ≥ ψ = ( γ, β, Σ) andpen λ ( ψ ) = λ (cid:13)(cid:13)(cid:13) ψ [1 , (cid:13)(cid:13)(cid:13) := λ (cid:16)(cid:13)(cid:13)(cid:13) ψ [1] (cid:13)(cid:13)(cid:13) + (cid:13)(cid:13)(cid:13) ψ [2] (cid:13)(cid:13)(cid:13) (cid:17) , (14) (cid:13)(cid:13)(cid:13) ψ [1] (cid:13)(cid:13)(cid:13) = k γ k = K X k =1 p X j =1 | γ kj | , (15) (cid:13)(cid:13)(cid:13) ψ [2] (cid:13)(cid:13)(cid:13) = k vec( β ) k = K X k =1 p X j =1 q X z =1 (cid:12)(cid:12)(cid:12) [ β k ] z,j (cid:12)(cid:12)(cid:12) . (16)From now on, we denote k β k p ( p ∈ { , , ∞} ) by the induced p -norm of a matrix; see Deﬁnition A.1, whichdiﬀers from k vec( β ) k p .Note that pen λ ( ψ ) is a Lasso regularization term encouraging sparsity for both the gating and expertparameters. Recall that this penalty is also studied in Khalili (2010), Chamroukhi & Huynh (2018), andChamroukhi & Huynh (2019), in which the authors studied the univariate case: Y ∈ R . Notice that, withoutconsidering the l -norm, the penalty function considered in (5) belongs to our framework and the l -oracle in-equality from Theorem 3.1 can be obtained for it. Indeed, by considering λ = min n λ [1]1 , . . . , λ [1] K , λ [2]1 , . . . , λ [2] K , λ [3] o ,the condition for a regularization parameter’s lower bound, (17) from Theorem 3.1, can also be applied to model(3), which leads to an l -oracle inequality. l -oracle inequality for the Lasso estimator In this section, we state Theorem 3.1, which is proved in Section 4.3. This result provides an l -oracle inequalityfor the Lasso estimator for mixtures of Gaussian experts regression models with soft-max gating functions. It isthe primary contribution of this article and is motivated by the problem studied in Meynet (2013) and Devijver(2015). Theorem 3.1 ( l -oracle inequality) . We observe (( x i , y i )) i ∈ [ n ] ∈ ([0 , p × R q ) , coming from the unknownconditional mixture of Gaussian experts regression models s := s ψ ∈ S , cf. (11) . We deﬁne the Lassoestimator b s Lasso ( λ ) , by (13) , where λ ≥ is a regularization parameter to be tuned. Then, if λ ≥ κ KB ′ n √ n (cid:16) q ln n p ln(2 p + 1) + 1 (cid:17) , (17) B ′ n = max ( A Σ , KA G ) (cid:0) q √ qA Σ (cid:0) A β + 4 A Σ ln n (cid:1)(cid:1) , (18) or some absolute constants κ ≥ , the estimator b s Lasso ( λ ) satisﬁes the following l -oracle inequality: E (cid:2) KL n (cid:0) s , b s Lasso ( λ ) (cid:1)(cid:3) ≤ (cid:0) κ − (cid:1) inf s ψ ∈ S (cid:16) KL n ( s , s ψ ) + λ (cid:13)(cid:13)(cid:13) ψ [1 , (cid:13)(cid:13)(cid:13) (cid:17) + λ + r Kn e q − π q/ A q/ p qA γ + 302 q r Kn max ( A Σ , KA G ) (cid:0) q √ qA Σ (cid:0) A β + 4 A Σ ln n (cid:1)(cid:1) × K (cid:18) A γ + qA β + q √ qa Σ (cid:19) ! . (19) Remark 3.1.

Theorem 3.1 provide information about the performance of the Lasso as an l regularizationestimator for mixtures of Gaussian experts regression models. If the regularization parameter λ is properlychosen, the Lasso estimator, which is the solution of the l -penalized empirical risk minimization problem,behaves as well as the deterministic Lasso, which is the solution of the l -penalized true risk minimizationproblem, up to an error term of order λ .of observations n is ﬁxed while the number of covariates p can grow with respect to n , and in fact can bemuch larger than n . The number of components K in the MoE model is ﬁxed.As in Devijver (2015), we suppose that the regressors belong to X = [0 , p , for simplicity. However, thearguments in our proof are valid for covariates of any scale.To the best of our knowledge, we are the ﬁrst to prove the non-asymptotic l -oracle inequality of Theorem3.1, for the mixture of Gaussian experts regression models with l -regularization. Note that by extending thetheoretical developments for mixture of linear regression models in Khalili & Chen (2007), a standard asymptotictheory for MoE models is established in Khalili (2010). Therefore, our non-asymptotic result in Theorem 3.1can be considered as a complementary result to such asymptotic results for MoE models with soft-max gatingfunctions. Motivated by the idea from Meynet (2013) and Devijver (2015), we study the Lasso as the solution of a penalizedmaximum likelihood model selection procedure over countable collections of models in an l -ball. Then Theorem3.1 is an immediate consequence of Theorem 4.1, stated below, which is an l -ball MoE regression model selectiontheorem for l -penalized maximum conditional likelihood estimation, in the Gaussian mixture framework. Theorem 4.1.

Assume that we observe (( x i , y i )) i ∈ [ n ] with unknown conditional Gaussian mixture PDF s . Forall m ∈ N ⋆ , consider the l -ball S m = n s ψ ∈ S, (cid:13)(cid:13)(cid:13) ψ [1 , (cid:13)(cid:13)(cid:13) ≤ m o (20) where, (cid:13)(cid:13)(cid:13) ψ [1 , (cid:13)(cid:13)(cid:13) = (cid:13)(cid:13)(cid:13) ψ [1] (cid:13)(cid:13)(cid:13) + (cid:13)(cid:13)(cid:13) ψ [2] (cid:13)(cid:13)(cid:13) , (cid:13)(cid:13)(cid:13) ψ [1] (cid:13)(cid:13)(cid:13) = k γ k = K X k =1 p X j =1 | γ kj | , (cid:13)(cid:13)(cid:13) ψ [2] (cid:13)(cid:13)(cid:13) = k vec( β ) k = K X k =1 p X j =1 q X z =1 (cid:12)(cid:12)(cid:12) [ β k ] z,j (cid:12)(cid:12)(cid:12) , and let b s m be a η m - ln -likelihood minimizer in S m for some η m ≥ : − n n X i =1 ln ( b s m ( y i | x i )) ≤ inf s m ∈ S m − n n X i =1 ln ( s m ( y i | x i )) ! + η m . (21) Assume that, for all m ∈ N ⋆ , the penalty function satisﬁes pen ( m ) = λm , where λ is deﬁned later. Then, wedeﬁne the penalized likelihood estimate b s b m , where b m is deﬁned via the satisfaction of the inequality − n n X i =1 ln ( b s b m ( y i | x i )) + pen ( b m ) ≤ inf m ∈ N ⋆ − n n X i =1 ln ( b s m ( y i | x i )) + pen ( m ) ! + η, (22) for some η ≥ . Then, if λ ≥ κ KB ′ n √ n (cid:16) q ln n p ln(2 p + 1) + 1 (cid:17) , (23) B ′ n = max ( A Σ , KA G ) (cid:0) q √ qA Σ (cid:0) A β + 4 A Σ ln n (cid:1)(cid:1) , (24) for some absolute constants κ ≥ , then E [KL n ( s , b s b m )] ≤ (cid:0) κ − (cid:1) inf m ∈ N ⋆ (cid:18) inf s m ∈ S m KL n ( s , s m ) + pen ( m ) + η m (cid:19) + η + r Kn e q − π q/ A q/ p qA γ + 302 q r Kn max ( A Σ , KA G ) (cid:0) q √ qA Σ (cid:0) A β + 4 A Σ ln n (cid:1)(cid:1) × K (cid:18) A γ + qA β + q √ qa Σ (cid:19) ! . (25) Remark 4.1.

Note that Theorem 3.1 is also complementary to Theorem 1 of Montuelle et al. (2014), whoalso considered the mixture of Gaussian experts regression models with soft-max gating functions. Notice thatthey focused on model selection and obtained a weak oracle inequality for the penalized MLE, while we aim tostudy the l -regularization properties of the Lasso estimators. However, we can compare their procedure withTheorem 4.1.The main reason explaining their result being considered a weak oracle inequality is that we can see thatTheorem 1 of Montuelle et al. (2014) uses diﬀerence divergence on the left (the JKL ⊗ n ρ , tensorized Jensen-Kullback-Leibler divergence), and on the right (the KL ⊗ n , tensorized Kullback-Leibler divergence). However,under a strong assumption, the two divergences are equivalent for the conditional PDFs considered. This strongassumption is nevertheless satisﬁed, if we assume that X is compact, as is the case of X = [0 , p in Theorem4.1, s is compactly supported, and the regression functions are uniformly bounded, and there is a uniformlower bound on the eigenvalues of the covariance matrices.To illustrate the strictness of the compactness assumption for s , we only need to consider s as a univari-ate Gaussian PDF, which obviously does not satisfy such a hypothesis. Therefore, in such case, Theorem 1in Montuelle et al. (2014) is actually weaker than Theorem 3.1, with respect to the compact support assumptionon the true conditional PDF s . On the contrary, the only assumption used to establish Theorem 4.1 is theboundedness of the parameters of the mixtures, which is also assumed in Montuelle et al. (2014, Theorem 1).Furthermore, these boundedness assumptions also appeared in Stadler et al. (2010), Meynet (2013), andDevijver (2015), and is quite usual when working with maximum likelihood estimation (Baudry, 2009, Maugis & Michel,2011), at least when considering the problem of the unboundedness of the likelihood on the boundary of the pa-rameter space (McLachlan & Peel, 2000, Redner & Walker, 1984), and to prevent the likelihood from diverging.Nevertheless, by using the smaller divergence: JKL ⊗ n ρ (or more strict assumptions on s and s m , so that thesame divergence KL ⊗ n appears on both side of the oracle inequality in Theorem 4.1), Montuelle et al. (2014,Theorem 1) obtained the faster rate of convergence of order 1 /n , while in Theorem 4.1, we only seek a rate ofconvergence of order 1 / √ n . Therefore, in cases where there are no guarantees on the strict conditions such asthe compactness of the support of s and the uniform boundedness of the regression functions, Theorem 4.1provides a theoretical foundation for the Lasso estimators with the order of convergence of 1 / √ n with only aboundedness assumption on the parameter space.Note that the constants 1 + κ − from the upper bound in Theorem 4.1 and C from Montuelle et al. (2014,Theorem 1) can not be taken to be equal to 1. This fact is consequential as when s does not belong tothe approximation class, i.e., when the model is misspeciﬁed. This problem also occurred in the l -oracleinequalities from Meynet (2013) and Devijver (2015). Deriving an oracle inequality such that 1 + κ − = 1, forthe Kullback-Leibler loss, is still an open problem. We hope to overcome this challenge in the future.Theorem 4.1 can be deduced from the two following propositions, which address the cases for large andsmall values of Y . Proposition 4.1.

Assume that we observe (( x i , y i )) i ∈ [ n ] , with unknown conditional PDF s . Let M n > andconsider the event T = (cid:26) max i =1 ,...,n k Y i k ∞ = max i =1 ,...,n max z ∈{ ,...,q } | [ Y i ] z | ≤ M n (cid:27) . For all m ∈ N ⋆ , consider the l -ball S m = n s ψ ∈ S, (cid:13)(cid:13)(cid:13) ψ [1 , (cid:13)(cid:13)(cid:13) ≤ m o and let b s m be a η m - ln -likelihood minimizer in S m , for some η m ≥ : − n n X i =1 ln ( b s m ( y i | x i )) ≤ inf s m ∈ S m − n n X i =1 ln ( s m ( y i | x i )) ! + η m . Assume that for all m ∈ N ⋆ , the penalty function satisﬁes pen ( m ) = λm , where λ is deﬁned later. Then, wedeﬁne the penalized likelihood estimate b s b m with b m deﬁned via the inequality − n n X i =1 ln ( b s b m ( y i | x i )) + pen ( b m ) ≤ inf m ∈ N ⋆ − n n X i =1 ln ( b s m ( y i | x i )) + pen ( m ) ! + η, (26) for some η ≥ . Then, if λ ≥ κ KB n √ n (cid:16) q ln n p ln(2 p + 1) + 1 (cid:17) ,B n = max ( A Σ , KA G ) (cid:16) q √ q ( M n + A β ) A Σ (cid:17) , for some absolute constants κ ≥ , then E [KL n ( s , b s b m ) T ] ≤ (cid:0) κ − (cid:1) inf m ∈ N ⋆ (cid:18) inf s m ∈ S m KL n ( s , s m ) + pen ( m ) + η m (cid:19) + 302 K / qB n √ n (cid:18) A γ + qA β + q √ qa Σ (cid:19) ! + η. (27) Proposition 4.2.

Consider s , T , and b s m as deﬁned in Proposition 4.1. Denote by T C the complement of T , i.e., T C = (cid:26) max i =1 ,...,n k Y i k ∞ = max i =1 ,...,n max z ∈{ ,...,q } | [ Y i ] z | > M n (cid:27) . Then, E [KL n ( s , b s b m ) T C ] ≤ e q/ − π q/ A q/ p KnqA γ e − M n − MnAβ A Σ . Theorem 4.1, and Propositions 4.1 and 4.2 are proved in the Sections 4.4, 4.5 and 4.6, respectively.

We ﬁrst introduce some deﬁnitions and notations that we shall use in the proofs. For any measurable function f : R → R , consider its empirical norm k f k n := vuut n n X i =1 f ( y i | x i ) , and its conditional expectation E X [ f ] = E [ f ( Y | X ) | X = x ] = Z R f ( y | x ) s ( y | x ) dy, as well as its empirical process P n ( f ) := 1 n n X i =1 f ( Y i | x i ) , (28)with expectation E X [ P n ( f )] = 1 n n X i =1 E X [ f ( Y i | x i )] = 1 n n X i =1 Z R f ( y | x i ) s ( y | x i ) dy (29)and the recentered process ν n ( f ) := P n ( f ) − E X [ P n ( f )] = 1 n n X i =1 (cid:20) f ( y i | x i ) − Z R f ( y | x i ) s ( y | x i ) dy (cid:21) . (30)For all m ∈ N ⋆ , consider the model S m = (cid:8) s ψ ∈ S, | s ψ | ≤ m (cid:9) , and deﬁne F m = (cid:26) f m = − ln (cid:18) s m s (cid:19) = ln( s ) − ln( s m ) , s m ∈ S m (cid:27) . (31)By using the basic properties of the inﬁmum: for every ǫ >

0, there exists x ǫ ∈ A , such that x ǫ < inf A + ǫ .Then let δ KL > m ∈ N ⋆ , and let η m ≥

0. It holds that there exist two functions b s m and s m in S m , suchthat P n ( − ln b s m ) ≤ inf s m ∈ S m P n ( − ln s m ) + η m , and (32)KL n ( s , s m ) ≤ inf s m ∈ S m KL n ( s , s m ) + δ KL . (33)Deﬁne b f m := − ln (cid:18) b s m s (cid:19) , and f m := − ln (cid:18) s m s (cid:19) . (34)Let η ≥ m ∈ N ⋆ . Further, deﬁne M ( m ) = { m ′ ∈ N ⋆ | P n ( − ln b s m ′ ) + pen( m ′ ) ≤ P n ( − ln b s m ) + pen( m ) + η } . (35) Let λ > b m to be the smallest integer such that b s Lasso ( λ ) belongs to S b m , i.e., b m := (cid:6)(cid:13)(cid:13) ψ [1 , (cid:13)(cid:13) (cid:7) ≤ (cid:13)(cid:13) ψ [1 , (cid:13)(cid:13) + 1. Then using the deﬁnition of b m , (13), (20), and S = S m ∈ N ⋆ S m , we get − n n X i =1 ln (cid:0)b s Lasso ( λ ) ( y i | x i ) (cid:1) + λ b m ≤ − n n X i =1 ln (cid:0)b s Lasso ( λ ) ( y i | x i ) (cid:1) + λ (cid:16)(cid:13)(cid:13)(cid:13) ψ [1 , (cid:13)(cid:13)(cid:13) + 1 (cid:17) = inf s ψ ∈ S − n n X i =1 ln ( s ψ ( y i | x i )) + λ (cid:13)(cid:13)(cid:13) ψ [1 , (cid:13)(cid:13)(cid:13) ! + λ = inf m ∈ N ⋆ inf s ψ ∈ S m − n n X i =1 ln ( s ψ ( y i | x i )) + λ (cid:13)(cid:13)(cid:13) ψ [1 , (cid:13)(cid:13)(cid:13) !! + λ = inf m ∈ N ⋆  inf s ψ ∈ S, k ψ [1 , k ≤ m − n n X i =1 ln ( s ψ ( y i | x i )) + λ (cid:13)(cid:13)(cid:13) ψ [1 , (cid:13)(cid:13)(cid:13) ! + λ ≤ inf m ∈ N ⋆ inf s m ∈ S m − n n X i =1 ln ( s m ( y i | x i )) + λm !! + λ, which implies − n n X i =1 ln (cid:0)b s Lasso ( λ ) ( y i | x i ) (cid:1) + pen( b m ) ≤ inf m ∈ N ⋆ − n n X i =1 ln ( b s m ( y i | x i )) + pen( m ) ! + η with pen( m ) = λm, η = λ , and b s m is a η m -ln-likelihood minimizer in S m , with η m ≥ b s Lasso ( λ ) satisﬁes (22) with b s Lasso ( λ ) ≡ b s b m , i.e., − n n X i =1 ln ( b s b m ( y i | x i )) + pen( b m ) ≤ inf m ∈ N ⋆ − n n X i =1 ln ( b s m ( y i | x i )) + pen( m ) ! + η. (36)Given κ ≥ E (cid:2) KL n (cid:0) s , b s Lasso ( λ ) (cid:1)(cid:3) ≤ (cid:0) κ − (cid:1) inf s ψ ∈ S (cid:16) KL n ( s , s ψ ) + λ (cid:13)(cid:13)(cid:13) ψ [1 , (cid:13)(cid:13)(cid:13) (cid:17) + λ + r Kn e q − π q/ A q/ p qA γ + 302 q r Kn max ( A Σ , KA G ) (cid:0) q √ qA Σ (cid:0) A β + 4 A Σ ln n (cid:1)(cid:1) × K (cid:18) A γ + qA β + q √ qa Σ (cid:19) ! , as required. Let M n > κ ≥ m ∈ N ⋆ , the penalty function satisﬁes pen( m ) = λm , with λ ≥ κ KB n √ n (cid:16) q ln n p ln(2 p + 1) + 1 (cid:17) . (37)We derive, from Propositions 4.1 and 4.2, that any penalized likelihood estimate b s b m with b m , satisfying − n n X i =1 ln ( b s b m ( y i | x i )) + pen( b m ) ≤ inf m ∈ N ⋆ − n n X i =1 ln ( b s m ( y i | x i )) + pen( m ) ! + η, for some η ≥

0, yields E [KL n ( s , b s b m )]= E [KL n ( s , b s b m ) T ] + E [KL n ( s , b s b m ) T c ] ≤ (cid:0) κ − (cid:1) inf m ∈ N ⋆ (cid:18) inf s m ∈ S m KL n ( s , s m ) + pen( m ) + η m (cid:19) + 302 K / qB n √ n (cid:18) A γ + qA β + q √ qa Σ (cid:19) ! + η + e q/ − π q/ A q/ p KnqA γ e − M n − MnAβ A Σ . (38)To obtain inequality (25), it only remains to optimize the inequality (38), with respect M n . Since the twoterms depending on M n , in (38), have opposite monotonicity with respect to M n , we are looking for a valueof M n such that these two terms are the same order with respect to n . Consider the positive solution M n = A β + q A β + 4 A Σ ln n of the equation X ( X − A β )4 A Σ − ln n = 0. Then, on the one hand, e − M n − MnAβ A Σ √ n = e − ln n √ n = 1 √ n . On the other hand, using the inequality ( a + b ) ≤ a + b ), we have B n = max ( A Σ , KA G ) (cid:16) q √ q ( M n + A β ) A Σ (cid:17) = max ( A Σ , KA G ) (cid:18) q √ qA Σ (cid:16) A β + q A β + 4 A Σ ln n (cid:17) (cid:19) ≤ max ( A Σ , KA G ) (cid:0) q √ qA Σ (cid:0) A β + 4 A Σ ln n (cid:1)(cid:1) , hence (38) implies (25). Indeed, it hold that E [KL n ( s , b s b m )] ≤ (cid:0) κ − (cid:1) inf m ∈ N ⋆ (cid:18) inf s m ∈ S m KL n ( s , s m ) + pen( m ) + η m (cid:19) + η + r Kn e q − π q/ A q/ p qA γ + 302 q r Kn max ( A Σ , KA G ) (cid:0) q √ qA Σ (cid:0) A β + 4 A Σ ln n (cid:1)(cid:1) × K (cid:18) A γ + qA β + q √ qa Σ (cid:19) ! . (39) For every m ′ ∈ M ( m ), from (35), (34), and (32), we obtain P n (cid:16) b f m ′ (cid:17) + pen( m ′ ) = P n (ln( s ) − ln ( b s m ′ )) + pen( m ′ ) (using (34)) ≤ P n (ln( s ) − ln ( b s m )) + pen( m ) + η (using (35)) ≤ P n (ln( s ) − ln ( s m )) + η m + pen( m ) + η (using (32))= P n (cid:0) f m (cid:1) + pen( m ) + η m + η (using (34)) , which implies that E X h P n (cid:16) b f m ′ (cid:17)i + pen( m ′ ) ≤ E X (cid:2) P n (cid:0) f m (cid:1)(cid:3) + pen( m ) + ν n (cid:0) f m (cid:1) − ν n (cid:16) b f m ′ (cid:17) + η + η m . Taking into account (12) and (28), we obtainKL n ( s , b s m ′ ) = 1 n n X i =1 Z R ln (cid:18) s ( y | x i ) b s m ′ ( y | x i ) (cid:19) s ( y | x i ) dy = 1 n n X i =1 Z R b f m ′ ( y | x i ) s ( y | x i ) dy (using (34))= 1 n n X i =1 E X h b f m ′ ( y i | x i ) i = E X h P n (cid:16) b f m ′ (cid:17)i (using (28)) . Similarly, we also obtain KL n ( s , s m ) = E X (cid:2) P n (cid:0) f m (cid:1)(cid:3) . Hence (33) implies thatKL n ( s , b s m ′ ) + pen( m ′ ) ≤ KL n ( s , s m ) + pen( m ) + ν n (cid:0) f m (cid:1) − ν n (cid:16) b f m ′ (cid:17) + η + η m ≤ inf s m ∈ S m KL n ( s , s m ) + pen( m ) + ν n (cid:0) f m (cid:1) − ν n (cid:16) b f m ′ (cid:17) + η m + δ KL + η. (40)All that remains is to control the deviation of − ν n (cid:16) b f m ′ (cid:17) = ν n (cid:16) − b f m ′ (cid:17) . To handle the randomness of b f m ′ , weshall control the deviation of sup f m ′ ∈ F m ′ ν n ( − f m ′ ), since b f m ′ ∈ F m ′ . Such control is provided by Lemma 4.1. Control of deviation

Lemma 4.1.

Let M n > . Consider the event T = (cid:26) max i =1 ,...,n k Y i k ∞ = max i =1 ,...,n max z ∈{ ,...,q } | [ Y i ] z | ≤ M n (cid:27) , and set B n = max ( A Σ , KA G ) (cid:16) q √ q ( M n + A β ) A Σ (cid:17) , and (41)∆ m ′ = m ′ p ln(2 p + 1) ln n + 2 √ K (cid:18) A γ + qA β + q √ qa Σ (cid:19) . (42) Then, on the event T , for all m ′ ∈ N ⋆ , and for all t > , with P X -probability greater than − e − t , sup f m ′ ∈ F m ′ | ν n ( − f m ′ ) | ≤ KB n √ n (cid:20) q ∆ m ′ + √ (cid:18) A γ + qA β + q √ qa Σ (cid:19) √ t (cid:21) . (43) Proof.

The proof appears in Section 5.1.From (40) and (43), we derive that on the event T , for all m ∈ N ⋆ , m ′ ∈ M ( m ), and t >

0, with P X -probability larger than 1 − e − t ,KL n ( s , b s m ′ ) + pen( m ′ ) ≤ inf s m ∈ S m KL n ( s , s m ) + pen( m ) + ν n (cid:0) f m (cid:1) − ν n (cid:16) b f m ′ (cid:17) + η m + δ KL + η. ≤ inf s m ∈ S m KL n ( s , s m ) + pen( m ) + ν n (cid:0) f m (cid:1) + η m + δ KL + η + 4 KB n √ n (cid:20) q ∆ m ′ + √ (cid:18) A γ + qA β + q √ qa Σ (cid:19) √ t (cid:21) ≤ inf s m ∈ S m KL n ( s , s m ) + pen( m ) + ν n (cid:0) f m (cid:1) + η m + δ KL + η + 4 KB n √ n " q ∆ m ′ + 12 (cid:18) A γ + qA β + q √ qa Σ (cid:19) + t , (44)where we get the last inequality using the fact that 2 ab ≤ a + b for b = √ t , and a = (cid:16) A γ + qA β + q √ qa Σ (cid:17) / √ m ∈ N ⋆ and m ′ ∈ M ( m ). To getan inequality valid on a set of high probability, we need to adequately choose the value of the parameter t ,depending on m ∈ N ⋆ and m ′ ∈ M ( m ). Let z >

0, for all m ∈ N ⋆ and m ′ ∈ M ( m ), and apply (44) to obtain t = z + m + m ′ . Then, on the event T , for all m ∈ N ⋆ and m ′ ∈ M ( m ), with P X -probability larger than1 − e − ( z + m + m ′ ),KL n ( s , b s m ′ ) + pen( m ′ ) ≤ inf s m ∈ S m KL n ( s , s m ) + pen( m ) + ν n (cid:0) f m (cid:1) + η m + δ KL + η + 4 KB n √ n " q ∆ m ′ + 12 (cid:18) A γ + qA β + q √ qa Σ (cid:19) + ( z + m + m ′ ) , (45)KL n ( s , b s m ′ ) − ν n (cid:0) f m (cid:1) ≤ inf s m ∈ S m KL n ( s , s m ) + (cid:20) pen( m ) + 4 KB n √ n m (cid:21) + η m + δ KL + η + (cid:20) KB n √ n (37 q ∆ m ′ + m ′ ) − pen( m ′ ) (cid:21) + 4 KB n √ n " (cid:18) A γ + qA β + q √ qa Σ (cid:19) + z . (46)Taking into account (42), we getKL n ( s , b s m ′ ) − ν n (cid:0) f m (cid:1) ≤ inf s m ∈ S m KL n ( s , s m ) + (cid:20) pen( m ) + 4 KB n √ n m (cid:21) + η m + δ KL + η + (cid:20) KB n √ n (cid:16) q ln n p ln(2 p + 1) + 1 (cid:17) m ′ − pen( m ′ ) (cid:21) + 4 KB n √ n " (cid:18) A γ + qA β + q √ qa Σ (cid:19) + 74 q √ K (cid:18) A γ + qA β + q √ qa Σ (cid:19) + z . (47) Now, let κ ≥ m ) = λm , for all m ∈ N ⋆ with λ ≥ κ KB n √ n (cid:16) q ln n p ln(2 p + 1) + 1 (cid:17) . Then, (47) impliesKL n ( s , b s m ′ ) − ν n (cid:0) f m (cid:1) ≤ inf s m ∈ S m KL n ( s , s m ) + (cid:20) λm + 4 KB n √ n m (cid:21) + η m + δ KL + η +  KB n √ n (cid:16) q ln n p ln(2 p + 1) + 1 (cid:17)| {z } ≤ λκ − m ′ − λm ′  + 4 KB n √ n " (cid:18) A γ + qA β + q √ qa Σ (cid:19) + 74 q √ K (cid:18) A γ + qA β + q √ qa Σ (cid:19) + z ≤ inf s m ∈ S m KL n ( s , s m ) +  pen( m ) + 4 KB n √ n m | {z } ≤ κ − pen( m )  + η m + δ KL + η + (cid:2) λκ − m ′ − λm ′ (cid:3)| {z } ≤ + 4 KB n √ n " (cid:18) A γ + qA β + q √ qa Σ (cid:19) + 74 q √ K (cid:18) A γ + qA β + q √ qa Σ (cid:19) + z ≤ inf s m ∈ S m KL n ( s , s m ) + (cid:0) κ − (cid:1) pen( m ) + η m + δ KL + η + 4 KB n √ n " (cid:18) A γ + qA β + q √ qa Σ (cid:19) + 74 q √ K (cid:18) A γ + qA β + q √ qa Σ (cid:19) + z . Next, using the inequality 2 ab ≤ β − a + β − b for a = √ K , b = K (cid:16) A γ + qA β + q √ qa Σ (cid:17) , and β = √ K , and thefact that K ≤ K / , for all K ∈ N ⋆ , it follows thatKL n ( s , b s m ′ ) − ν n (cid:0) f m (cid:1) ≤ inf s m ∈ S m KL n ( s , s m ) + (cid:0) κ − (cid:1) pen( m ) + η m + δ KL + η + 4 B n √ n " qK / (cid:18) A γ + qA β + q √ qa Σ (cid:19) + 74 q √ KK (cid:18) A γ + qA β + q √ qa Σ (cid:19)| {z } q × ab + Kz ≤ inf s m ∈ S m KL n ( s , s m ) + (cid:0) κ − (cid:1) pen( m ) + η m + δ KL + η + 4 B n √ n " qK / + 75 qK / (cid:18) A γ + qA β + q √ qa Σ (cid:19) + Kz . (48)By (26) and (35), b m belongs to M ( m ), for all m ∈ N ⋆ , so we deduce from (48) that on the event T , for all z >

0, with P X -probability greater than 1 − e − z ,KL n ( s , b s b m ) − ν n (cid:0) f m (cid:1) ≤ inf m ∈ N ⋆ (cid:18) inf s m ∈ S m KL n ( s , s m ) + (cid:0) κ − (cid:1) pen( m ) + η m (cid:19) + η + δ KL + 4 B n √ n " qK / + 75 qK / (cid:18) A γ + qA β + q √ qa Σ (cid:19) + Kz . (49)By integrating (49) over z >

0, using the fact that for any non-negative random variable Z and any a > , E [ Z ] = a R z ≥ P ( Z > az ) dz . Then, note that E (cid:2) ν n (cid:0) f m (cid:1)(cid:3) = 0, and that δ KL > small, we obtain that E [KL n ( s , b s b m ) T ] ≤ inf m ∈ N ⋆ (cid:18) inf s m ∈ S m KL n ( s , s m ) + (cid:0) κ − (cid:1) pen( m ) + η m (cid:19) + η + 4 B n √ n " qK / + 75 qK / (cid:18) A γ + qA β + q √ qa Σ (cid:19) + K ≤ inf m ∈ N ⋆ (cid:18) inf s m ∈ S m KL n ( s , s m ) + (cid:0) κ − (cid:1) pen( m ) + η m (cid:19) + η + 4 B n √ n " qK / + 75 qK / (cid:18) A γ + qA β + q √ qa Σ (cid:19) + qK / ≤ inf m ∈ N ⋆ (cid:18) inf s m ∈ S m KL n ( s , s m ) + (cid:0) κ − (cid:1) pen( m ) + η m (cid:19) + η + 302 K / qB n √ n (cid:18) A γ + qA β + q √ qa Σ (cid:19) ! . (50) By the Cauchy-Schwarz inequality, E [KL n ( s , b s b m ) T C ] ≤ q E (cid:2) KL n ( s , b s b m ) (cid:3)q P ( T C ) . (51)We seek to bound the two terms of the right-hand side of (51).For the ﬁrst term, let us bound KL ( s ( ·| x ) , s ψ ( ·| x )), for all s ψ ∈ S and x ∈ X . Let s ψ ∈ S and x ∈ X .Since s is a density, s is bounded by 1 and thusKL ( s ( ·| x ) , s ψ ( ·| x )) = Z R q ln (cid:18) s ( y | x ) s ψ ( y | x ) (cid:19) s ( y | x ) dy = Z R q ln ( s ( y | x )) s ( y | x ) dy − Z R q ln ( s ψ ( y | x )) s ( y | x ) dy ≤ − Z R q ln ( s ψ ( y | x )) s ( y | x ) dy (cid:18) since Z R q ln ( s ( y | x )) s ( y | x ) dy ≤ (cid:19) . (52) Thus, for all y ∈ R q ,ln ( s ψ ( y | x )) s ( y | x )= ln " K X k =1 g k ( x ; γ )(2 π ) q/ det(Σ k ) / exp − ( y − ( β k + β k x )) ⊤ Σ − k ( y − ( β k + β k x ))2 ! × K X k =1 g ,k ( x ; γ )(2 π ) q/ det(Σ ,k ) / exp − ( y − ( β ,k + β ,k x )) ⊤ Σ − ,k ( y − ( β ,k + β ,k x ))2 ! ≥ ln " K X k =1 a G det(Σ − k ) / (2 π ) q/ exp (cid:16) − (cid:16) y ⊤ Σ − k y + ( β k + β k x ) ⊤ Σ − k β k x ( β k + β k x ) (cid:17)(cid:17) × K X k =1 a G det(Σ − ,k ) / (2 π ) q/ exp (cid:16) − (cid:16) y ⊤ Σ − ,k y + ( β ,k + β ,k x ) ⊤ Σ − ,k ( β ,k + β ,k x ) (cid:17)(cid:17)(cid:0) using (10) and − ( a − b ) ⊤ A ( a − b ) / ≥ − ( a ⊤ Aa + b ⊤ Ab ), e.g., a = y, b = β k + β k x , A = Σ k (cid:1) ≥ ln " K X k =1 a G a q/ (2 π ) q/ exp (cid:16) − (cid:16) y ⊤ Σ − k y + ( β k + β k x ) ⊤ Σ − k β k x ( β k + β k x ) (cid:17)(cid:17) × K X k =1 a G a q/ (2 π ) q/ exp (cid:16) − (cid:16) y ⊤ Σ − ,k y + ( β ,k + β ,k x ) ⊤ Σ − ,k ( β ,k + β ,k x ) (cid:17)(cid:17) (using (9)) ≥ ln " K a G a q/ (2 π ) q/ exp (cid:0) − (cid:0) y ⊤ y + qA β (cid:1) A Σ (cid:1) × K a G a q/ (2 π ) q/ exp (cid:0) − (cid:0) y ⊤ y + qA β (cid:1) A Σ (cid:1) (using (9)) , (53)where, in the last inequality, we use the fact that for all u ∈ R q . By using the eigenvalue decomposition ofΣ = P ⊤ DP , (cid:12)(cid:12) u ⊤ Σ u (cid:12)(cid:12) = (cid:12)(cid:12) u ⊤ P ⊤ DP u (cid:12)(cid:12) ≤ k

P u k ≤ M ( D ) k P u k ≤ A Σ k u k ≤ A Σ q k u k ∞ , where in the last inequality, we used the fact that (79). Therefore, setting u = √ A Σ y and h ( t ) = t ln t , for all t ∈ R , and noticing that h ( t ) ≥ h (cid:0) e − (cid:1) = − e − , for all t ∈ R , and from (52) and (53), we get thatKL ( s ( ·| x ) , s ψ ( ·| x )) ≤ − Z R q " ln " K a γ a q/ (2 π ) q/ exp (cid:0) − (cid:0) y ⊤ y + qA β (cid:1) A Σ (cid:1) K a γ a q/ (2 π ) q/ exp (cid:0) − (cid:0) y ⊤ y + qA β (cid:1) A Σ (cid:1)! dy = − Ka γ a q/ e − qA β A Σ (2 A Σ ) q/ Z R q " ln K a γ a q/ (2 π ) q/ ! − qA β A Σ − u ⊤ u e − u ⊤ u (2 π ) q/ du = − Ka γ a q/ e − qA β A Σ (2 A Σ ) q/ E U "" ln K a γ a q/ (2 π ) q/ ! − qA β A Σ − U ⊤ U (with U ∼ N q (0 , I q ))= − Ka γ a q/ e − qA β A Σ (2 A Σ ) q/ " ln K a γ a q/ (2 π ) q/ ! − qA β A Σ − q = − Ka γ a q/ e − qA β A Σ − q (2 π ) q/ ( A Σ ) q/ e q/ π q/ ln Ka γ a q/ e − qA β A Σ − q (2 π ) q/ ! ≤ e q/ − π q/ A q/ , (54)where we used the fact that t ln( t ) ≥ − e − , for all t ∈ R . Then, for all s ψ ∈ S ,KL n ( s , s ψ ) = 1 n n X i =1 KL ( s ( ·| x i ) , s ψ ( . | x i )) ≤ e q/ − π q/ A q/ , and note that b s b m ∈ S , and thus q E (cid:2) KL n ( s , b s b m ) (cid:3) ≤ e q/ − π q/ A q/ . (55) We now provide an upper bound for P (cid:0) T C (cid:1) : P (cid:0) T C (cid:1) = E [ T C ] = E [ E X [ T C ]] = E (cid:2) P X (cid:0) T C (cid:1)(cid:3) ≤ E " n X i =1 P X ( k Y i k ∞ > M n ) . (56)For all i ∈ [ n ], Y i | x i ∼ K X k =1 g k ( x i ; γ ) N q ( β k + β k x i , Σ k ) , so we see from (56) that we need to provide an upper bound on P ( | Y x | > M n ), with Y x ∼ K X k =1 g k ( x ; γ ) N q ( β k + β k x, Σ k ) , x ∈ X . First, using Chernoﬀ’s inequality for a centered Gaussian variable (see Lemma A.5), and the fact that ψ belongsto the bounded space e Ψ (deﬁned by (9)), and that P Kk =1 g k ( x ; γ ) = 1, we get P ( k Y x k ∞ > M n )= Z { k y k ∞ >M n } K X k =1 g k ( x ; γ )(2 π ) q/ det(Σ k ) / exp − ( y − ( β k + β k x )) ⊤ Σ − k ( y − ( β k + β k x ))2 ! dy = K X k =1 g k ( x ; γ )(2 π ) q/ det(Σ k ) / Z { k y k ∞ >M n } exp − ( y − ( β k + β k x )) ⊤ Σ − k ( y − ( β k + β k x ))2 ! dy = K X k =1 g k ( x ; γ ) P (cid:0) k Y x,k k ∞ > M n (cid:1) ≤ K X k =1 g k ( x ; γ ) q X z =1 P (cid:0)(cid:12)(cid:12) [ Y x,k ] z (cid:12)(cid:12) > M n (cid:1) = K X k =1 g k ( x ; γ ) q X z =1 (cid:0) P (cid:0) [ Y x,k ] z < − M n (cid:1) + P (cid:0) [ Y x,k ] z > M n (cid:1)(cid:1) = K X k =1 g k ( x ; γ ) q X z =1 P U > M n − [ β k + β k x ] z [Σ k ] / z,z ! + P U < − M n − [ β k + β k x ] z [Σ k ] / z,z !! = K X k =1 g k ( x ; γ ) q X z =1 P U > M n − [ β k + β k x ] z [Σ k ] / z,z ! + P U > M n + [ β k + β k x ] z [Σ k ] / z,z !! ≤ K X k =1 g k ( x ; γ ) q X z =1  e − Mn − [ βk βkx ] z [ Σ k ] / z,z ! + e − Mn + [ βk βkx ] z [ Σ k ] / z,z !  (using Lemma A.5, (90)) ≤ K X k =1 g k ( x ; γ ) q X z =1 e − Mn − | [ βk βkx ] z | [ Σ k ] / z,z ! = 2 K X k =1 g k ( x ; γ ) q X z =1 e − M n − Mn | [ βk βkx ] z | + | [ βk βkx ] | z [ Σ k ] z,z ≤ K X k =1 g k ( x ; γ ) q X z =1 e − M n − Mn | [ βk βkx ] z | + | [ βk βkx ] | z [ Σ k ] z,z ≤ KA γ qe − M n − MnAβ A Σ , (57)where Y x,k ∼ N q ( β k + β k x, Σ k ) ,Y x,k ∼ N (cid:16) [ β k + β k x ] z , [Σ k ] z,z (cid:17) , and U = [ Y x,k ] z − [ βx ] z [Σ k ] / z,z ∼ N (0 , , and using the facts that e − | [ βk βkx ] | zA Σ ≤ ≤ z ≤ q (cid:12)(cid:12)(cid:12) [Σ k ] z,z (cid:12)(cid:12)(cid:12) ≤ k Σ k k = M (Σ k ) = m (cid:0) Σ − k (cid:1) ≤ A Σ . Wederive from (56) and (57) that P ( T c ) ≤ KnqA γ e − M n − MnAβ A Σ , (58) nd ﬁnally from (51), (55), and (58), we obtain E [KL n ( s , b s b m ) T C ] ≤ e q/ − π q/ A q/ p KnqA γ e − M n − MnAβ A Σ . (59) First, we give some tools to prove Lemma 4.1. Recall that k f k n = vuut n n X i =1 g ( y i | x i ) , for any measurable function g .Let m ∈ N ⋆ , we have sup f m ∈ F m | ν n ( − f m ) | = sup f m ∈ F m (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n X i =1 ( f m ( Y i | x i ) − E [ f m ( Y i | x i )]) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) . (60)To control the deviation of (60), we shall use concentration and symmetrization arguments. We shall ﬁrst usethe following concentration inequality, which can be found in Boucheron et al. (2013). Lemma 5.1 (See Boucheron et al., 2013) . Let Z , . . . , Z n be independent random variables with values in somespace Z and let F be a class of real-valued functions on Z . Assume that there exists R n , a non-random constant,such that sup f ∈F k f k n ≤ R n . Then, for all t > , P sup f ∈F (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n X i =1 [ f ( Z i ) − E [ f ( Z i )]] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) > E " sup f ∈F (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n X i =1 [ f ( Z i ) − E [ f ( Z i )]] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + 2 √ R n r tn ! ≤ e − t . (61)Then, we propose to bound E (cid:2) sup f ∈F (cid:12)(cid:12) n P ni =1 [ f ( Z i ) − E [ f ( Z i )]] (cid:12)(cid:12)(cid:3) due to the following symmetrizationargument. The proof of this result can be found in Van Der Vaart & Wellner (1996). Lemma 5.2 (See Lemma 2.3.6 in Van Der Vaart & Wellner, 1996) . Let Z , . . . , Z n be independent randomvariables with values in some space Z and let F be a class of real-valued functions on Z . Let ( ǫ , . . . , ǫ n ) be aRademacher sequence independent of ( Z , . . . , Z n ) . Then, E " sup f ∈F (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n X i =1 [ f ( Z i ) − E [ f ( Z i )]] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ E " sup f ∈F (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n X i =1 ǫ i f ( Z i ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) . (62)From (62), the problem is to provide an upper bound on E " sup f ∈F (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n X i =1 ǫ i f ( Z i ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) . To do so, we shall apply the following lemma, which is adapted from Lemma 6.1 in Massart (2007).

Lemma 5.3 (See Lemma 6.1 in Massart, 2007) . Let Z , . . . , Z n be independent random variables with valuesin some space Z and let F be a class of real-valued functions on Z . Let ( ǫ , . . . , ǫ n ) be a Rademacher sequence,independent of ( Z , . . . , Z n ) . Deﬁne R n , a non-random constant, such that sup f ∈F k f k n ≤ R n . (63) Then, for all S ∈ N ⋆ , E " sup f ∈F (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n X i =1 ǫ i f ( Z i ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ R n √ n S X s =1 − s q ln [1 + M (2 − s R n , F , k . k n )] + 2 − S ! , (64) where M ( δ, F , k . k n ) stands for the δ -packing number (see Deﬁnition A.2) of the set of functions F , equippedwith the metric induced by the norm k·k n . In our case, from (60), we apply a conditional version of Lemmas 5.1–5.3 to F = F m , ( Z , . . . , Z n ) =( Y | x , . . . , Y n | x n ), and f ( Z i ) = f m ( Y i | x i ), so as to control sup f m ∈ F m | ν n ( − f m ) | . On the one hand, we see from(63) that we need an upper bound of sup f m ∈ F m k f m k n . On the other hand, we see from (64) that we need tobound the entropy of the set of functions F m , equipped with the metric induced by the norm k·k n . Such boundsare provided by the two following lemmas.Let M n > T = (cid:26) max i =1 ,...,n k Y i k ∞ = max i =1 ,...,n max z ∈{ ,...,q } | [ Y i ] z | ≤ M n (cid:27) , and put B n = max ( A Σ , KA G ) (cid:16) q √ q ( M n + A β ) A Σ (cid:17) . Lemma 5.4.

On the event T , for all m ∈ N ⋆ , sup f m ∈ F m k f m k n T ≤ KB n (cid:18) A γ + qA β + q √ qa Σ (cid:19) =: R n . (65) Proof.

See Section 5.2.1.

Lemma 5.5.

Let δ > and m ∈ N ⋆ . On the event T , we have the following upper bound of the δ -packingnumber of the set of functions F m , equipped with the metric induced by the norm k·k n : M ( δ, F m , k·k n ) ≤ (2 p + 1) B nq K m δ (cid:18) B n KqA β δ (cid:19) K (cid:18) B n KA γ δ (cid:19) K (cid:18) B n Kq √ qa Σ δ (cid:19) K . Proof.

See Section 5.2.2.

Lemma 5.6 (Lemma 5.9 from Meynet, 2013) . Let δ > and ( x ij ) i =1 ,...,n ; j =1 ,...,p ∈ R np . There exists afamily B of (2 p + 1) k x k ,n /δ vectors in R p , such that for all β ∈ R p , with k β k ≤ , where k x k ,n = n P ni =1 max j ∈{ ,...,p } x ij , there exists β ′ ∈ B , such that n n X i =1  p X j =1 (cid:0) β j − β ′ j (cid:1) x ij  ≤ δ . Proof.

See in the proof of Lemma 5.9 Meynet (2013).Via the upper bounds provided in Lemmas 5.4 and 5.5, we can apply Lemma 5.3 to get an upper bound on E X (cid:2) sup f m ∈F m (cid:12)(cid:12) n P ni =1 ǫ i f m ( Y i | x i ) (cid:12)(cid:12)(cid:3) . We thus obtain the following results. Lemma 5.7.

Let m ∈ N ⋆ , consider ( ǫ , . . . , ǫ n ) , a Rademacher sequence independent of ( Y , . . . , Y n ) . Then, onthe event T , E X " sup f m ∈F m (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n X i =1 ǫ i f m ( Y i | x i ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ KB n q √ n ∆ m , (66)∆ m := m p ln(2 p + 1) ln n + 2 √ K (cid:18) A γ + qA β + q √ qa Σ (cid:19) . (67) Proof.

See Section 5.2.3.Now using (66) and applying both Lemmas 5.1 and 5.2 to F = F m , ( Z , . . . , Z n ) = ( Y | x , . . . , Y n | x n ) and f ( Z i ) = f m ( Y i | x i ), we get for all m ∈ N ⋆ and t >

0, with P X -probability greater than 1 − e − t ,sup f m ∈ F m | ν n ( − f m ) | = sup f m ∈ F m (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n X i =1 ( f m ( Y i | x i ) − E X [ f m ( Y i | x i )]) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ E " sup f m ∈ F m (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n X i =1 ( f m ( Y i | x i ) − E X [ f m ( Y i | x i )]) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + 2 √ R n r tn (Lemma 5.1) ≤ E " sup f m ∈ F m (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n X i =1 ǫ i f ( Y i | x i ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + 2 √ R n r tn (using Lemma 5.2) ≤ KB n q √ n ∆ m + 4 √ KB n (cid:18) A γ + qA β + q √ qa Σ (cid:19) r tn (cid:18) using Lemma 5.7 and R n = 2 KB n (cid:18) A γ + qA β + q √ qa Σ (cid:19)(cid:19) ≤ KB n √ n (cid:20) q ∆ m + √ (cid:18) A γ + qA β + q √ qa Σ (cid:19) √ t (cid:21) . The proofs of Lemmas 5.4–5.5 require an upper bound on the uniform norm of the gradient of ln s ψ , for s ψ ∈ S .We begin by providing such an upper bound. Lemma 5.8.

Given s ψ , as described in (11) , it holds that sup x ∈X sup ψ ∈ e Ψ (cid:13)(cid:13)(cid:13)(cid:13) ∂ ln ( s ψ ( ·| x )) ∂ψ (cid:13)(cid:13)(cid:13)(cid:13) ∞ ≤ G ( · ) ,G : R q ∋ y G ( y ) = max ( A Σ , KA G ) (cid:16) q √ q ( k y k ∞ + A β ) A Σ (cid:17) . (68) Proof.

Let s ψ ∈ S , with ψ = ( γ, β, Σ). From now on, we consider any x ∈ X , any y ∈ R q , and any k ∈ [ K ]. Wecan write ln ( s ψ ( y | x )) = ln K X k =1 g k ( x ; γ ) φ ( y ; β k + β k x, Σ k ) ! = ln K X k =1 f k ( x, y ) ! ,g k ( x ; γ ) = exp ( w k ( x )) P Kl =1 exp ( w l ( x )) , w k ( x ) = γ k + γ ⊤ k x,φ ( y ; β k + β k x, Σ k ) = 1(2 π ) q/ det(Σ k ) / exp − ( y − ( β k + β k x )) ⊤ Σ − k ( y − ( β k + β k x ))2 ! ,f k ( x, y ) = g k ( x ; γ ) φ ( y ; β k + β k x, Σ k )= g k ( x ; γ )(2 π ) q/ det(Σ k ) / exp (cid:20) −

12 ( y − ( β k + β k x )) ⊤ Σ − k ( y − ( β k + β k x )) (cid:21) . By using the chain rule, for all l ∈ [ K ], ∂ ln ( s ψ ( y | x )) ∂γ l = K X k =1 f k ( x, y ) g k ( x ; γ ) P Kk =1 f k ( x, y ) ∂g k ( x ; γ ) ∂w l ( x ) ∂w l ( x ) ∂γ l | {z } =1 , and ∂ ln ( s ψ ( y | x )) ∂ (cid:0) γ ⊤ l x (cid:1) = K X k =1 f k ( x, y ) g k ( x ; γ ) P Kk =1 f k ( x, y ) ∂g k ( x ; γ ) ∂w l ( x ) ∂w l ( x ) ∂ (cid:0) γ ⊤ l x (cid:1)| {z } =1 . Furthermore, ∂g k ( x ; γ ) ∂w l ( x ) = ∂∂w l ( x ) exp ( w k ( x )) P Kl =1 exp ( w l ( x )) ! = ∂∂w l ( x ) exp ( w k ( x )) P Kl =1 exp ( w l ( x )) − exp ( w k ( x )) (cid:16)P Kl =1 exp ( w l ( x )) (cid:17) ∂∂w l ( x ) K X i =1 exp ( w i ( x )) (cid:18) using ∂∂x (cid:18) f ( x ) g ( x ) (cid:19) = f ′ ( x ) g ( x ) − g ′ ( x ) f ( x ) g ( x ) (cid:19) = δ lk exp ( w k ( x )) P Kl =1 exp ( w l ( x )) − exp ( w k ( x )) P Kl =1 exp ( w l ( x )) exp ( w l ( x )) P Kl =1 exp ( w l ( x ))= g k ( x ; γ ) ( δ lk − g l ( x ; γ )) , where δ lk = ( l = k, l = k. Therefore, we obtain (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∂ ln ( s ψ ( y | x )) ∂ (cid:0) γ ⊤ l x (cid:1) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12) ∂ ln ( s ψ ( y | x )) ∂γ l (cid:12)(cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) K X k =1 f k ( x, y ) g k ( x ; γ ) P Kk =1 f k ( x, y ) g k ( x ; γ ) ( δ lk − g l ( x ; γ )) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) K X k =1 f k ( x, y ) P Kk =1 f k ( x, y ) ( δ lk − g l ( x ; γ )) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) K X k =1 ( δ lk − g l ( x ; γ )) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) since f k ( x, y ) P Kk =1 f k ( x, y ) ≤ ! = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) − K X k =1 g l ( x ; γ ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = | − Kg l ( x ; γ ) |≤ Kg l ( x ; γ ) ≤ KA G (using (10)) . Similarly, by using the fact that ψ belongs to the bounded space e Ψ, f l ( x, y ) / P Kk =1 f k ( x, y ) ≤ (cid:13)(cid:13)(cid:13)(cid:13) ∂ ln ( s ψ ( y | x )) ∂β l (cid:13)(cid:13)(cid:13)(cid:13) ∞ = (cid:13)(cid:13)(cid:13)(cid:13) ∂ ln ( s ψ ( y | x )) ∂ ( β l x ) (cid:13)(cid:13)(cid:13)(cid:13) ∞ = (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) f l ( x, y ) P Kk =1 f k ( x, y ) ∂∂ ( β l + β l x ) (cid:20) −

12 ( y − ( β l + β l x )) ⊤ Σ − l ( y − ( β l + β l x )) (cid:21)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∞ ≤ (cid:13)(cid:13)(cid:13)(cid:13) ∂∂ ( β l + β l x ) (cid:20) −

12 ( y − ( β l + β l x )) ⊤ Σ − l ( y − ( β l + β l x )) (cid:21)(cid:13)(cid:13)(cid:13)(cid:13) ∞ = (cid:13)(cid:13) Σ − l ( y − ( β l + β l x )) (cid:13)(cid:13) ∞ ≤ (cid:13)(cid:13) Σ − l (cid:13)(cid:13) ∞ k ( y − ( β l + β l x )) k ∞ (using (80)) ≤ √ q (cid:13)(cid:13) Σ − l (cid:13)(cid:13) ( k y k ∞ + k β l + β l x k ∞ ) (using (85)) ≤ √ qM (cid:0) Σ − l (cid:1) ( k y k ∞ + k β l + β l x k ∞ ) (using (84)) ≤ √ qA Σ ( k y k ∞ + A β ) (using (9)) . Now, we need to calculate the gradient w.r.t. to the covariance matrices of the Gaussian experts. To do this, we need the following result: given any l ∈ [ K ], v l = β l + β l x , it holds that ∂∂ Σ l φ ( x ; v l , Σ l )= ∂∂ Σ l " (2 π ) − p/ det(Σ l ) − / exp − ( x − v l ) ⊤ Σ − l ( x − v l )2 ! = φ ( x ; v l , Σ l ) (cid:20) − ∂∂ Σ l (cid:16) ( x − v l ) ⊤ Σ − l ( x − v l ) (cid:17) + det(Σ l ) / ∂∂ Σ l (cid:16) det(Σ l ) − / (cid:17)(cid:21) = φ ( x ; v l , Σ l ) (cid:20)

12 Σ − l ( x − v l ) ( x − v l ) ⊤ Σ − l −

12 det(Σ l ) − ∂∂ Σ l (det(Σ l )) (cid:21) = φ ( x ; v l , Σ l ) (cid:20)

12 Σ − l ( x − v l ) ( x − v l ) ⊤ Σ − l −

12 det(Σ l ) − det(Σ l ) (cid:0) Σ − l (cid:1) ⊤ (cid:21) = φ ( x ; v l , Σ l ) 12 h Σ − l ( x − v l ) ( x − v l ) ⊤ Σ − l − (cid:0) Σ − l (cid:1) ⊤ i| {z } T ( x,v l , Σ l ) , (69)noting that ∂∂ Σ l (cid:16) ( x − v l ) ⊤ Σ − l ( x − v l ) (cid:17) = − Σ − l ( x − v l ) ( x − v l ) ⊤ Σ − l (using Lemma A.1) , (70) ∂∂ Σ l (det(Σ l )) = det(Σ l ) (cid:0) Σ − l (cid:1) ⊤ (using Jacobi formula, Lemma A.2) . (71)For any l ∈ [ K ], (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∂ ln ( s ψ ( y | x )) ∂ (cid:16) [Σ l ] z ,z (cid:17) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:13)(cid:13)(cid:13)(cid:13) ∂ ln ( s ψ ( y | x )) ∂ Σ l (cid:13)(cid:13)(cid:13)(cid:13) (using (84))= (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) f l ( x, y ) P Kk =1 f k ( x, y ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:13)(cid:13)(cid:13)(cid:13) ∂∂ Σ l (cid:20) −

12 ( y − ( β l + β l x )) ⊤ Σ − l ( y − ( β l + β l x )) (cid:21)(cid:13)(cid:13)(cid:13)(cid:13) ≤ (cid:13)(cid:13)(cid:13)(cid:13) ∂∂ Σ l (cid:20) −

12 ( y − ( β l + β l x )) ⊤ Σ − l ( y − ( β l + β l x )) (cid:21)(cid:13)(cid:13)(cid:13)(cid:13) = 12 (cid:13)(cid:13)(cid:13) Σ − l ( y − ( β l + β l x )) ( y − ( β l + β l x )) ⊤ Σ − l − (cid:0) Σ − l (cid:1) ⊤ (cid:13)(cid:13)(cid:13) (using (69)) ≤ h A Σ + √ q (cid:13)(cid:13)(cid:13) ( y − ( β l + β l x )) ( y − ( β l + β l x )) ⊤ (cid:13)(cid:13)(cid:13) ∞ A i (using (85)) ≤ h A Σ + q √ q ( k y k ∞ + A β ) A i (using (9)) , where, in the last inequality given a = y − ( β l + β l x ), we use the fact that (cid:13)(cid:13) aa ⊤ (cid:13)(cid:13) ∞ = max ≤ i ≤ q q X j =1 (cid:12)(cid:12) [ aa ⊤ ] i,j (cid:12)(cid:12) = max ≤ i ≤ q q X j =1 | a i a j | = max ≤ i ≤ q | a i | q X j =1 | a j | ≤ q k a k ∞ . Thus, sup x ∈X sup ψ ∈ e Ψ (cid:13)(cid:13)(cid:13)(cid:13) ∂ ln ( s ψ ( y | x )) ∂ψ (cid:13)(cid:13)(cid:13)(cid:13) ∞ ≤ max " KA G , √ q ( k y k ∞ + A β ) A Σ , h A Σ + q √ q ( k y k ∞ + A β ) A i ≤ max " KA G , max ( A Σ , (cid:16) q √ q ( k y k ∞ + A β ) A Σ (cid:17) ≤ max ( A Σ , KA G ) (cid:16) q √ q ( k y k ∞ + A β ) A Σ (cid:17) =: G ( y ) , where we use the fact that √ q ( k y k ∞ + A β ) A Σ =: θ ≤ θ = 1 + q ( k y k ∞ + A β ) A ≤ max ( A Σ , (cid:16) q √ q ( k y k ∞ + A β ) A Σ (cid:17) . Let m ∈ N ⋆ and f m ∈ F m . By (31), there exists s m ∈ S m , such that f m = − ln ( s m /s ). For all x ∈ X ,let ψ ( x ) = ( γ k , γ k x, β k , β k x, Σ k ) k ∈ [ K ] be the parameters of s m ( ·| x ). In our case, we approximate f ( ψ ) =ln ( s ψ ( y i | x i )) around ψ ( x i ) by the n = 0 th degree Taylor polynomial of f ( ψ ). That is, (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ln  ( s m ) | {z } s ψ ( y i | x i )  − ln ( s ( y i | x i )) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) =: | f ( ψ ) − f ( ψ ) | = | R ( ψ ) | (deﬁned in Lemma A.6) ≤ sup x ∈X sup ψ ∈ e Ψ (cid:13)(cid:13)(cid:13)(cid:13) ∂ ln ( s ψ ( y i | x )) ∂ψ (cid:13)(cid:13)(cid:13)(cid:13) ∞ k ψ ( x i ) − ψ ( x i ) k . First applying Taylor’s inequality and then Lemma 5.8 on the event T . For all i ∈ [ n ], it holds that | f m ( y i | x i ) | T = | ln ( s m ( y i | x i )) − ln ( s ( y i | x i )) | T ≤ sup x ∈X sup ψ ∈ e Ψ (cid:13)(cid:13)(cid:13)(cid:13) ∂ ln ( s ψ ( y i | x )) ∂ψ (cid:13)(cid:13)(cid:13)(cid:13) ∞ k ψ ( x i ) − ψ ( x i ) k T ≤ max ( A Σ , KA G ) (cid:16) q √ q ( M n + A β ) A Σ (cid:17)| {z } =: B n k ψ ( x i ) − ψ ( x i ) k (using Lemma 5.8) ≤ B n K X k =1 | γ k − γ ,k | + (cid:12)(cid:12) γ ⊤ k x i − γ ⊤ ,k x i (cid:12)(cid:12) + k β k − β ,k k + k β k x i − β ,k x i k + k vec (Σ k − Σ ,k ) k ! ≤ B n K X k =1 (cid:0) | γ k | + (cid:12)(cid:12) γ ⊤ k x i (cid:12)(cid:12) + k β k k + k β k x i k + q k Σ k k (cid:1) (using (82)) ≤ KB n ( A γ + q k β k k ∞ + q k β k x i k ∞ + q √ q k Σ k k ) (using (9), (77), (78), (86)) ≤ KB n (cid:18) A γ + qA β + q √ qa Σ (cid:19) (using (9)) . Therefore, sup f m ∈ F m k f m k n T ≤ KB n (cid:18) A γ + qA β + q √ qa Σ (cid:19) =: R n . Let m ∈ N ⋆ , f [1] m ∈ F m , and x ∈ [0 , p . By (31), there exists s [1] m ∈ S m , such that f [1] m = − ln (cid:16) s [1] m /s (cid:17) .Introduce the notation s [2] m ∈ S and f [2] m = − ln (cid:16) s [2] m /s (cid:17) . Let ψ [1] ( x ) = (cid:16) γ [1] k , γ [1] k x, β [1] k , β [1] k x, Σ [1] k (cid:17) k ∈ [ K ] , and ψ [2] ( x ) = (cid:16) γ [2] k , γ [2] k x, β [2] k , β [2] k x, Σ [2] k (cid:17) k ∈ [ K ] , be the parameters of the PDFs s [1] m ( ·| x ) and s [2] m ( ·| x ), respectively. By applying Taylor’s inequality and thenLemma 5.8 on the event T , for all i ∈ [ n ], it holds that (cid:12)(cid:12)(cid:12) f [1] m ( y i | x i ) − f [2] m ( y i | x i ) (cid:12)(cid:12)(cid:12) T = (cid:12)(cid:12)(cid:12) ln (cid:16) s [1] m ( y i | x i ) (cid:17) − ln (cid:16) s [2] m ( y i | x i ) (cid:17)(cid:12)(cid:12)(cid:12) T ≤ sup x ∈X sup ψ ∈ e Ψ (cid:12)(cid:12)(cid:12)(cid:12) ∂ ln ( s ψ ( y i | x )) ∂ψ (cid:12)(cid:12)(cid:12)(cid:12) (cid:13)(cid:13)(cid:13) ψ [1] ( x i ) − ψ [1] ( x i ) (cid:13)(cid:13)(cid:13) T (using Taylor’s inequality in Lemma A.6) ≤ max ( A Σ , C ( p, K )) (cid:16) q √ q ( M n + A β ) A Σ (cid:17)| {z } B n (cid:13)(cid:13)(cid:13) ψ [1] ( x i ) − ψ [2] ( x i ) (cid:13)(cid:13)(cid:13) (using Lemma 5.8) ≤ B n K X k =1 (cid:12)(cid:12)(cid:12) γ [1] k − γ [2] k (cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12) γ [1] ⊤ k x i − γ [2] ⊤ k x i (cid:12)(cid:12)(cid:12) + (cid:13)(cid:13)(cid:13) β [1] k − β [2] k (cid:13)(cid:13)(cid:13) + (cid:13)(cid:13)(cid:13) β [1] k x i − β [2] k x i (cid:13)(cid:13)(cid:13) + (cid:13)(cid:13)(cid:13) vec (cid:16) Σ [1] k − Σ [2] k (cid:17)(cid:13)(cid:13)(cid:13) ! . By the Cauchy-Schwarz inequality, ( P mi =1 a i ) ≤ m P mi =1 a i ( m ∈ N ⋆ ), we get (cid:12)(cid:12)(cid:12) f [1] m ( y i | x i ) − f [2] m ( y i | x i ) (cid:12)(cid:12)(cid:12) T ≤ B n " K X k =1 (cid:12)(cid:12)(cid:12) γ [1] ⊤ k x i − γ [2] ⊤ k x i (cid:12)(cid:12)(cid:12)! + K X k =1 q X z =1 (cid:12)(cid:12)(cid:12)h β [1] k x i i z − h β [2] k x i i z (cid:12)(cid:12)(cid:12)! + (cid:16)(cid:13)(cid:13)(cid:13) β [1]0 − β [2]0 (cid:13)(cid:13)(cid:13) + (cid:13)(cid:13)(cid:13) γ [1]0 − γ [2]0 (cid:13)(cid:13)(cid:13) + (cid:13)(cid:13)(cid:13) vec (cid:16) Σ [1] − Σ [2] (cid:17)(cid:13)(cid:13)(cid:13) (cid:17) ≤ B n " K K X k =1  p X j =1 γ [1] ⊤ kj x ij − p X j =1 γ [2] ⊤ kj x ij  + Kq K X k =1 q X z =1  p X j =1 h β [1] k i z,j x ij − p X j =1 h β [2] k i z,j x ij  + (cid:16)(cid:13)(cid:13)(cid:13) β [1]0 − β [2]0 (cid:13)(cid:13)(cid:13) + (cid:13)(cid:13)(cid:13) γ [1]0 − γ [2]0 (cid:13)(cid:13)(cid:13) + (cid:13)(cid:13)(cid:13) vec (cid:16) Σ [1] − Σ [2] (cid:17)(cid:13)(cid:13)(cid:13) (cid:17) , and k f [1] m − f [2] m k n T = 1 n n X i =1 (cid:12)(cid:12)(cid:12) f [1] m ( y i | x i ) − f [2] m ( y i | x i ) (cid:12)(cid:12)(cid:12) T ≤ B n K K X k =1 n n X i =1  p X j =1 γ [1] kj x ij − p X j =1 γ [2] kj x ij  | {z } =: a + 3 B n Kq K X k =1 q X z =1 n n X i =1  p X j =1 h β [1] k i z,j x ij − p X j =1 h β [2] k i z,j x ij  | {z } =: b + 3 B n (cid:16)(cid:13)(cid:13)(cid:13) β [1]0 − β [2]0 (cid:13)(cid:13)(cid:13) + (cid:13)(cid:13)(cid:13) γ [1]0 − γ [2]0 (cid:13)(cid:13)(cid:13) + (cid:13)(cid:13)(cid:13) vec (cid:16) Σ [1] − Σ [2] (cid:17)(cid:13)(cid:13)(cid:13) (cid:17) . So, for all δ >

0, if a ≤ δ / (cid:0) B n (cid:1) ,b ≤ δ / (cid:0) B n (cid:1) , (cid:13)(cid:13)(cid:13) β [1]0 − β [2]0 (cid:13)(cid:13)(cid:13) ≤ δ/ (18 B n ) , (cid:13)(cid:13)(cid:13) γ [1]0 − γ [2]0 (cid:13)(cid:13)(cid:13) ≤ δ/ (18 B n ) , and (cid:13)(cid:13)(cid:13) vec (cid:16) Σ [1] − Σ [2] (cid:17)(cid:13)(cid:13)(cid:13) ≤ δ/ (18 B n ) , then k f [1] m − f [2] m k n T ≤ δ /

4. To bound a and b , we can write a = Km K X k =1 n n X i =1  p X j =1 γ [1] kj m x ij − p X j =1 γ [2] kj m x ij  , and b = Kqm K X k =1 q X z =1 n n X i =1  p X j =1 h β [1] k i z,j m x ij − p X j =1 h β [2] k i z,j m x ij  . Then, we apply Lemma 5.6 to obtain γ [1] k,. m = (cid:18) γ [1] kj m (cid:19) j ∈ [ q ] and h β [1] k i z,. m = h β [1] k i z,j m ! j ∈ [ q ] , for all k ∈ [ K ] , z ∈ [ q ].Since s [1] m ∈ S m , and using (20), we have (cid:13)(cid:13)(cid:13) γ [1] k (cid:13)(cid:13)(cid:13) ≤ m and (cid:13)(cid:13)(cid:13) vec (cid:16) β [1] k (cid:17)(cid:13)(cid:13)(cid:13) ≤ m , which leads to P pj =1 (cid:12)(cid:12)(cid:12)(cid:12) γ [1] kj m (cid:12)(cid:12)(cid:12)(cid:12) ≤ P qz =1 P pj =1 (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) h β [1] k i z,j m (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤

1, respectively. Furthermore, given x ∈ X = [0 , p , we have k x k ,n = 1. Thus, thereexist families A of (2 p + 1) B n K m /δ vectors and B of (2 p + 1) B n q K m /δ vectors of R p , such that for all k ∈ [ K ], z ∈ [ q ], γ [1] k,. , and h β [1] k i z,. , there exist γ [1] k,. ∈ A and h β [2] k i z,. ∈ B , such that1 n n X i =1  p X j =1 γ [1] kj m x ij − p X j =1 γ [2] kj m x ij  ≤ δ B n K m , and1 n n X i =1  p X j =1 h β [1] k i z,j m x ij − p X j =1 h β [2] k i z,j m x ij  ≤ δ B n q K m , which leads to a ≤ δ / B n and b ≤ δ / B n . Moreover, (9) leads to (cid:13)(cid:13)(cid:13) β [1]0 (cid:13)(cid:13)(cid:13) = K X k =1 (cid:13)(cid:13)(cid:13) β [1]0 k (cid:13)(cid:13)(cid:13) ≤ Kq (cid:13)(cid:13)(cid:13) β [1]0 k (cid:13)(cid:13)(cid:13) ∞ ≤ KqA β (using (77)) , (cid:13)(cid:13)(cid:13) γ [1]0 (cid:13)(cid:13)(cid:13) = K X k =1 (cid:12)(cid:12)(cid:12) γ [1]0 k (cid:12)(cid:12)(cid:12) ≤ KA γ , and (cid:13)(cid:13)(cid:13) vec (cid:16) Σ [1] (cid:17)(cid:13)(cid:13)(cid:13) = K X k =1 (cid:13)(cid:13)(cid:13) vec (cid:16) Σ [1] k (cid:17)(cid:13)(cid:13)(cid:13) ≤ Kq √ qa Σ . Therefore, on the event T , M ( δ, F m , k·k n ) ≤ N ( δ/ , F m , k·k n ) (using Lemma A.4) ≤ card( A ) card( B ) N (cid:18) δ B n , B K ( KqA β ) , k·k (cid:19) N (cid:18) δ B n , B K ( KA γ ) , k·k (cid:19) N (cid:18) δ B n , B K (cid:18) Kq √ qa Σ (cid:19) , k·k (cid:19) ≤ (2 p + 1) B nq K m δ (cid:18) B n KqA β δ (cid:19) K (cid:18) B n KA γ δ (cid:19) K (cid:18) B n Kq √ qa Σ δ (cid:19) K . Let m ∈ N ⋆ . From Lemma 5.4, on the event T ,sup f m ∈ F m k f m k n T ≤ KB n (cid:18) A γ + qA β + q √ qa Σ (cid:19) =: R n . (72)From Lemma 5.5, on the event T for all S ∈ N ⋆ , S X s =1 − s q ln [1 + M (2 − s R n , F m , k·k n )] ≤ S X s =1 − s q ln [2 M ( δ, F m , k·k n )] with δ = 2 − s R n ≤ S X s =1 − s " √ ln 2 + 6 √ B n qKmδ p ln (2 p + 1)+ s K ln (cid:20)(cid:18) B n KqA β δ (cid:19) (cid:18) B n KA γ δ (cid:19) (cid:18) B n Kq √ qa Σ δ (cid:19)(cid:21) ≤ S X s =1 − s " √ ln 2 + 2 s √ B n qKmR n p ln (2 p + 1)+ s K ln (cid:20)(cid:18) s B n KqA β R n (cid:19) (cid:18) s B n KA γ R n (cid:19) (cid:18) s B n Kq √ qa Σ R n (cid:19)(cid:21) . (73)Notice from (72), that R n ≥ KB n max (cid:16) A γ , qA β , q √ qa Σ (cid:17) . Moreover, it holds that 1 ≤ s +3 , and P Ss =1 − s =1 − − S ≤ , P Ss =1 ( √ e/ s ≤ √ e/ (2 − √ e ), and since for all s ∈ N ⋆ , e s ≥ s , and thus 2 − s √ s ≤ ( √ e/ s .Therefore, from (73): S X s =1 − s q ln [1 + M (2 − s R n , F m , k·k n )] ≤ S X s =1 − s " √ ln 2 + 2 s √ B n qKmR n p ln (2 p + 1) + p K ln [(2 s +1 ) (2 s +1 ) (2 s +1 )] = S X s =1 − s " √ ln 2 + 2 s √ B n qKmR n p ln(2 p + 1) + √ K p s + 1) ln 2 + 2 ln 3) ≤ √ B n KqmR n p ln(2 p + 1) S + √ K √ S X s =1 − s √ s + √ ln 2 (cid:16) √ K (cid:17) + √ K ≤ √ B n KqmR n p ln(2 p + 1) S + √ K √ S X s =1 (cid:18) √ e (cid:19) s + √ ln 2 (cid:16) √ K (cid:17) + √ K ≤ √ B n qKmR n p ln(2 p + 1) S + √ K ln 2 √ e − √ e + 1 + √ r !| {z } =: C . (74)Then, from (64) and (74), for all S ∈ N ⋆ : E " sup f m ∈F m (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n X i =1 ǫ i f m ( Z i ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ R n " √ n √ B n KmqR n p ln(2 p + 1) S + √ K ln 2 C ! + 2 − S . (75) e choose S = ln n/ ln 2 so that the two terms depending on S in (75) are of the same order. In particular, forthis value of S , 2 − S ≤ /n , and we deduce from (75) and (72) that E " sup f m ∈F m (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n X i =1 ǫ i f m ( Z i ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ √ B n Kmq √ n p ln(2 p + 1) ln n ln 2 + 2 KB n (cid:18) A γ + qA β + q √ qa Σ (cid:19) √ ln 2 C √ K √ n + 1 n ! ≤ B n Kmq √ n p ln(2 p + 1) ln n √ | {z } ≈ . + K √ K √ n B n (cid:18) A γ + qA β + q √ qa Σ (cid:19) (cid:16) √ ln 2 C + 1 (cid:17)| {z } ≈ . < KB n √ n (cid:20) mq p ln(2 p + 1) ln n + 2 √ K (cid:18) A γ + qA β + q √ qa Σ (cid:19)(cid:21) . We have studied an l -regularization estimator for ﬁnite mixtures of Gaussian experts regression models withsoft-max gating functions. Our main contribution is the proof of the l -oracle inequality that provides the lowerbound on the regularization of the Lasso that ensures non-asymptotic theoretical control on the Kullback-Leiblerloss of the estimator. Other than some remaining questions regarding the tightness of the bounds and the form ofpenalization functions, we believe that our contribution helps to further popularize mixtures of Gaussian expertsregression models by providing a theoretical foundation for their application in high-dimensional problems. Acknowledgments

TTN is supported by a “Contrat doctoral” from the French Ministry of Higher Education and Research and bythe French National Research Agency (ANR) grant SMILES ANR-18-CE40-0014. HDN and GJM are fundedby Australian Research Council grant number DP180101192.

A Technical results

We denote the vector space of all q -by- q real matrices by R q × q ( q ∈ N ⋆ ): A ∈ R q × q ⇐⇒ A = ( a i,j ) =  a , · · · a ,q ... ... a q, · · · a q,q  , a i,j ∈ R . If a capital letter is used to denote a matrix ( e.g.,

A, B ), then the corresponding lower-case letter with subscript i, j refers to the ( i, j )th entry ( e.g., a i,j , b i,j ). When required, we also designate the elements of a matrix withthe notation [ A ] i,j or A ( i, j ). Denote the q -by- q identity and zero matrices by I q and 0 q , respectively. Lemma A.1 (Derivative of quadratic form, Magnus & Neudecker, 2019) . Assume that X and a are non-singularmatrix in R q × q and vector in R q × , respectively. Then ∂a ⊤ X − a∂X = − X − aa ⊤ X − . Lemma A.2 (Jacobi’s formula, Theorem 8.1 from Magnus & Neudecker, 2019) . If X is a diﬀerentiable mapfrom the real numbers to q -by- q matrices, ddt det ( X ( t )) = tr (cid:18) Adj ( X ( t )) dX ( t ) dt (cid:19) . In particular, ∂ det ( X ) ∂X = (Adj ( X )) ⊤ = det ( X ) (cid:0) X − (cid:1) ⊤ . eﬁnition A.1 (Operator (induced) p -norm) . We recall an operator (induced) p -norms of a matrix A ∈ R q × q ( q ∈ N ⋆ , p ∈ { , , ∞} ), k A k p = max x =0 k Ax k p k x k p = max x =0 (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) A x k x k p !(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) p = max k x k p =1 k Ax k p , (76)where for all x ∈ R q , k x k ∞ ≤ k x k = q X i =1 | x i | ≤ q k x k ∞ , (77) k x k = q X i =1 | x i | ! = (cid:0) x ⊤ x (cid:1) ≤ k x k ≤ √ q k x k , and (78) k x k ∞ = max ≤ i ≤ q | x i | ≤ k x k ≤ √ q k x k ∞ . (79) Lemma A.3 (Some matrix p -norm properties, Golub & Van Loan, 2012) . By deﬁnition, we always have theimportant property that for every A ∈ R q × q and x ∈ R q , k Ax k p ≤ k A k p k x k p , (80) and every induced p -norm is submultiplicative, i.e., for every A ∈ R q × q and B ∈ R q × q , k AB k p ≤ k A k p k B k p . (81) In particular, it holds that k A k = max ≤ j ≤ q q X i =1 | a ij | ≤ q X j =1 q X i =1 | a ij | := k vec( A ) k ≤ q k A k , (82) k vec( A ) k ∞ := max ≤ i ≤ q, ≤ j ≤ q | a ij | ≤ k A k ∞ = max ≤ j ≤ q q X i =1 | a ij | ≤ q k vec( A ) k ∞ , (83) k vec( A ) k ∞ ≤ k A k = λ max ( A ) ≤ q k vec( A ) k ∞ , (84) where λ max is the largest eigenvalue of a positive deﬁnite symmetric matrix A . The p -norms, when p ∈ { , , ∞} ,satisfy √ q k A k ∞ ≤ k A k ≤ √ q k A k ∞ , (85)1 √ q k A k ≤ k A k ≤ √ q k A k . (86)Given δ >

0, we need to deﬁne the δ -packing number and δ -covering number. Deﬁnition A.2 ( δ -packing number, e.g., Deﬁnition 5.4 from Wainwright, 2019) . Let ( F , k·k ) be a normed spaceand let G ⊂ F . With ( g i ) i =1 ,...,m ∈ G , { g , . . . , g m } is an δ -packing of G of size m ∈ N ⋆ , if k g i − g j k > δ, ∀ i = j, i, j ∈ { , . . . , m } , or equivalently, T ni =1 B ( g i , δ/

2) = ∅ . Upon deﬁning δ -packing, we can measure the maximalnumber of disjoint closed balls with radius δ/ G . This number is called the δ -packingnumber and is deﬁned as M ( δ, G , k·k ) := max { m ∈ N ⋆ : ∃ δ -packing of G of size m } . (87) Deﬁnition A.3 ( δ -covering number, Deﬁnition 5.1 from Wainwright, 2019) . Let ( F , k·k ) be a normed spaceand let G ⊂ F . With ( g i ) i =1 ,...,n ∈ G , { g , . . . , g n } is an δ -covering of G of size n if G ⊂ ∪ ni =1 B ( g i , δ ), orequivalently, ∀ g ∈ G , ∃ i such that k g − g i k ≤ δ . Upon deﬁning the δ -covering, we can measure the minimalnumber of closed balls with radius δ , which is necessary to cover G . This number is called the δ -covering number and is deﬁned as N ( δ, G , k·k ) := min { n ∈ N ⋆ : ∃ δ -covering of G of size n } . (88)The covering entropy (metric entropy) is deﬁned as follows H k . k ( δ, G ) = ln ( N ( δ, G , k·k )).The relation between the packing number and the covering number is described in the following lemma. emma A.4 (Lemma 5.5 from Wainwright, 2019) . Let ( F , k·k ) be a normed space and let G ⊂ F . Then M (2 δ, G , k·k ) ≤ N ( δ, G , k·k ) ≤ M ( δ, G , k·k ) . Lemma A.5 (Chernoﬀ’s inequality, e.g.,

Chapter 2 in Wainwright, 2019) . Assume that the random variablehas a moment generating function in a neighborhood of zero, meaning that there is some constant b > suchthat the function ϕ ( λ ) = E (cid:2) e λ ( U − µ ) (cid:3) exists for all λ ≤ | b | . In such a case, we may apply Markov’s inequality tothe random variable Y = e λ ( U − µ ) , thereby obtaining the upper bound P ( U − µ ≥ a ) = P (cid:16) e λ ( U − µ ) ≥ e λt (cid:17) ≤ E (cid:2) e λ ( U − µ ) (cid:3) e λt . Optimizing our choice of λ so as to obtain the tightest result yields the Chernoﬀ bound ln ( P ( U − µ ≥ a )) ≤ sup λ ∈ [0 ,b ] n λt − ln (cid:16) E h e λ ( U − µ ) i(cid:17)o . (89) In particular, if U ∼ N ( µ, σ ) is a Gaussian random variable with mean µ and variance σ . By a straightforwardcalculation, we ﬁnd that U has the moment generating function E (cid:2) e λU (cid:3) = e µλ + σ λ , valid for all λ ∈ R . Substituting this expression into the optimization problem deﬁning the optimized Chernoﬀ bound (89) , we obtain sup λ ≥ n λt − ln (cid:16) E h e λ ( U − µ ) i(cid:17)o = sup λ ≥ (cid:26) λt − σ λ (cid:27) = − t σ , where we have taken derivatives in order to ﬁnd the optimum of this quadratic function. So, (89) leads to P ( X ≥ µ + t ) ≤ e − t σ , for all t ≥ . (90)Recall that a multi-index α = ( α , . . . , α p ) , α i ∈ N ⋆ , ∀ i ∈ { , . . . , p } is an p -tuple of non-negative integers.Let | α | = p X i =1 α i , α ! = p Y i =1 α i ! ,x α = p Y i =1 x α i i , x ∈ R p , ∂ α f = ∂ α ∂ α · · · ∂ α p p = ∂ | α | f∂x α ∂x α · · · ∂x α p p . The number | α | is called the order or degree of α . Thus, the order of α is the same as the order of x α as amonomial or the order of ∂ α as a partial derivative. Lemma A.6 (Taylor’s Theorem in Several Variables from Duistermaat & Kolk, 2004) . Suppose f : R p R isin the class C k +1 , of continuously diﬀerentiable functions, on an open convex set S . If a ∈ S and a + h ∈ S ,then f ( a + h ) = X | α |≤ k ∂ α f ( a ) α ! h α + R a,k ( h ) , where the remainder is given in Lagrange’s form by R a,k ( h ) = X | α | = k +1 ∂ α f ( a + ch ) h α α ! for some c ∈ (0 , , or in integral form by R a,k ( h ) = ( k + 1) X | α | = k +1 h α α ! Z (1 − t ) k ∂ α f ( a + th ) dt. In particular, we can estimate the remainder term if | ∂ α f ( x ) | ≤ M for x ∈ S and | α | = k + 1 , then | R a,k ( h ) | ≤ M ( k + 1)! k h k k +11 , k h k = p X i =1 | h i | . References

Akaike, H. (1974). A new look at the statistical model identiﬁcation.

IEEE Transactions on Automatic Control ,19(6), 716–723. (Cited on page 3.)

Baudry, J.-P. (2009).

S´election de mod`ele pour la classiﬁcation non supervis´ee. Choix du nombre de classes.

PhD thesis, Universit´e Paris-Sud XI. (Cited on page 8.)

Birg´e, L. & Massart, P. (2007). Minimal penalties for Gaussian model selection.

Probability Theory and RelatedFields , 138(1-2), 33–73. (Cited on page 3.)

Boucheron, S., Lugosi, G., & Massart, P. (2013).

Concentration Inequalities: A Nonasymptotic Theory OfIndependence . Oxford University Press. (Cited on page 18.)

Bunea, F. et al. (2008). Honest variable selection in linear and logistic regression models via l and l + l penalization. Electronic Journal of Statistics , 2, 1153–1194. (Cited on page 3.)

Chamroukhi, F. & Huynh, B. T. (2018). Regularized Maximum-Likelihood Estimation of Mixture-of-Expertsfor Regression and Clustering. In (pp.1–8). (Cited on pages 2, 3, and 6.)

Chamroukhi, F. & Huynh, B.-T. (2019). Regularized Maximum Likelihood Estimation and Feature Selection inMixtures-of-Experts Models.

Journal de la Soci´et´e Fran¸caise de Statistique , 160(1), 57–85. (Cited on pages 2,3, and 6.)

Cohen, S. & Pennec, E. L. (2011). Conditional density estimation by penalized likelihood model selection andapplications.

Technical Report, INRIA . (Cited on page 4.) Devijver, E. (2015). An l -oracle inequality for the Lasso in ﬁnite mixture of multivariate Gaussian regression. ESAIM: Probability and Statistics , 19, 649–670. (Cited on pages 2, 3, 4, 5, 6, 7, and 8.)

Duistermaat, J. J. & Kolk, J. A. (2004).

Multidimensional Real Analysis I: Diﬀerentiation , volume 86. Cam-bridge University Press. (Cited on page 29.)

Fan, J. & Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracle properties.

Journalof the American statistical Association , 96(456), 1348–1360. (Cited on pages 2 and 3.)

Genovese, C. R., Wasserman, L., et al. (2000). Rates of convergence for the Gaussian mixture sieve.

Annals ofStatistics , 28(4), 1105–1127. (Cited on page 4.)

Golub, G. H. & Van Loan, C. F. (2012).

Matrix Computations , volume 3. JHU Press. (Cited on page 28.)

Ho, N., Nguyen, X., et al. (2016a). Convergence rates of parameter estimation for some weakly identiﬁableﬁnite mixtures.

Annals of Statistics , 44(6), 2726–2755. (Cited on page 4.)

Ho, N., Nguyen, X., et al. (2016b). On strong identiﬁability and convergence rates of parameter estimation inﬁnite mixtures.

Electronic Journal of Statistics , 10(1), 271–307. (Cited on page 4.)

Ho, N., Yang, C.-Y., & Jordan, M. I. (2019). Convergence Rates for Gaussian Mixtures of Experts. arXivpreprint arXiv:1907.04377 . (Cited on page 4.) Huynh, T. & Chamroukhi, F. (2019). Estimation and Feature Selection in Mixtures of Generalized LinearExperts Models. arXiv:1907.06994 . (Cited on page 3.) Jacobs, R. A., Jordan, M. I., Nowlan, S. J., & Hinton, G. E. (1991). Adaptive Mixtures of Local Experts.

Neural Computation , 3, 79–87. (Cited on pages 1 and 4.)

Jiang, W. & Tanner, M. A. (1999). Hierarchical mixtures-of-experts for exponential family regression models:approximation and maximum likelihood estimation.

Annals of Statistics , (pp. 987–1011). (Cited on pages 3and 4.)

Jordan, M. I. & Jacobs, R. A. (1994). Hierarchical mixtures of experts and the EM algorithm.

Neural Compu-tation , 6(2), 181–214. (Cited on page 4.)

Khalili, A. (2010). New estimation and feature selection methods in mixture-of-experts models.

CanadianJournal of Statistics , 38(4), 519–539. (Cited on pages 2, 3, 6, and 7.)

Khalili, A. & Chen, J. (2007). Variable selection in ﬁnite mixture of regression models.

Journal of the AmericanStatistical Association , 102(479), 1025–1038. (Cited on pages 2, 3, and 7.)

Lloyd-Jones, L. R., Nguyen, H. D., & McLachlan, G. J. (2018). A globally convergent algorithm for lasso-penalized mixture of linear regression models.

Computational Statistics & Data Analysis , 119, 19–38. (Citedon page 2.)

Magnus, J. R. & Neudecker, H. (2019).

Matrix Diﬀerential Calculus with Applications in Statistics and Econo-metrics . John Wiley & Sons. (Cited on page 27.)

Massart, P. (2007).

Concentration Inequalities and Model Selection: Ecole d’Et´e de Probabilit´es de Saint-FlourXXXIII-2003 . Springer. (Cited on pages 3, 4, and 18.)

Massart, P. & Meynet, C. (2011). The Lasso as an l -ball model selection procedure. Electronic Journal ofStatistics , 5, 669–687. (Cited on page 2.)

Maugis, C. & Michel, B. (2011). A non asymptotic penalized criterion for Gaussian mixture model selection.

ESAIM: Probability and Statistics , 15, 41–68. (Cited on page 8.)

McLachlan, G. & Peel, D. (2000).

Finite Mixture Models . John Wiley & Sons. (Cited on page 8.)

Mendes, E. F. & Jiang, W. (2012). On convergence rates of mixtures of polynomial experts.

Neural Computation ,24(11), 3025–3051. (Cited on page 3.)

Meynet, C. (2013). An l -oracle inequality for the Lasso in ﬁnite mixture Gaussian regression models. ESAIM:Probability and Statistics , 17, 650–671. (Cited on pages 2, 3, 4, 6, 7, 8, and 19.)

Montuelle, L., Le Pennec, E., et al. (2014). Mixture of Gaussian regressions model with logistic weights, apenalized maximum likelihood approach.

Electronic Journal of Statistics , 8(1), 1661–1695. (Cited on pages 2,3, 5, and 8.)

Nguyen, H. D. & Chamroukhi, F. (2018). Practical and theoretical aspects of mixture-of-experts modeling: Anoverview.

Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery , 8(4), e1246. (Cited onpages 1, 2, and 4.)

Nguyen, H. D., Chamroukhi, F., & Forbes, F. (2019). Approximation results regarding the multiple-outputGaussian gated mixture of linear experts model.

Neurocomputing , 366, 208–214. (Cited on page 4.)

Nguyen, H. D., Lloyd-Jones, L. R., & McLachlan, G. J. (2016). A universal approximation theorem for mixture-of-experts models.

Neural Computation , 28(12), 2585–2593. (Cited on page 4.)

Nguyen, T., Chamroukhi, F., Nguyen, H. D., & McLachlan, G. J. (2020a). Approximation of probability densityfunctions via location-scale ﬁnite mixtures in Lebesgue spaces. arXiv preprint arXiv:2008.09787 . (Cited onpage 4.) Nguyen, T. T., Nguyen, H. D., Chamroukhi, F., & McLachlan, G. J. (2020b). Approximation by ﬁnite mixturesof continuous density functions that vanish at inﬁnity.

Cogent Mathematics & Statistics , 7(1), 1750861. (Citedon page 4.)

Nguyen, X. et al. (2013). Convergence of latent mixing measures in ﬁnite and inﬁnite mixture models.

Annalsof Statistics , 41(1), 370–400. (Cited on page 4.)

Norets, A. et al. (2010). Approximation of conditional densities by smooth mixtures of regressions.

Annals ofstatistics , 38(3), 1733–1766. (Cited on page 4.)

Park, M. Y. & Hastie, T. (2008). Penalized logistic regression for detecting gene interactions.

Biostatistics ,9(1), 30–50. (Cited on page 3.)

Redner, R. A. & Walker, H. F. (1984). Mixture densities, maximum likelihood and the EM algorithm.

SIAMReview , 26(2), 195–239. (Cited on page 8.)

Schwarz, G. et al. (1978). Estimating the dimension of a model.

Annals of Statistics , 6(2), 461–464. (Cited onpage 3.)

Stadler, N., Buhlmann, P., & van de Geer, S. (2010). l -penalization for mixture regression models. TEST , 19,209–256. (Cited on pages 2 and 8.)

Tibshirani, R. (1996). Regression shrinkage and selection via the lasso.

Journal of the Royal Statistical Society:Series B (Methodological) , 58(1), 267–288. (Cited on page 2.)

Van Der Vaart, A. & Wellner, J. (1996). Weak Convergence and Empirical Processes: With Applications toStatistics Springer Series in Statistics.

Springer , 58, 59. (Cited on page 18.)

Vapnik, V. (1982).

Estimation of Dependences Based on Empirical Data (Springer Series in Statistics) . Springer-Verlag. (Cited on page 4.)

Wainwright, M. J. (2019).

High-Dimensional Statistics: A Non-Asymptotic Viewpoint , volume 48. CambridgeUniversity Press. (Cited on pages 3, 28, and 29.)(Cited on pages 3, 28, and 29.)