An l 1 -oracle inequality for the Lasso in mixture-of-experts regression models
TrungTin Nguyen, Hien D Nguyen, Faicel Chamroukhi, Geoffrey J McLachlan
aa r X i v : . [ m a t h . S T ] S e p An l -oracle inequality for the Lassoin mixture-of-experts regression models TrungTin Nguyen ∗ , Hien D Nguyen , Faicel Chamroukhi ,and Geoffrey J McLachlan Lab of Mathematics Nicolas Oresme LMNO, UMR CNRS, Caen, France. School of Engineering and Mathematical Sciences. Department of Mathematics and Statistics, LaTrobe University, Melbourne, Victoria, Australia. School of Mathematics and Physics, University of Queensland, St. Lucia, Brisbane, Australia. ∗ Corresponding author, email: [email protected].
Abstract
Mixture-of-experts (MoE) models are a popular framework for modeling heterogeneity in data,for both regression and classification problems in statistics and machine learning, due to theirflexibility and the abundance of statistical estimation and model choice tools. Such flexibilitycomes from allowing the mixture weights (or gating functions) in the MoE model to depend on theexplanatory variables, along with the experts (or component densities). This permits the modelingof data arising from more complex data generating processes, compared to the classical finitemixtures and finite mixtures of regression models, whose mixing parameters are independent of thecovariates. The use of MoE models in a high-dimensional setting, when the number of explanatoryvariables can be much larger than the sample size (i.e., p ≫ n ), is challenging from a computationalpoint of view, and in particular from a theoretical point of view, where the literature is still lackingresults in dealing with the curse of dimensionality, in both the statistical estimation and featureselection. We consider the finite mixture-of-experts model with soft-max gating functions andGaussian experts for high-dimensional regression on heterogeneous data, and its l -regularizedestimation via the Lasso. We focus on the Lasso estimation properties rather than its featureselection properties. We provide a lower bound on the regularization parameter of the Lassofunction that ensures an l -oracle inequality satisfied by the Lasso estimator according to theKullback-Leibler loss. Keywords.
Mixture-of-Experts, mixture of regressions, penalized maximum likelihood, l -oracle inequal-ity, high-dimensional statistics, Lasso. Mixture-of-experts (MoE) models, a flexible generalization of classical finite mixture models, were introducedby Jacobs et al. (1991) in a problem decomposition context, and are widely used in statistics and machinelearning, thanks to their flexibility and the abundance of statistical estimation and model choice tools. Themain idea of MoE is a divide-and-conquer principle that proposes dividing a complex problem into a set ofsimpler subproblems and then one or more specialized problem-solving tools, or experts, are assigned to eachof the subproblems. The flexibility of MoE models comes from allowing the mixture weights (or the gatingfunctions) to depend on the explanatory variables, along with the experts (or the component densities). Thispermits the modeling of data arising from more complex data generating processes than the classical finitemixtures and finite mixtures of regression models, whose mixing parameters are independent of the covariates.Statistically, the MoE models are used to estimate the conditional distribution of a random variable Y ∈ R q , given certain features from n observations { x i } i ∈ [ n ] = { ( x i , . . . , x ip ) } i ∈ [ n ] ∈ ( R p ) n , where q, p, n ∈ N ⋆ ,[ n ] := { , . . . , n } , N ⋆ denotes the positive integer numbers, and R p means the p -dimensional real number. In thecontext of regression, finite MoE models with Gaussian experts and soft-max gating functions are a standardchoice and a powerful tool for modeling more complex non-linear relationships between response and predictors,arising from different subpopulations, compared to the finite mixture of Gaussian regression models. The readeris referred to Nguyen & Chamroukhi (2018) for a recent review on the topic. he use of MoE models in the high-dimensional regression setting, when the number of explanatory variablescan be much larger than the sample size, remains a challenge, particularly from a theoretical point of view,where there is still a lack of results in the literature regarding both statistical estimation and model selection. Insuch settings, we are required to reduce the dimension of the problem by seeking the most relevant relationships,to avoid numerical identifiability problems.We focus on the use of an l -penalized maximum likelihood estimator (MLE), as originally proposed as theLasso by Tibshirani (1996), which tends to produce sparse solutions and can be viewed as a convex surrogatefor the non-convex l -penalization problem. These methods have attractive computational and theoreticalproperties (cf. Fan & Li, 2001). First introduced in Tibshirani (1996) for the linear regression model, theLasso estimator has since been extended to many statistical problems, including for high-dimensional regressionof non-homogeneous data by using finite mixture regression models as considered by Khalili & Chen (2007),Stadler et al. (2010), and Lloyd-Jones et al. (2018). In Stadler et al. (2010), it is assumed that, for i ∈ [ n ] , n ∈ N ⋆ , the observations y i , conditionally on X i = x i , come from a conditional density s ψ ( ·| x i ) , which is a finitemixture of K ∈ N ⋆ Gaussian conditional densities with mixing proportions ( π , , . . . , π ,K ), where Y i | X i = x i ∼ s ψ ( y i | x i ) = K X k =1 π ,k φ ( y i ; β ⊤ ,k x i , σ ,k ) . (1)Here φ ( · ; µ, σ ) = 1 √ πσ exp − ( · − µ ) σ ! is the univariate Gaussian probability density function (PDF), with mean µ ∈ R and variance σ ∈ R + , and ψ = ( π ,k , β ,k , σ ,k ) k ∈ [ K ] is the vector of model parameters.Then, considering a model S , defined by the form (1). To estimate the true generative model s ψ ,Stadler et al. (2010) proposed a Lasso-regularization based estimator, which consists of a minimiser of thepenalized negative conditional log-likelihood that is defined by b s Lasso ( λ ) = argmin s ψ ∈ S ( − n n X i =1 ln ( s ψ ( y i | x i )) + pen λ ( ψ ) ) , pen λ ( ψ ) = λ K X k =1 π k p X j =1 (cid:12)(cid:12) σ − k β kj (cid:12)(cid:12) , λ > , ψ = ( π, β k , σ k ) k ∈ [ K ] . (2)For this estimator, the authors provided an l -oracle inequality, satisfied by b s Lasso ( λ ), conditional on the re-stricted eigenvalue condition and margin condition, which leads to link the Kullback-Leibler loss function to the l -norm of the parameters.Another direction of study regarding b s Lasso ( λ ) is to look at its l -regularization properties; see, for example,Massart & Meynet (2011), Meynet (2013), and Devijver (2015). As indicated by Devijver (2015), contrary toresults for the l penalty, some results for the l penalty are valid with no assumptions, neither on the Grammatrix nor on the margin. However, such results can be achieved only at a rate of convergence of 1 /n , ratherthan at order 1 / √ n .In the framework of finite mixtures of Gaussian regression models, Meynet (2013) considered the casefor a univariate response, and Devijver (2015) extended these results to the case of a multivariate responses, i.e., the Gaussian conditional pdf in (1) is replaced by a multivariate Gaussian PDF of the form φ ( · ; µ, Σ)with mean vector µ and a covariance matrix Σ. In particular, Devijver (2015) considered an extension of theLasso-estimator (2), with a regularization term defined by pen λ ( ψ ) = λ P Kk =1 P pj =1 P qz =1 (cid:12)(cid:12)(cid:12) [ β k ] z,j (cid:12)(cid:12)(cid:12) .In this article, we shall extend such result for the finite mixture of Gaussian regressions models, which isconsidered as a special case of the MoE models, where only the mixture components depend on the features,to the more general mixture of Gaussian experts regression models with soft-max gating functions, as definedin (6). Since each mixing proportion is modeled by a soft-max function of the covariates, the dependence oneach feature appears both in the experts pdfs and in the mixing proportion functions (gating functions), whichallows us to capture more complex non-linear relationships between the response and predictors arising fromdifferent subpopulations, compared to the finite mixture of Gaussian regression models. This is demonstratedvia numerical experiments in several articles such as Nguyen & Chamroukhi (2018), Chamroukhi & Huynh(2018), and Chamroukhi & Huynh (2019).In the context of studying the statistical properties of the penalized maximum likelihood approach for MoEmodels with soft-max gating functions, we may consider the prior works of Khalili (2010) and Montuelle et al.(2014). In Khalili (2010), for feature selection, two extra penalty terms are applied to the l -penalized conditional og-likelihood function. Their penalized conditional log-likelihood estimator is given by b s PL ( λ ) = argmin s ψ ∈ S ( − n n X i =1 ln ( s ψ ( y i | x i )) + pen λ ( ψ ) ) , (3) s ψ ( y | x ) = K X k =1 g k ( x ; γ ) φ (cid:0) y ; β k + β ⊤ k x, σ k (cid:1) , ψ = ( γ k , β k , σ k ) k ∈ [ K ] , (4)pen λ ( ψ ) = K X k =1 λ [1] k p X j =1 | γ kj | + K X k =1 λ [2] k p X j =1 | β kj | + λ [3] K X k =1 k γ k k , (5)where λ = (cid:16) λ [1]1 , . . . , λ [1] K , λ [2]1 , . . . , λ [2] K , λ [3] (cid:17) is a vector of non-negative regularization parameters, S contains allfunctions of form (3), k·k is the Euclidean norm in R p , and g k ( x ; γ ) = exp (cid:0) γ k + γ ⊤ k x (cid:1)P Kl =1 exp (cid:0) γ l + γ ⊤ l x (cid:1) is a soft-max gating function. Note that the first two terms from (5) are the normal Lasso functions ( l penalty function), while the l penalty function for the gating network is added to excessively wildly largeestimates of the regression coefficients corresponding to the mixing proportions. This behavior can be ob-served in logistic/multinomial regression when the number of potential features is large and highly corre-lated (see e.g., Park & Hastie, 2008 and Bunea et al., 2008). However, this also affects the sparsity of theregularization model, which is confirmed via the numerical experiments of Chamroukhi & Huynh (2018) andChamroukhi & Huynh (2019).By extending the theoretical developments for mixture of linear regression models in Khalili & Chen (2007),standard asymptotic theorems for MoE models are established in Khalili (2010). More precisely, under severalstrict regularity conditions on the true joint density function s ψ ( y, x ) and the choice of tuning parameter λ , theestimator of the true parameter vector b ψ PL n ( λ ), defined via b s PL ( λ ) from (3) but using the Scad penalty functionfrom Fan & Li (2001), instead of Lasso, is proved to be both consistent in feature selection and maintains root- n consistency. Differing from Scad, for Lasso, the estimator b ψ PL n ( λ ) cannot achieve both properties, simultaneously.In other words, Lasso is consistent in feature selection but introduces bias to the estimators of the true nonzerocoefficients.Another related result to our work is the weak oracle inequality from Montuelle et al. (2014, Theorem 1).Montuelle et al. (2014) focused on the variable selection procedure instead of investigating the l -regularizationproperties for the Lasso estimator. A detailed comparison between our work and their results can be foundin Remark 3.1. Therefore, our non-asymptotic result in Theorem 3.1 can be considered as a complement tosuch asymptotics for MoE regression models with soft-max gating functions. To obtain our oracle inequality,Theorem 3.1, we shall restrict our study to the Lasso estimator without the l -norm.While studying the oracle inequality within the context of the ( l + l )-norm may also be interesting. Ithas been demonstrated, in Huynh & Chamroukhi (2019), that the regularized maximum-likelihood estimationof MoE models for generalized linear models, better encourages sparsity under the l -norm, compared to whenusing the ( l + l )-norm, which may affect sparsity. We shall not discuss such approaches, further.To the best of our knowledge, we are the first to study the l -regularization properties of the MoE regressionmodels. In the current paper, we focus on a simplified but standard setting in which the means of the expertsare linear functions, with respect to explanatory variables. Although simplified, this model captures the coreof the MoE regression problem, which is the interactions among the different mixture components. We believethat the general techniques that we develop here can be extended to more general experts, such as Gaussianexperts with polynomial means ( e.g., Mendes & Jiang, 2012) or even with hierarchical MoE for exponentialfamily regression models in Jiang & Tanner (1999). But we leave such nontrivial developments for future work.The main contribution of our paper is a theoretical result: an oracle inequality, Theorem 3.1, which providesthe lower bound on the regularization parameters of Lasso that ensures such non asymptotic theoretical controlon the Kullback-Leibler loss of the Lasso estimator for the mixtures of Gaussian experts regression models withsoft-max gating functions. Note that this result is non-asymptotic; i.e., the number of observations n is fixed,while the number of predictors p and the dimension of the responses q can grow, with respect to n , and canbe much larger than n . Good discussions about non-asymptotic statistics are provided in Massart (2007) andWainwright (2019).Note that, as in Khalili (2010), the true order K of the MoE model (the true number of experts in our model)is supposed to be known. From a pragmatic perspective, one may estimate it via using the AIC of Akaike (1974),the BIC of Schwarz et al. (1978), or slope heuristic of Birg´e & Massart (2007). Our result follows directly theline of work of Meynet (2013) and Devijver (2015). In fact, our theorem combined Vapnik’s structural risk inimization paradigm ( e.g., Vapnik, 1982) and theory of model selection for conditional density estimation( e.g.,
Cohen & Pennec, 2011), which is an extended version of the density estimation results from Massart(2007).The goal of this paper is to provide a treatment regarding penalizations that guarantee an l -oracle inequalityfor finite MoE models in particular for high-dimensional non-linear regression. As such, the remainder of thearticle progresses as follows. In Section 2, we discuss the construction and framework of finite mixture ofGaussian experts regression models with soft-max gating functions. In Section 3, we state the main result of thearticle, which is an l -oracle inequality satisfied by the Lasso estimator in the finite mixture of Gaussian expertsregression models. Section 4 is devoted to the proof of these main results. The proof of technical lemmas can befounded in Section 5. Some conclusions are provided in Section 6, and additional technical results are relegatedto Appendix A. We consider the statistical framework in which we model a sample of high-dimensional regression data generatedfrom a heterogeneous population via the mixtures of Gaussian experts regression models with Gaussian gatingfunctions. We observe n independent couples (( x i , y i )) i ∈ [ n ] ∈ ( X × R q ) n ⊂ ( R p × R q ) n ( p, q, n ∈ N ⋆ ), wheretypically p ≫ n , x i is fixed and y i is a realization of the random variable variable Y i , for all i ∈ [ n ]. We assumethat X is a compact set of R p . We also assume that the response variable Y i depends on the set of explanatoryvariables (covariates) through a regression-type model. The conditional probability density function (PDF) ofthe model is approximated by mixture of Gaussian experts regression models with soft-max gating functions.The approximation capabilities of such MoE models have been extensively studied in Jiang & Tanner (1999),Norets et al. (2010), Nguyen et al. (2016), Ho et al. (2019), and Nguyen et al. (2019), and particular in the caseof finite mixture models by Genovese et al. (2000), Nguyen et al. (2013), Ho et al. (2016a), Ho et al. (2016b),and Nguyen et al. (2020a,b).More precisely, we assume that, conditionally to the { x i } i ∈ [ n ] , { Y i } i ∈ [ n ] are independent and identicallydistributed with conditional density s ( ·| x i ), which is approximated by a MoE model. Our goal is to estimatethis conditional density function s from the observations.For any K ∈ N ⋆ , the K -component MoE model can be defined asMoE ( y | x ; θ ) = K X k =1 g k ( x ; γ ) f k ( y | x ; η ) , where g k ( x ; γ ) > P Kk =1 g k ( x ; γ ) = 1, and f k ( y | x ; η ) is a conditional PDF (cf. Nguyen & Chamroukhi,2018). In our proposal, we consider the MoE model of Jordan & Jacobs (1994), which extended the originalMoE from Jacobs et al. (1991), for a regression model. More precisely, we utilize the following mixtures ofGaussian experts regression models with soft-max gating functions: s ψ ( y | x ) = K X k =1 g k ( x ; γ ) φ ( y ; v k ( x ) , Σ k ) , (6)to estimate s , where given any k ∈ [ K ], φ ( · ; v k , Σ k ) is the multivariate Gaussian density with mean v k ,which is a function of x tgat specifies the mean of the k th component, and with covariance matrix Σ k . Here,( v, Σ) := (( v , . . . , v K ) , (Σ , . . . , Σ K )) ∈ (Υ × V ), where Υ is a set of K -tuples of mean functions from X to R q and V is a sets of K -tuples of symmetric positive definite matrices on R q , and the soft-max gating function g k ( x ; γ ) is defined as in (7): g k ( x ; γ ) = exp ( w k ( x )) P Kl =1 exp ( w l ( x )) , w k ( x ) = γ k + γ ⊤ k x, γ = (cid:0) γ k , γ ⊤ k (cid:1) k ∈ [ K ] ∈ Γ = R ( p +1) K . (7)We shall define the parameter vector ψ in the sequel. Inspired by the framework in Meynet (2013) and Devijver (2015), the explanatory variables x i and the numberof components K ∈ N ⋆ are both fixed. We assume that the observed x i , i ∈ [ n ], are finite. Without loss ofgenerality, we choose to rescale x , so that k x k ∞ ≤
1. Therefore, we can assume that the explanatory variables x i ∈ X = [0 , p , for all i ∈ [ n ]. Note that such a restriction is also used in Devijver (2015). Under only theassumption of bounded parameters, we provide a lower bound on the Lasso regularization parameter λ , whichguarantees an oracle inequality. Note that in this non-random explanatory variables setting, we focus on theLasso for its l -regularization properties rather than as a model selection procedure, as in the case of randomexplanatory variables and unknown K , as in Montuelle et al. (2014).For simplicity, we consider the case where the means of Gaussian experts are linear functions of the ex-planatory variables; i.e., Υ = (cid:26) v : X 7→ v β ( x ) := ( β k + β k x ) k ∈ [ K ] ∈ ( R q ) K (cid:12)(cid:12)(cid:12)(cid:12) β = ( β k , β k ) k ∈ [ K ] ∈ B = (cid:16) R q × ( p +1) (cid:17) K (cid:27) , where β k and β k are respectively the q × q × p regression coefficients matrix for the k th expert.In summary, we wish to estimate s via conditional densities belonging to the class: { ( x, y ) s ψ ( y | x ) | ψ = ( γ, β, Σ) ∈ Ψ } , (8)where Ψ = Γ × Ξ, and Ξ =
B × V .From hereon in, for a vector x ∈ R p , we assume that x = ( x , . . . , x p ) is in the column form. Similarly, theparameter of the entire model, ψ = ( γ, β, Σ), is also a column vector, where we consider any matrix as a vectorproduced using vec( · ): the vectorization operator that stacks the columns of a matrix into a vector. For a matrix A , let m ( A ) be the modulus of the smallest eigenvalue, and M ( A ) the modulus of the largesteigenvalue. We shall restrict our study to estimate s by conditional PDFs belonging to the model class S ,which has boundedness assumptions on the softmax gating and Gaussian expert parameters. Specifically, weassume that there exists deterministic constants A γ , A β , a Σ , A Σ >
0, such that ψ ∈ e Ψ, where e Γ = (cid:26) γ ∈ Γ | ∀ k ∈ [ K ] , sup x ∈X (cid:0) | γ k | + (cid:12)(cid:12) γ ⊤ k x (cid:12)(cid:12)(cid:1) ≤ A γ (cid:27) , e Ξ = (cid:26) ξ ∈ Ξ | ∀ k ∈ [ K ] , max z ∈{ ,...,q } sup x ∈X ( | [ β k ] z | + | [ β k x ] z | ) ≤ A β , a Σ ≤ m (cid:0) Σ − k (cid:1) ≤ M (cid:0) Σ − k (cid:1) ≤ A Σ (cid:27) , e Ψ = e Γ × e Ξ . (9)Since a G := exp ( − A γ ) P Kl =1 exp ( A γ ) ≤ sup x ∈X ,γ ∈ e Γ exp (cid:0) γ k + γ ⊤ k x (cid:1)P Kl =1 exp (cid:0) γ l + γ ⊤ l x (cid:1) ≤ exp ( A γ ) P Kl =1 exp ( − A γ ) =: A G , there exists deterministic positive constants a G , A G , such that a G ≤ sup x ∈X ,γ ∈ e Γ g k ( x ; γ ) ≤ A G . (10)We wish to use the model class S of conditional PDFs to estimate s , where S = n ( x, y ) s ψ ( y | x ) (cid:12)(cid:12)(cid:12) ψ = ( γ, β, Σ) ∈ e Ψ o . (11)To simplify the proofs, we shall assume that the true density s belongs to S . That is to say, there exists ψ = ( γ , β , Σ ) ∈ e Ψ, such that s = s ψ . In maximum likelihood estimation, we consider the Kullback-Leibler information as the loss function, which isdefined for densities s and t byKL( s, t ) = (R R q ln (cid:16) s ( y ) t ( y ) (cid:17) s ( y ) dy if sdy is absolutely continuous with respect to tdy, + ∞ otherwise . ince we are working with conditional PDFs and not with classical densities, we define the following adaptedKullback-Leibler information that takes into account the structure of conditional PDFs. For fixed explanatoryvariables ( x i ) ≤ i ≤ n , we consider the average loss functionKL n ( s, t ) = 1 n n X i =1 KL ( s ( ·| x i ) , t ( · , | x i )) = 1 n n X i =1 Z R q ln (cid:18) s ( y | x i ) t ( y | x i ) (cid:19) s ( y | x i ) dy. (12)The maximum likelihood estimation approach suggests to estimate s by the conditional PDF s ψ thatmaximizes the likelihood, conditioned on ( x i ) ≤ i ≤ n , defined asln n Y i =1 s ψ ( y i | x i ) ! = n X i =1 ln ( s ψ ( y i | x i )) . Or equivalently, that minimizes the empirical contrast: − n n X i =1 ln ( s ψ ( y i | x i )) . However, since we want to handle high-dimensional data, we have to regularize the maximum likelihood estima-tor (MLE) in order to obtain reasonable estimates. Here, we shall consider l -regularization and the associatedso-called Lasso estimator, which is the following l -norm penalized MLE: b s Lasso ( λ ) := argmin s ψ ∈ S ( − n n X i =1 ln ( s ψ ( y i | x i )) + pen λ ( ψ ) ) , (13)where λ ≥ ψ = ( γ, β, Σ) andpen λ ( ψ ) = λ (cid:13)(cid:13)(cid:13) ψ [1 , (cid:13)(cid:13)(cid:13) := λ (cid:16)(cid:13)(cid:13)(cid:13) ψ [1] (cid:13)(cid:13)(cid:13) + (cid:13)(cid:13)(cid:13) ψ [2] (cid:13)(cid:13)(cid:13) (cid:17) , (14) (cid:13)(cid:13)(cid:13) ψ [1] (cid:13)(cid:13)(cid:13) = k γ k = K X k =1 p X j =1 | γ kj | , (15) (cid:13)(cid:13)(cid:13) ψ [2] (cid:13)(cid:13)(cid:13) = k vec( β ) k = K X k =1 p X j =1 q X z =1 (cid:12)(cid:12)(cid:12) [ β k ] z,j (cid:12)(cid:12)(cid:12) . (16)From now on, we denote k β k p ( p ∈ { , , ∞} ) by the induced p -norm of a matrix; see Definition A.1, whichdiffers from k vec( β ) k p .Note that pen λ ( ψ ) is a Lasso regularization term encouraging sparsity for both the gating and expertparameters. Recall that this penalty is also studied in Khalili (2010), Chamroukhi & Huynh (2018), andChamroukhi & Huynh (2019), in which the authors studied the univariate case: Y ∈ R . Notice that, withoutconsidering the l -norm, the penalty function considered in (5) belongs to our framework and the l -oracle in-equality from Theorem 3.1 can be obtained for it. Indeed, by considering λ = min n λ [1]1 , . . . , λ [1] K , λ [2]1 , . . . , λ [2] K , λ [3] o ,the condition for a regularization parameter’s lower bound, (17) from Theorem 3.1, can also be applied to model(3), which leads to an l -oracle inequality. l -oracle inequality for the Lasso estimator In this section, we state Theorem 3.1, which is proved in Section 4.3. This result provides an l -oracle inequalityfor the Lasso estimator for mixtures of Gaussian experts regression models with soft-max gating functions. It isthe primary contribution of this article and is motivated by the problem studied in Meynet (2013) and Devijver(2015). Theorem 3.1 ( l -oracle inequality) . We observe (( x i , y i )) i ∈ [ n ] ∈ ([0 , p × R q ) , coming from the unknownconditional mixture of Gaussian experts regression models s := s ψ ∈ S , cf. (11) . We define the Lassoestimator b s Lasso ( λ ) , by (13) , where λ ≥ is a regularization parameter to be tuned. Then, if λ ≥ κ KB ′ n √ n (cid:16) q ln n p ln(2 p + 1) + 1 (cid:17) , (17) B ′ n = max ( A Σ , KA G ) (cid:0) q √ qA Σ (cid:0) A β + 4 A Σ ln n (cid:1)(cid:1) , (18) or some absolute constants κ ≥ , the estimator b s Lasso ( λ ) satisfies the following l -oracle inequality: E (cid:2) KL n (cid:0) s , b s Lasso ( λ ) (cid:1)(cid:3) ≤ (cid:0) κ − (cid:1) inf s ψ ∈ S (cid:16) KL n ( s , s ψ ) + λ (cid:13)(cid:13)(cid:13) ψ [1 , (cid:13)(cid:13)(cid:13) (cid:17) + λ + r Kn e q − π q/ A q/ p qA γ + 302 q r Kn max ( A Σ , KA G ) (cid:0) q √ qA Σ (cid:0) A β + 4 A Σ ln n (cid:1)(cid:1) × K (cid:18) A γ + qA β + q √ qa Σ (cid:19) ! . (19) Remark 3.1.
Theorem 3.1 provide information about the performance of the Lasso as an l regularizationestimator for mixtures of Gaussian experts regression models. If the regularization parameter λ is properlychosen, the Lasso estimator, which is the solution of the l -penalized empirical risk minimization problem,behaves as well as the deterministic Lasso, which is the solution of the l -penalized true risk minimizationproblem, up to an error term of order λ .of observations n is fixed while the number of covariates p can grow with respect to n , and in fact can bemuch larger than n . The number of components K in the MoE model is fixed.As in Devijver (2015), we suppose that the regressors belong to X = [0 , p , for simplicity. However, thearguments in our proof are valid for covariates of any scale.To the best of our knowledge, we are the first to prove the non-asymptotic l -oracle inequality of Theorem3.1, for the mixture of Gaussian experts regression models with l -regularization. Note that by extending thetheoretical developments for mixture of linear regression models in Khalili & Chen (2007), a standard asymptotictheory for MoE models is established in Khalili (2010). Therefore, our non-asymptotic result in Theorem 3.1can be considered as a complementary result to such asymptotic results for MoE models with soft-max gatingfunctions. Motivated by the idea from Meynet (2013) and Devijver (2015), we study the Lasso as the solution of a penalizedmaximum likelihood model selection procedure over countable collections of models in an l -ball. Then Theorem3.1 is an immediate consequence of Theorem 4.1, stated below, which is an l -ball MoE regression model selectiontheorem for l -penalized maximum conditional likelihood estimation, in the Gaussian mixture framework. Theorem 4.1.
Assume that we observe (( x i , y i )) i ∈ [ n ] with unknown conditional Gaussian mixture PDF s . Forall m ∈ N ⋆ , consider the l -ball S m = n s ψ ∈ S, (cid:13)(cid:13)(cid:13) ψ [1 , (cid:13)(cid:13)(cid:13) ≤ m o (20) where, (cid:13)(cid:13)(cid:13) ψ [1 , (cid:13)(cid:13)(cid:13) = (cid:13)(cid:13)(cid:13) ψ [1] (cid:13)(cid:13)(cid:13) + (cid:13)(cid:13)(cid:13) ψ [2] (cid:13)(cid:13)(cid:13) , (cid:13)(cid:13)(cid:13) ψ [1] (cid:13)(cid:13)(cid:13) = k γ k = K X k =1 p X j =1 | γ kj | , (cid:13)(cid:13)(cid:13) ψ [2] (cid:13)(cid:13)(cid:13) = k vec( β ) k = K X k =1 p X j =1 q X z =1 (cid:12)(cid:12)(cid:12) [ β k ] z,j (cid:12)(cid:12)(cid:12) , and let b s m be a η m - ln -likelihood minimizer in S m for some η m ≥ : − n n X i =1 ln ( b s m ( y i | x i )) ≤ inf s m ∈ S m − n n X i =1 ln ( s m ( y i | x i )) ! + η m . (21) Assume that, for all m ∈ N ⋆ , the penalty function satisfies pen ( m ) = λm , where λ is defined later. Then, wedefine the penalized likelihood estimate b s b m , where b m is defined via the satisfaction of the inequality − n n X i =1 ln ( b s b m ( y i | x i )) + pen ( b m ) ≤ inf m ∈ N ⋆ − n n X i =1 ln ( b s m ( y i | x i )) + pen ( m ) ! + η, (22) for some η ≥ . Then, if λ ≥ κ KB ′ n √ n (cid:16) q ln n p ln(2 p + 1) + 1 (cid:17) , (23) B ′ n = max ( A Σ , KA G ) (cid:0) q √ qA Σ (cid:0) A β + 4 A Σ ln n (cid:1)(cid:1) , (24) for some absolute constants κ ≥ , then E [KL n ( s , b s b m )] ≤ (cid:0) κ − (cid:1) inf m ∈ N ⋆ (cid:18) inf s m ∈ S m KL n ( s , s m ) + pen ( m ) + η m (cid:19) + η + r Kn e q − π q/ A q/ p qA γ + 302 q r Kn max ( A Σ , KA G ) (cid:0) q √ qA Σ (cid:0) A β + 4 A Σ ln n (cid:1)(cid:1) × K (cid:18) A γ + qA β + q √ qa Σ (cid:19) ! . (25) Remark 4.1.
Note that Theorem 3.1 is also complementary to Theorem 1 of Montuelle et al. (2014), whoalso considered the mixture of Gaussian experts regression models with soft-max gating functions. Notice thatthey focused on model selection and obtained a weak oracle inequality for the penalized MLE, while we aim tostudy the l -regularization properties of the Lasso estimators. However, we can compare their procedure withTheorem 4.1.The main reason explaining their result being considered a weak oracle inequality is that we can see thatTheorem 1 of Montuelle et al. (2014) uses difference divergence on the left (the JKL ⊗ n ρ , tensorized Jensen-Kullback-Leibler divergence), and on the right (the KL ⊗ n , tensorized Kullback-Leibler divergence). However,under a strong assumption, the two divergences are equivalent for the conditional PDFs considered. This strongassumption is nevertheless satisfied, if we assume that X is compact, as is the case of X = [0 , p in Theorem4.1, s is compactly supported, and the regression functions are uniformly bounded, and there is a uniformlower bound on the eigenvalues of the covariance matrices.To illustrate the strictness of the compactness assumption for s , we only need to consider s as a univari-ate Gaussian PDF, which obviously does not satisfy such a hypothesis. Therefore, in such case, Theorem 1in Montuelle et al. (2014) is actually weaker than Theorem 3.1, with respect to the compact support assumptionon the true conditional PDF s . On the contrary, the only assumption used to establish Theorem 4.1 is theboundedness of the parameters of the mixtures, which is also assumed in Montuelle et al. (2014, Theorem 1).Furthermore, these boundedness assumptions also appeared in Stadler et al. (2010), Meynet (2013), andDevijver (2015), and is quite usual when working with maximum likelihood estimation (Baudry, 2009, Maugis & Michel,2011), at least when considering the problem of the unboundedness of the likelihood on the boundary of the pa-rameter space (McLachlan & Peel, 2000, Redner & Walker, 1984), and to prevent the likelihood from diverging.Nevertheless, by using the smaller divergence: JKL ⊗ n ρ (or more strict assumptions on s and s m , so that thesame divergence KL ⊗ n appears on both side of the oracle inequality in Theorem 4.1), Montuelle et al. (2014,Theorem 1) obtained the faster rate of convergence of order 1 /n , while in Theorem 4.1, we only seek a rate ofconvergence of order 1 / √ n . Therefore, in cases where there are no guarantees on the strict conditions such asthe compactness of the support of s and the uniform boundedness of the regression functions, Theorem 4.1provides a theoretical foundation for the Lasso estimators with the order of convergence of 1 / √ n with only aboundedness assumption on the parameter space.Note that the constants 1 + κ − from the upper bound in Theorem 4.1 and C from Montuelle et al. (2014,Theorem 1) can not be taken to be equal to 1. This fact is consequential as when s does not belong tothe approximation class, i.e., when the model is misspecified. This problem also occurred in the l -oracleinequalities from Meynet (2013) and Devijver (2015). Deriving an oracle inequality such that 1 + κ − = 1, forthe Kullback-Leibler loss, is still an open problem. We hope to overcome this challenge in the future.Theorem 4.1 can be deduced from the two following propositions, which address the cases for large andsmall values of Y . Proposition 4.1.
Assume that we observe (( x i , y i )) i ∈ [ n ] , with unknown conditional PDF s . Let M n > andconsider the event T = (cid:26) max i =1 ,...,n k Y i k ∞ = max i =1 ,...,n max z ∈{ ,...,q } | [ Y i ] z | ≤ M n (cid:27) . For all m ∈ N ⋆ , consider the l -ball S m = n s ψ ∈ S, (cid:13)(cid:13)(cid:13) ψ [1 , (cid:13)(cid:13)(cid:13) ≤ m o and let b s m be a η m - ln -likelihood minimizer in S m , for some η m ≥ : − n n X i =1 ln ( b s m ( y i | x i )) ≤ inf s m ∈ S m − n n X i =1 ln ( s m ( y i | x i )) ! + η m . Assume that for all m ∈ N ⋆ , the penalty function satisfies pen ( m ) = λm , where λ is defined later. Then, wedefine the penalized likelihood estimate b s b m with b m defined via the inequality − n n X i =1 ln ( b s b m ( y i | x i )) + pen ( b m ) ≤ inf m ∈ N ⋆ − n n X i =1 ln ( b s m ( y i | x i )) + pen ( m ) ! + η, (26) for some η ≥ . Then, if λ ≥ κ KB n √ n (cid:16) q ln n p ln(2 p + 1) + 1 (cid:17) ,B n = max ( A Σ , KA G ) (cid:16) q √ q ( M n + A β ) A Σ (cid:17) , for some absolute constants κ ≥ , then E [KL n ( s , b s b m ) T ] ≤ (cid:0) κ − (cid:1) inf m ∈ N ⋆ (cid:18) inf s m ∈ S m KL n ( s , s m ) + pen ( m ) + η m (cid:19) + 302 K / qB n √ n (cid:18) A γ + qA β + q √ qa Σ (cid:19) ! + η. (27) Proposition 4.2.
Consider s , T , and b s m as defined in Proposition 4.1. Denote by T C the complement of T , i.e., T C = (cid:26) max i =1 ,...,n k Y i k ∞ = max i =1 ,...,n max z ∈{ ,...,q } | [ Y i ] z | > M n (cid:27) . Then, E [KL n ( s , b s b m ) T C ] ≤ e q/ − π q/ A q/ p KnqA γ e − M n − MnAβ A Σ . Theorem 4.1, and Propositions 4.1 and 4.2 are proved in the Sections 4.4, 4.5 and 4.6, respectively.
We first introduce some definitions and notations that we shall use in the proofs. For any measurable function f : R → R , consider its empirical norm k f k n := vuut n n X i =1 f ( y i | x i ) , and its conditional expectation E X [ f ] = E [ f ( Y | X ) | X = x ] = Z R f ( y | x ) s ( y | x ) dy, as well as its empirical process P n ( f ) := 1 n n X i =1 f ( Y i | x i ) , (28)with expectation E X [ P n ( f )] = 1 n n X i =1 E X [ f ( Y i | x i )] = 1 n n X i =1 Z R f ( y | x i ) s ( y | x i ) dy (29)and the recentered process ν n ( f ) := P n ( f ) − E X [ P n ( f )] = 1 n n X i =1 (cid:20) f ( y i | x i ) − Z R f ( y | x i ) s ( y | x i ) dy (cid:21) . (30)For all m ∈ N ⋆ , consider the model S m = (cid:8) s ψ ∈ S, | s ψ | ≤ m (cid:9) , and define F m = (cid:26) f m = − ln (cid:18) s m s (cid:19) = ln( s ) − ln( s m ) , s m ∈ S m (cid:27) . (31)By using the basic properties of the infimum: for every ǫ >
0, there exists x ǫ ∈ A , such that x ǫ < inf A + ǫ .Then let δ KL > m ∈ N ⋆ , and let η m ≥
0. It holds that there exist two functions b s m and s m in S m , suchthat P n ( − ln b s m ) ≤ inf s m ∈ S m P n ( − ln s m ) + η m , and (32)KL n ( s , s m ) ≤ inf s m ∈ S m KL n ( s , s m ) + δ KL . (33)Define b f m := − ln (cid:18) b s m s (cid:19) , and f m := − ln (cid:18) s m s (cid:19) . (34)Let η ≥ m ∈ N ⋆ . Further, define M ( m ) = { m ′ ∈ N ⋆ | P n ( − ln b s m ′ ) + pen( m ′ ) ≤ P n ( − ln b s m ) + pen( m ) + η } . (35) Let λ > b m to be the smallest integer such that b s Lasso ( λ ) belongs to S b m , i.e., b m := (cid:6)(cid:13)(cid:13) ψ [1 , (cid:13)(cid:13) (cid:7) ≤ (cid:13)(cid:13) ψ [1 , (cid:13)(cid:13) + 1. Then using the definition of b m , (13), (20), and S = S m ∈ N ⋆ S m , we get − n n X i =1 ln (cid:0)b s Lasso ( λ ) ( y i | x i ) (cid:1) + λ b m ≤ − n n X i =1 ln (cid:0)b s Lasso ( λ ) ( y i | x i ) (cid:1) + λ (cid:16)(cid:13)(cid:13)(cid:13) ψ [1 , (cid:13)(cid:13)(cid:13) + 1 (cid:17) = inf s ψ ∈ S − n n X i =1 ln ( s ψ ( y i | x i )) + λ (cid:13)(cid:13)(cid:13) ψ [1 , (cid:13)(cid:13)(cid:13) ! + λ = inf m ∈ N ⋆ inf s ψ ∈ S m − n n X i =1 ln ( s ψ ( y i | x i )) + λ (cid:13)(cid:13)(cid:13) ψ [1 , (cid:13)(cid:13)(cid:13) !! + λ = inf m ∈ N ⋆ inf s ψ ∈ S, k ψ [1 , k ≤ m − n n X i =1 ln ( s ψ ( y i | x i )) + λ (cid:13)(cid:13)(cid:13) ψ [1 , (cid:13)(cid:13)(cid:13) ! + λ ≤ inf m ∈ N ⋆ inf s m ∈ S m − n n X i =1 ln ( s m ( y i | x i )) + λm !! + λ, which implies − n n X i =1 ln (cid:0)b s Lasso ( λ ) ( y i | x i ) (cid:1) + pen( b m ) ≤ inf m ∈ N ⋆ − n n X i =1 ln ( b s m ( y i | x i )) + pen( m ) ! + η with pen( m ) = λm, η = λ , and b s m is a η m -ln-likelihood minimizer in S m , with η m ≥ b s Lasso ( λ ) satisfies (22) with b s Lasso ( λ ) ≡ b s b m , i.e., − n n X i =1 ln ( b s b m ( y i | x i )) + pen( b m ) ≤ inf m ∈ N ⋆ − n n X i =1 ln ( b s m ( y i | x i )) + pen( m ) ! + η. (36)Given κ ≥ E (cid:2) KL n (cid:0) s , b s Lasso ( λ ) (cid:1)(cid:3) ≤ (cid:0) κ − (cid:1) inf s ψ ∈ S (cid:16) KL n ( s , s ψ ) + λ (cid:13)(cid:13)(cid:13) ψ [1 , (cid:13)(cid:13)(cid:13) (cid:17) + λ + r Kn e q − π q/ A q/ p qA γ + 302 q r Kn max ( A Σ , KA G ) (cid:0) q √ qA Σ (cid:0) A β + 4 A Σ ln n (cid:1)(cid:1) × K (cid:18) A γ + qA β + q √ qa Σ (cid:19) ! , as required. Let M n > κ ≥ m ∈ N ⋆ , the penalty function satisfies pen( m ) = λm , with λ ≥ κ KB n √ n (cid:16) q ln n p ln(2 p + 1) + 1 (cid:17) . (37)We derive, from Propositions 4.1 and 4.2, that any penalized likelihood estimate b s b m with b m , satisfying − n n X i =1 ln ( b s b m ( y i | x i )) + pen( b m ) ≤ inf m ∈ N ⋆ − n n X i =1 ln ( b s m ( y i | x i )) + pen( m ) ! + η, for some η ≥
0, yields E [KL n ( s , b s b m )]= E [KL n ( s , b s b m ) T ] + E [KL n ( s , b s b m ) T c ] ≤ (cid:0) κ − (cid:1) inf m ∈ N ⋆ (cid:18) inf s m ∈ S m KL n ( s , s m ) + pen( m ) + η m (cid:19) + 302 K / qB n √ n (cid:18) A γ + qA β + q √ qa Σ (cid:19) ! + η + e q/ − π q/ A q/ p KnqA γ e − M n − MnAβ A Σ . (38)To obtain inequality (25), it only remains to optimize the inequality (38), with respect M n . Since the twoterms depending on M n , in (38), have opposite monotonicity with respect to M n , we are looking for a valueof M n such that these two terms are the same order with respect to n . Consider the positive solution M n = A β + q A β + 4 A Σ ln n of the equation X ( X − A β )4 A Σ − ln n = 0. Then, on the one hand, e − M n − MnAβ A Σ √ n = e − ln n √ n = 1 √ n . On the other hand, using the inequality ( a + b ) ≤ a + b ), we have B n = max ( A Σ , KA G ) (cid:16) q √ q ( M n + A β ) A Σ (cid:17) = max ( A Σ , KA G ) (cid:18) q √ qA Σ (cid:16) A β + q A β + 4 A Σ ln n (cid:17) (cid:19) ≤ max ( A Σ , KA G ) (cid:0) q √ qA Σ (cid:0) A β + 4 A Σ ln n (cid:1)(cid:1) , hence (38) implies (25). Indeed, it hold that E [KL n ( s , b s b m )] ≤ (cid:0) κ − (cid:1) inf m ∈ N ⋆ (cid:18) inf s m ∈ S m KL n ( s , s m ) + pen( m ) + η m (cid:19) + η + r Kn e q − π q/ A q/ p qA γ + 302 q r Kn max ( A Σ , KA G ) (cid:0) q √ qA Σ (cid:0) A β + 4 A Σ ln n (cid:1)(cid:1) × K (cid:18) A γ + qA β + q √ qa Σ (cid:19) ! . (39) For every m ′ ∈ M ( m ), from (35), (34), and (32), we obtain P n (cid:16) b f m ′ (cid:17) + pen( m ′ ) = P n (ln( s ) − ln ( b s m ′ )) + pen( m ′ ) (using (34)) ≤ P n (ln( s ) − ln ( b s m )) + pen( m ) + η (using (35)) ≤ P n (ln( s ) − ln ( s m )) + η m + pen( m ) + η (using (32))= P n (cid:0) f m (cid:1) + pen( m ) + η m + η (using (34)) , which implies that E X h P n (cid:16) b f m ′ (cid:17)i + pen( m ′ ) ≤ E X (cid:2) P n (cid:0) f m (cid:1)(cid:3) + pen( m ) + ν n (cid:0) f m (cid:1) − ν n (cid:16) b f m ′ (cid:17) + η + η m . Taking into account (12) and (28), we obtainKL n ( s , b s m ′ ) = 1 n n X i =1 Z R ln (cid:18) s ( y | x i ) b s m ′ ( y | x i ) (cid:19) s ( y | x i ) dy = 1 n n X i =1 Z R b f m ′ ( y | x i ) s ( y | x i ) dy (using (34))= 1 n n X i =1 E X h b f m ′ ( y i | x i ) i = E X h P n (cid:16) b f m ′ (cid:17)i (using (28)) . Similarly, we also obtain KL n ( s , s m ) = E X (cid:2) P n (cid:0) f m (cid:1)(cid:3) . Hence (33) implies thatKL n ( s , b s m ′ ) + pen( m ′ ) ≤ KL n ( s , s m ) + pen( m ) + ν n (cid:0) f m (cid:1) − ν n (cid:16) b f m ′ (cid:17) + η + η m ≤ inf s m ∈ S m KL n ( s , s m ) + pen( m ) + ν n (cid:0) f m (cid:1) − ν n (cid:16) b f m ′ (cid:17) + η m + δ KL + η. (40)All that remains is to control the deviation of − ν n (cid:16) b f m ′ (cid:17) = ν n (cid:16) − b f m ′ (cid:17) . To handle the randomness of b f m ′ , weshall control the deviation of sup f m ′ ∈ F m ′ ν n ( − f m ′ ), since b f m ′ ∈ F m ′ . Such control is provided by Lemma 4.1. Control of deviation
Lemma 4.1.
Let M n > . Consider the event T = (cid:26) max i =1 ,...,n k Y i k ∞ = max i =1 ,...,n max z ∈{ ,...,q } | [ Y i ] z | ≤ M n (cid:27) , and set B n = max ( A Σ , KA G ) (cid:16) q √ q ( M n + A β ) A Σ (cid:17) , and (41)∆ m ′ = m ′ p ln(2 p + 1) ln n + 2 √ K (cid:18) A γ + qA β + q √ qa Σ (cid:19) . (42) Then, on the event T , for all m ′ ∈ N ⋆ , and for all t > , with P X -probability greater than − e − t , sup f m ′ ∈ F m ′ | ν n ( − f m ′ ) | ≤ KB n √ n (cid:20) q ∆ m ′ + √ (cid:18) A γ + qA β + q √ qa Σ (cid:19) √ t (cid:21) . (43) Proof.
The proof appears in Section 5.1.From (40) and (43), we derive that on the event T , for all m ∈ N ⋆ , m ′ ∈ M ( m ), and t >
0, with P X -probability larger than 1 − e − t ,KL n ( s , b s m ′ ) + pen( m ′ ) ≤ inf s m ∈ S m KL n ( s , s m ) + pen( m ) + ν n (cid:0) f m (cid:1) − ν n (cid:16) b f m ′ (cid:17) + η m + δ KL + η. ≤ inf s m ∈ S m KL n ( s , s m ) + pen( m ) + ν n (cid:0) f m (cid:1) + η m + δ KL + η + 4 KB n √ n (cid:20) q ∆ m ′ + √ (cid:18) A γ + qA β + q √ qa Σ (cid:19) √ t (cid:21) ≤ inf s m ∈ S m KL n ( s , s m ) + pen( m ) + ν n (cid:0) f m (cid:1) + η m + δ KL + η + 4 KB n √ n " q ∆ m ′ + 12 (cid:18) A γ + qA β + q √ qa Σ (cid:19) + t , (44)where we get the last inequality using the fact that 2 ab ≤ a + b for b = √ t , and a = (cid:16) A γ + qA β + q √ qa Σ (cid:17) / √ m ∈ N ⋆ and m ′ ∈ M ( m ). To getan inequality valid on a set of high probability, we need to adequately choose the value of the parameter t ,depending on m ∈ N ⋆ and m ′ ∈ M ( m ). Let z >
0, for all m ∈ N ⋆ and m ′ ∈ M ( m ), and apply (44) to obtain t = z + m + m ′ . Then, on the event T , for all m ∈ N ⋆ and m ′ ∈ M ( m ), with P X -probability larger than1 − e − ( z + m + m ′ ),KL n ( s , b s m ′ ) + pen( m ′ ) ≤ inf s m ∈ S m KL n ( s , s m ) + pen( m ) + ν n (cid:0) f m (cid:1) + η m + δ KL + η + 4 KB n √ n " q ∆ m ′ + 12 (cid:18) A γ + qA β + q √ qa Σ (cid:19) + ( z + m + m ′ ) , (45)KL n ( s , b s m ′ ) − ν n (cid:0) f m (cid:1) ≤ inf s m ∈ S m KL n ( s , s m ) + (cid:20) pen( m ) + 4 KB n √ n m (cid:21) + η m + δ KL + η + (cid:20) KB n √ n (37 q ∆ m ′ + m ′ ) − pen( m ′ ) (cid:21) + 4 KB n √ n " (cid:18) A γ + qA β + q √ qa Σ (cid:19) + z . (46)Taking into account (42), we getKL n ( s , b s m ′ ) − ν n (cid:0) f m (cid:1) ≤ inf s m ∈ S m KL n ( s , s m ) + (cid:20) pen( m ) + 4 KB n √ n m (cid:21) + η m + δ KL + η + (cid:20) KB n √ n (cid:16) q ln n p ln(2 p + 1) + 1 (cid:17) m ′ − pen( m ′ ) (cid:21) + 4 KB n √ n " (cid:18) A γ + qA β + q √ qa Σ (cid:19) + 74 q √ K (cid:18) A γ + qA β + q √ qa Σ (cid:19) + z . (47) Now, let κ ≥ m ) = λm , for all m ∈ N ⋆ with λ ≥ κ KB n √ n (cid:16) q ln n p ln(2 p + 1) + 1 (cid:17) . Then, (47) impliesKL n ( s , b s m ′ ) − ν n (cid:0) f m (cid:1) ≤ inf s m ∈ S m KL n ( s , s m ) + (cid:20) λm + 4 KB n √ n m (cid:21) + η m + δ KL + η + KB n √ n (cid:16) q ln n p ln(2 p + 1) + 1 (cid:17)| {z } ≤ λκ − m ′ − λm ′ + 4 KB n √ n " (cid:18) A γ + qA β + q √ qa Σ (cid:19) + 74 q √ K (cid:18) A γ + qA β + q √ qa Σ (cid:19) + z ≤ inf s m ∈ S m KL n ( s , s m ) + pen( m ) + 4 KB n √ n m | {z } ≤ κ − pen( m ) + η m + δ KL + η + (cid:2) λκ − m ′ − λm ′ (cid:3)| {z } ≤ + 4 KB n √ n " (cid:18) A γ + qA β + q √ qa Σ (cid:19) + 74 q √ K (cid:18) A γ + qA β + q √ qa Σ (cid:19) + z ≤ inf s m ∈ S m KL n ( s , s m ) + (cid:0) κ − (cid:1) pen( m ) + η m + δ KL + η + 4 KB n √ n " (cid:18) A γ + qA β + q √ qa Σ (cid:19) + 74 q √ K (cid:18) A γ + qA β + q √ qa Σ (cid:19) + z . Next, using the inequality 2 ab ≤ β − a + β − b for a = √ K , b = K (cid:16) A γ + qA β + q √ qa Σ (cid:17) , and β = √ K , and thefact that K ≤ K / , for all K ∈ N ⋆ , it follows thatKL n ( s , b s m ′ ) − ν n (cid:0) f m (cid:1) ≤ inf s m ∈ S m KL n ( s , s m ) + (cid:0) κ − (cid:1) pen( m ) + η m + δ KL + η + 4 B n √ n " qK / (cid:18) A γ + qA β + q √ qa Σ (cid:19) + 74 q √ KK (cid:18) A γ + qA β + q √ qa Σ (cid:19)| {z } q × ab + Kz ≤ inf s m ∈ S m KL n ( s , s m ) + (cid:0) κ − (cid:1) pen( m ) + η m + δ KL + η + 4 B n √ n " qK / + 75 qK / (cid:18) A γ + qA β + q √ qa Σ (cid:19) + Kz . (48)By (26) and (35), b m belongs to M ( m ), for all m ∈ N ⋆ , so we deduce from (48) that on the event T , for all z >
0, with P X -probability greater than 1 − e − z ,KL n ( s , b s b m ) − ν n (cid:0) f m (cid:1) ≤ inf m ∈ N ⋆ (cid:18) inf s m ∈ S m KL n ( s , s m ) + (cid:0) κ − (cid:1) pen( m ) + η m (cid:19) + η + δ KL + 4 B n √ n " qK / + 75 qK / (cid:18) A γ + qA β + q √ qa Σ (cid:19) + Kz . (49)By integrating (49) over z >
0, using the fact that for any non-negative random variable Z and any a > , E [ Z ] = a R z ≥ P ( Z > az ) dz . Then, note that E (cid:2) ν n (cid:0) f m (cid:1)(cid:3) = 0, and that δ KL > small, we obtain that E [KL n ( s , b s b m ) T ] ≤ inf m ∈ N ⋆ (cid:18) inf s m ∈ S m KL n ( s , s m ) + (cid:0) κ − (cid:1) pen( m ) + η m (cid:19) + η + 4 B n √ n " qK / + 75 qK / (cid:18) A γ + qA β + q √ qa Σ (cid:19) + K ≤ inf m ∈ N ⋆ (cid:18) inf s m ∈ S m KL n ( s , s m ) + (cid:0) κ − (cid:1) pen( m ) + η m (cid:19) + η + 4 B n √ n " qK / + 75 qK / (cid:18) A γ + qA β + q √ qa Σ (cid:19) + qK / ≤ inf m ∈ N ⋆ (cid:18) inf s m ∈ S m KL n ( s , s m ) + (cid:0) κ − (cid:1) pen( m ) + η m (cid:19) + η + 302 K / qB n √ n (cid:18) A γ + qA β + q √ qa Σ (cid:19) ! . (50) By the Cauchy-Schwarz inequality, E [KL n ( s , b s b m ) T C ] ≤ q E (cid:2) KL n ( s , b s b m ) (cid:3)q P ( T C ) . (51)We seek to bound the two terms of the right-hand side of (51).For the first term, let us bound KL ( s ( ·| x ) , s ψ ( ·| x )), for all s ψ ∈ S and x ∈ X . Let s ψ ∈ S and x ∈ X .Since s is a density, s is bounded by 1 and thusKL ( s ( ·| x ) , s ψ ( ·| x )) = Z R q ln (cid:18) s ( y | x ) s ψ ( y | x ) (cid:19) s ( y | x ) dy = Z R q ln ( s ( y | x )) s ( y | x ) dy − Z R q ln ( s ψ ( y | x )) s ( y | x ) dy ≤ − Z R q ln ( s ψ ( y | x )) s ( y | x ) dy (cid:18) since Z R q ln ( s ( y | x )) s ( y | x ) dy ≤ (cid:19) . (52) Thus, for all y ∈ R q ,ln ( s ψ ( y | x )) s ( y | x )= ln " K X k =1 g k ( x ; γ )(2 π ) q/ det(Σ k ) / exp − ( y − ( β k + β k x )) ⊤ Σ − k ( y − ( β k + β k x ))2 ! × K X k =1 g ,k ( x ; γ )(2 π ) q/ det(Σ ,k ) / exp − ( y − ( β ,k + β ,k x )) ⊤ Σ − ,k ( y − ( β ,k + β ,k x ))2 ! ≥ ln " K X k =1 a G det(Σ − k ) / (2 π ) q/ exp (cid:16) − (cid:16) y ⊤ Σ − k y + ( β k + β k x ) ⊤ Σ − k β k x ( β k + β k x ) (cid:17)(cid:17) × K X k =1 a G det(Σ − ,k ) / (2 π ) q/ exp (cid:16) − (cid:16) y ⊤ Σ − ,k y + ( β ,k + β ,k x ) ⊤ Σ − ,k ( β ,k + β ,k x ) (cid:17)(cid:17)(cid:0) using (10) and − ( a − b ) ⊤ A ( a − b ) / ≥ − ( a ⊤ Aa + b ⊤ Ab ), e.g., a = y, b = β k + β k x , A = Σ k (cid:1) ≥ ln " K X k =1 a G a q/ (2 π ) q/ exp (cid:16) − (cid:16) y ⊤ Σ − k y + ( β k + β k x ) ⊤ Σ − k β k x ( β k + β k x ) (cid:17)(cid:17) × K X k =1 a G a q/ (2 π ) q/ exp (cid:16) − (cid:16) y ⊤ Σ − ,k y + ( β ,k + β ,k x ) ⊤ Σ − ,k ( β ,k + β ,k x ) (cid:17)(cid:17) (using (9)) ≥ ln " K a G a q/ (2 π ) q/ exp (cid:0) − (cid:0) y ⊤ y + qA β (cid:1) A Σ (cid:1) × K a G a q/ (2 π ) q/ exp (cid:0) − (cid:0) y ⊤ y + qA β (cid:1) A Σ (cid:1) (using (9)) , (53)where, in the last inequality, we use the fact that for all u ∈ R q . By using the eigenvalue decomposition ofΣ = P ⊤ DP , (cid:12)(cid:12) u ⊤ Σ u (cid:12)(cid:12) = (cid:12)(cid:12) u ⊤ P ⊤ DP u (cid:12)(cid:12) ≤ k
P u k ≤ M ( D ) k P u k ≤ A Σ k u k ≤ A Σ q k u k ∞ , where in the last inequality, we used the fact that (79). Therefore, setting u = √ A Σ y and h ( t ) = t ln t , for all t ∈ R , and noticing that h ( t ) ≥ h (cid:0) e − (cid:1) = − e − , for all t ∈ R , and from (52) and (53), we get thatKL ( s ( ·| x ) , s ψ ( ·| x )) ≤ − Z R q " ln " K a γ a q/ (2 π ) q/ exp (cid:0) − (cid:0) y ⊤ y + qA β (cid:1) A Σ (cid:1) K a γ a q/ (2 π ) q/ exp (cid:0) − (cid:0) y ⊤ y + qA β (cid:1) A Σ (cid:1)! dy = − Ka γ a q/ e − qA β A Σ (2 A Σ ) q/ Z R q " ln K a γ a q/ (2 π ) q/ ! − qA β A Σ − u ⊤ u e − u ⊤ u (2 π ) q/ du = − Ka γ a q/ e − qA β A Σ (2 A Σ ) q/ E U "" ln K a γ a q/ (2 π ) q/ ! − qA β A Σ − U ⊤ U (with U ∼ N q (0 , I q ))= − Ka γ a q/ e − qA β A Σ (2 A Σ ) q/ " ln K a γ a q/ (2 π ) q/ ! − qA β A Σ − q = − Ka γ a q/ e − qA β A Σ − q (2 π ) q/ ( A Σ ) q/ e q/ π q/ ln Ka γ a q/ e − qA β A Σ − q (2 π ) q/ ! ≤ e q/ − π q/ A q/ , (54)where we used the fact that t ln( t ) ≥ − e − , for all t ∈ R . Then, for all s ψ ∈ S ,KL n ( s , s ψ ) = 1 n n X i =1 KL ( s ( ·| x i ) , s ψ ( . | x i )) ≤ e q/ − π q/ A q/ , and note that b s b m ∈ S , and thus q E (cid:2) KL n ( s , b s b m ) (cid:3) ≤ e q/ − π q/ A q/ . (55) We now provide an upper bound for P (cid:0) T C (cid:1) : P (cid:0) T C (cid:1) = E [ T C ] = E [ E X [ T C ]] = E (cid:2) P X (cid:0) T C (cid:1)(cid:3) ≤ E " n X i =1 P X ( k Y i k ∞ > M n ) . (56)For all i ∈ [ n ], Y i | x i ∼ K X k =1 g k ( x i ; γ ) N q ( β k + β k x i , Σ k ) , so we see from (56) that we need to provide an upper bound on P ( | Y x | > M n ), with Y x ∼ K X k =1 g k ( x ; γ ) N q ( β k + β k x, Σ k ) , x ∈ X . First, using Chernoff’s inequality for a centered Gaussian variable (see Lemma A.5), and the fact that ψ belongsto the bounded space e Ψ (defined by (9)), and that P Kk =1 g k ( x ; γ ) = 1, we get P ( k Y x k ∞ > M n )= Z { k y k ∞ >M n } K X k =1 g k ( x ; γ )(2 π ) q/ det(Σ k ) / exp − ( y − ( β k + β k x )) ⊤ Σ − k ( y − ( β k + β k x ))2 ! dy = K X k =1 g k ( x ; γ )(2 π ) q/ det(Σ k ) / Z { k y k ∞ >M n } exp − ( y − ( β k + β k x )) ⊤ Σ − k ( y − ( β k + β k x ))2 ! dy = K X k =1 g k ( x ; γ ) P (cid:0) k Y x,k k ∞ > M n (cid:1) ≤ K X k =1 g k ( x ; γ ) q X z =1 P (cid:0)(cid:12)(cid:12) [ Y x,k ] z (cid:12)(cid:12) > M n (cid:1) = K X k =1 g k ( x ; γ ) q X z =1 (cid:0) P (cid:0) [ Y x,k ] z < − M n (cid:1) + P (cid:0) [ Y x,k ] z > M n (cid:1)(cid:1) = K X k =1 g k ( x ; γ ) q X z =1 P U > M n − [ β k + β k x ] z [Σ k ] / z,z ! + P U < − M n − [ β k + β k x ] z [Σ k ] / z,z !! = K X k =1 g k ( x ; γ ) q X z =1 P U > M n − [ β k + β k x ] z [Σ k ] / z,z ! + P U > M n + [ β k + β k x ] z [Σ k ] / z,z !! ≤ K X k =1 g k ( x ; γ ) q X z =1 e − Mn − [ βk βkx ] z [ Σ k ] / z,z ! + e − Mn + [ βk βkx ] z [ Σ k ] / z,z ! (using Lemma A.5, (90)) ≤ K X k =1 g k ( x ; γ ) q X z =1 e − Mn − | [ βk βkx ] z | [ Σ k ] / z,z ! = 2 K X k =1 g k ( x ; γ ) q X z =1 e − M n − Mn | [ βk βkx ] z | + | [ βk βkx ] | z [ Σ k ] z,z ≤ K X k =1 g k ( x ; γ ) q X z =1 e − M n − Mn | [ βk βkx ] z | + | [ βk βkx ] | z [ Σ k ] z,z ≤ KA γ qe − M n − MnAβ A Σ , (57)where Y x,k ∼ N q ( β k + β k x, Σ k ) ,Y x,k ∼ N (cid:16) [ β k + β k x ] z , [Σ k ] z,z (cid:17) , and U = [ Y x,k ] z − [ βx ] z [Σ k ] / z,z ∼ N (0 , , and using the facts that e − | [ βk βkx ] | zA Σ ≤ ≤ z ≤ q (cid:12)(cid:12)(cid:12) [Σ k ] z,z (cid:12)(cid:12)(cid:12) ≤ k Σ k k = M (Σ k ) = m (cid:0) Σ − k (cid:1) ≤ A Σ . Wederive from (56) and (57) that P ( T c ) ≤ KnqA γ e − M n − MnAβ A Σ , (58) nd finally from (51), (55), and (58), we obtain E [KL n ( s , b s b m ) T C ] ≤ e q/ − π q/ A q/ p KnqA γ e − M n − MnAβ A Σ . (59) First, we give some tools to prove Lemma 4.1. Recall that k f k n = vuut n n X i =1 g ( y i | x i ) , for any measurable function g .Let m ∈ N ⋆ , we have sup f m ∈ F m | ν n ( − f m ) | = sup f m ∈ F m (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n X i =1 ( f m ( Y i | x i ) − E [ f m ( Y i | x i )]) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) . (60)To control the deviation of (60), we shall use concentration and symmetrization arguments. We shall first usethe following concentration inequality, which can be found in Boucheron et al. (2013). Lemma 5.1 (See Boucheron et al., 2013) . Let Z , . . . , Z n be independent random variables with values in somespace Z and let F be a class of real-valued functions on Z . Assume that there exists R n , a non-random constant,such that sup f ∈F k f k n ≤ R n . Then, for all t > , P sup f ∈F (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n X i =1 [ f ( Z i ) − E [ f ( Z i )]] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) > E " sup f ∈F (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n X i =1 [ f ( Z i ) − E [ f ( Z i )]] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + 2 √ R n r tn ! ≤ e − t . (61)Then, we propose to bound E (cid:2) sup f ∈F (cid:12)(cid:12) n P ni =1 [ f ( Z i ) − E [ f ( Z i )]] (cid:12)(cid:12)(cid:3) due to the following symmetrizationargument. The proof of this result can be found in Van Der Vaart & Wellner (1996). Lemma 5.2 (See Lemma 2.3.6 in Van Der Vaart & Wellner, 1996) . Let Z , . . . , Z n be independent randomvariables with values in some space Z and let F be a class of real-valued functions on Z . Let ( ǫ , . . . , ǫ n ) be aRademacher sequence independent of ( Z , . . . , Z n ) . Then, E " sup f ∈F (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n X i =1 [ f ( Z i ) − E [ f ( Z i )]] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ E " sup f ∈F (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n X i =1 ǫ i f ( Z i ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) . (62)From (62), the problem is to provide an upper bound on E " sup f ∈F (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n X i =1 ǫ i f ( Z i ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) . To do so, we shall apply the following lemma, which is adapted from Lemma 6.1 in Massart (2007).
Lemma 5.3 (See Lemma 6.1 in Massart, 2007) . Let Z , . . . , Z n be independent random variables with valuesin some space Z and let F be a class of real-valued functions on Z . Let ( ǫ , . . . , ǫ n ) be a Rademacher sequence,independent of ( Z , . . . , Z n ) . Define R n , a non-random constant, such that sup f ∈F k f k n ≤ R n . (63) Then, for all S ∈ N ⋆ , E " sup f ∈F (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n X i =1 ǫ i f ( Z i ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ R n √ n S X s =1 − s q ln [1 + M (2 − s R n , F , k . k n )] + 2 − S ! , (64) where M ( δ, F , k . k n ) stands for the δ -packing number (see Definition A.2) of the set of functions F , equippedwith the metric induced by the norm k·k n . In our case, from (60), we apply a conditional version of Lemmas 5.1–5.3 to F = F m , ( Z , . . . , Z n ) =( Y | x , . . . , Y n | x n ), and f ( Z i ) = f m ( Y i | x i ), so as to control sup f m ∈ F m | ν n ( − f m ) | . On the one hand, we see from(63) that we need an upper bound of sup f m ∈ F m k f m k n . On the other hand, we see from (64) that we need tobound the entropy of the set of functions F m , equipped with the metric induced by the norm k·k n . Such boundsare provided by the two following lemmas.Let M n > T = (cid:26) max i =1 ,...,n k Y i k ∞ = max i =1 ,...,n max z ∈{ ,...,q } | [ Y i ] z | ≤ M n (cid:27) , and put B n = max ( A Σ , KA G ) (cid:16) q √ q ( M n + A β ) A Σ (cid:17) . Lemma 5.4.
On the event T , for all m ∈ N ⋆ , sup f m ∈ F m k f m k n T ≤ KB n (cid:18) A γ + qA β + q √ qa Σ (cid:19) =: R n . (65) Proof.
See Section 5.2.1.
Lemma 5.5.
Let δ > and m ∈ N ⋆ . On the event T , we have the following upper bound of the δ -packingnumber of the set of functions F m , equipped with the metric induced by the norm k·k n : M ( δ, F m , k·k n ) ≤ (2 p + 1) B nq K m δ (cid:18) B n KqA β δ (cid:19) K (cid:18) B n KA γ δ (cid:19) K (cid:18) B n Kq √ qa Σ δ (cid:19) K . Proof.
See Section 5.2.2.
Lemma 5.6 (Lemma 5.9 from Meynet, 2013) . Let δ > and ( x ij ) i =1 ,...,n ; j =1 ,...,p ∈ R np . There exists afamily B of (2 p + 1) k x k ,n /δ vectors in R p , such that for all β ∈ R p , with k β k ≤ , where k x k ,n = n P ni =1 max j ∈{ ,...,p } x ij , there exists β ′ ∈ B , such that n n X i =1 p X j =1 (cid:0) β j − β ′ j (cid:1) x ij ≤ δ . Proof.
See in the proof of Lemma 5.9 Meynet (2013).Via the upper bounds provided in Lemmas 5.4 and 5.5, we can apply Lemma 5.3 to get an upper bound on E X (cid:2) sup f m ∈F m (cid:12)(cid:12) n P ni =1 ǫ i f m ( Y i | x i ) (cid:12)(cid:12)(cid:3) . We thus obtain the following results. Lemma 5.7.
Let m ∈ N ⋆ , consider ( ǫ , . . . , ǫ n ) , a Rademacher sequence independent of ( Y , . . . , Y n ) . Then, onthe event T , E X " sup f m ∈F m (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n X i =1 ǫ i f m ( Y i | x i ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ KB n q √ n ∆ m , (66)∆ m := m p ln(2 p + 1) ln n + 2 √ K (cid:18) A γ + qA β + q √ qa Σ (cid:19) . (67) Proof.
See Section 5.2.3.Now using (66) and applying both Lemmas 5.1 and 5.2 to F = F m , ( Z , . . . , Z n ) = ( Y | x , . . . , Y n | x n ) and f ( Z i ) = f m ( Y i | x i ), we get for all m ∈ N ⋆ and t >
0, with P X -probability greater than 1 − e − t ,sup f m ∈ F m | ν n ( − f m ) | = sup f m ∈ F m (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n X i =1 ( f m ( Y i | x i ) − E X [ f m ( Y i | x i )]) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ E " sup f m ∈ F m (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n X i =1 ( f m ( Y i | x i ) − E X [ f m ( Y i | x i )]) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + 2 √ R n r tn (Lemma 5.1) ≤ E " sup f m ∈ F m (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n X i =1 ǫ i f ( Y i | x i ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + 2 √ R n r tn (using Lemma 5.2) ≤ KB n q √ n ∆ m + 4 √ KB n (cid:18) A γ + qA β + q √ qa Σ (cid:19) r tn (cid:18) using Lemma 5.7 and R n = 2 KB n (cid:18) A γ + qA β + q √ qa Σ (cid:19)(cid:19) ≤ KB n √ n (cid:20) q ∆ m + √ (cid:18) A γ + qA β + q √ qa Σ (cid:19) √ t (cid:21) . The proofs of Lemmas 5.4–5.5 require an upper bound on the uniform norm of the gradient of ln s ψ , for s ψ ∈ S .We begin by providing such an upper bound. Lemma 5.8.
Given s ψ , as described in (11) , it holds that sup x ∈X sup ψ ∈ e Ψ (cid:13)(cid:13)(cid:13)(cid:13) ∂ ln ( s ψ ( ·| x )) ∂ψ (cid:13)(cid:13)(cid:13)(cid:13) ∞ ≤ G ( · ) ,G : R q ∋ y G ( y ) = max ( A Σ , KA G ) (cid:16) q √ q ( k y k ∞ + A β ) A Σ (cid:17) . (68) Proof.
Let s ψ ∈ S , with ψ = ( γ, β, Σ). From now on, we consider any x ∈ X , any y ∈ R q , and any k ∈ [ K ]. Wecan write ln ( s ψ ( y | x )) = ln K X k =1 g k ( x ; γ ) φ ( y ; β k + β k x, Σ k ) ! = ln K X k =1 f k ( x, y ) ! ,g k ( x ; γ ) = exp ( w k ( x )) P Kl =1 exp ( w l ( x )) , w k ( x ) = γ k + γ ⊤ k x,φ ( y ; β k + β k x, Σ k ) = 1(2 π ) q/ det(Σ k ) / exp − ( y − ( β k + β k x )) ⊤ Σ − k ( y − ( β k + β k x ))2 ! ,f k ( x, y ) = g k ( x ; γ ) φ ( y ; β k + β k x, Σ k )= g k ( x ; γ )(2 π ) q/ det(Σ k ) / exp (cid:20) −
12 ( y − ( β k + β k x )) ⊤ Σ − k ( y − ( β k + β k x )) (cid:21) . By using the chain rule, for all l ∈ [ K ], ∂ ln ( s ψ ( y | x )) ∂γ l = K X k =1 f k ( x, y ) g k ( x ; γ ) P Kk =1 f k ( x, y ) ∂g k ( x ; γ ) ∂w l ( x ) ∂w l ( x ) ∂γ l | {z } =1 , and ∂ ln ( s ψ ( y | x )) ∂ (cid:0) γ ⊤ l x (cid:1) = K X k =1 f k ( x, y ) g k ( x ; γ ) P Kk =1 f k ( x, y ) ∂g k ( x ; γ ) ∂w l ( x ) ∂w l ( x ) ∂ (cid:0) γ ⊤ l x (cid:1)| {z } =1 . Furthermore, ∂g k ( x ; γ ) ∂w l ( x ) = ∂∂w l ( x ) exp ( w k ( x )) P Kl =1 exp ( w l ( x )) ! = ∂∂w l ( x ) exp ( w k ( x )) P Kl =1 exp ( w l ( x )) − exp ( w k ( x )) (cid:16)P Kl =1 exp ( w l ( x )) (cid:17) ∂∂w l ( x ) K X i =1 exp ( w i ( x )) (cid:18) using ∂∂x (cid:18) f ( x ) g ( x ) (cid:19) = f ′ ( x ) g ( x ) − g ′ ( x ) f ( x ) g ( x ) (cid:19) = δ lk exp ( w k ( x )) P Kl =1 exp ( w l ( x )) − exp ( w k ( x )) P Kl =1 exp ( w l ( x )) exp ( w l ( x )) P Kl =1 exp ( w l ( x ))= g k ( x ; γ ) ( δ lk − g l ( x ; γ )) , where δ lk = ( l = k, l = k. Therefore, we obtain (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∂ ln ( s ψ ( y | x )) ∂ (cid:0) γ ⊤ l x (cid:1) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12) ∂ ln ( s ψ ( y | x )) ∂γ l (cid:12)(cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) K X k =1 f k ( x, y ) g k ( x ; γ ) P Kk =1 f k ( x, y ) g k ( x ; γ ) ( δ lk − g l ( x ; γ )) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) K X k =1 f k ( x, y ) P Kk =1 f k ( x, y ) ( δ lk − g l ( x ; γ )) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) K X k =1 ( δ lk − g l ( x ; γ )) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) since f k ( x, y ) P Kk =1 f k ( x, y ) ≤ ! = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) − K X k =1 g l ( x ; γ ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = | − Kg l ( x ; γ ) |≤ Kg l ( x ; γ ) ≤ KA G (using (10)) . Similarly, by using the fact that ψ belongs to the bounded space e Ψ, f l ( x, y ) / P Kk =1 f k ( x, y ) ≤ (cid:13)(cid:13)(cid:13)(cid:13) ∂ ln ( s ψ ( y | x )) ∂β l (cid:13)(cid:13)(cid:13)(cid:13) ∞ = (cid:13)(cid:13)(cid:13)(cid:13) ∂ ln ( s ψ ( y | x )) ∂ ( β l x ) (cid:13)(cid:13)(cid:13)(cid:13) ∞ = (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) f l ( x, y ) P Kk =1 f k ( x, y ) ∂∂ ( β l + β l x ) (cid:20) −
12 ( y − ( β l + β l x )) ⊤ Σ − l ( y − ( β l + β l x )) (cid:21)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∞ ≤ (cid:13)(cid:13)(cid:13)(cid:13) ∂∂ ( β l + β l x ) (cid:20) −
12 ( y − ( β l + β l x )) ⊤ Σ − l ( y − ( β l + β l x )) (cid:21)(cid:13)(cid:13)(cid:13)(cid:13) ∞ = (cid:13)(cid:13) Σ − l ( y − ( β l + β l x )) (cid:13)(cid:13) ∞ ≤ (cid:13)(cid:13) Σ − l (cid:13)(cid:13) ∞ k ( y − ( β l + β l x )) k ∞ (using (80)) ≤ √ q (cid:13)(cid:13) Σ − l (cid:13)(cid:13) ( k y k ∞ + k β l + β l x k ∞ ) (using (85)) ≤ √ qM (cid:0) Σ − l (cid:1) ( k y k ∞ + k β l + β l x k ∞ ) (using (84)) ≤ √ qA Σ ( k y k ∞ + A β ) (using (9)) . Now, we need to calculate the gradient w.r.t. to the covariance matrices of the Gaussian experts. To do this, we need the following result: given any l ∈ [ K ], v l = β l + β l x , it holds that ∂∂ Σ l φ ( x ; v l , Σ l )= ∂∂ Σ l " (2 π ) − p/ det(Σ l ) − / exp − ( x − v l ) ⊤ Σ − l ( x − v l )2 ! = φ ( x ; v l , Σ l ) (cid:20) − ∂∂ Σ l (cid:16) ( x − v l ) ⊤ Σ − l ( x − v l ) (cid:17) + det(Σ l ) / ∂∂ Σ l (cid:16) det(Σ l ) − / (cid:17)(cid:21) = φ ( x ; v l , Σ l ) (cid:20)
12 Σ − l ( x − v l ) ( x − v l ) ⊤ Σ − l −
12 det(Σ l ) − ∂∂ Σ l (det(Σ l )) (cid:21) = φ ( x ; v l , Σ l ) (cid:20)
12 Σ − l ( x − v l ) ( x − v l ) ⊤ Σ − l −
12 det(Σ l ) − det(Σ l ) (cid:0) Σ − l (cid:1) ⊤ (cid:21) = φ ( x ; v l , Σ l ) 12 h Σ − l ( x − v l ) ( x − v l ) ⊤ Σ − l − (cid:0) Σ − l (cid:1) ⊤ i| {z } T ( x,v l , Σ l ) , (69)noting that ∂∂ Σ l (cid:16) ( x − v l ) ⊤ Σ − l ( x − v l ) (cid:17) = − Σ − l ( x − v l ) ( x − v l ) ⊤ Σ − l (using Lemma A.1) , (70) ∂∂ Σ l (det(Σ l )) = det(Σ l ) (cid:0) Σ − l (cid:1) ⊤ (using Jacobi formula, Lemma A.2) . (71)For any l ∈ [ K ], (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∂ ln ( s ψ ( y | x )) ∂ (cid:16) [Σ l ] z ,z (cid:17) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:13)(cid:13)(cid:13)(cid:13) ∂ ln ( s ψ ( y | x )) ∂ Σ l (cid:13)(cid:13)(cid:13)(cid:13) (using (84))= (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) f l ( x, y ) P Kk =1 f k ( x, y ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:13)(cid:13)(cid:13)(cid:13) ∂∂ Σ l (cid:20) −
12 ( y − ( β l + β l x )) ⊤ Σ − l ( y − ( β l + β l x )) (cid:21)(cid:13)(cid:13)(cid:13)(cid:13) ≤ (cid:13)(cid:13)(cid:13)(cid:13) ∂∂ Σ l (cid:20) −
12 ( y − ( β l + β l x )) ⊤ Σ − l ( y − ( β l + β l x )) (cid:21)(cid:13)(cid:13)(cid:13)(cid:13) = 12 (cid:13)(cid:13)(cid:13) Σ − l ( y − ( β l + β l x )) ( y − ( β l + β l x )) ⊤ Σ − l − (cid:0) Σ − l (cid:1) ⊤ (cid:13)(cid:13)(cid:13) (using (69)) ≤ h A Σ + √ q (cid:13)(cid:13)(cid:13) ( y − ( β l + β l x )) ( y − ( β l + β l x )) ⊤ (cid:13)(cid:13)(cid:13) ∞ A i (using (85)) ≤ h A Σ + q √ q ( k y k ∞ + A β ) A i (using (9)) , where, in the last inequality given a = y − ( β l + β l x ), we use the fact that (cid:13)(cid:13) aa ⊤ (cid:13)(cid:13) ∞ = max ≤ i ≤ q q X j =1 (cid:12)(cid:12) [ aa ⊤ ] i,j (cid:12)(cid:12) = max ≤ i ≤ q q X j =1 | a i a j | = max ≤ i ≤ q | a i | q X j =1 | a j | ≤ q k a k ∞ . Thus, sup x ∈X sup ψ ∈ e Ψ (cid:13)(cid:13)(cid:13)(cid:13) ∂ ln ( s ψ ( y | x )) ∂ψ (cid:13)(cid:13)(cid:13)(cid:13) ∞ ≤ max " KA G , √ q ( k y k ∞ + A β ) A Σ , h A Σ + q √ q ( k y k ∞ + A β ) A i ≤ max " KA G , max ( A Σ , (cid:16) q √ q ( k y k ∞ + A β ) A Σ (cid:17) ≤ max ( A Σ , KA G ) (cid:16) q √ q ( k y k ∞ + A β ) A Σ (cid:17) =: G ( y ) , where we use the fact that √ q ( k y k ∞ + A β ) A Σ =: θ ≤ θ = 1 + q ( k y k ∞ + A β ) A ≤ max ( A Σ , (cid:16) q √ q ( k y k ∞ + A β ) A Σ (cid:17) . Let m ∈ N ⋆ and f m ∈ F m . By (31), there exists s m ∈ S m , such that f m = − ln ( s m /s ). For all x ∈ X ,let ψ ( x ) = ( γ k , γ k x, β k , β k x, Σ k ) k ∈ [ K ] be the parameters of s m ( ·| x ). In our case, we approximate f ( ψ ) =ln ( s ψ ( y i | x i )) around ψ ( x i ) by the n = 0 th degree Taylor polynomial of f ( ψ ). That is, (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ln ( s m ) | {z } s ψ ( y i | x i ) − ln ( s ( y i | x i )) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) =: | f ( ψ ) − f ( ψ ) | = | R ( ψ ) | (defined in Lemma A.6) ≤ sup x ∈X sup ψ ∈ e Ψ (cid:13)(cid:13)(cid:13)(cid:13) ∂ ln ( s ψ ( y i | x )) ∂ψ (cid:13)(cid:13)(cid:13)(cid:13) ∞ k ψ ( x i ) − ψ ( x i ) k . First applying Taylor’s inequality and then Lemma 5.8 on the event T . For all i ∈ [ n ], it holds that | f m ( y i | x i ) | T = | ln ( s m ( y i | x i )) − ln ( s ( y i | x i )) | T ≤ sup x ∈X sup ψ ∈ e Ψ (cid:13)(cid:13)(cid:13)(cid:13) ∂ ln ( s ψ ( y i | x )) ∂ψ (cid:13)(cid:13)(cid:13)(cid:13) ∞ k ψ ( x i ) − ψ ( x i ) k T ≤ max ( A Σ , KA G ) (cid:16) q √ q ( M n + A β ) A Σ (cid:17)| {z } =: B n k ψ ( x i ) − ψ ( x i ) k (using Lemma 5.8) ≤ B n K X k =1 | γ k − γ ,k | + (cid:12)(cid:12) γ ⊤ k x i − γ ⊤ ,k x i (cid:12)(cid:12) + k β k − β ,k k + k β k x i − β ,k x i k + k vec (Σ k − Σ ,k ) k ! ≤ B n K X k =1 (cid:0) | γ k | + (cid:12)(cid:12) γ ⊤ k x i (cid:12)(cid:12) + k β k k + k β k x i k + q k Σ k k (cid:1) (using (82)) ≤ KB n ( A γ + q k β k k ∞ + q k β k x i k ∞ + q √ q k Σ k k ) (using (9), (77), (78), (86)) ≤ KB n (cid:18) A γ + qA β + q √ qa Σ (cid:19) (using (9)) . Therefore, sup f m ∈ F m k f m k n T ≤ KB n (cid:18) A γ + qA β + q √ qa Σ (cid:19) =: R n . Let m ∈ N ⋆ , f [1] m ∈ F m , and x ∈ [0 , p . By (31), there exists s [1] m ∈ S m , such that f [1] m = − ln (cid:16) s [1] m /s (cid:17) .Introduce the notation s [2] m ∈ S and f [2] m = − ln (cid:16) s [2] m /s (cid:17) . Let ψ [1] ( x ) = (cid:16) γ [1] k , γ [1] k x, β [1] k , β [1] k x, Σ [1] k (cid:17) k ∈ [ K ] , and ψ [2] ( x ) = (cid:16) γ [2] k , γ [2] k x, β [2] k , β [2] k x, Σ [2] k (cid:17) k ∈ [ K ] , be the parameters of the PDFs s [1] m ( ·| x ) and s [2] m ( ·| x ), respectively. By applying Taylor’s inequality and thenLemma 5.8 on the event T , for all i ∈ [ n ], it holds that (cid:12)(cid:12)(cid:12) f [1] m ( y i | x i ) − f [2] m ( y i | x i ) (cid:12)(cid:12)(cid:12) T = (cid:12)(cid:12)(cid:12) ln (cid:16) s [1] m ( y i | x i ) (cid:17) − ln (cid:16) s [2] m ( y i | x i ) (cid:17)(cid:12)(cid:12)(cid:12) T ≤ sup x ∈X sup ψ ∈ e Ψ (cid:12)(cid:12)(cid:12)(cid:12) ∂ ln ( s ψ ( y i | x )) ∂ψ (cid:12)(cid:12)(cid:12)(cid:12) (cid:13)(cid:13)(cid:13) ψ [1] ( x i ) − ψ [1] ( x i ) (cid:13)(cid:13)(cid:13) T (using Taylor’s inequality in Lemma A.6) ≤ max ( A Σ , C ( p, K )) (cid:16) q √ q ( M n + A β ) A Σ (cid:17)| {z } B n (cid:13)(cid:13)(cid:13) ψ [1] ( x i ) − ψ [2] ( x i ) (cid:13)(cid:13)(cid:13) (using Lemma 5.8) ≤ B n K X k =1 (cid:12)(cid:12)(cid:12) γ [1] k − γ [2] k (cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12) γ [1] ⊤ k x i − γ [2] ⊤ k x i (cid:12)(cid:12)(cid:12) + (cid:13)(cid:13)(cid:13) β [1] k − β [2] k (cid:13)(cid:13)(cid:13) + (cid:13)(cid:13)(cid:13) β [1] k x i − β [2] k x i (cid:13)(cid:13)(cid:13) + (cid:13)(cid:13)(cid:13) vec (cid:16) Σ [1] k − Σ [2] k (cid:17)(cid:13)(cid:13)(cid:13) ! . By the Cauchy-Schwarz inequality, ( P mi =1 a i ) ≤ m P mi =1 a i ( m ∈ N ⋆ ), we get (cid:12)(cid:12)(cid:12) f [1] m ( y i | x i ) − f [2] m ( y i | x i ) (cid:12)(cid:12)(cid:12) T ≤ B n " K X k =1 (cid:12)(cid:12)(cid:12) γ [1] ⊤ k x i − γ [2] ⊤ k x i (cid:12)(cid:12)(cid:12)! + K X k =1 q X z =1 (cid:12)(cid:12)(cid:12)h β [1] k x i i z − h β [2] k x i i z (cid:12)(cid:12)(cid:12)! + (cid:16)(cid:13)(cid:13)(cid:13) β [1]0 − β [2]0 (cid:13)(cid:13)(cid:13) + (cid:13)(cid:13)(cid:13) γ [1]0 − γ [2]0 (cid:13)(cid:13)(cid:13) + (cid:13)(cid:13)(cid:13) vec (cid:16) Σ [1] − Σ [2] (cid:17)(cid:13)(cid:13)(cid:13) (cid:17) ≤ B n " K K X k =1 p X j =1 γ [1] ⊤ kj x ij − p X j =1 γ [2] ⊤ kj x ij + Kq K X k =1 q X z =1 p X j =1 h β [1] k i z,j x ij − p X j =1 h β [2] k i z,j x ij + (cid:16)(cid:13)(cid:13)(cid:13) β [1]0 − β [2]0 (cid:13)(cid:13)(cid:13) + (cid:13)(cid:13)(cid:13) γ [1]0 − γ [2]0 (cid:13)(cid:13)(cid:13) + (cid:13)(cid:13)(cid:13) vec (cid:16) Σ [1] − Σ [2] (cid:17)(cid:13)(cid:13)(cid:13) (cid:17) , and k f [1] m − f [2] m k n T = 1 n n X i =1 (cid:12)(cid:12)(cid:12) f [1] m ( y i | x i ) − f [2] m ( y i | x i ) (cid:12)(cid:12)(cid:12) T ≤ B n K K X k =1 n n X i =1 p X j =1 γ [1] kj x ij − p X j =1 γ [2] kj x ij | {z } =: a + 3 B n Kq K X k =1 q X z =1 n n X i =1 p X j =1 h β [1] k i z,j x ij − p X j =1 h β [2] k i z,j x ij | {z } =: b + 3 B n (cid:16)(cid:13)(cid:13)(cid:13) β [1]0 − β [2]0 (cid:13)(cid:13)(cid:13) + (cid:13)(cid:13)(cid:13) γ [1]0 − γ [2]0 (cid:13)(cid:13)(cid:13) + (cid:13)(cid:13)(cid:13) vec (cid:16) Σ [1] − Σ [2] (cid:17)(cid:13)(cid:13)(cid:13) (cid:17) . So, for all δ >
0, if a ≤ δ / (cid:0) B n (cid:1) ,b ≤ δ / (cid:0) B n (cid:1) , (cid:13)(cid:13)(cid:13) β [1]0 − β [2]0 (cid:13)(cid:13)(cid:13) ≤ δ/ (18 B n ) , (cid:13)(cid:13)(cid:13) γ [1]0 − γ [2]0 (cid:13)(cid:13)(cid:13) ≤ δ/ (18 B n ) , and (cid:13)(cid:13)(cid:13) vec (cid:16) Σ [1] − Σ [2] (cid:17)(cid:13)(cid:13)(cid:13) ≤ δ/ (18 B n ) , then k f [1] m − f [2] m k n T ≤ δ /
4. To bound a and b , we can write a = Km K X k =1 n n X i =1 p X j =1 γ [1] kj m x ij − p X j =1 γ [2] kj m x ij , and b = Kqm K X k =1 q X z =1 n n X i =1 p X j =1 h β [1] k i z,j m x ij − p X j =1 h β [2] k i z,j m x ij . Then, we apply Lemma 5.6 to obtain γ [1] k,. m = (cid:18) γ [1] kj m (cid:19) j ∈ [ q ] and h β [1] k i z,. m = h β [1] k i z,j m ! j ∈ [ q ] , for all k ∈ [ K ] , z ∈ [ q ].Since s [1] m ∈ S m , and using (20), we have (cid:13)(cid:13)(cid:13) γ [1] k (cid:13)(cid:13)(cid:13) ≤ m and (cid:13)(cid:13)(cid:13) vec (cid:16) β [1] k (cid:17)(cid:13)(cid:13)(cid:13) ≤ m , which leads to P pj =1 (cid:12)(cid:12)(cid:12)(cid:12) γ [1] kj m (cid:12)(cid:12)(cid:12)(cid:12) ≤ P qz =1 P pj =1 (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) h β [1] k i z,j m (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤
1, respectively. Furthermore, given x ∈ X = [0 , p , we have k x k ,n = 1. Thus, thereexist families A of (2 p + 1) B n K m /δ vectors and B of (2 p + 1) B n q K m /δ vectors of R p , such that for all k ∈ [ K ], z ∈ [ q ], γ [1] k,. , and h β [1] k i z,. , there exist γ [1] k,. ∈ A and h β [2] k i z,. ∈ B , such that1 n n X i =1 p X j =1 γ [1] kj m x ij − p X j =1 γ [2] kj m x ij ≤ δ B n K m , and1 n n X i =1 p X j =1 h β [1] k i z,j m x ij − p X j =1 h β [2] k i z,j m x ij ≤ δ B n q K m , which leads to a ≤ δ / B n and b ≤ δ / B n . Moreover, (9) leads to (cid:13)(cid:13)(cid:13) β [1]0 (cid:13)(cid:13)(cid:13) = K X k =1 (cid:13)(cid:13)(cid:13) β [1]0 k (cid:13)(cid:13)(cid:13) ≤ Kq (cid:13)(cid:13)(cid:13) β [1]0 k (cid:13)(cid:13)(cid:13) ∞ ≤ KqA β (using (77)) , (cid:13)(cid:13)(cid:13) γ [1]0 (cid:13)(cid:13)(cid:13) = K X k =1 (cid:12)(cid:12)(cid:12) γ [1]0 k (cid:12)(cid:12)(cid:12) ≤ KA γ , and (cid:13)(cid:13)(cid:13) vec (cid:16) Σ [1] (cid:17)(cid:13)(cid:13)(cid:13) = K X k =1 (cid:13)(cid:13)(cid:13) vec (cid:16) Σ [1] k (cid:17)(cid:13)(cid:13)(cid:13) ≤ Kq √ qa Σ . Therefore, on the event T , M ( δ, F m , k·k n ) ≤ N ( δ/ , F m , k·k n ) (using Lemma A.4) ≤ card( A ) card( B ) N (cid:18) δ B n , B K ( KqA β ) , k·k (cid:19) N (cid:18) δ B n , B K ( KA γ ) , k·k (cid:19) N (cid:18) δ B n , B K (cid:18) Kq √ qa Σ (cid:19) , k·k (cid:19) ≤ (2 p + 1) B nq K m δ (cid:18) B n KqA β δ (cid:19) K (cid:18) B n KA γ δ (cid:19) K (cid:18) B n Kq √ qa Σ δ (cid:19) K . Let m ∈ N ⋆ . From Lemma 5.4, on the event T ,sup f m ∈ F m k f m k n T ≤ KB n (cid:18) A γ + qA β + q √ qa Σ (cid:19) =: R n . (72)From Lemma 5.5, on the event T for all S ∈ N ⋆ , S X s =1 − s q ln [1 + M (2 − s R n , F m , k·k n )] ≤ S X s =1 − s q ln [2 M ( δ, F m , k·k n )] with δ = 2 − s R n ≤ S X s =1 − s " √ ln 2 + 6 √ B n qKmδ p ln (2 p + 1)+ s K ln (cid:20)(cid:18) B n KqA β δ (cid:19) (cid:18) B n KA γ δ (cid:19) (cid:18) B n Kq √ qa Σ δ (cid:19)(cid:21) ≤ S X s =1 − s " √ ln 2 + 2 s √ B n qKmR n p ln (2 p + 1)+ s K ln (cid:20)(cid:18) s B n KqA β R n (cid:19) (cid:18) s B n KA γ R n (cid:19) (cid:18) s B n Kq √ qa Σ R n (cid:19)(cid:21) . (73)Notice from (72), that R n ≥ KB n max (cid:16) A γ , qA β , q √ qa Σ (cid:17) . Moreover, it holds that 1 ≤ s +3 , and P Ss =1 − s =1 − − S ≤ , P Ss =1 ( √ e/ s ≤ √ e/ (2 − √ e ), and since for all s ∈ N ⋆ , e s ≥ s , and thus 2 − s √ s ≤ ( √ e/ s .Therefore, from (73): S X s =1 − s q ln [1 + M (2 − s R n , F m , k·k n )] ≤ S X s =1 − s " √ ln 2 + 2 s √ B n qKmR n p ln (2 p + 1) + p K ln [(2 s +1 ) (2 s +1 ) (2 s +1 )] = S X s =1 − s " √ ln 2 + 2 s √ B n qKmR n p ln(2 p + 1) + √ K p s + 1) ln 2 + 2 ln 3) ≤ √ B n KqmR n p ln(2 p + 1) S + √ K √ S X s =1 − s √ s + √ ln 2 (cid:16) √ K (cid:17) + √ K ≤ √ B n KqmR n p ln(2 p + 1) S + √ K √ S X s =1 (cid:18) √ e (cid:19) s + √ ln 2 (cid:16) √ K (cid:17) + √ K ≤ √ B n qKmR n p ln(2 p + 1) S + √ K ln 2 √ e − √ e + 1 + √ r !| {z } =: C . (74)Then, from (64) and (74), for all S ∈ N ⋆ : E " sup f m ∈F m (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n X i =1 ǫ i f m ( Z i ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ R n " √ n √ B n KmqR n p ln(2 p + 1) S + √ K ln 2 C ! + 2 − S . (75) e choose S = ln n/ ln 2 so that the two terms depending on S in (75) are of the same order. In particular, forthis value of S , 2 − S ≤ /n , and we deduce from (75) and (72) that E " sup f m ∈F m (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n X i =1 ǫ i f m ( Z i ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ √ B n Kmq √ n p ln(2 p + 1) ln n ln 2 + 2 KB n (cid:18) A γ + qA β + q √ qa Σ (cid:19) √ ln 2 C √ K √ n + 1 n ! ≤ B n Kmq √ n p ln(2 p + 1) ln n √ | {z } ≈ . + K √ K √ n B n (cid:18) A γ + qA β + q √ qa Σ (cid:19) (cid:16) √ ln 2 C + 1 (cid:17)| {z } ≈ . < KB n √ n (cid:20) mq p ln(2 p + 1) ln n + 2 √ K (cid:18) A γ + qA β + q √ qa Σ (cid:19)(cid:21) . We have studied an l -regularization estimator for finite mixtures of Gaussian experts regression models withsoft-max gating functions. Our main contribution is the proof of the l -oracle inequality that provides the lowerbound on the regularization of the Lasso that ensures non-asymptotic theoretical control on the Kullback-Leiblerloss of the estimator. Other than some remaining questions regarding the tightness of the bounds and the form ofpenalization functions, we believe that our contribution helps to further popularize mixtures of Gaussian expertsregression models by providing a theoretical foundation for their application in high-dimensional problems. Acknowledgments
TTN is supported by a “Contrat doctoral” from the French Ministry of Higher Education and Research and bythe French National Research Agency (ANR) grant SMILES ANR-18-CE40-0014. HDN and GJM are fundedby Australian Research Council grant number DP180101192.
A Technical results
We denote the vector space of all q -by- q real matrices by R q × q ( q ∈ N ⋆ ): A ∈ R q × q ⇐⇒ A = ( a i,j ) = a , · · · a ,q ... ... a q, · · · a q,q , a i,j ∈ R . If a capital letter is used to denote a matrix ( e.g.,
A, B ), then the corresponding lower-case letter with subscript i, j refers to the ( i, j )th entry ( e.g., a i,j , b i,j ). When required, we also designate the elements of a matrix withthe notation [ A ] i,j or A ( i, j ). Denote the q -by- q identity and zero matrices by I q and 0 q , respectively. Lemma A.1 (Derivative of quadratic form, Magnus & Neudecker, 2019) . Assume that X and a are non-singularmatrix in R q × q and vector in R q × , respectively. Then ∂a ⊤ X − a∂X = − X − aa ⊤ X − . Lemma A.2 (Jacobi’s formula, Theorem 8.1 from Magnus & Neudecker, 2019) . If X is a differentiable mapfrom the real numbers to q -by- q matrices, ddt det ( X ( t )) = tr (cid:18) Adj ( X ( t )) dX ( t ) dt (cid:19) . In particular, ∂ det ( X ) ∂X = (Adj ( X )) ⊤ = det ( X ) (cid:0) X − (cid:1) ⊤ . efinition A.1 (Operator (induced) p -norm) . We recall an operator (induced) p -norms of a matrix A ∈ R q × q ( q ∈ N ⋆ , p ∈ { , , ∞} ), k A k p = max x =0 k Ax k p k x k p = max x =0 (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) A x k x k p !(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) p = max k x k p =1 k Ax k p , (76)where for all x ∈ R q , k x k ∞ ≤ k x k = q X i =1 | x i | ≤ q k x k ∞ , (77) k x k = q X i =1 | x i | ! = (cid:0) x ⊤ x (cid:1) ≤ k x k ≤ √ q k x k , and (78) k x k ∞ = max ≤ i ≤ q | x i | ≤ k x k ≤ √ q k x k ∞ . (79) Lemma A.3 (Some matrix p -norm properties, Golub & Van Loan, 2012) . By definition, we always have theimportant property that for every A ∈ R q × q and x ∈ R q , k Ax k p ≤ k A k p k x k p , (80) and every induced p -norm is submultiplicative, i.e., for every A ∈ R q × q and B ∈ R q × q , k AB k p ≤ k A k p k B k p . (81) In particular, it holds that k A k = max ≤ j ≤ q q X i =1 | a ij | ≤ q X j =1 q X i =1 | a ij | := k vec( A ) k ≤ q k A k , (82) k vec( A ) k ∞ := max ≤ i ≤ q, ≤ j ≤ q | a ij | ≤ k A k ∞ = max ≤ j ≤ q q X i =1 | a ij | ≤ q k vec( A ) k ∞ , (83) k vec( A ) k ∞ ≤ k A k = λ max ( A ) ≤ q k vec( A ) k ∞ , (84) where λ max is the largest eigenvalue of a positive definite symmetric matrix A . The p -norms, when p ∈ { , , ∞} ,satisfy √ q k A k ∞ ≤ k A k ≤ √ q k A k ∞ , (85)1 √ q k A k ≤ k A k ≤ √ q k A k . (86)Given δ >
0, we need to define the δ -packing number and δ -covering number. Definition A.2 ( δ -packing number, e.g., Definition 5.4 from Wainwright, 2019) . Let ( F , k·k ) be a normed spaceand let G ⊂ F . With ( g i ) i =1 ,...,m ∈ G , { g , . . . , g m } is an δ -packing of G of size m ∈ N ⋆ , if k g i − g j k > δ, ∀ i = j, i, j ∈ { , . . . , m } , or equivalently, T ni =1 B ( g i , δ/
2) = ∅ . Upon defining δ -packing, we can measure the maximalnumber of disjoint closed balls with radius δ/ G . This number is called the δ -packingnumber and is defined as M ( δ, G , k·k ) := max { m ∈ N ⋆ : ∃ δ -packing of G of size m } . (87) Definition A.3 ( δ -covering number, Definition 5.1 from Wainwright, 2019) . Let ( F , k·k ) be a normed spaceand let G ⊂ F . With ( g i ) i =1 ,...,n ∈ G , { g , . . . , g n } is an δ -covering of G of size n if G ⊂ ∪ ni =1 B ( g i , δ ), orequivalently, ∀ g ∈ G , ∃ i such that k g − g i k ≤ δ . Upon defining the δ -covering, we can measure the minimalnumber of closed balls with radius δ , which is necessary to cover G . This number is called the δ -covering number and is defined as N ( δ, G , k·k ) := min { n ∈ N ⋆ : ∃ δ -covering of G of size n } . (88)The covering entropy (metric entropy) is defined as follows H k . k ( δ, G ) = ln ( N ( δ, G , k·k )).The relation between the packing number and the covering number is described in the following lemma. emma A.4 (Lemma 5.5 from Wainwright, 2019) . Let ( F , k·k ) be a normed space and let G ⊂ F . Then M (2 δ, G , k·k ) ≤ N ( δ, G , k·k ) ≤ M ( δ, G , k·k ) . Lemma A.5 (Chernoff’s inequality, e.g.,
Chapter 2 in Wainwright, 2019) . Assume that the random variablehas a moment generating function in a neighborhood of zero, meaning that there is some constant b > suchthat the function ϕ ( λ ) = E (cid:2) e λ ( U − µ ) (cid:3) exists for all λ ≤ | b | . In such a case, we may apply Markov’s inequality tothe random variable Y = e λ ( U − µ ) , thereby obtaining the upper bound P ( U − µ ≥ a ) = P (cid:16) e λ ( U − µ ) ≥ e λt (cid:17) ≤ E (cid:2) e λ ( U − µ ) (cid:3) e λt . Optimizing our choice of λ so as to obtain the tightest result yields the Chernoff bound ln ( P ( U − µ ≥ a )) ≤ sup λ ∈ [0 ,b ] n λt − ln (cid:16) E h e λ ( U − µ ) i(cid:17)o . (89) In particular, if U ∼ N ( µ, σ ) is a Gaussian random variable with mean µ and variance σ . By a straightforwardcalculation, we find that U has the moment generating function E (cid:2) e λU (cid:3) = e µλ + σ λ , valid for all λ ∈ R . Substituting this expression into the optimization problem defining the optimized Chernoff bound (89) , we obtain sup λ ≥ n λt − ln (cid:16) E h e λ ( U − µ ) i(cid:17)o = sup λ ≥ (cid:26) λt − σ λ (cid:27) = − t σ , where we have taken derivatives in order to find the optimum of this quadratic function. So, (89) leads to P ( X ≥ µ + t ) ≤ e − t σ , for all t ≥ . (90)Recall that a multi-index α = ( α , . . . , α p ) , α i ∈ N ⋆ , ∀ i ∈ { , . . . , p } is an p -tuple of non-negative integers.Let | α | = p X i =1 α i , α ! = p Y i =1 α i ! ,x α = p Y i =1 x α i i , x ∈ R p , ∂ α f = ∂ α ∂ α · · · ∂ α p p = ∂ | α | f∂x α ∂x α · · · ∂x α p p . The number | α | is called the order or degree of α . Thus, the order of α is the same as the order of x α as amonomial or the order of ∂ α as a partial derivative. Lemma A.6 (Taylor’s Theorem in Several Variables from Duistermaat & Kolk, 2004) . Suppose f : R p R isin the class C k +1 , of continuously differentiable functions, on an open convex set S . If a ∈ S and a + h ∈ S ,then f ( a + h ) = X | α |≤ k ∂ α f ( a ) α ! h α + R a,k ( h ) , where the remainder is given in Lagrange’s form by R a,k ( h ) = X | α | = k +1 ∂ α f ( a + ch ) h α α ! for some c ∈ (0 , , or in integral form by R a,k ( h ) = ( k + 1) X | α | = k +1 h α α ! Z (1 − t ) k ∂ α f ( a + th ) dt. In particular, we can estimate the remainder term if | ∂ α f ( x ) | ≤ M for x ∈ S and | α | = k + 1 , then | R a,k ( h ) | ≤ M ( k + 1)! k h k k +11 , k h k = p X i =1 | h i | . References
Akaike, H. (1974). A new look at the statistical model identification.
IEEE Transactions on Automatic Control ,19(6), 716–723. (Cited on page 3.)
Baudry, J.-P. (2009).
S´election de mod`ele pour la classification non supervis´ee. Choix du nombre de classes.
PhD thesis, Universit´e Paris-Sud XI. (Cited on page 8.)
Birg´e, L. & Massart, P. (2007). Minimal penalties for Gaussian model selection.
Probability Theory and RelatedFields , 138(1-2), 33–73. (Cited on page 3.)
Boucheron, S., Lugosi, G., & Massart, P. (2013).
Concentration Inequalities: A Nonasymptotic Theory OfIndependence . Oxford University Press. (Cited on page 18.)
Bunea, F. et al. (2008). Honest variable selection in linear and logistic regression models via l and l + l penalization. Electronic Journal of Statistics , 2, 1153–1194. (Cited on page 3.)
Chamroukhi, F. & Huynh, B. T. (2018). Regularized Maximum-Likelihood Estimation of Mixture-of-Expertsfor Regression and Clustering. In (pp.1–8). (Cited on pages 2, 3, and 6.)
Chamroukhi, F. & Huynh, B.-T. (2019). Regularized Maximum Likelihood Estimation and Feature Selection inMixtures-of-Experts Models.
Journal de la Soci´et´e Fran¸caise de Statistique , 160(1), 57–85. (Cited on pages 2,3, and 6.)
Cohen, S. & Pennec, E. L. (2011). Conditional density estimation by penalized likelihood model selection andapplications.
Technical Report, INRIA . (Cited on page 4.) Devijver, E. (2015). An l -oracle inequality for the Lasso in finite mixture of multivariate Gaussian regression. ESAIM: Probability and Statistics , 19, 649–670. (Cited on pages 2, 3, 4, 5, 6, 7, and 8.)
Duistermaat, J. J. & Kolk, J. A. (2004).
Multidimensional Real Analysis I: Differentiation , volume 86. Cam-bridge University Press. (Cited on page 29.)
Fan, J. & Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracle properties.
Journalof the American statistical Association , 96(456), 1348–1360. (Cited on pages 2 and 3.)
Genovese, C. R., Wasserman, L., et al. (2000). Rates of convergence for the Gaussian mixture sieve.
Annals ofStatistics , 28(4), 1105–1127. (Cited on page 4.)
Golub, G. H. & Van Loan, C. F. (2012).
Matrix Computations , volume 3. JHU Press. (Cited on page 28.)
Ho, N., Nguyen, X., et al. (2016a). Convergence rates of parameter estimation for some weakly identifiablefinite mixtures.
Annals of Statistics , 44(6), 2726–2755. (Cited on page 4.)
Ho, N., Nguyen, X., et al. (2016b). On strong identifiability and convergence rates of parameter estimation infinite mixtures.
Electronic Journal of Statistics , 10(1), 271–307. (Cited on page 4.)
Ho, N., Yang, C.-Y., & Jordan, M. I. (2019). Convergence Rates for Gaussian Mixtures of Experts. arXivpreprint arXiv:1907.04377 . (Cited on page 4.) Huynh, T. & Chamroukhi, F. (2019). Estimation and Feature Selection in Mixtures of Generalized LinearExperts Models. arXiv:1907.06994 . (Cited on page 3.) Jacobs, R. A., Jordan, M. I., Nowlan, S. J., & Hinton, G. E. (1991). Adaptive Mixtures of Local Experts.
Neural Computation , 3, 79–87. (Cited on pages 1 and 4.)
Jiang, W. & Tanner, M. A. (1999). Hierarchical mixtures-of-experts for exponential family regression models:approximation and maximum likelihood estimation.
Annals of Statistics , (pp. 987–1011). (Cited on pages 3and 4.)
Jordan, M. I. & Jacobs, R. A. (1994). Hierarchical mixtures of experts and the EM algorithm.
Neural Compu-tation , 6(2), 181–214. (Cited on page 4.)
Khalili, A. (2010). New estimation and feature selection methods in mixture-of-experts models.
CanadianJournal of Statistics , 38(4), 519–539. (Cited on pages 2, 3, 6, and 7.)
Khalili, A. & Chen, J. (2007). Variable selection in finite mixture of regression models.
Journal of the AmericanStatistical Association , 102(479), 1025–1038. (Cited on pages 2, 3, and 7.)
Lloyd-Jones, L. R., Nguyen, H. D., & McLachlan, G. J. (2018). A globally convergent algorithm for lasso-penalized mixture of linear regression models.
Computational Statistics & Data Analysis , 119, 19–38. (Citedon page 2.)
Magnus, J. R. & Neudecker, H. (2019).
Matrix Differential Calculus with Applications in Statistics and Econo-metrics . John Wiley & Sons. (Cited on page 27.)
Massart, P. (2007).
Concentration Inequalities and Model Selection: Ecole d’Et´e de Probabilit´es de Saint-FlourXXXIII-2003 . Springer. (Cited on pages 3, 4, and 18.)
Massart, P. & Meynet, C. (2011). The Lasso as an l -ball model selection procedure. Electronic Journal ofStatistics , 5, 669–687. (Cited on page 2.)
Maugis, C. & Michel, B. (2011). A non asymptotic penalized criterion for Gaussian mixture model selection.
ESAIM: Probability and Statistics , 15, 41–68. (Cited on page 8.)
McLachlan, G. & Peel, D. (2000).
Finite Mixture Models . John Wiley & Sons. (Cited on page 8.)
Mendes, E. F. & Jiang, W. (2012). On convergence rates of mixtures of polynomial experts.
Neural Computation ,24(11), 3025–3051. (Cited on page 3.)
Meynet, C. (2013). An l -oracle inequality for the Lasso in finite mixture Gaussian regression models. ESAIM:Probability and Statistics , 17, 650–671. (Cited on pages 2, 3, 4, 6, 7, 8, and 19.)
Montuelle, L., Le Pennec, E., et al. (2014). Mixture of Gaussian regressions model with logistic weights, apenalized maximum likelihood approach.
Electronic Journal of Statistics , 8(1), 1661–1695. (Cited on pages 2,3, 5, and 8.)
Nguyen, H. D. & Chamroukhi, F. (2018). Practical and theoretical aspects of mixture-of-experts modeling: Anoverview.
Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery , 8(4), e1246. (Cited onpages 1, 2, and 4.)
Nguyen, H. D., Chamroukhi, F., & Forbes, F. (2019). Approximation results regarding the multiple-outputGaussian gated mixture of linear experts model.
Neurocomputing , 366, 208–214. (Cited on page 4.)
Nguyen, H. D., Lloyd-Jones, L. R., & McLachlan, G. J. (2016). A universal approximation theorem for mixture-of-experts models.
Neural Computation , 28(12), 2585–2593. (Cited on page 4.)
Nguyen, T., Chamroukhi, F., Nguyen, H. D., & McLachlan, G. J. (2020a). Approximation of probability densityfunctions via location-scale finite mixtures in Lebesgue spaces. arXiv preprint arXiv:2008.09787 . (Cited onpage 4.) Nguyen, T. T., Nguyen, H. D., Chamroukhi, F., & McLachlan, G. J. (2020b). Approximation by finite mixturesof continuous density functions that vanish at infinity.
Cogent Mathematics & Statistics , 7(1), 1750861. (Citedon page 4.)
Nguyen, X. et al. (2013). Convergence of latent mixing measures in finite and infinite mixture models.
Annalsof Statistics , 41(1), 370–400. (Cited on page 4.)
Norets, A. et al. (2010). Approximation of conditional densities by smooth mixtures of regressions.
Annals ofstatistics , 38(3), 1733–1766. (Cited on page 4.)
Park, M. Y. & Hastie, T. (2008). Penalized logistic regression for detecting gene interactions.
Biostatistics ,9(1), 30–50. (Cited on page 3.)
Redner, R. A. & Walker, H. F. (1984). Mixture densities, maximum likelihood and the EM algorithm.
SIAMReview , 26(2), 195–239. (Cited on page 8.)
Schwarz, G. et al. (1978). Estimating the dimension of a model.
Annals of Statistics , 6(2), 461–464. (Cited onpage 3.)
Stadler, N., Buhlmann, P., & van de Geer, S. (2010). l -penalization for mixture regression models. TEST , 19,209–256. (Cited on pages 2 and 8.)
Tibshirani, R. (1996). Regression shrinkage and selection via the lasso.
Journal of the Royal Statistical Society:Series B (Methodological) , 58(1), 267–288. (Cited on page 2.)
Van Der Vaart, A. & Wellner, J. (1996). Weak Convergence and Empirical Processes: With Applications toStatistics Springer Series in Statistics.
Springer , 58, 59. (Cited on page 18.)
Vapnik, V. (1982).
Estimation of Dependences Based on Empirical Data (Springer Series in Statistics) . Springer-Verlag. (Cited on page 4.)
Wainwright, M. J. (2019).
High-Dimensional Statistics: A Non-Asymptotic Viewpoint , volume 48. CambridgeUniversity Press. (Cited on pages 3, 28, and 29.)(Cited on pages 3, 28, and 29.)