Learning with tree tensor networks: complexity estimates and model selection
aa r X i v : . [ m a t h . S T ] J u l Learning with tree tensor networks: complexity estimates and modelselection
Bertrand Michel and Anthony Nouy ∗ Abstract
In this paper, we propose and analyze a model selection method for tree tensor networks in an em-pirical risk minimization framework. Tree tensor networks, or tree-based tensor formats, are prominentmodel classes for the approximation of high-dimensional functions in numerical analysis and data sci-ence. They correspond to sum-product neural networks with a sparse connectivity associated with adimension partition tree T , widths given by a tuple r of tensor ranks, and multilinear activation func-tions (or units). The approximation power of these model classes has been proved to be near-optimalfor classical smoothness classes. However, in an empirical risk minimization framework with a limitednumber of observations, the dimension tree T and ranks r should be selected carefully to balance estima-tion and approximation errors. In this paper, we propose a complexity-based model selection strategy`a la Barron, Birg´e, Massart. Given a family of model classes, with different trees, ranks and tensorproduct feature spaces, a model is selected by minimizing a penalized empirical risk, with a penaltydepending on the complexity of the model class. After deriving bounds of the metric entropy of treetensor networks with bounded parameters, we deduce a form of the penalty from bounds on supremaof empirical processes. This choice of penalty yields a risk bound for the predictor associated with theselected model. For classical smoothness spaces, we show that the proposed strategy is minimax optimalin a least-squares setting. In practice, the amplitude of the penalty is calibrated with a slope heuristicsmethod. Numerical experiments in a least-squares regression setting illustrate the performance of thestrategy for the approximation of multivariate functions and univariate functions identified with tensorsby tensorization (quantization). Typical tasks in statistical learning include the estimation of a regression function or of posterior probabil-ities for classification (supervised learning), or the estimation of the probability distribution of a randomvariable from samples of the distribution (unsupervised learning). These approximation tasks can be for-mulated as a minimization problem of a risk functional R ( f ) whose minimizer f ⋆ is the target (or oracle)function, and such that R ( f ) − R ( f ⋆ ) measures some discrepancy between the function f and f ⋆ . The riskis usually defined as R ( f ) = E ( γ ( f, Z )) , with Z = ( X, Y ) for supervised learning or Z = X for unsupervised learning, and where γ is a contrastfunction. For supervised learning, the contrast γ is usually chosen as γ ( f, ( x, y )) = ℓ ( y, f ( x )) where ℓ ( y, f ( x )) measures some discrepancy between y and the prediction f ( x ) for a given realization ( x, y ) of( X, Y ). In practice, given i.i.d. realizations ( Z , . . . , Z n ) of Z , an approximation ˆ f Mn is obtained by theminimization of an empirical risk b R n ( f ) = 1 n n X i =1 γ ( f, Z i ) ∗ Centrale Nantes, Laboratoire de Math´ematiques Jean Leray, CNRS UMR 6629, France M , also called a model class or hypothesis set. Assuming that the risk admitsa minimizer f M over M , the error R ( b f Mn ) − R ( f ⋆ ) can be decomposed into two contributions: an ap-proximation error R ( f M ) − R ( f ⋆ ) which quantifies the best we can expect from the model class M , andan estimation error R ( b f Mn ) − R ( f M ) which is due to the use of a limited number of observations. For agiven model class, a first problem is to understand how these errors behave under some assumptions onthe target function. When considering an increasing sequence of model classes, the approximation errordecreases but the estimation error usually increases. Then strategies are required for the selection of aparticular model class.In many applications, the target function f ⋆ ( x ) is a function of many variables x = ( x , . . . , x d ). Forapplications in image or signal classification, x may be an image (with d the number of pixels or patches) ora discrete time signal (with d the number of time instants) and f ⋆ ( x ) provides a label to a particular input x . For applications in computational science, the target function may be the solution of a high-dimensionalpartial differential equation, a parameter-dependent equation or a stochastic equation. In all these appli-cations, when d is large and when the number of observations is limited, one has to rely on suitable modelclasses M of moderate complexity that exploit specific structures of the target function f ⋆ and yield an ap-proximation b f Mn with low approximation and estimation errors. Typical examples of model classes includeadditive functions f ( x ) + · · · + f d ( x d ), sums of multiplicative functions P mk =1 f k ( x ) · · · f kd ( x d ), projectionpursuit f ( w T x ) + · · · + f m ( w Tm x ), or feed-forward neural networks σ L ◦ f L ◦ . . . ◦ σ ◦ f ( x ) where the f k are affine maps and the σ k are given nonlinear functions.In this paper, we consider the class of functions in tree-based tensor format, or tree tensor networks.These model classes are well-known approximation tools in numerical analysis and computational physicsand have also been more recently considered in statistical learning. They are particular cases of feed-forward neural networks with an architecture given by a dimension partition tree and multilinear activationfunctions (see [26, 14]). For an overview of these tools, the reader is referred to the monograph [22] andthe surveys [30, 6, 25, 12, 13]. Some results on the approximation power of tree tensor networks can befound in [32, 20, 5] for multivariate functions, or in [24, 23, 2, 3] for tensorized (or quantized) functions.A tree-based tensor format is a set of functions M Tr ( H ) = { f ∈ H : rank α ( f ) ≤ r α , α ∈ T } , where T is a dimension partition tree over { , . . . , d } , r = ( r α ) ∈ N | T | is a tuple of integers and H is a finitedimensional tensor space of multivariate functions (e.g., polynomials, splines), which is a tensor productfeature space. A function f in M Tr ( H ) have a α -rank rank α ( f ) bounded by r α , that means it admits arepresentation f ( x ) = r α X k =1 g αk ( x α ) h α c k ( x α c )for some functions g αk and h α c k of complementary groups of variables. The function f admits a representationas a composition of multilinear functions. For instance, for the dimension tree of Figure 1a, f ( x ) = f ,..., (cid:16) f , , , (cid:0) f , , ( f ( φ ( x )) , f , ( f ( φ ( x )) , f ( φ ( x )))) , f ( φ ( x )) (cid:1) ,f , , , (cid:0) f , , ( f , ( f ( φ ( x )) , f ( φ ( x ))) , f ( φ ( x ))) , f ( φ ( x )) (cid:1)(cid:17) where φ ν ( x ν ) ∈ R n ν is a vector of n ν features in the variable x ν , and f α is a multilinear map with valuesin R r α . It corresponds to the neural network illustrated on Figure 2.The main contribution of the paper is a complexity-based strategy for the selection of a model classin an empirical risk minimization framework. Given a family of model classes M m = M T m r m ( H m ), m ∈ M ,associated with different trees T m , ranks r m and background approximation spaces H m , and given the2 , , , , , , , }{ , , , } { , , , }{ , , } { }{ , , } { }{ } { , } { , } { }{ } { } { } { } (a) Dimension tree T .
13 34 5 5 65 3 8 34 3 4 8 (b) r α , α ∈ T . Figure 1: Dimension tree T over { , . . . , } (a) and ranks r = ( r α ) α ∈ T (b).Figure 2: Neural network corresponding to the format M Tr ( H ) with the tree T and ranks r of Figure 1,and n ν = 10 features per variable.corresponding predictors ˆ f m that minimize the empirical risk, we propose a strategy to select a particularmodel ˆ m with a guaranteed performance. For that purpose, we make use of the model selection approachof Barron, Birg´e and Massart (see [28] for a general introduction to the topic) where ˆ m is obtained byminimizing a penalized empirical risk b R n ( ˆ f m ) + pen( m )with a penalty function pen( m ) derived from complexity estimates of the model classes M m , of the formpen( m ) ∼ O ( p C m /n ) (up to logarithmic terms) in a general setting, or of the form pen( m ) ∼ O ( C m /n )(again up to logarithmic terms) in a bounded least-squares setting where faster convergence rates can beobtained. In particular, we find that our strategy is minimax adaptive over Sobolev spaces. In practice, thepenalty is taken of the form pen( m ) = λ p C m /n (or pen( m ) = λC m /n in a bounded regression setting),where λ is calibrated with the slope heuristics method proposed in [10]. The family of models can begenerated by adaptive learning algorithms such as the ones proposed in [18, 17].Note that our method is a ℓ type approach. Convex regularization methods would be an interestingalternative route to follow. A straightforward convexification of tensor formats consists in using the sumof nuclear norms of unfoldings (see, e.g., [33] for Tucker format) but this is known to be far from optimalfrom a statistical point of view [31]. A convex regularization method based on the tensor nuclear norm has3een proposed for the Tucker format, or shallow tensor network, which comes with theoretical guarantees(see [35]). However, there is no straightforward extension of this approach to general tree tensor networks.The outline of the paper is as follows. In Section 2, we describe the model class of tree tensor networks(or tree-based tensor formats) in the case of vector-valued functions, which generalizes the classical defini-tion for real-valued functions [22, 16] and allows considering applications such as multiclass classification.In Section 3, we provide estimates of the metric and bracketing entropies in L p spaces for tree tensornetworks M m with bounded parameters. In Section 4, we derive bounds for the estimation error in aclassical empirical risk minimization framework. These bounds are derived from concentration inequalitiesfor empirical processes. In Section 5, we present the complexity-based model selection approach and wederive risk bounds for particular choices of penalty, first in a general setting and then in the boundedleast-squares setting. Then we present the practical aspects of the approach, which includes the slopeheuristics method for penalty calibration and the exploration strategies for the generation of a sequenceof model classes and associated predictors. Finally in Section 6, we present some numerical experimentsthat validate the proposed model selection strategy. We consider functions f ( x ) = f ( x , . . . , x d ) defined on a product set X = X × . . . × X d and with values in R s . Typically, X ν is a subset of R or R d ν but it could be a set of more general objects (sequences, functions,graphs...). For each ν ∈ { , . . . , d } , we introduce a finite-dimensional space H ν of functions defined on X ν . We let { φ νi ν : i ν ∈ I ν } be a basis of H ν , with I ν = { , . . . , n ν } . The functions φ νi ν ( x ν ) may be polynomials, splines,wavelets, kernel functions, or more general functions that extract n ν features from a given input x ν ∈ X ν .We let φ ν : X ν → R n ν be the associated feature map defined by φ ν ( x ν ) = ( φ ν ( x ν ) , . . . , φ νn ν ( x ν )) T ∈ R n ν .The functions φ i ( x ) = φ i ( x ) . . . φ di d ( x d ), i ∈ I = I × . . . × I d , form a basis of the tensor product space H = H ⊗ . . . ⊗ H d . A function f ∈ H admits a representation f ( x ) = X i ∈ I a i φ i ( x ) = n X i =1 . . . n d X i d =1 a i ,...,i d φ i ( x ) . . . φ di d ( x d ) , (1)where a ∈ R I = R n × ... × n d is an algebraic tensor (or multi-dimensional array) of size n × . . . × n d . Themap φ from X to R I which associates to x the elementary tensor φ ( x ) = φ ( x ) ⊗ . . . ⊗ φ d ( x d ) ∈ R I definesa tensor product feature map .A function f defined on X with values in R s whose components f k (1 ≤ k ≤ s ) are in H is identifiedwith an element of the product space H s , which is itself identified with the space H ⊗ . . . ⊗ H d ⊗ R s oftensors of order d + 1. For any α ⊂ { , . . . , d } := D , we introduce the tensor space H α = N ν ∈ α H ν of functions defined on X α = × ν ∈ α X ν , and for x ∈ X , we let x α = ( x ν ) ν ∈ α ∈ X α denote the group of variables α . We denote by α c = D \ α . We use the conventions H ∅ = R and H D = H . Definition 2.1.
The α -rank of a function f : X → R s , denoted rank α ( f ) , is the minimal integer r α suchthat f ( x ) = r α X k =1 g αk ( x α ) h α c k ( x α c ) (2)4 or some functions g αk : X α → R and h αk : X α c → R s . The above definition generalizes the classical notion of α -rank for vector-valued functions. It coincideswith the classical notion of α -rank when f : X → R s is seen as a real-valued function of s + 1 variablesdefined on X × . . . × X d × { , . . . , s } . A function f ∈ H s admits a representation (2) with functions g αk ∈ H α and h α c k ∈ H sα c . For f = 0, we have rank ∅ ( f ) = 1 and 1 ≤ rank D ( f ) ≤ s .We let T be a dimension partition tree over D , with root D and leaves { ν } , 1 ≤ ν ≤ d . For a node α ∈ T , we denote by S ( α ) the set of children of α . For any node α , we have either S ( α ) = ∅ (for leafnodes) or S ( α ) ≥ L ( T ) the set of leaves of T , and by I ( T ) = T \ L ( T )its interior nodes. For an interior node α ∈ I ( T ), S ( α ) forms a partition of α . The T -rank of a function f is the tuple rank T ( f ) = (rank α ( f )) α ∈ T . The number of nodes of a dimension partition tree over D isbounded as | T | ≤ d − . Given a tuple r ∈ N | T | we introduce the model class M Tr ( H s ) of functions in H s with T -rank boundedby r , M Tr ( H s ) = { f ∈ H s : rank T ( f ) ≤ r } . A function f ∈ M Tr ( H s ) admits the representation f ( x ) = r D X k D =1 c k D g Dk D ( x )where the c k D are vectors in R s and where the functions g Dk D ∈ H are defined recursively. For any interiornode α ∈ I ( T ), the functions g αk α admit the representation g αk α ( x α ) = X ≤ k β ≤ r β β ∈ S ( α ) C αk α , ( k β ) β ∈ S ( α ) Y β ∈ S ( α ) g βk β ( x β ) , where C α ∈ R r α × ( × β ∈ S ( α ) r β ) . For a leaf node α ∈ L ( T ), the functions g αk α ∈ H α admit the representation g αk α ( x α ) = X i α ∈ I α C αk α ,i α φ αi α ( x α ) . We let C ∅ denote the matrix whose columns are the vectors c k D . We introduce the tree T ⋆ = T ∪ ∅ and we use the conventions r ∅ = s and S ( ∅ ) = D . A function f in M Tr ( H s ) therefore admits an explicitrepresentation f k ( x ) = X i α ∈ I α α ∈L ( T ) X ≤ k β ≤ r β β ∈ T C ∅ k,k D Y α ∈ T \L ( T ) C αk α , ( k β ) β ∈ S ( α ) Y α ∈L ( T ) C αk α ,i α Y α ∈L ( T ) φ αi α ( x α ) (3)where the set of parameters ( C α ) α ∈ T ⋆ form a tree network of tensors, and C α ∈ R { ,...,r α }× I α := R K α , where I α = { , . . . , r D } for α = ∅ , I α = × β ∈ S ( α ) { , . . . , r β } for α ∈ I ( T ) or I α = { , . . . , n α } for α ∈ L ( T ).We let R H ,T,r be the map which associates to a set of tensors ( C α ) α ∈ T ⋆ the function f = R H ,T,r (( C α ) α ∈ T ⋆ )defined by (3), so that M Tr ( H s ) = { f = R H ,T,r (( C α ) α ∈ T ⋆ ) : C α ∈ R K α , α ∈ T ⋆ } . From the representation (3), we obtain the following
Lemma 2.2.
The map R r,T, H is a multilinear map from the product space × α ∈ T ⋆ R K α to H s . Remark 2.3. If r D = s , the parameter C ∅ ∈ R s × s can be chosen as the identity matrix, so that theparameters of a function in M Tr ( H s ) are reduced to the set of tensors ( C α ) α ∈ T . This includes the classicalcase of tree-based tensor formats for real-valued functions ( s = r D = 1 ). In this situation, we let T ⋆ = T . .3 Tree tensor networks as compositions of multilinear functions A function f in M Tr ( H s ) admits a representation in terms of compositions of multilinear functions. For agiven α ∈ T , we let g α ( x α ) = ( g αk α ( x α )) ≤ k α ≤ r α ∈ R r α . The matrix C ∅ ∈ R s × r D is linearly identified with alinear map f ∅ from R r D to R s . Therefore, a function f in M Tr ( H s ) admits the representation f ( x ) = f ∅ ( g D ( x )) . For any α ∈ I ( T ), the tensor C α can be linearly identified with a multilinear map f α : × β ∈ S ( α ) R r β → R r α defined by f αk α (( z β ) β ∈ S ( α ) ) = X ≤ k β ≤ r β β ∈ S ( α ) C αk α , ( k β ) β ∈ S ( α ) Y β ∈ S ( α ) z βk β for z β ∈ R r β . Therefore, g α admits the representation g α ( x α ) = f α (( g β ( x β ) β ∈ S ( α ) ) . (4)For a leaf node α ∈ L ( T ), the tensor C α can be linearly identified with a linear map f α : R n α → R r α , and g α ( x α ) = f α ( φ α ( x α )) . (5)Therefore, a function f in M Tr ( H s ) can be parametrized by a tree network of linear or multilinear maps f = ( f α ) α ∈ T ⋆ (identified with the tree tensor network ( C α ) α ∈ T ⋆ ).We denote by F α the space of linear maps from R r D to R s for α = ∅ , the space of multilinear maps from × β ∈ S ( α ) R r β to R r α for α ∈ I ( T ), or the space of linear maps from R n α to R r α for a leaf node α ∈ L ( T ).We denote by F T,r := × α ∈ T ⋆ F α the parameter space and by R H ,T,r the representation map which associates to a network f = ( f α ) α ∈ T ⋆ ∈ F T,r the function f . Then M Tr ( H s ) = {R H ,T,r ( f ) : f ∈ F T,r } . Since F α is linearly identified with R K α for all α ∈ T ⋆ , we deduce the following property from Lemma 2.2. Lemma 2.4.
The map R H ,T,r is a multilinear map from the product space F T,r = × α ∈ T ⋆ F α to the spaceof functions defined on X . When interpreting a tensor (or function) network f ∈ F T,r as a neural network, a classical measure ofcomplexity if the number of neurons, which is the sum of ranks r α , α ∈ T ⋆ . This leads to a first measureof complexity of a function f = R H ,T,r ( f ) defined bycompl N ( f ) = X α ∈ T ⋆ r α . From an approximation or statistical perspective, a more natural measure of complexity for a function f ∈ M Tr ( H s ) is its representation complexity, that is the dimension of the corresponding parameter space F T,r , or the number of weights of the corresponding sum-product neural network. We let N α = dim( F α ),with N α = sr D for α = ∅ , N α = r α n α for α ∈ L ( T ) and N α = r α Q β ∈ S ( α ) r β for α ∈ T ⋆ \ L ( T ). Then therepresentation complexity of a function f = R H ,T,r ( f ) iscompl C ( f ) := C ( T, r, H s ) = X α ∈ T ⋆ N α = sr D + X α ∈I ( T ) r α Y β ∈ S ( α ) r β + X α ∈L ( T ) r α n α . (6)6 emark 2.5. If r D = s , the function f ∅ : R s → R s can be taken as the identity map, so that the parametersof M Tr ( H s ) are reduced to the set of functions f = ( f α ) α ∈ T . In this case, we let T ⋆ = T , and the complexityis compl C ( f ) := C ( T, r, H s ) = X α ∈ T N α = X α ∈I ( T ) r α Y β ∈ S ( α ) r β + X α ∈L ( T ) r α n α . (7)Another measure of complexity of f = R H ,T,r ( f ) can be defined ascompl S ( f ) = X α ∈ T ⋆ k f α k ℓ , (8)where k f α k ℓ is the number of non-zero entries in the tensor C α associated with the multilinear map f α .This measure of complexity takes into account a possible sparsity in tensors or in the corresponding sum-product neural network. We note that compl S ( f ) ≤ compl C ( f ) . These different measures of complexitylead to the definition of different approximation tools and corresponding approximation classes, see [2, 3]for tensor networks, and [19] for similar results on ReLU or RePU neural networks.
A function f ∈ M Tr ( H s ) admits infinitely many equivalent parametrizations. From the multilinearity ofthe representation map R H ,T,r (see Lemma 2.4), it is clear that the model class M Tr ( H s ) is a cone, i.e. cM Tr ( H s ) ⊂ M Tr ( H s ) for any c ∈ R , and that given some norms k · k F α on the spaces F α , α ∈ T ⋆ , we have M Tr ( H s ) = { cf : c ∈ R , f ∈ M Tr ( H s ) } , where M Tr ( H s ) are elements of M Tr ( H s ) with bounded parameters, defined by M Tr ( H s ) = { f = R H ,T,r ( f ) : f = ( f α ) α ∈ T ⋆ ∈ F T,r : , k f α k F α ≤ , α ∈ T ⋆ } . (9) We assume that the sets X ν are equipped with finite measures µ ν , for all ν ∈ D = { , . . . , d } , and the set X is equipped with the product measure µ = µ ⊗ . . . ⊗ µ d . For 1 ≤ p ≤ ∞ , we consider the space L pµ ( X ; R s )of measurable functions defined on X with values in R s , with bounded norm k · k p,µ defined by k f k pp,µ = Z X k f ( x ) k pp dµ ( x ) for 1 ≤ p < ∞ , or k f k ∞ ,µ = µ - ess sup X | f | . We also consider the space L ∞ ( X ; R s ) of functions defined on X with values in R s , with bounded norm k f k ∞ = sup x ∈X | f ( x ) | . In the following, we denote by L λ ( X ; R s ) the space L pµ ( X ; R s ) equipped with the norm k·k p,µ when λ = ( p, µ )or the space L ∞ ( X ; R s ) equipped with the norm k · k ∞ when λ = ∞ . If H ν ⊂ L λ ( X ν ) for all ν ∈ D , then H ⊂ L λ ( X ) and H s ⊂ L λ ( X ; R s ). We here study the continuity properties of the representation map R H ,T,r as a map from F T,r = × α ∈ T ⋆ F α to H s ⊂ L λ ( X ; R s ), with λ = ( p, µ ) or λ = ∞ . We consider norms k · k F α on space F α , α ∈ T ⋆ , and theproduct norm k · k F over F T,r defined by k ( f α ) α ∈ T ⋆ k F T,r = max α ∈ T ⋆ k f α k F α . From the multilinearity of R H ,T,r (Lemma 2.4), we easily deduce the following property.7 emma 3.1. Assuming
H ⊂ L λ ( X ) , with either λ = ( p, µ ) or λ = ∞ , the multilinear map R H ,T,r from F T,r to H s ⊂ L λ ( X ; R s ) is continuous and such that for all f = R H ,T,r (( f α ) α ∈ T ⋆ ) in M Tr ( H s ) , k f k λ ≤ L λ Y α ∈ T ⋆ k f α k F α for some constant L λ < ∞ independent of f defined by L λ = sup f = R H ,T,r (( f α ) α ∈ T⋆ ) k f k λ Q α ∈ T ⋆ k f α k F α . (10)We denote by B ( F α ) the unit ball of F α and by B ( F T,r ) the unit ball of F . The set M Tr ( H s ) definedby (9) is such that M Tr ( H s ) = R H ,T,r ( B ( F T,r )) . (11)We then deduce that the map R H ,T,r is Lipschitz continuous on the set M Tr ( H s ) . Lemma 3.2.
Assuming
H ⊂ L λ ( X ) , with either λ = ( p, µ ) or λ = ∞ , for all f = R H ,T,r ( f ) and ˜ f = R H ,T,r (˜ f ) in M Tr ( H s ) , k f − ˜ f k λ ≤ L λ X α ∈ T ⋆ k f α − ˜ f α k F α ≤ L λ | T ⋆ |k f − ˜ f k F T,r . Proof.
Denoting by { α , . . . , α K } the elements of T ⋆ , we have f − ˜ f = P Kk =1 R H ,T,r ( ˜ f α , · · · , f α k − ˜ f α k , · · · , f α K ). Them from Lemma 3.1, we obtain k f − ˜ f k λ ≤ L λ K X k =1 k f α k − ˜ f α k k F αk Y i
Assuming that
H ⊂ L λ ( X ) , with either λ = ∞ or λ = ( p, µ ) , ≤ p ≤ ∞ , the metricentropy of the model class M Tr ( H s ) R = { cf : c ∈ R , | c | ≤ R, f ∈ M Tr ( H s ) } (13) in L λ ( X ; R s ) is such that H ( ǫ, M Tr ( H s ) R , k · k λ ) ≤ C ( T, r, H s ) log(3 ǫ − RL λ | T ⋆ | ) . Proof.
The covering number of the unit ball B ( F α ) of the N α -dimensional space F α is such that N ( ǫ, B ( F α ) , k·k F α ) ≤ (3 ǫ − ) N α . Then the unit ball B ( F T,r ) of the product space F T,r equipped with the product topol-ogy has a covering number N ( ǫ, B ( F T,r ) , k · k F T,r ) ≤ Q α ∈ T ⋆ N ( ǫ, B ( F α ) , k · k F α ) ≤ (3 ǫ − ) C ( T,r, H s ) with C ( T, r, H s ) = P α ∈ T ⋆ N α . From the Lipschitz continuity of R H ,T,r on M Tr ( H s ) (Lemma 3.2), we deducethat N ( ǫ, M Tr ( H s ) , k · k λ ) ≤ (3 ǫ − L λ | T ⋆ | ) C ( T,r, H s ) , from which we deduce that N ( ǫ, M Tr ( H s ) R , k · k λ ) ≤ (3 ǫ − RL λ | T ⋆ | ) C ( T,r, H s ) , which ends the proof. 8f f and f are two functions from L pµ ( X ; R s ), the collection of functions f ∈ L pµ ( X ; R s ) such that f ≤ f ≤ f almost everywhere is denoted by [ f , f ] and called a bracket with extremities f and f .The diameter of the bracket [ f , f ] for the norm k · k p,µ is given by k f − f k p,µ . The bracketing number N [] ( ǫ, K, k · k p,µ ) of a set K is defined as the minimal number of brackets with diameters less than ǫ whichare necessary to cover K . The corresponding bracketing entropy is defined as H [] ( ǫ, K, k · k p,µ ) := log N [] ( ǫ, K, k · k p,µ ) . Lemma 3.4.
For any ≤ p ≤ ∞ and any compact set K in L p ( X ; R s ) , H [] ( ǫ, K, k · k p,µ ) ≤ H ( ǫ µ ( X ) − /p , K, k · k ∞ ,µ ) , where µ ( X ) is the mass of the measure µ, and µ ( X ) − /p = 1 for p = ∞ .Proof. Let γ = ǫ µ ( X ) − /p and let N be a γ -net of K for the norm k · k ∞ ,µ with cardinal N ( γ, K, k · k ∞ ,µ ).Then for any f ∈ K , there exists a ˜ f ∈ N such that k f − ˜ f k ∞ ,µ ≤ γ , which implies that f is in the bracket[ ˜ f − γ, ˜ f + γ ] with diameter k γ k p,µ = 2 γµ ( X ) /p = ǫ . Then the collection of brackets { [ ˜ f − γ, ˜ f + γ ] : ˜ f ∈ N } with diameters ǫ covers K , which implies N [] ( ǫ, K, k · k p,µ ) ≤ N ( γ, K, k · k ∞ ,µ ) , which ends the proof.From Proposition 3.3 and Lemma 3.4, we directly deduce the following result. Proposition 3.5.
For any ≤ p ≤ ∞ , the set M Tr ( H s ) R defined in (13) has a bracketing entropy H [] ( ǫ, M Tr ( H s ) R , k · k p,µ ) ≤ C ( T, r, H s ) log(6 ǫ − µ ( X ) /p RL ∞ ,µ | T ⋆ | ) , with µ ( X ) − /p = 1 for p = ∞ . Assume that
H ⊂ L λ ( X ), with either λ = ( p, µ ) or λ = ∞ . The continuity constant L λ of the map R H ,T,r defined by (10) depends on λ , the norms on F α , the chosen basis for H and also on the measure µ when λ = ( p, µ ). We here introduce a particular choice of norms and basis functions which allows to bound thecontinuity constant L λ . We consider on the space F ∅ of linear maps from R r D to R s the norm (with p = ∞ when λ = ∞ ) k f ∅ k F ∅ = max z ∈ R rD k f ∅ ( z ) k p k z k p , which coincides with the classical matrix p -norm. For any interior node α ∈ I ( T ), we introduce a norm k · k F α over the space F α of multilinear maps f α : × β ∈ S ( α ) R r β → R r α , defined by k f α k F α = max ( z β ) β ∈ S ( α ) ∈ × β ∈ S ( α ) R rβ k f α (( z β ) β ∈ S ( α ) ) k p Q β ∈ S ( α ) k z β k p . For a leaf node α ∈ L ( T ), we introduce a norm k · k F α over the space F α of linear maps f α : R n α → R r α ,defined by k f α k F α = max z α ∈ R nα k f α ( z α ) k p k z α k p . (14)We assume that for any ν ∈ D , the feature map φ ν : X ν → R n ν is such that k φ ν k λ = 1. For λ = ( ∞ , µ )(resp. λ = ∞ ), that means that basis functions φ νi ν ( x ν ) have a unit norm in L ∞ ,µ ( X ν ) (resp. L ∞ ( X ν )).For p < ∞ , that means that P n ν i =1 k φ νi k pp,µ = 1, which can be obtained by rescaling basis functions so that k φ νi k p,µ = n − /pν . Proposition 3.6.
Assume
H ⊂ L λ ( X ) , with either λ = ( p, µ ) or λ = ∞ . With the above choice of normsand normalization of basis functions (with p = ∞ when λ = ∞ ), the continuity constant L λ defined by (10) is such that L λ ≤ , and for all ≤ q ≤ p , L q,µ ≤ µ ( X ) /q − /p L λ ≤ µ ( X ) /q − /p .Proof. See Appendix A.1. 9
Risk bounds for empirical risk minimization
In this section, we analyze the estimation error for tree tensor networks obtained by empirical risk mini-mization. We consider as fixed the approximation space H , the tree T and the rank r ∈ N | T | . We assumethat H ⊂ L ∞ ,µ ( X ), with X equipped with a finite measure µ . We consider the model class M Tr ( H s ) R := M of tree tensor networks with bounded parameters, with the norms defined in Section 3.3 for λ = ( ∞ , µ )( p = ∞ ). We denote by C M = C ( T, r, H s ) the representation complexity of M defined by (6) (or (7) when r D = s ). We consider a risk R ( f ) = E ( γ ( f, Z )) , where Z is a random variable taking values in Z and where γ : R X × Z → R is some contrast function.The minimizer of the risk over measurable functions defined on X is the target function f ⋆ . For f random(depending on the data), E ( γ ( f, Z )) shall be understood as an expectation E Z ( γ ( f, Z )) w.r.t. Z (conditionalto the data). Example 4.1.
For supervised learning, we consider a random variable Z = ( X, Y ) , with Y a randomvariable with values in R s , X a X -valued random variable with probability law µ . The contrast is chosen as γ ( f, ( x, y )) = ℓ ( y, f ( x )) with ℓ a loss function measuring a discrepancy between y and the prediction f ( x ) . Example 4.2.
For the problem of estimating the probability distribution of a random variable X , weconsider Z = X and s = 1 . Given the model class M , we denote by f M a minimizer over M of the risk R , and by ˆ f Mn a minimizerover M of the empirical risk b R n ( f ) = 1 n n X i =1 γ ( f, Z i ) , which is seen as an empirical process over M . We introduce the excess risk E ( f ) = R ( f ) − R ( f ⋆ ) . The excess risk for the estimator ˆ f Mn satisfies E ( ˆ f Mn ) = E ( f M ) + R ( ˆ f Mn ) − R ( f M ) , (15)where E ( f M ) is the best approximation error in M and R ( ˆ f Mn ) − R ( f M ) is the estimation error. Usingthe optimality of ˆ f Mn , we obtain that the estimation error satisfies R ( ˆ f Mn ) − R ( f M ) ≤ b R n ( f M ) − R ( f M ) − b R n ( ˆ f Mn ) + R ( ˆ f Mn ) := ¯ R n ( f M ) − ¯ R n ( ˆ f Mn ) , (16)where ¯ R n ( f ) is the centered empirical process¯ R n ( f ) = b R n ( f ) − R ( f ) = 1 n n X i =1 γ ( f, Z i ) − E ( γ ( f, Z )) . (17)To obtain bounds of the estimation error, it remains to quantify the fluctuations of the centered empiricalprocess ¯ R n ( f ). We here apply classical results to control the fluctuations of the supremum of the empirical process ¯ R n ( f )over the model class M . Assumption 4.3 (Bounded contrast) . Assume that γ is uniformly bounded over M × Z , i.e. | γ ( f, Z ) | ≤ B (18) holds almost surely for all f ∈ M , with B a constant independent of f . R n ( f ). Lemma 4.4.
Under assumption 4.3, we have that P ( ¯ R n ( f ) > ǫB ) ∨ P ( ¯ R n ( f ) < − ǫB ) ≤ e − n ǫ (19) holds for all f ∈ M .Proof. We have b R n ( f ) − R ( f ) = n P ni =1 A fi − E ( A f ), where the A fi = γ ( f, Z i ) are i.i.d. copies of therandom variable A f = γ ( f, Z ). From Assumption 4.3, we have that | A f | ≤ B almost surely, so that A f issubgaussian with parameter B and the result simply follows from Hoeffding’s inequality.A stronger assumption is required to obtain a uniform concentration inequality for the empirical process¯ R n ( f ) over M . Assumption 4.5.
Assume that γ ( · , Z ) is Lipschitz continuous over M ⊂ L ∞ ,µ ( X ; R s ) , i.e. | γ ( f, Z ) − γ ( g, Z ) | ≤ Lk f − g k ∞ ,µ (20) holds almost surely for all f, g ∈ M , with L a constant independent of f and g . Lemma 4.6.
Under Assumptions 4.3 and 4.5, we have that P ( sup f ∈ M ¯ R n ( f ) > ǫB ) ∨ P ( inf f ∈ M ¯ R n ( f ) < − ǫB ) ≤ N ǫB L e − nǫ , (21) where N ǫB L = N ( ǫB L , M, k · k ∞ ,µ ) is the covering number of M at scale ǫB L , and log N ǫB L ≤ C M log (cid:0) L B − R | T ⋆ | ǫ − (cid:1) . Proof.
See Appendix A.2.
Lemma 4.7.
Under Assumptions 4.3 and 4.5, E ( sup f ∈ M | ¯ R n ( f ) | ) ≤ B p C M r β ∨ e ) √ n ) n . with β = 6 L B − R | T ⋆ | . Proof.
See Appendix A.2.
From the properties of the centered empirical process, we can now derive upper bounds of the estimationerror in probability and in expectation.
Proposition 4.8.
Under Assumptions 4.3 and 4.5, the estimation error satisfies P ( R ( ˆ f Mn ) − R ( f M ) > ǫB ) ≤ e C M log( βǫ − ) − nǫ , where β = 6 L B − R | T ⋆ | . Moreover, E ( R ( ˆ f Mn ) − R ( f M )) ≤ B p C M r β ∨ e ) √ n ) n , and thus E ( E ( ˆ f Mn )) ≤ E ( f M ) + 4 B p C M r β ∨ e ) √ n ) n . roof. See Appendix A.2.
Proposition 4.9.
Under Assumptions 4.3 and 4.5, for any t > , with probability larger than − exp( − t ) , sup f ∈ M − ¯ R n ( f ) ≤ B p C M r L B − R | T ⋆ |√ n ) n + 2 B r t n . (22) Moreover, with probability larger than − exp( − t ) , E ( ˆ f Mn ) ≤ E ( f M ) + 8 B p C M r L B − R | T ⋆ |√ n ) n + 4 B r t n . (23) Proof.
See Appendix A.2.
Example 4.10 (Least-squares bounded regression) . We consider the least-squares regression setting with γ ( f, Z ) = k Y − f ( X ) k ℓ . Let µ be the distribution of X . The excess risk E ( f ) = R ( f ) −R ( f ⋆ ) = k f − f ⋆ k ,µ admits f ⋆ ( x ) = E ( Y | X = x ) as a minimizer. We assume that k Y k ℓ ∞ ≤ R almost surely. For all f ∈ M ,we have γ ( f, Z ) ≤ s k Y − f ( X ) k ℓ ∞ ≤ s ( k Y k ℓ ∞ + k f k ∞ ) , so that ≤ γ ( f, Z ) ≤ B almost surely, with B = 4 sR . Also, it holds almost surely | γ ( f, Z ) − γ ( g, Z ) | = | (2 Y − f ( X ) − g ( X ) , f ( X ) − g ( X )) ℓ |≤ k Y − f ( X ) − g ( X ) k ℓ k f ( X ) − g ( X ) k ℓ ∞ ≤ s (2 k Y k ℓ ∞ + k g k ∞ ,µ + k f k ∞ ,µ ) k f − g k ∞ ,µ . Then for all f, g ∈ M , | γ ( f, Z ) − γ ( g, Z ) | ≤ Lk f − g k ∞ ,µ with L = 4 sR . The constant β from Proposition 4.8is β = 6 | T ⋆ | . Example 4.11 ( L density estimation) . We consider the estimation of the probability law ν of X . Assum-ing that ν admits a density f ⋆ with respect to the measure µ , and assuming f ⋆ ∈ L µ ( X ) , we consider thecontrast γ ( f, x ) = k f k ,µ − f ( x ) , so that E ( f ) = R ( f ) − R ( f ⋆ ) = k f − f ⋆ k ,µ admits f ⋆ as a minimizer.We assume that µ is a finite measure on X and that f ⋆ is uniformly bounded by R . Then | γ ( f, X ) | ≤ B almost surely with B = R ( µ ( X ) R + 2) . Also, for all f, g ∈ M , we have almost surely | γ ( f, X ) − γ ( g, X ) | = |k f k ,µ − k g k ,µ − f ( X ) − g ( X )) |≤ | Z ( f − g )( f + g ) dµ | + 2 k f − g k ∞ ,µ ≤ ( k f + g k ,µ + 2) k f − g k ∞ ,µ ≤ Lk f − g k ∞ ,µ with L = 2( µ ( X ) R + 1) . Since /R ≤ L /B ≤ /R , the constant β from Proposition 4.8 is such that | T ⋆ | ≤ β ≤ | T ⋆ | . In this section, we provide an improved excess risk bound in the specific case of least squares contrasts.Our results come from Talagrand Inequalities and generic chaining bounds ; we follow the presentationgiven in the book of [27]. The excess risk bound given below strongly relies on the link between the excessrisk and the variance of the excess loss (see Inequality ( ?? ) in the proof of Proposition 4.12), as explainedin Chapter 5 of [27] and Chapter 8 in [28].Let γ be either the least squares contrast in the bounded regression setting (as described in Exam-ple 4.10, with s = 1), or the least squares contrast for density estimation (as described in Example 4.11).In particular, note that in the regression setting it is assumed that k Y k ℓ ∞ ≤ R almost surely.12s before, we consider the model class M = M Tr ( H ) R of tree tensor networks with bounded parameters.Contrary to the two previous subsections, it is now assumed that H ⊂ L ∞ ( X ) equipped with the norm k·k ∞ and we still use the normalization of the parameters with λ = ∞ ( p = ∞ ) introduced in Section 3.3. Notethat L ∞ ( X ) ⊂ L ∞ ,µ ( X ), where µ is the distribution of the X i ’s in the regression setting (see Example 4.10)or the reference measure for density estimation (see Example 4.11). In particular, in this setting k f k ∞ ,µ ≤k f k ∞ < ∞ for any f ∈ H . Proposition 4.12.
Under the previous assumptions, there exists an absolute constant A and a constant κ such that for any ε ∈ (0 , and any t > , with probability at least − A exp( − t ) , it holds E ( ˆ f Mn ) ≤ (1 + ε ) E ( f M ) + κR n (cid:20) a T C M ε log + (cid:18) nε a T C M (cid:19) + tε (cid:21) (24) where a T = 1 + log + (cid:16) | T ⋆ | e (cid:17) , and κ depends on linearly on µ ( X ) . Then by integrating according to t , weobtain that for any ε ∈ (0 , , E E ( ˆ f Mn ) ≤ (1 + ε ) E ( f M ) + κR n (cid:20) a T C M ε log + (cid:18) nε a T C M (cid:19) + A ε (cid:21) . Proof.
See Appendix A.2.1.Note that the term a T is upper bounded by a term of the order of log d because | T ∗ | ≤ d . Thus theconstants in the risk bound (24) does not explode with the dimension d in regression. Note however thatin density estimation, the constant κ depends linearly on the mass µ ( X ) of the reference measure, whichmay grow exponentially with d . We now consider a family of approximation spaces H m = H sm ⊂ L ∞ ,µ ( X ), m ∈ M , with X equipped witha finite measure µ , as in Sections 4.1 and 4.2. Let ( M m ) m ∈M be a given family of tree tensor networks with M m = M T m r m ( H m ) R and where the parameters are bounded according to the norms defined in Section 3.3for λ = ( ∞ , µ ) ( p = ∞ ). Each model m has a particular tree T m , a rank r m , an approximation space H m , and a radius R . We denote by C m = C ( T m , r m , H m ) the corresponding representation complexity. Forsome m ∈ M , we let f m be a minimizer of the risk over M m , f m ∈ arg min f ∈ M m R ( f ) , and ˆ f m be a minimizer of the empirical risk over M m , ˆ f m ∈ arg min f ∈ M m b R n ( f ) . At this stage of the procedure, we have at hand a family of predictors ˆ f m and our goal is to provide astrategy for selecting a good predictor. To this aim, we make use of the model selection approach of Barron,Birg´e and Massart. More precisely, we adapt a general theorem from [28] to our problem. Similar modelselection strategies can be found in [34, 21, 11], see also [9] for an application to the selection of principalcurves.Given some penalty function pen : M → R + , we define ˆ m as the minimizer over M of the criterioncrit( m ) := b R n ( ˆ f m ) + pen( m ) , (25)and we finally select the predictor ˆ f ˆ m . With µ ( X ) = 1 for regression. ssumption 5.1. We consider a family of positive weights ( x m ) m ∈M over the family of models such that Σ = X m ∈M exp( − x m ) < ∞ . This assumption and the choice of the weights is the discussed further in Section 5.3.
We follow a standard strategy that corresponds to the so-called
Vapnik’s structural minimization of therisk method (see for instance [28, Section 8.2]) to choose the penalty function and derive a risk bound forthe estimator selected by the criterion (25). By definition of ˆ m , for any m ∈ M , R n ( ˆ f ˆ m ) + pen( ˆ m ) ≤ R n ( ˆ f m ) + pen( m ) ≤ R n ( f m ) + pen( m ) . Therefore, R n ( ˆ f ˆ m ) ≤ R n ( f m ) + pen( m ) − pen( ˆ m )and thus R ( ˆ f ˆ m ) + ¯ R n ( ˆ f ˆ m ) ≤ R ( f m ) + ¯ R n ( f m ) + pen( m ) − pen( ˆ m ) , where ¯ R n ( f ) is the centered empirical process defined in (17). We finally derive the following upper boundon the excess risk E ( ˆ f ˆ m ) ≤ E ( f m ) + ¯ R n ( f m ) − ¯ R n ( ˆ f ˆ m ) − pen( ˆ m ) + pen( m ) . (26)We now provide a risk bound for a model selection strategy based on the criterion (25) with a suitablechoice of penalty. Theorem 5.2.
Under Assumptions 4.3, 4.5 and 5.1, if the penalty is such that pen( m ) ≥ λ m r C m n + 2 B r x m n , (27) with λ m = 4 B q L B − R | T m | ⋆ √ n ) , then the estimator ˆ f ˆ m selected according to the criterion (25) satisfies the following risk bound E ( E ( ˆ f ˆ m )) ≤ inf m ∈M {E ( f m ) + pen( m ) } + B Σ r π n . Proof.
See Appendix A.3.Theorem 5.2 gives a strong justification for using a penalty proportional to p C m /n , at least for nottoo large family of models. However, it is known that the Vapnik’s structural minimization of the riskmay lead to suboptimal rates of convergence. For instance, in the bounded regression setting, it is knownthat a penalty proportional to the VapnikChervonenkis dimension (typically in O ( C m /n ) leads to minimaxrates of convergence in various setting (see for instance Chapter 12 in [21]) whereas Vapnik’s structuralminimization of the risk (typically with penalty in O ( p C m /n )) is too pessimistic to provide fast rates ofconvergence. Note that the approach of [21] is based on a truncation strategy which is not easy to calibratein practice. In the next section, we give an improved model selection result for least squares inference.14 .2 Oracle inequalities for least squares inference on tree tensor networks In this section, we give an improved model selection result for least squares inference based on Proposi-tion 4.12. This corresponds to the approach presented in Sections 8.3 and 8.4 of [28] or in Section 6.3 of[27].We consider least squares density estimation and least squares bounded regression ( s = 1) in the sameframework as Section 4.3: we now consider a family of approximation spaces H m ⊂ L ∞ ( X ) with s = 1 andequipped with the norm k · k ∞ . We use the same normalization of the parameters with p = ∞ ( λ = ∞ ) asintroduced in Section 3.3. As before we consider a family of tree tensor networks ( M m ) m ∈M where eachmodel M m = M T m r m ( H m ) R has a particular tree T m , a rank r m , an approximation space H m , and a radius R . Theorem 5.3.
In the setting of Proposition 4.12 and under Assumption 5.1, there exists numerical con-stants K and K and K such that if the penalty satisfies pen( m ) = K R (cid:20) a m C m nε log nε a m C m + x m nε (cid:21) with a m = 1+log + (cid:16) | T ⋆m | e (cid:17) , then the estimator ˆ f ˆ m selected according to the penalized criterion (25) satisfiesthe following oracle inequality E E ( ˆ f ˆ m ) ≤ ε − ε inf m ∈M (cid:26) E ( f m ) + K R (cid:20) a m C m nε log nε a m C m + x m nε (cid:21)(cid:27) + K R Σ n εε (1 − ε ) . (28) Proof.
The proof is adapted from Theorem 6.5 in [27], see Appendix A.3.This theorem provides an improved oracle inequality bound with a penalty in C m n , up to logarithmicterms. In Section 5.4, we will derive adaptive optimal rates of convergence (in the minimax sense) fromthis model selection result. In Section 6 we illustrate how to calibrate the penalty in practice using theslope heuristics method. The weights x m represent the price to pay for the richness of the model collection, when there are manymodels with the same complexity C m . A typical choice for the weights is x m = x ( C m ) with a weightfunction x such that x ( c ) ≥ βc + log( N c ) , where N c = |{ m ∈ M : C m = c }| is the number of models with complexity c , and β some positive constant.With such a choice, Σ = P m ∈M exp( − x m ) = P c ≥ N c exp( − x ( c )) ≤ ( e β − − , so that Assumption 5.1 issatisfied. With such a weight function, if the model collection is not too rich, the weight x m is comparableto or smaller than the complexity C m .We restrict the following analysis to the case where the approximation space is fixed: H m = H s for any m ∈ M and we only consider binary trees, for which | T m | = 2 d − T is fixed and we need to upper bound the number N c of modelshaving the complexity c to define the weights. According the definition of the representation complexitygiven in Section 2.4, a format with complexity c satisfies c = X α ∈ T ⋆ N α = sr D + X α ∈I ( T ) r α Y β ∈ S ( α ) r β + X α ∈L ( T ) r α n α . (29)15he number of triplets of integers ( k , k , k ) such that the product k k k is less than an integer q α isclearly less than q α . So, the number of formats such that N α = q α for any α ∈ T ⋆ is less than Y α ∈ T ⋆ q α ≤ Y α ∈ T ⋆ q α ! / | T ⋆ | | T ⋆ | ≤ " | T ⋆ | X α ∈ T ⋆ q α | T ⋆ | ≤ (cid:20) c | T ⋆ | (cid:21) | T ⋆ | . Moreover, the number of tuple of integers ( q α ) α ∈ T ⋆ satisfying P α ∈ T ⋆ q α = c is (cid:0) c + | T ⋆ | c (cid:1) . For a fixed binarytree, the number N c of all possible formats of complexity c is thus such that N c ≤ (cid:18) c + | T ⋆ | c (cid:19) (cid:20) c | T ⋆ | (cid:21) | T ⋆ | . Using the inequality log (cid:18) kℓ (cid:19) ≤ ℓ (1 + log kℓ ) , (30)and the fact that | T ⋆ | ≤ C m for any model m in the collection, we obtainlog( N c ) ≤ c (1 + log( c + | T ⋆ | c )) + 2 | T ⋆ | log( c | T ⋆ | ) ≤ c (1 + log(2)) + 4 d log( c ) . c. Then for a given binary tree T , we finally take a weight function x ( c ) = ηc (31)for some η >
0. In the situation where all the formats of the collection rely on a same tree T , using theweight function given in (31), Theorem 5.3 shows that we can use a penalty proportional to C m .Leaving aside the computational aspects for the moment (see Section 5.6), we now consider the situationwhere the formats of the collection rely on several possible trees T . The number of binary dimensionpartition trees (or full binary trees) with d leaves is the Catalan number d (cid:0) d − d − (cid:1) . The number N c ofpossible formats of complexity c based on all possible binary dimension partition trees with d leaves is thussuch that N c ≤ d (cid:18) d − d − (cid:19)(cid:18) c + 2 dc (cid:19) h c d i d . Using again Inequality (30) and the fact that | T ⋆ | = 2 d ≤ C m for any model m in the collection, we obtain N c ≤ d (1 + log(2)) + c (1 + log(2)) + 4 d log( c ) . c and we finally propose the weight function x ( c ) = ηc (32)for some η >
0. In the situation where a large number of trees has been explored, we still can use penaltiesproportional to the format complexity C m . For each dimension ν ∈ { , . . . , d } , we consider approximation tools ( H ν,p ν ) p ν ∈ N for functions of the variable x ν , and we let ( H p ) p ∈ N d be the corresponding approximation tool for multivariate functions, where H p = H ,p ⊗ . . . ⊗ H d,p d .For adaptive methods in p and r (with fixed tree T ), we define an approximation toolΦ = (Φ c ) c ∈ N , Φ c = { f = R H p ,T,r ( f ) : f ∈ F T,r : r ∈ N r , p ∈ N d , compl( f ) ≤ c } , where compl( f ) is a measure of complexity of the network f , and Φ c is the set of functions with associatednetwork with complexity less than c . 16or tree adaptive methods, we define the sets Φ c asΦ c = { f = R H p ,T,r ( f ) : f ∈ F T,r : T ∈ T , r ∈ N r , p ∈ N d , compl( f ) ≤ c } , where T is a collection of possible dimension trees.The best approximation error by a tensor network with complexity less than c is defined by e c ( f ⋆ ) = inf f ∈ Φ c R ( f ) − R ( f ⋆ ) . Then given a growth function γ : N → N , an approximation class for tree tensor networks can be definedas the set A γ = { f ⋆ : sup c ≥ γ ( c ) e c ( f ⋆ ) < ∞} , which corresponds to functions that can be approximated with tree tensor networks with a convergence in O ( γ ( c ) − ).The approximation class A γ depends on the measure of complexity of the network, and on wether ornot tree adaptation is considered. Natural measures of complexity of a network f are the representationcomplexity compl C ( f ) = C ( T, r, H p ) or the sparse representation complexity compl S ( f ) (see Section 2.4).When considering the complexity measure compl = compl C , we easily derive from Theorem 5.2 or 5.3upper bounds on the rates of convergence of our model selection procedure for functions in A γ by balancingthe penalty term and the approximation term in the risk bounds.Next we provide examples that shows that minimax rates can be achieved by tensor networks forclassical smoothness classes. In all examples, we consider a least-squares setting with real valued functions( s = 1), where R ( f ) − R ( f ⋆ ) is the squared L norm of f − f ⋆ . Consider a function f ⋆ in the Sobolev space H r , r ∈ N ,of functions on (0 , d or the d -dimensional torus T d , and optimal approximation tools ( H ν,p ν ) p ν ∈ N forunivariate Sobolev functions (e.g., splines or trigonometric polynomials). For any fixed tree T , and whenconsidering the representation complexity measure compl C , we have e c ( f ⋆ ) = O ( c − rd )(see, e.g., [5]), andtherefore H r is included in A γ with γ ( c ) = c rd . In the setting of Theorem 5.3, for f ⋆ in the Sobolevspace H r over (0 , d , and when considering the family of all possible formats, we find that the rate ofconvergence of ˆ f ˆ m is of order n − r r + d log( n ) r r + d which is known to be the minimax rate of convergence over H r (up to the logarithmic term). Our model selection procedure (with variable p and r ) therefore achievesminimax rates for Sobolev spaces of any order, and is thus minimax adaptive to the regularity over Sovolevspaces. Sobolev spaces of multivariate functions with dominating mixed smoothness.
Consider a func-tion f ⋆ in the mixed Sobolev space H rmix , r ∈ N , on the d -dimensional torus T d , and optimal approximationtools ( H ν,p ν ) p ν ∈ N for univariate Sobolev functions on T (e.g., trigonometric polynomials). For a fixed binarytree T , when considering the complexity measure compl C , we have e c ( f ⋆ ) = O ( c − r log( c ) dr ) (see [32, 5]),and therefore, the space H rmix is included in A γ with γ ( c ) = c r log( c ) − dr . In the bounded regressionframework of Theorem 5.3, our model selection procedure shows a rate of convergence upper boundedby n − r r (log n ) rd . To our knowledge, the minimax rates of convergence over mixed Sobolev spaces areunknown for regression. However, the results of [29] for Gaussian white noise model as well as the resultsof [1] for density estimation suggest that these rates should be of the order of n − r r , up to a logarithmicterm. In fact, the minimax rate can not be obtained by our strategy since the rate of approximation errorin O ( c − r ) (up to logarithmic terms) is not the optimal rate of convergence which is in O ( c − r ) (up tologarithmic terms), the latter rate being achieved by hyperbolic cross approximation [15].17n optimal rate should probably be achieved with tree tensor networks by further exploiting sparsityin the tensors, and using the corresponding measure of complexity compl S . Indeed, optimal approximationrates should be obtained by shallow tensor networks (associated with a trivial tree) with a sparse tensor C D with O ( c ) non zero entries, and a sparsity pattern based on hyperbolic crosses. Then noting that sucha shallow network (which is a canonical tensor format with rank O ( c )) can be encoded within a tree tensornetwork with sparse tensors and the same overall complexity compl S in O ( c ) , minimax rates (up to logterms) should probably be obtained for mixed Sobolev space H rmix for any tree T , when combined with anestimate of the metric entropy of sets Φ c with the complexity measure compl S . Tree tensor networks can be used for the approximation of univariate functions after identification ofa function f ∈ L (0 ,
1) with an order- d tensor (or d -variate function) in R ⊗ . . . ⊗ R ⊗ L (0 ,
1) :=( R ) ⊗ d ⊗ L (0 , S in L (0 , S = P m , we define a tensor subspace ( R ) ⊗ d ⊗ P m = H d,m , which is isometricallyidentified with the space of univariate splines of degree m over a uniform partition of [0 ,
1] into 2 d intervals.An approximation tool is then defined by considering tensor networks in the tensor spaces H d,m withvariable d and fixed m . In [2], the authors consider tensor networks associated with linear trees, that is thetensor train format (or equivalently, recurrent sum-product neural networks). The variable d setting canbe interpreted as the tree adaptive setting presented above, where the family of trees T = { T d : d ∈ N } ,with T d the linear tree over { , . . . , d } with interior nodes { , . . . , ν } , 2 ≤ ν ≤ d .The following results are based on results from [3, Main results 3.1, 3.2 and 3.4] for Sobolev, Besov oranalytic functions. Sobolev spaces of univariate functions.
For functions f ⋆ in the Sobolev space H r of univariatefunctions on (0 , C , the approximation error e c ( f ⋆ ) = O ( c − r ) achieves the best possible approximation rate , that is H r is included in A γ with γ ( r ) = n r for any r ∈ N . Together with Theorem 5.3, we find that ˆ f ˆ m achieves a convergence in n − r r +1 (up to logarithmicterm). This shows that our model selection procedure (with variable d and fixed m , in particular m = 0)achieves minimax rates (up to logarithmic terms) for Sobolev spaces of any order r (without adapting thedegree m to the regularity of f ⋆ ). Besov spaces.
Near optimal approximation rates are also obtained for Besov spaces of univariatefunctions on (0 , f ⋆ in the Besov space B ατ,τ , with α > τ = ( r + 1 / − the Sobolev embedding number.When considering the complexity measure compl C , we have B ατ,τ ⊂ A γ with γ ( c ) = c α − ǫ for arbitrary ǫ > α > f ˆ m achieves a convergence in n − α − ǫα +1 (up tologarithmic term), which are close (but not equal to) minimax rates in n − α α +1 (up to log terms).Note that when considering the complexity measure compl S , we show B ατ,τ ⊂ A γ with γ ( c ) = c α − ǫ forarbitrary ǫ >
0, which is arbitrarily close to optimal approximation rates. Therefore, a strategy takinginto account sparsity of tensors could be able to achieve rates arbitrarily close to minimax rates for Besovspaces B ατ,τ of arbitrary smoothness α (without the need of adapting m to the regularity of f ⋆ ). Analytic functions.
For a function f ⋆ analytic on an open interval containing [0 ,
1] and when consideringthe complexity measure compl C , the approximation error converges exponentially fast as e c ( f ⋆ ) = O ( ρ − c / )for some ρ >
1. That means f ⋆ ∈ A γ with γ ( c ) = ρ c / . Together with Theorem 5.3, we find that ˆ f ˆ m also obtained by other tools such as splines of degree greater than r-1 n − log( n ) (up to logarithmic term). This is known to be the minimax rate fornonparametric estimation of analytic densities [8]. The aim of the slope heuristics method proposed by Birg´e and Massart [10] is precisely to calibrate penaltyfunction for model selection purposes. See [7] and [4] for a general presentation of the method. This methodhas shown very good performances and comes with mathematical guarantees in various settings amongother for non parametric Gaussian regression with i.i.d. error terms, see [10, 4] and references therein. Theslope heuristics have several versions (see [4]).The aim is to tune the constant λ in a penalty of the form pen( m ) = λ pen shape ( m ) where pen shape is aknown penalty shape. Let ˆ m ( λ ) be the model selected by penalized criterion with constant λ :ˆ m ( λ ) ∈ argmin m ∈M n b R n ( ˆ f m ) + λ pen shape ( m ) o . Let C m denote the complexity of the model. The complexity jump algorithm consists of the followingsteps:1. Compute the function λ ˆ m ( λ ),2. Find the constant ˆ λ cj > λ C ˆ m ( λ ) ,3. Select the model ˆ m = ˆ m (2ˆ λ cj ) such thatˆ m ∈ arg min m ∈M n b R n ( ˆ f m ) + 2ˆ λ cj pen shape ( m ) o . The exploration of all possible model classes M Tr ( H s ) with a complexity bounded by some c is intractablesince the number of such models is exponential in the number of variables d . Therefore, strategies shouldbe introduced to propose a set of candidate model classes M m , m ∈ M .In practice, a possible approach is to rely on adaptive learning algorithms from [18] (see also [17]) thatgenerate predictors ˆ f m (minimizing the empirical risk) in a sequence of model classes. For a fixed tree T , the proposed algorithm generates a sequence of model classes M m = M Tr m ( H sm ) withincreasing ranks r m , m ≥
1, by successively increasing the α -ranks for nodes α associated with the highest(estimated) truncation errors inf rank α ( f ) ≤ r m,α R ( f ) − R ( f ⋆ ) . For each m , the background approximation space is taken as H m := H p m = H ,p m, ⊗ . . . ⊗ H d,p m,d , wherefor each dimension ν ∈ { , . . . , d } , ( H ν,k ) k ∈ N is a given approximation tool (e.g., polynomials, wavelets).Exploring all possible tuples p m is again a combinatorial problem. The algorithm proposed in [18, 17]relies on a validation approach for the selection of a particular tuple. Note that a complexity-based modelselection method could also be considered for the selection of a tuple p m .19 .6.2 Variable tree Although the set of possible dimension trees over { , . . . , d } is finite, exploring this whole set of dimensiontrees is intractable for high and even moderate d . In [18], a stochastic algorithm has been proposed foroptimizing the dimension tree for the compression of a tensor. This tree optimization algorithm has beencombined with the rank-adaptive strategy discussed above. The resulting algorithm generates a sequenceof predictors in tree tensor networks associated with different trees. In the numerical experiments, weuse this learning algorithm with tree adaptation to generate a set of candidate trees. Then the learningalgorithm with rank adaptation but fixed tree is used with each of these trees. In this section, we illustrate the proposed model selection approach for supervised learning problems in aleast-squares regression setting. Y is a real-valued random variable ( s = 1) defined by Y = f ⋆ ( X ) + ǫ where ǫ is independent of X and has zero mean and standard deviation γσ ( f ⋆ ( X )). The parameter γ therefore controls the noise level in relative precision.For a given training sample, we use the learning strategies described in Section 5.6 that generate asequence of predictors ˆ f m , m ∈ M , associated with a certain collection of models M (which depends onthe training sample). Given a set of predictors ˆ f m , m ∈ M , we denote by ˆ m ⋆ the index of the model thatminimizes the risk over M , i.e. ˆ m ⋆ ∈ arg min m ∈M R ( ˆ f m ) . The model ˆ m ⋆ is the oracle model in M for a given training sample.We also denote by ˆ m ( λ ) the model such thatˆ m ( λ ) ∈ argmin m ∈M n b R n ( ˆ f m ) + λ pen shape ( m ) o , where pen shape ( m ) = C m /n , and by ˆ m = ˆ m (2ˆ λ cj ) the model selected by our model selection strategy, whereˆ λ cj is calibrated with the complexity jump algorithm (see Section 5.5).We consider two different types of problems: the approximation of univariate functions defined on(0 , R d (Section 6.2).For a given function f , the risk R ( f ) is evaluated using a sample of size 10 independent of the trainingsample. Statistics of complexities and risks (such as the expected complexity E ( C ˆ m ) or the expected risk E ( R ( ˆ f ˆ m ))) are computed using 20 different training samples. Here we consider tree tensor networks for the approximation of a univariate function in L (0 , f defined on (0 ,
1) can be linearly identifiedwith a function f = T l ( f ) of l + 1 variables defined on { , } l × (0 ,
1) such that f ( x ) = T l ( f )( i , . . . , i l − , y ) for x = 2 − l ( l − X k =0 i k k + y ) . T l is called the tensorization map at level l . This allows to isometrically identify the space L (0 , R ⊗ . . . ⊗ R ⊗ L (0 ,
1) of order d = l + 1. Then we consider the approximation space H l = R ⊗ . . . ⊗ R ⊗ P of d -variate functions f ( i , . . . , i l , y ) independent of the variable y . The space H l is linearly identified with the space of piecewise constant functions on the uniform partition of (0 ,
1) into2 l intervals. Then we consider model classes M l,T,r = { f : T l ( f ) ∈ M Tr ( H l ) } , which are piecewise constantfunctions whose tensorized version T l ( f ) is in a particular tree-based tensor format.In the following experiments, for each l ∈ { , . . . , } , we consider a fixed linear binary tree T (withinterior nodes { , . . . , k } , 1 ≤ k ≤ l + 1) and use the rank adaptive learning algorithm (Section 5.6.1) toproduce a sequence of 25 approximations with increasing ranks.Three functions f ⋆ ( x ) are considered. The first function f ⋆ ( x ) = √ x is analytic on the open interval(0 ,
1) and its derivative has a singularity at zero. The second function f ⋆ ( x ) = x is analytic on a largerinterval including [0 , H (0 , . For all functions, the proposedmodel selection approach shows a very good performance. It selects with high probability a model with arisk very close to the risk of the oracle ˆ f ˆ m ⋆ . f ⋆ ( x ) = √ x We consider the function f ⋆ ( x ) = √ x which is analytic on the open interval (0 , n and noise level. Tables 1 and 2 show expectations of complexities and errorsfor the selected estimator and illustrate the very good performance of the approach when compared to theoracle. -6 -4 -2 (a) Function λ C ˆ m ( λ ) , λ cj (red). -6 -5 -4 -3 -2 -1 (b) Points ( C m , R ( ˆ f m ), m ∈ M , and selected model(red). Figure 3: Slope heuristics for the tensorized function f ⋆ ( x ) = √ x with n = 200 and γ = 0 . n E ( C ˆ m ⋆ ) E ( C ˆ m ) E ( R ( ˆ f m ⋆ )) E ( R ( ˆ f ˆ m ))100 123.2 91.6 1.6e-05 5.0e-05200 163.8 165.0 3.0e-06 5.1e-06500 182.2 182.6 9.2e-07 1.2e-061000 190.2 228.5 7.1e-07 1.4e-06Table 1: Expectation of complexities and risks of the model selected by the slope heuristics, with thefunction f ⋆ ( x ) = √ x and different values of n and γ = 0 . -8 -6 -4 -2 (a) Function λ C ˆ m ( λ ) , λ cj (red). -7 -6 -5 -4 -3 -2 -1 (b) Points ( C m , R ( ˆ f m ), m ∈ M , and selected model(red). Figure 4: Slope heuristics for the tensorized function f ⋆ ( x ) = √ x with n = 1000 and γ = 0 . γ E ( C ˆ m ⋆ ) E ( C ˆ m ) E ( R ( ˆ f m ⋆ )) E ( R ( ˆ f ˆ m ))10 − − − f ⋆ ( x ) = √ x and different values of γ and n = 1000.22 .1.2 Tensorized function f ⋆ ( x ) = x . We consider the function f ⋆ ( x ) = x which is analytic on the interval ( − , ∞ ) including [0 , -6 -5 -4 -3 -2 -1 (a) Function λ C ˆ m ( λ ) , λ cj (red). -7 -6 -5 -4 -3 -2 (b) Points ( C m , R ( ˆ f m ), m ∈ M , and selected model(red). Figure 5: Slope heuristics for the tensorized function f ⋆ ( x ) = x with n = 200 and γ = 0 . -8 -6 -4 -2 (a) Function λ C ˆ m ( λ ) , λ cj (red). -9 -8 -7 -6 -5 -4 -3 -2 (b) Points ( C m , R ( ˆ f m ), m ∈ M , and selected model(red). Figure 6: Slope heuristics for the tensorized function f ⋆ ( x ) = x with n = 1000 and γ = 0 . E ( C ˆ m ⋆ ) E ( C ˆ m ) E ( R ( ˆ f m ⋆ )) E ( R ( ˆ f ˆ m ))100 88.0 83.0 9.3e-07 1.0e-06200 97.3 92.8 6.4e-07 6.6e-07500 92.9 124.4 5.8e-07 6.9e-071000 108.4 107.5 5.3e-07 5.3e-07Table 3: Expectation of complexities and risks of the model selected by the slope heuristics, with thefunction f ⋆ ( x ) = x , different values of n and γ = 0 . γ E ( C ˆ m ⋆ ) E ( C ˆ m ) E ( R ( ˆ f m ⋆ )) E ( R ( ˆ f ˆ m ))10 − − − f ⋆ ( x ) = x , different values of γ and n = 1000.24 .1.3 Tensorized function f ⋆ ( x ) = g ( g ( x )) with g ( x ) = 1 − | x − | . We consider the function f ⋆ ( x ) = g ( g ( x )) with g ( x ) = 1 − | x − | , which is in the Sobolev space H (0 , . Figures 7b and 8 illustrate again the good behaviour of the model selection approach for different samplesize and noise level. And Tables 3 and 4 again illustrate again the very good performance (in expectation)for the selected estimator of the approach when compared to the oracle. -6 -4 -2 (a) Function λ C ˆ m ( λ ) , λ cj (red). -7 -6 -5 -4 -3 -2 -1 (b) Points ( C m , R ( ˆ f m ), m ∈ M , and selected model(red). Figure 7: Slope heuristics for the tensorized function f ⋆ ( x ) = ( g ( g ( x ))) with n = 200 and γ = 0 . -7 -6 -5 -4 -3 -2 -1 (a) Function λ C ˆ m ( λ ) , λ cj (red). -7 -6 -5 -4 -3 -2 -1 (b) Points ( C m , R ( ˆ f m ), m ∈ M , and selected model(red). Figure 8: Slope heuristics for the tensorized function f ⋆ ( x ) = ( g ( g ( x ))) with n = 1000 and γ = 0 . E ( C ˆ m ⋆ ) E ( C ˆ m ) E ( R ( ˆ f m ⋆ )) E ( R ( ˆ f ˆ m ))200 176.4 181.6 6.3e-07 1.6e-06500 188.2 198.8 3.9e-07 4.1e-071000 196.6 233.8 3.2e-07 3.5e-07Table 5: Expectation of complexities and risks for the function f ⋆ ( x ) = ( g ( g ( x ))) , different values of n and γ = 0 . E ( C ˆ m ⋆ ) E ( C ˆ m ) E ( R ( ˆ f m ⋆ )) E ( R ( ˆ f ˆ m ))10 − − − f ⋆ ( x ) = ( g ( g ( x ))) , different values of γ and n = 1000. We consider the function f ⋆ ( X ) = 11 + P dν =1 ν − X ν with d = 10, where the X ν ∼ U (0 ,
1) are i.i.d. uniform random variables. The function f ⋆ is analyticon [0 , d . We use the fixed balanced binary tree T of Figure 9. Figures 10 and 11 illustrate the verygood behaviour of the model selection approach for a sample size n = 1000 and noise level γ = 0 . n and γ ), which are of the areof the same order as for the oracle. { , , , , , , , , , }{ , , , , , }{ , , , }{ , }{ }{ }{ , }{ }{ } { , }{ } { } { , , , }{ , }{ } { } { , }{ } { } Figure 9: Corner peak function. Dimension tree T . n E ( C ˆ m ⋆ ) E ( C ˆ m ) E ( R ( ˆ f m ⋆ )) E ( R ( ˆ f ˆ m ))100 124.1 73.7 2.1e-06 1.1e-05500 286.7 291.3 9.8e-11 1.0e-101000 286.2 293.8 6.6e-11 6.7e-11Table 7: Expectation of complexities and risks selected by the slope heuristics, with the Corner peakfunction, different values of n and γ = 10 − . 26 -7 -6 -5 -4 -3 (a) Function λ C ˆ m ( λ ) , λ cj (red). -7 -6 -5 -4 (b) Function C m
7→ R ( ˆ f m ) and selected model (red). Figure 10: Slope heuristics for the Corner peak function with n = 1000 and γ = 0 . -6 -5 -4 -3 (a) Functions λ C ˆ m ( λ ) , λ cj (red). -7 -6 -5 -4 (b) Functions C m
7→ R ( ˆ f m ) and selected model(red). Figure 11: Slope heuristics for the Corner peak function with n = 1000 and γ = 0 . γ E ( C ˆ m ⋆ ) E ( C ˆ m ) E ( R ( ˆ f m ⋆ )) E ( R ( ˆ f ˆ m ))10 − − − − γ and n = 1000. 27 .2.2 Borehole function We consider the function g ( U , . . . , U ) = 2 πU ( U − U )( U − log( U ))(1 + U U ( U − log( U )) U U + U U )which models the water flow through a borehole as a function of 8 independent random variables U ∼N (0 . , . U ∼ N (7 . , . U ∼ U (63070 , U ∼ U (990 , U ∼ U (63 . , U ∼ U (700 , U ∼ U (1120 , U ∼ U (9855 , f ⋆ ( X , . . . , X d ) = g ( g ( X ) , . . . , g ( X )) , where g ν are functions such that U ν = g ν ( X ν ), with X ν ∼ N (0 ,
1) for ν ∈ { , } , and X ν ∼ U ( − ,
1) for ν ∈ { , . . . , } . Function f ⋆ is thus defined on X = R × [ − , . As univariate approximation tools, weuse polynomial spaces H ν,p ν = P p ν ( X ν ), ν ∈ D .We use the exploration strategy described in Section 5.6.1. More precisely, we first run a learningalgorithm with tree adaptation from an initial binary tree drawn randomly, with n = 100 samples. Thelearning algorithm visited the 9 trees plotted in Figure 12. Then for each of these trees, we start a learningalgorithm with fixed tree and rank adaptation. Figures 13 to 15 illustrate the behaviour of the modelselection strategy for different sample size n . Table 9 shows the expectation of complexities and risks. Themodel selection approach shows very good performances, except for very small training size n = 100, wherethe approach selects a model rather far from the optimal one (in terms of expected risk and complexity). n E ( C ˆ m ⋆ ) E ( C ˆ m ) E ( R ( ˆ f m ⋆ )) E ( R ( ˆ f ˆ m ))100 132.1 63.4 6.9e-06 9.3e-04200 149.7 156.0 3.0e-08 1.1e-07500 144.7 178.2 1.0e-08 1.8e-081000 154.1 194.2 8.3e-09 1.2e-08Table 9: Borehole function. Expectation of complexities and risks. γ = 10 − , different n .28 } { }{ }{ }{ }{ } { } { } { }{ }{ }{ }{ } { } { } { } { }{ }{ }{ }{ }{ } { } { }{ }{ }{ }{ } { } { } { } { } { }{ }{ }{ }{ }{ } { } { } { } { } { }{ }{ }{ }{ } { }{ } { }{ }{ }{ }{ } { } { } { }{ } { }{ }{ } { }{ } { } { } { }{ }{ }{ }{ }{ } { } Figure 12: Borehole function. The path of 9 trees generated by the tree adaptive learning algorithm.29 -6 -4 -2 (a) Functions λ C ˆ m ( λ ) , λ cj (red).
20 40 60 80 100 120 140 16010 -6 -5 -4 -3 -2 -1 (b) Points ( C m , R ( ˆ f m ), m ∈ M , and selected model(red). Figure 13: Slope heuristics for Borehole function with n = 100 and γ = 10 − . -8 -6 -4 -2 (a) Functions λ C ˆ m ( λ ) , λ cj (red). -8 -7 -6 -5 -4 -3 -2 -1 (b) Points ( C m , R ( ˆ f m ), m ∈ M , and selected model(red). Figure 14: Slope heuristics for Borehole function with n = 200 and γ = 10 − . -8 -6 -4 -2 (a) Function λ C ˆ m ( λ ) , λ cj (red). -9 -8 -7 -6 -5 -4 -3 -2 -1 (b) Points ( C m , R ( ˆ f m ), m ∈ M , and selected model(red). Figure 15: Slope heuristics for Borehole function with n = 1000 and γ = 10 − .30 eferences [1] Nathalie Akakpo. Multivariate intensity estimation via hyperbolic wavelet selection. Journal ofMultivariate Analysis , 161:32–57, 2017.[2] Mazen Ali and Anthony Nouy. Approximation with tensor networks. part I: Approximation spaces. arXiv e-prints, arxiv:2007.00118 , 2020.[3] Mazen Ali and Anthony Nouy. Approximation with tensor networks. part II: Approximation rates forsmoothness classes. arXiv e-prints, arxiv:2007.00128 , 2020.[4] Sylvain Arlot. Minimal penalties and the slope heuristics: a survey. arXiv preprint arXiv:1901.07277 ,2019.[5] M. Bachmayr, A. Nouy, and R. Schneider. Approximation power of tree tensor networks for compo-sitional functions, 2020.[6] M. Bachmayr, R. Schneider, and A. Uschmajew. Tensor networks and hierarchical tensors for the so-lution of high-dimensional partial differential equations.
Foundations of Computational Mathematics ,pages 1–50, 2016.[7] Jean-Patrick Baudry, Cathy Maugis, and Bertrand Michel. Slope heuristics: overview and implemen-tation.
Statistics and Computing , 22(2):455–470, 2012.[8] Eduard Belitser et al. Efficient estimation of analytic density under random censorship.
Bernoulli ,4(4):519–543, 1998.[9] G´erard Biau and Aur´elie Fischer. Parameter selection for principal curves.
IEEE Transactions onInformation Theory , 58(3):1924–1939, 2012.[10] Lucien Birg´e and Pascal Massart. Minimal penalties for gaussian model selection.
Probability theoryand related fields , 138(1-2):33–73, 2007.[11] Olivier Bousquet, St´ephane Boucheron, and G´abor Lugosi. Introduction to statistical learning theory.In
Advanced lectures on machine learning , pages 169–207. Springer, 2004.[12] A. Cichocki, N. Lee, I. Oseledets, A.-H. Phan, Q. Zhao, and D. Mandic. Tensor networks for dimen-sionality reduction and large-scale optimization: Part 1 low-rank tensor decompositions.
Foundationsand Trends R (cid:13) in Machine Learning , 9(4-5):249–429, 2016.[13] A. Cichocki, A.-H. Phan, Q. Zhao, N. Lee, I. Oseledets, M. Sugiyama, and D. Mandic. Tensor networksfor dimensionality reduction and large-scale optimization: Part 2 applications and future perspectives. Foundations and Trends R (cid:13) in Machine Learning , 9(6):431–673, 2017.[14] N. Cohen, O. Sharir, and A. Shashua. On the expressive power of deep learning: A tensor analysis.In Conference on Learning Theory , pages 698–728, 2016.[15] Dinh D˜ung, Vladimir Temlyakov, and Tino Ullrich.
Hyperbolic cross approximation . Springer, 2018.[16] A. Falc´o, W. Hackbusch, and A. Nouy. Tree-based tensor formats.
SeMA Journal , 2018.[17] E. Grelier, A. Nouy, and R. Lebrun. Learning high-dimensional probability distributions using treetensor networks.
ArXiv e-prints , arXiv:1912.07913, 2019.[18] Erwan Grelier, Anthony Nouy, and Mathilde Chevreuil. Learning with tree-based tensor formats. arXiv e-prints , arXiv:1811.04455, 2018. 3119] R´emi Gribonval, Gitta Kutyniok, Morten Nielsen, and Felix Voigtlaender. Approximation spaces ofdeep neural networks. arXiv e-prints , arXiv:1905.01208, 2019.[20] Michael Griebel and Helmut Harbrecht. Analysis of tensor approximation schemes for continuousfunctions. arXiv e-prints arXiv:1903.04234 , 2019.[21] L´aszl´o Gy¨orfi, Michael Kohler, Adam Krzyzak, and Harro Walk.
A distribution-free theory of non-parametric regression . Springer Science & Business Media, 2006.[22] W. Hackbusch.
Tensor spaces and numerical tensor calculus , volume 42 of
Springer series in compu-tational mathematics . Springer, Heidelberg, 2012.[23] Vladimir Kazeev, Ivan Oseledets, Maxim Rakhuba, and Christoph Schwab. Qtt-finite-element ap-proximation for multiscale problems i: model problems in one dimension.
Advances in ComputationalMathematics , 43(2):411–442, 2017.[24] Vladimir Kazeev and Christoph Schwab. Approximation of singularities by quantized-tensor fem.
PAMM , 15(1):743–746, 2015.[25] B. Khoromskij. Tensors-structured numerical methods in scientific computing: Survey on recentadvances.
Chemometrics and Intelligent Laboratory Systems , 110(1):1–19, 2012.[26] Valentin Khrulkov, Alexander Novikov, and Ivan Oseledets. Expressive power of recurrent neuralnetworks. In
International Conference on Learning Representations , 2018.[27] Vladimir Koltchinskii.
Oracle Inequalities in Empirical Risk Minimization and Sparse Recovery Prob-lems: Ecole dEt´e de Probabilit´es de Saint-Flour XXXVIII-2008 , volume 2033. Springer Science &Business Media, 2011.[28] P. Massart.
Concentration Inequalities and Model Selection , volume Lecture Notes in Mathematics1896. Springer-Verlag, 2007.[29] Michael Neumann. Multivariate wavelet thresholding in anisotropic function spaces.
Statistica Sinica ,10:399–431, 2000.[30] A. Nouy. Low-rank methods for high-dimensional approximation and model order reduction. InP. Benner, A. Cohen, M. Ohlberger, and K. Willcox, editors,
Model Reduction and Approximation:Theory and Algorithms . SIAM, Philadelphia, PA, 2017.[31] Benjamin Recht, Maryam Fazel, and Pablo A Parrilo. Guaranteed minimum-rank solutions of linearmatrix equations via nuclear norm minimization.
SIAM review , 52(3):471–501, 2010.[32] R. Schneider and A. Uschmajew. Approximation rates for the hierarchical tensor format in periodicsobolev spaces.
Journal of Complexity , 30(2):56–71, 2014. Dagstuhl 2012.[33] Marco Signoretto, Lieven De Lathauwer, and Johan AK Suykens. Nuclear norms for tensors and theiruse for convex multilinear estimation.
Submitted to Linear Algebra and Its Applications , 43, 2010.[34] Vladimir Vapnik.
The nature of statistical learning theory . Springer science & business media, 2013.[35] Ming Yuan and Cun-Hui Zhang. On tensor completion via nuclear norm minimization.
Foundationsof Computational Mathematics , 16(4):1031–1068, 2016.32
Proofs
A.1 Proofs of Section 3
Proof of Proposition 3.6.
Let f = R H ,T,r (( f α ) α ∈ T ⋆ ) and let λ = ( p, µ ), 1 ≤ p ≤ ∞ , or λ = ∞ (with p = ∞ when λ = ∞ ). For x ∈ X , we first note that k f ( x ) k λ = k f ∅ ( g D ( x )) k λ ≤ k f ∅ k F ∅ k g D ( x ) k λ . Then for any interior node α ∈ I ( T ), we have k g α ( x α ) k λ = k f α (( g β ( x β )) β ∈ S ( α ) ) k λ ≤ k f α k F α Y β ∈ S ( α ) k g β ( x β ) k λ , and for any leaf node α ∈ L ( T ), k g α ( x α ) k λ = k f α ( φ α ( x α )) k λ ≤ k f α k F α k φ α ( x α ) k λ . We deduce that k f ( x ) k λ ≤ Y α ∈ T ⋆ k f α k F α Y ≤ ν ≤ d k φ ν ( x ν ) k λ , and therefore, since µ is a product measure and from the particular normalization of functions φ ν , weobtain k f k λ ≤ Y α ∈ T ⋆ k f α k F α Y ≤ ν ≤ d k φ ν k λ = Y α ∈ T ⋆ k f α k F α , which proves that L λ ≤
1. Finally for 1 ≤ q ≤ p , we note that µ ( X ) /p − /q k f k q,µ ≤ k f k p,µ ≤ k f k ∞ , which yields L q,µ ≤ µ ( X ) /q − /p L λ . A.2 Proofs of Section 4
Proof of Lemma 4.6.
Let γ = ǫB L and let N be a γ -net of M for the k · k ∞ ,µ -norm, with cardinal N ǫB L .Using Lemma 4.4 and a union bound argument, we obtain P (sup g ∈N ¯ R n ( g ) > ǫB ) ∨ P ( inf g ∈N ¯ R n ( g ) < − ǫB ) ≤ N ǫB L e − nǫ . For any f ∈ M , there exists a g ∈ N such that k f − g k ∞ ,µ ≤ γ . Noting that¯ R n ( f ) = ¯ R n ( g ) + b R n ( f ) − b R n ( g ) + R ( g ) − R ( f ) , we deduce from Assumption 4.5 that¯ R n ( f ) ≤ ¯ R n ( g ) + 2 Lk f − g k ∞ ,µ ≤ sup g ∈N ¯ R n ( g ) + ǫB, and ¯ R n ( f ) ≥ ¯ R n ( g ) − Lk f − g k ∞ ,µ ≥ inf g ∈N ¯ R n ( g ) − ǫB. This implies that P ( sup f ∈ M ¯ R n ( f ) > ǫB ) ≤ P (sup g ∈N ¯ R n ( f ) > ǫB ) , and P ( inf f ∈ M ¯ R n ( f ) < − ǫB ) ≤ P ( inf g ∈N ¯ R n ( f ) < − ǫB ) , which yields (21). The bound on N ǫB L directly follows from Proposition 3.3 and Proposition 3.6.33 roof of Lemma 4.7. We have E ( sup f ∈ M | ¯ R n ( f ) | ) = Z ∞ P ( sup f ∈ M | ¯ R n ( f ) | > t ) dt = 2 B Z ∞ P ( sup f ∈ M | ¯ R n ( f ) | > ǫB ) dǫ. Let β = 6 L B − R | T ⋆ | . Then, according to Lemma 4.6, for any δ > E ( sup f ∈ M | ¯ R n ( f ) | ) ≤ B (cid:20) δ + Z ∞ δ βǫ − ) C M e − n ǫ dǫ (cid:21) , = 2 B " δ + 2 β C M Z ∞ nδ / (cid:18) un (cid:19) − C M / e − u √ nu du ≤ B h δ + 2 n − β C M δ − C M − e − nδ / i , By taking δ = r C M n log(( β ∨ e ) √ n ) , we have n − β C M δ − C M − e − nδ / = n − β C M δ − C M − ( β ∨ e ) − C M n − CM ≤ δ − C M − n − CM − = δ ( δ n ) − CM − = δ (2 C M log(( β ∨ e ) √ n )) − CM − ≤ δ where we have used the fact that 2 C M log(( β ∨ e ) √ n ) ≥ . Then E ( sup f ∈ M | ¯ R n ( f ) | ) ≤ Bδ, which concludes the proof.
Proof of Proposition 4.8.
Starting from (16), we obtain R ( ˆ f Mn ) − R ( f M ) ≤ ¯ R n ( f M ) − ¯ R n ( ˆ f Mn ) ≤ sup f ∈ M ¯ R n ( f ) − inf f ∈ M ¯ R n ( f ) . Then using Lemma 4.6, we deduce P ( R ( ˆ f Mn ) − R ( f M ) > ǫB ) ≤ P ( sup f ∈ M ¯ R n ( f ) > ǫB ) + P ( inf f ∈ M ¯ R n ( f ) < − ǫB ) ≤ N ǫB L e − nǫ , with log N ǫB L ≤ C M log( βǫ − ). In the same way for the expectation bound, we have E ( R ( ˆ f Mn ) − R ( f M )) ≤ E ( ¯ R n ( f M ) − ¯ R n ( ˆ f Mn )) ≤ E ( sup f ∈ M | ¯ R n ( f ) | )and the result directly follows from Lemma 4.7. 34 roof of Proposition 4.9. The two inequalities come from standard application of the bounded differ-ence Inequality, see for instance Theorem 5.1 in [28]. The bounded difference inequality applied tosup f ∈ M − ¯ R n ( f ) = sup f ∈ M R ( f ) − b R n ( f ) gives that with probability larger than 1 − exp( − t ),sup f ∈ M − ¯ R n ( f ) ≤ E ( sup f ∈ M − ¯ R n ( f )) + 2 B r t n . Inequality 22 directly derives from this inequality and Lemma 4.7. Next, Inequality (16) gives that E ( ˆ f Mn ) ≤ E ( f M ) + 2 sup f ∈ M | ¯ R n ( f ) | . We finally prove the risk bound (23) by applying the bounded difference Inequality to sup f ∈ M | ¯ R n ( f ) | andLemma 4.7 again. A.2.1 Proof of Proposition 4.12
The proof follows the presentation of [27]. The least-squares contrast γ corresponds either to the regressioncontrast or the density estimation contrast. Under the assumptions of the proposition, in both frameworksthe oracle function satisfies k f ⋆ k ∞ ≤ R . • We first prove the proposition in the case where R = 1, by assuming for the moment that M = M = M Tr ( H s ) . For the regression framework, it is also assumed for the moment that k Y k ℓ ∞ ≤ k f ⋆ k ∞ ≤ • For the least-squares regression contrast (see Example 4.10), we have γ ( f, Z ) = k Y − f ( X ) k ℓ . Forall f ∈ M , it gives γ ( f, Z ) ≤ k Y − f ( X ) k ℓ ∞ ≤ k Y k ℓ ∞ + k f k ∞ ), so that 0 ≤ γ ( f, Z ) ≤ B almost surely,with B = 4. The distribution of the random variable X is denoted µ . Then E (( γ ( f, Z ) − γ ( f ⋆ , Z )) ) = E [( f ⋆ ( X ) − f ( X )) (2 Y − f ( X ) − f ⋆ ( X ))] = E [( f ⋆ ( X ) − f ( X )) (2( Y − f ⋆ ( X )) + f ⋆ ( X ) − f ( X ))] = E [( f ⋆ ( X ) − f ( X )) (2( Y − f ⋆ ( X ))] + E [ f ⋆ ( X ) − f ( X )] ≤ (4 k Y − f ⋆ ( X ) k ℓ ∞ + k f ⋆ − f k ∞ ) k f − f ⋆ k ,µ ≤ k f − f ⋆ k ,µ = 2 B k f − f ⋆ k ,µ , where the last inequality has been obtained using k Y − f ⋆ ( X ) k ℓ ∞ = k Y − E ( Y | X ) k ℓ ∞ ≤ . Let γ = γ/B .We have 0 ≤ γ ≤ E ( f ) := E [ γ ( f, Z ) − γ ( f ⋆ , Z )] = 1 B k f − f ⋆ k ,µ = 1 B E ( f )and E ([ γ ( f, Z ) − γ ( f ⋆ , Z )] ) ≤ D k f − f ⋆ k ,µ with D = B = . • We now consider the density estimation framework with γ ( f, x ) = k f k ,µ − f ( x ). According to Example4.11, | γ ( f, X ) | ≤ B = µ ( X ) + 2. The excess risk satisfies E ( f ) = k f ⋆ − f k ,µ and E ([ γ ( f, Z ) − γ ( f ⋆ , Z )]) = E ( (cid:2) k f k ,µ − k f ⋆ k ,µ + 2( f ⋆ ( X ) − f ( X )) (cid:3) ) ≤ ( k f k ,µ − k f ⋆ k ,µ ) + 4( k f k ,µ − k f ⋆ k ,µ ) h f ⋆ − f, f ⋆ i ,µ + 4 k f − f ⋆ k ,µ = ( k f k ,µ − k f ⋆ k ,µ )( k f k ,µ − k f ⋆ k ,µ + 4 h f ⋆ − f, f ⋆ i ,µ ) + 4 k f − f ⋆ k ,µ = h f − f ⋆ , f + f ⋆ i ,µ h f − f ⋆ , f − f ⋆ i ,µ + 4 k f − f ⋆ k = h f − f ⋆ , f + f ⋆ i ,µ k f − f ⋆ k − h f − f ⋆ , f + f ⋆ i ,µ h f − f ⋆ , f ⋆ i ,µ + 4 k f − f ⋆ k ,µ .
35e have h f − f ⋆ , f + f ⋆ i ,µ ≤ k f k ,µ ≤ µ ( X ), k f ⋆ k ,µ = 1 ≤ µ ( X ) / and k f ⋆ k ,µ ≤ k f ⋆ k ∞ ,µ k f ⋆ k ,µ ≤ E (( γ ( f, Z ) − γ ( f ⋆ , Z )) ) ≤ ( µ ( X ) + 2 k f + f ⋆ k ,µ k f ⋆ k ,µ + 4) k f − f ⋆ k ,µ ≤ ( µ ( X ) + 2 µ ( X ) + 2 + 4) k f − f ⋆ k ,µ = 3( µ ( X ) + 2) k f − f ⋆ k ,µ = 3 B k f − f ⋆ k ,µ . Let γ = B ( γ + B ). Then 0 , ≤ γ ( f, X ) ≤ f ∈ M . Moreover, E ( f ) := E [ γ ( f, Z ) − γ ( f ⋆ , Z )] = 12 B E ( f )and E (( γ ( f, Z ) − γ ( f ⋆ , Z )) ) ≤ D k f − f ⋆ k ,µ with D = B ≤ , where we have used µ ( X ) ≥ . • For δ >
0, we introduce ω n ( δ ) = ω n ( M , f ⋆ , δ ) = E sup f ∈ M | k f − f M k ,µ ≤ δ/D (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n X i =1 γ ( f, Z i ) − E ( γ ( f, Z )) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) Following [27] (Section 4.1 p.57), we introduce the sharp transformation ♯ of the function ω : ω ♯n ( ε ) = inf ( δ > , sup σ ≥ δ ω n ( σ ) σ ≤ ε ) . According to Proposition 4.1 in [27], there exists absolute constants κ and A such that for any ε ∈ (0 , t >
0, with probability at least 1 − A exp( − t ), E ( ˆ f M n ) ≤ (1 + ε ) E ( f M ) + 1 D ω ♯n (cid:18) εκ D (cid:19) + κ Dε tn . (33)The sharp transformation is monotonic: if Ψ ≤ Ψ then Ψ ♯ ≤ Ψ ♯ (see Appendix A.3 in [27]). Thus itremains to find an upper bound on the sharp transformation of an upper bound on ω n . • We use standard symmetrization and contraction arguments for Rademacher variables. The Rademacherprocess indexed by the class M is defined byRad n ( f ) = 1 n X i =1 nε i f ( X i )where the ε i ’s are i.i.d. Rademacher random variables (that is, ε i takes the values +1 and − X i ’s. By the symmetrization Inequality (see for instance Theorem2.1 in [27]), ω n ( δ ) ≤ E sup f ∈ M | k f − f M k ,µ ≤ δ/D (cid:12)(cid:12) Rad n (cid:0) γ ( f, · ) − γ ( f M , · ) (cid:1)(cid:12)(cid:12) . We introduce the function Ψ n ( δ ) = E sup f ∈ M | k f − f M k ,µ ≤ δ (cid:12)(cid:12) Rad n ( f − f M ) (cid:12)(cid:12) . ω n ( δ ) ≤ n ( δ/D ) . In the density estimation setting, we have γ ( f, Z ) = k f k ,µ − f ( Z ) and since the fluctuations of a constantfunction are obviously zero, we obtain ω n ( δ ) ≤ E sup f ∈ M | k f − f M k ,µ ≤ δ/D (cid:12)(cid:12) Rad n ( f − f M ) (cid:12)(cid:12) ≤ n ( δ/D ) . (34) • We now introduce the subset of the L ball centered at f M M ( δ, f ⋆ ) = { f − f M | f ∈ M , k f − f M k ,µ ≤ δ } . In the density estimation setting, the empirical measure is ν n . We also denote by ν n the empirical measurein the regression setting (take ν = µ ). The constant function F = 2 is an envelop for M ( δ, f ⋆ ) and k F k ,ν n = 2. According to Proposition 3.3, H ( ε, M ( δ, f ⋆ ) , k · k ,ν n ) ≤ C M log (cid:18) | T ⋆ | L ,ν n ε (cid:19) ε ≤ ν ⊗ n -almost surely,where L ,ν n is defined by (10) for the measure ν n and for p = 2. According to Proposition 3.6, L ,ν n satisfies L ,ν n ≤ p ν n ( X ) L ∞ ,ν n = L ∞ ,ν n . Here it is assumed that
H ⊂ L ∞ ( X ) equipped with the norm k · k ∞ . According to Proposition 3.6 we have L ∞ ,ν n ≤ L ∞ ≤
1, thus L ,ν n ≤ H ( ε, M ( δ, f ⋆ ) , k · k ,ν n ) ≤ C M (cid:20) log (cid:18) eε (cid:19) + log + (cid:18) | T ⋆ | e (cid:19)(cid:21) ε ≤ ≤ C M (cid:20) + (cid:18) | T ⋆ | e (cid:19)(cid:21) log (cid:18) eε (cid:19) ε ≤ ≤ C M a T h (cid:18) ε (cid:19) with a T = 1 + log + (cid:16) | T ⋆ | e (cid:17) and h ( u ) := log (2 eu ) u ≥ . We are now in position to apply Theorem A.1,which is given at the end of this section. We can take σ = δ in Theorem A.1 because E ν ( g ( X )) ≤ δ for g ∈ M ( δ, f ⋆ ). Thus, there exists an absolute constant κ > n ( δ ) ≤ κ "s δn C M a T h (cid:18) √ δ (cid:19) ∨ (cid:18) n C M a T h (cid:18) √ δ (cid:19)(cid:19) . For regression, it can be easily checked that (see also Example 3 p.80 in [27])Ψ ♯n ( ε ) ≤ κ C M a T ε n log (cid:18) e ε κ C M a T (cid:19) . Similar calculations hold for density estimation. Together with Inequalities (33) and (34), and accordingto the properties of the sharp transformation (see Appendix A.3 in [27]), it gives that with probability at37east 1 − A exp( − t ), E ( ˆ f M n ) ≤ (1 + ε ) E ( f M ) + 1 D (cid:16) n (cid:16) · D (cid:17)(cid:17) ♯ (cid:18) εκ D (cid:19) + κ Dε tn ≤ (1 + ε ) E ( f M ) + (8Ψ n ) ♯ (cid:18) εκ (cid:19) + κ Dε tn ≤ (1 + ε ) E ( f M ) + κ a T C M ε n log (cid:18) κ ε a T C M (cid:19) + κ Dε tn , where κ and κ are absolute constants. This completes the proof for R = 1, by rewriting the risk boundfor the excess risk E = B E . • We now consider the more general situation where M = M Tr ( H s ) R with R ≥
1. We first considerregression. We now assume that k Y k ℓ ∞ ≤ R almost surely. Let f ⋆ , f M and ˆ f M defined as in Section 4for the observations Z , . . . , Z n . We consider the least squares regression problem for the normalized data( X , Y /R ) , . . . , ( X n , Y n /R ) with the functional set M . For this problem the oracle f ⋆ satisfies f ⋆ = f ⋆ /R ,the best approximation f M on M satisfies f M = f M /R and the least squares estimator ˆ f M also satisfiesˆ f M = ˆ f M /R . The risk bound (24) is valid for the normalized data (with R = 1) and it directly gives (24)for R ≥
1. The same method applies for proving the risk bound in the density estimation case.
A.2.2 An adaptation of Theorem 3.12 in [27]
We consider the same framework as in [27]. We observe X , . . . , X n according to the distribution ν and let ν n be the empirical measure. Let F be a function space. Assume that the functions in F are uniformlybounded by a constant U and let F ≤ U denote a measurable envelop of F . We assume that σ is anumber such that sup f ∈F E ν f ≤ σ ≤ k F k ,ν . Let h : [0 , ∞ ) [0 , ∞ ) be a regularly varying function of exponent 0 ≤ α <
2, strictly increasing for u ≥ / h ( u ) = 0 for 0 ≤ u < / κ h > κ h > h and not on c . Theorem A.1 (Theorem 3.12 in [27]) . Let c > . If, for all ε > and n ≥ , log N ( ε, F , k · k ,ν n ) ≤ ch (cid:18) k F k ,ν n ε (cid:19) ν ⊗ n -almost surely,then there exists a constant κ h > that depends only on h such that E sup f ∈F | R n ( f ) | ≤ κ h " σ √ n s ch (cid:18) k F k ,ν σ (cid:19) ∨ Un ch (cid:18) k F k ,ν ε (cid:19) . Proof.
The proof of Theorem 3.12 of [27] starts by applying Theorem 3.11 of [27]. As in [27] we assumewithout loss of generality that U = 1. In our context it gives E := E sup f ∈F | R n ( f ) | ≤ C √ cn − / E Z σ n s h (cid:18) k F k ,ν n ε (cid:19) dε where σ n = sup f ∈F P ni =1 f ( X i ) and where C is an universal numerical constant. By following the linesof the proof of [27], we find that E satisfies the following inequation E ≤ √ cκ h, n − + √ cκ h, n − / σ s h (cid:18) k F k ,ν σ (cid:19) + √ cκ h, n − / √ E s h (cid:18) k F k ,ν σ (cid:19) κ h, , κ h, and κ h, are positive numerical constants which only depends on the function h (see theproof of Koltchinskii for the expression of these three constants). Solving this inequation completes theproof. A.3 Proofs of Section 5
Proof of Theorem 5.2.
According to Inequality (22) of Proposition 4.9, for any t > m ∈ M , onehas with probability larger than 1 − exp( − t ),sup f ∈ M m − ¯ R n ( f ) ≤ λ m r C m n + 2 B r t n . Then with probability larger than 1 − P m ∈M exp( − x m − t ) = 1 − Σ exp( − t ), it holds − ¯ R n ( ˆ f ˆ m ) ≤ sup f ∈ M ˆ m − ¯ R n ( f ) ≤ λ ˆ m r C ˆ m n + 2 B r t + x ˆ m n , which together with (26) implies that E ( ˆ f ˆ m ) ≤ E ( f m ) + ¯ R n ( f m ) + λ ˆ m r C ˆ m n + 2 B r x ˆ m n − pen( ˆ m ) + pen( m ) + 2 B r t n holds for all m ∈ M . Then, with the condition (27) on the penalty function, the upper bound E ( ˆ f ˆ m ) ≤ E ( f m ) + ¯ R n ( f m ) + pen( m ) + 2 B r t n holds for all m ∈ M simultaneously, with probability larger than 1 − Σ exp( − t ). Next, integrating withrespect to t gives E h ∨ (cid:16) E ( ˆ f ˆ m ) − E ( f m ) − ¯ R n ( f m ) − pen( m ) (cid:17)i ≤ B Σ r πn . Finally, since ¯ R n ( f m ) has zero mean, for any m ∈ M , E ( E ( ˆ f ˆ m )) ≤ E ( f m ) + pen( m ) + B Σ r π n , and we conclude by taking the infimum over m ∈ M . Proof of Theorem 5.3.
The proof is adapted from Theorem 6.5 in [27], which corresponds to a alternativestatement of Theorem 8.5 in [28]. We follow the lines of Section 6.3 in [27] (p.107-108).We first consider the case R = 1 and we consider the normalized contrast γ and the normalized risk E as for the proof of Proposition 4.12. We have shown that E [ γ ( f, Z ) − γ ( f ⋆ , Z )] ≤ D k f − f ⋆ k where D does not depend on the model M m . Next, it has also been shown in the proof of Proposition 4.12,that for ε ∈ (0 , ω ♯n ( ε ) ≤ κ a m C M nε log + (cid:18) nε a m C M (cid:19) with a m = 1 + log + (cid:16) | T ⋆m | e (cid:17) and where κ is an absolute constant. We consider the penalized criterion (25)with a penalty of the form pen( m ) = κ a m C m nε log + nε a m C m + κ x m nε . δ εn ( m ) = ˜ δ εn ( m ) = ˆ δ εn ( m ) = κ a m C m nε log nε a m C m + K x m + tnε (andthus p m = 0 in the theorem) and we also note that for any t > m ) = K (cid:20) a m C m nε log nε a m C m + x m + tnε (cid:21) . Finally, according to Theorem 6.5 in [27], there exist numerical constants K , K and K such that forany t > P (cid:18) E ( ˆ f ˆ m ) ≤ ε − ε inf m ∈M (cid:26) E ( ˆ f ˆ m ) + K (cid:20) a m C m nε log nε a m C m + x m + tnε (cid:21)(cid:27)(cid:19) ≤ K X m ∈M exp( − t − x m )Under Assumption 5.1, we easily derive the oracle bound (28) by rewriting it for the constrast γ and thenby integrating this probability bound with respect to t . This bound generalizes to the case R ≥≥