[PDF] Learning with tree tensor networks: complexity estimates and model selection

Abstract

Tree tensor networks, or tree-based tensor formats, are prominent model classes for the approximation of high-dimensional functions in computational and data science. They correspond to sum-product neural networks with a sparse connectivity associated with a dimension tree and widths given by a tuple of tensor ranks. The approximation power of these models has been proved to be (near to) optimal for classical smoothness classes. However, in an empirical risk minimization framework with a limited number of observations, the dimension tree and ranks should be selected carefully to balance estimation and approximation errors. We propose and analyze a complexity-based model selection method for tree tensor networks in an empirical risk minimization framework and we analyze its performance over a wide range of smoothness classes. Given a family of model classes associated with different trees, ranks, tensor product feature spaces and sparsity patterns for sparse tensor networks, a model is selected (à la Barron, Birgé, Massart) by minimizing a penalized empirical risk, with a penalty depending on the complexity of the model class and derived from estimates of the metric entropy of tree tensor networks. This choice of penalty yields a risk bound for the selected predictor. In a least-squares setting, after deriving fast rates of convergence of the risk, we show that our strategy is (near to) minimax adaptive to a wide range of smoothness classes including Sobolev or Besov spaces (with isotropic, anisotropic or mixed dominating smoothness) and analytic functions. We discuss the role of sparsity of the tensor network for obtaining optimal performance in several regimes. In practice, the amplitude of the penalty is calibrated with a slope heuristics method. Numerical experiments in a least-squares regression setting illustrate the performance of the strategy.

Full PDF

aa r X i v : . [ m a t h . S T ] J u l Learning with tree tensor networks: complexity estimates and modelselection

Bertrand Michel and Anthony Nouy ∗ Abstract

In this paper, we propose and analyze a model selection method for tree tensor networks in an em-pirical risk minimization framework. Tree tensor networks, or tree-based tensor formats, are prominentmodel classes for the approximation of high-dimensional functions in numerical analysis and data sci-ence. They correspond to sum-product neural networks with a sparse connectivity associated with adimension partition tree T , widths given by a tuple r of tensor ranks, and multilinear activation func-tions (or units). The approximation power of these model classes has been proved to be near-optimalfor classical smoothness classes. However, in an empirical risk minimization framework with a limitednumber of observations, the dimension tree T and ranks r should be selected carefully to balance estima-tion and approximation errors. In this paper, we propose a complexity-based model selection strategy`a la Barron, Birg´e, Massart. Given a family of model classes, with diﬀerent trees, ranks and tensorproduct feature spaces, a model is selected by minimizing a penalized empirical risk, with a penaltydepending on the complexity of the model class. After deriving bounds of the metric entropy of treetensor networks with bounded parameters, we deduce a form of the penalty from bounds on supremaof empirical processes. This choice of penalty yields a risk bound for the predictor associated with theselected model. For classical smoothness spaces, we show that the proposed strategy is minimax optimalin a least-squares setting. In practice, the amplitude of the penalty is calibrated with a slope heuristicsmethod. Numerical experiments in a least-squares regression setting illustrate the performance of thestrategy for the approximation of multivariate functions and univariate functions identiﬁed with tensorsby tensorization (quantization). Typical tasks in statistical learning include the estimation of a regression function or of posterior probabil-ities for classiﬁcation (supervised learning), or the estimation of the probability distribution of a randomvariable from samples of the distribution (unsupervised learning). These approximation tasks can be for-mulated as a minimization problem of a risk functional R ( f ) whose minimizer f ⋆ is the target (or oracle)function, and such that R ( f ) − R ( f ⋆ ) measures some discrepancy between the function f and f ⋆ . The riskis usually deﬁned as R ( f ) = E ( γ ( f, Z )) , with Z = ( X, Y ) for supervised learning or Z = X for unsupervised learning, and where γ is a contrastfunction. For supervised learning, the contrast γ is usually chosen as γ ( f, ( x, y )) = ℓ ( y, f ( x )) where ℓ ( y, f ( x )) measures some discrepancy between y and the prediction f ( x ) for a given realization ( x, y ) of( X, Y ). In practice, given i.i.d. realizations ( Z , . . . , Z n ) of Z , an approximation ˆ f Mn is obtained by theminimization of an empirical risk b R n ( f ) = 1 n n X i =1 γ ( f, Z i ) ∗ Centrale Nantes, Laboratoire de Math´ematiques Jean Leray, CNRS UMR 6629, France M , also called a model class or hypothesis set. Assuming that the risk admitsa minimizer f M over M , the error R ( b f Mn ) − R ( f ⋆ ) can be decomposed into two contributions: an ap-proximation error R ( f M ) − R ( f ⋆ ) which quantiﬁes the best we can expect from the model class M , andan estimation error R ( b f Mn ) − R ( f M ) which is due to the use of a limited number of observations. For agiven model class, a ﬁrst problem is to understand how these errors behave under some assumptions onthe target function. When considering an increasing sequence of model classes, the approximation errordecreases but the estimation error usually increases. Then strategies are required for the selection of aparticular model class.In many applications, the target function f ⋆ ( x ) is a function of many variables x = ( x , . . . , x d ). Forapplications in image or signal classiﬁcation, x may be an image (with d the number of pixels or patches) ora discrete time signal (with d the number of time instants) and f ⋆ ( x ) provides a label to a particular input x . For applications in computational science, the target function may be the solution of a high-dimensionalpartial diﬀerential equation, a parameter-dependent equation or a stochastic equation. In all these appli-cations, when d is large and when the number of observations is limited, one has to rely on suitable modelclasses M of moderate complexity that exploit speciﬁc structures of the target function f ⋆ and yield an ap-proximation b f Mn with low approximation and estimation errors. Typical examples of model classes includeadditive functions f ( x ) + · · · + f d ( x d ), sums of multiplicative functions P mk =1 f k ( x ) · · · f kd ( x d ), projectionpursuit f ( w T x ) + · · · + f m ( w Tm x ), or feed-forward neural networks σ L ◦ f L ◦ . . . ◦ σ ◦ f ( x ) where the f k are aﬃne maps and the σ k are given nonlinear functions.In this paper, we consider the class of functions in tree-based tensor format, or tree tensor networks.These model classes are well-known approximation tools in numerical analysis and computational physicsand have also been more recently considered in statistical learning. They are particular cases of feed-forward neural networks with an architecture given by a dimension partition tree and multilinear activationfunctions (see [26, 14]). For an overview of these tools, the reader is referred to the monograph [22] andthe surveys [30, 6, 25, 12, 13]. Some results on the approximation power of tree tensor networks can befound in [32, 20, 5] for multivariate functions, or in [24, 23, 2, 3] for tensorized (or quantized) functions.A tree-based tensor format is a set of functions M Tr ( H ) = { f ∈ H : rank α ( f ) ≤ r α , α ∈ T } , where T is a dimension partition tree over { , . . . , d } , r = ( r α ) ∈ N | T | is a tuple of integers and H is a ﬁnitedimensional tensor space of multivariate functions (e.g., polynomials, splines), which is a tensor productfeature space. A function f in M Tr ( H ) have a α -rank rank α ( f ) bounded by r α , that means it admits arepresentation f ( x ) = r α X k =1 g αk ( x α ) h α c k ( x α c )for some functions g αk and h α c k of complementary groups of variables. The function f admits a representationas a composition of multilinear functions. For instance, for the dimension tree of Figure 1a, f ( x ) = f ,..., (cid:16) f , , , (cid:0) f , , ( f ( φ ( x )) , f , ( f ( φ ( x )) , f ( φ ( x )))) , f ( φ ( x )) (cid:1) ,f , , , (cid:0) f , , ( f , ( f ( φ ( x )) , f ( φ ( x ))) , f ( φ ( x ))) , f ( φ ( x )) (cid:1)(cid:17) where φ ν ( x ν ) ∈ R n ν is a vector of n ν features in the variable x ν , and f α is a multilinear map with valuesin R r α . It corresponds to the neural network illustrated on Figure 2.The main contribution of the paper is a complexity-based strategy for the selection of a model classin an empirical risk minimization framework. Given a family of model classes M m = M T m r m ( H m ), m ∈ M ,associated with diﬀerent trees T m , ranks r m and background approximation spaces H m , and given the2 , , , , , , , }{ , , , } { , , , }{ , , } { }{ , , } { }{ } { , } { , } { }{ } { } { } { } (a) Dimension tree T .

13 34 5 5 65 3 8 34 3 4 8 (b) r α , α ∈ T . Figure 1: Dimension tree T over { , . . . , } (a) and ranks r = ( r α ) α ∈ T (b).Figure 2: Neural network corresponding to the format M Tr ( H ) with the tree T and ranks r of Figure 1,and n ν = 10 features per variable.corresponding predictors ˆ f m that minimize the empirical risk, we propose a strategy to select a particularmodel ˆ m with a guaranteed performance. For that purpose, we make use of the model selection approachof Barron, Birg´e and Massart (see [28] for a general introduction to the topic) where ˆ m is obtained byminimizing a penalized empirical risk b R n ( ˆ f m ) + pen( m )with a penalty function pen( m ) derived from complexity estimates of the model classes M m , of the formpen( m ) ∼ O ( p C m /n ) (up to logarithmic terms) in a general setting, or of the form pen( m ) ∼ O ( C m /n )(again up to logarithmic terms) in a bounded least-squares setting where faster convergence rates can beobtained. In particular, we ﬁnd that our strategy is minimax adaptive over Sobolev spaces. In practice, thepenalty is taken of the form pen( m ) = λ p C m /n (or pen( m ) = λC m /n in a bounded regression setting),where λ is calibrated with the slope heuristics method proposed in [10]. The family of models can begenerated by adaptive learning algorithms such as the ones proposed in [18, 17].Note that our method is a ℓ type approach. Convex regularization methods would be an interestingalternative route to follow. A straightforward convexiﬁcation of tensor formats consists in using the sumof nuclear norms of unfoldings (see, e.g., [33] for Tucker format) but this is known to be far from optimalfrom a statistical point of view [31]. A convex regularization method based on the tensor nuclear norm has3een proposed for the Tucker format, or shallow tensor network, which comes with theoretical guarantees(see [35]). However, there is no straightforward extension of this approach to general tree tensor networks.The outline of the paper is as follows. In Section 2, we describe the model class of tree tensor networks(or tree-based tensor formats) in the case of vector-valued functions, which generalizes the classical deﬁni-tion for real-valued functions [22, 16] and allows considering applications such as multiclass classiﬁcation.In Section 3, we provide estimates of the metric and bracketing entropies in L p spaces for tree tensornetworks M m with bounded parameters. In Section 4, we derive bounds for the estimation error in aclassical empirical risk minimization framework. These bounds are derived from concentration inequalitiesfor empirical processes. In Section 5, we present the complexity-based model selection approach and wederive risk bounds for particular choices of penalty, ﬁrst in a general setting and then in the boundedleast-squares setting. Then we present the practical aspects of the approach, which includes the slopeheuristics method for penalty calibration and the exploration strategies for the generation of a sequenceof model classes and associated predictors. Finally in Section 6, we present some numerical experimentsthat validate the proposed model selection strategy. We consider functions f ( x ) = f ( x , . . . , x d ) deﬁned on a product set X = X × . . . × X d and with values in R s . Typically, X ν is a subset of R or R d ν but it could be a set of more general objects (sequences, functions,graphs...). For each ν ∈ { , . . . , d } , we introduce a ﬁnite-dimensional space H ν of functions deﬁned on X ν . We let { φ νi ν : i ν ∈ I ν } be a basis of H ν , with I ν = { , . . . , n ν } . The functions φ νi ν ( x ν ) may be polynomials, splines,wavelets, kernel functions, or more general functions that extract n ν features from a given input x ν ∈ X ν .We let φ ν : X ν → R n ν be the associated feature map deﬁned by φ ν ( x ν ) = ( φ ν ( x ν ) , . . . , φ νn ν ( x ν )) T ∈ R n ν .The functions φ i ( x ) = φ i ( x ) . . . φ di d ( x d ), i ∈ I = I × . . . × I d , form a basis of the tensor product space H = H ⊗ . . . ⊗ H d . A function f ∈ H admits a representation f ( x ) = X i ∈ I a i φ i ( x ) = n X i =1 . . . n d X i d =1 a i ,...,i d φ i ( x ) . . . φ di d ( x d ) , (1)where a ∈ R I = R n × ... × n d is an algebraic tensor (or multi-dimensional array) of size n × . . . × n d . Themap φ from X to R I which associates to x the elementary tensor φ ( x ) = φ ( x ) ⊗ . . . ⊗ φ d ( x d ) ∈ R I deﬁnesa tensor product feature map .A function f deﬁned on X with values in R s whose components f k (1 ≤ k ≤ s ) are in H is identiﬁedwith an element of the product space H s , which is itself identiﬁed with the space H ⊗ . . . ⊗ H d ⊗ R s oftensors of order d + 1. For any α ⊂ { , . . . , d } := D , we introduce the tensor space H α = N ν ∈ α H ν of functions deﬁned on X α = × ν ∈ α X ν , and for x ∈ X , we let x α = ( x ν ) ν ∈ α ∈ X α denote the group of variables α . We denote by α c = D \ α . We use the conventions H ∅ = R and H D = H . Deﬁnition 2.1.

The α -rank of a function f : X → R s , denoted rank α ( f ) , is the minimal integer r α suchthat f ( x ) = r α X k =1 g αk ( x α ) h α c k ( x α c ) (2)4 or some functions g αk : X α → R and h αk : X α c → R s . The above deﬁnition generalizes the classical notion of α -rank for vector-valued functions. It coincideswith the classical notion of α -rank when f : X → R s is seen as a real-valued function of s + 1 variablesdeﬁned on X × . . . × X d × { , . . . , s } . A function f ∈ H s admits a representation (2) with functions g αk ∈ H α and h α c k ∈ H sα c . For f = 0, we have rank ∅ ( f ) = 1 and 1 ≤ rank D ( f ) ≤ s .We let T be a dimension partition tree over D , with root D and leaves { ν } , 1 ≤ ν ≤ d . For a node α ∈ T , we denote by S ( α ) the set of children of α . For any node α , we have either S ( α ) = ∅ (for leafnodes) or S ( α ) ≥ L ( T ) the set of leaves of T , and by I ( T ) = T \ L ( T )its interior nodes. For an interior node α ∈ I ( T ), S ( α ) forms a partition of α . The T -rank of a function f is the tuple rank T ( f ) = (rank α ( f )) α ∈ T . The number of nodes of a dimension partition tree over D isbounded as | T | ≤ d − . Given a tuple r ∈ N | T | we introduce the model class M Tr ( H s ) of functions in H s with T -rank boundedby r , M Tr ( H s ) = { f ∈ H s : rank T ( f ) ≤ r } . A function f ∈ M Tr ( H s ) admits the representation f ( x ) = r D X k D =1 c k D g Dk D ( x )where the c k D are vectors in R s and where the functions g Dk D ∈ H are deﬁned recursively. For any interiornode α ∈ I ( T ), the functions g αk α admit the representation g αk α ( x α ) = X ≤ k β ≤ r β β ∈ S ( α ) C αk α , ( k β ) β ∈ S ( α ) Y β ∈ S ( α ) g βk β ( x β ) , where C α ∈ R r α × ( × β ∈ S ( α ) r β ) . For a leaf node α ∈ L ( T ), the functions g αk α ∈ H α admit the representation g αk α ( x α ) = X i α ∈ I α C αk α ,i α φ αi α ( x α ) . We let C ∅ denote the matrix whose columns are the vectors c k D . We introduce the tree T ⋆ = T ∪ ∅ and we use the conventions r ∅ = s and S ( ∅ ) = D . A function f in M Tr ( H s ) therefore admits an explicitrepresentation f k ( x ) = X i α ∈ I α α ∈L ( T ) X ≤ k β ≤ r β β ∈ T C ∅ k,k D Y α ∈ T \L ( T ) C αk α , ( k β ) β ∈ S ( α ) Y α ∈L ( T ) C αk α ,i α Y α ∈L ( T ) φ αi α ( x α ) (3)where the set of parameters ( C α ) α ∈ T ⋆ form a tree network of tensors, and C α ∈ R { ,...,r α }× I α := R K α , where I α = { , . . . , r D } for α = ∅ , I α = × β ∈ S ( α ) { , . . . , r β } for α ∈ I ( T ) or I α = { , . . . , n α } for α ∈ L ( T ).We let R H ,T,r be the map which associates to a set of tensors ( C α ) α ∈ T ⋆ the function f = R H ,T,r (( C α ) α ∈ T ⋆ )deﬁned by (3), so that M Tr ( H s ) = { f = R H ,T,r (( C α ) α ∈ T ⋆ ) : C α ∈ R K α , α ∈ T ⋆ } . From the representation (3), we obtain the following

Lemma 2.2.

The map R r,T, H is a multilinear map from the product space × α ∈ T ⋆ R K α to H s . Remark 2.3. If r D = s , the parameter C ∅ ∈ R s × s can be chosen as the identity matrix, so that theparameters of a function in M Tr ( H s ) are reduced to the set of tensors ( C α ) α ∈ T . This includes the classicalcase of tree-based tensor formats for real-valued functions ( s = r D = 1 ). In this situation, we let T ⋆ = T . .3 Tree tensor networks as compositions of multilinear functions A function f in M Tr ( H s ) admits a representation in terms of compositions of multilinear functions. For agiven α ∈ T , we let g α ( x α ) = ( g αk α ( x α )) ≤ k α ≤ r α ∈ R r α . The matrix C ∅ ∈ R s × r D is linearly identiﬁed with alinear map f ∅ from R r D to R s . Therefore, a function f in M Tr ( H s ) admits the representation f ( x ) = f ∅ ( g D ( x )) . For any α ∈ I ( T ), the tensor C α can be linearly identiﬁed with a multilinear map f α : × β ∈ S ( α ) R r β → R r α deﬁned by f αk α (( z β ) β ∈ S ( α ) ) = X ≤ k β ≤ r β β ∈ S ( α ) C αk α , ( k β ) β ∈ S ( α ) Y β ∈ S ( α ) z βk β for z β ∈ R r β . Therefore, g α admits the representation g α ( x α ) = f α (( g β ( x β ) β ∈ S ( α ) ) . (4)For a leaf node α ∈ L ( T ), the tensor C α can be linearly identiﬁed with a linear map f α : R n α → R r α , and g α ( x α ) = f α ( φ α ( x α )) . (5)Therefore, a function f in M Tr ( H s ) can be parametrized by a tree network of linear or multilinear maps f = ( f α ) α ∈ T ⋆ (identiﬁed with the tree tensor network ( C α ) α ∈ T ⋆ ).We denote by F α the space of linear maps from R r D to R s for α = ∅ , the space of multilinear maps from × β ∈ S ( α ) R r β to R r α for α ∈ I ( T ), or the space of linear maps from R n α to R r α for a leaf node α ∈ L ( T ).We denote by F T,r := × α ∈ T ⋆ F α the parameter space and by R H ,T,r the representation map which associates to a network f = ( f α ) α ∈ T ⋆ ∈ F T,r the function f . Then M Tr ( H s ) = {R H ,T,r ( f ) : f ∈ F T,r } . Since F α is linearly identiﬁed with R K α for all α ∈ T ⋆ , we deduce the following property from Lemma 2.2. Lemma 2.4.

The map R H ,T,r is a multilinear map from the product space F T,r = × α ∈ T ⋆ F α to the spaceof functions deﬁned on X . When interpreting a tensor (or function) network f ∈ F T,r as a neural network, a classical measure ofcomplexity if the number of neurons, which is the sum of ranks r α , α ∈ T ⋆ . This leads to a ﬁrst measureof complexity of a function f = R H ,T,r ( f ) deﬁned bycompl N ( f ) = X α ∈ T ⋆ r α . From an approximation or statistical perspective, a more natural measure of complexity for a function f ∈ M Tr ( H s ) is its representation complexity, that is the dimension of the corresponding parameter space F T,r , or the number of weights of the corresponding sum-product neural network. We let N α = dim( F α ),with N α = sr D for α = ∅ , N α = r α n α for α ∈ L ( T ) and N α = r α Q β ∈ S ( α ) r β for α ∈ T ⋆ \ L ( T ). Then therepresentation complexity of a function f = R H ,T,r ( f ) iscompl C ( f ) := C ( T, r, H s ) = X α ∈ T ⋆ N α = sr D + X α ∈I ( T ) r α Y β ∈ S ( α ) r β + X α ∈L ( T ) r α n α . (6)6 emark 2.5. If r D = s , the function f ∅ : R s → R s can be taken as the identity map, so that the parametersof M Tr ( H s ) are reduced to the set of functions f = ( f α ) α ∈ T . In this case, we let T ⋆ = T , and the complexityis compl C ( f ) := C ( T, r, H s ) = X α ∈ T N α = X α ∈I ( T ) r α Y β ∈ S ( α ) r β + X α ∈L ( T ) r α n α . (7)Another measure of complexity of f = R H ,T,r ( f ) can be deﬁned ascompl S ( f ) = X α ∈ T ⋆ k f α k ℓ , (8)where k f α k ℓ is the number of non-zero entries in the tensor C α associated with the multilinear map f α .This measure of complexity takes into account a possible sparsity in tensors or in the corresponding sum-product neural network. We note that compl S ( f ) ≤ compl C ( f ) . These diﬀerent measures of complexitylead to the deﬁnition of diﬀerent approximation tools and corresponding approximation classes, see [2, 3]for tensor networks, and [19] for similar results on ReLU or RePU neural networks.

A function f ∈ M Tr ( H s ) admits inﬁnitely many equivalent parametrizations. From the multilinearity ofthe representation map R H ,T,r (see Lemma 2.4), it is clear that the model class M Tr ( H s ) is a cone, i.e. cM Tr ( H s ) ⊂ M Tr ( H s ) for any c ∈ R , and that given some norms k · k F α on the spaces F α , α ∈ T ⋆ , we have M Tr ( H s ) = { cf : c ∈ R , f ∈ M Tr ( H s ) } , where M Tr ( H s ) are elements of M Tr ( H s ) with bounded parameters, deﬁned by M Tr ( H s ) = { f = R H ,T,r ( f ) : f = ( f α ) α ∈ T ⋆ ∈ F T,r : , k f α k F α ≤ , α ∈ T ⋆ } . (9) We assume that the sets X ν are equipped with ﬁnite measures µ ν , for all ν ∈ D = { , . . . , d } , and the set X is equipped with the product measure µ = µ ⊗ . . . ⊗ µ d . For 1 ≤ p ≤ ∞ , we consider the space L pµ ( X ; R s )of measurable functions deﬁned on X with values in R s , with bounded norm k · k p,µ deﬁned by k f k pp,µ = Z X k f ( x ) k pp dµ ( x ) for 1 ≤ p < ∞ , or k f k ∞ ,µ = µ - ess sup X | f | . We also consider the space L ∞ ( X ; R s ) of functions deﬁned on X with values in R s , with bounded norm k f k ∞ = sup x ∈X | f ( x ) | . In the following, we denote by L λ ( X ; R s ) the space L pµ ( X ; R s ) equipped with the norm k·k p,µ when λ = ( p, µ )or the space L ∞ ( X ; R s ) equipped with the norm k · k ∞ when λ = ∞ . If H ν ⊂ L λ ( X ν ) for all ν ∈ D , then H ⊂ L λ ( X ) and H s ⊂ L λ ( X ; R s ). We here study the continuity properties of the representation map R H ,T,r as a map from F T,r = × α ∈ T ⋆ F α to H s ⊂ L λ ( X ; R s ), with λ = ( p, µ ) or λ = ∞ . We consider norms k · k F α on space F α , α ∈ T ⋆ , and theproduct norm k · k F over F T,r deﬁned by k ( f α ) α ∈ T ⋆ k F T,r = max α ∈ T ⋆ k f α k F α . From the multilinearity of R H ,T,r (Lemma 2.4), we easily deduce the following property.7 emma 3.1. Assuming

H ⊂ L λ ( X ) , with either λ = ( p, µ ) or λ = ∞ , the multilinear map R H ,T,r from F T,r to H s ⊂ L λ ( X ; R s ) is continuous and such that for all f = R H ,T,r (( f α ) α ∈ T ⋆ ) in M Tr ( H s ) , k f k λ ≤ L λ Y α ∈ T ⋆ k f α k F α for some constant L λ < ∞ independent of f deﬁned by L λ = sup f = R H ,T,r (( f α ) α ∈ T⋆ ) k f k λ Q α ∈ T ⋆ k f α k F α . (10)We denote by B ( F α ) the unit ball of F α and by B ( F T,r ) the unit ball of F . The set M Tr ( H s ) deﬁnedby (9) is such that M Tr ( H s ) = R H ,T,r ( B ( F T,r )) . (11)We then deduce that the map R H ,T,r is Lipschitz continuous on the set M Tr ( H s ) . Lemma 3.2.

Assuming

H ⊂ L λ ( X ) , with either λ = ( p, µ ) or λ = ∞ , for all f = R H ,T,r ( f ) and ˜ f = R H ,T,r (˜ f ) in M Tr ( H s ) , k f − ˜ f k λ ≤ L λ X α ∈ T ⋆ k f α − ˜ f α k F α ≤ L λ | T ⋆ |k f − ˜ f k F T,r . Proof.

Denoting by { α , . . . , α K } the elements of T ⋆ , we have f − ˜ f = P Kk =1 R H ,T,r ( ˜ f α , · · · , f α k − ˜ f α k , · · · , f α K ). Them from Lemma 3.1, we obtain k f − ˜ f k λ ≤ L λ K X k =1 k f α k − ˜ f α k k F αk Y ik k f α i k F αi , (12)and we conclude by noting that k f α k F α ≤ k ˜ f α k F α ≤ α ∈ T ⋆ . The metric entropy H ( ǫ, K, k · k X ) of a compact subset K of a normed vector space ( X, k · k X ) is deﬁnedas H ( ǫ, K, k · k X ) = log N ( ǫ, K, k · k X ) , with N ( ǫ, K, k · k X ) the covering number of K , which is the minimal number of balls of radius ǫ (for k · k X )necessary to cover K . We have the following result on the metric entropy of tensor networks with boundedparameters. Proposition 3.3.

Assuming that

H ⊂ L λ ( X ) , with either λ = ∞ or λ = ( p, µ ) , ≤ p ≤ ∞ , the metricentropy of the model class M Tr ( H s ) R = { cf : c ∈ R , | c | ≤ R, f ∈ M Tr ( H s ) } (13) in L λ ( X ; R s ) is such that H ( ǫ, M Tr ( H s ) R , k · k λ ) ≤ C ( T, r, H s ) log(3 ǫ − RL λ | T ⋆ | ) . Proof.

The covering number of the unit ball B ( F α ) of the N α -dimensional space F α is such that N ( ǫ, B ( F α ) , k·k F α ) ≤ (3 ǫ − ) N α . Then the unit ball B ( F T,r ) of the product space F T,r equipped with the product topol-ogy has a covering number N ( ǫ, B ( F T,r ) , k · k F T,r ) ≤ Q α ∈ T ⋆ N ( ǫ, B ( F α ) , k · k F α ) ≤ (3 ǫ − ) C ( T,r, H s ) with C ( T, r, H s ) = P α ∈ T ⋆ N α . From the Lipschitz continuity of R H ,T,r on M Tr ( H s ) (Lemma 3.2), we deducethat N ( ǫ, M Tr ( H s ) , k · k λ ) ≤ (3 ǫ − L λ | T ⋆ | ) C ( T,r, H s ) , from which we deduce that N ( ǫ, M Tr ( H s ) R , k · k λ ) ≤ (3 ǫ − RL λ | T ⋆ | ) C ( T,r, H s ) , which ends the proof. 8f f and f are two functions from L pµ ( X ; R s ), the collection of functions f ∈ L pµ ( X ; R s ) such that f ≤ f ≤ f almost everywhere is denoted by [ f , f ] and called a bracket with extremities f and f .The diameter of the bracket [ f , f ] for the norm k · k p,µ is given by k f − f k p,µ . The bracketing number N [] ( ǫ, K, k · k p,µ ) of a set K is deﬁned as the minimal number of brackets with diameters less than ǫ whichare necessary to cover K . The corresponding bracketing entropy is deﬁned as H [] ( ǫ, K, k · k p,µ ) := log N [] ( ǫ, K, k · k p,µ ) . Lemma 3.4.

For any ≤ p ≤ ∞ and any compact set K in L p ( X ; R s ) , H [] ( ǫ, K, k · k p,µ ) ≤ H ( ǫ µ ( X ) − /p , K, k · k ∞ ,µ ) , where µ ( X ) is the mass of the measure µ, and µ ( X ) − /p = 1 for p = ∞ .Proof. Let γ = ǫ µ ( X ) − /p and let N be a γ -net of K for the norm k · k ∞ ,µ with cardinal N ( γ, K, k · k ∞ ,µ ).Then for any f ∈ K , there exists a ˜ f ∈ N such that k f − ˜ f k ∞ ,µ ≤ γ , which implies that f is in the bracket[ ˜ f − γ, ˜ f + γ ] with diameter k γ k p,µ = 2 γµ ( X ) /p = ǫ . Then the collection of brackets { [ ˜ f − γ, ˜ f + γ ] : ˜ f ∈ N } with diameters ǫ covers K , which implies N [] ( ǫ, K, k · k p,µ ) ≤ N ( γ, K, k · k ∞ ,µ ) , which ends the proof.From Proposition 3.3 and Lemma 3.4, we directly deduce the following result. Proposition 3.5.

For any ≤ p ≤ ∞ , the set M Tr ( H s ) R deﬁned in (13) has a bracketing entropy H [] ( ǫ, M Tr ( H s ) R , k · k p,µ ) ≤ C ( T, r, H s ) log(6 ǫ − µ ( X ) /p RL ∞ ,µ | T ⋆ | ) , with µ ( X ) − /p = 1 for p = ∞ . Assume that

H ⊂ L λ ( X ), with either λ = ( p, µ ) or λ = ∞ . The continuity constant L λ of the map R H ,T,r deﬁned by (10) depends on λ , the norms on F α , the chosen basis for H and also on the measure µ when λ = ( p, µ ). We here introduce a particular choice of norms and basis functions which allows to bound thecontinuity constant L λ . We consider on the space F ∅ of linear maps from R r D to R s the norm (with p = ∞ when λ = ∞ ) k f ∅ k F ∅ = max z ∈ R rD k f ∅ ( z ) k p k z k p , which coincides with the classical matrix p -norm. For any interior node α ∈ I ( T ), we introduce a norm k · k F α over the space F α of multilinear maps f α : × β ∈ S ( α ) R r β → R r α , deﬁned by k f α k F α = max ( z β ) β ∈ S ( α ) ∈ × β ∈ S ( α ) R rβ k f α (( z β ) β ∈ S ( α ) ) k p Q β ∈ S ( α ) k z β k p . For a leaf node α ∈ L ( T ), we introduce a norm k · k F α over the space F α of linear maps f α : R n α → R r α ,deﬁned by k f α k F α = max z α ∈ R nα k f α ( z α ) k p k z α k p . (14)We assume that for any ν ∈ D , the feature map φ ν : X ν → R n ν is such that k φ ν k λ = 1. For λ = ( ∞ , µ )(resp. λ = ∞ ), that means that basis functions φ νi ν ( x ν ) have a unit norm in L ∞ ,µ ( X ν ) (resp. L ∞ ( X ν )).For p < ∞ , that means that P n ν i =1 k φ νi k pp,µ = 1, which can be obtained by rescaling basis functions so that k φ νi k p,µ = n − /pν . Proposition 3.6.

Assume

H ⊂ L λ ( X ) , with either λ = ( p, µ ) or λ = ∞ . With the above choice of normsand normalization of basis functions (with p = ∞ when λ = ∞ ), the continuity constant L λ deﬁned by (10) is such that L λ ≤ , and for all ≤ q ≤ p , L q,µ ≤ µ ( X ) /q − /p L λ ≤ µ ( X ) /q − /p .Proof. See Appendix A.1. 9

Risk bounds for empirical risk minimization

In this section, we analyze the estimation error for tree tensor networks obtained by empirical risk mini-mization. We consider as ﬁxed the approximation space H , the tree T and the rank r ∈ N | T | . We assumethat H ⊂ L ∞ ,µ ( X ), with X equipped with a ﬁnite measure µ . We consider the model class M Tr ( H s ) R := M of tree tensor networks with bounded parameters, with the norms deﬁned in Section 3.3 for λ = ( ∞ , µ )( p = ∞ ). We denote by C M = C ( T, r, H s ) the representation complexity of M deﬁned by (6) (or (7) when r D = s ). We consider a risk R ( f ) = E ( γ ( f, Z )) , where Z is a random variable taking values in Z and where γ : R X × Z → R is some contrast function.The minimizer of the risk over measurable functions deﬁned on X is the target function f ⋆ . For f random(depending on the data), E ( γ ( f, Z )) shall be understood as an expectation E Z ( γ ( f, Z )) w.r.t. Z (conditionalto the data). Example 4.1.

For supervised learning, we consider a random variable Z = ( X, Y ) , with Y a randomvariable with values in R s , X a X -valued random variable with probability law µ . The contrast is chosen as γ ( f, ( x, y )) = ℓ ( y, f ( x )) with ℓ a loss function measuring a discrepancy between y and the prediction f ( x ) . Example 4.2.

For the problem of estimating the probability distribution of a random variable X , weconsider Z = X and s = 1 . Given the model class M , we denote by f M a minimizer over M of the risk R , and by ˆ f Mn a minimizerover M of the empirical risk b R n ( f ) = 1 n n X i =1 γ ( f, Z i ) , which is seen as an empirical process over M . We introduce the excess risk E ( f ) = R ( f ) − R ( f ⋆ ) . The excess risk for the estimator ˆ f Mn satisﬁes E ( ˆ f Mn ) = E ( f M ) + R ( ˆ f Mn ) − R ( f M ) , (15)where E ( f M ) is the best approximation error in M and R ( ˆ f Mn ) − R ( f M ) is the estimation error. Usingthe optimality of ˆ f Mn , we obtain that the estimation error satisﬁes R ( ˆ f Mn ) − R ( f M ) ≤ b R n ( f M ) − R ( f M ) − b R n ( ˆ f Mn ) + R ( ˆ f Mn ) := ¯ R n ( f M ) − ¯ R n ( ˆ f Mn ) , (16)where ¯ R n ( f ) is the centered empirical process¯ R n ( f ) = b R n ( f ) − R ( f ) = 1 n n X i =1 γ ( f, Z i ) − E ( γ ( f, Z )) . (17)To obtain bounds of the estimation error, it remains to quantify the ﬂuctuations of the centered empiricalprocess ¯ R n ( f ). We here apply classical results to control the ﬂuctuations of the supremum of the empirical process ¯ R n ( f )over the model class M . Assumption 4.3 (Bounded contrast) . Assume that γ is uniformly bounded over M × Z , i.e. | γ ( f, Z ) | ≤ B (18) holds almost surely for all f ∈ M , with B a constant independent of f . R n ( f ). Lemma 4.4.

Under assumption 4.3, we have that P ( ¯ R n ( f ) > ǫB ) ∨ P ( ¯ R n ( f ) < − ǫB ) ≤ e − n ǫ (19) holds for all f ∈ M .Proof. We have b R n ( f ) − R ( f ) = n P ni =1 A fi − E ( A f ), where the A fi = γ ( f, Z i ) are i.i.d. copies of therandom variable A f = γ ( f, Z ). From Assumption 4.3, we have that | A f | ≤ B almost surely, so that A f issubgaussian with parameter B and the result simply follows from Hoeﬀding’s inequality.A stronger assumption is required to obtain a uniform concentration inequality for the empirical process¯ R n ( f ) over M . Assumption 4.5.

Assume that γ ( · , Z ) is Lipschitz continuous over M ⊂ L ∞ ,µ ( X ; R s ) , i.e. | γ ( f, Z ) − γ ( g, Z ) | ≤ Lk f − g k ∞ ,µ (20) holds almost surely for all f, g ∈ M , with L a constant independent of f and g . Lemma 4.6.

Under Assumptions 4.3 and 4.5, we have that P ( sup f ∈ M ¯ R n ( f ) > ǫB ) ∨ P ( inf f ∈ M ¯ R n ( f ) < − ǫB ) ≤ N ǫB L e − nǫ , (21) where N ǫB L = N ( ǫB L , M, k · k ∞ ,µ ) is the covering number of M at scale ǫB L , and log N ǫB L ≤ C M log (cid:0) L B − R | T ⋆ | ǫ − (cid:1) . Proof.

See Appendix A.2.

Lemma 4.7.

Under Assumptions 4.3 and 4.5, E ( sup f ∈ M | ¯ R n ( f ) | ) ≤ B p C M r β ∨ e ) √ n ) n . with β = 6 L B − R | T ⋆ | . Proof.

See Appendix A.2.

From the properties of the centered empirical process, we can now derive upper bounds of the estimationerror in probability and in expectation.

Proposition 4.8.

Under Assumptions 4.3 and 4.5, the estimation error satisﬁes P ( R ( ˆ f Mn ) − R ( f M ) > ǫB ) ≤ e C M log( βǫ − ) − nǫ , where β = 6 L B − R | T ⋆ | . Moreover, E ( R ( ˆ f Mn ) − R ( f M )) ≤ B p C M r β ∨ e ) √ n ) n , and thus E ( E ( ˆ f Mn )) ≤ E ( f M ) + 4 B p C M r β ∨ e ) √ n ) n . roof. See Appendix A.2.

Proposition 4.9.

Under Assumptions 4.3 and 4.5, for any t > , with probability larger than − exp( − t ) , sup f ∈ M − ¯ R n ( f ) ≤ B p C M r L B − R | T ⋆ |√ n ) n + 2 B r t n . (22) Moreover, with probability larger than − exp( − t ) , E ( ˆ f Mn ) ≤ E ( f M ) + 8 B p C M r L B − R | T ⋆ |√ n ) n + 4 B r t n . (23) Proof.

See Appendix A.2.

Example 4.10 (Least-squares bounded regression) . We consider the least-squares regression setting with γ ( f, Z ) = k Y − f ( X ) k ℓ . Let µ be the distribution of X . The excess risk E ( f ) = R ( f ) −R ( f ⋆ ) = k f − f ⋆ k ,µ admits f ⋆ ( x ) = E ( Y | X = x ) as a minimizer. We assume that k Y k ℓ ∞ ≤ R almost surely. For all f ∈ M ,we have γ ( f, Z ) ≤ s k Y − f ( X ) k ℓ ∞ ≤ s ( k Y k ℓ ∞ + k f k ∞ ) , so that ≤ γ ( f, Z ) ≤ B almost surely, with B = 4 sR . Also, it holds almost surely | γ ( f, Z ) − γ ( g, Z ) | = | (2 Y − f ( X ) − g ( X ) , f ( X ) − g ( X )) ℓ |≤ k Y − f ( X ) − g ( X ) k ℓ k f ( X ) − g ( X ) k ℓ ∞ ≤ s (2 k Y k ℓ ∞ + k g k ∞ ,µ + k f k ∞ ,µ ) k f − g k ∞ ,µ . Then for all f, g ∈ M , | γ ( f, Z ) − γ ( g, Z ) | ≤ Lk f − g k ∞ ,µ with L = 4 sR . The constant β from Proposition 4.8is β = 6 | T ⋆ | . Example 4.11 ( L density estimation) . We consider the estimation of the probability law ν of X . Assum-ing that ν admits a density f ⋆ with respect to the measure µ , and assuming f ⋆ ∈ L µ ( X ) , we consider thecontrast γ ( f, x ) = k f k ,µ − f ( x ) , so that E ( f ) = R ( f ) − R ( f ⋆ ) = k f − f ⋆ k ,µ admits f ⋆ as a minimizer.We assume that µ is a ﬁnite measure on X and that f ⋆ is uniformly bounded by R . Then | γ ( f, X ) | ≤ B almost surely with B = R ( µ ( X ) R + 2) . Also, for all f, g ∈ M , we have almost surely | γ ( f, X ) − γ ( g, X ) | = |k f k ,µ − k g k ,µ − f ( X ) − g ( X )) |≤ | Z ( f − g )( f + g ) dµ | + 2 k f − g k ∞ ,µ ≤ ( k f + g k ,µ + 2) k f − g k ∞ ,µ ≤ Lk f − g k ∞ ,µ with L = 2( µ ( X ) R + 1) . Since /R ≤ L /B ≤ /R , the constant β from Proposition 4.8 is such that | T ⋆ | ≤ β ≤ | T ⋆ | . In this section, we provide an improved excess risk bound in the speciﬁc case of least squares contrasts.Our results come from Talagrand Inequalities and generic chaining bounds ; we follow the presentationgiven in the book of [27]. The excess risk bound given below strongly relies on the link between the excessrisk and the variance of the excess loss (see Inequality ( ?? ) in the proof of Proposition 4.12), as explainedin Chapter 5 of [27] and Chapter 8 in [28].Let γ be either the least squares contrast in the bounded regression setting (as described in Exam-ple 4.10, with s = 1), or the least squares contrast for density estimation (as described in Example 4.11).In particular, note that in the regression setting it is assumed that k Y k ℓ ∞ ≤ R almost surely.12s before, we consider the model class M = M Tr ( H ) R of tree tensor networks with bounded parameters.Contrary to the two previous subsections, it is now assumed that H ⊂ L ∞ ( X ) equipped with the norm k·k ∞ and we still use the normalization of the parameters with λ = ∞ ( p = ∞ ) introduced in Section 3.3. Notethat L ∞ ( X ) ⊂ L ∞ ,µ ( X ), where µ is the distribution of the X i ’s in the regression setting (see Example 4.10)or the reference measure for density estimation (see Example 4.11). In particular, in this setting k f k ∞ ,µ ≤k f k ∞ < ∞ for any f ∈ H . Proposition 4.12.

Under the previous assumptions, there exists an absolute constant A and a constant κ such that for any ε ∈ (0 , and any t > , with probability at least − A exp( − t ) , it holds E ( ˆ f Mn ) ≤ (1 + ε ) E ( f M ) + κR n (cid:20) a T C M ε log + (cid:18) nε a T C M (cid:19) + tε (cid:21) (24) where a T = 1 + log + (cid:16) | T ⋆ | e (cid:17) , and κ depends on linearly on µ ( X ) . Then by integrating according to t , weobtain that for any ε ∈ (0 , , E E ( ˆ f Mn ) ≤ (1 + ε ) E ( f M ) + κR n (cid:20) a T C M ε log + (cid:18) nε a T C M (cid:19) + A ε (cid:21) . Proof.

See Appendix A.2.1.Note that the term a T is upper bounded by a term of the order of log d because | T ∗ | ≤ d . Thus theconstants in the risk bound (24) does not explode with the dimension d in regression. Note however thatin density estimation, the constant κ depends linearly on the mass µ ( X ) of the reference measure, whichmay grow exponentially with d . We now consider a family of approximation spaces H m = H sm ⊂ L ∞ ,µ ( X ), m ∈ M , with X equipped witha ﬁnite measure µ , as in Sections 4.1 and 4.2. Let ( M m ) m ∈M be a given family of tree tensor networks with M m = M T m r m ( H m ) R and where the parameters are bounded according to the norms deﬁned in Section 3.3for λ = ( ∞ , µ ) ( p = ∞ ). Each model m has a particular tree T m , a rank r m , an approximation space H m , and a radius R . We denote by C m = C ( T m , r m , H m ) the corresponding representation complexity. Forsome m ∈ M , we let f m be a minimizer of the risk over M m , f m ∈ arg min f ∈ M m R ( f ) , and ˆ f m be a minimizer of the empirical risk over M m , ˆ f m ∈ arg min f ∈ M m b R n ( f ) . At this stage of the procedure, we have at hand a family of predictors ˆ f m and our goal is to provide astrategy for selecting a good predictor. To this aim, we make use of the model selection approach of Barron,Birg´e and Massart. More precisely, we adapt a general theorem from [28] to our problem. Similar modelselection strategies can be found in [34, 21, 11], see also [9] for an application to the selection of principalcurves.Given some penalty function pen : M → R + , we deﬁne ˆ m as the minimizer over M of the criterioncrit( m ) := b R n ( ˆ f m ) + pen( m ) , (25)and we ﬁnally select the predictor ˆ f ˆ m . With µ ( X ) = 1 for regression. ssumption 5.1. We consider a family of positive weights ( x m ) m ∈M over the family of models such that Σ = X m ∈M exp( − x m ) < ∞ . This assumption and the choice of the weights is the discussed further in Section 5.3.

We follow a standard strategy that corresponds to the so-called

Vapnik’s structural minimization of therisk method (see for instance [28, Section 8.2]) to choose the penalty function and derive a risk bound forthe estimator selected by the criterion (25). By deﬁnition of ˆ m , for any m ∈ M , R n ( ˆ f ˆ m ) + pen( ˆ m ) ≤ R n ( ˆ f m ) + pen( m ) ≤ R n ( f m ) + pen( m ) . Therefore, R n ( ˆ f ˆ m ) ≤ R n ( f m ) + pen( m ) − pen( ˆ m )and thus R ( ˆ f ˆ m ) + ¯ R n ( ˆ f ˆ m ) ≤ R ( f m ) + ¯ R n ( f m ) + pen( m ) − pen( ˆ m ) , where ¯ R n ( f ) is the centered empirical process deﬁned in (17). We ﬁnally derive the following upper boundon the excess risk E ( ˆ f ˆ m ) ≤ E ( f m ) + ¯ R n ( f m ) − ¯ R n ( ˆ f ˆ m ) − pen( ˆ m ) + pen( m ) . (26)We now provide a risk bound for a model selection strategy based on the criterion (25) with a suitablechoice of penalty. Theorem 5.2.

Under Assumptions 4.3, 4.5 and 5.1, if the penalty is such that pen( m ) ≥ λ m r C m n + 2 B r x m n , (27) with λ m = 4 B q L B − R | T m | ⋆ √ n ) , then the estimator ˆ f ˆ m selected according to the criterion (25) satisﬁes the following risk bound E ( E ( ˆ f ˆ m )) ≤ inf m ∈M {E ( f m ) + pen( m ) } + B Σ r π n . Proof.

See Appendix A.3.Theorem 5.2 gives a strong justiﬁcation for using a penalty proportional to p C m /n , at least for nottoo large family of models. However, it is known that the Vapnik’s structural minimization of the riskmay lead to suboptimal rates of convergence. For instance, in the bounded regression setting, it is knownthat a penalty proportional to the VapnikChervonenkis dimension (typically in O ( C m /n ) leads to minimaxrates of convergence in various setting (see for instance Chapter 12 in [21]) whereas Vapnik’s structuralminimization of the risk (typically with penalty in O ( p C m /n )) is too pessimistic to provide fast rates ofconvergence. Note that the approach of [21] is based on a truncation strategy which is not easy to calibratein practice. In the next section, we give an improved model selection result for least squares inference.14 .2 Oracle inequalities for least squares inference on tree tensor networks In this section, we give an improved model selection result for least squares inference based on Proposi-tion 4.12. This corresponds to the approach presented in Sections 8.3 and 8.4 of [28] or in Section 6.3 of[27].We consider least squares density estimation and least squares bounded regression ( s = 1) in the sameframework as Section 4.3: we now consider a family of approximation spaces H m ⊂ L ∞ ( X ) with s = 1 andequipped with the norm k · k ∞ . We use the same normalization of the parameters with p = ∞ ( λ = ∞ ) asintroduced in Section 3.3. As before we consider a family of tree tensor networks ( M m ) m ∈M where eachmodel M m = M T m r m ( H m ) R has a particular tree T m , a rank r m , an approximation space H m , and a radius R . Theorem 5.3.

In the setting of Proposition 4.12 and under Assumption 5.1, there exists numerical con-stants K and K and K such that if the penalty satisﬁes pen( m ) = K R (cid:20) a m C m nε log nε a m C m + x m nε (cid:21) with a m = 1+log + (cid:16) | T ⋆m | e (cid:17) , then the estimator ˆ f ˆ m selected according to the penalized criterion (25) satisﬁesthe following oracle inequality E E ( ˆ f ˆ m ) ≤ ε − ε inf m ∈M (cid:26) E ( f m ) + K R (cid:20) a m C m nε log nε a m C m + x m nε (cid:21)(cid:27) + K R Σ n εε (1 − ε ) . (28) Proof.

The proof is adapted from Theorem 6.5 in [27], see Appendix A.3.This theorem provides an improved oracle inequality bound with a penalty in C m n , up to logarithmicterms. In Section 5.4, we will derive adaptive optimal rates of convergence (in the minimax sense) fromthis model selection result. In Section 6 we illustrate how to calibrate the penalty in practice using theslope heuristics method. The weights x m represent the price to pay for the richness of the model collection, when there are manymodels with the same complexity C m . A typical choice for the weights is x m = x ( C m ) with a weightfunction x such that x ( c ) ≥ βc + log( N c ) , where N c = |{ m ∈ M : C m = c }| is the number of models with complexity c , and β some positive constant.With such a choice, Σ = P m ∈M exp( − x m ) = P c ≥ N c exp( − x ( c )) ≤ ( e β − − , so that Assumption 5.1 issatisﬁed. With such a weight function, if the model collection is not too rich, the weight x m is comparableto or smaller than the complexity C m .We restrict the following analysis to the case where the approximation space is ﬁxed: H m = H s for any m ∈ M and we only consider binary trees, for which | T m | = 2 d − T is ﬁxed and we need to upper bound the number N c of modelshaving the complexity c to deﬁne the weights. According the deﬁnition of the representation complexitygiven in Section 2.4, a format with complexity c satisﬁes c = X α ∈ T ⋆ N α = sr D + X α ∈I ( T ) r α Y β ∈ S ( α ) r β + X α ∈L ( T ) r α n α . (29)15he number of triplets of integers ( k , k , k ) such that the product k k k is less than an integer q α isclearly less than q α . So, the number of formats such that N α = q α for any α ∈ T ⋆ is less than Y α ∈ T ⋆ q α ≤  Y α ∈ T ⋆ q α ! / | T ⋆ |  | T ⋆ | ≤ " | T ⋆ | X α ∈ T ⋆ q α | T ⋆ | ≤ (cid:20) c | T ⋆ | (cid:21) | T ⋆ | . Moreover, the number of tuple of integers ( q α ) α ∈ T ⋆ satisfying P α ∈ T ⋆ q α = c is (cid:0) c + | T ⋆ | c (cid:1) . For a ﬁxed binarytree, the number N c of all possible formats of complexity c is thus such that N c ≤ (cid:18) c + | T ⋆ | c (cid:19) (cid:20) c | T ⋆ | (cid:21) | T ⋆ | . Using the inequality log (cid:18) kℓ (cid:19) ≤ ℓ (1 + log kℓ ) , (30)and the fact that | T ⋆ | ≤ C m for any model m in the collection, we obtainlog( N c ) ≤ c (1 + log( c + | T ⋆ | c )) + 2 | T ⋆ | log( c | T ⋆ | ) ≤ c (1 + log(2)) + 4 d log( c ) . c. Then for a given binary tree T , we ﬁnally take a weight function x ( c ) = ηc (31)for some η >

0. In the situation where all the formats of the collection rely on a same tree T , using theweight function given in (31), Theorem 5.3 shows that we can use a penalty proportional to C m .Leaving aside the computational aspects for the moment (see Section 5.6), we now consider the situationwhere the formats of the collection rely on several possible trees T . The number of binary dimensionpartition trees (or full binary trees) with d leaves is the Catalan number d (cid:0) d − d − (cid:1) . The number N c ofpossible formats of complexity c based on all possible binary dimension partition trees with d leaves is thussuch that N c ≤ d (cid:18) d − d − (cid:19)(cid:18) c + 2 dc (cid:19) h c d i d . Using again Inequality (30) and the fact that | T ⋆ | = 2 d ≤ C m for any model m in the collection, we obtain N c ≤ d (1 + log(2)) + c (1 + log(2)) + 4 d log( c ) . c and we ﬁnally propose the weight function x ( c ) = ηc (32)for some η >

0. In the situation where a large number of trees has been explored, we still can use penaltiesproportional to the format complexity C m . For each dimension ν ∈ { , . . . , d } , we consider approximation tools ( H ν,p ν ) p ν ∈ N for functions of the variable x ν , and we let ( H p ) p ∈ N d be the corresponding approximation tool for multivariate functions, where H p = H ,p ⊗ . . . ⊗ H d,p d .For adaptive methods in p and r (with ﬁxed tree T ), we deﬁne an approximation toolΦ = (Φ c ) c ∈ N , Φ c = { f = R H p ,T,r ( f ) : f ∈ F T,r : r ∈ N r , p ∈ N d , compl( f ) ≤ c } , where compl( f ) is a measure of complexity of the network f , and Φ c is the set of functions with associatednetwork with complexity less than c . 16or tree adaptive methods, we deﬁne the sets Φ c asΦ c = { f = R H p ,T,r ( f ) : f ∈ F T,r : T ∈ T , r ∈ N r , p ∈ N d , compl( f ) ≤ c } , where T is a collection of possible dimension trees.The best approximation error by a tensor network with complexity less than c is deﬁned by e c ( f ⋆ ) = inf f ∈ Φ c R ( f ) − R ( f ⋆ ) . Then given a growth function γ : N → N , an approximation class for tree tensor networks can be deﬁnedas the set A γ = { f ⋆ : sup c ≥ γ ( c ) e c ( f ⋆ ) < ∞} , which corresponds to functions that can be approximated with tree tensor networks with a convergence in O ( γ ( c ) − ).The approximation class A γ depends on the measure of complexity of the network, and on wether ornot tree adaptation is considered. Natural measures of complexity of a network f are the representationcomplexity compl C ( f ) = C ( T, r, H p ) or the sparse representation complexity compl S ( f ) (see Section 2.4).When considering the complexity measure compl = compl C , we easily derive from Theorem 5.2 or 5.3upper bounds on the rates of convergence of our model selection procedure for functions in A γ by balancingthe penalty term and the approximation term in the risk bounds.Next we provide examples that shows that minimax rates can be achieved by tensor networks forclassical smoothness classes. In all examples, we consider a least-squares setting with real valued functions( s = 1), where R ( f ) − R ( f ⋆ ) is the squared L norm of f − f ⋆ . Consider a function f ⋆ in the Sobolev space H r , r ∈ N ,of functions on (0 , d or the d -dimensional torus T d , and optimal approximation tools ( H ν,p ν ) p ν ∈ N forunivariate Sobolev functions (e.g., splines or trigonometric polynomials). For any ﬁxed tree T , and whenconsidering the representation complexity measure compl C , we have e c ( f ⋆ ) = O ( c − rd )(see, e.g., [5]), andtherefore H r is included in A γ with γ ( c ) = c rd . In the setting of Theorem 5.3, for f ⋆ in the Sobolevspace H r over (0 , d , and when considering the family of all possible formats, we ﬁnd that the rate ofconvergence of ˆ f ˆ m is of order n − r r + d log( n ) r r + d which is known to be the minimax rate of convergence over H r (up to the logarithmic term). Our model selection procedure (with variable p and r ) therefore achievesminimax rates for Sobolev spaces of any order, and is thus minimax adaptive to the regularity over Sovolevspaces. Sobolev spaces of multivariate functions with dominating mixed smoothness.

Consider a func-tion f ⋆ in the mixed Sobolev space H rmix , r ∈ N , on the d -dimensional torus T d , and optimal approximationtools ( H ν,p ν ) p ν ∈ N for univariate Sobolev functions on T (e.g., trigonometric polynomials). For a ﬁxed binarytree T , when considering the complexity measure compl C , we have e c ( f ⋆ ) = O ( c − r log( c ) dr ) (see [32, 5]),and therefore, the space H rmix is included in A γ with γ ( c ) = c r log( c ) − dr . In the bounded regressionframework of Theorem 5.3, our model selection procedure shows a rate of convergence upper boundedby n − r r (log n ) rd . To our knowledge, the minimax rates of convergence over mixed Sobolev spaces areunknown for regression. However, the results of [29] for Gaussian white noise model as well as the resultsof [1] for density estimation suggest that these rates should be of the order of n − r r , up to a logarithmicterm. In fact, the minimax rate can not be obtained by our strategy since the rate of approximation errorin O ( c − r ) (up to logarithmic terms) is not the optimal rate of convergence which is in O ( c − r ) (up tologarithmic terms), the latter rate being achieved by hyperbolic cross approximation [15].17n optimal rate should probably be achieved with tree tensor networks by further exploiting sparsityin the tensors, and using the corresponding measure of complexity compl S . Indeed, optimal approximationrates should be obtained by shallow tensor networks (associated with a trivial tree) with a sparse tensor C D with O ( c ) non zero entries, and a sparsity pattern based on hyperbolic crosses. Then noting that sucha shallow network (which is a canonical tensor format with rank O ( c )) can be encoded within a tree tensornetwork with sparse tensors and the same overall complexity compl S in O ( c ) , minimax rates (up to logterms) should probably be obtained for mixed Sobolev space H rmix for any tree T , when combined with anestimate of the metric entropy of sets Φ c with the complexity measure compl S . Tree tensor networks can be used for the approximation of univariate functions after identiﬁcation ofa function f ∈ L (0 ,

1) with an order- d tensor (or d -variate function) in R ⊗ . . . ⊗ R ⊗ L (0 ,

1) :=( R ) ⊗ d ⊗ L (0 , S in L (0 , S = P m , we deﬁne a tensor subspace ( R ) ⊗ d ⊗ P m = H d,m , which is isometricallyidentiﬁed with the space of univariate splines of degree m over a uniform partition of [0 ,

1] into 2 d intervals.An approximation tool is then deﬁned by considering tensor networks in the tensor spaces H d,m withvariable d and ﬁxed m . In [2], the authors consider tensor networks associated with linear trees, that is thetensor train format (or equivalently, recurrent sum-product neural networks). The variable d setting canbe interpreted as the tree adaptive setting presented above, where the family of trees T = { T d : d ∈ N } ,with T d the linear tree over { , . . . , d } with interior nodes { , . . . , ν } , 2 ≤ ν ≤ d .The following results are based on results from [3, Main results 3.1, 3.2 and 3.4] for Sobolev, Besov oranalytic functions. Sobolev spaces of univariate functions.

For functions f ⋆ in the Sobolev space H r of univariatefunctions on (0 , C , the approximation error e c ( f ⋆ ) = O ( c − r ) achieves the best possible approximation rate , that is H r is included in A γ with γ ( r ) = n r for any r ∈ N . Together with Theorem 5.3, we ﬁnd that ˆ f ˆ m achieves a convergence in n − r r +1 (up to logarithmicterm). This shows that our model selection procedure (with variable d and ﬁxed m , in particular m = 0)achieves minimax rates (up to logarithmic terms) for Sobolev spaces of any order r (without adapting thedegree m to the regularity of f ⋆ ). Besov spaces.

Near optimal approximation rates are also obtained for Besov spaces of univariatefunctions on (0 , f ⋆ in the Besov space B ατ,τ , with α > τ = ( r + 1 / − the Sobolev embedding number.When considering the complexity measure compl C , we have B ατ,τ ⊂ A γ with γ ( c ) = c α − ǫ for arbitrary ǫ > α > f ˆ m achieves a convergence in n − α − ǫα +1 (up tologarithmic term), which are close (but not equal to) minimax rates in n − α α +1 (up to log terms).Note that when considering the complexity measure compl S , we show B ατ,τ ⊂ A γ with γ ( c ) = c α − ǫ forarbitrary ǫ >

0, which is arbitrarily close to optimal approximation rates. Therefore, a strategy takinginto account sparsity of tensors could be able to achieve rates arbitrarily close to minimax rates for Besovspaces B ατ,τ of arbitrary smoothness α (without the need of adapting m to the regularity of f ⋆ ). Analytic functions.

For a function f ⋆ analytic on an open interval containing [0 ,

1] and when consideringthe complexity measure compl C , the approximation error converges exponentially fast as e c ( f ⋆ ) = O ( ρ − c / )for some ρ >

1. That means f ⋆ ∈ A γ with γ ( c ) = ρ c / . Together with Theorem 5.3, we ﬁnd that ˆ f ˆ m also obtained by other tools such as splines of degree greater than r-1 n − log( n ) (up to logarithmic term). This is known to be the minimax rate fornonparametric estimation of analytic densities [8]. The aim of the slope heuristics method proposed by Birg´e and Massart [10] is precisely to calibrate penaltyfunction for model selection purposes. See [7] and [4] for a general presentation of the method. This methodhas shown very good performances and comes with mathematical guarantees in various settings amongother for non parametric Gaussian regression with i.i.d. error terms, see [10, 4] and references therein. Theslope heuristics have several versions (see [4]).The aim is to tune the constant λ in a penalty of the form pen( m ) = λ pen shape ( m ) where pen shape is aknown penalty shape. Let ˆ m ( λ ) be the model selected by penalized criterion with constant λ :ˆ m ( λ ) ∈ argmin m ∈M n b R n ( ˆ f m ) + λ pen shape ( m ) o . Let C m denote the complexity of the model. The complexity jump algorithm consists of the followingsteps:1. Compute the function λ ˆ m ( λ ),2. Find the constant ˆ λ cj > λ C ˆ m ( λ ) ,3. Select the model ˆ m = ˆ m (2ˆ λ cj ) such thatˆ m ∈ arg min m ∈M n b R n ( ˆ f m ) + 2ˆ λ cj pen shape ( m ) o . The exploration of all possible model classes M Tr ( H s ) with a complexity bounded by some c is intractablesince the number of such models is exponential in the number of variables d . Therefore, strategies shouldbe introduced to propose a set of candidate model classes M m , m ∈ M .In practice, a possible approach is to rely on adaptive learning algorithms from [18] (see also [17]) thatgenerate predictors ˆ f m (minimizing the empirical risk) in a sequence of model classes. For a ﬁxed tree T , the proposed algorithm generates a sequence of model classes M m = M Tr m ( H sm ) withincreasing ranks r m , m ≥

1, by successively increasing the α -ranks for nodes α associated with the highest(estimated) truncation errors inf rank α ( f ) ≤ r m,α R ( f ) − R ( f ⋆ ) . For each m , the background approximation space is taken as H m := H p m = H ,p m, ⊗ . . . ⊗ H d,p m,d , wherefor each dimension ν ∈ { , . . . , d } , ( H ν,k ) k ∈ N is a given approximation tool (e.g., polynomials, wavelets).Exploring all possible tuples p m is again a combinatorial problem. The algorithm proposed in [18, 17]relies on a validation approach for the selection of a particular tuple. Note that a complexity-based modelselection method could also be considered for the selection of a tuple p m .19 .6.2 Variable tree Although the set of possible dimension trees over { , . . . , d } is ﬁnite, exploring this whole set of dimensiontrees is intractable for high and even moderate d . In [18], a stochastic algorithm has been proposed foroptimizing the dimension tree for the compression of a tensor. This tree optimization algorithm has beencombined with the rank-adaptive strategy discussed above. The resulting algorithm generates a sequenceof predictors in tree tensor networks associated with diﬀerent trees. In the numerical experiments, weuse this learning algorithm with tree adaptation to generate a set of candidate trees. Then the learningalgorithm with rank adaptation but ﬁxed tree is used with each of these trees. In this section, we illustrate the proposed model selection approach for supervised learning problems in aleast-squares regression setting. Y is a real-valued random variable ( s = 1) deﬁned by Y = f ⋆ ( X ) + ǫ where ǫ is independent of X and has zero mean and standard deviation γσ ( f ⋆ ( X )). The parameter γ therefore controls the noise level in relative precision.For a given training sample, we use the learning strategies described in Section 5.6 that generate asequence of predictors ˆ f m , m ∈ M , associated with a certain collection of models M (which depends onthe training sample). Given a set of predictors ˆ f m , m ∈ M , we denote by ˆ m ⋆ the index of the model thatminimizes the risk over M , i.e. ˆ m ⋆ ∈ arg min m ∈M R ( ˆ f m ) . The model ˆ m ⋆ is the oracle model in M for a given training sample.We also denote by ˆ m ( λ ) the model such thatˆ m ( λ ) ∈ argmin m ∈M n b R n ( ˆ f m ) + λ pen shape ( m ) o , where pen shape ( m ) = C m /n , and by ˆ m = ˆ m (2ˆ λ cj ) the model selected by our model selection strategy, whereˆ λ cj is calibrated with the complexity jump algorithm (see Section 5.5).We consider two diﬀerent types of problems: the approximation of univariate functions deﬁned on(0 , R d (Section 6.2).For a given function f , the risk R ( f ) is evaluated using a sample of size 10 independent of the trainingsample. Statistics of complexities and risks (such as the expected complexity E ( C ˆ m ) or the expected risk E ( R ( ˆ f ˆ m ))) are computed using 20 diﬀerent training samples. Here we consider tree tensor networks for the approximation of a univariate function in L (0 , f deﬁned on (0 ,

1) can be linearly identiﬁedwith a function f = T l ( f ) of l + 1 variables deﬁned on { , } l × (0 ,

1) such that f ( x ) = T l ( f )( i , . . . , i l − , y ) for x = 2 − l ( l − X k =0 i k k + y ) . T l is called the tensorization map at level l . This allows to isometrically identify the space L (0 , R ⊗ . . . ⊗ R ⊗ L (0 ,

1) of order d = l + 1. Then we consider the approximation space H l = R ⊗ . . . ⊗ R ⊗ P of d -variate functions f ( i , . . . , i l , y ) independent of the variable y . The space H l is linearly identiﬁed with the space of piecewise constant functions on the uniform partition of (0 ,

1) into2 l intervals. Then we consider model classes M l,T,r = { f : T l ( f ) ∈ M Tr ( H l ) } , which are piecewise constantfunctions whose tensorized version T l ( f ) is in a particular tree-based tensor format.In the following experiments, for each l ∈ { , . . . , } , we consider a ﬁxed linear binary tree T (withinterior nodes { , . . . , k } , 1 ≤ k ≤ l + 1) and use the rank adaptive learning algorithm (Section 5.6.1) toproduce a sequence of 25 approximations with increasing ranks.Three functions f ⋆ ( x ) are considered. The ﬁrst function f ⋆ ( x ) = √ x is analytic on the open interval(0 ,

1) and its derivative has a singularity at zero. The second function f ⋆ ( x ) = x is analytic on a largerinterval including [0 , H (0 , . For all functions, the proposedmodel selection approach shows a very good performance. It selects with high probability a model with arisk very close to the risk of the oracle ˆ f ˆ m ⋆ . f ⋆ ( x ) = √ x We consider the function f ⋆ ( x ) = √ x which is analytic on the open interval (0 , n and noise level. Tables 1 and 2 show expectations of complexities and errorsfor the selected estimator and illustrate the very good performance of the approach when compared to theoracle. -6 -4 -2 (a) Function λ C ˆ m ( λ ) , λ cj (red). -6 -5 -4 -3 -2 -1 (b) Points ( C m , R ( ˆ f m ), m ∈ M , and selected model(red). Figure 3: Slope heuristics for the tensorized function f ⋆ ( x ) = √ x with n = 200 and γ = 0 . n E ( C ˆ m ⋆ ) E ( C ˆ m ) E ( R ( ˆ f m ⋆ )) E ( R ( ˆ f ˆ m ))100 123.2 91.6 1.6e-05 5.0e-05200 163.8 165.0 3.0e-06 5.1e-06500 182.2 182.6 9.2e-07 1.2e-061000 190.2 228.5 7.1e-07 1.4e-06Table 1: Expectation of complexities and risks of the model selected by the slope heuristics, with thefunction f ⋆ ( x ) = √ x and diﬀerent values of n and γ = 0 . -8 -6 -4 -2 (a) Function λ C ˆ m ( λ ) , λ cj (red). -7 -6 -5 -4 -3 -2 -1 (b) Points ( C m , R ( ˆ f m ), m ∈ M , and selected model(red). Figure 4: Slope heuristics for the tensorized function f ⋆ ( x ) = √ x with n = 1000 and γ = 0 . γ E ( C ˆ m ⋆ ) E ( C ˆ m ) E ( R ( ˆ f m ⋆ )) E ( R ( ˆ f ˆ m ))10 − − − f ⋆ ( x ) = √ x and diﬀerent values of γ and n = 1000.22 .1.2 Tensorized function f ⋆ ( x ) = x . We consider the function f ⋆ ( x ) = x which is analytic on the interval ( − , ∞ ) including [0 , -6 -5 -4 -3 -2 -1 (a) Function λ C ˆ m ( λ ) , λ cj (red). -7 -6 -5 -4 -3 -2 (b) Points ( C m , R ( ˆ f m ), m ∈ M , and selected model(red). Figure 5: Slope heuristics for the tensorized function f ⋆ ( x ) = x with n = 200 and γ = 0 . -8 -6 -4 -2 (a) Function λ C ˆ m ( λ ) , λ cj (red). -9 -8 -7 -6 -5 -4 -3 -2 (b) Points ( C m , R ( ˆ f m ), m ∈ M , and selected model(red). Figure 6: Slope heuristics for the tensorized function f ⋆ ( x ) = x with n = 1000 and γ = 0 . E ( C ˆ m ⋆ ) E ( C ˆ m ) E ( R ( ˆ f m ⋆ )) E ( R ( ˆ f ˆ m ))100 88.0 83.0 9.3e-07 1.0e-06200 97.3 92.8 6.4e-07 6.6e-07500 92.9 124.4 5.8e-07 6.9e-071000 108.4 107.5 5.3e-07 5.3e-07Table 3: Expectation of complexities and risks of the model selected by the slope heuristics, with thefunction f ⋆ ( x ) = x , diﬀerent values of n and γ = 0 . γ E ( C ˆ m ⋆ ) E ( C ˆ m ) E ( R ( ˆ f m ⋆ )) E ( R ( ˆ f ˆ m ))10 − − − f ⋆ ( x ) = x , diﬀerent values of γ and n = 1000.24 .1.3 Tensorized function f ⋆ ( x ) = g ( g ( x )) with g ( x ) = 1 − | x − | . We consider the function f ⋆ ( x ) = g ( g ( x )) with g ( x ) = 1 − | x − | , which is in the Sobolev space H (0 , . Figures 7b and 8 illustrate again the good behaviour of the model selection approach for diﬀerent samplesize and noise level. And Tables 3 and 4 again illustrate again the very good performance (in expectation)for the selected estimator of the approach when compared to the oracle. -6 -4 -2 (a) Function λ C ˆ m ( λ ) , λ cj (red). -7 -6 -5 -4 -3 -2 -1 (b) Points ( C m , R ( ˆ f m ), m ∈ M , and selected model(red). Figure 7: Slope heuristics for the tensorized function f ⋆ ( x ) = ( g ( g ( x ))) with n = 200 and γ = 0 . -7 -6 -5 -4 -3 -2 -1 (a) Function λ C ˆ m ( λ ) , λ cj (red). -7 -6 -5 -4 -3 -2 -1 (b) Points ( C m , R ( ˆ f m ), m ∈ M , and selected model(red). Figure 8: Slope heuristics for the tensorized function f ⋆ ( x ) = ( g ( g ( x ))) with n = 1000 and γ = 0 . E ( C ˆ m ⋆ ) E ( C ˆ m ) E ( R ( ˆ f m ⋆ )) E ( R ( ˆ f ˆ m ))200 176.4 181.6 6.3e-07 1.6e-06500 188.2 198.8 3.9e-07 4.1e-071000 196.6 233.8 3.2e-07 3.5e-07Table 5: Expectation of complexities and risks for the function f ⋆ ( x ) = ( g ( g ( x ))) , diﬀerent values of n and γ = 0 . E ( C ˆ m ⋆ ) E ( C ˆ m ) E ( R ( ˆ f m ⋆ )) E ( R ( ˆ f ˆ m ))10 − − − f ⋆ ( x ) = ( g ( g ( x ))) , diﬀerent values of γ and n = 1000. We consider the function f ⋆ ( X ) = 11 + P dν =1 ν − X ν with d = 10, where the X ν ∼ U (0 ,

1) are i.i.d. uniform random variables. The function f ⋆ is analyticon [0 , d . We use the ﬁxed balanced binary tree T of Figure 9. Figures 10 and 11 illustrate the verygood behaviour of the model selection approach for a sample size n = 1000 and noise level γ = 0 . n and γ ), which are of the areof the same order as for the oracle. { , , , , , , , , , }{ , , , , , }{ , , , }{ , }{ }{ }{ , }{ }{ } { , }{ } { } { , , , }{ , }{ } { } { , }{ } { } Figure 9: Corner peak function. Dimension tree T . n E ( C ˆ m ⋆ ) E ( C ˆ m ) E ( R ( ˆ f m ⋆ )) E ( R ( ˆ f ˆ m ))100 124.1 73.7 2.1e-06 1.1e-05500 286.7 291.3 9.8e-11 1.0e-101000 286.2 293.8 6.6e-11 6.7e-11Table 7: Expectation of complexities and risks selected by the slope heuristics, with the Corner peakfunction, diﬀerent values of n and γ = 10 − . 26 -7 -6 -5 -4 -3 (a) Function λ C ˆ m ( λ ) , λ cj (red). -7 -6 -5 -4 (b) Function C m

7→ R ( ˆ f m ) and selected model (red). Figure 10: Slope heuristics for the Corner peak function with n = 1000 and γ = 0 . -6 -5 -4 -3 (a) Functions λ C ˆ m ( λ ) , λ cj (red). -7 -6 -5 -4 (b) Functions C m

7→ R ( ˆ f m ) and selected model(red). Figure 11: Slope heuristics for the Corner peak function with n = 1000 and γ = 0 . γ E ( C ˆ m ⋆ ) E ( C ˆ m ) E ( R ( ˆ f m ⋆ )) E ( R ( ˆ f ˆ m ))10 − − − − γ and n = 1000. 27 .2.2 Borehole function We consider the function g ( U , . . . , U ) = 2 πU ( U − U )( U − log( U ))(1 + U U ( U − log( U )) U U + U U )which models the water ﬂow through a borehole as a function of 8 independent random variables U ∼N (0 . , . U ∼ N (7 . , . U ∼ U (63070 , U ∼ U (990 , U ∼ U (63 . , U ∼ U (700 , U ∼ U (1120 , U ∼ U (9855 , f ⋆ ( X , . . . , X d ) = g ( g ( X ) , . . . , g ( X )) , where g ν are functions such that U ν = g ν ( X ν ), with X ν ∼ N (0 ,

1) for ν ∈ { , } , and X ν ∼ U ( − ,

1) for ν ∈ { , . . . , } . Function f ⋆ is thus deﬁned on X = R × [ − , . As univariate approximation tools, weuse polynomial spaces H ν,p ν = P p ν ( X ν ), ν ∈ D .We use the exploration strategy described in Section 5.6.1. More precisely, we ﬁrst run a learningalgorithm with tree adaptation from an initial binary tree drawn randomly, with n = 100 samples. Thelearning algorithm visited the 9 trees plotted in Figure 12. Then for each of these trees, we start a learningalgorithm with ﬁxed tree and rank adaptation. Figures 13 to 15 illustrate the behaviour of the modelselection strategy for diﬀerent sample size n . Table 9 shows the expectation of complexities and risks. Themodel selection approach shows very good performances, except for very small training size n = 100, wherethe approach selects a model rather far from the optimal one (in terms of expected risk and complexity). n E ( C ˆ m ⋆ ) E ( C ˆ m ) E ( R ( ˆ f m ⋆ )) E ( R ( ˆ f ˆ m ))100 132.1 63.4 6.9e-06 9.3e-04200 149.7 156.0 3.0e-08 1.1e-07500 144.7 178.2 1.0e-08 1.8e-081000 154.1 194.2 8.3e-09 1.2e-08Table 9: Borehole function. Expectation of complexities and risks. γ = 10 − , diﬀerent n .28 } { }{ }{ }{ }{ } { } { } { }{ }{ }{ }{ } { } { } { } { }{ }{ }{ }{ }{ } { } { }{ }{ }{ }{ } { } { } { } { } { }{ }{ }{ }{ }{ } { } { } { } { } { }{ }{ }{ }{ } { }{ } { }{ }{ }{ }{ } { } { } { }{ } { }{ }{ } { }{ } { } { } { }{ }{ }{ }{ }{ } { } Figure 12: Borehole function. The path of 9 trees generated by the tree adaptive learning algorithm.29 -6 -4 -2 (a) Functions λ C ˆ m ( λ ) , λ cj (red).

20 40 60 80 100 120 140 16010 -6 -5 -4 -3 -2 -1 (b) Points ( C m , R ( ˆ f m ), m ∈ M , and selected model(red). Figure 13: Slope heuristics for Borehole function with n = 100 and γ = 10 − . -8 -6 -4 -2 (a) Functions λ C ˆ m ( λ ) , λ cj (red). -8 -7 -6 -5 -4 -3 -2 -1 (b) Points ( C m , R ( ˆ f m ), m ∈ M , and selected model(red). Figure 14: Slope heuristics for Borehole function with n = 200 and γ = 10 − . -8 -6 -4 -2 (a) Function λ C ˆ m ( λ ) , λ cj (red). -9 -8 -7 -6 -5 -4 -3 -2 -1 (b) Points ( C m , R ( ˆ f m ), m ∈ M , and selected model(red). Figure 15: Slope heuristics for Borehole function with n = 1000 and γ = 10 − .30 eferences [1] Nathalie Akakpo. Multivariate intensity estimation via hyperbolic wavelet selection. Journal ofMultivariate Analysis , 161:32–57, 2017.[2] Mazen Ali and Anthony Nouy. Approximation with tensor networks. part I: Approximation spaces. arXiv e-prints, arxiv:2007.00118 , 2020.[3] Mazen Ali and Anthony Nouy. Approximation with tensor networks. part II: Approximation rates forsmoothness classes. arXiv e-prints, arxiv:2007.00128 , 2020.[4] Sylvain Arlot. Minimal penalties and the slope heuristics: a survey. arXiv preprint arXiv:1901.07277 ,2019.[5] M. Bachmayr, A. Nouy, and R. Schneider. Approximation power of tree tensor networks for compo-sitional functions, 2020.[6] M. Bachmayr, R. Schneider, and A. Uschmajew. Tensor networks and hierarchical tensors for the so-lution of high-dimensional partial diﬀerential equations.

Foundations of Computational Mathematics ,pages 1–50, 2016.[7] Jean-Patrick Baudry, Cathy Maugis, and Bertrand Michel. Slope heuristics: overview and implemen-tation.

Statistics and Computing , 22(2):455–470, 2012.[8] Eduard Belitser et al. Eﬃcient estimation of analytic density under random censorship.

Bernoulli ,4(4):519–543, 1998.[9] G´erard Biau and Aur´elie Fischer. Parameter selection for principal curves.

IEEE Transactions onInformation Theory , 58(3):1924–1939, 2012.[10] Lucien Birg´e and Pascal Massart. Minimal penalties for gaussian model selection.

Probability theoryand related ﬁelds , 138(1-2):33–73, 2007.[11] Olivier Bousquet, St´ephane Boucheron, and G´abor Lugosi. Introduction to statistical learning theory.In

Advanced lectures on machine learning , pages 169–207. Springer, 2004.[12] A. Cichocki, N. Lee, I. Oseledets, A.-H. Phan, Q. Zhao, and D. Mandic. Tensor networks for dimen-sionality reduction and large-scale optimization: Part 1 low-rank tensor decompositions.

Foundationsand Trends R (cid:13) in Machine Learning , 9(4-5):249–429, 2016.[13] A. Cichocki, A.-H. Phan, Q. Zhao, N. Lee, I. Oseledets, M. Sugiyama, and D. Mandic. Tensor networksfor dimensionality reduction and large-scale optimization: Part 2 applications and future perspectives. Foundations and Trends R (cid:13) in Machine Learning , 9(6):431–673, 2017.[14] N. Cohen, O. Sharir, and A. Shashua. On the expressive power of deep learning: A tensor analysis.In Conference on Learning Theory , pages 698–728, 2016.[15] Dinh D˜ung, Vladimir Temlyakov, and Tino Ullrich.

Hyperbolic cross approximation . Springer, 2018.[16] A. Falc´o, W. Hackbusch, and A. Nouy. Tree-based tensor formats.

SeMA Journal , 2018.[17] E. Grelier, A. Nouy, and R. Lebrun. Learning high-dimensional probability distributions using treetensor networks.

ArXiv e-prints , arXiv:1912.07913, 2019.[18] Erwan Grelier, Anthony Nouy, and Mathilde Chevreuil. Learning with tree-based tensor formats. arXiv e-prints , arXiv:1811.04455, 2018. 3119] R´emi Gribonval, Gitta Kutyniok, Morten Nielsen, and Felix Voigtlaender. Approximation spaces ofdeep neural networks. arXiv e-prints , arXiv:1905.01208, 2019.[20] Michael Griebel and Helmut Harbrecht. Analysis of tensor approximation schemes for continuousfunctions. arXiv e-prints arXiv:1903.04234 , 2019.[21] L´aszl´o Gy¨orﬁ, Michael Kohler, Adam Krzyzak, and Harro Walk.

A distribution-free theory of non-parametric regression . Springer Science & Business Media, 2006.[22] W. Hackbusch.

Tensor spaces and numerical tensor calculus , volume 42 of

Springer series in compu-tational mathematics . Springer, Heidelberg, 2012.[23] Vladimir Kazeev, Ivan Oseledets, Maxim Rakhuba, and Christoph Schwab. Qtt-ﬁnite-element ap-proximation for multiscale problems i: model problems in one dimension.

Advances in ComputationalMathematics , 43(2):411–442, 2017.[24] Vladimir Kazeev and Christoph Schwab. Approximation of singularities by quantized-tensor fem.

PAMM , 15(1):743–746, 2015.[25] B. Khoromskij. Tensors-structured numerical methods in scientiﬁc computing: Survey on recentadvances.

Chemometrics and Intelligent Laboratory Systems , 110(1):1–19, 2012.[26] Valentin Khrulkov, Alexander Novikov, and Ivan Oseledets. Expressive power of recurrent neuralnetworks. In

International Conference on Learning Representations , 2018.[27] Vladimir Koltchinskii.

Oracle Inequalities in Empirical Risk Minimization and Sparse Recovery Prob-lems: Ecole dEt´e de Probabilit´es de Saint-Flour XXXVIII-2008 , volume 2033. Springer Science &Business Media, 2011.[28] P. Massart.

Concentration Inequalities and Model Selection , volume Lecture Notes in Mathematics1896. Springer-Verlag, 2007.[29] Michael Neumann. Multivariate wavelet thresholding in anisotropic function spaces.

Statistica Sinica ,10:399–431, 2000.[30] A. Nouy. Low-rank methods for high-dimensional approximation and model order reduction. InP. Benner, A. Cohen, M. Ohlberger, and K. Willcox, editors,

Model Reduction and Approximation:Theory and Algorithms . SIAM, Philadelphia, PA, 2017.[31] Benjamin Recht, Maryam Fazel, and Pablo A Parrilo. Guaranteed minimum-rank solutions of linearmatrix equations via nuclear norm minimization.

SIAM review , 52(3):471–501, 2010.[32] R. Schneider and A. Uschmajew. Approximation rates for the hierarchical tensor format in periodicsobolev spaces.

Journal of Complexity , 30(2):56–71, 2014. Dagstuhl 2012.[33] Marco Signoretto, Lieven De Lathauwer, and Johan AK Suykens. Nuclear norms for tensors and theiruse for convex multilinear estimation.

Submitted to Linear Algebra and Its Applications , 43, 2010.[34] Vladimir Vapnik.

The nature of statistical learning theory . Springer science & business media, 2013.[35] Ming Yuan and Cun-Hui Zhang. On tensor completion via nuclear norm minimization.

Foundationsof Computational Mathematics , 16(4):1031–1068, 2016.32

Proofs

A.1 Proofs of Section 3

Proof of Proposition 3.6.

Let f = R H ,T,r (( f α ) α ∈ T ⋆ ) and let λ = ( p, µ ), 1 ≤ p ≤ ∞ , or λ = ∞ (with p = ∞ when λ = ∞ ). For x ∈ X , we ﬁrst note that k f ( x ) k λ = k f ∅ ( g D ( x )) k λ ≤ k f ∅ k F ∅ k g D ( x ) k λ . Then for any interior node α ∈ I ( T ), we have k g α ( x α ) k λ = k f α (( g β ( x β )) β ∈ S ( α ) ) k λ ≤ k f α k F α Y β ∈ S ( α ) k g β ( x β ) k λ , and for any leaf node α ∈ L ( T ), k g α ( x α ) k λ = k f α ( φ α ( x α )) k λ ≤ k f α k F α k φ α ( x α ) k λ . We deduce that k f ( x ) k λ ≤ Y α ∈ T ⋆ k f α k F α Y ≤ ν ≤ d k φ ν ( x ν ) k λ , and therefore, since µ is a product measure and from the particular normalization of functions φ ν , weobtain k f k λ ≤ Y α ∈ T ⋆ k f α k F α Y ≤ ν ≤ d k φ ν k λ = Y α ∈ T ⋆ k f α k F α , which proves that L λ ≤

1. Finally for 1 ≤ q ≤ p , we note that µ ( X ) /p − /q k f k q,µ ≤ k f k p,µ ≤ k f k ∞ , which yields L q,µ ≤ µ ( X ) /q − /p L λ . A.2 Proofs of Section 4

Proof of Lemma 4.6.

Let γ = ǫB L and let N be a γ -net of M for the k · k ∞ ,µ -norm, with cardinal N ǫB L .Using Lemma 4.4 and a union bound argument, we obtain P (sup g ∈N ¯ R n ( g ) > ǫB ) ∨ P ( inf g ∈N ¯ R n ( g ) < − ǫB ) ≤ N ǫB L e − nǫ . For any f ∈ M , there exists a g ∈ N such that k f − g k ∞ ,µ ≤ γ . Noting that¯ R n ( f ) = ¯ R n ( g ) + b R n ( f ) − b R n ( g ) + R ( g ) − R ( f ) , we deduce from Assumption 4.5 that¯ R n ( f ) ≤ ¯ R n ( g ) + 2 Lk f − g k ∞ ,µ ≤ sup g ∈N ¯ R n ( g ) + ǫB, and ¯ R n ( f ) ≥ ¯ R n ( g ) − Lk f − g k ∞ ,µ ≥ inf g ∈N ¯ R n ( g ) − ǫB. This implies that P ( sup f ∈ M ¯ R n ( f ) > ǫB ) ≤ P (sup g ∈N ¯ R n ( f ) > ǫB ) , and P ( inf f ∈ M ¯ R n ( f ) < − ǫB ) ≤ P ( inf g ∈N ¯ R n ( f ) < − ǫB ) , which yields (21). The bound on N ǫB L directly follows from Proposition 3.3 and Proposition 3.6.33 roof of Lemma 4.7. We have E ( sup f ∈ M | ¯ R n ( f ) | ) = Z ∞ P ( sup f ∈ M | ¯ R n ( f ) | > t ) dt = 2 B Z ∞ P ( sup f ∈ M | ¯ R n ( f ) | > ǫB ) dǫ. Let β = 6 L B − R | T ⋆ | . Then, according to Lemma 4.6, for any δ > E ( sup f ∈ M | ¯ R n ( f ) | ) ≤ B (cid:20) δ + Z ∞ δ βǫ − ) C M e − n ǫ dǫ (cid:21) , = 2 B " δ + 2 β C M Z ∞ nδ / (cid:18) un (cid:19) − C M / e − u √ nu du ≤ B h δ + 2 n − β C M δ − C M − e − nδ / i , By taking δ = r C M n log(( β ∨ e ) √ n ) , we have n − β C M δ − C M − e − nδ / = n − β C M δ − C M − ( β ∨ e ) − C M n − CM ≤ δ − C M − n − CM − = δ ( δ n ) − CM − = δ (2 C M log(( β ∨ e ) √ n )) − CM − ≤ δ where we have used the fact that 2 C M log(( β ∨ e ) √ n ) ≥ . Then E ( sup f ∈ M | ¯ R n ( f ) | ) ≤ Bδ, which concludes the proof.

Proof of Proposition 4.8.

Starting from (16), we obtain R ( ˆ f Mn ) − R ( f M ) ≤ ¯ R n ( f M ) − ¯ R n ( ˆ f Mn ) ≤ sup f ∈ M ¯ R n ( f ) − inf f ∈ M ¯ R n ( f ) . Then using Lemma 4.6, we deduce P ( R ( ˆ f Mn ) − R ( f M ) > ǫB ) ≤ P ( sup f ∈ M ¯ R n ( f ) > ǫB ) + P ( inf f ∈ M ¯ R n ( f ) < − ǫB ) ≤ N ǫB L e − nǫ , with log N ǫB L ≤ C M log( βǫ − ). In the same way for the expectation bound, we have E ( R ( ˆ f Mn ) − R ( f M )) ≤ E ( ¯ R n ( f M ) − ¯ R n ( ˆ f Mn )) ≤ E ( sup f ∈ M | ¯ R n ( f ) | )and the result directly follows from Lemma 4.7. 34 roof of Proposition 4.9. The two inequalities come from standard application of the bounded diﬀer-ence Inequality, see for instance Theorem 5.1 in [28]. The bounded diﬀerence inequality applied tosup f ∈ M − ¯ R n ( f ) = sup f ∈ M R ( f ) − b R n ( f ) gives that with probability larger than 1 − exp( − t ),sup f ∈ M − ¯ R n ( f ) ≤ E ( sup f ∈ M − ¯ R n ( f )) + 2 B r t n . Inequality 22 directly derives from this inequality and Lemma 4.7. Next, Inequality (16) gives that E ( ˆ f Mn ) ≤ E ( f M ) + 2 sup f ∈ M | ¯ R n ( f ) | . We ﬁnally prove the risk bound (23) by applying the bounded diﬀerence Inequality to sup f ∈ M | ¯ R n ( f ) | andLemma 4.7 again. A.2.1 Proof of Proposition 4.12

The proof follows the presentation of [27]. The least-squares contrast γ corresponds either to the regressioncontrast or the density estimation contrast. Under the assumptions of the proposition, in both frameworksthe oracle function satisﬁes k f ⋆ k ∞ ≤ R . • We ﬁrst prove the proposition in the case where R = 1, by assuming for the moment that M = M = M Tr ( H s ) . For the regression framework, it is also assumed for the moment that k Y k ℓ ∞ ≤ k f ⋆ k ∞ ≤ • For the least-squares regression contrast (see Example 4.10), we have γ ( f, Z ) = k Y − f ( X ) k ℓ . Forall f ∈ M , it gives γ ( f, Z ) ≤ k Y − f ( X ) k ℓ ∞ ≤ k Y k ℓ ∞ + k f k ∞ ), so that 0 ≤ γ ( f, Z ) ≤ B almost surely,with B = 4. The distribution of the random variable X is denoted µ . Then E (( γ ( f, Z ) − γ ( f ⋆ , Z )) ) = E [( f ⋆ ( X ) − f ( X )) (2 Y − f ( X ) − f ⋆ ( X ))] = E [( f ⋆ ( X ) − f ( X )) (2( Y − f ⋆ ( X )) + f ⋆ ( X ) − f ( X ))] = E [( f ⋆ ( X ) − f ( X )) (2( Y − f ⋆ ( X ))] + E [ f ⋆ ( X ) − f ( X )] ≤ (4 k Y − f ⋆ ( X ) k ℓ ∞ + k f ⋆ − f k ∞ ) k f − f ⋆ k ,µ ≤ k f − f ⋆ k ,µ = 2 B k f − f ⋆ k ,µ , where the last inequality has been obtained using k Y − f ⋆ ( X ) k ℓ ∞ = k Y − E ( Y | X ) k ℓ ∞ ≤ . Let γ = γ/B .We have 0 ≤ γ ≤ E ( f ) := E [ γ ( f, Z ) − γ ( f ⋆ , Z )] = 1 B k f − f ⋆ k ,µ = 1 B E ( f )and E ([ γ ( f, Z ) − γ ( f ⋆ , Z )] ) ≤ D k f − f ⋆ k ,µ with D = B = . • We now consider the density estimation framework with γ ( f, x ) = k f k ,µ − f ( x ). According to Example4.11, | γ ( f, X ) | ≤ B = µ ( X ) + 2. The excess risk satisﬁes E ( f ) = k f ⋆ − f k ,µ and E ([ γ ( f, Z ) − γ ( f ⋆ , Z )]) = E ( (cid:2) k f k ,µ − k f ⋆ k ,µ + 2( f ⋆ ( X ) − f ( X )) (cid:3) ) ≤ ( k f k ,µ − k f ⋆ k ,µ ) + 4( k f k ,µ − k f ⋆ k ,µ ) h f ⋆ − f, f ⋆ i ,µ + 4 k f − f ⋆ k ,µ = ( k f k ,µ − k f ⋆ k ,µ )( k f k ,µ − k f ⋆ k ,µ + 4 h f ⋆ − f, f ⋆ i ,µ ) + 4 k f − f ⋆ k ,µ = h f − f ⋆ , f + f ⋆ i ,µ h f − f ⋆ , f − f ⋆ i ,µ + 4 k f − f ⋆ k = h f − f ⋆ , f + f ⋆ i ,µ k f − f ⋆ k − h f − f ⋆ , f + f ⋆ i ,µ h f − f ⋆ , f ⋆ i ,µ + 4 k f − f ⋆ k ,µ .

35e have h f − f ⋆ , f + f ⋆ i ,µ ≤ k f k ,µ ≤ µ ( X ), k f ⋆ k ,µ = 1 ≤ µ ( X ) / and k f ⋆ k ,µ ≤ k f ⋆ k ∞ ,µ k f ⋆ k ,µ ≤ E (( γ ( f, Z ) − γ ( f ⋆ , Z )) ) ≤ ( µ ( X ) + 2 k f + f ⋆ k ,µ k f ⋆ k ,µ + 4) k f − f ⋆ k ,µ ≤ ( µ ( X ) + 2 µ ( X ) + 2 + 4) k f − f ⋆ k ,µ = 3( µ ( X ) + 2) k f − f ⋆ k ,µ = 3 B k f − f ⋆ k ,µ . Let γ = B ( γ + B ). Then 0 , ≤ γ ( f, X ) ≤ f ∈ M . Moreover, E ( f ) := E [ γ ( f, Z ) − γ ( f ⋆ , Z )] = 12 B E ( f )and E (( γ ( f, Z ) − γ ( f ⋆ , Z )) ) ≤ D k f − f ⋆ k ,µ with D = B ≤ , where we have used µ ( X ) ≥ . • For δ >

0, we introduce ω n ( δ ) = ω n ( M , f ⋆ , δ ) = E sup f ∈ M | k f − f M k ,µ ≤ δ/D (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n X i =1 γ ( f, Z i ) − E ( γ ( f, Z )) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) Following [27] (Section 4.1 p.57), we introduce the sharp transformation ♯ of the function ω : ω ♯n ( ε ) = inf ( δ > , sup σ ≥ δ ω n ( σ ) σ ≤ ε ) . According to Proposition 4.1 in [27], there exists absolute constants κ and A such that for any ε ∈ (0 , t >

0, with probability at least 1 − A exp( − t ), E ( ˆ f M n ) ≤ (1 + ε ) E ( f M ) + 1 D ω ♯n (cid:18) εκ D (cid:19) + κ Dε tn . (33)The sharp transformation is monotonic: if Ψ ≤ Ψ then Ψ ♯ ≤ Ψ ♯ (see Appendix A.3 in [27]). Thus itremains to ﬁnd an upper bound on the sharp transformation of an upper bound on ω n . • We use standard symmetrization and contraction arguments for Rademacher variables. The Rademacherprocess indexed by the class M is deﬁned byRad n ( f ) = 1 n X i =1 nε i f ( X i )where the ε i ’s are i.i.d. Rademacher random variables (that is, ε i takes the values +1 and − X i ’s. By the symmetrization Inequality (see for instance Theorem2.1 in [27]), ω n ( δ ) ≤ E sup f ∈ M | k f − f M k ,µ ≤ δ/D (cid:12)(cid:12) Rad n (cid:0) γ ( f, · ) − γ ( f M , · ) (cid:1)(cid:12)(cid:12) . We introduce the function Ψ n ( δ ) = E sup f ∈ M | k f − f M k ,µ ≤ δ (cid:12)(cid:12) Rad n ( f − f M ) (cid:12)(cid:12) . ω n ( δ ) ≤ n ( δ/D ) . In the density estimation setting, we have γ ( f, Z ) = k f k ,µ − f ( Z ) and since the ﬂuctuations of a constantfunction are obviously zero, we obtain ω n ( δ ) ≤ E sup f ∈ M | k f − f M k ,µ ≤ δ/D (cid:12)(cid:12) Rad n ( f − f M ) (cid:12)(cid:12) ≤ n ( δ/D ) . (34) • We now introduce the subset of the L ball centered at f M M ( δ, f ⋆ ) = { f − f M | f ∈ M , k f − f M k ,µ ≤ δ } . In the density estimation setting, the empirical measure is ν n . We also denote by ν n the empirical measurein the regression setting (take ν = µ ). The constant function F = 2 is an envelop for M ( δ, f ⋆ ) and k F k ,ν n = 2. According to Proposition 3.3, H ( ε, M ( δ, f ⋆ ) , k · k ,ν n ) ≤ C M log (cid:18) | T ⋆ | L ,ν n ε (cid:19) ε ≤ ν ⊗ n -almost surely,where L ,ν n is deﬁned by (10) for the measure ν n and for p = 2. According to Proposition 3.6, L ,ν n satisﬁes L ,ν n ≤ p ν n ( X ) L ∞ ,ν n = L ∞ ,ν n . Here it is assumed that

H ⊂ L ∞ ( X ) equipped with the norm k · k ∞ . According to Proposition 3.6 we have L ∞ ,ν n ≤ L ∞ ≤

1, thus L ,ν n ≤ H ( ε, M ( δ, f ⋆ ) , k · k ,ν n ) ≤ C M (cid:20) log (cid:18) eε (cid:19) + log + (cid:18) | T ⋆ | e (cid:19)(cid:21) ε ≤ ≤ C M (cid:20) + (cid:18) | T ⋆ | e (cid:19)(cid:21) log (cid:18) eε (cid:19) ε ≤ ≤ C M a T h (cid:18) ε (cid:19) with a T = 1 + log + (cid:16) | T ⋆ | e (cid:17) and h ( u ) := log (2 eu ) u ≥ . We are now in position to apply Theorem A.1,which is given at the end of this section. We can take σ = δ in Theorem A.1 because E ν ( g ( X )) ≤ δ for g ∈ M ( δ, f ⋆ ). Thus, there exists an absolute constant κ > n ( δ ) ≤ κ "s δn C M a T h (cid:18) √ δ (cid:19) ∨ (cid:18) n C M a T h (cid:18) √ δ (cid:19)(cid:19) . For regression, it can be easily checked that (see also Example 3 p.80 in [27])Ψ ♯n ( ε ) ≤ κ C M a T ε n log (cid:18) e ε κ C M a T (cid:19) . Similar calculations hold for density estimation. Together with Inequalities (33) and (34), and accordingto the properties of the sharp transformation (see Appendix A.3 in [27]), it gives that with probability at37east 1 − A exp( − t ), E ( ˆ f M n ) ≤ (1 + ε ) E ( f M ) + 1 D (cid:16) n (cid:16) · D (cid:17)(cid:17) ♯ (cid:18) εκ D (cid:19) + κ Dε tn ≤ (1 + ε ) E ( f M ) + (8Ψ n ) ♯ (cid:18) εκ (cid:19) + κ Dε tn ≤ (1 + ε ) E ( f M ) + κ a T C M ε n log (cid:18) κ ε a T C M (cid:19) + κ Dε tn , where κ and κ are absolute constants. This completes the proof for R = 1, by rewriting the risk boundfor the excess risk E = B E . • We now consider the more general situation where M = M Tr ( H s ) R with R ≥

1. We ﬁrst considerregression. We now assume that k Y k ℓ ∞ ≤ R almost surely. Let f ⋆ , f M and ˆ f M deﬁned as in Section 4for the observations Z , . . . , Z n . We consider the least squares regression problem for the normalized data( X , Y /R ) , . . . , ( X n , Y n /R ) with the functional set M . For this problem the oracle f ⋆ satisﬁes f ⋆ = f ⋆ /R ,the best approximation f M on M satisﬁes f M = f M /R and the least squares estimator ˆ f M also satisﬁesˆ f M = ˆ f M /R . The risk bound (24) is valid for the normalized data (with R = 1) and it directly gives (24)for R ≥

1. The same method applies for proving the risk bound in the density estimation case.

A.2.2 An adaptation of Theorem 3.12 in [27]

We consider the same framework as in [27]. We observe X , . . . , X n according to the distribution ν and let ν n be the empirical measure. Let F be a function space. Assume that the functions in F are uniformlybounded by a constant U and let F ≤ U denote a measurable envelop of F . We assume that σ is anumber such that sup f ∈F E ν f ≤ σ ≤ k F k ,ν . Let h : [0 , ∞ ) [0 , ∞ ) be a regularly varying function of exponent 0 ≤ α <

2, strictly increasing for u ≥ / h ( u ) = 0 for 0 ≤ u < / κ h > κ h > h and not on c . Theorem A.1 (Theorem 3.12 in [27]) . Let c > . If, for all ε > and n ≥ , log N ( ε, F , k · k ,ν n ) ≤ ch (cid:18) k F k ,ν n ε (cid:19) ν ⊗ n -almost surely,then there exists a constant κ h > that depends only on h such that E sup f ∈F | R n ( f ) | ≤ κ h " σ √ n s ch (cid:18) k F k ,ν σ (cid:19) ∨ Un ch (cid:18) k F k ,ν ε (cid:19) . Proof.

The proof of Theorem 3.12 of [27] starts by applying Theorem 3.11 of [27]. As in [27] we assumewithout loss of generality that U = 1. In our context it gives E := E sup f ∈F | R n ( f ) | ≤ C √ cn − / E Z σ n s h (cid:18) k F k ,ν n ε (cid:19) dε where σ n = sup f ∈F P ni =1 f ( X i ) and where C is an universal numerical constant. By following the linesof the proof of [27], we ﬁnd that E satisﬁes the following inequation E ≤ √ cκ h, n − + √ cκ h, n − / σ s h (cid:18) k F k ,ν σ (cid:19) + √ cκ h, n − / √ E s h (cid:18) k F k ,ν σ (cid:19) κ h, , κ h, and κ h, are positive numerical constants which only depends on the function h (see theproof of Koltchinskii for the expression of these three constants). Solving this inequation completes theproof. A.3 Proofs of Section 5

Proof of Theorem 5.2.

According to Inequality (22) of Proposition 4.9, for any t > m ∈ M , onehas with probability larger than 1 − exp( − t ),sup f ∈ M m − ¯ R n ( f ) ≤ λ m r C m n + 2 B r t n . Then with probability larger than 1 − P m ∈M exp( − x m − t ) = 1 − Σ exp( − t ), it holds − ¯ R n ( ˆ f ˆ m ) ≤ sup f ∈ M ˆ m − ¯ R n ( f ) ≤ λ ˆ m r C ˆ m n + 2 B r t + x ˆ m n , which together with (26) implies that E ( ˆ f ˆ m ) ≤ E ( f m ) + ¯ R n ( f m ) + λ ˆ m r C ˆ m n + 2 B r x ˆ m n − pen( ˆ m ) + pen( m ) + 2 B r t n holds for all m ∈ M . Then, with the condition (27) on the penalty function, the upper bound E ( ˆ f ˆ m ) ≤ E ( f m ) + ¯ R n ( f m ) + pen( m ) + 2 B r t n holds for all m ∈ M simultaneously, with probability larger than 1 − Σ exp( − t ). Next, integrating withrespect to t gives E h ∨ (cid:16) E ( ˆ f ˆ m ) − E ( f m ) − ¯ R n ( f m ) − pen( m ) (cid:17)i ≤ B Σ r πn . Finally, since ¯ R n ( f m ) has zero mean, for any m ∈ M , E ( E ( ˆ f ˆ m )) ≤ E ( f m ) + pen( m ) + B Σ r π n , and we conclude by taking the inﬁmum over m ∈ M . Proof of Theorem 5.3.

The proof is adapted from Theorem 6.5 in [27], which corresponds to a alternativestatement of Theorem 8.5 in [28]. We follow the lines of Section 6.3 in [27] (p.107-108).We ﬁrst consider the case R = 1 and we consider the normalized contrast γ and the normalized risk E as for the proof of Proposition 4.12. We have shown that E [ γ ( f, Z ) − γ ( f ⋆ , Z )] ≤ D k f − f ⋆ k where D does not depend on the model M m . Next, it has also been shown in the proof of Proposition 4.12,that for ε ∈ (0 , ω ♯n ( ε ) ≤ κ a m C M nε log + (cid:18) nε a m C M (cid:19) with a m = 1 + log + (cid:16) | T ⋆m | e (cid:17) and where κ is an absolute constant. We consider the penalized criterion (25)with a penalty of the form pen( m ) = κ a m C m nε log + nε a m C m + κ x m nε . δ εn ( m ) = ˜ δ εn ( m ) = ˆ δ εn ( m ) = κ a m C m nε log nε a m C m + K x m + tnε (andthus p m = 0 in the theorem) and we also note that for any t > m ) = K (cid:20) a m C m nε log nε a m C m + x m + tnε (cid:21) . Finally, according to Theorem 6.5 in [27], there exist numerical constants K , K and K such that forany t > P (cid:18) E ( ˆ f ˆ m ) ≤ ε − ε inf m ∈M (cid:26) E ( ˆ f ˆ m ) + K (cid:20) a m C m nε log nε a m C m + x m + tnε (cid:21)(cid:27)(cid:19) ≤ K X m ∈M exp( − t − x m )Under Assumption 5.1, we easily derive the oracle bound (28) by rewriting it for the constrast γ and thenby integrating this probability bound with respect to t . This bound generalizes to the case R ≥≥