[PDF] Tensor denoising with trend filtering

Abstract

We extend the notion of trend filtering to tensors by considering the k^{\rm th}-order Vitali variation, a discretized version of the integral of the absolute value of the k^{\rm th}-order total derivative. We prove adaptive \ell^0-rates and not-so-slow \ell^1-rates for tensor denoising with trend filtering. For k=\{1,2,3,4\} we prove that the d-dimensional margin of a d-dimensional tensor can be estimated at the \ell^0-rate n^{-1}, up to logarithmic terms, if the underlying tensor is a product of (k-1)^{\rm th}-order polynomials on a constant number of hyperrectangles. For general k we prove the \ell^1-rate of estimation n^{- \frac{H(d)+2k-1}{2H(d)+2k-1}}, up to logarithmic terms, where H(d) is the d^{\rm th} harmonic number. Thanks to an ANOVA-type of decomposition we can apply these results to the lower dimensional margins of the tensor to prove bounds for denoising the whole tensor. Our tools are interpolating tensors to bound the effective sparsity for \ell^0-rates, mesh grids for \ell^1-rates and, in the background, the projection arguments by Dalalyan et al.

Full PDF

aa r X i v : . [ m a t h . S T ] J a n Tensor denoising with trend ﬁltering

Francesco Ortelli and Sara van de GeerSeminar f¨ur Statistik, ETH Z¨urichR¨amistrasse 101, CH-8092 Z¨urich { fortelli,geer } @ethz.chJanuary 27, 2021 Abstract

We extend the notion of trend ﬁltering to tensors by consideringthe k th -order Vitali variation – a discretized version of the integral ofthe absolute value of the k th -order total derivative. We prove adap-tive ℓ -rates and not-so-slow ℓ -rates for tensor denoising with trendﬁltering.For k = { , , , } we prove that the d -dimensional margin of a d -dimensional tensor can be estimated at the ℓ -rate n − , up to loga-rithmic terms, if the underlying tensor is a product of ( k − th -orderpolynomials on a constant number of hyperrectangles. For general k we prove the ℓ -rate of estimation n − H ( d )+2 k − H ( d )+2 k − , up to logarithmicterms, where H ( d ) is the d th harmonic number.Thanks to an ANOVA-type of decomposition we can apply theseresults to the lower dimensional margins of the tensor to prove boundsfor denoising the whole tensor. Our tools are interpolating tensors tobound the eﬀective sparsity for ℓ -rates, mesh grids for ℓ -rates and,in the background, the projection arguments by Dalalyan et al. [3]. Keywords: tensor denoising, total variation, Vitali variation, trend ﬁltering,oracle inequalities

Contents d -dimensional tensors . . . . . . . . . 82.1.1 Tensors with product structure . . . . . . . . . . . . 92.1.2 Orthogonality between tensors . . . . . . . . . . . . 92.1.3 Linear subspaces and orthogonal projections . . . . 92.2 Estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.3 Active sets . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 d = 1 . . . . . . . . . . . . . . . . . . . . . . 134.2 Dictionary for general d . . . . . . . . . . . . . . . . . . . . 15 k = 1 . . . . . . . . . . . . . 265.6.4 Interpolating tensor for k = 2 . . . . . . . . . . . . . 265.6.5 Interpolating tensor for k = 3 . . . . . . . . . . . . . 265.6.6 Interpolating tensor for k = 4 . . . . . . . . . . . . . 275.7 Proof of Theorem 5.2 . . . . . . . . . . . . . . . . . . . . . . 27 ℓ -rates 28 ˜ S is an enlarged mesh grid 306.3 Proof of Theorem 3.2 . . . . . . . . . . . . . . . . . . . . . . 30 Denoising the whole tensor 35

B.1 Proof of Lemma 4.2 . . . . . . . . . . . . . . . . . . . . . . 41

C Proofs of Section 5 42

C.1 Proof of Lemma 5.10 . . . . . . . . . . . . . . . . . . . . . . 42C.2 Proof of Lemma 5.11 . . . . . . . . . . . . . . . . . . . . . . 44C.3 Proof of Lemma 5.12 . . . . . . . . . . . . . . . . . . . . . . 45C.4 Matching derivatives . . . . . . . . . . . . . . . . . . . . . . 45C.5 Partial integration . . . . . . . . . . . . . . . . . . . . . . . 46C.6 Proof of Lemma 5.13 . . . . . . . . . . . . . . . . . . . . . . 48

D Proofs of Section 6 49

D.1 Proof of Lemma 6.3 . . . . . . . . . . . . . . . . . . . . . . 49

Let f ∈ R n × ... × n d be a d -dimensional tensor with n = n · . . . · n d entries.We want to prove error bounds for tensor denoising, which is the task ofrecovering f from its noisy version Y = f + ǫ , where ǫ has i.i.d. Gaussianentries with mean 0 and variance σ .We show that we can estimate the underlying tensor f in an adaptivemanner with a regularized least-squares signal approximator. As regularizerwe propose the Vitali variation of the ( k − th -order total diﬀerences ofthe candidate estimator for k ≥ . We call this regularizer the “ k th -orderVitali total variation”. We use the abbreviation TV for “total variation”.This approach extends the idea of “trend ﬁltering” [9, 22] to tensors.We expose the notion of TV regularization, review the literature onadaptive results for TV regularization, explain the concept of adaptationfor structured problems, introduce an ANOVA-type of decomposition of atensor, outline our contributions and ﬁnally present the organization of thepaper. 3 .1 TV regularization A regularized (least-squares) signal approximator is an estimator ˆ f deﬁnedas ˆ f := arg min f ∈ R n × ... × nd n k Y − f k /n + 2 λ pen(f) o , where k·k denotes the sum of the squared entries of its argument, λ > isa tuning parameter and pen(f) is a regularization penalty.When pen(f) = k D f k for a linear operator D and for k·k denotingthe sum of the absolute values of the entries of its argument, the regular-ized signal approximator is called “ ℓ -analysis estimator” or simply “anal-ysis estimator” [4]. If the linear operator D is a diﬀerence operator, then pen(f) = k D f k is usually called TV of f and the estimator ˆ f is called TVregularized estimator. Diﬀerent choices of the diﬀerence operator D arepossible, resulting in diﬀerent notions of TV.For a continuous image deﬁned on ( x , . . . , x d ) ∈ [0 , d , one can choose D as a discretized version of either the total k th -order derivative opera-tor Q di =1 ∂ k / ( ∂x i ) k or of the sum of k th -order partial derivative operators P di =1 ∂ k / ( ∂x i ) k . For d = 1 partial and total derivatives coincide. With D being the ﬁrstorder diﬀerence matrix, the TV regularized estimator is also known underthe name “fused Lasso” [24, 6]. Adaptivity of the fused Lasso has beenproved by Dalalyan et al. [3], Lin et al. [10], Guntuboyina et al. [7].The “edge Lasso” extends the fused lasso to graphs and is studied bySharpnack et al. [20], H¨utter and Rigollet [8]. Ortelli and van de Geer[12, 15] prove adaptivity of the edge Lasso on tree graphs and cycle graphs,respectively.The idea of the fused Lasso can also be extended to the penalization ofhigher-order diﬀerences. This extension is called “trend ﬁltering” [9, 22, 23].Adaptivity of trend ﬁltering is established in Guntuboyina et al. [7], Ortelliand van de Geer [14]. Wang et al. [27] consider trend ﬁltering on graphs,Sadhanala et al. [19] in higher-dimensional situations and Sadhanala andTibshirani [17] for additive models.Here, we consider the case of D being a discretization of Q di =1 ∂ k / ( ∂x i ) k .We call the corresponding notion of TV “ k th -order Vitali TV”. In the liter-ature, signal approximators regularized with the Vitali TV are studied byMammen and van de Geer [11], Ortelli and van de Geer [16], Fang et al.[5]. Ortelli and van de Geer [16] prove adaptivity for d = 2 and k = 1 .4ang et al. [5] show adaptivity for d = 2 and k = 1 using as regularizer theHardy-Krause variation, which is the sum of the Vitali TV of a matrix andof its margins. In this paper we will prove adaptivity of tensor denoisingwith k th -order Vitali TV regularization for k = { , , , } and general di-mension d ≥ . The results obtained for k = { , , , } and d = 1 in Ortelliand van de Geer [14] and for k = 1 and d = 2 in Ortelli and van de Geer[16] will then be retrieved as special cases.Signal approximators regularized with D being a discretization of thepartial derivatives P di =1 ∂ k / ( ∂x i ) k are studied by H¨utter and Rigollet [8],Sadhanala et al. [19] for general d . For d = 2 , Chatterjee and Goswami [2]show the fast rate n − / for estimating axis-aligned rectangles. The analysis estimator ˆ f can be recast in a constructive formulation as “syn-thesis estimator”. One can ﬁnd dictionary tensors { φ j ∈ R n × ... × n d } j ∈ [ p ] ,such that ˆ f = p X j =1 ˆ β j φ j , where ˆ β := arg min b ∈ R n  k Y − p X i =1 b j φ j k /n + 2 λ X j U | b j |  , and U ⊆ { , . . . , p } is a set of indices, cf. Elad et al. [4]. The Lassoestimator [21, 1, 25] is an instance of synthesis estimator. The dictionary { φ j } j ∈ [ p ] and the set of unpenalized coeﬃcients U ⊆ [ p ] depend on D .We can see that D imposes structure on the estimator: it determines thedictionary with which the estimator is constructed. For instance, in thecase of the st -order Vitali TV, the dictionary { φ j } j ∈ [ p ] consists of tensorsbeing constant on hyperrectangles. Therefore, the estimator ˆ f is constanton few hyperrectangular pieces.Our goal is to prove adaptation of the estimator ˆ f to the underlyingsignal f , when k D f k is the k th -order Vitali TV.Adaptation is a consequence of a high-probability upper bound on themean squared error (MSE) in the form of the oracle inequality k ˆ f − f k /n ≤ k g − f k /n + rem( D, g, S ) , (1)where g ∈ R n × ... × n d is an arbitrary tensor, S is an arbitrary set of indicesof Dg and rem( D, g, S ) is a remainder term. A result of the form of (1)establishes the adaptation of the estimator ˆ f , provided that the remainderterm rem( D, g = f , S = S ) converges to zero, where S is the set of theindices of the nonzero coeﬃcients of Df . The cardinality s := | S | of S is called the “sparsity” of f with respect to D .5e can optimize the upper bound in (1) over g and S . However, theoptimizers g ∗ and S ∗ will depend on f – which is unobserved. Hence thename “oracle” for the pair ( g ∗ ( f ) , S ∗ ( f )) and the name “oracle inequality”for results as (1).Such a result is considered to be adaptive, since diﬀerent underlyingtrue tensors f will possibly give place to diﬀerent oracles ( g ∗ ( f ) , S ∗ ( f )) and to diﬀerent values for the upper bound.Results as (1) are only useful if it can be proved that rem( D, f , S ) converges to zero. Typically rem( D, f , S ) = O (cid:16) λ Γ D ( S ) (cid:17) , where Γ D ( S ) is called “eﬀective sparsity” and depends both on D and S . Proving adaptivity therefore translates into proving a bound for theeﬀective sparsity: a task which depends on the structure imposed by D .To bound the eﬀective sparsity for tensor denoising with trend ﬁltering weuse an interpolating tensor, in analogy to the interpolating vector and theinterpolating matrix by Ortelli and van de Geer [14, 16].Adaptive results as (1) are a consequence of a careful choice of λ . Thegeneral theory for the Lasso [1, 25] suggests the choice λ ≍ λ ≍ p log( n ) /n ,where λ is called the “universal choice”. The universal choice ensures thatall the noise is overruled. However, Dalalyan et al. [3] show that also thesmaller choice λ ≍ ˜ γλ is possible, where ˜ γ > is a scaling factor whichaccounts for the correlation in the dictionary { φ j } j ∈ [ p ] induced by D and S . The projection arguments by Dalalyan et al. [3] in the background ofour results allow us to choose the tuning parameter of smaller order thanthe universal choice λ .Projection arguments have been discussed in the literature. We do notreport them here but refer instead to Theorem 3 in Dalalyan et al. [3],Lemma B2 and Lemma C2 in Ortelli and van de Geer [15], Lemma 13 inOrtelli and van de Geer [16] and to van de Geer [26]. In the continuous case, the “nullspace” of the k th -order derivative opera-tor along one coordinate is made of constant, linear, ..., ( k − th -ordermonomial functions. The nullspace of the total derivative operator in d -dimensions is made of d -dimensional functions which are linear, ..., ( k − th -order monomial along at least one coordinate. In the discrete casewhen n ≍ . . . ≍ n d the linear space spanned by such tensors is n − /d -dimensional. 6e will decompose a tensor f ∈ R n × ... × n d into a sum of mutuallyorthogonal tensors. Each of these mutually orthogonal tensors will be con-stant or linear or ... or ( k − th -order monomial along a set of l coordinates,for l ∈ [0 : d ] . This construction will be carried out for all possible sets ofcoordinates in [ d ] . Tensors being constant or linear or ... or ( k − th -ordermonomial along d − l coordinates will be called l -dimensional margins.We will adaptively estimate l -dimensional margins with l -dimensionalVitali TV regularized estimators, for l ∈ [ d ] . The -dimensional marginswill be estimated by ordinary least squares at a rate n − . By estimating allthe margins adaptively we will be able to prove adaptivity of the denoisingof the whole tensor via Vitali TV regularization. Previously, we have derived tools like interpolating vectors and matchingderivatives to prove adaptivity for trend ﬁltering ( d = 1 and k = { , , , } ,see Ortelli and van de Geer [14]). In Ortelli and van de Geer [16] we havecome up with tools to extend our results for adaptation of the fused Lasso( d = 1 and k = 1 ) to the two-dimensional case of image denoising ( d = 2 and k = 1 ). Here, we show in the ﬁrst place how to combine and extend thetools from image denoising and one-dimensional trend ﬁltering to handletrend ﬁltering for k = { , , , } and for general dimension d . Establishingadaptivity requires a so-called “bound on the antiprojections”. We prove aformula giving the bounds on the antiprojections for general k and d . Wethen propose an ANOVA decomposition to ensure that all the margins ofa d -dimensional tensor can be estimated adaptively.Lastly, we prove slow rates for tensor denoising with trend ﬁltering. Weextend the idea of mesh grid by Ortelli and van de Geer [16] to general d and general k . We then prove a bound on the antiprojections with the helpof the mesh grid holding for all d and all k .The integration of the arguments by Ortelli and van de Geer [14] withthe ones by Ortelli and van de Geer [16], the general bounds on the an-tiprojections and the ANOVA decomposition allow us to present generalrisk bounds for tensor denoising with trend ﬁltering. In Section 2 we expose the required notation, the model and deﬁne thetrend ﬁltering estimator for the d -dimensional margin.In Section 3 we list our contributions and give a preview of the results:adaptive ℓ -rates and not-so-slow ℓ -rates.7n Section 4 we derive the synthesis form of the trend ﬁltering estimatorfor the d -dimensional margin.Proving the main result on adaptivity for tensor denoising with trendﬁltering is the topic of Section 5.In Section 6 we apply a general result on not-so-slow ℓ -rates for analysisestimators to tensor denoising with trend ﬁltering.In Section 7 we show the ANOVA decomposition of a tensor and deﬁnethe estimators for lower-dimensional margins.In Section 8 we apply the results on adaptivity and on not-so-slow ℓ -rates to the estimators for the lower-dimensional margins deﬁned in Section7. This will establish adaptivity and not-so-slow rates for the estimation ofthe whole tensor.Section 9 concludes the paper. We consider the model Y = f + ǫ, where Y, f , ǫ ∈ R n × ... × n d are d -dimensional tensors and ǫ has i.i.d. N (0 , σ ) entries with known variance σ ∈ (0 , ∞ ) . For the case of unknown variancewe refer to Ortelli and van de Geer [15], who show how to estimate f and σ at the same time.The goal is to estimate f given its noisy observations Y . We considera signal approximator regularized with the Vitali TV. d -dimensional tensors For two integers i ≤ j we deﬁne [ i : j ] := { i, . . . , j } . Moreover, if i = 1 wewrite [ j ] := [1 : j ] .Let f ∈ R n × ... × n d be a d -dimensional tensor with n := n . . . n d entries.For indices ( j , . . . , j d ) ∈ [ n ] × . . . × [ n d ] we refer to the corresponding entryof f by f j ,...,j d using indices or by f ( j , . . . , j d ) using arguments.For ( j ′ , . . . , j ′ d ) , ( j ′′ , . . . , j ′′ d ) ∈ [ n ] × . . . × [ n d ] we use the notation j ′′ ,...,j ′′ d X j ′ ,...,j ′ d f j ,...,j d := j ′′ d X j d = j ′ d · · · j ′′ X j = j ′ f j ,...,j d . Similarly we write { f j ,...,j d } j ′′ ,...,j ′′ d j ′ ,...,j ′ d := { f j ,...,j d } ( j ′′ ,...,j ′′ d )( j ,...,j d )=( j ′ ,...,j ′ d ) . k f k := ( P n ,...,n d ,..., f j ,...,j d ) / we denote the Frobenius norm of f .Moreover we deﬁne k f k := P n ,...,n d ,..., | f j ,...,j d | as the sum of the absolutevalues of the entries of f . We now let f ∈ R n × ... × n d be a d -dimensional tensor with n := n · . . . · n d entries. Deﬁne the set of indices I of the entries of f as I := [ n ] × . . . × [ n d ] .We say that f has product structure if there are vectors { f j } j ∈ [ d ] suchthat f ( j , . . . , j d ) = f ( j ) · . . . · f d ( j d ) , ∀ ( j , . . . , j d ) ∈ I. We then write f = f × . . . × f d .Let f and g be tensors with product structure. We consider the entry-wise multiplication ( f ⊙ g ) j ,...,j d = f j ,...,j d g j ,...,j d , ( j , . . . , j d ) ∈ I .It holds that ( f ⊙ g ) j ,...,j d = Q dl =1 f l ( j l ) g l ( j l ) , ∀ ( j , . . . , j d ) ∈ I . The operation P n ,...,n d ,..., ( f ⊙ g ) j ,...,j d is the equivalent of the scalar productfor tensors.We say that the tensors f and g are orthogonal if P n ,...,n d ,..., ( f ⊙ g ) j ,...,j d =0 . If f and g have product structure and f l and g l are orthogonal to eachother for at least one coordinate l ∈ [ d ] , then f and g are orthogonal too. Let W be a linear subspace of R n × ... × n d and let W ⊥ be its orthogonalcomplement. By I : R n × ... × n d R n × ... × n d we denote the identity oper-ator, i.e., I f = f . By P W we denote the orthogonal projection operatoronto W and by A W := I − P W = P W ⊥ the corresponding orthogonal an-tiprojection operator. For a tensor f ∈ R n × ... × n d we write f W := P W and f W ⊥ := f − f W .For a linear operator ∆ , let N (∆) denote its nullspace. Let k be an integer in { , . . . , min i ∈ [ d ] n i − } .Let D ki be the k th -order diﬀerence operator along the i th coordinate,deﬁned as ( D ki f )( j , . . . , j i , . . . , j d ) := n k − i k X l =0 ( − l kl ! f ( j , . . . , j i − l, . . . , j d ) , ( j , . . . , j i − , j i , j i +1 , . . . , j d ) ∈ [ n ] × . . . × [ n i − ] × [ k + 1 : n i ] × [ n i +1 ] × . . . × [ n d ] . Deﬁnition 2.1 (Total k th -order diﬀerence operator) . The total k th -order dif-ference operator D k is deﬁned as D k := d Y i =1 D ki . The total k th -order diﬀerence operator D k can be seen as a discretizedversion of Q di =1 ∂ k / ( ∂x i ) k . It is important to note that the deﬁnition of D k implicitly includes a factor n k − that stems from the discretization.The Vitali TV of a tensor f ∈ R n × ... × n d is deﬁned as the sum of theabsolute values of its total k th -order diﬀerences. Deﬁnition 2.2 ( k th -order Vitali TV) . The k th -order Vitali TV TV k ( f ) of a d -dimensional tensor f ∈ R n × ... × n d is deﬁned as TV k ( f ) := k D k f k . The k th -order Vitali TV has the canonical scaling TV k ( f ) = O (1) dueto the normalization by the factor n k − in the deﬁnition of D k . We referto Sadhanala et al. [18] for more about canonical scalings.We deﬁne the nullspace N k of D k as N k := { f ∈ R n × ... × n d : D k f = 0 } and its orthogonal complement as N ⊥ k . We call f N ⊥ k the d -dimensionalmargin of a tensor f ∈ R n × ... × n d . Deﬁnition 2.3 ( k th -order trend ﬁltering estimator) . The k th -order Vitalitrend ﬁltering estimator ˆ f N ⊥ k for the d -dimensional margin f N ⊥ k is deﬁnedas ˆ f N ⊥ k := arg min f ∈ R n × ... × nd n k ( Y − f) N ⊥ k k /n + 2 λ TV k (f) o , where λ > is a tuning parameter. Let S ⊆ [3 : n − × . . . × [3 : n d − be a subset of the indices of D k f forsome tensor f ∈ R n × ... × n d . We write s := | S | and S = { t , . . . , t s } , where t m = ( t ,m , . . . , t d,m ) . We call { t m } sm =1 the jump locations.Moreover we deﬁne a S := { a j ,...,j d , ( j , . . . , j d ) ∈ S } and a − S := { a j ,...,j d , ( j , . . . , j d ) / ∈ S } . We will use the same notation a S for thetensor which shares its entries with a for ( j , . . . , j d ) ∈ S and has all itsother entries equal to zero. Similarly, we will also denote by a − S a tensorthat shares its entries with a for ( j , . . . , j d ) S and has its other entriesequal to zero. 10 Contributions

We make the following contributions: • We extend the idea of trend ﬁltering to d -dimensional settings via theVitali variation and total discrete derivatives. • We prove adaptive ℓ -rates for tensor denoising with trend ﬁlteringfor k = { , , , } , see Theorem 3.1, a simpliﬁed version of Theorem5.2. The rates for d = 1 and k = { , , , } and for d = 2 and k = 1 are known. Rates for the other cases are new contributions. We alsoexpose some suﬃcient conditions to ﬁnd adaptive bounds for general k . For each given k one can check by computer whether the conditionshold but the problem of showing that they hold for general k remainsopen. • We prove not-so-slow ℓ -rates for tensor denoising with trend ﬁltering,see Theorem 3.2. Here too, the rates for d = 2 and k ≥ and for d ≥ are new contributions. It is still an open problem whether theserates correspond for d ≥ to minimax rates (modulo log terms). • We extend the idea of ANOVA decomposition from st -order dif-ferences to k th -order diﬀerences in d dimensions. By means of thisANOVA decomposition we can apply the results for the d -dimensionalmargin to lower dimensional margins. We obtain ℓ - and ℓ -rates forthe estimation of the whole tensor by trend ﬁltering. • Our results allow to recover previous results for trend ﬁltering andimage denoising [14, 16] as special cases.

We consider tensors in R n × ... × n d such that n = . . . = n d .Let λ ( t ) := σ s n ) + 2 tn , t > . We call λ ( t ) the “universal choice” of the tuning parameter. The universalchoice λ = λ ( t ) guarantees that all the noise is overruled. However, ourresults also allow for a smaller choice than the universal choice, due to theprojection arguments by Dalalyan et al. [3] in the background. Theorem 3.1 (Adaptivity of Vitali trend ﬁltering, simpliﬁed) . Fix k ∈{ , , , } . Let g ∈ R n /d × ... × n /d be arbitrary. Let S ⊆ × i ∈ [ d ] [ k +2 : n /d − e an arbitrary set of size s := | S | deﬁning a regular grid of cardinality s /d × . . . × s /d parallel to the coordinate axes. For a large enough constant C > only dependent on k , choose λ ≥ C d / λ (log(2 n )) s k − d . Then, with probability at least − /n , it holds that k ( ˆ f − f ) N ⊥ k k /n ≤ k g − f N ⊥ k k /n + 4 λ k ( D k g ) − S k + O λ s k log( n/s ) n ! . Proof.

See Subsection 5.7 for the proof of the more general Theorem 5.2.Some examples of the exponent of s in the rate of Theorem 3.1 for d = { , , } and k = { , , , } are exposed in Table 1. k = 1 k = 2 k = 3 k = 4 d = 1 d = 2 d = 3 d general − /d − /d − /d − /d Table 1: Some examples of the exponent of s in the rate of Theorem 3.1for the choice λ ≍ s − k − d λ (log(2 n )) .If in Theorem 3.1 we set g = f N ⊥ k and choose the tuning parameter λ ≍ s − k − d λ (log(2 n )) depending on the (typically unknown) true activeset S , we obtain the rate O  s k − k − d log n log( n/s ) n  . If in Theorem 3.1 we set g = f N ⊥ k and we choose the tuning parameter λ ≍ λ (log(2 n )) in a completely data-driven way not depending on the(typically unknown) true active set S , we obtain the rate O s k log n log( n/s ) n ! . We now ﬁx k ∈ [1 : min i ∈ [ d ] n i − . For d ∈ N deﬁne the d th harmonicnumber H ( d ) as H ( d ) := P di =1 /i . 12 heorem 3.2 (Not-so-slow ℓ -rates for Vitali trend ﬁltering) . Let g ∈ R n /d × ... × n /d be any tensor such that TV k ( g ) = O (1) . Choose λ ≍ n − H ( d )+2 k − H ( d )+2 k − log H ( d )2 H ( d )+2 k − ( n ) . Then, with probability at least − Θ(1 /n ) , it holds that k ( ˆ f − f ) N ⊥ k k /n ≤ k g − f N ⊥ k k /n + O (cid:18) n − H ( d )+2 k − H ( d )+2 k − log H ( d )2 H ( d )+2 k − ( n ) (cid:19) . Proof.

See Subsection 6.3.Some examples of the exponent of n in the rate of Theorem 3.2 for d = { , , } and k = { , , } are exposed in Table 2. k = 1 k = 2 k = 3 k general d = 1 − / − / − / − k/ (2 k + 1) d = 2 − / − / − / − (4 k + 1) / (4 k + 4) d = 3 − / − / − / − (12 k + 5) / (12 k + 16) Table 2: Some examples of the exponent of n in the rate of Theorem 3.2. According to Deﬁnition 2.3, the trend ﬁltering estimator is an analysisestimator. In this section we want to rewrite it in a constructive form,that is, in synthesis form. We show that the trend ﬁltering estimator canbe constructed as a linear combination of tensors with product structure,where the factors are truncated monomials of order k − . We call thecollection of such tensors the “dictionary”.We ﬁrst deﬁne the dictionary and then show that it is the right dictio-nary to construct the trend ﬁltering estimator.We start with the one-dimensional case. We then obtain the d -dimensionaldictionary from the one-dimensional dictionary by constructing tensors withproduct structure. d = 1 Let φ j := { { j ′ ≥ j } } j ′ ∈ [ n ] , j ∈ [ n ] . The vectors { φ j } j ∈ [ n ] are linearly inde-pendent and piecewise constant. 13or ≤ k ≤ n − deﬁne recursively φ kj := ( φ jj , j ∈ [ k − , P l ≥ j φ k − l /n, j ∈ [ k : n ] . We call the collection Φ k = { φ kj } j ∈ [ n ] the “original” dictionary.The dictionary Φ k is a collection of n linearly independent discrete(truncated) monomials: the ﬁrst k are monomials of order , , . . . , k − ,while the last n − k are truncated monomials of order k − .We now deﬁne a partially orthonormalized version of the dictionary Φ k , k ∈ [ n − . Deﬁnition 4.1 (Partially orthonormalized dictionary in one dimension) . The(partially orthonormalized) dictionary ˜Φ k = { ˜ φ kj } j ∈ [ n ] is deﬁned as ˜ φ kj :=  √ n A { φ ll ,l ∈ [ j − } φ jj / k A { φ ll ,l ∈ [ j − } φ jj k , j ∈ [ k ] , A { φ ll ,l ∈ [ k ] } φ kj , j ∈ [ k + 1 : n ] . For k ∈ [ n − , ˜Φ k = { ˜ φ kj } j ∈ [ n ] is again a collection of n linearly inde-pendent vectors, where ˜ φ k , . . . , ˜ φ kk , { ˜ φ kj } j ∈ [ k +1: n ] are mutually orthogonal.Moreover k ˜ φ kj k = n, j ∈ [ k ] . Lemma 4.2 (Relation between dictionary and diﬀerence operator) . Fix k ∈ [ n − . It holds that D k φ kj = D k ˜ φ kj = ( , j ∈ [ k ] , { j } , j ∈ [ k + 1 : n ] . Proof.

See Appendix B.1.As a consequence of Lemma 4.2, { ˜ φ jj } j ∈ [ k ] span N k and { ˜ φ jj } j ∈ [ k ] is anorthogonal basis for N k . Moreover { ˜ φ kj } j ∈ [ k +1: n ] span N ⊥ k .By Lemma 4.2 combined with Lemma 2.2 in Ortelli and van de Geer[13] about the Moore-Penrose pseudoinverse we obtain for the pseudoinverse ( D k ) + that ( D k ) + = { ˜ φ kj } j ∈ [ k +1: n ] .With the dictionary ˜Φ k and some coeﬃcients { β j } nj = k +1 we can write avector f N ⊥ k ∈ N ⊥ k as f N ⊥ k = ( D k ) + β . Then β = D k f N ⊥ k .For d = 1 we therefore obtain the following synthesis form of the esti-mator ˆ f N ⊥ k : ˆ f N ⊥ k = n X j = k +1 ˜ φ kj ˆ β j , ˆ β = arg min b ∈ R n − k  k Y N ⊥ k − n X j = k +1 b j ˜ φ kj k /n + 2 λ k b k  . d Hereafter we ﬁx k ∈ [1 : min l ∈ [ d ] n l − . Deﬁnition 4.3 (Partially orthonormalized dictonary in d -dimensions) . Thedictionary { ˜ φ kj ,...,j d ∈ R n × ... × n d } n ,...,n d ,..., is deﬁned as ˜ φ kj ,...,j d = ˜ φ kj × . . . × ˜ φ kj d , ( j , . . . , j d ) ∈ × i ∈ [ d ] [ n i ] . The dictionary { ˜ φ kj ,...,j d } n ,...,n d ,..., is a collection of d -dimensional tensorswith product structure. By Lemma 4.2 and the product structure, N ⊥ k =span( { ˜ φ kj ,...,j d } n ,...,n d k +1 ,...,k +1 ) .For a tensor of coeﬃcients { β j ,...,j d } n ,...,n d k +1 ,...,k +1 , write f N ⊥ k = n ,...,n d X k +1 ,...,k +1 β j ,...,j d ˜ φ kj ,...,j d . Because of the product structure of ˜ φ kj ,...,j d it holds that D k f N ⊥ k = n ,...,n d X k +1 ,...,k +1 β j ,...,j d (1 { j } × . . . × { j d } ) = β. From the fact that any candidate estimator has to belong to the spacespanned by Y N ⊥ k , it follows that ˆ f N ⊥ k = n ,...,n d X k +1 ,...,k +1 ˆ β j ,...,j d ˜ φ kj ,...,j d , where ˆ β = arg min b ∈ R ( n − k ) × ... × ( nd − k )  k Y N ⊥ k − n ,...,n d X k +1 ,...,k +1 b j ,...,j d ˜ φ kj ,...,j d k /n + 2 λ k b k  . The synthesis form of the estimator ˆ f N ⊥ k is useful in two ways. Firstly, todetermine the structure of the estimator by specifying the dictionary usedto construct it. In our case, ˆ f N ⊥ k is a linear combination of d -dimensionalproducts of ( k − th -order polynomials. Secondly, the dictionary facilitatesthe approximation of some orthogonal projections in the proof of adaptive ℓ -rates and not-so-slow ℓ -rates. 15 Adaptivity

In this section we ﬁrst expose some notation for our main result. After hav-ing exposed our main result, Theorem 5.2, we work out explicit expressionsfor the bound on the antiprojections ˜ v , the inverse scaling factor ˜ γ and thenoise weights v . Finally, we show a bound on the eﬀective sparsity via asuitable interpolating tensor. In Subsection 5.7 we put the pieces togetherto prove Theorem 5.2.Fix k ∈ [1 : min i ∈ [ d ] n i − and an active set S ⊆ × i ∈ [ d ] [ k + 2 : n i − k ] .To every jump location in S , we associate a hyperrectangle of k d addi-tional jump locations to obtain the enlarged active set ˜ S , deﬁned as ˜ S := s [ m =1 ( × i ∈ [ d ] [ t i,m : t i,m + k − . Deﬁnition 5.1 (Hyperrectangular tessellation) . We call { R m } sm =1 a hyper-rectangular tessellation of × i ∈ [ d ] [ k + 1 : n i ] if it satisﬁes the following con-ditions: • each R m ⊆ × i ∈ [ d ] [ k + 1 : n i ] is a hyperrectangle ( m = 1 , . . . , s ); • ∪ sm =1 R m = × i ∈ [ d ] [ k + 1 : n i ] ; • for all m and m ′ = m , the hyperrectangles R m and R m ′ possibly shareboundary points but not interior points; • for all m , the points × i ∈ [ d ] [ t i,m : t i,m + k − are interior points of R m . For a hyperrectangular tessellation { R m } sm =1 denote the vertices of thehyperrectangle R m by ( t z ,m , . . . , t z d d,m ) , ( z , . . . , z d ) ∈ {− , + } d , for m ∈ [ s ] .Moreover we deﬁne the distances of the jump locations from the verticesof their respective hyperrectangle and the respective set of indices as d − i,m := ( t i,m − t − i,m ) , R − i,m := [ t − i,m : t i,m ] ,d i,m := k, R i,m := [ t i,m : t i,m + k − ,d + i,m := ( t + i,m − t i,m − k + 1) , R + i,m := [ t i,m + k − t + i,m ] , for i ∈ [ d ] and m ∈ [ s ] . Each hyperrectangle R m of the hyperrectangulartessellation { R m } m ∈ [ s ] can be partitioned into d hyperrectangles. Deﬁne,for all ( z , . . . , z d ) ∈ {− , , + } d , R z ··· z d m := R z ,m × . . . × R z d d,m , m ∈ [ s ] . For m ∈ [ s ] , let d z ··· z d m := d z ,m · . . . · d z d d,m , { z , . . . , z d } ∈ {− , + } d .

16e deﬁne the maximal distance from an (enlarged) jump location tothe boundary of the corresponding rectangular region along the coordinate i ∈ [ d ] as d i, max ( S ) := max m ∈ [1: s ] max { d − i,m , d + i,m } .d + , − m d − , − m d − , + m d + , + m ( t +1 ,m , t − ,m ) ( t +1 ,m , t +2 ,m )( t − ,m , t +2 ,m )( t − ,m , t − ,m )( t ,m , t − ,m )( t ,m + k − , t − ,m ) ( t − ,m , t ,m ) ( t +1 ,m , t ,m + k − d +1 ,m d − ,m d − ,m d +2 ,m Figure 1: A rectangle of the tessellation { R m } sm =1 for d = 2 and k = 4 For d = 2 and k = 4 , a rectangle of the tessellation is depicted in Figure1. We present our main result, that shows that trend ﬁltering leads to anadaptive estimation of the d -dimensional margin f N ⊥ k of f . Theorem 5.2 (Adaptivity of trend ﬁltering) . Fix k ∈ { , , , } and choose x, t > . Let g ∈ R n × ... × n d be arbitrary. Let S be an arbitrary subset ofsize s := | S | of × i ∈ [ d ] [ k + 1 + ( k + 2) k : n i − k + 1 − ( k + 2) k ] . For a largeenough constant C > that only dependends on k , choose λ ≥ C d vuut d X i =1 (cid:18) d i, max ( S ) n i (cid:19) k − λ ( t ) . hen, with probability at least − e − x − e − t , it holds that k ( ˆ f − f ) N ⊥ k k /n ≤ k g − f N ⊥ k k /n + 4 λ k ( D k g ) − S k + 2 σ n (cid:16) √ x + √ ks (cid:17) + O  λ d X i =1 log( ed i, max ( S )) ! s X m =1 X z ∈{− , + } d (cid:18) nd zm (cid:19) k −  . In particular the constraint on C is C ≥ k k − a with a =  , k = 1 , √ / ≈ . , k = 2 , √ / ≈ . , k = 3 , . , k = 4 , as min i ∈ [ d ] min m ∈ [ s ] min { d − i,m , d + i,m } → ∞ .Proof. See Subsection 5.7.By choosing x ≍ t ≍ log n in Theorem 5.2 and by constraining theactive set S to be a regular grid we retrieve Theorem 3.1. In that case,since S is a regular grid, we can choose λ ≍ s − k − d λ (log(2 n )) and theoracle inequality has the rate O  s k ( d − d n log( n/s ) log n  . Remark (The role of the hyperrectangular tessellation) . Given an activeset S , the choice of a hyperrectangular tessellation in Theorem 5.2 can beseen as arbitrary. We introduce some quantities on which Theorem 5.2 relies: the bound onthe antiprojections ˜ v , the inverse scaling factor ˜ γ , the noise weights v , asign conﬁguration q and the eﬀective sparsity Γ D k .Let ˜ S be the enlarged active set induced by some active set S . Let P ˜ S be the orthogonal projection operator on span( { ˜ φ kj ,...,j d } ( j ,...,j d ) ∈ ˜ S ) . Deﬁnition 5.3 (Bound on the antiprojections) . A bound on the antiprojec-tions is a tensor ˜ v ∈ R ( n − k ) × ... × ( n d − k ) such that ˜ v j ,...,j d ≥ k (I − P ˜ S ) ˜ φ kj ,...,j d k / √ n, ∀ ( j , . . . , j d ) ∈ × i ∈ [ d ] [ k + 1 : n i ] . ˜ v be a bound on the antiprojections. Deﬁnition 5.4 (Inverse scaling factor) . The inverse scaling factor ˜ γ ∈ R isdeﬁned as ˜ γ := k ˜ v − ˜ S k ∞ . Let ˜ v be a bound on the antiprojections and ˜ γ the corresponding inversescaling factor. Deﬁnition 5.5 (Noise weights) . The noise weights v ∈ R ( n − k ) × ... × ( n d − k ) aredeﬁned as v ≥ ˜ v/ ˜ γ ∈ [0 , ( n − k ) × ... × ( n d − k ) . We can now introduce the eﬀective sparsity. The eﬀective sparsity de-pends on a so-called “sign conﬁguration”, that is, on the sign pattern asso-ciated with the jump locations.

Deﬁnition 5.6 (Sign conﬁguration) . Let q ∈ [ − , ( n − k ) × ... × ( n d − k ) be s.t. q j ,...,j d ∈  {− , +1 } , ( j , . . . , j d ) = t m ∈ S, { q t m } , ( j , . . . , j d ) ∈ × i ∈ [ d ] [ t i,m : t i,m + k − , m ∈ [ s ] , [ − , , ( j , . . . , j d ) / ∈ ˜ S. We call q S ∈ {− , , } ( n − k ) × ... × ( n d − k ) a sign conﬁguration. The basic deﬁnition of eﬀective sparsity depends on the sign conﬁg-uration associated with S . One can however remove this dependence bydeﬁning the eﬀective sparsity as the maximum over all sign conﬁgurations. Deﬁnition 5.7 (Eﬀective sparsity) . Let an active set S , a sign conﬁguration q S and noise weights v be given. The eﬀective sparsity Γ D k ( S, v − S , q S ) ∈ R is deﬁned as Γ D k ( S, v − S , q S ) :=:= max ( s X m =1 ( q S ) t m ( D k f ) t m − k (1 − v ) − S ⊙ ( D k f ) − S k : k f k /n = 1 ) . Moreover we write Γ D k ( S, v − S ) := max q S Γ D k ( S, v − S , q S ) . By the adaptive bound of Theorem 2.2 in Ortelli and van de Geer [14](see also Theorem 2.1 in Ortelli and van de Geer [15] and Theorem 16 inOrtelli and van de Geer [16] modiﬁed with an enlarged active set), we knowthat bounding the eﬀective sparsity is a suﬃcient condition for provingadaptation of ˆ f N ⊥ k . 19 .3 Eﬀective sparsity via interpolating tensors To bound the eﬀective sparsity we extend the technique by Ortelli andvan de Geer [14] involving interpolating vectors to interpolating tensors,i.e., tensors that interpolates the signs of the jumps.

Deﬁnition 5.8 (Interpolating tensor) . Let q S ∈ {− , , } ( n − k ) × ... × ( n d − k ) be a sign conﬁguration and v ∈ [0 , ( n − k ) × ... × ( n d − k ) be a tensor of noiseweights. The tensor w ( q S ) ∈ R ( n − k ) × ... × ( n d − k ) is called an interpolatingtensor for the sign conﬁguration q S and the weights v if it has the followingproperties: • w j ,...,j d ( q S ) = ( q S ) t m , ∀ ( j , . . . , j d ) ∈ × i ∈ [ d ] [ t i,m : t i,m + k − , ∀ m ∈ [ s ] , • | w j ,...,j d ( q S ) | ≤ − v j ,...,j d , ∀ ( j , . . . , j d ) ∈ ( × i ∈ [ d ] [ k : n i ]) \ ˜ S . With the help of an interpolating tensor we can bound the eﬀectivesparsity, as the following lemma shows (Lemma 2.4 by Ortelli and van deGeer [14] in tensor form).

Lemma 5.9 (Bounding the eﬀective sparsity with an interpolating tensor) . We have Γ D k ( S, v − S , q S ) ≤ n min w ( q S ) k ( D k ) ′ w ( q S ) k where the minimum is over all interpolating tensors w ( q S ) for the signconﬁguration q S .Proof. It holds that s X m =1 ( q S ) t m ( D k f ) t m − k (1 − v ) − S ⊙ ( D k f ) − S k ≤ s X m =1 ( q S ) t m ( D k f ) t m − k w ( q S ) − S ⊙ ( D k f ) − S k ≤ n ,...,n d X ,..., w ( q S ) j ,...,j d ( D k f ) j ,...,j d = n ,...,n d X ,..., (( D k ) ′ w ( q S )) j ,...,j d f j ,...,j d ≤ √ n k ( D k ) ′ w ( q S ) k k f k / √ n. .4 Requirements on an interpolating tensor Theorem 5.2 follows by a bound on the eﬀective sparsity obtained byLemma 5.9 with the help of an interpolating tensor. In the deﬁnition of aninterpolating tensor (cf. Deﬁnition 5.8), there is a constraint posed by thenoise weights v .Therefore, we now calculate in Subsection 5.5 a bound on the antipro-jections ˜ v to derive an appropriate inverse scaling factor ˜ γ and noise weights v . In this way we will make explicit the constraints that the interpolatingtensor has to satisfy in the speciﬁc case of tensor denoising with trendﬁltering.After that, we will show in Subsection 5.6 an explicit form for the in-terpolating tensor for k = { , , , } and derive the corresponding boundon the eﬀective sparsity.That bound on the eﬀective sparsity combined with the fact that theinterpolating tensor used indeed is an interpolating tensor for trend ﬁlteringwill allow us to derive Theorem 5.2 from Theorem A.1. We start by ﬁnding a bound on the antiprojections ˜ v .Deﬁne, for m ∈ [ s ] and i ∈ [ d ] , ˜ v i,m ( j i ) = (cid:18) t i,m − j i n i (cid:19) k − , j i ∈ R − i,m = [ t − i,m : t i,m ] , , j i ∈ R i,m = [ t i,m : t i,m + k − , (cid:18) j i − t i,m − k + 1 n i (cid:19) k − , j i ∈ R + i,m = [ t i,m + k − t + i,m ] . Moreover, for ( j , . . . , j d ) ∈ R m we deﬁne ˜ v j i ,...,j d := vuut d X i =1 ˜ v i,m ( j i ) . Lemma 5.10 (A valid bound on the antiprojections) . For all ( j , . . . , j d ) ∈ R m and for all m ∈ [ s ] it holds that k A ˜ S ˜ φ kj ,...,j d k /n ≤ ˜ v j i ,...,j d , i.e., the tensor ˜ v ∈ R ( n − k ) × ... × ( n d − k ) is a valid bound on the antiprojections.Proof. See Appendix C.1. 21eﬁne, for m ∈ [ s ] and i ∈ [ d ] , v i,m ( j i ) =  t i,m − j i d − i,m ! k − , j i ∈ R − i,m = [ t − i,m : t i,m ] , , j i ∈ R i,m = [ t i,m : t i,m + k − , j i − t i,m − k + 1 d + i,m ! k − , j i ∈ R + i,m = [ t i,m + k − t + i,m ] . Moreover, for a constant C = C ( k ) ≥ we deﬁne for ( j , . . . , j d ) ∈ R m and m = 1 , . . . , s v j i ,...,j d := 1 d d X i =1 v i,m ( j i ) C (2)and ˜ γ = Cd vuut d X i =1 (cid:18) d i, max ( S ) n i (cid:19) k − . Lemma 5.11 (Valid noise weights) . For all m ∈ [ s ] and for all ( j , . . . , j d ) ∈ R m it holds that ˜ v j i ,...,j d ≤ v j i ,...,j d ˜ γ, i.e., the tensor v ∈ R ( n − k ) × ... × ( n d − k ) as deﬁned in Equation (2) deﬁnesvalid noise weights.Proof. See Appendix C.2.The constant C ≥ in Equation (2) can be chosen aribtrarily. Choosinga larger C makes the noise weights smaller. As a result, the requirementsimposed on the interpolating tensor by the noise weights become weaker. We now deﬁne an interpolating tensor w = w ( q ) for any sign conﬁguration q S . For ( j , . . . , j d ) ∈ R m , m ∈ [ s ] and the same constant C = C ( k ) > asin the deﬁnition of the noise weights in Equation (2), deﬁne the tensor w j ,...,j d ( q S ) := 1 d d X i =1 d Y l =1 w l,i,m ( j l ) , (3)22here, w l,i,m ( j l ) = q t m , j l ∈ R l,m , l = i,w l,i,m ( j l ) ∈ [0 , q t m ] , j l ∈ R − l,m ∪ R + l,m , l = i,w i,i,m ( j i ) = q t m , j i ∈ R i,m , l = i,w i,i,m ( j i ) ≤ q t m (1 − v i,m ( j i ) /C ) , j i ∈ R − l,m ∪ R + l,m , l = i. What diﬀerentiates the case l = i is that w i,i,m has to satisfy the require-ments imposed by the noise weights. For l = i the only constraint imposedis that | w l,i,m | ≤ . The tensor w is a sum of terms with product structureif constrained to the set of indices of a hyperrectangle R m .We deﬁne w − l,i,m := { w l,i,m ( j l ) } j l ∈ R − i,m and w + l,i,m := { w l,i,m ( j l ) } j l ∈ R + i,m . Lemma 5.12 (A valid interpolating tensor) . For any given sign conﬁguration q S , the tensor w = w ( q S ) deﬁned in Equation (3) is a valid interpolatingtensor.Proof. See Appendix C.3.

We now want to ﬁnd the explicit form of an appropriate interpolating tensor w , to apply in Lemma 5.9. We ﬁrst consider continuous versions ω ( x ) ,respectively. w( x ) , of the vectors w − i,i,m and w + i,i,m , respectively. w − l,i,m and w + l,i,m for l = i , on a mock interval x ∈ [0 , . We then set w − i,i,m ( j i ) := ω t i,m − j i d − i,m ! , j i ∈ R − i,m ,w + i,i,m ( j i ) := ω j i − t i,m − k + 1 d + i,m ! , j i ∈ R + i,m ,w − l,i,m ( j l ) := w t l,m − j l d − l,m ! , j l ∈ R − l,m ,w + l,i,m ( j l ) := w j l − t l,m − k + 1 d + l,m ! , j l ∈ R + l,m , for i ∈ [ d ] , l = i, m ∈ [ s ] .We aim to ﬁnd a form of ω and w giving place to continuous functionswith k − continuous derivatives and piecewise constant k th derivative.23oreover, these functions have to be interpolating between the jump loca-tion ( x = 0 ) and the border ( x = 1 ). We guarantee that they interpolatethe signs of the jumps by restricting to polynomials with ω (0) = 1 , ω (1) = 0 , w(0) = 1 , w(1) = 0 , w( x ) = 1 − w(1 − x ) , x ∈ [0 , . The discretized version of these polynomials will vanish at the boundariesof the hyperrectangles while it will have the value at the indices belongingto the enlarged active set ˜ S , guaranteeing the interpolation of the signs ofthe jump locations. Moreover, we will have to choose the constant C > in Equation (2) such that the noise weights are made small enough for theinterpolating polynomial to satisfy the conditions of Lemma 5.12.To obtain interpolating polynomials ω and w , we split the interval [0 , into an adequate number of subintervals. We then choose ω and w to bemade of polynomial pieces of order at most k . The exception is the ﬁrstsubinterval [0 , x ] , x ∈ (0 , for ω , where we choose ω ( x ) = 1 − a x k − .We then ﬁnd the explicit values of the coeﬃcients of the polynomials byderivatives matching, as in Ortelli and van de Geer [14]. More details onderivatives matching are given in the Appendix C.4.To guarantee that ω and w can give place to interpolating tensors, onehas to check that derivative matching renders a piecewise polynomial whichis monotone. Monotonicity combined with the constraints ω (0) = w(0) = 1 and ω (1) = w(1) = 0 ensures that | ω ( x ) | ≤ and | w( x ) | ≤ .Monotone interpolating polynomials ω and w and a large enough C in the tuning parameter are suﬃcient conditions for a valid interpolatingtensor. In particular, given that ω is monotone, we require that C ≥ k k − /a as min i ∈ [ d ] min m ∈ [ s ] min { d − i,m , d + i,m } → ∞ . (4)Note that for the construction of w , we do not have any constraint given bythe antiprojections ˜ v , the noise weights v and the inverse scaling factor ˜ γ .Therefore, we can take the dependence on x k instead of x k − . This savesa logarithmic term, not visible in Lemma 5.13, which only contains thelogarithmic terms stemming from ω . Indeed, as Lemma C.1 in AppendixC.5 shows, partial integration of a k th -order polynomial does not incur inlog terms, while partial integration of x k − does so. We have to choosethe worse dependence on x k − for ω though, because ω has to respect theconstraints posed by the noise weights.24 .6.2 Show a bound on the eﬀective sparsity We now show a bound on the eﬀective sparsity, using a “candidate” in-terpolating tensor generated from the discretizations of ω and w whoseconstruction has been exposed above. We call it “candidate” interpolatingtensor because we have not yet shown that ω and w are monotone. For themoment we assume that matching derivatives renders monotone ω and w .We check the monotonicity for k = { , , , } in the next subsection.To make the notation and the computation steps lighter, we neglect theconstants and use the order notation O instead.Since the sign conﬁguration q S is typically unknown, we focus on ﬁndingan upper bound on the eﬀective sparsity that does not depend on the signconﬁguration q S . Thus, the bound also accommodates for the worst-casesign conﬁguration. Lemma 5.13 (Eﬀective sparsity for trend ﬁltering) . Take the interpolatingvector w as deﬁned in Equation (3) . Choose the vectors w − i,i,m and w + i,i,m ,respectively w − l,i,m and w + l,i,m for l = i , to be discretized versions of ω ( x ) and w( x ) as in Subsubsection 5.6.1. Assume that ω ( x ) and w( x ) obtainedby derivative matching are monotone.For such an interpolating vector w , it holds that Γ D ( S, v − S ) = O  d X i =1 log( ed i, max ( S )) ! s X m =1 X z ∈{− , + } d (cid:18) nd zm (cid:19) k −  Proof.

See Appendix C.6.From Lemma C.1 and the matching of discrete derivatives, it followsthat, if ω and w are monotone and C is chosen large enough Γ D ( S, v − S ) = O  d X i =1 log( ed i, max ( S )) ! s X m =1 X z ∈{− , + } d (cid:18) nd zm (cid:19) k −  . If the active set S deﬁnes a regular grid we therefore have a bound on theeﬀective sparsity of order Γ D ( S, v − S ) = O (cid:16) s k log( n/s ) (cid:17) . It only remains to check the monotonicity of ω and w . We will do this for k = { , , , } . One can check monotonicity for higher values of k by solving(for instance at the computer) the appropriate system of equations and, say,graphically visualizing the result. We check monotonicity analytically for k = { , , } and computationally for k = 4 .25 .6.3 Interpolating tensor for k = 1 For k = 1 ω ( x ) = 1 − √ x, x ∈ [0 , . and w( x ) = 1 − x, x ∈ [0 , . Both ω and w are monotone. k = 2 For k = 2 ω ( x ) =  − √ x / , x ∈ [0 , / ,

127 (1 − x ) , x ∈ [1 / , , and w( x ) =  − x , x ∈ [0 , / , (cid:18) − x (cid:19) + 12 , x ∈ [1 / , / . Both ω and w are monotone. k = 3 For k = 3 ω ( x ) =  − √ x / , x ∈ [0 , / , x − x + 25576 x + 145228 , x ∈ [1 / , / , − x ) , x ∈ [2 / , , and w( x ) =  − x , x ∈ [0 , / , − (cid:18) − x (cid:19) + 2 (cid:18) − x (cid:19) + 12 , x ∈ [1 / , / . Both ω and w are monotone. 26 .6.6 Interpolating tensor for k = 4 For k = 4 ω ( x ) =  − . x / , x ∈ [0 , / , . x − . x + 12 . x − . x + 1 . , x ∈ [1 / , / , − . x + 78 . x − . x + 26 . x − . , x ∈ [1 / , / , . − x ) , x ∈ [3 / , , and w( x ) =  − . x , x ∈ [0 , / , x − . x + 7 . x − . x + 1 . , x ∈ [1 / , / , − . (cid:18) − x (cid:19) + 2 . (cid:18) − x (cid:19) + 12 , x ∈ [1 / , / . Both ω and w are monotone. Theorem 5.2 follows by combining Theorem A.1 with a bound on the eﬀec-tive sparsity.Lemma 5.13 uses Lemma 5.9 to give us a bound on the eﬀective sparsityholding for all sign conﬁgurations. This bound is based on a speciﬁc formof the interpolating tensor, obtained by derivative matching as explained inSubsection 5.6.1. The interpolating tensor obtained by derivative match-ing is valid if the monotonicity of ω and w is guaranteed. In Subsections5.6.3-5.6.6 we check that the interpolating tensors obtained by derivativematching for k = { , , , } satisfy the monotonicity requirement.There is also a constraint posed by the noise weights and by the constant C to satisfy. Lemma 5.10 gives a valid bound on the antiprojections. If wechoose ˜ γ = Cd vuut d X i =1 (cid:18) d i, max ( S ) n i (cid:19) k − the noise weights given in Equation (2) are valid noise weights, accordingto Lemma 5.11. By Lemma 5.12, interpolating tensors of the form givenin Equation (3) are valid interpolating tensors. The tensor obtained bythe discretization of the result of derivative matching has such a form (as min m ∈ [ s ] min i ∈ [ d ] min { d − i,m , d + i,m } → ∞ ).According to Equation (4) in Subsubsection 5.6.1 one has to choose C ≥ k k − /a as min i ∈ [ d ] min m ∈ [ s ] min { d − i,m , d + i,m } → ∞ . a are given in Subsubsections 5.6.3-5.6.6.Theorem 2.2 by Ortelli and van de Geer [14], on which Theorem A.1 isbased, uses a bound on the increments of empirical process { ǫ ′ f, f ∈ R n } ,where ǫ has i.i.d. entries. Theorem A.1 involves in the background anempirical process, whose increments are given by  n ,...,n d X ,..., ( ǫ N ⊥ k ⊙ f ) j ,...,j d , f ∈ R n × ... × n d  Note that the entries of ǫ N ⊥ k = P N ⊥ k ǫ are correlated. However, by the idem-potence of orthogonal projections, we can work with uncorrelated errors andinstead restrict to tensors f N ⊥ k ∈ N ⊥ k . Indeed P n ,...,n d ,..., ( ǫ N ⊥ k ⊙ f ) j ,...,j d = P n ,...,n d ,..., ( ǫ ⊙ f N ⊥ k ) j ,...,j d . This allows us to take over the arguments ofTheorem 2.2 by Ortelli and van de Geer [14]. (cid:3) Remark (The inﬂuence of the dimensionality) . If we choose λ ≍ ˜ γλ ( t ) ,the rate of the oracle inequality is ˜ γ P sm =1 P z ∈{− , + } d ( n/d zm ) k − /n , up tologarithmic factors. For simplicity, let S deﬁne a regular grid. Then the(hyper-)volume of one of the s hyperrectangles of the tessellation scales as d zm ≍ n/s . Hence the scaling P sm =1 P z ∈{− , + } d ( n/d zm ) k − ≍ s k . However ˜ γ , the maximal length of an antiprojection, scales as ˜ γ ≍ ( s − d ) k − , where s − d ≍ d i, max /n i is proportional to the side length of a hyperrectangle of thetessellation, up to the exponent k − . The inﬂuence of the dimensionalityin the exponent of s is a consequence of the diﬀerent scaling of volume andside length of a hyperrectangle in d -dimensions. The (hyper-)volume scalesas s − while the side length scales as s − d . The reason for this discrep-ancy is that we are not able to ﬁnd an upper bound for the noise weightsproportional the volume of the hyperrectangles, i.e., to the product of sidelengths. The bound we obtain involves rather the sum of side lengths. ℓ -rates Theorem 3.2 about not-so-slow rates for trend ﬁltering is based on The-orem A.2, where the choice of the active set S is aribtrary. The cri-terion guiding choice of S is to get an “as small as possible” value ofthe inverse scaling factor ˜ γ . Recall that the inverse scaling factor ˜ γ isthe maximal length of the antiprojection of a dictionary atom ˜ φ kj ,...,j d onto the set of dictionary atoms indexed by the active set S , that ist ˜ γ ≥ max ( j ,...,j d ) ∈ [ n ] × ... × [ n d ] k A S ˜ φ kj ,...,j d k / √ n .28he active set S could be chosen as a regular grid parallel to the coordi-nate axes. However, we will show that we can shorten the maximal lengthof the antiprojections by choosing an active set deﬁning a so-called “meshgrid”, whose construction we illustrate hereafter. Let δ ∈ N . For a coordinate i ∈ [ d ] , we deﬁne the set of indices Z i ( l ) suchthat Z i ( l ) = { δ dl equispaced indices in [ n i ] } , l ∈ [ d ] and Z i (1) ⊇ Z i (2) ⊇ . . . ⊇ Z i ( d ) . If, for any l ∈ [ d ] , n i is not a multiple of δ dl , we relax the requirement onthe indices to be approximately equispaced, i.e., the distance between allthe indices has to be asymptotically of the same order. For i ∈ [ d ] , we alsodeﬁne ˜ Z i ( l ) = k − [ h =0 { Z i ( l ) + h } , l ∈ [ d ] . Let now ( l , . . . , l d ) ∈ [ d ] d be a tuple of indices. We deﬁne the set S := { ( l , . . . , l d ) ∈ [ d ] d : |{ i ∈ [ d ] : l i ≤ z }| ≤ z, ∀ z ∈ [ d ] } . Deﬁnition 6.1 (Mesh grid) . A mesh grid S is deﬁned as S := [ ( l ,...,l d ) ∈S (cid:16) × i ∈ [ d ] Z i ( l i ) (cid:17) . Figure 2 illustrates a mesh grid for d = 2 .We now want to enlarge a mesh grid S to allow us to handle k th -ordertrend ﬁltering for k > . Deﬁnition 6.2 (Enlarged mesh grid) . An enlarged mesh grid ˜ S is deﬁned as ˜ S := [ ( l ,...,l d ) ∈S (cid:16) × i ∈ [ d ] ˜ Z i ( l i ) (cid:17) . Figure 3 illustrates an enlarged mesh grid for d = 2 and k = 2 .Let s := | S | and ˜ s := | ˜ S | . It holds that s ≍ ˜ s ≍ Q di =1 δ di ≍ δ dH ( d ) ,where H ( d ) = P di =1 /i is the d th harmonic number. Therefore δ ≍ s dH ( d ) .29igure 2: Mesh grid for d = 2 . Figure 3: Enlarged mesh grid for d =2 and k = 2 . ˜ S is an enlarged mesh grid We will now show that we can ﬁnd a smaller bound on the inverse scalingfactor if we choose ˜ S to be an enlarged mesh grid rather than an enlargedregular grid. Lemma 6.3 (Inverse scaling factor when ˜ S is an enlarged mesh grid) . Let n ≍ . . . ≍ n d and ˜ S be an enlarged mesh grid. It holds that ˜ γ ( ˜ S ) = O (cid:18) s − k − H ( d ) (cid:19) . Proof.

See Appendix D.1.

Theorem 3.2 follows from Theorem A.2. Theorem A.2 is allowed to havecorrelated errors for the same reasons as Theorem A.1 is, see the proof ofTheorem 5.2 in Subsection 5.7.In Theorem A.2 we set x ≍ t ≍ log n . We can then choose the freeparameters S and g ∈ R n × ... × n d independently of each other. Rememberthat the normalization of the TV is included in the deﬁnition of the analysisoperator D k . Therefore it is natural to restrict the choice of g to the class { f : k D k f k = O (1) } .We can then choose S to trade oﬀ the terms ˜ γλ (log n ) ≍ ˜ γ log / ( n ) /n / and s/n . Typically, if we require S to have a regular structure, we obtain30 γ = O ( s − h ) for some h = h ( d, k ) ∈ R . The tradeoﬀ is achieved by choosing s ≍ n h ) log h ) n and gives the slow rate sn ≍ n − h ) log h ) n. We choose the active set to be an enlarged mesh grid ˜ S . Then, byLemma 6.3, we can choose ˜ γ = O (cid:18) s − k − H ( d ) (cid:19) and the claim follows. (cid:3) Remark (Mesh grids vs. regular grids) . If we choose a regular grid as activeset, according to Lemma 5.11 we obtain ˜ γ ≍ s − k − d and a slow rate n − d +2 k − d +2 k − log d d +2 k − ( n ) , which is slower than the rate obtained with an active set deﬁning a meshgrid. Indeed, for all d ≥ it holds that H ( d ) ≤ d .In both cases, the slow rate for ﬁxed k goes to n − / log / ( n ) as d → ∞ .If d is ﬁxed, the slow rates goes to n − as k → ∞ . Remark (Allow λ to depend on TV k ) . In the proof of Theorem 3.2 wecan also drop the restriction g ∈ { f : k D k f k = O (1) } and trade oﬀ ˜ γ log / ( n )TV k ( g ) /n / and s/n . The tradeoﬀ results in the choice λ ≍ n − H ( d )+2 k − H ( d )+2 k − log H ( d )2 H ( d )+2 k − ( n ) TV k ( g ) − k − H ( d )+2 k − and gives the rate n − H ( d )+2 k − H ( d )+2 k − log H ( d )2 H ( d )+2 k − ( n ) TV k ( g ) H ( d )2 H ( d )+2 k − . In the previous sections we have shown how to estimate f N ⊥ k by trendﬁltering and have established fast adaptive ℓ -rates and not-so-slow ℓ -rates.There is still an open question: how to estimate f N k ?If n ≍ . . . ≍ n d , the dimension of N k is of order n − /d . Estimating f N k by least squares would result in a rate of order n − /d and therefore belimiting for d ≥ .The approach we take is to decompose N k into lower dimensional mutu-ally orthogonal linear spaces, the so-called marginal linear spaces, to whichwe can apply a lower dimensional version of trend ﬁltering.31et P [ d ] denote the power set of [ d ] := { , . . . , d } . We consider sets ofcoordinate indices M ⊆ [ d ] .The intuition behind the decomposition into margins is to partition theset of tensor indices into d subsets as [1 : k ] ∪ [ k + 1 : n ] × . . . × [1 : k ] ∪ [ k + 1 : n ] . For M ∈ P [ d ] deﬁne the set of indices I kM = × i ∈ M [ k + 1 : n i ] × i M [1 : k ] . We moreover deﬁne the linear spaces M ( M ) = span { ˜ φ kk ,...,k d , ( k , . . . , k d ) ∈ I kM } , M ∈ P [ d ] . Note that in one dimension, { ˜ φ kj } j ∈ [ k ] and { ˜ φ kj } j ∈ [ k +1: n ] are orthogonalto each other. Moreover, M △ M ′ = ∅ , for M = M ′ ∈ P [ d ] . Because of theproduct structure of the dictionary atoms spanning M this means that any M ( M ) and M ( M ′ ) are mutually orthogonal, for M = M ′ .The mutually orthogonal marginal linear subspaces {M ( M ) } M ∈P [ d ] par-tition R n × ... × n d . The dimension of M ( M ) is given by k d −| M | Q i ∈ M ( n i − k ) .By the multi-binomial theorem it holds that d Y i =1 n i = X M ∈P [ d ] k d −| M | Y i ∈ M ( n i − k ) for k ∈ [0 : min l ∈ [ d ] n l − . This means that X M ∈P [ d ] dim( M ( M )) = n and because {M ( M ) } M ∈P [ d ] are mutually orhtogonal it follows that theyalso partition R n × ... × n d .We can further partition any M ( M ) into k d −| M | mutually orthogonalsubspaces M ( M, h ) , h ∈ [1 : k ] d −| M | .The partition results by deﬁning the set of indices I kM,h := × i ∈ M [ k + 1 : n i ] × i M { h i } and the linear subspaces M ( M, h ) := span { ˜ φ kk ,...,k d , ( k , . . . , k d ) ∈ I kM,h } . Again, {M ( M, h ) } h ∈ [ k ] d −| M | ,M ∈P [ d ] are mutually orthogonal and partition R n × ... × n d . 32 eﬁnition 7.1 (ANOVA decomposition) . The decomposition of a tensor f as f = X M ∈P [ d ] X h ∈ [1: k ] d −| M | f M ( M,h ) is called ANOVA decomposition. By orthogonality we have that k f k = X M ∈P [ d ] X h ∈ [1: k ] d −| M | k f M ( M,h ) k . Our aim is to apply a lower dimensional version of trend ﬁltering to estimate f M ( M,h ) , for M = ∅ . For M = ∅ it holds that | I kM = ∅ | = k d = O (1) . Wewill therefore estimate f M ( ∅ ,h ) by the least squares estimate Y M ( ∅ ,h ) at theparametric rate n − .To apply a lower dimensional version of trend ﬁltering to estimate f M ( M,h ) we ﬁrst need to reinterpret f M ( M,h ) as a | M | -dimensional tensor.We then need to justify why we can apply Theorems A.1 and A.2 whichrequire iid errors and are at the base of the adaptive rates by Theorem 5.2and the not-so-slow rates by Theorem 3.2.By writing f M ( M,h ) = ¯ f M ( M,h ) × × i M ˜ φ kh i , ¯ f M ( M,h ) ∈ R × i ∈ M n i we can interpret f M ( M,h ) as a M -dimensional object.Similarly, we can write Y M ( M,h ) = ¯ Y M ( M,h ) × × i M ˜ φ kh i , ¯ Y M ( M,h ) ∈ R × i ∈ M n i . Let n M := Q i ∈ M n i . Because of the (partial) product structure of f M ( M,h ) and since k ˜ φ kh i k = n i , h i ∈ [ k ] (cf. Deﬁnition 4.1), it holds that k f M ( M,h ) k /n = k ¯ f M ( M,h ) k /n M . Thanks to the above equation and to the ANOVA decomposition wecan add up the rates of estimation of the margins to estimate the wholetensor. 33 .2 The estimator for the lower-dimensional margins

For M ∈ P [ d ] \ ∅ deﬁne D kM := n k − M Y i ∈ M D ki . To estimate the whole tensor, we consider the estimator ˆ f = X M ∈P [ d ] X h ∈ [ k ] d −| M | ˆ f M ( M,h ) , where ˆ f M ( M,h ) = ˆ¯ f M ( M,h ) × i M ˜ φ kh i , ˆ¯ f M ( M,h ) ∈ R × i ∈ M n i . We deﬁne ˆ f M ( ∅ ,h ) := ¯ Y ∅ ,h ( × i ∈ [ d ] ˜ φ kh i ) , ∀ h ∈ [ k ] d and ˆ¯ f M ( M,h ) :=:= arg min ¯f M ( M,h ) ∈ R × i ∈ M ni n k ¯ Y M ( M,h ) − ¯f M ( M,h ) k /n M + 2 λ M,h k D kM ¯f M ( M,h ) k o , where { λ M,h > , h ∈ [ k ] d −| M | , M ∈ P [ d ] \∅} are positive tuning parameters.We call k D kM ¯ f M ( M,h ) k the k th -order | M | -dimensional Vitali total vari-ation and ˆ¯ f M ( M,h ) the | M | -dimensional trend ﬁltering estimator. Remark (We can apply Theorems A.1 and A.2) . For ¯ f ∈ R × i ∈ M n i it holdsthat ¯ ǫ M ( M,h ) ⊙ ¯ f = ¯ ǫ M ( M,h ) ⊙ ¯ f M ( M,h ) =  X M ′ ⊆ M,h ′ M = h (¯ ǫ M ( M ′ ,h ′ ) × i ∈ M \ M ′ ˜ φ kh i )  ⊙ ¯ f M ( M,h ) . The n M entries of the tensor P M ′ ⊆ M,h ′ M = h (¯ ǫ M ( M ′ ,h ′ ) × i ∈ M \ M ′ ˜ φ kh i ) are thecoeﬃcients of the projection of ǫ onto the linear space × i M ˜ φ kh i × i ∈ M R n i and as such have i.i.d. N (0 , σ n M /n ) -distributed entries. We can thereforeapply Theorems A.1 and A.2 with noise variance σ n M /n . Remark (Synthesis form for the estimator of lower dimensional margins) . The synthesis form of the estimator for the margins can be obtained in asimilar way as for the d -dimensional margin (cf. Subsection 4).34 Denoising the whole tensor

We now put together the results from Sections 5 and 6 with the ANOVAdecomposition given in Section 7 to show adaptivity and not-so-slow ratesfor the estimation of the whole tensor.

We ﬁx k ∈ { , , , } . By S M,h we denote a subset of I kM,h satisfying theconditions for a hyperrectangular tessellation suitable for derivative match-ing. By d zm ( S M,h ) we denote an analogon of the quantity d zm appearing inTheorem 5.2, but deﬁned on a hyperrectangular tessellation of I kM,h gener-ated by the enlarged version ˜ S M,h of the active set S M,h .The following theorem for denoising the whole tensor Y by means oftrend ﬁltering holds. Theorem 8.1 (Adaptivity of tensor denoising with trend ﬁltering) . Choose x, t > . Let g ∈ R n × ... × n d be arbitrary. For M = ∅ and a large enoughconstant C > only depending on k , choose λ M,h ≥ C | M | vuut X i ∈ M (cid:18) d i, max ( S M,h ) n i (cid:19) k − λ ( t + d log( k + 1)) . Then, with probability at least − e − x − e − t , it holds that k ˆ f − f k /n ≤ k g − f k /n + 4 X M ∈P [ d ] \∅ h ∈ [ k ] d −| M | λ M,h k ( D kM g ) − S M,h k + 2 σ n (cid:18)q x + d log( k + 1) + √ k d (cid:19) + X M ∈P [ d ] \∅ h ∈ [ k ] d −| M | σ n (cid:18)q x + d log( k + 1) + q ks M,h (cid:19) + O  X M ∈P [ d ] \∅ h ∈ [ k ] d −| M | λ M,h X i ∈ M log( ed i, max ) ! s M,h X m =1 X z ∈{− , + } | M | (cid:18) n M d zm (cid:19) k −  . n particular the constraint on C is C ≥ k k − a with a =  , k = 1 , √ / ≈ . , k = 2 , √ / ≈ . , k = 3 , . , k = 4 . as min i ∈ M min m ∈ [ s M,h ] min h ∈ [ k ] d −| M | min { d − i,m ( S M,h ) , d + i,m ( S M,h ) } → ∞ . Proof.

The result follows by the ANOVA decomposition. In total thereare ( k + 1) d margins. As a consequence of the union bound, the resultfor the estimation of the whole tensor is attained with probability at least − e − t − e − x if in the application of Theorem 5.2 one chooses x + d log( k +1) and t + d log( k + 1) instead of x and t for some x, t > .If S M,h are chosen to be regular grids and the tuning parameters arechoosen as λ M,h ≍ s log ns k − | M | n then the rate of Theorem 8.1 is O  log nn X M ∈P [ d ] \∅ h ∈ [ k ] d −| M | s k ( | M |− | M | M,h log( n M /s M,h )  . We now present a not-so-slow rate of estimation for the whole tensor f bytrend ﬁltering. We restrict again to tensors with n ≍ . . . ≍ n d ≍ n /d .We let c M,h > be constants of order O (1) . The following theoremholds. Theorem 8.2 (Not-so-slow ℓ -rate for trend ﬁltering) . Choose λ ≍ n − H ( d )+2 k − H ( d )+2 k − log H ( d )2 H ( d )+2 k − ( n ) . Then, with probability at least − Θ(1 /n ) , it holds that k ˆ f − f k /n ≤ X M ∈P [ d ] \∅ h ∈ [ k ] d −| M | min ¯f M ( M,h ) : k D kM ¯f M ( M,h ) k ≤ c M,h k ¯ f M ( M,h ) − ¯f M ( M,h ) k /n M + O (cid:18) n − H ( d )+2 k − H ( d )+2 k − log H ( d )2 H ( d )+2 k − n (cid:19) . roof. We apply Theorem A.2 to ˆ¯ f M ( M,h ) with x + d log( k + 1) and t + d log( k + 1) as in the proof of Theorem 8.1. We choose x ≍ t ≍ log n .Let ˜ S M,h be an enlarged mesh grid. We have to trade oﬀ with respectto ˜ s M,h ≍ s M,h the terms n M σ n s M,h n M ≍ s k − H ( | M | ) | {z } ≍ ˜ γ σ r n M n s log nn M | {z } ≍ λ (log n ) . We therefore obtain the rate O (cid:18) n − H ( | M | )+2 k − H ( | M | )+2 k − log H ( | M | )2 H ( | M | )+2 k − n (cid:19) . Since H ( | M | )2 H ( | M | )+2 k − is decreasing in | M | , the rate of estimation of the d -dimensional margin is limiting and we obain the claim. We have shown that imposing structure to denoise d -dimensional tensorsleads to an adaptive reconstruction. The structure is imposed via penaltieson the l -dimensional k th -order Vitali TV of the l -dimensional margins ofthe tensor, for l ∈ [ d ] . If the tensor is a product of polynomials on aconstant number of hyperrectangles of any dimension l ≤ d , then the MSEis bounded as k ˆ f − f k /n = O (log n/n ) , with high probability. The truetensor f can therefore be reconstructed at an almost parametric rate. Thekey aspects of our results are: the reformulation of the analysis estimator insynthesis form, the interpolating tensor to bound the eﬀective sparsity andthe ANOVA decomposition of a d -dimensional tensor. In the backgroundof all our results there are the projection argmuents by Dalalyan et al. [3]to bound the random part of the problem, which are fundamental to provethe adaptativity of ˆ f to the underlying unobserved f .Note that we prove reconstruction for trend ﬁltering of order k = { , , , } . We are not able to prove that the approach we use to ﬁndan interpolating tensor for k ∈ { , , , } gives a suitable interpolatingtensor for general k . Thus, although for each given ﬁnite k we can checkby computer whether our construction gives an interpolating vector, theproblem remains open for general k .37 cknowledgements We would like to acknowledge support for this project from the the SwissNational Science Foundation (SNF grant 200020 169011).

References [1] P. B¨uhlmann and S. van de Geer.

Statistics for High-DimensionalData . 2011.[2] S. Chatterjee and S. Goswami. New risk bounds for 2d total variationdenoising. arXiv:1902.01215v2 , 2019.[3] A. Dalalyan, M. Hebiri, and J. Lederer. On the prediction performanceof the Lasso.

Bernoulli , 23(1):552–581, 2017.[4] M. Elad, P. Milanfar, and R. Rubinstein. Analysis versus synthesis insignal priors.

Inverse Problems , 23(947), 2007.[5] B. Fang, A. Guntuboyina, and B. Sen. Multivariate extensions of iso-tonic regression and total variation denoising via entire monotonicityand Hardy-Krause variation. arXiv:1903.01395v1 , 2019.[6] J. Friedman, T. Hastie, H. H¨oﬂing, and R. Tibshirani. Pathwise coor-dinate optimization.

Annals of Applied Statistics , 1(2):302–332, 2007.[7] A. Guntuboyina, D. Lieu, S. Chatterjee, and B. Sen. Adaptive riskbounds in univariate total variation denoising and trend ﬁltering.

An-nals of Statistics , 48(1):205–229, 2020.[8] J.-C. H¨utter and P. Rigollet. Optimal rates for total variation denois-ing.

JMLR: Workshop and Conference Proceedings , 49:1–32, 2016.[9] S.-J. Kim, K. Koh, S. Boyd, and D. Gorinevsky. ℓ Trend Filtering.

SIAM Review , 51(2):339–360, 2009.[10] K. Lin, J. Sharpnack, A. Rinaldo, and R. J. Tibshirani. A sharp erroranalysis for the fused lasso, with application to approximate change-point screening.

Neural Information Processing Systems (NIPS) , (3):42, 2017.[11] E. Mammen and S. van de Geer. Locally adaptive regression splines.

Annals of Statistics , 25(1):387–413, 1997.3812] F. Ortelli and S. van de Geer. On the total variation regularized esti-mator over a class of tree graphs.

Electronic Journal of Statistics , 12:4517–4570, 2018.[13] F. Ortelli and S. van de Geer. Synthesis and analysis in total variationregularization.

ArXiv ID 1901.06418v1 , 2019.[14] F. Ortelli and S. van de Geer. Prediction bounds for (higher order)total variation regularized least squares.

To appear in the Annals ofStatistics , 2019.[15] F. Ortelli and S. van de Geer. Oracle inequalities for square root analy-sis estimators with application to total variation penalties.

Informationand Inference: A Journal of the IMA , (iaaa002), 2020.[16] F. Ortelli and S. van de Geer. Adaptive rates for total variation imagedenoising.

Journal of Machine Learning Research , 21(247):1–38, 2020.[17] V. Sadhanala and R. J. Tibshirani. Additive models with trend ﬁlter-ing.

Annals of Statistics , 47(6):3032–3068, 2019.[18] V. Sadhanala, Y.-X. Wang, and R. Tibshirani. Total variation classesbeyond 1d: minimax rates, and the limitations of linear smoothers.

Neural Information Processing Systems (NIPS) , 2016.[19] V. Sadhanala, Y. X. Wang, J. Sharpnack, and R. Tibshirani. Higher-order total variation classes on grids: minimax theory and trend ﬁlter-ing methods. In

Advances in Neural Information Processing Systems ,pages 5801–5811, 2017.[20] J. Sharpnack, A. Rinaldo, and A. Singh. Sparsistency of the edgelasso over graphs.

International Conference on Artiﬁcial Intelligenceand Statistics (AISTATS) , 22:1028–1036, 2012.[21] R. Tibshirani. Regression Shrinkage and Selection via the Lasso.

J.R. Statist. Soc. B , 58(1):267–288, 1996.[22] R. Tibshirani. Adaptive piecewise polynomial estimation via trendﬁltering.

Annals of Statistics , 42(1):285–323, 2014.[23] R. Tibshirani. Divided diﬀerences, falling factorials, and discretesplines: Another look at trend ﬁltering and related problems.

ArXivID: 2003.03886 , 2020. 3924] R. Tibshirani, M. Saunders, S. Rosset, J. Zhu, and K. Knight. Sparsityand smoothness via the fused lasso.

Journal of the Royal StatisticalSociety: Series B (Statistical Methodology) , 67(1):91–108, 2005.[25] S. van de Geer.

Estimation and Testing under Sparsity , volume 2159.Springer, 2016.[26] S. van de Geer. The Lasso with structured design and entropy of(absolute) convex hulls.

Preprint , pages 1–24, 2021.[27] Y.-X. Wang, J. Sharpnack, A. Smola, and R. Tibshirani. Trend ﬁl-tering on graphs.

Journal of Machine Learning Research , 17:15–147,2016. 40

Oracle inequalities with fast and slow rates

In this section we report an oracle inequality with fast rates and one withslow rates. These oracle inequalities correspond to the adaptive and to thenon-adaptive bound of Theorem 2.2 in [14], see also Theorems 2.1 and 2.2in Ortelli and van de Geer [15] and Theorems 16 and 17 in Ortelli andvan de Geer [16] adapted to have an enlarged active set.

Theorem A.1 (Oracle inequality with fast rates) . Let g ∈ R n × ... × n d and S ⊆ × i ∈ [ d ] [ k + 2 : n i − k ] be arbitrary. For x, t > , choose λ ≥ ˜ γλ ( t ) .Then, with probability at least − e − x − e − t , it holds that k ( ˆ f − f ) N ⊥ k k /n ≤ k g − f N ⊥ k k /n + 4 λ k ( D k g ) − S k +  σ r xn + σ s ksn + λ Γ D k ( S, v − S , q S )  , where q S = sign (( D k g ) S ) . Theorem A.2 (Oracle inequality with slow rates) . Let g ∈ R n × ... × n d and S ⊆ × i ∈ [ d ] [ k + 2 : n i − k ] be arbitrary. For x, t > , choose λ ≥ ˜ γλ ( t ) .Then, with probability at least − e − x − e − t , it holds that k ( ˆ f − f ) N ⊥ k k /n ≤ k g − f N ⊥ k k /n + 4 λ k D k g k + σ r xn + σ r ˜ sn ! . B Proofs of Section 4

B.1 Proof of Lemma 4.2

We prove Lemma 4.2 by induction.

Anchor: k = 1 Note that φ = ˜ φ and φ j − ˜ φ j = αφ for some α ∈ R . Therefore D φ = D ˜ φ = 0 and D ( φ j − ˜ φ j ) = 0 . It follows that D ˜ φ j = D φ j = 1 { j ′ ≥ j } − { j ′ ≥ j − } = 1 { j } , j ∈ [2 : n ] . Step: k − implies k For j ∈ [ k − it holds that D k φ kj = D k ˜ φ kj = D k φ k − j = D k ˜ φ k − j = 0 , since41y assumption D k − φ k − j = D k − ˜ φ k − j = 0 for j ∈ [ k − . Moreover D k φ kj = D k ( X l ≥ j φ k − l ) /n = D ( X l ≥ j D k − φ k − l ) = D { { j ′ ≥ j } } j ′ ∈ [ k : n ] = ( , j = k, { j } , j ∈ [ k + 1 : n ] . It also holds that φ kj − ˜ φ kj = P l ∈ [ k ] α l φ ll , j ∈ [ k : n ] for some { α l ∈ R } l ∈ [ k ] and therefore D k φ kj = D k ˜ φ kj , j ∈ [ k : n ] . (cid:3) C Proofs of Section 5

C.1 Proof of Lemma 5.10

To bound the antiprojections we can use the dictionary Φ k instead of ˜Φ k .Indeed, by Lemma 28 in Ortelli and van de Geer [16], it holds that k A { ˜ φ kt ,t ∈ ˜ S } ˜ φ kj k ≤ k A { φ kt ,t ∈ ˜ S } φ kj k , j ∈ [ k + 1 : n ] . Bound on the antiprojections for d = 1 We ﬁrst prove that, for m = 1 , . . . , s , k A ˜ S ˜ φ kj k /n ≤ (cid:18) t m − jn (cid:19) k − , j ∈ R − m = [ t − m : t m ] , , j ∈ R m = [ t m : t m + k − , (cid:18) j − t m − k + 1 n (cid:19) k − , j ∈ R + m = [ t m + k − t + m ] . We then extend the reasoning to general dimension d .For any m ∈ [ s ] , we ﬁx j ∈ R − m and approximate φ kj by φ kt m , . . . , φ kt m + k − .By the deﬁnition of Φ k we have that φ kj ( j ′ ) = n − k +1 ( j ′ − j + 1) k − { j ′ ≥ j } , j ′ ∈ [ n ] . Moreover note that for k ′ ∈ { , , . . . , k − } k ′ X l =0 ( − l k ′ l ! φ kt m + l = n − k ′ φ k − k ′ t m = n − k +1 { ( j ′ − t m +1) k − k ′ − { j ′ ≥ t m } } j ′ ∈ [ n ] . (5)We now express φ kj as the sum of a linear combination of φ kt m , . . . , φ kt m + k − and a remainder. The linear combination will approximate the projection42f φ kj onto { φ kj , j ∈ ˜ S } , while the remainder will be an upper bound for theantiprojections.For all j ′ ∈ [ n ] it holds that φ kj ( j ′ ) = n − k +1 ( j ′ − j + 1) k − (1 { j ≤ j ′ ≤ t m − } + 1 { j ′ ≥ t m } ) . By the binomial theorem ( j ′ − t m + 1 + t m − j ) k − { j ′ ≥ t m } = k − X l =0 k − l ! ( t m − j ) k − l − ( j ′ − t m + 1) l { j ′ ≥ t m } = k − X l =0 k − l ! ( t m − j ) k − l − n l φ l +1 t m . By Equation (5) we know that { φ l +1 t m } l ∈ [0: k − ∈ span( { φ kt m + l } l ∈ [0: k − ) .Therefore, for j ∈ R − m , k A ˜ S ˜ φ kj k ≤ n − k +2 t m − X j ′ = j ( j ′ − j + 1) k − ≤ n − k +2 Z t m − j ( j ′ ) k − dj ′ ≤ ( t m − j ) k − (2 k − n k − ≤ n (cid:18) t m − jn (cid:19) k − . Note that the construction of the partially orthonormalized dictionary ˜Φ k can of course also be made starting from the collection of functions { { j ≤ j ′ } } j ∈ [ n ] , j ′ ∈ [ n ] instead of { { j ≥ j ′ } } j ∈ [ n ] , j ′ ∈ [ n ] , cf. Deﬁnition4.1. The resulting dictionaries ˜Φ k coincide, up to permutation of the col-umn indices. As a consequence, the calculation we showed to approximate k A ˜ S ˜ φ kj k for j ∈ R − m can be carried out with the dictionary ˜Φ k based on { { j ≤ j ′ } } j ∈ [ n ] , j ′ ∈ [ n ] to obtain the approximation k A ˜ S ˜ φ kj k ≤ n (cid:18) j − t m − k + 1 n (cid:19) k − , j ∈ R + m . This consideration also applies in higher-dimensional situations.

Bound on the antiprojections for general dimension d By the same reasons as above, we consider without loss of generality ( k , . . . , k d ) ∈ R − ,..., − m . We decompose φ kk ,...,k d as follows φ kk ,...,k d ( j , . . . , j d ) = n − k +1 d Y i =1 ( a i ( j i ) + b i ( j i )) , j i ∈ [ n i ] , i ∈ [ d ] , where a i = a i ( j i ) = ( j i − k i + 1) k − { k i ≤ j i ≤ t i,m − } ,b i = b i ( j i ) = ( j i − k i + 1) k − { j i ≥ t i,m } ,c i = c i ( j i ) = ( j i − k + 1) k − { j i ≥ k } ≥ a i + b i . Note that a i , b i depend on t i,m , while c i does not. Moreover, for all ( l , . . . , l d ) ∈ [0 , k − d it holds that × i ∈ [ d ] { t i,m + l i } ∈ ˜ S . Thus, we approximate k A ˜ S ˜ φ kk ,...,k d k ≤ n − k +2 n ,...,n d X ,..., d Y i =1 ( a i + b i ) − d Y i =1 b i ! , since by Equation (5) the contributions of Q di =1 b i are spanned by φ kS . Notethat Q di =1 ( a i + b i ) − Q di =1 b i is nonzero on × i ∈ [ d ] [ k i : n i ] \ × i ∈ [ d ] [ t i,m : n i ] ⊆ ∪ i ∈ [ d ] ([ k i : t i,m − × × l = i [1 : n l ]) . Moreover, on [ k i : t i,m − × × l = i [1 : n l ] , it holds that Q di =1 ( a i + b i ) − Q di =1 b i ≤ a i Q l = i c l . Therefore k A ˜ S ˜ φ kk ,...,k d k ≤ n − k +2 d X i =1 n ,...,n d X ,...,  a i ( j ) Y l = i c l ( j l )  . As in the one-dimensional case, n − k +2 i P n i j i =1 a i ( j i ) ≤ n i (cid:16) t i − k i n i (cid:17) k − and n − k +2 i P n i j i =1 c i ( j i ) ≤ n i . It follows that k A ˜ S ˜ φ kk ,...,k d k ≤ n d X i =1 (cid:18) t i − k i n i (cid:19) k − . Note that as soon as j i ∈ R i,m for some coordinate i ∈ [ d ] , then a i ( j i ) =0 and the i th coordinate does not contribute to the antiprojections. Thebounds for all other hyperrectangles R zm , z ∈ {− , , + } d follow by analogouscalculations. (cid:3) C.2 Proof of Lemma 5.11

For any m ∈ [ s ] and for any ( j , . . . , j d ) ∈ R m it holds that vuut d X l =1 ˜ v i,m ( j i ) ≤ d X l =1 ˜ v i,m ( j i ) ≤ d X l =1 v i,m ( j i ) max { d − i,m , d + i,m } n i ! k − ≤ d X l =1 v i,m ( j i ) vuut d X i =1 max { d − i,m , d + i,m } n i ! k − ≤ v j ,...,j d ˜ γ. C.3 Proof of Lemma 5.12

Fix i ∈ [ d ] and m ∈ [ s ] . Say q t m = 1 . Since w i,l,m ∈ [0 , , l = i , for any j i ∈ R − i,m ∪ R i,m ∪ R + i,m it holds that d Y l =1 w i,l,m ( j l ) ≤  − q v i,m ( j i ) C  Y l = i w i,l,m ( j l ) ≤  − q v i,m ( j i ) C  . Moreover, for any ( j , . . . , j d ) ∈ R m it holds that w j ,...,j d = 1 d d X i =1 d Y l =1 w i,l,m ( j l ) ≤ d d X i =1  − q v i,m ( j i ) C  = 1 − d X i =1 q v i,m ( j i ) dC = 1 − v j ,...,j d . Analogous expressions hold if q t m = − . The claim follows by noting thatthe conditions of the deﬁnition of interpolating tensor (Deﬁnition 5.8) aresatisﬁed for w . (cid:3) C.4 Matching derivatives

To obtain continuous vectors with k − continuous derivatives and piecewiseconstant k th derivative, we split [0 , into N ω , resp. N w , intervals of equallength, where N ω = k , N w = k + 1 if k is odd and N w = k + 2 if k is even.We denote these intervals by { [ x l − , x l ] } N { ω, w } l =1 with x = 0 and x N { ω, w } = 1 .We choose ω ( x ) =  − a x k − , x ∈ [ x , x ] ,b l,k x k + b l,k − x k − + . . . + b l, x + b l, , x ∈ [ x l − , x l ] ,l ∈ [2 : k − ,c (1 − x ) k , x ∈ [ x k − , x k ] . We moreover choose w( x ) =  − a x k , x ∈ [ x , x ] , b l,k x k + b l,k − x k − + . . . + b l, x + b l, , x ∈ [ x l − , x l ] ,l ∈ [2 : N w / − , a L (1 / − x ) L + . . . + a (1 / − x ) + 1 / , x ∈ [ x N w / − , x N w / ] , L = k − if k is even and L = k if k is odd.We choose both the coeﬃcients ( a , a L , . . . , a , { b l,k , . . . , b l, } l , c ) and( a , a L , . . . , a , { b l,k , . . . , b l, } l ) by derivative matching. We require the k − derivatives of the diﬀerent pieces of the interpolating polynomials tomatch at the junctions between the intervals. This gives place to piecewiseconstant k th derivatives with the exception of the interval [ x , x ] where ω ( k ) ( x ) ≍ − / √ x .Matching derivatives for ω means solving a system of k ( k − equationsand k ( k − unknowns. Matching derivatives for w means solving a systemof k ( k/ equations and k ( k/ unknowns when k is even and k ( k − / equations and k ( k − / unknowns when k is odd. We therefore do not needto do any derivative matching for k = 1 , where we just take ω ( x ) = 1 − √ x and w( x ) = 1 − x .As an alternative to discretizing a continuous version of the interpolat-ing polynomials, one can also proceed by matching discrete diﬀerences. Thetwo approaches are equivalent when min i ∈ [ d ] min m ∈ [ s ] min { d − i,m , d + i,m } → ∞ as n → ∞ . Discrete derivative matching requires that the counterpart ofeach interval [ x l − : x l ] contains at least k points. We therefore require that min { d − i,m , d + i,m } ≥ ( k + 2) k, ∀ i ∈ [ d ] , ∀ m ∈ [ s ] . We refer to Ortelli and van de Geer [14] for details on discrete derivativematching.

C.5 Partial integration

Some consequences of the fact that both the resulting ω and w have piece-wise constant k th derivatives with the exception of the interval [0 , x ] where ω ( k ) ( x ) ≍ − / √ x are shown in the next lemma, which is be useful to com-pute the bound on the eﬀective sparsity in Lemma 5.13. Lemma C.1 (Discrete diﬀerences of some polynomials) . Let for some d ∈ N , d ≥ k , q j := ( j/d ) k − , j = 0 , . . . , d. Then n − k +2 k D k q k = O (log( ed ) /d k − ) . Let for some d ∈ N , d ≥ k , p j := ( j/d ) k , j = 0 , . . . , d. Then n − k +2 k D k p k = O (1 /d k − ) . roof. We have for j ≥ kn − k +2 ( D k q) j = k X l =0 kl ! ( − l (cid:18) j − ld (cid:19) k − = (cid:18) jd (cid:19) k − (cid:20) k X l =0 kl ! ( − l (cid:18) − lj (cid:19) k − (cid:21) . We do a ( k − -term Taylor expansion of x (1 − x ) k − around x = 0 : (1 − x ) k − = k − X i =0 a i x i + rem( x ) , where a = 1 , a = − k − , . . . , a k − are the coeﬃcients of the Taylorexpansion and where the remainder rem( x ) satisﬁes sup ≤ x ≤ / | rem( x ) | = O (cid:16) | x | k (cid:17) . Thus k X l =0 kl ! ( − l (cid:18) − lj (cid:19) k − = k X l =0 kl ! ( − l k − X i =0 a i (cid:18) lj (cid:19) i + rem (cid:18) lj (cid:19)! , where k X l =0 kl ! ( − l k − X i =0 a i (cid:18) lj (cid:19) i = 0 since ( k − X i =0 a i (cid:18) lj (cid:19) i ) kl =0 is a polynomial of degree k − and hence its k th -order diﬀerences are zero.It follows that for j ≥ k , (cid:12)(cid:12)(cid:12)(cid:12) k X l =0 kl ! ( − l (cid:18) − lj (cid:19) k − (cid:12)(cid:12)(cid:12)(cid:12) ≤ k X l =0 kl !(cid:12)(cid:12)(cid:12)(cid:12) rem (cid:18) lj (cid:19)(cid:12)(cid:12)(cid:12)(cid:12) = O (cid:18) j k (cid:19) . Then for j ≥ k , n − k +2 ( D k q) j = O (cid:16) / ( j d k − ) (cid:17) . So n − k +2 k D k q k = O (cid:16) log( ed ) /d k − (cid:17) .For p the same arguments go through. We obtain that ( D k p) j = O (1 /d k ) and so n − k +2 k D k p k = O (cid:16) /d k − (cid:17) .47 .6 Proof of Lemma 5.13 We prove a bound on the eﬀective sparsity holding for every sign conﬁgura-tion. We eliminate the dependence on the sign conﬁguration by decouplingpartial integration on the whole interpolating tensor ( k ( D k ) ′ w k ) into tak-ing k th -order diﬀerences on the hyperrectangles { R m } sm =1 ( k D k w ( R m ) k ,where w ( R m ) = { w j ,...,j d } ( j ,...,j d ) ∈ R m denotes the restriction of the inter-polating tensor w to the set of indices R m ).To do this, we deﬁne the boundaries B ( R m ) of a rectangle R m as B ( R m ) := R m \ × i ∈ [ d ] [ t − i,m + k : t + i,m − k ] . It holds that n − k +1 k ( D k ) ′ w k = O s X m =1 (cid:16) n − k +1 k D k w ( R m ) k + k w ( B ( R m )) k (cid:17)! . By the deﬁnition of the interpolating tensor w it holds that n − k +1 k D k w ( R m ) k = O n − k +1 d X i =1 k D k × l ∈ [ d ] w l,i,m k ! = O n − k +1 d X i =1 d Y l =1 k D k w l,i,m k ! = O  d X i =1 d Y l =1  n − k +1 l k D k w − l,i,m k + t l,m − X j l = t l,m − k (1 − w − l,i,m ( j l )) + n − k +1 l k D k w + l,i,m k + t l,m +2 k − X j l = t l,m + k (1 − w + l,i,m ( j l ))  , where the sums stem from the diﬀerences involving the constant part of w on R l,m . Because of the form chosen for ω and w , it holds that t l,m − X j l = t l,m − k (1 − w − l,i,m ( j l )) = ( O ( ω (1 /d − i,m )) O (w (1 /d − l,m )) = ( O (1 / ( d − i,m ) k − ) , l = i, O (1 / ( d − l,m ) k ) , l = i. A similar bound holds for P t l,m +2 k − j l = t l,m + k (1 − w + l,i,m ( j l )) . By Lemma C.1 itholds that n − k +1 l k D k w − l,i,m k = ( O (log( ed − i,m ) / ( d − i,m ) k − ) , l = i, O (1 / ( d − l,m ) k − ) , l = i.

48 similar bound holds for n − k +1 l k D k w + l,i,m k .We now just have to upper bound the contributions of the boundaries B ( R m ) . For k = 1 , w ( B ( R m )) = 0 , for all m ∈ [ s ] and the boundaries donot contribute to the eﬀective sparsity. For k ≥ it holds that X B ( R m ) w j ,...,j d = O  d X i =1 X B ( R m ) d Y l =1 w l,i,m ( j l )  = O  d X i =1 X z ∈{− , + } d d zm ) k −  since all the contributions on the boundaries have the same dependence on k and we can approximate the volume of the boundaries by the sum of thevolume of the d fractions { R zm } z ∈{− , + } d of the hyperrectangle.It therefore holds that n − k +1 k ( D k ) ′ w k = O  d X i =1 log( ed i, max ( S )) ! s X m =1 X z ∈{− , + } d d zm ) k −  and the claim follows. (cid:3) D Proofs of Section 6

D.1 Proof of Lemma 6.3

Setting

To calculate the inverse scaling factor when the active set is an enlargedmesh grid ˜ S , we decompose a dictionary atom – which is a product of sums– into a sum of products. Some of the components will be spanned by thedictionary atoms indexed by the mesh grid. The remaining componentswill contribute to the antiprojection.By Lemma 28 in Ortelli and van de Geer [16] we can look at the dictio-nary atoms φ kj ,...,j d instead of ˜ φ kj ,...,j d , see also the proof of Lemma 5.10 inAppendix C.1.We therefore consider φ kj ,...,j d = φ kj × . . . × φ kj d , where, for i ∈ [ d ] φ kj i = n − k +1 i ( j − j i + 1) k +1 { j ≥ j i } . rojection of the mesh grid on single coordinates Now choose z i,l ∈ Z i ( l ) such that j i ≤ z i, ≤ . . . ≤ z i,d − ≤ z i,d . By thedeﬁnition of the mesh grid we can choose z i,l ∈ Z i ( l ) such that • | j i − z i, | = O ( n i /s H ( d ) ) ; • | z i,l − z i,l − | = O ( n i /s lH ( d ) ) , l ∈ [2 : d ] ; • | z i,d | ≤ n i . The decomposition

We now decompose the factors into sums: φ kj i = d X l =0 u i,l , where, for j ∈ [ n i ] , u i, := 1 { j ∈ [ j i : z i, − } n − k +1 i ( j − j i + 1) k − ,u i,l := 1 { j ∈ [ z i,l : z i,l +1 − } n − k +1 i ( j − j i + 1) k − , l ∈ [1 : d − ,u i,d := 1 { j ∈ [ z i,d : n i ] } n − k +1 i ( j − j i + 1) k − , Note that { u i,l } dl =0 are mutually orthogonal.Thanks to the decomposition of the factors, the following decompositionof the dictionary atom φ kj ,...,j d holds: φ kj ,...,j d = X ( l ,...,l d ) ∈ [0: d ] d d Y i =1 u i,l i , where { Q di =1 u i,l i } ( l ,...,l d ) ∈ [0: d ] d are mutually orthogonal. We therefore ob-tain a decomposition of a product of sums into a sum of products. Partitioning the decomposition

We now partition { ( l , . . . , l d ) ∈ [0 : d ] d } into two subsets: Σ and Σ c . Deﬁne Σ := { ( l , . . . , l d ) ∈ [0 : d ] d : |{ i ∈ [ d ] : l i ≤ z }| ≤ z, ∀ z ∈ [0 : d ] } . This means that Σ contains tuples ( l , . . . , l d ) having at most d entries withvalue at most d and at most d − entries with value at most d − and ... and at most entry with value at most and no entry with value .50 onnecting the decomposition with the enlarged mesh grid We now want to show that, for any ( l , . . . , l d ) ∈ Σ , Q di =1 u i,l i can beobtained as a linear combination of { φ kj ,...,j d } ( j ,...,j d ) ∈ ˜ S . These compo-nents will approximate the projection of any φ kj ,...,j d onto the linear spanof { φ kj ,...,j d } ( j ,...,j d ) ∈ ˜ S .For l i ∈ [1 : d − it holds that u i,l i ( j ) = 1 { z i,li ≤ j } n − k +1 i ( j − j i + 1) k − − { z i,li +1 ≤ j } n − k +1 i ( j − j i + 1) k − . In analogy to the proof of Lemma 5.10 (use the binomial theorem andEquation (5)) it holds that u i,l i ∈ span( { φ kz i,li + h } k − h =0 ∪ { φ kz i,li +1 + h } k − h =0 ) .For l i ∈ [ d ] it holds that u i,d ∈ span( { φ kz i,li + h } k − h =0 ) We need a claim

We now show that ( l , . . . , l d ) ∈ Σ = ⇒ ( l ′ , . . . , l ′ d ) ∈ Σ , where l ′ i ≥ l i , ∀ i ∈ [ d ] by proving that ( l , . . . , l d ) ∈ Σ = ⇒ ( l , . . . , l d − , l d + 1) ∈ Σ , where without loss of generality we choose the index l d and assume that l d ≤ d − .As a consequence it will follow that, for any ( l , . . . , l d ) ∈ Σ , Q di =1 u i,l i can be obtained as a linear combination of { φ kj ,...,j d } ( j ,...,j d ) ∈ ˜ S .We now prove the claim: assume that ( l , . . . , l d ) ∈ Σ , i.e., |{ i ∈ [ d ] : l i ≤ z }| ≤ z, ∀ z ∈ [0 : d ] . Take ( l ′ , . . . , l ′ d ) as l ′ i = l i , i ∈ [ d − and l ′ d = l d + 1 . Then |{ i ∈ [ d ] : l ′ i ≤ z }| = |{ i ∈ [ d −

1] : l i ≤ z }| + 1 { z ≥ l d +1 } ≤ z − { z ≥ l d } + 1 { z ≥ l d +1 } ≤ z. Therefore ( l ′ , . . . , l ′ d ) ∈ Σ and the claim is proved. Approximating the antiprojections

Thanks to the above claim and to the mutual orthogonality of the elementsof { Q di =1 u i,l i } ( l ,...,l d ) ∈ [0: d ] d , we can approximate as follows: k A ˜ S φ kj ,...,j d k /n ≤ X ( l ,...,l d ) Σ k d Y i =1 u i,l i k /n = X ( l ,...,l d ) Σ d Y i =1 k u i,l i k /n i . k u i,l i k /n i = O ( s − k − li +1) H ( d ) ) . The larger l i , the larger the contribution of k u i,l i k n .It therefore only remains to ﬁnd the order of the largest contribution(s)indexed by Σ c . A tuple of indices in Σ c giving the contribution highest inorder is ( d − , . . . , d − . It holds that k A S ψ j ,...,j d k /n = O d Y i =1 k u i,l i k n ! = O ( s − k − H ( d ) ) . Since the upper bound does not depend on ( j , . . . , j d ) we read directly that ˜ γ = O (cid:18) s − k − H ( d ) (cid:19) . (cid:3)(cid:3)