Tensor denoising with trend filtering
aa r X i v : . [ m a t h . S T ] J a n Tensor denoising with trend filtering
Francesco Ortelli and Sara van de GeerSeminar f¨ur Statistik, ETH Z¨urichR¨amistrasse 101, CH-8092 Z¨urich { fortelli,geer } @ethz.chJanuary 27, 2021 Abstract
We extend the notion of trend filtering to tensors by consideringthe k th -order Vitali variation – a discretized version of the integral ofthe absolute value of the k th -order total derivative. We prove adap-tive ℓ -rates and not-so-slow ℓ -rates for tensor denoising with trendfiltering.For k = { , , , } we prove that the d -dimensional margin of a d -dimensional tensor can be estimated at the ℓ -rate n − , up to loga-rithmic terms, if the underlying tensor is a product of ( k − th -orderpolynomials on a constant number of hyperrectangles. For general k we prove the ℓ -rate of estimation n − H ( d )+2 k − H ( d )+2 k − , up to logarithmicterms, where H ( d ) is the d th harmonic number.Thanks to an ANOVA-type of decomposition we can apply theseresults to the lower dimensional margins of the tensor to prove boundsfor denoising the whole tensor. Our tools are interpolating tensors tobound the effective sparsity for ℓ -rates, mesh grids for ℓ -rates and,in the background, the projection arguments by Dalalyan et al. [3]. Keywords: tensor denoising, total variation, Vitali variation, trend filtering,oracle inequalities
Contents d -dimensional tensors . . . . . . . . . 82.1.1 Tensors with product structure . . . . . . . . . . . . 92.1.2 Orthogonality between tensors . . . . . . . . . . . . 92.1.3 Linear subspaces and orthogonal projections . . . . 92.2 Estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.3 Active sets . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 d = 1 . . . . . . . . . . . . . . . . . . . . . . 134.2 Dictionary for general d . . . . . . . . . . . . . . . . . . . . 15 k = 1 . . . . . . . . . . . . . 265.6.4 Interpolating tensor for k = 2 . . . . . . . . . . . . . 265.6.5 Interpolating tensor for k = 3 . . . . . . . . . . . . . 265.6.6 Interpolating tensor for k = 4 . . . . . . . . . . . . . 275.7 Proof of Theorem 5.2 . . . . . . . . . . . . . . . . . . . . . . 27 ℓ -rates 28 ˜ S is an enlarged mesh grid 306.3 Proof of Theorem 3.2 . . . . . . . . . . . . . . . . . . . . . . 30 Denoising the whole tensor 35
B.1 Proof of Lemma 4.2 . . . . . . . . . . . . . . . . . . . . . . 41
C Proofs of Section 5 42
C.1 Proof of Lemma 5.10 . . . . . . . . . . . . . . . . . . . . . . 42C.2 Proof of Lemma 5.11 . . . . . . . . . . . . . . . . . . . . . . 44C.3 Proof of Lemma 5.12 . . . . . . . . . . . . . . . . . . . . . . 45C.4 Matching derivatives . . . . . . . . . . . . . . . . . . . . . . 45C.5 Partial integration . . . . . . . . . . . . . . . . . . . . . . . 46C.6 Proof of Lemma 5.13 . . . . . . . . . . . . . . . . . . . . . . 48
D Proofs of Section 6 49
D.1 Proof of Lemma 6.3 . . . . . . . . . . . . . . . . . . . . . . 49
Let f ∈ R n × ... × n d be a d -dimensional tensor with n = n · . . . · n d entries.We want to prove error bounds for tensor denoising, which is the task ofrecovering f from its noisy version Y = f + ǫ , where ǫ has i.i.d. Gaussianentries with mean 0 and variance σ .We show that we can estimate the underlying tensor f in an adaptivemanner with a regularized least-squares signal approximator. As regularizerwe propose the Vitali variation of the ( k − th -order total differences ofthe candidate estimator for k ≥ . We call this regularizer the “ k th -orderVitali total variation”. We use the abbreviation TV for “total variation”.This approach extends the idea of “trend filtering” [9, 22] to tensors.We expose the notion of TV regularization, review the literature onadaptive results for TV regularization, explain the concept of adaptationfor structured problems, introduce an ANOVA-type of decomposition of atensor, outline our contributions and finally present the organization of thepaper. 3 .1 TV regularization A regularized (least-squares) signal approximator is an estimator ˆ f definedas ˆ f := arg min f ∈ R n × ... × nd n k Y − f k /n + 2 λ pen(f) o , where k·k denotes the sum of the squared entries of its argument, λ > isa tuning parameter and pen(f) is a regularization penalty.When pen(f) = k D f k for a linear operator D and for k·k denotingthe sum of the absolute values of the entries of its argument, the regular-ized signal approximator is called “ ℓ -analysis estimator” or simply “anal-ysis estimator” [4]. If the linear operator D is a difference operator, then pen(f) = k D f k is usually called TV of f and the estimator ˆ f is called TVregularized estimator. Different choices of the difference operator D arepossible, resulting in different notions of TV.For a continuous image defined on ( x , . . . , x d ) ∈ [0 , d , one can choose D as a discretized version of either the total k th -order derivative opera-tor Q di =1 ∂ k / ( ∂x i ) k or of the sum of k th -order partial derivative operators P di =1 ∂ k / ( ∂x i ) k . For d = 1 partial and total derivatives coincide. With D being the firstorder difference matrix, the TV regularized estimator is also known underthe name “fused Lasso” [24, 6]. Adaptivity of the fused Lasso has beenproved by Dalalyan et al. [3], Lin et al. [10], Guntuboyina et al. [7].The “edge Lasso” extends the fused lasso to graphs and is studied bySharpnack et al. [20], H¨utter and Rigollet [8]. Ortelli and van de Geer[12, 15] prove adaptivity of the edge Lasso on tree graphs and cycle graphs,respectively.The idea of the fused Lasso can also be extended to the penalization ofhigher-order differences. This extension is called “trend filtering” [9, 22, 23].Adaptivity of trend filtering is established in Guntuboyina et al. [7], Ortelliand van de Geer [14]. Wang et al. [27] consider trend filtering on graphs,Sadhanala et al. [19] in higher-dimensional situations and Sadhanala andTibshirani [17] for additive models.Here, we consider the case of D being a discretization of Q di =1 ∂ k / ( ∂x i ) k .We call the corresponding notion of TV “ k th -order Vitali TV”. In the liter-ature, signal approximators regularized with the Vitali TV are studied byMammen and van de Geer [11], Ortelli and van de Geer [16], Fang et al.[5]. Ortelli and van de Geer [16] prove adaptivity for d = 2 and k = 1 .4ang et al. [5] show adaptivity for d = 2 and k = 1 using as regularizer theHardy-Krause variation, which is the sum of the Vitali TV of a matrix andof its margins. In this paper we will prove adaptivity of tensor denoisingwith k th -order Vitali TV regularization for k = { , , , } and general di-mension d ≥ . The results obtained for k = { , , , } and d = 1 in Ortelliand van de Geer [14] and for k = 1 and d = 2 in Ortelli and van de Geer[16] will then be retrieved as special cases.Signal approximators regularized with D being a discretization of thepartial derivatives P di =1 ∂ k / ( ∂x i ) k are studied by H¨utter and Rigollet [8],Sadhanala et al. [19] for general d . For d = 2 , Chatterjee and Goswami [2]show the fast rate n − / for estimating axis-aligned rectangles. The analysis estimator ˆ f can be recast in a constructive formulation as “syn-thesis estimator”. One can find dictionary tensors { φ j ∈ R n × ... × n d } j ∈ [ p ] ,such that ˆ f = p X j =1 ˆ β j φ j , where ˆ β := arg min b ∈ R n k Y − p X i =1 b j φ j k /n + 2 λ X j U | b j | , and U ⊆ { , . . . , p } is a set of indices, cf. Elad et al. [4]. The Lassoestimator [21, 1, 25] is an instance of synthesis estimator. The dictionary { φ j } j ∈ [ p ] and the set of unpenalized coefficients U ⊆ [ p ] depend on D .We can see that D imposes structure on the estimator: it determines thedictionary with which the estimator is constructed. For instance, in thecase of the st -order Vitali TV, the dictionary { φ j } j ∈ [ p ] consists of tensorsbeing constant on hyperrectangles. Therefore, the estimator ˆ f is constanton few hyperrectangular pieces.Our goal is to prove adaptation of the estimator ˆ f to the underlyingsignal f , when k D f k is the k th -order Vitali TV.Adaptation is a consequence of a high-probability upper bound on themean squared error (MSE) in the form of the oracle inequality k ˆ f − f k /n ≤ k g − f k /n + rem( D, g, S ) , (1)where g ∈ R n × ... × n d is an arbitrary tensor, S is an arbitrary set of indicesof Dg and rem( D, g, S ) is a remainder term. A result of the form of (1)establishes the adaptation of the estimator ˆ f , provided that the remainderterm rem( D, g = f , S = S ) converges to zero, where S is the set of theindices of the nonzero coefficients of Df . The cardinality s := | S | of S is called the “sparsity” of f with respect to D .5e can optimize the upper bound in (1) over g and S . However, theoptimizers g ∗ and S ∗ will depend on f – which is unobserved. Hence thename “oracle” for the pair ( g ∗ ( f ) , S ∗ ( f )) and the name “oracle inequality”for results as (1).Such a result is considered to be adaptive, since different underlyingtrue tensors f will possibly give place to different oracles ( g ∗ ( f ) , S ∗ ( f )) and to different values for the upper bound.Results as (1) are only useful if it can be proved that rem( D, f , S ) converges to zero. Typically rem( D, f , S ) = O (cid:16) λ Γ D ( S ) (cid:17) , where Γ D ( S ) is called “effective sparsity” and depends both on D and S . Proving adaptivity therefore translates into proving a bound for theeffective sparsity: a task which depends on the structure imposed by D .To bound the effective sparsity for tensor denoising with trend filtering weuse an interpolating tensor, in analogy to the interpolating vector and theinterpolating matrix by Ortelli and van de Geer [14, 16].Adaptive results as (1) are a consequence of a careful choice of λ . Thegeneral theory for the Lasso [1, 25] suggests the choice λ ≍ λ ≍ p log( n ) /n ,where λ is called the “universal choice”. The universal choice ensures thatall the noise is overruled. However, Dalalyan et al. [3] show that also thesmaller choice λ ≍ ˜ γλ is possible, where ˜ γ > is a scaling factor whichaccounts for the correlation in the dictionary { φ j } j ∈ [ p ] induced by D and S . The projection arguments by Dalalyan et al. [3] in the background ofour results allow us to choose the tuning parameter of smaller order thanthe universal choice λ .Projection arguments have been discussed in the literature. We do notreport them here but refer instead to Theorem 3 in Dalalyan et al. [3],Lemma B2 and Lemma C2 in Ortelli and van de Geer [15], Lemma 13 inOrtelli and van de Geer [16] and to van de Geer [26]. In the continuous case, the “nullspace” of the k th -order derivative opera-tor along one coordinate is made of constant, linear, ..., ( k − th -ordermonomial functions. The nullspace of the total derivative operator in d -dimensions is made of d -dimensional functions which are linear, ..., ( k − th -order monomial along at least one coordinate. In the discrete casewhen n ≍ . . . ≍ n d the linear space spanned by such tensors is n − /d -dimensional. 6e will decompose a tensor f ∈ R n × ... × n d into a sum of mutuallyorthogonal tensors. Each of these mutually orthogonal tensors will be con-stant or linear or ... or ( k − th -order monomial along a set of l coordinates,for l ∈ [0 : d ] . This construction will be carried out for all possible sets ofcoordinates in [ d ] . Tensors being constant or linear or ... or ( k − th -ordermonomial along d − l coordinates will be called l -dimensional margins.We will adaptively estimate l -dimensional margins with l -dimensionalVitali TV regularized estimators, for l ∈ [ d ] . The -dimensional marginswill be estimated by ordinary least squares at a rate n − . By estimating allthe margins adaptively we will be able to prove adaptivity of the denoisingof the whole tensor via Vitali TV regularization. Previously, we have derived tools like interpolating vectors and matchingderivatives to prove adaptivity for trend filtering ( d = 1 and k = { , , , } ,see Ortelli and van de Geer [14]). In Ortelli and van de Geer [16] we havecome up with tools to extend our results for adaptation of the fused Lasso( d = 1 and k = 1 ) to the two-dimensional case of image denoising ( d = 2 and k = 1 ). Here, we show in the first place how to combine and extend thetools from image denoising and one-dimensional trend filtering to handletrend filtering for k = { , , , } and for general dimension d . Establishingadaptivity requires a so-called “bound on the antiprojections”. We prove aformula giving the bounds on the antiprojections for general k and d . Wethen propose an ANOVA decomposition to ensure that all the margins ofa d -dimensional tensor can be estimated adaptively.Lastly, we prove slow rates for tensor denoising with trend filtering. Weextend the idea of mesh grid by Ortelli and van de Geer [16] to general d and general k . We then prove a bound on the antiprojections with the helpof the mesh grid holding for all d and all k .The integration of the arguments by Ortelli and van de Geer [14] withthe ones by Ortelli and van de Geer [16], the general bounds on the an-tiprojections and the ANOVA decomposition allow us to present generalrisk bounds for tensor denoising with trend filtering. In Section 2 we expose the required notation, the model and define thetrend filtering estimator for the d -dimensional margin.In Section 3 we list our contributions and give a preview of the results:adaptive ℓ -rates and not-so-slow ℓ -rates.7n Section 4 we derive the synthesis form of the trend filtering estimatorfor the d -dimensional margin.Proving the main result on adaptivity for tensor denoising with trendfiltering is the topic of Section 5.In Section 6 we apply a general result on not-so-slow ℓ -rates for analysisestimators to tensor denoising with trend filtering.In Section 7 we show the ANOVA decomposition of a tensor and definethe estimators for lower-dimensional margins.In Section 8 we apply the results on adaptivity and on not-so-slow ℓ -rates to the estimators for the lower-dimensional margins defined in Section7. This will establish adaptivity and not-so-slow rates for the estimation ofthe whole tensor.Section 9 concludes the paper. We consider the model Y = f + ǫ, where Y, f , ǫ ∈ R n × ... × n d are d -dimensional tensors and ǫ has i.i.d. N (0 , σ ) entries with known variance σ ∈ (0 , ∞ ) . For the case of unknown variancewe refer to Ortelli and van de Geer [15], who show how to estimate f and σ at the same time.The goal is to estimate f given its noisy observations Y . We considera signal approximator regularized with the Vitali TV. d -dimensional tensors For two integers i ≤ j we define [ i : j ] := { i, . . . , j } . Moreover, if i = 1 wewrite [ j ] := [1 : j ] .Let f ∈ R n × ... × n d be a d -dimensional tensor with n := n . . . n d entries.For indices ( j , . . . , j d ) ∈ [ n ] × . . . × [ n d ] we refer to the corresponding entryof f by f j ,...,j d using indices or by f ( j , . . . , j d ) using arguments.For ( j ′ , . . . , j ′ d ) , ( j ′′ , . . . , j ′′ d ) ∈ [ n ] × . . . × [ n d ] we use the notation j ′′ ,...,j ′′ d X j ′ ,...,j ′ d f j ,...,j d := j ′′ d X j d = j ′ d · · · j ′′ X j = j ′ f j ,...,j d . Similarly we write { f j ,...,j d } j ′′ ,...,j ′′ d j ′ ,...,j ′ d := { f j ,...,j d } ( j ′′ ,...,j ′′ d )( j ,...,j d )=( j ′ ,...,j ′ d ) . k f k := ( P n ,...,n d ,..., f j ,...,j d ) / we denote the Frobenius norm of f .Moreover we define k f k := P n ,...,n d ,..., | f j ,...,j d | as the sum of the absolutevalues of the entries of f . We now let f ∈ R n × ... × n d be a d -dimensional tensor with n := n · . . . · n d entries. Define the set of indices I of the entries of f as I := [ n ] × . . . × [ n d ] .We say that f has product structure if there are vectors { f j } j ∈ [ d ] suchthat f ( j , . . . , j d ) = f ( j ) · . . . · f d ( j d ) , ∀ ( j , . . . , j d ) ∈ I. We then write f = f × . . . × f d .Let f and g be tensors with product structure. We consider the entry-wise multiplication ( f ⊙ g ) j ,...,j d = f j ,...,j d g j ,...,j d , ( j , . . . , j d ) ∈ I .It holds that ( f ⊙ g ) j ,...,j d = Q dl =1 f l ( j l ) g l ( j l ) , ∀ ( j , . . . , j d ) ∈ I . The operation P n ,...,n d ,..., ( f ⊙ g ) j ,...,j d is the equivalent of the scalar productfor tensors.We say that the tensors f and g are orthogonal if P n ,...,n d ,..., ( f ⊙ g ) j ,...,j d =0 . If f and g have product structure and f l and g l are orthogonal to eachother for at least one coordinate l ∈ [ d ] , then f and g are orthogonal too. Let W be a linear subspace of R n × ... × n d and let W ⊥ be its orthogonalcomplement. By I : R n × ... × n d R n × ... × n d we denote the identity oper-ator, i.e., I f = f . By P W we denote the orthogonal projection operatoronto W and by A W := I − P W = P W ⊥ the corresponding orthogonal an-tiprojection operator. For a tensor f ∈ R n × ... × n d we write f W := P W and f W ⊥ := f − f W .For a linear operator ∆ , let N (∆) denote its nullspace. Let k be an integer in { , . . . , min i ∈ [ d ] n i − } .Let D ki be the k th -order difference operator along the i th coordinate,defined as ( D ki f )( j , . . . , j i , . . . , j d ) := n k − i k X l =0 ( − l kl ! f ( j , . . . , j i − l, . . . , j d ) , ( j , . . . , j i − , j i , j i +1 , . . . , j d ) ∈ [ n ] × . . . × [ n i − ] × [ k + 1 : n i ] × [ n i +1 ] × . . . × [ n d ] . Definition 2.1 (Total k th -order difference operator) . The total k th -order dif-ference operator D k is defined as D k := d Y i =1 D ki . The total k th -order difference operator D k can be seen as a discretizedversion of Q di =1 ∂ k / ( ∂x i ) k . It is important to note that the definition of D k implicitly includes a factor n k − that stems from the discretization.The Vitali TV of a tensor f ∈ R n × ... × n d is defined as the sum of theabsolute values of its total k th -order differences. Definition 2.2 ( k th -order Vitali TV) . The k th -order Vitali TV TV k ( f ) of a d -dimensional tensor f ∈ R n × ... × n d is defined as TV k ( f ) := k D k f k . The k th -order Vitali TV has the canonical scaling TV k ( f ) = O (1) dueto the normalization by the factor n k − in the definition of D k . We referto Sadhanala et al. [18] for more about canonical scalings.We define the nullspace N k of D k as N k := { f ∈ R n × ... × n d : D k f = 0 } and its orthogonal complement as N ⊥ k . We call f N ⊥ k the d -dimensionalmargin of a tensor f ∈ R n × ... × n d . Definition 2.3 ( k th -order trend filtering estimator) . The k th -order Vitalitrend filtering estimator ˆ f N ⊥ k for the d -dimensional margin f N ⊥ k is definedas ˆ f N ⊥ k := arg min f ∈ R n × ... × nd n k ( Y − f) N ⊥ k k /n + 2 λ TV k (f) o , where λ > is a tuning parameter. Let S ⊆ [3 : n − × . . . × [3 : n d − be a subset of the indices of D k f forsome tensor f ∈ R n × ... × n d . We write s := | S | and S = { t , . . . , t s } , where t m = ( t ,m , . . . , t d,m ) . We call { t m } sm =1 the jump locations.Moreover we define a S := { a j ,...,j d , ( j , . . . , j d ) ∈ S } and a − S := { a j ,...,j d , ( j , . . . , j d ) / ∈ S } . We will use the same notation a S for thetensor which shares its entries with a for ( j , . . . , j d ) ∈ S and has all itsother entries equal to zero. Similarly, we will also denote by a − S a tensorthat shares its entries with a for ( j , . . . , j d ) S and has its other entriesequal to zero. 10 Contributions
We make the following contributions: • We extend the idea of trend filtering to d -dimensional settings via theVitali variation and total discrete derivatives. • We prove adaptive ℓ -rates for tensor denoising with trend filteringfor k = { , , , } , see Theorem 3.1, a simplified version of Theorem5.2. The rates for d = 1 and k = { , , , } and for d = 2 and k = 1 are known. Rates for the other cases are new contributions. We alsoexpose some sufficient conditions to find adaptive bounds for general k . For each given k one can check by computer whether the conditionshold but the problem of showing that they hold for general k remainsopen. • We prove not-so-slow ℓ -rates for tensor denoising with trend filtering,see Theorem 3.2. Here too, the rates for d = 2 and k ≥ and for d ≥ are new contributions. It is still an open problem whether theserates correspond for d ≥ to minimax rates (modulo log terms). • We extend the idea of ANOVA decomposition from st -order dif-ferences to k th -order differences in d dimensions. By means of thisANOVA decomposition we can apply the results for the d -dimensionalmargin to lower dimensional margins. We obtain ℓ - and ℓ -rates forthe estimation of the whole tensor by trend filtering. • Our results allow to recover previous results for trend filtering andimage denoising [14, 16] as special cases.
We consider tensors in R n × ... × n d such that n = . . . = n d .Let λ ( t ) := σ s n ) + 2 tn , t > . We call λ ( t ) the “universal choice” of the tuning parameter. The universalchoice λ = λ ( t ) guarantees that all the noise is overruled. However, ourresults also allow for a smaller choice than the universal choice, due to theprojection arguments by Dalalyan et al. [3] in the background. Theorem 3.1 (Adaptivity of Vitali trend filtering, simplified) . Fix k ∈{ , , , } . Let g ∈ R n /d × ... × n /d be arbitrary. Let S ⊆ × i ∈ [ d ] [ k +2 : n /d − e an arbitrary set of size s := | S | defining a regular grid of cardinality s /d × . . . × s /d parallel to the coordinate axes. For a large enough constant C > only dependent on k , choose λ ≥ C d / λ (log(2 n )) s k − d . Then, with probability at least − /n , it holds that k ( ˆ f − f ) N ⊥ k k /n ≤ k g − f N ⊥ k k /n + 4 λ k ( D k g ) − S k + O λ s k log( n/s ) n ! . Proof.
See Subsection 5.7 for the proof of the more general Theorem 5.2.Some examples of the exponent of s in the rate of Theorem 3.1 for d = { , , } and k = { , , , } are exposed in Table 1. k = 1 k = 2 k = 3 k = 4 d = 1 d = 2 d = 3 d general − /d − /d − /d − /d Table 1: Some examples of the exponent of s in the rate of Theorem 3.1for the choice λ ≍ s − k − d λ (log(2 n )) .If in Theorem 3.1 we set g = f N ⊥ k and choose the tuning parameter λ ≍ s − k − d λ (log(2 n )) depending on the (typically unknown) true activeset S , we obtain the rate O s k − k − d log n log( n/s ) n . If in Theorem 3.1 we set g = f N ⊥ k and we choose the tuning parameter λ ≍ λ (log(2 n )) in a completely data-driven way not depending on the(typically unknown) true active set S , we obtain the rate O s k log n log( n/s ) n ! . We now fix k ∈ [1 : min i ∈ [ d ] n i − . For d ∈ N define the d th harmonicnumber H ( d ) as H ( d ) := P di =1 /i . 12 heorem 3.2 (Not-so-slow ℓ -rates for Vitali trend filtering) . Let g ∈ R n /d × ... × n /d be any tensor such that TV k ( g ) = O (1) . Choose λ ≍ n − H ( d )+2 k − H ( d )+2 k − log H ( d )2 H ( d )+2 k − ( n ) . Then, with probability at least − Θ(1 /n ) , it holds that k ( ˆ f − f ) N ⊥ k k /n ≤ k g − f N ⊥ k k /n + O (cid:18) n − H ( d )+2 k − H ( d )+2 k − log H ( d )2 H ( d )+2 k − ( n ) (cid:19) . Proof.
See Subsection 6.3.Some examples of the exponent of n in the rate of Theorem 3.2 for d = { , , } and k = { , , } are exposed in Table 2. k = 1 k = 2 k = 3 k general d = 1 − / − / − / − k/ (2 k + 1) d = 2 − / − / − / − (4 k + 1) / (4 k + 4) d = 3 − / − / − / − (12 k + 5) / (12 k + 16) Table 2: Some examples of the exponent of n in the rate of Theorem 3.2. According to Definition 2.3, the trend filtering estimator is an analysisestimator. In this section we want to rewrite it in a constructive form,that is, in synthesis form. We show that the trend filtering estimator canbe constructed as a linear combination of tensors with product structure,where the factors are truncated monomials of order k − . We call thecollection of such tensors the “dictionary”.We first define the dictionary and then show that it is the right dictio-nary to construct the trend filtering estimator.We start with the one-dimensional case. We then obtain the d -dimensionaldictionary from the one-dimensional dictionary by constructing tensors withproduct structure. d = 1 Let φ j := { { j ′ ≥ j } } j ′ ∈ [ n ] , j ∈ [ n ] . The vectors { φ j } j ∈ [ n ] are linearly inde-pendent and piecewise constant. 13or ≤ k ≤ n − define recursively φ kj := ( φ jj , j ∈ [ k − , P l ≥ j φ k − l /n, j ∈ [ k : n ] . We call the collection Φ k = { φ kj } j ∈ [ n ] the “original” dictionary.The dictionary Φ k is a collection of n linearly independent discrete(truncated) monomials: the first k are monomials of order , , . . . , k − ,while the last n − k are truncated monomials of order k − .We now define a partially orthonormalized version of the dictionary Φ k , k ∈ [ n − . Definition 4.1 (Partially orthonormalized dictionary in one dimension) . The(partially orthonormalized) dictionary ˜Φ k = { ˜ φ kj } j ∈ [ n ] is defined as ˜ φ kj := √ n A { φ ll ,l ∈ [ j − } φ jj / k A { φ ll ,l ∈ [ j − } φ jj k , j ∈ [ k ] , A { φ ll ,l ∈ [ k ] } φ kj , j ∈ [ k + 1 : n ] . For k ∈ [ n − , ˜Φ k = { ˜ φ kj } j ∈ [ n ] is again a collection of n linearly inde-pendent vectors, where ˜ φ k , . . . , ˜ φ kk , { ˜ φ kj } j ∈ [ k +1: n ] are mutually orthogonal.Moreover k ˜ φ kj k = n, j ∈ [ k ] . Lemma 4.2 (Relation between dictionary and difference operator) . Fix k ∈ [ n − . It holds that D k φ kj = D k ˜ φ kj = ( , j ∈ [ k ] , { j } , j ∈ [ k + 1 : n ] . Proof.
See Appendix B.1.As a consequence of Lemma 4.2, { ˜ φ jj } j ∈ [ k ] span N k and { ˜ φ jj } j ∈ [ k ] is anorthogonal basis for N k . Moreover { ˜ φ kj } j ∈ [ k +1: n ] span N ⊥ k .By Lemma 4.2 combined with Lemma 2.2 in Ortelli and van de Geer[13] about the Moore-Penrose pseudoinverse we obtain for the pseudoinverse ( D k ) + that ( D k ) + = { ˜ φ kj } j ∈ [ k +1: n ] .With the dictionary ˜Φ k and some coefficients { β j } nj = k +1 we can write avector f N ⊥ k ∈ N ⊥ k as f N ⊥ k = ( D k ) + β . Then β = D k f N ⊥ k .For d = 1 we therefore obtain the following synthesis form of the esti-mator ˆ f N ⊥ k : ˆ f N ⊥ k = n X j = k +1 ˜ φ kj ˆ β j , ˆ β = arg min b ∈ R n − k k Y N ⊥ k − n X j = k +1 b j ˜ φ kj k /n + 2 λ k b k . d Hereafter we fix k ∈ [1 : min l ∈ [ d ] n l − . Definition 4.3 (Partially orthonormalized dictonary in d -dimensions) . Thedictionary { ˜ φ kj ,...,j d ∈ R n × ... × n d } n ,...,n d ,..., is defined as ˜ φ kj ,...,j d = ˜ φ kj × . . . × ˜ φ kj d , ( j , . . . , j d ) ∈ × i ∈ [ d ] [ n i ] . The dictionary { ˜ φ kj ,...,j d } n ,...,n d ,..., is a collection of d -dimensional tensorswith product structure. By Lemma 4.2 and the product structure, N ⊥ k =span( { ˜ φ kj ,...,j d } n ,...,n d k +1 ,...,k +1 ) .For a tensor of coefficients { β j ,...,j d } n ,...,n d k +1 ,...,k +1 , write f N ⊥ k = n ,...,n d X k +1 ,...,k +1 β j ,...,j d ˜ φ kj ,...,j d . Because of the product structure of ˜ φ kj ,...,j d it holds that D k f N ⊥ k = n ,...,n d X k +1 ,...,k +1 β j ,...,j d (1 { j } × . . . × { j d } ) = β. From the fact that any candidate estimator has to belong to the spacespanned by Y N ⊥ k , it follows that ˆ f N ⊥ k = n ,...,n d X k +1 ,...,k +1 ˆ β j ,...,j d ˜ φ kj ,...,j d , where ˆ β = arg min b ∈ R ( n − k ) × ... × ( nd − k ) k Y N ⊥ k − n ,...,n d X k +1 ,...,k +1 b j ,...,j d ˜ φ kj ,...,j d k /n + 2 λ k b k . The synthesis form of the estimator ˆ f N ⊥ k is useful in two ways. Firstly, todetermine the structure of the estimator by specifying the dictionary usedto construct it. In our case, ˆ f N ⊥ k is a linear combination of d -dimensionalproducts of ( k − th -order polynomials. Secondly, the dictionary facilitatesthe approximation of some orthogonal projections in the proof of adaptive ℓ -rates and not-so-slow ℓ -rates. 15 Adaptivity
In this section we first expose some notation for our main result. After hav-ing exposed our main result, Theorem 5.2, we work out explicit expressionsfor the bound on the antiprojections ˜ v , the inverse scaling factor ˜ γ and thenoise weights v . Finally, we show a bound on the effective sparsity via asuitable interpolating tensor. In Subsection 5.7 we put the pieces togetherto prove Theorem 5.2.Fix k ∈ [1 : min i ∈ [ d ] n i − and an active set S ⊆ × i ∈ [ d ] [ k + 2 : n i − k ] .To every jump location in S , we associate a hyperrectangle of k d addi-tional jump locations to obtain the enlarged active set ˜ S , defined as ˜ S := s [ m =1 ( × i ∈ [ d ] [ t i,m : t i,m + k − . Definition 5.1 (Hyperrectangular tessellation) . We call { R m } sm =1 a hyper-rectangular tessellation of × i ∈ [ d ] [ k + 1 : n i ] if it satisfies the following con-ditions: • each R m ⊆ × i ∈ [ d ] [ k + 1 : n i ] is a hyperrectangle ( m = 1 , . . . , s ); • ∪ sm =1 R m = × i ∈ [ d ] [ k + 1 : n i ] ; • for all m and m ′ = m , the hyperrectangles R m and R m ′ possibly shareboundary points but not interior points; • for all m , the points × i ∈ [ d ] [ t i,m : t i,m + k − are interior points of R m . For a hyperrectangular tessellation { R m } sm =1 denote the vertices of thehyperrectangle R m by ( t z ,m , . . . , t z d d,m ) , ( z , . . . , z d ) ∈ {− , + } d , for m ∈ [ s ] .Moreover we define the distances of the jump locations from the verticesof their respective hyperrectangle and the respective set of indices as d − i,m := ( t i,m − t − i,m ) , R − i,m := [ t − i,m : t i,m ] ,d i,m := k, R i,m := [ t i,m : t i,m + k − ,d + i,m := ( t + i,m − t i,m − k + 1) , R + i,m := [ t i,m + k − t + i,m ] , for i ∈ [ d ] and m ∈ [ s ] . Each hyperrectangle R m of the hyperrectangulartessellation { R m } m ∈ [ s ] can be partitioned into d hyperrectangles. Define,for all ( z , . . . , z d ) ∈ {− , , + } d , R z ··· z d m := R z ,m × . . . × R z d d,m , m ∈ [ s ] . For m ∈ [ s ] , let d z ··· z d m := d z ,m · . . . · d z d d,m , { z , . . . , z d } ∈ {− , + } d .
16e define the maximal distance from an (enlarged) jump location tothe boundary of the corresponding rectangular region along the coordinate i ∈ [ d ] as d i, max ( S ) := max m ∈ [1: s ] max { d − i,m , d + i,m } .d + , − m d − , − m d − , + m d + , + m ( t +1 ,m , t − ,m ) ( t +1 ,m , t +2 ,m )( t − ,m , t +2 ,m )( t − ,m , t − ,m )( t ,m , t − ,m )( t ,m + k − , t − ,m ) ( t − ,m , t ,m ) ( t +1 ,m , t ,m + k − d +1 ,m d − ,m d − ,m d +2 ,m Figure 1: A rectangle of the tessellation { R m } sm =1 for d = 2 and k = 4 For d = 2 and k = 4 , a rectangle of the tessellation is depicted in Figure1. We present our main result, that shows that trend filtering leads to anadaptive estimation of the d -dimensional margin f N ⊥ k of f . Theorem 5.2 (Adaptivity of trend filtering) . Fix k ∈ { , , , } and choose x, t > . Let g ∈ R n × ... × n d be arbitrary. Let S be an arbitrary subset ofsize s := | S | of × i ∈ [ d ] [ k + 1 + ( k + 2) k : n i − k + 1 − ( k + 2) k ] . For a largeenough constant C > that only dependends on k , choose λ ≥ C d vuut d X i =1 (cid:18) d i, max ( S ) n i (cid:19) k − λ ( t ) . hen, with probability at least − e − x − e − t , it holds that k ( ˆ f − f ) N ⊥ k k /n ≤ k g − f N ⊥ k k /n + 4 λ k ( D k g ) − S k + 2 σ n (cid:16) √ x + √ ks (cid:17) + O λ d X i =1 log( ed i, max ( S )) ! s X m =1 X z ∈{− , + } d (cid:18) nd zm (cid:19) k − . In particular the constraint on C is C ≥ k k − a with a = , k = 1 , √ / ≈ . , k = 2 , √ / ≈ . , k = 3 , . , k = 4 , as min i ∈ [ d ] min m ∈ [ s ] min { d − i,m , d + i,m } → ∞ .Proof. See Subsection 5.7.By choosing x ≍ t ≍ log n in Theorem 5.2 and by constraining theactive set S to be a regular grid we retrieve Theorem 3.1. In that case,since S is a regular grid, we can choose λ ≍ s − k − d λ (log(2 n )) and theoracle inequality has the rate O s k ( d − d n log( n/s ) log n . Remark (The role of the hyperrectangular tessellation) . Given an activeset S , the choice of a hyperrectangular tessellation in Theorem 5.2 can beseen as arbitrary. We introduce some quantities on which Theorem 5.2 relies: the bound onthe antiprojections ˜ v , the inverse scaling factor ˜ γ , the noise weights v , asign configuration q and the effective sparsity Γ D k .Let ˜ S be the enlarged active set induced by some active set S . Let P ˜ S be the orthogonal projection operator on span( { ˜ φ kj ,...,j d } ( j ,...,j d ) ∈ ˜ S ) . Definition 5.3 (Bound on the antiprojections) . A bound on the antiprojec-tions is a tensor ˜ v ∈ R ( n − k ) × ... × ( n d − k ) such that ˜ v j ,...,j d ≥ k (I − P ˜ S ) ˜ φ kj ,...,j d k / √ n, ∀ ( j , . . . , j d ) ∈ × i ∈ [ d ] [ k + 1 : n i ] . ˜ v be a bound on the antiprojections. Definition 5.4 (Inverse scaling factor) . The inverse scaling factor ˜ γ ∈ R isdefined as ˜ γ := k ˜ v − ˜ S k ∞ . Let ˜ v be a bound on the antiprojections and ˜ γ the corresponding inversescaling factor. Definition 5.5 (Noise weights) . The noise weights v ∈ R ( n − k ) × ... × ( n d − k ) aredefined as v ≥ ˜ v/ ˜ γ ∈ [0 , ( n − k ) × ... × ( n d − k ) . We can now introduce the effective sparsity. The effective sparsity de-pends on a so-called “sign configuration”, that is, on the sign pattern asso-ciated with the jump locations.
Definition 5.6 (Sign configuration) . Let q ∈ [ − , ( n − k ) × ... × ( n d − k ) be s.t. q j ,...,j d ∈ {− , +1 } , ( j , . . . , j d ) = t m ∈ S, { q t m } , ( j , . . . , j d ) ∈ × i ∈ [ d ] [ t i,m : t i,m + k − , m ∈ [ s ] , [ − , , ( j , . . . , j d ) / ∈ ˜ S. We call q S ∈ {− , , } ( n − k ) × ... × ( n d − k ) a sign configuration. The basic definition of effective sparsity depends on the sign config-uration associated with S . One can however remove this dependence bydefining the effective sparsity as the maximum over all sign configurations. Definition 5.7 (Effective sparsity) . Let an active set S , a sign configuration q S and noise weights v be given. The effective sparsity Γ D k ( S, v − S , q S ) ∈ R is defined as Γ D k ( S, v − S , q S ) :=:= max ( s X m =1 ( q S ) t m ( D k f ) t m − k (1 − v ) − S ⊙ ( D k f ) − S k : k f k /n = 1 ) . Moreover we write Γ D k ( S, v − S ) := max q S Γ D k ( S, v − S , q S ) . By the adaptive bound of Theorem 2.2 in Ortelli and van de Geer [14](see also Theorem 2.1 in Ortelli and van de Geer [15] and Theorem 16 inOrtelli and van de Geer [16] modified with an enlarged active set), we knowthat bounding the effective sparsity is a sufficient condition for provingadaptation of ˆ f N ⊥ k . 19 .3 Effective sparsity via interpolating tensors To bound the effective sparsity we extend the technique by Ortelli andvan de Geer [14] involving interpolating vectors to interpolating tensors,i.e., tensors that interpolates the signs of the jumps.
Definition 5.8 (Interpolating tensor) . Let q S ∈ {− , , } ( n − k ) × ... × ( n d − k ) be a sign configuration and v ∈ [0 , ( n − k ) × ... × ( n d − k ) be a tensor of noiseweights. The tensor w ( q S ) ∈ R ( n − k ) × ... × ( n d − k ) is called an interpolatingtensor for the sign configuration q S and the weights v if it has the followingproperties: • w j ,...,j d ( q S ) = ( q S ) t m , ∀ ( j , . . . , j d ) ∈ × i ∈ [ d ] [ t i,m : t i,m + k − , ∀ m ∈ [ s ] , • | w j ,...,j d ( q S ) | ≤ − v j ,...,j d , ∀ ( j , . . . , j d ) ∈ ( × i ∈ [ d ] [ k : n i ]) \ ˜ S . With the help of an interpolating tensor we can bound the effectivesparsity, as the following lemma shows (Lemma 2.4 by Ortelli and van deGeer [14] in tensor form).
Lemma 5.9 (Bounding the effective sparsity with an interpolating tensor) . We have Γ D k ( S, v − S , q S ) ≤ n min w ( q S ) k ( D k ) ′ w ( q S ) k where the minimum is over all interpolating tensors w ( q S ) for the signconfiguration q S .Proof. It holds that s X m =1 ( q S ) t m ( D k f ) t m − k (1 − v ) − S ⊙ ( D k f ) − S k ≤ s X m =1 ( q S ) t m ( D k f ) t m − k w ( q S ) − S ⊙ ( D k f ) − S k ≤ n ,...,n d X ,..., w ( q S ) j ,...,j d ( D k f ) j ,...,j d = n ,...,n d X ,..., (( D k ) ′ w ( q S )) j ,...,j d f j ,...,j d ≤ √ n k ( D k ) ′ w ( q S ) k k f k / √ n. .4 Requirements on an interpolating tensor Theorem 5.2 follows by a bound on the effective sparsity obtained byLemma 5.9 with the help of an interpolating tensor. In the definition of aninterpolating tensor (cf. Definition 5.8), there is a constraint posed by thenoise weights v .Therefore, we now calculate in Subsection 5.5 a bound on the antipro-jections ˜ v to derive an appropriate inverse scaling factor ˜ γ and noise weights v . In this way we will make explicit the constraints that the interpolatingtensor has to satisfy in the specific case of tensor denoising with trendfiltering.After that, we will show in Subsection 5.6 an explicit form for the in-terpolating tensor for k = { , , , } and derive the corresponding boundon the effective sparsity.That bound on the effective sparsity combined with the fact that theinterpolating tensor used indeed is an interpolating tensor for trend filteringwill allow us to derive Theorem 5.2 from Theorem A.1. We start by finding a bound on the antiprojections ˜ v .Define, for m ∈ [ s ] and i ∈ [ d ] , ˜ v i,m ( j i ) = (cid:18) t i,m − j i n i (cid:19) k − , j i ∈ R − i,m = [ t − i,m : t i,m ] , , j i ∈ R i,m = [ t i,m : t i,m + k − , (cid:18) j i − t i,m − k + 1 n i (cid:19) k − , j i ∈ R + i,m = [ t i,m + k − t + i,m ] . Moreover, for ( j , . . . , j d ) ∈ R m we define ˜ v j i ,...,j d := vuut d X i =1 ˜ v i,m ( j i ) . Lemma 5.10 (A valid bound on the antiprojections) . For all ( j , . . . , j d ) ∈ R m and for all m ∈ [ s ] it holds that k A ˜ S ˜ φ kj ,...,j d k /n ≤ ˜ v j i ,...,j d , i.e., the tensor ˜ v ∈ R ( n − k ) × ... × ( n d − k ) is a valid bound on the antiprojections.Proof. See Appendix C.1. 21efine, for m ∈ [ s ] and i ∈ [ d ] , v i,m ( j i ) = t i,m − j i d − i,m ! k − , j i ∈ R − i,m = [ t − i,m : t i,m ] , , j i ∈ R i,m = [ t i,m : t i,m + k − , j i − t i,m − k + 1 d + i,m ! k − , j i ∈ R + i,m = [ t i,m + k − t + i,m ] . Moreover, for a constant C = C ( k ) ≥ we define for ( j , . . . , j d ) ∈ R m and m = 1 , . . . , s v j i ,...,j d := 1 d d X i =1 v i,m ( j i ) C (2)and ˜ γ = Cd vuut d X i =1 (cid:18) d i, max ( S ) n i (cid:19) k − . Lemma 5.11 (Valid noise weights) . For all m ∈ [ s ] and for all ( j , . . . , j d ) ∈ R m it holds that ˜ v j i ,...,j d ≤ v j i ,...,j d ˜ γ, i.e., the tensor v ∈ R ( n − k ) × ... × ( n d − k ) as defined in Equation (2) definesvalid noise weights.Proof. See Appendix C.2.The constant C ≥ in Equation (2) can be chosen aribtrarily. Choosinga larger C makes the noise weights smaller. As a result, the requirementsimposed on the interpolating tensor by the noise weights become weaker. We now define an interpolating tensor w = w ( q ) for any sign configuration q S . For ( j , . . . , j d ) ∈ R m , m ∈ [ s ] and the same constant C = C ( k ) > asin the definition of the noise weights in Equation (2), define the tensor w j ,...,j d ( q S ) := 1 d d X i =1 d Y l =1 w l,i,m ( j l ) , (3)22here, w l,i,m ( j l ) = q t m , j l ∈ R l,m , l = i,w l,i,m ( j l ) ∈ [0 , q t m ] , j l ∈ R − l,m ∪ R + l,m , l = i,w i,i,m ( j i ) = q t m , j i ∈ R i,m , l = i,w i,i,m ( j i ) ≤ q t m (1 − v i,m ( j i ) /C ) , j i ∈ R − l,m ∪ R + l,m , l = i. What differentiates the case l = i is that w i,i,m has to satisfy the require-ments imposed by the noise weights. For l = i the only constraint imposedis that | w l,i,m | ≤ . The tensor w is a sum of terms with product structureif constrained to the set of indices of a hyperrectangle R m .We define w − l,i,m := { w l,i,m ( j l ) } j l ∈ R − i,m and w + l,i,m := { w l,i,m ( j l ) } j l ∈ R + i,m . Lemma 5.12 (A valid interpolating tensor) . For any given sign configuration q S , the tensor w = w ( q S ) defined in Equation (3) is a valid interpolatingtensor.Proof. See Appendix C.3.
We now want to find the explicit form of an appropriate interpolating tensor w , to apply in Lemma 5.9. We first consider continuous versions ω ( x ) ,respectively. w( x ) , of the vectors w − i,i,m and w + i,i,m , respectively. w − l,i,m and w + l,i,m for l = i , on a mock interval x ∈ [0 , . We then set w − i,i,m ( j i ) := ω t i,m − j i d − i,m ! , j i ∈ R − i,m ,w + i,i,m ( j i ) := ω j i − t i,m − k + 1 d + i,m ! , j i ∈ R + i,m ,w − l,i,m ( j l ) := w t l,m − j l d − l,m ! , j l ∈ R − l,m ,w + l,i,m ( j l ) := w j l − t l,m − k + 1 d + l,m ! , j l ∈ R + l,m , for i ∈ [ d ] , l = i, m ∈ [ s ] .We aim to find a form of ω and w giving place to continuous functionswith k − continuous derivatives and piecewise constant k th derivative.23oreover, these functions have to be interpolating between the jump loca-tion ( x = 0 ) and the border ( x = 1 ). We guarantee that they interpolatethe signs of the jumps by restricting to polynomials with ω (0) = 1 , ω (1) = 0 , w(0) = 1 , w(1) = 0 , w( x ) = 1 − w(1 − x ) , x ∈ [0 , . The discretized version of these polynomials will vanish at the boundariesof the hyperrectangles while it will have the value at the indices belongingto the enlarged active set ˜ S , guaranteeing the interpolation of the signs ofthe jump locations. Moreover, we will have to choose the constant C > in Equation (2) such that the noise weights are made small enough for theinterpolating polynomial to satisfy the conditions of Lemma 5.12.To obtain interpolating polynomials ω and w , we split the interval [0 , into an adequate number of subintervals. We then choose ω and w to bemade of polynomial pieces of order at most k . The exception is the firstsubinterval [0 , x ] , x ∈ (0 , for ω , where we choose ω ( x ) = 1 − a x k − .We then find the explicit values of the coefficients of the polynomials byderivatives matching, as in Ortelli and van de Geer [14]. More details onderivatives matching are given in the Appendix C.4.To guarantee that ω and w can give place to interpolating tensors, onehas to check that derivative matching renders a piecewise polynomial whichis monotone. Monotonicity combined with the constraints ω (0) = w(0) = 1 and ω (1) = w(1) = 0 ensures that | ω ( x ) | ≤ and | w( x ) | ≤ .Monotone interpolating polynomials ω and w and a large enough C in the tuning parameter are sufficient conditions for a valid interpolatingtensor. In particular, given that ω is monotone, we require that C ≥ k k − /a as min i ∈ [ d ] min m ∈ [ s ] min { d − i,m , d + i,m } → ∞ . (4)Note that for the construction of w , we do not have any constraint given bythe antiprojections ˜ v , the noise weights v and the inverse scaling factor ˜ γ .Therefore, we can take the dependence on x k instead of x k − . This savesa logarithmic term, not visible in Lemma 5.13, which only contains thelogarithmic terms stemming from ω . Indeed, as Lemma C.1 in AppendixC.5 shows, partial integration of a k th -order polynomial does not incur inlog terms, while partial integration of x k − does so. We have to choosethe worse dependence on x k − for ω though, because ω has to respect theconstraints posed by the noise weights.24 .6.2 Show a bound on the effective sparsity We now show a bound on the effective sparsity, using a “candidate” in-terpolating tensor generated from the discretizations of ω and w whoseconstruction has been exposed above. We call it “candidate” interpolatingtensor because we have not yet shown that ω and w are monotone. For themoment we assume that matching derivatives renders monotone ω and w .We check the monotonicity for k = { , , , } in the next subsection.To make the notation and the computation steps lighter, we neglect theconstants and use the order notation O instead.Since the sign configuration q S is typically unknown, we focus on findingan upper bound on the effective sparsity that does not depend on the signconfiguration q S . Thus, the bound also accommodates for the worst-casesign configuration. Lemma 5.13 (Effective sparsity for trend filtering) . Take the interpolatingvector w as defined in Equation (3) . Choose the vectors w − i,i,m and w + i,i,m ,respectively w − l,i,m and w + l,i,m for l = i , to be discretized versions of ω ( x ) and w( x ) as in Subsubsection 5.6.1. Assume that ω ( x ) and w( x ) obtainedby derivative matching are monotone.For such an interpolating vector w , it holds that Γ D ( S, v − S ) = O d X i =1 log( ed i, max ( S )) ! s X m =1 X z ∈{− , + } d (cid:18) nd zm (cid:19) k − Proof.
See Appendix C.6.From Lemma C.1 and the matching of discrete derivatives, it followsthat, if ω and w are monotone and C is chosen large enough Γ D ( S, v − S ) = O d X i =1 log( ed i, max ( S )) ! s X m =1 X z ∈{− , + } d (cid:18) nd zm (cid:19) k − . If the active set S defines a regular grid we therefore have a bound on theeffective sparsity of order Γ D ( S, v − S ) = O (cid:16) s k log( n/s ) (cid:17) . It only remains to check the monotonicity of ω and w . We will do this for k = { , , , } . One can check monotonicity for higher values of k by solving(for instance at the computer) the appropriate system of equations and, say,graphically visualizing the result. We check monotonicity analytically for k = { , , } and computationally for k = 4 .25 .6.3 Interpolating tensor for k = 1 For k = 1 ω ( x ) = 1 − √ x, x ∈ [0 , . and w( x ) = 1 − x, x ∈ [0 , . Both ω and w are monotone. k = 2 For k = 2 ω ( x ) = − √ x / , x ∈ [0 , / ,
127 (1 − x ) , x ∈ [1 / , , and w( x ) = − x , x ∈ [0 , / , (cid:18) − x (cid:19) + 12 , x ∈ [1 / , / . Both ω and w are monotone. k = 3 For k = 3 ω ( x ) = − √ x / , x ∈ [0 , / , x − x + 25576 x + 145228 , x ∈ [1 / , / , − x ) , x ∈ [2 / , , and w( x ) = − x , x ∈ [0 , / , − (cid:18) − x (cid:19) + 2 (cid:18) − x (cid:19) + 12 , x ∈ [1 / , / . Both ω and w are monotone. 26 .6.6 Interpolating tensor for k = 4 For k = 4 ω ( x ) = − . x / , x ∈ [0 , / , . x − . x + 12 . x − . x + 1 . , x ∈ [1 / , / , − . x + 78 . x − . x + 26 . x − . , x ∈ [1 / , / , . − x ) , x ∈ [3 / , , and w( x ) = − . x , x ∈ [0 , / , x − . x + 7 . x − . x + 1 . , x ∈ [1 / , / , − . (cid:18) − x (cid:19) + 2 . (cid:18) − x (cid:19) + 12 , x ∈ [1 / , / . Both ω and w are monotone. Theorem 5.2 follows by combining Theorem A.1 with a bound on the effec-tive sparsity.Lemma 5.13 uses Lemma 5.9 to give us a bound on the effective sparsityholding for all sign configurations. This bound is based on a specific formof the interpolating tensor, obtained by derivative matching as explained inSubsection 5.6.1. The interpolating tensor obtained by derivative match-ing is valid if the monotonicity of ω and w is guaranteed. In Subsections5.6.3-5.6.6 we check that the interpolating tensors obtained by derivativematching for k = { , , , } satisfy the monotonicity requirement.There is also a constraint posed by the noise weights and by the constant C to satisfy. Lemma 5.10 gives a valid bound on the antiprojections. If wechoose ˜ γ = Cd vuut d X i =1 (cid:18) d i, max ( S ) n i (cid:19) k − the noise weights given in Equation (2) are valid noise weights, accordingto Lemma 5.11. By Lemma 5.12, interpolating tensors of the form givenin Equation (3) are valid interpolating tensors. The tensor obtained bythe discretization of the result of derivative matching has such a form (as min m ∈ [ s ] min i ∈ [ d ] min { d − i,m , d + i,m } → ∞ ).According to Equation (4) in Subsubsection 5.6.1 one has to choose C ≥ k k − /a as min i ∈ [ d ] min m ∈ [ s ] min { d − i,m , d + i,m } → ∞ . a are given in Subsubsections 5.6.3-5.6.6.Theorem 2.2 by Ortelli and van de Geer [14], on which Theorem A.1 isbased, uses a bound on the increments of empirical process { ǫ ′ f, f ∈ R n } ,where ǫ has i.i.d. entries. Theorem A.1 involves in the background anempirical process, whose increments are given by n ,...,n d X ,..., ( ǫ N ⊥ k ⊙ f ) j ,...,j d , f ∈ R n × ... × n d Note that the entries of ǫ N ⊥ k = P N ⊥ k ǫ are correlated. However, by the idem-potence of orthogonal projections, we can work with uncorrelated errors andinstead restrict to tensors f N ⊥ k ∈ N ⊥ k . Indeed P n ,...,n d ,..., ( ǫ N ⊥ k ⊙ f ) j ,...,j d = P n ,...,n d ,..., ( ǫ ⊙ f N ⊥ k ) j ,...,j d . This allows us to take over the arguments ofTheorem 2.2 by Ortelli and van de Geer [14]. (cid:3) Remark (The influence of the dimensionality) . If we choose λ ≍ ˜ γλ ( t ) ,the rate of the oracle inequality is ˜ γ P sm =1 P z ∈{− , + } d ( n/d zm ) k − /n , up tologarithmic factors. For simplicity, let S define a regular grid. Then the(hyper-)volume of one of the s hyperrectangles of the tessellation scales as d zm ≍ n/s . Hence the scaling P sm =1 P z ∈{− , + } d ( n/d zm ) k − ≍ s k . However ˜ γ , the maximal length of an antiprojection, scales as ˜ γ ≍ ( s − d ) k − , where s − d ≍ d i, max /n i is proportional to the side length of a hyperrectangle of thetessellation, up to the exponent k − . The influence of the dimensionalityin the exponent of s is a consequence of the different scaling of volume andside length of a hyperrectangle in d -dimensions. The (hyper-)volume scalesas s − while the side length scales as s − d . The reason for this discrep-ancy is that we are not able to find an upper bound for the noise weightsproportional the volume of the hyperrectangles, i.e., to the product of sidelengths. The bound we obtain involves rather the sum of side lengths. ℓ -rates Theorem 3.2 about not-so-slow rates for trend filtering is based on The-orem A.2, where the choice of the active set S is aribtrary. The cri-terion guiding choice of S is to get an “as small as possible” value ofthe inverse scaling factor ˜ γ . Recall that the inverse scaling factor ˜ γ isthe maximal length of the antiprojection of a dictionary atom ˜ φ kj ,...,j d onto the set of dictionary atoms indexed by the active set S , that ist ˜ γ ≥ max ( j ,...,j d ) ∈ [ n ] × ... × [ n d ] k A S ˜ φ kj ,...,j d k / √ n .28he active set S could be chosen as a regular grid parallel to the coordi-nate axes. However, we will show that we can shorten the maximal lengthof the antiprojections by choosing an active set defining a so-called “meshgrid”, whose construction we illustrate hereafter. Let δ ∈ N . For a coordinate i ∈ [ d ] , we define the set of indices Z i ( l ) suchthat Z i ( l ) = { δ dl equispaced indices in [ n i ] } , l ∈ [ d ] and Z i (1) ⊇ Z i (2) ⊇ . . . ⊇ Z i ( d ) . If, for any l ∈ [ d ] , n i is not a multiple of δ dl , we relax the requirement onthe indices to be approximately equispaced, i.e., the distance between allthe indices has to be asymptotically of the same order. For i ∈ [ d ] , we alsodefine ˜ Z i ( l ) = k − [ h =0 { Z i ( l ) + h } , l ∈ [ d ] . Let now ( l , . . . , l d ) ∈ [ d ] d be a tuple of indices. We define the set S := { ( l , . . . , l d ) ∈ [ d ] d : |{ i ∈ [ d ] : l i ≤ z }| ≤ z, ∀ z ∈ [ d ] } . Definition 6.1 (Mesh grid) . A mesh grid S is defined as S := [ ( l ,...,l d ) ∈S (cid:16) × i ∈ [ d ] Z i ( l i ) (cid:17) . Figure 2 illustrates a mesh grid for d = 2 .We now want to enlarge a mesh grid S to allow us to handle k th -ordertrend filtering for k > . Definition 6.2 (Enlarged mesh grid) . An enlarged mesh grid ˜ S is defined as ˜ S := [ ( l ,...,l d ) ∈S (cid:16) × i ∈ [ d ] ˜ Z i ( l i ) (cid:17) . Figure 3 illustrates an enlarged mesh grid for d = 2 and k = 2 .Let s := | S | and ˜ s := | ˜ S | . It holds that s ≍ ˜ s ≍ Q di =1 δ di ≍ δ dH ( d ) ,where H ( d ) = P di =1 /i is the d th harmonic number. Therefore δ ≍ s dH ( d ) .29igure 2: Mesh grid for d = 2 . Figure 3: Enlarged mesh grid for d =2 and k = 2 . ˜ S is an enlarged mesh grid We will now show that we can find a smaller bound on the inverse scalingfactor if we choose ˜ S to be an enlarged mesh grid rather than an enlargedregular grid. Lemma 6.3 (Inverse scaling factor when ˜ S is an enlarged mesh grid) . Let n ≍ . . . ≍ n d and ˜ S be an enlarged mesh grid. It holds that ˜ γ ( ˜ S ) = O (cid:18) s − k − H ( d ) (cid:19) . Proof.
See Appendix D.1.
Theorem 3.2 follows from Theorem A.2. Theorem A.2 is allowed to havecorrelated errors for the same reasons as Theorem A.1 is, see the proof ofTheorem 5.2 in Subsection 5.7.In Theorem A.2 we set x ≍ t ≍ log n . We can then choose the freeparameters S and g ∈ R n × ... × n d independently of each other. Rememberthat the normalization of the TV is included in the definition of the analysisoperator D k . Therefore it is natural to restrict the choice of g to the class { f : k D k f k = O (1) } .We can then choose S to trade off the terms ˜ γλ (log n ) ≍ ˜ γ log / ( n ) /n / and s/n . Typically, if we require S to have a regular structure, we obtain30 γ = O ( s − h ) for some h = h ( d, k ) ∈ R . The tradeoff is achieved by choosing s ≍ n h ) log h ) n and gives the slow rate sn ≍ n − h ) log h ) n. We choose the active set to be an enlarged mesh grid ˜ S . Then, byLemma 6.3, we can choose ˜ γ = O (cid:18) s − k − H ( d ) (cid:19) and the claim follows. (cid:3) Remark (Mesh grids vs. regular grids) . If we choose a regular grid as activeset, according to Lemma 5.11 we obtain ˜ γ ≍ s − k − d and a slow rate n − d +2 k − d +2 k − log d d +2 k − ( n ) , which is slower than the rate obtained with an active set defining a meshgrid. Indeed, for all d ≥ it holds that H ( d ) ≤ d .In both cases, the slow rate for fixed k goes to n − / log / ( n ) as d → ∞ .If d is fixed, the slow rates goes to n − as k → ∞ . Remark (Allow λ to depend on TV k ) . In the proof of Theorem 3.2 wecan also drop the restriction g ∈ { f : k D k f k = O (1) } and trade off ˜ γ log / ( n )TV k ( g ) /n / and s/n . The tradeoff results in the choice λ ≍ n − H ( d )+2 k − H ( d )+2 k − log H ( d )2 H ( d )+2 k − ( n ) TV k ( g ) − k − H ( d )+2 k − and gives the rate n − H ( d )+2 k − H ( d )+2 k − log H ( d )2 H ( d )+2 k − ( n ) TV k ( g ) H ( d )2 H ( d )+2 k − . In the previous sections we have shown how to estimate f N ⊥ k by trendfiltering and have established fast adaptive ℓ -rates and not-so-slow ℓ -rates.There is still an open question: how to estimate f N k ?If n ≍ . . . ≍ n d , the dimension of N k is of order n − /d . Estimating f N k by least squares would result in a rate of order n − /d and therefore belimiting for d ≥ .The approach we take is to decompose N k into lower dimensional mutu-ally orthogonal linear spaces, the so-called marginal linear spaces, to whichwe can apply a lower dimensional version of trend filtering.31et P [ d ] denote the power set of [ d ] := { , . . . , d } . We consider sets ofcoordinate indices M ⊆ [ d ] .The intuition behind the decomposition into margins is to partition theset of tensor indices into d subsets as [1 : k ] ∪ [ k + 1 : n ] × . . . × [1 : k ] ∪ [ k + 1 : n ] . For M ∈ P [ d ] define the set of indices I kM = × i ∈ M [ k + 1 : n i ] × i M [1 : k ] . We moreover define the linear spaces M ( M ) = span { ˜ φ kk ,...,k d , ( k , . . . , k d ) ∈ I kM } , M ∈ P [ d ] . Note that in one dimension, { ˜ φ kj } j ∈ [ k ] and { ˜ φ kj } j ∈ [ k +1: n ] are orthogonalto each other. Moreover, M △ M ′ = ∅ , for M = M ′ ∈ P [ d ] . Because of theproduct structure of the dictionary atoms spanning M this means that any M ( M ) and M ( M ′ ) are mutually orthogonal, for M = M ′ .The mutually orthogonal marginal linear subspaces {M ( M ) } M ∈P [ d ] par-tition R n × ... × n d . The dimension of M ( M ) is given by k d −| M | Q i ∈ M ( n i − k ) .By the multi-binomial theorem it holds that d Y i =1 n i = X M ∈P [ d ] k d −| M | Y i ∈ M ( n i − k ) for k ∈ [0 : min l ∈ [ d ] n l − . This means that X M ∈P [ d ] dim( M ( M )) = n and because {M ( M ) } M ∈P [ d ] are mutually orhtogonal it follows that theyalso partition R n × ... × n d .We can further partition any M ( M ) into k d −| M | mutually orthogonalsubspaces M ( M, h ) , h ∈ [1 : k ] d −| M | .The partition results by defining the set of indices I kM,h := × i ∈ M [ k + 1 : n i ] × i M { h i } and the linear subspaces M ( M, h ) := span { ˜ φ kk ,...,k d , ( k , . . . , k d ) ∈ I kM,h } . Again, {M ( M, h ) } h ∈ [ k ] d −| M | ,M ∈P [ d ] are mutually orthogonal and partition R n × ... × n d . 32 efinition 7.1 (ANOVA decomposition) . The decomposition of a tensor f as f = X M ∈P [ d ] X h ∈ [1: k ] d −| M | f M ( M,h ) is called ANOVA decomposition. By orthogonality we have that k f k = X M ∈P [ d ] X h ∈ [1: k ] d −| M | k f M ( M,h ) k . Our aim is to apply a lower dimensional version of trend filtering to estimate f M ( M,h ) , for M = ∅ . For M = ∅ it holds that | I kM = ∅ | = k d = O (1) . Wewill therefore estimate f M ( ∅ ,h ) by the least squares estimate Y M ( ∅ ,h ) at theparametric rate n − .To apply a lower dimensional version of trend filtering to estimate f M ( M,h ) we first need to reinterpret f M ( M,h ) as a | M | -dimensional tensor.We then need to justify why we can apply Theorems A.1 and A.2 whichrequire iid errors and are at the base of the adaptive rates by Theorem 5.2and the not-so-slow rates by Theorem 3.2.By writing f M ( M,h ) = ¯ f M ( M,h ) × × i M ˜ φ kh i , ¯ f M ( M,h ) ∈ R × i ∈ M n i we can interpret f M ( M,h ) as a M -dimensional object.Similarly, we can write Y M ( M,h ) = ¯ Y M ( M,h ) × × i M ˜ φ kh i , ¯ Y M ( M,h ) ∈ R × i ∈ M n i . Let n M := Q i ∈ M n i . Because of the (partial) product structure of f M ( M,h ) and since k ˜ φ kh i k = n i , h i ∈ [ k ] (cf. Definition 4.1), it holds that k f M ( M,h ) k /n = k ¯ f M ( M,h ) k /n M . Thanks to the above equation and to the ANOVA decomposition wecan add up the rates of estimation of the margins to estimate the wholetensor. 33 .2 The estimator for the lower-dimensional margins
For M ∈ P [ d ] \ ∅ define D kM := n k − M Y i ∈ M D ki . To estimate the whole tensor, we consider the estimator ˆ f = X M ∈P [ d ] X h ∈ [ k ] d −| M | ˆ f M ( M,h ) , where ˆ f M ( M,h ) = ˆ¯ f M ( M,h ) × i M ˜ φ kh i , ˆ¯ f M ( M,h ) ∈ R × i ∈ M n i . We define ˆ f M ( ∅ ,h ) := ¯ Y ∅ ,h ( × i ∈ [ d ] ˜ φ kh i ) , ∀ h ∈ [ k ] d and ˆ¯ f M ( M,h ) :=:= arg min ¯f M ( M,h ) ∈ R × i ∈ M ni n k ¯ Y M ( M,h ) − ¯f M ( M,h ) k /n M + 2 λ M,h k D kM ¯f M ( M,h ) k o , where { λ M,h > , h ∈ [ k ] d −| M | , M ∈ P [ d ] \∅} are positive tuning parameters.We call k D kM ¯ f M ( M,h ) k the k th -order | M | -dimensional Vitali total vari-ation and ˆ¯ f M ( M,h ) the | M | -dimensional trend filtering estimator. Remark (We can apply Theorems A.1 and A.2) . For ¯ f ∈ R × i ∈ M n i it holdsthat ¯ ǫ M ( M,h ) ⊙ ¯ f = ¯ ǫ M ( M,h ) ⊙ ¯ f M ( M,h ) = X M ′ ⊆ M,h ′ M = h (¯ ǫ M ( M ′ ,h ′ ) × i ∈ M \ M ′ ˜ φ kh i ) ⊙ ¯ f M ( M,h ) . The n M entries of the tensor P M ′ ⊆ M,h ′ M = h (¯ ǫ M ( M ′ ,h ′ ) × i ∈ M \ M ′ ˜ φ kh i ) are thecoefficients of the projection of ǫ onto the linear space × i M ˜ φ kh i × i ∈ M R n i and as such have i.i.d. N (0 , σ n M /n ) -distributed entries. We can thereforeapply Theorems A.1 and A.2 with noise variance σ n M /n . Remark (Synthesis form for the estimator of lower dimensional margins) . The synthesis form of the estimator for the margins can be obtained in asimilar way as for the d -dimensional margin (cf. Subsection 4).34 Denoising the whole tensor
We now put together the results from Sections 5 and 6 with the ANOVAdecomposition given in Section 7 to show adaptivity and not-so-slow ratesfor the estimation of the whole tensor.
We fix k ∈ { , , , } . By S M,h we denote a subset of I kM,h satisfying theconditions for a hyperrectangular tessellation suitable for derivative match-ing. By d zm ( S M,h ) we denote an analogon of the quantity d zm appearing inTheorem 5.2, but defined on a hyperrectangular tessellation of I kM,h gener-ated by the enlarged version ˜ S M,h of the active set S M,h .The following theorem for denoising the whole tensor Y by means oftrend filtering holds. Theorem 8.1 (Adaptivity of tensor denoising with trend filtering) . Choose x, t > . Let g ∈ R n × ... × n d be arbitrary. For M = ∅ and a large enoughconstant C > only depending on k , choose λ M,h ≥ C | M | vuut X i ∈ M (cid:18) d i, max ( S M,h ) n i (cid:19) k − λ ( t + d log( k + 1)) . Then, with probability at least − e − x − e − t , it holds that k ˆ f − f k /n ≤ k g − f k /n + 4 X M ∈P [ d ] \∅ h ∈ [ k ] d −| M | λ M,h k ( D kM g ) − S M,h k + 2 σ n (cid:18)q x + d log( k + 1) + √ k d (cid:19) + X M ∈P [ d ] \∅ h ∈ [ k ] d −| M | σ n (cid:18)q x + d log( k + 1) + q ks M,h (cid:19) + O X M ∈P [ d ] \∅ h ∈ [ k ] d −| M | λ M,h X i ∈ M log( ed i, max ) ! s M,h X m =1 X z ∈{− , + } | M | (cid:18) n M d zm (cid:19) k − . n particular the constraint on C is C ≥ k k − a with a = , k = 1 , √ / ≈ . , k = 2 , √ / ≈ . , k = 3 , . , k = 4 . as min i ∈ M min m ∈ [ s M,h ] min h ∈ [ k ] d −| M | min { d − i,m ( S M,h ) , d + i,m ( S M,h ) } → ∞ . Proof.
The result follows by the ANOVA decomposition. In total thereare ( k + 1) d margins. As a consequence of the union bound, the resultfor the estimation of the whole tensor is attained with probability at least − e − t − e − x if in the application of Theorem 5.2 one chooses x + d log( k +1) and t + d log( k + 1) instead of x and t for some x, t > .If S M,h are chosen to be regular grids and the tuning parameters arechoosen as λ M,h ≍ s log ns k − | M | n then the rate of Theorem 8.1 is O log nn X M ∈P [ d ] \∅ h ∈ [ k ] d −| M | s k ( | M |− | M | M,h log( n M /s M,h ) . We now present a not-so-slow rate of estimation for the whole tensor f bytrend filtering. We restrict again to tensors with n ≍ . . . ≍ n d ≍ n /d .We let c M,h > be constants of order O (1) . The following theoremholds. Theorem 8.2 (Not-so-slow ℓ -rate for trend filtering) . Choose λ ≍ n − H ( d )+2 k − H ( d )+2 k − log H ( d )2 H ( d )+2 k − ( n ) . Then, with probability at least − Θ(1 /n ) , it holds that k ˆ f − f k /n ≤ X M ∈P [ d ] \∅ h ∈ [ k ] d −| M | min ¯f M ( M,h ) : k D kM ¯f M ( M,h ) k ≤ c M,h k ¯ f M ( M,h ) − ¯f M ( M,h ) k /n M + O (cid:18) n − H ( d )+2 k − H ( d )+2 k − log H ( d )2 H ( d )+2 k − n (cid:19) . roof. We apply Theorem A.2 to ˆ¯ f M ( M,h ) with x + d log( k + 1) and t + d log( k + 1) as in the proof of Theorem 8.1. We choose x ≍ t ≍ log n .Let ˜ S M,h be an enlarged mesh grid. We have to trade off with respectto ˜ s M,h ≍ s M,h the terms n M σ n s M,h n M ≍ s k − H ( | M | ) | {z } ≍ ˜ γ σ r n M n s log nn M | {z } ≍ λ (log n ) . We therefore obtain the rate O (cid:18) n − H ( | M | )+2 k − H ( | M | )+2 k − log H ( | M | )2 H ( | M | )+2 k − n (cid:19) . Since H ( | M | )2 H ( | M | )+2 k − is decreasing in | M | , the rate of estimation of the d -dimensional margin is limiting and we obain the claim. We have shown that imposing structure to denoise d -dimensional tensorsleads to an adaptive reconstruction. The structure is imposed via penaltieson the l -dimensional k th -order Vitali TV of the l -dimensional margins ofthe tensor, for l ∈ [ d ] . If the tensor is a product of polynomials on aconstant number of hyperrectangles of any dimension l ≤ d , then the MSEis bounded as k ˆ f − f k /n = O (log n/n ) , with high probability. The truetensor f can therefore be reconstructed at an almost parametric rate. Thekey aspects of our results are: the reformulation of the analysis estimator insynthesis form, the interpolating tensor to bound the effective sparsity andthe ANOVA decomposition of a d -dimensional tensor. In the backgroundof all our results there are the projection argmuents by Dalalyan et al. [3]to bound the random part of the problem, which are fundamental to provethe adaptativity of ˆ f to the underlying unobserved f .Note that we prove reconstruction for trend filtering of order k = { , , , } . We are not able to prove that the approach we use to findan interpolating tensor for k ∈ { , , , } gives a suitable interpolatingtensor for general k . Thus, although for each given finite k we can checkby computer whether our construction gives an interpolating vector, theproblem remains open for general k .37 cknowledgements We would like to acknowledge support for this project from the the SwissNational Science Foundation (SNF grant 200020 169011).
References [1] P. B¨uhlmann and S. van de Geer.
Statistics for High-DimensionalData . 2011.[2] S. Chatterjee and S. Goswami. New risk bounds for 2d total variationdenoising. arXiv:1902.01215v2 , 2019.[3] A. Dalalyan, M. Hebiri, and J. Lederer. On the prediction performanceof the Lasso.
Bernoulli , 23(1):552–581, 2017.[4] M. Elad, P. Milanfar, and R. Rubinstein. Analysis versus synthesis insignal priors.
Inverse Problems , 23(947), 2007.[5] B. Fang, A. Guntuboyina, and B. Sen. Multivariate extensions of iso-tonic regression and total variation denoising via entire monotonicityand Hardy-Krause variation. arXiv:1903.01395v1 , 2019.[6] J. Friedman, T. Hastie, H. H¨ofling, and R. Tibshirani. Pathwise coor-dinate optimization.
Annals of Applied Statistics , 1(2):302–332, 2007.[7] A. Guntuboyina, D. Lieu, S. Chatterjee, and B. Sen. Adaptive riskbounds in univariate total variation denoising and trend filtering.
An-nals of Statistics , 48(1):205–229, 2020.[8] J.-C. H¨utter and P. Rigollet. Optimal rates for total variation denois-ing.
JMLR: Workshop and Conference Proceedings , 49:1–32, 2016.[9] S.-J. Kim, K. Koh, S. Boyd, and D. Gorinevsky. ℓ Trend Filtering.
SIAM Review , 51(2):339–360, 2009.[10] K. Lin, J. Sharpnack, A. Rinaldo, and R. J. Tibshirani. A sharp erroranalysis for the fused lasso, with application to approximate change-point screening.
Neural Information Processing Systems (NIPS) , (3):42, 2017.[11] E. Mammen and S. van de Geer. Locally adaptive regression splines.
Annals of Statistics , 25(1):387–413, 1997.3812] F. Ortelli and S. van de Geer. On the total variation regularized esti-mator over a class of tree graphs.
Electronic Journal of Statistics , 12:4517–4570, 2018.[13] F. Ortelli and S. van de Geer. Synthesis and analysis in total variationregularization.
ArXiv ID 1901.06418v1 , 2019.[14] F. Ortelli and S. van de Geer. Prediction bounds for (higher order)total variation regularized least squares.
To appear in the Annals ofStatistics , 2019.[15] F. Ortelli and S. van de Geer. Oracle inequalities for square root analy-sis estimators with application to total variation penalties.
Informationand Inference: A Journal of the IMA , (iaaa002), 2020.[16] F. Ortelli and S. van de Geer. Adaptive rates for total variation imagedenoising.
Journal of Machine Learning Research , 21(247):1–38, 2020.[17] V. Sadhanala and R. J. Tibshirani. Additive models with trend filter-ing.
Annals of Statistics , 47(6):3032–3068, 2019.[18] V. Sadhanala, Y.-X. Wang, and R. Tibshirani. Total variation classesbeyond 1d: minimax rates, and the limitations of linear smoothers.
Neural Information Processing Systems (NIPS) , 2016.[19] V. Sadhanala, Y. X. Wang, J. Sharpnack, and R. Tibshirani. Higher-order total variation classes on grids: minimax theory and trend filter-ing methods. In
Advances in Neural Information Processing Systems ,pages 5801–5811, 2017.[20] J. Sharpnack, A. Rinaldo, and A. Singh. Sparsistency of the edgelasso over graphs.
International Conference on Artificial Intelligenceand Statistics (AISTATS) , 22:1028–1036, 2012.[21] R. Tibshirani. Regression Shrinkage and Selection via the Lasso.
J.R. Statist. Soc. B , 58(1):267–288, 1996.[22] R. Tibshirani. Adaptive piecewise polynomial estimation via trendfiltering.
Annals of Statistics , 42(1):285–323, 2014.[23] R. Tibshirani. Divided differences, falling factorials, and discretesplines: Another look at trend filtering and related problems.
ArXivID: 2003.03886 , 2020. 3924] R. Tibshirani, M. Saunders, S. Rosset, J. Zhu, and K. Knight. Sparsityand smoothness via the fused lasso.
Journal of the Royal StatisticalSociety: Series B (Statistical Methodology) , 67(1):91–108, 2005.[25] S. van de Geer.
Estimation and Testing under Sparsity , volume 2159.Springer, 2016.[26] S. van de Geer. The Lasso with structured design and entropy of(absolute) convex hulls.
Preprint , pages 1–24, 2021.[27] Y.-X. Wang, J. Sharpnack, A. Smola, and R. Tibshirani. Trend fil-tering on graphs.
Journal of Machine Learning Research , 17:15–147,2016. 40
Oracle inequalities with fast and slow rates
In this section we report an oracle inequality with fast rates and one withslow rates. These oracle inequalities correspond to the adaptive and to thenon-adaptive bound of Theorem 2.2 in [14], see also Theorems 2.1 and 2.2in Ortelli and van de Geer [15] and Theorems 16 and 17 in Ortelli andvan de Geer [16] adapted to have an enlarged active set.
Theorem A.1 (Oracle inequality with fast rates) . Let g ∈ R n × ... × n d and S ⊆ × i ∈ [ d ] [ k + 2 : n i − k ] be arbitrary. For x, t > , choose λ ≥ ˜ γλ ( t ) .Then, with probability at least − e − x − e − t , it holds that k ( ˆ f − f ) N ⊥ k k /n ≤ k g − f N ⊥ k k /n + 4 λ k ( D k g ) − S k + σ r xn + σ s ksn + λ Γ D k ( S, v − S , q S ) , where q S = sign (( D k g ) S ) . Theorem A.2 (Oracle inequality with slow rates) . Let g ∈ R n × ... × n d and S ⊆ × i ∈ [ d ] [ k + 2 : n i − k ] be arbitrary. For x, t > , choose λ ≥ ˜ γλ ( t ) .Then, with probability at least − e − x − e − t , it holds that k ( ˆ f − f ) N ⊥ k k /n ≤ k g − f N ⊥ k k /n + 4 λ k D k g k + σ r xn + σ r ˜ sn ! . B Proofs of Section 4
B.1 Proof of Lemma 4.2
We prove Lemma 4.2 by induction.
Anchor: k = 1 Note that φ = ˜ φ and φ j − ˜ φ j = αφ for some α ∈ R . Therefore D φ = D ˜ φ = 0 and D ( φ j − ˜ φ j ) = 0 . It follows that D ˜ φ j = D φ j = 1 { j ′ ≥ j } − { j ′ ≥ j − } = 1 { j } , j ∈ [2 : n ] . Step: k − implies k For j ∈ [ k − it holds that D k φ kj = D k ˜ φ kj = D k φ k − j = D k ˜ φ k − j = 0 , since41y assumption D k − φ k − j = D k − ˜ φ k − j = 0 for j ∈ [ k − . Moreover D k φ kj = D k ( X l ≥ j φ k − l ) /n = D ( X l ≥ j D k − φ k − l ) = D { { j ′ ≥ j } } j ′ ∈ [ k : n ] = ( , j = k, { j } , j ∈ [ k + 1 : n ] . It also holds that φ kj − ˜ φ kj = P l ∈ [ k ] α l φ ll , j ∈ [ k : n ] for some { α l ∈ R } l ∈ [ k ] and therefore D k φ kj = D k ˜ φ kj , j ∈ [ k : n ] . (cid:3) C Proofs of Section 5
C.1 Proof of Lemma 5.10
To bound the antiprojections we can use the dictionary Φ k instead of ˜Φ k .Indeed, by Lemma 28 in Ortelli and van de Geer [16], it holds that k A { ˜ φ kt ,t ∈ ˜ S } ˜ φ kj k ≤ k A { φ kt ,t ∈ ˜ S } φ kj k , j ∈ [ k + 1 : n ] . Bound on the antiprojections for d = 1 We first prove that, for m = 1 , . . . , s , k A ˜ S ˜ φ kj k /n ≤ (cid:18) t m − jn (cid:19) k − , j ∈ R − m = [ t − m : t m ] , , j ∈ R m = [ t m : t m + k − , (cid:18) j − t m − k + 1 n (cid:19) k − , j ∈ R + m = [ t m + k − t + m ] . We then extend the reasoning to general dimension d .For any m ∈ [ s ] , we fix j ∈ R − m and approximate φ kj by φ kt m , . . . , φ kt m + k − .By the definition of Φ k we have that φ kj ( j ′ ) = n − k +1 ( j ′ − j + 1) k − { j ′ ≥ j } , j ′ ∈ [ n ] . Moreover note that for k ′ ∈ { , , . . . , k − } k ′ X l =0 ( − l k ′ l ! φ kt m + l = n − k ′ φ k − k ′ t m = n − k +1 { ( j ′ − t m +1) k − k ′ − { j ′ ≥ t m } } j ′ ∈ [ n ] . (5)We now express φ kj as the sum of a linear combination of φ kt m , . . . , φ kt m + k − and a remainder. The linear combination will approximate the projection42f φ kj onto { φ kj , j ∈ ˜ S } , while the remainder will be an upper bound for theantiprojections.For all j ′ ∈ [ n ] it holds that φ kj ( j ′ ) = n − k +1 ( j ′ − j + 1) k − (1 { j ≤ j ′ ≤ t m − } + 1 { j ′ ≥ t m } ) . By the binomial theorem ( j ′ − t m + 1 + t m − j ) k − { j ′ ≥ t m } = k − X l =0 k − l ! ( t m − j ) k − l − ( j ′ − t m + 1) l { j ′ ≥ t m } = k − X l =0 k − l ! ( t m − j ) k − l − n l φ l +1 t m . By Equation (5) we know that { φ l +1 t m } l ∈ [0: k − ∈ span( { φ kt m + l } l ∈ [0: k − ) .Therefore, for j ∈ R − m , k A ˜ S ˜ φ kj k ≤ n − k +2 t m − X j ′ = j ( j ′ − j + 1) k − ≤ n − k +2 Z t m − j ( j ′ ) k − dj ′ ≤ ( t m − j ) k − (2 k − n k − ≤ n (cid:18) t m − jn (cid:19) k − . Note that the construction of the partially orthonormalized dictionary ˜Φ k can of course also be made starting from the collection of functions { { j ≤ j ′ } } j ∈ [ n ] , j ′ ∈ [ n ] instead of { { j ≥ j ′ } } j ∈ [ n ] , j ′ ∈ [ n ] , cf. Definition4.1. The resulting dictionaries ˜Φ k coincide, up to permutation of the col-umn indices. As a consequence, the calculation we showed to approximate k A ˜ S ˜ φ kj k for j ∈ R − m can be carried out with the dictionary ˜Φ k based on { { j ≤ j ′ } } j ∈ [ n ] , j ′ ∈ [ n ] to obtain the approximation k A ˜ S ˜ φ kj k ≤ n (cid:18) j − t m − k + 1 n (cid:19) k − , j ∈ R + m . This consideration also applies in higher-dimensional situations.
Bound on the antiprojections for general dimension d By the same reasons as above, we consider without loss of generality ( k , . . . , k d ) ∈ R − ,..., − m . We decompose φ kk ,...,k d as follows φ kk ,...,k d ( j , . . . , j d ) = n − k +1 d Y i =1 ( a i ( j i ) + b i ( j i )) , j i ∈ [ n i ] , i ∈ [ d ] , where a i = a i ( j i ) = ( j i − k i + 1) k − { k i ≤ j i ≤ t i,m − } ,b i = b i ( j i ) = ( j i − k i + 1) k − { j i ≥ t i,m } ,c i = c i ( j i ) = ( j i − k + 1) k − { j i ≥ k } ≥ a i + b i . Note that a i , b i depend on t i,m , while c i does not. Moreover, for all ( l , . . . , l d ) ∈ [0 , k − d it holds that × i ∈ [ d ] { t i,m + l i } ∈ ˜ S . Thus, we approximate k A ˜ S ˜ φ kk ,...,k d k ≤ n − k +2 n ,...,n d X ,..., d Y i =1 ( a i + b i ) − d Y i =1 b i ! , since by Equation (5) the contributions of Q di =1 b i are spanned by φ kS . Notethat Q di =1 ( a i + b i ) − Q di =1 b i is nonzero on × i ∈ [ d ] [ k i : n i ] \ × i ∈ [ d ] [ t i,m : n i ] ⊆ ∪ i ∈ [ d ] ([ k i : t i,m − × × l = i [1 : n l ]) . Moreover, on [ k i : t i,m − × × l = i [1 : n l ] , it holds that Q di =1 ( a i + b i ) − Q di =1 b i ≤ a i Q l = i c l . Therefore k A ˜ S ˜ φ kk ,...,k d k ≤ n − k +2 d X i =1 n ,...,n d X ,..., a i ( j ) Y l = i c l ( j l ) . As in the one-dimensional case, n − k +2 i P n i j i =1 a i ( j i ) ≤ n i (cid:16) t i − k i n i (cid:17) k − and n − k +2 i P n i j i =1 c i ( j i ) ≤ n i . It follows that k A ˜ S ˜ φ kk ,...,k d k ≤ n d X i =1 (cid:18) t i − k i n i (cid:19) k − . Note that as soon as j i ∈ R i,m for some coordinate i ∈ [ d ] , then a i ( j i ) =0 and the i th coordinate does not contribute to the antiprojections. Thebounds for all other hyperrectangles R zm , z ∈ {− , , + } d follow by analogouscalculations. (cid:3) C.2 Proof of Lemma 5.11
For any m ∈ [ s ] and for any ( j , . . . , j d ) ∈ R m it holds that vuut d X l =1 ˜ v i,m ( j i ) ≤ d X l =1 ˜ v i,m ( j i ) ≤ d X l =1 v i,m ( j i ) max { d − i,m , d + i,m } n i ! k − ≤ d X l =1 v i,m ( j i ) vuut d X i =1 max { d − i,m , d + i,m } n i ! k − ≤ v j ,...,j d ˜ γ. C.3 Proof of Lemma 5.12
Fix i ∈ [ d ] and m ∈ [ s ] . Say q t m = 1 . Since w i,l,m ∈ [0 , , l = i , for any j i ∈ R − i,m ∪ R i,m ∪ R + i,m it holds that d Y l =1 w i,l,m ( j l ) ≤ − q v i,m ( j i ) C Y l = i w i,l,m ( j l ) ≤ − q v i,m ( j i ) C . Moreover, for any ( j , . . . , j d ) ∈ R m it holds that w j ,...,j d = 1 d d X i =1 d Y l =1 w i,l,m ( j l ) ≤ d d X i =1 − q v i,m ( j i ) C = 1 − d X i =1 q v i,m ( j i ) dC = 1 − v j ,...,j d . Analogous expressions hold if q t m = − . The claim follows by noting thatthe conditions of the definition of interpolating tensor (Definition 5.8) aresatisfied for w . (cid:3) C.4 Matching derivatives
To obtain continuous vectors with k − continuous derivatives and piecewiseconstant k th derivative, we split [0 , into N ω , resp. N w , intervals of equallength, where N ω = k , N w = k + 1 if k is odd and N w = k + 2 if k is even.We denote these intervals by { [ x l − , x l ] } N { ω, w } l =1 with x = 0 and x N { ω, w } = 1 .We choose ω ( x ) = − a x k − , x ∈ [ x , x ] ,b l,k x k + b l,k − x k − + . . . + b l, x + b l, , x ∈ [ x l − , x l ] ,l ∈ [2 : k − ,c (1 − x ) k , x ∈ [ x k − , x k ] . We moreover choose w( x ) = − a x k , x ∈ [ x , x ] , b l,k x k + b l,k − x k − + . . . + b l, x + b l, , x ∈ [ x l − , x l ] ,l ∈ [2 : N w / − , a L (1 / − x ) L + . . . + a (1 / − x ) + 1 / , x ∈ [ x N w / − , x N w / ] , L = k − if k is even and L = k if k is odd.We choose both the coefficients ( a , a L , . . . , a , { b l,k , . . . , b l, } l , c ) and( a , a L , . . . , a , { b l,k , . . . , b l, } l ) by derivative matching. We require the k − derivatives of the different pieces of the interpolating polynomials tomatch at the junctions between the intervals. This gives place to piecewiseconstant k th derivatives with the exception of the interval [ x , x ] where ω ( k ) ( x ) ≍ − / √ x .Matching derivatives for ω means solving a system of k ( k − equationsand k ( k − unknowns. Matching derivatives for w means solving a systemof k ( k/ equations and k ( k/ unknowns when k is even and k ( k − / equations and k ( k − / unknowns when k is odd. We therefore do not needto do any derivative matching for k = 1 , where we just take ω ( x ) = 1 − √ x and w( x ) = 1 − x .As an alternative to discretizing a continuous version of the interpolat-ing polynomials, one can also proceed by matching discrete differences. Thetwo approaches are equivalent when min i ∈ [ d ] min m ∈ [ s ] min { d − i,m , d + i,m } → ∞ as n → ∞ . Discrete derivative matching requires that the counterpart ofeach interval [ x l − : x l ] contains at least k points. We therefore require that min { d − i,m , d + i,m } ≥ ( k + 2) k, ∀ i ∈ [ d ] , ∀ m ∈ [ s ] . We refer to Ortelli and van de Geer [14] for details on discrete derivativematching.
C.5 Partial integration
Some consequences of the fact that both the resulting ω and w have piece-wise constant k th derivatives with the exception of the interval [0 , x ] where ω ( k ) ( x ) ≍ − / √ x are shown in the next lemma, which is be useful to com-pute the bound on the effective sparsity in Lemma 5.13. Lemma C.1 (Discrete differences of some polynomials) . Let for some d ∈ N , d ≥ k , q j := ( j/d ) k − , j = 0 , . . . , d. Then n − k +2 k D k q k = O (log( ed ) /d k − ) . Let for some d ∈ N , d ≥ k , p j := ( j/d ) k , j = 0 , . . . , d. Then n − k +2 k D k p k = O (1 /d k − ) . roof. We have for j ≥ kn − k +2 ( D k q) j = k X l =0 kl ! ( − l (cid:18) j − ld (cid:19) k − = (cid:18) jd (cid:19) k − (cid:20) k X l =0 kl ! ( − l (cid:18) − lj (cid:19) k − (cid:21) . We do a ( k − -term Taylor expansion of x (1 − x ) k − around x = 0 : (1 − x ) k − = k − X i =0 a i x i + rem( x ) , where a = 1 , a = − k − , . . . , a k − are the coefficients of the Taylorexpansion and where the remainder rem( x ) satisfies sup ≤ x ≤ / | rem( x ) | = O (cid:16) | x | k (cid:17) . Thus k X l =0 kl ! ( − l (cid:18) − lj (cid:19) k − = k X l =0 kl ! ( − l k − X i =0 a i (cid:18) lj (cid:19) i + rem (cid:18) lj (cid:19)! , where k X l =0 kl ! ( − l k − X i =0 a i (cid:18) lj (cid:19) i = 0 since ( k − X i =0 a i (cid:18) lj (cid:19) i ) kl =0 is a polynomial of degree k − and hence its k th -order differences are zero.It follows that for j ≥ k , (cid:12)(cid:12)(cid:12)(cid:12) k X l =0 kl ! ( − l (cid:18) − lj (cid:19) k − (cid:12)(cid:12)(cid:12)(cid:12) ≤ k X l =0 kl !(cid:12)(cid:12)(cid:12)(cid:12) rem (cid:18) lj (cid:19)(cid:12)(cid:12)(cid:12)(cid:12) = O (cid:18) j k (cid:19) . Then for j ≥ k , n − k +2 ( D k q) j = O (cid:16) / ( j d k − ) (cid:17) . So n − k +2 k D k q k = O (cid:16) log( ed ) /d k − (cid:17) .For p the same arguments go through. We obtain that ( D k p) j = O (1 /d k ) and so n − k +2 k D k p k = O (cid:16) /d k − (cid:17) .47 .6 Proof of Lemma 5.13 We prove a bound on the effective sparsity holding for every sign configura-tion. We eliminate the dependence on the sign configuration by decouplingpartial integration on the whole interpolating tensor ( k ( D k ) ′ w k ) into tak-ing k th -order differences on the hyperrectangles { R m } sm =1 ( k D k w ( R m ) k ,where w ( R m ) = { w j ,...,j d } ( j ,...,j d ) ∈ R m denotes the restriction of the inter-polating tensor w to the set of indices R m ).To do this, we define the boundaries B ( R m ) of a rectangle R m as B ( R m ) := R m \ × i ∈ [ d ] [ t − i,m + k : t + i,m − k ] . It holds that n − k +1 k ( D k ) ′ w k = O s X m =1 (cid:16) n − k +1 k D k w ( R m ) k + k w ( B ( R m )) k (cid:17)! . By the definition of the interpolating tensor w it holds that n − k +1 k D k w ( R m ) k = O n − k +1 d X i =1 k D k × l ∈ [ d ] w l,i,m k ! = O n − k +1 d X i =1 d Y l =1 k D k w l,i,m k ! = O d X i =1 d Y l =1 n − k +1 l k D k w − l,i,m k + t l,m − X j l = t l,m − k (1 − w − l,i,m ( j l )) + n − k +1 l k D k w + l,i,m k + t l,m +2 k − X j l = t l,m + k (1 − w + l,i,m ( j l )) , where the sums stem from the differences involving the constant part of w on R l,m . Because of the form chosen for ω and w , it holds that t l,m − X j l = t l,m − k (1 − w − l,i,m ( j l )) = ( O ( ω (1 /d − i,m )) O (w (1 /d − l,m )) = ( O (1 / ( d − i,m ) k − ) , l = i, O (1 / ( d − l,m ) k ) , l = i. A similar bound holds for P t l,m +2 k − j l = t l,m + k (1 − w + l,i,m ( j l )) . By Lemma C.1 itholds that n − k +1 l k D k w − l,i,m k = ( O (log( ed − i,m ) / ( d − i,m ) k − ) , l = i, O (1 / ( d − l,m ) k − ) , l = i.
48 similar bound holds for n − k +1 l k D k w + l,i,m k .We now just have to upper bound the contributions of the boundaries B ( R m ) . For k = 1 , w ( B ( R m )) = 0 , for all m ∈ [ s ] and the boundaries donot contribute to the effective sparsity. For k ≥ it holds that X B ( R m ) w j ,...,j d = O d X i =1 X B ( R m ) d Y l =1 w l,i,m ( j l ) = O d X i =1 X z ∈{− , + } d d zm ) k − since all the contributions on the boundaries have the same dependence on k and we can approximate the volume of the boundaries by the sum of thevolume of the d fractions { R zm } z ∈{− , + } d of the hyperrectangle.It therefore holds that n − k +1 k ( D k ) ′ w k = O d X i =1 log( ed i, max ( S )) ! s X m =1 X z ∈{− , + } d d zm ) k − and the claim follows. (cid:3) D Proofs of Section 6
D.1 Proof of Lemma 6.3
Setting
To calculate the inverse scaling factor when the active set is an enlargedmesh grid ˜ S , we decompose a dictionary atom – which is a product of sums– into a sum of products. Some of the components will be spanned by thedictionary atoms indexed by the mesh grid. The remaining componentswill contribute to the antiprojection.By Lemma 28 in Ortelli and van de Geer [16] we can look at the dictio-nary atoms φ kj ,...,j d instead of ˜ φ kj ,...,j d , see also the proof of Lemma 5.10 inAppendix C.1.We therefore consider φ kj ,...,j d = φ kj × . . . × φ kj d , where, for i ∈ [ d ] φ kj i = n − k +1 i ( j − j i + 1) k +1 { j ≥ j i } . rojection of the mesh grid on single coordinates Now choose z i,l ∈ Z i ( l ) such that j i ≤ z i, ≤ . . . ≤ z i,d − ≤ z i,d . By thedefinition of the mesh grid we can choose z i,l ∈ Z i ( l ) such that • | j i − z i, | = O ( n i /s H ( d ) ) ; • | z i,l − z i,l − | = O ( n i /s lH ( d ) ) , l ∈ [2 : d ] ; • | z i,d | ≤ n i . The decomposition
We now decompose the factors into sums: φ kj i = d X l =0 u i,l , where, for j ∈ [ n i ] , u i, := 1 { j ∈ [ j i : z i, − } n − k +1 i ( j − j i + 1) k − ,u i,l := 1 { j ∈ [ z i,l : z i,l +1 − } n − k +1 i ( j − j i + 1) k − , l ∈ [1 : d − ,u i,d := 1 { j ∈ [ z i,d : n i ] } n − k +1 i ( j − j i + 1) k − , Note that { u i,l } dl =0 are mutually orthogonal.Thanks to the decomposition of the factors, the following decompositionof the dictionary atom φ kj ,...,j d holds: φ kj ,...,j d = X ( l ,...,l d ) ∈ [0: d ] d d Y i =1 u i,l i , where { Q di =1 u i,l i } ( l ,...,l d ) ∈ [0: d ] d are mutually orthogonal. We therefore ob-tain a decomposition of a product of sums into a sum of products. Partitioning the decomposition
We now partition { ( l , . . . , l d ) ∈ [0 : d ] d } into two subsets: Σ and Σ c . Define Σ := { ( l , . . . , l d ) ∈ [0 : d ] d : |{ i ∈ [ d ] : l i ≤ z }| ≤ z, ∀ z ∈ [0 : d ] } . This means that Σ contains tuples ( l , . . . , l d ) having at most d entries withvalue at most d and at most d − entries with value at most d − and ... and at most entry with value at most and no entry with value .50 onnecting the decomposition with the enlarged mesh grid We now want to show that, for any ( l , . . . , l d ) ∈ Σ , Q di =1 u i,l i can beobtained as a linear combination of { φ kj ,...,j d } ( j ,...,j d ) ∈ ˜ S . These compo-nents will approximate the projection of any φ kj ,...,j d onto the linear spanof { φ kj ,...,j d } ( j ,...,j d ) ∈ ˜ S .For l i ∈ [1 : d − it holds that u i,l i ( j ) = 1 { z i,li ≤ j } n − k +1 i ( j − j i + 1) k − − { z i,li +1 ≤ j } n − k +1 i ( j − j i + 1) k − . In analogy to the proof of Lemma 5.10 (use the binomial theorem andEquation (5)) it holds that u i,l i ∈ span( { φ kz i,li + h } k − h =0 ∪ { φ kz i,li +1 + h } k − h =0 ) .For l i ∈ [ d ] it holds that u i,d ∈ span( { φ kz i,li + h } k − h =0 ) We need a claim
We now show that ( l , . . . , l d ) ∈ Σ = ⇒ ( l ′ , . . . , l ′ d ) ∈ Σ , where l ′ i ≥ l i , ∀ i ∈ [ d ] by proving that ( l , . . . , l d ) ∈ Σ = ⇒ ( l , . . . , l d − , l d + 1) ∈ Σ , where without loss of generality we choose the index l d and assume that l d ≤ d − .As a consequence it will follow that, for any ( l , . . . , l d ) ∈ Σ , Q di =1 u i,l i can be obtained as a linear combination of { φ kj ,...,j d } ( j ,...,j d ) ∈ ˜ S .We now prove the claim: assume that ( l , . . . , l d ) ∈ Σ , i.e., |{ i ∈ [ d ] : l i ≤ z }| ≤ z, ∀ z ∈ [0 : d ] . Take ( l ′ , . . . , l ′ d ) as l ′ i = l i , i ∈ [ d − and l ′ d = l d + 1 . Then |{ i ∈ [ d ] : l ′ i ≤ z }| = |{ i ∈ [ d −
1] : l i ≤ z }| + 1 { z ≥ l d +1 } ≤ z − { z ≥ l d } + 1 { z ≥ l d +1 } ≤ z. Therefore ( l ′ , . . . , l ′ d ) ∈ Σ and the claim is proved. Approximating the antiprojections
Thanks to the above claim and to the mutual orthogonality of the elementsof { Q di =1 u i,l i } ( l ,...,l d ) ∈ [0: d ] d , we can approximate as follows: k A ˜ S φ kj ,...,j d k /n ≤ X ( l ,...,l d ) Σ k d Y i =1 u i,l i k /n = X ( l ,...,l d ) Σ d Y i =1 k u i,l i k /n i . k u i,l i k /n i = O ( s − k − li +1) H ( d ) ) . The larger l i , the larger the contribution of k u i,l i k n .It therefore only remains to find the order of the largest contribution(s)indexed by Σ c . A tuple of indices in Σ c giving the contribution highest inorder is ( d − , . . . , d − . It holds that k A S ψ j ,...,j d k /n = O d Y i =1 k u i,l i k n ! = O ( s − k − H ( d ) ) . Since the upper bound does not depend on ( j , . . . , j d ) we read directly that ˜ γ = O (cid:18) s − k − H ( d ) (cid:19) . (cid:3)(cid:3)