Optimal Stable Nonlinear Approximation
Albert Cohen, Ronald DeVore, Guergana Petrova, Przemyslaw Wojtaszczyk
aa r X i v : . [ m a t h . NA ] S e p Optimal Stable Nonlinear Approximation
Albert Cohen, Ronald DeVore, Guergana Petrova, and Przemyslaw Wojtaszczyk ∗ September 22, 2020
Abstract
While it is well known that nonlinear methods of approximation can often perform dra-matically better than linear methods, there are still questions on how to measure the optimalperformance possible for such methods. This paper studies nonlinear methods of approximationthat are compatible with numerical implementation in that they are required to be numericallystable. A measure of optimal performance, called stable manifold widths , for approximating amodel class K in a Banach space X by stable manifold methods is introduced. Fundamentalinequalities between these stable manifold widths and the entropy of K are established. Theeffects of requiring stability in the settings of deep learning and compressed sensing are discussed. Nonlinear methods are now used in many areas of numerical analysis, signal/image processing, andstatistical learning. While their improvement of error reduction when compared to linear methodsis well established, the intrinsic limitations of such methods have not been given, at least for whatnumerical analysts would consider as acceptable algorithms.Several notions of widths have been introduced to quantify optimal performance of nonlinearapproximation methods. Historically, the first of these was the Alexandroff width described in [2].Subsequently, alternate descriptions of widths were given in [11]. We refer the reader to [12], wherea summary of different nonlinear widths and their relations to one another is discussed.While these notions of nonlinear widths were shown to monitor certain approximation methodssuch as wavelet compression, they did not provide a realistic estimate for the optimal performanceof nonlinear methods in the context of numerical computation. The key ingredient missing inthese notions of widths was stability. Stability is essential in numerical computation and should beincluded in formulations of the best possible performance by numerical methods.In this paper, we modify the definition of nonlinear widths to include stability. In this way,we provide a more realistic benchmark for the optimal performance of numerical algorithms whoseultimate goal is to recover an underlying function. Such algorithms are the cornerstone of numericalmethods for solving operator equations, statistical methods in regression and classification, and incompressing and encoding signals and images. It turns out that these new notions of widths haveconsiderable interplay with various results in functional analysis, including the bounded approxi-mation property and the extension of Lipschitz mappings. ∗ This research was supported by the NSF Grant DMS 1817603 (RD-GP) and the ONR Contract N00014-17-1-2908(RD). P. W. was supported by National Science Centre, Polish grant UMO-2016/21/B/ST1/00241. A portion of thisresearch was completed when the first three authors were visiting the Isaac Newton Institute. X equippedwith a norm k · k X and we wish to approximate the elements of X with error measured in thisnorm by simpler, less complex elements such as polynomials, splines, rational functions, neuralnetworks, etc. The quality of this approximation is a critical element in the design and analysisof numerical methods. Any numerical method for computing functions is built on some form ofapproximation and hence the optimal performance of the numerical method is no better thanthe optimal performance of the approximation method. Note, however, that it may not be easyto actually design a numerical method in a given applicative context that achieves this optimalperformance. For example, one may not be given a complete access to the target function. This isthe case when we are only given limited data about the target function, as it occurs in statisticallearning and in the theory of optimal recovery.In analyzing the performance of approximation/numerical methods, we typically examine theirperformance on model classes K ⊂ X , i.e., on compact subsets K of X . The model class K summarizes what we know about the target function. For example, when numerically solving apartial differential equation (PDE), K is typically provided by a regularity theorem for the PDE.In the case of signal processing, K summarizes what is known or assumed about the underlyingsignal such as bandlimits in the frequency domain or sparsity.The concept of widths was introduced to quantify the best possible performance of approxima-tion methods on a given model class K . The best known among these widths is the Kolmogorovwidth, which was introduced to quantify the best possible approximation using linear spaces. If X n ⊂ X is a linear subspace of X of finite dimension n , then its performance in approximating theelements of the model class K is given by the worst case error E ( K, X n ) X := sup f ∈ K dist( f, X n ) X . (1.1)The value of n describes the complexity of the approximation or numerical method using the space X n . If we fix the value of n ≥
0, the Kolmogorov n -width of K is defined as d ( K ) X = sup f ∈ K k f k X , d n ( K ) X := inf dim( Y )= n E ( K, Y ) X , n ≥ . (1.2)It tells us the optimal performance possible on the model class K using linear spaces of dimension n for the approximation. Of course, it does not tell us how to select a (near) optimal space Y ofdimension n for this purpose.For classical model classes such as a finite ball in smoothness spaces like the Lipschitz, Sobolev,or Besov spaces, the Kolmogorov widths are known asymptotically. Furthermore, it is often knownthat specific linear spaces of dimension n such as polynomials, splines on uniform partition, etc.,achieve this (near) optimal performance (at least within reasonable constants). This can then beused to show that certain numerical methods, such as spectral methods or finite element methodsare also (near) optimal among all possible choices of numerical methods built on using linear spacesof dimension n for the approximation.Let us note that in the definition of Kolmogorov width, we are not requiring that the mappingwhich sends f ∈ K into an approximation to f is a linear map. There is a concept of linear width which requires the linearity of the approximation map. Namely, given n ≥ K ⊂ X , its linear width d Ln ( K ) X is defined as d L ( K ) X = sup f ∈ K k f k X , d Ln ( K ) X := inf L ∈L n sup f ∈ K k f − L ( f ) k X , n ≥ , (1.3)2here the infimum is taken over the class L n of all continuous linear maps from X into itself withrank at most n . The asymptotic decay of linear widths for classical smoothness classes are known.We refer the reader to the book of Pinkus [21] for the fundamental results for Kolmogorov andlinear widths.There is a general lower bound on the decay of the Kolmogorov width that was given by Carlin [5]. Given n ≥
0, we define the entropy number ε n ( K ) X to be the infimum of all ε > n balls of radius ε cover K . Then, Carl proved that for each r >
0, there is a constant C r suchthat whenever sup m ≥ ( m + 1) r d m ( K ) X is finite, then ε n ( K ) X ≤ C r ( n + 1) − r sup m ≥ ( m + 1) r d m ( K ) X . (1.4)Thus, for polynomial decay rates for approximation of the elements of K by n dimensional linearspaces, this decay rate cannot be better than that of the entropy numbers of K . For many standardmodel classes K , such as finite balls in Sobolev and Besov spaces, the decay rate of d n ( K ) X is muchworse than ε n ( K ) X .During the decade of the 1970’s, it was recognized that the performance of approximation andnumerical methods could be significantly enhanced if one uses certain nonlinear methods of ap-proximation in place of the linear spaces X n . For example, there was the emergence of adaptivefinite element methods in numerical PDEs, the sparse approximation from a dictionary in signalprocessing, and various nonlinear methods for learning. These new numerical methods can beviewed as replacing in the construction of the numerical algorithm the linear space X n by a non-linear manifold M n depending on n parameters. For example, in place of using piecewise linearapproximation on a fixed partition with n cells, one would use piecewise linear approximation ona partition of n cells which would be allowed to vary with the target function. Adaptive finiteelement methods (AFEM) are a primary example of such nonlinear algorithms. Another relevantexample of nonlinear approximation, which is of much interest these days, are neural networks.The parameters of the neural network are chosen depending on the target function (or the availableinformation about the target function given through data observations) and hence is a nonlinearprocedure. The outputs of neural networks with fixed architecture form a nonlinear parametricfamily M n of functions, where n is the number of parameters.When analyzing the performance of numerical algorithms built on some form of approximation(linear or nonlinear), an important new ingredient emerges, namely, the notion of stability. Sta-bility means that when the input (the information about the target function) is entered into thealgorithm, the performance of the algorithm is not severely affected by small inaccuracies. More-over, the algorithm should not be severely effected by small inaccuracies in computation since suchinaccuracies are inevitable. Having this in mind, we are interested in the following fundamentalquestion in numerical analysis: Question:
Given a numerical task on a model class K , is there a best stable numerical algorithmfor this task and accordingly, is there an optimal rate-distortion performance which incorporatesthe notion of stability? In this context, to formulate the notion of best, we need a precise definition of what are admissiblenumerical algorithms. We would like a notion that is built on nonlinear methods of approximationand also respects the requirement of numerical stability. In this paper, we take the view thatnonlinear methods of approximation depending on n parameters are built on two mappings.3 A mapping a = a n : X → R n , which when given f ∈ X chooses n parameters a ( f ) ∈ R n torepresent f . Here, when n = 0, we take R := { } . • A mapping M = M n : R n → X which maps a vector y ∈ R n back into X and is used to buildthe approximation of f . The set M n := { M n ( y ) : y ∈ R n } ⊂ X is viewed as a parametric manifold.Given f ∈ X , we approximate f by A ( f ) = M ◦ a ( f ) := M ( a ( f )). The error for approximating f ∈ X is then given by E a,M ( f ) := k f − M ( a ( f )) k X , and the approximation error on a model class K ⊂ X is E a,M ( K ) X := sup f ∈ K E a,M ( f ) . A significant question is what conditions should be placed on the mappings a, M . If no condi-tions at all are placed on these mappings, we would allow discontinuous or non-measurable map-pings that have no stability and would not be useful in a numerical context. This observation ledto requiring that both mappings a, M at least be continuous and motivated the definition of the manifold width δ n ( K ) X , see [11, 12], δ n ( K ) X := inf a,M E a,M ( K ) X , (1.5)where the infimum is taken over all mappings a : K → R n and M : R n → X with a continuous on K and M continuous on R n . A comparison between manifold widths and other types of nonlinearwidths was given in [12].Note that in numerical applications one faces the following two inaccuracies in algorithms:(i) In place of inputting f into the algorithm, one rather inputs a noisy discretization of f whichcan be viewed as a perturbation of f . So one would like to have the property that when k f − g k X is small then the algorithm outputs M ◦ a ( f ) and M ◦ a ( g ) are close to one another.A standard quantification of this is to require that the mapping A := M ◦ a is a Lipschitzmapping.(ii) In the numerical implementation of the algorithm the parameters a ( f ) are not computedexactly and so one would like to have the property that if a, b ∈ R n are close to one anotherthen M ( a ) and M ( b ) are likewise close. Again, the usual quantification of this in numericalimplementation is that the mapping M : R n → X is a Lipschitz map. This property requiresthe specification of a norm on R n which is controlling the size of the perturbation of a .One simple way to guarantee that these two properties hold is to require that the two mappings a, M are themselves Lipschitz. Note that this requirement implies (i) and (ii) but is indeed stronger.We shall come back to this point later in the paper. At present, this motivates us to introduce thefollowing stable manifold width . We fix a constant γ ≥ a and M that are γ Lipschitz continuous on their domains with respect to a norm k · k Y on R n , that is k a ( f ) − a ( g ) k Y ≤ γ k f − g k X , and k M ( x ) − M ( y ) k X ≤ γ k x − y k Y , x, y ∈ R n . (1.6)4hen, the stable manifold width δ ∗ n,γ ( K ) X of the compact set K ⊂ X is defined as δ ∗ n,γ ( K ) X := inf a,M, k·k Y E a,M ( K ) X , (1.7)where now the infimum is taken over all maps a : K → ( R n , k · k Y ), M : ( R n , k · k Y ) → X , andnorms k · k Y on R n , where a, M are γ Lipschitz.
Remark 1.1.
Note that a rescaling ˜ a ( f ) = ca ( f ) and ˜ M ( x ) = M ( c − x ) leaves E a,M ( K ) X un-changed. Therefore, if a is Lipschitz with constant λ and M is Lipschitz with constant λ we canrescale them to satisfy our definition with constant √ λ λ . We choose the above version of thedefinition for simplicity of notation. Throughout the paper, we use the standard notation ℓ np := ( R n , k · k ℓ p ) (1.8)for the space R n equipped with the ℓ p norm, and use k · k ℓ np when we need to stress the dependenceon n , or simply k · k ℓ p when there is no ambiguity, for the corresponding ℓ p norm.The stable manifold width defined above gives a benchmark for accuracy which no Lipschitzstable numerical algorithm can exceed when numerically recovering the model class K . Note,however, that whether there is a numerical procedure that can achieve this accuracy depends inpart on what access is available to the target functions from K . In typical numerical settings,one may not have full access to f and this would restrict the possible performance of a numericalprocedure. For example, if we are only given partial information in the form of data about f , thenperformance will be limited by the quality of that data.The majority of this paper is a study of this stable manifold width. We begin in the next sectionby discussing some of its fundamental properties. It turns out that some of these properties areclosely connected to classical concepts in the theory of Banach spaces. For example, we prove inTheorem 2.4 that a separable Banach space X has the property that ¯ δ n,γ ( K ) X → n → ∞ , forevery compact set K ⊂ X if and only if X has the γ -bounded approximation property. Here,¯ δ n,γ ( K ) X is a modified stable manifold width, defined the same way as δ ∗ n,γ ( K ) X , with the onlydifference being that the infimum is taken over all a : X → R n defined on the whole space X (ratherthan only on K ) which are γ Lipschitz.The next part of this paper seeks comparison of stable manifold widths of a compact set K ⊂ X with its entropy numbers. In §
3, we show that for a general Banach space X stable manifold widths δ ∗ n,γ ( K ) X essentially cannot go to zero faster than the entropy numbers of K . Namely, we showthat for any r >
0, we have ε n ( K ) X ≤ C ( r, γ )( n + 1) − r sup m ≥ ( m + 1) r δ ∗ m,γ ( K ) X , n ≥ . (1.9)Inequalities of this type are called Carl’s type inequalities since such inequalities were first provedfor Kolmogorov widths by Carl [5]. This inequality says that if δ ∗ n,γ ( K ) X tends to zero like n − r as n tends to infinity, then the entropy numbers must at least do the same. The significance ofCarl’s inequality is that in practice it is usually much easier to estimate the entropy numbers of acompact set K than it is to compute its widths. In fact, the entropy numbers of all classical Sobolevand Besov finite balls in an L p space (or Sobolev space) are known. Note that the assumption of5tability is key here since we show that less restrictive forms of nonlinear widths, for example themanifold widths, do not satisfy a Carl’s inequality.While, the inequality (1.9) is significant, one might speculate that in general ε n ( K ) X may goto zero much faster than δ ∗ n,γ ( K ) X . In §
4, we show that when X is a Hilbert space H , for anycompact set K ⊂ H , we have δ ∗ n, ( K ) H ≤ ε n ( K ) H , n ≥ . (1.10)We prove (1.10) by exploiting well know results from functional analysis (the Johnson-Lindenstraussembedding lemma together with the existence of extensions of Lipschitz mappings). When com-bined with the Carl’s inequalities this shows that δ ∗ n,γ ( K ) H and ε n ( K ) H behave the same whenthe approximation takes place in a Hilbert space H . Thus, the entropy numbers of a compact setprovide a benchmark for the best possible performance of numerical recovery algorithms in thiscase.A central question (not completely answered in this paper) is what are the best comparisonslike (1.10) that hold for a general Banach space X ? In section §
5, we prove some first results of theform (1.10) for more general Banach spaces. Our results show some loss over (1.10) when movingfrom a Hilbert space to a general Banach space in the sense that the constant 3 is now replacedby C n α , where α depends on the particular Banach space. This topic seems to be intimatelyconnected with the problem of extension of Lipschitz maps defined on a subset S of X to all of X .From the viewpoint of approximation theory and numerical analysis, it is also of interest howclassical nonlinear approximation procedures comply with the stability properties proposed in thispaper. This is discussed in § δ ∗ n,γ ( K ) X for classical smoothness classes K used in numerical analysis, for example when K is the unit ball of a Sobolev or Besov space.For now, only in the case X = L is there a satisfactory understanding of this behavior. In this section, we derive properties of the stable manifold width and discuss its relations withcertain concepts in the theory of Banach spaces such as the bounded approximation property. δ ∗ n,γ ( K ) X Let us begin by making some comments on the definition of δ ∗ n,γ ( K ) X presented in (1.7). In thisdefinition, we assumed that the mappings a were Lipschitz only on K . We could have imposedthe stronger condition that a is defined and Lipschitz on all of X . Since this concept is sometimesuseful, we define the modified stable manifold width ¯ δ n,γ ( K ) X := inf a,M, k·k Y sup f ∈ K k f − M ( a ( f )) k X , (2.1)with the infimum now taken over all norms k · k Y on R n and mappings a : X → ( R n , k · k Y ) and M : ( R n , k · k Y ) → X which are γ Lipschitz. Obviously, we have δ ∗ n,γ ( K ) X ≤ ¯ δ n,γ ( K ) X , n ≥ . (2.2)On the other hand, in the case of a Hilbert space H , the following lemma holds.6 emma 2.1. For K ⊂ H a compact convex subset of the Hilbert space H we have δ ∗ n,γ ( K ) H = ¯ δ n,γ ( K ) H , n ≥ . Proof:
Having in mind (2.2), we only need to show that δ ∗ n,γ ( K ) H ≥ ¯ δ n,γ ( K ) H . Let us fix n ≥ a : K → ( R n , k · k Y )be any γ Lipschitz map and let us consider the metric projection P K : H → K of H onto K , P K ( f ) := argmin g ∈ K k g − f k H . Note that P K is 1 Lipschitz map. Therefore, a can be extended to the γ Lipschitz map˜ a := a ◦ P K : H → ( R n , k · k Y )defined on H , and we find that E ˜ a,M ( K ) X = E a,M ( K ) X for any reconstruction map M . Thus, δ ∗ n,γ ( K ) H ≥ ¯ δ n,γ ( K ) H , and the proof is completed. ✷ Remark 2.2.
The above approach relies on properties of metric projections, see [1], and can beused to show intrinsic relations between δ ∗ n,γ ( K ) X and ¯ δ n,γ ( K ) X for certain compact subsets K ⊂ X of a general Banach space X . Remark 2.3.
In the definition of δ ∗ n,γ ( K ) X and ¯ δ n,γ ( K ) X , the space ( R n , k · k Y ) can be replacedby any normed space ( X n , k · k X n ) of dimension n . That is, for example, in the case of δ ∗ n,γ ( K ) X , δ ∗ n,γ ( K ) X = inf a,M,X n sup f ∈ K k f − M ( a ( f )) k X , (2.3) where now the infimum is taken over all normed spaces X n of dimension n with norm k · k X n andall γ Lipschitz maps a : X → ( X n , k · k X n ) and M : ( X n , k · k X n ) → X . Indeed, consider any basis ( φ , . . . , φ n ) of X n . The associated coordinate map κ : X n → R n defined by κ ( g ) = ( x , . . . , x n ) = x for g = P ni =1 x i φ i is an isometry when R n is equiped with the norm k x k Y := k g k X n . For this norm,the maps ˜ a = κ ◦ a : X → R n and ˜ M = M ◦ κ − : R n → X have the same Lipschitz constants as a : X → X n and M : X n → X , which shows the equivalence between the two definitions. ¯ δ n,γ ( K ) X tend to zero as n → ∞ ? We turn next to the question of whether ¯ δ n,γ ( K ) X tends to zero for all compact sets K ⊂ X . Inorder to orient this discussion, we first recall results of this type for other widths and for otherclosely related concepts in the theory of Banach spaces.Let X be a separable Banach space. While the Kolmogorov widths d n ( K ) X tend to zero as n → ∞ for each compact set K ⊂ X , notice that this definition of widths says nothing about howthe approximants to a given f ∈ K are constructed. In the definition of the linear widths d Ln ( K ) X ,see (1.3), it is required that the approximants to f are constructed by finite rank continuous linearmappings. In this case, it is known that a necessary and sufficient condition that these widths tendto zero is that X has the approximation property , i.e. for each compact subset K ⊂ X , there is asequence of bounded linear operators T n of finite rank at most n such thatsup f ∈ K k f − T n ( f ) k X → , n → ∞ . (2.4)7n the definition of approximation property, the norms of the operators T n are allowed to growwith n . A second concept of γ - bounded approximation property requires in addition that there isa γ ≥ k T n k ≤ γ holds for the operators in (2.4).The main result of this section is the following theorem which characterizes the Banach spaces X for which every compact subset K ⊂ X has the property ¯ δ n,γ ( K ) X → n → ∞ . Theorem 2.4.
Let X be a separable Banach space and γ ≥ . The following two statements areequivalent: (i) ¯ δ n,γ ( K ) X → as n → ∞ for every compact set K ⊂ X . (ii) X has the γ -bounded approximation property. Before going further, we state a lemma that we use in the proof of the above theorem. Theproof of the lemma is given after the proof of the theorem.
Lemma 2.5.
Let k · k Y be a norm on R n , n ≥ , and X be any separable Banach space. If M : R n → X is a γ Lipschitz mapping, then for any bounded set S ⊂ R n , and any ε > , thereexists a map M : R n → X , M = M ( S, ε ) , with the following properties: (i) M is Lipschitz with constant γ . (ii) M has finite rank, that is M ( R n ) is a subset of a finite dimensional subspace of X . (iii) M approximates M to accuracy ε on S , namely k M − M k L ∞ ( S,X ) := max x ∈ S k M ( x ) − M ( x ) k X ≤ ε. (2.5) Proof of Theorem 2.4:
First, we show that (ii) implies (i). If X has the γ -bounded approxi-mation property, then given any compact set K ⊂ X , there is a sequence of operators { T n } , n ≥ T n : X → X n with X n of dimension at most n , with operator norms k T n k ≤ γ , andsup f ∈ K k f − T n ( f ) k X → , n → ∞ . (2.6)Consider the mappings a := γ − T n : X → X n , M := γId : X n → X n ⊂ X. Each of these mappings is Lipschitz with Lipschitz constant at most γ and M ◦ a = T n . By virtueof (2.6) and Remark 2.3, we have that ¯ δ n,γ ( K ) X → n → K is any compactset in X . From the definition of ¯ δ n,γ ( K ) X , there exist γ Lipschitz mappings a n : X → R n , M n : R n → X, with some norm k · k Y n on R n andsup f ∈ K k f − M n ◦ a n ( f ) k X → , n → ∞ . We take ε = 1 /n in Lemma 2.5 and let M n be the modified mapping for M n guaranteed by thelemma with the set S being a n ( K ). Then the mapping T n : X → X defined by T n := M n ◦ a n γ Lipschitz and has a finite rank. Moreover, since for every f ∈ K , k f − T n ( f ) k X ≤ k f − M n ◦ a n ( f ) k X + k M n ◦ a n ( f ) − M n ◦ a n ( f ) k X ≤ k f − M n ◦ a n ( f ) k X + 1 /n, one has sup f ∈ K k f − T n ( f ) k X → , n → ∞ . To complete the proof, we use Theorem 5.3 from [15], see also the discussions in [14, 13], to concludethat X has the γ -bounded approximation property. ✷ We now proceed with the proof of the lemma.
Proof of Lemma 2.5:
We fix the value of n ≥ k · k Y on R n . We will prove theapparently weaker statement that for any ε, δ >
0, there exists a ( γ + δ ) Lipschitz map f M : R n → X with finite rank such that k M − f M k L ∞ ( S,X ) := max x ∈ S k M ( x ) − f M ( x ) k X ≤ ε. (2.7)Once we construct f M , we obtain the claimed statement by taking M = γγ + δ f M .
Clearly, M will satisfy (i), (ii), and (iii), since k M − M k L ∞ ( S,X ) ≤ k M − f M k L ∞ ( S,X ) + k f M − M k L ∞ ( S,X ) ≤ ε + δγ + δ max x ∈ S k f M ( x ) k X < ε + δγ max x ∈ S k f M ( x ) k X , where δ and ε are arbitrarily small and S ⊂ R n is bounded.The construction of f M from M proceeds in 3 steps, where one of the main issues is to keepcontrol of the Lipschitz constants. Step 1:
Let us fix δ >
0. In this step, we construct a map M that agrees with M on S , takes theconstant value M (0) outside of a larger set that contains S , and is ( γ + δ/
2) Lipschitz. We take R > S is contained in the ball of radius R with respect to the k · k Y norm, that is, x ∈ S = ⇒ k x k Y < R . For λ >
0, we then define the continuous piecewise linear function φ λ : R + → R by φ λ ( t ) = , ≤ t ≤ R , − λ ( t − R ) , R ≤ t ≤ R + 1 /λ, , t ≥ R + 1 /λ. Clearly, φ λ is λ Lipschitz function and 0 ≤ φ λ ( t ) ≤ t ≥
0. Next, we define the functionΦ λ : R n → R n byΦ λ ( x ) := φ λ ( k x k Y ) x = x, k x k Y ≤ R , (1 − λ ( k x k Y − R )) x, R ≤ k x k Y ≤ R + 1 /λ, , k x k Y ≥ R + 1 /λ, λ ( x ) = x for x ∈ S . Let us check the Lipschitz property of Φ λ .First, for x, y contained in the ball B of radius R + 1 /λ with respect to the k · k Y norm, wehave Φ λ ( x ) − Φ λ ( y ) = ( φ λ ( k x k Y ) − φ λ ( k y k Y )) x + φ λ ( k y k Y )( x − y ) , and thus k Φ λ ( x ) − Φ λ ( y ) k Y ≤ k x k Y (cid:12)(cid:12) φ λ ( k x k Y ) − φ λ ( k y k Y ) (cid:12)(cid:12) + φ λ ( k y k Y ) k x − y k Y ≤ λ k x k Y |k x k Y − k y k Y | + φ λ ( k y k Y ) k x − y k Y ≤ ( λ k x k Y + φ λ ( k x k Y )) k x − y k Y ≤ (1 + λR ) k x − y k Y , x, y ∈ B. (2.8)Next, for x, y ∈ R n such that k x k Y ≥ R + 1 /λ and k y k Y ≥ R + 1 /λ Φ λ ( x ) − Φ λ ( y ) = 0 . (2.9)Lastly, if k x k Y ≤ R + 1 /λ and k y k Y > R + 1 /λ , we consider the point x ∗ := x + s ∗ ( y − x ), s ∗ ∈ [0 ,
1] of the intersection of the line segment connecting x and y and the sphere with radius R + 1 /λ . We have Φ λ ( y ) = Φ λ ( x ∗ ) = 0, and thus it follows from (2.8) that k Φ λ ( x ) − Φ λ ( y ) k Y = k Φ λ ( x ) − Φ λ ( x ∗ ) k Y ≤ (1 + λR ) k x − x ∗ k Y (2.10)= (1 + λR ) s ∗ k x − y k Y ≤ (1 + λR ) k x − y k Y . From (2.8), (2.9), and (2.10), we conclude that Φ λ is a (1 + λR ) Lipschitz function. We canmake the Lipschitz constant (1 + λR ) as close to one as we wish by taking λ small. Therefore,choosing λ sufficiently small, we have that the function M := M ◦ Φ λ , is ( γ + δ/
2) Lipschitz, agrees with M over S and has constant value M (0) on the set { x ∈ R n : k x k Y ≥ R + 1 /λ } . By equivalence of norms on R n , we conclude that M has value M (0) outside an ℓ ∞ cube [ − R , R ] n ,with R = R ( λ, n ) Step 2:
In the second step, we approximate M by a function M obtained by regularization, see[16]. We consider a standard mollifier ϕ m ( x ) = m n ϕ ( mx ) , x ∈ R n , where ϕ is a smooth positive function supported on the unit euclidean ball of R n and such that R R n ϕ = 1. We then define M := ϕ ∗ M , that is, M ( x ) = M ( m, x ) := Z R n ϕ m ( y ) M ( x − y ) dy. The function M is smooth and equal to M (0) outside of the cube Q := [ − D, D ] n , D := R + 1 m . (2.11)10y taking m sufficiently large, we are ensured thatmax x ∈ R n k M ( x ) − M ( x ) k X ≤ ε/ , and in particular (since M agrees with M on S )max x ∈ S k M ( x ) − M ( x ) k X ≤ ε/ , (2.12)because k M ( x ) − M ( x ) k X = (cid:13)(cid:13)(cid:13) Z R n ϕ m ( y ) M ( x ) dy − Z R n ϕ m ( y ) M ( x − y ) dy (cid:13)(cid:13)(cid:13) X ≤ Z R n ϕ m ( y ) k M ( x ) − M ( x − y ) k X dy ≤ ( γ + δ/ Z R n ϕ m ( y ) k y k Y dy = γ + δ/ m Z R n ϕ ( y ) k y k Y dy. In addition, by convexity we find that M is ( γ + δ/
2) Lipschitz since k M ( x ) − M ( y ) k X = (cid:13)(cid:13)(cid:13) Z R n ϕ m ( z )( M ( x − z ) − M ( y − z )) dz (cid:13)(cid:13)(cid:13) X ≤ Z R n ϕ m ( z ) k M ( x − z ) − M ( y − z ) k X dz ≤ Z R n ϕ m ( z )( γ + δ/ k x − y k Y dz = ( γ + δ/ k x − y k Y . If we take m sufficiently large then (2.12) holds and the construction of M in this step is complete.We fix m for the remainder of the proof. Any constants C given below depend only on m , n , δ ,and the initial function M . The value of C may change from line to line. Step 3:
In this step, we derive f M from M by piecewise linear interpolation. We work on thesupport cube Q = [ − D, D ] n . We recall that M is constant and equal to M (0) outside of Q . Wecreate a simplicial mesh of 2 Q by subdividing it into subcubes Q k of equal side length h = 2 D/N ,and then using the Kuhn simplicial decomposition of each of these subcubes into n ! simplices, see[20]. The set of vertices of the cubes Q k form a mesh of discrete points in 2 Q . We denote byΛ h = { x ν } ⊂ Q ⊂ R n the set of these vertices that belong to Q .We denote by I h the operator of piecewise linear interpolation at the vertices of Λ h . It is usuallyapplied to scalar valued functions but its extension to Banach space valued functions is immediate.Since M has value M (0) on ∂Q , the same holds for I h M which may be written as f M ( x ) := I h M ( x ) = X x ν ∈ Λ h M ( x ν ) N ν ( x ) , x ∈ Q. Here, the functions N ν are the nodal basis for piecewise linear interpolation, that is N ν is a con-tinuous piecewise linear function with N ν ( x µ ) = δ µ,ν , with δ µ,ν the Kronecker symbol for µ ∈ Λ h .11e then can extend f M by the value M (0) outside of Q . It follows that f M ( R n ) is contained in alinear subspace of dimension h ) + 1, that is f M has finite rank. We are now left to show that f M is ( γ + δ ) Lipschitz and that (2.7) holds. Thus it is enough to show that:(i) ( f M − M ) is δ/ M is ( γ + δ/
2) Lipschitz;(ii) max x ∈ S k f M ( x ) − M ( x ) k X ≤ ε/
2, because of (2.12).In order to prove (i) and (ii), we first note that if U is the unit ball in X ∗ , we have that k ( f M ( x ) − M ( x )) − ( f M ( y ) − M ( y )) k X = sup ℓ ∈ U | ℓ ( f M ( x )) − ℓ ( M ( x )) − ( ℓ ( f M ( y )) − ℓ ( M ( y ))) | , and k f M ( x ) − M ( x ) k X = sup ℓ ∈ U | ℓ ( f M ( x )) − ℓ ( M ( x )) | . For any ℓ ∈ U , we denote by v ℓ : R n → R the piecewise linear scalar valued function v ℓ ( x ) := ℓ ( M ( x )) = ( ℓ ( M ( x )) , x ∈ Q,ℓ ( M (0)) , x ∈ R n \ Q. Then we have ℓ ( f M ( x )) = ( ℓ ( I h M ( x )) = I h v ℓ ( x ) , x ∈ Q,ℓ ( M (0)) , x ∈ R n \ Q, and ℓ ( M ( x )) − ℓ ( f M ( x )) = ( v ℓ ( x ) − I h v ℓ ( x ) , x ∈ Q, , x ∈ R n \ Q. Note here that we have used the slight abuse of notation since the same notation I h is used forthe interpolation operator applied to scalar valued functions as well as for Banach space valuedfunctions. In particular, we may extend I h v ℓ by ℓ ( M (0)) outside of Q .Therefore, to show (i) and (ii), it is enough to show that uniformly in ℓ ∈ U , for h sufficientlysmall, ( v ℓ − I h v ℓ ) is δ/ R n andmax x ∈ Q | v ℓ ( x ) − I h v ℓ ( x ) | ≤ ε/ . (2.13)Note that the functions v ℓ , ℓ ∈ U , are smooth with uniformly bounded (in ℓ ) second derivatives | v ℓ | W , ∞ ( R n ) := max x ∈ R n max | α | =2 | ∂ α v ℓ ( x ) | ≤ C , . with C a fixed constant independent of ℓ .Let K be any one of the simplices in the Kuhn simplicial decomposition of any of the Q k ’s.Then the diameter of K is √ nh and the radius of the inscribed sphere is h √ n − √ , see subsection3.1.4 in [20]. It follows from Corollary 2 in [6] thatmax x ∈K | v ℓ ( x ) − I h v ℓ ( x ) | ≤ Ch | v ℓ | W , ∞ ( K ) , max i =1 ,...,n max x ∈K | ∂∂x i v ℓ ( x ) − ∂∂x i I h v ℓ ( x ) | ≤ Ch | v ℓ | W , ∞ ( K ) , x ∈ R n | v ℓ ( x ) − I h v ℓ ( x ) | ≤ Ch | v ℓ | W , ∞ ( R n ) , max i =1 ,...,n k ∂∂x i v ℓ − ∂∂x i I h v ℓ k L ∞ ( R n ) ≤ Ch | v ℓ | W , ∞ ( R n ) . Thus (2.13) follows from the first inequality if we select h small enough. From the second of theseinequalities, we find that for x, y ∈ R n , | ( v ℓ ( x ) − I h v ℓ ( x )) − ( v ℓ ( y ) − I h v ℓ ( y )) | ≤ Ch | v ℓ | W , ∞ ( R n ) k x − y k ℓ ( R n ) ≤ Ch | v ℓ | W , ∞ ( R n ) k x − y k Y < δ/ k x − y k Y , where we have used the fact that any two norms on R n are equivalent and h can be made smallenough. This completes the proof of the lemma. ✷ δ ∗ n,γ ( K ) X = 0 ? In this section, we characterize the sets K for which δ ∗ n,γ ( K ) X = 0. We also consider a closelyrelated question of whether δ ∗ n,γ ( K ) X is assumed. We will use the following lemma which is a formof Ascoli’s theorem. Lemma 2.6.
Let ( X, d ) be a separable metric space and ( Y, ρ ) be a metric space for which everyclosed ball is compact. Let F n : X → Y be a sequence of γ Lipschitz maps for which there exists a ∈ X and b ∈ Y such that F n ( a ) = b for n = 1 , , . . . . Then, there exists a subsequence F n j , j ≥ ,which is point-wise convergent to a function F : X → Y and F is γ Lipschitz. If ( X, d ) is alsocompact, then the convergence is uniform. Proof:
For any f ∈ X we have ρ ( F n ( f ) , b ) = ρ ( F n ( f ) , F n ( a )) ≤ γd ( f, a ) . Let us fix a countable dense subset A = { f j } ∞ j =1 ⊂ X and define B j := B ( b, γd ( f j , a ))as the closed ball in Y with radius γd ( f j , a ) centered at b . Then the cartesian product B := B × B × · · · is a compact metric space under the natural product topology. We naturally identify each F n withan element ˆ F n ∈ B whose j -th coordinate is F n ( f j ), that isˆ F n := ( F n ( f ) , F n ( f ) , . . . , F n ( f j ) , . . . ) ∈ B . So, there exists a subsequence ˆ F n s convergent to an element ˆ F ∈ B , that isˆ F ( j ) = lim s →∞ ˆ F n s ( j ) = lim s →∞ F n s ( f j ) , j ≥ . In other words, we get a function F : A → Y , defined as F ( f j ) = lim s →∞ F n s ( f j ) .
13e check that ρ ( F ( f j ) , F ( f i )) ≤ γd ( f j , f i ) for each i, j = 1 , , . . . . Since A is dense, F extends toa γ Lipschitz function on X , F : X → Y . Moreover, F ( f ) = lim s →∞ F n s ( f ) for every f ∈ X . If( X, d ) is compact, uniform convergence is proved by a standard argument, remarking that for any ε > X by a finite number of ε -balls with centers g , . . . , g k ∈ X , and sosup f ∈ X ρ ( F n s ( f ) , F ( f )) ≤ γε + max i =1 ,...,n ρ ( F n s ( g i ) , F ( g i )) ≤ (2 γ + 1) ε, for s large enough. ✷ Theorem 2.7.
Let K ⊂ X be a compact set in a separable Banach space X . If δ ∗ n,γ ( K ) X = 0 ,then the set K is γ Lipschitz equivalent to a subset of R n . That is, there is a norm k · k Y on R n and a function F : K → ( R n , k · k Y ) such that F is invertible and both F and F − are γ Lipschitz.
Proof:
Notice that if we knew that δ ∗ n,γ ( K ) X = 0 was assumed by maps a, M , then we could simplytake F := a and F − = M | a ( K ) . So the proof consists of a limiting argument. For each k ≥
1, thereexist a norm k . k Y k on R n and γ Lipschitz maps a k : K → ( R n , k . k Y k ) and M k : ( R n , k . k Y k ) → X such that lim k →∞ sup f ∈ K k f − M k ( a k ( f )) k X = δ ∗ n,γ ( K ) X = 0 . (2.14)Let us fix f ∈ K and define a ′ k ( f ) := a k ( f ) − a k ( f ) and M ′ k ( x ) := M k ( x + a k ( f )) . Then, a ′ k : K → ( R n , k . k Y k ) and M ′ k : ( R n , k . k Y k ) → X are γ Lipschitz maps. Moreover, a ′ k ( f ) = 0and M ′ k ◦ a ′ k = M k ◦ a k for k = 1 , , . . . .We denote by U the unit ball in R n with respect to the Euclidean norm k . k ℓ n and by U k theunit ball of R n with respect to the norm k · k Y k . From the Fritz John theorem (see e.g [22, Chapt.3]) we infer that there exist invertible linear operators Λ k on R n such that U ⊂ Λ k ( U k ) ⊂ √ nU, and therefore the modified norm k . k Z k defined as k x k Z k := k Λ − k ( x ) k Y k , x ∈ R n , satisfies the inequality 1 √ n k x k ℓ n ≤ k x k Z k ≤ k x k ℓ n . (2.15)Next, we replace a ′ k and M ′ k by˜ a k := Λ k ◦ a ′ k : K → ( R n , k . k Z k ) , ˜ M k := M ′ k ◦ Λ − k : ( R n , k . k Z k ) → X. Note that M k ◦ a k = M ′ k ◦ a ′ k = ˜ M k ◦ ˜ a k , and ˜ a k ( f ) = 0. We note that ˜ a k and ˜ M k are γ Lipschitzwith respect to the new norm k · k k . Indeed, k ˜ a k ( f ) − ˜ a k ( g ) k Z k = k Λ k ◦ a ′ k ( f ) − Λ k ◦ a ′ k ( g ) k Z k = k a ′ k ( f ) − a ′ k ( g ) k Y k ≤ γ k f − g k X , (2.16)and k ˜ M k ( x ) − ˜ M k ( y ) k X = k M ′ k ◦ Λ − k ( x ) − M ′ k ◦ Λ − k ( y ) k X ≤ γ k Λ − k ( x ) − Λ − k ( y ) k Y k = γ k x − y k Z k .
14e then extract subsequence of these mappings that converge point-wise by using Lemma 2.6.For this, we first note that from (2.15), we have k ˜ a k ( f ) − ˜ a k ( g ) k ℓ n ≤ √ n k ˜ a k ( f ) − ˜ a k ( g ) k Z k ≤ γ √ n k f − g k X . Hence, the sequence ˜ a k : K → ℓ n is a sequence of γ √ n Lipschitz mappings for which ˜ a k ( f ) = 0.We apply Lemma 2.6 to infer that, up to a subsequence extraction, ˜ a k converges point-wise on K to a mapping F . Note that F : K → ℓ n is also a γ √ n Lipschitz map.The remainder of the proof is to show that the function F is the mapping claimed by thetheorem. To prove it, we want first to extract a single norm to use in place of the k · k Z k . For this,we apply Lemma 2.6 to the subsequence of norms k · k Z k : ℓ n → R , j = 1 , , . . . , viewed as 1 Lipschitz functions, to derive that, up to another subsequence extraction, k · k Z k converges pointwise to a 1 Lipschitz function from ℓ n to R which we denote by k . k Y . It is easy tocheck that k · k Y is a norm on R n and it satisfies1 √ n k x k ℓ n ≤ k x k Y ≤ k x k ℓ n , x ∈ R n . (2.17)We now verify the required Lipschitz properties of F with respect to k · k Y . First, we claim that k F ( f ) − F ( g ) k Y ≤ γ k f − g k X , (2.18)namely, F : K → ( R n , k . k Y ) is a γ Lipschitz mapping. Since lim k →∞ k F ( f ) − ˜ a k ( f ) k Y = 0 for all f ∈ K because of (2.17), we prove (2.18) by showing that for any ε > f, g ∈ K , we have k ˜ a k ( f ) − ˜ a k ( g ) k Y ≤ γ k f − g k X + ε, (2.19)for any sufficiently large k . Now the set S := { z ∈ R n : z = ˜ a k ( f ) − ˜ a k ( g ) , f, g ∈ K, k ≥ } isbounded and therefore sup z ∈ S |k z k Z k − k z k Y | := ε k → , k → ∞ . (2.20)This gives k ˜ a k ( f ) − ˜ a k ( g ) k Y ≤ k ˜ a k ( f ) − ˜ a k ( g ) k Z k + ε k ≤ γ k f − g k X + ε k , k ≥ , (2.21)where we have used (2.16). Choosing k sufficiently large we have (2.19) and in turn have proved(2.18).Finally, we need to check that F has an inverse on F ( K ) which is γ Lipschitz. Let f, g ∈ K .For every k we have γ k F ( f ) − F ( g ) k Z k ≥ k ˜ M k ( F ( f )) − ˜ M k ( F ( g )) k X ≥ k ˜ M k (˜ a k ( f )) − ˜ M k (˜ a k ( g )) k X − k ˜ M k ( F ( f )) − ˜ M k (˜ a k ( f )) k X − k ˜ M k ( F ( g )) − ˜ M k (˜ a k ( g )) k X ≥ k ˜ M k (˜ a k ( f )) − ˜ M k (˜ a k ( g )) k X − γ k F ( f ) − ˜ a k ( f ) k Z k − γ k F ( g ) − ˜ a k ( g ) k Z k . Passing to the limit and using (2.14) and that M k ◦ a k = ˜ M k ◦ ˜ a k , we obtain γ k F ( f ) − F ( g ) k Y ≥ k f − g k X , ✷ The above argument brings up the interesting question of when δ ∗ n,γ ( K ) X is attained. To provea result in this direction, we recall the well-known vector version of the Banach limit, see AppendixC in [3]. Given X , let ℓ ∞ ( N , X ) denote the space of all sequences ~f = ( f k ) ∞ k =1 with f k ∈ X , k ≥ k ~f k ∞ = sup k k f k k X . The following theorem holds. Theorem 2.8.
For each Banach space X , there exists a norm one linear operator L : ℓ ∞ ( N , X ) → X ∗∗ , such that L ( ~f ) = g whenever ~f = ( f k ) ∞ k =1 , f k ∈ X , and lim k →∞ f k = g ∈ X . Note that we have k L ( ~f ) k X ∗∗ ≤ lim sup k →∞ k f k k X . Theorem 2.9.
Let X be a separable Banach space such that there exists a linear norm one projec-tion P from X ∗∗ onto X . Then, for every n and every compact set K ⊂ X there is a norm k · k Y on R n and mappings ˜ a : K → ( R n , k · k Y ) and ˜ M : ( R n , k · k Y ) → X such that sup f ∈ K k f − ˜ M ◦ ˜ a ( f ) k X = δ ∗ n,γ ( K ) X . This is also the case for ¯ δ n,γ ( K ) X . Proof:
For each k ≥
1, consider the γ Lipschitz maps a k : K → ( R n , k . k Y k ), M k : ( R n , k . k Y k ) → X and the norms k . k Y k on R n , such thatlim k →∞ sup f ∈ K k f − M k ( a k ( f )) k X = δ ∗ n,γ ( K ) X . We proceed as in the proof of Theorem 2.7 to generate a norm k · k Y on R n and a sequence ofmappings ˜ a k that converges pointwise on K to the γ Lipschitz mapping ˜ a : K → ( R n , k · k Y )(denoted by F in Theorem 2.7), and a sequence of γ Lipschitz mappings ˜ M k : ( R n , k · k Y k ) → X .Note that M k ◦ a k = ˜ M k ◦ ˜ a k , and thereforelim k →∞ sup f ∈ K k f − ˜ M k ◦ ˜ a k ( f ) k X = δ ∗ n,γ ( K ) X . For x ∈ R n , we consider the sequence −−−→ M ( x ) := ( ˜ M k ( x )) ∞ k =1 ∈ ℓ ∞ ( N , X ) and define the mapping M ∞ : ( R n , k . k Y ) → X ∗∗ as M ∞ ( x ) = L ( −−−→ M ( x )). One easily verifies that this is γ Lipschitz map since k M ∞ ( x ) − M ∞ ( y ) k X ∗∗ = k L ( −−−→ M ( x ) − −−−→ M ( y )) k X ∗∗ ≤ lim sup k →∞ k ˜ M k ( x ) − ˜ M k ( y ) k X ≤ γ lim sup k →∞ k x − y k k = γ k x − y k Y . M := P ◦ M ∞ , ˜ M : ( R n , k . k Y ) → X is a γ Lipschitz map since P is linearprojection on X of norm one. For f ∈ K , we define ~f := ( f, f, . . . ) ∈ ℓ ∞ ( N , X ), and then k f − ˜ M ◦ ˜ a ( f ) k X = k P ◦ L ( ~f ) − P ◦ M ∞ ◦ ˜ a ( f ) k X ≤ k L ( ~f ) − M ∞ (˜ a ( f )) k X ∗∗ = k L ( ~f ) − L (cid:0) −−−−−→ M (˜ a ( f )) (cid:1) k X ∗∗ ≤ lim sup k →∞ k f − ˜ M k (˜ a ( f )) k X ≤ lim sup k →∞ k f − ˜ M k (˜ a k ( f )) k X + k ˜ M k (˜ a k ( f )) − ˜ M k (˜ a ( f )) k X ≤ lim sup k →∞ k f − ˜ M k (˜ a k ( f )) k X + γ k ˜ a k ( f ) − ˜ a ( f ) k k ≤ lim sup k →∞ k f − ˜ M k (˜ a k ( f )) k X + γ k ˜ a k ( f ) − ˜ a ( f ) k ℓ n ≤ δ ∗ n,γ ( K ) X . Thus, we get sup f ∈ K k f − ˜ M ◦ ˜ a ( f ) k X ≤ δ ∗ n,γ ( K ) X , and the proof is completed. To show the theorem for ¯ δ n,γ ( K ) X , it suffices to repeat those argumentsassuming that a k ’s are defined on X . ✷ Remark 2.10.
Clearly a reflexive Banach space X is complemented in X ∗∗ by a linear projectionof norm one, namely the identity. The same holds for L ([0 , and L ∞ ([0 , . However C ([0 , is not complemented in C ([0 , ∗∗ . In this section, we study whether δ ∗ n,γ ( K ) X can go to zero faster than the entropy numbers of K . To understand this question, we shall prove bounds for the entropy numbers ε n ( K ) X in termsof δ ∗ n,γ ( K ) X . The inequalities we obtain are analogous to the bounds on entropy in terms ofKolmogorov widths as given in Carl’s inequality. Before formulating our main theorem, let us notethat we cannot expect inequalities of the form ε n ( K ) X ≤ Cδ ∗ αn,γ ( K ) X , n ≥ , (3.1)with α > X = ℓ p ( N ) with 1 ≤ p < ∞ and define K := K m := { ( x , . . . , x m , , . . . ) : m X j =1 | x j | p ≤ } ⊂ ℓ p ( N ) . Then ¯ δ n, ( K m ) = δ ∗ n, ( K m ) = 0, provided n ≥ m . Indeed, in this case, we can take, a n : X → ℓ np , M n : ℓ np → X, where a n ( x ) = ( x , . . . , x n ) when x = ( x , x , . . . ) ∈ X and M n (( x , . . . , x n )) := ( x , . . . , x n , , , . . . ).Now, given any α >
0, we choose n so that αn ≥ m and find that the right side of (3.1) is zero butthe left side is not. 17 .1 A weak inequality for entropy While direct inequalities like (3.1) do not hold, we shall prove a weak inequality between theentropy numbers ε n ( K ) X and the stable widths δ ∗ n,γ ( K ) X . To formulate our results, we assumethat δ ∗ n,γ ( K ) X → n →
0, and consider the function φ ( ε ) := φ K,γ ( ε ) := min { m : δ ∗ m,γ ( K ) X ≤ ε } . (3.2)We shall use the following lemma. Lemma 3.1.
Let γ > and let K ⊂ X be a compact subset of the Banach space X . Let us fixa point f ∈ K , δ > , and consider the ball B := B ( f , δ ) in X of radius δ , centered at f and B K := K ∩ B . Then B K can be covered by N balls of radius δ/ , where N ≤ A m , with A := 1 + 16 γ , m := φ ( δ/ , (3.3) Proof:
Let N be the largest number such that there exist points f , . . . , f N from B K such that k f i − f j k X ≥ δ/ , i = j. (3.4)Since f , . . . , f N is a maximal number of points from B K satisfying the separation condition (3.4),it follows that any f ∈ B K must be in one of the balls centered at f j of radius δ/
2. So, we want tobound N .Let m = φ ( δ/ m is the minimal index for which δ ∗ m,γ ( K ) X ≤ δ/
8. In whatfollows, we assume that there are mappings a m , M m for which δ ∗ m,γ ( K ) X is assumed, that is k f − M m ( a m ( f )) k X ≤ δ/
8, where a m : K → ( R m , k · k Y m ) and M m : ( R m , k · k Y m ) → X are both γ Lipschitz. A similar proof, based on limiting arguments, holds in the case when the infimum δ ∗ m,γ ( K ) X is not assumed.Let us denote by y := a m ( f ), y j := a m ( f j ) ∈ R m and g j := M m ( y j ) ∈ X for j = 1 , , . . . , N .Then, we know that k f j − g j k X ≤ δ ∗ m,γ ( K ) X ≤ δ/ , j = 1 , . . . , N, and therefore k g i − g j k X ≥ k f i − f j k X − k f i − g i k X − k f j − g j k X ≥ δ/ , i = j. (3.5)From the assumption that M m is γ Lipschitz we have k g i − g j k X = k M ( y i ) − M ( y j ) k Y m ≤ γ k y i − y j k Y m , and therefore it follows from (3.5) that k y i − y j k Y m ≥ δ γ , i = j. (3.6)Since k y − y j k Y m = k a m ( f ) − a m ( f j ) k Y m ≤ γ k f − f j k X ≤ γδ, j = 1 , . . . , N, all y j ’s, j = 1 , , . . . , N , are in a ball B Y := B Y ( y , γδ ) of radius γδ and center y in R m withrespect to the norm k · k Y m . We recall that for any η >
0, the unit ball in an m dimensional Banachspace can be covered by (1 + 2 /η ) m open balls of radius η , see [22], p. 63. Therefore, B Y can be18overed by (1 + 2 /η ) m balls of radius ηγδ . We take η := 8 − γ − so that the radius of each of theseballs is δ γ . Then, in view of (3.6), each of these balls has at most one of the points y j , j = 1 , . . . , N .This tells us that N ≤ (1 + 2 /η ) m ≤ (1 + 16 γ ) m , and thus proves the lemma. ✷ Theorem 3.2.
Let K ⊂ X be a compact subset of the Banach space X and assume K is containedin a ball with radius R . Let ε > and L be the smallest integer such that L ε ≥ R . Then K canbe covered by N ( ε ) balls where N ( ε ) ≤ A P Lk =1 φ (2 k ε/ , A := 1 + 16 γ . (3.7) Proof:
Let ε k := 2 k ε , k = 0 , , . . . , L , and m k := φ ( ε k / K is contained in theball B of radius ε L which without loss of generality we can assume is centered at 0. From Lemma3.1, we have that K is contained in A m L balls of radius ε L − . We can apply Lemma 3.1 to each ofthese new balls and find that K is contained in A m L · A m L − = A m L + m L − balls of radius ε L − . Continuing in this way, we have that K is contained in N ( ε ) balls of radius ε = ε , where N ( ε ) ≤ A P Lk =1 m k . This proves the theorem. ✷ We can apply the last theorem to derive bounds on entropy numbers from an assumed decay of δ ∗ n,γ ( K ) X in the following way. From the assumed decay, we obtain bounds on the growth of φ ( ε ) as ε → N ( ε ) of ballsof radius ε needed to cover K . The latter then translates into bounds on ε n ( K ) X . We illustratethis approach with two examples in this section. The first is the usual form of Carl’s inequality asstated in the literature. Theorem 3.3.
Let r, γ ≥ . If K is any compact subset of a Banach space X , then ε n ( K ) X ≤ C ( n + 1) − r sup m ≥ ( m + 1) r δ ∗ m,γ ( K ) X , n ≥ , with C depending only on r and γ . Proof: . We fix r > γ > m ≥ ( m + 1) r δ ∗ m,γ ( K ) X . If Λ = ∞ , there is nothing to prove and so we assume Λ < ∞ . We claim that φ (Λ2 − αr ) ≤ α , α ∈ R . (3.8)19ndeed, this follows from the definition of φ and the fact that δ ∗ n,γ ( K ) X ≤ Λ( n + 1) − r , n ≥ . Since K is compact, it is contained in a ball of some radius R . We now define ε := 8Λ2 − nr andlet L be the smallest integer for which2 L ε = Λ2 L − nr ≥ R, and apply Theorem 3.2. From (3.8), we have L X k =1 φ (2 k ε/
8) = L X k =1 φ (Λ2 k − nr ) ≤ L X k =1 n − kr ≤ n ∞ X k =0 − kr = 2 n (1 − − /r ) − . Therefore, it follows from (3.7) that N ( ε ) ≤ A P Lk =1 φ (Λ2 k − nr ) ≤ A n (1 − − /r ) − ≤ n + c , with c an integer depending only on r and γ . It follows that ε n + c ( K ) X ≤ − rn = 2 cr − ( n + c ) r , n ≥ . This proves the desired inequality for integers of the form 2 n + c . This can then be extended to allintegers by using the monotonicity of ε n ( K ) X . ✷ This same idea can be used to derive entropy bounds under other decay rate assumptions on δ ∗ n,γ ( K ) X . We mention just one other example to illustrate this point. Suppose that δ ∗ n,γ ( K ) X ≤ Λ(log ( n + 1)) β ( n + 1) − r , n ≥ , for some r > β ∈ R . Then the above argument gives ε n ( K ) X ≤ C Λ(log ( n + 1)) β ( n + 1) − r , n ≥ , with now C depending only on r, β, γ . Remark 3.4.
The same results obviously hold for δ n,γ ( K ) X since it is larger than δ ∗ n,γ ( K ) X . It is easy to see that Carl’s inequality does not hold for the manifold widths δ n ( K ) X , where theassumption on the mappings a, M are only that these maps are continuous. For a simple example,let X = ℓ ( N ) and let ( α j ) j ≥ be any strictly decreasing sequence of positive numbers which tendto 0. We consider the set K = K ( α , α , . . . ) := { α j e j } j ≥ ∪ { } ⊂ X, where e j , j = 1 , , . . . , is the canonical basis for ℓ ( N ). For each k ≥ a k : K → R by a k (0) = α k , a k ( α j e j ) = α min( j,k ) , j = 1 , , . . . , M k : R → X as the piecewise linear function with breakpoints 0 , α k , . . . , α , defined by thefollowing conditions M k ( t ) = , t ≤ ,α e , t ≥ α ,α j e j , for t = α j j = 1 , . . . , k. Clearly M k ( a k ( x )) = x when x = α j e j with j ≤ k . For any other x ∈ K we have M k ( a k ( x )) = α k e k ,and so sup x ∈ K k x − M k ( a k ( x )) k ℓ ( N ) = sup j>k k α j e j − α k e k k ℓ ( N ) < √ α k . Since α k → k → ∞ , we get δ ( K ) ℓ ( N ) = 0, and thus δ n ( K ) ℓ ( N ) = 0 for n = 1 , , . . . . Next, we bound the entropy numbers of K from below. For 1 ≤ j ≤ n and any k = j , we have k α j e j − α k e k k ℓ ( N ) = q α j + α k > α j ≥ α n . So if we take ε := α n with n ≥
1, then any attempt to cover K with 2 n balls with radius ε ≤ ε will fail since every ball in this set will contain exactly one of the α j e j , j = 1 , . . . , n and no moreelements from K . This gives that ε n ( K ) ℓ ( N ) ≥ α n . We can now show that Carl’s inequality cannot hold for any r >
0. Given such an r , we takefor K = K ( α ) the set corresponding to a sequence α = ( α , α , . . . ), where α n := 1[1 + log n ] r/ . We have that ε n ( K ( α )) ℓ ( N ) ≥ n + 1) r/ , while δ n ( K ( α )) ℓ ( N ) = 0 ≤ n − r , n ≥ . Finally, let us observe that in the above construction of a k , M k for K , the mapping a k is 1Lipschitz. On the other hand, M k has poor Lipschitz constant. Note that since k M k ( α k ) − M k ( α k − ) k X ≥ α k − , the Lipschitz constant of M n is at least of size α k − α k − − α k . When ( α j ) tends to zero slowly as in ourexample, then these Lipschitz constants tend to infinity. Our motivation for the results in this section is the following. One may argue that requiring that themaps a, M are Lipschitz is too severe and perhaps stability can be gained under weaker assumptionsson these mappings. The results of this section show that this is indeed the case. Namely, we showthat to establish a form of numerical stability, it is enough to have the mapping a bounded andthe mapping M satisfy a considerably weaker mapping property than the requirement that it be21ipschitz. We then go on to show that even under these weaker assumptions on the mappings a, M ,one can compare the error of approximation on a model class K with the entropy numbers of K .Let K be a compact set in the Banach space X and recall the notation A := M ◦ a and E A ( K ) X := sup f ∈ K k f − A ( f ) k X . We introduce the following new properties on the pair ( a, M ) of mappings:(i) a : K → ( R n , k · k Y ) is bounded, i.e., k a ( f ) k Y ≤ γ k f k X , f ∈ K ;(ii) M : ( R n , k · k Y ) → X satisfies k M ( x ) − M ( y ) k X ≤ γ k x − y k βY + E A ( K ) X , x, y ∈ R n , (3.9)where γ, β > k · k Y is a norm on R n .Obviously, the assumption (i) is much weaker than the assumption that a is Lipschitz. Notice that(ii) is only requiring that M is a Lip β mapping for x, y sufficiently far apart which is weaker thatLipschitz when β ≥ < β ≤ bounded stable manifold width˜ δ n,γ,β ( K ) X := inf a,M, k·k Y sup f ∈ K k f − M ( a ( f )) k X , (3.10)where the infimum is over all maps a, M satisfying (i) and (ii) and all norms k · k Y on R n . Clearly,we have δ n.γ,β ( K ) X ≤ δ ∗ n,γ ( K ) X for all n ≥
1. We show that properties (i) and (ii) still guaranteea form of numerical stability.
Theorem 3.5.
If the pair ( a, M ) satisfies (i) and (ii) with respect to the norm k · k Y on R n forsome β > , then the approximation operator A := M ◦ a is stable in the following sense. If inplace of f ∈ K we input g ∈ K with k f − g k X ≤ η and in place of y = a ( g ) we compute y ′ with k y − y ′ k Y ≤ η , then k f − M ( y ′ ) k X ≤ E A ( K ) X + η + γη β . (3.11) Proof:
Since A ( g ) = M ( a ( g )) = M ( y ), we have k f − M ( y ′ ) k X ≤ k f − g k X + k g − A ( g ) k X + k M ( y ) − M ( y ′ ) k X ≤ η + E A ( K ) X + E A ( K ) X + γη β , where we have used (ii). ✷ The above theorem shows that we can obtain a form of numerical stability under rather weakassumptions on a, M . The question now is whether it is still true that when using such mappings,the approximation error cannot go to zero faster than entropy numbers. That is, do we stillhave a form of Carl inequality. The following theorem shows that this is indeed the case, up to alogarithmic loss. In formulating the theorem, we let C ( K ) := sup f ∈ K k f k X , which is finite becauseby assumption K is compact. Theorem 3.6.
Let r, γ, β > . If K is any compact subset of a Banach space X , then ε cn ln n ( K ) X ≤ ( n + 1) − r sup m ≥ ( m + 1) r ˜ δ m,γ,β ( K ) X , n ≥ , (3.12) with c depending only on r, β, γ and C ( K ) . roof: Let ˜ δ n := ˜ δ n,γ,β ( K ) X , n ≥
1. We assume that the right side of (3.12) is finite since otherwisethere is nothing to prove. Given any ε >
0, we let m = m ( ε ) be the smallest integer such that˜ δ m ≤ ε/ . (3.13)We fix for now such a pair ( ε, m ). Suppose that { f , . . . , f N } is the largest collection of points in K such that k f i − f j k X ≥ ε for all i, j . Then, the balls centered at the f j with radius ε cover K .We want now to bound N .Let the pair ( a m , M m ) satisfies (i-ii) with respect to the norm k · k Y m and achieves the accuracy˜ δ m (in case the accuracy is not actually attained, a slight modification of the argument below givesthe result). It follows from (3.13) that the mapping A := A m = M m ◦ a m satisfies˜ δ m = E A ( K ) X ≤ ε/ . (3.14)Now, consider y j := a m ( f j ) ∈ R m , g j := M m ( y j ) ∈ X, j = 1 , . . . , N.
Because of (i), the points y j , j = 1 , . . . , N , are all in the ball B centered at 0 of radius R := γC ( K )with respect to the norm k · k Y m . Since k f j − g j k X = k f j − A ( f j )) k X ≤ ˜ δ m ≤ ε/ , we have that whenever i = j , k M ( y i ) − M ( y j ) k X = k g i − g j k X ≥ k f i − f j k X − k f i − g i k X − k f j − g j k X ≥ ε/ . (3.15)Combining condition (ii), (3.14), and (3.15), we have that y j ∈ R m , j = 1 , . . . , N , satisfy γ k y i − y j k βY m ≥ k M ( y i ) − M ( y j ) k X − E A ( K ) X ≥ ε/ − E A ( K ) X = ε/ ε/ − E A ( K ) X ) ≥ ε/ , i = j. In other words, { y , . . . , y N } are in the ball B and they are separated in the sense that k y i − y j k Y m ≥ h ε γ i β =: τ, i = j. (3.16)We take a minimum covering of the ball B by balls B , . . . , B M of radius τ /
2. Then, in view of(3.16), each of these balls has at most one of the points y j , j = 1 , . . . , N , and therefore N ≤ M .As we have used earlier, for any η >
0, the unit ball with respect to k · k Y m can be covered by(1 + 2 /η ) m balls of radius η . This tells us that N ≤ M ≤ h C ε i m/β , (3.17)with C depending only on γ, β and C ( K ).We can now finish the proof of the theorem. If C := sup m ≥ ( m + 1) r ˜ δ m is finite, we take ε = C ( n + 1) − r . We can find c r ∈ N , depending on r , such that 4 /r ( n + 1) ≤ c r n + 1, for n ≥ δ c r n ≤ C ( c r n + 1) r ≤ C n + 1) r = ε/ . m = m ( ε ), we have that m ( ε ) ≤ c r n . Hence, it follows from (3.17) that K can be covered with at most h C ( n + 1) r C i c r n/β ≤ cn ln n (3.18)balls of radius ε . Here c depends only on β, γ, r , and C ( K ). In other words ε cn ln n ( K ) X ≤ C ( n + 1) − r , which is the desired result. ✷ The previous section gave lower bounds in terms of entropy numbers for the optimal possibleperformance when using Lipschitz stable approximation. We now turn to the question of whetherthese performance bounds can actually be met. In this section, we consider the case when theperformance error is measured in a Hilbert space H. The following theorem proves that in thiscase there always exits Lipschitz stable numerical algorithms whose error behaves like the entropynumbers. Hence, this result combined with the Carl type inequalities shows that stable manifoldwidths and entropy numbers behave the same in the case of Hilbert spaces.
Theorem 4.1.
Let H be a Hilbert space and K ⊂ H be any compact subset of H . Then for γ = 2 ,any n ≥ , we have δ ∗ n ( K ) H := δ ∗ n,γ ( K ) H ≤ ¯ δ n,γ ( K ) H ≤ ε n ( K ) H . (4.1) Proof:
Let us fix n and consider the discrete set K := K n := { f , . . . , f n } ⊂ K with the property that every f ∈ K can be approximated by an element from K n with accuracy ε n ( K ) H . That is, for every f ∈ K there is f j ∈ K , such that k f − f j k H ≤ ε n ( K ) H . (4.2)For the set of 2 n points K n ⊂ H we apply the Johnson-Lindenstrauss Lemma, see Theorem 2.1 in[9] for the version we use. According to this theorem, for any 0 < ε <
1, we can find a linear map a ε : K n → ℓ c ( ε ) n such that r − ε ε k f i − f j k H ≤ k a ε ( f i ) − a ε ( f j ) k ℓ ≤ k f i − f j k H , i, j = 1 , . . . , n , whenever c ( ε ) is a positive integer satisfying c ( ε ) ≥ ε / − ε / .We take ε = 3 / c ( ε ) = 26. This gives a linear map a : K n → ℓ n , for which 12 k f i − f j k H ≤ k a ( f i ) − a ( f j ) k ℓ ≤ k f i − f j k H , i, j = 1 , . . . , n . (4.3)24sing the Kirszbraun extension theorem, see Theorem 1.12 from [3], page 18, the mapping a canbe extended from K n to the whole H preserving the Lipschitz constant 1. Let us denote by M n the image of K n under a , that is the discrete set M n := { a ( f j ) : f j ∈ K n } ⊂ R n . Now consider the map M : ( M n , k · k ℓ ) → H , defined by M ( a ( f j )) = f j , j = 1 , . . . , n . Clearly k M ( a ( f i )) − M ( a ( f j )) k H = k f i − f j k H ≤ k a ( f i ) − a ( f j ) k ℓ , and therefore M is a Lipschitz map with a Lipschitz constant 2. According to the Kirszbraunextension theorem, we can extend M to a Lipschitz map on the whole ℓ n with the same Lipschitzconstant 2.Let us now consider the approximation algorithm A defined by A := M ◦ a . If f ∈ K , there isan f j ∈ K n , such that k f − f j k H ≤ ε n ( K ) H . Therefore, f − A ( f ) = ( f − f j ) + ( f j − M ( a ( f j ))) + ( M ( a ( f j )) − M ( a ( f ))) , and since f j = M ( a ( f j )), we have that k f − A ( f ) k H ≤ k f − f j k H + k M ( a ( f j )) − M ( a ( f )) k H ≤ ε n ( K ) H + 2 k a ( f ) − a ( f j ) k ℓ ≤ ε n ( K ) H + 2 k f − f j k H ≤ ε n ( K ) H , which proves the theorem. ✷ We can combine the last result with the results of the previous section to obtain the followingcorollary.
Corollary 4.2.
Let γ ≥ . If K ⊂ H is a compact set in a Hilbert space H and if r > , then δ ∗ n,γ ( K ) H = O (( n + 1) − r ) , n ≥ , if and only if ε n ( K ) H = O (( n + 1) − r ) , n ≥ . The same result holds if δ ∗ n,γ is replaced by ¯ δ n,γ . X In this section, we consider bounding the stable manifold widths by entropy numbers in the case ofan arbitrary Banach space X . Let us note that for such a general Banach space we can no longerhave a direct bound for ¯ δ n,γ ( K ) X in terms of entropy numbers. Indeed, for any compact set K andany Banach space, the entropy numbers of K tend to zero. However, we know that ¯ δ n,γ ( K ) X tendsto zero for all compact sets K only if X has the γ -bounded approximation property, see Theorem2.4. Since there are Banach spaces without this property, we must expect a loss when compared tothe theorems of the previous section. We present in this section results that exhibit a loss in boththe growth of the Lipschitz constants and in the rate of decay of ¯ δ n,γ ( K ) X , as n tends to infinity.It is quite possible that the results of this section may be improved with a deeper analysis.25 heorem 5.1. Let X be a Banach space and K ⊂ X be a compact subset of X . Then, there is afixed positive constant C , such that for each n ≥ there are Lipschitz mappings a n : X → ( R n , k · k ℓ ∞ ) , M n : ( R n , k · k ℓ ∞ ) → X whose Lipschitz constants are at most Cn / and sup f ∈ K k f − M n ( a n ( f )) k X ≤ Cn / ε n ( K ) X , n = 1 , , . . . . (5.1) Proof:
As in the proof of Theorem 4.1, we fix n >
0, and consider the discrete set K n := { f , . . . , f n } ⊂ K, with the property that for every f ∈ K there is f j ∈ K n , such that k f − f j k X ≤ ε n ( K ) X . (5.2)For the discrete set K n ⊂ X of 2 n points we apply Proposition 1 from [4], according to which wecan construct a bi-Lipschitz map ˜ a n from K n into a Hilbert space H ,˜ a n : ( K n , k · k X ) → H, ˜ a − n : ( H n , k · k H ) → K n , where H n := ˜ a n ( K n ) ⊂ H, such that ˜ a n is C n / Lipschitz map and ˜ a − n is C n − / Lipschitz map. Using the version of theJohnson-Lindenstrauss lemma as in the proof of Theorem 4.1, we get a map J : ( H n , k · k H ) → ℓ n , such that J and J − are 2 Lipschitz maps. We also consider the identity map I : ℓ n → ℓ n ∞ , where I is 1 Lipschitz map and I − is √ n Lipschitz map. Thus, the map a n := I ◦ J ◦ ˜ a n : K n → ℓ n ∞ is a Cn / Lipschitz map which, see [3, Lemma 1.1], can be extended to a map a n : X → ℓ n ∞ with the same Lipschitz constant.Next, we proceed with the construction of M n . First, we denote by M n ⊂ R n the image of K n under a n , that is the discrete set M n := { a n ( f j ) : f j ∈ K n , j = 1 , . . . , n } ⊂ R n , and consider the map ˜ M n := ˜ a − n ◦ J − ◦ I − : ( M n , k · k ℓ ∞ ) → X. From the above observations it follows that ˜ M n is a Cn / Lipschitz map. According to Theorem 1from [18], we can extend ˜ M n to a Lipschitz map M n from ℓ n ∞ into X with the Lipschitz constant Cn / . 26ow that a n and M n are constructed, we continue with the analysis of the approximation powerof the mapping M n ◦ a n . We fix f ∈ K , find f j ∈ K n , such that k f − f j k ≤ ε n ( K ) X . Clearly, k f − M n ◦ a n ( f ) k X ≤ k f − f j k X + k M n ( a n ( f j )) − M n ( a n ( f )) k X ≤ ε n ( K ) + Cn / k a n ( f ) − a n ( f j ) k ℓ ∞ ≤ ε n ( K ) + Cn / k f − f j k X ≤ Cn / ε n ( K ) X . Therefore, for the Cn / Lipschitz mappings a n and M n , we havesup f ∈ K k f − M n ◦ a n ( f ) k X ≤ Cn / ε n ( K ) X . This completes the proof. ✷ Remark 5.2.
If we have additional information about the Banach space X , we can get betterestimates than (5.1) , as illustrated in the next lemmas. Lemma 5.3.
Let the Banach space X be isometric to ℓ ∞ (Γ) for some set Γ . Then, there is a fixedpositive constant C , such that for each n ≥ there are Cn / Lipschitz mappings a n : X → ℓ n ∞ , M n : ℓ n ∞ → X, with the property sup f ∈ K k f − M n ( a n ( f )) k X ≤ Cn / ε n ( K ) X , n = 1 , , . . . . Proof:
For K n as in the proof of Theorem 5.1 and H a Hilbert space, using [4], we constructmappings ˜ a n : K n → H, ˜ a − n : H n → K n , where H n := ˜ a ( K n ) ⊂ H, where ˜ a n is Lipschitz with constant C n / and ˜ a − n with a Lipschitz constant C n / . Then, with I and J are as in Theorem 5.1, the mapping I ◦ J ◦ ˜ a n : K n → ℓ n ∞ , is a C n / Lipschitz. We extend it to a mapping a n on the whole X with the same Lipschitzconstant.Next, we consider ˜ M n := ˜ a − n ◦ J − ◦ I − : ( M n , k · k ℓ ∞ ) → X, which is C n / Lipschitz. Now, according to Lemma 1.1 from [3], since X is isometric to ℓ ∞ (Γ)for some Γ, ˜ M n can be extended to M n : ℓ n ∞ → X with the same Lipschitz constant C n / . Then A n = M n ◦ a n is Cn / Lipschitz, and k f − M n ◦ a n ( f ) k X ≤ k f − f j k X + k M n ( a n ( f j )) − M n ( a n ( f )) k X ≤ ε n ( K ) + Cn / k f − f j k X ≤ Cn / ε n ( K ) X , which gives ¯ δ n,Cn / ( K ) X ≤ Cn / ε n ( K ) X . ✷ orollary 5.4. Let C ( S ) be the Banach space of continuous functions on a compact subset S of ametric space. Further, let K ⊂ C ( S ) be a compact set. Then, there is a fixed positive constant C ,such that for each n ≥ there are Cn / Lipschitz mappings a n : C ( S ) → ℓ n , M n : ℓ n → C ( S ) , with the property sup f ∈ K k f − M n ( a n ( f )) k C ( S ) ≤ Cn / ε n ( K ) C ( S ) , n = 1 , , . . . . Proof:
Let us fix arbitrary ε >
0. Since C ( S ) is separable, it follows from [19] that there exists afinite dimensional subspace X ⊂ C ( S ) isometric to ℓ ∞ (Γ) and a linear projection P : C ( S ) → X from C ( S ) onto X of norm 1 such that sup f ∈ K k f − P ( f ) k ≤ ε. We apply Lemma 5.3 to the space X and its compact subset P ( K ), according to which there are Cn / Lipschitz mappings a n : X → ℓ n ∞ , M n : ℓ n ∞ → X, with the propertysup g ∈ P ( K ) k g − M n ( a n ( g )) k C ( S ) ≤ Cn / ε n ( P ( K )) C ( S ) , n = 1 , , . . . . We next define ˜ a n : C ( S ) → ℓ n ∞ , and ˜ M n : ℓ n ∞ → C ( S ), where˜ a n := a n ◦ P, ˜ M n := I ◦ M n , with I : X → C ( S ) the identity embedding from X into C ( S ). Clearly ˜ a n and ˜ M n are both Cn / Lipschitz andsup f ∈ K k f − ˜ M n (˜ a n ( f )) k C ( S ) = sup f ∈ K k f − M n ( a n ( P ( f )) k C ( S ) ≤ sup f ∈ K (cid:0) k f − P ( f ) k C ( S ) + k P ( f ) − M n ( a n ( P ( f )) k C ( S ) (cid:1) ≤ ε + Cn / ε n ( K ) C ( S ) , where we have used that ε n ( P ( K )) C ( S ) ≤ ε n ( K ) C ( S ) . Since ε is arbitrary we get the claim. ✷ Next, we discuss a few standard examples of approximation from the viewpoint of stable manifoldwidths. 28 .1 Linear approximation
Let X be a Banach space, K ⊂ X be compact, and let X n be a linear subspace of X of dimension n . Let us consider approximation procedures f → A ( f ) = M ◦ a ( f ) given by maps a, M , where a : X → R n , M : R n → X n ⊂ X. If we are interested only in such approximation methods given by continuous mappings then it iseasy to see that by using coverings and partitions of unity (see Theorem 2.1 in [11]) one can achievean approximation error for K equivalent to the error dist( K, X n ) X . Thus, δ n ( K ) X can be boundedby the Cd n ( K ) X where d n is the Kolmogorov width. The situation becomes more subtle when werequire Lipschitz continuity of the mappings as we now discuss.Let Φ := { φ , . . . , φ n } be any basis for X n and let us consider the norm on R n , induced by thebasis φ , . . . , φ n , namely k y k Y := k n X j =1 y j φ j k X , y ∈ R n . (6.1)We define the mapping M : ( R n , k · k Φ ) → X , as M ( y ) := n X j =1 y j φ j ∈ X n ⊂ X, y ∈ R n . Clearly, M is a linear mapping with norm one, and hence a 1 Lipschitz mapping. Thus, the mainquestion is whether we can construct a mapping a : X → ( R n , k · |k Y ) that is Lipschitz.If X n admits a bounded projection P n : X → X n , then we can write for f ∈ X , P n ( f ) = n X j =1 a j ( f ) φ j , (6.2)and therefore define a as a ( f ) = ( a ( f ) , . . . , a n ( f )) ∈ R n . Since k a ( f ) − a ( g ) k Y = k n X j =1 a j ( f − g ) φ j k X = k P n ( f − g ) k X ≤ k P n kk f − g k X ,a is a γ n -Lipschitz mapping with γ n := k P n k ≥
1. We thus have¯ δ n,γ n ( K ) X ≤ sup f ∈ K k f − M ( a ( f )) k X = sup f ∈ K k f − P n ( f ) k X . If X = H is a Hilbert space then we know there is always a projection with norm one and hence¯ δ n, ( K ) H ≤ d n ( K ) H , n ≥ . (6.3)For non-Hilbertian Banach spaces every finite dimensional space admits a projection, however thenorm may depend on n . The Kadec-Snobar theorem guarantees that there is a projection withnorm √ n and so we obtain for a general Banach space X and compact K ⊂ X the bound¯ δ n, √ n ( K ) X ≤ d n ( K ) X , n ≥ . (6.4)Of course, we already know from our earlier results that relate the decay of ¯ δ n,γ ( K ) X to the boundedapproximation property that some growth factor is needed. If we assume additional structure on X then the quantitative growth can be better controlled. For example, for X = L p , 1 < p < ∞ ,we can replace √ n in (6.4) by n | / − /p | , see e.g. [25, III.B.10.].29 .2 Compressed Sensing One of the primary settings where nonlinear approximation methods prevail is in compressed sens-ing which is concerned with the numerical recovery of sparse signals. The standard setting ofcompressed sensing is the following. We consider vectors x ∈ R N where N is large. Such a vector x is said to be k sparse if at most k of its coordinates are nonzero. Let Σ k denote the set of all k sparse vectors in R N . The goal of compressed sensing is to make a small number of n linear mea-surements of a vector x which can then be used to approximate x . The linear measurements takethe form of inner products of x with vectors φ , . . . , φ n . These measurements can be representedas the application of a compressed sensing matrix Φ ∈ R n × N to x , where the rows of Φ are thevectors φ , . . . , φ n .A fundamental assumption about the measurements used in compressed sensing is the so called restricted isometry property of order k , RIP( k, δ k ). We say that the matrix Φ satisfies the RIP( k, δ k ),0 < δ k <
1, if (1 − δ k ) k x k ℓ N ≤ k Φ( x ) k ℓ n ≤ (1 + δ k ) k x k ℓ N , for all x ∈ Σ k . (6.5)A decoder is a mapping M which takes the measurement vector y = Φ( x ) and maps it backinto R N . The vector M (Φ( x )) is the approximation to x . Thus, compressed sensing falls into ourparadigm of nonlinear approximation as given by the two mapping a : R N → ( R n , k · k Y ) with a ( x ) := Φ( x ) and the mapping M : ( R n , k · k Y ) → R N . Note that the mapping a is rather specialsince it is assumed to be linear.The first goal of compressed sensing is to find such mappings for which M ( a ( x )) = x whenever x is in Σ k . It is easy to see that n = 2 k is the smallest number of measurements for which this is trueand it is easy to characterize all of the mappings a = Φ that do the job (see e.g. [8]). However, thesematrices Φ and perfect reconstruction maps M with n = 2 k are deemed unsatisfactory becauseof their instability. To discuss this and other issues connected with compressed sensing using theviewpoint of this paper, we need to introduce a norm on R N in which we shall measure performance.We consider the ℓ p norms for 1 ≤ p ≤ X := ( R N , k · k ℓ p ).There are two flavors of results one can ask for in the context of compressed sensing or sparserecovery. The strongest guarantees are in the form of instance optimality . To formulate this let x ∈ R N and define σ k ( x ) p := inf y ∈ Σ k k x − y k ℓ p (6.6)to be its error of best approximation by k sparse vectors. We say that the measurement system(Φ , M ) is C instance optimal of order k if k x − M (Φ( x )) k ℓ p ≤ Cσ k ( x ) p , x ∈ R N . (6.7)A central issue in compressed sensing is how large must the number of measurements n be toguarantee instance optimality of order k with a reasonable constant C . It is known, see [8], thatfor p = 1, linear mappings Φ based on n measurements and satisfying the RIP(3 k, δ k ), with δ k ≤ δ < ( √ − /
3, and the recovery map M based on ℓ minimization M ( y ) := argmin {k x k ℓ : Φ x = y } , provide instance optimality. One can construct such matrices when n ≥ ck log( N/k ) with a suit-able constant c independent of k . On the other hand, see [8], when 1 < p ≤
2, the number ofmeasurements n must necessarily grow as a power of N in order to guarantee that the instance30ptimality (6.7) is achieved. In particular, for p = 2, instance optimality cannot hold unless n isproportional to N .A weaker notion of performance is to consider only distortion on compact subsets K of R N .The distortion is now measured in the worst error described by E ( K, Φ , M ) p := sup x ∈ K k x − M (Φ( x )) k ℓ p . (6.8)A common family of model classes are the unit balls K q , K q := { x ∈ R N : k x k ℓ q ≤ } , q < p. By utilizing the above results on instance optimality for p = 1, one can derive estimates for theabove error when using a suitably chosen compressed sensing matrix Φ for encoding and with ℓ minimization decoding M . Given p ≥
1, one can derive bounds for the above error for a certainrange of q and show these are optimal by comparing this error with Gelfand widths. We refer thereader to [8] for details.Our main goal in this paper is not to restrict the measurement map a to be linear but ratherimpose only that it is Lipschitz. By relaxing the condition on a to only be Lipschitz we will deriveimproved approximation error bounds. We first observe that the matrices Φ, which are the canonicalmeasurement maps of compressed sensing, have rather big Lipschitz constants when considered asmapping from ℓ Np to ℓ n . Let us denote by k Φ k ℓ Np → ℓ n the norm of Φ. Then the following lemmaholds. Lemma 6.1.
If the matrix Φ satisfies the RIP(1 , δ ) , then for all ≤ p ≤ , (1 − δ ) − n − / N − /p ≤ k Φ k ℓ Np → ℓ n ≤ (1 + δ ) N − /p . (6.9) Proof:
Let Φ := ( a i,j ) ∈ R n × N . It follows from the RIP(1 , δ ) that for j = 1 , . . . , N ,(1 − δ ) ≤ n X i =1 | a i,j | ≤ (1 + δ ) , (6.10)and therefore (1 − δ ) √ N ≤ k Φ k F ≤ (1 + δ ) √ N , (6.11)where k Φ k F is the Frobenious norm of Φ. Since1 √ n k Φ k F ≤ k Φ k ℓ N → ℓ n ≤ k Φ k F , it follows from (6.11) that (1 − δ ) n − / √ N ≤ k Φ k ℓ N → ℓ n ≤ (1 + δ ) √ N . (6.12)We now derive bounds for Φ on the ℓ Np spaces, 1 ≤ p <
2. Let e j := (0 , . . . , , , . . . , ∈ R N , bethe j -th standard basis element. We have k e j k ℓ N = 1 and k Φ e j k ℓ n = n X i =1 | a i,j | ≤ (1 + δ ) , j = 1 , . . . , N, x = P nj =1 x j e j ∈ ℓ , k Φ x k ℓ n ≤ n X j =1 | x j |k Φ e j k ℓ n ≤ (1 + δ ) k x k ℓ N . In other words, k Φ k ℓ N → ℓ n ≤ (1 + δ ) , and from (6.12) and the Riesz-Thorin theorem we get the right inequality in (6.9).To prove the left inequality in (6.9) , we observe that from (6.11) there exists 1 ≤ i ≤ n suchthat N X j =1 a i ,j ≥ N (1 − δ ) n . We define a ∗ := ( a i , , . . . , a i ,N ) ∈ R N and x ∗ := a ∗ / k a ∗ k ℓ N Then we have N / n − / (1 − δ ) ≤ N X j =1 x ∗ j a i ,j = [Φ x ∗ ] i ≤ k Φ x ∗ k ℓ n ≤ k Φ k ℓ Np → ℓ n k x ∗ k ℓ Np . Since k x ∗ k ℓ Np ≤ N /p − / we get the left inequality in(6.9). ✷ Since the mapping Φ is linear, its norm is the same as its Lipschitz constant. So the abovelemma shows that this Lipschitz constant is large, at least when we choose the norm on R n to bethe ℓ norm. Choosing another norm on R n cannot help much because of norm equivalences on R n and changing norms will change the Lip constant for the recovery mapping M . We next wantto show that dropping the requirement that a is linear, and replacing it by requiring only that itis Lipschitz, dramatically improves matters. For now, we illustrate this only in one setting. Weconsider instance optimality in ℓ which we recall fails to hold in the classical setting of compressedsensing.Let X = ℓ N and let Φ be an n × N matrix which satisfies the RIP of order 2 k (with suitableRIP constants). Define a : Σ k → ℓ n by a ( x ) := Φ( x ) , x ∈ Σ k . It follows from the RIP that k Φ x k ℓ n ≤ C k x k ℓ N , for all x ∈ Σ k , and so a is C Lipschitz on Σ k . Bythe Kirszbraun extension theorem, a has a C Lipschitz extension to all of X which extension wecontinue to denote by a . Note that a will not be linear on X .Now consider the construction of a recovery map M . There is a 1 Lipschitz inverse mapping M : a (Σ k ) → X such that M ( a ( x )) = x when x ∈ Σ k (for example ℓ minimization provides suchan M ). Again by the Kirszbraun extension theorem, M has a 1 Lipschitz extension to all of ℓ n ,which we continue to denote by M .These new mappings a : ℓ ( R N ) → ℓ ( R n ) , M : ℓ ( R n ) → ℓ ( R N ) , (6.13)have Lipschitz constant at most C for a and one for M . Moreover, when applied to any x ∈ Σ k ,we still have M ( a ( x )) = x . 32ow, consider the performance of these mappings on all of ℓ N . Given x ∈ R N , we can write x = x + e , where x is a best approximation to x from Σ k and k e k ℓ = σ k ( x ) ℓ . We have that k x − M ( a ( x )) k ℓ ≤ k x + e − M ( a ( x )) k ℓ + k M ( a ( x )) − M ( a ( x )) k ℓ ≤ k e || ℓ + C k e k ℓ = ( C + 1) σ k ( x ) ℓ , (6.14)because M ( a ( x )) = x and because the composition mapping M ◦ a is C Lipschitz mapping.Thus, instance optimality can be achieved in ℓ , for n of the order of k up to logarithmic factorsprovided one generalizes the notion of measurement maps to be nonlinear but Lipschitz, whilelinear measurements would impose that n is of the order of N . This is now a very active area of research. A neural network is a vehicle for creating multivariatefunctions which depend on a fixed number n of parameters given by the weights and biases of thenetwork. We consider all networks with n parameters with perhaps some user prescribed restrictionsimposed on the architecture of the network. Let us denote by Υ n the outputs of such networks.Thus the elements in Υ n are multivariate functions, say with d variables, described by n parametersand hence are a nonlinear manifold depending on n parameters.Let us fix a function norm k · k X to measure error. Given a target function f ∈ X (or dataobservations of f such as point values), one determines the n parameters a ( f ) = ( a ( f ) , . . . , a n ( f ))of the network which will be used to approximate f . These parameters determine the outputfunction M ( a ) from Υ n . The decoder M is explicit and simple to describe from the assumedarchitecture. For example, for the ReLU activation function this output is a piecewise linearfunction. Thus, neural networks provide an approximation procedure A ( f ) := M ( a ( f )) of the typestudied in this paper.There are by now several papers addressing the approximation properties of neural networks(see [10] and the references therein). In some cases, they advertise some surprising results. Wemention here only the results on approximating univariate 1 Lipschitz functions with respect to an L p norm on an interval [0 ,
1] by neural networks with a ReLU activation function. It is shown in[23] (with earlier results in [24]) that any function in the unit ball K of Lip 1 can be approximatedto accuracy Cn − by elements from Υ n . This result is on first glance quite surprising since theentropy number ε n ( K ) L p ≥ cn − with c an absolute constant.So, how should we evaluate such a result? The first thing we should note is that if we viewsuch a neural network approximation as simply a manifold approximation, then the result is notsurprising. Indeed, we could equally well construct a one parameter (space filling) manifold (evenwith piecewise linear manifold elements) and achieve arbitrary approximation error for K . Sucha one parameter manifold is not very useful, since given f or data for f , it would be essentiallyimpossible to numerically find an approximant from the manifold with this error. So the main issuescenter around the properties of a and M . If we require the rather minimal condition that a and M are continuous, we can never achieve accuracy better than cn − in approximating the elementsof K using an n parameter manifold as is proved in [12]. We can even lessen the requirement that a be continuous to just requiring that a is bounded if we impose a little smoothness on M (seeTheorem 3.6). So, to achieve a rate of approximation better than O ( n − ) for K using n parameterneural networks, one must necessarily use mappings which are not continuous, even a has to bepoorly bounded (with bounds growing with n ). The question is the numerical cost to find good33arameters and whether the numerical procedure to find these parameters is stable. The results ofthe present paper clarify these issues.In practice, the parameters of the neural network are found from given data observations of f , by typically using stochastic gradient descent algorithms with respect to a chosen loss functionrelated to fitting the data. Unfortunately, there is no clear analysis of the convergence of thesedecent algorithms for such optimization problems, although it seems to be recognized that oneneeds to impose constraints on the size of the steps in each iteration that tend to zero as thenumber of steps increase. The results of the present paper may provide a better understanding ofwhat conditions need to be imposed in the descent and what approximation results can be obtainedunder such constraints. A general question, which is not answered in this paper is to determine the asymptotic behaviorof δ ∗ n,γ ( K ) X for classical model classes K in classical Banach spaces X . For example, we do notknow the decay rate of δ ∗ n,γ ( K ) X for all of the Besov or Sobolev balls K that compactly embedinto L p , 1 ≤ p ≤ ∞ . The asymptotic decay of these widths remains an open fundamental question.In the case that this ball is a compact subset of L p , then it is known, see Theorem 1.1 in [7], thatthe entropy numbers of this unit ball decay like n − s/d and so in view of the Carl type inequality ofTheorem 3.3 we have δ ∗ n,γ ( K ) L p ≥ cn − s/d , n ≥ . (6.15)The main question therefore is whether the inequality in (6.15) can be reversed. In the case p = 2,the fact that it can be reversed follows from Theorem 4.1. The situation for p = 2 is not sostraightforward and is still not settled. Let us remark that for the weaker notion of manifoldwidths δ n ( K ) L p both (6.15) and its reverse have been proven, see Theorem 1.1 in [12]. Acknowledgment:
The authors thank Professor Giles Godefroy for insightful discussions on theresults of this paper.
References [1] Ya. Alber, A. Notik,
On some estimates for projection operator on Banach spaces , Comm. onApplied Nonlinear Analysis, (1) (1993), arXiv:funct-an/9311003.[2] P. Aleksandrov, Combinatorial Topology, Vol. 1 , Graylock Press, Rochester, NY, 1956.[3] Y. Benyamini, J. Lindenstrauss,
Geometric Nonlinear Functional Analysis , Vol. 1, AmericanMathematical Society Colloquium Publications, (2000), AMS, Providence, RI.[4] J. Bourgain, On Lipschitz Embedding of Finite Metric Spaces in Hilbert Space , Israel J. Math., (1985), 46–52.[5] B. Carl, Entropy numbers, s-numbers, and eigenvalue problems , J. Funct. Anal., (1981),290–306.[6] P. Ciarlet, C. Wagschal, Multipoint Taylor formulas and applications to the finite elementmethod , Num. Mathematik, (1971), 84–100.347] A. Cohen, W. Dahmen, I. Daubechies, R. DeVore, Tree Approximation and Optimal Encoding ,ACHA, (2001) 192–226.[8] A. Cohen, W. Dahmen, R. DeVore, Compressed sensing and best k-term approximation , J.Amer. Math. Soc., (2009), 211–231.[9] S. Dasgupta, A. Gupta, An elementary proof of a theorem of Johnson and Lindenstrauss ,Random Structures Algorithms, (1) (2003), 60–65.[10] I. Daubechies, R. DeVore, S. Foucart, B. Hanin, G. Petrova, Nonlinear approximation anddeep (ReLU) networks , arXiv:1905.02199, 2019[11] R. DeVore, R. Howard, C. Micchelli,
Optimal nonlinear approximation , Manuscripta Mathe-matica (4) (1989), 469–478.[12] R. DeVore, G. Kyriazis, D. Leviaton, V. Tichomirov, Wavelet compression and nonlinearn-widths , Advances in Computational Mathematics, (2) (1993), 197–214.[13] G. Godefroy, Lipschitz approximable Banach spaces , CMUC, to appear.[14] G. Godefroy,
A survey on Lipschitz-free Banach spaces , Commentationes Mathematicae, (2015), 89–118.[15] G. Godefroy, N. Kalton, Lipschitz-free Banach spaces , Studia Math., (1) (2003), 121–141.[16] T. Hytonen, J van Neerven, M. Veraar, L. Weis,
Analysis in Banach Spaces, Volume I: Mar-tingales and Littlewood-Paley Theory , Springer, 2016.[17] W. Johnson, J. Lindenstrauss,
Extensions of Lipschitz Mappings into a Hilbert Space , Con-temporary Mathematics, (1984), 189–206.[18] W. Johnson, J. Lindenstrauss, G. Schechtman, Extensions of Lipschitz Mappings into BanachSpaces , Israel J. Math., (2) (1986), 129–138.[19] E. Michael, A. Pe lczy´nski, Separable Banach spaces which admit ℓ ∞ n approximations , Israel J.Math., (1966), 189–198.[20] S. Kachanovich, Meshing submanifolds using Coxeter triangulations
Computational Ge-ometry[cs.CG]. COMUE Universite Cote d’Azur (2015 - 2019), 2019. English. NNT:2019AZUR4072?tel-02419148v2, thesis, https://hal.inria.fr/tel-02419148v2/document [21] A. Pinkus, n -widths in Approximation Theory , Springer, 2012.[22] G. Pisier, The Volume of Convex Bodies and Banach Space Geometry , Cambridge Univ. Press,Cambridge, 1989.[23] Jianfeng Lu, Zuowei Shen, Haizhao Yang, Shijun Zhang,
Deep Network Approximation forSmooth Functions , preprint 2020.[24] D. Yarotsky,
Error bounds for approximations with deep ReLU networks , Neural Networks, (2017), 103–114. 3525] P. Wojtaszczyk, Banach Spaces for Analysts , Cambridge U. Press, 1991.