[PDF] Adaptive approximation by optimal weighted least squares methods

Abstract

Given any domain X⊆ R d and a probability measure ρ on X , we study the problem of approximating in L 2 (X,ρ) a given function u:X→R , using its noiseless pointwise evaluations at random samples. For any given linear space V⊂ L 2 (X,ρ) with dimension n , previous works have shown that stable and optimally converging Weighted Least-Squares (WLS) estimators can be constructed using m random samples distributed according to an auxiliary probability measure μ that depends on V , with m being linearly proportional to n up to a logarithmic term. As a first contribution, we present novel results on the stability and accuracy of WLS estimators with a given approximation space, using random samples that are more structured than those used in the previous analysis. As a second contribution, we study approximation by WLS estimators in the adaptive setting. For any sequence of nested spaces ( V k ) k ⊂ L 2 (X,ρ) , we show that a sequence of WLS estimators of u , one for each space V k , can be sequentially constructed such that: i) the estimators remain provably stable with high probability and optimally converging in expectation, simultaneously for all iterations from one to k , and ii) the overall number of samples necessary to construct all the first k estimators remains linearly proportional to the dimension of V k . We propose two sampling algorithms that achieve this goal. The first one is a purely random algorithm that recycles most of the samples from the previous iterations. The second algorithm recycles all the samples from all the previous iterations. Such an achievement is made possible by crucially exploiting the structure of the random samples. Finally we develop numerical methods for the adaptive approximation of functions in high dimension.

Full PDF

AAdaptive approximation by optimal weighted least squares methods

Giovanni Migliorati ∗ July 11, 2019

Abstract

Given any domain X Ď R d and a probability measure ρ on X , we study the problem of approximatingin L p X, ρ q a given function u : X Ñ R , using its noiseless pointwise evaluations at random samples.For any given linear space V Ă L p X, ρ q with dimension n , previous works have shown that stable andoptimally converging Weighted Least-Squares (WLS) estimators can be constructed using m random samplesdistributed according to an auxiliary probability measure µ that depends on V , with m being linearlyproportional to n up to a logarithmic term. As a ﬁrst contribution, we present novel results on the stabilityand accuracy of WLS estimators with a given approximation space, using random samples that are morestructured than those used in the previous analysis. As a second contribution, we study approximationby WLS estimators in the adaptive setting. For any sequence of nested spaces p V k q k Ă L p X, ρ q , we showthat a sequence of WLS estimators of u , one for each space V k , can be sequentially constructed such that:i) the estimators remain provably stable with high probability and optimally converging in expectation,simultaneously for all iterations from one to k , and ii) the overall number of samples necessary to constructall the ﬁrst k estimators remains linearly proportional to the dimension of V k , up to a logarithmic term.The overall number of samples takes into account all the samples generated to build all the estimators fromiteration one to k . We propose two sampling algorithms that achieve this goal. The ﬁrst one is a purelyrandom algorithm that recycles most of the samples from the previous iterations. The second algorithmrecycles all the samples from all the previous iterations. Such an achievement is made possible by cruciallyexploiting the structure of the random samples. Finally we apply the results from our analysis to developnumerical methods for the adaptive approximation of functions in high dimension. Keywords: approximation theory, weighted least squares, convergence rates, high dimensional approximation,adaptive approximation.

AMS 2010 Classiﬁcation:

In recent years, the increasing computing power and availability of data have contributed to a huge growthin the complexity of the mathematical models. Dealing with such models often requires the approximation orintegration of functions in high dimension, that can be a challenging task due to the curse of dimensionality.The present paper studies the problem of approximating a function u : X Ñ R that depends on a d -dimensionalparameter x P X Ď R d , using the information coming from the evaluations of u at a set of selected samples x , . . . , x m P X . Two classical approaches to such a problem are interpolation and least-squares methods , see e.g. [9, 3, 15]. Here we turn our attention to least-squares methods, that are frequently used in applicationsfor approximation, data-ﬁtting, estimation and prediction. Other approaches to function approximation are compressive sensing , see [12] and references therein, and neural networks , see e.g. [4, 16].Previous convergence results for standard least-squares methods have been proposed in [5], in expectation,and [17], in probability. Weighted Least-Squares methods (hereafter WLS) have been previously studied in[11, 13, 7]. It has been proven in [7] that stable and optimally converging WLS estimators can be constructedusing judiciously distributed random samples, whose number is only linearly proportional to the dimension of ∗ Sorbonne Université, UPMC Univ Paris 06, CNRS, UMR 7598, Laboratoire Jacques-Louis Lions, 4, place Jussieu 75005, Paris,France. email: [email protected] a r X i v : . [ m a t h . NA ] J u l he approximation space, up to a logarithmic term. The analysis holds in general approximation spaces, andthe number of samples ensuring stability and optimality of the estimator does not depend on d . Such a resultis recalled in Theorem 1. The analysis in [7] considers both cases of noisy or noiseless evaluations of u . In thispaper we conﬁne to the case of noiseless evaluations, which is relevant whenever the function u can be evaluatedat the selected samples with suﬃciently high precision, e.g. up to machine epsilon. The case of noisy evaluationscan be addressed using the same techniques as in [7, 19].The proof of Theorem 1, and more generally the analysis in [7], use results from [1, 20] on tail boundsfor sums of random matrices. An interesting feature of the bounds in [20] is that the matrices need not beidentically distributed. The analysis in [7] does not take advantage of this property. One of the main goals ofthe present paper is to show how the use of this property paves the way towards novel results in the analysis ofWLS methods for a given space (Theorem 2), and towards their application in an adaptive setting (Theorem 3).The proof of Theorem 2 builds on previous contributions [5, 7]. The overall skeleton of the proof is similar, butwith some crucial diﬀerences that make use of the additional structure of the random samples.The outline of the paper is the following: in Section 1.1 we describe and motivate our contributions. InSection 2 we recall some results from the analysis in [7] on weighted least squares for a given space. Section 2.2contains Theorem 2 and its proof. In Section 3 we apply Theorem 2 and Theorem 1 to the adaptive setting, withan arbitrary nested sequence p V k q k of approximation spaces. In Section 4 we present the sampling algorithms.Section 5 contains some numerical tests, together with an example of adaptive algorithm that uses sequencesof nested polynomial spaces. Section 6 draws some conclusions. All the algorithms are collected in appendix. Let X Ď R d be a Borel set, ρ be a Borel probability measure on X , p ψ i q i ě be a basis orthonormal in L p X, ρ q equipped with the inner product x f , f y “ ş X f p x q f p x q dρ , and V : “ span t ψ , . . . , ψ n u be the space obtainedby retaining n terms of the basis. The least-squares method approximates the function u by computing itsdiscrete L projection onto a given space V , using pointwise evaluations of u at a set of m ě n distinct randomsamples x , . . . , x m . The analysis in [7], whose main ﬁndings are resumed in the forthcoming Theorem 1,provides some results on the stability and convergence properties of such a discrete projection, and of otherWLS estimators as well. In Theorem 1, independent and identically distributed random samples are drawnfrom the probability measure dµ n “ n n ÿ j “ dχ j , that is an additive mixture of the probability measures χ j deﬁned as χ j p A q : “ ż A | ψ j p x q| dρ, for any Borel set A Ď X . (1.1)One sample from µ n can be generated by randomly choosing an index j uniformly in t , . . . , n u and then drawingone sample from χ j . In general µ n is not a product measure, even if ρ is a product measure.Another novel approach proposed in this paper uses a diﬀerent type of random samples. Such an approachuses a set of independent random samples of the form x , . . . , x m with m “ τ n for a suitable integer τ , andsuch that for any j “ , . . . , n , the samples x p j ´ q τ ` , . . . , x jτ are distributed according to χ j . These samplesare not identically distributed. On the upside, they are more structured than those used in Theorem 1, sincethe amount of samples coming from each component of the mixture is ﬁxed. If τ “ n independentsamples x , . . . , x n are jointly drawn from p x , . . . , x n q „ dγ n : “ n ź j “ | ψ j p x j q| dρ. If τ ě m “ τ n independent samples x , . . . , x m follows the probability measure dγ m : “ b τ dγ n .Denote with dµ m : “ b m dµ n the probability measure for the draw of m i.i.d. samples from µ n . Given aﬁxed n , in the limit m “ τ n Ñ 8 obtained by τ Ñ 8 , the proportion of random samples of µ m coming fromeach χ j tends to 1 { n by the strong law of large numbers, whereas the same proportion is exactly equal to 1 { n by construction for the samples drawn from γ m . With any m the two probability measures µ m and γ m generate2amples with diﬀerent distributions. However, the block of n samples from γ n still mimics the samples from µ n .For example the sum of the expectation of the random samples is preserved, p x , . . . , x n q „ dγ n , ˜ x „ dµ n ùñ n ÿ j “ E ` x j ˘ “ n ÿ j “ ż X x j | ψ j p x j q| dρ “ ż X ˜ x n ÿ j “ | ψ j p ˜ x q| dρ “ n E p ˜ x q , and this preservation plays a main role in our forthcoming analysis. All the measures appearing in this paperare also Borel probability measures, and sometimes for brevity we refer to them just as measures.The ﬁrst main result of this paper is Theorem 2. It proves the same guarantees as Theorem 1 for thestability and accuracy of WLS estimators with a given approximation space, but when the random samplesfrom µ m are replaced with random samples from γ m . The second main result concerns the analysis of WLSestimators, when considering a sequence of nested approximation spaces p V k q k ě , where V k : “ span t ψ , . . . , ψ n k u and n k : “ dim p V k q . In this adaptive setting, we compare the two approaches using random samples from γ m or µ m . In both cases, in Theorems 3 and 4, we prove that a sequence of estimators of u , one for each space V k ,can be sequentially constructed such that: i) the estimators remain provably stable with high probability andoptimally converging in expectation, simultaneously for all iterations from one to k , and ii) the overall numberof samples necessary to construct all the ﬁrst k estimators remains linearly proportional to the dimension of V k ,up to a logarithmic term. As a further contribution we show that using the samples from γ m rather than from µ m provides the following advantages, that are relevant in the development of adaptive WLS methods.• Structure of the random samples.

When using γ m , the number of random samples coming from eachcomponent | ψ j p x q| dρ of the mixture is precisely determined, and allows the development of adaptivealgorithms that recycle all the samples from all the previous iterations. When using µ m it is not possibleto recycle all the samples from the previous iterations with probability equal to one. Given two nestedspaces V k ´ Ă V k and two positive integers τ k ´ ď τ k , at iteration k the probability measure γ m k of the m k “ τ k n k samples can be decomposed as dγ m k “ b τ k n k ź j “ | ψ j p x q| dρ (1.2) “ ¨˚˚˚˚˚˚˝ dγ m k ´ loomoon measure of the m k ´ samples recycled withcertainty from step k ´ ˛‹‹‹‹‹‹‚ b ¨˚˚˚˚˚˚˝ b τ k ´ τ k ´ n k ´ ź j “ | ψ j p x q| dρ looooooooooooooomooooooooooooooon measure of the new samples drawnfrom the old components of the mixture ˛‹‹‹‹‹‹‚ b ¨˚˚˚˚˚˚˝ b τ k n k ź j “ ` n k ´ | ψ j p x q| dρ loooooooooooooomoooooooooooooon measure of the new samples drawnfrom the new components of the mixture ˛‹‹‹‹‹‹‚ . The probability measure dµ m k “ b m k dµ n k cannot be decomposed as the product of two probabilitymeasures with one being µ m k ´ , because µ n k is not a product measure. It is however possible to leveragethe structure of µ n k as an additive mixture of µ n k ´ and a suitable probability measure σ n k , and decompose µ m k as dµ m k “ b m k ¨˚˚˚˚˚˚˚˚˚˚˝ n k ´ n k dµ nk ´ hkkkkkkkkkkkkikkkkkkkkkkkkj n k ´ n k ´ ÿ j “ | ψ j p x q| dρ loooooooooooomoooooooooooon measure of the samples drawnfrom the old components of dµ nk ,perhaps recycled from step k ´ ` n k ´ n k ´ n k dσ nk hkkkkkkkkkkkkkkkkkkkikkkkkkkkkkkkkkkkkkkj n k ´ n k ´ n k ÿ j “ ` n k ´ | ψ j p x q| dρ looooooooooooooooooomooooooooooooooooooon measure of the samples drawnfrom the new components of dµ nk ˛‹‹‹‹‹‹‹‹‹‹‚ . (1.3)When drawing m k samples from µ n k , the amount of samples coming from one of the components of µ n k ´ isa binomial random variable with number of trials m k and rate of success n k ´ { n k for each trial. Wheneverthis random variable takes values less than m k ´ , that always occurs with some positive probability, it isnot possible to recycle all the m k ´ samples from iteration k ´ Variance reduction.

Random mixture proportions induce extra variance in the generated samples.As a consequence, random samples from γ m are more disciplined than random samples from µ m . Thisstabilization eﬀect ampliﬁes when using basis elements whose supports are more localized than globallysupported orthogonal polynomials. More on this in Remark 6.• Coarsening and extension to nonnested sequences of approximation spaces.

When using thesamples from γ m , thanks to the decomposition (1.2), it is possible to remove an element of the basis ψ j from the space V as well as its associated samples x p j ´ q τ ` , . . . , x jτ from the whole set x , . . . , x m of m “ τ n samples, and at the same time recycle all the τ p n ´ q remaining samples for V zt ψ j u . Moregenerally, the use of γ m allows the development of eﬃcient adaptive methods with arbitrary sequencesof approximation spaces p V k q k , that probe any ψ j R V k chosen according to some criterion. The methodthen either retains ψ j as V k ` “ V k Y t ψ j u or discards it depending on its contribution to the reductionof (an estimator of) the error from V k to V k ` . Comparison with [2].

Section 4.2 presents an analysis of a sampling algorithm (Algorithm 2) that sequen-tially generates m random samples from µ m with an arbitrary nested sequence of approximation spaces p V k q k .In [2] a similar algorithm that uses µ m has been proposed and analysed. Notation for product of measures.

Let ρ , ρ be two Borel measures on X Ď R d with the Borel σ -algebra B “ B p X q . The notation ρ b ρ denotes the product measure on X ˆ X with the tensor product Borel σ -algebra B b B , that satisﬁes ρ b ρ p A ˆ A q “ ρ p A q ρ p A q , for any A , A P B . Let X Ď R d be a Borel set, and ρ be a Borel probability measure on X . We deﬁne the L p X, ρ q inner product x f , f y “ ż X f p x q f p x q dρ p x q (2.4)associated with the norm } f } : “ x f, f y { . Throughout the paper we denote by p ψ i q i ě an L p X, ρ q orthonormalbasis. We deﬁne the linear space V : “ span t ψ , . . . , ψ n u that contains n arbitrarily chosen elements of the basis,and denote with n : “ dim p V q its dimension. We further assume thatfor any x P X there exists ψ j P V such that ψ j p x q ‰

0. (2.5)This assumption is veriﬁed for example if the space V contains the functions that are constant over X . For anygiven V , we deﬁne the weight function w : X Ñ R as w p x q : “ n ř ni “ | ψ i p x q| , x P X, (2.6)whose denominator does not vanish under assumption (2.5). The function w is known as the Christoﬀel function,up to a renormalization, when V is the space of algebraic polynomials with prescribed total degree. Using w we deﬁne the probability measure dµ n : “ w ´ dρ “ ř ni “ | ψ i p x q| n dρ, (2.7)which depends on the chosen approximation space V . Another inner product used in this paper is x f , f y m : “ m m ÿ j “ w p x j q f p x j q f p x j q , (2.8)4here the functions w , f , f are evaluated at m samples x , . . . , x m independent and identically distributed as µ n . This inner product is associated with the discrete seminorm } f } m : “ x f, f y { m . The discrete inner product(2.8) mimics (2.4) due to (2.7). The exact L projection on V of any function u P L p X, ρ q is deﬁned asΠ n u : “ argmin v P V } u ´ v } . In practice such a projection cannot be computed out of very particular cases, motivating the interest towardsthe discrete least-squares approach. We deﬁne the weighted least-squares estimator u W of u as u W : “ Π mn u “ argmin v P V } u ´ v } m , that is obtained by applying the discrete projector Π mn on V to u . The estimator u W is associated to the solutionof the linear system Ga “ h, (2.9)where the Gramian matrix G and the right-hand side h are deﬁned element-wise as G ij “ x ψ i , ψ j y m , h i “ x u, ψ i y m , i, j “ , . . . , n, and a “ p a , . . . , a n q J is the vector containing the coeﬃcients of u W “ ř ni “ a i ψ i expanded over the orthonormalbasis. The linear system (2.9) always has at least one solution, which is unique when G is nonsingular. When G is singular we can deﬁne u W as the estimator associated to the unique minimal ‘ -norm solution to (2.9).Moreover, we deﬁne the L p X, ρ q best approximation of u in V as e n, p u q : “ min v P V } u ´ v } “ } u ´ Π n u } , (2.10)and the weighted L p X, ρ q best approximation of u as e n, ,w p u q : “ inf v P V sup y P X a w p y q| u p y q ´ v p y q| . Notice that Π n , Π mn , e n, and e n, ,w depend on the chosen space V , and not only on its dimension n . Theidentity matrix is denoted with I P R n ˆ n . The spectral norm of any matrix A P R n ˆ n is deﬁned as ~ A ~ : “ sup } v } ‘ “ } Av } ‘ , using the Euclidean inner product in R n and its associated norm. Another weighted least-squares estimatorintroduced in [7] is the conditioned estimator: u C : “ u W , if ~ G ´ I ~ ď , , otherwise . (2.11)One of the main results from [7] is the following theorem, see [7, Theorem 2.1 and Corollary 2.2]. Theorem 1.

For any real r ą , if the integers m and n are such that the condition n θ r ď m ln m , with θ r : “ θ ´ p ` r q , θ : “ p { q ´ « . , (2.12) is fulﬁlled, and x , . . . , x m are independent and identically distributed random samples from µ n , then the fol-lowing holds: (i) the matrix G satisﬁes the tail bound Pr " ~ G ´ I ~ ą * ď nm ´p r ` q ď m ´ r ;5ii) if u P L p X, ρ q then the estimator u C satisﬁes E p} u ´ u C } q ď p ` ε p m qq e n, p u q ` } u } m ´ r , where ε p m q : “ θ r ln p m q Ñ as m Ñ `8 , and θ r as in (2.12) ; (iii) with probability larger than ´ m ´ r , the estimator u W satisﬁes } u ´ u W } ď p ` ? q e n, ,w p u q , (2.13) for all u such that }? wu } L ă `8 . The above theorem can be rewritten for a chosen conﬁdence level, by setting α “ nm ´p r ` q and replacingthe corresponding r in (2.12). For convenience we rewrite condition (2.12) with equality using the ceilingoperator, since the number of samples is an integer and usually one wishes to minimize the number of samples m satisfying (2.12) for a given n . Corollary 1.

For any α P p , q and any integer n ě , if m “ R nθ ln ˆ nα ˙V , with θ as in (2.12) , (2.14) and x , . . . , x m are m independent and identically distributed random samples from µ n , thenPr ˆ ~ G ´ I ~ ą ˙ ď α. When the evaluations of the function u are noiseless, convergence estimates in probability with conﬁdencelevel 1 ´ α are immediate to obtain. If the evaluations of u are noisy, then convergence estimates in probabilityof the form (2.13) can still be obtained by using techniques from large deviations to estimate the additionalterms due to the presence of the noise, as shown in [19] for standard least squares. The proof of Theorem 1, and more generally the analysis in [5, 7], use a result from [1, 20] on tail boundsfor sums of random matrices. We recall below this result from [20, Theorem 1.1], in a less general form thatsimpliﬁes the presentation and still ﬁts our purposes. If X , . . . , X m are independent n ˆ n random self-adjointand positive semideﬁnite matrices satisfying λ max p X j q “ ~ X j ~ ď R almost surely and E p ř mj “ X j q “ I then itholds Pr ˜ λ min ˜ m ÿ j “ X j ¸ ď ´ δ ¸ ď n ˆ e ´ δ p ´ δ q ´ δ ˙ R , δ P r , s , (2.15)Pr ˜ λ max ˜ m ÿ j “ X j ¸ ě ` δ ¸ ď n ˆ e δ p ` δ q ` δ ˙ R , δ ě . (2.16)Since for δ P p , q the upper bound in (2.16) is always greater or equal than the upper bound in (2.15), itholds that Pr ˜(cid:23)(cid:23)(cid:23)(cid:23)(cid:23) m ÿ j “ X j ´ I (cid:23)(cid:23)(cid:23)(cid:23)(cid:23) ą δ ¸ ď n ˆ e δ p ` δ q ` δ ˙ R . (2.17)Finding a suitable value for R and taking δ “ leads to item (i) in Theorem 1, see [7] for the proof.One of the features of the bounds (2.15)-(2.16) is that the matrices X , . . . , X m need not be identicallydistributed. This property has not been exploited in the analysis in [7]. The ﬁrst contribution of this paperis the following Theorem 2, which states a similar result as Theorem 1, but using a diﬀerent type of randomsamples that is very advantageous in itselft as well as for the forthcoming application to the adaptive setting.6 heorem 2. For any α P p , q and any integer n ě , if m “ τ n, with τ : “ R θ ´ ln ˆ nα ˙V , θ as in (2.12) , (2.18) and x , . . . , x nτ is a set of independent random samples such that for any j “ , . . . , n the samples x p j ´ q τ ` , . . . , x jτ are identically distributed according to χ j deﬁned in (1.1) then the following holds: (i) the matrix G satisﬁes the tail bound Pr " ~ G ´ I ~ ą * ď α ; (2.19)(ii) if u P L p X, ρ q then the estimator u C satisﬁes, E p} u ´ u C } q ď ˆ ` θ ln p n { α q ˙ e n, p u q ` α } u } ; (2.20)(iii) with probability larger than ´ α , the estimator u W satisﬁes } u ´ u W } ď p ` ? q e n, ,w p u q , for all u such that }? wu } L ă `8 .Proof. Proof of (i): the matrix G can be decomposed as G “ ř nj “ ř τk “ X jk where the X jk , j “ , . . . , n , k “ , . . . , τ , are mutually independent and, given any j “ , . . . , n , the X j , . . . , X jτ are identically distributedcopies of the rank-one random matrix X p x q deﬁned element-wise as X pq p x q “ τ n w p x q ψ p p x q ψ q p x q , p, q “ , . . . , n, with x being a random variable distributed according to χ j . Notice that the X jk , j “ , . . . , n , k “ , . . . , τ ,are not identically distributed. Anyway, using (2.6), it holds that for any p, q “ , . . . , n , E p G pq q “ E ˜ n ÿ j “ τ ÿ k “ X jkpq ¸ “ τ ÿ k “ n ÿ j “ E ` X jkpq ˘ “ τ ÿ k “ n ÿ j “ ż X τ n w p x q ψ p p x q ψ q p x q ψ j p x q dρ “ n ż X w p x q ψ p p x q ψ q p x q n ÿ j “ ψ j p x q dρ “ ż X ψ p p x q ψ q p x q dρ “ δ pq , and therefore E p G q “ I . We then use (2.17) to obtain that if ~ X jk p x q~ ď R almost surely for any j “ , . . . , n and any k “ , . . . , τ then for any δ P p , q it holdsPr p~ G ´ I ~ ą δ q ď n exp ´ ´ c δ R ¯ , with c δ : “ p ` δ q ln p ` δ q ´ δ ą

0. We choose δ “ and obtain c “ θ as in (2.12). Since X jk has rank oneand ~ X jk p x q~ “ trace ` p X jk p x qq J X jk p x q ˘ “ ˜ τ n w p x q n ÿ ‘ “ ψ ‘ p x q ¸ “ τ j “ , . . . , n , for all k “ , . . . , τ and uniformly for all x P X , we can take R “ { τ and obtain that, if m and n satisfy (2.18) then Pr ˆ ~ G ´ I ~ ą ˙ ď ne ´ θτ ď ne ´ ln p n { α q “ α. The overall structure of the proof of ii) follows [7], with some diﬀerences due to the fact that here the samples x , . . . , x m are not identically distributed. First we identify the underlying probability measure associated tothese samples. The m “ τ n samples x , . . . , x m are all mutually independent, and are subdivided into τ blocks,where each block contains n random samples. More precisely, each block contains one random sample distributedas χ j , for j “ , . . . , n . The probability measure of each block z “ p z , . . . , z n q is dγ n : “ ś nj “ | ψ j p z j q| dρ, where each z j P X . The probability measure of τ blocks, with all the τ n random samples x , . . . , x m , is dγ m : “ b τ dγ n . Let Ω be the set of all possible draws from γ m , Ω ` be the set of all draws such that ~ G ´ I ~ ď , and Ω ´ : “ Ω z Ω ` be its complement. Under the assumptions of Theorem 2, from (2.19) it holds thatPr p Ω ´ q “ ż Ω ´ dγ m ď α. Denote g : “ u ´ Π n u . We consider the event ~ G ´ I ~ ď , where it holds } u ´ u C } “ } u ´ u W } “ } g } ` } Π mn g } , since Π mn Π n u “ Π n u and g is orthogonal to V . Denoting with p a , . . . , a n q J the solution to the linear system Ga “ b, and b “ px g, ψ k y m q k “ ,...,n we have that } u ´ u C } “ e n, p u q ` n ÿ k “ | a k | . Since ~ G ´ I ~ ď ùñ ~ G ~ ě ùñ ~ G ´ ~ ď , from the line above } u ´ u C } ď e n, p u q ` n ÿ k “ |x g, ψ k y m | . In the event ~ G ´ I ~ ą by the deﬁnition of u C in (2.11) we have } u ´ u C } “ } u } . Taking the expectation of } u ´ u C } w.r.t. γ m we obtain that E p} u ´ u C } q “ ż Ω ` } u ´ u C } dγ m ` ż Ω ´ } u ´ u C } dγ m ď ˜ e n, p u q ` n ÿ k “ E p|x g, ψ k y m | q ¸ Pr p Ω ` q ` } u } Pr p Ω ´ qď e n, p u q ` n ÿ k “ E p|x g, ψ k y m | q ` α } u } . We now study the second term in the above expression, crucially exploiting the structure of the randomsamples and the fact that their expectations still pile up and simplify, despite the samples are not identically8istributed: n ÿ k “ E p|x g, ψ k y m | q “ n ÿ k “ E ˜ m m ÿ i “ m ÿ j “ w p x i q w p x j q g p x i q g p x j q ψ k p x i q ψ k p x j q ¸ “ m n ÿ k “ m ÿ i “ m ÿ j “ E ` w p x i q w p x j q g p x i q g p x j q ψ k p x i q ψ k p x j q ˘ “ m ¨˚˚˚˚˚˝ n ÿ k “ m ÿ i “ m ÿ j “ j ‰ i E ` w p x i q w p x j q g p x i q g p x j q ψ k p x i q ψ k p x j q ˘loooooooooooooooooooooooooooooomoooooooooooooooooooooooooooooon I ` n ÿ k “ m ÿ i “ E ´` w p x i q g p x i q ψ k p x i q ˘ ¯loooooooooooooooooooomoooooooooooooooooooon II ˛‹‹‹‹‹‚ . For term I: with any k “ , . . . , n , using in sequence the independence of the samples, the structure of thesamples and the deﬁnition of w we obtain I “ m ÿ i “ m ÿ j “ j ‰ i E ` w p x i q g p x i q ψ k p x i q ˘ E ` w p x j q g p x j q ψ k p x j q ˘ “ m ÿ i “ E ` w p x i q g p x i q ψ k p x i q ˘ m ÿ j “ j ‰ i E ` w p x j q g p x j q ψ k p x j q ˘ “ m ÿ i “ E ` w p x i q g p x i q ψ k p x i q ˘ ˜ m ÿ j “ E ` w p x j q g p x j q ψ k p x j q ˘ ´ E ` w p x i q g p x i q ψ k p x i q ˘¸ “ m ÿ i “ E ` w p x i q g p x i q ψ k p x i q ˘ ˜ τ ÿ ‘ “ n ÿ j “ ż X w p x q g p x q ψ k p x q ψ j p x q dρ ´ E ` w p x i q g p x i q ψ k p x i q ˘¸ “ m ÿ i “ E ` w p x i q g p x i q ψ k p x i q ˘ ˜ τ ÿ ‘ “ ż X w p x q g p x q ψ k p x q n ÿ j “ ψ j p x q dρ ´ E ` w p x i q g p x i q ψ k p x i q ˘¸ “ m ÿ i “ E ` w p x i q g p x i q ψ k p x i q ˘ ¨˚˚˝ τ n ż X g p x q ψ k p x q dρ loooooooomoooooooon “ ´ E ` w p x i q g p x i q ψ k p x i q ˘˛‹‹‚ “ ´ m ÿ i “ ` E ` w p x i q g p x i q ψ k p x i q ˘˘ ă , where x g, ψ k y “ k “ , . . . , n because g is orthogonal to V . For term II, again exploiting the structureof the samples and the deﬁnition of w it holds II “ m ÿ i “ E ˜ w p x i q g p x i q n ÿ k “ ψ k p x i q ¸ “ n m ÿ i “ E ` w p x i q g p x i q ˘ “ n τ ÿ j “ n ÿ k “ ż X w p x q g p x q ψ k p x q dρ “ n τ ÿ j “ ż X w p x q g p x q n ÿ k “ ψ k p x q dρ “ n τ ż X g p x q dρ “ n τ } g } . m “ τ n in term II and neglecting the nonpositive contribution of termI, we obtain E p} u ´ u C } q ď ˆ ` nm ˙ e n, p u q ` α } u } . Since n { m “ τ ´ ď θ { ln p n { α q we ﬁnally obtain (2.20).The proof of iii) uses i) and then proceeds in the same way as for the proof of iii) in Theorem 1 from [7].From the deﬁnition of the spectral norm ~ G ´ I ~ ď ðñ } v } ď } v } m ď } v } , v P V, and this norm equivalence holds at least with probability 1 ´ α from item (i) under condition (2.18). Using theabove norm equivalence, the Pythagorean identity } u ´ v } m “ } v ´ u W } m ` } u ´ u W } m , and max p} u ´ v } m , } u ´ v }q ď }? w p u ´ v q} L , for any v P V it holds that } u ´ u W } ď} u ´ v } ` } v ´ u W }ď} u ´ v } ` ? } v ´ u W } m ď} u ´ v } ` ? } u ´ v } m ďp ` ? q}? w p u ´ v q} L . Since v is arbitrary we obtain the thesis.The next Corollary 2 extends Theorem 2 to any m (not necessarily an integer multiple of n ) satisfying (2.14).For any (ﬁxed) τ “ , . . . t m { n u , the set of m random samples in Corollary 2 is obtained by merging m ´ τ n random samples distributed as µ n and τ random samples from χ j for all j “ , . . . , n . When m is an integermultiple of n and τ “ t m { n u , Corollary 2 gives Theorem 2 as a particular case. When τ “

0, all the randomsamples in Corollary 2 are distributed as µ n , like in Corollary 1. Corollary 2.

For any α P p , q , any integers n ě , m ě n and τ “ , . . . , t m { n u , if m satisﬁes (2.14) and x , . . . , x m is a set of independent random samples such that for any j “ , . . . , n the x p j ´ q τ ` , . . . , x jτ areidentically distributed according to χ j deﬁned in (1.1) , and the x nτ ` , . . . , x m are identically distributed as µ n ,then items i), ii) and iii) of Theorem 2 hold true.Proof. We proceed as in the proof of Theorem 2 but using the decomposition G “ ř nj “ ř τk “ X jk ` ř m‘ “ nτ ` X ‘ ,where all the X jk and X ‘ are mutually independent, and given any j “ , . . . , n , the X j , . . . , X jτ are identicallydistributed copies of the rank-one random matrix X p x q deﬁned element-wise as X pq p x q “ m w p x q ψ p p x q ψ q p x q , p, q “ , . . . , n, with x being a random variable distributed according to χ j , and the X ‘ are identically distributed copies of X p x q but with x being a random variable distributed according to µ n . For any p, q “ , . . . , n , τ “ , . . . , t m { n u , E p G pq q “ τ ÿ k “ n ÿ j “ E ` X jkpq ˘ ` m ÿ ‘ “ nτ ` E ` X ‘pq ˘ “ τ ÿ k “ n ÿ j “ ż X m w p x q ψ p p x q ψ q p x q ψ j p x q dρ ` m ÿ ‘ “ nτ ` ż X m w p x q ψ p p x q ψ q p x q ř nj “ ψ j p x q n dρ “ nτm ż X ψ p p x q ψ q p x q dρ ` m ´ nτm ż X ψ p p x q ψ q p x q dρ “ δ pq . When τ “ m is an integer multiple of n and τ “ t m { n u ), the leftmost (rightmost) sum in the ﬁrst equationabove is empty. Since ~ X jk p x q~ “ ~ X ‘ p x q~ “ nm j “ , . . . , n , for all k “ , . . . , τ , for all ‘ “ nτ ` , . . . , m and uniformly for all x P X , we can take R “ n { m and obtain that if m and n satisfy (2.14) then (2.19) holds true.For the proof of (2.20), we proceed as in the proof of Theorem 2 with two diﬀerences. When bounding termI, we split the random samples as m ÿ j “ E p w p x j q g p x j q ψ k p x j qq “ τ ÿ ‘ “ n ÿ j “ ż X w p x q g p x q ψ k p x q ψ j p x q dρ ` m ÿ ‘ “ nτ ` ż X w p x q g p x q ψ k p x q ř nj “ ψ j p x q n dρ “ nτ ż X g p x q ψ k p x q dρ ` p m ´ nτ q ż X g p x q ψ k p x q dρ “ , and this term again vanishes due to the orthogonality of g to ψ k . For term II, splitting again the randomsamples we obtain n ÿ k “ m ÿ j “ E ´` w p x j q g p x j q ψ k p x j q ˘ ¯ “ m ÿ j “ E ˜ w p x j q g p x j q n ÿ k “ ψ k p x j q ¸ “ n m ÿ j “ E ` w p x q g p x q ˘ “ n τ ÿ ‘ “ n ÿ j “ ż X w p x q g p x q ψ j p x q dρ ` n m ÿ ‘ “ nτ ` ż X w p x q g p x q ř nk “ ψ k p x q n dρ “ n τ } g } ` n p m ´ nτ q} g } “ nm } g } . The proof of the last item is the same as the corresponding proof in Theorem 2, but using (2.19) with therandom samples of Corollary 2.

We now apply the results for a given approximation space from the previous section to an arbitrary sequence ofnested spaces p V k q k ě Ă L p X, ρ q , with V k : “ span t ψ , . . . , ψ n k u and n k : “ dim p V k q . Theorem 1 and Theorem 2provide two diﬀerent approaches to build the set of random samples for a given approximation space, and eachone of them can be applied to the adaptive setting. Since the samples are adapted to the space, the underlyingchallenge is how to recycle as much as possible the samples associated to spaces from the previous iterations,in order to keep the overall number of generated samples from iteration one to k of the same order as dim p V k q , i.e. the same scaling as in the results for an individual approximation space.First we brieﬂy discuss the approach using Theorem 2. This theorem prescribes the precise number ofrandom samples coming from each component of the mixture (2.7) associated to the space. When the spacesare nested, this trivially allows one to recycle all the samples from all the previous iterations, just by adding themissing samples to the previous ones, as shown in (1.2). The concrete procedure and the related Algorithm 1are explained in Section 4.1.The approach using Theorem 1 is not as simple and eﬀective as the previous one. Without recycling thesamples from the previous iterations, the naïve sequential application of Theorem 1 to each space V , . . . , V t would require the generation of an overall number of samples equal to ř tk “ m k , with m k samples drawn fromeach µ n k . However, despite µ n k changes at each iteration k , it is possible to recycle most, but not all, of thesamples from the previous iterations by leveraging the additive structure of µ n k as in (1.3). This procedure isdescribed in Section 4.2 together with Algorithm 2.The next results are obtained by applying Theorem 2 (respectively Theorem 1) individually for each space V k and using a union bound, with the random samples produced by Algorithm 1 (respectively Algorithm 2).Here I k P R n k ˆ n k denotes the identity matrix. For any s ą ζ p s q denotes the Riemann zeta function. The bestapproximation error (2.10) of u on the space V k is denoted by e n k , p u q , and u kC denotes the estimator (2.11) on V k . Theorem 3.

Let α P p , q , s ą be real numbers and t ě be an integer. Given any nested sequence of spaces V Ă . . . Ă V t Ă L p X, ρ q with dimensions n ă . . . ă n t , if m k “ τ k n k , τ k : “ R θ ´ ln ˆ ζ p s q n s ` k α ˙V , k “ , . . . , t, (3.21) then Pr ˜ t č k “ " ~ G k ´ I k ~ ď *¸ ě ´ α, where G k P R n k ˆ n k is deﬁned element wise as p G k q pq “ m ´ k m k ÿ j “ ψ p ` x j ˘ ψ q ` x j ˘ , and x , . . . , x m t is a set of m t independent random samples such that for any k “ , . . . , t and for any j “ , . . . , n k the samples x p j ´ q τ k ` , . . . , x jτ k are distributed according to χ j . The set x , . . . , x m t can begenerated by Algorithm 1. (ii) If u P L p X, ρ q then for any k “ , . . . , t the estimator u kC satisﬁes, E p} u ´ u kC } q ď ˆ ` θ ln p ζ p s q n s ` k { α q ˙ e n k , p u q ` α } u } . Proof.

Proof of (i). From Lemma 2, for any k “ , . . . , t Algorithm 1 with τ k as in (3.21) generates a set x , . . . , x m k of m k random samples with the required properties. By construction, these random samples satisfythe assumptions of Theorem 2, and are used to compute the matrix G k . For any k “ , . . . , t , using Theorem 2individually for each G k with α k “ αζ p s q n sk , gives t ÿ k “ Pr ˆ" ~ G k ´ I k ~ ą *˙ ď t ÿ k “ α k ď αζ p s q t ÿ k “ k s ď αζ p s q ÿ k ě k s “ α, where the second inequality follows from strict monotonicity of p n k q k ě and n ě

1, that implies n k ě k .Using De Morgan’s law and a probability union bound for the matrices G , . . . , G t it holds thatPr ˜ t č k “ " ~ G k ´ I k ~ ď *¸ ě ´ t ÿ k “ Pr ˆ" ~ G k ´ I k ~ ą *˙ ě ´ α. The proof of (ii) trivially follows from Theorem 2, since α k ď α for any k “ , . . . , t . Theorem 4.

Let α P p , q , s ą be real numbers and t ě be an integer. Given any nested sequence of spaces V Ă . . . Ă V t Ă L p X, ρ q with dimensions n ă . . . ă n t , if m k “ R n k θ ln ˆ ζ p s q n s ` k α ˙V , k “ , . . . , t, (3.22) then (i) Pr ˜ t č k “ " ~ G k ´ I k ~ ď *¸ ě ´ α, where G k P R n k ˆ n k is deﬁned element wise as p G k q pq “ m ´ k m k ÿ j “ ψ p ` x j ˘ ψ q ` x j ˘ , and x , . . . , x m t is a set of m t independent random samples such that for any k “ , . . . , t the x , . . . , x m k are distributed according to µ n k . The set x , . . . , x m t can be generated by Algorithm 2 using an overallnumber of random samples given by the random variable ˜ m t in (4.25) . If u P L p X, ρ q then for any k “ , . . . , t the estimator u kC satisﬁes, E p} u ´ u kC } q ď ˆ ` θ ln p ζ p s q n s ` k { α q ˙ e n k , p u q ` α } u } . (3.23) Proof.

From Lemma 3, Algorithm 2 generates a set x , . . . , x m t of m t random samples with the required prop-erties. The rest of the proof of item (i) and (ii) follows the proof of Theorem 3, but applying Corollary 1individually to each G k , rather than Theorem 2.Condition (3.21) ensures that m k is an integer multiple of n k for any k ě

1. This condition requires a numberof points m k only slightly larger than condition (3.22) for the same values of n k , s and α (the proof is postponedto the forthcoming Lemma 1). However, to compare the eﬀective number of samples used in Theorem 3 andTheorem 4 one cannot just compare (3.21) and (3.22), because (3.22) neglects the random samples that havenot been recycled from all the previous iterations. This issue is discussed in Remark 1.For convenience in Remark 1, Lemma 1 and Remark 2 we denote with ˆ m k the number of points required bycondition (3.21), and with m k the number of points required by (3.22), for the same values of n k , s and α . Remark 1.

For any t ě , in Theorem 3 (Theorem 4) the set x , . . . , x ˆ m t ( x , . . . , x m t ) of random samples canbe generated by Algorithm 1 (Algorithm 2). In Theorem 4, the generation of the m t random samples requiresAlgorithm 2 to produce an overall number ˜ m t of random samples, with ˜ m t being the random variable deﬁnedin (4.25) . Among the ˜ m t random samples (not necessarily distributed as µ n t ) only m t are retained, and theremaining ˜ m t ´ m t are discarded. Condition (3.22) does not take into account the ˜ m t ´ m t discarded samples.As a consequence, when comparing the eﬀective number of samples required in Theorem 3 and Theorem 4, oneshould compare ˆ m t with ˜ m t , and not ˆ m t with m t .It can be shown that ˜ m t ´ m t remains small with large probability, if m k satisﬁes (3.22) for all k “ , . . . , t .More precisely, using the upper bound ˜ m t ď m t ` U t with U t deﬁned in (4.26) , Lemma 4, the last inequality in (3.24) and Remark 5, it can be shown that ˜ m t is upper bounded by a random variable with mean p ` θ q m t andvariance p ` θ q m t that exhibits Gaussian concentration. Lemma 1.

For any k ě it holds that m k ď ˆ m k ď m k ` n k ´ ď m k p ` (cid:15) k q ´ , (3.24) where (cid:15) k : “ θ ˆ log 2 n s ` k ζ p s q α ˙ ´ ! . Proof.

For any n k ě m k “ R n k θ ´ log 2 n s ` k ζ p s q α V ď n k R θ ´ log 2 n s ` k ζ p s q α V “ ˆ m k ă n k ˆ θ ´ log 2 n s ` k ζ p s q α ` ˙ . The ﬁrst inequality above proves the ﬁrst inequality in (3.24). The rightmost strict inequality above and (3.22)prove the second (large) inequality in (3.24). The last inequality in (3.24) is obtained by using once again (3.22)together with properties of the ceiling operator. Notice that θ « . Remark 2.

The small number ˆ m k ´ m k ă n k of additional samples required by (3.21) contribute to furtherimprove the stability of u W . This slight surplus of samples is completely negligible from a fully adaptive point ofview, where at each iteration k conditions (3.21) or (3.22) are not necessarily fulﬁlled, but, more simply, newrandom samples are just added to the previous ones until a certain stability criterion is met, for example until ~ G k ´ I k ~ ď ξ k or cond p G k q ď ξ k for some threshold ξ k eventually depending on k . Remark 3.

For any integer t ě and reals α P p , q , s ą , the result in Theorem 3 can be sharpened by using m t random samples x , . . . , x m t such that, with the same τ k as in (3.21) ,• for any k “ , . . . , t ´ and for any j “ , . . . , n k the x p j ´ q τ k ` , . . . , x jτ k are distributed according to χ j ,• at iteration t , for any j “ , . . . , n t the samples x p j ´ q τ t ´ ` , . . . , x jτ t ´ are distributed according to χ j ,and the samples x n t τ t ´ ` , . . . , x m t are distributed according to µ n t . or any k “ , . . . , t ´ the set x , . . . , x m k can be incrementally constructed using (1.2) . For the constructionof x , . . . , x m t at the last iteration t : we recycle all the m t ´ “ n t ´ τ t ´ samples from iteration t ´ , thenwe add p n t ´ n t ´ q τ t ´ new samples, i.e. τ t ´ samples from χ j for all j “ n t ´ ` , . . . , n t , and then add theremaining m t ´ n t τ t ´ samples from µ n t .Using the random samples x , . . . , x m t , in the proof of Theorem 3 we can apply Theorem 2 at the ﬁrst t ´ iterations, and then at iteration t apply Corollary 2 with n “ n t , m “ m t , τ “ τ t ´ . This proves the sameconclusions of Theorem 3 under the same condition (3.21) on m k for k “ , . . . , t ´ , but with the slightly bettercondition (3.22) on m t , because m t need not be an integer multiple of n t . Remark 4.

Remark 3 can be used to develop adaptive algorithms that recycle all the samples from all theprevious iterations, and that use a number of samples m t given by (3.22) at the last iteration t . Such adaptivealgorithms need to detect in advance which is last iteration, in contrast to algorithms developed using Theorem 3where this information is not needed. In the following we present two sequential algorithms that generate the random samples required by Theorem 3or Theorem 4 at any iteration say t , while recyclying the samples from the previous iterations k “ , . . . , t ´ for i “ start to end on the variablesay i is not executed if end ă start . This section presents Algorithm 1, that can be used to produce the random samples required by Theorem 3using the decomposition (1.2). By construction, Algorithm 1 recycles all the samples from all the previousiterations. At any iteration t ě m t “ τ t n t random samples in a n t ˆ τ t matrix. Allthe elements of this matrix are modiﬁed only once as the algorithm runs from iteration one to t . The algorithmworks with any nondecreasing positive integer sequence p τ k q k ě . Lemma 2.

Let p V k q k ě be any sequence of nested spaces with dimension n k “ dim p V k q , and p τ k q k ě be apositive nondecreasing integer sequence. For any t ě , Algorithm 1 generates a set of m t “ τ t n t random samples x , . . . , x m t with the property that for any k “ , . . . , t and for any j “ , . . . , n k the samples x p j ´ q τ k ` , . . . , x jτ k are distributed according to χ j .Proof. At any iteration k “ , . . . , t Algorithm 1 produces the matrix t x j‘ , j “ , . . . , n k , ‘ “ , . . . , τ k u thatcontains the τ k n k “ m k random samples, by modifying only the elements t x j‘ , j “ , . . . , n k ´ , ‘ “ ` τ k ´ , . . . , τ k u and t x j‘ , j “ ` n k ´ , . . . , n k , ‘ “ , . . . , τ k u . By construction, for any j “ , . . . , n k , the j th rowof this matrix contains τ k samples distributed as χ j . This matrix is recasted into a column vector by means ofa transposition composed with a vectorization, that piles up its rows into the vector p x , . . . , x m k q J .When ρ is a product measure on X , random samples from all the χ j appearing in Algorithm 1 can beeﬃciently drawn by using the algorithms proposed in [7], i.e. inverse transform sampling or rejection sampling .The computational cost required by these algorithms scales linearly with respect to d and to the desired numberof samples. This section presents Algorithm 2, that can be used to produce the random samples required by Theorem 4,and uses the decomposition (1.3). For any k ě

2, a standard algorithm for generating m k random samples from µ n k uses a binomial random variable B k „ Bin p m k , p n k ´ n k ´ q{ n k q to determine the proportion of samplescoming from σ n k . The ﬁrst parameter of B k is the number of trials, and the second parameter is the probabilityof success for each trial, that is given by the coeﬃcient multiplying dσ n k in (1.3). For any k ‰ k , B k and B k are mutually independent. The amount of samples associated to µ n k ´ is equal to m k ´ B k . These are thesamples that the algorithm can recycle from the previous iterations, whenever necessary. For any t ě

1, thealgorithm that generates random samples from µ n , . . . , µ n t in a sequential manner is described in Algorithm 2.Eﬃcient algorithms have been proposed in [7] for drawing samples from all the probability measures µ n k and σ n k t iterations. Lemma 3.

For any t ě , Algorithm 2 generates a set of m t random samples x , . . . , x m t with the propertythat x , . . . , x m k are distributed according to µ n k , for any k “ , . . . , t . The overall number of samples generatedby Algorithm 2 at iteration t is ˜ m t : “ m ` t ÿ k “ p B k ` max t m k ´ B k ´ m k ´ , uq . (4.25) Proof.

The properties of the random samples are ensured by construction. We now proove (4.25). When t “ t ě

2. The proof uses induction on k . At iteration k “ m k ´ samples from µ n k ´ are available, which veriﬁes the induction hypothesis. Proof of the inductionstep: for any k ě

2, supposing that m k ´ samples from µ n k ´ are available at iteration k ´

1, the number ofrecycled samples from iteration k ´ p m k ´ B k , m k ´ q . Then the algorithm adds max p m k ´ B k ´ m k ´ , q new samples from µ n k ´ . Afterwards m k ´ max p m k ´ B k ´ m k ´ , q ´ min p m k ´ B k , m k ´ q new samples areadded from σ n k . At the end of iteration k , the algorithm produces a set containing m k random samples from µ n k , and throws away m k ´ ´ min p m k ´ B k , m k ´ q samples that were drawn at iteration k ´ µ n k ´ .Summation of each contribution of new samples at any iteration k from 2 to t gives (4.25).The number of unrecycled samples after t iterations is ˜ m t ´ m t . As a sum of nonnegative random variables,this number can only increase as the algorithm runs, which represents the major disadvantage of Algorithm 2,and of any other purely random sequential algorithm. Since m k ě m k ´ and B k ě k ě

2, from (4.25)we have the upper bound ˜ m t ď m t ` U t , where U t is the random variable deﬁned as U t : “ t ÿ k “ B k , (4.26)that gives an upper bound for the number of unrecycled samples. Its mean and variance are given by E p U t q “ t ÿ k “ E p B k q “ t ÿ k “ m k n k ´ n k ´ n k , Var p U t q “ t ÿ k “ Var p B k q “ t ÿ k “ m k n k ´ n k ´ n k n k ´ n k . (4.27)The above expressions for the mean and variance of U t hold true for any condition between m k and n k , notnecessarily of the form (3.22). When (3.22) is fulﬁlled we have the following upper bounds. Lemma 4.

For any strictly increasing sequence p n k q k ě , for any t ě , s ą and α P p , q , if m k and n k satisfy condition (3.22) for all k “ , . . . , t thenVar p U t q ă E p U t q ď m t ` n t ´ m . Proof. E p U t q “ t ÿ k “ m k n k ´ n k ´ n k ď t ÿ k “ p n k ´ n k ´ q R θ ´ ln ˆ ζ p s q n s ` k α ˙V ď t ÿ k “ n k R θ ´ ln ˆ ζ p s q n s ` k α ˙V ´ n k ´ S θ ´ ln ˜ ζ p s q n s ` k ´ α ¸W “ n t R θ ´ ln ˆ ζ p s q n s ` t α ˙V ´ n R θ ´ ln ˆ ζ p s q n s ` α ˙V ď m t ` n t ´ m . The inequality for Var p U t q follows from (4.27) and strict monotonicity of the sequence p n k q k .15 emark 5. Here we show that the random variable U t concentrates like a Gaussian random variable with meanand variance given by (4.27) . The central limit theorem for a binomial random variable B „ Bin p m, p q withnumber of trials m and success probability p states that lim m Ñ8 Pr ˜ B ´ mp a mp p ´ p q ď b ¸ “ Φ p b q , b P R , where Φ is the cumulative distribution function of the standard Gaussian distribution. This justiﬁes the well-known Gaussian approximation of B when m is suﬃciently large. This approximation is very accurate alreadywhen mp ě and m p ´ p q ě . In our settings, when m k and n k satisfy (3.22) for some α P p , q and s ą , the parameters of the binomial random variables B k overwhelmingly verify the above conditions for any k “ , . . . , t , since m k p n k ´ n k ´ q n k ě θ ´ ln ˆ ζ p s q n s ` k α ˙ p n k ´ n k ´ q ě θ ´ ln ˆ ζ p s q n s ` k α ˙ " , (4.28) m k n k ´ n k “ θ ´ ln ˆ ζ p s q n s ` k α ˙ n k ´ " , (4.29) and θ ´ « . . Using the Gaussian approximation of the binomial distribution, each B k behaves like aGaussian r.v. with the same mean and variance. A ﬁnite linear combination of independent Gaussian randomvariables is a Gaussian random variable. Hence the r.v. U t behaves like a Gaussian r.v. with mean and varianceas in (4.27) . The main properties of Algorithm 1 and Algorithm 2 are resumed below. At any iteration say t :• Algorithm 1 generates m t “ τ t n t independent random samples with τ t being any positive integer, andsuch that τ t of these random samples are drawn from χ j , for any j “ , . . . , n t . This algorithm recyclesall the samples generated at all the previous iterations k “ , . . . , t ´ m t independent random samples from µ n t . This algorithm recycles most of thesamples generated at all the previous iterations k “ , . . . , t ´

1. If (3.22) holds true at any iteration, thenthe number of unrecycled samples at iteration t is upper bounded by the random variable (4.26) withmean p ` θ q m t and variance p ` θ q m t , that exhibits Gaussian concentration.• During the execution, Algorithm 1 modiﬁes each element of the output set x , . . . , x m t only once, incontrast to Algorithm 2 that can modify the same element several times, when discarding previouslygenerated random samples.• Algorithm 1 and Algorithm 2 use any sequence of nested spaces p V k q k .• The weighted least-squares estimators constructed with the random samples generated by both Algo-rithm 1 and Algorithm 2 share the same theoretical guarantees, see Theorem 3 and Theorem 4.• In practice Algorithm 1 outperforms Algorithm 2 in all our numerical tests, recycling all the samples fromall the previous iterations, and producing on average more stable Gramian matrices. Remark 6.

When using random samples from γ m rather than from µ m , the beneﬁts of variance reductionincrease with more localized basis than orthogonal polynomials, like wavelets. The structure of the randomsamples from γ m ensures that for any element of the basis ψ j P V at least one sample is contained in supp p ψ j q .If this is not the case then the Gramian matrix is singular, because the discrete inner product of two functionsis equal to zero when none of the samples is contained in the intersection of their supports. Numerical methods for adaptive (polynomial) approximation

The results presented in Theorems 3 and 4 hold for any nested sequence p V k q k of general approximation spaces,in any dimension d . Two families of spaces that are suitable for approximation in arbitrary dimension d arepolynomial spaces and wavelet spaces. In this paper we conﬁne to polynomial spaces. Even with this restriction,adaptive numerical methods in such a general context are still quite a large subject. Our focus in the presentpaper is on a more speciﬁc type of adaptive methods, in the spirit of orthogonal matching pursuit, and on theline of the greedy algorithms described in [10].The spaces V k can be adaptively chosen from one iteration to the other, as long as the sequence remainsnested. Without additional information on the function that we would like to approximate, the inﬁnite numberof elements in the basis prevents the development of a concrete strategy for performing the adaptive selection.Such additional information is available in the form of decay of the coeﬃcients, for example, for some PDEswith parametric or stochastic coeﬃcients, whose solution is provably well-approximated by so-called downwardclosed polynomial spaces. See [6] and references therein for an introduction to the topic. The deﬁnition ofdownward closed polynomial spaces is postponed to (5.32). In the remaining of this section we assume thatthe function u can be well approximated by a nested sequenceof downward closed polynomial approximation spaces. (5.30)As a relevant example that motivates our interest in the above setting, for PDEs with lognormal diﬀusioncoeﬃcients it was shown by the author in [8, Lemma 2.4] that suitable polynomial spaces yielding provableconvergence rates are actually downward closed.After (5.30) we restrict our analysis to nested sequences p V k q k of polynomial spaces satisfying the additionalconstraint of being downward closed. At iteration k , given V k ´ , an ideal (local) optimal criterion for performingthe adaptive selection is to choose V k Ą V k ´ as the space that delivers the smallest error among all possibledownward closed spaces with prescribed dimension, for example n k “ ` n k ´ . Since d is ﬁnite, the number ofall possible choices for V k is also ﬁnite. In reality the exact error } u ´ Π n k u } is not available, and the adaptiveselection has to rely on the error } u ´ u kC } that is a random variable. Here the error estimates from Theorems 3and 4 come in handy because they ensure that } u ´ u kC } is less than twice } u ´ Π n k u } in expectation. Even if theexact error was available, the adaptive selection using the local optimal criterion does not ensure optimality ofthe selected spaces at the following iterations, and for this reason it is referred to as a greedy adaptive selection.Before moving to the description of the adaptive algorithm, we brieﬂy introduce some deﬁnitions that areuseful to describe the polynomial setting. Hereafter we assume that X “ ˆ di “ I i is the Cartesian product ofintervals I i Ă R , and that dρ “ b di “ dρ i where each ρ i is a probability measure deﬁned on I i . This settingensures the existence of a product basis orthonormal in L p X, ρ q that we now introduce. To simplify thepresentation and notation, we further suppose that I : “ I j and ˜ ρ : “ ρ j for any j , and denote with p T j q j ě theunivariate family of orthogonal polynomials, orthonormal in L p I, ˜ ρ q . Let Λ Ă F : “ N d be a multi-index setenumerated according to an ordering relation, for example the lexicographical ordering. Using Λ we deﬁne ψ ν p x q : “ d ź i “ T ν i p x i q , ν “ p ν , . . . , ν d q P Λ , x “ p x , . . . , x d q P X, (5.31)and relate the orthonormal basis p ψ i q i ě from the previous sections to the above orthonormal basis as ψ i “ ψ ν i for any i “ , . . . , p Λ q , where ν i is the i th element of Λ according to the lexicographical ordering, and p Λ q denotes the cardinality of Λ. The space associated to Λ is deﬁned as V Λ : “ span t ψ ν : ν P Λ u .A set Λ Ă F is downward closed if ν P Λ and ν ď ν ùñ ν P Λ , (5.32)where the ordering ν ď ν is intended in the lexicographical sense. We say that the space V Λ is downward closedif the supporting index set Λ is downward closed. For any Λ Ă F downward closed we deﬁne its margin M p Λ q as M p Λ q : “ t ν P F : ν R Λ ^ D j P t , . . . , d u : ν ´ e j P Λ u , where e j P F is the multi-index with all components equal to zero, except the j th component that is equal toone. The reduced margin R p Λ q of Λ is deﬁned as R p Λ q : “ t ν P F : ν R Λ ^ @ j P t , . . . , d u , ν j ‰ ùñ ν ´ e j P Λ u Ď M p Λ q .

17f Λ is downward closed then Λ

Y t ν u is downward closed for any ν P R p Λ q .Finally we choose the space V k from the previous sections as V k “ V Λ k for any k by means of a nestedsequence p Λ k q k Ă F of downward closed multi-index sets. For any k ě

1, p Λ k q “ dim p V k q “ n k equals thedimension of V k . In this section we describe an adaptive algorithm using optimal weighted least squares, starting from the algo-rithm proposed in [18] for standard least squares and inspired by orthogonal matching pursuit. The algorithmbuilds a sequence of nested spaces V Λ Ă . . . Ă V Λ t performing at each iteration an adaptive greedy selectionof the indices identifying the elements of the basis. The adaptive construction of the index sets uses ideas thatwere originally proposed in [14] for developing adaptive sparse grids quadratures. The greedy selection of theindices uses a marking strategy known as bulk chasing.The adaptive algorithm works with downward closed index sets. Given any Λ downward closed, a nonnegativefunction e : R p Λ q Ñ R and a parameter β P p , s , we deﬁne the procedure BULK : “ BULK p R p Λ q , e, β q thatcomputes a set F Ď R p Λ q of minimal positive cardinality such that ÿ ν P F e p ν q ě β ÿ ν P R p Λ q e p ν q . (5.33)Denote with a ν the coeﬃcient associated to ψ ν in the expansion u “ ř ν P F a ν ψ ν . For any ν P R p Λ q , thefunction e p ν q is chosen as an estimator for | a ν | . The adaptive algorithm is described in Algorithm 3. At anyiteration k of the algorithm, m k random samples are generated by using Algorithm 1, with m k satisfying (3.21)as a function of n k “ p Λ k q , for a given choice of the parameters α and s . The m k random samples are usedto compute the weighted least-squares estimator u kC on V Λ k . For convenience in Algorithm 3 the operationsperformed by Algorithm 1 have been merged with those for the adaptive selection of the space. In Algorithm 3the χ ν correspond to χ j with ψ j “ ψ ν . An estimator for | a ν | proposed in [18] that uses only the informationavailable at iteration k ´ e k ´ p ν q : “ ˇˇ x u ´ u k ´ C , ψ ν y m k ´ ˇˇ , ν P R p Λ k ´ q , (5.34)where the discrete inner product uses the evaluations of the function u at the same m k ´ samples that havebeen used to compute u k ´ C at iteration k ´

1. The estimator (5.34) uses the residual r k ´ : “ u ´ u k ´ C and ischeap to compute: it requires only the product of a vector with a matrix.A safeguard mechanism prevents Algorithm 3 from getting stuck into indices associated to null coeﬃcientsin the expansion of u kC . Given a positive integer k sg , once every k sg iterations the algorithm adds to Λ k the mostancient multi-index from R p Λ k ´ qz F . In the numerical tests reported in the next section, such a mechanismwas never activated, and the algorithm was always able to identify the best n k -term index sets of the givenfunction at any iteration k .Algorithm 3 can be modiﬁed by relaxing (3.21) to a less demanding condition between m k and n k at eachiteration k . For example, the random samples can be added until a stability condition of the form ~ G k ´ I k ~ ď ξ is met, for some given threshold ξ ą {

2. This provides a fully adaptive algorithm as described in Algorithm 4,that however, in contrast to Algorithm 3, does not come with the theoretical guarantees of Theorem 3.

This section presents some numerical tests of the sampling algorithms that generate the random samples,comparing Algorithm 1 and Algorithm 2. At the very end, our implementation of both algorithms uses inversetransform sampling as described in [7, Section 5.2] for drawing samples from all the χ j .A natural vehicle to quantify the quality of the generated samples is the deviation of the matrix G k fromthe identity, i.e. ~ G k ´ I k ~ . Since ~ G k ´ I k ~ ď ùñ cond p G k q : “ ~ G ´ k ~~ G k ~ ď

3, our tests show thecondition number, that is a more meaningful quantity when solving a linear system.From the point of view of the stability and convergence properties of the weighted least-squares estimators,the random samples generated by both algorithms come with the same theoretical guarantees. But, in contrastto Algorithm 2, Algorithm 1 recycles all the samples from the previous iterations. This is the main reason to18

10 20 30 40 50123456 E E E +S E +S E E E +S E +S Figure 1: Left: estimators E i and E i ` S i of the sequence of random variables p cond p G k qq k ě at iteration k “ , . . . ,

50 with n k “ k and m k “ r θ ´ s n k . Hermite polynomials. d “

1. The estimators use 10 realizations of thesequence p cond p G k qq k ě . Center: estimators E i and E i ` S i of the sequence of random variables p cond p G k qq k ě at iteration k “ , . . . ,

55 with n k “ k and m k “ p ` n k q n k . Hermite polynomials. d “

1. The estimators use10 realizations of the sequence p cond p G k qq k ě . Right: same simulation as center but showing E ´ E .prefer Algorithm 1 over Algorithm 2. Another reason to choose Algorithm 1 is that it produces more stableGramian matrices on average, since the sample variance of the generated samples is lower.Our ﬁrst tests illustrate the beneﬁts of variance reduction, using spaces V k of univariate Hermite polynomials p H j q j ě with degree from 0 to k ´

1. More precisely, the sequence p H j q j ě contains univariate Hermite polyno-mials orthonormalised as ş R H i p t q H j p t q dg “ δ ij , where dg : “ p π q ´ { e ´ t { dt . Denote with E i « E p cond p G k qq and S i « Var p cond p G k qq the sample mean and sample variance estimators of the random variable cond p G k q with G k constructed using the random samples generated by Algorithm i P t , u . Figure 1-left shows thecomparison of E i and E i ` S i between the two algorithms, with m k “ r θ ´ s n k . Both estimators conﬁrm thatAlgorithm 1 produces random samples whose Gramian matrix is better conditioned than Algorithm 2. The sametrend persists when choosing other scalings like m k “ p ` n k q n k , see Figure 1-center and Figure 1-right. Thediﬀerence between the two algorithms is expected to amplify when using more localized basis, with Algorithm 2producing much more ill-conditioned Gramian matrices as the ratio m k { n k decreases.From now on the focus is on Algorithm 1. For all the tests in the remaining part of this section we choose m k as in (3.21) with α “ . s “

2. The value of α is chosen fairly large on purpose to check, in practice, howsharp the stability constraint (3.21) is. In the ﬁrst test, we choose ρ “ b d dg as the d -dimensional probabilisticGaussian measure on X “ R d , and V k as the spaces of tensorized Hermite polynomials, obtained from (5.31)by taking T j “ H j , j ě

0. The Gaussian case poses several challenges: as shown in [7], standard least-squaresestimators with Hermite polynomials typically fail due to the ill-conditioning of the Gramian matrix. Sincethe ill-conditioning arises with high-degree polynomials, we choose fairly low-dimensional tests to begin with,such that very high-degree polynomials can be tested, e.g. degrees beyond 100. With d “ d “ G k stays well below thethreshold equal to 3 during all the simulations, which contain, respectively, 10 and 10 realizations of thesequence p cond p G k qq k ě with random samples generated by Algorithm 1. At any iteration k , the index setΛ k Ą Λ k ´ that deﬁnes the space V k “ V Λ k is generated by adding to Λ k ´ a random number of indicesrandomly chosen from R p Λ k ´ q . This procedure generates nested sequences of downward closed index sets,see Figure 3-right for an example of such a set. With other families of orthogonal polynomials the results arevery similar. For example, with d “

4, the results in Figure 3-left with the d -dimensional uniform probabilisticmeasure on X “ r´ , s d and Legendre polynomials are analogous to those obtained in Figure 2-right with theGaussian measure and Hermite polynomials. Figure 3-right shows an example of (the section of the ﬁrst andsecond coordinates of) an index set Λ k obtained in the simulation of Figure 2-right at iteration k “

100 200 300 400 50011.522.53

Figure 2: Left: condition number cond p G k q at iteration k with n k “ k and m k as in (2.18), d “

1, Gaussianmeasure, Hermite polynomials, s “ α “ .

1. Black lines are 10 realizations of the sequence p cond p G k qq k ě with random samples from Algorithm 1. The red line is their sample mean. Right: condition number cond p G k q at iteration k with n k “ k and m k as in (2.18), d “

4, Gaussian measure, Hermite polynomials, s “ α “ . realizations of the sequence p cond p G k qq k ě with random samples from Algorithm 1. Thered line is their sample mean. Figure 3: Left: condition number cond p G k q at iteration k with n k “ k and m k as in (2.18), d “

4, uniformmeasure, Legendre polynomials, s “ α “ .

1. Black lines are 10 realizations of the sequence p cond p G k qq k ě with random samples from Algorithm 1. The red line is their sample mean. Right: section of the ﬁrst andsecond coordinates of an index set obtained at iteration k “

500 during the simulation in Fig 2-right.20 .3 Testing the adaptive algorithm

For the numerical tests of Algorithm 3 we choose ρ as the uniform measure over X “ r´ , s d and V k as thespaces of tensorized Legendre polynomials obtained by ﬁrst deﬁning the sequence p L j q j ě of univariate Legendrepolynomials orthonormalised as ş ` ´ L i p t q L j p t q dt “ δ ij and then taking T j “ L j in (5.31). As an illustrativeexample, consider the following function that satisﬁes assumption (5.30), u p x q “ ˜ ` d d ÿ i “ q i x i ¸ ´ , x P X, (5.35)with d “

16 and q i “ ´ p i ´ q d ´ . A set X CV of 10 cross-validation points uniformly distributed over X is chosenonce and for all, and the approximation error } u ´ u kC } is estimated as } u ´ u kC } « } u ´ u kC } CV, : “ d p X CV q ÿ ˜ x P X CV | u p ˜ x q ´ u kC p ˜ x q| ď } u ´ u kC } CV, : “ max ˜ x P X CV | u p ˜ x q ´ u kC p ˜ x q| . (5.36)The error estimators are denoted with } u ´ u kC } CV, , } u ´ u kC } CV, , although these are not norms over thefunctional space. The parameter of the marking strategy is set to β “ .

5, and Λ “ tp , . . . , q J u . Figure 4-leftshows the results for the errors (5.36) obtained when approximating the function (5.35) with Algorithm 3 andusing the random samples generated by Algorithm 1. At each iteration k the number of samples m k as afunction of n k satisﬁes (3.21) with α “ . s “

2. Figure 4-right shows the condition number of G k atiteration k , that stays below two at all the iterations. Figure 5-left shows that at each iteration k the adaptivealgorithm catches the coeﬃcients in the best n k -term set. The coeﬃcients in Figure 5-left have not been sorted,and they appear in the same order in which their corresponding elements of the basis were included in theapproximation space by the adaptive selection procedure. After 35 iterations the algorithm has adaptivelyconstructed a sequence Λ , . . . , Λ of index sets. The set Λ contains about 10 indices, and its associatedspace V Λ provides an approximation error of the order 10 ´ on average. Figure 5 shows some sections of Λ .All the d coordinates in Λ are active, i.e. @ i P t , . . . , d u , D ν P Λ : ν j ą k , showing that condition (3.21) couldbe relaxed while still preserving the stability of the discrete projection, and yielding faster convergence ratesw.r.t. m k than those in Figure 4-left. -8 -6 -4 -2 Figure 4: Left: 10 realizations of the errors } u ´ u kC } CV, and } u ´ u kC } CV, versus m k obtained with Algorithm 3and the random samples generated by Algorithm 1. Right: 10 realizations of cond p G k q versus k , for the samesimulation on the left. We have advanced one step further the analysis of optimal weighted least-squares estimators for a given general d -dimensional approximation space. The main novelty concerns the structure of the random samples, that21 -10 -8 -6 -4 -2 Figure 5: Left: ﬁrst 10 coeﬃcients of the estimator u kC “ ř j a j ψ j obtained at iteration k “

35 with the indexset Λ , for one realization among those shown in Figure 4. Right: some sections of the index set correspondingto the coeﬃcients displayed on the left.follow a distribution with product form. The results have immediate applications to the adaptive settingwith a nested sequence of approximation spaces, and point out new promising directions for the developmentof adaptive numerical methods for high-dimensional approximation using polynomial or wavelet spaces. Ouranalysis indicates that eﬃcient adaptive methods can also be developed for general sequences of nonnecessarilynested spaces. This topic will be investigated in the future. References [1] R. Ahlswede, A. Winter,

Strong converse for identiﬁcation via quantum channels , IEEE Trans. Inf. Theory48(3), 569-579 (2002).[2] B.Arras, M.Bachmayr, A.Cohen, “Sequential sampling for optimal weighted least squares approximationsin hierarchical spaces”, arXiv:1805.10801[3] H.Bungartz, M.Griebel:

Sparse grids , Acta Numer. 13:147–269, 2004.[4] G.Cybenko:

Approximations by superpositions of sigmoidal functions , Mathematics of Control, Signals,and Systems, 2(4):303–314, 1989.[5] A.Cohen, M.A.Davenport, D.Leviatan:

On the stability and accuracy of least squares approximations ,Found. Comput. Math., 13:819–834, 2013.[6] A.Cohen, R.DeVore:

Approximation of high-dimensional parametric PDEs , Acta Numer., 24:1–159, 2015.[7] A.Cohen, G.Migliorati:

Optimal weighted least-squares methods , SMAI Journal of Computational Mathe-matics, 3:181–203, 2017.[8] A.Cohen, G.Migliorati:

Multivariate approximation in downward closed polynomial spaces , ContemporaryComputational Mathematics - A Celebration of the 80th Birthday of Ian Sloan, Springer 2018.[9] P.J.Davis:

Interpolation and approximation , Dover, 1975.[10] R.DeVore, V.N.Temlyakov:

Some remarks on greedy algorithms , Advances in Computational Mathematics,5:173–187, 1996.[11] A.Doostan, J.Hampton,

Coherence motivated sampling and convergence analysis of least squares polynomialChaos regression , Comput. Methods Appl. Mech. Engrg., 290:73–97, 2015.[12] S.Foucart, H.Rauhut,

A Mathematical Introduction to Compressive Sensing , Birkhäuser, 2013.[13] J.D.Jakeman, A.Narayan, T.Zhou,

A Christoﬀel function weighted least squares algorithm for collocationapproximations , Math.Comp. 86:1913–1947, 2017.2214] T.Gerstner, M.Griebel:

Dimension-adaptive tensor-product quadrature , Computing, 71(1):65-87, 2003.[15] L.Györﬁ, M.Kohler, A.Krzyzak, H.Walk,

A distribution-free theory of nonparametric regression , Springer2002.[16] M.Leshno, V.Y.Lin, A.Pinkus, S.Schocken,

Multilayer feedforward networks with a nonpolynomial activa-tion function can approximate any function , Neural networks, 6(6):861–867.[17] G.Migliorati, F.Nobile, E.von Schwerin, R.Tempone:

Analysis of discrete L projection on polynomialspaces with random evaluations , Found. Comput. Math., 14:419–456, 2014.[18] G.Migliorati: Adaptive polynomial approximation by means of random discrete least squares , Proceedings ofENUMATH 2013, Lecture Notes in Computational Science and Engineering, 103:547–554, 2015, Springer.[19] G.Migliorati, F.Nobile, R.Tempone:

Convergence estimates in probability and in expectation for discreteleast squares with noisy evaluations at random points , J. Multivar. Anal., 142:167–182, 2015.[20] J.Tropp:

User friendly tail bounds for sums of random matrices , Found. Comput. Math., 12:389–434, 2012.

A Algorithms

Algorithm 1

Deterministic sequential sampling

INPUT: t , dρ , p τ k q tk “ , p n k q tk “ , p ψ j q n t j “ OUTPUT: x , . . . , x m t s.t. x p j ´ q τ k ` , . . . , x jτ k i.i.d. „ χ j , j “ , . . . , n k , k “ , . . . , t. for j “ n dofor ‘ “ τ do Sample x j‘ from χ j end forend for p x , . . . , x m q J Ð Vec ˜ˆ` x j‘ ˘ j “ ,...,n ‘ “ ,...,τ ˙ J ¸ for k “ t dofor j “ n k ´ ` n k dofor ‘ “ τ k ´ do Sample x j‘ from χ j end forend forfor j “ n k dofor ‘ “ τ k ´ ` τ k do Sample x j‘ from χ j end forend for p x , . . . , x m k q J Ð Vec ˜ˆ` x j‘ ˘ j “ ,...,nk‘ “ ,...,τk ˙ J ¸ end for lgorithm 2 Random sequential sampling

INPUT: t , p µ n k q t ´ k “ , p σ n k q tk “ OUTPUT: x , . . . , x m t s.t. x , . . . , x m k i.i.d. „ µ n k , k “ , . . . , t . for j “ m do Sample x j from µ n “ σ n end forfor k “ t do Sample B k from Bin ˆ m k , n k ´ n k ´ n k ˙ for j “ min p m k ´ B k , m k ´ q ` p m k ´ B k , m k ´ q ` max p m k ´ B k ´ m k ´ , q do Sample x j from µ n k ´ end forfor j “ min p m k ´ B k , m k ´ q ` max p m k ´ B k ´ m k ´ , q ` m k do Sample x j from σ n k end forend for Algorithm 3

Adaptive weighted least squares

INPUT: Λ “ tp , . . . , q J u , β , s , α , t , k sg OUTPUT: u tC τ “ r θ ´ ln p ζ p s qp p Λ qq s ` { α q s for each ν P Λ do Add τ random samples distributed as χ ν end for m “ τ p Λ q u C “ argmin v P V Λ1 } u ´ v } m r “ u ´ u C for k “ t do F “ BULK p R p Λ k ´ q , |x r k ´ , ψ ν y m k ´ | , β q Λ k “ Λ k ´ Y Fτ k “ r θ ´ ln p ζ p s qp p Λ k qq s ` { α q s for each ν P Λ k ´ do Add τ k ´ τ k ´ random samples distributed as χ ν end forfor each ν P Λ k z Λ k ´ do Add τ k random samples distributed as χ ν end for m k “ τ k p Λ k q u kC “ argmin v P V Λ k } u ´ v } m k if k mod k sg “ then Λ k “ Λ k ´ Y t ν u , with ν being the most ancient multi-index in R p Λ k ´ qz F end if r k “ u ´ u kC end for lgorithm 4 Fully adaptive weighted least squares

INPUT: Λ “ tp , . . . , q J u , β , t , ξ , k sg OUTPUT: u tC repeatfor each ν P Λ do Add one random sample distributed as χ ν end for m “ m ` p Λ q until ~ G ´ I ~ ă ξu C “ argmin v P V Λ1 } u ´ v } m r “ u ´ u C for k “ t do F “ BULK p R p Λ k ´ q , |x r k ´ , ψ ν y m k ´ | , β q Λ k “ Λ k ´ Y F for each ν P Λ k z Λ k ´ do Add m k ´ { p Λ k ´ q random samples distributed as χ ν end for m k “ m k ´ p Λ k q{ p Λ k ´ q repeatfor each ν P Λ k do Add one random sample distributed as χ ν end for m k “ m k ` p Λ k q until ~ G k ´ I k ~ ă ξu kC “ argmin v P V Λ k } u ´ v } m k if k mod k sg “ then Λ k “ Λ k ´ Y t ν u , with ν being the most ancient multi-index in R p Λ k ´ qz F end if r k “ u ´ u kC end forend for