Stochastic expansions using continuous dictionaries: Lévy adaptive regression kernels
aa r X i v : . [ m a t h . S T ] D ec The Annals of Statistics (cid:13)
Institute of Mathematical Statistics, 2011
STOCHASTIC EXPANSIONS USING CONTINUOUSDICTIONARIES: L´EVY ADAPTIVE REGRESSION KERNELS
By Robert L. Wolpert , Merlise A. Clyde and Chong Tu Duke University, Duke University and PIMCO
This article describes a new class of prior distributions for non-parametric function estimation. The unknown function is modeled asa limit of weighted sums of kernels or generator functions indexedby continuous parameters that control local and global features suchas their translation, dilation, modulation and shape. L´evy randomfields and their stochastic integrals are employed to induce prior dis-tributions for the unknown functions or, equivalently, for the numberof kernels and for the parameters governing their features. Scaling,shape, and other features of the generating functions are location-specific to allow quite different function properties in different partsof the space, as with wavelet bases and other methods employingovercomplete dictionaries. We provide conditions under which thestochastic expansions converge in specified Besov or Sobolev norms.Under a Gaussian error model, this may be viewed as a sparse re-gression problem, with regularization induced via the L´evy randomfield prior distribution. Posterior inference for the unknown functionsis based on a reversible jump Markov chain Monte Carlo algorithm.We compare the L´evy Adaptive Regression Kernel (LARK) methodto wavelet-based methods using some of the standard test functions,and illustrate its flexibility and adaptability in nonstationary appli-cations.
1. Introduction.
Popular approaches for nonparametric Bayesian esti-mation of unobserved functions generally employ as prior distributions ei-ther Gaussian processes (or random fields, in two or more dimensions) ormixtures of Dirichlet processes. In this article, we focus attention on a wider
Received May 2010; revised January 2011. Supported by NSF Grants DMS-07-57549, PHY-09-41373 and NASA AISR GrantNNX09AK60G. Supported by NSF Grants DMS-0342172 and DMS-04-06115.
AMS 2000 subject classifications.
Primary 62G08; secondary 60E07.
Key words and phrases.
Bayes, Besov, kernel regression, LARK, L´evy random field,nonparametric regression, relevance vector machine, reversible jump Markov chain MonteCarlo, splines, support vector machine, wavelets.
This is an electronic reprint of the original article published by theInstitute of Mathematical Statistics in
The Annals of Statistics ,2011, Vol. 39, No. 4, 1916–1962. This reprint differs from the original inpagination and typographic detail. 1
R. L. WOLPERT, M. A. CLYDE AND C. TU class of processes, L´evy random fields and their stochastic integrals. Theseinclude Gaussian random fields as a limiting case, while Dirichlet processesmay be represented as “normalized” variants of the Gamma L´evy randomfield; L´evy random fields thus provide an important link between two ofthe random processes that form the foundation of Bayesian nonparametricmethods (see Section 6). In this article, we construct prior distributions forthe mean function in nonparametric regression as stochastic integrals of L´evyrandom fields. Under suitable regularity, these can be expressed as stochas-tic expansions using continuous dictionaries, permitting tractable Bayesianinference. While our focus is on nonparametric regression, we hope that thereader will see the possibilities of using L´evy random fields in other contexts.To begin, suppose we have noisy measurements { Y i } i ∈ I of an unknownreal-valued function f : X → R observed at points { x i } i ∈ I in some completeseparable metric space X , with E [ Y i ] = f ( x i ). In nonparametric regressionmodels, the mean function f ( · ) is often regarded as an element of someHilbert space H of real-valued functions on X , and is expressed as a linearcombination of basis functions { g j } ⊂ H : f ( x i ) = X ≤ j 2. Stochastic expansions and prior distributions. !To make inference aboutthe unknown mean function f ∈ H given noisy observations Y i of f ( x i ) for { x i } ⊂ X , we must first propose a prior distribution on H for f . Let Ω bea complete separable metric space and φ : X × Ω → R a Borel measurablefunction, and set φ j ( x i ) ≡ φ ( x i , ω j ) for some collection { ω j } ⊂ Ω. As a slightextension of the basis expansion of (1), set f ( x ) ≡ X ≤ j Possible choices for φ ( x, ω ) for X = R includetranslation-invariant kernel functions, such as the Gaussian φ G ( x, ω ) ≡ exp {− λ ( x − χ ) } (4a)or the Laplace φ L ( x, ω ) ≡ exp {− λ | x − χ |} (4b)kernels with ω ≡ ( χ, λ ) ∈ X × R + ≡ Ω. There is no need to restrict attentionto symmetric (e.g., Mercer) kernels, as required in the conventional SupportVector Machine (SVM) approach [Law and Kwok (2001), Sollich (2002)].Asymmetric kernels, such as the one-sided exponential φ E ( x, ω ) ≡ exp {− λ ( x − χ ) } { x>χ } (4c)are useful, for example, in modeling pollutant dissipation over time. Otherpossibilities include piecewise-constant Haar wavelets on X = (0 , φ H ( x, ω ) ≡ { <λ ( x − χ ) ≤ } (4d)or continuous rescaling and shifting of other wavelet functions φ ψ ( x, ω ) ≡ λ / ψ ( λ ( x − χ )) . (4e)In each of these examples, Ω is a location-scale space with location parame-ter χ and parameter λ determining the scale. Higher-dimensional spaces X may be accommodated in a similar way; for example, in Section 8.2 we usespace–time kernel φ ST ( x, ω ) ≡ exp {− ( s − σ ) ′ Λ( s − σ ) − λ | t − τ |} (4f)for space–time point x = ( s, t ) ∈ R × R + ; here ω = ( σ, τ, Λ , λ ) includes a spa-ce–time point ( σ, τ ) ∈ R × R + , a positive-definite spatial dispersion matrixΛ ∈ S +2 , and a temporal decay rate λ ∈ R + . ´EVY ADAPTIVE REGRESSION KERNELS L´evy random measures. For any ν + ≥ π ( dβ dω ) on R × Ω, let J ∼ Po ( ν + ) be Poisson-distributed with mean ν + ,and let { ( β j , ω j ) } ≤ j 0, this will hold for any measurethat satisfies the local L integrability condition Z Z R × K (1 ∧ | β | ) ν ( dβ dω ) < ∞ (7)for each compact K ⊂ Ω. The mean and variance, when they exist, are givenby E [ L ( A )] = RR R × A βν ( dβ dω ) and Var [ L ( A )] = RR R × A β ν ( dβ dω ), respec-tively.Khinchine and L´evy (1936) showed that the most general ID random vari-ables [and hence the most general ID-valued random measures; see Rajputand Rosi´nski (1989), Proposition 2.1] have characteristic functions of theform E [ e it L ( A ) ] = exp (cid:26) itδ ( A ) − t Σ( A )(8) + Z Z R × A ( e itβ − − ith ( β )) ν ( dβ dω ) (cid:27) , where h ( β ) ≡ β [ − , ( β ), determined uniquely by the characteristic tripletof sigma-finite measures ( δ, Σ , ν ) consisting of a signed measure δ ( dω ) anda positive measure Σ( dω ) on Ω, and a positive measure ν ( dβ dω ) on R × Ωthat satisfies the local L integrability condition Z Z R × K (1 ∧ β ) ν ( dβ dω ) < ∞ (9)for each compact K ⊂ Ω and ν ( { } , Ω) = 0 (for more details on this non-stationary version of the classic L´evy–Khinchine formula see Jacod andShiryaev [(1987), page 75], Cont and Tankov [(2004), pages 457–459] orWolpert and Taqqu (2005)). R. L. WOLPERT, M. A. CLYDE AND C. TU The role of the compensator function h ( β ) is to make the last integrandin (8) bounded and O ( β ) near β ≈ 0, permitting the replacement of (7) withthe weaker condition (9); in this case L ( dω ) may have countably-many pointsof support { ω j } ⊂ Ω whose magnitudes { β j } are not absolutely summable,precluding a representation of the form (5). The compensator h ( β ) may bereplaced by any bounded measurable function satisfying h ( β ) = β + O ( β ) , β ≈ , (10)with a corresponding replacement of δ ( dω ) with δ h ( dω ) = δ ( dω ) + R R [ h ( β ) − h ( β )] ν ( dβ dω ). Whenever (7) is satisfied, we may take h ( β ) ≡ δ .By (8) the random measure L may be written as the sum of two indepen-dent parts: a Gaussian portion, assigning independent normally-distributedrandom variables with mean δ h ( A i ) and variance Σ( A i ) to disjoint sets A i ,and the remaining portion, with characteristic function E [ e it L ( A ) ] = exp (cid:26)Z Z R × A ( e itβ − − ith ( β )) ν ( dβ dω ) (cid:27) . (11)We call a random signed measure L with no Gaussian component [i.e., anID-valued measure with Σ(Ω) = δ h (Ω) = 0, that satisfies (11)] a L´evy ran-dom measure . Nonnegative L´evy random measures satisfying (7) were called“completely random measures” by Kingman (1967).2.3. L´evy random fields. A L´evy random measure L satisfying (11) in-duces a linear mapping φ 7→ L [ φ ] from functions φ : Ω → R to random vari-ables L [ φ ] ≡ R Ω φ ( ω ) L ( dω ); such a mapping is called a random field . Forsimple functions φ ( ω ) = P a i A i ( ω ) with each ¯ A i ⊂ Ω compact, we set L [ φ ] ≡ P a i L ( A i ) and verify that E [ e it L [ φ ] ] = exp (cid:26)Z Z R × Ω ( e itφ ( ω ) β − − itφ ( ω ) h ( β )) ν ( dβ dω ) (cid:27) . (12)It is straightforward to extend this by continuity in probability to (at least)all bounded measurable compactly-supported φ : Ω → R . We now presenta general construction based on Poisson random fields, the key to our ap-proach to tractable posterior Bayesian inference.2.3.1. Poisson construction I: Uncompensated. When ν ( dβ dω ) satisfies(7) (i.e., | β | is locally ν -integrable at zero) we may take h ( β ) ≡ L as follows. Begin with a Poisson random measure N ( dβ dω ) ∼ Po ( ν ) on ( R × Ω) that assigns independent Poisson-distributed random vari-ables N ( C i ) ∼ Po ( ν ( C i )) with means ν ( C i ) to disjoint Borel sets C i ⊂ ( R × Ω).For any Borel set A ⊂ Ω with compact closure ¯ A and bounded measurable ´EVY ADAPTIVE REGRESSION KERNELS compactly-supported φ : Ω → R , set J ≡ N ( R × A ) and L ( A ) ≡ Z Z R × A β N ( dβ dω ) = X ≤ j The situation is more deli-cate in case the L´evy measure does not satisfy (7), but only the weaker boundin (9) (i.e., if β is locally ν -integrable but | β | is not). Begin again with thePoisson measure N ∼ Po ( ν ) on R × Ω, and introduce the compensated orcentered Poisson measure ˜ N ( dβ dω ) ≡ N ( dβ dω ) − ν ( dβ dω ) with mean zero[Sato (1999), page 38], inducing an isometry from L ( R × Ω , ν ( dβ dω )) tothe square-integrable zero-mean random variables. Following Wolpert andTaqqu (2005), set L ( A ) ≡ Z Z R × A [ β − h ( β )] N ( dβ dω ) + Z Z R × A h ( β ) ˜ N ( dβ dω ) , L [ φ ] ≡ Z Z R × Ω [ β − h ( β )] φ ( ω ) N ( dβ dω )(14) + Z Z R × Ω h ( β ) φ ( ω ) ˜ N ( dβ dω )for any measurable φ for which (14) converges. If (7) holds, one may sim-plify (14) to L [ φ ] = Z Z R × Ω βφ ( ω ) N ( dβ dω ) − Z Z R × Ω h ( β ) φ ( ω ) ν ( dβ dω )(15a) = X ≤ j Let ν be a L´evy measure on R × Ω satisfying (9). Then L [ φ ] is well defined by (14) with characteristic function given by (12) for com-pensator h ( β ) ≡ β {| β |≤ } if φ satisfies Z Z [ − , c × Ω (1 ∧ | βφ ( ω ) | ) ν ( dβ dω ) < ∞ , (16a) Z Z [ − , × Ω ( | βφ ( ω ) | ∧ | βφ ( ω ) | ) ν ( dβ dω ) < ∞ . (16b) If, in addition, φ satisfies Z Z R × Ω (1 ∧ β ) | φ ( ω ) | ν ( dβ dω ) < ∞ , (16c) then L [ φ ] is well defined for any compensator h ( β ) satisfying (10). Proof. Under these conditions, the integrands of the compensated anduncompensated Poisson integrals in (14) are in the Musielak–Orlicz spacesfor which those integrals are well defined; see Rajput and Rosi´nski [(1989),page 9], Kwapie´n and Woyczy´nski (1992). (cid:3) In particular: Corollary 1. L [ φ ] is well defined with characteristic function (12) forany function φ satisfying Z Z R × Ω (1 ∧ β )( | φ ( ω ) | ∨ φ ( ω )) ν ( dβ dω ) < ∞ , (17) including [by (9)] all bounded measurable compactly-supported φ . Thus, L ( A ) = L [ A ] is always well defined for any Borel set A ⊂ Ω with compactclosure ¯ A . Similarly: Proposition 1. For a L´evy measure ν satisfying (7), take h ( β ) ≡ ; then(13) L [ φ ] ≡ Z Z R × Ω βφ ( ω ) N ( dβ dω ) = X ≤ j Denote by Φ the linear space offunctions φ : Ω → R for which L [ φ ] has been defined; we have seen that thisincludes at least all bounded measurable compactly-supported functions φ .Denote by G the linear space of measurable functions g : X → Φ, and simplifynotation by writing “ g ( x, ω )” for g ( x )( ω ). Each of the generating functionsintroduced in (4) lies in G . For any g ∈ G , we can construct a random function f : X → R by f ( x ) ≡ L [ g ( x )](19) = Z Z R × Ω g ( x, ω )[ β − h ( β )] N ( dβ dω )+ Z Z R × Ω g ( x, ω ) h ( β ) ˜ N ( dβ dω )= X ≤ j We now consider some specific exam-ples of L´evy random fields and the corresponding kernel integrals. Familiarexamples include Poisson, Gamma, Cauchy and more generally α -Stablerandom fields.2.5.1. Compound Poisson processes. The simplest model to consider wouldbe that of (2), with finite L´evy measure satisfying ν + ≡ ν ( R × Ω) < ∞ , re-produced here: f ( x ) ≡ X ≤ j The L´evy measure for the Gamma randomfield is infinite but satisfies the strong local L integrability condition (7),obviating compensation; in the homogeneous case, it is ν ( dβ dω ) = β − e − βη { β> } dβγ ( dω )(22)for some σ -finite measure γ ( dω ) on Ω, giving L ( A ) ∼ Ga ( γ ( A ) , η ) [withmean γ ( A ) /η ] for Borel measurable A ⊂ Ω with γ ( A ) < ∞ . Because ν isconcentrated on R + , the mass β j at each of the Gamma random mea-sure’s support points ω j is positive, so all the coefficients in the expres-sion f ( x ) = P φ ( x, ω j ) β j are nonnegative. With a nonnegative generatingfunction φ ∈ G , this provides a direct way to construct nonnegative meanfunctions f ≥ { Y i } as Gaussianmethods would require. The mean E [ f ( x )] = η − R g ( x, ω ) γ ( dω ) is availablefrom (21a), as is the covariance from (21b).2.5.3. Symmetric Gamma random fields. A symmetric analogue of theGamma random field (22) has L´evy measure ν ( dβ dω ) = | β | − e −| β | η dβγ ( dω )(23)on all of R × Ω, leading to random variables L ( A ) distributed as the differ-ence of two independent Ga ( γ ( A ) , η ) variables, with characteristic function E [ e it L ( A ) ] = (1 + t /η ) − γ ( A ) . Both the standard positive Gamma randommeasure and this symmetric version satisfy the local L bound (7), hence nocompensation is required so we may take h ( β ) ≡ f ( x ). The mean E [ f ( x )] = 0 vanishes for the symmetricGamma random field, or for any other L´evy random field with a symmetric(in ± β ) L´evy measure satisfying (7). Covariances are available from (21b).Nearly all of the commonly used isotropic geostatistical covariance functions[see Chil`es and Delfiner (1999), Section 2.5] may be achieved by the choiceof a suitable generating kernel g ( x, · ) and L´evy measure ν ( dβ dω ); see Clydeand Wolpert (2007) for specific examples.2.5.4. Symmetric α -Stable random fields. Symmetric α -Stable ( S α S ) L´e-vy random fields have L´evy measure ν ( dβ dω ) = c α α | β | − − α dβ γ ( dω )(24)on R × Ω for some 0 < α < σ -finite positive measure γ ( dω ), where c α = (1 /π )Γ( α ) sin( πα/ L ( A ) ∼ St ( α, , γ ( A ) , 0) [in parametriza-tion (M) of Zolotarev (1986), page 11] with infinite variance (and thus nomeaningful covariance function for f ( x ) ≡ L [ g ( x )]). This infinite L´evy mea-sure satisfies (9) for all 0 < α < 2, but satisfies the stronger local L con-dition (7) only for 0 < α < 1; thus compensation is required to construct S α S random fields with 1 ≤ α < 2, including the Cauchy case of α = 1. One ´EVY ADAPTIVE REGRESSION KERNELS can show that f ( x ) is well defined for any φ ( x, · ) ∈ L α (Ω , γ ( dω )), includingthe generating functions of (4). The S α S fields have heavier tails than, forexample, the symmetric Gamma fields of Section 2.5.3, and may be moreappropriate for problems where one might expect f ( · ) to include by a fewheavily weighted kernels. 3. Approximations for implementing kernel integrals. Computer simula-tions of L´evy random measures A 7→ L ( A ) and random fields φ 7→ L [ φ ] asso-ciated with finite L´evy measures ν may be constructed as in (5), (13), simplyby setting ν + ≡ ν ( R × Ω) and drawing J ∼ Po ( ν + ) and { ( β j , ω j ) } ≤ j Let ν be a L´evy measure defined on R × Ω satisfying (9)and φ ∈ Φ satisfying (16). Take { K ε } to be any family of compact sets in-creasing to Ω as ε → , and for any Borel sets A ⊂ Ω and B ⊂ R and let ν ε be the unique Borel measure on R × Ω satisfying ν ε ( B × A ) ≡ ν (( B ∩ [ − ε, ε ] c ) × ( A ∩ K ε ))(25) for B ⊂ R , A ⊂ Ω [note ν + ε ≡ ν ε ( R × Ω) < ∞ ]. Let h ( · ) be any boundedmeasurable compensator function on R satisfying h ( β ) = β + O ( β ) for β near zero. Then as ε → , the random variables L ε [ φ ] ≡ Z Z [ − ε,ε ] c × K ε βφ ( ω ) N ( dβ dω )(26) − Z Z [ − ε,ε ] c × K ε h ( β ) φ ( ω ) ν ( dβ dω ) converge in probability to L [ φ ] of (14). Proof. The error in approximating L [ φ ] of (14) by L ε [ φ ] of (26) is L [ φ ] − L ε [ φ ] = Z Z N ε ( β − h ( β )) φ ( ω ) N ( dβ dω )(27) + Z Z N ε h ( β ) φ ( ω ) ˜ N ( dβ dω ) , R. L. WOLPERT, M. A. CLYDE AND C. TU where N ε ≡ { ( β, ω ) : | β | ≤ ε or ω ∈ K cε } . The first term in (27) converges tozero almost surely, and the second in L , as ε → 0; see the Appendix fordetails. (cid:3) The approximation L ε [ φ ] is the sum of a L´evy random field with finiteL´evy measure ν ε [hence with simple representation (13)] and a deterministicdrift term δ ε [ φ ] given by the second integral in (26). The drift vanisheswhenever ν ( dβ dω ) is symmetric in ± β and h ( β ) is odd. Corollary 2. If either (a) ν ( dβ dω ) satisfies (7), or (b) ν ( dβ dω ) sat-isfies (9) and is even in ± β , and also h ( β ) is an odd function, then for each x ∈ X , f ε ( x ) ≡ X ≤ j With f ε ( x ) ≡ L ε [ g ( x )], f ε ( x ) = Z Ω g ( x, ω ) L ε ( dω )(29) = Z Z R × Ω g ( x, ω ) β N ε ( dβ dω ) − Z Z R × Ω g ( x, ω ) h ( β ) ν ε ( dβ dω )with N ε ( dβ dω ) ∼ Po ( ν ε ( dβ dω )). If ν satisfies (7), then without loss of gen-erality take the compensator function h ( β ) ≡ 0. In both cases (a) and (b),the second integral in (29) vanishes, leading to (28) [cf. (2)]. (cid:3) Note that in case (b) the { g ( x, ω j ) β j } are not absolutely summable so“ P ∞ j =0 g ( x, ω j ) β j ” does not converge in the Lebesgue sense. In each of ourapplications the conditions of Corollary 2 hold, allowing us to approximate ν by a finite L´evy measure ν ε [and L by L ε ∼ L´evy ( ν ε )], and exploit the re-sulting Poisson representation for inference. 4. Function spaces for LARK models. Theorem 2 and Corollary 2 estab-lish pointwise convergence of f ε ( x ) to f ( x ) as ε → 0; in this section we pro-vide conditions to ensure that f ε ( · ) → f ( · ) in appropriate Besov or Sobolevnorms if the generating functions lie in the same space. ´EVY ADAPTIVE REGRESSION KERNELS For s ≥ d ∈ N denote by W s ( R d ) the Sobolev space of real-valuedsquare-integrable functions f ( · ) ∈ L ( R d ) [Sobolev (1991), Section 1.7, Reedand Simon (1975), page 50] with finite Sobolev norm k f k W s = (cid:26) π ) d Z R d (1 + | ξ | ) s | ˆ f ( ξ ) | dξ (cid:27) / (30)with Fourier transforms defined for f ∈ L ( R d ) byˆ f ( ξ ) = Z R d e iξ · x f ( x ) dx and by L limits for f ∈ L ( R d ); here dξ and dx denote the Lebesgue volumeelement in R d , and ξ · x denotes the Euclidean inner product. Each W s isa Banach space, hence complete. By Plancherel’s theorem, each f ∈ W s with s ≥ s distributional derivatives in L ( R ), and by Sobolev’s lemma has k continuous derivatives for each integer 0 ≤ k < s − d/ B spq consists of those f ∈ L p ( R d ) whoseBesov semi-norms are finite. Several equivalent Besov semi-norms appear inthe literature [Triebel (1992), Theorem 2.6.1, page 140]; we use the definitiongiven as equation 2 of that theorem. For p, q ≥ s > d (1 /p − + andfor any integer m > s ( m = 1 + ⌊ s ⌋ is easiest), set | f | spq = (cid:18)Z | h |≤ | h | − sq k ∆ mh f k qp dh/ | h | d (cid:19) /q or, in dimension d = 1, | f | spq = (cid:18) Z h − − sq k ∆ mh f k qp dh (cid:19) /q , (31)where ∆ mh denotes the m th forward finite difference,∆ h f ( x ) = f ( x ) , ∆ mh f ( x ) = [∆ m − h f ( x + h ) − ∆ m − h f ( x )](32) = m X k =0 (cid:18) mk (cid:19) ( − m − k f ( x + kh ) . The Besov space B spq is the Banach space completion of L p ( R d ) under norm k f k spq = k f k p + | f | spq . (33)For p = q = 2, B spq coincides with the Sobolev space W s .For fixed ω ∈ Ω, each of the kernel functions g ( · , ω ) in (4) is in B spq forall p, q ≥ s > 0, and hence each finite approximation of the form(28) lies in the same B spq . For example, the Gaussian kernel of (4a) (alongwith its d -dimensional generalization) satisfies g G ( · , ω ) ∈ B spq for every s < ∞ R. L. WOLPERT, M. A. CLYDE AND C. TU and p, q ≥ 1, while in R the double-sided Laplace kernel of (4b) satisfies g L ( · , ω ) ∈ B spp for s < /p < p and the Haar wavelet of (4d)is in B spq only for s < /p . To simplify proofs in Section 4.1, we will restrictattention to generating functions g on R d ; these results may be extended tobounded domains the Besov semi-norms defined in terms of differences onbounded domains in Section 5.2.2 of Triebel (1992) may be used to extendthese results.We now provide conditions for LARK models to be in the same Besovspace as their generating functions.4.1. Convergence of LARK models in Besov spaces. Theorem 3. Fix g ∈ B spq ( R d ) for some p, q ≥ and s > and a L´evymeasure ν on R × Ω with Ω = ( S d + × R d ) of translation-invariant product form ν ( dβ dω ) = ˜ ν ( dβ d Λ) dχ [here ω = (Λ , χ ) ] for a σ -finite measure ˜ ν ( dβ d Λ) on R × S d + that satisfies the integrability condition (7). Define a location-scaleLARK model f ( · ) on X = R d by: f ( x ) = R Ω φ ( x, ω ) L ( dω ) where φ ( x, ω ) ≡ g (Λ( x − χ )) satisfies (18) for each fixed x ∈ X . Then f has the almost surelyconvergent series expression f ( x ) = X j g (Λ j ( x − χ j )) β j (34) and f ∈ B spq almost surely if ˜ ν satisfies Z Z R ×S d + (1 ∧ | β || Λ | − /p )˜ ν ( dβ d Λ) < ∞ , (35a) Z Z R ×S d + (1 ∧ | β || Λ | s − /p )˜ ν ( dβ d Λ) < ∞ . (35b) Proof. Equation (18) ensures that the sum in (34) will converge al-most surely for each fixed x ∈ X , with a finite number of terms | g (Λ j ( x − χ j )) β j | > | g (Λ j ( x − χ j )) β j | ≤ 1. The L p norm of f satisfies the bound k f k p ≤ X j k g (Λ j ( · − χ j )) k p | β j | = k g k p X j | Λ j | − /p | β j | by the triangle inequality and Proposition 2 in Appendix A. This is finitealmost surely by (35a) since g ∈ B spq ⊂ L p . The Besov semi-norm of f isbounded by | f | spq ≤ X j | β j || g (Λ j ( x − χ j )) | spq ´EVY ADAPTIVE REGRESSION KERNELS = X j | β j | (cid:18)Z | h |≤ | h | − d − sq k ∆ mh g (Λ j ( · − χ j )) k qp dh (cid:19) /q = X j | β j | (cid:18)Z | h |≤ | | h | − d − sq | Λ j | − q/p k ∆ m Λ j h g k qp dh (cid:19) /q by Proposition 2; changing variables h t = Λ h , this is= X j | β j || Λ j | s − /p (cid:18)Z | Λ − j t |≤ | t | − d − sq k ∆ mt g k qp dt (cid:19) /q . (36)The integral in (36) is bounded by Z R d | t | − d − sq k ∆ mt g k qp dt = Z | t |≤ | t | − d − sq k ∆ mt g k qp dt + Z | t | > | t | − d − sq k ∆ mt g k qp dt. The first term is just ( | g | spq ) q , and (32) implies k ∆ mt g k p ≤ m k g k p , so ≤ ( | g | spq ) q + Z | t | > | t | − d − sq (2 m k g k p ) q dt = ( | g | spq ) q + π d/ mq Γ( d/ sq k g k qp ≤ ( c k g k spq ) q for some c < ∞ , so | f | spq ≤ c k g k spq X j | β j || Λ j | s − /p , (37)which is almost surely finite by (35b). (cid:3) Each of the kernels g ( · , ω ) considered in the examples in Sections 7 and 8may be shown to be in some Besov space B spq , and each is bounded by k g k ∞ ≤ 1. Corollary 3 establishes that each of our LARK models with a L´evymeasure that satisfies (7) is in the same space B spq as its generating function. Corollary 3. Let f ( x ) = R φ ( x, ω ) L ( dω ) be a one-dimensional LARKmodel on a compact set X ⊂ R , with product L´evy measure ν ( dβ dω ) = ν β ( dβ ) π λ ( dλ ) dχ on R × R + × X satisfying (7) with Gamma probability mea-sure π λ ( dλ ) = Ga ( a λ , b λ ) and location-scale generator φ ( x, ω ) = g ( λ ( x − χ )) with bounded g ∈ B spq . Then f ∈ B spq almost surely if α λ > /p for p, q ≥ and s > . In particular, if a λ ≥ then f ∈ B spq if g ∈ B spq for all p, q ≥ and s > . R. L. WOLPERT, M. A. CLYDE AND C. TU Proof. Equation (18) holds for bounded g ∈ B spq with L´evy measures ofthe form indicated; the conditions on α λ ensure that also R R + λ − /p π λ ( dλ ) < ∞ and R R + λ s − /p π λ ( dλ ) < ∞ , so the bounds of (35) hold. (cid:3) Comparisons with Abramovich, Sapatinas and Silverman. The sto-chastic wavelet expansion of Abramovich, Sapatinas and Silverman (2000)may be viewed as a LARK model using wavelet generator (4e), with coeffi-cients that, when conditioned on the scale parameters { a j } , have indepen-dent Gaussian distributions { β j } ind ∼ No (0 , ca − δj ) with ω = ( a, b ) ∈ [ a , ∞ ) × [0 , 1) and ν ω ( dω ) ∝ a − ξ { a ≥ a } db da for some c, δ, ξ ≥ δ + ξ > a ≥ δ and ξ control the size and frequency of wavelet coeffi-cients and determine whether the expansion will have a well-defined limit.For a finite L´evy measure ν ω ( dω ) ( ξ > ξ ≤ 1, the Poisson mean is no longer finite; however, Abramovich, Sapatinasand Silverman (2000) provide conditions on δ and ξ so that f falls in thecorresponding Besov space of the generating wavelet.For “simplicity of exposition,” Abramovich, Sapatinas and Silverman workwith functions of unit period [i.e., satisfying g ( x ) = g ( x + 1)] and regardthem as functions on the unit torus T , the interval [0 , 1] with the endpointsidentified. We now illustrate how the LARK theory may be used to provethat the resulting expansion lies in B spq ( T ) if the generating function does.The Besov sequence norms used by Abramovich, Sapatinas and Silvermanand others are natural for the Gaussian distributions and discrete waveletexpansions they study; we have found the (equivalent) function norms tobe more convenient for continuous wavelet expansions using non-Gaussian( α -Stable, e.g.) distributions used for the coefficients in our expansions. Wefollow Nikol’ski˘ı [(1975), Sections 1.1.1 and 4.3.5] in defining Besov normson the torus by replacing the L p norm on R with that over T in the defini-tion of the Besov semi-norm and norm [see (31), (33)], and in denoting thecorresponding spaces by L ∗ p ( T ) and B s ∗ pq ( T ), respectively.To simplify the proof, we will use the following lemma. Lemma 1. Let π z ( dz ) denote the standard normal distribution on R , let g ∈ L ∗ p ( T ) with p ≥ and let r ∈ { , } . Then Z Z Z R × [1 , ∞ ) × T (1 ∧ | zg ( u ) r | λ − a ) λ − b π z ( dz ) dλ du < ∞ for any a ∈ R if b > , and for all a > − b if b ≤ . The proof is given in Appendix A.1. ´EVY ADAPTIVE REGRESSION KERNELS Theorem 4. Let g ∈ B s ∗ pq ( T ) for some p, q ≥ and s > . Let L ( dω ) bea random field on Ω = [1 , ∞ ) × T with L´evy measure ν ( dβ dλ dχ ) = 1 √ π λ δ/ − ζ e − β λ δ / dβ dλ dχ (38) on R × Ω with δ, ζ ≥ . Then the LARK model f ( x ) = R Ω λ / g ( λ ( x − χ )) L ( dω ) has an absolutely convergent expansion f ( x ) = X j β j λ / j g ( λ j ( x − χ j )) , ≤ x < , (39) provided that δ − > − ζ for ≤ ζ ≤ , or for any δ ≥ if ζ > . Also f ( · ) ∈ B s ∗ pq ( T ) almost surely for δ − > s + 1 − ζ if ≤ ζ ≤ or for any δ ≥ if ζ > . Proof. The absolute convergence of (39) for each x will follow fromProposition 1 if we can verify the conditions of (18), that is, finiteness of theintegral Z Z Z R × [1 , ∞ ) × T (1 ∧ | βλ / g ( λ ( x − χ )) | ) ν ( dβ dλ dχ ) . (40)Applying the change of variables β z = λ δ/ β ,= Z Z Z R × [1 , ∞ ) × T (1 ∧ | z | λ (1 − δ ) / | g ( λ ( x − χ )) | ) λ − ζ π z ( dz ) dλ dχ, (41)where π z ( dz ) is the standard normal distribution. Since the term in paren-theses is bounded by one, (41) is finite for all δ and g if ζ > 1. For 0 ≤ ζ ≤ χ u = λ ( x − χ ) and apply periodicity= Z R Z ∞ Z λxλ ( x − (1 ∧ | zg ( u ) | λ (1 − δ ) / ) duλ − − ζ dλπ z ( dz )which, due to periodicity, satisfies the bound ≤ Z R Z ∞ Z (1 ∧ | zg ( u ) | λ (1 − δ ) / ) du ⌈ λ ⌉ λ − − ζ dλπ z ( dz ) ≤ Z Z Z R × [1 , ∞ ) × T (1 ∧ | zg ( u ) | λ (1 − δ ) / ) duλ − ζ dλπ z ( dz ) , where ⌈ λ ⌉ denotes the least integer ≥ λ . By Lemma 1 this is finite for 0 ≤ ζ ≤ δ − > − ζ with g ∈ B s ∗ pq , so (18) holds and Proposition 1 ensuresconvergence.The L ∗ p norms of the m th forward differences of a periodic function g ( · ) ∈ B s ∗ pq ( T ) and their scaled translates g ( λ ( · − χ )) for χ ∈ T and positive scale λ ∈ [1 , ∞ ) are related by k ∆ mh g ( λ ( · − χ )) k ∗ p ≤ /p k ∆ mλh g k ∗ p (42) R. L. WOLPERT, M. A. CLYDE AND C. TU since, by a change of variables x u = λ ( x − χ ), k ∆ mh g ( λ ( · − χ )) k ∗ p = λ − /p (Z λ (1 − χ ) − λχ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) m X k =0 (cid:18) mk (cid:19) ( − m − k g ( u + kλh ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) p du ) /p , which, again from periodicity, satisfies ≤ (cid:18) ⌈ λ ⌉ λ (cid:19) /p (Z (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) m X k =0 (cid:18) mk (cid:19) ( − m − k g ( u + kλh ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) p du ) /p = (cid:18) ⌈ λ ⌉ λ (cid:19) /p k ∆ mλh g k ∗ p , while ⌈ λ ⌉ /λ ≤ f is bounded by | f | s ∗ pq ≤ X j | β j | λ / j | g ( λ j ( · − χ j )) | s ∗ pq = X j | β j | λ / j (cid:18)Z | h |≤ | h | − − sq k ∆ mh g ( λ j ( · − χ j )) k ∗ qp dh (cid:19) /q ≤ X j | β j | λ / j (cid:18) ⌈ λ j ⌉ λ j (cid:19) /p (cid:18)Z | h |≤ | | h | − − sq k ∆ mλ j h g k ∗ qp dh (cid:19) /q = X j | β j | λ s +1 / j (cid:18) ⌈ λ j ⌉ λ j (cid:19) /p (cid:18)Z | t |≤ λ j | t | − − sq k ∆ mt g k ∗ qp dt (cid:19) /q . (43)The integral in (43) is bounded by Z R | t | − − sq k ∆ mt g k ∗ qp dt = Z | t |≤ | t | − − sq k ∆ mt g k ∗ qp dt + Z | t | > | t | − − sq k ∆ mt g k ∗ qp dt. The first term is just ( | g | s ∗ pq ) q , and (32) implies k ∆ mt g k ∗ p ≤ m k g k ∗ p , so ≤ ( | g | s ∗ pq ) q + Z | t | > | t | − − sq (2 m k g k ∗ p ) q dt = ( | g | spq ) q + 2 mq sq k g k ∗ qp ≤ ( c k g k s ∗ pq ) q ´EVY ADAPTIVE REGRESSION KERNELS for some c < ∞ , so | f | s ∗ pq ≤ c k g k s ∗ pq X j | β j | λ s +1 / j (44)is almost surely finite if and only if Z Z Z R × [1 , ∞ ) × T (1 ∧ | β | λ s +1 / ) ν ( dβ dλ dχ )is finite. Applying the change of variables β z = λ δ/ β ,= Z Z Z R × [1 , ∞ ) × T (1 ∧ | z | λ s +(1 − δ ) / ) λ − ζ π z ( dz ) dλ dχ is finite by Lemma 1 for all δ ≥ ζ > δ − > s + 1 − ζ if 0 ≤ ζ ≤ L ∗ p norm of f satisfies a bound of theform k f k ∗ p ≤ c k g k ∗ p X j | β j | λ / j for some c < ∞ . This is finite almost surely if Z Z Z R × [1 , ∞ ) × T (1 ∧ | β | λ / ) ν ( dβ dλ dχ )= Z Z Z R × [1 , ∞ ) × T (1 ∧ | z | λ (1 − δ ) / ) λ − ζ π Z ( dz ) dλ dχ is finite, which follows from Lemma 1 for all δ ≥ ζ > ζ ≤ 1, for δ satisfying δ − > − ζ since g ∈ B s ∗ pq ⊂ L ∗ p . Combining conditions, the B s ∗ pq norm of f is finite if δ/ − / > s + 1 − ζ for 0 ≤ ζ ≤ δ ≥ ζ > (cid:3) For L´evy measures ν ( dβ dλ dχ ) supported on R × N × T (i.e., for which λ is almost-surely integral) the function f ( x ) of (39) would inherit periodicityfrom the generator g ( λ j ( x − χ j )) but, for the absolutely-continuous measureof (38), it is the definition of f ( x ) as a function on T [as in Abramovich,Sapatinas and Silverman (2000), equation (2)] that induces periodicity. Therestriction to λ ≥ λ > Compensation. For L´evy measures satisfying only the local- L boundof (9) and not the local- L bound of (7), we must use the definition of f ( x )in (14) and use (16) to establish conditions that ensure f will be well definedfor g ∈ B spq . We verify these conditions for the existence of LARK modelsunder symmetric α -Stable random fields. R. L. WOLPERT, M. A. CLYDE AND C. TU Theorem 5. For a Symmetric α -Stable random field with L´evy measureof the form ν ( dβ dω ) = c α α | β | − − α dβπ ( d Λ) dχ on R × S d + × R d for < α < ,with π ( d Λ) a probability measure on S d + and g ∈ B spq ( R d ) ∩ L ( R d ) for p, q ≥ and s > , the conditions of (16) for f ( x ) to be well defined by Theorem 1are satisfied for < α ≤ p , α < if E [ | Λ | − ] < ∞ . For α = 1 , there is theadditional requirement that Z R d | g ( u ) log | g ( u ) || du < ∞ . (45) Proof. Fix x ∈ X . By the affine change of variables of χ u ≡ Λ( x − χ ), Z Z [ − , c × Ω (1 ∧ | βφ ( x, ω ) | ) ν ( dβ dω )= 2 c α α Z S d + | Λ | − π ( d Λ) Z Z [1 , ∞ ) × R d (1 ∧ β | g ( u ) | ) β − − α dβ du = 2 c α α E | Λ | − Z R d (cid:26)Z | g ( u ) | − β − α | g ( u ) | dβ + Z ∞| g ( u ) | − β − − α dβ (cid:27) du. For 1 < α < c α α E | Λ | − (cid:26)Z R d | g ( u ) | − | g ( u ) | α α − du + Z R d | g ( u ) | α α du (cid:27) , which is finite for 1 < α ≤ p since g ∈ L and g ∈ B spq ⊂ L p . For α = 1,= 2 c E | Λ | − (cid:26)Z R d −| g ( u ) | log | g ( u ) | du + Z R d | g ( u ) | du (cid:27) . The first integral exists and is finite by (45) while the second is finite since g ∈ L . Similarly, the integral in (16b) is Z Z [ − , × Ω ( | βφ ( x, ω ) | ∧ | βφ ( x, ω ) | ) ν ( dβ dω )= 2 c α α E | λ | − (cid:26)Z Z [0 , × R d ( | βg ( u ) | ∧ | βg ( u ) | ) β − − α dβ du (cid:27) . The integral in braces Z Z [0 , ∧| g ( u ) | − ] × R d β − α g ( u ) dβ du + Z Z [1 ∧| g ( u ) | − , × R d β − α | g ( u ) | dβ du is finite for 1 < α ≤ p , α < ≤ Z R d | g ( u ) | α − α + | g ( u ) | α − | g ( u ) | α − du ≤ k g k pp (2 − α )( α − < ∞ , ´EVY ADAPTIVE REGRESSION KERNELS while for α = 1, ≤ Z R d {| g ( u ) | + | g ( u ) log | g ( u ) ||} du < ∞ by (45). Finally, (16c) holds because Z Z R × Ω (1 ∧ β ) | φ ( x, ω ) | ν ( dβ dω )= E | Λ | − c α α Z Z R × R d (1 ∧ β ) | β | − − α | g ( u ) | dβ du = E | Λ | − k g k c α α Z R (1 ∧ β ) | β | − − α dβ < ∞ . (cid:3) All of the generator functions in the examples in Section 7 satisfy theconditions of the theorem for the Cauchy random field ( α = 1), so the LARKmodels are well defined as ε → ε > 0, the approximations arein the same Besov space as g . We are able to show that this also holds forSobolev W s spaces (which are equivalent to B s ) even when compensationis required, but this remains an open question for B spq with general p and q .4.4. Convergence in W s . Theorem 6. Let { φ ( x, ω ) } be a location-scale family of the form φ ( x, ω ) ≡ g (Λ( x − χ )) for ω = ( χ, Λ) with χ ∈ R d and nonsingular d × d matrix Λ ∈ S d + for some function g ( · ) ∈ W s with s ≥ . Let ν be a L´evy measure satisfyingthe condition Z Z R × Ω | Λ | − [1 + ρ (Λ) s ](1 ∧ β ) ν ( dβ dω ) < ∞ , (46) where ρ (Λ) denotes the spectral radius (largest eigenvalue) of Λ . Recall f ( x ) ≡ Z Z R × Ω φ ( x, ω )[ β − h ( β )] N ( dβ dω ) (19) + Z Z R × Ω φ ( x, ω ) h ( β ) ˜ N ( dβ dω ) and, for ε > , define f ε ( x ) ≡ Z Z [ − ε,ε ] c × Ω φ ( x, ω )[ β − h ( β )] N ( dβ dω )+ Z Z [ − ε,ε ] c × Ω φ ( x, ω ) h ( β ) ˜ N ( dβ dω )(47) R. L. WOLPERT, M. A. CLYDE AND C. TU = X ≤ j First, consider the case of compensator functions satisfying h ( β ) = β for all | β | ≤ 1. Apply an affine change of variables to see that φ ( x, ω ) has Fourier transform (in x )ˆ φ ( ξ, ω ) = e iξ · χ | Λ | − ˆ g (Λ − ξ ) . For 0 < ε < ε < x ∈ R d , set ∆( x ) ≡ f ε ( x ) − f ε ( x ) and let A ≡ { ε < | β | ≤ ε } × Ω. Then∆( x ) = X ≤ j 0, (48) has the additional nonrandom term (cid:12)(cid:12)(cid:12)(cid:12)Z Z A e iξ · χ ˆ g (Λ − ξ ) | Λ | ( β − h ( β )) ν ( dβ dω ) (cid:12)(cid:12)(cid:12)(cid:12) ≤ c (cid:18)Z Z A | ˆ g (Λ − ξ ) || Λ | β ν ( dβ dω ) (cid:19) leading at most to an additional constant factor of [1+ c RR R × Ω (1 ∧ β ) ν ( dβ dω )]in (49), leading as before to k f − f ε k k W s → (cid:3) Corollary 4. If { φ ( x, ω ) } is a location-scale family of the form con-sidered in Theorem 6 and if a L´evy measure ν is of product form ν ( dβ dω ) = ν β ( dβ ) π ω ( dω ) for some σ -finite measure ν β ( dβ ) on R and probability mea-sure π ω ( · ) on Ω that for some s ≥ satisfy Z R (1 ∧ β ) ν β ( dβ ) < ∞ , (50a) Z Ω | Λ | − ((1 + ρ (Λ)) s ) π ω ( dω ) < ∞ , (50b) then ν ( dβ dω ) also satisfies (46) and hence f ε ( · ) → f ( · ) in W s almost surelyas ε → . For example, in one dimension, (50b) is satisfied for all s > λ hasthe χ ν distribution with ν > λ ∼ Ga ( α λ , β λ )with α λ > . More generally, for any m > s > λ m ∼ Ga ( α λ , β λ ) with α λ > /m or, for m < 0, for α λ > (1 − s ) /m .Recall that the quantity ε introduced in the proof of Theorem 6 and thestatement of Corollary 4 is not a model parameter and has no bearing onthe Sobolov spaces to which the limiting function f ( · ) belongs; it is onlya tool used in proofs and implementations, to which we now turn. 5. Inference for LARK models. The LARK model introduced in Sec-tion 1 may now be summarized as E [ Y ( x ) | L , θ ] = f ( x ) ≡ Z Ω φ ( x, ω ) L ( dω ) , (51) L | θ ∼ L´evy ( ν ) ,θ ∼ π θ ( dθ )with implicit dependence of the L´evy measure ν ( dβ dω ) and conditionaldistribution for Y ( x ) on a hyperparameter vector θ . In all of our examples,we take ν to be a product measure ν ( dβ dω ) = ν β ( β ) dβ | Ω | π ω ( dω ) satisfying R. L. WOLPERT, M. A. CLYDE AND C. TU the conditions of Corollary 2, with π ω ( · ) a probability measure on Ω, | Ω | a measure of the volume of Ω, and ν β ( · ) > R satisfying R R (1 ∧ β ) ν β ( β ) dβ < ∞ [so ν satisfies (9)], for which either(a) ν also satisfies (7) or (b) ν β ( β ) is even and h ( β ) is odd in β . Thus, wehave the representation θ ∼ π θ ( dθ ) , (52a) J | θ ∼ Po ( ν + ε ) , ν + ε ≡ ν ε ( R × Ω) , (52b) { ( β j , ω j ) } ≤ j Motivated by the applications inSection 8, we now focus on LARK models built on approximations to Gamma,symmetric Gamma and Symmetric α -Stable (in particular, Cauchy) L´evyrandom fields, and quantify the approximation errors to facilitate the selec-tion of ε and other prior hyperparameters.5.1.1. Gamma LARK models. The Gamma random field of Section 2.5.2has ν β ( dβ ) = γβ − e − βη { β> } dβ for some constants γ > η > 0. Theparameter η in (22) controls both the Poisson rate of mass points { ( β j , ω j ) } of magnitude | β | > ε and the probability distribution of those magnitu-des { β j } . To facilitate elicitation we disentangle those two roles by trun-cating at | βη | ≥ ε (rather than | β | ≥ ε ); of course the limit as ε → J and { β j } are now given by J ∼ Po ( ν + ε ) , ν + ε = γ | Ω | E ( ε ) ,β j i . i . d . ∼ π β ( β j ) dβ j , π β ( β j ) = β j − e − β j η E ( ε ) { β j η>ε } , where the exponential integral function [Abramowitz and Stegun (1964),page 228] is denoted as E ( z ) ≡ R ∞ z t − e − t dt . With this truncation, theexpected square L norm of the loss due to truncation for any φ ∈ L (Ω , | Ω | π ω ( dω )), such as φ ( ω ) = φ ( x, ω ), is E |L [ φ ] − L ε [ φ ] | = Z Z R × Ω φ ( ω ) | β | {| βη |≤ ε } ν ( dβ dω ) ´EVY ADAPTIVE REGRESSION KERNELS = k φ k Z ε/η β ν β ( β ) dβ (53a) = γη − k φ k [1 − (1 + ε ) e − ε ] , showing the rate at which L ε [ φ ] → L [ φ ] in L as ε → 0. This is used inSection 5.2 to guide the elicitation of hyperparameters.5.1.2. Symmetric Gamma LARK models. The symmetric Gamma ran-dom field of Section 2.5.3 has L´evy measure ν β ( dβ ) = γ | β | − e −| β | η dβ forsome constants γ > η > 0. Once again truncation at | βη | > ε leads to J ∼ Po ( ν + ε ) , ν + ε = 2 γ | Ω | E ( ε ) β j i . i . d . ∼ π β ( β j ) dβ j , π β ( β j ) = | β j | − e −| β j | η ( ε ) {| β j η | >ε } and expected squared discrepancy (used for elicitation) E |L [ φ ] − L ε [ φ ] | = 2 γη − k φ k [1 − (1 + ε ) e − ε ] . (53b)5.1.3. Symmetric α -Stable LARK models. The S α S L´evy random fieldof Section 2.5.4 has ν β ( dβ ) = ˙ γαπ Γ( α ) sin πα | β | − α − dβ for some constants˙ γ > < α < 2. To facilitate elicitation and posterior inference, wewrite ˙ γ = γη − α and (again) truncate at | β j η | > ε . This leads to J ∼ Po ( ν + ε ) , ν + ε = γ | Ω | π Γ( α ) sin πα ε − α β j i . i . d . ∼ π β ( β j ) dβ j , π β ( β j ) = αε α η α | β j | − α − {| β j η | >ε } with symmetric Pareto distributions for the coefficients { β j } . For the Cauchy( α = 1), these simplify to ν + ε = 2 γ | Ω | / ( πε ), with π β ( β j ) = ε η | β j | − {| β j η | >ε } . Although the total variation |L| is almost surely infinite, and even |L − L ε | will be infinite for α ≥ 1, still for φ ∈ L (Ω , | Ω | π ω ( dω )) the expected squareddiscrepancy is finite: E |L [ φ ] − L ε [ φ ] | = Z Z R × Ω φ ( ω ) | β | {| βη |≤ ε } ν ( dβ dω )(53c) = 2 γη − k φ k (cid:20) Γ( α + 1) π (2 − α ) sin πα ε − α (cid:21) or 2 γη − k φ k [ ε/π ] for the Cauchy case α = 1. R. L. WOLPERT, M. A. CLYDE AND C. TU Prior elicitation of hyperparameters. We now turn to the selectionof ε > 0, the vector θ ∈ Θ of (52), and the L´evy measure ν ( dβ dω ). In each ofour examples θ ≡ ( γ, η ) for rate parameters γ and η governing the frequencyand magnitude of coefficients { β j } , respectively, and the expected squaredtruncation error for L ε [ φ ] for φ ( ω ) = φ ( x, ω ) is of the form E |L [ φ ( x, · )] −L ε [ φ ( x, · )] | = γη − k φ ( x, · ) k c ( ε ) for some c ( ε ) > c ( ε ) → ε → J of terms in the stochastic expansion; (2) desired range of coefficientmagnitudes { β j } ; and (3) tolerable expected truncation error. We first selecta L´evy family (Gamma, α -Stable, etc.) to meet the needs of a particularproblem for symmetry or positivity, sharp or heavy tails, etc. Each of ourL´evy measures is of the product form ν ( dβ dω ) = ν β ( dβ ) π ω ( dω ) considered inTheorem 6 and Corollary 4, with location, scale, and perhaps other location-specific (and hence adaptive) attributes encoded in ω ∈ Ω in problem-specificways.Hyperparameters in the L´evy measure ν β ( dβ ) govern sparseness for LARKmodels, that is, the number J of terms in the stochastic expansion. In eachLARK model, J has a Poisson distribution with mean proportional to γ .The coefficient of variation under the Poisson distribution falls to zero asthe mean increases, overstating the prior certainty for large values of E J .To ameliorate this, we introduce an additional layer of hierarchy by placinga Gamma prior distribution on the parameter γ ∼ Ga ( a γ , b γ ), leading to theoverdispersed negative binomial prior distribution for J ∼ NB ( a J , p J ). Theparameter η governs the scale of the coefficients { β j } , and hence the rangeof the regression function f ( · ). We employ a Gamma distribution for thescale parameter η − ∼ Ga ( a η , b η ). Together the hyperparameters ε , a γ , b γ , a η , b η determine the prior distributions for J , for the coefficients { β j } (andhence the range of f ( · )), and for the expected mean-square truncation error.We select values for these five parameters to meet five criteria: attain twospecified quantiles (such as a central 99% interval) for each of J and { β j } ,and a specified bound on the expected truncation error E γη − k φ ( x, · ) k c ( ε ).Typically this involves an iterative numerical solution.As a default choice, we take π ( dω ) = π χ ( dχ ) π λ ( dλ ) to be the product ofthe uniform distribution for locations χ ∼ Un ( X ) and a Gamma distribu-tion for inverse (distance) scale parameters λ ∼ Ga ( a λ , b λ ). The shape andrate hyperparameters a λ and b λ govern the range of probable values for thelocation-specific inverse scale parameters { λ j } and hence for the smoothnessof f ( x ), similar to how bandwidth selection governs smoothness in other ker-nel methods. A kernel at ω j = ( χ j , λ j ) will represent a feature located at χ j of width 1 /λ j , so large values of λ j are needed to fit a very “spiky” part ofa curve, while a smoother part of a curve may be fit most parsimoniouslyusing small values of λ j . The prior distribution for λ j must support an ade- ´EVY ADAPTIVE REGRESSION KERNELS quate range of values in order to fit a spatially inhomogeneous curve. Valuesof a λ > E [ λ ] < ∞ and a finite covariance function; we choose( a λ , b λ ) to attain two specified quantiles, such as a central 99% interval.5.3. Posterior inference. The joint posterior density of all parametersunder the LARK model of (52), given observations Y = { Y i } , is p ( γ, η, J, β , ω | Y ) ∝ π γ ( γ ) π η ( η ) exp[ − ν ε ( R × Ω)] J !(54) × (cid:26) Y ≤ j 6. Relation of LARK to other models. Gaussian processes or random fields. For any positive Borel mea-sure Σ( dω ) on a complete separable metric space Ω, there exists a Gaussianrandom measure Z ( dω ) on Ω that assigns to disjoint Borel sets A i ⊂ Ω offinite measure Σ( A i ) < ∞ independent mean-zero Gaussian random vari-ables Z ( A i ) ∼ No (0 , Σ( A i )) of variance E Z ( A i ) = Σ( A i ). For any kernelfunction g on X × Ω with φ ( x, · ) ∈ L (Ω , Σ( dω )) for each x ∈ X , this inducesa mean-zero Gaussian random field through the Wiener stochastic integral f ( x ) = Z Ω φ ( x, ω ) Z ( dω )with covariance C ( x, y ) = E [ f ( x ) f ( y )] = R Ω φ ( x, ω ) φ ( y, ω )Σ( dω ). The Gaus-sian random measure Z ( dω ) is the special case of a L´evy random mea-sure L ( dω ) defined earlier in (8) with δ ( dω ) ≡ ν ( dβ dω ) ≡ C ( x, y ) = c ( x − y ) may be writtenin the above form if the spectral measure has a density function ˆ c ( ω ) = R X e − iω · x c ( x ) dx whose square root is Lebesgue integrable, for example, theMat´ern class [Stein (1999), page 31] in R d with smoothness parameter ν > d/ 2. The Gaussian random field model above may also be obtained R. L. WOLPERT, M. A. CLYDE AND C. TU as the limit as α → α -Stable LARK models consideredherein, providing an alternative method for inference that avoids the need forlarge matrix inversions. To maintain a unified computational approach, wehave limited our attention in this article to LARK models with pure-jumpL´evy random measures, that is, Σ( · ) ≡ Compound Poissons and mixtures of Gaussian random fields. Mix-tures of Gaussian random fields may be constructed as LARK models withL´evy measure of the form ν ( dβ dω ) = (2 πσ ω ) − / e − β / σ ω dβν ω ( dω )(55)leading to mean functions of the form f ( x i ) = P ≤ j 0, if one scales the expected number E J of terms(as a function of τ ) properly, this model converges to a LARK model withinfinite L´evy measure (and so is not sensitive to the cut-off ε , which merelyquantifies how close is this approximation). If E J is not scaled properly toconverge to a LARK model, the limiting results may depend critically onarbitrary and unintentional choices.This may be implemented explicitly in LARK form by placing indepen-dent Ga ( α/ , ε/ 2) prior distributions on σ − ω in (55) to achieve independentunivariate Student t α (0 , ε ) distributions for the coefficients { β j } and (ap-proximately, as the parameter ε → 0) the heavy-tailed Symmetric α -Stableprocess for f ( x ) of Sections 2.5.4 and 5.1.3 [this also illustrates that truncat-ing the support of β ω is not the only way to construct suitable approximat-ing sequences of finite L´evy measures ν ε ( dβ dω ) ⇒ ν ( dβ dω ) for which theintegrals in (27) converge]. An important feature of our infinitely divisibleconstruction (in contrast to a compound Poisson approach from other dis-tributional families) is that in each case, as ε → Finite dimensional frames. LARK may be viewed as a limit ofBayesian variable selection methods with finite frames or dictionaries. Wolfe,Godsill and Ng (2004) consider frames based on discretizing Ω as a finegrid with | G | elements. They place i.i.d. prior distributions π G ( β ) dβ on thenonzero coefficients and i.i.d. Bernoulli kernel inclusion indicators with in-clusion probability ρ G . If | G | ρ G π G ( β ) → ν ( β ) as | G | → ∞ , then the resultconverges to a LARK model on the infinite-dimensional frame. The rep-resentation in Wolfe, Godsill and Ng (2004) uses a point mass at zero toprovide sparsity. Similarly, one may view the prior distributions in LARKunder the ε -truncation approach as assigning zero mass to a neighborhoodaround zero, also leading to sparse representations. One benefit of LARK isits provision of a formal method for coherent prior specification for contin-uous dictionaries; a second is its provision of a proper prior specification inthe limit as ε → 0, ensuring insensitivity to the choice of ε .Standard stochastic search algorithms using finite-dimensional frames mayexhibit poor mixing when the correlations between grid elements tend to ± ω and ω are closein parameter space, leading to two highly correlated columns in the designmatrix. In addition, assume that inclusion of either column leads to nearly-maximal likelihood. With the standard one-at-a-time deletion or additionmoves in many stochastic search algorithms, to move from a model includ- R. L. WOLPERT, M. A. CLYDE AND C. TU ing a kernel indexed by ω to one indexed by ω would require an extremelyunlikely deletion followed by an addition (or unlikely addition followed bya deletion). LARK avoids this difficulty by allowing the continuous parame-ter ω indexing dictionary elements to move incrementally from ω to ω bya series of update steps, avoiding some of the poor mixing problems associ-ated with highly correlated frame elements in a fine-grid based method.6.4. Dirichlet processes. The Dirichlet process [Ferguson (1973, 1974),Antoniak (1974)] has received widespread use as a prior distribution onprobability distribution functions. Its popularity is due in large part to itsanalytic tractability in many problems; simulation is straightforward, andBayesian MCMC inference methods are available [Escobar (1994), MacEach-ern (1994), Escobar and West (1995), MacEachern (1998), M¨uller and Quin-tana (2004)]. Liang, Mukherjee and West (2007) consider nonlinear regres-sion and classification models E [ Y i | X i ] = f ( X i ) for data { ( Y i , X i ) } usingkernel expansions of the form f ( x ) = Z k ( x, u ) γ ( du ) = Z k ( x, u ) w ( u ) F ( du )(56)with random signed measure γ ( du ) expressed as the integral of a weight func-tion w ( u ) with respect to a probability distribution F , modeled as a Dirichletprocess F ∼ DP ( F , α ) with base measure F and scale α > 0. If observedpoints { X i } are viewed as a random sample from F , then updating theposterior for F solely on the basis of the observed { X i } would lead in thelimit as α → F concentrated at the empiricaldistribution for X , justifying the finite-dimensional expansion f ( x ) = n X i =1 k ( x, x i ) w ( x i )with kernels evaluated only at the observed data locations. The generalized g -prior of West (2003) for the coefficients { w i = w ( x i ) } leads to dependentCauchy distributions for the { f ( x i ) } . This approach (like the SVM, RVMand related approaches) has as many coefficients as there are data points,but avoids over-fitting through shrinkage. Asymptotic properties of f ( x ) as n → ∞ are difficult to study in the absence of a limiting structure such asthat provided by LARK.The Dirichlet measure F ( du ) does not assign independent random vari-ables to disjoint sets and so (56) is not a LARK model, but it can be con-structed from one. In fact it is exactly the normalized LARK model f ( x ) = Z Ω k ( x, u ) w ( u ) L ( du )/ L (Ω)(57) = X j k ( x, u j ) w j β j / β + ´EVY ADAPTIVE REGRESSION KERNELS with F ( du ) = L ( du ) / L (Ω) for a Gamma random field L ( du ) with infiniteL´evy measure ν ( dβ du ) = αβ − e − β { β> } dβF ( du ) , where β + := P β j [note that w ( u ) could be absorbed into k ( x, u )].Well-known disadvantages of Dirichlet process models include their inflex-ibility (the single parameter α determines the prior dispersion everywhere ,precluding prior specifications with more uncertainty in some regions thanin others), their discreteness, and the limited variability of the masses as-signed to the countably-many support points. The normalized Gamma rep-resentation (57) of DP’s offers the opportunity to overcome some of thesedisadvantages—for example, the Gamma process may be given a variablerate parameter b ( u ) by taking ν ( dβ du ) = β − e − b ( u ) β { β> } dβF ( du )leading to a precision that can vary with location u ∈ Ω, or the Gammarandom field may be replaced with another nonnegative L´evy random fieldwith wider dispersion, such as the fully-skewed Stable process of index α < 7. Simulation study. We now turn our attention to simulated and realexamples to illustrate the performance of LARK models in practice. We con-ducted a simulation study using four spatially varying functions introducedby Donoho and Johnstone (1994) that are now standard in the wavelet liter-ature: Blocks, Bumps, Doppler and Heavysine. Data were generated for eachtest function by adding independent Gaussian random noise No (0 , σ ) to thetrue target function f ( · ) at n = 1024 equally-spaced points on X = [0 , σ was chosen toattain a root signal-to-noise ratio (RSNR) of qR X ( f ( x ) − ¯ f ) dx/σ = 7 . f ≡ |X | R X f ( x ) dx . Each target function f ( · ) has a range of approx-imately 0 ≤ f ( x ) ≤ 25. For each function, we generated 100 replicate datasets to evaluate the performance of LARK and other methods on the basisof mean squared errorMSE ≡ n − n X i =1 ( [ f ( x i ) − f ( x i )) . (58)7.1. Hyperparameters. In Table 1, we report the kernel functions usedfor the four simulation examples, chosen to illustrate the flexibility of LARKto use a wide range of kernels that may be adapted to anticipated features(smoothness, spikiness, jumps, curvature, covariation, etc.) of applications. R. L. WOLPERT, M. A. CLYDE AND C. TU Table 1 Kernel functions used for four test functions Test function Kernel φ ( x i ; χ j , λ j ) Blocks { <λ j ( x i − χ j ) ≤ } Bumps e − λ j | x i − χ j | Doppler e − . λ j ( x i − χ j ) Heavysine e − . λ j ( x i − χ j ) {| x i − χ j | < . } Table 2 Hyperparameters used in examples of Section 7.1 L´evy measure ε a γ b γ a η b η a λ b λ Symmetric Gamma 0.0041 2.53 6.45 13.01 0.71 1.117 0.1965Cauchy 0.0029 2.53 14.2 0.50 1.00 1.117 0.1965 In each case, we take Ω = [0 , × R + (and | Ω | = 10), with elements denoted ω = ( χ, λ ), comprising a location parameter χ ∈ X = [0 , 10] and a shape pa-rameter λ > 0. As described in Section 5.2, we take { χ j } i . i . d . ∼ Un (Ω) and { λ j } i . i . d . ∼ Ga ( a λ , b λ ) with a λ , b λ chosen (see Table 2) to achieve a 95% priorinterval of [0 . , . 0] for λ to attain dilated kernels covering from half a per-cent up to fifty percent of X .Our choice of the remaining hyperparameters was guided by three objec-tives: to achieve a 95% prior predictive interval of [5 , J , to achievea 95% prior predictive interval of [ − , 25] for the { β j } , and to achievea limit on the mean squared truncation error of kL [ φ ] − L ε [ φ ] k = ( E |L [ φ ] −L ε [ φ ] | ) / ≤ . · k φ k (see Section 5.2). While these objectives could bemet for the LARK model with symmetric Gamma prior with the valuesgiven in Table 2, they are not quite attainable for the Cauchy model—thecompeting goals of an extremely wide distribution for the { β j } and a lowmean squared truncation error cannot be reconciled. Upon relaxing the priorpredictive distribution requirement on { β j } to a 99.9% interval of [ − , { β i } , theremaining objectives for the distribution of J and the mean square trun-cation error were attained using the values given in Table 2. See Figure 5,Appendix C for realizations from the prior distribution.7.1.1. Performance. We compared LARK with two of the best wavelet me-thods currently available for inhomogeneous function estimation using over-complete representations: the empirical Bayes approach (“EBayesThresh”) ´EVY ADAPTIVE REGRESSION KERNELS Table 3 Average and (standard errors) over replications of mean square errors of the fourtest functions using the L´evy Adaptive Regression Kernels (LARK) using the symmetricGamma and Cauchy priors, the OCW approach using a Laplace prior [Chu, Clyde andLiang (2009)], and the EBayesThresh approach using a Laplace prior [Johnstone andSilverman (2005a)] Method Blocks Bumps HeavySine Doppler LARK-Gamma 0.030 (0.0013) 0.111 (0.0019) 0.038 (0.0010) 0.152 (0.0030)LARK-Cauchy 0.026 (0.0011) 0.105 (0.0017) 0.036 (0.0010) 0.157 (0.0028)OCW 0.060 (0.0023) 0.285 (0.0025) 0.082 (0.0010) 0.152 (0.0019)EBayesThresh 0.096 (0.0013) 0.307 (0.0032) 0.118 (0.00098) 0.202 (0.0027) of Johnstone and Silverman (2004, 2005a, 2005b) using translational-invariantwavelets, and the continuous over-complete wavelet (“OCW”) approach ofChu, Clyde and Liang (2009) based on the stochastic wavelet expansions ofAbramovich, Sapatinas and Silverman (2000). We replicated the results ofJohnstone and Silverman (2005b) under the beta-Laplace prior using their R package EBayesThresh [Johnstone and Silverman (2005a)] with Daubechies’“least asymmetric” ( la8 ) wavelets [see Section 4 of Daubechies (1988) orSection 6.4 of Daubechies (1992)]. OCW uses the same la8 wavelet asEBayesThresh except for the Blocks example, where both LARK and OCWuse the Haar wavelet. The OCW method may be viewed as a special case ofLARK with a finite nonseparable L´evy measure, where coefficients β j haveindependent Laplace distributions conditional on scale parameters λ j , whichin turn have truncated Pareto distributions. As in LARK, OCW assigns in-dependent uniform locations, with a negative binomial distribution for thenumber of terms in the expansion.The performance of each method was measured by its average meansquare error (AMSE), defined as the average value of the MSE given in (58)over the 100 replicated simulations. Overall, the performance of the LARKmodel is excellent (Table 3). Both LARK versions generated lower AMSEvalues than did EBayesThresh for all four test functions. LARK also hassmaller AMSE than OCW, except for Doppler, where the methods are com-parable. For Blocks, both LARK and OCW use the Haar wavelet, thus anydifference in results is due to the prior distribution on the function; LARKleads to a 50% reduction in AMSE compared to OCW. For the other exam-ples, both OCW and EBayesThresh uses a Laplace prior distribution for eachcoefficient in the expansion and the same wavelet; in all cases it is clear thatusing a continuous dictionary is better than the finite-dimensional dictio-nary (frame) with the nondecimated wavelets. Lark reconstructions (rightcolumn, Figure 1) consistently show less ringing and fewer artifacts thanEBayesThresh (left column). R. L. WOLPERT, M. A. CLYDE AND C. TU Fig. 1. Comparison of fitted functions using EBayesThresh beta.laplace [John-stone and Silverman (2005a)] (left column) and L´evy Adaptive Regression Kernels(LARK-Gamma) (right column) for the four test functions. From top to bottom, the testfunctions are Blocks, Bumps, Doppler and Heavysine, respectively. ´EVY ADAPTIVE REGRESSION KERNELS Fig. 2. Left: results of the LARK model for the motorcycle crash data. Circles representthe observations; solid line is the posterior mean; dotted lines are pointwise 90% Bayesiancredible interval for the mean function. Right: histogram of posterior samples of the expo-nential power parameter ρ , with prior density (solid line) for comparison. 8. Applications. Motorcycle crash data. To further illustrate the method, we explorethe motorcycle crash experiment data of Schmidt, Mattern and Sch¨uler(1981) considered by Silverman (1985), shown in Figure 2. The 133 observa-tions are unequally spaced, with repeated observations at some time points.Our focus in this example is to illustrate how a single wide class of generatingfunctions may be used in LARK, with the data (through the likelihood) influ-encing the choice of kernels present in the posterior distribution. We use thepower exponential family of kernel functions φ ( x ; χ, λ, ρ ) = exp {− λ | x − χ | ρ } ,but here (in contrast with the examples in Section 7) we treat ρ as an un-certain parameter and make inference about it from the data. We take thepower ρ to be common for all kernels, and use a relatively concentratedGamma prior distribution ρ ∼ Ga (2 . , . 75) with a 50% HPD interval of[0 . , . 56] which comfortably includes both the Laplace ( ρ = 1) and Gaus-sian ( ρ = 2) kernels as special cases.The results are summarized in Figure 2. It is apparent that the fitted meancaptures the general trend of the data very well, with minimal boundaryeffects. The model is parsimonious in the sense we only need 4 kernels onaverage to fit the data. The posterior mean for ρ is approximately 3 withmost of the posterior mass well above the values ( ρ = 1 , 2) for the Laplaceand Gaussian kernels.8.2. Spatial temporal model. In this section, we explore the performanceof the LARK approach for modeling hourly SO concentration levels (mea-sured in ppm) in Pennsylvania, New Jersey, Delaware and Maryland [U.S.EPA (2007)]. The locations of the 33 monitoring stations are shown in Fig-ure 3; the study region S , delineated by a rectangle in the figure, covers R. L. WOLPERT, M. A. CLYDE AND C. TU (a) (b) Fig. 3. (a) Thirty-three monitors used by EPA to measure hourly SO concentration inyear 2002. The inverted triangle denotes Site 31. The study area is delineated by a rectanglethat includes parts of Pennsylvania, Maryland, New Jersey and Delaware, and is blownup in (b) which illustrates locations of kernels from a draw from the posterior distribution.The blue circles represent aperiodic points and the red circles represent daily periodic pointsources. Circle areas are proportional to the magnitudes of the point sources they represent. a 310 km × 310 km area. We used rescaled coordinates from a Lambert(conformal conic) projection to reduce the distortion caused by the earth’scurvature. For demonstration purposes, we restrict analysis to measurementstaken during a 144 hour period T from September of 2002. About 5% of SO readings are missing (at random) from the data set, which is not a prob-lem for the LARK model. While Gaussian random field models are popularfor modeling spatial-temporal data, the log transformation typically usedin the Gaussian approach (because the mean function is strictly positive)eliminates many of the (important) spiky features of the data. Our Gammarandom field prior distribution allows us to model the data in the originalunits.The model can be written in the same simple form as (52), but now theSO concentration Y ( x ) is indexed by points x ∈ X = S × T in space–timeand the L´evy random measure L ( dω ) assigns Gamma-distributed randomvariables to Borel sets of a space Ω of points ω = ( σ, τ, Λ , λ ) that includea location ( σ, τ ) ∈ S × T in space–time, a positive-definite 2 × ∈ S +2 , and a temporal decay rate λ > 0. We employa separable kernel of the form φ ( x, ω ) = exp {− ( s − σ ) ′ Λ( s − σ ) / − λ | t − τ |} and in the spirit of Higdon [(1998), Section 3.2] and Higdon, Swall and Kern[(1999), Section 2.2], we employ a novel parametrization for Λ in terms of ´EVY ADAPTIVE REGRESSION KERNELS its eigenvalues and the orientation of its major axis [see Tu (2006), Sec-tion 4.2.6, for details on prior specifications]. In variations also described inTu [(2006), Chapter 4] accommodation is made for partial periodicity (dueto diurnal patterns associated with daily variation in ambient temperature,traffic levels, etc.), still within the framework described by (51) but nowwith more elaborate choices for Ω and φ ( x, ω ).The locations of latent point sources from one iteration of the RJ-MCMCalgorithm are presented in Figure 3(b). Larger latent points appear to beclustered in the Baltimore metropolitan area and near the New Jersey/Penn-sylvania border. The model’s support points are more than a mere modelingdevice—they can help analysts identify possible underlying sources of pol-lution, or support future decisions on monitor locations.The predictive power of the model is validated through out-of-sampleprediction. The model was fit excluding data from Site 31 [the inverted tri-angle in Figure 3(a)], and then its predictions were compared with reportedmeasurements from that site for the entire 144 hours. The result shownin Figure 4 is promising. The major peak was captured clearly, and 90%pointwise Bayesian credible intervals cover in excess of 80% of the true ob-servations. This was a challenging out-of-sample prediction problem due tolow cross-correlations among sites. We are currently refining features of theprior distributions to incorporate known point sources. Fig. 4. Out-of-sample predictions for Site . Dashed line represents observed time se-ries, solid line represents predictive mean curve. Gray lines are posterior predictiveintervals. R. L. WOLPERT, M. A. CLYDE AND C. TU 9. Discussion. In this article, we have developed a fully Bayesian adap-tive kernel method, LARK, for nonparametric function estimation. TheLARK model is based on a stochastic expansion of functions in a continuousovercomplete dictionary, and may be expressed as a stochastic integral ofa kernel or other generating function with respect to a L´evy random field.When (7) is satisfied (so compensation is unnecessary), the L´evy field isa random signed measure. By using a positive random measure and pos-itive kernel family, LARK models provide natural constructions for non-negative functions (as in Section 8.2); with signed measures, unconstrainedfunctions may be modeled (as in Sections 7 and 8.1). The kernel parame-ters are location-specific and thus adapt to local features of the data. Aswith wavelets, the adaptive smoothing using LARK preserves local featuressuch as discontinuities and high peaks and is especially useful for modelinginhomogeneous functions. The LARK approach does not require that thedata be equally-spaced without missing observations nor that the samplesize be a dyadic power as is a commonly required of many wavelet meth-ods.The RJ-MCMC algorithm developed for fitting LARK provides anautomatic stochastic search mechanism for finding sparse representationsof a function. The algorithm is computationally efficient [requiring only O ( n · M ) operations for data including n observations and an MCMC streamof length M ], as dictionary elements are calculated only when needed. Ker-nel methods such as Support Vector Machines (SVMs) and Bayesian Rele-vance Vector Machines [or RVMs, Tipping (2001)] employ all data points askernel locations, but attain sparsity by shrinking coefficients to zero. LARKprovides additional flexibility by not restricting kernel locations. Many com-peting sparse methods, including the Dantzig Selector and Lasso, require thea priori selection of a pre-specified number of dictionary elements. Evalu-ating these kernels on a sufficiently fine grid will exceed the computationalcost of LARK. Fine grids also lead to extreme multicollinearity in these ap-proaches, that may lead both to numerical instability and violation of theconditions needed for sparse solutions.9.1. Extensions. It is straightforward to implement LARK with wideclasses of generating functions including wavelets, structural elements in tex-ture analysis, and splines. Unlike support vector machines or other methodsbased on Mercer kernels [Pillai et al. (2007)], the LARK approach does notrequire symmetry, continuity or simple functional forms. While it is oftenconvenient to use kernels based on some distance metric, arbitrary generat-ing functions may be tailored to the problem at hand as illustrated in thespace–time example of Section 8.2. The LARK modeling approach adaptsreadily to problems in any number of dimensions. ´EVY ADAPTIVE REGRESSION KERNELS In Section 4, we present conditions for LARK models to belong to thesame Besov space as their generating functions, for L´evy measures and gen-erating functions that satisfy the stringent local L -bound of (18). In themore general case, where (18) fails and compensation is required, we are ableto establish similar results only for B spq with p = q = 2 (equivalent to W s ).We are exploring extensions to the general case, but the additional driftterm that arises in compensation complicates confirming the convergenceof f ε to f in B spq for general p, q .Work is also on-going in establishing conditions for posterior consistencyfor function estimation. Extending methods of Choudhuri, Ghosal and Roy(2004), Ghosal and van der Vaart (2007) and Choi and Schervish (2007),Pillai (2008) has verified posterior consistency for certain LARK modelswith Gaussian measurement errors in work that will be reported elsewhere.APPENDIX A: DETAILS OF PROOFS Proposition 2. For a function g ( · ) ∈ L p ( R d ) and its scaled translate g (Λ( · − χ )) with χ ∈ R d and positive definite matrix Λ ∈ S d + , the L p norm of g (Λ( · − χ )) and the L p norm of its m th forward differences are given by k g (Λ( · − χ )) k p = | Λ | /p k g k p k ∆ mh g (Λ( · − χ )) k p = | Λ | /p k ∆ mλh g k p , (59) where | Λ | denotes the determinant of Λ . Proof. By a change of variables χ u = Λ( x − χ ), k ∆ mh g (Λ( · − χ )) k p = (cid:26)Z | ∆ mh g (Λ( x − χ )) | p dx (cid:27) /p = (Z (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) m X k =0 (cid:18) mk (cid:19) ( − m − k g (Λ( x + kh − χ )) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) p dx ) /p = | Λ | − /p (Z (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) m X k =0 (cid:18) mk (cid:19) ( − m − k g ( u + k Λ h ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) p du ) /p = | Λ | − /p k ∆ m Λ h g k p . The proof for the L p norm of g (Λ( · − χ )) follows by the same change ofvariables. (cid:3) A.1. Proof of Lemma 1. First, consider the case b > a ∈ R . Then Z Z Z R × [1 , ∞ ) × T (1 ∧ | zg ( u ) r | λ − a ) λ − b π z ( dz ) dλ du R. L. WOLPERT, M. A. CLYDE AND C. TU < Z Z Z R × [1 , ∞ ) × T λ − b π z ( dz ) dλ du = 1 b − < ∞ . Next, consider the case of b < a > − b (which imply a > Z Z Z R × [1 , ∞ ) × T (1 ∧ | zg ( u ) r | λ − a ) λ − b π z ( dz ) dλ du = Z Z | zg ( u ) r | > Z | zg ( u ) r | /a λ − b dλπ z ( dz ) du + Z Z R × T | zg ( u ) r | Z ∞ ∨| zg ( u ) r | /a λ − a − b dλπ z ( dz ) du = Z Z | zg ( u ) r | > λ − b − b (cid:12)(cid:12)(cid:12)(cid:12) λ = | zg ( u ) r | /a λ =1 π z ( dz ) du + Z Z R × T | zg ( u ) r | λ (1 − a − b ) − a − b (cid:12)(cid:12)(cid:12)(cid:12) λ = ∞ λ =1 ∨| zg ( u ) r | /a π z ( dz ) du = Z Z | zg ( u ) r | > − | zg ( u ) r | (1 − b ) /a b − π z ( dz ) du + Z Z | zg ( u ) r | > | zg ( u ) r | (1 − b ) /a a + b − π z ( dz ) du + Z Z | zg ( u ) r |≤ | zg ( u ) r | a + b − π z ( dz ) du ≤ b − Z Z R × T | zg ( u ) r | (1 − b ) /a a ( a + b − − b ) π z ( dz ) du = 1 b − a ( a + b − − b ) Z R | z | (1 − b ) /a π z ( dz ) Z T | g ( u ) r | (1 − b ) /a du< ∞ for a + b > r = 0, and for ap + b ≥ r = 1 [since g ∈ L ∗ p ( T )], which isimplied by a > − b .Now consider the case of b = 1 and a > Z Z Z R × [1 , ∞ ) × T (1 ∧ | zg ( u ) r | λ − a ) λ − b π z ( dz ) dλ du ´EVY ADAPTIVE REGRESSION KERNELS = Z Z | zg ( u ) r | > Z | zg ( u ) r | /a λ − dλπ z ( dz ) du + Z Z R × T | zg ( u ) r | Z ∞ ∨| zg ( u ) r | /a λ − a − dλπ z ( dz ) du = Z Z | zg ( u ) r | > log λ (cid:12)(cid:12)(cid:12)(cid:12) λ = | zg ( u ) r | /a λ =1 π z ( dz ) du + Z Z R × T | zg ( u ) r | λ − a − a (cid:12)(cid:12)(cid:12)(cid:12) λ = ∞ λ =1 ∨| zg ( u ) r | /a π z ( dz ) du = Z Z | zg ( u ) r | > a log | zg ( u ) r | π z ( dz ) du + Z Z | zg ( u ) r | > a π z ( dz ) du + Z Z | zg ( u ) r |≤ | zg ( u ) r | a π z ( dz ) du ≤ a Z Z R × T log + | zg ( u ) r | π z ( dz ) du + 1 a< ∞ since log + ( zg r ) = (0 ∨ log | zg r | ) ≤ | z | + | g | r and g ∈ L ∗ ( T ). A.2. Proof of Theorem 2. For any compensator function h ( β ) satisfy-ing (10) there are numbers c j ∈ (0 , ∞ ) such that | h ( β ) | ≤ c , | β − h ( β ) | ≤ c ( | β | ∧ β ) , | h ( β ) | ≤ c (1 ∧ | β | )for all β ∈ R . Fix 0 < ε ≤ φ : R × Ω → R satisfying (16);let B a , B b and B c be the values of the integrals from (16a)–(16c), respec-tively. To complete the proof of Theorem 2 it suffices to show that each ofthe two terms from (27), X ≡ Z Z N ε ( β − h ( β )) φ ( ω ) N ( dβ dω ) and(60) Y ≡ Z Z N ε h ( β ) φ ( ω ) ˜ N ( dβ dω ) , converges to zero in probability as ε → 0. Write the first integral in (60) asthe sum of two parts: X ≡ Z Z N ε ( β − h ( β )) φ ( ω ) N ( dβ dω ) = X + X R. L. WOLPERT, M. A. CLYDE AND C. TU with X ≡ Z Z N ε ∩ [ | βφ |≤ ( β − h ( β )) φ ( ω ) N ( dβ dω ) ,X ≡ Z Z N ε ∩ [ | βφ | > ( β − h ( β )) φ ( ω ) N ( dβ dω ) . Then E | X | ≤ c Z Z N ε ∩ [ | βφ |≤ ( | β | ∧ β ) | φ ( ω ) | ν ( dβ dω )= c Z Z N ε ∩ [ | β |≤ ∩ [ | βφ |≤ (1 ∧ β ) | φ ( ω ) | ν ( dβ dω )+ c Z Z N ε ∩ [ | β | > ∩ [ | βφ |≤ (1 ∧ | βφ ( ω ) | ) ν ( dβ dω ) ≤ c ( B c + B a ) < ∞ , so X → L as ε → { N ε } ( β, ω ) tends to zero a.e. ( ν ) as ε → 0. Nowconsider X : ν ( { ( β, ω ) : | βφ ( ω ) | > } ) = Z Z [ | β |≤ ∩ [ | βφ | > ν ( dβ dω )+ Z Z [ | β | > ∩ [ | βφ | > ν ( dβ dω ) ≤ Z Z [ | β |≤ ∩ [ | βφ | > ( | βφ ( ω ) | ∧ | βφ ( ω ) | ) ν ( dβ dω )+ Z Z [ | β | > ∩ [ | βφ | > (1 ∧ | βφ ( ω ) | ) ν ( dβ dω ) ≤ B b + B a < ∞ , so almost surely the random support of N ( dβ dω ) in [ | βφ | > 1] is a finite setdisjoint from T ε> N ε ; it follows that N ( N ε ∩ [ | βφ ( ω ) | > → X → ε → Y ≡ Z Z N ε h ( β ) φ ( ω ) ˜ N ( dβ dω ) = Y + Y + Y + Y with Y ≡ Z Z N ε ∩ [ | β |≤ ∩ [ | βφ |≤ h ( β ) φ ( ω ) ˜ N ( dβ dω ) , ´EVY ADAPTIVE REGRESSION KERNELS Y ≡ Z Z N ε ∩ [ | β |≤ ∩ [ | βφ | > h ( β ) φ ( ω ) ˜ N ( dβ dω ) ,Y ≡ Z Z N ε ∩ [ | β | > ∩ [ | βφ |≤ h ( β ) φ ( ω ) ˜ N ( dβ dω ) ,Y ≡ Z Z N ε ∩ [ | β | > ∩ [ | βφ | > h ( β ) φ ( ω ) ˜ N ( dβ dω ) . Now E | Y | = Z Z N ε ∩ [ | β |≤ ∩ [ | βφ |≤ h ( β ) φ ( ω ) ν ( dβ dω ) ≤ c Z Z N ε ∩ [ | β |≤ ∩ [ | βφ |≤ | βφ ( ω ) | ν ( dβ dω )= c Z Z N ε ∩ [ | β |≤ ∩ [ | βφ |≤ ( | βφ ( ω ) | ∧ | βφ ( ω ) | ) ν ( dβ dω ) ≤ c B b < ∞ , so Y → L (and hence also in L ) as ε → Y ≡ Z Z N ε ∩ [ | β |≤ ∩ [ | βφ | > h ( β ) φ ( ω ) ˜ N ( dβ dω )= Z Z N ε ∩ [ | β |≤ ∩ [ | βφ | > h ( β ) φ ( ω ) N ( dβ dω ) − Z Z N ε ∩ [ | β |≤ ∩ [ | βφ | > h ( β ) φ ( ω ) ν ( dβ dω ) , E | Y | ≤ Z Z N ε ∩ [ | β |≤ ∩ [ | βφ | > | h ( β ) || φ ( ω ) | ν ( dβ dω ) ≤ c Z Z N ε ∩ [ | β |≤ ∩ [ | βφ | > | βφ ( ω ) | ν ( dβ dω )= 2 c Z Z N ε ∩ [ | β |≤ ∩ [ | βφ | > ( | βφ ( ω ) | ∧ | βφ ( ω ) | ) ν ( dβ dω ) ≤ c B b < ∞ , so Y → L as ε → Y ≡ Z Z N ε ∩ [ | β | > ∩ [ | βφ |≤ h ( β ) φ ( ω ) ˜ N ( dβ dω )= Z Z N ε ∩ [ | β | > ∩ [ | βφ |≤ h ( β ) φ ( ω ) N ( dβ dω ) R. L. WOLPERT, M. A. CLYDE AND C. TU − Z Z N ε ∩ [ | β | > ∩ [ | βφ |≤ h ( β ) φ ( ω ) ν ( dβ dω ) , E | Y | ≤ Z Z N ε ∩ [ | β | > ∩ [ | βφ |≤ | h ( β ) || φ ( ω ) | ν ( dβ dω ) ≤ c Z Z N ε ∩ [ | β | > ∩ [ | βφ |≤ | βφ ( ω ) | ν ( dβ dω )= 2 c Z Z N ε ∩ [ | β | > ∩ [ | βφ |≤ > (1 ∧ | βφ ( ω ) | ) ν ( dβ dω ) ≤ c B a < ∞ , so Y → L as ε → 0. Finally, for Y , Y ≡ Z Z N ε ∩ [ | β | > ∩ [ | βφ | > h ( β ) φ ( ω ) ˜ N ( dβ dω )= Z Z N ε ∩ [ | β | > ∩ [ | βφ | > h ( β ) φ ( ω ) N ( dβ dω ) − Z Z N ε ∩ [ | β | > ∩ [ | βφ | > h ( β ) φ ( ω ) ν ( dβ dω ) , E | Y | ≤ Z Z N ε ∩ [ | β | > ∩ [ | βφ | > | h ( β ) φ ( ω ) | ν ( dβ dω ) ≤ c Z Z N ε ∩ [ | β | > ∩ [ | βφ | > | φ ( ω ) | ν ( dβ dω ) ≤ c Z Z N ε ∩ [ | β | > ∩ [ | βφ | > (1 ∧ β ) | φ ( ω ) | ν ( dβ dω ) ≤ c B c < ∞ so Y → L as ε → 0, completing the proof of Theorem 2.APPENDIX B: REVERSIBLE-JUMP MCMC PROCEDURESA typical RJ-MCMC procedure for sampling varying-dimensional param-eters involves at least three types of moves (Birth, Death and Update); weuse Metropolis–Hastings steps for each of these. Our trans-dimensional up-date steps entail altering the value ( β ∗ j , ω ∗ j ) of one point ( β j , ω j ). We select j ∼ Un (0 : J − 1) for proposed updating, then take Gaussian random walksteps successively in the coefficient β j , the location parameter χ j , and thelog kernel shape parameter, log λ j . Step sizes are chosen to achieve approxi-mately 30% acceptance rates for each class of updates. One novel feature is ´EVY ADAPTIVE REGRESSION KERNELS that when the proposed update of some coefficient β j falls in the truncatedregion β ∗ j η ∈ ( − ε, ε ), the move is treated as a Death, the point ( β j , ω j ) isremoved and J is decremented. This is advantageous as it automaticallyfocuses on small magnitude coefficients for removal (rather than a randomselection as in the typical RJ-MCMC Death step). A Birth step entailsgenerating a new point ( β ∗ , ω ∗ ) to be included among the { ( β j , ω j ) } andincrementing J by one. We use a double exponential birth distribution withrate η/ε , conditioned to exceed | β j | η > ε so that proposed coefficients aresmall, balancing the “Death” of small coefficients in the Update step to at-tain the target acceptance rates. The fixed-dimensional parameters are sam-pled using a conventional Metropolis–Hastings approach [Gilks, Richardsonand Spiegelhalter (1996), Section 1.3.3]. Each of these inexpensive updatesteps requires only O ( n ) operations [in contrast to Gaussian methods, whichmay require O ( n )], so the method scales well in the number n of observa-tions. Further details of the RJ-MCMC are available in [Tu (2006), Ap-pendix A.1, pages 116 and 117]. An R package [R Development Core Team(2004)] implementing LARK is under development by the authors and willbe made publicly available.APPENDIX C: EXAMPLES OF LARK PRIOR REALIZATIONS(a) (b) Fig. 5. Four realizations from LARK prior distribution with (a) Blocks kernel and Sym-metric Gamma L´evy measure; (b) Bumps kernel and Gamma L´evy measure; (c) , (d) Doppler kernel and Cauchy L´evy measure, with J = 1000 for (a)–(c) and J = 10 for (d) components. Hyperparameters a λ , b λ , a γ , b γ , a η , b η and ε are given in Table 2. R. L. WOLPERT, M. A. CLYDE AND C. TU (c) (d) Fig. 5. (Continued.) Acknowledgments. The authors would like to thank Natesh Pillai, threereferees, the Associate Editor and the Editor for helpful comments and sug-gestions. REFERENCES Abramovich, F. , Sapatinas, T. and Silverman, B. W. (1998). Wavelet thresholding viaa Bayesian approach. J. R. Stat. Soc. Ser. B Stat. Methodol. Abramovich, F. , Sapatinas, T. and Silverman, B. W. (2000). Stochastic expansionsin an overcomplete wavelet dictionary. Probab. Theory Related Fields Abramowitz, M. and Stegun, I. A. (1964). Handbook of Mathematical Functions withFormulas, Graphs, and Mathematical Tables . National Bureau of Standards AppliedMathematics Series . U.S. Government Printing Office, Washington, DC. MR0167642 Antoniak, C. E. (1974). Mixtures of Dirichlet processes with applications to Bayesiannonparametric problems. Ann. Statist. Cand`es, E. and Tao, T. (2007). The Dantzig selector: Statistical estimation when p ismuch larger than n . Ann. Statist. Chen, S. S. , Donoho, D. L. and Saunders, M. A. (1998). Atomic decomposition bybasis pursuit. SIAM J. Sci. Comput. Chil`es, J.-P. and Delfiner, P. (1999). Geostatistics: Modeling Spatial Uncertainty . Wi-ley, New York. MR1679557 Choi, T. and Schervish, M. J. (2007). On posterior consistency in nonparametric re-gression problems. J. Multivariate Anal. Choudhuri, N. , Ghosal, S. and Roy, A. (2004). Bayesian estimation of the spectraldensity of a time series. J. Amer. Statist. Assoc. Chu, J.-H. , Clyde, M. A. and Liang, F. (2009). Bayesian function estimation usingcontinuous wavelet dictionaries. Statist. Sinica Chu, C.-K. and Marron, J. S. (1991). Choosing a kernel regression estimator (withdiscussion). Statist. Sci. Clyde, M. A. and Wolpert, R. L. (2007). Nonparametric function estimation usingovercomplete dictionaries. In Bayesian Statistics 8 ( J. M. Bernardo , M. J. Bayarri , J. O. Berger , A. P. Dawid , D. Heckerman , A. F. M. Smith and M. West , eds.)91–114. Oxford Univ. Press, Oxford. MR2433190 Cont, R. and Tankov, P. (2004). Financial Modelling with Jump Processes . Chapman& Hall/CRC, Boca Raton, FL. MR2042661 Cristianini, N. and Shawe-Taylor, J. (2000). An Introduction to Support Vector Ma-chines and Other Kernel-based Learning Methods . Cambridge Univ. Press, Cambridge. Daubechies, I. (1988). Orthonormal bases of compactly supported wavelets. Comm. PureAppl. Math. Daubechies, I. (1992). Ten Lectures on Wavelets . CBMS-NSF Regional Conference Seriesin Applied Mathematics . SIAM, Philadelphia, PA. MR1162107 Denison, D. G. T. , Mallick, B. K. and Smith, A. F. M. (1998). Automatic Bayesiancurve fitting. J. R. Stat. Soc. Ser. B Stat. Methodol. Denison, D. G. T. , Holmes, C. C. , Mallick, B. K. and Smith, A. F. M. (2002). Bayesian Methods for Nonlinear Classification and Regression . Wiley, Chichester.MR1962778 DiMatteo, I. , Genovese, C. R. and Kass, R. E. (2001). Bayesian curve-fitting withfree-knot splines. Biometrika Donoho, D. L. and Elad, M. (2003). Optimally sparse representation in general(nonorthogonal) dictionaries via l minimization. Proc. Natl. Acad. Sci. USA Donoho, D. L. and Johnstone, I. M. (1994). Ideal spatial adaptation by wavelet shrink-age. Biometrika Escobar, M. D. (1994). Estimating normal means with a Dirichlet process prior. J. Amer.Statist. Assoc. Escobar, M. D. and West, M. (1995). Bayesian density estimation and inference usingmixtures. J. Amer. Statist. Assoc. Ferguson, T. S. (1973). A Bayesian analysis of some nonparametric problems. Ann.Statist. Ferguson, T. S. (1974). Prior distributions on spaces of probability measures. Ann.Statist. Ghosal, S. and van der Vaart, A. (2007). Convergence rates of posterior distributionsfor non-i.i.d. observations. Ann. Statist. Gilks, W. R. , Richardson, S. and Spiegelhalter, D. J. , eds. (1996). Markov ChainMonte Carlo in Practice . Chapman and Hall, London. MR1397966 Green, P. J. (1995). Reversible jump Markov chain Monte Carlo computation andBayesian model determination. Biometrika Higdon, D. M. (1998). A process-convolution approach to modeling temperatures in theNorth Atlantic ocean. Environ. Ecol. Stat. Higdon, D. , Swall, J. and Kern, J. (1999). Non-stationary spatial modeling. In Bayesian Statistics 6 ( J. M. Bernardo , J. O. Berger , A. P. Dawid and A. F. M. Smith , eds.) 761–768. Oxford Univ. Press, Oxford. Jacod, J. and Shiryaev, A. N. (1987). Limit Theorems for Stochastic Processes . Grundlehren der Mathematischen Wissenschaften [Fundamental Principles of Math-ematical Sciences] . Springer, Berlin. MR0959133 R. L. WOLPERT, M. A. CLYDE AND C. TU Johnstone, I. M. and Silverman, B. W. (2004). Needles and straw in haystacks:Empirical Bayes estimates of possibly sparse sequences. Ann. Statist. Johnstone, I. M. and Silverman, B. W. (2005a). EBayesThresh: R programs for em-pirical Bayes thresholding. Journal of Statistical Software Johnstone, I. M. and Silverman, B. W. (2005b). Empirical Bayes selection of waveletthresholds. Ann. Statist. Jordan, M. I. (2010). Hierarchical models, nested models and completely random mea-sures. In Frontiers of Statistical Decision Making and Bayesian Analysis: In Honor ofJames O. Berger ( M.-H. Chen , D. K. Dey , P. M¨uller , D. Sun and K. Ye , eds.)207–217. Springer, New York. MRMR2766461 Khinchine, A. Y. and L´evy, P. (1936). Sur les lois stables. C. R. Math. Acad. Sci. Paris Kingman, J. F. C. (1967). Completely random measures. Pacific J. Math. Kwapie´n, S. and Woyczy´nski, W. A. (1992). Random Series and Stochastic Integrals:Single and Multiple . Birkh¨auser, Boston, MA. MR1167198 Law, M. H. and Kwok, J. T. (2001). Bayesian support vector regression. In Proceedings ofthe Eighth International Workshop on Artificial Intelligence and Statistics (AISTATS) Liang, F. , Mukherjee, S. and West, M. (2007). The use of unlabeled data in predictivemodeling. Statist. Sci. MacEachern, S. N. (1994). Estimating normal means with a conjugate style Dirichletprocess prior. Comm. Statist. Simulation Comput. MacEachern, S. N. (1998). Computational methods for mixture of Dirichlet pro-cess models. In Practical Nonparametric and Semiparametric Bayesian Statistics ( D. K. Dey , P. M¨uller and D. Sinha , eds.). Lecture Notes in Statist. Mallat, S. G. and Zhang, Z. (1993). Matching pursuit with time-frequency dictionaries. IEEE Trans. Signal Process M¨uller, P. and Quintana, F. A. (2004). Nonparametric Bayesian data analysis. Statist.Sci. Nikol’ski˘ı, S. M. (1975). Approximation of Functions of Several Variables and ImbeddingTheorems . Die Grundlehren der Mathematischen Wissenschaften Springer, NewYork. Translated from the Russian by John M. Danskin, Jr. MR0374877 Pillai, N. S. (2008). L´evy random measures: Posterior consistency and applications.Ph.D. dissertation, Dept. Statist. Sci., Duke Univ. Available at http://stat.duke.edu/people/theses/PillaiNS.html . Pillai, N. S. , Wu, Q. , Liang, F. , Mukherjee, S. and Wolpert, R. L. (2007). Charac-terizing the function space for Bayesian kernel models. J. Mach. Learn. Res. R Development Core Team (2004). R : A language and environment for sta-tistical computing. R foundation for statistical computing. Available at . Rajput, B. S. and Rosi´nski, J. (1989). Spectral representations of infinitely divisibleprocesses. Probab. Theory Related Fields Reed, M. C. and Simon, B. (1975). Methods of Modern Mathematical Physics, Vol. II:Fourier Analysis, Self-Adjointness . Academic Press, New York.´EVY ADAPTIVE REGRESSION KERNELS Sato, K.-i. (1999). L´evy Processes and Infinitely Divisible Distributions . Cambridge Stud-ies in Advanced Mathematics . Cambridge Univ. Press, Cambridge. Translated fromthe 1990 Japanese original. Revised by the author. MR1739520 Schmidt, G. , Mattern, R. and Sch¨uler, F. (1981). Biomechanical investigation todetermine physical and traumatological differentiation criteria for the maximum loadcapacity of head and vertebral column with and without protective helmet under theeffects of impact. EEC research program on biomechanics of impacts, final report, phaseIII, Project 65, Institut f¨ur Rechtsmedizin, Univ. Heidelberg, Germany. Silverman, B. W. (1985). Some aspects of the spline smoothing approach to nonparamet-ric regression curve fitting. J. R. Stat. Soc. Ser. B Stat. Methodol. Sisson, S. A. (2005). Transdimensional Markov chains: A decade of progress and futureperspectives. J. Amer. Statist. Assoc. Smith, M. and Kohn, R. (1996). Nonparametric regression using Bayesian variable se-lection. J. Econometrics Sobolev, S. L. (1991). Some Applications of Functional Analysis in MathematicalPhysics . Translations of Mathematical Monographs . Amer. Math. Soc., Providence,RI. MR1125990 Sollich, P. (2002). Bayesian methods for support vector machines: Evidence and predic-tive class probabilities. Machine Learning Stein, M. L. (1999). Interpolation of Spatial Data: Some Theory for Kriging . Springer,New York. MR1697409 Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. J. R. Stat. Soc.Ser. B Stat. Methodol. Tipping, M. E. (2001). Sparse Bayesian learning and the relevance vector machine. J. Mach. Learn. Res. Triebel, H. (1992). Theory of Function Spaces. II . Monographs in Mathematics .Birkh¨auser, Basel. MR1163193 Tu, C. (2006). Nonparametric modelling using L´evy process priors with applications forfunction estimation, time series modeling and spatio-temporal modeling. Ph.D. disserta-tion, Dept. Statist. Sci., Duke Univ. Available at . U.S. EPA . (2007). Air Quality System (AQS). Available at . Vidakovic, B. (1999). Statistical Modeling by Wavelets . Wiley, New York. MR1681904 Wahba, G. (1992). Multivariate function and operator estimation, based on smoothingsplines and reproducing kernels. In Nonlinear Modeling and Forecasting: Proceedings ofthe Workshop on Nonlinear Modeling and Forecasting held September, 1990, in SantaFe, New Mexico ( M. Casdagli and S. G. Eubank , eds.). SFI Studies in the Sciencesof Complexity XII West, M. (2003). Bayesian factor regression models in the “large p , small n ” paradigm.In Bayesian Statistics 7 ( J. M. Bernardo et al. , eds.) 733–742. Oxford Univ. Press,New York. MR2003537 Wolfe, P. J. , Godsill, S. J. and Ng, W.-J. (2004). Bayesian variable selection and reg-ularization for time-frequency surface estimation. J. R. Stat. Soc. Ser. B Stat. Methodol. Wolpert, R. L. , Ickstadt, K. and Hansen, M. B. (2003). A nonparametric Bayesianapproach to inverse problems. In Bayesian Statistics 7 ( J. M. Bernardo et al. , eds.)403–417. Oxford Univ. Press, New York. MR2003186 Wolpert, R. L. and Taqqu, M. S. (2005). Fractional Ornstein–Uhlenbeck L´evy processesand the Telecom process: Upstairs and downstairs. Signal Processing R. L. WOLPERT, M. A. CLYDE AND C. TU Zolotarev, V. M. (1986). One-dimensional Stable Distributions . Translations of Math-ematical Monographs . Amer. Math. Soc., Providence, RI. MR0854867 R. L. WolpertM. A. ClydeDepartment of Statistical ScienceDuke UniversityDurham, North Carolina 27708-0251USAE-mail: [email protected]@stat.duke.edu URL: