[PDF] Stochastic expansions using continuous dictionaries: Lévy adaptive regression kernels

Abstract

This article describes a new class of prior distributions for nonparametric function estimation. The unknown function is modeled as a limit of weighted sums of kernels or generator functions indexed by continuous parameters that control local and global features such as their translation, dilation, modulation and shape. Lévy random fields and their stochastic integrals are employed to induce prior distributions for the unknown functions or, equivalently, for the number of kernels and for the parameters governing their features. Scaling, shape, and other features of the generating functions are location-specific to allow quite different function properties in different parts of the space, as with wavelet bases and other methods employing overcomplete dictionaries. We provide conditions under which the stochastic expansions converge in specified Besov or Sobolev norms. Under a Gaussian error model, this may be viewed as a sparse regression problem, with regularization induced via the Lévy random field prior distribution. Posterior inference for the unknown functions is based on a reversible jump Markov chain Monte Carlo algorithm. We compare the Lévy Adaptive Regression Kernel (LARK) method to wavelet-based methods using some of the standard test functions, and illustrate its flexibility and adaptability in nonstationary applications.

Full PDF

aa r X i v : . [ m a t h . S T ] D ec The Annals of Statistics (cid:13)

Institute of Mathematical Statistics, 2011

STOCHASTIC EXPANSIONS USING CONTINUOUSDICTIONARIES: L´EVY ADAPTIVE REGRESSION KERNELS

By Robert L. Wolpert , Merlise A. Clyde and Chong Tu Duke University, Duke University and PIMCO

This article describes a new class of prior distributions for non-parametric function estimation. The unknown function is modeled asa limit of weighted sums of kernels or generator functions indexedby continuous parameters that control local and global features suchas their translation, dilation, modulation and shape. L´evy randomﬁelds and their stochastic integrals are employed to induce prior dis-tributions for the unknown functions or, equivalently, for the numberof kernels and for the parameters governing their features. Scaling,shape, and other features of the generating functions are location-speciﬁc to allow quite diﬀerent function properties in diﬀerent partsof the space, as with wavelet bases and other methods employingovercomplete dictionaries. We provide conditions under which thestochastic expansions converge in speciﬁed Besov or Sobolev norms.Under a Gaussian error model, this may be viewed as a sparse re-gression problem, with regularization induced via the L´evy randomﬁeld prior distribution. Posterior inference for the unknown functionsis based on a reversible jump Markov chain Monte Carlo algorithm.We compare the L´evy Adaptive Regression Kernel (LARK) methodto wavelet-based methods using some of the standard test functions,and illustrate its ﬂexibility and adaptability in nonstationary appli-cations.

1. Introduction.

Popular approaches for nonparametric Bayesian esti-mation of unobserved functions generally employ as prior distributions ei-ther Gaussian processes (or random ﬁelds, in two or more dimensions) ormixtures of Dirichlet processes. In this article, we focus attention on a wider

Received May 2010; revised January 2011. Supported by NSF Grants DMS-07-57549, PHY-09-41373 and NASA AISR GrantNNX09AK60G. Supported by NSF Grants DMS-0342172 and DMS-04-06115.

AMS 2000 subject classiﬁcations.

Primary 62G08; secondary 60E07.

Key words and phrases.

Bayes, Besov, kernel regression, LARK, L´evy random ﬁeld,nonparametric regression, relevance vector machine, reversible jump Markov chain MonteCarlo, splines, support vector machine, wavelets.

This is an electronic reprint of the original article published by theInstitute of Mathematical Statistics in

The Annals of Statistics ,2011, Vol. 39, No. 4, 1916–1962. This reprint diﬀers from the original inpagination and typographic detail. 1

R. L. WOLPERT, M. A. CLYDE AND C. TU class of processes, L´evy random ﬁelds and their stochastic integrals. Theseinclude Gaussian random ﬁelds as a limiting case, while Dirichlet processesmay be represented as “normalized” variants of the Gamma L´evy randomﬁeld; L´evy random ﬁelds thus provide an important link between two ofthe random processes that form the foundation of Bayesian nonparametricmethods (see Section 6). In this article, we construct prior distributions forthe mean function in nonparametric regression as stochastic integrals of L´evyrandom ﬁelds. Under suitable regularity, these can be expressed as stochas-tic expansions using continuous dictionaries, permitting tractable Bayesianinference. While our focus is on nonparametric regression, we hope that thereader will see the possibilities of using L´evy random ﬁelds in other contexts.To begin, suppose we have noisy measurements { Y i } i ∈ I of an unknownreal-valued function f : X → R observed at points { x i } i ∈ I in some completeseparable metric space X , with E [ Y i ] = f ( x i ). In nonparametric regressionmodels, the mean function f ( · ) is often regarded as an element of someHilbert space H of real-valued functions on X , and is expressed as a linearcombination of basis functions { g j } ⊂ H : f ( x i ) = X ≤ j

2. Stochastic expansions and prior distributions. !To make inference aboutthe unknown mean function f ∈ H given noisy observations Y i of f ( x i ) for { x i } ⊂ X , we must ﬁrst propose a prior distribution on H for f . Let Ω bea complete separable metric space and φ : X × Ω → R a Borel measurablefunction, and set φ j ( x i ) ≡ φ ( x i , ω j ) for some collection { ω j } ⊂ Ω. As a slightextension of the basis expansion of (1), set f ( x ) ≡ X ≤ j

Possible choices for φ ( x, ω ) for X = R includetranslation-invariant kernel functions, such as the Gaussian φ G ( x, ω ) ≡ exp {− λ ( x − χ ) } (4a)or the Laplace φ L ( x, ω ) ≡ exp {− λ | x − χ |} (4b)kernels with ω ≡ ( χ, λ ) ∈ X × R + ≡ Ω. There is no need to restrict attentionto symmetric (e.g., Mercer) kernels, as required in the conventional SupportVector Machine (SVM) approach [Law and Kwok (2001), Sollich (2002)].Asymmetric kernels, such as the one-sided exponential φ E ( x, ω ) ≡ exp {− λ ( x − χ ) } { x>χ } (4c)are useful, for example, in modeling pollutant dissipation over time. Otherpossibilities include piecewise-constant Haar wavelets on X = (0 , φ H ( x, ω ) ≡ { <λ ( x − χ ) ≤ } (4d)or continuous rescaling and shifting of other wavelet functions φ ψ ( x, ω ) ≡ λ / ψ ( λ ( x − χ )) . (4e)In each of these examples, Ω is a location-scale space with location parame-ter χ and parameter λ determining the scale. Higher-dimensional spaces X may be accommodated in a similar way; for example, in Section 8.2 we usespace–time kernel φ ST ( x, ω ) ≡ exp {− ( s − σ ) ′ Λ( s − σ ) − λ | t − τ |} (4f)for space–time point x = ( s, t ) ∈ R × R + ; here ω = ( σ, τ, Λ , λ ) includes a spa-ce–time point ( σ, τ ) ∈ R × R + , a positive-deﬁnite spatial dispersion matrixΛ ∈ S +2 , and a temporal decay rate λ ∈ R + . ´EVY ADAPTIVE REGRESSION KERNELS L´evy random measures.

For any ν + ≥ π ( dβ dω ) on R × Ω, let J ∼ Po ( ν + ) be Poisson-distributed with mean ν + ,and let { ( β j , ω j ) } ≤ j

0, this will hold for any measurethat satisﬁes the local L integrability condition Z Z R × K (1 ∧ | β | ) ν ( dβ dω ) < ∞ (7)for each compact K ⊂ Ω. The mean and variance, when they exist, are givenby E [ L ( A )] = RR R × A βν ( dβ dω ) and Var [ L ( A )] = RR R × A β ν ( dβ dω ), respec-tively.Khinchine and L´evy (1936) showed that the most general ID random vari-ables [and hence the most general ID-valued random measures; see Rajputand Rosi´nski (1989), Proposition 2.1] have characteristic functions of theform E [ e it L ( A ) ] = exp (cid:26) itδ ( A ) − t Σ( A )(8) + Z Z R × A ( e itβ − − ith ( β )) ν ( dβ dω ) (cid:27) , where h ( β ) ≡ β [ − , ( β ), determined uniquely by the characteristic tripletof sigma-ﬁnite measures ( δ, Σ , ν ) consisting of a signed measure δ ( dω ) anda positive measure Σ( dω ) on Ω, and a positive measure ν ( dβ dω ) on R × Ωthat satisﬁes the local L integrability condition Z Z R × K (1 ∧ β ) ν ( dβ dω ) < ∞ (9)for each compact K ⊂ Ω and ν ( { } , Ω) = 0 (for more details on this non-stationary version of the classic L´evy–Khinchine formula see Jacod andShiryaev [(1987), page 75], Cont and Tankov [(2004), pages 457–459] orWolpert and Taqqu (2005)).

R. L. WOLPERT, M. A. CLYDE AND C. TU

The role of the compensator function h ( β ) is to make the last integrandin (8) bounded and O ( β ) near β ≈

0, permitting the replacement of (7) withthe weaker condition (9); in this case L ( dω ) may have countably-many pointsof support { ω j } ⊂ Ω whose magnitudes { β j } are not absolutely summable,precluding a representation of the form (5). The compensator h ( β ) may bereplaced by any bounded measurable function satisfying h ( β ) = β + O ( β ) , β ≈ , (10)with a corresponding replacement of δ ( dω ) with δ h ( dω ) = δ ( dω ) + R R [ h ( β ) − h ( β )] ν ( dβ dω ). Whenever (7) is satisﬁed, we may take h ( β ) ≡ δ .By (8) the random measure L may be written as the sum of two indepen-dent parts: a Gaussian portion, assigning independent normally-distributedrandom variables with mean δ h ( A i ) and variance Σ( A i ) to disjoint sets A i ,and the remaining portion, with characteristic function E [ e it L ( A ) ] = exp (cid:26)Z Z R × A ( e itβ − − ith ( β )) ν ( dβ dω ) (cid:27) . (11)We call a random signed measure L with no Gaussian component [i.e., anID-valued measure with Σ(Ω) = δ h (Ω) = 0, that satisﬁes (11)] a L´evy ran-dom measure . Nonnegative L´evy random measures satisfying (7) were called“completely random measures” by Kingman (1967).2.3.

L´evy random ﬁelds.

A L´evy random measure L satisfying (11) in-duces a linear mapping φ

7→ L [ φ ] from functions φ : Ω → R to random vari-ables L [ φ ] ≡ R Ω φ ( ω ) L ( dω ); such a mapping is called a random ﬁeld . Forsimple functions φ ( ω ) = P a i A i ( ω ) with each ¯ A i ⊂ Ω compact, we set L [ φ ] ≡ P a i L ( A i ) and verify that E [ e it L [ φ ] ] = exp (cid:26)Z Z R × Ω ( e itφ ( ω ) β − − itφ ( ω ) h ( β )) ν ( dβ dω ) (cid:27) . (12)It is straightforward to extend this by continuity in probability to (at least)all bounded measurable compactly-supported φ : Ω → R . We now presenta general construction based on Poisson random ﬁelds, the key to our ap-proach to tractable posterior Bayesian inference.2.3.1. Poisson construction I: Uncompensated.

When ν ( dβ dω ) satisﬁes(7) (i.e., | β | is locally ν -integrable at zero) we may take h ( β ) ≡ L as follows. Begin with a Poisson random measure N ( dβ dω ) ∼ Po ( ν ) on ( R × Ω) that assigns independent Poisson-distributed random vari-ables N ( C i ) ∼ Po ( ν ( C i )) with means ν ( C i ) to disjoint Borel sets C i ⊂ ( R × Ω).For any Borel set A ⊂ Ω with compact closure ¯ A and bounded measurable ´EVY ADAPTIVE REGRESSION KERNELS compactly-supported φ : Ω → R , set J ≡ N ( R × A ) and L ( A ) ≡ Z Z R × A β N ( dβ dω ) = X ≤ j

The situation is more deli-cate in case the L´evy measure does not satisfy (7), but only the weaker boundin (9) (i.e., if β is locally ν -integrable but | β | is not). Begin again with thePoisson measure N ∼ Po ( ν ) on R × Ω, and introduce the compensated orcentered Poisson measure ˜ N ( dβ dω ) ≡ N ( dβ dω ) − ν ( dβ dω ) with mean zero[Sato (1999), page 38], inducing an isometry from L ( R × Ω , ν ( dβ dω )) tothe square-integrable zero-mean random variables. Following Wolpert andTaqqu (2005), set L ( A ) ≡ Z Z R × A [ β − h ( β )] N ( dβ dω ) + Z Z R × A h ( β ) ˜ N ( dβ dω ) , L [ φ ] ≡ Z Z R × Ω [ β − h ( β )] φ ( ω ) N ( dβ dω )(14) + Z Z R × Ω h ( β ) φ ( ω ) ˜ N ( dβ dω )for any measurable φ for which (14) converges. If (7) holds, one may sim-plify (14) to L [ φ ] = Z Z R × Ω βφ ( ω ) N ( dβ dω ) − Z Z R × Ω h ( β ) φ ( ω ) ν ( dβ dω )(15a) = X ≤ j

Let ν be a L´evy measure on R × Ω satisfying (9). Then L [ φ ] is well deﬁned by (14) with characteristic function given by (12) for com-pensator h ( β ) ≡ β {| β |≤ } if φ satisﬁes Z Z [ − , c × Ω (1 ∧ | βφ ( ω ) | ) ν ( dβ dω ) < ∞ , (16a) Z Z [ − , × Ω ( | βφ ( ω ) | ∧ | βφ ( ω ) | ) ν ( dβ dω ) < ∞ . (16b) If, in addition, φ satisﬁes Z Z R × Ω (1 ∧ β ) | φ ( ω ) | ν ( dβ dω ) < ∞ , (16c) then L [ φ ] is well deﬁned for any compensator h ( β ) satisfying (10). Proof.

Under these conditions, the integrands of the compensated anduncompensated Poisson integrals in (14) are in the Musielak–Orlicz spacesfor which those integrals are well deﬁned; see Rajput and Rosi´nski [(1989),page 9], Kwapie´n and Woyczy´nski (1992). (cid:3)

In particular:

Corollary 1. L [ φ ] is well deﬁned with characteristic function (12) forany function φ satisfying Z Z R × Ω (1 ∧ β )( | φ ( ω ) | ∨ φ ( ω )) ν ( dβ dω ) < ∞ , (17) including [by (9)] all bounded measurable compactly-supported φ . Thus, L ( A ) = L [ A ] is always well deﬁned for any Borel set A ⊂ Ω with compactclosure ¯ A . Similarly:

Proposition 1.

For a L´evy measure ν satisfying (7), take h ( β ) ≡ ; then(13) L [ φ ] ≡ Z Z R × Ω βφ ( ω ) N ( dβ dω ) = X ≤ j

Denote by Φ the linear space offunctions φ : Ω → R for which L [ φ ] has been deﬁned; we have seen that thisincludes at least all bounded measurable compactly-supported functions φ .Denote by G the linear space of measurable functions g : X →

Φ, and simplifynotation by writing “ g ( x, ω )” for g ( x )( ω ). Each of the generating functionsintroduced in (4) lies in G . For any g ∈ G , we can construct a random function f : X → R by f ( x ) ≡ L [ g ( x )](19) = Z Z R × Ω g ( x, ω )[ β − h ( β )] N ( dβ dω )+ Z Z R × Ω g ( x, ω ) h ( β ) ˜ N ( dβ dω )= X ≤ j

We now consider some speciﬁc exam-ples of L´evy random ﬁelds and the corresponding kernel integrals. Familiarexamples include Poisson, Gamma, Cauchy and more generally α -Stablerandom ﬁelds.2.5.1. Compound Poisson processes.

The simplest model to consider wouldbe that of (2), with ﬁnite L´evy measure satisfying ν + ≡ ν ( R × Ω) < ∞ , re-produced here: f ( x ) ≡ X ≤ j

The L´evy measure for the Gamma randomﬁeld is inﬁnite but satisﬁes the strong local L integrability condition (7),obviating compensation; in the homogeneous case, it is ν ( dβ dω ) = β − e − βη { β> } dβγ ( dω )(22)for some σ -ﬁnite measure γ ( dω ) on Ω, giving L ( A ) ∼ Ga ( γ ( A ) , η ) [withmean γ ( A ) /η ] for Borel measurable A ⊂ Ω with γ ( A ) < ∞ . Because ν isconcentrated on R + , the mass β j at each of the Gamma random mea-sure’s support points ω j is positive, so all the coeﬃcients in the expres-sion f ( x ) = P φ ( x, ω j ) β j are nonnegative. With a nonnegative generatingfunction φ ∈ G , this provides a direct way to construct nonnegative meanfunctions f ≥ { Y i } as Gaussianmethods would require. The mean E [ f ( x )] = η − R g ( x, ω ) γ ( dω ) is availablefrom (21a), as is the covariance from (21b).2.5.3. Symmetric Gamma random ﬁelds.

A symmetric analogue of theGamma random ﬁeld (22) has L´evy measure ν ( dβ dω ) = | β | − e −| β | η dβγ ( dω )(23)on all of R × Ω, leading to random variables L ( A ) distributed as the diﬀer-ence of two independent Ga ( γ ( A ) , η ) variables, with characteristic function E [ e it L ( A ) ] = (1 + t /η ) − γ ( A ) . Both the standard positive Gamma randommeasure and this symmetric version satisfy the local L bound (7), hence nocompensation is required so we may take h ( β ) ≡ f ( x ). The mean E [ f ( x )] = 0 vanishes for the symmetricGamma random ﬁeld, or for any other L´evy random ﬁeld with a symmetric(in ± β ) L´evy measure satisfying (7). Covariances are available from (21b).Nearly all of the commonly used isotropic geostatistical covariance functions[see Chil`es and Delﬁner (1999), Section 2.5] may be achieved by the choiceof a suitable generating kernel g ( x, · ) and L´evy measure ν ( dβ dω ); see Clydeand Wolpert (2007) for speciﬁc examples.2.5.4. Symmetric α -Stable random ﬁelds. Symmetric α -Stable ( S α S ) L´e-vy random ﬁelds have L´evy measure ν ( dβ dω ) = c α α | β | − − α dβ γ ( dω )(24)on R × Ω for some 0 < α < σ -ﬁnite positive measure γ ( dω ), where c α = (1 /π )Γ( α ) sin( πα/ L ( A ) ∼ St ( α, , γ ( A ) ,

0) [in parametriza-tion (M) of Zolotarev (1986), page 11] with inﬁnite variance (and thus nomeaningful covariance function for f ( x ) ≡ L [ g ( x )]). This inﬁnite L´evy mea-sure satisﬁes (9) for all 0 < α <

2, but satisﬁes the stronger local L con-dition (7) only for 0 < α <

1; thus compensation is required to construct S α S random ﬁelds with 1 ≤ α <

2, including the Cauchy case of α = 1. One ´EVY ADAPTIVE REGRESSION KERNELS can show that f ( x ) is well deﬁned for any φ ( x, · ) ∈ L α (Ω , γ ( dω )), includingthe generating functions of (4). The S α S ﬁelds have heavier tails than, forexample, the symmetric Gamma ﬁelds of Section 2.5.3, and may be moreappropriate for problems where one might expect f ( · ) to include by a fewheavily weighted kernels.

3. Approximations for implementing kernel integrals.

Computer simula-tions of L´evy random measures A

7→ L ( A ) and random ﬁelds φ

7→ L [ φ ] asso-ciated with ﬁnite L´evy measures ν may be constructed as in (5), (13), simplyby setting ν + ≡ ν ( R × Ω) and drawing J ∼ Po ( ν + ) and { ( β j , ω j ) } ≤ j L ε [ φ ] converge to the random ﬁeld L [ φ ] given in (14). Note that ε is not a model parameter. It is only a device used for two purposes: as a toolin the theorems constructing LARK models (in this section) and establish-ing their properties (in Section 4), and to enable the construction of practi-cal numerical methods to approximate LARK models within speciﬁed errorbounds (in Section 5). Theorem 2.

Let ν be a L´evy measure deﬁned on R × Ω satisfying (9)and φ ∈ Φ satisfying (16). Take { K ε } to be any family of compact sets in-creasing to Ω as ε → , and for any Borel sets A ⊂ Ω and B ⊂ R and let ν ε be the unique Borel measure on R × Ω satisfying ν ε ( B × A ) ≡ ν (( B ∩ [ − ε, ε ] c ) × ( A ∩ K ε ))(25) for B ⊂ R , A ⊂ Ω [note ν + ε ≡ ν ε ( R × Ω) < ∞ ]. Let h ( · ) be any boundedmeasurable compensator function on R satisfying h ( β ) = β + O ( β ) for β near zero. Then as ε → , the random variables L ε [ φ ] ≡ Z Z [ − ε,ε ] c × K ε βφ ( ω ) N ( dβ dω )(26) − Z Z [ − ε,ε ] c × K ε h ( β ) φ ( ω ) ν ( dβ dω ) converge in probability to L [ φ ] of (14). Proof.

The error in approximating L [ φ ] of (14) by L ε [ φ ] of (26) is L [ φ ] − L ε [ φ ] = Z Z N ε ( β − h ( β )) φ ( ω ) N ( dβ dω )(27) + Z Z N ε h ( β ) φ ( ω ) ˜ N ( dβ dω ) , R. L. WOLPERT, M. A. CLYDE AND C. TU where N ε ≡ { ( β, ω ) : | β | ≤ ε or ω ∈ K cε } . The ﬁrst term in (27) converges tozero almost surely, and the second in L , as ε →

0; see the Appendix fordetails. (cid:3)

The approximation L ε [ φ ] is the sum of a L´evy random ﬁeld with ﬁniteL´evy measure ν ε [hence with simple representation (13)] and a deterministicdrift term δ ε [ φ ] given by the second integral in (26). The drift vanisheswhenever ν ( dβ dω ) is symmetric in ± β and h ( β ) is odd. Corollary 2.

If either (a) ν ( dβ dω ) satisﬁes (7), or (b) ν ( dβ dω ) sat-isﬁes (9) and is even in ± β , and also h ( β ) is an odd function, then for each x ∈ X , f ε ( x ) ≡ X ≤ j

With f ε ( x ) ≡ L ε [ g ( x )], f ε ( x ) = Z Ω g ( x, ω ) L ε ( dω )(29) = Z Z R × Ω g ( x, ω ) β N ε ( dβ dω ) − Z Z R × Ω g ( x, ω ) h ( β ) ν ε ( dβ dω )with N ε ( dβ dω ) ∼ Po ( ν ε ( dβ dω )). If ν satisﬁes (7), then without loss of gen-erality take the compensator function h ( β ) ≡

0. In both cases (a) and (b),the second integral in (29) vanishes, leading to (28) [cf. (2)]. (cid:3)

Note that in case (b) the { g ( x, ω j ) β j } are not absolutely summable so“ P ∞ j =0 g ( x, ω j ) β j ” does not converge in the Lebesgue sense. In each of ourapplications the conditions of Corollary 2 hold, allowing us to approximate ν by a ﬁnite L´evy measure ν ε [and L by L ε ∼ L´evy ( ν ε )], and exploit the re-sulting Poisson representation for inference.

4. Function spaces for LARK models.

Theorem 2 and Corollary 2 estab-lish pointwise convergence of f ε ( x ) to f ( x ) as ε →

0; in this section we pro-vide conditions to ensure that f ε ( · ) → f ( · ) in appropriate Besov or Sobolevnorms if the generating functions lie in the same space. ´EVY ADAPTIVE REGRESSION KERNELS For s ≥ d ∈ N denote by W s ( R d ) the Sobolev space of real-valuedsquare-integrable functions f ( · ) ∈ L ( R d ) [Sobolev (1991), Section 1.7, Reedand Simon (1975), page 50] with ﬁnite Sobolev norm k f k W s = (cid:26) π ) d Z R d (1 + | ξ | ) s | ˆ f ( ξ ) | dξ (cid:27) / (30)with Fourier transforms deﬁned for f ∈ L ( R d ) byˆ f ( ξ ) = Z R d e iξ · x f ( x ) dx and by L limits for f ∈ L ( R d ); here dξ and dx denote the Lebesgue volumeelement in R d , and ξ · x denotes the Euclidean inner product. Each W s isa Banach space, hence complete. By Plancherel’s theorem, each f ∈ W s with s ≥ s distributional derivatives in L ( R ), and by Sobolev’s lemma has k continuous derivatives for each integer 0 ≤ k < s − d/ B spq consists of those f ∈ L p ( R d ) whoseBesov semi-norms are ﬁnite. Several equivalent Besov semi-norms appear inthe literature [Triebel (1992), Theorem 2.6.1, page 140]; we use the deﬁnitiongiven as equation 2 of that theorem. For p, q ≥ s > d (1 /p − + andfor any integer m > s ( m = 1 + ⌊ s ⌋ is easiest), set | f | spq = (cid:18)Z | h |≤ | h | − sq k ∆ mh f k qp dh/ | h | d (cid:19) /q or, in dimension d = 1, | f | spq = (cid:18) Z h − − sq k ∆ mh f k qp dh (cid:19) /q , (31)where ∆ mh denotes the m th forward ﬁnite diﬀerence,∆ h f ( x ) = f ( x ) , ∆ mh f ( x ) = [∆ m − h f ( x + h ) − ∆ m − h f ( x )](32) = m X k =0 (cid:18) mk (cid:19) ( − m − k f ( x + kh ) . The Besov space B spq is the Banach space completion of L p ( R d ) under norm k f k spq = k f k p + | f | spq . (33)For p = q = 2, B spq coincides with the Sobolev space W s .For ﬁxed ω ∈ Ω, each of the kernel functions g ( · , ω ) in (4) is in B spq forall p, q ≥ s >

0, and hence each ﬁnite approximation of the form(28) lies in the same B spq . For example, the Gaussian kernel of (4a) (alongwith its d -dimensional generalization) satisﬁes g G ( · , ω ) ∈ B spq for every s < ∞ R. L. WOLPERT, M. A. CLYDE AND C. TU and p, q ≥

1, while in R the double-sided Laplace kernel of (4b) satisﬁes g L ( · , ω ) ∈ B spp for s < /p < p and the Haar wavelet of (4d)is in B spq only for s < /p . To simplify proofs in Section 4.1, we will restrictattention to generating functions g on R d ; these results may be extended tobounded domains the Besov semi-norms deﬁned in terms of diﬀerences onbounded domains in Section 5.2.2 of Triebel (1992) may be used to extendthese results.We now provide conditions for LARK models to be in the same Besovspace as their generating functions.4.1. Convergence of LARK models in Besov spaces.

Theorem 3.

Fix g ∈ B spq ( R d ) for some p, q ≥ and s > and a L´evymeasure ν on R × Ω with Ω = ( S d + × R d ) of translation-invariant product form ν ( dβ dω ) = ˜ ν ( dβ d Λ) dχ [here ω = (Λ , χ ) ] for a σ -ﬁnite measure ˜ ν ( dβ d Λ) on R × S d + that satisﬁes the integrability condition (7). Deﬁne a location-scaleLARK model f ( · ) on X = R d by: f ( x ) = R Ω φ ( x, ω ) L ( dω ) where φ ( x, ω ) ≡ g (Λ( x − χ )) satisﬁes (18) for each ﬁxed x ∈ X . Then f has the almost surelyconvergent series expression f ( x ) = X j g (Λ j ( x − χ j )) β j (34) and f ∈ B spq almost surely if ˜ ν satisﬁes Z Z R ×S d + (1 ∧ | β || Λ | − /p )˜ ν ( dβ d Λ) < ∞ , (35a) Z Z R ×S d + (1 ∧ | β || Λ | s − /p )˜ ν ( dβ d Λ) < ∞ . (35b) Proof.

Equation (18) ensures that the sum in (34) will converge al-most surely for each ﬁxed x ∈ X , with a ﬁnite number of terms | g (Λ j ( x − χ j )) β j | > | g (Λ j ( x − χ j )) β j | ≤

1. The L p norm of f satisﬁes the bound k f k p ≤ X j k g (Λ j ( · − χ j )) k p | β j | = k g k p X j | Λ j | − /p | β j | by the triangle inequality and Proposition 2 in Appendix A. This is ﬁnitealmost surely by (35a) since g ∈ B spq ⊂ L p . The Besov semi-norm of f isbounded by | f | spq ≤ X j | β j || g (Λ j ( x − χ j )) | spq ´EVY ADAPTIVE REGRESSION KERNELS = X j | β j | (cid:18)Z | h |≤ | h | − d − sq k ∆ mh g (Λ j ( · − χ j )) k qp dh (cid:19) /q = X j | β j | (cid:18)Z | h |≤ | | h | − d − sq | Λ j | − q/p k ∆ m Λ j h g k qp dh (cid:19) /q by Proposition 2; changing variables h t = Λ h , this is= X j | β j || Λ j | s − /p (cid:18)Z | Λ − j t |≤ | t | − d − sq k ∆ mt g k qp dt (cid:19) /q . (36)The integral in (36) is bounded by Z R d | t | − d − sq k ∆ mt g k qp dt = Z | t |≤ | t | − d − sq k ∆ mt g k qp dt + Z | t | > | t | − d − sq k ∆ mt g k qp dt. The ﬁrst term is just ( | g | spq ) q , and (32) implies k ∆ mt g k p ≤ m k g k p , so ≤ ( | g | spq ) q + Z | t | > | t | − d − sq (2 m k g k p ) q dt = ( | g | spq ) q + π d/ mq Γ( d/ sq k g k qp ≤ ( c k g k spq ) q for some c < ∞ , so | f | spq ≤ c k g k spq X j | β j || Λ j | s − /p , (37)which is almost surely ﬁnite by (35b). (cid:3) Each of the kernels g ( · , ω ) considered in the examples in Sections 7 and 8may be shown to be in some Besov space B spq , and each is bounded by k g k ∞ ≤

1. Corollary 3 establishes that each of our LARK models with a L´evymeasure that satisﬁes (7) is in the same space B spq as its generating function. Corollary 3.

Let f ( x ) = R φ ( x, ω ) L ( dω ) be a one-dimensional LARKmodel on a compact set X ⊂ R , with product L´evy measure ν ( dβ dω ) = ν β ( dβ ) π λ ( dλ ) dχ on R × R + × X satisfying (7) with Gamma probability mea-sure π λ ( dλ ) = Ga ( a λ , b λ ) and location-scale generator φ ( x, ω ) = g ( λ ( x − χ )) with bounded g ∈ B spq . Then f ∈ B spq almost surely if α λ > /p for p, q ≥ and s > . In particular, if a λ ≥ then f ∈ B spq if g ∈ B spq for all p, q ≥ and s > . R. L. WOLPERT, M. A. CLYDE AND C. TU

Proof.

Equation (18) holds for bounded g ∈ B spq with L´evy measures ofthe form indicated; the conditions on α λ ensure that also R R + λ − /p π λ ( dλ ) < ∞ and R R + λ s − /p π λ ( dλ ) < ∞ , so the bounds of (35) hold. (cid:3) Comparisons with Abramovich, Sapatinas and Silverman.

The sto-chastic wavelet expansion of Abramovich, Sapatinas and Silverman (2000)may be viewed as a LARK model using wavelet generator (4e), with coeﬃ-cients that, when conditioned on the scale parameters { a j } , have indepen-dent Gaussian distributions { β j } ind ∼ No (0 , ca − δj ) with ω = ( a, b ) ∈ [ a , ∞ ) × [0 ,

1) and ν ω ( dω ) ∝ a − ξ { a ≥ a } db da for some c, δ, ξ ≥ δ + ξ > a ≥ δ and ξ control the size and frequency of wavelet coeﬃ-cients and determine whether the expansion will have a well-deﬁned limit.For a ﬁnite L´evy measure ν ω ( dω ) ( ξ > ξ ≤

1, the Poisson mean is no longer ﬁnite; however, Abramovich, Sapatinasand Silverman (2000) provide conditions on δ and ξ so that f falls in thecorresponding Besov space of the generating wavelet.For “simplicity of exposition,” Abramovich, Sapatinas and Silverman workwith functions of unit period [i.e., satisfying g ( x ) = g ( x + 1)] and regardthem as functions on the unit torus T , the interval [0 ,

1] with the endpointsidentiﬁed. We now illustrate how the LARK theory may be used to provethat the resulting expansion lies in B spq ( T ) if the generating function does.The Besov sequence norms used by Abramovich, Sapatinas and Silvermanand others are natural for the Gaussian distributions and discrete waveletexpansions they study; we have found the (equivalent) function norms tobe more convenient for continuous wavelet expansions using non-Gaussian( α -Stable, e.g.) distributions used for the coeﬃcients in our expansions. Wefollow Nikol’ski˘ı [(1975), Sections 1.1.1 and 4.3.5] in deﬁning Besov normson the torus by replacing the L p norm on R with that over T in the deﬁni-tion of the Besov semi-norm and norm [see (31), (33)], and in denoting thecorresponding spaces by L ∗ p ( T ) and B s ∗ pq ( T ), respectively.To simplify the proof, we will use the following lemma. Lemma 1.

Let π z ( dz ) denote the standard normal distribution on R , let g ∈ L ∗ p ( T ) with p ≥ and let r ∈ { , } . Then Z Z Z R × [1 , ∞ ) × T (1 ∧ | zg ( u ) r | λ − a ) λ − b π z ( dz ) dλ du < ∞ for any a ∈ R if b > , and for all a > − b if b ≤ . The proof is given in Appendix A.1. ´EVY ADAPTIVE REGRESSION KERNELS Theorem 4.

Let g ∈ B s ∗ pq ( T ) for some p, q ≥ and s > . Let L ( dω ) bea random ﬁeld on Ω = [1 , ∞ ) × T with L´evy measure ν ( dβ dλ dχ ) = 1 √ π λ δ/ − ζ e − β λ δ / dβ dλ dχ (38) on R × Ω with δ, ζ ≥ . Then the LARK model f ( x ) = R Ω λ / g ( λ ( x − χ )) L ( dω ) has an absolutely convergent expansion f ( x ) = X j β j λ / j g ( λ j ( x − χ j )) , ≤ x < , (39) provided that δ − > − ζ for ≤ ζ ≤ , or for any δ ≥ if ζ > . Also f ( · ) ∈ B s ∗ pq ( T ) almost surely for δ − > s + 1 − ζ if ≤ ζ ≤ or for any δ ≥ if ζ > . Proof.

The absolute convergence of (39) for each x will follow fromProposition 1 if we can verify the conditions of (18), that is, ﬁniteness of theintegral Z Z Z R × [1 , ∞ ) × T (1 ∧ | βλ / g ( λ ( x − χ )) | ) ν ( dβ dλ dχ ) . (40)Applying the change of variables β z = λ δ/ β ,= Z Z Z R × [1 , ∞ ) × T (1 ∧ | z | λ (1 − δ ) / | g ( λ ( x − χ )) | ) λ − ζ π z ( dz ) dλ dχ, (41)where π z ( dz ) is the standard normal distribution. Since the term in paren-theses is bounded by one, (41) is ﬁnite for all δ and g if ζ >

1. For 0 ≤ ζ ≤ χ u = λ ( x − χ ) and apply periodicity= Z R Z ∞ Z λxλ ( x − (1 ∧ | zg ( u ) | λ (1 − δ ) / ) duλ − − ζ dλπ z ( dz )which, due to periodicity, satisﬁes the bound ≤ Z R Z ∞ Z (1 ∧ | zg ( u ) | λ (1 − δ ) / ) du ⌈ λ ⌉ λ − − ζ dλπ z ( dz ) ≤ Z Z Z R × [1 , ∞ ) × T (1 ∧ | zg ( u ) | λ (1 − δ ) / ) duλ − ζ dλπ z ( dz ) , where ⌈ λ ⌉ denotes the least integer ≥ λ . By Lemma 1 this is ﬁnite for 0 ≤ ζ ≤ δ − > − ζ with g ∈ B s ∗ pq , so (18) holds and Proposition 1 ensuresconvergence.The L ∗ p norms of the m th forward diﬀerences of a periodic function g ( · ) ∈ B s ∗ pq ( T ) and their scaled translates g ( λ ( · − χ )) for χ ∈ T and positive scale λ ∈ [1 , ∞ ) are related by k ∆ mh g ( λ ( · − χ )) k ∗ p ≤ /p k ∆ mλh g k ∗ p (42) R. L. WOLPERT, M. A. CLYDE AND C. TU since, by a change of variables x u = λ ( x − χ ), k ∆ mh g ( λ ( · − χ )) k ∗ p = λ − /p (Z λ (1 − χ ) − λχ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) m X k =0 (cid:18) mk (cid:19) ( − m − k g ( u + kλh ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) p du ) /p , which, again from periodicity, satisﬁes ≤ (cid:18) ⌈ λ ⌉ λ (cid:19) /p (Z (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) m X k =0 (cid:18) mk (cid:19) ( − m − k g ( u + kλh ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) p du ) /p = (cid:18) ⌈ λ ⌉ λ (cid:19) /p k ∆ mλh g k ∗ p , while ⌈ λ ⌉ /λ ≤ f is bounded by | f | s ∗ pq ≤ X j | β j | λ / j | g ( λ j ( · − χ j )) | s ∗ pq = X j | β j | λ / j (cid:18)Z | h |≤ | h | − − sq k ∆ mh g ( λ j ( · − χ j )) k ∗ qp dh (cid:19) /q ≤ X j | β j | λ / j (cid:18) ⌈ λ j ⌉ λ j (cid:19) /p (cid:18)Z | h |≤ | | h | − − sq k ∆ mλ j h g k ∗ qp dh (cid:19) /q = X j | β j | λ s +1 / j (cid:18) ⌈ λ j ⌉ λ j (cid:19) /p (cid:18)Z | t |≤ λ j | t | − − sq k ∆ mt g k ∗ qp dt (cid:19) /q . (43)The integral in (43) is bounded by Z R | t | − − sq k ∆ mt g k ∗ qp dt = Z | t |≤ | t | − − sq k ∆ mt g k ∗ qp dt + Z | t | > | t | − − sq k ∆ mt g k ∗ qp dt. The ﬁrst term is just ( | g | s ∗ pq ) q , and (32) implies k ∆ mt g k ∗ p ≤ m k g k ∗ p , so ≤ ( | g | s ∗ pq ) q + Z | t | > | t | − − sq (2 m k g k ∗ p ) q dt = ( | g | spq ) q + 2 mq sq k g k ∗ qp ≤ ( c k g k s ∗ pq ) q ´EVY ADAPTIVE REGRESSION KERNELS for some c < ∞ , so | f | s ∗ pq ≤ c k g k s ∗ pq X j | β j | λ s +1 / j (44)is almost surely ﬁnite if and only if Z Z Z R × [1 , ∞ ) × T (1 ∧ | β | λ s +1 / ) ν ( dβ dλ dχ )is ﬁnite. Applying the change of variables β z = λ δ/ β ,= Z Z Z R × [1 , ∞ ) × T (1 ∧ | z | λ s +(1 − δ ) / ) λ − ζ π z ( dz ) dλ dχ is ﬁnite by Lemma 1 for all δ ≥ ζ > δ − > s + 1 − ζ if 0 ≤ ζ ≤ L ∗ p norm of f satisﬁes a bound of theform k f k ∗ p ≤ c k g k ∗ p X j | β j | λ / j for some c < ∞ . This is ﬁnite almost surely if Z Z Z R × [1 , ∞ ) × T (1 ∧ | β | λ / ) ν ( dβ dλ dχ )= Z Z Z R × [1 , ∞ ) × T (1 ∧ | z | λ (1 − δ ) / ) λ − ζ π Z ( dz ) dλ dχ is ﬁnite, which follows from Lemma 1 for all δ ≥ ζ > ζ ≤

1, for δ satisfying δ − > − ζ since g ∈ B s ∗ pq ⊂ L ∗ p . Combining conditions, the B s ∗ pq norm of f is ﬁnite if δ/ − / > s + 1 − ζ for 0 ≤ ζ ≤ δ ≥ ζ > (cid:3) For L´evy measures ν ( dβ dλ dχ ) supported on R × N × T (i.e., for which λ is almost-surely integral) the function f ( x ) of (39) would inherit periodicityfrom the generator g ( λ j ( x − χ j )) but, for the absolutely-continuous measureof (38), it is the deﬁnition of f ( x ) as a function on T [as in Abramovich,Sapatinas and Silverman (2000), equation (2)] that induces periodicity. Therestriction to λ ≥ λ > Compensation.

For L´evy measures satisfying only the local- L boundof (9) and not the local- L bound of (7), we must use the deﬁnition of f ( x )in (14) and use (16) to establish conditions that ensure f will be well deﬁnedfor g ∈ B spq . We verify these conditions for the existence of LARK modelsunder symmetric α -Stable random ﬁelds. R. L. WOLPERT, M. A. CLYDE AND C. TU

Theorem 5.

For a Symmetric α -Stable random ﬁeld with L´evy measureof the form ν ( dβ dω ) = c α α | β | − − α dβπ ( d Λ) dχ on R × S d + × R d for < α < ,with π ( d Λ) a probability measure on S d + and g ∈ B spq ( R d ) ∩ L ( R d ) for p, q ≥ and s > , the conditions of (16) for f ( x ) to be well deﬁned by Theorem 1are satisﬁed for < α ≤ p , α < if E [ | Λ | − ] < ∞ . For α = 1 , there is theadditional requirement that Z R d | g ( u ) log | g ( u ) || du < ∞ . (45) Proof.

Fix x ∈ X . By the aﬃne change of variables of χ u ≡ Λ( x − χ ), Z Z [ − , c × Ω (1 ∧ | βφ ( x, ω ) | ) ν ( dβ dω )= 2 c α α Z S d + | Λ | − π ( d Λ) Z Z [1 , ∞ ) × R d (1 ∧ β | g ( u ) | ) β − − α dβ du = 2 c α α E | Λ | − Z R d (cid:26)Z | g ( u ) | − β − α | g ( u ) | dβ + Z ∞| g ( u ) | − β − − α dβ (cid:27) du. For 1 < α < c α α E | Λ | − (cid:26)Z R d | g ( u ) | − | g ( u ) | α α − du + Z R d | g ( u ) | α α du (cid:27) , which is ﬁnite for 1 < α ≤ p since g ∈ L and g ∈ B spq ⊂ L p . For α = 1,= 2 c E | Λ | − (cid:26)Z R d −| g ( u ) | log | g ( u ) | du + Z R d | g ( u ) | du (cid:27) . The ﬁrst integral exists and is ﬁnite by (45) while the second is ﬁnite since g ∈ L . Similarly, the integral in (16b) is Z Z [ − , × Ω ( | βφ ( x, ω ) | ∧ | βφ ( x, ω ) | ) ν ( dβ dω )= 2 c α α E | λ | − (cid:26)Z Z [0 , × R d ( | βg ( u ) | ∧ | βg ( u ) | ) β − − α dβ du (cid:27) . The integral in braces

Z Z [0 , ∧| g ( u ) | − ] × R d β − α g ( u ) dβ du + Z Z [1 ∧| g ( u ) | − , × R d β − α | g ( u ) | dβ du is ﬁnite for 1 < α ≤ p , α < ≤ Z R d | g ( u ) | α − α + | g ( u ) | α − | g ( u ) | α − du ≤ k g k pp (2 − α )( α − < ∞ , ´EVY ADAPTIVE REGRESSION KERNELS while for α = 1, ≤ Z R d {| g ( u ) | + | g ( u ) log | g ( u ) ||} du < ∞ by (45). Finally, (16c) holds because Z Z R × Ω (1 ∧ β ) | φ ( x, ω ) | ν ( dβ dω )= E | Λ | − c α α Z Z R × R d (1 ∧ β ) | β | − − α | g ( u ) | dβ du = E | Λ | − k g k c α α Z R (1 ∧ β ) | β | − − α dβ < ∞ . (cid:3) All of the generator functions in the examples in Section 7 satisfy theconditions of the theorem for the Cauchy random ﬁeld ( α = 1), so the LARKmodels are well deﬁned as ε → ε >

0, the approximations arein the same Besov space as g . We are able to show that this also holds forSobolev W s spaces (which are equivalent to B s ) even when compensationis required, but this remains an open question for B spq with general p and q .4.4. Convergence in W s . Theorem 6.

Let { φ ( x, ω ) } be a location-scale family of the form φ ( x, ω ) ≡ g (Λ( x − χ )) for ω = ( χ, Λ) with χ ∈ R d and nonsingular d × d matrix Λ ∈ S d + for some function g ( · ) ∈ W s with s ≥ . Let ν be a L´evy measure satisfyingthe condition Z Z R × Ω | Λ | − [1 + ρ (Λ) s ](1 ∧ β ) ν ( dβ dω ) < ∞ , (46) where ρ (Λ) denotes the spectral radius (largest eigenvalue) of Λ . Recall f ( x ) ≡ Z Z R × Ω φ ( x, ω )[ β − h ( β )] N ( dβ dω ) (19) + Z Z R × Ω φ ( x, ω ) h ( β ) ˜ N ( dβ dω ) and, for ε > , deﬁne f ε ( x ) ≡ Z Z [ − ε,ε ] c × Ω φ ( x, ω )[ β − h ( β )] N ( dβ dω )+ Z Z [ − ε,ε ] c × Ω φ ( x, ω ) h ( β ) ˜ N ( dβ dω )(47) R. L. WOLPERT, M. A. CLYDE AND C. TU = X ≤ j

First, consider the case of compensator functions satisfying h ( β ) = β for all | β | ≤

1. Apply an aﬃne change of variables to see that φ ( x, ω ) has Fourier transform (in x )ˆ φ ( ξ, ω ) = e iξ · χ | Λ | − ˆ g (Λ − ξ ) . For 0 < ε < ε < x ∈ R d , set ∆( x ) ≡ f ε ( x ) − f ε ( x ) and let A ≡ { ε < | β | ≤ ε } × Ω. Then∆( x ) = X ≤ j

0, (48) has the additional nonrandom term (cid:12)(cid:12)(cid:12)(cid:12)Z Z A e iξ · χ ˆ g (Λ − ξ ) | Λ | ( β − h ( β )) ν ( dβ dω ) (cid:12)(cid:12)(cid:12)(cid:12) ≤ c (cid:18)Z Z A | ˆ g (Λ − ξ ) || Λ | β ν ( dβ dω ) (cid:19) leading at most to an additional constant factor of [1+ c RR R × Ω (1 ∧ β ) ν ( dβ dω )]in (49), leading as before to k f − f ε k k W s → (cid:3) Corollary 4. If { φ ( x, ω ) } is a location-scale family of the form con-sidered in Theorem 6 and if a L´evy measure ν is of product form ν ( dβ dω ) = ν β ( dβ ) π ω ( dω ) for some σ -ﬁnite measure ν β ( dβ ) on R and probability mea-sure π ω ( · ) on Ω that for some s ≥ satisfy Z R (1 ∧ β ) ν β ( dβ ) < ∞ , (50a) Z Ω | Λ | − ((1 + ρ (Λ)) s ) π ω ( dω ) < ∞ , (50b) then ν ( dβ dω ) also satisﬁes (46) and hence f ε ( · ) → f ( · ) in W s almost surelyas ε → . For example, in one dimension, (50b) is satisﬁed for all s > λ hasthe χ ν distribution with ν > λ ∼ Ga ( α λ , β λ )with α λ > . More generally, for any m > s > λ m ∼ Ga ( α λ , β λ ) with α λ > /m or, for m <

0, for α λ > (1 − s ) /m .Recall that the quantity ε introduced in the proof of Theorem 6 and thestatement of Corollary 4 is not a model parameter and has no bearing onthe Sobolov spaces to which the limiting function f ( · ) belongs; it is onlya tool used in proofs and implementations, to which we now turn.

5. Inference for LARK models.

The LARK model introduced in Sec-tion 1 may now be summarized as E [ Y ( x ) | L , θ ] = f ( x ) ≡ Z Ω φ ( x, ω ) L ( dω ) , (51) L | θ ∼ L´evy ( ν ) ,θ ∼ π θ ( dθ )with implicit dependence of the L´evy measure ν ( dβ dω ) and conditionaldistribution for Y ( x ) on a hyperparameter vector θ . In all of our examples,we take ν to be a product measure ν ( dβ dω ) = ν β ( β ) dβ | Ω | π ω ( dω ) satisfying R. L. WOLPERT, M. A. CLYDE AND C. TU the conditions of Corollary 2, with π ω ( · ) a probability measure on Ω, | Ω | a measure of the volume of Ω, and ν β ( · ) > R satisfying R R (1 ∧ β ) ν β ( β ) dβ < ∞ [so ν satisﬁes (9)], for which either(a) ν also satisﬁes (7) or (b) ν β ( β ) is even and h ( β ) is odd in β . Thus, wehave the representation θ ∼ π θ ( dθ ) , (52a) J | θ ∼ Po ( ν + ε ) , ν + ε ≡ ν ε ( R × Ω) , (52b) { ( β j , ω j ) } ≤ jε } ν β ( β ) | Ω | /ν + ε ,Y i | f ind ∼ p Y ( y | f ( x i )) dy, (52d) f ( x i ) ≡ X ≤ j

Motivated by the applications inSection 8, we now focus on LARK models built on approximations to Gamma,symmetric Gamma and Symmetric α -Stable (in particular, Cauchy) L´evyrandom ﬁelds, and quantify the approximation errors to facilitate the selec-tion of ε and other prior hyperparameters.5.1.1. Gamma LARK models.

The Gamma random ﬁeld of Section 2.5.2has ν β ( dβ ) = γβ − e − βη { β> } dβ for some constants γ > η >

0. Theparameter η in (22) controls both the Poisson rate of mass points { ( β j , ω j ) } of magnitude | β | > ε and the probability distribution of those magnitu-des { β j } . To facilitate elicitation we disentangle those two roles by trun-cating at | βη | ≥ ε (rather than | β | ≥ ε ); of course the limit as ε → J and { β j } are now given by J ∼ Po ( ν + ε ) , ν + ε = γ | Ω | E ( ε ) ,β j i . i . d . ∼ π β ( β j ) dβ j , π β ( β j ) = β j − e − β j η E ( ε ) { β j η>ε } , where the exponential integral function [Abramowitz and Stegun (1964),page 228] is denoted as E ( z ) ≡ R ∞ z t − e − t dt . With this truncation, theexpected square L norm of the loss due to truncation for any φ ∈ L (Ω , | Ω | π ω ( dω )), such as φ ( ω ) = φ ( x, ω ), is E |L [ φ ] − L ε [ φ ] | = Z Z R × Ω φ ( ω ) | β | {| βη |≤ ε } ν ( dβ dω ) ´EVY ADAPTIVE REGRESSION KERNELS = k φ k Z ε/η β ν β ( β ) dβ (53a) = γη − k φ k [1 − (1 + ε ) e − ε ] , showing the rate at which L ε [ φ ] → L [ φ ] in L as ε →

0. This is used inSection 5.2 to guide the elicitation of hyperparameters.5.1.2.

Symmetric Gamma LARK models.

The symmetric Gamma ran-dom ﬁeld of Section 2.5.3 has L´evy measure ν β ( dβ ) = γ | β | − e −| β | η dβ forsome constants γ > η >

0. Once again truncation at | βη | > ε leads to J ∼ Po ( ν + ε ) , ν + ε = 2 γ | Ω | E ( ε ) β j i . i . d . ∼ π β ( β j ) dβ j , π β ( β j ) = | β j | − e −| β j | η ( ε ) {| β j η | >ε } and expected squared discrepancy (used for elicitation) E |L [ φ ] − L ε [ φ ] | = 2 γη − k φ k [1 − (1 + ε ) e − ε ] . (53b)5.1.3. Symmetric α -Stable LARK models. The S α S L´evy random ﬁeldof Section 2.5.4 has ν β ( dβ ) = ˙ γαπ Γ( α ) sin πα | β | − α − dβ for some constants˙ γ > < α <

2. To facilitate elicitation and posterior inference, wewrite ˙ γ = γη − α and (again) truncate at | β j η | > ε . This leads to J ∼ Po ( ν + ε ) , ν + ε = γ | Ω | π Γ( α ) sin πα ε − α β j i . i . d . ∼ π β ( β j ) dβ j , π β ( β j ) = αε α η α | β j | − α − {| β j η | >ε } with symmetric Pareto distributions for the coeﬃcients { β j } . For the Cauchy( α = 1), these simplify to ν + ε = 2 γ | Ω | / ( πε ), with π β ( β j ) = ε η | β j | − {| β j η | >ε } . Although the total variation |L| is almost surely inﬁnite, and even |L − L ε | will be inﬁnite for α ≥

1, still for φ ∈ L (Ω , | Ω | π ω ( dω )) the expected squareddiscrepancy is ﬁnite: E |L [ φ ] − L ε [ φ ] | = Z Z R × Ω φ ( ω ) | β | {| βη |≤ ε } ν ( dβ dω )(53c) = 2 γη − k φ k (cid:20) Γ( α + 1) π (2 − α ) sin πα ε − α (cid:21) or 2 γη − k φ k [ ε/π ] for the Cauchy case α = 1. R. L. WOLPERT, M. A. CLYDE AND C. TU

Prior elicitation of hyperparameters.

We now turn to the selectionof ε >

0, the vector θ ∈ Θ of (52), and the L´evy measure ν ( dβ dω ). In each ofour examples θ ≡ ( γ, η ) for rate parameters γ and η governing the frequencyand magnitude of coeﬃcients { β j } , respectively, and the expected squaredtruncation error for L ε [ φ ] for φ ( ω ) = φ ( x, ω ) is of the form E |L [ φ ( x, · )] −L ε [ φ ( x, · )] | = γη − k φ ( x, · ) k c ( ε ) for some c ( ε ) > c ( ε ) → ε → J of terms in the stochastic expansion; (2) desired range of coeﬃcientmagnitudes { β j } ; and (3) tolerable expected truncation error. We ﬁrst selecta L´evy family (Gamma, α -Stable, etc.) to meet the needs of a particularproblem for symmetry or positivity, sharp or heavy tails, etc. Each of ourL´evy measures is of the product form ν ( dβ dω ) = ν β ( dβ ) π ω ( dω ) considered inTheorem 6 and Corollary 4, with location, scale, and perhaps other location-speciﬁc (and hence adaptive) attributes encoded in ω ∈ Ω in problem-speciﬁcways.Hyperparameters in the L´evy measure ν β ( dβ ) govern sparseness for LARKmodels, that is, the number J of terms in the stochastic expansion. In eachLARK model, J has a Poisson distribution with mean proportional to γ .The coeﬃcient of variation under the Poisson distribution falls to zero asthe mean increases, overstating the prior certainty for large values of E J .To ameliorate this, we introduce an additional layer of hierarchy by placinga Gamma prior distribution on the parameter γ ∼ Ga ( a γ , b γ ), leading to theoverdispersed negative binomial prior distribution for J ∼ NB ( a J , p J ). Theparameter η governs the scale of the coeﬃcients { β j } , and hence the rangeof the regression function f ( · ). We employ a Gamma distribution for thescale parameter η − ∼ Ga ( a η , b η ). Together the hyperparameters ε , a γ , b γ , a η , b η determine the prior distributions for J , for the coeﬃcients { β j } (andhence the range of f ( · )), and for the expected mean-square truncation error.We select values for these ﬁve parameters to meet ﬁve criteria: attain twospeciﬁed quantiles (such as a central 99% interval) for each of J and { β j } ,and a speciﬁed bound on the expected truncation error E γη − k φ ( x, · ) k c ( ε ).Typically this involves an iterative numerical solution.As a default choice, we take π ( dω ) = π χ ( dχ ) π λ ( dλ ) to be the product ofthe uniform distribution for locations χ ∼ Un ( X ) and a Gamma distribu-tion for inverse (distance) scale parameters λ ∼ Ga ( a λ , b λ ). The shape andrate hyperparameters a λ and b λ govern the range of probable values for thelocation-speciﬁc inverse scale parameters { λ j } and hence for the smoothnessof f ( x ), similar to how bandwidth selection governs smoothness in other ker-nel methods. A kernel at ω j = ( χ j , λ j ) will represent a feature located at χ j of width 1 /λ j , so large values of λ j are needed to ﬁt a very “spiky” part ofa curve, while a smoother part of a curve may be ﬁt most parsimoniouslyusing small values of λ j . The prior distribution for λ j must support an ade- ´EVY ADAPTIVE REGRESSION KERNELS quate range of values in order to ﬁt a spatially inhomogeneous curve. Valuesof a λ > E [ λ ] < ∞ and a ﬁnite covariance function; we choose( a λ , b λ ) to attain two speciﬁed quantiles, such as a central 99% interval.5.3. Posterior inference.

The joint posterior density of all parametersunder the LARK model of (52), given observations Y = { Y i } , is p ( γ, η, J, β , ω | Y ) ∝ π γ ( γ ) π η ( η ) exp[ − ν ε ( R × Ω)] J !(54) × (cid:26) Y ≤ j

6. Relation of LARK to other models.

Gaussian processes or random ﬁelds.

For any positive Borel mea-sure Σ( dω ) on a complete separable metric space Ω, there exists a Gaussianrandom measure Z ( dω ) on Ω that assigns to disjoint Borel sets A i ⊂ Ω ofﬁnite measure Σ( A i ) < ∞ independent mean-zero Gaussian random vari-ables Z ( A i ) ∼ No (0 , Σ( A i )) of variance E Z ( A i ) = Σ( A i ). For any kernelfunction g on X ×

Ω with φ ( x, · ) ∈ L (Ω , Σ( dω )) for each x ∈ X , this inducesa mean-zero Gaussian random ﬁeld through the Wiener stochastic integral f ( x ) = Z Ω φ ( x, ω ) Z ( dω )with covariance C ( x, y ) = E [ f ( x ) f ( y )] = R Ω φ ( x, ω ) φ ( y, ω )Σ( dω ). The Gaus-sian random measure Z ( dω ) is the special case of a L´evy random mea-sure L ( dω ) deﬁned earlier in (8) with δ ( dω ) ≡ ν ( dβ dω ) ≡ C ( x, y ) = c ( x − y ) may be writtenin the above form if the spectral measure has a density function ˆ c ( ω ) = R X e − iω · x c ( x ) dx whose square root is Lebesgue integrable, for example, theMat´ern class [Stein (1999), page 31] in R d with smoothness parameter ν > d/

2. The Gaussian random ﬁeld model above may also be obtained R. L. WOLPERT, M. A. CLYDE AND C. TU as the limit as α → α -Stable LARK models consideredherein, providing an alternative method for inference that avoids the need forlarge matrix inversions. To maintain a uniﬁed computational approach, wehave limited our attention in this article to LARK models with pure-jumpL´evy random measures, that is, Σ( · ) ≡ Compound Poissons and mixtures of Gaussian random ﬁelds.

Mix-tures of Gaussian random ﬁelds may be constructed as LARK models withL´evy measure of the form ν ( dβ dω ) = (2 πσ ω ) − / e − β / σ ω dβν ω ( dω )(55)leading to mean functions of the form f ( x i ) = P ≤ j

0, if one scales the expected number E J of terms(as a function of τ ) properly, this model converges to a LARK model withinﬁnite L´evy measure (and so is not sensitive to the cut-oﬀ ε , which merelyquantiﬁes how close is this approximation). If E J is not scaled properly toconverge to a LARK model, the limiting results may depend critically onarbitrary and unintentional choices.This may be implemented explicitly in LARK form by placing indepen-dent Ga ( α/ , ε/

2) prior distributions on σ − ω in (55) to achieve independentunivariate Student t α (0 , ε ) distributions for the coeﬃcients { β j } and (ap-proximately, as the parameter ε →

0) the heavy-tailed Symmetric α -Stableprocess for f ( x ) of Sections 2.5.4 and 5.1.3 [this also illustrates that truncat-ing the support of β ω is not the only way to construct suitable approximat-ing sequences of ﬁnite L´evy measures ν ε ( dβ dω ) ⇒ ν ( dβ dω ) for which theintegrals in (27) converge]. An important feature of our inﬁnitely divisibleconstruction (in contrast to a compound Poisson approach from other dis-tributional families) is that in each case, as ε → Finite dimensional frames.

LARK may be viewed as a limit ofBayesian variable selection methods with ﬁnite frames or dictionaries. Wolfe,Godsill and Ng (2004) consider frames based on discretizing Ω as a ﬁnegrid with | G | elements. They place i.i.d. prior distributions π G ( β ) dβ on thenonzero coeﬃcients and i.i.d. Bernoulli kernel inclusion indicators with in-clusion probability ρ G . If | G | ρ G π G ( β ) → ν ( β ) as | G | → ∞ , then the resultconverges to a LARK model on the inﬁnite-dimensional frame. The rep-resentation in Wolfe, Godsill and Ng (2004) uses a point mass at zero toprovide sparsity. Similarly, one may view the prior distributions in LARKunder the ε -truncation approach as assigning zero mass to a neighborhoodaround zero, also leading to sparse representations. One beneﬁt of LARK isits provision of a formal method for coherent prior speciﬁcation for contin-uous dictionaries; a second is its provision of a proper prior speciﬁcation inthe limit as ε →

0, ensuring insensitivity to the choice of ε .Standard stochastic search algorithms using ﬁnite-dimensional frames mayexhibit poor mixing when the correlations between grid elements tend to ± ω and ω are closein parameter space, leading to two highly correlated columns in the designmatrix. In addition, assume that inclusion of either column leads to nearly-maximal likelihood. With the standard one-at-a-time deletion or additionmoves in many stochastic search algorithms, to move from a model includ- R. L. WOLPERT, M. A. CLYDE AND C. TU ing a kernel indexed by ω to one indexed by ω would require an extremelyunlikely deletion followed by an addition (or unlikely addition followed bya deletion). LARK avoids this diﬃculty by allowing the continuous parame-ter ω indexing dictionary elements to move incrementally from ω to ω bya series of update steps, avoiding some of the poor mixing problems associ-ated with highly correlated frame elements in a ﬁne-grid based method.6.4. Dirichlet processes.

The Dirichlet process [Ferguson (1973, 1974),Antoniak (1974)] has received widespread use as a prior distribution onprobability distribution functions. Its popularity is due in large part to itsanalytic tractability in many problems; simulation is straightforward, andBayesian MCMC inference methods are available [Escobar (1994), MacEach-ern (1994), Escobar and West (1995), MacEachern (1998), M¨uller and Quin-tana (2004)]. Liang, Mukherjee and West (2007) consider nonlinear regres-sion and classiﬁcation models E [ Y i | X i ] = f ( X i ) for data { ( Y i , X i ) } usingkernel expansions of the form f ( x ) = Z k ( x, u ) γ ( du ) = Z k ( x, u ) w ( u ) F ( du )(56)with random signed measure γ ( du ) expressed as the integral of a weight func-tion w ( u ) with respect to a probability distribution F , modeled as a Dirichletprocess F ∼ DP ( F , α ) with base measure F and scale α >

0. If observedpoints { X i } are viewed as a random sample from F , then updating theposterior for F solely on the basis of the observed { X i } would lead in thelimit as α → F concentrated at the empiricaldistribution for X , justifying the ﬁnite-dimensional expansion f ( x ) = n X i =1 k ( x, x i ) w ( x i )with kernels evaluated only at the observed data locations. The generalized g -prior of West (2003) for the coeﬃcients { w i = w ( x i ) } leads to dependentCauchy distributions for the { f ( x i ) } . This approach (like the SVM, RVMand related approaches) has as many coeﬃcients as there are data points,but avoids over-ﬁtting through shrinkage. Asymptotic properties of f ( x ) as n → ∞ are diﬃcult to study in the absence of a limiting structure such asthat provided by LARK.The Dirichlet measure F ( du ) does not assign independent random vari-ables to disjoint sets and so (56) is not a LARK model, but it can be con-structed from one. In fact it is exactly the normalized LARK model f ( x ) = Z Ω k ( x, u ) w ( u ) L ( du )/ L (Ω)(57) = X j k ( x, u j ) w j β j / β + ´EVY ADAPTIVE REGRESSION KERNELS with F ( du ) = L ( du ) / L (Ω) for a Gamma random ﬁeld L ( du ) with inﬁniteL´evy measure ν ( dβ du ) = αβ − e − β { β> } dβF ( du ) , where β + := P β j [note that w ( u ) could be absorbed into k ( x, u )].Well-known disadvantages of Dirichlet process models include their inﬂex-ibility (the single parameter α determines the prior dispersion everywhere ,precluding prior speciﬁcations with more uncertainty in some regions thanin others), their discreteness, and the limited variability of the masses as-signed to the countably-many support points. The normalized Gamma rep-resentation (57) of DP’s oﬀers the opportunity to overcome some of thesedisadvantages—for example, the Gamma process may be given a variablerate parameter b ( u ) by taking ν ( dβ du ) = β − e − b ( u ) β { β> } dβF ( du )leading to a precision that can vary with location u ∈ Ω, or the Gammarandom ﬁeld may be replaced with another nonnegative L´evy random ﬁeldwith wider dispersion, such as the fully-skewed Stable process of index α <

7. Simulation study.

We now turn our attention to simulated and realexamples to illustrate the performance of LARK models in practice. We con-ducted a simulation study using four spatially varying functions introducedby Donoho and Johnstone (1994) that are now standard in the wavelet liter-ature: Blocks, Bumps, Doppler and Heavysine. Data were generated for eachtest function by adding independent Gaussian random noise No (0 , σ ) to thetrue target function f ( · ) at n = 1024 equally-spaced points on X = [0 , σ was chosen toattain a root signal-to-noise ratio (RSNR) of qR X ( f ( x ) − ¯ f ) dx/σ = 7 . f ≡ |X | R X f ( x ) dx . Each target function f ( · ) has a range of approx-imately 0 ≤ f ( x ) ≤

25. For each function, we generated 100 replicate datasets to evaluate the performance of LARK and other methods on the basisof mean squared errorMSE ≡ n − n X i =1 ( [ f ( x i ) − f ( x i )) . (58)7.1. Hyperparameters.

In Table 1, we report the kernel functions usedfor the four simulation examples, chosen to illustrate the ﬂexibility of LARKto use a wide range of kernels that may be adapted to anticipated features(smoothness, spikiness, jumps, curvature, covariation, etc.) of applications. R. L. WOLPERT, M. A. CLYDE AND C. TU

Table 1

Kernel functions used for four test functions

Test function Kernel φ ( x i ; χ j , λ j ) Blocks { <λ j ( x i − χ j ) ≤ } Bumps e − λ j | x i − χ j | Doppler e − . λ j ( x i − χ j ) Heavysine e − . λ j ( x i − χ j ) {| x i − χ j | < . } Table 2

Hyperparameters used in examples of Section 7.1

L´evy measure ε a γ b γ a η b η a λ b λ Symmetric Gamma 0.0041 2.53 6.45 13.01 0.71 1.117 0.1965Cauchy 0.0029 2.53 14.2 0.50 1.00 1.117 0.1965

In each case, we take Ω = [0 , × R + (and | Ω | = 10), with elements denoted ω = ( χ, λ ), comprising a location parameter χ ∈ X = [0 ,

10] and a shape pa-rameter λ >

0. As described in Section 5.2, we take { χ j } i . i . d . ∼ Un (Ω) and { λ j } i . i . d . ∼ Ga ( a λ , b λ ) with a λ , b λ chosen (see Table 2) to achieve a 95% priorinterval of [0 . , .

0] for λ to attain dilated kernels covering from half a per-cent up to ﬁfty percent of X .Our choice of the remaining hyperparameters was guided by three objec-tives: to achieve a 95% prior predictive interval of [5 , J , to achievea 95% prior predictive interval of [ − ,

25] for the { β j } , and to achievea limit on the mean squared truncation error of kL [ φ ] − L ε [ φ ] k = ( E |L [ φ ] −L ε [ φ ] | ) / ≤ . · k φ k (see Section 5.2). While these objectives could bemet for the LARK model with symmetric Gamma prior with the valuesgiven in Table 2, they are not quite attainable for the Cauchy model—thecompeting goals of an extremely wide distribution for the { β j } and a lowmean squared truncation error cannot be reconciled. Upon relaxing the priorpredictive distribution requirement on { β j } to a 99.9% interval of [ − , { β i } , theremaining objectives for the distribution of J and the mean square trun-cation error were attained using the values given in Table 2. See Figure 5,Appendix C for realizations from the prior distribution.7.1.1. Performance.

We compared LARK with two of the best wavelet me-thods currently available for inhomogeneous function estimation using over-complete representations: the empirical Bayes approach (“EBayesThresh”) ´EVY ADAPTIVE REGRESSION KERNELS Table 3

Average and (standard errors) over replications of mean square errors of the fourtest functions using the L´evy Adaptive Regression Kernels (LARK) using the symmetricGamma and Cauchy priors, the OCW approach using a Laplace prior [Chu, Clyde andLiang (2009)], and the EBayesThresh approach using a Laplace prior [Johnstone andSilverman (2005a)]

Method Blocks Bumps HeavySine Doppler

LARK-Gamma 0.030 (0.0013) 0.111 (0.0019) 0.038 (0.0010) 0.152 (0.0030)LARK-Cauchy 0.026 (0.0011) 0.105 (0.0017) 0.036 (0.0010) 0.157 (0.0028)OCW 0.060 (0.0023) 0.285 (0.0025) 0.082 (0.0010) 0.152 (0.0019)EBayesThresh 0.096 (0.0013) 0.307 (0.0032) 0.118 (0.00098) 0.202 (0.0027) of Johnstone and Silverman (2004, 2005a, 2005b) using translational-invariantwavelets, and the continuous over-complete wavelet (“OCW”) approach ofChu, Clyde and Liang (2009) based on the stochastic wavelet expansions ofAbramovich, Sapatinas and Silverman (2000). We replicated the results ofJohnstone and Silverman (2005b) under the beta-Laplace prior using their R package EBayesThresh [Johnstone and Silverman (2005a)] with Daubechies’“least asymmetric” ( la8 ) wavelets [see Section 4 of Daubechies (1988) orSection 6.4 of Daubechies (1992)]. OCW uses the same la8 wavelet asEBayesThresh except for the Blocks example, where both LARK and OCWuse the Haar wavelet. The OCW method may be viewed as a special case ofLARK with a ﬁnite nonseparable L´evy measure, where coeﬃcients β j haveindependent Laplace distributions conditional on scale parameters λ j , whichin turn have truncated Pareto distributions. As in LARK, OCW assigns in-dependent uniform locations, with a negative binomial distribution for thenumber of terms in the expansion.The performance of each method was measured by its average meansquare error (AMSE), deﬁned as the average value of the MSE given in (58)over the 100 replicated simulations. Overall, the performance of the LARKmodel is excellent (Table 3). Both LARK versions generated lower AMSEvalues than did EBayesThresh for all four test functions. LARK also hassmaller AMSE than OCW, except for Doppler, where the methods are com-parable. For Blocks, both LARK and OCW use the Haar wavelet, thus anydiﬀerence in results is due to the prior distribution on the function; LARKleads to a 50% reduction in AMSE compared to OCW. For the other exam-ples, both OCW and EBayesThresh uses a Laplace prior distribution for eachcoeﬃcient in the expansion and the same wavelet; in all cases it is clear thatusing a continuous dictionary is better than the ﬁnite-dimensional dictio-nary (frame) with the nondecimated wavelets. Lark reconstructions (rightcolumn, Figure 1) consistently show less ringing and fewer artifacts thanEBayesThresh (left column). R. L. WOLPERT, M. A. CLYDE AND C. TU

Fig. 1.

Comparison of ﬁtted functions using EBayesThresh beta.laplace [John-stone and Silverman (2005a)] (left column) and L´evy Adaptive Regression Kernels(LARK-Gamma) (right column) for the four test functions. From top to bottom, the testfunctions are Blocks, Bumps, Doppler and Heavysine, respectively. ´EVY ADAPTIVE REGRESSION KERNELS Fig. 2.

Left: results of the LARK model for the motorcycle crash data. Circles representthe observations; solid line is the posterior mean; dotted lines are pointwise 90% Bayesiancredible interval for the mean function. Right: histogram of posterior samples of the expo-nential power parameter ρ , with prior density (solid line) for comparison.

8. Applications.

Motorcycle crash data.

To further illustrate the method, we explorethe motorcycle crash experiment data of Schmidt, Mattern and Sch¨uler(1981) considered by Silverman (1985), shown in Figure 2. The 133 observa-tions are unequally spaced, with repeated observations at some time points.Our focus in this example is to illustrate how a single wide class of generatingfunctions may be used in LARK, with the data (through the likelihood) inﬂu-encing the choice of kernels present in the posterior distribution. We use thepower exponential family of kernel functions φ ( x ; χ, λ, ρ ) = exp {− λ | x − χ | ρ } ,but here (in contrast with the examples in Section 7) we treat ρ as an un-certain parameter and make inference about it from the data. We take thepower ρ to be common for all kernels, and use a relatively concentratedGamma prior distribution ρ ∼ Ga (2 . , .

75) with a 50% HPD interval of[0 . , .

56] which comfortably includes both the Laplace ( ρ = 1) and Gaus-sian ( ρ = 2) kernels as special cases.The results are summarized in Figure 2. It is apparent that the ﬁtted meancaptures the general trend of the data very well, with minimal boundaryeﬀects. The model is parsimonious in the sense we only need 4 kernels onaverage to ﬁt the data. The posterior mean for ρ is approximately 3 withmost of the posterior mass well above the values ( ρ = 1 ,

2) for the Laplaceand Gaussian kernels.8.2.

Spatial temporal model.

In this section, we explore the performanceof the LARK approach for modeling hourly SO concentration levels (mea-sured in ppm) in Pennsylvania, New Jersey, Delaware and Maryland [U.S.EPA (2007)]. The locations of the 33 monitoring stations are shown in Fig-ure 3; the study region S , delineated by a rectangle in the ﬁgure, covers R. L. WOLPERT, M. A. CLYDE AND C. TU (a) (b)

Fig. 3. (a)

Thirty-three monitors used by EPA to measure hourly SO concentration inyear 2002. The inverted triangle denotes Site 31. The study area is delineated by a rectanglethat includes parts of Pennsylvania, Maryland, New Jersey and Delaware, and is blownup in (b) which illustrates locations of kernels from a draw from the posterior distribution.The blue circles represent aperiodic points and the red circles represent daily periodic pointsources. Circle areas are proportional to the magnitudes of the point sources they represent. a 310 km ×

310 km area. We used rescaled coordinates from a Lambert(conformal conic) projection to reduce the distortion caused by the earth’scurvature. For demonstration purposes, we restrict analysis to measurementstaken during a 144 hour period T from September of 2002. About 5% of SO readings are missing (at random) from the data set, which is not a prob-lem for the LARK model. While Gaussian random ﬁeld models are popularfor modeling spatial-temporal data, the log transformation typically usedin the Gaussian approach (because the mean function is strictly positive)eliminates many of the (important) spiky features of the data. Our Gammarandom ﬁeld prior distribution allows us to model the data in the originalunits.The model can be written in the same simple form as (52), but now theSO concentration Y ( x ) is indexed by points x ∈ X = S × T in space–timeand the L´evy random measure L ( dω ) assigns Gamma-distributed randomvariables to Borel sets of a space Ω of points ω = ( σ, τ, Λ , λ ) that includea location ( σ, τ ) ∈ S × T in space–time, a positive-deﬁnite 2 × ∈ S +2 , and a temporal decay rate λ >

0. We employa separable kernel of the form φ ( x, ω ) = exp {− ( s − σ ) ′ Λ( s − σ ) / − λ | t − τ |} and in the spirit of Higdon [(1998), Section 3.2] and Higdon, Swall and Kern[(1999), Section 2.2], we employ a novel parametrization for Λ in terms of ´EVY ADAPTIVE REGRESSION KERNELS its eigenvalues and the orientation of its major axis [see Tu (2006), Sec-tion 4.2.6, for details on prior speciﬁcations]. In variations also described inTu [(2006), Chapter 4] accommodation is made for partial periodicity (dueto diurnal patterns associated with daily variation in ambient temperature,traﬃc levels, etc.), still within the framework described by (51) but nowwith more elaborate choices for Ω and φ ( x, ω ).The locations of latent point sources from one iteration of the RJ-MCMCalgorithm are presented in Figure 3(b). Larger latent points appear to beclustered in the Baltimore metropolitan area and near the New Jersey/Penn-sylvania border. The model’s support points are more than a mere modelingdevice—they can help analysts identify possible underlying sources of pol-lution, or support future decisions on monitor locations.The predictive power of the model is validated through out-of-sampleprediction. The model was ﬁt excluding data from Site 31 [the inverted tri-angle in Figure 3(a)], and then its predictions were compared with reportedmeasurements from that site for the entire 144 hours. The result shownin Figure 4 is promising. The major peak was captured clearly, and 90%pointwise Bayesian credible intervals cover in excess of 80% of the true ob-servations. This was a challenging out-of-sample prediction problem due tolow cross-correlations among sites. We are currently reﬁning features of theprior distributions to incorporate known point sources. Fig. 4.

Out-of-sample predictions for Site . Dashed line represents observed time se-ries, solid line represents predictive mean curve. Gray lines are posterior predictiveintervals. R. L. WOLPERT, M. A. CLYDE AND C. TU

9. Discussion.

In this article, we have developed a fully Bayesian adap-tive kernel method, LARK, for nonparametric function estimation. TheLARK model is based on a stochastic expansion of functions in a continuousovercomplete dictionary, and may be expressed as a stochastic integral ofa kernel or other generating function with respect to a L´evy random ﬁeld.When (7) is satisﬁed (so compensation is unnecessary), the L´evy ﬁeld isa random signed measure. By using a positive random measure and pos-itive kernel family, LARK models provide natural constructions for non-negative functions (as in Section 8.2); with signed measures, unconstrainedfunctions may be modeled (as in Sections 7 and 8.1). The kernel parame-ters are location-speciﬁc and thus adapt to local features of the data. Aswith wavelets, the adaptive smoothing using LARK preserves local featuressuch as discontinuities and high peaks and is especially useful for modelinginhomogeneous functions. The LARK approach does not require that thedata be equally-spaced without missing observations nor that the samplesize be a dyadic power as is a commonly required of many wavelet meth-ods.The RJ-MCMC algorithm developed for ﬁtting LARK provides anautomatic stochastic search mechanism for ﬁnding sparse representationsof a function. The algorithm is computationally eﬃcient [requiring only O ( n · M ) operations for data including n observations and an MCMC streamof length M ], as dictionary elements are calculated only when needed. Ker-nel methods such as Support Vector Machines (SVMs) and Bayesian Rele-vance Vector Machines [or RVMs, Tipping (2001)] employ all data points askernel locations, but attain sparsity by shrinking coeﬃcients to zero. LARKprovides additional ﬂexibility by not restricting kernel locations. Many com-peting sparse methods, including the Dantzig Selector and Lasso, require thea priori selection of a pre-speciﬁed number of dictionary elements. Evalu-ating these kernels on a suﬃciently ﬁne grid will exceed the computationalcost of LARK. Fine grids also lead to extreme multicollinearity in these ap-proaches, that may lead both to numerical instability and violation of theconditions needed for sparse solutions.9.1. Extensions.

It is straightforward to implement LARK with wideclasses of generating functions including wavelets, structural elements in tex-ture analysis, and splines. Unlike support vector machines or other methodsbased on Mercer kernels [Pillai et al. (2007)], the LARK approach does notrequire symmetry, continuity or simple functional forms. While it is oftenconvenient to use kernels based on some distance metric, arbitrary generat-ing functions may be tailored to the problem at hand as illustrated in thespace–time example of Section 8.2. The LARK modeling approach adaptsreadily to problems in any number of dimensions. ´EVY ADAPTIVE REGRESSION KERNELS In Section 4, we present conditions for LARK models to belong to thesame Besov space as their generating functions, for L´evy measures and gen-erating functions that satisfy the stringent local L -bound of (18). In themore general case, where (18) fails and compensation is required, we are ableto establish similar results only for B spq with p = q = 2 (equivalent to W s ).We are exploring extensions to the general case, but the additional driftterm that arises in compensation complicates conﬁrming the convergenceof f ε to f in B spq for general p, q .Work is also on-going in establishing conditions for posterior consistencyfor function estimation. Extending methods of Choudhuri, Ghosal and Roy(2004), Ghosal and van der Vaart (2007) and Choi and Schervish (2007),Pillai (2008) has veriﬁed posterior consistency for certain LARK modelswith Gaussian measurement errors in work that will be reported elsewhere.APPENDIX A: DETAILS OF PROOFS Proposition 2.

For a function g ( · ) ∈ L p ( R d ) and its scaled translate g (Λ( · − χ )) with χ ∈ R d and positive deﬁnite matrix Λ ∈ S d + , the L p norm of g (Λ( · − χ )) and the L p norm of its m th forward diﬀerences are given by k g (Λ( · − χ )) k p = | Λ | /p k g k p k ∆ mh g (Λ( · − χ )) k p = | Λ | /p k ∆ mλh g k p , (59) where | Λ | denotes the determinant of Λ . Proof.

By a change of variables χ u = Λ( x − χ ), k ∆ mh g (Λ( · − χ )) k p = (cid:26)Z | ∆ mh g (Λ( x − χ )) | p dx (cid:27) /p = (Z (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) m X k =0 (cid:18) mk (cid:19) ( − m − k g (Λ( x + kh − χ )) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) p dx ) /p = | Λ | − /p (Z (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) m X k =0 (cid:18) mk (cid:19) ( − m − k g ( u + k Λ h ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) p du ) /p = | Λ | − /p k ∆ m Λ h g k p . The proof for the L p norm of g (Λ( · − χ )) follows by the same change ofvariables. (cid:3) A.1. Proof of Lemma 1.

For any compensator function h ( β ) satisfy-ing (10) there are numbers c j ∈ (0 , ∞ ) such that | h ( β ) | ≤ c , | β − h ( β ) | ≤ c ( | β | ∧ β ) , | h ( β ) | ≤ c (1 ∧ | β | )for all β ∈ R . Fix 0 < ε ≤ φ : R × Ω → R satisfying (16);let B a , B b and B c be the values of the integrals from (16a)–(16c), respec-tively. To complete the proof of Theorem 2 it suﬃces to show that each ofthe two terms from (27), X ≡ Z Z N ε ( β − h ( β )) φ ( ω ) N ( dβ dω ) and(60) Y ≡ Z Z N ε h ( β ) φ ( ω ) ˜ N ( dβ dω ) , converges to zero in probability as ε →

0. Write the ﬁrst integral in (60) asthe sum of two parts: X ≡ Z Z N ε ( β − h ( β )) φ ( ω ) N ( dβ dω ) = X + X R. L. WOLPERT, M. A. CLYDE AND C. TU with X ≡ Z Z N ε ∩ [ | βφ |≤ ( β − h ( β )) φ ( ω ) N ( dβ dω ) ,X ≡ Z Z N ε ∩ [ | βφ | > ( β − h ( β )) φ ( ω ) N ( dβ dω ) . Then E | X | ≤ c Z Z N ε ∩ [ | βφ |≤ ( | β | ∧ β ) | φ ( ω ) | ν ( dβ dω )= c Z Z N ε ∩ [ | β |≤ ∩ [ | βφ |≤ (1 ∧ β ) | φ ( ω ) | ν ( dβ dω )+ c Z Z N ε ∩ [ | β | > ∩ [ | βφ |≤ (1 ∧ | βφ ( ω ) | ) ν ( dβ dω ) ≤ c ( B c + B a ) < ∞ , so X → L as ε → { N ε } ( β, ω ) tends to zero a.e. ( ν ) as ε →

0. Nowconsider X : ν ( { ( β, ω ) : | βφ ( ω ) | > } ) = Z Z [ | β |≤ ∩ [ | βφ | > ν ( dβ dω )+ Z Z [ | β | > ∩ [ | βφ | > ν ( dβ dω ) ≤ Z Z [ | β |≤ ∩ [ | βφ | > ( | βφ ( ω ) | ∧ | βφ ( ω ) | ) ν ( dβ dω )+ Z Z [ | β | > ∩ [ | βφ | > (1 ∧ | βφ ( ω ) | ) ν ( dβ dω ) ≤ B b + B a < ∞ , so almost surely the random support of N ( dβ dω ) in [ | βφ | >

1] is a ﬁnite setdisjoint from T ε> N ε ; it follows that N ( N ε ∩ [ | βφ ( ω ) | > → X → ε → Y ≡ Z Z N ε h ( β ) φ ( ω ) ˜ N ( dβ dω ) = Y + Y + Y + Y with Y ≡ Z Z N ε ∩ [ | β |≤ ∩ [ | βφ |≤ h ( β ) φ ( ω ) ˜ N ( dβ dω ) , ´EVY ADAPTIVE REGRESSION KERNELS Y ≡ Z Z N ε ∩ [ | β |≤ ∩ [ | βφ | > h ( β ) φ ( ω ) ˜ N ( dβ dω ) ,Y ≡ Z Z N ε ∩ [ | β | > ∩ [ | βφ |≤ h ( β ) φ ( ω ) ˜ N ( dβ dω ) ,Y ≡ Z Z N ε ∩ [ | β | > ∩ [ | βφ | > h ( β ) φ ( ω ) ˜ N ( dβ dω ) . Now E | Y | = Z Z N ε ∩ [ | β |≤ ∩ [ | βφ |≤ h ( β ) φ ( ω ) ν ( dβ dω ) ≤ c Z Z N ε ∩ [ | β |≤ ∩ [ | βφ |≤ | βφ ( ω ) | ν ( dβ dω )= c Z Z N ε ∩ [ | β |≤ ∩ [ | βφ |≤ ( | βφ ( ω ) | ∧ | βφ ( ω ) | ) ν ( dβ dω ) ≤ c B b < ∞ , so Y → L (and hence also in L ) as ε → Y ≡ Z Z N ε ∩ [ | β |≤ ∩ [ | βφ | > h ( β ) φ ( ω ) ˜ N ( dβ dω )= Z Z N ε ∩ [ | β |≤ ∩ [ | βφ | > h ( β ) φ ( ω ) N ( dβ dω ) − Z Z N ε ∩ [ | β |≤ ∩ [ | βφ | > h ( β ) φ ( ω ) ν ( dβ dω ) , E | Y | ≤ Z Z N ε ∩ [ | β |≤ ∩ [ | βφ | > | h ( β ) || φ ( ω ) | ν ( dβ dω ) ≤ c Z Z N ε ∩ [ | β |≤ ∩ [ | βφ | > | βφ ( ω ) | ν ( dβ dω )= 2 c Z Z N ε ∩ [ | β |≤ ∩ [ | βφ | > ( | βφ ( ω ) | ∧ | βφ ( ω ) | ) ν ( dβ dω ) ≤ c B b < ∞ , so Y → L as ε → Y ≡ Z Z N ε ∩ [ | β | > ∩ [ | βφ |≤ h ( β ) φ ( ω ) ˜ N ( dβ dω )= Z Z N ε ∩ [ | β | > ∩ [ | βφ |≤ h ( β ) φ ( ω ) N ( dβ dω ) R. L. WOLPERT, M. A. CLYDE AND C. TU − Z Z N ε ∩ [ | β | > ∩ [ | βφ |≤ h ( β ) φ ( ω ) ν ( dβ dω ) , E | Y | ≤ Z Z N ε ∩ [ | β | > ∩ [ | βφ |≤ | h ( β ) || φ ( ω ) | ν ( dβ dω ) ≤ c Z Z N ε ∩ [ | β | > ∩ [ | βφ |≤ | βφ ( ω ) | ν ( dβ dω )= 2 c Z Z N ε ∩ [ | β | > ∩ [ | βφ |≤ > (1 ∧ | βφ ( ω ) | ) ν ( dβ dω ) ≤ c B a < ∞ , so Y → L as ε →

0. Finally, for Y , Y ≡ Z Z N ε ∩ [ | β | > ∩ [ | βφ | > h ( β ) φ ( ω ) ˜ N ( dβ dω )= Z Z N ε ∩ [ | β | > ∩ [ | βφ | > h ( β ) φ ( ω ) N ( dβ dω ) − Z Z N ε ∩ [ | β | > ∩ [ | βφ | > h ( β ) φ ( ω ) ν ( dβ dω ) , E | Y | ≤ Z Z N ε ∩ [ | β | > ∩ [ | βφ | > | h ( β ) φ ( ω ) | ν ( dβ dω ) ≤ c Z Z N ε ∩ [ | β | > ∩ [ | βφ | > | φ ( ω ) | ν ( dβ dω ) ≤ c Z Z N ε ∩ [ | β | > ∩ [ | βφ | > (1 ∧ β ) | φ ( ω ) | ν ( dβ dω ) ≤ c B c < ∞ so Y → L as ε →

0, completing the proof of Theorem 2.APPENDIX B: REVERSIBLE-JUMP MCMC PROCEDURESA typical RJ-MCMC procedure for sampling varying-dimensional param-eters involves at least three types of moves (Birth, Death and Update); weuse Metropolis–Hastings steps for each of these. Our trans-dimensional up-date steps entail altering the value ( β ∗ j , ω ∗ j ) of one point ( β j , ω j ). We select j ∼ Un (0 : J −

1) for proposed updating, then take Gaussian random walksteps successively in the coeﬃcient β j , the location parameter χ j , and thelog kernel shape parameter, log λ j . Step sizes are chosen to achieve approxi-mately 30% acceptance rates for each class of updates. One novel feature is ´EVY ADAPTIVE REGRESSION KERNELS that when the proposed update of some coeﬃcient β j falls in the truncatedregion β ∗ j η ∈ ( − ε, ε ), the move is treated as a Death, the point ( β j , ω j ) isremoved and J is decremented. This is advantageous as it automaticallyfocuses on small magnitude coeﬃcients for removal (rather than a randomselection as in the typical RJ-MCMC Death step). A Birth step entailsgenerating a new point ( β ∗ , ω ∗ ) to be included among the { ( β j , ω j ) } andincrementing J by one. We use a double exponential birth distribution withrate η/ε , conditioned to exceed | β j | η > ε so that proposed coeﬃcients aresmall, balancing the “Death” of small coeﬃcients in the Update step to at-tain the target acceptance rates. The ﬁxed-dimensional parameters are sam-pled using a conventional Metropolis–Hastings approach [Gilks, Richardsonand Spiegelhalter (1996), Section 1.3.3]. Each of these inexpensive updatesteps requires only O ( n ) operations [in contrast to Gaussian methods, whichmay require O ( n )], so the method scales well in the number n of observa-tions. Further details of the RJ-MCMC are available in [Tu (2006), Ap-pendix A.1, pages 116 and 117]. An R package [R Development Core Team(2004)] implementing LARK is under development by the authors and willbe made publicly available.APPENDIX C: EXAMPLES OF LARK PRIOR REALIZATIONS(a) (b) Fig. 5.

Four realizations from LARK prior distribution with (a)

Blocks kernel and Sym-metric Gamma L´evy measure; (b)

Bumps kernel and Gamma L´evy measure; (c) , (d) Doppler kernel and Cauchy L´evy measure, with J = 1000 for (a)–(c) and J = 10 for (d) components. Hyperparameters a λ , b λ , a γ , b γ , a η , b η and ε are given in Table 2. R. L. WOLPERT, M. A. CLYDE AND C. TU (c) (d)

Fig. 5. (Continued.)

Acknowledgments.

The authors would like to thank Natesh Pillai, threereferees, the Associate Editor and the Editor for helpful comments and sug-gestions. REFERENCES

Abramovich, F. , Sapatinas, T. and

Silverman, B. W. (1998). Wavelet thresholding viaa Bayesian approach.

J. R. Stat. Soc. Ser. B Stat. Methodol. Abramovich, F. , Sapatinas, T. and

Silverman, B. W. (2000). Stochastic expansionsin an overcomplete wavelet dictionary.

Probab. Theory Related Fields

Abramowitz, M. and

Stegun, I. A. (1964).

Handbook of Mathematical Functions withFormulas, Graphs, and Mathematical Tables . National Bureau of Standards AppliedMathematics Series . U.S. Government Printing Oﬃce, Washington, DC. MR0167642 Antoniak, C. E. (1974). Mixtures of Dirichlet processes with applications to Bayesiannonparametric problems.

Ann. Statist. Cand`es, E. and

Tao, T. (2007). The Dantzig selector: Statistical estimation when p ismuch larger than n . Ann. Statist. Chen, S. S. , Donoho, D. L. and

Saunders, M. A. (1998). Atomic decomposition bybasis pursuit.

SIAM J. Sci. Comput. Chil`es, J.-P. and

Delfiner, P. (1999).

Geostatistics: Modeling Spatial Uncertainty . Wi-ley, New York. MR1679557

Choi, T. and

Schervish, M. J. (2007). On posterior consistency in nonparametric re-gression problems.

J. Multivariate Anal. Choudhuri, N. , Ghosal, S. and

Roy, A. (2004). Bayesian estimation of the spectraldensity of a time series.

J. Amer. Statist. Assoc. Chu, J.-H. , Clyde, M. A. and

Liang, F. (2009). Bayesian function estimation usingcontinuous wavelet dictionaries.

Statist. Sinica Chu, C.-K. and

Marron, J. S. (1991). Choosing a kernel regression estimator (withdiscussion).

Statist. Sci. Clyde, M. A. and

Wolpert, R. L. (2007). Nonparametric function estimation usingovercomplete dictionaries. In

Bayesian Statistics 8 ( J. M. Bernardo , M. J. Bayarri , J. O. Berger , A. P. Dawid , D. Heckerman , A. F. M. Smith and

M. West , eds.)91–114. Oxford Univ. Press, Oxford. MR2433190

Cont, R. and

Tankov, P. (2004).

Financial Modelling with Jump Processes . Chapman& Hall/CRC, Boca Raton, FL. MR2042661

Cristianini, N. and

Shawe-Taylor, J. (2000).

An Introduction to Support Vector Ma-chines and Other Kernel-based Learning Methods . Cambridge Univ. Press, Cambridge.

Daubechies, I. (1988). Orthonormal bases of compactly supported wavelets.

Comm. PureAppl. Math. Daubechies, I. (1992).

Ten Lectures on Wavelets . CBMS-NSF Regional Conference Seriesin Applied Mathematics . SIAM, Philadelphia, PA. MR1162107 Denison, D. G. T. , Mallick, B. K. and

Smith, A. F. M. (1998). Automatic Bayesiancurve ﬁtting.

J. R. Stat. Soc. Ser. B Stat. Methodol. Denison, D. G. T. , Holmes, C. C. , Mallick, B. K. and

Smith, A. F. M. (2002).

Bayesian Methods for Nonlinear Classiﬁcation and Regression . Wiley, Chichester.MR1962778

DiMatteo, I. , Genovese, C. R. and

Kass, R. E. (2001). Bayesian curve-ﬁtting withfree-knot splines.

Biometrika Donoho, D. L. and

Elad, M. (2003). Optimally sparse representation in general(nonorthogonal) dictionaries via l minimization. Proc. Natl. Acad. Sci. USA

Donoho, D. L. and

Johnstone, I. M. (1994). Ideal spatial adaptation by wavelet shrink-age.

Biometrika Escobar, M. D. (1994). Estimating normal means with a Dirichlet process prior.

J. Amer.Statist. Assoc. Escobar, M. D. and

West, M. (1995). Bayesian density estimation and inference usingmixtures.

J. Amer. Statist. Assoc. Ferguson, T. S. (1973). A Bayesian analysis of some nonparametric problems.

Ann.Statist. Ferguson, T. S. (1974). Prior distributions on spaces of probability measures.

Ann.Statist. Ghosal, S. and van der Vaart, A. (2007). Convergence rates of posterior distributionsfor non-i.i.d. observations.

Ann. Statist. Gilks, W. R. , Richardson, S. and

Spiegelhalter, D. J. , eds. (1996).

Markov ChainMonte Carlo in Practice . Chapman and Hall, London. MR1397966

Green, P. J. (1995). Reversible jump Markov chain Monte Carlo computation andBayesian model determination.

Biometrika Higdon, D. M. (1998). A process-convolution approach to modeling temperatures in theNorth Atlantic ocean.

Environ. Ecol. Stat. Higdon, D. , Swall, J. and

Kern, J. (1999). Non-stationary spatial modeling. In

Bayesian Statistics 6 ( J. M. Bernardo , J. O. Berger , A. P. Dawid and

A. F. M. Smith , eds.) 761–768. Oxford Univ. Press, Oxford.

Jacod, J. and

Shiryaev, A. N. (1987).

Limit Theorems for Stochastic Processes . Grundlehren der Mathematischen Wissenschaften [Fundamental Principles of Math-ematical Sciences] . Springer, Berlin. MR0959133 R. L. WOLPERT, M. A. CLYDE AND C. TU

Johnstone, I. M. and

Silverman, B. W. (2004). Needles and straw in haystacks:Empirical Bayes estimates of possibly sparse sequences.

Ann. Statist. Johnstone, I. M. and

Silverman, B. W. (2005a). EBayesThresh: R programs for em-pirical Bayes thresholding.

Journal of Statistical Software Johnstone, I. M. and

Silverman, B. W. (2005b). Empirical Bayes selection of waveletthresholds.

Ann. Statist. Jordan, M. I. (2010). Hierarchical models, nested models and completely random mea-sures. In

Frontiers of Statistical Decision Making and Bayesian Analysis: In Honor ofJames O. Berger ( M.-H. Chen , D. K. Dey , P. M¨uller , D. Sun and

K. Ye , eds.)207–217. Springer, New York. MRMR2766461

Khinchine, A. Y. and

L´evy, P. (1936). Sur les lois stables.

C. R. Math. Acad. Sci. Paris

Kingman, J. F. C. (1967). Completely random measures.

Paciﬁc J. Math. Kwapie´n, S. and

Woyczy´nski, W. A. (1992).

Random Series and Stochastic Integrals:Single and Multiple . Birkh¨auser, Boston, MA. MR1167198

Law, M. H. and

Kwok, J. T. (2001). Bayesian support vector regression. In

Proceedings ofthe Eighth International Workshop on Artiﬁcial Intelligence and Statistics (AISTATS)

Liang, F. , Mukherjee, S. and

West, M. (2007). The use of unlabeled data in predictivemodeling.

Statist. Sci. MacEachern, S. N. (1994). Estimating normal means with a conjugate style Dirichletprocess prior.

Comm. Statist. Simulation Comput. MacEachern, S. N. (1998). Computational methods for mixture of Dirichlet pro-cess models. In

Practical Nonparametric and Semiparametric Bayesian Statistics ( D. K. Dey , P. M¨uller and

D. Sinha , eds.).

Lecture Notes in Statist.

Mallat, S. G. and

Zhang, Z. (1993). Matching pursuit with time-frequency dictionaries.

IEEE Trans. Signal Process M¨uller, P. and

Quintana, F. A. (2004). Nonparametric Bayesian data analysis.

Statist.Sci. Nikol’ski˘ı, S. M. (1975).

Approximation of Functions of Several Variables and ImbeddingTheorems . Die Grundlehren der Mathematischen Wissenschaften

Springer, NewYork. Translated from the Russian by John M. Danskin, Jr. MR0374877

Pillai, N. S. (2008). L´evy random measures: Posterior consistency and applications.Ph.D. dissertation, Dept. Statist. Sci., Duke Univ. Available at http://stat.duke.edu/people/theses/PillaiNS.html . Pillai, N. S. , Wu, Q. , Liang, F. , Mukherjee, S. and

Wolpert, R. L. (2007). Charac-terizing the function space for Bayesian kernel models.

J. Mach. Learn. Res. R Development Core Team (2004). R : A language and environment for sta-tistical computing. R foundation for statistical computing. Available at . Rajput, B. S. and

Rosi´nski, J. (1989). Spectral representations of inﬁnitely divisibleprocesses.

Probab. Theory Related Fields Reed, M. C. and

Simon, B. (1975).

Methods of Modern Mathematical Physics, Vol. II:Fourier Analysis, Self-Adjointness . Academic Press, New York.´EVY ADAPTIVE REGRESSION KERNELS Sato, K.-i. (1999).

L´evy Processes and Inﬁnitely Divisible Distributions . Cambridge Stud-ies in Advanced Mathematics . Cambridge Univ. Press, Cambridge. Translated fromthe 1990 Japanese original. Revised by the author. MR1739520 Schmidt, G. , Mattern, R. and

Sch¨uler, F. (1981). Biomechanical investigation todetermine physical and traumatological diﬀerentiation criteria for the maximum loadcapacity of head and vertebral column with and without protective helmet under theeﬀects of impact. EEC research program on biomechanics of impacts, ﬁnal report, phaseIII, Project 65, Institut f¨ur Rechtsmedizin, Univ. Heidelberg, Germany.

Silverman, B. W. (1985). Some aspects of the spline smoothing approach to nonparamet-ric regression curve ﬁtting.

J. R. Stat. Soc. Ser. B Stat. Methodol. Sisson, S. A. (2005). Transdimensional Markov chains: A decade of progress and futureperspectives.

J. Amer. Statist. Assoc.

Smith, M. and

Kohn, R. (1996). Nonparametric regression using Bayesian variable se-lection.

J. Econometrics Sobolev, S. L. (1991).

Some Applications of Functional Analysis in MathematicalPhysics . Translations of Mathematical Monographs . Amer. Math. Soc., Providence,RI. MR1125990 Sollich, P. (2002). Bayesian methods for support vector machines: Evidence and predic-tive class probabilities.

Machine Learning Stein, M. L. (1999).

Interpolation of Spatial Data: Some Theory for Kriging . Springer,New York. MR1697409

Tibshirani, R. (1996). Regression shrinkage and selection via the lasso.

J. R. Stat. Soc.Ser. B Stat. Methodol. Tipping, M. E. (2001). Sparse Bayesian learning and the relevance vector machine.

J. Mach. Learn. Res. Triebel, H. (1992).

Theory of Function Spaces. II . Monographs in Mathematics .Birkh¨auser, Basel. MR1163193 Tu, C. (2006). Nonparametric modelling using L´evy process priors with applications forfunction estimation, time series modeling and spatio-temporal modeling. Ph.D. disserta-tion, Dept. Statist. Sci., Duke Univ. Available at . U.S. EPA . (2007). Air Quality System (AQS). Available at . Vidakovic, B. (1999).

Statistical Modeling by Wavelets . Wiley, New York. MR1681904

Wahba, G. (1992). Multivariate function and operator estimation, based on smoothingsplines and reproducing kernels. In

Nonlinear Modeling and Forecasting: Proceedings ofthe Workshop on Nonlinear Modeling and Forecasting held September, 1990, in SantaFe, New Mexico ( M. Casdagli and

S. G. Eubank , eds.).

SFI Studies in the Sciencesof Complexity

XII

West, M. (2003). Bayesian factor regression models in the “large p , small n ” paradigm.In Bayesian Statistics 7 ( J. M. Bernardo et al. , eds.) 733–742. Oxford Univ. Press,New York. MR2003537

Wolfe, P. J. , Godsill, S. J. and

Ng, W.-J. (2004). Bayesian variable selection and reg-ularization for time-frequency surface estimation.

J. R. Stat. Soc. Ser. B Stat. Methodol. Wolpert, R. L. , Ickstadt, K. and

Hansen, M. B. (2003). A nonparametric Bayesianapproach to inverse problems. In

Bayesian Statistics 7 ( J. M. Bernardo et al. , eds.)403–417. Oxford Univ. Press, New York. MR2003186

Wolpert, R. L. and

Taqqu, M. S. (2005). Fractional Ornstein–Uhlenbeck L´evy processesand the Telecom process: Upstairs and downstairs.

Signal Processing R. L. WOLPERT, M. A. CLYDE AND C. TU

Zolotarev, V. M. (1986).

One-dimensional Stable Distributions . Translations of Math-ematical Monographs . Amer. Math. Soc., Providence, RI. MR0854867 R. L. WolpertM. A. ClydeDepartment of Statistical ScienceDuke UniversityDurham, North Carolina 27708-0251USAE-mail: [email protected]@stat.duke.edu

URL: