The Ridgelet Prior: A Covariance Function Approach to Prior Specification for Bayesian Neural Networks
TThe Ridgelet Prior: A Covariance Function Approach toPrior Specification for Bayesian Neural Networks
Takuo Matsubara , ∗ , Chris J. Oates , , Fran¸cois-Xavier Briol , Newcastle University The Alan Turing Institute University College London
October 19, 2020
Abstract
Bayesian neural networks attempt to combine the strong predictive performance ofneural networks with formal quantification of uncertainty associated with the predictiveoutput in the Bayesian framework. However, it remains unclear how to endow theparameters of the network with a prior distribution that is meaningful when lifted intothe output space of the network. A possible solution is proposed that enables the user toposit an appropriate covariance function for the task at hand. Our approach constructsa prior distribution for the parameters of the network, called a ridgelet prior , thatapproximates the posited covariance structure in the output space of the network. Theapproach is rooted in the ridgelet transform and we establish both finite-sample-sizeerror bounds and the consistency of the approximation of the covariance function ina limit where the number of hidden units is increased. Our experimental assessmentis limited to a proof-of-concept, where we demonstrate that the ridgelet prior canout-perform an unstructured prior on regression problems for which an informativecovariance function can be a priori provided.
Neural networks are beginning to be adopted in a range of sensitive application areas such ashealthcare [46], social care [40], and the justice system [47], where the accuracy and reliabilityof their predictive output demands careful assessment. This problem lends itself naturallyto the Bayesian paradigm and there has been a resurgence in interest in Bayesian neuralnetworks (BNNs), originally introduced and studied in [4, 27, 33]. BNNs use the languageof probability to express uncertainty regarding the “true” value of the parameters in the ∗ Corresponding author email: [email protected] a r X i v : . [ s t a t . M L ] O c t eural network, initially by assigning a prior distribution over the space of possible parameterconfigurations and then updating this distribution on the basis of a training dataset. Theresulting posterior distribution over the parameter space implies an associated predictivedistribution for the output of the neural network, assigning probabilities to each of thepossible values that could be taken by the output of the network. This predictive distributioncarries the formal semantics of the Bayesian framework and can be used to describe epistemicuncertainty associated with the phenomena being modelled.Attached to any probabilistic quantification of uncertainty are semantics , which describehow probabilities should be interpreted (e.g. are these probabilities epistemic or alleatoric;whose belief is being quantified; what assumptions are premised?). As for any Bayesian model,the semantics of the posterior predictive distribution are largely inherited from the semanticsof the prior distribution, which is typically a representation of a user’s subjective beliefabout the unknown “true” values of parameters in the model. This represents a challengefor BNNs, as a user cannot easily specify their prior belief at the level of the parametersof the network in general settings where the influence of each parameter on the network’soutput can be difficult to understand. Furthermore, the total number of parameters can gofrom a few dozens to several million or more, rendering careful selection of priors for eachparameter inpractical. This has lead some researchers to propose ad hoc choices for theprior distribution, which will be reviewed in Section 2 [see also 32]. Such ad hoc choices ofprior appear to severely limit interpretability of the semantics of the BNN. It has also beenreported that such priors can have negative consequences for the predictive performance ofBNNs [57].The development of interpretable prior distributions for BNNs is an active area of researchthat, if adequately solved, has the potential to substantially advance methodology for neuralnetworks. Potential benefits include: • Fewer Data Required:
BNNs are “data hungry” models; their large number of pa-rameters means that a large number of data are required for the posterior to concentrateon a suitable configuration of parameter values. The inclusion of domain knowledge inthe prior disctribution could be helpful in reducing the effective degrees of freedom inthe parameter space, mitigating the requirement for a large training dataset. • Faster Computation:
The use of an ad hoc prior distribution can lead to a posteriordistribution that is highly multi-modal [37], creating challenges for computation (e.g.using variational inference or Markov chain Monte Carlo). The inclusion of domainknowledge could be expected to counteract (to some extent) the multi-modality issueby breaking some of the symmetries present in the parametrisation of the network. • Lower Generalisation Error:
An important issue with BNNs is that their out-of-sample performance can be poor when an ad hoc prior is used. These issues havelead several authors to question the usefulness of the BNNs; see [29] and [53]. Modelpredictions are strongly driven by the prior and we therefore expect inclusion of domainknowlege to be an important factor in improving the generalisation performance ofBNNs. 2n this paper we do not claim to provide a solution to the problem of prior selection thatenjoys all the benefits just discussed. Such an endeavour would require very extensive (andapplication-specific) empirical investigation, which is not our focus in this work. Rather, thispaper proposes and studies a novel approach to prior specification that operates at the levelof the output of the neural network and, in doing so, provides a route for expert knowledgeon the phenomenon being modelled to be probabilistically encoded. The construction thatwe present is stylised to admit a detailed theoretical analysis and therefore the empiricalresults in this paper are limited to a proof-of-concept. In subsequent work we will discussgeneralisations of the construction that may be more amenable to practical applications, forexample by reducing the number of hidden units that may be required.Our analysis can be viewed in the context of a recent line of research which focuses onthe predictive distribution as a function of the prior distribution on network parameters[11, 15, 36, 45]. These papers propose to reduce the problem of prior selection for BNNs tothe somewhat easier problem of prior selection for Gaussian processes (GPs). The approachstudied in these papers, and also adopted in the present paper, can be summarised as follows:(i) Elicit a GP model that encodes domain knowledge for the problem at hand, (ii) Selecta prior for the parameters of the BNN such that the output of the BNN in some sense“closely approximates” the GP. This high-level approach is appealing since it provides a directconnection between the established literature on covariance modelling for GPs [10, 39, 43] andthe literature on uncertainty quantification using a BNN. For instance, existing covariancemodels can be used to encode a priori assumptions of amplitude, smoothness, periodicityand so on as required. Moreover, the number of parameters required to elicit a GP (i.e. theparameters of the mean and covariance functions) is typically much smaller than the numberof parameters in a BNN.Existing work on this topic falls into two categories. In the first, the prior is selected inorder to minimise a variational objective between the BNN and the target GP [11, 15, 45].Although some of the more recent approaches have demonstrated promising empirical results,all lack theoretical guarantees. In addition, these approaches often constrain the user to usea particular algorithm for posterior approximation (such as variational inference), or requirethe need to see some of the training data in order to construct the prior model. The secondapproach consists of carefully adapting the architecture of the BNN to ensure convergence tothe GP via a central limit theorem argument [36]. This approach is particularly efficient, butrequires deriving a new BNN architecture for every GP covariance function and in this sensemay be considered inpractical.In this paper we propose the ridgelet prior , a novel method to construct interpretableprior distributions for BNNs. It follows the previous two-stage approach, but remedies severalissues with existing approaches. First, the ridgelet prior can be used to approximate any covariance function of interest (provided generic regularity conditions are satisfied) withoutthe need to modify the architecture of the network. Second, we provide approximation errorbounds which are valid for a finite-dimensional parametrisation of the network, as opposedto relying on asymptotic results such as a central limit theorem. Third, the prior can be usedwith any algorithm for posterior approximation such as variational inference or Markov chain3onte Carlo. Finally, the ridgelet prior does not require access to any part of the dataset inits construction and is straight-forward to implement (e.g. it does not require any numericaloptimisation routine).To construct the ridgelet prior, we build on existing analysis of the ridgelet transform [5, 30, 42], which was used to study the consistency of (non-Bayesian) neural networks. Inparticular, we derive a novel result for numerical approximation using a finite-bandwidthversion of the ridgelet transform, presented in Theorem 2, that may be of independentinterest. The ridgelet prior is defined for neural networks with L ě L “ To begin we briefly introduce notation for GPs and BNNs, discussing the issue of priorspecification for these models.
This paper focusses on the problem of approximating a deterministic function f : R d Ñ R using a BNN. This problem is fundamental and underpins algorithms for regression andclassification. The Bayesian approach is to model f as a stochastic process (also called“random function”) f : R d ˆ Θ Ñ R , where Θ is a measurable parameter space on whicha prior probability distribution is elicited, denoted P . The set Θ may be either finite orinfinite-dimensional. In either case, θ ÞÑ f p¨ , θ q is a random variable taking values in thevector space of real-valued functions on R d . The combination of a dataset of size n and Bayes’rule are used to constrain, in a statistical sense, this distribution on Θ, to produce a posterior P n that is absolutely continuous with respect to P . If the model is well-specified, then thereexists an element θ : P Θ such that f p¨ , θ : q “ f p¨q and, if the Bayesian procedure is consistent, P n will converge (in an appropriate sense) to a point mass on θ : in the n Ñ 8 limit.In practice, Bayesian inference requires that a suitable prior distribution P is elicited.Stochastic processes are intuitively described by their moments, and these can be used by adomain expert to elicit P . The first two moments are given by the mean function m : R d Ñ R and the covariance function k : R d ˆ R d Ñ R , given pointwise by m p x q : “ ż Θ f p x , θ q d P p θ q , k p x , x q : “ ż Θ p f p x , θ q ´ m p x qqp f p x , θ q ´ m p x qq d P p θ q . for all x , x P R d . GPs and BNNs are examples of stochastic processes that can be used.In the case of a GP, the first two moments completely characterise P . Indeed, under the4onditions of Mercer’s theorem [see e.g. Section 4.5 of 44], f p x , θ q “ m p x q ` dim p Θ q ÿ i “ θ i ϕ i p x q , θ i i.i.d. „ N p , q where the ϕ i : R d Ñ R are obtained from the Mercer decomposition of the covariance function k p x , x q “ ř i ϕ i p x q ϕ i p x q and dim p Θ q denotes the dimension of Θ. The shorthand notation GP p m, k q is often used to denote this GP. The kernel trick enables explicit computation withthe θ i and ϕ i to be avoided, so that the user can specify the mean and covariance functionsand, in doing so, P is implicitly defined. There is a well-established literature on covariancemodelling [10, 39, 43] for GPs. For BNNs, however, there is no analogue of the kernel trickand it is unclear how to construct a prior P for the parameters θ of the BNN that is inagreement with moments that have been expert-elicited.Fix a function φ : R Ñ R , which we will call activation function . In this paper a BNNwith L ě hidden layers is understood to be a stochastic process with functional form f p x , θ q “ N L ÿ j “ w L ,j φ p z Lj p x qq , z li p x q : “ b l ´ i ` N l ´ ÿ j “ w l ´ i,j φ p z l ´ j p x qq , l “ , . . . , L (1)where N l : “ dim p z l p x qq is the number of nodes in the l th layer and the edge case is the inputlayer z i p x q : “ b i ` d ÿ j “ w i,j x j , The parameters θ of the BNN consists of the weights w li,j P R and the biases b li P R ofeach layer l “ , . . . , L . Common examples of activation functions include the rectifiedlinear unit (ReLU) φ p x q “ max p , x q , logistic φ p x q “ {p ` exp p´ x qq , hyperbolic tangent φ p x q “ tanh p x q and the Gaussian φ p x q “ exp p´ x q . In all cases the complexity of themapping θ ÞÑ f p¨ , θ q in (1) makes prior specification challenging, since it is difficult to ensurethat a distribution on the parameters θ will be meaningful when lifted to the output space ofthe neural network. Here we discuss existing choices for the prior P on θ in a BNN, which are motivated by thecovariance structure that they induce on the output space of the neural network. This isa rapidly evolving field and a full review requires a paper in itself; we provide a succinctsummary and refer the reader to the survey in [32].Several deep connections between BNNs and GPs have been exposed [39, 43, 21, 19, 1].The first detailed connection between BNNs and GPs was made by Neal [33]. Let r w li s j : “ w li,j .In the case of a shallow BNN with L “
1, assume that each of the weights w , w i andbiases b i are a priori independent, each with mean 0 and with finite second moments σ w , σ w , σ b , respectively, where σ w “ σ { N for some fixed σ ą
0. To improve presentation,5igure 1:
Lifting the parameter prior distribution of a Bayesian neural network (BNN) tothe output space of the network. Left:
Realisations from a BNN, with the ReLU activationfunction and independent standard Gaussian distributions on the weights and bias parameters.
Middle:
The covariance function of a BNN, as the activation function is varied over ReLU,linear, sigmoid, hyperbolic tangent and Gaussian.
Right:
Realisations from a BNN endowedwith a ridgelet prior, which is constructed to approximate a GP with covariance k p x , y q “ σ exp p´ l } x ´ y } q with σ “ . l “ .
75. [In all cases one hidden layer was used.]let f p x q : “ f p x , θ q , so that θ is implicit, and let E denote expectation with respect to θ „ P .A well-known result from Neal [33] is that, according to the central limit theorem, the BNNconverges asymptotically (as N Ñ 8 ) to a zero-mean GP with covariance function k p x , x q : “ E r f p x q f p x qs “ σ E “ φ ` w ¨ x ` b ˘ φ ` w ¨ x ` b ˘‰ ` σ b , (2)Analytical forms of the GP covariance were obtained for several activation functions φ , suchas the ReLU and Gaussian error functions in, [23, 54, 55]. Furthermore, similar results wereobtained more recently for neural networks with multiple hidden layers in [23, 28, 34, 12].Placing independent priors with the same second moments σ { N l ´ and σ b on the weightsand biases of the l th layer, and taking N Ñ 8 , N Ñ 8 , . . . in succession, it can be shownthat the l th layer of this BNN convergences to a zero mean GP with covariance: k l p x , x q “ σ E z l ´ i „ GP p ,k l ´ q r φ p z l ´ i p x qq φ p z l ´ i p x qqs ` σ b . (3)Of course, the discussion of this section is informal only and we refer the reader to the originalreferences for full and precise detail.The identification of limiting forms of covariance function allows us to investigate whethersuch priors are suitable for performing uncertainty quantification in real-world tasks. Unfor-tunately, the answer is often “no”. One reason, which has been demonstrated empiricallyby multiple authors [15, 56, 57], is that BNNs can have poor out-of-sample performance.Typically one would want a covariance function to have a notion of locality, so that k p x , x q decays sufficiently rapidly as x and x become distant from each other. This ensures thatwhen predictions are made for a location x that is far from the training dataset, the predictivevariance is appropriately increased. However, as exemplified in Figure 1, the covariancestructure of a BNN need not be local. Even the use of a light-tailed Gaussian activationfunction still has a possibility of leading to a covariance model that is non-local. Theseexisting studies [15, 56, 57] illustrate the difficulties of inducing a meaningful prior on the6utput space of the neural network when operating at the level of the parameters θ of thenetwork. In this section the ridgelet approach to prior specification is presented. The approach relieson the classical ridgelet transform, which is briefly introduced in Section 3.1. Then, inSection 3.2, we describe how a ridgelet prior is constructed.
The ridgelet transform [5, 30, 42] was developed in the context of harmonic analysis in the1990s [3, 17, 24, 30, 22] and has received recent interest as a tool for the theoretical analysisof neural networks [42, 2, 41, 35]. In this section we provide a brief and informal descriptionof the ridgelet transform, deferring all mathematical details until Section 4. To this end, let p f denote the Fourier transform of a function f , and let ¯ z denote the complex conjugate of z P C . Given an activation function φ : R Ñ R , suppose we have a corresponding function ψ : R Ñ R such that the relationship p π q d ż R | ξ | ´ d p ψ p ξ q p φ p ξ q d ξ “ ψ is available in closed form for many of the activation functions φ that are commonly used in neural networks; examples can be found in Table 1 in Section 4.2and Table 2 in Appendix A.3. Then, under regularity conditions detailed in Section 4, theridgelet transform of a function f : R d Ñ R is defined as R r f sp w , b q : “ ż R d ψ p w ¨ x ` b q f p x q d x (4)for w P R d , and b P R , and the dual ridgelet transform of a function τ : R d ˆ R Ñ R isdefined as R ˚ r τ sp x q : “ ż R d ` φ p w ¨ x ` b q τ p w , b q d w d b (5)for x P R d . There are two main properties of the ridgelet transform that we exploit in thiswork. First, a discretisation of (5) using a cubature method gives rise to an expressionclosely related to one layer of a neural network; c.f. (1). Second, under regularity conditions,the dual ridgelet transform works as the pseudo-inverse of the ridgelet transform, meaningthat p R ˚ R qr f s “ f whenever the left hand side is defined. Next we explain how these twoproperties of the ridgelet transform will be used.7 .2 The Ridgelet Prior In this section our proposed ridgelet prior is presented. Our starting point is a Gaussianstochastic process GP p m, k q and we aim to construct a probability distribution for theparameters θ of a neural network f p x , θ q in (1) such that the stochastic process θ ÞÑ f p¨ , θ q closely approximates the GP, in a sense yet to be defined.The construction proceeds in three elementary steps. The first step makes use of theproperty that discretisation of R ˚ in (5) using a cubature method (i.e. a linear combinationof function values) gives rise to a neural network. To see this, let us abstractly denote by ˜ R and ˜ R ˚ approximations of R and R ˚ obtained using a cubature method with D nodes: R r f sp w , b q « ˜ R r f sp w , b q : “ D ÿ j “ u j ψ p w ¨ x j ` b q f p x j q (6) R ˚ r τ sp x q « ˜ R ˚ r τ sp x q : “ N ÿ i “ v i φ p w i ¨ x ` b i q τ p w i , b i q (7)where p x j , u j q Dj “ Ă R d ˆ R and pp w i , b i q , v i q N i “ Ă R d ` ˆ R are the cubature nodes andweights employed respectively in (6) and (7). The specification of suitable cubature nodesand weights will be addressed in Section 4, but for now we assume that they have beenspecified. It is clear that (7) closely resembles one layer of a neural network; c.f. (1).The second step makes use of the fact that p R ˚ R qr f s “ f , which suggests that we mayapproximate a function f using the discretised ridgelet transform and its dual p ˜ R ˚ ˜ R qr f sp x q “ N ÿ i “ v i φ p w i ¨ x ` b i q « D ÿ j “ u j ψ p w i ¨ x j ` b i q f p x j q ff “ N ÿ i “ D ÿ j “ “ v i u j ψ p w i ¨ x j ` b i q f p x j q ‰loooooooooooooooooomoooooooooooooooooon “ : w ,i φ p w i ¨ x ` b i q (8)where the coefficients w “ p w , , . . . , w ,N q J depend explicitly on the function f beingapproximated. Thus ˜ R ˚ ˜ R is a linear operator that returns a neural network approximationto each function f provided as input.The third and final step is to compute the pushforward of GP p m, k q through the linearoperator ˜ R ˚ ˜ R , in order to obtain a probability distribution over the coefficients w of the neuralnetwork in (8). Let m i : “ m p x i q and r K s i,j : “ k p x i , x j q and let r Ψ s i,j : “ v i u j ψ p w i ¨ x j ` b i q so that m P R D , K P R D ˆ D and Ψ P R N ˆ D . If f „ GP p m, k q then it follows immediatelythat w is a Gaussian random vector with E r w s “ Ψ m E rp w ´ E r w sqp w ´ E r w sq J s “ Ψ K p Ψ q J . Example of the covariance matrix Ψ l ´ K p Ψ l ´ q J of the ridgelet prior for l “ and N “ . Left: The covariance matrix of independent standard Gaussian distributionprior on R N for comparison. Right:
The covariance matrix Ψ l ´ K p Ψ l ´ q J computed from 3independent realisations of t w i , b i u N l ´ i “ and from K of a Gaussian covariance function.To arrive at a prior for a general neural network of the form (1) we apply this constructionrecursively, starting from the input layer and working up towards the output layer. Thedimension of the cubature rule pp w li , b li q , v li q N l i “ used at level l is required to equal N l so thatour discretised ridgelet transform inherits the same network architecture as in (1). Ournotation is generalised to r Ψ l s i,j : “ v i u j ψ p w li ¨ x j ` b li q in order to indicate that this cubaturerule with N l elements was used to construct the matrix Ψ l . Our ridgelet prior can now beformally defined: Definition 1 (Ridgelet Prior) . Consider the neural network in (1) . Given a mean function m : R d Ñ R and a covariance function k : R d ˆ R d Ñ R , a prior distribution is called a ridgelet prior if the weights w li at level l depend on the weights and biases at level l ´ according to w li |tp w l ´ r , b l ´ r q : r “ , . . . , N l ´ u i.i.d. „ N p Ψ l ´ m , Ψ l ´ K p Ψ l ´ q J q where i “ , . . . , N l and l “ , . . . , L . To complete the prior specification, the bias parmeters b li at the all layers and the weights w i at the input layer are required to be independent andidentically distributed, denoted b li i.i.d. „ P b , w i i.i.d. „ P w , where the distributions P b and P w are respectively supported on R and R d . Several initial remarks are in order:
Remark 1.
The dependence of the distribution for the weights w li on the previous layer’sweights w l ´ r and the biases b l ´ r , r “ , . . . , N l ´ , is an important feature of our constructionand seems essential if we are aiming to approximate the covariance function k that wasspecified. This dependence is illustrated in Figure 2. Remark 2.
Here we discuss how to sample the weight parameters w li of the l th hidden layer,conditional on the values of the parameters in lower layers being fixed. Recall that sampling rom a multivariate Gaussian N p , Σ q requires computing a square root Σ { of the covariancematrix Σ “ p Σ { qp Σ { q J ; a sample w „ N p , Σ q is then generated through the change ofvariables w “ Σ { r w where r w „ N p , I q . The most efficient way to achieve this will dependon whether N l ´ is larger than D , or not: • For N l ´ ą D the matrix Σ : “ Ψ l ´ K p Ψ l ´ q J will have less than full rank. In thiscase we can compute a singular value decomposition of the D -dimensional square matrix K “ AV A J , where V is diagonal, and take Σ { : “ Ψ l ´ AV { , where V { is thediagonal matrix with p V { q i,i “ a V i,i . • For N l ´ ď D we can simply compute the singular value decomposition of the N l ´ -dimensional square matrix Σ as BW B J , where W is diagonal, and take Σ { : “ BW { , where W { is the diagonal matrix with p W { q i,i “ a W i,i .Thus the computational complexity of sampling the weight parameters w li of the l th hidden layer,conditional on the values of the parameters in lower layers being fixed, is O p min t D, N l ´ u q . Remark 3.
Let r φ p x qs i : “ φ p w i ¨ x ` b i q . In the comparison with (2) , the covariance of aBNN with one hidden layer and endowed with our ridgelet prior takes the form E rp f p x q ´ m p x qqp f p x q ´ m p x qqs “ E “ φ p x q J Ψ K p Ψ q J φ p x q ‰ (9) where the expectation on the right hand side is with respect to the random weights w i andbiases b i . In Section 4 it is shown that, in an appropriate limit, the expression in (9) convergesto k p x , x q , the covariance model that we aimed to approximate at the outset. This completes our definition of the ridgelet prior. An illustration is provided in Figure 3,the full details of which are reserved for Section 5.1. It can be seen that as N , the number ofhidden units in the BNN, is increased the samples from the BNN begin to resemble, in astatistical sense, samples from the target GP. In the case of multiple hidden layers, a largernumber N of hidden units appear to be required to achieve a similar degree of approximationto the GP. Next we present our theoretical analysis, which considers only the case of onehidden layer, and is the principal contribution of the paper. This section presents our theoretical analysis of the ridgelet prior in the setting of a singlehidden layer ( L “ a) BNN ( L “ N “ ) (b) BNN ( L “ N “ , ) (c) BNN ( L “ N “ , )(d) BNN ( L “ N “ , ) (e) BNN ( L “ N “ , ) (f) BNN ( L “ N “ , ) (g) GP Figure 3: Sample paths from a Bayesian neural network (BNN) equipped with a ridgeletprior. Here we examine the effect of increasing the number N of hidden units in a BNN with L “ L “ Notation:
The following notation will be used. For a topological space X and a Borelmeasure µ on X , let L p p X , µ q denote the set of functions f : X Ñ R such that } f } L p p X ,µ q ă 8 where } f } L p p X ,µ q : “ " `ş X | f p x q| p d µ p x q ˘ { p ď p ă 8 ess sup x P X | f p x q| p “ 8 . If µ is the Lebesgue measure on X Ă R d , we use the shorthand L p p X q for L p p X , µ q andfurthermore we let L p loc p X q denote the set of function f : X Ñ R such that } f } L p p K q exists andis finite on all compact K Ă X . Given α , β P N d , the multi-index notation x α : “ x α . . . x α d d , B α h p x q : “ B α x . . . B α d x d h p x q and B α , β h p x , y q : “ B α x . . . B α d x d B β y . . . B βdy d h p x , y q will be used. For X Ă R d , let C p X q denote the set of all continuous functions f : X Ñ R . Similarly, for r P N Y t8u denote by C r p X q the set of all functions f : X Ñ R for which the derivatives B α f exist and are continuous on X for all α P N d with | α | ď r . Denote by C r ˆ r p X ˆ X q theset of all functions k : X ˆ X Ñ R for which the derivatives B α , β k exist and are continuouson X ˆ X for all α , β P N d with | α | , | β | ď r . A bivariate function k : X ˆ X Ñ R iscalled positive definite if ř i,j a i a j k p x i , x j q ą ‰ p a , . . . , a n q P R n and all distinct t x i u ni “ Ă X . 11 .1 Regularity of the Activation Function In this section we outline our regularity assumptions on the activation function φ . To dothis we first recall the classical Fourier transform and its generalisations, all for real-valuedfunctions on R d . Fourier transform on L p R d q and L p R d q : The Fourier transform of f P L p R d q isdefined by p f p ξ q : “ p π q ´ d ş R d f p x q exp p´ i ξ ¨ x q d x for each ξ P R d . The Fourier transform of f P L p R d q is formally defined as a limit of a sequence p ˆ f n q n P N where the f n P L p R d q X L p R d q and f n Ñ f in L p R d q [see e.g. 14, p.113-114].The image of the Fourier transform on L p R d q is not contained in L p R d q . However, thereexists a subset of L p R d q on which the Fourier transform defines an automorphism. It isconvenient to work on this subset, which consists of so-called Schwartz functions , definednext.
Schwartz functions:
A function f P C p R d q is called a Schwartz function if for any pair ofmulti-indices α , β P N d we have that C α , β : “ sup x P R d ˇˇ x α B β f p x q ˇˇ ă 8 . The set of Schwartzfunctions is denoted by S p R d q [14, p.105]. Note that the Fourier transform on S p R d q iswell-defined as S p R d q Ă L p R d q . Moreover, the Fourier transform is a homeomorphism from S p R d q onto itself [14, p.113].The most commonly encountered activation functions φ are not elements of L p R q andwe therefore also require the notion of a generalised Fourier transform: Generalised Fourier transform:
Let S m p R d q be the vector space of functions f P S p R d q that satisfy f p x q “ O p} x } m q for x Ñ
0. For k P N , let t : R d Ñ R be any continuousfunction of at most polynomial growth of order k , meaning that sup x P R d | t p x q|{p ` } x } k q ă 8 .A measurable function p t P L p R d zt uq is called a generalised Fourier transform of t if thereexists an integer m P N such that ş R d p t p w q f p w q d w “ ş R d t p x q p f p x q d x for all f P S m p R d q [52, p.103]. The set of all continuous functions of at most polynomial growth of order k thatadmit a generalized Fourier transform will be denoted C ˚ k p R d q .The generalized Fourier transform can be computed for activation functions φ that aretypically used in a neural network, for which classical Fourier transforms are not well-defined: Example 1 (Generalised Fourier transform of the ReLU function) . Let φ p x q : “ max p , x q ,then φ P C ˚ p R q . Although φ is not an element of L p R d q , it has polynomial growth and admitsa generalised Fourier transform p φ p w q “ ´p? πw q ´ . Example 2 (Generalised Fourier transform of the tanh function) . Let φ p x q : “ tanh p x q , then φ P C ˚ p R q . Likewise, φ admits a generalised Fourier transform p φ p w q “ ´ i a π csch ` π w ˘ where csch is the hyperbolic cosecant. In order to present our theoretical results, we will make the following assumptions on theactivation function φ that defines the ridgelet transform: Assumption 1 (Activation function) . The activation function φ : R Ñ R satisfies unction Interpretation Example φ p z q activation function tanh p z q ψ p z q defines the ridgelet transform exp p π q ` π ˘ r p π q ´ d ` d d ` ` r d z d ` ` r ” exp ´ ´ z ¯ sin ` πz ˘ı r : “ ´ d mod 2 Table 1: An example of functions φ , ψ that satisfy the regularity assumptions used in thiswork. φ P C ˚ p R q , and there exists a function ψ : R Ñ R such that2. ψ P S p R q , p π q d ż R | ξ | ´ d p ψ p ξ q p φ p ξ q d ξ “ , ż R | ξ | ´ d ´ ˇˇ p ψ p ξ q p φ p ξ q ˇˇ d ξ ă 8 . The boundedness assumption φ P C ˚ p R q rules out some commonly used activationfunctions, such as ReLU, but enables stronger convergence results to be obtained. However,our analysis is also able to handle φ P C ˚ p R q . For presentational purposes we present resultsunder the assumption φ P C ˚ p R q in the main text and, in Appendix A.3 we present theoreticalresults for the unbounded setting φ P C ˚ p R q . Parts 2 and 3 of Assumption 1 are the standardassumptions for the ridgelet transform [5, 30]; examples can be found in Table 1 in Section 4.2,Table 2 in Appendix A.3 and Table 4 in Appendix A.4.2.Next we turn our attention to proving novel and general results about the ridgelettransform, under Assumption 1. This section introduces a finite-bandwidth approximation to the ridgelet transform thatunderpins our analysis. The results of this section focus solely on the ridgelet transform andmay therefore be of more general interest.To proceed, we aim to approximate the dual ridgelet transform in (5) with a finite-bandwidth transform, meaning that the Lebesgue reference measure in (5) is replaced by afinite measure, corresponding to P w and P b in Definition 1. This will in turn enable cubaturemethods to be applied to discretise (5). It will be convenient to introduce probabilitydensity functions p w and p b , respectively, for the distributions P w and P b , and then tointroduce a scaling parameter to control convergence toward the improper uniform limit. For0 ă σ w ă 8 , ă σ b ă 8 , define scaled densities of p w and p b by p w ,σ p w q : “ σ ´ d w p w p σ ´ w w q and p b,σ p b q : “ σ ´ b p b p σ ´ b b q . The parameters σ w and σ b will be called bandwidths . For example,if p w and p b are standard Gaussians then p w ,σ p w q and p b,σ p b q are also Gaussians with variances σ w I and σ b , i.e. p w ,σ p w q “ p πσ w q ´ d exp ˆ ´ } w } σ w ˙ , p b,σ p b q “ p πσ b q ´ exp ˆ ´ b σ b ˙ . p w and p b . Let p be any probabilitydensity function defined on R m for some m P N and consider the following properties that p may satisfy:(i) (Symmetry at the origin) p p x q “ p p´ x q , @ x P R m ,(ii) (Boundedness) } p } L p R m q ă 8 ,(iii) (Finite second moment) ş R m } x } p p x q d x ă 8 ,(iv) (Positivity of the Fourier transform) p p p x q ą @ x P R m ,(v) (Integrable Fourier transform) } p p } L p R m q ă 8 .Of these properties, only (iv) and (v) are not straight forward. If ˆ p can be computed thenconditions (iv) and (v) can be directly verified. Otherwise, sufficient conditions on p for (iv)can be found in [49] and sufficient conditions on p for (v) can be found in [25]. Specifically,the following regularity assumptions on p w and p b are required in this work: Assumption 2 (Finite bandwidth) . The probability density functions p w : R d Ñ r , and p b : R Ñ r , satisfy properties (i) – (v) above and, in addition,1. ş R d } x } x p w p x q d x ă 8 ,2. }B p b } L p R q ă 8 ,where we recall that B p b denotes the first derivative of p b . As an explicit example, standard Gaussian densities for p w and p b satisfy Assumption 2 with } x p w } L p R d q “ } p p b } L p R q “ λ σ p w , b q “ Zp w ,σ p w q p b,σ p b q , where for the analysis thatfollows it will be convenient to set Z : “ p π q σ d w σ b } x p w } L p R d q } p p b } L p R q , indicating the total measure assigned by λ σ to R d ` . Definition 2 (Finite-bandwidth ridgelet transform) . In the setting of Assumption 1 andAssumption 2, the ridgelet transform R and the finite-bandwidth approximation R ˚ σ of itsdual R ˚ are defined pointwise as R r f sp w , b q : “ ż R d f p x q ψ p w ¨ x ` b q d x , (10) R ˚ σ r τ sp x q : “ ż R d ` τ p w , b q φ p w ¨ x ` b q d λ σ p w , b q (11) for all f P L p R d q , τ P L p R d ` , λ σ q , x P R d , w P R d , and b P R . f P L p R d q , theboundedness of ψ P S p R q guarantees R r f s P L p R d ` q . Similarly for any τ P L p R d ` , λ σ q ,the boundedness of φ P C ˚ p R q guarantees R ˚ σ r τ s P L p R d q . Recall that a discussion onrelaxation of the assumption φ P C ˚ p R q is reserved for Appendix A.3.The classical ridgelet transform in the sense of [30] replaces λ σ with the Lebesgue measureon R d ` , which can be considered as the limit of λ σ as σ w Ñ 8 and σ b Ñ 8 . In this limitthe operator R ˚ R is an identity operator on L p R d q [30]. An original contribution of ourwork, which may be of more general interest, is to study approximation properties of theoperator R ˚ σ R when a finite measure λ σ is used. It turns out that one can rigorously establishconvergence of p R ˚ σ R qr f s to f in an appropriate limit of σ w Ñ 8 and σ b Ñ 8 and we providean explicit, non-asymptotic error bound in Theorem 1.Let M r p f q : “ max | α |ď r sup x P R d |B α f p x q| and B r p f q : “ ş R d | f p x q|p ` } x } q r d x , in eachcase for r P N . Theorem 1.
Let Assumption 1 and Assumption 2 hold, and let f P C p R d q satisfy M p f q ă 8 and B p f q ă 8 . Then sup x P R d | f p x q ´ p R ˚ σ R qr f sp x q| ď C max p M p f q , B p f qq " σ w ` σ d w p σ w ` q σ b * for some constant C that is independent of σ w , σ b and f , but may depend on φ , ψ , p w and p b . The proof is provided in Appendix A.1. Our priority here was to provide a simple upperbound on the reconstruction error, that separates the influence of f from the influenceof the bandwidths σ w and σ b ; the bound is not claimed to be tight. The formulation ofTheorem 1 enables us to conveniently conclude that taking σ w and σ b to infinity in a mannersuch that σ d ` w σ ´ b Ñ f , for any function f for which M p f q , B p f q ă 8 is satisfied. The term M p f q reflects the general difficulty of approximating f using a finite-bandwidth ridgelet transform, while the term B p f q serves to deal with thetails of f and the fact that a supremum is taken over the unbounded domain R d .Theorem 5 in Appendix A.3 presents a variation on Theorem 1 that relaxes the boundednessassumption on the activation function from φ P C ˚ p R q to φ P C ˚ p R q at the expense of changingthe tail condition from B p f q ă 8 to B p f q ă 8 and restricting the supremum to a compactsubset of R d . In this section, we build on Theorem 1 to obtain theoretical guarantees for our BNN prior.We present Theorem 2, which establishes an explicit error bound for approximation of a GPprior using the ridgelet prior in a BNN.Recall from Section 3 that the ridgelet prior was derived from discretisation of the classicalridgelet transform in (4) and (5). The starting point of our analysis is to consider how thefinite-bandwidth ridgelet transform in (10) and (11) can be discretised. To this end, we view15he discretisation of R ˚ σ R as an operator I σ,D,N : C p R d q Ñ C p R d q by I σ,D,N r f sp x q : “ N ÿ i “ v i ˜ D ÿ j “ u j f p x j q ψ p w i ¨ x j ` b i q ¸ φ p w i ¨ x ` b i q (12)where t x j , u j u Dj “ and tp w i , b i q , v i u Ni “ are the cubature nodes and weights used, respectively,to discretise the integrals (10) and (11). Informally, I σ,D,N ought to converge in some sense tothe identity operator in an appropriate limit involving σ, D, N Ñ 8 ; this will be made precisein the sequel. The weights w i and biases b i can be identified with w i and b i in the ridgeletprior, as can N be identified with N . Indeed, to make explicit the connection between theoperator I σ,D,N and the ridgelet prior of Definition 1, note that f p x , θ q|t w i , b i u Ni “ „ p I σ,D,N q GP p m, k q|t w i , b i u Ni “ where f p x , θ q “ N ÿ i “ w ,i φ p w i ¨ x ` b i q is the BNN defined in Section 2.1 with L “ N “ N , w i “ w i , b i “ b i , and F µ denotes the push-forward of a probability distribution µ through the function F [38, p.1,17]. In other words, for fixed t w i , b i u Ni “ , the distributionof the random function x ÞÑ f p x , θ q defined by drawing θ from the ridgelet prior has thesame distribution as the random function obtained by application of the operator I σ,D,N to afunction drawn from GP p m, k q .Our analysis exploits properties of the cubature methods defined in Assumption 3 andAssumption 4 below. To rigorously introduce these cubature methods, let X be a compactsubset of R d ; without loss of generality we assume X is contained in the interior of thehyper-cube r´ S, S s d for some constant S ą
0. It follows that there exists a function (a“mollifier”) P C p R d q with the properties that p x q P r , s , p x q “ x P X and p x q “ x R r´
S, S s d . Indeed, letting a : “ ´ inf t} x ´ y } : x P X , y R r´
S, S s d u and g p t q : “ e ´ t for t ą
00 for t ď h p t q : “ g p t q g p t q ` g p ´ t q then the function p x q : “ d ź i “ „ ´ h ˆ x i ´ a ´ a ˙ (13)satisfies the desired properties with P C p R d q [48, p.141-143]. The function is used inour theoretical analysis to restrict, via multiplication, the support of function f from R d to r´ S, S s d without changing f on X and without global smoothness on R d being lost; seeFigure 4. 16 a) function before multiplying (b) the mollifier (c) function after multiplying Figure 4: Illustrating the role of a mollifier: (a) A function f p¨q defined on R , to be mollified.(b) Here is mollifier that is equal to 1 on X “ r´ , s and equal to 0 outside r´ S, S s with S “ .
5. (c) The product f ¨ is equal to f on X and has a compact support on r´ S, S s . Assumption 3 (Discretisation of R ) . The cubature nodes t x j u Dj “ are a regular grid on r´ S, S s d , corresponding to a Cartesian product of left endpoint rules [(2.1.2) of 8], and thecubature weights are u j : “ p S q d D ´ p x j q for all j “ , ..., D . The use of a Cartesian grid cubature method is not crucial to our analysis and other methodscould be considered. However, we found that the relatively weak assumptions required toensure convergence of a cubature method based on a Cartesian grid were convenient to check.One could consider adapting the cubature method to exploit additional smoothness that maybe present in the integrand, but we did not pursue that in this work.
Assumption 4 (Discretisation of R ˚ σ ) . The cubature nodes t w i , b i u Ni “ are independentlysampled from p w ,σ ˆ p b,σ and v i : “ ZN ´ for all i “ , . . . , N . The use of a Monte Carlo cubature scheme in Assumption 4 ensures an exact identificationbetween the ridgelet prior, where the p w i , b i q are a priori independent, and a cubaturemethod. From the approximation theoretic viewpoint, I σ,D,N is now a random operator andwe must account for this randomness in our analysis; high probability bounds are used tothis effect in Theorem 2.Now we present out main result, which is a finite-sample-size error bound for the ridgeletprior as an approximation of a GP. In what follows we introduce a random variable f torepresent the GP, and we assume that f is independent of the random variables t w i , b i u Ni “ ;all random varibles are understood to be defined on a common underlying probabilityspace whose expectation operator is denoted E . The approximation error that we study isdefined over X Ă r´
S, S s d and to this end we introduce the modified notation M ˚ r p m q : “ max | α |ď r sup x Pr´
S,S s d |B α m p x q| and M ˚ r p k q : “ max | α | , | β |ď r sup x , y Pr´
S,S s d |B αx B βy k p x , y q| . Theorem 2.
Consider a stochastic process f „ GP p m, k q with mean function m P C p R d q and symmetric positive definite covariance function k P C ˆ p R d ˆ R d q . Let Assumptions 1,2, 3 and 4 hold. Further, assume φ is L φ -Lipschitz continuous. Then, with probability at least ´ δ , sup x P X b E “ p f p x q ´ I σ,D,N f p x qq |t w i , b i u Ni “ ‰ ď C ´ M ˚ p m q ` a M ˚ p k q ¯ σ w ` σ d w p σ w ` q σ b ` σ b σ d w p σ w ` q D d ` σ b σ d w p σ w ` a log δ ´ q? N + where C is a constant independent of m, k, σ w , σ b , D, N and δ .Proof. The proof is established in Appendix A.2.This result establishes that the root-mean-square error between the GP and its approx-imating BNN will vanish uniformly over the compact set X as the bandwidth parameters σ w , σ b and the sizes N , D of the cubature rules are increased at appropriate relative rates,which can be read off from the bound in Theorem 2. In particular, the bandwidths σ w and σ b should be increased in such a manner that σ d ` w { σ b Ñ D and N ofcubature nodes should be increased in such a manner that the final two terms in the boundof Theorem 2 asymptotically vanish. In such a limit, the ridgelet prior provides a consistentapproximation of the GP prior over the compact set X . The result holds with high probability(1 ´ δ ) with respect to randomness in the sampling of the weights and biases t w i , b i u Ni “ . Fromthis concentration inequality, and in an appropriate limit of σ w , σ b , N and D , the almostsure convergence of the root-mean-square error also follows (from Borel-Cantelli).This finite-sample-size error bound can be contrasted with earlier work that providedonly asymptotic convergence results [33, 23, 28]. Remark 4.
The rates of convergence serve as a proof-of-concept and are the usual ratesthat one would expect in the absence of anisotropy assumptions, with the curse of dimensionappearing in the term D ´ { d . In other words, we are gated by the rate at which we fillthe volume r´ S, S s d with cubature nodes t x j u Dj “ . However, it may be the case that realapplications posess additional regularity, such as a low “effective dimension”, that was notpart of the assumptions used to obtain our bound. We therefore defer further discussion ofthe approximation error to the empirical assessment in Section 5. Remark 5.
The weak assumptions made on the mean and covariance functions of the GP(i.e. that they are continuously differentiable) demonstrates that the ridgelet prior is capableof approximating essentially any GP that might be encountered in applications. One candraw a loose analogy between the universal approximation theorems for (non-Bayesian) neuralnetworks [6, 24] and Theorem 2 for BNNs.
Remark 6.
Related result can also be obtained when the boundedness assumption on theactivation function is relaxed from φ P C ˚ p R q to φ P C ˚ p R q . Details are reserved forAppendix A.3. This completes our analysis of the ridgelet prior for BNNs and our attention turns, in thenext section, to the empirical performance of the ridgelet prior.18
Empirical Assessment
In this section we briefly report empirical results that are intended as a proof-of-concept.Our aims in this section are twofold: First, in Section 5.1 we seek to illustrate the theoreticalanalysis of Section 4 and to explore how well a BNN with a ridgelet prior approximates itsintended GP target. Second, in Section 5.2 we aim to establish whether use of the ridgeletprior for a BNN in a Bayesian inferential context confers some of the same characteristics aswhen the target GP prior is used, in terms of the inferences that are obtained.Throughout this section we focus on BNNs with a single hidden layer ( L “ p w and p b standard Gaussians; sensitivity to these choices examined inAppendix A.4.3. Our principal interest is in how many hidden units, N , are required in orderfor the ridgelet prior to provide an adequate approximation of the associated GP, since thesize of the network determines the computational complexity of performing inference withthe ridgelet prior; see Remark 2. To this end, we took D , σ w and σ b to be sufficiently largethat their effect was negligible compared to the effect of finite N . For completeness, Table 3in Appendix A.4.2 reports the values of D , σ w and σ b that were used for each experiment,with sensitivity to these choices examined in Appendix A.4.3. The aim of this section is to explore how well a BNN with a ridgelet prior approximatesits intended GP target. For this experiment we consider dimension d “ squared exponential covariance function k p x, x q : “ l exp p´ s | x ´ x | q , where l “ . s “ . X “ r´ , s .Figure 3, introduced earlier in the paper, presents sample paths from a BNN equippedwith the ridgelet prior as the number N of hidden units is varied from 100 to 3000. In thecase N “ N “ maximum root-mean-square error (MRMSE)sup x P X b E rp f p x q ´ I σ,D,N f p x qq | t w i , b i u Ni “ s . It was observed that the approximation error between the BNN and the GP decays at a slowrate, consistent with our theoretical error bound O p N ´ q . To explore how well the secondmoments of the GP are being approximated, in Figure 5b we compared the BNN covariancefunction E r f p x q f p qs as in (9) against the covariance function k p x, q of the target GP. As N is varied we observe convergence of the BNN covariance function to the GP covariancefunction. For accurate approximation, a large number of hidden units appears to be required.Additional results, detailing the effect of varying D , σ w , σ b , activation function φ , and GPcovariance k are provided in Appendix A.4.319 a) approximation error (b) BNN covariance Figure 5: Approximation quality of the ridgelet prior: (a) MRMSE for the BNN approximationof the GP was estimated as the number N of hidden units was varied with standard errors of100 independent experiments displayed. (b) The covariance associated to the BNN, as N isvaried, together with the covariance of the GP being approximated. In this section we compare the performance of the ridgelet prior to that of its target GP inan inferential context. To this end, we identified tasks where non-trivial prior informationis available and can be enconded into a covariance model; the ridgelet prior is then used toapproximate this covariance model using a BNN. Three tasks were considered: (i) predictionof atmospheric CO concentration using the well-known Mauna Loa dataset; (ii) predictionof airline passenger numbers using a historical dataset; (iii) a simple in-painting task that iscloser in spirit to applications where BNNs may be used. These tasks are toy in their natureand we do not attempt an empirical investigation into the practical value of the ridgelet prior;this will be reserved for a sequel. Prediction of Atmospheric CO : In this first task we extracted 43 consecutive monthsof atmospheric CO concentration data recorded at the Mauna Loa observatory in Hawaii [18]and considered the task of predicting CO concentration up to 23 months ahead. The timeperiod under discussion was re-scaled onto interval X “ r´ , s and the response variable,representing CO concentration, was standardised. This task is well-suited to regressionmodels, such as GPs, that are able to account for non-trivial prior information. In particular,one may seek to encode (1) a linear trend toward increasing concentration of atmosphericCO and (2) a periodic seasonal trend. In the GP framework, this can be achieved using amean function m and covariance function k of the form m p x q : “ ax, k p x, x q : “ l exp ˜ ´ s sin ˆ πp | x ´ x | ˙ ¸ (14)where a “ . l “ . s “ . p “ .
8; see [39, Section 5.4.3]. The aim of this experimentis to explore whether the posterior predictive distribution obtained using the ridgelet priorfor a BNN is similar to that which would be obtained using this GP prior. To this end, we20 a) i.i.d. prior (b) the ridgelet prior (c) GP
Figure 6: Prediction of Atmospheric CO : Posterior predictive distributions were obtainedusing (a) a Bayesian neural network (BNN) with an i.i.d. prior, (b) a BNN with the ridgeletprior, and (c) a Gaussian process (GP) prior. The ridgelet prior in (b) was designed toapproximate the GP prior used in (c). The solid blue curve is the posterior mean and theshaded regions represent the pointwise 95% credible interval.employed a simple Gaussian likelihood y i “ f p x i , θ q ` (cid:15) i , (cid:15) i i.i.d. „ N p , σ (cid:15) q where y i is the standardised CO concentration in month i , x i is the standardised timecorresponding to month i , x ÞÑ f p x, θ q is the regression model and σ (cid:15) “ .
085 was fixed. Theposterior GP is available in closed form, while we used Markov chain Monte Carlo to samplefrom the posterior distribution over the parameters θ when a neural network is used. Thedetail of this sampling scheme are reserved for Appendix A.4.1.Here we present results for BNN with N “ i.i.d. prior that takes all parameters to be a priori independent with w i,j „ N p , σ w q , b i „ N p , σ b q , and w ,i „ N p , σ w q . Here σ w “ σ b “
36 and σ w “ . {? N . This is tobe contrasted with the ridgelet prior, with values of σ w “ σ b “
36 chosen to ensure afair comparison with the i.i.d. prior. Figure 6 displays the posterior predictive distributionsobtained using (a) the i.i.d. prior, (b) the ridgelet prior, and (c) the original GP. It can beseen that (b) and (c) are more similar than (a) and (c). These results suggest that the ridgeletprior is able to encode non-trivial information on the seasonality of CO concentration intothe prior distribution for the parameters of the BNN. Prediction of Airline Passenger Numbers:
Next we considered a slightly more chal-lenging example that involves a more intricate periodic trend. The dataset here is a subset ofthe airline passenger dataset studied in [36]. The response y i represents the (standardised)monthly total number of international airline passengers in the United States and the input x i represents the (standardised) time corresponding to month i . The experimental set-upwas identical to the CO experiment, but the prior in (14) was modified to reflect the moreintricate periodic trend by taking a “ . l “ . s “ . p “ .
75 and the measurementnoise was fixed to σ (cid:15) “ . experiment, we implemented the i.i.d. prior and theanalogous ridgelet prior, in the latter case based on D “
200 quadrature points. Results21 a) i.i.d. prior (b) the ridgelet prior (c) GP
Figure 7: Prediction of Airline Passenger Numbers: Posterior predictive distributions wereobtained using (a) a Bayesian neural network (BNN) with an i.i.d. prior, (b) a BNN with theridgelet prior, and (c) a Gaussian process (GP) prior. The ridgelet prior in (b) was designedto approximate the GP prior used in (c). The solid blue curve is the posterior mean and theshaded regions represent the pointwise 95% credible interval.in Figure 7 support the conclusion that the ridgelet prior is able to capture this morecomplicated seasonal trend. In comparison with [36] we required more hidden units toperform this regression task. However, we used a standard activation function, where themethod of [36] would need to develop a new activation function for each covariance model ofinterest.
In-Painting Task:
Our final example is a so-called in-painting task, where we are requiredto infer a missing part of an image from the remaining part. Our aim is not to assessthe suitability of the ridgelet prior for such tasks – to do so would require extensive andchallenging empirical investigation, beyond the scope of the present paper – but rather tovalidate the ridgelet prior as a proof-of-concept.The image that we consider is shown in Figure 8a and we censor the central part, describedby the red square in Figure 8b. Each pixel corresponds to a real value y i and the locationof the pixel is denoted x i P r´ , s . To the remaining part we add a small amount of i.i.d.noise, (cid:15) i „ N p , σ (cid:15) q to each pixel i , in Figure 8b (the addition of noise here ensures conditionsfor ergodicity of MCMC are satisfied; otherwise the posterior is supported on a submanifoldof the parameter space and more sophisticated sampling procedures would be required). Thetask is then to infer this missing central region using the remaining part of the image as atraining dataset.For the statistical regression model we considered a GP whose mean function m is zeroand covariance function k is k p x , x q : “ l exp ´ ´ } x ´ x } s ¯ ` (15) ´ cos ´ πx ¯ ` ¯´ cos ´ πx ¯ ` ¯´ cos ´ πx ¯ ` ¯´ cos ´ πx ¯ ` ¯ , where l “ . s “ ,
1. The periodic structure induced by cosine functions is deliberatelychosen to be commensurate with the separation between modes in the original dataset,22 a) original image (b) training dataset (c) i.i.d. prior (d) the ridgelet prior
Figure 8: In-Painting Task: The central part of the original image in (a) was censored toproduce (b) and the in-painting task is to infer the missing part of the image using theremaining part as a training dataset. Posterior predictive distributions were obtained using(c) a Bayesian neural network (BNN) with an i.i.d. prior, and (d) a BNN with the ridgeletprior.meaning that considerable prior information is provided in the GP covariance model. Ourbenchmark here was a BNN equipped with with N “ σ w “ σ b “
18 and σ w “ . {? N . The ridgelet prior was based on a BNN of the same size, with σ w “ σ b “
18 to ensure a fair comparison with the i.i.d. prior.Figure 8c and Figure 8d show posterior means estimates for the missing region in thein-painting task, respectively for the i.i.d. prior and the ridgelet prior. It is interesting toobserve that the i.i.d. prior gives rise to a posterior mean that is approximately constant,whereas the ridgelet prior gives rise to a posterior mean that somewhat resembles the originalimage. This suggests that the specific structure of the covariance model in (15) has beenfaithfully encoded into the ridgelet prior, leading to superior performance on this stylisedin-painting task.
One of the main barriers to the wide-spread adoption of BNN is the identification of priordistributions that are meaningful when lifted to the output space of the network. In thispaper it was shown that the ridgelet transform facilitates the consistent approximation ofa GP using a BNN. This has the potential to bring the powerful framework of covariancemodelling for GPs to bear on the task of prior specification for BNN. In contrast to earlierwork in this direction [11, 15, 36, 45], our construction is accompanied by theoretical analysisthat establishes the approximation is consistent. Moreover, we are able to provide a finite-sample-size error bound that requires only weak assumptions on the GP covariance model(i.e. that the mean and covariance function is continuous and differentiable).This role of this paper was to establish the ridgelet prior as a theoretical proof-of-conceptonly and there remain several open questions to be addressed: • In real applications, is it necessary to have an accurate approximation of the intended23ovariance model, in order to deliver improved performance of the BNN, or is a crudeapproximation sufficient? • What cubature rules t x i , u i u Di “ are most effective in applications - is a regular gridneeded, or do real-world applications exhibit an effective low dimension so that e.g. asparse grid could be used? • In the case of multiple hidden layers, can the convergence of the ridgelet prior to a deepGP [7, 9] be established?These questions will likely require a substantial amount of work to address in full, but wehope to pursue them a sequel.
Acknowledgements
The authors were supported by the Lloyd’s Register FoundationProgramme on Data-Centric Engineering and the Alan Turing Institute under the EPSRCgrant [EP/N510129/1].
References [1] Ben Adlam, Jaehoon Lee, Lechao Xiao, Jeffrey Pennington, and Jasper Snoek. Explor-ing the uncertainty properties of neural networks’ implicit priors in the infinite-width limit. arXiv:2010.07355 , 2020.[2] Francis Bach. Breaking the curse of dimensionality with convex neural networks.
Journal ofMachine Learning Research , 18:1–53, 2018.[3] Andrew R. Barron. Universal approximation bounds for superpositions of a sigmoidal function.
IEEE Transactions on Information Theory , 39(3):930–945, 1993.[4] Wray L. Buntine and Andreas S. Weigend. Bayesian back-propagation.
Complex Systems ,5(6):603–643, 1991.[5] Emmanuel Cand`es.
Ridgelets: Theory and applications . PhD thesis, Stanford University, 1998.[6] George Cybenko. Approximation by superpositions of a sigmoidal function.
Mathematics ofControl, Signals and Systems , 2(4):303–314, 1989.[7] Andreas C. Damianou and Neil D. Lawrence. Deep Gaussian processes.
Proceedings of the 16thInternational Conference on Artificial Intelligence and Statistics , PMLR 31:207–215, 2013.[8] Philip J. Davis and Philip Rabinowitz.
Methods of Numerical Integration . Academic Press,1984.[9] Matthew M. Dunlop, Mark A. Girolami, Andrew M. Stuart, and Aretha L. Teckentrup. Howdeep are deep Gaussian processes?
Journal of Machine Learning Research , 19(54):1–46, 2018.[10] David Duvenaud.
Automatic model construction with Gaussian processes . PhD thesis, Universityof Cambridge, 2014.[11] Daniel Flam-Shepherd, James Requeima, and David Duvenaud. Mapping Gaussian processpriors to Bayesian neural networks.
Proceedings of the 31st Conference on Neural InformationProcessing Systems Bayesian Deep Learning Workshop , 2017.[12] Adri`a Garriga-Alonso, Carl Edward Rasmussen, and Laurence Aitchison. Deep convolutionalnetworks as shallow Gaussian processes.
Proceedings of the 7th International Conference onLearning Representations , 2019.
13] Evarist Gine and Richard Nickl.
Mathematical Foundations of Infinite-Dimensional StatisticalModels . Cambridge University Press, 2015.[14] Loukas Grafakos.
Classical Fourier Analysis . Springer, 2000.[15] Danijar Hafner, Dustin Tran, Timothy Lillicrap, Alex Irpan, and James Davidson. Noisecontrastive priors for functional uncertainty. arXiv:1807.09289 , 2019.[16] Tailen Hsing and Randall Eubank.
Theoretical Foundations of Functional Data Analysis, withan Introduction to Linear Operators . Wiley, 2015.[17] Lee K. Jones. A simple lemma on greedy approximation in Hilbert space and convergencerates for projection pursuit regression and neural network training.
The Annals of Statistics ,20(1):608–613, 1992.[18] Charles D. Keeling and Timothy P. Whorf. Monthly atmospheric CO records from sites inthe sio air sampling network. In Trends: A Compendium of Data on Global Change . CarbonDioxide Information Analysis Center, Oak Ridge National Laboratory, 2004.[19] Mohammad Emtiyaz E. Khan, Alexander Immer, Ehsan Abedi, and Maciej Korzepa. Approxi-mate inference turns deep networks into Gaussian processes.
Proceedings of the 33rd Conferenceon Neural Information Processing Systems , 2019.[20] Michael R. Kosorok.
Introduction to Empirical Processes and Semiparametric Inference .Springer, 2008.[21] Agustinus Kristiadi, Matthias Hein, and Philipp Hennig. Fixing asymptotic uncertainty ofBayesian neural networks with infinite ReLU features. arXiv:2010.02709 , 2020.[22] Vera Kurkov´a and Marcello Sanguineti. Bounds on rates of variable-basis and neural-networkapproximation.
IEEE Transactions on Information Theory , 47(6):2659–2665, 2001.[23] Jaehoon Lee, Yasaman Bahri, Roman Novak, Samuel S Schoenholz, Jeffrey Pennington, andJascha Sohl-Dickstein. Deep neural networks as Gaussian processes.
Proceedings of the 6thInternational Conference on Learning Representations , 2018.[24] Moshe Leshno, Vladimir Ya Lin, Allan Pinkus, and Shimon Schocken. Multilayer feedforwardnetworks with a nonpolynomial activation function can approximate any function.
NeuralNetworks , 6(6):861–867, 1993.[25] Elijah Liflyand. Integrability spaces for the Fourier transform of a function of bounded variation.
Journal of Mathematical Analysis and Applications , 436(2):1082–1101, 2016.[26] Milan N. Luki´c and Jay H. Beder. Stochastic processes with sample paths in reproducing kernelHilbert spaces.
Transactions of the American Mathematical Society , 353(10):3945–3969, 2001.[27] David J. C. Mackay. A practical Bayesian framework for backpropagation networks.
NeuralComputation , 4(3):448–472, 1992.[28] Alexander G. de G. Matthews, Mark Rowland, Jiri Hron, Richard E. Turner, and ZoubinGhahramani. Gaussian process behaviour in wide deep neural networks.
Proceedings of the 6thInternational Conference on Learning Representations , 2018.[29] John Mitros and Brian Mac Namee. On the validity of Bayesian neural networks for uncertaintyestimation.
Proceedings of the 27th AIAI Irish Conference on Artificial Intelligence andCognitive Science , 2019.[30] Noboru Murata. An integral representation of functions using three-layered networks and theirapproximation bounds.
Neural Networks , 9(6):947–956, 1996.[31] Iain Murray, Ryan P. Adams, and David J. C. Mackay. Elliptical slice sampling.
Proceedingsof the 13th International Conference on Artificial Intelligence and Statistics, JMLR Workshopand Conference Proceedings , 9:541–548, 2010.
32] Eric Nalisnick.
On Priors for Bayesian Neural Networks . PhD thesis, University of California,Irvine, 2018.[33] Radford M Neal.
Bayesian Learning for Neural Networks . Springer, 1995.[34] Roman Novak, Lechao Xiao, Jaehoon Lee, Yasaman Bahri, Greg Yang, Jiri Hron, Daniel AAbolafia, Jeffrey Pennington, and Jascha Sohl-Dickstein. Bayesian deep convolutional networkswith many channels are Gaussian processes.
Proceedings of the 7th International Conferenceon Learning Representations , 2019.[35] Greg Ongie, Rebecca Willett, Daniel Soudry, and Nathan Srebro. A function space view ofbounded norm infinite width ReLU nets: The multivariate case. arXiv:1910.01635 , 2019.[36] Tim Pearce, Russell Tsuchida, Mohamed Zaki, Alexandra Brintrup, and Andy Neely. Expressivepriors in Bayesian neural networks: Kernel combinations and periodic functions.
Proceedingsof the 35th Conference on Uncertainty in Artificial Intelligence , PMLR 115:134–144, 2020.[37] Arya A. Pourzanjani, Richard M. Jiang, and Linda R. Petzold. Improving the identifiabilityof neural networks for Bayesian inference.
Proceedings of the 31st Conference on NeuralInformation Processing Systems Workshop on Bayesian Deep Learning , 2017.[38] Giuseppe Da Prato.
An Introduction to Infinite-Dimensional Analysis . Springer, 2006.[39] Carl Edward Rasmussen and Christopher K. I. Williams.
Gaussian Processes for MachineLearning . MIT Press, 2006.[40] Emilio Serrano and Javier Bajo. Deep neural network architectures for social services diagnosisin smart cities.
Future Generation Computer Systems , 100:122–131, 2019.[41] Sho Sonoda, Isao Ishikawa, Masahiro Ikeda, Kei Hagihara, Yoshihiro Sawano, Takuo Matsubara,and Noboru Murata. The global optimum of shallow neural network is attained by ridgelettransform.
Proceedings of the 35th International Conference on Machine Learning Workshopon Theory of Deep Learning , 2018.[42] Sho Sonoda and Noboru Murata. Neural network with unbounded activation functions isuniversal approximator.
Applied and Computational Harmonic Analysis , 43(2):233–268, 2017.[43] Michael L. Stein.
Interpolation of Spatial Data Some Theory for Kriging . Springer, 1999.[44] Ingo Steinwart and Andreas Christmann.
Support Vector Machines . Springer, 2008.[45] Shengyang Sun, Guodong Zhang, Jiaxin Shi, and Roger Grosse. Functional variational Bayesianneural networks. arXiv:1903.05779 , 2019.[46] Eric J. Topol. High-performance medicine: the convergence of human and artificial intelligence.
Nature Medicine , 25(1):44–56, 2019.[47] Leda Tortora, Gerben Meynen, Johannes Bijlsma, Enrico Tronci, and Stefano Ferracuti.Neuroprediction and ai in forensic psychiatry and criminal justice: A neurolaw perspective.
Frontiers in Psychology , 11, 2020.[48] Loring W. Tu.
An Introduction to Manifolds . Springer, 2010.[49] Ernest O Tuck. On positivity of Fourier transforms.
Bulletin of the Australian MathematicalSociety , 74(1):133–138, 2006.[50] Aad van der Vaart.
Asymptotic Statistics . Cambridge University Press, 1998.[51] Martin J. Wainwright.
High-Dimensional Statistics: A Non-Asymptotic Viewpoint . CambridgeUniversity Press, 2019.[52] Holger Wendland.
Scattered Data Approximation . Cambridge University Press, 2005.[53] Florian Wenzel, Kevin Roth, Bastiaan S. Veeling, Jakub ´Swiatkowski, Linh Tran, StephanMandt, Jasper Snoek, Tim Salimans, Rodolphe Jenatton, and Sebastian Nowozin. How goodis the Bayes posterior in deep neural networks really? arXiv:2002.02405 , 2020.
54] Christopher K. I. Williams. Computing with infinite networks.
Neural Computation , 10(1):1203–1216, 1998.[55] Greg Yang and Hadi Salman. A fine-grained spectral perspective on neural networks.
Proceedingsof the 8th International Conference on Learning Representations , 2019.[56] Wanqian Yang, Lars Lorch, Moritz A Graule, Srivatsan Srinivasan, Anirudh Suresh, JiayuYao, Melanie F Pradier, and Finale Doshi-Velez. Output-constrained Bayesian neural networks.
Proceedings of the 36th International Conference on Machine Learning Workshop on Uncertaintyand Robustness in Deep Learning and Workshop on Understanding and Improving Generalizationin Deep Learning , 2019.[57] Jiayu Yao, Weiwei Pan, Soumya Ghosh, and Finale Doshi-Velez. Quality of uncertaintyquantification for Bayesian neural network inference.
Proceedings of the 36th InternationalConference on Machine Learning Workshop on Uncertainty and Robustness in Deep Learning ,2019. Appendix
This appendix is structured as follows:A.1 Proof of Theorem 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28A.1.1 Approximate Identity Operators . . . . . . . . . . . . . . . . . . . . . 28A.1.2 Proof of Theorem 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29A.1.3 Auxiliary Lemma 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35A.2 Proof of Theorem 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35A.2.1 Technical Lemmas . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36A.2.2 An Intermediate Ridgelet Reconstruction Result . . . . . . . . . . . . 38A.2.3 Proof of Theorem 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43A.3 An Analogous Result to Theorem 2 for Unbounded φ . . . . . . . . . . . . . 45A.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50A.4.1 Approximating the Posterior with Markov Chain Monte Carlo . . . . 50A.4.2 Experimental Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . 51A.4.3 Comparison of Prior Predictive with Different Settings . . . . . . . . 51 Notation:
Throughout this appendix we adopt identical notation to the main text, but forbrevity we denote } ¨ } L p X q as } ¨ } L and } ¨ } L p X q as } ¨ } L whenever X is the Euclideanspace R d of any dimension d P N . A.1 Proof of Theorem 1
In this section we prove Theorem 1. The proof makes use of techniques from the theoryof approximate identity integral operators, which we recall first in Appendix A.1.1. Theproof of Theorem 1 is then presented in Appendix A.1.2. The proof relies on an auxiliarytechnical lemma regarding the Fourier transform, whose statement and proof we defer toAppendix A.1.3.
A.1.1 Approximate Identity Operators An approximate identity operator is an integral operator which converges to an identityoperator in an appropriate limit. The discussion in this section follows Gin`e and Nickl [13,ch. 4.1.3, 4.3.6]. For h ą
0, define an operator K h by K h r f sp x q : “ ż R d f p x q h d K ˆ x h , x h ˙ d x (16)where K : R d ˆ R d Ñ R is a measurable function and f : R d Ñ R is suitably regular forthe integral to exist and be well-defined. Proposition 1 provides sufficient conditions for theapproximate identity K h r f s to converge to an identity operator when h Ñ Proposition 1. [13, Propositions 4.3.31 and 4.3.33, p.368] Let K h r f s be defined as in (16) with K a measurable function satisfying, for some N P N , . ż R d sup v P R d | K p v , v ´ u q| } u } N d u ă 8 ,2. for all v P R d and all multi-indices α s.t. | α | P t , . . . , N ´ u ,(a) ż R d K p v , v ´ u q d u “ ,(b) ż R d K p v , v ´ u q u α d u “ .Then for each m ď N there exists a constant C , depending only on m and K , such that f P C m p R d q ùñ sup x P R d | K h r f sp x q ´ f p x q| ď Ch m max | α |“ m sup x P R d |B α f p x q| . The sense in which Proposition 1 will be used in the proof of Theorem 1 is captured bythe following example:
Example 3.
Consider a translation invariant kernel of the form K p x , x q “ ϕ p x ´ x q forsome ϕ : R d Ñ p , . Further assume ş R d ϕ p u q d u “ and ş R d ϕ p u q} u } d u ă 8 . Then K satisfies the preconditions of Proposition 1 for N “ and hence K h r f s is an approximateidentity operator. A.1.2 Proof of Theorem 1
For this section it is convenient to introduce a notational shorthand. Let (cid:15) w : “ σ ´ w and (cid:15) b : “ σ ´ b , so that p w p σ ´ w w q “ p w p (cid:15) w w q and p b p σ ´ b b q “ p b p (cid:15) b b q . Further introduce theshorthand p (cid:15) w p¨q : “ p w p (cid:15) w ¨q and p (cid:15)b p¨q : “ p b p (cid:15) b ¨q .Now we turn to the proof of Theorem 1. Denote the topological dual space of S p R d q by S p R d q . The elements of S p R d q are called tempered distributions and the generalized Fouriertransform can be considered as the Fourier transform on tempered distributions S p R d q [14,p.123-131]. Throughout the proof we exchange the order of integrals; we do so only when theabsolute value of the integrand is itself an integrable function, so that from Fubini’s theoremthe interchange can be justified. Proof of Theorem 1.
The goal is to bound reconstruction error when the Lebesgue measurein the classical ridgelet transform is replaced by the finite measured λ σ p w , b q “ p π q } x p w } ´ L } p p b } ´ L p w p σ ´ w w q p b p σ ´ b b q d w d b. The reconstruction p R ˚ σ R qr f s of a function f on R d is defined by the following integral: p R ˚ σ R qr f sp x q“ ż R d ` ż R d f p x q ψ p w ¨ x ` b q d x φ p w ¨ x ` b q d λ σ p w , b q“ ż R d f p x q " p π q } x p w } ´ L } p p b } ´ L ż R ż R d ψ p w ¨ x ` b q φ p w ¨ x ` b q p (cid:15) w p w q p (cid:15)b p b q d w d b * d x .
29t is thus clear that p R ˚ σ R qr f s is in some sense a smoothed version of f , and in what followswe will make use of the theory of approximate identity operators discussed in Appendix A.1.1.We aim to deal separately with the reconstruction error due to the use of a finite measure on w and the reconstruction error due to the use of a finite measure on b . To this end, it will beconvenient to (formally) define a new operator R ˚ σ R p w q by ` R ˚ σ R p w q ˘ r f sp x q : “ ż R d f p x q ˆ } x p w } ´ L ż R ż R d ψ p w ¨ x ` b q φ p w ¨ x ` b q p (cid:15) w p w q d w d b ˙ d x which replaces p (cid:15)b by the Lebesgue measure on R . That is, p R ˚ σ R p w q qr f s can be intuitivelyconsidered as an idealised version of p R ˚ σ R qr f s where the reconstruction error due to the useof a finite measure on b is removed. Our analysis will then proceed based on the followingtriangle inequality:sup x P R d | f p x q ´ p R ˚ σ R qr f sp x q|ď sup x P R d | f p x q ´ p R ˚ σ R p w q qr f sp x q| looooooooooooooooomooooooooooooooooon p˚q ` sup x P R d |p R ˚ σ R p w q qr f sp x q ´ p R ˚ σ R qr f sp x q| looooooooooooooooooooooomooooooooooooooooooooooon p˚˚q . (17)Different strategies are required to bound p˚q and p˚˚q and we address them separately next. Bounding p˚q : A bound on p˚q uses techniques from approximate identity operators describedin Appendix A.1.1. To this end, we show that R ˚ σ R satisfies the preconditions of Proposition 1to obtain a bound.Define r k : R d ˆ R d Ñ R by r k p x , x q : “ } x p w } ´ L ż R ż R d φ p w ¨ x ` b q ψ p w ¨ x ` b q p w p w q d w d b. (18)From a change of variable b “ w ¨ x ` b , r k p x , x q “ } x p w } ´ L ż R d ż R φ p b q ψ p w ¨ p x ´ x q ` b q d b p w p w q d w . (19)Next we perform some Fourier analysis similar to that in [42, Appendix C]. From thediscussion on generalised Fourier transform in Section 4.1, recall that p t P L p R d zt uq isa generalised Fourier transform of function t if there exists an integer m P N such that ş R d p t p w q h p w q d w “ ş R d t p x q p h p x q d x for all h P S m p R d q . We set t “ φ and p h “ ψ for ouranalysis. Part (3) of Assumption 1 implies that ψ P S m p R d q for some m depending on thegeneralised Fourier transform p φ . Let ψ ´ p b q : “ ψ p´ b q . From the definition of the generalisedFourier transform, ż R φ p b q ψ p w ¨ p x ´ x q ` b q d b “ ż R φ p b q ψ ´ p´ b ´ w ¨ p x ´ x qq d b “ ż R p φ p ξ q p ψ p ξ q e iξ w ¨p x ´ x q d ξ , (20)30here we used the fact that x ψ ´ “ p ψ for the real function ψ ´ p b q for the last equality [14,p.109,113].Note that ξ ÞÑ p φ p ξ q p ψ p ξ q belongs to L p R q from part (3) of Assumption 1 and therefore(20) exists. Also we have that x p w p ξ x ´ ξ x q “ p π q ´ d ż R d e iξ w ¨p x ´ x q p w p w q d w . (21)Substituting (20) and (21) into (19) gives that r k p x , x q “ } x p w } ´ L ż R d ż R φ p b q ψ p w ¨ p x ´ x q ` b q d b p w p w q d w “ } x p w } ´ L ż R p φ p ξ q p ψ p ξ q ż R d e iξ w ¨p x ´ x q p w p w q d w d ξ “ p π q d } x p w } ´ L ż R p φ p ξ q p ψ p ξ q x p w p ξ x ´ ξ x q d ξ . Let k w p v q : “ } x p w } ´ L x p w p v q , so that r k p x , x q “ p π q d ż R p φ p ξ q p ψ p ξ q k w p ξ x ´ ξ x q d ξ. (22)Recall that p (cid:15) w p w q “ p w p (cid:15) w q and let k w (cid:15) p v q : “ } x p w } ´ L x p (cid:15) w p v q . Since the Fourier transform of w ÞÑ p w p (cid:15) w w q is to be v ÞÑ (cid:15) d w x p w ´ v (cid:15) w ¯ by standard properties of the Fourier transform [14,p.109,113], k w (cid:15) p v q “ (cid:15) d w k w ´ v (cid:15) w ¯ . Define r k (cid:15) p x , x q : “ } x p w } ´ L ż R ż R d φ p w ¨ x ` b q ψ p w ¨ x ` b q p (cid:15) w p w q d w d b. (23)Noting the similarity between (23) and (18), an analogous argument to that just presentedshows that we may re-express (23) as r k (cid:15) p x , x q “ p π q d ż R p φ p ξ q p ψ p ξ q k w (cid:15) p ξ x ´ ξ x q d ξ “ (cid:15) d w r k ˆ x (cid:15) w , x (cid:15) w ˙ . Then R ˚ σ R p w q f p x q “ ż R d f p x q r k (cid:15) p x , x q d x “ ż R d f p x q (cid:15) d w r k ˆ x (cid:15) w , x (cid:15) w ˙ d x . Now we will show that r k satisfies the pre-conditions of Proposition 1. That is, setting N “ v P R d and all multi-index α P N d s.t. | α | “ ż R d sup v P R d ˇˇˇr k p v , v ´ u q ˇˇˇ } u } d u ă 8 , (24) ż R d r k p v , v ´ u q d u “ , (25) ż R d r k p v , v ´ u q u α d u “ . (26)31irst we verify (24). From (22) we have that ż R sup v P R d ˇˇˇr k p v , v ´ u q ˇˇˇ } u } d u “ p π q d ż R d ˇˇˇˇż R p ψ p ξ q p φ p ξ q k w p ξ u q d ξ ˇˇˇˇ } u } d u . ď p π q d ż R ż R d ˇˇˇ p ψ p ξ q p φ p ξ q ˇˇˇ | k w p ξ u q| } u } d u d ξ. By the change of variables u “ ξ u , ż R sup v P R d ˇˇˇr k p v , v ´ u q ˇˇˇ } u } d u ď p π q d ż R d ż R | p ψ p ξ q p φ p ξ q|| k w p u q| ›››› u ξ ›››› | ξ | d d u d ξ ď p π q d ż R ż R d | p ψ p ξ q p φ p ξ q|| ξ | d ` | k w p u q| } u } d u d ξ ď p π q d ż R ˇˇˇ p ψ p ξ q p φ p ξ q ˇˇˇ | ξ | d ` d ξ ż R d | k w p u q| } u } d u That this final bound is finite follows from the requirement that ş R p| p ψ p ξ q p φ p ξ q|{| ξ | d ` q d ξ ă 8 in Definition 2, together with the assumption that x p w has finite second moment, whichensures the finiteness of ş R d | k w p u q| } u } d u . Next we verify (25). From (22) and the changeof variables u “ ξ u , ż R d r k p v , v ´ u q d u “ p π q d ż R d ż R p ψ p ξ q p φ p ξ q k w p ξ u q d ξ d u “ p π q d ż R p ψ p ξ q p φ p ξ q| ξ | d d ξ ż R d k w p u q d u “ , where the final inequality used the facts that ş R d k w p u q d u “ ş R d | k w p u q| d u “ p π q d ş R p ψ p ξ q p φ p ξ q| ξ | d d ξ “ u “ ξ u , ż R d r k p v , v ´ u q u α d u “ p π q d ż R d ż R p ψ p ξ q p φ p ξ q k w p ξ u q d ξ u α d u “ p π q d ż R p ψ p ξ q p φ p ξ q| ξ | d ξ | α | d ξ loooooooooooooomoooooooooooooon p q ż R d k w p u q u α d u looooooooomooooooooon p q . The term p q is finite as a consequence of the assumption ş R ˇˇˇ p ψ p ξ q p φ p ξ q ˇˇˇ | ξ | d ` d ξ ă 8 in Definition2. For the term p q , note that u ÞÑ u α is odd function whenever | α | “
1. On the otherhand, k w is even function since k w is the Fourier transform of an even probability density p w . Thus the function u ÞÑ k w p u q u α , which is given by the product of even and odd32unctions, is odd. For any integrable odd function h : R d Ñ R , ş R d h p u q d u “ ş R d ` h p u q d u “ ´ ş R d ´ h p u q d u where R d ` and R d ´ are positive and negative half Euclidean space.The function u ÞÑ k w p u q u α is integrable since k w has the finite second moment and | α | “ ş R d r k p v , v ´ u q u α d u “ .Thus r k satisfies the condition in Proposition 1 and, with K h r f s “ R ˚ σ R p w q f , we obtain forsome C ą k but not f ,sup x P R d | f p x q ´ R ˚ σ R p w q f p x q| ď C M p f q (cid:15) w . (27) Bounding p˚˚q : A bound on p˚˚q makes use of Auxiliary Lemma 1. Define k (cid:15) : R d ˆ R d Ñ R by k (cid:15) p x , x q : “ p π q } x p w } ´ L } p p b } ´ L ż R ż R d ψ p w ¨ x ` b q φ p w ¨ x ` b q p (cid:15) w p w q p (cid:15)b p b q d w d b. (28)so that R ˚ σ Rf p x q “ ş R d f p x q k (cid:15) p x , x q d x and recall that R ˚ σ R p w q f p x q “ ş R d f p x q r k (cid:15) p x , x q d x where r k (cid:15) is defined by (23). The second error term issup x P R d | R ˚ σ R p w q f p x q ´ R ˚ σ Rf p x q| “ sup x P R d ˇˇˇˇż R d f p x q r k (cid:15) p x , x q d x ´ ż R d f p x q k (cid:15) p x , x q d x ˇˇˇˇ “ sup x P R d ˇˇˇˇż R d f p x q ´r k (cid:15) p x , x q ´ k (cid:15) p x , x q ¯ d x ˇˇˇˇ . Let ∆ k p x , x q : “ r k (cid:15) p x , x q ´ k (cid:15) p x , x q for short-hand. By the definition of r k (cid:15) and k (cid:15) ,∆ k p x , x q “ } x p w } ´ L ż R ż R d ψ p w ¨ x ` b q φ p w ¨ x ` b q p (cid:15) w p w q ! ´ p π q } p p b } ´ L p (cid:15)b p b q ) d w d b. Next we upper bound ∆ k p x , x q . Substituting the identity established in Lemma 1 yields∆ k p x , x q “ (cid:15) b p π q } x p w } ´ L } p p b } ´ L ż R ż R d ψ p w ¨ x ` b q φ p w ¨ x ` b q p (cid:15) w p w q ż b B p b p t(cid:15) b b q d t d w d b “ (cid:15) b p π q } x p w } ´ L } p p b } ´ L ż ż R d ż R bφ p w ¨ x ` b q ψ p w ¨ x ` b qB p b p t(cid:15) b b q d b looooooooooooooooooooooooomooooooooooooooooooooooooon p q p (cid:15) w p w q d w d t. By Assumption 2, p b has bounded first derivative. Thus: |p q| ď }B p b } L ż R | bφ p w ¨ x ` b q ψ p w ¨ x ` b q| d b By a change of variable b “ w ¨ x ` b , |p q| ď }B p b } L ż R | b ´ w ¨ x || φ p w ¨ p x ´ x q ` b q ψ p b q| d b ď }B p b } L ż R p| b | ` } w } } x } q| φ p w ¨ p x ´ x q ` b q ψ p b q| d b |p q| ď }B p b } L p ` } x } qp ` } w } q ż R p ` | b |q| φ p w ¨ p x ´ x q ` b q ψ p b q| d b (29)By the assumption φ is bounded, |p q| ď }B p b } L } φ } L p ` } x } qp ` } w } q ż R p ` | b |q| ψ p b q| d b (30)Since ψ is Schwartz function, the integral is finite. Let C ψ : “ ş R p ` | b |q| ψ p b q| d b to see |p q| ď }B p b } L } φ } L C ψ C p w p ` } x } qp ` } w } q . From this upper bound of |p q| we have∆ k p x , x q “ (cid:15) b p π q } x p w } ´ L } p p b } ´ L ż ż R d |p q| p (cid:15) w p w q d w d t ď (cid:15) b p π q } x p w } ´ L } p p b } ´ L }B p b } L } φ } L C ψ C p w p ` } x } q ż R d p ` } w } q p (cid:15) w p w q d w ż d t For the integral ş R d p ` } w } q p (cid:15) w p w q d w , by a change of variable w “ (cid:15) w w , ż R d p ` } w } q p (cid:15) w p w q d w “ ż R d ˆ (cid:15) d w ` } w } (cid:15) d ` w ˙ p w p w q d w . Recall that p w was assumed to have finite second moment. Let C p w : “ max ` , ş R d } w } p w p w q d w ˘ to see ż R d ˆ (cid:15) d w ` } w } (cid:15) d ` w ˙ p w p w q d w ď C p w ˆ (cid:15) d w ` (cid:15) d ` w ˙ “ C p w (cid:15) d w ˆ ` (cid:15) w ˙ . Plugging this upper bound in ∆ k p x , x q ,∆ k p x , x q ď p π q } x p w } ´ L } p p b } ´ L }B p b } L } φ } L C ψ C p w p ` } x } q (cid:15) d w ˆ ` (cid:15) w ˙ (cid:15) b . The original error term is then bounded as p˚˚q “ sup x P R d ˇˇˇˇż R d f p x q ∆ k p x , x q d x ˇˇˇˇ ď p π q } x p w } ´ L } p p b } ´ L }B p b } L } φ } L C ψ C p w ż R d p ` } x } q| f p x q| d x (cid:15) d w ˆ ` (cid:15) w ˙ (cid:15) b . From the assumption of f in Theorem 1, B p f q “ ş R d p ` } x } q| f p x q| d x ă 8 . Setting C : “ p π q } x p w } ´ L } p p b } ´ L }B p b } L } φ } L C ψ C p w gives p˚˚q ď C B p f q (cid:15) d w ˆ ` (cid:15) w ˙ (cid:15) b . (31)34 verall Bound: Substituting the results of (27) and (31) into (17), for some C ą x P R d | f p x q ´ R ˚ σ Rf p x q| ď C max p M p f q , B p f qq " (cid:15) w ` (cid:15) d w ˆ ` (cid:15) w ˙ (cid:15) b * (32)where C only depends on φ , ψ , p w , p b . Setting σ w “ (cid:15) ´ w and σ b “ (cid:15) ´ b , the main convergenceresult is obtained. A.1.3 Auxiliary Lemma 1
The following technical lemma was exploited in the proof of Theorem 1:
Lemma 1.
For p b in the setting of Assumption 2, we have that ´ p π q } p p b } ´ L p (cid:15)b p b q “ (cid:15) b p π q } p p b } ´ L ż b B p b p t(cid:15) b b q d t. Proof.
The result will be established by proving (a) 1 “ p π q } p p b } ´ L p (cid:15)b p q and (b) p b p q ´ p b p (cid:15) b b q “ (cid:15) b ş b B p b p t(cid:15) b b q d t , which are algebraically seen to imply the stated result. Part (a):
Recall that the
Fourier inversion g p x q “ p π q ´ d ş R p g p ξ q e i ξ ¨ x d ξ holds for anyfunction g P L p R d q s.t. p g P L p R d q . We use the fact g p q “ p π q ´ d } p g } L for g P L p R d q s.t. p g P L p R d q and p g is positive, which is obtained by substituting x “ into the Fourierinversion p π q ´ d ş R d p g p w q e i w ¨ d w “ p π q ´ d ş R d p g p w q d w . Recall that p (cid:15)b p b q “ p b p (cid:15) b b q . Fromstandard properties of the Fourier transform [14, p.109,113], the Fourier transform of b ÞÑ p b p (cid:15) b b q is given as ξ ÞÑ (cid:15) b p p b ´ ξ(cid:15) b ¯ and p p b is positive by the assumption. Hence, p (cid:15)b p q “p π q ´ } p p (cid:15)b } L “ p π q ´ ››› (cid:15) b p p b ´ ¨ (cid:15) b ¯››› L . Since the L p R q norm is invariant to the scaling offunction i.e. ››› (cid:15) b p p b ´ ¨ (cid:15) b ¯››› L “ } p p b } L , we obtain p (cid:15)b p q “ p π q ´ } p p b } L as required. Part (b):
We use the fact that the equation g p y q ´ g p x q “ ş p y ´ x qB g p x ´ t p y ´ x qq d t holdsfor g P C p R q [13, p.302,304] to see that p b p q ´ p b p (cid:15) b b q “ ż p´ (cid:15) b b qB p b pp ´ t q (cid:15) b b q d t “ (cid:15) b ż b B p b p t(cid:15) b b q d t where change of variable t “ ´ t is applied. This holds since p b P C p R q . A.2 Proof of Theorem 2
This section is dedicated to the proof of Theorem 2. It is divided into three parts; inAppendix A.2.1 we state technical lemmas that will be useful; in Appendix A.2.2 we stateand prove an intermediate result concerning the discretised ridgelet transform, then inAppendix A.2.3 we present the proof of Theorem 2.35 .2.1 Technical Lemmas
The following technical lemmas will be useful for the proof of Theorem 2.
Lemma 2.
Let X be a compact subset of R d . Let g : X ˆ R p Ñ R be such that g p x , ¨q : R p Ñ R are measurable for all x P X . Let θ , θ , ..., θ n be independent samples from a distribution P on R p . Assume that there exists a measurable function G : R p Ñ R such that E r G p θ q s ă 8 and | g p x , θ q ´ g p x , θ q| ď G p θ q} x ´ x } for all x , x P X . (33) Then E « sup x P X ˇˇˇˇˇ n n ÿ i “ g p x , θ i q ´ E r g p x , θ qs ˇˇˇˇˇff ď C ? n a E r G p θ q s . where C is a constant that depends only on X .Proof. Let g x : “ g p x , ¨q for shorthand and let G : “ t g x ˇˇ x P X u . For any g x , g x P G ,define a (random) pseudo metric ρ n p g x , g x q : “ b n ř ni “ p g x p θ i q ´ g x p θ i qq and the diameter D n : “ sup x , x P X ρ n p g x , g x q . Let N p G , ρ n , (cid:15) q denotes the covering number of the set G by (cid:15) -ball under the metric ρ n [13, p.41]. Then by [13, Theorem 3.5.1, Remark 3.5.2, p.185], E « sup x P X ˇˇˇˇˇ n n ÿ i “ g x p θ i q ´ E r g x p¨qs ˇˇˇˇˇff ď ? ? n E „ ż D n a log 2 N p G , ρ n , (cid:15) q d (cid:15) loooooooooooooomoooooooooooooon p˚q Here we reduced the assumption 0 P G by the discussion in [51, p.135]. Let } G } ρ n “ b n ř ni “ G p θ i q . By [20, Lemma 9.18, p.166] and [50, Example 19.7, p.271], the coveringnumber is bounded as N p G , ρ n , (cid:15) q ď ´ K } G } ρn (cid:15) ¯ d where K is a constant depending only on X .Hence p˚q ď ? d ş D n b log K } G } ρn (cid:15) d (cid:15) . By Cauchy-Schwartz inequality, ? d ż D n c log 2 K } G } ρ n (cid:15) d (cid:15) ď a dD n dż D n log 2 K } G } ρ n (cid:15) d (cid:15). By calculating the integral, p˚q ď a dD n d D n ˆ ` log 2 K } G } ρ n D n ˙ “ ? dD n d ` log 2 K } G } ρ n D n . By the preceding assumption (33), D n “ sup x , x P X d n n ÿ i “ p g x p θ i q ´ g x p θ i qq ď sup x , x P X d n n ÿ i “ G p θ i q } x ´ x } “ R } G } ρ n . R “ sup x , x P X } x ´ x } . We can set K so that R ď K without loss of generality,then we have K } G } ρn D n ě ô log K } G } ρn D n ě ? dD n d ` log 2 K } G } ρ n D n ď ? dD n ˆ ` log 2 K } G } ρ n D n ˙ . By the inequality 1 ` log z ď z for all z ą
0, we have 1 ` log K } G } ρn D n ď K } G } ρn D n and p˚q ď ? dK } G } ρ n . Then by Jensen’ inequality, E rp˚qs “ ? dK E «d n n ÿ i “ G p θ i q ff ď ? dK gffe E « n n ÿ i “ G p θ i q ff “ ? dK a E r G p θ q s . This completes the proof.
Lemma 3.
For any b -uniformly bounded class of function F and any integer n ě , we have,with probability at least ´ δ , sup f P F ˇˇˇˇˇ n n ÿ i “ f p X i q ´ E r f p X qs ˇˇˇˇˇ ď E « sup f P F ˇˇˇˇˇ n n ÿ i “ f p X i q ´ E r f p X qs ˇˇˇˇˇff ` b a δ ´ ? n . Proof.
From equation (4.16) in [51], with probability at least 1 ´ exp ´ ´ nδ b ¯ sup f P F ˇˇˇˇˇ n n ÿ i “ f p X i q ´ E r f p X qs ˇˇˇˇˇ ď E « sup f P F ˇˇˇˇˇ n n ÿ i “ f p X i q ´ E r f p X qs ˇˇˇˇˇff ` δ . Setting δ “ b ? δ ´ ? n yields the result. Lemma 4.
Let S be a positive constant and h P C pr´ S, S s d q . For grid points p x i q Di “ on r´ S, S s d corresponding to a Cartesian product of left endpoint rules, we have that ˇˇˇˇˇż r´ S,S s d h p x q d x ´ p S q d D D ÿ i “ h p x i q ˇˇˇˇˇ ď CD d max | α |“ sup x Pr´
S,S s d |B α h p x q| . (34) where C is a constant independent of h .Proof. Let r : “ S { d ? D . For x i “ p x i, , . . . , x i,d q , let R i : “ r x i, , x i, ` r sˆ¨ ¨ ¨ˆr x i,d , x i,d ` r s for i “ , . . . , D . Since p x i q Di “ are grid points on r´ S, S s d corresponding to a Cartesian product37f left endpoint rules, the domain r´ S, S s d can be decomposed as r´ S, S s d “ R i ‘ ¨ ¨ ¨ ‘ R D ,meaning ż r´ S,S s d h p x q d x “ D ÿ i “ ż R i h p x q d x . Denote the original error in (34) by p˚q . Then noting r d “ ş R i d x , p˚q “ ˇˇˇˇˇ D ÿ i “ ż R i h p x q d x ´ r d D ÿ i “ h p x i q ˇˇˇˇˇ “ ˇˇˇˇˇ D ÿ i “ ż R i p h p x q ´ h p x i qq d x ˇˇˇˇˇ By the mean value theorem, there exists x ˚ i for i “ , . . . , D such that p˚q “ ˇˇˇˇˇ D ÿ i “ ż R i ∇ h p x ˚ i q ¨ p x ´ x i q d x ˇˇˇˇˇ . where ∇ h is the gradient vector of h . Calculating the integral and taking supremum of ∇ h p˚q ď D ÿ i “ d r d ` max | α |“ sup x Pr´
S,S s d |B α h p x q| “ d Dr d ` max | α |“ sup x Pr´
S,S s d |B α h p x q| . Substituting r “ S { d ? D and setting C : “ d p S q d ` { Lemma 5.
For a non-negative random variable X such that E r X s ă 8 , it holds withprobability at least ´ δ that X ď E r X s{ δ .Proof. From the
Markov inequality we have that P r X ě t s ď E r X s t . Taking a complement ofthe probability and setting δ “ E r X s t , we obtain the result. A.2.2 An Intermediate Ridgelet Reconstruction Result
The aim of this section is to establish an analogue of Theorem 1 that holds when the ridgeletoperator R ˚ σ R is discretised using cubature rules, as in (12). The purpose of Theorem 3is to guarantee an accurate reconstruction with high probability. This will be centralto the proof of Theorem 2. Recall M r p f q : “ max | α |ď r sup x P R d |B α f p x q| and M ˚ r p f q : “ max | α |ď r sup x Pr´
S,S s d |B α f p x q| . Theorem 3.
Let I σ,D,N is given by (12) under Assumption 2 and 3. Further, assume φ is L φ -Lipschitz continuous. For any f P C p R d q with M ˚ p f q ă 8 , with probability at least ´ δ , sup x P X | f p x q ´ I σ,D,N f p x q|ď CM ˚ p f q σ w ` σ d w p σ w ` q σ b ` σ b σ d w p σ w ` q D d ` σ b σ d w p σ w ` a log δ ´ q? N + where C is a constant that may depend on X , , φ, ψ, p w , p b but does not depend on f , σ w , σ b , δ . roof. Under Assumption 3, specifically u j “ p S q d D ´ p x j q and v i “ Z { N , we have that I σ,D,N f p x q “ N ÿ i “ ZN ˜ p S q d D D ÿ j “ p x j q f p x j q ψ p w i ¨ x j ` b i q ¸ φ p w i ¨ x ` b i q . and we formally define I σ f p x q : “ ż R ż R d ˆż R d p x q f p x q ψ p w ¨ x ` b q d x ˙ φ p w ¨ x ` b q Zp w ,σ p w q p b,σ p b q d w d b,I σ,D f p x q : “ ż R ż R d ˜ p S q d D D ÿ j “ p x j q f p x j q ψ p w ¨ x j ` b q ¸ φ p w ¨ x ` b q Zp w ,σ p w q p b,σ p b q d w d b. The error will be decomposed by the triangle inequality,sup x P X | f p x q ´ I σ,D,N f p x q|ď sup x P X | f p x q ´ I σ f p x q| looooooooooomooooooooooon p˚q ` sup x P X | I σ f p x q ´ I σ,D f p x q| loooooooooooooomoooooooooooooon p˚˚q ` sup x P X | I σ,D f p x q ´ I σ,D,N f p x q| loooooooooooooooomoooooooooooooooon p˚˚˚q . (35) Bounding p˚q : Let f clip p x q : “ p x q f p x q . Since f clip p x q “ f p x q for all x P X and ourconstruction ensures that I σ f p x q “ R ˚ σ Rf clip p x q , we have that p˚q “ sup x P X | f p x q ´ I σ f p x q| “ sup x P X | f clip p x q ´ R ˚ σ Rf clip p x q| . In order to apply Theorem 1 for p˚q , the two quantities M p f clip q and B p f clip q must be shownto be finite. For the first quantity, we have M p f clip q ď sup x P R d |p f qp x q| ` max | α |“ sup x P R d |B α p f qp x q|“ sup x P R d | p x q f p x q| ` max | α |“ sup x P R d | p x qB α f p x q ` f p x qB α p x q|ď sup x P R d | p x q f p x q| ` max | α |“ sup x P R d | p x qB α f p x q| ` max | α |“ sup x P R d | f p x qB α p x q| . Recall that P C p R d q has the property p x q “ x P X and p x q “ x R r´
S, S s d ,meaning both p x q “ B α p x q “ x R r´
S, S s d . By the assumption f P C p R d q , the all terms p x q f p x q , p x qB α f p x q , and f p x qB α p x q vanish outside of x P r´
S, S s d .Therefore the following inequality holds: M p f clip q ď sup x Pr´
S,S s d | p x q f p x q| ` max | α |“ sup x Pr´
S,S s d | p x qB α f p x q| ` max | α |“ sup x Pr´
S,S s d | f p x qB α p x q| . ď sup x Pr´
S,S s d | p x q| ˆ sup x Pr´
S,S s d | f p x q| ` sup x Pr´
S,S s d | p x q| ˆ max | α |“ sup x Pr´
S,S s d |B α f p x q| ` sup x Pr´
S,S s d | f p x q| ˆ max | α |“ sup x Pr´
S,S s d |B α p x q|“ M ˚ p q M ˚ p f q ă 8 . (36)39he quantity B p f clip q is clearly bounded since f clip p x q is compactly support on r´ S, S s d : B p f clip q “ ż r´ S,S s d f clip p x qp ` } x } q d x ď ż r´ S,S s d M p f clip q ¨ p ` S q d x “ p ` S qp S q d M p f clip q ă 8 . (37)Thus we may apply Theorem 1 to obtain p˚q ď C max p M p f clip q , B p f clip qq ˆ σ w ` σ d w p σ w ` q σ b ˙ for some constant C ą φ, ψ, p w , p b . Let C : “ p ` S qp S q d C , so thatfrom (37) we have p˚q ď C M p f clip q ˆ σ w ` σ d w p σ w ` q σ b ˙ . Bounding p˚˚q : Let h w ,b p x q : “ f clip p x q ψ p w ¨ x ` b q ` } w } . Since f clip vanishes outside r´ S, S s d , so does h w ,b and therefore I σ f p x q ´ I σ,D f p x q “ ż R ż R d " ż r´ S,S s d h w ,b p x q d x ´ d D D ÿ j “ h w ,b p x j q looooooooooooooooooooooomooooooooooooooooooooooon (a) * ˆp ` } w } q φ p w ¨ x ` b q Zp w ,σ p w q p b,σ p b q d w d b. Our aim here to show that the collection of functions h w ,b indexed by w and b hasderivatives that are uniformly bounded on r´ S, S s d , in order that Lemma 4 can be appliedto (a). Let α be a multi-index such that | α | “
1. By the chain rule of differentiation, B α h w ,b p x q “ ´ B α f clip p x q ψ p w ¨ x ` b q ` f clip p x qB α ψ p w ¨ x ` b q ¯ ` } w } “ ´ B α f clip p x q ψ p w ¨ x ` b q ` f clip p x q w α pB ψ qp w ¨ x ` b q ¯ ` } w } where we recall that w α “ w i for the non zero element index i of α and that B ψ isthe first derivative of ψ : R Ñ R . Since | α | “
1, we have w α ď } w } . In addition, B α f clip p x q ď M p f clip q and f clip p x q ď M p f clip q for all x P r´
S, S s d by definition. Therefore B α h w ,b p x q ď M p f clip q ˆ ` } w } ψ p w ¨ x ` b q ` } w } ` } w } pB ψ qp w ¨ x ` b q ˙ . ψ P S p R q , `} w } ď
1, and } w } `} w } ď w P R d , we further have B α h w ,b p x q ď M p f clip q M p ψ q , which is a uniform bound, independent of x P r´
S, S s d , w P R d , b P R , and α such that | α | “
1. Applying Lemma 4 for the term (a) with the grid points p x i q Di “ on r´ S, S s d , ˇˇˇˇˇż r´ S,S s d h w ,b p x q d x ´ d D D ÿ j “ h w ,b p x j q ˇˇˇˇˇ ď C Q D d max | α |“ sup x Pr´
S,S s d |B α h w ,b p x q|ď C Q M p f clip q M p ψ q D d . where C Q is a constant independent of h w ,b . Therefore setting C Q “ C Q we have that | I σ f p x q ´ I σ,D f p x q|ď C Q M p f clip q M p ψ q D d ż R ż R d ˇˇˇ p ` } w } q φ p w ¨ x ` b q Zp w ,σ p w q p b,σ p b q ˇˇˇ d w d b loooooooooooooooooooooooooooooooooomoooooooooooooooooooooooooooooooooon (b) , (38)where we move C Q M p ψ q M p f clip q outside the integral as they are w and b -independent.It remains to bound (b). By the assumption φ P C ˚ p R q ,(b) ď Z } φ } L ż R d p ` } w } q p w ,σ p w q d w ż R p b,σ p b q d b “ Z } φ } L ż R d p ` } w } q p w ,σ p w q d w , where ş R p b,σ p b q d b “ p b.σ is probability density. Since p w ,σ p w q “ σ d w p w ´ w σ w ¯ , by achange of variable w “ σ ´ w w ,(b) ď Z } φ } L ż R d p ` σ w } w } q p w p w q d w . Since p w has the finite second moment, let C p w : “ max ` , ş R d } w } p w p w q d w ˘ to see(b) ď Z } φ } L C p w p ` σ w q . Recalling Z “ p π q } x p w } ´ L } p p b } ´ L σ d w σ b ,(b) ď p π q } x p w } ´ L } p p b } ´ L } φ } L C p w σ b σ d w p σ w ` q . By plugging the upper bound of (b) in and setting C : “ p π q } x p w } ´ L } p p b } ´ L } φ } L M p ψ q C p w C Q ,we arrive at the overall boundsup x P X | I σ f p x q ´ I σ,D f p x q| ď C M p f clip q σ b σ d w p σ w ` q D d p log D q ´ . ounding p˚˚˚q : Define τ φ p x , w , b q : “ ´ p S q d D ř Dj “ f clip p x j q ψ p w ¨ x j ` b q ¯ φ p w ¨ x ` b q . Thenthe term p˚ ˚ ˚q can be written assup x P X | I σ,D f p x q ´ I σ,D,N f p x q| “ Z sup x P X ˇˇˇˇˇ E p w ,b q r τ φ p x , w , b qs ´ N N ÿ i “ τ φ p x , w i , b i q ˇˇˇˇˇ . We apply Lemma 2 and Lemma 3 to obtain the upper bound of p˚ ˚ ˚q . In order to applyLemma 2, it is to be verified that there exists G : R d ` Ñ R such that E p w ,b q r G p w , b q s ă 8 and | τ φ p x , w , b q ´ τ φ p x , w , b q| ď G p w , b q} x ´ x } .Recalling that } f clip } L ď M p f clip q and that φ was assumed to be Lipschitz with constantdenoted L φ , the difference of τ φ is given by | τ φ p x , w , b q ´ τ φ p x , w , b q| “ ˇˇˇˇˇ˜ p S q d D D ÿ j “ f clip p x j q ψ p w ¨ x j ` b q ¸ p φ p w ¨ x ` b q ´ φ p w ¨ x ` b qq ˇˇˇˇˇ ď p S q d M p f clip q} ψ } L L φ | w ¨ p x ´ x q|ď p S q d M p f clip q} ψ } L L φ } w } } x ´ x } where the final inequality used Cauchy-Schwartz. Let G p w , b q : “ M p f clip q} ψ } L L φ } w } , sothat E p w ,b q r G p w , b q s “ p S q d M p f clip q } ψ } L L φ ż R d } w } p w ,σ p w q d w ż R p b,σ p b q d b “ p S q d M p f clip q } ψ } L L φ ż R d } w } p w ,σ p w q d w . Since p w ,σ p w q “ σ d w p w ´ w σ w ¯ , by a change of variable w “ σ ´ w w , we have ş R d } w } p w ,σ p w q d w “ σ w ş R d } w } p w p w q d w . By the assumption that p w has the finite second moment, let V p w : “ ş R d } w } p w p w q d w to see E p w ,b q r G p w , b q s ď (cid:32) p S q d M p f clip q} ψ } L L φ V p w σ w ( . By Lemma 2, for some constant C X only depending on X , E « sup x P X ˇˇˇˇˇ E p w ,b q r τ φ p x , w , b qs ´ N N ÿ i “ τ φ p x , w i , b i q ˇˇˇˇˇff ď C X a E p w ,b q r G p w , b q s? N ď C X p S q d M p f clip q} ψ } L L φ V p w σ w ? N . (39)By the upper bound ´ p S q d D ř Dj “ f clip p x j q ψ p w ¨ x j ` b q ¯ ď p S q d M p f clip q} ψ } L and theassumption φ P C ˚ p R q , we have τ φ p x , w , b q ď p S q d M p f clip q} ψ } L } φ } L for all p x , w , b q P R d ˆ R d ˆ R . From Lemma 3, we have, with probability at least 1 ´ δ ,sup x P X | I σ,D f p x q ´ I σ,D,N f p x q| ď Z p S q d M p f clip q} ψ } L ˜ C X L φ V p w σ w ? N ` ? } φ } L a log δ ´ ? N ¸ . C : “ p π q p S q d } x p w } ´ L } p p b } ´ L } ψ } L max ` C X L φ V p w , ? } φ } L ˘ where we recall that Z “p π q } x p w } ´ L } p p b } ´ L σ d w σ b . Then we havesup x P X | I σ,D f p x q ´ I σ,D,N f p x q| ď C M p f clip q ˜ σ b σ d w p σ w ` a log δ ´ q? N ¸ . For all bounds of p˚q , p˚˚q and p˚ ˚ ˚q , recall that M p f clip q ď M ˚ p q M ˚ p f q from (36).Then combining p˚q , p˚˚q and p˚ ˚ ˚q and setting C : “ M ˚ p qp C ` C ` C q completes theproof. A.2.3 Proof of Theorem 2
Proof.
For a bivariate function g p x , y q we let I x σ,D,N g p x , y q denote the action of I σ,D,N onthe first argument of g and we let I y σ,D,N g p x , y q denote the action of I σ,D,N on the secondargument of g . To reduce notation, in this proof we denote E w ,b r¨s : “ E r¨ | t w i , b i u Ni “ s . Forfixed x , y we let p a q : “ E w ,b rp f p x q ´ I σ,D,N f p x qq p f p y q ´ I σ,D,N f p y qqs“ E w ,b r f p x q f p y qs ´ E w ,b r I σ,D,N f p x q f p y qs ´ E w ,b r f p x q I σ,D,N f p y qs` E w ,b r I σ,D,N f p x q I σ,D,N f p y qs Recall that E w ,b r f p x q f p y qs “ k p x , y q ` m p x q m p y q for f „ GP p m, k q . Then p a q “ E w ,b r f p x q f p y qs ´ I x σ,D,N E w ,b r f p x q f p y qs ´ I y σ,D,N E w ,b r f p x q f p y qs` I x σ,D,N I y σ,D,N E w ,b r f p x q f p y qs“ p k p x , y q ` m p x q m p y qq ´ I x σ,D,N p k p x , y q ` m p x q m p y qq´ I y σ,D,N p k p x , y q ` m p x q m p y qq ` I x σ,D,N I y σ,D,N p k p x , y q ` m p x q m p y qq“ p m p x q m p y q ´ I σ,D,N m p x q m p y q ´ m p x q I σ,D,N m p y q ` I σ,D,N m p x q I σ,D,N m p y qq` ` k p x , y q ´ I x σ,D,N k p x , y q ´ I y σ,D,N k p x , y q ` I x σ,D,N I y σ,D,N k p x , y q ˘ Let h p x , y q : “ k p x , y q ´ I x σ,D,N k p x , y q in order to see p a q “ p m p x q ´ I σ,D,N m p x qqp m p y q ´ I σ,D,N m p y qq ` ` h p x , y q ´ I y σ,D,N h p x , y q ˘ . Therefore the error issup x P X b E w ,b “ p f p x q ´ I σ,D,N f p x qq ‰ ď sup x Pr´
S,S s d b E w ,b “ p f p x q ´ I σ,D,N f p x qq ‰ ď sup x Pr´
S,S s d | m p x q ´ I σ,D,N m p x q| loooooooooooooooooomoooooooooooooooooon p˚q ` sup x Pr´
S,S s d ˇˇ h p x , x q ´ I y σ,D,N h p x , x q ˇˇ looooooooooooooooooooomooooooooooooooooooooon p˚˚q In the remainder we bound p˚q and p˚˚q . 43 ounding p˚q : Applying Theorem 3, we immediately have, with probability at least 1 ´ δ with respect to the random variables t w i , b i u Ni “ , p˚q ď C M ˚ p m q ˜ σ w ` σ d w p σ w ` q σ b ` σ b σ d w p σ w ` q D d ` σ b σ d w p σ w ` a log δ ´ q? N ¸ for some constant C independent of δ, σ x , σ b , D, N and m . Bounding p˚˚q : It is clear to see p˚˚q “ sup x Pr´
S,S s d ˇˇ h p x , x q ´ I y σ,D,N h p x , x q ˇˇ+ ď sup x Pr´
S,S s d sup y Pr´
S,S s d ˇˇ h p x , y q ´ I y σ,D,N h p x , y q ˇˇ+ (40)First, with respect to the supremum of y , from Theorem 3, with probability at least 1 ´ δ with respect to the random variables t w i , b i u Ni “ ,sup y Pr´
S,S s d ˇˇ h p x , y q ´ I y σ,D,N h p x , y q ˇˇ ď C M ˚ p h p x , ¨qq ˜ σ w ` σ d w p σ w ` q σ b ` σ b σ d w p σ w ` q D d ` σ b σ d w p σ w ` a log δ ´ q? N ¸ (41)where M ˚ p h p x , ¨qq is given as M ˚ p h p x , ¨qq “ max | β |ď sup y Pr´
S,S s d |B , β h p x , y q| . (42)Second, | M ˚ p h p x , ¨qq| is to be bounded. Recall h p x , y q “ k p x , y q ´ I x σ,D,N k p x , y q to see B , β h p x , y q “ B , β k p x , y q ´ I x σ,D,N B , β k p x , y q . For fixed y P r´
S, S s d and | β | ď B , β h p x , y q is upper bounded for all x P r´
S, S s d from Theorem 3: with probability at least 1 ´ δ withrespect to the random variables t w i , b i u Ni “ , B , β h p x , y q ď sup x Pr´
S,S s d ˇˇ B , β h p x , y q ˇˇ “ sup x Pr´
S,S s d ˇˇ B , β k p x , y q ´ I x σ,D,N B , β k p x , y q ˇˇ ď C M ˚ pB , β k p¨ , y qq ˜ σ w ` σ d w p σ w ` q σ b ` σ b σ d w p σ w ` q D d ` σ b σ d w p σ w ` a log δ ´ q? N ¸ . where M ˚ pB , β k p¨ , y qq “ max | α |ď sup x Pr´
S,S s d |B α , β k p x , y q| . Plugging this upper bound into(42), with probability at least 1 ´ δ with respect to the random variables t w i , b i u Ni “ , M ˚ p h p x , ¨qq ď C max | β |ď sup y Pr´
S,S s d ˇˇˇˇˇ max | α |ď sup x Pr´
S,S s d ˇˇ B α , β k p x , y q ˇˇˇˇˇˇˇ˜ σ w ` σ d w p σ w ` q σ b ` σ b σ d w p σ w ` q D d ` σ b σ d w p σ w ` a log δ ´ q? N ¸ . M ˚ p k q : “ max | α |ď , | β |ď sup x , y Pr´
S,S s d |B α , β k p x , y q| to see, for all x P r´
S, S s d , andwith probability at least 1 ´ δ with respect to the random variables t w i , b i u Ni “ , M ˚ p h p x , ¨qq ď C M ˚ p k q ˜ σ w ` σ d w p σ w ` q σ b ` σ b σ d w p σ w ` q D d ` σ b σ d w p σ w ` a log δ ´ q? N ¸ . Notice that M ˚ p k q ă 8 by the assumption k P C ˆ p R d ˆ R d q . Combining this upper boundwith (41) and taking the supremum over x P r´
S, S s d , from (40) we have, with probabilityat least 1 ´ δ with respect to the random variables t w i , b i u Ni “ , p˚˚q ď C a M ˚ p k q ˜ σ w ` σ d w p σ w ` q σ b ` σ b σ d w p σ w ` q D d ` σ b σ d w p σ w ` a log δ ´ q? N ¸ . where C : “ a C C .Combining these bounds on p˚q and p˚˚q completes the proof. A.3 An Analogous Result to Theorem 2 for Unbounded φ In this section we state and prove an analogous result of Theorem 2 that holds under weakerassumptions on the activation function φ , with a correspondingly stronger assumption onthe GP. This ensures that our theory is compatible with activation functions φ that may beunbounded, including the ReLU activation function φ p x q “ max p , x q .First of all, we recall that a positive semi-definite function k : R d ˆ R d Ñ R reproduces a Hilbert space H k whose elements are functions f : R d Ñ R such that k p¨ , x q P H k forall x P R d and x f, k p¨ , x qy H k “ f p x q for all x P R d and all f P H k . The Hilbert space H k is called a reproducing kernel Hilbert space (RKHS) . It is a well-known fact that GPsample paths are not contained in the RKHS induced by the GP covariance kernel i.e. f „ GP p m, k q ùñ P p f P H k q “
0, unless H k is finite dimensional [16, Theorem 7.5.4].However, P p f P H R q “ H R satisfies nuclear dominance over H k ; see [26].The additional assumption that we require on the GP in this appendix is that the GP takesvalues in a RKHS H R where R is continuously differentiable. Intuitively, this imposes anadditional smoothness requirement on the GP compared to Theorem 2. Our analogous resultto Theorem 2 is as follows: Theorem 4 (Analogue of Theorem 2 for Unbounded φ ) . In the same setting of Theorem 2,replace the assumption φ P C ˚ p R q with φ P C ˚ p R q . In addition, assume that f „ GP p m, k q is a random variable taking values in H R with the reproducing kernel R P C ˆ p R d ˆ R d q .Assume that m P H R and that the covariance operator K of f is trace class. Then, withprobability at least ´ δ , sup x P X b E “ p f p x q ´ I σ,D,N f p x qq |t w i , b i u Ni “ ‰ ď C a M ˚ p R q ´ } m } H R ` a tr p K q ¯ " σ w ` σ d w p σ w ` q σ b ` σ b p σ b ` q σ d w p σ w ` q D d ` σ b σ d ` w δ ? N * . here tr p K q is the trace of the operator K and C is a constant independent of m, k, σ w , σ b , D, N, δ .Proof. To reduce notation, in this proof we denote E f | w ,b r¨s : “ E r¨ | t w i , b i u Ni “ s . By Jensen’sinequality, we havesup x P X b E f | w ,b “ p f p x q ´ I σ,D,N f p x qq ‰ ď d E f | w ,b „ˆ sup x P X ˇˇˇ f p x q ´ I σ,D,N f p x q ˇˇˇ˙ (43)By Theorem 6 we have, with probability 1 ´ δ with respect to the random variables t w i , b i u Ni “ ,the right hand side of (43) can be bounded as ď gffe E f | w ,b « C M ˚ p f q " σ w ` σ d w p σ w ` q σ b ` σ b p σ b ` q σ d w p σ w ` q D d ` σ b σ d ` w δ ? N * ff “ C b E f | w ,b r M ˚ p f q s " σ w ` σ d w p σ w ` q σ b ` σ b p σ b ` q σ d w p σ w ` q D d ` σ b σ d ` w δ ? N * , where C is a constant independent on f , σ w , σ b , δ . Next we will upper-bound a E f | w ,b r M ˚ p f q s .Since f is a H R -valued random variable, by the reproducing property of H R E f | w ,b “ M ˚ p f q ‰ “ E f | w ,b »–˜ max | α |ď sup x Pr´
S,S s d B α f p x q ¸ fifl “ E f | w ,b »–˜ max | α |ď sup x Pr´
S,S s d x f, B α , R p x , ¨qy H R ¸ fifl . By the Cauchy-Schwartz inequality, E f | w ,b »–˜ max | α |ď sup x Pr´
S,S s d x f, B α , R p x , ¨qy H R ¸ fifl ď E f | w ,b »–˜ max | α |ď sup x Pr´
S,S s d } f } H R }B α , R p x , ¨q} H R ¸ fifl “ E f | w ,b « max | α |ď sup x Pr´
S,S s d } f } H R B α , α R p x , x q ff ď M ˚ p R q E f | w ,b “ } f } H R ‰ . From [38, (1.13)], E f | w ,b “ } f } H R ‰ “ } m } H R ` tr p K q . Therefore, we have b E f | w ,b r M ˚ p f q s ď b M ˚ p R q ` } m } H R ` tr p K q ˘ ď a M ˚ p R q ´ } m } H R ` a tr p K q ¯ where the fact that ? a ` b ď ? a ` ? b for a, b P R is applied for the last inequality. Pluggingin this upper bound concludes the proof.An example of an activation function φ and the associated function ψ that satisfies theassumptions of Theorem 4 is given in Table 2.46 unction Interpretation Example φ p z q activation function max p , z q ψ p z q defines the ridgelet transform ´ p π q ´ d ´ r ` π ˘ ´ ´ r
12 d d ` r ` d z d ` r ` exp ´ ´ z ¯ r : “ d mod 2 Table 2: An example of functions φ , ψ that satisfy the regularity assumptions of Theorem 4.In the remaining part of this section, we show Theorem 5 and Theorem 6 which areanalogous results to Theorem 1 and Theorem 3 for φ P C ˚ p R q . The assumption φ P C ˚ p R q implies for some C φ ă 8 φ p w ¨ x ` b q ď C φ p ` | w ¨ x ` b |qď C φ p ` } w } } x } ` | b |qď C φ p ` } x } qp ` } w } qp ` | b |q (44)where Cauchy-Schwartz inequality is applied for the second inequality. The same discussionsin the proof of Theorem 1 and Theorem 3 holds by replacing all bounds involving φ with anexpression similar to (44). Recall B p f q : “ ş R d | f p x q|p ` } x } q d x . Theorem 5 (Analogue of Theorem 1 for Unbounded φ ) . Let X Ă R d be compact. LetAssumption 1 and Assumption 2 hold, but with φ P C ˚ p R q replaced with φ P C ˚ p R q , and let f P C p R d q satisfy M p f q ă 8 and B p f q ă 8 . Then sup x P X | f p x q ´ p R ˚ σ R qr f sp x q| ď C max p M p f q , B p f qq " σ w ` σ d w p σ w ` q σ b * (45) for some constant C that is independent of σ w , σ b and f , but may depend on φ , ψ , p w and p b .Proof. We use the same proof as Theorem 1, but consider the supremum error over a compactdomain X Ă R d i.e. sup x P X | f p x q ´ p R ˚ σ R qr f sp x q| , instead of error over R d . Recall the overallstructure of the proof is (17) and in particular there are two quantities, p˚q and p˚˚q to bebounded. The argument used to bound p˚q remains valid, so our attention below turns to theargument used to bound p˚˚q .In order to establish a bound on p˚˚q , we replace the upper bound involving φ subsequentto (29). From (29), we have |p q| ď }B p b } L p ` } x } qp ` } w } q ż R p ` | b |q| φ p w ¨ p x ´ x q ` b q ψ p b q| d b . (46)From (44), for some C φ ă 8 , φ p w ¨ p x ´ x q ` b q ď C φ p ` } x ´ x } qp ` } w } qp ` | b |qď C φ p ` } x } qp ` } x } qp ` } w } qp ` | b |q . |p q| ď C φ }B p b } L p ` } x } qp ` } x } q p ` } w } q ż R p ` | b |q | ψ p b q| d b . (47)Let C ψ : “ ş R p ` | b |q | ψ p b q| d b and C p w : “ max ` , ş R d } w } p w p w q d w , ş R d } w } p w p w q d w ˘ .By the same discussion in the proof of Theorem 1 subsequent to (29), considering thedifference between (30) and (47),∆ k p x , x q ď p π q } x p w } ´ L } p p b } ´ L }B p b } L } φ } L C ψ C p w p ` } x } qp ` } x } q (cid:15) d w ˆ ` (cid:15) w ˙ (cid:15) b . Set C : “ p π q } x p w } ´ L } p p b } ´ L }B p b } L } φ } L C ψ C p w . The error term p˚˚q is then bounded as p˚˚q “ sup x P X ˇˇˇˇż R d f p x q ∆ k p x , x q d x ˇˇˇˇ ď C sup x P X ˇˇ ` } x } ˇˇ ż R d p ` } x } q | f p x q| d x (cid:15) d w ˆ ` (cid:15) w ˙ (cid:15) b . Setting C : “ C sup x P X ˇˇ ` } x } ˇˇ , we have p˚˚q ď C B p f q (cid:15) d w ˆ ` (cid:15) w ˙ (cid:15) b . (48)Combining the bound for p˚q and p˚˚q and setting C : “ max p C , C q ,sup x P X | f p x q ´ R ˚ σ Rf p x q| ď C max p M p f q , B p f qq (cid:15) w ` (cid:15) d w ˆ ` (cid:15) w ˙ (cid:15) b + where C only depends on X , φ , ψ , p w , p b but not on f, (cid:15) w , (cid:15) b . Setting σ w “ (cid:15) ´ w and σ b “ (cid:15) ´ b completes the proof. Theorem 6 (Analogue of Theorem 3 for Unbounded φ ) . In the same setting of Theorem 3,but with the assumption φ P C ˚ p R q replaced with φ P C ˚ p R q , for any f P C p R d q with M ˚ p f q ă 8 , with probability at least ´ δ , sup x P X | f p x q ´ I σ,D,N f p x q| ď CM ˚ p f q " σ w ` σ d w p σ w ` q σ b ` σ b p σ b ` q σ d w p σ w ` q D d ` σ b σ d ` w δ ? N * where C is a constant that may depend on X , , φ, ψ, p w , p b but does not depend on f , σ w , σ b , δ . roof. This result follows from a modification of the proof of Theorem 3. Recall from (35)there are three terms, p˚q , p˚˚q , and p˚ ˚ ˚q , to be bounded. In what follows we indicate howthe arguments used to establish Theorem 3 should be modified. Bounding p˚q : Replace (37) in the proof of Theorem 3 with the following inequality B p f clip q “ ż r´ S,S s d f clip p x qp ` } x } q d x ď ż r´ S,S s d M p f clip q ¨ p ` S q d x “ p ` S q p S q d M p f clip q ă 8 . Now apply Theorem 5 in place of Theorem 1 to obtain p˚q ď C max p M p f clip q , B p f clip qq ˆ σ w ` σ d w p σ w ` q σ b ˙ for some constant C ą φ, ψ, p w , p b . Let C “ p ` S q p S q d C to see p˚q ď C M p f clip q ˆ σ w ` σ d w p σ w ` q σ b ˙ . Bounding p˚˚q : Here we replace the upper bound of (b) in the proof of Theorem 3. From(44), (b) “ Z ż R ż R d ˇˇˇ p ` } w } q φ p w ¨ x ` b q p w ,σ p w q p b,σ p b q ˇˇˇ d w d b ď ZC φ p ` } x } q ż R d p ` } w } q p w ,σ p w q d w loooooooooooooooomoooooooooooooooon (c) ż R p ` | b |q p b,σ p b q d b loooooooooomoooooooooon (d) . Let C p w : “ max ` , ş R d } w } p w p w q d w , ş R d } w } p w p w q d w ˘ and C p b : “ max p , ş R | b | p b p b q d b q .By the same discussion on the upper bound of (b) in the proof of Theorem 3,(c) ď C p w p ` σ w q , (d) ď C p b p ` σ b q . Recall Z “ p π q } x p w } ´ L } p p b } ´ L σ d w σ b . By the upper bound of (c) and (d),(b) ď p ` } x } qp π q } x p w } ´ L } p p b } ´ L C p w C p b C φ σ b p ` σ b q σ d w p ` σ w q . Let C : “ sup x Pr´
S,S s d |p ` } x } q|p π q } x p w } ´ L } p p b } ´ L C p w C p b C φ . By plugging the upper boundof (b) in (38) and setting C : “ C Q M p ψ q C , we have p˚˚q ď C M p f clip q σ b p σ b ` q σ d w p σ w ` q D d . ounding p˚˚˚q : Finally we apply Lemma 5 in place of Lemma 3. From (39), E t w i ,b i u Ni “ rp˚ ˚ ˚qs is upper bounded by E t w i ,b i u Ni “ rp˚ ˚ ˚qs ď C X p S q d M p f clip q} ψ } L L φ V p w σ w ? N Set C : “ C X p S q d } ψ } L L φ V p w σ w . By Lemma 5, with probability at least 1 ´ δ , p˚ ˚ ˚q ď C M p f clip q σ b σ d ` w δ ? N .
For all bounds of p˚q , p˚˚q and p˚ ˚ ˚q , recall that M p f clip q ď M ˚ p q M ˚ p f q from (36).Then combining p˚q , p˚˚q and p˚ ˚ ˚q and setting C : “ M ˚ p qp C ` C ` C q completes theproof. A.4 Experiments
This section expands on the experiments that were reported in the main text. In Ap-pendix A.4.1 we detail how Markov chain Monte Carlo was used to approximate the posteriordistribution in our experiments. Appendix A.4.2 reports the cubature rules that were used.Finally, Appendix A.4.3 explores the effect of using alternative settings in the ridgelet prior,compared to those used to produce the figures in the main text.
A.4.1 Approximating the Posterior with Markov Chain Monte Carlo
For the experiments reported in Section 5.2, the posterior distribution over the parameters θ ofthe BNN is analytically intractable and must therefore be approximated. Several approacheshave been developed for this task, including variational methods and Monte Carlo methods.Here we exploited conditional linearity to obtain accurate approximations to the posteriorusing a Monte Carlo method. The details of this will now be described.Our general regression model for covariates x p i q and their associated responses y p i q isformulated as y p i q “ m p x p i q q ` f p x p i q q ` (cid:15) p i q , f p x q “ N ÿ i “ w i φ p w i ¨ x ` b i q , (cid:15) p i q „ N p , σ (cid:15) q . (49)For convenience, set ˜ y p i q “ y p i q ´ m p x p i q q . Recall that the model parameters θ “ t w i , b i , w i u Ni “ were assigned the prior distribution w i „ N p , σ w I N ˆ N q , b i „ N p , σ b I N ˆ N q , t w i u Ni “ |t w i , b i u Ni “ „ N p , Σ q , where Σ : “ Ψ K Ψ J . Let r ˜ y s i “ ˜ y p i q and r Φ s i,j : “ φ p w j ¨ x p i q ` b j q . Our main observationis that, conditional on t w i , b i u Ni “ , (49) is a linear model whose coefficients t w i u Ni “ can beanalytically marginalised to obtain a Gaussian marginal (log-)likelihoodlog p p ˜ y | t w i , b i u Ni “ q “ C ´
12 log det Σ ˚ ´
12 ˜ y J Σ ´ ˚ ˜ y Σ ˚ : “ Φ ΣΦ J ` σ (cid:15) I and C is a constant with respect to t w i , b i u Ni “ . This enablesthe use of MCMC to sample directly from the marginal distribution of t w i , b i u Ni “ | ˜ y , andfor this purpose we employed the elliptical slice sampler of [31]. Furthermore, we recognisethat the regression coefficients t w i u Ni “ are conditionally Gaussian given ˜ y and t w i , b i u Ni “ ,meaning that for each sample of t w i , b i u Ni “ | ˜ y we can simulate a corresponding sample from t w i u Ni “ | ˜ y , t w i , b i u Ni “ : t w i u Ni “ | ˜ y , t w i , b i u Ni “ „ N p m ˚˚ , Σ ˚˚ q where m ˚˚ : “ σ ´ (cid:15) Σ ˚˚ Φ J ˜ y and Σ ´ ˚˚ : “ Σ ´ ` σ ´ (cid:15) Φ J Φ . This two-stage procedure wasobserved to work effectively in the experiments that we performed. However, the matrixinversion that appears in the above expressions can exhibit poor numerical conditioningand, in this work, we employed a crude form of numerical regularisation in order that suchissues – which arise from the posterior approximation approach used and are not intrinsicto the ridgelet prior itself – were obviated. Specifically, we employed the Moore–Penrosepseudo-inverse whenever Σ ˚ or Σ ´ ` σ ´ (cid:15) Φ J Φ was numerically singular and the action ofthe inverse matrix was required, and we employed Tikhonov regularisation (with the extentof the regularisation manually selected) whenever Σ ˚˚ was numerically singular and a matrixsquare root was required. A.4.2 Experimental Setting
Here, for completeness, we report the cubature rules tp u i , x i qu Di “ and the bandwidth parame-ters σ w , σ b that were used in our experiments. The sensitivity of reported results to thesechoices is investigated in Appendix A.4.3.Recall that tp u i , x i qu Di “ is a cubature rule on r´ S, S s d and that, in all experiments, weaim for accurate approximation on X “ r´ , s d . For S ą is required and inthis case (13) we used. The settings that we used for the results in the main text were asfollows. As aforementioned in Section 5, the functions φ and ψ were set as Table 1 and thedensities p w and p b were set as standard Gaussians for all of the results in Table 3. Experiment Cubature points, weights, total number Bandwidth of p w , p b Section 5.1 x j : grid point on r´ , s , u j “ D ´ p x j q , D “ σ w “ σ b “ x j : grid point on r´ , s , u j “ D ´ , D “ σ w “ σ b “ x j : grid point on r´ , s , u j “ D ´ , D “ σ w “ σ b “ x j : grid point on r´ , s , u j “ D ´ , D “ σ w “ σ b “ L “ x j : grid point on r´ , s , u j “ D ´ p x j q , D “ σ w “ σ b “ Table 3: Cubature rules and bandwidth parameters used for each experiment reported in themain text.
A.4.3 Comparison of Prior Predictive with Different Settings
In this final section we investigate the effect of varying the settings of the ridgelet prior. Thedefault settings in Table 1 were taken as a starting point and were then systematically varied.51nitially we consider a squared exponential covariance model for the target GP. Specifically,we considered:1. Different choices of σ w and σ b : p σ w , σ b q “ p , q , p , q , p , q , p , q
2. Dynamic setting of σ w and σ b : σ w and σ b varies depending on N
3. Different choices of activation function: Gaussian, Tanh, ReLU4. Different choices of GP covariance model: squared exponential, rational quadratic,periodicThe findings in each case are summarised next.
Different choices of σ w and σ b : Figure 9 displays the MRMSE and BNN covariancefunction for each of the choices p σ w , σ b q “ p , q , p , q , p , q , p , q . It can be observedthat the BNN covariance function for larger p σ w , σ b q has a qualitatively correct shape but islarger overall compared to the GP target when N is small. On the other hand, the BNNcovariance function for smaller p σ w , σ b q takes values that are closer to that of the GP, butis visually flatter than the GP and the approximation does not improve as N is increased.These observation indicates that it may be advantageous to change the values of p σ w , σ b q ina manner that increases with N . This leads us to the next experiment. (a) p σ w , σ b q “ p , q (b) p σ w , σ b q “ p , q (c) p σ w , σ b q “ p , q (d) p σ w , σ b q “ p , q Figure 9: MRMSE and BNN covariance for p σ w , σ b q “ p , q , p , q , p , q , p , q . Dynamic setting of σ w and σ b : As observed in the previous experiment, it may beadvantageous to change p σ w , σ b q in a manner that increases with N . In this experiment,the values of p σ w , σ b , N q were varied as p σ w , σ b , N q “ p , , q , p , , q , p , , q , p , , q , p , , q . Figure 10 shows BNN sample paths and Figure 11 showsMRMSE and BNN covariance. This dynamic setting of p σ w , σ b , N q appears to constitute apromising compromise between the two extremes of behaviour observed in Figure 9. Different choices of activation function:
In this experiment, we fix p σ w , σ b q “ p , q and use 3 different activation functions: Gaussian, hyperbolic tangent, and ReLU. The52 a) p σ w , σ b , N q “ p , , q (b) p σ w , σ b , N q “ p , , q (c) p σ w , σ b , N q “ p , , q (d) p σ w , σ b , N q “ p , , q (e) p σ w , σ b , N q “ p , , q (f) Original GP Figure 10: Sample paths of BNN for dynamic setting of p σ w , σ b , N q .Figure 11: MRMSE and BNN covariance for dynamic setting of p σ w , σ b , N q .settings for the ridgelet prior corresponding to each activation function are given in Table 4for the Gaussian activation function, Table 1 for the hyperbolic tangent activation function,and Table 2 for the ReLU activation function. Figure 12 and Figure 13 indicate that smoothand bounded activation functions, such as the Gaussian activation function, allowed theridgelet approximation to converge more rapidly to the GP in this experiment. Function Interpretation Example φ p z q activation function exp ´ ´ z ¯ ψ p z q defines the ridgelet transform p π q ´ d ´ ? π ¯ ´ r d d ` r d z d ` r exp ´ ´ z ¯ r : “ d mod 2 Table 4: An example of functions φ , ψ that satisfy the regularity assumptions used in Gaussiancase. 53 a) Gauss N “
300 (b) Gauss N “ N “ N “
300 (e) Tanh N “ N “ N “
300 (h) ReLU N “ N “ Figure 12: Sample paths of the BNN for different activation functions; Gaussian, hyperbolictangent, and ReLU. (a) Gauss (b) Tanh (c) ReLU
Figure 13: MRMSE and BNN covariance for different activation functions; Gaussian, hyper-bolic tangent, and ReLU.
Different choice of GP covariance model:
For these experiments we fixed the activationfunction to the hyperbolic tangent and fixed the bandwidth p σ w , σ b q “ p , q . The weconsidered in turn each of the following covariance models for the target GP: square exponential54 , rational quadratic k , and periodic k : k p x, x q : “ l exp ˆ ´ s | x ´ x | ˙ k p x, x q : “ l ˆ ` α s | x ´ x | ˙ ´ α k p x, x q : “ l exp ˜ ´ s sin ˆ πp | x ´ x | ˙ ¸ where l “ . , s “ . , l “ . , α “ . , s “ . , l “ . , p “ . , s “ .
75. The samplepaths from the BNN with the ridgelet prior are displayed in Figure 14 as a function of N ,and the associated BNN covariance functions are displayed in Figure 15. It is perhaps notsurprising that the periodic covariance model, being the most complex, appears to be themost challenging to approximate with a BNN. (a) SE N “
300 (b) SE N “ N “ N “
300 (f) RQ N “ N “ N “
300 (j) Periodic N “ N “ Figure 14: Sample paths of the BNN for different GP covariance models; square exponential,rational quadratic, and periodic. 55 a) SE (b) RQ (c) Periodica) SE (b) RQ (c) Periodic