[PDF] A Role for Symmetry in the Bayesian Solution of Differential Equations

Abstract

The interpretation of numerical methods, such as finite difference methods for differential equations, as point estimators suggests that formal uncertainty quantification can also be performed in this context. Competing statistical paradigms can be considered and Bayesian probabilistic numerical methods (PNMs) are obtained when Bayesian statistical principles are deployed. Bayesian PNM have the appealing property of being closed under composition, such that uncertainty due to different sources of discretisation in a numerical method can be jointly modelled and rigorously propagated. Despite recent attention, no exact Bayesian PNM for the numerical solution of ordinary differential equations (ODEs) has been proposed. This raises the fundamental question of whether exact Bayesian methods for (in general nonlinear) ODEs even exist. The purpose of this paper is to provide a positive answer for a limited class of ODE. To this end, we work at a foundational level, where a novel Bayesian PNM is proposed as a proof-of-concept. Our proposal is a synthesis of classical Lie group methods, to exploit underlying symmetries in the gradient field, and non-parametric regression in a transformed solution space for the ODE. The procedure is presented in detail for first and second order ODEs and relies on a certain strong technical condition -- existence of a solvable Lie algebra -- being satisfied. Numerical illustrations are provided.

Full PDF

AA Role for Symmetry in the Bayesian Solution ofDiﬀerential Equations

Junyang Wang , Jon Cockayne , Chris. J. Oates , Newcastle University, UK Alan Turing Institute, UKSeptember 24, 2019

Abstract

The interpretation of numerical methods, such as ﬁnite diﬀerence methods for dif-ferential equations, as point estimators suggests that formal uncertainty quantiﬁca-tion can also be performed in this context. Competing statistical paradigms can beconsidered and Bayesian probabilistic numerical methods (PNMs) are obtained whenBayesian statistical principles are deployed. Bayesian PNM have the appealing prop-erty of being closed under composition, such that uncertainty due to diﬀerent sourcesof discretisation in a numerical method can be jointly modelled and rigorously prop-agated. Despite recent attention, no exact Bayesian PNM for the numerical solutionof ordinary diﬀerential equations (ODEs) has been proposed. This raises the funda-mental question of whether exact Bayesian methods for (in general nonlinear) ODEseven exist. The purpose of this paper is to provide a positive answer for a limited classof ODE. To this end, we work at a foundational level, where a novel Bayesian PNMis proposed as a proof-of-concept. Our proposal is a synthesis of classical Lie groupmethods, to exploit underlying symmetries in the gradient ﬁeld, and non-parametricregression in a transformed solution space for the ODE. The procedure is presented indetail for ﬁrst and second order ODEs and relies on a certain strong technical condi-tion – existence of a solvable Lie algebra – being satisﬁed. Numerical illustrations areprovided.

Numerical methods underpin almost all of scientiﬁc, engineering and industrial output. Inthe abstract, a numerical task can be formulated as the approximation of a quantity ofinterest Q : Y → Q , a r X i v : . [ s t a t . O T ] S e p ubject to a ﬁnite computational budget. The true underlying state y † ∈ Y is typically high-or inﬁnite-dimensional, so that only limited information A : Y → A (1)is provided and exact computation of Q (y † ) is prohibited. For example, numerical integrationaims to approximate an integral Q (y † ) = (cid:82) y † ( t )d t given the values A (y † ) = { ( x i , y † ( x i )) } ni =1 of the integrand y † on a ﬁnite number of abscissa { x i } ni =1 . Similarly, a numerical approxi-mation to the solution Q (y † ) = y † of a diﬀerential equation dy / d x = f ( x, y( x )), y( x ) = y ,must be based on at most a ﬁnite number of evaluations of f , the gradient ﬁeld. In thisviewpoint a numerical method corresponds to a map b : A → Q , as depicted in Figure 1a,where b ( a ) represents an approximation to the solution of the diﬀerential equation based onthe information a ∈ A .The increasing ambition and complexity of contemporary applications is such that thecomputational budget can be extremely small compared to the precision that is required atthe level of the quantity of interest. As such, in many important problems it is not possible toreduce the numerical error to a negligible level. Fields acutely associated with this challengeinclude climate forecasting (Wedi, 2014), computational cardiology (Chabiniok et al., 2016)and molecular dynamics (Perilla et al., 2015). In the presence of non-negligible numericalerror, it is unclear how scientiﬁc interpretation of the output of computation can proceed.For example, a posteriori analysis of traditional numerical methods can be used to establishhard upper bounds on the numerical error, but these bounds typically depend on an unknownglobal constant. In the case of ODEs, this may be the maximum value of a norm (cid:107) f (cid:107) of thegradient ﬁeld (see e.g. Estep, 1995). If (cid:107) f (cid:107) were known, it would be possible to provide ahard bound on numerical error. However, in the typical numerical context all that is knownis that (cid:107) f (cid:107) is ﬁnite. One could attempt to approximate (cid:107) f (cid:107) with cubature, but that itselfrequires a numerical cubature method whose error is required to obey a known bound. Ingeneral, therefore, there are no hard error bounds without global information being a priori provided (Larkin, 1974). Our aim in this paper is to consider, as an alternative to traditionalnumerical analysis, an exact Bayesian framework for solution uncertainty quantiﬁcation inthe ordinary diﬀerential equation context. The ﬁeld of probabilistic numerics dates back to Larkin (1972) and a modern perspective isprovided in Hennig et al. (2015); Oates and Sullivan (2019). Under the abstract frameworkjust described, numerical methods can be interpreted as point estimators in a statisticalcontext, where the state y † can be thought of as a latent variable in a statistical model, andthe ‘data’ consist of information A (y † ) that does not fully determine the quantity of interest Q (y † ) but is indirectly related to it. Hennig et al. (2015) provide an accessible introductionand survey of the ﬁeld. In particular, they illustrated how PNM can be used to quan-tify uncertainty due to discretisation in important scientiﬁc problems, such as astronomicalimaging. 2 AQ Q Ab (a) P Y AP Q Q µ a ← (cid:91) aB ( µ,a ) ← (cid:91) a (b) Figure 1: Diagrams for a numerical method. (a) The traditional viewpoint of a numericalmethod is equivalent to a map b from a ﬁnite-dimensional information space A to the spaceof the quantity of interest Q . (b) The probabilistic viewpoint treats approximation of Q (y † )in a statistical context, described by a map B ( µ, · ) from A to the space of probabilitydistributions on Q . The probabilistic numerical method ( A, B ) is Bayesian if and only if (b)is a commutative diagram.Let the notation Σ Y denote a σ -algebra on the space Y and let P Y denote the set ofprobability measures on ( Y , Σ Y ). A probabilistic numerical method (PNM) is a procedurewhich takes as input a ‘belief’ distribution µ ∈ P Y , representing epistemic uncertainty withrespect to the true (but unknown) value y † , along with a ﬁnite amount of information, A (y † ) ∈ A . The output is a distribution B ( µ, A (y † )) ∈ P Q on ( Q , Σ Q ), representing epistemicuncertainty with respect to the quantity of interest Q (y † ) after the information A (y † ) havebeen processed. For example, a PNM for an ordinary diﬀerential equation (ODE) takes aninitial belief distribution deﬁned on the solution space of the diﬀerential equation, togetherwith information arising from a ﬁnite number of evaluations of the gradient ﬁeld, plus theinitial condition of the ODE, to produce a distribution over either the solution space of theODE, or perhaps some derived quantity of interest. In this paper, the measurability of A and Q will be assumed.Despite computational advances in this emergent ﬁeld, until recently there had not beenan attempt to establish rigorous statistical foundations for PNM. In Cockayne et al. (2019)the authors argued that Bayesian principles can be adopted. In brief, this framework requiresthat the input belief distribution µ carries the semantics of a Bayesian agent’s prior belief andthe output of a PNM agrees with the inferences drawn when the agent is rational. To be moreprecise recall that, in this paper, information is provided in a deterministic manner through(1) and thus Bayesian inference corresponds to conditioning of µ on the level sets of A . Let Q : P Y → P Q denote the push-forward map associated to Q . i.e. Q ( µ )( S ) = µ ( Q − ( S ))for all S ∈ Σ Q . Let { µ a } a ∈A ⊂ P Y denote the disintegration, assumed to exist , of µ ∈ P Y along the map A . It is of course possible to perform Bayesian inference in the noisy-data context, but for the ODEsconsidered in this paper we assume that one can obtain noiseless evaluations of the gradient ﬁeld. The reader unfamiliar with the concept of a disintegration can treat µ a as a technical notion of the‘conditional distribution of y given A (y) = a ’ when reading this work. The disintegration theorem , Thm.1 of Chang and Pollard (1997), guarantees existence and uniqueness up to a A µ -null set under the weakrequirement that Y is a metric space, Σ Y is the Borel σ -algebra, µ is Radon, Σ A is countable generated andΣ A contains all singletons { a } for a ∈ A . eﬁnition 1. A probabilistic numerical method ( A, B ) with A : Y → A and B : P Y × A →P Q for a quantity of interest Q : Y → Q is Bayesian if and only if B ( µ, a ) = Q ( µ a ) for all µ ∈ P Y and all a ∈ A . This deﬁnition is intuitive; the output of the PNM should coincide with the marginaldistribution for Q (y † ) according to the disintegration element µ a ∈ P Y , based on the infor-mation a ∈ A that was provided. The deﬁnition is equivalent to the statement that Figure1b is a commutative diagram. In Cockayne et al. (2019) the map A was termed an infor-mation operator and the map B was termed a belief update operator ; we adhere to thesedeﬁnitions in our work. The Bayesian approach to PNM confers several important beneﬁts: • The input µ and output B ( µ, a ) belief distributions can be interpreted, respectively,as a prior and (marginal) posterior . As such, they automatically inherit the strongerformal semantics and philosophical foundations that underpin the Bayesian frameworkand, in this sense, are well-understood (see e.g. Gelman and Shalizi, 2013). • The deﬁnition of Bayesian PNM is operational. Thus, if we are presented with a prior µ and information a then there is a unique Bayesian PNM and it is constructivelydeﬁned. • The class of Bayesian PNM is closed under composition, such that uncertainty due todiﬀerent sources of discretisation can be jointly modelled and rigorously propagated.This point will not be discussed further in this work, but we refer the interested readerto Section 5 of Cockayne et al. (2019).Nevertheless, the strict deﬁnition of Bayesian PNM limits scope to design convenient com-putational algorithms and indeed several proposed PNM are not Bayesian (see Table 1 inCockayne et al., 2019). The challenge is two-fold; for a Bayesian PNM, the elicitation ofan appropriate prior distribution µ and the exact computation of its disintegration { µ a } a ∈A must both be addressed. In the next section we argue that – perhaps as a consequence ofthese constraints – a strictly Bayesian PNM for the numerical solution of an ODE does notyet exist. The ﬁrst PNM (of any ﬂavour) for the numerical solution of ODEs, or which we are aware,was due to Skilling (1992). Two decades later, this problem is receiving renewed criticalattention as part of the active development of PNM. The aim of this section is to providea high-level overview of existing work and to argue that existing methods do not adhere tothe deﬁnition of Bayesian PNM. Indeed, if the set Y a = { y ∈ Y : A (y) = a } is not measure zero under µ , then µ a is the conditional distri-bution deﬁned by restricting µ to the subset Y a ; µ a (y) = 1[y ∈ Y a ] µ (y) /µ ( Y a ). The theory of disintegrationsgeneralises the conditional distribution µ a to cases where Y a is a null set. otation: The notational convention used in this paper is that the non-italicised y denotesa generic function, whereas the italicised y denotes a scalar value taken by the function y.The notation y † is reserved for the true solution to an ODE. Throughout, the underlyingstate space Y is taken to be a space occupied by the true solution of the ODE, i.e. y † ∈ Y . The ﬁrst paper on this topic, of which we are aware, was Skilling (1992). This will serve as aprototypical PNM for the numerical solution of an ODE. Originally described as ‘Bayesian’by the author, we will argue that, at least in the strict sense of Deﬁnition 1, it is not aBayesian PNM. Consider a generic univariate ﬁrst-order initial value problemdyd x = f ( x, y( x )) , x ∈ [ x , x T ] , y( x ) = y . (2)Throughout this paper all ODEs that we consider will be assumed to be well-deﬁned andadmit a unique solution y † ∈ Y where Y is some pre-speciﬁed set. In this paper the quantityof interest Q (y † ) will either be the solution curve y † itself or the value y † ( x T ) of the solutionat a speciﬁc input (in this section it will be the former). The approach outlined in Skilling(1992) allows for a general prior µ ∈ P Y . The gradient ﬁeld f is treated as a ‘black box’ oraclethat can be queried at a ﬁxed computational cost. Thus we are provided with evaluationsof the gradient ﬁeld [ f ( x , y ) , . . . , f ( x n , y n )] (cid:62) ∈ R n +1 for certain input pairs { ( x i , y i ) } ni =0 .This approach of treating evaluations of the gradient ﬁeld as ‘data’ will be seen to bea common theme in existing PNM for ODEs and theoretical support for this framework isrooted in the ﬁeld of information-based complexity (Traub and Wo´zniakowski, 1992). Let a i = f ( x i , y i ) and a i = [ a , . . . , a i ]. The selection of the input pairs ( x i , y i ) on which f isevaluated is not constrained and several possibilities, of increasing complexity, were discussedin Skilling (1992). To ﬁx ideas, the simplest such approach is to proceed iteratively as follows:(0.1) The ﬁrst pair ( x , y ) is fully determined by the initial condition of the ODE.(0.2) The oracle then provides one piece of information, a = f ( x , y ).(0.3) The prior µ is updated according to a , leading to a belief distribution µ which is justthe disintegration element µ a .(1) A discrete time step x = x + h , where h = x T − x n >

0, is performed and a particularpoint estimate y = (cid:82) y( x )d µ (y) for the unknown true value y † ( x ) is obtained. Thisspeciﬁes the second pair ( x , y ).The process continues similarly, such that at time step i − µ i − = B ( µ, a i − ) ∈ P Y , where the general belief update operator B is yet to be deﬁned, andthe following step is performed:( i ) Let x i = x i − + h and set y i = (cid:82) y( x i )d µ i − (y) .5he ﬁnal output is a probability distribution µ n = B ( µ, a n ) ∈ P Y . Now, strictly speaking,the method just described is not a PNM in the concrete sense that we have deﬁned. Indeed,the ﬁnal output µ n is a deterministic function of the values a n of the gradient ﬁeld that wereobtained. However, in the absence of additional assumptions on the global smoothness ofthe gradient ﬁeld, the values of f ( x, y ) outside any open neighbourhood of the true solutioncurve C = { ( x, y ) : y = y † ( x ) , x ∈ [ x , x T ] } do not determine the solution of the ODE and,conversely, the solution of the ODE provides no information about the values of the gradientﬁeld outside any open neighbourhood of the true solution curve C . Thus it is not possible, ingeneral, to write down an information operator A : Y → A that reproduces the information a n when applied to the solution curve y † ( · ) of the ODE.The approach taken in Skilling (1992) was therefore to posit an approximate informationoperator ˆ A and a particular belief update operator B , which are now described. The approx-imate information operator is motivated by the intuition that if y † ( x i ) is well-approximatedby y i at the abscissa x i then dy † d x ( x i ) should be well-approximated by f ( x i , y i ). That is, thefollowing approximate information operator ˆ A was constructed:ˆ A (y) = (cid:20) dyd x ( x ) , . . . , dyd x ( x n ) (cid:21) (cid:62) . (3)Of course, ˆ A (y † ) (cid:54) = a n in general. To acknowledge the approximation error, Skilling (1992)proposed to model the information with an approximate likelihood:d µ n d µ (y) = n (cid:89) i =1 d µ i d µ i − (y) (4)d µ i d µ i − (y) ∝ exp (cid:32) − σ (cid:18) dyd x ( x i ) − f ( x i , y i ) (cid:19) (cid:33) (5)This was referred to in Skilling (1992) as simply a “likelihood” and, together with µ = µ a ,the output µ n is completely speciﬁed. Here σ is a ﬁxed positive constant, however in principlea non-diagonal covariance matrix can also be considered. The negative consequences ofbasing inferences on an approximate information operator ˆ A are potentially twofold. First,recall that values of the gradient ﬁeld that are not contained on the true solution curve ofthe ODE do not, in principle, determine the true solution curve y † . It is therefore unclearif these values should be taken into account at all. Second, in the special case where thegradient ﬁeld f does not depend the second argument then the quantities dyd x ( x i ) and f ( x i , y i )are identical. From this perspective, µ n represents inference under a mis-speciﬁed likelihood,since information is treated as erroneous when it is in fact exact. The use of a mis-speciﬁedlikelihood violates the likelihood principle and implies violation of the Bayesian framework.This conﬁrms, through a diﬀerent argument, that the approach of Skilling (1992) cannot beBayesian in the strict sense of Deﬁnition 1. 6 .2.2 Schober et al. (2014, 2019); Teymur et al. (2016, 2018) After Skilling (1992), several authors have proposed improvements to the above method. Theapproach of Schober et al. (2014) considered Eq. (5) in the σ ↓ µ was restricted tobe a k -times integrated Wiener measure on the solution space of the ODE. The tractabilityof the integrated Weiner measure leads to a closed-form characterisation of the posterior andenables computation to be cast as a Kalman ﬁlter.This direction of research can be motivated by the following fact: For k ∈ { , } theauthors prove that if the input pair ( x , y ) is taken as y = (cid:82) y( x )d µ (y), as indicated inSection 1.2.1, then the smoothing estimate ˆ y = (cid:82) y( x )d µ (y), i.e. the posterior mean fory( x ) based on information a , coincides with the deterministic approximation to y † ( x ) thatwould be provided by a k -th order Runge-Kutta method. As such, theoretical guaranteessuch as local convergence order are inherited. For k = 3 it was shown that the sameconclusion can be made to hold, provided that the input pair ( x , y ) is selected in a mannerthat is no longer obviously related to µ . In all cases the identiﬁcation with a classical Runge-Kutta method does not extend beyond iteration n = 1. Similar connections to multistepmethods of Nordsieck and Adams form were identiﬁed, respectively, in Schober et al. (2019)and Teymur et al. (2016, 2018). The approach of Schober et al. (2014) is not Bayesian in thesense of Deﬁnition 1, which can again be deduced from dependence on values of the gradientﬁeld away from the true solution curve, so that the likelihood principle is violated. The subsequent work of Kersting and Hennig (2016) attempted to elicit an appropriate non-zero covariance matrix for use in Eq. (5), in order to encourage uncertainty estimates to bebetter calibrated. Their proposal consisted of the approximate likelihoodd µ i d µ i − (y) ∝ exp  − (cid:32) dyd x ( x i ) − m i σ i (cid:33)  (6) m i = (cid:90) f ( x i , y( x i ))d µ i − (y) (7) σ i = (cid:90) ( f ( x i , y( x i )) − m i ) d µ i − (y) . (8)This can be viewed as the predictive marginal likelihood for the value f ( x i , y( x i )) based on µ i − . From a practical perspective, the approach is somewhat circular as the integrals inEq. (7) and (8) involve the black-box gradient ﬁeld f and are therefore cannot be computed.The authors suggested a number of ways that these quantities could be numerically approx-imated , which involve evaluating f ( x i , y i ) at one or more values y i that must be speciﬁed. One such method is

Bayesian quadrature , another PNM wherein the integrand f is modelled as uncertainuntil it is evaluated. This raises separate philosophical challenges, as one must then ensure that the statistical The original work of Chkrebtii et al. (2016) is somewhat related to Kersting and Hennig(2016), however instead of using the mean of the current posterior as input to the gradientﬁeld, the input pair ( x i , y i ) was selected by sampling y i from the marginal distribution fory( x i ) implied by µ i − . The approximate likelihood in this approach was taken as follows:d µ i d µ i − (y) ∝ exp  − (cid:32) dyd x ( x i ) − f ( x i , y i ) σ i (cid:33)  m i = (cid:90) dyd x ( x i )d µ i − (y) σ i = (cid:90) (cid:18) dyd x ( x i ) − m i (cid:19) d µ i − (y) . Compared to Eq. (6), (7) and (8), this approach does not rely on integrals over the unknowngradient ﬁeld. However, the approach also relies on the approximate information operatorin Eq. (3) and is thus not Bayesian according to Deﬁnition 1.

The approaches proposed in Conrad et al. (2017); Abdulle and Garegnani (2018) are notmotivated in the Bayesian framework, but instead seek to introduce a stochastic perturbationinto a classical numerical method. Both methods focus on the quantity of interest Q (y † ) =y † ( x T ). In the simple context of Eq. (2), the method of Conrad et al. (2017) augments theexplicit Euler method with a stochastic perturbation: y i = y i − + hf ( x i − , y i − ) + h (cid:15) i , x i = x i − + h, i = 1 , . . . , n The distribution of the sequence ( (cid:15) i ) ni =1 must be speciﬁed. In the simplest case where the (cid:15) i are modelled as independent, say with (cid:15) i ∼ ρ , the canonical ﬂow map Φ i : R → R ofthe explicit Euler method, deﬁned as Φ i ( z ) = z + hf ( x i , z ), is replaced by the probabilisticcounterpart Ψ i : P R → P R given byΨ i ( ν )(d z ) = (cid:90) ρ (cid:18) d z − Φ i (˜ z ) h (cid:19) ν (d˜ z )through which stochasticity can be propagated. The output of the method of Conrad et al.(2017) is then B = Ψ n ◦ · · · ◦ Ψ δ ( y ), where δ ( z ) denotes an atomic distribution on z ∈ R . models used for y( · ) and f ( x i , · ) are logically consistent. In Kersting and Hennig (2016) these functions weresimply modelled as independent. ρ i has zero mean, it can be shown that the mean of B equalsΦ n ◦ · · · ◦ Φ ( y ), which is exactly the deterministic approximation produced with the explicitEuler method.This framework can be practically problematic, since (cid:15) i is charged with modelling theextrapolation error and such errors are not easily modelled as independent random variables– Section 2.8 of Higham (2002) is devoted to this point. Thus, if for example f ( x, y ) = y ,the true linearisation error at step i is e x i − e x i − so that the ‘true’ sequence ( (cid:15) i ) ni =1 in thiscase is monotonic and exponentially unbounded. The challenge of designing a stochasticmodel for the sequence ( (cid:15) i ) ni =1 that reﬂects the highly structured nature of the error remainsunresolved. On the other hand, the mathematical properties of this method are now well-understood (Lie et al., 2018, 2019). The proposal of Abdulle and Garegnani (2018) was toinstead consider randomisation of the inputs { x i } Ti =0 in the context of a classical numericalmethod, also outside of the Bayesian framework. The survey just presented begs the question of whether a Bayesian PNM for ODEs canexist at all. A ﬁrst step toward this goal was taken in Cockayne et al. (2019), where it wasargued that an information operator can be constructed if the vector ﬁeld f is brought tothe left-hand-side in Eq. (2). Speciﬁcally, they proposed the information operator˜ A (y) = (cid:20) dyd x ( x ) − f ( x , y( x )) , . . . , dyd x ( x n ) − f ( x n , y( x n )) (cid:21) (cid:62) for which the ‘data’ are trivial; ˜ a n = 0. It was rigorously established that the approximatelikelihood d µ i,σ d µ i − ,σ (y) = exp (cid:32) − σ (cid:18) dyd x ( x i ) − f ( x i , y( x i )) (cid:19) (cid:33) leads to an exact Bayesian PNM in the limit: µ n,σ F → µ ˜ a n as σ ↓ A µ -almost all˜ a n ∈ R n +1 . Here F → denotes convergence in an integral probability metric deﬁned by asuitable set F of test functions (see Sec. 4 of Cockayne et al., 2019). However, the dependenceof the information operator ˜ A on f means that this cannot be used as the basis for a practicalmethod. Indeed, unless f depends linearly on its second argument and conjugacy propertiesof the prior can be exploited, the posterior cannot easily be characterised. Approximatetechniques from nonlinear ﬁltering were proposed to address this challenge in Tronarp et al.(2019). The comprehensive literature review in the previous section reveals not only that that noBayesian PNM has yet been proposed, but also that such an endeavour may be fundamentally9iﬃcult. Indeed, a theme that has emerged with existing PNM for ODEs, which can be tracedback to Skilling (1992), is the use of approximate and subjective forms for the likelihood.The complex, implicit relationship between the latent ODE solution y † and the data f ( x i , y i )arising from the gradient ﬁeld appears to preclude use of an exact likelihood. Of course,violation of the likelihood principle is not traditionally a concern in the design of a numericalmethod, yet if the strictly richer output that comes with a Bayesian PNM is desired, thenclearly adherence to the likelihood principle is important. It is therefore natural to ask thequestion, “under what conditions can exact Bayesian inference for ODEs be made?”.This paper presents a proof-of-concept PNM for the numerical solution of a (limited)class of ODEs that is both (a) Bayesian in the sense of Deﬁnition 1 and (b) can in principlebe implemented. The method being proposed is indicated in Figure 2 and its main propertiesare as follows: • The classical theory of Lie groups is exploited, for the ﬁrst time in the context of PNM,to understand when an ODE of the form in Eq. (2) can be transformed into an ODEwhose gradient ﬁeld is a function of the independent state variable only, reducing theODE to an integral. • For ODEs that admit a solvable Lie algebra, our proposal can be shown to simultane-ously perform exact Bayesian inference on both the original and the Lie-transformedODE. Crucially, as we explain later, to identify a Lie algebra only high-level a priori information about the ODE is required. The case of ﬁrst- and second-order ODEs ispresented in detail, but the method itself is general. • In general the speciﬁcation of prior belief can be diﬃcult. The prior distributions thatwe construct are guaranteed to respect aspects of the structure of the ODE. As such,our priors are, to some extent, automatically adapted to the ODE at hand as opposedto being arbitrarily posited. • In addition to the beneﬁts conferred in the Bayesian framework, detailed in Section1.1 and in Cockayne et al. (2019), the method being proposed can be computationallyrealised. On the other hand, there is a cost in terms of the run-time of the methodthat is substantially larger than existing, non-Bayesian approaches (especially classicalnumerical methods). As such, we consider this work to be a proof-of-concept ratherthan an applicable Bayesian PNM.The remainder of the paper is structured as follows: Section 2 is dedicated to a succinctreview of Lie group methods for ODEs. In Section 3, Lie group methods are exploitedto construct priors over the solution space of the ODE whenever a solvable Lie algebra isadmitted and exact Bayesian inference is performed on a transformed version of the originalODE which takes the form of an integral. Numerical experiments are reported in Section 4.Finally, some conclusions and recommendations for future research are drawn in Section 5.10 yd x = f ( x, y( x )) dsd r = g ( r )Lie transform; ( x, y ) (cid:55)→ ( r, s )(Lie transform) − ; ( r, s ) (cid:55)→ ( x, y )ExactBayesianPNM ? Figure 2: Schematic of our proposed approach. An n th order ODE that admits a solvable Liealgebra can be transformed into n integrals, to which exact Bayesian probabilistic numericalmethods can be applied. The posterior measure on the transformed space is then pushedback through the inverse transformation onto the original domain of interest. This section provides a succinct overview of classical Lie group methods, introduced in the19th century by Sophus Lie in the diﬀerential equation context (Hawkins, 2012). Lie devel-oped the fundamental notion of a

Lie group of transformations , which roughly correspond tomaps that take one solution of the ODE to another. This provided a formal generalisationof certain algebraic techniques, such as dimensional analysis and transformations based onspatial symmetries, that can sometimes be employed to algebraically reduce the order of anODE.This section proceeds as follows: First, in Section 2.1 we introduce a one-parameter Liegroup of transformations and then, in Section 2.2, we explain what it means for a curve or asurface to be transformation-invariant. In Secion 2.3 we recall consequences of Lie’s theoryin the ODE context. Last, in Section 2.4 the generalisation to multi-parameter Lie groups isindicated. Our development is heavily inﬂuenced by Bluman and Anco (2002) and we referthe reader to their book when required.

The purpose of this section is to recall essential deﬁnitions, together with the ﬁrst fundamen-tal theorem of Lie , which relates a Lie group of transformations to its inﬁnitesimal generator.In what follows we consider a ﬁxed domain D ⊂ R d and denote a generic state variable as x = ( x , . . . , x d ) ∈ D . Deﬁnition 2 (One-Parameter Group of Transformations) . A one-parameter group of trans-formations on D is a map X : D × S → D , deﬁned on D × S for some set S ⊂ R , togetherwith a bivariate map φ : S × S → S , such that the following hold:

1) For each (cid:15) ∈ S , the transformation X ( · , (cid:15) ) is a bijection on D .(2) ( S, φ ) forms a group with law of composition φ .(3) If (cid:15) is the identity element in ( S , φ ) , then X ( · , (cid:15) ) is the identity map on D .(4) For all x ∈ D , (cid:15), δ ∈ S , if x ∗ = X ( x, (cid:15) ) , x ∗∗ = X ( x ∗ , δ ) , then x ∗∗ = X ( x ∗ , φ ( (cid:15), δ )) . In what follows we continue to use the shorthand notation x ∗ = X ( x, (cid:15) ). The notion of a Lie group additionally includes smoothness assumptions on the maps that constitute a group oftransformations. Recall that a real-valued function is analytic if it can be locally expressedas a convergent power series.

Deﬁnition 3 (One-Parameter Lie Group of Transformations) . Let X , together with φ , forma one-parameter group of transformations on D . Then we say that X , together with φ , forma one-parameter Lie group of transformations on D if, in addition, the following hold:(5) S is a (possibly unbounded) interval in R .(6) For each (cid:15) ∈ S , X ( · , (cid:15) ) is inﬁnitely diﬀerentiable in D.(7) For each x ∈ D , X ( x, · ) is an analytic function on S .(8) φ is analytic in S × S . Without the loss of generality it will be assumed, through re-parametrisation if required,that S contains the origin and (cid:15) = 0 is the identity element in ( S, φ ). The deﬁnition isillustrated through three examples:

Example 1 (Translation in the x-Axis) . The one-parameter transformation x ∗ = x + (cid:15) , x ∗ = x for (cid:15) ∈ R forms a Lie group of transformations on D = R with group compositionlaw φ ( (cid:15), δ ) = (cid:15) + δ . Example 2 (Rotation Group) . The one-parameter transformation x ∗ = x cos( (cid:15) ) − x sin( (cid:15) ) , x ∗ = x sin( (cid:15) ) + x cos( (cid:15) ) for (cid:15) ∈ R again forms a Lie group of transformations on D = R with group composition law φ ( (cid:15), δ ) = (cid:15) + δ . Example 3 (Cyclic group C p ) . Let D = { , , , . . . , p } . Let S = Z . For n ∈ D and m ∈ S ,let X ( n, m ) = n + m (mod p ) . Then X , together with φ ( a, b ) = a + b , deﬁnes a one parametergroup of transformations on D , but is not a Lie group of transformations since (5) is violated. The ﬁrst fundamental theorem of Lie establishes that a Lie group of transformations canbe characterised by its inﬁnitesimal generator, deﬁned next:

Deﬁnition 4 (Inﬁnitesimal Transformation) . Let X be a one-parameter Lie group of trans-formations. Then the transformation x ∗ = x + (cid:15)ξ ( x ) , ξ ( x ) = ∂X ( x, (cid:15) ) ∂(cid:15) (cid:12)(cid:12)(cid:12)(cid:12) (cid:15) =0 , is called the inﬁnitesimal transformation associated to X and the map ξ is called an inﬁnites-imal . eﬁnition 5 (Inﬁnitesimal Generator) . The inﬁnitesimal generator of a one-parameter Liegroup of transformations X is deﬁned to be the operator X = ξ · ∇ where ξ is the inﬁnitesimalassociated to X and ∇ = ( ∂∂x , ∂∂x , . . . , ∂∂x n ) is the gradient. Example 4 (Ex. 1, continued) . For Ex. 1, we have ξ ( x ) = (cid:18) d x ∗ d (cid:15) (cid:12)(cid:12)(cid:12)(cid:12) (cid:15) =0 , d x ∗ d (cid:15) (cid:12)(cid:12)(cid:12)(cid:12) (cid:15) =0 (cid:19) = (1 , so the inﬁnitesimal generator for translation in the x-axis is X = ∂∂x . Example 5 (Ex. 2, continued) . Similarly, the inﬁnitesimal generator for the rotation groupis

X = − x ∂∂x + x ∂∂x . The ﬁrst fundamental theorem of Lie provides a constructive route to obtain the inﬁnitesimalgenerator from the transformation itself:

Theorem 1 (First Fundamental Theorem of Lie; see pages 39-40 of Bluman and Anco(2002)) . A one parameter Lie group of transformations X is characterised by the initialvalue problem: d x ∗ d τ = ξ ( x ∗ ) , x ∗ = x when τ = 0 , (9) where τ ( (cid:15) ) is a parametrisation of (cid:15) which satisﬁes τ (0) = 0 and, for (cid:15) (cid:54) = 0 , τ ( (cid:15) ) = (cid:90) (cid:15) ∂φ ( a, b ) ∂b (cid:12)(cid:12)(cid:12) ( a,b )=( δ − ,δ ) d δ. Here δ − denotes the group inverse element for δ . Since Eq. (9) is translation-invariant in τ , it follows that without loss of generality we canassume a parametrisation τ ( (cid:15) ) such that the group action becomes φ ( τ , τ ) = τ + τ and,in particular, τ − = − τ . In the remainder of the paper, for convenience we assume that allLie groups are parametrised such that the group action is φ ( (cid:15) , (cid:15) ) = (cid:15) + (cid:15) .The next result can be viewed as a converse to Theorem 1, as it shows how to obtain thetransformation from the inﬁnitesimal generator. All proofs are reserved for SupplementalSection A.1. Theorem 2.

A one parameter Lie group of transformations with inﬁnitesimal generator X is equivalent to x ∗ = e (cid:15) X x , where e (cid:15) X = (cid:80) ∞ k =0 1 k ! (cid:15) k X k x . The following is immediate from the proof of Theorem 2:

Corollary 1. If F is inﬁnitely diﬀerentiable, then F ( x ∗ ) = e (cid:15) X F ( x ) . .2 Invariance Under Transformation In this section we explain what it means for a curve or a surface to be invariant under a Liegroup of transformations and how this notion relates to the inﬁnitesimal generator.

Deﬁnition 6 (Invariant Function) . A function F : D → R is said to be invariant under aone parameter Lie group of transformations x ∗ = X ( x, (cid:15) ) if F ( x ∗ ) = F ( x ) for all x ∈ D and (cid:15) ∈ S . Based on the results in Section 2.1, one might expect that invariance to a transformationcan be expressed in terms of the inﬁnitesimal generator of the transformation. This is indeedthe case:

Theorem 3.

A diﬀerentiable function F : D (cid:55)→ R is invariant under a one parameter Liegroup of transformations with inﬁnitesimal generator X if and only if X F ( x ) = 0 for all x ∈ D . Theorem 4.

For a function F : D (cid:55)→ R and a one parameter Lie group of transformations x ∗ = X ( x, (cid:15) ) , the relation F ( x ∗ ) = F ( x ) + (cid:15) holds for all x ∈ D and (cid:15) ∈ S if and only if X F ( x ) = 1 for all x ∈ D . The following deﬁnition is fundamental to the method proposed in Section 3:

Deﬁnition 7 (Canonical Coordinates) . Consider a coordinate system r = ( r ( x ) , . . . , r n ( x )) on D . Then any one parameter Lie group of transformations x ∗ = X ( x, (cid:15) ) induces a trans-formation of the coordinates r ∗ i = r i ( x ∗ ) . The coordinate system r is called canonical for thetransformation if r ∗ = r , . . . , r ∗ n − = r n − and r ∗ n = r n + (cid:15) . Example 6 (Ex. 2, continued) . For the rotation group in Ex. 2, we have canonical coordi-nates r ( x , x ) = (cid:112) x + x , r ( x , x ) = arctan( x /x ) . In canonical coordinates, a one parameter Lie group of transformations can be viewed as astraight-forward translation in the r n -axis. The existence of canonical coordinates is estab-lished in Thm. 2.3.5-2 of Bluman and Anco (2002). Note that Thms. 3 and 4 imply thatX r ∗ i = 0 for i = 1 , , ..., n −

1, X r ∗ n = 1. Deﬁnition 8 (Invariant Surface) . For a function F : D → R , a surface deﬁned by F ( x ) = 0 is said to be invariant under a one parameter Lie group of transformation x ∗ = X ( x, (cid:15) ) ifand only if F ( x ∗ ) = 0 whenever F ( x ) = 0 for all x ∈ D and (cid:15) ∈ S . The invariance of a surface, as for a function, can be cast in terms of an inﬁnitesimalgenerator:

Corollary 2.

A surface F ( x ) = 0 is invariant under a one parameter Lie group of trans-formations with inﬁnitesimal generator X if and only if X F ( x ) = 0 whenever F ( x ) = 0 . .3 Symmetry Methods for ODEs The aim of this section is to relate Lie transformations to ODEs for which these transfor-mations are admitted. These techniques form the basis for our proposed method in Section3. For an ODE of the form in Eq. (2), one can consider the action of a transformationon the coordinates ( x, y ); i.e. a special case of the above framework where the genericcoordinates x and x are respectively the independent ( x ) and dependent ( y ) variables ofthe ODE. It is clear that such a transformation also implies some kind of transformation ofthe derivatives y m := d m y d x m . Indeed, consider a one-parameter Lie group of transformations( x ∗ , y ∗ ) = ( X ( x, y ; (cid:15) ) , Y ( x, y ; (cid:15) )). Then we have from the chain rule that y ∗ m := d m y ∗ d( x ∗ ) m isa function of x, y, y , . . . , y m and we denote y ∗ m = Y m ( x, y, y , . . . , y m ; (cid:15) ). As an explicitexample: y ∗ = d y ∗ d x ∗ = ∂Y ( x,y ; (cid:15) ) ∂x + y ∂Y ( x,y ; (cid:15) ) ∂y∂X ( x,y ; (cid:15) ) ∂x + y ∂X ( x,y ; (cid:15) ) ∂y =: Y ( x, y, y ; (cid:15) )In general: y ∗ m = ∂y ∗ m − ∂x + y ∂y ∗ m − ∂y + y ∂y ∗ m − ∂y + ... + y m ∂y ∗ m − ∂y m − ∂X ( x,y ; (cid:15) ) ∂x + y ∂X ( x,y ; (cid:15) ) ∂y =: Y m ( x, y, y , . . . , y m ; (cid:15) )In this sense a transformation deﬁned on ( x, y ) can be naturally extended to a transformationon ( x, y, y , y , . . . ) as required. Deﬁnition 9 (Admitted Transformation) . An m th order ODE F ( x, y, y , . . . , y m ) = 0 issaid to admit a one parameter Lie group of transformations ( x ∗ , y ∗ ) = ( X ( x, y ; (cid:15) ) , Y ( x, y ; (cid:15) )) if the surface F deﬁned by the ODE is invariant under the Lie group of transformations, i.e.if F ( x ∗ , y ∗ , y ∗ , . . . , y ∗ m ) = 0 whenever F ( x, y, y , . . . , y m ) = 0 . Example 7.

Clearly any ODE of the form d y d x = F ( x ) admits the transformation ( x ∗ , y ∗ ) =( x, y + (cid:15) ) . Our next task is to understand how the inﬁnitesimal generator of a transformation canbe extended to act on derivatives y m . Deﬁnition 10 (Extended Inﬁnitesimal Transformation) . The m th extended inﬁnitesimals of a one parameter Lie group of transformations ( x ∗ , y ∗ ) = ( X ( x, y ; (cid:15) ) , Y ( x, y ; (cid:15) )) are deﬁnedas the functions ξ, η, η (1) , . . . , η ( m ) for which the following equations hold: x ∗ = X ( x, y ; (cid:15) ) = x + (cid:15)ξ ( x, y ) + O ( (cid:15) ) y ∗ = Y ( x, y ; (cid:15) ) = y + (cid:15)η ( x, y ) + O ( (cid:15) ) y ∗ = Y ( x, y, y ; (cid:15) ) = y + (cid:15)η (1) ( x, y, y ) + O ( (cid:15) ) ... y ∗ m = Y m ( x, y, y , . . . , y m ; (cid:15) ) = y m + (cid:15)η ( m ) ( x, y, y , y , . . . , y m ) + O ( (cid:15) )15t can be shown straightforwardly via induction that η ( m ) ( x, y, y , y , . . . , y m ) = d m η d x m − m (cid:88) k =0 m !( m − k )! k ! y m − k − d k ξ d x k (10)where dd x denotes the full derivative with respect to x , i.e. dd x = ∂∂x + y ∂∂y + (cid:80) m +1 k =2 y k ∂∂y k − .It follows that η ( m ) is a polynomial in y , y , . . . , y m with coeﬃcients linear combinations of ξ, η and their partial derivatives up to the m th order. Deﬁnition 11 (Extended Inﬁnitesimal Generator) . The m th extended inﬁnitesimal gener-ator is deﬁned as X ( m ) = ξ m ( x, y, y , . . . , y m ) · ∇ = ξ ( x, y ) ∂∂x + η ( x, y ) ∂∂y + η (1) ( x, y ) ∂∂y + · · · + η ( m ) ( x, y, y , . . . , y m ) ∂∂y m where ∇ = ( ∂∂x , ∂∂y , ∂∂y , . . . , ∂∂y m ) is the extended gradient. The following corollaries are central to the actual computation of the admitted Lie groupsof an ODE.

Corollary 3.

A diﬀerentiable function F : D m → R where D m is the phase space containingelements of the form ( x, y, y , . . . , y m ) , is invariant under a one parameter Lie group of trans-formations with an extended inﬁnitesimal generator X ( m ) if and only if X ( m ) F ( x, y, y , . . . , y m ) =0 for all ( x, y, y , . . . , y m ) ∈ D m . Corollary 4 (Inﬁnitesimal Criterion for Symmetries Admitted by an ODE) . A one param-eter Lie group of transformations is admitted by the m th order ODE F ( x, y, y , .., y m ) = 0 if and only if its extended inﬁnitesimal generator X ( m ) satisﬁes X ( m ) F ( x, y, y , . . . , y m ) = 0 whenever F ( x, y, y , . . . , y m ) = 0 . To leverage the full power of Lie symmetry methods for ODEs of order m ≥

2, we need to con-sider multiple Lie symmetries which are collectively described by a

Lie algebra . Fortunately,the notion of a multi-parameter Lie group of transformations is a natural generalisation fromthe one parameter case. Thus, this last section of background material concerns the gener-alisation of the deﬁnitions in Section 2.1 to the case of a multi-parameter Lie group. Theassociated Lie algebra will also be deﬁned.

Deﬁnition 12 (Multi-Parameter Lie Group of Transformations) . The set of transformations x ∗ = X ( x, (cid:15) ) where x ∗ i = X i ( x, (cid:15) ) and (cid:15) = ( (cid:15) , (cid:15) , . . . , (cid:15) r ) ∈ S ⊂ R r is called a r -parameterLie group of transformations if it satisﬁes the same axioms as in the one parameter case,but with law of composition φ ( (cid:15), δ ) = ( φ ( (cid:15), δ ) , . . . , φ r ( (cid:15), δ )) , and (without loss of generality) (cid:15) = (0 , , ..., as the group identity element. eﬁnition 13 (Inﬁnitesimal Matrix) . The appropriate generalisation for the inﬁnitesimaltransformation is the inﬁnitesimal matrix

Ξ = [ ξ ij ] , whose entries are deﬁned as ξ ij ( x ) = ∂X j ( x,(cid:15) ) ∂(cid:15) i (cid:12)(cid:12)(cid:12) (cid:15) =0 . Deﬁnition 14 (Inﬁnitesimal Generator) . An r -parameter Lie group of transformations isassociated with r inﬁnitesimal generators, X i , deﬁned as X i = X i ( x ) = (cid:80) dj =1 ξ ij ( x ) ∂∂x j . The ﬁrst fundamental theorem of Lie can be generalised to the multi-parameter case. Inparticular, it can be shown that an r -parameter Lie group of transformations is characterisedby the set of its r inﬁnitesimal generators. The generalisation is straight-forward and so, forbrevity, we refer the reader to pages 39-40 of Bluman and Anco (2002).Next we explain how the collection of inﬁnitesimal generators forms a Lie algebra. Thisrelies on the basic facts that the set D of diﬀerential operators on D is a vector space over R (i.e. λ X + µ Y ∈ D for all X , Y ∈ D and λ, µ ∈ R ) and that diﬀerential operators can becomposed (i.e. XY ∈ D for all X , Y ∈ D ). Deﬁnition 15 (Commutator) . The commutator of two inﬁnitesimal generators X i and X j is deﬁned as [X i , X j ] = X i X j − X j X i . Theorem 5 (Second Fundamental Theorem of Lie; see page 78 of Bluman and Anco (2002)) . Consider an r -parameter Lie group of transformations and let L denote the linear span ofthe inﬁnitesimal generators X , . . . , X r in D . Let [ · , · ] : L × L → D denote the unique bilinearoperator that agrees with Def. (15) on the set of inﬁnitesimal generators. i.e. (cid:34) r (cid:88) i =1 λ i X i , r (cid:88) j =1 µ j X j (cid:35) = r (cid:88) i =1 r (cid:88) j =1 λ i µ j (X i X j − X j X i ) . (11) Then [ · , · ] maps into L . i.e. the right hand side of Eq. (11) belongs to L for all λ, µ ∈ R r . Example 8.

Consider the two parameter group of transformations on D = R given by ( x ∗ , y ∗ ) = ( x + x(cid:15) + x δ, y + y(cid:15) + y δ ) . The inﬁnitesimal generators corresponding to δ and (cid:15) , respectively, are X = x ∂∂x + y ∂∂y , X = x ∂∂x + y ∂∂y . It can be directly veriﬁed that [X , X ] = − X . The space L , deﬁned in Thm. 5, satisﬁes the properties of an r -dimensional Lie algebra L , deﬁned next: Deﬁnition 16 (Lie Algebra) . An r -dimensional vector space L over R together with a bilinearoperator [ · , · ] : L × L → L is called an r -dimensional Lie algebra if the following hold:(1) Alternativity: [X , X] = 0 for all X ∈ L (2) Jacobi Identity: [X , [Y , Z]] + [Y , [Z , X]] + [Z , [X , Y]] = 0 for all X , Y , Z ∈ L In general, for the methods presented in Section 3 to be applied, existence of an n -parameter Lie group of transformations is not in itself suﬃcient; we require the existence ofan n -dimensional solvable Lie sub-algebra, deﬁned next:17 eﬁnition 17 (Normal Lie Sub-algebra) . Consider a Lie sub-algebra J of a Lie algebra L with bilinear operator [ · , · ] , i.e. a subset J ⊂ L such that, when equipped with the restrictionof [ · , · ] to J × J , is itself a Lie algebra and, in particular, [X , Y] ∈ J for all X , Y ∈ J .Then J is said to be normal if, in addition, [X , Y] ∈ J for all X ∈ J , Y ∈ L . Deﬁnition 18 (Solvable Lie Algebra) . An r -dimensional Lie algebra L is called solvable ifthere exists a chain of sub-algebras L ⊂ L ⊂ ... ⊂ L q − ⊂ L r =: L such that L i − is anormal sub-algebra of L i for i = 2 , , ..., r . For low-order ODEs, the existence requirement for an admitted Lie group of transforma-tions is more restrictive than the requirement that the associated Lie algebra is solveable.Indeed, we have the following result:

Theorem 6.

All two-dimensional Lie Algebras are solvable.

This completes our review of background material. The exact Bayesian PNM developedin Section 3 for an n th order ODE require the existence of an admitted n -parameter Liegroup of transformations with a solvable Lie algebra. In practice we therefore require somehigh-level information on the gradient ﬁeld f , in order to establish which transformations ofthe ODE may be admitted. In addition, the requirement of a solvable Lie algebra also limitsthe class of ODEs for which our exact Bayesian methods can be employed. Nevertheless, thisclass of ODEs is suﬃciently broad to have merited extensive theoretical research (Blumanand Anco, 2002) and the development of software (Baumann, 2013). In this section our novel Bayesian PNM is presented. The method relies on high-level in-formation about the gradient ﬁeld f and, in Section 3.1, we discuss how such informationcan be exploited to identify any Lie transformations that are admitted by the ODE. In thecase of a ﬁrst order ODE, any non-trivial transformation is suﬃcient for our method and anexplicit information operator is provided for this case, together with recommendations forprior construction, in Section 3.2. Together, the prior and the information operator uniquelydetermine a Bayesian PNM, as explained in Section 1.1. In the general case of an m th orderODE, we require that an m -dimensional solvable Lie algebra is admitted by the ODE. Thespecial case m = 2 is treated in detail, with an explicit information operator and guidancefor prior construction provided in Section 3.3. In the Supplemental Section A.2 the selectionof input pairs ( x i , y i ) to the gradient ﬁeld is discussed. For the methods proposed in this paper, transformations admitted by the ODE, togetherwith their inﬁnitesimal generators, must ﬁrst be obtained. The algorithm for obtaininginﬁnitesimal generators follows as a consequence of Cor. 4. Indeed, suppose we have a m th18rder ODE of the form y m − f ( x, y, y , . . . , y m − ) = 0. Then, by Cor. 4, a transformationwith inﬁnitesimal generator X is admitted by the ODE if and only if:X ( m ) ( y m − f ( x, y, y , . . . , y m − )) = 0 whenever y m = f ( x, y, y , . . . , y m − ) . (12)In inﬁnitesimal notation, Eq. (12) is equivalent to η ( m ) ( x, y, y , . . . , y m − , y m ) = ξ ∂f∂x + η ∂f∂y + m − (cid:88) k =1 η ( k ) ∂f∂y k . (13)The direct solution of Eq. (13) recovers any transformations that are admitted.In the common scenario where f ( x, y, y , . . . , y m − ) is a polynomial in y , y , . . . , y m − ,the algorithm just described, for identiﬁcation of admitted transformations, can be fullyautomated (c.f. Baumann, 2013). Indeed, from Def. 10 it follows that the extended in-ﬁnitesimals η ( k ) for k ∈ , , , . . . , m are polynomial in y , y , . . . , y k . Thus, by substituting y m = f ( x, y, y , . . . , y m − ), Eq. (12) too must be a polynomial in y , y , . . . , y m − . Moreover,the coeﬃcients of this polynomial must vanish because (12) holds for arbitrary values of x, y, y , . . . , y m − . This argument, of setting coeﬃcients to zero, leads to a system of linearpartial diﬀerential equations (overdetermined when m ≥

2) for ξ ( x, y ) and η ( x, y ), whichcan be exactly solved to retrieve all the inﬁnitesimal generators of the ODE. The same strat-egy can often be applied beyond the polynomial case and explicit worked examples of thisprocedure are now provided: Example 9 (First Order ODE) . Consider the class of all ﬁrst order ODEs of the form dyd x = f ( x, y( x )) , f ( x, y ) = F (cid:0) yx (cid:1) . From Eq. (10) , we have η (1) = η x + ( η y − ξ x ) y − ξ y ( y ) soEq. (13) becomes η x +( η y − ξ x ) f − ξ y ( f ) = ξ ∂f∂x + η ∂f∂y and thus η x +( η y − ξ x ) F (cid:0) yx (cid:1) − ξ y F (cid:0) yx (cid:1) = − ξF (cid:48) (cid:0) yx (cid:1) yx + ηF (cid:48) (cid:0) yx (cid:1) x . For this equation to hold for general F , the coeﬃcients of F , F and F (cid:48) must vanish: η x = 0 , η y − ξ x = 0 , ξ y = 0 , − ξ yx + η x = 0 . This is now a linearsystem of partial diﬀerential equations in ( ξ, η ) which is easily solved; namely ξ = x, η = y .The associated inﬁnitesimal generator is X = x ∂∂x + y ∂∂y . Example 10 (Second Order ODE) . The inﬁnitesimal generators for the second order ODE ( x − y( x )) d yd x + 2 dyd x (cid:18) dyd x + 1 (cid:19) + (cid:18) dyd x (cid:19) / = 0 (14) are derived in Supplementary Section A.1. In this section we present our approach for a ﬁrst order ODE. This allows some of the moretechnical details associated to the general case to be omitted, due to the fact that any one-dimensional Lie algebra is trivial. The main result that will allow us to construct an exactBayesian PNM is as follows: 19 heorem 7 (Reduction of a First Order ODE to an Integral) . If a ﬁrst order ODE dyd x = f ( x, y( x )) (15) admits a one parameter Lie group of transformations, then there exists coordinates r ( x, y ) , s ( x, y ) such that dsd r = G ( r ) (16) for some explicit function G ( r ) . Note that the transformed ODE in Eq. (16) is nothing more than an integral, for whichexact Bayesian PNM have already been developed (e.g. Briol et al., 2019; Karvonen et al.,2018). At a high level, as indicated in Fig. 2, our proposed Bayesian PNM performs inferencefor the solution s( r ) of Eq. (16) and then transforms the resultant posterior back into theoriginal ( x, y )-coordinate system. Our PNM is therefore based on the information operator A (y) = [ G ( r ) , . . . , G ( r n )] (cid:62) ∈ A = R n +1 (17)which corresponds indirectly to n + 1 evaluations of the original gradient ﬁeld f at certaininput pairs ( x i , y i ). The selection of the inputs r i is discussed in Section A.2.The transformation of a ﬁrst order ODE is clearly illustrated in the following: Example 11 (Ex. 9, continued) . Consider the ﬁrst order ODE dyd x = f ( x, y( x )) , f ( x, y ) = F (cid:0) yx (cid:1) . Recall from Ex. 9 that this ODE admits the one parameter Lie group of transforma-tions x ∗ = αx , y ∗ = αy for α ∈ R and the associated inﬁnitesimal generator is X = x ∂∂x + y ∂∂y .Solving the pair of partial diﬀerential equations X r = 0 , X s = 1 yields the canonical coor-dinates s = log y , r = yx . The transformed ODE is then dsd r = F ( r ) − r + rF ( r ) =: G ( r ) . Thus anevaluation G ( r ) corresponds to an evaluation of f ( x, y ) at an input ( x, y ) such that r = yx . Two important points must now be addressed: First, the approach just described cannotbe Bayesian unless it corresponds to a well-deﬁned prior distribution µ ∈ P Y in the originalcoordinate system Y . This precludes standard (e.g. Gaussian process) priors in general, assuch priors assign mass to functions in ( r, s )-space that do not correspond to well-deﬁnedfunctions in ( x, y )-space (see Fig. 3). Second, any prior that is used ought to be consistentwith the Lie group of transformations that the ODE is known to admit. To address eachof these important points, we propose two general principles for prior construction in thiswork. The ﬁrst principle is the implicit prior principle. This ensures that a prior speciﬁed inthe transformed coordinates ( r, s ) can be safely transformed into a well-deﬁned distributionon Y . For such an implicit prior to be well-deﬁned we need to understand when a functionin ( r, s ) space maps to a well-deﬁned function in the original ( x, y ) domain of interest. Let S denote the image of Y under the canonical coordinate transformation. Principle 1 (Implicit Prior) . A distribution ν ∈ P S on the transformed solution space S corresponds to a well-deﬁned implicit prior µ ∈ P Y provided that x ( r, s( r )) is strictlymonotone as a function of r . x, y ) ( r, s )Lie transform(Lie transform) − Figure 3: Illustration of the implicit prior principle: A prior elicited for the function s( r ) inthe transformed coordinate system ( r, s ) must be supported on functions s( r ) that correspondto well-deﬁned functions y( x ) in the original coordinate system ( x, y ). Thus the situationdepicted would not be allowed. Example 12 (Ex. 11, continued) . For the ODE in Ex. 11, with canonical coordinates s = log y , r = yx , if x ∈ [ x , x T ] = [1 , x T ] and y ∈ (0 , ∞ ) , then the region in the ( r, s ) planecorresponding to [1 , x T ] × (0 , ∞ ) in the ( x, y ) plane is (0 , ∞ ) × R . Now, d x ( r, s( r ))d r = ∂x∂r + ∂x∂s ds( r )d r = r s (cid:48) ( r ) exp(s( r )) − exp(s( r )) r . Thus d x d r > if and only if s (cid:48) ( r ) > r and the invariant prior principle requires that we respectthe constraint log( r ) ≤ s( r ) ≤ log( r ) + log( x T ) for all r > . The set S must thereforeconsists of diﬀerentiable functions s deﬁned on r ∈ (0 , ∞ ) and satisfying log( r ) ≤ s( r ) ≤ log( r ) + log( x T ) . Now we turn to the second important point, namely that the prior ought to encodeknowledge about any Lie transformations that are known to be admitted by the ODE. Inworking on the transformed space S , it become clear how to construct a prior measure inwhich this knowledge is encoded. Our second principle for prior speciﬁcation states that equalprior weight should be aﬀorded to all curves that are identical up to a Lie transformation: Principle 2 (Invariant Prior) . A distribution ν ∈ P S on the transformed solution space S is said to be invariant provided that ν ( S ) = ν ( S + (cid:15) ) where the elements of S + (cid:15) are theelements of S after a vertical translation; i.e. s( · ) (cid:55)→ s( · ) + (cid:15) and both S, S + (cid:15) ∈ Σ S . Our recommendation is that, when possible, both the implicit prior principle and theinvariant prior principle should be enforced. However, in practice it seems diﬃcult to satisfyboth principles and our empirical results in Section 4 are based on implicit priors that arenot invariant.

In this section we present our approach for a second order ODE. The study of second orderODEs is particularly important, since Newtonian mechanics is based on ODEs of second21rder. The presentation is again simpliﬁed relative to the general case of an m th orderODE, this time by virtue of the fact that any two dimensional Lie algebra is guaranteed tobe solveable (Thm. 6). The main result that will allow us to construct an exact BayesianPNM is as follows: Theorem 8 (Reduction of a Second Order ODE to Two Integrals) . If a second order ODE d yd x = f (cid:18) x, y( x ) , dyd x (cid:19) (18) admits a two parameter Lie group of transformations, then there exists coordinates r ( x, y ) , s ( x, y ) such that dsd r = G ( r ) (19) for some implicitly deﬁned function G . The function G is explicitly related to the solutionof a second equation of the form d˜sd˜ r = H (˜ r ) (20) for some explicit function H (˜ r ) . Note that the ODE in Eq. (18) is reduced to two integrals, namely Eq. (19) and Eq. (20).At a high level, our proposed Bayesian PNM performs inference for the solution s( r ) ofEq. (19) and then transforms the resultant posterior back into the original ( x, y )-coordinatesystem. However, because G in Eq. (19) depends on the solution ˜s(˜ r ) of Eq. (20), we mustalso estimate ˜s(˜ r ) and for this we need to evaluate H . Our PNM is therefore based on theinformation operator A (y) = [ G ( r ) , . . . , G ( r n ) , H (˜ r ) , . . . , H (˜ r n )] (cid:62) ∈ A = R n +1) which corresponds indirectly to 2( n + 1) evaluations of f , the original gradient ﬁeld. Theextension of our approach to a general m th order ODE proceeds analogously, with A = R m ( n +1) . The use of Thm. 8 is illustrated in Example 14 in the Supplement.The two principles of prior construction that we advocated in the case of a ﬁrst orderODE apply equally to the case of a second- and higher-order ODE. It therefore remains onlyto discuss the selection of the speciﬁc inputs r i (and ˜ r i in the case of a second order ODE)that are used to deﬁne the information operator A . This discussion is again reserved forSupplemental Section A.2. In this section the proposed Bayesian PNM is empirically illustrated. Recall that we are notadvocating these methods for practical use, rather they are to serve as a proof-of-concept22or demonstrating that exact Bayesian inference can in principle be performed for ODEs,albeit at considerable eﬀort; a non-trivial ﬁnding that helps to shape ongoing research anddiscussion in this nascent ﬁeld.The case of a ﬁrst order ODE is considered in Section 4.1 and the second order case iscontained in Section 4.2. In both cases, scope is limited to verifying the correctness of theprocedures, as well as indicating how conjugate prior distributions can be constructed.

This section illustrates the empirical performance of the proposed method for a ﬁrst orderODE.

ODE:

To limit scope we consider ﬁrst order ODEs of the formdyd x = F (cid:18) y( x ) x (cid:19) , x ∈ [1 , x T ] , y(1) = y . (21)Note that admitted transformation and associated canonical coordinates for this class ofODE have already been derived in Ex. 9, Ex. 11 and Ex. 12. Prior:

In constructing a prior µ ∈ P Y we refer to the implicit prior principle in Sec. 3.2.Indeed, recall from Ex. 11 that the ODE in Eq. (21) can be transformed into an ODE of theform dsd r = G ( r ) , r ∈ (0 , ∞ ) , s( y ) = log( y ) . Then our approach constructs a distribution ν ∈ P S where, from Ex. 12, S is the set ofdiﬀerentiable functions s deﬁned on r ∈ (0 , ∞ ) and satisfyinglog( r ) ≤ s( r ) ≤ log( r ) + log( x T ) . (22)To ensure monotonicity in the implicit prior principle, we take d x dr >

0, which translates intothe requirement that dsd r > r . (23)If Eq. (23) holds, then ν induces a well-deﬁned distribution µ ∈ P Y . Note that the constraintsin Eq. (22) and Eq. (23) preclude the direct use of standard prior models, such as Gaussianprocesses. However, it is nevertheless possible to design priors that are convenient for a givenset of canonical coordinates. Indeed, for the canonical coordinates r, s in our example, wecan consider a prior of the forms( r ) = log( r ) + log( x T ) ζ ( r )23 r s n=5 r s n=50 x y n=5 x y n=50 Figure 4: Experimental results, ﬁrst order ODE: The black curves represent samples fromthe posterior, whilst the exact solution is indicated in red. The blue curves represent aconstraint on the ( r, s ) domain that arises when the implicit prior principle is applied. Thenumber n of gradient evaluations is indicated. Top: results in the ( r, s ) domain. Bottom:results in the ( x, y ) domain.where the function ζ : (0 , ∞ ) → R satisﬁes ζ ( y ) = 0 , ζ ( r ) ≤ , d ζ d r ≥ . (24)For this experiment, the approach of L´opez-Lopera et al. (2018) was used as a prior modelfor the monotone, bounded function ζ ; this requires that a number, N , of basis functions isspeciﬁed - for brevity we defer the detail to Appendix A.3.24he prior just described incorporates the symmetric structure of the ODE, in the sensethat the independent variable r = yx is the ﬁrst canonical coordinate of the inﬁnitesimalgenerator of the Lie group of transformations of the original ODE in Eq. 11. In other words, r is a variable ﬁxed by the Lie group of transformations of the ODE (in this case x ∗ = αx , y ∗ = αy , so r ∗ = r ). Because the prior is deﬁned on functions s ( r ) of r , this means the prioritself is also unchanged by the Lie group of transformations of the ODE, so that the prioreﬀectively incorporates the symmetric structure of the ODE. Results:

To obtain empirical results we consider the ODE with F ( r ) = r − + r and y = 1 , x T = 5. The posterior distributions that were obtained as the number n of datapoints was increased were sampled and plotted in the ( r, s ) and ( x, y ) planes in Fig. 4. Ineach case a basis of size N = 2 n was used. Observe that the implicit prior principle ensuresthat all curves in the ( x, y ) plane are well-deﬁned functions (i.e. there is at most one y value for each x value). Observe also that the posterior mass appears to contract to the truesolution y † of the ODE as the number of evaluations n of the gradient ﬁeld is increased. This section illustrates the empirical performance of the proposed method for a second orderODE.

ODE:

Consider again the second order nonlinear ODE in Eqn. 31 together with the initialcondition y ( x ) = y , dyd x ( x ) = y (cid:48) . Prior:

It is shown in Ex. 14 in the Supplement that Eq. 31 can be reduced to a ﬁrst orderODE in ( s, r ) with − x − r ≤ s ≤ − x T − r . The implicit prior principle in this case requiresthat dsd r > −

1. Thus we are led to consider a parametrisation of the forms( r ) = − x − r + (cid:18) x − x T (cid:19) ζ ( r )where the function ζ again satisﬁes the conditions in Eq. (24). The approach of L´opez-Loperaet al. (2018) was therefore again used as a prior model.For this example an additional level of analytic tractability is possible, as described indetail in Ex. 14 in the Supplement. Thus we need only consider an information operator ofthe form A (y) = [ G ( r ) , . . . , G ( r n )]. Results:

The posterior distributions that were obtained are plotted in the ( r, s ) plane andthe ( x, y ) plane in Fig. 5. A basis of size N = 2 n was used, with [ y , y (cid:48) ] = [ − , , [ x , x T ] =[5 , x, y ) plane arewell-deﬁned functions (i.e. there is at most one y value for each x value). The true solutionappears to be smoother than the samples, even for 50 gradient evaluations, which suggeststhat the prior was somewhat conservative in this context.25 r s n=50 x -10.5-10-9.5-9-8.5-8-7.5-7 y n=50 Figure 5: Experimental results, second order ODE: The black curves represent samples fromthe posterior in the ( r, s ) plane (left) and ( x, y ) plane (right), whilst the exact solution isindicated in red. The blue curves represent a constraint on the domain that arises when theimplicit prior principle is applied. The number of gradient evaluations was n = 50. This paper presented a foundational perspective on PNM. It was ﬁrst argued that there didnot exist a Bayesian PNM for the numerical solution of ODEs. Then, to address this gap,a prototypical Bayesian PNM was developed. The Bayesian perspective that we have putforward sheds light on foundational issues which will need to be addressed going forward:

Foundation of PNM:

As explained in Section 1.2, existing PNM for ODEs each takethe underlying state space Y to be the solution space of the ODE. This appears to beproblematic, in the sense that a generic evaluation f ( x i , y i ) of the gradient ﬁeld cannot becast as information A (y † ) about the solution y † of the ODE unless the point ( x i , y i ) liesexactly on the solution curve { ( x, y † ( x )) : x ∈ [ x , x T ] } . As a consequence, all existing PNMof which we are aware violate the likelihood principle and are therefore not strictly Bayesian.The assumption of a solvable Lie algebra, used in this work, can be seen as a mechanism toensure the existence of an exact information operator A , so that the likelihood principle canbe obeyed. However, for a general ODE it might be more natural to take the underlying statespace to be a set F of permitted gradient ﬁelds and the quantity of interest Q ( f ) to mapa gradient ﬁeld f to the solution of the associated ODE. This would make the informationoperator A trivial but evaluation of the push-forward Q µ a would require the exact solutionoperator of the ODE. However, the reliance on access to an oracle solver Q makes thisphilosophically somewhat distinct from PNM. Limitations of Bayesian PNM:

The proposed method was intended as a proof-of-concept and it is therefore useful to highlight the aspects in which it is limited. First,26hen an m th order ODE admits an r -parameter Lie group of transformations with r > m ,there is an arbitrariness to the particular m -dimensional sub-group of transformations thatare selected. Second, the route to obtain transformations admitted by the ODE demandsthat some aspects of the gradient ﬁeld f are known, in contrast to other work in which f istreated as a black-box. For instance, in Ex. 11 we used the fact that f can be expressed as f ( x, y ) = F ( yx ), although knowledge of the form of F was not required. Third, the class ofODEs for which a solvable Lie algebra is admitted is relatively small. On the other hand,references such as Bluman and Anco (2002) document important cases where our methodcould be applied. Fourth, the principles for prior construction that we identiﬁed do notentail a unique prior and, as such, the question of prior elicitation must still be addressed. Outlook:

The goal of providing rigorous and exact statistical uncertainty quantiﬁcationfor the solution of an ODE is, we believe, important and will continue to be addressed.Traditional numerical methods have beneﬁtted from a century of research eﬀort and, incomparison, Bayesian PNM is an under-developed ﬁeld. For example, the limited existingwork on PNM for ODEs, such as Skilling (1992); Schober et al. (2014); Chkrebtii et al.(2016); Kersting and Hennig (2016); Schober et al. (2019); Kersting et al. (2018); Tronarpet al. (2019), does not attempt to provide adaptive error control (though we note promisingongoing research in that direction by Chkrebtii and Campbell, 2019). Nevertheless, the casefor developing Bayesian numerical methods - which shares some parallels with the case forBayesian statistics as opposed to other inferential paradigms - is clear, as argued in Diaconis(1988) and Hennig et al. (2015). The insights we have provided in this paper serve tohighlight the foundational issues pertinent to Bayesian PNM for ODEs. Indeed, our proof-of-concept highlights that performing exact Bayesian inference for ODEs may be extremelydiﬃcult. This in turn provides motivation for the continued development of ‘approximatelyBayesian’ approaches to PNM, which in Sec. 1.2 we surveyed in detail.

Acknowledgements:

The authors are grateful to Mark Craddock, Fran¸cois-Xavier Brioland Tim Sullivan for discussion of this work, as well as to an Associate Editor and twoReviewers for their challenging but constructive feedback. JW was supported by the EPSRCCentre for Doctoral Training in Cloud Computing for Big Data at Newcastle University,UK. CJO was supported by the Lloyd’s Register Foundation programme on data-centricengineering at the Alan Turing Institute, UK. This material was based upon work partiallysupported by the National Science Foundation under Grant DMS-1127914 to the Statisticaland Applied Mathematical Sciences Institute. Any opinions, ﬁndings, and conclusions orrecommendations expressed in this material are those of the authors and do not necessarilyreﬂect the views of the National Science Foundation.

References

Assyr Abdulle and Giacomo Garegnani. Random time step probabilistic methods for uncer-tainty quantiﬁcation in chaotic and geometric numerical integration. arXiv:1801.01340 ,27018.Gerd Baumann.

Symmetry Analysis of Diﬀerential Equations with Mathematica R (cid:13) . SpringerScience & Business Media, 2013.G.W. Bluman and S.C. Anco. Symmetry and Integration Methods for Diﬀerential Equations .Springer, 2002.Fran¸cois-Xavier Briol, Chris J Oates, Mark Girolami, Michael A Osborne, and Dino Sejdi-novic. Probabilistic integration: A role in statistical computation? (with discussion andrejoinder).

Statistical Science , 34(1):1–42, 2019.Radomir Chabiniok, Vicky Y Wang, Myrianthi Hadjicharalambous, Liya Asner, Jack Lee,Maxime Sermesant, Ellen Kuhl, Alistair A Young, Philippe Moireau, Martyn P Nash,et al. Multiphysics and multiscale modelling, data–model fusion and integration of organphysiology in the clinic: ventricular cardiac mechanics.

Interface Focus , 6(2):20150083,2016.Kathryn Chaloner and Isabella Verdinelli. Bayesian experimental design: A review.

Statis-tical Science , pages 273–304, 1995.Joseph T Chang and David Pollard. Conditioning as disintegration.

Statistica Neerlandica ,51(3):287–317, 1997.OA Chkrebtii and DA Campbell. Adaptive step-size selection for state-space based proba-bilistic diﬀerential equation solvers.

Statistics and Computing , 2019. To appear.Oksana Chkrebtii, David A Campbell, Mark A Girolami, and Ben Calderhead. Bayesianuncertainty quantiﬁcation for diﬀerential equations (with discussion).

Bayesian Analysis ,11(4):1239–1267, 2016.Jon Cockayne, Chris Oates, Tim Sullivan, and Mark Girolami. Probabilistic meshless meth-ods for partial diﬀerential equations and bayesian inverse problems. arXiv:1605.07811 ,2016.Jon Cockayne, Chris Oates, Tim Sullivan, and Mark Girolami. Bayesian probabilistic nu-merical methods.

SIAM Review , 2019. To appear.Patrick R Conrad, Mark Girolami, Simo S¨arkk¨a, Andrew Stuart, and Konstantinos Zy-galakis. Statistical analysis of diﬀerential equations: introducing probability measures onnumerical solutions.

Statistics and Computing , 27(4):1065–1082, 2017.P Diaconis. Bayesian numerical analysis.

Statistical Decision Theory and Related TopicsIV(1) , page 1988, 1988.D. Estep. A posteriori error bounds and global error control for approximation of ordinarydiﬀerential equations.

SIAM Journal on Numerical Analysis , 32(1):1–48, 1995.28ndrew Gelman and Cosma Rohilla Shalizi. Philosophy and the practice of Bayesian statis-tics.

British Journal of Mathematical and Statistical Psychology , 66(1):8–38, 2013.Thomas Hawkins.

Emergence of the theory of Lie groups: An essay in the history of math-ematics 1869–1926 . Springer Science & Business Media, 2012.Philipp Hennig, Michael A Osborne, and Mark Girolami. Probabilistic numerics and uncer-tainty in computations.

Proceedings of the Royal Society A , 471(2179):20150142, 2015.Nicholas J Higham.

Accuracy and Stability of Numerical Algorithms . SIAM, 2002.Toni Karvonen, Chris J Oates, and Simo S¨arkk¨a. A bayes-sard cubature method. In

Proceedings of the 32nd Annual Conference on Neural Information Processing Systems(NeurIPS2018) , pages 5882–5893, 2018.H. Kersting, T.J. Sullivan, and P. Hennig. Convergence rates of gaussian ode ﬁlters. arXiv:1807.09737 , 2018.Hans Kersting and Philipp Hennig. Active uncertainty calibration in Bayesian ODE solvers.In

Proceedings of the 32nd Conference on Uncertainty in Artiﬁcial Intelligence (UAI 2016) ,pages 309–318, 2016.F M Larkin. Probabilistic error estimates in spline interpolation and quadrature. In

Infor-mation Processing 74 (Proc. IFIP Congress, Stockholm, 1974) , pages 605–609, 1974.FM Larkin. Gaussian measure in Hilbert space and applications in numerical analysis.

TheRocky Mountain Journal of Mathematics , 2:379–421, 1972.Han Cheng Lie, TJ Sullivan, and Aretha L Teckentrup. Random forward models and log-likelihoods in bayesian inverse problems.

SIAM/ASA Journal on Uncertainty Quantiﬁca-tion , 6(4):1600–1629, 2018.H.C. Lie, A.M. Stuart, and T.J. Sullivan. Strong convergence rates of probabilistic integratorsfor ordinary diﬀerential equations.

Statistics and Computing , 2019. To appear.Andr´es F L´opez-Lopera, Fran¸cois Bachoc, Nicolas Durrande, and Olivier Roustant. Finite-dimensional Gaussian approximation with linear inequality constraints.

SIAM/ASA Jour-nal of Uncertainty Quantiﬁcation , 6(3):1224–1255, 2018.C J Oates and T J Sullivan. A modern retrospective on probabilistic numerics.

Statisticsand Computing , 2019. To appear.C J Oates, J Cockayne, D Prangle, T J Sullivan, and M Girolami.

Multivariate Algorithmsand Information-Based Complexity , chapter Optimality Criteria for Probabilistic Numer-ical Methods. Berlin/Boston De Gruyter, 2019. To appear.29ntony Overstall, David Woods, and Ben Parker. Bayesian optimal design for ordinarydiﬀerential equation models with application in biological science.

Journal of the AmericanStatistical Association , 2019. To appear.A. Pakman and L. Paninski. Exact hamiltonian monte carlo for truncated multivariategaussians.

Journal of Computational and Graphical Statistics , 23:518–542, 2014.Juan R Perilla, Boon Chong Goh, C Keith Cassidy, Bo Liu, Rafael C Bernardi, Till Rudack,Hang Yu, Zhe Wu, and Klaus Schulten. Molecular dynamics simulations of large macro-molecular complexes.

Current Opinion in Structural Biology , 31:64–74, 2015.Michael Schober, David K Duvenaud, and Philipp Hennig. Probabilistic ODE solvers withRunge-Kutta means. In

Proceedings of the 28th Annual Conference on Neural InformationProcessing Systems (NIPS 2014) , pages 739–747, 2014.Michael Schober, Simo S¨arkk¨a, and Philipp Hennig. A probabilistic model for the numericalsolution of initial value problems.

Statistics and Computing , 2019. To appear.John Skilling. Bayesian solution of ordinary diﬀerential equations. In

Maximum Entropyand Bayesian Methods , pages 23–37. Springer, 1992.Onur Teymur, Kostas Zygalakis, and Ben Calderhead. Probabilistic linear multistep meth-ods. In

Proceedings fo the 30th Annual Conference on Neural Information ProcessingSystems (NIPS 2016) , pages 4321–4328, 2016.Onur Teymur, Han Cheng Lie, Tim Sullivan, and Ben Calderhead. Implicit probabilistic in-tegrators for ODEs. In

Proceedings of the 32nd Annual Conference on Neural InformationProcessing Systems (NeurIPS 2018) , 2018.J.F. Traub and Wo´zniakowski. Perspectives on information-based complexity.

Bulletin ofthe American Mathematical Society , 26(1):29–52, 1992.Filip Tronarp, Hans Kersting, Simo S¨arkk¨a, and Philipp Hennig. Probabilistic solutionsto ordinary diﬀerential equations as non-linear Bayesian ﬁltering: A new perspective.

Statistics and Computing , 2019. To appear.Nils P Wedi. Increasing horizontal resolution in numerical weather prediction and climatesimulations: illusion or panacea?

Philosophical Transactions of the Royal Society A , 372(2018):20130289, 2014. 30

Supplementary material

This electronic supplement to the paper by Wang, Cockayne and Oates contains proofs fortheoretical results in the main text (Sec. A.1), a brief discussion on Bayesian experimentaldesign in the context of designing a Bayesian probabilistic numerical method (Sec. A.2),and computation detail that was suppressed from the main text (Sec. A.3).

A.1 Proof of Theoretical Results

Proof of Theorem 2.

From Taylor’s theorem we have that x ∗ = X ( x, (cid:15) )= ∞ (cid:88) k =0 (cid:15) k k ! ∂ k X ( x, (cid:15) ) ∂(cid:15) k (cid:12)(cid:12)(cid:12)(cid:12) (cid:15) =0 For any diﬀerentiable function F we have thatd F ( x ∗ )d (cid:15) = d (cid:88) i =1 ∂F ( x ∗ ) ∂x ∗ i d x ∗ i d (cid:15) = d (cid:88) i =1 ξ i ∂F ( x ∗ ) ∂x ∗ i = X F ( x ∗ )and similarly d k F ( x ∗ )d (cid:15) k = X k F ( x ∗ ) . Thus ∂ k X ( x, (cid:15) ) ∂(cid:15) k (cid:12)(cid:12)(cid:12)(cid:12) (cid:15) =0 = X k x so that the stated result is recovered. Proof of Theorem 3.

The result is established as follows: F invariant ⇔ F ( x ∗ ) = 0 whenever F ( x ) = 0 ⇔ e (cid:15) X F ( x ) = 0 whenever F ( x ) = 0 (Cor. 1) ⇔ F ( x ) + (cid:15) X F ( x ) + O ( (cid:15) ) = 0 whenever F ( x ) = 0 (Taylor) ⇔ X F ( x ) = 0 whenever F ( x ) = 0where the last line follows since the coeﬃcient of the O ( (cid:15) ) term in the Taylor expansion mustvanish. This completes the proof. Proof of Theorem 4.

From Cor. 1, we have that F ( x ∗ ) = F ( x ) + (cid:15) X F ( x ) + O ( (cid:15) ). The resultfollows from inspection of the (cid:15) coeﬃcient. 1 roof of Theorem 6. Suppose L is a two dimensional Lie Algebra generated by linearly in-dependent inﬁnitesimal generators X and X . Let Y = [X , X ] = a X + b X and let J bethe one dimensional subalgebra generated by Y. Suppose Z = c X + d X is some element of L , then [Y , Z] = [

Y, c X + d X ]= c [Y , X ] + d [Y , X ]= cb [X , X ] + da [X , X ]= ( ad − bc )Y ∈ J So J ⊂ L is normal, thus L is solvable, as claimed. Proof of Theorem 7.

Let the inﬁnitesimal generator associated with the Lie group of trans-formations be denoted X. From the remarks after Def. 7, we can obtain canonical coordinatesby solving the pair of ﬁrst order partial diﬀerential equations X r = 0, X s = 1. By the chainrule we have d s d r = s x + s y y (cid:48) r x + r y y (cid:48) =: G ( r, s )From the deﬁnition of canonical coordinates, the Lie group of transformations is r ∗ = r , s ∗ = s + (cid:15) in the transformed coordinate system, sod s ∗ d r ∗ = G ( r ∗ , s ∗ ) = ⇒ d s d r = G ( r, s + (cid:15) )for all (cid:15) , which implies G ( r, s ) = G ( r ) and thus Eq. (15) becomesdsd r = G ( r )as claimed. Proof of Theorem 8.

Let the inﬁnitesimal generators associated with the Lie group of trans-formations be denoted X and X . Recall from Thm. 6 that any two dimensional Lie algebrais solvable. Thus, without loss of generality we may assume[X , X ] = λ X (25)for some λ ∈ R . The inﬁnitesimal generators X and X each correspond to a one parameterLie group of transformations, denoted x ∗ = X ( x, (cid:15) ) and x † = X ( x, (cid:15) ). Let v ( x, y ), w ( x, y, y ) be invariant functions of X and its extension X (1)1 , respectively, so v ( x ∗ , y ∗ ) = v ( x, y ) and w ( x ∗ , y ∗ , y ∗ ) = w ( x, y, y ), where w has a non-trivial dependence on its thirdargument. It follows from the deﬁnition of invariance thatd w ∗ d v ∗ = d w d v , (1)1 d w d v = 0 (26)by Cor. 3. Now because Eq. (26) is a homogeneous partial diﬀerential equation, the gen-eral solution d w d v can be expressed as a function of the two solutions v ( x, y ) and w ( x, y, y ).Therefore d w d v = Z ( v, w ) (27)for some undetermined function Z .Since Eq. (18) admits X , and Eq. (27) is the same ODE when expressed in terms of x, y, y , it must be the case thatX (2)2 (cid:18) d w d v − Z ( v, w ) (cid:19) = 0 whenever d w d v = Z ( v, w ) . Then from Cor. 4 it follows that X (1)2 is admitted by the ﬁrst order ODE in Eq. (27).Thus we are now faced with a ﬁrst order ODE that admits a one parameter Lie group oftransformations, as in Thm. 7.Now, from Eq. (25), we have X X v = X X v + λ X v = 0. Thus X v is an invariant ofX and so X v = h ( v ) for some function h . Similarly X (1)1 X (1)2 v = X (1)2 X (1)1 v + λ X (1)1 v = 0, sothat X (1)2 w = g ( v, w ), for some function g . This implies X (1)2 = h ( v ) ∂∂v + g ( v, w ) ∂∂w .Proceeding as in Thm. 7, denote the canonical coordinates of X (1)2 = h ( v ) ∂∂v + g ( v, w ) ∂∂w by ˜ r ( v, w ), ˜ s ( v, w ) such that X (1)2 ˜ r = 0, X (1)2 ˜ s = 1. In canonical coordinates, Eq. (27)becomes: d˜sd˜ r = H (˜ r ) (28)This is again an integral, with solution˜s(˜ r ) = (cid:90) ˜ r H ( t )d t + C. (29)We can rewrite Eq. (29) in terms of v, w to obtain an equation of the form I ( v, w ) = 0 (30)which satisﬁes X (1)1 ( I ( v, w )) = 0 whenever I ( v, w ) = 0, since recall v, w are invariants of X (1)1 . For the ﬁnal step, we recall that v = v ( x, y ) and w = w ( x, y, y ), so that Eq. (30)represents a ﬁrst order ODE in y , which admits X . Thus we can apply Thm. 15 a secondtime to obtain canonical coordinates r ( x, y ), s ( x, y ) for X . In these coordinates, Eq. (30)reduces into the form dsd r = G ( r )where G is implicitly deﬁned. 3 xample 13 (Deriving the Inﬁnitesimal Generators for the Second Order ODE in Eq. 10) . Consider the second order nonlinear ODE ( x − y( x )) d yd x + 2 dyd x (cid:18) dyd x + 1 (cid:19) + (cid:18) dyd x (cid:19) / = 0 . (31) Using Corollary 4, we have: (cid:18) ξ ∂∂x + η ∂∂y + η (1) ∂∂y + η (2) ∂∂y (cid:19) (cid:32) y + 2 y ( y + 1) + y / x − y (cid:33) = 0 which implies − ξ y ( y + 1) + y / ( x − y ) + η y ( y + 1) + y / ( x − y ) + η (1) (cid:32) y + 2 + y / x − y (cid:33) + η (2) = 0 Recall η (1) = η x + ( η y − ξ x ) y + ξ y y and η (2) = η xx + (2 η xy − ξ xx ) y + ( η yy − ξ xy ) y − ξ yy y + ( η y − ξ x ) y − ξ y y y Also notice we can replace y via the original diﬀerential equation, i.e. y = − y ( y + 1) + y / x − y . Substituting for η (1) , η (2) and y via the above expressions, multiplying both sides by ( x − y ) and rearranging the terms as powers of y yields the rather long equation: (2 xη x − yη x + x η xx − xyη xx + y η xx )+ y  − ξ + 2 η + 4 xη x − yη x + 2 xη y − yη y − xξ x + 2 yξ x +2 x η xy − xyη xy + 2 y η xy − x ξ xx + 2 xyξ xx − y ξ xx − xη y + 4 xξ x + 2 yη y − yξ x  + y  − ξ + 2 η + 4 xη y − yη y − xξ x + 4 yξ x + 2 xξ y − yξ y + x η yy − xyη yy + y η yy − x ξ xy + 4 xyξ xy − y ξ xy − xη y + 4 xξ x + 2 yη y − yξ x + 4 xξ y − yξ y  + y (4 xξ y − yξ y − x ξ yy + 2 xyξ yy − y ξ yy + 4 xξ y − yξ y )+ y / (cid:18) xη x − yη x (cid:19) + y / (cid:18) − ξ + η + 32 xη y − yη y − xξ x + 32 yξ x − xη y + 2 xξ x + yη y − yξ x (cid:19) + y / (cid:18) xξ y − yξ y + 2 xξ y − yξ y (cid:19) = 04 his expression on the left hand side must vanish, so comparing the coeﬃcients of powers of y gives the determining equations: xη x − yη x + x η xx − xyη xx + y η xx = 0 (32) − ξ + 2 η + 4 xη x − yη x + 2 xη y − yη y − xξ x + 2 yξ x +2 x η xy − xyη xy + 2 y η xy − x ξ xx + 2 xyξ xx − y ξ xx − xη y + 4 xξ x + 2 yη y − yξ x = 0 − ξ + 2 η + 4 xη y − yη y − xξ x + 4 yξ x + 2 xξ y − yξ y + x η yy − xyη yy + y η yy − x ξ xy + 4 xyξ xy − y ξ xy − xη y + 4 xξ x + 2 yη y − yξ x + 4 xξ y − yξ y = 04 xξ y − yξ y − x ξ yy + 2 xyξ yy − y ξ yy + 4 xξ y − yξ y = 0 (33)32 xη x − yη x = 0 (34) − ξ + η + 32 xη y − yη y − xξ x + 32 yξ x − xη y + 2 xξ x + yη y − yξ x = 032 xξ y − yξ y + 2 xξ y − yξ y = 0 (35) It is immediately obvious from (34) that η x = 0 and from (35) that ξ y = 0 . Consequently (32) and (33) vanishes. The remaining determining equations simplify to: − ξ + 2 η + 2 xξ x − yξ x − x ξ xx + 2 xyξ xx − y ξ xx = 0 (36) − ξ + 2 η + 2 xη y − yη y + x η yy − xyη yy + y η yy = 0 (37) − ξ + η + 12 xη y − yη y + 12 xξ x − yξ x = 0 (38) These remaining partial diﬀerential equations in ξ ( x, y ) and η ( x, y ) are linear, and recall η ( x, y ) is independent of x and ξ ( x, y ) is independent of y respectively. To solve these partialdiﬀerential equations we can therefore express ξ ( x ) = (cid:80) ∞ n =0 a n x n and η ( y ) = (cid:80) ∞ m =0 b m y m .Consequently (36) becomes: − ∞ (cid:88) n =0 a n x n + 2 ∞ (cid:88) m =0 b m y m + 2 ∞ (cid:88) n =1 na n x n − y ∞ (cid:88) n =1 na n x n − − ∞ (cid:88) n =2 n ( n − a n x n + 2 y ∞ (cid:88) n =2 n ( n − a n x n − − y ∞ (cid:88) n =2 n ( n − a n x n − = 0 Comparing the constant term implies b = a . Comparing the terms containing y implies: b − ∞ (cid:88) n =1 na n x n − + 2 ∞ (cid:88) n =2 n ( n − a n x n − = 0 Comparing coeﬃcients of x n gives b = a , n = n ( n − or a n = 0 for n ≥ . Of course, n = n ( n − has solution n = 2 for n ≥ , so a n = 0 for n ≥ . Comparing the terms ontaining y implies b = a . Notice (37) is symmetric with (36) in the sense that swapping ξ with η and x with y in (36) gives (37) . So by symmetry (37) gives b = a , b = a , b = a and b n = 0 for n ≥ . (38) gives no additional solutions. Therefore, the exampleODE admits a three parameter Lie group of transformations with inﬁnitesimals: ξ = a + a x + a x η = a + a y + a y where a , a and a are arbitrary constants. The inﬁnitesimal generators corresponding to a , a and a are respectively X = x ∂∂x + y ∂∂y (39)X = x ∂∂x + y ∂∂y X = ∂∂x + ∂∂y , (40) which generate a three dimensional Lie algebra. Example 14 (Ex. 13, continued) . Recall from Ex. 13 that the second order nonlinearODE in Eq. (31) admits a three parameter Lie group of transformations with inﬁnitesimalgenerators X , X , X deﬁned in Eqs. (39) - (40) . These generators can be veriﬁed to satisfy [X , X ] = − X , [X , X ] = − X , [X , X ] = − X . The pairs X , X and X , X form a twodimensional (and therefore solvable by Thm. 6) Lie sub-algebra and can be used as the basisfor our method. For the derivations below we proceed with arbitrary choice X , X .Following the proof of Thm. 8, ﬁrst we seek a solution v = v ( x, y ) to the ﬁrst order linearPDE X v = 0 . i.e. we must solve x ∂v∂x + y ∂v∂y = 0 This has general solution v = f ( y − x ) for some arbitrary function f , and we pick a particularsolution v ( x, y ) = y − x . Next we seek a solution w = w ( x, y, y ) to the ﬁrst order linearPDE X (1)1 w = 0 . i.e. we must solve x ∂w∂x + y ∂w∂y + 2( y − x ) y ∂w∂y = 0 . Again, we pick a particular solution w ( x, y, y ) = y ( xy ) . In accordance with Eq. (27) , wecan re-write the original ODE (31) in terms of the coordinates v and w to obtain d w d v =  w / + 2 w ( w + 1) v ( w − , for xy ≥ − w / + 2 w ( w + 1) v ( w − , for xy < ext we express X (1)2 in terms of v and w ﬁnd its canonical coordinates ˜ r ( v, w ) , ˜ s ( v, w ) .To this end, we have X (1)2 = x ∂∂x + y ∂∂y = − v ∂∂v , which has canonical coordinates ˜ r ( v, w ) = w , ˜ s ( v, w ) = − log( v ) . Re-writing Eq. (41) in terms of ˜ r , ˜ s leads to the analogue of Eq. (28) : d˜sd˜ r = 1 − ˜ r ± ˜ r / + 2˜ r (˜ r + 1) =: H (˜ r ) (42) This example exhibits the convenient feature that Eq. (42) can be directly integrated to give ˜s(˜ r ) = − log(2˜ r ± √ ˜ r + 2) + log(˜ r ) / C , which can be re-written in terms of x, y, y to give log (cid:18) y − x (cid:19) = log  (cid:115) y x y ± (cid:113) y x y  − C (43) for some integration constant C .The ﬁnal step, to remove the y independence, requires canonical coordinates for X .These can be selected as r ( x, y ) = y − x , s ( x, y ) = − y . Then Eq. (43) becomes r ∓ exp( − C )2 exp( − C ) = (cid:32) dsd r dsd r (cid:33) / + (cid:32) dsd r dsd r (cid:33) / which is equivalent to Eq. (19) for some function G . A.2 Design of the Training Set

The performance of the proposed Bayesian PNM is not our main focus in this work, aswe consider the method to be (only) a proof-of-concept. However, for completeness weacknowledge that performance will depend on the locations at which the gradient ﬁeld isevaluated; the so-called training set . In this section we discuss how these inputs could beoptimally selected. To simplify the presentation, we focus on the case of a ﬁrst order ODE,as in Eq. (15), where the inputs r , . . . , r n must be selected.The design of a PNM can be viewed as an instance of statistical experimental design(Chaloner and Verdinelli, 1995). In Sec. 3 of Cockayne et al. (2019) a connection betweenPNM and decision-theoretic experimental design was exposed. Such methods require that aloss function L : Q × Q → R is provided, where L ( q, q † ) quantiﬁes the loss when q is used asan estimate for the true quantity of interest q † . Further detail was provided in Oates et al.(2019). To avoid repetition, in the remainder we focus instead on approximate experimentaldesign, where a loss function is not explicitly needed.Recall that the output of a PNM is the distribution µ n = B ( µ, a n ) ∈ P Q . Then one canspecify a functional (cid:96) : P Q → R and compute τ ( r , . . . , r n ) = (cid:90) (cid:96) ( B ( µ, A (y; r , . . . , r n )))d µ (y) (44)7here A ( · ; r , . . . , r n ) is the information operator in Eq. (17) with the dependence on r , . . . , r n made explicit. For the choice (cid:96) ( ν ) = log det(Cov ˜ Q ∼ ν [ ˜ Q ]), a conﬁguration ( r , . . . , r n ) forwhich τ ( r , . . . , r n ) is minimised is said to be D-optimal . The functional (cid:96) plays the role ofan approximation to posterior expected loss, and other choices for (cid:96) lead to other approxi-mate notions of optimal experimental design. For instance, an

A-optimal design was usedfor the Bayesian solution of a partial diﬀerential equation in Cockayne et al. (2016). Forfurther background on experimental design we refer the reader to Chaloner and Verdinelli(1995).Importantly, Eq. (44) does not depend on the information A (y † ) and can therefore beevaluated prior to the experiment being performed. However, in general the numericalapproximation of Eq. (44), and the task of ﬁnding a minimal conﬁguration, is practicallydiﬃcult. The reader is referred to Overstall et al. (2019) for further discussion of experimentaldesign in the PNM context. A.3 Computational Detail

In this Appendix we set out in detail the prior construction that was used for the numericalillustrations of Sec. 4 in the main text. Recall that for both the ﬁrst order ODE example inSec. 4.1 and the second order ODE example in Sec. 4.2 we required a non-parametric priorover functions ζ which satisfy the constraints given in Eq. 24, namely that ζ ( r ) = 0 , ζ ( r ) ≤ , d ζ d r ≥ . (45)Moreover, bearing in mind the posterior computation that is to follow, we require in additionthat the prior conveniently facilitates the conditioning calculations involved. It is clear thatstandard non-parametric priors such as Gaussian processes do not satisfy the boundednessor monotonicity constraints, whilst a nonlinear transformation of such a process would fail tomake conditioning on data straight-forward. In fact, the construction of such ﬂexible priorsremains an active area of research.To proceed, we adopted an approach recently proposed in L´opez-Lopera et al. (2018). Inbrief, the main idea is to construct an N -dimensional parametric distribution over functionsfor which Eq. 45 is satisﬁed. This distribution, being ﬁnite-dimensional, allows for thepossibility of tractable conditioning operations, whilst the ﬂexibility to take N arbitrarilylarge provides a means of ensuring that the salient uncertainty is accurately represented.More speciﬁcally, the function ζ is parametrised as ζ ( r ) = N (cid:88) j =1 z j φ j ( r ) (46)where the φ j are basis functions φ j ( r ) = (cid:40) − | r − t j h | | r − t j h | ≤

10 otherwise8or equally spaced points t j with increment h , as recommended in L´opez-Lopera et al. (2018).A prior on ζ can be induced via a prior on the coeﬃcients z , . . . , z N , with N taken to besubstantially larger than the number n of datapoints on which the z i are to be conditioned.The speciﬁc construction of a prior on the coeﬃcients is required to encode the constraintsin Eq. 45 and to admit tractable conditioning; these issues are discussed in the remainder.First, we consider the boundedness and monotonicity constraints in Eq. 45. At the levelof the coeﬃcients, it is straight-forward to check that this requires that the prior support isrestricted to the set Z = { z ∈ R N : 0 < z ≤ z ≤ · · · ≤ z N ≤ } . For convenience, we elected to use a prior that was obtained by restricting a standardGaussian measure N (0 , I ) on R N to the set Z .Second, we consider how to condition on a dataset. Recall that information is providedon the values of the gradient ζ (cid:48) ( r i ) = b i of the function ζ , evaluated at a ﬁnite number oflocations r i of the canonical coordinate r , together with the initial condition ζ ( r ) = b .Thus the information can be described by the linear system of constraintsΦ z = b where Φ =  φ ( r ) . . . φ N ( r ) φ (cid:48) ( r ) . . . φ (cid:48) N ( r )... ... φ (cid:48) ( r n ) . . . φ (cid:48) N ( r n )  , b =  b b ... b n  . The posterior can therefore be characterised as the restriction of N (0 , I ) to the set Z ∩ D where D = { z ∈ R N : Φ z = b } .Finally, we discuss how posterior computation was performed. The key observation isthat an equivalent characterisation of the posterior is ﬁrst to restrict N (0 , I ) to D and thento further restrict to Z . This is advantageous since the linear nature of the data implies thatthe restriction z |D of N (0 , I ) to D is again a Gaussian with a closed form, denoted N ( µ, Σ).It is important to note that Σ is singular (rank ρ = N − n −

1) and so Σ = U Λ U (cid:62) , where U is an orthogonal matrix and Λ is a diagonal matrix with ρ non-zero entries on the diagonal.Thus we can express z |D in the form z = µ + U Λ˜ z where ˜ z ∼ N (0 , I ) is a standard Gaussian on R ρ . Let M = U Λ and let m i = [ m i, , . . . m i,ρ ]denote the i th row of M . Then we have the relation z ∈ Z ⇐⇒ ˜ z ∈ ˜ Z where˜ Z = { ˜ z ∈ R ρ : 0 ≤ m ˜ z + µ , ≤ ( m i +1 − m i )˜ z + ( µ i +1 − µ i ) , ≤ − m N ˜ z + (1 − µ N ) }