A Role for Symmetry in the Bayesian Solution of Differential Equations
AA Role for Symmetry in the Bayesian Solution ofDifferential Equations
Junyang Wang , Jon Cockayne , Chris. J. Oates , Newcastle University, UK Alan Turing Institute, UKSeptember 24, 2019
Abstract
The interpretation of numerical methods, such as finite difference methods for dif-ferential equations, as point estimators suggests that formal uncertainty quantifica-tion can also be performed in this context. Competing statistical paradigms can beconsidered and Bayesian probabilistic numerical methods (PNMs) are obtained whenBayesian statistical principles are deployed. Bayesian PNM have the appealing prop-erty of being closed under composition, such that uncertainty due to different sourcesof discretisation in a numerical method can be jointly modelled and rigorously prop-agated. Despite recent attention, no exact Bayesian PNM for the numerical solutionof ordinary differential equations (ODEs) has been proposed. This raises the funda-mental question of whether exact Bayesian methods for (in general nonlinear) ODEseven exist. The purpose of this paper is to provide a positive answer for a limited classof ODE. To this end, we work at a foundational level, where a novel Bayesian PNMis proposed as a proof-of-concept. Our proposal is a synthesis of classical Lie groupmethods, to exploit underlying symmetries in the gradient field, and non-parametricregression in a transformed solution space for the ODE. The procedure is presented indetail for first and second order ODEs and relies on a certain strong technical condi-tion – existence of a solvable Lie algebra – being satisfied. Numerical illustrations areprovided.
Numerical methods underpin almost all of scientific, engineering and industrial output. Inthe abstract, a numerical task can be formulated as the approximation of a quantity ofinterest Q : Y → Q , a r X i v : . [ s t a t . O T ] S e p ubject to a finite computational budget. The true underlying state y † ∈ Y is typically high-or infinite-dimensional, so that only limited information A : Y → A (1)is provided and exact computation of Q (y † ) is prohibited. For example, numerical integrationaims to approximate an integral Q (y † ) = (cid:82) y † ( t )d t given the values A (y † ) = { ( x i , y † ( x i )) } ni =1 of the integrand y † on a finite number of abscissa { x i } ni =1 . Similarly, a numerical approxi-mation to the solution Q (y † ) = y † of a differential equation dy / d x = f ( x, y( x )), y( x ) = y ,must be based on at most a finite number of evaluations of f , the gradient field. In thisviewpoint a numerical method corresponds to a map b : A → Q , as depicted in Figure 1a,where b ( a ) represents an approximation to the solution of the differential equation based onthe information a ∈ A .The increasing ambition and complexity of contemporary applications is such that thecomputational budget can be extremely small compared to the precision that is required atthe level of the quantity of interest. As such, in many important problems it is not possible toreduce the numerical error to a negligible level. Fields acutely associated with this challengeinclude climate forecasting (Wedi, 2014), computational cardiology (Chabiniok et al., 2016)and molecular dynamics (Perilla et al., 2015). In the presence of non-negligible numericalerror, it is unclear how scientific interpretation of the output of computation can proceed.For example, a posteriori analysis of traditional numerical methods can be used to establishhard upper bounds on the numerical error, but these bounds typically depend on an unknownglobal constant. In the case of ODEs, this may be the maximum value of a norm (cid:107) f (cid:107) of thegradient field (see e.g. Estep, 1995). If (cid:107) f (cid:107) were known, it would be possible to provide ahard bound on numerical error. However, in the typical numerical context all that is knownis that (cid:107) f (cid:107) is finite. One could attempt to approximate (cid:107) f (cid:107) with cubature, but that itselfrequires a numerical cubature method whose error is required to obey a known bound. Ingeneral, therefore, there are no hard error bounds without global information being a priori provided (Larkin, 1974). Our aim in this paper is to consider, as an alternative to traditionalnumerical analysis, an exact Bayesian framework for solution uncertainty quantification inthe ordinary differential equation context. The field of probabilistic numerics dates back to Larkin (1972) and a modern perspective isprovided in Hennig et al. (2015); Oates and Sullivan (2019). Under the abstract frameworkjust described, numerical methods can be interpreted as point estimators in a statisticalcontext, where the state y † can be thought of as a latent variable in a statistical model, andthe ‘data’ consist of information A (y † ) that does not fully determine the quantity of interest Q (y † ) but is indirectly related to it. Hennig et al. (2015) provide an accessible introductionand survey of the field. In particular, they illustrated how PNM can be used to quan-tify uncertainty due to discretisation in important scientific problems, such as astronomicalimaging. 2 AQ Q Ab (a) P Y AP Q Q µ a ← (cid:91) aB ( µ,a ) ← (cid:91) a (b) Figure 1: Diagrams for a numerical method. (a) The traditional viewpoint of a numericalmethod is equivalent to a map b from a finite-dimensional information space A to the spaceof the quantity of interest Q . (b) The probabilistic viewpoint treats approximation of Q (y † )in a statistical context, described by a map B ( µ, · ) from A to the space of probabilitydistributions on Q . The probabilistic numerical method ( A, B ) is Bayesian if and only if (b)is a commutative diagram.Let the notation Σ Y denote a σ -algebra on the space Y and let P Y denote the set ofprobability measures on ( Y , Σ Y ). A probabilistic numerical method (PNM) is a procedurewhich takes as input a ‘belief’ distribution µ ∈ P Y , representing epistemic uncertainty withrespect to the true (but unknown) value y † , along with a finite amount of information, A (y † ) ∈ A . The output is a distribution B ( µ, A (y † )) ∈ P Q on ( Q , Σ Q ), representing epistemicuncertainty with respect to the quantity of interest Q (y † ) after the information A (y † ) havebeen processed. For example, a PNM for an ordinary differential equation (ODE) takes aninitial belief distribution defined on the solution space of the differential equation, togetherwith information arising from a finite number of evaluations of the gradient field, plus theinitial condition of the ODE, to produce a distribution over either the solution space of theODE, or perhaps some derived quantity of interest. In this paper, the measurability of A and Q will be assumed.Despite computational advances in this emergent field, until recently there had not beenan attempt to establish rigorous statistical foundations for PNM. In Cockayne et al. (2019)the authors argued that Bayesian principles can be adopted. In brief, this framework requiresthat the input belief distribution µ carries the semantics of a Bayesian agent’s prior belief andthe output of a PNM agrees with the inferences drawn when the agent is rational. To be moreprecise recall that, in this paper, information is provided in a deterministic manner through(1) and thus Bayesian inference corresponds to conditioning of µ on the level sets of A . Let Q : P Y → P Q denote the push-forward map associated to Q . i.e. Q ( µ )( S ) = µ ( Q − ( S ))for all S ∈ Σ Q . Let { µ a } a ∈A ⊂ P Y denote the disintegration, assumed to exist , of µ ∈ P Y along the map A . It is of course possible to perform Bayesian inference in the noisy-data context, but for the ODEsconsidered in this paper we assume that one can obtain noiseless evaluations of the gradient field. The reader unfamiliar with the concept of a disintegration can treat µ a as a technical notion of the‘conditional distribution of y given A (y) = a ’ when reading this work. The disintegration theorem , Thm.1 of Chang and Pollard (1997), guarantees existence and uniqueness up to a A µ -null set under the weakrequirement that Y is a metric space, Σ Y is the Borel σ -algebra, µ is Radon, Σ A is countable generated andΣ A contains all singletons { a } for a ∈ A . efinition 1. A probabilistic numerical method ( A, B ) with A : Y → A and B : P Y × A →P Q for a quantity of interest Q : Y → Q is Bayesian if and only if B ( µ, a ) = Q ( µ a ) for all µ ∈ P Y and all a ∈ A . This definition is intuitive; the output of the PNM should coincide with the marginaldistribution for Q (y † ) according to the disintegration element µ a ∈ P Y , based on the infor-mation a ∈ A that was provided. The definition is equivalent to the statement that Figure1b is a commutative diagram. In Cockayne et al. (2019) the map A was termed an infor-mation operator and the map B was termed a belief update operator ; we adhere to thesedefinitions in our work. The Bayesian approach to PNM confers several important benefits: • The input µ and output B ( µ, a ) belief distributions can be interpreted, respectively,as a prior and (marginal) posterior . As such, they automatically inherit the strongerformal semantics and philosophical foundations that underpin the Bayesian frameworkand, in this sense, are well-understood (see e.g. Gelman and Shalizi, 2013). • The definition of Bayesian PNM is operational. Thus, if we are presented with a prior µ and information a then there is a unique Bayesian PNM and it is constructivelydefined. • The class of Bayesian PNM is closed under composition, such that uncertainty due todifferent sources of discretisation can be jointly modelled and rigorously propagated.This point will not be discussed further in this work, but we refer the interested readerto Section 5 of Cockayne et al. (2019).Nevertheless, the strict definition of Bayesian PNM limits scope to design convenient com-putational algorithms and indeed several proposed PNM are not Bayesian (see Table 1 inCockayne et al., 2019). The challenge is two-fold; for a Bayesian PNM, the elicitation ofan appropriate prior distribution µ and the exact computation of its disintegration { µ a } a ∈A must both be addressed. In the next section we argue that – perhaps as a consequence ofthese constraints – a strictly Bayesian PNM for the numerical solution of an ODE does notyet exist. The first PNM (of any flavour) for the numerical solution of ODEs, or which we are aware,was due to Skilling (1992). Two decades later, this problem is receiving renewed criticalattention as part of the active development of PNM. The aim of this section is to providea high-level overview of existing work and to argue that existing methods do not adhere tothe definition of Bayesian PNM. Indeed, if the set Y a = { y ∈ Y : A (y) = a } is not measure zero under µ , then µ a is the conditional distri-bution defined by restricting µ to the subset Y a ; µ a (y) = 1[y ∈ Y a ] µ (y) /µ ( Y a ). The theory of disintegrationsgeneralises the conditional distribution µ a to cases where Y a is a null set. otation: The notational convention used in this paper is that the non-italicised y denotesa generic function, whereas the italicised y denotes a scalar value taken by the function y.The notation y † is reserved for the true solution to an ODE. Throughout, the underlyingstate space Y is taken to be a space occupied by the true solution of the ODE, i.e. y † ∈ Y . The first paper on this topic, of which we are aware, was Skilling (1992). This will serve as aprototypical PNM for the numerical solution of an ODE. Originally described as ‘Bayesian’by the author, we will argue that, at least in the strict sense of Definition 1, it is not aBayesian PNM. Consider a generic univariate first-order initial value problemdyd x = f ( x, y( x )) , x ∈ [ x , x T ] , y( x ) = y . (2)Throughout this paper all ODEs that we consider will be assumed to be well-defined andadmit a unique solution y † ∈ Y where Y is some pre-specified set. In this paper the quantityof interest Q (y † ) will either be the solution curve y † itself or the value y † ( x T ) of the solutionat a specific input (in this section it will be the former). The approach outlined in Skilling(1992) allows for a general prior µ ∈ P Y . The gradient field f is treated as a ‘black box’ oraclethat can be queried at a fixed computational cost. Thus we are provided with evaluationsof the gradient field [ f ( x , y ) , . . . , f ( x n , y n )] (cid:62) ∈ R n +1 for certain input pairs { ( x i , y i ) } ni =0 .This approach of treating evaluations of the gradient field as ‘data’ will be seen to bea common theme in existing PNM for ODEs and theoretical support for this framework isrooted in the field of information-based complexity (Traub and Wo´zniakowski, 1992). Let a i = f ( x i , y i ) and a i = [ a , . . . , a i ]. The selection of the input pairs ( x i , y i ) on which f isevaluated is not constrained and several possibilities, of increasing complexity, were discussedin Skilling (1992). To fix ideas, the simplest such approach is to proceed iteratively as follows:(0.1) The first pair ( x , y ) is fully determined by the initial condition of the ODE.(0.2) The oracle then provides one piece of information, a = f ( x , y ).(0.3) The prior µ is updated according to a , leading to a belief distribution µ which is justthe disintegration element µ a .(1) A discrete time step x = x + h , where h = x T − x n >
0, is performed and a particularpoint estimate y = (cid:82) y( x )d µ (y) for the unknown true value y † ( x ) is obtained. Thisspecifies the second pair ( x , y ).The process continues similarly, such that at time step i − µ i − = B ( µ, a i − ) ∈ P Y , where the general belief update operator B is yet to be defined, andthe following step is performed:( i ) Let x i = x i − + h and set y i = (cid:82) y( x i )d µ i − (y) .5he final output is a probability distribution µ n = B ( µ, a n ) ∈ P Y . Now, strictly speaking,the method just described is not a PNM in the concrete sense that we have defined. Indeed,the final output µ n is a deterministic function of the values a n of the gradient field that wereobtained. However, in the absence of additional assumptions on the global smoothness ofthe gradient field, the values of f ( x, y ) outside any open neighbourhood of the true solutioncurve C = { ( x, y ) : y = y † ( x ) , x ∈ [ x , x T ] } do not determine the solution of the ODE and,conversely, the solution of the ODE provides no information about the values of the gradientfield outside any open neighbourhood of the true solution curve C . Thus it is not possible, ingeneral, to write down an information operator A : Y → A that reproduces the information a n when applied to the solution curve y † ( · ) of the ODE.The approach taken in Skilling (1992) was therefore to posit an approximate informationoperator ˆ A and a particular belief update operator B , which are now described. The approx-imate information operator is motivated by the intuition that if y † ( x i ) is well-approximatedby y i at the abscissa x i then dy † d x ( x i ) should be well-approximated by f ( x i , y i ). That is, thefollowing approximate information operator ˆ A was constructed:ˆ A (y) = (cid:20) dyd x ( x ) , . . . , dyd x ( x n ) (cid:21) (cid:62) . (3)Of course, ˆ A (y † ) (cid:54) = a n in general. To acknowledge the approximation error, Skilling (1992)proposed to model the information with an approximate likelihood:d µ n d µ (y) = n (cid:89) i =1 d µ i d µ i − (y) (4)d µ i d µ i − (y) ∝ exp (cid:32) − σ (cid:18) dyd x ( x i ) − f ( x i , y i ) (cid:19) (cid:33) (5)This was referred to in Skilling (1992) as simply a “likelihood” and, together with µ = µ a ,the output µ n is completely specified. Here σ is a fixed positive constant, however in principlea non-diagonal covariance matrix can also be considered. The negative consequences ofbasing inferences on an approximate information operator ˆ A are potentially twofold. First,recall that values of the gradient field that are not contained on the true solution curve ofthe ODE do not, in principle, determine the true solution curve y † . It is therefore unclearif these values should be taken into account at all. Second, in the special case where thegradient field f does not depend the second argument then the quantities dyd x ( x i ) and f ( x i , y i )are identical. From this perspective, µ n represents inference under a mis-specified likelihood,since information is treated as erroneous when it is in fact exact. The use of a mis-specifiedlikelihood violates the likelihood principle and implies violation of the Bayesian framework.This confirms, through a different argument, that the approach of Skilling (1992) cannot beBayesian in the strict sense of Definition 1. 6 .2.2 Schober et al. (2014, 2019); Teymur et al. (2016, 2018) After Skilling (1992), several authors have proposed improvements to the above method. Theapproach of Schober et al. (2014) considered Eq. (5) in the σ ↓ µ was restricted tobe a k -times integrated Wiener measure on the solution space of the ODE. The tractabilityof the integrated Weiner measure leads to a closed-form characterisation of the posterior andenables computation to be cast as a Kalman filter.This direction of research can be motivated by the following fact: For k ∈ { , } theauthors prove that if the input pair ( x , y ) is taken as y = (cid:82) y( x )d µ (y), as indicated inSection 1.2.1, then the smoothing estimate ˆ y = (cid:82) y( x )d µ (y), i.e. the posterior mean fory( x ) based on information a , coincides with the deterministic approximation to y † ( x ) thatwould be provided by a k -th order Runge-Kutta method. As such, theoretical guaranteessuch as local convergence order are inherited. For k = 3 it was shown that the sameconclusion can be made to hold, provided that the input pair ( x , y ) is selected in a mannerthat is no longer obviously related to µ . In all cases the identification with a classical Runge-Kutta method does not extend beyond iteration n = 1. Similar connections to multistepmethods of Nordsieck and Adams form were identified, respectively, in Schober et al. (2019)and Teymur et al. (2016, 2018). The approach of Schober et al. (2014) is not Bayesian in thesense of Definition 1, which can again be deduced from dependence on values of the gradientfield away from the true solution curve, so that the likelihood principle is violated. The subsequent work of Kersting and Hennig (2016) attempted to elicit an appropriate non-zero covariance matrix for use in Eq. (5), in order to encourage uncertainty estimates to bebetter calibrated. Their proposal consisted of the approximate likelihoodd µ i d µ i − (y) ∝ exp − (cid:32) dyd x ( x i ) − m i σ i (cid:33) (6) m i = (cid:90) f ( x i , y( x i ))d µ i − (y) (7) σ i = (cid:90) ( f ( x i , y( x i )) − m i ) d µ i − (y) . (8)This can be viewed as the predictive marginal likelihood for the value f ( x i , y( x i )) based on µ i − . From a practical perspective, the approach is somewhat circular as the integrals inEq. (7) and (8) involve the black-box gradient field f and are therefore cannot be computed.The authors suggested a number of ways that these quantities could be numerically approx-imated , which involve evaluating f ( x i , y i ) at one or more values y i that must be specified. One such method is
Bayesian quadrature , another PNM wherein the integrand f is modelled as uncertainuntil it is evaluated. This raises separate philosophical challenges, as one must then ensure that the statistical The original work of Chkrebtii et al. (2016) is somewhat related to Kersting and Hennig(2016), however instead of using the mean of the current posterior as input to the gradientfield, the input pair ( x i , y i ) was selected by sampling y i from the marginal distribution fory( x i ) implied by µ i − . The approximate likelihood in this approach was taken as follows:d µ i d µ i − (y) ∝ exp − (cid:32) dyd x ( x i ) − f ( x i , y i ) σ i (cid:33) m i = (cid:90) dyd x ( x i )d µ i − (y) σ i = (cid:90) (cid:18) dyd x ( x i ) − m i (cid:19) d µ i − (y) . Compared to Eq. (6), (7) and (8), this approach does not rely on integrals over the unknowngradient field. However, the approach also relies on the approximate information operatorin Eq. (3) and is thus not Bayesian according to Definition 1.
The approaches proposed in Conrad et al. (2017); Abdulle and Garegnani (2018) are notmotivated in the Bayesian framework, but instead seek to introduce a stochastic perturbationinto a classical numerical method. Both methods focus on the quantity of interest Q (y † ) =y † ( x T ). In the simple context of Eq. (2), the method of Conrad et al. (2017) augments theexplicit Euler method with a stochastic perturbation: y i = y i − + hf ( x i − , y i − ) + h (cid:15) i , x i = x i − + h, i = 1 , . . . , n The distribution of the sequence ( (cid:15) i ) ni =1 must be specified. In the simplest case where the (cid:15) i are modelled as independent, say with (cid:15) i ∼ ρ , the canonical flow map Φ i : R → R ofthe explicit Euler method, defined as Φ i ( z ) = z + hf ( x i , z ), is replaced by the probabilisticcounterpart Ψ i : P R → P R given byΨ i ( ν )(d z ) = (cid:90) ρ (cid:18) d z − Φ i (˜ z ) h (cid:19) ν (d˜ z )through which stochasticity can be propagated. The output of the method of Conrad et al.(2017) is then B = Ψ n ◦ · · · ◦ Ψ δ ( y ), where δ ( z ) denotes an atomic distribution on z ∈ R . models used for y( · ) and f ( x i , · ) are logically consistent. In Kersting and Hennig (2016) these functions weresimply modelled as independent. ρ i has zero mean, it can be shown that the mean of B equalsΦ n ◦ · · · ◦ Φ ( y ), which is exactly the deterministic approximation produced with the explicitEuler method.This framework can be practically problematic, since (cid:15) i is charged with modelling theextrapolation error and such errors are not easily modelled as independent random variables– Section 2.8 of Higham (2002) is devoted to this point. Thus, if for example f ( x, y ) = y ,the true linearisation error at step i is e x i − e x i − so that the ‘true’ sequence ( (cid:15) i ) ni =1 in thiscase is monotonic and exponentially unbounded. The challenge of designing a stochasticmodel for the sequence ( (cid:15) i ) ni =1 that reflects the highly structured nature of the error remainsunresolved. On the other hand, the mathematical properties of this method are now well-understood (Lie et al., 2018, 2019). The proposal of Abdulle and Garegnani (2018) was toinstead consider randomisation of the inputs { x i } Ti =0 in the context of a classical numericalmethod, also outside of the Bayesian framework. The survey just presented begs the question of whether a Bayesian PNM for ODEs canexist at all. A first step toward this goal was taken in Cockayne et al. (2019), where it wasargued that an information operator can be constructed if the vector field f is brought tothe left-hand-side in Eq. (2). Specifically, they proposed the information operator˜ A (y) = (cid:20) dyd x ( x ) − f ( x , y( x )) , . . . , dyd x ( x n ) − f ( x n , y( x n )) (cid:21) (cid:62) for which the ‘data’ are trivial; ˜ a n = 0. It was rigorously established that the approximatelikelihood d µ i,σ d µ i − ,σ (y) = exp (cid:32) − σ (cid:18) dyd x ( x i ) − f ( x i , y( x i )) (cid:19) (cid:33) leads to an exact Bayesian PNM in the limit: µ n,σ F → µ ˜ a n as σ ↓ A µ -almost all˜ a n ∈ R n +1 . Here F → denotes convergence in an integral probability metric defined by asuitable set F of test functions (see Sec. 4 of Cockayne et al., 2019). However, the dependenceof the information operator ˜ A on f means that this cannot be used as the basis for a practicalmethod. Indeed, unless f depends linearly on its second argument and conjugacy propertiesof the prior can be exploited, the posterior cannot easily be characterised. Approximatetechniques from nonlinear filtering were proposed to address this challenge in Tronarp et al.(2019). The comprehensive literature review in the previous section reveals not only that that noBayesian PNM has yet been proposed, but also that such an endeavour may be fundamentally9ifficult. Indeed, a theme that has emerged with existing PNM for ODEs, which can be tracedback to Skilling (1992), is the use of approximate and subjective forms for the likelihood.The complex, implicit relationship between the latent ODE solution y † and the data f ( x i , y i )arising from the gradient field appears to preclude use of an exact likelihood. Of course,violation of the likelihood principle is not traditionally a concern in the design of a numericalmethod, yet if the strictly richer output that comes with a Bayesian PNM is desired, thenclearly adherence to the likelihood principle is important. It is therefore natural to ask thequestion, “under what conditions can exact Bayesian inference for ODEs be made?”.This paper presents a proof-of-concept PNM for the numerical solution of a (limited)class of ODEs that is both (a) Bayesian in the sense of Definition 1 and (b) can in principlebe implemented. The method being proposed is indicated in Figure 2 and its main propertiesare as follows: • The classical theory of Lie groups is exploited, for the first time in the context of PNM,to understand when an ODE of the form in Eq. (2) can be transformed into an ODEwhose gradient field is a function of the independent state variable only, reducing theODE to an integral. • For ODEs that admit a solvable Lie algebra, our proposal can be shown to simultane-ously perform exact Bayesian inference on both the original and the Lie-transformedODE. Crucially, as we explain later, to identify a Lie algebra only high-level a priori information about the ODE is required. The case of first- and second-order ODEs ispresented in detail, but the method itself is general. • In general the specification of prior belief can be difficult. The prior distributions thatwe construct are guaranteed to respect aspects of the structure of the ODE. As such,our priors are, to some extent, automatically adapted to the ODE at hand as opposedto being arbitrarily posited. • In addition to the benefits conferred in the Bayesian framework, detailed in Section1.1 and in Cockayne et al. (2019), the method being proposed can be computationallyrealised. On the other hand, there is a cost in terms of the run-time of the methodthat is substantially larger than existing, non-Bayesian approaches (especially classicalnumerical methods). As such, we consider this work to be a proof-of-concept ratherthan an applicable Bayesian PNM.The remainder of the paper is structured as follows: Section 2 is dedicated to a succinctreview of Lie group methods for ODEs. In Section 3, Lie group methods are exploitedto construct priors over the solution space of the ODE whenever a solvable Lie algebra isadmitted and exact Bayesian inference is performed on a transformed version of the originalODE which takes the form of an integral. Numerical experiments are reported in Section 4.Finally, some conclusions and recommendations for future research are drawn in Section 5.10 yd x = f ( x, y( x )) dsd r = g ( r )Lie transform; ( x, y ) (cid:55)→ ( r, s )(Lie transform) − ; ( r, s ) (cid:55)→ ( x, y )ExactBayesianPNM ? Figure 2: Schematic of our proposed approach. An n th order ODE that admits a solvable Liealgebra can be transformed into n integrals, to which exact Bayesian probabilistic numericalmethods can be applied. The posterior measure on the transformed space is then pushedback through the inverse transformation onto the original domain of interest. This section provides a succinct overview of classical Lie group methods, introduced in the19th century by Sophus Lie in the differential equation context (Hawkins, 2012). Lie devel-oped the fundamental notion of a
Lie group of transformations , which roughly correspond tomaps that take one solution of the ODE to another. This provided a formal generalisationof certain algebraic techniques, such as dimensional analysis and transformations based onspatial symmetries, that can sometimes be employed to algebraically reduce the order of anODE.This section proceeds as follows: First, in Section 2.1 we introduce a one-parameter Liegroup of transformations and then, in Section 2.2, we explain what it means for a curve or asurface to be transformation-invariant. In Secion 2.3 we recall consequences of Lie’s theoryin the ODE context. Last, in Section 2.4 the generalisation to multi-parameter Lie groups isindicated. Our development is heavily influenced by Bluman and Anco (2002) and we referthe reader to their book when required.
The purpose of this section is to recall essential definitions, together with the first fundamen-tal theorem of Lie , which relates a Lie group of transformations to its infinitesimal generator.In what follows we consider a fixed domain D ⊂ R d and denote a generic state variable as x = ( x , . . . , x d ) ∈ D . Definition 2 (One-Parameter Group of Transformations) . A one-parameter group of trans-formations on D is a map X : D × S → D , defined on D × S for some set S ⊂ R , togetherwith a bivariate map φ : S × S → S , such that the following hold:
1) For each (cid:15) ∈ S , the transformation X ( · , (cid:15) ) is a bijection on D .(2) ( S, φ ) forms a group with law of composition φ .(3) If (cid:15) is the identity element in ( S , φ ) , then X ( · , (cid:15) ) is the identity map on D .(4) For all x ∈ D , (cid:15), δ ∈ S , if x ∗ = X ( x, (cid:15) ) , x ∗∗ = X ( x ∗ , δ ) , then x ∗∗ = X ( x ∗ , φ ( (cid:15), δ )) . In what follows we continue to use the shorthand notation x ∗ = X ( x, (cid:15) ). The notion of a Lie group additionally includes smoothness assumptions on the maps that constitute a group oftransformations. Recall that a real-valued function is analytic if it can be locally expressedas a convergent power series.
Definition 3 (One-Parameter Lie Group of Transformations) . Let X , together with φ , forma one-parameter group of transformations on D . Then we say that X , together with φ , forma one-parameter Lie group of transformations on D if, in addition, the following hold:(5) S is a (possibly unbounded) interval in R .(6) For each (cid:15) ∈ S , X ( · , (cid:15) ) is infinitely differentiable in D.(7) For each x ∈ D , X ( x, · ) is an analytic function on S .(8) φ is analytic in S × S . Without the loss of generality it will be assumed, through re-parametrisation if required,that S contains the origin and (cid:15) = 0 is the identity element in ( S, φ ). The definition isillustrated through three examples:
Example 1 (Translation in the x-Axis) . The one-parameter transformation x ∗ = x + (cid:15) , x ∗ = x for (cid:15) ∈ R forms a Lie group of transformations on D = R with group compositionlaw φ ( (cid:15), δ ) = (cid:15) + δ . Example 2 (Rotation Group) . The one-parameter transformation x ∗ = x cos( (cid:15) ) − x sin( (cid:15) ) , x ∗ = x sin( (cid:15) ) + x cos( (cid:15) ) for (cid:15) ∈ R again forms a Lie group of transformations on D = R with group composition law φ ( (cid:15), δ ) = (cid:15) + δ . Example 3 (Cyclic group C p ) . Let D = { , , , . . . , p } . Let S = Z . For n ∈ D and m ∈ S ,let X ( n, m ) = n + m (mod p ) . Then X , together with φ ( a, b ) = a + b , defines a one parametergroup of transformations on D , but is not a Lie group of transformations since (5) is violated. The first fundamental theorem of Lie establishes that a Lie group of transformations canbe characterised by its infinitesimal generator, defined next:
Definition 4 (Infinitesimal Transformation) . Let X be a one-parameter Lie group of trans-formations. Then the transformation x ∗ = x + (cid:15)ξ ( x ) , ξ ( x ) = ∂X ( x, (cid:15) ) ∂(cid:15) (cid:12)(cid:12)(cid:12)(cid:12) (cid:15) =0 , is called the infinitesimal transformation associated to X and the map ξ is called an infinites-imal . efinition 5 (Infinitesimal Generator) . The infinitesimal generator of a one-parameter Liegroup of transformations X is defined to be the operator X = ξ · ∇ where ξ is the infinitesimalassociated to X and ∇ = ( ∂∂x , ∂∂x , . . . , ∂∂x n ) is the gradient. Example 4 (Ex. 1, continued) . For Ex. 1, we have ξ ( x ) = (cid:18) d x ∗ d (cid:15) (cid:12)(cid:12)(cid:12)(cid:12) (cid:15) =0 , d x ∗ d (cid:15) (cid:12)(cid:12)(cid:12)(cid:12) (cid:15) =0 (cid:19) = (1 , so the infinitesimal generator for translation in the x-axis is X = ∂∂x . Example 5 (Ex. 2, continued) . Similarly, the infinitesimal generator for the rotation groupis
X = − x ∂∂x + x ∂∂x . The first fundamental theorem of Lie provides a constructive route to obtain the infinitesimalgenerator from the transformation itself:
Theorem 1 (First Fundamental Theorem of Lie; see pages 39-40 of Bluman and Anco(2002)) . A one parameter Lie group of transformations X is characterised by the initialvalue problem: d x ∗ d τ = ξ ( x ∗ ) , x ∗ = x when τ = 0 , (9) where τ ( (cid:15) ) is a parametrisation of (cid:15) which satisfies τ (0) = 0 and, for (cid:15) (cid:54) = 0 , τ ( (cid:15) ) = (cid:90) (cid:15) ∂φ ( a, b ) ∂b (cid:12)(cid:12)(cid:12) ( a,b )=( δ − ,δ ) d δ. Here δ − denotes the group inverse element for δ . Since Eq. (9) is translation-invariant in τ , it follows that without loss of generality we canassume a parametrisation τ ( (cid:15) ) such that the group action becomes φ ( τ , τ ) = τ + τ and,in particular, τ − = − τ . In the remainder of the paper, for convenience we assume that allLie groups are parametrised such that the group action is φ ( (cid:15) , (cid:15) ) = (cid:15) + (cid:15) .The next result can be viewed as a converse to Theorem 1, as it shows how to obtain thetransformation from the infinitesimal generator. All proofs are reserved for SupplementalSection A.1. Theorem 2.
A one parameter Lie group of transformations with infinitesimal generator X is equivalent to x ∗ = e (cid:15) X x , where e (cid:15) X = (cid:80) ∞ k =0 1 k ! (cid:15) k X k x . The following is immediate from the proof of Theorem 2:
Corollary 1. If F is infinitely differentiable, then F ( x ∗ ) = e (cid:15) X F ( x ) . .2 Invariance Under Transformation In this section we explain what it means for a curve or a surface to be invariant under a Liegroup of transformations and how this notion relates to the infinitesimal generator.
Definition 6 (Invariant Function) . A function F : D → R is said to be invariant under aone parameter Lie group of transformations x ∗ = X ( x, (cid:15) ) if F ( x ∗ ) = F ( x ) for all x ∈ D and (cid:15) ∈ S . Based on the results in Section 2.1, one might expect that invariance to a transformationcan be expressed in terms of the infinitesimal generator of the transformation. This is indeedthe case:
Theorem 3.
A differentiable function F : D (cid:55)→ R is invariant under a one parameter Liegroup of transformations with infinitesimal generator X if and only if X F ( x ) = 0 for all x ∈ D . Theorem 4.
For a function F : D (cid:55)→ R and a one parameter Lie group of transformations x ∗ = X ( x, (cid:15) ) , the relation F ( x ∗ ) = F ( x ) + (cid:15) holds for all x ∈ D and (cid:15) ∈ S if and only if X F ( x ) = 1 for all x ∈ D . The following definition is fundamental to the method proposed in Section 3:
Definition 7 (Canonical Coordinates) . Consider a coordinate system r = ( r ( x ) , . . . , r n ( x )) on D . Then any one parameter Lie group of transformations x ∗ = X ( x, (cid:15) ) induces a trans-formation of the coordinates r ∗ i = r i ( x ∗ ) . The coordinate system r is called canonical for thetransformation if r ∗ = r , . . . , r ∗ n − = r n − and r ∗ n = r n + (cid:15) . Example 6 (Ex. 2, continued) . For the rotation group in Ex. 2, we have canonical coordi-nates r ( x , x ) = (cid:112) x + x , r ( x , x ) = arctan( x /x ) . In canonical coordinates, a one parameter Lie group of transformations can be viewed as astraight-forward translation in the r n -axis. The existence of canonical coordinates is estab-lished in Thm. 2.3.5-2 of Bluman and Anco (2002). Note that Thms. 3 and 4 imply thatX r ∗ i = 0 for i = 1 , , ..., n −
1, X r ∗ n = 1. Definition 8 (Invariant Surface) . For a function F : D → R , a surface defined by F ( x ) = 0 is said to be invariant under a one parameter Lie group of transformation x ∗ = X ( x, (cid:15) ) ifand only if F ( x ∗ ) = 0 whenever F ( x ) = 0 for all x ∈ D and (cid:15) ∈ S . The invariance of a surface, as for a function, can be cast in terms of an infinitesimalgenerator:
Corollary 2.
A surface F ( x ) = 0 is invariant under a one parameter Lie group of trans-formations with infinitesimal generator X if and only if X F ( x ) = 0 whenever F ( x ) = 0 . .3 Symmetry Methods for ODEs The aim of this section is to relate Lie transformations to ODEs for which these transfor-mations are admitted. These techniques form the basis for our proposed method in Section3. For an ODE of the form in Eq. (2), one can consider the action of a transformationon the coordinates ( x, y ); i.e. a special case of the above framework where the genericcoordinates x and x are respectively the independent ( x ) and dependent ( y ) variables ofthe ODE. It is clear that such a transformation also implies some kind of transformation ofthe derivatives y m := d m y d x m . Indeed, consider a one-parameter Lie group of transformations( x ∗ , y ∗ ) = ( X ( x, y ; (cid:15) ) , Y ( x, y ; (cid:15) )). Then we have from the chain rule that y ∗ m := d m y ∗ d( x ∗ ) m isa function of x, y, y , . . . , y m and we denote y ∗ m = Y m ( x, y, y , . . . , y m ; (cid:15) ). As an explicitexample: y ∗ = d y ∗ d x ∗ = ∂Y ( x,y ; (cid:15) ) ∂x + y ∂Y ( x,y ; (cid:15) ) ∂y∂X ( x,y ; (cid:15) ) ∂x + y ∂X ( x,y ; (cid:15) ) ∂y =: Y ( x, y, y ; (cid:15) )In general: y ∗ m = ∂y ∗ m − ∂x + y ∂y ∗ m − ∂y + y ∂y ∗ m − ∂y + ... + y m ∂y ∗ m − ∂y m − ∂X ( x,y ; (cid:15) ) ∂x + y ∂X ( x,y ; (cid:15) ) ∂y =: Y m ( x, y, y , . . . , y m ; (cid:15) )In this sense a transformation defined on ( x, y ) can be naturally extended to a transformationon ( x, y, y , y , . . . ) as required. Definition 9 (Admitted Transformation) . An m th order ODE F ( x, y, y , . . . , y m ) = 0 issaid to admit a one parameter Lie group of transformations ( x ∗ , y ∗ ) = ( X ( x, y ; (cid:15) ) , Y ( x, y ; (cid:15) )) if the surface F defined by the ODE is invariant under the Lie group of transformations, i.e.if F ( x ∗ , y ∗ , y ∗ , . . . , y ∗ m ) = 0 whenever F ( x, y, y , . . . , y m ) = 0 . Example 7.
Clearly any ODE of the form d y d x = F ( x ) admits the transformation ( x ∗ , y ∗ ) =( x, y + (cid:15) ) . Our next task is to understand how the infinitesimal generator of a transformation canbe extended to act on derivatives y m . Definition 10 (Extended Infinitesimal Transformation) . The m th extended infinitesimals of a one parameter Lie group of transformations ( x ∗ , y ∗ ) = ( X ( x, y ; (cid:15) ) , Y ( x, y ; (cid:15) )) are definedas the functions ξ, η, η (1) , . . . , η ( m ) for which the following equations hold: x ∗ = X ( x, y ; (cid:15) ) = x + (cid:15)ξ ( x, y ) + O ( (cid:15) ) y ∗ = Y ( x, y ; (cid:15) ) = y + (cid:15)η ( x, y ) + O ( (cid:15) ) y ∗ = Y ( x, y, y ; (cid:15) ) = y + (cid:15)η (1) ( x, y, y ) + O ( (cid:15) ) ... y ∗ m = Y m ( x, y, y , . . . , y m ; (cid:15) ) = y m + (cid:15)η ( m ) ( x, y, y , y , . . . , y m ) + O ( (cid:15) )15t can be shown straightforwardly via induction that η ( m ) ( x, y, y , y , . . . , y m ) = d m η d x m − m (cid:88) k =0 m !( m − k )! k ! y m − k − d k ξ d x k (10)where dd x denotes the full derivative with respect to x , i.e. dd x = ∂∂x + y ∂∂y + (cid:80) m +1 k =2 y k ∂∂y k − .It follows that η ( m ) is a polynomial in y , y , . . . , y m with coefficients linear combinations of ξ, η and their partial derivatives up to the m th order. Definition 11 (Extended Infinitesimal Generator) . The m th extended infinitesimal gener-ator is defined as X ( m ) = ξ m ( x, y, y , . . . , y m ) · ∇ = ξ ( x, y ) ∂∂x + η ( x, y ) ∂∂y + η (1) ( x, y ) ∂∂y + · · · + η ( m ) ( x, y, y , . . . , y m ) ∂∂y m where ∇ = ( ∂∂x , ∂∂y , ∂∂y , . . . , ∂∂y m ) is the extended gradient. The following corollaries are central to the actual computation of the admitted Lie groupsof an ODE.
Corollary 3.
A differentiable function F : D m → R where D m is the phase space containingelements of the form ( x, y, y , . . . , y m ) , is invariant under a one parameter Lie group of trans-formations with an extended infinitesimal generator X ( m ) if and only if X ( m ) F ( x, y, y , . . . , y m ) =0 for all ( x, y, y , . . . , y m ) ∈ D m . Corollary 4 (Infinitesimal Criterion for Symmetries Admitted by an ODE) . A one param-eter Lie group of transformations is admitted by the m th order ODE F ( x, y, y , .., y m ) = 0 if and only if its extended infinitesimal generator X ( m ) satisfies X ( m ) F ( x, y, y , . . . , y m ) = 0 whenever F ( x, y, y , . . . , y m ) = 0 . To leverage the full power of Lie symmetry methods for ODEs of order m ≥
2, we need to con-sider multiple Lie symmetries which are collectively described by a
Lie algebra . Fortunately,the notion of a multi-parameter Lie group of transformations is a natural generalisation fromthe one parameter case. Thus, this last section of background material concerns the gener-alisation of the definitions in Section 2.1 to the case of a multi-parameter Lie group. Theassociated Lie algebra will also be defined.
Definition 12 (Multi-Parameter Lie Group of Transformations) . The set of transformations x ∗ = X ( x, (cid:15) ) where x ∗ i = X i ( x, (cid:15) ) and (cid:15) = ( (cid:15) , (cid:15) , . . . , (cid:15) r ) ∈ S ⊂ R r is called a r -parameterLie group of transformations if it satisfies the same axioms as in the one parameter case,but with law of composition φ ( (cid:15), δ ) = ( φ ( (cid:15), δ ) , . . . , φ r ( (cid:15), δ )) , and (without loss of generality) (cid:15) = (0 , , ..., as the group identity element. efinition 13 (Infinitesimal Matrix) . The appropriate generalisation for the infinitesimaltransformation is the infinitesimal matrix
Ξ = [ ξ ij ] , whose entries are defined as ξ ij ( x ) = ∂X j ( x,(cid:15) ) ∂(cid:15) i (cid:12)(cid:12)(cid:12) (cid:15) =0 . Definition 14 (Infinitesimal Generator) . An r -parameter Lie group of transformations isassociated with r infinitesimal generators, X i , defined as X i = X i ( x ) = (cid:80) dj =1 ξ ij ( x ) ∂∂x j . The first fundamental theorem of Lie can be generalised to the multi-parameter case. Inparticular, it can be shown that an r -parameter Lie group of transformations is characterisedby the set of its r infinitesimal generators. The generalisation is straight-forward and so, forbrevity, we refer the reader to pages 39-40 of Bluman and Anco (2002).Next we explain how the collection of infinitesimal generators forms a Lie algebra. Thisrelies on the basic facts that the set D of differential operators on D is a vector space over R (i.e. λ X + µ Y ∈ D for all X , Y ∈ D and λ, µ ∈ R ) and that differential operators can becomposed (i.e. XY ∈ D for all X , Y ∈ D ). Definition 15 (Commutator) . The commutator of two infinitesimal generators X i and X j is defined as [X i , X j ] = X i X j − X j X i . Theorem 5 (Second Fundamental Theorem of Lie; see page 78 of Bluman and Anco (2002)) . Consider an r -parameter Lie group of transformations and let L denote the linear span ofthe infinitesimal generators X , . . . , X r in D . Let [ · , · ] : L × L → D denote the unique bilinearoperator that agrees with Def. (15) on the set of infinitesimal generators. i.e. (cid:34) r (cid:88) i =1 λ i X i , r (cid:88) j =1 µ j X j (cid:35) = r (cid:88) i =1 r (cid:88) j =1 λ i µ j (X i X j − X j X i ) . (11) Then [ · , · ] maps into L . i.e. the right hand side of Eq. (11) belongs to L for all λ, µ ∈ R r . Example 8.
Consider the two parameter group of transformations on D = R given by ( x ∗ , y ∗ ) = ( x + x(cid:15) + x δ, y + y(cid:15) + y δ ) . The infinitesimal generators corresponding to δ and (cid:15) , respectively, are X = x ∂∂x + y ∂∂y , X = x ∂∂x + y ∂∂y . It can be directly verified that [X , X ] = − X . The space L , defined in Thm. 5, satisfies the properties of an r -dimensional Lie algebra L , defined next: Definition 16 (Lie Algebra) . An r -dimensional vector space L over R together with a bilinearoperator [ · , · ] : L × L → L is called an r -dimensional Lie algebra if the following hold:(1) Alternativity: [X , X] = 0 for all X ∈ L (2) Jacobi Identity: [X , [Y , Z]] + [Y , [Z , X]] + [Z , [X , Y]] = 0 for all X , Y , Z ∈ L In general, for the methods presented in Section 3 to be applied, existence of an n -parameter Lie group of transformations is not in itself sufficient; we require the existence ofan n -dimensional solvable Lie sub-algebra, defined next:17 efinition 17 (Normal Lie Sub-algebra) . Consider a Lie sub-algebra J of a Lie algebra L with bilinear operator [ · , · ] , i.e. a subset J ⊂ L such that, when equipped with the restrictionof [ · , · ] to J × J , is itself a Lie algebra and, in particular, [X , Y] ∈ J for all X , Y ∈ J .Then J is said to be normal if, in addition, [X , Y] ∈ J for all X ∈ J , Y ∈ L . Definition 18 (Solvable Lie Algebra) . An r -dimensional Lie algebra L is called solvable ifthere exists a chain of sub-algebras L ⊂ L ⊂ ... ⊂ L q − ⊂ L r =: L such that L i − is anormal sub-algebra of L i for i = 2 , , ..., r . For low-order ODEs, the existence requirement for an admitted Lie group of transforma-tions is more restrictive than the requirement that the associated Lie algebra is solveable.Indeed, we have the following result:
Theorem 6.
All two-dimensional Lie Algebras are solvable.
This completes our review of background material. The exact Bayesian PNM developedin Section 3 for an n th order ODE require the existence of an admitted n -parameter Liegroup of transformations with a solvable Lie algebra. In practice we therefore require somehigh-level information on the gradient field f , in order to establish which transformations ofthe ODE may be admitted. In addition, the requirement of a solvable Lie algebra also limitsthe class of ODEs for which our exact Bayesian methods can be employed. Nevertheless, thisclass of ODEs is sufficiently broad to have merited extensive theoretical research (Blumanand Anco, 2002) and the development of software (Baumann, 2013). In this section our novel Bayesian PNM is presented. The method relies on high-level in-formation about the gradient field f and, in Section 3.1, we discuss how such informationcan be exploited to identify any Lie transformations that are admitted by the ODE. In thecase of a first order ODE, any non-trivial transformation is sufficient for our method and anexplicit information operator is provided for this case, together with recommendations forprior construction, in Section 3.2. Together, the prior and the information operator uniquelydetermine a Bayesian PNM, as explained in Section 1.1. In the general case of an m th orderODE, we require that an m -dimensional solvable Lie algebra is admitted by the ODE. Thespecial case m = 2 is treated in detail, with an explicit information operator and guidancefor prior construction provided in Section 3.3. In the Supplemental Section A.2 the selectionof input pairs ( x i , y i ) to the gradient field is discussed. For the methods proposed in this paper, transformations admitted by the ODE, togetherwith their infinitesimal generators, must first be obtained. The algorithm for obtaininginfinitesimal generators follows as a consequence of Cor. 4. Indeed, suppose we have a m th18rder ODE of the form y m − f ( x, y, y , . . . , y m − ) = 0. Then, by Cor. 4, a transformationwith infinitesimal generator X is admitted by the ODE if and only if:X ( m ) ( y m − f ( x, y, y , . . . , y m − )) = 0 whenever y m = f ( x, y, y , . . . , y m − ) . (12)In infinitesimal notation, Eq. (12) is equivalent to η ( m ) ( x, y, y , . . . , y m − , y m ) = ξ ∂f∂x + η ∂f∂y + m − (cid:88) k =1 η ( k ) ∂f∂y k . (13)The direct solution of Eq. (13) recovers any transformations that are admitted.In the common scenario where f ( x, y, y , . . . , y m − ) is a polynomial in y , y , . . . , y m − ,the algorithm just described, for identification of admitted transformations, can be fullyautomated (c.f. Baumann, 2013). Indeed, from Def. 10 it follows that the extended in-finitesimals η ( k ) for k ∈ , , , . . . , m are polynomial in y , y , . . . , y k . Thus, by substituting y m = f ( x, y, y , . . . , y m − ), Eq. (12) too must be a polynomial in y , y , . . . , y m − . Moreover,the coefficients of this polynomial must vanish because (12) holds for arbitrary values of x, y, y , . . . , y m − . This argument, of setting coefficients to zero, leads to a system of linearpartial differential equations (overdetermined when m ≥
2) for ξ ( x, y ) and η ( x, y ), whichcan be exactly solved to retrieve all the infinitesimal generators of the ODE. The same strat-egy can often be applied beyond the polynomial case and explicit worked examples of thisprocedure are now provided: Example 9 (First Order ODE) . Consider the class of all first order ODEs of the form dyd x = f ( x, y( x )) , f ( x, y ) = F (cid:0) yx (cid:1) . From Eq. (10) , we have η (1) = η x + ( η y − ξ x ) y − ξ y ( y ) soEq. (13) becomes η x +( η y − ξ x ) f − ξ y ( f ) = ξ ∂f∂x + η ∂f∂y and thus η x +( η y − ξ x ) F (cid:0) yx (cid:1) − ξ y F (cid:0) yx (cid:1) = − ξF (cid:48) (cid:0) yx (cid:1) yx + ηF (cid:48) (cid:0) yx (cid:1) x . For this equation to hold for general F , the coefficients of F , F and F (cid:48) must vanish: η x = 0 , η y − ξ x = 0 , ξ y = 0 , − ξ yx + η x = 0 . This is now a linearsystem of partial differential equations in ( ξ, η ) which is easily solved; namely ξ = x, η = y .The associated infinitesimal generator is X = x ∂∂x + y ∂∂y . Example 10 (Second Order ODE) . The infinitesimal generators for the second order ODE ( x − y( x )) d yd x + 2 dyd x (cid:18) dyd x + 1 (cid:19) + (cid:18) dyd x (cid:19) / = 0 (14) are derived in Supplementary Section A.1. In this section we present our approach for a first order ODE. This allows some of the moretechnical details associated to the general case to be omitted, due to the fact that any one-dimensional Lie algebra is trivial. The main result that will allow us to construct an exactBayesian PNM is as follows: 19 heorem 7 (Reduction of a First Order ODE to an Integral) . If a first order ODE dyd x = f ( x, y( x )) (15) admits a one parameter Lie group of transformations, then there exists coordinates r ( x, y ) , s ( x, y ) such that dsd r = G ( r ) (16) for some explicit function G ( r ) . Note that the transformed ODE in Eq. (16) is nothing more than an integral, for whichexact Bayesian PNM have already been developed (e.g. Briol et al., 2019; Karvonen et al.,2018). At a high level, as indicated in Fig. 2, our proposed Bayesian PNM performs inferencefor the solution s( r ) of Eq. (16) and then transforms the resultant posterior back into theoriginal ( x, y )-coordinate system. Our PNM is therefore based on the information operator A (y) = [ G ( r ) , . . . , G ( r n )] (cid:62) ∈ A = R n +1 (17)which corresponds indirectly to n + 1 evaluations of the original gradient field f at certaininput pairs ( x i , y i ). The selection of the inputs r i is discussed in Section A.2.The transformation of a first order ODE is clearly illustrated in the following: Example 11 (Ex. 9, continued) . Consider the first order ODE dyd x = f ( x, y( x )) , f ( x, y ) = F (cid:0) yx (cid:1) . Recall from Ex. 9 that this ODE admits the one parameter Lie group of transforma-tions x ∗ = αx , y ∗ = αy for α ∈ R and the associated infinitesimal generator is X = x ∂∂x + y ∂∂y .Solving the pair of partial differential equations X r = 0 , X s = 1 yields the canonical coor-dinates s = log y , r = yx . The transformed ODE is then dsd r = F ( r ) − r + rF ( r ) =: G ( r ) . Thus anevaluation G ( r ) corresponds to an evaluation of f ( x, y ) at an input ( x, y ) such that r = yx . Two important points must now be addressed: First, the approach just described cannotbe Bayesian unless it corresponds to a well-defined prior distribution µ ∈ P Y in the originalcoordinate system Y . This precludes standard (e.g. Gaussian process) priors in general, assuch priors assign mass to functions in ( r, s )-space that do not correspond to well-definedfunctions in ( x, y )-space (see Fig. 3). Second, any prior that is used ought to be consistentwith the Lie group of transformations that the ODE is known to admit. To address eachof these important points, we propose two general principles for prior construction in thiswork. The first principle is the implicit prior principle. This ensures that a prior specified inthe transformed coordinates ( r, s ) can be safely transformed into a well-defined distributionon Y . For such an implicit prior to be well-defined we need to understand when a functionin ( r, s ) space maps to a well-defined function in the original ( x, y ) domain of interest. Let S denote the image of Y under the canonical coordinate transformation. Principle 1 (Implicit Prior) . A distribution ν ∈ P S on the transformed solution space S corresponds to a well-defined implicit prior µ ∈ P Y provided that x ( r, s( r )) is strictlymonotone as a function of r . x, y ) ( r, s )Lie transform(Lie transform) − Figure 3: Illustration of the implicit prior principle: A prior elicited for the function s( r ) inthe transformed coordinate system ( r, s ) must be supported on functions s( r ) that correspondto well-defined functions y( x ) in the original coordinate system ( x, y ). Thus the situationdepicted would not be allowed. Example 12 (Ex. 11, continued) . For the ODE in Ex. 11, with canonical coordinates s = log y , r = yx , if x ∈ [ x , x T ] = [1 , x T ] and y ∈ (0 , ∞ ) , then the region in the ( r, s ) planecorresponding to [1 , x T ] × (0 , ∞ ) in the ( x, y ) plane is (0 , ∞ ) × R . Now, d x ( r, s( r ))d r = ∂x∂r + ∂x∂s ds( r )d r = r s (cid:48) ( r ) exp(s( r )) − exp(s( r )) r . Thus d x d r > if and only if s (cid:48) ( r ) > r and the invariant prior principle requires that we respectthe constraint log( r ) ≤ s( r ) ≤ log( r ) + log( x T ) for all r > . The set S must thereforeconsists of differentiable functions s defined on r ∈ (0 , ∞ ) and satisfying log( r ) ≤ s( r ) ≤ log( r ) + log( x T ) . Now we turn to the second important point, namely that the prior ought to encodeknowledge about any Lie transformations that are known to be admitted by the ODE. Inworking on the transformed space S , it become clear how to construct a prior measure inwhich this knowledge is encoded. Our second principle for prior specification states that equalprior weight should be afforded to all curves that are identical up to a Lie transformation: Principle 2 (Invariant Prior) . A distribution ν ∈ P S on the transformed solution space S is said to be invariant provided that ν ( S ) = ν ( S + (cid:15) ) where the elements of S + (cid:15) are theelements of S after a vertical translation; i.e. s( · ) (cid:55)→ s( · ) + (cid:15) and both S, S + (cid:15) ∈ Σ S . Our recommendation is that, when possible, both the implicit prior principle and theinvariant prior principle should be enforced. However, in practice it seems difficult to satisfyboth principles and our empirical results in Section 4 are based on implicit priors that arenot invariant.
In this section we present our approach for a second order ODE. The study of second orderODEs is particularly important, since Newtonian mechanics is based on ODEs of second21rder. The presentation is again simplified relative to the general case of an m th orderODE, this time by virtue of the fact that any two dimensional Lie algebra is guaranteed tobe solveable (Thm. 6). The main result that will allow us to construct an exact BayesianPNM is as follows: Theorem 8 (Reduction of a Second Order ODE to Two Integrals) . If a second order ODE d yd x = f (cid:18) x, y( x ) , dyd x (cid:19) (18) admits a two parameter Lie group of transformations, then there exists coordinates r ( x, y ) , s ( x, y ) such that dsd r = G ( r ) (19) for some implicitly defined function G . The function G is explicitly related to the solutionof a second equation of the form d˜sd˜ r = H (˜ r ) (20) for some explicit function H (˜ r ) . Note that the ODE in Eq. (18) is reduced to two integrals, namely Eq. (19) and Eq. (20).At a high level, our proposed Bayesian PNM performs inference for the solution s( r ) ofEq. (19) and then transforms the resultant posterior back into the original ( x, y )-coordinatesystem. However, because G in Eq. (19) depends on the solution ˜s(˜ r ) of Eq. (20), we mustalso estimate ˜s(˜ r ) and for this we need to evaluate H . Our PNM is therefore based on theinformation operator A (y) = [ G ( r ) , . . . , G ( r n ) , H (˜ r ) , . . . , H (˜ r n )] (cid:62) ∈ A = R n +1) which corresponds indirectly to 2( n + 1) evaluations of f , the original gradient field. Theextension of our approach to a general m th order ODE proceeds analogously, with A = R m ( n +1) . The use of Thm. 8 is illustrated in Example 14 in the Supplement.The two principles of prior construction that we advocated in the case of a first orderODE apply equally to the case of a second- and higher-order ODE. It therefore remains onlyto discuss the selection of the specific inputs r i (and ˜ r i in the case of a second order ODE)that are used to define the information operator A . This discussion is again reserved forSupplemental Section A.2. In this section the proposed Bayesian PNM is empirically illustrated. Recall that we are notadvocating these methods for practical use, rather they are to serve as a proof-of-concept22or demonstrating that exact Bayesian inference can in principle be performed for ODEs,albeit at considerable effort; a non-trivial finding that helps to shape ongoing research anddiscussion in this nascent field.The case of a first order ODE is considered in Section 4.1 and the second order case iscontained in Section 4.2. In both cases, scope is limited to verifying the correctness of theprocedures, as well as indicating how conjugate prior distributions can be constructed.
This section illustrates the empirical performance of the proposed method for a first orderODE.
ODE:
To limit scope we consider first order ODEs of the formdyd x = F (cid:18) y( x ) x (cid:19) , x ∈ [1 , x T ] , y(1) = y . (21)Note that admitted transformation and associated canonical coordinates for this class ofODE have already been derived in Ex. 9, Ex. 11 and Ex. 12. Prior:
In constructing a prior µ ∈ P Y we refer to the implicit prior principle in Sec. 3.2.Indeed, recall from Ex. 11 that the ODE in Eq. (21) can be transformed into an ODE of theform dsd r = G ( r ) , r ∈ (0 , ∞ ) , s( y ) = log( y ) . Then our approach constructs a distribution ν ∈ P S where, from Ex. 12, S is the set ofdifferentiable functions s defined on r ∈ (0 , ∞ ) and satisfyinglog( r ) ≤ s( r ) ≤ log( r ) + log( x T ) . (22)To ensure monotonicity in the implicit prior principle, we take d x dr >
0, which translates intothe requirement that dsd r > r . (23)If Eq. (23) holds, then ν induces a well-defined distribution µ ∈ P Y . Note that the constraintsin Eq. (22) and Eq. (23) preclude the direct use of standard prior models, such as Gaussianprocesses. However, it is nevertheless possible to design priors that are convenient for a givenset of canonical coordinates. Indeed, for the canonical coordinates r, s in our example, wecan consider a prior of the forms( r ) = log( r ) + log( x T ) ζ ( r )23 r s n=5 r s n=50 x y n=5 x y n=50 Figure 4: Experimental results, first order ODE: The black curves represent samples fromthe posterior, whilst the exact solution is indicated in red. The blue curves represent aconstraint on the ( r, s ) domain that arises when the implicit prior principle is applied. Thenumber n of gradient evaluations is indicated. Top: results in the ( r, s ) domain. Bottom:results in the ( x, y ) domain.where the function ζ : (0 , ∞ ) → R satisfies ζ ( y ) = 0 , ζ ( r ) ≤ , d ζ d r ≥ . (24)For this experiment, the approach of L´opez-Lopera et al. (2018) was used as a prior modelfor the monotone, bounded function ζ ; this requires that a number, N , of basis functions isspecified - for brevity we defer the detail to Appendix A.3.24he prior just described incorporates the symmetric structure of the ODE, in the sensethat the independent variable r = yx is the first canonical coordinate of the infinitesimalgenerator of the Lie group of transformations of the original ODE in Eq. 11. In other words, r is a variable fixed by the Lie group of transformations of the ODE (in this case x ∗ = αx , y ∗ = αy , so r ∗ = r ). Because the prior is defined on functions s ( r ) of r , this means the prioritself is also unchanged by the Lie group of transformations of the ODE, so that the prioreffectively incorporates the symmetric structure of the ODE. Results:
To obtain empirical results we consider the ODE with F ( r ) = r − + r and y = 1 , x T = 5. The posterior distributions that were obtained as the number n of datapoints was increased were sampled and plotted in the ( r, s ) and ( x, y ) planes in Fig. 4. Ineach case a basis of size N = 2 n was used. Observe that the implicit prior principle ensuresthat all curves in the ( x, y ) plane are well-defined functions (i.e. there is at most one y value for each x value). Observe also that the posterior mass appears to contract to the truesolution y † of the ODE as the number of evaluations n of the gradient field is increased. This section illustrates the empirical performance of the proposed method for a second orderODE.
ODE:
Consider again the second order nonlinear ODE in Eqn. 31 together with the initialcondition y ( x ) = y , dyd x ( x ) = y (cid:48) . Prior:
It is shown in Ex. 14 in the Supplement that Eq. 31 can be reduced to a first orderODE in ( s, r ) with − x − r ≤ s ≤ − x T − r . The implicit prior principle in this case requiresthat dsd r > −
1. Thus we are led to consider a parametrisation of the forms( r ) = − x − r + (cid:18) x − x T (cid:19) ζ ( r )where the function ζ again satisfies the conditions in Eq. (24). The approach of L´opez-Loperaet al. (2018) was therefore again used as a prior model.For this example an additional level of analytic tractability is possible, as described indetail in Ex. 14 in the Supplement. Thus we need only consider an information operator ofthe form A (y) = [ G ( r ) , . . . , G ( r n )]. Results:
The posterior distributions that were obtained are plotted in the ( r, s ) plane andthe ( x, y ) plane in Fig. 5. A basis of size N = 2 n was used, with [ y , y (cid:48) ] = [ − , , [ x , x T ] =[5 , x, y ) plane arewell-defined functions (i.e. there is at most one y value for each x value). The true solutionappears to be smoother than the samples, even for 50 gradient evaluations, which suggeststhat the prior was somewhat conservative in this context.25 r s n=50 x -10.5-10-9.5-9-8.5-8-7.5-7 y n=50 Figure 5: Experimental results, second order ODE: The black curves represent samples fromthe posterior in the ( r, s ) plane (left) and ( x, y ) plane (right), whilst the exact solution isindicated in red. The blue curves represent a constraint on the domain that arises when theimplicit prior principle is applied. The number of gradient evaluations was n = 50. This paper presented a foundational perspective on PNM. It was first argued that there didnot exist a Bayesian PNM for the numerical solution of ODEs. Then, to address this gap,a prototypical Bayesian PNM was developed. The Bayesian perspective that we have putforward sheds light on foundational issues which will need to be addressed going forward:
Foundation of PNM:
As explained in Section 1.2, existing PNM for ODEs each takethe underlying state space Y to be the solution space of the ODE. This appears to beproblematic, in the sense that a generic evaluation f ( x i , y i ) of the gradient field cannot becast as information A (y † ) about the solution y † of the ODE unless the point ( x i , y i ) liesexactly on the solution curve { ( x, y † ( x )) : x ∈ [ x , x T ] } . As a consequence, all existing PNMof which we are aware violate the likelihood principle and are therefore not strictly Bayesian.The assumption of a solvable Lie algebra, used in this work, can be seen as a mechanism toensure the existence of an exact information operator A , so that the likelihood principle canbe obeyed. However, for a general ODE it might be more natural to take the underlying statespace to be a set F of permitted gradient fields and the quantity of interest Q ( f ) to mapa gradient field f to the solution of the associated ODE. This would make the informationoperator A trivial but evaluation of the push-forward Q µ a would require the exact solutionoperator of the ODE. However, the reliance on access to an oracle solver Q makes thisphilosophically somewhat distinct from PNM. Limitations of Bayesian PNM:
The proposed method was intended as a proof-of-concept and it is therefore useful to highlight the aspects in which it is limited. First,26hen an m th order ODE admits an r -parameter Lie group of transformations with r > m ,there is an arbitrariness to the particular m -dimensional sub-group of transformations thatare selected. Second, the route to obtain transformations admitted by the ODE demandsthat some aspects of the gradient field f are known, in contrast to other work in which f istreated as a black-box. For instance, in Ex. 11 we used the fact that f can be expressed as f ( x, y ) = F ( yx ), although knowledge of the form of F was not required. Third, the class ofODEs for which a solvable Lie algebra is admitted is relatively small. On the other hand,references such as Bluman and Anco (2002) document important cases where our methodcould be applied. Fourth, the principles for prior construction that we identified do notentail a unique prior and, as such, the question of prior elicitation must still be addressed. Outlook:
The goal of providing rigorous and exact statistical uncertainty quantificationfor the solution of an ODE is, we believe, important and will continue to be addressed.Traditional numerical methods have benefitted from a century of research effort and, incomparison, Bayesian PNM is an under-developed field. For example, the limited existingwork on PNM for ODEs, such as Skilling (1992); Schober et al. (2014); Chkrebtii et al.(2016); Kersting and Hennig (2016); Schober et al. (2019); Kersting et al. (2018); Tronarpet al. (2019), does not attempt to provide adaptive error control (though we note promisingongoing research in that direction by Chkrebtii and Campbell, 2019). Nevertheless, the casefor developing Bayesian numerical methods - which shares some parallels with the case forBayesian statistics as opposed to other inferential paradigms - is clear, as argued in Diaconis(1988) and Hennig et al. (2015). The insights we have provided in this paper serve tohighlight the foundational issues pertinent to Bayesian PNM for ODEs. Indeed, our proof-of-concept highlights that performing exact Bayesian inference for ODEs may be extremelydifficult. This in turn provides motivation for the continued development of ‘approximatelyBayesian’ approaches to PNM, which in Sec. 1.2 we surveyed in detail.
Acknowledgements:
The authors are grateful to Mark Craddock, Fran¸cois-Xavier Brioland Tim Sullivan for discussion of this work, as well as to an Associate Editor and twoReviewers for their challenging but constructive feedback. JW was supported by the EPSRCCentre for Doctoral Training in Cloud Computing for Big Data at Newcastle University,UK. CJO was supported by the Lloyd’s Register Foundation programme on data-centricengineering at the Alan Turing Institute, UK. This material was based upon work partiallysupported by the National Science Foundation under Grant DMS-1127914 to the Statisticaland Applied Mathematical Sciences Institute. Any opinions, findings, and conclusions orrecommendations expressed in this material are those of the authors and do not necessarilyreflect the views of the National Science Foundation.
References
Assyr Abdulle and Giacomo Garegnani. Random time step probabilistic methods for uncer-tainty quantification in chaotic and geometric numerical integration. arXiv:1801.01340 ,27018.Gerd Baumann.
Symmetry Analysis of Differential Equations with Mathematica R (cid:13) . SpringerScience & Business Media, 2013.G.W. Bluman and S.C. Anco. Symmetry and Integration Methods for Differential Equations .Springer, 2002.Fran¸cois-Xavier Briol, Chris J Oates, Mark Girolami, Michael A Osborne, and Dino Sejdi-novic. Probabilistic integration: A role in statistical computation? (with discussion andrejoinder).
Statistical Science , 34(1):1–42, 2019.Radomir Chabiniok, Vicky Y Wang, Myrianthi Hadjicharalambous, Liya Asner, Jack Lee,Maxime Sermesant, Ellen Kuhl, Alistair A Young, Philippe Moireau, Martyn P Nash,et al. Multiphysics and multiscale modelling, data–model fusion and integration of organphysiology in the clinic: ventricular cardiac mechanics.
Interface Focus , 6(2):20150083,2016.Kathryn Chaloner and Isabella Verdinelli. Bayesian experimental design: A review.
Statis-tical Science , pages 273–304, 1995.Joseph T Chang and David Pollard. Conditioning as disintegration.
Statistica Neerlandica ,51(3):287–317, 1997.OA Chkrebtii and DA Campbell. Adaptive step-size selection for state-space based proba-bilistic differential equation solvers.
Statistics and Computing , 2019. To appear.Oksana Chkrebtii, David A Campbell, Mark A Girolami, and Ben Calderhead. Bayesianuncertainty quantification for differential equations (with discussion).
Bayesian Analysis ,11(4):1239–1267, 2016.Jon Cockayne, Chris Oates, Tim Sullivan, and Mark Girolami. Probabilistic meshless meth-ods for partial differential equations and bayesian inverse problems. arXiv:1605.07811 ,2016.Jon Cockayne, Chris Oates, Tim Sullivan, and Mark Girolami. Bayesian probabilistic nu-merical methods.
SIAM Review , 2019. To appear.Patrick R Conrad, Mark Girolami, Simo S¨arkk¨a, Andrew Stuart, and Konstantinos Zy-galakis. Statistical analysis of differential equations: introducing probability measures onnumerical solutions.
Statistics and Computing , 27(4):1065–1082, 2017.P Diaconis. Bayesian numerical analysis.
Statistical Decision Theory and Related TopicsIV(1) , page 1988, 1988.D. Estep. A posteriori error bounds and global error control for approximation of ordinarydifferential equations.
SIAM Journal on Numerical Analysis , 32(1):1–48, 1995.28ndrew Gelman and Cosma Rohilla Shalizi. Philosophy and the practice of Bayesian statis-tics.
British Journal of Mathematical and Statistical Psychology , 66(1):8–38, 2013.Thomas Hawkins.
Emergence of the theory of Lie groups: An essay in the history of math-ematics 1869–1926 . Springer Science & Business Media, 2012.Philipp Hennig, Michael A Osborne, and Mark Girolami. Probabilistic numerics and uncer-tainty in computations.
Proceedings of the Royal Society A , 471(2179):20150142, 2015.Nicholas J Higham.
Accuracy and Stability of Numerical Algorithms . SIAM, 2002.Toni Karvonen, Chris J Oates, and Simo S¨arkk¨a. A bayes-sard cubature method. In
Proceedings of the 32nd Annual Conference on Neural Information Processing Systems(NeurIPS2018) , pages 5882–5893, 2018.H. Kersting, T.J. Sullivan, and P. Hennig. Convergence rates of gaussian ode filters. arXiv:1807.09737 , 2018.Hans Kersting and Philipp Hennig. Active uncertainty calibration in Bayesian ODE solvers.In
Proceedings of the 32nd Conference on Uncertainty in Artificial Intelligence (UAI 2016) ,pages 309–318, 2016.F M Larkin. Probabilistic error estimates in spline interpolation and quadrature. In
Infor-mation Processing 74 (Proc. IFIP Congress, Stockholm, 1974) , pages 605–609, 1974.FM Larkin. Gaussian measure in Hilbert space and applications in numerical analysis.
TheRocky Mountain Journal of Mathematics , 2:379–421, 1972.Han Cheng Lie, TJ Sullivan, and Aretha L Teckentrup. Random forward models and log-likelihoods in bayesian inverse problems.
SIAM/ASA Journal on Uncertainty Quantifica-tion , 6(4):1600–1629, 2018.H.C. Lie, A.M. Stuart, and T.J. Sullivan. Strong convergence rates of probabilistic integratorsfor ordinary differential equations.
Statistics and Computing , 2019. To appear.Andr´es F L´opez-Lopera, Fran¸cois Bachoc, Nicolas Durrande, and Olivier Roustant. Finite-dimensional Gaussian approximation with linear inequality constraints.
SIAM/ASA Jour-nal of Uncertainty Quantification , 6(3):1224–1255, 2018.C J Oates and T J Sullivan. A modern retrospective on probabilistic numerics.
Statisticsand Computing , 2019. To appear.C J Oates, J Cockayne, D Prangle, T J Sullivan, and M Girolami.
Multivariate Algorithmsand Information-Based Complexity , chapter Optimality Criteria for Probabilistic Numer-ical Methods. Berlin/Boston De Gruyter, 2019. To appear.29ntony Overstall, David Woods, and Ben Parker. Bayesian optimal design for ordinarydifferential equation models with application in biological science.
Journal of the AmericanStatistical Association , 2019. To appear.A. Pakman and L. Paninski. Exact hamiltonian monte carlo for truncated multivariategaussians.
Journal of Computational and Graphical Statistics , 23:518–542, 2014.Juan R Perilla, Boon Chong Goh, C Keith Cassidy, Bo Liu, Rafael C Bernardi, Till Rudack,Hang Yu, Zhe Wu, and Klaus Schulten. Molecular dynamics simulations of large macro-molecular complexes.
Current Opinion in Structural Biology , 31:64–74, 2015.Michael Schober, David K Duvenaud, and Philipp Hennig. Probabilistic ODE solvers withRunge-Kutta means. In
Proceedings of the 28th Annual Conference on Neural InformationProcessing Systems (NIPS 2014) , pages 739–747, 2014.Michael Schober, Simo S¨arkk¨a, and Philipp Hennig. A probabilistic model for the numericalsolution of initial value problems.
Statistics and Computing , 2019. To appear.John Skilling. Bayesian solution of ordinary differential equations. In
Maximum Entropyand Bayesian Methods , pages 23–37. Springer, 1992.Onur Teymur, Kostas Zygalakis, and Ben Calderhead. Probabilistic linear multistep meth-ods. In
Proceedings fo the 30th Annual Conference on Neural Information ProcessingSystems (NIPS 2016) , pages 4321–4328, 2016.Onur Teymur, Han Cheng Lie, Tim Sullivan, and Ben Calderhead. Implicit probabilistic in-tegrators for ODEs. In
Proceedings of the 32nd Annual Conference on Neural InformationProcessing Systems (NeurIPS 2018) , 2018.J.F. Traub and Wo´zniakowski. Perspectives on information-based complexity.
Bulletin ofthe American Mathematical Society , 26(1):29–52, 1992.Filip Tronarp, Hans Kersting, Simo S¨arkk¨a, and Philipp Hennig. Probabilistic solutionsto ordinary differential equations as non-linear Bayesian filtering: A new perspective.
Statistics and Computing , 2019. To appear.Nils P Wedi. Increasing horizontal resolution in numerical weather prediction and climatesimulations: illusion or panacea?
Philosophical Transactions of the Royal Society A , 372(2018):20130289, 2014. 30
Supplementary material
This electronic supplement to the paper by Wang, Cockayne and Oates contains proofs fortheoretical results in the main text (Sec. A.1), a brief discussion on Bayesian experimentaldesign in the context of designing a Bayesian probabilistic numerical method (Sec. A.2),and computation detail that was suppressed from the main text (Sec. A.3).
A.1 Proof of Theoretical Results
Proof of Theorem 2.
From Taylor’s theorem we have that x ∗ = X ( x, (cid:15) )= ∞ (cid:88) k =0 (cid:15) k k ! ∂ k X ( x, (cid:15) ) ∂(cid:15) k (cid:12)(cid:12)(cid:12)(cid:12) (cid:15) =0 For any differentiable function F we have thatd F ( x ∗ )d (cid:15) = d (cid:88) i =1 ∂F ( x ∗ ) ∂x ∗ i d x ∗ i d (cid:15) = d (cid:88) i =1 ξ i ∂F ( x ∗ ) ∂x ∗ i = X F ( x ∗ )and similarly d k F ( x ∗ )d (cid:15) k = X k F ( x ∗ ) . Thus ∂ k X ( x, (cid:15) ) ∂(cid:15) k (cid:12)(cid:12)(cid:12)(cid:12) (cid:15) =0 = X k x so that the stated result is recovered. Proof of Theorem 3.
The result is established as follows: F invariant ⇔ F ( x ∗ ) = 0 whenever F ( x ) = 0 ⇔ e (cid:15) X F ( x ) = 0 whenever F ( x ) = 0 (Cor. 1) ⇔ F ( x ) + (cid:15) X F ( x ) + O ( (cid:15) ) = 0 whenever F ( x ) = 0 (Taylor) ⇔ X F ( x ) = 0 whenever F ( x ) = 0where the last line follows since the coefficient of the O ( (cid:15) ) term in the Taylor expansion mustvanish. This completes the proof. Proof of Theorem 4.
From Cor. 1, we have that F ( x ∗ ) = F ( x ) + (cid:15) X F ( x ) + O ( (cid:15) ). The resultfollows from inspection of the (cid:15) coefficient. 1 roof of Theorem 6. Suppose L is a two dimensional Lie Algebra generated by linearly in-dependent infinitesimal generators X and X . Let Y = [X , X ] = a X + b X and let J bethe one dimensional subalgebra generated by Y. Suppose Z = c X + d X is some element of L , then [Y , Z] = [
Y, c X + d X ]= c [Y , X ] + d [Y , X ]= cb [X , X ] + da [X , X ]= ( ad − bc )Y ∈ J So J ⊂ L is normal, thus L is solvable, as claimed. Proof of Theorem 7.
Let the infinitesimal generator associated with the Lie group of trans-formations be denoted X. From the remarks after Def. 7, we can obtain canonical coordinatesby solving the pair of first order partial differential equations X r = 0, X s = 1. By the chainrule we have d s d r = s x + s y y (cid:48) r x + r y y (cid:48) =: G ( r, s )From the definition of canonical coordinates, the Lie group of transformations is r ∗ = r , s ∗ = s + (cid:15) in the transformed coordinate system, sod s ∗ d r ∗ = G ( r ∗ , s ∗ ) = ⇒ d s d r = G ( r, s + (cid:15) )for all (cid:15) , which implies G ( r, s ) = G ( r ) and thus Eq. (15) becomesdsd r = G ( r )as claimed. Proof of Theorem 8.
Let the infinitesimal generators associated with the Lie group of trans-formations be denoted X and X . Recall from Thm. 6 that any two dimensional Lie algebrais solvable. Thus, without loss of generality we may assume[X , X ] = λ X (25)for some λ ∈ R . The infinitesimal generators X and X each correspond to a one parameterLie group of transformations, denoted x ∗ = X ( x, (cid:15) ) and x † = X ( x, (cid:15) ). Let v ( x, y ), w ( x, y, y ) be invariant functions of X and its extension X (1)1 , respectively, so v ( x ∗ , y ∗ ) = v ( x, y ) and w ( x ∗ , y ∗ , y ∗ ) = w ( x, y, y ), where w has a non-trivial dependence on its thirdargument. It follows from the definition of invariance thatd w ∗ d v ∗ = d w d v , (1)1 d w d v = 0 (26)by Cor. 3. Now because Eq. (26) is a homogeneous partial differential equation, the gen-eral solution d w d v can be expressed as a function of the two solutions v ( x, y ) and w ( x, y, y ).Therefore d w d v = Z ( v, w ) (27)for some undetermined function Z .Since Eq. (18) admits X , and Eq. (27) is the same ODE when expressed in terms of x, y, y , it must be the case thatX (2)2 (cid:18) d w d v − Z ( v, w ) (cid:19) = 0 whenever d w d v = Z ( v, w ) . Then from Cor. 4 it follows that X (1)2 is admitted by the first order ODE in Eq. (27).Thus we are now faced with a first order ODE that admits a one parameter Lie group oftransformations, as in Thm. 7.Now, from Eq. (25), we have X X v = X X v + λ X v = 0. Thus X v is an invariant ofX and so X v = h ( v ) for some function h . Similarly X (1)1 X (1)2 v = X (1)2 X (1)1 v + λ X (1)1 v = 0, sothat X (1)2 w = g ( v, w ), for some function g . This implies X (1)2 = h ( v ) ∂∂v + g ( v, w ) ∂∂w .Proceeding as in Thm. 7, denote the canonical coordinates of X (1)2 = h ( v ) ∂∂v + g ( v, w ) ∂∂w by ˜ r ( v, w ), ˜ s ( v, w ) such that X (1)2 ˜ r = 0, X (1)2 ˜ s = 1. In canonical coordinates, Eq. (27)becomes: d˜sd˜ r = H (˜ r ) (28)This is again an integral, with solution˜s(˜ r ) = (cid:90) ˜ r H ( t )d t + C. (29)We can rewrite Eq. (29) in terms of v, w to obtain an equation of the form I ( v, w ) = 0 (30)which satisfies X (1)1 ( I ( v, w )) = 0 whenever I ( v, w ) = 0, since recall v, w are invariants of X (1)1 . For the final step, we recall that v = v ( x, y ) and w = w ( x, y, y ), so that Eq. (30)represents a first order ODE in y , which admits X . Thus we can apply Thm. 15 a secondtime to obtain canonical coordinates r ( x, y ), s ( x, y ) for X . In these coordinates, Eq. (30)reduces into the form dsd r = G ( r )where G is implicitly defined. 3 xample 13 (Deriving the Infinitesimal Generators for the Second Order ODE in Eq. 10) . Consider the second order nonlinear ODE ( x − y( x )) d yd x + 2 dyd x (cid:18) dyd x + 1 (cid:19) + (cid:18) dyd x (cid:19) / = 0 . (31) Using Corollary 4, we have: (cid:18) ξ ∂∂x + η ∂∂y + η (1) ∂∂y + η (2) ∂∂y (cid:19) (cid:32) y + 2 y ( y + 1) + y / x − y (cid:33) = 0 which implies − ξ y ( y + 1) + y / ( x − y ) + η y ( y + 1) + y / ( x − y ) + η (1) (cid:32) y + 2 + y / x − y (cid:33) + η (2) = 0 Recall η (1) = η x + ( η y − ξ x ) y + ξ y y and η (2) = η xx + (2 η xy − ξ xx ) y + ( η yy − ξ xy ) y − ξ yy y + ( η y − ξ x ) y − ξ y y y Also notice we can replace y via the original differential equation, i.e. y = − y ( y + 1) + y / x − y . Substituting for η (1) , η (2) and y via the above expressions, multiplying both sides by ( x − y ) and rearranging the terms as powers of y yields the rather long equation: (2 xη x − yη x + x η xx − xyη xx + y η xx )+ y − ξ + 2 η + 4 xη x − yη x + 2 xη y − yη y − xξ x + 2 yξ x +2 x η xy − xyη xy + 2 y η xy − x ξ xx + 2 xyξ xx − y ξ xx − xη y + 4 xξ x + 2 yη y − yξ x + y − ξ + 2 η + 4 xη y − yη y − xξ x + 4 yξ x + 2 xξ y − yξ y + x η yy − xyη yy + y η yy − x ξ xy + 4 xyξ xy − y ξ xy − xη y + 4 xξ x + 2 yη y − yξ x + 4 xξ y − yξ y + y (4 xξ y − yξ y − x ξ yy + 2 xyξ yy − y ξ yy + 4 xξ y − yξ y )+ y / (cid:18) xη x − yη x (cid:19) + y / (cid:18) − ξ + η + 32 xη y − yη y − xξ x + 32 yξ x − xη y + 2 xξ x + yη y − yξ x (cid:19) + y / (cid:18) xξ y − yξ y + 2 xξ y − yξ y (cid:19) = 04 his expression on the left hand side must vanish, so comparing the coefficients of powers of y gives the determining equations: xη x − yη x + x η xx − xyη xx + y η xx = 0 (32) − ξ + 2 η + 4 xη x − yη x + 2 xη y − yη y − xξ x + 2 yξ x +2 x η xy − xyη xy + 2 y η xy − x ξ xx + 2 xyξ xx − y ξ xx − xη y + 4 xξ x + 2 yη y − yξ x = 0 − ξ + 2 η + 4 xη y − yη y − xξ x + 4 yξ x + 2 xξ y − yξ y + x η yy − xyη yy + y η yy − x ξ xy + 4 xyξ xy − y ξ xy − xη y + 4 xξ x + 2 yη y − yξ x + 4 xξ y − yξ y = 04 xξ y − yξ y − x ξ yy + 2 xyξ yy − y ξ yy + 4 xξ y − yξ y = 0 (33)32 xη x − yη x = 0 (34) − ξ + η + 32 xη y − yη y − xξ x + 32 yξ x − xη y + 2 xξ x + yη y − yξ x = 032 xξ y − yξ y + 2 xξ y − yξ y = 0 (35) It is immediately obvious from (34) that η x = 0 and from (35) that ξ y = 0 . Consequently (32) and (33) vanishes. The remaining determining equations simplify to: − ξ + 2 η + 2 xξ x − yξ x − x ξ xx + 2 xyξ xx − y ξ xx = 0 (36) − ξ + 2 η + 2 xη y − yη y + x η yy − xyη yy + y η yy = 0 (37) − ξ + η + 12 xη y − yη y + 12 xξ x − yξ x = 0 (38) These remaining partial differential equations in ξ ( x, y ) and η ( x, y ) are linear, and recall η ( x, y ) is independent of x and ξ ( x, y ) is independent of y respectively. To solve these partialdifferential equations we can therefore express ξ ( x ) = (cid:80) ∞ n =0 a n x n and η ( y ) = (cid:80) ∞ m =0 b m y m .Consequently (36) becomes: − ∞ (cid:88) n =0 a n x n + 2 ∞ (cid:88) m =0 b m y m + 2 ∞ (cid:88) n =1 na n x n − y ∞ (cid:88) n =1 na n x n − − ∞ (cid:88) n =2 n ( n − a n x n + 2 y ∞ (cid:88) n =2 n ( n − a n x n − − y ∞ (cid:88) n =2 n ( n − a n x n − = 0 Comparing the constant term implies b = a . Comparing the terms containing y implies: b − ∞ (cid:88) n =1 na n x n − + 2 ∞ (cid:88) n =2 n ( n − a n x n − = 0 Comparing coefficients of x n gives b = a , n = n ( n − or a n = 0 for n ≥ . Of course, n = n ( n − has solution n = 2 for n ≥ , so a n = 0 for n ≥ . Comparing the terms ontaining y implies b = a . Notice (37) is symmetric with (36) in the sense that swapping ξ with η and x with y in (36) gives (37) . So by symmetry (37) gives b = a , b = a , b = a and b n = 0 for n ≥ . (38) gives no additional solutions. Therefore, the exampleODE admits a three parameter Lie group of transformations with infinitesimals: ξ = a + a x + a x η = a + a y + a y where a , a and a are arbitrary constants. The infinitesimal generators corresponding to a , a and a are respectively X = x ∂∂x + y ∂∂y (39)X = x ∂∂x + y ∂∂y X = ∂∂x + ∂∂y , (40) which generate a three dimensional Lie algebra. Example 14 (Ex. 13, continued) . Recall from Ex. 13 that the second order nonlinearODE in Eq. (31) admits a three parameter Lie group of transformations with infinitesimalgenerators X , X , X defined in Eqs. (39) - (40) . These generators can be verified to satisfy [X , X ] = − X , [X , X ] = − X , [X , X ] = − X . The pairs X , X and X , X form a twodimensional (and therefore solvable by Thm. 6) Lie sub-algebra and can be used as the basisfor our method. For the derivations below we proceed with arbitrary choice X , X .Following the proof of Thm. 8, first we seek a solution v = v ( x, y ) to the first order linearPDE X v = 0 . i.e. we must solve x ∂v∂x + y ∂v∂y = 0 This has general solution v = f ( y − x ) for some arbitrary function f , and we pick a particularsolution v ( x, y ) = y − x . Next we seek a solution w = w ( x, y, y ) to the first order linearPDE X (1)1 w = 0 . i.e. we must solve x ∂w∂x + y ∂w∂y + 2( y − x ) y ∂w∂y = 0 . Again, we pick a particular solution w ( x, y, y ) = y ( xy ) . In accordance with Eq. (27) , wecan re-write the original ODE (31) in terms of the coordinates v and w to obtain d w d v = w / + 2 w ( w + 1) v ( w − , for xy ≥ − w / + 2 w ( w + 1) v ( w − , for xy < ext we express X (1)2 in terms of v and w find its canonical coordinates ˜ r ( v, w ) , ˜ s ( v, w ) .To this end, we have X (1)2 = x ∂∂x + y ∂∂y = − v ∂∂v , which has canonical coordinates ˜ r ( v, w ) = w , ˜ s ( v, w ) = − log( v ) . Re-writing Eq. (41) in terms of ˜ r , ˜ s leads to the analogue of Eq. (28) : d˜sd˜ r = 1 − ˜ r ± ˜ r / + 2˜ r (˜ r + 1) =: H (˜ r ) (42) This example exhibits the convenient feature that Eq. (42) can be directly integrated to give ˜s(˜ r ) = − log(2˜ r ± √ ˜ r + 2) + log(˜ r ) / C , which can be re-written in terms of x, y, y to give log (cid:18) y − x (cid:19) = log (cid:115) y x y ± (cid:113) y x y − C (43) for some integration constant C .The final step, to remove the y independence, requires canonical coordinates for X .These can be selected as r ( x, y ) = y − x , s ( x, y ) = − y . Then Eq. (43) becomes r ∓ exp( − C )2 exp( − C ) = (cid:32) dsd r dsd r (cid:33) / + (cid:32) dsd r dsd r (cid:33) / which is equivalent to Eq. (19) for some function G . A.2 Design of the Training Set
The performance of the proposed Bayesian PNM is not our main focus in this work, aswe consider the method to be (only) a proof-of-concept. However, for completeness weacknowledge that performance will depend on the locations at which the gradient field isevaluated; the so-called training set . In this section we discuss how these inputs could beoptimally selected. To simplify the presentation, we focus on the case of a first order ODE,as in Eq. (15), where the inputs r , . . . , r n must be selected.The design of a PNM can be viewed as an instance of statistical experimental design(Chaloner and Verdinelli, 1995). In Sec. 3 of Cockayne et al. (2019) a connection betweenPNM and decision-theoretic experimental design was exposed. Such methods require that aloss function L : Q × Q → R is provided, where L ( q, q † ) quantifies the loss when q is used asan estimate for the true quantity of interest q † . Further detail was provided in Oates et al.(2019). To avoid repetition, in the remainder we focus instead on approximate experimentaldesign, where a loss function is not explicitly needed.Recall that the output of a PNM is the distribution µ n = B ( µ, a n ) ∈ P Q . Then one canspecify a functional (cid:96) : P Q → R and compute τ ( r , . . . , r n ) = (cid:90) (cid:96) ( B ( µ, A (y; r , . . . , r n )))d µ (y) (44)7here A ( · ; r , . . . , r n ) is the information operator in Eq. (17) with the dependence on r , . . . , r n made explicit. For the choice (cid:96) ( ν ) = log det(Cov ˜ Q ∼ ν [ ˜ Q ]), a configuration ( r , . . . , r n ) forwhich τ ( r , . . . , r n ) is minimised is said to be D-optimal . The functional (cid:96) plays the role ofan approximation to posterior expected loss, and other choices for (cid:96) lead to other approxi-mate notions of optimal experimental design. For instance, an
A-optimal design was usedfor the Bayesian solution of a partial differential equation in Cockayne et al. (2016). Forfurther background on experimental design we refer the reader to Chaloner and Verdinelli(1995).Importantly, Eq. (44) does not depend on the information A (y † ) and can therefore beevaluated prior to the experiment being performed. However, in general the numericalapproximation of Eq. (44), and the task of finding a minimal configuration, is practicallydifficult. The reader is referred to Overstall et al. (2019) for further discussion of experimentaldesign in the PNM context. A.3 Computational Detail
In this Appendix we set out in detail the prior construction that was used for the numericalillustrations of Sec. 4 in the main text. Recall that for both the first order ODE example inSec. 4.1 and the second order ODE example in Sec. 4.2 we required a non-parametric priorover functions ζ which satisfy the constraints given in Eq. 24, namely that ζ ( r ) = 0 , ζ ( r ) ≤ , d ζ d r ≥ . (45)Moreover, bearing in mind the posterior computation that is to follow, we require in additionthat the prior conveniently facilitates the conditioning calculations involved. It is clear thatstandard non-parametric priors such as Gaussian processes do not satisfy the boundednessor monotonicity constraints, whilst a nonlinear transformation of such a process would fail tomake conditioning on data straight-forward. In fact, the construction of such flexible priorsremains an active area of research.To proceed, we adopted an approach recently proposed in L´opez-Lopera et al. (2018). Inbrief, the main idea is to construct an N -dimensional parametric distribution over functionsfor which Eq. 45 is satisfied. This distribution, being finite-dimensional, allows for thepossibility of tractable conditioning operations, whilst the flexibility to take N arbitrarilylarge provides a means of ensuring that the salient uncertainty is accurately represented.More specifically, the function ζ is parametrised as ζ ( r ) = N (cid:88) j =1 z j φ j ( r ) (46)where the φ j are basis functions φ j ( r ) = (cid:40) − | r − t j h | | r − t j h | ≤
10 otherwise8or equally spaced points t j with increment h , as recommended in L´opez-Lopera et al. (2018).A prior on ζ can be induced via a prior on the coefficients z , . . . , z N , with N taken to besubstantially larger than the number n of datapoints on which the z i are to be conditioned.The specific construction of a prior on the coefficients is required to encode the constraintsin Eq. 45 and to admit tractable conditioning; these issues are discussed in the remainder.First, we consider the boundedness and monotonicity constraints in Eq. 45. At the levelof the coefficients, it is straight-forward to check that this requires that the prior support isrestricted to the set Z = { z ∈ R N : 0 < z ≤ z ≤ · · · ≤ z N ≤ } . For convenience, we elected to use a prior that was obtained by restricting a standardGaussian measure N (0 , I ) on R N to the set Z .Second, we consider how to condition on a dataset. Recall that information is providedon the values of the gradient ζ (cid:48) ( r i ) = b i of the function ζ , evaluated at a finite number oflocations r i of the canonical coordinate r , together with the initial condition ζ ( r ) = b .Thus the information can be described by the linear system of constraintsΦ z = b where Φ = φ ( r ) . . . φ N ( r ) φ (cid:48) ( r ) . . . φ (cid:48) N ( r )... ... φ (cid:48) ( r n ) . . . φ (cid:48) N ( r n ) , b = b b ... b n . The posterior can therefore be characterised as the restriction of N (0 , I ) to the set Z ∩ D where D = { z ∈ R N : Φ z = b } .Finally, we discuss how posterior computation was performed. The key observation isthat an equivalent characterisation of the posterior is first to restrict N (0 , I ) to D and thento further restrict to Z . This is advantageous since the linear nature of the data implies thatthe restriction z |D of N (0 , I ) to D is again a Gaussian with a closed form, denoted N ( µ, Σ).It is important to note that Σ is singular (rank ρ = N − n −
1) and so Σ = U Λ U (cid:62) , where U is an orthogonal matrix and Λ is a diagonal matrix with ρ non-zero entries on the diagonal.Thus we can express z |D in the form z = µ + U Λ˜ z where ˜ z ∼ N (0 , I ) is a standard Gaussian on R ρ . Let M = U Λ and let m i = [ m i, , . . . m i,ρ ]denote the i th row of M . Then we have the relation z ∈ Z ⇐⇒ ˜ z ∈ ˜ Z where˜ Z = { ˜ z ∈ R ρ : 0 ≤ m ˜ z + µ , ≤ ( m i +1 − m i )˜ z + ( µ i +1 − µ i ) , ≤ − m N ˜ z + (1 − µ N ) }