[PDF] Bayesian experimental design without posterior calculations: an adversarial approach

Abstract

Most computational approaches to Bayesian experimental design require making posterior calculations, such evidence estimates, repeatedly for a large number of potential designs and/or simulated datasets. This can be expensive and prohibit scaling up these methods to models with many parameters, or designs with many unknowns to select. We introduce an efficient alternative approach without posterior calculations, based on optimising the expected trace of the Fisher information, as discussed by Walker (2016). We illustrate drawbacks of this approach, including lack of invariance to reparameterisation and encouraging designs which are informative about one parameter combination but not any others. We show these can be avoided by using an adversarial approach: the experimenter must select their design while an adversary attempts to select the least favourable parameterisation. We present theoretical properties of this approach and show it can be used with gradient based optimisation methods to find designs efficiently in practice.

Full PDF

BBayesian optimal design using stochasticgradient optimisation and Fisher informationgain

Sophie Harbisher, Colin S Gillespie and Dennis PrangleSchool of Mathematics Statistics and Physics, Newcastle University, UK

Abstract

Finding high dimensional designs is increasingly important in ap-plications of experimental design, but is computationally demandingunder existing methods. We introduce an eﬃcient approach applyingrecent advances in stochastic gradient optimisation. To allow rapidgradient calculations we work with a computationally convenient utilityfunction, the trace of the Fisher information. We provide a decisiontheoretic justiﬁcation for this utility, analogous to work by Bernardo(1979) on the Shannon information gain. Due to this similarity we referto our utility as the Fisher information gain. We compare our optimisa-tion scheme, SGO-FIG, to existing state-of-the-art methods and showour approach is quicker at ﬁnding designs which maximise expectedutility, allowing designs with hundreds of choices to be produced inunder a minute in one example.

Selecting a good design for an experiment can be crucial to extracting usefulinformation and controlling costs. Applications include medical studies (Amzalet al., 2006), epidemic modelling (Cook et al., 2008), pharmacokinetics (Ryanet al., 2014; Overstall and Woods, 2017) and ecology (Gillespie and Boys,2019). In modern applications it is increasingly feasible to take a large1 a r X i v : . [ s t a t . C O ] A ug umber of measurements, for example placing sensors (Krause et al., 2009) ormaking observations in a numerical integration problem (Oates et al., 2019).Therefore high dimensional designs are increasingly relevant. However mostmethods for optimal design are expensive in the high dimensional setting, sothere is a need for more eﬃcient and scalable methods.We focus on the Bayesian approach to optimal experimental design, whichtakes into account existing knowledge and uncertainty about the processbeing studied before the experiment is undertaken. In this framework anexperimenter must select a design. They then receive some utility based on theoutcome of the experiment. The aim is to select the design which maximisesexpected utility given the experimenter’s beliefs before the experiment takesplace. Mathematically this is an optimisation problem. A practical challengeis that the expected utility to be optimised usually cannot be calculatedexactly. Instead only random estimates can be produced through simulation.We approach the problem by applying methods from the machine learningliterature for high dimensional optimisation in similar settings: automaticdiﬀerentiation and adaptive variants of stochastic gradient descent. Theserely on being able to estimate the gradient of the expected utility with respectto the design. Therefore our methods are relevant to problems where there isa continuous space of possible designs.For many utility functions it is expensive to estimate utilities or theirgradients. For instance, a single utility evaluation often requires performingsome aspect of Bayesian inference for simulated data using a Monte Carlocalculation. Therefore we focus on a utility function which is particularlycomputationally convenient as it is often available in closed form: the trace ofthe Fisher information. This is commonly used in classical optimal design, butis often criticised in the Bayesian experimental design literature as eﬀectivelyonly relying on an approximation to the posterior (see Section 2 for morediscussion). However Walker (2016) shows that it has an information theoreticjustiﬁcation. We provide a further justiﬁcation of this utility as a Bayesianapproach, by noting that there is a derivation directly from the ﬁrst principlesof decision theory. The argument is analogous to that of Bernardo (1979)supporting the use of the Shannon information gain utility. Due to thissimilarity, we refer to our utility as the Fisher information gain (FIG). Ourderivation is based on judging the quality of a parameter estimate throughthe Hyv¨arinen score (Hyv¨arinen, 2005), rather than through the logarithmicscore as in Bernardo (1979).We compare our optimisation scheme, SGO-FIG, to existing state-of-the-2rt methods such as the algorithm of M¨uller (1999) and ACE (Overstall andWoods, 2017), also using the FIG utlity. Our approach is quicker at ﬁndingdesigns which are local maximma of expected utility, and allows designs withhundreds of choices to be produced in under a minute. A potential drawbackof our method is that in one example it often converges to poor local maximaWe show how post-processing methods from Overstall and Woods (2017)can be used to ﬁnd the overall optimal design. In our conclusion, Section8 we present general recommendations of how to combine SGO-FIG withpost-processing.Similar gradient-based optimisation approaches to ours have been exploredpreviously. Pronzato and Walter (1985) optimise the expected determinantof the Fisher information using analytically derived gradients. Huan andMarzouk (2013, 2014) optimise expected Shannon information using gradi-ents (either derived analytically or based on ﬁnite diﬀerences) for a biasednumerical approximation to the utility. The novelty of our approach is touse recently developed adaptive stochastic gradient algorithms and automaticdiﬀerentiation frameworks, as well as a utility function chosen to allow cheapcalculation of unbiased gradient estimates.To summarise, the main contribution of our paper is an eﬃcient algo-rithm for high dimensional Bayesian optimal design using stochastic gradientoptimisation. The algorithm relies on a particular utility function: Fisherinformation gain. A secondary contribution of our work is a decision theoreticjustiﬁcation for its use. To save space, the latter contribution is mainlydiscussed in the appendix.The remainder of the paper is structured as follows. Section 2 presentsbackground on Bayesian experimental design, including common utility func-tions, the role of decision theory and existing computational methods. Section3 describes the utility function we use, Fisher information gain. Section4 presents our algorithm for optimal design. Sections 5 – 7 present threeexample applications, including a comparison of our algorithm to exist-ing methods. Code to illustrate all of these applications is available at github.com/SophieHarbisher/SGO-FIG . Finally, Section 8 concludes witha discussion. 3

Bayesian experimental design

Optimal experimental design concerns the following problem. An experimentermust select a design τ . The experiment produces data y with likelihood f ( y | θ ; τ ), where θ is a vector of unknown parameters. The goal is to select thedesign which optimises some notion of the quality of the experiment, typicallybased on its informativeness and its cost.The Bayesian approach to experimental design involves selecting a function U = U ( τ, θ, y ), giving the utility of selecting design τ given observations y and true parameters θ . (As we shall see, some choices of U do not dependon all these possible inputs.) We try to maximise the expected utility of τ i.e.the prior predictive utility J ( τ ) = E θ ∼ π ( θ ) ,y ∼ f ( y | θ ; τ ) [ U ( τ, θ, y )] , (1)where π ( θ ) is the prior density for θ . The optimal design τ ∗ is that maximising J ( τ ). Throughout we consider the case where a unique maximum exists,although much of the methodology will remain useful when this is not thecase.This section reviews relevant details of Bayesian experimental design. SeeChaloner and Verdinelli (1995) and Ryan et al. (2016) for more comprehensivesurveys. First, Section 2.1 introduces some useful notation. Section 2.2summarises some common choices of utility function, and Section 2.3 describesa particularly principled approach: deriving it with decision theory. Section2.4 reviews existing computational methods for estimating the optimal design. As usual in Bayesian statistics, we will make use of the posterior densityand the prior predictive density for y . In our setting both depend on theexperimental design τ , π ( θ | y ; τ ) = π ( θ ) f ( y | θ ; τ ) /π ( y ; τ ) , (2) π ( y ; τ ) = E θ ∼ π ( θ ) [ f ( y | θ ; τ )] . (3)We will also make use of the Fisher information matrix, I ( θ ; τ ) = E y ∼ f ( y | θ ; τ ) [ u ( y, θ ; τ ) u ( y, θ ; τ ) T ] , (4)4hich is based on the score function, u ( y, θ ; τ ) = ∇ θ log f ( y | θ ; τ ) . (5)We will focus on models where both of these are well deﬁned.We concentrate on the case where τ = ( τ , . . . , τ d ) ∈ T ⊆ R d i.e. a design isa ﬁxed number, d , of real-valued quantities. Such a design typically representstimes or locations of measurements. We will typically assume there are p parameters θ , θ , . . . , θ p and that y is a vector of n observations y , y , . . . , y n . Ideally a speciﬁc utility function for the situation at hand could be chosen,perhaps by eliciting preferences over diﬀerent ( τ, θ, y ) combinations from theexperimenter (e.g. Wolfson et al., 1996). However this is often infeasiblein practice. Instead several generic choices of utility have been proposed,including: scalar summaries of posterior precision or the diﬀerence between θ and E [ θ | y ] (the posterior mean); information theoretic choices; utilities basedon predicting future observations. Ryan et al. (2016) describe these utilitiesin more detail, and refer to them as fully Bayesian as they are functionals ofthe posterior distribution π ( θ | y ; τ ).Producing good estimates of posterior quantities can be computation-ally expensive. Hence another set of generic utility choices are based oncruder posterior approximations, in particular Gaussian approximations using I ( θ ; τ ) − , the inverse of the Fisher information matrix (4), as the variancematrix (Chaloner and Verdinelli, 1995). The utility can then be taken tobe a scalar summary of I ( θ ; τ ). This corresponds to using alphabet opti-mality criteria from classical experimental design (Box, 1982). Such utilitiesinclude tr I ( θ ; τ ) (T-optimality), det I ( θ ; τ ) (D-optimality) and − tr I ( θ ; τ ) − (A-optimality). Ryan et al. (2016) refer to these as pseudo-Bayesian as theyare not functionals of the posterior. Remark 1.

The distinction between pseudo and fully Bayesian utility func-tions is not as clear cut as it appears. In particular, Section 3 will presentan example where a fully Bayesian and a pseudo-Bayesian utility (under thepreceding deﬁnitions) are equivalent, in the sense of always producing thesame expected utility function up to an additive constant, and hence the sameoptimal design. We explore this issue further in Appendix A.

Shannon information gain (SIG), U SIG ( τ, θ, y ) = log π ( θ | y ; τ ) − log π ( θ ) (6)= log f ( y | θ ; τ ) − log π ( y ; τ ) (7)A common SIG estimate replaces π ( y ; τ ) in (7) with a Monte Carlo estimateˆ π ( y ; τ ) = 1 L L (cid:88) (cid:96) =1 f ( y | θ ( (cid:96) ) ; τ ) , (8)where θ ( (cid:96) ) are independent draws from the prior. A typical choice of L is 1000 (Overstall and Woods, 2017), which makes each utility evaluationsomewhat computationally expensive. Furthermore, some approximation erroris introduced. In general, numerical estimation of most fully Bayesian utilityfunctions involves a similar mixture of computational cost and approximationerror (see e.g. Ryan et al., 2016 and Overstall and Woods, 2017 for details). Lindley (1972), following Raiﬀa and Schlaifer (1961), proposed viewing experi-mental design as a decision problem. As in the general framework at the startof Section 2, the experimenter selects a design τ and the experiment producesdata y given parameters θ with an assumed prior distribution. Following this,the experimenter selects an action a based on observing y (but not θ ). Theirpreferences are represented by a function V ( a, θ, y, τ ), which we will refer toas the base utility .The utility U required for Bayesian experimental design can be derivedby assuming that the optimal action a (which we assume exists) is alwaystaken, giving U = max a E θ ∼ π ( θ | y ; τ ) [ V ( a, θ, y, τ )] . (Note this is a case where the utility U depends on τ and y only.) In principlethe base utility could be elicited from the experimenter’s preferences. Howeverthis is often not possible and instead a generic function can be used. Forinstance a simple possibility is to let a be a point estimate of θ and use V = ( a − θ ) T ( a − θ ) i.e. mean squared error. See Chaloner and Verdinelli(1995) for further discussion.Bernardo (1979) proposed that a instead be an estimated density for θ .The utility can then be based on a strictly proper scoring rule , a functional6or evaluating the quality of density estimates. Bernardo showed that thisframework allows a decision theoretic derivation of Shannon information gain.We summarise the argument in Appendix A. Below we discuss two popular algorithms for Bayesian experimental design,which we use in our comparisons: the M¨uller and ACE algorithms. Manyother algorithms have been proposed. For example, Ryan et al. (2014) look athigh dimensional designs with a low dimensional parametric form, Price et al.(2018) use evolutionary algorithms, and Gillespie and Boys (2019) perform asophisticated search on a discrete grid of designs.

M¨uller algorithm

M¨uller (1999) performs optimal design using MarkovChain Monte Carlo (see Amzal et al., 2006 and K¨uck et al., 2006 for similarapproaches using sequential Monte Carlo.) The target density is h ( τ, θ , . . . , θ J , y , . . . , y J ) ∝ J (cid:89) j =1 U ( τ, θ j , y j ) π ( θ j ) f ( y j | θ j ; τ ) . The marginal density for τ is proportional to a power of the expected utility, J ( τ ) J . Hence the mode of τ under this marginal is the optimal design.Estimating the mode from MCMC samples can be non-trivial, especially forhigh dimensional designs. Taking larger values of J makes the mode easier toidentify, but increases the computational cost of each iteration.The M¨uller algorithm uses Metropolis-Hastings to sample from the targetdistribution. To draw a proposal, ﬁrst τ ∗ is sampled from a proposal kernel.Then ( θ ∗ j , y ∗ j ) pairs are sampled independently from π ( θ ) f ( y | θ ; τ ∗ ). In ourimplementation we sample τ ∗ ∼ N ( τ, σ RW I ). Therefore the tuning choiceswe require are J and σ RW . Approximate co-ordinate exchange (ACE)

The ACE algorithm (Over-stall and Woods, 2017) consists of two phases. Phase I is referred to as coor-dinate exchange. It loops over the components of the design, updating eachin turn. To perform an update, designs are selected from the one dimensionalsearch space (in which the current component only is updated) and MonteCarlo estimates of expected utility calculated. A Gaussian process is ﬁtted to7he expected utility estimates and used to propose an improved value for thedesign component under consideration. This is accepted or rejected based ona Bayesian test of whether it improves expected utility, using a large numberof simulations under the current and proposed designs.Phase II is referred to point exchange. It considers whether the designoutput by phase I can be improved by replacing some components of thedesign with replicates of other components. New designs are proposed in agreedy fashion, replicating the design point which would yield the largestimprovement in estimated expected utility, then removing the point whichwould result in the least reduction of the estimated expected utility. Similarlyto phase I, the proposed design is then accepted based on a Bayesian testof whether expected utility is improved, after sampling utilities under theexisting and candidate designs.ACE can converge to local optima so the authors suggest running thealgorithm multiple times and selecting the design which returns the highestexpected utility. These runs can be performed in parallel to reduce com-putation time. The algorithm is implemented in the acebayes

R package(Overstall et al., 2017). The package allows ACE phase II to be run separately,so it can be used to post-process designs from any algorithm.

Section 3.1 introduces the Fisher information gain utility, and discusses itsproperties, with further details given in Appendix A. Section 3.2 discussesevaluating it computationally.

Walker (2016) proposes maximising the following objective function for optimaldesign: J F IG ( τ ) = E θ ∼ π ( θ ) [tr I ( θ ; τ )] . (9)Under the framework of Section 2.2, the utility function used is U F IG = tr I ( θ ; τ ) , (10)corresponding to classical T-optimality. We’ll refer to this as the Fisherinformation gain (FIG) due to an analogy with SIG discussed in Appendix A.8 emark 2.

One potential drawback of the FIG utility is that it is not invariantto reparameterisation. We discuss this further in Section 4.4.

Since the right hand side of (10) does not involve y , the FIG utility isnot a functional of the posterior, and therefore is pseudo-Bayesian underthe terminology of Ryan et al. (2016). However Walker (2016) shows that J F IG also results (up to an additive constant) from using utilities which arefunctionals of the posterior (detailed in Section A.4 of the appendix). Asnoted in Remark 1, hence an expected utility equivalent to J F IG can bederived from both pseudo-Bayesian and fully-Bayesian utility functions.A contribution of our paper is to provide stronger theoretical support for J F IG , by extending the theoretical argument of Bernardo (1979) supportingSIG to produce the following results.

Result 1.

The expected utility J F IG can be derived from Bayesian decisiontheory using the negative Hyv¨arinen score (Hyv¨arinen, 2005) as the baseutility.

The background and proof of this result are presented in Appendix A. Thisalso discusses how the various utilities discussed by Walker (2016) correspondto information theoretic quantities derived from the Hyv¨arinen score. Themost computationally convenient to use in practice is U F IG , as discussed inthe next section.

An advantage of the FIG utility is that for many models the Fisher infor-mation is available in closed form. This avoids the computational cost andapproximation error associated with many alternative utilities, discussed inSection 2.2. (When a closed form is not available unbiased Monte Carloestimation is possible. See Appendix E for details of this case.)We give details here of one general case which we will use in some laterexamples. Consider a model with observation vector y ∼ N ( x ( θ, τ ) , Σ( τ )).That is, the observations are true values x ( θ, τ ) plus normal noise, whichmay be correlated. The entry in row i column j of the associated Fisherinformation matrix is (see e.g. Miller, 1974, section V, equation (4.4)) I i,j ( θ ; τ ) = v i ( θ, τ ) T Σ( τ ) − v j ( θ, τ ) , v i is the vector ∂∂θ i x ( θ, τ ) of elementwise derivatives of x . Thustr I ( θ ; τ ) = p (cid:88) i =1 v i ( θ, τ ) T Σ( τ ) − v i ( θ, τ ) . (11) Remark 3.

Under the normal observation model, replacing Σ( τ ) with κ Σ( τ ) for a scalar κ > only aﬀects the FIG utility by a constant of proportionality,and so does not change the optimal design. We have described how the FIG utility can easily be numerically evaluated. Aswe shall see, gradient calculations are also straightforward. This is in contrastto many other utility functions for Bayesian experimental design, where everyutility evaluation involves high computational cost or approximation error(see Section 2.2), and gradient information is not easily available. Thesehelpful properties of the FIG utility means it can be used with powerful stochastic gradient optimisation methods to maximise the expected utility J F IG ( τ ), as we outline in this section.Section 4.1 presents our algorithm for Bayesian optimal design and back-ground on the methods it uses. The algorithm requires unbiased estimates ofthe gradient of J F IG ( τ ), and Section 4.2 discusses calculating these. Section4.3 is on optimisation under constraints to the space of designs, and Section4.4 is about optimising a weighted version of the FIG, and how this is helpfulto learn information about all the parameters rather than a subset of them.Note that throughout the paper we use ∇ to represent gradient withrespect to τ . When diﬀerentiation with respect to another vector is requiredwe add a subscript e.g. ∇ θ . Algorithm 1 summarises our SGO-FIG algorithm to maximise J F IG ( τ ). Wetypically run this for a ﬁxed number of iterations. Alternatively it could berun until a convergence condition is reached. For example J F IG ( τ ) couldbe estimated at each iteration and a moving average recorded, with thealgorithm terminating if the minimum moving average value is not beaten fora prespeciﬁed number of iterations. 10 lgorithm 1 SGO-FIG : Stochastic gradient optimisation of expected Fisherinformation gain Input : Initial design τ , number of iterations n , batch size K (used ingradient estimation). Loop over t = 0 , , , . . . , n − g t , an estimate of ∇J F IG ( τ ) at τ = τ t , using (12) fromSection 4.2.(This requires a closed form expression for I ( θ ; τ ). See AppendixE for alternatives when this is not available.)2. Get τ t +1 by calling the update rule of a stochastic gradient opti-misation algorithm with current state τ t and gradient estimate g t .We use the update rule of the Adam algorithm (Kingma and Ba,2015).The remainder of this section brieﬂy discusses the background to the opti-misation methods used in step 2 of SGO-FIG. Our aim is to maximise J F IG ( τ )although we cannot compute this function exactly. This is possible usingiterative steps of stochastic gradient optimisation methods. A straightforwarditerative update is τ t +1 = τ t + a t g t , where g t is an unbiased estimate of ∇J F IG ( τ ) at τ = τ t and a t is a decreasing learning rate sequence. For unbiased gradient estimates and an appropriate a t sequence, convergence to a local optimum is guaranteed. See Kushner andYin (2003) and Bottou et al. (2018) for an overview of the theory. However,selecting the learning rate sequence in advance is a considerable tuningchallenge which has a large eﬀect on the speed of convergence.In the last decade many adaptive stochastic gradient optimisation algo-rithms have been proposed (Ruder, 2016). These select the learning rateadaptively based on the observed g t s to speed convergence. Furthermore, aseparate learning rate is used for each entry of the τ vector. Various algorithmsof this form are in common use and we use the popular Adam (“adaptivemoments”) algorithm, which produces state-of-the-art empirical performanceon optimisation problems in deep learning with millions of unknowns or more11Kingma and Ba, 2015). (Adam is technically a minimisation method, so weuse it to minimise −J F IG ( τ ).) An appealing feature of Adam is that it oftendoes not require tuning, with the default choices performing well in manysituations. We experimented with varying the default tuning choices for theexamples in this paper and found no improvement. SGO-FIG requires unbiased estimates of ∇J F IG ( τ ). From the deﬁnition of J F IG ( τ ), (9), and assuming weak regularity conditions (see Appendix D) toallow interchange of diﬀerentiation and expectation, ∇J F IG ( τ ) = E θ ∼ π ( θ ) [ ∇ tr I ( θ ; τ )] . A closed form of the Fisher information is often available (see Appendix Efor when this is not the case). In this case an unbiased Monte Carlo gradientestimate is (cid:92) ∇J F IG ( τ ) = 1 K K (cid:88) k =1 ∇ tr I ( θ ( k ) ; τ ) , (12)where θ ( k ) are independent draws from the prior. Using a larger Monte Carlobatch size K reduces the variance of the estimates but increases computationalcost.Typically we can calculate this gradient estimate using automatic dif-ferentiation (Baydin et al., 2017). Our code does this using the Tensorﬂowframework (Abadi et al., 2016). We manually compute the function to be dif-ferentiated, tr I ( θ ; τ ). See Sections 5–7 for some examples. Deriving tr I ( θ ; τ )can itself involve lengthy diﬀerentiation, and there may be scope for futurework to avoid this using advanced automatic diﬀerentiation methods. We often wish to optimise the expected utility under a constraint: τ ∈ T ⊂ R p .For the application in this paper, constrained optimisation was possible bythe simple pragmatic approach of adding a large penalty to expected utilitiesfor τ (cid:54)∈ T , whose gradient moves designs back towards the feasible space T .In more complex settings this penalisation method may not suﬃce. A moresophisticated alternative is to compose each stochastic gradient optimisationupdate with a projection operation into T (Kushner and Yin, 2003).12 .4 Optimisation using weights The FIG utility can be written astr I ( θ ; τ ) = p (cid:88) i =1 E y ∼ f ( y | θ ; τ ) (cid:104) u i ( y, θ ; τ ) (cid:105) , (13)where u i = ∂∂θ i log f ( y | θ ; τ ) is the i th element of the score function u . Analternative utility is to use a weighted version of the sum in (13). We nowshow that this is equivalent to rescaling the parameters. (This shows that theFIG is not invariant under reparameterisation, as mentioned in Remark 2.)Consider ϑ = W θ where W is a diagonal weight matrix with diagonalentries w , w , . . . , w p , all positive. Let w represent the vector of these weights.Then it is easy to show thattr I ( ϑ ; τ ) = tr[ I ( θ ; τ ) W − ] = p (cid:88) i =1 E y ∼ f ( y | θ ; τ ) (cid:104) u i ( y, θ ; τ ) /w i (cid:105) . (14)The corresponding expected utility is J F IG ( τ ; w ) = E θ ∼ π ( θ ) tr (cid:2) I ( θ ; τ ) W − (cid:3) . When there is no natural parameter scale to use, we argue it is reasonableto weight the parameters so that each contributes a comparable amount tothe sum in (14). Otherwise optimal design may concentrate on maximisinginformativeness for a subset of parameters: those with the largest contributions(see Table 1 later for an example).In Appendix B we present Algorithm 2, which adaptively learns weightswith the property just described, and optimises (14). We found this algorithmhas good empirical performance. However, its convergence is not guaranteed.To ensure convergence, one can use Algorithm 2 to pick reasonable weights,and then run Algorithm 1 using rescaled parameters ϑ . See Appendix B forfurther details. Several authors have investigated experimental design for the simple deathprocess (Renshaw, 1993; Cook et al., 2008; Gillespie and Boys, 2019). To13llustrate our method, we consider this setting with a single observation time, τ .In this scenario the observation model is Y ∼ Bin ( N, λ ) where λ = exp( − θτ ).Here N is a known constant and θ is the parameter of interest.The Fisher information for this model can be derived from two standardresults. First, the Fisher information for Y ∼ Bin ( N, φ ) is I φ ( φ ) = Nφ (1 − φ ) .Secondly, a reparameterisation ϕ = ϕ ( φ ) produces I ϕ ( ϕ ) = I φ ( φ ) (cid:16) dφdϕ (cid:17) .Hence for the death model we have I θ ( θ ) = N τ λ − λ = N τ exp( − θτ )1 − exp( − θτ ) . Since this is scalar, the expected FIG utility is J F IG ( τ ) = E θ [ I θ ( θ )]. FollowingCook et al. (2008), we take N = 50 and a log normal LN (0 . , .

01) priordistribution for θ . In this example the expected utility is a univariate integral,so near-exact calculation by numerical integration is possible. Figure 1a showsthe resulting utility surface, and the optimal observation time τ ∗ ≈ . τ values of0 , , , . . . ,

10, and a batch size of K = 1. (In practice rather than τ = 0 weused a small positive value to avoid numerical issues.) Regardless of the initial τ value, SGO-FIG quickly locates the optimal design. As would be expected,the closer the initial value is to τ ∗ , the quicker the algorithm converges. Eachanalysis uses 10,000 iterations, which runs in only a few seconds for thissimple example. This section contains simulations studies using a pharmacokinetic model.The goal is to investigate the performance of our optimisation method andcompare it with existing methods. We focus on also using the FIG utility inthe other methods to give a fair comparison.Throughout we compare the methods based on how many utility eval-uations they perform. For SGO-FIG each utility evaluation also has anassociated gradient evaluation calculation, performed using automatic dif-ferentiation. Due to this extra cost we also comment on the time taken for14

Design time U t ili t y * 1.61 Utility evaluations D e s i g n t i m e * Figure 1: Death model plots (a) The utility surface. (b) Trace plots ofobservation time τ against computational cost (measured in utility gradientevaluations) for 11 independent runs of SGO-FIG. The optimal design τ ∗ isshown by the dotted line.the diﬀerent methods, although this is an imperfect comparison as it is inﬂu-enced by implementation details of the methods, such as what programminglanguage was used. Model

We assume that drug concentration, y j , at time τ j (in hours) isdistributed as y j ∼ N ( x ( θ, τ j ) , σ ) , where x ( θ, τ j ) = Dθ (exp[ − θ τ j ] − exp[ − θ τ j ]) ,θ ( θ − θ ) , and D = 400. Concentrations at diﬀerent times are assumed to be independent.This is a modiﬁcation of a model from Ryan et al. (2014) and Overstall andWoods (2017), removing a multiplicative noise term for simplicity.Following this earlier work we assume independent log normal prior distri-butions θ ∼ LN (log 0 . , . , θ ∼ LN (log 1 , . , θ ∼ LN (log 20 , . , θ θ θ Unweighted +218% − − , σ = 0 . Methods

The FIG utility for this model can be calculated using (11). SeeAppendix C for details. When implementing the algorithms, we found noneed to use constrained optimisation as the designs remained in the interval[0 ,

24] in any case.The three terms in the sum representation of the FIG (13) are on widelydiﬀerent scales for this example. Table 1 shows that using SGO-FIG directlyon this utility only increases the utility contribution for one parameter. Hencewe use a weighted FIG (14) instead. After a short pilot run using adaptiveweights (i.e. Algorithm 2), weights of w = (5 × , × ,

70) were selected.In the remainder of this section all methods use parameters scaled by theseweights. Table 1 shows that SGO-FIG on the weighted FIG objective increasesthe utility contribution for all parameters.Table 2 contains details of the algorithms we compare and their tuningchoices. We sampled 100 initial designs from a uniform distribution overthe search space. Each algorithm is run from each of these initial designs.We also investigate post-processing the results from each method using theACE phase II algorithm (under its default tuning choices from Overstall andWoods, 2017).

Results

First we investigated the cost of evaluating the FIG and SIGutilities. As discussed earlier, we expect SIG evaluation to be slower as it16ethod Tuning choicesACE (phase I) Defaults, as given by Overstall and Woods (2017)SGO-FIG 1 . × iterations, K = 1 (other K values discussed below)M¨uller 1 J = 1, 1 . × iterations, σ RW = 1 × − M¨uller 2 J = 2, 9 . × iterations, σ RW = 2 × − Table 2: Settings for the algorithms used in the pharmacokinetic model simu-lation studies. The number of utility evaluations used in each aims to equalthe default computational cost of the ACE algorithm (1 . × iterations).See Section 2.4 for the interpretation of the M¨uller tuning parameters.requires computing a Monte Carlo estimate (8). In ACE, the mean time toproduce 1000 utility evaluations was 0.008 seconds for FIG compared to 0.082seconds for SIG. Hence FIG produces roughly a 10 times speed-up in utilityevaluation compared to SIG. (This is despite our FIG calculations withinACE being performed in user-supplied R code, while more eﬃcient built-in Ccode is used for SIG.)Figure 2 shows expected utilities returned by each optimal design method.M¨uller makes only a small improvement on the initial designs. SGO-FIGmakes a larger improvment but does worse than ACE. Post-processing helps allthe methods, with SGO-FIG results now generally the best, but occasionallyextremely poor (about 10% of results have expected utility close to 2.5).Speed of convergence is also of interest. Figure 3 is a trace plot ofutilities during the execution of the ACE and SGO-FIG algorithms. SGO-FIG converges much quicker than ACE (taking only around 10 iterations),while ACE does not appear to have converged after the full computationalbudget. However SGO-FIG can converge to several diﬀerent expected utilities,typically achieving a worse value than ACE.For a ﬁxed number of utility evaluations ACE was much quicker to executethan the other methods. For one particular initial design, approximate timeswere (in seconds) ACE 130, SGO-FIG 7,800, M¨uller 1 9,000, M¨uller 2 4,500.(The cost of our implementation of the M¨uller algorithm is dominated bynon-utility calculations for each iteration, which is why computation time ishalved when using half as many iterations.) However SGO-FIG takes only 40sto reach 10 utility evaluations, at which point it has eﬀectively converged.Figure 4 summarises the designs returned by the algorithms. The M¨ulleralgorithm produces designs approximately uniformly spaced across the design17 .5 2.0 2.5 3.0 3.5 4.0 Expected utility

Muller2 + phIIMuller1 + phIISGO-FIG + phIIACE + phIIMuller2Muller1SGO-FIGACEInitial states

Figure 2: Expected utilities of designs returned by various optimal designalgorithms methods for the pharmacokinetic example. Each method wasrun from the same set of 100 randomly sampled initial designs. The sameinitial designs are reused for each method. The bottom four rows showresults after post-processing with ACE phase II. For each returned design,the expected utility was calculated via a Monte Carlo estimate using 2 × utility evaluations to allow for a fair comparison. A box plot showing theexpected utility at the initial designs is included for comparison, as well asa vertical line giving expected utility of a uniformly spaced design over thesearch space. 18 .0 0.3 0.6 0.9 1.2 1.5 1.8 Utility evaluations (×10 ) E x p e c t e d u t ili t y Utility evaluations (×10 ) E x p e c t e d u t ili t y Figure 3: Trace plots of expected utility for the pharmacokinetic example using(a) ACE and (b) SGO-FIG. Traces are shown for 100 randomly sampled initialdesigns. The same initial designs are reused for each method. The expectedutility estimates shown for ACE are by-products of its other calculations.Those for SGO-FIG are averages of utility estimates from recent steps ofAlgorithm 1. Darker lines in (b) correspond to more frequent trace paths.19pace. SGO-FIG converges to designs that have repeated observations attimes close to 1 and 8. Across the 100 runs only the proportions at thesetwo times change, indicating these designs are local maxima. ACE alsoreturns designs where all times are close to 1 and 8. However there is morevariability in the observation times, suggesting that these algorithms havenot yet converged to a local maximum. Also, ACE places more observationtimes near to 1 than SGO-FIG. This suggests it is better at avoiding poorlocal maxima.Post-processing usually selects all observation times to be close to 1 ratherthan close to 8. Since post-processing improves the expected utility, asshown by Figure 2, it appears that this is the optimal design. The SGO-FIGalgorithm has 10 runs where the design returned places all observations around8, and post-processing cannot move points to other times to improve theutility. These correspond to the outlying points in Figure 2 where SGO-FIGplus post-processing achieves a poor expected utility close to 2 . K in SGO-FIG. Thetrace plots in Figure 5 show that for K = 1 ,

10 or 100, SGO-FIG converges inroughly the same number of iterations ( ≈ ). Hence K = 1 is most eﬃcientin terms of computational cost. Summary

Our simulation study shows that FIG is much quicker to computethan SIG. Furthermore, SGO-FIG ﬁnds utility maxima using fewer utilityevaluations than the other algorithms investigated. However it is slower thanACE in terms of time taken to run a ﬁxed number of utility evaluations.Overall, this does not prevent SGO-FIG from being quicker to converge to itsﬁnal design. We expect the advantage to be more pronounced in exampleswith more expensive utility evalautions.SGO-FIG is more susceptible than ACE to ﬁnding poor local maxima.However combining SGO-FIG with a post-processing by ACE phase II usuallyﬁnds the global maximum. In a small number of cases the SGO-FIG designsreach a local maximum that the post-processing algorithm could not improvesigniﬁcantly.Overall we found SGO-FIG followed by post-processing was the mosteﬃcient way to ﬁnd the global maximum. However multiple runs are requiredto avoid the possibility of being stuck at a local maximum.The optimal design in this setting is somewhat counter-intuitive – allobservations are made at the same time! However it has been argued that20 T i m e

10 24 52 73 86 93 97 99 100100100100100100100 . × Observation index T i m e

10 10 10 10 10 10 10 10 10 10 10 10 10 10 11

Observation index

Observation index . × + A C E p h a s e II Figure 4: Designs returned by diﬀerent optimal design methods for thepharmacokinetic example (a) SGO-FIG (b) ACE (c) M¨uller. Designs areshown for 100 randomly sampled initial designs. The same initial designs arereused for each method. Each design is sorted in increasing order. Dakrerpoints indicate higher frequencies of observations around a particular time.For SGO-FIG, text labels have been added to make the frequencies clearer.21 .0 0.25 0.5 0.75 1.0

Iterations (×10 ) E x p e c t e d u t ili t y K = 1 Iterations (×10 ) K = 10 Iterations (×10 ) K = 100 Utility evaluations (×10 ) E x p e c t e d u t ili t y Utility evaluations (×10 ) Utility evaluations (×10 ) Figure 5: Trace plots of SGO-FIG for the pharmacokinetic example usingvarious choices of batch size K . Traces are shown for 100 randomly sampledinitial designs. The same initial designs are reused for each method. Theexpected utilities estimates shown are averages of utility estimates from recentsteps of Algorithm 1. The top row shows number of SGO-FIG iterations onthe x -axis and the bottom row instead shows number of utility evaluations.22eplicated observation times can be highly informative (Binois et al., 2019).We discuss to what extent our choice of utility encourages replication inSection 8. This section investigates an example requiring hundreds of design choices, toinvestigate how our method scales up to higher dimensional applications.

Model

Consider the following geostatistical regression model. Here a design τ is a d × i th row speciﬁes the location of a measurement.(For the purposes of running the optimal design algorithms, τ can be ﬂattenedinto a vector.) The model assumes normal observations with a linear trendand squared exponential covariance function with a nugget eﬀect, giving y ∼ N ( x ( θ, τ ) , Σ( τ )) , x i = θ τ i + θ τ i , Σ = σ I + σ R ( τ ) , R ij = exp (cid:34) − (cid:88) k =1 ( τ ik − τ jk ) /(cid:96) (cid:35) . For simplicity we assume that σ , σ (observation variance components) and (cid:96) (covariance length scale) are known. Hence the unknown parameters are θ and θ (trends). An alternative parameterisation which will be useful shortlyintroduces κ = σ + σ and γ = σ / ( σ + σ ) so that Σ = κ [ γI + (1 − γ ) R ]. FIG utility

Using (11) the FIG utility is,tr I ( τ ) = d (cid:88) i =1 τ Ti Σ( τ ) − τ i , where τ i is the i th row of τ . Note that in this setting the FIG does not dependon θ . Hence J F IG ( τ ) simply equals tr I ( τ ). Also note that κ only aﬀects J F IG as a constant of proportionality, so it does not change the optimal design.(Recall Remark 3 – changing Σ to κ Σ does not aﬀect the optimal design.)

Simulation study

We performed various simulation studies to search for d = 100 design locations restricted to a unit square centred at the origin.23 (cid:96) = 10 − − − − − γ and (cid:96) . The design space is a unit square centred at theorigin.Throughout we use Algorithm 1, with a batch size of K = 1. We implementedconstrained optimisation by adding a large L penalty to designs outside theunit square. Adaptive weights were not considered as the contributions ofthe parameters to the expected utility are similar by symmetry. Also forsimplicity we did not post-process the results, although this may improve thedesigns in cases where the design points fall into a few small clusters.Figure 6 shows resulting designs under various choices of γ and (cid:96) . Forboth large and small (cid:96) values, the design points cluster in the corners. Inbetween these extremes, the points are more uniformly spaced in the unitsquare.For a comparison with ACE we ﬁx γ = 0 . (cid:96) = 10 − . Both algorithmswere started from 100 initial designs sampled uniformly over the design space.For fair comparison, the initial designs used were common to both algorithms.The default settings for ACE were used, noting that the expected utilityis deterministic so it was not necessary to estimate it using Monte Carlo.SGO-FIG used 5 × iterations and a batch size of K = 1. These settingsresult in computational budgets of ≈ . × and 5 × utility evaluationsper run for ACE and SGO-FIG respectively. Each SGO-FIG run took roughly45 seconds, while the ACE analyses took roughly 4 minutes each. Figure 7shows that SGO-FIG converges after using many fewer utility evaluationsthan ACE, and returns designs with higher expected utility values.24 GO-FIG ACE29.629.830.030.230.4 U t ili t y Utility evaluations U t ili t y SGO-FIGACEUniform

Figure 7: Comparison of optimal design algorithms for the geostatisticalregression example. (a) Expected utilities of designs over 100 replicationsfrom diﬀerent initial conditions. (b) Trace plots of the utility over thecomputation of the algorithms (generated as in Figure 3). The horizontal lineindicates the utility for a uniformly spaced grid over the design space.25

Discussion

We have presented a stochastic gradient optimisation algorithm, SGO-FIG,for Bayesian experimental design which can quickly ﬁnd high dimensionaldesigns under a particular fast-to-compute utility function, Fisher informationgain. We also provide a novel decision theoretic justiﬁcation for this utility.In our simulation studies SGO-FIG ﬁnds local maxima of the expected utilityfunction faster than other state of the art methods. We ran our experimentson a CPU, but our Tensorﬂow code can easily make use of GPU parallelisation,allowing for further speed improvements.

Recommendations

We recommend using SGO-FIG as follows. Performmultiple runs of SGO-FIG from random initial designs. If any of the resultingdesigns contain replicated points, then post-process using ACE phase II.Estimate the expected utility for each of the resulting designs and select thebest performing.Our recommended approach is motivated by one of our simulation studies(Section 6). Here SGO-FIG without post-processing typically converged tosub-optimal local maxima, but this was usually resolved by applying ACEphase II. Multiple runs are recommended as a small number of runs did notﬁnd the best design even after post-processing.

Future directions

There may be scope for future work modifying genericstochastic gradient optimisation methods to avoid local maxima in optimaldesign problems e.g. using tempering methods, or non-local updates such asline search, as used in ACE.In Section 6 we also observed that maximising the expected FIG oftenproduces designs with repeated points. A similar ﬁnding was reported byPronzato and Walter (1985). We speculate that the issue may be as follows.Repeated observation times can produce highly concentrated posteriors undersome observed data, at the cost of less informative results otherwise. Overallthe expected FIG rewards such trade-oﬀs, as Fisher information is an ap-proximation to posterior precision. In other words, FIG can be viewed as a risk seeking utility function. (For more on risk attitudes of utility functionssee e.g. French and Insua, 2000.) This suggests modifying the utility to givedecreasing returns to posterior concentration – e.g. using a utility log tr I ( θ ; τ )– eﬀectively introducing more risk aversion. It would be interesting to explore26hether there are variations on the FIG along these lines which avoid de-signs with repeated points while retaining its computational convenience anddecision theoretic support.More generally, stochastic gradient optimisation could also be used forother utility functions whose gradients can be calculated, such as otherfunctionals of the Fisher information matrix e.g. det I ( θ ; τ ) and tr I ( θ ; τ ) − .However the cost of computing determinants or inverses of I ( θ ; τ ) is higherthan simply computing the trace, and there is arguably less theoretical backingfor these utility functions. Acknowledgements

The authors thank Kevin Wilson, Chris Oates, John Matthews and PhilipDawid for helpful conversations, and gratefully acknowledge an EPSRCstudentship supporting Sophie Harbisher.

References

Abadi, M. et al. (2016). Tensorﬂow: A system for large-scale machinelearning. In , pages 265–283.Amzal, B., Bois, F. Y., Parent, E., and Robert, C. P. (2006). Bayesian-optimaldesign via interacting particle systems.

Journal of the American StatisticalAssociation , 101(474):773–785.Baydin, A. G., Pearlmutter, B. A., Radul, A. A., and Siskind, J. M. (2017).Automatic diﬀerentiation in machine learning: a survey.

Journal of MachineLearning Research , 18(153):1–153.Bernardo, J. M. (1979). Expected information as expected utility.

The Annalsof Statistics , 7(3):686–690.Binmore, K. (2007).

Playing for real: a text on game theory . Oxford universitypress.Binois, M., Huang, J., Gramacy, R. B., and Ludkovski, M. (2019). Replicationor exploration? Sequential design for stochastic simulation experiments.

Technometrics , 61(1):7–23. 27ottou, L., Curtis, F. E., and Nocedal, J. (2018). Optimization methods forlarge-scale machine learning.

Siam Review , 60(2):223–311.Box, G. E. P. (1982). Choice of response surface design and alphabeticoptimality.

Utilitas Math. B , 21:11–55.Chaloner, K. and Verdinelli, I. (1995). Bayesian experimental design: Areview.

Statistical Science , pages 273–304.Cook, A. R., Gibson, G. J., and Gilligan, C. A. (2008). Optimal observationtimes in experimental epidemic processes.

Biometrics , 64(3):860–868.Figurnov, M., Mohamed, S., and Mnih, A. (2018). Implicit reparameterizationgradients. In

Advances in Neural Information Processing Systems , pages441–452.French, S. and Insua, D. R. (2000).

Statistical decision theory . Wiley.Gillespie, C. S. and Boys, R. J. (2019). Eﬃcient construction of Bayes optimaldesigns for stochastic process models.

Statistics and Computing (onlinepreview) .Gneiting, T. and Raftery, A. E. (2007). Strictly proper scoring rules, pre-diction, and estimation.

Journal of the American Statistical Association ,102(477):359–378.Huan, X. and Marzouk, Y. M. (2013). Simulation-based optimal Bayesianexperimental design for nonlinear systems.

Journal of ComputationalPhysics , 232(1):288–317.Huan, X. and Marzouk, Y. M. (2014). Gradient-based stochastic optimiza-tion methods in Bayesian experimental design.

International Journal forUncertainty Quantiﬁcation , 4(6).Hyv¨arinen, A. (2005). Estimation of non-normalized statistical models byscore matching.

Journal of Machine Learning Research , 6:695–709.Jankowiak, M. and Obermeyer, F. (2018). Pathwise derivatives beyond thereparameterization trick. In

International Conference on Machine Learning ,pages 2240–2249. 28ingma, D. P. and Ba, J. (2015). Adam: A method for stochastic optimization.

International Conference on Learning Representations .Krause, A., Rajagopal, R., Gupta, A., and Guestrin, C. (2009). Simultaneousplacement and scheduling of sensors. In

Proceedings of the 2009 Interna-tional Conference on Information Processing in Sensor Networks , pages181–192. IEEE Computer Society.K¨uck, H., de Freitas, N., and Doucet, A. (2006). SMC samplers for Bayesianoptimal nonlinear design. In , pages 99–102. IEEE.Kushner, H. and Yin, G. G. (2003).

Stochastic approximation and recursivealgorithms and applications . Springer Science & Business Media.Lindley, D. (1972).

Bayesian statistics, a review . SIAM.Maddison, C. J., Mnih, A., and Teh, Y. W. (2017). The concrete distribu-tion: A continuous relaxation of discrete random variables.

InternationalConference on Learning Representations .Mescheder, L., Nowozin, S., and Geiger, A. (2017). The numerics of GANs.In

Advances in Neural Information Processing Systems , pages 1825–1835.Miller, K. S. (1974).

Complex stochastic processes: an introduction to theoryand application . Addison Wesley Publishing Company.M¨uller, P. (1999). Simulation-based optimal design. In

Bayesian Statistics 6:Proceedings of Sixth Valencia International Meeting , pages 459–474. OxfordUniversity Press.Oates, C. J., Cockayne, J., Prangle, D., Sullivan, T. J., and Girolami, M.(2019). Optimality criteria for probabilistic numerical methods. arXivpreprint arXiv:1901.04326 .Overstall, A., Woods, D., and Adamou, M. (2017). acebayes: An R packagefor Bayesian optimal design of experiments via approximate coordinateexchange. arXiv preprint arXiv:1705.08096 .Overstall, A. M. and Woods, D. C. (2017). Bayesian design of experimentsusing approximate coordinate exchange.

Technometrics , 59(4):458–470.29arry, M., Dawid, A. P., and Lauritzen, S. (2012). Proper local scoring rules.

The Annals of Statistics , 40(1):561–592.Price, D. J., Bean, N. G., Ross, J. V., and Tuke, J. (2018). An inducednatural selection heuristic for ﬁnding optimal Bayesian experimental designs.

Computational Statistics & Data Analysis , 126:112–124.Pronzato, L. and Walter, ´E. (1985). Robust experiment design via stochasticapproximation.

Mathematical Biosciences , 75(1):103–120.Raiﬀa, H. and Schlaifer, R. (1961).

Applied statistical decision theory . Divisionof Research, Harvard Business School.Renshaw, E. (1993).

Modelling Biological Populations in Space and Time .Cambridge University Press.Rezende, D. J., Mohamed, S., and Wierstra, D. (2014). Stochastic back-propagation and approximate inference in deep generative models. In

International Conference on Machine Learning , pages 1278–1286.Ruder, S. (2016). An overview of gradient descent optimization algorithms. arXiv preprint arXiv:1609.04747 .Ryan, E. G., Drovandi, C. C., McGree, J. M., and Pettitt, A. N. (2016). Areview of modern computational algorithms for Bayesian optimal design.

International Statistical Review , 84(1):128–154.Ryan, E. G., Drovandi, C. C., Thompson, M. H., and Pettitt, A. N. (2014).Towards Bayesian experimental design for nonlinear models that require alarge number of sampling times.

Computational Statistics & Data Analysis ,70:45–60.Shewry, M. C. and Wynn, H. P. (1987). Maximum entropy sampling.

Journalof applied statistics , 14(2):165–170.Walker, S. G. (2016). Bayesian information in an experiment and the Fisherinformation distance.

Statistics & Probability Letters , 112:5–9.Wolfson, L. J., Kadane, J. B., and Small, M. J. (1996). Expected utility as apolicy-making tool: an environmental health example.

Statistics Textbooksand Monographs , 151:261–278. 30

Decision theoretic support

This appendix justiﬁes Result 1 of the main text i.e. the procedure of selectingthe design maximising expected Fisher information gain can be derived fromBayesian decision theory. Our argument extends that of Bernardo (1979),who gives a decision theoretic derivation for maximising expected Shannoninformation gain. First Section A.1 reviews background material on scoringrules. Section A.2 proves a general result linking scoring rules and utilitiesin experimental design. Then Section A.3 relates this result to Shannoninformation gain and Section A.4 to Fisher information gain, proving Result1.

A.1 Scoring rules A scoring rule S ( q, θ ) measures the quality of a distribution to model anuncertain quantity, given a realised value θ . For the purposes of this paperwe represent the distribution by its density q ( θ ). Low scores represent a goodmatch.A scoring rule is strictly proper if it is always true that E θ ∼ p ( θ ) [ S ( q, θ )]is uniquely minimised by q = p . An interpretation of this property is asfollows. Consider a decision problem where a density q must be reportedfor a quantity θ ∼ p ( θ ) and the loss function is a scoring rule. The scoringrule is strictly proper if and only if the action minimising expected loss is toreport the actual density, p ( θ ). The expected loss resulting from this actionis referred to as the entropy , H [ p ( θ )] = E θ ∼ p ( θ ) [ S ( p ( θ ) , θ )] . The extra expected loss from making a non-optimal action q is known as the divergence , D [ p ( θ ) , q ( θ )] = E θ ∼ p ( θ ) [ S ( q ( θ ) , θ ) − S ( p ( θ ) , θ )] . Table 3 summarises the quantities just described for two strictly properscoring rules, logarithmic score and Hyv¨arinen score (Hyv¨arinen, 2005). TheHyv¨arinen results rely on the following regularity conditions:1. p ( θ ) and q ( θ ) are twice diﬀerentiable with respect to all θ i .2. E θ ∼ p ( θ ) [ ||∇ log p ( θ ) || ] , E θ ∼ q ( θ ) [ ||∇ log q ( θ )] || are ﬁnite.31ogarithmic score Hyv¨arinen scoreScoring rule − log q ( θ ) 2∆ log q ( θ ) + ||∇ log q ( θ ) || Entropy − E θ ∼ p ( θ ) [log p ( θ )] − E θ ∼ p ( θ ) [ ||∇ log p ( θ ) || ]Divergence E θ ∼ p ( θ ) [log p ( θ ) − log q ( θ )] E θ ∼ p ( θ ) [ ||∇ log p ( θ ) − ∇ log q ( θ ) || ]Table 3: Summary of two scoring rules and related quantities. Here p ( θ ) isthe true density of an unobserved quantity and q ( θ ) is a generic density. Notethat ∆ is the Laplacian operator (sum of second partial derivatives). For aderivation of the Hyv¨arinen divergence see Appendix A of Hyv¨arinen (2005).The other derivations are straightforward.3. ∇ p ( θ ) → , ∇ q ( θ ) → || θ || → ∞ .Throughout this appendix ∇ represents gradient with respect to θ , and || θ || represents the L norm i.e. || x || = √ x T x .For more discussion of scoring rules see for example Gneiting and Raftery(2007) and Parry et al. (2012). A.2 General result

Recall the Bayesian decision theory framework of Section 2.3. After picking adesign τ and observing data y , the experimenter must take an action a . Theythen receive some base utility V ( a, θ, y, τ ), involving the true parameter value θ . Their utility U for the design τ given data y is U = max a E θ ∼ π ( θ | y ; τ ) [ V ( a, θ, y, τ )] . i.e. the expected base utility under the posterior distribution for θ , assumingthat the optimal a is selected. (Throughout this appendix U has arguments τ, y , but these are suppressed to simplify notation.)Following Bernardo (1979), suppose that a is an estimated density for θ .Let the base utility be the negative of a strictly proper scoring rule. Then,by the discussion in the previous section, the experimenter’s utility U isthe negative of the entropy associated with the scoring rule, evaluated on π ( θ | y ; τ ).The following lemma shows that two other utility functions produceequivalent expected utilities to the negative entropy. The proof is based on32he common observation that the divergence contains a constant term whichcan be ignored under maximisation. Similar results have appeared in theexperimental design literature previously for a logarithmic score function (e.g.Shewry and Wynn, 1987). This lemma simply generalises them to a generalscore function. Lemma 1.

Given an underlying strictly proper score function S , the followingchoices of utility function produce the same expected utility in Bayesianexperimental design: U entropy diﬀ = H [ π ( θ )] − H [ π ( θ | y ; τ )] , U divergence = D [ π ( θ | y ; τ ) , π ( θ )] . Furthermore the utility U entropy = −H [ π ( θ | y ; τ )] produces the same expected utility up to an additive constant and hence sharesthe same optimal design.Proof. In the framework of Section 2, Bayesian experimental design is con-cerned with optimising the expected utility E θ,y ∼ π ( θ,y ) [ U ] where π ( θ, y ) = π ( θ ) f ( y | θ ; τ ). Using (2) and (3), we also have π ( θ, y ) = π ( θ | y ; τ ) π ( y ; τ ).From the deﬁnitions of Section A.1, U divergence = U entropy + E θ ∼ π ( θ | y ; τ ) [ S ( π ( θ ) , θ )] . Hence E ( θ,y ) ∼ π ( θ,y ) [ U divergence ] = E ( θ,y ) ∼ π ( θ,y ) [ U entropy ] + E ( θ,y ) ∼ π ( θ,y ) [ S ( π ( θ ) , θ )]= E ( θ,y ) ∼ π ( θ,y ) [ U entropy diﬀ ] . So U divergence and U entropy diﬀ produce the same expected utility. Furthermorethe expected utility of U entropy only diﬀers by an additive constant, since E ( θ,y ) ∼ π ( θ,y ) [ S ( π ( θ ) , θ )] does not depend on τ . A.3 Logarithmic score and Shannon information gain

In the case of the logarithmic score function the equivalent utilities fromLemma 1 are U entropy = E θ ∼ π ( θ | y ; τ ) [log π ( θ | y ; τ )] , (negative Shannon entropy) U entropy diﬀ = E θ ∼ π ( θ | y ; τ ) [log π ( θ | y ; τ )] − E θ ∼ π ( θ ) [log π ( θ )] , U divergence = E θ ∼ π ( θ | y ; τ ) [log π ( θ | y ; τ ) − log π ( θ )] . (Kullback-Leibler divergence)33nother equivalent utility is the quantity within the expectation in U divergence , U SIG = log π ( θ | y ; τ ) − log π ( θ ) , which is the Shannon information gain (6). Hence Lemma 1 provides decisiontheoretic support for this utility. It does so by essentially following theargument of Bernardo (1979). A.4 Hyv¨arinen score and Fisher information gain

Under the Hyv¨arinen score function the equivalent utilities from Lemma 1 are U entropy = E θ ∼ π ( θ | y ; τ ) [ ||∇ log π ( θ | y ; τ ) || ] , U entropy diﬀ = E θ ∼ π ( θ | y ; τ ) [ ||∇ log π ( θ | y ; τ ) || ] − E θ ∼ π ( θ ) [ ||∇ log π ( θ ) || ] , U divergence = E θ ∼ π ( θ | y ; τ ) [ ||∇ log π ( θ | y ; τ ) − ∇ log π ( θ ) || ] . Two further equivalent utilities are the quantity within the expectation in U divergence , U score diﬀ = ||∇ log π ( θ | y ; τ ) − ∇ log π ( θ ) || = ||∇ log f ( y | θ ; τ ) || , and its expectation U F IG = E y ∼ f ( y | θ ; τ ) [ ||∇ log f ( y | θ ; τ ) || ] , i.e. the trace of the Fisher information. This argument proves Result 1 fromthe main text. We refer to U F IG as Fisher information gain in a rough analogyto Shannon information gain. While a closer analogy would be to refer to U score diﬀ as Fisher information gain, we prefer to reserve the term for themore computationally convenient form U F IG .Walker (2016) proved directly that U F IG , U entropy diﬀ and U divergence give thesame expected utility. We have shown that the same result arises naturallyfrom a decision theory characterisation. B Adaptive weights algorithm

Here we present our algorithm for adaptively maximising the weighted FIGutility (14), while learning suitable weights. We also describe the motivationbehind the algorithm, and examine its converence properties.34s discussed in Section 4.4, (14) is equivalent to scaling each parameter θ i by a weight w i . For notational convenience below, we introduce ˜ w i = w i and let ˜ w be the corresponding vector of squared weights. The weighted FIGcan then be written U F IG ( τ, θ, y ; ˜ w ) = diag( I ( θ ; τ )) T ˜ w − = p (cid:88) i =1 E y ∼ f ( y | θ ; τ ) (cid:2) u i ( y, θ ; τ ) / ˜ w i (cid:3) , where diag maps a matrix to its leading diagonal and ˜ w − is the vector of ˜ w − i values. The resulting expected utility is J F IG ( τ ; ˜ w ) = E θ ∼ π ( θ ) [ U F IG ( τ, θ, y ; ˜ w )].As discussed in Section 4.4, we wish to select (1) a design maximising theweighted FIG utility and (2) weights which give each parameter give an equalcontribution to J F IG ( τ ; ˜ w ). More precisely, our goal is to ﬁnd ( τ ∗ , ˜ w ∗ ) suchthat:1. τ ∗ maximises J F IG ( τ ; ˜ w ∗ ),2. ˜ w ∗ = g ( τ ∗ ) where g ( τ ) = E θ ∼ π ( θ ) [diag( I ( θ ; τ ∗ ))].Equivalently, ˜ w ∗ maximises K ( ˜ w ; τ ∗ ) where K ( ˜ w ; τ ) = − || ˜ w − g ( τ ) || . Algorithm 2 is an algorithm for optimal design with adaptive weights. Itoperates by applying stochastic gradient optimisation updates to τ and w .An unbiased Monte Carlo gradient estimate of ∇ τ J F IG ( τ ; ˜ w ) is (cid:92) ∇J F IG ( τ ; ˜ w ) = 1 K K (cid:88) k =1 ∇ [diag( I ( θ ( k ) ; τ )) T ˜ w − ] , (15)where θ ( k ) are independent draws from the prior. This corresponds to (12)for the unweighted case.An unbiased Monte Carlo gradient estimate of ∇ ˜ w K ( ˜ w ; τ ) is (cid:100) ∇K ( ˜ w ; τ ) = 1 K K (cid:88) k =1 diag( I ( θ ( k ) ; τ )) − ˜ w. (16)The same θ ( k ) draws used for (15) can be reused in (16).35 lgorithm 2 Stochastic gradient optimisation of expected Fisher informationgain with adaptive weights

Input : Initial design τ , number of iterations n , batch size K to use ingradient estimation, initial squared weights ˜ w . Loop over t = 0 , , , . . . , n − τ , using (15), and an estimatedgradient for ˜ w , using (16).2. Get τ t +1 , ˜ w t +1 by calling the update rule of a stochastic gradient op-timisation algorithm with current state τ t , ˜ w t and gradient estimatefrom step 1. We use the update rule of the Adam algorithm.Algorithm 2 can be viewed as a simultaneous gradient ascent approach asit make simultaneous gradient ascent steps for two objective functions. It iswell known that such algorithms do not always converge (see e.g. Meschederet al., 2017). The issue is that improving one objective function may reducethe other, resulting in dynamics of ( τ t , ˜ w t ) which form a cycle rather thanconverging to an optimum. Furthermore, it is diﬃcult to guarantee theexistence of a solution to our adaptive optimisation problem, as discussed inSection B.1. Nonetheless we ﬁnd reasonable performance of the algorithm ina simulation study (see Table 1). Also, as mentioned in the main text, thealgorithm can be used simply to select reasonable weights i.e. increase thevalue of K ( ˜ w ; τ ) compared to using unit weights. Then Algorithm 1 can berun on parameters rescaled by these weights. B.1 Existence of a solution

Lemma 2 shows that a solution to the adaptive weights optimisation problemdescribed above can be guaranteed under a few conditions. However one ofthese, item 4, is hard to verify: h ( ˜ w ), the solution to a maximisation problemdeﬁned by ˜ w , must be a continuous function of ˜ w . Therefore it is diﬃcult toguarantee the existence of a solution in practice. Lemma 2.

There exists a solution to τ ∗ = argmax τ J F IG ( τ ; ˜ w ∗ ) , ˜ w ∗ = argmax ˜ w K ( ˜ w ; τ ∗ )36 iven the following conditions:1. The set T ⊆ R d of possible designs is compact and convex.2. g ( τ ) is continuous.3. || g ( τ ) || is bounded above.4. There exists a continuous function h ( ˜ w ) outputting a τ value maximising J F IG ( τ ; ˜ w ) . The joint optimisation problem here is equivalent to the deﬁnition of aNash equilibrium in game theory. To prove existence of a solution we followthe game theory literature by using Brouwer’s ﬁxed point theorem (see e.g.Binmore, 2007).

Proof.

Let W = { ˜ w (cid:12)(cid:12) || ˜ w || ≤ k } where k is an upper bound of || g ( τ ) || . Deﬁne f : T × W → T × W by f ( τ, ˜ w ) = ( h ( ˜ w ) , g ( τ )). It follows from the assumptionsthat this is a continuous function from a compact convex set to itself. Henceby Brouwer’s ﬁxed point theorem, there exists some ( τ ∗ , ˜ w ∗ ) ∈ T × W whichis a ﬁxed point of f . Therefore τ ∗ = h ( ˜ w ∗ ) and ˜ w ∗ = g ( τ ∗ ) as required. C Fisher information gain for the pharma-cokinetic example

Recall from Section 3.2 that for a model y ∼ N ( x ( θ, τ ) , Σ( τ )),tr I ( θ ; τ ) = p (cid:88) i =1 v i ( θ, τ ) T Σ( τ ) − v i ( θ, τ )where v i is the vector ∂∂θ i x ( θ, τ ) of elementwise derivatives of x . For thepharmacokinetic model Σ = σ I so thattr I ( θ ; τ ) = σ − (cid:88) i =1 n (cid:88) j =1 v ij ( θ, τ ) . v ij = ∂∂θ i x ( θ, τ j ). It remains to compute the v ij terms, which are v j = 1 θ − θ x ( θ, τ j ) − Dθ θ ( θ − θ ) τ j exp( − θ τ j ) ,v j = θ θ ( θ − θ ) x ( θ, τ j ) + Dθ θ ( θ − θ ) τ j exp( − θ τ j ) ,v j = − θ x ( θ, τ j ) . D Regularity conditions

Under weak regularity conditions on g ( θ, τ ) it is possible to interchange ofdiﬀerentiation and expectation so that ∇ τ E θ ∼ π ( θ ) [ g ( θ, τ )] = E θ ∼ π ( θ ) [ ∇ τ g ( θ, τ )] . (This result was required in Section 4.2 with g ( θ, τ ) = tr I ( θ ; τ ).) A suﬃcientregularity condition is that E θ ∼ π ( θ ) [ |∇ g ( θ, τ ) | ] is ﬁnite, where absolute value |·| acts elementwise. It follows that interchange is possible by Fubini’s theorem. E More on estimation of FIG and its gradient

This appendix considers estimating the expected FIG utility, J F IG ( τ ), andits gradient when the Fisher information is not available in closed form. Firstnote we can express J F IG ( τ ) = E θ ∼ π ( θ ) ,y ∼ f ( y | θ ; τ ) [ || u ( y, θ ; τ ) || ] , (17)where u is the score function, (5). Hence a simple unbiased Monte Carloestimate is 1 K K (cid:88) k =1 || u ( y ( k ) , θ ( k ) ; τ ) || . where ( θ ( k ) , y ( k ) ) are independent draws from the prior and model.However, on taking the gradient of (17) with respect to τ , it is generallynot possible to exchange the order of integration and diﬀerentiation. Thereason is that the distribution for y depends on τ . This complicates obtaininga Monte Carlo estimate. 38 standard approach is to use the “reparameterisation trick” (Rezendeet al., 2014). The idea is to express y as a transformation y ( (cid:15), θ, τ ) of arandom variable (cid:15) with ﬁxed density p ( (cid:15) ). Then J F IG ( τ ) = E θ ∼ π ( θ ) ,(cid:15) ∼ p ( (cid:15) ) [ || u ( y ( (cid:15), θ, τ ) , θ ; τ ) || ] , Now it is possible to exchange integration and diﬀerentiation under mildregularity conditions (see Appendix D) to get ∇J F IG ( τ ) = E θ ∼ π ( θ ) ,(cid:15) ∼ p ( (cid:15) ) [ ∇|| u ( y ( (cid:15), θ, τ ) , θ ; τ ) || ] . Hence an unbiased Monte Carlo estimate is1 K K (cid:88) k =1 [ ∇|| u ( y ( (cid:15) ( k ) , θ ( k ) , τ ) , θ ( k ) ; τ ) || ] , where ( θ ( k ) , (cid:15) ( k ) ) are independent draws from π ( θ ) p ( (cid:15) ). Unbiased estimates ofthe quantities required for Algorithm 2 can be derived similarly.In some cases y cannot be represented as a suitable transformation of (cid:15) .This includes the case of discrete yy