[PDF] Flexible Prior Elicitation via the Prior Predictive Distribution

Abstract

The prior distribution for the unknown model parameters plays a crucial role in the process of statistical inference based on Bayesian methods. However, specifying suitable priors is often difficult even when detailed prior knowledge is available in principle. The challenge is to express quantitative information in the form of a probability distribution. Prior elicitation addresses this question by extracting subjective information from an expert and transforming it into a valid prior. Most existing methods, however, require information to be provided on the unobservable parameters, whose effect on the data generating process is often complicated and hard to understand. We propose an alternative approach that only requires knowledge about the observable outcomes - knowledge which is often much easier for experts to provide. Building upon a principled statistical framework, our approach utilizes the prior predictive distribution implied by the model to automatically transform experts judgements about plausible outcome values to suitable priors on the parameters. We also provide computational strategies to perform inference and guidelines to facilitate practical use.

Full PDF

FFlexible Prior Elicitation via thePrior Predictive Distribution

Marcelo Hartmann , Georgi Agiashvili , Paul B ¨urkner & Arto Klami Department of Computer Science, University of Helsinki, Helsinki, Finland Department of Computer Science, Aalto University, Espoo, Finland E-mail:marcelo.hartmann@helsinki.ﬁgeorgi.agiashvili@helsinki.ﬁ[email protected]@helsinki.ﬁ

Summary . The prior distribution for the unknown model parameters plays a crucial rolein the process of statistical inference based on Bayesian methods. However, specifyingsuitable priors is often difﬁcult even when detailed prior knowledge is available in principle.The challenge is to express quantitative information in the form of a probability distribution.Prior elicitation addresses this question by extracting subjective information from an expertand transforming it into a valid prior. Most existing methods, however, require informationto be provided on the unobservable parameters, whose effect on the data generating pro-cess is often complicated and hard to understand. We propose an alternative approachthat only requires knowledge about the observable outcomes – knowledge which is of-ten much easier for experts to provide. Building upon a principled statistical framework,our approach utilizes the prior predictive distribution implied by the model to automaticallytransform experts judgements about plausible outcome values to suitable priors on the pa-rameters. We also provide computational strategies to perform inference and guidelinesto facilitate practical use.

1. INTRODUCTION

The Bayesian approach for statistical inference is widely used both in statistical mod-eling and in general-purpose machine learning. It builds on the simple and intuitiverule that allows updating one’s prior beliefs about the state of the world through newlymade observations (i.e., data) to obtain posterior beliefs in a fully probabilistic manner.Nowadays, the Bayesian approach can routinely be used in a vast number of applica-tions due to combination of powerful inference algorithms and probabilistic programminglanguages (Meent et al., 2018), such as Stan (Carpenter et al., 2017).Despite available computational tools, the task of designing and building the modelcan still be diﬃcult. Often, the user building the model can safely be assumed to havegood knowledge of the phenomenon they are modeling. However, they additionally needto have suﬃcient statistical knowledge in order to formulate the domain assumptions interms of probabilistic models which are sensible enough to obtain valid inference. Thisis by no means an easy task for the majority of users. Hence, the model building processis often highly iterative, requiring frequent modiﬁcations of modeling assumptions, for a r X i v : . [ s t a t . M E ] M a r Hartmann, Agiashvili, B ¨urkner & Klami example, based on predictive checks and model comparisons; see Daee et al. (2017),Schad et al. (2019) and Sarma and Kay (2020) for attempts of formalising the modelingworkﬂow.We focus on one particular stage of the modeling process, namely the problem ofspecifying priors for the model parameters. The prior distribution lies at the heart ofthe Bayesian paradigm and must be designed coherently to make Bayesian inferenceoperational (e.g., see Kadane and Wolfson, 1998). The practical diﬃculty, though, evenfor more experienced users, is the encoding of one’s actual prior beliefs in form of para-metric distributions. The parameters may not even have direct interpretation, and theeﬀect of the prior on the data generating mechanism can be quite involved and showlarge disparity with respect to what the user’s prior beliefs over the data distributioncould be (Kadane et al., 1980).The existing literature addresses this issue via expert knowledge elicitation . This isunderstood as the process of extracting the expert’s information (knowledge or opinion)related to quantities or events that are uncertain, and expressing them in the form of aprobability distribution, the prior. See, for example, the works by Lindley (1983), Genestand Schervish (1985), and Gelfand et al. (1995) for early ideas and introduction. SeeGarthwaite et al. (2005) and O’Hagan (2019) for detailed reviews of expert elicitationprocedures and guidelines.The majority of the knowledge elicitation literature is on eliciting information with re-spect to the parameters of the model, that is, asking the expert to make statements aboutplausible values of the parameters. The early works do this within speciﬁc parametricprior families, whereas more recently, O’Hagan and Oakley (2004), Gosling (2005) andOakley and O’Hagan (2007) have proposed nonparametric approaches based on Gaussianprocesses (O’Hagan, 1978), allowing more more ﬂexibility. Even though the prior itselfcan be of ﬂexible form, the elicitation process is typically carried out on a parameter-by-parameter basis so that each parameter receives its own independent univariate prior.As a result, the implied joint prior on the whole set of parameters is often unreasonable.Although Moala and O’Hagan (2010) generalized the approach of Gosling (2005) to mul-tivariate priors, the resulting process is diﬃcult for experts, since they are required toexpress high-dimensional joint probabilities. Hence, its practical use is basically limitedto just two dimensions.Independently of whether we assign individual or joint priors on the model param-eters, any prior can only be understood in the context of the model it is part of (e.g.,Gelman et al., 2017; Simpson et al., 2017). This point may be obvious but its practicalimplications are far reaching. Subject matter experts, who may understandably lackin-depth knowledge of statistical modeling, are left with the task of assigning sensiblepriors on parameters whose scale and real-world implications are hard to grasp even forstatistical experts.For this reason, Kadane et al. (1980) and Akbarov (2009) argue that prior elicitationshould be conducted using observable quantities, by asking statements related to the prior predictive distribution , that is, the distribution of the data as predicted by themodel conditioned on the parameters’ prior, instead of directly referring to the prior onthe unobservable parameters. After eliciting the prior predictive distribution, the infor-mation can then be transformed into priors on the parameters by a suitable methodology. rior predictive elicitation The logic of using the prior predictive distribution is that the expert should always havean understanding about plausible values of the observable variables based on their owndomain knowledge – even if they may not fully understand the statistical model and therole of parameters used to represent the underlying data generating mechanism. Afterall, what is an expert if they do not understand their own data?From a predictive viewpoint, Kadane et al. (1980), Kadane and Wolfson (1998),Geisser (1993), and Akbarov (2009) present practical methods for recovering the priordistribution via expert’s information on the prior predictive distribution. Those meth-ods are based on specifying particular moments of the prior predictive distribution for aGaussian linear regression model, or on providing prior predictive probabilities for ﬁxedsubregions of the sample space where the prior distribution is assumed to be univariate.In the latter case, the strategy is to perform least-squares minimization between theo-retical probabilities and those probabilities quantiﬁed by the expert. However, in thesense of O’Hagan and Oakley (2004), these approaches neglect the fact that the expert’sinformation itself can be uncertain and provide no measure for whether the chosen pre-dictive model is able to reproduce the expert’s probabilistic judgements well enough.That is to say, existing methods do not take into account imprecisions in probabilisticjudgements when constructing the prior predictive distribution, nor do they provide aprincipled framework which would guide the experts to select a predictive model and/orprior distribution matching their knowledge (Jeﬀreys and Zellner, 1980; Winkler, 1967).Our contribution addresses the question of prior elicitation via prior predictive dis-tributions using a principled statistical framework which 1) makes prior elicitation inde-pendent on the speciﬁc structure of the probabilistic model from the users’ viewpoint,2) handles complex models with many parameters and potentially multivariate priors,3) fully accounts for uncertainty in experts/users probabilistic judgements on the data,and 4) provides a formal quality measure indicating if the chosen predictive model isable to reproduce experts’ probabilistic judgements. Our work provides both the theo-retical basis as well as ﬂexible tools that allow the modeller to express their knowledgein terms of the probability of the data while taking into account the uncertainty in theirjudgements.In Section 2, we establish the basic notation and explain why the prior predictivedistribution is better suited to represent expert’s opinions. Sections 3 and 4 introducethe methodology to tackle imprecise probabilistic judgements via a principled statisticalframework, and general computational procedures to recover the hyperparameters of aprior distribution. The development is interleaved with practical examples illustratingthe core concepts and demonstrating its practical use – via concrete instantiations formultivariate prior elicitation for generalized linear models and a small-scale user studycomparing the proposed methodology for classical prior elicitation directly on modelparameters. We close the paper in Section 5, where conclusions and potential futuredirections are presented.

Hartmann, Agiashvili, B ¨urkner & Klami

2. NOTATION AND PRELIMINARIES

The process of performing Bayesian statistical inference usually starts by building ajoint probability distribution of observable variables/measurements Y and unobservableparameters θ . The corresponding marginal distribution with respect to θ is referredto as the prior distribution and the marginal distribution with respect to Y is referredto as the prior predictive distribution. According to the Bayesian paradigm, the priordistribution should be designed independently of the measurement outcomes, that isto say, it must reﬂect our prior knowledge about the parameters θ before seeing theactual independent measurements y , y , . . . (i.e., realizations of Y ) obtained in theexperiments (Berger, 1993; O’Hagan, 2004). After having obtained the measurements,the posterior distribution of θ arises from the joint distribution by conditioning on y , y , . . . (O’Hagan, 2004). Let Y = [ Y . . . Y S ] be a S -dimensional vector of observable variables and denote thesample space Ω as a subset of R S . Hereafter we denote by Y | θ ∼ π Y | θ our dataprobability distribution conditioned on the parameters. We also write θ ∼ π θ where θ ∈ Θ ⊆ R D and π θ belongs to a given family of parametric distributions, say F λ indexed by a hyperparameter vector λ . Then, by marginalizing out the parameters θ ,the prior predictive distribution is given by π Y ( y | λ ) = (cid:90) Θ π Y | θ ( y | θ ) π θ ( θ | λ ) d θ . (1)The prior predictive distribution is not to be confused with the marginal likelihoodof observed data, which is obtained by marginalization over θ of the observed data’ssampling distribution times the prior (e.g., Jeﬀreys and Zellner, 1980).Given any subset A ⊆ Ω, the prior predictive probability of A , denoted as P ( Y ∈ A | λ ), can be obtained by exchanging the order of integration via the Fubini-Tonellitheorem (Folland, 2013) as P A | λ := (cid:90) A π Y ( y | λ ) d y = E θ (cid:0) P Y | θ ( Y ∈ A | θ ) (cid:1) . (2)See supplementary materials for details. The hyperparameter vector λ , which deﬁnesa particular prior from the set of all priors F λ , will be treated as constant. Hence, noprior needs to be assigned to it. Instead, the values of λ will be obtained during theprior predictive elicitation method presented below.

3. PRIOR PREDICTIVE ELICITATION

Our approach follows Oakley and O’Hagan (2007) and Gosling (2005) by approachingthe elicitation process as a problem of statistical inference where the information to rior predictive elicitation be provided by the expert is in the form of probabilistic judgements about the data.However, the solution itself is novel. From an high-level perspective, our elicitationmethodology for any Bayesian model can be summarized as follows:(a) Deﬁne the parametric generative model for observable data Y composed by a prob-abilistic model conditioned of the parameters θ and a (potentially multivariate)prior distribution for the parameters. The prior distribution depends on hyper-parameters λ essentially deﬁning the prior which we seek to obtain (see Section2).(b) Partition the data space into exhaustive and mutually exclusive data categories.For each of these categories, ask the expert what they belief is the probability ofthe data falling in that category.(c) Model the elicited probabilities from Step 2 as a function of the hyperparameters λ from Step 1 while taking into account that the expert information is itself ofprobabilistic nature and has inherent uncertainty.(d) Perform iterative optimization of the model from Step 3 to obtain an estimate for λ describing the expert opinion best within the chosen parametric family of priordistributions.(e) Evaluate how well the predictions obtained from the optimal prior distribution ofStep 4 can describe the elicited expert opinion.In the remainder of this section, we ﬁrst introduce the basic formalism for modellingthe users’ beliefs in Section 3.1, provide a key consistency result in Section 3.2, thendemonstrate how it can be applied to predictive problems in Section 3.3, and ﬁnallydiscuss the interfaces for the actual knowledge elicitation procedure in Section 3.4. Eachpart is concluded by an example illustrating the concept. Our assumption is that the output elicitation procedure provides information as prob-abilistic assignments regarding the data vector Y falling within a ﬁxed set of mutuallyexclusively and exhaustive events A . Such collection of assignments can be consideredas the data available for inferring the prior, and is not to be confused by actual mea-surement data following the generative model. Our focus here is in the mathematicalmachinery required for converting this information into prior distributions, not takingany stance on how the information is collected from the expert. However, we will brieﬂydiscuss the elicitation process itself in Section 3.4.Let A = { A , . . . , A n } be a partition of the sample space Ω. Throughout the elic-itation procedure, the expert supplies their expected opinions regarding the quantities P A i | λ for all i = 1 , . . . , n . The expert’s judgements themselves are not fully determin-istic and retain some uncertainty. Also, the expert may be more comfortable to makestatements for certain partitions of Ω than for others.To account for the uncertainty in the probability quantiﬁcations of P A i | λ , we assumethat the obtained judgements p follow a Dirichlet distribution (Ferguson, 1973) with Hartmann, Agiashvili, B ¨urkner & Klami base measure given by the prior predictive probabilities P A i | λ and precision parameter α . Hence, for any chosen partition A of size n , we denote the distribution of p as p | α, λ ∼ D ( α, [ P A | λ · · · P A n | λ ]) , (3)where D ( · ) stands for Dirichlet distribution and whose multivariate density functionreads D ( p | α, λ ) = Γ( α ) (cid:81) ni =1 Γ( α P A i | λ ) n (cid:89) i =1 p α P Ai | λ − i . (4)Naturally, we require (cid:80) ni =1 P A i | λ = 1. The Dirichlet density (4) accounts for the un-certainty inherent to the numerical quantiﬁcation of the probability vector p due to, forexample, biases introduced through the mechanisms of elicitation processes (the way inwhich questions are made), practical imperfection (imprecision) of experts’ judgementsin probabilistic terms or poor judgements on the eﬀect of parameters in the outputmodel. For details and in-depth discussion, see O’Hagan and Oakley (2004), O’Hagan(2019) and Sarma and Kay (2020) .The hyperparameter α measures how well the prior predictive probability model isable to represent (or reproduce) the probability data provided in the elicitation process.The larger the values of α , the less variance around the expected value P A i | λ . Forpractical use of this principle, we can ﬁnd the maximum likelihood estimate (MLE) ˆ α of α , which can be directly understood in terms of the deviance between the prior predictiveprobability and the experts opinion. More speciﬁcally, we haveˆ α ≈ n/ − / P A | λ || p ) (5)where P A | λ = [ P A i | λ · · · P A n | λ ] (cid:62) and KL( P A | λ || p ) is the Kullback-Leibler divergencebetween the two distributions. The practical interpretation is that for small KL values,we would not be able discriminate the prior predictive probability from the probabilitydata provided by the expert. See supplementary materials for the proof of Equation (5). Example:

Consider a generative model given by Y | θ ∼ N ( θ, σ ) and θ ∼ N ( µ , σ )+ N ( µ , σ ). This yields the prior predictive distribution Y ∼ N ( µ , σ + σ ) + N ( µ , σ + σ ) with hyperparameters λ = [ µ , µ , σ , σ , σ ] (cid:62) . For a set A =( a, b ] ⊂ R , the prior predictive probability is P A | λ = (cid:80) k =1 12 Φ (cid:0) ( a − µ k ) / (cid:113) σ + σ k (cid:1) − Φ (cid:0) ( b − µ k )) / (cid:113) σ + σ k (cid:1) . Figure 1 illustrates the eﬀect of the α parameter for a givenpartition A with n = 10. For each α ∈ { , , , , , } . we generated p bysampling from (3), using ﬁxed hyperparameter values of µ = − µ = 2 and σ = σ = σ = 1. Even though we work in a Bayesian context looking to recover a prior distribution, thecore procedure of our method applies classical statistical inference. Given a numerical rior predictive elicitation a = 8.12 | KL = 0.93 a = 20.84 | KL = 0.37 a = 36.8 | KL = 0.21 a = 145.86 | KL = 0.06 a = 631.67 | KL = 0.02 a = 1330.75 | KL = 0.01 Fig. 1.

Illustration of the role of the concentration parameter α . Large values correspond toscenarios where the prior predictive distribution (solid line) is able to represent expert’s opinions(bars) accurately. That is, α provides an accuracy diagnostic for our method with higher valuesindicating higher accuracy. vector of probabilities from the elicitation process, the goal is to show that we areable to ﬁnd the value of certain parameters (in this case the hyperparameters λ andconcentration α parameter) of the Dirichlet probabilistic model (3) which would havemost likely generated this particular data (of user’s subjective knowledge). In otherwords, we are aiming to obtain the maximum likelihood estimator (MLE).To study the MLE, we consider the limit where the partitioning is made increasinglymore ﬁne grained by increasing n towards inﬁnity. However, we still only obtain infor-mation from the user once (i.e., for a single partitioning). That is, the user is providingmore and more information about the probabilities, but does not repeat the proceduremultiple times. As we will show below, the MLE is consistent under these circumstances,providing the true λ when n → ∞ , under reasonable assumptions.Recall that equations (3) and (4) represent the probabilistic model of p conditionedon the parameters η = ( λ , α ). Suppose the implied true prior distribution of the experthas hyperparameter values λ and denote η = ( λ , α ). Take the size of the partition n to be large and denote the log-likelihood as T η ( p ) = log D ( p | α, λ ) with expectation Q η ( η ) = E D ( T η ( p )).We show that the expected log-likelihood is maximized at η . By Jensen’s inequality,we know that E D (cid:20) − log D ( p | α, λ ) D ( p | α , λ ) (cid:21) > − log E (cid:20) D ( p | α, λ ) D ( p | α , λ ) (cid:21) = 0 , (6)yielding Q η ( η ) = E D ( T η ( p )) > E D ( T η ( p )) = Q η ( η ) , Hartmann, Agiashvili, B ¨urkner & Klami

Prior distribution estimate n = 3n = 5n = 10n = 20n = 30

Prior predictive n = 3n = 5n = 10n = 20n = 30

Hyperparameters estimates

Partition size n

True hyperparameter valuesHyperparameter estimates

Fig. 2.

Consistency of the MLE for λ . On the right:

All six hyperparameter values convergeto the true values as the number of partitions n increases (each line corresponds to one hy-perparameter), here converging already roughly for n = 10 . On the left and middle:

Boththe estimated prior distribution (left) and the corresponding prior predictive distribution (right)converge towards the respective true distributions, depicted as black lines. which holds for all η . The expectation E D ( · ) is taken with respect to the distribution(4). The technical condition to ensure uniqueness of the MLE is that the probabilisticmodel (4) must be identiﬁable † . That is, equality of likelihoods must imply equalityof parameters: D ( p | α , λ ) = D ( p | α , λ ) ⇒ η = η for all p . Otherwise we mayencounter multiple maxima and thus the prior distribution in the set F λ is not unique. Example:

Extending the earlier example, consider a more general generative modelwhere the prior distribution is now θ ∼ w N ( µ , σ ) + w N ( µ , σ ) yielding the priorpredictive distribution Y ∼ w N ( µ , σ + σ ) + w N ( µ , σ + σ ), where w and w are weights summing up to 1 and the hyperparameters are given by λ = [ µ , µ , σ , σ ,σ , w , w ].Suppose α is ﬁxed and the true prior distribution has hyperparameters λ . We run anexperiment where probability vectors are generated from (3) with increasing partitionsizes. Figure 2 shows that, as the partition size increases, the estimates ˆ λ convergeto λ , which means the prior distribution is recovered from single-sample elicitation ofprobability data. Next, we demonstrate how the proposed approach can be used for concrete modellingproblems, by detailing the procedure for the widely-used family of generalized linearmodels (GLM; Nelder and Wedderburn, 1972). As GLMs typically have several pa-rameters – one parameter per predicting covariate plus an intercept and potentially a † In practise, this may not be an issue when ﬁtting the model. However, we believe it isimportant to understand the theoretical properties of the inference process so that we can avoidproblems in the optimisation procedures. rior predictive elicitation dispersion parameter – direct speciﬁcation of the parameters’ joint prior is often diﬃcult.However, our prior predictive approach can handle this situation elegantly.In case of a GLM, our elicitation method requires the selection of sets of covariatevalues for which the expert is comfortable to express probability judgements about plau-sible realizations of Y . More formally, for each set of covariates x j = [ x j, · · · x j,C ], j =1 , . . . , J , the expert provides probability judgements p j = [ p j, · · · p j,n j ] with (cid:80) n j i j =1 p j,i j =1, where n j is the partition size for covariate set j implying the partition A j = { A j, ,. . . , A j,n j } . Under the assumption of the judgement p j being pairwise conditionallyindependent, we can express the likelihood function of α and λ as D ( p , . . . , p J | α, λ ) = Γ( α ) JJ (cid:81) j =1 n j (cid:81) i j =1 Γ( α P A j,ij | λ , x j ) J (cid:89) j =1 n j (cid:89) i j =1 p α P Aj,ij | λ , x j − j,i j (7)where P A j,ij | λ , x j is the prior predictive probability for the set A j,i j related to covariateset x j .Importantly, there is no need for the partitions themselves or their size to be the samethroughout the sets of covariate values: For each j , the expert can create any partitionthey are most comfortable with making judgements about. This feature provides muchmore freedom to the expert in expressing their knowledge of the data compared toalternative methods. For example, to obtain a prior distribution for logistic regressionmodel, the method of Bedrick et al. (1997) requires the user to provide a ﬁxed numberof probabilities just enough to make the Jacobians appearing in their method invertible. Example:

Here we consider a generative model for binary data in the presence of avector of covariates. The observable variable conditioned on the parameters is distributedaccording to a Bernoulli model and we take a multivariate Gaussian distribution as theprior distribution for the vector of parameters in the predictor function. This can beformalized as Y | θ ∼ B (cid:0) Φ( x (cid:62) θ ) (cid:1) θ ∼ N D ( µ , Σ) (8)yielding the prior predictive distribution Y ∼ B (cid:16) p ( x , λ ) (cid:17) (9)with p ( x , λ ) = Φ( x (cid:62) µ / √ x (cid:62) Σ x ).The notation N D ( · , · ) stands for a D -dimensional Gaussian distribution and B ( · )for the Bernoulli distribution. The hyperparameter vector λ = [ µ , Σ], consists of theprior means µ = [ µ , · · · , µ D ] and prior covariance matrix Σ. We ﬁx the partitioningthroughout the covariate set as A j, = { } , A j, = { } since Y ∈ Ω = { , } . Equation(2) simpliﬁes to P A | λ = 1 − p ( x , λ ) and P A | λ = p ( x , λ ).The parametrisation of the covariance matrix follows the separation strategy sug-gested by Barnard et al. (2000) on an unconstrained space as presented by Kurowickaand Cooke (2003). That is, the covariance matrix is rewritten as Σ = diag( σ , . . . , σ D ) R diag( σ , . . . , σ D ) where ( σ , . . . , σ D ) are the variances and R is the correlation matrix. Hartmann, Agiashvili, B ¨urkner & Klami

Number of sets of covariates

Multivariate prior dimension, D = 2Multivariate prior dimension, D = 3Multivariate prior dimension, D = 4Multivariate prior dimension, D = 5Multivariate prior dimension, D = 6

Fig. 3.

Convegence of the covariance matrix estimates for multivariate prior elicitation for binarylinear regression as a function of the number of covariates J for which the user provides proba-bility estimates, measured using the logarithm of the Frobenius norm of the difference betweenthe true covariance matrix and the estimate. The coloured lines refer to the dimensionality D of the prior distribution, showing that we can effectively elicit multivariate priors of reasonabledimensionality, with naturally increasing difﬁculty for larger D . In the simulation experiment, we vary the dimension D ∈ { , , , , } and the num-ber of sets of covariates J ∈ { , , , , } . For each D we randomly pick a true valuefor λ , and for each covariate set, we draw random probabilities of success/failure fromthe Dirichlet probability model. Hence, the likelihood is given by (7). We repeat theprocedure for each D and J where the hyperparameters λ are ﬁxed with respect to J .To show the convergence with respect to the estimates of Σ obtained from the expertjudgements, we compare the logarithm of the Frobenius norm between the estimatedcovariance matrix and the true covariance matrix (Fig. 3). For suﬃciently large J ,roughly from J = 15 onwards, we are able to accurately elicit multivariate priors up to5-6 dimensional priors – this is a signiﬁcant improvement over earlier methods that havebeen limited to univariate or at most bivariate priors (Moala and O’Hagan, 2010). Forincreasing D from 2, 3, 4, 5 to 6, the respective number of hyperparameters in the vector λ becomes 5, 9, 14, 20 to 27, explaining the increased elicitation diﬃculty for large D . Using the machinery above requires obtaining the probability judgements p from theuser. The method itself is general, and can be used as part of any practical Bayesianmodelling workﬂow when linked to any particular elicitation interface. We have imple-mented an extension of the SHELF interface (Oakley and O’Hagan, 2019) as a refer-ence, by replacing the direct parameter elicitation components with variants that querythe user for the prior predictive probabilities. This readily provides practical elicitationmethods for the user to specify probabilities by utilizing probability quantiles or roulette rior predictive elicitation chips. This means that probability ratios for events are provided and then individualprobabilities are recovered under the natural constraint (cid:80) ni =1 p i = 1. Hence, the usercan choose the way of providing information they feel most comfortable with. Besidesgraphical interfaces, the elicitation can be carried out by the modeller interviewing a do-main expert. Experienced modellers may also choose to simply express some particularpriors via providing p while designing the model. Example:

To evaluate the applicability of our method in practice, we conducted asmall user study of N = 5 doctoral students of computer science with reasonable statis-tical knowledge. The task was to elicited priors of a human growth model (see Preeceand Baines, 1978, model 1, Section 2) with a six-dimensional hyperparameter vector λ .We queried the users for n j = 6 probabilities and J = 4 covariates, each correspondingto stature distribution of males at the age of t ∈ { , . , , . } years. We chose thismodel because everyone can be expected to have a rough understanding of the observeddata and hence can act as an expert. As a baseline, we used a standard elicitation proce-dure which queries the prior distributions for each parameter directly (again with n = 6).Some of these parameters are intuitive (e.g., stature as adult) while some control thequantitative behaviour of the model in a non-trivial way. The model was implementedin brms (B¨urkner, 2017) to demonstrate compatibility with existing modelling tools.Gradient-free optimization (see next section) was used for converting the elicited prob-abilities into priors. Table 3.4 shows exemplary for one user how the prior predictivedistribution corresponding to λ elicited with the proposed method matches well withresults of Preece and Baines (1978). When applying direct parameter elicitation, thematch was clearly worse because the user was unable to provide reasonable estimatesfor parameters without an intuitive meaning, despite being provided an explanation ofthe model and its parameters. In a standardized interview, all users reported that theywere more comfortable providing probability judgements for the observables than for theparameters, and that they were more conﬁdent that the resulting prior matches theiractual subjective prior. See supplementary materials for details of the model and userstudy, as well as results for all users.

4. ON THE LEARNING ALGORITHMS

Having characterized the problem itself and its asymptotic properties, we now turn ourattention to the computational problem of estimating the hyperparameter vector λ andthe uncertainty parameter α in practice. We start by mentioning basic notions forthe type of models and properties over which our method is able to accommodate andsystematise general purpose model independent computer algorithms.The methodology presented in Section 3 supports both discrete and continuous com-ponents in the observables variables Y , or combinations of both. It also works for anydata dimension S and any parameter dimension D . Interesting cases are when S = 1 and D >

1, meaning that, as we have showed previously, we can recover a multivariate priordistribution from probability judgements of 1-dimensional observable variable. This isnovel in the recent literature.For arbitrary S , where we would possibly work with a multivariate distribution over a Hartmann, Agiashvili, B ¨urkner & Klami

Table 1.

Result of a real prior elicitation experiment forone user, characterized by statistics of the prior distribu-tion. The proposed approach (Predictive) better matchesthe parameters found by ﬁtting the model to actual dataPreece and Baines (Reference; 1978), compared to di-rect parameter elicitation (Parametric). This is visible inthe lower α estimate as well. The reference column ex-cludes b due to their use of a non-probabilistic model. Predictive Parametric

Parameter Reference E [ · ] V ( · ) E [ · ] V ( · ) h h t ∗ s < s t ∗ b − α − − − vector of observable variables, probabilities for a generic rectangular set A = × Ss =1 ( a s , b s ]can be formulated via the cumulative distribution function of the prior predictive dis-tribution (1) as follows. Let I = ( a, b ] be an interval, g some function with g : R S → R ,and ∆ sI the diﬀerence operator with ∆ sI = g ( y , . . . , y s − , b ) − g ( y , . . . , y s − , a ). Then,equation (2) takes the general form P A | λ = (cid:90) b a · · · (cid:90) b S a S π Y | λ ( y , . . . , y S )d y . . . d y S = ∆ I ∆ I · · · ∆ SI S F Y | λ ( y , . . . , y S ) , (10)where F Y | λ ( · ) is the cumulative distribution function of the prior predictive distribution(1). Cases in which S >

Natural gradients for closed-form cases:

If equation (10) is available in closed-form, usual gradient-based optimisation algorithms are applicable. We recommend usingnatural gradients (Amari, 1998), which have been widely applied for statistical machinelearning problems (e.g., see Girolami and Calderhead, 2011; Salimbeni et al., 2018). Inthis case, the Fisher information matrix for λ can be computed in closed-form usingresults from the original parametrisation of the Dirichlet distribution (Ferguson, 1973)as H λ = ( dd λ P A | λ ) (cid:62) H P A | λ ( dd λ P A | λ ) , (11) rior predictive elicitation where H P A | λ = α (diag( ψ (cid:48) ( α P A | λ )) − ψ (cid:48) ( α ) (cid:62) ) is the Fisher information matrix of thestandard Dirichlet distribution, P A | λ = [ P A | λ · · · P A n | λ ] (cid:62) , and dd λ P A | λ = (cid:2) ddλ P A | λ · · · ddλ M P A | λ (cid:3) (cid:62) . The function ψ (cid:48) ( · ) is the the derivative of the digamma function and ddλ M P is the derivative of the vector P with respect to an element in the vector ofhyperparameters λ . Due to the closed-form expression, we can use natural gradientswith almost no additional computational cost. The only extra step is the calculationof ddλ M P which can be obtained easily with automatic diﬀerentiation regardless of thechosen generative model. Stochastic natural gradients optimization:

If (10) cannot be expressed in closed-form but the equation (4) or (7) are diﬀerentiable with respect to λ , one can use gradient-based optimization with reparametrisation gradients and automatic diﬀerentiation. Theelements of P are expected values with respect to the prior distribution (2), and thegoal is then to ﬁnd a pivotal function for the prior (see Casella and Berger, 2001, page427, Section 9.2.2) and obtain Monte-Carlo estimates of it (which is not diﬃcult once wecan use the representation (10)) and gradients ddλ M P with very low computational costaccording to Figurnov et al. (2018).When the generative model has a higher level hierarchical structure, such as Y | θ ∼ π ( y | θ ), θ | θ ∼ π ( θ | θ ), . . . , θ L | λ ∼ π ( θ L | λ ), we can show that the elements of P A | λ and ddλ M P A | λ can also be computed eﬃciently together with a stochastic estimationof the hyperparameters’ Fisher information matrix. That is P A | λ = E X L (cid:0) E X L − · · · (cid:0) E X (cid:0) P A | f ( λ ) (cid:1)(cid:1)(cid:1) (12)where X (cid:96) are pivotal quantities with respect to distributions π ( θ (cid:96) | θ (cid:96) +1 ) for (cid:96) = 1 , . . . , L and f ( λ ) is a function which depends only on the hyperparameters λ . Gradients areestimated similarly asdd λ m P A | λ = E X L (cid:32) E X L − · · · (cid:18) E X (cid:18) d f d λ m dd f P A | f ( λ ) (cid:19)(cid:19)(cid:33) (13)The equations above can be plugged into (11) to obtain an estimation for the hyperpa-rameters’ Fisher information matrix. The proof and detailed explanations are providedin the supplementary materials. Gradient-free optimization:

Finally, for completely arbitrary models, we can stepoutside of gradient-based optimization and use general-purpose global optimization toolsfor determining λ . Methods such as Bayesian optimization and Nelder-Mead only requirethe ability to evaluate the objective (10), and many practical optimization libraries (e.g. optimR ) provide extensive range of practical alternatives. For models with relativelysmall number of hyperparameters, we have found such tools to work well in practice.However, whenever either of the gradient-based methods described above is applicable,we recommend using them due to substantially improve eﬃciency. Hartmann, Agiashvili, B ¨urkner & Klami

Optimization of α : Finally, besides λ , we usually want to estimate α as well whichquantiﬁes the uncertainty as explained in Section 3.1. One can either directly optimise(4) for ( λ , α ) together, or switch optimisation of (4) for λ with ﬁxed α with optimizationof (4) for α with ﬁxed λ . This may be easier since we have an approximate closed-formexpression for α provided in the supplementary materials.

5. DISCUSSION AND CONCLUSIONS

Prior elicitation is an important stage in the Bayesian modeling workﬂow (Schad et al.,2019), especially for hierarchical models whose parameters have a complex relationshipwith the observed data. Standard prior elicitation strategies, such as O’Hagan andOakley (2004); Moala and O’Hagan (2010), do not really help in such scenarios, sincethe expert still needs to express information in terms of probability distribution of themodel’s parameters. The idea of eliciting knowledge in terms of the observable data isnot new – in fact, it dates back to Kadane et al. (1980). However, to our knowledgewe proposed the ﬁrst practical formulation that accounts for uncertainty in the expert’sjudgements of the prior predictive distribution, with easy, general, and complete imple-mentation that allows eliciting both univariate and multivariate prior distributions moreeﬃciently.We demonstrated the general formalism in several practical contexts, ranging fromsimple conceptual illustrations and technical veriﬁcations to real elicitation examples.In particular, we showed that multivariate priors (of reasonable dimensionality) canbe elicited in context of generalized linear models based on relatively small collectionof probability judgements for diﬀerent covariate sets. The approach can be coupledwith existing modelling tools and used for eliciting prior information from real users, asdemonstrated for the human growth model of Preece and Baines (1978) implementedin brms (B¨urkner, 2017). Even though we only carried out a simpliﬁed and small-case experiment, the results already indicate that even users familiar with statisticalmodelling were more comfortable expressing knowledge of the observed data rather thanmodel parameters, and that the resulting priors better matched their beliefs.The obvious continuation of this work would consider tighter integration of themethod into a principled Bayesian workﬂow, coupled with more extensive user stud-ies. We also look forward to extend our method to cases of multiple experts opinionsabout the same observable variables. As a ﬁrst attempt, we could consider the samepredictive model and distinct α ’s for multiple experts. However, more work is needed inthat regard. Acknowledgements

This work was supported by the Academy of Finland (Flagship programme: FinnishCenter for Artiﬁcial Intelligence, FCAI; Grants 320181, 320182, 320183). rior predictive elicitation References

Akbarov, A. (2009)

Probability elicitation: Predictive approach . Ph.D. thesis, Universityof Salford.Amari, S. (1998) Natural Gradient Works Eﬃciently in Learning.

Neural Computation(communicated by Steven Nowlan and Erkki Oja) , , 251–276.Barnard, J., McCulloch, R. and Meng, X.-L. (2000) Modelling covariance matrices interms of standart deviations and correlations, with applications to shrinkage. Statis-tical Sinica , , 1281–1311.Bedrick, E. J., Christensen, R. and Johnson, W. (1997) Bayesian binomial regression:Predicting survival at a trauma center. The American Statistician , , 211–218.Berger, J. O. (1993) Statistical decision theory and Bayesian analysis . Springer series instatistics. Springer-Verlag, 2nd ed edn.B¨urkner, P.-C. (2017) BMRS: An R package for Bayesian multilevel models using Stan.

Journal of Statistical Software , , 1–28.Calderhead, B. (2012) Diﬀerential geometric MCMC methods and applications . Ph.D.thesis, University of Glasgow.Carpenter, B., Gelman, A., Hoﬀman, M., Lee, D., Goodrich, B., Betancourt, M.,Brubaker, M., Guo, J., Li, P. and Riddell, A. (2017) Stan: A probabilistic program-ming language.

Journal of Statistical Software, Articles , , 1–32.Casella, G. and Berger, R. L. (2001) Statistical Inference . Duxbury Press, 2 edn.Daee, P., Peltola, T., Soare, M. and Kaski, S. (2017) Knowledge elicitation via sequentialprobabilistic inference for high-dimensional prediction.

Machine Learning , , 1599–1620.Ferguson, T. S. (1973) A Bayesian Analysis of Some Nonparametric Problems. TheAnnals of Statistics , , 209–230.Figurnov, M., Mohamed, S. and Mnih, A. (2018) Implicit reparameterization gradients.In Proceedings of the 32Nd International Conference on Neural Information ProcessingSystems , NIPS’18, 439–450.Folland, G. (2013)

Real Analysis: Modern Techniques and Their Applications . Pure andApplied Mathematics: A Wiley Series of Texts, Monographs and Tracts. Wiley.Garthwaite, P. H., Kadane, J. B. and O’Hagan, A. (2005) Statistical methods for elicitingprobability distributions.

Journal of the American Statistical Association , , 680–701.Geisser, S. (1993) Predictive inference: An introduction . Springer.Gelfand, A. E., Mallick, B. K. and Dey, D. K. (1995) Modeling expert opinion arising asa partial probabilistic speciﬁcation.

Journal of the American Statistical Association , , 598–604. Hartmann, Agiashvili, B ¨urkner & Klami

Gelman, A., Simpson, D. and Betancourt, M. (2017) The prior can often only be under-stood in the context of the likelihood.

Entropy , , 555.Genest, C. and Schervish, M. J. (1985) Modeling Expert Judgments for Bayesian Up-dating. The Annals of Statistics , , 1198–1212.Girolami, M. and Calderhead, B. (2011) Riemann Manifold Langevin and HamiltonianMonte Carlo Methods. Journal of the Statistical Royal Society B , , 123–214.Gosling, J. (2005) Elicitation: A nonparametric view . Ph.D. thesis, University ofSheﬃeld.Jeﬀreys, H. and Zellner, A. (1980)

Bayesian analysis in econometrics and statistics:essays in honor of Harold Jeﬀreys . Studies in Bayesian econometrics. North-HollandPub. Co.Kadane, J. and Wolfson, L. J. (1998) Experiences in elicitation.

Journal of the RoyalStatistical Society: Series D (The Statistician) , , 3–19.Kadane, J. B., Dickey, J. M., Winkler, R. L., Smith, W. S. and Peters, S. C. (1980)Interactive elicitation of opinion for a normal linear model. Journal of the AmericanStatistical Association , , 845–854.Kijima, M. (1997) Markov Processes for Stochastic Modeling . Stochastic Modeling Series.Taylor & Francis.Kurowicka, D. and Cooke, R. (2003) A parameterization of positive deﬁnite matrices interms of partial correlation vines.

Linear Algebra and its Applications , , 225–251.Lawless, J. (2011) Statistical Models and Methods for Lifetime Data . Wiley Series inProbability and Statistics. Wiley.Lindley, D. (1983) Reconciliation of probability distributions.

Operations Research , ,866–880.Meent, J.-W. V. D., Paige, B., Yang, H. and Wood, F. (2018) An introduction toprobabilistic programming. ArXiv .Moala, F. and O’Hagan, A. (2010) Elicitation of multivariate prior distributions: Anonparametric Bayesian approach.

Journal of Statistical Planning and Inference , ,1635–1655.Mohamed, S., Rosca, M., Figurnov, M. and Mnih, A. (2019) Monte Carlo GradientEstimation in Machine Learning. arXiv e-prints .Nelder, J. A. and Wedderburn, R. W. M. (1972) Generalized linear models. Journal ofthe Royal Statistical Society. Series A (General) , , 370–384.Oakley, J. E. and O’Hagan, A. (2007) Uncertainty in prior elicitations: A nonparametricapproach. Biometrika , . rior predictive elicitation — (2019) SHELF: The Sheﬃeld Elicitation Framework (Version 4.0). School of Mathe-matics and Statistics, University of Sheﬃeld, UK (http://tonyohagan.co.uk/shelf).O’Hagan, A. (1978) Curve ﬁtting and optimal design for prediction. Journal of RoyalStatistical Society B , , 1–42.— (2004) Kendall’s Advanced Theory of Statistics: Bayesian Inference . Oxford Univer-sity Press.— (2019) Expert knowledge elicitation: Subjective but scientiﬁc.

The American Statis-tician , , 69–81.O’Hagan, A. and Oakley, J. E. (2004) Probability is perfect, but we can’t elicit it per-fectly. Reliability Engineering & System Safety , , 239–248.Preece, M. A. and Baines, M. J. (1978) A new family of mathematical models describingthe human growth. Annals of Human Biology , , 1–24.Salimbeni, H., Eleftheriadis, S. and Hensman, J. (2018) Natural gradients in practice:Non-conjugate variational inference in Gaussian process models. In Proc. of the 21stAISTATS , vol. 84 of

Proceedings of Machine Learning Research , 689–697. PMLR.Sarma, A. and Kay, M. (2020) Prior setting in practice: Strategies and rationales usedin choosing prior distributions for Bayesian analysis.Schad, D. J., Betancourt, M. and Vasishth, S. (2019) Toward a principled bayesianworkﬂow in cognitive science.Simpson, D., Rue, H., Martins, T., Riebler, A. and Sørbye, S. (2017) Penalising modelcomponent complexity: A principled, practical approach to constructing priors.

Sta-tistical Science , , 1–28.Winkler, R. L. (1967) ”the assessment of prior distributions in Bayesian analysis”. Jour-nal of the American Statistical Association , , 776–800. Hartmann, Agiashvili, B ¨urkner & Klami

SUPPLEMENTARY MATERIALS

In this section we highlight the steps to obtain the prior predictive probability, by rewrit-ing it as a expected value w.r.t. the prior distribution as follows. Given that the prob-abilistic models, π Y | θ ( y | θ ) and the prior π θ are positive functions, we can rearrangethe order of the integration (See Folland, 2013, Fubini-Tonelli theorem). Hence we have P A | λ := (cid:90) A π Y ( y | λ ) d y = (cid:90) A (cid:90) Θ π Y | θ ( y | θ ) π ( θ | λ ) d θ d y Fub. = (cid:90) Θ (cid:90) A π Y | θ ( y | θ ) π ( θ | λ ) d y d θ = (cid:90) Θ P Y | θ ( Y ∈ A | θ ) π ( θ | λ ) d θ = E θ (cid:0) P Y | θ ( Y ∈ A | θ ) (cid:1) . (14) Here we show the approximate behaviour of the precision parameter α for the generalcase when covariates are present. The simpliﬁcation to other cases in straightforward.Recall the likelihood function of λ given expert data reads, D ( p , . . . , p J | α, λ ) = Γ( α ) JJ (cid:81) j =1 n j (cid:81) i j =1 Γ( α P A j,ij | λ ) J (cid:89) j =1 n j (cid:89) i j =1 p α P Aj,ij | λ − j,i j . (15)Consider the Stirling’s approximation ‡ to the Γ( · ) function given by,Γ( x ) ≈ (cid:114) πx (cid:16) xe (cid:17) x . (16)Rewriting the likelihood function in terms of the above approximation and removingterms that does not depend on α with a simpliﬁed notation we get, D ( p | α, λ ) ≈ (cid:32)(cid:114) πα (cid:16) αe (cid:17) α (cid:33) J (cid:81) j,i j (cid:115) πα P A j,ij | λ (cid:32) α P A j,ij | λ e (cid:33) α P Aj,ij | λ exp (cid:88) i,j α ( P A j,ij | λ −

1) log p j,i j  ≈ α (cid:80) j nj/ − J/ (cid:81) i,j P / A j,ij | λ exp (cid:32) α (cid:80) i,j P A j,ij | λ log P A j,ij | λ p i,i j (cid:33) (17) ‡ This is a precise approximation. rior predictive elicitation Take the logarithm of the above function and the derivative w.r.t. α . Setting it to zeroand solving for α we obtain, ˆ α ≈ (cid:80) j n j / − J/ (cid:80) j KL ( P j || p j ) (18)where the notation P j = [ P A j, | λ · · · P A j,nj | λ ] (cid:62) and KL ( P || Q ) denotes the Kullback-Leibler divergence in this order. The Fisher information matrix for the unknown hyperparameters can be obtained inclosed-form by the fact that, in the original parametrisation of the Dirichlet distribution,the Fisher information is already known. In the original parametrisation and in its basicform, the probability density function reads D ( p | α, P ) = Γ( α ) (cid:81) ni =1 Γ( α P i ) n (cid:89) i =1 p α P i − i (19)where P = [ P · · · P n ] (cid:62) . Also knowing that the Dirichlet distribution belongs the expo-nential family, the Fisher information matrix reads, H P = α (diag( ψ (cid:48) ( α P )) − ψ (cid:48) ( α ) (cid:62) ) , (20)whose inverse is given in closed-form as H − P = α (cid:18) diag( ψ (cid:48) ( α P )) − + diag( ψ (cid:48) ( α P )) − (cid:62) diag( ψ (cid:48) ( α P )) − (1 /ψ (cid:48) ( α ) − (cid:62) diag( ψ (cid:48) ( α P )) − ) (cid:19) (21)where is n × P of the Dirichlet distribution is written asa function of λ . Using the change of variables for a new parametrisation (see Calderhead,2012; Girolami and Calderhead, 2011, page 64, Section 3.2.5, equation 3.27), the Fisherinformation matrix with respect to λ can be obtained directly (by passing any need ofrecalculating integrals) as, H λ = (cid:2) ddλ P · · · ddλ M P (cid:3) (cid:62) H P (cid:2) ddλ P · · · ddλ M P (cid:3) (22)where the vector ddλ m P = (cid:2) ddλ m P · · · ddλ m P n (cid:3) (cid:62) (the Jacobian matrix). Note that H P isinvertible and positive-deﬁnide, so as H λ . Hence H λ is also invertible and its choleskydecomposition is stable to compute. Presence of covariates (inputs):

When set of covariates are present, we have toconsider that diﬀerent partitions are provided. Since the likelihood function will stillfactorise for distinct covariates, note equation (15), the resulting Fisher informationmatrix will be the sum of Fisher information matrices (Casella and Berger, 2001). Hence,we can write, H λ = (cid:88) j (cid:2) ddλ P j · · · ddλ M P j (cid:3) (cid:62) H P j (cid:2) ddλ P j · · · ddλ M P j (cid:3) (23) Hartmann, Agiashvili, B ¨urkner & Klami

For the case where P j does not have closed-form expression we can estimate P j and itsderivatives w.r.t λ using the reparametrisation gradients and automatic diﬀerentiation.The main idea is to ﬁnd a pivotal function (see Casella and Berger, 2001, page 427,Section 9.2.2) and obtain Monte-Carlo estimates of P j and gradients d/dλ m P j with lowcomputational cost according to Figurnov et al. (2018) and Mohamed et al. (2019).With a simpliﬁed notation, recall the prior distribution π θ and that the prior predic-tive probability can be rewritten as a expected value P A | λ = E θ (cid:0) P ( Y ∈ A | θ ) (cid:1) (24)which depends on λ . Here the expression P ( Y ∈ A | θ ) depends only on θ . Then, ﬁnd apivotal function X = T ( θ ) such that the distribution of X does not depend on λ . Wethen can rewrite the expectation, P A | λ = E X (cid:0) P ( Y ∈ A | T − X ( λ )) (cid:1) (25)The gradients can be computed interchanging the order of integration and derivation,dd λ m P A | λ = E X (cid:18) dd λ m P ( Y ∈ A | T − X ( λ )) (cid:19) . (26)Where T − X ( · ) is a inverse function of T and depends on X and λ . The important notionhere is that there is no need for resampling X since the distribution π X ( · ) is free of λ by deﬁnition. Hierachical structures:

Assume a hierarchical probabilistic model deﬁned in formof layers as in the representation Y ← θ ← · · · ← θ L ← λ , where the letter L indicatethe number of hierarchical layers. Formally one could write the hierarchical probabilisticmodel, Y | θ ∼ π ( y | θ ) θ | θ ∼ π ( θ | θ )... θ L | λ ∼ π ( θ L | λ ) (27)whose prior predictive probability reads, P A | λ = (cid:90) Θ P ( Y ∈ A | θ ) L − (cid:89) (cid:96) =1 π ( θ (cid:96) | θ (cid:96) +1 ) π ( θ L | λ )d θ = (cid:90) Θ L π ( θ L | λ ) (cid:90) Θ L − π ( θ L − | θ L ) · · · (cid:90) Θ π ( θ | θ ) P ( Y ∈ A | θ )d θ . . . d θ L (28)where Θ = ∪ L(cid:96) =1 Θ j and θ (cid:96) ∈ Θ (cid:96) . Note that the above equation can be rewritten via the tower property by applying it sequentially due to the model hierarchy. P A | λ = E θ L (cid:0) E θ L − · · · (cid:0) E θ (cid:0) P A | θ (cid:1)(cid:1)(cid:1) (29) rior predictive elicitation with shortened notation P ( Y ∈ A | θ ) = P A | θ .In this case, to apply the reparametrisation gradients technique, ﬁrst ﬁnd a piv-otal function X (cid:96) = T j ( θ (cid:96) ) for each layer (cid:96) whose inverse function is denoted as θ (cid:96) = T − X (cid:96) ( θ (cid:96) +1 ). Note the fact when we assume a pivotal quantity for every layer (cid:96) , by def-inition the distribution of π X (cid:96) ( x (cid:96) ) = π θ (cid:96) | θ (cid:96) +1 ( T − x (cid:96) ) | det J ( T − x (cid:96) ) | does not dependent onany θ (cid:96) +1 or λ . Hence, deﬁne the composite of inverse functions for each layer as θ (cid:96) = f (cid:96) ( λ ) = ( T − X (cid:96) ◦ T − X (cid:96) +1 ◦ · · · ◦ T − X L )( λ )This way, the above expected value as a function of λ can be rewritten as, P A | λ = E X L (cid:0) E X L − · · · (cid:0) E X (cid:0) P A | f ( λ ) (cid:1)(cid:1)(cid:1) (30)To estimate P A | λ via Monte Carlo ﬁrst remember that λ is ﬁxed. Sample from π X (cid:96) for each (cid:96) and obtain the respectively the value of the function f (cid:96) ( λ ) for each (cid:96) . Calculatethe sample mean of P A | f ( λ ) . Gradients of P A | λ w.r.t. λ can be obtained similarly, theextra step needed is in the calculation of the following expression,dd λ m P A | λ = E X L (cid:32) E X L − · · · (cid:18) E X (cid:18) d f d λ m dd f P A | f ( λ ) (cid:19)(cid:19)(cid:33) (31)where the notation of the expectation E X ( · ) is the same as in (30), but shortened. Theﬁrst derivative on the right-hand side of the equation above then reads,d f d λ m = L − (cid:89) r =1 d T − X r d T − X r +1 d T − X L d λ m . (32)In cases where the derivative of the inverse function T X − (cid:96) above cannot be obtainedin closed-form we proceed similar as Figurnov et al. (2018) equation (6). Knowing that T (cid:96) is one-to-one function, we can write X (cid:96) = T (cid:96) ( T − X (cid:96) ( θ (cid:96) +1 )) (33)Take implicit and explicit derivatives (total derivative) with respect to θ (cid:96) +1 to get that0 = d T (cid:96) d θ (cid:96) +1 (cid:12)(cid:12)(cid:12)(cid:12) explicit + d T (cid:96) d θ (cid:96) +1 (cid:12)(cid:12)(cid:12)(cid:12) implicit = d T (cid:96) d θ (cid:96) +1 + d T (cid:96) d θ (cid:96) d θ (cid:96) d θ (cid:96) +1 (34)Identifying the notation θ (cid:96) = T − X (cid:96) for all (cid:96) and solving for d θ (cid:96) d θ (cid:96) +1 yields,d T − X (cid:96) d T − X (cid:96) +1 = − (cid:32) d T (cid:96) d T − X (cid:96) (cid:33) − d T (cid:96) d T − X (cid:96) +1 (35)We can now plug (35) into (32) to estimate (31) and in turn to have the estimate forhyperparmeters’ Fisher information matrix in (22) and (23). Hence, we can proceedwith stochastic natural gradient descent to estimate hyperparameters λ for general typesof probabilistic models. Hartmann, Agiashvili, B ¨urkner & Klami

The probabilistic model for observed data (stature of male human being) is speciﬁed asfollows, Y t | θ , b ∼ W ( h ( t ; θ ) , b ) b ∼ G ( a , b ) θ d i.i.d ∼ LN ( a d , b d ) (36)where Y t is univariate S = 1 and denotes the stature of the human being at time t .The parameters of the growth-model h ( t ; θ ) are denoted as θ = [ θ θ θ θ θ ] =[ h , h t ∗ , t ∗ , s , s ] (cid:62) , where h is the average height of an adult human, h t ∗ is the averagehigh for the event ”growth-spurt” (Preece and Baines, 1978), t ∗ is when that eventhappens, s and s are constants from the model. The parameter b controls the varianceof the variable Y t around h ( t ; θ ). Large the values of b less variance around the h ( t ; θ )and vice-versa. W , G and LN stands for respectively, Weibull, Gamma and log-Normaldistributions.We used the Weibull distribution in the mean-variance parametrisation which meansthat the probability distribution of Y t | θ , b is given by, π Y t | θ ,b ( y ) = b Γ(1+1 /b )exp( h ( t ; θ )) (cid:16) y Γ(1+1 /b )exp( h ( t ; θ )) (cid:17) b − exp (cid:18) − (cid:16) y Γ(1+1 /b )exp( h ( t ; θ )) (cid:17) b (cid:19) (37)The other distribution used for the prior are used in their standard parametrisationscale-shape for Gamma and mean-variance for log-Normal distribution. The vector ofhyperparameters is λ = { a m , b m , m = 0 , . . . , } . The human-growth model obtained byPreece and Baines (1978) and given in Section 2, Model 1 in their paper. In our notationthis growth-model reads h ( t ; θ ) = h − h − h t ∗ )exp[ s ( t − t ∗ )] + exp[ s ( t − t ∗ )] . (38)The only general background information provided to the participants was the fol-lowing brief description characterizing the overall growth process and providing generalnumerical values as reminders: ”During the early stages of life the stature of female and male are about the same,but their stature start to clearly to diﬀer during growth and in the later stages of life.In the early stage man and female are born roughly with the same stature, around cm- cm. By the time they are born reaching around 2.5 years old, both male and fe-male present the highest growth rate (centimetres pey year). It is the time they grow thefastest. During this period, man has higher growth rate compared to female. For bothmale and female there is a spurt growth in the pre-adulthood. For man, this phase showsfast growth rate varying in between 13-17 years old and female varying from 11-15. Also,male tend to keep growing with roughly constant rate until the age of 17-18, while femalewith until the age of 15-16. After this period of life they tend to stablish their staturesmostly around - cm and - cm respectively.” rior predictive elicitation Given the background information we asked each user to provide the distribution forstatures of males at given ages t = { t , t , t , t } = { , . , , . } in form of probabilis-tic assessments. For eliciting the probabilities we asked them to provide the thresholds y i determining the statures that partition the sample space with the following probabilities P ( Y t ≤ y ) = 0 . P ( Y t ≤ y ) = 0 . P ( Y t ≤ y ) = 0 . P ( Y t ≤ y ) = 0 . P ( Y t ≤ y ) = 0 .

90 (39)where naturally y < y < . . . < y . The data used as each t j was hence given by P ( Y t j ∈ (0 , y )) = p j,i j = 0 . P ( Y t j ∈ ( y , y )) = p j,i j = 0 . P ( Y t j ∈ ( y , y )) = p j,i j = 0 . P ( Y t j ∈ ( y , y )) = p j,i j = 0 . P ( Y t j ∈ ( y , y )) = p j,i j = 0 . P ( Y t j ∈ ( y , ∞ )) = p j,i j = 0 . Results for the prior predictive elicitation

The main manuscript provided the results for one example user. The results for otherfour users are provided here in Tables 1 to 4. The general trend of prior predictive elici-tation matching better the data-dependent values of Preece and Baines (1978) remains,and for some users the direct parameter elicitation approach resulted in very poor prior(e.g. h t ∗ for User 3). Table 2.

User 2

Predictive Parametric

Parameter Reference E [ · ] V ( · ) E [ · ] V ( · ) h h t ∗ s < < s < t ∗ b α − − − Hartmann, Agiashvili, B ¨urkner & Klami

Table 3.

User 3

Predictive Parametric

Parameter Reference E [ · ] V ( · ) E [ · ] V ( · ) h h t ∗ s < s t ∗ b − α − − − Table 4.

User 4

Predictive Parametric

Parameter Reference E [ · ] V ( · ) E [ · ] V ( · ) h < h t ∗ s < s t ∗ b − < α − − − Table 5.

User 5

Predictive Parametric

Parameter Reference E [ · ] V ( · ) E [ · ] V ( · ) h h t ∗ s < s < t ∗ b − α − −1.5