[PDF] Approximate Bayesian Computations to fit and compare insurance loss models

Abstract

Approximate Bayesian Computation (ABC) is a statistical learning technique to calibrate and select models by comparing observed data to simulated data. This technique bypasses the use of the likelihood and requires only the ability to generate synthetic data from the models of interest. We apply ABC to fit and compare insurance loss models using aggregated data. A state-of-the-art ABC implementation in Python is proposed. It uses sequential Monte Carlo to sample from the posterior distribution and the Wasserstein distance to compare the observed and synthetic data.

Full PDF

aa r X i v : . [ s t a t . C O ] J u l Approximate Bayesian Computations to ﬁt andcompare insurance loss models

Pierre-Olivier Goﬀard and Patrick J. Laub

Univ Lyon, Universit´e Lyon 1, LSAF EA2429

July 9, 2020

Abstract

Approximate Bayesian Computation (ABC) is a statistical learningtechnique to calibrate and select models by comparing observed data tosimulated data. This technique bypasses the use of the likelihood andrequires only the ability to generate synthetic data from the models ofinterest. We apply ABC to ﬁt and compare insurance loss models usingaggregated data. We present along the way how to use ABC for themore common claim counts and claim sizes data. A state-of-the-art ABCimplementation in Python is proposed. It uses sequential Monte Carloto sample from the posterior distribution and the Wasserstein distance tocompare the observed and synthetic data.

MSC 2010 : 60G55, 60G40, 12E10.

Keywords : Bayesian statistics, approximate Bayesian computation, likelihood-free inference, risk management.

Over a ﬁxed time period, an insurance company experiences a random numberof claims called the claim frequency , and each claim requires the payment of arandomly sized compensation called the claim severity . The claim frequency isa counting random variable while the claim sizes are non-negative continuousrandom variables. Let us say that the claim frequency and the claim severitydistributions are speciﬁed by the parameters θ freq and θ sev respectively, with θ = ( θ freq ; θ sev ). For each time s = 1 , . . . , t the number of claims n s and theclaim sizes u s := ( u s, , u s, , . . . , u s,n s ) are distributed as n s ∼ p N ( n ; θ freq ) and ( u s | n s ) ∼ f U ( u ; n, θ sev ) . We wish to ﬁt these distributions, however, we assume that these independentand identically distributed (i.i.d.) values { ( n , u ) , . . . , ( n t , u t ) } are unobserv-able. Instead, we only have access to some real-valued summaries of the claimdata at each time, denoted by x s = Ψ( n s , u s ) for s = 1 , . . . , t. (1)1he summaries could be the aggregated claims if Ψ( n, u ) = P ni =1 u i or themaximum claims if Ψ( n, u ) = max ≤ i ≤ n u i . Our problem is to take some obser-vations of these summaries x = ( x , . . . , x t ) and ﬁnd the θ which best explainsthem for a given parametric model.Such incomplete data situations arise in reinsurance, see the monograph of Al-brecher et al. [1, Chapter I, Section 3]. For instance, within a global non-proportional reinsurance agreement, the reinsurance company covers the riskthat the insurer’s total claim amount is in excess of a threshold c >

0. Thereinsurer is only observing its payout at each time period x s = ( P n s i =1 u s,i − c ) + . Being able to infer the parameters of the claim frequency and the claim severitydistributions would help the reinsurer to better understand the risk they haveunderwritten.

Remark 1.1.

When the summary is the aggregated loss Ψ( n, u ) = P ni =1 u i ,we eﬀectively decompound the random sum. Traditionally, a decompoundingmethod builds a non-parametric estimate of the claim severity distribution basedon the observations of the aggregated sums, see Buchmann and Gr¨ubel [7] orBøgsted and Pitts [6]. A popular application is the study of discretely observedcompound Poisson processes, see for instance van Es et al. [32], Coca [8] andGugushvili et al. [17] where a Bayesian non-parametric approach is used. A Bayesian approach to estimating θ would be to treat θ as a random variableand ﬁnd (or approximate) the posterior distribution π ( θ | x ). Bayes’ theoremtells us that π ( θ | x ) ∝ p ( x | θ ) π ( θ ) , (2)where p ( x | θ ) is the likelihood and π ( θ ) is the prior distribution . The prior rep-resents our beliefs about θ before seeing any of the observations and is informedby our domain-speciﬁc expertise. The posterior distribution is a very valuablepiece of information that gathers our knowledge over the parameters. A pointestimate b θ may be derived by taking the mean or mode of the posterior. Foran overview on Bayesian statistics, we refer to the book of Gelman et al. [14].The posterior distribution (2) rarely admits a closed-form expression, so it isapproximated by an empirical distribution of samples from π ( θ | x ). Posteriorsamples are typically obtained using Markov Chain Monte Carlo (MCMC), yeta requirement for MCMC sampling is the ability to evaluate (at least up to aconstant) the likelihood function p ( x | θ ). When considering the deﬁnition of x in (1), we can see that there is little hope of ﬁnding an expression for thelikelihood function even in simple cases (e.g. when the claim sizes are i.i.d.). Ifthe claim sizes are not i.i.d. or if the number of claims inﬂuences their amount,then the chance that a tractable likelihood for x exists is extremely low. Evenwhen a simple expression for the likelihood exists, it can be prohibitively diﬃcultto compute (such as in a big data regime), and so a likelihood-free approachcan be beneﬁcial.We advertise here a likelihood-free estimation method known as approximateBayesian computation (ABC). This technique has attracted a lot of attentionrecently due to its wide range of applicability and its intuitive underlying prin-ciple. One resorts to ABC when the model at hand is too complicated to writethe likelihood function but still simple enough to generate artiﬁcial data. Givensome observations x , the basic principle consists in iterating the following steps:2i) generate a potential parameter from the prior distribution θ ∗ ∼ π ( θ );(ii) simulate ‘fake data’ x ∗ from the likelihood ( x ∗ | θ ∗ ) ∼ p ( x | θ );(iii) if k x − x ∗ k ≤ ǫ , where ǫ > θ ∗ ,where k · k denotes a distance measure and ǫ is an acceptance threshold. Thealgorithm provides us with a sample of θ ’s whose distribution is close to theposterior distribution π ( θ | x ).The ABC algorithm presented in this work allows us to consider a wide varietyof Ψ functions (1) without imposing common simplifying assumptions such asassuming the claim amounts are i.i.d. and independent from the claim frequency.In addition to parameter estimation, ABC allows us to perform model selectionin a Bayesian manner. This direction is also investigated. For a comprehensiveoverview on ABC, we refer to the monograph of Sisson et al. [29]. In ﬁnanceand insurance, ABC has been considered in the context of operational riskmanagement [21] and for reserving purposes [22].The rest of the paper is organized as follows. Section 2 provides a gentle intro-duction to ABC algorithms. We start by presenting the ABC routines used oncount data and continuous data, then show how to use ABC to ﬁt an insuranceloss model based on aggregated data. Section 3 explains how to adapt the ABCalgorithm to compare models by computing the a posteriori model probabilityof each competing model. The performance of our ABC implementation areillustrated on simulated data in Section 4 and on a real world insurance dataset in Section 5. ABC is a method for approximating the posterior probability π ( θ | x ) withoutusing the likelihood function. The implementation of ABC is tied to the na-ture of the data at hand. In our problem, the frequency data is discrete, theindividual claim sizes are continuous and the aggregated data is a mixture ofdiscrete and continuous (due to the atom at 0). We take advantage of thisfact to introduce ABC algorithms for discrete data in Section 2.1, continuousdata in Section 2.2, and mixed data in Section 2.3. The acceptance–rejectionalgorithm laid out in the introduction most often leads to considerable comput-ing time, so Section 2.4 explains how to speed up ABC using sequential MonteCarlo (SMC). Section 2.5 shows the validity of our ABC implementation on anillustrative example. Consider some count data n , . . . , n t ∈ N which are i.i.d. with the probabilitymass function (p.m.f.) p N ( n | θ ); for example, the n s ’s could be claim fre-quencies. The likelihood of such data is p ( n | θ ) = Q ts =1 p N ( n s | θ ) , where n = ( n , . . . , n t ). For common discrete distributions, such as the Poisson ornegative binomial, the likelihood function is tractable and may be plugged into3n MCMC sampling algorithm to produce samples from the posterior distribu-tion π ( θ | n ) ∝ p ( n | θ ) π ( θ ) . Alternatively, we can sample from the posterior π ( θ | n ) in a likelihood-free way by acceptance–rejection, which is detailed inAlgorithm 1. Algorithm 1

Acceptance–rejection sampling the posterior of count data. input observations n = ( n , . . . , n t ) for k = 1 → K do repeat generate θ k ∼ π ( θ ) generate n k ∼ p ( n | θ k ) until n k = n then store θ k end for return { θ , . . . , θ K } which are i.i.d. samples from π ( θ | n )Algorithm 1 gives samples from π ( θ | n ) which is exactly the desired posteriordistribution π ( θ | n ): π ( θ | n ) ∝ π ( θ ) Z N t I { n = e n } p ( e n | θ ) d e n = π ( θ ) p ( n | θ ) , where I { n = e n } = ( , if n = e n , , otherwise.As we collect more data, the probability of seeing an exact match { n = e n } decreases exponentially. This, combined with a diﬀuse prior distribution, willresult in a cumbersome waiting time before getting a posterior sample. A naturalreﬁnement is to require an exact correspondence between the samples sorted inascending order. The acceptance rate may still be too low to be practical, andin this case an approximate match between the observed and fake data mustbe considered. We discuss this matter within the continuous data case in thefollowing section. Let u , . . . , u n ∈ R be an i.i.d. sample of continuous data with a probabilitydensity function (p.d.f.) denoted f U ( u | θ ). An example of such data wouldbe the claim sizes. With the notation u = ( u , . . . , u n ), we can write thelikelihood as p ( u | θ ) = Q ni =1 f U ( u i | θ ) . If the data is ﬁtted to a standardprobability model, say gamma or normal, then we can sample from the posteriordistribution π ( θ | u ) ∝ π ( θ ) p ( u | θ ) , with an MCMC scheme. If the likelihoodis unavailable, then we can adapt Algorithm 1 to the case of continuous datafor which exact correspondence between observed and fake data is not possible.Synthetic samples are then accepted whenever they fall suﬃciently close tothe observed data. That is, if the dissimilarity between two samples, assessedby a norm k · k on R n , is smaller than some tolerance threshold ǫ >

0, seeAlgorithm 2. 4 lgorithm 2

ABC acceptance–rejection sampling for continuous data. input observations u = ( u , . . . , u n ), ǫ > k·k norm for k = 1 → K do repeat generate θ k ∼ π ( θ ) generate u k ∼ p ( u | θ k ) until k u − u k k < ǫ then store θ k end for return { θ , . . . , θ K } which are approximately π ( θ | u ) distributedThe procedure depicted in Algorithm 2 allows us to sample from an approxima-tion of the posterior distribution given by π ǫ ( θ | u ) ∝ π ( θ ) Z R t I {k u − e u k <ǫ } p ( e u | θ ) d e u , (3)where I {k u − e u k <ǫ } = ( , if k u − e u k < ǫ, , otherwise.Distribution (3) is called the ABC posterior and it has the desirable theoreticalproperty of converging toward the standard posterior π ( θ | u ) as ǫ tends to 0,see Rubio and Johansen [26], Prangle et al. [24] or Bernton et al. [3].The ABC procedure suﬀers from the so-called curse of dimensionality [4]. Specif-ically, if one takes the Euclidean distance or some variation of it to measure thedissimilarity between observed and fake data then the odds of getting an accept-able match will plummet as the number of observations, i.e. the dimension of u , increases. The dimensionality curse can be alleviated by replacing u ∈ R n with summary statistics S ( u ) ∈ R d , where d < n , in Algorithm 2 (speciﬁcally,in line 5 the norm becomes k S ( u ) − S ( u k ) k ). While the choice of the summarystatistics S : R n R d is arbitrary, it is desirable to have d ≪ n while limitingthe information loss. This is diﬃcult. When the model at hand admits suﬃcientstatistics then these should be taken. In fact, the only statistics which upholdthe convergence of π ǫ ( θ | u ) to π ( θ | u ) as ǫ → S are not to be confused with the Ψ summaries inSection 1! Remark 2.1.

When dealing with frequency data (see Section 2.1), it is possibleto deﬁne a map S : N t R d which allows us to reduce the dimension and adoptthe ABC procedure for continuous data. Consider for instance, the case wherethe claim frequency are Poisson distributed and the map S corresponds to theempirical mean. Most often, we will not be able to ﬁnd suﬃcient statistics. Many researchpapers have been dedicated to designing ad hoc summary statistics in the ABCliterature, we refer to the survey of Blum et al. [5]. The problem is that it alwaysimplies a loss of information along with a convergence toward the posteriordistribution conditionally to the summary statistics instead of the true posterior.We illustrate the use of summary statistics in Section 2.5, but do not use thistechnique in the other examples. 5ernton et al. [3] recommend the Wasserstein distance to measure the dissim-ilarity between two samples. The Wasserstein distance is deemed diﬃcult tocompute but for real-valued i.i.d. observations it reduces to W p ( u , e u ) = 1 n n X k =1 (cid:12)(cid:12) u ( k ) − e u ( k ) (cid:12)(cid:12) p , for p ≥ , where u (1) < . . . < u ( n ) and e u (1) < . . . < e u ( n ) denote the order statistics ofthe observed and synthetic data respectively. The use of the order statistics assummary statistics is not new, it was investigated for instance in the work ofSousa et al. [30] and Fearnhead and Prangle [11]. Now that we have reviewedthe use of ABC in the case of discrete and continuous data, we turn to the caseof mixed data which is of primary interest for the actuarial application at thecenter of this work. We return to the problem of ﬁtting a model to aggregated insurance data. Recallthat, for each time period, a random number n ∈ N of claims are ﬁled. Theclaim frequencies form an i.i.d. sample from the p.m.f. p N ( n | θ freq ). Given n , the associated claim sizes u = ( u , . . . , u n ) have a joint p.d.f. denoted by f U | N ( u | n, θ sev ).The distribution of the available information x := Ψ( n, u ) is parametrized by θ = ( θ freq , θ sev ) and admits a point mass p X (0 | θ ) at 0. Zeros can occurif no claims are ﬁled ( n = 0) which occurs with probability p N (0 | θ freq ), orbecause of censoring eﬀects like in the non-proportional reinsurance treaty case,see Section 1. The continuous part of x ’s distribution is characterized by theconditional p.d.f. [1 − p X (0 | θ )] f X | X> ( x | θ ) for x > . For a data history x = ( x , . . . , x t ) of t time periods, we separate the zeros fromthe non-negative data points, so x = ( x , x + ) = (0 , . . . , | {z } t zeros , x +1 , . . . , x + t − t | {z } t − t non-zeros ) . The likelihood function may be written as p ( x | θ ) = p X (0 | θ ) t [1 − p X (0 | θ )] t − t t − t Y s =1 f X | X> ( x + s | θ ) (4)= p X (0 | θ ) t [1 − p X (0 | θ )] t − t p ( x + | θ ) . To evaluate the conditional p.d.f. f X | X> in (4) we must consider all possiblevalues of n which often leads to an inﬁnite series without closed-form expression,as illustrated in Example 1. Example 1.

Consider the case where we only observe the aggregate claim sizes x s = P n s i =1 u s,i for s = 1 , . . . , t , i.e., Ψ is the sum operator. If the claim sizes re i.i.d. and independent from the claim frequency, which is common in theactuarial science literature, the conditional p.d.f. of X taking positive values is f X | X> ( x | θ ) = 11 − p N (0 | θ freq ) ∞ X n =1 f ( ∗ n ) U ( x | θ sev ) p N ( n | θ freq ) , (5) where f ( ∗ n ) U ( x | θ sev ) denotes the n -fold convolution product of f U ( x | θ sev ) withitself. A closed-form expression of (5) is available only in a few cases. For theremaining cases, quite some energy has been dedicated by actuarial scientiststo ﬁnding convenient numerical approximations. Note that none of the afore-mentioned numerical routines would be suited to the multiple evaluations of theconditional p.d.f. required for Bayesian inference or maximum likelihood infer-ence via some optimization algorithm. We begin our numerical illustration of theABC method on some cases where a closed-form expression of (5) is available,as we will be able to sample from the true posterior via an MCMC simulationscheme. Point estimates may also be compared to frequentist estimators suchas the maximum likelihood or the method of moment estimators. The latter hasbeen used in a similar situation in the work of Goﬀard et al. [15]. The lack of analytical expression for the likelihood function justiﬁes the use of alikelihood-free inference method such as ABC. The distribution of x is of mixedtype which means we cannot directly apply Algorithm 2 as we would lose theconvergence toward the standard posterior distribution. To address this issue,we ask that the number of zeros in the synthetic samples e t matches the numberof zeros in the observed data t and we treat the non-negative data points as i.i.d.continuous data. So, in Algorithm 3 we retain synthetic samples that belong tothe set B ǫ, x = (cid:8)e x ∈ R t ; x = e x and k x + − e x + k < ǫ (cid:9) . (6) Algorithm 3

ABC acceptance–rejection sampling for mixed data. input observations x = ( x , . . . , x t ), ǫ > k·k norm for k = 1 → K do repeat generate θ k ∼ π ( θ ) generate x k ∼ p ( x | θ k ) until x k ∈ B ǫ, x , then store θ k ⊲ B ǫ, x deﬁned by (6) end for return { θ , . . . , θ K } which are approximately π ( θ | x ) distributedAlgorithm 3 samples from the approximate posterior distribution π ǫ ( θ | x ) ∝ π ( θ ) Z R t I B ǫ, x ( e x ) p ( e x | θ ) d e x , (7)where I B ǫ, x ( e x ) = ( , if x = e x and k x + − e x + k < ǫ, , otherwise.The following result shows the convergence of π ǫ toward the standard posterioras we let ǫ approaching 0. 7 roposition 1. Suppose that sup ( e x , θ ) ∈B ǫ, x × Θ p ( e x | θ ) < ∞ , for some ǫ > . Then, for each θ ∈ Θ , we have π ǫ ( θ | x ) −→ π ( θ | x ) , as ǫ → . Proof.

The modiﬁed prior π ǫ ( θ | x ) is deﬁned as π ǫ ( θ | x ) = π ( θ ) R R t I B ǫ, x ( e x ) p ( e x | θ ) d e x R Θ π ( θ ) R R t I B ǫ, x ( e x ) p ( e x | θ ) d e x d θ = π ( θ ) p ǫ ( x | θ ) R Θ π ( θ ) p ǫ ( x | θ ) d θ , (8)where p ǫ ( x | θ ) is an approximation of the likelihood p ǫ ( x | θ ) = R R t I B ǫ, x ( e x ) p ( e x | θ ) d e x R R t I B ǫ, x ( e x ) d e x . (9)Since the data is i.i.d., we rearrange the vectors x and e x to set aside the zerosin the data, so x = ( x , x + ) and e x = ( e x , e x + ), respectively. It allows us towrite the indicator function in (9) as the product I B ǫ, x ( e x ) = I { x = e x } · I {k x + − e x + k≤ ǫ } . (10)Inserting (10) into the quasi-likelihood (9) leads to p ǫ ( x | θ ) = p X (0 | θ ) t [1 − p X (0 | θ )] t − t R R t − t I {k x + − e x + k≤ ǫ } p ( e x + | θ ) d e x R R t − t I {k x + − e x + k≤ ǫ } d e x −→ ǫ → p X (0 | θ ) t [1 − p X (0 | θ )] t − t p ( x + | θ ) = p ( x | θ ) , (11)where the limit in (11) follows from applying Proposition 1 of Rubio and Jo-hansen [26], see also Bernton et al. [3, Proposition 2]. Taking the limit as ǫ tends to 0 in (8) yields the announced result.Following up on the discussion in Section 2.2, we take the Wasserstein distanceto evaluate the dissimilarities between the non-negative portions of the fake andobserved data. When comparing the non-negative data points, a small ǫ leadsto an accurate but potentially slow ABC algorithm. The combination of a small ǫ and a prior more diﬀuse than the posterior distribution makes ABC rejectionsampling ineﬃcient as acceptance almost never occurs. We therefore move fromthe acceptance–rejection simulation scheme to a Sequential Monte Carlo (SMC)scheme inspired by the work of Del Moral et al. [9]. Sequential Monte Carlo (ABC-SMC) is an ABC approach where a sequenceof distributions is constructed by gradually decreasing tolerance ǫ through asequence ( ǫ g ) g ≥ . The ABC-SMC algorithm starts by sampling a ﬁnite numberof parameter sets (particles) from the prior distribution and each intermediate8istribution (called a generation) is obtained as a weighted sample approximatedvia a multivariate Kernel Density Estimator (KDE).We start by setting the number of generation G and the number of particles K .For the ﬁrst generation ( g = 1), the tolerance level is set to ǫ = ∞ . Particlesare proposed from the prior distribution θ k ∼ π ( θ ) and retained if the syntheticdata x k ∼ p ( x | θ k ) satisﬁes x k ∈ B ∞ , x . It goes on until K particles areselected. Note that the condition x k ∈ B ∞ , x simply means that the number ofzeros in the fake data matches the number of zeros in the observed data. A ﬁrstapproximation of the posterior distribution follows from ﬁtting a multivariateKernel Density Estimator (KDE) K h to the ﬁrst generation of particles b π ǫ ( θ | x ) = 1 K K X k =1 K h (cid:0) k θ − θ k k (cid:1) , where h denotes the bandwidth. For a given generation g >

1, we hold anapproximation b π ǫ g − ( θ | x ) of the posterior distribution based on the ( g − th generation of particles. New particles θ gk are proposed by sampling repeatedlyfrom b π ǫ g − ( θ | x ) until the synthetic data x k ∼ p ( x | θ gk ) satisﬁes x k ∈ B ∞ , x .It goes on until K particles are selected, the synthetic data is also kept. Anacceptance threshold ǫ g is deﬁned as the empirical quantile of order α ∈ (0 , || x + − x + k || , k = 1 , . . . , K . Each particle is assigned a weight w gk ∝ π ( θ gk ) b π g − ( θ gk | x ) I B ǫg, x ( x k ) , k = 1 , . . . , K, which is used to update the posterior approximation to b π ǫ g ( θ | x ) = K X k =1 w gk K h ( k θ − θ gk k ) . The pseudocode of the algorithm is provided in Appendix A, see Algorithm 5.A common choice for the kernel is the multivariate Gaussian kernel with covari-ance matrix set to twice the empirical covariance matrix assessed over the cloudof weighted particles { ( θ gk , w gk ) } k =1 ,...,K , see Beaumont et al. [2].The behavior of the algorithm can be investigated by calculating the EﬀectiveSample Size (ESS), deﬁned in Del Moral et al. [9] asESS g = h K X k =1 ( w gk ) i − , g = 1 , . . . , G. The eﬀective sample size ranges from 1 to N and indicates whether the algorithmis eﬃcient in sampling from the targeted distribution. An ESS falling belowa certain threshold, typically N/ .5 Illustrations on total claim amounts data Let the claim frequency be geometrically distributed n , . . . , n t i . i . d . ∼ Geom ( p = 0 . , with p.m.f. given by p N ( n ; p ) = (1 − p ) p n , n ∈ N . Assume that the claimamounts are exponentially distributed u s, , . . . , u s,n s i . i . d . ∼ Exp ( δ = 5) , s = 1 , . . . , t. with p.d.f. deﬁned as f ( x ; δ ) = (1 /δ )e − x/δ , x >

0, irrespective of the claimfrequency. The available data is the aggregated claim sizes x s = n s X k =1 u s,k , s = 1 , . . . , t, and we assume that t = 100 data points are available to conduct the inference.The likelihood function of the data is given by p ( x | θ ) = (1 − p ) t (cid:0) pδ (cid:1) t − t exp h − − pδ t − t X s =1 x + s i , and allows us to sample from the standard posterior distribution via an MCMCscheme. This compound geometric-exponential model admits t (the numberof zeros in the data) and P ( t − t ) s =1 x + s (sum of the non-negative data points) assuﬃcient statistics which in turn allows us to sample from an ABC posteriorbased on suﬃcient summary statistics. We set uniform priors p ∼ Unif (0 , , δ ∼ Unif (0 , Geom ( p )– Exp ( δ ) we want to ﬁt. We set the numberof generation to G = 10, the number of particles to K = 1000 and the order ofthe quantile to α = 0 . PyMC3

Python library, see Salvatier et al. [27].

When it comes to modeling claim data, one has plenty of options for boththe claim frequency and the claim sizes, see for instance the book of Klugmanet al. [19, Chapters V & VI]. A decision must be made to ﬁnd the most suit-able models among a set of candidates { , . . . , M } . The Bayesian approach tomodel selection and hypothesis testing consists in deﬁning a categorical randomvariable m with state space { , . . . , M } and a priori distribution π ( m ). The aposteriori model evidence is then given by π ( m | x ) = p ( x | m ) π ( m ) P M e m =1 p ( x | e m ) π ( e m ) , m ∈ { , . . . , M } . p δ Prior ABC ABC SS MCMC TrueFigure 1: Fitting a

Geom ( p )– Exp ( δ ) model to simulated data. The true parame-ters are p = 0 . δ = 5. The ABC posterior , ABC summary statisticsposterior , and the true posterior (by MCMC) coincide very well, and areconsiderably narrower than the prior .One often compares two models, say 1 and 2, by computing the Bayes factors B = π (2 | x ) /π (1 | x ). For an overview on Bayesian model selection and Bayesfactor, we refer the reader to Kass and Raftery [18]. The marginal likelihood ofthe data according to given model m ∈ { , . . . , M } is deﬁned by p ( x | m ) = Z Θ m p ( x | m, θ ) π ( θ | m ) d θ , for m ∈ { , . . . , M } , (12)where Θ m denotes the parameter space of model m . The evaluation of (12)is challenging from a computational point of view, even when the likelihood isavailable. The acceptance–rejection implementation of ABC proposed in Gre-laud et al. [16] reduces to add a layer to the standard Algorithm 3 by ﬁrstdrawing a model from π ( m ). The posterior probability of a model is then pro-portional to the number of times this model was selected, see Algorithm 4. Algorithm 4

Acceptance–rejection to compute the model evidence. for k = 1 → K do repeat generate m k ∼ π ( m ) generate θ k ∼ π ( θ | m ) generate x k ∼ p ( x | m k , θ k ) until x k ∈ B ǫ, x then store ( m k , θ k ) end for The spirit of Algorithm 4 relates to the Monte Carlo approach to the computa-tion of models’ marginal likelihood, see for instance McCulloch and Rossi [20].Namely, the model evidence is evaluated by p ( x | m ) ≈ K K X k =1 p ( x | m, θ k ) , θ , . . . , θ K ∼ π ( θ | m ). This procedure might be ineﬃcient as most of the θ i have small likelihoods when the posterior is more concentrated than the priordistribution. Importance sampling strategies have been proposed to address thisissue. The sequential Monte Carlo idea used in Algorithm 5 have been adaptedin the works of Toni and Stumpf [31] and Prangle et al. [23] to improve thesampling eﬃciency. Our implementation is described hereafter.We ﬁx the number of generations G and the number of particles K . Whenseveral models are competing, a particle is a combination of a model and itsparameters.For the ﬁrst generation ( g = 1), for each particle k = 1 , . . . , K , a model m k is drawn from π ( m ) with parameter θ k sampled from the prior distribution π ( θ | m k ) until the synthetic data x k ∼ p ( x | m k , θ k ) satisﬁes x k ∈ B ǫ , x , where ǫ = ∞ . A ﬁrst approximation of the posterior model probability is given by b π ǫ ( m | x ) = 1 K K X k =1 I { m k = m } . A multivariate Kernel Density Estimator (KDE) K h with bandwidth h is thenﬁtted to the parameter values associated to each model with b π ǫ ( θ | m, x ) = 1 K K X k =1 b π ǫ ( m | x ) K h ( k θ − θ k k ) I { m k = m } , m ∈ { , . . . , M } . At a given generation g ∈ { , . . . , G } and for each model m ∈ { , . . . , M } ,we hold an approximation of the posterior model evidence b π ǫ g − ( m | x ) andthe posterior distribution of the parameters b π ǫ g − ( θ | m, x ). New particles( m gk , θ gk ) are proposed by sampling from π ( m ) and b π ǫ g − ( θ | m gk , x ) until thesynthetic data x k ∼ p ( x | m gk , θ gk ) satisﬁes x k ∈ B ǫ g − , x . Sampling is performedrepeatedly until K particles are selected. The acceptance threshold ǫ g becomesthe empirical quantile of order α ∈ (0 ,

1) of the distances k x + − x + k k , k = 1 , . . . K .To each particle is assigned a weight given by w gk ∝ π ( θ gk | m gk ) b π ǫ g − ( θ gk | m gk , x ) I B ǫg, x ( x k ) , k = 1 , . . . , K. The model probability is then updated b π ǫ g ( m | x ) = K X k =1 w ik I { m gk = m } , along with the posterior distribution of the parameters associated to each model b π ǫ g ( θ | m, x ) = K X k =1 w gk b π ǫ g ( m | x ) K h ( k θ − θ gk k ) I { m gk = m } , m = 1 , . . . , M. The algorithm is summarized in Algorithm 6 of Appendix A.12ur ABC implementation when evaluating posterior model probabilities is testedon a simple example where we aim at ﬁtting individual claim sizes generatedfrom a lognormal distribution u , . . . , u n i . i . d . ∼ LogNorm ( µ = 0 , σ = 1) , with associated p.d.f. f ( x ; µ, σ ) = 1 xσ √ π exp h − (ln x − µ )2 σ i , x > . The lognormal model is compared to a gamma model

Gamma ( r, m ) with p.d.f. f ( x ; r, m ) = e − x/m x r − m r Γ( r ) , x > , and a Weibull model Weib ( r, m ) with p.d.f. f ( x ; k, β ) = kβ (cid:0) xβ (cid:1) k − exp (cid:2) − (cid:0) xβ (cid:1) k (cid:3) , x > . Uniform priors are set over the parameters of all the model: µ ∼ Unif ( − , , and σ ∼ Unif (0 , , for the lognormal model, r ∼ Unif (0 , , and m ∼ Unif (0 , , for the gamma model, and k ∼ Unif ( , , and β ∼ Unif (0 , , for the Weibull model. The likelihood function of the data u = u , . . . , u n maybe computed for these loss models and the model probability can be estimatedthrough the Sequential Monte Carlo sampler of the PyMC3 library. The computa-tion of model probabilities via ABC is more demanding than simply estimatingparameters. Namely, the number of iterations must be larger to lead to an ac-curate model probability estimation. We therefore set the number of iterationsto G = 25. The model evidences of all three models are reported in Table 1 forsamples of size ranging from 25 to 200.Further approximate Bayesian model evidence computations are proposed inSection 4 and Section 5 when the data at hand is aggregated. This section aims at studying the ﬁnite sample behavior of our ABC implemen-tation on two case studies based on simulated data. In Section 4.1, we assumethat the claim sizes are independent from the claim frequency and that theinsurer have access to the right truncated aggregated sum. In Section 4.2, we13yMC3 ABC

Gamma LogNorm Weib Gamma LogNorm Weib sample size25 0.42 0.18 0.40 0.51 0.15 0.3450 0.25 0.64 0.11 0.31 0.51 0.1875 0.04 0.95 0.01 0.15 0.79 0.07100 0.01 0.99 0.00 0.07 0.91 0.02150 0.00 1.00 0.00 0.01 0.99 0.00200 0.00 1.00 0.00 0.00 1.00 0.00Table 1: Model evidence for individual claim sizes data simulated by a

LogNorm ( µ = 0 , σ = 1) model. The model evidences computed via ABC farewell compared to the model evidences computed by relying on the likelihoodfunction.consider a model in which the average of the claim sizes depends on the num-ber of claims and the insurer have access to the total claim sizes for each timeperiod.Our goal is to check whether our ABC sampling algorithm manage to returna posterior sample that concentrates around the true value when the model iswell speciﬁed. Another question is how does the ABC posterior behave whenthe model is misspeciﬁed? The ABC posterior samples are compared, in thatcase, to the maximum likelihood estimates of the parameters.Finally, we assume that the claim frequency data is available in addition tothe aggregated data. The number of claims is then input directly in our ABCimplementation to specify how many claim sizes should be generated for eachtime period. It reduces the computing time, and allow us to drop the parametricassumption over the claim frequency distribution and direct our focus on theclaim amounts distribution.In both examples, the number of generations for ABC is set to G = 7 and eachconsists of K = 1000 particles when only one model is considered and whenthe claim frequency is not available. Knowing the number of claims leads to areduction in calculation time, which in turn allows us to bring the number ofiterations to G = 10. Let the claim frequency be negative binomial distributed n , . . . , n t i . i . d . ∼ NegBin ( α = 4 , p = ) , with p.m.f. p N ( n ; α, p ) = (cid:18) α + n − n (cid:19) p α (1 − p ) n , n ≥ , while the claim sizes are Weibull distributed u s, , . . . , u s,n s i . i . d . ∼ Weib ( k = , β = 1) , s = 1 , . . . , t. c , givenby x s = (cid:16) n s X i =1 u s,i − c (cid:17) + , s = 1 , . . . , t. (13)It corresponds to the data available to a reinsurance company within the frameof a global non-proportional treaty over a non-life insurance portfolio. Thecases t = 50 and t = 250 are considered. The prior distributions over the fourparameters are α ∼ Unif (0 , , p ∼ Unif ( , , k ∼ Unif ( , , and β ∼ Unif (0 , . (14)Figure 2 displays the ABC posterior samples when only using the aggregateddata (13).0 10 0 1 0 10 0 20 α p k β Prior ABC (50 x s ’s) ABC (250 x s ’s) TrueFigure 2: ABC posterior samples of a NegBin ( α, p )– Weib ( k, β ) model ﬁtted todata simulated by a NegBin ( α = 4 , p = )– Weib ( k = , β = 1). The posteriorsare based on

50 observations and

250 observations of the x s summaries asin (13).The p and k posteriors are quite informative, whereas the scale parameters α and β are skewed in opposite directions and seem to compensate for each other.We then include the claim frequencies n s in the input data of our ABC algorithmto see if this helps in getting posterior samples closer to the target. Figure 3displays the ABC posterior samples of the claim sizes model when the claimfrequency data is available in addition to the summaries (13).The ABC posteriors are very strongly concentrated around the true values k = and β = 1 compared to that of Figure 2.We now turn to the case where the model is misspeciﬁed. The same datasimulated from a NegBin ( α = 4 , p = )– Weib ( k = , β = 1) model is used toﬁt a NegBin ( α, p )– Gamma ( r, m ) model. The prior distributions over the fourparameters are uniform with α ∼ Unif (0 , , p ∼ Unif ( , , r ∼ Unif (0 , , and m ∼ Unif (0 , . (15)15.0 2.5 5.0 7.5 10.0 0 5 10 15 20 k β ABC (50 x s ’s)ABC (50 x s ’s & n s ’s) ABC (250 x s ’s)ABC (250 x s ’s & n s ’s) TrueFigure 3: ABC posterior samples of a Weib ( k, β ) model ﬁtted to data simulatedby a NegBin ( α = 4 , p = )– Weib ( k = , β = 1). The data includes each sum-mary x s as in (13) and each frequency n s . The posterior with

250 observations is a slight improvement over the one with

50 observations .The true values for the gamma distribution parameters are replaced by themaximum likelihood estimators based on a large sample of Weibull distributedindividual losses. Figure 4 displays the ABC posterior samples when only usingthe aggregated data (13).0 10 0 1 0 10 0 50 α p r m

Prior ABC (50 x s ’s) ABC (250 x s ’s)True MLEFigure 4: ABC posterior samples of a NegBin ( α, p )– Gamma ( r, m ) model ﬁttedto data simulated by a NegBin ( α = 4 , p = )– Weib ( k = , β = 1) model. Thedata only includes the summaries x s as in (13). The target values are the truevalues for α and p and the MLE estimates for k and β given the claim sizes.The ABC posterior distributions are informative regarding p , r and m , howeverthe algorithm does not improve signiﬁcantly the prior assumption over α .Figure 5 displays the ABC posterior samples for the parameter of the gammadistribution when the claim frequency data is available in addition to the sum-16aries (13).0 1 2 3 4 0 20 40 r m ABC (50 x s ’s)ABC (50 x s ’s & n s ’s) ABC (250 x s ’s)ABC (250 x s ’s & n s ’s) MLEFigure 5: ABC posterior samples of a Gamma ( r, m ) model ﬁtted to data simu-lated by a NegBin ( α = 4 , p = )– Weib ( k = , β = 1) model. The data includeseach summary x s as in (13) and each frequency n s .The posterior sample for m does not seem to center around the maximum like-lihood estimator. Note that the situation improves greatly when considering alarger sample, of size 500 say. Also note that by ﬁtting a gamma model on theindividual losses, the mean a posteriori for m is around 40, which may explainwhy our ABC posterior somewhat miss the target.To perform model selection, we specify to our ABC algorithm the Weibull andthe gamma distribution as competing models for the claim sizes and we setuniform priors as in (14) and (15) over the parameters. The model evidencescomputed via ABC are reported in Table 2. Sample Sizes Frequency ModelNegative Binomial Observed Frequencies50 0.51 0.88250 0.44 1.00

Table 2: Model evidence in favor of a

Weib ( k, β ) model when compared againsta Gamma ( r, m ) model for data simulated by a NegBin ( α = 4 , p = )– Weib ( k = , β = 1) model. The values should increase to 1 as the sample size increases.When only the summaries x s are available and the claim frequency is modeled bya negative binomial distribution then ABC cannot decide between the Weibulland the gamma distributions. When the claim counts n s are also available thenABC favors greatly the Weibull model for the claim sizes.17 .2 Dependence between the claim frequency and severity Let the claim frequency be Poisson distributed n , . . . , n t i . i . d . ∼ Poisson ( λ = 4) , with p.m.f. p N ( k ; λ ) = e − λ λ k k ! , k ≥ . The claim sizes are assumed to be exponentially distributed with a scale param-eter depending on the observed claim frequency with u s, , . . . , u s,n s | n s i . i . d . ∼ Exp ( µ = β e δn s ) , for s = 1 , . . . , t. We denote this u s ∼ DepExp ( n s ; β, δ ), and take β = 2 and δ = 0 .

2. Theresulting conditional p.d.f. is f U ( x | n ; β, δ ) = 1 β e δn exp (cid:0) − xβ e δn (cid:1) , x > . This dependence structure relates to the insurance ratemaking practice wherepremiums are computed using the average claim frequency and severity pre-dicted by a generalized linear models (GLM). In the classical setting, the claimfrequency is assumed to be Poisson distributed and the claim sizes are gammadistributed. The GLM are then ﬁtted independently for the claim frequencyand the claim severity, we refer to Renshaw [25]. Empirical studies, like the oneconducted in Frees et al. [12], have shown how the claim sizes may vary withthe claim frequency. A standard practice is then to include the predicted claimfrequency as a covariate within the claim sizes model, see for instance Shi et al.[28]. It then reduces to bump the expectation of the severity by a factor e δn s .Our case study is inspired by Garrido et al. [13, Example 3.1]. The availabledata is the aggregated claim sizes x s = n s X k =1 u s,k , s = 1 , . . . , t. (16)We consider data histories of length t = 50 and 250.Uniform prior distributions are set over the model parameters as λ ∼ Unif (0 , , β ∼ Unif (0 , , and δ ∼ Unif ( − , . Figure 6 displays the posterior samples of λ the parameter of the Poisson dis-tribution, β the scale parameter of the exponential parameter and δ the fre-quency/severity correlation parameter.The algorithm does a tremendous job on this example even without includingthe claim count information of each time period.Figure 7 displays the ABC posterior samples associated to the claim sizes dis-tribution DepExp ( n ; β, δ ) when including the frequency information in additionto the summaries (16).As already noted, the inclusion of the claim frequency information improves theABC posterior samples. 18 5 10 0 10 20 -1 0 1 λ β δ Prior ABC (50 x s ’s) ABC (250 x s ’s) TrueFigure 6: ABC posterior samples of a Poisson ( λ )– DepExp ( n ; β, δ ) model ﬁttedto data simulated by a Poisson ( λ = 4)– DepExp ( n ; β = 2 , δ = 0 . x s as in (16).0 5 10 15 20 -0.5 0.0 0.5 β δ ABC (50 x s ’s)ABC (50 x s ’s & n s ’s) ABC (250 x s ’s)ABC (250 x s ’s & n s ’s) TrueFigure 7: ABC posterior samples of a DepExp ( n ; β, δ ) model ﬁtted to datasimulated by a Poisson ( λ = 4)– DepExp ( n ; β = 2 , δ = 0 . x s as in (16) and each frequency n s . We consider an open source insurance data set named ausautoBI8999 consistingof 22 ,

036 settled personal injury insurance claims in Australia, the ﬁrst ﬁveobservations are displayed in Table 3.The data is accessible from the R package CASDatasets , see Dutang and Char-pentier [10]. The data is then aggregated monthly by reporting the number ofclaims along with the sum of all the compensations associated to each month,see Table 4.Descriptive statistics for the claim sizes, claim frequencies and the aggregatedclaims sizes are reported in Table 5. 19ate Month Claim Severity1993-10-01 52 87.751994-02-01 56 353.621994-02-01 56 688.831994-05-01 59 172.801994-09-01 63 43.29Table 3: ausautoBI8999 personal injury claim data.

Month Claim Frequency Total Claim Severity49 149 1.55e+0650 188 3.21e+0651 196 4.81e+0652 203 4.22e+0653 226 5.27e+06

Table 4: Monthly aggregated data.

Statistics Claim Severity Claim Frequency Total Claim SeverityCount 2.20e+04 6.90e+01 6.90e+01Mean 3.84e+04 3.19e+02 1.23e+07Std 9.10e+04 1.09e+02 5.22e+06Min 9.96e+00 9.40e+01 1.55e+0625% 6.30e+03 2.31e+02 8.21e+0650% 1.39e+04 3.12e+02 1.20e+0775% 3.51e+04 3.81e+02 1.55e+07Max 4.49e+06 6.06e+02 2.63e+07

Table 5: Descriptive statistics of the claim data.20 everity model Parameters MLE BICGamma r m k β σ µ Table 6: Maximum likelihood estimates of a gamma, Weibull and lognormaldistribution based on the individual claim sizes data.We are going to use ABC to ﬁt and compare loss models using only the monthlyaggregated data in Table 4. We would like to know whether the results diﬀerfrom ﬁtting the same loss models but using the individual claim sizes data inTable 3.We start by studying the individual loss distribution. We ﬁt a gamma, a log-normal and a Weibull model to the data shown in Table 3 using maximumlikelihood estimation. The estimates of the parameters are given in Table 6 andwill serve as benchmark for our ABC posterior samples.The lognormal distribution seems to provide the best ﬁt when looking at the val-ues of the Bayesian Information Criteria (BIC). This result is visually conﬁrmedby the quantile-quantile plots displayed in Figure 8.Empirical Quantiles T h e o r e t i c a l Q u a n t il e s Gamma Weibull LognormalFigure 8: Quantile-quantile plots associated to the gamma, Weibull and lognor-mal models ﬁtted to the individual claim sizes data.We then investigate the stationarity of the individual loss distribution by ﬁttingthe three loss models to the data associated to each time period separately.Figures 9 to 11 display the parameters of the gamma, Weibull and lognormaldistribution respectively depending on the time period considered.The parameters of the Weibull and gamma distributions exhibit a high variabil-210 80 100 120Time Period0255075 60 80 100 120Time Period050000100000 r m

Figure 9: Parameters of the gamma model.60 80 100 120Time Period0.250.500.75 60 80 100 120Time Period02000040000 k β

Figure 10: Parameters of the Weibull model.60 80 100 120Time Period8.59.09.5 60 80 100 120Time Period1.21.41.61.8 µ σ

Figure 11: Parameters of the lognormal model.ity, see Figures 9 and 10, while the parameters of the lognormal distribution aremore stable, see Figure 11. The model evidences, displayed in Figure 12, arecomputed using the Schwarz criterion that approximates the Bayes factor usingthe maximum likelihood estimators and the BIC.The model probabilities mostly favor the lognormal model.We use ABC to ﬁt a

NegBin ( α, p )– LogNorm ( µ, σ ) model to the total claim sever-220 60 70 80 90 100 110 120Time Period0.000.250.500.751.00 Gamma Weibull LognormalFigure 12: Model evidence for the gamma, lognormal and Weibull models.ities data in Table 4 which consists of t = 69 summaries of the form x s = n s X k =1 u s,k , s = 1 , . . . , t. (17)We consider two sets of prior assumptions over the parameters:1. α ∼ Unif (0 , p ∼ Unif ( , µ ∼ Unif ( − , σ ∼ Unif (0 , α ∼ Unif (0 , p ∼ Unif ( , µ ∼ Unif (0 , σ ∼ Unif (0 , µ . We opt for a more intensive ABC calibration compared to that of section4. The number of iterations is ﬁxed at G = 20 when the claim frequenciesare known and G = 15 when they are not. The ABC posterior samples ofthe NegBin ( α, p )– LogNorm ( µ, σ ) model using only the summaries x s in (17) areshown in Figure 13.The results with prior settings 1 and 2 are noticeably diﬀerent. More speciﬁcally,the ABC posterior are tighter and more centered around the MLE estimates withprior 2 at least when it comes to estimating the parameters p , µ and σ .The ABC posterior samples when including the claim frequency information areshown in Figure 14. We keep the same prior assumptions over µ and σ .Including the claim frequency data helps in making the results consistent fromone prior setting to the other.We now turn to the problem of selecting a model for the claim sizes, so we specifya negative binomial distribution NegBin ( α, p ) with uniform prior distributions α ∼ Unif (0 , , p ∼ Unif (0 , • Weib ( k, β ) with prior distributions k ∼ Unif ( , , β ∼ Unif (0 , × ) ,

23 20 0.0 0.5 5 10 15 0.0 2.5 α p µ σ

ABC (Prior 1) ABC (Prior 2) MLEFigure 13: ABC posterior samples of a

NegBin ( α, p )– LogNorm ( µ, σ ) model ﬁttedto a real world insurance data set. The data includes the total claim severities(17) data in Table 4. The posterior samples are closer to the MLE estimates with prior 2 than with prior 1 .5.0 7.5 10.0 12.5 15.0 0 1 2 3 µ σ

Prior 1 ( x s ’s)Prior 1 ( x s ’s & n s ’s) Prior 2 ( x s ’s)Prior 2 ( x s ’s & n s ’s) MLEFigure 14: ABC posterior samples of a LogNorm ( µ, σ ) model ﬁtted to a realworld insurance data set. The data includes the total claim severities and theclaim frequencies in Table 4. When the x s ’s and n s ’s are both observed, theposterior samples with Prior 1 and

Prior 2 almost totally overlap and arereasonably close to the

MLE estimates . • Gamma ( r, m ) with prior distributions r ∼ Unif (0 , , β ∼ Unif (0 , . × ) , • LogNorm ( µ, σ ) with prior distributions µ ∼ Unif (5 , , σ ∼ Unif (0 , . The bounds of the uniform distributions are set to reﬂect the variability of theparameters in Figures 9 to 11. The model evidences are reported in Table 7.24 requency Model Severity ModelGamma Lognormal WeibullNegative Binomial 0.92 0.01 0.07Observed Frequencies 0.00 0.49 0.51

Table 7: ABC model evidence with the claim frequency and the aggregatedclaim sizes data.We see that ABC strongly favors the gamma model when the claim frequencyis assumed to have a negative binomial distribution. When including the claimcount, ABC discards the gamma model but is unable to decide between theWeibull or the lognormal model. This result is of course a little disappointingbut probably means that ABC would need more than 69 observations to pickthe right model.

This paper is a case study of an ABC application in insurance. We showed howto use this method to calibrate insurance loss models with limited information(one data point per time period). The fact that the method does not require theknowledge of the likelihood function permits to go beyond the classical settingwhere independence is assumed between the claim frequency and the claim sizes.An ABC routines essentially relies on two things: (i) an eﬃcient sampling strat-egy and (ii) a reliable measure of dissimilarity between samples of data. We puttogether an ABC routine that implements a parallel sequential Monte Carlosampler and uses the Wasserstein distance to compare the synthetic data to theobserved one. The python code may be downloaded from the following GitHubrepository https://github.com/LaGauﬀre/ABCFitLoMo.ABC has become over the years a common practice in a variety of ﬁelds rangingfrom ecology to genetics. We believe that ABC could be also applied to a widerange of sophisticated models that arise in ﬁnance and insurance.

Acknowledgments

Patrick J. Laub conducted this research within the DAMI – Data Analytics andModels for Insurance – Chair under the aegis of the Fondation du Risque, ajoint initiative by UCBL and BNP Paribas Cardif.

References [1] Hansj¨org Albrecher, Jan Beirlant, and Jozef L Teugels.

Reinsurance: Ac-tuarial and Statistical Aspects . Wiley Series in Probability and Statistics.John Wiley & Sons, Chichester, 2017.252] Mark A Beaumont, Jean-Marie Cornuet, Jean-Michel Marin, and Chris-tian P Robert. Adaptive approximate Bayesian computation.

Biometrika ,96(4):983–990, 2009.[3] Espen Bernton, Pierre E Jacob, Mathieu Gerber, and Christian P Robert.Approximate Bayesian computation with the Wasserstein distance.

Journalof the Royal Statistical Society: Series B (Statistical Methodology) , 81(2):235–269, 2019.[4] Michael GB Blum. Approximate Bayesian computation: a nonparamet-ric perspective.

Journal of the American Statistical Association , 105(491):1178–1187, 2010.[5] Michael GB Blum, Maria Antonieta Nunes, Dennis Prangle, and Scott ASisson. A comparative review of dimension reduction methods in approxi-mate Bayesian computation.

Statistical Science , 28(2):189–208, 2013.[6] Martin Bøgsted and Susan M Pitts. Decompounding random sums: anonparametric approach.

Annals of the Institute of Statistical Mathematics ,62(5):855–872, 2010.[7] Boris Buchmann and Rudolf Gr¨ubel. Decompounding: an es-timation problem for Poisson random sums.

Ann. Statist. , 31(4):1054–1074, 08 2003. doi: 10.1214/aos/1059655905. URL https://doi.org/10.1214/aos/1059655905 .[8] Alberto J Coca. Eﬃcient nonparametric inference for discretely observedcompound Poisson processes.

Probability Theory and Related Fields , 170(1-2):475–523, 2018.[9] Pierre Del Moral, Arnaud Doucet, and Ajay Jasra. An adaptive sequentialMonte Carlo method for approximate Bayesian computation.

Statistics andComputing , 22(5):1009–1020, 2012.[10] C Dutang and A Charpentier. CASdatasets: Insurance datasets (oﬃcialwebsite). http://cas.uqam.ca/ , 2016.[11] Paul Fearnhead and Dennis Prangle. Constructing summary statistics forapproximate Bayesian computation: semi-automatic approximate Bayesiancomputation.

Journal of the Royal Statistical Society: Series B (StatisticalMethodology) , 74(3):419–474, 2012.[12] Edward W. Frees, Jie Gao, and Marjorie A. Rosenberg. Predicting the fre-quency and amount of health care expenditures.

North American ActuarialJournal , 15(3):377–392, 2011. doi: 10.1080/10920277.2011.10597626. URL https://doi.org/10.1080/10920277.2011.10597626 .[13] J. Garrido, C. Genest, and J. Schulz. Generalized linear modelsfor dependent frequency and severity of insurance claims.

Insur-ance: Mathematics and Economics , 70:205 – 215, 2016. ISSN 0167-6687. doi: https://doi.org/10.1016/j.insmatheco.2016.06.006. URL .2614] Andrew Gelman, John B Carlin, Hal S Stern, David B Dunson, Aki Vehtari,and Donald B Rubin.

Bayesian Data Analysis . Chapman and Hall/CRC,2013.[15] Pierre-Olivier Goﬀard, S Rao Jammalamadaka, and Simos G. Meintanis.Goodness-of-ﬁt tests for compound distributions with applications in insur-ance. 2019.[16] Aude Grelaud, Christian P Robert, Jean-Michel Marin, Francois Rodolphe,and Jean-Fran¸cois Taly. ABC likelihood-free methods for model choice inGibbs random ﬁelds.

Bayesian Analysis , 4(2):317–335, 2009.[17] Shota Gugushvili, Frank van der Meulen, and Peter Spreij. A non-parametric Bayesian approach to decompounding from high frequency data.

Statistical Inference for Stochastic Processes , 21(1):53–79, 2018.[18] Robert E Kass and Adrian E Raftery. Bayes factors.

Journal of the Amer-ican Statistical Association , 90(430):773–795, 1995.[19] Stuart A Klugman, Harry H Panjer, and Gordon E Willmot.

Loss Models:From Data to Decisions , volume 715. John Wiley & Sons, 2012.[20] Robert McCulloch and Peter E Rossi. A Bayesian approach to testing thearbitrage pricing theory.

Journal of Econometrics , 49(1-2):141–168, 1991.[21] Gareth Peters and Scott Sisson. Bayesian inference, Monte Carlo samplingand operational risk.

Journal of Operational Risk , 1(3), 2006.[22] Gareth W Peters, Mario V W¨uthrich, and Pavel V Shevchenko. Chainladder method: Bayesian bootstrap versus classical bootstrap.

Insurance:Mathematics and Economics , 47(1):36–51, 2010.[23] Dennis Prangle, Paul Fearnhead, Murray P Cox, Patrick J Biggs, andNigel P French. Semi-automatic selection of summary statistics for ABCmodel choice.

Statistical Applications in Genetics and Molecular Biology ,13(1):67–82, 2014.[24] Dennis Prangle, Richard G Everitt, and Theodore Kypraios. A rare eventapproach to high-dimensional approximate Bayesian computation.

Statis-tics and Computing , 28(4):819–834, 2018.[25] Arthur E. Renshaw. Modelling the claims process in the presence of co-variates.

ASTIN Bulletin , 24(2):265–285, 1994. doi: 10.2143/AST.24.2.2005070.[26] FJ Rubio and Adam M Johansen. A simple approach to maximum in-tractable likelihood estimation.

Electronic Journal of Statistics , 7:1632–1654, 2013.[27] John Salvatier, Thomas V Wiecki, and Christopher Fonnesbeck. Proba-bilistic programming in python using PyMC3.

PeerJ Computer Science , 2:e55, 2016. 2728] Peng Shi, Xiaoping Feng, and Anastasia Ivantsova. Depen-dent frequency–severity modeling of insurance claims.

Insurance:Mathematics and Economics , 64:417 – 428, 2015. ISSN 0167-6687. doi: https://doi.org/10.1016/j.insmatheco.2015.07.006. URL .[29] Scott A Sisson, Yanan Fan, and Mark Beaumont.

Handbook of ApproximateBayesian Computation . Chapman and Hall/CRC, 2018.[30] Vitor C Sousa, Marielle Fritz, Mark A Beaumont, and Loun`es Chikhi. Ap-proximate Bayesian computation without summary statistics: the case ofadmixture.

Genetics , 181(4):1507–1519, 2009.[31] Tina Toni and Michael P. H. Stumpf. Simulation-based model selection fordynamical systems in systems and population biology.

Bioinformatics , 26(1):104–110, 10 2009. ISSN 1367-4803. doi: 10.1093/bioinformatics/btp619.URL https://doi.org/10.1093/bioinformatics/btp619 .[32] Bert van Es, Shota Gugushvili, and Peter Spreij. A kernel type nonparamet-ric density estimator for decompounding.

Bernoulli , 13(3):672–694, 08 2007.doi: 10.3150/07-BEJ6091. URL https://doi.org/10.3150/07-BEJ6091 .28

Algorithmic details

Algorithm 5

Sequential Monte Carlo Approximate Bayesian Computation Al-gorithm. for k = 1 → K do repeat generate θ k ∼ π ( θ ) generate x k ∼ p ( x | θ k ) until x k ∈ B ∞ , x end for compute b π ǫ ( θ | x ) = K P Kk =1 K h ( k θ − θ k k ) for g = 2 → G do for k = 1 → K do repeat generate θ gk ∼ b π ǫ g − ( θ | x ) generate x k ∼ p ( x | θ gk ) until x k ∈ B ǫ g − , x end for set ǫ g = Quantile (cid:0) k x + − x +1 k , . . . , k x + − x + K k ; α (cid:1) for k = 1 → K do set w gk ∝ π ( θ gk ) b π ǫg ( θ gk | x ) I B ǫg, x ( x k ) end for compute b π ǫ g ( θ | x ) = P Kk =1 w gk K h ( k θ − θ gk k ) end for lgorithm 6 ABC-SMC for model selection. for k = 1 → K do repeat generate m k ∼ π ( m ) generate θ k ∼ π ( θ | m k ) generate x k ∼ p ( x | m k , θ k ) until x k ∈ B ∞ , x end for for m = 1 , . . . , M do compute b π ǫ ( m | x ) = K P Kk =1 I { m k = m } compute b π ǫ ( θ | m, x ) = K P Kk =1 1 b π ǫ ( m | x ) K h ( k θ − θ k k ) I { m k = m } end for for g = 2 → I do for k = 1 → K do repeat generate m gk ∼ π ( m ) generate θ gk ∼ b π ǫ g − ( θ | m gk , x ) generate x k ∼ p ( x | m gk , θ gk ) until x k ∈ B ǫ g − , x end for set ǫ g = Quantile( k x + − x +1 k , . . . , k x + − x + K k ; α ) for k = 1 → K do set w gk ∝ π ( θ gk | m gk ) b π ǫg − ( θ gk | m gk , x ) I B ǫg, x ( x k ) end for for m = 1 , . . . , M do compute b π ǫ g ( m | x ) = P Kk =1 w gk I { m gk = m } compute b π ǫ g ( θ | m, x ) = P Kk =1 w gk b π ǫg ( m | x ) K h ( k θ − θ gk k ) I { m gk = m } end for end forend for