[PDF] SynthETIC: an individual insurance claim simulator with feature control

Abstract

Recent years have seen rapid increase in the application of machine learning to insurance loss reserving. They yield most value when applied to large data sets, such as individual claims, or large claim triangles. In short, they are likely to be useful in the analysis of any data set whose volume is sufficient to obscure a naked-eye view of its features. Unfortunately, such large data sets are in short supply in the actuarial literature. Accordingly, one needs to turn to synthetic data. Although the ultimate objective of these methods is application to real data, the use of synthetic data containing features commonly observed in real data is also to be encouraged. While there are a number of claims simulators in existence, each valuable within its own context, the inclusion of a number of desirable (but complicated) data features requires further development. Accordingly, in this paper we review those desirable features, and propose a new simulator of individual claim experience called SynthETIC. Our simulator is publicly available, open source, and fills a gap in the non-life actuarial toolkit. The simulator specifically allows for desirable (but optionally complicated) data features typically occurring in practice, such as variations in rates of settlements and development patterns; as with superimposed inflation, and various discontinuities, and also enables various dependencies between variables. The user has full control of the mechanics of the evolution of an individual claim. As a result, the complexity of the data set generated (meaning the level of difficulty of analysis) may be dialled anywhere from extremely simple to extremely complex.

Full PDF

aa r X i v : . [ q -f i n . R M ] S e p SynthETIC : an individual insurance claim simulator with feature control

Benjamin Avanzi a , Greg Taylor ∗ ,b , Melantha Wang b , Bernard Wong b a Centre for Actuarial Studies, Department of Economics, University of Melbourne VIC 3010, Australia b School of Risk and Actuarial Studies, UNSW Australia Business School, UNSW Sydney NSW 2052, Australia

Abstract

A simulator of individual claim experience called

SynthETIC is described. It is publicly available, opensource and ﬁlls a gap in the non-life actuarial toolkit. It simulates, for each claim, occurrence, notiﬁcation,the timing and magnitude of individual partial payments, and closure. Inﬂation, including (optionally)superimposed inﬂation, is incorporated in payments. Superimposed inﬂation may occur over calendar oraccident periods. The claim data are summarized by accident and payment “periods” whose duration isan arbitrary choice (e.g. month, quarter, etc.) available to the user. The code is structured as eightmodules (occurrence, notiﬁcation, etc.), any one or more of which may be re-designed according to theusers requirements. The default version is parameterized so as to include a broad (though not numericallyprecise) resemblance to the major features of experience of a speciﬁc (but anonymous) Auto Bodily Injuryportfolio. It thus reﬂects a number of desirable (but complicated) data features, but the general structureis suitable for most lines of business, with some amendment of modules. The structure of the simulatorenables the inclusion of a number of important dependencies between the variables related to an individualclaim, e.g. dependence of notiﬁcation delay on claim size, of the size of a partial payment on the sizes ofthose preceding, etc. The user has full control of the mechanics of the evolution of an individual claim. Asa result, the complexity of the data set generated (meaning the level of diﬃculty of analysis) may be dialledanywhere from extremely simple to extremely complex. At the extremely simple end would be chain-ladder-compatible data, and so the alternative data structures available enable proposed loss reserving models tobe tested against more challenging data sets. Indeed, the user may generate a collection of data sets thatprovide a spectrum of complexity, and the collection may be used to present a model under test with asteadily increasing challenge.

Key words: granular models, individual claims, individual claim simulator, loss reserving, partialpayments, simulated losses, superimposed inﬂation, SynthETIC, synthetic lossesMSC classes: 91G70, 91G60, 62P05

1. Introduction

Recent years have seen rapid increase in the application of machine learning (ML) to insurance lossreserving. A relatively up-to-date bibliography is contained in Richman (2018). These ML methods arehungry for data. They yield most value when applied to large data sets, such as individual claim, eventransactional, data sets. They may also be useful in the analysis of large claim triangles, e.g. 40 ×

40. Inshort, they are likely to be useful in the analysis of any data set whose volume is suﬃcient to obscure anaked-eye view of its features.Unfortunately, such large data sets are in short supply in the actuarial literature, and so there is ashortfall in the material required to test ML methods. Accordingly, one needs to turn to synthetic data. ∗ Corresponding author.

Email addresses: [email protected] (Benjamin Avanzi), [email protected] (Greg Taylor), [email protected] (Melantha Wang), [email protected] (Bernard Wong)

September 10, 2020 lthough the ultimate objective of these methods is application to real data, the use of synthetic datacontaining features commonly observed in real data is also to be encouraged.Indeed, there is a school of thought that proposed claim models should always be tested against syntheticdata, as well as real data if available. The reason is that the features contained in the synthetic data set willbe known, by construction. One is then able to test the extent to which the proposed model is successful inidentifying them.There are several individual claim simulators in existence, and they are brieﬂy reviewed in Section 5.1.Although each of these is valuable within its own context, the authors believe that there is scope for afurther simulator. The reasons for this are discussed in Section 5. It is against this background that theindividual claim simulator

SynthETIC ( S ynthetic E xperience T racking I nsurance C laims) is introduced.The code is open source (see Section 8) and is, quite deliberately, in modular form. The claim processis deﬁned by 8 modules (see Section 4.1), any of which may be unplugged and replaced by an alternativemodule of the users choosing. The current default parameterization broadly resembles a speciﬁc real AutoLiability data set that contains a number of features that render its modelling demanding.After some notation in Section 3, the architecture of SynthETIC is described in Section 4. Section 5discusses its relation to prior literature, while Section 6 gives detail of the default parameterization justmentioned. Section 7 discusses the circumstances and manner in which

SynthETIC might be applied, Sec-tion 8 gives the web address of the code and the example data set used in Section 6, and Section 9 containssome closing comments.

2. Desirable data features

Much of the actuarial literature on loss reserving is focused on the chain ladder. This model requires avery speciﬁc parametric structure in which the expectation of the target variable is equal to the product ofa row parameter and a column parameter (see, for example, Taylor, 2000; W¨uthrich and Merz, 2008).However, the same literature is dotted with examples of data sets that do not ﬁt this prescription.Indeed, various models within the literature have been devised with the express purpose of addressing suchnon-conforming data sets.This failure to conform with the chain ladder structure can arise in a number of ways. First, the rateof settlement of claims can change from one accident period to another. Models that take account ofthis feature have a long pedigree, e.g. Fisher and Lange (1973); Reid (1978) through to the present day, e.g.McGuire et al. (2018).More generally, development patterns may change from one accident period to another. This may bedue to changing rates of settlement, changing mixture of risks, or other causes. Again, models that seek toaddress this feature have a long pedigree, e.g. Berquist and Sherman (1977) up to recent times, e.g. Meyers(2015).

Superimposed inﬂation is a further potential disruptor of claim experience. The chain ladder modelcan accommodate inﬂation that occurs at a constant rate over time, but not variable-rate inﬂation. Examplesof real data sets that appear to exhibit variable rates of superimposed inﬂation appear in e.g. Taylor (2000)and McGuire et al. (2018).A claim triangle may experience a discontinuity as one passes from one accident period to another.This might result from a change in legislation governing claim conditions, or a sudden change in the type ofbusiness underwritten. An example is discussed by McGuire (2007).Models that have been created to address data features such as the above require data sets containingthose features if they are to be tested eﬀectively. A major objective of

SynthETIC is to provide this facility,and to our knowledge, it is the only simulator (at the time of writing this paper) that includes all thosefeatures natively.

3. Notation

SynthETIC works with exact transaction times , so time will be measured continuously. Calendartime τ = 0 denotes the ﬁrst date on which there is exposure to occurrence of a claim. The time scale is2rbitrary; a unit of time might be a quarter, a year, or any other selected period. The length of a period inyears is speciﬁed by the user as a global parameter. The user needs to ensure that all input parameters arecompatible with the chosen time unit.For certain purposes (see Section 4.1), it will be useful to partition time into discrete periods. These areunit periods according to the chosen time scale. These periods will be of two types: • occurrence periods (or accident periods), numbered 1 , , . . . , where occurrence period 1 correspondsto the calendar time interval (0 , • payment periods , numbered 1 , , . . . , I − I past periods and a further I − partialpayments . The claim will be regarded as settled immediately after the ﬁnal partial payment. The delaysbetween successive partial payments are referred to as inter-partial delays .All payments are subject to inﬂation. They are initially simulated without allowance for inﬂation, andan inﬂation adjustment added subsequently. Any quantity described as “without allowance for inﬂation” isexpressed in constant dollar values, speciﬁcally those of calendar time τ = 0.Inﬂation occurs in two types:(a) Base inﬂation : which represents, in some sense, “normal” community inﬂation (e.g. price inﬂation,wage inﬂation) that would apply to claim sizes in the absence of extraordinary considerations; and(b)

Superimposed inﬂation : which represents the diﬀerential (positive or negative) between claiminﬂation and base inﬂation.It is assumed that base inﬂation may be represented by a vector of quarterly inﬂation rates for bothpast and future calendar periods. The input inﬂation rates need to be expressed as quarterly eﬀective ratesirrespective of the length of calendar periods adopted. The inﬂation rates are used to construct an inﬂationindex whose values are obtained: • at quarterly points from calendar time 0, by compounding the quarterly rates; and • at intra-quarterly points, by exponential interpolation between the quarter ends immediately priorand subsequent.It is also assumed that superimposed inﬂation occurs in two sub-types:(i) Payment period superimposed inﬂation : which operates over payment periods; and(ii)

Occurrence period superimposed inﬂation : which operates over occurrence periods.By default, both will be applied jointly, but this can be controlled by the user.The following notation is used throughout:Int( x ) denotes the integral part of x ⌈ x ⌉ = the ceiling function, that is, the smallest integer n such that n − ≤ x ≤ ni = occurrence period t = continuous calendar time with origin at the beginning of occurrence period 1 t = ⌈ t ⌉ payment period E i = (annual eﬀective) exposure in occurrence period iλ i = expected claim frequency (per unit exposure) in occurrence period if ( t ) = base inﬂation index, representing the ratio of dollar values at calendar time t to those atcalendar time 0, constructed from the input base inﬂation rates3 P ( t | s ) = payment period superimposed inﬂation index, representing the ratio of dollar values atcalendar time t to those at calendar time 0 g O ( i | s ) = occurrence period superimposed inﬂation index, representing the ratio of dollar values atoccurrence period u to those at occurrence time 0 n i = number of claims occurring in occurrence period ir = identiﬁcation number of claims occurring in occurrence period i ( r = 1 , , . . . , N i ) u ir = occurrence time of claim r of occurrence period is ir = size of claim r of occurrence period i without allowance for inﬂation v ir = delay from occurrence to notiﬁcation of claim r of occurrence period i (N.B. the notiﬁcationtime is u ir + v ir ) w ir = delay from notiﬁcation to settlement of claim r of occurrence period i (N.B. the settlement timeis u ir + v ir + w ir ) m ir = number of partial payments in respect of claim r of occurrence period is ( m ) ir = size of the m -th partial payment in respect of claim r of occurrence period i , m = 1 , , . . . , m ir p ( m ) ir = s ( m ) ir /s ir = proportion of claim amount s ir paid in the m -th partial payment d ( m ) ir = the inter-partial delay from the epoch of the ( m − m -th partial payment of claim r of occurrence period i , with the convention that d (0) ir = 0, corresponding to notiﬁcation date (theepoch of the m -th partial payment is u ir + v ir + d (1) ir + · · · + d ( m ) ir )All of these quantities from n i onward, but except r , are realizations of random variables. The randomvariables themselves are denoted in the same way but with the primary symbol in upper case. For example, S ir denotes the random variable whose realization is s ir .

4. Architecture

The claim process for claim r of occurrence period i is envisaged as consisting of the following modules: • • • • • • • • SynthETIC rather than temporal order. Thisarchitecture more or less follows that set out in the early papers on claim micro-models (Arjas, 1989; Jewell,1989; Norberg, 1993, 1999; Hesselager, 1994). The following paragraphs provide full detail of the architectureof each module, though each may be re-coded by the user if desired (see Section 4.2).4 .1.1. Claim occurrence date

Occurrence period i covers calendar period ( i − , i ]. It is assumed that N i ∼ Poisson( E i λ i ), where E i , λ i are input parameters. The number of claims n i is a random drawing of N i . As indicated in Section 3, E i is an annual eﬀective parameter. The example implementation in Section 6 is speciﬁed in Appendix A towork with quarterly periods, and with E i = 12000, λ i = 0 .

03. The exposure for a single occurrence quarterwill therefore be 3000 (exposure-years), and the expected claim frequency 3000 × .

03 = 90.No trend in claim frequency over the occurrence period has been assumed, and so the occurrence timeof claim r is a realization u ir of a random variable U ir simulated as U ir ∼ Uniform( i − , i ], r = 1 , , . . . , N i .In principle, the assumption of uniformity would conﬂict with fractional E i if the position of theseexposures within the occurrence period were known. This conﬂict has been disregarded for the purpose ofthe current version of SynthETIC , but is capable of modiﬁcation if the conﬂict is material.The current selections of E i , λ i are set out in Appendix A. Claim size is represented as a multiple of a reference claim size, which is deﬁned as a global parameter.This enables the simulator to switch conveniently between currencies, e.g. the reference claim size might be1,000 USD or 1,000,000 KRW. Alternatively, some schemes of insurance deﬁne entitlements as multiples ofa regulated reference claim size, and this latter may be used as the

SynthETIC global parameter.Claim size s ir is the realization of a random variable S ir with df F S ( s ), independent of i in this versionof SynthETIC , speciﬁed on input as a function of s . This is sampled according to S ir = F − S (cid:16) Z ( S ) ir (cid:17) , where Z ( S ) ir ∼ Uniform(0 , F S ( s ) is set out in Appendix A. Notiﬁcation delay v ir is the realization of a random variable V ir with df F V | i,s ( v ; i, s ), speciﬁed on inputas a function of v , and possibly dependent on occurrence period i and claim size s . This is sampled accordingto V ir = F − V | i,s (cid:16) Z ( V ) ir (cid:17) , where Z ( V ) ir ∼ Uniform(0 , F V | i,s ( v ; i, s ) is set out in Appendix A. Settlement delay w ir is the realization of a survival random variable W ir with survival function F W | i,s ( w ; i, s ),speciﬁed on input as a function of w , and possibly dependent on occurrence period i and claim size s , andwith survival function F W | i,s = 1 − F W | i,s for df F W | i,s . This is sampled according to W ir = F − W | i,s (cid:16) Z ( W ) ir (cid:17) ,where Z ( W ) ir ∼ Uniform(0 , F W | i,s ( w ; i, s ) is set out in Appendix A. The number of partial payments m ir is the realization of a random variable M ir with df F M | s ( m ; s ),speciﬁed on input as a function of m , and possibly dependent on claim size s >

0. This is sampled accordingto M ir = F − M | s (cid:16) Z ( M ) ir (cid:17) , where Z ( M ) ir ∼ Uniform(0 , F M | s ( m ; s ) is set out in Appendix A. Consider claims from an Auto Liability line of business, for example. Such a claim will usually consistof:(1) (possibly) some small payments such as police reports, medical consultations and reports;(2) Some more substantial payments such as hospitalization, specialist medical procedures, equipment(e.g. prosthetics); 53) A ﬁnal settlement with the claimant;(4) A smaller ﬁnal payment, usually covering legal costs.The settlement with the claimant will typically account for the bulk of the claim cost, and the ﬁnalpayment will usually increase (not necessarily linearly) with the size of the settlement.Claims in a number of other lines of business exhibit a similar structure, albeit with possible diﬀerencesin the types of payment made. For example, in Auto Collision Damage, (1) might include tow-truck costs,and (2) might include replacement hire car costs. Payments of type (4) might be negative and account forthird party recoveries and salvage.Some claims may be trivial, involving only preliminary costs of types (1) and (2). It is assumed thatthese claims are characterized by m ir ≤

3, and that the more substantial claims, with payments of types(3) and (4), are characterized by m ir ≥ Case m ir ≥ In accordance with the above commentary, payments of types (3) and (4), i.e. the lasttwo payments in respect of the claim, are simulated ﬁrst. Initially the sum of the two is simulated, and thenthis amount is apportioned between the two payments.The sum of the two payments are sampled, as a proportion of claim size s ir , according to 1 − h P ( m ir − ir + P ( m ir ) ir i ∼ F − P | s (cid:16) Z ( P ) ir (cid:17) , where Z ( P ) ir ∼ Uniform(0 , F P | s is set out in Appendix A.Let the proportion of the total represented by the settlement be denoted Q ir = P ( m ir − ir P ( m ir − ir + P ( m ir ) ir , which is sampled according to Q ir ∼ F − Q | s (cid:16) Z ( Q ) ir (cid:17) , where Z ( Q ) ir ∼ Uniform(0 , F Q | s is set out in Appendix A.This leaves the proportion 1 − h P ( m ir − ir + P ( m ir ) ir i of s ir to be accounted for by partial payments s (1) ir , s (2) ir , . . . , s ( m ir − ir .The proportions p ( m ) ir , m = 1 , . . . , m ir − m ir − X m =1 p ( m ) ir = 1 − (cid:16) p ( m ir − ir + p ( m ir ) ir (cid:17) . They are ﬁrst simulated in unnormalized form ˆ p ( m ) ir and these quantities then normalized. Thus, ˆ p ( m ) ir is therealization of a random variable ˆ P ( m ) ir ∼ F − P | m (cid:16) Z ˆ P | mir (cid:17) , where Z ˆ P | mir ∼ Uniform(0 , F ˆ P | m is set out in Appendix A.Then normalization takes the form p ( m ) ir = h − (cid:16) p ( m ir − ir + p ( m ir ) ir (cid:17)i ˆ p ( m ) ir ˆ p (1) ir + · · · + ˆ p ( m ir − ir , m = 1 , . . . , m ir − . A summary of the construction of partial payments is as follows: • Simulate p ( m ir − ir + p ( m ir ) ir . • Simulate q ir . • Calculate p ( m ir − ir = q ir (cid:16) p ( m ir − ir + p ( m ir ) ir (cid:17) , p ( m ir ) ir = (1 − q ir ) (cid:16) p ( m ir − ir + p ( m ir ) ir (cid:17) . • Simulate ˆ p ( m ) ir , m = 1 , . . . , m ir −

2. 6

Normalize to obtain p ( m ) ir , m = 1 , . . . , m ir − • Calculate s ( m ) ir = p ( m ) ir s ir , m = 1 , . . . , m ir . Case m ir = The calculations proceed as above, except without settlement and associated partialpayments. Thus, • for actual m ir = 2, proceed as if m ir = 4 with p ( m ir − ir + p ( m ir ) ir = 0; • for actual m ir = 3, proceed as if m ir = 5 with p ( m ir − ir + p ( m ir ) ir = 0. Case m ir = For this degenerate case, s (1) ir = s ir of course. As noted in Section 3, the epoch of the m -th partial payment is u ir + v ir + d (1) ir + · · · + d ( m ) ir . Let this bedenoted t ( m ) ir . The payment period in which this payment falls is ⌈ u ir + v ir + d (1) ir + · · · + d ( m ) ir ⌉ .The d ( m ) ir require normalization so that d (1) ir + · · · + d ( m ir ) ir = w ir , and so their simulation parallels thatof the p ( m ) ir just above. Thus, ˆ d ( m ) ir is the realization of a random variable ˆ D ( m ) ir ∼ F − D | m (cid:16) Z ˆ D | mir (cid:17) , where Z ˆ D | mir ∼ Uniform(0 , F ˆ D | m is set out in Appendix A.Normalization takes the form d ( m ) ir = w ir ˆ d ( m ) ir ˆ d (1) ir + · · · + ˆ d ( m ir ) ir , m = 1 , . . . , m ir . The actual dollar value of a constant dollar partial payment is s ∗ ( m ) ir = s ( m ) ir f (cid:16) t ( m ) ir (cid:17) f (1) g O ( i | s ir ) g O (1 | s ir ) g C (cid:16) t ( m ) ir | s ir (cid:17) g C (1 | s ir ) . For the purpose of the present sub-section, a transaction includes occurrence, notiﬁcation, settlement,or a payment. Occasionally, simulated transactions will be out-of-bounds, i.e. take place beyond the end ofdevelopment period I , where development periods are numbered 1 , , . . . .In these cases, the simulated epoch of occurrence of the transaction is maintained throughout the simu-lation of details of the claim concerned, other than in the exceptions noted below. For example, if settlementoccurs at development time j > I , and delays between partial payments depend on settlement delay, thenthe simulated value of j will be used in the simulation of inter-partial delays. Only at the stage wheretransactions are assigned to development periods for the purpose of either tabulation or addition of inﬂationis the epoch of occurrence varied. In these circumstances, the transaction is deemed to have occurred at theend of development period I .This convention will cause some concentration of transactions at the end of development period I .Usually, the concentration will be small to the extent of virtual immateriality. If the eﬀect is material, thismay indicate that the users selection of parameters determining transaction times is not well matched tothe maximum number of development periods allowed, and consideration might be given to changing oneor the other. 7 .2. Modularity SynthETIC has been structured so as to generate a portfolio of claims that loosely resemble those from anAuto Liability portfolio with which one of the authors is familiar. The latter is referred to here as the “refer-ence portfolio” , and has been discussed in various earlier papers (Taylor et al., 2003; Taylor and McGuire,2004; McGuire, 2007; McGuire et al., 2018).This structure is quite general, and many users should be able to adopt it with changes to parameters butnot algebraic structure. However, there will undoubtedly be other cases in which some change of structureis required.For this reason,

SynthETIC has been coded in modular form. Its eight modules are listed as 4.1.1 to4.1.8. The coding of any one is independent of all others so that the user may unplug any one and replaceit with a version modiﬁed to his/her own purpose.It is essential to note, however, that the module sequence 4.1.1 to 4.1.8 should be maintained. Thereason is that this sequence reﬂects assumed dependencies whereby any speciﬁc quantities generated by anyspeciﬁc module may be dependent on those from prior modules. Examples are given in [4.3.1] to [4.3.7].

The reference portfolio is complex, containing a number of features that can cause a certain degree ofawkwardness in modelling. These are as follows.[4.3.1] Distribution of settlement delay depends on claim size.[4.3.2] Distribution of settlement delay also, conditional on claim size, varies from one occurrence period toanother.[4.3.3] Claim sizes are subject to payment period superimposed inﬂation.[4.3.4] Rates of payment period superimposed inﬂation vary from one payment period to another.[4.3.5] Rates of payment period superimposed inﬂation also vary with claim size.[4.3.6] A legislative change occurred that aﬀected the claim experience of subsequent occurrence periods(occurrence period superimposed inﬂation).[4.3.7] The legislative change aﬀected claims diﬀerentially according to pre-change claim size.All of these features have been incorporated in

SynthETIC .

5. Relation to prior literature

The literature contains a few predecessor simulators (CAS Loss Simulation Model Working Party, 2007;Harej et al., 2017 (ASTIN Working Party on Individual Claim Development with Machine Learning); Gabrielli and W¨uthrich,2018). These are discussed in the following two sub-sections, and the major diﬀerences between them and

SynthETIC identiﬁed.

The simulator incorporates controls on the following features: • Notiﬁcation delay; • Settlement delay; • Claim size. 8ach of these three may be speciﬁed by selecting from a prescribed set of standard distributions. Thereare options to modify the claim size distribution to allow for either or both of zero claims and the eﬀect ofdeductibles.The simulator is open source. It is calibrated against the Auto experience of a number of US states,although the user may over-ride this calibration.The program includes dependencies between: • claim size and settlement delay; and • notiﬁcation delay and settlement delay.In both cases, dependency is expressed in terms of correlation, which diﬀers from the present simulator.Claim inﬂation is included by means of a user-deﬁned “severity index”.No provision for partial payments is included. This simulator generated samples of both paid and incurred amounts. The latter is not of concern forpresent purposes. Although the model of claim payments is expressed in terms of continuous time, it isimplemented in discrete time (years).Let j denote continuous development time, with origin at the commencement of the occurrence period.Cumulative claim payments to development time j , denoted P ( j ), are deﬁned as P ( j ) = U F P ( j ), where U is the ultimate claim size and F P ( j ) is the proportion of it paid by development time j .Claim size U is sampled from a log normal distribution with prescribed location and dispersion param-eters. For each j = 1 , , . . . , F P ( j ) is drawn from a log normal distribution also with prescribed locationand dispersion parameters. In particular, the location parameter at development time j takes the form α exp( − (( j − τ ) /λ ) ) with the quantities α, τ, λ used to calibrate the payment pattern to short and long tailforms.This simulator does not deal with notiﬁcation or settlement events. It does not include explicit allowancefor inﬂation.Since F P ( j ) is a continuous function of j , the simulator envisages a claim as paid continuously byinﬁnitesimal amounts. As pointed out above, however, it is implemented in discrete time, with paymentsoccurring in each development year.In this sense, allowance is made for partial payments. On the other hand, each claim is just a miniatureversion of aggregate payment experience, subject to random perturbation. This would not be a realisticrepresentation of partial payments for most lines of business. The simulator of Gabrielli and W¨uthrich (2018) generates claim payments in discrete time (years) over12 development years. It is calibrated against a speciﬁc set of real data from four lines of business. The linesof business are not stated, though the fact that the data include “labor sector of the injured” and “part ofthe body injured” suggest Casualty lines, possibly Employers Liability. The code is available on-line.The simulator incorporates controls on the following features: • Notiﬁcation delay (in development years); • Settlement delay (in development years); • Claim size, including the proportion of zero-cost claims; • Number of development years with payment; • Claim payments by development year.Less usual features included in the simulator are: 9

Covariates (accident year, accident quarter, line of business, labor sector of the injured, age of theinjured, part of the body injured) which diﬀerentiate claims experience; • Claim recoveries; and • Claim re-openings.The inclusion of accident year and quarter in covariates enables the model to accommodate (non-linear)time trends and seasonalities across accident periods in all model components. The case of claim size is ofspecial interest in this context. This is included as a log normal variate with location parameter expressedas a linear function of the output of the prior layer of neurons.Hence the logarithm of claim size incorporates a non-linear trend across accident periods. The simulatordoes not include explicit allowance for inﬂation but, according to the trend just discussed, includes implicitaccident year inﬂation. It does not include calendar period inﬂation.Dependencies between some simulated quantities are included in the model. For example, the number ofdevelopment years containing payment depends on the notiﬁcation delay; claim size depends on notiﬁcationdelay and number of development years containing payment.Dependencies between payments of diﬀerent development years are included partially by virtue of thesimulation of number of development years with payment. For any ﬁxed number of these developmentyears, a deterministic payment pattern payment pattern is assumed, whence part of the dependency betweenpayments of diﬀerent development years is excluded. The authors point out that the selection of a paymentpattern could be made stochastic, but this would require careful engineering of the dependencies betweendevelopment years.The parametric structure of the simulator is ﬁxed. The output always resembles a drawing from thedata set against which the simulator is calibrated. The modeller may sometimes seek greater challenge thanprovided by this data set.

SynthETIC makes useful contributions to the already excellent simulator of Gabrielli and W¨uthrich(2018). First, the latter appears relatively short-tailed; Table 5 of Gabrielli and W¨uthrich (2018) indi-cates that more than 50% of an accident years cost is paid in the ﬁrst development year, and roughly 90%within the ﬁrst three development years. Second, the data appear to conform reasonably well with thesimple chain ladder structure.The chain ladder will often perform well in relation to data that conform with its multiplicative cross-classiﬁed parametric structure; correspondingly, it will often perform poorly in relation to data that do notso conform. Thus, purpose of proposed alternative models will often be to ﬁll the gap left by the chainladder, i.e. analysis of those data sets that are awkward for the chain ladder. Testing of these alternativemodels will then require data sets that are poorly adapted to the chain ladder.

SynthETIC was explicitlydesigned to facilitate such an analysis.Similar comments can be made about the length of payment tail of a data set. Analysis is usually mostdiﬃcult in the case of long-tailed data sets, and so there is a need for the simulation of these.

SynthETIC allows the user to easily specify arbitrarily long tails.

One family of models likely to thirst for testing data is that of granular (or micro-) models . These,by their nature, often attempt to model the minutiae of individual claims, and so are closely related to

SynthETIC , which generates this detail.A reasonably up-to-date summary of the literature of these models is given by De Felice and Moriconi(2019). They are also discussed by Taylor (2019), where it is pointed out that the elaborateness of the gran-ular model with its many parameters is likely to cause essential intra-model dependencies to be overlookedor replaced by bland and unrealistic assumptions. That reference gives a couple of speciﬁc examples, one ofwhich relates to claim payments.Section 6.2.2 illustrates this particular issue numerically. Section 4.3 lists a number of other dependenciesbuilt into

SynthETIC . 10 . Example implementation of

SynthETIC

An example simulation has been performed in accordance with the detailed speciﬁcation set out inAppendix A. The generated experience covers 40 occurrence quarters, each tracked for 40 developmentquarters.The principal features of the experience are similar to those of the reference portfolio, as set out in [4.3.1]to [4.3.7], though slightly less extreme than that case. All of those features are present in the simulatedexperience. Some speciﬁc detail of relevance is the following:6.1.1. Settlement delays, in addition to depending on claim size, decline gradually by 15% over the ﬁrst 20occurrence quarters, but is stable over subsequent quarters with the exception noted in 6.1.4.6.1.2. Base inﬂation occurs at 2% per annum.6.1.3. Payment period superimposed inﬂation is very high (30% per annum) for the smallest claims, butzero for claims exceeding $200,000 in dollar values of payment quarter 1. The rate of inﬂation varieslinearly between claim sizes of zero and $200,000.6.1.4. There was a legislative change at the end of occurrence quarter 20, which, in the main, aﬀected smallerclaims in all subsequent occurrence periods. As a result, settlement delays of claims up to $20,000 (indollar values of payment quarter 1), already reduced by 15% (see 6.1.1.), immediately decline by afurther 20%, but this eﬀect is gradually eroded over the next 10 occurrence periods. At the same time,claims of up to $50,000 reduce in size. The reduction is 40% for the smallest claims, nil for claims of$50,000, and the reduction varies linearly between these claim sizes.

The data features set out in [4.3.1] to [4.3.7] and 6.1.1. to 6.1.4. create dramatic breaches of the chainladder assumptions under which the expectation of any occurrence/development period cell is simply theproduct of an occurrence period parameter and a development period parameter.This opens the chain ladder to forecast error. This has been investigated as follows:(a) The part of the simulated individual claim data relating to payment quarters 1 to 40 has been aggre-gated into a 40 ×

40 payment triangle.(b) The chain ladder has been applied to this triangle, and a forecast of outstanding claims up to devel-opment quarter 40 obtained (in fact, there is little claim activity beyond development quarter 40).(c) This forecast has been compared with the “actual” amount of outstanding claims, simulated for pay-ment quarters 41 to 79.The results are set out in Table 6.1 where the chain ladder is seen to perform very poorly. This is perhapsunsurprising in view of the data features and the extent to which they breach chain ladder assumptions.Data sets such as this are useful for testing models that endeavour to represent data outside the scope ofthe chain ladder. See Section 7 for further comment.

The general structure of payments in respect of a speciﬁc claim is described in Section 4.1. Here, largerclaims are envisaged as involving a settlement payment, usually shortly before closure, and with this paymentending to dominate other payments in respect of this claim. Smaller claims do not exhibit this feature.One eﬀect of this is that, if a large m -th partial payment is experienced, then the likelihood that thenext payment in respect of the same claim will also be large is very much reduced at larger values of m .Likewise, if only small or medium payments have been observed, then the likelihood of a larger payment inthe future is increased. 11 able 6.1: Forecast claims costs based on synthetic data setOccurrence quarters Estimated loss reserveTarget (simulator) Chain ladderestimate Ratio: chain ladderto target$M $M %1 to 10 9.2 9.5 311 to 20 26.2 54.6 10921 to 25 26.1 68.1 16126 to 30 61.8 103.4 6731 20.5 22.4 932 24.4 34.8 4333 27.8 35.5 2734 30.1 44.5 4835 26.4 53.2 10136 36.2 47.8 3237 33.0 66.4 10138 43.1 44.9 439 39.0 39.5 140 46.2 32.4 -30

Total 450.0 657.1 46

This stands in contrast with some granular models, which assume independence between the sizes ofpartial payments. This creates diﬃculties for the granular model, since the incorporation of the requireddependencies may be awkward without the assumption of a claim payment structure such as that used inthe simulator under discussion here.The type of dependency described above is illustrated in Table 6.2, which is an abridged summary of thesimulated data set. It displays, for selected values of m and various size ranges of m -th partial payments,the average size of the ( m + 1)-th partial payment, where inﬂation has been excluded throughout. The tableconﬁrms the expected dependencies between the sizes of successive partial payments. Table 6.2:

Simulated sizes of successive partial payments (excluding inﬂation)Size of the m -thpartial payment Size of ( m + 1)-th partial payment for m =1 2 3 5 7 9$K $K $K $K $K $K $K0 to 1 0.5 0.9 1.9 6.6 0.9 3.91 to 2 1.6 3.8 6.0 8.0 12.7 20.84 to 5 4.5 12.0 21.8 31.4 36.4 53.68 to 10 8.8 31.5 48.9 94.3 133.0 124.820 to 50 27.3 253.9 55.6 25.5 4.6 4.450 to 100 58.7 1099.0 8.1 8.0 8.9 10.1over 100 no claims no claims 32.2 31.8 33.3 37.812 . Application of SynthETIC

The synthetic data set with the features described in Section 6.1 contains substantial breaches of thechain ladder assumptions, as noted in Section 6.2.1, and, in consequence, requires complex modelling. Itis, however, possible to generate simple data sets using

SynthETIC . Indeed, such data sets may be easilyrendered compatible with the chain ladder.To do so, one simply notes that compatibility is achieved if the expected distribution of claim paymentsacross development periods is the same for all occurrence periods. Hence,

SynthETIC will generate chain-ladder-compatible data if:(a) all of its components 4.1.1 to 4.1.7 are deﬁned to be independent of occurrence period;(b) base inﬂation and calendar period superimposed inﬂation in 4.1.8 occur at a constant rate per period;and(c) occurrence period superimposed inﬂation in 4.1.8 must be independent of all other components, butotherwise can be arbitrary.Condition (b) is the case because it is known that, insertion of a constant rate of calendar period superim-posed inﬂation in a claim triangle will change the estimated distribution of claim payments over developmentperiods, but will also introduce a compensating change in accident period parameters so as to leave the es-timated loss reserve invariant (Taylor, 2000).Occurrence period superimposed inﬂation is directly reﬂected in the chain ladder row parameters, whichcan be arbitrary. Hence condition (c).The above demonstrates that

SynthETIC may be used to generate very simple or very complex datasets. There is obviously a vast range of intermediate cases, and so it follows that

SynthETIC may be usedto generate a collection of data sets that provide a spectrum of complexity . Such a collection may beused to present a model under test with a steadily increasing challenge. This idea is used in McGuire et al.(2018).Diﬀerent collections may be constructed with some components held constant and others subject tocomplex variation. In this way, one may explore the eﬀectiveness of a contending model in the presence ofdiﬀerent sources of complexity.

The architecture of

SynthETIC depends heavily on claim closures. Payments in respect of claims otherthan small tend to be concentrated toward the settlement date in the example reported in Section 6, becausethe settlement payment occurs on average three months ahead of closure (see Appendix A).This means that claim closure date is a strong marker for the occurrence of a large payment. This tendsto be realistic for Casualty lines that do not pay claims as annuities (such as Workers Compensation), andeven for some major Property lines (such as Auto and Home).Reserving models that give prominence to closure count data are discussed in Taylor (1986, 2000) andalso in Huang et al. (2016). The sequence of papers by Taylor et al. (2003), Taylor and McGuire (2004),McGuire (2007) and McGuire et al. (2018), based on the same data set, also used this type of model.These models rely, of course, on accuracy of the count data, and this is sometime lacking. However,Taylor and Xu (2016) applied one of these models to US data from the NAIC (Meyers and Shi, 2011),which consisted of a variety of claim count qualities. They were able to demonstrate that the use of closurecounts produced tighter forecasts than obtained without it.Despite this, the literature has largely passed over such models. Some modelling is based on both claimamounts and claim notiﬁcation counts, e.g. the double chain ladder (Martinez-Miranda et al., 2013).Notiﬁcation counts are of obvious value in enabling estimation of ultimate numbers of incurred claims,which can serve as measures of exposures over accident years.13hese exposures would usually be multiplicative in the modelling of claim amounts, with expected claimpayments in a triangle cell assumed proportional to number of claims incurred in the relevant accidentperiod. This is similar to the Payments per Claim Incurred model discussed in Taylor (1986, 2000).However, notiﬁcation counts are less adapted to modelling of any departure from this proportionality.In longer tailed lines, notiﬁcation and settlement may be separated by substantial numbers of years, so thecontribution of notiﬁcation counts is likely to be oblique at best, and usually inferior to that of closure counts(assuming the latter to be reliable).

SynthETIC , by its construction, linking claim size and partial paymentsto settlement delay, provides a suitable test environment for claim models based on claim settlement counts.The ultimate form of claim data for ML models is transactional, which requires the modelling of partialpayments. Relatively few contributions to the literature confront this issue, and those that do are sometimesbedevilled by unrealistic assumptions of independence of the sort discussed in Section 6.2.2.

SynthETIC includes partial payments in a realistic manner, recognizing the dependencies between them, and so providesa suitable test environment for claim models at a transactional level. SynthETIC repository

SynthETIC is published as an open-source R package on the Comprehensive R Archive Network (CRAN)at https://CRAN.R-project.org/package=SynthETIC (Avanzi et al., 2020). The package is licensed underGPL-3. The source code is fully available and users are free to copy, modify, or redistribute the programor any of its derivative versions. The re-distribution must not impose any further restrictions on the rightsgranted by the GPL.

SynthETIC has functions to sequentially simulate each of the eight modules as outlined in Section 4. Thedefault setting assumes probability distributions of the quantities as detailed in the Appendix A (except inthe case of base inﬂation), but the distributional assumptions can be easily modiﬁed by users to match theirspeciﬁc claims experiences. Users can choose to output their simulated claims in the form of a chain-laddersquare by occurrence and development periods, or alternatively a structured data set at either a claim or apayment level. The data set at the payment level can then be used to visualise the claims development overtime. A test data set generated under the current speciﬁcation is also available as part of the package. Afull demonstration of

SynthETIC ’s functionalities can be accessed by running

RShowDoc("SynthETIC-demo",package = "SynthETIC") in the R console after the installation of the package.Users can install

SynthETIC from the CRAN repository by running in R install.packages("SynthETIC") .A development version of the program is also available on https://github.com/agi-lab/SynthETIC . TheGitHub repository contains, in addition to the package code, • A PDF version of the package reference manual (also available on https://CRAN.R-project.org/package=SynthETIC ); • A chain-ladder analysis of the test data set (discussed in Section 6), in an Excel spreadsheet.

9. Conclusions

The foregoing sections describe an individual claim simulator, newly introduced to the literature. Itscode is open source, and it is modular with unpluggable and re-pluggable modules for the convenience ofthe user. Default modules are provided, based broadly on real data, and incorporating the data featureshighlighted in [4.3.1] to [4.3.7].Three existing simulators are discussed in Section 5.1. These are all useful within speciﬁc contexts, butnone contains all features [4.3.1] to [4.3.7]. The Gabrielli-W¨uthrich simulator (Section 5.1.3) is perhaps themost extensive in its allowance for dependencies between the various observations on a single claim, but itis calibrated to a single data set, and so its simulations always mirror that data set.The simulator proposed here reﬂects a number of desirable (but complicated) data features, and is ﬂexiblein the form of model used to generate claim experience. Data features within its modules are easily adjusted,or the modules completely replaced. This enables the generation of a collection of data sets providing aspectrum of complexity with which to challenge a proposed model (Section 7.1).14 ynthETIC may be of especial value in testing granular models. These sometimes include unrealisticassumptions of independence between diﬀerent variates within the model.

SynthETIC , on the other hand,contains built-in dependencies, e.g. between diﬀerent partial payments.

Acknowledgements

This research was supported under Australian Research Council’s Linkage (LP130100723) and Discovery(DP200101859) Projects funding schemes. Melantha Wang acknowledges ﬁnancial support from UNSWAustralia Business School. The views expressed herein are those of the authors and are not necessarilythose of the supporting organisations.

References

Arjas, E., 1989. The claims reserving problem in non-life insurance: Some structural ideas. ASTIN Bulletin 19 (2), 139–152.Avanzi, B., Taylor, G., Wang, M., Wong, B., 2020.

SynthETIC : Synthetic experience tracking insurance claims. https://CRAN.R-project.org/package=SynthETIC .Berquist, J. R., Sherman, R. E., 1977. Loss reserve adequacy testing: A comprehensive, systematic approach. In: Proceedingsof the Casualty Actuarial Society. Vol. 64. pp. 123–184.CAS Loss Simulation Model Working Party, 2007. Parameterizing the loss simulation model. .De Felice, M., Moriconi, F., 2019. Claim watching and individual claims reserving using classiﬁcation and regression trees.Risks 7 (4), 1–36.Fisher, W. H., Lange, J. T., 1973. Loss reserve testing: a report year approach. Proceedings of the Casualty Actuarial Society60, 189–207.Gabrielli, A., W¨uthrich, M. V., 2018. Individual claims history simulation machine. Risks 6 (2), 29.Harej, B., G¨achter, R., Jamal, S., 2017. Individual claim development with machine learning. .Hesselager, O., 1994. A markov model for loss reserving. ASTIN Bulletin 24, 183–93.Huang, J., Wu, X., Zhou, X., 2016. Asymptotic behaviors of stochastic reserving: aggregate versus individual models. EuropeanJournal of Operational Research 249, 657–666.Jewell, W. S., 1989. Predicting ibnr events and delays. ASTIN Bulletin 19 (1).Martinez-Miranda, M. D., Nielsen, J. P., Verrall, R. J., 2013. Double chain ladder and bornhuetter-ferguson. North AmericanActuarial Journal 17 (2), 101–113.McGuire, G., 2007. Individual claim modelling of CTP data. In: XIth Accident Compensation Seminar. Institute of Actuariesof Australia, Sydney, Australia.McGuire, G., Taylor, G., Miller, H., 2018. Self-assembling insurance claim models using regularized regression and machinelearning. Variance (in press). Also in SSRN.Meyers, G., 2015. Stochastic Loss Reserving Using Bayesian MCMC Models. Vol. 1 of CAS Monograph Series. Arlington, USA:Casualty Actuarial Society.Meyers, G., Shi, P., September 2011. Loss Reserving Data Pulled From NAIC Schedule P. .Norberg, R., 1993. Prediction of Outstanding Liabilities in Non-Life Insurance. ASTIN Bulletin 23 (1), 95–115.Norberg, R., 1999. Prediction of Outstanding Liabilities - II Model Variations and Extensions. ASTIN Bulletin 29 (1), 5–27.Reid, D. H., 1978. Claim reserves in general insurance. Journal of the Institute of Actuaries 105, 211–296.Richman, R., 2018. AI in actuarial science. SSRN.Taylor, G., 2000. Loss Reserving: An Actuarial Perspective. Huebner International Series on Risk, Insurance and EconomicSecurity. Kluwer Academic Publishers.Taylor, G., 2019. Claim models: granular and machine learning forms. Risks 7 (3), 82.Taylor, G., McGuire, G., 2004. Loss reserving with glms: a case study. In: Spring 2004 Meeting of the Casualty ActuarialSociety, Colorado Springs, Colorado.Taylor, G., McGuire, G., Greenﬁeld, A., 2003. Loss reserving: Past, present and future, university of Melbourne ResearchPaper.Taylor, G. C., 1986. Claims reserving in non-life insurance. North-Holland, Amsterdam, Netherlands.Taylor, G. C., Xu, J., 2016. An empirical investigation of the value of ﬁnalisation count information to loss reserving. Variancein press.W¨uthrich, M. V., Merz, M., 2008. Stochastic claims reserving methods in insurance. John Wiley & Sons, Chichester. . Parametrizations The following table displays the formal parameterization of modules 4.1.1 to 4.1.8 for the example ofSection 6.

Parameter type Functional form Numerical parametersGlobal

Time unit = 1 / , Claim occurrence I = 40 E i = 12000 λ i = 0 . Claim size Power-normal S . ir ∼ N(9 . , Claim notiﬁcation Weibull Mean = min(3 , max(1 , − ln( s ir / Claim closure Weibull Mean = a ( i ) min(25 , max(1 , s ir / a ( i ) = max(0 . , − . i ), subject to the over-riding condi-tion that, for s ir < i ≥ a ( i ) = min(0 . , .

65 + 0 . i − Partial payments:Number For s ir ≤ M ir = 1) = Prob( M ir = 2) = 1 / ≤ s ir ≤ M ir = 2) = 1 / , Prob( M ir = 3) = 2 / s ir > M ir is geometric, with minimum 4and mean = min(8 , s ir / Partial payments:Amounts F P | s = Beta Mean = 1 − min(0 . , .

75 + 0 .

04 ln( s ir / F Q | s = Beta Mean = 0.9Coeﬃcient of variation = 3% F ˆ P | m = Beta Mean = (cid:16) − (cid:16) p ( m ir − ir + p ( m ir ) ir (cid:17)(cid:17) / ( m ir − Distribution of pay-ments over time F ˆ D | m = Weibull For m ir ≥ m = m ir ,Mean = 3 months (converted to the relevant time units)Coeﬃcient of variation = 20%For m ir ≥ m < m ir , or m ir < m ir Coeﬃcient of variation = 35%

Base inﬂation f ( t ) = (1 + α ) t , where α is equivalent to an increase of 2% p.a., allowingfor the relevant time units Superimposed inﬂa-tion g O ( u | s ) = 1 for u ≤

20 = 1 − . , − s/ u > g C ( t | s ) = (1 + β ( s )) t with β = γ max(0 , − s/ γ is equivalent toa 30% p.a. inﬂation rate, allowing for the relevant time units Notes:1. This component is deﬁned in terms of claim size. The deﬁnition here displays claim size in raw(uninﬂated) units. The inputs to the example application of

SynthETIC , on the other hand, expressclaim sizes as multiples of a reference claim size equal to 200,000. For example, the amount of 100,000that appears in the deﬁnition of claim notiﬁcation delay is expressed in

SynthETIC as 0 . × ,,