Towards Explainability of Machine Learning Models in Insurance Pricing
TTowards Explainability of Machine Learning Models in Insurance Pricing
Kevin Kuo a, ∗ , Daniel Lupton b a Kasa AI, 3040 78th Ave SE b Taylor & Mulder, 10508 Rivers Bend Lane, Potomac, MD 20854, USA
Abstract
Machine learning methods have garnered increasing interest among actuaries in recent years. However, theiradoption by practitioners has been limited, partly due to the lack of transparency of these methods, ascompared to generalized linear models. In this paper, we discuss the need for model interpretability inproperty & casualty insurance ratemaking, propose a framework for explaining models, and present a casestudy to illustrate the framework.
1. Introduction
Risk classification for property & casualty (P&C)insurance rating has traditionally been done withone-way, or univariate, analysis techniques. In re-cent years, many insurers have moved towards usinggeneralized linear models (GLM), a multivariate pre-dictive modeling technique, which addresses manyshortcomings of univariate approaches, and is cur-rently considered the gold standard in insurance riskclassification. At the same time, machine learning(ML) techniques such as deep neural networks havegained popularity in many industries due to theirsuperior predictive performance over linear models(LeCun, Bengio, and Hinton 2015). In fact, thereis a fast growing body of literature on applyingML to P&C reserving (Kuo 2019; Wüthrich 2018;Gabrielli, Richman, and Wüthrich 2019; Gabrielli2019). However, these ML techniques, often consid-ered to be completely “black box”, have been lesssuccessful in gaining adoption in pricing, which is aregulated discipline and requires a certain amountof transparency in models.If insurers can gain more insight into how ML mod-els behave in risk classification contexts, it wouldincrease their ability to reassure regulators and thepublic that accepted ratemaking principles are met.Being able to charge more accurate premiums would,in turn, make the risk transfer system more efficient ∗ Corresponding Author
Email addresses: [email protected] (Kevin Kuo), [email protected] (Daniel Lupton) and contribute to the betterment of society. In thispaper, we aim to take a step towards liberating ac-tuaries from the confines of linear models in pricingprojects, by proposing a framework for explainingML models for ratemaking that regulators, prac-titioners, and researchers in actuarial science canbuild upon.The rest of this paper is organized as follows:Section 2 provides an overview of P&C ratemaking,Section 3 discusses the importance of interpretation,Section 4 discusses model interpretability in thecontext of ratemaking and proposes specific tasksfor model explanation, Section 5 describes currentmodel interpretation techniques and applies themto the tasks defined in the previous section, andSection 6 concludes.
2. Property and Casualty Ratemaking
Early classification ratemaking procedures weretypically univariate in nature. For example, Lange(1966) notes that (at that time) most major lines ofinsurance used univariate methods based around thesame principle: distributing an overall indicationto territorial relativities or classification relativitiesbased on the extent to which they deviated fromthe average experience.Bailey and Simon (1960) introduced minimumbias methods, which were expanded throughout the60s, 70s, and 80s. As computing power developed,minimum bias began to give away to GLMs, with
Preprint submitted to Variance March 25, 2020 a r X i v : . [ q -f i n . R M ] M a r apers such as Brown (1988) and Mildenhall (1999)bridging the gap between the methods.Arguably, GLMs predate minimum bias proce-dures by a significant margin. The term was coinedby Nelder and Wedderburn (1972), but generaliza-tions of least squares linear regression date back atleast to the 1930s. Like minimum bias methods,GLMs did not become mainstream in actuarial sci-ence for some time. For example, the syllabus ofbasic education of the Casualty Actuarial Society(CAS) does not seem to include any mention ofGLMs prior to Brown (1988) in the 1990 syllabus.From there, GLMs seem to have received only pass-ing mention until 2006 with the introduction ofAnderson et al. (2005) to the syllabus. Beginningin 2016, the CAS introduced Goldburd, Khare, andTevet (2016) to the syllabus, which offers a compre-hensive guide to GLMs. Paralelling the development of GLM was the devel-opment of machine learning algorithms throughoutthe middle part of the 20th century. Detailed his-tories of machine learning may be found in sourcessuch as Nilsson (2009) and Wang and Raj (2017).Consistent with GLMs, machine learning was rela-tively unpopular in actuarial science until the lastten years as computing power has become cheaperand more easily available and as machine learningsoftware packages have obviated the need for devel-oping analyses from scratch each time an analysis isperformed. Due to the breadth of machine learningas a field, it is difficult to identify the first time itentered the CAS syllabus; however, cluster analy-sis (in the form of k-means) seems to have beenfirst included in 2011 with Robertson (2009). Morerecently, the CAS MAS-I and MAS-II exams in-troduced in 2018 have included machine learningexplicitly.Within the area of ratemaking, machine learningis still in its infancy. A significant portion of ma-chine learning applications to ratemaking has beenin the context of automobile telematics, such as Gao,Meng, and Wuthrich (2018), Gao and Wuthrich(2018), Gao and Wuthrich (2019), Roel, Antonio,and Claeskens (2018), or Wuthrich (2017). Pre-sumably this focus has been a result of the high-dimensionality and complexity of telematics data,making it a field in which the unique abilities ofmachine learning techniques give a clear advantageover traditional approaches. Outside of telematics, Yang, Qian, and Zou (2018)use a gradient tree-boosting approach to capturenon-linearities that would be a challenge for GLMs.Henckaerts et al. (2018) make use of generalizedadditive models (GAM) to improve predictions ofGLMs. Many researchers, in an apparent effortto demonstrate the range of possibilities and ad-vantages of machine learning, have approached thetopic by comparing many different machine learningalgorithms within a single study, such as in Dugaset al. (2003), Noll, Salzmann, and Wuthrich (2018),Spedicato, Dutang, and Petrini (2018). These stud-ies make use of such varied techniques as regressiontrees, boosting machines, support vector machines,and neural networks.
Regardless of the method employed for determin-ing this risk of various classifications, the actualprocess of setting rate relativities typically involvessome variation of the following steps:1. Obtain relevant policy-level data2. Prepare data for analysis3. Perform analysis on the data, employing desiredmethod(s) to estimate needed rates4. Select final rates based on rate indications5. Present rates to the reglator, including expla-nation of the steps followed to derive the rates6. Answer questions from regulators regarding themethod employedThe focus of this paper is on steps 5 and 6. Inmany states, rate filings that exceed certain thresh-olds for magnitude of rate changes or filings thatmake use of new or sophisticated predictive modelsmay be subject to particular regulatory scrutiny. Inthese cases, it is necessary to be able to explainthe results of the modeling process in a way that isunderstandable without sacrificing statistical rigor.It should be noted that communicating results isnot simply a method of passing regulatory muster.Generating interpretable modeling output is an im-portant - even essential - facet of model checking.Actuaries are bound by relevant standards to beable to exercise appropriate judgment in selectingrisk characteristics as part of a risk classificationsystem per Actuarial Standard of Practice 12 (“RiskClassification”). Therefore, the techniques discussedin this paper may be viewed from the lens of provid-ing useful information to regulators, but they shouldalso be considered as part of a thorough vetting ofany rating model.2lthough the focus of this paper is on communi-cation to regulators, it should be said that selectingfinal rates based on indications (step 4 in the listabove) may pose a unique challenge for black-boxmodels. This, too, provides strong motivation fortechniques that could add to the modeler’s - or anystakeholder’s - understanding of the model, such asthe relative importance of variables or the shapes ofresponse curves. Such techniques could be usefullyemployed in making decisions about how best toselect rates.Similarly, although the focus of this paper is oncommunication in a pricing context, the techniquesexplored in this paper (and many of the concernsdiscussed) may also be relevant to other contexts,such as claim-level reserving or analytics, or otherapplications of machine learning to the insuranceindustry.
3. The Need to See Inside the Black Box
Within the actuarial profession, Actuarial Stan-dard of Practice 41 (“Actuarial Communications”)notes that “. . . another actuary qualified in the samepractice area [should be able to] make an objectiveappraisal of the reasonableness of the actuary’s workas presented in the actuarial report.” (“ActuarialStandard of Practice No. 41 - Actuarial Commu-nications” 2010) Underlying this requirement is anassumption that the hypothetical other actuary qual-ified in the same practice area is adequately familiarwith the relevant techniques employed. Althoughthe syllabus of basic education is constantly chang-ing, there has at times been an assumption that alltechniques and assumptions that have ever been apart of the syllabus of basic education needn’t beexplained from first principles in general actuarialcommunications, and that an actuary practicingin the same field should be able to make an ob-jective appraisal of the results from the methodsfound in the syllabus. This is notable because, be-ginning with the introduction of the CAS MAS-Iand MAS-II examination in July of 2018, severalmachine learning models were formally included inthe syllabus of basic education. These exams covera wide range of topics, such as splines, clusteringalgorithms, decision trees, boosting, and principlecomponents analysis (Casualty Actuarial Society2018).Nevertheless, machine learning poses something ofa special challenge for ASOP 41 for several reasons: 1. Machine learning models can be very ad hoccompared to traditional statistical models.2. Because many machine learning models do notassume an underlying probability distributionor stochastic process, they may not admit ofstandard metrics for model comparison (e.g.,it’s not straightforward to calculate an AICover a neural network).3. Machine learning methods are often combinedinto ensembles that may not be easily separatedand that may, as a collection, cease to resemblea single standard version of a model.4. Machine learning models can be “black boxes”insofar as the final form of response curve can-not be easily predicted and may depend heavilyon the available data (which may not, in turn,be available to the reviewer).This last item raises a final interesting issue.GLMs and their ilk are often fitted using one of ahandful of standard and well-understood approaches(e.g., maximum likelihood estimation). However,this is not possible in general with machine learn-ing models, as machine learning algorithms oftenuse loss surfaces that are very complex such that itmay not be feasible to calculate the global minimumof the surface. Certainly, closed form representa-tions of the loss surfaces are not generally available.For this reason, the training phase of a machinelearning model is, in many ways, just as importantto one’s understanding as the model form and thedata on which the model is fitted. Because the finalmodel result is inseparable from these three compo-nents (training method, model form, and data), itis not generally adequate to just know the methodemployed to make an objective appraisal of the rea-sonableness of the result.These issues also pose particular challenges withrespect to other standards. For instance, as dis-cussed previously, ASOP 12 requires actuaries tobe able to exercise appropriate judgment about riskclassification systems. The recent ASOP 56 (“Mod-els”) speaks to more general concerns in all practiceareas that might make use of models. ASOP 56requires the actuary to “make reasonable efforts toconfirm that the model structure, data, assump-tions, governance and controls, and model testingand output validation are consistent with the in-tended purpose.” (“Actuarial Standard of PracticeNo. 56 - Modeling” 2019) All of these efforts maybe hampered if it is not possible to peer into theblack box of the model.3t should also be noted that these comments onlyapply within the actuarial profession. Outside of theactuarial profession, communication of results maybe more challenging. A 2017 survey conducted bythe Casualty Actuarial and Statistical Task Force ofthe National Association of Insurance Commission-ers (NAIC) found that the plurality of respondingregulators identified “Filing complexity and/or alack of resources or expertise” as a key challengethat impedes their ability to review GLMs or otherpredictive models (National Association of Insur-ance Commissioners 2017). Given that machinelearning algorithms are generally regarded as morecomplex than GLMs, this implies that the challengeof communicating machine learning model results issignificant.In response to the same survey, 33 state regula-tors noted that it would be helpful or very helpfulfor the NAIC to develop information and tools toassist in reviewing rate filings based on GLMs, and34 noted that it would be helpful to develop similaritems to assist in reviewing “Other Advanced Mod-eling Techniques.” One outgrowth of this need wasthe development of a white paper, National Associ-ation of Insurance Commissioners (2019), on bestpractices for regulatory review of predictive models.The white paper focuses on review of GLMs, partic-ularly with respect to private passenger automobileand homeowners’ insurance. Some of the guidanceoffered in this regard is therefore not strictly appli-cable to the review of machine learning models. Forexample, as previously noted, p-values are not a con-cept that translates well to deterministic machinelearning algorithms. However, among the guidanceapplicable to machine learning algorithms are thefollowing:• Determine the extent to which the model causespremium disruption for individual policyholdersand how the insurer will explain the disruptionto individual consumers that inquire about it.• Determine that individual input characteristicsto a predictive model are related to the ex-pected loss or expense differences in risk. Eachinput characteristic should have an intuitive ordemonstrable actual relationship to expectedloss or expense.• Determine that individual outputs from a pre-dictive model and their associated selected rel-ativities are not unfairly discriminatory.The last of these items is an entire topic unto itself.The methods and concepts introduced in this paper are useful for exploring the question of whether ratesare appropriately related to risk of loss as definedby the variables used in the model, but there aremany other aspects of discrimination-free that areoutside the scope of this paper. The methods in thispaper may help in understanding the model, whichis a necessary precursor to addressing the questionof unfair discrimination.The items in this list are by no means exhaus-tive, but they pertain to the concept of model inter-pretability for ratemaking that we develop next.
4. Interpretability in the Ratemaking Con-text
In this section, we attempt to develop a workingdefinition of interpretability for ratemaking applica-tions. While we will not provide a comprehensivesurvey of the prolific and fast evolving ML inter-pretability literature, we draw from it as appropriatein setting the stage for our discussion. Even amongresearchers in the subject, there is not a consensuson the definition of interpretability; here are a fewfrom frequently cited papers:• Ability to explain or to present in understand-able terms to a human (Doshi-Velez and Kim2017);• The degree to which an observer can understandthe cause of a decision (Biran and Cotton 2017);and• A method is interpretable if a user can cor-rectly and efficiently predict the method’s re-sults (Kim, Khanna, and Koyejo 2016).We motivate our discussion by considering severalaspects of interpretability. As we proceed throughthe points below, we aim to arrive at a more scopedand relevant definition of what it means for a pricingmodel to be interpretable. In the remainder of thissection, we clarify a couple concepts regarding inter-pretable classes of models and the computationaltransparency of ML models, outline frameworks forunderstanding the communication goals of inter-pretability, then discuss a potential framework forimplementing ML interpretability in practice.
In the actuarial science literature, the GLM isprobably the most oft-cited example of an easilyinterpretable model. Given a set of inputs, we caneasily reason about what the output of the model4 igure 1: A simple decision tree for loss cost prediction. is. As an illustrative example, consider a claimseverity model with driver age, sex, and vehicleage as predictors; assuming a log link function andletting Y denote the response, we havelog( E [ Y ]) = β + β · age+ β · vehicle_age + β · sex male . (1)Here, we can tell, for example, what the modelwould predict for the expected severity if we wereto increase age by a certain amount, all else beingequal , because the relationship between the predictorand the response is simply multiplication by thecoefficient β and applying the inverse link function.Another commonly cited example of an inter-pretable model is a decision tree. An illustrativeexample is shown in Figure 1. Here, the predic-tion is arrived at by following a sequence of if-elsedecisions.Now, it is worth pointing out that, when declaringthat GLMs or decision trees are interpretable models,we are implicitly assuming that we are consideringonly a handful of predictors. In fact, the ease withwhich we can reason about a model declines as thenumber of predictors, transformations of them, andinteractions increase, as in the following (somewhatpathological) example:log( E [ Y ]) = β + β · age + β · vehicle_age+ β · vehicle_age + β · age · vehicle_age+ β · sex male + β · sex male · age . (2) Similarly, one can see that in Figure 2, larger treesare tough to reason about. In other words, evenwhen working within the framework of an “inter-pretable” class of models, we may still end up withsomething that many would consider “black box.” Another occasional misconception is that we haveno visibility into how some ML models compute predictions, which renders them uninterpretable.Outside of proprietary algorithms, all common MLmodels, including neural networks, gradient boostedtrees, and random forests, are well studied and havelarge bodies of literature documenting their innerworkings. As an example, a fitted feedforward neuralnetwork is simply a composition of linear transfor-mations followed by nonlinear activation functions.As in Equation 2, one can write down the mathe-matical equation for calculating the prediction givensome inputs, but it may be difficult for a human toreason about it. We show later that we can still pro-vide explanations of completely “black box” models,but is important to note that ML model predictionsare still governed by mathematical rules, and aredeterministic in most cases.
Hilton (1990) proposed a framework, later inter-preted by Miller (2017) in the context of ML, forunderstanding model explanations as conversations or social interactions. One consequence of this iden-tification is that explanations need to be relevant to the audience . This framework is consistent withASOP 41, which formulates a similar requirement interms of an intended user of the actuarial communi-cation. In developing, filing, and operationalizing apricing model, one needs to accommodate a varietyof stakeholders, each of whom has a different setof questions, assumptions, and technical capacity.First, there are internal stakeholders at the com-pany, which includes management and underwriters.While some of the individuals in this audience maybe technical, they are likely less familiar with pre-dictive modeling techniques than the actuaries anddata scientists who build the models. Next, we havethe regulators, who may have limited resources toreview the models, and will focus on a specific listof questions motivated by statute and public policy.Finally, we have potential policyholders, who havean interest (perhaps more so than the other parties)as they are responsible for paying the premiums.5 igure 2: A more complex decision tree. This is still much simpler than typical realistic examples.
It is interesting to note that the modelers, whoare most familiar with the models, tend to be samepeople designing and communicating the explana-tions. This poses a challenge that Miller, Howe,and Sonenberg (2017) call “inmates running the asy-lum”, where the modelers design explanations for themselves rather than the intended audience. Forexample, they may be interested in technical ques-tions, such as extrapolation behavior, and shape theexplanations accordingly, which may be irrelevantto a prospective policyholder.Another point outlined in Miller’s survey (Miller2017) is that explanations are contrastive . In otherwords, people are often interested in not why some-thing happened, but rather why it happened insteadof something else. For example, policyholders mightnot care exactly how their auto premiums are com-puted, but would like to know why they are beingcharged more than their coworkers who drive similarvehicles. As an extension, policyholders may wantto know what they can change in order to obtainlower premiums.
With the above considerations in mind, we pro-pose a potential framework for interpreting ML mod-els for insurance pricing: the actuarial profession, incollaboration with regulators and representatives ofthe public, define a set of questions to be answeredby explanations accompanying ML models, alongwith acceptance criteria and examples of successfulexplanations. In other words, interpretability forour purposes is defined as the ability of a model’sexplanations to answer the posed questions. It should be noted that no ideal set of questionsexists that would encompass all potential models.Rather, the actuary must consider what aspectsof the model would raise questions from the per-spective of the model’s intended users. We proposethat relevant stakeholders, by providing examplequestions and answers, would inherently provideguidance by which actuaries can reasonably antici-pate the kinds of specific questions most importantto those stakeholders and address them proactively.These questions should relate to existing guide-lines, such as those described in National Associationof Insurance Commissioners (2019) and outlined inSection 3, standards of practice, and regulation, andin fact should not be specific only to ML models.By conceptualizing a set of questions, we reducethe burden of both companies and regulators; thisis especially important for the latter, who are al-ready resource constrained facing increasing varietyof models being filed. This format should also befamiliar to actuaries who are accustomed to adher-ing to specific guidelines in, for example, ASOPs.Like the ASOPs, We envision that these questionsand guidelines will be continually updated to reflectfeedback obtained and advances in research.While the realization of a set of such guidelines isan ambitious undertaking beyond the scope of thispaper, we present in the next section a sample set ofquestions and techniques one can leverage to answerthem. The goal of these case studies is twofold: tomore concretely illustrate the proposed framework,and to expose the actuarial audience to modern MLinterpretation techniques.6 . Applying Model Interpretation Tech-niques
Now that we have established a framework formodel interpretation in the form of asking and an-swering relevant questions, we demonstrate exam-ples of such exchanges via an illustrative case study.Analytically, our starting point is a fitted deep neu-ral network model for predicting loss costs. As themodeling details are of secondary importance, theyare available in Appendix Appendix A. The ques-tions that we ask of the model are as follows:1. What are the most important predictors in themodel? Put another way, to what extent do thepredictors improve the accuracy of the model?2. How does the predicted loss cost change, onaverage, as we change an input?3. For a particular policyholder, how does eachcharacteristic contribute to the loss cost predic-tion?The techniques we utilize to answer these ques-tions are permutation variable importance, partialdependence plots, and additive variable attributions,respectively. In our discussion, we adopt the or-ganization of techniques and some notation pre-sented in Molnar and others (2018) and Biecek andBurzykowski (2019), which are comprehensive ref-erences on the most established ML interpretationtechniques.
Before we dive into the answering questions, wepresent a brief taxonomy of ML interpretation tech-niques. Rather than attempting an exhaustive classi-fication, the goal is to orient ourselves among broadcategories of techniques, so we can map them totasks indicated by the questions being asked. For ourpurposes, model interpretation techniques can becategorized across two dimensions: intrinsic vs. post-hoc and global vs. local.
Intrinsic model interpretation draws conclusionsfrom the structure of the fitted model and are whatwe typically associate with “interpretable” classesof models. This is only viable with models with sim-ple structures, such as the sparse linear model andshallow decision tree we see in Section 4.1, wherewe arrive at explanations by reading off parameterestimates or a few decision rules. For algorithms that produce models with complex structure that donot lend themselves easily to intrinsic exploration,we can appeal to post-hoc techniques. This class oftechniques interrogate the model by presenting itwith data for scoring and observing the predictionbehavior of the model. These techniques are con-cerned with only the inputs and outputs, and henceare agnostic of the model itself, which means theycan also be applied to simple models. Since mostuseful ML models have a level of complexity beyondthe threshold of intrinsic interpretability, we focuson model-agnostic techniques in our case study. Aswe will see later on, the data that we present to themodels are usually some perturbed variations of testdata.
Along the other dimension, we categorize modelinterpretations as global, or model-level, and local,or instance-level. The former class provides insightswith respect to the model as a whole. Some examplesof these eplanations include variable importancesand sensitivities, on average, of the predicted re-sponse with respect to individual predictors. Notethat these methods may be compared to the meth-ods described in (Goldburd, Khare, and Tevet 2016),Chapter 7, which focus on global interpretation ofGLMs.In our case study, questions 1 and 2 are associatedwith global interpretations. On the other hand, ques-tion 3 pertains to an individual prediction, whichwould fall in the local, or instance-level, category.In addition to individual variable attribution, wecan also inquire about what would happen to thecurrent predicted response if we were to perturbspecific predictor variables.
Having aligned the questions with the categoriesof interpretation techniques, we now introduce aselection of appropriate techniques to answer them. “What are the most important predictors in themodel?”For linear models and their generalizations, andsome ML models, measures of variable importancecan be obtained from the fitted model structure. Inthe case of GLMs, one might observe the magnitudesof the estimated coefficients or t -statistics, whereasfor random forests, one might use out-of-bag errors7Breiman 2001). For more complex models, such asthe neural network in our case study, we need todevise another approach.We follow the methodology of permutation featureimportance as described in Fisher, Rudin, and Do-minici (2018), and utilize the notation introducedby Biecek and Burzykowski (2019). The gist ofthe technique is as follows: to see how importanta variable is, we make predictions without it andsee how much worse off we are in terms of accu-racy. One way to achieve this would be to re-fit themodel many times (as many times as the number ofvariables.) However, this may be intractable withlengthy model training times or large numbers ofvariables, so a more popular approach is to insteadkeep the same fitted model but permute the valuesof each predictor.More formally, let y denote the vector of responses, X denote the matrix of predictor variables, b f denotethe fitted model, and L = L ( b f ( X ) , y ), where b f applies to X rowwise, denote the value of the lossfunction, which is mean squared error in the case ofregression. Now, if e X j denotes the predictor matrixwhere the j th variable has been permuted, then wecan compute the loss with the permuted dataset as L − j = L ( b f ( e X j ) , y ). Here, by permuting a variable,we mean that we randomly rearrange the valuesin the column of data associated with the variable.With this, we define the variable importance V I j as L − j − L .In Figure 3, we show a plot of variable impor-tances. In our particular example, we see that the“make” variable contributes most to the accuracy ofthe model with “sex” contributing the least. Thisprovides a way for the audience to quickly glance atthe most relevant variables, and ask further questionas necessary.Note that these measures do not provide infor-mation regarding the directional sensitivity of thepredictors on the response. Also, when there arecorrelated variables, one should be careful aboutinterpretation, as the result may be biased by un-realistic records in the permuted dataset. Anotherramification of a group of correlated variables isthat their inclusion may cause each to appear lessimportant than if only one is included in the model. “How does the predicted loss cost change, onaverage, as we change an input?”For this question, we again consider first how itwould be answered in the GLM setting. When the Figure 3: Permutation feature importances for the neuralnetwork model. input predictor in question is continuous, we cananswer the question by looking at the estimated co-efficient, which provides the change in the responseper unit change in the predictor (on the scale of thelinear predictor). For non-parametric models andneural networks, where no coefficients are available,we can appeal to partial dependence plots (PDP),first proposed by Friedman (2001) for gradient boost-ing machines (GBM).To describe PDP, we need to introduce some ad-ditional notation. Let x j denote the input variableof interest. Then we define the partial dependencefunction as h ( z ) = E X − j [ b f ( x | x j = z )] , (3)where the expectation is taken over the distribu-tion of the other predictor variables. In other words,we marginalize them out so we can focus on therelationship between the predicted response and thevariable of interest. Empirically, we estimate h by b h ( z ) = 1 N N X i =1 b f ( x i | x ji = z ) , (4)where N is the number of records in the dataset.In Figure 4, we exhibit the PDP for the “vehicleage” variable. We see that the average predictedloss cost decreases with vehicle age until the latter isaround 18. Note that the accompanying histogramshows that the data is quite thin for vehicle agegreater than 18, so the apparent upward trend tothe right is driven by just a few data points.This information allows the modeler and stake-holders to consider whether it is reasonable for the8 igure 4: Partial dependence plot for the neural network model. anticipated loss cost to follow this shape.The question posed here is particularly importantfor regulators, who would like to know whether eachvariable affects the prediction in the direction thatis expected, based on intuition, experience, andexisting models. During the model developmentstage, PDP can also be used as a reasonablenesstest for candidate models by identifying unexpectedrelationships for the analyst to investigate.As with permutation feature importance, oneshould be careful when interpreting PDP when thereare strongly correlated variables. Since we averageover the marginal distribution of the rest of thevariables, we may take into account unrealistic data(e.g. high vehicle age for a model that is brand new).To address this drawback, alternative visualizationtechniques have been proposed, such as accumulatedlocal effect (ALE) plots, which take expectationsover conditional, rather than marginal, distributions(Apley and Zhu 2016). “For a particular policyholder, how does eachcharacteristic contribute to the loss cost prediction?” In the previous two examples, we look at model-level explanations; now we move on to one where weinvestigate one particular prediction instance. Asbefore, we consider how we would approach the ques-tion for linear models. For a GLM with a log linkcommon in ratemaking applications, for example,we start with the base rate, then the exponentiatedcoefficients would have multiplicative effects on thefinal rate. Similar to the previous examples, forML models in general we do not have directly in-terpretable weights. Instead, one way to arrive atvariable contributions is calculating the change inthe expected model prediction for each predictor,conditioned on other predictors.Formally, for a fitted model b f , a given orderingof the variables X , . . . , X p , where p is the numberof predictor variables, and a specific instance x ∗ , wewould like to decompose the model prediction b f ( x ∗ )into b f ( x ∗ ) = v + p X j =1 v ( j, x ∗ ) , (5)where v denotes the average model response, and v ( j, x ∗ ) denotes the contribution of the j th variable9 igure 5: Variable contribution plot for the neural network model. in instance x ∗ , defined as v ( j, x ∗ ) = E X [ b f ( X ) | X = x ∗ , . . . , X j = x j ∗ ] − E X [ b f ( X ) | X = x ∗ , . . . , X j − = x j − ∗ )] . (6)Hence, the contribution of the j th variable to theprediction is the incremental change in the expectedmodel prediction when we set X j = x j ∗ assumingthe other variables take their values in x ∗ . Notehere that this definition implies that the order inwhich we consider the variables affects the results.Empirically, the expectations in (6) are calculatedby sampling the test dataset.In Figure 5, we exhibit a waterfall plot of variablecontributions. The “intercept” value denotes theaverage model prediction and represents the v termin Equation (5). The predicted loss cost for thisparticular policyholder is slightly less than average;the characteristics that makes this policyholder morerisky are the fact that he is a male between theages of 18 and 25; counteracting the risky drivercharacteristics are the vehicle properties: it is a GMvehicle built domestically and is seven years old.Instance-level explanations are useful for inves-tigating specific problematic predictions generated by the model. Regulators and model reviewers maybe interested in variable contributions for the safestand riskiest policyholders to see if they conform tointuition. A policyholder with a particularly highpremium may wish to find out what of their char-acteristics contribute to it, and may follow up witha question about how he can lower it, which wouldrequire another type of explanation.As noted earlier, the ordering of variables hasan impact on the contributions calculated, espe-cially for models that are non-additive, which couldcause inconsistent explanations. There are severalapproaches to ameliorate this phenomenon, includ-ing selecting variables with the largest contributionsfirst, including interactions terms, and averagingover possible orderings. The last of these ideas isimplemented by Lundberg and Lee (2017) usingShapley values from cooperative game theory, and isreferred to as Shapley additive explanations (SHAP).These approaches are discussed further in Biecekand Burzykowski (2019) and its references. Shapleyvalues were also used in Mango (1998), which has ap-peared on the CAS syllabus starting in 2004, in thecontext of determining how to allocate catastropherisk loads between multiple accounts.10 .3. Other Techniques In this paper, we demonstrate just a few model-agnostic ML interpretation techniques. These rep-resent a small subset of existing techniques, eachof which have additional variations. In the remain-der of this section, we point out a few commontechniques not covered in our case study.Individual conditional expectation (ICE) plotsdisaggregate PDPs into their instance-level com-ponents for a more granular view into predictorsensitivities (Goldstein et al. 2015). To accommo-date correlated variables in PDP, accumulated localeffect (ALE) plots computes expected changes inmodel response over the conditional, rather thanmarginal, distribution of the other variables (Apleyand Zhu 2016).Local interpretable model-agnostic explanations(LIME) (Ribeiro, Singh, and Guestrin 2016) buildssimple surrogate models using model predictions,with higher training weights given to the point ofinterest, in effect replacing the complex ML modelwith an easily interpretable linear regressions ordecision trees in neighborhoods of specific pointsfor the purpose of explanation. Taking the conceptfurther, one can also train a global surrogate modelacross the entire domain of interest.
6. Conclusion
Actuarial standards of practice, most notablyASOP 41, places responsibility on the actuary toclearly communicate actuarial work products, includ-ing insurance pricing models. These responsibilitiescreate special challenges for communicating machinelearning models, which are often seen “black boxes”due in part to their complexity, nonlinearity, flexibleconstruction, and ad hoc nature.In this paper, we discuss particular questions ofmodel validation that are of key importance in com-municating a model that may present particulardifficulty for machine learning models compared toGLMs or traditional pricing models. Specifically,• How does the model impact individual insur-ance consumers?• How are the predictor variables related to ex-pected losses?We contextualize these questions in terms of dif-ferent frameworks for defining interpretability. Weconceptualize interpretability in terms of the abilityof a model (or modeler) to answer a set of idealized questions that would be refined. We then offer po-tential (families of) model-agnostic techniques forproviding answers to these questions.Much work remains to be done in terms of definingthe role of machine learning algorithms in actuarialpractice. Lack of interpretability has been a keybarrier preventing wider adoption and explorationof these techniques. The methods proposed in thispaper could therefore represent important strides inunlocking the potential of machine learning withinthe insurance industry.
Acknowledgments
We thank Navdeep Gill, Daniel Falbel, MorganBugbee, the volunteers of the CAS project oversightgroup, and two anonymous reviewers for helpfuldiscussions and/or feedback. This work is supportedby the Casualty Actuarial Society.
References
ASTIN Bulletin
IJCAI-17 Workshop on Explainable AI(XAI) , 8:1.Breiman, Leo. 2001. “Random Forests.”
MachineLearning arXiv:1702.08608 [Cs, Stat] , February.http://arxiv.org/abs/1702.08608.Dugas, Charles, Y. Bengio, Nicolas Chapados, P.Vincent, G. Denoncourt, and C. Fournier. 2003.“Statistical Learning Algorithms Applied to Auto-mobile Insurance Ratemaking,” December. https://doi.org/10.1142/9789812794246_0004.Fisher, Aaron, Cynthia Rudin, and Francesca Do-minici. 2018. “All Models Are Wrong, but ManyAre Useful: Learning a Variable’s Importance byStudying an Entire Class of Prediction Models Si-multaneously.” arXiv:1801.01489 [Stat] , January.http://arxiv.org/abs/1801.01489.Friedman, Jerome H. 2001. “Greedy Function Ap-proximation: A Gradient Boosting Machine.”
An-nals of Statistics , 1189–1232.Gabrielli, Andrea. 2019. “A Neural NetworkBoosted Double over-Dispersed Poisson Claims Re-serving Model.” SSRN Scholarly Paper ID 3365517.Rochester, NY: Social Science Research Network.Gabrielli, Andrea, Ronald Richman, and MarioV. Wüthrich. 2019. “Neural Network Embedding ofthe over-Dispersed Poisson Reserving Model.”
Scan-dinavian Actuarial Journal , 1–29.Gao, Guangyuan, Shengwang Meng, and MarioV. Wuthrich. 2018. “Claims Frequency ModelingUsing Telematics Car Driving Data.”
ScandinavianActuarial Journal .Gao, Guangyuan, and Mario Wuthrich. 2019.“Convolutional Neural Network Classification ofTelematics Car Driving Data.”
Risks
European Actuarial Journal
Casualty Actuarial Society, CAS MonographsSeries , no. 5. Goldstein, Alex, Adam Kapelner, Justin Bleich,and Emil Pitkin. 2015. “Peeking Inside the BlackBox: Visualizing Statistical Learning with Plotsof Individual Conditional Expectation.”
Journalof Computational and Graphical Statistics
24 (1):44–65.Henckaerts, Roel, Katrien Antonio, Maxime Cli-jsters, and Roel Verbelen. 2018. “A Data DrivenBinning Strategy for the Construction of InsuranceTariff Classes.”
Scandinavian Actuarial Journal
Psychological Bulletin
107 (1): 65.Kim, Been, Rajiv Khanna, and Oluwasanmi O.Koyejo. 2016. “Examples Are Not Enough, Learnto Criticize! Criticism for Interpretability.” In
Ad-vances in Neural Information Processing Systems ,2280–8.Kuo, Kevin. 2019. “DeepTriangle: A Deep Learn-ing Approach to Loss Reserving.”
Risks
Proceedings of the CasualtyActuarial Society
LIII: 26–53.LeCun, Yann, Yoshua Bengio, and Geoffrey Hin-ton. 2015. “Deep Learning.”
Nature
521 (7553):436.Lundberg, Scott M, and Su-In Lee. 2017. “A Uni-fied Approach to Interpreting Model Predictions.”In
Advances in Neural Information Processing Sys-tems 30 , edited by I. Guyon, U. V. Luxburg, S.Bengio, H. Wallach, R. Fergus, S. Vishwanathan,and R. Garnett, 4765–74. Curran Associates,Inc. http://papers.nips.cc/paper/7062-a-unified-approach-to-interpreting-model-predictions.pdf.Mango, Donald F. 1998. “An Application ofGame Theory: Property Catastrophe Risk Load.”In
Proceedings of the Casualty Actuarial Society ,LXXXV:157–86.Mildenhall, Stephen J. 1999. “A Systematic Re-lationship Between Minimum Bias and GeneralizedLinear Models.”
Proceedings of the Casualty Actu-arial Society
LXXXVI: 393–487.Miller, Tim. 2017. “Explanation in ArtificialIntelligence: Insights from the Social Sciences.” arXiv:1706.07269 [Cs] , June. http://arxiv.org/abs/1706.07269.Miller, Tim, Piers Howe, and Liz Sonenberg. 2017.“Explainable AI: Beware of Inmates Running theAsylum or: How I Learnt to Stop Worrying andLove the Social and Behavioural Sciences.” arXivPreprint arXiv:1712.00547 .12olnar, Christoph, and others. 2018. “Inter-pretable Machine Learning: A Guide for Mak-ing Black Box Models Explainable.”
E-Book At
Journal of the RoyalStatistical Society , Series a (general), 135 (3): 370–84.Nilsson, Nils J. 2009.
The Quest for ArtificialIntelligence . Cambridge University Press.Noll, Alexander, Robert Salzmann, and MarioWuthrich. 2018. “Case Study: French Motor Third-Party Liability Claims.”
SSRN Electronic Journal .R Core Team. 2019.
R: A Language and En-vironment for Statistical Computing
Proceedings of the 22nd Acm Sigkdd InternationalConference on Knowledge Discovery and Data Min-ing , 1135–44. ACM.Robertson, J. P. 2009. “NCCI’s 2007 HazardGroup Mapping.”
Variance
RoyalStatistical Society .Spedicato, Giorgio, Christophe Dutang, andLeonardo Petrini. 2018. “Machine Learning Meth-ods to Perform Pricing Optimization: A Comparisonwith Standard Generalized Linear Models.”
Vari-ance
12 (1).Wang, Haohan, and Bhiksha Raj. 2017. “Onthe Origin of Deep Learning.” arXiv:1702.07800v4[cs.LG] .Wuthrich, Mario V. 2017. “Covariate Selectionfrom Telematics Car Driving Data.”
European Ac-tuarial Journal
Scandinavian Actu-arial Journal
Journal ofBusiness and Economic Statistics
36 (3): 456–70.
Appendix
Table .1: Input variables and their transformations.
Variable Type TransformationAge range Categorical One-hot encodeSex Categorical One-hot encodeVehicle category Categorical One-hot encodeMake Categorical Embed in R Vehicle age Numeric Center and scaleRegion Categorical Embed in R Appendix A. Model Development
In this appendix, we describe the ML model andthe data used to train it. Note that, for our paper,the ultimate goal of the modeling procedure is todevelop something that can produce predictions.As a result, we do not follow standard practicesfor tuning and validation. However, for the sakeof completeness and reproducibility, we include anoverview of the process here. Implementation isdone using the R (R Core Team 2019) interfaceto TensorFlow (Abadi et al. 2015). The modelexplanation visualizations utilize the implementionby Biecek and Burzykowski (2019), and the code toreproduce them are available on GitHub . Appendix A.1. Data
We use data from the AUTOSEG (“Automo-bile Statistics System”) of Brazil’s Superintendenceof Private Insurance (SUSEP). The organizationmaintains policy-characteristics-level data, includ-ing claim counts and amounts, for all insured auto-mobiles in Brazil. The data contains variables frompolicyholder characteristics to losses by peril. Weuse the records from the first half of 2012, whichcontains 1,707,651 records. One-fifth of the data isreserved for testing; the remainder is further splitinto 3/4 of analysis and 1/4 into assessment fordetermining early stopping. https://github.com/kasaai/explain-ml-pricing ppendix A.2. Model Table .1 shows the input variables to our modeland their associated transformations. For “make”and “region”, we map each level to a point in R2