[PDF] Towards Explainability of Machine Learning Models in Insurance Pricing

Abstract

Machine learning methods have garnered increasing interest among actuaries in recent years. However, their adoption by practitioners has been limited, partly due to the lack of transparency of these methods, as compared to generalized linear models. In this paper, we discuss the need for model interpretability in property & casualty insurance ratemaking, propose a framework for explaining models, and present a case study to illustrate the framework.

Full PDF

TTowards Explainability of Machine Learning Models in Insurance Pricing

Kevin Kuo a, ∗ , Daniel Lupton b a Kasa AI, 3040 78th Ave SE b Taylor & Mulder, 10508 Rivers Bend Lane, Potomac, MD 20854, USA

Abstract

Machine learning methods have garnered increasing interest among actuaries in recent years. However, theiradoption by practitioners has been limited, partly due to the lack of transparency of these methods, ascompared to generalized linear models. In this paper, we discuss the need for model interpretability inproperty & casualty insurance ratemaking, propose a framework for explaining models, and present a casestudy to illustrate the framework.

1. Introduction

Risk classiﬁcation for property & casualty (P&C)insurance rating has traditionally been done withone-way, or univariate, analysis techniques. In re-cent years, many insurers have moved towards usinggeneralized linear models (GLM), a multivariate pre-dictive modeling technique, which addresses manyshortcomings of univariate approaches, and is cur-rently considered the gold standard in insurance riskclassiﬁcation. At the same time, machine learning(ML) techniques such as deep neural networks havegained popularity in many industries due to theirsuperior predictive performance over linear models(LeCun, Bengio, and Hinton 2015). In fact, thereis a fast growing body of literature on applyingML to P&C reserving (Kuo 2019; Wüthrich 2018;Gabrielli, Richman, and Wüthrich 2019; Gabrielli2019). However, these ML techniques, often consid-ered to be completely “black box”, have been lesssuccessful in gaining adoption in pricing, which is aregulated discipline and requires a certain amountof transparency in models.If insurers can gain more insight into how ML mod-els behave in risk classiﬁcation contexts, it wouldincrease their ability to reassure regulators and thepublic that accepted ratemaking principles are met.Being able to charge more accurate premiums would,in turn, make the risk transfer system more eﬃcient ∗ Corresponding Author

Email addresses: [email protected] (Kevin Kuo), [email protected] (Daniel Lupton) and contribute to the betterment of society. In thispaper, we aim to take a step towards liberating ac-tuaries from the conﬁnes of linear models in pricingprojects, by proposing a framework for explainingML models for ratemaking that regulators, prac-titioners, and researchers in actuarial science canbuild upon.The rest of this paper is organized as follows:Section 2 provides an overview of P&C ratemaking,Section 3 discusses the importance of interpretation,Section 4 discusses model interpretability in thecontext of ratemaking and proposes speciﬁc tasksfor model explanation, Section 5 describes currentmodel interpretation techniques and applies themto the tasks deﬁned in the previous section, andSection 6 concludes.

2. Property and Casualty Ratemaking

Early classiﬁcation ratemaking procedures weretypically univariate in nature. For example, Lange(1966) notes that (at that time) most major lines ofinsurance used univariate methods based around thesame principle: distributing an overall indicationto territorial relativities or classiﬁcation relativitiesbased on the extent to which they deviated fromthe average experience.Bailey and Simon (1960) introduced minimumbias methods, which were expanded throughout the60s, 70s, and 80s. As computing power developed,minimum bias began to give away to GLMs, with

Preprint submitted to Variance March 25, 2020 a r X i v : . [ q -f i n . R M ] M a r apers such as Brown (1988) and Mildenhall (1999)bridging the gap between the methods.Arguably, GLMs predate minimum bias proce-dures by a signiﬁcant margin. The term was coinedby Nelder and Wedderburn (1972), but generaliza-tions of least squares linear regression date back atleast to the 1930s. Like minimum bias methods,GLMs did not become mainstream in actuarial sci-ence for some time. For example, the syllabus ofbasic education of the Casualty Actuarial Society(CAS) does not seem to include any mention ofGLMs prior to Brown (1988) in the 1990 syllabus.From there, GLMs seem to have received only pass-ing mention until 2006 with the introduction ofAnderson et al. (2005) to the syllabus. Beginningin 2016, the CAS introduced Goldburd, Khare, andTevet (2016) to the syllabus, which oﬀers a compre-hensive guide to GLMs. Paralelling the development of GLM was the devel-opment of machine learning algorithms throughoutthe middle part of the 20th century. Detailed his-tories of machine learning may be found in sourcessuch as Nilsson (2009) and Wang and Raj (2017).Consistent with GLMs, machine learning was rela-tively unpopular in actuarial science until the lastten years as computing power has become cheaperand more easily available and as machine learningsoftware packages have obviated the need for devel-oping analyses from scratch each time an analysis isperformed. Due to the breadth of machine learningas a ﬁeld, it is diﬃcult to identify the ﬁrst time itentered the CAS syllabus; however, cluster analy-sis (in the form of k-means) seems to have beenﬁrst included in 2011 with Robertson (2009). Morerecently, the CAS MAS-I and MAS-II exams in-troduced in 2018 have included machine learningexplicitly.Within the area of ratemaking, machine learningis still in its infancy. A signiﬁcant portion of ma-chine learning applications to ratemaking has beenin the context of automobile telematics, such as Gao,Meng, and Wuthrich (2018), Gao and Wuthrich(2018), Gao and Wuthrich (2019), Roel, Antonio,and Claeskens (2018), or Wuthrich (2017). Pre-sumably this focus has been a result of the high-dimensionality and complexity of telematics data,making it a ﬁeld in which the unique abilities ofmachine learning techniques give a clear advantageover traditional approaches. Outside of telematics, Yang, Qian, and Zou (2018)use a gradient tree-boosting approach to capturenon-linearities that would be a challenge for GLMs.Henckaerts et al. (2018) make use of generalizedadditive models (GAM) to improve predictions ofGLMs. Many researchers, in an apparent eﬀortto demonstrate the range of possibilities and ad-vantages of machine learning, have approached thetopic by comparing many diﬀerent machine learningalgorithms within a single study, such as in Dugaset al. (2003), Noll, Salzmann, and Wuthrich (2018),Spedicato, Dutang, and Petrini (2018). These stud-ies make use of such varied techniques as regressiontrees, boosting machines, support vector machines,and neural networks.

Regardless of the method employed for determin-ing this risk of various classiﬁcations, the actualprocess of setting rate relativities typically involvessome variation of the following steps:1. Obtain relevant policy-level data2. Prepare data for analysis3. Perform analysis on the data, employing desiredmethod(s) to estimate needed rates4. Select ﬁnal rates based on rate indications5. Present rates to the reglator, including expla-nation of the steps followed to derive the rates6. Answer questions from regulators regarding themethod employedThe focus of this paper is on steps 5 and 6. Inmany states, rate ﬁlings that exceed certain thresh-olds for magnitude of rate changes or ﬁlings thatmake use of new or sophisticated predictive modelsmay be subject to particular regulatory scrutiny. Inthese cases, it is necessary to be able to explainthe results of the modeling process in a way that isunderstandable without sacriﬁcing statistical rigor.It should be noted that communicating results isnot simply a method of passing regulatory muster.Generating interpretable modeling output is an im-portant - even essential - facet of model checking.Actuaries are bound by relevant standards to beable to exercise appropriate judgment in selectingrisk characteristics as part of a risk classiﬁcationsystem per Actuarial Standard of Practice 12 (“RiskClassiﬁcation”). Therefore, the techniques discussedin this paper may be viewed from the lens of provid-ing useful information to regulators, but they shouldalso be considered as part of a thorough vetting ofany rating model.2lthough the focus of this paper is on communi-cation to regulators, it should be said that selectingﬁnal rates based on indications (step 4 in the listabove) may pose a unique challenge for black-boxmodels. This, too, provides strong motivation fortechniques that could add to the modeler’s - or anystakeholder’s - understanding of the model, such asthe relative importance of variables or the shapes ofresponse curves. Such techniques could be usefullyemployed in making decisions about how best toselect rates.Similarly, although the focus of this paper is oncommunication in a pricing context, the techniquesexplored in this paper (and many of the concernsdiscussed) may also be relevant to other contexts,such as claim-level reserving or analytics, or otherapplications of machine learning to the insuranceindustry.

3. The Need to See Inside the Black Box

Within the actuarial profession, Actuarial Stan-dard of Practice 41 (“Actuarial Communications”)notes that “. . . another actuary qualiﬁed in the samepractice area [should be able to] make an objectiveappraisal of the reasonableness of the actuary’s workas presented in the actuarial report.” (“ActuarialStandard of Practice No. 41 - Actuarial Commu-nications” 2010) Underlying this requirement is anassumption that the hypothetical other actuary qual-iﬁed in the same practice area is adequately familiarwith the relevant techniques employed. Althoughthe syllabus of basic education is constantly chang-ing, there has at times been an assumption that alltechniques and assumptions that have ever been apart of the syllabus of basic education needn’t beexplained from ﬁrst principles in general actuarialcommunications, and that an actuary practicingin the same ﬁeld should be able to make an ob-jective appraisal of the results from the methodsfound in the syllabus. This is notable because, be-ginning with the introduction of the CAS MAS-Iand MAS-II examination in July of 2018, severalmachine learning models were formally included inthe syllabus of basic education. These exams covera wide range of topics, such as splines, clusteringalgorithms, decision trees, boosting, and principlecomponents analysis (Casualty Actuarial Society2018).Nevertheless, machine learning poses something ofa special challenge for ASOP 41 for several reasons: 1. Machine learning models can be very ad hoccompared to traditional statistical models.2. Because many machine learning models do notassume an underlying probability distributionor stochastic process, they may not admit ofstandard metrics for model comparison (e.g.,it’s not straightforward to calculate an AICover a neural network).3. Machine learning methods are often combinedinto ensembles that may not be easily separatedand that may, as a collection, cease to resemblea single standard version of a model.4. Machine learning models can be “black boxes”insofar as the ﬁnal form of response curve can-not be easily predicted and may depend heavilyon the available data (which may not, in turn,be available to the reviewer).This last item raises a ﬁnal interesting issue.GLMs and their ilk are often ﬁtted using one of ahandful of standard and well-understood approaches(e.g., maximum likelihood estimation). However,this is not possible in general with machine learn-ing models, as machine learning algorithms oftenuse loss surfaces that are very complex such that itmay not be feasible to calculate the global minimumof the surface. Certainly, closed form representa-tions of the loss surfaces are not generally available.For this reason, the training phase of a machinelearning model is, in many ways, just as importantto one’s understanding as the model form and thedata on which the model is ﬁtted. Because the ﬁnalmodel result is inseparable from these three compo-nents (training method, model form, and data), itis not generally adequate to just know the methodemployed to make an objective appraisal of the rea-sonableness of the result.These issues also pose particular challenges withrespect to other standards. For instance, as dis-cussed previously, ASOP 12 requires actuaries tobe able to exercise appropriate judgment about riskclassiﬁcation systems. The recent ASOP 56 (“Mod-els”) speaks to more general concerns in all practiceareas that might make use of models. ASOP 56requires the actuary to “make reasonable eﬀorts toconﬁrm that the model structure, data, assump-tions, governance and controls, and model testingand output validation are consistent with the in-tended purpose.” (“Actuarial Standard of PracticeNo. 56 - Modeling” 2019) All of these eﬀorts maybe hampered if it is not possible to peer into theblack box of the model.3t should also be noted that these comments onlyapply within the actuarial profession. Outside of theactuarial profession, communication of results maybe more challenging. A 2017 survey conducted bythe Casualty Actuarial and Statistical Task Force ofthe National Association of Insurance Commission-ers (NAIC) found that the plurality of respondingregulators identiﬁed “Filing complexity and/or alack of resources or expertise” as a key challengethat impedes their ability to review GLMs or otherpredictive models (National Association of Insur-ance Commissioners 2017). Given that machinelearning algorithms are generally regarded as morecomplex than GLMs, this implies that the challengeof communicating machine learning model results issigniﬁcant.In response to the same survey, 33 state regula-tors noted that it would be helpful or very helpfulfor the NAIC to develop information and tools toassist in reviewing rate ﬁlings based on GLMs, and34 noted that it would be helpful to develop similaritems to assist in reviewing “Other Advanced Mod-eling Techniques.” One outgrowth of this need wasthe development of a white paper, National Associ-ation of Insurance Commissioners (2019), on bestpractices for regulatory review of predictive models.The white paper focuses on review of GLMs, partic-ularly with respect to private passenger automobileand homeowners’ insurance. Some of the guidanceoﬀered in this regard is therefore not strictly appli-cable to the review of machine learning models. Forexample, as previously noted, p-values are not a con-cept that translates well to deterministic machinelearning algorithms. However, among the guidanceapplicable to machine learning algorithms are thefollowing:• Determine the extent to which the model causespremium disruption for individual policyholdersand how the insurer will explain the disruptionto individual consumers that inquire about it.• Determine that individual input characteristicsto a predictive model are related to the ex-pected loss or expense diﬀerences in risk. Eachinput characteristic should have an intuitive ordemonstrable actual relationship to expectedloss or expense.• Determine that individual outputs from a pre-dictive model and their associated selected rel-ativities are not unfairly discriminatory.The last of these items is an entire topic unto itself.The methods and concepts introduced in this paper are useful for exploring the question of whether ratesare appropriately related to risk of loss as deﬁnedby the variables used in the model, but there aremany other aspects of discrimination-free that areoutside the scope of this paper. The methods in thispaper may help in understanding the model, whichis a necessary precursor to addressing the questionof unfair discrimination.The items in this list are by no means exhaus-tive, but they pertain to the concept of model inter-pretability for ratemaking that we develop next.

4. Interpretability in the Ratemaking Con-text

In this section, we attempt to develop a workingdeﬁnition of interpretability for ratemaking applica-tions. While we will not provide a comprehensivesurvey of the proliﬁc and fast evolving ML inter-pretability literature, we draw from it as appropriatein setting the stage for our discussion. Even amongresearchers in the subject, there is not a consensuson the deﬁnition of interpretability; here are a fewfrom frequently cited papers:• Ability to explain or to present in understand-able terms to a human (Doshi-Velez and Kim2017);• The degree to which an observer can understandthe cause of a decision (Biran and Cotton 2017);and• A method is interpretable if a user can cor-rectly and eﬃciently predict the method’s re-sults (Kim, Khanna, and Koyejo 2016).We motivate our discussion by considering severalaspects of interpretability. As we proceed throughthe points below, we aim to arrive at a more scopedand relevant deﬁnition of what it means for a pricingmodel to be interpretable. In the remainder of thissection, we clarify a couple concepts regarding inter-pretable classes of models and the computationaltransparency of ML models, outline frameworks forunderstanding the communication goals of inter-pretability, then discuss a potential framework forimplementing ML interpretability in practice.

In the actuarial science literature, the GLM isprobably the most oft-cited example of an easilyinterpretable model. Given a set of inputs, we caneasily reason about what the output of the model4 igure 1: A simple decision tree for loss cost prediction. is. As an illustrative example, consider a claimseverity model with driver age, sex, and vehicleage as predictors; assuming a log link function andletting Y denote the response, we havelog( E [ Y ]) = β + β · age+ β · vehicle_age + β · sex male . (1)Here, we can tell, for example, what the modelwould predict for the expected severity if we wereto increase age by a certain amount, all else beingequal , because the relationship between the predictorand the response is simply multiplication by thecoeﬃcient β and applying the inverse link function.Another commonly cited example of an inter-pretable model is a decision tree. An illustrativeexample is shown in Figure 1. Here, the predic-tion is arrived at by following a sequence of if-elsedecisions.Now, it is worth pointing out that, when declaringthat GLMs or decision trees are interpretable models,we are implicitly assuming that we are consideringonly a handful of predictors. In fact, the ease withwhich we can reason about a model declines as thenumber of predictors, transformations of them, andinteractions increase, as in the following (somewhatpathological) example:log( E [ Y ]) = β + β · age + β · vehicle_age+ β · vehicle_age + β · age · vehicle_age+ β · sex male + β · sex male · age . (2) Similarly, one can see that in Figure 2, larger treesare tough to reason about. In other words, evenwhen working within the framework of an “inter-pretable” class of models, we may still end up withsomething that many would consider “black box.” Another occasional misconception is that we haveno visibility into how some ML models compute predictions, which renders them uninterpretable.Outside of proprietary algorithms, all common MLmodels, including neural networks, gradient boostedtrees, and random forests, are well studied and havelarge bodies of literature documenting their innerworkings. As an example, a ﬁtted feedforward neuralnetwork is simply a composition of linear transfor-mations followed by nonlinear activation functions.As in Equation 2, one can write down the mathe-matical equation for calculating the prediction givensome inputs, but it may be diﬃcult for a human toreason about it. We show later that we can still pro-vide explanations of completely “black box” models,but is important to note that ML model predictionsare still governed by mathematical rules, and aredeterministic in most cases.

Hilton (1990) proposed a framework, later inter-preted by Miller (2017) in the context of ML, forunderstanding model explanations as conversations or social interactions. One consequence of this iden-tiﬁcation is that explanations need to be relevant to the audience . This framework is consistent withASOP 41, which formulates a similar requirement interms of an intended user of the actuarial communi-cation. In developing, ﬁling, and operationalizing apricing model, one needs to accommodate a varietyof stakeholders, each of whom has a diﬀerent setof questions, assumptions, and technical capacity.First, there are internal stakeholders at the com-pany, which includes management and underwriters.While some of the individuals in this audience maybe technical, they are likely less familiar with pre-dictive modeling techniques than the actuaries anddata scientists who build the models. Next, we havethe regulators, who may have limited resources toreview the models, and will focus on a speciﬁc listof questions motivated by statute and public policy.Finally, we have potential policyholders, who havean interest (perhaps more so than the other parties)as they are responsible for paying the premiums.5 igure 2: A more complex decision tree. This is still much simpler than typical realistic examples.

It is interesting to note that the modelers, whoare most familiar with the models, tend to be samepeople designing and communicating the explana-tions. This poses a challenge that Miller, Howe,and Sonenberg (2017) call “inmates running the asy-lum”, where the modelers design explanations for themselves rather than the intended audience. Forexample, they may be interested in technical ques-tions, such as extrapolation behavior, and shape theexplanations accordingly, which may be irrelevantto a prospective policyholder.Another point outlined in Miller’s survey (Miller2017) is that explanations are contrastive . In otherwords, people are often interested in not why some-thing happened, but rather why it happened insteadof something else. For example, policyholders mightnot care exactly how their auto premiums are com-puted, but would like to know why they are beingcharged more than their coworkers who drive similarvehicles. As an extension, policyholders may wantto know what they can change in order to obtainlower premiums.

With the above considerations in mind, we pro-pose a potential framework for interpreting ML mod-els for insurance pricing: the actuarial profession, incollaboration with regulators and representatives ofthe public, deﬁne a set of questions to be answeredby explanations accompanying ML models, alongwith acceptance criteria and examples of successfulexplanations. In other words, interpretability forour purposes is deﬁned as the ability of a model’sexplanations to answer the posed questions. It should be noted that no ideal set of questionsexists that would encompass all potential models.Rather, the actuary must consider what aspectsof the model would raise questions from the per-spective of the model’s intended users. We proposethat relevant stakeholders, by providing examplequestions and answers, would inherently provideguidance by which actuaries can reasonably antici-pate the kinds of speciﬁc questions most importantto those stakeholders and address them proactively.These questions should relate to existing guide-lines, such as those described in National Associationof Insurance Commissioners (2019) and outlined inSection 3, standards of practice, and regulation, andin fact should not be speciﬁc only to ML models.By conceptualizing a set of questions, we reducethe burden of both companies and regulators; thisis especially important for the latter, who are al-ready resource constrained facing increasing varietyof models being ﬁled. This format should also befamiliar to actuaries who are accustomed to adher-ing to speciﬁc guidelines in, for example, ASOPs.Like the ASOPs, We envision that these questionsand guidelines will be continually updated to reﬂectfeedback obtained and advances in research.While the realization of a set of such guidelines isan ambitious undertaking beyond the scope of thispaper, we present in the next section a sample set ofquestions and techniques one can leverage to answerthem. The goal of these case studies is twofold: tomore concretely illustrate the proposed framework,and to expose the actuarial audience to modern MLinterpretation techniques.6 . Applying Model Interpretation Tech-niques

Now that we have established a framework formodel interpretation in the form of asking and an-swering relevant questions, we demonstrate exam-ples of such exchanges via an illustrative case study.Analytically, our starting point is a ﬁtted deep neu-ral network model for predicting loss costs. As themodeling details are of secondary importance, theyare available in Appendix Appendix A. The ques-tions that we ask of the model are as follows:1. What are the most important predictors in themodel? Put another way, to what extent do thepredictors improve the accuracy of the model?2. How does the predicted loss cost change, onaverage, as we change an input?3. For a particular policyholder, how does eachcharacteristic contribute to the loss cost predic-tion?The techniques we utilize to answer these ques-tions are permutation variable importance, partialdependence plots, and additive variable attributions,respectively. In our discussion, we adopt the or-ganization of techniques and some notation pre-sented in Molnar and others (2018) and Biecek andBurzykowski (2019), which are comprehensive ref-erences on the most established ML interpretationtechniques.

Before we dive into the answering questions, wepresent a brief taxonomy of ML interpretation tech-niques. Rather than attempting an exhaustive classi-ﬁcation, the goal is to orient ourselves among broadcategories of techniques, so we can map them totasks indicated by the questions being asked. For ourpurposes, model interpretation techniques can becategorized across two dimensions: intrinsic vs. post-hoc and global vs. local.

Intrinsic model interpretation draws conclusionsfrom the structure of the ﬁtted model and are whatwe typically associate with “interpretable” classesof models. This is only viable with models with sim-ple structures, such as the sparse linear model andshallow decision tree we see in Section 4.1, wherewe arrive at explanations by reading oﬀ parameterestimates or a few decision rules. For algorithms that produce models with complex structure that donot lend themselves easily to intrinsic exploration,we can appeal to post-hoc techniques. This class oftechniques interrogate the model by presenting itwith data for scoring and observing the predictionbehavior of the model. These techniques are con-cerned with only the inputs and outputs, and henceare agnostic of the model itself, which means theycan also be applied to simple models. Since mostuseful ML models have a level of complexity beyondthe threshold of intrinsic interpretability, we focuson model-agnostic techniques in our case study. Aswe will see later on, the data that we present to themodels are usually some perturbed variations of testdata.

Along the other dimension, we categorize modelinterpretations as global, or model-level, and local,or instance-level. The former class provides insightswith respect to the model as a whole. Some examplesof these eplanations include variable importancesand sensitivities, on average, of the predicted re-sponse with respect to individual predictors. Notethat these methods may be compared to the meth-ods described in (Goldburd, Khare, and Tevet 2016),Chapter 7, which focus on global interpretation ofGLMs.In our case study, questions 1 and 2 are associatedwith global interpretations. On the other hand, ques-tion 3 pertains to an individual prediction, whichwould fall in the local, or instance-level, category.In addition to individual variable attribution, wecan also inquire about what would happen to thecurrent predicted response if we were to perturbspeciﬁc predictor variables.

Having aligned the questions with the categoriesof interpretation techniques, we now introduce aselection of appropriate techniques to answer them. “What are the most important predictors in themodel?”For linear models and their generalizations, andsome ML models, measures of variable importancecan be obtained from the ﬁtted model structure. Inthe case of GLMs, one might observe the magnitudesof the estimated coeﬃcients or t -statistics, whereasfor random forests, one might use out-of-bag errors7Breiman 2001). For more complex models, such asthe neural network in our case study, we need todevise another approach.We follow the methodology of permutation featureimportance as described in Fisher, Rudin, and Do-minici (2018), and utilize the notation introducedby Biecek and Burzykowski (2019). The gist ofthe technique is as follows: to see how importanta variable is, we make predictions without it andsee how much worse oﬀ we are in terms of accu-racy. One way to achieve this would be to re-ﬁt themodel many times (as many times as the number ofvariables.) However, this may be intractable withlengthy model training times or large numbers ofvariables, so a more popular approach is to insteadkeep the same ﬁtted model but permute the valuesof each predictor.More formally, let y denote the vector of responses, X denote the matrix of predictor variables, b f denotethe ﬁtted model, and L = L ( b f ( X ) , y ), where b f applies to X rowwise, denote the value of the lossfunction, which is mean squared error in the case ofregression. Now, if e X j denotes the predictor matrixwhere the j th variable has been permuted, then wecan compute the loss with the permuted dataset as L − j = L ( b f ( e X j ) , y ). Here, by permuting a variable,we mean that we randomly rearrange the valuesin the column of data associated with the variable.With this, we deﬁne the variable importance V I j as L − j − L .In Figure 3, we show a plot of variable impor-tances. In our particular example, we see that the“make” variable contributes most to the accuracy ofthe model with “sex” contributing the least. Thisprovides a way for the audience to quickly glance atthe most relevant variables, and ask further questionas necessary.Note that these measures do not provide infor-mation regarding the directional sensitivity of thepredictors on the response. Also, when there arecorrelated variables, one should be careful aboutinterpretation, as the result may be biased by un-realistic records in the permuted dataset. Anotherramiﬁcation of a group of correlated variables isthat their inclusion may cause each to appear lessimportant than if only one is included in the model. “How does the predicted loss cost change, onaverage, as we change an input?”For this question, we again consider ﬁrst how itwould be answered in the GLM setting. When the Figure 3: Permutation feature importances for the neuralnetwork model. input predictor in question is continuous, we cananswer the question by looking at the estimated co-eﬃcient, which provides the change in the responseper unit change in the predictor (on the scale of thelinear predictor). For non-parametric models andneural networks, where no coeﬃcients are available,we can appeal to partial dependence plots (PDP),ﬁrst proposed by Friedman (2001) for gradient boost-ing machines (GBM).To describe PDP, we need to introduce some ad-ditional notation. Let x j denote the input variableof interest. Then we deﬁne the partial dependencefunction as h ( z ) = E X − j [ b f ( x | x j = z )] , (3)where the expectation is taken over the distribu-tion of the other predictor variables. In other words,we marginalize them out so we can focus on therelationship between the predicted response and thevariable of interest. Empirically, we estimate h by b h ( z ) = 1 N N X i =1 b f ( x i | x ji = z ) , (4)where N is the number of records in the dataset.In Figure 4, we exhibit the PDP for the “vehicleage” variable. We see that the average predictedloss cost decreases with vehicle age until the latter isaround 18. Note that the accompanying histogramshows that the data is quite thin for vehicle agegreater than 18, so the apparent upward trend tothe right is driven by just a few data points.This information allows the modeler and stake-holders to consider whether it is reasonable for the8 igure 4: Partial dependence plot for the neural network model. anticipated loss cost to follow this shape.The question posed here is particularly importantfor regulators, who would like to know whether eachvariable aﬀects the prediction in the direction thatis expected, based on intuition, experience, andexisting models. During the model developmentstage, PDP can also be used as a reasonablenesstest for candidate models by identifying unexpectedrelationships for the analyst to investigate.As with permutation feature importance, oneshould be careful when interpreting PDP when thereare strongly correlated variables. Since we averageover the marginal distribution of the rest of thevariables, we may take into account unrealistic data(e.g. high vehicle age for a model that is brand new).To address this drawback, alternative visualizationtechniques have been proposed, such as accumulatedlocal eﬀect (ALE) plots, which take expectationsover conditional, rather than marginal, distributions(Apley and Zhu 2016). “For a particular policyholder, how does eachcharacteristic contribute to the loss cost prediction?” In the previous two examples, we look at model-level explanations; now we move on to one where weinvestigate one particular prediction instance. Asbefore, we consider how we would approach the ques-tion for linear models. For a GLM with a log linkcommon in ratemaking applications, for example,we start with the base rate, then the exponentiatedcoeﬃcients would have multiplicative eﬀects on theﬁnal rate. Similar to the previous examples, forML models in general we do not have directly in-terpretable weights. Instead, one way to arrive atvariable contributions is calculating the change inthe expected model prediction for each predictor,conditioned on other predictors.Formally, for a ﬁtted model b f , a given orderingof the variables X , . . . , X p , where p is the numberof predictor variables, and a speciﬁc instance x ∗ , wewould like to decompose the model prediction b f ( x ∗ )into b f ( x ∗ ) = v + p X j =1 v ( j, x ∗ ) , (5)where v denotes the average model response, and v ( j, x ∗ ) denotes the contribution of the j th variable9 igure 5: Variable contribution plot for the neural network model. in instance x ∗ , deﬁned as v ( j, x ∗ ) = E X [ b f ( X ) | X = x ∗ , . . . , X j = x j ∗ ] − E X [ b f ( X ) | X = x ∗ , . . . , X j − = x j − ∗ )] . (6)Hence, the contribution of the j th variable to theprediction is the incremental change in the expectedmodel prediction when we set X j = x j ∗ assumingthe other variables take their values in x ∗ . Notehere that this deﬁnition implies that the order inwhich we consider the variables aﬀects the results.Empirically, the expectations in (6) are calculatedby sampling the test dataset.In Figure 5, we exhibit a waterfall plot of variablecontributions. The “intercept” value denotes theaverage model prediction and represents the v termin Equation (5). The predicted loss cost for thisparticular policyholder is slightly less than average;the characteristics that makes this policyholder morerisky are the fact that he is a male between theages of 18 and 25; counteracting the risky drivercharacteristics are the vehicle properties: it is a GMvehicle built domestically and is seven years old.Instance-level explanations are useful for inves-tigating speciﬁc problematic predictions generated by the model. Regulators and model reviewers maybe interested in variable contributions for the safestand riskiest policyholders to see if they conform tointuition. A policyholder with a particularly highpremium may wish to ﬁnd out what of their char-acteristics contribute to it, and may follow up witha question about how he can lower it, which wouldrequire another type of explanation.As noted earlier, the ordering of variables hasan impact on the contributions calculated, espe-cially for models that are non-additive, which couldcause inconsistent explanations. There are severalapproaches to ameliorate this phenomenon, includ-ing selecting variables with the largest contributionsﬁrst, including interactions terms, and averagingover possible orderings. The last of these ideas isimplemented by Lundberg and Lee (2017) usingShapley values from cooperative game theory, and isreferred to as Shapley additive explanations (SHAP).These approaches are discussed further in Biecekand Burzykowski (2019) and its references. Shapleyvalues were also used in Mango (1998), which has ap-peared on the CAS syllabus starting in 2004, in thecontext of determining how to allocate catastropherisk loads between multiple accounts.10 .3. Other Techniques In this paper, we demonstrate just a few model-agnostic ML interpretation techniques. These rep-resent a small subset of existing techniques, eachof which have additional variations. In the remain-der of this section, we point out a few commontechniques not covered in our case study.Individual conditional expectation (ICE) plotsdisaggregate PDPs into their instance-level com-ponents for a more granular view into predictorsensitivities (Goldstein et al. 2015). To accommo-date correlated variables in PDP, accumulated localeﬀect (ALE) plots computes expected changes inmodel response over the conditional, rather thanmarginal, distribution of the other variables (Apleyand Zhu 2016).Local interpretable model-agnostic explanations(LIME) (Ribeiro, Singh, and Guestrin 2016) buildssimple surrogate models using model predictions,with higher training weights given to the point ofinterest, in eﬀect replacing the complex ML modelwith an easily interpretable linear regressions ordecision trees in neighborhoods of speciﬁc pointsfor the purpose of explanation. Taking the conceptfurther, one can also train a global surrogate modelacross the entire domain of interest.

6. Conclusion

Actuarial standards of practice, most notablyASOP 41, places responsibility on the actuary toclearly communicate actuarial work products, includ-ing insurance pricing models. These responsibilitiescreate special challenges for communicating machinelearning models, which are often seen “black boxes”due in part to their complexity, nonlinearity, ﬂexibleconstruction, and ad hoc nature.In this paper, we discuss particular questions ofmodel validation that are of key importance in com-municating a model that may present particulardiﬃculty for machine learning models compared toGLMs or traditional pricing models. Speciﬁcally,• How does the model impact individual insur-ance consumers?• How are the predictor variables related to ex-pected losses?We contextualize these questions in terms of dif-ferent frameworks for deﬁning interpretability. Weconceptualize interpretability in terms of the abilityof a model (or modeler) to answer a set of idealized questions that would be reﬁned. We then oﬀer po-tential (families of) model-agnostic techniques forproviding answers to these questions.Much work remains to be done in terms of deﬁningthe role of machine learning algorithms in actuarialpractice. Lack of interpretability has been a keybarrier preventing wider adoption and explorationof these techniques. The methods proposed in thispaper could therefore represent important strides inunlocking the potential of machine learning withinthe insurance industry.

Acknowledgments

We thank Navdeep Gill, Daniel Falbel, MorganBugbee, the volunteers of the CAS project oversightgroup, and two anonymous reviewers for helpfuldiscussions and/or feedback. This work is supportedby the Casualty Actuarial Society.

References

ASTIN Bulletin

IJCAI-17 Workshop on Explainable AI(XAI) , 8:1.Breiman, Leo. 2001. “Random Forests.”

MachineLearning arXiv:1702.08608 [Cs, Stat] , February.http://arxiv.org/abs/1702.08608.Dugas, Charles, Y. Bengio, Nicolas Chapados, P.Vincent, G. Denoncourt, and C. Fournier. 2003.“Statistical Learning Algorithms Applied to Auto-mobile Insurance Ratemaking,” December. https://doi.org/10.1142/9789812794246_0004.Fisher, Aaron, Cynthia Rudin, and Francesca Do-minici. 2018. “All Models Are Wrong, but ManyAre Useful: Learning a Variable’s Importance byStudying an Entire Class of Prediction Models Si-multaneously.” arXiv:1801.01489 [Stat] , January.http://arxiv.org/abs/1801.01489.Friedman, Jerome H. 2001. “Greedy Function Ap-proximation: A Gradient Boosting Machine.”

An-nals of Statistics , 1189–1232.Gabrielli, Andrea. 2019. “A Neural NetworkBoosted Double over-Dispersed Poisson Claims Re-serving Model.” SSRN Scholarly Paper ID 3365517.Rochester, NY: Social Science Research Network.Gabrielli, Andrea, Ronald Richman, and MarioV. Wüthrich. 2019. “Neural Network Embedding ofthe over-Dispersed Poisson Reserving Model.”

Scan-dinavian Actuarial Journal , 1–29.Gao, Guangyuan, Shengwang Meng, and MarioV. Wuthrich. 2018. “Claims Frequency ModelingUsing Telematics Car Driving Data.”

ScandinavianActuarial Journal .Gao, Guangyuan, and Mario Wuthrich. 2019.“Convolutional Neural Network Classiﬁcation ofTelematics Car Driving Data.”

Risks

European Actuarial Journal

Casualty Actuarial Society, CAS MonographsSeries , no. 5. Goldstein, Alex, Adam Kapelner, Justin Bleich,and Emil Pitkin. 2015. “Peeking Inside the BlackBox: Visualizing Statistical Learning with Plotsof Individual Conditional Expectation.”

Journalof Computational and Graphical Statistics

24 (1):44–65.Henckaerts, Roel, Katrien Antonio, Maxime Cli-jsters, and Roel Verbelen. 2018. “A Data DrivenBinning Strategy for the Construction of InsuranceTariﬀ Classes.”

Scandinavian Actuarial Journal

Psychological Bulletin

107 (1): 65.Kim, Been, Rajiv Khanna, and Oluwasanmi O.Koyejo. 2016. “Examples Are Not Enough, Learnto Criticize! Criticism for Interpretability.” In

Ad-vances in Neural Information Processing Systems ,2280–8.Kuo, Kevin. 2019. “DeepTriangle: A Deep Learn-ing Approach to Loss Reserving.”

Risks

Proceedings of the CasualtyActuarial Society

LIII: 26–53.LeCun, Yann, Yoshua Bengio, and Geoﬀrey Hin-ton. 2015. “Deep Learning.”

Nature

521 (7553):436.Lundberg, Scott M, and Su-In Lee. 2017. “A Uni-ﬁed Approach to Interpreting Model Predictions.”In

Advances in Neural Information Processing Sys-tems 30 , edited by I. Guyon, U. V. Luxburg, S.Bengio, H. Wallach, R. Fergus, S. Vishwanathan,and R. Garnett, 4765–74. Curran Associates,Inc. http://papers.nips.cc/paper/7062-a-uniﬁed-approach-to-interpreting-model-predictions.pdf.Mango, Donald F. 1998. “An Application ofGame Theory: Property Catastrophe Risk Load.”In

Proceedings of the Casualty Actuarial Society ,LXXXV:157–86.Mildenhall, Stephen J. 1999. “A Systematic Re-lationship Between Minimum Bias and GeneralizedLinear Models.”

Proceedings of the Casualty Actu-arial Society

LXXXVI: 393–487.Miller, Tim. 2017. “Explanation in ArtiﬁcialIntelligence: Insights from the Social Sciences.” arXiv:1706.07269 [Cs] , June. http://arxiv.org/abs/1706.07269.Miller, Tim, Piers Howe, and Liz Sonenberg. 2017.“Explainable AI: Beware of Inmates Running theAsylum or: How I Learnt to Stop Worrying andLove the Social and Behavioural Sciences.” arXivPreprint arXiv:1712.00547 .12olnar, Christoph, and others. 2018. “Inter-pretable Machine Learning: A Guide for Mak-ing Black Box Models Explainable.”

E-Book At, Version Dated

Journal of the RoyalStatistical Society , Series a (general), 135 (3): 370–84.Nilsson, Nils J. 2009.

The Quest for ArtiﬁcialIntelligence . Cambridge University Press.Noll, Alexander, Robert Salzmann, and MarioWuthrich. 2018. “Case Study: French Motor Third-Party Liability Claims.”

SSRN Electronic Journal .R Core Team. 2019.

R: A Language and En-vironment for Statistical Computing

Proceedings of the 22nd Acm Sigkdd InternationalConference on Knowledge Discovery and Data Min-ing , 1135–44. ACM.Robertson, J. P. 2009. “NCCI’s 2007 HazardGroup Mapping.”

Variance

RoyalStatistical Society .Spedicato, Giorgio, Christophe Dutang, andLeonardo Petrini. 2018. “Machine Learning Meth-ods to Perform Pricing Optimization: A Comparisonwith Standard Generalized Linear Models.”

Vari-ance

12 (1).Wang, Haohan, and Bhiksha Raj. 2017. “Onthe Origin of Deep Learning.” arXiv:1702.07800v4[cs.LG] .Wuthrich, Mario V. 2017. “Covariate Selectionfrom Telematics Car Driving Data.”

European Ac-tuarial Journal

Scandinavian Actu-arial Journal

Journal ofBusiness and Economic Statistics

36 (3): 456–70.

Appendix

Table .1: Input variables and their transformations.

Variable Type TransformationAge range Categorical One-hot encodeSex Categorical One-hot encodeVehicle category Categorical One-hot encodeMake Categorical Embed in R Vehicle age Numeric Center and scaleRegion Categorical Embed in R Appendix A. Model Development

In this appendix, we describe the ML model andthe data used to train it. Note that, for our paper,the ultimate goal of the modeling procedure is todevelop something that can produce predictions.As a result, we do not follow standard practicesfor tuning and validation. However, for the sakeof completeness and reproducibility, we include anoverview of the process here. Implementation isdone using the R (R Core Team 2019) interfaceto TensorFlow (Abadi et al. 2015). The modelexplanation visualizations utilize the implementionby Biecek and Burzykowski (2019), and the code toreproduce them are available on GitHub . Appendix A.1. Data

We use data from the AUTOSEG (“Automo-bile Statistics System”) of Brazil’s Superintendenceof Private Insurance (SUSEP). The organizationmaintains policy-characteristics-level data, includ-ing claim counts and amounts, for all insured auto-mobiles in Brazil. The data contains variables frompolicyholder characteristics to losses by peril. Weuse the records from the ﬁrst half of 2012, whichcontains 1,707,651 records. One-ﬁfth of the data isreserved for testing; the remainder is further splitinto 3/4 of analysis and 1/4 into assessment fordetermining early stopping. https://github.com/kasaai/explain-ml-pricing ppendix A.2. Model Table .1 shows the input variables to our modeland their associated transformations. For “make”and “region”, we map each level to a point in R2