Rethinking Representations in P&C Actuarial Science with Deep Neural Networks
Christopher Blier-Wong, Jean-Thomas Baillargeon, Hélène Cossette, Luc Lamontagne, Etienne Marceau
RRethinking Representations in P&C Actuarial Science with DeepNeural Networks
Christopher Blier-Wong , Jean-Thomas Baillargeon , Hélène Cossette , LucLamontagne , and Etienne Marceau ∗1,3,4 École d’actuariat, Université Laval, Québec, Canada Département d’informatique et de génie logiciel, Université Laval, Québec, Canada Centre de recherche en données massives, Université Laval, Québec, Canada Centre interdisciplinaire en modelisation mathématique, Université Laval, Québec, Canada
February 12, 2021
Abstract
Insurance companies gather a growing variety of data for use in the insurance process, butmost traditional ratemaking models are not designed to support them. In particular, manyemerging data sources (text, images, sensors) may complement traditional data to provide betterinsights to predict the future losses in an insurance contract. This paper presents some of theseemerging data sources and presents a unified framework for actuaries to incorporate these inexisting ratemaking models. Our approach stems from representation learning, whose goal is tocreate representations of raw data. A useful representation will transform the original data intoa dense vector space where the ultimate predictive task is simpler to model. Our paper presentsmethods to transform non-vectorial data into vectorial representations and provides examplesfor actuarial science.
Keywords:
Feature learning, emerging data, unstructured data, embeddings, neural networks,machine learning, property and casualty insurance, pricing
Actuaries play essential roles in Property and Casualty (P&C) insurance companies. From predict-ing future claims to managing enterprise risk, they combine their domain expertise with statisticalmodels to achieve the company’s objectives. At the center of these models are data: constructingpredictive models requires historical experience, see Figure 8.1 in [43].This paper is about data. This paper is not about modeling, but rather about the crucial matière première that is data. Insurance companies acknowledge that they sit on a precious naturalresource : data. Data is everywhere and has a significant impact on financial institutions. Abundantcollections of data may improve the understanding of the composition of risks in their portfolio andprovide more accurate predictions of losses (pricing), claim development (reserving) and depen-dence (risk management). Specifically, emerging data sources will reduce model heterogeneity bysegmenting customers into more homogeneous risk classes. New external data sources also provide ∗ Corresponding author: Etienne Marceau, [email protected] a r X i v : . [ s t a t . A P ] F e b ore accurate insurance premiums for new contracts in business segments with lower exposures.This emerging data source is sometimes called Big data and consists of structured, unstructuredand semi-structured data. Most existing ratemaking models are designed to use vectorial variables,a type of structured data that spreadsheets can neatly store. Actuaries expect underwriters to askquestions upon quoting the contract, and each answer has a dedicated column in the database,making statistical modeling with generalized linear models (GLMs) straightforward. Non-vectorialdata, such as text, voice, images, sensors and other big data, require more processing before beingused in such a statistical model. A recent series on predictive models in actuarial science [20, 21]presents one application of unstructured data for pricing (a model for pricing telematics driving),indicating a potential lack of a framework for dealing with unstructured data for P&C ratemaking.This paper provides a framework to use non-vectorial data within traditional ratemaking modelsusing representation learning embeddings. The representation of an observation corresponds to theinformation that is used to produce a prediction. In classical ratemaking, the representation of an a priori loss estimate is the input variables provided by the potential customer during the quotingprocess. As companies collect and organize emerging data, these are added to existing represen-tations, making the combined representations non-vectorial and unsuitable for existing ratemakingmodels. Representations operate on a more recent scientific paradigm where quantitative methodsand massive emerging data can produce better predictions with a simpler model. The approach wetake in this paper is to create vectorial representations from emerging data.It is worthwhile to differentiate between variables and features, which have different meaningsin this paper. By variables, we mean the raw information collected from the insured. The amountand type of rating variables depend on the line of business, the insurer history and the actuaryor underwriter’s preference. By features, we mean the information that is used as inputs to theratemaking model. A feature could be a variable itself but could also include modified or transformedversions of existing variables. For instance, the underwriter may ask the insured’s age (or date ofbirth), which would be a variable. A feature typically used in a ratemaking model is the insured’sage, so in this case, the variable is also a feature. An actuary could also consider other features in themodel, including polynomial transformations of the age (like age squared) or the interaction betweenage and gender. Another example includes the marital status of the insured. The variable would bethe qualitative (nominal) variable, while the feature would be a quantitative transformation (likethe one-hot encoding of the class, see subsection 2.1 for details).
The typical approach in ratemaking is to use statistical models with variables as inputs. We call thisapproach variable-based learning since the inputs to the models are the unmodified variables. Con-sider a dataset of n observations each with a vector of variables x i for i = 1 , . . . , n . If the variablesare vectorial, then the length of the vector x i will be the same for all observations i = 1 , . . . , n . Forinstance, if p variables are available for each observation, then x i = (1 , x i , x i , . . . , x ip ) , i = 1 , . . . , n is a ( p + 1) dimensional vector. In this paper, we focus on machine learning algorithms that actas a function on the input data. We seek a function E [ r ( Y i )] = f ( x i ) + ε i , i = 1 , . . . , n, where f : R p → R , Y is some response variable, r is a known function and ε is a random residual withzero mean and finite variance. A popular approach in actuarial science includes generalized linearmodels [41, 42], where g ( E [ Y i ]) = β + p (cid:88) j =1 x ij β j = x i β , i = 1 , . . . , n, (1)2here g is the link function, β = ( β , β , . . . , β p ) (cid:48) , such that f = g − ( x i β ) . This model is linear,ignoring interactions and non-linear transformations between features.Actuaries improve the pricing performance by adding hand-crafted features (through transforma-tions or non-linear interactions). We will note the vector of these transformations with an asterisk, x ∗ . A ratemaking model using the variables and the created features could have the same shape as(1), but the vector of predictors would also include these transformations.The novel approaches for ratemaking with machine learning involves flexible end-to-end modelsthat transform the input data into predictions, see [6] for a review. In our experience of trainingend-to-end models P&C insurance data, the optimal hyperparameters selected were not much moreflexible than GLMs (not much depth in the neural network), meaning there were not enough labeledresponse variables (claim frequency or costs) to increase the model flexibility vastly.In this paper, we detail a framework to automatically construct the feature vector x ∗ to use ina regression model. To accomplish this, we decompose the ratemaking process in two steps:step 1: a representation model to construct the embedding vector x ∗ , andstep 2: a regression model (typically a GLM) to model the claim frequency, claim severity oraggregate costs for an insurance contract, using x ∗ as inputs.Figure 1 outlines the process of the approach. Our focus is solely on the representation model.Throughout the paper, we will provide different methods, from various emerging data, to constructa vectorial representation vector x ∗ that is useful. Traditionnalvariables (TV)Emerging variablesource 1 (EV1) ...
Emerging variablesource m (EVm) Traditionnal features (TF)Emerging featuressource 1 (EF1) ... Emerging featuressource m (EFm) Combined fea-tures x ∗ (CF) Predictive model (PM) PredictionStep 1: representation model Step 2: predictive modelEnd-to-end model Figure 1: Ratemaking process deconstructed. Orange: traditional ratemaking steps. Green: emerg-ing data. Violet: representation steps.A useful representation transforms the input data such that the transformed feature improvesthe performance of a supervised learning model. If we create high-quality representations in step 1,it may be unnecessary to use more complex machine learning models in the regression model [2],only requiring GLMs. For instance, if a generalized additive model learns a polynomial relationshipfor a feature, then a good representation would also capture this polynomial relationship within x ∗ , so a GLM using x ∗ would perform as well as a GAM using x . With the same idea, a goodrepresentation could capture the non-linear transformations and interactions that tree-based modelsand deep neural networks do, all staying within the GLM framework.By using our approach, we can create more flexible models, in some instances, than a fully-supervised approach, by training the representation model using a larger dataset of unlabeled data Although strategic pretraining of neural networks could replicate this increased flexibility for the fully-supervisedapproach.
Our unified approach to ratemaking with multisource data is influenced by [10], which provides asimilar scope as this paper but for natural language processing (NLP). Our work fits in the field ofrepresentation learning, see [2, 27] for an overview on this topic.The representations in this paper use deep neural networks, a tool that many researchers inactuarial science have used in recent years. See [6] for a recent review of machine learning inP&C ratemaking and reserving. In [25], PCA (principal component analysis) and autoencoders areused to extract features from velocity-acceleration heatmaps created from telematics trips. Theserepresentations are then used in a regression model in [24]. Autoencoders and PCA are used ina tutorial to extract representations of categorical features in [46]. Then, [5] compare PCA andautoencoders, then propose a convolutional neural network to extract spatial representations toclassify the severity of car accidents.Our unified approach extracts features to use in a GLM model. The relationship between GLMsand deep neural networks was also recognized by [52]. Based on this relationship, they proposedthe combined actuarial neural network (CANN). In this model, a GLM captures linear effects, anda neural network captures residual effects. We can interpret CANN as boosting the classical GLMmodel but leaving the GLM framework. [29] also use GLMs: the authors extract features of agradient boosting machine using partial dependence effects and use them in a GLM. This approachaims to approximate a complex model with GLMs while retaining feature interpretability.The idea of extracting intermediate representations is used in many fields; we cover a fewbelow. [3], [10] propose models to extract a distributed representation for words, a tool used inmany applications of NLP and explained further in Section 5. Representations of electronic healthrecords are created in [40, 45], generating simplified representations of patients used to predictmedical outcomes. [33] employ autoencoders to create representations used to identify molecularfingerprints with predefined anticancer properties. Even Youtube, a video streaming service, createsuser profile representations with similar techniques to determine which video to recommend, see [12]for details.
The remainder of this paper is structured as follows. Section 2 presents the different formats andtypes of data that we will study in this paper. In Section 3, we structure modeling paradigms alonga scale based on the degree of feature abstractness. We study the representation approach in detailin Section 4 and explain different approaches to extract representations from non-vectorial data.Sections 5 to 7 provide more detail on our approach for emerging non-vectorial data in actuarialscience, including text, images and spatial data respectively. Section 8 concludes the paper.
In this section, we present the emerging sources of data that have imminent potential in P&Cactuarial science, particularly for ratemaking. The data may be quantitative or qualitative. The4ata can also be structured into different formats, including vectorial data and image (matrix) data.Finally, the data can also be indexed to form a sequence. This section describes each and presentshow they are stored for programming purposes.Traditional data for P&C insurance are described extensively in, e.g., [43], [15], [42]. We providean overview of traditional data in Table 1, along with emerging sources.Traditional EmergingAll P&C basic customer information (age, gender) granular customer information, IoTAuto type of car, miles driven telematics, image of car, external data(weather, crime, traffic, vehicle identifica-tion number)Home construction characteristics, fire protec-tion, distance to amenities, location sensors, image of dwelling, external data(weather, crime, census)Liability type of business, experience, credentials business reviews, lawsuit texts, under-writer commentsTable 1: Traditional and emerging data for P&C insurance
Data can be quantitative or qualitative. Qualitative data consists of variables that can be measuredon a discrete or continuous scale. We will often omit ordinal numbers and nominal numbers sincemathematical operations do not make sense with these numbers. Since the data is numerical, itis straightforward to compute other numerical transformations such as summary statistics. Moststatistical models exclusively deal with quantitative data since these numerical transformations arerequired to produce predictions.Many emerging variables are qualitative, also called categorical, which are usually classified intoa fixed number of categories. We will use the term categorical data for the rest of this paper.Examples of categorical variables currently used in actuarial science include the following:• residential territory (zip / postal code, city / county, state / province);• house siding material (wood, vinyl, stone, cement);• car types (sedan, truck, sports).Statistical models are not designed to take categorical data as features. Categorical variablesmust first be transformed into quantitative features. The common method to do this is called one-hot encoding in the machine learning nomenclature and is related to indicator features or dummycoding in statistical nomenclature. The one-hot encoding function transforms a single categoricalvariable into a vector that can be used in a statistical model. Consider a categorical variable thatcan take any category j from 1 to k . The generated one-hot encoding is a vector e j of length k filled with zero, except at the j th position where the value is 1. Consider a variable composed offive categories { Cat 1 , Cat 2 , Cat 3 , Cat 4 , Cat 5 }, i.e. k = 5 . An observation with Cat 2 wouldgenerate a one-hot ending (0 , , , , (cid:48) , also noted e .One-hot encoding is intuitive but has a few limitations when used in statistical models. First,as the number of categories increases, the dimensions of the encoding increases. This amplifies the Nominal numbers are used for identification and act as a one-to-one function between a number and a categoryor object, so typically represent categorical variables. The main difference between one-hot encoding and dummy coding is that one-hot encoding has no base category.Statistical models use dummy coding because using all categories yields non-unique solutions. This is not an issuefor modern machine learning since the solutions are typically non-unique and yield local optima.
In this subsection, we present formats that individual variables can take for a single observation.
The most straightforward data format is vectorial. This is the data type that spreadsheets handleefficiently. With vectorial data, the variables fit into a numeric vector. A consequence of havinga dedicated element for each variable in the vector is that it does not naturally support non-fixedlength variables. Since most emerging data sources have varying sizes (see later in this section forexamples), these cannot be stored in tabular (record-oriented) datasets.Most machine learning models use vector data as inputs, including popular regression modelslike GLM/GAM, tree-based methods (CART, random forest, gradient boosting, see [22], [15]), andfully-connected neural networks. The unified approach for modeling non-fixed size variables in thispaper is to introduce an intermediate step of vectorizing (creating vector features of fixed size)before being used for supervised regression.
This section briefly introduces the image format, although image handling will be revisited inSections 4 and 6. An image is a two-dimensional perception of a scene, usually containing one ormore objects or subjects. In computers, they are projected and stored in a grid of pixels (assumingno compression). Consider an image with N × M pixels, for { N, M } ∈ N . Each pixel will haveassociated values corresponding to an intensity (brightness). Images are stored in a computer as amatrix, where the intensity of each pixel is a integer number from 0 to 255 or floating point numberfrom 0 to 1. A grayscale (black and white) image has one value (channel) associated with eachpixel, with 1 corresponding to white and 0 corresponding to black. Figure 2 presents an exampleof a 28 ×
28 grayscale image of a handwritten ¨ a x symbol along with the matrix of values. A typicalcolor picture has three values (channels) for every pixel, representing intensities for the red, greenand blue (RGB) channels. The corresponding tensor for a color picture will have size M × N × .Figure 2: Left: grayscale image of a handwritten symbol, right: matrix representation of the channel6 .3 Indexing of data Indexing data leads to a sequence of variables, so the order of observations is important. Wewill use the notation { X t , t ∈ T } for realizations of the sequence and | X | for the size of eachobservation. This includes time series, which are indexed at discrete times ( T ⊆ N ) or continuous-time ( T ⊆ R + ). They can also be a sequence of ordered variables not indexed by time, which wewill also note T ⊆ N . Each variable could be of any format (qualitative or categorical) and of anytype. Examples include• individual reserving at the event level or snapshot (discrete-time interval) level;• aggregate reserving (triangular data);• stream data (including telematics), audio and video;• textual data are sequences categorical data (words) from a large vocabulary;• time series tracks the asset prices, and the sequence can be considered discrete or continuous.We can use sequential data as inputs for many predictive tasks. On the one hand, a sequence canpredict a single response variable (for instance, using a sequence of individual reserving paymentsto predict if a claim is open or closed). Models will have the shape E [ r ( Y i | X iT )] = f ( X iT , x i ) + ε i , i = 1 , . . . , n, where X iT represents the sequence in the index set T , f : R m → R is a predictive model, m = | T | · | X | + | x i | , for i = 1 , . . . , n . Models can also predict future observations in a sequence, likepredicting the future payments in an individual reserving model in P&C or generating the end of asentence in NLP. These models use observations in T to predict the unobserved observations in (cid:98) T : E (cid:2) r (cid:0) X i (cid:98) T | X iT (cid:1)(cid:3) = f ( X iT , x i ) + ε i , i = 1 , . . . , n. A numeric sequence of fixed length could be considered as vectorial data, with a column as-sociated with each index in the sequence. Therefore, we can use any algorithm for vectorial datato model sequential data, ignoring the variables’ sequential nature. However, statisticians typicallyuse models that deal explicitly with sequential data, see Section 4.1.2.
In this section, we classify different modeling paradigms based on the degree of feature abstractness,i.e. how abstract the features are, compared to the initial variables. The scale does not determinewhich method performs the best: this paper remains about data. Features with no abstraction usethe variables themselves in an interpretable model, while features with the highest abstraction areimplicit features obtained using black-box machine learning models.
The basic degree on the scale is to use the variables as features themselves. This means thatquantitative variables are used as features directly, and qualitative variables are transformed intoone-hot vectors and used directly. At this degree, the representation of the characteristics of aninsured directly corresponds to its variables. Domain knowledge serves as an inspiration to identifynew variables. From the process of Figure 1, the data (TV, EV1, . . . , EVm) are the input to alinear predictive model (PM). 7 .2 Degree 1: hand-crafted non-linear transformations & interactions
Insurers who want to improve predictive performance without the costly process of collecting newvariables could create new features. At the first degree of abstraction, an insured’s representationis its features and the new ones created. From the process of Figure 1, we use data (TV, EV1, . . . ,EVm) to construct features by hand (TF, EF1, . . . , EFm, CF) and use thse in a linear predictivemodel (PM). Consider again the GLM from (1), where g ( E [ Y i ]) is a linear relationship between inputfeatures. The linear constraint implies that the model does not capture non-linear transformationsof individual features or interactions between features.For the model to consider non-linear transformations or interactions, actuaries can manuallycreate features that capture them. This process is referred to as feature engineering. For instance,to capture the quadratic relationship of the variable age , the feature age must manually be addedto the feature set. For the model to capture the effect of young male drivers, the actuary can createa feature age × e j where j is the category for male.We can also construct manual features from non-vectorial data. We could extract relevantstatistics from raw telematics data, including average speed / acceleration, number of trips / hardbreaks, and time of trips (peak or off-peak traffic). Credit scores are another example of suchfeatures, summarizing financial and credit data as a fiscal responsibility proxy.This technique implies injecting a priori knowledge within the data based on the domain knowl-edge of actuaries or underwriters. Much of actuarial work (and data science in general) is dedicatedto creating these manual features. Although these features add much predictive performance onmodels, their construction are very time consuming. Finally, since there is an infinite candidate ofnon-linear transformation and interactions between features, it is improbable that the statisticianwill identify the best combinations by trial and error.Three main problems arise from using degree 0 or degree 1 methods with non-vectorial data orvectorial data with lots of features (where p is large).• An individual feature may be insignificant when studying univariate statistics, but significantwhen studying multivariate statistics.• In variable-based learning, there is typically a feature selection step. The size p of featuresfrom degree 0 or degree 1 may be very large, making feature selection tedious. While statisticaltechniques could select a subset of the variables to consider (see, e.g. [44] to discard variableuncorrelated with the response variable or LASSO which implicitly selects significant features),variable selection methods have the following drawbacks:1. Variable selection is computationally intensive, and greedy methods such as backward,forward and stepwise selection are non-optimal, especially for large p .2. Optimal variables may change if the modeling task changes, requiring repeated statisticalanalyses and computational variable selection. For example, if there are different modelsfor two lines of business, variable selection must be repeated for each analysis.• The size p of the vector of variables may not be constant for every observation, so (1) doesnot hold. For instance, an observation could have a telematics history, while another does nothave one. Two insured will also drive a different amount, so the length of trips is different. Representation learning is a method to circumvent the issues of degrees 0 and 1. The idea of degree2 representations is to create dense and compact (shorter than the original dimension and non-zero entries) numerical vectors. These are called embeddings, and we construct them to capturestructural and semantic information from variables. This degree of representations targets steps8F, EF1, . . . , EFm and CF of Figure 1, building features automatically instead of manually. Thisdegree is the focus of Section 4, and we provide specific examples in Sections 5 to 7.In modern machine learning (gradient boosting and neural networks), all variables are fed in amodel that learns complex interactions between variables. This solves the first two points above(multivariate effects and variable selection). These models have excellent predictive performance butare often considered black boxes. The models also have a large number of parameters, so diagnostictools ignore model complexity. For this reason, we do not consider them as final pricing methodshere and prefer using GLMs as predictive models. However, we can still use the intuitions behinddeep neural network regressors and GLMs to explain embedding-based learning ideas. Consider adeep neural network for regression (see [6], [14] or [51] for details of deep neural network regression).The example in Figure 3 has three hidden layers, and the last hidden layer contains three neurons.We can interpret the blue arrows and nodes as a GLM since ˆ y = g (cid:16)(cid:80) j =1 w (3) j x ∗ j + b (3) (cid:17) , which hasthe same shape as (1), and the features in the GLM are the hidden units of the final hidden layerof the neural network. This means that we can interpret the beginning of the network as a featuregenerator, applying non-linear transformations and interactions between input features to create afeature vector x ∗ that is ultimately used as inputs to a GLM. ... x x x x p ˆ y Inputlayer FirstHiddenlayer SecondHiddenlayer ThirdHiddenlayer OutputlayerEnd-to-end model (a) A deep neural network for regression ... x x x x p x ∗ x ∗ x ∗ x ∗ x ∗ x ∗ ˆ y Inputlayer Embeddinglayer Embeddinglayer OutputlayerStep 1: representation model Step 2: predictive model (b) Deconstructed in two steps
Figure 3: A deep neural network for regression, interpreted as a feature generator and a GLMUsing this intuition, we can keep the weights from the initial part of the model, creating em-beddings of the input data, and train a new GLM for other tasks (see Figure 3b). We can constructthese two models sequentially. The first step will be creating representations, also called pretrainingembeddings, and the final step will be a GLM on the output of the first step. We explain how toconstruct embeddings in Section 4. Pretrained embeddings, constructed outside of the regressionmodel, have many advantages for actuarial science, including:
Transform non-vectorial data into vectorial data . Much emerging data is not structuredneatly; we cannot easily organize them in a row/column database. They are often unstructuredformats, like text, images, sound, and geolocalised data. Statistical modeling of unstructured datais complex. An advantage of dense embeddings is that we can change the data type and formatduring construction. Many representation learning models are based on neural networks, with ar-chitectures that may handle multidimensional data (convolutional neural networks) and time-seriesdata (recurrent neural networks). Using these models, we may extract structured representationsof unstructured data and use this structured information in a traditional statistical model. Techniques like LIME and SHAP exist to partially interpret the effects of inputs on the prediction. When the response variable is quantitative, we can interpret the neural network as being a regression GLMstacked on a feature map. When the response variable is categorical, we can interpret the model as being a logisticregression stacked on a feature map. Recall that although neural networks have the same shape as GLMs, they donot share the maximum likelihood properties that GLMs have. ransform a categorical variable into a continuous one . Representation models cancompress categorical variables into dense vectors. When a categorical variable has many possibleclasses, the dimension of the one-hot encoding is also large (increasing p ). The actuary chooses theembedding dimension (cid:96) with dense representations, typically selecting a dimension much smallerthan the feature’s cardinality. Fully-connected representations (see Section 4.1.1) are beneficial forcategorical variables, which data scientists typically represent using one-hot encodings. Returningto our example of Section 2.1, consider a variables with k = 5 categories. The resulting one-hotvector has a dimension 5. A fully-connected neural network applied exclusively to this categorywould project this sparse vector into a dense vector with a dimension of our choice. In Figure 4, weselect a projection space in R . A useful encoding will produce vectors of similar categories that liein the same region of the vector space. For instance, {2, 5}, {1, 4} and {3} would each be clusters ofsimilar categories. In addition, representations learned from datasets external to ratemaking couldcreate vectors for new categories in the regression task (see Cat 6 in gray from Figure 4). Withone-hot encodings, the new category would increase the dimension to 6, while embedding-basedlearning remains in 2 dimensions.Category Dim. 1 Dim. 2
Cat 1
Cat 2
Cat 3
Cat 4
Cat 5
Cat 6 (a) Embedding table
Cat 1Cat 2Cat 3 Cat 4Cat 5Cat 6 (b) Embedding plot
Figure 4: Black: embeddings project one-hot encodings of dimension 5 into dense vectors of dimen-sion 2. Gray: when adding a 6 th category, the embedding dimension remains two. Reduce data dimension.
Modern actuarial models use many features, but they may not allbe statistically significant. Statisticians use feature selection to determine which features the modelshould use. Greedy algorithmic selection methods based on heuristics (e.g., backward, forward,stepwise) must train a model with many combinations of features and is not feasible for a largenumber of features. Expert selection is flawed because of expert bias, and features may accidentallybe omitted, putting pressure on domain experts [35]. Representation learning methods allows toselect the final dimension of the embeddings, typically decreasing p . Generates reusable representations.
Representations learned using techniques from Section4 are agnostic to the ultimate regression task. Once we create embeddings, we can use them instatistical models for related domains without modifying the embeddings. The actuary does notneed to know how the embeddings are created.
Learn non-linear transformations and interactions.
When using deep neural networks toconstruct embeddings, non-linear functions are used between each layer. This introduces non-lineartransformations of combinations of input features. The representations are more abstract than theinput features but capture useful transformations, and adding depth creates more representativeembeddings [2]. For models like GLMs (which do not create transformations and interactions),providing transformed features is essential. Furthermore, more flexible models like gradient boostingand neural networks may not be necessary for the regression task: representation learning createsuseful representations of the input data such that simpler models may be used in downstream tasks.10 ecrease regression model complexity.
Related to the previous point, if we beak downthe modeling process in two steps: a representation model followed by a regression model, then thecomplexity of the regression model does not depend on the complexity of the representation model.Consider a representation model to create embeddings x ∗ ∈ R (cid:96) with a high number of parametersbut which does use the response variable Y during training. Then, a GLM regression model with Y as a response variable and using x ∗ as input features will only have (cid:96) + 1 parameters, irrespectiveof the representation model complexity.In our experience of training predictive models, unsupervised embeddings produced statisticallysignificant features even if hand-crafted features using the same underlying variables were alreadyincluded in the model, meaning that unsupervised embeddings captured significant non-linear trans-formations and interactions. Representations add value even with data already used within models. The third degree of abstraction is implicit representations, meaning we never observe nor extractthe representations. End-to-end learning removes the feature construction step, instead building apredictive model that will implicitly learn the appropriate representations. A conceptual differenceis that the model architecture is specifically tailored to a task. It is more the model architecturethat can be considered the representation, not the data itself. Examples of models at this degreeinclude [23], who have specific network components to capture claim counts and claim severities,and [13], who build a model with six neural networks to capture different sources of payments orrecoveries within the claims process.This degree of representation is the most flexible since it exits the linear predictive model frame-work. They can perform well in situations where domain knowledge leads to network architecturedesigns that capture unique characteristics of the predictive tasks (e.g. reserving in P&C insurance)and in situations where a substantial volume of labeled data is available. When labeled data islimited, end-to-end models do not perform well in our experience. By exiting the GLM framework,quality-of fit statistics typically ignores model complexity and constructing confidence intervals thatcaptures parameter uncertainty is tedious.Table 2 presents a summary of properties for each degree. Since degree 2 representations separatethe ratemaking process in two steps, we breakdown the properties for the representation model(construct) and the predictive model (use). Representation learning shares the simplicity of degree0, while keeping (and sometimes improving) the flexibility of degree 3.Properties 0 1 2 (construct) 2 (use) 3Supports emerging data × × (cid:88)
N/A (cid:88)
Supports non-linear transformations × (cid:88) (cid:88) (cid:88) (cid:88)
Requires no domain knowledge (cid:88) × × (cid:88) ×Requires no machine learning knowledge (cid:88) (cid:88) × (cid:88) ×Not time consuming (cid:88) × × (cid:88) ×Stays in the GLM framework (cid:88) (cid:88) N/A (cid:88) ×Table 2: Summary of certain properties of models at different levels of abstraction11
Learning representations
An embedding is a dense representation of input features to perform a predictive task . This sectiondetails the four steps to construct a representation model to generate embeddings (colored violet).There are two components in a representation model: the encoder (colored red) and the decoderor predictor (colored blue). The first step is to construct the encoder, an architecture that projectsraw variables into embeddings. The second step is to construct the decoder, which will determinewhich domain knowledge the embeddings capture. The encoder and decoder will perform theprocesses (TF, EF1, . . . , EFm) in the process from Figure 1. The third step combines representationsfrom multiple data sources (step CF of Figure 1), while the fourth step will be to evaluate therepresentations.The training procedures for neural networks (estimating the model’s parameters) are out of thispaper’s scope since any backpropagation algorithm can perform this task. The unfamiliar readercan refer to [27] for details on training neural networks.
The first step in constructing representations is to set up the compression architecture by answering what goes in the model and what are the hidden steps in the encoder . For this step, neural networksare a popular choice since they offer a lot of architectural flexibility and predictive performance.We explain three methods to extract vectorial representations for emerging data. For each method,we present the particular operations that capture the unique data characteristics, a topologicalframework, and explain how the variables become embeddings.
The basic deep neural networks are fully-connected. This is the model presented in Figure 3b, andsupports vectorial data as input. Following [6], let h ( L − j , j = 1 , . . . , J ( L − be the last hidden layer,with L being the number of layers and J ( L − is the size of the last hidden layer. Then, we canselect x ∗ = h ( L − as features generated by the function f : R p → R J ( L − which transforms theinput features x into the embedding vector x ∗ . Recurrent neural networks naturally capture the sequential nature of data. The network iterativelyupdates its state with each element of a sequence, see Figure 5 for a simple example. Let thesuperscript (cid:104) t (cid:105) denotes state at time t, t = 1 , . . . , T . Let x (cid:104) t (cid:105) ∈ R p be the vector of input features attime t . Let h (cid:104) t (cid:105) ∈ R (cid:96) be the vector of the (hidden) cell state at time t and o (cid:104) t (cid:105) ∈ R J ( o ) the outputfeatures at time t , for t = 1 , . . . , T .The basic idea behind RNNs is that the hidden state h (cid:104) t (cid:105) depends on the previous hidden state h (cid:104) t − (cid:105) and the current input vector x (cid:104) t (cid:105) . This enables some sort of memory, so the hidden states h (cid:104) t (cid:105) depend on all previous observations in the sequence x (cid:104) u (cid:105) , u = 1 , . . . , t . The relationship betweenthe previous hidden state and the new input vector is h (cid:104) t (cid:105) = g h (cid:0) z (cid:104) t (cid:105) (cid:1) , where g h is an activationfunction and z (cid:104) t (cid:105) = x (cid:104) t (cid:105) W [ x ] + b [ x ] (cid:124) (cid:123)(cid:122) (cid:125) new information + h (cid:104) t − (cid:105) W [ h ] + b [ h ] (cid:124) (cid:123)(cid:122) (cid:125) memory , where the first term depends on x (cid:104) t (cid:105) and represents new information, while the second term de-pends on h (cid:104) t − (cid:105) and represents the memory from past states, and W [ x ] ∈ R p × (cid:96) , b [ x ] ∈ R (cid:96) , W [ h ] ∈ (cid:96) × (cid:96) , b [ h ] ∈ R (cid:96) are trainable weight parameters. Finally, one or more fully connected layers are usu-ally used to produce an output prediction. For one layer, the output score is z (cid:104) t,o (cid:105) = h (cid:104) t (cid:105) W [ o ] + b [ o ] ,where W [ o ] ∈ R (cid:96) × J ( o ) , b [ o ] ∈ R J (0) , and the output value is o (cid:104) t (cid:105) = g o (cid:0) z (cid:104) t,o (cid:105) (cid:1) , where g o is the outputactivation function. h (cid:104) (cid:105) h (cid:104) (cid:105) h (cid:104) (cid:105) h (cid:104) T (cid:105) = h (cid:104) t (cid:105) o (cid:104) (cid:105) x (cid:104) (cid:105) o (cid:104) (cid:105) x (cid:104) (cid:105) o (cid:104) (cid:105) x (cid:104) (cid:105) o (cid:104) T (cid:105) x (cid:104) T (cid:105) o (cid:104) t (cid:105) x (cid:104) t (cid:105) . . . Figure 5: A simple RNN architecture. Left: recurrent representation of the architecture. Right:unrolled representation.The hidden state vectors h (cid:104) t (cid:105) , t = 1 , . . . , T can be considered as representations of the sequencebecause a single hidden state represents all input features from prior observations in the sequence.The architecture from Figure 5 has T hidden states, then we can select the embedding representationas the hidden state after the most recent observation in the sequence, i.e. x ∗ = h (cid:104) T (cid:105) . Since thedimension of the hidden layer h (cid:104) t (cid:105) , t = 1 , . . . , T is constant (chosen as a hyperparameter of theneural network), the embedding is always of fixed length. Therefore, for different lengths T , therepresentation can be considered as vectorial data, i.e. the representation model is f : R T × p → R (cid:96) . Our simple recurrent neural network’s hidden state was the sum of a hidden state and newinformation, but modern RNN models use more complex hidden states. For instance, long shortterm memory (LSTM, [30]) and gated recurrent units (GRU, [9]) include gates to determine to whatextent new information should be included or excluded in future hidden states.
Convolutional neural networks rely on the convolution operation to capture order and distancefrom text, sound and image data. The convolution operation of a filter applied to these datasources creates local features. These features are useful since they capture inherent meaning fromsections of the image or of the sequence. Figure 6 provides an illustration of this operation. Fordetails on the discrete convolution for images, see [27] or [18]. a b c de f g hi j k lm n o p ∗ w xy z = aw + bx + ey + fx bw + cx + fy + gz cw + dx + gy + hzew + fx + iy + jx fw + gx + jy + kz gw + hx + ky + lziw + jx + my + nxjw + kx + ny + oz kw + lx + oy + pz Input image Filter Convolution
Figure 6: Example convolution for image size × and filter size × Consider a color image of dimension N × M . To create vectorial representations of image data13ith CNNs, we first pass the input image through convolution layers. Then, we transform a threedimensional feature map into one dimension by taking the convolutional representation and unrollingthe feature map into a vector. That is, after the convolution operations, the feature space is a squarecuboid with three dimensions ( q, q, c ) , where q, c ∈ N . This feature space is unrolled (from left toright, top to bottom, front to back) into a vector h ∈ R q · q · c . This unrolled vector can be followedby fully-connected layers and the last fully-connected layer before the prediction is considered theembedding vector x ∗ . The purpose of adding a fully-connected layer after the unroll stage is toreduce the spatial autocorrelation that is still present in the last convolution feature map. Figure 7presents an example with a single fully-connected layer between the unrolled representation and theembedding. Then, x ∗ = g ( z ) , where g is the activation function z = hW + b , with W ∈ R ( q · q · c ) × (cid:96) and b ∈ R (cid:96) . U n r o ll E m b e dd i n g O u t pu t Figure 7: Creating an embedding with convolutional neural networksFigure 7 presents a CNN architecture to create vectorial embeddings from images. We canextract a function f : R N × M × → R (cid:96) that maps a color image to an embedding vector x ∗ ofdimension (cid:96) .We note that we can also represent sequences with one-dimensional convolutional neural net-works. These models are used for text modeling and telematics modeling [26] but RNNs naturallycapture sequential properties. The decoding or predictive step is vital since it determines the knowledge the representation modelinduces in the embeddings. This subsection presents a few approaches.
An unsupervised method to construct representations is to build a model to predict the input fea-tures themselves. Using a large quantity of information, we first construct vectors x ∗ projectinginput data in a latent space of smaller dimension. Undercomplete (bottleneck) autoencoders ac-complish dimension reduction by building an encoder and a decoder. The encoder compresses inputfeatures into a smaller dimension than the input dimension, and the decoder attempts to projectthe compressed data back by reconstructing the input features. Let f : R p → R (cid:96) be the encoder and g : R (cid:96) → R p , be the decoder, where p is the input dimension and (cid:96) is the latent space (embedding,or representation) dimension. To perform dimension reduction, we choose l (cid:28) p . To build thismodel, we seek an encoder f and a decoder g such that the reconstruction error is minimized, arg min g,f n (cid:88) i =1 p (cid:88) j =1 ( g ( f ( x i,j )) − x i,j ) . (2)14ultiple classes of functions exist for f and g to perform this optimization.In principle component analysis, x ∗ i = f ( x i ) = W x i , i = 1 , . . . , n , where W ∈ R p × (cid:96) is anorthonormal matrix corresponding to the first eigenvectors of the covariance matrix of x i . Althoughproofs for principal components usually rely on a variance maximization objective, we can alsoshow that the principal components minimize the reconstruction error in (2) subject to W being anorthonormal basis. The embedding layer are the principal components of the data; see Figure 8a. ... ... x x x x x x p ... x ∗ x ∗ (cid:96) (cid:98) x (cid:98) x (cid:98) x (cid:98) x (cid:98) x (cid:98) x p Inputlayer Embeddinglayer Reconstructedlayer (a) Principal component analysis ... ... ... x x x x x x p h e h e h em ... ... x ∗ x ∗ (cid:96) h d h d h dm (cid:98) x (cid:98) x (cid:98) x (cid:98) x (cid:98) x (cid:98) x p Inputlayer HiddenEncoderlayer Embeddinglayer HiddenDecoderlayer Reconstructedlayer (b) Autoencoder
Figure 8: Bottleneck architectures of PCA and autoencodersThe representation ability of PCA is restricted to linear combinations. With deep learning, therepresentations are more flexible. Adding more depth may create more abstract representations–typically more useful in models– but are harder to train [2].Autoencoders optimize (2) using a fully-connected neural network as an encoder and a fully-connected neural network as a decoder. When (cid:96) (cid:28) p , encoders compress input data to a lowerdimension layer, and the decoder attempts to reconstruct (decompress) the input data. If thecompressed information is mostly reconstructed, the compressed features retain the input data’suseful information. Figure 8b contains a graphical representation of an autoencoder with p inputfeatures, one hidden layer of m neurons and an embedding layer of size (cid:96) . The values h ej and h dj , j = 1 , . . . , m are hidden layers of the encoder and of the decoder. Transfer learning creates representations of similar tasks. If the tasks are similar, the represen-tations will be useful in both cases. Consider a car insurer that launches a new business line inmotorcycle insurance. The insurance company cannot create motorcycle drivers’ representationsdue to the lack of data, but they can create car drivers’ representations. Since car insurance claimhistories include domain knowledge on the risk level, the car driver representation will be usefulas a motorcycle driver representation. Transfer learning induces domain information from relatedtasks into a representation [7].Embeddings extracted from transfer learning are typically the last hidden layer before the outputlayer, see Figure 3. This layer is the farthest from the input layer (thus highly non-linear), and closeto the prediction such that the hidden layer contains useful information for prediction.15 .3 Step 3 (optional): combining representations
Now that we have constructed embeddings of many emerging sources of data, we must combinethese embeddings into one feature vector that will be used in a predictive model. To combinerepresentations, we can use techniques from encoder and decoder techniques above.For example, consider a model with a vectorial representation of dimension (cid:96) , a sequentialrepresentation of dimension (cid:96) and an image representation of dimension (cid:96) . Concatenating theserepresentations generates a single vector of dimension (cid:96) + (cid:96) + (cid:96) . We can optionally mix theseembeddings by adding fully-connected layers after the concatenation and train a new combinedembedding with transfer learning (4.2.2) or by constructing an autoencoder with the concatenatedrepresentation. We could also combine the three pre-trained encoder models into a single encoderand fine-tune the weights for a combined decoder. The last optional step is to evaluate the representations. There are two types of methods to performthis task. The first is intrinsic, where we determine if the representations make sense. This methodusually consists of comparing the distance between two representations. A popular distance measureis the cosine distance between vectors,cosine ( u , v ) = u · v || u || || v || , (3)where || u || is the Euclidean norm. We can perform intrinsic evaluations of representations by select-ing a vector, computing the cosine distance with other vectors and studying the observations withthe smallest and largest cosine distances. The representations make sense if small cosine distancesoccur with similar observations, and large cosine distances occur with dissimilar observations. Re-searchers use benchmark datasets to evaluate the quality of word embeddings in fields like NLP,comparing cosine distances with human-assigned similarity scores like wordsim353 [19].The second type of representation evaluation is extrinsic evaluations. In this method, we usethe representations in downstream tasks, i.e. use them as features in a predictive model. Whencomparing two representation models, the one who performs the best on the ratemaking predic-tion is preferred. A data scientist that constructs representations should select downstream tasksbeforehand and select a performance threshold at which the embeddings are acceptable for use. Our first example deals with textual representations. The field of computer science studying textis called natural language processing (NLP). A comprehensive introduction to NLP is given in[32]. Individual words are categorical variables, usually represented in one-hot encodings. Texts(sentences, paragraphs or documents) are a series of words, so they have a categorical format andsequence type. This combination of types and formats make textual documents challenging to storein a typical design matrix. In this section, we present sources of documents for P&C insurance.Then, we expand on the numerical analysis of words, including word embeddings. Then, we explainmethods to extract representations of documents using word embeddings.
A non-exhaustive list for sources of documents in P&C actuarial science includes insurance contracts,insurance endorsements (amendments) modifying basic insurance contracts, claims adjuster notes,16nderwriter comments and email exchanges between the insurance company’s agents with its clients.More textual information could be available to commercial insurance actuaries by scraping websiteinformation and customer reviews. Most textual applications of text in P&C insurance researchfocus on claims adjuster notes, including classification of injury severity and type [49], classificationof peril type from descriptions [36], identifying large losses from descriptions [1] [36], [47], andidentifying fraudulent claims [50].
Following [32], a dataset containing textual documents is called a corpus. A distinct word in acorpus is called a word type. The set of word types is called the vocabulary V , and its size is notedby | V | . A word in a corpus is called a token, and the number of tokens in a corpus is noted N .For example, consider a text composed of the word types { fire , home , cancer , flood , car }, so | V | = 5 . We may use these features in machine learning models through one-hot encoding.As the corpus size increases, its size | V | will also increase (the size is typically over 10 000),making these vectors very sparse. Additionally, we cannot perform mathematical operations on thevectors to determine correlations or similarities between words. The distributional hypothesis [28]established the link between a word’s context and its meaning. A better representation of a wordwould enable computational comparisons between similar words. For instance, we expect perils like{ fire , flood } to be more similar to each other than the types of insurance { car , home } or to lifeinsurance preconditions like { cancer }. Word embeddings accommodate the distributional hypothesis by assigning a dense vector to eachword. This vector contains continuous values, and its dimension is much smaller than | V | . Auseful representation of a vocabulary in two dimensions would cluster perils and types of coveragesseparately. Returning to Figure 4, we observe clusters of perils { Cat 1, Cat 4 }, types of insurance{
Cat 2 , Cat 5 } and an unrelated word {
Cat 3 }. Useful word embeddings trained on a large corpuswould also produce a vector for a new coverage liability as a 6 th class close to the peril cluster.Creating these representation vectors by hand (feature engineering) would be a tedious task.Instead, we use representation learning tools to extract useful representations from the data auto-matically. The typical method to model sequential variables is to construct a model using pastdata as input and predict the next observation in the sequence, i.e. a function of the shape x T +1 = f ( x t , t = 1 , . . . , T ) . To construct word embeddings, following [10], we also look at thefuture context to gather meaning, creating a function of the shape x t = f ( x t − , x t − , x t +1 , x t +2 ) .The first widespread word embedding models were word2vec, a pair of models presented in[37, 39] and illustrated in Figure 9. The diagram in Figure 9a is the continuous bag-of-word (CBOW)model, where the context predicts a word. The diagram in Figure 9b presents the skipgram model,where a word predicts the context. The inputs and outputs of the model are one-hot encodingsof individual words. Unlike the representation models presented in Section 3, the embeddings inword2vec are extracted from the parameter weight matrix (represented here by arrows). In theexamples of Figure 9, there are two context words before and two after. The size of the contextwindow will determine the type of word embeddings that are generated. Typical word embeddingshave a dimension (cid:96) between 100 and 300, much shorter than | V | .Word2vec is a classical approach to construct word embeddings, but we limit the scope of thispaper to these models. For further explanations of word embeddings in actuarial science, see [36].17 .. ... w t − w t − w t +1 w t +2 h h (cid:96) (cid:98) w t Inputlayer Hiddenlayer Outputlayer (a) CBOW architecture ... w t ... h h (cid:96) (cid:98) w w − (cid:98) w w − (cid:98) w w +1 (cid:98) w w +2 Inputlayer Hiddenlayer Outputlayer (b) Skipgram architecture
Figure 9: word2vec embedding modelsTraining word embeddings is an active area of NLP research, the interested reader can look at [16]and variants for modern word-embedding models using transformers.
In NLP, word embeddings have great success as they compress knowledge from many NLP tasks indense and compact vectors usable in modeling tasks [11]. Today, most applications of NLP rely onpre-trained embeddings [38]. In this section, we will explain how to use these word embeddings tocreate document representations.Let d be a document, consisting of | d | word tokens. We will use the notation w (cid:104) i (cid:105) to representthe i th word in the document, i = 1 , . . . , | d | . In addition, let w (cid:104) i (cid:105)∗ be the word embedding ofword w (cid:104) i (cid:105) , i = 1 , . . . , | d | . We can summarize a document by calculating the document centroid, theaverage embedding vector in a document, given by d ∗ = 1 | d | | d | (cid:88) i =1 w (cid:104) i (cid:105)∗ . Document centroids are useful to compare how similar two documents are, by using the cosinedistance (3) between two documents, cosine ( d ∗ , d ∗ ) . Applications of document similarity includesinformation retrieval, plagiarism detection and news recommender systems [32].We can also create document representations for specific classification or regression tasks usingthe methods from Section 3. For example, consider an insurance company that writes many uniqueamendments to their primary insurance contracts. Calculating the amendments’ effect on the fre-quency of claims is not as simple as with vectorial data since we cannot group amendments. Inthis case, we could train a model to predict a claim’s occurrence, given the endorsement text. Sincethere is a sequence of word inputs and a single output, we can use a many-to-one recurrent neuralnetwork as presented in Figure 10.To price a future insurance contract, we can use the output o (cid:104) T (cid:105) from Figure 10 as an additiveor multiplicative effect on the claim frequency (end-to-end learning), or extract an embedding fromthe hidden state h (cid:104) T (cid:105) and use it as features in the ratemaking model (representation learning). This section covers visual representations, the embeddings we create from image datasets. Thefield of computer science studying images is called computer vision (CV). The most imminent use18 (cid:104) (cid:105) h (cid:104) (cid:105) h (cid:104) (cid:105) h (cid:104) T (cid:105) w (cid:104) (cid:105)∗ w (cid:104) (cid:105)∗ w (cid:104) (cid:105)∗ w (cid:104) T (cid:105)∗ . . . o (cid:104) T (cid:105) Figure 10: Many-to-one recurrent neural network to predict the occurrence of a claim. Input:pre-trained word embeddings, output: estimated probability.of images in the ratemaking process includes extracting information to simplify the ratemakingprocess. For instance, an image of a house could provide information on roofing type or indicatorvariables like a garage’s presence. While a human can perform this task (manual identification is theapproach in [34]) it can also be automated by training an image classification model. If this modelperforms well, there may be enough information within the image to capture other risk sources. Wecould then consider using the image as inputs to the ratemaking model. This section will providedetails on sources of images and how to create representations with these sources.
A ratemaking model could use any image with a relationship with the insured product or thecustomer. Due to the potential discrimination of using a customer’s image, we will restrict ouranalysis to images of the insured product.For homeowners insurance, using a house’s image as input to the regression model could bebeneficial. There are many methods to obtain the images, including asking the insured to provide ahouse picture. However, this may be cumbersome and discourage the customer from proceeding withthe quoting process, so we seek an automatic method to extract this image from the internet. Thetwo most readily available sources are aerial images (orthorectified images from satellites or planes)available through a provider like Google Satellite or facade images available from Google Streetview.These images are simple to obtain through a web API (Application Programming Interface). Forinstance, to download a Google Streetview image, we call the web API with the following command: https://maps.googleapis.com/maps/api/streetview?size=600x600&location=LOCATION&key=KEY where
LOCATION is an address, and
KEY is an API key provided by Google. Therefore, when apotential customer provides his or her address during the quoting process, the ratemaking modelcan automatically perform a call to this API, get the image and use it to provide the quote.For car insurance, we could consider using the image of a car as input to the model. However,structured vehicle characteristics are more readily available than home information, so images maybe unnecessary.
Two main preprocessing steps to use images in our context includes identifying occlusions (by vehi-cles or trees) and centering the dwelling. To help with this step, we can use semantic segmentationtechniques to identify instances in an image. For instance, applying [54, 55]. Figure 11 presents animage and its semantic instances using
ResNet50dilated + PPM_deepsup .19igure 11: Image and semantic segmentation. Brown: building (59.82%). Dark green: Tree(11.73%). Blue: sky (9.92%). Light green: grass (9.02%). Gray: road (6.34%). Other instances:<5%. Original image: Simon Pierre Barrette,
Marie-Sirois house , 25 May 2019, CC BY-SA 4.0.The predicted instances can either be used to filter images (discard images where house or building instance is below threshold), center an image, or included as a fourth channel in the inputimage and let the representation model determine how to treat different instances. The first method to create representations of images is convolutional autoencoders, as presented inFigure 12. Like the autoencoders that we presented in Section 3, the input and the output of themodel are the images. In this model, the encoder is composed of convolutional operations. Thefinal feature map is unrolled into a vector and followed by fully-connected layers. The model thenrolls back the representation into a grid, followed by deconvolution steps and max unpooling, see[18] for arithmetic details of these operations. I npu t E m b e dd i n g R ec o n s tr u c t e d Figure 12: Convolutional autoencoder to extract image embeddings
Another method to create image representations is transfer learning. Consider a CNN trained toclassify the content of an image on a large dataset like ImageNet. Then, the layer right before theprediction step will correspond to a semantic representation of an image’s content. We can extractan embedding from this layer (i.e. discarding the final layer of the CNN).20
Spatial representations
Geographical information is useful in insurance since it helps contextualize risks. For weather-related risks such as flooding, geographical information is crucial since location is a primary factorin assessing claim frequency. Insurance companies might also want to limit the number of housesit insures on the same street since, in the event a flood occurs, it will need to pay for many claimssimultaneously, putting it in a difficult financial situation. For socio-demographic risks such asdriving, habits depend on where drivers live: they are less likely to have accidents in rural areassince they use rarely frequented roads. When they have accidents, they tend to generate higherclaims since crashes are more severe than in cities.Insurers with a high risk exposure volume in a territory may rely on historical loss data topredict losses, using their experience to smooth previous losses and estimate risk relativities basedon the one-hot representation of base territories. Insurers with low exposures in a territory mustuse external spatial information to predict future losses. This creates two issues:1. much spatial information may be needed to model spatial effects adequately, and2. spatial effect relativities must vary smoothly in space.Spatial representations of risks will address these issues. This section will provide the ideas behindspatial representations, followed by an overview of the convolutional regional autoencoder (CRAE)[5], which designs a representation of geographical information. This model is based on convolutionalautoencoders with an architecture similar to Figure 12. Using spatial representations, we canuse vectorial ratemaking models instead of using spatial models, therefore not requiring historicalexperience in a territory to produce a prediction.
Many sources of geographic information could improve P&C ratemaking. Experts estimate that80% of data is geographic [8]. While we can represent spatial data in many formats (see [4] fordetails in actuarial science), we limit our study to geo-localized features (point pattern). Theseinclude most emerging external data like weather, crime, traffic and census information.
Territories are categorical variables, typically stored as one-hot encodings. For instance, consider aset of five cities {
Matane , Montréal , Kuujjuaq , Sept-Îles , Québec }. An observation from Montréalwould have a vector e . Montréal and Québec are large urban areas, with heterogeneous populationsand abundant services. For this reason, we expect these territories to be similar. Sept-Îles andMatane are smaller remote population centers with homogeneous populations. On the other hand,Kuujjuaq is a northern village composed of different populations and fewer services. The one-hotrepresentation does not capture this similarity. When projecting territories in two dimensions, agood representation would create clusters of similar territories as in Figure 4.Creating these representations by hand would be tedious. Spatial embeddings learn these repre-sentations from data automatically. This section presents a method to construct these embeddingsusing census data, although other approaches are possible, see [48, 53, 17, 31]. An added advantageof using spatial embeddings is that the embedding values vary smoothly, neighboring territories willhave similar embedding values. This is desirable for insurance ratemaking since we expect neighborsto exhibit similar risk levels. Spatial embeddings are particularly interesting for insurers with littleor no data in a territory. This could be the case for regions with low populations, a territory theinsurer wishes to increase its portfolio, or establishing a new line of business in a territory.21onsider an insurer with a portfolio in the province of Québec but who wishes to expand tothe province of Ontario. When using word embeddings, we may predict quantities of interest forregions in Ontario using a model trained in the province of Québec. Toronto (new category 6) isa large urban region with similar characteristics to Montréal (category 2). Thus, we expect thespatial effects that generate spatial risk to be similar in both regions (more similar than Torontoand Kuujjuaq, category 3, for instance). The traditional actuarial technique to model spatial effectswould be to increase the one-hot encoding dimension by one. However, we have no historical lossdata to estimate the regression parameter associated with this new dimension and no way to letthe model determine that this new dimension is similar to Montréal. Dense representations wouldkeep the embedding dimension the same and learn similar vectors for Montréal and Toronto inan unsupervised way by using external data. Since their embedding values are similar, statisticalmodels using dense embeddings as inputs will extrapolate Montréal effects to the new region ofToronto. Returning to Figure 4, Toronto could have an embedding value of (3, 4.5), placing itclosest to Montréal but farthest from Kuujjuaq. Spatial embeddings based on the one-hot encoding of territories cannot predict spatial effects inunobserved territories. Instead, we create dense representations based on external datasets withspatial information available across the entire territory of an insurer’s portfolio. Consider a countryperforming census studies, providing summary statistics for territories across the country. We canuse the information in this census as external data to construct the spatial representations. We canuse the insured’s address information to determine the summary information from citizens in thesame geographic area. Suppose we have p features in the census dataset. The vector of externalgeo-localized information for observation i is noted γ i ∈ R p , consisting of every variable availablein the census dataset. We can concatenate this vector with the original features for observation i within a regression model. However, p is usually large (over 2000 for Canada), so a dimensionreduction procedure is required, either using PCA, autoencoders or transfer learning techniques.Figure 13 presents a map of a representation created using PCA on γ i , i = 1 , . . . , n to create anembedding vector x ∗ i . Each point corresponds to a postal code, and the color of the point is theintensity of one of the principal component values x ∗ i at each location.Figure 13: One dimensional embeddings Figure 14: Two dimensional embeddingsWe notice two problems from the plots of embeddings in Figure 13. The first is that switching22rom one base territory to the next, there is a sudden jump in the embedding value. We expect spatialeffects to vary smoothly in space. By using one-dimensional methods like PCA or autoencoders,neighboring information is not used to create the embeddings. These jumps are undesirable forinsurance pricing since two neighboring customers could pay significantly different premiums if theyeach live on opposing sides of the border between two base territories. Another problem with theplot above is that geocoding is not reliable data. We can notice this problem in Figure 13 byobserving individual points with embedding values different from those around it. A reason forthis is that many datasets converting postal codes or addresses into GPS coordinates are crowd-sourced and not validated by central authorities. Another reason is that postal codes may not stayconstant. Some may be created, merged or moved, all changing the embedding values or geocoding.Two-dimensional representation models based on GPS coordinates will circumvent these issues. One implementation of dense geographic representations is the convolutional regional autoencoder(CRAE), introduced in [5]. The main idea of CRAE is to transform spatial point pattern data intogrid data by spanning a grid around a central location. This process creates an image with manychannels that can be modeled using the same techniques as in Section 6. Suppose again that wehave external data γ i ∈ R p , i = 1 , . . . , n . Our objective remains to create an embedding vector x ∗ i ∈ R (cid:96) for i = 1 , . . . , n . Instead of creating a representation model f : R p → R (cid:96) , we will useexternal data from the surrounding location as well.To create a representation of the spatial information for the location of risk i (noted by a star inFigure 15), we use information from ν i , a set of geographic coordinates for observations around thelocation of risk i . A straightforward method to generate the set ν i is to create a q × q grid aroundthe location of risk i, i = 1 , . . . , n and the size of the set will be | ν i | = q . Since every element ofthe neighbor set belongs to a base territory (represented by polygons in Figure 15), we can extractspatial information γ j , ∀ j ∈ ν i , constructing a geographic data square cuboid, see Figure 16. l l l l l l l ll l l l l l l ll l l l l l l ll l l l l l l ll l l l l l l ll l l l l l l ll l l l l l l ll l l l l l l l lon l a t Figure 15: Grid around point
Feature 1Feature 2Feature 3Feature d Feature d -1 Figure 16: Data square cuboidThe data square cuboid will contain census information (represented by the depth p of the datasquare cuboid) for each neighbor in ν i (represented by the grid). This new data representationcan be interpreted the same way as an image, but the number of channels will be p instead of 1(grayscale) or 3 (color). The model presented in [5] uses a convolutional autoencoder as presentedin Figure 12, and the embeddings are extracted using the encoder f : R q × q × p → (cid:96). CRAE has shown useful as predictors for downstream regression tasks. In Figure 14, we intrinsi-cally evaluate the quality of representations by plotting the values for one dimension of x ∗ generated23y CRAE. The values are much smoother than Figure 13, and there is no spatial discontinuity. Themislocated points are gone since our embedding method uses coordinates and not base territories. In this paper, we presented a unified framework for P&C ratemaking in actuarial science withmultisource data. To accomplish this, we split the P&C ratemaking process into two steps: anencoder to create representations (the so-called representation model) and a regression model toperform actuarial tasks. A good encoder will create useful representations for the insurance process,such that the GLM regression model can be simple. We explained many advantages of this approach,including staying within the GLM framework (and retaining statistical properties of GLMs) butremaining as flexible as modern machine learning models.This framework can accept vectorial, image and sequential variables. We presented special casesof the framework for textual documents, images and geographic data. In forthcoming papers, weapply our representation framework in an actuarial pricing context.While most machine learning applications to ratemaking focused on improving end-to-end pre-dictive models, our paper focuses on data. By creating useful representations of data, we canultimately improve the performance of predictive models and retain the statistical properties ofGLMs. While this approach is not novel in computer science, we believe this framework will helpactuaries leverage better insights from data.
Acknowledgments
The first author gratefully acknowledges support through fellowships from the Natural Sciencesand Engineering Research Council of Canada (NSERC) and the Chaire en actuariat de l’UniversitéLaval. This research was funded by NSERC (Cossette: 04273, Marceau: 05605), and the Chaireen actuariat de l’Université Laval (FO502320). We thank Cyril Blanc for discussions on the use ofimages in actuarial science, especially for the semantic segmentation step. We also thank ProfessorThierry Duchesne for helpful comments.
References [1] J.-T. Baillargeon, L. Lamontagne, and E. Marceau. Mining actuarial risk predictors in accidentdescriptions using recurrent neural networks.
Risks , 9(1):7, 2021.[2] Y. Bengio, A. Courville, and P. Vincent. Representation learning: A review and new per-spectives.
IEEE Transactions on Pattern Analysis and Machine Intelligence , 35(8):1798–1828,2013.[3] Y. Bengio, R. Ducharme, P. Vincent, and C. Jauvin. A neural probabilistic language model.
Journal of Machine Learning Research , 3(Feb):1137–1155, 2003.[4] C. Blier-Wong. Correction of ratemaking errors in the presence of spatial dependence.
Master’sThesis , 2018.[5] C. Blier-Wong, J.-T. Baillargeon, H. Cossette, L. Lamontagne, and E. Marceau. Encodingneighbor information into geographical embeddings using convolutional neural networks. In
The Thirty-Third International Flairs Conference , 2020.246] C. Blier-Wong, H. Cossette, L. Lamontagne, and E. Marceau. Machine learning in P&Cinsurance: A review for pricing and reserving.
Risks , 9(1):26, 2021.[7] R. Caruana. Multitask learning.
Machine Learning , 28(1):41–75, 1997.[8] K.-T. Chang.
Introduction to Geographic Information Systems , volume 4. McGraw-Hill Boston,2008.[9] K. Cho, B. Van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Ben-gio. Learning phrase representations using rnn encoder-decoder for statistical machine trans-lation. arXiv preprint arXiv:1406.1078 , 2014.[10] R. Collobert and J. Weston. A unified architecture for natural language processing: Deepneural networks with multitask learning. In
Proceedings of the 25th International Conferenceon Machine Learning , pages 160–167, 2008.[11] R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, and P. Kuksa. Naturallanguage processing (almost) from scratch.
Journal of Machine Learning Research , 12:2493–2537, 2011.[12] P. Covington, J. Adams, and E. Sargin. Deep neural networks for youtube recommendations.In
Proceedings of the 10th ACM Conference on Recommender Systems , pages 191–198, 2016.[13] L. Delong, M. Lindholm, and M. V. Wüthrich. Collective reserving using individual claimsdata.
Available at SSRN , 2020.[14] M. Denuit, D. Hainaut, and J. Trufin.
Effective Statistical Learning Methods for Actuaries ,volume III. Springer, 2019.[15] M. Denuit, D. Hainaut, and J. Trufin.
Effective Statistical Learning Methods for Actuaries ,volume II. Springer, 2020.[16] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. Bert: Pre-training of deep bidirectionaltransformers for language understanding. arXiv preprint arXiv:1810.04805 , 2018.[17] J. Du, Y. Zhang, P. Wang, J. Leopold, and Y. Fu. Beyond geo-first law: Learning spatial rep-resentations via integrated autocorrelations and complementarity. In , pages 160–169. IEEE, 2019.[18] V. Dumoulin and F. Visin. A guide to convolution arithmetic for deep learning. arXiv preprintarXiv:1603.07285 , 2016.[19] L. Finkelstein, E. Gabrilovich, Y. Matias, E. Rivlin, Z. Solan, G. Wolfman, and E. Ruppin.Placing search in context: The concept revisited.
ACM Transactions on Information Systems ,20(1):116–131, 2002.[20] E. W. Frees, R. A. Derrig, and G. Meyers.
Predictive Modeling Applications in ActuarialScience , volume 1. Cambridge University Press, 2014.[21] E. W. Frees, R. A. Derrig, and G. Meyers.
Predictive Modeling Applications in ActuarialScience , volume 2. Cambridge University Press, 2014.[22] J. Friedman, T. Hastie, and R. Tibshirani.
The Elements of Statistical Learning , volume 1.Springer Series in Statistics, 2001. 2523] A. Gabrielli. A neural network boosted double overdispersed poisson claims reserving model.
ASTIN Bulletin: The Journal of the IAA , 50(1):25–60, 2020.[24] G. Gao, S. Meng, and M. V. Wüthrich. Claims frequency modeling using telematics car drivingdata.
Scandinavian Actuarial Journal , 2019(2):143–162, 2019.[25] G. Gao and M. V. Wüthrich. Feature extraction from telematics car driving heatmaps.
Euro-pean Actuarial Journal , 8(2):383–406, 2018.[26] G. Gao and M. V. Wüthrich. Convolutional neural network classification of telematics cardriving data.
Risks , 7(1):6, 2019.[27] I. Goodfellow, Y. Bengio, and A. Courville.
Deep Learning . MIT press, 2016.[28] Z. S. Harris. Distributional structure.
Word , 10(2-3):146–162, 1954.[29] R. Henckaerts, K. Antonio, and M.-P. Côté. Model-agnostic interpretable and data-drivensurrogates suited for highly regulated industries. arXiv preprint arXiv:2007.06894 , 2020.[30] S. Hochreiter and J. Schmidhuber. Long short-term memory.
Neural Computation , 9(8):1735–1780, 1997.[31] B. Hui, D. Yan, W.-S. Ku, and W. Wang. Predicting economic growth by region embedding:A multigraph convolutional network approach. In
Proceedings of the 29th ACM InternationalConference on Information & Knowledge Management , pages 555–564, 2020.[32] D. Jurafsky.
Speech & Language Processing . Pearson, 2000.[33] A. Kadurin, S. Nikolenko, K. Khrabrov, A. Aliper, and A. Zhavoronkov. drugan: an advancedgenerative adversarial autoencoder model for de novo generation of new molecules with desiredmolecular properties in silico.
Molecular Pharmaceutics , 14(9):3098–3104, 2017.[34] K. Kita and Ł. Kidziński. Google street view image of a house predicts car accident risk of itsresident. arXiv preprint arXiv:1904.05270 , 2019.[35] M. Kuhn and K. Johnson.
Feature Engineering and Selection: A Practical Approach for Pre-dictive Models . CRC Press, 2019.[36] G. Y. Lee, S. Manski, and T. Maiti. Actuarial applications of word embedding models.
ASTINBulletin: The Journal of the IAA , 50(1):1–24, 2020.[37] T. Mikolov, K. Chen, G. Corrado, and J. Dean. Efficient estimation of word representations invector space. arXiv preprint arXiv:1301.3781 , 2013.[38] T. Mikolov, E. Grave, P. Bojanowski, C. Puhrsch, and A. Joulin. Advances in pre-trainingdistributed word representations. arXiv preprint arXiv:1712.09405 , 2017.[39] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representationsof words and phrases and their compositionality. In C. J. C. Burges, L. Bottou, M. Welling,Z. Ghahramani, and K. Q. Weinberger, editors,
Advances in Neural Information ProcessingSystems 26 , pages 3111–3119. Curran Associates, Inc., 2013.[40] R. Miotto, L. Li, B. A. Kidd, and J. T. Dudley. Deep patient: an unsupervised representation topredict the future of patients from the electronic health records.
Scientific Reports , 6(1):1–10,2016. 2641] J. A. Nelder and R. W. Wedderburn. Generalized linear models.
Journal of the Royal StatisticalSociety: Series A (General) , 135(3):370–384, 1972.[42] E. Ohlsson and B. Johansson.
Non-life Insurance Pricing with Generalized Linear Models ,volume 174. Springer, 2010.[43] P. Parodi.
Pricing in General Insurance . CRC Press, 2014.[44] F. Pechon, J. Trufin, and M. Denuit. Preliminary selection of risk factors in P&C ratemaking.
Variance , 2018.[45] A. Rajkomar, E. Oren, K. Chen, A. M. Dai, N. Hajaj, M. Hardt, P. J. Liu, X. Liu, J. Marcus,M. Sun, et al. Scalable and accurate deep learning with electronic health records.
NPJ DigitalMedicine , 1(1):18, 2018.[46] S. Rentzmann and M. V. Wüthrich. Unsupervised learning: What is a sports car?
Availableat SSRN 3439358 , 2019.[47] I. C. Sabban, O. Lopez, and Y. Mercuzot. Automatic analysis of insurance reports throughdeep neural networks to identify severe claims.
Preprint , 2020.[48] M. Saeidi, S. Riedel, and L. Capra. Lower dimensional representations of city neighbourhoods.In
Workshops at the Twenty-Ninth AAAI Conference on Artificial Intelligence , 2015.[49] A. J.-P. Tixier, M. R. Hallowell, B. Rajagopalan, and D. Bowman. Automated content anal-ysis for construction safety: A natural language processing system to extract precursors andoutcomes from unstructured injury reports.
Automation in Construction , 62:45 – 56, 2016.[50] Y. Wang and W. Xu. Leveraging deep learning with LDA-based text analytics to detectautomobile insurance fraud.
Decision Support Systems , 105:87–95, 2018.[51] M. V. Wüthrich and C. Buser. Data analytics for non-life insurance pricing.
Preprint , 2017.[52] M. V. Wüthrich. From generalized linear models to neural networks, and back.
Available atSSRN 3491790 , 2019.[53] Y. Zhang, Y. Fu, P. Wang, X. Li, and Y. Zheng. Unifying inter-region autocorrelation andintra-region structures for spatial embedding via collective adversarial learning. In
Proceedingsof the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining ,pages 1700–1708, 2019.[54] B. Zhou, H. Zhao, X. Puig, S. Fidler, A. Barriuso, and A. Torralba. Scene parsing throughade20k dataset. In
Proceedings of the IEEE Conference on Computer Vision and PatternRecognition , 2017.[55] B. Zhou, H. Zhao, X. Puig, T. Xiao, S. Fidler, A. Barriuso, and A. Torralba. Semantic un-derstanding of scenes through the ade20k dataset.