Applications of Clustering with Mixed Type Data in Life Insurance
Shuang Yin, Guojun Gan, Emiliano A. Valdez, Jeyaraj Vadiveloo
AApplications of Clustering with Mixed Type Data inLife Insurance
Shuang Yin ∗ Guojun Gan † Emiliano A. Valdez ‡ Jeyaraj Vadiveloo § January 27, 2021
Abstract
Death benefits are generally the largest cash flow item that affects financial state-ments of life insurers where some still do not have a systematic process to track andmonitor death claims experience. In this article, we explore data clustering to examineand understand how actual death claims differ from expected, an early stage of devel-oping a monitoring system crucial for risk management. We extend the k -prototypesclustering algorithm to draw inference from a life insurance dataset using only theinsured’s characteristics and policy information without regard to known mortality.This clustering has the feature to efficiently handle categorical, numerical, and spatialattributes. Using gap statistics, the optimal clusters obtained from the algorithm arethen used to compare actual to expected death claims experience of the life insuranceportfolio. Our empirical data contains observations, during 2014, of approximately1.14 million policies with a total insured amount of over 650 billion dollars. For thisportfolio, the algorithm produced three natural clusters, with each cluster having alower actual to expected death claims but with differing variability. The analyticalresults provide management a process to identify policyholders’ attributes that domi-nate significant mortality deviations, and thereby enhance decision making for takingnecessary actions. Keywords: k -prototypes clustering; geospatial attributes, gap statistics; tracking andmonitoring death claims. ∗ Department of Statistics, University of Connecticut, 215 Glenbrook Road, Storrs, CT, 06269-4120, USA.Email: [email protected] . † Department of Mathematics, University of Connecticut, 341 Mansfield Road, Storrs, CT, 06269-1009,USA. Email: [email protected] . ‡ Department of Mathematics, University of Connecticut, 341 Mansfield Road, Storrs, CT, 06269-1009,USA. Email: [email protected] . § Department of Mathematics, University of Connecticut, 341 Mansfield Road, Storrs, CT, 06269-1009,USA. Email: [email protected] . a r X i v : . [ s t a t . A P ] J a n Introduction and motivation
According to the Insurance Information Institute , the life insurance industry paid a totalof nearly $
76 billion as death benefits in 2019. Life insurance is in the business of providinga benefit in the event of premature death, one that is understandably difficult to predictwith certainty. Claims arising from mortality are not surprisingly the largest cash flow itemthat affects both the income statement and the balance sheet of a life insurer. Life insurancecontracts are generally considered long duration where the promised benefit could be for anextended period of time before being realized. In effect, not only do life insurers pay outdeath claims in aggregate on a periodic basis, they are also obligated to have sufficient assetsset aside as reserves to fulfill this long term obligation. See Dickson et al. (2013).Every life insurer must have in place a systematic process of tracking and monitor-ing its death claims experience. This tracking and monitoring system is an important riskmanagement tool. It should involve not only identifying statistically significant deviations ofactual to expected experience, but also being able to understand and explain the effects ofpatterns. Such deviations might be considered normal patterns of deviation that are anoma-lies for short durations, while of more considerable importance are deviations considered tofollow a trend for longer durations.Prior to sale, insurance companies exercise underwriting to identify the degree of mor-tality risk of applicants. As a consequence, there is a selection effect on the underlyingmortality of life insurance policyholders; normally, the mortality of policyholders are con-sidered better than the general population. However, this mortality selection wears off overtime, and in spite of this selection, it is undeniably important for a life insurance companyto have a monitoring system. Vadiveloo et al. (2014) listed some of these benefits and wereiterate their importance again as follows:1. A tracking and monitoring system is a risk management tool that can assist insurersto take actions necessary to mitigate the economic impact of mortality deviations.2. It is a tool for improved understanding of the emergence of death claims experiencethereby helping an insurer in product design, underwriting, marketing, pricing, reserv-ing, and financial planning.3. It provides a proactive tool for dealing with regulators, credit analysts, investors, andrating agencies who may be interested in reasons for any volatility in earnings as aresult of death claims fluctuations.4. A better understanding of the company’s emergence of death claims experience helpsto improve its claims predictive models.5. The results of a tracking and monitoring system provides the company a benchmarkfor its death claims experience that can be relatively compared with that of othercompanies in the industry. k -meansclustering algorithm (MacQueen (1967)) is perhaps the simplest, most straightforward, andmost popular method that efficiently partitions the data set into k clusters. With k ini-tial centroids arbitrarily set, the k -means algorithm finds the locally optimal solutions bygradually minimizing the clustering error calculated according to numerical attributes. Thetechnique has been applied in several disciplines including life insurance, e.g., Thiprungsriand Vasarhelyi (2011), Devale and Kulkarni (2012), Gan (2013), and Gan and Valdez (2016).Despite its popularity, the algorithm has drawbacks that present challenges to our life insur-ance dataset: (i) it is particularly sensitive to the initial cluster assignment which is randomlypicked, and (ii) it is unable to handle categorical attributes. While the k -prototypes clus-tering is lesser known, it provides the advantage of being able to handle mixed data types,including numerical and categorical attributes. For numerical attributes, the distance mea-sure used may still be based on Euclidean. For categorical attributes, the distance measureused is based on the number of matching categories.This paper extends the use of k -prototypes algorithm proposed by Huang (1997) to pro-vide insights and draw inference from a real-life dataset of death claims experience obtainedfrom a portfolio of contracts of a life insurance company. The k -prototypes algorithm hasbeen applied in marketing for segmenting customers to better understand product demands(Hsu and Chen (2007)) and in medical statistics for understanding hospital care practices(Najjar et al. (2014)). This algorithm integrates the procedures of k -means and k -modes toefficiently cluster datasets that contain, as earlier said, numerical and categorical variables;the nature of our data, however, contains a geospatial variable. The k -means can only handle3umerical attributes while the k -modes can only handle categorical attributes. We thereforeimprove the k -prototypes clustering by adding a distance measurement to the cost functionso that it can also deal with the geodetic distance between latitude-longitude spatial datapoints. The latitude is a numerical measure of the distance of a location from far north orsouth of the equator; longitude is a numerical measure of the distance of a location fromeast-west of the “meridians.” Some work related to geospatial data clustering can be foundin the Density-Based Spatial Clustering of Applications with Noise (DBSCAN) (Ester et al.,1996) and in ontology (Wang et al., 2010).Our empirical data has been drawn from the life insurance portfolio of a major in-surer and contains observations, during the third quarter of 2014, of approximately 1.14million policies with a total insured amount of over 650 billion dollars. Using our empiricaldata, we applied the k -prototypes algorithm that ultimately yields to three optimal clus-ters determined using the concept of gap statistics. Shown to be an effective method fordetermining the optimal number of clusters, the gap statistic is based on evaluating “thechange in within-cluster dispersion with that expected under an appropriate reference nulldistribution” (Tibshirani et al., 2001).To provide further insights to the death claims experience of our life insurance dataset, we compared the aggregated actual to expected deaths for each of the optimal clusters.For a life insurance contract, it is most sensible to measure the magnitude of deaths based onthe face amount, and thus, we computed the ratio of the aggregated actual face amounts ofthose who died to the face amounts of expected deaths for each optimal cluster. Under somemild regularity conditions, necessary to prove normality, we are able to construct statisticalconfidence intervals of the ratio on each of the clusters thereby allowing us to draw inferenceas to the significant statistical deviations of the mortality experience for each of the optimalclusters. We provide details of the proofs for the asymptotic development of these confidenceintervals in the Appendix. Each cluster showed different patterns of mortality deviation andwe can deduce the dominant characteristics of the policies from this cluster-based analysis.The motivation is to assist the life insurance company gain some better understanding ofpotential favorable and unfavorable clusters.For the rest of this paper, it has been organized as follows. In Section 2, we brieflydescribe the real data set from an insurance company including the data elements and thepreprocessing of the data in preparation for cluster analysis. In section 3, we provide detailsof the k -prototypes clustering algorithm and discuss how the balance weight parameter isestimated and how to choose the optimal number of clusters. In section 4, we present theclustering results and discuss their implications and applications to monitoring the company’sdeath claims experience. We conclude in Section 5. We illustrate k -prototypes clustering algorithm based on the data set we obtained froman insurance company. This data set contains 1,137,857 life insurance policies issued in the4able 1: Description of variables in the mortality dataset Categorical Variables Description ProportionsGender
Insured’s sex Female 34.1%Male 65.9%
Smoker Status
Insured’s smoking status Smoker 4.14%Nonsmoker 95.86%
Underwriting Type
Type of underwriting requirement Term conversion 4.52%Underwritten 95.48%
Substandard Indicator
Indicator of substandard policies Yes 7.76%No 92.24%
Plan
Plan type Term 74.28%ULS 14.55%VLS 11.17%
Continuous Variables Minimum Mean MaximumIssue Age
Policyholder’s ageat issue 0 43.62 90
Face Amount
Amount of sum in-sured at issue 215 529,636 100,000,000 third quarter of 2014. Each policy is described by 8 attributes with 5 categorical, 2 numericaldata elements and longitude-latitude coordinates. Table 1 shows the description and basicsummary statistics of each variable.Figure 1 provides a visualization of the distribution of the policies across the states.We only kept the policies issued in the continental United States, and therefore, excludedthe policies issued in Alaska, Hawaii, and Guam. First, the frequency of policies observedfrom these states are not materially plentiful. Second, since these states or territories areoutside the mainland United States, geodetic measurements are distorted and clusteringresults may become less meaningful. The saturated color indicates a high frequency of thepolicy distributed in a particular state. The distribution of the policy count is highly skewed,with New York, New Jersey, California, and Pennsylvania having insureds significantly morethan other states. The spatial attributes are represented by latitude and longitude coordinatepairs.The Insured’s sex indicator
Gender is also a discrete variable with 2 levels, Female andMale, with the number of males almost twice as many as females.
Smoker Status indicatesthe insured’s smoking status with 95 .
86% nonsmokers and the remaining 4 .
14% smokers.The variable
Underwriting Type reflects two types of underwriting: 95 .
48% of the policieswere fully underwritten at issue while the remaining 4 .
52% are term conversions. Term con-5
LAZ ARCA CO CTDEDCFLGAID IL INIAKS KYLA MEMD MAMIMN MSMOMT NENV NHNJNM NYNCND OHOKOR PA RISCSD TNTXUT VTVAWA WVWIWY
Policy Count
Figure 1: U.S Heatmap of Policy Frequencyversions refer to those policies originally with a fixed maturity (or term) but were convertedinto permanent policies at a later date from issue, without any additional underwriting. Thevariable
Substandard Indicator indicates whether policy has been issued as substandard ornot. Substandard policies are issued after an underwriting is performed that have expectedmortality worse than standard policies. Substandard policies come with an extra premium.In our dataset, there are about 7 .
76% policies considered substandard and the remaining92 .
24% are standard. The variable
Plan has three levels: Term Insurance Plan (Term), Uni-versal Life with Secondary Guarantees (ULS) and Variable Life with Secondary Guarantees(VLS).In our dataset, there are two continuous variables. The variable
Issue Age refers to thepolicyholder’s age at the time of issue; the range of issue ages is from as young as a newbornto as old as 90 years, with an average of about 44 years old. The variable
Face Amount refers to the amount of sum insured either fixed at policy issue or accumulated to this levelat the most recent time of valuation. As is common with data clustering, we standardizedthese two continuous variables by rescaling the values in order to be in the range of [0 , x new = x − min( x )max( x ) − min( x ) , where x is the original value and x new is the standardized (or normalized) value. However,for the variable Face Amount, we find few extreme values that may be further distorting the6pread or range of possible values. To fix this additional concern, we take the logarithm ofthe original values before applying the normalization formula: x new = log( x ) − min(log( x ))max(log( x )) − min(log( x )) . Data clustering refers to the process of dividing a set of objects into homogeneous groups orclusters (Gan et al., 2007; Gan, 2011) using some similarity criterion. Objects in the samecluster are more similar to each other than to objects from other clusters. Data clustering isan unsupervised learning process and is often used as a preliminary step for data analytics.In bioinformatics, for example, data clustering is used to identify the patterns hidden in geneexpression data (MacCuish and MacCuish, 2010). In big data analytics, data clustering isused to produce a good quality of clusters or summaries for big data to address the storageand analytical issues (Fahad et al., 2014). In actuarial science, data clustering is also usedto select representative insurance policies from a large pool of policies in order to buildpredictive models (Gan, 2013; Gan and Lin, 2015; Gan and Valdez, 2016).Patternrepresentation Definedissimilaritymeasure clustering Dataabstraction AssessoutputFigure 2: A typical data clustering process.Figure 2 shows a typical clustering process described in Jain et al. (1999). The clus-tering process consists of five major steps: pattern representation, dissimilarity measuredefinition, clustering, data abstraction, and output assessment. In the pattern representa-tion step, the task is to determine the number and type of the attributes of the objects to beclustered. In this step, we may extract, select, and transform features to identify the mosteffective subset of the original attributes to use in clustering. In the dissimilarity measuredefinition step, we select a distance measure that is appropriate to the data domain. In theclustering step, we apply a clustering algorithm to divide the data into a number of mean-ingful clusters. In the data abstraction step, we extract one or more prototypes from eachcluster to help comprehend the clustering results. In the final step, we use some criteria toassess the clustering results.Clustering algorithms can be divided into two categories: partitional and hierarchicalclustering algorithms. A partitional clustering algorithm divides a dataset into a singlepartition; while a hierarchical clustering algorithm divides a dataset into a sequence of nestedpartitions. In general, partitional algorithms are more efficient than hierarchical algorithmsbecause the latter usually require calculating the pairwise distances between all the datapoints. 7 .1 The k -prototypes algorithm The k -prototypes algorithm (Huang, 1998) is an extension of the well-known k -means algo-rithm for clustering mixed type data. In the k -prototypes algorithm, the prototype is thecenter of a cluster, just as the mean is the center of a cluster in the k -means algorithm.To describe the k -prototypes algorithm, let { X ij } , i = 1 , , . . . , n, j = 1 , , . . . , d denotea dataset containing n observations. Each observation is described by d variables, including d numerical variables, d − d categorical variables, and d − d = 2 spatial variables. Withoutloss of generality, we assume that the first d variables are numerical, the remaining d − d variables are categorical, and the last two variables are spatial. Then the dissimilaritymeasure between two points x and y used by the k -prototypes algorithm is defined asfollows: D ( x , y ) = d (cid:88) j =1 ( x j − y j ) + λ d (cid:88) j = d +1 δ ( x j , y j ) + λ δ ( x ∗ , y ∗ ) , (1)where λ and λ are balancing weights with respect to numerical attributes that is used toavoid favoring types of variables other than numerical, δ ( · , · ) is the simple-matching distancedefined as δ ( x j , y j ) = (cid:26) , if x j (cid:54) = y j ,0 , if x j = y j .and δ ( · , · ) returns the spatial distance between two points with latitude-longitude coordi-nates using Great Circle distance (WGS84 ellipsoid) methods. We have x ∗ = ( x d +1 , x d +2 ), y ∗ = ( y d +1 , y d +2 ) and radius of the Earth r = 6378137m from WGS84 axis (Carter, 2002),∆ a = x d +2 − x d +1 ∆ b = y d +2 − y d +1 A = cos( x d +2 ) sin(∆ b ) B = sin(∆ a ) + cos( x d +2 ) sin( x d +1 )[1 − cos(∆ b )]Φ , = tan − ( A/B )Θ , = tan − (cid:20) B cos(Φ , ) + A sin(Φ , )cos(∆ A ) − cos( x d +1 ) cos( x d +2 )[1 − cos(∆ a )] (cid:21) δ ( x ∗ , y ∗ ) = r (1 − f ) × Θ , where f is the flattening of the Earth (use 1 / . k -prototypes algorithm aims to minimize the following objective (cost) function: P ( U, Z ) = n (cid:88) i =1 k (cid:88) l =1 u il D ( x i , z l ) , (2)8here U = ( u il ) i =1: n,l =1: k is an n × k partition matrix, Z = { z , z , . . . , z k } is a set ofprototypes, and k is the desired number of clusters. The k -prototypes algorithm employsan iterative process to minimize this objective function. The algorithm starts with k initialprototypes selected randomly from the dataset. Given the set of prototypes Z , the algorithmthen updates the partition matrix as follows: u il = (cid:26) , if D ( x i , z l ) = min ≤ s ≤ k D ( x i , z s ),0 , if otherwise. (3)Given the partition matrix U , the algorithm updates the prototypes as follows: z lj = (cid:80) ni =1 u il x ij (cid:80) ni =1 u il , ≤ j ≤ d , (4a) z lj = mode { x ij : u il = 1 } , d + 1 ≤ j ≤ d , (4b)( z l,d +1 , z l,d +2 ) = { ( x i,d +1 , x i,d +2 ) (cid:12)(cid:12) min( δ ( x ∗ , z ∗ l )) } , (4c)where x ∗ = { ( x ,d +1 , x ,d +2 ) , ( x ,d +1 , x ,d +2 ) , . . . , ( x n,d +1 , x n,d +2 ) } and z ∗ l = ( z l,d +1 , z l,d +2 ).When δ is calculated, we exclude the previous spatial prototype. The numerical compo-nents of the prototype of a cluster are updated to the means, the categorical componentsare updated to the modes, and the new spatial prototype is the coordinate closest to theprevious one.Algorithm 1 shows the pseudo-code of the k -prototypes algorithm. A major advantageof the k -prototypes algorithm is that it is easy to implement and efficient for large datasets.A drawback of the algorithm is that it is sensitive to the initial prototypes, especially when k is large. Algorithm 1:
Pseudo-code of the k -prototypes algorithm. Input:
A dataset X , k Output: k clusters Initialize z , z , . . . , z k by randomly selecting k points from X ; repeat Calculate the distance between x i and z j for all 1 ≤ i ≤ n and 1 ≤ j ≤ k ; Update the partition matrix U according to Equation (3); Update cluster centers Z according to Equation (4); until No further changes of cluster membership ; Return the partition matrix U and the cluster centers Z ; λ and λ The cost function in Equation (2) can be further rewritten as: P ( U, Z ) = k (cid:88) l =1 n (cid:88) i =1 u il (cid:40) d (cid:88) j =1 ( x ij − z lj ) + λ d (cid:88) j = d +1 δ ( x ij , z lj ) + λ δ ( x ∗ i , z ∗ l ) (cid:41) , x ∗ i = ( x i,d +1 , x i,d +2 ) and the inner term D l = n (cid:88) i =1 u il (cid:40) d (cid:88) j =1 ( x ij − z lj ) + λ d (cid:88) j = d +1 δ ( x ij , z lj ) + λ δ ( x ∗ i , z ∗ l ) (cid:41) = D nl + D cl + D sl is the total cost when X is assigned to cluster l . Note that we can subdivide these measure-ments into D nl = n (cid:88) i =1 u il d (cid:88) j =1 ( x ij − z lj ) ,D c = n (cid:88) i =1 u il λ d (cid:88) j = d +1 δ ( x ij , z lj ) , and D sl = n (cid:88) i =1 u il λ δ ( x ∗ i , z ∗ l ) , that represent the total cost from the numerical, categorical, and spatial attributes, respec-tively.It is easy to show that the total cost D l is minimized by individually minimizing D nl , D cl , and D sl (Huang (1997)). D nl can be minimized through Equations (4a). D cl , the totalcost from categorical attributes of X , can be rewritten as D cl = λ n (cid:88) i =1 u il d (cid:88) j = d +1 δ ( x ij , z lj )= λ n (cid:88) i =1 d (cid:88) j = d +1 { · (1 − P ( x ij = z lj | l )) + 0 · P ( x ij = z lj | l ) } = λ n (cid:88) i =1 d (cid:88) j = d +1 { − P ( x ij = z lj | l ) } = λ d (cid:88) j = d +1 n l { − P ( z lj ∈ A j | l ) } , where A j is the set of all unique levels of the j th categorical attribute of X and P ( z lj ∈ A j | l )denotes the probability that the j th categorical attribute of prototype z l occurs given cluster l . λ and λ are chosen to prevent over-emphasizing either categorical or spatial with respectto numerical attributes and hereby are dependent on the distributions of those numericalattributes (Huang (1997)). In the R package, Szepannek, G. (2017) suggested the value of λ as the ratio of average of variance of numerical variables to the average concentration of10ategorical variables:ˆ λ = d (cid:80) d j =1 Var( x j ) d − ( d +1) (cid:80) d j = d +1 (cid:80) k q jk (1 − q jk ) = d (cid:80) d j =1 Var( x j ) d − ( d +1) (cid:80) d j = d +1 (1 − (cid:80) k q jk ) , where q jk is the frequency of the k th level of the j th categorical variable. See also Szepannek(2019). For each categorical variable, we consider it to have a distribution with a prob-ability of each level to be the frequency of this level. For example, the categorical dataelement Plan has three levels: Term, Universal life with secondary guarantees (ULS) andVariable life with secondary guarantees (VLS). Then the concentration of
Plan can be mea-sured by Gini impurity: (cid:80) k =1 q jk (1 − q jk ) = 1 − (cid:80) k =1 q jk . Therefore, under the conditionthat all the variables are independent, the total Gini impurity for categorical variables is (cid:80) dj = d +1 (1 − (cid:80) k q jk ), since (cid:80) k =1 q jk = 1. The average of the total variance for the nu-merical variables d (cid:80) d j =1 Var( x j ) can be considered to be the estimate of the populationvariance. Subsequently, ˆ λ becomes a reasonable estimate and is easy to calculate.Similarly, ˆ λ = d (cid:80) d j =1 Var( x j )Var( δ ( x ∗ , center)) , where the concentration of spatial attributes isestimated by the variance of the Great Circle distances between x ∗ and the center of thetotal longitude-latitude coordinates. As alluded in Section 1, the gap statistic is used to determine the optimal number of clusters.Data X = { X ij } , i = 1 , , . . . , n, j = 1 , , . . . , d consists of d features measured on n inde-pendent observations. D ij denotes the distance, defined in Equation 1, between observation i and j . Suppose that we have partitioned the data into k clusters C , . . . , C k and n l = | C l | .Let D wl = (cid:88) i,j ∈ C l D ij be the sum of the pairwise distance for all points within cluster l and set W k ( X ) = k (cid:88) l =1 n l D wl . The idea of the approach is to standardized the comparison of log( W k ) with its expectationunder an appropriate null reference distribution of the data. We defineGap( k ) = E[log( W k ( X ∗ ))] − log( W k ( X )) , where E[log( W k ( X ∗ ))] denotes the average log( W k ) of the samples X ∗ generated from thereference distribution with predefined k . The gap statistic can be calculated by the followingsteps: 11 Set k = 1 , , . . . , • Run k -prototypes algorithm and calculate log( W k ) under each k = 1 , , . . . ,
10 for theoriginal data X ; • For each b = 1 , , . . . , B , generate a reference data set X ∗ b with sample size n . Run theclustering algorithm under the candidate k values and computeE[log( W k ( X ∗ ))] = 1 B B (cid:88) b =1 log( W k ( X ∗ b ))and Gap( k ); • Define s ( k ) = (cid:16)(cid:112) /B (cid:17) × sd( k ), wheresd( k ) = (cid:113) (1 /B ) (cid:80) Bb =1 (log( W k ( X ∗ b )) − E[log( W k ( X ∗ ))]) ; and • Choose the optimal number of clusters as the smallest k such that Gap( k ) ≥ Gap( k +1) − s ( k + 1).This estimate is broadly applicable to any clustering method and distance measure D ij . Weuse B = 50 and randomly draw 10% of the data set using stratified sampling to keep thesame proportion of each attribute. The Gap and the quantity Gap( k ) − (Gap( k +1) − s ( k +1))against the number of clusters k are shown in Figure 3. The Gap statistic clearly peaks at k = 3 and the criteria for choosing k displayed in the right panel. The correct k = 3 is thesmallest for which the quantity Gap( k ) − (Gap( k + 1) − s ( k + 1)) becomes positive. . . . . Number of clusters k G ap ( k ) (a) − . . . Number of clusters k G ap ( k ) − ( G ap ( k + ) − s ( k + )) (b) Figure 3: (a) Gap statistics in terms of the corresponding number of clusters and (b) Resultsof choosing optimal number of clusters.There is the possible drawback of the highly sensitivity of the initial choice of proto-types. In order to minimize the impact, we run the k -prototypes algorithm with correct k = 3 starting with 20 different initializations and then choose the one with the smallesttotal sum squared errors. 12 Implications and applications of numerical results
Using our mortality dataset with eight different attributes that are mixed type (numerical,categorical and spatial), we concluded as detailed in the previous section that three clustersare formed. Table 2 displays the size and membership degree of each cluster. Cluster 3 hasthe largest membership of nearly 57% of the total observations, while Clusters 1 and 2 arepartitioned with 30.1% and 13.0% memberships, respectively.Table 2: Size and percentage for each of the three optimal clustersCluster 1 Cluster 2 Cluster 3number of observations 342,518 147,561 647,778percentage 30.10% 12.97% 56.93%Let us describe some dominating features for each of the clusters. The outputs arevisualized in Figure 4 and Figure 5. Additional details of these dominating features are wellsummarized in Tables 4 showing the cluster distribution in the categorical variables, Table 3with a descending order of the cluster proportion in the variable States, and Table 5 regardingthe distributions of numerical variables. These tables are provided in the Appendix.
Cluster 1 • Its gender make-up is predominantly females in the entire portfolio. There isa larger percentage of Term plan and fewer percentage of Substandard policiesthan Clusters 2 and 3. The violin plots for the numerical attributes show that theyoungest group with smallest amount of insurance coverage is in this cluster. Geo-graphically, the insureds in this cluster are mostly distributed in the northeasternregion such as New Jersey, New York, Rhode Island, and New Hampshire.
Cluster 2 • This cluster has a gender make-up that is interesting. While Clusters 1 and 3have a dominating gender, Cluster 2 has 30% female and 70% male. It alsohas the largest proportion of Smokers, Term conversion underwriting type, andSubstandard policies when compared with other clusters. However, when it comesto Plan type, 91% of them have Universal Life contracts and almost no Termplans. With respect to issue age and amount of insurance coverage, this clusterof policies has the most senior people and not surprisingly, it has also lower faceamount. Geographically, with exception of few states dominating the cluster,there is almost uniform distribution of the rest of the states in this cluster. Custer2 have states with the lowest proportion of insured policies among all the clusters.13 luster 3 • Male policyholders dominate this cluster and Cluster 3 has the smallest proportionof Smokers and Term Conversion underwriting type among all clusters. Morethan 90% of the policyholders purchased Term plan and most of them are alsowith generally larger face amount than the other two clusters. The policyholdersin this cluster are in middle age compared with other clusters according to theviolin plots. The policyholders in this cluster are more geographically scatteredin Arkansas, Alabama, Mississippi, Tennessee, and Oregon; interestingly, Cluster3 has the largest proportion of policies among all clusters.
74% 17%75%61% 11%28% 56% 11% 32%64% 10%26%58% 12% 31%59%48%33% 59% 12% 29%64% 10% 26%62% 13%25%69% 53% 15% 32%54% 16% 30%62% 13%25%68% 22%61% 12%26%54% 12% 34%55% 14% 32%50%16%34%60% 12% 28%60% 12% 29%61% 13% 26%74% 18%61%30% 63% 11% 26%63%63% 26%54%34% 49% 14% 36%58%30%57%32% 51% 14% 35%54% 18% 28%68% 23%71% 22% 49% 19% 32%49%16%34%59% 14%27%60% 71% 8%21% 67% 9% 24%64%27%62% 11% 27%60% 66% 7%27%60% 12% 28%53%64%
ALARAZCACOCTDCDEFLGAIAIDILINKSKYLAMAMDMEMIMNMOMSMTNCNDNENHNJNMNVNYOHOKORPARISCSDTNTXUTVAVTWAWIWVWY 0 25000 50000 75000 100000 125000
Count S t a t e s cluster Figure 4: Distribution of the variable
States in each of the optimal clusters14
21 321
Gender P e r ce n t a g e
321 321
SmokerStatus P e r ce n t a g e
321 321
UnderwritingType P e r ce n t a g e
321 321
Substandard P e r ce n t a g e
321 321 321
Plan P e r ce n t a g e Index of Cluster I ss u e . A g e Index of Cluster l og ( F ace A m oun t) Figure 5: Distribution of the numerical and categorical attributes in each of the optimalclusters 15 .2 Analysis of mortality deviation
We now compare these clusters with respect to their deviations of actual to expected mor-tality. It is a typical practice in the life insurance industry that when analyzing and under-standing such deviations, we compare the actual-to-expected (A/E) death experiences.To illustrate how we made the comparison, consider one particular cluster containing n policies. We computed the actual number of deaths for this entire cluster by adding upall the face amounts of those who died during the quarter. Let FA i be the face amount ofpolicyholder i in this particular cluster. Thus, the aggregated actual face amount amongthose who died is equal to n (cid:88) i =1 A i = n (cid:88) i =1 FA i × I i , where I i = 1 indicates the policyholder died and the aggregated expected face amount is n (cid:88) i =1 E i = n (cid:88) i =1 FA i × q i , where the expected mortality rate, q i , is based on the latest 2015 Valuation Basic Table(VBT), using smoker-distinct and ALB (age-last-birthday) . The measure of deviation, R ,is then defined to be R = (cid:80) ni =1 A i (cid:80) ni =1 E i . Clearly, a ratio
R <
R > I i is a Bernoulli distributed random variable with parameter q i which represents the probability of death, or loosely speaking, the mortality rate. For large n , i.e., as n → ∞ , the ratio R converges in distribution to a normal random variable withmean 1 and variance (cid:80) ni =1 FA i q i (1 − q i )( (cid:80) ni =1 E i ) . The details of proofs for this convergence are providedin the appendix.Based on the results of this convergence, it allowed us to construct 90% and 95%confidence interval of the ratio R or the A/E of mortality. We display Figure 6(a) andFigure 6(b), which depict the differences in the A/E ratio for the three different clusters,based respectively on a 90% and 95% confidence intervals, respectively.Based on this company’s claims experience, these figures provide some good newsoverall. The observed A/E ratios for all clusters are all below 1, which as earlier saidindicates that the actual mortality is better than expected for all 3 clusters. There are somepeculiar observations that we can draw from the clusters: • Cluster 1 has the most favorable A/E ratio among all the cluster and is significantlyless than 1 at 10% significance level, with moderate variability. This can be explained • Cluster 2 has the A/E ratio of 0.68 and is not significantly less than 1 at both 5% and10% significance levels; it has the largest variability of this ratio among all clusters.Cluster 2 has therefore the most unfavorable A/E ratio from a statistical perspective.The characteristics of this cluster can be captured by these dominant features: (i)Its gender make-up is a mixed of male and female, with more males than females;(ii) It has the largest proportion of Smokers, Term conversion underwriting type andSubstandard policies compared with other clusters; (iii) However, when it comes toplan type, 91% of them have Universal Life contracts and no Term policies; (iv) Withrespect to issue age and amount of insurance coverage, this cluster has the largestproportion of elder people and therefore, has lower face amounts. All these dominatingfeatures help explain a generally worse mortality and larger variability of deviations.For example, the older group has a higher mortality rate than the younger group andwith the largest proportion of smokers, this explains the compounded mortality. Tosome extent, with the largest proportion of Term conversion underwriting types andSubstandard policies, they reasonably indicate more inferior mortality experience. • Cluster 3 has the A/E ratio most significantly less than 1, even though it has theworst A/E ratio among all the clusters. The characteristics can be captured by somedominating features in the cluster: male policyholders dominate this cluster and it hasthe smallest proportion of Smokers and Term Conversion underwriting type amongthree clusters. More than 90% of the policyholders purchased Term plan and most ofthem have larger face amount than other clusters. The policyholders in this clusterare about the middle age groups compared to other clusters according to the violinplots. The policyholders are more geographically scattered in Arkansas, Alabama,Mississippi, Tennessee, and Oregon. We generally know that smokers mortality isworse than non smokers. Relatively younger age groups have a lower mortality ratethan other age groups. Term plans generally have fixed terms and are more subject tofrequent underwriting. The small variability can be explained by having more policiesgiving enough information and hence, with much more predictable mortality.17 l l
Index of Cluster A c t u a l/ E x p ec t e d M o r t a li t y (a) 90% Confidence Interval of A/E Ratio l l l Index of Cluster A c t u a l/ E x p ec t e d M o r t a li t y (b) 95% Confidence Interval of A/E Ratio Figure 6: Actual to expected mortality rates based on face amounts18
Conclusions
In this paper, we investigated the use of k -prototypes clustering algorithm to provide insightsas to the death claims experience of a portfolio of contracts from a life insurance company.Developing a tracking and monitoring system of death claims is an important part of man-aging a portfolio of life insurance policies. We explore how the results from the k -prototypesclustering algorithm can help us detect peculiar characteristics of our life insurance portfo-lio in order to have an improved understanding of mortality deviations. The k -prototypesalgorithm integrates the procedures of k -means and k -modes to efficiently cluster our dataset that contains numerical, categorical, and spatial attributes. Our data set consists of alife insurance company’s death claims experience observed during the third quarter of 2014,with approximately 1.14 million unique policies and a total insured amount of over 650 bil-lion dollars. The optimal number of clusters are obtained using gap statistics; the algorithmproduced three dominating natural clusters in this insurance portfolio. We then used theclusters to compare and monitor actual to expected death claims experience. Each clusterhas a lower actual to expected death claims but with differing variabilities, and each optimalcluster showed patterns of mortality deviation for which we are able to deduce the dominantcharacteristics of the policies within a cluster. We also find that the additional informationdrawn from the spatial nature of the policies contributed to an explanation of the devia-tion of mortality experience from expected. The results can help facilitate decision makingbecause of an improved understanding of potential favorable and unfavorable clusters. Acknowledgments
We thank the financial support of the Society of Actuaries through its Centers of ActuarialExcellence (CAE) grant for our research project on
Applying Data Mining Techniques inActuarial Science . We also express our gratitude to Professor Dipak Dey who providedguidance, especially to Shuang Yin, in the completion of this work. He is a faculty memberof the Department of Statistics at our university.
Appendix A. Convergence of A/E ratio
Define S n = X + · · · X n and B n = Var( S n ) = (cid:80) nk =1 σ k and for (cid:15) > L n ( (cid:15) ) = 1 B n n (cid:88) k =1 E( X k − µ k ) | X k − µ k | >(cid:15)B n . Lindeberg-Feller Theorem : Let { X n } n ≥ be a sequence of independent random variablewith mean µ n and variances 0 < σ n < ∞ . If L n ( (cid:15) ) → (cid:15) >
0, then19 n − E[ S n ] B n d −→ N (0 , . Lyapunov Theorem : Assume that E | X k | δ < ∞ for some δ > k = 1 , , . . . . If1 B δn n (cid:88) k =1 E | X k − µ k | δ → S n − E[ S n ] B n d −→ N (0 , Proof.
For δ > L n ( (cid:15) ) = 1 B n n (cid:88) k =1 E( X k − µ k ) | X k − µ k | >(cid:15)B n = 1 B n n (cid:88) k =1 n (cid:88) k =1 E | X k − µ k | δ | X k − µ k | δ | X k − µ k | >(cid:15)B n ≤ (cid:15) δ B δn n (cid:88) k =1 E | X k − µ k | δ → n → ∞ Then by Lindeberg-Feller Theorem, S n − E[ S n ] B n d −→ N (0 , { X n } n ≥ is a sequence of independent random variables such that0 < inf n Var( X n ) and sup n E | X n | < ∞ . Then ( S n − E[ S n ]) /B n d −→ N (0 , Proof. : Suppose that X n has mean µ n and variance σ n < ∞ . (cid:80) nk =1 E | X k | B n = (cid:80) nk =1 E | X k | ( (cid:80) nk =1 σ k ) ≤ n · sup n E | X n | ( n · inf n Var( X n )) = sup n E | X n | (inf n Var( X n )) √ n → n → ∞ . where sup n E | X n | < ∞ and 0 < inf n Var( X n ) < ∞ . Therefore by Lyapunov Theorem,( S n − E[ S n ]) /B n d −→ N (0 , I i that is Bernoulli distributed withprobability of death q x i . Assume that each policy’s death is observable and fixed, not random,so that q x i is fixed and not varying with data. Within cluster c with total number of policies n c , let FA i , A i , and E i be the face amount, actual death payment, and expected deathpayment for each policy, respectively. When a policy i is observed dead, then I i = 1.Otherwise, I i = 0. Thus, A i = FA i · I i and E i = FA i · q x i . Let Y i = c i I i where c i = FA i (cid:80) nck =1 E k . Since I i ∼ Bernoulli( q x i ), E( Y i ) = c i E( I i ) = c i q x i and Var( Y i ) = c i q x i (1 − q x i ). We20alculate that inf n Var( X n ) = 1 . ∗ − , then inf n Var( Y n ) is positive and finite, and 0 < sup n E | Y n | = 1 . ∗ − < ∞ . These two conditions are satisfied and Y i ’s are independentlydistributed.Let R c = (cid:80) n c i =1 Y i = (cid:80) nci =1 A i (cid:80) nci =1 E i denote the measure of mortality deviation for cluster c .By Lyapunov Theorem, we have (cid:80) n c i =1 Y i − E( (cid:80) n c i =1 Y i ) (cid:112) Var( (cid:80) n c i =1 Y i ) d −→ N (0 , ⇒ R c d −→ N (cid:18) , (cid:80) n c i =1 FA i q x i (1 − q x i )( (cid:80) n c i =1 E i ) (cid:19) , where E( R c ) = n c (cid:88) i =1 E( Y i ) = n c (cid:88) i =1 c i q x i = (cid:80) n c i =1 FA i ∗ q x i (cid:80) n c i =1 E i = (cid:80) n c i =1 E i (cid:80) n c i =1 E i = 1and Var( R c ) = n c (cid:88) i =1 Var( Y i ) = n c (cid:88) i =1 c i q x i (1 − q x i ) = (cid:80) n c i =1 FA i q x i (1 − q x i )( (cid:80) n c i =1 E i ) . ppendix B. Tables that summarize the distribution ofclusters Table 3: Proportions of each cluster in the variable
States
Cluster 1 Cluster 2 Cluster 3States proportion States proportion States proportionNJ 36.36% WV 21.25% AR 74.78%NY 34.54% DE 19.40% AL 74.19%RI 34.35% PA 19.26% MS 73.84%NH 34.09% OH 18.50% TN 71.36%ME 33.98% IN 16.40% OR 70.64%MA 33.64% RI 16.21% ID 69.03%DE 32.70% ME 15.79% OK 68.16%CA 32.46% SD 15.48% KY 68.02%NV 32.25% IL 15.45% TX 66.90%MD 31.90% NJ 14.17% WA 66.29%IL 31.82% NY 14.12% UT 64.42%PA 31.63% SC 13.84% GA 64.34%DC 31.54% MD 13.54% CO 64.29%CT 30.82% IA 12.78% WY 63.83%MT 29.96% MO 12.78% ND 63.20%NM 29.94% KS 12.69% NE 62.92%IN 29.55% ND 12.51% NC 62.69%FL 29.19% LA 12.32% KS 61.93%MN 28.60% VT 12.22% VA 61.80%AZ 28.47% MA 12.15% IA 61.74%WI 28.34% FL 12.07% LA 61.48%MI 27.94% NH 11.93% MT 61.02%VT 27.93% MN 11.65% MO 60.85%OH 27.61% WI 11.65% AZ 60.69%WY 27.12% MI 11.64% MI 60.42%UT 27.01% NM 11.59% WI 60.01%VA 26.91% CT 11.53% SD 59.86%SC 26.86% NE 11.51% VT 59.85%WA 26.61% NC 11.49% MN 59.75%MO 26.37% VA 11.29% SC 59.31%LA 26.20% CA 11.26% FL 58.74%GA 26.10% AZ 10.83% DC 58.66%NC 25.82% NV 10.34% NM 58.47%CO 25.66% CO 10.05% CT 57.65%NE 25.57% KY 9.98% NV 57.40%IA 25.48% DC 9.80% CA 56.28%WV 25.48% GA 9.56% MD 54.56%KS 25.38% OK 9.08% MA 54.21%SD 24.66% WY 9.05% IN 54.05%ND 24.29% MT 9.02% NH 53.98%TX 24.24% TX 8.86% OH 53.90%ID 22.84% AL 8.74% WV 53.27%OK 22.77% MS 8.62% IL 52.73%OR 22.24% UT 8.57% NY 51.34%KY 22% ID 8.13% ME 50.23%TN 20.71% TN 7.93% NJ 49.47%AR 18.04% AR 7.18% RI 49.44%MS 17.54% OR 7.12% PA 49.10%AL 17.07% WA 7.10% DE 47.90%
Categorical Variables Levels Cluster 1 Cluster 2 Cluster 3Gender
Female 100% 30.43% 0.09%Male 0% 69.57% 99.91%
Smoker Status
Smoker 4.43% 6.53% 3.45%Nonsmoker 95.57% 93.47% 96.55%
Underwriting Type
Term conversion 3.59% 22.79% 0.85%Underwritten 96.41% 77.21% 99.15%
Substandard Indicator
Yes 6% 11.58% 7.82%No 94% 88.42% 92.18%
Plan
Term 73.71% 0% 91.51%ULS 8.95% 90.86% 0.12%VLS 17.34% 9.14% 8.37%
Table 5: Data summary for numerical variables within the 3 optimal clusters
Continuous Variables Minimum 1st Quantile Mean 3rd Quantile MaximumIssue Age Cluster 1
Cluster 2
Cluster 3
Face Amount Cluster 1
215 100,000 375,066 500,000 13,000,000
Cluster 2
Cluster 3 eferences Carter, C. (2002). Great circle distances. .Devale, A. and Kulkarni, R. (2012). Applications of data mining techniques in life insurance.
International Journal of Data Mining & Knowledge Management Process , 2(4):31–40.Dickson, D. C., Hardy, M. R., and Waters, H. R. (2013).
Actuarial Mathematics for LifeContingent Risks, 2nd edition . Cambridge, United Kingdom: Cambridge University Press.Ester, M., Kriegel, H.-P., Sander, J., and Xu, X. (1996). A density-based algorithm fordiscovering clusters in large spatial databases with noise. In
Knowledge Discovery inDatabases (KDD) , volume 96, pages 226–231.Fahad, A., Alshatri, N., Tari, Z., Alamri, A., Y.Zomaya, A., Khalil, I., Foufou, S., andBouras, A. (2014). A survey of clustering algorithms for big data: Taxonomy and empiricalanalysis.
IEEE Transactions on Emerging Topics in Computing , 2(3):267–279.Gan, G. (2011).
Data Clustering in C++: An Object-Oriented Approach . Data Mining andKnowledge Discovery Series. Chapman & Hall/CRC Press, Boca Raton, FL, USA.Gan, G. (2013). Application of data clustering and machine learning in variable annuityvaluation.
Insurance: Mathematics and Economics , 53(3):795–801.Gan, G. and Lin, X. S. (2015). Valuation of large variable annuity portfolios under nestedsimulation: a functional data approach.
Insurance: Mathematics and Economics , 62:138–150.Gan, G., Ma, C., and Wu, J. (2007).
Data Clustering: Theory, Algorithms and Applications .ASA-SIAM Series on Statistics and Applied Probability. SIAM Press, Philadelphia, PA,USA.Gan, G. and Valdez, E. A. (2016). An empirical comparison of some experimental designsfor the valuation of large variable annuity portfolios.
Dependence Modeling , 4(1):382–400.Hsu, C.-C. and Chen, Y.-C. (2007). Mining of mixed data with application to catalogmarketing.
Expert Systems with Applications , 32:12–23.Huang, Z. (1997). Clustering large data sets with mixed numeric and categorical values.In
Proceedings of The First Pacific-Asia Conference on Knowledge Discovery and DataMining , pages 21–34. Singapore.Huang, Z. (1998). Extensions to the k -means algorithm for clustering large data sets withcategorical values. Data Mining and Knowledge Discovery , 2(3):283–304.Jain, A., Murty, M., and Flynn, P. (1999). Data clustering: a review.
ACM ComputingSurveys , 31(3):264–323. 24acCuish, J. D. and MacCuish, N. E. (2010).
Clustering in Bioinformatics and Drug Dis-covery . CRC Press, Boca Raton, FL.MacQueen, J. (1967). Some methods for classification and analysis of multivariate obser-vations. In
Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics andProbability , volume 1, pages 281–297. Oakland, CA, USA.Najjar, A., Gagne, C., and Reinharz, D. (2014). A novel mixed values k -prototypes algorithmwith application to health care databdata mining. IEEE Symposium on ComputationalIntelligence in Healthcare and e-health (CICARE) , pages 1–8.Szepannek, G. (2019). clustMixType: user-friendly clustering of mixed-type data in R.
TheR Journal , 10(2):200–208.Szepannek, G. (2017). R: k -Prototypes Clustering for Mixed Variable-Type Data . R Foun-dation for Statistical Computing, Vienna, Austria.Thiprungsri, S. and Vasarhelyi, M. A. (2011). Cluster analysis for anomaly detection inaccounting data: an audit approach. International Journal of Digital Accounting Research ,11.Tibshirani, R., Walther, G., and Hastie, T. (2001). Estimating the number of clusters in adata set via the gap statistic.
Journal of the Royal Statistical Society: Series B (StatisticalMethodology) , 63(2):411–423.Vadiveloo, J., Niu, G., Xu, J., Shen, X., and Song, T. (2014). Tracking and monitoringclaims experience: a practical application of risk management.
Risk Management , pages12–15.Wang, X., Gu, W., Ziebelin, D., and Hamilton, H. (2010). An ontology-based frame-work for geospatial clustering.