[PDF] Modeling surrender risk in life insurance: theoretical and experimental insight

Abstract

Surrender poses one of the major risks to life insurance and a sound modeling of its true probability has direct implication on the risk capital demanded by the Solvency II directive. We add to the existing literature by performing extensive experiments that present highly practical results for various modeling approaches, including XGBoost and neural networks. Further, we detect shortcomings of prevalent model assessments, which are in essence based on a confusion matrix. Our results indicate that accurate label predictions and a sound modeling of the true probability can be opposing objectives. We illustrate this with the example of resampling. While resampling is capable of improving label prediction in rare event settings, such as surrender, and thus is commonly applied, we show theoretically and numerically that models trained on resampled data predict significantly biased event probabilities. Following a probabilistic perspective on surrender, we further propose time-dependent confidence bands on predicted mean surrender rates as a complementary assessment and demonstrate its benefit. This evaluation takes a very practical, going concern perspective, which respects that the composition of a portfolio might change over time.

Full PDF

aa r X i v : . [ q -f i n . R M ] J a n M ODELING SUR R ENDER R ISK IN LIFE INSUR ANCE : THEOR ETICAL AND EXPER IMENTAL INSIGHT

Mark Kiermayer

Department of Natural ScienceUniversity of Applied Sciences Ruhr West [email protected]

January 28, 2021 A BSTRACT

Surrender poses one of the major risks to life insurance and a sound modeling of its true proba-bility has direct implication on the risk capital demanded by the Solvency II directive. We add tothe existing literature by performing extensive experiments that present highly practical results forvarious modeling approaches, including XGBoost and neural networks. Further, we detect short-comings of prevalent model assessments, which are in essence based on a confusion matrix. Ourresults indicate that accurate label predictions and a sound modeling of the true probability can beopposing objectives. We illustrate this with the example of resampling. While resampling is capableof improving label prediction in rare event settings, such as surrender, and thus is commonly applied,we show theoretically and numerically that models trained on resampled data predict signiﬁcantlybiased event probabilities. Following a probabilistic perspective on surrender, we further proposetime-dependent conﬁdence bands on predicted mean surrender rates as a complementary assessmentand demonstrate its beneﬁt. This evaluation takes a very practical, going concern perspective, whichrespects that the composition of a portfolio might change over time. K eywords conﬁdence predictions, conﬁdence bands, resampling, ensemble techniques, neural networks, XGBoost Managing risks is at the core of insurance business. With the publication of the Solvency II directive in 2009European insurers are required to work towards a holistic view of risk. In particular, it states individual modules forthe underwriting risk, market risk and counterparty default risk. For life insurance business, the Solvency II directiverequires the underwriting risk to explicitly indicate the risk of individual sub-components like mortality, longevity,morbidity, life-expense, revision, lapse and life-catastrophe. In the German market, traditional risks as mortalityand longevity are generally addressed by using safety buffers on mortality assumptions, see e.g.the German 2008Tand 2004R tables, as well as its comparison to international approaches in [12]. The present work looks at a soundestimation of surrender risk. We place a particular focus on the distributional effect of resampling schemes that arecommonly used to mitigate the rare event character of surrender.Lapse risk has been identiﬁed as a major risk to life insurance business by the Quantitative Impact Study QIS5 ofthe European Insurance and Occupational Pensions Authority and related research, see e.g. [5, 33]. In [18], lapseisk is ofﬁcially deﬁned as "all legal or contractual policyholder options which can signiﬁcantly change the valueof the future cash-ﬂows". Hence, lapse risk includes not only a premature and full termination of contracts, but forexample also partial terminations, changes to the frequency or quantity of premium payments, altered beneﬁts orextension of coverage. For accuracy, we use the term surrender to speciﬁcally refer to a premature, full terminationof a contract induced by the policyholder. Colloquially, however, surrender and lapse are often used interchangeably.Overall, the effects of surrender are manifold. Arguably most harmful to the insurer are early surrenders that cost theinsurer its initial expenses or adverse selection that alters the composition of the portfolio and compromises a sounddiversiﬁcation. The review in [16] provides an extensive overview of past research on surrender risk and divides therespective literature into two subsets. Given a theoretical framework, we can determine a fair value estimate of theembedded options, see e.g. [5, 33, 34]. This theoretical perspective is not limited to risk-neutral individuals, but canbe extended to risk-avers and non-rational consumer behavior, see [16]. On the other hand, there exists extensiveempirical research, which aims to identify the main risk drivers for lapse events and to test for modeling hypotheses,as well as the quality of modeling approaches to capture the underlying risk, see e.g. [2, 7, 15, 29, 41, 52]. The mostcommon modeling classes include logistic regressions and tree based methods as CART and random forests, wherethe evaluation is typically performed with metrics on the respective confusion matrix or the related receiver operatingcurve (ROC) for assessment and model choice. The work in [2, 7, 37, 38] indicate how the quality of empirical lapsemodels translates to economic measures. In [4] we see the inclusion of contagion effects in the modeling of lapse risk.We note that the majority of empirical modeling approaches focus on a single-period lapse prediction, which is to alarge extent motivated by the one-year VaR . risk measure imposed by the Solvency II directive. A multi-periodperspective on lapse dynamics including a proposition for lapse tables, that can be used in similar fashion to standardmortality tables, can be found in [40]. The authors of [40] apply survival analytical methods, as e.g. the Fine&Graymodel, see [20], in order to distinguish between competing risks that can lead to the termination of a contract.In the present work, we provide extensive numerical experiments that investigate the quality of common modelingapproaches for surrender risk, namely logistic regression, tree based methods and neural networks, each in a baggedand a boosted version. All models are analyzed on four different portfolios of endowment policies, each showingsurrender behaviours that replicate ﬁndings reported in the literature, see [8, 15, 41]. Overall, we ﬁnd highlycompetitive results, where XGBoost consistently displays superior performance. For model evaluation, we look atthe latent, true surrender probabilities, as well as observable realizations of surrender events. Accessing the trueprobabilities allows us to determine the exact bias and variance and monitor whether accessible concepts for modelassessment reliably capture the latent dynamics of surrender. In view of the bias of a model, our results indicate aconﬂict between accurate label predictions and sound conﬁdence predictions. We illustrate this with the example ofresampling, which is commonly used to obtain more accurate label prediction of the minority class. Although, wecan conﬁrm common resampling techniques to improve the F -score, we theoretically and numerically show thatresampling comes at the cost of a signiﬁcant bias of estimated surrender probabilities. Following this probabilisticperspective, we introduce time-dependent conﬁdence bands for the predicted mean surrender rate as an alternativeevaluation, which indicate the uncertainty of predicted surrender probabilities. This allows us to assess the quality ofa model from a very practical going concern perspective, where the composition of the underlying portfolio and thepredominant risk drivers might change over time. In particular, conﬁdence bands highlight that adding a risk buffer toa naive baseline does not cover the surrender risk sufﬁciently.The remainder of this paper is organized as follows. We start by clarifying the objective in Section 2. In Section 3,we review common, frequentistic concepts to evaluate binary classiﬁers and contrast them to probabilistic alternatives.Next, in Section 4 we discuss the rare event setting of surrender risk and analyze the distributional effect of commonresampling methods. Section 5 presents the simulation and preparation of all data used in our numerical experiments.In Section 6, we then describe the speciﬁc parameterization of all classiﬁers and report numerical results. Lastly, weconclude in Section 7 with an outlook for future work. 2 Objective

Let the probability space (Ω , F , P ) describe our stochastic environment. We denote an arbitrary insurance contractat time t ∈ N by X t : Ω → R n , n ∈ N . The time series ( X , X , X , . . . , X T ) then represents the evolution ofthe contract up to the random period of termination T : Ω → N ≥ , recording e.g. the age of the policyholder orthe assured beneﬁts. As the termination can have multiple causes, like surrender, death or maturity, we record therespective state at the end of period t by J t : Ω → N , where J t = 0 for t = 0 , . . . , T − indicates an activecontract. All states J T ∈ N are terminal, absorbing states. They describe competing, censoring events, in the sensethat observing { J T = i } and { J T = j } , i = j , are mutually exclusive, both end the observation on the time series andprevent a realization of an alternative state. For more detail on censoring and competing risks we refer the reader to[1]. Single period setting.

In this work we consider three terminal states, namely surrender, death and maturity. Sur-render is represented by { J T = 1 } . Death and maturity are indicated by { J T = 2 } and { J T = 3 } , but will not bemodeled explicitly. Further, in line with the Solvency II directive, we focus on a single period setting. Hence, wedeﬁne targets Y t by Y t := { J t =1 } . Given a realization x t of X t , our objective is to estimate the probability of theactive contract x t to surrender within the period [ t, t + 1) , i.e. to obtain an estimate for the conditional probability p (1 | x t ) := P ( Y t = 1 | X t = x t ) , (1)or, equivalently, to model the random variable Y t | ( X t = x t ) ∼ Ber ( p (1 | x t )) . (2)It is important to note that (1) implicitly imposes the Markov property that the random state J t and the past representedby X t − k = x t − k are independent for k = 1 , . . . , t . In the literature, it is common to assume that the time horizoninﬂuencing surrender decisions of policyholders is restricted to one period, see e.g. [2, 41]. Also, conditioning on anactive contract x t concurrently implies T ≥ t .Overall, our objective is a common setup for supervised learning, where we aim to obtain a model ˆ p : R n → [0 , , x ˆ p (1 | x ) , such that it minimizes the loss function ℓ : [0 , × { , } → R , i.e. ˆ p := argmin ¯ p ℓ (¯ p (1 | X t ) , Y t ) . (3)We would like to draw attention to the non-binary output of the model ˆ p . The contrast between binary and non-binaryprediction will be a major topic throughout this paper. In the style of [45], we call estimates ˆ p (1 | x t ) ∈ [0 , conﬁdenceprediction . In contrast, we refer to binary predictions ˆ y t ∈ { , } , that purely focus on the realization of Y t | ( X t = x t ) ,as label predictions . Naturally, any conﬁdence prediction ˆ p (1 | x t ) can be transformed to a label prediction based onsome threshold c ∈ [0 , , i.e. ˆ y t = { ˆ p (1 | x t ) ≥ c } .In practice, obtaining a sound model ˆ p is complicated by the rare event character of surrender. We discuss rare eventsseparately in Section 4, with a speciﬁc focus on their effect on conﬁdence predictions and the distributional effect ofcommon resampling strategies. First, we would like to conclude this section by commenting on the setting at hand andreview approaches to evaluate classiﬁers in Section 3. Context for alternative settings.

The presented objective is customized to the Solvency II directive and a one-yearperspective, which disregards the exact time of surrender within the year. Other settings, e.g. a risk neutral productdesign, might ask for a more general setup where the risk of individual events { J T = i } , i ∈ N , are estimated jointlyin a multi-period setting. In particular, this requires a more general, real-valued random event T : Ω → R ≥ andrandom state variable J t : Ω → N , ≤ t ≤ T . Given a realization x t of X t , the objective then changes to estimatingthe cause-speciﬁc hazard rate α T,i (˜ t | x t ) of the terminal events i ∈ J t (Ω) at time ˜ t ≥ t . According to [1], the quantity α T,i (˜ t | x t ) is deﬁned by α T,i (˜ t | x t ) := lim ∆ → P (cid:0) ˜ t ≤ T ≤ ˜ t + ∆ , J T = i | T ≥ ˜ t, x t (cid:1) . (4)3ote that conditioning on an active contract x t also implies that J ˜ t = 0 for all ˜ t < t , as in our context all states J ˜ t = 0 are terminal and competing risks. For a detailed discussion of α T,i (˜ t | x t ) and its estimation we refer the reader to [1,40]. Instead we want to comment on the practical implications of (1) and (4).Quantities in (4) can be thought of as time-dependent transition intensities from the active state to a terminal state i in a Markov model. Cause-speciﬁc hazard rates provide a more detailed insight into the likelihood of states J T to be realized and might be used to form a table for lapse probabilities, see e.g. [40]. It is well know that simplyignoring competing risks, as e.g. death or maturity for surrender, alters the estimated hazards α T,i (˜ t | x t ) , whichare asymptotically unbiased, see e.g. Section 3.4 in [1]. On the other hand, modeling a Bernoulli variable as in(1) translates to asking how likely is it to observe a surrender event of contract x t within the next period. Here,all non-surrender states J t = 1 are combined by introducing the target Y t . Further, in our one-period perspectivewe have complete observations as either surrender or non-surrender is realized. In contrast, if we do not observecontract x entering a terminal state J T > , the observation on x yields a right-censoring event for the estimation of (4).For notational convenience, in the remainder of this work we drop the time index t in ( X t , Y t ) and denote realizations ( x, y ) ∼ ( X, Y ) by the use of small letters. The progression of time is implicitly recorded by features like the age ofthe policyholder. Consequently, we denote the true surrender probability of realization x by p (1 | x ) , the conﬁdenceprediction by ˆ p (1 | x ) and the label prediction by ˆ y . In the following, we use indices to count data in our data set, e.g. x k , and exponents, if we explicitly want to refer to components, e.g. x ( j ) k . Usually, binary classiﬁers are evaluated based on their label predictions. In this section, we review classical approachesfor evaluating classiﬁers, discuss shortcomings of their frequentistic focus and provide probabilistic alternatives. Ourreasoning is dictated by the objective of sound conﬁdence predictions, in contrast to accurate binary label predictionsto model actions or deﬁnite class memberships.

Frequentistic evaluation.

In a binary classiﬁcation task, the label prediction ˆ y for a record x with label y can eitherbe true, i.e. ˆ y = y , or false, i.e. ˆ y = y . Based on whether the prediction ˆ y is positive ( ˆ y = 1 ) or negative ( ˆ y = 0 ), werefer to a prediction as true, respectively false, positive or negative. A common evaluation is to present the count of allpredictions ˆ y w.r.t. their correct label y in a × matrix, a so-called confusion matrix , see [2, 19, 41]. We provide anillustration of a confusion matrix in the Appendix A.4, see Figure 6. Based on the absolute frequencies in a confusionmatrix, we can then formulate a variety of performance measures. Common examples include, accuracy, precision,recall, speciﬁcity or F β -score for β > . Table 3 in the Appendix A.5 summarizes the deﬁnition of well-knownperformance measures.We argue that a confusion matrix, and respective performance measures, do not provide a sound evaluation. Mostperformance measures solely look at individual aspects of our classiﬁer and individually provide a deﬁcient evaluationas they are invariant towards speciﬁc changes in the confusion matrix. For example, precision or speciﬁcity eachdisregards either label prediction of type ˆ y = 0 or data labeled with y = 1 all together. On the other hand, accuracysimultaneously measures correct classiﬁcations for both classes y ∈ { , } , however, has little explanatory powerwhen there is an imbalance between data with label y = 0 and y = 1 . For imbalanced data a naive prediction ofexclusively the majority class will indicate a particularly high accuracy if data of the minority class is very rare.Similarly, the F β -score takes a broader perspective by quantifying the trade-off between precision and recall, yet isinvariant to the count of correct predictions ˆ y = 0 . A more detailed analysis of individual performance measures,with a particular focus on invariance under permutations in the confusion matrix, is provided in [47]. At the core ofour critic lies the use label predictions, given a probabilistic objective. For conﬁdence predictions p (1 | x ) ∈ [0 , , theunderlying confusion matrix is based on a single threshold c for label predictions ˆ y = { p (1 | x ) >c } . This implies that4ur evaluation is highly insensitive towards changes in p (1 | x ) .One way to examine a classiﬁer’s performance more holistically is check how performance measures change forvarying values of c ∈ [0 , . A common choice is to look at the trade-off between recall, alias true positive rate, andfalse positive rate. The resulting tuples can be plotted in [0 , and the result is known as the receiver operatingcharacteristics (ROC) curve. Note that recall and false positive rate are intertwined in a way that randomly classifyingthe share p ∈ [0 , of all records as positive, i.e. ˆ y = 1 , results in an expected recall and false positive rate of p .Hence, a ROC curve above the diagonal in [0 , indicates the classiﬁer to perform better than random guessing. Thenotion of ’better’ is formalized by the area under curve (AUC), the integral of the ROC on [0 , , where AUC > . indicates a classiﬁer superior to random guessing. Alternatively, the convex hull of the ROC curve can be utilized todetermine optimal values of c or even to compare different classiﬁers, see e.g. [19, 43].Although ROC and its AUC are common evaluations, we note that recall and false positive rate each focus on theevaluation of one label. Thus, a ROC approach is independent of the relative frequency of the two classes y = 0 and y = 1 , i.e. insensitive to imbalanced data. The precision-recall curve presents an alternative model evaluation, whichfollows the same logic as ROC, see [19]. However, the precision considers both actual labels, y = 0 and y = 1 , andtherefore is sensitive to the balance of data.Even when looking at multiple thresholds c simultaneously, the discussed frequentistic performance measures evaluatebinary label predictions. In the very simplistic example of constant surrender rates, i.e. p (1 | x ) ≡ p , which we assumeto model perfectly, i.e. ˆ p (1 | x ) ≡ p , any label prediction ˆ y = { ˆ p (1 | x ) ≥ c } , c ∈ [0 , , predicts only one outcome for allobservations x . In order to avoid this loss of information, we look at more probabilistic measures of performance. Probabilistic evaluation.

Ideally, we would like to compare the conﬁdence predictions ˆ p (1 | x ) with true event proba-bilities p (1 | x ) , e.g. by employing standard p-q plots. Given data { ( x i , y i ) } Ni =1 , we denote the mean absolute differenceof these quantities by mae ( p, ˆ p ) := 1 N X i | p (1 | x i ) − ˆ p (1 | x i ) | . (5)In our experiments, we construct a setup where we can access the true surrender probabilities p (1 | x ) , in order to lookat the model ﬁt from a probabilistic point of view. However, this is a purely theoretical perspective as in practicewe do not observe p (1 | x ) but only the binary realization y f from Ber ( p (1 | x )) . Thus, information theory provides amore practical approach. One can show that the maximum likelihood estimator ˆ p (1 | x ) corresponds to minimizingthe Kullback-Leibler divergence, respectively equivalently minimizing the cross-entropy, see [25]. Using standardnotation, we denote the empirical estimate of the (binary) cross-entropy by H (ˆ p ) := − N X i y i log ˆ p (1 | x i ) + (1 − y i ) log (1 − ˆ p (1 | x i )) . (6)Hence, minimizing the binary cross-entropy provides a maximum-likelihood consistent estimator, which is why thecross-entropy is the most commonly chosen loss function in classiﬁcation. Unfortunately however, the cross-entropylacks a comprehensive, practical interpretation for surrender. Therefore, we introduce a third option to evaluate conﬁ-dence predictions, based on the central limit theorem of Lindeberg-Feller, see e.g. [32], Theorem 15.43. Proposition 1.

Let x , . . . , x N be realizations of contract X . Let Z , . . . , Z N describe random, independent surrenderevent, where Z i ∼ Ber ( p (1 | x i )) , i = 1 , . . . , N . Then, if there exists ε > such that p (1 | x ) ∈ [ ε, − ε ] for anycontract x ∼ X , then σ ( S N ) X i Z i − p (1 | x i ) d → N (0 , , N → ∞ , (7)5ith σ ( S N ) := vuut Var X i Z i ! = sX i p (1 | x i )(1 − p (1 | x i )) (8)holds. Proof.

See Appendix A.1.Assuming Z i := ( Y | X = x i ) in Proposition 1 to be independent means that, given circumstances x i , policyholdersact independently of each other. If x and x present observations on the same contract, i.e. the same policyholder,at different times this assumption is implausible. Hence, in application we restrict x i in Proposition 1 to stem fromthe portfolio of the insurer for a ﬁxed calendar year. Features that can effect the surrender risk of all policyholdersholistically, as e.g. rising interest rate, can be recorded as a component of contracts x i . Thus, a holistic increase ofthe surrender risk does not violate the independence of surrender decisions Z i . Note that the technical restriction of p (1 | x i ) ∈ [ ε, − ε ] does not impact the assymptotic distribution in Proposition 1. We can think of it as assuminga minimum risk for surrender and non-surrender. For any ﬁnite data set { ( x i , y i } Ni =1 with estimator ˆ p (1 | x i ) ∈ (0 , there exists ε > which satisﬁes ˆ p (1 | x i ) ∈ [ ε, − ε ] . Corollary 1.

Let the assumptions of Proposition 1 hold and let ˆ p : x ˆ p (1 | x ) denote an estimator of p (1 | x ) . Giventhe conﬁdence level α ∈ (0 , , we can then construct a conﬁdence interval for the mean surrender rate by N X i ˆ p (1 | x i ) ! ± z − α/ pP i ˆ p (1 | x i )(1 − ˆ p (1 | x i )) p N ( N − , (9)where ˆ p (1 | x i ) presents the model estimate for p (1 | x i ) and z − α/ the respective percentile of the standard normaldistribution. The best point estimate for the mean surrender rate is given by N P i ˆ p (1 | x i ) . Proof.

See Appendix A.1.If we were to work with label predictions ˆ y i := { ˆ p (1 | x i ) ≥ . } ∈ { , } , the quantity N P i ˆ y i also provides anestimator for E (cid:0) N P i Z i (cid:1) . However, without an estimate of the distribution of Z i we cannot express any levelof conﬁdence for our predictions. In Section 6, we apply Corollary 1 for each calendar year separately and obtainconﬁdence bands on our time series. This approach relaxes the assumptions of [41], where the authors constructconﬁdence intervals under the assumption of homogeneous and identically distributed subgroups of contracts. We now combine the nature of rare events with the objective of Section 2 and look at the effect of frequentisticperformance measures in view of the probabilistic objective. We start with a deﬁnition of rare events, motivateresampling and examine it from a purely theoretical point of view. Thereafter, we interpret these results for commonclassiﬁers in practice.In binary classiﬁcation, we generally refer to data { ( x i , y i ) } Ni =1 drawn from ( X, Y ) as balanced data if both classescontain about equally many observations, i.e. P i y i ≈ P i (1 − y i ) . As this heuristic deﬁnition refers to the dataitself, data imbalance might be introduced during the collection of data. Concerns about biased collection of data arediscussed e.g. in [30]. Overall, it has been widely acknowledged in literature, see [39, 46, 49, 50], that imbalanceddata can lead to classiﬁers which are biased towards predicting the majority class, and that standard metrics, as e.g.accuracy, are inappropriate. Here, the notion of a bias refers solely to label predictions. In the particular case of thelogistic regression imbalanced data leads to low values in the Hessian matrix, see [26]. Hence, estimated regressionparameters are highly unstable, despite being asymptotically unbiased.6n contrast to imbalanced data, we characterize rare events by their low, true probability P ( Y = 1) < ε , where thevalue of ε can be domain speciﬁc, see [44]. In practice, values as low as ε = 0 . or ε = 0 . are commonly foundin e.g. surrender of insurance contracts [2, 40, 41] or fraud detection [36, 42]. In this work, we will assume perfectdata in the sense that any imbalance between surrender and non-surrender events stems from the rare event characterof the task at hand and is not the consequence of improper data collection.To address the concern of biased classiﬁers on imbalanced data, the literature proposes two main techniques,cost-sensitive-learning and resampling, see e.g. [17, 30, 39, 46, 49, 50]. Both have empirically shown to be capableof improving classiﬁers regarding their recall, F β , AUC or geometric mean of the true positive and true negative rate,see e.g. [10, 46, 49, 51]. In [45], we see that popular boosting schemes like ’AdaBoost’ can be also interpreted asan adaptive resampling. For completeness we note that, research in data mining indicates that classiﬁers can alsobeneﬁt from removing outliers or accounting for noisy measurements close to the decision boundary. For a morecomprehensive review we refer the reader to [9, 27]. In the following, we focus on common resampling schemes, asresampling data and applying weights to the empirical loss function are conceptually equivalent. This argument issupported by the authors in [49], who yet contend that sampling approaches will generally outperform cost sensitivelearning.We start with a purely theoretical perspective on resampling. Therefor, we denote the marginal, respectively joint,density functions of ( X, Y ) at ( x, y ) by p ( x ) and p ( y ) , respectively p ( x, y ) . Note that the speciﬁc letter indicates theunderlying random vector as it is common in Bayesian literature, see e.g. [48]. Further, we introduce the functions p ( y | x ) := ( p ( x, y ) (cid:14) p ( x ) , p ( x ) > , p ( x ) = 0 , p ( x | y ) := ( p ( x, y ) (cid:14) p ( y ) , p ( y ) > , p ( y ) = 0 . (10)Based on the concept of regular conditional distributions, see [14, 32], both functions in (10) are well deﬁned on thegiven probability space (Ω , F , P ) as a function w.r.t. either x or y . Hence, we are justiﬁed to interpret p ( y | x ) as aconditional density given the condition x or as a measurable function w.r.t. its condition x .Heuristically, over- or undersampling aim to mitigate the imbalance between classes by either dropping samples ofthe majority class or resampling samples from the minority class. We call this procedure random over- respectively undersampling if the resampling respectively dropping is performed randomly. In practice, these sampling proceduresare usually performed until the data is balanced, as illustrated in the following example. Example 1 (Random resampling) . Let us assume two random variables

G, K : Ω → R n , n ∈ N , that present twodifferent events. In terms of previous notation, we can think of G as G := ( X | Y = 0) and analogously K :=( X | Y = 1) . Further, let g , . . . , g and k , . . . , k be realization of G and K .Then, random undersampling corresponds to randomly drawing an index set I ⊂ { , , . . . , } with | I | = 10 and P ( i ∈ I ) = for i = 1 , . . . , . The resampled data set contains the collection of data points S i ∈ I { g i } and S i =1 { k i } . Conversely, random oversampling generates a data set consisting of S i =1 { g i } and S i =1 { k s i } , were s i present random i.i.d. indices with P ( s i = j ) = for j = 1 , . . . , .Formally, we deﬁne resampling with the purpose of increasing the share of the minority class 1, without specifyingthe speciﬁc target ratio of minority and minority class, as follows. Deﬁnition 1.

Let D = { ( x i , y i ) } Ni =1 be i.i.d. samples generated from ( X, Y ) , where Y | ( X = x ) ∼ Ber ( p (1 | x )) .Let the data be imbalanced with minority class Y = 1 , such that P i y i < P i (1 − y i ) . We then call S a resamplingscheme , if it generates data D S = { ( x Si , y Si ) } N S i =1 ⊂ D with N S > N , respectively N S < N , for which X ≤ i ≤ N (1 − y i ) = X ≤ i ≤ N S (1 − y Si ) and { ( x i , y i ) ∈ D : y i = 0 } ⊂ D S , ( oversampling )7espectively X ≤ i ≤ N y i = X ≤ i ≤ N S y Si and { ( x i , y i ) ∈ D : y i = 1 } ⊂ D S ( undersampling )hold.To indicate that we refer to empirical data, we write ˆ p ( · ) , ˆ p ( · , · ) and ˆ p ( ·|· ) for the densities induced by the original data { ( x i , y i ) } Ni =1 . Reversely, ˆ p S ( · ) , ˆ p S ( · , · ) and ˆ p S ( ·|· ) are derived from resampled data { ( x Si , y Si ) } N S i =1 . Next, we imposethat any proper resampling scheme should not alter the distribution of features in a given class, which is natural as anyestimator ˆ p : x ˆ p (1 | x ) relies on features x . Deﬁnition 2 (Consistent resampling) . We call a resampling scheme S consistent if it preserves the observed probabil-ities for features X within every class Y = y , i.e. with y ∈ { , } ﬁxed ˆ p S ( x | y ) = ˆ p ( x | y ) holds for arbitrary x ∈ X (Ω) . Remark 1 (Examples for consistent resampling) . Random undersampling and oversampling, as well as the popularresampling algorithm SMOTE, see [10], are all natural approaches to building a consistent resampling method. Allapproaches alter data x only for either the minority or the majority class. We provide arguments for the consistency ofthese resampling techniques, focusing on the class altered by the speciﬁc algorithm.1. In random undersampling, we iteratively and randomly drop observation ( x, ∈ { ( x i , y i ) } Ni =1 . In the ﬁrstiteration, the probability to drop ( x, is characterized by the share of observations x i = x in class y = 0 ,which poses an estimator ˆ p ( x | y = 0) . Assuming a sufﬁciently large sample size for the majority class, i.e. y = 0 , we can approximate the probability of subsequent observations ( x ′ , to be dropped by ˆ p ( x ′ | y = 0) .2. In random oversampling, we iteratively draw observations ( x, ∈ { ( x i , y i ) } Ni =1 with replacement. Hence,each x is drawn with probability ˆ p ( x | y = 1) .3. SMOTE oversampling, see [10], also draws observations ( x, ∈ { ( x i , y i ) } Ni =1 with replacement and withprobability ˆ p ( x | y = 1) . In contrast to random oversampling, it does not sample x but x ′ = x + u · ˜ x , where u ∼ U (0 , and ˜ x with (˜ x, ∈ { ( x i , y i ) } Ni =1 a randomly chosen, k -nearest-neighbour to x . By default,SMOTE employs k = 5 . This approach samples new data points by assuming ˆ p ( x | y = 1) to be locallyconstant. The notion of locality is dictated by the distance metric used in the k -nearest neighbour algorithm.According to the work in [17], the effect of consistent resampling can be formalized as follows. Theorem 1 (Probabilistic view on resampling, see [17]) . Let S be a consistent resampling scheme and ( x, y ) arbitrarydata which satisfy ˆ p ( x ) > and ˆ p S ( x ) > . Then the equation ˆ p S (1 | x ) = ˆ p (1 | x ) ˆ p S ( y = 1) (1 − ˆ p ( y = 1))ˆ p ( y = 1) (1 − ˆ p (1 | x )) + ˆ p S ( y = 1) (ˆ p (1 | x ) − ˆ p ( y = 1)) holds.In Theorem 1, the assumption of positives densities ˆ p ( x ) > and ˆ p S ( x ) > is crucial to apply Bayes’ rule p (1 | x ) = p ( x | p ( y = 1) p ( x ) − in the proof of Theorem 1 for both ˆ p and ˆ p S . In the simplest case, where estimators ˆ p and ˆ p S are based on relative frequencies this certainly holds true for data ( x, y ) that are in the original and theresampled dataset. Next, we discuss the assumption of Theorem 1 for common classiﬁers.Generalized linear models, tree based classiﬁers or neural networks all can be used to construct conﬁdence predictions ˆ p (1 | x ) which are, at the very least, locally continuous. For example, a logistic regression or a neural networks witha ﬁnal sigmoid activation both result in P -almost-surely differentiable ˆ p (1 | x ) ∈ (0 , for all contracts x and, thus, p ( x ) > , see (10). In its standard form, see [26], classiﬁcation trees apply relative frequencies per node, i.e. a locallyconstant estimator ˆ p (1 | x ) . In pure nodes, where ˆ p (1 | x ) ∈ { , } , we ﬁnd p ( x, y ) = 0 and p ( x ) > , see again (10).8ence, the stated models all assume p ( x ) > globally and show the generality of Theorem 1 in practice.Observe that for any resampling schemes as in Deﬁnition 1 holds N S P y Si y Si > N P y i y i . Hence, it is natural toassume that ˆ p S ( y = 1) > ˆ p ( y = 1) holds for modeling approaches in practice. Then, we can establish the effect ofresampling from a distributional perspective. Lemma 1 (Bias of traditional resampling) . Let S be a consistent resampling scheme. Further, consider a modelingapproach with ˆ p S ( y = 1) > ˆ p ( y = 1) and arbitrary data ( x, y ) with ˆ p ( x ) > and ˆ p S ( x ) > . Then, S increases theconditional distribution of the minority class 1, such that ˆ p S (1 | x ) > ˆ p (1 | x ) holds. Proof.

See Appendix A.1.Following the standard rules of differentiation we can further analyse the bias indicated by Lemma 1.

Lemma 2.

Let S be a consistent resampling scheme and ( x, y ) arbitrary data with ˆ p ( x ) > and ˆ p S ( x ) > . Then ∂ ˆ p S (1 | x ) ∂ ˆ p (1 | x ) > and ∂ ˆ p S (1 | x ) ∂ ˆ p (1 | x ) < holds.Lemma 1 immediately implies that if ˆ p (1 | x ) is a consistent estimator, i.e. it P -almost-surely converges to the true,latent probability p (1 | x ) as the sample size N → ∞ , then ˆ p S (1 | x ) formed on resampled data will not be consistent.In particular, if we look at Theorem 1 as a function of ˆ p (1 | x ) we can explicitly retrieve the relative bias introduced byresampling. As recorded in Lemma 2, resampling in particular alters low probabilities ˆ p (1 | x ) and results, if plottedagainst the non-resampling classiﬁer ˆ p , in a concave curve strictly above the line deﬁned by the identity ˆ p S (1 | x ) =ˆ p (1 | x ) .Let us assume ˆ p (1 | x ) to be consistent. In the special case of perfectly balanced data, i.e. ˆ p S ( y = 1) = , withan original base rate ˆ p ( y = 1) and the conﬁdence predictions ˆ p S (1 | x ) , we can then explicitly reconstruct unbiasedconﬁdence predictions by ˆ p (1 | x ) = ˆ p ( y = 1) (cid:18) ˆ p ( y = 1) + (1 − ˆ p S (1 | x ))(1 − ˆ p ( y = 1))ˆ p S (1 | x ) (cid:19) − . (11)We conclude this section with remarks on resampling. Remark 2. i) In a rare event setting the event probability p (1 | x ) is generally low, leading to a small regionin the feature space X (Ω) to be classiﬁed as the minority event. Lemma 1 and 2 illustrate that resamplingincreases this region and, thus, allows to shift the decision boundary of a classiﬁer. For label predictions, thishas empirically been shown to improve frequentistic metrics, see [10, 51].ii) If the given data represents the ground truth in the sense that data { ( x i , y i ) } Ni =1 are i.i.d. realizations from ( X, Y ) that allow for a sound estimation, in the sense of E x ∼ X [ p (1 | x ) − ˆ p (1 | x )] = 0 , then resamplingintroduces a bias to our conﬁdence predictions ˆ p S ( y | x ) . In particular, any risk measure, as e.g. the value-at-risk, applied to the distribution induced by ˆ p S ( y | x ) will reﬂect this bias.iii) It is obvious that random-undersampling disregards most of the information contained in data of the majorityclass. A natural approach to mitigate the loss of information is to use bootstrapping to obtain multiplebalanced, random-undersampled training data and apply ensemble techniques, which have shown superiorperformance e.g. in [49]. 9 Data

There exists a large number of empirical work that aims to describe the nature of the surrender probability p (1 | x ) ,where x includes macroeconomic, microeconomic and contractual factors, see [2, 7, 15, 29, 41, 52]. As the nature of p (1 | x ) seems to vary between countries and types of policy, we do not aim to construct an optimal estimator ˆ p (1 | x ) forall life insurance business. Our experiments focus on endowment contracts only. Further, we look at individual ﬁndingsof the literature to construct four latent surrender models p of varying nature and complexity, which we call surrenderproﬁles . Then, for each surrender proﬁle we estimate p with a particular focus on the consistency of estimator ˆ p andthe effect of resampling. In the following, we summarize how data { ( x i , y i ) } Ni =1 ∼ ( X, Y ) is generated. Simulation of contracts x . Each endowment contract x i = ( x (1) i , . . . , x ( n ) i ) is identiﬁed by the current age of thepolicyholder, the face amount, the duration of the contract, the elapsed duration, the remaining duration, the annualpremium amount, the frequency of premium payments and the current calendar year. We start by generating a portfolioof N = 30 ′ endowment contracts at calendar year , where we infer a realistic distributions of X from [40]. In[40], the authors provide detailed statistics on a portfolio of US whole life contracts. Given the similarity betweenendowment and whole life insurance , the portfolio in [40] provides a realistic basis for our experiments. In Figure 1we provide an illustration of the marginal distributions of the portfolio at calendar year . Face amounts are calibratedsuch that annual premiums are consistent with [40]. To compute premiums we apply the equivalence-principle, see[13, 21], under the following assumptions.• The actuarial interest rate i is constant with i = 0 . .• Expenses of the insurer for acquisition, administration and amortization of each contract are represented by ( α, β, α γ ) = (0 . , . , . , see [7].• Premiums are paid up to the age of 67, the German age of retirement . Up-front premium payments areannualized linearly by the remaining time of premium payments.• Mortality is described by a parametric survival model based on the Makeham Law, see [13]. The t -yearsurvival probability t p a of individual aged a is deﬁned by t p a := exp (cid:18) − A t − B log( c ) c a ( c t − (cid:19) , (12)where we adopt the parameters of [13] by setting the baseline hazard A = 0 . and age related factorsB = 2 . · − and c = 1 . .More detail on the simulation of data at calendar year can be found in the Appendix A.2. All contracts are theniterated forward by increasing the age of the policyholder and the elapsed duration of the contract by the period length u p - f r o n t a n n u a l m o n t h l y premium (freq.)0.00.10.20.30.40.50.6 0 20000premium (p.a.)0.00000.00020.00040.0006 Figure 1: Marginal distribution of the portfolio at calendar year . A whole life insurance is equivalent to an endowment insurance with inﬁnite duration. To be precise, the age of retirement in Germany varies between 65 and 67 based on the date of birth. The basic retirement at67 applies for individuals born on 1 January 1964 or later, see [35].

10f one year. All other features are assumed to remain constant, i.e. changes to policies are not admissible. Additionally,at every iteration we simulate new business at the magnitude of of the existing business, which was empiricallyobserved in the German market, see [23]. Contracts of the new business are simulated analogously to the portfolio atcalendar year , naturally with the restriction that the elapsed duration equals zero. The time frame of our data is setto one-year periods. Meta model p (1 |· ) . Given the input variable x = ( x (1) , . . . , x ( n ) ) , we use a logistic regression model to obtain thetrue surrender probability p (1 | x ) by p (1 | x ) := (cid:16) − β − β ′ x x ) (cid:17) − , (13)where the regression coefﬁcients β ∈ R and β x : R n → R n are speciﬁed by the respective surrender proﬁle. Incontrast to its standard formulation, see [26], we denote coefﬁcients β x as a function of x . We do so purely fornotational convenience, as it allows us to set the effect β ( i ) x x ( i ) of the i th feature as piecewise constant without havingto introduce hot-encoded auxiliary variables. This will be useful when describing our surrender proﬁles. Additionally,we note that the use of categorized risk drivers increases the complexity of the model p (1 | x ) , which is in line with thecomplexity of surrender activities.In [8, 15, 41], the authors report speciﬁc values of β ( i ) x x ( i ) for a variety of categorized risk drivers x ( i ) . We adopt andcombine these values in four different surrender proﬁles, see Figure 2 for an illustration. We consider the current ageof a policyholder, the duration elapsed since underwriting the contract, the maximum duration of the contract, as wellas the remaining duration until maturity, the frequency of premium payments and the annualized premium amount.In surrender proﬁle 4, the most complex setting, we additionally include a time factor represented by calendar years,which can be interpreted as a proxy of the economic environment at that time or alternatively as noise. The rangeof proﬁles covers a variety of plausible dynamics, for example elevated surrender of young or higher ages, increasedsurrender at an early stage of contracts, for short-term contracts in general or higher surrender for higher amounts ofannual premiums. The intercept β of each proﬁle has been adjusted, such that observed surrender rates per calendaryear fall into realistic ranges of 0.01 up to 0.05. In the Appendix A.3 we provide additional detail on the formulationof all surrender proﬁles. Simulation of surrender activity y . We observe a contract x only up to its termination T , in our case either maturity,death or surrender. Maturity events are easily identiﬁed by the duration of a contract, unless death or surrender occursprior. To simulate the death of a policyholder with current age a we numerically invert t p a as a function of t and obtainthe time of termination by T = ( t p a ) − ( u ) , u ∼ U (0 , . Naturally, we do so only once per policyholder. More detailon inverse transform sampling can be found in [24]. Surrender events are computed iteratively. A policyholder withcontract x realizes a surrender event y = 1 within the next year if and only if p (1 | x ) ≥ v for v ∼ U (0 , . Ifsurrender and an alternative termination event are modeled to occur in the same year, we assume uniformity of theevents, similar to fractional survival probabilities in [21]. Given time t ∈ (0 , until the alternative termination eventoccurs, we assume that the surrender event occurs with probability t prior to the competing risk, i.e. maturity or death. Data preparation.

We one-hot encode the categorical feature ’frequency of premium payments’ and apply min-maxscaling to scale all contracts x to the range of [0 , n . For the sake of readability, both the raw and the scaled contractby x . Next, we split our data { ( x i , y i ) } Ni =1 in time at the calendar year at which at least of data has been observed.Depending on the speciﬁc surrender proﬁle, this split occurs at calendar years t ∈ { , } and results in training data D train and test data D test . Note that a split in time imitates a practical setting, where we test the model from a goingconcern perspective. A random train-test split would result in training on future data. Moreover, a split in time enablesus to detect a potential bias of our model as the decomposition of the portfolio changes. Although we model newbusiness to follow a stationary distribution, respectively all features but elapsed duration and the current calendar year,the inclusion old business causes the composition of the portfolio to be non-stationary, e.g. with an increasing shareof older policyholders or older contracts. Last, we randomly set aside of the training data for validation.11

20 40 60 80 x (i) : current age β ( i ) x x ( i ) x (i) : duration (elapsed)

50 single annual monthly x (i) : premium (freq.) x (i) : premium (p.a.) (a) Surrender proﬁle x (i) : current age β ( i ) x x ( i ) x (i) : duration (elapsed)

02 single annual monthly x (i) : premium (freq.) x (i) : premium (p.a.) (b) Surrender proﬁle x (i) : current age β ( i ) x x ( i ) x (i) : duration −0.5 x (i) : duration (elapsed) x (i) : duration (remaining) −0.4 −0.20.0 single annual monthl x (i) : premium (freq.) −2−10 β ( i ) x x ( i ) x (i) : premium (p.a.) (c) Surrender proﬁle x (i) : current age β ( i ) x x ( i ) x (i) : duration −0.50.00.5 0 5 10 15 20 25 x (i) : duration (elapsed)

02 0 5 10 15 20 25 x (i) : duration (remaining) −0.4−0.20.0single annual monthl x (i) : premium (freq.) −2−10 β ( i ) x x ( i ) x (i) : premium (p.a.) x (i) : time −0.20.00.2 (d) Surrender proﬁle Figure 2: Collection of distinct surrender proﬁles in our experiments.In the Appendix A.5, see Table 4, we additionally provide a summary of the number of one-year observations intraining and test set, the imbalance of surrender and non-surrender events and the size of the training data set afterrandom-undersampling or SMOTE-resampling to perfect balance. Without resampling, surrender events contribute ashare of . up to . of all observations, which highlights the rare event character of our setting.12 Numerical experiments

We present results for modeling p (1 | x ) , given the four different surrender proﬁles and data { ( x i , y i ) } Ni =1 discussedin Section 5. We apply three model types, the logistic regression, tree based classiﬁers and neural networks, in theirbagged and boosted form. As a substitute for a thorough exploratory data analysis, for each surrender proﬁle werestrict the input to its estimator ˆ p (1 | x ) to the components of x i that actually are part of the latent surrender model p (1 | x ) . Let us brieﬂy comment on the speciﬁcs of the estimators. Baseline model.

To set a baseline, we introduce a classiﬁer with a constant surrender rate. Due to its simplicity, thisprovides a practical supervision whether subsequent classiﬁers are plausible and whether their increasing complexity isjustiﬁed. Given a set of training data D train , which differ for each surrender proﬁle, we deﬁne the conﬁdence prediction ˆ p of the baseline model by ˆ p (1 | x ′ ) ≡ |D train | X ( x,y ) ∈D train y , where x ′ presents an arbitrary contract. Logistic regression.

The structure of the logistic regression is given by p (1 | x ) := (cid:16) − β − β ′ x ) (cid:17) − , withintercept β ∈ R and coefﬁcients β ∈ R p . Note that in contrast to the meta model p (1 | x ) , coefﬁcients β are constants.Providing the estimator with the true categories of all components of x , e.g. indicating ages 20-30 to be highlysusceptible to surrender, seems unrealistic in a practical rare event setting. Instead, we feature engineer higher degreeinputs ( x ( j ) i ) k for each feature x ( j ) i and k = 1 , . . . , , as long the binary-crossentropy on the validation set decreases.More detail on greedy forward-stepwise selection can be found in [26]. Note that the surrender proﬁles do not considerinteractions between features, which is why we do not engineer inputs of the form ( x ( l ) i ) k ( x ( n ) i ) m . To avoid overﬁtting,we additionally apply the L -regularization P i> ( β ( i ) ) when minimizing the cross-entropy loss. At last, we combine N bag = 10 estimators to an ensemble model , since a single logistic estimator is highly volatile in a rare event setting,see again [26]. Tree based classiﬁers.

A tree based classiﬁer provides a split of the feature space X (Ω) into disjoint regions R , . . . , R m . The surrender probability ˆ p (1 | x ) is then formed by relative frequencies per regions R k . For an arbi-trary data point x ′ ∈ R k we look at the subset D R k = { ( x, y ) ∈ D train : x ∈ R k } and assign ˆ p (1 | x ′ ) = 1 |D R k | X ( x,y ) ∈D Rk y . For an individual classiﬁcation and regression tree (

CART ), see [26], regions R k are the result of recursive, binary splitsof the region X (Ω) . At each recursion, a CART searches for the feature x ( i ) and the hyperplane H := { x : x ( i ) = c } , c ∈ R , such that the respective split by H leads to a maximum reduction of the impurity. We measure the impurity ofa region R by the binary cross-entropy . To avoid overﬁtting, we perform a best parameter search for the maximumdepth T of a single CART based on the cross-entropy loss on the validation set and eventually set T = 5 , i.e. allowingfor binary splits. Finally, we use ensemble techniques and form a random forest classiﬁer with N bag = 100 trees asin [26] and a gradient boosting decision tree with N boost = 100 trees, the XGBoost classiﬁer as in [11]. Both ensembletechniques utilize uniform bootstrapping with a ﬁxed sample size |D train | , leading to varying training data for everyCART. Remark 3.

In practice, a more recent gradient boosting decision trees as LightGBM, which is designed for high-dimensional features spaces and large data sets, see [22], might be beneﬁcial. In our experiments with a low-dimensional feature space, we observe a speed up of LGBoost compared to an already fast XGBoost model, butno improvement in model performance. For completeness, we also tested adaptive boosting as in [45] on polynomial features, but dropped it due to poor performance. We note that the alternative

Gini index , see [26], yielded comparable results in our experiments. eural network classiﬁers. The bagged model consists of neural networks ˆ p ( k ) (1 |· ) , which are trained indepen-dently for k = 1 , . . . , N bag with N bag = 5 . All models ˆ p ( k ) have the same architecture. We choose feed-forwardnetworks with 3 hidden layers of decreasing widths of , and with two ReLu and a tanh activation func-tion. The output layer has a single unit and a sigmoid activation σ . The bagged estimator is then formed by ˆ p (1 | x ) := N bag P k ˆ p ( k ) (1 | x ) . Each sub-model ˆ p ( k ) consists of n + 1) + 871 parameters. Note that the fea-ture ’premium frequency’ is one-hot-encoded, resulting in effectively n+1 inputs to the neural network.Additionally, we construct a boosted neural network. Motivated by experiments and [3], we choose shallow neuralnetworks p ( k ) , k = 1 , . . . , N boost as weak learners with N boost = 20 . For k > , each network p ( k ) has one hiddenlayer with units for each input feature in the respective surrender proﬁle, a ReLu activation in the hidden layerand a single output unit with a linear activation. Hence, each weak learner ˆ p ( k ) , k > , consists of ( n + 3)10 n + 1 parameters. We start the boosting procedure with the baseline rate of ˆ p (1) (1 | x ) := σ − (cid:16) |D train | P i y i (cid:17) for trainingdata ( x i , y i ) ∈ D train . We then iteratively add and interdependently train weak learners p ( k ) until we obtain the ﬁnalboosted model ˆ p : R p → (0 , by ˆ p (1 | x ) := σ N boost X k =1 ˆ p ( k ) (1 | x ) ! . (14)All neural networks are trained with stochastic gradient descent on the cross-entropy loss, implemented by the adamalgorithm, see [31], using a batch size of 1’024 and a maximum of 2’000 epochs. To mitigate overﬁtting, we applyearly stopping if the validation loss does not decrease for 25 epochs. For the bagged estimator, in addition to training ˆ p ( k ) individually, we perform a ﬁnal ﬁne tuning where we train ˆ p (1 | x ) collectively, as done e.g. in [28]. For the boostedestimator, we include a corrective step after iteration i ∈ { , , , } , in which we collectively ﬁne tune the model σ (cid:16)P k ≤ i ˆ p ( k ) (cid:17) , see [3]. Note that in contrast to [3], that utilizes gradient boosting and hence a Taylor approximation,we work with the exact cross-entropy loss. While this is computationally more expensive as we do not train onresiduals, it allows us to use the sigmoid function σ in (14) to ensure sound conﬁdence predictions ˆ p (1 | x ) ∈ (0 , . Numerical results.

Let us now evaluate the classiﬁers mentioned above for all four surrender proﬁles. We startwith a hypothetical analysis, where we assume that we can access the true probability of surrender within the nextyear of arbitrary contract x . While this quantity p (1 | x ) is not available in practice it yet provides instructive insightinto the the classiﬁers performance and allows us to compare the bias of resampling to Lemma 1 and 2 numerically.We start with a qualitative analysis of bias and variance of the models before we provide statistics to quantify ourobservations and the effect of resampling. Lastly, we look at the modelled mean surrender rate over calendar-time,including conﬁdence bands, for all classiﬁers and all proﬁles. In summary, we will see XGBoost to be the superiormodel, closely followed by neural classiﬁers, and conﬁrm the bias of resampling numerically. Additionally, we willnotice that model evaluation by a single value, e.g. cross-entropy, is insufﬁcient as it ignores the non-stationarity ofour data. Further, mean surrender rates with conﬁdence bands as in Corollary 1 will provide a more comprehensiveperspective.In Figure 3 we consider test data ( x, y ) ∈ D test and plot the label predictions ˆ p (1 | x ) against the true, latent surrenderprobability p (1 | x ) for all discussed models and all four surrender proﬁles. We observe that XGBoost is the onlyclassiﬁer that is approximately unbiased for all surrender proﬁles, yet its prediction ˆ p (1 | x ) exhibit a comparably highvariance. In contrast, predictions of the random forest and the bagged logistic classiﬁer indicate a lower variance, butconsistently underestimate the lapse probability of contracts x that posses a comparably high risk of surrender. Recallthat two lapse events ˆ y , ˆ y are realizations of different random variables ( Y | X = x ) , given the underlying contracts x i , x j do not coincide. Hence, this bias refers to a speciﬁc type of high-risk contracts, but not the tail of a singledistribution. Lastly, both neural network approaches show a slight bias towards underestimating the lapse probabilityof high-risk contracts (proﬁle 1 and 3), see Figures 3a and 3c, with less variance in ˆ p (1 | x ) than XGBoost. For riskproﬁle 2, see Figure 3b, both neural networks suggest a performance superior to XGBoost, as they are approximately14 a) proﬁle 1(b) proﬁle 2(c) proﬁle 3 d) proﬁle 4 Figure 3: p - ˆ p plots of the discussed classiﬁers for surrender proﬁles 1-4.unbiased with lower variance. In proﬁle 4, the bagged network consistently underestimates the surrender probabilityof contracts with comparably high surrender risk and the boosted estimator provides a classiﬁer with high variance.Recall that in proﬁle 4 we add non-stationary noise with regard to the calendar year, see Appendix A.3. While itmight initially seem like our neural network approaches are overﬁtting on the training data in proﬁle 4, we note thatproﬁle 4 is particularly challenging as observations on the input feature ’calendar year’ are different in training andtest data. Moreover, the effect of the calendar year t on surrender is increasingly negative in the training data, whereasthe reverse holds for the test data, see Figure 2d. This setting forces classiﬁers to extrapolate the effect of the calendaryear. While this adverse effect might be mitigated by selecting the validation data D val also with respect to calendartime, instead of randomly, we take this use-case to eventually highlight the beneﬁt of a probabilistic evaluation. Recallthat the illustrated p - ˆ p plot cannot be used for tuning hyperparameters, as the true probabilities p (1 | x ) are unknown.Next, we quantify our observations in Table 1, where we provide latent quantities, as the mean value mae ( p, ˆ p ) (inshort: mae) and the variance (Var) of the errors | p (1 | x ) − ˆ p (1 | x ) | , but also observable quantities as the accuracy, the F -score and the binary cross-entropy. All results are provided for training and test data separately. In each columnwe highlight the best value. Note that computing the variance of the errors | p (1 | x ) − ˆ p (1 | x ) | is fundamentally differentfrom computing the variance of label predictions ˆ p (1 | x ) . While the later is observable and gives the intuition that e.g.the logistic regression in Figure 3a has little variance, it disregards the fact that every contract x induces an individualrandom variable ( Y | X = x ) , which is to be modeled by ˆ p (1 | x ) . Hence, the variance of predictions ˆ p (1 | x ) holds littlevalue as it penalizes e.g. an unbiased classiﬁer if true probabilities p (1 | x ) evenly populate [0 , .For proﬁles 1 and 4, see Tables 1a and 1d, XGBoost shows the lowest variance and mean absolute error on both trainand test data. The low variance of latent errors | p (1 | x ) − ˆ p (1 | x ) | is in contrast to the comparably high variance ofobservable predictions ˆ p (1 | x ) . For proﬁles 2 and 3, see Tables 1b and 1c, neural classiﬁers perform equally well asXGBoost, respectively even sligthly outperfom it. The other classiﬁers, bagged logistic regression and random forest,also show improvements in mean absolute error and variance over the naive baseline. Depending on the desiredbias-variance trade-off, one can even argue the logistic regression to be the best choice for proﬁle 2. Overall, in eachproﬁle the best classiﬁer improves the mean absolute error by about , with the exception of proﬁle 1 where theabsolute improvement is as high as . . Later in this section, when we take a time series perspective to examineobserved surrender rates per calendar year, we will observe that simply adding a constant risk buffer of e.g. 0.01 to ourbaseline model is not sufﬁcient to cover our exposure to surrender risk. For now, let us look at the actually observableevaluation metrics in Table 1. We notice that the binary cross-entropy (in short: entropy) and the latent mean absolute16ata Train. TestEvaluation acc. F entropy mae Var acc. F entropy mae VarBaseline 0.9710 0.0000 0.1311 0.0495 0.0101 0.9546 0.0000 0.1890 0.0641 0.0158Logist. Regr. 0.9710 0.0000 0.1008 0.0445 0.0084 0.9546 0.0000 0.1352 0.0520 0.0128Random Forest 0.9732 0.2003 0.0532 0.0120 0.0007 0.9558 0.0867 0.0770 0.0154 0.0010XGBoost 0.9818 0.6866 0.0429 0.0058 0.0006 0.9628 0.5934 0.0719 0.0080 0.0008NN - bagging 0.9739 0.4440 0.0532 0.0090 0.0014 0.9581 0.2479 0.0733 0.0088 0.0009NN - boosting 0.9732 0.3780 0.0607 0.0156 0.0025 0.9571 0.1823 0.0765 0.0118 0.0013 (a) proﬁle 1 Data Train. TestEvaluation acc. F entropy mae Var acc. F entropy mae VarBaseline 0.9835 0.0000 0.0842 0.0199 0.0011 0.9869 0.0 0.0703 0.0200 0.0011Logist. Regr. 0.9835 0.0000 0.0721 0.0110 0.0006 0.9869 0.0 0.0614 0.0106 0.0006Random Forest 0.9835 0.0000 0.0704 0.0119 0.0008 0.9869 0.0 0.0608 0.0122 0.0009XGBoost 0.9838 0.0485 0.0631 0.0117 0.0010 0.9869 0.0 0.0610 0.0125 0.0011NN - bagging 0.9835 0.0000 0.0713 0.0105 0.0007 0.9869 0.0 0.0609 0.0104 0.0007NN - boosting 0.9835 0.0000 0.0717 0.0116 0.0009 0.9869 0.0 0.0608 0.0115 0.0009 (b) proﬁle 2 Data Train. TestEvaluation acc. F entropy mae Var acc. F entropy mae VarBaseline 0.9799 0.0000 0.0984 0.0254 0.0016 0.9838 0.0 0.0831 0.0242 0.0013Logist. Regr. 0.9799 0.0000 0.0865 0.0205 0.0012 0.9838 0.0 0.0728 0.0187 0.0011Random Forest 0.9799 0.0000 0.0835 0.0172 0.0011 0.9838 0.0 0.0710 0.0162 0.0010XGBoost 0.9801 0.0181 0.0724 0.0140 0.0010 0.9838 0.0 0.0689 0.0141 0.0011NN - bagging 0.9799 0.0009 0.0806 0.0148 0.0010 0.9838 0.0 0.0693 0.0141 0.0010NN - boosting 0.9799 0.0018 0.0819 0.0158 0.0010 0.9838 0.0 0.0710 0.0147 0.0010 (c) proﬁle 3 Data Train. TestEvaluation acc. F entropy mae Var acc. F entropy mae VarBaseline 0.9790 0.0000 0.1017 0.0219 0.0011 0.9841 0.0 0.0823 0.0190 0.0007Logist. Regr. 0.9790 0.0000 0.0890 0.0173 0.0008 0.9841 0.0 0.0720 0.0143 0.0005Random Forest 0.9790 0.0000 0.0867 0.0136 0.0007 0.9841 0.0 0.0707 0.0108 0.0004XGBoost 0.9792 0.0162 0.0742 0.0096 0.0005 0.9841 0.0 0.0679 0.0087 0.0003NN - bagging 0.9790 0.0000 0.0843 0.0130 0.0007 0.9841 0.0 0.0740 0.0149 0.0007NN - boosting 0.9791 0.0013 0.0849 0.0144 0.0007 0.9841 0.0 0.0829 0.0151 0.0007 (d) proﬁle 4 Table 1: Statistics of the discussed classiﬁers for surrender proﬁles 1-4.17rror are highly correlated, yet not perfectly aligned, see e.g. Table 1b. From a frequentistic perspective, for proﬁles2-4 all classiﬁers yield the identical accuracy and F -score on the test set. Since for theses proﬁles the true surrenderprobability p (1 | x ) does not exceed 0.5, recall Figure 3, which is the default threshold c ∈ [0 , to obtain labelpredictions ˆ y = { ˆ p (1 | x ) ≥ c } , any sound classiﬁer will show this behavior. As p (1 | x ) is not accessible in practice andcannot be used for evaluation or to check for plausibility, we follow the common strategy of resampling to improvethe F -score.We retrained all models discussed above on balanced data, altered by random-undersampling or SMOTE-resampling.As the quality of our results was equal regardless of the speciﬁc surrender proﬁle or the type of resampling applied,we restrict our presentation to SMOTE-resampling on surrender proﬁle 2. Proﬁle 2 is arguably the hardest proﬁle toform accurate label predictions on, as even on the training data all models except XGBoost showed a zero F -score,see Table 1b.In Table 2 we revisit the evaluation of our classiﬁers, which now have been trained on resampled data. We evaluate allclassiﬁers on the actually observed training and test data. We notice that resampling indeed improves the F -score ofall classiﬁers on the training as well as the test data. This means that SMOTE-resampling positively affects the balancebetween correct label predictions ˆ y = 1 and correctly classiﬁed observations y = 1 . While this might be the objectivein situations where binary actions have to be taken, e.g. the decision to investigate for fraud, Table 2 also shows itsdownside. First, there is a trade-off between F -score and accuracy. Secondly and arguably more importantly, weData Train. TestEvaluation acc. F entropy mae Var acc. F entropy mae VarBaseline 0.9835 0.0000 0.0842 0.0199 0.0011 0.9869 0.0000 0.0703 0.0200 0.0011Logist. Regr. 0.7512 0.0922 0.5119 0.3245 0.0471 0.7714 0.0749 0.4850 0.3084 0.0451Random Forest 0.7232 0.0874 0.4944 0.3252 0.0435 0.7530 0.0720 0.4551 0.3042 0.0398XGBoost 0.8202 0.1185 0.4151 0.2653 0.0497 0.8723 0.0933 0.3038 0.2022 0.0377NN - bagging 0.7640 0.0997 0.4549 0.2816 0.0596 0.7824 0.0754 0.4165 0.2581 0.0563NN - boosting 0.7663 0.0985 0.4778 0.2944 0.0562 0.7829 0.0770 0.4489 0.2731 0.0552Table 2: Statistics on the discussed classiﬁers with SMOTE-resampling for surrender proﬁle 2.Figure 4: p - ˆ p S plots of the discussed classiﬁers and SMOTE-resampling for surrender proﬁles 2.18bserve that the cross-entropy as well as the mean and variance of the latent errors | p (1 | x ) − ˆ p S (1 | x ) | have increasedfor all classiﬁers on training and test data, compared to training without SMOTE-resampling, see Figure 1b. We canuncover the later even further by looking at p - ˆ p S -plots, see Figure 4. Here, we clearly observe that resampling (withSMOTE) results in conﬁdence prediction ˆ p S (1 | x ) that in general highly overestimate the true surrender probability p (1 | x ) of contract x ∈ D test . Further, the quantities ˆ p S (1 | x ) approximately form a concave curve with respect to thetrue probabilities p (1 | x ) . This numerically afﬁrms Lemma 1 and 2. Note that both lemmata describe the relationbetween ˆ p and ˆ p S , e.g. ˆ p S (1 | x ) > ˆ p (1 | x ) , and that prior to resampling classiﬁers occasionally underestimate the trueprobability p (1 | x ) by ˆ p (1 | x ) , see again Figure 3b. Ignoring the numerical nature of training, this is why we naturallyobserve a few estimates ˆ p S (1 | x ) below the line deﬁned by the identity ˆ p S = p . Remark 4.

Alternatively, to improve the F -score one might alter the threshold c ∈ [0 , which label predictions ˆ y = { ˆ p (1 | x ) ≥ c } are based on. For completeness, we also investigated ROC and precision-recall-curves. However,we found no signiﬁcant beneﬁt of resampling in terms of the classiﬁers’ AUC-values, which is why we do not reportresults on this. Another downside of resampling, and oversampling in particular, is its more demanding run time. Inour experiments, random undersampling and SMOTE-resampling lead to an increased run time of a factor up to 6.3,resp. 8.5, see Table 5 in Appendix A.5. Remark 5.

The bias of resampling schemes indicated by Figure 4 naturally leads to mean surrender rates |D| P ( x,y ) ∈D ˆ p S (1 | x ) that highly and conﬁdently overestimate the observed surrender rate |D| P ( x,y ) ∈D y for data D of a speciﬁc calendar year. If we were to work with label predictions, we could alternatively estimate mean surrenderrates by relative frequencies |D| P ( x,y ) ∈D { ˆ p (1 | x ) ≥ c } . However, in our experiments this led to performances unsuit-able for application, as we still considerably overestimate the observed surrender rate. Also, binary label predictionsprevent us from constructing conﬁdence bands as described in Corollary 1.At last, we omit resampling and evaluate all classiﬁers in a practical setting, where we cannot access the latentprobabilities p (1 | x ) . We now take a time-series perspective and look at predicted mean surrender rates and theirconﬁdence bands over calendar-time, as proposed in Corollary 1. This concept uncovers the quality of the proposedclassiﬁers in a going concern setting and respects that our insurance portfolio might change over time, leading tonon-constant risk exposure e.g. in term of a value-at-risk measure. The results are illustrated in Figure 5, where thesplit in calendar time between train and test data is indicated by a vertical, gray line. Upper and lower α = 0 . conﬁdence bands, as in Corollary 1, are displayed by up- respectively down-facing triangles.First, we observe that the mean surrender rates are non-stationary. Proﬁles 2-4 show mean surrender rates that declineover time. In contrast, in proﬁle 1 we see an increase of mean surrender activity with a notable drop at calendar year4. Recall that surrender proﬁle 1 as in [41] exhibits a signiﬁcant increase in surrender activity just prior the an elapsedduration of 4 years, followed by a sharp decline.For all surrender proﬁles, the naive baseline provides an insufﬁcient classiﬁer, which naturally does not capture any ofthe non-stationarity. Hence, a constant risk buffer does not appropriately compensate the simplicity of the model. Inproﬁle 1, we observe that all classiﬁers except the logistic regression capture the trend of mean surrender, includingthe characteristic, sharp drop at calendar year 4. However, XGBoost is the only classiﬁer for which the true rate liesconsistently within the conﬁdence band of its predictions. In proﬁles 2 and 3, for most years the quality of the proposedclassiﬁers, excluding the naive baseline, cannot be distinguished based on their . -conﬁdence bands. Arguably,random forest and logistic regression indicate a performance inferior to neural classiﬁers and XGBoost, which is againthe only model that conﬁdently predicts the true surrender rate for all calendar years. This is actually surprising, giventhat in proﬁle 2 for both neural classiﬁers the mean error | p (1 | x ) − ˆ p (1 | x ) | is lower than for XGBoost, recall Table 1b.This highlights the importance of a calendar-year, respectively time series, perspective on surrender. In proﬁle 4 weexpect our neural classiﬁers to show poor performance, given that we forced them to extrapolate the feature ’calendaryear’, recall Figure 3d and its reasoning. However, this expectation is purely based on accessing the true surrenderprobability p (1 | x ) , which is latent in practice. In Figure 5d we now observe precisely this behaviour. The mean19urrender rates of both neural classiﬁers, as well as the logistic regression, drift away from the true surrender rate onthe test set. Hence, predicted mean surrender rates and in particular their conﬁdence bands successfully uncover thepreviously latent weakness of models. In that sense, results on proﬁle 4 also indicate tree based classiﬁers to providemore stable estimates when characteristics of training and test set differ. time t s u rr e n d e r r a t e true rateBaselineLogist. Regr.Random ForestXGBoostNN - baggingNN - boosting (a) proﬁle 1 time t s u rr e n d e r r a t e true rateBaselineLogist. Regr.Random ForestXGBoostNN - baggingNN - boosting (b) proﬁle 2 time t s u rr e n d e r r a t e true rateBaselineLogist. Regr.Random ForestXGBoostNN - baggingNN - boosting (c) proﬁle 3 time t s u rr e n d e r r a t e true rateBaselineLogist. Regr.Random ForestXGBoostNN - baggingNN - boosting (d) proﬁle 4 Figure 5: Mean surrender rates over calendar-time of the classiﬁers for surrender proﬁles 1-4, including conﬁdencebands ( α = 0 . ) based on Corollary 1. Vertical lines indicate a train-test-split. In the present work, we perform extensive numerical experiments, where we look at four different surrender proﬁles,each motivated by empirical research in the literature. Our modeling approaches include the logistic regression, treebased classiﬁers and neural networks, each in a bagged and a boosted version. The results generally indicate a highlypractical performance, where XGBoost is the superior model across all surrender proﬁles. During the evaluation weplace a particular focus on bias and variance, which we compute explicitly by accessing the latent, true surrenderprobability of each contract. These latent model assessments are additionally compared to evaluations based onobservable quantities. With a view to low bias, our results indicate a tradeoff between accurate label prediction forrare events, such as surrender, and a sound estimation of the true surrender probabilities. We numerically afﬁrmearlier ﬁndings in the literature that common resampling techniques can indeed improve frequentistic evaluations ofour classiﬁers, as e.g. the F -score. However, our theoretical and numerical results also show that this comes along20ith a signiﬁcant bias of the predicted surrender probabilities. Hence, resampling severely skews the underlying riskand is impractical e.g. with regard to a value-at-risk assessment of surrender risk.Further, we discuss that each observed surrender event is a realization of a distinct conditional random variable.Therefore, the variance of the latent errors of a model is signiﬁcantly different to the variance perceived by itspredictions. To promote sound model evaluation, that also factors in the uncertainty of predicted surrender proba-bilities, we consult the central limit theorem of Lindeberg-Feller, which drops the assumption of random variablesbeing identically distributed. Thereby, we derive practical conﬁdence bands on mean surrender rates with respect tocalendar-time. This allow us to identify poor performance in a going concern perspective, where the composition of aportfolio or the predominant risk drivers might change over time. Based on mean surrender rates and their conﬁdencebands, our experiments in particular highlight that adding a risk buffer to a naive baseline does not cover the surrenderrisk sufﬁciently. This observation indicates the importance of time for a practical model evaluation and providessupport for a probabilistic, time-series perspective on surrender risk.Further research should focus on the application of our ﬁndings to high-dimensional, real data sets. We are interestedin the beneﬁt of conﬁdence bands for mean surrender rates on data sets, where prior analysis was based on frequentisticconcepts and the application of resampling. Conﬁdence bands could also be used to monitor if a given model needs tobe recalibrated as either the composition of the portfolio or the behaviour of policyholders change. Further, it wouldbe interesting to extend our results by including the loss-given-surrender, or an appropriate deterministic proxy, andthereby evaluate models based on their estimate of the economic capital at risk due to surrender. The presented conceptof mean event probabilities and conﬁdence bands is not restricted to surrender or one year time horizons, but is validfor general Bernoulli-type events. Hence, it can also be used to monitor e.g. the quality of a model for expected fraudrates or one-year death events. Acknowledgements

The author wants to thank his supervisor Christian Weiß for his precious feedback and numerous helpful discussions.Further, the author is grateful to the ’Ministeriums für Kultur und Wissenschaft des Landes Nordrhein-Westfalen’ forsupporting his research by their grant ’FH BASIS 2019’ (reference 1908fhb005). All numerical experiments presentedin this work were conducted on a server funded by the respective grant.

References [1] O. O. Aalen, Ø. Borgan, M. Gail, H. K. Gjessing, K. Krickeberg, J. Samet, A. Tsiatis, and W. Wong.

Survivaland Event History Analysis: A Process Point of View . Statistics for Biology and Health. New York: Springer,2008.[2] M. Aleandri. “Modeling Dynamic Policyholder Behaviour through Machine Learning Techniques”.

Submittedto Scuola de Scienze Statistiche (2017).[3] S. Badirli, X. Liu, Z. Xing, A. Bhowmik, K. Doan, and S. S. Keerthi. “Gradient Boosting Neural Networks:GrowNet” (2020). Available on arXiv 2002.07971.[4] F. Barsotti, X. Milhaud, and Y. Salhi. “Lapse risk in life insurance: Correlation and contagion effects amongpolicyholders’ behaviors”.

Insurance: Mathematics and Economics

71 (2016), 317–331.[5] D. Bauer, R. Kiesel, A. Kling, and J. Ruß. “Risk-neutral valuation of participating life insurance contracts”.

Insurance: Mathematics and Economics

Probability Theory . Universitext. London: Springer, 2013.[7] T. Burkhart. “Surrender Risk in the Context of the Quantitative Assessment of Participating Life InsuranceContracts under Solvency II”.

Risks . Rome, 2008.[9] V. Chandola, A. Banerjee, and V. Kumar. “Anomaly detection: ACM Computing Surveys, 41(3), 1-58” (2009).[10] N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer. “SMOTE: Synthetic Minority Over-samplingTechnique”.

Journal of Artiﬁcial Intelligence Research

16 (2002), 321–357.[11] T. Chen and C. Guestrin. “XGBoost: A Scalable Tree Boosting System”.

Proceedings of the 22nd ACMSIGKDD International Conference on Knowledge Discovery and Data Mining . 2016, 785–794.[12] DAV.

Herleitung der Sterbetafel DAV 2008 T für Lebensversicherungen mit Todesfallcharakter . Ed. by DeutscheAktuarvereinigung e.V. https://aktuar.de/unsere- themen/lebensversicherung/sterbetafeln/2018-10-05_DAV-Richtlinie_Herleitung_DAV2008T.pdf . Accessed: 2020-06-29.[13] D. C. Dickson, M. R. Hardy, and H. R. Waters.

Actuarial Mathematics for Life Contingent Risks . InternationalSeries on Actuarial Science. Cambridge University Press, 2009.[14] R. Durrett.

Probability: Theory and examples . 4th ed. Cambridge University Press, 2010.[15] M. Eling and D. Kiesenbauer. “What Policy Features Determine Life Insurance Lapse? An Analysis of theGerman Market”.

The Journal of Risk and Insurance

The Journal of Risk Finance

In Proceedings of the Seventeenth International JointConference on Artiﬁcial Intelligence

QIS5 Technical Speciﬁcations . 2010.[19] T. Fawcett. “An introduction to ROC analysis”.

Pattern Recognition Letters

Journalof the American Statistical Association

Einführung in die Lebensversicherungsmathematik . Karlsruhe: Verlag Ver-sicherungswirtschaft, 2006.[22] Ke. G., Q. Meng, T. Finley, T. Wang, W. Chen, W. Ma, Q. Ye, and T. Liu. “LightGBM: A Highly EfﬁcientGradient Boosting Decision Tree”.

NIPS . 2017.[23] GDV.

Statistical Yearbook of German Insurance 2018 . GDV, 2019.[24] P. Glasserman.

Monte Carlo methods in ﬁnancial engineering . New York: Springer, 2010.[25] I. Goodfellow, Y. Bengio, and A. Courville.

Deep Learning . MIT Press, 2016.[26] T. Hastie, R. Tibshirani, and J. H. Friedman.

The elements of statistical learning: Data mining, inference, andprediction . Second edition. New York: Springer, 2017.[27] V. Hodge and J. Austin. “A Survey of Outlier Detection Methodologies”.

Artiﬁcial Intelligence Review

Scandinavian Actuar-ial Journal

North American Actu-arial Journal

Political Analysis

International Conference on LearningRepresentation (2015).[32] A. Klenke.

Probability Theory . London: Springer, 2014.[33] A. Kling, F. Ruez, and J. Ruß. “The impact of policyholder behavior on pricing, hedging, and hedge efﬁciencyof withdrawal beneﬁt guarantees in variable annuities”.

European Actuarial Journal

Annals of Actuarial Science

Old-age pensions . . Accessed: 2019-06-27.[36] S.-H. Li, D. C. Yen, W.-H. Lu, and C. Wang. “Identifying the signs of fraudulent accounts using data miningtechniques”. Computers in Human Behavior

European Journal of Operational Research

Knowledge-Based Systems

59 (2014), 142–148.[40] X. Milhaud and C. Dutang. “Lapse tables for lapse risk management in insurance: a competing risk approach”.

European Actuarial Journal

Bulletin Français d’Actuariat

Machine Learning

Rare event simulation using Monte Carlo methods . Chichester: Wiley, 2009.[45] R. E. Schapire and Y. Singer. “Improved Boosting Algorithms Using Conﬁdence-rated Predictions”.

MachineLearning . 2007, 132–139.[47] M. Sokolova and G. Lapalme. “A systematic analysis of performance measures for classiﬁcation tasks”.

Infor-mation Processing & Management

Bayesian and Frequentist Regression Methods . New York: Springer, 2013.[49] B. C. Wallace, K. Small, C. E. Brodley, and T. A. Trikalinos. “Class Imbalance, Redux”.

IEEE 11th Interna-tional Conference on Data Mining . 2011, 754–763.[50] G. M. Weiss. “Mining with rarity”.

ACM SIGKDD Explorations Newsletter

The GenevaPapers on Risk and Insurance - Issues and Practice

A Appendix

A.1 Proofs

Proof of Proposition 1.

In order to apply the central limit theorem of Lindeberg-Feller, it remains to show that theLindeberg condition holds, see Theorem 15.43 in [32]. Hence, for all δ > we have to show that L N ( δ ) : = 1 σ ( S N ) N X i =1 E (cid:2) ( Z i − p (1 | x i )) {| Z i − p (1 | x i ) | >δσ ( S N ) } (cid:3) −−−−→ N →∞ . (15)23bserve that | Z i − p (1 | x i ) | is bounded, while σ ( S N ) = P Ni =1 p (1 | x i )(1 − p (1 | x i )) −−−−→ N →∞ ∞ , given that p (1 | x i ) ∈ [ ε, − ε ] for all i ∈ N . Hence, for every δ > there is ¯ N δ ∈ N such that | Z i − p x i | < δσ ( S N ) , for all N > ¯ N δ and forall i ∈ N . Therefore, the sum in (15) contains at most ¯ N δ bounded summands and we conclude L N ( δ ) −−−−→ N →∞ . Proof of the Corollary 1.

By Lemma 1, we know σ ( S N ) X i Z i − p (1 | x i ) d → N (0 , . (16)Then, for large values of N − α P N X i Z i ≤ z ! ≈ Φ N z − P i p x i pP i p (1 | x i )(1 − p (1 | x i ) ! , (17)holds, and therefore z ≈ N X i p (1 | x i ) + Φ − (cid:16) α (cid:17) pP i p (1 | x i )(1 − p (1 | x i )) N . (18)Let ˆ p : x ˆ p (1 | x ) describe the estimator for the true Bernoulli probabilities p : x p (1 | x ) . Then, substituting N P i p (1 | x i ) by its estimate N P i ˆ p (1 | x i ) and the standard deviation σ ( S N ) = pP i p (1 | x i )(1 − p (1 | x i )) by itsasymptotically unbiased sample estimate ˆ σ ( S N ) = q NN − pP i ˆ p (1 | x i )(1 − ˆ p (1 | x i )) yields the statement. Proof of Lemma 1.

We present the simulation scheme for the portfolio at the initial calendar year. New business at subsequent calendaryears is simulated analogously with the condition that the currently elapsed duration equals zero. Let us denote anarbitrary contract X = ( X (1) , . . . , X ( n ) ) with n = 8 . The features correspond to the calendar year ( X (1) ), the currentage of the policyholder ( X (2) ), the face amount of the contract ( X (3) ), the duration ( X (4) ), the elapsed duration( X (5) ), the remaining duration ( X (6) ), the frequency of premium payments ( X (7) ) and the annual amount of premiumpayments ( X (8) ). For the simulation we use a Pearson gamma distribution Γ α,λ with shape α > and rate λ > , see[6], as well as the uniform distribution U (0 , . The gamma distribution provides a ﬂexible, right skewed distribution.For a random variable ζ ∼ Γ α,λ the expectation and variance of ζ are given by E ( ζ ) = λα , Var ( ζ ) = λα . In [40], we ﬁnd of policyholders to be of ages 0-34, respectively . and . of ages 35-54 and 55-84.This indicates a right-skewed distribution for a policyholder’s age. Further, most lapse events, including maturity, areindicated to occur within 15 years. The authors in [40] also report roughly annual premium payments, infra annual and supra annual payments. The annual premium amount is reported to be highly right-skewed. Wecalibrate the marginal distributions of features X ( i ) , i > , to closely imitate these statistics. We deﬁne X (2) ∼ Γ . , . − , X (4) ∼ , . − ,X (3) ∼ ′

000 + Γ , ′ − , X (5) ∼ min( X · U (0 , , X ) .

24s we do not permit negative ages at the underwriting of the contract, the elapsed duration X (5) is engineered tonot exceed the current age. The remaining duration X (6) of a contract is then obtained deterministically by X (6) := X (4) − X (5) . Further, the premium frequency X (7) is a categorical variable which mirrors [40] with X (7) =  upfront , with prob. . annual , with prob. . monthly , with prob. . . Last, we determine the fair premium by the equivalence principle in [13, 21] annualize it linearly based on thepremium frequency and the number of remaining premium payments up to maturity, respectively at most up to thelast premium payment at retirement, i.e. at age 67. The resulting annual premium X (8) is then used to calibrate theparameters of the face amount X (3) . The reported parametrization of X (3) leads to an annual, mean premium of ′ EUR, which inherits the right-skew from X (3) . A.3 Surrender proﬁles

All surrender proﬁles used for the experiments in this work are based on ﬁndings in the literature, see [8, 15, 41],which all employ logistic regressions and report quantitative results on β ( i ) x x ( i ) for surrender. Empirical results in[41] are obtained on endowment insurance contracts in Spain, while [8] and [16] consider life insurance contracts ofmultiple types in the Italian and German market. Given surrender behaviour on various markets, we do not aim tocombine them to a holistic surrender model. Instead we create multiple surrender proﬁles that capture a variety of theobserved risk characteristics. To plausibly combine multiple risk drivers β ( i ) x x ( i ) we adjust the baseline risk β suchthat the observed mean surrender rates of each risk proﬁle fall into practical ranges of 0.01 to 0.05. Proﬁles 1 and 2.

In [41] the authors use odd ratios to identify quality and severity of risk drivers for endowmentinsurance products. To compute the odd ratio of a single feature i , we take for two contracts x and ˜ x , which only differin feature i , i.e. x ( j ) = ˜ x ( j ) for j = i . The odd ratio is then computed by p (1 | x ) / (1 − p (1 | x )) p (1 | ˜ x ) / (1 − p (1 | ˜ x )) = exp { β ( i ) x x ( i ) − β ( i )˜ x ˜ x ( i ) ) } . (19)If we set β ( i )˜ x ˜ x ( i ) := 0 as the baseline risk of feature i , we can extract β ( i ) x x ( i ) for all features i . Consequently, weconstruct surrender proﬁle 1 based on empirical odd ratios and surrender proﬁle 2 based on modeled odd ratios, bothare reported in [41]. The risk drivers are restricted to the current age of the policyholder, the elapsed duration of thecontract, the frequency of premium payments, i.e. monthly, annually or as a lump sum, and the annualized premiumamount. Due to data conﬁdentiality in [41], the actual premium amounts are omitted. Hence, we transfer the odd ratiofor the savings premium in [41] to our meta model by assuming the indicated jumps to occur at plausible levels of ′ EUR resp. ′ EUR.

Proﬁles 3 and 4.

We keep the the effect of the elapsed duration and the annualized premium amount ﬁxed to thevalues in surrender proﬁle 2. For the premium frequency, as well as the remaining duration we employ results in [15],including regression coefﬁcient estimated on a per contract basis. These results indicate a reduced risk of surrenderfor single premium payments and an increased risk for a high remaining duration. Further, in [8] we ﬁnd a mitigatingeffect of mid-range ages, as well as an increased surrender activity for contracts with a duration of less than years,both reported in the form of β ( i ) x x ( i ) . This completes surrender proﬁle 3. Lastly, for surrender proﬁle 4 we additionallyinclude the calendar year as a risk driver and impose coefﬁcients reported in [8] for the years up to . Thecalendar year can be viewed as a proxy for the economic environment or alternatively as noise to the surrender activity,as input variables x in our data do not include economic features.25 .4 Figures actuallabel predicted label ˆ y = 1 ˆ y = 0 y = 1 TruePositive( TP ) FalseNegative( FN ) y = 0 FalsePositive( FP ) TrueNegative( TN )Figure 6: Confusion Matrix A.5 Tables

Measure Formula InterpretationAccuracy

T P + T NT P + F P + F N + T N

Overall share of correctly classiﬁed labelsPrecision

T PT P + F P

Share of correct label predictions ˆ y = 1 Recall T PT P + F N

Share of correctly classiﬁed data with y = 1 Speciﬁcity

T NF P + T N

Share of correctly classiﬁed data with y = 0 False positive rate

F PF P + T N

Share of misclassiﬁed data with y = 0 F β -score (1+ β ) · recall · precision β · recall + precision Weighted balance ( β > ) between recall and precisionTable 3: Measures for binary classiﬁcation, adapted from [47].proﬁle |D train | imbalance RUS SMOTE |D test | imbalance0 189285 0.0290 10966 367604 75414 0.04541 212978 0.0165 7042 418914 81113 0.01312 217515 0.0201 8738 426292 89671 0.01623 216734 0.0210 9082 424386 89105 0.0159Table 4: Overview of number of 1-year observations and imbalance in the training and test set of all surrender pro-ﬁles. Additionally, we report the number of observations in a perfectly balanced training set resulting from random-undersampling (RUS) and SMOTE-resampling. 26roﬁle Logist. Regr. Random Forest XGBoost NN - bagging NN - boosting0 215 10 3 1245 31521 177 12 4 1392 35122 280 13 4 2710 34113 465 12 5 1589 3160 (a) No resampling. Proﬁle Logist. Regr. Random Forest XGBoost NN - bagging NN - boosting0 13 23 31 3333 153531 10 27 32 8992 189752 26 31 35 7594 217383 29 30 35 10114 24501 (b) Random undersamling.

Proﬁle Logist. Regr. Random Forest XGBoost NN - bagging NN - boosting0 621 23 8 2937 139151 1452 29 11 7756 181332 4045 31 13 8598 221523 5952 27 13 9937 27057 (c) SMOTE-resampling.(c) SMOTE-resampling.