Accurate and Interpretable Machine Learning for Transparent Pricing of Health Insurance Plans
Rohun Kshirsagar, Li-Yen Hsu, Vatshank Chaturvedi, Charles H. Greenberg, Matthew McClelland, Anushadevi Mohan, Wideet Shende, Nicolas P. Tilmans, Renzo Frigato, Min Guo, Ankit Chheda, Meredith Trotter, Shonket Ray, Arnold Lee, Miguel Alvarado
AAccurate and Interpretable Machine Learning for Transparent Pricing of HealthInsurance Plans
Rohun Kshirsagar, Li-Yen Hsu, Charles H. Greenberg, Matthew McClelland, Anushadevi Mohan,Wideet Shende, Nicolas P. Tilmans, Min Guo, Ankit Chheda, Meredith Trotter, Shonket Ray,Miguel Alvarado
Lumiata Inc. 489 S. El Camino Real, San Mateo, CA 94402 USACorresponding Author: Rohun Kshirsagar - [email protected]
Abstract
Health insurance companies cover half of the United Statespopulation through commercial employer-sponsored healthplans and pay 1.2 trillion US dollars every year to covermedical expenses for their members. The actuary and un-derwriter roles at a health insurance company serve to assesswhich risks to take on and how to price those risks to ensureprofitability of the organization. While Bayesian hierarchicalmodels are the current standard in the industry to estimaterisk, interest in machine learning as a way to improve uponthese existing methods is increasing. Lumiata, a healthcareanalytics company, ran a study with a large health insurancecompany in the United States. We evaluated the ability of ma-chine learning models to predict the per member per monthcost of employer groups in their next renewal period, espe-cially those groups who will cost less than 95% of what anactuarial model predicts (groups with “concession opportuni-ties”). We developed a sequence of two models, an individualpatient-level and an employer-group-level model, to predictthe annual per member per month allowed amount for em-ployer groups, based on a population of 14 million patients.Our models performed 20% better than the insurance carrier’sexisting pricing model, and identified 84% of the conces-sion opportunities. This study demonstrates the applicationof a machine learning system to compute an accurate and fairprice for health insurance products and analyzes how explain-able machine learning models can exceed actuarial models’predictive accuracy while maintaining interpretability.
Introduction
The recent explosion of available electronic health record(EHR) and insurance claims data sets, coupled with the de-mocratization of statistical learning algorithms, has set thestage for machine learning (ML) applications to fundamen-tally transform the healthcare industry. Employer-sponsoredhealth insurance (ESI) currently covers 150 million Amer-icans (Kirzinger et al. 2019). With numerous subsidies inplace to make ESI more affordable (Buchmueller, Carey, andLevy 2013; White 2017), it is by far the most popular optionfor obtaining health insurance in the United States (CBO2019). Since the passage of the Affordable Care Act (ACA)in 2009, premiums for single and family ESI plans have in-creased 50% and deductibles have doubled, making afford-ability of care a major issue for many Americans. Thirty fourpercent of patients on ESI plans are reportedly unable to pay for an unexpected bill of $500 and over 50% skip or post-pone medical care and prescription fills due to cost (KFF2019). By making healthcare more affordable, it increasesthe chance patients will receive needed medical care and re-fill medications in a timely manner. Increasing affordabilityimproves patient health outcomes and quality of life, reduc-ing familial strain of medical debt, postponing major house-hold spending, and obviating the need to hold multiple jobs(KFF 2019).However, ESI premiums are the dominating source of rev-enue for US-based health insurance companies, many ofwhich are among the Fortune 500 companies . As such, ac-curate rate-setting is a crucial component of revenue andmembership growth for insurance companies; setting inad-equate rates can mean the difference between profitabilityand unprofitability (Steenwyk 2007). Traditionally, healthinsurance companies set rates using a combination of ac-tuarial science and underwriting judgement. Actuaries ap-ply statistical modeling to claims data to set premiums foreach employer-group; underwriters use the actuary’s pre-dicted rate alongside non-claims data (e.g. health question-naires) to decide which groups to cover and what their rateswill be. Accurate rate setting is essential to balance cus-tomer retention against business viability because the in-surer needs to retain a group for several years before theaccount becomes profitable. Hence, insurers are willing toreduce rates in the near-term, in exchange for the chance ofa longer term relationship (i.e. greater persistency ) (Lyonset al. 1961). Good financial standing allows insurers to fo-cus on growing their business, influencing patient health out-comes, improving customer experience, and increasing effi-ciency (Schaudel et al. 2018). Reduced renewal premiumscan align patients’ financial interests and the insurers strate-gic interests.Within the ESI market, the <
500 employer group seg-ment (employer groups with fewer than 500 enrollees) ishighly transactional, particularly during peak season (nearJanuary 1st of each year). The largest insurance carriersprocess tens of thousands of new and renewal businessquotes. For instance, the carrier in this pilot study aver-aged 130 presale quotes and 70 renewal quotes per under-writer. Some aspects unique to the <
500 market are the: (1)Complex array of funding arrangements, including: fully-insured, level-funding Administrative Services Only (ASO, a r X i v : . [ c s . C Y ] S e p .e. self-insured), pay-as-you-go and monthly shared-riskmodels (Finn et al. 2017; IBC 2016) (2) high-risk, high-reward nature of this segment: the <
500 market comprisesone third of medical insurance customers but yields onehalf of earnings, providing opportunities to yield higherprofit margins, and (3) lower persistency compared to largerclients (the average account length is 5 years for the < > <
500 market, un-derwriters could benefit from an additional highly-accuratesignal to increase the efficiency of their work. This signalwould need to be easily interpretable and broadly appliedfor measurability. The unique circumstances of the < Materials and Methods
Participants and Setting
We drew from a 14 million member population represen-tative of Delphi’s entire fully-insured employer-group cus-tomer base (i.e. their book of business ) from 2015 to 2017.We used this population to train and tune our models andthen predicted the annual cost of groups in a separate “hold-out” set. The holdout set consisted of 648 employer groups(referred to here simply as “group”) with renewal dates vary-ing from 05/01/2016 to 04/01/2017. There were a total of349,715 members still actively enrolled as of their respectivegroups renewal date. Using the holdout set, we evaluated ourmodel by its performance of predicted cost incurred duringthe 12-month period starting from their group’s renewal date(the “projection period”).In the holdout set, Delphi censored the data recordedfour months prior to the renewal date for each group (the“blackout period”); group underwriting is usually done sev-eral months before the new contract year in order to createthe renewal quotes presented to employer groups. The 12months before the blackout period comprise the “experienceperiod”, ending on the “slice date”. For example, if a group’srenewal date was 05/01/2017, the experience period was theentire 2016 calendar year and 05/01/2017 to 04/30/2018 wasthe cost projection period (Figure 1).For this pilot study, we sliced the training data to select
FeaturesTargetsBlackout
Figure 1: Holdout set “dynamic time-slicing”. This pictureshows the way the holdout set groups are time-sliced accord-ing to their renewal dates, and thus have different lengths ofhistorical data (shown as “Features”) and different projec-tion periods (shown as “Targets”) with respect to each other.Lumiata was blinded from the information in blackout andprojection periods before submitting predictions to Delphi.groups and members in a similar, but simpler way. Instead ofdynamically slicing the groups, we imposed a fixed renewaldate of 01/01/2017 and used only the data recorded until08/31/2016 for feature extraction. Therefore, the membersand groups for which we predicted cost were those eligibleas of the 08/31/2016 slice date. After filtering the trainingdata for eligibility, ∼ Data Sources
We built our models using medical, capitation and pharmacyclaims, and lab and eligibility tables for Delphi’s patientsand groups. The medical claims tables contained cost infor-mation, International Classification of Disease (ICD-9 andICD-10) diagnostic codes, and Current Procedural Technol-ogy (CPT) procedure codes at the claim-level. The capita-tion tables contained only cost information. All tables re-ported each claim’s care setting: inpatient, outpatient, ancil-lary, emergency, primary care, or specialty care. The phar-macy claims tables contained National Drug Code (NDC)medication codes and cost for each drug prescribed to apatient and the written and fill dates for the drug prescrip-tions. The cost associated to claims were given as “allowed”amounts (the amount paid by the insurer plus the memberscost share). Lab tables contained Logical Observation Iden-tifiers Names and Codes (LOINC) lab test codes. Eligibilitytables captured each patient’s health plans, enrollment timeperiods, and plan benefits.
Models
Actuarial Models
To compare model performance be-tween Delphi’s actuarial models and Lumiata’s ML models,we followed the best practices of actuarial science. Actuar-ial models estimate group cost on a per member per month(pmpm) basis. The normalization unit is called a “membermonth”, which is defined as one month of enrollment for onemember. We used member months to normalize cost becausepredicted pmpm cost for a group translates to the monthlypremium charged to each member in this group. Actuar-
Renewal Date
End Projection Period
EXPERIENCE PERIOD BLACKOUT PERIOD PROJECTION PERIOD09/01/15
TESTEVALUATE TEST TARGETSEVALUATE TARGETSTRAIN TARGETSTRAIN
Begin Experience Period
Figure 2: “Splitting” and “time-slicing” for the trainingdata. This figure shows how we split and sliced the data ofgroups and members enrolled on 08/31/2016 to train andvalidate our models. Groups were split using a 70:20:10train:test:evaluate ratio.ial models estimate group-level cost, treating each membermonth of medical history as independent across membersand within a member.The following example shows the utility and limitationof this perspective. Consider three hypothetical groups X,Y, and Z, each having 100 members and costing $1 millionduring the 2017 calendar year: • Group X: each member is enrolled for 10 months - thepmpm cost equals $1 million/(10 months ×
100 mem-bers) = $1,000 pmpm • Group Y: each member is enrolled for five months - thepmpm cost equals $1 million/(5 months ×
100 members)= $2,000 pmpm • Group Z: each member is enrolled for five months and onemember costs $900,000, while the rest of the memberscost $100,000 total - the pmpm cost equals $1 million/(5months ×
100 members) = $2,000 pmpmAs a result of the pmpm formulation, two groups with thesame cost but different member months will have a differ-ent pmpm cost (Group X vs Y or Z). If the group cost ishighly concentrated on one individual, versus being evenlydistributed amongst the members in the group, it does notnecessarily reflect in the pmpm cost, (Group Y vs Z). In con-trast, a member-level cost prediction model views Group Yand Z differently, therefore better modeling cost at the grouplevel.Actuarial predictive models used for quoting renewalbusiness rely on dozens of rating factors (input variables) tobuild a predicted rate for a given group. The factors rely onpre-computed demographic, medical trend, pharmacy trendand other actuarial coefficients derived from patients acrossa large (usually exogenous) population using regression-based methods. These rating factor “priors” are then used to assemble a predicted trend value for a particular group(called the “manual rate”,
M R ). Historical claims for thegroup are used to create the “experience rate” ( ER ), whichis then blended with the manual rate to create the final pre-diction. The proportion of blending between the two quan-tities is called “credibility” ( c , where ≤ c ≤ ; Atkinson2019). Thus, the predicted total cost of a group can be ex-pressed as: predicted cost = c · ER + (1 − c ) · M R (1)This formula is a linear Bayesian hierarchical model, andis the optimal linear least-squares solution for estimatingthe annual pmpm cost of an employer group, called theBuhlmann-Straub method (Schmidli 2013). This approach isthe industry standard, underlying most models in productionat insurance companies; applications include pricing, plandesign, and reserve setting (Bluhm et al. 2007; Fuhrer 2015).As a group becomes more credible, the actuarial model canrely more on the medical claims history of the group as anindicator of future expenses. In the absence of credibility,the safer bet is to rely on population-level cost estimationsusing only age and sex (the manual rate). Actuaries use agroup’s member months to parameterize credibility (Atkin-son 2019; CMS 2018). A larger group enrolled for a shorterperiod of time (e.g. 1000 members × ×
30 months = 6000 member months), mak-ing them equally credible.The two equations below are examples of the type of ex-perience rating and manual rating models used by Delphi: ER = ( T C − T SC )(1 + AT ) m · x m x b x d + n s x p + BC p (1 + AT L ) m · x ph x gp x dp x ip · mmM R = [ BC med (1 + AT med ) m · x gm x dm x im x udm + BC cap (1 + AT med ) m + BC ph (1 + AT ph ) m · x gph x dph x iph x udph ] · mm The definitions of independent variables in these two equa-tions are shown in Table 1. The experience rate is linear interms of the total claims (
T C ), and is combined with themanual rate and the group’s member months (Eqn. 1).To improve accuracy, we modeled cost at both the indi-vidual and group levels. We used a sequence of two mod-els: (1) Individual-level model predicting per month costfor a given member and (2) Group-level model predictingper member per month (pmpm) cost for a given group. Ourapproach contrasts with traditional actuarial methods whichare heavily focused on group-level cost and lack individual-level information within each group.
Feature Engineering
We reshaped the claims and eligi-bility tables into longitudinal patient records per our propri-etary data format, the “Lumiata Data Model” (LDM - seeAppendix). From the LDM, we created member-level fea-tures (input variables) using information before the black-out period, based on techniques from the literature (e.g. ariable Definition Comment AT annual trend medical and pharmacytrends combined AT L leveraged annual trend adjusted for the effectsof pooling-point lever-aging AT med medical annual trend AT ph pharmacy annual trend BC cap base capitation claims pmpm cost of medicalclaims during the expe-rience period BC med base medical claims pmpm cost of capita-tion claims during theexperience period BC p base pooled claims pmpm cost of pooledclaims during the expe-rience period BC ph base pharmacy claims pmpm cost of phar-macy claims during theexperience period T C total claims total cost during the ex-perience period
T SC total shock claims total cost of claims overthe pooling level duringthe experience period m midpoint months number of months be-tween the midpoints ofexperience and projec-tion periods mm member months member months duringthe experience period n s number of shock claims number of claims overthe pooling level x b benefit x d demographic based on the experienceperiod x dm demographic - medical based on census x dp demographic - pooling based on census x dph demographic - phar-macy based on census x gm geographic area - medi-cal based on census x gp geographic area - pool-ing based on census x gph geographic area - phar-macy based on census x im industry - medical x ip industry - pooling x iph industry - pharmacy x m maturation x p pooling level the cost threshold defin-ing a shock claim (typi-cally $100,000) x ph pharmacy load x udm utilization dampening -medical . e − . S , where S ismedical cost share x udph utilization dampening -pharmacy determined from a ta-ble using medical costshare Table 1: Definitions of all the independent variables in themanual rate and experience rate equations. The most recentcensus was used for lookups. All x variables are factors. Razavian et al. 2015; Tamang et al. 2017). Our demo-graphic features were age and sex; all other features weretime-dependent. Our time windows to compute a varietyof features (e.g. diagnosis, medication, procedure, lab, andrevenue codes, and cost and coverage) were: “last threemonths”, “last six months”, “last one year”, and “anytime”prior to the blackout date.In addition to ICD-9, ICD-10, CPT, NDC, and LOINCcodes, we transformed the codes into their grouped coun-terparts based on organ-type (SNOMED), condition cate-gories (HCUP, CMS-HCC and HHS-HCC), drug molecule(RxNorm and ATC), and our proprietary clinical grouper(Lumiata disease code). We derived features from thesecodes by calculating the log count of every unique code ineach time window, and the summary statistics of each sys-tem (total count, unique code count, minimum count, maxi-mum count, mean count, etc) in the “anytime” window. Thepresence of revenue codes (binary) was computed in the“anytime” window.We computed features using their observed lab interpreta-tions or values from the LOINC codes. These include (1) logcounts of interpretations, i.e. “high”, “low”, “abnormal”, and“normal”, (2) whether the value was increasing, decreasingor flat across time points for the same test (one-hot encoded),and (3) if the interpretation was fluctuating across time (bi-nary). For (2), the calculation was based on a t-test’s p -valueof a simple linear regression’s slope.Cost features are the most powerful features to predictfuture cost. In addition to the cumulative allowed cost indifferent time windows, we computed cost attributed to dif-ferent care settings. The length of coverage was computedfor all time windows. In total, we constructed more than 5million possible features per patient. The resulting featurematrix was very sparse; most of the columns had no val-ues at all. We reduced the dimension of the feature spacebefore training a model using feature selection techniquesdiscussed below. Individual-Level Model
We regressed our first model onthe allowed amount per month during the projection periodfor each member. We trained a gradient boosting tree thatoptimized the mean squared error (MSE) using the Light-GBM package (Ke et al. 2017) in the Python programminglanguage (Guttag 2016). To speed up training time and re-duce over-fitting, we tested a variety of feature prevalencethresholds to reduce the features set to ≤ ∼ ∼ Aggregation
Our individual model predicts the per monthcost in the projection period for the members who wereenrolled in groups at the end of experience period. Forour train, test, and evaluate sets, this date was 08/31/2016.For the holdout set, the slice date depended on the groupsrenewal date. We then aggregated these predictions basedon the members active on this date to obtain the mean ofmember-level cost predictions for each group. This quantityhas a unit of pmpm and became the input of our group-levelmodel described below.
Group-Level Adjustment
The aggregated mean predic-tion of a group can be thought of as predicted pmpm cost.However, this quantity only considers the members enrolledat the end of the experience period and assumes enrollmentremains constant throughout the projection period. In re- ality, a group can grow or shrink during the blackout andprojection periods, affecting the true pmpm. We regresseda second model on the true group-level pmpm cost to ad-just the aggregated predictions. We experimented with dif-ferent group-level features in this model and use the follow-ing (per group): (i) mean cost of member-level predictions,(ii) mean member age, (iii) total number of member monthsfor the group during the experience period, (iv) “growth”feature, defined as the change in the number of membersduring the experience period divided by the total numberof member months, (v) average length of member cover-age, (vi) fraction of experience period costs that were in-curred during the final four months, before the blackout pe-riod (vii) fraction of high-cost members, defined as some-one whose cost falls within the top 10% of all members inthe training set. We then trained a LightGBM model thatoptimized pmpm Mean Absolute Error (MAE) and usedthe test set to perform hyper-parameter tuning and early-stopping. The mean of member-level cost predictions foreach group highly correlates with the overall target pmpmcost for the group. With the additional features, the group-level model improved pmpm MAE by ∼
10% compared tothe individual-level model alone.Similar to the individual model, we trained the groupmodel on groups still active during the projection period be-cause only the groups that remained active will be includedwhen evaluating the results. This operation was only done inmodel training and was not performed when doing inferenceusing the trained model.
Model Evaluation
ML Metrics Evaluation
We adapted the standard metricsR-squared ( R ), MAE, and Gini index (Frees, Meyers, andCummings 2011) to predicted pmpm cost; evaluation met-rics are measured by comparing “true pmpm cost” and “pre-dicted pmpm cost”. Lumiata received claims data at the al-lowed amount level to preserve Delphi’s pharmaceutical andprovider reimbursement rates, whereas Delphi’s productionmodels predict the “paid amount” for employer groups. Asa result, Delphi and Lumiata modeled two different types ofcost, which are not directly comparable because the allowedamount is usually greater than the paid amount (though theyare correlated; see Hileman and Steele 2016). Both teamsagreed on two solutions to make the predictions comparable.First, we computed “normalized” pmpm MAE, which equalspmpm MAE divided by the “global pmpm cost”. The globalpmpm cost is the total (allowed or paid) amount divided bythe total number of member months across all groups. Sec-ond, we computed the trend versions of our predictions. Theallowed trend was defined as the pmpm allowed amount inthe projection period divided by the pmpm allowed amountin the experience period; the “paid trend was defined simi-larly. Trend calculations are ubiquitous in actuarial science(SOA 2017). Lift Plot and Concession Opportunities
While ML per-formance metrics like MAE and R are ubiquitous in thetech industry, they often lack direct connection to concretebusiness-level Key Performance Indicators (KPIs) in otherodel MAE R Gini indexDelphi 0.239 0.265
Lumiata < < < < < Results
Model Performance
Lumiata’s model was 20% better innormalized pmpm MAE, 26% better in pmpm R than Del-phis model, and 2% lower in Gini index than Delphi’s model(Table 3). Lumiata correctly identified 84% of the groups inthe holdout set that had concession opportunities at the 5%level.Lumiata’s predicted pmpm allowed trend had a 65% pre-cision and 84% recall in identifying concession opportuni-ties at the 5% level, and a 56% precision and 85% recall inidentifying concession opportunities at the 10% level. Forcomparison, an “oracle” model had an 84% precision and96% recall at the 5% level, and a 73% precision and 97% re-call at the 10% level. This similarity in precision and recall Lumiata Trend Greater Than Delphi Trend Lumiata Trend Equal To Delphi Trend Lumiata Trend Lower Than Delphi TrendUW ConcessionNo UW Action No UW ActionActual Greater Than Expected Actual Equal To Expected Actual Lower Than ExpectedPush ProfitableUnprofitable
Figure 3: Flow-chart of the “stop-light” process implement-ing Lumiatas model output. “UW” stands for underwriting.indicates Lumiata’s model was near optimal for predictingconcession opportunities at the 5% level.
Practical Application
Operationalizing Lumiata’s modelrelied on the “stop-light principle” to make the model out-put interpretable to non-data scientists. The decision processderived from Lumiata’s model is: (1) If Lumiata’s predictedtrend is less than Delphi’s predicted trend, then the modelsuggests an underwriter give a concession of at least 5% onthat group’s renewal quote (Green). (2) If Lumiata’s pre-dicted trend is equal to or greater than Delphi’s predictedtrend, no action is taken (Yellow or Red, see Figure 3).The lift plot in Figure 4 shows the result of using Lumi-ata’s predicted pmpm allowed trend in this decision process.The average A/E per decile of Delphi’s model varies withthe decile of Lumiata trend ratio (Lumiata’s predicted al-lowed trend divided by Delphi’s predicted paid trend) andtrue allowed trend ratio (true allowed trend divided by Del-phi’s predicted paid trend), respectively. The A/E of the bot-tom five Lumiata trend ratio deciles are below 0.95, meaningLumiata’s trend ratio can select groups for a rate drop of ≥
5% with good accuracy. The median trend ratio was 0.89which yielded higher precision than a decision rule of 1.0.Had Delphi implemented our model for this pricing pe-riod, their underwriters could have dropped the renewalquote 5% or more for approximately half of the groups whileretaining profitability. An “oracle” model (blue line in Fig-ure 4) identified the same number of decile concession op-portunities at the 5% level as our model (five deciles).
Deployment
The goals of Lumiata’s pilot study with Delphi were to proveLumiata’s (1) ML could improve over an established indus-try methodology, and (2) tech stack could deliver monthlypredictions for Delphi’s renewal business groups. We de-signed a Kaggle-style competition between the two compa-nies, with two holdout sets - a preliminary holdout set (658groups) and a final holdout set (which we call “holdout set”throughout this paper - 648 groups). We had one chance tocompare our predictions to Delphi’s on the final holdout set;the success or failure of the pilot study was predicated onwhose model had the best group-level cost predictions. Trend Ratio Decile . . . . . . . . . N o r m a li ze d A / E o f D e l ph i M o d e l Normalized A/E by Decile
Trend lumiata / Trend delphi
Trend true allowed / Trend delphi
Figure 4: Lift plot for the holdout set using Lumiata (pur-ple) and true allowed (blue) trend ratios. The trend ratio ofa model is defined as its predicted trend divided by Delphi’spredicted trend. The solid black line indicates A/E = 1. Thedotted black lines indicate where the A/E = 1.05 or 0.95.Concession opportunity groups are in deciles below the 0.95dotted line.
Quality Assurance
Due to several delicate calculationsneeded to assemble the predicted/true allowed trend predic-tions, we computed non-prediction fields to rule out non-data science confounding factors before running the resultsanalysis. These fields included (per group): (i) number ofmembers enrolled at the end of experience period, (ii) num-ber of member months in the experience period, (iii) trueallowed amount in the experience period, and (iv) predictedallowed amount in the projection period. After receiving thecensored information, we computed the: (i) number of mem-bers at the beginning of projection period, (ii) number ofmember months in the projection period, and (iii) true al-lowed amount in the projection period. Due to a strong datamodel built off of FHIR and solid compute infrastructure,we were able to iterate and fix bugs quickly, until our cal-culations of the experience period non-prediction columnsmatched Delphi’s within 5%. Completion of this analysisallowed quick and self-evident comparison of Lumiatas andDelphis models performance metrics. Additionally, the non-prediction fields found their way into roll-out plan below. Data Challenges
For model comparison, we had to ad-dress discrepancies between Lumiata’s and Delphi’s claimsdata sets. Two major differences in the claims data were: (1)Delphis models used paid amount and Lumiata used allowedamount, and (2) Delphi used the “paid date”, but requestedLumiata use the “encounter date” for allowed amounts (Del-phi felt “encounter date” was more appropriate for patient-level cost predictions). To solve these issues, we computedthe “allowed trend” compared to the “paid trend” (Figure 4).Calculating the “allowed trend” required we compute threeother quantities per group (“number of members at the endof experience period”, “member months in the experienceperiod”, and “allowed amount in the experience period”),and the predicted allowed amount for the group. Accurately computing these quantities was more difficult than expecteddue to the consequences of dual paid/allowed conventionsand time-dependent patient enrollment:(i) Calculating the “allowed amount in the experienceperiod” differs depending on whether “paid date” or “en-counter date” is used. Using the paid vs encounter date seg-mented claims differently into a groups experience, black-out, and projection periods. For example, some claims weredenied or paid claims could be reversed (see Appendix Fig-ure A1). Hence, claims data filtered on encounter date in theholdout set experience period were sometimes absent fromthe unblinded holdout set’s experience period.(ii) The number of members enrolled in a group at theend of the experience period could change during the black-out and projection periods. For example, members can shiftbetween different groups due to job or spousal health planchanges. Also, enrollment was updated on the 15th of themonth, but we calculated “number of members at end of ex-perience period” and “member months in the experience pe-riod” for the first of each month, making our enrollment datatwo weeks out of date.
Roll-out Strategy
Delphi requested Lumiata providemonthly group-level cost predictions and concession op-portunities for groups up for renewal within the next fourmonths (Delphi renews groups throughout the year). Delphiasked for a seven day cost-prediction turn around time andfixed model feature set each quarter. In response, we built aplatform that can create LDMs for 1 TB of claims data in un-der an hour and produce highly optimized group-level costpredictions in under four hours.To streamline deployment, we proposed the followingplan to Delphi: (1a) Each month, Delphi sends a movingthree year snapshot of their claims and eligibility tables. (1b)Delphi sends their non-prediction field calculations and theirpaid trend group cost predictions for groups up for renewal.(2) Lumiata updates the existing patients LDMs and addsnew patients from the eligibility file. (3) Lumiata updatesthe group cost prediction model, training using the prior 12months as the projection period, four months prior as theblackout, and prior 14 months as the experience period. (4)Lumiata creates feature vectors including the new projec-tion/blackout period claims information. (5) Lumiata appliesthe updated member- and group-level models to the claimsdata to produce group-level allowed trend predictions. (6)Using Delphi’s paid trend predictions from 1b and the “stop-light” principle (Figure 3), Lumiata recommends whetherto drop the rate by 5% for each group. (7) Lumiata sendsits non-prediction attributes and the allowed trend predic-tions to Delphi with one recommendation per group up forrenewal. All non-prediction fields must agree within ≤ Transparent Rate Setting
Actuarial models have an es-sential property: they are “explainable” because a predictioncan be decomposed into discrete multiplicative factors withn inherent interpretation. For example, “geographic areafactor” = 0.9 means people in a particular zip code cost 10%less than the mean, so the base rate is adjusted (multiplied)by 0.9 for members from this zip code. This degree of ex-plainability is crucial because actuaries need to file rates an-nually for individual/small group markets with the state in-surance commissioner to ensure the factors used to producethe rate are compliant with legal guidelines. Furthermore,an underwriter needs to be able to explain how she arrivedat a particular rate to a customer. A critique of ML modelsis that they lack explainability in terms of what input vari-ables may have contributed to a particular prediction (Rudin2019). However, explainable ML in healthcare is must-have,touching upon fundamental issues of bias, transparency, andreasonableness of ML model predictions.Shapely values, a game theoretic algorithm, was devel-oped to enable common machine-learning algorithms to out-put a set of feature weights specific to a prediction (Lund-berg and Lee 2017). We applied the Python package SHAP to our LightGBM models to yield member-level explana-tions. The weights are interpreted as the dollar value pmpmamounts and the sum of the values equals the predictionmade by the original model. Similar to an actuarial for-mula, the rate predicted by the group-level model can be ex-pressed in terms of aggregated member-level SHAP valuesfor member-level features. We often found that ≤
500 fea-tures’ SHAP values account for 95% of a cost prediction,for each group. However, the specific features involved var-ied by group.The transparency afforded by the group-level SHAP val-ues provides the opportunity to explain a rate adjustmentto a customer in dollar pmpm amounts using the specificrisk drivers for that group and modify a rate given by theML model by the expected change in cost for specific drugsand services. For instance, if the price of Glipizide, a drugto treat type-2 diabetes, will drop for an insurer next yearby 20%, the insurer can multiply the SHAP values corre-sponding to Glipizide-related pharmacy costs by 0.80 forall the members on Glipizide, thus lowering the projectedrate. These mechanics are similar to current actuarial meth-ods, making them easy to implement. Furthermore, greatermodel transparency could increase patient adherence to pre-scribed medications. As drug prices rise and more patientspurchase high-deductible plans, patients have to pay higherout of pocket costs for drug treatment, and patient medica-tion adherence declines (Callaghan et al. 2019). Insurers canuse the SHAP values from the patient-level model to iden-tify drugs driving up projected cost for that group, and sug-gest the prescribing doctor offer a lower cost alternative drugwith similar efficacy. This provides a win-win opportunity,lowering drug cost for the payer, and improving patient ad-herence to the cheaper drug through increased affordability.
Discussion
Here, we demonstrate that: (1) ML approaches can signifi-cantly improve the accuracy and efficiency of group healthinsurance underwriting and (2) ML models can offer com-parable interpretability to traditional actuarial methods. Ourcontributions provide clear direction for how to improve the ≤
100 101 ∼
250 251 ∼
500 501 ∼ ∼ > Group Size as of Renewal Date . . . . . . . N o r m a li ze dp m p m M A E
148 groups 228 groups 113 groups 124 groups 28 groups 7 groups
LumiataDelphi
Figure 5: Normalized pmpm MAE of Lumiata and Delphi’smodels by employer group size segment. Lumiatas model isbetter across all segments.efficiency and predictive performance of underwriting foremployer-based insurance and how to lower the cost formembers in groups of any size. Our ML-based approachimproved MAE over actuarial models across the book ofbusiness: > ≤
500 and/orthe group claims experience is relatively short ( < R ,MAE, and lift plot (Table 3 and Figure 4). This discrepancycan occur because the Gini index is a ranking-based metric,whereas a regression model minimizes the prediction errors.One difficulty is the Gini index is not a differentiable quan-tity. Future work should develop algorithms to address thisproblem.ata quality was crucial to our success. Alignment onnon-prediction fields between Lumiata and Delphi ruled outerrors in the data, pipeline, or output, improving communi-cation across teams and increasing efficiency. These calcula-tions must be automated for developing and productionaliz-ing a medical underwriting ML application, due to the largesize of data sets and rapid turnaround of results.Due to its highly applied nature, some operational reali-ties limit our study’s evaluation. A challenge for validatingour predictions is the long feedback cycle (20 months). Also,not all of Lumiata’s concession recommendations could begranted due to a variety of quantitative and judgement fac-tors under the underwriter and insurer’s discretion.Additionally, we could not determine if our individual-level model was racially biased, because we did not receivepatient ethnicity data. Avoiding racial bias is important asprevious studies have found evidence of racial bias in com-mercial cost prediction models used for clinical manage-ment (Obermeyer et al. 2019). Historically, poorer minori-ties under-utilized healthcare services due to mistrust of thesystem and confusion about how to navigate it (Obermeyeret al. 2019). However, because our response variable is not aclinical outcome but a financial one, we think this effect onpricing may be less significant. More work will be neededto better understand the effect of pricing insurance more af-fordably for minority patients, predicated on their less fre-quent utilization of the healthcare system.In practice, ML approaches can help insurers be morecompetitive, avoiding adverse risk. It can result in the designof more “exotic” funding arrangements due to better predic-tive power of patient health, following the industry trend to-wards capitated payments (Lee, Majerol, and Burke 2019).Unlike previous ML models in healthcare (Rudin 2019),our model output is interpretable by a non-technical user,simplifying operationalization (Figure 3). A user does notneed to understand the inner workings of our algorithm toapply our output as a multiplicative adjustment factor to theirexisting actuarial models and can output the most importantgroup-specific risk factors. Conclusions
Machine learning on insurance claims data provides a pow-erful tool to improve the efficiency and affordability of plansand care offered to patients enrolled in employer-sponsoredhealth plans. With more accurate rate-setting, health insur-ance companies can design nuanced plan attributes, reduc-ing the cost of care for their members. Our ML modelachieved 20% improved accuracy in absolute predictive per-formance over traditional actuarial methods and was able toidentify over 80% of new concession opportunities availableto Delphi. This allows underwriters to better price and retain <
500 employer group customers. This study can be usedby payers to give underwriters improved pricing guidance,retaining business and giving a better and more affordableexperience to members.
Acknowledgements
We thank our counterparts at Delphi for collaborativelyworking with us to validate our model against a production-grade model with real data. Additionally, we thank the fol-lowing people for their contributions: Kim Branson, Vat-shank Chaturvedi, Hilaf Hasson, Renzo Frigato, DilawarSyed, Laika Kayani, Wil Yu, Shahab Hassani, AlexandraPettet, Derek Gordon, Arnold Lee, Ash Damle, ThomasWatson, Leon Barovic, Diana Rypkema and our investors atBlue Cross Blue Shield Venture Fund, and Khosla Ventures.
Notes https://fortune.com/fortune500/2019/ https://wiki.hl7.org/FHIR https://github.com/slundberg/shap References
Atkinson, D. B. 2019. Credibility Methods Applied to Life,Health, and Pensions: Credibility Applications for Life andHealth Insurers and Pension Plans.
Society of Actuaries .Bluhm, W.; Skwire, D.; Kaczmarek, S.; and Bohn, K. 2007.
Group Insurance 6th Edition . ACTEX Publications, Inc.Buchmueller, T.; Carey, C.; and Levy, H. 2013. Will em-ployers drop Health Insurance Coverage because of the Af-fordable Care Act?
Health Affairs
32: 1522–30.Callaghan, B.; Reynolds, E.; Banerjee, M.; Kerber, K.; Sko-larus, L.; Magliocco, B.; Esper, G.; and Burke, J. 2019. Out-of-pocket costs are on the rise for commonly prescribed neu-rologic medications.
Neurology
92: e2604–e2613.CBO. 2019. Congressional Budget Office: Federal Subsidiesfor Health Insurance Coverage for People Under Age 65:2019 to 2029. Technical report.CMS. 2018. Centers for Medicare and Medicaid Office ofthe Actuary - Claims Credibility Guidelines. Technical re-port.Finn, P.; Gupta, A.; Lin, S.; and Onitskansky, E. 2017.Growing employer interest in innovative ways to controlhealthcare costs. Technical report.Frees, E.; Meyers, G.; and Cummings, A. D. 2011. Summa-rizing Insurance Scores Using a Gini Index.
Journal of theAmerican Statistical Association
Society of Actu-aries – Health Section Research Committee .Guttag, J. V. 2016.
Introduction to Computation and Pro-gramming Using Python: With Application to Understand-ing Data .Henke, N.; Levine, J.; and McInerney, P. 2018. AnalyticsTranslator: The new must-have role. Technical report.Hileman, G.; and Steele, S. 2016. Accuracy of Claims-Based Risk Scoring Model.
Society of Actuaries – HealthSection Research Committee .BC. 2016. Independence Blue Cross Underwriting Depart-ment Large Group Underwriting Guidelines. Technical re-port.Ke, G.; Meng, Q.; Finley, T.; Wang, T.; Chen, W.; Ma, W.;Ye, Q.; and Liu, T. 2017. LightGBM: A Highly EfficientGradient Boosting Decision Tree.
Neural Information Pro-cessing Systems .KFF. 2019. Kaiser Family Foundation 2019 EmployerHealth Benefits Survey. Technical report.Kirzinger, A.; Munana, C.; Wu, B.; and Brodie, M. 2019.Data Note: Americans Challenges with Health Care Costs.Technical report.Lee, J.; Majerol, M.; and Burke, J. D. 2019. Addressingthe social determinants of health for Medicare and Medicaidenrollees .Lundberg, S.; and Lee, S. 2017. A Unifed Approach to Inter-preting Model Predictions.
Neural Information ProcessingSystems .Lyons, H.; Shaw, S. E.; Kittredge, J.; Watson, G.; Moran, J.;and Millman, R. 1961. Persistency of Group Health Insur-ance.
Transactions of Society of Actuaries
Science
Journal of Machine Learning Technologies
2: 3763.Rajkomar, A.; Oren, E.; Chen, K.; Dai, A.; Hajaj, N.; Hardt,M.; Liu, P.; Liu, X.; Marcus, J.; Sun, M.; Sundberg, P.; Yee,H.; Zhang, K.; Zhang, Y.; Flores, G.; Duggan, G.; Irvine,J.; Le, Q.; Litsch, K.; Mossin, A.; Tansuwan, J.; D.;, W.;Wexler, J.; Wilson, J.; Ludwig, D.; Volchenboum, S.; Chou,K.; Pearson, M.; Madabushi, S.; Shah, N.; Butte, A.; How-ell, M.; Cui, C.; G.;, C.; and Dean, J. 2018. Scalable andaccurate deep learning with electronic health records.
Na-ture Digital Medicine
Big Data
3: 277–287.Rudin, C. 2019. Stop explaining black box machine learn-ing models for high stakes decisions and use interpretablemodels instead.
Nature Machine Intelligence ∼ schmidli/rt.pdf.SOA. 2017. Society of Actuaries: Pricing, Reserving andForecasting Module. Society of Actuaries - FSA Group andHealth Track
Rating and Underwriting for HealthPlans . Xlibris Corporation. Tamang, S.; Milstein, A.; Sorensen, H.; Pedersen, L.;Mackey, L.; Betterton, J.; Janson, L.; and Shah, N. 2017.Predicting patient cost blooms in Denmark: a longitudinalpopulation-based study.
BMJ Open .White, J. 2017. The Tax Exclusion for Employer-SponsoredInsurance Is Not Regressive-But What Is It?
J Health PolitPolicy Law
42: 697–708.Zhou, H.; Yang, Y.; and Qian, W. 2018. Tweedie Gradi-ent Boosting for Extremely Unbalanced Zero-inflated Data. arXiv preprint arXiv:1811.10192 ppendix
Data Quality
The following describes some implications of modelingindividual-level cost when the data model, from which thedata is derived, considers the employer group as the primaryentity. • The cost of a claim depends entirely on what date you areinspecting the claim. The example in Figure A1 showshows, depending on the time period, a claim which hasbeen reversed has a material impact on the total allowedamount we calculate. A corollary is that calculations doneto a claims dataset which has been filtered on a date (sothat all the claims occur prior to that date) can change asdata after the filter date is added back in. • Must consider “alteration of training set” implications dueto modeling at group-level. Members can move betweengroups because they change jobs or move to their partner’shealth plan. This raises the potential of a member being inboth the training and testing data. Of course, the solutionhere, as above, is to remember that enrollment is only rel-evant with respect to a particular date, since enrollmentcan change. • Finally, a rigorous accounting of the (1) count of mem-ber, (2) count of groups, and (3) total cost per group, fromthe insurance carrier’s side, to receipt of raw data, to gen-erating cost predictions from the pipeline, and finally tosending predictions back to the insurance carrier, is cru-cial to ensure the credibility of the performance estimatesasserted by the model training process.
Claim ID Allowed Amount Start Date Paid Date1 $3000 06/25/2016 07/20/20161 -$3000 06/25/2016 08/15/2015 experience period censored
Figure A1: An example of a reversed claim that is seen dif-ferently due to censored data. In this example, the black-out period begins on 08/01/2016, so all the claims with paiddates on or after this date are removed before we receivedthe data. In reality, claim “1” was originally incurred andpaid before 08/01/2016 but was reversed on a paid date after08/01/2016. However, because the data were censored basedon paid dates, we saw that claim “1” had an allowed amountof $3000. Once we receive the uncensored full claims data,we would see that claim “1” had actually been reversed andtherefore cost $0.
Lumiata Data Model