UUstun and Rudin
Learning Optimized Risk Scores
Berk Ustun [email protected]
Center for Research in Computation and SocietyHarvard University
Cynthia Rudin [email protected]
Department of Computer ScienceDuke University
Abstract
Risk scores are simple classification models that let users make quick risk predictions byadding and subtracting a few small numbers. These models are widely used in medicineand criminal justice, but are difficult to learn from data because they need to be calibrated,sparse, use small integer coefficients and obey application-specific operational constraints.In this paper, we present a new machine learning approach to learn risk scores. We for-mulate the risk score problem as a mixed integer nonlinear program, and present a newcutting plane algorithm for non-convex settings to efficiently recover its optimal solution.We improve our algorithm with specialized techniques to generate feasible solutions, nar-row the optimality gap, and reduce data-related computation. Our approach can fit riskscores in a way that scales linearly in the number of samples, provides a certificate of opti-mality, and obeys real-world constraints without parameter tuning or post-processing. Weillustrate the performance benefits of this approach through an extensive set of numericalexperiments, where we compare risk scores built using our approach to those built usingheuristic approaches. We also discuss the practical benefits of our approach through areal-world application where we build a customized risk score for ICU seizure prediction incollaboration with the Massachusetts General Hospital.
Keywords: scoring systems; classification; constraints; calibration; interpretability; cut-ting plane methods; discrete optimization; mixed integer nonlinear programming.
1. Introduction
Risk scores are linear classification models that let users assess risk by adding, subtracting,and multiplying a few small numbers (see Figure 1). These models are widely used tosupport decision-making in domains such as: • Medicine : to assess the risk of mortality in intensive care (e.g., Moreno et al., 2005),critical physical conditions (e.g., adverse cardiac events, Six et al., 2008; Than et al.,2014) and mental illnesses (e.g., adult ADHD in Kessler et al. 2005; Ustun et al. 2017). • Criminal Justice : to assess the risk of recidivism when setting bail, sentencing, and releaseon parole (see e.g., Latessa et al., 2009; Austin et al., 2010; Pennsylvania Bulletin, 2017). • Finance : to assess the risk of default on a loan (see e.g., credit scores in FICO, 2011;Siddiqi, 2012), and to inform financial investments (Piotroski, 2000; Beneish et al., 2013).The widespread adoption of risk scores in these domains stems from the fact thatdecision-makers often find them easy to use and understand. In comparison to other kindsof classification models, risk scores let users make quick predictions through simple arith-metic, without a computer or calculator. Users can gauge the effect of changing multiple a r X i v : . [ s t a t . M L ] S e p earning Optimized Risk Scores input variables on the predicted outcome, and override predictions in an informed mannerif needed. In comparison to scoring systems for decision-making (see e.g., the models con-sidered in Ustun and Rudin, 2016; Zeng et al., 2016; Carrizosa et al., 2016; Van Belle et al.,2013; Billiet et al., 2016, 2017; Sokolovska et al., 2017, 2018), which predict a yes-or-nooutcome at a fixed operating point, risk scores output risk predictions at multiple operatingpoints. Thus, users have the ability to choose an operating point after the model has beendeployed. Further, they are given risk estimates that, when calibrated, can inform thischoice and support decision-making in other ways (see e.g., Shah et al., 2018). We providemore background on risk scores in Appendix C. C ongestive Heart Failure 1 point . . .2. H ypertension 1 point + . . .3. A ge ≥
75 1 point + . . .4. D iabetes Mellitus 1 point + . . .5. Prior S troke or Transient Ischemic Attack points + SCORE = SCORE
RISK
Figure 1:
CHADS Although risk scores have existed for nearly a century (see e.g., a parole violation modelin Burgess, 1928), many of them are built ad hoc . This is partly because risk scores areoften used in applications where models must satisfy requirements related to qualities suchas interpretability, usability, or fairness (see e.g., requirements on “face validity” and “userfriendliness” in Than et al., 2014). Building a risk score that satisfies these requirementsnecessitates precise control over multiple model properties, such as monotonicity (Guptaet al., 2016), group sparsity (Kessler et al., 2005), prediction (Reilly and Evans, 2006)and calibration (Pleiss et al., 2017). Since existing classification methods do not providecontrol over such a diverse properties off-the-shelf, risk scores are built by combining themwith heuristics and expert judgment (e.g., preliminary feature selection, logistic regression,scaling, and rounding as in Antman et al., 2000). In some cases, risk scores are hand-crafted by a panel of experts (see e.g., the CHADS score in Figure 1, or the NationalEarly Warning Score of McGinley and Pearse 2012). As we will show, such approachesmay produce a model that violates requirements, or that performs poorly relative to thebest risk score that can be built using the same dataset. The lack of a formal guaranteecomplicates model development: when a risk score performs poorly, a practitioner cannottell if this is due to heuristics used in the training pipeline or overly restrictive constraints.In this paper, we present a new machine learning approach to learn risk scores from data.We consider a mixed-integer nonlinear program (MINLP), which minimizes the logistic lossfor calibration and AUC, penalizes the (cid:96) -norm for sparsity, and restricts coefficients tosmall integers. We refer to this optimization problem as the risk score problem , and referto the risk score built from its solution as a Risk-calibrated Supersparse Linear IntegerModel ( RiskSLIM ). We aim to recover a certifiably optimal solution (i.e., global optimum stun and Rudin and a certificate of optimality). This approach requires solving a computationally difficultoptimization problem, but has three major benefits for our setting:(i) Performance : Since the approach can directly optimize, penalize, and constrain discretequantities, it can produce a risk score that is fully optimized for feature selectionand small integer coefficients. Thus, models are guaranteed not to suffer in trainingperformance due to heuristic post-processing.(ii)
Direct Customization : Practitioners can address application-specific requirements byadding discrete constraints to the MINLP formulation, which can be solved with ageneric solver. In this way, they can fit customized risk scores without parametertuning, post-processing, or implementing a new algorithm for each application.(iii)
Understanding the Impact of Constraints : It can pair risk scores with a certificate ofoptimality. By design, a globally optimal solution to the risk score problem attainsthe best performance among risk scores that satisfy a given set of constraints. Oncewe recover a certifiably optimal solution, we therefore end up with a risk score withacceptable performance, or a risk score with unacceptable performance and a certificateproving that the constraints were overly restrictive. In settings where performancegeneralizes, the certificate of optimality provides the ability to evaluate the effect ofconstraints on predictive performance. By comparing certifiably optimal risk scores fordifferent sets of constraints, practitioners can make informed choices between modelsthat satisfy different sets of requirements.In light of these potential benefits, a major goal of this work is to recover certifiablyoptimal solutions to the risk score problem for the largest possible datasets. As we willshow, solving the risk score problem with a MINLP solver is time-consuming even on smalldatasets, as generic MINLP algorithms are slowed down by excessive data-related compu-tation. Accordingly, we aim to solve the risk score problem with a cutting plane algorithm ,which reduces data-related computation by iteratively solving a surrogate problem with alinear approximation of the loss function that is much cheaper to evaluate. Cutting planealgorithms have an impressive track record on large-scale supervised learning problems, asthey scale linearly with the number of samples and provide precise control over data-relatedcomputation (see e.g., Teo et al., 2009; Franc and Sonnenburg, 2009; Joachims et al., 2009).Prior cutting plane algorithms were designed under the assumption that the surrogateoptimization problem can be solved to optimality at each iteration. This assumption isperfectly reasonable in a convex setting, but leads cutting plane algorithms to stall on non-convex problems such as ours, as the time to solve the surrogate problem to optimalityincreases exponentially with each iteration. To overcome this issue, we present a newcutting plane algorithm for non-convex settings. We then improve its performance throughfast techniques to generate good feasible solutions, narrow the optimality gap, and reducedata-related computation. Our approach extends the benefits of cutting plane algorithmsto discrete optimization problems, allowing us to train risk scores in a way that can addressreal-world constraints, provide a certificate of optimality, and scale linearly with the numberof samples in a dataset. earning Optimized Risk Scores Contributions
The main contributions of this paper are as follows. • We present a new machine learning approach to build risk scores. Our approach can trainmodels that: (i) are fully optimized for feature selection and small integer coefficients; (ii)handle application-specific constraints without parameter tuning or post-processing; (iii)provide a certificate of optimality. • We develop a new cutting plane algorithm, called the lattice cutting plane algorithm ( LCPA ). LCPA retains the benefits of cutting plane algorithms for convex empirical riskminimization problems, but does not stall on problems with non-convex regularizers orconstraints. It can be easily implemented using a MIP solver (e.g., CPLEX, Gurobi, orCBC), and used to train customized risk scores (and other kinds of models) in a way thatscales linearly with the number of samples in a dataset. • We design new techniques that allow
LCPA to quickly recover a risk score with goodperformance and a small optimality gap, namely: rounding and polishing heuristics; fastbound tightening and initialization procedures; and strategies to reduce data-related com-putation. Our techniques can be adapted to improve
LCPA for other problems, and usedto improve the performance of risk scores built via heuristic post-processing. • We present an extensive set of experiments comparing methods to learn risk scores onpublicly available datasets. Our results show that our approach can consistently train riskscores with best-in-class performance in minutes. We highlight pitfalls of approaches thatare often used in practice, and present new heuristic methods that address these issues. • We present results from a collaboration with the Massachusetts General Hospital wherewe built a customized risk score for ICU seizure prediction. Our results highlight thepractical benefits of our approach when training models that obey real-world constraints,and illustrate the performance gains of certifiably optimal risk scores in such settings. • We provide a software package to build optimized risk scores in Python, available onlineat http://github.com/ustunb/risk-slim.
Organization
In the remainder of Section 1, we discuss related work. In Section 2, we formally define therisk score problem. In Section 3, we present our cutting plane algorithm,
LCPA . In Section4, we present techniques to improve
LCPA . In Section 5, we benchmark methods to buildrisk scores. In Section 6, we discuss an application to ICU seizure prediction.The supplement to our paper includes: proofs of all theorems (Appendix A); a primeron how risk scores are used and developed in practice (Appendix C); supporting material forcomputational experiments in Sections 3 and 4 (Appendix D), the performance benchmarkin Section 5 (Appendix E), and the seizure prediction problem in Section 6 (Appendix F).
Prior Work
Our paper extends work that was first published in KDD (Ustun and Rudin, 2017). Real-world applications of
RiskSLIM include building a screening tool for adult ADHD froma short self-reported questionnaire (Ustun et al., 2017) and building a risk score for ICU stun and Rudin seizure prediction (Struck et al., 2017). Applications of this work have been discussed in apaper that was a finalist for the 2017 INFORMS Wagner Prize (Rudin and Ustun, 2018). Scoring Systems
While several methods have been proposed to learn scoring systems for decision-making (see, e.g., Ustun and Rudin, 2016; Zeng et al., 2016; Carrizosa et al., 2016; Van Belle et al.,2013; Billiet et al., 2016, 2017; Sokolovska et al., 2017, 2018), the method in this work aimsto produce scoring systems for risk assessment (i.e., risk scores). Risk scores represent themajority of scoring systems that are currently used in medicine, criminal justice, and creditscoring. In practice, the performance objective of these models is to produce calibrated riskestimates, which are used to choose an operating point and to inform decision-making inother ways (see e.g., Section 6 and Van Calster and Vickers, 2015; Alba et al., 2017, for adiscussion on how miscalibrated risk estimates can lead to harmful decisions in medicine).As we will show in Section 5.3, building risk scores that output calibrated risk estimates ischallenging, and heuristics used in risk score development (e.g., rounding and scaling) cancompromise calibration in ways that are difficult to fix.
RiskSLIM risk scores are the risk assessment counterpart to
SLIM scoring systems(Ustun et al., 2013; Ustun and Rudin, 2016; Zeng et al., 2016; Ustun et al., 2016). Both
RiskSLIM and
SLIM use score functions that are optimized for feature selection andsmall integer coefficients, and that can be directly customized to obey application-specificconstraints.
RiskSLIM models are designed for risk assessment and optimize the logisticloss. In contrast,
SLIM models are designed for decision-making and minimize the 0–1 loss.
SLIM models do not output probability estimates, and the scores will not necessarily havehigh AUC. However, they perform well at the single point on the ROC curve that theywere optimized for. Optimizing the 0–1 loss is also NP-hard, so training
SLIM models maynot scale to datasets with large sample sizes as is the case here. In practice,
RiskSLIM isbetter-suited for applications where models must output calibrated risk estimates and/orperform well at multiple operating points. over the ROC curve.
Machine Learning
Our cutting-plane algorithm can be used with machine learning methods where model aretrained by solving an empirical risk minimization problem with a convex loss function, anon-convex penalty, and non-convex constraints. These include methods to learn: scoringsystems for decision-making (Carrizosa et al., 2016; Van Belle et al., 2013; Billiet et al.,2016, 2017; Sokolovska et al., 2017); sparse Boolean classifiers (e.g., Chevaleyre et al., 2013;Malioutov and Varshney, 2013; Goh and Rudin, 2014; Wang et al., 2017); sparse decisiontrees (Angelino et al., 2018; Letham et al., 2015; Lakkaraju et al., 2016; Wang and Rudin,2015) and other (cid:96) -regularized models (Sato et al., 2015, 2016; Bertsimas et al., 2016).For each of these methods, our cutting-plane algorithm can train models that optimizethe same objective function and obeys the same constraints, but in a way that recovers aglobally optimal solution, handles application-specific constraints, and scales linearly withthe number of samples. earning Optimized Risk Scores Our approach is a promising option to build models that obey constraints related to,for example, interpretability (see e.g. Caruana et al., 2015; Gupta et al., 2016; Chen et al.,2018; Li et al., 2018, where interpretability is addressed through constraints on model form), safety (Amodei et al., 2016), credibility (Wang et al., 2018), and fairness (Kamishima et al.,2011; Zafar et al., 2017). Such qualities depend on multiple model properties, which changesignificantly across applications and result in unclear performance trade-offs. Current ap-proaches often aim to address a specific kind of constraints for generic models via pre- orpost-processing (see e.g., Goh et al., 2016; Calmon et al., 2017). In contrast, our approachcan address such constraints directly for a specific model class. When these models belongto a simple hypothesis class (e.g., risk scores), we can expect model performance on trainingdata to generalize and check generalization empirically (e.g., using cross-validation). In thisway, users can assess the impact of their requirements on predictive performance and makeinformed choices between models.Our work is part of a recent stream of research on integer programming in supervisedlearning (Carrizosa et al., 2016; Liu and Wu, 2007; Goldberg and Eckstein, 2012; Guanet al., 2009; Nguyen and Franke, 2012; Sato et al., 2015, 2016; Bertsimas et al., 2016;Bertsimas and Dunn, 2017). The unique aspect of our work on the risk score problem isthat our models are certifiably optimal or have small optimality gaps (see also Angelinoet al., 2018). Our work shows that certifiably models not only perform better, but areuseful for applications where models must satisfy constraints (see Section 6).
Optimization
We train risk scores by solving a MINLP with three main components: (i) a convex lossfunction; (ii) a non-convex feasible region (i.e., small integer coefficients and application-specific constraints); (iii) a non-convex penalty function (i.e., the (cid:96) -penalty).In Section 3.3, we show that this MINLP requires a specialized algorithm as off-the-shelfMINLP solvers fail to solve instances for small datasets. Accordingly, we propose solvingthe risk score problem with a cutting plane algorithm; cutting planes have been extensivelystudied by the optimization community (see e.g., Kelley, 1960), and used to solve convex empirical risk minimization problems (Teo et al., 2007, 2009; Franc and Sonnenburg, 2008,2009; Joachims, 2006; Joachims et al., 2009); except, as we discuss, these methods do nottypically apply to our problem because they stall in the discrete setting.Our cutting plane algorithm ( LCPA ) builds a cutting plane approximation while per-forming branch-and-bound search. It can be easily implemented using a MIP solver with control callbacks (see e.g., Bai and Rubin, 2009; Naoum-Sawaya and Elhedhli, 2010, for sim-ilar uses of control callbacks in the optimization literature).
LCPA retains the key benefitsof existing cutting plane algorithms on empirical risk minimization problems, but does notstall on problems with non-convex regularizers or constraints. As we discuss in Section 3.1,stalling affects many cutting plane algorithms, including variants that are not consideredin machine learning (see Boyd and Vandenberghe, 2004, for a list).Given that
LCPA does not stall, it may provide a practical alternative to solve kindsof optimization problems that have motivated recent work in the optimization community(see e.g., Park and Boyd, 2015a; H¨ubner and Sch¨obel, 2014, who propose other approachesto solve special cases that minimize a convex function over integers). stun and Rudin
2. Risk Score Problem
In what follows, we formalize the problem of learning a risk score as in Figure 1. Westart with a dataset of n i.i.d. training examples ( x i , y i ) ni =1 where x i ⊆ R d +1 denotes avector of features [1 , x i, , . . . , x i,d ] (cid:62) and y i ∈ {± } denotes a class label. We represent thescore as a linear function s ( x ) = (cid:104) λ , x (cid:105) where λ ⊆ R d +1 is a vector of d + 1 coefficients[ λ , λ , . . . , λ d ] (cid:62) , and λ is an intercept. In this setup, coefficient λ j represents the pointsfor feature j . Given an example with features x i , a user first tallies the points to compute ascore s i = (cid:104) λ , x i (cid:105) , then uses the score to obtain an estimate of predicted risk . We estimatethe predicted risk that example i is positive through the logistic link function as: p i = Pr ( y i = +1 | x i ) = 11 + exp( −(cid:104) λ , x i (cid:105) ) . Model Desiderata
Our goal is to train a risk score that is sparse, has small integer coefficients, and performswell in terms of the following measures:1.
Calibration : A calibrated model outputs risk predictions that match their observedrisk. We assess the calibration of a model using a reliability diagram (see DeGroot andFienberg, 1983), which shows how the predicted risk (x-axis) at each score matches the observed risk (y-axis). We estimate the observed risk for a score of s as¯ p s = 1 |{ i : s i = s }| (cid:88) i : s i = s [ y i = +1] . We summarize the calibration of a model over the full reliability diagram using the expected calibration error (Naeini et al., 2014):CAL = 1 n (cid:88) s (cid:88) i : s i = s | p i − ¯ p s | . Rank Accuracy : A rank-accurate model outputs scores that can correctly rank exam-ples according to their true risk. We assess the rank accuracy of a model using the areaunder the ROC curve : AUC = 1 n + n − (cid:88) [ i : y i =+1] (cid:88) [ k : y k = − [ s i > s k ] , where n + = |{ i : y i = +1 }| , n − = |{ i : y i = − }| .As discussed in Section 1.1, calibration is the primary performance objective whenbuilding a risk score. In principle, good calibration should ensure good rank accuracy.Nevertheless, we report AUC as an auxiliary performance metric because trivial risk scores(i.e., models that assign the same score to all examples) can have low CAL on datasets withclass imbalance (see Section 5.3 for an example).
1. Users can also obtain a predicted label ˆ y i ∈ {± } by setting a risk threshold (e.g., predict ˆ y i = +1if predicted risk ≥ earning Optimized Risk Scores Optimization Problem
We learn the values of the coefficients by solving a mixed integer nonlinear program (MINLP),which we refer to as the risk score problem or RiskSlimMINLP . Definition 1 (Risk Score Problem,
RiskSlimMINLP ) The risk score problem is a discrete optimization problem with the form: min λ l ( λ ) + C (cid:107) λ (cid:107) s.t. λ ∈ L , (1) where: • l ( λ ) = n (cid:80) ni =1 log(1 + exp( −(cid:104) λ , y i x i (cid:105) )) is the normalized logistic loss function; • (cid:107) λ (cid:107) = (cid:80) dj =1 [ λ j (cid:54) = 0] is the (cid:96) -seminorm; • L ⊂ Z d +1 is a set of feasible coefficient vectors (user-provided); • C > is a trade-off parameter to balance fit and sparsity (user-provided). RiskSlimMINLP is formulated to capture the exact constraints of a risk score. The ob-jective minimizes the logistic loss for calibration and AUC and penalizes the (cid:96) -norm forsparsity. The trade-off parameter C controls the balance between these competing ob-jectives, and represents the maximum log-likelihood that is sacrificed to remove a featurefrom the optimal model. The feasible region restricts coefficients to a small set of boundedintegers such as L = {− , − , . . . , , } d +1 , and may be customized to address exact modelrequirements, such as those in Table 1. Model Requirement Example
Feature Selection Choose between 5 to 10 total featuresGroup Sparsity Include either male or female in the model but not bothOptimal Thresholding Use at most 3 thresholds for a set of indicator variables: (cid:80) k =1 [ age ≤ k ] ≤ male is in model, then include hypertension or bmi ≥
30 as a controlSide Information Predict Pr ( y = +1 | x ) ≥ .
90 when male = TRUE and hypertension = TRUEFairness Limit disparate impact between groups
A, B to 80%:
Pr(ˆ y =+1 | i ∈ A )Pr(ˆ y =+1 | i ∈ B ) ≤ . Table 1:
Model requirements that can be addressed by adding operational constraints to
RiskSlimMINLP . A Risk-calibrated Supersparse Linear Integer Model ( RiskSLIM ) is a risk score builtusing an optimal solution to (1). By definition, the optimal solution to
RiskSlimMINLP attains the lowest value of the logistic loss among feasible models on the training data,provided that C is small enough (see Appendix B for a more precise statement and aproof). Thus, RiskSLIM is the maximum likelihood logit model that satisfies all constraintsrequired for a risk score. stun and Rudin Our empirical results in Section 5 show that models with lower loss typically attain bettercalibration and AUC on the training data (see also Caruana and Niculescu-Mizil, 2004), andthis generalizes to test data due to the simplicity of our hypothesis space. There are sometheoretical results to explain why minimizing the logistic loss leads to good calibrationand AUC. In particular, the logistic loss is a strictly proper loss (Reid and Williamson,2010; Ertekin and Rudin, 2011) which yields calibrated risk estimates under the parametricassumption that the true risk can be modeled using a logistic link function (see Menonet al., 2012). Further, the work of Kotlowski et al. (2011) shows that a “balanced” versionof the logistic loss forms a lower bound on 1 − AUC, so minimizing the logistic loss indirectlymaximizes a surrogate of AUC.
Trade-off Parameter
The trade-off parameter can be restricted to values between C ∈ [0 , l ( )]. Setting C >l ( ), will produce a trivial model where λ ∗ = . Using an exact formulation provides analternative way to set the trade-off parameter C : • If we are given a limit on model size (e.g., (cid:107) λ (cid:107) ≤ R ), we can add it as a constraint inthe formulation and set C to a small value (e.g., C = 10 − ). In this case, the optimalsolution corresponds to the best model that obeys the model size constraint, providedthat C is small enough (see Appendix B). • If we wish to set the model size based on cross-validated performance, we can repeatthe previous process for (cid:107) λ (cid:107) ≤ R for R = 1 . . . d . This lets us fit risk scores across thefull (cid:96) -regularization path by solving d instances of RiskSlimMINLP . In comparison,a standard approach (i.e., where we treat C as a hyperparameter and define a grid ofvalues) would require solving at least d instances, as we cannot determine (in advance) d values of C that would return the full range of risk scores. Computational Complexity
Optimizing
RiskSlimMINLP is a difficult computational task given that (cid:96) -regularization,minimization over integers, and MINLP problems are all NP-hard (Bonami et al., 2012).These are worst-case complexity results that mean that finding an optimal solution to RiskSlimMINLP may be intractable for high dimensional datasets. As we will show,however,
RiskSlimMINLP can be solved to optimality for many real-world datasets inminutes, and in a way that scales linearly in the sample size.
Notation, Assumptions, and Terminology
We denote the objective function of
RiskSlimMINLP as V ( λ ) = l ( λ ) + C (cid:107) λ (cid:107) and anoptimal solution as λ ∗ ∈ argmin λ ∈L V ( λ ). We bound the optimal values of the objective,loss, and (cid:96) -norm as V ( λ ∗ ) ∈ [ V min , V max ], l ( λ ∗ ) ∈ [ L min , L max ], (cid:107) λ ∗ (cid:107) ∈ [ R min , R max ],respectively. We denote the set of feasible coefficients for feature j as L j , and define Λ min j =min λ j ∈L j λ j and Λ max j = max λ j ∈L j λ j .For clarity of exposition, we assume that: (i) the coefficient set contains the null vector, ∈ L , which ensures that RiskSlimMINLP is always feasible; (ii) the intercept is not earning Optimized Risk Scores regularized, which means that the more precise version of the RiskSlimMINLP objectivefunction is V ( λ ) = l ( λ ) + C (cid:13)(cid:13) λ [1 ,d ] (cid:13)(cid:13) where λ = [ λ , λ [1 ,d ] ].We measure the optimality of a feasible solution λ (cid:48) ∈ L in terms of its optimality gap ,defined as V ( λ (cid:48) ) − V min V ( λ (cid:48) ) . Given an algorithm to solve RiskSlimMINLP , we denote the bestfeasible solution that the algorithm returns in a fixed time as λ best ∈ L . The optimality gapof λ best is computed using an upper bound set as V max = V ( λ best ), and a lower bound V min that is provided by the algorithm. We say that the algorithm has solved RiskSlimMINLP to optimality if λ best has an optimality gap of ε = 0 . RiskSlimMINLP and produced a lower bound V min = V ( λ ∗ ). stun and Rudin
3. Methodology
In this section, we discuss the cutting plane algorithm that we use to solve the risk scoreproblem. We start with a brief introduction of cutting plane algorithms to describe theirpractical benefits and to explain why existing algorithms stall on non-convex problems(Section 3.1). In Section 3.2, we present a new cutting plane algorithm that does not stall.In Section 3.3, we compare the performance of cutting plane algorithms to a commercialMINLP solver on difficult instances of the risk score problem.
In Algorithm 1, we present a simple cutting plane algorithm to solve
RiskSlimMINLP that we call
CPA . CPA recovers the optimal solution to
RiskSlimMINLP by repeatedly solving a surrogateproblem that optimizes a linear approximation of the loss function l ( λ ). The approximationis built using cutting planes or cuts . Each cut is a supporting hyperplane to the loss functionat a fixed point λ t ∈ L : l ( λ t ) + (cid:104)∇ l ( λ t ) , λ − λ t (cid:105) . Here, l ( λ t ) ∈ R + and ∇ l ( λ t ) ∈ R d +1 are cut parameters that can be computed by evaluatingthe value and gradient of the loss at the point λ t : l ( λ t ) = 1 n n (cid:88) i =1 log(1 + exp( −(cid:104) λ t , y i x i (cid:105) )) , ∇ l ( λ t ) = 1 n n (cid:88) i =1 − y i x i −(cid:104) λ t , y i x i (cid:105) ) . (3)As shown in Figure 2, we can construct a cutting plane approximation of the loss functionby taking the pointwise maximum of multiple cuts. In what follows, we denote the cuttingplane approximation of the loss function built using k cuts as:ˆ l k ( λ ) = max t =1 ...k (cid:104) l ( λ t ) + (cid:104)∇ l ( λ t ) , λ − λ t (cid:105) (cid:105) . l ( ) Figure 2:
A convex loss function l ( λ ) and its cutting plane approximation ˆ l ( λ ). On iteration k , CPA solves a surrogate mixed-integer program (MIP) that minimizes thecutting plane approximation ˆ l k , namely RiskSlimMIP (ˆ l k ). CPA uses the optimal solutionto the surrogate MIP ( L k , λ k ) in two ways: (i) it computes a new cut at λ k to improve thecutting plane approximation; (ii) it computes bounds on optimal value of RiskSlimMINLP earning Optimized Risk Scores Algorithm 1
Cutting Plane Algorithm (
CPA ) Input ( x i , y i ) ni =1 training data L coefficient set C (cid:96) penalty parameter ε stop ∈ [0 , maximum optimality gap of acceptable solution Initialize k ← iteration counter ˆ l ( λ ) ← { } cutting plane approximation ( V min , V max ) ← (0 , ∞ ) bounds on the optimal value of RiskSlimMINLP ε ← ∞ optimality gap1: while ε > ε stop do ( L k , λ k ) ← provably optimal solution to RiskSlimMIP (ˆ l k ) compute cut parameters l ( λ k ) and ∇ l ( λ k ) ˆ l k +1 ( λ ) ← max { ˆ l k ( λ ) , l ( λ k ) + (cid:104)∇ l ( λ k ) , λ − λ k (cid:105)} update approximation for all λ ∈ L V min ← L k + C (cid:13)(cid:13) λ k (cid:13)(cid:13) optimal value of RiskSlimMIP is lower bound if V ( λ k ) < V max then V max ← V ( λ k ) update upper bound λ best ← λ k update incumbent end if ε ← − V min /V max k ← k + 1 end whileOutput: λ best ε -optimal solution to RiskSlimMINLP
RiskSlimMIP (ˆ l k ) is a surrogate problem for RiskSlimMINLP that minimizes a cuttingplane approximation ˆ l k of the loss function l :min L, λ L + C (cid:107) λ (cid:107) s.t. L ≥ ˆ l ( λ ) λ ∈ L . (2)We present a MIP formulation for RiskSlimMIP (ˆ l k ) in Definition 2, described shortly. stun and Rudin to check for convergence. Here, the upper bound is set as the objective value of the bestsolution across all iterations: V max = min t =1 ...k (cid:104) l ( λ t ) + C (cid:107) λ t (cid:107) (cid:105) . The lower bound is set as the optimal value to the surrogate problem at the current iteration: V min = ˆ l k ( λ k ) + C (cid:107) λ k (cid:107) . CPA converges to an optimal solution of
RiskSlimMINLP in a finite number of itera-tions (see e.g., Kelley, 1960, for a proof). In particular, the cutting plane approximation ofa convex loss function improves monotonically with each cut:ˆ l k ( λ ) ≤ ˆ l k + m ( λ ) ≤ l ( λ ) for all λ ∈ L and k, m ∈ N . Since the cuts added at each iteration are not redundant, the lower bound improves mono-tonically with each iteration. Once the optimality gap ε is less than a stopping threshold ε stop , CPA terminates and returns an ε -optimal solution λ best to RiskSlimMINLP . RiskSlimMIP Formulation
Definition 2 ( RiskSlimMIP ) Given a finite coefficient set
L ⊂ Z d +1 , trade-off parameter C > , and cutting planeapproximation ˆ l k : R d +1 → R + with cut parameters { l ( λ t ) , ∇ l ( λ t ) } kt =1 , the surrogateoptimization problem RiskSlimMIP (ˆ l k ) can be formulated as the mixed integer program: min L, λ ,α V s.t. V = L + C R objective value (4a) L ≥ l ( λ t ) + (cid:104)∇ l ( λ t ) , λ − λ t (cid:105) t = 1,..., k cut constraints (4b) R = d (cid:88) j =1 α j (cid:96) –norm (4c) λ j ≤ Λ max j α j j = 1,..., d (cid:96) indicator constraints (4d) λ j ≥ − Λ min j α j j = 1,..., d (cid:96) indicator constraints (4e) V ∈ [ V min , V max ] bounds on objective value (4f) L ∈ [ L min , L max ] bounds on loss value (4g) R ∈ { R min , . . . , R max } bounds on (cid:96) –norm (4h) λ j ∈ { Λ min j , . . . , Λ max j } j = 1,..., d coefficient bounds (4i) α j ∈ { , } j = 1,..., d (cid:96) indicator variables The formulation in Definition 2 contains 2 d +3 variables and k +2 d +2 constraints (excludingbounds on the variables). Here, the cutting plane approximation is represented via the cutconstraints in (4b) and the auxiliary variable L ∈ R + . The (cid:96) –norm is computed using theindicator variables α j = [ λ j (cid:54) = 0] set in constraints (4d) and (4e). The λ j are restricted toa bounded set of integers in constraints (4i).The formulation includes two additional auxiliary variables: V , defined as the objectivevalue in (4a); and R , defined as the (cid:96) -norm in (4c). These variables will be useful for earning Optimized Risk Scores implementing the techniques in Section 4. In particular, by including V and R , we can usea control callback in a MIP solver to set bounds on these quantities during branch-and-bound without adding new constraints to the formulation. Why Solve the Risk Score Problem with a Cutting Plane Algorithm
CPA has three important properties that motivate why we want to use a cutting planealgorithm to solve the risk score problem:(i)
Scalability in the Sample Size : Cutting plane algorithms only use the training datawhen computing cut parameters, and not while solving
RiskSlimMIP . Since the pa-rameters in (3) can be computed using elementary matrix-vector operations in O( nd )time at each iteration, running time scales linearly in n for fixed d (see Figure 3).(ii) Control over Data-related Computation : Since cutting plane algorithms compute cutparameters in a single isolated step (e.g., Step 3 in Algorithm 1), users can reducedata-related computation by customizing their implementation to compute cut param-eters efficiently (e.g., via distributed computing, or techniques that exploit structuralproperties of a specific model class as in Section 4.4).(iii)
Ability to use a MIP Solver : Cutting plane algorithms have a special benefit in oursetting since the surrogate problem can be solved with a MIP solver (rather than aMINLP solver). MIP solvers provide a fast implementation of branch-and-bound searchand other features to speed up the search process (e.g., built-in heuristics, preprocessingand cut generation procedures, lazy evaluation of cut constraints, and control callbacksthat let us customize the search with specialized techniques). As we show in Figure 6,using a MIP solver can substantially improve our ability to solve
RiskSlimMINLP ,despite the fact that one may have to solve multiple MIPs.Note that (i) and (ii) are well-known benefits of cutting plane algorithms for convex empir-ical risk minimization problems (see e.g., Teo et al., 2007). l l l l l l l l l l l l l l l l l l l l l l l l l l N R un t i m e ( s e c ond s ) Figure 3:
Runtime of
CPA on synthetic datasets with d = 10 and n ∈ [10 , ] (see Appendix D for details).As n increases, the runtime for the solver (grey) remains roughly constant. The total runtime (black) scalesat O( n ), which reflects the scalability of matrix-vector operations used to compute cut parameters. stun and Rudin Stalling in Non-Convex Settings
Cutting plane algorithms for empirical risk minimization (Joachims, 2006; Franc and Son-nenburg, 2008; Teo et al., 2009) are similar to
CPA in that they solve a surrogate optimizationproblem at each iteration (e.g., Step 2 of Algorithm 1). When these algorithms are used tosolve convex risk minimization problems, the surrogate is convex and therefore tractable.When the algorithms are used to solve risk minimization problems with non-convex regular-izers or constraints, however, the surrogate is non-convex. In these settings, cutting planealgorithms will typically stall as they eventually reach an iteration where the surrogateproblem cannot be solved to optimality within a fixed time limit.In Figure 4, we illustrate the stalling behavior of
CPA on a difficult instance of
RiskSlim-MINLP for a synthetic dataset where d = 20 (see also Figure 6). As shown, the firstiterations terminate quickly as the surrogate problem RiskSlimMIP contains a trivial ap-proximation of the loss. Since the surrogate becomes increasingly difficult to optimize witheach iteration, however, the time to solve
RiskSlimMIP increases exponentially, leading
CPA to stall at iteration k = 86. In this case, the solution returned by CPA after 6 hours hasa large optimality gap and a highly suboptimal loss. This is unsurprising, as the solutionwas obtained by optimizing a low-fidelity approximation of the loss (i.e., an 85-cut approxi-mation of a 20-dimensional function). Since the value of the loss is tied to the performanceof the model (see Section 5), the solution corresponds to a risk score with poor performance.There is no simple fix to prevent cutting plane algorithms such as
CPA from stallingon non-convex problems. This is because they need a certifiably optimal solution at eachiteration to compute a valid lower bound. In non-convex risk minimization problems, thisrequires finding the optimal solution of a non-convex surrogate and certifying that thissolution has an optimality gap of 0.0%. If, for example,
CPA only solved the surrogate untilit found a feasible solution with a non-zero optimality gap, the lower bound computed inStep 5 could exceed the true optimal value, which would lead the algorithm to terminateprematurely and return a suboptimal solution with invalid bounds.Seeing how the stalling behavior of
CPA is related to the mechanism used to checkconvergence, a tempting (but flawed) solution is to use an algorithm that constructs acutting plane approximation by computing cuts at central points of
RiskSlimMIP , suchas the center of gravity algorithm of Levin (1965), or the analytic center algorithm ofAtkinson and Vaidya (1995). Such algorithms are guaranteed to converge in a fixed numberof iterations and do not require computing a lower bound. In this case, however, they wouldstill stall on high-dimensional problems as they would need to compute central points bysolving a non-convex optimization problem to optimality at each iteration.
To avoid stalling in non-convex settings, we solve the risk score problem using the latticecutting plane algorithm ( LCPA ) shown in Algorithm 2.
LCPA has the same benefits as othercutting plane algorithms for the risk score problem, such as scalability in the sample size,control over data-related computation, and the ability to use a MIP solver. As shown inFigure 5, however,
LCPA does not stall. This is because it can add cuts and compute alower bound without having to optimize a non-convex surrogate. earning Optimized Risk Scores Cuts Added S e c ond s / C u t Cuts Added O p t i m a li t y G ap Figure 4:
Performance profile of
CPA on RiskSlimMINLP for a synthetic dataset with n = 50,000 and d = 20 (see Appendix D for details). We plot the time per iteration (left, in log-scale) and optimalitygap (right) for each iteration over 6 hours. CPA stalls on iteration 86, at which point the time to solve
RiskSlimMIP to optimality increases exponentially. The best solution obtained after 6 hours correspondsto a risk score with poor performance.
Cuts Added S e c ond s / C u t Cuts Added O p t i m a li t y G ap Figure 5:
Performance profile of
LCPA (red) and
CPA (black) on the
RiskSlimMINLP instance in Figure4. Unlike
CPA , LCPA does not stall.
LCPA recovers a high-quality risk score (i.e., whose objective value is ≤
10% of the optimal value) in 9 minutes after adding 4,655 cuts, and the optimal risk score in 234 minutesafter adding 11,665 cuts. The remaining time is used to reduce the optimality gap. stun and Rudin LCPA recovers the optimal solution to
RiskSlimMINLP via branch-and-bound (B&B)search. The search process recursively splits the feasible region of
RiskSlimMINLP , dis-carding parts that are infeasible or provably suboptimal.
LCPA solves a surrogate linearprogram (LP) over each region. It updates the cutting plane approximation when the sur-rogate LP yields an integer feasible solution, and sets the lower bound for the risk scoreproblem as the smallest lower bound of the surrogate LP over unexplored regions.
Algorithm 2
Lattice Cutting Plane Algorithm (
LCPA ) Input ( x i , y i ) ni =1 training data L coefficient set C (cid:96) penalty parameter ε stop ∈ [0 , optimality gap of acceptable solution RemoveNode procedure to remove node from a node set (provided by MIP solver)
SplitRegion procedure to split region into disjoint subsets (provided by MIP solver)
RiskSlimLP (ˆ l, R ) LP relaxation of
RiskSlimMIP (ˆ l ) over the region R ⊆ conv ( L ) (see Definition 3) Initialize k ← number of cuts ˆ l k ( λ ) ← { } cutting plane approximation ( V min , V max ) ← (0 , ∞ ) bounds on the optimal value of RiskSlimMINLP R ← conv ( L ) initial region is convex hull of coefficient set v ← lower bound of the objective value of the surrogate LP at R N ← { ( R , v ) } node set ε ← ∞ optimality gap while ε > ε stop do
2: ( R t , v t ) ← RemoveNode ( N ) t is index of removed node
3: solve
RiskSlimLP (ˆ l k , R t )4: λ t ← coefficients from optimal solution to RiskSlimLP (ˆ l k , R t )5: V t ← optimal value of RiskSlimLP (ˆ l k , R t )6: if optimal solution is integer feasible then
7: compute cut parameters l ( λ t ) and ∇ l ( λ t )8: ˆ l k +1 ( λ ) ← max { ˆ l k ( λ ) , l ( λ t ) + (cid:104)∇ l ( λ k ) , λ − λ t (cid:105)} update approximation for all λ ∈ L if V t < V max then V max ← V t update lower bound λ best ← λ t update best solution N ← N \ { ( R s , v s ) | v s ≥ V max } prune suboptimal nodes end if k ← k + 115: else if optimal solution is not integer feasible then
16: ( R (cid:48) , R (cid:48)(cid:48) ) ← SplitRegion ( R t , λ t ) R (cid:48) , R (cid:48)(cid:48) are disjoint subsets of R t N ← N ∪ { ( R (cid:48) , V t ) , ( R (cid:48)(cid:48) , V t ) } V t is lower bound of RiskSlimLP for child regions R (cid:48) , R (cid:48)(cid:48) end if V min ← min s =1 ... |N| v s V min is smallest lower bound among nodes in N ε ← − V min /V max update optimality gap end whileOutput: λ best ε -optimal solution to RiskSlimMINLP earning Optimized Risk Scores Definition 3 ( RiskSlimLP ) Given a bounded convex region
R ⊆ conv ( L ) , trade-off parameter C > , and cuttingplane approximation ˆ l k : R d +1 → R + with cut parameters { l ( λ t ) , ∇ l ( λ t ) } kt =1 , the surro-gate optimization problem RiskSlimLP (ˆ l k , R ) can be formulated as the linear program: min L, λ ,α V s.t. V = L + C R objective value R = d (cid:88) j =1 α j relaxed (cid:96) -norm L ≥ l ( λ t ) + (cid:104)∇ l ( λ t ) , λ − λ t (cid:105) t = 1,..., k cut constraints λ j ≤ Λ max j α j j = 1,..., d (cid:96) -indicator constraints λ j ≥ − Λ min j α j j = 1,..., d (cid:96) -indicator constraints λ ∈ R feasible region V ∈ [ V min , V max ] objective bounds L ∈ [ L min , L max ] loss bounds R ∈ [ R min , R max ] relaxed (cid:96) -bounds α j ∈ [0 , j = 1,..., d relaxed (cid:96) -indicators Branch-and-Bound Search
In Algorithm 2, we represent the state of the B&B search process using a B&B tree. Werefer to each leaf of the tree as a node , and denote the set of all nodes as N . Each node ( R t , v t ) ∈ N consists of: a region of the convex hull of the coefficient set R t ⊆ conv ( L );and a lower bound on the objective value of the surrogate LP over this region v t .Each iteration of LCPA removes a node from the node set ( R t , v t ) ∈ N , then solves thesurrogate LP for the corresponding region: RiskSlimLP (ˆ l k , R t ). Subsequent steps of thealgorithm are determined by the solution status of the surrogate LP: • If RiskSlimLP (ˆ l k , R t ) has an integer solution, LCPA updates the cutting plane approxi-mation ˆ l k with a new cut at λ t in Step 8. • If RiskSlimLP (ˆ l k , R t ) has a real-valued solution, LCPA adds two child nodes ( R (cid:48) , v t )and ( R (cid:48)(cid:48) , v t ) to the node set N in Step 17. The child nodes are produced by applying asplitting rule, which splits R t into disjoint regions R (cid:48) and R (cid:48)(cid:48) . The lower bound for eachchild node is set as the optimal value of the surrogate LP v t . • If RiskSlimLP (ˆ l k , R t ) is infeasible, then LCPA discards the node from the node set.The B&B search is governed by two procedures that are implemented in a MIP solver: • RemoveNode , which removes a node ( R t , v t ) from the node set N (e.g., the node withthe smallest lower bound v t ). • SplitRegion , which splits R t into disjoint subsets of R t (e.g., split on a fractional compo-nent of λ t , which returns R (cid:48) = { λ ∈ R t | λ tj ≥ (cid:100) λ tj (cid:101)} and R (cid:48)(cid:48) = { λ ∈ R t | λ tj ≤ (cid:98) λ tj (cid:99)} ).The output conditions for SplitRegion must ensure that: the regions at each node remaindisjoint; the total number of nodes remains finite; and the total search region shrinks evenwhen the surrogate LP has a real-valued solution. stun and Rudin Convergence
LCPA certifies that it has recovered a globally optimal solution to the risk score problem byusing bounds on the objective value of
RiskSlimMINLP . The upper bound V max is set bythe objective value of the best integer feasible solution in Step 10. The lower bound V min is set by the smallest objective value among all nodes in Step 19. The value of V min can beviewed as a lower bound on the objective value of the surrogate LP over the remaining searchregion (cid:83) t R t (i.e. V min is a lower bound on the objective value of RiskSlimLP (ˆ l k , (cid:83) t R t )).Thus, V min will increase when we reduce the remaining search region or add cuts.Each iteration of LCPA reduces the remaining search region by either finding an integerfeasible solution, identifying an infeasible region, or splitting a region into disjoint subsets.Thus, V min increases monotonically as the search region becomes smaller, and cuts areadded at integer feasible solutions. Likewise, V max decreases monotonically as it is set asthe objective value of the best integer feasible solution. Since there are a finite number ofnodes in the worst-case, LCPA terminates after a finite number of iterations, returning anoptimal solution to the risk score problem.
Remark 4 (Worst-Case Data-Related Computation for
LCPA ) Given any training dataset ( x i , y i ) ni =1 , any trade-off parameter C > , and any finitecoefficient set L ⊂ Z d +1 , LCPA returns an optimal solution to the risk score problem aftercomputing at most |L| cutting planes, and processing at most |L| − nodes. Implementation with a MIP Solver with Lazy Cut Evaluation
LCPA can easily be implemented using a MIP solver (e.g., CPLEX, Gurobi, SCIP or GLPK)with control callbacks . In this approach, the solver handles all B&B related steps of Algo-rithm 2, and users only need to write the code for a control callback to update the cuttingplane approximation. In a basic implementation, the solver would call the control callbackwhenever it finds an integer feasible solution (i.e., Step 6). The code would retrieve theinteger feasible solution, compute the cut parameters, add a cut to the surrogate LP, andreturn control back to the solver at Step 9.A key benefit of using a MIP solver is the ability to add cuts as lazy constraints . Inpractice, if we were to add cuts as generic constraints to the surrogate LP, the time tosolve the surrogate LP would increase with each cut, which would progressively slow down
LCPA . When we add cuts as lazy constraints, the solver branches using a surrogate LPthat contains a subset of relevant cuts, and only evaluates the complete set of cuts when
LCPA finds an integer feasible solution. In this case,
LCPA still returns the optimal solution.However, computation is significantly reduced as the surrogate LP is much faster to solvefor the vast majority of cases where it is infeasible or yields a real-valued solution. From adesign perspective, lazy cut evaluation reduces the marginal computational cost of addingcuts, which allows us to add cuts liberally (i.e., without having to worry about slowingdown
LCPA by adding too many cuts). earning Optimized Risk Scores In what follows, we benchmark
CPA and
LCPA (without the improvements in Section 4)against three MINLP algorithms as implemented in a state-of-the-art commercial MINLPsolver (Artelsys Knitro 9.0, which is an updated version of the solver in Byrd et al., 2006) .In Figure 6, we show the performance of algorithms on difficult instances of the riskscore problem for synthetic datasets with d dimensions and n samples (see Appendix D fordetails). We consider the following performance metrics: (i) the time to find a near-optimalsolution; (ii) the optimality gap of best solution at termination; and (iii) the proportionof time spent on data-related computation. Since all three MINLP algorithms behavedsimilarly, we show only the best one in Figure 6 ( ActiveSetMINLP ) and include results forthe remaining algorithms in Appendix D.3.As shown,
LCPA finds an optimal or near-optimal solution for almost all instances ofthe risk score problem, and pairs the solution with a small optimality gap.
CPA performssimilarly to
LCPA on low-dimensional instances. On instances with d ≥
15, however,
CPA stalls after a few iterations and returns a highly suboptimal solution (i.e., a risk score withpoor performance).In comparison to the cutting plane algorithms, the MINLP algorithms can only handleinstances with small n or d . On larger instances, the solver is slowed down by operationsthat involve data-related computation, fails to converge within the 6-hour time limit, andfails to recover high-quality solutions. Seeing how MINLP solvers are designed to solve adiverse set of optimization problems, we do not believe that they can identify and exploitthe structure of the risk score problem in the same way as a cutting plane algorithm.
3. Knitro is one of several MINLP solvers that can solve the risk score problem directly (see Bussieck andVigerske, 2010, for others). We chose Knitro because it let us control factors that would otherwise leadan MINLP solver to perform poorly in benchmarks, namely: (i) Knitro provides the ability to solveLP subproblems with a third-party LP solver, which let us ensure that all algorithms used the sameLP solver (i.e., CPLEX 12.6.3, ILOG, 2017); (ii) Knitro provides the ability to compute the objectivefunction of the MINLP with a user-defined function handle, which let us monitor and control data-relatedcomputation by using the same functions to evaluate the objective, its gradient and Hessian. stun and Rudin LCPA CPA ActiveSetMINLP
Time to Train a GoodRisk Score i.e., the time for an al-gorithm to find a solutionwhose loss is ≤ of theoptimal loss. This reflects thetime to obtain a risk scorewith good calibration withouta proof of optimality. <1 min <10 min <1 hour <6 hours 6+ hours · · · · · · · · · · · · · · · · · · Optimality Gap of BestSolution at Termination i.e., ( V max − V min ) /V max ,where V max is the objec-tive value of the best solutionfound at termination. A gapof 0.0% means an algorithmhas found the optimal solu-tion and provided a proof ofoptimality within 6 hours.
0% 0−20% 20−50% 50−90% 90−100% · · · · · · · · · · · · · · · · · · % Time Spent on Data-Related Computation i.e., the proportion of to-tal runtime that an algorithmspends computing the value,gradient, or Hessian of theloss function. · · · · · · · · · · · · · · · · · · Figure 6:
Performance of
LCPA , CPA , and a commercial MINLP solver on difficult instances of
RiskSlim-MINLP for synthetic datasets with d dimensions and n samples (see Appendix D for details). ActiveSetMINLP fails to produce good risk scores on instances with large n or d as it struggles with data-related computation. CPA and
LCPA scale linearly in n when d is fixed: if they solve an instance for a given d , then they can solveinstances for larger n in O ( n ) additional time. CPA stalls when d ≥
15 and returns a low-quality risk scorewhen d ≥
20. In contrast,
LCPA consistently recovers a good model without stalling. Results reflect theperformance for a basic
LCPA implementation without the improvements in Section 4. We show results fortwo other MINLP algorithms in Appendix D. earning Optimized Risk Scores
4. Algorithmic Improvements
In this section, we describe specialized techniques to improve the performance of the latticecutting plane algorithm (
LCPA ) on the risk score problem. They include: • Polishing Heuristic . We present a technique called discrete coordinate descent (
DCD ;Section 4.1.1), which we use to polish integer solutions found by
LCPA (Step 6).
DCD aims to improve the objective value of all integer solutions, which produces stronger upperbounds over the course of
LCPA , and reduces the time to recover a good risk score. • Rounding Heuristic . We present a rounding technique called
SequentialRounding (Section4.1.2) to generate integer solutions. We use
SequentialRounding to round real-valuedsolutions to the surrogate LP in Step 15, and polish the rounded solution with
DCD .Rounded solutions may improve the best solution found by
LCPA , producing strongerupper bounds and reducing the time to recover a good risk score. • Bound Tightening Procedure . We design a fast procedure to strengthen bounds on theoptimal values of the objective, loss, and number of non-zero coefficients called
ChainedUp-dates (Section 4.2). We call
ChainedUpdates whenever the solver updates the upper boundin Step 10 or the lower bound in Step 19.
ChainedUpdates improves the lower bound,and reduces the optimality gap of the final risk score. • Initialization Procedure : Since solution quality is affected by the fidelity of the approx-imate loss function, and
LCPA only adds cuts at integer solutions, the initial set of in-cumbent solutions from
LCPA may correspond to low-quality risk scores. To mitigate thisissue, we devise an initialization procedure that quickly generates a set of cutting planesto warm-start
LCPA (Section 4.3). Using this procedure improves both the upper and thelower bound over the course of
LCPA , reducing the time to obtain a good risk score andthe final optimality gap. • Data-Related Computation : We describe two techniques to reduce data computation. Thefirst technique reduces all data-related computation in
LCPA and heuristic operations byexploiting the fact that risk scores have discrete and bounded coefficients. The secondtechnique uses loss-sensitive rounding heuristics such as
SequentialRounding . By reducingdata computation, both techniques improve scalability with respect to sample size.Our techniques can be adapted to solve other risk minimization problems with similarproperties (see Section 1.1 for a list).
We present two heuristics to generate and polish integer solutions for the risk score problem.We use these techniques to improve the performance of
LCPA . Both heuristics can be usedto create calibrated risk scores from logistic regression models with real-valued coefficients(see Sections 5 and 6).Our heuristics can be implemented using control callbacks in a MIP solver, and aredesigned to run quickly. The procedures that we present can handle sparsity constraintsand monotonicity constraints, but may not yield a feasible solution under other kinds ofconstraints (e.g., a constraint on the false positive rate). stun and Rudin Discrete coordinate descent ( DCD ) is a technique to polish an integer solution (Algorithm3). It takes as input an integer solution λ = [ λ , . . . , λ d ] (cid:62) ∈ L and iteratively descendsalong a single coordinate j to attain an integer solution with a better objective value. Thedescent direction at each iteration is chosen as the coordinate that minimizes the objectivevalue j ∈ argmin V ( λ + δ j e j ). DCD terminates once it can no longer strictly improve the objective value along anycoordinate. This eliminates the potential of cycling, and thereby guarantees that the proce-dure will terminate in a finite number of iterations. The polished solution returned by
DCD satisfies a type of local optimality guarantee for discrete optimization problems: formally,it is with respect to the objective, meaning that one will not improve the objectivevalue by changing any single coefficient (see e.g., Park and Boyd, 2015b, for a technique tofind a 1-opt point for a different optimization problem).In practice, the most expensive computation in
DCD is Step 5, where we determine astep-size δ j ∈ ∆ j to minimize the objective along coordinate j . We can significantly reducethis computation by using golden section search. This approach requires nd log |L j | flopsper iteration compared to nd |L j | flops per iteration required by a brute force approach (i.e.,which evaluates the loss for all λ j ∈ L j ). Algorithm 3
Discrete Coordinate Descent (
DCD ) Input ( x i , y i ) ni =1 training data L coefficient set C (cid:96) penalty parameter λ ∈ L integer solution to RiskSlimMINLP
J ⊆ { , . . . , d } valid descent directions1: repeat V ← V ( λ ) objective value at current solution for j ∈ J do ∆ j ← { δ ∈ Z | λ + δe j ∈ L} list feasible moves along dim j δ j ← argmin δ ∈ ∆ j V ( λ + δ ) find best move in dim j v j ← V ( λ + δ j e j ) store objective value for best move in dim j end for m ← argmin j ∈J v j descend along dim that minimizes objective λ ← λ + δ m e m until v m ≥ V Output: λ solution that is 1-opt with respect to the objective of RiskSlimMINLP
In Figure 7, we show how
DCD improves the performance of
LCPA when we use it topolish feasible solutions found by the MIP solver (i.e., in Step 6 of Algorithm 2). earning Optimized Risk Scores lll llllll ll l l l Nodes Processed U ppe r bound lll llllll ll l l l Nodes Processed O p t i m a li t y G ap Figure 7:
Performance profile of
LCPA in a basic implementation (black) and with
DCD (red). We use
DCD to polish every integer solution found by the MIP solver whose objective value is within 10% of thecurrent upper bound. We plot the number of total nodes processed of
LCPA (x-axis) against the upperbound (y-axis; left) and the optimality gap (y-axis; right). We mark iterations where
LCPA updates theincumbent solution. Results reflect performance on
RiskSlimMINLP for a synthetic dataset with d = 30and n = 50,000 (see Appendix D for details). SequentialRounding (Algorithm 4) is a rounding heuristic to generate integer solutions forthe risk score problem. In comparison to na¨ıve rounding, which returns the closest roundingfrom a set of 2 d +1 possible roundings, SequentialRounding returns a rounding that optimizesthe objective value of the risk score problem.Given a real-valued solution λ real ∈ conv ( L ), the procedure iteratively rounds one com-ponent (up or down) in a way that minimizes the objective of RiskSlimMINLP . On Step k , it has already rounded k components of λ real , and must round one of the remaining d − k + 1 components to (cid:100) λ real j (cid:101) or (cid:98) λ real j (cid:99) . To this end, it computes the objective of allfeasible (component, direction)-pairs and chooses the best one. Formally, the minimizationon Step k requires (cid:80) i = d − k +1 i =1 i = ( d − k + 1)( d − k + 2) evaluations of the loss function.Thus, given that there are d + 1 steps, SequentialRounding terminates after d ( d + 3 d + 2)evaluations of the loss function.In Figure 8, we show the impact of using SequentialRounding in LCPA . Here, we ap-ply
SequentialRounding to the non-integer solution to
RiskSlimLP when the lower boundchanges (i.e., Step 3 of Algorithm 2), then polish the rounded solution using
DCD . As shown,this strategy can reduce the time required for
LCPA to find a high-quality risk score, andattain a lower optimality gap.
SequentialRounding can be viewed as a special case of
DCD . In particular, instead ofconducting a line search over all feasible values at each iteration,
SequentialRounding roundseach real-valued component to a nearby integer value. Likewise, instead of terminating ata 1-OPT point,
SequentialRounding terminates in d + 1 iterations after all components arerounded. In that case, one can see why DCD might perform better than
SequentialRounding since it considers a larger set of integer coefficients and is not restricted to only d + 1iterations. In comparison to DCD , SequentialRounding terminates after a fixed number ofsteps and can be accelerated with a subsampling technique presented in Section 4.4.2. stun and Rudin Algorithm 4
SequentialRounding
Input ( x i , y i ) ni =1 training data L coefficient set C (cid:96) penalty parameter λ ∈ conv ( L ) non-integer infeasible solution from RiskSlimLP J real ← { j : λ j (cid:54) = (cid:100) λ j (cid:99)} index set of real-valued coefficients repeat λ j, up ← ( λ , . . . , (cid:100) λ j (cid:101) , . . . , λ d ) for all j ∈ J real λ j, down ← ( λ , . . . , (cid:98) λ j (cid:99) , . . . , λ d ) for all j ∈ J real v up ← min j ∈J real V ( λ j, up ) v down ← min j ∈J real V ( λ j, down ) if v up < v down then k ← argmin j ∈J real V ( λ j, up ) λ k ← (cid:100) λ k (cid:101) else k ← argmin j ∈J real V ( λ j, down ) λ k ← (cid:98) λ k (cid:99) end if J real ← J real \ { k } until J real = ∅ Output: λ ∈ L integer solution lllllll llllllll llllll ll l l l Nodes Processed U ppe r bound lllllll llllllll llllll ll l l l Nodes Processed O p t i m a li t y G ap Figure 8:
Performance profile of
LCPA in a basic implementation (black) and with
SequentialRounding and
DCD polishing (red). We call
SequentialRounding to round non-integer solutions to
RiskSlimLP in Step15, and then polish the integer solution with
DCD . We plot large points to show when
LCPA updates theincumbent solution. Results reflect performance on
RiskSlimMINLP for a synthetic dataset with d = 30and n = 50,000 (see Appendix D for details). Here, SequentialRounding and
DCD reduce the upper boundand optimality gap of
LCPA compared to a basic implementation (although not as much as
DCD alone). earning Optimized Risk Scores We describe a fast bound tightening technique called
ChainedUpdates (Algorithm 5). Thistechnique iteratively bounds the optimal values of the objective, loss, and (cid:96) -penalty byiteratively setting the values of V min , V max , L min , L max , and R max in RiskSlimLP . Bound-ing these quantities over the course of B&B restricts the search region without discardingthe optimal solution, thereby improving the lower bound and reducing the optimality gap.
Initial Bounds on Objective Terms
We initialize
ChainedUpdates with values of V min , V max , L min , L max , and R max that canbe computed using only the training data ( x i , y i ) ni =1 and the coefficient set L . We startwith Proposition 5, which provides initial values for L min and L max using the fact that thecoefficient set L is bounded (see Appendix A for a proof). Proposition 5 (Bounds on Logistic Loss over a Bounded Coefficient Set)
Given a training dataset ( x i , y i ) ni =1 where x i ∈ R d and y i ∈ {± } for i = 1 , . . . , n ,consider the normalized logistic loss of a linear classifier with coefficients λ : l ( λ ) = 1 n n (cid:88) i =1 log(1 + exp( −(cid:104) λ , y i x i (cid:105) )) . If the coefficients belong to a bounded set L , then the value of the normalized logistic lossmust obey l ( λ ) ∈ [ L min , L max ] for all λ ∈ L , where, L min = 1 n (cid:88) i : y i =+1 log (1 + exp( − s max i )) + 1 n (cid:88) i : y i = − log (1 + exp( s min i )) ,L max = 1 n (cid:88) i : y i =+1 log (1 + exp( − s min i )) + 1 n (cid:88) i : y i = − log (1 + exp( s max i )) ,s min i = min λ ∈L (cid:104) λ , x i (cid:105) for i = 1 , . . . , n,s max i = max λ ∈L (cid:104) λ , x i (cid:105) for i = 1 , . . . , n. The value of L min in Proposition 5 represents the “best-case” loss in a separable settingwhere we assign each positive example its maximal score s max i and each negative exampleits minimal score s min i . Conversely, L max represents the “worst-case” loss when we assigneach positive example its minimal score s min i and each negative example its maximal score s max i . Both L min and L max can be computed in O( n ) flops using only the training data andthe coefficient set by evaluating s min i and s max i as follows: s min i = min λ ∈L (cid:104) λ , x i (cid:105) = d (cid:88) j =0 [ x ij > x ij Λ min j + [ x ij < x ij Λ max j , (6) s max i = max λ ∈L (cid:104) λ , x i (cid:105) = d (cid:88) j =0 [ x ij > x ij Λ max j + [ x ij < x ij Λ min j . (7) stun and Rudin We initialize the bounds on the number of non-zero coefficients R to ∈ { , . . . , d } , trivially.In some cases, these bounds may be stronger due to operational constraints (e.g., we canset R ∈ { , . . . , } if models are required to use ≤ L min and L max can be further reduced since s min i and s max i are bounded by the number ofnon-zero coefficients. Once again, s min i (or s max i ) can still be computed efficiently in O( n )flops by choosing the R max smallest (or largest) terms in the right-hand side of Equation(6) (or 7). Having initialized L min , L max , R min and R max , we set the bounds on the optimalobjective value as V min = L min + C R min and V max = L max + C R max , respectively. Dynamic Bounds on Objective Terms
In Propositions 6 to 8, we present bounds that use information from MIP solver in
LCPA to strengthen the values of L min , L max , R max , V min and V max (see Appendix A for proofs). Proposition 6 (Upper Bound on Optimal Number of Non-Zero Coefficients)
Given an upper bound on the optimal objective value V max ≥ V ( λ ∗ ) , and a lower boundon the optimal loss L min ≤ l ( λ ∗ ) , the optimal number of non-zero coefficients is at most R max ≥ (cid:107) λ ∗ (cid:107) where R max = (cid:22) V max − L min C (cid:23) . Proposition 7 (Upper Bound on Optimal Loss)
Given an upper bound on the optimal objective value V max ≥ V ( λ ∗ ) , and a lower boundon the optimal number of non-zero coefficients R min ≤ (cid:107) λ ∗ (cid:107) , the optimal loss is at most L max ≥ l ( λ ∗ ) where L max = V max − C R min . Proposition 8 (Lower Bound on Optimal Loss)
Given a lower bound on the optimal objective value V min ≤ V ( λ ∗ ) , and an upper boundon the optimal number of non-zero coefficients R max ≥ (cid:107) λ ∗ (cid:107) , the optimal loss is at least L min ≤ l ( λ ∗ ) where L min = V min − C R max . Implementation
In Algorithm 5, we present a fast bound tightening procedure to strengthen the values of V min , V max , L min , L max , and R max in RiskSlimLP using Propositions 6 to 8.Propositions 6 to 8 impose dependencies between V min , V max , L min , L max , R min and R max that may produce a complex “chain” of updates. As shown in Figure 9, ChainedUp-dates can update multiple terms and can update the same term more than once. Considera case where we call
ChainedUpdates after
LCPA improves V min . Say the procedure updates L min in Step 4. If ChainedUpdates updates R max in Step 6, then it will also update V max , L min , L max , V min . However, if ChainedUpdates does not update R max in Step 6, then itwill not update V max , L min , L max , V min and terminate. earning Optimized Risk Scores Algorithm 5
ChainedUpdates
Input C (cid:96) penalty parameter V min , V max , L min , L max , R min , R max initial bounds on V ( λ ∗ ), l ( λ ∗ ) and (cid:107) λ ∗ (cid:107) repeat V min ← max (cid:0) V min , L min + C R min (cid:1) update lower bound on V ( λ ∗ ) V max ← min ( V max , L max + C R max ) update upper bound on V ( λ ∗ ) L min ← max (cid:0) L min , V min − C R max (cid:1) update lower bound on l ( λ ∗ ) L max ← min (cid:0) L max , V max − C R min (cid:1) update upper bound on l ( λ ∗ ) R max ← min (cid:16) R max , (cid:106) V max − L min C (cid:107)(cid:17) update upper bound on (cid:107) λ ∗ (cid:107) until there are no more bound updates due to Steps 2 to 6. Output: V min , V max , L min , L max , R min , R max In light of these dependencies, Algorithm 5 applies Propositions 6 to 8 until it cannotno longer improve V min , V max , L min , L max or R max . This ensures that ChainedUpdates willreturn the strongest possible bounds regardless of the term that was first updated. V max V min R max L max L min V min L min V max L max L min V min
54 6 52 2
Figure 9:
All possible “chains” of updates in
ChainedUpdates . Circles represent “source” terms that canbe updated by
LCPA to trigger
ChainedUpdates . The path from each source term shows all bounds that canbe updated by the procedure. The number in each arrow references the update step in Algorithm 5.
In our implementation, we call
ChainedUpdates whenever
LCPA improves V max or V min (i.e., Steps 10 or 19 of Algorithm 2). If ChainedUpdates improves any bounds, we passthis information back to the MIP solver by updating the bounds on the auxiliary variablesin the
RiskSlimLP formulation in Definition 3. As shown in Figure 10, this strategycan considerably improve the lower bound and optimality gap over the course of
LCPA . Inparticular, the bounds from
ChainedUpdates restrict the feasible region of the surrogateLP. This ensures that
LCPA produces stronger lower bounds on the optimal value of therisk score problem, which reduces the optimality gap and may benefit the search process inother ways (e.g., by pruning suboptimal nodes). stun and Rudin Nodes Processed Lo w e r bound ll lllll l l l l l Nodes Processed O p t i m a li t y G ap Figure 10:
Performance profile of
LCPA in a basic implementation (black) and with
ChainedUpdates (red).Results reflect performance on a
RiskSlimMINLP instance for a synthetic dataset with d = 30 and n =50,000 (see Appendix D). In Algorithm 6, we present an initialization procedure for
LCPA . The procedure aims tospeed up
LCPA by generating: a collection of cutting planes; a good integer solution; andnon-trivial bounds on the values of the objective, loss, and number of non-zero coefficients.It combines all of the techniques from this section, as follows:1.
Run
CPA on RiskSlimLP : We apply traditional
CPA to solve the
RiskSlimLP untila user-specified stopping condition is met. We store: (i) the cuts from
RiskSlimLP (which will be used as an initial cutting plane approximation for
LCPA ); (ii) the lowerbound from
CPA on the objective value of
RiskSlimLP (which represents a lower bound V min on the optimal value of RiskSlimMINLP ).2.
Sequential Rounding and Polishing : We collect the solutions produced at each iterationof
CPA . For each solution, we run
SequentialRounding to obtain an integer solution for
RiskSlimMINLP , and then polish it using
DCD . We use the best solution found so farto update the upper bound V max on the optimal value to RiskSlimMINLP .3.
Chained Updates : Having obtained non-trivial bounds on V min and V max , we updatethe bounds on key quantities using ChainedUpdates .In Figure 11, we show how the initialization procedure in Algorithm 6 improves the lowerbound and the optimality gap over the course of
LCPA . earning Optimized Risk Scores Algorithm 6
Initialization Procedure for
LCPA
Input ( x i , y i ) ni =1 training data L coefficient set C (cid:96) penalty parameter V min , V max , L min , L max , R min , R max initial bounds on V ( λ ∗ ), l ( λ ∗ ) and (cid:107) λ ∗ (cid:107) T max time limit for CPA on RiskSlimLP
Initialize ˆ l ( λ ) ← { } initial approximation of loss function P int ← ∅ initial collection of integer solutions Step I: Solve
RiskSlimLP with
CPA Solve
RiskSlimLP (ˆ l , conv ( L )) using CPA
Algorithm 1 k ← number of cuts added by CPA within time limit T max ˆ l initial ← ˆ l k store cuts from each iteration P real ← { λ t } kt =1 store solutions from each iteration V min ← lower bound from CPA lower bound for
RiskSlimLP is lower bound for
RiskSlimMINLP
Step II: Round and Polish Non-Integer Solutions from
CPA for each λ real ∈ P real do λ sr ← SequentialRounding (cid:0) λ real , L , C (cid:1) Algorithm 4 λ dcd ← DCD ( λ sr , L , C ) Algorithm 3 P int ← P int ∪ { λ dcd } store polished integer solutions end for λ best ← argmin λ ∈P int V ( λ ) V max ← V (cid:0) λ best (cid:1) best integer solution produces upper bound for V ( λ ∗ ) Step III: Update Bounds on Objective Terms ( V min , . . . , R max ) ← ChainedUpdates (cid:0) V min , . . . , R max , C (cid:1) Algorithm 5
Output: λ best , ˆ l initial ( λ ), V min , V max , L min , L max , R min , R max stun and Rudin Nodes Processed Lo w e r bound ll l l lllll l l l l l Nodes Processed O p t i m a li t y G ap Figure 11:
Performance profile of
LCPA in a basic implementation (black) and with the initializationprocedure in Algorithm 6 (red). Results reflect performance on a
RiskSlimMINLP instance for a syntheticdataset with d = 30 and n = 50,000 (see Appendix D for details). In this section, we present techniques to reduce data-related computation for the risk scoreproblem.
The first technique is designed to speed up the evaluation of the loss function and itsgradient, which reduces runtime when we compute cut parameters (3) and call the roundingand polishing procedures in Section 4.1. The technique requires that the features x i andcoefficients λ belong to sets that are bounded, discrete, and regularly spaced, such as x i ∈ X ⊆ { , } d and λ ∈ L ⊆ {− , . . . , } d +1 .Evaluating the logistic loss, log(1 + exp( −(cid:104) λ , y i x i (cid:105) ), is a relatively expensive operationbecause it involves exponentiation and must be carried out in multiple steps to avoid nu-merical overflow/underflow when the scores s i = (cid:104) λ , x i y i (cid:105) are too small or large . Whenthe training data and coefficients belong to discrete bounded sets, the scores s i = (cid:104) λ , x i y i (cid:105) belong to a discrete and bounded set S = (cid:8) (cid:104) λ , x i y i (cid:105) (cid:12)(cid:12) i = 1 , . . . , n and λ ∈ L (cid:9) . If the elements of the feature set X and the coefficient set L are regularly spaced, then thescores belong to the set of integers S ⊆ Z ∩ [ s min , s max ] where: s min = min i, λ {(cid:104) λ , x i y i (cid:105) for all ( x i , y i ) ∈ D and λ ∈ L} ,s max = max i, λ {(cid:104) λ , x i y i (cid:105) for all ( x i , y i ) ∈ D and λ ∈ L} . Thus, we can precompute and store all possible values of the loss function in a lookup tablewith s max − s min + 1 rows, where row m contains the value of [log(1 + exp( − ( m + s min −
4. The value of exp( s ) can be computed reliably using IEEE 754 double precision floating point numbersfor s ∈ [ − , ∞ when s < − s > earning Optimized Risk Scores cached in memory, which yields a substantial speedup. In addition, when R max is updatedover the course of LCPA , the lookup table can be further reduced by recomputing s min and s max , and limiting the entries to values between s min and s max . The values of s min and s max can be computed in O( n ) time, so the update is not expensive. We now describe a subsampling technique to reduce computation for rounding heuristicsthat require multiple evaluations of the loss function (e.g.,
SequentialRounding ).Ideally, we would want to run such heuristics frequently as possible because each runmay output a solution that updates the incumbent solution (i.e., the current best solutionto the risk score problem). In practice, however, this may slow down
LCPA since eachrun requires multiple evaluations of the loss, and runs that fail to update the incumbentamount to wasted computation. If, for example, we ran
SequentialRounding each time theMIP solver found a set of non-integer coefficients in
LCPA (i.e., Step 15 of Algorithm 2),many rounded solutions would not update the incumbent, and we would have wasted toomuch time rounding, without necessarily finding a better solution.Our technique aims to reduce the overhead of calling heuristics by running them ona smaller dataset D m built by sampling m points without replacement from the trainingdataset D n . In what follows, we present probabilistic guarantees to choose the numberof samples m so that an incumbent update using D m guarantees an incumbent updateusing D n . To clarify when the loss and objective are computed using D m or D n , we let l i ( λ ) = log(1 + exp( (cid:104) λ , y i x i (cid:105) )) and define: l m ( λ ) = 1 m m (cid:88) i =1 l i ( λ ) , V m ( λ ) = l m ( λ ) + C (cid:107) λ (cid:107) ,l n ( λ ) = 1 n n (cid:88) i =1 l i ( λ ) , V n ( λ ) = l n ( λ ) + C (cid:107) λ (cid:107) . Consider a case where a heuristic returns a promising solution λ hr such that: V m ( λ hr ) < V max . (8)In this case, we compute the objective on the full training dataset D n by evaluating the lossfor each of the n − m points that were not included in D m . We then update the incumbentsolution if λ hr attains an objective value that is less than the current upper bound: V n ( λ hr ) < V max . (9)Although this strategy requires evaluating the loss for the full training dataset to validatean incumbent update, it still reduces data-related computation since rounding heuristicstypically require multiple evaluations of the loss (e.g., SequentialRounding , which requires d ( d + 3 d + 2) evaluations).To guarantee that any solution that updates the incumbent when the objective is eval-uated with D m will also update the incumbent when the objective is evaluated with D n (i.e., that any solution that satisfies (8) will also satisfy (9)), we can use the generalizationbound from Theorem 9. stun and Rudin Theorem 9 (Generalization of Sampled Loss on Finite Coefficient Set)
Let D n = ( x i , y i ) ni =1 denote a training dataset with n > points, D m = ( x i , y i ) mi =1 denotea sample of m points drawn without replacement from D n , and λ denote the coefficientsof a linear classifier from a finite set L . For all ε > , it holds that Pr (cid:18) max λ ∈L (cid:16) l n ( λ ) − l m ( λ ) (cid:17) ≥ ε (cid:19) ≤ |L| exp (cid:32) − ε ( m )(1 − m n )∆ max ( L , D n ) (cid:33) , where ∆ max ( L , D n ) = max λ ∈L (max i =1 ,...,n l i ( λ ) − min i =1 ,...,n l i ( λ )) . Theorem 9 is derived from a concentration inequality for sampling without replacement,called the Hoeffding-Serfling inequality (see Bardenet and Maillard, 2015). The Hoeffding-Serfling inequality can be significantly tighter than the classical Hoeffding inequality as itensures that Pr ( l n ( λ ) − l m ( λ ) ≥ (cid:15) ) → m → n for all (cid:15) >
0. Here, ∆ max ( L , D n ) is anormalization term that represents the maximum range of the loss and can be computedquickly using the training dataset D n and coefficient set L as shown in Proposition 5 inSection 4.2.In general machine learning settings, the |L| term in Theorem 9 would yield a vacuousbound. In this setting, however, rounding ensures that L contains at most 2 d elements,which produces a well-defined bound on the difference between l n ( λ ) and l m ( λ ). As aresult, Theorem 9 can be used to assess the probability that a proposed incumbent updateleads to an actual incumbent update (see Corollary 10). Alternatively, it can be used toset the sample size m so that an incumbent update on D m is likely to yield an incumbentupdate on D n . In practice, the bound in Theorem 9 can be tightened by recomputingthe normalization term ∆ max ( L , D n ) over the course of LCPA . This can be done for eachreal-valued solution ρ , or when the MIP solver restricts the set of feasible coefficients L . Corollary 10 (Update Probabilities of Rounding Heuristics on Subsampled Data)
Consider a rounding heuristic that takes as input a vector of real-valued coefficients ρ =( ρ , . . . , ρ d ) ∈ conv( L ) and outputs a vector of rounded coefficients λ ∈ L | ρ where L ρ = (cid:0) λ ∈ L (cid:12)(cid:12) λ j ∈ {(cid:100) ρ j (cid:101) , (cid:98) ρ j (cid:99)} for j = 1 , . . . , d (cid:1) . If we evaluate the rounding heuristic using m points D m = ( x i , y i ) mi =1 drawn withoutreplacement from D n = ( x i , y i ) ni =1 and the rounded coefficients λ ∈ L ρ attain an objectivevalue V m ( λ ) , then for any δ , with probability at least − δ , we have that V m ( λ ) < V max − ε δ = ⇒ V n ( λ ) ≤ V max , where ε δ = ∆ max ( L ρ , D n ) (cid:115) log(1 /δ ) + d log(2)2 (cid:18) m (cid:19) (cid:18) − m n (cid:19) . earning Optimized Risk Scores
5. Experiments
In this section, we compare the performance of methods to create risk scores. We have threegoals: (i) to benchmark the performance and computation of our approach on real-worlddatasets; (ii) to highlight pitfalls of traditional approaches used in practice; and (iii) topresent new approaches that address the pitfalls of traditional approaches.
We considered 6 publicly available datasets shown in Table 2. We chose these datasets toallow for comparisons with other work, and to see how methods are affected by factors suchas class imbalance, the number of features, and feature encoding. For each dataset, we fitrisk scores using
RiskSLIM and 6 baseline methods that post-processed the coefficients ofthe best logistic regression model built using Lasso, Ridge or Elastic Net. We used eachmethod to fit a risk score with small integer coefficients λ j ∈ {− , . . . , } that obeys themodel size constraint (cid:107) λ (cid:107) ≤ R max . We considered target model sizes R max ∈ { , . . . , } to benchmark method across a range of model sizes used in practice. Dataset n d
Pr( y i = 1) Conditions for y i = 1 Reference income mammo
961 14 46.3% person has breast cancer Elter et al. (2007) mushroom rearrest spambase telemarketing
Table 2:
Datasets used in Section 5. All datasets are available on the UCI repository (Bache and Lichman,2013), other than rearrest which must be requested from ICPSR. We processed each dataset by droppingexamples with missing values, and by binarizing categorical variables and some real-valued variables. Weprovide all processed datasets and the code to process rearrest at http://github.com/ustunb/risk-slim.
RiskSLIM
We formulated an instance of
RiskSlimMINLP with the constraints: λ ∈ {− , . . . , } , λ j ∈ {− , . . . , } , and (cid:107) λ (cid:107) ≤ R max . We set the trade-off parameter to a small value C = 10 − to recover the best model under these constraints (see Appendix B). We solvedeach instance for at most 20 minutes on a single 3.33 GHz CPU with 16 GB RAM usingCPLEX 12.6.3 (ILOG, 2017). Our LCPA implementation includes the improvements inSection 4 and is available online at http://github.com/ustunb/risk-slim.
Penalized Logistic RegressionPLR is the best logistic regression model produced over the full regularization path usinga weighted combination of the (cid:96) and (cid:96) penalties (i.e., the best model produced by Lasso,Ridge or Elastic Net). We train PLR models using the glmnet package of Friedman et al. stun and Rudin (2010). The coefficients of each model are the solution to the optimization problem: min λ ∈ R d +1 n n (cid:88) i =1 log(1 + exp( −(cid:104) λ , y i x i (cid:105) )) + γ · (cid:16) α (cid:107) λ (cid:107) + (1 − α ) (cid:107) λ (cid:107) (cid:17) where α ∈ [0 ,
1] is the elastic-net mixing parameter and γ ≥ PLR models by choosing 1,100 combinations of ( α, γ ): 11 values of α ∈ { . , . , . . . , . , . } ×
100 values of γ (chosen automatically by glmnet for each α ).This free parameter grid ensures that the 1,100 PLR models includes models produced bythe following variants of logistic regression: • Lasso ( (cid:96) -penalty), which corresponds to PLR when α = 1 . • Ridge ( (cid:96) -penalty), which corresponds to PLR when α = 0 . • Standard Logistic Regression, which corresponds to
PLR when α = 0 . γ is small. Traditional Approaches
While there is considerable variation in how risk scores are developed in practice, manymodels are developed using a two-step approach: (i) fit a sparse logistic regression modelwith real-valued coefficients; (ii) convert this model into a risk score with integer coefficients.We consider three methods that adopt this approach. Each method first trains a
PLR model(i.e., the one that maximizes the 5-CV AUC and obeys the model size constraint), and thenconverts this model into a risk score by applying a common rounding heuristic: • PLR (cid:29) Rd (Rounding): We round each coefficient to the nearest integer in {− . . . } bysetting λ j ← (cid:100) min(max( λ j , − , (cid:99) , and round the intercept as λ ← (cid:100) λ (cid:99) . • PLR (cid:29)
Unit (Unit Weighting): We round each coefficient to ± λ j ← sign( λ j ) [ λ j (cid:54) = 0].Unit weighting is a common heuristic in medicine and criminal justice (see e.g., Antmanet al., 2000; Kessler et al., 2005; US Department of Justice, 2005; Duwe and Kim, 2016),and sometimes called the Burgess method (as it was first proposed by Burgess, 1928). • PLR (cid:29)
RsRd (Rescaled Rounding) We first rescale coefficients so that the largest coef-ficient is ±
5, then round each coefficient to the nearest integer (i.e., λ j → (cid:100) γλ j (cid:99) where γ = 5 / max j | λ j | ). Rescaling is commonly used to avoid rounding small coefficients tozero, which happens when | λ j | < . New Pooled Approaches
We also propose three new methods that use a pooling strategy and the loss-minimizingheuristics from Section 4. Each method generates a pool of
PLR models with real-valuedcoefficients, applies the same post-processing procedure to each model in the pool, thenselects the best risk score among feasible risk scores. The methods include: • PooledRd (Pooled
PLR + Rounding): We fit a pool of 1,100 models using
PLR . Foreach model in the pool, we round each coefficient to the nearest integer in {− , . . . , } bysetting λ j ← (cid:100) min(max( λ j , − , (cid:99) , and round the intercept as λ ← (cid:100) λ (cid:99) . • PooledRd* (Pooled
PLR + Rounding + Polishing): We fit a pool of 1,100 models using
PooledRd . For each model in the pool, we polish the rounded coefficients using
DCD . earning Optimized Risk Scores • PooledSeqRd* (Pooled
PLR + Sequential Rounding + Polishing): We fit a pool of1,100 models using
PLR . For each model in the pool, we round the coefficients using
SequentialRounding and then polish the rounded coefficients using
DCD .To ensure that the polishing step in
PooledRd* and
PooledSeqRd* does not increasethe number of non-zero coefficients (which would violate the model size constraint), we run
DCD on the set of non-zero coefficients { j | λ j (cid:54) = 0 } .While other pooled methods can be designed by combining rounding and polishingmethods, we omit certain variations for the sake of clarity, namely: (i) methods that userescaled rounding and unit weighting (as they perform worse than na¨ıve rounding whenwe use pooling); (ii) methods that use SequentialRounding without
DCD polishing (as itworse than
PooledSeqRd* ). We report results for some of these variations on a differentreal-world dataset in Section 6 (i.e.,
PooledRsRd , PooledRsRd* , PooledSeqRd ). Notes on the Choice of Methods
We also considered training risk scores by directly solving
RiskSlimMINLP with a com-mercial MINLP solver, but could not recover models that performed well on any datasetdue to computational issues (discussed also in Section 3.3). We did not consider methodsto fit scoring systems for decision-making (see Section 1.1 for a list) since these modelsdo not output risk predictions. Ustun and Rudin (2016) and Zeng et al. (2016) presentexperiments comparing the performance of decision-making ability of scoring systems andother popular classifiers on these datasets.
Reliability Diagrams
We evaluate the calibration of each risk score by plotting a reliability diagram , which showhow the predicted risk (x-axis) matches the observed risk (y-axis) for each distinct score(DeGroot and Fienberg, 1983). The observed risk at a score of s as¯ p s = 1 |{ i : s i = s }| (cid:88) i : s i = s [ y i = +1] . If a model has over 30 distinct scores, we group them into 10 bins before plotting thereliability diagram. A model with perfect calibration should output predictions that areperfectly aligned with observed risk, as shown by a reliability diagram where all points lieon the x = y line. Summary Statistics
We report the following summary statistics for each risk score: • Calibration Error , computed as CAL = n (cid:80) s (cid:80) i : s i = s | p i − ¯ p s | where p i is the predictedrisk of example i , and ¯ p s is the observed risk for all examples with a score of s . CAL isthe expected calibration error over the reliability diagram (Naeini et al., 2014). • Area under the ROC curve , computed as AUC = n + n − (cid:80) i : y i =+1 (cid:80) k : y k = − [ s i > s k ] , where n + = |{ i : y i = +1 }| , n − = |{ i : y i = − }| . Note that trivial models (i.e., modelsthat predict one class) achieve the best possible CAL (0.0%) but poor AUC (0.5). stun and Rudin • Logistic Loss , computed as Loss = n (cid:80) ni =1 log(1 + exp( − s i )). The loss reflects the ob-jective values of the risk score problem when C is small. We report the loss to see ifminimizing the objective value of the risk score problem improves CAL and AUC. • Model Size : the number of non-zero coefficients excluding the intercept (cid:80) dj =1 [ λ j (cid:54) = 0]. Model Selection
We use nested
PooledRd , PooledRd* , PooledSeqRd* as thesemethods tune the parameters of a final risk score using 5-CV statistics. RiskSLIM canproduce an unbiased performance estimate using standard 5-CV because it does not requireparameter tuning. This is also true for traditional methods such as
PLR (cid:29) Rd , PLR (cid:29)
RsRd , PLR (cid:29)
Unit as the parameter tuning step occurs before post-processing.
On the Performance of Risk Scores
We compare the performance of
RiskSLIM to traditional approaches in Figure 12 and to ourpooled approaches in Figure 13. These results show that
RiskSLIM models consistentlyattain better calibration and AUC than alternatives. These results are more preciselyreflected for risk scores with a target model size of R max = 5 in Table 3. Here, RiskSLIM has the best 5-CV mean test CAL on 5/6 datasets, the best 5-CV mean test AUC on 5/6datasets, and no method has better test CAL and test AUC than
RiskSLIM .We make two observations to explain the empirical performance of risk scores:(i) Models that attain low values of the logistic loss have good calibration (see Figures 12and 13 and the empirical results of e.g., Caruana and Niculescu-Mizil 2004, 2006).(ii) Since we are fitting from a simple class of models, risk scores tend to generalize (seethe test CAL/AUC and training CAL/AUC of
RiskSLIM models in Figures 16 to 21,and other risk scores in Table 9 in Appendix E).Since
RiskSLIM models optimize the loss over exact constraints on model form, they attainminimal or near-minimal values of the loss. Thus, they perform well in terms of trainingCAL/AUC as per (i) and test CAL/AUC as per (ii). These observations also explain whymethods that use loss-minimizing heuristics produce risk scores with better CAL and AUCthan those that do not (e.g.,
PooledRd* has better test CAL/AUC than
PooledRd since
DCD polishing can only reduce the loss).
5. It is well-known that when 5-CV statistics are used to set the free parameters for the final model,then these statistics produce an overly optimistic estimate of test performance (see Cawley and Talbot,2010). To avoid this bias, the 5 fold-based models trained to construct 5-CV estimates must have theirparameters set in the same way as the final model. Explicitly, for each the 5 fold-based models, the freeparameters must be set by running an inner 5-CV, and choosing a free parameter instance that satisfiesthe model size constraint and maximizes the inner 5-CV mean test AUC. earning Optimized Risk Scores l l l l PLR>Rd PLR>RsRd PLR>Unit RiskSLIM
Optimization Metric Performance Metric Selection MetricDataset
Training Loss 5-CV Mean Test CAL 5-CV Mean Test AUC income
Target Model Size T r a i n i ng Lo ss l Target Model Size T e s t C A L l Target Model Size T e s t A UC mammo ll Target Model Size T r a i n i ng Lo ss lllllll Target Model Size T e s t C A L lllllll Target Model Size T e s t A UC mushroom llll Target Model Size T r a i n i ng Lo ss lllll Target Model Size T e s t C A L lllll Target Model Size T e s t A UC rearrest llll Target Model Size T r a i n i ng Lo ss lllllll Target Model Size T e s t C A L lllllll Target Model Size T e s t A UC spambase lllll Target Model Size T r a i n i ng Lo ss llllllllll Target Model Size T e s t C A L llllllllll Target Model Size T e s t A UC telemarketing llll Target Model Size T r a i n i ng Lo ss lllllll Target Model Size T e s t C A L lllllll Target Model Size T e s t A UC Figure 12:
Summary statistics for risk scores built using
RiskSLIM and traditional approaches. Eachpoint represents the best risk score with integer coefficients λ j ∈ {− , . . . , } and model size (cid:107) λ (cid:107) ≤ R max for R max ∈ { , . . . , } . We show the variation in 5-CV mean test CAL and AUC for each method by shadingthe range between the 5-CV minimum and maximum. The black line in each plot is a baseline, which showsthe performance of a single PLR model with real-valued coefficients and no model size constraint. stun and Rudin l l l l PooledRd PooledRd* PooledSeqRd* RiskSLIM
Optimization Metric Performance Metric Selection MetricDataset
Training Loss 5-CV Mean Test CAL 5-CV Mean Test AUC income
Target Model Size T r a i n i ng Lo ss Target Model Size T e s t C A L Target Model Size T e s t A UC mammo Target Model Size T r a i n i ng Lo ss Target Model Size T e s t C A L Target Model Size T e s t A UC mushroom Target Model Size T r a i n i ng Lo ss Target Model Size T e s t C A L Target Model Size T e s t A UC rearrest Target Model Size T r a i n i ng Lo ss Target Model Size T e s t C A L Target Model Size T e s t A UC spambase Target Model Size T r a i n i ng Lo ss Target Model Size T e s t C A L Target Model Size T e s t A UC telemarketing Target Model Size T r a i n i ng Lo ss Target Model Size T e s t C A L Target Model Size T e s t A UC Figure 13:
Summary statistics for risk scores built using
RiskSLIM and pooled approaches. Each pointrepresents the best risk score with integer coefficients λ j ∈ {− , . . . , } and model size (cid:107) λ (cid:107) ≤ R max for R max ∈ { , . . . , } . We show the variation in 5-CV mean test CAL and AUC for each method by shadingthe range between the 5-CV minimum and maximum. The black line in each plot is a baseline, which showsthe performance of a single PLR model with real-valued coefficients and no model size constraint. earning Optimized Risk Scores Traditional Approaches Pooled Approaches
Dataset Metric
PLR (cid:29)
Rd PLR (cid:29)
RsRd PLR (cid:29)
Unit PooledRd PooledRd* PooledSeqRd* RiskSLIM income n = 32561 d = 36 test caltest aucloss valuemodel sizeopt. gap mammo n = 961 d = 14 test caltest aucloss valuemodel sizeopt. gap mushroom n = 8124 d = 113 test caltest aucloss valuemodel sizeopt. gap rearrest n = 22530 d = 48 test caltest aucloss valuemodel sizeopt. gap spambase n = 4601 d = 57 test caltest aucloss valuemodel sizeopt. gap telemarketing n = 41188 d = 57 test caltest aucloss valuemodel sizeopt. gap Table 3:
Summary statistics for risk scores with integer coefficients λ j ∈ {− , . . . , } for a model sizeconstraint (cid:107) λ (cid:107) ≤
5. Here: test cal is the 5-CV mean test CAL; test auc is the 5-CV mean test AUC; modelsize and loss value pertain to a final model fit using the entire dataset. For each dataset, we highlight themethod that attains the best test cal, auc, and loss value in green. We also highlight methods that producetrivial models in red. stun and Rudin On the Caveats of CAL
The
PLR (cid:29) Rd risk score for telemarketing in Table 3 highlights a key shortcoming of CALthat illustrates why we report AUC: trivial and near-trivial models can have misleadinglylow CAL. This table also thus shows why we choose free parameters that maximize the5-CV mean test AUC rather than CAL: choosing free parameters to minimize the 5-CVmean test CAL can result in a trivial model. Here, PLR (cid:29) Rd rounds all coefficients otherthan the intercept to zero, producing a model that trivially assigns a constant score toall examples s i = (cid:104) λ , x i (cid:105) = λ = 2. Since there is only one score, the predicted riskfor all points is p i = 11 . p = Pr ( y i = +1) = 11 . On Calibration Issues of Risk Scores
The reliability diagrams in Figure 14 highlight two issues with respect to the calibration ofrisk scores that are difficult to capture using a summary statistic: • Monotonicity violations in observed risk. For example, the reliability diagrams for
PLR (cid:29) Rd on spambase , or PooledRd and
PooledSeqRd* on mushroom show that the observedrisk does not increase monotonically with predicted risk. • Irregular spacing and coverage of predicted risk. For example, the
PLR (cid:29) Rd risk score for income outputs risk predictions that range between only 20% to 60%, and the PLR (cid:29)
RsRd risk score for rearrest produces risk predictions that are clustered at end points.The results in Figure 14 suggests that such issues can be mitigated by optimizing thelogistic loss (see e.g. the calibration of risk scores built using
RiskSLIM , PooledSeqRd , PooledSeqRd* where integer coefficients are determined by directly optimizing the lo-gistic loss). In contrast, such issues are difficult to address by additional post-processing.Consider, for example, using Platt scaling (Platt et al., 1999) to improve the calibrationof the
PLR (cid:29) Rd and PLR (cid:29)
RsRd risk scores in Figure 14. As shown in Figure 15, Plattscaling improves calibration by centering and spreading risk estimates over the reliabilitydiagram. However, it does not fix issues that were introduced by earlier heuristics, such asmonotonicity violations, a lack of coverage in risk predictions, and low AUC. earning Optimized Risk Scores Traditional Approaches Pooled ApproachesPLR (cid:29)
Rd PLR (cid:29)
RsRd PLR (cid:29)
Unit PooledRd PooledRd* PooledSeqRd* RiskSLIM income
Predicted Risk O b s e r v ed R i sk Predicted Risk O b s e r v ed R i sk Predicted Risk O b s e r v ed R i sk Predicted Risk O b s e r v ed R i sk Predicted Risk O b s e r v ed R i sk Predicted Risk O b s e r v ed R i sk Predicted Risk O b s e r v ed R i sk mammo Predicted Risk O b s e r v ed R i sk Predicted Risk O b s e r v ed R i sk Predicted Risk O b s e r v ed R i sk Predicted Risk O b s e r v ed R i sk Predicted Risk O b s e r v ed R i sk Predicted Risk O b s e r v ed R i sk Predicted Risk O b s e r v ed R i sk mushroom Predicted Risk O b s e r v ed R i sk Predicted Risk O b s e r v ed R i sk Predicted Risk O b s e r v ed R i sk Predicted Risk O b s e r v ed R i sk Predicted Risk O b s e r v ed R i sk Predicted Risk O b s e r v ed R i sk Predicted Risk O b s e r v ed R i sk rearrest Predicted Risk O b s e r v ed R i sk Predicted Risk O b s e r v ed R i sk Predicted Risk O b s e r v ed R i sk Predicted Risk O b s e r v ed R i sk Predicted Risk O b s e r v ed R i sk Predicted Risk O b s e r v ed R i sk Predicted Risk O b s e r v ed R i sk spambase Predicted Risk O b s e r v ed R i sk Predicted Risk O b s e r v ed R i sk Predicted Risk O b s e r v ed R i sk Predicted Risk O b s e r v ed R i sk Predicted Risk O b s e r v ed R i sk Predicted Risk O b s e r v ed R i sk Predicted Risk O b s e r v ed R i sk telemark. Predicted Risk O b s e r v ed R i sk Predicted Risk O b s e r v ed R i sk Predicted Risk O b s e r v ed R i sk Predicted Risk O b s e r v ed R i sk Predicted Risk O b s e r v ed R i sk Predicted Risk O b s e r v ed R i sk Predicted Risk O b s e r v ed R i sk Figure 14:
Reliability diagrams for risk scores with integer coefficients λ j ∈ {− , . . . , } for a model sizeconstraint (cid:107) λ (cid:107) ≤
5. We plot results for models from each fold on test data in grey, and for the final modelon training data in black. stun and Rudin Rounding Rescaled RoundingRaw + Platt Scaling Raw + Platt Scaling RiskSLIM income
Predicted Risk O b s e r v ed R i sk Predicted Risk O b s e r v ed R i sk Predicted Risk O b s e r v ed R i sk Predicted Risk O b s e r v ed R i sk Predicted Risk O b s e r v ed R i sk rearrest Predicted Risk O b s e r v ed R i sk Predicted Risk O b s e r v ed R i sk Predicted Risk O b s e r v ed R i sk Predicted Risk O b s e r v ed R i sk Predicted Risk O b s e r v ed R i sk Figure 15:
Reliability diagrams for
PLR (cid:29) Rd and PLR (cid:29)
RsRd risk scores with and without Platt scaling,and for
RiskSLIM . Platt scaling improves calibration by centering and spreading risk predictions. However,it cannot overcome calibration issues introduced by rounding heuristics such a lack of decision points (e.g.
PLR (cid:29) Rd on income ) or monotonicity violations (e.g, PLR (cid:29)
RsRd on rearrest ). On the Pitfalls of Traditional Approaches
Our results in Figure 12 and Table 3 show that risk scores built using traditional approachesperform poorly in terms of CAL and AUC. In particular: • Rounding (
PLR (cid:29) Rd ) produces risk scores with low AUC when it eliminates featuresfrom the model by rounding small coefficients λ j < | . | to zero. • Rescaled rounding (
PLR (cid:29)
RsRd ) greatly reduces CAL since the logistic loss is not scale-invariant (see e.g., the reliability diagram for
PLR (cid:29)
RsRd in Figure 14). • Unit weighting (
PLR (cid:29)
Unit ) results in poor calibration and unpredictable behavior (e.g.,risk scores with more features can perform worse as seen in
PLR (cid:29)
Unit models for income and telemarketing in Figure 12).The effects of rescaled rounding and unit weighting on calibration are reflected by highlysuboptimal values of the loss in Table 3 and Figure 12. These issues are often overlooked astheir effect on AUC is far less severe (e.g. the rescaling is recommended by US Departmentof Justice, 2005; Pennsylvania Commission on Sentencing, 2012).Our baseline methods may not match the exact methods used in practice as they donot reflect the significant human input used in risk score development (e.g., domain expertsperform preliminary feature selection, round coefficients, or choose a scaling factor beforerounding, manually and/or without validation, as shown in Appendix C). Nevertheless,these results highlight two major pitfalls of traditional approaches, namely: • Traditional approaches heuristically post-process a single model. This means that theyfail whenever a heuristic dramatically changes CAL or AUC. • Traditional approaches use heuristics that are oblivious to the value of the loss function,which results in poor calibration. earning Optimized Risk Scores On Pooled Approaches
Our results suggest that risk scores built using our pooled approaches attain considerablybetter calibration and rank accuracy than those built using traditional approaches. Thesemethods aim to overcome the pitfalls of traditional approaches using two strategies: • Pooling , which generates a pool of
PLR models, post-processes each model to producea pool of risk scores, and selects the best risk score within the pool. Pooling providesrobustness when post-processing with heuristics that can dramatically alter performance(e.g., rounding, as it is very unlikely that the coefficients of all models in the pool willbe rounded to zero). The performance gain due to pooling can be seen by comparing theresults for
PLR (cid:29) Rd to PooledRd in Table 3. • Loss-Sensitive Heuristics , such as
SequentialRounding and
DCD , which produce a pool ofrisk scores that attain lower values of the loss, and thereby let us select a risk score withbetter CAL and AUC. The performance gain due to loss-sensitive heuristics can be seenby comparing the results for
PooledRd to PooledRd* in Table 3.The main shortcomings of pooled methods are that they are difficult to implement,do not provide formal optimality or feasibility guarantees (particularly when handling non-trivial constraints), and may require massive computation depending on the post-processingheuristics. We discuss these shortcomings in greater detail in Section 6.The fact that
RiskSLIM produces risk scores with lower loss compared to pooled ap-proaches shows that direct optimization can efficiently recover models that may not befound via massive post-processing (i.e., where we train all possible (cid:96) and (cid:96) penalizedlogistic regression models, and convert them to risk scores by applying specially-designedheuristics). Here, we have shown that the latter strategy may still produce risk scores thatperform well. In Section 6, we show that the difference in performance is significant in thepresence of non-trivial constraints. On Computation
Although the risk score problem is NP-hard, we trained
RiskSLIM models that were cer-tifiably optimal or had small optimality gaps for all datasets in under 20 minutes usingan
LCPA implementation with the improvements in Section 4. Even when
LCPA did notrecover a certifiably optimal solution, it produced a risk score that performed well and didnot exhibit the calibration issues of models built heuristically.In general, the time spent computing cutting planes is a small portion of the overallruntime for
LCPA ( < LCPA scales linearly with samplesize, we therefore expect to obtain similar results even if the datasets had far more samples.Our experiments revealed several factors that affect the time to recover a certifiablyoptimal solution, namely: • Highly Correlated Features : Subsets of redundant features produce multiple optima, whichincreases the size of the B&B tree. • Feature Encoding : In particular, the problem is harder when the dataset includes real-valued variables, like those in the spambase dataset. • Difficulty of the Learning Problem : On separable problems such as mushroom , it is easyto recover a certifiably optimal solution since many solutions perform well and produce anear-optimal lower bound. stun and Rudin
1. Prior Arrests ≥ ≥ + . . .3. Prior Arrests for Local Ordinance 1 point + . . .4. Age at Release between 18 to 24 1 point + . . .5. Age at Release ≥
40 -1 point + SCORE = SCORE -1 0 1 2 3 4
RISK % % % % % % Figure 16:
RiskSLIM model for rearrest . RISK is the predicted probability that a prisoner is arrestedwithin 3 years of release from prison. This model has a 5-CV mean test CAL/AUC of 1.7%/0.697 andtraining CAL/AUC of 2.6%/0.701.
1. Married 3 points . . .2. Reported Capital Gains 2 points + . . .3. Age 22 to 29 -1 point + . . .4. Highest Level of Education is High School Diploma -2 points + . . .5. No High School Diploma -3 points + SCORE = SCORE -4 to -1 0 1 2 3 4 5
RISK < % % % % % % % Figure 17:
RiskSLIM model for income . RISK is the predicted probability that a US resident earns over$50 000. This model has a 5-CV mean test CAL/AUC of 2.4%/0.854 and training CAL/AUC of 4.1%/0.860.
1. Call between January and March 1 point . . .2. Called Previously 1 point + . . .3. Previous Call was Successful 1 point + . . .4. Employment Indicator < + . . .5. 3 Month Euribor Rate ≥
100 -1 point + SCORE = SCORE -1 0 1 2 3 4
RISK % % % % % % Figure 18:
RiskSLIM model for telemarketing . RISK is the predicted probability that a client opensa new bank account after a marketing call. This model has a 5-CV mean test CAL/AUC of 1.3%/0.760 anda training CAL/AUC of 1.1%/0.760. earning Optimized Risk Scores
1. odor = foul 5 points . . .2. gill size = broad -3 points + . . .3. odor = almond -5 points + . . .4. odor = anise -5 points + . . .5. odor = none -5 points + SCORE = SCORE -8 -5 -3 2 to 5
RISK % % % > % Figure 19:
RiskSLIM model for mushroom . RISK is the predicted probability that a mushroom ispoisonous. This model has a 5-CV mean test CAL/AUC of 1.8%/0.989 and a training CAL/AUC of1.0%/0.990.
1. IrregularShape 1 point . . .2. Age ≥
60 1 point + . . .3. OvalShape -1 point + . . .4. ObscuredMargin -1 point + . . .5. CircumscribedMargin -2 points + SCORE = SCORE -3 -2 -1 0 1 2
RISK % % % % % % Figure 20:
RiskSLIM model for mammo . RISK is the predicted probability that a mammogram pertainsto a patient with breast cancer. This model has a 5-CV mean test CAL/AUC of 5.0%/0.843 and a trainingCAL/AUC of 3.1%/0.849.
1. CharacterFrequency DollarSign × × + . . .3. WordFrequency Free × + . . .4. WordFrequency HP × -2 points + . . .5. WordFrequency George × -5 points + SCORE = SCORE ≤ -1.2 -1.2 to -0.4 -0.4 to 0.2 0.2 to 0.6 0.6 to 1.0 1.0 to 1.4 1.4 to 1.8 1.9 to 2.4 2.4 to 3.2 ≥ RISK % % % % % % % % % % Figure 21:
RiskSLIM model for spambase . RISK is the predicted probability that an e-mail is spam.This model has a 5-CV mean test CAL/AUC of 11.7%/0.928 and a training CAL/AUC of 12.3%/0.935. stun and Rudin
6. ICU Seizure Prediction
In this section, we describe a collaboration with the Massachusetts General Hospital wherewe built a customized risk score for ICU seizure prediction. Our goal is to discuss theperformance and practicality of our approach on a real-world problem with non-trivialconstraints.
Patients who suffer from traumatic brain injury are monitored via continuous electroen-cephalography (cEEG). Based on current clinical standards, neurologists undergo extensivetraining to recognize a large set of patterns in cEEG output (see Hirsch et al., 2013, andFigure 22). They consider the presence and characteristics of cEEG patterns along withother medical information to evaluate a patient’s risk of seizure. These risk estimates areused to decide if a patient can be dismissed from the ICU or requires further monitoring,and whether to prescribe a medical invention in order to prevent additional brain injury.
Figure 22: cEEG displays electrical activity at 16 standardized locations in a patient’s brain using electrodesplaced on the scalp. We show two cEEG patterns: a Generalized Periodic Discharge (GPD), which occurson both sides of the brain (left); and a Lateralized Periodic Discharge (LPD), which occurs on one side ofthe brain (right). These figures were reproduced from a presentation at a training module by the AmericanClinical Neurophysiology Society (2012).
Dataset
Our dataset was derived from extensive set of cEEG recordings from 41 hospitals, curated bythe Critical Care EEG Monitoring Research Consortium. It contains n = 5 ,
427 recordingsand d = 87 input variables (see Appendix F for a list). The outcome is defined as y i = +1if patient i who has been in the ICU for the past 24 hours will have a seizure in thenext 24 hours. There is significant class imbalance as Pr ( y i = +1) = 12.5%. The inputvariables include information on patient medical history, secondary neurological symptoms,and the presence and characteristics of 5 standard cEEG patterns: Lateralized PeriodicDischarges (LPD);
Lateralized Rhythmic Delta (LRDA);
Generalized Periodic Discharges (GPD);
Generalized Rhythmic Delta (GRDA); and
Bilateral Periodic Discharges (BiPD). earning Optimized Risk Scores Model Requirements
Our collaborators wanted a risk score to quickly predict seizure risk using the presenceand characteristics of cEEG patterns. It was critical for the model to output calibratedrisk predictions since physicians would use the predicted risk to choose between multipletreatment options (i.e., patients may be prescribed different medication based on theirpredicted risk; see also Van Calster and Vickers, 2015; Shah et al., 2018, for a discussionon how miscalibrated risk predictions can result in harmful decisions). In order to be usedand accepted by physicians, it was also important to produce a model that could be easilyvalidated by domain experts and aligned with domain expertise.In Figure 23, we present a
RiskSLIM risk score for this problem that satisfies all of theserequirements (see Struck et al., 2017, for details on model development). This risk scoreoutputs calibrated risk estimates at several decision points; obeys a model size constraint( (cid:107) λ (cid:107) ≤
6) to let physicians easily validate the model; and obeys monotonicity constraintsto ensure that the signs of coefficients are aligned with domain knowledge.
1. Any cEEG Pattern with Frequency > Hz E pileptiform Discharges 1 point + . . .3. Patterns include L PD or LRDA or BIPD 1 point + . . .4. P atterns Superimposed with Fast, or Sharp Activity 1 point + . . .5. Prior S eizures 1 point + . . .6. B rief Rhythmic Discharges 2 points + SCORE = SCORE
RISK <
5% 12% 27% 50% 73% 88 % > Figure 23:
RiskSLIM (see Struck et al., 2017, for details). This model hasa 5-CV mean test CAL/AUC of 2.7%/0.819.
As a follow up to our work in Struck et al. (2017), our collaborators wanted to see if wecould improve the usability of the risk score in Figure 23 without sacrificing too much cali-bration or rank accuracy. Seeing how the risk score in Figure 23 includes features that couldrequire a physician to check a large number of patterns (e.g.,
MaxFrequencyOfAnyPattern ≥ or PatternsIncludeLPD or LRDA or BIPD ), our collaborators sought to improve usabilityby restricting the model size and specifying operational constraints on feature compositionand feature encoding. In turn, our goal was to produce the best risk score that satisfiedthese constraints, so that our collaborators could correctly evaluate the loss in predictiveperformance due to their requirements, and make an informed choice regarding model de-ployment. We present a complete list of their constraints in Appendix F. They can begrouped as follows: • Limited Model Size : The model had to use at most 4 input variables, so that it would beeasy to validate and use in an ICU. • Monotonicity : The model had to obey monotonicity constraints for well-known risk factorsfor seizures (e.g., it could not suggest that having prior seizures lowers seizure risk). • No Redundancy between Categorical Variables : The model had to include variables thatwere linearly independent (e.g., it could include
Male or Female but not both). stun and Rudin • Specific cEEG Patterns or Any cEEG Pattern : The dataset had variables for specificcEEG patterns (e.g.,
MaxFrequencyLPD ) and variables for any cEEG pattern (e.g.,
MaxFre-quencyAnyPattern ) The model had to use variables for specific patterns or any pattern,but not both. • Frequency in Continuous or Thresholded Form : The dataset had two kinds of variablesrelated to the frequency of a cEEG pattern: (i) a real-valued variable (e.g.,
MaxFrequen-cyLPD ∈ [0 , . MaxFrequencyLPD ≤ ).Models had to use the real-valued variable or the binary variables, not both. • Limited : To prevent clinicians from having tocheck multiple thresholds, the model could include at most 2 binary threshold variablesfor a given cEEG pattern.
Training Setup
We used the training setup as in Section 5.1, which we adapted to address constraints asfollows. We trained a
RiskSLIM model by solving an instance of
RiskSlimMINLP withall the constraints. This MINLP had 20 additional constraints, 2 additional variables, andwas solved to optimality in ≤
20 minutes. The baseline methods had built-in mechanismsto address monotonicity constraints but needed tuning to handle the remaining constraints.We used a nested 5-CV to evaluate the predictive performance of all models. For eachmethod, we trained a final model using all of the training data for an instance of the freeparameters that obeys all constraints and maximizes the mean 5-CV test AUC.
On Performance and Usability in a Constrained Setting
The results in Table 4 show the potential performance benefits of training an optimized riskscore for problems with non-trivial constraints. Here, the
RiskSLIM model has a 5-CVmean test CAL/AUC of 2.5%/0.801 while the best risk score built using a heuristic methodhas a 5-CV mean test CAL/AUC of 2.8%/0.754.Unlike the experiments in Section 5, only
RiskSLIM and our pooled methods canproduce a feasible risk score. Traditional methods (e.g.,
PLR (cid:29) Rd , PLR (cid:29)
RsRd , and
PLR (cid:29)
Unit ), violate one or more constraints after rounding. As shown in Table 6, thesemethods cannot produce a risk score with comparable performance to the
RiskSLIM riskscore even when these constraints are relaxed. If we only consider simple constraints onmodel size and monotonicity, then risk scores produced by these methods have a test AUCof at most 0.761 (
PLR (cid:29)
RsRd ) and a test CAL of at most 7.0% (
PLR (cid:29) Rd ).As shown in Figures 24 to 30, risk scores that have similar test CAL can still exhibitimportant differences in terms of calibration. Here, the risk predictions of the RiskSLIM model are monotonic and within the boundaries of the risk predictions of the fold-basedmodels. In comparison, the risk predictions of other models do not monotonically increasewith observed risk and vary significantly across test folds (e.g.,
PooledRsRd , Poole-dRsRd* , PooledSeqRd and
PooledSeqRd* ). As noted by our collaborators, theseissues can affect how physicians will use the model and whether they will accept it (e.g., themonotonicity violations of the
PooledSeqRd model suggest that patients with a score of earning Optimized Risk Scores Method TestCAL TestAUC ModelSize LossValue OptimalityGap TrainCAL TrainAUC
RiskSLIM PooledRd PooledRsRd PooledSeqRd PooledRd* PooledRsRd* PooledSeqRd* Table 4:
Performance of risk scores for seizure prediction that satisfy all constraints. We report the 5-CV mean test CAL and 5-CV mean test AUC. The ranges in each cell represent the 5-CV minimum andmaximum. We present the risk scores built using each method in Figures 24 to 30.
Training Requirements % of Instances That Satisfy Constraints on
Method
RiskSLIM
PooledRd
PooledRd*
PooledRsRd
PooledRsRd*
PooledSeqRd
PooledSeqRd*
Table 5:
Training requirements and constraint violations for the methods in Table 4. Each instance is aunique combination of free parameters. stun and Rudin Method ConstraintsViolated TestCAL TestAUC ModelSize LossValue TrainCAL TrainAUC
PLR
Model SizeIntegralityMonotonicityOperational 2.6%
20 - 35
PLR
Model SizeIntegralityOperational 2.7%
16 - 41
PLR
IntegralityOperational 4.4% PLR (cid:29) Rd Operational 7.0% PLR (cid:29)
RsRd
Operational 12.4% PLR (cid:29)
Unit
Operational 24.6% Table 6:
Performance of baseline models that violate one or more of the constraints.
RiskSLIM model requires physicians to scancEEG output for 3 patterns (that is, LPD, BriefRhythmicDischarges, and EpiletiformDis-charge). In comparison, other risk scores may not necessarily be more useable since they allinclude
PatternsInclude BiPD or LRDA or LPD , which can require physicians to check forcEEG output for 3 patterns (in the worst case). The
PooledRsRd , PooledSeqRd and
PooledSeqRd* risk scores also include
MaxFrequencyLPD , which requires recording thefrequency of LPD, which requires more time. Note that it is possible, as a simple extensionof our approach, to fine-tune usability by penalizing each feature based on how difficult itis for a physician to evaluate it.
On the Practical Value of a Certificate of Optimality
The results in Tables 4 and 6 illustrate how heuristics can lead practitioners to overestimatethe trade-offs between performance and feasibility with respect to real-world constraints.Here: three traditional methods could not output a feasible risk score; six pooled methodsproduced feasible risk scores with suboptimal AUC and calibration issues; a baseline
PLR model with real-valued coefficients has an AUC of 0.742 (see Table 6). Based on theseresults, a practitioner could easily conclude that there did not exist a feasible risk scorethat achieves a test AUC of 0.801.In contrast,
RiskSLIM models are paired with an optimality gap. In practice, a smalloptimality gap suggests that we have trained the best possible risk score that satisfiesa specific set of constraints. Thus, if a risk score with a small optimality gap performs earning Optimized Risk Scores poorly on training data, and the model generalizes (as evidenced by comparing trainingperformance to K-CV performance), then a practitioner can attribute the performancedeficit of the model to restrictive constraints and improve performance by relaxing them.Our approach provides a general mechanism to evaluate the effect of constraints onpredictive performance. Say, for example, that our collaborators were not satisfied with theperformance or usability of our model, then we could train certifiably optimal risk scores fordifferent sets of constraints. By comparing the performance of certifiably optimal risk scores,our collaborators could evaluate the impact of their requirements on predictive performance,and navigate these trade-offs in an informed manner. This approach was useful when ourcollaborators were interested in choosing between a model with 4 features or 5 features.Here, we trained a RiskSLIM risk score with 5 features, which has a 5-CV test CAL/AUCof 3.4%/0.816. However, the slight improvement in test AUC did not outweigh the factthat the model included a feature that would increase the worst-case evaluation time (i.e.
MaxFreqFactorAnyPattern ≥ On the Challenges of Handling Operational Constraints
Table 5 shows some of the practical benefits of methods that can address constraints withoutthe need for parameter tuning or post-processing. Since our approach can directly incorpo-rate these constraints into the MINLP formulation, all
RiskSLIM models are feasible withrespect to these constraints. Thus, we can produce a feasible risk score and estimate itspredictive performance by training 6 models: 1 final model trained on the full dataset fordeployment and 5 models trained on subsets of the training data to produce an unbiasedperformance estimate for the final model via 5-fold CV.In contrast, the pooled methods produce a feasible model by post-processing a large poolof models and discarding models that are infeasible. Since we must then choose betweenfeasible models on the basis of 5-CV performance, we must use a nested CV setup to pairany model with an unbiased performance estimate. As shown in Table 5, this requires fittinga total of 33,000 models (i.e., a nested CV setup with 5 outer folds, 5 inner folds, and 1,100free parameter instances requires fitting 1 , × × (5 + 1) = 33 ,
000 models.). There is noguarantee that pooled methods will produce a feasible risk score. As shown in Table 5, forinstance, only 12% of the instances for the pooled methods satisfied all constraints.Our results highlight several other complications with methods that aim to address hardconstraints by parameter tuning. Since such constraints also affect model performance, theseparameters should ideally be set on the basis of predictive performance, using a K -fold CVsetup. In this case, we would train models on K validation folds for each instance of thefree parameters, choose the instance of the free parameters that maximizes the mean K -CVtest AUC among the instances that satisfied all constraints, and then train a “final model”for this instance. Unfortunately, there is no guarantee that the final model will obey allconstraints. On the Benefits of Risk Scores with Small Integer Coefficients
Figures 24 to 30 illustrate some of the practical benefits of risk scores with small integercoefficients. When input variables belong to a small discrete set, scores also belong to asmall discrete set. This reduces the number of operating points on the ROC curve and stun and Rudin reliability diagram, which makes it easier to pick an operating point. Further, when inputvariables are binary, the decision rules at each operating point can be represented as aBoolean function. For the RiskSLIM model in Figure 24, for example, the decision rule:ˆ y i = +1 if score ≥ y i = +1 if BriefRhythmicDischarge ∨ PatternsIncludeLPD ∨ EpiletiformDischarge
Small integer coefficients let users extract such rules by listing conditions when the scoreexceeds the threshold. This is more challenging when a model uses real-valued coefficients,as shown by the score function of the
PLR model from Table 4:score = − .
35 + 0 . PatternsIncludeBiPD or LRDA or LPD + 0 . PriorSeizure + 0 . × MaxFrequencyLPD . In this case, extracting a Boolean function is difficult as computing the score involvesarithmetic with real-valued coefficients and the real-valued variable
MaxFrequencyLPD .
1. BriefRhythmicDischarge 2 points . . .2. PatternsInclude LPD 2 points + . . .3. PriorSeizure 1 point + . . .4. EpiletiformDischarge 1 point + SCORE = SCORE
RISK % % % % % % % Predicted Risk O b s e r v ed R i sk False Positive Rate T r ue P o s i t i v e R a t e Figure 24:
RiskSLIM risk score (top), reliability diagram (bottom left), and ROC curve (bottom right)for the seizure dataset. We plot results for the final model on training data in black, and the 5 fold modelson test data in grey. This model has a 5-CV mean test CAL/AUC of 2.5%/0.801. earning Optimized Risk Scores
1. BriefRhythmicDischarge 1 point . . .2. PatternsIncludeBiPD or LRDA or LPD 1 point + SCORE = SCORE
RISK % % % Predicted Risk O b s e r v ed R i sk False Positive Rate T r ue P o s i t i v e R a t e Figure 25:
PooledRd risk score (top), reliability diagram (bottom left), and ROC curve (bottom right)for the seizure dataset. We plot results for the final model on training data in black, and results for thefold models on test data in grey. This model has a 5-CV mean test CAL/AUC of 5.3/0.740%.
1. BriefRhythmicDischarge 3 points . . .2. PatternsIncludeBiPD or LRDA or LPD 2 points + SCORE = SCORE
RISK % % % % Predicted Risk O b s e r v ed R i sk False Positive Rate T r ue P o s i t i v e R a t e Figure 26:
PooledRd* risk score (top), reliability diagram (bottom left), and ROC curve (bottom right)for the seizure dataset. We plot results for the final model on training data in black, and results for thefold models on test data in grey. This model has a 5-CV mean test CAL/AUC of 3.0/0.745%. stun and Rudin
1. PatternsIncludeBiPD or LRDA or LPD 5 points . . .2. PriorSeizure 1 point + . . .3. MaxFreqFactorLPD × + SCORE = SCORE
RISK < % % % % % % % % Predicted Risk O b s e r v ed R i sk False Positive Rate T r ue P o s i t i v e R a t e Figure 27:
PooledRsRd model (top), reliability diagram (bottom left), and ROC curve (bottom right)for the seizure dataset. We plot results for fold models on test data in grey, and for the final model ontraining data in black. This model has a 5-CV mean test CAL/AUC of 12.2%/0.736.
1. PatternsIncludeBiPD or LRDA or LPD 5 points . . .2. PriorSeizure 3 points + . . .3. MaxFreqFactorLPD × + SCORE = SCORE
RISK < % % % % % % % % % > % Predicted Risk O b s e r v ed R i sk False Positive Rate T r ue P o s i t i v e R a t e Figure 28:
PooledRsRd* model (top), reliability diagram (bottom left), and ROC curve (bottom right)for the seizure dataset. We plot results for fold models on test data in grey, and for the final model ontraining data in black. This model has a 5-CV mean test CAL/AUC of 6.2%/0.731. earning Optimized Risk Scores
1. PriorSeizure 1 point . . .2. PatternsIncludeBiPD or LRDA or LPD 1 point + . . .3. MaxFreqFactorLPD × + SCORE = SCORE
RISK % % % % % % % % % Predicted Risk O b s e r v ed R i sk False Positive Rate T r ue P o s i t i v e R a t e Figure 29:
PooledSeqRd model (top), reliability diagram (bottom left), and ROC curve (bottom right)for the seizure dataset. We plot results for fold models on test data in grey, and for the final model ontraining data in black. This model has a 5-CV mean test CAL/AUC of 4.2%/0.738.
1. PriorSeizure 1 point . . .2. PatternsIncludeBiPD or LRDA or LPD 1 point + . . .3. MaxFreqFactorLPD × + SCORE = SCORE
RISK % % % % % % % % % Predicted Risk O b s e r v ed R i sk False Positive Rate T r ue P o s i t i v e R a t e Figure 30:
PooledSeqRd* model (top), reliability diagram (bottom left), and ROC curve (bottom right)for the seizure dataset. We plot results for fold models on test data in grey, and for the final model ontraining data in black. This model has a 5-CV mean test CAL/AUC of 2.8%/0.745. stun and Rudin
7. Discussion
Risk scores are simple models that are widely used to inform consequential decisions, rangingfrom mortality prediction to sentencing. In domains such as medicine and criminal justice,practitioners build risk scores ad hoc, by implementing training pipelines that combineexisting machine learning methods with simple heuristics (see Appendix C for the pipelineused in Antman et al., 2000). When a pipeline fails to produce a risk score that satisfiesimportant requirements, practitioners resolve these issues by making manual adjustments.Alternatively, a feasible risk score is specified by a panel of experts, and data is only usedto evaluate its performance (see e.g., Gage et al., 2001; McGinley and Pearse, 2012).Ad hoc approaches do not produce risk scores with performance guarantees, and maytherefore result in the deployment of risk scores with poor calibration or rank accuracy(which impact the quality of decision-making when these tools are used in consequentialapplications). Ad hoc model development also has significant practical costs: the needto implement a custom training pipeline slows down model development, and hinders theability to perform follow-up analyses to inform stakeholders as to whether a risk scoreshould be deployed (see e.g., the development and analysis of the sentencing risk score inPennsylvania Bulletin, 2017, which was commissioned in 2010).Our goal in this paper was to design a machine learning approach to build risk scoresthat standardizes model development, reduces the need for domain experts to specify modelsmanually, and pairs models with useful formal guarantees. Our proposed approach trainsrisk scores by solving an empirical risk minimization problem that performs exact featureselection while restricting coefficients to small integers and enforcing application-specificconstraints. Since commercial solvers did not handle this problem well off-the-shelf, wesolved it using a new cutting plane algorithm (
LCPA ), which allows training to scale linearlywith the number of samples, and can be used to train models for other problems with non-convex regularizers and non-convex constraints.As shown in Sections 5 and 6,
LCPA can train certifiably optimal risk scores for real-world datasets within minutes. These models attain best-in-class calibration and rankaccuracy, and avoid pitfalls of heuristics used in practice. Our approach simplifies andstandardizes model development, by allowing practitioners to build customized risk scoreswithout parameter tuning or post-processing. In addition, it pairs models with a certificateof optimality, which can be used to understand how application-specific constraints affectpredictive performance.
Acknowledgments
We gratefully acknowledge support from Siemens, Phillips, and Wistron. We would like tothank Paul Rubin for helpful discussions, and our collaborators Aaron Struck and BrandonWestover for their guidance on the seizure prediction application.
References
Alba, Ana Carolina, Thomas Agoritsas, Michael Walsh, Steven Hanna, Alfonso Iorio, PJ Devereaux, ThomasMcGinn, and Gordon Guyatt. Discrimination and calibration of clinical prediction models: Users’ guidesto the medical literature.
JAMA , 318(14):1377–1384, 2017. earning Optimized Risk Scores American Clinical Neurophysiology Society, . Standardized Critical Care EEG Terminology Training Module,2012.Amodei, Dario, Chris Olah, Jacob Steinhardt, Paul Christiano, John Schulman, and Dan Man´e. Concreteproblems in ai safety. arXiv preprint arXiv:1606.06565 , 2016.Angelino, Elaine, Nicholas Larus-Stone, Daniel Alabi, Margo Seltzer, and Cynthia Rudin. Learning certifi-ably optimal rule lists for categorical data.
Journal of Machine Learning Research , 18(234):1–78, 2018.Antman, Elliott M, Marc Cohen, Peter JLM Bernink, Carolyn H McCabe, Thomas Horacek, Gary Papuchis,Branco Mautner, Ramon Corbalan, David Radley, and Eugene Braunwald. The TIMI risk score forunstable angina/non–ST elevation MI.
The Journal of the American Medical Association , 284(7):835–842, 2000.Atkinson, David S and Pravin M Vaidya. A cutting plane algorithm for convex programming that usesanalytic centers.
Mathematical Programming , 69(1-3):1–43, 1995.Austin, James, Roger Ocker, and Avi Bhati. Kentucky pretrial risk assessment instrument validation.
Bureauof Justice Statistics , 2010.Bache, K. and M. Lichman. UCI Machine Learning Repository, 2013.Bai, Lihui and Paul A Rubin. Combinatorial Benders Cuts for the Minimum Tollbooth Problem.
OperationsResearch , 57(6):1510–1522, 2009.Bardenet, R´emi and Odalric-Ambrym Maillard. Concentration inequalities for sampling without replace-ment.
Bernoulli , 21(3):1361–1385, 2015.Beneish, Messod D, Charles MC Lee, and D Craig Nichols. Earnings manipulation and expected returns.
Financial Analysts Journal , 2013.Bertsimas, Dimitris and Jack Dunn. Optimal classification trees.
Machine Learning , 106(7):1039–1082, 2017.Bertsimas, Dimitris, Angela King, Rahul Mazumder, and others. Best subset selection via a modern opti-mization lens.
The Annals of Statistics , 44(2):813–852, 2016.Billiet, Lieven, Sabine Van Huffel, and Vanya Van Belle. Interval coded scoring index with interactioneffects. In
Proceedings of the 5th International Conference on Pattern Recognition Applications andMethods , pages 33–40. SCITEPRESS-Science and Technology Publications, Lda, 2016.Billiet, Lieven, Sabine Van Huffel, and Vanya Van Belle. Interval coded scoring extensions for larger problems.In
Computers and Communications (ISCC), 2017 IEEE Symposium on , pages 198–203. IEEE, 2017.Bobko, Philip, Philip L Roth, and Maury A Buster. The usefulness of unit weights in creating compositescores. A literature review, application to content validity, and meta-analysis.
Organizational ResearchMethods , 10(4):689–709, 2007.Bonami, Pierre, Mustafa Kilin¸c, and Jeff Linderoth. Algorithms and software for convex mixed integernonlinear programs. In
Mixed integer nonlinear programming , pages 1–39. Springer, 2012.Boyd, Stephen P and Lieven Vandenberghe.
Convex optimization . Cambridge university press, 2004.Burgess, Ernest W. Factors determining success or failure on parole.
The workings of the indeterminatesentence law and the parole system in Illinois , pages 221–234, 1928.Bussieck, Michael R and Stefan Vigerske. Minlp solver software.
Wiley Encyclopedia of Operations Researchand Management Science , 2010.Byrd, Richard H, Jorge Nocedal, and Richard A Waltz. Knitro: An integrated package for nonlinearoptimization. In
Large-scale nonlinear optimization , pages 35–59. Springer, 2006.Calmon, Flavio, Dennis Wei, Bhanukiran Vinzamuri, Karthikeyan Natesan Ramamurthy, and Kush R Varsh-ney. Optimized pre-processing for discrimination prevention. In
Advances in Neural Information Process-ing Systems , pages 3995–4004, 2017.Carrizosa, Emilio, Amaya Nogales-G´omez, and Dolores Romero Morales. Strongly agree or strongly dis-agree?: Rating features in support vector machines.
Information Sciences , 329:256–273, 2016.Caruana, Rich and Alexandru Niculescu-Mizil. Data mining in metric space: an empirical analysis of super-vised learning performance criteria. In
Proceedings of the tenth ACM SIGKDD international conferenceon Knowledge discovery and data mining , pages 69–78. ACM, 2004. stun and Rudin Caruana, Rich and Alexandru Niculescu-Mizil. An empirical comparison of supervised learning algorithms.In
Proceedings of the 23rd international conference on Machine learning , pages 161–168. ACM, 2006.Caruana, Rich, Yin Lou, Johannes Gehrke, Paul Koch, Marc Sturm, and Noemie Elhadad. Intelligible modelsfor healthcare: Predicting pneumonia risk and hospital 30-day readmission. In
Proceedings of the 21thACM SIGKDD International Conference on Knowledge Discovery and Data Mining , pages 1721–1730.ACM, 2015.Cawley, Gavin C and Nicola LC Talbot. On over-fitting in model selection and subsequent selection bias inperformance evaluation.
Journal of Machine Learning Research , 11(Jul):2079–2107, 2010.Chen, Chaofan, Oscar Li, Alina Barnett, Jonathan Su, and Cynthia Rudin. This looks like that: deeplearning for interpretable image recognition. arXiv , 2018.Chevaleyre, Yann, Frederic Koriche, and Jean-Daniel Zucker. Rounding methods for discrete linear clas-sification. In
Proceedings of the 30th International Conference on Machine Learning (ICML-13) , pages651–659, 2013.Cranor, Lorrie Faith and Brian A LaMacchia. Spam!
Communications of the ACM , 41(8):74–83, 1998.Dawes, Robyn M. The robust beauty of improper linear models in decision making.
American psychologist ,34(7):571–582, 1979.DeGroot, Morris H and Stephen E Fienberg. The comparison and evaluation of forecasters.
The statistician ,pages 12–22, 1983.Duwe, Grant and KiDeuk Kim. Sacrificing accuracy for transparency in recidivism risk assessment: Theimpact of classification method on predictive performance.
Corrections , pages 1–22, 2016.Einhorn, Hillel J and Robin M Hogarth. Unit weighting schemes for decision making.
OrganizationalBehavior and Human Performance , 13(2):171–192, 1975.Elter, M, R Schulz-Wendtland, and T Wittenberg. The prediction of breast cancer biopsy outcomes usingtwo cad approaches that both emphasize an intelligible decision process.
Medical Physics , 34:4164, 2007.Ertekin, S¸eyda and Cynthia Rudin. On equivalence relationships between classification and ranking algo-rithms.
The Journal of Machine Learning Research
Credit scoring, response modeling, and insurance rating: a practical guide to forecastingconsumer behavior . Palgrave Macmillan, 2012.Franc, Vojtˇech and Soeren Sonnenburg. Optimized cutting plane algorithm for support vector machines. In
Proceedings of the 25th international conference on Machine learning , pages 320–327. ACM, 2008.Franc, Vojtˇech and S¨oren Sonnenburg. Optimized cutting plane algorithm for large-scale risk minimization.
The Journal of Machine Learning Research , 10:2157–2192, 2009.Friedman, Jerome, Trevor Hastie, and Robert Tibshirani. Regularization paths for generalized linear modelsvia coordinate descent.
Journal of Statistical Software , 33(1):1–22, 2010.Gage, Brian F, Amy D Waterman, William Shannon, Michael Boechler, Michael W Rich, and Martha JRadford. Validation of clinical classification schemes for predicting stroke.
The Journal of the AmericanMedical Association , 285(22):2864–2870, 2001.Goel, Sharad, Justin M Rao, and Ravi Shroff. Precinct or prejudice? understanding racial disparities inNew York City’s stop-and-frisk policy.
Annals of Applied Statistics , 10(1):365–394, 2016.Goh, Gabriel, Andrew Cotter, Maya Gupta, and Michael P Friedlander. Satisfying real-world goals withdataset constraints. In
Advances in Neural Information Processing Systems , pages 2415–2423, 2016.Goh, Siong Thye and Cynthia Rudin. Box drawings for learning with imbalanced data. In
Proceedings of the20th ACM SIGKDD international conference on Knowledge discovery and data mining , pages 333–342.ACM, 2014.Goldberg, Noam and Jonathan Eckstein. Sparse weighted voting classifier selection and its linear program-ming relaxations.
Information Processing Letters , 112:481–486, 2012.Gottfredson, Don M and Howard N Snyder. The mathematics of risk classification: Changing data intovalid instruments for juvenile courts. ncj 209158.
Office of Juvenile Justice and Delinquency Prevention earning Optimized Risk Scores Washington, D.C. , 2005.Guan, Wei, Alex Gray, and Sven Leyffer. Mixed-integer support vector machine. In
NIPS Workshop onOptimization for Machine Learning , 2009.Gupta, Maya, Andrew Cotter, Jan Pfeifer, Konstantin Voevodski, Kevin Canini, Alexander Mangylov,Wojciech Moczydlowski, and Alexander Van Esbroeck. Monotonic calibrated interpolated look-up tables.
The Journal of Machine Learning Research , 17(1):3790–3836, 2016.Hirsch, LJ, SM LaRoche, N Gaspard, E Gerard, A Svoronos, ST Herman, R Mani, H Arif, N Jette, Y Mi-nazad, and others. American clinical neurophysiology society’s standardized critical care eeg terminology:2012 version.
Journal of Clinical Neurophysiology , 30(1):1–27, 2013.Holte, Robert C. Very simple classification rules perform well on most commonly used datasets.
Machinelearning , 11(1):63–90, 1993.Holte, Robert C. Elaboration on Two Points Raised in “Classifier Technology and the Illusion of Progress”.
Statistical Science , 21(1):24–26, February 2006.H¨ubner, Ruth and Anita Sch¨obel. When is rounding allowed in integer nonlinear optimization?
EuropeanJournal of Operational Research
Proceedings of the 12th ACM SIGKDD inter-national conference on Knowledge discovery and data mining , pages 217–226. ACM, 2006.Joachims, Thorsten, Thomas Finley, and Chun-Nam John Yu. Cutting-plane training of structural svms.
Machine Learning , 77(1):27–59, 2009.Kamishima, Toshihiro, Shotaro Akaho, and Jun Sakuma. Fairness-aware learning through regularizationapproach. In , pages 643–650.IEEE, 2011.Kelley, James E, Jr. The cutting-plane method for solving convex programs.
Journal of the Society forIndustrial and Applied Mathematics , 8(4):703–712, 1960.Kessler, Ronald C, Lenard Adler, Minnie Ames, Olga Demler, Steve Faraone, EVA Hiripi, Mary J Howes,Robert Jin, Kristina Secnik, Thomas Spencer, and others. The world health organization adult ADHDself-report scale (asrs): a short screening scale for use in the general population.
Psychological medicine ,35(02):245–256, 2005.Kohavi, Ron. Scaling up the accuracy of naive-bayes classifiers: A decision-tree hybrid. In
KDD , pages202–207, 1996.Kotlowski, Wojciech, Krzysztof J Dembczynski, and Eyke Huellermeier. Bipartite ranking through mini-mization of univariate loss. In
Proceedings of the 28th International Conference on Machine Learning(ICML-11) , pages 1113–1120, 2011.Lakkaraju, Himabindu, Stephen H Bach, and Jure Leskovec. Interpretable decision sets: A joint frameworkfor description and prediction. In
Proceedings of the 22nd ACM SIGKDD International Conference onKnowledge Discovery and Data Mining , pages 1675–1684. ACM, 2016.Latessa, Edward, Paula Smith, Richard Lemke, Matthew Makarios, and Christopher Lowenkamp. Creationand validation of the ohio risk assessment system: Final report. , 2009.Le Gall, Jean-Roger, Stanley Lemeshow, and Fabienne Saulnier. A new simplified acute physiology score(SAPS II) based on a European/North American multicenter study.
The Journal of the American MedicalAssociation , 270(24):2957–2963, 1993.Letham, Benjamin, Cynthia Rudin, Tyler H. McCormick, and David Madigan. Interpretable classifiers usingrules and bayesian analysis: Building a better stroke prediction model.
Annals of Applied Statistics , 9(3):1350–1371, 2015.Levin, A Yu. On an algorithm for the minimization of convex functions. In
Soviet Mathematics Doklady ,volume 160, pages 1244–1247, 1965. stun and Rudin Li, Oscar, Hao Liu, Chaofan Chen, and Cynthia Rudin. Deep learning for case-based reasoning throughprototypes: A neural network that explains its predictions. In
Proceedings of AAAI , 2018.Liu, Yufeng and Yichao Wu. Variable selection via a combination of the l0 and l1 penalties.
Journal ofComputational and Graphical Statistics , 16(4), 2007.Malioutov, Dmitry and Kush Varshney. Exact rule learning via boolean compressed sensing. In
Proceedingsof The 30th International Conference on Machine Learning , pages 765–773, 2013.Mangasarian, Olvi L, W Nick Street, and William H Wolberg. Breast cancer diagnosis and prognosis vialinear programming.
Operations Research , 43(4):570–577, 1995.McGinley, Ann and Rupert M Pearse. A national early warning score for acutely ill patients, 2012.Menon, Aditya Krishna, Xiaoqian J Jiang, Shankar Vembu, Charles Elkan, and Lucila Ohno-Machado.Predicting accurate probabilities with a ranking loss. In
Machine learning: proceedings of the InternationalConference. International Conference on Machine Learning , volume 2012, page 703. NIH Public Access,2012.Moreno, Rui P, Philipp GH Metnitz, Eduardo Almeida, Barbara Jordan, Peter Bauer, Ricardo AbizandaCampos, Gaetano Iapichino, David Edbrooke, Maurizia Capuzzo, and Jean-Roger Le Gall. SAPS 3 - fromevaluation of the patient to evaluation of the intensive care unit. part 2: Development of a prognosticmodel for hospital mortality at icu admission.
Intensive Care Medicine , 31(10):1345–1355, 2005.Moro, S´ergio, Paulo Cortez, and Paulo Rita. A data-driven approach to predict the success of bank tele-marketing.
Decision Support Systems , 62:22–31, 2014.Naeini, Mahdi Pakdaman, Gregory F Cooper, and Milos Hauskrecht. Binary classifier calibration: Non-parametric approach. arXiv preprint arXiv:1401.3390 , 2014.Naoum-Sawaya, Joe and Samir Elhedhli. An interior-point Benders based branch-and-cut algorithm formixed integer programs.
Annals of Operations Research , 210(1):33–55, November 2010.Nguyen, Hai Thanh and Katrin Franke. A general lp-norm support vector machine via mixed 0-1 program-ming. In
Machine Learning and Data Mining in Pattern Recognition , pages 40–49. Springer, 2012.Park, Jaehyun and Stephen Boyd. Concave quadratic cuts for mixed-integer quadratic problems. arXivpreprint arXiv:1510.06421 , 2015a.Park, Jaehyun and Stephen Boyd. A semidefinite programming method for integer convex quadratic mini-mization. arXiv preprint arXiv:1504.07672 , 2015b.Pennsylvania Bulletin, . Sentence Risk Assessment Instrument, April 2017.Pennsylvania Commission on Sentencing, . Interim Report 4: Development of Risk Assessment Scale.Technical report, June 2012.Piotroski, Joseph D. Value investing: The use of historical financial statement information to separatewinners from losers.
Journal of Accounting Research , pages 1–41, 2000.Platt, John and others. Probabilistic outputs for support vector machines and comparisons to regularizedlikelihood methods.
Advances in large margin classifiers , 10(3):61–74, 1999.Pleiss, Geoff, Manish Raghavan, Felix Wu, Jon Kleinberg, and Kilian Q Weinberger. On fairness andcalibration. In
Advances in Neural Information Processing Systems , pages 5684–5693, 2017.Reid, Mark D and Robert C Williamson. Composite binary losses.
The Journal of Machine LearningResearch , 11:2387–2422, 2010.Reilly, Brendan M and Arthur T Evans. Translating clinical research into clinical practice: Impact of usingprediction rules to make decisions.
Annals of internal medicine , 144(3):201–209, 2006.Rudin, Cynthia and Berk Ustun. Optimized scoring systems: Towards trust in machine learning for health-care and criminal justice.
Interfaces (Forthcoming) , 2018.Sato, Toshiki, Yuichi Takano, and Ryuhei Miyashiro. Piecewise-linear approximation for feature subsetselection in a sequential logit model. arXiv preprint arXiv:1510.05417 , 2015.Sato, Toshiki, Yuichi Takano, Ryuhei Miyashiro, and Akiko Yoshise. Feature subset selection for logisticregression via mixed integer optimization.
Computational Optimization and Applications , pages 1–16,2016.Schlimmer, Jeffrey Curtis. Concept acquisition through representational adjustment. 1987. earning Optimized Risk Scores Shah, Nilay D, Ewout W Steyerberg, and David M Kent. Big data and predictive analytics: Recalibratingexpectations.
JAMA , 2018.Siddiqi, Naeem.
Credit risk scorecards: developing and implementing intelligent credit scoring , volume 3.John Wiley & Sons, 2012.Six, AJ, BE Backus, and JC Kelder. Chest pain in the emergency room: value of the heart score.
NetherlandsHeart Journal , 16(6):191–196, 2008.Sokolovska, Nataliya, Yann Chevaleyre, Karine Cl´ement, and Jean-Daniel Zucker. The fused lasso penaltyfor learning interpretable medical scoring systems. In
Neural Networks (IJCNN), 2017 International JointConference on , pages 4504–4511. IEEE, 2017.Sokolovska, Nataliya, Yann Chevaleyre, and Jean-Daniel Zucker. A provable algorithm for learning in-terpretable scoring systems. In
International Conference on Artificial Intelligence and Statistics , pages566–574, 2018.Struck, Aaron F, Berk Ustun, Andres Rodriguez Ruiz, Jong Woo Lee, Suzette M LaRoche, Lawrence JHirsch, Emily J Gilmore, Jan Vlachy, Hiba Arif Haider, Cynthia Rudin, and others. Association of anelectroencephalography-based risk score with seizure probability in hospitalized patients.
JAMA Neurol-ogy , 74(12):1419–1424, 2017.Teo, Choon Hui, Alex Smola, SVN Vishwanathan, and Quoc Viet Le. A scalable modular convex solverfor regularized risk minimization. In
Proceedings of the 13th ACM SIGKDD international conference onKnowledge discovery and data mining , pages 727–736. ACM, 2007.Teo, Choon Hui, S Vishwanathan, Alex Smola, and Quoc V Le. Bundle methods for regularized riskminimization.
Journal of Machine Learning Research , 1:55, 2009.Than, Martin, Dylan Flaws, Sharon Sanders, Jenny Doust, Paul Glasziou, Jeffery Kline, Sally Aldous,Richard Troughton, Christopher Reid, William A Parsonage, and others. Development and validationof the emergency department assessment of chest pain score and 2 h accelerated diagnostic protocol.
Emergency Medicine Australasia , 26(1):34–44, 2014.US Department of Justice, . The Mathematics of Risk Classification: Changing Data into Valid Instrumentsfor Juvenile Courts, 2005.Ustun, Berk and Cynthia Rudin. Supersparse Linear Integer Models for Optimized Medical Scoring Systems.
Machine Learning , 102(3):349–391, 2016.Ustun, Berk and Cynthia Rudin. Optimized Risk Scores. In
Proceedings of the 23rd ACM SIGKDD Inter-national Conference on Knowledge Discovery and Data Mining . ACM, 2017.Ustun, Berk, Stefano Traca, and Cynthia Rudin. Supersparse linear integer models for predictive scoringsystems. In
AAAI Late-Breaking Developments , 2013.Ustun, Berk, M.B. Westover, Cynthia Rudin, and Matt T. Bianchi. Clinical prediction models for sleepapnea: The importance of medical history over symptoms.
Journal of Clinical Sleep Medicine , 12(2):161–168, 2016.Ustun, Berk, Lenard A Adler, Cynthia Rudin, Stephen V Faraone, Thomas J Spencer, Patricia Berglund,Michael J Gruber, and Ronald C Kessler. The World Health Organization Adult Attention-Deficit /Hyperactivity Disorder Self-Report Screening Scale for DSM-5.
JAMA Psychiatry , 74(5):520–526, 2017.Van Belle, Vanya, Patrick Neven, Vernon Harvey, Sabine Van Huffel, Johan AK Suykens, and Stephen Boyd.Risk group detection and survival function estimation for interval coded survival methods.
Neurocomput-ing , 112:200–210, 2013.Van Calster, Ben and Andrew J Vickers. Calibration of risk prediction models: impact on decision-analyticperformance.
Medical decision making , 35(2):162–169, 2015.Wang, Fulton and Cynthia Rudin. Falling rule lists. In
Proceedings of Artificial Intelligence and Statistics(AISTATS) , 2015.Wang, Jiaxuan, Jeeheh Oh, Haozhu Wang, and Jenna Wiens. Learning credible models. In
Proceedings ofthe 24th ACM SIGKDD International Conference on Knowledge Discovery & , KDD’18, pages 2417–2426, New York, NY, USA, 2018. ACM. ISBN 978-1-4503-5552-0.Wang, Tong, Cynthia Rudin, Finale Doshi-Velez, Yimin Liu, Erica Klampfl, and Perry MacNeille. A bayesianframework for learning rule sets for interpretable classification.
Journal of Machine Learning Research , stun and Rudin Proceedings ofthe 20th International Conference on Artificial Intelligence and Statistics , volume 54 of
Proceedings ofMachine Learning Research , pages 962–970, Fort Lauderdale, FL, USA, 20–22 Apr 2017. PMLR.Zeng, Jiaming, Berk Ustun, and Cynthia Rudin. Interpretable Classification Models for Recidivism Predic-tion.
Journal of the Royal Statistical Society: Series A , 2016. stun and Rudin Appendix A. Omitted Proofs
Proof (Remark 4) . We first explain why
LCPA attains the optimal objective value, andthen justify the upper bounds on the number of cuts and number of nodes.Observe that
LCPA finds the optimal solution to
RiskSlimMINLP through an ex-haustive search over the feasible region L . Thus, LCPA is bound to encounter the optimalsolution, since it only discards a node ( v t , R t ) if: (i) the surrogate problem is infeasibleover R t (in which case RiskSlimMINLP is also infeasible over R t ); or (ii) the surro-gate problem has an objective value that exceeds than V max (in which case, any integerfeasible solution R t is also suboptimal).The bound on the number of cuts follows from the fact that Algorithm 2 only adds cutsat integer feasible solutions, of which there are at most |L| . The bound on the numberof processed nodes represents a worst-case limit produced by bounding the depth of thebranch-and-bound tree. To do this, we exploit the fact that the splitting rule SplitRegion splits a partition into two mutually exclusive subsets by adding integer-valued bounds such λ j ≥ (cid:100) λ j (cid:101) and λ j ≤ (cid:98) λ j (cid:99) − on a single coefficient to the feasible solution. Considerapplying SplitRegion a total of Λ max j − Λ min j + 1 times in succession on a fixed dimension j . This results in a total of Λ max j − Λ min j + 1 nodes where each node fixes the coefficientin dimension j to an integer value λ j ∈ { Λ min j , . . . , Λ max j } . Pick any node and repeat thisprocess for coefficients in the remaining dimensions. The resulting B&B tree will haveat most |L| − leaf nodes where λ is restricted to integer feasible solutions. earning Optimized Risk Scores Proof (Proposition 5) . Since the coefficient set L is bounded, the data are ( x i , y i ) ni =1 discrete, and the normalized logistic loss function l ( λ ) is continuous, it follows that thevalue of l ( λ ) is bounded: l ( λ ) ∈ [min λ ∈L l ( λ ) , max λ ∈L , l ( λ )] for all λ ∈ L . Thus we need only to show that L min ≤ min λ ∈L l ( λ ) , and L max ≥ max λ ∈L l ( λ ) . For thelower bound, we observe that: min λ ∈L l ( λ ) = min λ ∈L n n (cid:88) i =1 log(1 + exp( −(cid:104) λ , y i x i (cid:105) ))= min λ ∈L n (cid:88) i : y i =+1 log(1 + exp( −(cid:104) λ , x i (cid:105) )) + 1 n (cid:88) i : y i = − log(1 + exp( (cid:104) λ , x i (cid:105) )) ≥ n (cid:88) i : y i =+1 min λ ∈L log(1 + exp( −(cid:104) λ , x i (cid:105) )) + 1 n (cid:88) i : y i = − min λ ∈L log(1 + exp( (cid:104) λ , x i (cid:105) ))= 1 n (cid:88) i : y i =+1 log(1 + exp( − max λ ∈L (cid:104) λ , x i (cid:105) )) + 1 n (cid:88) i : y i = − log(1 + exp(min λ ∈L (cid:104) λ , x i (cid:105) ))= 1 n (cid:88) i : y i =+1 log(1 + exp( − s max i ) + 1 n (cid:88) i : y i = − log(1 + exp( s min i ))= L min . The upper bound can be derived in a similar manner. stun and Rudin
Proof (Proposition 6) . We are given that V max ≥ V ( λ ∗ ) where V ( λ ∗ ) := l ( λ ∗ ) + C (cid:107) λ ∗ (cid:107) by definition. Thus, we can recover the upper bound from Proposition 6 asfollows: l ( λ ∗ ) + C (cid:107) λ ∗ (cid:107) ≤ V max , (cid:107) λ ∗ (cid:107) ≤ V max − l ( λ ∗ ) C , (cid:107) λ ∗ (cid:107) ≤ V max − L min C , (10) (cid:107) λ ∗ (cid:107) ≤ (cid:22) V max − L min C (cid:23) . (11) Here, (10) follows from the fact that L min ≤ l ( λ ∗ ) by definition, and (11) follows fromthe fact that the number of non-zero coefficients is a natural number. Proof (Proposition 7) . We are given that V max ≥ V ( λ ∗ ) where V ( λ ∗ ) := l ( λ ∗ ) + C (cid:107) λ ∗ (cid:107) by definition. Thus, we can recover the upper bound from Proposition 7 asfollows: l ( λ ∗ ) + C (cid:107) λ ∗ (cid:107) ≤ V max ,l ( λ ∗ ) ≤ V max − C (cid:107) λ ∗ (cid:107) ,l ( λ ∗ ) ≤ V max − C R min . Here, the last line follows from the fact that R min ≤ (cid:107) λ ∗ (cid:107) by definition. Proof (Proposition 8) . We are given that V min ≤ V ( λ ∗ ) where V ( λ ∗ ) := l ( λ ∗ ) + C (cid:107) λ ∗ (cid:107) by definition. Thus, we can recover the lower bound from Proposition 8 asfollows: l ( λ ∗ ) + C (cid:107) λ ∗ (cid:107) ≥ V min ,l ( λ ∗ ) ≥ V min − C (cid:107) λ ∗ (cid:107) ,l ( λ ∗ ) ≥ V min − C R max . Here, the last line follows from the fact that R max ≥ (cid:107) λ ∗ (cid:107) by definition. earning Optimized Risk Scores Proof (Theorem 9) . For a fixed set coefficient vector λ ∈ L , consider a sample of n points composed of the values for the loss function l i ( λ ) for each example in the fulltraining dataset D n = ( x i , y i ) ni =1 . Let l n ( λ ) = n (cid:80) ni =1 l i ( λ ) and l m ( λ ) = m (cid:80) mi =1 l i ( λ ) .Then, the Hoeffding-Serfling inequality (see e.g., Theorem 2.4 in Bardenet and Maillard,2015) guarantees the following for all ε > : Pr ( l n ( λ ) − l m ( λ ) ≥ ε ) ≤ exp (cid:32) − ε ( m )(1 − mn )(1 + mn )∆( λ , D n ) (cid:33) , where ∆( λ , D n ) = max i =1 ,...,n l i ( λ ) − min i =1 ,...,n l i ( λ ) . We recover the desired inequality by generalizing this bound to hold for all λ ∈ L asfollows. Pr (cid:18) max λ ∈L (cid:16) l n ( λ ) − l m ( λ ) (cid:17) ≥ ε (cid:19) = Pr (cid:32) (cid:91) λ ∈L ( l n ( λ ) − l m ( λ ) ≥ ε ) (cid:33) , ≤ (cid:88) λ ∈L Pr ( l n ( λ ) − l m ( λ ) ≥ ε ) , (12) ≤ (cid:88) λ ∈L exp (cid:32) − ε ( m )(1 − mn )(1 + mn )∆( λ , D n ) (cid:33) , (13) ≤ |L| exp (cid:32) − ε ( m )(1 − mn )(1 + mn )∆ max ( L , D n ) (cid:33) . (14) Here, (12) follows from the union bound, (13) follows from the Hoeffding Serling inequal-ity, (14) follows from the fact that ∆( λ , D n ) ≤ ∆ max ( L , D n ) given that λ ∈ L . stun and Rudin Proof (Corollary 10) . We will first show that for any tolerance that we pick δ > , theprescribed choice of ε δ will ensure that V n ( λ ) − V m ( λ ) ≤ ε δ w.p. at least − δ . Restatingthe result of Theorem 9, we have that for any ε > : Pr (cid:18) max λ ∈L (cid:16) l n ( λ ) − l m ( λ ) (cid:17) ≥ ε (cid:19) ≤ |L| exp (cid:32) − ε ( m )(1 − mn )(1 + mn )∆ max ( L , D n ) (cid:33) . (15) Note that l n ( λ ) − l m ( λ ) = V n ( λ ) − V m ( λ ) for any fixed λ . In addition, note that the setof rounded coefficients L ( ρ ) contains at most |L ( ρ ) | ≤ d coefficient vectors. Therefore,in this setting, (15) implies that for any ε > , Pr ( V n ( λ ) − V m ( λ ) ≥ ε ) ≤ d exp (cid:32) − ε ( m )(1 − mn )(1 + mn )∆( L ( ρ ) , D n ) (cid:33) . (16) By setting ε = ε δ and simplifying the terms on the right hand side in (16) , we can seethat Pr ( V n ( λ ) − V m ( λ ) ≥ ε δ ) ≤ δ. Thus, the prescribed value of ε δ ensures that V n ( λ ) − V m ( λ ) ≤ ε δ w.p. at least − δ .Since we have set ε δ so that V n ( λ ) − V m ( λ ) ≤ ε δ w.p. at least − δ , we now needonly to show that any λ satisfying V m ( λ ) < V max − ε δ will also satisfy V n ( λ ) ≤ V max tocomplete the proof. To see this, observe that: V n ( λ ) − V m ( λ ) ≤ ε δ ,V n ( λ ) ≤ V m ( λ ) + ε δ ,V n ( λ ) < V max . (17) Here, (17) follows from the fact that V m ( λ ) < V max − ε δ = ⇒ V m ( λ ) + ε δ < V max . earning Optimized Risk Scores Appendix B. Small Regularization Parameters do not Influence Accuracy
In Section 2, we state that if the trade-off parameter C in the objective of the risk scoreproblem is sufficiently small, then its optimal solution will attain the best possible trade-offbetween logistic loss and sparsity. In what follows, we formalize this statement. In whatfollows, we will omit the intercept term for clarity and explicitly show the regularizationparameter C in the RiskSLIM objective so that V ( λ ; C ) := l ( λ ) + C (cid:107) λ (cid:107) . We use somenew notation shown in Table 7. Notation Description M = argmin λ ∈L l ( λ ) minimizers of the logistic loss L ( k ) = { λ ∈ L | (cid:107) λ (cid:107) ≤ k } feasible coefficients of models with model size ≤ k M ( k ) = argmin λ ∈L ( k ) l ( λ ) minimizers of the logistic loss among models of size ≤ kL ( k ) = min λ ∈L ( k ) l ( λ ) logistic loss of minimizers with model size ≤ k λ opt ∈ argmin λ ∈M (cid:107) λ (cid:107) sparsest minimizers among all minimizers of the logistic loss k opt = (cid:107) λ opt (cid:107) model size of sparsest minimizer Table 7:
Notation used in Remarks 11 and 12
Remark 11 (Minimizers of the Risk Score Problem)
Any optimal solution to the risk score problem will achieve a logistic loss of L ( k ) forsome k ≥ : min λ ∈L V ( λ ; C ) = min k ∈{ , ,...,k opt } L ( k ) + C k. Remark 12 (Small Trade-Off Parameters Do Not Influence Accuracy)
There exists an integer z ≥ such that if C < z (cid:2) L ( k opt − z ) − L ( k opt ) (cid:3) , then λ opt ∈ argmin λ ∈L V ( λ ; C ) . Remark 12 states that as long as C is sufficiently small, the regularized objective is mini-mized by any model λ opt that has the smallest size among the most accurate feasible models.In other words, there exists a small enough value of C to guarantee that we will obtain thebest possible solution. Since we do not know z in advance and since it is just as difficult tocompute L as it is to solve the risk score problem, we will not know in advance how small C must be to avoid sacrificing sparsity. However, it does not matter which value of C wechoose as long as it is sufficiently small. In practice, we set C = 10 − , which is the smallestvalue that we can use without running into numerical issues with the MIP solver. stun and Rudin Proof (Remark 11) . Since L ( k ) is the minimal value of the logistic loss for all modelswith at most k non-zero coefficients, we have: L ( k ) + C k ≤ l ( λ ) + C k for any λ ∈ L ( k ) (18) Denote a feasible minimizer of V ( λ ; C ) as λ (cid:48) ∈ argmin λ ∈L V ( λ ; C ) , and let k (cid:48) = (cid:107) λ (cid:48) (cid:107) .Since λ (cid:48) ∈ L ( k (cid:48) ) , we have that: L ( k (cid:48) ) + C k (cid:48) ≤ V ( λ (cid:48) ; C ) . (19) Taking the minimum of the left hand side of (19) : min k ∈{ , , ,...,k opt } L ( k ) + C k ≤ L ( k (cid:48) ) + C k (cid:48) . (20) Combining (19) and (20) , we get: min k ∈{ , , ,...,k opt } L ( k ) + C k ≤ L ( k (cid:48) ) + C k (cid:48) ≤ min λ ∈L V ( λ ; C ) . If these inequalities are not all equalities, we have a contradiction with the definition of λ (cid:48) and k (cid:48) as the minimizer of V ( λ ; C ) and its size. So all must be equality. This provesthe statement. earning Optimized Risk Scores Proof (Remark 12) . Note that if C = 0 , then any minimizer of V ( λ ; C ) has k opt non-zero coefficients. Consider increasing the value of C starting from zero until a thresholdvalue C min0 , defined as the smallest value such that the minimizer can sacrifice someloss to remove at least one non-zero coefficient. Let z ≥ be the number of non-zerocoefficients removed. For the threshold value C min0 at which we choose the smaller modelrather than the one with size k opt , we have L ( k opt ) + C min0 k opt ≥ L ( k opt − z ) + C min0 ( k opt − z ) . Simplifying, we obtain: z (cid:2) L ( k opt − z ) − L ( k opt ) (cid:3) ≤ C min0 . Thus,
RiskSLIM does not sacrifice sparsity for logistic loss when: z (cid:2) L ( k opt − z ) − L ( k opt ) (cid:3) > C . Here, we know that the value on the left hand side is greater than 0 because L ( · ) isdecreasing in its argument, and if there was no strict decrease, there would be a contra-diction with the definition of k opt as the smallest number of terms of an optimal model.In particular, L ( k opt − z ) = L ( k opt ) implies that: min λ V ( λ ; 0) = L ( k opt ) + C min0 k opt = L ( k opt − z ) + C min0 ( k opt ) ≥ L ( k opt − z ) + C min0 ( k opt − z ) ≥ min λ V ( λ ; 0) . Thus all inequalities are equalities, which implies that z = 0 and contradicts z ≥ . stun and Rudin Appendix C. Background on Risk Scores
In Table 8, we present quotes from authors in medicine, criminal justice, and finance tosupport the claim that risk scores are used because they are easy to use, understand, andvalidate.
Domain Reference
Quote
Medicine Than et al. (2014) “Ease of use might be facilitated by presenting a rule devel-oped from logistic regression as a score, where the originalpredictor weights have been converted to integers that areeasy to add together... Though less precise than the origi-nal regression formula, such presentations are less complex,easier to apply by memory and usable without electronic as-sistance.”
Criminal Justice Duwe and Kim (2016) “It is commonplace... for fine-tuned regression coefficients tobe replaced with a simple-point system... to promote the easyimplementation, transparency, and interpretability of risk-assessment instruments.”
Finance Finlay (2012) “presenting a linear model in the form of a scorecard is at-tractive because it’s so easy to explain and use. In particular,the score can be calculated using just addition to add up therelevant points that someone receives”
Table 8:
Quotes on why risk scores with small integer coefficients are used in different domains.
Importance of Operational Constraints
The approaches used to create risk scores vary significantly for each problem. There isno standard approach within a given domain, (see e.g., the different techniques proposedin criminal justice by Gottfredson and Snyder, 2005; Bobko et al., 2007; Duwe and Kim,2016), or a given application (see e.g., the different approaches used to create risk scoresfor cardiac illness, Six et al., 2008; Antman et al., 2000; Than et al., 2014).A key reason for the lack of a standardized approach is because risk scores in domainssuch as medicine and criminal justice need to obey additional operational constraints to beused and accepted. In some cases, these constraints can be explicitly stated. Reilly andEvans (2006), for example, describe the requirements put forth by physicians when buildinga model to detect major cardiac complications for patients with chest pain:“
Our physicians... insisted that a new left bundle-branch block be consideredas an electrocardiographic predictor of acute ischemia. In addition, they arguedthat patients who are stratified as low risk by the prediction rule could inap-propriately include patients presenting with acute pulmonary edema, ongoingischemic pain despite maximal medical therapy, or unstable angina after recentcoronary revascularization (52). They insisted that such emergent clinical pre-sentations be recommended for coronary care unit admission, not telemetry unitadmission .” earning Optimized Risk Scores In other cases, however, operational constraints may depend on qualities that are difficultto define a priori. Consider for example, the following statement in Than et al. (2014), thatdescribes the importance of sensibility for deployment:“
An important consideration during development is the clinical sensibility of theresulting prediction rule [...] Evaluation of sensibility requires judgment ratherthan statistical methods. A sensible rule is easy to use, and has content and facevalidities. Prediction rules are unlikely to be applied in practice if they are notconsidered sensible by the end-user, even if they are accurate. ” Approaches to Model Development
Common heuristics used in model development include: • Heuristic Feature Selection : Many approaches use heuristic feature selection to reduce thenumber of variables in the model. Model development pipelines can often involve multiplerounds of feature selection, and may use different heuristics at each stage (e.g., Antmanet al. 2000 uses a significance test to remove weak predictors, then uses approximatefeature selection via forward stepwise regression). • Heuristic Rounding : Many approaches use rounding heuristics to produce models withinteger coefficients. In the simplest case, this involves scaling and rounding the coefficientsfrom a logistic regression model (Goel et al., 2016) or a linear probability model (USDepartment of Justice, 2005). The SAPS II score (Le Gall et al., 1993), for example, wasbuilt in this way (“ the general rule was to multiply the β for each range by 10 and roundoff to the nearest integer. ”) • Expert Judgement : A common approach to model development involves having a panelexperts build a model by hand, and using data to validate the model after it is built (e.g.,for the CHADS score for stroke prediction of Gage et al. 2001, and the National EarlyWarning Score to assess acute illness in the ICU of McGinley and Pearse 2012). Expertjudgement can also be used in data-driven approaches. In developing the EDACS score(Than et al., 2014), for example, expert judgement was used to: (i) determine a scalingfactor for model coefficients (“ The beta coefficients were multiplied by eight, which wasthe smallest common multiplication factor possible to obtain a sensible score that usedwhole numbers and facilitated clinical ease of use. ”) (ii) convert a continuous variableinto a binary variables (“
Age was the only continuous variable to be included in the finalscore. It was converted to a categorical variable, using 5-year age bands with increasingincrements of +2 points. ”) • Unit Weighting : This technique aims to produce a score by adding all variables that aresignificantly correlated with the outcome of interest. Unit weighting is prevalent in thecriminal justice (see e.g. Bobko et al., 2007; Duwe and Kim, 2016), where it is referredto as the
Burgess method (as it was first proposed in Burgess, 1928). The use of thistechnique is frequently motivated by empirical work showing that linear models with unitweights may perform surprisingly well (see e.g. Einhorn and Hogarth, 1975; Dawes, 1979;Holte, 1993, 2006; Bobko et al., 2007). stun and Rudin
Critical Analysis of a Real-World Training Pipeline
Many risk scores are built using sequential training pipelines that combine traditional sta-tistical techniques, heuristics, and expert judgement. The TIMI Risk Score of Antman et al.(2000), for example, was built as follows:
1. “A total of 12 baseline characteristics arranged in a dichotomous fashion were screenedas candidate predictor variables of risk of developing an end-point event”2. “After each factor was tested independently in a univariate logistic regression model,those that achieved a significance level of p < . were [retained].”3. “[The remaining factors]... selected for testing in a multivariate step-wise (backwardelimination) logistic regression model. Variables associated with p < . were retainedin the final model.”4. “After development of the multivariate model, the [risk predictions were determined]...for the test cohort using those variables that had been found to be statistically signifi-cant predictors of events in the multivariate analysis.”5. “The score was then constructed by a simple arithmetic sum of the number of variablespresent.” Although this pipeline uses several established statistical techniques (e.g. stepwise regres-sion, significance testing), it is unlikely to output a risk score that attains the best possibleperformance because: • Decisions involving feature selection and rounding are made sequentially (e.g., Steps 1-3involve feature selection, Step 5 involves rounding). • The objective function that is optimized at each step differs from the performance metricof interest (i.e., the calibration error, which measures the reliability of risk estimates). • Some steps optimize conflicting objective functions (e.g., backward elimination typicallyoptimizes the AIC or BIC, while the final model is fit to optimize the logistic loss). • Some steps do not fully optimize their own objective function (i.e., backward eliminationdoes not return a globally optimal feature set). • Some steps depend free parameters that are set without validation (e.g., the thresholdsignificance level of p < .
20 used in Step 2) • The final model is not trained using all the available training data. Here, Steps 4 and 5use data from a “test cohort” that could have been used to improve the fit of the finalmodel. • The final model uses an empirical risk estimate for each score (i.e., the predicted risk foreach score simply represents the fraction of patients in the test cohort with y i = +1). • The coefficients of the final model are set to +1 and not optimized (that is, the modeluses unit weights). earning Optimized Risk Scores
Appendix D. Details on Computational Experiments
In this appendix, we provide additional details on the computational experiments in Sections3.3 and 4.
D.1 Simulation Procedure for Synthetic Datasets
We ran the computational experiments in Sections 3.3 and 4 using a collection of syntheticdatasets that we generated from the breastcancer dataset of Mangasarian et al. (1995),which contains n = 683 samples and d = 9 features x ij ∈ { , . . . , } . This choice was basedoff the fact that the breastcancer dataset produces a RiskSlimMINLP instance that canbe solved using many algorithms, the data have been extensively studied in the literature,and can be obtained from the UCI ML repository (Bache and Lichman, 2013).We show the simulation procedure in Algorithm 7. Given the original dataset, thisprocedure generates a collection of nested synthetic datasets in two steps. First, it generatesthe largest dataset (with n max = 5 × and d max = 30) by replicating features and samplesfrom the original dataset and adding normally distributed noise. Second, it produces smallerdatasets by taking nested subsets of the samples and features. This ensures that a syntheticdataset with d features and n samples contains the same features and examples as a smallersynthetic dataset (i.e., with d (cid:48) < d features and n (cid:48) < n samples).The resulting collection of synthetic datasets has two properties that are useful forbenchmarking the performance of algorithms for RiskSlimMINLP :1. They produce difficult instances of the risk score problem. Here, the
RiskSlimMINLP instances for synthetic datasets with d >
RiskSlim-MINLP instances we may not be able to solve. Say, for example, that we could not solvean instance of the risk score problem for a synthetic dataset with ( d, n ) = (20 , ), butcould solve an instance for a smaller synthetic dataset with ( d, n ) = (10 , ). In this case,we know that the optimal value of an RiskSlimMINLP instance where ( d, n ) = (20 , )must be less than or equal to the optimal value of an RiskSlimMINLP instance where( d, n ) = (10 , ). This is because the ( d, n ) = (20 , ) dataset contains all of thefeatures as the ( d, n ) = (10 , ) dataset. stun and Rudin Algorithm 7
Simulation Procedure to Generate Nested Synthetic Datasets
Input X original ← [ x ij ] i =1 ...n original ,j =1 ...d original feature matrix of original dataset Y original ← [ y i ] i =1 ...n original label vector of original dataset d . . . d max dimensions for synthetic datasets (increasing order) n . . . n max sample sizes for synthetic datasets (increasing order) Initialize J original ← [1 , . . . , d original ] index array for original features J max ← [] index array of features for largest synthetic dataset m full ← (cid:98) d max /d original (cid:99) m remainder ← d max − d original Step I : Generate Largest Dataset for m = 1 , . . . , m full do J max ← [ J max , RandomPermute ( J original )] end for J max ← [ J max , RandomSampleWithoutReplacement ( J original , m remainder )] for i = 1 , . . . , n max do sample l with replacement from 1 , . . . , n original y max i ← y i for j = 1 , . . . , d max do k ← J max [ j ] sample ε from Normal(0 , . x max ij ← (cid:100) x l,k + ε (cid:99) new features are noisy versions of original features x max ij ← min(10 , max(0 , x max ij )) new features have same bounds as old features end for end for X max ← [ x ij ] i =1 ...n max ,j =1 ...d max Y max ← [ y maxi ] i =1 ...n max Step II : Generate Smaller Datasets for d = [ d , . . . , d max ] do for n = [ n , . . . , n max ] do X ( n,d ) = X max [1 : n, d ] Y n = Y max [1 : n ] end for end forOutput: synthetic datasets ( X ( n,d ) , Y n ) for all n . . . n max and d . . . d max . earning Optimized Risk Scores D.2 Setup on the Performance Comparison in Section 3.3
We consider an instance of
RiskSlimMINLP where C = 10 − , λ ∈ {− , } , and λ j = {− , . . . , } for j = 1 , . . . , d . We solved this instance on a 3.33 GHz Intel Xeon CPUwith 16GB of RAM for up to 6 hours using the following algorithms:(i) CPA , as described in Algorithm 1);(ii)
LCPA , as described in Algorithm 2;(iii)
ActiveSetMINLP , an active set MINLP algorithm;(iv)
InteriorMINLP , an interior point MINLP algorithm;(v)
InteriorCGMINLP , an interior point MINLP algorithm where the primal-dual KKT sys-tem is solved with a conjugate gradient method;All MINLP algorithms were implemented in a state-of-the-art commercial solver (i.e., Ar-telsys Knitro 9.0, which is an updated version of the solver described in Byrd et al. 2006).If an algorithm did not return a certifiably optimal solution for a particular instance withinthe 6 hour time limit, we reported results for the best feasible solution. Both
CPA and
LCPA were implemented using CPLEX 12.6.3.
D.3 Results for MINLP Algorithms
In Figure 31, we plot results for all MINLP algorithms that we benchmarked against
CPA and
LCPA : ActiveSetMINLP , InteriorMINLP , and
InteriorCGMINLP . We reported results for
ActiveSetMINLP in Section 3.3 because it solved the largest number of instances to optimal-ity. stun and Rudin
ActiveSetMINLP InteriorMINLP InteriorCGMINLP
Time to Train a GoodRisk Score i.e., the time for an al-gorithm to find a solutionwhose loss ≤ of the op-timal loss. It reflects thetime to obtain a risk scorewith good calibration withouta proof of optimality. <1 min <10 min <1 hour <6 hours 6+ hours · · · · · · · · · · · · · · · · · · Optimality Gap of BestSolution at Termination i.e., ( V max − V min ) /V max ,where V max is the objec-tive value of the best solutionfound at termination. A gapof 0.0% means an algorithmhas found the optimal solu-tion and provided a proof ofoptimality within 6 hours.
0% 0−20% 20−50% 50−90% 90−100% · · · · · · · · · · · · · · · · · · % Time Spent on Data-Related Computation i.e., the proportion of to-tal runtime that an algorithmspends computing the value,gradient, or Hessian of theloss function. · · · · · · · · · · · · · · · · · · Figure 31:
Performance of MINLP algorithms on difficult instances of
RiskSlimMINLP for syntheticdatasets with varying dimensions d and sample sizes n . All algorithms perform similarly. We report resultsfor ActiveSetMINLP in Section 3.3 because it solves the most instances to optimality. earning Optimized Risk Scores
Appendix E. Additional Experimental Results
Traditional Approaches Pooled Approaches
Dataset Metric
PLR (cid:29)
Rd PLR (cid:29)
RsRd PLR (cid:29)
Unit PooledRd PooledRd* PooledSeqRd* RiskSLIM bank n = 41188 d = 57 test caltrain caltest auctrain auc census n = 32561 d = 36 test caltrain caltest auctrain auc mammo n = 961 d = 14 test caltrain caltest auctrain auc mushroom n = 8124 d = 113 test caltrain caltest auctrain auc rearrest n = 22530 d = 48 test caltrain caltest auctrain auc spambase n = 4601 d = 57 test caltrain caltest auctrain auc Table 9:
Summary statistics of risk scores with integer coefficients λ j ∈ {− , . . . , } with model size of (cid:107) λ (cid:107) ≤
5. Here: test cal and test auc , which are the 5-CV mean test CAL / AUC; train cal and train auc which are the CAL and AUC of the final model fit using the entire dataset. stun and Rudin
Appendix F. Details and Results for Seizure Prediction
In this appendix, we provide supporting material for the seizure prediction application inSection 6. In Table 10, we list all input variables in the training dataset.
Input Variable Values Sign
Male { , } Female { , } PriorSeizure { , } + PosteriorDominantRhythmPresent { , } − BriefRhythmicDischarge { , } + NoReactivityToStimulation { , } EpileptiformDischarges { , } + SecondaryDXIncludesMentalStatusFirst { , } SecondaryDXIncludesCNSInfection { , } SecondaryDXIncludesCNSInflammatoryDisease { , } SecondaryDXIncludesCNSNeoplasm { , } SecondaryDXIncludesHypoxisIschemicEncephalopathy { , } SecondaryDXIncludesIntracerebralHemorrhage { , } SecondaryDXIncludesIntraventricularHemorrhage { , } SecondaryDXIncludesMetabolicEncephalopathy { , } SecondaryDXIncludesIschemicStroke { , } SecondaryDXIncludesSubarachnoidHemmorage { , } SecondaryDXIncludesSubduralHematoma { , } SecondaryDXIncludesTraumaticBrainInjury { , } SecondaryDXIncludesHydrocephalus { , } PatternIsStimulusInducedAny { , } PatternIsStimulusInducedBiPD { , } PatternIsStimulusInducedGPD { , } PatternIsStimulusInducedGRDA { , } PatternIsStimulusInducedLPD { , } PatternIsStimulusInducedLRDA { , } PatternIsSuperImposedAny { , } + PatternIsSuperImposedBiPD { , } + PatternIsSuperImposedGPD { , } + PatternIsSuperImposedGRDA { , } + PatternIsSuperImposedLPD { , } + PatternIsSuperImposedLRDA { , } + PatternsInclude BiPD { , } + PatternsInclude GPD { , } + PatternsInclude GRDA { , } + PatternsInclude LPD { , } + PatternsInclude LRDA { , } + PatternsInclude GRDA or GPD { , } + PatternsInclude BiPD or LRDA or LPD { , } + MaxFrequencyAnyPattern { . , . ,..., . } + MaxFrequencyBiPD { . , . ,..., . } + MaxFrequencyGPD { . , . ,..., . } + MaxFrequencyLPD { . , . ,..., . } + MaxFrequencyLRDA { . , . ,..., . } + Input Variable Values Sign
MaxFrequencyAnyPattern = 0.0Hz { , } MaxFrequencyAnyPattern ≥ { , } + MaxFrequencyAnyPattern ≥ { , } + MaxFrequencyAnyPattern ≥ { , } + MaxFrequencyAnyPattern ≥ { , } + MaxFrequencyAnyPattern ≥ { , } + MaxFrequencyAnyPattern ≥ { , } + MaxFrequencyBiPD=0.0 { , } MaxFrequencyBiPD ≥ { , } + MaxFrequencyBiPD ≥ { , } + MaxFrequencyBiPD ≥ { , } + MaxFrequencyBiPD ≥ { , } + MaxFrequencyBiPD ≥ { , } + MaxFrequencyBiPD ≥ { , } + MaxFrequencyGPD=0.0 { , } MaxFrequencyGPD ≥ { , } + MaxFrequencyGPD ≥ { , } + MaxFrequencyGPD ≥ { , } + MaxFrequencyGPD ≥ { , } + MaxFrequencyGPD ≥ { , } + MaxFrequencyGPD ≥ { , } + MaxFrequencyGRDA=0.0 { , } MaxFrequencyGRDA ≥ { , } + MaxFrequencyGRDA ≥ { , } + MaxFrequencyGRDA ≥ { , } + MaxFrequencyGRDA ≥ { , } + MaxFrequencyGRDA ≥ { , } + MaxFrequencyGRDA ≥ { , } + MaxFrequencyLPD=0.0 { , } MaxFrequencyLPD ≥ { , } + MaxFrequencyLPD ≥ { , } + MaxFrequencyLPD ≥ { , } + MaxFrequencyLPD ≥ { , } + MaxFrequencyLPD ≥ { , } + MaxFrequencyLPD ≥ { , } + MaxFrequencyLRDA=0.0 { , } MaxFrequencyLRDA ≥ { , } + MaxFrequencyLRDA ≥ { , } + MaxFrequencyLRDA ≥ { , } + MaxFrequencyLRDA ≥ { , } + MaxFrequencyLRDA ≥ { , } + MaxFrequencyLRDA ≥ { , } + Table 10:
Names, values, and sign constraints for input variables in the seizure dataset. earning Optimized Risk Scores
F.1 Operational ConstraintsNo Redundant Categorical Variables
1. Use either
Male or Female .2. Use either
PatternsInclude GRDA or GPD or any one of(
PatternsInclude GRDA , PatternsInclude GPD ).3. Use either
PatternsInclude BiPD or LRDA or LPD or any one of(
PatternsInclude BiPD , PatternsInclude LRDA , PatternsInclude LPD ).4. Use either
MaxFrequencyAnyPattern = 0 . MaxFrequencyAnyPattern ≥ . MaxFrequencyLPD = 0 . MaxFrequencyLPD ≥ . MaxFrequencyGPD = 0 . MaxFrequencyGPD ≥ . MaxFrequencyGRDA = 0 . MaxFrequencyGRDA ≥ .
58. Use either
MaxFrequencyBiPD = 0 . MaxFrequencyBiPD ≥ . MaxFrequencyLRDA = 0 . MaxFrequencyLRDA ≥ . Frequency in Continuous Encoding or Thresholded Encoding
10. Choose between
MaxFrequencyAnyPattern or(
MaxFrequencyAnyPattern = 0 . . . . MaxFrequencyAnyPattern ≥ . MaxFrequencyGPD or(
MaxFrequencyGPD = 0 . . . . MaxFrequencyGPD ≥ . MaxFrequencyLPD or(
MaxFrequencyLPD = 0 . . . . MaxFrequencyLPD ≥ . MaxFrequencyGRDA or(
MaxFrequencyGRDA = 0 . . . . MaxFrequencyGRDA ≥ . MaxFrequencyBiPD or(
MaxFrequencyBiPD = 0 . . . . MaxFrequencyBiPD ≥ . MaxFrequencyLRDA or(
MaxFrequencyLRDA = 0 . . . . MaxFrequencyLRDA ≥ . Limited
16. Use at most 2 of:
MaxFrequencyAnyPattern = 0 . MaxFrequencyAnyPattern ≥ . . . . MaxFrequencyAnyPattern ≥ . MaxFrequencyLPD = 0 . MaxFrequencyLPD ≥ . . . . MaxFrequencyLPD ≥ . MaxFrequencyGPD = 0 . MaxFrequencyGPD ≥ . . . . MaxFrequencyGPD ≥ . MaxFrequencyGRDA = 0 . MaxFrequencyGRDA ≥ . . . . MaxFrequencyGRDA ≥ . stun and Rudin
20. Use at most 2 of:
MaxFrequencyBiPD = 0 . MaxFrequencyBiPD ≥ . . . . MaxFrequencyBiPD ≥ . MaxFrequencyLRDA = 0 . MaxFrequencyLRDA ≥ . . . . MaxFrequencyLRDA ≥ . Specific cEEG Patterns or Any cEEG Pattern
22. Use either
PatternIsStimulusInducedAny or any of (
PatternIsStimulusInducedBiPD , PatternIsStimulusInducedGRDA , PatternIsStimulusInducedGPD , PatternIsStimulusIn-ducedLPD , PatternIsStimulusInducedLRDA ).23. Use either
PatternIsSuperImposed or any of (
PatternIsSuperImposedBiPD , PatternIsSu-perImposedGPD , PatternIsSuperImposedGRDA , PatternIsSuperImposedLPD , PatternIsSu-perImposedLRDA ).24. Use either
MaxFrequencyAnyPattern (or its thresholded versions) or any of