[PDF] Learning Optimized Risk Scores

Abstract

Risk scores are simple classification models that let users make quick risk predictions by adding and subtracting a few small numbers. These models are widely used in medicine and criminal justice, but are difficult to learn from data because they need to be calibrated, sparse, use small integer coefficients, and obey application-specific operational constraints. In this paper, we present a new machine learning approach to learn risk scores. We formulate the risk score problem as a mixed integer nonlinear program, and present a cutting plane algorithm for non-convex settings to efficiently recover its optimal solution. We improve our algorithm with specialized techniques to generate feasible solutions, narrow the optimality gap, and reduce data-related computation. Our approach can fit risk scores in a way that scales linearly in the number of samples, provides a certificate of optimality, and obeys real-world constraints without parameter tuning or post-processing. We benchmark the performance benefits of this approach through an extensive set of numerical experiments, comparing to risk scores built using heuristic approaches. We also discuss its practical benefits through a real-world application where we build a customized risk score for ICU seizure prediction in collaboration with the Massachusetts General Hospital.

Full PDF

UUstun and Rudin

Learning Optimized Risk Scores

Berk Ustun [email protected]

Center for Research in Computation and SocietyHarvard University

Cynthia Rudin [email protected]

Department of Computer ScienceDuke University

Abstract

Risk scores are simple classiﬁcation models that let users make quick risk predictions byadding and subtracting a few small numbers. These models are widely used in medicineand criminal justice, but are diﬃcult to learn from data because they need to be calibrated,sparse, use small integer coeﬃcients and obey application-speciﬁc operational constraints.In this paper, we present a new machine learning approach to learn risk scores. We for-mulate the risk score problem as a mixed integer nonlinear program, and present a newcutting plane algorithm for non-convex settings to eﬃciently recover its optimal solution.We improve our algorithm with specialized techniques to generate feasible solutions, nar-row the optimality gap, and reduce data-related computation. Our approach can ﬁt riskscores in a way that scales linearly in the number of samples, provides a certiﬁcate of opti-mality, and obeys real-world constraints without parameter tuning or post-processing. Weillustrate the performance beneﬁts of this approach through an extensive set of numericalexperiments, where we compare risk scores built using our approach to those built usingheuristic approaches. We also discuss the practical beneﬁts of our approach through areal-world application where we build a customized risk score for ICU seizure prediction incollaboration with the Massachusetts General Hospital.

Keywords: scoring systems; classiﬁcation; constraints; calibration; interpretability; cut-ting plane methods; discrete optimization; mixed integer nonlinear programming.

1. Introduction

Risk scores are linear classiﬁcation models that let users assess risk by adding, subtracting,and multiplying a few small numbers (see Figure 1). These models are widely used tosupport decision-making in domains such as: • Medicine : to assess the risk of mortality in intensive care (e.g., Moreno et al., 2005),critical physical conditions (e.g., adverse cardiac events, Six et al., 2008; Than et al.,2014) and mental illnesses (e.g., adult ADHD in Kessler et al. 2005; Ustun et al. 2017). • Criminal Justice : to assess the risk of recidivism when setting bail, sentencing, and releaseon parole (see e.g., Latessa et al., 2009; Austin et al., 2010; Pennsylvania Bulletin, 2017). • Finance : to assess the risk of default on a loan (see e.g., credit scores in FICO, 2011;Siddiqi, 2012), and to inform ﬁnancial investments (Piotroski, 2000; Beneish et al., 2013).The widespread adoption of risk scores in these domains stems from the fact thatdecision-makers often ﬁnd them easy to use and understand. In comparison to other kindsof classiﬁcation models, risk scores let users make quick predictions through simple arith-metic, without a computer or calculator. Users can gauge the eﬀect of changing multiple a r X i v : . [ s t a t . M L ] S e p earning Optimized Risk Scores input variables on the predicted outcome, and override predictions in an informed mannerif needed. In comparison to scoring systems for decision-making (see e.g., the models con-sidered in Ustun and Rudin, 2016; Zeng et al., 2016; Carrizosa et al., 2016; Van Belle et al.,2013; Billiet et al., 2016, 2017; Sokolovska et al., 2017, 2018), which predict a yes-or-nooutcome at a ﬁxed operating point, risk scores output risk predictions at multiple operatingpoints. Thus, users have the ability to choose an operating point after the model has beendeployed. Further, they are given risk estimates that, when calibrated, can inform thischoice and support decision-making in other ways (see e.g., Shah et al., 2018). We providemore background on risk scores in Appendix C. C ongestive Heart Failure 1 point . . .2. H ypertension 1 point + . . .3. A ge ≥

75 1 point + . . .4. D iabetes Mellitus 1 point + . . .5. Prior S troke or Transient Ischemic Attack points + SCORE = SCORE

RISK

Figure 1:

CHADS Although risk scores have existed for nearly a century (see e.g., a parole violation modelin Burgess, 1928), many of them are built ad hoc . This is partly because risk scores areoften used in applications where models must satisfy requirements related to qualities suchas interpretability, usability, or fairness (see e.g., requirements on “face validity” and “userfriendliness” in Than et al., 2014). Building a risk score that satisﬁes these requirementsnecessitates precise control over multiple model properties, such as monotonicity (Guptaet al., 2016), group sparsity (Kessler et al., 2005), prediction (Reilly and Evans, 2006)and calibration (Pleiss et al., 2017). Since existing classiﬁcation methods do not providecontrol over such a diverse properties oﬀ-the-shelf, risk scores are built by combining themwith heuristics and expert judgment (e.g., preliminary feature selection, logistic regression,scaling, and rounding as in Antman et al., 2000). In some cases, risk scores are hand-crafted by a panel of experts (see e.g., the CHADS score in Figure 1, or the NationalEarly Warning Score of McGinley and Pearse 2012). As we will show, such approachesmay produce a model that violates requirements, or that performs poorly relative to thebest risk score that can be built using the same dataset. The lack of a formal guaranteecomplicates model development: when a risk score performs poorly, a practitioner cannottell if this is due to heuristics used in the training pipeline or overly restrictive constraints.In this paper, we present a new machine learning approach to learn risk scores from data.We consider a mixed-integer nonlinear program (MINLP), which minimizes the logistic lossfor calibration and AUC, penalizes the (cid:96) -norm for sparsity, and restricts coeﬃcients tosmall integers. We refer to this optimization problem as the risk score problem , and referto the risk score built from its solution as a Risk-calibrated Supersparse Linear IntegerModel ( RiskSLIM ). We aim to recover a certiﬁably optimal solution (i.e., global optimum stun and Rudin and a certiﬁcate of optimality). This approach requires solving a computationally diﬃcultoptimization problem, but has three major beneﬁts for our setting:(i) Performance : Since the approach can directly optimize, penalize, and constrain discretequantities, it can produce a risk score that is fully optimized for feature selectionand small integer coeﬃcients. Thus, models are guaranteed not to suﬀer in trainingperformance due to heuristic post-processing.(ii)

Direct Customization : Practitioners can address application-speciﬁc requirements byadding discrete constraints to the MINLP formulation, which can be solved with ageneric solver. In this way, they can ﬁt customized risk scores without parametertuning, post-processing, or implementing a new algorithm for each application.(iii)

Understanding the Impact of Constraints : It can pair risk scores with a certiﬁcate ofoptimality. By design, a globally optimal solution to the risk score problem attainsthe best performance among risk scores that satisfy a given set of constraints. Oncewe recover a certiﬁably optimal solution, we therefore end up with a risk score withacceptable performance, or a risk score with unacceptable performance and a certiﬁcateproving that the constraints were overly restrictive. In settings where performancegeneralizes, the certiﬁcate of optimality provides the ability to evaluate the eﬀect ofconstraints on predictive performance. By comparing certiﬁably optimal risk scores fordiﬀerent sets of constraints, practitioners can make informed choices between modelsthat satisfy diﬀerent sets of requirements.In light of these potential beneﬁts, a major goal of this work is to recover certiﬁablyoptimal solutions to the risk score problem for the largest possible datasets. As we willshow, solving the risk score problem with a MINLP solver is time-consuming even on smalldatasets, as generic MINLP algorithms are slowed down by excessive data-related compu-tation. Accordingly, we aim to solve the risk score problem with a cutting plane algorithm ,which reduces data-related computation by iteratively solving a surrogate problem with alinear approximation of the loss function that is much cheaper to evaluate. Cutting planealgorithms have an impressive track record on large-scale supervised learning problems, asthey scale linearly with the number of samples and provide precise control over data-relatedcomputation (see e.g., Teo et al., 2009; Franc and Sonnenburg, 2009; Joachims et al., 2009).Prior cutting plane algorithms were designed under the assumption that the surrogateoptimization problem can be solved to optimality at each iteration. This assumption isperfectly reasonable in a convex setting, but leads cutting plane algorithms to stall on non-convex problems such as ours, as the time to solve the surrogate problem to optimalityincreases exponentially with each iteration. To overcome this issue, we present a newcutting plane algorithm for non-convex settings. We then improve its performance throughfast techniques to generate good feasible solutions, narrow the optimality gap, and reducedata-related computation. Our approach extends the beneﬁts of cutting plane algorithmsto discrete optimization problems, allowing us to train risk scores in a way that can addressreal-world constraints, provide a certiﬁcate of optimality, and scale linearly with the numberof samples in a dataset. earning Optimized Risk Scores Contributions

The main contributions of this paper are as follows. • We present a new machine learning approach to build risk scores. Our approach can trainmodels that: (i) are fully optimized for feature selection and small integer coeﬃcients; (ii)handle application-speciﬁc constraints without parameter tuning or post-processing; (iii)provide a certiﬁcate of optimality. • We develop a new cutting plane algorithm, called the lattice cutting plane algorithm ( LCPA ). LCPA retains the beneﬁts of cutting plane algorithms for convex empirical riskminimization problems, but does not stall on problems with non-convex regularizers orconstraints. It can be easily implemented using a MIP solver (e.g., CPLEX, Gurobi, orCBC), and used to train customized risk scores (and other kinds of models) in a way thatscales linearly with the number of samples in a dataset. • We design new techniques that allow

LCPA to quickly recover a risk score with goodperformance and a small optimality gap, namely: rounding and polishing heuristics; fastbound tightening and initialization procedures; and strategies to reduce data-related com-putation. Our techniques can be adapted to improve

LCPA for other problems, and usedto improve the performance of risk scores built via heuristic post-processing. • We present an extensive set of experiments comparing methods to learn risk scores onpublicly available datasets. Our results show that our approach can consistently train riskscores with best-in-class performance in minutes. We highlight pitfalls of approaches thatare often used in practice, and present new heuristic methods that address these issues. • We present results from a collaboration with the Massachusetts General Hospital wherewe built a customized risk score for ICU seizure prediction. Our results highlight thepractical beneﬁts of our approach when training models that obey real-world constraints,and illustrate the performance gains of certiﬁably optimal risk scores in such settings. • We provide a software package to build optimized risk scores in Python, available onlineat http://github.com/ustunb/risk-slim.

Organization

In the remainder of Section 1, we discuss related work. In Section 2, we formally deﬁne therisk score problem. In Section 3, we present our cutting plane algorithm,

LCPA . In Section4, we present techniques to improve

LCPA . In Section 5, we benchmark methods to buildrisk scores. In Section 6, we discuss an application to ICU seizure prediction.The supplement to our paper includes: proofs of all theorems (Appendix A); a primeron how risk scores are used and developed in practice (Appendix C); supporting material forcomputational experiments in Sections 3 and 4 (Appendix D), the performance benchmarkin Section 5 (Appendix E), and the seizure prediction problem in Section 6 (Appendix F).

Prior Work

Our paper extends work that was ﬁrst published in KDD (Ustun and Rudin, 2017). Real-world applications of

RiskSLIM include building a screening tool for adult ADHD froma short self-reported questionnaire (Ustun et al., 2017) and building a risk score for ICU stun and Rudin seizure prediction (Struck et al., 2017). Applications of this work have been discussed in apaper that was a ﬁnalist for the 2017 INFORMS Wagner Prize (Rudin and Ustun, 2018). Scoring Systems

While several methods have been proposed to learn scoring systems for decision-making (see, e.g., Ustun and Rudin, 2016; Zeng et al., 2016; Carrizosa et al., 2016; Van Belle et al.,2013; Billiet et al., 2016, 2017; Sokolovska et al., 2017, 2018), the method in this work aimsto produce scoring systems for risk assessment (i.e., risk scores). Risk scores represent themajority of scoring systems that are currently used in medicine, criminal justice, and creditscoring. In practice, the performance objective of these models is to produce calibrated riskestimates, which are used to choose an operating point and to inform decision-making inother ways (see e.g., Section 6 and Van Calster and Vickers, 2015; Alba et al., 2017, for adiscussion on how miscalibrated risk estimates can lead to harmful decisions in medicine).As we will show in Section 5.3, building risk scores that output calibrated risk estimates ischallenging, and heuristics used in risk score development (e.g., rounding and scaling) cancompromise calibration in ways that are diﬃcult to ﬁx.

RiskSLIM risk scores are the risk assessment counterpart to

SLIM scoring systems(Ustun et al., 2013; Ustun and Rudin, 2016; Zeng et al., 2016; Ustun et al., 2016). Both

RiskSLIM and

SLIM use score functions that are optimized for feature selection andsmall integer coeﬃcients, and that can be directly customized to obey application-speciﬁcconstraints.

RiskSLIM models are designed for risk assessment and optimize the logisticloss. In contrast,

SLIM models are designed for decision-making and minimize the 0–1 loss.

SLIM models do not output probability estimates, and the scores will not necessarily havehigh AUC. However, they perform well at the single point on the ROC curve that theywere optimized for. Optimizing the 0–1 loss is also NP-hard, so training

SLIM models maynot scale to datasets with large sample sizes as is the case here. In practice,

RiskSLIM isbetter-suited for applications where models must output calibrated risk estimates and/orperform well at multiple operating points. over the ROC curve.

Machine Learning

Our cutting-plane algorithm can be used with machine learning methods where model aretrained by solving an empirical risk minimization problem with a convex loss function, anon-convex penalty, and non-convex constraints. These include methods to learn: scoringsystems for decision-making (Carrizosa et al., 2016; Van Belle et al., 2013; Billiet et al.,2016, 2017; Sokolovska et al., 2017); sparse Boolean classiﬁers (e.g., Chevaleyre et al., 2013;Malioutov and Varshney, 2013; Goh and Rudin, 2014; Wang et al., 2017); sparse decisiontrees (Angelino et al., 2018; Letham et al., 2015; Lakkaraju et al., 2016; Wang and Rudin,2015) and other (cid:96) -regularized models (Sato et al., 2015, 2016; Bertsimas et al., 2016).For each of these methods, our cutting-plane algorithm can train models that optimizethe same objective function and obeys the same constraints, but in a way that recovers aglobally optimal solution, handles application-speciﬁc constraints, and scales linearly withthe number of samples. earning Optimized Risk Scores Our approach is a promising option to build models that obey constraints related to,for example, interpretability (see e.g. Caruana et al., 2015; Gupta et al., 2016; Chen et al.,2018; Li et al., 2018, where interpretability is addressed through constraints on model form), safety (Amodei et al., 2016), credibility (Wang et al., 2018), and fairness (Kamishima et al.,2011; Zafar et al., 2017). Such qualities depend on multiple model properties, which changesigniﬁcantly across applications and result in unclear performance trade-oﬀs. Current ap-proaches often aim to address a speciﬁc kind of constraints for generic models via pre- orpost-processing (see e.g., Goh et al., 2016; Calmon et al., 2017). In contrast, our approachcan address such constraints directly for a speciﬁc model class. When these models belongto a simple hypothesis class (e.g., risk scores), we can expect model performance on trainingdata to generalize and check generalization empirically (e.g., using cross-validation). In thisway, users can assess the impact of their requirements on predictive performance and makeinformed choices between models.Our work is part of a recent stream of research on integer programming in supervisedlearning (Carrizosa et al., 2016; Liu and Wu, 2007; Goldberg and Eckstein, 2012; Guanet al., 2009; Nguyen and Franke, 2012; Sato et al., 2015, 2016; Bertsimas et al., 2016;Bertsimas and Dunn, 2017). The unique aspect of our work on the risk score problem isthat our models are certiﬁably optimal or have small optimality gaps (see also Angelinoet al., 2018). Our work shows that certiﬁably models not only perform better, but areuseful for applications where models must satisfy constraints (see Section 6).

Optimization

We train risk scores by solving a MINLP with three main components: (i) a convex lossfunction; (ii) a non-convex feasible region (i.e., small integer coeﬃcients and application-speciﬁc constraints); (iii) a non-convex penalty function (i.e., the (cid:96) -penalty).In Section 3.3, we show that this MINLP requires a specialized algorithm as oﬀ-the-shelfMINLP solvers fail to solve instances for small datasets. Accordingly, we propose solvingthe risk score problem with a cutting plane algorithm; cutting planes have been extensivelystudied by the optimization community (see e.g., Kelley, 1960), and used to solve convex empirical risk minimization problems (Teo et al., 2007, 2009; Franc and Sonnenburg, 2008,2009; Joachims, 2006; Joachims et al., 2009); except, as we discuss, these methods do nottypically apply to our problem because they stall in the discrete setting.Our cutting plane algorithm ( LCPA ) builds a cutting plane approximation while per-forming branch-and-bound search. It can be easily implemented using a MIP solver with control callbacks (see e.g., Bai and Rubin, 2009; Naoum-Sawaya and Elhedhli, 2010, for sim-ilar uses of control callbacks in the optimization literature).

LCPA retains the key beneﬁtsof existing cutting plane algorithms on empirical risk minimization problems, but does notstall on problems with non-convex regularizers or constraints. As we discuss in Section 3.1,stalling aﬀects many cutting plane algorithms, including variants that are not consideredin machine learning (see Boyd and Vandenberghe, 2004, for a list).Given that

LCPA does not stall, it may provide a practical alternative to solve kindsof optimization problems that have motivated recent work in the optimization community(see e.g., Park and Boyd, 2015a; H¨ubner and Sch¨obel, 2014, who propose other approachesto solve special cases that minimize a convex function over integers). stun and Rudin

2. Risk Score Problem

In what follows, we formalize the problem of learning a risk score as in Figure 1. Westart with a dataset of n i.i.d. training examples ( x i , y i ) ni =1 where x i ⊆ R d +1 denotes avector of features [1 , x i, , . . . , x i,d ] (cid:62) and y i ∈ {± } denotes a class label. We represent thescore as a linear function s ( x ) = (cid:104) λ , x (cid:105) where λ ⊆ R d +1 is a vector of d + 1 coeﬃcients[ λ , λ , . . . , λ d ] (cid:62) , and λ is an intercept. In this setup, coeﬃcient λ j represents the pointsfor feature j . Given an example with features x i , a user ﬁrst tallies the points to compute ascore s i = (cid:104) λ , x i (cid:105) , then uses the score to obtain an estimate of predicted risk . We estimatethe predicted risk that example i is positive through the logistic link function as: p i = Pr ( y i = +1 | x i ) = 11 + exp( −(cid:104) λ , x i (cid:105) ) . Model Desiderata

Our goal is to train a risk score that is sparse, has small integer coeﬃcients, and performswell in terms of the following measures:1.

Calibration : A calibrated model outputs risk predictions that match their observedrisk. We assess the calibration of a model using a reliability diagram (see DeGroot andFienberg, 1983), which shows how the predicted risk (x-axis) at each score matches the observed risk (y-axis). We estimate the observed risk for a score of s as¯ p s = 1 |{ i : s i = s }| (cid:88) i : s i = s [ y i = +1] . We summarize the calibration of a model over the full reliability diagram using the expected calibration error (Naeini et al., 2014):CAL = 1 n (cid:88) s (cid:88) i : s i = s | p i − ¯ p s | . Rank Accuracy : A rank-accurate model outputs scores that can correctly rank exam-ples according to their true risk. We assess the rank accuracy of a model using the areaunder the ROC curve : AUC = 1 n + n − (cid:88) [ i : y i =+1] (cid:88) [ k : y k = − [ s i > s k ] , where n + = |{ i : y i = +1 }| , n − = |{ i : y i = − }| .As discussed in Section 1.1, calibration is the primary performance objective whenbuilding a risk score. In principle, good calibration should ensure good rank accuracy.Nevertheless, we report AUC as an auxiliary performance metric because trivial risk scores(i.e., models that assign the same score to all examples) can have low CAL on datasets withclass imbalance (see Section 5.3 for an example).

1. Users can also obtain a predicted label ˆ y i ∈ {± } by setting a risk threshold (e.g., predict ˆ y i = +1if predicted risk ≥ earning Optimized Risk Scores Optimization Problem

We learn the values of the coeﬃcients by solving a mixed integer nonlinear program (MINLP),which we refer to as the risk score problem or RiskSlimMINLP . Deﬁnition 1 (Risk Score Problem,

RiskSlimMINLP ) The risk score problem is a discrete optimization problem with the form: min λ l ( λ ) + C (cid:107) λ (cid:107) s.t. λ ∈ L , (1) where: • l ( λ ) = n (cid:80) ni =1 log(1 + exp( −(cid:104) λ , y i x i (cid:105) )) is the normalized logistic loss function; • (cid:107) λ (cid:107) = (cid:80) dj =1 [ λ j (cid:54) = 0] is the (cid:96) -seminorm; • L ⊂ Z d +1 is a set of feasible coeﬃcient vectors (user-provided); • C > is a trade-oﬀ parameter to balance ﬁt and sparsity (user-provided). RiskSlimMINLP is formulated to capture the exact constraints of a risk score. The ob-jective minimizes the logistic loss for calibration and AUC and penalizes the (cid:96) -norm forsparsity. The trade-oﬀ parameter C controls the balance between these competing ob-jectives, and represents the maximum log-likelihood that is sacriﬁced to remove a featurefrom the optimal model. The feasible region restricts coeﬃcients to a small set of boundedintegers such as L = {− , − , . . . , , } d +1 , and may be customized to address exact modelrequirements, such as those in Table 1. Model Requirement Example

Feature Selection Choose between 5 to 10 total featuresGroup Sparsity Include either male or female in the model but not bothOptimal Thresholding Use at most 3 thresholds for a set of indicator variables: (cid:80) k =1 [ age ≤ k ] ≤ male is in model, then include hypertension or bmi ≥

30 as a controlSide Information Predict Pr ( y = +1 | x ) ≥ .

90 when male = TRUE and hypertension = TRUEFairness Limit disparate impact between groups

A, B to 80%:

Pr(ˆ y =+1 | i ∈ A )Pr(ˆ y =+1 | i ∈ B ) ≤ . Table 1:

Model requirements that can be addressed by adding operational constraints to

RiskSlimMINLP . A Risk-calibrated Supersparse Linear Integer Model ( RiskSLIM ) is a risk score builtusing an optimal solution to (1). By deﬁnition, the optimal solution to

RiskSlimMINLP attains the lowest value of the logistic loss among feasible models on the training data,provided that C is small enough (see Appendix B for a more precise statement and aproof). Thus, RiskSLIM is the maximum likelihood logit model that satisﬁes all constraintsrequired for a risk score. stun and Rudin Our empirical results in Section 5 show that models with lower loss typically attain bettercalibration and AUC on the training data (see also Caruana and Niculescu-Mizil, 2004), andthis generalizes to test data due to the simplicity of our hypothesis space. There are sometheoretical results to explain why minimizing the logistic loss leads to good calibrationand AUC. In particular, the logistic loss is a strictly proper loss (Reid and Williamson,2010; Ertekin and Rudin, 2011) which yields calibrated risk estimates under the parametricassumption that the true risk can be modeled using a logistic link function (see Menonet al., 2012). Further, the work of Kotlowski et al. (2011) shows that a “balanced” versionof the logistic loss forms a lower bound on 1 − AUC, so minimizing the logistic loss indirectlymaximizes a surrogate of AUC.

Trade-off Parameter

The trade-oﬀ parameter can be restricted to values between C ∈ [0 , l ( )]. Setting C >l ( ), will produce a trivial model where λ ∗ = . Using an exact formulation provides analternative way to set the trade-oﬀ parameter C : • If we are given a limit on model size (e.g., (cid:107) λ (cid:107) ≤ R ), we can add it as a constraint inthe formulation and set C to a small value (e.g., C = 10 − ). In this case, the optimalsolution corresponds to the best model that obeys the model size constraint, providedthat C is small enough (see Appendix B). • If we wish to set the model size based on cross-validated performance, we can repeatthe previous process for (cid:107) λ (cid:107) ≤ R for R = 1 . . . d . This lets us ﬁt risk scores across thefull (cid:96) -regularization path by solving d instances of RiskSlimMINLP . In comparison,a standard approach (i.e., where we treat C as a hyperparameter and deﬁne a grid ofvalues) would require solving at least d instances, as we cannot determine (in advance) d values of C that would return the full range of risk scores. Computational Complexity

Optimizing

RiskSlimMINLP is a diﬃcult computational task given that (cid:96) -regularization,minimization over integers, and MINLP problems are all NP-hard (Bonami et al., 2012).These are worst-case complexity results that mean that ﬁnding an optimal solution to RiskSlimMINLP may be intractable for high dimensional datasets. As we will show,however,

RiskSlimMINLP can be solved to optimality for many real-world datasets inminutes, and in a way that scales linearly in the sample size.

Notation, Assumptions, and Terminology

We denote the objective function of

RiskSlimMINLP as V ( λ ) = l ( λ ) + C (cid:107) λ (cid:107) and anoptimal solution as λ ∗ ∈ argmin λ ∈L V ( λ ). We bound the optimal values of the objective,loss, and (cid:96) -norm as V ( λ ∗ ) ∈ [ V min , V max ], l ( λ ∗ ) ∈ [ L min , L max ], (cid:107) λ ∗ (cid:107) ∈ [ R min , R max ],respectively. We denote the set of feasible coeﬃcients for feature j as L j , and deﬁne Λ min j =min λ j ∈L j λ j and Λ max j = max λ j ∈L j λ j .For clarity of exposition, we assume that: (i) the coeﬃcient set contains the null vector, ∈ L , which ensures that RiskSlimMINLP is always feasible; (ii) the intercept is not earning Optimized Risk Scores regularized, which means that the more precise version of the RiskSlimMINLP objectivefunction is V ( λ ) = l ( λ ) + C (cid:13)(cid:13) λ [1 ,d ] (cid:13)(cid:13) where λ = [ λ , λ [1 ,d ] ].We measure the optimality of a feasible solution λ (cid:48) ∈ L in terms of its optimality gap ,deﬁned as V ( λ (cid:48) ) − V min V ( λ (cid:48) ) . Given an algorithm to solve RiskSlimMINLP , we denote the bestfeasible solution that the algorithm returns in a ﬁxed time as λ best ∈ L . The optimality gapof λ best is computed using an upper bound set as V max = V ( λ best ), and a lower bound V min that is provided by the algorithm. We say that the algorithm has solved RiskSlimMINLP to optimality if λ best has an optimality gap of ε = 0 . RiskSlimMINLP and produced a lower bound V min = V ( λ ∗ ). stun and Rudin

3. Methodology

In this section, we discuss the cutting plane algorithm that we use to solve the risk scoreproblem. We start with a brief introduction of cutting plane algorithms to describe theirpractical beneﬁts and to explain why existing algorithms stall on non-convex problems(Section 3.1). In Section 3.2, we present a new cutting plane algorithm that does not stall.In Section 3.3, we compare the performance of cutting plane algorithms to a commercialMINLP solver on diﬃcult instances of the risk score problem.

In Algorithm 1, we present a simple cutting plane algorithm to solve

RiskSlimMINLP that we call

CPA . CPA recovers the optimal solution to

RiskSlimMINLP by repeatedly solving a surrogateproblem that optimizes a linear approximation of the loss function l ( λ ). The approximationis built using cutting planes or cuts . Each cut is a supporting hyperplane to the loss functionat a ﬁxed point λ t ∈ L : l ( λ t ) + (cid:104)∇ l ( λ t ) , λ − λ t (cid:105) . Here, l ( λ t ) ∈ R + and ∇ l ( λ t ) ∈ R d +1 are cut parameters that can be computed by evaluatingthe value and gradient of the loss at the point λ t : l ( λ t ) = 1 n n (cid:88) i =1 log(1 + exp( −(cid:104) λ t , y i x i (cid:105) )) , ∇ l ( λ t ) = 1 n n (cid:88) i =1 − y i x i −(cid:104) λ t , y i x i (cid:105) ) . (3)As shown in Figure 2, we can construct a cutting plane approximation of the loss functionby taking the pointwise maximum of multiple cuts. In what follows, we denote the cuttingplane approximation of the loss function built using k cuts as:ˆ l k ( λ ) = max t =1 ...k (cid:104) l ( λ t ) + (cid:104)∇ l ( λ t ) , λ − λ t (cid:105) (cid:105) . l ( ) Figure 2:

A convex loss function l ( λ ) and its cutting plane approximation ˆ l ( λ ). On iteration k , CPA solves a surrogate mixed-integer program (MIP) that minimizes thecutting plane approximation ˆ l k , namely RiskSlimMIP (ˆ l k ). CPA uses the optimal solutionto the surrogate MIP ( L k , λ k ) in two ways: (i) it computes a new cut at λ k to improve thecutting plane approximation; (ii) it computes bounds on optimal value of RiskSlimMINLP earning Optimized Risk Scores Algorithm 1

Cutting Plane Algorithm (

CPA ) Input ( x i , y i ) ni =1 training data L coeﬃcient set C (cid:96) penalty parameter ε stop ∈ [0 , maximum optimality gap of acceptable solution Initialize k ← iteration counter ˆ l ( λ ) ← { } cutting plane approximation ( V min , V max ) ← (0 , ∞ ) bounds on the optimal value of RiskSlimMINLP ε ← ∞ optimality gap1: while ε > ε stop do ( L k , λ k ) ← provably optimal solution to RiskSlimMIP (ˆ l k ) compute cut parameters l ( λ k ) and ∇ l ( λ k ) ˆ l k +1 ( λ ) ← max { ˆ l k ( λ ) , l ( λ k ) + (cid:104)∇ l ( λ k ) , λ − λ k (cid:105)} update approximation for all λ ∈ L V min ← L k + C (cid:13)(cid:13) λ k (cid:13)(cid:13) optimal value of RiskSlimMIP is lower bound if V ( λ k ) < V max then V max ← V ( λ k ) update upper bound λ best ← λ k update incumbent end if ε ← − V min /V max k ← k + 1 end whileOutput: λ best ε -optimal solution to RiskSlimMINLP

RiskSlimMIP (ˆ l k ) is a surrogate problem for RiskSlimMINLP that minimizes a cuttingplane approximation ˆ l k of the loss function l :min L, λ L + C (cid:107) λ (cid:107) s.t. L ≥ ˆ l ( λ ) λ ∈ L . (2)We present a MIP formulation for RiskSlimMIP (ˆ l k ) in Deﬁnition 2, described shortly. stun and Rudin to check for convergence. Here, the upper bound is set as the objective value of the bestsolution across all iterations: V max = min t =1 ...k (cid:104) l ( λ t ) + C (cid:107) λ t (cid:107) (cid:105) . The lower bound is set as the optimal value to the surrogate problem at the current iteration: V min = ˆ l k ( λ k ) + C (cid:107) λ k (cid:107) . CPA converges to an optimal solution of

RiskSlimMINLP in a ﬁnite number of itera-tions (see e.g., Kelley, 1960, for a proof). In particular, the cutting plane approximation ofa convex loss function improves monotonically with each cut:ˆ l k ( λ ) ≤ ˆ l k + m ( λ ) ≤ l ( λ ) for all λ ∈ L and k, m ∈ N . Since the cuts added at each iteration are not redundant, the lower bound improves mono-tonically with each iteration. Once the optimality gap ε is less than a stopping threshold ε stop , CPA terminates and returns an ε -optimal solution λ best to RiskSlimMINLP . RiskSlimMIP Formulation

Deﬁnition 2 ( RiskSlimMIP ) Given a ﬁnite coeﬃcient set

L ⊂ Z d +1 , trade-oﬀ parameter C > , and cutting planeapproximation ˆ l k : R d +1 → R + with cut parameters { l ( λ t ) , ∇ l ( λ t ) } kt =1 , the surrogateoptimization problem RiskSlimMIP (ˆ l k ) can be formulated as the mixed integer program: min L, λ ,α V s.t. V = L + C R objective value (4a) L ≥ l ( λ t ) + (cid:104)∇ l ( λ t ) , λ − λ t (cid:105) t = 1,..., k cut constraints (4b) R = d (cid:88) j =1 α j (cid:96) –norm (4c) λ j ≤ Λ max j α j j = 1,..., d (cid:96) indicator constraints (4d) λ j ≥ − Λ min j α j j = 1,..., d (cid:96) indicator constraints (4e) V ∈ [ V min , V max ] bounds on objective value (4f) L ∈ [ L min , L max ] bounds on loss value (4g) R ∈ { R min , . . . , R max } bounds on (cid:96) –norm (4h) λ j ∈ { Λ min j , . . . , Λ max j } j = 1,..., d coeﬃcient bounds (4i) α j ∈ { , } j = 1,..., d (cid:96) indicator variables The formulation in Deﬁnition 2 contains 2 d +3 variables and k +2 d +2 constraints (excludingbounds on the variables). Here, the cutting plane approximation is represented via the cutconstraints in (4b) and the auxiliary variable L ∈ R + . The (cid:96) –norm is computed using theindicator variables α j = [ λ j (cid:54) = 0] set in constraints (4d) and (4e). The λ j are restricted toa bounded set of integers in constraints (4i).The formulation includes two additional auxiliary variables: V , deﬁned as the objectivevalue in (4a); and R , deﬁned as the (cid:96) -norm in (4c). These variables will be useful for earning Optimized Risk Scores implementing the techniques in Section 4. In particular, by including V and R , we can usea control callback in a MIP solver to set bounds on these quantities during branch-and-bound without adding new constraints to the formulation. Why Solve the Risk Score Problem with a Cutting Plane Algorithm

CPA has three important properties that motivate why we want to use a cutting planealgorithm to solve the risk score problem:(i)

Scalability in the Sample Size : Cutting plane algorithms only use the training datawhen computing cut parameters, and not while solving

RiskSlimMIP . Since the pa-rameters in (3) can be computed using elementary matrix-vector operations in O( nd )time at each iteration, running time scales linearly in n for ﬁxed d (see Figure 3).(ii) Control over Data-related Computation : Since cutting plane algorithms compute cutparameters in a single isolated step (e.g., Step 3 in Algorithm 1), users can reducedata-related computation by customizing their implementation to compute cut param-eters eﬃciently (e.g., via distributed computing, or techniques that exploit structuralproperties of a speciﬁc model class as in Section 4.4).(iii)

Ability to use a MIP Solver : Cutting plane algorithms have a special beneﬁt in oursetting since the surrogate problem can be solved with a MIP solver (rather than aMINLP solver). MIP solvers provide a fast implementation of branch-and-bound searchand other features to speed up the search process (e.g., built-in heuristics, preprocessingand cut generation procedures, lazy evaluation of cut constraints, and control callbacksthat let us customize the search with specialized techniques). As we show in Figure 6,using a MIP solver can substantially improve our ability to solve

RiskSlimMINLP ,despite the fact that one may have to solve multiple MIPs.Note that (i) and (ii) are well-known beneﬁts of cutting plane algorithms for convex empir-ical risk minimization problems (see e.g., Teo et al., 2007). l l l l l l l l l l l l l l l l l l l l l l l l l l N R un t i m e ( s e c ond s ) Figure 3:

Runtime of

CPA on synthetic datasets with d = 10 and n ∈ [10 , ] (see Appendix D for details).As n increases, the runtime for the solver (grey) remains roughly constant. The total runtime (black) scalesat O( n ), which reﬂects the scalability of matrix-vector operations used to compute cut parameters. stun and Rudin Stalling in Non-Convex Settings

Cutting plane algorithms for empirical risk minimization (Joachims, 2006; Franc and Son-nenburg, 2008; Teo et al., 2009) are similar to

CPA in that they solve a surrogate optimizationproblem at each iteration (e.g., Step 2 of Algorithm 1). When these algorithms are used tosolve convex risk minimization problems, the surrogate is convex and therefore tractable.When the algorithms are used to solve risk minimization problems with non-convex regular-izers or constraints, however, the surrogate is non-convex. In these settings, cutting planealgorithms will typically stall as they eventually reach an iteration where the surrogateproblem cannot be solved to optimality within a ﬁxed time limit.In Figure 4, we illustrate the stalling behavior of

CPA on a diﬃcult instance of

RiskSlim-MINLP for a synthetic dataset where d = 20 (see also Figure 6). As shown, the ﬁrstiterations terminate quickly as the surrogate problem RiskSlimMIP contains a trivial ap-proximation of the loss. Since the surrogate becomes increasingly diﬃcult to optimize witheach iteration, however, the time to solve

RiskSlimMIP increases exponentially, leading

CPA to stall at iteration k = 86. In this case, the solution returned by CPA after 6 hours hasa large optimality gap and a highly suboptimal loss. This is unsurprising, as the solutionwas obtained by optimizing a low-ﬁdelity approximation of the loss (i.e., an 85-cut approxi-mation of a 20-dimensional function). Since the value of the loss is tied to the performanceof the model (see Section 5), the solution corresponds to a risk score with poor performance.There is no simple ﬁx to prevent cutting plane algorithms such as

CPA from stallingon non-convex problems. This is because they need a certiﬁably optimal solution at eachiteration to compute a valid lower bound. In non-convex risk minimization problems, thisrequires ﬁnding the optimal solution of a non-convex surrogate and certifying that thissolution has an optimality gap of 0.0%. If, for example,

CPA only solved the surrogate untilit found a feasible solution with a non-zero optimality gap, the lower bound computed inStep 5 could exceed the true optimal value, which would lead the algorithm to terminateprematurely and return a suboptimal solution with invalid bounds.Seeing how the stalling behavior of

CPA is related to the mechanism used to checkconvergence, a tempting (but ﬂawed) solution is to use an algorithm that constructs acutting plane approximation by computing cuts at central points of

RiskSlimMIP , suchas the center of gravity algorithm of Levin (1965), or the analytic center algorithm ofAtkinson and Vaidya (1995). Such algorithms are guaranteed to converge in a ﬁxed numberof iterations and do not require computing a lower bound. In this case, however, they wouldstill stall on high-dimensional problems as they would need to compute central points bysolving a non-convex optimization problem to optimality at each iteration.

To avoid stalling in non-convex settings, we solve the risk score problem using the latticecutting plane algorithm ( LCPA ) shown in Algorithm 2.

LCPA has the same beneﬁts as othercutting plane algorithms for the risk score problem, such as scalability in the sample size,control over data-related computation, and the ability to use a MIP solver. As shown inFigure 5, however,

LCPA does not stall. This is because it can add cuts and compute alower bound without having to optimize a non-convex surrogate. earning Optimized Risk Scores Cuts Added S e c ond s / C u t Cuts Added O p t i m a li t y G ap Figure 4:

Performance proﬁle of

CPA on RiskSlimMINLP for a synthetic dataset with n = 50,000 and d = 20 (see Appendix D for details). We plot the time per iteration (left, in log-scale) and optimalitygap (right) for each iteration over 6 hours. CPA stalls on iteration 86, at which point the time to solve

RiskSlimMIP to optimality increases exponentially. The best solution obtained after 6 hours correspondsto a risk score with poor performance.

Cuts Added S e c ond s / C u t Cuts Added O p t i m a li t y G ap Figure 5:

Performance proﬁle of

LCPA (red) and

CPA (black) on the

RiskSlimMINLP instance in Figure4. Unlike

CPA , LCPA does not stall.

LCPA recovers a high-quality risk score (i.e., whose objective value is ≤

10% of the optimal value) in 9 minutes after adding 4,655 cuts, and the optimal risk score in 234 minutesafter adding 11,665 cuts. The remaining time is used to reduce the optimality gap. stun and Rudin LCPA recovers the optimal solution to

RiskSlimMINLP via branch-and-bound (B&B)search. The search process recursively splits the feasible region of

RiskSlimMINLP , dis-carding parts that are infeasible or provably suboptimal.

LCPA solves a surrogate linearprogram (LP) over each region. It updates the cutting plane approximation when the sur-rogate LP yields an integer feasible solution, and sets the lower bound for the risk scoreproblem as the smallest lower bound of the surrogate LP over unexplored regions.

Algorithm 2

Lattice Cutting Plane Algorithm (

LCPA ) Input ( x i , y i ) ni =1 training data L coeﬃcient set C (cid:96) penalty parameter ε stop ∈ [0 , optimality gap of acceptable solution RemoveNode procedure to remove node from a node set (provided by MIP solver)

SplitRegion procedure to split region into disjoint subsets (provided by MIP solver)

RiskSlimLP (ˆ l, R ) LP relaxation of

RiskSlimMIP (ˆ l ) over the region R ⊆ conv ( L ) (see Deﬁnition 3) Initialize k ← number of cuts ˆ l k ( λ ) ← { } cutting plane approximation ( V min , V max ) ← (0 , ∞ ) bounds on the optimal value of RiskSlimMINLP R ← conv ( L ) initial region is convex hull of coeﬃcient set v ← lower bound of the objective value of the surrogate LP at R N ← { ( R , v ) } node set ε ← ∞ optimality gap while ε > ε stop do

2: ( R t , v t ) ← RemoveNode ( N ) t is index of removed node

3: solve

RiskSlimLP (ˆ l k , R t )4: λ t ← coeﬃcients from optimal solution to RiskSlimLP (ˆ l k , R t )5: V t ← optimal value of RiskSlimLP (ˆ l k , R t )6: if optimal solution is integer feasible then

7: compute cut parameters l ( λ t ) and ∇ l ( λ t )8: ˆ l k +1 ( λ ) ← max { ˆ l k ( λ ) , l ( λ t ) + (cid:104)∇ l ( λ k ) , λ − λ t (cid:105)} update approximation for all λ ∈ L if V t < V max then V max ← V t update lower bound λ best ← λ t update best solution N ← N \ { ( R s , v s ) | v s ≥ V max } prune suboptimal nodes end if k ← k + 115: else if optimal solution is not integer feasible then

16: ( R (cid:48) , R (cid:48)(cid:48) ) ← SplitRegion ( R t , λ t ) R (cid:48) , R (cid:48)(cid:48) are disjoint subsets of R t N ← N ∪ { ( R (cid:48) , V t ) , ( R (cid:48)(cid:48) , V t ) } V t is lower bound of RiskSlimLP for child regions R (cid:48) , R (cid:48)(cid:48) end if V min ← min s =1 ... |N| v s V min is smallest lower bound among nodes in N ε ← − V min /V max update optimality gap end whileOutput: λ best ε -optimal solution to RiskSlimMINLP earning Optimized Risk Scores Deﬁnition 3 ( RiskSlimLP ) Given a bounded convex region

R ⊆ conv ( L ) , trade-oﬀ parameter C > , and cuttingplane approximation ˆ l k : R d +1 → R + with cut parameters { l ( λ t ) , ∇ l ( λ t ) } kt =1 , the surro-gate optimization problem RiskSlimLP (ˆ l k , R ) can be formulated as the linear program: min L, λ ,α V s.t. V = L + C R objective value R = d (cid:88) j =1 α j relaxed (cid:96) -norm L ≥ l ( λ t ) + (cid:104)∇ l ( λ t ) , λ − λ t (cid:105) t = 1,..., k cut constraints λ j ≤ Λ max j α j j = 1,..., d (cid:96) -indicator constraints λ j ≥ − Λ min j α j j = 1,..., d (cid:96) -indicator constraints λ ∈ R feasible region V ∈ [ V min , V max ] objective bounds L ∈ [ L min , L max ] loss bounds R ∈ [ R min , R max ] relaxed (cid:96) -bounds α j ∈ [0 , j = 1,..., d relaxed (cid:96) -indicators Branch-and-Bound Search

In Algorithm 2, we represent the state of the B&B search process using a B&B tree. Werefer to each leaf of the tree as a node , and denote the set of all nodes as N . Each node ( R t , v t ) ∈ N consists of: a region of the convex hull of the coeﬃcient set R t ⊆ conv ( L );and a lower bound on the objective value of the surrogate LP over this region v t .Each iteration of LCPA removes a node from the node set ( R t , v t ) ∈ N , then solves thesurrogate LP for the corresponding region: RiskSlimLP (ˆ l k , R t ). Subsequent steps of thealgorithm are determined by the solution status of the surrogate LP: • If RiskSlimLP (ˆ l k , R t ) has an integer solution, LCPA updates the cutting plane approxi-mation ˆ l k with a new cut at λ t in Step 8. • If RiskSlimLP (ˆ l k , R t ) has a real-valued solution, LCPA adds two child nodes ( R (cid:48) , v t )and ( R (cid:48)(cid:48) , v t ) to the node set N in Step 17. The child nodes are produced by applying asplitting rule, which splits R t into disjoint regions R (cid:48) and R (cid:48)(cid:48) . The lower bound for eachchild node is set as the optimal value of the surrogate LP v t . • If RiskSlimLP (ˆ l k , R t ) is infeasible, then LCPA discards the node from the node set.The B&B search is governed by two procedures that are implemented in a MIP solver: • RemoveNode , which removes a node ( R t , v t ) from the node set N (e.g., the node withthe smallest lower bound v t ). • SplitRegion , which splits R t into disjoint subsets of R t (e.g., split on a fractional compo-nent of λ t , which returns R (cid:48) = { λ ∈ R t | λ tj ≥ (cid:100) λ tj (cid:101)} and R (cid:48)(cid:48) = { λ ∈ R t | λ tj ≤ (cid:98) λ tj (cid:99)} ).The output conditions for SplitRegion must ensure that: the regions at each node remaindisjoint; the total number of nodes remains ﬁnite; and the total search region shrinks evenwhen the surrogate LP has a real-valued solution. stun and Rudin Convergence

LCPA certiﬁes that it has recovered a globally optimal solution to the risk score problem byusing bounds on the objective value of

RiskSlimMINLP . The upper bound V max is set bythe objective value of the best integer feasible solution in Step 10. The lower bound V min is set by the smallest objective value among all nodes in Step 19. The value of V min can beviewed as a lower bound on the objective value of the surrogate LP over the remaining searchregion (cid:83) t R t (i.e. V min is a lower bound on the objective value of RiskSlimLP (ˆ l k , (cid:83) t R t )).Thus, V min will increase when we reduce the remaining search region or add cuts.Each iteration of LCPA reduces the remaining search region by either ﬁnding an integerfeasible solution, identifying an infeasible region, or splitting a region into disjoint subsets.Thus, V min increases monotonically as the search region becomes smaller, and cuts areadded at integer feasible solutions. Likewise, V max decreases monotonically as it is set asthe objective value of the best integer feasible solution. Since there are a ﬁnite number ofnodes in the worst-case, LCPA terminates after a ﬁnite number of iterations, returning anoptimal solution to the risk score problem.

Remark 4 (Worst-Case Data-Related Computation for

LCPA ) Given any training dataset ( x i , y i ) ni =1 , any trade-oﬀ parameter C > , and any ﬁnitecoeﬃcient set L ⊂ Z d +1 , LCPA returns an optimal solution to the risk score problem aftercomputing at most |L| cutting planes, and processing at most |L| − nodes. Implementation with a MIP Solver with Lazy Cut Evaluation

LCPA can easily be implemented using a MIP solver (e.g., CPLEX, Gurobi, SCIP or GLPK)with control callbacks . In this approach, the solver handles all B&B related steps of Algo-rithm 2, and users only need to write the code for a control callback to update the cuttingplane approximation. In a basic implementation, the solver would call the control callbackwhenever it ﬁnds an integer feasible solution (i.e., Step 6). The code would retrieve theinteger feasible solution, compute the cut parameters, add a cut to the surrogate LP, andreturn control back to the solver at Step 9.A key beneﬁt of using a MIP solver is the ability to add cuts as lazy constraints . Inpractice, if we were to add cuts as generic constraints to the surrogate LP, the time tosolve the surrogate LP would increase with each cut, which would progressively slow down

LCPA . When we add cuts as lazy constraints, the solver branches using a surrogate LPthat contains a subset of relevant cuts, and only evaluates the complete set of cuts when

LCPA ﬁnds an integer feasible solution. In this case,

LCPA still returns the optimal solution.However, computation is signiﬁcantly reduced as the surrogate LP is much faster to solvefor the vast majority of cases where it is infeasible or yields a real-valued solution. From adesign perspective, lazy cut evaluation reduces the marginal computational cost of addingcuts, which allows us to add cuts liberally (i.e., without having to worry about slowingdown

LCPA by adding too many cuts). earning Optimized Risk Scores In what follows, we benchmark

CPA and

LCPA (without the improvements in Section 4)against three MINLP algorithms as implemented in a state-of-the-art commercial MINLPsolver (Artelsys Knitro 9.0, which is an updated version of the solver in Byrd et al., 2006) .In Figure 6, we show the performance of algorithms on diﬃcult instances of the riskscore problem for synthetic datasets with d dimensions and n samples (see Appendix D fordetails). We consider the following performance metrics: (i) the time to ﬁnd a near-optimalsolution; (ii) the optimality gap of best solution at termination; and (iii) the proportionof time spent on data-related computation. Since all three MINLP algorithms behavedsimilarly, we show only the best one in Figure 6 ( ActiveSetMINLP ) and include results forthe remaining algorithms in Appendix D.3.As shown,

LCPA ﬁnds an optimal or near-optimal solution for almost all instances ofthe risk score problem, and pairs the solution with a small optimality gap.

CPA performssimilarly to

LCPA on low-dimensional instances. On instances with d ≥

15, however,

CPA stalls after a few iterations and returns a highly suboptimal solution (i.e., a risk score withpoor performance).In comparison to the cutting plane algorithms, the MINLP algorithms can only handleinstances with small n or d . On larger instances, the solver is slowed down by operationsthat involve data-related computation, fails to converge within the 6-hour time limit, andfails to recover high-quality solutions. Seeing how MINLP solvers are designed to solve adiverse set of optimization problems, we do not believe that they can identify and exploitthe structure of the risk score problem in the same way as a cutting plane algorithm.

3. Knitro is one of several MINLP solvers that can solve the risk score problem directly (see Bussieck andVigerske, 2010, for others). We chose Knitro because it let us control factors that would otherwise leadan MINLP solver to perform poorly in benchmarks, namely: (i) Knitro provides the ability to solveLP subproblems with a third-party LP solver, which let us ensure that all algorithms used the sameLP solver (i.e., CPLEX 12.6.3, ILOG, 2017); (ii) Knitro provides the ability to compute the objectivefunction of the MINLP with a user-deﬁned function handle, which let us monitor and control data-relatedcomputation by using the same functions to evaluate the objective, its gradient and Hessian. stun and Rudin LCPA CPA ActiveSetMINLP

Time to Train a GoodRisk Score i.e., the time for an al-gorithm to ﬁnd a solutionwhose loss is ≤ of theoptimal loss. This reﬂects thetime to obtain a risk scorewith good calibration withouta proof of optimality. <1 min <10 min <1 hour <6 hours 6+ hours · · · · · · · · · · · · · · · · · · Optimality Gap of BestSolution at Termination i.e., ( V max − V min ) /V max ,where V max is the objec-tive value of the best solutionfound at termination. A gapof 0.0% means an algorithmhas found the optimal solu-tion and provided a proof ofoptimality within 6 hours.

0% 0−20% 20−50% 50−90% 90−100% · · · · · · · · · · · · · · · · · · % Time Spent on Data-Related Computation i.e., the proportion of to-tal runtime that an algorithmspends computing the value,gradient, or Hessian of theloss function. · · · · · · · · · · · · · · · · · · Figure 6:

Performance of

LCPA , CPA , and a commercial MINLP solver on diﬃcult instances of

RiskSlim-MINLP for synthetic datasets with d dimensions and n samples (see Appendix D for details). ActiveSetMINLP fails to produce good risk scores on instances with large n or d as it struggles with data-related computation. CPA and

LCPA scale linearly in n when d is ﬁxed: if they solve an instance for a given d , then they can solveinstances for larger n in O ( n ) additional time. CPA stalls when d ≥

15 and returns a low-quality risk scorewhen d ≥

20. In contrast,

LCPA consistently recovers a good model without stalling. Results reﬂect theperformance for a basic

LCPA implementation without the improvements in Section 4. We show results fortwo other MINLP algorithms in Appendix D. earning Optimized Risk Scores

4. Algorithmic Improvements

In this section, we describe specialized techniques to improve the performance of the latticecutting plane algorithm (

LCPA ) on the risk score problem. They include: • Polishing Heuristic . We present a technique called discrete coordinate descent (

DCD ;Section 4.1.1), which we use to polish integer solutions found by

LCPA (Step 6).

DCD aims to improve the objective value of all integer solutions, which produces stronger upperbounds over the course of

LCPA , and reduces the time to recover a good risk score. • Rounding Heuristic . We present a rounding technique called

SequentialRounding (Section4.1.2) to generate integer solutions. We use

SequentialRounding to round real-valuedsolutions to the surrogate LP in Step 15, and polish the rounded solution with

DCD .Rounded solutions may improve the best solution found by

LCPA , producing strongerupper bounds and reducing the time to recover a good risk score. • Bound Tightening Procedure . We design a fast procedure to strengthen bounds on theoptimal values of the objective, loss, and number of non-zero coeﬃcients called

ChainedUp-dates (Section 4.2). We call

ChainedUpdates whenever the solver updates the upper boundin Step 10 or the lower bound in Step 19.

ChainedUpdates improves the lower bound,and reduces the optimality gap of the ﬁnal risk score. • Initialization Procedure : Since solution quality is aﬀected by the ﬁdelity of the approx-imate loss function, and

LCPA only adds cuts at integer solutions, the initial set of in-cumbent solutions from

LCPA may correspond to low-quality risk scores. To mitigate thisissue, we devise an initialization procedure that quickly generates a set of cutting planesto warm-start

LCPA (Section 4.3). Using this procedure improves both the upper and thelower bound over the course of

LCPA , reducing the time to obtain a good risk score andthe ﬁnal optimality gap. • Data-Related Computation : We describe two techniques to reduce data computation. Theﬁrst technique reduces all data-related computation in

LCPA and heuristic operations byexploiting the fact that risk scores have discrete and bounded coeﬃcients. The secondtechnique uses loss-sensitive rounding heuristics such as

SequentialRounding . By reducingdata computation, both techniques improve scalability with respect to sample size.Our techniques can be adapted to solve other risk minimization problems with similarproperties (see Section 1.1 for a list).

We present two heuristics to generate and polish integer solutions for the risk score problem.We use these techniques to improve the performance of

LCPA . Both heuristics can be usedto create calibrated risk scores from logistic regression models with real-valued coeﬃcients(see Sections 5 and 6).Our heuristics can be implemented using control callbacks in a MIP solver, and aredesigned to run quickly. The procedures that we present can handle sparsity constraintsand monotonicity constraints, but may not yield a feasible solution under other kinds ofconstraints (e.g., a constraint on the false positive rate). stun and Rudin Discrete coordinate descent ( DCD ) is a technique to polish an integer solution (Algorithm3). It takes as input an integer solution λ = [ λ , . . . , λ d ] (cid:62) ∈ L and iteratively descendsalong a single coordinate j to attain an integer solution with a better objective value. Thedescent direction at each iteration is chosen as the coordinate that minimizes the objectivevalue j ∈ argmin V ( λ + δ j e j ). DCD terminates once it can no longer strictly improve the objective value along anycoordinate. This eliminates the potential of cycling, and thereby guarantees that the proce-dure will terminate in a ﬁnite number of iterations. The polished solution returned by

DCD satisﬁes a type of local optimality guarantee for discrete optimization problems: formally,it is with respect to the objective, meaning that one will not improve the objectivevalue by changing any single coeﬃcient (see e.g., Park and Boyd, 2015b, for a technique toﬁnd a 1-opt point for a diﬀerent optimization problem).In practice, the most expensive computation in

DCD is Step 5, where we determine astep-size δ j ∈ ∆ j to minimize the objective along coordinate j . We can signiﬁcantly reducethis computation by using golden section search. This approach requires nd log |L j | ﬂopsper iteration compared to nd |L j | ﬂops per iteration required by a brute force approach (i.e.,which evaluates the loss for all λ j ∈ L j ). Algorithm 3

Discrete Coordinate Descent (

DCD ) Input ( x i , y i ) ni =1 training data L coeﬃcient set C (cid:96) penalty parameter λ ∈ L integer solution to RiskSlimMINLP

J ⊆ { , . . . , d } valid descent directions1: repeat V ← V ( λ ) objective value at current solution for j ∈ J do ∆ j ← { δ ∈ Z | λ + δe j ∈ L} list feasible moves along dim j δ j ← argmin δ ∈ ∆ j V ( λ + δ ) ﬁnd best move in dim j v j ← V ( λ + δ j e j ) store objective value for best move in dim j end for m ← argmin j ∈J v j descend along dim that minimizes objective λ ← λ + δ m e m until v m ≥ V Output: λ solution that is 1-opt with respect to the objective of RiskSlimMINLP

In Figure 7, we show how

DCD improves the performance of

LCPA when we use it topolish feasible solutions found by the MIP solver (i.e., in Step 6 of Algorithm 2). earning Optimized Risk Scores lll llllll ll l l l Nodes Processed U ppe r bound lll llllll ll l l l Nodes Processed O p t i m a li t y G ap Figure 7:

Performance proﬁle of

LCPA in a basic implementation (black) and with

DCD (red). We use

DCD to polish every integer solution found by the MIP solver whose objective value is within 10% of thecurrent upper bound. We plot the number of total nodes processed of

LCPA (x-axis) against the upperbound (y-axis; left) and the optimality gap (y-axis; right). We mark iterations where

LCPA updates theincumbent solution. Results reﬂect performance on

RiskSlimMINLP for a synthetic dataset with d = 30and n = 50,000 (see Appendix D for details). SequentialRounding (Algorithm 4) is a rounding heuristic to generate integer solutions forthe risk score problem. In comparison to na¨ıve rounding, which returns the closest roundingfrom a set of 2 d +1 possible roundings, SequentialRounding returns a rounding that optimizesthe objective value of the risk score problem.Given a real-valued solution λ real ∈ conv ( L ), the procedure iteratively rounds one com-ponent (up or down) in a way that minimizes the objective of RiskSlimMINLP . On Step k , it has already rounded k components of λ real , and must round one of the remaining d − k + 1 components to (cid:100) λ real j (cid:101) or (cid:98) λ real j (cid:99) . To this end, it computes the objective of allfeasible (component, direction)-pairs and chooses the best one. Formally, the minimizationon Step k requires (cid:80) i = d − k +1 i =1 i = ( d − k + 1)( d − k + 2) evaluations of the loss function.Thus, given that there are d + 1 steps, SequentialRounding terminates after d ( d + 3 d + 2)evaluations of the loss function.In Figure 8, we show the impact of using SequentialRounding in LCPA . Here, we ap-ply

SequentialRounding to the non-integer solution to

RiskSlimLP when the lower boundchanges (i.e., Step 3 of Algorithm 2), then polish the rounded solution using

DCD . As shown,this strategy can reduce the time required for

LCPA to ﬁnd a high-quality risk score, andattain a lower optimality gap.

SequentialRounding can be viewed as a special case of

DCD . In particular, instead ofconducting a line search over all feasible values at each iteration,

SequentialRounding roundseach real-valued component to a nearby integer value. Likewise, instead of terminating ata 1-OPT point,

SequentialRounding terminates in d + 1 iterations after all components arerounded. In that case, one can see why DCD might perform better than

SequentialRounding since it considers a larger set of integer coeﬃcients and is not restricted to only d + 1iterations. In comparison to DCD , SequentialRounding terminates after a ﬁxed number ofsteps and can be accelerated with a subsampling technique presented in Section 4.4.2. stun and Rudin Algorithm 4

SequentialRounding

Input ( x i , y i ) ni =1 training data L coeﬃcient set C (cid:96) penalty parameter λ ∈ conv ( L ) non-integer infeasible solution from RiskSlimLP J real ← { j : λ j (cid:54) = (cid:100) λ j (cid:99)} index set of real-valued coeﬃcients repeat λ j, up ← ( λ , . . . , (cid:100) λ j (cid:101) , . . . , λ d ) for all j ∈ J real λ j, down ← ( λ , . . . , (cid:98) λ j (cid:99) , . . . , λ d ) for all j ∈ J real v up ← min j ∈J real V ( λ j, up ) v down ← min j ∈J real V ( λ j, down ) if v up < v down then k ← argmin j ∈J real V ( λ j, up ) λ k ← (cid:100) λ k (cid:101) else k ← argmin j ∈J real V ( λ j, down ) λ k ← (cid:98) λ k (cid:99) end if J real ← J real \ { k } until J real = ∅ Output: λ ∈ L integer solution lllllll llllllll llllll ll l l l Nodes Processed U ppe r bound lllllll llllllll llllll ll l l l Nodes Processed O p t i m a li t y G ap Figure 8:

Performance proﬁle of

LCPA in a basic implementation (black) and with

SequentialRounding and

DCD polishing (red). We call

SequentialRounding to round non-integer solutions to

RiskSlimLP in Step15, and then polish the integer solution with

DCD . We plot large points to show when

LCPA updates theincumbent solution. Results reﬂect performance on

RiskSlimMINLP for a synthetic dataset with d = 30and n = 50,000 (see Appendix D for details). Here, SequentialRounding and

DCD reduce the upper boundand optimality gap of

LCPA compared to a basic implementation (although not as much as

DCD alone). earning Optimized Risk Scores We describe a fast bound tightening technique called

ChainedUpdates (Algorithm 5). Thistechnique iteratively bounds the optimal values of the objective, loss, and (cid:96) -penalty byiteratively setting the values of V min , V max , L min , L max , and R max in RiskSlimLP . Bound-ing these quantities over the course of B&B restricts the search region without discardingthe optimal solution, thereby improving the lower bound and reducing the optimality gap.

Initial Bounds on Objective Terms

We initialize

ChainedUpdates with values of V min , V max , L min , L max , and R max that canbe computed using only the training data ( x i , y i ) ni =1 and the coeﬃcient set L . We startwith Proposition 5, which provides initial values for L min and L max using the fact that thecoeﬃcient set L is bounded (see Appendix A for a proof). Proposition 5 (Bounds on Logistic Loss over a Bounded Coeﬃcient Set)

Given a training dataset ( x i , y i ) ni =1 where x i ∈ R d and y i ∈ {± } for i = 1 , . . . , n ,consider the normalized logistic loss of a linear classiﬁer with coeﬃcients λ : l ( λ ) = 1 n n (cid:88) i =1 log(1 + exp( −(cid:104) λ , y i x i (cid:105) )) . If the coeﬃcients belong to a bounded set L , then the value of the normalized logistic lossmust obey l ( λ ) ∈ [ L min , L max ] for all λ ∈ L , where, L min = 1 n (cid:88) i : y i =+1 log (1 + exp( − s max i )) + 1 n (cid:88) i : y i = − log (1 + exp( s min i )) ,L max = 1 n (cid:88) i : y i =+1 log (1 + exp( − s min i )) + 1 n (cid:88) i : y i = − log (1 + exp( s max i )) ,s min i = min λ ∈L (cid:104) λ , x i (cid:105) for i = 1 , . . . , n,s max i = max λ ∈L (cid:104) λ , x i (cid:105) for i = 1 , . . . , n. The value of L min in Proposition 5 represents the “best-case” loss in a separable settingwhere we assign each positive example its maximal score s max i and each negative exampleits minimal score s min i . Conversely, L max represents the “worst-case” loss when we assigneach positive example its minimal score s min i and each negative example its maximal score s max i . Both L min and L max can be computed in O( n ) ﬂops using only the training data andthe coeﬃcient set by evaluating s min i and s max i as follows: s min i = min λ ∈L (cid:104) λ , x i (cid:105) = d (cid:88) j =0 [ x ij > x ij Λ min j + [ x ij < x ij Λ max j , (6) s max i = max λ ∈L (cid:104) λ , x i (cid:105) = d (cid:88) j =0 [ x ij > x ij Λ max j + [ x ij < x ij Λ min j . (7) stun and Rudin We initialize the bounds on the number of non-zero coeﬃcients R to ∈ { , . . . , d } , trivially.In some cases, these bounds may be stronger due to operational constraints (e.g., we canset R ∈ { , . . . , } if models are required to use ≤ L min and L max can be further reduced since s min i and s max i are bounded by the number ofnon-zero coeﬃcients. Once again, s min i (or s max i ) can still be computed eﬃciently in O( n )ﬂops by choosing the R max smallest (or largest) terms in the right-hand side of Equation(6) (or 7). Having initialized L min , L max , R min and R max , we set the bounds on the optimalobjective value as V min = L min + C R min and V max = L max + C R max , respectively. Dynamic Bounds on Objective Terms

In Propositions 6 to 8, we present bounds that use information from MIP solver in

LCPA to strengthen the values of L min , L max , R max , V min and V max (see Appendix A for proofs). Proposition 6 (Upper Bound on Optimal Number of Non-Zero Coeﬃcients)

Given an upper bound on the optimal objective value V max ≥ V ( λ ∗ ) , and a lower boundon the optimal loss L min ≤ l ( λ ∗ ) , the optimal number of non-zero coeﬃcients is at most R max ≥ (cid:107) λ ∗ (cid:107) where R max = (cid:22) V max − L min C (cid:23) . Proposition 7 (Upper Bound on Optimal Loss)

Given an upper bound on the optimal objective value V max ≥ V ( λ ∗ ) , and a lower boundon the optimal number of non-zero coeﬃcients R min ≤ (cid:107) λ ∗ (cid:107) , the optimal loss is at most L max ≥ l ( λ ∗ ) where L max = V max − C R min . Proposition 8 (Lower Bound on Optimal Loss)

Given a lower bound on the optimal objective value V min ≤ V ( λ ∗ ) , and an upper boundon the optimal number of non-zero coeﬃcients R max ≥ (cid:107) λ ∗ (cid:107) , the optimal loss is at least L min ≤ l ( λ ∗ ) where L min = V min − C R max . Implementation

In Algorithm 5, we present a fast bound tightening procedure to strengthen the values of V min , V max , L min , L max , and R max in RiskSlimLP using Propositions 6 to 8.Propositions 6 to 8 impose dependencies between V min , V max , L min , L max , R min and R max that may produce a complex “chain” of updates. As shown in Figure 9, ChainedUp-dates can update multiple terms and can update the same term more than once. Considera case where we call

ChainedUpdates after

LCPA improves V min . Say the procedure updates L min in Step 4. If ChainedUpdates updates R max in Step 6, then it will also update V max , L min , L max , V min . However, if ChainedUpdates does not update R max in Step 6, then itwill not update V max , L min , L max , V min and terminate. earning Optimized Risk Scores Algorithm 5

ChainedUpdates

Input C (cid:96) penalty parameter V min , V max , L min , L max , R min , R max initial bounds on V ( λ ∗ ), l ( λ ∗ ) and (cid:107) λ ∗ (cid:107) repeat V min ← max (cid:0) V min , L min + C R min (cid:1) update lower bound on V ( λ ∗ ) V max ← min ( V max , L max + C R max ) update upper bound on V ( λ ∗ ) L min ← max (cid:0) L min , V min − C R max (cid:1) update lower bound on l ( λ ∗ ) L max ← min (cid:0) L max , V max − C R min (cid:1) update upper bound on l ( λ ∗ ) R max ← min (cid:16) R max , (cid:106) V max − L min C (cid:107)(cid:17) update upper bound on (cid:107) λ ∗ (cid:107) until there are no more bound updates due to Steps 2 to 6. Output: V min , V max , L min , L max , R min , R max In light of these dependencies, Algorithm 5 applies Propositions 6 to 8 until it cannotno longer improve V min , V max , L min , L max or R max . This ensures that ChainedUpdates willreturn the strongest possible bounds regardless of the term that was ﬁrst updated. V max V min R max L max L min V min L min V max L max L min V min

54 6 52 2

Figure 9:

All possible “chains” of updates in

ChainedUpdates . Circles represent “source” terms that canbe updated by

LCPA to trigger

ChainedUpdates . The path from each source term shows all bounds that canbe updated by the procedure. The number in each arrow references the update step in Algorithm 5.

In our implementation, we call

ChainedUpdates whenever

LCPA improves V max or V min (i.e., Steps 10 or 19 of Algorithm 2). If ChainedUpdates improves any bounds, we passthis information back to the MIP solver by updating the bounds on the auxiliary variablesin the

RiskSlimLP formulation in Deﬁnition 3. As shown in Figure 10, this strategycan considerably improve the lower bound and optimality gap over the course of

LCPA . Inparticular, the bounds from

ChainedUpdates restrict the feasible region of the surrogateLP. This ensures that

LCPA produces stronger lower bounds on the optimal value of therisk score problem, which reduces the optimality gap and may beneﬁt the search process inother ways (e.g., by pruning suboptimal nodes). stun and Rudin Nodes Processed Lo w e r bound ll lllll l l l l l Nodes Processed O p t i m a li t y G ap Figure 10:

Performance proﬁle of

LCPA in a basic implementation (black) and with

ChainedUpdates (red).Results reﬂect performance on a

RiskSlimMINLP instance for a synthetic dataset with d = 30 and n =50,000 (see Appendix D). In Algorithm 6, we present an initialization procedure for

LCPA . The procedure aims tospeed up

LCPA by generating: a collection of cutting planes; a good integer solution; andnon-trivial bounds on the values of the objective, loss, and number of non-zero coeﬃcients.It combines all of the techniques from this section, as follows:1.

Run

CPA on RiskSlimLP : We apply traditional

CPA to solve the

RiskSlimLP untila user-speciﬁed stopping condition is met. We store: (i) the cuts from

RiskSlimLP (which will be used as an initial cutting plane approximation for

LCPA ); (ii) the lowerbound from

CPA on the objective value of

RiskSlimLP (which represents a lower bound V min on the optimal value of RiskSlimMINLP ).2.

Sequential Rounding and Polishing : We collect the solutions produced at each iterationof

CPA . For each solution, we run

SequentialRounding to obtain an integer solution for

RiskSlimMINLP , and then polish it using

DCD . We use the best solution found so farto update the upper bound V max on the optimal value to RiskSlimMINLP .3.

Chained Updates : Having obtained non-trivial bounds on V min and V max , we updatethe bounds on key quantities using ChainedUpdates .In Figure 11, we show how the initialization procedure in Algorithm 6 improves the lowerbound and the optimality gap over the course of

LCPA . earning Optimized Risk Scores Algorithm 6

Initialization Procedure for

LCPA

Input ( x i , y i ) ni =1 training data L coeﬃcient set C (cid:96) penalty parameter V min , V max , L min , L max , R min , R max initial bounds on V ( λ ∗ ), l ( λ ∗ ) and (cid:107) λ ∗ (cid:107) T max time limit for CPA on RiskSlimLP

Initialize ˆ l ( λ ) ← { } initial approximation of loss function P int ← ∅ initial collection of integer solutions Step I: Solve

RiskSlimLP with

CPA Solve

RiskSlimLP (ˆ l , conv ( L )) using CPA

Algorithm 1 k ← number of cuts added by CPA within time limit T max ˆ l initial ← ˆ l k store cuts from each iteration P real ← { λ t } kt =1 store solutions from each iteration V min ← lower bound from CPA lower bound for

RiskSlimLP is lower bound for

RiskSlimMINLP

Step II: Round and Polish Non-Integer Solutions from

CPA for each λ real ∈ P real do λ sr ← SequentialRounding (cid:0) λ real , L , C (cid:1) Algorithm 4 λ dcd ← DCD ( λ sr , L , C ) Algorithm 3 P int ← P int ∪ { λ dcd } store polished integer solutions end for λ best ← argmin λ ∈P int V ( λ ) V max ← V (cid:0) λ best (cid:1) best integer solution produces upper bound for V ( λ ∗ ) Step III: Update Bounds on Objective Terms ( V min , . . . , R max ) ← ChainedUpdates (cid:0) V min , . . . , R max , C (cid:1) Algorithm 5

Output: λ best , ˆ l initial ( λ ), V min , V max , L min , L max , R min , R max stun and Rudin Nodes Processed Lo w e r bound ll l l lllll l l l l l Nodes Processed O p t i m a li t y G ap Figure 11:

Performance proﬁle of

LCPA in a basic implementation (black) and with the initializationprocedure in Algorithm 6 (red). Results reﬂect performance on a

RiskSlimMINLP instance for a syntheticdataset with d = 30 and n = 50,000 (see Appendix D for details). In this section, we present techniques to reduce data-related computation for the risk scoreproblem.

The ﬁrst technique is designed to speed up the evaluation of the loss function and itsgradient, which reduces runtime when we compute cut parameters (3) and call the roundingand polishing procedures in Section 4.1. The technique requires that the features x i andcoeﬃcients λ belong to sets that are bounded, discrete, and regularly spaced, such as x i ∈ X ⊆ { , } d and λ ∈ L ⊆ {− , . . . , } d +1 .Evaluating the logistic loss, log(1 + exp( −(cid:104) λ , y i x i (cid:105) ), is a relatively expensive operationbecause it involves exponentiation and must be carried out in multiple steps to avoid nu-merical overﬂow/underﬂow when the scores s i = (cid:104) λ , x i y i (cid:105) are too small or large . Whenthe training data and coeﬃcients belong to discrete bounded sets, the scores s i = (cid:104) λ , x i y i (cid:105) belong to a discrete and bounded set S = (cid:8) (cid:104) λ , x i y i (cid:105) (cid:12)(cid:12) i = 1 , . . . , n and λ ∈ L (cid:9) . If the elements of the feature set X and the coeﬃcient set L are regularly spaced, then thescores belong to the set of integers S ⊆ Z ∩ [ s min , s max ] where: s min = min i, λ {(cid:104) λ , x i y i (cid:105) for all ( x i , y i ) ∈ D and λ ∈ L} ,s max = max i, λ {(cid:104) λ , x i y i (cid:105) for all ( x i , y i ) ∈ D and λ ∈ L} . Thus, we can precompute and store all possible values of the loss function in a lookup tablewith s max − s min + 1 rows, where row m contains the value of [log(1 + exp( − ( m + s min −

4. The value of exp( s ) can be computed reliably using IEEE 754 double precision ﬂoating point numbersfor s ∈ [ − , ∞ when s < − s > earning Optimized Risk Scores cached in memory, which yields a substantial speedup. In addition, when R max is updatedover the course of LCPA , the lookup table can be further reduced by recomputing s min and s max , and limiting the entries to values between s min and s max . The values of s min and s max can be computed in O( n ) time, so the update is not expensive. We now describe a subsampling technique to reduce computation for rounding heuristicsthat require multiple evaluations of the loss function (e.g.,

SequentialRounding ).Ideally, we would want to run such heuristics frequently as possible because each runmay output a solution that updates the incumbent solution (i.e., the current best solutionto the risk score problem). In practice, however, this may slow down

LCPA since eachrun requires multiple evaluations of the loss, and runs that fail to update the incumbentamount to wasted computation. If, for example, we ran

SequentialRounding each time theMIP solver found a set of non-integer coeﬃcients in

LCPA (i.e., Step 15 of Algorithm 2),many rounded solutions would not update the incumbent, and we would have wasted toomuch time rounding, without necessarily ﬁnding a better solution.Our technique aims to reduce the overhead of calling heuristics by running them ona smaller dataset D m built by sampling m points without replacement from the trainingdataset D n . In what follows, we present probabilistic guarantees to choose the numberof samples m so that an incumbent update using D m guarantees an incumbent updateusing D n . To clarify when the loss and objective are computed using D m or D n , we let l i ( λ ) = log(1 + exp( (cid:104) λ , y i x i (cid:105) )) and deﬁne: l m ( λ ) = 1 m m (cid:88) i =1 l i ( λ ) , V m ( λ ) = l m ( λ ) + C (cid:107) λ (cid:107) ,l n ( λ ) = 1 n n (cid:88) i =1 l i ( λ ) , V n ( λ ) = l n ( λ ) + C (cid:107) λ (cid:107) . Consider a case where a heuristic returns a promising solution λ hr such that: V m ( λ hr ) < V max . (8)In this case, we compute the objective on the full training dataset D n by evaluating the lossfor each of the n − m points that were not included in D m . We then update the incumbentsolution if λ hr attains an objective value that is less than the current upper bound: V n ( λ hr ) < V max . (9)Although this strategy requires evaluating the loss for the full training dataset to validatean incumbent update, it still reduces data-related computation since rounding heuristicstypically require multiple evaluations of the loss (e.g., SequentialRounding , which requires d ( d + 3 d + 2) evaluations).To guarantee that any solution that updates the incumbent when the objective is eval-uated with D m will also update the incumbent when the objective is evaluated with D n (i.e., that any solution that satisﬁes (8) will also satisfy (9)), we can use the generalizationbound from Theorem 9. stun and Rudin Theorem 9 (Generalization of Sampled Loss on Finite Coeﬃcient Set)

Let D n = ( x i , y i ) ni =1 denote a training dataset with n > points, D m = ( x i , y i ) mi =1 denotea sample of m points drawn without replacement from D n , and λ denote the coeﬃcientsof a linear classiﬁer from a ﬁnite set L . For all ε > , it holds that Pr (cid:18) max λ ∈L (cid:16) l n ( λ ) − l m ( λ ) (cid:17) ≥ ε (cid:19) ≤ |L| exp (cid:32) − ε ( m )(1 − m n )∆ max ( L , D n ) (cid:33) , where ∆ max ( L , D n ) = max λ ∈L (max i =1 ,...,n l i ( λ ) − min i =1 ,...,n l i ( λ )) . Theorem 9 is derived from a concentration inequality for sampling without replacement,called the Hoeﬀding-Serﬂing inequality (see Bardenet and Maillard, 2015). The Hoeﬀding-Serﬂing inequality can be signiﬁcantly tighter than the classical Hoeﬀding inequality as itensures that Pr ( l n ( λ ) − l m ( λ ) ≥ (cid:15) ) → m → n for all (cid:15) >

0. Here, ∆ max ( L , D n ) is anormalization term that represents the maximum range of the loss and can be computedquickly using the training dataset D n and coeﬃcient set L as shown in Proposition 5 inSection 4.2.In general machine learning settings, the |L| term in Theorem 9 would yield a vacuousbound. In this setting, however, rounding ensures that L contains at most 2 d elements,which produces a well-deﬁned bound on the diﬀerence between l n ( λ ) and l m ( λ ). As aresult, Theorem 9 can be used to assess the probability that a proposed incumbent updateleads to an actual incumbent update (see Corollary 10). Alternatively, it can be used toset the sample size m so that an incumbent update on D m is likely to yield an incumbentupdate on D n . In practice, the bound in Theorem 9 can be tightened by recomputingthe normalization term ∆ max ( L , D n ) over the course of LCPA . This can be done for eachreal-valued solution ρ , or when the MIP solver restricts the set of feasible coeﬃcients L . Corollary 10 (Update Probabilities of Rounding Heuristics on Subsampled Data)

Consider a rounding heuristic that takes as input a vector of real-valued coeﬃcients ρ =( ρ , . . . , ρ d ) ∈ conv( L ) and outputs a vector of rounded coeﬃcients λ ∈ L | ρ where L ρ = (cid:0) λ ∈ L (cid:12)(cid:12) λ j ∈ {(cid:100) ρ j (cid:101) , (cid:98) ρ j (cid:99)} for j = 1 , . . . , d (cid:1) . If we evaluate the rounding heuristic using m points D m = ( x i , y i ) mi =1 drawn withoutreplacement from D n = ( x i , y i ) ni =1 and the rounded coeﬃcients λ ∈ L ρ attain an objectivevalue V m ( λ ) , then for any δ , with probability at least − δ , we have that V m ( λ ) < V max − ε δ = ⇒ V n ( λ ) ≤ V max , where ε δ = ∆ max ( L ρ , D n ) (cid:115) log(1 /δ ) + d log(2)2 (cid:18) m (cid:19) (cid:18) − m n (cid:19) . earning Optimized Risk Scores

5. Experiments

In this section, we compare the performance of methods to create risk scores. We have threegoals: (i) to benchmark the performance and computation of our approach on real-worlddatasets; (ii) to highlight pitfalls of traditional approaches used in practice; and (iii) topresent new approaches that address the pitfalls of traditional approaches.

We considered 6 publicly available datasets shown in Table 2. We chose these datasets toallow for comparisons with other work, and to see how methods are aﬀected by factors suchas class imbalance, the number of features, and feature encoding. For each dataset, we ﬁtrisk scores using

RiskSLIM and 6 baseline methods that post-processed the coeﬃcients ofthe best logistic regression model built using Lasso, Ridge or Elastic Net. We used eachmethod to ﬁt a risk score with small integer coeﬃcients λ j ∈ {− , . . . , } that obeys themodel size constraint (cid:107) λ (cid:107) ≤ R max . We considered target model sizes R max ∈ { , . . . , } to benchmark method across a range of model sizes used in practice. Dataset n d

Pr( y i = 1) Conditions for y i = 1 Reference income mammo

961 14 46.3% person has breast cancer Elter et al. (2007) mushroom rearrest spambase telemarketing

Table 2:

Datasets used in Section 5. All datasets are available on the UCI repository (Bache and Lichman,2013), other than rearrest which must be requested from ICPSR. We processed each dataset by droppingexamples with missing values, and by binarizing categorical variables and some real-valued variables. Weprovide all processed datasets and the code to process rearrest at http://github.com/ustunb/risk-slim.

RiskSLIM

We formulated an instance of

RiskSlimMINLP with the constraints: λ ∈ {− , . . . , } , λ j ∈ {− , . . . , } , and (cid:107) λ (cid:107) ≤ R max . We set the trade-oﬀ parameter to a small value C = 10 − to recover the best model under these constraints (see Appendix B). We solvedeach instance for at most 20 minutes on a single 3.33 GHz CPU with 16 GB RAM usingCPLEX 12.6.3 (ILOG, 2017). Our LCPA implementation includes the improvements inSection 4 and is available online at http://github.com/ustunb/risk-slim.

Penalized Logistic RegressionPLR is the best logistic regression model produced over the full regularization path usinga weighted combination of the (cid:96) and (cid:96) penalties (i.e., the best model produced by Lasso,Ridge or Elastic Net). We train PLR models using the glmnet package of Friedman et al. stun and Rudin (2010). The coeﬃcients of each model are the solution to the optimization problem: min λ ∈ R d +1 n n (cid:88) i =1 log(1 + exp( −(cid:104) λ , y i x i (cid:105) )) + γ · (cid:16) α (cid:107) λ (cid:107) + (1 − α ) (cid:107) λ (cid:107) (cid:17) where α ∈ [0 ,

1] is the elastic-net mixing parameter and γ ≥ PLR models by choosing 1,100 combinations of ( α, γ ): 11 values of α ∈ { . , . , . . . , . , . } ×

100 values of γ (chosen automatically by glmnet for each α ).This free parameter grid ensures that the 1,100 PLR models includes models produced bythe following variants of logistic regression: • Lasso ( (cid:96) -penalty), which corresponds to PLR when α = 1 . • Ridge ( (cid:96) -penalty), which corresponds to PLR when α = 0 . • Standard Logistic Regression, which corresponds to

PLR when α = 0 . γ is small. Traditional Approaches

While there is considerable variation in how risk scores are developed in practice, manymodels are developed using a two-step approach: (i) ﬁt a sparse logistic regression modelwith real-valued coeﬃcients; (ii) convert this model into a risk score with integer coeﬃcients.We consider three methods that adopt this approach. Each method ﬁrst trains a

PLR model(i.e., the one that maximizes the 5-CV AUC and obeys the model size constraint), and thenconverts this model into a risk score by applying a common rounding heuristic: • PLR (cid:29) Rd (Rounding): We round each coeﬃcient to the nearest integer in {− . . . } bysetting λ j ← (cid:100) min(max( λ j , − , (cid:99) , and round the intercept as λ ← (cid:100) λ (cid:99) . • PLR (cid:29)

Unit (Unit Weighting): We round each coeﬃcient to ± λ j ← sign( λ j ) [ λ j (cid:54) = 0].Unit weighting is a common heuristic in medicine and criminal justice (see e.g., Antmanet al., 2000; Kessler et al., 2005; US Department of Justice, 2005; Duwe and Kim, 2016),and sometimes called the Burgess method (as it was ﬁrst proposed by Burgess, 1928). • PLR (cid:29)

RsRd (Rescaled Rounding) We ﬁrst rescale coeﬃcients so that the largest coef-ﬁcient is ±

5, then round each coeﬃcient to the nearest integer (i.e., λ j → (cid:100) γλ j (cid:99) where γ = 5 / max j | λ j | ). Rescaling is commonly used to avoid rounding small coeﬃcients tozero, which happens when | λ j | < . New Pooled Approaches

We also propose three new methods that use a pooling strategy and the loss-minimizingheuristics from Section 4. Each method generates a pool of

PLR models with real-valuedcoeﬃcients, applies the same post-processing procedure to each model in the pool, thenselects the best risk score among feasible risk scores. The methods include: • PooledRd (Pooled

PLR + Rounding): We ﬁt a pool of 1,100 models using

PLR . Foreach model in the pool, we round each coeﬃcient to the nearest integer in {− , . . . , } bysetting λ j ← (cid:100) min(max( λ j , − , (cid:99) , and round the intercept as λ ← (cid:100) λ (cid:99) . • PooledRd* (Pooled

PLR + Rounding + Polishing): We ﬁt a pool of 1,100 models using

PooledRd . For each model in the pool, we polish the rounded coeﬃcients using

DCD . earning Optimized Risk Scores • PooledSeqRd* (Pooled

PLR + Sequential Rounding + Polishing): We ﬁt a pool of1,100 models using

PLR . For each model in the pool, we round the coeﬃcients using

SequentialRounding and then polish the rounded coeﬃcients using

DCD .To ensure that the polishing step in

PooledRd* and

PooledSeqRd* does not increasethe number of non-zero coeﬃcients (which would violate the model size constraint), we run

DCD on the set of non-zero coeﬃcients { j | λ j (cid:54) = 0 } .While other pooled methods can be designed by combining rounding and polishingmethods, we omit certain variations for the sake of clarity, namely: (i) methods that userescaled rounding and unit weighting (as they perform worse than na¨ıve rounding whenwe use pooling); (ii) methods that use SequentialRounding without

DCD polishing (as itworse than

PooledSeqRd* ). We report results for some of these variations on a diﬀerentreal-world dataset in Section 6 (i.e.,

PooledRsRd , PooledRsRd* , PooledSeqRd ). Notes on the Choice of Methods

We also considered training risk scores by directly solving

RiskSlimMINLP with a com-mercial MINLP solver, but could not recover models that performed well on any datasetdue to computational issues (discussed also in Section 3.3). We did not consider methodsto ﬁt scoring systems for decision-making (see Section 1.1 for a list) since these modelsdo not output risk predictions. Ustun and Rudin (2016) and Zeng et al. (2016) presentexperiments comparing the performance of decision-making ability of scoring systems andother popular classiﬁers on these datasets.

Reliability Diagrams

We evaluate the calibration of each risk score by plotting a reliability diagram , which showhow the predicted risk (x-axis) matches the observed risk (y-axis) for each distinct score(DeGroot and Fienberg, 1983). The observed risk at a score of s as¯ p s = 1 |{ i : s i = s }| (cid:88) i : s i = s [ y i = +1] . If a model has over 30 distinct scores, we group them into 10 bins before plotting thereliability diagram. A model with perfect calibration should output predictions that areperfectly aligned with observed risk, as shown by a reliability diagram where all points lieon the x = y line. Summary Statistics

We report the following summary statistics for each risk score: • Calibration Error , computed as CAL = n (cid:80) s (cid:80) i : s i = s | p i − ¯ p s | where p i is the predictedrisk of example i , and ¯ p s is the observed risk for all examples with a score of s . CAL isthe expected calibration error over the reliability diagram (Naeini et al., 2014). • Area under the ROC curve , computed as AUC = n + n − (cid:80) i : y i =+1 (cid:80) k : y k = − [ s i > s k ] , where n + = |{ i : y i = +1 }| , n − = |{ i : y i = − }| . Note that trivial models (i.e., modelsthat predict one class) achieve the best possible CAL (0.0%) but poor AUC (0.5). stun and Rudin • Logistic Loss , computed as Loss = n (cid:80) ni =1 log(1 + exp( − s i )). The loss reﬂects the ob-jective values of the risk score problem when C is small. We report the loss to see ifminimizing the objective value of the risk score problem improves CAL and AUC. • Model Size : the number of non-zero coeﬃcients excluding the intercept (cid:80) dj =1 [ λ j (cid:54) = 0]. Model Selection

We use nested

PooledRd , PooledRd* , PooledSeqRd* as thesemethods tune the parameters of a ﬁnal risk score using 5-CV statistics. RiskSLIM canproduce an unbiased performance estimate using standard 5-CV because it does not requireparameter tuning. This is also true for traditional methods such as

PLR (cid:29) Rd , PLR (cid:29)

RsRd , PLR (cid:29)

Unit as the parameter tuning step occurs before post-processing.

On the Performance of Risk Scores

We compare the performance of

RiskSLIM to traditional approaches in Figure 12 and to ourpooled approaches in Figure 13. These results show that

RiskSLIM models consistentlyattain better calibration and AUC than alternatives. These results are more preciselyreﬂected for risk scores with a target model size of R max = 5 in Table 3. Here, RiskSLIM has the best 5-CV mean test CAL on 5/6 datasets, the best 5-CV mean test AUC on 5/6datasets, and no method has better test CAL and test AUC than

RiskSLIM .We make two observations to explain the empirical performance of risk scores:(i) Models that attain low values of the logistic loss have good calibration (see Figures 12and 13 and the empirical results of e.g., Caruana and Niculescu-Mizil 2004, 2006).(ii) Since we are ﬁtting from a simple class of models, risk scores tend to generalize (seethe test CAL/AUC and training CAL/AUC of

RiskSLIM models in Figures 16 to 21,and other risk scores in Table 9 in Appendix E).Since

RiskSLIM models optimize the loss over exact constraints on model form, they attainminimal or near-minimal values of the loss. Thus, they perform well in terms of trainingCAL/AUC as per (i) and test CAL/AUC as per (ii). These observations also explain whymethods that use loss-minimizing heuristics produce risk scores with better CAL and AUCthan those that do not (e.g.,

PooledRd* has better test CAL/AUC than

PooledRd since

DCD polishing can only reduce the loss).

5. It is well-known that when 5-CV statistics are used to set the free parameters for the ﬁnal model,then these statistics produce an overly optimistic estimate of test performance (see Cawley and Talbot,2010). To avoid this bias, the 5 fold-based models trained to construct 5-CV estimates must have theirparameters set in the same way as the ﬁnal model. Explicitly, for each the 5 fold-based models, the freeparameters must be set by running an inner 5-CV, and choosing a free parameter instance that satisﬁesthe model size constraint and maximizes the inner 5-CV mean test AUC. earning Optimized Risk Scores l l l l PLR>Rd PLR>RsRd PLR>Unit RiskSLIM

Optimization Metric Performance Metric Selection MetricDataset

Training Loss 5-CV Mean Test CAL 5-CV Mean Test AUC income

Target Model Size T r a i n i ng Lo ss l Target Model Size T e s t C A L l Target Model Size T e s t A UC mammo ll Target Model Size T r a i n i ng Lo ss lllllll Target Model Size T e s t C A L lllllll Target Model Size T e s t A UC mushroom llll Target Model Size T r a i n i ng Lo ss lllll Target Model Size T e s t C A L lllll Target Model Size T e s t A UC rearrest llll Target Model Size T r a i n i ng Lo ss lllllll Target Model Size T e s t C A L lllllll Target Model Size T e s t A UC spambase lllll Target Model Size T r a i n i ng Lo ss llllllllll Target Model Size T e s t C A L llllllllll Target Model Size T e s t A UC telemarketing llll Target Model Size T r a i n i ng Lo ss lllllll Target Model Size T e s t C A L lllllll Target Model Size T e s t A UC Figure 12:

Summary statistics for risk scores built using

RiskSLIM and traditional approaches. Eachpoint represents the best risk score with integer coeﬃcients λ j ∈ {− , . . . , } and model size (cid:107) λ (cid:107) ≤ R max for R max ∈ { , . . . , } . We show the variation in 5-CV mean test CAL and AUC for each method by shadingthe range between the 5-CV minimum and maximum. The black line in each plot is a baseline, which showsthe performance of a single PLR model with real-valued coeﬃcients and no model size constraint. stun and Rudin l l l l PooledRd PooledRd* PooledSeqRd* RiskSLIM

Optimization Metric Performance Metric Selection MetricDataset

Training Loss 5-CV Mean Test CAL 5-CV Mean Test AUC income

Target Model Size T r a i n i ng Lo ss Target Model Size T e s t C A L Target Model Size T e s t A UC mammo Target Model Size T r a i n i ng Lo ss Target Model Size T e s t C A L Target Model Size T e s t A UC mushroom Target Model Size T r a i n i ng Lo ss Target Model Size T e s t C A L Target Model Size T e s t A UC rearrest Target Model Size T r a i n i ng Lo ss Target Model Size T e s t C A L Target Model Size T e s t A UC spambase Target Model Size T r a i n i ng Lo ss Target Model Size T e s t C A L Target Model Size T e s t A UC telemarketing Target Model Size T r a i n i ng Lo ss Target Model Size T e s t C A L Target Model Size T e s t A UC Figure 13:

Summary statistics for risk scores built using

RiskSLIM and pooled approaches. Each pointrepresents the best risk score with integer coeﬃcients λ j ∈ {− , . . . , } and model size (cid:107) λ (cid:107) ≤ R max for R max ∈ { , . . . , } . We show the variation in 5-CV mean test CAL and AUC for each method by shadingthe range between the 5-CV minimum and maximum. The black line in each plot is a baseline, which showsthe performance of a single PLR model with real-valued coeﬃcients and no model size constraint. earning Optimized Risk Scores Traditional Approaches Pooled Approaches

Dataset Metric

PLR (cid:29)

Rd PLR (cid:29)

RsRd PLR (cid:29)

Unit PooledRd PooledRd* PooledSeqRd* RiskSLIM income n = 32561 d = 36 test caltest aucloss valuemodel sizeopt. gap mammo n = 961 d = 14 test caltest aucloss valuemodel sizeopt. gap mushroom n = 8124 d = 113 test caltest aucloss valuemodel sizeopt. gap rearrest n = 22530 d = 48 test caltest aucloss valuemodel sizeopt. gap spambase n = 4601 d = 57 test caltest aucloss valuemodel sizeopt. gap telemarketing n = 41188 d = 57 test caltest aucloss valuemodel sizeopt. gap Table 3:

Summary statistics for risk scores with integer coeﬃcients λ j ∈ {− , . . . , } for a model sizeconstraint (cid:107) λ (cid:107) ≤

5. Here: test cal is the 5-CV mean test CAL; test auc is the 5-CV mean test AUC; modelsize and loss value pertain to a ﬁnal model ﬁt using the entire dataset. For each dataset, we highlight themethod that attains the best test cal, auc, and loss value in green. We also highlight methods that producetrivial models in red. stun and Rudin On the Caveats of CAL

The

PLR (cid:29) Rd risk score for telemarketing in Table 3 highlights a key shortcoming of CALthat illustrates why we report AUC: trivial and near-trivial models can have misleadinglylow CAL. This table also thus shows why we choose free parameters that maximize the5-CV mean test AUC rather than CAL: choosing free parameters to minimize the 5-CVmean test CAL can result in a trivial model. Here, PLR (cid:29) Rd rounds all coeﬃcients otherthan the intercept to zero, producing a model that trivially assigns a constant score toall examples s i = (cid:104) λ , x i (cid:105) = λ = 2. Since there is only one score, the predicted riskfor all points is p i = 11 . p = Pr ( y i = +1) = 11 . On Calibration Issues of Risk Scores

The reliability diagrams in Figure 14 highlight two issues with respect to the calibration ofrisk scores that are diﬃcult to capture using a summary statistic: • Monotonicity violations in observed risk. For example, the reliability diagrams for

PLR (cid:29) Rd on spambase , or PooledRd and

PooledSeqRd* on mushroom show that the observedrisk does not increase monotonically with predicted risk. • Irregular spacing and coverage of predicted risk. For example, the

PLR (cid:29) Rd risk score for income outputs risk predictions that range between only 20% to 60%, and the PLR (cid:29)

RsRd risk score for rearrest produces risk predictions that are clustered at end points.The results in Figure 14 suggests that such issues can be mitigated by optimizing thelogistic loss (see e.g. the calibration of risk scores built using

RiskSLIM , PooledSeqRd , PooledSeqRd* where integer coeﬃcients are determined by directly optimizing the lo-gistic loss). In contrast, such issues are diﬃcult to address by additional post-processing.Consider, for example, using Platt scaling (Platt et al., 1999) to improve the calibrationof the

PLR (cid:29) Rd and PLR (cid:29)

RsRd risk scores in Figure 14. As shown in Figure 15, Plattscaling improves calibration by centering and spreading risk estimates over the reliabilitydiagram. However, it does not ﬁx issues that were introduced by earlier heuristics, such asmonotonicity violations, a lack of coverage in risk predictions, and low AUC. earning Optimized Risk Scores Traditional Approaches Pooled ApproachesPLR (cid:29)

Rd PLR (cid:29)

RsRd PLR (cid:29)

Unit PooledRd PooledRd* PooledSeqRd* RiskSLIM income

Predicted Risk O b s e r v ed R i sk Predicted Risk O b s e r v ed R i sk Predicted Risk O b s e r v ed R i sk Predicted Risk O b s e r v ed R i sk Predicted Risk O b s e r v ed R i sk Predicted Risk O b s e r v ed R i sk Predicted Risk O b s e r v ed R i sk mammo Predicted Risk O b s e r v ed R i sk Predicted Risk O b s e r v ed R i sk Predicted Risk O b s e r v ed R i sk Predicted Risk O b s e r v ed R i sk Predicted Risk O b s e r v ed R i sk Predicted Risk O b s e r v ed R i sk Predicted Risk O b s e r v ed R i sk mushroom Predicted Risk O b s e r v ed R i sk Predicted Risk O b s e r v ed R i sk Predicted Risk O b s e r v ed R i sk Predicted Risk O b s e r v ed R i sk Predicted Risk O b s e r v ed R i sk Predicted Risk O b s e r v ed R i sk Predicted Risk O b s e r v ed R i sk rearrest Predicted Risk O b s e r v ed R i sk Predicted Risk O b s e r v ed R i sk Predicted Risk O b s e r v ed R i sk Predicted Risk O b s e r v ed R i sk Predicted Risk O b s e r v ed R i sk Predicted Risk O b s e r v ed R i sk Predicted Risk O b s e r v ed R i sk spambase Predicted Risk O b s e r v ed R i sk Predicted Risk O b s e r v ed R i sk Predicted Risk O b s e r v ed R i sk Predicted Risk O b s e r v ed R i sk Predicted Risk O b s e r v ed R i sk Predicted Risk O b s e r v ed R i sk Predicted Risk O b s e r v ed R i sk telemark. Predicted Risk O b s e r v ed R i sk Predicted Risk O b s e r v ed R i sk Predicted Risk O b s e r v ed R i sk Predicted Risk O b s e r v ed R i sk Predicted Risk O b s e r v ed R i sk Predicted Risk O b s e r v ed R i sk Predicted Risk O b s e r v ed R i sk Figure 14:

Reliability diagrams for risk scores with integer coeﬃcients λ j ∈ {− , . . . , } for a model sizeconstraint (cid:107) λ (cid:107) ≤

5. We plot results for models from each fold on test data in grey, and for the ﬁnal modelon training data in black. stun and Rudin Rounding Rescaled RoundingRaw + Platt Scaling Raw + Platt Scaling RiskSLIM income

Predicted Risk O b s e r v ed R i sk Predicted Risk O b s e r v ed R i sk Predicted Risk O b s e r v ed R i sk Predicted Risk O b s e r v ed R i sk Predicted Risk O b s e r v ed R i sk rearrest Predicted Risk O b s e r v ed R i sk Predicted Risk O b s e r v ed R i sk Predicted Risk O b s e r v ed R i sk Predicted Risk O b s e r v ed R i sk Predicted Risk O b s e r v ed R i sk Figure 15:

Reliability diagrams for

PLR (cid:29) Rd and PLR (cid:29)

RsRd risk scores with and without Platt scaling,and for

RiskSLIM . Platt scaling improves calibration by centering and spreading risk predictions. However,it cannot overcome calibration issues introduced by rounding heuristics such a lack of decision points (e.g.

PLR (cid:29) Rd on income ) or monotonicity violations (e.g, PLR (cid:29)

RsRd on rearrest ). On the Pitfalls of Traditional Approaches

Our results in Figure 12 and Table 3 show that risk scores built using traditional approachesperform poorly in terms of CAL and AUC. In particular: • Rounding (

PLR (cid:29) Rd ) produces risk scores with low AUC when it eliminates featuresfrom the model by rounding small coeﬃcients λ j < | . | to zero. • Rescaled rounding (

PLR (cid:29)

RsRd ) greatly reduces CAL since the logistic loss is not scale-invariant (see e.g., the reliability diagram for

PLR (cid:29)

RsRd in Figure 14). • Unit weighting (

PLR (cid:29)

Unit ) results in poor calibration and unpredictable behavior (e.g.,risk scores with more features can perform worse as seen in

PLR (cid:29)

Unit models for income and telemarketing in Figure 12).The eﬀects of rescaled rounding and unit weighting on calibration are reﬂected by highlysuboptimal values of the loss in Table 3 and Figure 12. These issues are often overlooked astheir eﬀect on AUC is far less severe (e.g. the rescaling is recommended by US Departmentof Justice, 2005; Pennsylvania Commission on Sentencing, 2012).Our baseline methods may not match the exact methods used in practice as they donot reﬂect the signiﬁcant human input used in risk score development (e.g., domain expertsperform preliminary feature selection, round coeﬃcients, or choose a scaling factor beforerounding, manually and/or without validation, as shown in Appendix C). Nevertheless,these results highlight two major pitfalls of traditional approaches, namely: • Traditional approaches heuristically post-process a single model. This means that theyfail whenever a heuristic dramatically changes CAL or AUC. • Traditional approaches use heuristics that are oblivious to the value of the loss function,which results in poor calibration. earning Optimized Risk Scores On Pooled Approaches

Our results suggest that risk scores built using our pooled approaches attain considerablybetter calibration and rank accuracy than those built using traditional approaches. Thesemethods aim to overcome the pitfalls of traditional approaches using two strategies: • Pooling , which generates a pool of

PLR models, post-processes each model to producea pool of risk scores, and selects the best risk score within the pool. Pooling providesrobustness when post-processing with heuristics that can dramatically alter performance(e.g., rounding, as it is very unlikely that the coeﬃcients of all models in the pool willbe rounded to zero). The performance gain due to pooling can be seen by comparing theresults for

PLR (cid:29) Rd to PooledRd in Table 3. • Loss-Sensitive Heuristics , such as

SequentialRounding and

DCD , which produce a pool ofrisk scores that attain lower values of the loss, and thereby let us select a risk score withbetter CAL and AUC. The performance gain due to loss-sensitive heuristics can be seenby comparing the results for

PooledRd to PooledRd* in Table 3.The main shortcomings of pooled methods are that they are diﬃcult to implement,do not provide formal optimality or feasibility guarantees (particularly when handling non-trivial constraints), and may require massive computation depending on the post-processingheuristics. We discuss these shortcomings in greater detail in Section 6.The fact that

RiskSLIM produces risk scores with lower loss compared to pooled ap-proaches shows that direct optimization can eﬃciently recover models that may not befound via massive post-processing (i.e., where we train all possible (cid:96) and (cid:96) penalizedlogistic regression models, and convert them to risk scores by applying specially-designedheuristics). Here, we have shown that the latter strategy may still produce risk scores thatperform well. In Section 6, we show that the diﬀerence in performance is signiﬁcant in thepresence of non-trivial constraints. On Computation

Although the risk score problem is NP-hard, we trained

RiskSLIM models that were cer-tiﬁably optimal or had small optimality gaps for all datasets in under 20 minutes usingan

LCPA implementation with the improvements in Section 4. Even when

LCPA did notrecover a certiﬁably optimal solution, it produced a risk score that performed well and didnot exhibit the calibration issues of models built heuristically.In general, the time spent computing cutting planes is a small portion of the overallruntime for

LCPA ( < LCPA scales linearly with samplesize, we therefore expect to obtain similar results even if the datasets had far more samples.Our experiments revealed several factors that aﬀect the time to recover a certiﬁablyoptimal solution, namely: • Highly Correlated Features : Subsets of redundant features produce multiple optima, whichincreases the size of the B&B tree. • Feature Encoding : In particular, the problem is harder when the dataset includes real-valued variables, like those in the spambase dataset. • Diﬃculty of the Learning Problem : On separable problems such as mushroom , it is easyto recover a certiﬁably optimal solution since many solutions perform well and produce anear-optimal lower bound. stun and Rudin

1. Prior Arrests ≥ ≥ + . . .3. Prior Arrests for Local Ordinance 1 point + . . .4. Age at Release between 18 to 24 1 point + . . .5. Age at Release ≥

40 -1 point + SCORE = SCORE -1 0 1 2 3 4

RISK % % % % % % Figure 16:

RiskSLIM model for rearrest . RISK is the predicted probability that a prisoner is arrestedwithin 3 years of release from prison. This model has a 5-CV mean test CAL/AUC of 1.7%/0.697 andtraining CAL/AUC of 2.6%/0.701.

1. Married 3 points . . .2. Reported Capital Gains 2 points + . . .3. Age 22 to 29 -1 point + . . .4. Highest Level of Education is High School Diploma -2 points + . . .5. No High School Diploma -3 points + SCORE = SCORE -4 to -1 0 1 2 3 4 5

RISK < % % % % % % % Figure 17:

RiskSLIM model for income . RISK is the predicted probability that a US resident earns over$50 000. This model has a 5-CV mean test CAL/AUC of 2.4%/0.854 and training CAL/AUC of 4.1%/0.860.

1. Call between January and March 1 point . . .2. Called Previously 1 point + . . .3. Previous Call was Successful 1 point + . . .4. Employment Indicator < + . . .5. 3 Month Euribor Rate ≥

100 -1 point + SCORE = SCORE -1 0 1 2 3 4

RISK % % % % % % Figure 18:

RiskSLIM model for telemarketing . RISK is the predicted probability that a client opensa new bank account after a marketing call. This model has a 5-CV mean test CAL/AUC of 1.3%/0.760 anda training CAL/AUC of 1.1%/0.760. earning Optimized Risk Scores

1. odor = foul 5 points . . .2. gill size = broad -3 points + . . .3. odor = almond -5 points + . . .4. odor = anise -5 points + . . .5. odor = none -5 points + SCORE = SCORE -8 -5 -3 2 to 5

RISK % % % > % Figure 19:

RiskSLIM model for mushroom . RISK is the predicted probability that a mushroom ispoisonous. This model has a 5-CV mean test CAL/AUC of 1.8%/0.989 and a training CAL/AUC of1.0%/0.990.

1. IrregularShape 1 point . . .2. Age ≥

60 1 point + . . .3. OvalShape -1 point + . . .4. ObscuredMargin -1 point + . . .5. CircumscribedMargin -2 points + SCORE = SCORE -3 -2 -1 0 1 2

RISK % % % % % % Figure 20:

RiskSLIM model for mammo . RISK is the predicted probability that a mammogram pertainsto a patient with breast cancer. This model has a 5-CV mean test CAL/AUC of 5.0%/0.843 and a trainingCAL/AUC of 3.1%/0.849.

1. CharacterFrequency DollarSign × × + . . .3. WordFrequency Free × + . . .4. WordFrequency HP × -2 points + . . .5. WordFrequency George × -5 points + SCORE = SCORE ≤ -1.2 -1.2 to -0.4 -0.4 to 0.2 0.2 to 0.6 0.6 to 1.0 1.0 to 1.4 1.4 to 1.8 1.9 to 2.4 2.4 to 3.2 ≥ RISK % % % % % % % % % % Figure 21:

RiskSLIM model for spambase . RISK is the predicted probability that an e-mail is spam.This model has a 5-CV mean test CAL/AUC of 11.7%/0.928 and a training CAL/AUC of 12.3%/0.935. stun and Rudin

6. ICU Seizure Prediction

In this section, we describe a collaboration with the Massachusetts General Hospital wherewe built a customized risk score for ICU seizure prediction. Our goal is to discuss theperformance and practicality of our approach on a real-world problem with non-trivialconstraints.

Patients who suﬀer from traumatic brain injury are monitored via continuous electroen-cephalography (cEEG). Based on current clinical standards, neurologists undergo extensivetraining to recognize a large set of patterns in cEEG output (see Hirsch et al., 2013, andFigure 22). They consider the presence and characteristics of cEEG patterns along withother medical information to evaluate a patient’s risk of seizure. These risk estimates areused to decide if a patient can be dismissed from the ICU or requires further monitoring,and whether to prescribe a medical invention in order to prevent additional brain injury.

Figure 22: cEEG displays electrical activity at 16 standardized locations in a patient’s brain using electrodesplaced on the scalp. We show two cEEG patterns: a Generalized Periodic Discharge (GPD), which occurson both sides of the brain (left); and a Lateralized Periodic Discharge (LPD), which occurs on one side ofthe brain (right). These ﬁgures were reproduced from a presentation at a training module by the AmericanClinical Neurophysiology Society (2012).

Dataset

Our dataset was derived from extensive set of cEEG recordings from 41 hospitals, curated bythe Critical Care EEG Monitoring Research Consortium. It contains n = 5 ,

427 recordingsand d = 87 input variables (see Appendix F for a list). The outcome is deﬁned as y i = +1if patient i who has been in the ICU for the past 24 hours will have a seizure in thenext 24 hours. There is signiﬁcant class imbalance as Pr ( y i = +1) = 12.5%. The inputvariables include information on patient medical history, secondary neurological symptoms,and the presence and characteristics of 5 standard cEEG patterns: Lateralized PeriodicDischarges (LPD);

Lateralized Rhythmic Delta (LRDA);

Generalized Periodic Discharges (GPD);

Generalized Rhythmic Delta (GRDA); and

Bilateral Periodic Discharges (BiPD). earning Optimized Risk Scores Model Requirements

Our collaborators wanted a risk score to quickly predict seizure risk using the presenceand characteristics of cEEG patterns. It was critical for the model to output calibratedrisk predictions since physicians would use the predicted risk to choose between multipletreatment options (i.e., patients may be prescribed diﬀerent medication based on theirpredicted risk; see also Van Calster and Vickers, 2015; Shah et al., 2018, for a discussionon how miscalibrated risk predictions can result in harmful decisions). In order to be usedand accepted by physicians, it was also important to produce a model that could be easilyvalidated by domain experts and aligned with domain expertise.In Figure 23, we present a

RiskSLIM risk score for this problem that satisﬁes all of theserequirements (see Struck et al., 2017, for details on model development). This risk scoreoutputs calibrated risk estimates at several decision points; obeys a model size constraint( (cid:107) λ (cid:107) ≤

6) to let physicians easily validate the model; and obeys monotonicity constraintsto ensure that the signs of coeﬃcients are aligned with domain knowledge.

1. Any cEEG Pattern with Frequency > Hz E pileptiform Discharges 1 point + . . .3. Patterns include L PD or LRDA or BIPD 1 point + . . .4. P atterns Superimposed with Fast, or Sharp Activity 1 point + . . .5. Prior S eizures 1 point + . . .6. B rief Rhythmic Discharges 2 points + SCORE = SCORE

RISK <

5% 12% 27% 50% 73% 88 % > Figure 23:

RiskSLIM (see Struck et al., 2017, for details). This model hasa 5-CV mean test CAL/AUC of 2.7%/0.819.

As a follow up to our work in Struck et al. (2017), our collaborators wanted to see if wecould improve the usability of the risk score in Figure 23 without sacriﬁcing too much cali-bration or rank accuracy. Seeing how the risk score in Figure 23 includes features that couldrequire a physician to check a large number of patterns (e.g.,

MaxFrequencyOfAnyPattern ≥ or PatternsIncludeLPD or LRDA or BIPD ), our collaborators sought to improve usabilityby restricting the model size and specifying operational constraints on feature compositionand feature encoding. In turn, our goal was to produce the best risk score that satisﬁedthese constraints, so that our collaborators could correctly evaluate the loss in predictiveperformance due to their requirements, and make an informed choice regarding model de-ployment. We present a complete list of their constraints in Appendix F. They can begrouped as follows: • Limited Model Size : The model had to use at most 4 input variables, so that it would beeasy to validate and use in an ICU. • Monotonicity : The model had to obey monotonicity constraints for well-known risk factorsfor seizures (e.g., it could not suggest that having prior seizures lowers seizure risk). • No Redundancy between Categorical Variables : The model had to include variables thatwere linearly independent (e.g., it could include

Male or Female but not both). stun and Rudin • Speciﬁc cEEG Patterns or Any cEEG Pattern : The dataset had variables for speciﬁccEEG patterns (e.g.,

MaxFrequencyLPD ) and variables for any cEEG pattern (e.g.,

MaxFre-quencyAnyPattern ) The model had to use variables for speciﬁc patterns or any pattern,but not both. • Frequency in Continuous or Thresholded Form : The dataset had two kinds of variablesrelated to the frequency of a cEEG pattern: (i) a real-valued variable (e.g.,

MaxFrequen-cyLPD ∈ [0 , . MaxFrequencyLPD ≤ ).Models had to use the real-valued variable or the binary variables, not both. • Limited : To prevent clinicians from having tocheck multiple thresholds, the model could include at most 2 binary threshold variablesfor a given cEEG pattern.

Training Setup

We used the training setup as in Section 5.1, which we adapted to address constraints asfollows. We trained a

RiskSLIM model by solving an instance of

RiskSlimMINLP withall the constraints. This MINLP had 20 additional constraints, 2 additional variables, andwas solved to optimality in ≤

20 minutes. The baseline methods had built-in mechanismsto address monotonicity constraints but needed tuning to handle the remaining constraints.We used a nested 5-CV to evaluate the predictive performance of all models. For eachmethod, we trained a ﬁnal model using all of the training data for an instance of the freeparameters that obeys all constraints and maximizes the mean 5-CV test AUC.

On Performance and Usability in a Constrained Setting

The results in Table 4 show the potential performance beneﬁts of training an optimized riskscore for problems with non-trivial constraints. Here, the

RiskSLIM model has a 5-CVmean test CAL/AUC of 2.5%/0.801 while the best risk score built using a heuristic methodhas a 5-CV mean test CAL/AUC of 2.8%/0.754.Unlike the experiments in Section 5, only

RiskSLIM and our pooled methods canproduce a feasible risk score. Traditional methods (e.g.,

PLR (cid:29) Rd , PLR (cid:29)

RsRd , and

PLR (cid:29)

Unit ), violate one or more constraints after rounding. As shown in Table 6, thesemethods cannot produce a risk score with comparable performance to the

RiskSLIM riskscore even when these constraints are relaxed. If we only consider simple constraints onmodel size and monotonicity, then risk scores produced by these methods have a test AUCof at most 0.761 (

PLR (cid:29)

RsRd ) and a test CAL of at most 7.0% (

PLR (cid:29) Rd ).As shown in Figures 24 to 30, risk scores that have similar test CAL can still exhibitimportant diﬀerences in terms of calibration. Here, the risk predictions of the RiskSLIM model are monotonic and within the boundaries of the risk predictions of the fold-basedmodels. In comparison, the risk predictions of other models do not monotonically increasewith observed risk and vary signiﬁcantly across test folds (e.g.,

PooledRsRd , Poole-dRsRd* , PooledSeqRd and

PooledSeqRd* ). As noted by our collaborators, theseissues can aﬀect how physicians will use the model and whether they will accept it (e.g., themonotonicity violations of the

PooledSeqRd model suggest that patients with a score of earning Optimized Risk Scores Method TestCAL TestAUC ModelSize LossValue OptimalityGap TrainCAL TrainAUC

RiskSLIM PooledRd PooledRsRd PooledSeqRd PooledRd* PooledRsRd* PooledSeqRd* Table 4:

Performance of risk scores for seizure prediction that satisfy all constraints. We report the 5-CV mean test CAL and 5-CV mean test AUC. The ranges in each cell represent the 5-CV minimum andmaximum. We present the risk scores built using each method in Figures 24 to 30.

Training Requirements % of Instances That Satisfy Constraints on

Method

RiskSLIM

PooledRd

PooledRd*

PooledRsRd

PooledRsRd*

PooledSeqRd

PooledSeqRd*

Table 5:

Training requirements and constraint violations for the methods in Table 4. Each instance is aunique combination of free parameters. stun and Rudin Method ConstraintsViolated TestCAL TestAUC ModelSize LossValue TrainCAL TrainAUC

PLR

Model SizeIntegralityMonotonicityOperational 2.6%

20 - 35

PLR

Model SizeIntegralityOperational 2.7%

16 - 41

PLR

IntegralityOperational 4.4% PLR (cid:29) Rd Operational 7.0% PLR (cid:29)

RsRd

Operational 12.4% PLR (cid:29)

Unit

Operational 24.6% Table 6:

Performance of baseline models that violate one or more of the constraints.

RiskSLIM model requires physicians to scancEEG output for 3 patterns (that is, LPD, BriefRhythmicDischarges, and EpiletiformDis-charge). In comparison, other risk scores may not necessarily be more useable since they allinclude

PatternsInclude BiPD or LRDA or LPD , which can require physicians to check forcEEG output for 3 patterns (in the worst case). The

PooledRsRd , PooledSeqRd and

PooledSeqRd* risk scores also include

MaxFrequencyLPD , which requires recording thefrequency of LPD, which requires more time. Note that it is possible, as a simple extensionof our approach, to ﬁne-tune usability by penalizing each feature based on how diﬃcult itis for a physician to evaluate it.

On the Practical Value of a Certificate of Optimality

The results in Tables 4 and 6 illustrate how heuristics can lead practitioners to overestimatethe trade-oﬀs between performance and feasibility with respect to real-world constraints.Here: three traditional methods could not output a feasible risk score; six pooled methodsproduced feasible risk scores with suboptimal AUC and calibration issues; a baseline

PLR model with real-valued coeﬃcients has an AUC of 0.742 (see Table 6). Based on theseresults, a practitioner could easily conclude that there did not exist a feasible risk scorethat achieves a test AUC of 0.801.In contrast,

RiskSLIM models are paired with an optimality gap. In practice, a smalloptimality gap suggests that we have trained the best possible risk score that satisﬁesa speciﬁc set of constraints. Thus, if a risk score with a small optimality gap performs earning Optimized Risk Scores poorly on training data, and the model generalizes (as evidenced by comparing trainingperformance to K-CV performance), then a practitioner can attribute the performancedeﬁcit of the model to restrictive constraints and improve performance by relaxing them.Our approach provides a general mechanism to evaluate the eﬀect of constraints onpredictive performance. Say, for example, that our collaborators were not satisﬁed with theperformance or usability of our model, then we could train certiﬁably optimal risk scores fordiﬀerent sets of constraints. By comparing the performance of certiﬁably optimal risk scores,our collaborators could evaluate the impact of their requirements on predictive performance,and navigate these trade-oﬀs in an informed manner. This approach was useful when ourcollaborators were interested in choosing between a model with 4 features or 5 features.Here, we trained a RiskSLIM risk score with 5 features, which has a 5-CV test CAL/AUCof 3.4%/0.816. However, the slight improvement in test AUC did not outweigh the factthat the model included a feature that would increase the worst-case evaluation time (i.e.

MaxFreqFactorAnyPattern ≥ On the Challenges of Handling Operational Constraints

Table 5 shows some of the practical beneﬁts of methods that can address constraints withoutthe need for parameter tuning or post-processing. Since our approach can directly incorpo-rate these constraints into the MINLP formulation, all

RiskSLIM models are feasible withrespect to these constraints. Thus, we can produce a feasible risk score and estimate itspredictive performance by training 6 models: 1 ﬁnal model trained on the full dataset fordeployment and 5 models trained on subsets of the training data to produce an unbiasedperformance estimate for the ﬁnal model via 5-fold CV.In contrast, the pooled methods produce a feasible model by post-processing a large poolof models and discarding models that are infeasible. Since we must then choose betweenfeasible models on the basis of 5-CV performance, we must use a nested CV setup to pairany model with an unbiased performance estimate. As shown in Table 5, this requires ﬁttinga total of 33,000 models (i.e., a nested CV setup with 5 outer folds, 5 inner folds, and 1,100free parameter instances requires ﬁtting 1 , × × (5 + 1) = 33 ,

000 models.). There is noguarantee that pooled methods will produce a feasible risk score. As shown in Table 5, forinstance, only 12% of the instances for the pooled methods satisﬁed all constraints.Our results highlight several other complications with methods that aim to address hardconstraints by parameter tuning. Since such constraints also aﬀect model performance, theseparameters should ideally be set on the basis of predictive performance, using a K -fold CVsetup. In this case, we would train models on K validation folds for each instance of thefree parameters, choose the instance of the free parameters that maximizes the mean K -CVtest AUC among the instances that satisﬁed all constraints, and then train a “ﬁnal model”for this instance. Unfortunately, there is no guarantee that the ﬁnal model will obey allconstraints. On the Benefits of Risk Scores with Small Integer Coefficients

Figures 24 to 30 illustrate some of the practical beneﬁts of risk scores with small integercoeﬃcients. When input variables belong to a small discrete set, scores also belong to asmall discrete set. This reduces the number of operating points on the ROC curve and stun and Rudin reliability diagram, which makes it easier to pick an operating point. Further, when inputvariables are binary, the decision rules at each operating point can be represented as aBoolean function. For the RiskSLIM model in Figure 24, for example, the decision rule:ˆ y i = +1 if score ≥ y i = +1 if BriefRhythmicDischarge ∨ PatternsIncludeLPD ∨ EpiletiformDischarge

Small integer coeﬃcients let users extract such rules by listing conditions when the scoreexceeds the threshold. This is more challenging when a model uses real-valued coeﬃcients,as shown by the score function of the

PLR model from Table 4:score = − .

35 + 0 . PatternsIncludeBiPD or LRDA or LPD + 0 . PriorSeizure + 0 . × MaxFrequencyLPD . In this case, extracting a Boolean function is diﬃcult as computing the score involvesarithmetic with real-valued coeﬃcients and the real-valued variable

MaxFrequencyLPD .

1. BriefRhythmicDischarge 2 points . . .2. PatternsInclude LPD 2 points + . . .3. PriorSeizure 1 point + . . .4. EpiletiformDischarge 1 point + SCORE = SCORE

RISK % % % % % % % Predicted Risk O b s e r v ed R i sk False Positive Rate T r ue P o s i t i v e R a t e Figure 24:

RiskSLIM risk score (top), reliability diagram (bottom left), and ROC curve (bottom right)for the seizure dataset. We plot results for the ﬁnal model on training data in black, and the 5 fold modelson test data in grey. This model has a 5-CV mean test CAL/AUC of 2.5%/0.801. earning Optimized Risk Scores

1. BriefRhythmicDischarge 1 point . . .2. PatternsIncludeBiPD or LRDA or LPD 1 point + SCORE = SCORE

RISK % % % Predicted Risk O b s e r v ed R i sk False Positive Rate T r ue P o s i t i v e R a t e Figure 25:

PooledRd risk score (top), reliability diagram (bottom left), and ROC curve (bottom right)for the seizure dataset. We plot results for the ﬁnal model on training data in black, and results for thefold models on test data in grey. This model has a 5-CV mean test CAL/AUC of 5.3/0.740%.

1. BriefRhythmicDischarge 3 points . . .2. PatternsIncludeBiPD or LRDA or LPD 2 points + SCORE = SCORE

RISK % % % % Predicted Risk O b s e r v ed R i sk False Positive Rate T r ue P o s i t i v e R a t e Figure 26:

PooledRd* risk score (top), reliability diagram (bottom left), and ROC curve (bottom right)for the seizure dataset. We plot results for the ﬁnal model on training data in black, and results for thefold models on test data in grey. This model has a 5-CV mean test CAL/AUC of 3.0/0.745%. stun and Rudin

1. PatternsIncludeBiPD or LRDA or LPD 5 points . . .2. PriorSeizure 1 point + . . .3. MaxFreqFactorLPD × + SCORE = SCORE

RISK < % % % % % % % % Predicted Risk O b s e r v ed R i sk False Positive Rate T r ue P o s i t i v e R a t e Figure 27:

PooledRsRd model (top), reliability diagram (bottom left), and ROC curve (bottom right)for the seizure dataset. We plot results for fold models on test data in grey, and for the ﬁnal model ontraining data in black. This model has a 5-CV mean test CAL/AUC of 12.2%/0.736.

1. PatternsIncludeBiPD or LRDA or LPD 5 points . . .2. PriorSeizure 3 points + . . .3. MaxFreqFactorLPD × + SCORE = SCORE

RISK < % % % % % % % % % > % Predicted Risk O b s e r v ed R i sk False Positive Rate T r ue P o s i t i v e R a t e Figure 28:

PooledRsRd* model (top), reliability diagram (bottom left), and ROC curve (bottom right)for the seizure dataset. We plot results for fold models on test data in grey, and for the ﬁnal model ontraining data in black. This model has a 5-CV mean test CAL/AUC of 6.2%/0.731. earning Optimized Risk Scores

1. PriorSeizure 1 point . . .2. PatternsIncludeBiPD or LRDA or LPD 1 point + . . .3. MaxFreqFactorLPD × + SCORE = SCORE

RISK % % % % % % % % % Predicted Risk O b s e r v ed R i sk False Positive Rate T r ue P o s i t i v e R a t e Figure 29:

PooledSeqRd model (top), reliability diagram (bottom left), and ROC curve (bottom right)for the seizure dataset. We plot results for fold models on test data in grey, and for the ﬁnal model ontraining data in black. This model has a 5-CV mean test CAL/AUC of 4.2%/0.738.

1. PriorSeizure 1 point . . .2. PatternsIncludeBiPD or LRDA or LPD 1 point + . . .3. MaxFreqFactorLPD × + SCORE = SCORE

RISK % % % % % % % % % Predicted Risk O b s e r v ed R i sk False Positive Rate T r ue P o s i t i v e R a t e Figure 30:

PooledSeqRd* model (top), reliability diagram (bottom left), and ROC curve (bottom right)for the seizure dataset. We plot results for fold models on test data in grey, and for the ﬁnal model ontraining data in black. This model has a 5-CV mean test CAL/AUC of 2.8%/0.745. stun and Rudin

7. Discussion

Risk scores are simple models that are widely used to inform consequential decisions, rangingfrom mortality prediction to sentencing. In domains such as medicine and criminal justice,practitioners build risk scores ad hoc, by implementing training pipelines that combineexisting machine learning methods with simple heuristics (see Appendix C for the pipelineused in Antman et al., 2000). When a pipeline fails to produce a risk score that satisﬁesimportant requirements, practitioners resolve these issues by making manual adjustments.Alternatively, a feasible risk score is speciﬁed by a panel of experts, and data is only usedto evaluate its performance (see e.g., Gage et al., 2001; McGinley and Pearse, 2012).Ad hoc approaches do not produce risk scores with performance guarantees, and maytherefore result in the deployment of risk scores with poor calibration or rank accuracy(which impact the quality of decision-making when these tools are used in consequentialapplications). Ad hoc model development also has signiﬁcant practical costs: the needto implement a custom training pipeline slows down model development, and hinders theability to perform follow-up analyses to inform stakeholders as to whether a risk scoreshould be deployed (see e.g., the development and analysis of the sentencing risk score inPennsylvania Bulletin, 2017, which was commissioned in 2010).Our goal in this paper was to design a machine learning approach to build risk scoresthat standardizes model development, reduces the need for domain experts to specify modelsmanually, and pairs models with useful formal guarantees. Our proposed approach trainsrisk scores by solving an empirical risk minimization problem that performs exact featureselection while restricting coeﬃcients to small integers and enforcing application-speciﬁcconstraints. Since commercial solvers did not handle this problem well oﬀ-the-shelf, wesolved it using a new cutting plane algorithm (

LCPA ), which allows training to scale linearlywith the number of samples, and can be used to train models for other problems with non-convex regularizers and non-convex constraints.As shown in Sections 5 and 6,

LCPA can train certiﬁably optimal risk scores for real-world datasets within minutes. These models attain best-in-class calibration and rankaccuracy, and avoid pitfalls of heuristics used in practice. Our approach simpliﬁes andstandardizes model development, by allowing practitioners to build customized risk scoreswithout parameter tuning or post-processing. In addition, it pairs models with a certiﬁcateof optimality, which can be used to understand how application-speciﬁc constraints aﬀectpredictive performance.

Acknowledgments

We gratefully acknowledge support from Siemens, Phillips, and Wistron. We would like tothank Paul Rubin for helpful discussions, and our collaborators Aaron Struck and BrandonWestover for their guidance on the seizure prediction application.

References

Alba, Ana Carolina, Thomas Agoritsas, Michael Walsh, Steven Hanna, Alfonso Iorio, PJ Devereaux, ThomasMcGinn, and Gordon Guyatt. Discrimination and calibration of clinical prediction models: Users’ guidesto the medical literature.

JAMA , 318(14):1377–1384, 2017. earning Optimized Risk Scores American Clinical Neurophysiology Society, . Standardized Critical Care EEG Terminology Training Module,2012.Amodei, Dario, Chris Olah, Jacob Steinhardt, Paul Christiano, John Schulman, and Dan Man´e. Concreteproblems in ai safety. arXiv preprint arXiv:1606.06565 , 2016.Angelino, Elaine, Nicholas Larus-Stone, Daniel Alabi, Margo Seltzer, and Cynthia Rudin. Learning certiﬁ-ably optimal rule lists for categorical data.

Journal of Machine Learning Research , 18(234):1–78, 2018.Antman, Elliott M, Marc Cohen, Peter JLM Bernink, Carolyn H McCabe, Thomas Horacek, Gary Papuchis,Branco Mautner, Ramon Corbalan, David Radley, and Eugene Braunwald. The TIMI risk score forunstable angina/non–ST elevation MI.

The Journal of the American Medical Association , 284(7):835–842, 2000.Atkinson, David S and Pravin M Vaidya. A cutting plane algorithm for convex programming that usesanalytic centers.

Mathematical Programming , 69(1-3):1–43, 1995.Austin, James, Roger Ocker, and Avi Bhati. Kentucky pretrial risk assessment instrument validation.

Bureauof Justice Statistics , 2010.Bache, K. and M. Lichman. UCI Machine Learning Repository, 2013.Bai, Lihui and Paul A Rubin. Combinatorial Benders Cuts for the Minimum Tollbooth Problem.

OperationsResearch , 57(6):1510–1522, 2009.Bardenet, R´emi and Odalric-Ambrym Maillard. Concentration inequalities for sampling without replace-ment.

Bernoulli , 21(3):1361–1385, 2015.Beneish, Messod D, Charles MC Lee, and D Craig Nichols. Earnings manipulation and expected returns.

Financial Analysts Journal , 2013.Bertsimas, Dimitris and Jack Dunn. Optimal classiﬁcation trees.

Machine Learning , 106(7):1039–1082, 2017.Bertsimas, Dimitris, Angela King, Rahul Mazumder, and others. Best subset selection via a modern opti-mization lens.

The Annals of Statistics , 44(2):813–852, 2016.Billiet, Lieven, Sabine Van Huﬀel, and Vanya Van Belle. Interval coded scoring index with interactioneﬀects. In

Proceedings of the 5th International Conference on Pattern Recognition Applications andMethods , pages 33–40. SCITEPRESS-Science and Technology Publications, Lda, 2016.Billiet, Lieven, Sabine Van Huﬀel, and Vanya Van Belle. Interval coded scoring extensions for larger problems.In

Computers and Communications (ISCC), 2017 IEEE Symposium on , pages 198–203. IEEE, 2017.Bobko, Philip, Philip L Roth, and Maury A Buster. The usefulness of unit weights in creating compositescores. A literature review, application to content validity, and meta-analysis.

Organizational ResearchMethods , 10(4):689–709, 2007.Bonami, Pierre, Mustafa Kilin¸c, and Jeﬀ Linderoth. Algorithms and software for convex mixed integernonlinear programs. In

Mixed integer nonlinear programming , pages 1–39. Springer, 2012.Boyd, Stephen P and Lieven Vandenberghe.

Convex optimization . Cambridge university press, 2004.Burgess, Ernest W. Factors determining success or failure on parole.

The workings of the indeterminatesentence law and the parole system in Illinois , pages 221–234, 1928.Bussieck, Michael R and Stefan Vigerske. Minlp solver software.

Wiley Encyclopedia of Operations Researchand Management Science , 2010.Byrd, Richard H, Jorge Nocedal, and Richard A Waltz. Knitro: An integrated package for nonlinearoptimization. In

Large-scale nonlinear optimization , pages 35–59. Springer, 2006.Calmon, Flavio, Dennis Wei, Bhanukiran Vinzamuri, Karthikeyan Natesan Ramamurthy, and Kush R Varsh-ney. Optimized pre-processing for discrimination prevention. In

Advances in Neural Information Process-ing Systems , pages 3995–4004, 2017.Carrizosa, Emilio, Amaya Nogales-G´omez, and Dolores Romero Morales. Strongly agree or strongly dis-agree?: Rating features in support vector machines.

Information Sciences , 329:256–273, 2016.Caruana, Rich and Alexandru Niculescu-Mizil. Data mining in metric space: an empirical analysis of super-vised learning performance criteria. In

Proceedings of the tenth ACM SIGKDD international conferenceon Knowledge discovery and data mining , pages 69–78. ACM, 2004. stun and Rudin Caruana, Rich and Alexandru Niculescu-Mizil. An empirical comparison of supervised learning algorithms.In

Proceedings of the 23rd international conference on Machine learning , pages 161–168. ACM, 2006.Caruana, Rich, Yin Lou, Johannes Gehrke, Paul Koch, Marc Sturm, and Noemie Elhadad. Intelligible modelsfor healthcare: Predicting pneumonia risk and hospital 30-day readmission. In

Proceedings of the 21thACM SIGKDD International Conference on Knowledge Discovery and Data Mining , pages 1721–1730.ACM, 2015.Cawley, Gavin C and Nicola LC Talbot. On over-ﬁtting in model selection and subsequent selection bias inperformance evaluation.

Journal of Machine Learning Research , 11(Jul):2079–2107, 2010.Chen, Chaofan, Oscar Li, Alina Barnett, Jonathan Su, and Cynthia Rudin. This looks like that: deeplearning for interpretable image recognition. arXiv , 2018.Chevaleyre, Yann, Frederic Koriche, and Jean-Daniel Zucker. Rounding methods for discrete linear clas-siﬁcation. In

Proceedings of the 30th International Conference on Machine Learning (ICML-13) , pages651–659, 2013.Cranor, Lorrie Faith and Brian A LaMacchia. Spam!

Communications of the ACM , 41(8):74–83, 1998.Dawes, Robyn M. The robust beauty of improper linear models in decision making.

American psychologist ,34(7):571–582, 1979.DeGroot, Morris H and Stephen E Fienberg. The comparison and evaluation of forecasters.

The statistician ,pages 12–22, 1983.Duwe, Grant and KiDeuk Kim. Sacriﬁcing accuracy for transparency in recidivism risk assessment: Theimpact of classiﬁcation method on predictive performance.

Corrections , pages 1–22, 2016.Einhorn, Hillel J and Robin M Hogarth. Unit weighting schemes for decision making.

OrganizationalBehavior and Human Performance , 13(2):171–192, 1975.Elter, M, R Schulz-Wendtland, and T Wittenberg. The prediction of breast cancer biopsy outcomes usingtwo cad approaches that both emphasize an intelligible decision process.

Medical Physics , 34:4164, 2007.Ertekin, S¸eyda and Cynthia Rudin. On equivalence relationships between classiﬁcation and ranking algo-rithms.

The Journal of Machine Learning Research

Credit scoring, response modeling, and insurance rating: a practical guide to forecastingconsumer behavior . Palgrave Macmillan, 2012.Franc, Vojtˇech and Soeren Sonnenburg. Optimized cutting plane algorithm for support vector machines. In

Proceedings of the 25th international conference on Machine learning , pages 320–327. ACM, 2008.Franc, Vojtˇech and S¨oren Sonnenburg. Optimized cutting plane algorithm for large-scale risk minimization.

The Journal of Machine Learning Research , 10:2157–2192, 2009.Friedman, Jerome, Trevor Hastie, and Robert Tibshirani. Regularization paths for generalized linear modelsvia coordinate descent.

Journal of Statistical Software , 33(1):1–22, 2010.Gage, Brian F, Amy D Waterman, William Shannon, Michael Boechler, Michael W Rich, and Martha JRadford. Validation of clinical classiﬁcation schemes for predicting stroke.

The Journal of the AmericanMedical Association , 285(22):2864–2870, 2001.Goel, Sharad, Justin M Rao, and Ravi Shroﬀ. Precinct or prejudice? understanding racial disparities inNew York City’s stop-and-frisk policy.

Annals of Applied Statistics , 10(1):365–394, 2016.Goh, Gabriel, Andrew Cotter, Maya Gupta, and Michael P Friedlander. Satisfying real-world goals withdataset constraints. In

Advances in Neural Information Processing Systems , pages 2415–2423, 2016.Goh, Siong Thye and Cynthia Rudin. Box drawings for learning with imbalanced data. In

Proceedings of the20th ACM SIGKDD international conference on Knowledge discovery and data mining , pages 333–342.ACM, 2014.Goldberg, Noam and Jonathan Eckstein. Sparse weighted voting classiﬁer selection and its linear program-ming relaxations.

Information Processing Letters , 112:481–486, 2012.Gottfredson, Don M and Howard N Snyder. The mathematics of risk classiﬁcation: Changing data intovalid instruments for juvenile courts. ncj 209158.

Oﬃce of Juvenile Justice and Delinquency Prevention earning Optimized Risk Scores Washington, D.C. , 2005.Guan, Wei, Alex Gray, and Sven Leyﬀer. Mixed-integer support vector machine. In

NIPS Workshop onOptimization for Machine Learning , 2009.Gupta, Maya, Andrew Cotter, Jan Pfeifer, Konstantin Voevodski, Kevin Canini, Alexander Mangylov,Wojciech Moczydlowski, and Alexander Van Esbroeck. Monotonic calibrated interpolated look-up tables.

The Journal of Machine Learning Research , 17(1):3790–3836, 2016.Hirsch, LJ, SM LaRoche, N Gaspard, E Gerard, A Svoronos, ST Herman, R Mani, H Arif, N Jette, Y Mi-nazad, and others. American clinical neurophysiology society’s standardized critical care eeg terminology:2012 version.

Journal of Clinical Neurophysiology , 30(1):1–27, 2013.Holte, Robert C. Very simple classiﬁcation rules perform well on most commonly used datasets.

Machinelearning , 11(1):63–90, 1993.Holte, Robert C. Elaboration on Two Points Raised in “Classiﬁer Technology and the Illusion of Progress”.

Statistical Science , 21(1):24–26, February 2006.H¨ubner, Ruth and Anita Sch¨obel. When is rounding allowed in integer nonlinear optimization?

EuropeanJournal of Operational Research

Proceedings of the 12th ACM SIGKDD inter-national conference on Knowledge discovery and data mining , pages 217–226. ACM, 2006.Joachims, Thorsten, Thomas Finley, and Chun-Nam John Yu. Cutting-plane training of structural svms.

Machine Learning , 77(1):27–59, 2009.Kamishima, Toshihiro, Shotaro Akaho, and Jun Sakuma. Fairness-aware learning through regularizationapproach. In , pages 643–650.IEEE, 2011.Kelley, James E, Jr. The cutting-plane method for solving convex programs.

Journal of the Society forIndustrial and Applied Mathematics , 8(4):703–712, 1960.Kessler, Ronald C, Lenard Adler, Minnie Ames, Olga Demler, Steve Faraone, EVA Hiripi, Mary J Howes,Robert Jin, Kristina Secnik, Thomas Spencer, and others. The world health organization adult ADHDself-report scale (asrs): a short screening scale for use in the general population.

Psychological medicine ,35(02):245–256, 2005.Kohavi, Ron. Scaling up the accuracy of naive-bayes classiﬁers: A decision-tree hybrid. In

KDD , pages202–207, 1996.Kotlowski, Wojciech, Krzysztof J Dembczynski, and Eyke Huellermeier. Bipartite ranking through mini-mization of univariate loss. In

Proceedings of the 28th International Conference on Machine Learning(ICML-11) , pages 1113–1120, 2011.Lakkaraju, Himabindu, Stephen H Bach, and Jure Leskovec. Interpretable decision sets: A joint frameworkfor description and prediction. In

Proceedings of the 22nd ACM SIGKDD International Conference onKnowledge Discovery and Data Mining , pages 1675–1684. ACM, 2016.Latessa, Edward, Paula Smith, Richard Lemke, Matthew Makarios, and Christopher Lowenkamp. Creationand validation of the ohio risk assessment system: Final report. , 2009.Le Gall, Jean-Roger, Stanley Lemeshow, and Fabienne Saulnier. A new simpliﬁed acute physiology score(SAPS II) based on a European/North American multicenter study.

The Journal of the American MedicalAssociation , 270(24):2957–2963, 1993.Letham, Benjamin, Cynthia Rudin, Tyler H. McCormick, and David Madigan. Interpretable classiﬁers usingrules and bayesian analysis: Building a better stroke prediction model.

Annals of Applied Statistics , 9(3):1350–1371, 2015.Levin, A Yu. On an algorithm for the minimization of convex functions. In

Soviet Mathematics Doklady ,volume 160, pages 1244–1247, 1965. stun and Rudin Li, Oscar, Hao Liu, Chaofan Chen, and Cynthia Rudin. Deep learning for case-based reasoning throughprototypes: A neural network that explains its predictions. In

Proceedings of AAAI , 2018.Liu, Yufeng and Yichao Wu. Variable selection via a combination of the l0 and l1 penalties.

Journal ofComputational and Graphical Statistics , 16(4), 2007.Malioutov, Dmitry and Kush Varshney. Exact rule learning via boolean compressed sensing. In

Proceedingsof The 30th International Conference on Machine Learning , pages 765–773, 2013.Mangasarian, Olvi L, W Nick Street, and William H Wolberg. Breast cancer diagnosis and prognosis vialinear programming.

Operations Research , 43(4):570–577, 1995.McGinley, Ann and Rupert M Pearse. A national early warning score for acutely ill patients, 2012.Menon, Aditya Krishna, Xiaoqian J Jiang, Shankar Vembu, Charles Elkan, and Lucila Ohno-Machado.Predicting accurate probabilities with a ranking loss. In

Machine learning: proceedings of the InternationalConference. International Conference on Machine Learning , volume 2012, page 703. NIH Public Access,2012.Moreno, Rui P, Philipp GH Metnitz, Eduardo Almeida, Barbara Jordan, Peter Bauer, Ricardo AbizandaCampos, Gaetano Iapichino, David Edbrooke, Maurizia Capuzzo, and Jean-Roger Le Gall. SAPS 3 - fromevaluation of the patient to evaluation of the intensive care unit. part 2: Development of a prognosticmodel for hospital mortality at icu admission.

Intensive Care Medicine , 31(10):1345–1355, 2005.Moro, S´ergio, Paulo Cortez, and Paulo Rita. A data-driven approach to predict the success of bank tele-marketing.

Decision Support Systems , 62:22–31, 2014.Naeini, Mahdi Pakdaman, Gregory F Cooper, and Milos Hauskrecht. Binary classiﬁer calibration: Non-parametric approach. arXiv preprint arXiv:1401.3390 , 2014.Naoum-Sawaya, Joe and Samir Elhedhli. An interior-point Benders based branch-and-cut algorithm formixed integer programs.

Annals of Operations Research , 210(1):33–55, November 2010.Nguyen, Hai Thanh and Katrin Franke. A general lp-norm support vector machine via mixed 0-1 program-ming. In

Machine Learning and Data Mining in Pattern Recognition , pages 40–49. Springer, 2012.Park, Jaehyun and Stephen Boyd. Concave quadratic cuts for mixed-integer quadratic problems. arXivpreprint arXiv:1510.06421 , 2015a.Park, Jaehyun and Stephen Boyd. A semideﬁnite programming method for integer convex quadratic mini-mization. arXiv preprint arXiv:1504.07672 , 2015b.Pennsylvania Bulletin, . Sentence Risk Assessment Instrument, April 2017.Pennsylvania Commission on Sentencing, . Interim Report 4: Development of Risk Assessment Scale.Technical report, June 2012.Piotroski, Joseph D. Value investing: The use of historical ﬁnancial statement information to separatewinners from losers.

Journal of Accounting Research , pages 1–41, 2000.Platt, John and others. Probabilistic outputs for support vector machines and comparisons to regularizedlikelihood methods.

Advances in large margin classiﬁers , 10(3):61–74, 1999.Pleiss, Geoﬀ, Manish Raghavan, Felix Wu, Jon Kleinberg, and Kilian Q Weinberger. On fairness andcalibration. In

Advances in Neural Information Processing Systems , pages 5684–5693, 2017.Reid, Mark D and Robert C Williamson. Composite binary losses.

The Journal of Machine LearningResearch , 11:2387–2422, 2010.Reilly, Brendan M and Arthur T Evans. Translating clinical research into clinical practice: Impact of usingprediction rules to make decisions.

Annals of internal medicine , 144(3):201–209, 2006.Rudin, Cynthia and Berk Ustun. Optimized scoring systems: Towards trust in machine learning for health-care and criminal justice.

Interfaces (Forthcoming) , 2018.Sato, Toshiki, Yuichi Takano, and Ryuhei Miyashiro. Piecewise-linear approximation for feature subsetselection in a sequential logit model. arXiv preprint arXiv:1510.05417 , 2015.Sato, Toshiki, Yuichi Takano, Ryuhei Miyashiro, and Akiko Yoshise. Feature subset selection for logisticregression via mixed integer optimization.

Computational Optimization and Applications , pages 1–16,2016.Schlimmer, Jeﬀrey Curtis. Concept acquisition through representational adjustment. 1987. earning Optimized Risk Scores Shah, Nilay D, Ewout W Steyerberg, and David M Kent. Big data and predictive analytics: Recalibratingexpectations.

JAMA , 2018.Siddiqi, Naeem.

Credit risk scorecards: developing and implementing intelligent credit scoring , volume 3.John Wiley & Sons, 2012.Six, AJ, BE Backus, and JC Kelder. Chest pain in the emergency room: value of the heart score.

NetherlandsHeart Journal , 16(6):191–196, 2008.Sokolovska, Nataliya, Yann Chevaleyre, Karine Cl´ement, and Jean-Daniel Zucker. The fused lasso penaltyfor learning interpretable medical scoring systems. In

Neural Networks (IJCNN), 2017 International JointConference on , pages 4504–4511. IEEE, 2017.Sokolovska, Nataliya, Yann Chevaleyre, and Jean-Daniel Zucker. A provable algorithm for learning in-terpretable scoring systems. In

International Conference on Artiﬁcial Intelligence and Statistics , pages566–574, 2018.Struck, Aaron F, Berk Ustun, Andres Rodriguez Ruiz, Jong Woo Lee, Suzette M LaRoche, Lawrence JHirsch, Emily J Gilmore, Jan Vlachy, Hiba Arif Haider, Cynthia Rudin, and others. Association of anelectroencephalography-based risk score with seizure probability in hospitalized patients.

JAMA Neurol-ogy , 74(12):1419–1424, 2017.Teo, Choon Hui, Alex Smola, SVN Vishwanathan, and Quoc Viet Le. A scalable modular convex solverfor regularized risk minimization. In

Proceedings of the 13th ACM SIGKDD international conference onKnowledge discovery and data mining , pages 727–736. ACM, 2007.Teo, Choon Hui, S Vishwanathan, Alex Smola, and Quoc V Le. Bundle methods for regularized riskminimization.

Journal of Machine Learning Research , 1:55, 2009.Than, Martin, Dylan Flaws, Sharon Sanders, Jenny Doust, Paul Glasziou, Jeﬀery Kline, Sally Aldous,Richard Troughton, Christopher Reid, William A Parsonage, and others. Development and validationof the emergency department assessment of chest pain score and 2 h accelerated diagnostic protocol.

Emergency Medicine Australasia , 26(1):34–44, 2014.US Department of Justice, . The Mathematics of Risk Classiﬁcation: Changing Data into Valid Instrumentsfor Juvenile Courts, 2005.Ustun, Berk and Cynthia Rudin. Supersparse Linear Integer Models for Optimized Medical Scoring Systems.

Machine Learning , 102(3):349–391, 2016.Ustun, Berk and Cynthia Rudin. Optimized Risk Scores. In

Proceedings of the 23rd ACM SIGKDD Inter-national Conference on Knowledge Discovery and Data Mining . ACM, 2017.Ustun, Berk, Stefano Traca, and Cynthia Rudin. Supersparse linear integer models for predictive scoringsystems. In

AAAI Late-Breaking Developments , 2013.Ustun, Berk, M.B. Westover, Cynthia Rudin, and Matt T. Bianchi. Clinical prediction models for sleepapnea: The importance of medical history over symptoms.

Journal of Clinical Sleep Medicine , 12(2):161–168, 2016.Ustun, Berk, Lenard A Adler, Cynthia Rudin, Stephen V Faraone, Thomas J Spencer, Patricia Berglund,Michael J Gruber, and Ronald C Kessler. The World Health Organization Adult Attention-Deﬁcit /Hyperactivity Disorder Self-Report Screening Scale for DSM-5.

JAMA Psychiatry , 74(5):520–526, 2017.Van Belle, Vanya, Patrick Neven, Vernon Harvey, Sabine Van Huﬀel, Johan AK Suykens, and Stephen Boyd.Risk group detection and survival function estimation for interval coded survival methods.

Neurocomput-ing , 112:200–210, 2013.Van Calster, Ben and Andrew J Vickers. Calibration of risk prediction models: impact on decision-analyticperformance.

Medical decision making , 35(2):162–169, 2015.Wang, Fulton and Cynthia Rudin. Falling rule lists. In

Proceedings of Artiﬁcial Intelligence and Statistics(AISTATS) , 2015.Wang, Jiaxuan, Jeeheh Oh, Haozhu Wang, and Jenna Wiens. Learning credible models. In

Proceedings ofthe 24th ACM SIGKDD International Conference on Knowledge Discovery & , KDD’18, pages 2417–2426, New York, NY, USA, 2018. ACM. ISBN 978-1-4503-5552-0.Wang, Tong, Cynthia Rudin, Finale Doshi-Velez, Yimin Liu, Erica Klampﬂ, and Perry MacNeille. A bayesianframework for learning rule sets for interpretable classiﬁcation.

Journal of Machine Learning Research , stun and Rudin Proceedings ofthe 20th International Conference on Artiﬁcial Intelligence and Statistics , volume 54 of

Proceedings ofMachine Learning Research , pages 962–970, Fort Lauderdale, FL, USA, 20–22 Apr 2017. PMLR.Zeng, Jiaming, Berk Ustun, and Cynthia Rudin. Interpretable Classiﬁcation Models for Recidivism Predic-tion.

Journal of the Royal Statistical Society: Series A , 2016. stun and Rudin Appendix A. Omitted Proofs

Proof (Remark 4) . We ﬁrst explain why

LCPA attains the optimal objective value, andthen justify the upper bounds on the number of cuts and number of nodes.Observe that

LCPA ﬁnds the optimal solution to

RiskSlimMINLP through an ex-haustive search over the feasible region L . Thus, LCPA is bound to encounter the optimalsolution, since it only discards a node ( v t , R t ) if: (i) the surrogate problem is infeasibleover R t (in which case RiskSlimMINLP is also infeasible over R t ); or (ii) the surro-gate problem has an objective value that exceeds than V max (in which case, any integerfeasible solution R t is also suboptimal).The bound on the number of cuts follows from the fact that Algorithm 2 only adds cutsat integer feasible solutions, of which there are at most |L| . The bound on the numberof processed nodes represents a worst-case limit produced by bounding the depth of thebranch-and-bound tree. To do this, we exploit the fact that the splitting rule SplitRegion splits a partition into two mutually exclusive subsets by adding integer-valued bounds such λ j ≥ (cid:100) λ j (cid:101) and λ j ≤ (cid:98) λ j (cid:99) − on a single coeﬃcient to the feasible solution. Considerapplying SplitRegion a total of Λ max j − Λ min j + 1 times in succession on a ﬁxed dimension j . This results in a total of Λ max j − Λ min j + 1 nodes where each node ﬁxes the coeﬃcientin dimension j to an integer value λ j ∈ { Λ min j , . . . , Λ max j } . Pick any node and repeat thisprocess for coeﬃcients in the remaining dimensions. The resulting B&B tree will haveat most |L| − leaf nodes where λ is restricted to integer feasible solutions. earning Optimized Risk Scores Proof (Proposition 5) . Since the coeﬃcient set L is bounded, the data are ( x i , y i ) ni =1 discrete, and the normalized logistic loss function l ( λ ) is continuous, it follows that thevalue of l ( λ ) is bounded: l ( λ ) ∈ [min λ ∈L l ( λ ) , max λ ∈L , l ( λ )] for all λ ∈ L . Thus we need only to show that L min ≤ min λ ∈L l ( λ ) , and L max ≥ max λ ∈L l ( λ ) . For thelower bound, we observe that: min λ ∈L l ( λ ) = min λ ∈L n n (cid:88) i =1 log(1 + exp( −(cid:104) λ , y i x i (cid:105) ))= min λ ∈L n (cid:88) i : y i =+1 log(1 + exp( −(cid:104) λ , x i (cid:105) )) + 1 n (cid:88) i : y i = − log(1 + exp( (cid:104) λ , x i (cid:105) )) ≥ n (cid:88) i : y i =+1 min λ ∈L log(1 + exp( −(cid:104) λ , x i (cid:105) )) + 1 n (cid:88) i : y i = − min λ ∈L log(1 + exp( (cid:104) λ , x i (cid:105) ))= 1 n (cid:88) i : y i =+1 log(1 + exp( − max λ ∈L (cid:104) λ , x i (cid:105) )) + 1 n (cid:88) i : y i = − log(1 + exp(min λ ∈L (cid:104) λ , x i (cid:105) ))= 1 n (cid:88) i : y i =+1 log(1 + exp( − s max i ) + 1 n (cid:88) i : y i = − log(1 + exp( s min i ))= L min . The upper bound can be derived in a similar manner. stun and Rudin

Proof (Proposition 6) . We are given that V max ≥ V ( λ ∗ ) where V ( λ ∗ ) := l ( λ ∗ ) + C (cid:107) λ ∗ (cid:107) by deﬁnition. Thus, we can recover the upper bound from Proposition 6 asfollows: l ( λ ∗ ) + C (cid:107) λ ∗ (cid:107) ≤ V max , (cid:107) λ ∗ (cid:107) ≤ V max − l ( λ ∗ ) C , (cid:107) λ ∗ (cid:107) ≤ V max − L min C , (10) (cid:107) λ ∗ (cid:107) ≤ (cid:22) V max − L min C (cid:23) . (11) Here, (10) follows from the fact that L min ≤ l ( λ ∗ ) by deﬁnition, and (11) follows fromthe fact that the number of non-zero coeﬃcients is a natural number. Proof (Proposition 7) . We are given that V max ≥ V ( λ ∗ ) where V ( λ ∗ ) := l ( λ ∗ ) + C (cid:107) λ ∗ (cid:107) by deﬁnition. Thus, we can recover the upper bound from Proposition 7 asfollows: l ( λ ∗ ) + C (cid:107) λ ∗ (cid:107) ≤ V max ,l ( λ ∗ ) ≤ V max − C (cid:107) λ ∗ (cid:107) ,l ( λ ∗ ) ≤ V max − C R min . Here, the last line follows from the fact that R min ≤ (cid:107) λ ∗ (cid:107) by deﬁnition. Proof (Proposition 8) . We are given that V min ≤ V ( λ ∗ ) where V ( λ ∗ ) := l ( λ ∗ ) + C (cid:107) λ ∗ (cid:107) by deﬁnition. Thus, we can recover the lower bound from Proposition 8 asfollows: l ( λ ∗ ) + C (cid:107) λ ∗ (cid:107) ≥ V min ,l ( λ ∗ ) ≥ V min − C (cid:107) λ ∗ (cid:107) ,l ( λ ∗ ) ≥ V min − C R max . Here, the last line follows from the fact that R max ≥ (cid:107) λ ∗ (cid:107) by deﬁnition. earning Optimized Risk Scores Proof (Theorem 9) . For a ﬁxed set coeﬃcient vector λ ∈ L , consider a sample of n points composed of the values for the loss function l i ( λ ) for each example in the fulltraining dataset D n = ( x i , y i ) ni =1 . Let l n ( λ ) = n (cid:80) ni =1 l i ( λ ) and l m ( λ ) = m (cid:80) mi =1 l i ( λ ) .Then, the Hoeﬀding-Serﬂing inequality (see e.g., Theorem 2.4 in Bardenet and Maillard,2015) guarantees the following for all ε > : Pr ( l n ( λ ) − l m ( λ ) ≥ ε ) ≤ exp (cid:32) − ε ( m )(1 − mn )(1 + mn )∆( λ , D n ) (cid:33) , where ∆( λ , D n ) = max i =1 ,...,n l i ( λ ) − min i =1 ,...,n l i ( λ ) . We recover the desired inequality by generalizing this bound to hold for all λ ∈ L asfollows. Pr (cid:18) max λ ∈L (cid:16) l n ( λ ) − l m ( λ ) (cid:17) ≥ ε (cid:19) = Pr (cid:32) (cid:91) λ ∈L ( l n ( λ ) − l m ( λ ) ≥ ε ) (cid:33) , ≤ (cid:88) λ ∈L Pr ( l n ( λ ) − l m ( λ ) ≥ ε ) , (12) ≤ (cid:88) λ ∈L exp (cid:32) − ε ( m )(1 − mn )(1 + mn )∆( λ , D n ) (cid:33) , (13) ≤ |L| exp (cid:32) − ε ( m )(1 − mn )(1 + mn )∆ max ( L , D n ) (cid:33) . (14) Here, (12) follows from the union bound, (13) follows from the Hoeﬀding Serling inequal-ity, (14) follows from the fact that ∆( λ , D n ) ≤ ∆ max ( L , D n ) given that λ ∈ L . stun and Rudin Proof (Corollary 10) . We will ﬁrst show that for any tolerance that we pick δ > , theprescribed choice of ε δ will ensure that V n ( λ ) − V m ( λ ) ≤ ε δ w.p. at least − δ . Restatingthe result of Theorem 9, we have that for any ε > : Pr (cid:18) max λ ∈L (cid:16) l n ( λ ) − l m ( λ ) (cid:17) ≥ ε (cid:19) ≤ |L| exp (cid:32) − ε ( m )(1 − mn )(1 + mn )∆ max ( L , D n ) (cid:33) . (15) Note that l n ( λ ) − l m ( λ ) = V n ( λ ) − V m ( λ ) for any ﬁxed λ . In addition, note that the setof rounded coeﬃcients L ( ρ ) contains at most |L ( ρ ) | ≤ d coeﬃcient vectors. Therefore,in this setting, (15) implies that for any ε > , Pr ( V n ( λ ) − V m ( λ ) ≥ ε ) ≤ d exp (cid:32) − ε ( m )(1 − mn )(1 + mn )∆( L ( ρ ) , D n ) (cid:33) . (16) By setting ε = ε δ and simplifying the terms on the right hand side in (16) , we can seethat Pr ( V n ( λ ) − V m ( λ ) ≥ ε δ ) ≤ δ. Thus, the prescribed value of ε δ ensures that V n ( λ ) − V m ( λ ) ≤ ε δ w.p. at least − δ .Since we have set ε δ so that V n ( λ ) − V m ( λ ) ≤ ε δ w.p. at least − δ , we now needonly to show that any λ satisfying V m ( λ ) < V max − ε δ will also satisfy V n ( λ ) ≤ V max tocomplete the proof. To see this, observe that: V n ( λ ) − V m ( λ ) ≤ ε δ ,V n ( λ ) ≤ V m ( λ ) + ε δ ,V n ( λ ) < V max . (17) Here, (17) follows from the fact that V m ( λ ) < V max − ε δ = ⇒ V m ( λ ) + ε δ < V max . earning Optimized Risk Scores Appendix B. Small Regularization Parameters do not Inﬂuence Accuracy

In Section 2, we state that if the trade-oﬀ parameter C in the objective of the risk scoreproblem is suﬃciently small, then its optimal solution will attain the best possible trade-oﬀbetween logistic loss and sparsity. In what follows, we formalize this statement. In whatfollows, we will omit the intercept term for clarity and explicitly show the regularizationparameter C in the RiskSLIM objective so that V ( λ ; C ) := l ( λ ) + C (cid:107) λ (cid:107) . We use somenew notation shown in Table 7. Notation Description M = argmin λ ∈L l ( λ ) minimizers of the logistic loss L ( k ) = { λ ∈ L | (cid:107) λ (cid:107) ≤ k } feasible coeﬃcients of models with model size ≤ k M ( k ) = argmin λ ∈L ( k ) l ( λ ) minimizers of the logistic loss among models of size ≤ kL ( k ) = min λ ∈L ( k ) l ( λ ) logistic loss of minimizers with model size ≤ k λ opt ∈ argmin λ ∈M (cid:107) λ (cid:107) sparsest minimizers among all minimizers of the logistic loss k opt = (cid:107) λ opt (cid:107) model size of sparsest minimizer Table 7:

Notation used in Remarks 11 and 12

Remark 11 (Minimizers of the Risk Score Problem)

Any optimal solution to the risk score problem will achieve a logistic loss of L ( k ) forsome k ≥ : min λ ∈L V ( λ ; C ) = min k ∈{ , ,...,k opt } L ( k ) + C k. Remark 12 (Small Trade-Oﬀ Parameters Do Not Inﬂuence Accuracy)

There exists an integer z ≥ such that if C < z (cid:2) L ( k opt − z ) − L ( k opt ) (cid:3) , then λ opt ∈ argmin λ ∈L V ( λ ; C ) . Remark 12 states that as long as C is suﬃciently small, the regularized objective is mini-mized by any model λ opt that has the smallest size among the most accurate feasible models.In other words, there exists a small enough value of C to guarantee that we will obtain thebest possible solution. Since we do not know z in advance and since it is just as diﬃcult tocompute L as it is to solve the risk score problem, we will not know in advance how small C must be to avoid sacriﬁcing sparsity. However, it does not matter which value of C wechoose as long as it is suﬃciently small. In practice, we set C = 10 − , which is the smallestvalue that we can use without running into numerical issues with the MIP solver. stun and Rudin Proof (Remark 11) . Since L ( k ) is the minimal value of the logistic loss for all modelswith at most k non-zero coeﬃcients, we have: L ( k ) + C k ≤ l ( λ ) + C k for any λ ∈ L ( k ) (18) Denote a feasible minimizer of V ( λ ; C ) as λ (cid:48) ∈ argmin λ ∈L V ( λ ; C ) , and let k (cid:48) = (cid:107) λ (cid:48) (cid:107) .Since λ (cid:48) ∈ L ( k (cid:48) ) , we have that: L ( k (cid:48) ) + C k (cid:48) ≤ V ( λ (cid:48) ; C ) . (19) Taking the minimum of the left hand side of (19) : min k ∈{ , , ,...,k opt } L ( k ) + C k ≤ L ( k (cid:48) ) + C k (cid:48) . (20) Combining (19) and (20) , we get: min k ∈{ , , ,...,k opt } L ( k ) + C k ≤ L ( k (cid:48) ) + C k (cid:48) ≤ min λ ∈L V ( λ ; C ) . If these inequalities are not all equalities, we have a contradiction with the deﬁnition of λ (cid:48) and k (cid:48) as the minimizer of V ( λ ; C ) and its size. So all must be equality. This provesthe statement. earning Optimized Risk Scores Proof (Remark 12) . Note that if C = 0 , then any minimizer of V ( λ ; C ) has k opt non-zero coeﬃcients. Consider increasing the value of C starting from zero until a thresholdvalue C min0 , deﬁned as the smallest value such that the minimizer can sacriﬁce someloss to remove at least one non-zero coeﬃcient. Let z ≥ be the number of non-zerocoeﬃcients removed. For the threshold value C min0 at which we choose the smaller modelrather than the one with size k opt , we have L ( k opt ) + C min0 k opt ≥ L ( k opt − z ) + C min0 ( k opt − z ) . Simplifying, we obtain: z (cid:2) L ( k opt − z ) − L ( k opt ) (cid:3) ≤ C min0 . Thus,

RiskSLIM does not sacriﬁce sparsity for logistic loss when: z (cid:2) L ( k opt − z ) − L ( k opt ) (cid:3) > C . Here, we know that the value on the left hand side is greater than 0 because L ( · ) isdecreasing in its argument, and if there was no strict decrease, there would be a contra-diction with the deﬁnition of k opt as the smallest number of terms of an optimal model.In particular, L ( k opt − z ) = L ( k opt ) implies that: min λ V ( λ ; 0) = L ( k opt ) + C min0 k opt = L ( k opt − z ) + C min0 ( k opt ) ≥ L ( k opt − z ) + C min0 ( k opt − z ) ≥ min λ V ( λ ; 0) . Thus all inequalities are equalities, which implies that z = 0 and contradicts z ≥ . stun and Rudin Appendix C. Background on Risk Scores

In Table 8, we present quotes from authors in medicine, criminal justice, and ﬁnance tosupport the claim that risk scores are used because they are easy to use, understand, andvalidate.

Domain Reference

Quote

Medicine Than et al. (2014) “Ease of use might be facilitated by presenting a rule devel-oped from logistic regression as a score, where the originalpredictor weights have been converted to integers that areeasy to add together... Though less precise than the origi-nal regression formula, such presentations are less complex,easier to apply by memory and usable without electronic as-sistance.”

Criminal Justice Duwe and Kim (2016) “It is commonplace... for ﬁne-tuned regression coeﬃcients tobe replaced with a simple-point system... to promote the easyimplementation, transparency, and interpretability of risk-assessment instruments.”

Finance Finlay (2012) “presenting a linear model in the form of a scorecard is at-tractive because it’s so easy to explain and use. In particular,the score can be calculated using just addition to add up therelevant points that someone receives”

Table 8:

Quotes on why risk scores with small integer coeﬃcients are used in diﬀerent domains.

Importance of Operational Constraints

The approaches used to create risk scores vary signiﬁcantly for each problem. There isno standard approach within a given domain, (see e.g., the diﬀerent techniques proposedin criminal justice by Gottfredson and Snyder, 2005; Bobko et al., 2007; Duwe and Kim,2016), or a given application (see e.g., the diﬀerent approaches used to create risk scoresfor cardiac illness, Six et al., 2008; Antman et al., 2000; Than et al., 2014).A key reason for the lack of a standardized approach is because risk scores in domainssuch as medicine and criminal justice need to obey additional operational constraints to beused and accepted. In some cases, these constraints can be explicitly stated. Reilly andEvans (2006), for example, describe the requirements put forth by physicians when buildinga model to detect major cardiac complications for patients with chest pain:“

Our physicians... insisted that a new left bundle-branch block be consideredas an electrocardiographic predictor of acute ischemia. In addition, they arguedthat patients who are stratiﬁed as low risk by the prediction rule could inap-propriately include patients presenting with acute pulmonary edema, ongoingischemic pain despite maximal medical therapy, or unstable angina after recentcoronary revascularization (52). They insisted that such emergent clinical pre-sentations be recommended for coronary care unit admission, not telemetry unitadmission .” earning Optimized Risk Scores In other cases, however, operational constraints may depend on qualities that are diﬃcultto deﬁne a priori. Consider for example, the following statement in Than et al. (2014), thatdescribes the importance of sensibility for deployment:“

An important consideration during development is the clinical sensibility of theresulting prediction rule [...] Evaluation of sensibility requires judgment ratherthan statistical methods. A sensible rule is easy to use, and has content and facevalidities. Prediction rules are unlikely to be applied in practice if they are notconsidered sensible by the end-user, even if they are accurate. ” Approaches to Model Development

Common heuristics used in model development include: • Heuristic Feature Selection : Many approaches use heuristic feature selection to reduce thenumber of variables in the model. Model development pipelines can often involve multiplerounds of feature selection, and may use diﬀerent heuristics at each stage (e.g., Antmanet al. 2000 uses a signiﬁcance test to remove weak predictors, then uses approximatefeature selection via forward stepwise regression). • Heuristic Rounding : Many approaches use rounding heuristics to produce models withinteger coeﬃcients. In the simplest case, this involves scaling and rounding the coeﬃcientsfrom a logistic regression model (Goel et al., 2016) or a linear probability model (USDepartment of Justice, 2005). The SAPS II score (Le Gall et al., 1993), for example, wasbuilt in this way (“ the general rule was to multiply the β for each range by 10 and roundoﬀ to the nearest integer. ”) • Expert Judgement : A common approach to model development involves having a panelexperts build a model by hand, and using data to validate the model after it is built (e.g.,for the CHADS score for stroke prediction of Gage et al. 2001, and the National EarlyWarning Score to assess acute illness in the ICU of McGinley and Pearse 2012). Expertjudgement can also be used in data-driven approaches. In developing the EDACS score(Than et al., 2014), for example, expert judgement was used to: (i) determine a scalingfactor for model coeﬃcients (“ The beta coeﬃcients were multiplied by eight, which wasthe smallest common multiplication factor possible to obtain a sensible score that usedwhole numbers and facilitated clinical ease of use. ”) (ii) convert a continuous variableinto a binary variables (“

Age was the only continuous variable to be included in the ﬁnalscore. It was converted to a categorical variable, using 5-year age bands with increasingincrements of +2 points. ”) • Unit Weighting : This technique aims to produce a score by adding all variables that aresigniﬁcantly correlated with the outcome of interest. Unit weighting is prevalent in thecriminal justice (see e.g. Bobko et al., 2007; Duwe and Kim, 2016), where it is referredto as the

Burgess method (as it was ﬁrst proposed in Burgess, 1928). The use of thistechnique is frequently motivated by empirical work showing that linear models with unitweights may perform surprisingly well (see e.g. Einhorn and Hogarth, 1975; Dawes, 1979;Holte, 1993, 2006; Bobko et al., 2007). stun and Rudin

Critical Analysis of a Real-World Training Pipeline

Many risk scores are built using sequential training pipelines that combine traditional sta-tistical techniques, heuristics, and expert judgement. The TIMI Risk Score of Antman et al.(2000), for example, was built as follows:

1. “A total of 12 baseline characteristics arranged in a dichotomous fashion were screenedas candidate predictor variables of risk of developing an end-point event”2. “After each factor was tested independently in a univariate logistic regression model,those that achieved a signiﬁcance level of p < . were [retained].”3. “[The remaining factors]... selected for testing in a multivariate step-wise (backwardelimination) logistic regression model. Variables associated with p < . were retainedin the ﬁnal model.”4. “After development of the multivariate model, the [risk predictions were determined]...for the test cohort using those variables that had been found to be statistically signiﬁ-cant predictors of events in the multivariate analysis.”5. “The score was then constructed by a simple arithmetic sum of the number of variablespresent.” Although this pipeline uses several established statistical techniques (e.g. stepwise regres-sion, signiﬁcance testing), it is unlikely to output a risk score that attains the best possibleperformance because: • Decisions involving feature selection and rounding are made sequentially (e.g., Steps 1-3involve feature selection, Step 5 involves rounding). • The objective function that is optimized at each step diﬀers from the performance metricof interest (i.e., the calibration error, which measures the reliability of risk estimates). • Some steps optimize conﬂicting objective functions (e.g., backward elimination typicallyoptimizes the AIC or BIC, while the ﬁnal model is ﬁt to optimize the logistic loss). • Some steps do not fully optimize their own objective function (i.e., backward eliminationdoes not return a globally optimal feature set). • Some steps depend free parameters that are set without validation (e.g., the thresholdsigniﬁcance level of p < .

20 used in Step 2) • The ﬁnal model is not trained using all the available training data. Here, Steps 4 and 5use data from a “test cohort” that could have been used to improve the ﬁt of the ﬁnalmodel. • The ﬁnal model uses an empirical risk estimate for each score (i.e., the predicted risk foreach score simply represents the fraction of patients in the test cohort with y i = +1). • The coeﬃcients of the ﬁnal model are set to +1 and not optimized (that is, the modeluses unit weights). earning Optimized Risk Scores

Appendix D. Details on Computational Experiments

In this appendix, we provide additional details on the computational experiments in Sections3.3 and 4.

D.1 Simulation Procedure for Synthetic Datasets

We ran the computational experiments in Sections 3.3 and 4 using a collection of syntheticdatasets that we generated from the breastcancer dataset of Mangasarian et al. (1995),which contains n = 683 samples and d = 9 features x ij ∈ { , . . . , } . This choice was basedoﬀ the fact that the breastcancer dataset produces a RiskSlimMINLP instance that canbe solved using many algorithms, the data have been extensively studied in the literature,and can be obtained from the UCI ML repository (Bache and Lichman, 2013).We show the simulation procedure in Algorithm 7. Given the original dataset, thisprocedure generates a collection of nested synthetic datasets in two steps. First, it generatesthe largest dataset (with n max = 5 × and d max = 30) by replicating features and samplesfrom the original dataset and adding normally distributed noise. Second, it produces smallerdatasets by taking nested subsets of the samples and features. This ensures that a syntheticdataset with d features and n samples contains the same features and examples as a smallersynthetic dataset (i.e., with d (cid:48) < d features and n (cid:48) < n samples).The resulting collection of synthetic datasets has two properties that are useful forbenchmarking the performance of algorithms for RiskSlimMINLP :1. They produce diﬃcult instances of the risk score problem. Here, the

RiskSlimMINLP instances for synthetic datasets with d >

RiskSlim-MINLP instances we may not be able to solve. Say, for example, that we could not solvean instance of the risk score problem for a synthetic dataset with ( d, n ) = (20 , ), butcould solve an instance for a smaller synthetic dataset with ( d, n ) = (10 , ). In this case,we know that the optimal value of an RiskSlimMINLP instance where ( d, n ) = (20 , )must be less than or equal to the optimal value of an RiskSlimMINLP instance where( d, n ) = (10 , ). This is because the ( d, n ) = (20 , ) dataset contains all of thefeatures as the ( d, n ) = (10 , ) dataset. stun and Rudin Algorithm 7

Simulation Procedure to Generate Nested Synthetic Datasets

Input X original ← [ x ij ] i =1 ...n original ,j =1 ...d original feature matrix of original dataset Y original ← [ y i ] i =1 ...n original label vector of original dataset d . . . d max dimensions for synthetic datasets (increasing order) n . . . n max sample sizes for synthetic datasets (increasing order) Initialize J original ← [1 , . . . , d original ] index array for original features J max ← [] index array of features for largest synthetic dataset m full ← (cid:98) d max /d original (cid:99) m remainder ← d max − d original Step I : Generate Largest Dataset for m = 1 , . . . , m full do J max ← [ J max , RandomPermute ( J original )] end for J max ← [ J max , RandomSampleWithoutReplacement ( J original , m remainder )] for i = 1 , . . . , n max do sample l with replacement from 1 , . . . , n original y max i ← y i for j = 1 , . . . , d max do k ← J max [ j ] sample ε from Normal(0 , . x max ij ← (cid:100) x l,k + ε (cid:99) new features are noisy versions of original features x max ij ← min(10 , max(0 , x max ij )) new features have same bounds as old features end for end for X max ← [ x ij ] i =1 ...n max ,j =1 ...d max Y max ← [ y maxi ] i =1 ...n max Step II : Generate Smaller Datasets for d = [ d , . . . , d max ] do for n = [ n , . . . , n max ] do X ( n,d ) = X max [1 : n, d ] Y n = Y max [1 : n ] end for end forOutput: synthetic datasets ( X ( n,d ) , Y n ) for all n . . . n max and d . . . d max . earning Optimized Risk Scores D.2 Setup on the Performance Comparison in Section 3.3

We consider an instance of

RiskSlimMINLP where C = 10 − , λ ∈ {− , } , and λ j = {− , . . . , } for j = 1 , . . . , d . We solved this instance on a 3.33 GHz Intel Xeon CPUwith 16GB of RAM for up to 6 hours using the following algorithms:(i) CPA , as described in Algorithm 1);(ii)

LCPA , as described in Algorithm 2;(iii)

ActiveSetMINLP , an active set MINLP algorithm;(iv)

InteriorMINLP , an interior point MINLP algorithm;(v)

InteriorCGMINLP , an interior point MINLP algorithm where the primal-dual KKT sys-tem is solved with a conjugate gradient method;All MINLP algorithms were implemented in a state-of-the-art commercial solver (i.e., Ar-telsys Knitro 9.0, which is an updated version of the solver described in Byrd et al. 2006).If an algorithm did not return a certiﬁably optimal solution for a particular instance withinthe 6 hour time limit, we reported results for the best feasible solution. Both

CPA and

LCPA were implemented using CPLEX 12.6.3.

D.3 Results for MINLP Algorithms

In Figure 31, we plot results for all MINLP algorithms that we benchmarked against

CPA and

LCPA : ActiveSetMINLP , InteriorMINLP , and

InteriorCGMINLP . We reported results for

ActiveSetMINLP in Section 3.3 because it solved the largest number of instances to optimal-ity. stun and Rudin

ActiveSetMINLP InteriorMINLP InteriorCGMINLP

Time to Train a GoodRisk Score i.e., the time for an al-gorithm to ﬁnd a solutionwhose loss ≤ of the op-timal loss. It reﬂects thetime to obtain a risk scorewith good calibration withouta proof of optimality. <1 min <10 min <1 hour <6 hours 6+ hours · · · · · · · · · · · · · · · · · · Optimality Gap of BestSolution at Termination i.e., ( V max − V min ) /V max ,where V max is the objec-tive value of the best solutionfound at termination. A gapof 0.0% means an algorithmhas found the optimal solu-tion and provided a proof ofoptimality within 6 hours.

Performance of MINLP algorithms on diﬃcult instances of

RiskSlimMINLP for syntheticdatasets with varying dimensions d and sample sizes n . All algorithms perform similarly. We report resultsfor ActiveSetMINLP in Section 3.3 because it solves the most instances to optimality. earning Optimized Risk Scores

Appendix E. Additional Experimental Results

Traditional Approaches Pooled Approaches

Dataset Metric

PLR (cid:29)

Rd PLR (cid:29)

RsRd PLR (cid:29)

Unit PooledRd PooledRd* PooledSeqRd* RiskSLIM bank n = 41188 d = 57 test caltrain caltest auctrain auc census n = 32561 d = 36 test caltrain caltest auctrain auc mammo n = 961 d = 14 test caltrain caltest auctrain auc mushroom n = 8124 d = 113 test caltrain caltest auctrain auc rearrest n = 22530 d = 48 test caltrain caltest auctrain auc spambase n = 4601 d = 57 test caltrain caltest auctrain auc Table 9:

Summary statistics of risk scores with integer coeﬃcients λ j ∈ {− , . . . , } with model size of (cid:107) λ (cid:107) ≤

5. Here: test cal and test auc , which are the 5-CV mean test CAL / AUC; train cal and train auc which are the CAL and AUC of the ﬁnal model ﬁt using the entire dataset. stun and Rudin

Appendix F. Details and Results for Seizure Prediction

In this appendix, we provide supporting material for the seizure prediction application inSection 6. In Table 10, we list all input variables in the training dataset.

Input Variable Values Sign

Male { , } Female { , } PriorSeizure { , } + PosteriorDominantRhythmPresent { , } − BriefRhythmicDischarge { , } + NoReactivityToStimulation { , } EpileptiformDischarges { , } + SecondaryDXIncludesMentalStatusFirst { , } SecondaryDXIncludesCNSInfection { , } SecondaryDXIncludesCNSInﬂammatoryDisease { , } SecondaryDXIncludesCNSNeoplasm { , } SecondaryDXIncludesHypoxisIschemicEncephalopathy { , } SecondaryDXIncludesIntracerebralHemorrhage { , } SecondaryDXIncludesIntraventricularHemorrhage { , } SecondaryDXIncludesMetabolicEncephalopathy { , } SecondaryDXIncludesIschemicStroke { , } SecondaryDXIncludesSubarachnoidHemmorage { , } SecondaryDXIncludesSubduralHematoma { , } SecondaryDXIncludesTraumaticBrainInjury { , } SecondaryDXIncludesHydrocephalus { , } PatternIsStimulusInducedAny { , } PatternIsStimulusInducedBiPD { , } PatternIsStimulusInducedGPD { , } PatternIsStimulusInducedGRDA { , } PatternIsStimulusInducedLPD { , } PatternIsStimulusInducedLRDA { , } PatternIsSuperImposedAny { , } + PatternIsSuperImposedBiPD { , } + PatternIsSuperImposedGPD { , } + PatternIsSuperImposedGRDA { , } + PatternIsSuperImposedLPD { , } + PatternIsSuperImposedLRDA { , } + PatternsInclude BiPD { , } + PatternsInclude GPD { , } + PatternsInclude GRDA { , } + PatternsInclude LPD { , } + PatternsInclude LRDA { , } + PatternsInclude GRDA or GPD { , } + PatternsInclude BiPD or LRDA or LPD { , } + MaxFrequencyAnyPattern { . , . ,..., . } + MaxFrequencyBiPD { . , . ,..., . } + MaxFrequencyGPD { . , . ,..., . } + MaxFrequencyLPD { . , . ,..., . } + MaxFrequencyLRDA { . , . ,..., . } + Input Variable Values Sign

MaxFrequencyAnyPattern = 0.0Hz { , } MaxFrequencyAnyPattern ≥ { , } + MaxFrequencyAnyPattern ≥ { , } + MaxFrequencyAnyPattern ≥ { , } + MaxFrequencyAnyPattern ≥ { , } + MaxFrequencyAnyPattern ≥ { , } + MaxFrequencyAnyPattern ≥ { , } + MaxFrequencyBiPD=0.0 { , } MaxFrequencyBiPD ≥ { , } + MaxFrequencyBiPD ≥ { , } + MaxFrequencyBiPD ≥ { , } + MaxFrequencyBiPD ≥ { , } + MaxFrequencyBiPD ≥ { , } + MaxFrequencyBiPD ≥ { , } + MaxFrequencyGPD=0.0 { , } MaxFrequencyGPD ≥ { , } + MaxFrequencyGPD ≥ { , } + MaxFrequencyGPD ≥ { , } + MaxFrequencyGPD ≥ { , } + MaxFrequencyGPD ≥ { , } + MaxFrequencyGPD ≥ { , } + MaxFrequencyGRDA=0.0 { , } MaxFrequencyGRDA ≥ { , } + MaxFrequencyGRDA ≥ { , } + MaxFrequencyGRDA ≥ { , } + MaxFrequencyGRDA ≥ { , } + MaxFrequencyGRDA ≥ { , } + MaxFrequencyGRDA ≥ { , } + MaxFrequencyLPD=0.0 { , } MaxFrequencyLPD ≥ { , } + MaxFrequencyLPD ≥ { , } + MaxFrequencyLPD ≥ { , } + MaxFrequencyLPD ≥ { , } + MaxFrequencyLPD ≥ { , } + MaxFrequencyLPD ≥ { , } + MaxFrequencyLRDA=0.0 { , } MaxFrequencyLRDA ≥ { , } + MaxFrequencyLRDA ≥ { , } + MaxFrequencyLRDA ≥ { , } + MaxFrequencyLRDA ≥ { , } + MaxFrequencyLRDA ≥ { , } + MaxFrequencyLRDA ≥ { , } + Table 10:

Names, values, and sign constraints for input variables in the seizure dataset. earning Optimized Risk Scores

F.1 Operational ConstraintsNo Redundant Categorical Variables

1. Use either

Male or Female .2. Use either

PatternsInclude GRDA or GPD or any one of(

PatternsInclude GRDA , PatternsInclude GPD ).3. Use either

PatternsInclude BiPD or LRDA or LPD or any one of(

PatternsInclude BiPD , PatternsInclude LRDA , PatternsInclude LPD ).4. Use either

MaxFrequencyAnyPattern = 0 . MaxFrequencyAnyPattern ≥ . MaxFrequencyLPD = 0 . MaxFrequencyLPD ≥ . MaxFrequencyGPD = 0 . MaxFrequencyGPD ≥ . MaxFrequencyGRDA = 0 . MaxFrequencyGRDA ≥ .

58. Use either

MaxFrequencyBiPD = 0 . MaxFrequencyBiPD ≥ . MaxFrequencyLRDA = 0 . MaxFrequencyLRDA ≥ . Frequency in Continuous Encoding or Thresholded Encoding

10. Choose between

MaxFrequencyAnyPattern or(

MaxFrequencyAnyPattern = 0 . . . . MaxFrequencyAnyPattern ≥ . MaxFrequencyGPD or(

MaxFrequencyGPD = 0 . . . . MaxFrequencyGPD ≥ . MaxFrequencyLPD or(

MaxFrequencyLPD = 0 . . . . MaxFrequencyLPD ≥ . MaxFrequencyGRDA or(

MaxFrequencyGRDA = 0 . . . . MaxFrequencyGRDA ≥ . MaxFrequencyBiPD or(

MaxFrequencyBiPD = 0 . . . . MaxFrequencyBiPD ≥ . MaxFrequencyLRDA or(

MaxFrequencyLRDA = 0 . . . . MaxFrequencyLRDA ≥ . Limited

16. Use at most 2 of:

MaxFrequencyAnyPattern = 0 . MaxFrequencyAnyPattern ≥ . . . . MaxFrequencyAnyPattern ≥ . MaxFrequencyLPD = 0 . MaxFrequencyLPD ≥ . . . . MaxFrequencyLPD ≥ . MaxFrequencyGPD = 0 . MaxFrequencyGPD ≥ . . . . MaxFrequencyGPD ≥ . MaxFrequencyGRDA = 0 . MaxFrequencyGRDA ≥ . . . . MaxFrequencyGRDA ≥ . stun and Rudin

20. Use at most 2 of:

MaxFrequencyBiPD = 0 . MaxFrequencyBiPD ≥ . . . . MaxFrequencyBiPD ≥ . MaxFrequencyLRDA = 0 . MaxFrequencyLRDA ≥ . . . . MaxFrequencyLRDA ≥ . Speciﬁc cEEG Patterns or Any cEEG Pattern

22. Use either

PatternIsStimulusInducedAny or any of (

PatternIsStimulusInducedBiPD , PatternIsStimulusInducedGRDA , PatternIsStimulusInducedGPD , PatternIsStimulusIn-ducedLPD , PatternIsStimulusInducedLRDA ).23. Use either

PatternIsSuperImposed or any of (

PatternIsSuperImposedBiPD , PatternIsSu-perImposedGPD , PatternIsSuperImposedGRDA , PatternIsSuperImposedLPD , PatternIsSu-perImposedLRDA ).24. Use either

MaxFrequencyAnyPattern (or its thresholded versions) or any of