[PDF] Classification using Ensemble Learning under Weighted Misclassification Loss

Abstract

Binary classification rules based on covariates typically depend on simple loss functions such as zero-one misclassification. Some cases may require more complex loss functions. For example, individual-level monitoring of HIV-infected individuals on antiretroviral therapy (ART) requires periodic assessment of treatment failure, defined as having a viral load (VL) value above a certain threshold. In some resource limited settings, VL tests may be limited by cost or technology, and diagnoses are based on other clinical markers. Depending on scenario, higher premium may be placed on avoiding false-positives which brings greater cost and reduced treatment options. Here, the optimal rule is determined by minimizing a weighted misclassification loss/risk. We propose a method for finding and cross-validating optimal binary classification rules under weighted misclassification loss. We focus on rules comprising a prediction score and an associated threshold, where the score is derived using an ensemble learner. Simulations and examples show that our method, which derives the score and threshold jointly, more accurately estimates overall risk and has better operating characteristics compared with methods that derive the score first and the cutoff conditionally on the score especially for finite samples.

Full PDF

aa r X i v : . [ s t a t . M L ] M a y Classiﬁcation using Ensemble Learning under Weighted Misclassiﬁca-tion Loss

Yizhen Xu , Tao Liu , Michael J. Daniels , Rami Kantor , Ann Mwangi , Joseph W Hogan

1. Department of Biostatistics, Brown University,121 S. Main Street, Providence, RI, U.S.A.2. Department of Statistics and Data Sciences, University of Texas at Austin, Austin TX, U.S.A3. Division of Infectious Diseases, Brown University, Providence RI, U.S.A4. Academic Model Providing Access to Healthcare (AMPATH), Eldoret, Kenya5. College of Health Sciences, School of Medicine, Moi University, Eldoret, Kenya* [email protected] * This is part of thesis work of the first author.

Abstract

Binary classiﬁcation rules based on covariates typically depend on simple loss functions such aszero-one misclassiﬁcation. Some cases may require more complex loss functions. For example,individual-level monitoring of HIV-infected individuals on antiretroviral therapy (ART) requiresperiodic assessment of treatment failure, deﬁned as having a viral load (VL) value above a certainthreshold. In some resource limited settings, VL tests may be limited by cost or technology, anddiagnoses are based on other clinical markers. Depending on scenario, higher premium may beplaced on avoiding false-positives which brings greater cost and reduced treatment options. Here,the optimal rule is determined by minimizing a weighted misclassiﬁcation loss/risk.We propose a method for ﬁnding and cross-validating optimal binary classiﬁcation rules underweighted misclassiﬁcation loss. We focus on rules comprising a prediction score and an associatedthreshold, where the score is derived using an ensemble learner. Simulations and examples showthat our method, which derives the score and threshold jointly, more accurately estimates overallrisk and has better operating characteristics compared with methods that derive the score ﬁrst andthe cutoﬀ conditionally on the score especially for ﬁnite samples.

This work was supported by NIH (Grant NO: R01 AI108441, P30 AI042853, K24 AI134359, R01 AI066922. Introduction

Development of accurate binary classiﬁcation rules is important in many areas of biomedical re-search and clinical practice. The problem can be described as follows: suppose each individualin a population of interest has a binary outcome Y and p × X = ( X , . . . , X p ). We would like to ﬁnd a classiﬁcation rule Q : X → { , } that takes X ∈ X asan input and generates a classiﬁcation action a ∈ { , } as output. The criterion for classiﬁcationaccuracy is given in terms of a loss function L ( Y, a ); for example, L ( Y, a ) = { a = Y } is simplemisclassiﬁcation loss. For a given loss criterion, the optimal rule is the one that minimizes expectedloss over the population of interest.Loosely speaking, methods for binary and categorical classiﬁcation can be divided into those thatperform classiﬁcation directly and those that use a score-and-threshold approach. Direct classiﬁ-cation methods include K -nearest-neighbors (KNN) (Cover & Hart, 1967; Duda, Hart, et al., 1973)and tree-based techniques (Breiman, Friedman, Stone, & Olshen, 1984). In score-based methods,classiﬁcation is carried out by comparing a scalar, real-valued risk score Ψ( X ) to a threshold c ,yielding rules of the form Q ( X ; Ψ( · ) , c ) = { Ψ( X ) ≥ c } . The risk score Ψ( X ), a function mapping X to IR, can represent class membership probability P ( Y = 1 | X ) or can be a more general rankingmeasure, i.e. where Ψ( X ) > Ψ( X ) implies that Y is stochastically greater than Y . Moreover,the risk score can be derived using a single model, such as a regression model, or using an ensemblemethod that combines scores from multiple models (van der Laan, Polley, & Hubbard, 2007).Score-based methods are often used in practice, especially in medical applications. Examplesinclude the VACS score (Tate et al., 2013) for risk prediction of all-cause mortality among HIVinfected individuals and the Nottingham prognostic index (Haybittle et al., 1982) for breast cancer.The main motivation here is to build a risk score for classiﬁcation prediction of virological failure,which can be applied to various stages of individual-level HIV monitoring such as viral load (VL)pooled testing. This is important because VL testing is limited in some resource-limited settings(RLS) and accurate classiﬁcation based on common clinical markers can reduce cost and mortality.A classiﬁcation rule is optimal if it minimizes average loss over a population of interest. Theprocess of deriving optimal classiﬁcation rules and estimating their operating characteristics relieson a training step and a testing step. In the training step, optimal classiﬁcation rules are estimated;in the testing step, out-of-sample error rates associated with the rules are calculated. Finding opti-mal score-based classiﬁcation rules requires estimation of both the score function and the associatedthreshold. In this paper, we focus on score-based classiﬁcation methods under weighted misclassiﬁ-cation loss, where the score Ψ( X ) is derived using the Super Learner ensemble (van der Laan et al.,2007). In Super Learner (SL), a library of K learners generates risk scores Ψ ( X ) , . . . , Ψ K ( X ), whichare then combined into a single score using the convex combination Ψ( X ; α ) = P Kk =1 α k Ψ k ( X ) ,where α = ( α , . . . , α K ), α k ≥ k , and P k α k = 1. The α k ’s are estimated based oncross-validated library learner predictions, typically using squared-error, likelihood-based or ROC-based loss. van der Laan et al. (van der Laan et al., 2007) showed through applications that SLoutperforms individual candidate learners in a given library.A conditional thresholding approach is to ﬁrst derive a risk score b Ψ( X ) = Ψ( X ; b α ) and, condi-tionally on b Ψ( X ), ﬁnd b c that minimizes the loss function for misclassiﬁcation. In this paper, weregard derivation of optimal decision threshold by directly conditioning on estimated risk of thesame training data as conditional thresholding. Because it is intuitively straightforward, conditionalthresholding has been used in various studies. Birkner et al. Birkner et al. (2007) used conditionalthresholding in predicting 30-day mortality following stroke in a rural Indian population; from aset of candidate learners, they selected the optimal model as the estimator with the smallest cross-validated risk, and chose the cut-oﬀ value from the estimated score on the training data to be thesmallest value achieving at least 90% sensitivity. Kruppa et al. Kruppa, Ziegler, and K¨onig (2012)derived classiﬁcation rule using conditional thresholding for a GWA study on rheumatoid arthritis;several risk scores were estimated for the training data and the thresholds for binary classiﬁcationwere selected by maximizing Youden index for each risk score in the training data. Shim et al.2him et al. (2018) developped a risk score model to identify patients at high risk of negative eﬀectsafter total knee arthroplasty, derived the optimal threshold for binary classiﬁcation directly from theestimated score based on Youden’s index, which is also a conditional thresholding derivation. Weshow using simulation studies and empirical examples that this approach may lead to over-estimationof misclassiﬁcation error and yield classiﬁcation rules that are sub-optimal.As an alternative we propose joint thresholding , which combines the estimation of coeﬃcients α and threshold c in one step to gain a better overall performance in terms of out-of-sample predictionrisk. It uses information in out-of-sample classiﬁcation through one cross-validation procedure, andnaturally integrates threshold estimation into the SL framework. Simulation and application resultsunder weighted misclassiﬁcation loss show that our simultaneous estimation approach generally haslower out-of-sample risk compared to conditional thresholding. In this data applications, the thresh-old estimated using our method is approximately the optimal Bayes threshold when the ensemblescore is close to the true class probability, as would be expected under weighted misclassiﬁcationloss (Buja, Stuetzle, & Shen, 2005).Our approach involves minimization of an empirical weighted misclassiﬁcation risk function.The minimization can be diﬃcult because the weighted misclassiﬁcation risk is composed of stepfunctions that count the misclassiﬁcation, making it into a nonconvex, nonsmooth and NP-hardoptimization problem (Chen & Mangasarian, 1996). Conditional on a risk score, estimating c byminimizing the misclassiﬁcation risk is relatively easy and can be accomplished by line search. InSection 4, we demonstrate a simultaneous estimation of the ensemble weights and threshold usingbounded controlled random search (Kaelo & Ali, 2006). This method gives us an approximation tothe minimizer of the empirical weighted misclassiﬁcation risk, and is easy to implement.The paper is structured as follows. Section 2 provides motivating examples and deﬁnes weightedmisclassiﬁcation loss. Section 3 describes ensemble learning and the conditional thresholding ap-proach. Section 4 explains our proposal of joint thresholding, discusses related challenges and existingmethods for minimization of the weighted misclassiﬁcation loss/risk, and introduces an approach tojoint thresholding by controlled random search. Simulations and data applications are described inSection 5 and 6 respectively. Section 7 gives conclusions and suggestions for further work. Total misclassiﬁcation loss is often the primary criterion for classiﬁcation, where penalties for falsepositives and false negatives are the same. However, there are many circumstances where false posi-tive and false negative classiﬁcations have diﬀerent consequences, and should be weighted diﬀerently.An instance of this is individual-level HIV treatment monitoring in RLS. Failure of antiretroviraltreatment (ART) is indicated by a VL value above a certain threshold. In RLS, VL assessmentmay be limited by logistics, cost and technology (Wang, Xu, & Demirci, 2010); hence, other clinicalmarkers are used at times to predict viral failure. In this situation, false positive classiﬁcation ofVL failure leads to early switching to a second- or third-line ART, which may have higher toxicityand lower adherence, incurs signiﬁcantly greater cost, and limits treatment options over the longterm. False negative classiﬁcation of viral failure results in lack of regimen switch, risk of drugresistance accumulation and increased morbidity and mortality. These considerations may motivateprioritization of avoiding either false negative or false positive classiﬁcations. For example, if avoid-ing unnecessary switches to second-line therapy is prioritized, the loss function would assign greaterweight to false positive misclassiﬁcations.Another example is initial diagnosis of breast cancer malignancy from digitized images. Breastcancer is one of the largest causes of cancer deaths among women (Siegel, Naishadham, & Jemal,2012). Patient survival depends largely on early detection and accurate diagnosis. Fine needleaspiration (FNA) (Wolberg, Street, & Mangasarian, 1994) is a minimally invasive procedure thatallows measurement of individual cellullar characteristics, allowing algorithm-based cell image anal-ysis for diagnosis. FNA is usually carried out when there is a breast lump previously detectedby self-examination or mammography. Positive ﬁndings may lead to conﬁrmation of breast cancer3hrough surgical biopsy, which is accurate but requires signiﬁcantly more recovery time and expense,involves pain and carries risks associated with surgical procedures such as scarring and infection. Itmay be of interest to place a premium on avoiding false positives over false negatives depending onavailable tools and expertise, for example, when pathology requirements are not met or availabilityof anesthesia is limited in RLS (Saghir et al., 2011; Shyyan et al., 2006).Our goal is to develop classiﬁcation rules of the form Q ( X ) = Q ( X ; Ψ( · ) , c ) = { Ψ( X ) ≥ c } based on a weighted misclassiﬁcation loss L λ ( Y, Q ( X )) = λ { Q ( X ) = 0 , Y = 1 } + (1 − λ ) { Q ( X ) = 1 , Y = 0 } = λ { Ψ( X ) < c, Y = 1 } + (1 − λ ) { Ψ( X ) ≥ c, Y = 0 } , where λ ∈ (0 ,

1) is a user-deﬁned weight. The loss function gives weight λ for false negative classi-ﬁcation (misclassifying Y = 1 as 0) and 1 − λ for false positive classiﬁcation (misclassifying Y = 0as 1). At λ = . λ depends on speciﬁcapplications and goals of classiﬁcation.Weighted misclassiﬁcation risk is the expected loss over the underlying joint distribution of X and Y , R λ (Ψ , c ) = E X,Y { L λ ( Y, Q ( X )) } = λpP { Ψ( X ) < c | Y = 1 } + (1 − λ )(1 − p ) P { Ψ( X ) ≥ c | Y = 0 } = λp FNR(Ψ , c ) + (1 − λ )(1 − p )FPR(Ψ , c ) (1)where p = P ( Y = 1) is the prevalence, FNR(Ψ , c ) = P { Ψ( X ) < c | Y = 1 } the false negative rate,and FPR(Ψ , c ) = P { Ψ( X ) ≥ c | Y = 0 } the false positive rate.The objective of the inference problem is to ﬁnd Ψ( · ) and c that minimize the risk. Given a sampleof data ( X , Y ) , . . . , ( X n , Y n ), this can be operationalized as ﬁnding Ψ( · ) and c that minimize theempirical risk function, b R λ ( Y, X ; Ψ , c ) = 1 n n X i =1 λ { Ψ( X i ) < c, Y i = 1 } + (1 − λ ) { Ψ( X i ) ≥ c, Y i = 0 } . (2) Super learner is an ensemble learner that combines predictions produced by candidate learners froma user deﬁned library L of K learners. Our deﬁnition of ensemble refers to those that combinediﬀerent base algorithms; for example, we consider bagging and boosting as single learners, as theygenerate a single strong learner by using one base algorithm to obtain a collection of weak learners,by boostrap aggregation and re-weighting samples respectively. As a kind of stacked generalization(Wolpert, 1992), SL combines diﬀerent learning algorithms over the same data, and algorithms suchas bagging, boosting, and random forest all can be included as candidate learners of a SL library.Consider a library L = (Ψ , . . . , Ψ K ) where each candidate learner Ψ k ∈ L is a mapping from X to IR, and the prediction score Ψ( X ) = P Kk =1 α k Ψ k ( X ) is a convex combination of predictionsfrom the library of learners. Deriving an optimal classiﬁcation rule requires ﬁnding the value of( α, c ) that minimizes (2).An intuitive approach to determining the rule is conditional thresholding; that is, ﬁrst derive aprediction score b Ψ( X ) = P Kk =1 b α k b Ψ k ( X ) using SL, and then identify an optimal threshold value b c based on the score b Ψ( X ). In SL, the α coeﬃcients are derived from cross validation against lossfunction such as squared error loss or ROC-based loss. Conditional thresholding approach estimatesthreshold c by plugging b Ψ( X ) into (2) and minimizing over c . However, the procedure does notreﬂect out-of-sample performance because it does not use cross-validated weighted misclassiﬁcationrisk.Conditional thresholding for classiﬁcation based on SL proceeds as follows:4. Fit each learner Ψ k ∈ L to the entire data set { ( Y i , X i ) } ni =1 and generate score predictions b Ψ k ( X i ) for k = 1 , . . . , K and i = 1 , . . . , n .2. Carry out D -fold cross validation. Have D partitions of the data, where partitions are indexedby d = 1 , . . . , D . For the d th partition, T ( d ) and V ( d ) are the training and validation datasplits respectively. Fit each candidate learner Ψ k to T ( d ), yielding b Ψ k,T ( d ) . Then generate itsprediction on V ( d ), written as b Ψ k,T ( d ) ( X V ( d ) ), k = 1 , . . . , K and d = 1 , . . . , D .3. For each candidate learner Ψ k , stack together the D fold-speciﬁc predictions b Ψ k,T ( d ) ( X V ( d ) ), d = 1 , . . . , D to get Z k = { b Ψ k,T ( d ) ( X V ( d ) ) , d = 1 , . . . , D } , an n × i th observation, deﬁne d i = d ( X i ) as its validation fold index, i.e. d i = 2 if X i ∈ X V (2) . Write the n × K cross-validated prediction matrix as Z = { Z , . . . , Z K } , wherethe i th row k th column element Z ik = b Ψ k,T ( d i ) ( X i ) is an out-of-sample prediction made bythe k th candidate learner to the i th observation, which is a member of validation fold V ( d i ).Estimate α in m ( Z ; α ) = P Kk =1 α k Z k by minimizing risk E { ˜ L ( Y, m ( Z ; α )) } , e.g. ordinarylinear regression has quadratic loss and E ( Y | Z ; α ) = m ( Z ; α ).4. Combine b α from Step 3 with the data predictions b Ψ k ( X i ) , k = 1 , . . . , K from Step 1, andobtain the SL score b Ψ SL ( X i ; b α ) = P Kk =1 b α k b Ψ k ( X i ) .

5. Estimate the classiﬁcation threshold c by minimizing the empirical risk function (2) using theSL score as the risk score, b c = argmin c n X i =1 λ { b Ψ SL ( X i ; b α ) < c, Y i = 1 } + (1 − λ ) { b Ψ SL ( X i ; b α ) ≥ c, Y i = 0 }

6. For any x ∈ X , the classiﬁcation rule is b Q ( x ) = Q ( x ; b Ψ SL ( · ; b α ) , b c ) = { b Ψ SL ( x ; b α ) ≥ b c } . As we will show in empirical examples and simulations, conditional thresholding may over-estimateactual risk. We propose to estimate the classiﬁcation threshold c based on the cross validatedprediction Z (deﬁned in Step 3 of Section 3) within the SL algorithm. In our approach, Steps 1 and2 are the same as above. Steps 3 and 5 are replaced by simultaneous estimation of α and c to satisfy(˜ α, ˜ c ) = argmin ( α,c ) n X i =1 λ { m ( Z i ; α ) < c, Y i = 1 } + (1 − λ ) { m ( Z i ; α ) ≥ c, Y i = 0 } . (3)The SL score in Step 4 and the classiﬁcation rule in Step 6 are then updated to b Ψ SL ( X i ; ˜ α ) = P Kk =1 ˜ α k b Ψ k ( X i ) and b Q ( x ) = Q ( x ; b Ψ SL ( · ; ˜ α ) , ˜ c ) = { b Ψ SL ( x ; ˜ α ) ≥ ˜ c } accordingly.Optimizing the empirical risk in equation (3) is complicated by discontinuities introduced by theindicator functions. Common optimization methods such as Newton-Raphson cannot be appliedbecause they require existence of the ﬁrst or second order derivatives. The lack of smoothness andconvexity makes other optimization methods diﬃcult as well. Moreover, minima of the objectivefunction is not unique due to the non-convexity of weighted misclassiﬁcation loss.According to the Bayes rule, the optimal threshold for (3) is 1 − λ when m ( Z ; α ) = P ( Y = 1 | X ).In practice, the threshold c = 1 − λ is valid only when the risk score is a consistent estimate of P ( Y = 1 | X ) and the sample size is suﬃciently large. However, the underlying true mechanism fordata generation is very complicated in most applications. Furthermore, whether or not the ensemblescore is a probability estimate depends on the investigators’ intention and study goal, and a goodrisk score for classiﬁcation does not have to be a probability estimate (e.g. SVM).5ne approach to minimizing the empirical weighted misclassiﬁcation risk is to approximate theweighted misclassiﬁcation loss with some smooth solvable loss functions. Buja et al. Buja et al.(2005) used integrals of beta distribution functions to approximate the indicator functions in theloss, theoretically enabling use of optimization algorithms. However, in practice the nonconvexityof the smooth approximation can undermine the invertibility of the Hessian in Newton updates andcause the optimization procedure to fail.Another approach is to reformulate the minimization problem as a linear program with equilib-rium constraints (LPEC), a special case of a hierarchical mathematical programming that consists oftwo levels of optimization. Mangasarian Mangasarian (1994) studied total misclassiﬁcation loss andsuggested a Frank-Wolfe type iterative algorithm to approximate the minima that moves the solutiontowards the minimum of a linear approximation of the objective function in the same domain. Chenand Mangasarian Chen and Mangasarian (1996) proposed a hybrid algorithm as an accelerated ap-proximation to the algorithm in Mangasarian Mangasarian (1994), which is costly in computation.The hybrid algorithm iteratively estimates α by replacing indicator function ( x >

0) with theconvex surrogate max(1 + x,

0) at a ﬁxed c , and estimates c by minimizing the objective function ata ﬁxed α . Solving LPEC is costly and computationally intensive because the minimization problemis NP-hard (Chen & Mangasarian, 1996).We consider two options to solving equation (3) within SL. One is to approximate the solutionin two separate steps: (1) using non-negative least squares linear regression to estimate ˜ α and thennormalizing it to sum to one, and (2) conducting a line search to estimate ˜ c conditional on theestimated ˜ α . We refer to this procedure as Two-Step Minimization in our simulations and dataapplications. This can be further extended to using a convex and continuous surrogate ˜ L ( Y, m ( Z ; α ))for estimation of ˜ α . The process can be described as follows:(3a) Estimate ˜ α in m ( Z ; ˜ α ) = P Kk =1 ˜ α k Z k by argmin α E { ˜ L ( Y, m ( Z ; α )) } , e.g. if ˜ L is squared errorloss, we use ordinary least squares regression of Y on Z .(3b) Estimate ˜ c by conditional minimization using the cross-validated predcitions Z ,˜ c = argmin c n X i =1 λ { m ( Z i ; ˜ α ) < c, Y i = 1 } + (1 − λ ) { m ( Z i ; ˜ α ) ≥ c, Y i = 0 } . When the surrogate loss ˜ L in Step 3a involves threshold c , an iterative procedure similar toChen and Mangasarian Chen and Mangasarian (1996) can be used for the estimation of (˜ α, ˜ c ). Thistwo-step procedure provides ﬂexibility in that the user can choose the surrogate loss ˜ L based oncontext. If minimizing the surrogate loss produces risk score that gives good discrimination to thedata, the resulting classiﬁcation rule would be a good approximation to a minimizer of weightedmisclassiﬁcation risk.The second option is to estimate α and c using a bounded region optimization, and perform acontrolled random search in the bounded region. First, note that the inequality P Kk =1 Ψ k ( · ) α k > c can be written as P Kk =1 Ψ k ( · ) α ∗ k > c ∗ , with α ∗ k = α k /α and c ∗ = c/α when α >

0. In ouranalyis, coeﬃcients ( α , . . . , α K ) are constrained to be nonnegative, normalized to sum to one, andwithout loss of generality α is designated as the coeﬃcient to have the largest estimated value frominitialization, as described in Step 1 below.Assume that the coeﬃcient estimates from initialization based on some convex loss functions areclose to the solutions that minimize the weighted misclassiﬁcation risk. Then, with initialized valueof α ∗ in [0 , K , we can estimate α ∗ and c ∗ by searching for the optima in an enlarged boundedregion. One way to search for the optima is to randomly generate a large user-speciﬁed number ofinitial points in the bounded region and do a controlled random search (using the crs2lm function inthe optimization package nloptr (Johnson, 2014) in R ). We recommend controlled random searchbecause it does not rely on the properties of the objective function for global optimization, henceavoiding the issues from nonconvexity and nonsmoothness. Some other direct search methods arealso applicable in this case, such as the simplex algorithm (Nelder & Mead, 1965) and the diﬀerential6volution (Storn & Price, 1997). This procedure is referred to as the CRS Minimization in Section5 and Section 6. The procedure is described as follows:1. Initialize the controlled random search by ( b α ∗ (0) , b c ∗ (0) ), calculated as follows. Obtain b α =( b α , . . . , b α K ) T by regressing Y on Z (deﬁned in Step 3 of Section 3) under squared error losswith nonnegative constraint. Locate the estimated coeﬃcient with the largest value. Withoutloss of generality, assume b α = max k ∈{ ,...,K } b α k . Deﬁne b α ∗ (0) = (1 , b α / b α , . . . , b α K / b α ) T andestimate b c ∗ (0) = argmin c P ni =1 λ { Y i = 1 , Z i b α ∗ (0) < c } + (1 − λ ) { Y i = 0 , Z i b α ∗ (0) ≥ c } by linesearch.2. Apply controlled random search on an enlarged nonnegative bounded region, e.g. in thefollowing sections we empirically chose the enlarged region as [0 , K based on the magnitudeof b α ′ s . Obtain b α ∗ = ( b α ∗ , . . . , b α ∗ K ) and b c ∗ from the controlled random search, as estimates to aminimizer of equation (3).3. Normalize both the coeﬃcients and threshold estimates by P Kk =1 b α ∗ k , so the coeﬃcient esti-mates sum to one.The controlled random search here does not give an unbiased or eﬃcient estimate of the classprobability, even if the resulting scores are scaled to between zero and one. The estimated scores donot need to have a probabilisitic interpretation and the solutions may not be unique, but they areall valid approximations in terms of minimizing the weighted misclassiﬁcation loss.For the theoretical completion of this work, we are able to show the asymptotic optimality of SLwith joint thresholding under the weighted misclassiﬁcation loss by the following theorem: Theorem.

Let S represents a random data split that is stochastically independent of the observa-tions, resulting in a training set and a test set of a nonneglible size. For the weighted misclassiﬁcationloss L λ ( Y, Q ( X )) at a given λ ∈ (0 , , and classiﬁers in Q = { Q θ : θ ∈ Θ n } , where θ = ( α, c ) , Q θ ( x ) = { P Kk =1 α k Ψ k ( x ) ≥ c } , and Θ n is a bounded discretized parameter space of θ , there is E S R λ ( Q ˆ θ ( P S )) ≤ E S R λ ( Q ˜ θ ( P S )) + O (cid:18) log ( n ) √ n (cid:19) , where n is sample size of the entire data, E S R λ ( Q ˆ θ ( P S )) and E S R λ ( Q ˜ θ ( P S )) are the risks averagedover data splits for the SL with joint thresholding and an oracle classiﬁer respectively. The web appendix (A3) provides details of the notations and a proof of the theorem.The oracle classiﬁer is the best classiﬁer among all classiﬁers in the form of Q θ ( x ) estimated fromthe data, such that it is closest in distance to the unparametrized true minimizer of the weightedmisclassiﬁction risk. The theorem is saying that under the weighted misclassiﬁcation loss, SL withparameters estimated from equation (3) asymptotically converges to the oracle in terms of averagerisk as the sample size grows to inﬁnity. The key objective of our simulation study is to compare our approach to the conditional thresholdingapproach in minimizing the out-of-sample weighted misclassiﬁcation risk when the true model canor cannot be easily recovered. We set it up so that we know the optimal classiﬁcation rule and cancompare with it the rules obtained from conditional and joint thresholdings. We ﬁrst simulate twodatasets D and D , each of size n = 10 . For weighted misclassiﬁcation loss at ﬁxed λ ∈ (0 , D , and usethem to estimate three classiﬁcation rules. For each rule and each value of λ , the correspondingout-of-sample weighted misclassiﬁcation risk is calculated by applying the derived rule on D .7wo data generating mechanisms, adapted from Kang and Schafer Kang and Schafer (2007), areconsidered. For each one, the observed binary outcome Y i is generated from an underlying score ˜ Y i via Y i = ( ˜ Y i ≥ c ), where the cutoﬀ c is chosen to guarantee a 30% prevalence. The underlyingscore is generated by ˜ Y i = b + U i b + ǫ i , where b = 210, b = (27 . , . , . , . T ; for i = 1 , . . . , n , ǫ i ∼ N (0 , ), and U i = ( U i , U i , U i , U i ) ∼ N (0 , I × ).In the ﬁrst setting we treat U as observed covariates and uses ( Y, U ) to derive classiﬁcationrules. In the second setting, instead of using U , we observe X = g ( U ), where g : IR → IR is given by g ( u ) = exp( u / g ( u ) = u / (1 + exp( u )) + 10, g ( u ) = ( u u /

25 + 0 . , and g ( u ) = ( u + u + 20) . More detailed rationale for this parameterization is given in Kang andSchafer Kang and Schafer (2007).Earlier, the true probability score is Ψ ( U ) = P ( Y = 1 | U ) = P ( ǫ ≥ c − U β | U ). The optimalBayes classiﬁcation rule based on the true probability score is Q ( U ) = { Ψ ( U ) ≥ − λ } , whichprovides a reference standard for assessing classiﬁcation accuracy. The out-of-sample risk for theoptimal Bayes classiﬁcation rule is approximated by b R λ ( D ; Q ) = 1 n X ( U i ,Y i ) ∈D λ { Q ( U i ) = 0 , Y i = 1 } + (1 − λ ) { Q ( U i ) = 1 , Y i = 0 } . We therefore use b R λ ( D ; Q ) as the reference for relative diﬀerences displayed in Table 1. Therelative diﬀerence of estimated weighted misclassiﬁcation risk for an estimated classiﬁcation rule b Q ( · ) at penalty λ is deﬁned as { b R λ ( D ; b Q ) − b R λ ( D ; Q ) } / b R λ ( D ; Q ).We consider using four and eight candidate algorithms for SL. In the case of K = 4, the SLlibrary L includes random forest (Breiman, 2001), logistic regression, generalized additive model(Hastie & Tibshirani, 1990) and CART (Breiman et al., 1984). For K = 8, four additional candidatealgorithms are added to L : 10 nearest neighbors, generalized boosting (Friedman, 2001), supportvector machine (Hsu, Chang, Lin, et al., 2003) and bagging classiﬁcation (Breiman, 1996). Logisticregression and generalized additive model are ﬁtted using maximum likelihood without penalization;we use linear main eﬀect terms for logistic regression and quadratic main eﬀect splines for generalizedadditive model.Relative risk diﬀerences for each of the SL classiﬁcation methods discussed above are summa-rized in Table 1. In the ﬁrst setting when covariates U are used to develop the classiﬁcation rule,conditional thresholding and joint thresholding rules (two-step and CRS minimization) have, asexpected, similar out-of-sample performances, diﬀering from the optimal classiﬁcation rule by lessthan 2% relatively across all values of λ . In the second setting, where classiﬁcation is based on X ,joint thresholding clearly outperforms conditional thresholding for most λ values. Relative to theoptimal Bayes classiﬁcation rule based on the true probability score Ψ ( U ), conditional thresholdingis worse by 5 to 23% across λ , while joint thresholding diﬀers by 2.3% or less. In this section we illustrate our proposed methods on Kenyan clinical HIV data and Wisconsindiagnostic breast cancer data. We use the same SL libraries with four and eight learners as in thesimulations. The ﬁrst data set was used in Liu et al. (Liu et al., 2017) and the second data set isavailable on the UCI machine learning data repository (Lichman, 2013).The Kenyan HIV data were derived from three studies conducted at the Academic ModelProviding Access to Healthcare (AMPATH) in Eldoret, Kenya: (i) The “Crisis” study (n=191)(Mann et al., 2013), conducted in 2009-2011 to investigate the impact of the 2007-2008 post-electionviolence in Kenya on ART failure and drug resistance; (ii) The “second-line” study (n=394) (Diero et al.,2014), conducted in 2011-2012 to investigate ART failure and drug resistance upon second-line ART;and (iii) The “TDF” study (n=333) (Brooks et al., 2016), conducted in 2012-2013 to investigate theimpact of WHO guidelines changes to TDF-based ﬁrst-line ART on HIV treatment failure and drugresistance. The data include covariate information on age, gender, nadir CD4 count, CD4 count,8D4 percent, adherence to ART, time since starting current ART, slope of CD4 percent progressionand the outcome VL. Our interest is to develop a classiﬁcation rule to predict HIV virological failure(VL > λ = 0.2, 0.5 and 0.8 compared to the conditionalthresholding approach for both data illustrations. In addition, the diﬀerence in cross-validated risksbetween conditional and joint thresholding is smaller for eight learners compared to four learners.For breast cancer malignancy prediction at λ = 0 . λ = 0.2 and 0.8, for conditional thresholding andour proposed approaches. In our analysis, both conditional thresholding and two-step minimizationderive coeﬃcients by nonnegative least squares linear regression and normalize to sum to one, hencetheir coeﬃcient estimates are the same and stay ﬁxed across λ . For CRS minimization, the coeﬃcientestimating procedure involves optimizing a function of λ , so the estimated values may vary by λ . Threshold estimation depends on λ for all approaches. Conditional thresholding and two-stepminimization share the same estimated risk scores for all λ values, but their threshold estimates arediﬀerent due to the diﬀerent thresholding methods.Figure 1 provides the cross-validated risk comparison in a ﬁner scale for the Kenyan HIV dataand the Wisconsin diagnostic breast cancer data, considering both four and eight candidate learners.Diﬀerences in cross-validated risks between conditional and joint thresholding across λ are larger forfewer candidate learners in the SL library, and joint thresholding approaches generally have bettercross-validated risks compared with the conditional thresholding.Figure 2 illustrates why misclassiﬁcation diﬀers between the thresholding methods. The verticallines in Figure 2 indicate threshold values estimated at λ = 0 . Z ˆ α predictions correspondingly from left to right. Under conditional thresholding, the rule isconditional on SL prediction, which may be subject to over-ﬁtting (the ﬁrst panel). By contrast,two-step and CRS minimization use cross-validated prediction for selecting the threshold, and dothis within the SL (the third panel), which has a similar distribution as the cross-validated pre-diction of the entire SL (middle panel). As expected from risk analysis of the methods, estimatedthresholds are very diﬀerent for conditional and joint thresholding. In Figure 2, threshold estimatesfrom joint thresholding and cross-validating the entire SL are very similar, which further explains theout-performance of joint thresholding compared to conditional thresholding. Nonetheless, thresholdcalculation by cross-validating the entire SL requires substantially more computation time and com-plexity in both estimation and evaluation, and may result in over-cross-validating the data whensample size is not suﬃciently large. Furthermore, none of the estimated thresholds is close to theoptimal Bayes threshold of 1 − λ = 0 .

2, implying that all the three types of derived predictionsfrom SL do not highly match the unknown underlying true classiﬁcation probability, indicating theneccessity to do optimal classiﬁcation rule derivation.9

Summary and Discussion

Weighted misclassiﬁcation risk is often used to evaluate predictions when false-positives or false-negatives need to be prioritized diﬀerently. However, inference and rule derivation using weigthedmisclassiﬁcation risk is less common due to the diﬃculties in numerical computations. For binaryclassiﬁcation using SL, we aimed to optimally estimate both the threshold and ensemble weightsassociated with candidate learners by minimizing the weighted misclassiﬁcation risk. Through sim-ulations and data examples, we showed that the conditional thresholding may generate sub-optimalclassiﬁcation rules. We proposed two options for joint thresholding, both embedding the thresholdestimation procedure within SL, and showed that our proposal performs similarly or outperformsthe conditional thresholding approach in terms of determining the optimal classiﬁcation rules.Our method presents a new way to estimate the classiﬁcation rule. We show that the rulesdeveloped under either two-step or CRS minimization generally have lower error rates compared toconditional thresholding. From the comparison of density curves in Figure 2 among SL prediction,cross-validated SL prediction and Zα within SL in data application, we can see that the conditionalthresholding tends to overﬁt the data, which results in an incorrectly estimated risk. Therefore, weexpect the actual risk from conditional thresholding to be closer to the true risk under two situations:ﬁrst, when the training data distribution can well represent the true underlying data distribution, inwhich case there would not be much diﬀerence in distribution between the training and the test sets;second, when the ensemble risk score can well discriminate the data for both the training and thetest set, as reﬂected by Figure 1, the diﬀerence between conditional and joint thresholding becomessmaller with more candidate learners in the SL library.Our work also provides a general framework for using ensemble learners for binary classiﬁcation.Although we consider weighted loss functions as in (3), our method has the potential to be extendedto more general threshold-based classiﬁcations, and numerical optimization method should changeaccordingly based on the nature of the threshold-based classiﬁcation loss. We expect similar propertyto hold for other classiﬁcation problems that involve threshold estimation, when the classiﬁcationloss is a measurable function of classiﬁers.Furthermore, from Figure 2, we anticipate the performance of our method to be comparable tothreshold estimation based on cross-validated SL predictions. This is important for settings wherecomputational complexity is high, or the size of the SL library and data are large, as thresholdesimation based on cross-validated SL predictions may require substantial amount of computations,increasing complication and diﬃculty to method evaluation.Code for the analysis was written in R (R Core Team, 2016) and is available in the web appendix.10able 1: Out-of-sample weighted misclassiﬁcation risk of the optimal Bayes classiﬁcation rule (%)in the ﬁrst row, and relative diﬀerence in out-of-sample weighted misclassiﬁcation risk (%) at λ =0.2, 0.5 and 0.8 for simulation studies described in Section 5, stratiﬁed by estimation methods andnumber of learners in SL library. Relative diﬀerence for a derived classiﬁcation rule b Q ( · ) at λ is( b R λ ( D ; b Q ) − b R λ ( D ; Q )) / b R λ ( D ; Q ), where Q is the optimal Bayes classiﬁcation rule based onthe true probability score Ψ ( U ). 4 Learners 8 Learners λ λ b R λ ( D ; Q )(%) 6.1 14.8 12.9 6.1 14.8 12.9Simulation 1 Conditional Thresholding 0.0 0.0 1.6 0.0 0.0 1.6Two-Step Minimization 0.0 0.0 0.8 0.0 0.0 1.6CRS Minimization 0.0 0.0 0.8 0.0 0.0 1.6Simulation 2 Conditional Thresholding 16.4 12.8 11.6 8.2 6.1 9.3Two-Step Minimization 0.0 0.0 2.3 0.0 0.0 2.3CRS Minimization 0.0 0.0 2.3 0.0 0.0 2.3Table 2: Cross validated weighted misclassiﬁcation risk (%) at λ = 0.2, 0.5 and 0.8 for two dataexamples, stratiﬁed by estimation methods and number of learners in SL library.4 Learners 8 Learners λ λ α, c ) at λ = 0.2 and 0.8 for Kenyan HIV data, under estimation methods:Conditional Threshold (CT), Two-Step Minimization (2-Step) and CRS Minimization (CRS). ∗ indicates that b α are the same for CT and 2-Step regardless of λ . λ = .2 λ = .8CT ∗ ∗ CRS CT ∗ ∗ CRSˆ α random forest α logistic regression α quadratic splines α CART α α generalized boosting α SVM α Bagging c λ for thetwo data applications, stratiﬁed by number of learners in SL library.12igure 2: Density curve of SL prediction, 10-fold cross-validated prediction of SL (CV-SL) andcombined cross-validated prediction within SL ( Zα ) for breast cancer data, using least squaresregression under nonnegative and sum to one constraints on the coeﬃcients for combining the eightlibrary prediction scores. Vertical line represents estimated threshold based on corresponding typeof prediction at λ = 0 .

8. 13 eb AppendixA1. Application to SECOM Data

The following analysis uses the large p SECOM data set (Dheeru & Karra Taniskidou, 2017) fromthe UCI machine learning repository. Each sample in the SECOM data set represents a singleproduction entity from a modern semi-conductor manufacturing process; the outcome represents apass/fail yield for in-house line testing and the features are measured signals from the monitoringsystem. The original data set has 1567 samples, 591 features, and 104 fails in its outcome. Thenumber of features in the data set is over 1/3 of the sample size, and regular small p methods suchas linear regression and additive splines do not converge for this data.We cleaned the data so that all the features have more than one unique values and each feature haseither no or at least 30 missingness. After accounting for missingness by creating missing indicatorsand ﬁlling in 0’s for missing measurements, there are 1436 samples, 484 features, and 100 fails.We used three candidate learners for this large p application: random forest, lasso, and leekasso.Leekasso does linear model using the top 10 most signiﬁcant predictors from ﬁtting univariate modelsof each covariate. These three candidate learners were chosen from convergence and computationspeed considerations.From the following cross-validated risk plot we show that our proposed joint thresholding can beapplied to large-p problems with properly selected candidate learners and our method still performsbetter than the conditional thresholding.Figure 3: Comparison of cross-validated weighted misclassiﬁcation risk as a function of λ for theSECOM data application between conditional thresholding and joint thresholding (CRS and two-step minimization), under the following candidate learners in SL library: random forest, lasso, andleekasso. A2. Comparison with Candidate Learners – SECOM Data

Because candidate learners return risk estimations, which are continuous scores, each individualcandidate learner would need a threshold in order to show its classiﬁcation performance underthe weighted misclassiﬁcation loss. For a single learner, the joint thresholding method reduces toderiving a threshold based on the cross-validated predictions of that learner. To actually see howmuch additional gain the ensemble learner is achieving, we used the SECOM data example again.The 10-fold cross-validated risks are shown in Figure 2. Table 1 has the values of the curves inFigure 2 speciﬁcally at λ = 0 . , . , . . . , .

9. 14rom Figure 2 we can see that the performance of SL with joint thresholding is similar to itsbest candidate learner, which is the random forest with joint thresholding in this case.Figure 4: Comparison of cross-validated weighted misclassiﬁcation risk with joint thresholding asa function of λ for the SECOM data application among the SL (CRS minimization and two-stepminimization) and the individual candidate learners (random forest, lasso, and leekasso).Table 4: Cross-validated weighted misclassiﬁcation risk with joint thresholding at λ from 0.1 to 0.9for the SECOM data application, stratiﬁed by estimation methods λ

3. Theoretical Justiﬁcation

Our theoretical justiﬁcation follows a similar road map as in van der Laan et al. (2007) and van der Vaart, Dudoit, and van der Laan(2009). We ﬁrst introduce and revise some notations.Given false negative penalty λ ∈ (0 , Q = argmin Q E ( Y,X ) L λ ( Y, Q ( X )) = argmin Q R λ ( Q )where L λ ( Y, Q ( X )) = λ { Q ( X ) = 0 , Y = 1 } + (1 − λ ) { Q ( X ) = 1 , Y = 0 } and R λ ( Q ) = Z L λ ( y, Q ( x )) dP ( x, y ) . In the above expression, notice that P is the unknown true distribution over the outcome Y andcovariates X .The distance between any two classiﬁers Q and Q is deﬁned as the absolute diﬀerence betweentheir risks, d λ ( Q , Q ) = | R λ ( Q ) − R λ ( Q ) | . For SL, there is a collection of scores from the K candidate learners, (Ψ , . . . , Ψ K ). We pa-rameterize the classiﬁcation rule as Q θ ( · ), where θ = ( α, c ), α is the vector of coeﬃcients thatlinearly combines the K scores, and c is the threshold to be applied to the combined score for binaryclassiﬁcation, Q θ ( x ) = { K X k =1 α k Ψ k ( x ) ≥ c } Without loss of generality, we assume the coeﬃcients in α to be constrained as | α i | ∈ [0 , , i =1 , . . . , K and P Ki =1 α i = 1. Denote the range of score predictions as Ω, which is assumed to bebounded, the parameter space of θ can be written as Θ = [0 , K × Ω. We consider a grid Θ n of θ values in the bounded parameter space Θ, and let K ( n ) = n be the number of grid points suchthat K ( n ) ≤ n q for some constant q < ∞ . We then consider Q = { Q θ : θ ∈ Θ n } as a collection ofcandidate classiﬁers.Next, we formalize cross-validation as in van der Vaart et al. (2009) and describe our estimatorsaccordingly following these notations. Let S = ( S , . . . , S n ) ∈ { , } n be a random vector indepen-dent of data samples X , . . . , X n ; S i = 0 indicates that the sample X i belongs to the training set,otherwise the test set. We can then deﬁne the empirical distributions of the training and the testset by P jS = 1 n j X i : S i = j δ X i , n j = n X i =1 { S i = j } , j = 0 , δ X i ( x ) = { x ≥ X i } for i = 1 , . . . , n . Q θ ( P S ) is a classiﬁer estimated from the training set, Q θ ( P S )( x ) = { K X k =1 α k Ψ k ( P S )( x ) ≥ c } . An oracle selector of θ is one that its corresponding classiﬁer estimated from the training setminimizes the risk on the unknown distribution P averaged over the splits,˜ θ = argmin θ ∈ Θ n E S (cid:20) Z L λ ( y, Q θ ( P S )( x )) dP ( x, y ) (cid:21) = argmin θ ∈ Θ n E S (cid:20) R λ ( Q θ ( P S )) − R λ ( Q ) (cid:21) . Q ˜ θ , and among all classiﬁers estimated from the training set, it is the closestin distance to the true risk minimizer Q .Cross validation replaces P by the test set distribution P S , and uses θ n as the cross-validatedselector of θ , θ n = argmin θ ∈ Θ n E S (cid:20) Z L λ ( y, Q θ ( P S )( x )) dP S ( x, y ) (cid:21) . The realization of θ n is the parameter estimate for SL with joint thresholding as presented inthe paper, θ n ≡ argmin θ ∈ Θ n n n X i =1 λ { K X k =1 α k Z ik < c, Y i = 1 } + (1 − λ ) { K X k =1 α k Z ik ≥ c, Y i = 0 } where Z = { b Ψ k,T ( d ) ( X V ( d ) ) , k = 1 , . . . , K, d = 1 , . . . , D } is the stacked cross-validated predictions asmentioned in Sections 3 and 4. In Section 4, the ( b α, b c ) at the end of our proposed procedure is thesolution or approximation to the θ n .The goal of this section is to theoretically prove that, the risk averaged over data splits for clas-siﬁer estimator Q θ n asymptotically converges to that of the oracle classiﬁer Q ˜ θ . van der Vaart et al.(2009) (Theorem 2.3) established the following inequality for the risk averaged over data splitsbetween the cross-validated selector and the oracle selector. Theorem 1. (van der Vaart et al., 2009) For Q ∈ Q , let ( M ( Q ) , v ( Q )) be a Bernstein pair for themeasurable function z L ( z, Q ) and assume that R ( Q ) = R L ( z, Q ) dP ( z ) ≥ for every Q ∈ Q .Then for any δ > and ≤ p ≤ , E S R ( Q ˆ θ ( P S )) ≤ (1 + 2 δ )E S R ( Q ˜ θ ( P S )) + (1 + δ )E S (cid:20) n ) /p (cid:21) × log(1 + Q ) sup Q ∈Q (cid:20) M ( Q )( n ) − /p + (cid:18) v ( Q ) R ( Q ) − p (cid:19) /p (cid:18) δδ (cid:19) /p − (cid:21) . Recall that for a measurable function f : X → R , ( M ( f ) , v ( f )) is a pair of Bernstein numbers if M ( f ) P (cid:18) e | f | /M ( f ) − − | f | M ( f ) (cid:19) ≤ v ( f ) . And it was shown in van der Vaart et al. (2009) that if f is uniformly-bounded, then ( || f || ∞ , . P f )is a pair of Bernstein numbers. Lemma 1.

For a weighted misclassiﬁcation loss ( x, y ) L λ ( y, Q ( x )) , where Q : x

7→ { , } and λ ∈ (0 , , its Bernstein pairs ( M ( Q ) , v ( Q )) satisfy M ( Q ) = max { λ, − λ } and v ( Q ) = λ P ( Q ( X ) = 0 , Y = 1) + (1 − λ ) P ( Q ( X ) = 1 , Y = 0) . Furthermore, v ( Q ) ≤ . × max { λ, − λ } × R λ ( Q ) Proof.

The loss function L λ ( y, Q ( x )) = λ { Q ( x ) = 0 , y = 1 } + (1 − λ ) { Q ( x ) = 1 , y = 0 } has range { , λ, − λ } and hence is bounded by max { λ, − λ } .By deﬁnition, risk can be written as R λ ( Q ) = λP ( Q ( X ) = 0 , Y = 1) + (1 − λ ) P ( Q ( X ) = 1 , Y = 0) . E ( x,y ) (cid:20) L ( y, Q ( x )) (cid:21) = E ( x,y ) (cid:20) λ { Q ( x ) = 0 , y = 1 } + (1 − λ ) { Q ( x ) = 1 , y = 0 } (cid:21) = λ P ( Q ( X ) = 0 , Y = 1) + (1 − λ ) P ( Q ( X ) = 1 , Y = 0) ≤ max { λ, − λ } × R λ ( Q ) . Therefore, v ( Q ) = 1 . E ( x,y ) (cid:20) L ( y, Q ( x )) (cid:21) ≤ . × max { λ, − λ } × R λ ( Q ) . We apply theorem 1 (van der Vaart et al., 2009) with the lemma, so that for the weighted mis-classiﬁcation loss, we have

Theorem 2.

For classiﬁers in Q = { Q θ : θ ∈ Θ n } and the weighted misclassiﬁcation loss L λ ( Y, Q ( X )) at a given λ ∈ (0 , , there is E S R λ ( Q ˆ θ ( P S )) ≤ E S R λ ( Q ˜ θ ( P S )) + O (cid:18) log ( n ) √ n (cid:19) . Proof.

Write C = max { λ, − λ } . Recall that we assumed the cardinality of Θ n as K ( n ) ≤ n q ,and by deﬁnition, Q = n . Hence, we have log(1 + Q ) ≤ q log( n ) . Therefore, applying theinequality in Lemma 1 to Theorem 1, for any δ > ≤ p ≤

2, there isE S R λ ( Q ˆ θ ( P S )) ≤ (1 + 2 δ )E S R λ ( Q ˜ θ ( P S )) + (1 + δ )E S (cid:20) n ) /p (cid:21) × log(1 + Q ) sup Q ∈Q (cid:20) C ( n ) − /p + (cid:18) . C × R λ ( Q ) p − (cid:19) /p (cid:18) δδ (cid:19) /p − (cid:21) Furthermore, by R λ ( Q ) ≤ C and Q ≤ n q , L.H.S ≤ (1 + 2 δ )E S R λ ( Q ˜ θ ( P S )) + (1 + δ )E S (cid:20) n ) /p (cid:21) × q log( n ) × (cid:20) C ( n ) − /p + (cid:18) . C p (cid:19) /p (cid:18) δδ (cid:19) /p − (cid:21) ≤ (1 + 2 δ )E S R λ ( Q ˜ θ ( P S )) + (1 + δ )E S (cid:20) n ) /p (cid:21) × q log( n ) × (cid:20) C ( n ) − /p + 1 . C (cid:18) δδ (cid:19) /p − (cid:21) Sample size of the test set is approximately a ﬁxed proportion of the entire data, so n = O ( n ).Let p = 2 and δ = 1 / √ n , then the above inequality becomesE S R λ ( Q ˆ θ ( P S )) ≤ E S R λ ( Q ˜ θ ( P S )) + 2 δ + (1 + δ )E S (cid:20) n ) / (cid:21) × q log( n ) × (cid:20) C ( n ) − / + 1 . C (cid:21) =E S R λ ( Q ˜ θ ( P S )) + O (cid:18) log( n ) √ n (cid:19) . og( n ) √ n asymptotically goes to zero. As long aslog( n ) / √ n E S R λ ( Q ˜ θ ( P S )) → as n → ∞ (4)then Q θ n is asymptotically equivalent to the oracle estimator Q ˜ θ in terms of their true risks averagedover data splittings when ﬁtting the estimators on the training set,E S R λ ( Q ˆ θ ( P S ))E S R λ ( Q ˜ θ ( P S )) → as n → ∞ . When equation (4) does not hold, then Q ˆ θ ( P S ) achieves the log( n ) √ n rate:E S R λ ( Q ˆ θ ( P S )) = O (cid:18) log( n ) √ n (cid:19) . This section shows that the performance of SL classiﬁcation rule using joint thresholding inSection 4 is asymptotically equivalent to the oracle under some conditions.

References

Birkner, M. D., Kalantri, S. P., Solao, V., Badam, P., Joshi, R., Goel, A., . . . Hubbard,A. E. (2007, June). Creating diagnostic scores using data-adaptive regression: An ap-plication to prediction of 30-day mortality among stroke victims in a rural hospital in In-dia.

Therapeutics and Clinical Risk Management , (3), 475. Retrieved 2018-09-24, from Breiman, L. (1996). Bagging predictors.

Machine Learning , (2), 123–140.Breiman, L. (2001). Random forests. Machine Learning , (1), 5–32.Breiman, L., Friedman, J., Stone, C. J., & Olshen, R. A. (1984). Classiﬁcation and RegressionTrees . CRC press.Brooks, K., Diero, L., DeLong, A., Balamane, M., Reitsma, M., Kemboi, E., . . . others (2016).Treatment failure and drug resistance in HIV-positive patients on Tenofovir-based ﬁrst-lineantiretroviral therapy in western Kenya.

Journal of the International AIDS Society , (1).Buja, A., Stuetzle, W., & Shen, Y. (2005). Loss functions for binary class probability estimationand classiﬁcation: Structure and applications.Chen, C., & Mangasarian, O. L. (1996). Hybrid misclassiﬁcation minimization. Advances inComputational Mathematics , (1), 127–136.Cover, T., & Hart, P. (1967). Nearest neighbor pattern classiﬁcation. IEEE Transactions onInformation Theory , (1), 21–27.Dheeru, D., & Karra Taniskidou, E. (2017). UCI Machine Learning Repository . Univer-sity of California, Irvine, School of Information and Computer Sciences. Retrieved from http://archive.ics.uci.edu/ml

Diero, L., DeLong, A., Schreier, L., Kemboi, E., Orido, M., & Rono, M. (2014). High HIV resistanceand mutation accrual at low viral loads upon 2nd-line failure in western Kenya. In

Conferenceon Retroviruses and Opportunistic Infections.

Boston, MA.Duda, R. O., Hart, P. E., et al. (1973).

Pattern Classiﬁcation and Scene Analysis (Vol. 3). WileyNew York.Friedman, J. H. (2001). Greedy function approximation: a gradient boosting machine.

Annals ofStatistics , 1189–1232.Hastie, T. J., & Tibshirani, R. J. (1990).

Generalized Additive Models (Vol. 43). CRC press.Haybittle, J. L., Blamey, R. W., Elston, C. W., Johnson, J., Doyle, P. J., Campbell, F. C., . . .Griﬃths, K. (1982). A prognostic index in primary breast cancer.

British Journal of Cancer , (3), 361. 19su, C., Chang, C., Lin, C., et al. (2003). A practical guide to support vector classiﬁcation.Johnson, S. G. (2014). The nlopt nonlinear-optimization package.(R package version 1.0.4)Kaelo, P., & Ali, M. M. (2006). Some variants of the controlled random search algorithm for globaloptimization. Journal of Optimization Theory and Applications , (2), 253–264.Kang, J. D., & Schafer, J. L. (2007). Demystifying double robustness: A comparison of alternativestrategies for estimating a population mean from incomplete data. Statistical Science , 523–539.Kruppa, J., Ziegler, A., & K¨onig, I. R. (2012, October). Risk estimation and risk prediction usingmachine-learning methods.

Human Genetics , (10), 1639–1654. doi: 10.1007/s00439-012-1194-yLichman, M. (2013). UCI Machine Learning Repository.

Retrieved from http://archive.ics.uci.edu/ml

Liu, T., Hogan, J. W., Daniels, M. J., Coetzer, M., Xu, Y., Bove, G., . . . Kan-tor, R. (2017, August). Improved HIV-1 Viral Load Monitoring Capacity Us-ing Pooled Testing With Marker-Assisted Deconvolution.

JAIDS Journal of Ac-quired Immune Deﬁciency Syndromes , (5), 580. Retrieved 2018-10-09, from https://journals.lww.com/jaids/Fulltext/2017/08150/Improved HIV 1 Viral Load Monitoring Capacity.13.aspx doi: 10.1097/QAI.0000000000001424Mangasarian, O. L. (1994). Misclassiﬁcation minimization. Journal of Global Optimization , (4),309–323.Mann, M., Diero, L., Kemboi, E., Mambo, F., Rono, M., Injera, W., . . . others (2013). Antiretro-viral treatment interruptions induced by the Kenyan postelection crisis are associated withvirological failure. Journal of Acquired Immune Deﬁciency Syndromes , (2), 220.Nelder, J. A., & Mead, R. (1965). A simplex method for function minimization. The ComputerJournal , (4), 308–313.R Core Team. (2016). R: A language and environment for statistical computing [Computer softwaremanual]. Vienna, Austria. Retrieved from Saghir, N. S. E., Adebamowo, C. A., Anderson, B. O., Carlson, R. W., Bird, P. A., Corbex, M.,. . . others (2011). Breast cancer management in low resource countries (LRCs): Consensusstatement from the Breast Health Global Initiative.

The Breast , , S3–S11.Shim, J., Mclernon, D. J., Hamilton, D., Simpson, H. A., Beasley, M., & Macfarlane, G. J. (2018,July). Development of a clinical risk score for pain and function following total knee arthro-plasty: results from the TRIO study. Rheumatology Advances in Practice , (2). Retrieved2018-09-17, from https://academic.oup.com/rheumap/article/2/2/rky021/5025119 doi:10.1093/rap/rky021Shyyan, R., Masood, S., Badwe, R. A., Errico, K. M., Liberman, L., Ozmen, V., . . . Vass, L. (2006).Breast cancer in limited-resource countries: Diagnosis and pathology. The Breast Journal , (s1), S27–S37.Siegel, R., Naishadham, D., & Jemal, A. (2012). Cancer statistics, 2012. CA: A Cancer Journal forClinicians , (1), 10–29.Storn, R., & Price, K. (1997). Diﬀerential evolution–a simple and eﬃcient heuristic for globaloptimization over continuous spaces. Journal of Global Optimization , (4), 341–359.Tate, J. P., Justice, A. C., Hughes, M. D., Bonnet, F., Reiss, P., Mocroft, A., . . . others (2013). Aninternationally generalizable risk index for mortality after one year of antiretroviral therapy. AIDS , (4), 563.van der Laan, M. J., Polley, E. C., & Hubbard, A. E. (2007). Super learner. Statistical Applicationsin Genetics and Molecular Biology , (1).van der Vaart, A. W., Dudoit, S., & van der Laan, M. J. (2009). Oracle inequalities formulti-fold cross validation. Statistics & Decisions , (3), 351–371. Retrieved from doi: 10.1524/stnd.2006.24.3.351 20ang, S., Xu, F., & Demirci, U. (2010). Advances in developing HIV-1 viral load assays forresource-limited settings. Biotechnology Advances , (6), 770–781.Wolberg, W. H., Street, W. N., & Mangasarian, O. L. (1994). Machine learning techniques todiagnose breast cancer from image-processed nuclear features of ﬁne needle aspirates. CancerLetters , (2), 163–171.Wolpert, D. H. (1992). Stacked generalization. Neural Networks , (2), 241–259.21 ample Codes normalize to make coefficients sum up to onebcrs = bcrs/normcutoff = crssol$par[1]/normnormalize to make coefficients sum up to onebcrs = bcrs/normcutoff = crssol$par[1]/norm