Classification using Ensemble Learning under Weighted Misclassification Loss
Yizhen Xu, Tao Liu, Michael J. Daniels, Rami Kantor, Ann Mwangi, Joseph W. Hogan
aa r X i v : . [ s t a t . M L ] M a y Classification using Ensemble Learning under Weighted Misclassifica-tion Loss
Yizhen Xu , Tao Liu , Michael J. Daniels , Rami Kantor , Ann Mwangi , Joseph W Hogan
1. Department of Biostatistics, Brown University,121 S. Main Street, Providence, RI, U.S.A.2. Department of Statistics and Data Sciences, University of Texas at Austin, Austin TX, U.S.A3. Division of Infectious Diseases, Brown University, Providence RI, U.S.A4. Academic Model Providing Access to Healthcare (AMPATH), Eldoret, Kenya5. College of Health Sciences, School of Medicine, Moi University, Eldoret, Kenya* [email protected] * This is part of thesis work of the first author.
Abstract
Binary classification rules based on covariates typically depend on simple loss functions such aszero-one misclassification. Some cases may require more complex loss functions. For example,individual-level monitoring of HIV-infected individuals on antiretroviral therapy (ART) requiresperiodic assessment of treatment failure, defined as having a viral load (VL) value above a certainthreshold. In some resource limited settings, VL tests may be limited by cost or technology, anddiagnoses are based on other clinical markers. Depending on scenario, higher premium may beplaced on avoiding false-positives which brings greater cost and reduced treatment options. Here,the optimal rule is determined by minimizing a weighted misclassification loss/risk.We propose a method for finding and cross-validating optimal binary classification rules underweighted misclassification loss. We focus on rules comprising a prediction score and an associatedthreshold, where the score is derived using an ensemble learner. Simulations and examples showthat our method, which derives the score and threshold jointly, more accurately estimates overallrisk and has better operating characteristics compared with methods that derive the score first andthe cutoff conditionally on the score especially for finite samples.
This work was supported by NIH (Grant NO: R01 AI108441, P30 AI042853, K24 AI134359, R01 AI066922. Introduction
Development of accurate binary classification rules is important in many areas of biomedical re-search and clinical practice. The problem can be described as follows: suppose each individualin a population of interest has a binary outcome Y and p × X = ( X , . . . , X p ). We would like to find a classification rule Q : X → { , } that takes X ∈ X asan input and generates a classification action a ∈ { , } as output. The criterion for classificationaccuracy is given in terms of a loss function L ( Y, a ); for example, L ( Y, a ) = { a = Y } is simplemisclassification loss. For a given loss criterion, the optimal rule is the one that minimizes expectedloss over the population of interest.Loosely speaking, methods for binary and categorical classification can be divided into those thatperform classification directly and those that use a score-and-threshold approach. Direct classifi-cation methods include K -nearest-neighbors (KNN) (Cover & Hart, 1967; Duda, Hart, et al., 1973)and tree-based techniques (Breiman, Friedman, Stone, & Olshen, 1984). In score-based methods,classification is carried out by comparing a scalar, real-valued risk score Ψ( X ) to a threshold c ,yielding rules of the form Q ( X ; Ψ( · ) , c ) = { Ψ( X ) ≥ c } . The risk score Ψ( X ), a function mapping X to IR, can represent class membership probability P ( Y = 1 | X ) or can be a more general rankingmeasure, i.e. where Ψ( X ) > Ψ( X ) implies that Y is stochastically greater than Y . Moreover,the risk score can be derived using a single model, such as a regression model, or using an ensemblemethod that combines scores from multiple models (van der Laan, Polley, & Hubbard, 2007).Score-based methods are often used in practice, especially in medical applications. Examplesinclude the VACS score (Tate et al., 2013) for risk prediction of all-cause mortality among HIVinfected individuals and the Nottingham prognostic index (Haybittle et al., 1982) for breast cancer.The main motivation here is to build a risk score for classification prediction of virological failure,which can be applied to various stages of individual-level HIV monitoring such as viral load (VL)pooled testing. This is important because VL testing is limited in some resource-limited settings(RLS) and accurate classification based on common clinical markers can reduce cost and mortality.A classification rule is optimal if it minimizes average loss over a population of interest. Theprocess of deriving optimal classification rules and estimating their operating characteristics relieson a training step and a testing step. In the training step, optimal classification rules are estimated;in the testing step, out-of-sample error rates associated with the rules are calculated. Finding opti-mal score-based classification rules requires estimation of both the score function and the associatedthreshold. In this paper, we focus on score-based classification methods under weighted misclassifi-cation loss, where the score Ψ( X ) is derived using the Super Learner ensemble (van der Laan et al.,2007). In Super Learner (SL), a library of K learners generates risk scores Ψ ( X ) , . . . , Ψ K ( X ), whichare then combined into a single score using the convex combination Ψ( X ; α ) = P Kk =1 α k Ψ k ( X ) ,where α = ( α , . . . , α K ), α k ≥ k , and P k α k = 1. The α k ’s are estimated based oncross-validated library learner predictions, typically using squared-error, likelihood-based or ROC-based loss. van der Laan et al. (van der Laan et al., 2007) showed through applications that SLoutperforms individual candidate learners in a given library.A conditional thresholding approach is to first derive a risk score b Ψ( X ) = Ψ( X ; b α ) and, condi-tionally on b Ψ( X ), find b c that minimizes the loss function for misclassification. In this paper, weregard derivation of optimal decision threshold by directly conditioning on estimated risk of thesame training data as conditional thresholding. Because it is intuitively straightforward, conditionalthresholding has been used in various studies. Birkner et al. Birkner et al. (2007) used conditionalthresholding in predicting 30-day mortality following stroke in a rural Indian population; from aset of candidate learners, they selected the optimal model as the estimator with the smallest cross-validated risk, and chose the cut-off value from the estimated score on the training data to be thesmallest value achieving at least 90% sensitivity. Kruppa et al. Kruppa, Ziegler, and K¨onig (2012)derived classification rule using conditional thresholding for a GWA study on rheumatoid arthritis;several risk scores were estimated for the training data and the thresholds for binary classificationwere selected by maximizing Youden index for each risk score in the training data. Shim et al.2him et al. (2018) developped a risk score model to identify patients at high risk of negative effectsafter total knee arthroplasty, derived the optimal threshold for binary classification directly from theestimated score based on Youden’s index, which is also a conditional thresholding derivation. Weshow using simulation studies and empirical examples that this approach may lead to over-estimationof misclassification error and yield classification rules that are sub-optimal.As an alternative we propose joint thresholding , which combines the estimation of coefficients α and threshold c in one step to gain a better overall performance in terms of out-of-sample predictionrisk. It uses information in out-of-sample classification through one cross-validation procedure, andnaturally integrates threshold estimation into the SL framework. Simulation and application resultsunder weighted misclassification loss show that our simultaneous estimation approach generally haslower out-of-sample risk compared to conditional thresholding. In this data applications, the thresh-old estimated using our method is approximately the optimal Bayes threshold when the ensemblescore is close to the true class probability, as would be expected under weighted misclassificationloss (Buja, Stuetzle, & Shen, 2005).Our approach involves minimization of an empirical weighted misclassification risk function.The minimization can be difficult because the weighted misclassification risk is composed of stepfunctions that count the misclassification, making it into a nonconvex, nonsmooth and NP-hardoptimization problem (Chen & Mangasarian, 1996). Conditional on a risk score, estimating c byminimizing the misclassification risk is relatively easy and can be accomplished by line search. InSection 4, we demonstrate a simultaneous estimation of the ensemble weights and threshold usingbounded controlled random search (Kaelo & Ali, 2006). This method gives us an approximation tothe minimizer of the empirical weighted misclassification risk, and is easy to implement.The paper is structured as follows. Section 2 provides motivating examples and defines weightedmisclassification loss. Section 3 describes ensemble learning and the conditional thresholding ap-proach. Section 4 explains our proposal of joint thresholding, discusses related challenges and existingmethods for minimization of the weighted misclassification loss/risk, and introduces an approach tojoint thresholding by controlled random search. Simulations and data applications are described inSection 5 and 6 respectively. Section 7 gives conclusions and suggestions for further work. Total misclassification loss is often the primary criterion for classification, where penalties for falsepositives and false negatives are the same. However, there are many circumstances where false posi-tive and false negative classifications have different consequences, and should be weighted differently.An instance of this is individual-level HIV treatment monitoring in RLS. Failure of antiretroviraltreatment (ART) is indicated by a VL value above a certain threshold. In RLS, VL assessmentmay be limited by logistics, cost and technology (Wang, Xu, & Demirci, 2010); hence, other clinicalmarkers are used at times to predict viral failure. In this situation, false positive classification ofVL failure leads to early switching to a second- or third-line ART, which may have higher toxicityand lower adherence, incurs significantly greater cost, and limits treatment options over the longterm. False negative classification of viral failure results in lack of regimen switch, risk of drugresistance accumulation and increased morbidity and mortality. These considerations may motivateprioritization of avoiding either false negative or false positive classifications. For example, if avoid-ing unnecessary switches to second-line therapy is prioritized, the loss function would assign greaterweight to false positive misclassifications.Another example is initial diagnosis of breast cancer malignancy from digitized images. Breastcancer is one of the largest causes of cancer deaths among women (Siegel, Naishadham, & Jemal,2012). Patient survival depends largely on early detection and accurate diagnosis. Fine needleaspiration (FNA) (Wolberg, Street, & Mangasarian, 1994) is a minimally invasive procedure thatallows measurement of individual cellullar characteristics, allowing algorithm-based cell image anal-ysis for diagnosis. FNA is usually carried out when there is a breast lump previously detectedby self-examination or mammography. Positive findings may lead to confirmation of breast cancer3hrough surgical biopsy, which is accurate but requires significantly more recovery time and expense,involves pain and carries risks associated with surgical procedures such as scarring and infection. Itmay be of interest to place a premium on avoiding false positives over false negatives depending onavailable tools and expertise, for example, when pathology requirements are not met or availabilityof anesthesia is limited in RLS (Saghir et al., 2011; Shyyan et al., 2006).Our goal is to develop classification rules of the form Q ( X ) = Q ( X ; Ψ( · ) , c ) = { Ψ( X ) ≥ c } based on a weighted misclassification loss L λ ( Y, Q ( X )) = λ { Q ( X ) = 0 , Y = 1 } + (1 − λ ) { Q ( X ) = 1 , Y = 0 } = λ { Ψ( X ) < c, Y = 1 } + (1 − λ ) { Ψ( X ) ≥ c, Y = 0 } , where λ ∈ (0 ,
1) is a user-defined weight. The loss function gives weight λ for false negative classi-fication (misclassifying Y = 1 as 0) and 1 − λ for false positive classification (misclassifying Y = 0as 1). At λ = . λ depends on specificapplications and goals of classification.Weighted misclassification risk is the expected loss over the underlying joint distribution of X and Y , R λ (Ψ , c ) = E X,Y { L λ ( Y, Q ( X )) } = λpP { Ψ( X ) < c | Y = 1 } + (1 − λ )(1 − p ) P { Ψ( X ) ≥ c | Y = 0 } = λp FNR(Ψ , c ) + (1 − λ )(1 − p )FPR(Ψ , c ) (1)where p = P ( Y = 1) is the prevalence, FNR(Ψ , c ) = P { Ψ( X ) < c | Y = 1 } the false negative rate,and FPR(Ψ , c ) = P { Ψ( X ) ≥ c | Y = 0 } the false positive rate.The objective of the inference problem is to find Ψ( · ) and c that minimize the risk. Given a sampleof data ( X , Y ) , . . . , ( X n , Y n ), this can be operationalized as finding Ψ( · ) and c that minimize theempirical risk function, b R λ ( Y, X ; Ψ , c ) = 1 n n X i =1 λ { Ψ( X i ) < c, Y i = 1 } + (1 − λ ) { Ψ( X i ) ≥ c, Y i = 0 } . (2) Super learner is an ensemble learner that combines predictions produced by candidate learners froma user defined library L of K learners. Our definition of ensemble refers to those that combinedifferent base algorithms; for example, we consider bagging and boosting as single learners, as theygenerate a single strong learner by using one base algorithm to obtain a collection of weak learners,by boostrap aggregation and re-weighting samples respectively. As a kind of stacked generalization(Wolpert, 1992), SL combines different learning algorithms over the same data, and algorithms suchas bagging, boosting, and random forest all can be included as candidate learners of a SL library.Consider a library L = (Ψ , . . . , Ψ K ) where each candidate learner Ψ k ∈ L is a mapping from X to IR, and the prediction score Ψ( X ) = P Kk =1 α k Ψ k ( X ) is a convex combination of predictionsfrom the library of learners. Deriving an optimal classification rule requires finding the value of( α, c ) that minimizes (2).An intuitive approach to determining the rule is conditional thresholding; that is, first derive aprediction score b Ψ( X ) = P Kk =1 b α k b Ψ k ( X ) using SL, and then identify an optimal threshold value b c based on the score b Ψ( X ). In SL, the α coefficients are derived from cross validation against lossfunction such as squared error loss or ROC-based loss. Conditional thresholding approach estimatesthreshold c by plugging b Ψ( X ) into (2) and minimizing over c . However, the procedure does notreflect out-of-sample performance because it does not use cross-validated weighted misclassificationrisk.Conditional thresholding for classification based on SL proceeds as follows:4. Fit each learner Ψ k ∈ L to the entire data set { ( Y i , X i ) } ni =1 and generate score predictions b Ψ k ( X i ) for k = 1 , . . . , K and i = 1 , . . . , n .2. Carry out D -fold cross validation. Have D partitions of the data, where partitions are indexedby d = 1 , . . . , D . For the d th partition, T ( d ) and V ( d ) are the training and validation datasplits respectively. Fit each candidate learner Ψ k to T ( d ), yielding b Ψ k,T ( d ) . Then generate itsprediction on V ( d ), written as b Ψ k,T ( d ) ( X V ( d ) ), k = 1 , . . . , K and d = 1 , . . . , D .3. For each candidate learner Ψ k , stack together the D fold-specific predictions b Ψ k,T ( d ) ( X V ( d ) ), d = 1 , . . . , D to get Z k = { b Ψ k,T ( d ) ( X V ( d ) ) , d = 1 , . . . , D } , an n × i th observation, define d i = d ( X i ) as its validation fold index, i.e. d i = 2 if X i ∈ X V (2) . Write the n × K cross-validated prediction matrix as Z = { Z , . . . , Z K } , wherethe i th row k th column element Z ik = b Ψ k,T ( d i ) ( X i ) is an out-of-sample prediction made bythe k th candidate learner to the i th observation, which is a member of validation fold V ( d i ).Estimate α in m ( Z ; α ) = P Kk =1 α k Z k by minimizing risk E { ˜ L ( Y, m ( Z ; α )) } , e.g. ordinarylinear regression has quadratic loss and E ( Y | Z ; α ) = m ( Z ; α ).4. Combine b α from Step 3 with the data predictions b Ψ k ( X i ) , k = 1 , . . . , K from Step 1, andobtain the SL score b Ψ SL ( X i ; b α ) = P Kk =1 b α k b Ψ k ( X i ) .
5. Estimate the classification threshold c by minimizing the empirical risk function (2) using theSL score as the risk score, b c = argmin c n X i =1 λ { b Ψ SL ( X i ; b α ) < c, Y i = 1 } + (1 − λ ) { b Ψ SL ( X i ; b α ) ≥ c, Y i = 0 }
6. For any x ∈ X , the classification rule is b Q ( x ) = Q ( x ; b Ψ SL ( · ; b α ) , b c ) = { b Ψ SL ( x ; b α ) ≥ b c } . As we will show in empirical examples and simulations, conditional thresholding may over-estimateactual risk. We propose to estimate the classification threshold c based on the cross validatedprediction Z (defined in Step 3 of Section 3) within the SL algorithm. In our approach, Steps 1 and2 are the same as above. Steps 3 and 5 are replaced by simultaneous estimation of α and c to satisfy(˜ α, ˜ c ) = argmin ( α,c ) n X i =1 λ { m ( Z i ; α ) < c, Y i = 1 } + (1 − λ ) { m ( Z i ; α ) ≥ c, Y i = 0 } . (3)The SL score in Step 4 and the classification rule in Step 6 are then updated to b Ψ SL ( X i ; ˜ α ) = P Kk =1 ˜ α k b Ψ k ( X i ) and b Q ( x ) = Q ( x ; b Ψ SL ( · ; ˜ α ) , ˜ c ) = { b Ψ SL ( x ; ˜ α ) ≥ ˜ c } accordingly.Optimizing the empirical risk in equation (3) is complicated by discontinuities introduced by theindicator functions. Common optimization methods such as Newton-Raphson cannot be appliedbecause they require existence of the first or second order derivatives. The lack of smoothness andconvexity makes other optimization methods difficult as well. Moreover, minima of the objectivefunction is not unique due to the non-convexity of weighted misclassification loss.According to the Bayes rule, the optimal threshold for (3) is 1 − λ when m ( Z ; α ) = P ( Y = 1 | X ).In practice, the threshold c = 1 − λ is valid only when the risk score is a consistent estimate of P ( Y = 1 | X ) and the sample size is sufficiently large. However, the underlying true mechanism fordata generation is very complicated in most applications. Furthermore, whether or not the ensemblescore is a probability estimate depends on the investigators’ intention and study goal, and a goodrisk score for classification does not have to be a probability estimate (e.g. SVM).5ne approach to minimizing the empirical weighted misclassification risk is to approximate theweighted misclassification loss with some smooth solvable loss functions. Buja et al. Buja et al.(2005) used integrals of beta distribution functions to approximate the indicator functions in theloss, theoretically enabling use of optimization algorithms. However, in practice the nonconvexityof the smooth approximation can undermine the invertibility of the Hessian in Newton updates andcause the optimization procedure to fail.Another approach is to reformulate the minimization problem as a linear program with equilib-rium constraints (LPEC), a special case of a hierarchical mathematical programming that consists oftwo levels of optimization. Mangasarian Mangasarian (1994) studied total misclassification loss andsuggested a Frank-Wolfe type iterative algorithm to approximate the minima that moves the solutiontowards the minimum of a linear approximation of the objective function in the same domain. Chenand Mangasarian Chen and Mangasarian (1996) proposed a hybrid algorithm as an accelerated ap-proximation to the algorithm in Mangasarian Mangasarian (1994), which is costly in computation.The hybrid algorithm iteratively estimates α by replacing indicator function ( x >
0) with theconvex surrogate max(1 + x,
0) at a fixed c , and estimates c by minimizing the objective function ata fixed α . Solving LPEC is costly and computationally intensive because the minimization problemis NP-hard (Chen & Mangasarian, 1996).We consider two options to solving equation (3) within SL. One is to approximate the solutionin two separate steps: (1) using non-negative least squares linear regression to estimate ˜ α and thennormalizing it to sum to one, and (2) conducting a line search to estimate ˜ c conditional on theestimated ˜ α . We refer to this procedure as Two-Step Minimization in our simulations and dataapplications. This can be further extended to using a convex and continuous surrogate ˜ L ( Y, m ( Z ; α ))for estimation of ˜ α . The process can be described as follows:(3a) Estimate ˜ α in m ( Z ; ˜ α ) = P Kk =1 ˜ α k Z k by argmin α E { ˜ L ( Y, m ( Z ; α )) } , e.g. if ˜ L is squared errorloss, we use ordinary least squares regression of Y on Z .(3b) Estimate ˜ c by conditional minimization using the cross-validated predcitions Z ,˜ c = argmin c n X i =1 λ { m ( Z i ; ˜ α ) < c, Y i = 1 } + (1 − λ ) { m ( Z i ; ˜ α ) ≥ c, Y i = 0 } . When the surrogate loss ˜ L in Step 3a involves threshold c , an iterative procedure similar toChen and Mangasarian Chen and Mangasarian (1996) can be used for the estimation of (˜ α, ˜ c ). Thistwo-step procedure provides flexibility in that the user can choose the surrogate loss ˜ L based oncontext. If minimizing the surrogate loss produces risk score that gives good discrimination to thedata, the resulting classification rule would be a good approximation to a minimizer of weightedmisclassification risk.The second option is to estimate α and c using a bounded region optimization, and perform acontrolled random search in the bounded region. First, note that the inequality P Kk =1 Ψ k ( · ) α k > c can be written as P Kk =1 Ψ k ( · ) α ∗ k > c ∗ , with α ∗ k = α k /α and c ∗ = c/α when α >
0. In ouranalyis, coefficients ( α , . . . , α K ) are constrained to be nonnegative, normalized to sum to one, andwithout loss of generality α is designated as the coefficient to have the largest estimated value frominitialization, as described in Step 1 below.Assume that the coefficient estimates from initialization based on some convex loss functions areclose to the solutions that minimize the weighted misclassification risk. Then, with initialized valueof α ∗ in [0 , K , we can estimate α ∗ and c ∗ by searching for the optima in an enlarged boundedregion. One way to search for the optima is to randomly generate a large user-specified number ofinitial points in the bounded region and do a controlled random search (using the crs2lm function inthe optimization package nloptr (Johnson, 2014) in R ). We recommend controlled random searchbecause it does not rely on the properties of the objective function for global optimization, henceavoiding the issues from nonconvexity and nonsmoothness. Some other direct search methods arealso applicable in this case, such as the simplex algorithm (Nelder & Mead, 1965) and the differential6volution (Storn & Price, 1997). This procedure is referred to as the CRS Minimization in Section5 and Section 6. The procedure is described as follows:1. Initialize the controlled random search by ( b α ∗ (0) , b c ∗ (0) ), calculated as follows. Obtain b α =( b α , . . . , b α K ) T by regressing Y on Z (defined in Step 3 of Section 3) under squared error losswith nonnegative constraint. Locate the estimated coefficient with the largest value. Withoutloss of generality, assume b α = max k ∈{ ,...,K } b α k . Define b α ∗ (0) = (1 , b α / b α , . . . , b α K / b α ) T andestimate b c ∗ (0) = argmin c P ni =1 λ { Y i = 1 , Z i b α ∗ (0) < c } + (1 − λ ) { Y i = 0 , Z i b α ∗ (0) ≥ c } by linesearch.2. Apply controlled random search on an enlarged nonnegative bounded region, e.g. in thefollowing sections we empirically chose the enlarged region as [0 , K based on the magnitudeof b α ′ s . Obtain b α ∗ = ( b α ∗ , . . . , b α ∗ K ) and b c ∗ from the controlled random search, as estimates to aminimizer of equation (3).3. Normalize both the coefficients and threshold estimates by P Kk =1 b α ∗ k , so the coefficient esti-mates sum to one.The controlled random search here does not give an unbiased or efficient estimate of the classprobability, even if the resulting scores are scaled to between zero and one. The estimated scores donot need to have a probabilisitic interpretation and the solutions may not be unique, but they areall valid approximations in terms of minimizing the weighted misclassification loss.For the theoretical completion of this work, we are able to show the asymptotic optimality of SLwith joint thresholding under the weighted misclassification loss by the following theorem: Theorem.
Let S represents a random data split that is stochastically independent of the observa-tions, resulting in a training set and a test set of a nonneglible size. For the weighted misclassificationloss L λ ( Y, Q ( X )) at a given λ ∈ (0 , , and classifiers in Q = { Q θ : θ ∈ Θ n } , where θ = ( α, c ) , Q θ ( x ) = { P Kk =1 α k Ψ k ( x ) ≥ c } , and Θ n is a bounded discretized parameter space of θ , there is E S R λ ( Q ˆ θ ( P S )) ≤ E S R λ ( Q ˜ θ ( P S )) + O (cid:18) log ( n ) √ n (cid:19) , where n is sample size of the entire data, E S R λ ( Q ˆ θ ( P S )) and E S R λ ( Q ˜ θ ( P S )) are the risks averagedover data splits for the SL with joint thresholding and an oracle classifier respectively. The web appendix (A3) provides details of the notations and a proof of the theorem.The oracle classifier is the best classifier among all classifiers in the form of Q θ ( x ) estimated fromthe data, such that it is closest in distance to the unparametrized true minimizer of the weightedmisclassifiction risk. The theorem is saying that under the weighted misclassification loss, SL withparameters estimated from equation (3) asymptotically converges to the oracle in terms of averagerisk as the sample size grows to infinity. The key objective of our simulation study is to compare our approach to the conditional thresholdingapproach in minimizing the out-of-sample weighted misclassification risk when the true model canor cannot be easily recovered. We set it up so that we know the optimal classification rule and cancompare with it the rules obtained from conditional and joint thresholdings. We first simulate twodatasets D and D , each of size n = 10 . For weighted misclassification loss at fixed λ ∈ (0 , D , and usethem to estimate three classification rules. For each rule and each value of λ , the correspondingout-of-sample weighted misclassification risk is calculated by applying the derived rule on D .7wo data generating mechanisms, adapted from Kang and Schafer Kang and Schafer (2007), areconsidered. For each one, the observed binary outcome Y i is generated from an underlying score ˜ Y i via Y i = ( ˜ Y i ≥ c ), where the cutoff c is chosen to guarantee a 30% prevalence. The underlyingscore is generated by ˜ Y i = b + U i b + ǫ i , where b = 210, b = (27 . , . , . , . T ; for i = 1 , . . . , n , ǫ i ∼ N (0 , ), and U i = ( U i , U i , U i , U i ) ∼ N (0 , I × ).In the first setting we treat U as observed covariates and uses ( Y, U ) to derive classificationrules. In the second setting, instead of using U , we observe X = g ( U ), where g : IR → IR is given by g ( u ) = exp( u / g ( u ) = u / (1 + exp( u )) + 10, g ( u ) = ( u u /
25 + 0 . , and g ( u ) = ( u + u + 20) . More detailed rationale for this parameterization is given in Kang andSchafer Kang and Schafer (2007).Earlier, the true probability score is Ψ ( U ) = P ( Y = 1 | U ) = P ( ǫ ≥ c − U β | U ). The optimalBayes classification rule based on the true probability score is Q ( U ) = { Ψ ( U ) ≥ − λ } , whichprovides a reference standard for assessing classification accuracy. The out-of-sample risk for theoptimal Bayes classification rule is approximated by b R λ ( D ; Q ) = 1 n X ( U i ,Y i ) ∈D λ { Q ( U i ) = 0 , Y i = 1 } + (1 − λ ) { Q ( U i ) = 1 , Y i = 0 } . We therefore use b R λ ( D ; Q ) as the reference for relative differences displayed in Table 1. Therelative difference of estimated weighted misclassification risk for an estimated classification rule b Q ( · ) at penalty λ is defined as { b R λ ( D ; b Q ) − b R λ ( D ; Q ) } / b R λ ( D ; Q ).We consider using four and eight candidate algorithms for SL. In the case of K = 4, the SLlibrary L includes random forest (Breiman, 2001), logistic regression, generalized additive model(Hastie & Tibshirani, 1990) and CART (Breiman et al., 1984). For K = 8, four additional candidatealgorithms are added to L : 10 nearest neighbors, generalized boosting (Friedman, 2001), supportvector machine (Hsu, Chang, Lin, et al., 2003) and bagging classification (Breiman, 1996). Logisticregression and generalized additive model are fitted using maximum likelihood without penalization;we use linear main effect terms for logistic regression and quadratic main effect splines for generalizedadditive model.Relative risk differences for each of the SL classification methods discussed above are summa-rized in Table 1. In the first setting when covariates U are used to develop the classification rule,conditional thresholding and joint thresholding rules (two-step and CRS minimization) have, asexpected, similar out-of-sample performances, differing from the optimal classification rule by lessthan 2% relatively across all values of λ . In the second setting, where classification is based on X ,joint thresholding clearly outperforms conditional thresholding for most λ values. Relative to theoptimal Bayes classification rule based on the true probability score Ψ ( U ), conditional thresholdingis worse by 5 to 23% across λ , while joint thresholding differs by 2.3% or less. In this section we illustrate our proposed methods on Kenyan clinical HIV data and Wisconsindiagnostic breast cancer data. We use the same SL libraries with four and eight learners as in thesimulations. The first data set was used in Liu et al. (Liu et al., 2017) and the second data set isavailable on the UCI machine learning data repository (Lichman, 2013).The Kenyan HIV data were derived from three studies conducted at the Academic ModelProviding Access to Healthcare (AMPATH) in Eldoret, Kenya: (i) The “Crisis” study (n=191)(Mann et al., 2013), conducted in 2009-2011 to investigate the impact of the 2007-2008 post-electionviolence in Kenya on ART failure and drug resistance; (ii) The “second-line” study (n=394) (Diero et al.,2014), conducted in 2011-2012 to investigate ART failure and drug resistance upon second-line ART;and (iii) The “TDF” study (n=333) (Brooks et al., 2016), conducted in 2012-2013 to investigate theimpact of WHO guidelines changes to TDF-based first-line ART on HIV treatment failure and drugresistance. The data include covariate information on age, gender, nadir CD4 count, CD4 count,8D4 percent, adherence to ART, time since starting current ART, slope of CD4 percent progressionand the outcome VL. Our interest is to develop a classification rule to predict HIV virological failure(VL > λ = 0.2, 0.5 and 0.8 compared to the conditionalthresholding approach for both data illustrations. In addition, the difference in cross-validated risksbetween conditional and joint thresholding is smaller for eight learners compared to four learners.For breast cancer malignancy prediction at λ = 0 . λ = 0.2 and 0.8, for conditional thresholding andour proposed approaches. In our analysis, both conditional thresholding and two-step minimizationderive coefficients by nonnegative least squares linear regression and normalize to sum to one, hencetheir coefficient estimates are the same and stay fixed across λ . For CRS minimization, the coefficientestimating procedure involves optimizing a function of λ , so the estimated values may vary by λ . Threshold estimation depends on λ for all approaches. Conditional thresholding and two-stepminimization share the same estimated risk scores for all λ values, but their threshold estimates aredifferent due to the different thresholding methods.Figure 1 provides the cross-validated risk comparison in a finer scale for the Kenyan HIV dataand the Wisconsin diagnostic breast cancer data, considering both four and eight candidate learners.Differences in cross-validated risks between conditional and joint thresholding across λ are larger forfewer candidate learners in the SL library, and joint thresholding approaches generally have bettercross-validated risks compared with the conditional thresholding.Figure 2 illustrates why misclassification differs between the thresholding methods. The verticallines in Figure 2 indicate threshold values estimated at λ = 0 . Z ˆ α predictions correspondingly from left to right. Under conditional thresholding, the rule isconditional on SL prediction, which may be subject to over-fitting (the first panel). By contrast,two-step and CRS minimization use cross-validated prediction for selecting the threshold, and dothis within the SL (the third panel), which has a similar distribution as the cross-validated pre-diction of the entire SL (middle panel). As expected from risk analysis of the methods, estimatedthresholds are very different for conditional and joint thresholding. In Figure 2, threshold estimatesfrom joint thresholding and cross-validating the entire SL are very similar, which further explains theout-performance of joint thresholding compared to conditional thresholding. Nonetheless, thresholdcalculation by cross-validating the entire SL requires substantially more computation time and com-plexity in both estimation and evaluation, and may result in over-cross-validating the data whensample size is not sufficiently large. Furthermore, none of the estimated thresholds is close to theoptimal Bayes threshold of 1 − λ = 0 .
2, implying that all the three types of derived predictionsfrom SL do not highly match the unknown underlying true classification probability, indicating theneccessity to do optimal classification rule derivation.9
Summary and Discussion
Weighted misclassification risk is often used to evaluate predictions when false-positives or false-negatives need to be prioritized differently. However, inference and rule derivation using weigthedmisclassification risk is less common due to the difficulties in numerical computations. For binaryclassification using SL, we aimed to optimally estimate both the threshold and ensemble weightsassociated with candidate learners by minimizing the weighted misclassification risk. Through sim-ulations and data examples, we showed that the conditional thresholding may generate sub-optimalclassification rules. We proposed two options for joint thresholding, both embedding the thresholdestimation procedure within SL, and showed that our proposal performs similarly or outperformsthe conditional thresholding approach in terms of determining the optimal classification rules.Our method presents a new way to estimate the classification rule. We show that the rulesdeveloped under either two-step or CRS minimization generally have lower error rates compared toconditional thresholding. From the comparison of density curves in Figure 2 among SL prediction,cross-validated SL prediction and Zα within SL in data application, we can see that the conditionalthresholding tends to overfit the data, which results in an incorrectly estimated risk. Therefore, weexpect the actual risk from conditional thresholding to be closer to the true risk under two situations:first, when the training data distribution can well represent the true underlying data distribution, inwhich case there would not be much difference in distribution between the training and the test sets;second, when the ensemble risk score can well discriminate the data for both the training and thetest set, as reflected by Figure 1, the difference between conditional and joint thresholding becomessmaller with more candidate learners in the SL library.Our work also provides a general framework for using ensemble learners for binary classification.Although we consider weighted loss functions as in (3), our method has the potential to be extendedto more general threshold-based classifications, and numerical optimization method should changeaccordingly based on the nature of the threshold-based classification loss. We expect similar propertyto hold for other classification problems that involve threshold estimation, when the classificationloss is a measurable function of classifiers.Furthermore, from Figure 2, we anticipate the performance of our method to be comparable tothreshold estimation based on cross-validated SL predictions. This is important for settings wherecomputational complexity is high, or the size of the SL library and data are large, as thresholdesimation based on cross-validated SL predictions may require substantial amount of computations,increasing complication and difficulty to method evaluation.Code for the analysis was written in R (R Core Team, 2016) and is available in the web appendix.10able 1: Out-of-sample weighted misclassification risk of the optimal Bayes classification rule (%)in the first row, and relative difference in out-of-sample weighted misclassification risk (%) at λ =0.2, 0.5 and 0.8 for simulation studies described in Section 5, stratified by estimation methods andnumber of learners in SL library. Relative difference for a derived classification rule b Q ( · ) at λ is( b R λ ( D ; b Q ) − b R λ ( D ; Q )) / b R λ ( D ; Q ), where Q is the optimal Bayes classification rule based onthe true probability score Ψ ( U ). 4 Learners 8 Learners λ λ b R λ ( D ; Q )(%) 6.1 14.8 12.9 6.1 14.8 12.9Simulation 1 Conditional Thresholding 0.0 0.0 1.6 0.0 0.0 1.6Two-Step Minimization 0.0 0.0 0.8 0.0 0.0 1.6CRS Minimization 0.0 0.0 0.8 0.0 0.0 1.6Simulation 2 Conditional Thresholding 16.4 12.8 11.6 8.2 6.1 9.3Two-Step Minimization 0.0 0.0 2.3 0.0 0.0 2.3CRS Minimization 0.0 0.0 2.3 0.0 0.0 2.3Table 2: Cross validated weighted misclassification risk (%) at λ = 0.2, 0.5 and 0.8 for two dataexamples, stratified by estimation methods and number of learners in SL library.4 Learners 8 Learners λ λ α, c ) at λ = 0.2 and 0.8 for Kenyan HIV data, under estimation methods:Conditional Threshold (CT), Two-Step Minimization (2-Step) and CRS Minimization (CRS). ∗ indicates that b α are the same for CT and 2-Step regardless of λ . λ = .2 λ = .8CT ∗ ∗ CRS CT ∗ ∗ CRSˆ α random forest α logistic regression α quadratic splines α CART α α generalized boosting α SVM α Bagging c λ for thetwo data applications, stratified by number of learners in SL library.12igure 2: Density curve of SL prediction, 10-fold cross-validated prediction of SL (CV-SL) andcombined cross-validated prediction within SL ( Zα ) for breast cancer data, using least squaresregression under nonnegative and sum to one constraints on the coefficients for combining the eightlibrary prediction scores. Vertical line represents estimated threshold based on corresponding typeof prediction at λ = 0 .
8. 13 eb AppendixA1. Application to SECOM Data
The following analysis uses the large p SECOM data set (Dheeru & Karra Taniskidou, 2017) fromthe UCI machine learning repository. Each sample in the SECOM data set represents a singleproduction entity from a modern semi-conductor manufacturing process; the outcome represents apass/fail yield for in-house line testing and the features are measured signals from the monitoringsystem. The original data set has 1567 samples, 591 features, and 104 fails in its outcome. Thenumber of features in the data set is over 1/3 of the sample size, and regular small p methods suchas linear regression and additive splines do not converge for this data.We cleaned the data so that all the features have more than one unique values and each feature haseither no or at least 30 missingness. After accounting for missingness by creating missing indicatorsand filling in 0’s for missing measurements, there are 1436 samples, 484 features, and 100 fails.We used three candidate learners for this large p application: random forest, lasso, and leekasso.Leekasso does linear model using the top 10 most significant predictors from fitting univariate modelsof each covariate. These three candidate learners were chosen from convergence and computationspeed considerations.From the following cross-validated risk plot we show that our proposed joint thresholding can beapplied to large-p problems with properly selected candidate learners and our method still performsbetter than the conditional thresholding.Figure 3: Comparison of cross-validated weighted misclassification risk as a function of λ for theSECOM data application between conditional thresholding and joint thresholding (CRS and two-step minimization), under the following candidate learners in SL library: random forest, lasso, andleekasso. A2. Comparison with Candidate Learners – SECOM Data
Because candidate learners return risk estimations, which are continuous scores, each individualcandidate learner would need a threshold in order to show its classification performance underthe weighted misclassification loss. For a single learner, the joint thresholding method reduces toderiving a threshold based on the cross-validated predictions of that learner. To actually see howmuch additional gain the ensemble learner is achieving, we used the SECOM data example again.The 10-fold cross-validated risks are shown in Figure 2. Table 1 has the values of the curves inFigure 2 specifically at λ = 0 . , . , . . . , .
9. 14rom Figure 2 we can see that the performance of SL with joint thresholding is similar to itsbest candidate learner, which is the random forest with joint thresholding in this case.Figure 4: Comparison of cross-validated weighted misclassification risk with joint thresholding asa function of λ for the SECOM data application among the SL (CRS minimization and two-stepminimization) and the individual candidate learners (random forest, lasso, and leekasso).Table 4: Cross-validated weighted misclassification risk with joint thresholding at λ from 0.1 to 0.9for the SECOM data application, stratified by estimation methods λ
3. Theoretical Justification
Our theoretical justification follows a similar road map as in van der Laan et al. (2007) and van der Vaart, Dudoit, and van der Laan(2009). We first introduce and revise some notations.Given false negative penalty λ ∈ (0 , Q = argmin Q E ( Y,X ) L λ ( Y, Q ( X )) = argmin Q R λ ( Q )where L λ ( Y, Q ( X )) = λ { Q ( X ) = 0 , Y = 1 } + (1 − λ ) { Q ( X ) = 1 , Y = 0 } and R λ ( Q ) = Z L λ ( y, Q ( x )) dP ( x, y ) . In the above expression, notice that P is the unknown true distribution over the outcome Y andcovariates X .The distance between any two classifiers Q and Q is defined as the absolute difference betweentheir risks, d λ ( Q , Q ) = | R λ ( Q ) − R λ ( Q ) | . For SL, there is a collection of scores from the K candidate learners, (Ψ , . . . , Ψ K ). We pa-rameterize the classification rule as Q θ ( · ), where θ = ( α, c ), α is the vector of coefficients thatlinearly combines the K scores, and c is the threshold to be applied to the combined score for binaryclassification, Q θ ( x ) = { K X k =1 α k Ψ k ( x ) ≥ c } Without loss of generality, we assume the coefficients in α to be constrained as | α i | ∈ [0 , , i =1 , . . . , K and P Ki =1 α i = 1. Denote the range of score predictions as Ω, which is assumed to bebounded, the parameter space of θ can be written as Θ = [0 , K × Ω. We consider a grid Θ n of θ values in the bounded parameter space Θ, and let K ( n ) = n be the number of grid points suchthat K ( n ) ≤ n q for some constant q < ∞ . We then consider Q = { Q θ : θ ∈ Θ n } as a collection ofcandidate classifiers.Next, we formalize cross-validation as in van der Vaart et al. (2009) and describe our estimatorsaccordingly following these notations. Let S = ( S , . . . , S n ) ∈ { , } n be a random vector indepen-dent of data samples X , . . . , X n ; S i = 0 indicates that the sample X i belongs to the training set,otherwise the test set. We can then define the empirical distributions of the training and the testset by P jS = 1 n j X i : S i = j δ X i , n j = n X i =1 { S i = j } , j = 0 , δ X i ( x ) = { x ≥ X i } for i = 1 , . . . , n . Q θ ( P S ) is a classifier estimated from the training set, Q θ ( P S )( x ) = { K X k =1 α k Ψ k ( P S )( x ) ≥ c } . An oracle selector of θ is one that its corresponding classifier estimated from the training setminimizes the risk on the unknown distribution P averaged over the splits,˜ θ = argmin θ ∈ Θ n E S (cid:20) Z L λ ( y, Q θ ( P S )( x )) dP ( x, y ) (cid:21) = argmin θ ∈ Θ n E S (cid:20) R λ ( Q θ ( P S )) − R λ ( Q ) (cid:21) . Q ˜ θ , and among all classifiers estimated from the training set, it is the closestin distance to the true risk minimizer Q .Cross validation replaces P by the test set distribution P S , and uses θ n as the cross-validatedselector of θ , θ n = argmin θ ∈ Θ n E S (cid:20) Z L λ ( y, Q θ ( P S )( x )) dP S ( x, y ) (cid:21) . The realization of θ n is the parameter estimate for SL with joint thresholding as presented inthe paper, θ n ≡ argmin θ ∈ Θ n n n X i =1 λ { K X k =1 α k Z ik < c, Y i = 1 } + (1 − λ ) { K X k =1 α k Z ik ≥ c, Y i = 0 } where Z = { b Ψ k,T ( d ) ( X V ( d ) ) , k = 1 , . . . , K, d = 1 , . . . , D } is the stacked cross-validated predictions asmentioned in Sections 3 and 4. In Section 4, the ( b α, b c ) at the end of our proposed procedure is thesolution or approximation to the θ n .The goal of this section is to theoretically prove that, the risk averaged over data splits for clas-sifier estimator Q θ n asymptotically converges to that of the oracle classifier Q ˜ θ . van der Vaart et al.(2009) (Theorem 2.3) established the following inequality for the risk averaged over data splitsbetween the cross-validated selector and the oracle selector. Theorem 1. (van der Vaart et al., 2009) For Q ∈ Q , let ( M ( Q ) , v ( Q )) be a Bernstein pair for themeasurable function z L ( z, Q ) and assume that R ( Q ) = R L ( z, Q ) dP ( z ) ≥ for every Q ∈ Q .Then for any δ > and ≤ p ≤ , E S R ( Q ˆ θ ( P S )) ≤ (1 + 2 δ )E S R ( Q ˜ θ ( P S )) + (1 + δ )E S (cid:20) n ) /p (cid:21) × log(1 + Q ) sup Q ∈Q (cid:20) M ( Q )( n ) − /p + (cid:18) v ( Q ) R ( Q ) − p (cid:19) /p (cid:18) δδ (cid:19) /p − (cid:21) . Recall that for a measurable function f : X → R , ( M ( f ) , v ( f )) is a pair of Bernstein numbers if M ( f ) P (cid:18) e | f | /M ( f ) − − | f | M ( f ) (cid:19) ≤ v ( f ) . And it was shown in van der Vaart et al. (2009) that if f is uniformly-bounded, then ( || f || ∞ , . P f )is a pair of Bernstein numbers. Lemma 1.
For a weighted misclassification loss ( x, y ) L λ ( y, Q ( x )) , where Q : x
7→ { , } and λ ∈ (0 , , its Bernstein pairs ( M ( Q ) , v ( Q )) satisfy M ( Q ) = max { λ, − λ } and v ( Q ) = λ P ( Q ( X ) = 0 , Y = 1) + (1 − λ ) P ( Q ( X ) = 1 , Y = 0) . Furthermore, v ( Q ) ≤ . × max { λ, − λ } × R λ ( Q ) Proof.
The loss function L λ ( y, Q ( x )) = λ { Q ( x ) = 0 , y = 1 } + (1 − λ ) { Q ( x ) = 1 , y = 0 } has range { , λ, − λ } and hence is bounded by max { λ, − λ } .By definition, risk can be written as R λ ( Q ) = λP ( Q ( X ) = 0 , Y = 1) + (1 − λ ) P ( Q ( X ) = 1 , Y = 0) . E ( x,y ) (cid:20) L ( y, Q ( x )) (cid:21) = E ( x,y ) (cid:20) λ { Q ( x ) = 0 , y = 1 } + (1 − λ ) { Q ( x ) = 1 , y = 0 } (cid:21) = λ P ( Q ( X ) = 0 , Y = 1) + (1 − λ ) P ( Q ( X ) = 1 , Y = 0) ≤ max { λ, − λ } × R λ ( Q ) . Therefore, v ( Q ) = 1 . E ( x,y ) (cid:20) L ( y, Q ( x )) (cid:21) ≤ . × max { λ, − λ } × R λ ( Q ) . We apply theorem 1 (van der Vaart et al., 2009) with the lemma, so that for the weighted mis-classification loss, we have
Theorem 2.
For classifiers in Q = { Q θ : θ ∈ Θ n } and the weighted misclassification loss L λ ( Y, Q ( X )) at a given λ ∈ (0 , , there is E S R λ ( Q ˆ θ ( P S )) ≤ E S R λ ( Q ˜ θ ( P S )) + O (cid:18) log ( n ) √ n (cid:19) . Proof.
Write C = max { λ, − λ } . Recall that we assumed the cardinality of Θ n as K ( n ) ≤ n q ,and by definition, Q = n . Hence, we have log(1 + Q ) ≤ q log( n ) . Therefore, applying theinequality in Lemma 1 to Theorem 1, for any δ > ≤ p ≤
2, there isE S R λ ( Q ˆ θ ( P S )) ≤ (1 + 2 δ )E S R λ ( Q ˜ θ ( P S )) + (1 + δ )E S (cid:20) n ) /p (cid:21) × log(1 + Q ) sup Q ∈Q (cid:20) C ( n ) − /p + (cid:18) . C × R λ ( Q ) p − (cid:19) /p (cid:18) δδ (cid:19) /p − (cid:21) Furthermore, by R λ ( Q ) ≤ C and Q ≤ n q , L.H.S ≤ (1 + 2 δ )E S R λ ( Q ˜ θ ( P S )) + (1 + δ )E S (cid:20) n ) /p (cid:21) × q log( n ) × (cid:20) C ( n ) − /p + (cid:18) . C p (cid:19) /p (cid:18) δδ (cid:19) /p − (cid:21) ≤ (1 + 2 δ )E S R λ ( Q ˜ θ ( P S )) + (1 + δ )E S (cid:20) n ) /p (cid:21) × q log( n ) × (cid:20) C ( n ) − /p + 1 . C (cid:18) δδ (cid:19) /p − (cid:21) Sample size of the test set is approximately a fixed proportion of the entire data, so n = O ( n ).Let p = 2 and δ = 1 / √ n , then the above inequality becomesE S R λ ( Q ˆ θ ( P S )) ≤ E S R λ ( Q ˜ θ ( P S )) + 2 δ + (1 + δ )E S (cid:20) n ) / (cid:21) × q log( n ) × (cid:20) C ( n ) − / + 1 . C (cid:21) =E S R λ ( Q ˜ θ ( P S )) + O (cid:18) log( n ) √ n (cid:19) . og( n ) √ n asymptotically goes to zero. As long aslog( n ) / √ n E S R λ ( Q ˜ θ ( P S )) → as n → ∞ (4)then Q θ n is asymptotically equivalent to the oracle estimator Q ˜ θ in terms of their true risks averagedover data splittings when fitting the estimators on the training set,E S R λ ( Q ˆ θ ( P S ))E S R λ ( Q ˜ θ ( P S )) → as n → ∞ . When equation (4) does not hold, then Q ˆ θ ( P S ) achieves the log( n ) √ n rate:E S R λ ( Q ˆ θ ( P S )) = O (cid:18) log( n ) √ n (cid:19) . This section shows that the performance of SL classification rule using joint thresholding inSection 4 is asymptotically equivalent to the oracle under some conditions.
References
Birkner, M. D., Kalantri, S. P., Solao, V., Badam, P., Joshi, R., Goel, A., . . . Hubbard,A. E. (2007, June). Creating diagnostic scores using data-adaptive regression: An ap-plication to prediction of 30-day mortality among stroke victims in a rural hospital in In-dia.
Therapeutics and Clinical Risk Management , (3), 475. Retrieved 2018-09-24, from Breiman, L. (1996). Bagging predictors.
Machine Learning , (2), 123–140.Breiman, L. (2001). Random forests. Machine Learning , (1), 5–32.Breiman, L., Friedman, J., Stone, C. J., & Olshen, R. A. (1984). Classification and RegressionTrees . CRC press.Brooks, K., Diero, L., DeLong, A., Balamane, M., Reitsma, M., Kemboi, E., . . . others (2016).Treatment failure and drug resistance in HIV-positive patients on Tenofovir-based first-lineantiretroviral therapy in western Kenya.
Journal of the International AIDS Society , (1).Buja, A., Stuetzle, W., & Shen, Y. (2005). Loss functions for binary class probability estimationand classification: Structure and applications.Chen, C., & Mangasarian, O. L. (1996). Hybrid misclassification minimization. Advances inComputational Mathematics , (1), 127–136.Cover, T., & Hart, P. (1967). Nearest neighbor pattern classification. IEEE Transactions onInformation Theory , (1), 21–27.Dheeru, D., & Karra Taniskidou, E. (2017). UCI Machine Learning Repository . Univer-sity of California, Irvine, School of Information and Computer Sciences. Retrieved from http://archive.ics.uci.edu/ml
Diero, L., DeLong, A., Schreier, L., Kemboi, E., Orido, M., & Rono, M. (2014). High HIV resistanceand mutation accrual at low viral loads upon 2nd-line failure in western Kenya. In
Conferenceon Retroviruses and Opportunistic Infections.
Boston, MA.Duda, R. O., Hart, P. E., et al. (1973).
Pattern Classification and Scene Analysis (Vol. 3). WileyNew York.Friedman, J. H. (2001). Greedy function approximation: a gradient boosting machine.
Annals ofStatistics , 1189–1232.Hastie, T. J., & Tibshirani, R. J. (1990).
Generalized Additive Models (Vol. 43). CRC press.Haybittle, J. L., Blamey, R. W., Elston, C. W., Johnson, J., Doyle, P. J., Campbell, F. C., . . .Griffiths, K. (1982). A prognostic index in primary breast cancer.
British Journal of Cancer , (3), 361. 19su, C., Chang, C., Lin, C., et al. (2003). A practical guide to support vector classification.Johnson, S. G. (2014). The nlopt nonlinear-optimization package.(R package version 1.0.4)Kaelo, P., & Ali, M. M. (2006). Some variants of the controlled random search algorithm for globaloptimization. Journal of Optimization Theory and Applications , (2), 253–264.Kang, J. D., & Schafer, J. L. (2007). Demystifying double robustness: A comparison of alternativestrategies for estimating a population mean from incomplete data. Statistical Science , 523–539.Kruppa, J., Ziegler, A., & K¨onig, I. R. (2012, October). Risk estimation and risk prediction usingmachine-learning methods.
Human Genetics , (10), 1639–1654. doi: 10.1007/s00439-012-1194-yLichman, M. (2013). UCI Machine Learning Repository.
Retrieved from http://archive.ics.uci.edu/ml
Liu, T., Hogan, J. W., Daniels, M. J., Coetzer, M., Xu, Y., Bove, G., . . . Kan-tor, R. (2017, August). Improved HIV-1 Viral Load Monitoring Capacity Us-ing Pooled Testing With Marker-Assisted Deconvolution.
JAIDS Journal of Ac-quired Immune Deficiency Syndromes , (5), 580. Retrieved 2018-10-09, from https://journals.lww.com/jaids/Fulltext/2017/08150/Improved HIV 1 Viral Load Monitoring Capacity.13.aspx doi: 10.1097/QAI.0000000000001424Mangasarian, O. L. (1994). Misclassification minimization. Journal of Global Optimization , (4),309–323.Mann, M., Diero, L., Kemboi, E., Mambo, F., Rono, M., Injera, W., . . . others (2013). Antiretro-viral treatment interruptions induced by the Kenyan postelection crisis are associated withvirological failure. Journal of Acquired Immune Deficiency Syndromes , (2), 220.Nelder, J. A., & Mead, R. (1965). A simplex method for function minimization. The ComputerJournal , (4), 308–313.R Core Team. (2016). R: A language and environment for statistical computing [Computer softwaremanual]. Vienna, Austria. Retrieved from Saghir, N. S. E., Adebamowo, C. A., Anderson, B. O., Carlson, R. W., Bird, P. A., Corbex, M.,. . . others (2011). Breast cancer management in low resource countries (LRCs): Consensusstatement from the Breast Health Global Initiative.
The Breast , , S3–S11.Shim, J., Mclernon, D. J., Hamilton, D., Simpson, H. A., Beasley, M., & Macfarlane, G. J. (2018,July). Development of a clinical risk score for pain and function following total knee arthro-plasty: results from the TRIO study. Rheumatology Advances in Practice , (2). Retrieved2018-09-17, from https://academic.oup.com/rheumap/article/2/2/rky021/5025119 doi:10.1093/rap/rky021Shyyan, R., Masood, S., Badwe, R. A., Errico, K. M., Liberman, L., Ozmen, V., . . . Vass, L. (2006).Breast cancer in limited-resource countries: Diagnosis and pathology. The Breast Journal , (s1), S27–S37.Siegel, R., Naishadham, D., & Jemal, A. (2012). Cancer statistics, 2012. CA: A Cancer Journal forClinicians , (1), 10–29.Storn, R., & Price, K. (1997). Differential evolution–a simple and efficient heuristic for globaloptimization over continuous spaces. Journal of Global Optimization , (4), 341–359.Tate, J. P., Justice, A. C., Hughes, M. D., Bonnet, F., Reiss, P., Mocroft, A., . . . others (2013). Aninternationally generalizable risk index for mortality after one year of antiretroviral therapy. AIDS , (4), 563.van der Laan, M. J., Polley, E. C., & Hubbard, A. E. (2007). Super learner. Statistical Applicationsin Genetics and Molecular Biology , (1).van der Vaart, A. W., Dudoit, S., & van der Laan, M. J. (2009). Oracle inequalities formulti-fold cross validation. Statistics & Decisions , (3), 351–371. Retrieved from doi: 10.1524/stnd.2006.24.3.351 20ang, S., Xu, F., & Demirci, U. (2010). Advances in developing HIV-1 viral load assays forresource-limited settings. Biotechnology Advances , (6), 770–781.Wolberg, W. H., Street, W. N., & Mangasarian, O. L. (1994). Machine learning techniques todiagnose breast cancer from image-processed nuclear features of fine needle aspirates. CancerLetters , (2), 163–171.Wolpert, D. H. (1992). Stacked generalization. Neural Networks , (2), 241–259.21 ample Codes normalize to make coefficients sum up to onebcrs = bcrs/normcutoff = crssol$par[1]/normnormalize to make coefficients sum up to onebcrs = bcrs/normcutoff = crssol$par[1]/norm