Deep Learning for Virtual Screening: Five Reasons to Use ROC Cost Functions
Vladimir Golkov, Alexander Becker, Daniel T. Plop, Daniel Čuturilo, Neda Davoudi, Jeffrey Mendenhall, Rocco Moretti, Jens Meiler, Daniel Cremers
aa r X i v : . [ q - b i o . B M ] J un Deep Learning for Virtual Screening:Five Reasons to Use ROC Cost Functions
Vladimir Golkov , Alexander Becker , Daniel T. Plop , Daniel ˇCuturilo , Neda Davoudi ,Jeffrey Mendenhall , Rocco Moretti , Jens Meiler , Daniel Cremers Computer Vision Group, Technical University of Munich, Germany Center for Structural Biology, Vanderbilt University, USA Institute for Drug Discovery, Leipzig University, Germany {vladimir.golkov, alexander.becker, daniel.plop, daniel.cuturilo, neda.davoudi,cremers}@tum.de, {jeffrey.l.mendenhall, jens.meiler}@vanderbilt.edu,[email protected]
Abstract
Computer-aided drug discovery is an essential component of modern drug devel-opment. Therein, deep learning has become an important tool for rapid screeningof billions of molecules in silico for potential hits containing desired chemical fea-tures. Despite its importance, substantial challenges persist in training these mod-els, such as severe class imbalance, high decision thresholds, and lack of groundtruth labels in some datasets. In this work we argue in favor of directly optimiz-ing the receiver operating characteristic (ROC) in such cases, due to its robust-ness to class imbalance, its ability to compromise over different decision thresh-olds, certain freedom to influence the relative weights in this compromise, fidelityto typical benchmarking measures, and equivalence to positive/unlabeled learn-ing. We also propose new training schemes (coherent mini-batch arrangement,and usage of out-of-batch samples) for cost functions based on the ROC, as wellas a cost function based on the logAUC metric that facilitates early enrichment(i.e. improves performance at high decision thresholds, as often desired when syn-thesizing predicted hit compounds). We demonstrate that these approaches outper-form standard deep learning approaches on a series of PubChem high-throughputscreening datasets that represent realistic and diverse drug discovery campaignson major drug target families.
Drug discovery is a long, complex, and expensive process, through which new chemical compoundscan be identified and that, if successful, may lead to new pharmaceutical drugs. In the testingprocess of compounds to be identified as potential drugs, we generally have many more negativesamples than positive ones. A preselection in silico (virtual screening) using machine learningmust select only a small percentage among millions of candidates, because running subsequent invitro experiments on each preselected compound is expensive. In other words, machine learningfor virtual screening is used with high decision thresholds. A particularly appropriate family ofcost functions for this task is based on the receiver operating characteristics (ROC) curve. Thesefunctions are inherently robust to class imbalance and some of them are specifically optimized forhigh decision thresholds. The ROC curve for a binary classification problem plots the true positiverate (TPR) as a function of the false positive rate (FPR). Each point on the curve is obtained bychoosing a classification threshold. The area under the ROC curve (AUC) is a measure of classifierperformance. .1 Reasons to use ROC-based cost functions in drug discovery
In the following we outline five important reasons to use ROC-based cost functions (i.e. to optimizethe AUC statistic or similar statistics relatively directly) in virtual screening, instead of using thetypical cross-entropy cost function.
Class imbalance
Typical cost functions such as cross-entropy do not work well when appliedto datasets where class sizes are strongly imbalanced and classes are not easily separable. Theminority class has little contribution to the cross-entropy cost function, making it inexpensive forthe classifier to misclassify most of the minority class. In contrast, AUC-based cost functions sumover positive-negative sample pairs (rather than over individual samples), resulting in every classappearing equally often in the cost term. This makes AUC-based cost functions immune to classimbalance.
High decision thresholds
Due to the high cost of in vitro experiments, only a small percentageof samples can be selected, i.e. the decision threshold must be high. The left part of the ROCcorresponds to such high decision thresholds. Hence, optimizing this part of the ROC specificallytargets our goals. This is why advanced quality metrics for virtual screening focus on the left partof the ROC [Mysinger and Shoichet, 2010]. In Section 2.2.2, we introduce a novel ROC-based costfunction that focuses on maximizing the area under the left part of the ROC curve.
Not knowing the exact threshold in advance
In virtual screening we may not know the decisionthreshold for classification in advance. If this is the case, it is reasonable to consider all realisticthresholds when measuring or optimizing classifier performance. The ROC (and AUC) consolidateclassifier quality over all possible thresholds. The left part of the ROC (and quality metrics thatfocus on it) consolidate classifier quality across various high decision thresholds.
Better benchmarking results
Evaluation metrics typically reported for methods benchmarkingare based on AUC because of the reasons listed above. Thus, if many benchmarks use ROC-basedquality metrics and one wants to surpass current methods in these benchmarks, one can ideally usecost functions that directly aim to optimize those quality metrics.
PU learning
The machine learning task where only positive and unlabeled samples are availablefor training is referred to as positive/unlabeled learning (PU learning). An example applicationof PU learning in drug discovery is the classification of small molecules into drug-like and non-drug-like compounds. While we can assemble lists of drugs and known drug-like compounds aspositive training samples, it is difficult to come up with lists of definitely non-drug-like compounds– or at least not any that are not trivially non-drug-like. However, we can come up with sets ofunlabeled compounds with unknown drug likeness. Zhang and Lee [2008] show that PU learningis equivalent to treating the unlabeled samples as negative while optimizing an AUC-based costfunction. Ren et al. [2018] show the effectiveness of AUC maximization for highly imbalanced PUlearning for the special case where a part of the training data is mislabeled and/or some features areredundant.
While there are important reasons to optimize the AUC, it also gives rise to some challenging prob-lems. Firstly, the gradient of AUC with respect to network weights is zero almost everywhere,because AUC changes only when the ranks of predictions for a positive sample and a negative sam-ple are swapped. Therefore, AUC cannot be optimized directly by first-order optimization methods.Secondly, the AUC is a sum over pairs of samples rather than over individual samples, making itsdirect computation slow. In the following we outline typical solutions to these problems.
Approximations to AUC with non-zero gradient
Different approximations to the AUC havebeen proposed that have non-zero gradients (on more than only a null set of weight space), allowinggradient-based optimization.A sum of sigmoidal functions with the arguments scaled by a pre-defined value is a good approxi-mation to the AUC [Yan et al., 2003]. However, it comes at the cost of creating very steep gradients2epending on the choice of this scaling value, making optimization difficult. One of the differen-tiable AUC approximations introduced by Yan et al. [2003] does not have the issue of steep gradi-ents. The proposed objective function dynamically adjusts a sample pairs’ contribution to the loss,based on the score difference between the positive and the negative sample. More specifically, ifthe difference is larger than a specified margin, then the contribution of a given pair to the loss iszero, otherwise it is a positive value that changes smoothly with the magnitude of the score differ-ence. This formulation makes the optimization focus on maximizing the number of pairs that havea pairwise difference larger than a given margin, enhancing generalization performance. For thesereasons, we base our proposed objectives on this loss function (named R ) by Yan et al. [2003]. RankOpt [Herschtal and Raskutti, 2004] is a linear classifier which uses sigmoidal functions in thecost function with the scaling value being calculated from the data instead of fixing it a priori. Thealgorithm is also computationally efficient, making it linear in the number of samples as opposedto quadratic run time of other methods. However, being a linear classifier, RankOpt is not directlycompatible with deep learning.The AUC ignores the exact prediction scores (which might contain valuable additional informationabout the model’s quality, for example scoring a positive as “ . (rank )” might indicate a morepromising model than scoring it as “ . (rank )”). The AUC takes only ranks into account. (Thislimitation of the AUC also causes the zero-gradient problem.) To overcome this issue, a methodcalled scored ROC was proposed [Wu et al., 2007] which is based on reducing scores for positivesby a number between and , the so-called margin. To this end, the scored ROC curve (not similarto ROC) plots margins against the corresponding AUC. The area under the scored ROC curve, called scored AUC , measures how quickly AUC declines when classifier outputs for positives are reduced.This metric has non-zero gradients with respect to sample scores.Calders and Jaroszewicz [2007] use a polynomial (rather than sigmoidal) approximation of the stepfunction. Their approximation allows to reformulate the sum over pairs of samples into a sum overindividual samples, thus reducing the runtime of an epoch from quadratic to linear complexity. Weachieve such runtime reduction with a sigmoidal approximation by using a lookup table that mapsdecision thresholds to false positive rates (see Section 2.2.2). Ferri et al. [2005] use a linear (ratherthan sigmoidal or polynomial) approximation of the step function. A study of the bias of severalAUC approximations was performed by Vanderlooy and Hüllermeier [2008]. Direct optimization of AUC without gradient-based methods
Optimizing AUC directly usingcoordinate descent (i.e. optimizing one model parameter at a time, which is feasible despite the zerogradient) yields good results for certain machine learning methods that were designed specificallyfor genetics applications [Zhu et al., 2017]. LeDell et al. [2016] introduce an ensemble approachbased on
Super Learner [van der Laan et al., 2007] which adjusts the combination of scores fromseveral individual classifiers in favor of a higher AUC. They show that even though none of the baseclassifiers is specifically trained to maximize AUC, the Super Learner ensemble outperforms the topbase algorithm, especially on data with high class imbalance.
Online AUC optimization
We use out-of-batch predictions (see Section 2.2.1) and a lookup table(see Section 2.2.2). These techniques are related to online AUC optimization (i.e. training wheredata are not available all at once), which rewrites the sum of losses over sample pairs into a sum oflosses of individual samples and uses buffers for positive and negative training samples [Zhao et al.,2011] or stores the first- and second-order statistics of training data [Gao et al., 2013].
In this section we will first introduce the AUC-based cost functions upon which our work relies, thenwe introduce the novel cost functions. The main contributions are: identification of five reasons forusing ROC-based cost functions in virtual screening (Section 1.1), a novel algorithm
AUC-prev that uses out-of-batch predictions for overall prediction improvement in AUC optimization (Sec-tion 2.2.1), and a new cost function L logAUC that optimizes the ROC curve using a novel reweightingscheme for different decision thresholds, along with a lookup table for fast computation and a stop-gradient operator that prevents degenerate solutions (Section 2.2.2), and a comparison of five costfunctions using four quality metrics and nine representative drug discovery datasets.3 .1 Approximation of the AUC The AUC is the ratio of all positive-negative pairs where the positive has a higher prediction thanthe negative. In other words, AUC = P mi =1 P nj =1 H ( x i − y j ) mn , (1)where H ( z ) = [ z > is the Heaviside step function, the x i are all m positive samples and the y j are all n negative samples.Yan et al. [2003] proposed an approximation to H ( x i − y j ) with partially non-zero gradients: f ( x i , y j ) = (cid:26) ( − ( x i − y j − γ )) p , if x i − y j < γ , otherwise , (2)where < γ ≤ (usually . ≤ γ ≤ . ) and p > (usually or ) are hyperparameters.When comparing two classifiers that have the same AUC value, one of them may be better in thesense that it separates the positive scores from the negative scores by a larger margin. Equation (2)incorporates the margin γ to distinguish between these cases. The average of f ( x i , y j ) taken overall pairs ( x i , y j ) , i.e. L AUC = 1 mn m X i =1 n X j =1 f ( x i , y j ) , (3)is a cost function that encourages positive samples to have a score that is higher by at least γ thanthe score for negative samples. If this condition is not met, then that particular positive-negative paircontributes to the loss. The left part of the ROC changes if positive-negative pairs that have relatively high scores swapranks. This portion of the ROC curve describes the classifier performance at high decision thresholds,i.e. when selecting only the top few percent of candidates, which is usually the case in drug discoverydue to the high cost of subsequent in vitro experiments. To optimize for such situations, Yan et al.[2003] further modify Eq. (3) to transform the scores of positive and negative samples by the function g ( · ) before passing the pair to the cost function: L leftAUC = 1 mn m X i =1 n X j =1 f ( g ( x i ) , g ( y j )) , (4)where g ( s ) = (cid:26) ( s − βµ s ) α , if s > βµ s , otherwise , (5)with hyperparameters α > but close to (usually . ) and β ≥ (usually ), and µ s is the meanvalue of the classifier scores for all samples. Positive samples that have a high score will be mappedto a value that depends on the magnitude of the difference between the score of that sample and theclassifier mean score. This has an effect of pushing these samples even more in the direction of highpositive classification. Based on the AUC and its approximation, Eq. (3), we propose two new objective functions fortraining any parametric classifiers to directly optimize the ROC curve using gradient-based methods.We use them with a multilayer perceptron with softmax outputs for applications in drug discovery.
A mini-batch that consists of . of all positive and negative samples computes . of the over-all cost for usual cost functions, but only . (if pairs are coherent, see Section 2.3) or less The datasets we use have very few negative samples (not only in terms of relative class imbalance, but alsoin terms of absolute numbers), i.e. the advantages of the following method can be expected to be even morepronounced on larger (including imbalanced) datasets. recent predictions for all samples (in-cluding out-of-batch ones) are also used for loss computation, but considered constants that do notdepend on the network weights. The proposed mini-batch-wise objective function looks as follows: L AUC-prev = L AUC ( X curr , Y curr ) + L AUC ( X curr , Y prev ) + L AUC ( X prev , Y curr ) , (6)where X curr , Y curr are current predictions for positive resp. negative samples from the current mini-batch and X prev , Y prev are the most recent predictions for all positive resp. negative samples. Withthis cost function, the network weights are optimized by not only considering the loss contributionof positive-negative pairs from the current mini-batch but also from pairs which consist of a current-mini-batch sample and an out-of-batch sample. Compared to other approaches, this procedure allowsthe loss to be based upon more samples than present in the mini-batch, reducing noise in the lossand making the training more stable. If the goal is to optimize for early enrichment, i.e. not missing out on good candidates at highdecision thresholds, then a suitable performance measure is the area under a part of the lin-log ROCcurve, called logAUC [Mysinger and Shoichet, 2010]. This quality measure shares some propertieswith the AUC statistic (e.g. robustness to class imbalance) but is biased towards early enrichment.The logAUC metric is often used for measuring the quality of methods that were trained with othermetrics such as cross-entropy. Defining quality differently during training than during evaluation issuboptimal. We develop a new cost function targeted at directly optimizing the logAUC metric.We observe that the AUC, Eq. (1), can be interpreted as integration over the ROC curve, whichconsists of n stripes of equal width:AUC = 1 mn n X j =1 m X i =1 H ( x i − y j ) (7) = n X j =1 (cid:20) j + 1 n − jn (cid:21) m m X i =1 H ( x i − y j ) , (8)where we assume the negatives to be sorted in order of descending classifier output. The term m P mi =1 H ( x i − y j ) corresponds to the TPR at the decision threshold y j , i.e. to the height of the j th stripe, and (cid:2) j +1 n − jn (cid:3) = n is the stripe width ( jn and j +1 n are the left and right coordinates of thestripe, respectively). So (cid:2) j +1 n − jn (cid:3) m P mi =1 H ( x i − y j ) is the area of the j th stripe, and the outersum goes over all n stripes.Computing the area under the part of the curve where FPR is in [ λ ; 1] (with e.g. λ = 0 . ) insteadof [0; 1] can be done as follows:AUC λ = n X j =1 (cid:20) clip [ λ ;1] (cid:18) j + 1 n (cid:19) − clip [ λ ;1] (cid:18) jn (cid:19)(cid:21) m m X i =1 H ( x i − y j ) , (9)where clip [ a ; b ] ( z ) = min { max { z, a } , b } clips the abscissa coordinates to the interval [ λ ; 1] .The area logAUC under the log-transformed ROC curve can be computed by transforming the ab-scissa coordinates (in square brackets) to logarithmic scale:logAUC λ = n X j =1 (cid:20) log (cid:18) clip [ λ ;1] (cid:18) j + 1 n (cid:19)(cid:19) − log (cid:18) clip [ λ ;1] (cid:18) jn (cid:19)(cid:19)(cid:21) m m X i =1 H ( x i − y j ) . (10)Clipping to [ λ ; 1] is essential because otherwise logAUC would be infinite.Equivalently to the explanation for AUC above, logAUC has gradient zero almost everywhere. Inorder to optimize logAUC with gradient-based methods, we approximate it by replacing the step5unction H in Eq. (10) by a smooth function, as was done for AUC in Eqs. (1)–(3). Defining w j := log (cid:16) clip [ λ ;1] (cid:0) j +1 n (cid:1)(cid:17) − log (cid:16) clip [ λ ;1] (cid:0) jn (cid:1)(cid:17) , we obtain the logAUC objective function L logAUC = n X j =1 w j m X i =1 f ( x i , y j ) ! = n X j =1 m X i =1 w j f ( x i , y j ) , (11)where the weighting factor w j from Eq. (10) was pulled into the inner sum such that each positive-negative pair has a weighting factor for batch-based training.This cost function directly optimizes the logAUC metric by individually scaling each of the n equal-width stripes which the AUC is composed of (like the logAUC metric does by log-transforming theabscissa of the ROC), thus giving more importance to the left part of the ROC curve. Acceleration of rank computation
There are a couple of implementation details worth mention-ing. First, the usage of the weighting factor w j requires computing the rank j of each negativesample (see the assumption under Eq. (8)) from its prediction y j , which in turn requires sorting allnegatives by their predictions. To save time, we construct a lookup table which maps thresholdsto FPRs, allowing us to then estimate a negative’s rank j . The lookup table is updated after everyepoch (pass through all pairs) and uses equidistant keypoints in threshold space. Lookup uses lin-ear interpolation. We found a step size of . to be a good trade-off between accuracy and timerequirements.This imprecise rank computation leads to imprecise cost gradients and degenerate solutions. Detailsand a remedy are described in the following. The score of each sample in a mini-batch is betweentwo adjacent entries in the lookup table and we interpolate linearly between these two thresholds.Smaller predictions for negatives lead (by changing the interpolation weights) to higher estimatesof sample ranks (because the prediction-to-rank mapping is a monotonically decreasing mapping).This causes the estimates of the sample weights w j (estimated stripe widths in log space) to decrease.Decreasing estimates of w j lead to a decreasing estimate of the overall loss, cf. Eq. (11). As asmaller loss is preferred by the optimization algorithm, and gradients can flow through this entirepipeline , predicted scores for negatives are incentivized to become smaller and converge to zerovery quickly. Also the scores for positives converge to zero, apparently as a side effect, becausethe network does not have sufficient time/incentive to learn to distinguish them and simply learnsto always output zero. The remedy is during an update step to conceal the strictly monotonousdependence of rank estimates on predictions, mimicking the zero gradient of the piecewise constantdependence of actual ranks on predictions. To this end, we use the stop-gradient operator sg [ · ] whichsets gradients to zero when back-propagating through it. We replace w j by sg [ w j ] . This preventsdegenerate solutions. Results are reported on nine large Quantitative Structure–Activity Relationship (QSAR)benchmark datasets [Butkiewicz et al., 2013] comprising small molecules labeled as active or inac-tive for nine protein targets. Each dataset was derived from a single screening effort in the PubChemdatabase, and label accuracy for actives was confirmed via validating screens. As a cohesive set,these datasets avoid the construction biases often seen in other datasets, which can result in trivialdecision boundaries and poor generalization (something which also affects [Chen et al., 2019] typ-ical docking datasets such as DUD-E). They also display the realistically large numbers of diversecompounds and the class imbalances (a few hundred active molecules and inactives on the order of ) typically seen in practical drug development projects, properties often lacking in model sys-tems. See Butkiewicz et al. [2013] for details. As input features, we used the 391 descriptors fromthe Reduced Short Range descriptor set, which has previously been identified as sufficiently infor-mative and compact [Mendenhall and Meiler, 2016, Vu et al., 2019]. We normalized features usingz-score scaling, which was found to be the most effective for these datasets [Mendenhall and Meiler,2016]. Network architecture
For our experiments we adopt many of the architecture design choices thathave proven to be useful [Mendenhall and Meiler, 2016] on the same PubChem QSAR datasetswith standard cost functions. Throughout the experiments we use a two-layer feed-forward neural6able 1: Evaluation using “AUC” as a quality metric. The loss functions L AUC and L AUC-prev pro-posed for optimizing this quality metric are highlighted in black, other loss functions in grey. Resultsthat significantly ( p < . ) outperform the baseline method L cross-entropy are marked bold . On all4 datasets on which L AUC-prev outperforms L cross-entropy , it also outperforms L AUC , indicating advan-tages of our proposed usage of out-of-batch samples over usual batch-wise training. Error marginsrange from ± . to ± . .SAID L c r o ss - e n t r opy L AU C L AU C - p r e v L l e f t AU C L l og AU C Training procedure
We train on mini-batches which are constructed as follows: each mini-batchcontains coherent positive-negative pairs, i.e. we randomly uniformly draw positive and negativesamples from the training set and construct all possible pairs between these. The use of coherentmini-batches has many advantages such as being memory-efficient, easy to implement, and paral-lelizable.All parameters are initialized using the scheme by He et al. [2015]. For training, we used the Adamoptimization algorithm [Kingma and Ba, 2014] with different learning rates for each task, exponen-tial decay rate for the first and second moment at . and . , respectively. The optimal learningrates were found by K -fold cross-validation and testing approach as in [Korjus et al., 2016] with K = 4 . The best learning rate was . for all nine tasks and all objective functions, except for L logAUC on dataset ID 2258, where the best learning rate was . . Similar to [Mysinger and Shoichet, 2010], model performance is evaluated by computing logAUCfor FPR in [0 . . . This quality metric is very popular for chemistry-related problems such asdrug discovery, where we focus on high decision thresholds. In addition to this quality metric wereport two other metrics that focus on high decision thresholds, namely AUC for FPR in [0 . . and logAUC for FPR in [0 . ; as well as AUC.Tables 1–4 report these four quality metrics for each of the four ROC-based objective functions. Ad-ditionally, results are compared to the baseline method, i.e. the cross-entropy objective. Each rowrepresents one of the nine datasets. Bold faced numbers indicate that the corresponding objectivefunction significantly ( p < . ) outperforms the baseline for that particular dataset. Confidenceintervals for each metric were computed by bootstrapping the test set with replacement 200 times[Mendenhall and Meiler, 2016]. For each metric, the results of the cost functions proposed specif-ically for that metric are shown in black. The results of other cost functions are also shown, butgreyed out. 7able 2: Evaluation using “AUC for FPR in [0 . . ” as a quality metric. The loss function L leftAUC designed to optimize this quality metric is highlighted in black. It significantly ( p < . , bold ) outperforms L cross-entropy on 4 out of 9 datasets. L logAUC , designed for a similar purpose, sig-nificantly outperforms L cross-entropy on 5 out of 9 datasets. Error margins range from ± . to ± . . SAID L c r o ss - e n t r opy L AU C L AU C - p r e v L l e f t AU C L l og AU C Table 3: Evaluation using “logAUC for FPR in [0 . ” as a quality metric. The loss func-tion L logAUC designed to optimize this quality metric significantly ( p < . , bold ) outperforms L cross-entropy on 7 out of 9 datasets. Error margins range from ± . to ± . .SAID L c r o ss - e n t r opy L AU C L AU C - p r e v L l e f t AU C L l og AU C Table 1 shows results using the AUC quality metric. The cost functions designed specifically to op-timize this metric are L AUC and L AUC-prev . The cost function L AUC outperforms the baseline method L cross-entropy at a significance level of α = 0 . on 5 out of the 9 datasets. Our procedure L AUC-prev of using out-of-batch predictions outperforms the baseline method significantly ( α = 0 . ) on 4 outof 9 datasets, including one dataset on which L AUC does not outperform L cross-entropy . Moreover, onall 4 datasets on which L AUC-prev outperforms L cross-entropy , it also outperforms L AUC (with p < . on two of the datasets), indicating advantages of our proposed usage of out-of-batch samples.Results in Table 2 are evaluated using the “AUC for FPR in [0 . . ” quality metric. The func-tion L leftAUC was designed specifically for optimizing this metric. The results show that L leftAUC outperforms the cross-entropy baseline in 4 out of 9 datasets at a significance level of α = 0 . . Ourobjective function L logAUC (designed for a similar purpose) significantly outperforms L cross-entropy in5 out of 9 datasets. This indicates that an ROC-based cost function that maximizes the area underthe left part of the ROC can improve classifier performance as compared to typical cost functions(such as cross-entropy) when the goal is to optimize the performance at high decision thresholds. Inaddition, L logAUC significantly outperforms L leftAUC in 4 out of 9 datasets.Table 3 demonstrates results for the “logAUC for FPR in [0 . ” quality metric. Our proposedobjective function L logAUC was specifically designed for this metric and performs significantly better8able 4: Evaluation using “logAUC for FPR in [0 . . ” as a quality metric. The loss func-tion L logAUC designed to optimize this quality metric significantly ( p < . , bold ) outperforms L cross-entropy on 7 out of 9 datasets. Error margins range from ± . to ± . .SAID L c r o ss - e n t r opy L AU C L AU C - p r e v L l e f t AU C L l og AU C ( α = 0 . ) than cross-entropy on 7 out of 9 datasets. The results also show that L logAUC outperformsother ROC-based objective functions on many of the datasets. Specifically, it significantly outper-forms L AUC on 5 out of 9 datasets.Table 4 shows results for the “logAUC for FPR in [0 . . ” quality metric which is also optimizedby our L logAUC objective. The results demonstrate that, again, our objective function outperforms thecross-entropy baseline in 7 out of 9 datasets and L AUC in 5 out of 9 datasets, both at a significancelevel of α = 0 . .One important aspect to mention is that when using L cross-entropy we oversample the minority class(positives) in order to compare the AUC-based cost functions against the cross-entropy under idealcircumstances. Thus we should expect even more favorable results compared with cross-entropy ifno oversampling is performed. The AUC-based cost functions are robust towards class imbalance,i.e. perform equally well without oversampling. Thus, they do not require tuning the oversamplingratio, as the cross-entropy loss does. On the other hand, they have additional hyperparameters thatrequire tuning. Luckily, a wide range of values works well in practice [Yan et al., 2003]. We listed a series of special properties of virtual screening datasets, such as class imbalance, all ofwhich can be addressed by using ROC-based cost functions. Such cost functions, in turn, have pe-culiarities such as the necessity for techniques to avoid zero gradients (which we borrowed from lit-erature), and a quadratic rather than linear number of summands (which we addressed by proposingto use out-of-batch samples and coherent mini-batches). Moreover, to optimize performance specifi-cally for the high decision thresholds that are used in virtual screening, and to more directly optimizethe logAUC quality metric that is popular in this domain, we proposed an approximation L logAUC to logAUC with nonzero gradients. To accelerate its computation, we replaced precise computationby a lookup table, and to prevent the wrong gradient caused by this replacement from leading todegenerate solutions, we used a stop-gradient operator. Our methods outperformed cross-entropy inmany scenarios in a benchmark of realistic diverse datasets. We do not claim that these AUC lossesare perfect for all situations; rather, we exemplify how losses can be aligned with project-specificgoals. We encourage active exploration of loss options in virtual screening and in other applications,instead of following the old tradition of resorting to cross-entropy “by default”. Acknowledgements
We thank Benjamin P. Brown for helpful discussions.9 eferences
M. Butkiewicz, E. W. Lowe, R. Mueller, J. L. Mendenhall, P. L. Teixeira, C. D. Weaver, and J. Meiler. Bench-marking ligand-based virtual high-throughput screening with the PubChem database.
Molecules , 18(1):735–756, 2013.T. Calders and S. Jaroszewicz. Efficient AUC optimization for classification. In
European Conference onPrinciples of Data Mining and Knowledge Discovery , pages 42–53. Springer, 2007.L. Chen, A. Cruz, S. Ramsey, C. J. Dickson, J. S. Duca, V. Hornak, D. R. Koes, and T. Kurtzman. Hidden biasin the DUD-E dataset leads to misleading performance of deep learning in structure-based virtual screening.
PloS one , 14(8), 2019.C. Ferri, P. Flach, J. Hernández-Orallo, and A. Senad. Modifying roc curves to incorporate predicted probabil-ities. In
Proceedings of the ICML 2005 workshop on ROC Analysis in Machine Learning , 2005.W. Gao, R. Jin, S. Zhu, and Z.-H. Zhou. One-pass AUC optimization. In
International Conference on MachineLearning , pages 906–914, 2013.K. He, X. Zhang, S. Ren, and J. Sun. Delving deep into rectifiers: Surpassing human-level performance onimagenet classification. In
Proceedings of the IEEE international conference on computer vision , pages1026–1034, 2015.A. Herschtal and B. Raskutti. Optimising area under the ROC curve using gradient descent. In
Proceedings ofthe twenty-first international conference on Machine learning , page 49. ACM, 2004.D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 , 2014.K. Korjus, M. N. Hebart, and R. Vicente. An efficient data partitioning to improve classification performancewhile keeping parameters interpretable.
PloS one , 11(8):e0161788, 2016.E. LeDell, M. J. van der Laan, and M. Peterson. Auc-maximizing ensembles through metalearning.
Theinternational journal of biostatistics , 12(1):203–218, 2016.J. Mendenhall and J. Meiler. Improving quantitative structure–activity relationship models using artificial neuralnetworks trained with dropout.
Journal of computer-aided molecular design , 30(2):177–189, 2016.M. M. Mysinger and B. K. Shoichet. Rapid context-dependent ligand desolvation in molecular docking.
Journalof chemical information and modeling , 50(9):1561–1573, 2010.K. Ren, H. Yang, Y. Zhao, M. Xue, H. Miao, S. Huang, and J. Liu. A robust AUC maximization frame-work with simultaneous outlier detection and feature selection for positive-unlabeled classification.
CoRR ,abs/1803.06604, 2018. URL http://arxiv.org/abs/1803.06604 .M. J. van der Laan, E. C. Polley, and A. E. Hubbard. Super learner.
Statistical Applica-tions in Genetics and Molecular Biology , 6(1), Jan. 2007. doi: 10.2202/1544-6115.1309. URL https://doi.org/10.2202/1544-6115.1309 .S. Vanderlooy and E. Hüllermeier. A critical analysis of variants of the auc.
Machine Learning , 72(3):247–262,2008.O. Vu, J. Mendenhall, D. Altarawy, and J. Meiler. Bcl:: Mol2d—a robust atom environment descriptor for qsarmodeling and lead optimization.
Journal of computer-aided molecular design , 33(5):477–486, 2019.S. Wu, P. Flach, and C. Ferri. An improved model selection heuristic for auc. pages 478–489, 09 2007. doi:10.1007/978-3-540-74958-5_44.L. Yan, R. H. Dodier, M. Mozer, and R. H. Wolniewicz. Optimizing classifier performance via an approximationto the Wilcoxon–Mann–Whitney statistic. In
Proceedings of the 20th International Conference on MachineLearning (ICML-03) , pages 848–855, 2003.D. Zhang and W. S. Lee. Learning classifiers without negative examples: A reduction approach. In
DigitalInformation Management, 2008. ICDIM 2008. Third International Conference on , pages 638–643. IEEE,2008.P. Zhao, S. C. Hoi, R. Jin, and T. Yang. Online AUC maximization. 2011.L. Zhu, H.-B. Zhang, and D.-S. Huang. Direct AUC optimization of regulatory motifs.
Bioinformatics , 33(14):i243–i251, 2017., 33(14):i243–i251, 2017.