A critical assessment of conformal prediction methods applied in binary classification settings
AA critical assessment of conformal predictionmethods applied in binary classification settings
Damjan Krstajic
Research Centre for Cheminformatics, Jasenova 7, 11030 Beograd, Serbia § Corresponding authorEmail address:DK: [email protected]
Abstract
In recent years there has been an increase in the number of scientific papers thatsuggest using conformal predictions in drug discovery. We consider that someversions of conformal predictions applied in binary settings are embroiled in pitfalls,not obvious at first sight, and that it is important to inform the scientific communityabout them. In the paper we first introduce the general theory of conformalpredictions and follow with an explanation of the versions currently dominant in drugdiscovery research today. Finally, we provide cases for their critical assessment inbinary classification settings. - 1 - ntroduction
Conformal predictions (CP) is a very active research field in statistics and machinelearning. It was introduced in Vovk et al. [1] and Sanders et al. [2] and furtherdeveloped in Vovk et al. [3]. Since then several of its aspects have been developed[4]-[5], while in drug discovery it was pioneered by Norinder et al. [6]. As CP usespast experience to determine levels of confidence in new predictions, it is an approachwith the potential of being useful in drug discovery projects where the issue of theapplication domain (AD) [7] is acute. Therefore, it is not surprising that in recentyears there has been an increase in the number of scientific papers that suggest usingCP in drug discovery [6],[8]-[15]. We consider that some of its versions applied inbinary classification settings are embroiled in pitfalls, not obvious at first sight, andthat it is important to inform the scientific community about them. In the paper wefirst introduce the general theory of CP and follow with an explanation of the versionsof CP dominant in drug discovery research today. Finally, we provide the cases fortheir critical assessment in binary settings. Damjan Krstajic [16] has published a criticism of a comparison between CP andQSAR applications. We will repeat his main point as it is applicable to binaryclassification models based on CP. - 2 - inary classification models
A binary classification statistical model is a predictive model F() that predicts a binaryvariable Y using values of variables X1,..,Xm. It can be viewed as the relationshipY=F(X1,,..,Xm). It is created using previously known values (Yi,Xi1,…Xim) i=1..N,which we refer to as the training data . As we are examining binary outputs, Y hasonly two values, which we shall refer to as positive and negative. The quality of thepredictive model F() is measured by how well it predicts a previously unseen set ofsamples, which we refer to as the test data . There are various common measures ofquality for binary classification models, such as misclassification error, specificity,sensitivity, etc. Furthermore, in binary classification settings it is common for apredictive model F() not only to predict whether something is positive or negative, butalso to estimate its probability of being positive. The most common measure forassessing probabilistic classification models is the area under the ROC curve(AUROC).
Conformal predictions in binary settings
Shafer and Vovk [17] designed CP for an on-line setting in which labels are predictedsuccessively, each one being revealed before the next is predicted. This means thatthere is only one test sample. After the test sample is predicted, its true value isrevealed and it is incorporated into the training set. The new predictive model is thenbuilt and ready to predict the next test sample. In such an environment they havedefined several features of CP which makes it different from other statisticalmodelling approaches:a) CP produces an N% prediction region , which contains possible predicted valueswith a probability of at least N%. In CP, binary classifiers may return the following- 3 -our prediction regions for a single test sample: {positive}, {negative}, {positive andnegative} and {null}. The last two predictions are usually referred to as ‘both’ and‘empty’.b) CP requires a nonconformity measure to be specified. Given the nonconformitymeasure, the conformal algorithm produces a prediction region for any specified N%.c) CP defines a new concept of validity for prediction with confidence. As CP isdefined in an on-line setting, it is repeatedly applied to an accumulating data set, andnot to independent data sets. Therefore, Shafer and Vovk [17] refer to an N%prediction region in the on-line method for binary classifiers as valid if N% of thesepredictions contain the correct label.d) In addition to the validity of a method for producing N% prediction regions, Shaferand Vovk [17] discuss their efficiency. Shafer and Vovk [17] say that a predictionregion is efficient if the prediction region is usually relatively small and thereforeinformative. In classification it is desirable to see an N% prediction region so smallthat it contains only a single predicted label, i.e. not ‘both’ nor ‘empty’.e) There are other features like exchangeability and ability to be applied to all pointestimates that make CP very attractive, but they are not relevant for our discussion.We think that the causes of problematic applications of CP in binary classificationsettings, which we will point to later in the text, arrive from a misunderstanding ormisrepresentation of the above concepts. - 4 - hat does it mean in practice that a binary classifier provides a valid N% prediction region?
In the binary classification setting it does not mean that N% of the predictions will becorrect, but that N% of the predictions will contain the correct label. This means thatif we obtain 950 'both' predictions and 50 'empty' predictions from 1000 test samples,our predictions would be 95% valid, because each 'both' prediction contains thecorrect label. We do not see anything wrong with the definition of validity as such, butwe question its practical value in the binary classification setting.If our CP model produces an 86% valid prediction region it could be that we obtain,for example:
86% 'both' predictions and 14% 'empty'
50% ‘both’ , 36% correct, 10% false and 4% ‘empty’
15% ‘both, 71% correct, 2% false and 12% ‘empty’
86% correct and 14% false single predictionsWe see a substantial difference in the practical value of the above cases all having thesame valid 86% prediction region.
How can we choose the nonconformity measure?
The nonconformity measure is a starting point for CP. Shafer and Vovk [17] definenonconformity measure as d( (B), z) where z is a new example (test sample) and (B) ẑ ẑ is a method for obtaining a point prediction for a new example from a bag B of old ẑ examples (training samples). - 5 -nitially, Shafer and Vovk [17] provide distance to the neighbours for classification asan example of nonconformity measures in binary settings. Later, when they appliedCP on Ronald A. Fisher’s Iris dataset [18] they also used two other nonconformitymeasures: distance to the average of each species, and a support vector machine. Herewe will only present the distance to the neighbours for classification as an example ofa nonconformity measure d( (B), z) ẑ d( (B), z) = ẑ distance of z ' s nearest neighbours ∈ B with the same labeldistance of z ' s nearest neighbours ∈ B with a different label
As one can see in the above definition of their example of the nonconformitymeasure, Shafer and Vovk [17] use the values of the output binary variable Y in thetraining samples when calculating the nonconformity measure. We do not seeanything wrong with the way Shafer and Vovk [17] have defined the nonconformitymeasure as such, but we would like to point out that it is not the same as the distancemeasures in AD [7], where only values of input variables are used, i.e. (X1,…Xm). Shafer and Vovk [17] say that a nonconformity measure is a real-valued functionwhich measures how different a test sample is from training samples. In someresearch fields outside of CP, like AD [7], when someone “ measures how different atest sample is from training samples ” it is presumed that only values of input variables{X1,…Xm} are used for calculating the measure, while in CP that is not the case.
Conformal predictions in drug discovery
Our understanding is that Norinder et al. [6] introduced the use of CP in drugdiscovery research. There are, in our opinion, two major issues in their method as wellas in their presentation. First, we question their choice of the nonconformity measureand, second, we demonstrate that their example is not useful in practice.- 6 -ortés-Ciriano and Bender [19] summarise underlying concepts and practicalapplications of CP with a particular focus on drug discovery processes. Cortés-Cirianoand Bender [19] describe various versions of CP, and they list 28 drug discoverystudies in which CP was implemented. Even though they describe current limitationsin the field, our view is that Cortés-Ciriano and Bender [19] omitted to present majorpitfalls and misapplications in the field. Norinder et al. [6] nicely explained theprocess of defining prediction regions with an example, while Cortés-Ciriano andBender [19] just presented an example test set consisting of 10 compounds with theirnon-conformity scores.We focus on explanations and results from Norinder et al. [6], because they are theonly authors, as far as we are aware, who properly and in full detail explain theprocess of applying CP in binary settings. Even though we are critical of theirapproach, we find their explanations to be clear and understandable, which is not thecase with other authors. - 7 - hoice of the nonconformity measure in Norinder et al. [6]
For the binary classification case, Norinder et al. [6] defined the nonconformity scoreto be the probability for the prediction from the decision trees in the random forest,i.e. the score of a new compound is equal to the percentage of correct predictionsgiven by the individual decision trees in the random forest. They reference a paper byDevetyarov and Nouretdinov [20] for using such a nonconformity score. Devetyarovand Nouretdinov [20] mention 3 types of nonconformity measures of which the firstone is equal to the percentage of correct predictions given for the sample by decisiontrees. However, the experiments and results in Devetyarov and Nouretdinov [20] areall for the other 2 types of nonconformity measures, which means that apart of its justmentioning, Devetyarov and Nouretdinov [20] do not provide any practicaljustification for using the nonconformity measure as defined by Norinder et al. [6].What exactly is the nonconformity score defined by Norinder et al. [6]? Ourunderstanding is that it is a predicted probability generated by a random forest.Norinder et al. [6] do not provide any rationale why probabilities generated by therandom forest are a good choice for nonconformity score. In what way is a probabilityof a test sample produced by an RF model a measure of how different the test sampleis from training samples? - 8 - lassification example in Norinder et al. [6]
Norinder et al. [6] applied a variant of CP called Mondrian Conformal Predictions(MCP) [5]. In MCP, a training set is randomly divided into a proper training set and acalibration set. Norinder et al. [6] use a 70% (proper training set) and 30%(calibration set) split. The proper training set was used for model fitting, and thecalibration set for constructing the prediction region. In Figure 1 we show the predicted probabilities of classes A and B, which are exactlythe same as Figure 1 in Norinder et al. [6]. They are an example of results on acalibration set consisting of 21 compounds that authors used for explaining how theprediction region is created. Figure 1.- 9 - class A class B0.002 0.010.15 0.080.23 0.210.40 0.360.48 0.430.70 0.510.75 0.640.80 0.720.95 0.750.98 0.800.95 e are not questioning their explanation of the way the prediction region is created,but rather their choice of an example and its consequences. As we are dealing with abinary classification, for ease of presentation, let’s say that ‘class A’ is a negative classand ‘class B’ positive. We calculated AUROC for 21 predicted probabilities on thecalibration set and found it out to be 0.527. Furthermore, if we take 0.5 to be athreshold for predicting labels, then accuracy of predictions is 0.524. We would like topoint out that Norinder et al. [6] do not inform the readers regarding AUROC nor ofthe accuracy of their predictions on the calibration set. The problem of presenting an example with almost random predictions is in ouropinion two-fold. First, we doubt that anybody would use such a model in practice.Second, when we instead consider an example with good predictions, we would seethat a number of otherwise correct predictions (in the binary sense) would become‘both’ or ‘empty’.
Issues when comparing CP with binary predictions
In binary classification models there are 2 possible prediction outcomes { positive , negative }, while in CP there are 4 possible prediction regions { positive , negative ,‘ both’ , ‘ empty’ }. As we have described earlier, one may calculate, for example, themisclassification error, specificity and sensitivity of a binary classification model.However, which error statistics may one use for prediction regions generated by a CPmodel? Bosc et al. [15] described a test study that directly compares CP with binaryclassification models in QSAR setting. Their approach treats ‘empty’ predictionregions as false predictions, while for ‘both’ prediction regions they analyse caseswhen ‘both’ is considered correctly classified as well as when ‘both’ is treated as afalse prediction. - 10 -amjan Krstajic [16] has published a criticism of the approach presented by Bosc etal. [15]. Here we will only repeat the main point of his criticism which Bosc et al. [21] omitted to comment in their reply. Half of the comparisons between CP andQSAR presented in Bosc et al. [15] examine situations when predictions assigned to‘both’ are considered correctly classified. How can someone in practice transform‘both’ predictions into correct classifications? Thus, if a sample has a positive outputvalue and it is predicted as ‘both’ it would be treated as correctly predicted. However,if it has a negative output value it would again be treated as correctly predicted. Thisimplies that if we have a CP model with all ‘both’ predictions we would have 100%correct predictions. In our opinion, this does not make sense. Discussion
We would like to reiterate that we are not criticising the CP theory, but itspresentations and mostly its applications in computational drug discovery. We are notquestioning the results of any authors, but their scientific value. There are someimportant details in the CP theory that have implications different from what someauthors present them to be. Here we will summarise them.- 11 -) An N% prediction region for a binary classifier is valid if N% of these predictionscontain the correct label. However, this does not permit to limit the number of falsepositives as Cortés-Ciriano and Bender [19] suggest, because there is not a clearnotion of false positives in CP. Saying that N% of predictions contain the correct labelwithout explaining that the ‘both’ prediction, i.e. {positive, negative), contains thecorrect label but that it is not the correct label, might lead readers to misunderstandthe true meaning of the validity in CP. We would like to point out that Shafer andVovk [17] nicely and fully explain pros and cons of the validity measure in CP.2) Shafer and Vovk [17] say that a nonconformity measure is a real-valued functionwhich measures how different a test sample is from training samples. In our opinion,there is a need for more clarification here. As we have shown, in the examples thatShafer and Vovk [17] present, the calculation of nonconformity measures presumesthe use of the output binary variable Y, as well as input variables X1,..,Xm. We are notaware of any example in CP literature where the nonconformity measure is based onthe knowledge of input variables X1,..,Xm alone, i.e. without the use of the outputbinary variable Y. We think that such clarification is necessary, because in someresearch fields such as AD, one presumes that only input variables X1,..,Xm are usedwhen assessing how different a test sample is from training samples.- 12 -) How can the percentage of correct predictions given by the individual decisiontrees in a random forest be a measure of how different a test sample is from trainingsamples? Cortés-Ciriano and Bender [19] present 18 articles which use such ameasure. We cannot find any theoretical or empirical evidence which would supportusing the percentage of correct predictions as the measure of how different a testsample is from training samples.4) In our opinion, there is still an unresolved problem in CP as to how to deal with‘both’ and ‘empty’ prediction regions. Bosc et al. [15] presume that ‘both’ predictionsmay be treated as correct classification. We think that such practice is not logical.How can someone in practice treat a {positive, negative} prediction region, i.e. ‘both’prediction, as a correct classification? How can it be useful in science to examinesituations in which we assume that we know something which we cannot know? Weare puzzled as to how this practice is accepted in the scientific community.
Conclusion
We have presented here our critical assessment of CP methods applied in binaryclassification settings. We would like to point out that we do not have anythingagainst any of the authors whose methods we have criticised here. Our intention ismainly to inform the scientific community of a different view which is currently notpresent.
Acknowledgments
The author would like to thank his mother, Linda Louise Woodall Krstajic, for correcting English typos and language improvements in the text.
Declarations
Competing interests
The author declares that he has no competing interests.- 13 - unding
No funding received.
References
1. Vovk, Volodya, Alexander Gammerman, and Craig Saunders. "Machine-learning applications of algorithmic randomness." (1999): 444-453.2. Saunders, Craig, Alexander Gammerman, and Volodya Vovk. "Transduction with confidence and credibility." (1999): 722-726.3. Vovk, Vladimir, Alex Gammerman, and Glenn Shafer.
Algorithmic learning ina random world . Springer Science & Business Media, 2005.4. Vovk, Vladimir. "Cross-conformal predictors."
Annals of Mathematics and Artificial Intelligence
Technical Report (2003).6. Norinder, Ulf, et al. "Introducing conformal prediction in predictive modeling. A transparent and flexible alternative to applicability domain determination."
Journal of chemical information and modeling
SAR and QSAR in Environmental Research
Annals of Mathematics and Artificial Intelligence
IFIP International Conference on Artificial Intelligence Applications and Innovations . Springer, Berlin, Heidelberg, 2012.- 14 -0. Norinder, Ulf, and Scott Boyer. "Binary classification of imbalanced datasets using conformal prediction."
Journal of Molecular Graphics and Modelling
72 (2017): 256-265.11. Sun, Jiangming, et al. "Applying mondrian cross-conformal prediction to estimate prediction confidence on large imbalanced bioactivity data sets."
Journal of chemical information and modeling
Molecular informatics
Toxicology research
Chemical research in toxicology
Journal of cheminformatics
J Cheminform
65 (2019). https://doi.org/10.1186/s13321-019-0387-y17. Shafer, Glenn, and Vladimir Vovk. "A tutorial on conformal prediction."
Journal of Machine Learning Research
Annals of eugenics arXiv preprint arXiv:1908.03569 (2019).20. Devetyarov, Dmitry, and Ilia Nouretdinov. "Prediction with confidence based on a random forest classifier."
IFIP International Conference on Artificial Intelligence Applications and Innovations . Springer, Berlin, Heidelberg, 2010.21. Bosc, N., Atkinson, F., Félix, E. et al.
Reply to “Missed opportunities in large scale comparison of QSAR and conformal prediction methods and their applications in drug discovery”.
J Cheminform11,