[PDF] Impact of non-normal error distributions on the benchmarking and ranking of Quantum Machine Learning models

Abstract

Quantum machine learning models have been gaining significant traction within atomistic simulation communities. Conventionally, relative model performances are being assessed and compared using learning curves (prediction error vs. training set size). This article illustrates the limitations of using the Mean Absolute Error (MAE) for benchmarking, which is particularly relevant in the case of non-normal error distributions. We analyze more specifically the prediction error distribution of the kernel ridge regression with SLATM representation and L 2 distance metric (KRR-SLATM-L2) for effective atomization energies of QM7b molecules calculated at the level of theory CCSD(T)/cc-pVDZ. Error distributions of HF and MP2 at the same basis set referenced to CCSD(T) values were also assessed and compared to the KRR model. We show that the true performance of the KRR-SLATM-L2 method over the QM7b dataset is poorly assessed by the Mean Absolute Error, and can be notably improved after adaptation of the learning set.

Full PDF

IImpact of non-normal error distributionson the benchmarking and rankingof Quantum Machine Learning models

Pascal Pernot , Bing Huang and Andreas Savin Tuesday 7 th April, 2020 Institut de Chimie Physique, UMR8000, CNRS, Université Paris-Saclay, 91405 Orsay, France.Contact: [email protected] Institute of Physical Chemistry and National Center for Computational Design and Discoveryof Novel Materials (MARVEL), Department of Chemistry, University of Basel, Klingelbergstrasse80, 4056 Basel, Switzerland. Laboratoire de Chimie Théorique, CNRS and UPMC Université Paris 06, SorbonneUniversités, 75252 Paris, France.

Abstract

Quantum machine learning models have been gaining signiﬁcant traction within atomisticsimulation communities. Conventionally, relative model performances are being assessed andcompared using learning curves (prediction error vs. training set size). This article illustratesthe limitations of using the Mean Absolute Error (MAE) for benchmarking, which is particu-larly relevant in the case of non-normal error distributions. We analyze more speciﬁcally theprediction error distribution of the kernel ridge regression with SLATM representation and L distance metric (KRR-SLATM-L2) for eﬀective atomization energies of QM7b moleculescalculated at the level of theory CCSD(T)/cc-pVDZ. Error distributions of HF and MP2 atthe same basis set referenced to CCSD(T) values were also assessed and compared to theKRR model. We show that the true performance of the KRR-SLATM-L2 method over theQM7b dataset is poorly assessed by the Mean Absolute Error, and can be notably improvedafter adaptation of the learning set. a r X i v : . [ phy s i c s . d a t a - a n ] A p r Introduction

For users to appreciate the accuracy of a computational chemistry method, one should ideallyprovide them some statistics enabling to estimate easily the probability to get errors below athreshold corresponding to their requirements. As usual for physical measurements, the mostsimple statistic would be a prediction uncertainty attached to a method u ( M ) [1].If the distribution of residual errors for a benchmark dataset is zero-centered normal, statisticssuch as the mean absolute error (MAE), the root mean squared deviation (RMSD) or the 95 th quantile of the absolute errors distribution ( Q ) are redundant and can be used to infer u ( M ): u ( M ) ’ RMSD = q π/ ’ . Q . If it is not normal, more information is necessary toprovide the user with probabilistic diagnostics [2]. It is therefore primordial to test the normalityof the distribution and, in case of non-normality, to consider data transformations that might getthe distribution closer to normality. For instance, the use of intensive properties [3, 2], the choicebetween errors or relative errors [4], and the correction of trends [1] might have a notable impact.The previous analysis has been applied to computational chemistry methods, but to our knowl-edge, the error distributions of ML algorithms have not be scrutinized for their ability to deliver areliable prediction uncertainty, and the general use of the MAE as a benchmark statistic for MLmethods [5, 6] has to be evaluated. A problem arises notably when comparing methods with diﬀer-ent error distribution shapes, as MAE-based ranking might become arbitrary, occulting importantconsiderations about the risk of large errors for some of the methods [2, 7] .This is the main topic of the present paper, where we analyze the prediction errors for eﬀectiveatomization energies of QM7b molecules calculated at the level of theory CCSD(T)/cc-pVDZby the kernel ridge regression with CM and SLATM representations and L distance metric [6].The ML error distributions are compared with the ones obtained from computational chemistrymethods (HF and MP2) on the same reference dataset.In the next section, we present the statistical tools we use to characterize error sets. Theseare then applied to the HF, MP2, CM-L2 and SLATM-L2 error sets (Section 3). The strong non-normality of the SLATM-L2 errors is then scrutinized and the systems having large errors areanalyzed as outliers. The impact on the prediction errors distribution of including these outliersin the learning set is evaluated. The main ﬁndings are discussed in Section 4. Statistical benchmarking of a method M is based on the estimation of errors ( E M = { e M,i } Ni =1 ) fora set of N calculated ( C M = { c M,i } Ni =1 ) and reference data ( R = { r i } Ni =1 ), where e M,i = r i − c M,i (1)In the present study, reference data are calculated by a quantum chemistry method, and uncertaintyon the calculated and reference values are considered to be negligible before the errors.

Reliable use of statistical tests of normality require typically at least N = 100 points [8, 9]. Fora given sample size, the Shapiro-Wilk W statistics has been shown to have good properties [8],and it is used in this study. The values of W range between 0 and 1, and values of W ’ W lies below a critical value W c depending on the sample2ize and the chosen level of type I errors α (typically 0.05), the normality hypothesis cannot beassumed to hold [9].It might also be useful to assess normality by visual tools: normal quantile-quantile plots(QQ-plots) [4], where the quantiles of the scaled and centered errors sample is plotted against thetheoretical quantiles of a standard normal distribution (in the normal case, all points should lieover the unit line); or comparison of the histogram of errors with a gaussian curve having the samemean, estimated by the mean signed error (MSE), and same standard deviation, estimated by theRMSD.Two other statistics are helpful in characterizing the departure from normality. The skewness(Skew), or third standardized moment of the distribution, quantiﬁes its asymmetry (Skew = 0 fora symmetric distribution). The kurtosis (Kurt), or fourth standardized moment, quantiﬁes theconcentration of data in the tails of the distribution. Kurtosis of a normal distribution is equalto 3; distributions with excess kurtosis (Kurt > 3) are called leptokurtic ; those with Kurt < 3 arenamed platykurtic .For error distributions which are non symmetric (Skew = 0), quantifying the accuracy by asingle dispersion-related statistic is not reliable, and one should provide probability intervals oraccept to information on the sign and use a statistic based on absolute errors, such as Q presentedbelow.For platykurtic and leptokurtic error distribution, the normal probabilistic interpretation ofdispersion statistics such as the RMSD is lost, and one should rely on dispersion measures pro-viding explicit probability coverage, such as u recommended in the thermochemistry literature[10]. u is an enlarged uncertainty which enables to deﬁne a symmetric 95 % probability intervalaround the estimated value. If the distribution is also skewed, one cannot rely on a symmetricinterval based on u . A major problem with leptokurtic error distributions is that the excess ofdata in the tails is related to a risk of large prediction errors, that cannot be appreciated from theusual MAE or RMSD statistics. A solution is again to use Q (Section 2.2).For ranking studies, problems might occur when comparing distribution with diﬀerent shapes,which might lead to conﬂict between diﬀerent ranking statistics [2]. A striking example is providedbelow (Section 3). Pernot and Savin [2] have shown that the non-normality of error distributions prevents the prob-abilistic interpretation of usual benchmark statistics. However, direct probabilistic informationcan be extracted from the empirical cumulative distribution function (ECDF) of absolute errors (cid:15) i = | e i | C ( η ) = 1 N N X i =1 (cid:15) i ≤ η (2)where η is a threshold value to be chosen, and x is the indicator function, with value 1 if x is trueand 0 otherwise.Two statistics based on the ECDF have been proposed as fulﬁlling the needs of end-users tochoose a method suited to their purpose: • C ( η ) is the probability that the absolute errors lie below a chosen threshold η , such as the When Kurt = 3, the interval µ ± σ built on the mean µ (MSE) and standard deviation σ (RMSD) is not a 67 %probability interval. C ( η ). • Q , the 95 th percentile of the absolute error distribution, gives the amplitude of errors thatthere is a 5% probability to exceed [2]. The choice of this speciﬁc percentile was based onseveral considerations. When the error distribution is normal, Q is identical to the enlargeduncertainty u recommended for thermochemical calculations and data [10]. Besides, Q is much less sensitive to outliers than the maximal error, or even Q for small datasets.Thakkar et al. [11] proposed a similar statistic ( P ) based on the 90 th percentile.As mentioned previously, the use of such probabilistic scores for non-normal distributions relievesthe ambiguity attached to the MAE and RMSD. For instance, in all the cases of error sets thatPernot and Savin [2, 7] have studied previously, the probability for an absolute error to be largerthan the MAE lied between 0.2 and 0.45 (it should be about 0.42 for a zero-centered normaldistribution). The small values are typically associated with skewed and leptokurtic distributions.Several cases have also been observed where two methods have similar MAE values and verydiﬀerent probabilities of large errors, due to their diﬀerent error distributions.This non-ambiguity is essential if one wants to assess a risk of large prediction errors. Notethat the transition from descriptive statistics to predictive statistics for risk assesment requiresfurther constraints on the reference dataset and error distributions. Ideally, the reference datasethas to be representative enough to cover future prediction cases and there should be no notabletrend in the errors with respect to the calculated values by the method of interest (in the statisticalprediction framework, the predictor variable is the calculated value of a property, from which onewants to infer a credible interval for its true value).Several statistical trend correction methods have been proposed in the computational chemistryliterature, from the simple scaling of the calculated values [12, 13], or linear corrections [14, 15, 1,4, 16], to more complex, ML-based corrections, such as ∆-ML [17, 18, 6] or Gaussian Processes[19]. There is no unique method to identify outliers for a non-normal distribution. One might, forinstance, use visual tools, such as QQ-plots [4], or automatic selection tools, such as selectingpoints for which the absolute error is larger than the 95-th percentile ( Q ), or another percentilecorresponding to a predeﬁned error threshold.In cases where the errors distribution seems heterogeneous, one can attempt to analyze it as amixture of normal distributions. One might for instance use a bi-normal ﬁt E ( x ) = X i =1 w i N ( x ; µ i , σ i ) (3)where N ( x ; µ, σ ) is a normal distribution of x with mean value µ and standard deviation σ . Outliersare then deﬁned as the points which lie outside of a µ ± n × σ interval, where µ and σ are theparameters of the most concentrated component. The enlargement factor n should be chosen largeenough to exclude as much points from this component as possible, but small enough to captureenough points of the wider component to enable tests of chemical or physical hypotheses.4 .4 Implementation All calculations have been made in the R language [20], using several packages, notably for boot-strap (boot [21]), skewness and kurtosis ( moments [22]) and mixture analysis ( mixtools [23, 24]).Bootstrap estimates are based on 1000 draws. The R implementation of the Shapiro-Wilk test(function shapiro.test ) has a limit of dataset size at 5000 [8]. For the large datasets used in theapplication, 5000 points are randomly chosen to evaluate W .The code and datasets to reproduce the tables and ﬁgures of this article are available atZenodo ( https://doi.org/10.5281/zenodo.3733272 ). The datasets can also be analyzed withthe ErrView code ( https://doi.org/10.5281/zenodo.3628489 ), or its web interface ( http://upsa.shinyapps.io/ErrView ). The data are issued from the study by Zaspel et al. [6]. The eﬀective atomization energies ( E ∗ ) forthe QM7b dataset [25], for 7211 molecules up to 7 heavy atoms (C, N, O, S or Cl) are available forseveral basis sets (STO-3g, 6-31g, and cc-pVDZ), three quantum chemistry methods (HF, MP2 andCCSD(T)) and four machine learning algorithms (CM-L1, CM-L2, SLATM-L1 and SLATM-L2).The ML methods have been trained over a random sample of 1000 CCSD(T) energies (learningset), and the test set contains the prediction errors for the 6211 remaining systems [6]. We considerhere the results for the cc-pVDZ basis set and the CM-L2 and SLATM-L2 methods. Besides, theerrors for HF and MP2 have been estimated on the same reference data set as for the ML methods.They provide an interesting contrast in terms of error distribution. Summary statistics and their uncertainty have been estimated by bootstrap [26]. They are re-ported in Table 1. In this part, one concentrates on the unmodiﬁed methods, HF, MP2, CM-L2,and SLATM-L2. Considering MAE alone, one might conclude that SLATM-L2 produces slightlysmaller errors than MP2 and a signiﬁcant improvement over all other methods. MSE tells us thatthere is no important bias when compared to the MAE or RMSD (all distributions are nearlycentered on zero). It is striking that MP2 has a smaller RMSD than SLATM-L2. As the RMSDis often taken as an estimator of prediction uncertainty [1], this contradicts the MAE ranking.This is also a strong clue that not all the error distributions are normal. In the case ofnormal distributions, and considering that MSE ’

0, the RMSD should conﬁrm the MAE ranking(

RM SD ’ q π/ ∗ M AE for a zero-centered normal distribution). Indeed, the Shapiro-Wilknormality test rejects the normality of all the error sets, with a critical value W c = 0 . N = 5000 and a type I error rate of 0.05. However the W statistic shows that MP2 is much closerto normal ( W = 0 . W = 0 . Q ,which informs us that 5 % of the absolute errors above 3.3 kcal/mol for MP2 and 4.7 kcal/mol forSLATM-L2. Despite a smaller MAE, one has thus a non-negligible risk to get larger errors bySLATM-L2 than by MP2. Note that in parallel with large errors, there is an excess in small errors:5igure 1: Error distributions for the eﬀective atomization energies E ∗ with respect to CCSD(T) of(a) HF, (b) MP2 (c) CM-L2,(d) SLATM-L2, (e) SLATM-L2(+o) and (f) SLATM-L2(+r). In eachsub-ﬁgure, the left panel presents the histogram of the errors, with its ﬁt by a normal distribution,and the right panel a scatterplot of the errors vs . the calculated value by the corresponding method.The red lines represent the linear trend with its 95 % conﬁdence interval, obtained by ordinaryleast squares linear regression. 6 ethods MAE MSE RMSD Q C ( η ) Skew Kurt W (kcal/mol) η = 1 kcal/mol W c = 0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . (3) 0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . (3) 0 . . . . . . . . . . . . . . . . . . . . . . . . (7) 0 . Table 1: Error summary statistics on eﬀective atomization energies for the QM7b dataset. Allcalculations are done with the cc-pvdz basis set, and the reference data are CCSD(T) values.Methods with a ’lc-’ preﬁx include linear correction. For the meaning of the -o, +o and +rsuﬃxes, see Sections 3.3 and 3.4. MAE: mean absolute/unsigned error; MSE: mean signed error;RMSD: root mean squared deviation; Q : 95 th percentile of the absolute errors distribution; Skew,Kurt: skewness and kurtosis of the errors distribution; C ( η ): probability of absolute errors below η = 1 kcal/mol; W : Shapiro-Wilk normality statistic. For a sample of size 5000, the critical valueis W c = 0 . W < W c . Standarduncertainties estimated by bootstrap are reported in parenthetical notation.we estimated that the probability to have absolute errors smaller than the RMSD, C (2 . Trends.

Plots of the error sets (Fig. 1(a-d)) show that the ML errors present weak (linear)trends, whereas HF and MP2 are more aﬀected. For comparison of their performances, the datasets have been linearly corrected to remove the trends, according to Pernot et al. [1]. Summarystatistics for corrected methods (with a ’lc-’ preﬁx) are reported in Table 1, showing a small impacton all statistics. The normality Shapiro-Wilk statistic is improved for HF and MP2, and the errordistribution for lc-MP2 is practically normal. After linear correction, and considering the smalluncertainty on the linear correction parameters due to the large size of the dataset, the predictionuncertainty can be estimated by the RMSD [1]. This enables to assess the better performance ofMP2 in terms of prediction uncertainty.

Normality.

Comparison of the histograms with their gaussian ﬁt in Fig. 1(a-d) shows clearlythat the distributions for HF and MP2, albeit trended, are nearly normal. Note that this is nota general trend of quantum chemistry methods [2]. On the contrary, the error distribution forSLATM-L2 is notably non-normal, with a narrow central peak and wide tails. Because of the largenumber of points concentrated near the center, the tails of the distribution are not visible on the7igure 2: ECDFs of the absolute errors the eﬀective atomization energies E ∗ with respect toCCSD(T): (a) original methods; (b) comparison with SLATM-L2 trained on a learning set aug-mented with 350 systems (+o: outliers of SLATM-L2; +r: random systems). The dashed linessignal the Q statistic for each set, and the dotted lines the corresponding MAE.histogram in Fig. 1(d). They are much more apparent on the QQ-plot in Fig. 4(a), which will beintroduced in Section 3.3. ECDFs.

Considering Fig. 2, one see that the absolute errors ECDFs for SLATM-L2 and MP2intersect around 3 kcal/mol. Below this value, SLATM-L2 presents smaller absolute errors, withthe opposite above. It is clearly visible here why SLATM-L2 has a larger Q than MP2. Indeed,one sees also on Figs. 1 (b,d) that SLATM-L2 has a non-negligible set of errors much larger than thelargest MP2 errors. A similar eﬀect occurs even between HF and SLATM-L2, the latter presentinglarger absolute errors above approximately 8 kcal/mol.Computational time aside, the choice between MP2 and SLATM-L2 as a predictor for CCSD(T)energies depends on the acceptance by the user of a small percentage of large errors. SLATM-L2would largely beneﬁt from a reduction of such large errors. In the following, we focus on theSLATM-L2 error set to identify underlying features linked to large prediction errors. In order to get a chemical insight on the error trends in the present dataset, the absolute errorshave been analyzed as functions of several compositional data, such as the number of individualspecies in the composition (H, C, N, O, S, Cl) and the Double Bond Equivalent (DBE), whichestimates the degree of unsaturation (number of double bonds and rings) [27]DBE = n C − n H + n Cl2 + n N2 + 1 (4)where n X is the number of X atoms in the molecule. Note that in this formula divalent atoms donot contribute to the DBE. 8igure 3: Error distributions of SLATM-L2 (a) and MP2 (b) and outliers of SLATM-L2 (c) asfunctions of the number of H atoms and Double Bond Equivalent (DBE) of the molecules. Thecircles diameters are on a common scale for all ﬁgures and are proportional to the absolute errors.The distribution of SLATM-L2 errors for each DBE class is shown in (d), where the horizontaldashed line represents the 5 ∗ σ selection threshold for outliers.Sensitivity of the errors to the various variables has been estimated by using rank correlationcoeﬃcients. The largest sensitivity has been found with respect to the number of hydrogen atomsin the molecule ( n H) and the DBE. Note that both indicators are not independent (DBE is anti-correlated with n H), and that n H is strongly correlated with the number of atoms in the molecule.The distribution of the absolute errors sorted by DBE class is shown in Fig. 3(d). Distributionsof the absolute errors according to n H and DBE are presented in (Fig. 3(a,b)), where each datumis represented by a circle of diameter proportional to its absolute error. The scale is identical for9ll the series of plots. The imprint of the points cloud reﬂects the correlation between n H andDBE.For SLATM-L2, some systems with DBE = 3-5 present very large errors, and the maximalabsolute error increases almost linearly from DBE = 0 to 4. All compositions with DBE = 6 arecorrectly predicted, although, except for C H N , none of those compositions are found in thelearning set. For molecules with DBE = 0, all heavy atoms are sp -hybridized, being the mostlocal and thus transferable and easy to learn by QML models. For molecules with DBE = 6,bond environments are non-local, but could be easily covered by the training set. Therefore, lowprediction errors are expected in these cases.For MP2, the points cloud in Fig. 3(b) presents no strong feature, but the errors seem smallerfor DBE = 0, and for each DBE class, there seems to be a maximum around the mid-range ofadmissible n H values.

The non-normality of the errors distribution for SLATM-L2 is clearly visible on the QQ-plot inFig. 4(a). There is a near linear section at the center if the distribution, seen by comparisonwith the interquartile line, and also linear sections in the tails of the distribution. The core ofthe distribution is therefore dominated by a normal component, as well as the tails, albeit withdiﬀerent widths (and centers). Indeed, a bi-normal ﬁt (Eq. 3) provides a good representation ofthe distribution. Note that to get a perfect ﬁt, as much as four components are needed, but thebi-normal representation is suﬃcient for the present purpose of outliers analysis.The graphical results and optimal parameters are reported in Fig. 4(b). This decomposi-tion provides a strong component (weight 0.83) with a sub-kcal/mol standard deviation ( σ =0 .

87 kcal/mol), and a weaker component (weight: 0.17) with a much larger dispersion ( σ =5 . i.e. points lying at more than 5 ∗ σ from the centerof the concentrated component ( µ ).This ﬁlter selects 350 molecules (about 6 % of the test set population). The chemical formulaeof these systems are analyzed in terms of n H and DBE (Fig. 3(c)). Systems with DBE = 0 or 6 arepractically absent from the list of outliers (at the exception of a few DBE = 0 ones). The outlierswith the largest absolute errors are molecules with some unsaturation (DBE ≥ H N O (27.9 kcal/mol for the moleculeSMILES:C H N (27.4 kcal/mol, SMILES:c1cnn2c1CC2), with 17 conﬁgurations in the learning set, 60 in the testset, 10 of which are outliers.One can see in this list that most outliers compositions present some isomers in the learning10 arameters of the bi-normal ﬁt i µ i (kcal/mol) σ i (kcal/mol) w i ∗ σ thresholdfor outliers identiﬁcation.set. Unsurprisingly, it contains also a few compositions absent from the learning set, e.g. , C H OS(SMILES:c1c2c(cs1)OC2), but as seen above for the DBE=6 compounds, this is not systematicallyassociated with poor predictions.The impact on the error statistics of the removal of the 350 outliers from the test set is reportedin Table 1, as method SLATM-L2(-o). The improvement is noticeable on all aspects: betternormality of the distribution, reduced bias (MSE ’ Q . Having identiﬁed 350 systems for which SLATM-L2 provides exceedingly large errors, we want tocheck if their transfer to the learning set has a positive impact on prediction performances. Tovalidate the impact of this learning set augmentation, a list of 350 systems has been transferredfrom the test set to the learning set according to two selections: (1) the outliers identiﬁed above(SLATM-L2(+o)), and (2) a random choice (SLATM-L2(+r)), both resulting in a learning set ofsize 1350. The second scenario is a control to assess the size eﬀect of the learning set. Note that thepresent random selection contains 22 of the 350 outliers ( ’ omposition DBE medAE maxAE nOutl/nTest nLearn C H N O 5 10.28 27.92 4/12 3C H N H N O 4 8.13 25.79 14/64 13C H N 3 9.14 24.46 9/152 25C H NO H N O 3 6.22 21.36 37/223 39C H O H O 3 6.02 19.79 10/203 33C H N 2 12.71 17.90 6/207 29C H O S 2 9.86 17.62 5/6 1C H NOS 4 13.13 16.76 2/6 2C H OS 4 16.60 16.60 1/3 0C H NO H O 2 5.32 15.09 4/242 51C H N 4 9.48 15.08 6/51 6C H N S 3 6.71 14.91 14/40 8C H N S 1 13.63 14.58 5/5 0C H N S 2 12.10 14.18 12/12 0C H H O 4 8.67 13.71 5/66 6C H N O S 1 6.88 12.07 9/15 1C H NO S 3 8.57 12.04 2/2 1C H H O H NO S 2 7.39 11.31 5/5 2C H O S 0 6.49 11.30 6/8 0C H NO 4 6.08 11.20 7/83 10C H O S 1 7.77 10.96 9/14 4C H N H N O 2 5.66 10.20 20/245 50C H N O 1 5.56 10.14 9/151 25

Table 2: Composition of the 5 σ outliers with maximum absolute error above 10 kcal/mol. For eachcomposition, one deﬁnes medAE: median Absolute error in outlier isomers; maxAE: max Absoluteerror in outlier isomers; nOutl: number of isomers in the outlier set; nTest: number of isomers inthe test set; nLearn: number of isomers in the learning set.Fig. 1(e,f), 2(b), Fig. 6.A ﬁrst aspect to consider is how the inclusion of the outliers in the learning set (SLATM-L2(+o))aﬀects the prediction errors with respect to the pruned error set (SLATM-L2(-o)) presented in theprevious section. Globally, the improvement is minor, but consistent. The MSE and RMSD areidentical, the MAE is barely decreased, and the Q is slightly improved from 2.70 to 2.49 kcal/mol.The learning on the outliers has been transferred to some of the predictions.Then, we compare the gain in prediction quality due to the augmentation of the learningset by outliers versus random systems. The MAE values of SLATM-L2(+o): 0.83 and SLATM-L2(+r): 0.95 kcal/mol indicate a statistically signiﬁcant, but rather weak eﬀect of the outliers se-12igure 5: DBE distribution in (a) the learning set and (b) the test set; (c) proportion of each DBEclass in the learning set with respect to the distribution in the full data set (learning + test); (d)proportion of SLATM-L2 outliers in each DBE class with respect to the distribution in the testset. The horizontal dashed lines represent the mean proportion over the reference dataset.lection, whereas the sheer augmentation of the learning set size improves notably the MAE of theoriginal method by 0.3 kcal/mol. The MSE shows that the slight bias of the original SLATM-L2 hasbeen essentially corrected in both cases. Impact on the RMSD is larger than on the MAE (1.20 for’+o’ vs . 1.89 kcal/mol for ’+r’), with a notable reduction of 1.24 kcal/mol from the initial RMSD.Similarly, the decrease in Q is important for SLATM-L2(+o), from 4.7 to 2.49 kcal/mol, andmuch less for SLATM-L2(+r), to 3.15 kcal/mol, which reaches the level of MP2 (3.34 kcal/mol).One sees also that the normality of the distribution has been improved for SLATM-L2(+o) (Kurtfrom 29 to 8.1; W from 0.71 to 0.935), which is not the case for the random selection (Kurt =46; W = 0.66).Without looking at the plots of the distributions, this set of statistics points to a notableimprovement of the error distribution on all criteria for the outliers-augmented learning set. Bycontrast, the improvement by random selection of the 350 systems, although signiﬁcant, doesnot improve the quality of the error distribution, which stays strongly leptokurtic. This can beappreciated on the QQ-plots for both error distributions in Fig. 6(a,b).The ECDFs of the error distributions illustrate clearly the analysis of the summary statistics(Fig. 2(b)). The augmented SLATM-L2 versions have more concentrated absolute errors than13igure 6: Normal QQ-plots of the error distributions for (a) SLATM-L2(+o) and (b) SLATM-L2(+r) (see Fig. 4 for details); (c,d): absolute error distributions as functions of the number ofH atoms and Double Bond Equivalent (DBE) of the molecules. The circles diameters are on acommon scale for all ﬁgures and are proportional to the absolute errors.SLATM-L2. In the case of SLATM-L2(+o) the large errors have essentially vanished and theoverall performance is better than SLATM-L2 and even than MP2. Albeit SLATM-L2(+r) is alsomore concentrated, it still presents a set of systems with large errors, exceeding those of MP2. Thecomparison of the n H/DBE plots in Fig. 6(c,d) and depicts precisely this point.

We analyzed the prediction error distribution of a ML method (KRR-SLATM-L2) for eﬀectiveatomization energies of QM7b molecules calculated at the level of theory CCSD(T)/cc-pVDZ.Error distributions of standard computational chemistry methods (HF and MP2) at the samebasis set level and for the same reference dataset were also estimated for comparison.We have shown that the shapes of error distributions should carefully be considered when oneattempts to quantify the prediction performance of a method. MAE-based benchmarks neglectcrucial information, that is better rendered by probabilistic estimators such as Q . Similar values14f the MAE might hide very diﬀerent error distributions, with a strong impact on the assessmentand ranking of methods. In particular, ML prediction error distributions can be strongly non-normal, notably leptokurtic, which poses a problem of prediction reliability, as they might havea non-negligible probability to produce very large errors. Identiﬁcation of systems with largeprediction errors and their inclusion into the learning set improves signiﬁcantly the predictionerror distribution. These main points are further discussed below. The QM7b dataset has a nice feature of diversity, albeit it is small in total size. A naturalyet undesired consequence is that the dataset is highly inhomogeneous in chemical space, theconclusions drawn above from the results on QM7b may therefore not be extensible to otherdatasets, such as QM9 [29, 30] (less diverse in composition, yet larger in total set size). Consideringthese properties of QM9, one might expect the prediction error distribution for SLATM-L2 to beless leptokurtic than for QM7b, and a less problematic use of MAE for performance assesment.This has however to be checked, but at the moment, one has no access to hierarchical propertiesfor the full QM9, as for QM7b.In this paper we assessed the prediction of energy, which is extensive in nature. There exists aset of intensive properties as well and which are typically diﬃcult to learn. The generalizability ofour observations to these properties should also be conﬁrmed in future studies.

A major feature of the SLATM-L2 error distribution is the presence of a strong peak of smallerrors and wide tails of large errors. This leptokurtic distribution has been successfully modeled asa mixture of two normal distributions, both practically centered on zero, with standard deviations0.87 and 5.5 kcal/mol and a mixture ratio 83:17. Based on this analysis, we explore the chemicalidentity of the systems with large errors.It is not possible to unscramble both populations in the area where the most concentratedpopulation is dominant, but if one goes far enough in the tails (beyond 5 times 0.87 kcal/mol) oneﬁnds a large majority of systems belonging to the population with large errors. Such 350 systemswere identiﬁed and tagged as outliers. A chemical analysis of these systems based on their DBErevealed that compositions with high levels of unsaturation (DBE= 3 −

5) are overrepresented withrespect to their abundance in the test set. It seems therefore that the large prediction errors arenot randomly distributed over the QM7b molecules, which might provide insights for improvedmolecular descriptors or a better design of the learning set.In fact, if the learning set is augmented with these outliers (SLATM-L2(+o)), one gets betterprediction performances than if one augments the learning set with 350 randomly chosen systems(SLATM-L2(+r)). This suggests an iterative method for the design of learning sets that mightdeserve further consideration.

For ML methods to be used trustfully as replacements of quantum chemistry methods, one needsto have a reliable measure of their accuracy. Ideally, this should be quantiﬁed by a single numbercharacterizing the predictive ability of a method. This requires a ﬁnite variance, homogeneous,error distribution (with a dispersion that does not vary strongly within the calculated property15ange) and a zero-centered symmetric error distribution. Moreover, we have seen that leptokurticerror distributions with a strong core of small errors and tails of large errors, as described aboveby a bi-normal distribution might not be properly summarized by a single dispersion statistic.The MAE is not a good choice in such cases, and Q presents a more reliable alternative, whichquantiﬁes the risk of large prediction errors, even in the case of non-zero-centered symmetric errordistributions.We have shown that inclusion of outliers in the learning set of the SLATM-L2 method couldresult in a notable improvement of the shape of its prediction errors distribution. One might alsopresume that the use of much larger learning sets should improve the structure of the predictionerror distribution. Whatever the pertinence of this question, it cannot be answered by considering the MAE, con-trary to a still too common practice in the benchmarking literature to use it as an “accuracy”measure. The example of the SLATM-L2(+o) method shows that a sub-chemical-accuracy MAE(0.86 kcal/mol) can hide a non-negligible risk (about 30 %) to exceed this limit.A direct answer to the question is provided by the ECDF of the errors and C ( η ) with η =1 kcal/mol.The values reported in Table 1 tell us that none of the studied methods achieves the 1 kcal/molthreshold with a high probability. The best performers would be SLATM-L2 methods with aug-mented learning sets, with C (1) = 0 .

743 for the (+r) version. Even in this case, one has no ﬁrmguarantee to reach chemical accuracy for any new prediction, the risk being higher than 25 % tooverpass the threshold. Note that this is a notable improvement over MP2, for which the risk isabout 55 %. Besides, this qualiﬁes chemical accuracy with respect to CCSD(T), not with respectto experimental values.

In Section 3.1, we outlined that the basic summary statistics (MAE, RMSD) provided contradic-tory information about the SLATM-L2 prediction performances, notably when compared to MP2.SLATM-L2 has a smaller slightly smaller MAE than MP2 (1.26 vs. vs. Q value (3.34vs. 4.7 kcal/mol), meaning that SLATM-L2 is susceptible of larger errors than MP2.Given this set of information and computation time aside, one would most certainly pick MP2over SLATM-L2 as a reasonable predictor of CCSD(T) energies, with a prediction uncertainty ofapproximately 1.7 kcal/mol for E ∗ (RMSD in Table 1), and with added conﬁdence from the near-normality of the distribution, reducing the risk of rogue predictions that might be expected fromSLATM-L2. Finally, the modiﬁed SLATM-L2(+o) method, for which the learning set was aug-mented with SLATM-L2 outliers, clearly outperforms both SLATM-L2 and MP2 for all statistics.This illustrates once more that the MAE is not sensitive to major diﬀerences in the shape ofthe errors distributions, as was shown by Pernot and Savin [2]: even for normal distributions,many combinations of MSE and RMSD can produce the same MAE. The case is even worse fornon-normal distributions, where the MAE loses its ability to quantify prediction uncertainty. It isclear that the MAE should not be used as a unique scoring statistic, and that it should at least becomplemented (or replaced) by probabilistic indicators such as Q .16 cknowledgments The authors are grateful to Anatole von Lilienfeld for fruitful discussions and for his help withassembling the dataset.

References [1] P. Pernot, B. Civalleri, D. Presti, and A. Savin. Prediction uncertainty of density functionalapproximations for properties of crystals with cubic symmetry.

J. Phys. Chem. A , 119:5288–5304, 2015. doi:10.1021/jp509980w .[2] P. Pernot and A. Savin. Probabilistic performance estimators for computational chemistrymethods: the empirical cumulative distribution function of absolute errors.

J. Chem. Phys. ,148:241707, 2018. doi:10.1063/1.5016248 .[3] J. P. Perdew, J. Sun, A. J. Garza, and G. E. Scuseria. Intensive atomization energy: Re-thinking a metric for electronic structure theory methods.

Z. Phys. Chem. , 230:737–742, 2016. doi:10.1515/zpch-2015-0713 .[4] K. Lejaeghere, L. Vanduyfhuys, T. Verstraelen, V. V. Speybroeck, and S. Cottenier. Is the er-ror on ﬁrst-principles volume predictions absolute or relative?

Comput. Mater. Sci. , 117:390–396, 2016. doi:10.1016/j.commatsci.2016.01.039 .[5] F. A. Faber, L. Hutchison, B. Huang, J. Gilmer, S. S. Schoenholz, G. E. Dahl, O. Vinyals,S. Kearnes, P. F. Riley, and O. A. von Lilienfeld. Prediction errors of molecular machinelearning models lower than hybrid DFT error.

J. Chem. Theory Comput. , September 2017. doi:10.1021/acs.jctc.7b00577 .[6] P. Zaspel, B. Huang, H. Harbrecht, and O. A. von Lilienfeld. Boosting quantum machinelearning models with a multilevel combination technique: Pople diagrams revisited.

J. Chem.Theory Comput. , 15(3):1546–1559, 2019. doi:10.1021/acs.jctc.8b00832 .[7] P. Pernot and A. Savin. Probabilistic performance estimators for computational chemistrymethods: Systematic improvement probability and ranking probability matrix. II. Applica-tions. arXiv:2003.01572 , 2020. URL: https://arxiv.org/abs/2003.01572 .[8] N. Mohd Razali and B. Yap. Power Comparisons of Shapiro-Wilk, Kolmogorov-Smirnov, Lil-liefors and Anderson-Darling Tests.

J. Stat. Model. Analytics , 2:21–33, 01 2011. URL: https://pdfs.semanticscholar.org/dcdc/0a0be7d65257c4e6a9117f69e246fb227423.pdf .[9] K. Klauenberg, G. Wübbeler, and C. Elster. About not correcting for systematic eﬀects.

Meas. Sci. Rev. , 19:204–208, 2019. doi:10.2478/msr-2019-0026 .[10] B. Ruscic. Uncertainty quantiﬁcation in thermochemistry, benchmarking electronic structurecomputations, and active thermochemical tables.

Int. J. Quantum Chem. , 114:1097–1101,2014. doi:10.1002/qua.24605 .[11] A. J. Thakkar and T. Wu. How well do static electronic dipole polarizabilities from gas-phase experiments compare with density functional and MP2 computations?

J. Chem. Phys. ,143:144302, 2015. doi:10.1063/1.4932594 .1712] A. P. Scott and L. Radom. Harmonic Vibrational Frequencies: An Evaluation of Hartree-Fock, Möller-Plesset, Quadratic Conﬁguration Interaction, Density Functional Theory, andSemiempirical Scale Factors.

J. Phys. Chem. , 100(41):16502–16513, 1996.[13] P. Pernot and F. Cailliez. Comment on "Uncertainties in scaling factors for ab initio vibrationalzero-point energies" [J. Chem. Phys. 130, 114102 (2009)] and "Calibration sets and the accu-racy of vibrational scaling factors: A case study with the X3LYP hybrid functional" [J. Chem.Phys. 133, 114109 (2010)].

J. Chem. Phys. , 134:167101, 2011. arXiv:arXiv:1010.5669 , doi:10.1063/1.3581022 .[14] K. Lejaeghere, J. Jaeken, V. V. Speybroeck, and S. Cottenier. Ab initio based thermalproperty predictions at a low cost: An error analysis. Phys. Rev. B , 89:014304, jan 2014. doi:10.1103/physrevb.89.014304 .[15] K. Lejaeghere, V. Van Speybroeck, G. Van Oost, and S. Cottenier. Error estimates forsolid-state density-functional theory predictions: An overview by means of the ground-stateelemental crystals.

Crit. Rev. Solid State Mater. Sci. , 39:1–24, 2014. doi:10.1080/10408436.2013.772503 .[16] J. Proppe and M. Reiher. Reliable estimation of prediction uncertainty for physicochemicalproperty models.

J. Chem. Theory Comput. , 13:3297–3317, 2017. doi:10.1021/acs.jctc.7b00235 .[17] R. Ramakrishnan, P. O. Dral, M. Rupp, and O. A. von Lilienfeld. Big data meets quantumchemistry approximations: The δ -machine learning approach. J. Chem. Theory Comput. ,11:2087–2096, 2015. doi:10.1021/acs.jctc.5b00099 .[18] L. Ward, B. Blaiszik, I. Foster, R. S. Assary, B. Narayanan, and L. Curtiss. Machine Learn-ing Prediction of Accurate Atomization Energies of Organic Molecules from Low-FidelityQuantum Chemical Calculations. arXiv e-prints , page arXiv:1906.03233, Jun 2019. URL: https://arxiv.org/abs/1906.03233 .[19] J. Proppe, S. Gugler, and M. Reiher. Gaussian Process-Based Reﬁnement of DispersionCorrections. arXiv e-prints , page arXiv:1906.09342, Jun 2019. URL: https://arxiv.org/abs/1906.09342 .[20] R Core Team.

R: A Language and Environment for Statistical Computing . R Foundation forStatistical Computing, Vienna, Austria, 2019. Version 3.6.1. URL: .[21] A. Canty and B. Ripley. boot: Bootstrap Functions (Originally by Angelo Canty for S) , 2019.R package version 1.3-22. URL: https://CRAN.R-project.org/package=boot .[22] L. Komsta and F. Novomestky. moments: Moments, cumulants, skewness, kurtosis and relatedtests , 2015. R package version 0.14. URL: https://CRAN.R-project.org/package=moments .[23] D. Young, T. Benaglia, D. Chauveau, and D. Hunter. mixtools: Tools for Analyzing Fi-nite Mixture Models , 2020. R package version 1.2.0. URL: https://CRAN.R-project.org/package=mixtools . 1824] T. Benaglia, D. Chauveau, D. R. Hunter, and D. Young. mixtools: An R package for analyzingﬁnite mixture models.

J. Stat. Softw , 32:1–29, 2009. doi:10.18637/jss.v032.i06 .[25] G. Montavon, M. Rupp, V. Gobre, A. Vazquez-Mayagoitia, K. Hansen, A. Tkatchenko, K.-R.Müller, and O. Anatole von Lilienfeld. Machine learning of molecular electronic properties inchemical compound space.

New J. Phys. , 15:095003, 2013. doi:10.1088/1367-2630/15/9/095003 .[26] P. Pernot and A. Savin. Probabilistic performance estimators for computational chemistrymethods: Systematic improvement probability and ranking probability matrix. I. Theory. arXiv:2003.00987 , 2020. URL: https://arxiv.org/abs/2003.00987 .[27] V. Pellegrin. Molecular formulas of organic compounds: the nitrogen rule and degree ofunsaturation.

J. Chem. Educ. , 60:626, 1983. doi:10.1021/ed060p626 .[28] D. Weininger. Smiles, a chemical language and information system. 1. introduction tomethodology and encoding rules.

J. Chem. Inf. Comput. Sci. , 28:31–36, 1988. doi:10.1021/ci00057a005 .[29] L. Ruddigkeit, R. van Deursen, L. C. Blum, and J.-L. Reymond. Enumeration of 166 billionorganic small molecules in the chemical universe database gdb-17.

J. Chem. Inf. Model. ,52:2864–2875, 2012. doi:10.1021/ci300415d .[30] R. Ramakrishnan, P. O. Dral, M. Rupp, and O. A. von Lilienfeld. Quantum chemistrystructures and properties of 134 kilo molecules.

Scientiﬁc Data , 1:140022, 2014. doi:10.1038/sdata.2014.22doi:10.1038/sdata.2014.22