The Problem with Assessing Statistical Methods
TThe Problem with Assessing Statistical Methods
Abigail A. Arnold Jason L. LoeppkyDepartment of Statistics StatisticsWestern University University of British ColumbiaLondon, ON N6A 3K7, CANADA Kelowna, BC V1V 1V7, CANADA([email protected])
Abstract
In this paper, we investigate the problem of assessing statistical methods and ef-fectively summarizing results from simulations. Specifically, we consider problemsof the type where multiple methods are compared on a reasonably large test setof problems. These simulation studies are typically used to provide advice on aneffective method for analyzing future untested problems. Most of these simula-tion studies never apply statistical methods to find which method(s) are expectedto perform best. Instead, conclusions are based on a qualitative assessment ofpoorly chosen graphical and numerical summaries of the results. We illustrate thatthe Empirical Cumulative Distribution Function when used appropriately is anextremely effective tool for assessing what matters in large scale statistical simula-tions.KEYWORDS: Empirical CDF, benchmarking, simulation
Statistical models are used throughout the sciences and engineering to model complexrelationships. Typically, models are chosen on a problem-to-problem basis. While formost examples this is completely justified, there are many cases, especially during modeldevelopment, where models are being evaluated to provide advice to the practitioneras to what model should be fit. Obviously there is no one model that will universallyoutperform the rest. Recognizing the “No Free Lunch” theorem, the logical questionto ask is whether one model will perform best over a given class of problems. Again,we feel that the answer to this question is of course no. But we do feel that there arecertain methods that will have a better chance than other methods. The current statisticsliterature is rife with examples of simulations comparing various methods on problems ordata. The presentations of these results often range from tables of results, summarizing1 a r X i v : . [ s t a t . A P ] O c t eans and variances of simulations, to a large number of boxplots or dotplots for varioussubsets of the data.A brief look into recent papers in the Journal of Statistical Computation and Sim-ulation, Journal of Computational and Graphical Statistics, and Canadian Journal ofStatistics shows a handful of papers that provide large tables of results or results aver-aged over many trials that prove it difficult to quickly determine the best method (Shenand Yang, 2014; Yang et al., 2014; Gramacy and Apley, 2014; Schaubel et al., 2014).Gramacy and Apley (2014) apply a Monte Carlo experiment to borehole data and aver-ages the RMSE values for thirty trials for each of the twelve methods being compared.They find that the best method is not statistically better than the second best. Thereader of the paper would benefit from seeing a graphical representation of all of theRMSE values to better compare the performance of the two best methods. Mbah andPaothong (2014) show improvement by graphically representing their data reasonablywell. By averaging values over 50,000 trials, they do not account for variability in theirplots. We find that examples like these persist throughout the literature and feel thisis problematic. Motivated by the Benchmarking literature in Optimization this paperargues for the use of the Empirical Cumulative Distribution function (ECDF) to displayexperimental results which allows the practitioner to quickly assess the appropriatenessof a given model.This paper is outlined as follows. In Section 2, we present a small simulation studyrelated to the choice of experimental design for Computer Experiments and show somesummaries of the results. In Section 3, we review standard techniques of comparingoptimization algorithms and argue that the same approach should be used to comparestatistical methods. In Section 4 we show various uses of plots adopted from benchmark-ing to assess the quality of a given statistical method. Finally, we make some concludingremarks in Section 5. 2 Working Example
As a motivating example we consider the problem of choosing a design for a computerexperiment which was disscused in Gustafsson (2012). Computer models are extensivelyused to simulate complex physical processes across all areas of science and engineering. Inmany cases the computer code is slow to run and a statistical model, such as a GaussianProcess (GP), is used to approximate the code (Sacks et al., 1989; Currin et al., 1991;Santner et al., 2003). Building such a numerical approximation requires the user tospecify an experiment design where the code is run and then to build an appropriate GPemulator.For our working example, we consider some of the tests performed in Gustafsson(2012) to choose an appropriate design for a computer experiment. The simulationenvironment selected 2 different versions of 4 generic problems that are evaluated forthree different input dimensions. This is equivalent to having 24 different test problemsand each test problem is evaluated for 3 different choices of sample size (5 d, d, d )where d is the number of input dimensions. For a given input design, 50 equivalentdesigns are constructed by permuting the columns of the original design which is roughlyequivalent to 50 replicates of each of the 72 experimental settings. Each experimentevaluated 7 different experiment designs, which we will label as Methods 1 through 7,where M M M M M M M M SE = m (cid:88) i =1 ( y i − ˆ y i ) and AM E = max i =1 ,...,m | y i − ˆ y i | , where y i is the true value of computer code at input x i from the test data and ˆ y i is the3rediction of the GP at the same input. Both of these measures are used to assess theoverall model performance.Chen et al. (2015) discuss the importance of having extensive simulations and multipledata sets to validate a new statistical procedure. This working example and an incompletelist of other papers (Loeppky et al., 2009, 2010; Williams et al., 2011; Gramacy and Lee,2012; Gramacy and Apley, 2014; Butler et al., 2014; Lam and Notz, 2008) all have asimilar structure of a reasonably comprehensive set of simulations and number of methodsthat need to be assessed. The papers above typically show the results as a series of side-by-side boxplots or side-by-side dotcharts for each method, with one plot for each testfunction and sample size. Conclusions are then drawn from looking at a handful ofboxplots which often look very cluttered and usually do not provide clear evidence as tothe best method(s). Alternatively, the results will be summarized in a table of averageperformance for each method separated by function and sample size. These tables areusually overwhelming to look at and interpretations are incredibly inefficient. Using ourworking example, the plot in Figure 1 illustrates this currently widely used approach.Figure 1 shows the boxplots for RMSE values corresponding to run sizes of 50, 100,and 150 respectively for one of the 24 test functions under consideration. Within eachplot, the seven boxes correspond to the seven different methods used to fit the data. Notethat the scales of the plots do not correspond. The RMSE values in the first plot are muchlarger than that of the second and the same can be seen between the second and third.If we were to plot them all on the same y -axis, the details of the second and third plotswould not be clearly seen. Although these plots certainly provide information on the fitof the seven models, it is not possible to determine which method(s) are performing bestover all. As an example, for the middle plot, cases could be made to support methods2, 3 and 6. Furthermore, we would typically be looking at similar plots for multiple testfunctions and trying to draw conclusions from say ten or twelve plots similar to just theone shown. In our opinion, these plots, just like many of the plots in the literature, donot provide a clear picture of the evidence for or against a method.In the papers above and other work, we find the results of various methods to appear4 l l llll ll lllll l ll l ll l lll Figure 1: Box plots of RMSE values for one function with sample sizes 50 (left), 100(middle), and 150 (right) fit by each of the seven methods.extremely similar in box or dot plots and tables and it is nearly impossible to choose aclear winner for a single function, let alone trying to choose a clear winner overall. Wefind the lack of an easy way to visualize the results from these simulations to be a largeproblem. We find the goal for using these boxplots (or dotcharts and tables), which is toprovide overall evidence as to the usefulness of a chosen procedure, is not being met.The need for better summaries and the need for better designed simulation studiesis of course not new. Chipman (2015) discussed the need for statistically analyzingsimulation results and even using principals of experimental design to reduce the numberof simulations being preformed. We agree with the approach, however, we focus attentionon the need for simple graphical methods that allow for quick assessments of what ispractically significant. In many simulation studies, the number of replicates can be veryhigh which will often lead to very small residual errors which might unduly inflate thesignificance of the results. In what follows we discuss a very simple graphical method to5ssess which methods are expected to do well in practice.
Motivated by benchmarking in the optimization literature, we seek a simple tool that al-lows one to quickly assess the overall performance of a set of methods. In many simulationstudies, there is a primary factor of interest and multiple other factors are introducedto simply create more experimental cases to be tested. Additionally, these are oftenreplicated a large number of times. The other factors and the replicates in this caseare nuisance variables that are not of primary interest. Many of the common plots ofcourse display results for a fixed level of one of these factors and only show replicates andmethods in a single plot such as those seen in Figure 4. This leaves the reader, and theauthors in many cases, completely baffled as to what method is a clear winner. In ouropinion the reasons for this are often two fold. First, the graphical (or tabular) displaysused are simply inadequate and secondly, we often have inadequate methods for judgingperformance. Although optimization benchmarking deals with very different problems,many of the arguments made in the benchmarking literature can be directly applied toassessing the quality of a statistical method.Benchmarking in optimization attempts to ask the question of which optimizationroutine (solver) is the best overall at optimizing a large class of test problems (Mor´e andWild, 2009). This involves having a very large test set of problems to compare and thenrunning each numerical solver on each of these test problems. For a set of p = 1 , . . . , P problems and s = 1 , . . . , S solvers the goal is to find solver s ∗ which is the “best” atsolving all problems. The hope of course is that the best solver on this test set willalso be the best solver on a set of untested problems. Given an appropriate test set offunctions we do not view this as being contradictory to the goals of assessing a statisticalmethod.Two common plots in Benchmarking that warrant some discussion are performanceprofiles and data profiles (Mor´e and Wild, 2009). Performance profiles define the measure6f interest for each solver to be the time taken to solve a given problem. Each solveris assessed by comparing the ECDFs of the time taken to solve all problems. On theother hand, data profiles display the proportion of problems solved by a given methodfor a fixed time budget. Several modifications of these plots also exist that display theproportion of problems solved within a given accuracy. In order for either of these plots tobe effective it is important to have a standardized measure of performance. For example,in performance profiles, CPU time is typically used as opposed to the number of functionevaluations since some functions might take longer to evaluate then others and somealgorithms run longer then others. When displaying accuracy it is typical to compute arelative error for a problem which is given by F p = ( f p − f ∗ ) /f ∗ where f ∗ is the knownsolution and f p is the solution found by the solver. When the solution f ∗ is unknownit is common to replace f ∗ with the best solution found by all solvers. Standardizingthe results allows for easy use of the ECDF to quickly assess which solver is best for aparticular problem.Drawing the parallels between statistical problems and benchmarking, most statisticalsimulation studies consist of m = 1 , . . . , M methods that need to be compared. Thecomparisons are done over a wide array of factors that represent possible scenarios ofinterest and typically many replicates, all of these multiple combinations could be viewedas the set of p = 1 , . . . , P problems where the comparisons are performed. Many paperstend to do a separate analysis of each of the P problems and then try to draw generalconclusions on the appropriateness of the given method. This of course is often difficultand results in an overwhelming number of plots or tables to be compared. In orderto avoid these difficulties and following the procedures from benchmarking, the simpleminded approach to choose the best method would be to simply plot the ECDF of asuitable measure of accuracy for each of the P problems that were analyzed. There aretwo perceived issues with this approach. Firstly, the suggestion is to collapse over all ofthe different problems P which results in a perceived loss of any functional information.In what follows below, we show that one does not need to collapse over all problems.Additionally, we show that significant benefit is gained by collapsing. This benefit is7mplified when one takes the view point of choosing a method that is expected to performwell on a variety of problems. The second concern is defining a suitable measure ofaccuracy. One obvious choice is the standardized measured F p shown above. In the nextsection, we illustrate the use of this tool on the data from the motivating example anddiscuss an alternative measure of standardizing the results. In order to define a suitable measure of accuracy that is comparable across all of thedifferent problems we consider two approaches. For now, we will focus on using eitherthe RMSE or AME obtained by fitting the GP for each of 7 different methods and dividethat value by the RMSE or AME from the model using the the trivial predictor of ¯ y .This allows us to easily plot very different RMSE or AME values on the same plot andalso allows for a quick assessment of quality since values of this ratio larger than say 0.5would be classified as an extremely poor fitting GP. Since these measures of accuracyare all relatively close to one another, we typically take a log base 10 transformationto induce some spread in the results. Other transformations could be used including anatural log, but we use log base 10 since this provides the negative value of the decimalsof accuracy achieved by the method.The plot in Figure 2 shows the log 10 transformed RMSE values displayed as bothboxplots in the top row and as the ECDFs in the bottom row for the same data displayedin Figure 1. Note that using logged RMSE values in the boxplot still results in multipleconfusing plots that do not allow a clear winner to be identified. For the ECDFs sincewe are taking the log of values less than one, we end up with negative values. Themore negative the value, the smaller the RMSE. Therefore, the line that sits farthestleft represents the method that performs the best on the problems. We can clearly seethat for sample sizes 50 and 150 (the left and right plots) method 2 performs best. Forsample size 100, it is not as clear. Method 3 is the best for mid range RMSE values, butas RMSE values increase, methods 2 and 6 both outperform 3. We argue that the tail end8f the ECDF is most important. Differences in more negative values correspond to smalldifferences in RMSE, while the larger values correspond to more noticeable differencesin performance at a point where performance is hardly better, if any better, than thetrivial predictor. For this reason, methods 2 and 6 are more attractive than 3. Clearly,we can see a lot more about the performance of the methods in the ECDFs as comparedto the minimal information provided by boxplots. We can very quickly see that methods4 and 5 are not performing well at all compared to the other methods and that method2 is consistently performing well out of the seven. l l l −1.2−1.1−1.0−0.9 M1 M2 M3 M4 M5 M6 M7 llll ll lllll l l −1.6−1.5−1.4−1.3 M1 M2 M3 M4 M5 M6 M7 l lll ll ll −1.8−1.7−1.6−1.5 M1 M2 M3 M4 M5 M6 M70.000.250.500.751.00 −1.3 −1.2 −1.1 −1.0 −0.9 −0.8 0.000.250.500.751.00 −1.6 −1.5 −1.4 −1.3 0.000.250.500.751.00 −1.8 −1.7 −1.6 −1.5M1M2M3M4M5M6M7 Figure 2: Box plots and ECDFs for seven methods on one test function for sample sizes50, 100, and 150 from left to right.In the case above, we considered one problem and separate plots for three differentsample sizes. We can also collapse the three sample sizes into one plot as in Figure ?? .9t is still clear to see which method is doing best overall. Additionally, the approximatelocations in the plot where RMSE values shift from one run size to the next should benoticeable upon visual inspection. If it is not easy to identify these locations, verticallines can be added in the plot to show roughly where these boundaries are. Althoughnot needed in this case, we have included such vertical lines. The advantage of this plotover the set of ECDF plots above is that all of the information is clearly condensed intoone display. At once, we can see if there is one clear winner or if for some run sizes thewinner is not as obvious. From the plots it is easy to see that methods 4 and 5, thecosine transformed designs (Dette and Pepelyshev, 2010), get progressively worse as thesample size increases. Additionally we see that method 2, the maximin LHS, does muchbetter at smaller sample sizes, which is perhaps not that surprising since it is focused onspace filling which will become less important as the density of points increases. log10(value) y variable M1M2M3M4M5M6M7
Figure 3: RMSE values for one function collapsed over sample sizes 50, 100, and 150with approximate boundaries indicated by vertical lines.Finally, we can also collapse over many functions at once. The left plot of Figure 4includes RMSE values for all 24 test functions evaluated for sample size 10d. Therefore,we have 24 problems each with 50 replicates evaluated by each of the seven methods.10e notice that overall Methods 1, 2, 3, 6, and 7 are performing about equally well. Themiddle plot of Figure 4 again uses all 24 problems but shows standardized AME valuesfor each of the 50 replicates rather than RMSE data. We draw similar conclusions fromthis plot supporting those from the RMSE data. log10(value) y variable M1M2M3M4M5M6M7 0.000.250.500.751.00 −2 −1 0 log10(value) y variable M1M2M3M4M5M6M7 0.000.250.500.751.00 −4 −2 0 2 log10(value) y variable M1M2M3M4M5M6M7
Figure 4: ECDFs for RMSE values collapsed over all 10d simulations (left plot), AMEvalues collapsed over 10d simulations (middle), and alternate standardized RMSE valuesfor 10d simulations (right).Until now, we have been standardizing RMSE and AME values by the correspondingRMSE and AME values obtained by fitting the trivial predictor. It is important tonote that depending on the problem, the standardization used may differ. To produceuseful ECDFs, care must be taken in this step. We present here an alternate form ofstandardizing for our problem set to show another possibility and also to see that weobtain consistent results. The right plot in Figure 4 uses the standardization F p =( f p − f ∗ ) /f ∗ , where f ∗ is the lowest RMSE value from each of the seven methods for eachreplicate. Again, we see similar results between this and the other two plots, althoughthis form of standardization shows a larger discrepancy in performance of Methods 4 and5 compared to the rest. The advantage of this plot may be that we can see for what11ercentage of the problems each method has the best solution. For example, we see thatMethod 6 has the best solution nearly 25% of the time. In this paper we have introduced a very simple graphical method to assess the perfor-mance of a given statistical procedure. We have additionally argued that in many casessignificant benefits are to be gained by plotting multiple problems on the same graphso that one can quickly and reliably decide which methods are being outperformed on alarge class of problems. Compared to plotting boxplots or dotplots for all of the resultsof one method, the ECDF has the advantage of displaying every single data point intwo-dimensions and allows for an easy interpretation. One can read off what percentageof problems have been solved within a given accuracy. When this accuracy is convertedto standardized performance measure, it also allows one to see which method results inthe best fit over all of the different problems. Although the ECDF is a very simple tool,it is far more sophisticated than a laundry list of tables or pages of boxplots or dotplotsdisplaying results for one test at a time. In our opinion, displays of this variety do asignificantly better job at obfuscating the relevant information and leaving the authorsand researchers confused rather than indicating which method preforms best.The ECDF allows for quick assessments of methods over a large array of problems toget an overall view while of course not precluding comparisons on individual functions.The major advantage of the ECDF is its ability to quickly show similarities and differencesthat are not as easy to see with dotplots or boxplots which present themselves as beingcluttered. We hope that readers of this paper agree with our opinions and stronglyencourage everyone to rely on the ECDF, at least as a starting point, to display relevantstatistical information from simulations.
ACKNOWLEDGEMENTS
The research of Loeppky was supported by Natural Sciences and Engineering ResearchCouncil of Canada Discovery Grant (RGPIN-2015-03895 ). The research of Arnold was12onducted at the University of British Columbia with the support of a Natural Sciencesand Engineering Research Council of Canada undergraduate research award.
References
Butler, A., Haynes, R. D., Humphries, T. D., and Ranjan, P. (2014), “Efficient Opti-mization of the Likelihood Function in Gaussian Process Modelling,”
ComputationalStatistics & Data Analysis
Journal of the American Statistical Association , 86, 953–963.Dette, H. and Pepelyshev, A. (2010), “Generalized Latin Hypercube Design for ComputerExperiments,”
Technometrics , 52, 421–429.Gramacy, R. B. and Apley, D. W. (2014), “Local Gaussian process approximation forlarge computer experiments,”
Journal of Computational and Graphical Statistics , 1–28.Gramacy, R. B. and Lee, H. K. H. (2012), “Cases for the Nugget in Modeling ComputerExperiments,”
Statistics and Computing , 22, 713–722.Gustafsson, E. . (2012), “Designs for Computer Experiments: Random versus Rational.”Master’s thesis, Statistics, University of British Columbia, Okanagan, Kelowna, BritishColumbia.Iman, R. L. and Conover, W. J. (1982), “A Distribution-Free Approach to InducingRank Correlation Among Input Variables,”
Communications in Statistics. Simulationand Computation. , 11, 311–334.Lam, C. Q. and Notz, W. (2008), “Sequential Adaptive Designs in Computer Experimentsfor Response Surface Model Fit,”
Statistics and Applications , 6, 207–233.Loeppky, J. L., Moore, L. M., and Williams, B. J. (2010), “Batch Sequential Designsfor Computer Experiments,”
Journal of Statistical Planning and Inference , 140, 1452–1464. 13oeppky, J. L., Sacks, J., and Welch, W. J. (2009), “Choosing the Sample Size of aComputer Experiment: A Practical Guide,”
Technometrics , 51, 366–376.Mbah, A. K. and Paothong, A. (2014), “Shapiro–Francia test compared to other normal-ity test using expected p-value,”
Journal of Statistical Computation and Simulation ,1–15.McKay, M. D., Beckman, R. J., and Conover, W. J. (1979), “A Comparison of ThreeMethods for Selecting Values of Input Variables in the Analysis of Ouput from aComputer Code,”
Technometrics , 21, 239–245.Mor´e, J. J. and Wild, S. M. (2009), “Benchmarking derivative-free optimization algo-rithms,”
SIAM Journal on Optimization , 20, 172–191.Morris, M. D. and Mitchell, T. J. (1995), “Exploratory Designs for Computational Ex-periments,”
Journal of Statistical Planning and Inference , 43, 381–402.Sacks, J., Welch, W. J., Mitchell, T. J., and Wynn, H. P. (1989), “Designs and Analysisof Computer Experiments (with Discussion),”
Statistical Science , 4, 409–435.Santner, T. J., Williams, B. J., and Notz, W. I. (2003),
The Design and Analysis ofComputer Experiments , New York: Springer.Schaubel, D. E., Zhang, H., Kalbfleisch, J. D., and Shu, X. (2014), “Semiparametricmethods for survival analysis of case-control data subject to dependent censoring,”
Canadian Journal of Statistics , 42, 365–383.Shen, Y. and Yang, Z. (2014), “Bias-correction for Weibull common shape estimation,”
Journal of Statistical Computation and simulation , 1–30.Sobol, I. M. (1967), “Distribution of points in a cube and approximate evaluation ofintegrals,”
USSR Computational Mathematics and Mathematical Physics , 7, 86–112.Williams, B. J., Loeppky, J. L., Moore, L. M., and Macklem, M. S. (2011), “BatchSequential Design to Achieve Predictive Maturity with Calibrated Computer Models,”
Reliability Engineering & System Safety , 96, 1208–1219.Yang, H., Lv, J., and Guo, C. (2014), “Robust variable selection in modal varying-coefficient models with longitudinal,”