[PDF] Bayesian Bootstrap Inference for the ROC Surface

Abstract

Accurate diagnosis of disease is of great importance in clinical practice and medical research. The receiver operating characteristic (ROC) surface is a popular tool for evaluating the discriminatory ability of continuous diagnostic test outcomes when there exist three ordered disease classes (e.g., no disease, mild disease, advanced disease). We propose the Bayesian bootstrap, a fully nonparametric method, for conducting inference about the ROC surface and its functionals, such as the volume under the surface. The proposed method is based on a simple, yet interesting, representation of the ROC surface in terms of placement variables. Results from a simulation study demonstrate the ability of our method to successfully recover the true ROC surface and to produce valid inferences in a variety of complex scenarios. An application to data from the Trail Making Test to assess cognitive impairment in Parkinson's disease patients is provided.

Full PDF

BBayesian Bootstrap Inference for the ROC Surface

Vanda

In´acio de Carvalho , Miguel de Carvalho , and Adam

Branscum

Abstract

Accurate diagnosis of disease is of great importance in clinical practice and medical research. The receiveroperating characteristic (ROC) surface is a popular tool for evaluating the discriminatory ability of continuousdiagnostic test outcomes when there exist three ordered disease classes (e.g., no disease, mild disease, advanceddisease). We propose the Bayesian bootstrap, a fully nonparametric method, for conducting inference about theROC surface and its functionals, such as the volume under the surface. The proposed method is based on a simple,yet interesting, representation of the ROC surface in terms of placement variables. Results from a simulation studydemonstrate the ability of our method to successfully recover the true ROC surface and to produce valid inferencesin a variety of complex scenarios. An application to data from the Trail Making Test to assess cognitive impairmentin Parkinson’s disease patients is provided. key words: Diagnostic test; Bayesian bootstrap; ROC surface; Volume under the surface; Trail Making test INTRODUCTION

Evaluating and ranking the performance of medical diagnostic tests is of fundamental importance in healthcare. Before a test is approved for routine use in practice, its ability to distinguish between diﬀerent diseasestages or conditions must be rigorously evaluated through statistical analysis. The receiver operating charac-teristic (ROC) curve is a popular tool for evaluating the accuracy of continuous outcome diagnostic tests thatclassify subjects into two groups: diseased and nondiseased. However, disease progression can be regardedas a dynamic process and, in clinical practice, physicians often face situations that require a decision amongthree or more diagnostic alternatives. Patients may advance through one or more transitional stages prior tofull disease onset, and this is especially true for neurological disorders. For instance, in Section 5 we presentan assessment of the discriminatory ability of the Trail Making Test, a widely used test to detect cognitiveimpairment associated with dementia, to distinguish between Parkinson’s disease patients who present normalcognition, mild cognitive impairment, and dementia/severe impairment. As a direct generalisation of ROC

Vanda In´acio de Carvalho is Lecturer in Statistics, School of Mathematics, University of Edinburgh, Scotland, UK(

[email protected] ). Miguel de Carvalho is Lecturer in Statistics, School of Mathematics, University of Edinburgh, Scotland,UK (

[email protected] ). Adam J. Branscum is Professor, College of Public Health and Human Sciences, Oregon StateUniversity, USA ( [email protected] ). a r X i v : . [ s t a t . M E ] M a y urves, the ROC surface has been used to evaluate diagnostic accuracy in ordered three-class classiﬁcationproblems (Yang and Carlin, 2000; Nakas and Yiannoutsos, 2004); see also Nakas (2014) for an insightful reviewof three-class ROC surface analysis. The volume under the ROC surface (VUS) has been proposed as a scalarsummary measure of diagnostic accuracy, analogous to the area under the ROC curve in the two-class setting.There is a vast literature on parametric, semiparametric, and nonparametric ROC curve analysis; see Pepe(2003) and Zhou et al. (2011) for an overview. The amount of existing research on ROC surface analysis is, bycomparison, limited. Li and Zhou (2009) developed a frequentist nonparametric approach to estimating theROC surface based on the empirical distribution function and a semiparametric approach that attempts togeneralise the parametric (normal) functional form of the ROC surface but that, as pointed out by the authors,relies heavily on the normality assumption. In´acio et al. (2011) developed a nonparametric Bayesian methodbased on ﬁnite mixtures of Polya trees to estimate the ROC surface, while Zhang and Li (2011) developedmethods for combining multiple biomarkers, Yu (2012) considered rare diseases and high-throughput data, Liand Fine (2012) considered ROC surface regression analysis, Kang and Tian (2013) developed nonparametricestimators based on kernel methods, and Coolen-Maturi and Coolen (2014) considered a frequentist nonpara-metric predictive approach. An empirical likelihood approach (Wan, 2012) and inverse probability weighting(Zhang and Alonzo, 2016) have been used to estimate the volume under the ROC surface, and test statisticswere developed by Hong and Cho (2015).In this paper, we develop a computationally appealing, fully nonparametric estimator of the ROC surfacebased on the Bayesian bootstrap. The Bayesian bootstrap (Rubin, 1981) is the Bayesian analog of the fre-quentist bootstrap. A key diﬀerence is the Bayesian bootstrap is based on simulation rather than resampling(details about the Bayesian bootstrap are given in Section 2). Our nonparametric estimator of the ROCsurface is robust and smooth. Compared to a kernel approach, it does not depend on a smoothing parameter,the choice of which is a nontrivial issue in practice and has a great impact on inference. Our method is hence awidely applicable approach to inference for the ROC surface that can be used for many populations and for alarge number of diseases and continuous diagnostic measures. Moreover, point estimates and credible intervalsfor the ROC surface and its corresponding VUS are obtained in a single integrated framework. Recent devel-opments of ﬂexible Bayesian models that have been successfully applied in medical diagnostic testing researchabound (e.g., Erkanli et al. 2006; Branscum et al. 2008, 2013; In´acio de Carvalho et al. 2013; Rodr´ıguez andMart´ınez 2013; Branscum et al. 2015; Hwang and Chen 2015; Zhao et al. 2016; In´acio de Carvalho et al. 2017;Carvalho and Branscum 2018).The remainder of the paper is organised as follows. In the next section we introduce background mate-2ial on ROC surfaces and the Bayesian bootstrap. Section 3 presents our novel nonparametric approach toestimating the ROC surface. The performance of our method is assessed in Section 4 using simulated data,Section 5 applies our approach to data from a Trail Making Test to detect cognitive impairment, and Section 6concludes the paper. The Trail Making Test data and an R function for implementing our method are providedin the Supplementary Materials. PRELIMINARIES

ROC surfaces

We assume that each subject in the population belongs to one of three ordered diagnostic groups (e.g., nodisease, mild disease, advanced disease) and that a diagnostic test with continuous scale outcomes is used forclassiﬁcation into one of the three groups. We further assume that the group to which each subject belongsto is known without error due to the existence of a gold standard test. Without loss of generality, we assumethat individuals from group 3 tend to have higher test outcomes than individuals in group 2 who tend tohave higher test values than group 1 individuals. Let Y , Y , and Y be continuous random variables denotingtest outcomes in groups 1, 2, and 3, respectively, with F , F , and F being the corresponding cumulativedistribution functions. For any pair of ordered thresholds ( c , c ), with c < c , the probabilities of correctclassiﬁcation into each group are given by p ( c , c ) = Pr( Y ≤ c ) = F ( c ) ,p ( c , c ) = Pr( c < Y ≤ c ) = F ( c ) − F ( c ) ,p ( c , c ) = Pr( Y > c ) = 1 − F ( c ) . The ROC surface is then the three-dimensional plot in the unit cube depicting the probabilities of correctclassiﬁcation into each group as the thresholds c and c vary { ( p ( c , c ) , p ( c , c ) , p ( c , c )) : c < c } = { ( F ( c ) , F ( c ) − F ( c ) , − F ( c )) : c < c } . For notational simplicity, hereafter we drop the dependence of p , p , and p on c and c .By writing c = F − ( p ) and c = F − (1 − p ), we obtain the functional form of the ROC surfaceROCS( p , p ) =  F ( F − (1 − p )) − F ( F − ( p )) , if F − ( p ) < F − (1 − p ) , , otherwise . (cid:90) (cid:90) ROCS( p , p )d p d p = (cid:90) (cid:90) − F ( F − ( p ))0 (cid:8) F ( F − (1 − p )) − F ( F − ( p )) (cid:9) d p d p = Pr( Y < Y < Y ) . When the three distributions completely overlap, and thus the test has no discriminatory ability, the VUS takesthe value 1 /

6, whereas a VUS of 1 corresponds to a test that perfectly discriminates between the three groups.Other values of VUS correspond to diﬀerent degrees of overlap/stochastic ordering between f /F , f /F , and f /F ; the closer the VUS is to 1, the better the classiﬁcation accuracy (see Figure 1 of the SupplementaryMaterials). Bayesian bootstrap

The Bayesian bootstrap (BB) was introduced by Rubin (1981) as a Bayesian counterpart of the originalbootstrap proposed by Efron (1979) and it is based on simulation rather than resampling. Let ( y , . . . , y n )be a random sample from an unknown distribution F and suppose that the parameter of interest is F itself,which is represented as F ( z ) = n (cid:88) i =1 ω i I ( y i ≤ z ) , where I ( · ) denotes the indicator function, ω i is the weight associated to observation y i , with ω i ≥ (cid:80) ni =1 ω i = 1. In the classic nonparametric bootstrap, inference about F is obtained by repeatedly generatingbootstrap samples, where each bootstrap sample is drawn with replacement from the data. In the b th bootstrapreplicate, F ( b ) is computed as F ( b ) ( z ) = n (cid:88) i =1 ω ( b ) i I ( y i ≤ z ) , where ω ( b ) i is the number of times y i appears in the b th bootstrap sample, with ω ( b ) i taking values on thediscrete set { , /n, . . . , n/n } . For the Bayesian bootstrap, the weights are considered unknown and theirposterior distribution is derived. Rubin (1981) used a diﬀuse prior, (cid:81) ni =1 ω − i , which when combined with the(multinomial) likelihood for the data, results in a Dirichlet( n ; 1 , . . . ,

1) distribution for the posterior distributionof the weights. Thus, the weights in the BB are smoother than those from the classical bootstrap. Note that inthe BB the data are regarded as ﬁxed, so we do not resample from it. The BB has connections to the Dirichlet4rocess (Ferguson, 1973); speciﬁcally, it can be regarded as a non-informative version of the Dirichlet Process(Gasparini, 1995, Theorem 2). For a further explanation of the diﬀerent views of the BB we refer the readerto Kim et al. (2005, p. 971). PROPOSED ESTIMATOR

Our estimator extends to the three class-setting the method proposed by Gu et al. (2008) for the ROC curve.It is motivated by a simple, yet interesting and computational appealing representation of the ROC surfacethat is based on the notion of a placement variable (Pepe, 2003, Chapter 5). A placement variable is simply astandardisation of test outcomes with respect to a reference population. Consider the following two variables U = F ( Y ) , U = 1 − F ( Y ) , with U being the proportion of class 1 subjects with test outcomes less than or equal to Y and U being theproportion of class 3 subjects with test outcomes greater than Y . Here, group 2 is the reference group. Thevariables U and U quantify the degree of separation of the test outcome distributions in the three groupsof patients. Speciﬁcally, U quantiﬁes the degree of separation between the test outcomes in groups 1 and 2,whereas U quantiﬁes the degree of separation between groups 2 and 3. For instance, if the test outcomes inthe three groups are highly separated, the placement of most group 2 subjects is at the upper tail of the group1 distribution and at the lower tail of the group 3 distribution, so that most group 2 subjects will have large U and U values. On the other hand, when the three distributions of test outcomes completely overlap, both U and U will have a Uniform(0 ,

1) distribution.Interestingly, the ROC surface is the diﬀerence between the survival distribution of U and the cumulativedistribution of U . Speciﬁcally, if F − ( p ) < F − (1 − p ), we haveROCS( p , p ) = F ( F − (1 − p )) − F ( F − ( p ))= Pr( Y ≤ F − (1 − p )) − Pr( Y ≤ F − ( p ))= Pr(1 − F ( Y ) > p ) − Pr( F ( Y ) ≤ p )= Pr( U > p ) − Pr( U ≤ p ) . (1)Let ( y , . . . , y n ), ( y , . . . , y n ), and ( y , . . . , y n ) be independent (within and between groups) samplesof size n , n , and n from groups 1, 2, and 3, respectively. The result in (1) provides the rationale for thefollowing computational algorithm. Let B denote the number of iterations.5 tep 1: Computation of placement variables based on the BB. For b = 1 , . . . , B , let U ( b )1 j = F ( b )1 ( y j ) = n (cid:88) i =1 v ( b )1 i I ( y i ≤ y j ) , and U ( b )3 j = 1 − F ( b )3 ( y j ) = n (cid:88) (cid:96) =1 v ( b )3 (cid:96) I ( y (cid:96) > y j ) , where j = 1 , . . . , n , ( v ( b )11 , . . . , v ( b )1 n ) ∼ Dirichlet( n ; 1 , . . . , v ( b )31 , . . . , v ( b )3 n ) ∼ Dirichlet( n ; 1 , . . . , Step 2: Generate a random realisation of the ROC surface.

Based on (1), generate a realisation of ROCS ( b ) ( p , p ), the diﬀerence between the survival function of( U ( b )31 , . . . , U ( b )3 n ) and the distribution function of ( U ( b )11 , . . . , U ( b )1 n ), i.e., ROCS ( b ) ( p , p ) = (cid:80) n j =1 w ( b )3 j I ( U ( b )3 j > p ) − (cid:80) n j =1 w ( b )1 j I ( U ( b )1 j ≤ p ) , if (cid:80) n j =1 w ( b )3 j I ( U ( b )3 j > p ) > (cid:80) n j =1 w ( b )1 j I ( U ( b )1 j ≤ p ) , , otherwise , where p and p span grids over [0 , w ( b )11 , . . . , w ( b )1 n ) ∼ Dirichlet( n ; 1 , . . . , w ( b )31 , . . . , w ( b )3 n ) ∼ Dirichlet( n ; 1 , . . . , (cid:92) ROCS( p , p ) is obtained by averaging over the ensemble ofROC surfaces (cid:110) ROCS (1) ( p , p ) , . . . , ROCS ( B ) ( p , p ) (cid:111) , that is, (cid:92) ROCS( p , p ) = 1 B B (cid:88) b =1 ROCS ( b ) ( p , p ) . Similarly, the posterior mean for the VUS can be computed as (cid:91)

VUS = 1 B B (cid:88) b =1 VUS ( b ) , VUS ( b ) = (cid:90) (cid:90) ROCS ( b ) ( p , p )d p d p . A credible interval for the VUS can be obtained from the percentiles of the ensemble (cid:16)

VUS (1) , . . . ,

VUS ( B ) (cid:17) . SIMULATION STUDY

A simulation study was conducted to evaluate the performance of our approach to conduct inference aboutthe ROC surface and its associated VUS. 6 .1 Simulation scenarios

We considered four scenarios as listed in Table 1. Scenario 1 corresponds to the case where test outcomesfrom the three groups follow normal distributions. In Scenario 2, data from the three groups follow non-normal distributions from the same family, namely the family of gamma distributions. In Scenario 3, testoutcomes arise from diﬀerent distributional families. Lastly, Scenario 4 considers mixtures of distributions fortest outcome data, a setting that is common in practice.

For our BB estimator we only need to specify the number of BB iterates B , which we set equal to 2000. Forthe grid of values for p and p , the probabilities of correct classiﬁcation in group 1 and 3, respectively, weused 50 equidistant points on [0 , F d by its empirical distribution function. For the kernel estimator, the cumulative distributionfunction in each group is estimated as (cid:98) F d ( y ) = 1 n d n d (cid:88) i =1 Φ (cid:18) y − y di h d (cid:19) , d ∈ { , , } , where Φ( · ) stands for the standard normal distribution function. For the bandwidth h d , which controls theamount of smoothing, we considered two options. The ﬁrst option was h d = 0 . { SD( y d ) , IQR( y d ) / . } n − . d , (2)where SD( y d ) and IQR( y d ) are the standard deviation and interquantile range, respectively, of y d = ( y d , . . . , y dn d ).This is the default choice in the R statistical software (R Core Team, 2017) and it is implemented in the function bw.nrd0 . It is well-known (e.g., Wand and Jones, 1994, p. 61) that this ‘rule’, although providing reasonablebandwidths for non-normal data, is ‘optimal’ when the data distribution is normal. For this reason, we havealso considered a bandwidth selected by least squares cross-validation (Wand and Jones, 1994, Chapter 3),which is implemented in R by the function bw.ucv . Estimation of VUS for the empirical and kernel approaches7sed the following closed form expressions (see e.g., Kang and Tian, 2013): (cid:91) VUS e = 1 n n n n (cid:88) i =1 n (cid:88) j =1 n (cid:88) (cid:96) =1 I ( y i < y j < y (cid:96) ) , (cid:91) VUS k = 1 n n n n (cid:88) i =1 n (cid:88) j =1 n (cid:88) (cid:96) =1 Φ (cid:32) y j − y i (cid:112) h + h (cid:33) Φ (cid:32) y (cid:96) − y j (cid:112) h + h (cid:33) , where VUS e and VUS k stand, respectively, for the empirical and kernel VUS.For Scenario 1 (where test outcomes in each group follow a normal distribution), we also included a com-parison to a model involving independent parametric normal distributions, in order to assess the eﬃciency ofour nonparametric estimator in this context. For each of the four scenarios described in Section 4.1, 300 datasets were generated using sample sizes of( n , n , n ) = (50 , , n , n , n ) = (100 , , n , n , n ) = (200 , , n p n p n p (cid:88) u =1 n p (cid:88) u =1 (cid:110) (cid:92) ROCS( p u , p u ) − ROCS( p u , p u ) (cid:111) ≈ (cid:90) (cid:90) (cid:110) (cid:92) ROCS( p , p ) − ROCS( p , p ) (cid:111) , where n p = n p = 50. The estimated VUS and the EMSEs for Scenarios 1–4 are presented in Figures 1to 4. Speciﬁcally, for each scenario and sample size considered, we present a boxplot of the VUS estimates(along with the true value) and EMSEs produced by each method. In addition, the estimated ROC surfaces,along with the true surfaces, are shown in Figures 2–5 of the Supplementary Materials. In Scenario 1, we canappreciate a minor loss in eﬃciency of our BB estimator, which is a small price to pay for the beneﬁt of therobustness that leads to accurate data driven estimates under increasingly complex scenarios (Figures 2 to4). The BB estimator outperformed, in terms of the empirical mean squared error, the empirical estimatorfor most of the scenarios, especially for the sample size ( n , n , n ) = (50 , , .

95 and 1, which shows the validity of our inferences. APPLICATION

The Trail Making Test (TMT) is a neuropsychological test that provides information about visual searchspeed, scanning, speed of processing, as well as, executive functioning. The TMT test is commonly used asa diagnostic test of cognitive impairment associated with dementia. The TMT comprises two parts, bothconsisting of 25 circles distributed over a sheet of paper or on a computer screen. In Part A, the circles arenumbered from 1 to 25 and patients are tasked with connecting them in a sequential order (1–2–3, etc). Inpart B, the patient alternates between numbers and letters (1–A–2–B, etc). The goal is to ﬁnish both partsof the test as quickly as possible, and completion times are used as the primary performance metrics. WhilePart A is primarily used to assess cognitive processing speed, Part B is used to examine executive functioning.We analysed TMT Part A completion times for 245 patients with Parkinson’s disease (Bantis et al., 2017).Based on a battery of tests for characterising cognitive impairment, 170 patients were diagnosed as unimpaired(U), 52 patients were diagnosed as having mild cognitive impairment (MCI), and 23 patients were diagnosed ashaving dementia (D). Parkinson’s disease patients who have dementia were expected to have slower completiontimes than those with MCI, and patients with MCI were expected to have slower completion times than thosewith no cognitive impairment, that is, the anticipated ordering of completion times is U < MCI < D. Figure5 shows histograms and boxplots of the completion times for each group. We can observe a very reasonableseparation between completion times in the three groups, with an almost non-existing overlap between com-pletion times in the unimpaired and dementia group.We applied our BB methodology to the TMT Part A data. We used 5000 iterations and, as in Section 4, p and p lie on a grid of 50 even points on [0 , .

75 (0 . , . .

70 (0 . , . .

67 (0 . , . .

74 (0 . , .

83) for the kernel methodwith bandwidth calculated using equation (2), the kernel method with bandwidth selected by cross validation,and the empirical method, respectively. The empirical VUS is similar to our BB estimate, whereas the kernelVUS, for both bandwidths, is slightly lower. This is in agreement to what has been reported by Bantis et al.(2017). Overall, all estimates are similar and suggest that TMT Part A completion time is an accurate testfor dementia stage in Parkinson’s disease patients. CONCLUDING REMARKS

We have developed a ﬂexible, nonparametric method based on the Bayesian bootstrap and on the notion ofplacement value for conducting inference about the ROC surface and its functionals. In addition to providingpoint and interval estimates in a single integrated framework, our method is computationally easy to implementand very fast. A simulation study illustrated the ability of our approach to produce accurate estimates fora variety of data-generating distributions, and it demonstrated that our estimator is a viable alternative tocurrent nonparametric surface estimators. Furthermore, the validity of our inferences, in terms of frequentistprobability of coverage, was demonstrated. The TMT data analysis revealed high accuracy for distinguishingbetween Parkinson’s disease patients who present normal cognition, mild cognitive impairment and dementia.An interesting avenue for future research is the potential use of the Bayesian bootstrap for learning about theROC surface of tests subject to a limit of detection.

References

Bantis, L. E., Nakas, C. T., Reiser, B., Myall, D., and Dalrymple-Alford, J. C. (2017). Construction of jointconﬁdence regions for the optimal true class fractions of receiver operating characteristic (ROC) surfacesand manifolds.

Statistical Methods in Medical Research , 26(3):1429–1442.10ranscum, A. J., Johnson, W. O., and Baron, A. T. (2013). Robust medical test evaluation using ﬂexibleBayesian semiparametric regression models.

Epidemiology Research International , 2013.Branscum, A. J., Johnson, W. O., Hanson, T. E., and Baron, A. T. (2015). Flexible regression models forROC and risk analysis, with or without a gold standard.

Statistics in Medicine , 34(30):3997–4015.Branscum, A. J., Johnson, W. O., Hanson, T. E., and Gardner, I. A. (2008). Bayesian semiparametric ROCcurve estimation and disease diagnosis.

Statistics in Medicine , 27(13):2474–2496.Carvalho, V. I. d. and Branscum, A. J. (2018). Bayesian nonparametric inference for the three-class Youdenindex and its associated optimal cutoﬀ points.

Statistical Methods in Medical Research , 27(3):689–700.Coolen-Maturi, T., E. F. and Coolen, F. (2014). Three-group ROC analysis: A nonparametric predictiveapproach.

Computational Statistics and Data Analysis , 78:69–81.Efron, B. (1979). Bootstrap methods: Another look at the jackknife.

Annals of Statistics , 7(1):1–26.Erkanli, A., Sung, M., Jane Costello, E., and Angold, A. (2006). Bayesian semi-parametric ROC analysis.

Statistics in Medicine , 25(22):3905–3928.Ferguson, T. S. (1973). A Bayesian analysis of some nonparametric problems.

The Annals of Statistics ,1(2):209–230.Gasparini, M. (1995). Exact multivariate Bayesian bootstrap distributions of moments.

The Annals of Statis-tics , 23(3):762–768.Gu, J., Ghosal, S., and Roy, A. (2008). Bayesian bootstrap estimation of ROC curve.

Statistics in Medicine ,27(26):5407–5420.Hong, C. and Cho, M. (2015). Test statistics for volume under the ROC surface and hypervolume under theROC manifold.

Communications for Statistical Applications and Methods , 22:377–387.Hwang, B. S. and Chen, Z. (2015). An integrated Bayesian nonparametric approach for stochastic and vari-ability orders in ROC curve estimation: an application to endometriosis diagnosis.

Journal of the AmericanStatistical Association , 110(511):923–934.In´acio, V., Turkman, A. A., Nakas, C. T., and Alonzo, T. A. (2011). Nonparametric Bayesian estimation ofthe three-way receiver operating characteristic surface.

Biometrical Journal , 53(6):1011–1024.11n´acio de Carvalho, V., de Carvalho, M., and Branscum, A. J. (2017). Nonparametric Bayesian covariate-adjusted estimation of the Youden index.

Biometrics , 73(4):1279–1288.In´acio de Carvalho, V., Jara, A., Hanson, T. E., and de Carvalho, M. (2013). Bayesian nonparametric ROCregression modeling.

Bayesian Analysis , 8(3):623–646.Kang, L. and Tian, L. (2013). Estimation of the volume under the ROC surface with three ordinal diagnosticcategories.

Computational Statistics & Data Analysis , 62:39–51.Kim, Y., Lee, J., and Kim, J. (2005). Bayesian bootstrap analysis of doubly censored data using Gibbs sampler.

Statistica Sinica , pages 969–980.Li, J. and Zhou, X.-H. (2009). Nonparametric and semiparametric estimation of the three way receiveroperating characteristic surface.

Journal of Statistical Planning and Inference , 139(12):4133–4142.Li, J.L., Z. X. and Fine, J. (2012). A regression approach to ROC surface, with applications to Alzheimer’sdisease.

Biostatistics , 55:1583–1595.Nakas, C. T. (2014). Developments in ROC surface analysis and assessment of diagnostic markers in three-classclassiﬁcation problems.

REVSTAT , 12:43–65.Nakas, C. T. and Yiannoutsos, C. T. (2004). Ordered multiple-class ROC analysis with continuous measure-ments.

Statistics in Medicine , 23:3437–3449.Pepe, M. S. (2003).

The Statistical Evaluation of Medical Tests for Classiﬁcation and Prediction . OxfordUniversity Press, New York.R Core Team (2017).

R: A Language and Environment for Statistical Computing . R Foundation for StatisticalComputing, Vienna, Austria.Rodr´ıguez, A. and Mart´ınez, J. C. (2013). Bayesian semiparametric estimation of covariate-dependent ROCcurves.

Biostatistics , 15(2):353–369.Rubin, D. B. (1981). The Bayesian bootstrap.

The Annals of Statistics , 9(1):130–134.Wan, S. (2012). An empirical likelihood conﬁdence interval for the volume under ROC surface.

Statistics &Probability Letters , 82:1463–1467. 12and, M. P. and Jones, M. C. (1994).

Kernel Smoothing . CRC/Chapman Hall, Boca Raton.Yang, H. and Carlin, D. (2000). ROC surface: a generalization of ROC curve analysis.

Journal of Biophar-maceutical Statistics , 10:183–196.Yu, T. (2012). ROCS: Receiver operating characteristic surface for class-skewed high-throughput data.

PLOSOne , 7(7).Zhang, Y. and Alonzo, T. (2016). Inverse probability weighting estimation of the volume under the ROCsurface in the presence of veriﬁcation bias.

Biometrical Journal , 58:1338–1356.Zhang, Y. and Li, J. (2011). Combining multiple markers for multi-category classiﬁcation: An ROC surfaceapproach.

Australian & New Zealand Journal of Statistics , 53:63–78.Zhao, L., Feng, D., Chen, G., and Taylor, J. M. (2016). A uniﬁed Bayesian semiparametric approach to assessdiscrimination ability in survival analysis.

Biometrics , 72(2):554–562.Zhou, X.-H., McClish, D. K., and Obuchowski, N. A. (2011).

Statistical Methods in Diagnostic Medicine . JohnWiley & Sons, New York, 2nd edition. 13 llllllllllllllllllll llllllllllllllllll llllllllllllllllll llllllllllllllllll lllllllllllllll E M SE Scenario 1, n =n =n =50 llllllllllllll llllllllll llllllllllllllll llllllllllllll llllllllll E M SE Scenario 1, n =n =n =100 lllllllllllllllll lllllllllllll lllllllllllllllllllllll llllllllllllllllllllllllll llllllllll E M SE Scenario 1, n =n =n =200 ll ll l ll V U S Scenario 1, n =n =n =50 l l lll lll lll V U S Scenario 1, n =n =n =100 lllll llllllll lllll llllll V U S Scenario 1, n =n =n =200 Figure 1.

Scenario 1. Boxplots summarising simulation results for the EMSE (top row) and estimates of the VUS (bottom row).The solid red line corresponds to the true VUS. Here Kernel denotes the kernel estimate with bandwidth calculated using equation(2) and Kernel-CV stands for the kernel estimate with the bandwidth selected by least squares cross-validation. llllllllllllllllllll lllllllllllllll lllllllllllll lllllllllllllllllll E M SE Scenario 2, n =n =n =50 llllllllllllllll llllllllllllllllll llllllllllllllllllllll lllllllllllllllll E M SE Scenario 2, n =n =n =100 llllllllllllllllllll lllllllllllllllllllllllllllllll llllllllllllllllllllll lllllllllllllllllllll E M SE Scenario 1, n =n =n =200 ll ll l ll V U S Scenario 2, n =n =n =50 l ll l l V U S Scenario 2, n =n =n =100 lll lll llll lll V U S Scenario 2, n =n =n =200 Figure 2.

Scenario 2. Boxplots summarising simulation results for the EMSE (top row) and estimates of the VUS (bottom row).The solid red line corresponds to the true VUS. Here Kernel denotes the kernel estimate with bandwidth calculated using equation(2) and Kernel-CV stands for the kernel estimate with the bandwidth selected by least squares cross-validation. lllllllll llllllllllllllllllll lllllllllllllll llllllllll E M SE Scenario 3, n =n =n =50 lllllllllllll llllllllllllllllllll lllllllllllllllllll lllllllllll E M SE Scenario 3, n =n =n =100 lllllllllllllllll llllllllllllllll llllllllllllllll llllllllllllllllll E M SE Scenario 3, n =n =n =200 ll lll ll ll V U S Scenario 3, n =n =n =50 lll ll lllll lll V U S Scenario 3, n =n =n =100 lll lll lllll lll V U S Scenario 3, n =n =n =200 Figure 3.

Scenario 3. Boxplots summarising simulation results for the EMSE (top row) and estimates of the VUS (bottom row).The solid red line corresponds to the true VUS. Here Kernel denotes the kernel estimate with bandwidth calculated using equation(2) and Kernel-CV stands for the kernel estimate with the bandwidth selected by least squares cross-validation. llllllllllllll lllllllllllllll llllllllllllllll lllllllllllllll E M SE Scenario 4, n =n =n =50 llllllllllll lllllllllllllll lllllllllllllllllllllll lllllllllll E M SE Scenario 4, n =n =n =100 lllllllllll llllllllllll llllllllll lllllllllll E M SE Scenario 4, n =n =n =200 lllll lll lll lllll V U S Scenario 4, n =n =n =50 llll llll llll llll V U S Scenario 4, n =n =n =100 llll lll lll llll V U S Scenario 4, n =n =n =200 Figure 4.

Scenario 4. Boxplots summarising simulation results for the EMSE (top row) and estimates of the VUS (bottom row).The solid red line corresponds to the true VUS. Here Kernel denotes the kernel estimate with bandwidth calculated using equation(2) and Kernel-CV stands for the kernel estimate with the bandwidth selected by least squares cross-validation. F r equen cy U 0510152025 0 100 200Completion time F r equen cy MCI0510152025 0 100 200Completion time F r equen cy D llll lll l C o m p l e t i on t i m e Figure 5.

Trail Making Test Part A data: histograms and variable-width boxplots of the completion times (in seconds) in eachgroup. p1 p 3 p2 (a) F r equen cy (b) p1 p 3 p2 (c) p1 p 3 p2 (d) p1 p 3 p2 (e) Figure 6.

Trail Making Test Part A data. Top row: BB estimate of the ROC surface (a) and histogram of the 5000 BBsampled VUS (b). Bottom row: estimated ROC surfaces using the kernel approach with bandwidths computed as in (2) (c), kernelapproach with bandwidths selected by cross-validation (d), and the empirical approach (e). Y Y Y , ) N(1 . , ) N(3 , )2 Gamma(2 ,

1) Gamma(3 ,

1) Gamma(5 , t Beta(2 , χ . , ) + 0 . , ) 0 . , ) + 0 . , . ) 0 . , ) + 0 . , )Table 1. Scenarios considered for the simulation study.

Scenario ( n , n , n ) = (50 , ,

50) ( n , n , n ) = (100 , , n , n , n ) = (200 , , .

95 0 .

98 0 .

972 0 .

99 1 0 .

993 0 .

95 0 .

97 0 .

954 0 .

98 0 .

98 0 . VUS 95% coverage probabilities. SUPPLEMENTARY MATERIALS D en s i t y D en s i t y D en s i t y D en s i t y D i s t r i bu t i on f un c t i on D i s t r i bu t i on f un c t i on D i s t r i bu t i on f un c t i on D i s t r i bu t i on f un c t i on p1 p 3 p2 VUS=0.170 p1 p 3 p2 VUS=0.337 p1 p 3 p2 VUS=0.702 p1 p 3 p2 VUS=1

Figure 1.

Diﬀerent degrees of overlap/stochastic ordering between f1/ F1, f2/ F2, and f3/ F3 for hypothetical test outcomesdistributions and corresponding ROC surfaces. Solid lines represent test outcomes in group 1, dashed lines in group 2, and dottedlines in group 3. p1 p 3 p2 True p1 p 3 p2 BB p1 p 3 p2 Kernel p1 p 3 p2 Kernel−CV p1 p 3 p2 Empirical p1 p 3 p2 BB p1 p 3 p2 Kernel p1 p 3 p2 Kernel−CV p1 p 3 p2 Empirical p1 p 3 p2 BB p1 p 3 p2 Kernel p1 p 3 p2 Kernel−CV p1 p 3 p2 Empirical

Figure 2.

Scenario 1. True ROC surface and mean across the 300 estimated ROC surfaces. First row: n = n = n =50. Secondrow: n = n = n =100. Third row: n = n = n =200. Here Kernel denotes the kernel estimate with bandwidth calculatedusing equation (2) of the main manuscript and Kernel-CV stands for the kernel estimate with the bandwidth selected by leastsquares cross-validation. p1 p 3 p2 True p1 p 3 p2 BB p1 p 3 p2 Kernel p1 p 3 p2 Kernel−CV p1 p 3 p2 Empirical p1 p 3 p2 BB p1 p 3 p2 Kernel p1 p 3 p2 Kernel−CV p1 p 3 p2 Empirical p1 p 3 p2 BB p1 p 3 p2 Kernel p1 p 3 p2 Kernel−CV p1 p 3 p2 Empirical

Figure 3.

Scenario 2. True ROC surface and mean across the 300 estimated ROC surfaces. First row: n = n = n =50. Secondrow: n = n = n =100. Third row: n = n = n =200. Here Kernel denotes the kernel estimate with bandwidth calculatedusing equation (2) of the main manuscript and Kernel-CV stands for the kernel estimate with the bandwidth selected by leastsquares cross-validation. p1 p 3 p2 True p1 p 3 p2 BB p1 p 3 p2 Kernel p1 p 3 p2 Kernel−CV p1 p 3 p2 Empirical p1 p 3 p2 BB p1 p 3 p2 Kernel p1 p 3 p2 Kernel−CV p1 p 3 p2 Empirical p1 p 3 p2 BB p1 p 3 p2 Kernel p1 p 3 p2 Kernel−CV p1 p 3 p2 Empirical

Figure 4.

Scenario 3. True ROC surface and mean across the 300 estimated ROC surfaces. First row: n = n = n =50. Secondrow: n = n = n =100. Third row: n = n = n =200. Here Kernel denotes the kernel estimate with bandwidth calculatedusing equation (2) of the main manuscript and Kernel-CV stands for the kernel estimate with the bandwidth selected by leastsquares cross-validation. p1 p 3 p2 True p1 p 3 p2 BB p1 p 3 p2 Kernel p1 p 3 p2 Kernel−CV p1 p 3 p2 Empirical p1 p 3 p2 BB p1 p 3 p2 Kernel p1 p 3 p2 Kernel−CV p1 p 3 p2 Empirical p1 p 3 p2 BB p1 p 3 p2 Kernel p1 p 3 p2 Kernel−CV p1 p 3 p2 Empirical

Figure 5.

Scenario 4. True ROC surface and mean across the 300 estimated ROC surfaces. First row: n = n = n =50. Secondrow: n = n = n =100. Third row: n = n = n =200. Here Kernel denotes the kernel estimate with bandwidth calculatedusing equation (2) of the main manuscript and Kernel-CV stands for the kernel estimate with the bandwidth selected by leastsquares cross-validation. .000.250.500.751.00 0 100 200Completion time D i s t r i bu t i on f un c t i on U 0.000.250.500.751.00 0 100 200Completion time D i s t r i bu t i on f un c t i on MCI0.000.250.500.751.00 0 100 200Completion time D i s t r i bu t i on f un c t i on D 0.000.250.500.751.00 0 100 200Completion time D i s t r i bu t i on f un c t i on Figure 6.

TMT Part A data: BB estimate and 95% pointwise probability band of (a) F (unimpaired group), (b) F (mildcognitive impairment group), and (c) F (dementia group). U=c(34, 58, 18, 29, 30, 37, 41, 36, 15, 36, 40, 36, 32, 26, 28, 25, 40,34, 27, 27, 35,17, 56, 31, 29, 34, 46, 29, 44, 38, 31, 29, 50, 50, 41, 28, 34, 44, 43, 34, 67, 76,33, 28,51, 45, 61, 36, 47, 30, 35, 39, 42, 40, 42, 41,17, 25, 48, 61, 48, 34, 31, 35,48, 30, 33, 34, 34, 58, 28, 28, 24,55, 21, 21, 37, 25, 38, 40, 55, 35, 39, 34,28, 37,37, 46, 37, 51, 37, 30, 46, 37, 24, 38, 23, 52, 40, 34, 29,44, 30, 24, 35, 21, 48, 47,16, 34, 30, 28,35, 36, 34, 27, 31, 37, 26, 50, 44, 42, 32, 42, 48, 43, 49, 23, 49, 16,26, 52, 34, 55, 51, 46, 63, 42, 41, 53,38, 21, 68, 56, 46, 31, 33, 52, 33, 30, 50, 71,29, 48, 63, 39, 31, 32, 32, 43, 26, 35, 40, 39, 31, 31, 30, 24, 47, 30) CI=c(66, 34, 44, 56, 75, 45, 48, 43, 62, 68, 85, 107, 34, 82, 68, 103, 51, 57, 50, 30,38, 59, 31, 68, 65, 62, 51, 74, 46, 70, 40, 54, 51, 56, 40, 72, 123, 62, 64, 76,77, 75, 55, 94, 44, 51, 62, 33, 58, 53, 39, 55)D=c(182, 63, 166, 143, 94 ,155, 78, 91, 239, 261, 101, 129, 73, 214, 82, 72, 107,129, 128, 52, 94, 71, 101)y1=U; y2=MCI; y3=Dp=seq(0.0001,0.9999,len=50) rocsbb=function(y1,y2,y3,p1,p3,B) { np=length(p1); n1=length(y1); n2=length(y2); n3=length(y3)rocbb=array(0,c(np,np,B)); vusb=numeric(B)for(b in 1:B) { aux3=rexp(n3,1); v3=aux3/sum(aux3)aux1=rexp(n1,1); v1=aux1/sum(aux1)u3=numeric(n2); u1=numeric(n2)for(j in 1:n2) { u3[j]=sum(v3*(y3>y2[j]))u1[j]=sum(v1*(y1<=y2[j])) } aux3a=rexp(n2,1); omega3=aux3a/sum(aux3a)aux1a=rexp(n2,1); omega1=aux1a/sum(aux1a)for(i in 1:np) { for(j in 1:np) { rocbb[i,j,b]=sum(omega3*(u3>p3[j]))-sum(omega1*(u1<=p1[i]))if(rocbb[i,j,b]<0) rocbb[i,j,b]=0 }} vusb[b]=sum(rocbb[,,b])/(np*np) } rocbbf=apply(rocbb,1:2,mean); vusbf=mean(vusb); int=quantile(vusb,c(0.025,0.975)) eturn(list(rocbbf,vusbf,int)) } res=rocsbb(y1=y1,y2=y2,y3=y3,p1=p,p3=p,B=5000) persp(p,p,res[[1]],phi=30,theta=60, xlab = " \ n \ n p1", ylab = " \ n \ n p3", zlab = " \ n \ n p2",ticktype="detailed",cex.lab=1.4) vusm=res[[2]]; vusci=res[[3]]vusm; vuscivusm=res[[2]]; vusci=res[[3]]vusm; vusci