[PDF] Reliability Analysis of Artificial Intelligence Systems Using Recurrent Events Data from Autonomous Vehicles

Abstract

Artificial intelligence (AI) systems have become increasingly common and the trend will continue. Examples of AI systems include autonomous vehicles (AV), computer vision, natural language processing, and AI medical experts. To allow for safe and effective deployment of AI systems, the reliability of such systems needs to be assessed. Traditionally, reliability assessment is based on reliability test data and the subsequent statistical modeling and analysis. The availability of reliability data for AI systems, however, is limited because such data are typically sensitive and proprietary. The California Department of Motor Vehicles (DMV) oversees and regulates an AV testing program, in which many AV manufacturers are conducting AV road tests. Manufacturers participating in the program are required to report recurrent disengagement events to California DMV. This information is being made available to the public. In this paper, we use recurrent disengagement events as a representation of the reliability of the AI system in AV, and propose a statistical framework for modeling and analyzing the recurrent events data from AV driving tests. We use traditional parametric models in software reliability and propose a new nonparametric model based on monotonic splines to describe the event process. We develop inference procedures for selecting the best models, quantifying uncertainty, and testing heterogeneity in the event process. We then analyze the recurrent events data from four AV manufacturers, and make inferences on the reliability of the AI systems in AV. We also describe how the proposed analysis can be applied to assess the reliability of other AI systems.

Full PDF

RReliability Analysis of Artiﬁcial Intelligence SystemsUsing Recurrent Events Data from Autonomous Vehicles

Yili Hong , Jie Min , Caleb B. King , and William Q. Meeker Department of Statistics, Virginia Tech, Blacksburg, VA 24061 JMP Division, SAS, Cary, NC 27513 Department of Statistics, Iowa State University, Ames, IA 50011

Abstract

Artiﬁcial intelligence (AI) systems have become increasingly common and the trendwill continue. Examples of AI systems include autonomous vehicles (AV), computer vi-sion, natural language processing, and AI medical experts. To allow for safe and eﬀectivedeployment of AI systems, the reliability of such systems needs to be assessed. Tradition-ally, reliability assessment is based on reliability test data and the subsequent statisticalmodeling and analysis. The availability of reliability data for AI systems, however, islimited because such data are typically sensitive and proprietary. The California Depart-ment of Motor Vehicles (DMV) oversees and regulates an AV testing program, in whichmany AV manufacturers are conducting AV road tests. Manufacturers participating inthe program are required to report recurrent disengagement events to California DMV.This information is being made available to the public. In this paper, we use recurrentdisengagement events as a representation of the reliability of the AI system in AV, andpropose a statistical framework for modeling and analyzing the recurrent events datafrom AV driving tests. We use traditional parametric models in software reliability andpropose a new nonparametric model based on monotonic splines to describe the eventprocess. We develop inference procedures for selecting the best models, quantifying un-certainty, and testing heterogeneity in the event process. We then analyze the recurrentevents data from four AV manufacturers, and make inferences on the reliability of theAI systems in AV. We also describe how the proposed analysis can be applied to assessthe reliability of other AI systems.

Key Words:

Disengagement Events; Fractional Random Weight Bootstrap; Gom-pertz Model; Monotonic Splines; Software Reliability; Self-driving Cars. a r X i v : . [ c s . A I] F e b Introduction

With the rapid development of new technology, artiﬁcial intelligence (AI) systems are emerg-ing in many areas. Typical applications of AI systems include autonomous vehicles (AV),computer vision, speech recognition, and AI medical experts. The reliability and safety ofAI systems need to be assessed before massive deployment in the ﬁeld. For example, thereliability of AV needs to be demonstrated so that people can use them with conﬁdence.Traditionally, reliability assessment of products and systems is based on reliability test datacollected from the laboratory and the ﬁeld. Reliability information is then extracted fromstatistical modeling and analysis of the data.Commonly used data types for reliability analysis are time-to-event data, degradation data,and recurrent events data. Reliability data collected by manufacturers are highly sensitiveand are usually not publicly available. In the area of AV, however, the California Departmentof Motor Vehicles (DMV) has launched an AV driving program. More details of the studyare given in Section 2. Many AV manufacturers are participating in the program and areconducting AV road test in California. Manufacturers are required to report disengagementevents and mileage information to the California DMV. The reported events are availablefor public access. Because of the availability of these recurrent events data, we focus on thereliability analysis of the AI systems in AV units in this paper.A disengagement event happens when the AI system and/or the backup driver determinesthat the driver needs to take over the driving. The recurrence rates of disengagement eventscan be used as a proxy for the reliability of the AI system in AV. A lower occurrence rate(event rate) of disengagement events would indicate a more reliable AI system in the AV. Inthe reliability literature, parametric forms have been used to describe the event rate througha nonhomogeneous Poisson process (NHPP) model. In practice, some speciﬁc questions arisein the analysis of the recurrent disengagement events data, • How to model the event process, and what kind of parametric forms should be used? • Does the parametric form provide an adequate ﬁt to the data, and are there any otherﬂexible forms for modeling? • Is there any population heterogeneity in the event processes from multiple test units?To answer those practical questions, we develop a statistical framework for modeling andanalyzing the recurrent events data from AV driving tests.Speciﬁcally, we use the NHPP model to describe the disengagement event process withadjustment for the time-varying mileage data using traditional parametric models in softwarereliability for the intensity functions. We also propose a new nonparametric model based on2onotonic splines to describe the event process. The spline model is ﬂexible enough to ﬁtvarious patterns in the event process and can also be used to assess if the fully-parametricmodel provides an adequate ﬁt. We develop inference procedures for selecting the best modelsand quantifying uncertainty using the fractional-random-weight bootstrap. The parametricmodels and spline models can be used as complementary tools. In addition, we use the gammafrailty model to quantify and assess heterogeneity in the event process.From the California driving study, we use data from four manufacturers that have beenconducting extensive AV driving tests in California. We apply the developed methods toanalyze the recurrent events data from the four AV manufacturers. Based on the modeling,we make inferences on the reliability of the AI systems in AV, and summarize interestingﬁndings on the reliability of the AI systems in AV. Although our analysis focuses on the AIsystems in AV, the statistical analysis can also be applied to assess the reliability of other AIsystems.

There is only a small amount of literature on reliability analysis of AI systems. Amodeiet al. (2016) provided a general discussion on AI safety and outlined ﬁve concrete areas forAI safety research. Bosio et al. (2019) conducted a reliability analysis of deep convolutionalneural network (CNN) developed for automotive applications using fault injections. Goldsteinet al. (2020) investigated the impact of transient faults on the reliability of compresseddeep CNN. Zhao et al. (2020) proposed a safety framework based on Bayesian inference forcritical systems using deep learning models. Alshemali and Kalita (2020) provided a reviewon improving the reliability of natural language processing. Due to the limited availability oftest data, statistical reliability analysis of AI reliability/safety is in an emerging stage.As an example of the modeling of the reliability and safety of AV, Kalra and Paddock(2016) used a statistical hypothesis testing approach to determine the needed miles of drivingto demonstrate AV safety. ˚Asljung, Nilsson, and Fredriksson (2017) used extreme value theoryto model the safety of AV. Michelmore et al. (2019) designed a statistical framework toevaluate the safety of deep neural network controllers and assessed the safety of AV. Burtonet al. (2020) provided a multi-disciplinary perspective on the safety of AV from engineering,ethics, and law aspects. Most existing modeling framework, however, does not involve largescale ﬁeld-testing data, which is essential for reliability assessment.The California driving test data provide unique opportunities for data analysis. Regardingthe analysis of the California driving data, Dixit, Chand, and Nair (2016) and Favar`o, Eurich,and Nader (2018) analyzed the causes of disengagement events using the California drivingtest data up to 2017 and showed the relationship between disengagement events per miles3nd cumulative miles. Lv et al. (2018) performed a descriptive analysis of the causes ofdisengagement events using the California driving test data from 2014 to 2015, and concludedthat software issues and limitations were the most common reasons for disengagement events.Banerjee et al. (2018) used linear regressions to model the relationship between disengagementper mile and cumulative miles using the data from 2016 to 2017. Merkel (2018) conducted dataanalysis on the California driving data from 2015 to 2017 with aggregated counts data andleast-squares ﬁt. Zhao et al. (2019) proposed a general conservative Bayesian inference methodto estimate the rate of events (crashes and fatality) and illustrated it with the Californiadriving test data. Boggs, Wali, and Khattak (2020) did an exploratory analysis for AV crashesfrom the California driving study. So far, there is no comprehensive statistical treatment forthe analysis of AI reliability and especially for AV reliability. Starting in 2018, the exactdisengagement event time can be extracted from the California DMV report, which makes itpossible to apply recurrent events modeling techniques.In the reliability literature, NHPP models are widely used to analyze recurrent events. Zuo,Meeker, and Wu (2008), and Hong, Li, and Osborn (2015) analyzed recurrent events data withwindow-observations, which has similar data types with the disengagement events data fromthe California driving study. Parametric models, such as the Musa-Okumoto model (e.g.,Musa and Okumoto 1984) and the Gompertz model (e.g., Huang, Lyu, and Kuo 2003), arecommonly used in software reliability applications (e.g., Wood 1996). Ehrlich et al. (1998)used accelerated testing methods to study software reliability. Burke, Jones, and Noufaily(2020) proposed ﬂexible parametric models for time-to-event data analysis. Useful referencebooks for reliability data analysis for researchers in the AI reliability area include Meeker andEscobar (1998), and Lawless (2003). Overall, parametric models are common in analyzingrecurrent events data in the context of reliability study.We propose to use monotonic splines (Ramsay 1988, and Meyer 2008) as a nonparametricmethod to model the event process. Although monotonic splines are used in some degradationsettings (Xie et al. 2018), the application of monotonic splines to recurrent events in reliabilityis new. To model population heterogeneity, the gamma frailty model is used. An early use ofthe gamma frailty model in reliability is found in Lawless (1995). More recently, Shan, Hong,and Meeker (2020) used the gamma frailty model to describe seasonal warrant return data.Duchateau and Janssen (2008) provide a comprehensive review of frailty models.Due to the complicated structure of the window-observed recurrent events data in ourstudy, we use fractional-random-weight bootstrap as a convenient way to generate bootstrapsamples for statistical inference. The idea of fractional-random-weight bootstrap is introducedin Rubin (1981), and some theoretical properties are shown in Jin, Ying, and Wei (2001). Xuet al. (2020) demonstrated the use of fractional-random-weight bootstrap in many complexapplications in reliability, survival analysis, and logistic regression. A simultaneous conﬁdence4and (SCB) can be used to assess if a parametric model is adequate for ﬁtting the data. Hong,Escobar, and Meeker (2010) showed that an SCB could be obtained from a simultaneousconﬁdence region (SCR) for parameters. However, in the context of bootstrap, it is notstraightforward to construct SCR for parameter estimators with multiple dimensions. Hence,we use the idea of the equal-precision band in Nair (1984) and use bootstrap samples tocalibrate pointwise conﬁdence intervals to provide SCB to quantify statistical uncertainty.In summary, we provide a general analytic framework by integrating existing methods andproposing new methods for reliability analysis of data from an AI study. The parametric andnonparametric models, and the statistical interval and testing procedures will be useful toolsfor practitioners working in the area of AI reliability.

The rest of the paper is organized as follows. Section 2 describes the California autonomousvehicle driving study and introduces the data. Section 3 describes the parametric model, thespline model, and the gamma frailty model that are used to describe the recurrent disengage-ment events data. Section 4 describes the parameter estimation procedures and the inferenceprocedures for various models. Section 5 conducts a simulation study to show the statisticalperformance of the estimation procedures. Section 6 conducts the data analysis and summa-rizes interesting ﬁndings. Section 7 contains some concluding remarks and areas for futureresearch.

This paper presents reliability modeling and analysis of autonomous vehicles (AV) using datafrom the California Department of Motor Vehicles (DMV) autonomous vehicle tester program,which has been in operation since 2014. The tester program allows manufacturers to testAV on California public roads with a human in the driver seat who can take control of thevehicle if necessary. Up to July 1, 2020, 62 manufacturers are permitted to perform AV drivetesting. Manufacturers are required to report disengagement events annually, and collision ofAV within 10 days of the accident. Before December 1, 2017, only the aggregated number ofdisengagement events per month was reported, while after that date, the exact date of theevent is now reported. Thus, we focus on the analysis of the data after December 1, 2017.Since it can be diﬃcult to determine responsibility in collisions, a disengagement eventis typically used as an alternative to determining unsafe auto-driving in the literature (e.g.,5erkel 2018). During the period from December 1, 2017, to November 30, 2019, 34 manufac-turers reported disengagement events. We use the data from disengagement events reported byWaymo, Cruise, Pony AI, and Zoox as these four manufacturers performed extensive on-roadtesting during this time period.Here we provide a brief introduction to those four manufacturers. Waymo began as theGoogle Self-Driving Car Project in 2009, testing autonomous vehicles on public roads acrosssix states in the United States. Waymo joined the California tester program in 2015. Cruiseis an autonomous vehicle company founded in 2013. It joined the California tester program in2016 and tested AV in the urban environment of San Francisco. Pony AI was founded in 2016,developing autonomous driving technology globally. Pony AI started AV testing on Californiapublic roads in June 2017. Zoox is an autonomous vehicle company founded in 2013. Theyjoined the California tester program in 2017 and tested AV in downtown San Francisco.

The California DMV requires manufacturers to report when their vehicles disengaged fromautonomous mode during tests. A disengagement event occurs when there is an autonomoustechnology failure, or when situations requiring the test driver to take manual control ofthe vehicle to operate safely. Disengagement events can be initialized by a warning from theautonomous vehicle system, or by test drivers as the driver thinks it is not safe to continue autodriving. Disengagement reports are provided annually, and include the ID number of vehiclesin testing, location and date of disengagement events, description of causing of disengagement,monthly autonomous mileage of each testing vehicle, and annual total autonomous miles ofeach testing vehicle.The starting date for the data used for this paper is December 1, 2017. The study periodis from December 1, 2017, to November 30, 2019, which is a 24 month or 2 year study period.The data for the period from December 1, 2018, to November 30, 2019, are available in csvformat from the California DMV website, while the data for the period from December 1,2017, to November 30, 2018, are available in pdf format that need to be manually convertedto csv format.After data cleaning, the event time is computed as the number of days since the startingdate. The monthly mileage is divided by the number of days in that month to obtain the dailymileage, under the approximation that the daily mileage is constant over each month. Theunit for the mileage is thousands of miles (k-miles). Figure 1 shows a subset of the recurrentevents data and mileage data from manufacturer Waymo. Figure 1(a) shows the recurrentevents data with the crosses representing the event times and the blue segments showing theactive months (i.e., events can only be recorded within the observation windows). Figure 1(b)6

200 400 600

Days after 2017−12−01 U n i t N o . Obs. Window Event 0 200 400 600 . . . . . . . . Days after 2017−12−01 M il e s D r i v en ( k − m i e s ) No. 1No. 2No. 7No. 12No. 15 (a) Recurrent Events (b) Thousands of MilesFigure 1: Visualization of a subset of the recurrent events data and mileage data from manu-facturer Waymo. (a) The recurrent events data with the crosses showing the event times andthe blue segments showing the active months. (b) A plot of the mileage as a function of timefor ﬁve representative units. Note that the miles driven are in the units of thousands of miles(k-miles).plots the mileage as a function of time for ﬁve representative units. Table 1 shows a summaryof the recurrent events data and mileage data from the four manufacturers. We can see thatWaymo and Cruise have driven more than 1 million miles and the disengagement event rateis around 0.1 events per k-miles during the 24 month testing period. Pony AI and Zoox havesmaller driving miles, and the event rates are around 0.2 events per k-miles and 0.6 eventsper k-miles, respectively.

The number of AV testing units (vehicles) in the ﬂeet is denoted by n . The total follow-uptime is τ = 730 days (i.e., two years). Let t ij , i = 1 , . . . , n, j = 1 , . . . , n i be event time j forunit i . Here t ij records the number of days since December 1, 2017, and n i is the number ofevents for unit i . Note that it is possible that n i = 0, indicating that there were no eventsobserved for unit i .Let x i ( t ) , < t ≤ τ , be the mileage driven for unit i at time (day) t . The unit of x i ( t )is k-miles. Because only monthly mileage was reported, the daily average was used for x i ( t ).7able 1: Summary of the recurrent events data and mileage data from the four manufacturersover the 24 month study period.Manufacturer No. of Active Active Months No. of Total in No. of EventsVehicles Months per Vehicle Events k-miles per k-milesWaymo 123 1550 12.602 224 2710.136 0.083Cruise 304 2079 6.839 154 1278.661 0.120Pony AI 23 179 7.783 43 190.871 0.225Zoox 32 280 8.750 58 97.780 0.593Thus, x i ( t ) is a step function, which can be represented as, x i ( t ) = n τ (cid:88) l =1 x il ( τ l − < t ≤ τ l ) . (1)Here, n τ = 24 is the number of months in the follow-up period, x il is the daily mileage forunit i during month l , τ l is the ending day since the starting of the study for month l , and ( · ) is an indicator function. Let x i ( t ) = { x i ( s ) : 0 < s ≤ t } be the history for the mileagedriven for unit i . Recurrent events processes are commonly modeled by a nonhomogeneous Poisson process(NHPP). The event intensity function for unit i is modeled as, λ i [ t ; x i ( t ) , θ ] = λ ( t ; θ ) x i ( t ) . (2)Here, λ ( t ; θ ) = λ ( t ) is the baseline intensity function (BIF) with parameter vector θ . Be-cause x i ( t ) is the mileage driven, λ i [ t ; x i ( t ) , θ ] is the mileage-adjusted event intensity. TheBIF can be interpreted as the event rate per k-miles at time t when x i ( t ) = 1. The baselinecumulative intensity function (BCIF) is,Λ ( t ; θ ) = Λ ( t ) = (cid:90) t λ ( s ; θ ) ds. (3)Note that Λ (0; θ ) = 0 and Λ ( t ; θ ) is a non-decreasing function of t . The BCIF Λ ( t ) can beinterpreted as the expected number of events from time 0 to t when x ( t ) = 1 for all t . Thecumulative intensity function (CIF) for unit i is,Λ i [ t ; x i ( t ) , θ ] = (cid:90) t λ ( s ; θ ) x i ( s ) ds. (4)8able 2: List of commonly used parametric models and their BIFs, BCIFs, and parameters.Model BCIF Λ ( t ; θ ) BIF λ ( t ; θ ) ParametersMusa-Okumoto θ − log(1 + θ θ t ) θ (1 + θ θ t ) − θ = ( θ , θ ) (cid:48) θ > , θ > θ θ θ t − θ θ θ θ t θ θ t log( θ ) log( θ ) θ = ( θ , θ , θ ) (cid:48) θ > , < θ , θ < θ [1 − exp( − θ t θ )] θ θ θ t ( θ − exp( − θ t θ ) θ = ( θ , θ , θ ) (cid:48) θ > , θ > , θ > ( t ; θ ).Note that the BIF can be obtained by taking the derivative of BCIF with respect to t .That is, λ ( t ; θ ) = d Λ ( t ; θ ) /dt. The commonly used models for Λ ( t ; θ ) are Musa-Okumoto,Gompertz, and the Weibull models (e.g., Merkel 2018). Table 2 lists their BIFs, BCIFs, andparameters. Note that the Weibull BCIF is similar to the Weibull cumulative distributionfunction (cdf) but with an asymptote θ as t goes to ∞ . Although parametric models can ﬁt certain trends in the event process, they may not be ﬂex-ible enough to describe the event process for AV testing, as the evolution of the AI technologyin an AV system can be complicated, which motivates us to propose the spline model fordescribing the BCIF. In the spline model, the BCIF is represented as a linear combination ofspline bases. That is, Λ ( t ; θ ) = n s (cid:88) l =1 β l γ l ( t ) , β l ≥ , l = 1 , . . . , n s , (5)which provides a nonparametric method to describe the BCIF. Here θ = ( β , . . . , β n s ) (cid:48) is thevector for the spline coeﬃcients, γ l ( t )’s are the spline bases, and n s is the number of splinebases. The BIF can be obtained by taking a derivative with respect to t . That is, λ ( t ; θ ) = d Λ ( t ; θ ) dt = n s (cid:88) l =1 β l dγ l ( t ) dt . (6)9

200 400 600 . . . . . . Time in Days S p li ne B a s i s Figure 2: Examples of I-spline basis functions.Because of the constraints that Λ (0; θ ) = 0 and that Λ ( t ; θ ) is a non-decreasing functionof t , some special considerations are needed in the spline model. We use the I-splines inRamsay (1988). Figure 2 shows examples of I-spline basis functions (i.e., γ l ( t )). We can seethat each spline basis takes value zero at t = 0 and is monotonically increasing. By takingnon-negative coeﬃcients (i.e., β l ≥ ( t ; θ ) is obtained.A brief introduction on the construction of I-spline basis is as follows. The boundary knotsare 0 and τ . The b interior knots are denoted by t h +1 , . . . , t h + b for splines of order h . Thecomplete sequence for the knots are denoted by 0 = t = · · · = t h < t h +1 < · · · < t h + b

This section presents parameter estimation procedures for the parametric and spline models,and the corresponding statistical inference procedures. This section also contains parameterestimation and hypothesis testing for the gamma frailty model.

We use the maximum likelihood (ML) methods for parameter estimation. The likelihoodfunction is, L ( θ ) = n (cid:89) i =1 (cid:40) n i (cid:89) j =1 λ i [ t ij ; x i ( t ij ) , θ ] (cid:41) × exp {− Λ i [ τ ; x i ( τ ) , θ ] } , (8)with the convention that (cid:81) j =1 ( · ) = 1. Here, the intensity function and the CIF are deﬁnedin (2) and (4), respectively. The log-likelihood function is obtained by taking the logarithmof the L ( θ ) in (8). That is, l ( θ ) = n (cid:88) i =1 (cid:32) n i (cid:88) j =1 { log[ x i ( t ij )] + log[ λ ( t ij ; θ )] } (cid:33) − Λ i [ τ ; x i ( τ ) , θ ] (9)= n (cid:88) i =1 n i (cid:88) j =1 { log[ x i ( t ij )] + log[ λ ( t ij ; θ )] } − n (cid:88) i =1 n τ (cid:88) l =1 { x il · [Λ ( τ l ; θ ) − Λ ( τ l − ; θ )] } , (cid:80) j =1 ( · ) = 0. The last step in (9) is obtained by substituting theexpression for x i ( t ) in (1).For parametric models, the functional forms of Λ ( t ) and λ ( t ) in Table 2 can be substitutedinto (9) to evaluate the log-likelihood function. The ML estimate of θ , denoted by (cid:98) θ , can beobtained by ﬁnding the value of θ that maximizes l ( θ ). For parametric models, the dimensionof θ is typically 2 or 3. We used the R optim() function with the “Nelder-Mead” option todo the optimization.For the spline model, the functional forms of Λ ( t ) and λ ( t ) in (5) and (6), respectively,can be substituted into (9) to evaluate the log-likelihood function. To estimate the parameter θ for the spline model, one needs to specify the locations of the knots and the number ofknots, and address the non-negativity constraints on θ as indicated in (5).We use I-splines of degree 3, which is generally smooth enough to ﬁt the BCIF. Theboundary knots are set to be 0 on the left side and τ = 730 on the right side. The interiorknots are set to be equally spaced sample quantiles of the observed event times. For example,if three interior knots are needed, the interior knots are set to the 0.25, 0.5 and 0.75 samplequantiles of the observed event times. After setting the knot locations, the spline bases canbe computed.To select the best number of knots, we use the Akaike information criterion (AIC), whichis computed as AIC = − l ( θ ) + 2 df. Here, df is the number of degrees of freedom for the model. For the spline model, df is thenumber of non-zero coeﬃcients. To address the non-negativity constraints on θ , we use the“L-BFGS-B” option in the R optim() function. The “L-BFGS-B” algorithm allows users tospecify an interval for the variable to be optimized. We set the interval to be [0 , ∞ ) for thespline coeﬃcients. Using the “L-BFGS-B” algorithm, some of the elements of the estimates (cid:98) θ could be set to zero. For inference based on parametric models, the normal approximation based on large sampletheory can be used. Because the inference for parametric models is relatively straightforward,the focus of this section on the inference for the spline model. For the spline model, the normalapproximation is inappropriate because the parameter estimates can occur at the boundaryof the parameter space (i.e., (cid:98) β l can be zero). Thus, we use the fractional-random-weightbootstrap (e.g., Xu et al. 2020). Diﬀerent from other bootstrap methods, the fractional-random-weight bootstrap provides a convenient way to quantify uncertainty for data with12omplex structure. For our case, we need to handle recurrent events with window observationsand time-varying mileage information, which complicate the structure.The implementation of the fractional-random-weight bootstrap is straightforward. Thelog-likelihood function in (9) is re-weighted as l ∗ ( θ ) = n (cid:88) i =1 n i (cid:88) j =1 w i { log[ x i ( t ij )] + log[ λ ( t ij ; θ )] } (10) − n (cid:88) i =1 n τ (cid:88) l =1 w i { x il · [Λ ( τ l ; θ ) − Λ ( τ l − ; θ )] } , where the random weights are independently generated from an exponential distribution withmean one. The bootstrap algorithm for generating bootstrap versions of the estimate of BCIF (cid:98) Λ ∗ ( t ) is summarized as follows. Algorithm 1: Bootstrap Algorithm for Generating (cid:98) Λ ∗ ( t ) .

1. Generate random weights w i , i = 1 , . . . , n , independently from an exponential distribu-tion with mean one.2. Construct the re-weighted log-likelihood function as in (10).3. Use AIC to select the best number of knots based on the log-likelihood function in (10).4. Based on the best number of knots chosen in step 3, the corresponding spline bases, andestimated coeﬃcients, one can compute the estimates for BCIF as (cid:98) Λ ∗ ( t ) .

5. Repeat steps 1 to 4 to obtain B copies of (cid:98) Λ ∗ ( t ), denoted by (cid:98) Λ ∗ b ( t ) , b = 1 , . . . , B .Based on the bootstrap estimates (cid:98) Λ ∗ b ( t ) , b = 1 , . . . , B , one can construct approximate100(1 − α p )% pointwise conﬁdence interval (PCI) for Λ ( t ) for a given t , (cid:104)(cid:98) Λ ∗ ([ Bα p / ( t ) , (cid:98) Λ ∗ ([ B (1 − α p / ( t ) (cid:105) . Here, (cid:98) Λ ∗ ( b )0 ( t ) is the ordered version of (cid:98) Λ ∗ b ( t ), and [ · ] is the rounding function.Based on the spline model, one can construct a simultaneous conﬁdence band (SCB) forthe BCIF, which can be used to assess the adequacy of the parametric models in ﬁtting thedata. An approximate 100(1 − α )% SCB for Λ ( t ) , t L ≤ t ≤ t U , is (cid:104) Λ (cid:101) ( t ) , (cid:101) Λ ( t ) (cid:105) , t L ≤ t ≤ t U , (11)where t L and t U are boundaries to be speciﬁed. If an estimated parametric model is containedin the SCB in (11) the parametric model is statistically consistent with the data (i.e., thereis no statistical evidence to reject the parametric model). Thus, the SCB constructed by thespline model provides a tool to check if a parametric model is adequate.13he 100(1 − α p )% PCIs provide a structure for computing SCBs for Λ ( t ) for all in the t ∈ [ t L , t U ] period. We use bootstrap samples to calibrate the PCIs so that it can approximatelyprovide the nominal 100(1 − α )% CP, similar to the idea of the equal-precision SCB for a cdfin Nair (1984). The CP can be estimated asCP( α p ) = 1 B B (cid:88) b =1 (cid:16)(cid:98) Λ ∗ ([ Bα p / ( t ) ≤ (cid:98) Λ ∗ b ( t ) ≤ (cid:98) Λ ∗ ([ B (1 − α p / ( t ) , for all t ∈ [ t L , t U ] (cid:17) . By setting CP( α p ) = 1 − α , one can ﬁnd the solution to be α c . Thus, the SCB in (11) can becomputed as (cid:104)(cid:98) Λ ∗ ([ Bα c / ( t ) , (cid:98) Λ ∗ ([ B (1 − α c / ( t ) (cid:105) , t L ≤ t ≤ t U , which is time-eﬃcient because the bootstrap samples have already been obtained. To estimate the gamma frailty in (7), one needs to calculate the marginal likelihood function.The marginal likelihood function is, L ( θ , φ ) = n (cid:89) i =1 (cid:90) ∞ (cid:40) n i (cid:89) j =1 u i λ i [ t ij ; x i ( t ij ) , θ ] (cid:41) × exp {− u i Λ i [ τ ; x i ( τ ) , θ ] } f ( u i ) du i (12)= n (cid:89) i =1 (cid:40) n i (cid:89) j =1 λ i [ t ij ; x i ( t ij ) , θ ] (cid:41) × (cid:90) ∞ u n i i exp( − u i c i ) f ( u i ) du i = n (cid:89) i =1 (cid:40) n i (cid:89) j =1 λ i [ t ij ; x i ( t ij ) , θ ] (cid:41) × φ − /φ Γ( n i + 1 /φ )Γ(1 /φ )( c i + 1 /φ ) ( n i +1 /φ ) , where c i = (cid:80) n τ l =1 x il · [Λ ( τ l ; θ ) − Λ ( τ l − ; θ )]. The ML estimates of θ and φ are obtained byﬁnding the values that maximize the log-likelihood function, log[ L ( θ , φ )]. We again use theR optim() function with the “Nelder-Mead” option to do the optimization.To check if there is population heterogeneity in the event process, one can use the likelihoodratio test. The test statistic is constructed as, − { l ( (cid:98) θ ) − log[ L ( (cid:98) θ , (cid:98) φ )] } , (13)which has a χ distribution under the null hypothesis that φ = 0. Here, l ( (cid:98) θ ) is obtained byevaluating (9) at (cid:98) θ . The null hypothesis is rejected if the test statistic is larger than χ , (1 − α ) ,indicating that there is population heterogeneity.14 Simulation Study

The purpose of the simulation study is to show the properties of the ML estimator for thespline model, to check the CP of the SCB procedure based on the spline model, and see if theSCB can correctly accept or reject a particular parametric model.

In the simulation study, the spline model is considered as the true underlying model. Thespline bases are shown in Figure 2. We consider three scenarios. Figure 3 shows the trueBCIFs used in the three simulation scenarios. To simplify the setting, we only consider oneparametric model, which is the Gompertz model. • Scenario 1: we choose θ = (6 , , , , (cid:48) as the coeﬃcients. Using the spline basesin Figure 2, the true BCIF is shown as the black/solid line in Figure 3. The Gompertzmodel ﬁts this BCIF perfectly. • Scenario 2: we choose θ = (8 , , , , (cid:48) as the coeﬃcients. The corresponding trueBCIF is shown as the red/dash line in Figure 3. The Gompertz model ﬁts this BCIFwell except for the late stage. • Scenario 3: we choose θ = (5 , , , , (cid:48) as the coeﬃcients. The corresponding trueBCIF is shown as the green/dot-dash line in Figure 3. The Gompertz model does notﬁt this BCIF well due to various changes in slope over time.The number of bootstrap samples is B = 5000 and the number of simulated datasets(i.e., repeats) is N = 1000. The mileage driven history is sampled with replacement fromthe historical data from Waymo. The average number of events per unit is around 1.8. Thesample sizes (i.e., the number of AV units) considered in the simulation are 50, 100, 200, 500,and 1000.We evaluate the relative root mean squared errors (RMSE) for the BCIF estimator, theCP for the SCB, and the acceptance probability for the parametric model. In particular, therelative RMSE (RelRMSE) is computed asRelRMSE = (cid:26)(cid:80) Nl =1 (cid:104)(cid:98) Λ l ( t ) − Λ ( t ) (cid:105) /N (cid:27) / Λ ( t ) , where (cid:98) Λ l ( t ) is the estimated BCIF using the spline model based on the l th simulated dataset,and Λ ( t ) is the true BCIF. We use the relative RMSE to remove the scale eﬀect of Λ ( t ).15

200 400 600

Time in Days C u m u l a t i v e I n t en s i t y F un c t i on Scenario 1Scenario 2Scenario 3

Figure 3: Plot of the true BCIFs used in the three simulation scenarios.That is, RMSE tends to be large if Λ ( t ) is large at a particular t . The CP is estimated bythe proportion that the SCB constructed by using the spline model captures the true BCIF.The acceptance probability is estimated by the proportion of times that the spline-based SCBcaptures the estimated BCIF from the Gompertz model. Figure 4 shows the plots of the RelRMSE as a function of time using the spline model andthe Gompertz model to ﬁt the data under the three scenarios. As we can see from the ﬁgure,the RelRMSE is generally decreasing as the sample size increases for the spline model acrossall the three scenarios. When the sample size is large, the RelRMSE for the spline modelestimator is within a small range.The behavior of the RelRMSE for the Gompertz model depends on the scenario. ForScenario 1, in which the Gompertz model ﬁts well to the true model, the spline model tendsto have a higher RelRMSE because there are more parameters in the spline model (and thusmore variability in the estimates) than in the parametric model. For Scenario 2, in whichthere is some departure in the late stage, the Gompertz model has smaller RelRMSE thanthe spline model but the advantage diminishes when the sample size increases because biasbegins to dominate variance. For Scenario 3, in which there is a large diﬀerence from the truemodel, the Gompertz model tends to have larger RelRMSE than the spline model and theRelRMSE does not decrease much when the sample size increases due to the eﬀect of large16ias.In summary, the ML estimator for the spline model works as expected. When a parametricBCIF is adequate, it tends to have higher statistical eﬃciency. When the parametric model isnot adequate, there could be large RelRMSE due to bias in the estimation. The results showthat it is important to assess the adequacy of parametric models. The spline model, however,is ﬂexible at the price of losing some statistical eﬃciency.Figure 5 shows the plot of CP and acceptance probability as a function of sample sizeunder the three scenarios. From Figure 5(a), the CP for the SCB procedure based on thespline model and bootstrap is quite similar for the three scenarios. The CP improves whenthe sample size increases. The CP is less than nominal when the sample size is small and isgetting closer to the nominal CP when the sample is larger than 200. From the plot of theacceptance probability in Figure 5(b), the parametric model is generally accepted when thesample size is small and there is little or no departure from the true model. When there is alarge departure from the true model, as in Scenario 3, the SCB does not capture the estimatedGompertz model with high probability when the sample size is large.

In this section, we present the data analysis for the recurrent disengagement events data.

We start by ﬁtting the parametric models in Table 2 and the spline model to the disengagement-events data from the four manufacturers: Waymo, Cruise, Pony AI, and Zoox. The modelﬁtting uses the ML estimation procedures described in Section 4. Table 3 shows AIC valuesfor the selected models ﬁtted to the data from the four manufacturers. The numbers withbold font indicate the lowest AIC values among the parametric models. The spline model, ingeneral, results in much lower AIC values except for Zoox data. This shows that the splinemodel is ﬂexible in ﬁtting the recurrent event data. For Zoox, the AIC value for the Musa-Okumoto model is small because the model has two parameters. Based on the AIC values,the best parametric model for Waymo and Cruise is the Gompertz model. The best paramet-ric model for Pony AI is the Weibull model, and the best parametric model for Zoox is theMusa-Okumoto model.To visualize the model estimation results, Figure 6 shows the plots of the estimated BCIFsfor the four manufacturers based on the spline model and parametric models, together withthe 95% SCB based on the spline model. For Waymo and Cruise, all the parametric modelsare within the SCB and agree quite well with the estimated BCIF from the spline model. The17

100 200 300 400 500 600 700 . . . . . . Time in Days R e l a t i v e R M SE n=50n=100n=200n=500n=1000 . . . . . . Time in Days R e l a t i v e R M SE n=50n=100n=200n=500n=1000 (a) Scenario 1, Spline Model (b) Scenario 1, Gompertz . . . . . . Time in Days R e l a t i v e R M SE n=50n=100n=200n=500n=1000 . . . . . . Time in Days R e l a t i v e R M SE n=50n=100n=200n=500n=1000 (c) Scenario 2, Spline Model (d) Scenario 2, Gompertz . . . . . . Time in Days R e l a t i v e R M SE n=50n=100n=200n=500n=1000 . . . . . . Time in Days R e l a t i v e R M SE n=50n=100n=200n=500n=1000 (e) Scenario 3, Spline Model (f) Scenario 3, GompertzFigure 4: Plots of relative RMSE as a function of time using the spline model and Gompertzmodel to ﬁt the data under the three scenarios.18 . . . . . . Number of Units C o v e r age P r obab ili t y Scenario 1Scenario 2Scenario 3

50 100 200 500 1000 . . . . . . Number of Units A cc ep t an c e P r obab ili t y Scenario 1Scenario 2Scenario 3 (a) Coverage Probability (b) Acceptance ProbabilityFigure 5: Plots of coverage probability and acceptance probability as a function of samplesize under the three scenarios.Table 3: The values of AIC for ﬁtting various models to the data from the four manufacturers.The numbers with bold font indicate the lowest AIC values among the parametric models.Manufacturer Parametric Models Spline ModelMusa-Okumoto Gompertz WeibullWaymo 2769.78 x i ( t ) = 0 .

01. A high number of events with a low mileage drivenleads to a high event rate. The SCB is asymmetric because it was built using fractional-random-bootstrap and the distribution of (cid:98) Λ ( t ) is heavily skewed. The fact that all eventscome from two units leads to a large amount of variability in the SCB as the change of weightsof the two units has a large inﬂuence on the re-weighted log-likelihood. We also note thatthe best parametric model (the Weibull) does not agree well with the spline model, althoughit tracks the trend. For Zoox, the SCB is also relatively wide due to the small number oftest units. All parametric models are within the SCB, indicating all parametric models arestatistically acceptable.To further check how well the models ﬁt the data, Figure 7 shows the plots of the expectedversus the observed number of events for the four manufacturers based on the spline modeland parametric models, together with the 95% PCIs based on the spline model. The expectednumber of events is computed based on the speciﬁc model with the adjustment for the mileagehistory from all units. The PCIs for the expected number of events are based on bootstrapsamples. The shape of the function of the cumulative number of events diﬀers from the shapeof the BCIF because the function of the cumulative number of events is adjusted by the rate x i ( t ) which is time-varying and depends on the driving pattern, while BCIF is the case when x i ( t ) = 1. In all cases, the spline model tracks the cumulative number of observed eventswell. For Waymo, Cruise, and Zoox, the Gompertz and Weibull models also track the countswell, but visually we can see some departures for the Musa-Okumoto model. For Pony AI,all three parametric models show signiﬁcant departures in the plot, indicating they are notﬂexible enough to describe that event process. Table 4 shows the parameter estimates, standard errors, and approximate 95% conﬁdenceintervals (CIs) for the best parametric models for the four manufacturers. The CIs for pa-rameters are based on a normal (Wald) approximation, and some transformations (e.g., alogarithm transformation for positive parameters) are used to improve the performance. Notethat the estimate for θ for the Zoox/Musa-Okumoto is set to zero because the model isdegenerate at θ = 0, indicating a constant rate situation.We also tested population heterogeneity using the procedure in Section 4.3. Table 5 showsthe summary of the gamma frailty models for the four manufacturers. All of the p -values are20

200 400 600

Time in Days C u m u l a t i v e I n t en s i t y F un c t i on Spline ModelGompertzMusa−OkumotoWeibull95% SCB

Time in Days C u m u l a t i v e I n t en s i t y F un c t i on Spline ModelGompertzMusa−OkumotoWeibull95% SCB (a) Waymo (b) Cruise

Time in Days C u m u l a t i v e I n t en s i t y F un c t i on Spline ModelGompertzMusa−OkumotoWeibull95% SCB

Time in Days C u m u l a t i v e I n t en s i t y F un c t i on Spline ModelGompertzMusa−OkumotoWeibull95% SCB (c) Pony AI (d) ZooxFigure 6: Plots of the estimated BCIFs for the four manufacturers based on the spline modeland parametric models, together with the 95% SCB based on the spline model.21

200 400 600

Time in Days C u m u l a t i v e N u m be r o f E v en t s Spline ModelGompertzMusa−OkumotoWeibull95% PointwiseData

Time in Days C u m u l a t i v e N u m be r o f E v en t s Spline ModelGompertzMusa−OkumotoWeibull95% PointwiseData (a) Waymo (b) Cruise

Time in Days C u m u l a t i v e N u m be r o f E v en t s Spline ModelGompertzMusa−OkumotoWeibull95% PointwiseData

Time in Days C u m u l a t i v e N u m be r o f E v en t s Spline ModelGompertzMusa−OkumotoWeibull95% PointwiseData (c) Pony AI (d) ZooxFigure 7: Plots of the expected versus the observed number of events for the four manufacturersbased on the spline model and parametric models, together with the 95% PCIs based on thespline model. 22lose to 1, indicating little population heterogeneity among the event processes for the fourmanufacturers. Hence, it is reasonable to use a model for which all units have the same eventprocess with the same BCIF. Table 5 also lists the names for the best parametric models foreach manufacturer.To further visualize the results, Figure 8 plots the BIFs from the best parametric models foreach of the four manufacturers. From the plot, we see the trends for diﬀerent manufacturers.A decreasing trend means an improvement in AI technology and an increase in AV reliability.Waymo, Cruise, and Pony AI display a decreasing trend, while Zoox displays a constant rateof about 0.6 events per k-miles. Waymo starts at an event rate near 0.1 events per k-milesand decreases over the two-year period. Cruise starts at 0.2 events per k-miles and shows adecreasing trend. Pony AI starts at a high rate of 10 events per k-miles and shows a rapidlyimproving rate. By the end of the study period (i.e., November 30, 2019), the event ratefor Waymo and Cruise is around 0.05 events per k-miles. The event rate for Pony AI isaround 0.1 events per k-miles. This pattern indicates that there is a lot of improvement forthe reliability of the Pony AI driving system. From the analysis conducted in this sectionfor diﬀerent manufacturers, we can see that the overall AV reliability is improving over thetwo-year period.As a comparison, Figure 9 shows the estimated BIFs based on the spline model and theparametric models, and the 95% PCIs based on the spline model. We can see that the splinemodel shows more variation but the general trends are the same as the parametric models.The estimates for Zoox have a large amount of variability due to the limited sample size, asdiscussed previously.

This paper focuses on the reliability analysis of AV technology using the recurrent disen-gagement events from the California driving study. We propose a statistical framework formodeling and analyzing the recurrent events data from AV driving tests. We use both para-metric models and a nonparametric model to describe the event processes. Based on the splinemodel, we can select the best model, quantify uncertainty, and test heterogeneity in the eventprocess. We want to point out that the parametric models and spline models are proposed ascomplementary tools for modeling and inference.Through the simulation and data analysis, we show that the proposed spline model isﬂexible for describing the recurrent events data from four AV manufacturers, and parametricmodels are adequate for data from most manufacturers. It is worth noting that the bestparametric models can be diﬀerent for diﬀerent manufacturers. The population heterogeneity23able 4: Parameter estimates, standard errors, and approximate 95% CIs for the best para-metric models for the four manufacturers.Manufacturer/ Parameter Estimate Std. Err. 95% CIModel Lower UpperWaymo θ θ θ θ θ θ θ θ θ θ θ (cid:98) φ ) p -valueWaymo Gompertz 0.0001 0.9721Cruise Gompertz 0.0000 0.9972Pony AI Weibull 0.0000 0.9916Zoox Musa-Okumoto 0.0197 0.840024

200 400 6000.020.050.100.200.501.002.005.0010.00 Time in Days I n t en s i t y F un c t i on Waymo − GompertzCruise − GompertzPony AI − WeibullZoox − Musa−Okumoto

Figure 8: Plot of BIFs from the best parametric models for the four manufacturers.in the event process is also low. From the data analysis, we found that the overall AV reliabilityis improving over the two-year study period.The currently available data do not include covariates, such as the driving speed when theevent occurred, the test environment (e.g., busy street versus freeway), and vehicle models.In the future, it would be interesting and useful to collect more covariate information. Ourproposed modeling framework can be extended to analyze recurrent disengagement eventsdata with covariates. The CA DMV recently started a driverless program where cars on theroad do not have to have a driver. It would be interesting to analyze the driverless study datain the future when enough event data are available.Because reliability is a property that evolves over time, all kinds of AI systems need to betested over time to quantify their reliability. Although our analysis focuses on AV reliability,our proposed data analytic framework can also be applied to assess the reliability of otherAI systems where departures from desired behavior provide recurrent events data. With anappropriate deﬁnition of time scale and events, the parametric and spline models discussed inthis paper can be extended to analyze data from the reliability testing of other AI systems.Computing hardware reliability is also an important aspect of AI reliability. For example,GPU’s are widely used in AI model computing. Ostrouchov et al. (2020) considered thelifetimes of GPUs used supercomputers. It would be interesting to study GPU reliability, ormore broadly computing hardware reliability, and its relationship to AI reliability.25

200 400 600 . . . . . Time in Days I n t en s i t y F un c t i on Spline ModelGompertzMusa−OkumotoWeibull95% Pointwise . . . . . . . . Time in Days I n t en s i t y F un c t i on Spline ModelGompertzMusa−OkumotoWeibull95% Pointwise (a) Waymo (b) Cruise

Time in Days I n t en s i t y F un c t i on Spline ModelGompertzMusa−OkumotoWeibull95% Pointwise . . . . . . . Time in Days I n t en s i t y F un c t i on Spline ModelGompertzMusa−OkumotoWeibull95% Pointwise (c) Pony AI (d) ZooxFigure 9: Plots of the estimated BIFs for the four manufacturers based on the spline modeland parametric models, together with the 95% PCIs based on the spline model.26 cknowledgments

The authors acknowledge the Advanced Research Computing program at Virginia Tech forproviding computational resources. The work by Hong was partially supported by NationalScience Foundation Grant CMMI-1904165 to Virginia Tech.

References

Alshemali, B. and J. Kalita (2020). Improving the reliability of deep neural networks inNLP: A review.

Knowledge-Based Systems 191 , 105210.Amodei, D., C. Olah, J. Steinhardt, P. Christiano, J. Schulman, and D. Mane (2016).Concrete problems in AI safety. arXiv: 1606.06565 .˚Asljung, D., J. Nilsson, and J. Fredriksson (2017). Using extreme value theory for vehiclelevel safety validation and implications for autonomous vehicles.

IEEE Transactions onIntelligent Vehicles 2 , 288–297.Banerjee, S. S., S. Jha, J. Cyriac, Z. T. Kalbarczyk, and R. K. Iyer (2018). Hands oﬀ thewheel in autonomous vehicles?: A systems perspective on over a million miles of ﬁelddata. In , pp. 586–597. IEEE.Boggs, A. M., B. Wali, and A. J. Khattak (2020). Exploratory analysis of automated vehi-cle crashes in California: A text analytics & hierarchical Bayesian heterogeneity-basedapproach.

Accident Analysis and Prevention 135 , 105354.Bosio, A., P. Bernardi, A. Ruospo, and E. Sanchez (2019). A reliability analysis of a deepneural network. In , pp. 1–6.Burke, K., M. C. Jones, and A. Noufaily (2020). A ﬂexible parametric modelling frameworkfor survival analysis.

Journal of the Royal Statistical Society: Series C 69 , 429–457.Burton, S., I. Habli, T. Lawton, J. McDermid, P. Morgan, and Z. Porter (2020). Mind thegaps: Assuring the safety of autonomous systems from an engineering, ethical, and legalperspective.

Artiﬁcial Intelligence 279 , 103201.California Department of Motor Vehicles. Autonomous vehicle tester program. [On-line]. Available: , accessed: September 01, 2020.Cook, R. J. and J. F. Lawless (2007).

The Statistical Analysis of Recurrent Events . NewYork: Springer-Verlag. 27ruise. [Online]. Available: , accessed: September 01, 2020.Dixit, V. V., S. Chand, and D. J. Nair (2016). Autonomous vehicles: disengagements,accidents and reaction times.

PLoS One 11 , e0168054.Duchateau, L. and P. Janssen (2008).

The Frailty Model . New York: Springer-Verlag.Ehrlich, W. K., V. N. Nair, M. S. Alam, W. H. Chen, and M. Engel (1998). Softwarereliability assessment using accelerated testing methods.

Journal of the Royal StatisticalSociety: Series C 47 , 15–30.Favar`o, F., S. Eurich, and N. Nader (2018). Autonomous vehicles disengagements: Trends,triggers, and regulatory limitations.

Accident Analysis & Prevention 110 , 136–148.Goldstein, B. F., S. Srinivasan, D. Das, K. Banerjee, L. Santiago, V. C. Ferreira, A. S.Nery, S. Kundu, and F. M. G. Frana (2020). Reliability evaluation of compressed deeplearning models. In , pp. 1–5.Hong, Y., L. A. Escobar, and W. Q. Meeker (2010). Coverage probabilities of simultaneousconﬁdence bands and regions for log-location-scale distributions.

Statistic & ProbabilityLetters 80 , 733–738.Hong, Y., M. Li, and B. Osborn (2015). System unavailability analysis based on window-observed recurrent event data.

Applied Stochastic Models in Business and Industry 31 ,122–136.Huang, C.-Y., M. R. Lyu, and S.-Y. Kuo (2003). A uniﬁed scheme of some nonhomoge-nous Poisson process models for software reliability estimation.

IEEE Transactions onSoftware Engineering 29 , 261–269.Jin, Z., Z. Ying, and L. J. Wei (2001). A simple resampling method by perturbing theminimand.

Biometrika 88 , 381–390.Kalra, N. and S. M. Paddock (2016). Driving to safety: How many miles of driving wouldit take to demonstrate autonomous vehicle reliability?

Transportation Research Part A:Policy and Practice 94 , 182–193.Lawless, J. F. (1995). The analysis of recurrent events for multiple subjects.

Journal of theRoyal Statistical Society. Series C 44 , 487–498.Lawless, J. F. (2003).

Statistical Models and Methods for Lifetime Data (2nd ed.). NewJersey, Hoboken: John Wiley & Sons, Inc.Lv, C., D. Cao, Y. Zhao, D. J. Auger, M. Sullman, H. Wang, L. M. Dutka, L. Skrypchuk,and A. Mouzakitis (2018). Analysis of autopilot disengagements occurring during au-tonomous vehicle testing.

IEEE/CAA Journal of Automatica Sinica 5 , 58–68.28eeker, W. Q. and L. A. Escobar (1998).

Statistical Methods for Reliability Data . NewYork: John Wiley & Sons, Inc.Merkel, R. (2018). Software reliability growth models predict autonomous vehicle disen-gagement events. arXiv: 1812.08901 .Meyer, M. C. (2008). Inference using shape-restricted regression splines.

The Annals ofApplied Statistics 2 , 1013–1033.Michelmore, R., M. Wicker, L. Laurenti, L. Cardelli, Y. Gal, and M. Kwiatkowska (2019).Uncertainty quantiﬁcation with statistical guarantees in end-to-end autonomous drivingcontrol. arXiv: 1909.09884 .Musa, J. D. and K. Okumoto (1984). A logarithmic Poisson execution time model forsoftware reliability measurement. In

Proceedings of the 7th International Conference onSoftware Engineering , pp. 230–238. IEEE Press.Nair, V. N. (1984). Conﬁdence bands for survival functions with censored data: a compar-ative study.

Technometrics 26 , 265–275.Ostrouchov, G., D. Maxwell, R. A. Ashraf, C. Engelmann, M. Shankar, and J. H. Rogers(2020). GPU lifetimes on Titan supercomputer: Survival analysis and reliability. In

Pro-ceedings of the International Conference for High Performance Computing, Networking,Storage and Analysis (SC ’20) , New York, NY. Association for Computing Machinery.Pony AI. [Online]. Available: , accessed: September01, 2020.Ramsay, J. O. (1988). Monotone regression splines in action.

Statistical Science 3 , 425–441.Rubin, D. B. (1981). The Bayesian bootstrap.

Annals of Statistics 9 , 130–134.Shan, Q., Y. Hong, and W. Q. Meeker (2020). Seasonal warranty prediction based onrecurrent event data.

Annals of Applied Statistics 14 , 929–955.Waymo. [Online]. Available: https://waymo.com/ , accessed: September 01, 2020.Wood, A. (1996). Software reliability growth models. Tandem Technical Report 96.1, Tan-dem Computers, Cupertino, CA.Xie, Y., C. B. King, Y. Hong, and Q. Yang (2018). Semi-parametric models for accelerateddestructive degradation test data analysis.

Technometrics 60 , 222–234.Xu, L., C. Gotwalt, Y. Hong, C. B. King, and W. Q. Meeker (2020). Applications of thefractional-random-weight bootstrap.

The American Statistician 74 , 345–358.Zhao, X., A. Banks, J. Sharp, V. Robu, D. Flynn, M. Fisher, and X. Huang (2020). A safetyframework for critical systems utilising deep neural networks. arXiv: 2003.05311 .29hao, X., V. Robu, D. Flynn, K. Salako, and L. Strigini (2019). Assessing the safety andreliability of autonomous vehicles from road testing. In , pp. 13–23.Zoox. [Online]. Available: https://zoox.com/ , accessed: September 01, 2020.Zuo, J., W. Q. Meeker, and H. Wu (2008). Analysis of window-observation recurrence data.