[PDF] Speeding up bootstrap computations: a vectorized implementation for statistics based on sample moments

Abstract

In this note we propose a vectorized implementation of the non-parametric bootstrap for statistics based on sample moments. Basically, we adopt the multinomial sampling formulation of the non-parametric bootstrap, and compute bootstrap replications of sample moment statistics by simply weighting the observed data according to multinomial counts, instead of evaluating the statistic on a re-sampled version of the observed data. Using this formulation we can generate a matrix of bootstrap weights and compute the entire vector of bootstrap replications with a few matrix multiplications. Vectorization is particularly important for matrix-oriented programming languages such as R, where matrix/vector calculations tend to be faster than scalar operations implemented in a loop. We illustrate the gain in computational speed achieved by the vectorized implementation in real and simulated data sets, when bootstrapping Pearson's sample correlation coefficient.

Full PDF

aa r X i v : . [ s t a t . C O ] D ec Speeding up bootstrap computations: avectorized implementation for statisticsbased on sample moments

Elias Chaibub Neto e-mail: [email protected], Sage Bionetworks

Abstract:

In this note we propose a vectorized implementation of the non-parametric bootstrap for statistics based on sample moments. Basically, weadopt the multinomial sampling formulation of the non-parametric boot-strap, and compute bootstrap replications of sample moment statistics bysimply weighting the observed data according to multinomial counts, in-stead of evaluating the statistic on a re-sampled version of the observeddata. Using this formulation we can generate a matrix of bootstrap weightsand compute the entire vector of bootstrap replications with a few matrixmultiplications. Vectorization is particularly important for matrix-orientedprogramming languages such as R, where matrix/vector calculations tendto be faster than scalar operations implemented in a loop. We illustratethe gain in computational speed achieved by the vectorized implementa-tion in real and simulated data sets, when bootstrapping Pearson’s samplecorrelation coeﬃcient.

1. Introduction

In this note we propose a vectorized implementation of the non-parametricbootstrap for statistics based on sample moments. Our approach is based onthe multinomial sampling formulation of the non-parametric bootstrap. Thisformulation is described in the next section, but, in essence, follows from thefact that the bootstrap distribution of any sample moment statistic (generatedby sampling N data points with replacement from a population of N values, andevaluating the statistic on the re-sampled version of the observed data) can alsobe generated by weighting the observed data according to multinomial categorycounts sampled from a multinomial distribution deﬁned over N categories (i.e.,data points) and assigning probability 1 /N to each one of them.The practical advantage of this multinomial sampling formulation is that,once we generate a matrix of bootstrap weights, we can compute the entirevector of bootstrap replications using a few matrix multiplication operations.The usual re-sampling formulation, on the other hand, is not amenable to suchvectorization of computations, since for each bootstrap replication one needs togenerate a re-sampled version of the data. Vectorization is particularly impor-tant for matrix-oriented programming languages such as R[6] and Matlab[5],where matrix/vector computations tend to be faster than scalar operations im-plemented in a loop. haibub Neto E./Vectorized bootstrap for statistics based on sample moments This note is organized as follows. Section 2: (i) presents notation and back-ground on the standard data re-sampling approach for the non-parametric boot-strap; (ii) describes the multinomial sampling formulation of the bootstrap; and(iii) explains how to vectorize the bootstrap calculations. Section 3 reports acomparison of computation times (in R) required by the vectorized and stan-dard approaches, when bootstrapping Pearson’s sample correlation coeﬃcientin real and simulated data sets. Finally, Section 4 presents some ﬁnal remarks,and point out that the Bayesian bootstrap computations can also be easilyvectorized.

2. The vectorized non-parametric bootstrap

Let X represent a random variable distributed according to an unknown proba-bility distribution F , and let x = ( x , . . . , x N ) t be an observed random samplefrom F . The goal of the bootstrap is to estimate a parameter of interest, θ ,based on a statistic ˆ θ = s ( x ).Let ˆ F represent the empirical distribution of the observed data, x , assigningprobability 1 /N to each observed value x i , i = 1 , . . . , N . A bootstrap sample, x ∗ = ( x ∗ , . . . , x ∗ N ) t , corresponds to a random sample of size N draw from ˆ F .Operationally, sampling from ˆ F is equivalent to sampling N data points withreplacement from the population of N objects ( x , . . . , x N ). The star notationindicates that x ∗ is not the actual data set x but rather a re-sampled version of x . The sampling distribution of estimator ˆ θ is then estimated from B bootstrapreplications of ˆ θ ∗ = s ( x ∗ ).Now, consider the estimation of the ﬁrst moment of the unknown probabilitydistribution F , θ = E F ( X ) , (1)on the basis of the observed data x . If no further information (other than theobserved sample x ) is available about F , then it follows that the best estimatorof θ is the plug-in estimate (see page 36 of [2]),ˆ θ = E ˆ F ( X ) = N X i =1 x i P ˆ F ( X = x i ) = N X i =1 x i N = s ( x ) , (2)and the bootstrap distribution of ˆ θ is generated from B bootstrap replicationsof ˆ θ ∗ = s ( x ∗ ) = N − P Ni =1 x ∗ i . Algorithm 1 summarizes the approach. Algorithm 1

Non-parametric bootstrap for E ˆ F ( X ) via data re-sampling For b = 1 , . . . , B : • Draw a bootstrap sample x ∗ = ( x ∗ , . . . , x ∗ N ) t from the empirical distribution of theobserved data, that is, sample N data points with replacement from the populationof N objects x = ( x , . . . , x N ) t . • Compute the bootstrap replication ˆ θ ∗ = N − P Ni =1 x ∗ i . haibub Neto E./Vectorized bootstrap for statistics based on sample moments Alternatively, let n ∗ i represent the number of times that data point x i ap-pears in the bootstrap sample x ∗ , and w ∗ i = n ∗ i /N . Then, the category counts, n ∗ = ( n ∗ , . . . , n ∗ N ) t , of the bootstrap sample x ∗ are distributed according to themultinomial distribution, n ∗ = N ˆ w ∗ ∼ Multinomial (cid:0)

N , N − N (cid:1) , (3)where the vector N = (1 , . . . , t has length N .Now, since N X i =1 x ∗ i = N X i =1 n ∗ i x i , (4)it follows that the bootstrap replication of the ﬁrst sample moment of the ob-served data can we re-expressed, in terms of the bootstrap weights w ∗ as,ˆ θ ∗ = 1 N N X i =1 x ∗ i = N X i =1 w ∗ i x i , (5)so that we can generate bootstrap replicates using Algorithm 2. Algorithm 2

Non-parametric bootstrap for E ˆ F ( X ) via multinomial sampling For b = 1 , . . . , B : • Draw a bootstrap count vector n ∗ = ( n ∗ , . . . , n ∗ N ) t ∼ Multinomial (cid:0)

N , N − N (cid:1) . • Compute the bootstrap weights w ∗ = ( n ∗ /N, . . . , n ∗ N /N ) t . • Compute the bootstrap replication ˆ θ ∗ = P Ni =1 w ∗ i x i . The main advantage of this multinomial sampling formulation of the non-parametric bootstrap is that it allows the vectorization of the computation.Explicitly, Algorithm 2 can be vectorized as follows:1. Draw B bootstrap count vectors, n ∗ , from (3), using a single call of amultinomial random vector generator (e.g., rmultinom in R).2. Divide the sampled bootstrap count vectors by N in order to obtain a N × B bootstrap weights matrix, W ∗ .3. Generate the entire vector of bootstrap replications,ˆ θ ∗ = x t W ∗ (6)in a single computation.It is clear from equation (4) that this multinomial sampling formulation isavailable for statistics based on any arbitrary sample moment (that is, statisticsdeﬁned as functions of arbitrary sample moments). For instance, the samplecorrelation between data vectors z = ( z , . . . , z N ) t and y = ( y , . . . , y N ) t , is afunction of the sample moments, haibub Neto E./Vectorized bootstrap for statistics based on sample moments N N X i =1 z i , N N X i =1 y i , N N X i =1 z i , N N X i =1 y i , N N X i =1 z i y i , (7)and the bootstrap replication,ˆ θ ∗ = N P i z ∗ i y ∗ i − ( P i z ∗ i ) ( P i y ∗ i ) q(cid:2) N P i z ∗ i − ( P i z ∗ i ) (cid:3) (cid:2) N P i y ∗ i − ( P i y ∗ i ) (cid:3) , (8)can be re-expressed in terms of bootstrap weights as,ˆ θ ∗ = P i w ∗ i z i y i − ( P i w ∗ i z i )( P i w ∗ i y i ) q(cid:2) P i w ∗ i z i − ( P i w ∗ i z i ) (cid:3) (cid:2) P i w ∗ i y i − ( P i w ∗ i y i ) (cid:3) , (9)and in vectorized form as,ˆ θ ∗ = ( z • y ) t W ∗ − ( z t W ∗ ) • ( y t W ∗ ) q(cid:2) ( z ) t W ∗ − ( zW ∗ ) (cid:3) • (cid:2) ( y ) t W ∗ − ( y t W ∗ ) (cid:3) , (10)where the • operator represents the Hadamard product of two vectors (that is,the element-wise product of the vectors entries), and the square and square rootoperations in the denominator of (10) are also performed entry-wise.

3. Illustrations

In this section, we illustrate the gain in computational eﬃciency achieved by thevectorized multinomial sampling bootstrap, relative to two versions the standarddata re-sampling approach: (i) a strait forward version based on a for loop; and(ii) a more sophisticated version, implemented in the bootstrap

R package[1],where the for loop is replaced by a call to the apply

R function. In the following,we refer to these two versions as “loop” and “apply”, respectively.We bootstrapped Pearson’s sample correlation coeﬃcient using the Ameri-can law school data (page 21 of [2]) provided in the law82 data object of the bootstrap

R package. The data is composed of two measurements (class meanscore on a national law test, LSAT, and class mean undergraduate grade pointaverage, GPA) on the entering class of 1973 for N = 82 American law schools.Figure 1 presents the results.The top left panel of Figure 1 shows the time (in seconds) required to generate B bootstrap replications of ˆ θ ∗ = ˆcor(LSAT ∗ , GPA ∗ ), for B varying from 1,000to 1,000,000. The red, brown, and blue lines show, respectively, the computationtime required by the “apply”, “loop”, and the vectorized multinomial samplingapproaches. The center and bottom left panels show the computation time ratioof the data re-sampling approach versus the vectorized approach as a functionof B . The plots clearly show that the vectorized bootstrap was considerablyfaster than the data re-sampling implementations for all B tested. The right haibub Neto E./Vectorized bootstrap for statistics based on sample moments number of bootstraps t i m e i n s e c ond s vectorized multinomial samplingdata re−sampling (loop)data re−sampling (apply) Vectorized multinomial sampling correlation D en s i t y number of bootstraps t i m e r a t i o ( l oop / v e c t o r i z ed ) Data re−sampling (loop) correlation D en s i t y number of bootstraps t i m e r a t i o ( app l y / v e c t o r i z ed ) Data re−sampling (apply) correlation D en s i t y Fig 1 . Comparison between the data re-sampling approaches and the vectorized multino-mial sampling bootstrap, in the American law school data. The top left panel compares theapproach’s time expenditures. The center and bottom left panels the computation time ra-tio of the data re-sampling versus the vectorized approaches. The right panels present the ˆ θ ∗ = ˆ cor ( LSAT ∗ , GPA ∗ ) distributions generated with B = 1 , , . panels show the ˆ θ ∗ distributions for the three bootstrap approaches based on B = 1 , , N = 15 samples from the American law school data (page 19 of [2]), providedin the law data object of the bootstrap R package. This time, the vectorizedimplementation was remarkably faster than the data re-sampling versions. Thecenter and bottom left panels of Figure 2 show that the vectorized implementa- haibub Neto E./Vectorized bootstrap for statistics based on sample moments number of bootstraps t i m e i n s e c ond s vectorized multinomial samplingdata re−sampling (loop)data re−sampling (apply) Vectorized multinomial sampling correlation D en s i t y . . . . number of bootstraps t i m e r a t i o ( l oop / v e c t o r i z ed ) Data re−sampling (loop) correlation D en s i t y . . . . number of bootstraps t i m e r a t i o ( app l y / v e c t o r i z ed ) Data re−sampling (apply) correlation D en s i t y . . . . Fig 2 . Comparison between the data re-sampling approaches and the vectorized multinomialsampling bootstrap, in a subset of the American law school data. The top left panel comparesthe approach’s time expenditures. The center and bottom left panels the computation timeratio of the data re-sampling versus the vectorized approaches. The right panels present the ˆ θ ∗ = ˆ cor ( LSAT ∗ , GPA ∗ ) distributions generated with B = 1 , , . tion was roughly 50 times faster than the re-sampling versions, whereas in theprevious example it was about 8 times faster (center and bottom left panels ofFigure 1).The performance diﬀerence observed in these two examples suggests that thegain in speed achieved by the vectorized implementation decreases as a functionof the sample size. In order to conﬁrm this observation, we used simulated datato compare the bootstrap implementations across a grid of 10 distinct sample haibub Neto E./Vectorized bootstrap for statistics based on sample moments sizes (varying from 15 to 915) for B equal to 10,000, 100,000, and 1,000,000.Figure 3 reports the results. B = 10,000 sample size t i m e i n s e c ond s vectorized mult. samplingdata re−sampling (loop)data re−sampling (apply) B = 10,000 sample size l og t i m e r a t i o ( l oop / v e c . ) B = 10,000 sample size l og t i m e r a t i o ( app l y / v e c . ) B = 100,000 sample size t i m e i n s e c ond s vectorized mult. samplingdata re−sampling (loop)data re−sampling (apply) B = 100,000 sample size l og t i m e r a t i o ( l oop / v e c . ) B = 100,000 sample size l og t i m e r a t i o ( app l y / v e c t. ) B = 1,000,000 sample size t i m e i n s e c ond s vectorized mult. samplingdata re−sampling (loop)data re−sampling (apply) B = 1,000,000 sample size l og t i m e r a t i o ( l oop / v e c . ) B = 1,000,000 sample size l og t i m e r a t i o ( app l y / v e c . ) Fig 3 . Comparison of the bootstrap implementations, when bootstrapping Pearson’s samplecorrelation, ˆ θ ∗ = ˆ cor ( x ∗ , x ∗ ) , for sample sizes varying from 15 to 915. The data was simulatedaccording to x i = ǫ i , x i = x i + ǫ i , ǫ ji ∼ N (0 , , j = 1 , . Results based on 10,000 (toppanels), 100,000 (center panels), and 1,000,000 (bottom panels) bootstrap replications. Thecenter and right panels show the computation time ratios (in log scale) comparing the “loop”vs vectorized and the “apply” vs vectorized implementations, respectively. The left panels of Figure 3 show computation times as a function of increas-ing sample sizes. In all cases tested the vectorized implementation (blue line)outperformed the “apply” version (red line), whereas the “loop” version (brownline) outperformed the vectorized implementation for large sample sizes (but haibub Neto E./Vectorized bootstrap for statistics based on sample moments was considerably slower for small and moderate sample sizes). The central andright panels show the computation time ratios (in log scale) comparing, respec-tively, the “loop” vs vectorized and the “apply” vs vectorized implementations.The horizontal line is set at zero and represents the threshold below which thedata re-sampling approach outperforms the vectorized implementation. Notethat log time ratios equal to 4, 3, 2, 1, and 0.5, correspond to speed gains of54.60, 20.09, 7.39, 2.72, and 1.65, respectively.All the timings in this section were measured on an Intel Core i7-3610QM(2.3 GHz), 24 Gb RAM, Windows 7 Enterprize (64-bit) platform.

4. Discussion

In this note we showed how the multinomial sampling formulation of the non-parametric bootstrap can be easily implemented in vectorized form. We illus-trate the gain in computational speed (in the R programming language) usingreal and simulated data sets.Our examples provide several interesting insights. First, the re-sampling im-plementation based on the for loop was generally faster than the implementa-tion provided by the bootstrap

R package, which employs the apply in placeof the for loop (compare the red and brown curves on the top left panel ofFigures 1 and 2, and on the left panels of Figure 3). This result illustratesthe fact that apply is not always faster than a for loop (see [4] for furtherdiscussion and examples). Second, the vectorized implementation outperformedthe data re-sampling implementation provided in the bootstrap

R package inall cases tested (left panels in Figures 1 and 2, and right panels in Figure 3).Third, the gain in speed achieved by the vectorized implementation decreasesas a function of the sample size (Figure 3). This decrease is likely due to theincrease in memory requirements for generating and performing operations inthe larger bootstrap weight matrices associated with the larger sample sizes.We point out, however, that even though optimized BLAS (Basic Linear Al-gebra Subprograms) libraries could potentially increase the execution speed ofour vectorized operations in large matrices/vectors [4], our examples still showremarkable/considerable gains for small/moderate sample sizes even withoutusing any optimized BLAS library (as illustrated by Figures 2 and 1).For the sake of clarity, the exposition in Section 2 focused on statistics basedon sample moments of the observed data, x . We point out, however, that themultinomial sampling formulation of the bootstrap is available for any statisticsatisfying the more general relation, N X i =1 f ( x ∗ i ) = N X i =1 n ∗ i f ( x i ) , (11)where the k th sample moment represents the particular case, f ( x i ) = x ki /N .Clearly, the left hand side of equation (11) still represents a bootstrap replicationof the ﬁrst sample moment of the transformed variable u ∗ i = N f ( x ∗ i ). haibub Neto E./Vectorized bootstrap for statistics based on sample moments The multinomial sampling formulation of the non-parametric bootstrap is notnew. It is actually a key piece in the demonstration of the connection between thenon-parametric bootstrap and Bayesian inference, described in [7] and in Section8.4 of [3], where the non-parametric bootstrap is shown to closely approximatethe posterior distribution of the quantity of interest generated by the Bayesianbootstrap [7]. This close connection, also implies that the Bayesian bootstrapcan be easily implemented in vectorized form. As a matter of fact, instead ofgenerating the bootstrap weights from bootstrap count vectors sampled from amultinomial distribution, the Bayesian bootstrap samples the weights directlyfrom a Dirichlet( N ) distribution. We point out, however, that, to the bestof our knowlege, the multinomial sampling formulation has not been exploredbefore for vectorizing bootstrap computations. References [1]

S original, from StatLib and by Rob Tibshirani. R portby Friedrich Leisch (2014) bootstrap: Functions for the Book“An Introduction to the Bootstrap”. R package version 2014.4.http://CRAN.R-project.org/package=bootstrap[2]

Efron, B. and Tibshirani, R. (1993) An introduction to the bootstrap.Chapman & Hall, Boca Raton.[3]

Hastie, T., Tibshirani, R., Friedman, J. (2008) The elements of sta-tistical learning: data mining, inference, and prediction. 2nd ed, SpringerVerlag, New York.[4]

Ligges, U. and Fox, J. (2008) R help desk: how can I avoid this loop ormake it faster?

R News , , 46-50.[5] MATLAB and Statistics Toolbox Release (2012) The MathWorks,Inc., Natick, Massachusetts, United States.[6]

R Core Team

Rubin, D. B. (1981) The Bayesian bootstrap.

Annals of Statistics ,9