Randomized Controlled Trials with Minimal Data Retention
RRandomized Controlled Trials with Minimal Data Retention ∗ Winston Chou † February 8, 2021
Abstract
Amidst rising appreciation for privacy and data usage rights, researchers have increasinglyrecognized the principle of data minimization , which holds that the accessibility, collection, andretention of subjects’ data should be kept to the minimum necessary to answer focused researchquestions. Applying this principle to randomized controlled trials (RCTs), this paper presents algo-rithms for drawing precise inferences from RCTs under stringent data retention and anonymizationpolicies. In particular, we show how to use recursive algorithms to construct running estimates oftreatment effects in RCTs, thereby allowing individualized records to be deleted or anonymizedshortly after collection. Devoting special attention to the case of non-i.i.d. data, we further demon-strate how to draw robust inferences from RCTs by combining recursive algorithms with bootstrapand federated strategies.
Randomized controlled trials (RCTs) represent the gold standard for causal evidence in medicine,technology, and numerous other fields. In a typical RCT, researchers collect data on n units, whichare randomly allocated into treatment and control groups. At the conclusion of the experiment, whena pre-specified time or sample size threshold is met, statistics are computed over these groups andcompared for evidence of a causal effect.Conventional analytic procedures for RCTs require researchers to store all n individual recordsuntil the end of the trial. However, such retention is not strictly necessary and carries privacy risksfor experiment participants. To reduce such risks, this paper presents methods for computing precisestatistics from RCT data that do not require individual records to be stored for the duration ofthe trial. In particular, our methods are designed to operate under strict data retention policies,which guarantee that subjects’ individual-level data will be deleted or anonymized shortly after beingcollected. Such policies, which have become ubiquitous and enshrined in data use regulation like theGDPR, embody the principle of data minimization , which holds that the accessibility, collection, and ∗ For helpful comments, thanks are due to Katherine McCabe, Jasmine Nettiksimmons, Jim Sorensen, and Russ Webb. † Senior Data Scientist, Apple. a r X i v : . [ c s . CR ] F e b etention of subjects’ data should be kept to the bare amount needed to address targeted researchquestions (Pfitzmann and Hansen, 2010).Our algorithms are constructed on several families of statistical methods. First, we use recursiveestimation to maintain running tallies of parameter estimates, which are updated with each newobservation. This means that raw, individualized records can be discarded or anonymized immediatelyafter being incorporated into the collective estimate. Second, to improve statistical precision usingpre-experiment data, we extend these methods to linear regression models of RCT data (Deng et al.,2013). As such methods allow researchers to detect causal effects more quickly and with less data,they constitute yet another application of the data minimization principle. Lastly, to address thecommonplace challenge of non-i.i.d. records – which create special problems for data deletion andanonymization practices – we show how recursive estimation can be combined with bootstrap andfederated methods for robust inference in RCTs.The paper is structured as follows. In Section 2, we state our motivating problem, namely esti-mation of causal effects in RCTs with minimal data retention. In Section 3, we present our proposedmethodology: first providing a gentle introduction to recursive estimation, then discussing its appli-cation to RCTs and linear regression. We turn to cluster-robust inference in Section 4, outlining abootstrap-based method for dependent data in Section 4.1 and a federated implementation of cluster-robust standard errors in Section 4.2. The motivating problem for this paper is to estimate the average effect of a treatment d on an outcome y in an RCT while minimizing the time that individual-level data is stored for analysis. Raw observationsconsist of individual records ( y i , x i ), i ∈ , . . . , n , where y i is unit i ’s measurement on y and x i is acolumn vector of features. x i includes a binary treatment indicator d i , which equals 1 if record i israndomly assigned to the treatment group and 0 otherwise. We assume that records arrive continuouslythroughout the RCT, which is potentially longer in duration than mandated data retention periods (as,e.g., in long-term holdouts). We further assume that the non-treatment features in x i are unaffectedby the treatment (for example, they may be measured prior to the RCT). We also impose standard regularity conditions on x , e.g., the features should not be linear combinations of eachother. τ = E [ y i − y i ] , (1)where y i is the i -th unit’s potential outcome under treatment, y i is the same unit’s potential outcomeunder control, and the expectation is taken over the population represented by the RCT sample (Imbensand Rubin, 2015). Absent data retention limits, estimating the PATE is trivial: simply subtracting the mean of y in thecontrol group from the mean of y in the treatment group will yield an unbiased estimate. Furthermore,we can leverage the additional features in x to achieve more precise estimates of the PATE using linearregression (Deng et al., 2013), among other methods. Yet, conventional implementations of theseestimators assume that each individual record can be kept until all records have been observed; ouraim is to show how these estimators can be implemented without this assumption.This paper contributes to the literature on digital experimentation by showing how recursive esti-mation can be used to overcome common challenges in RCTs – mainly protection of research subjects’privacy through swift data deletion, but also variance reduction, heteroscedasticity, and clustering.A related paper (Coey et al., 2020) similarly contemplates how to analyze cluster-randomized ex-periments with limited data retention, demonstrating that the pseudo-random sampling technique inBakshy and Eckles (2013) can address this problem (see also Section 4.1 of this paper). Among othercontributions, we extend their methodology to linear regression; identify a potential privacy risk; anddevelop an alternative strategy, based on federated computation, to remove this risk. The basic idea behind our proposed algorithms is to keep a running tally of estimates, which isupdated whenever a new record is observed. Only the most recent set of estimates, along with thenew observation, are needed to update the tally, meaning that individual records can be discardedimmediately after being incorporated into the collective estimate. To illustrate, we begin with thesimple example of computing the mean of a sequence of measurements, denoted z , z , . . . , z n . At any We omit discussion of sampling issues in this paper. t ∈ { , . . . , n } , the arithmetic mean of the z i ’s observed up to that point can be computed as: z t = 1 t t (cid:88) i =1 z i . (2)Note that we do not need to retain each individual value of z i , i < t to estimate z t . Instead, weneed only to keep track of the number of observations and sum of z i ’s prior to t . Then, once the t -thobservation is made, we can add z t to the running sum and divide by t to compute the mean.In fact, Equation 2 can be written recursively to express this update in even fewer steps: z t = z t − + 1 t ( z t − z t − ) . (3)As Equation 3 makes clear, retaining the individual records z , . . . , z t − is unnecessary to compute z t . The information contained in these observations is wholly encapsulated in z t − , which was itselfcomputed by updating z t − with z t − , etc. As such, we can discard each record immediately afterincorporating it into the running tally. Equation 3 conveys other useful concepts. Upon observing a new datum z t , we apply a trans-formation to z t that represents its surprisingness or “innovation” compared to past experience. Theinnovation is represented in Equation 3 by z t − z t − . We also apply a scaling factor, represented by t , which converges to zero as more data is accumulated. This factor is called the Kalman gain.Vitally, the variance of z i can also be estimated recursively. Letting s t − denote the sum of squaresafter t − s t − = t − (cid:88) i =1 ( z i − z t − ) , (4) s t is computed recursively upon observing an additional datum as: s t = s t +1 + t − t ( z t − z t − ) . (5)Then an unbiased estimate of the variance of z i is given by (cid:99) ν t = s t / ( t − From Equation 3, it is clear that we can obtain an unbiased estimate of the PATE from an RCTwithout needing to store any individual records for the duration of the trial. Let z i = d i y i π − (1 − d i ) y i − π , Equation 3 can be generalized to batch updates. Let t (cid:48) denote the total record count after observing the batch and∆ = (cid:80) t (cid:48) i = t +1 z i denote the batch sum. Then z t (cid:48) = z t + t (cid:48) [∆ − ( t (cid:48) − t ) z t )]. The analogue to Equation 5 for batch updates is given by s t (cid:48) = s ∆ + tt (cid:48) ( t (cid:48) − t )(∆ − z t ) , where ∆ is the batch mean∆ / ( t (cid:48) − t ) and s ∆ is the batch sum of squares, or (cid:80) t (cid:48) i = t +1 ( z i − ∆) (Chan et al., 1982). π = E [ d i ] is the probability of being assigned to the treatment group. Upon observing all n records, the recursive estimate of E [ z i ] is equivalent to the arithmetic mean, which in turn equals: z n = 1 n n (cid:88) i =1 (cid:18) d i y i π − (1 − d i ) y i − π (cid:19) (6)= 1 n (cid:32) n (cid:88) i =1 d i y i π − n (cid:88) i =1 (1 − d i ) y i − π (cid:33) . (7)Taking expectations, we have that: E [ z n ] = 1 n ( n E [ y i | d i = 1] − n E [ y i | d i = 0]) (8)= E [ y i | d i = 1] − E [ y i | d i = 0] (9)= E [ y i − y i ] , (10)where the last equality is implied by random assignment. Additionally, (cid:99) ν n /n yields an unbiasedestimate of the asymptotic sampling variance of z n , and can be used to construct t -statistics andconfidence intervals. Having sketched a gentle introduction to recursive estimation in the context of RCTs, we now turn tolinear regression analysis of RCTs. Specifically, we consider the linear regression model: y i = x (cid:62) i β + e i , (11)where, as before, x i = [1 , d i , . . . ] (cid:62) is a k -length feature vector that contains the treatment indicator d i as well as a constant offset; β is the column vector of linear regression coefficients [ β , . . . , β k ] (cid:62) , whosesecond component β corresponds to the marginal treatment effect (i.e., the coefficient on d i ); and e i is a stochastic error term having mean 0. Note that Equation 11 implies that the conditional mean ofthe control potential outcome, E [ y i | x i ], is a linear function of the non-treatment features in x .The benefit of fitting this model to RCT data is that, when the non-treatment features in x arestrongly correlated with the outcome y , the statistical variance of the estimated treatment effect isconsiderably reduced (Deng et al., 2013). As a result, less data is needed to detect the treatment effectand/or measure it precisely. Moreover, when the treatment is randomly assigned (as in an RCT),5he least squares estimate of β in this model converges to the PATE in large samples, even if therelationship between the outcome and other features is not truly linear (Imbens and Rubin, 2015; Lin,2013). The popular CUPED algorithm is a special case of this model that adds a lagged value of thedependent variable as the sole additional feature (Deng et al., 2013).In classical linear regression, the error terms e i are assumed to be i.i.d. normal with variance σ .Under this assumption, the sampling variance of (cid:98) β is consistently estimated by: (cid:98) Σ IID = (cid:32) n (cid:88) i =1 x i x (cid:62) i (cid:33) − (cid:99) σ , (12)where: (cid:99) σ = 1 n − k − n (cid:88) i =1 ˆ e i = 1 n − k − n (cid:88) i =1 ( x (cid:62) i (cid:98) β − y i ) . (13)As with the univariate mean and variance, (cid:98) Σ IID can be used to compute asymptotically valid t -statistics and confidence intervals for (cid:98) β .To fit (cid:98) β and (cid:98) Σ IID with minimal data retention, we propose the method of recursive least squares(Harvey, 1990). Letting (cid:98) β t denote the OLS estimate of β after observing t records and ( y t +1 , x t +1 )denote a new record, the updated estimate (cid:98) β t +1 is given by: (cid:98) β t +1 = (cid:98) β t + ( X (cid:62) t +1 X t +1 ) − x t +1 ( y t +1 − x (cid:62) t +1 (cid:98) β t ) , (14)where X m denotes the m × k matrix of features corresponding to the initial m ≤ n observations. Asbefore, we can decompose the right-hand side of Equation 14 into the innovation, ( y t +1 − x (cid:62) t +1 (cid:98) β t ), andKalman gain, ( X (cid:62) t +1 X t +1 ) − x t +1 . Note that we could have also maintained a running estimate of (cid:98) β by keeping a running tally of thefollowing matrices: X (cid:62) t X t = t (cid:88) i =1 x i x (cid:62) i (15) X (cid:62) t Y t = t (cid:88) i =1 x i y i , (16) Equation 14 is equivalent to stochastic gradient descent (SGD) with the learning rate determined by ( X (cid:62) t +1 X t +1 ) − . SGD, while also a recursive algorithm, is unnecessary since the least squares solution can be found in each iteration. X (cid:62) t X t ) (cid:98) β t = X (cid:62) t Y t . (17)This approach is analogous to the first recursion method for the mean, i.e., tracking the sum of z i ’s anddividing by t , and can be used with batch updates. However, its drawback is that it is computationallyexpensive: repeatedly solving the normal equations requires inversion of X (cid:62) t X t at each t for which wewant an estimate of β . Fortunately, ( X (cid:62) t X t ) − can be updated directly:( X (cid:62) t +1 X t +1 ) − = ( X (cid:62) t X t ) − − ( X (cid:62) t X t ) − x t +1 x (cid:62) t +1 ( X (cid:62) t X t ) − x (cid:62) t +1 ( X (cid:62) t X t ) − x t +1 , (18)which also simplifies the task of computing Equation 14.The final step is to compute the residual standard error (cid:99) σ , which constitutes the other piece of (cid:98) Σ IID . Letting S t denote the sum of squared residuals after t observations: S t = t (cid:88) i =1 ( x (cid:62) i (cid:98) β − y i ) , (19) S t +1 is computed as: S t +1 = S t + ( x (cid:62) t +1 (cid:98) β t − y t +1 ) x (cid:62) t +1 ( X (cid:62) t X t ) − x t +1 . (20)Brown et al. (1975) offer a more intuitive expression for S t +1 : S t +1 = t (cid:88) i =1 ( y i − x (cid:62) i (cid:98) β t ) (cid:124) (cid:123)(cid:122) (cid:125) S t +( y t +1 − x (cid:62) t +1 (cid:98) β t ) − (21) t +1 (cid:88) i =1 ( (cid:98) β t +1 − (cid:98) β t ) (cid:62) x i x (cid:62) i ( (cid:98) β t +1 − (cid:98) β t ) . Equation 22 re-expresses the update to S t as the sum of two components: First, we add the squared“look-ahead residual” of the new observation based on the previous estimate of β . Second, we need toadjust the entire sum of residuals given the updated estimate of β . However, as shown in Equation 20,we do not actually need to make use of the previous records to do so.Putting these pieces together yields the recursive least squares ( RLS ) algorithm (Algorithm 1).7 lgorithm 1:
Recursive least squares (
RLS ) (Harvey, 1990)Initialize estimates of (cid:98) β , ( X (cid:62) X ) − , and S , for example using the first m ≥ for each record i = m + 1 , . . . , n do S ← S + ( x (cid:62) i (cid:98) β − y i ) x (cid:62) i ( X (cid:62) X ) − x i ( X (cid:62) X ) − ← ( X (cid:62) X ) − − ( X (cid:62) X ) − x i x (cid:62) i ( X (cid:62) X ) − x (cid:62) i ( X (cid:62) X ) − x i (cid:98) β ← (cid:98) β + ( X (cid:62) X ) − x i ( y i − x (cid:62) i (cid:98) β ) end Inference in RCTs does not only consist of point estimates of the treatment effect; it also consistsof quantifying the uncertainty of such estimates. Above, we showed how to estimate the samplingvariances of z n and (cid:98) β n using normal approximations. However, the reliability of such approximationswill depend on the sampling distribution of the error terms e i , which can depart significantly fromnormality in practice.A more robust approach is to use bootstrap methods, which estimate the sampling variability byresampling the original records with replacement (Efron and Hastie, 2016). However, the conventionalbootstrap assumes that we have kept the individual-level records for analysis, which is exactly whatour recursive algorithms seek to avoid.Fortunately, recursive estimation works well with the online bootstrap (Owen and Eckles, 2012),which does not require all of the data to be collected in advance (Chamandy et al., 2012). In the onlinebootstrap, each new record is assigned a random vector of B weights, w i = [ w (1) i , . . . , w ( B ) i ] (cid:62) , whereeach weight is drawn independently from a distribution with unit mean and variance. The resultingset of B statistics – with the i -th unit contributing to the B -th statistic in proportion to w ( b ) i – is thenused as an approximation to the sampling distribution of that statistic.To tailor the online bootstrap to recursive estimation, we first initialize B bootstrap sample statis-tics z = [ z (1)0 , . . . , z ( B )0 ] (cid:62) . Once the t -th record arrives, a B -length vector of weights, denoted w t , issampled from a Poisson(1) distribution. The record then updates all B statistics in proportion to thecomponents of w t . Once all n records are observed, the vector z n = [ z (1) n , . . . , z ( B ) n ] (cid:62) is treated as thebootstrap distribution for z .We provide a simple numerical example based on the PATE. Suppose we are running an RCT with We can also estimate heteroscedasticity robust standard errors (HRSEs) when the e i are independent but havedifferent variances (Wooldridge, 2010). HRSEs are commonly applied to RCTs, since treatment effect heterogeneityimplies heteroscedasticity. We show how to estimate HRSEs recursively in the Appendix. π = 0 .
50. The first record is a treated observation having y = 2. In theconventional bootstrap methodology, we would store y individually until all n records have arrived,then resample those records B times, such that first record appears (say) once in the first resample,twice in the second resample, and not at all in the B -th resample. These counts can be represented asthe vector of weights w = [1 , , . . . , (cid:62) .In the online bootstrap, rather than resample the full dataset, the weight vectors are drawn froma probabilistic distribution as records arrive. Accordingly, suppose that the same weight vector w were randomly sampled instead. Upon observing y = 2, d = 1, and w = [1 , , . . . , (cid:62) , the updatedvalue of z is given by [ × z , × z , . . . ,
0] = [4 , , . . . , (cid:62) , where z = d y π − (1 − d ) y − π = 4.Having computed z , we can discard the raw information ( y , d ) and instead update z using thefollowing recurrence relation for the (weighted) mean: z t = z t − if w t = 0 z t − + w t n t ( z t − z t − ) otherwise , (22)where we have redefined n t as (cid:80) ti =1 w t , the sum of weights up to the t -th observation. For example,if we next observed a control record with y = 1 and w = [2 , , . . . , (cid:62) , the updated value of z would be [4 − , − , . . . , − (cid:62) = [0 , , . . . , − (cid:62) . Repeating this procedure (Algorithm 2) for all n observations yields B estimates – constituting the bootstrap distribution of z n – of the PATE. Algorithm 2:
Online bootstrap for the PATEInitialize B bootstrap sample means z (1) , . . . , z ( B ) and pseudocounts n (1) , . . . , n ( B ) at zero for each record i = 1 , . . . , n do z i ← d i y i π − (1 − d i ) y i π Sample B weights w (1) i , . . . , w ( B ) i from Poisson(1) for each bootstrap sample b = 1 , . . . , B in parallel do n ( b ) ← n ( b ) + w ( b ) i z ( b ) ← z ( b ) if w ( b ) i = 0 else z ( b ) + w ( b ) i n ( b ) (cid:0) z i − z ( b ) (cid:1) end It is possible to extent Algorithm 2 to linear regression by replacing the inner loop with weighted
RLS . However, this raises the tricky question of how to initialize the recursion: with very few obser-vations, X (cid:62) X is not invertible, and so the starting values of (cid:98) β will be undefined. As a workaround, in See Chamandy et al. (2012) for theoretical justification of this approach for the Poisson(1) distribution. The standalone weighted
RLS algorithm is presented in the Appendix. m records and fit starting values using ordinary batchmethods. These values are then used to initialize all B recursions. In the following section, we showthat this algorithm underestimates the variance, but that the underestimation shrinks as n increasesrelative to m . Algorithm 3:
Online bootstrap for weighted
RLS
Estimate (cid:98) β and Z ≡ ( X (cid:62) X ) − using the first m recordsInitialize B bootstrap coefficients (cid:98) β (1) , . . . , (cid:98) β ( B ) and matrices Z (1) , . . . , Z ( B ) at (cid:98) β and Z ,respectively for each record i = m + 1 , . . . , n do Sample B weights w (1) i , . . . , w ( B ) i from Poisson(1) for each bootstrap sample b = 1 , . . . , B in parallel doZ ( b ) ← Z ( b ) − w ( b ) i Z ( b ) x i x (cid:62) i Z ( b ) w ( b ) i x (cid:62) i Z ( b ) x i (cid:98) β ( b ) ← (cid:98) β ( b ) + Z ( b ) w ( b ) i x i ( y i − x (cid:62) i (cid:98) β ( b ) ) end
500 1000 2000012345 s e w i d t h least squares
500 1000 2000012345 bootstrap least squares
500 1000 2000012345 bootstrap rls
500 1000 20000.800.850.900.951.00 c o v e r age
500 1000 20000.800.850.900.951.00 500 1000 20000.800.850.900.951.00
Figure 1: Simulation study results. Bootstrap RLS underestimates the sampling variability of (cid:98) β . Theunderestimation shrinks as the sample size grows relative to the number of records used to initializethe recursion.This section presents a small-scale simulation study on some practical differences between recursiveand batch estimators of treatment effects. We emphasize that, in many cases, these approaches willyield numerically identical results. Algorithm 3 presents an exception: here, we use the first m recordsto initialize the recursion of the bootstrap distribution. Because the final bootstrap distribution (i.e.,10fter all n records have been observed) does not account for the sampling variability of these initialrecords, it will underestimate the variance of (cid:98) β , especially if m is large relative to n .To illustrate this, we use simulation to compare the performance of three estimators:1. Ordinary (batch) least squares estimator of (cid:98) β and (cid:98) Σ .2. Ordinary least squares estimator of (cid:98) β , but with the online bootstrap estimator of the variance.3. RLS with the online bootstrap estimator of the variance (Algorithm 3) using the first m = 100records to initialize all starting values.Having verified that our implementations of these estimators yielded identical estimates of (cid:98) β , in oursimulations we focus on differences in the estimated variance of (cid:98) β .The simulation parameters were as follows: x i ∼ Exponential(1 /
10) (23) y i = 0 . x i − . x i + e i (24) e i ∼ Student’s t (0 ,
2) (25) y i = y i + τ i (26) τ i ∼ Student’s t (1 ,
10) (27) d i ∼ Bernoulli(0 . . (28)That is, we first simulated a single feature x i , which is nonlinearly related to the control potentialoutcome y i . We then sampled the individual treatment effects τ i from a t -distribution centered on 1with 10 degrees of freedom. Lastly, we drew the treatment assignment from a Bernoulli distributionwith probability 0 .
5, which determined the observed outcome y i .In Figure 1, we plot the mean estimated standard error of ˆ β and coverage of 95% confidenceintervals over 10,000 Monte Carlo simulations for samples of size 500, 1,000, and 2,000. Again, ourfocus is on the variance estimation because the estimated coefficients are numerically identical for allthree estimators. As the figure shows, the least squares and bootstrap least squares methods yieldhighly similar results; in particular, both estimators generate confidence intervals with the appropriatecoverage. On the other hand, the RLS estimator yields standard errors that understate the truevariability, particularly when the sample size is small (relative to m ). As shown by the estimators Despite this nonlinearity, we fit the linear regression model in Equation 11. m shrinks as a fraction of the total sample size (i.e., from 20% to 10% to 5%), theestimated standard error and coverage converge to the correct values. In addition to non-normality, another common challenge in RCTs is that records are not independentbut instead clustered into higher-level units. For example, users in an online experiment might return tothe same page multiple times throughout the course of the trial. Such clustering is typically modeled asa correlation between records (e.g., page visits) that stem from the same higher-level unit (e.g., users).Intracluster correlation tends to inflate the variance of parameter estimates, as less information isgleaned from correlated records – meaning that naive estimates of the variance are likely to understatethe true variability.In addition to these statistical challenges, clustering also complicates data minimization. This isbecause conventional methods for addressing clustering assume that we can associate records from thesame units over time, meaning that records cannot be deleted or anonymized during the RCT.We outline two strategies for handling these issues. First, we present a cluster-robust version ofthe online bootstrap for recursive estimates (Coey et al., 2020). The “trick” to this approach is to usepseudorandom sampling to ensure that records from the same unit are assigned identical vectors ofbootstrap weights (Bakshy and Eckles, 2013). The method emulates the canonical cluster bootstrap,but without requiring individual records to be stored over time.Second and alternatively, we propose to distribute the computation of the variance-covariancematrix to the higher-level units themselves. In this framework, the units (e.g., mobile phones) inde-pendently compute their contributions to the variance-covariance matrix for (cid:98) β , which can be condensedinto k × While there are a number of ways to adjust variance estimates for clustering, one of the most robustis the cluster bootstrap. In the typical implementation of this method, clusters rather than individ-ual records are resampled with replacement. Note that this requires individual records to be stored12hroughout the RCT period. Moreover, it requires that we are able to stitch together records fromthe same higher-level unit, meaning that the records cannot be anonymized, let alone deleted, until allrecords have arrived.To circumvent these requirements, we propose to use deterministic sampling in the online bootstrapalgorithm in order to ensure that records from the same unit receive the same vector of weights (Bakshyand Eckles, 2013; Coey et al., 2020). That is, rather than sample a distinct weight vector for eachrecord, we can seed the random weight generator with the unit identifier for the i -th record, denotedby j [ i ], such that every record belonging to j [ i ] will receive an identical vector of weights w j [ i ] . Thisemulates the conventional cluster bootstrap, as each record will appear in the same frequency in eachbootstrap resample as all of the other records in the same cluster.When combined with recursive estimation, this yields a straightforward algorithm for estimatingsampling variances with clustered observations. Algorithm 4 exposits in the case of the PATE. Notethat the only difference from Algorithm 2 is that the weights are no longer i.i.d. across records butgenerated using j [ i ] as a seed. Accordingly, we can extend this method to RLS by seeding the randomweight generator in Algorithm 3.
Algorithm 4:
Online cluster bootstrap for the PATEInitialize at zero B bootstrap sample means z (1) , . . . , z ( B ) and pseudocounts n (1) , . . . , n ( B ) for each record i = 1 , . . . , n do Sample B weights w (1) j [ i ] , . . . , w ( B ) j [ i ] , using j [ i ] as a seed z i ← d i y i π − (1 − d i ) y i π for each bootstrap sample b = 1 , . . . , B in parallel do n ( b ) ← n ( b ) + w ( b ) i z ( b ) ← z ( b ) if w ( b ) i = 0 else z ( b ) + w ( b ) i n ( b ) (cid:0) z i − z ( b ) (cid:1) end A potentially significant flaw of Algorithm 4 is that, when records are anonymized but not deleted,the bootstrap weights w j [ i ] can be used to implicitly associate records from the same higher-level unit– that is, they serve as an implicit identifier. Even if the weights themselves are not stored, theymay be inferred from the traceplots of the bootstrap estimates. This risk is illustrated in Figure 2.Suppose the weights are determined by 2 × W j [ i ] where W j [ i ] ∼ Binomial(1 , ) (“double-or-nothing”weights). For an anonymized record i with j [ i ] unknown, we can infer the weight vector from the13on-zero components left after subtracting the i − i -th bootstrapestimates (first row). Then, if a non-anonymized record ( i (cid:48) with j [ i (cid:48) ] known) shares the same weightvector (second row), we can infer j [ i ] = j [ i (cid:48) ] with high probability. Hence, even if a record hasnominally been anonymized, it can be deanonymized by the weight vector.Figure 2: Reidentification risk of online cluster bootstrap. If either the weights or history of bootstrapestimates is stored, they can be used to re-identify anonymized records. This is because the weightscan serve as a hashed identifier that implicitly “fingerprints” records from the same higher-level unit.To be sure, this “fingerprinting” risk is not unique to the online cluster bootstrap: all methods ofadjusting for intracluster correlation assume some ability to associate records from the same cluster.Again, this association is important for drawing valid inferences from RCTs, but greatly complicatesdata deletion and anonymization practices.To overcome this challenge, we propose an alternative, federated approach to cluster-robust infer-ence. In federated computation, research subjects (typically called clients) store and perform com-putations on their individual records (Koneˇcn`y et al., 2016). These independent computations arecollected by the server, possibly without client identifiers, and immediately aggregated into collectivestatistics. Consequently, raw, individualized records need neither be communicated to nor retainedon the server. Of course, federated computation implies the existence of clients that are capable ofstoring and performing computations on their own data (e.g., mobile phones).Although the online cluster bootstrap can be implemented in a federated fashion, we pursue analternative, communication-efficient approach, which is better-suited to experimentation with mobilephones. In such experiments, communication costs (e.g., upload bandwidth) compel us to limit thesize and number of transmissions, ideally to a fixed length and frequency (Koneˇcn`y et al., 2016). E.g., the probability of two distinct units sharing the same vector of double-or-nothing weights is given by B . cluster-robust standard errors (CRSEs) for cluster-robust inference (Liang and Zeger, 1986). As the name implies, CRSEs adjust (cid:98) Σ IID , the estimated variance-covariance matrix of (cid:98) β , to account for intracluster correlation of recordsfrom the same unit. Usefully, each unit’s contribution to the CRSEs can be computed independentlyand communicated to the server as a k -length vector, itself an aggregation of each unit’s records.Our proposed implementation of CRSEs is as follows. As in Section 3.2, records are streamedto the server during the RCT to generate recursive estimates of (cid:98) β and ( X (cid:62) X ) − . These can bequickly discarded and/or anonymized (or, in contrast with the online cluster bootstrap, collectedwithout subject identifiers) on the server . Each client retains its own records for the duration of theexperiment. At the conclusion of the RCT, the server pushes the final estimates of the linear regressioncoefficients (cid:98) β n to the clients. The clients then compute their independent contributions to the varianceof (cid:98) β n , given below. These contributions are collected and averaged on the server to produce the finalestimate (cid:98) Σ n (see Algorithm 5).This procedure emulates the following batch estimator of the variance-covariance matrix of (cid:98) β : (cid:98) Σ CL = ( X (cid:62) X ) − X (cid:62) ee (cid:62) X ( X (cid:62) X ) − , (29)where e = [ˆ e , . . . , ˆ e n ] (cid:62) is the vector of n linear regression residuals. (cid:98) Σ CL is often called the“sandwich estimator” because it sandwiches a “meat” matrix, X (cid:62) ee (cid:62) X , between two “bread” matrices( X (cid:62) X ) − (Wooldridge, 2010). Under clustering, ee (cid:62) is a block-diagonal matrix, which means the meatmatrix can be rewritten as X (cid:62) ee (cid:62) X = (cid:88) j X (cid:62) j e j e (cid:62) j X j , (30)where X j is the n j × k matrix consisting of the j -th cluster’s feature vectors and e j is the columnvector of its regression residuals.As Equation 30 demonstrates, each unit’s contribution to (cid:98) Σ CL (that is, X (cid:62) j e j e (cid:62) j X j ) can be com-puted independently and anonymously aggregated to form X (cid:62) ee (cid:62) X , which is in turn combined withthe recursive estimates of ( X (cid:62) X ) − to compute (cid:98) Σ CL .Figure 3 illustrates. In the first stage, the server pushes the final estimates of β to the clients. Each CSREs are appropriate whenever the treatment is assigned to higher-level units, but raw observations are collectedat a finer granularity (Abadie et al., 2017). More precisely, CSREs are appropriate when the structural mean model inEquation 11 is correct, but the error terms e i are correlated within each cluster. We assume (WLOG, since reordering the data is trivial in the batch processing case) that records are ordered bycluster. X (cid:62) j e j = (cid:80) n j i =1 x i ( x (cid:62) i (cid:98) β n − y i ) and communicates it to the server, possibly withoutidentifiers. Lastly, the server computes the outer products of these vectors and sums the resulting k × k matrices to compute X (cid:62) ee (cid:62) X and (cid:98) Σ CL .Figure 3: Federated computation of cluster-robust standard errors. Once the server pushes the finalestimates of β to the clients, each client independently computes its contribution to (cid:98) Σ SW , then passesthese, possibly anonymously, to the server for aggregation. All messages consist of k numbers.This computation bears some resemblance to federated learning for linear regression (McMahanet al., 2017). For instance, in both algorithms, the client updates consist of the squared error lossgradient computed on the client’s own data. However, because our focus is on estimating the varianceof the fitted model parameters (cid:98) β n and not how these parameters are fitted, we require just one roundof back-and-forth communication between the server and clients. Algorithm 5:
Federated computation of cluster-robust standard errors for each client j = 1 , . . . , J in parallel dofor each record i = 1 , . . . , n j in parallel do e i ← x (cid:62) i (cid:98) β − y i u j ← (cid:80) n j i x i e i return u j to server (cid:98) Σ SW ← ( (cid:80) ni =1 x i x (cid:62) i ) − (cid:80) Jj =1 u j u j (cid:62) ( (cid:80) ni =1 x i x (cid:62) i ) − Conclusion
We have outlined methods for recursively computing precise statistics from RCTs. Because recursiveestimation eliminates the need to store raw, individualized records for analysis, it is an attractive optionfor working with data that is sensitive in nature and/or subject to stringent retention or anonymizationpolicies. While not a panacea for all problems – in particular, when records from the same unit need tobe associated over time for clustering – recursive methods can be combined with bootstrap or federatedcomputation methods to address these challenges.The driving motivation behind our methods is to enable the swift deletion of RCT data. There are,of course, fundamental tensions between this goal and open science principles like data transparencyand replicability. Yet, there are also points of alignment; in particular, committing to a recursivestrategy implies pre-specifying an analysis plan and can greatly limit the scope for post hoc specificationsearch and p -hacking (Munaf`o et al., 2017). Future research should explore methods of enabling datatransparency and replicability that also respect research subjects’ fundamental privacy and data usagerights.The methods outlined in this paper can be extended in a number of directions, of which we em-phasize three. First, our recursive methods can be extended to more powerful models than linearregression, which may achieve greater variance reduction. For example, recursion is routinely usedin gradient descent algorithms to fit more complicated models than linear regression, although suchmodels are not commonly applied to RCTs. Second, we have not explored how to initialize the recur-sion in detail. This can be based on an initial batch of records, as in Algorithms 1 and 3, or it canreflect a Bayesian prior, which can also yield precision gains. Third, our federated implementation ofcluster-robust standard errors would dovetail with federated learning of the treatment effect or regres-sion model itself. This would have the benefit of exempting clients from sending any individualizedrecords to the server for learning. That being said, combining federated computation with recursiveestimation may strike a useful compromise between data minimization, communication costs, and useof client resources. References
Alberto Abadie, Susan Athey, Guido W. Imbens, and Jeffrey Wooldridge. 2017.
When Should YouAdjust Standard Errors for Clustering?
Technical Report. National Bureau of Economic Research.17ytan Bakshy and Dean Eckles. 2013. Uncertainty in Online Experiments With Dependent Data: AnEvaluation of Bootstrap Methods. In
Proceedings of the 19th ACM SIGKDD International Confer-ence on Knowledge Discovery and Data Mining . 1303–1311.Robert L. Brown, James Durbin, and James M. Evans. 1975. Techniques for Testing the Constancyof Regression Relationships Over Time.
Journal of the Royal Statistical Society: Series B (Method-ological)
37, 2 (1975), 149–163.Nicholas Chamandy, Omkar Muralidharan, Amir Najmi, and Siddartha Naidu. 2012.
EstimatingUncertainty for Massive Data Streams . Technical Report. Google Research.Tony F. Chan, Gene Howard Golub, and Randall J. LeVeque. 1982. Updating Formulae and a PairwiseAlgorithm for Computing Sample Variances. In
COMPSTAT 1982 5th Symposium . Springer, 30–41.Dominic Coey, Felipe Coirolo, Daniel Haimovich, and Kevin Liou. 2020. Privacy-preserving Methodsfor Experiments with Repeated Measurements. (2020). Conference on Digital Experimentation(CODE).Alex Deng, Ya Xu, Ron Kohavi, and Toby Walker. 2013. Improving the Sensitivity of Online ControlledExperiments by Utilizing Pre-Experiment Data. In
Proceedings of the sixth ACM international con-ference on Web search and data mining . 123–132.Bradley Efron and Trevor Hastie. 2016.
Computer Age Statistical Inference: Algorithms, Evidence,and Data Science . Cambridge University Press.Andrew C. Harvey. 1990.
The Econometric Analysis of Time Series . MIT Press.Guido W. Imbens and Donald B. Rubin. 2015.
Causal Inference for Statistics, Social, and BiomedicalSciences: An Introduction . Cambridge University Press.Jakub Koneˇcn`y, H. Brendan McMahan, Felix X. Yu, Peter Richt´arik, Ananda Theertha Suresh, andDave Bacon. 2016. Federated Learning: Strategies for Improving Communication Efficiency. In
NIPSWorkshop on Private Multi-Party Machine Learning .Kung-Yee Liang and Scott L Zeger. 1986. Longitudinal Data Analysis Using Generalized LinearModels.
Biometrika
73, 1 (1986), 13–22.Winston Lin. 2013. Agnostic Notes on Regression Adjustments to Experimental Data: ReexaminingFreedman’s Critique.
The Annals of Applied Statistics
7, 1 (2013), 295–318.18. Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Ag¨uera y Arcas.2017. Communication-Efficient Learning of Deep Networks from Decentralized Data. In
Proceedingsof the 20th International Conference on Artificial Intelligence and Statistics (AISTATS) . PMLR,1273–1282.Marcus R. Munaf`o, Brian A. Nosek, Dorothy V.M. Bishop, Katherine S. Button, Christopher D/Chambers, Nathalie Percie du Sert, Uri Simonsohn, Eric-Jan Wagenmakers, Jennifer J. Ware, andJohn P.A. Ioannidis. 2017. A Manifesto for Reproducible Science.
Nature Human Behaviour
1, 21(2017), 1–9.Art B. Owen and Dean Eckles. 2012. Bootstrapping Data Arrays of Arbitrary Order.
The Annals ofApplied Statistics
6, 3 (2012), 895–927.Andreas Pfitzmann and Marit Hansen. 2010.
A Terminology for Talking About Privacy by Data Min-imization: Anonymity, Unlinkability, Undetectability, Unobservability, Pseudonymity, and IdentityManagement . Technical Report.Jeffrey M. Wooldridge. 2010.
Econometric Analysis of Cross Section and Panel Data . MIT Press.
AppendixA Weighted Recursive Least Squares Algorithm
Algorithm 6 solves the following weighted least squares problem:arg min β n (cid:88) i w i ( y i − x (cid:62) i β ) , (31)where w i is the weight associated with the i -th observation and W is the diagonal weight matrix. B Robust Standard Errors
Under heteroscedasticity, a consistent estimator of the variance of (cid:98) β is given by (Wooldridge, 2010) (cid:98) Σ HR = ( X (cid:62) X ) − n (cid:88) i =1 ˆ e i x i x (cid:62) i ( X (cid:62) X ) − . (32)19 lgorithm 6: Weighted recursive least squares ( wRLS )Initialize estimates of (cid:98) β , ( X (cid:62) WX ) − , and S , for instance with the first m ≥ for each record i = m + 1 , . . . , n do S ← S + ( x (cid:62) i (cid:98) β − y i ) w − i + x (cid:62) i ( X (cid:62) WX ) − x i ( X (cid:62) WX ) − ← ( X (cid:62) WX ) − − ( X (cid:62) WX ) − x i x (cid:62) i ( X (cid:62) WX ) − w − i + x (cid:62) i ( X (cid:62) WX ) − x i (cid:98) β ← (cid:98) β + ( X (cid:62) WX ) − w i x i ( y i − x (cid:62) i (cid:98) β ) end A simple method for computing (cid:98) Σ HR recursively is to keep running tallies of the following matrices: A ( t ) y = t (cid:88) i =1 y i x i x (cid:62) i (33) A ( t ) j = − t (cid:88) i =1 y i x ij x i x (cid:62) i , j = 1 , . . . , k (34) A ( t ) jj = t (cid:88) i =1 x ij x i x (cid:62) i , j = 1 , . . . , k (35) A ( t ) jl = 2 t (cid:88) i =1 x ij x il x i x (cid:62) i , ≤ j < l ≤ k. (36)In total, we maintain 1 + k + k + k ( k − matrices. Gathering these into the tensor A ( t ) jkl , the meatmatrix in the interior of Equation 32 is computed for a given t as: t (cid:88) i =1 ˆ e i x i x (cid:62) i = (cid:101) β (cid:62) t A ( t ) jkl , (37)where (cid:101) β t = [1 , (cid:98) β , . . . , (cid:98) β k , (cid:98) β , . . . , (cid:98) β k , (cid:98) β (cid:98) β , . . . , (cid:98) β k − (cid:98) β k ] (cid:62) . Note that the bread matrices ( X (cid:62) X ) − and regression coefficients (cid:98) β may be computed as in Algorithm 1.Then, after observing all n records, (cid:98) Σ HR is computed as: (cid:98) Σ HR = ( X (cid:62) X ) − (cid:101) β (cid:62) n A ( n ) jkl ( X (cid:62) X ) − . (38) For readability, we suppress the t subscripts on the individual coefficients in (cid:101) β t ..