[PDF] Application of the hierarchical bootstrap to multi-level data in neuroscience

Abstract

A common feature in many neuroscience datasets is the presence of hierarchical data structures, most commonly recording the activity of multiple neurons in multiple animals across multiple trials. Accordingly, the measurements constituting the dataset are not independent, even though the traditional statistical analyses often applied in such cases (e.g., Students t-test) treat them as such. The hierarchical bootstrap has been shown to be an effective tool to accurately analyze such data and while it has been used extensively in the statistical literature, its use is not widespread in neuroscience - despite the ubiquity of hierarchical datasets. In this paper, we illustrate the intuitiveness and utility of this approach to analyze hierarchically nested datasets. We use simulated neural data to show that traditional statistical tests can result in a false positive rate of over 45%, even if the Type-I error rate is set at 5%. While summarizing data across non-independent points (or lower levels) can potentially fix this problem, this approach greatly reduces the statistical power of the analysis. The hierarchical bootstrap, when applied sequentially over the levels of the hierarchical structure, keeps the Type-I error rate within the intended bound and retains more statistical power than summarizing methods. We conclude by demonstrating the effectiveness of the method in two real-world examples, first analyzing singing data in male Bengalese finches (Lonchura striata var. domestica) and second quantifying changes in behavior under optogenetic control in flies (Drosophila melanogaster).

Full PDF

OO R I G I N A L A R T I C L E

NBDT } A n a l y s i s

Application of the hierarchical bootstrap tomulti-level data in neuroscience

Varun Saravanan | Gordon J Berman | Samuel JSober Neuroscience Graduate Program,Graduate Division of Biological andBiomedical Sciences, Laney GraduateSchool, Emory University, 30322 Department of Biology, Emory University,30322 Department of Physics, Emory University,30322

Correspondence

Samuel J Sober, 1510 Clifton Road NE,Room 2006, Atlanta, GA, 30322Email: [email protected]

Funding information

The work for this project was funded by NIHNINDS F31 NS100406, NIH NINDS R01NS084844, NIH NIBIB R01 EB022872, NIHNIMH R01 MH115831-01, NSF 1456912,Research Corporation for ScienceAdvancement no. 25999 and The SimonsFoundation.

A common feature in many neuroscience datasets is thepresence of hierarchical data structures, most commonlyrecording the activity of multiple neurons in multiple ani-mals across multiple trials. Accordingly, the measurementsconstituting the dataset are not independent, even thoughthe traditional statistical analyses often applied in such cases(e.g., Student’s t-test) treat them as such. The hierarchicalbootstrap has been shown to be an effective tool to accu-rately analyze such data and while it has been used exten-sively in the statistical literature, its use is not widespread inneuroscience - despite the ubiquity of hierarchical datasets.In this paper, we illustrate the intuitiveness and utility ofthis approach to analyze hierarchically nested datasets. Weuse simulated neural data to show that traditional statisticaltests can result in a false positive rate of over 45%, even ifthe Type-I error rate is set at 5%. While summarizing dataacross non-independent points (or lower levels) can poten-tially ﬁx this problem, this approach greatly reduces the sta-tistical power of the analysis. The hierarchical bootstrap,when applied sequentially over the levels of the hierarchicalstructure, keeps the Type-I error rate within the intended a r X i v : . [ q - b i o . N C ] J u l S ARAVANAN ET AL . bound and retains more statistical power than summarizingmethods. We conclude by demonstrating the effectivenessof the method in two real-world examples, ﬁrst analyzingsinging data in male Bengalese ﬁnches ( Lonchura striata var. domestica ) and second quantifying changes in behavior un-der optogenetic control in ﬂies (

Drosophila melanogaster ). K E Y W O R D S

Hierarchical Bootstrap, Multi-level datasets | INTRODUCTION

It is commonplace for studies in neuroscience to collect multiple samples from within a category (e.g., multiple neuronsfrom one animal) to boost sample sizes. A recent survey found that of 314 papers published in prominent journalscovering neuroscience research over an 18 month period in 2013-14, roughly 53% of those studies had nested datasetsfeaturing hierarchical data [1]. When data are collected in this manner, the resulting data are not independent. Com-monly deployed statistical tests like the Student’s t-test and ANOVA, however, treat all data points as independent.This assumption results in an underestimation of uncertainty in the dataset and a corresponding underestimationof the p-value [2, 3, 4]. Pseudoreplication [5, 6], where variance within a group and variance between groups is notappropriately accounted for, is often the cause of such biases. For example, consider a hypothetical example in whichone measures changes in dendritic spine size during learning. Since one can typically only measure from a few animalseach in different treatment conditions, researchers usually increase sample sizes by measuring multiple spines fromeach neuron and by measuring multiple neurons within an animal. The hierarchical nature of such datasets can result indifferent samples being statistically dependent on each other: spines measured from the same neuron are likely moresimilar than spines measured across different neurons, even more so than spines measured from different animalswithin the same treatment condition.Linear Mixed Models (LMMs) can be used to account for the variance across different levels [1, 7] and have recentlybeen used to do so in several studies [8, 9, 10, 11]. However, LMMs assume that all hierarchical structure present islinear, which is often not the case. Additionally, the parameters returned by LMM ﬁts may exhibit bias and be unreliablewhen the number of clusters is small, as is also often the case in neuroscience datasets [12, 13, 14]. Finally LMMsprovide a great deal of ﬂexibility in the various implementation choices a user can make to build models appropriate fortheir datasets. However, recent ﬁndings have suggested that LMMs can be highly sensitive to user choices, particularlyfor random effects, which in turn can impact the convergence and interpretation of results from LMMs [15].The hierarchical bootstrap [16, 17, 18, 19] is a statistical method that has been applied successfully to a wide varietyof clustered datasets, including census and polling results, educational and psychological data, and phylogenetic treeinformation [20, 21, 14]. Unlike LMMs, the hierarchical bootstrap is relatively agnostic to the underlying structurepresent in the data and has consistently performed better at quantifying uncertainty and identifying signal than tradi-tional statistics [22, 21, 23], though some concerns have been raised that the bootstrap may be excessively conservativein a limited subset of cases [24, 25]. However, the use of the hierarchical bootstrap in neuroscience has been limited,even though its application is increasingly warranted.This paper is divided into two parts. In the ﬁrst, we simulate a typical dataset studied in neuroscience and use it toillustrate how the Type-I error is inﬂated in hierarchical datasets when applying traditional statistical methods but can

ARAVANAN ET AL . 3 be averted using the hierarchical bootstrap. In the second, we demonstrate the use of the hierarchical bootstrap in tworeal-world examples using singing data from songbirds [26] and optogenetic control of behavior in ﬂies [27]. In bothcases, the data have a strong hierarchical structure and our analyses highlight the need to use appropriate statisticaltests when analyzing hierarchical datasets in neuroscience. | MATERIALS AND METHODS

The simulations for this paper were run in the Jupyter Notebooks environment using Python (version 3.7.2) andimporting the following libraries: NumPy (version 1.15.4), SciPy (version 1.1.0), Matplotlib (version 3.0.2), StatsModels(version 0.11.0) and Pandas (version 0.23.4). Reanalysis of data from Hoffmann and Sober (2014) was performedusing MATLAB (version 2017a). The codes for both simulation and the data analysis are available on Github here:https://github.com/soberlab/Hierarchical-Bootstrap-Paper. | Traditional vs Summarized vs Bootstrap

Throughout this paper, we compare 4 statistical methods that we refer to by shorthand as “Traditional”, “Summarized”,“Bootstrap” and "LMM" respectively. Here, when we refer to the “Bootstrap” method, we mean a hierarchical non-parametric bootstrap procedure. We will detail what the ﬁrst three of those terms mean here (see Fig. 1 for schematicsof each), and we will describe LMMs in detail in a subsequent section. For the sake of clarity, let us consider a ﬁctitiousexample. Suppose our dataset involves recording the neural activity of neurons in the amygdala when subjects wereexposed to an aversive auditory cue either in the presence or absence of a drug of interest believed to reduce anxiety.For each subject, neural activity was recorded from one hundred neurons during exposure to the auditory cue and theexperiment was repeated with several hundreds of subjects in both the presence and absence of the drug (see Fig. 1a).Note that while a more typical example would involve only a small number of subjects, the differences between methodsare easier to comprehend in an example with a large number of subjects (we demonstrate differences in a small numberof subjects in Fig. 4). We could add a layer of complexity by considering that the experiment was repeated for severaltrials for each neuron recording but for the sake of simplicity, let us assume that all the data were collected from a singletrial per neuron. In the “Traditional” method, every data point (e.g., the ﬁring rate of every neuron to every instance ofthe auditory cue) is treated as independent, regardless of the hierarchical structure present in the dataset (see Fig. 1b).All the data points are used to calculate the mean and the uncertainty in the estimate of the mean, namely the standarderror of the mean (SEM) and a Student’s t-test is used to ascertain statistically signiﬁcant differences between the meanﬁring rate of the neurons in the presence versus absence of the drug of interest. This is fairly common since, unlikethe previous ﬁctitious example, experiments are often repeated only in a small sample of subjects. The “Summarized”method, on the other hand, acknowledges the possibility that different neurons within the same subject may be moresimilar to each other than neurons across subjects. As a result, the mean ﬁring rate for each subject is calculated ﬁrstand the mean of the group is calculated as the mean of the population of mean ﬁring rates for each subject in the groupand the SEM is computed from this population of means (see Fig. 1c). Note that the mean for each group in this case isequal to that in the “Traditional” case if and only if the number of neurons recorded for each subject within a group isrepresented equally. A Student’s t-test is thus applied to the population of mean ﬁring rates between the two groups.An additional complication that we circumvent in our toy example by not considering trials per neuron is the decision asto which level one must summarize the data. In the case of multiple trials per neuron, one may summarize either at thelevel of individual neurons or individual subjects. While summarizing at the level of subjects is the most appropriate way S ARAVANAN ET AL . An overview of Hierarchical Datasetsa)

Drug Control Level 1:Experimental ConditionLevel 2:SubjectsLevel 3:Neurons b) “Traditional” StatisticsDrugMean + SEM ControlMean + SEM

Compare withTwo samplet-test c) “Summarized” Statistics

Calculate Mean for each Subject

DrugMean + SEM ControlMean + SEM

Compare withTwo samplet-test

The “Hierarchical Bootstrap” Methodd)

Drug

Repeat N bootstrap timesfor each conditionResample each level with replacement

Control

Mean across all neurons Mean across all neuronsPopulation of N bootstrap means Population of N bootstrap means

Compare withStatisticaltests

F I G U R E 1 a) An example of a hierarchical dataset. Here the dataset is divided into 3 levels, with the ﬁrst levelcontaining the experimental groups to be compared, the second containing the individual subjects and the thirdcontaining the individual neurons per subject. Each subject is color coded and the neurons per subject are distinguishedby the position of the colored diamond. b) In “Traditional” statistics, the means for each group is computed across all theneurons and are then compared using a two sample t-test. c) In “Summarized” statistics, the mean for each subject iscomputed ﬁrst. These means are then used to compute an overall mean for each group and the groups are comparedusing a two-sample t-test. d) In the “Hierarchical Bootstrap” method, we create new datasets N bootstrap times byresampling with replacement ﬁrst at the level of subjects followed by neurons within a subject. We then compute themean across all neurons every time we perform resampling. The ﬁnal statistic is computed on this population ofresampled means (see Methods for details).

ARAVANAN ET AL . 5 to avoid non-independence between data points, it can seriously reduce sample size and therefore power especially ifthe number of subjects is small. In the “Bootstrap” method, we perform the hierarchical bootstrap on the two highestlevel groups (drug versus control group in Fig 1d) to compute posterior distributions of the range of means possiblefrom each group, as follows. First, we sample with replacement (i.e., we sample from the current distribution in such away that replications of previously drawn samples are allowed) from the subjects in the group. Then, for the subjectsselected, we then sample with replacement from the individual neurons for the number of neurons that were recorded.We then compute the mean ﬁring rate across the group for that resampled population and repeat the entire processN bootstrap times (N bootstrap =10 for all instances in this paper unless otherwise noted). The mean for each group in thiscase is computed the same way as is done in the “Traditional” method. The 67% conﬁdence interval (or equivalently, thestandard deviation) of the population of N bootstrap means gives an accurate estimate of the uncertainty in the mean.Note that the mean is a special case where this uncertainty is more commonly referred to as the Standard Error of theMean or SEM. We can then compute the probability of one group being different from the other using the population ofresampled means obtained above for each group (see Hypothesis testing with Bootstrap below for complete details). | Hypothesis testing using Bootstrap

We described above how bootstrap samples can be used to compute the uncertainty in measuring the mean of apopulation. However, the bootstrap can be used more broadly to measure the uncertainty in any metric of interestas long as it obeys the law of large numbers and scales linearly in the probability space. In addition, the bootstrapis used to compute posterior distributions of the range of values possible for the metric of interest from the data oflimited sample size. As a result, the distribution of bootstrap samples can be used to compute probabilities that the datasupports particular hypotheses of interest directly as opposed to p-values (see below for a detailed description). Wewill describe below how this can be done with an example. Note that while this is not the only way of hypothesis testingusing bootstrapping, we found this to be a particularly simple and effective way of doing so.As will be performed several times in this paper, suppose we wish to evaluate the support for the hypothesis thatthe mean of particular sample was signiﬁcantly different from a ﬁxed constant - zero for our example. In order to do so,we would compute the proportion of means in our population of bootstrapped means of the sample that were greaterthan or equal to zero. If we set the acceptable false positive (Type-I error) rate to α ( α = 0.05 throughout this paper),then if the computed proportion was greater than 1 – α /2 (or p>0.975 for α = 0.05), we would conclude that the sampleof interest had a mean signiﬁcantly greater than zero. Alternatively, if the computed proportion was less than α /2 (orp<0.025 for α = 0.05) then we would conclude that the sample of interest had a mean signiﬁcantly less than zero. Anyproportion α / ≤ p ≤ ( − α / ) would indicate a relative lack of support for the hypothesis that the mean of the sampleof interest is different from zero. In the case of comparing the population of bootstrapped means to several constants(multiple comparisons), we use the Bonferroni correction to adjust the threshold for signiﬁcance accordingly. We wouldalso like to make a distinction between the probabilities we referred to above and the p-values typically associated withstatistical tests. p-values refer to the probability of obtaining a result as extreme or more extreme than those obtainedunder the assumption that the null hypothesis is true. As such, they do not provide a direct measure of support forthe hypothesis one truly wishes to test. The ‘p’ referred to in the bootstrapping procedure above, however, directlyprovides a probability of the tested hypothesis being true. For the rest of this paper, in order to distinguish the directprobabilities obtained using the bootstrapping procedure from p-values reported from traditional statistical tests, wewill use ‘p boot ’ to refer to bootstrap probabilities and ‘p’ to refer to p-values from other tests.The procedure described above can also be used to compare the means of two different groups using their re-spective samples (although this variant of the analysis is not performed in this manuscript). In this case, we would S ARAVANAN ET AL . compute a joint probability distribution of the two samples with each sample forming the two axes of a 2-D plot. Thenull hypothesis would thus be a symmetric Gaussian with the line y = x as a diameter. Therefore, to test if the two groupsare different, one would compute the total density of the joint probability distribution on one side of the unity line. If thevolume computed is greater than 1 – α /2 then the ﬁrst group is signiﬁcantly greater than or equal to the second while ifthe volume computed is less than α /2, the second group is signiﬁcantly greater than or equal to the ﬁrst with all othervolumes indicating no signiﬁcant differences between the groups. We can also extend this formulation to comparisonsbetween multiple groups by performing pairwise-comparisons between the groups and adjusting the threshold forsigniﬁcance accordingly (using Bonferroni corrections [28], for example). | Choosing the number of times to resample using Bootstrap

As we mentioned in the Introduction, the bootstrap is relatively agnostic to the structure of the data. However, thisstatement is not to say that there are no assumptions in its use. As mentioned previously, the bootstrap cannot beused on a metric that does not obey the law of large numbers (i.e., the central limit theorem) or is not linear in theprobability space (e.g., depends on the product or logarithm of probabilities like entropy for example). Additionally, howone chooses to resample their data deﬁnes assumptions that must match the true data distribution in order to provideaccurate results. A common procedure is to resample from a population equal to the number of samples drawn in theoriginal dataset. While this may seem obvious at higher levels of hierarchical datasets, it is not so clear at the lowerlevels. To take the example of songbird vocal data, one bird may simply have sung far fewer times than the rest of thebirds in the experimental group. Should that bird’s data always be less represented in all resampled datasets? What ifthe imbalance is in the number of repetitions of individual syllables within a bird? In each case, the way to resamplemay and should change based on the hypothesis being tested. The crucial rule to follow when choosing resamplingparameters is to come as close as possible to replicating what one might expect if they had physically repeated thatexperiment several hundreds of times instead of resampling using bootstrapping. For our own work, we have previouslygiven equal weight to each syllable independent of its frequency of occurrence in the actual dataset [29] while in theexperimental results analyzed here, each syllable is weighted by its frequency of occurrence due to the differences inthe nature of experiments and questions being analyzed. | Linear Mixed Models (LMMs)

Linear Mixed Models have been suggested and successfully used as a powerful technique to account for various datasetsthat have a non-independent structure including hierarchical datasets and time series data [30, 31]. We have thereforeincluded results from the use of LMMs in our simulations and reanalysis as we believe it will provide useful context asto when one may choose to use one technique over the other. LMMs are extensions of linear regressions in that theyconsist of a “ﬁxed effect” component and a “random effect” component. The “ﬁxed effect” component is the conventionallinear regression part and typically is what one uses to quantify statistical differences between groups. The “randomeffects” on the other hand can vary across units (neurons, subjects, etc.) in the dataset allowing one to build in thehierarchical structure of the data into their model. In our simulations, we used the treatment condition as the ﬁxedeffect and subject identity as the random effect. The proportion of signiﬁcant results was obtained by extracting thet-statistic and corresponding p-value for the ﬁxed effect coefﬁcient. The SEM estimate was drawn from the standarderror in the intercept for the model. We describe the LMMs used in our experimental work in greater detail in thecorresponding section under Results.

ARAVANAN ET AL . 7 | Design Effect (DEFF)

In scientiﬁc studies, the effect size is an important metric that helps deﬁne the effectiveness of the treatment andthe number of subjects required to adequately power the study. It is therefore important to quantify changes in theeffect size due to the presence of a hierarchical structure in the dataset. When one analyzes data from hierarchicaldatasets, the unique information provided by each additional data point at the lowest level of the hierarchy dependson the average number of samples in the cluster (neurons within a subject form a cluster in our toy example) and therelative variance within and between clusters. This relationship was mathematically quantiﬁed using the Intra-clustercorrelation (ICC). ICC is a useful metric that provides a quantitative measure of how similar data points are to each otherwithin an individual cluster in a hierarchical dataset [32, 33]. While there are some differences in how it is calculated, ingeneral it is deﬁned as the following ratio:

I C C or ρ = s between s between + s within (1)Here s between represents the variance across the means of individual clusters and s within represents the variance withinclusters. Thus, the ICC is a metric that varies from zero to one, where a measure of zero represents no clustering ofdata and every data point being independent and a measure of one represents a perfect reproduction of samples withinclusters (i.e., all points within a cluster are exactly the same). Kish further formalized the relationship between ICC andthe adjusted effect size that was termed the “Design Effect” or DEFF with a corresponding correction to be applied tothe standard error of the mean computed from the dataset termed DEFT, deﬁned as the square root of DEFF [33, 34].Formally, DEFF is deﬁned as: DE F F = σ ( dat a ) σ ( dat a if i ndependent ) = 1 + ρ ∗ ( ¯ n j − ) (2)Where ¯n j represents the average sample size within each cluster and ρ is the ICC. Hence, as the number of sampleswithin a cluster increases, the DEFF increases, resulting in a need for a larger correction (increase) to the standarderrors. Conversely, as the number of samples within clusters increase, the standard error of the mean is underestimatedpotentially resulting in underestimation of the p-values and inﬂation of the Type-I error rate. | RESULTS

Our results are organized into two sub-sections: Simulations and Examples. In the Simulations sub-section, we showresults from simulations that illustrate the utility of the hierarchical bootstrap and in the Examples sub-section, wehighlight the differences in results when analyzing two example datasets with and without the hierarchical bootstrapand also compare to LMMs. Throughout the results section, we will compare statistical tests we refer to by shorthandas “Traditional”, “Summarized”, “Bootstrap” and "LMMs" respectively (see

Traditional vs Summarized vs Bootstrap and

Linear Mixed Models (LMMs) in Materials and Methods and Figure 1 for a detailed description). Also note that wheneverwe refer to the “Bootstrap” in this paper, we mean a hierarchical non-parametric bootstrap unless otherwise speciﬁed. S ARAVANAN ET AL . | Simulations

We used simulations of neuronal ﬁring in order to highlight the key characteristics of the hierarchical bootstrap asapplied to nested data in neuroscience and the differences between the bootstrap and other, more commonly usedstatistical tests. Since it has been a concern raised previously [24, 25], we were interested in whether the bootstrapdisplayed a conservative bias for independent (i.e., all data points are as dissimilar to each other as to every other datapoint) and non-independent (i.e., some data points are more similar to each other than others) datasets, as well as inquantifying the bootstrap’s statistical power compared to other techniques. While the results of these simulations maybe predicted by other mathematical results previously published [35, 19], we found them instructive to depict explicitly. | The hierarchical bootstrap does not have a conservative bias in a hierarchical dataset

If the bootstrap had a strong conservative bias regardless of the nature of the data (hierarchical or independent), itmay not be the right metric with which to address the problem of statistical analysis in hierarchical datasets. For ourﬁrst simulation, we wished to evaluate whether the bootstrap displayed a conservative bias when processing a typicalhierarchical dataset one might encounter in neuroscience research. Speciﬁcally, we simulated a condition in which werecorded the activity of 100 neurons each from 1000 subjects. The subjects would then be split randomly into twogroups of 500 each and the mean ﬁring rate across groups would be compared. Each neuron had a mean ﬁring rate of5Hz, simulated using a Poisson process. However, in order to introduce hierarchical structure into the dataset, eachsubject’s neuronal ﬁring rate was offset by a constant drawn from Gaussian noise of width 3Hz. This constant was thesame for all 100 neurons simulated for each subject but varied between subjects. Since subjects were split randomlyinto two groups, there should be no statistical difference between the groups. Therefore, any signiﬁcant differences wesee would be false positives and we would expect to see them at a rate of α (here set to 0.05). This has been depictedgraphically in Figure 2a where the two groups have identical mean ﬁring rates but the individual subjects have an offsetthat is constant across all their neurons.We also varied the number of neurons per subject to study its effect on the false positive rate. We simulated theexperiment 1000 times for each value of number of neurons and computed the false positive rate from each. We usedbootstrapping on the obtained results to estimate error bars and to test for signiﬁcant differences away from 0.05, theexpected false positive rate. Given the relationship between the number of points within a cluster to the Design Effect(DEFF; see Design Effect in Materials and Methods), we would expect the false positive rate to increase with the numberof neurons per subject in the traditional analysis but not using the other methods [30, 36, 1].As shown in Figure 2b, the false positive rate for the traditional method does increase with the number of neuronsper subject rising from around 46% for 10 trials to almost 96% in the case of 3000 trials per neuron (probability ofresampled proportions being greater than or equal to 0.05 was p boot > 0.9999 in all cases; limit due to resampling 10 times). On the other hand, the summarized, bootstrap and LMM methods stayed remarkably similar in value and werenot signiﬁcantly different from 0.05, the expected false positive rate, in all cases (adjusting for threshold of signiﬁcancewith Bonferroni corrections for 3 comparisons).We also computed the estimate for the SEM, i.e., the uncertainty in computing the mean ﬁring rate for each group,using all 4 cases for each number of neurons simulated, and the result is shown in Figure 2c. As shown, the SEM estimateremains fairly constant for the summarized, bootstrap and LMM methods but decreases with an increase in the numberof trials per neuron in the traditional case. Furthermore, the SEM estimate for the traditional case starts out much lowerthan the other cases suggesting that the increased false positive rate is at least partially due to the underestimation oferror in the traditional case. ARAVANAN ET AL . 9

An example of Non-Independent Neuronsa)

Group 1 Group 2 Level 1:Experimental ConditionLevel 2:SubjectsLevel 3:Neurons

Number of neurons per subject P r o p o r t i o n o f S i g n i fi c a n t R e s u l t s Proportion of false positives with number of neuronsb)

5% levelTraditionalSummarizedBootstrapped

10 30 100 300 1000 30000.00.20.40.60.81.0

Number of neurons per subject S E M q u a n t i fi e d Size of error bars with number of neuronsc)

10 30 100 300 1000 30000.000.020.040.060.080.100.120.14

LMM 5% levelTraditionalSummarizedBootstrappedLMM

F I G U R E 2

False positive rate and size of error bars quantiﬁed for the simulation in which there was no differencebetween the groups but the points were not independent. a) A graphical representation of the experimental condition.The ﬁring rate is not different between the two groups but each subject has a constant offset away from the mean forthe group that is consistent across all their neurons. b) The proportion of signiﬁcant results using each statisticalmethod as a function of the number of neurons per subject. As expected, the false positive rate for the traditionalmethod rises with increasing number of samples within each subject. The Summarized, Bootstrap and LMM methods onthe other hand have almost identical false positive rates not signiﬁcantly different from the theoretical 5% in all cases. c)The size of SEMs (uncertainty in computing the mean ﬁring rate for each group) computed for all 4 methods as afunction of the number of neurons per subject. While those for the summarized, bootstrap and LMM stay roughly thesame, those for the traditional method reduces with increasing number of trials. Note that for both traces, since theBootstrap, Summarized and LMM almost perfectly overlap, the red trace has been thickened and dashed forvisualization. | The hierarchical bootstrap is more conservative than Traditional and Summarizedmethods for independent data points but LMM is not

In the previous simulation, we reported that the hierarchical bootstrap does not have a conservative bias for hierarchicaldatasets. However, it has been claimed that the bootstrap has a conservative bias [24, 25], resulting in larger error bars

ARAVANAN ET AL . than strictly necessary for the chosen threshold of Type-I error α (here set to 0.05). It has also been argued that this isnot a bug or bias in the algorithm, but rather a more generic property of hypothesis testing by resampling [37, 20], andnewer algorithms have claimed to reduce bias further [38, 39]. Here we tested the conservative bias of the hierarchicalbootstrap in a similar situation as the ﬁrst simulation above but where all the data points were independent. Given thatwe set α to 0.05, we would expect a 5% false positive rate if there were no bias in the algorithm. An example of Independent Neuronsa)

Group 1 Group 2 Level 1:Experimental ConditionLevel 2:SubjectsLevel 3:Neurons b) Traditional Summarized Bootstrap0.000.020.04

False Positive rate over 10,000 simulations P r o p o r t i o n o f S i g n i fi c a n t R e s u l t s Traditional Summarized Bootstrap0.0000.0040.0080.012

SEM range over 10,000 simulationsc) S E M q u a n t i fi e d LMM LMM

F I G U R E 3

Results from the simulation in which all data points were independent. a) A graphical representation ofthe experimental condition. As shown by the shades of group, subject and neuron means all being the same, the pointsare all independent in spite of an apparent hierarchical structure. b) Proportion of signiﬁcant results when comparingthe 2 groups with each statistical method at α of 0.05. As expected, the Traditional, Summarized and LMM methods giveroughly 5% false positive results (black dashed line). However, the bootstrap gives a much smaller proportion ofsigniﬁcant results suggesting a conservative bias. c) The size of SEMs computed using each of the methods. Thebootstrap does give an error bar roughly 1.4 times that of the other metrics.As before we simulated a situation in which we recorded the activity of 100 neurons each over 1000 subjects. Inorder to make each neuron independent though, we removed the Gaussian noise term for each subject. The neuronswere thus simulated using only a Poisson random number generator with an average ﬁring rate of 5Hz (each trial wasthought to be 1 second of activity). This has been depicted graphically in Figure 3a where the group means, subject ARAVANAN ET AL . 11 means and trials all are set to the same ﬁring rate of 5Hz. As is evident from the ﬁgure, even though there is a hierarchicalstructure present, the points can be rearranged in any fashion without affecting the results. In other words, the pointsare all independent. Since the data points are independent, we would not expect differences between the Traditionaland Summarized methods. We then split these 1000 subjects into two groups of 500 each randomly and computed themean ﬁring rate for each group. We then tested whether the means were signiﬁcantly different from each other usingthe Traditional, Summarized, Bootstrap and LMM methods. We repeated this analysis 10000 times and plotted theproportion of simulations that resulted in signiﬁcant differences for each of the methods in Figure 3b. The error barswere computed by bootstrapping the results obtained from the simulation runs. As shown in the ﬁgure, the Traditional,Summarized and LMM methods resulted in a proportion of signiﬁcant results close to and not signiﬁcantly differentfrom 5% as expected (Traditional – 4.80 ± boot = 0.18; Summarized – 4.99 ± boot = 0.48; LMM – 4.80 ± boot = 0.18). By contrast, when using the bootstrap method, the proportion ofsigniﬁcant results was signiﬁcantly lower than the expected 5% at 0.66 ± boot < 10 -4 ; limit due to resampling 10 times). This was a markeddeparture from Figure 2b where we saw that the bootstrap had no signiﬁcant conservative bias for hierarchical datasets.We also computed the standard error of the mean (SEM) in each case and reported the results in Figure 3c. Asshown, the error bars for the Traditional, Summarized and LMM methods are almost identical at 1.002 ± -2 for Traditional, 0.994 ± -2 for Summarized and 1.001638 ± -2 for LMM respectively. The errorbars computed using the Bootstrap method are roughly 1.4 times larger at 1.407 ± -2 . Since the effect sizeis inversely proportional to the uncertainty in the dataset [40], which is captured here by the error bars, we concludethat the larger error bars do partially account for the drop in proportion of signiﬁcant results observed and that thebootstrap seems to have a conservative bias for independent datasets. | The Bootstrap and LMM balance intended Type-I error rate and statistical powerbetter than the Traditional or Summarized methods at low sample sizes

As we saw in the previous section, the summarized, bootstrap and LMM methods bind the Type-I error rate at theintended 5% and the estimate of the SEM is roughly the same for all three methods (Fig. 2). What then is the advantageof the bootstrap or LMM over simply using the summarized method? The answer lies in the fact that the summarizedmethods result in a loss of statistical power (the ability to detect a statistical difference when one truly exists), partic-ularly for low sample sizes of the upper hierarchical levels and for small effect sizes (a situation commonly found inneuroscience datasets). We used simulations to calculate the power for each of the four methods and the results areshown in Figure 4.Since power depends on the effect size and the number of samples, we chose to examine the change in power withrespect to the number of subjects per group (N) and the effect size for these simulations ( Δ mean). In order to do so, wevaried N between 1 and 16, keeping the number of neurons per subject constant at 100 each. As shown in Figure 4a,we kept the mean ﬁring rates of neurons of one group of subjects at 5Hz as before and varied the mean ﬁring rate ofthe other group by Δ mean, adding an additional 3Hz random Gaussian noise term to each subject in both groups. Asbefore, this term is constant for all neurons within a subject and varies between subjects. Since the previous simulationsdid not estimate the false positive rate when the number of subjects was as low, we ﬁrst kept the mean ﬁring rate forboth groups of subjects equal at 5Hz, simulating 1000 times for each value for the number of subjects per group. Theresult is shown in Figure 4b. As shown, the false positive rate for the traditional method stays around or above 80% ARAVANAN ET AL . An example of Groups with Mean difference ∆ a) Group 1 Group 2 Level 1:Experimental ConditionLevel 2:SubjectsLevel 3:Neurons

55 + ∆ Mean Firing Rates (Hz) 2 4 6 8 10 12 14 16Number of subjects per groups0.00.20.40.60.81.0 P r o p o r t i o n o f S i g n i fi c a n t R e s u l t s Power with number of neurons and effect size P r o p o r t i o n o f S i g n i fi c a n t R e s u l t s Proportion of false positives with number of subjects ∆ = 0

5% levelTraditionalSummarizedBootstrapLMM b) c)

Traditional Summarized Bootstrap ∆ (Mean) = 1.5Hz ∆ (Mean) = 4.5Hz ∆ (Mean) = 3Hz LMM F I G U R E 4

Change of power with number of neurons and effect size. a) A graphical representation of theexperimental conditions. As shown, there is now a difference in the mean between groups with individual subjects alsodisplaying variations about the mean. b) The false positive rate when there was no difference between the mean ﬁringrates for the two groups. c) The proportion of signiﬁcant results or power when the difference in mean ﬁring rates( Δ mean) between the two groups was 1.5Hz (short dashes), 3Hz (long dashes) and 4.5 Hz (solid line) respectively.(blue trace in Fig. 4b), while that for the summarized method hugs the expected 5% line (red trace in Fig. 4b), exceptfor the special case of one neuron per group, where you can never achieve signiﬁcance since you are comparing onlytwo points. The behavior of the bootstrap calculation highlights the fundamental characteristic of the bootstrap and istherefore worth exploring in detail (green trace in Fig. 4b). The essence of the bootstrap is to provide a reliable range foryour metric under the assumption that the limited dataset you have captures the essential dynamics of the underlyingpopulation. When there is only one subject, the bootstrap assumes that neural level data is the true distribution andtherefore has a false positive rate equal to that of the traditional method. As the number of subjects increases, onegets a better sampling of the true underlying distribution, and correspondingly, the bootstrap tends towards a 5% errorrate with increasing number of subjects, as the weight of data points shifts from individual neurons to neurons acrosssubjects with increasing number of subjects. Therefore, if the data collected do not accurately represent the dynamics ARAVANAN ET AL . 13 of the underlying distribution, the bootstrap cannot provide accurate estimates of population metrics. The LMMs alsohave a false positive rate that is higher than the expected 5% rate for very small N which also reduces with increasing N(magenta trace, Fig. 4b). However, it is consistently less than that of the bootstrap. Note that for LMMs, the algorithmdoes not converge reliably for very small group numbers and so we could only plot results for N ≥

4. In spite of this, weveriﬁed that the false positive rate for LMMs at N=1 is 16 ± | Examples

We now present two real-world examples of the utility of the hierarchical bootstrap as applied to behavioral datacollected from experiments in songbirds [26] and ﬂies [27]. These examples provide concrete instances of why oneshould use the appropriate statistical tests depending on the nature of their data and how the popular tests can result inmore false positives or less statistical power than one desires. | The bootstrap highlights the risk of false positives when analyzing hierarchical be-havioral data (vocal generalization in songbirds) using traditional statistical methods

As described above, although the bootstrap provides a better compromise between statistical power and false-positiverate than the Traditional or Summarized methods, its use is not widespread in the neuroscience literature, including insome of our own prior published work. To illustrate the practical importance of these issues, and to encourage otherauthors to critically re-evaluate their prior analyses, we here present a case in which we have used the bootstrap andthe LMM to reexamine one of our prior results – which used both Traditional and Summarized methods – and foundthe choices made can signiﬁcantly affect the outcome of our analyses. As a reminder, when discussing Traditional,Summarized or LMM statistical tests, we will report a p-value denoted by ‘p’ which yields a signiﬁcant result if p < 0.05.When talking about the Bootstrap tests however, we will report a p boot which in turn yields a signiﬁcant result if p boot <0.025 or p boot > 0.975. In addition, p boot provides a direct probability of the hypothesis being true, which is not true ofp-values [41].

ARAVANAN ET AL . a) Bird 1 Bird 3 Level 1:Experimental ConditionLevel 2:SubjectsLevel 3:Syllables b) “Traditional” Statistics c) “Summarized” Statistics

Calculate Mean for each Syllable

GroupBird 2 Level 4:SyllableIterations

Structure of Songbird DataTreat each syllable iteration as independentCompare Mean + SEM to zero Treat each syllable as independentCompare Mean + SEM to zero

F I G U R E 5

Overview of the structure of songbird data and statistical methods used to analyze it. a) A graphicalrepresentation of the hierarchical structure of songbird data. As shown, there are typically multiple experimentalgroups consisting of several birds each. Each bird in turn sings several syllables and each syllable may be sung severalhundreds of times (called iterations). b) In “Traditional” statistics, each syllable iteration is treated as independent andtherefore all syllable iterations across syllables and birds are pooled in order to get the population statistics which isthen used to perform statistical tests. c) In “Summarized” statistics, the mean for each syllable is calculated from thesyllable iterations and the resulting means are pooled across birds in the experimental group. This population of meansis then used to perform statistical tests. As detailed in the results, neither of these approaches is valid since even in the“Summarized” case, we are not ensuring independence between data points since syllables within birds are still moresimilar than syllables between birds.For our ﬁrst example, in songbirds, each bird sings a variety of syllables and each syllable is repeated a differentnumber of times. A graphical representation of this data structure and the “Traditional” and “Summarized” methods wedeployed to analyze this data is shown in Figure 5. As depicted in Figure 5c, we decided to summarize at the level ofsyllables even though one must summarize at the level of birds in order to ensure independence between data points. In

ARAVANAN ET AL . 15 many of our studies, we examine changes in the pitch of these syllables in response to manipulations. In a prior study,we examined the "generalization" (deﬁned below) of vocal learning in songbirds in response to an induced auditory pitchshift on a particular (target) syllable [26]. In these studies, the auditory feedback of one syllable was shifted in pitch andrelayed to the bird through custom-built headphones with very short ( 10 ms) latency, effectively replacing the bird’snatural auditory feedback with the manipulated version [42, 43, 26]. Note that while the headphones provided auditoryfeedback throughout the song, only the feedback for a single targeted syllable was shifted in pitch. We reported thatin addition to birds changing the pitch of the target syllable in response to the pitch shift, the birds also "generalized"by changing the pitch of other syllables that had not been shifted in pitch. Speciﬁcally, we reported that syllables ofthe same-type (acoustic structure) as the target syllable changed pitch in the same direction as the target syllable(“generalization”) while syllables of a different-type than the target syllable changed pitch in the direction opposite tothat of the target syllable (“anti-adaptive generalization”; see Fig. 6a). Since in Hoffmann and Sober (2014) we employedtraditional and summarized (at a syllable level) statistics when analyzing generalization, we decided to reanalyze thedata from that study to ask if the generalization observed was still statistically signiﬁcant when statistical tests werecomputed using the hierarchical bootstrapping procedure. In order to do so, we ﬁrst recapitulated the results reportedby computing statistics on the last 3 days of the shift period using the traditional and summarized methods as wasreported earlier [26]. We focus our reporting on changes in the target syllable and anti-adaptive generalization indifferent-type syllables for the purpose of this example.When we computed the change in pitch over the last 3 days of the shift period for the syllable targeted with auditorypitch shifts, we found that the birds compensated for the pitch shift of 1 semitone by 0.341 ± ± SEM inall cases) semitones with traditional statistics (one sampled t-test comparing to zero; t = 47.3; p < 2*10 -308 ; limit dueto smallest number in MATLAB; red trace in Fig. 6a) and by 0.34 ± ± -125 ;blue trace in Fig. 6a). With summarized statistics, the different-type syllables changed by -0.12 ± ± boot = 0.995; red trace in Fig. 6c). As a reminder, p boot gives the probabilityof the hypothesis tested being true. Therefore, a value of 0.5 indicates minimal support for either the hypothesis (or itsopposite) while values close to 1 (or 0) represent strong support for (or for the opposite of) the hypothesis. Different-type syllables however shifted to a ﬁnal value of -0.09 ± boot = 0.25; blue trace in Fig. 6c). Hence, this result shows that the anti-adaptive generalizationwas too small an effect to detect with the sample size in the original study, suggesting that the generalization effectsobserved were driven largely by a small number of individual birds rather than a population wide effect. This conclusionis consistent with an independent study in songbirds that also did not ﬁnd evidence of anti-adaptive generalization indifferent type syllables [44].We further reanalyzed the data using LMMs. In order to do so, since we were not interested in the effect of the typeof syllable but whether each type of syllable was signiﬁcantly different from zero, we built three separate LMMs (one ARAVANAN ET AL . P i t c h s h i f t ( i n s e m i t o n e s ) a) Traditional Statistics00.30.6-0.3 4 8 12 P i t c h s h i f t ( i n s e m i t o n e s ) c) Bootstrap over LevelsDays of Shift0 Targeted SyllableDifferent type Syllable P i t c h s h i f t ( i n s e m i t o n e s ) b) Summarized over Syllables0 Targeted SyllableDifferent type SyllableTargeted SyllableDifferent type Syllable * p = 0.053 *** N.S.

F I G U R E 6

Reanalysis of generalization in the headphones learning paradigm. a) The results when quantiﬁed usingtraditional statistics. As shown, both target and different type syllables differ signiﬁcantly from zero over the last 3 daysof the shift with target syllable moving adaptively and the different type syllables showing anti-adaptive generalizationrespectively. b) The results when quantiﬁed using summarized statistics when summarized over the syllables. In thiscase, the target syllable is signiﬁcantly different from zero while the different type syllables are just over the thresholdfor signiﬁcance. c) The results when bootstrap is applied over the hierarchical levels. The target syllable is signiﬁcantlydifferent from zero but the different type syllables are not.

ARAVANAN ET AL . 17 for each type of syllable). The model had no ﬁxed effects and the random effects controlled for bird identity and syllableidentity within birds. We assessed the intercept and whether it was signiﬁcantly different from zero. As expected, ourresults indicated that the target syllable shifted robustly by 0.33 ± -5 ). Curiouslyhowever, the LMMs suggested that different type syllables also shifted robustly by -0.12 ± boot = 0.85) or LMMs (t = 1.31; p = 0.19). These results again indicate that we did not perform the generalizationexperiment on a sufﬁcient number of birds to adequately power the study. However it is worth noting that while theresults did not meet the threshold for statistical signiﬁcance, reporting probabilities in support of the hypotheses (p boot )provides more information than simply determining whether or not a statistical threshold was met. In this case, ap boot of 0.85 means that if we measured data from more birds drawn from the same distribution, we will see adaptivegeneralization in 85% of cases which is much higher than chance (50%) and is still useful information. Furthermore,we will note that an independent study [44] did ﬁnd evidence of adaptive generalization for same type syllables insongbirds. | The hierarchical bootstrap captures the signal better than traditional or summa-rized methods in optogenetic control of behavior in ﬂies

We wanted to test the utility of the hierarchical bootstrap in an example where the recorded variables are multi-dimensional, and so we chose to analyze the data from an experiment studying the role of descending neurons in thecontrol of behavior in ﬂies [27]. Studies examining optogenetic effects on behavior are another area where hierarchicaldatasets are the norm. In this example, since each ﬂy can be tracked over extended periods of time, it will exhibit eachbehavior multiple times within the period of observation. Additionally, multiple ﬂies can be tracked simultaneouslyand the behavior is typically averaged across ﬂies across trials for each experimental group. In this study, we usedoptogenetics to activate descending neurons in ﬂies and studied the corresponding changes in behavior displayed. Inorder to do so, we ﬁrst created a two-dimensional representation of the behavior of the ﬂies, as described in detailpreviously [45, 27]. This representation is a probability density function, with each individual peak corresponding toa distinct stereotyped behavior (e.g., running at a given speed, left wing grooming, antennal grooming, etc.). We thenmapped the behavior of both experimental animals and control animals (where there was not a functional light-activatedchannel present) in the presence and absence of light stimulation, onto the behavioral representation. The resultingbehavioral map for one class of descending neurons (neurons G7 and G8, line SS02635, from Figure 2 and Videos 6 and7 in the original study) is shown in Figure 7b. Through both statistical analysis and visual observation, we found that thisstimulating this neuron robustly demonstrated a head grooming phenotype (the upper-most section of the map).In the original study, to assess whether the light stimulation caused a statistically signiﬁcant change in the frequencyof behavior observed, we argued that for a region to be signiﬁcantly increased by the activation, (i) the frequencyof behavior had to be signiﬁcantly greater during optical stimulation than during periods of no stimulation withinexperimental animals and (ii) the frequency of behavior during optical stimulation had to be greater in experimentalanimals than in control animals. We used Wilcoxon rank-sum tests coupled with Sidak corrections [46] for multiplecomparisons across the behavioral space in order to test for statistically signiﬁcant differences for both (i) and (ii). Fulldetails of this approach are enumerated in the original study [27], but the number of comparisons is conservativelyestimated to be 2 (H( ρ )) , where H( ρ ) is the entropy of the two-dimensional probability distribution. This approach wouldfall under the category of traditional methods that we have described previously. We compared the regions obtained ARAVANAN ET AL . Control - Light OFF Level 1:Experimental ConditionLevel 2:SubjectsLevel 3:Trials a)b)

TraditionalBootstrappedMixed ModelSummarized R e l a t i v e D en s i t y ( a . u .) F I G U R E 7 a) A graphical representation of the hierarchical structure present in each experimental group in thestudy. As shown, each experimental group consists of several subjects (ﬂies) and each ﬂy is observed performing severalhundreds to thousands of trials. b) The frequency of behavior mapped onto a two-dimensional space when a particulargroup of descending neurons in the experimental ﬂies were manipulated using optogenetic stimulation. In thisparticular case, the descending neurons targeted controlled head grooming behavior, represented at the top of the map.Thus, the animals display elevated frequencies of head grooming during stimulation when compared to control ﬂies andwhen the light was turned off. The magenta trace shows the statistically signiﬁcant differences after accounting formultiple comparisons when using the traditional method. As shown, the magenta trace overestimates the signal presentand captures some false regions as well as pointed to in the lower right region. The black trace represents the areas ofsigniﬁcant difference as deﬁned by using the hierarchical bootstrap and it matches the region we expected in ouroriginal analysis very well. The summarized method did not return any regions that were statistically different betweengroups even though there was a clear signal present in the data (see videos and other data in [27]). The LMM returnedan area (cyan trace) that was more conservative than any other method and likely included many false negatives, thoughthe false positive rate was small.

ARAVANAN ET AL . 19 from the original analysis with regions we obtained when using summarized or bootstrap statistics on this datasetand the result is shown in Figure 7b. As shown, the traditional method seems to overestimate the region of signiﬁcantdifferences and includes a false positive area that is separate from the main region where signal is present. We areconﬁdent that this area is a false positive because all trials were individually visually inspected, with each one displayingthe head grooming phenotype when stimulated.The summarized method, on the other hand, does not identify any regions as being statistically signiﬁcant, despitevisual evidence suggesting the clear presence of a signal [27]. This lack of an identiﬁed region likely results from thefact that the number of individual ﬂies used was not high (six experimental ﬂies and six control ﬂies), but with 30 trialsfor each animal. Thus, the summarized method did not sufﬁciently account for the robust trial-to-trial response ofeach animal. As is the case for any hierarchical data set, the effective number of samples is not six, but nor is it 180 (sixanimals multiplied by 30 trials) because of the correlations between subsequent trials in the same animal. Taking thesecorrelations into account, however, the hierarchical bootstrap returns a concise region that is more tightly ﬁt aroundthe region that is associated with head grooming in non-stimulated animals (Video 2 in the original study). Thus, even inthis case of abstracted, multi-dimensional data analysis, we see that the hierarchical bootstrap is an effective methodthat accounts for correlation structure within a dataset.We also performed a LMM to test in a similar manner to the birdsong example described previously. Speciﬁcally, weassumed no ﬁxed effects, controlled random effects for ﬂy identity, and we tested for the signiﬁcance of the intercept(using the same multiple hypothsis testing threshold as described above). We performed this analysis in a one-sidedmanner for both the control and the experimental animals, identifying regions as “signiﬁcant” if the ﬁtted intercept forthe experimental animals was positive with a sufﬁciently small p-value and if the control animals did not experiencea signiﬁcant positive shift. As seen in the cyan line of Figure 7b, we see that the LMM was more conservative thanthe Bootstrap method in this instance, picking a smaller, somewhat discontinuous, portion of the map. Based onobservations from Video 2 of the original study, the LMM likely is resulting in many false negatives, however the falsepositive rate is low. | DISCUSSION

The hierarchical bootstrap is a powerful statistical tool that was developed to quantify the relationship between schoolclass sizes and achievement [19] and has since been used to quantify effect sizes in a wide variety of ﬁelds. The use ofthe hierarchical bootstrap in neuroscience however is still limited in spite of the need for it being increasingly clear.Through our simulations, we have shown the utility of the hierarchical bootstrap by examining the shortcomings ofmore common statistical techniques in a typical example one might encounter. While the results of our simulations maybe inferred from other mathematical work on the subject [35, 19, 22, 47], our goal here is to illustrate the application ofthese methods to the hierarchical datasets typically found in neuroscience. We ﬁrst illustrated that the bootstrap doesnot have a conservative bias for hierarchical datasets in which the assumption of independence between data pointsis violated (Fig. 2). We then showed that the bootstrap performs better than summarized statistical measures by notsacriﬁcing as much statistical power especially at low sample sizes and small effect sizes (Fig. 4). Finally, we showed realworld applications of applying the hierarchical bootstrap to two datasets from songbirds and ﬂies to demonstrate theadvantages of the hierarchical bootstrap over other more commonly used statistical analysis techniques in neuroscience(Figs. 6 and 7).A critical assumption of several commonly used statistical tests is that every data point within a dataset is inde-pendent. In other words, every data point would have to be equally different from every other data point for any two

ARAVANAN ET AL . randomly selected data points. “Pseudoreplication” refers to studies in which individual data points may be more similarto some data points than to others within the dataset, and yet the statistical tests performed treated the data points asindependent replicates [5]. While pseudoreplication was ﬁrst extensively reported on in ecological studies [5, 48], ithas since been identiﬁed as a common problem in other ﬁelds, including neuroscience [6]. While resampling methodsincluding bootstrapping were originally suggested as tools by which one could overcome the pseudoreplication prob-lem [49], the bootstrap was argued to have a conservative bias resulting in larger error bars than necessary [24, 25].Since then, however, several versions of the bootstrap algorithm have been developed to apply to hierarchical data andhave been found to be unbiased and more robust for calculation of uncertainty in clustered data than other statisticalmethods [35, 19, 22, 47, 21, 14]. In order to test the bootstrap for any potential bias in a typical example we mightencounter in neuroscience, we performed simulations to quantify differences in mean ﬁring rates between two groups ofneurons when there was no difference between the groups. We illustrated that the bootstrap produced a false positiverate signiﬁcantly below the expected 5% (Fig. 3b) and had larger error bars (Fig. 3c) than other statistical methods whenthe data were independent. However, when the independence between data points was abolished by introducing ahierarchical structure, the bootstrap was not statistically different from the expected 5% false positive rate (green barsin Fig. 2b) and the error bars computed were similar to those computed using summarized statistics (red and green barsin Fig. 2c) demonstrating that the hierarchical bootstrap is robust to bias for applications in neuroscience.Furthermore, we showed that the bootstrap does not lose power to the degree that using summarized statisticsdoes (see green traces versus red traces in Fig. 4c) while also keeping the false positive rate within the intended bound(see green trace in Fig. 4b). Additionally, the hierarchical bootstrap as applied in this paper does not make assumptionsabout the relationships underlying the latent variables that deﬁne the hierarchical structure. However, as we saw inFigure 4b, where we sampled multiple trials from a very small number of neurons, the bootstrap assumes that the datacollected captures essential characteristics of the population distribution. In this case, the bootstrap initially consideredthe trial level data as independent before switching to neuron level data as the number of neurons increased. Hence, onemay have to adjust the resampling procedure to ensure that the distribution of resampled data points most accuratelymatches the population distribution one wishes to study.Among the reasons Linear Mixed Models (LMMs) gained in popularity for statistical testing was the fact thatthey could accommodate hierarchical datasets by controlling for various levels as “random” effects while still using allavailable data, thereby minimizing the loss of statistical power [36, 50, 30, 1, 31]. As shown in Figures 2 and 3, LMMs donot exhibit a conservative bias for independent or hierarchical datasets. In the particular example we chose for oursimulations, LMMs seem to perform better than the bootstrap in terms of eliminating false positives but underperformslightly in retaining statistical power (see magenta versus green traces in Fig. 4b and c). Given that LMMs are consideredby many to be the gold-standard for hierarchical datasets [51, 1, 52], why are we advocating an alternative approach?Given the results from the simulations, we see the bootstrap as a method that performs with a similar rate offalse-postitives and false-negatives as the LMM, but requires fewer implementation choices by the user. Unlike thebootstrap, LMMs require the assumptions of the structure in the dataset to be speciﬁed in the terms used to modelthe “random effects”. If this procedure is performed incorrectly, the results obtained and inferences drawn may beinaccurate [53, 54, 15]. Furthermore, there are some types of hierarchical structures, such as when the relationshipbetween levels is non-linear, that LMMs may not be able to capture the dynamics of the data accurately [55]. Giventhese complications and the steep learning curve associated with using LMMs accurately, it has been argued that aperfectly implemented ANOVA may be more useful than a hastily implemented LMM [15]. A ﬁnal danger associatedwith learning to use LMMs correctly that it could encourage p-hacking since one could potentially obtain the answerthey desire by changing the structure of the random effects.To showcase its utility in analyzing hierarchical datasets in neuroscience, we then used the hierarchical bootstrap ARAVANAN ET AL . 21 on two independent examples. First, we reanalyzed data from Hoffmann and Sober, 2014 in which we used bothTraditional and Summarized statistical analysis to conclude that songbirds generalize changes in pitch targeted on asingle syllable anti-adaptively to syllables of a different type. When reanalyzed with the bootstrap however, we foundthat the anti-adaptive generalization of different type syllables (blue trace in Fig. 6c) did not meet the threshold forstatistical signiﬁcance. This was a striking result, as the original study did report statistically signiﬁcant changes fromzero even while using summarized statistics [26]. A probable reason for the differences between the summarized andbootstrap methods for this dataset stems from a decision point regarding the level to which one must summarize thedata when using summarized statistics (we avoided this decision in the simulations by assuming all data came from asingle subject). The summary statistics reported were summarized at the level of syllables for this dataset. However,in order to truly make sure all points are independent, one must summarize at the highest level, i.e., at the level ofindividual birds in the dataset. The differences in results between the summarized and bootstrap methods here suggestthat the generalization effects were driven largely by strong effects in a subset of birds as opposed to a population-wideeffect and that, by failing to take the hierarchical nature of the dataset into account, we overestimated our statisticalpower and chose too low an N. Further evidence for this interpretation, as opposed to one where incorrect analysismethods call all former results into question, comes from reanalysis of data from a separate study looking at learningbirds display in response to pitch shift of their entire song through custom-built headphones [42] using the hierarchicalbootstrap. Since the changes in pitch were far more systemic across birds in this experiment, we did not see any changesin statistically signiﬁcant results [29].This example also served to highlight the complexities associated with implementing and interpreting results froman LMM accurately. When we reanalyzed the songbird data using LMMs we were surprised to see a signiﬁcant effect ofanti-adaptive generalization in different type syllables but not for adaptive generalization in same type syllables. Whileit is possible that this result may be true in our particular dataset it is highly unlikely as the results are the opposite ofwhat other studies have reported in songbirds [44] and humans [56, 57]. This suggests that in spite of our best effortsto accurately specify LMMs to ﬁt this data, we may have failed in capturing all the essential structure required in therandom effects.In addition to re-examining published data from vocal behavior in songbirds, we applied the hierarchical bootstrapto data from an independent experiment studying the role of descending neurons in controlling behavior in ﬂiesusing optogenetics [27]. As shown in Figure 7b, the hierarchical bootstrap performs better than the traditional andsummarized statistical methods and LMMs in isolating the true signal in the experiment. The traditional method includesareas that are likely false positives and the summarized method does not return any statistically signiﬁcant areas. TheLMM was interestingly much more conservative than the bootstrap in this case and probably had a much larger falsenegative rate though the false positive rate was low.We would also like to reiterate another advantage of the direct probabilities returned by the bootstrap (p boot )over p-values typically reported through the traditional, summarized, and LMM approaches. p-values represent theprobability of obtaining results at least as extreme as the ones obtained under the assumption that the null hypothesisis true. It is a cumbersome deﬁnition that has led to numerous misconceptions regarding what p-values actuallysignify [41, 58]. The value returned by the bootstrap however, p boot , provides a direct probability in support of aparticular hypothesis. As we reported in the songbirds example, we found that the probability of same-type syllablesgeneralizing was 0.85. This means that if we measured data from more birds drawn from the same distribution, we willsee adaptive generalization in 85% of cases which is much higher than chance (50%). Hence, the hierarchical bootstrapmethod can provide a measure of the relative support for the hypothesis which is both easier to understand and can beuseful information for both positive and negative results in research.To conclude, neuroscience research is at an exciting junction. On the one hand, new technologies are being built

ARAVANAN ET AL . promising bigger and more complex datasets to help understand brain function [59, 60, 61]. On the other, we haverising concerns over the incorrect use of statistical tests [62, 63, 64] and the lack of reproducibility of a number of pastﬁndings [65, 66, 67]. We propose the hierarchical bootstrap as a powerful but easy-to-implement method that can bescaled to large and complicated datasets, that returns a direct probability in support of a tested hypothesis reducingthe potential for misinterpretation of p-values and that can be checked for correct implementation through sharing ofanalysis code. As we have shown through this paper, we believe that widespread use of the bootstrap will reduce therate of false positive results and improve the use of appropriate statistical tests across many types of neuroscience data. A C K N O W L E D G E M E N T S

The work for this project was funded by NIH NINDS F31 NS100406, NIH NINDS R01 NS084844, NIH NIBIB R01EB022872, NIH NIMH R01 MH115831-01, NSF 1456912, Research Corporation for Science Advancement no. 25999and The Simons Foundation. C O N FL I C T O F I N T E R E S T

The authors declare no competing ﬁnancial interests.

R E F E R E N C E S [1] Aarts E, Verhage M, Veenvliet JV, Dolan CV, Van Der Sluis S. A solution to dependency: using multilevel analysis toaccommodate nested data. Nature neuroscience 2014;17(4):491.[2] Hahs-Vaughn DL. A primer for using and understanding weights with national datasets. The Journal of ExperimentalEducation 2005;73(3):221–248.[3] Arceneaux K, Nickerson DW. Modeling certainty with clustered data: A comparison of methods. Political analysis2009;17(2):177–190.[4] Musca SC, Kamiejski R, Nugier A, Méot A, Er-Raﬁy A, Brauer M. Data with hierarchical structure: impact of intraclasscorrelation and sample size on Type-I error. Frontiers in Psychology 2011;2:74.[5] Hurlbert SH. Pseudoreplication and the design of ecological ﬁeld experiments. Ecological monographs 1984;54(2):187–211.[6] Lazic SE. The problem of pseudoreplication in neuroscientiﬁc studies: is it affecting your analysis? BMC neuroscience2010;11(1):5.[7] Aarts E, Dolan CV, Verhage M, van der Sluis S. Multilevel analysis quantiﬁes variation in the experimental effect whileoptimizing power and preventing false positives. BMC neuroscience 2015;16(1):94.[8] Arlet M, Jubin R, Masataka N, Lemasson A. Grooming-at-a-distance by exchanging calls in non-human primates. Biologyletters 2015;11(10):20150711.[9] Liang Z, Watson GD, Alloway KD, Lee G, Neuberger T, Zhang N. Mapping the functional network of medial prefrontalcortex by combining optogenetics and fMRI in awake rats. Neuroimage 2015;117:114–123.[10] Machado AS, Darmohray DM, Fayad J, Marques HG, Carey MR. A quantitative framework for whole-body coordinationreveals speciﬁc deﬁcits in freely walking ataxic mice. Elife 2015;4:e07892.

ARAVANAN ET AL . 23 [11] Pleil KE, Helms CM, Sobus JR, Daunais JB, Grant KA, Kash TL. Effects of chronic alcohol consumption on neuronal func-tion in the non-human primate BNST. Addiction biology 2016;21(6):1151–1167.[12] Maas CJ, Hox JJ. Sufﬁcient sample sizes for multilevel modeling. Methodology 2005;1(3):86–92.[13] Gehlbach H, Brinkworth ME, King AM, Hsu LM, McIntyre J, Rogers T. Creating birds of similar feathers: Leveragingsimilarity to improve teacher–student relationships and academic achievement. Journal of Educational Psychology2016;108(3):342.[14] Huang FL. Using cluster bootstrapping to analyze nested data with a few clusters. Educational and psychological mea-surement 2018;78(2):297–318.[15] Seedorff M, Oleson J, McMurray B. Maybe maximal: Good enough mixed models optimize power while controlling TypeI error. PsyArXiv 2019;.[16] Efron B. The jackknife, the bootstrap, and other resampling plans, vol. 38. Siam; 1982.[17] Efron B. Bootstrap methods: another look at the jackknife. In: Breakthroughs in statistics Springer; 1992.p. 569–593.[18] Efron B, Tibshirani RJ. An introduction to the bootstrap. CRC press; 1994.[19] Carpenter JR, Goldstein H, Rasbash J. A novel bootstrap procedure for assessing the relationship between class size andachievement. Journal of the Royal Statistical Society: Series C (Applied Statistics) 2003;52(4):431–443.[20] Efron B, Halloran E, Holmes S. Bootstrap conﬁdence levels for phylogenetic trees. Proceedings of the National Academyof Sciences 1996;93(23):13429–13429.[21] Harden JJ. A bootstrap method for conducting statistical inference with clustered data. State Politics & Policy Quarterly2011;11(2):223–246.[22] Field CA, Welsh AH. Bootstrapping clustered data. Journal of the Royal Statistical Society: Series B (Statistical Method-ology) 2007;69(3):369–390.[23] Thai HT, Mentré F, Holford NH, Veyrat-Follet C, Comets E. A comparison of bootstrap approaches for estimating uncer-tainty of parameters in linear mixed-effects models. Pharmaceutical statistics 2013;12(3):129–140.[24] Hillis DM, Bull JJ. An empirical test of bootstrapping as a method for assessing conﬁdence in phylogenetic analysis.Systematic biology 1993;42(2):182–192.[25] Adams DC, Gurevitch J, Rosenberg MS. Resampling tests for meta-analysis of ecological data. Ecology 1997;78(4):1277–1283.[26] Hoffmann LA, Sober SJ. Vocal generalization depends on gesture identity and sequence. Journal of Neuroscience2014;34(16):5564–5574.[27] Cande J, Namiki S, Qiu J, Korff W, Card GM, Shaevitz JW, et al. Optogenetic dissection of descending behavioral controlin Drosophila. Elife 2018;7:e34275.[28] Weisstein EW. Bonferroni correction. Wolfram Research, Inc 2004;.[29] Saravanan V, Hoffmann LA, Jacob AL, Berman GJ, Sober SJ. Dopamine depletion affects vocal acoustics and disruptssensorimotor adaptation in songbirds. eNeuro 2019;6(3).[30] Snijders TA. Multilevel analysis. Springer; 2011.[31] Hox JJ, Moerbeek M, Van de Schoot R. Multilevel analysis: Techniques and applications. Routledge; 2017.

ARAVANAN ET AL . [32] Walsh JE, et al. Concerning the effect of intraclass correlation on certain signiﬁcance tests. The Annals of MathematicalStatistics 1947;18(1):88–96.[33] Kish L. Survey sampling. No. 04; HN29, K5.; 1965.[34] McCoach DB, Adelson JL. Dealing with dependence (Part I): Understanding the effects of clustered data. Gifted ChildQuarterly 2010;54(2):152–155.[35] Davison AC, Hinkley DV. Bootstrap methods and their application, vol. 1. Cambridge university press; 1997.[36] Snijders TA, Bosker RJ. Standard errors and sample sizes for two-level research. Journal of educational statistics1993;18(3):237–259.[37] Felsenstein J, Kishino H. Is there something wrong with the bootstrap on phylogenies? A reply to Hillis and Bull. System-atic Biology 1993;42(2):193–200.[38] Shimodaira H. An approximately unbiased test of phylogenetic tree selection. Systematic biology 2002;51(3):492–508.[39] Shimodaira H, et al. Approximately unbiased tests of regions using multistep-multiscale bootstrap resampling. TheAnnals of Statistics 2004;32(6):2616–2641.[40] Coe R. It’s the effect size, stupid: What effect size is and why it is important. Education-line 2002;.[41] Halsey LG, Curran-Everett D, Vowler SL, Drummond GB. The ﬁckle P value generates irreproducible results. Naturemethods 2015;12(3):179.[42] Sober SJ, Brainard MS. Adult birdsong is actively maintained by error correction. Nature neuroscience 2009;12(7):927.[43] Hoffmann LA, Kelly CW, Nicholson DA, Sober SJ. A lightweight, headphones-based system for manipulating auditoryfeedback in songbirds. JoVE (Journal of Visualized Experiments) 2012;(69):e50027.[44] Tian LY, Brainard MS. Discrete circuits support generalized versus context-speciﬁc vocal learning in the songbird. Neuron2017;96(5):1168–1177.[45] Berman GJ, Choi DM, Bialek W, Shaevitz JW. Mapping the stereotyped behaviour of freely moving fruit ﬂies. Journal ofThe Royal Society Interface 2014;11(99):20140672.[46] Šidák Z. Rectangular conﬁdence regions for the means of multivariate normal distributions. Journal of the AmericanStatistical Association 1967;62(318):626–633.[47] Goldstein H. Bootstrapping in multilevel models. Handbook of advanced multilevel analysis 2011;p. 163–171.[48] Heffner RA, Butler MJ, Reilly CK. Pseudoreplication revisited. Ecology 1996;77(8):2558–2562.[49] Crowley PH. Resampling methods for computation-intensive data analysis in ecology and evolution. Annual Review ofEcology and Systematics 1992;23(1):405–447.[50] Diez R. A glossary for multilevel analysis. Journal of epidemiology and community health 2002;56(8):588.[51] Baayen RH, Davidson DJ, Bates DM. Mixed-effects modeling with crossed random effects for subjects and items. Journalof memory and language 2008;59(4):390–412.[52] Koerner TK, Zhang Y. Application of linear mixed-effects models in human neuroscience research: a comparison withpearson correlation in two auditory electrophysiology studies. Brain sciences 2017;7(3):26.[53] Barr DJ, Levy R, Scheepers C, Tily HJ. Random effects structure for conﬁrmatory hypothesis testing: Keep it maximal.Journal of memory and language 2013;68(3):255–278. ARAVANAN ET AL . 25. 25