Power contours: optimising sample size and precision in experimental psychology and human neuroscience
Daniel H. Baker, Greta Vilidaite, Freya A. Lygo, Anika K. Smith, Tessa R. Flack, Andre D. Gouws, Timothy J. Andrews
PPower contours: optimising sample size and precision in experimentalpsychology and human neuroscience
Daniel H. Baker , , , Greta Vilidaite , Freya A. Lygo , Anika K. Smith , Tessa R. Flack , André D.Gouws & Timothy J. Andrews
1. Department of Psychology, University of York, Heslington, York, YO10 5DD, UK2. York Biomedical Research Institute, University of York, Heslington, York, YO10 5DD, UK3. School of Psychology, University of Southampton, University Road, Southampton, SO17 1BJ, UK4. School of Psychology, University of Lincoln, Brayford Pool, Lincoln, LN6 7TS, UK5. York Neuroimaging Centre, The Biocentre, York Science Park, Heslington, York, YO10 5NY, UK6. Corresponding author, email: [email protected]
When designing experimental studies with human participants, experimenters must decidehow many trials each participant will complete, as well as how many participants to test.Most discussion of statistical power (the ability of a study design to detect an e ff ect) hasfocussed on sample size, and assumed su ffi cient trials. Here we explore the influence ofboth factors on statistical power, represented as a two-dimensional plot on which iso-powercontours can be visualised. We demonstrate the conditions under which the number of tri-als is particularly important, i.e. when the within-participant variance is large relative to thebetween-participants variance. We then derive power contour plots using existing data sets foreight experimental paradigms and methodologies (including reaction times, sensory thresholds,fMRI, MEG, and EEG), and provide example code to calculate estimates of the within- andbetween-participant variance for each method. In all cases, the within-participant variancewas larger than the between-participants variance, meaning that the number of trials has ameaningful influence on statistical power in commonly used paradigms. An online tool is pro-vided (https: // shiny.york.ac.uk / powercontours / ) for generating power contours, from which theoptimal combination of trials and participants can be calculated when designing future studies. Keywords: statistical power, sample size, neuroscience, psychophysics, fMRI, MEG, EEG
Introduction
Statistical power is the ability of a study design with agiven sample size to detect an e ff ect of a particular magni-tude. In recent years, the problems with low statistical powerhave been increasingly highlighted (Bishop, 2019). Lowpowered studies are less able to detect a true e ff ect (and somake more Type II errors) compared with high powered stud-ies. Nominally significant findings from low powered studiesare less likely to reflect true e ff ects (Button et al., 2013), andbecause of publication bias (whereby significant findings aremore likely to be published than non-significant ones) pub-lished low powered studies will also have a high Type 1 error(false positive) rate. Furthermore, any real e ff ects that aredetected are likely to have inflated e ff ect sizes (Colquhoun,2014; Ioannidis, 2008). These problems are common acrossmany scientific disciplines, and estimates of power acrossstudies in the neurosciences (Button et al., 2013) yield powervalues in the range 8%-30%, far below the desired level of ≥ ff ects that fail to replicate and may well be spurious (Ioannidis, 2005; Open ScienceCollaboration, 2015). Most discussion of increasing statis-tical power has focussed on recruiting larger sample sizes,because for a given e ff ect size, power is a function of samplesize (see Figure 1d). However there is a second degree offreedom available to many experimenters at the study designstage – the number of repetitions (or trials) of a given exper-imental condition by each participant.When the dependent variable of interest can be estimatedwith high precision, repeated measurements provide littlebenefit, and the main source of variance is between partic-ipants. This is illustrated by the distribution in Figure 1a,where participants (points) di ff er according to a normal dis-tribution (curve), but the variance of each individual point isnegligible. A more realistic situation for many experimentalparadigms is shown in Figure 1b, where the variance of eachindividual estimate is large, as indicated by the horizontalstandard error bars. This has the knock-on e ff ect of increas-ing the overall standard deviation of the sample ( σ s = σ s = a r X i v : . [ q - b i o . N C ] F e b BAKER ET AL. (PREPRINT) proving the accuracy of each participant’s estimated meanby increasing the number of measurements. This is demon-strated in Figure 1c, where each participant’s mean is esti-mated from k =
200 trials (compared with k =
20 in Figure1b), and the standard deviation of the sample (curve) reducessubstantially (to σ s = . ff ect size measures suchas Cohen’s d (Cohen, 1988), which depends on the samplemean (or di ff erence in means), and also the sample standarddeviation (formally d = M /σ s ). Under parametric assump-tions, the number of trials per participant ( k ) influences thesample standard deviation (Figure 1e), according to the equa-tion: σ s = (cid:114) σ b + σ w k (1)where σ b and σ w are the between- and (average) within-participant standard deviations, and k is the number oftrials per participant (see also Brandmaier et al., 2018).The sample standard deviation ( σ s ) determines the e ff ectsize, and subsequently the power (Figure 1f). In domainswhere the dependent variable is subject to high within-participant variance (as is potentially the case in psychol-ogy and neuroscience studies), increasing the precision of theper-participant estimate can therefore greatly increase overallpower, perhaps reducing the number of participants requiredfor a study (see Cleary & Linn, 1969; Phillips & Jiang, 2016).Although most active researchers are intuitively aware of thisfact (it is common knowledge that running lots of trials deliv-ers ‘better’ data), and the problem has received mathematicaltreatment (Kanyongo, Brook, Kyei-Blankson, & Gocmen,2007; Phillips & Jiang, 2016; Rouder & Haaf, 2018; West-fall, Kenny, & Judd, 2014; Williams & Zimmerman, 1989),there is no widely used procedure for quantitatively deter-mining the appropriate number of trials to run. Instead, stud-ies are typically designed using rules of thumb, prior prece-dent and guesswork.In this paper, we advocate a useful representation, the power contour plot – a two-dimensional representation ofpower as the joint function of sample size ( N ) and numberof trials ( k ). We provide an online tool for generating powercontours in order to estimate the impact of measurement pre-cision (the number of trials conducted) on statistical power.We then use a subsampling method to explore the joint ef-fects of sample size and number of trials on real data setsusing common methodologies and paradigms in psychologyand neuroscience research. These measures include reac-tion times, psychophysical thresholds, event-related poten-tials, steady-state evoked potentials, and fMRI BOLD sig-nals. We make computer code available to demonstrate howpower contours were produced, and how estimates of the within- and between-participant variance were calculated foreach example. Power contours
Consider first the situation described above, in which thedependent variable of interest can be estimated accuratelyfrom a single trial, but individuals all express di ff erent truevalues of the variable (formally, the within-participant vari-ance is low, but the between-participant variance is high, σ w << σ b ). Examples might include variables such as ageand height, for which there is low measurement error andminimal variation from moment to moment, or for whichtools exist (such as tape measures) to facilitate accurate mea-surement. In these situations, statistical power is a functionof sample size and e ff ect size (Figure 1d), where e ff ect sizeis Cohen’s d . Clearly, in such a situation, testing each par-ticipant multiple times should confer no advantage. We canrepresent the power as a function of both sample size andnumber of trials using a 2D plot such as the one shown inFigure 1g. Here the lines trace iso-power contours - combi-nations of sample size and number of trials that result in thesame statistical power (this property is sometimes referredto as power equivalence, see von Oertzen, 2010). For thisexample the power contours are vertical, showing no benefitof repeated testing.Next consider a more realistic scenario – a situation wherethe individual measurements are very noisy (high within-participant variance relative to the between-participant vari-ance, σ w > σ b ). The sample standard deviation decreasesas a function of the number of trials (Figure 1e), as the esti-mated mean for each participant becomes more accurate withrepeated measurements. Now power depends on both thenumber of trials and the sample size, and a series of curvediso-power contours are apparent (Figure 1h; see recent workby Westfall et al. (2014) and Xu, Adam, Fang, and Vogel(2018) for related plots in di ff erent scenarios).These power contours o ff er a useful summary of the ef-fect of possible experimental designs on statistical power. Agiven power (e.g. 80%, indicated by the thick blue curveson the power contour plots) can be obtained from multiplecombinations of sample size and trial number. This is auseful insight, as study designs can then be optimised de-pending on other constraints. If relatively few participantsare available (perhaps because of financial constraints, ortesting of a clinical population) then the number of trialscan be increased. Note, however, that beyond a particularnumber of trials (around k =
50 in Figure 1h), the func-tion asymptotes and further trials are not beneficial. Al-ternatively, if each participant must be tested very rapidly(e.g. for studies involving children), but many participantsare available, the number of trials could be kept relatively
OWER CONTOURS (PREPRINT) (a) (b) (c)(d) (e) (f)(g) (h) s w = 0 s b = 2 s s = 21: ¥ trials s w = 10 s b = 2 s s = 320 trials s w = 10 s b = 2 s s = 2.1200 trials s w = 0, s b = 2 s w = 10, s b = 2 s b = 2 M=0.5N=200 Figure 1 . Simulations of standard deviation and statistical power. Panel (a) shows simulated data for 50 individuals, generatedusing a population mean of M =
0, a within-participants standard deviation of σ w =
0, a between-participants standarddeviation of σ b =
2, and a sample standard deviation of σ s =
2. Individual data points have a random vertical o ff set fordisplay purposes. In panel (b) the within-participant standard deviation was increased to σ w =
10, and each point is the meanof 20 trials, with horizontal error bars indicating ± ff ect of increasing to 200 trials per participant.Panel (d) plots traditional power curves for di ff erent e ff ect sizes (Cohen’s d ) as a function of sample size (N). The dashedhorizontal line indicates a power of 80%, which is generally considered acceptable. Panel (e) shows how the sample standarddeviation ( σ s ) depends on the number of trials per participant ( k ) for a range of within-participant standard deviations (seelegend), and a between-participant standard deviation of σ b =
2. Panel (f) shows the statistical power resulting from the valuesin panel (e), for a sample size of N =
200 and an underlying mean of M = .
5. Panels (g,h) show power contours for di ff erentcombinations of σ w and σ b , as described in the text, and a group mean of M =
1. Simulations used normally distributedrandom numbers, and statistical power was calculated for a two-sided, one-sample t-test comparing to a mean of 0.
BAKER ET AL. (PREPRINT) low (here around k = ff erence in means usingan R script, which can be accessed through a web interfaceat: https: // shiny.york.ac.uk / powercontours / To have practical value in the design of experiments, itis necessary to establish empirically whether power does in-deed vary with the number of trials in typically used experi-mental paradigms. To this end, we have reanalysed data from8 studies, using a range of common methodologies frompsychology and cognitive neuroscience, including reactiontimes, proportional choices, sensory thresholds, EEG, MEGand fMRI. We estimate power contours by subsampling thedata, so we aimed to include data sets featuring large sam-ple sizes, in which each participant completed many trials(though it was not always possible to satisfy both criteria).All of these analyses are based on one-sample or paired t-tests, but the same principle applies to more sophisticatedstatistical techniques (see the Discussion section), and can beimplemented using the subsampling technique we describebelow. All analysis scripts are available on the project repos-itory at https: // osf.io / ebhnk / and data sets are provided ei-ther on the project page or referenced directly throughout themanuscript to allow others to reproduce our analyses, andapply the methods to their own studies. We anticipate thatthese resources will be most valuable as a guide for perform-ing related subsampling analyses for specific study designs,and suggest that readers short on time might find it most use-ful to skip ahead to the section reporting data from whicheverparadigm they are most familiar. Reaction times
We first analysed reaction time measures from a Posner-style attentional cueing experiment previously reported byPirrone, Wen, Li, Baker, and Milne (2018). Participants (N =
38) saw a central cue stimulus directing their attentionto either the left or the right of fixation. A sine wave grat-ing target was then presented either in the attended location(congruent condition) or the unattended location (incongru-ent condition). Each participant completed k =
600 congru-ent trials and k =
200 incongruent trials, with example RTdistributions for one participant shown in Figure 2a. At thegroup level, reaction times were on average 51 ms slowerin the incongruent condition (see Figure 2b), and the stan-dard deviation of the di ff erences ( σ s ) was 42 ms . For the full data set, this yielded an e ff ect size of d = .
2. We alsoestimated the within participants standard deviation by pool-ing the variances for the incongruent and congruent reactiontimes, and averaging across participants, for which σ w = ms . Finally, to estimate σ b we rearranged equation 1 to give: σ b = (cid:114) σ s − σ w k , (2)which produced a value of σ b = ms .We calculated statistical power by resampling randomsubsets of trials and participants from the data, and calculat-ing the e ff ect size and power using the mean and standard de-viation, for a paired t-test comparing to 0 (using the pwr.t.test function in the pwr package in R ). Note that an alternativeis simply to calculate a t-test with the resampled data, andcalculate the proportion of tests that are significant, but thedirect estimation of power is computationally more e ffi cientso we use this where possible. The subsampling procedurewas repeated 10,000 times, and the averaged power estimatesare shown in Figure 2c. Just as predicted by our simulations(Figure 1h), the iso-power contour for 80% power (shownby the thick blue line) is curved (we confirmed the subsam-pling result by using the summary statistics calculated abovein the power contour Shiny app). High power can be obtainedwith either a large sample size ( N >
20) and small numberof trials ( k <
10) or a large number of trials ( k >
50) andsmall sample size ( N = N =
10, with each participant com-pleting approximately k =
20 trials. Of course, this is for arelatively large e ff ect size with a robust and well-establishede ff ect (attentional cueing). Other study designs with smallersized e ff ects will require larger sample sizes and / or more tri-als, but it is clear that the same basic pattern should apply forexperiments of this type. Proportional choices in the Iowa Gambling Task
We next reanalysed a data set comprising N =
504 partici-pants in the Iowa Gambling Task, as reported by Steingroeveret al. (2015), and made available through that publication. Inthis task, participants choose cards from four decks. Twodecks have a greater overall payo ff (‘good’ decks), and theother two have a poorer payo ff (‘bad’ decks). Participantsmust learn these contingencies during the course of the ex-periment, and attempt to maximise their payo ff . As such per-formance changes throughout the experiment, and we dis-cuss the consequences of this learning below, but begin withan analysis of the aggregated (e.g. unordered) trials. Fig-ure 3a shows summary data for a population of participantswho each completed k =
100 trials of the task. Averagedacross all trials, the mean probability of selecting a card from
OWER CONTOURS (PREPRINT) (a) (b) (c) Example individual Population (N=38)
Figure 2 . Summary of reaction time data. Panel (a) shows reaction time distributions for an example participant, with verticallines giving the means. Panel (b) shows the group level data for mean reaction times across the sample of 38 participants.Panel (c) shows a power contour plot, in which colour represents statistical power (see legend). The thick blue line indi-cates combinations of sample size and trial number with a power of 80%. The y-axis represents the number of trials in theincongruent condition (the congruent condition contained three times as many trials)a ‘good’ deck was 0.54 (sample SD of σ s = . ff ectsize of d = .
24 when compared with the chance baseline of0.5 (see Figure 3a). We calculated the standard deviation ofindividual choices, and averaged across participants to give σ w = .
47, implying (via equation 2) a between-subjectsstandard deviation of σ b = . ff ect size and power using the mean and standard deviation,for a one-sample t-test comparing to 0.5 (using the pwr.t.test function in the pwr package in R ). This procedure was re-peated 10,000 times, and the averaged power estimates areshown in Figure 3b. Consistent with the simulations in Fig-ure 1h, power depends on both sample size and number oftrials. With small numbers of trials ( k < k = k =
40 trials, the samplesize can be reduced from N =
400 to N =
200 whilst main-taining power. Alternatively, for a sample size of N = k =
40 to k =
100 trials, as the function has reached asymptote.In the Iowa Gambling Task, the trial contingencies arelearned throughout the experiment. The black trace in Fig-ure 3a illustrates that at the start of the experiment partici-pants are more likely to choose cards from the ‘bad’ decksfor around the first 20 trials. Their behaviour then changesas they learn the task contingencies, and for the final 40 trialsthey are more likely to choose cards from the ‘good’ decks.This information is lost by randomly sampling trials as wedid to generate the power contour plot in Figure 3b. An al- ternative is to retain the trial order, and resample only acrossparticipants. Power contours are shown for this analysis inFigure 3c. Over the first 40 trials, power is high because themean probability is significantly below 0.5 (see black curvein Figure 3a). As participants start to learn the task contin-gencies, the mean probability passes through 0.5, and powerfalls to near zero around 60 trials. Then, as participants be-gin to reliably choose the ‘good’ deck, the average proba-bility becomes significantly above 0.5 and power increasesagain, reaching 80% by around 80 trials with the full sam-ple of participants. This alternative visualisation of the datacould be valuable when planning studies using this task, as itshows explicitly how performance (and hence overall power)changes over time.
Sensory thresholds
Psychophysical detection thresholds are typically mea-sured using large numbers of binary trials across stimuliof di ff erent intensities. The proportion of correct trials in-creases monotonically with stimulus intensity, producing apsychometric function (see Figure 4a). Threshold is then es-timated at some criterion performance level (often 75% cor-rect) by fitting a continuous ogival function such as a cu-mulative Gaussian or Weibull distribution. We reanalyseddata from a binocular summation experiment (reported byBaker, Lygo, Meese, & Georgeson, 2018), in which contrastdetection thresholds were measured in this way for sine wavegrating stimuli shown either monocularly or binocularly us-ing a stereo shutter goggle system. Example psychometricfunctions for a single participant are shown in Figure 4a (fit- BAKER ET AL. (PREPRINT) (a) (b) (c)
Population (N=504) Unordered Ordered
Figure 3 . Summary of proportion data from the Iowa Gambling task. Panel (a) shows a density plot of the mean probabilityof choosing a card from a ‘good’ deck for the population of N =
504 participants, each averaged across k =
100 trials. Thevertical orange line shows the grand mean, and the dashed vertical line is the probability expected by chance. The black curve(with grey shading showing ± quickpsy package in R , see Linares & López-Moliner, 2016), where it is clear that equivalent performancerequires higher contrast for monocular presentation (blue)than for binocular presentation (yellow). At the group level(see Figure 4b), this produces a ratio of monocular to binocu-lar thresholds between √ ff ect – which here had an e ff ect size of d = . ff ect was 6 . dB , with a sample standard deviationof σ s = . dB We subsampled the data set to produce the power con-tour plot shown in Figure 4c. Because each participant com-pleted slightly di ff erent numbers of trials (owing to the adap-tive staircase procedure used to determine contrast levels foreach trial), we subsampled at di ff erent percentages of trialsfor each participant, refitting the psychometric function eachtime. On average, each participant completed 225 trials forthe binocular condition, and for the monocular conditionsfor each eye (left and right eyes were tested separately andtheir data combined). Summation estimates were rejectedwhen they fell outside of a reasonable range (between fac-tors of 0.12 and 32), as this indicated that something hadgone wrong with the fitting procedure. As anticipated, powerdepended on both sample size and number of trials, and con-tinued to improve over the ranges available in the data set(i.e. the function at 80% power was quite shallow, and didnot asymptote over the ranges tested). Indeed, with all trialsincluded, only around six participants were required to reach80% power (consistent with previous estimates of power forthis paradigm, see Baker et al., 2018). Conversely, whenall 38 participants were included, only around 15% of trials were required (around 34 trials for each condition). Alterna-tively, 80% power could be maintained with a sample size of N =
12, with each participant completing around 30% of thetotal trials.For this paradigm, estimating the within-participant stan-dard deviation was not straightforward because thresholdwere calculated by fitting a psychometric function. So, wegenerated power contour surfaces for a range of possible σ w values, and compared these numerically to the surface de-rived by subsampling (Figure 4c). The best fitting value was σ w = . dB , which implies (via equation 2) a between-participant standard deviation of σ b = . dB . EEG: event-related potentials
We next analysed event-related potentials (ERPs) froma contrast discrimination experiment reported by Vilidaite,Marsh, and Baker (2019), recorded using a 64-channel EEGcap. The stimuli were sine wave gratings with a contrast of50%, presented sequentially in pairs for 100 ms each, with aninterstimulus interval of 400-600ms. These produced a typ-ical response (see Figure 5a) over occipital electrodes (seeinset to Figure 5a), with positive peaks at around 120 and220 ms (marking stimulus onset and o ff set), and a later neg-ative region with a trough around 600 ms . The first stimulusof each pair (yellow curve) produced a generally more pos-itive response than the second stimulus (blue curve), in partas a consequence of di ff erential overlap, though the precisecause of the di ff erences are unimportant for this demonstra- OWER CONTOURS (PREPRINT) (a) (b) (c) Example individual Population (N=38)
Figure 4 . Summary of threshold psychophysics data. Panel (a) shows psychometric functions for a single participant, withsymbol size proportional to the number of trials at each target contrast level. Curves are fitted cumulative Gaussian functions,used to interpolate thresholds at 75% correct (dashed line). Data for the monocular condition (blue) were pooled acrossthe left and right eye conditions before fitting. Panel (b) shows distributions of monocular (blue) and binocular (yellow)detection thresholds across a group of N =
38 participants with normal vision. Panel (c) shows the power contours derived bysubsampling the data and refitting the psychometric functions.tion. Each trial was baselined by subtracting the mean volt-age during the 200 ms before stimulus onset. The samplesize for this experiment ( N =
22) was modest (albeit typi-cal for ERP research), but each participant completed a largenumber of trials ( k =
600 stimulus pairs).For each participant, we calculated the peak voltage andlatency within three time windows, highighted grey in Figure5a. These were 100-150 ms , 200-300 ms and 500-700 ms ,and corresponded to the P100, P200 and N600 components.The peak voltages and latencies were compared between thetwo intervals using a repeated measures approach. The distri-butions of peak voltages and voltage di ff erences across par-ticipants are shown in Figure 5b-d for the three time win-dows, which produced e ff ect sizes (Cohen’s d ) of 1.18, 1.11and 1.32. We performed similar calculations for the laten-cies, however these were less convincing, with e ff ect sizes of d = ff erences.We calculated power contours for each of the three peakvoltage di ff erences by subsampling trials and participants,and re-estimating the peak for each participant and conditionon each of 10,000 iterations. These are shown in Figure 5e-g,and had the expected format in all cases. For the P100 com-ponent, power continued to increase across all sample sizesand trial numbers tested. For the N600 component, powerwas largely determined by sample size, and only for rela-tively few trials ( k < µ V to 21 µ V for σ w , and from 1 . µ V to 5 . µ V for σ b . EEG: steady-state evoked potentials
An alternative EEG paradigm is the steady-state method,where a stimulus oscillates at a particular frequency, induc-ing entrained neural responses at that same frequency. In anexperiment reported by Vilidaite et al. (2018), sine wave grat-ings of di ff erent contrasts were flickered at 7Hz, and shownto a sample of N =
100 participants. Each participant com-pleted 8 trials of 11 seconds per contrast level, from whichthe first 1 s of EEG data was discarded, and the remaining10 s were divided into 10 epochs of 1 s each, yielding a to-tal of k =
80 observations per condition. Each epoch wasthen Fourier transformed, and responses are evident both atthe fundamental (flicker) frequency (7Hz) and its second har-monic (14Hz), as shown in Figure 6a. For these visual stim-uli, the responses are strongest at the occipital pole, nearearly visual cortex (see inset to Figure 6a).Responses at the fundamental frequency increase mono-tonically with maximum stimulus contrast (see Figure 6b) atelectrode Oz . For a stimulus contrast of 8% (marked by the BAKER ET AL. (PREPRINT) (a)(b) (c) (d)(e) (f) (g)
P100 P200 N600P100 P200 N600
Figure 5 . Summary of ERP results. Panel (a) shows grand mean ERPs in response to central presentation of a 50% contrast sinewave grating in two intervals of each trial. Shaded regions surrounding each trace show ± N = ms after stimulus onset and black symbol mark the electrodes ( Oz, O1, O2, POz, PO3 - PO8 ) over whichERPs were averaged. Panels (b-d) show average peak voltages across a group of N =
22 participants in each time window, forboth intervals and their di ff erence. Panels (e-g) show power contours for the peak voltage within each time window. OWER CONTOURS (PREPRINT) ff ect sizeof d = .
2. However this can be substantially increased (to d = .
68) by using coherent averaging, in which both the am-plitude and phase information are averaged across trials foreach individual participant (and the absolute amplitudes arethen averaged across participants). The improvement occursbecause responses to the stimuli are phase-locked, and there-fore should have the same phase on each trial. Any noise atthe stimulus frequency has random phase, and so cancels outover multiple repetitions. Example Fourier spectra for bothcoherent (blue) and incoherent (red) averaging methods areshown in Figure 6a, where it is clear that the coherent methodgreatly reduces the noise at o ff -target frequencies. Note inparticular the increase in noise in the alpha band (8-12Hz) isclear with incoherent averaging (red) but absent with coher-ent averaging (blue). In the contrast response function (Fig-ure 6b), coherent averaging (blue function) leads to loweramplitudes at low stimulus contrasts, whereas with inco-herent averaging (red function) responses must overcome amuch higher ‘noise floor’ before they can be detected. Dis-tributions of voltages for an example participant and for thepopulation are shown in Figure 6c,d.We calculated power contours via subsampling using bothcoherent (Figure 6e) and incoherent (Figure 6f) averaging,which further confirmed that coherent averaging results insubstantially greater statistical power. The 80% power con-tour in the coherent condition (thick line in Figure 6e) is rela-tively shallow, showing that both increasing sample size andadding more trials will improve power over most of the rangeexplored here. For example, halving the sample size from N =
100 to N =
50 requires an increase from approximately k =
20 to k =
40 trials per participant to maintain power at80%. We confirmed these general findings at the higher stim-ulus contrasts (not shown). Because the coherent averagingprecludes typical calculation of within-participant standarddeviations, we again fitted the power contour surfaces for arange of σ w to the power contours derived by subsampling.The best fitting values were σ w = . µ V and σ b = . µ V . fMRI: event-related design A widely-used fMRI paradigm is the event-related de-sign, in which stimuli are presented briefly with a jit-tered interstimulus interval (ISI). We obtained data fromthe Cam-CAN repository (available at http: // / datasets / camcan / ) for an event-related fMRIexperiment detailed by Shafto et al. (2014) and Taylor etal. (2017). In brief, N =
625 participants viewed bilat-eral checkerboard patterns, presented for 30 ms and repeated k =
124 times. Some stimuli were accompanied by an au- ditory beep, but this was disregarded for the purposes of ouranalyses.We implemented a minimal preprocessing pipeline usingFSL (Jenkinson, Beckmann, Behrens, Woolrich, & Smith,2012). This involved co-registering the functional data toan individual participant’s anatomical scan, and then to thestandard MNI152 brain. We used the inverse of these trans-forms to project a probabilistic map of primary visual cor-tex (V1) obtained from Wang, Mruczek, Arcaro, and Kast-ner (2015) onto the functional data to use as a region of in-terest (see Figure 7a). The functional data were correctedfor slice timing and participant motion, and high pass fil-tered at 0.01Hz. Then the time-course was averaged acrossthe V1 ROI and exported for further analysis. Whilst thisanatomically-defined ROI will necessarily include some vox-els that were not responsive to the stimulus, we would expectnoise from these voxels to average out and not adverselya ff ect the results (e.g. Boynton, Engel, Glover, & Heeger,1996).We then constructed general linear models (GLMs) foreach data set using the individual trial timings. To simu-late experiments with variable numbers of trials, each GLMsplit the data using random trial allocations into two arbi-trary groups – a ‘target’ condition and a ’non-target’ con-dition. A third condition modelled four auditory-only tri-als which lacked any visual stimulus. A canonical doublegamma haemodynamic response function (Figure 7b) wasconvolved with each condition using the fmri.stimulus func-tion (part of the fmri package in R , see Tabelow & Polzehl,2011), and orthogonal second order polynomial drift termswere included in the overall model. We then fit the GLM todetermine a regression (beta) weight for the target conditionto use as our dependent variable. By varying the numberof trials allocated to the target and non-target conditions, wewere able to simulate experiments with di ff erent numbers oftrials, whilst keeping the GLM design balanced (see Figure7c). To provide a null condition, we repeated the analysis us-ing randomly determined events within the experiment time-course (i.e. not using the true event timings). This generatedthe sample distributions of beta weights shown in Figure 7d,and resulted in an e ff ect size of d = . ff ect sizes across participants for the dif-ference between beta values for the true and null models withdi ff erent numbers of trials (see Figure 7c), and used these toestimate statistical power. As previously, simulations wererepeated 10,000 times with di ff erent random sampling of tri-als and participants to generate power contours (see Figure7e). As with several previous data sets, power continued toincrease across the full range of trial numbers, such that 80%power could be maintained for sample sizes from N = N = BAKER ET AL. (PREPRINT) (a) (b)(c) (d)(e) (f)
Example individual Population (N=100)Coherent average Incoherent average
Figure 6 . Summary of SSVEP data. Panel (a) shows Fourier spectra for full 10 s long trials, using either coherent (blue) orincoherent (red) averaging, and the scalp distribution of activity at 7Hz (inset). Panel (b) shows contrast response functionsfor both types of averaging. Panel (c) shows the distribution of amplitudes for an example participant, and panel (d) showsaverages for the population. Panels (e) and (f) show power contours for coherent and incoherent averaging, respectively. OWER CONTOURS (PREPRINT) (a) (b)(c)(d) (e) V1 timecourseFull GLM16 trials32 trials
Population (N=625)
Figure 7 . Summary of event-related fMRI analysis and results. Panel (a) shows the V1 region of interest on the medial surfaceof the standard (MNI152) brain, highlighted in blue. Panel (b) shows the canonical double gamma haemodynamic responsefunction used in our general linear models. Panel (c) shows an example time-course from the V1 ROI for one participant(blue), and a general linear model constructed to predict this time-course (black) based on stimulus events (red). The greenand purple traces show example GLM components with random subsets of trials. Panel (d) shows the population distributionsof beta weights for the full GLM modelling all stimulus events (yellow) or randomly simulated times (blue). Panel (e) showsthe power contour plot for these event-related fMRI data.2
BAKER ET AL. (PREPRINT) flexibility allows event-related designs to achieve high statis-tical power even with relatively modest sample sizes, but it iscritical that su ffi cient trials are included for each condition.It is also straightforward to design a severely underpoweredstudy by including too few trials (here k < σ w =
515 and σ b = . β units). fMRI: blocked design Another popular fMRI paradigm is the blocked design, inwhich stimuli are presented for periods of several seconds,interleaved with periods of no stimulation. Typically, eventsare scheduled to coincide with the acquisition of functionalvolumes (the repetition time, or TR). Blocked designs gener-ally have greater power than event-related designs, becausethe stimulus timing is more closely aligned to the sluggishtime constraints of haemodynamic activity, with the longerduration presentations (relative to event-related designs) al-lowing BOLD signals to sum over time (Boynton et al.,1996).We reanalysed a data set comprising N =
83 participants,all of whom viewed a series of images of faces, objects,places and scrambled images as part of a functional localiserdescribed by Flack et al. (2015). Stimuli were presented inblocks of 6 s , with a 9 s inter-block interval during whichthe display was blank. Within each block, 5 images wereshown sequentially for 1000 ms each, with a 200 ms inter-stimulus interval. fMRI data were acquired with a TR of 3 s , so a complete cycle (one block plus inter-block interval)lasted for 15 s , or 5 TRs. Each participant completed k = s matching that of the trial cycle. The BOLDresponse peaked 9 s after stimulus onset, as can be seen mostclearly in Figure 8b, which averages the response across all35 blocks for the example participant. The distributions ofBOLD responses at each time point (relative to the start ofa block) are shown in Figure 8c. Panels d-f of Figure 8show comparable data for the population of N =
83 par-ticipants, displaying a similar pattern. In order to generatepower contours for a range of e ff ect sizes, we compared ac-tivity between sequential pairs of sample points. E ff ect sizesincreased from d = .
26 comparing 3 s and 0 s , to d = . s and 3 s . The range of standard deviationsacross these comparisons for σ w was 0.47 - 0.52%, and for σ b was 0.23 - 0.40%. Power contours (see Figure 8g-j) ap- proximately asymptoted for trial numbers above k =
15. Thispattern is somewhat di ff erent from the event-related fMRI re-sults discussed previously (Figure 7), where adding more tri-als continued to increase power across the entire range. Forthe larger e ff ects (Figure 8h-j), power was high even withthe relatively small samples ( N <
20) typical of many neu-roimaging studies (Button et al., 2013). Of course lookingfor responses to visual stimuli in V1 is guaranteed to pro-duce large e ff ect sizes - most fMRI studies are designed totest subtler e ff ects which will inevitably be smaller than inthe examples here. MEG: evoked responses
The Cam-CAN data set also contains MEG responses( k =
120 trials) to the same visual stimuli as described in thesection on event-related fMRI, recorded using a
VectorView system (Elekta Neuromag, Helsinki). We filtered (0.01 -30Hz bandpass), baselined and epoched the data from eachparticipant, and then conducted one-sample t-tests at a sin-gle sensor (see Figure 9a) comparing activity to zero. Weselected three time points very soon after stimulus onset (50,54 and 58 ms ) to leverage the power of this large ( N = ff ects of a similar magnitude to thoseinvestigated in typical experiments, where small di ff erencesin responses to di ff erent stimuli or mental states might becompared.Evoked responses showed an initial polarisation begin-ning around 50 ms , followed by a larger peak of oppositepolarity at 130 ms (see Figure 9a). E ff ect sizes at the threetime points increased from d = .
17 at 50 ms to d = . ms when including all trials and participants. As forprevious examples, the within-participant variance (Figure9b) was clearly greater than the sample variance (Figure 9c).Across the time window from 50 − ms , values of σ w ranged from 8 . − . pT / m , and values of σ b ranged from0 . − . pT / m . Subsampled power contours showed the fa-miliar form (see Figure 9d-f), with power only reaching 80%for the 50 ms time-point when the full data set was used. Atlater time points, iso-power contours show constant powercan be maintained, for example when reducing the samplesize from N =
400 to N =
200 by increasing the number oftrials from k =
20 to k =
60 (at 54 ms ). Discussion
We advocate a representation of statistical power as thejoint function of sample size and number of trials; thepower contour plot. Example power contours were gener-ated by subsampling data sets from a number of widely usedparadigms in experimental psychology and human neuro-science, covering a range of di ff erent sample sizes and trial OWER CONTOURS (PREPRINT) (a) (b) (c)(d) (e) (f)(g) (h) (i) (j) Example individualPopulation (N=83)3s vs 0s 6s vs 3s 9s vs 6s 9s vs 12s
Figure 8 . Summary of blocked design fMRI data. Panel (a) shows an fMRI timecourse for an example individual, averagedacross the V1 ROI (see Figure 7a). Shaded grey regions at the foot of the panel indicate blocks when stimuli were presented.Panel (b) shows the data from panel (a) aligned to each block onset and averaged across all k =
35 blocks (with error barsshowing ± N = ff erences). In other paradigms, particularlythose where the dependent variable was derived by someform of model fit, power continued to improve with repeatedtesting, beyond the range that could be assessed with our datasets.A practical guide to using the power contour approach for study design is as follows. If existing data are available onwhich to base an analysis, and where these data permit directestimation of mean di ff erence, σ w and σ b (using equation 2),these values can be calculated (or estimated using bootstrap-ping methods, see Luck, Stewart, Simmons, and Rhemtulla(2019)) and entered directly into the power contour web ap-plication. Where direct estimation of these values is not pos-sible, power contours should be generated by subsampling,as we have done for the examples here (and as demonstratedin the code provided). If required, the e ff ective values of σ w and σ b can then be estimated by fitting the subsampledpower contour surface to simulated surfaces and finding thebest fitting values. These methods will be of most use whenplanning replication studies, or when conducting a series ofexperiments using a single technique that build upon an ini-4 BAKER ET AL. (PREPRINT) (a) (b) (c)(d) (e) (f)
50 ms 54 ms 58 msExample individual Population (N=637)
Figure 9 . Summary of MEG results. Panel (a) shows a butterfly plot of evoked responses from 204 planar gradiometers,averaged across all participants ( N = ms (the peakof the black curve), and the black dot indicates the location of the sensor used for the analysis. Coloured points highlightedon the black curve indicate time points used for power analysis. Panel (b) shows distributions of field strengths at each ofthe three target time points for an individual participant. Panel (c) shows the same but for the sample population of N = ff erent time-points.tial finding in a well-powered sample. If no relevant dataare available, power contours can still be informative if rea-sonable assumptions can be made about the likely e ff ect size,and ratio of standard deviations. Just as it is common practisein power analysis to calculate power curves for a range of po-tential e ff ect sizes, it might also prove instructive to comparepower contour plots for a range of assumptions about the un-derlying e ff ect size and variance measures. In all cases, theaccuracy of the predictions will be limited by the extent to which the parameters generalise to the new experiment.In Table 1, we summarise the relevant variables fromeach paradigm, including the mean e ff ect, and within- andbetween-participant and sample standard deviations. Forseveral paradigms, including sensory thresholds, SSVEPs,and event-related fMRI, estimates of within-participant stan-dard deviations were not directly available because the pro-cess by which trials were combined did not generate one.In these cases (as described above), we simulated power OWER CONTOURS (PREPRINT) (a) (b) Figure 10 . Summary of sample sizes, trial numbers and Fano-factors across experimental paradigms. Each rectangle in (a)covers the range of sample sizes and trial numbers for one of the studies analysed here, with colours defined in the legendin panel (b). Panel (b) plots Fano-factors (variance divided by the mean) derived from the within- and between-participantsstandard deviations given in Table 1. Note the log-scaled axes for both panels.contour surfaces for a range of candidate standard devia-tions. The estimated value is the within-participant stan-dard deviation ( σ w ) that produced the best fit. Although thishas no direct relationship to the measured dependent vari-able, it can be thought of as the SD from an experimen-tal design with identical power (a power equivalent model,see von Oertzen, 2010) but which uses traditional averag-ing across trials instead of more sophisticated analysis steps.We then calculated the between-participant standard devi-ation ( σ b ) using equation 2. For the SSVEP and event-related fMRI data sets, equation 2 returned an imaginarynumber because the estimated within-participant SD wasvery large. Here we assumed that σ b = σ s for the purposesof completing Table 1. The analysis scripts used to performthese calculations are available on the project OSF repository(https: // osf.io / ebhnk / ), and we anticipate that readers mightuse these resources to perform similar analyses on their owndata when planning future studies. However we advise cau-tion in the extent to which variance estimates can be assumedto generalise across di ff erent experimental set-ups, laborato-ries, and participant groups. Using the values estimated hereto perform power analyses for studies using similar methodsis likely to be highly inaccurate and we do not recommend it.A further instructive analysis is to compare the within-and between-participant variances, as these provide insightinto the likely gains that can be obtained by conducting moretrials on each participant. A situation in which the within-participant variance is very small compared to the between-participants variance will result in a power contour like thatshown in Figure 1g, where repeated testing confers no bene-fit. Figure 10b plots the variances expressed as Fano-factors(variance scaled by the mean) to permit comparison across paradigms with widely di ff ering units. It is clear that forall paradigms considered here, the within-participant vari-ance is substantially above the between-participant variance(all points appear above the diagonal). This property is nota given, and we anticipate that there may exist paradigmswhere within-participant variance is very low (owing to ac-curate measurement, or consistency of responses across mul-tiple repetitions; see Nesselroade (1991) for a discussion inthe context of developmental research). We note that wheremultiple estimates were calculated for a single method (suchas ERPs at di ff erent time points), the Fano-factors appear tocluster together, suggesting a consistent ratio of variances fora given paradigm. However, establishing a generic Fano fac-tor for a particular methodology would require further inves-tigation across multiple studies, and also across di ff erent lab-oratories and equipment (e.g. scanner models, sensor typesetc), and would not necessarily apply to individual experi-ments.From equation 1, the sample standard error can be ex-pressed as: S E s = (cid:115) σ b + σ w k N = (cid:115) σ b N + σ w kN . (3)These expressions make explicit the dependence of measure-ment precision (and hence power) on both N and k , regardlessof e ff ect size. In situations where σ w > σ b , running manytrials will materially reduce the overall standard error. In sit-uations where σ w < σ b , running many trials will confer lessbenefit, as the standard error is primarily determined by σ b ,and increasing N is more profitable. In Table 1 we also cal-6 BAKER ET AL. (PREPRINT)
Table 1
Summary of means, standard deviations and e ff ect sizes for di ff erent paradigms. The SD ratio is defined as σ w /σ b . *Estimated:to estimate a within-participant SD, we ran simulations to optimise the value of this parameter using the full power contoursurface, see text for details. Paradigm Mean e ff ect σ w σ b SD ratio σ s E ff ect size ( d )Reaction times 51 ms ms ms ms µ V 12.0 µ V* 1.14 µ V 10.5 1.25 µ V 1.2ERP P200 1.93 µ V 13.8 µ V* 1.64 µ V 8.4 1.74 µ V 1.1ERP N600 7.84 µ V 21.1 µ V* 5.27 µ V 4.0 5.34 µ V 1.5SSVEP 8% vs 0% 0.25 µ V 3.1 µ V* 0.19 µ V 16.3 0.19 µ V 0.7Event-related fMRI β = . β = β = . β = . s vs 0 s s vs 3 s s vs 6 s s vs 12 s ms p T / m p T / m p T / m p T / m ms p T / m p T / m p T / m p T / m ms p T / m p T / m p T / m p T / m σ w /σ b ) as this gives auseful indication of the likely influence that changing k willhave on power. Paradigms with a small ratio (such as theblocked fMRI paradigm) produce power contours with thesmallest gains from increasing numbers of trials (see Figure8).Up until this point, we have implicitly assumed that a fixedvalue of within-participant standard deviation ( σ w ) can besubstituted for each participant’s individual value. Is thisassumption justified, and what impact might di ff erent dis-tributions of σ w have on statistical power? To address this,we simulated power curves assuming a fixed value of σ w ,and both normal and skewed distributions of σ w (see Fig-ure 11a). The properties of these distributions were derivedfrom the MEG data set (at 58 ms ), as described in the Fig-ure 11 caption, and compared with power estimates from theempirical data. For a range of sample sizes (N) and num-bers of trials ( k ), the power estimates for all three artificialdistributions were very similar (Figure 11b). However thepower estimates derived from the empirical data are some-what lower, especially with larger numbers of participants.This happens because a small number of outlier participantswith higher standard deviations (those in the tail of the greydistribution in Figure 11a) contribute disproportionately tothe overall variance. We think that most analysis pipelineswill reject such participants (or reject individual trials thatare contributing to a noisy participant mean), meaning thatthe loss of power here is a ’worst case’ scenario (we avoidedelaborate processing pipelines in the current paper to max-imise transparency). In general these simulations suggestthat the simplifying assumption of a single within-participant standard deviation is reasonable. For prospective power anal-yses, the margin of error in estimating e ff ect sizes and vari-ances will most likely subsume any considerations owing tonon-normally distributed variances and outliers.A further factor that influences statistical power in re-peated measures designs is the covariance between the twomeasures. We performed simulations to quantify this, bygenerating synthetic data sets with di ff erent levels of covari-ance, and performing power calculations on the syntheticdata for repeated measures t-tests. The simulations in Figure11c-e show that when R =
0, there is no benefit from the re-peated measures design, and power is determined by conven-tional factors (e ff ect size, alpha level, sample size and num-ber of trials). As the level of correlation increases from zero,power also increases because the covariance between the twomeasures accounts for a greater proportion of the total vari-ance, and it is discounted by the repeated measures analy-sis. However the overall shape of the power contours is nota ff ected by the change in covariance - the contours simplyshift towards the origin. For paired t-test designs, the covari-ance can be accounted for by taking the di ff erence betweenthe two measures for each participant, and using these di ff er-ence scores in a one-sample t-test (which is mathematicallyequivalent to a paired t-test on the original data). Estimates ofe ff ect size and power calculated in this way will incorporatethe covariance between repeated measures. For more sophis-ticated designs, calculating stochastic power contours usingexisting data, or simulating them with a range of plausiblecovariance levels, may be more appropriate.Of course, we are far from the first to appreciate that mul- OWER CONTOURS (PREPRINT) (a) (b)(c) (d) (e)R = 0 R = 0.3 R = 0.6 Figure 11 . Summary of the influence on power of the distribution of within-participant standard deviations, and the corre-lation between repeated measures. Panel (a) illustrates possible distributions of within-participant standard deviations. Thegrey curve shows an empirical distribution derived from the MEG data set (N =
637 at 58 ms ). The dashed line gives a fixedvalue, which is the mean of the empirical distribution excluding values > pT / m . The blue curve shows a normal distribution,with mean and SD derived from the empirical distribution (mean = = = = ff erence of 0.5, between participants standard deviation of 2, and within participant standarddeviation of 10. The total variance remained constant across the range of correlations.tiple measurements can increase e ff ect sizes and power. Inthe domain of psychometric research, the Spearman-Brownprophecy formula (Brown, 1910; Spearman, 1910) predictshow the reliability of a test (such as a personality test, or anIQ test) increases as more items are added. Rouder and Haaf(2018) also consider the e ff ects of sample size and numberof trials on statistical power, in the context of ‘stochasticdominance’ - the tendency for all participants in an experi-ment to have a true e ff ect in the same direction. Under these conditions, the distribution of e ff ects in the sample popula-tion is unlikely to be normal, and may instead be positivelyskewed with a mean and variance that are proportional (e.g. agamma distribution). Simulations show that in this situationpower can remain almost constant when trading o ff partic-ipants against trials. Our observation that Fano-factors fora given method appear to cluster together (see Figure 10b)could be taken as evidence that dominance holds for someof the paradigms investigated here, because gamma distribu-8 BAKER ET AL. (PREPRINT) tions have a variance that increases in proportion to the mean.Strong empirical evidence to firmly establish the conditionswhen dominance occurs is currently lacking, although it ap-pears entirely plausible for many tasks in sensory and cogni-tive research.Whereas most of the example data sets we consider hereinvolve multiple repetitions of identical stimuli (6 / ff erentstimulus examples on each trial. For example in research onobject processing, databases of object images are often used,with multiple examplars in each object category. This addi-tional source of variability can also be estimated, and furthercomplicates the underlying mathematics of power analysis,as described in detail by Westfall et al. (2014). The powercontour representation advocated here is also applicable tothese situations (see Figures 2-6 of Westfall et al., 2014), anda linear mixed modelling approach can be used in which vari-ances are explicitly represented at the participant, stimulusitem and sample level (see also Brysbaert & Stevens, 2018).In such ‘crossed’ designs, the maximum power that can beachieved is limited by the item-level variance and numberof stimulus examples, even for a hypothetically infinite sam-ple size. For statistical procedures where the item-level vari-ance is not explicitly modelled, it will be subsumed into thewithin- and between-participant variances, perhaps makingpower estimates less accurate.Some studies have used cost functions to attempt to derivea single optimal experimental design, by assuming specificcosts (usually in units of experimenter time) required for re-cruitment and testing of each participant (e.g. Cleary & Linn,1969; von Oertzen, 2010; von Oertzen & Brandmaier, 2013).In principle these methods might be used to determine a pointon the power contour that specifies a particular sample sizeand number of trials. We have avoided being prescriptiveabout this here, as di ff erent studies will have di ff erent con-straints and priorities, and the advantage of visualising theentire power surface is that it permits the experimenter totrade o ff these two variables against each other without lossof power. However we have built functionality into the Shiny web application to estimate an optimal combination of sam-ple size and number of trials, based on the additional con-straint of a per-participant ‘recruitment cost’, expressed as anotional number of trials. The optimal point is calculated bydetermining the smallest value of N*( k + cost) that achieves80% power. We advise caution in the use of this feature. Application to other statistical tests and approaches
Throughout all examples so far we have deliberately useda basic statistical test to determine power - the t-test. How-ever the subsampling method we develop here can very easily be extended to more advanced statistical methods, includingnonparametric statistics, Analysis of Variance (see Smith &Little, 2018, for a related example), correlation, regressionand so on. The method of subsampling trials has no specificrequirements about the form of the data (as with bootstrap-ping techniques), provided the assumptions for calculatingthe relevant test statistic are met. A recent study by Xu etal. (2018) calculated the reliability of working memory mea-sures as a function of both sample size and number of tri-als, using a similar sub-sampling approach. This producedsimilar contour plots, but for Cronbach’s alpha, Spearman-Brown reliability and standard deviation instead of statisticalpower. In all cases, these showed a dependency on both sam-ple size and number of trials, consistent with the exampleshere. Iso-power contours have also been calculated in workon optimal study design using structural equation modelling(e.g. Brandmaier, von Oertzen, Ghisletta, Hertzog, & Lin-denberger, 2015; von Oertzen & Brandmaier, 2013).In Figure 12 we show power contour plots for repeatedmeasures ANOVAs using two of the example data sets fromthe body of the paper. We conducted a one-way repeatedmeasures ANOVA across the latter three TR times of theblocked design MRI experiment (using all five TR timesproduced such a large e ff ect that the power contour anal-ysis was uninformative). With the full data set, this pro-duced a substantial significant e ff ect (F (2 , = p < × − , equivalent d = ff ect (F (6 , = p < × − ,equivalent d = ff ect (F (1 , = p < . d = (6 , = p < × − , equiv-alent d = ff ects andthe interaction are shown in Figure 12b-d. The main e ff ectof stimulus contrast was so substantial that high power couldbe achieved with almost any combination of sample size andnumber of trials. The main e ff ect of mask and the interactionwere weaker, and again show the familiar tradeo ff betweenN and k . In practical settings, one should design an experi- OWER CONTOURS (PREPRINT) (a) (b)(c) (d) MRI one−way ANOVA SSVEP contrast main effectSSVEP mask main effect SSVEP contrast*mask interation
Figure 12 . Example power contours for one-way and factorial ANOVAs. Panel (a) shows a power contour plot for a one-wayrepeated measures ANOVA using three levels from the blocked fMRI data (summarised in Figure 8). Panels (b-d) show powercontours for the main e ff ects of contrast (b) and mask level (c), as well as their interaction (d) in a 7x2 repeated measuresANOVA design using the SSVEP data set (summarised in Figure 6).ment to detect the smallest e ff ect of interest with the desiredpower. For this example, the main e ff ect of mask has thesmallest e ff ect, and so a replication of this experiment coulduse values along the 80% contour in Figure 12c: for example,100 participants each completing 40 trials, or 75 participantseach completing 80 trials. Alternatively, if only the interac-tion were of theoretical interest, one could base the design onthe constraints shown in Figure 12d.For time-varying data using EEG and MEG (see Figures 5& 9), it is commonplace to use cluster correction algorithmsto control for multiple comparisons (e.g. Maris & Oosten-veld, 2007). Informative power contours could in principlebe constructed for significant clusters using either the num-ber of trials (as here), or the number of time-points includedwithin a cluster. Similar approaches might be applied to fMRI data, where the number of voxels included in a spatialcluster or a region of interest (ROI) will likely a ff ect statisti-cal power.One limitation of the methods presented here is that theyassume that trials are random, and independent of each other.In many paradigms, participants might become better at atask with practise (for example they could become more ac-curate, or their reaction times could speed up), or becomefatigued after long testing sessions. This will place lim-its on the improvements gained by running additional trials,however the likely impact will vary across paradigms (seeFigure 3c for an example). For large data sets it may bepossible to estimate the nonstationarity of σ w , and the im-pact this has on power (see e.g. von Oertzen & Brandmaier,2013). Other work has modelled multiple sources of vari-0 BAKER ET AL. (PREPRINT) ance in MRI studies explicitly using intra-class correlations(Brandmaier et al., 2018). This method permits dissociationof within-participant variance from various sources of mea-surement noise such as di ff erences in variance between timeof day, scanner model, and so on. Accurate estimates of rele-vant sources of variance will improve the overall accuracy ofpower analysis, which is particularly important given recentmeta-analytic evidence (Elliott et al., 2019) that test-retestreliability for task-based fMRI is typically very low (meanintra-class correlation < Quest algorithm (Watson & Pelli, 1983), used widely inpsychophysics, which chooses the optimal stimulus level oneach trial to provide the most information about the locationof the threshold. Related methods have also been used to op-timize data collection in fMRI experiments (Lorenz, Hamp-shire, & Leech, 2017). Typically such approaches operate ata per-participant level, and will result in e ffi cient use of thetime available. If the ultimate aim is to combine results sta-tistically across participants, then power contours might stillbe used to optimise the number of trials, in a similar fashionto that shown here for the contrast detection data (Figure 4),which also involved an adaptive (staircase) procedure. Onthe other hand, if the algorithm is designed to continue un-til particular conditions are met, traditional power analysisbased only on sample size may be more appropriate.Most discussion of power analysis is focussed on stud-ies which involve statistically demonstrating the presence ofsome e ff ect. However an alternative approach common inperceptual and cognitive research is to explain and predictpatterns of response across multiple conditions using a com-putational model. In this tradition, each participant can beconsidered an independent ‘replication’ of the phenomenaunder study (see e.g. Smith & Little, 2018), and the emphasisis on improving data quality through conducting many trialsfor each participant. Power contours might not be especiallyhelpful under such circumstances, though knowledge of the within-participant standard deviation will inform decisionsabout how many trials to conduct.Whereas experimental studies of the type we discuss heretypically aim to reduce the sample variance ( σ s ) in order toincrease e ff ect size, studies using individual di ff erences ap-proaches aim to maximise meaningful variation between par-ticipants. However, it is important that the observed variation( σ s ) is truly a result of individual di ff erences (high σ b ) andnot merely a consequence of poor measurement (high σ w andlow k ). Traditional psychometric instruments, such as testsof personality and ability, typically have high test-retest reli-ability, which implies low within-participant variance ( σ w ),yet this may not be so for neuroscience and experimental psy-chology paradigms (e.g. Elliott et al., 2019; Zuo, Xu, & Mil-ham, 2019). Estimating these values explicitly (e.g. usingequation 2) may help individual di ff erences researchers us-ing such methods to optimise the number of trials and samplesize to this end. We note that since σ w > σ b for all estimatesof these two parameters in the paradigms considered here(Table 1 and Figure 10b), individual di ff erences studies willrequire su ffi cient trials to reduce the unwanted influence ofintra-individual variability ( σ w ) on sample variance ( σ s ). Conclusions
Here we present the rationale for incorporating the num-ber of measurements (trials) into calculations of statisticalpower in experimental studies of psychology and human neu-roscience. Power contour plots can be generated by subsam-pling existing data sets or using an online tool, and permitresearchers to make informed choices about how many par-ticipants to test, and how long to test each one for, at thestudy design stage. However, as with all a priori powercalculations, the true e ff ect sizes and variances will remainspeculative until data have been collected. Acknowledgements
We are grateful to everyone involved in collection of thedata sets reanalysed here, and particularly to those who madetheir data publicly available. This work was supported in partby a Wellcome Trust (ref: 105624) grant, through the Centrefor Chronic Diseases and Disorders (C2D2) at the Universityof York, awarded to DHB. Data collection and sharing forpart of this project was provided by the Cambridge Centrefor Ageing and Neuroscience (CamCAN). CamCAN fund-ing was provided by the UK Biotechnology and BiologicalSciences Research Council (grant number BB / H008217 / OWER CONTOURS (PREPRINT) ff eredconstructive suggestions based on the preprint.ReferencesBaker, D. H., Lygo, F. A., Meese, T. S., & Georgeson,M. A. (2018). Binocular summation revisited: Be-yond √ Psychol Bull , (11), 1186-1199. doi:10.1037 / bul0000163Bishop, D. (2019). Rein in the four horsemen of irrepro-ducibility. Nature , (7753), 435. doi: 10.1038 / d41586-019-01307-2Boudewyn, M. A., Luck, S. J., Farrens, J. L., & Kappenman,E. S. (2018). How many trials does it take to get asignificant ERP e ff ect? It depends. Psychophysiology , (6), e13049. doi: 10.1111 / psyp.13049Boynton, G. M., Engel, S. A., Glover, G. H., & Heeger, D. J.(1996). Linear systems analysis of functional magneticresonance imaging in human V1. J Neurosci , (13),4207-21.Brandmaier, A. M., von Oertzen, T., Ghisletta, P., Hertzog,C., & Lindenberger, U. (2015). Lifespan: A toolfor the computer-aided design of longitudinal studies. Frontiers in Psychology , . doi: 10.3389 / fpsyg.2015.00272Brandmaier, A. M., Wenger, E., Bodammer, N. C., Kühn,S., Raz, N., & Lindenberger, U. (2018). Assess-ing reliability in neuroimaging research through intra-class e ff ect decomposition (ICED). Elife , . doi:10.7554 / eLife.35718Brown, W. (1910). Some experimental results in the correla-tion of mental abilities. British Journal of Psychology , , 296–322.Brysbaert, M., & Stevens, M. (2018). Power analysis ande ff ect size in mixed e ff ects models: A tutorial. Journalof Cognition , (1), 9, 1-20. doi: 10.5334 / joc.10Button, K. S., Ioannidis, J. P. A., Mokrysz, C., Nosek, B. A.,Flint, J., Robinson, E. S. J., & Munafò, M. R. (2013).Power failure: why small sample size undermines thereliability of neuroscience. Nature Reviews Neuro-science , (5), 365–376. doi: 10.1038 / nrn3475Clayson, P. E., & Miller, G. A. (2017). Psychometric con-siderations in the measurement of event-related brainpotentials: Guidelines for measurement and report-ing. Int J Psychophysiol , , 57-67. doi: 10.1016 / j.ijpsycho.2016.09.005Cleary, T. A., & Linn, R. L. (1969). Error of measurementand the power of a statistical test. British Journal ofMathematical and Statistical Psychology , (1), 49–55. doi: 10.1111 / j.2044-8317.1969.tb00419.xCohen, J. (1988). Statistical power analysis for the behav-ioral sciences . Lawrence Erlbaum Associates.Colquhoun, D. (2014). An investigation of the false discov- ery rate and the misinterpretation of p-values.
R SocOpen Sci , (3), 140216. doi: 10.1098 / rsos.140216Elliott, M. L., Knodt, A. R., Ireland, D., Morris, M. L., Poul-ton, R., Ramrakha, S., . . . Hariri, A. R. (2019). Poortest-retest reliability of task-fMRI: new empirical ev-idence and a meta-analysis. BioRxiv preprint . doi:10.1101 / Cortex , , 14–23. doi: 10.1016 / j.cortex.2015.03.002Ioannidis, J. P. A. (2005). Why most published research find-ings are false. PLoS Med , (8), e124. doi: 10.1371 / journal.pmed.0020124Ioannidis, J. P. A. (2008). Why most discovered true asso-ciations are inflated. Epidemiology , (5), 640-8. doi:10.1097 / EDE.0b013e31818131e7Jenkinson, M., Beckmann, C. F., Behrens, T. E. J., Wool-rich, M. W., & Smith, S. M. (2012). FSL.
Neuroim-age , (2), 782-90. doi: 10.1016 / j.neuroimage.2011.09.015Kanyongo, G. Y., Brook, G. P., Kyei-Blankson, L., & Goc-men, G. (2007). Reliability and Statistical Power:How Measurement Fallibility A ff ects Power and Re-quired Sample Sizes for Several Parametric and Non-parametric Statistics. Journal of Modern Applied Sta-tistical Methods , (1), 81–90. doi: 10.22237 / jmasm / The R Journal , (1), 122 - 131. doi:10.32614 / RJ-2016-008Lorenz, R., Hampshire, A., & Leech, R. (2017). Neu-roadaptive bayesian optimization and hypothesis test-ing.
Trends Cogn Sci , (3), 155-167. doi: 10.1016 / j.tics.2017.01.006Luck, S. J., Stewart, A., Simmons, A., & Rhemtulla, M.(2019). Standardized measurement error as a univer-sal measure of data quality for event-related potentials:An overview. PsyArXiv . doi: 10.31234 / osf.io / jc3sdMaris, E., & Oostenveld, R. (2007). Nonparametric statis-tical testing of EEG- and MEG-data. Journal of Neu-roscience Methods , (1), 177–190. doi: 10.1016 / j.jneumeth.2007.03.024Nesselroade, J. R. (1991). The warp and woof of the de-velopmental fabric. In R. M. Downs, R. S. Liben,& D. S. Palermo (Eds.), Visions of aesthetics, the en-vironment & development: the legacy of Joachim F.Wohlwill (p. 213-240). Hillsdale.Open Science Collaboration. (2015). Estimating thereproducibility of psychological science. Science ,2 BAKER ET AL. (PREPRINT) (6251), aac4716. doi: 10.1126 / science.aac4716Phillips, G. W., & Jiang, T. (2016). Measurement errorand equating error in power analysis. Practical As-sessment, Research & Evaluation , (9), 12.Pirrone, A., Wen, W., Li, S., Baker, D. H., & Milne, E.(2018). Autistic traits in the neurotypical populationdo not predict increased response conservativeness inperceptual decision making. Perception , (10-11),1081-1096. doi: 10.1177 / ff erent De-sign Traditions. Advances in Methods and Practices inPsychological Science , , 19-26.Rouder, J. N., Speckman, P. L., Sun, D., Morey, R. D., &Iverson, G. (2009). Bayesian t tests for accepting andrejecting the null hypothesis. Psychonomic Bulletin & Review , (2), 225–237. doi: 10.3758 / pbr.16.2.225Shafto, M. A., Tyler, L. K., Dixon, M., Taylor, J. R.,Rowe, J. B., Cusack, R., . . . Matthews, F. E. (2014).The Cambridge Centre for Ageing and Neuroscience(Cam-CAN) study protocol: a cross-sectional, lifes-pan, multidisciplinary examination of healthy cogni-tive ageing. BMC Neurology , (1). doi: 10.1186 / s12883-014-0204-1Smith, P. L., & Little, D. R. (2018). Small is beautiful:In defense of the small-n design. Psychon Bull Rev , (6), 2083-2101. doi: 10.3758 / s13423-018-1451-8Spearman, C. C. (1910). Correlation calculated from faultydata. British Journal of Psychology , , 271–295.Steingroever, H., Fridberg, D. J., Horstmann, A., Kjome,K. L., Kumari, V., Lane, S. D., . . . Wagenmakers, E.-J.(2015). Data from 617 Healthy Participants Perform-ing the Iowa Gambling Task: A “Many Labs” Collab-oration. Journal of Open Psychology Data , . doi:10.5334 / jopd.akTabelow, K., & Polzehl, J. (2011). Statistical parametricmaps for functional MRI experiments in R: the pack-age fmri. Journal of Statistical Software , (11), 1 -21. doi: 10.18637 / jss.v044.i11Taylor, J. R., Williams, N., Cusack, R., Auer, T., Shafto,M. A., Dixon, M., . . . Henson, R. N. (2017). The Cam-bridge Centre for Ageing and Neuroscience (Cam-CAN) data repository: Structural and functional MRI,MEG, and cognitive data from a cross-sectional adult lifespan sample. NeuroImage , , 262–269. doi:10.1016 / j.neuroimage.2015.09.018Vilidaite, G., Marsh, E., & Baker, D. H. (2019). Internalnoise in contrast discrimination propagates forwardsfrom early visual cortex. NeuroImage , , 503 - 517.doi: 10.1016 / j.neuroimage.2019.02.049Vilidaite, G., Norcia, A. M., West, R. J. H., Elliott, C. J. H.,Pei, F., Wade, A. R., & Baker, D. H. (2018). Autismsensory dysfunction in an evolutionarily conservedsystem. Proceedings of the Royal Society B , ,20182255. doi: 10.1098 / rspb.2018.2255von Oertzen, T. (2010). Power equivalence in structuralequation modelling. Br J Math Stat Psychol , (Pt 2),257-72. doi: 10.1348 / Psychologyand Aging , (2), 414–428. doi: 10.1037 / a0031844Wang, L., Mruczek, R. E. B., Arcaro, M. J., & Kastner, S.(2015). Probabilistic maps of visual topography inhuman cortex. Cereb Cortex , (10), 3911-31. doi:10.1093 / cercor / bhu277Watson, A. B., & Pelli, D. G. (1983). QUEST: a Bayesianadaptive psychometric method. Percept Psychophys , (2), 113-20.Westfall, J., Kenny, D. A., & Judd, C. M. (2014). Sta-tistical power and optimal design in experiments inwhich samples of participants respond to samples ofstimuli. J Exp Psychol Gen , (5), 2020-45. doi:10.1037 / xge0000014Williams, R. H., & Zimmerman, D. W. (1989). StatisticalPower Analysis and Reliability of Measurement. TheJournal of General Psychology , (4), 359–369. doi:10.1080 / Behav Res Methods , (2), 576-588. doi:10.3758 / s13428-017-0886-6Zuo, X.-N., Xu, T., & Milham, M. P. (2019). Harnessingreliability for neuroscience research. Nat Hum Behav .doi: 10.1038 / s41562-019-0655-xData and scripts are available at:http: // dx.doi.org / / OSF.IO //