Point estimates, Simpson's paradox and nonergodicity in biological sciences
PPoint estimates, Simpson’s paradox and nonergodicity inbiological sciences
Madhur Mangalam and Damian G. Kelty-StephenMadhur Mangalam is in the Department of Physical Therapy, Movement and RehabilitationSciences, Northeastern University, Boston, MA, USA; Damian G. Kelty-Stephen is in theDepartment of Psychology, Grinnell College, Grinnell, IA, USA.E-mails: [email protected]; [email protected] u d bstract
Modern biomedical, behavioral and psychological inference about cause-effect relationshipsrespects an ergodic assumption, that is, that mean response of representative samples allowpredictions about individual members of those samples. Recent empirical evidence in all ofthe same fields indicates systematic violations of the ergodic assumption. Indeed, violation ofergodicity in biomedical, behavioral and psychological causes is precisely the inspirationbehind our research inquiry. Here, we review the long term costs to scientific progress inthese domains and a practical way forward. Specifically, we advocate the use of statisticalmeasures that can themselves encode the degree and type of non-ergodicity inmeasurements. Taking such steps will lead to a paradigm shift, allowing researchers toinvestigate the nonstationary, far-from-equilibrium processes that characterize the creativityand emergence of biological and psychological behavior.
Keywords: ergodic, longitudinal modeling, nonergodic, stationarity, statistical analysis, vectorautoregression
OX 1 GLOSSARYEffect size.
The magnitude of the observed effect of an independent variable on a dependentvariable. The effect size obtained for a sample is an estimate of the population effect size.
Gaussian distribution.
A frequency distribution that arises from additive processes, is welldefined mathematically, and is assumed to be common in empirical data.
Hurst exponent:
A measure of long-range temporal correlations in time series.
P-value.
The probability of the same result or one more extreme just by chance if the nullhypothesis is true. When dichotomized according to experimenter-determined tolerance forfalse-positive error, this value serves as an inverse measure of the strength of evidenceagainst the null hypothesis.
Population.
The set of all individuals of a specific type that is to be statistically measured butis too vast to be sampled exhaustively such that an exact measure of the population cannotbe obtained.
Regression.
A family of statistical techniques used to estimate the relationships between adependent variable (often called “outcome”) and one or more independent variables (oftencalled “predictors”).
Replicate.
To repeat a procedure using a new sample from the same population(s).
Reproduce.
To find the same results when a procedure is repeated using a new sample fromthe same population(s). (Random) sample.
Observations obtained randomly from a defined population used toestimate population characteristics.
Sample size.
The number of observations in the sample.
Statistical power.
A measure of the capacity of a statistical test to yield a significant result. Itdepends on the threshold for significance, size of the expected effect, variation in thepopulation, the alternative hypothesis (one or two-sided), nature of the test (paired orunpaired), and sample size.
Variance.
The spread between observations in a sample, quantifying how far eachobservation in the sample is from the mean, and therefore from every observation in thesample. ox 2 ERGODICITY
A stochastic process x ( t ) can be subjected to two types of averaging: the ensemble averageand the time average. The finite-ensemble average of the quantity x at a given time t is ⟨ x ( t ) ⟩ N = N ∑ iN x i ( t ) (1)where x i is the i th realization of x ( t ) and N is the number of realizations included in theaverage. The finite-time average of the quantity x ( t ) is x Δt = Δ t ∫ tt + Δt x ( s ) ds . (2)If x changes at T = Δ t / δ t discrete times t + δ t , t + δ t ,… , then Eq. (2) becomes x Δt = T δ t ∑ τ + T x ( t + τ δ t ) . (3)An observable X is ergodic if its ensemble average converges to its time average withprobability one, such that lim N → ∞ N ∑ iN X i ( t ) = lim Δ t → ∞ Δ t ∫ tt + Δt X ( s ) ds . (4)Hence, a stochastic process is ergodic if any random collection of samples representsthe entire process’s average statistical properties. Conversely, a stochastic process isnonergodic when the statistical properties of that process change with time. An unbiasedrandom walk is nonergodic, as its time average is a random variable with divergent varianceabout the expectation value of zero. For instance, let us choose at random one of the two faircoins and then toss the selected coin n times. Let the outcome be 1 for heads and 0 for tails.Then the ensemble average is ½(½ + ½) = ½, which is equal to the long-term average of ½for either coin. Hence, this random process is ergodic. Conversely, let us choose at randomone of the two coins: one fair and the other has two heads, and then toss the selected coin ntimes. The ensemble average for this case is ½(½ + 1) = ¾; yet the long-term average is ½for the fair coin and 1 for the two-headed coin. Hence, this random process is not ergodic.Note that ergodicity is distinct from stationarity. An example of a stationary butnonergodic process is rolling a dice today. The function x ( t ) for all future dates is whatever isrolled today. x ( t ) is clearly nonergodic given that its expectation value is 3.5, but its timeaverage is either 1, 2, 3, 4, 5 or 6. Nonetheless, it is stationary. Meanwhile, it is possible tohave an ergodic process that is nonstationary, precisely when the expectation value is staticbut the variance. For instance, during a jazzy drum solo, the high-hat cymbal is locked intoposition on the drum kit, but the intermittent strikes by the drummer will dramatically changethe size of the oscillations in the cymbal from beat to beat and measure to measure. ox 3 SIMPSON’S PARADOX Simpson’s paradox is a phenomenon in which a statistical trend observed at the individuallevel (or at the level of groups) either disappears or reverses at the group level (or at the levelof a set of groups). The paradox can be resolved when individual relationships areappropriately addressed in the statistical modeling. Simpson’s paradox is especiallyproblematic in biomedical sciences, where trends observed at the group level are oftenfallaciously used to derive inferences about individuals. An often-quoted intuitive example ofSimpson’s paradox is the correlation between typing speed and typos . At the group level, thecorrelation is negative—more experienced typists type faster as well as make fewer typos(solid black line in Fig. I). However, at the individual level, the correlation is positive—thefaster an individual types, the greater the number of typos he/she makes (blue dotted lines inFig. I). Thus, it would be fallacious to conclude that the relationship between typing speed andtypos observed at the group level holds for individuals. The interested readers can refer tosome of the excellent resources that discuss statistical methods to prevent, diagnose andtreat Simpson’s paradox in point data . Fig. I.
Relationship between typing speed and typos. Simulated data illustrating that despite anegative correlation at the group level, within each individual, there exists a positiverelationship between typing speed and the frequency of typos. ox 4 STATISTICS FOR QUANTIFYING NONERGODICITY
Ergodicity refers to the resemblance of sample variance to population variance. Statisticalmethods often phrase this question as testing how much subsample variances vary from thebigger sample’s variance, that is, the variance of sample variances. This resemblance can beconsidered in two ways: first, in terms of whether the time-average variance of any singletrajectory resembles the ensemble-average variance of those multiple trajectories and,second, in terms of whether the time-average variance of a subset within a single trajectoryresembles the time-average variance for the entire single trajectory.Consider a sample of participants, each flipping a coin repeatedly 200 times. If weassume that each coin is fair, and each participant performs the flips similarly, the variance inoutcomes should presumably resemble each other in the long run. We can test thisassumption by looking at any one participant’s flipping, taking the variance for thatparticipant’s whole sequence, and comparing it with the variance across all participants.Deviations in that participant’s flips might suggest a weighted coin or a systematically differentflipping method. If that participant’s flipping behavior does not resemble the average behavioracross all participants, then ergodic assumptions fail. Of course, each participant is morelikely to deviate from a larger subsample of participants. So, it is possible to examine varianceacross progressively larger subsamples of individuals (e.g., 1 participant alone, 2 participants,4 participants, and so forth) and test how quickly subsample-average variance convergestowards whole-sample-average variance. Now, we can also consider how well subsamples ofany one participant’s flipping outcomes resemble the variance across the whole sequence:because the small-sample bias likely makes variance unstable, we do not expect allsubsamples of 5-flips sequences to yield the same estimated variance as the entire 200-flipssequence. However, we can evaluate the average variance for gradually longer subsets (i.e.,for 5-, 10-, 20-flips sequences, and so forth) and evaluate ergodicity as the speed at whichthe variance across these subsets converges towards the variance across the wholesequence.To our knowledge, the most precise way to portray ergodicity is through adimensionless statistic of ergodicity breaking E B —also known as the Thirumalai-Mountain(TM) metric —that subtracts squared total-sample variance from average squared subsamplevariance and then divides by the total-sample squared variance (Fig. II). E B ( x ( t ) ) = ( ⟨ [ δ ( x ( t ) ) ] ⟩ − ⟨ δ ( x ( t ) ) ⟩ ) ⟨ δ ( x ( t ) ) ⟩ (5) which is zero in the limit t → ∞ if the process is ergodic. In effect, this statistic phrasesdeviation of squared subsample variance from squared total-sample variance as a proportionof squared total-sample variance. Evidence for ergodicity is the rapid convergence of thisstatistic to zero for progressively larger samples. Slower (or non-) convergence indicatesweaker (or non-) ergodicity . Here, we would like to caution about ergodicity breaking usingan example from economics. Models of relative wealth (i.e., what proportion of total nationalwealth an individual owns) often assume ergodicity, that is, fast convergence to a stableasymptotic distribution. This assumption can be upheld against all evidence: it gives answers,puts numbers on parameters. It is when we work outside this assumption that we find that it isunjustified . Hence, sometimes we may not see a fast convergence to any asymptoticdistribution, sometimes no convergence at all.*** ig. II. Ergodicity in additive white Gaussian noise (AWGN). ( a ) A representative trajectory ofwhile noise and ensemble average of N = 1024 such trajectories. ( b ) Variances of individualtrajectories, as well as time-averaged and ensemble-averaged variances across all 1024trajectories (the latter two are the same for AWGN). ( c ) Ensemble-averaged variancecalculated over differently-sized trajectory subgroups. ( d ) The ergodicity breaking parameter E B versus time t .he universe is not ergodic.—Sean Carroll Introduction
Scientific advancement depends on the reproducibility and validation of research findings.Poorly reproducible studies can impede and misdirect scientific progress, jeopardize funding,and lead to harmful clinical applications. Growing awareness among the scientific communityabout lapses of reproducibility in biomedical sciences , including psychological sciences and neurosciences have inspired recent developments like the Nature series entitled“Challenges in irreproducible research,” the Reproducibility Initiative, a global project intendedto identify and reward reproducible research(http://validation.scienceexchange.com/ . Although we loud these efforts, the focus has mostly been onthe fallacies of P-values , small sample size , inaccurate estimation graphics , andreporting biases . A fundamental problem pervasively linked to the lack of reproducibility inhuman subjects research—inherent in standard analytical techniques—that remains to beconsidered is nonergodicity, the paucity of group-to-individual generalizability . Crucially,despite the expectation that group treatments will inform individual-level interventions oroutcomes, human-subjects research may take ergodicity for granted when it should not. Here,we aim to address how recognizing and quantifying nonergodicity holds great promise forsupporting future scientific progress.Biomedical, behavioral and psychological studies infer from statistical tests conductedon aggregated data. This policy rests on the assumption that group-level statistical effects canbe applied to understanding the physiology and psychology of an individual—that is, theyassume that the study phenomenon is “ergodic” (Box 2). Ergodicity holds whenindividual-level variability is homogeneous in resembling variability at the level of the group (or“ensemble” in statistical-physical parlance) and when individual-level variability is stationary,exhibiting homogeneous mean and variance over time . Unfortunately, the differentiationof form and behavior inherent to biological structure violates the assumptions of homogeneityand stationarity, respectively, inevitably making it nonergodic. This violation is no peripheralnuisance but follows systematically from all of our most unambiguous attempts to distinguishlife as an evolving, innovating, and adaptive process distinct from non-living processes . Thesecond condition, stationarity, rules out most physiological and psychological processes withtime-varying moments (mean function, sequential covariance function) of being ergodic (e.g.,heart rate variability, motor control, developmental processes, learning processes, andtransient brain responses) . Sample-level statistical modeling that assumes stationaritywhere it is not, cloaks artifacts of nonstationarity or fails to articulate systematic changesproducing nonstationarity, obscures any genuine individual differences, and jettisons anygeneralizable truths we might have gleaned from the same diversity initially intended torepresent the population. In these circumstances, deriving inferences from statistical testsconducted on aggregated data might profoundly threaten our goals of scientific consilienceand completeness. It certainly makes the suspicious compromise of enforcing similarity andignoring diversity for the sake of formal convenience.Here we argue that behavioral, psychological, and neuronal processes unavoidablyviolate the ergodic assumptions, with reference to typically-studied phenomena. We discusshow, for this reason, a significant body of work in these fields falls short on scientificnvestigation’s fundamental objectives, ranging from the failure to reproduce research findingsto the inability to test hypotheses about nonlinear far-from-equilibrium dynamics. We explainwhy finding meaningful ergodic observables is essential for investigating nonergodicprocesses, and we introduce a few strategies that allow this transformation. Behavioral, physiological, and neuronal processes can violate ergodic assumptions
Despite multiple calls about the perils of violating ergodic assumptions , extant workin biomedical sciences has been mostly based on best-practice guidelines almost exclusivelybased on statistical inferences from data aggregated across large samples. Whether couchedin prosaic terms or using formal mathematical theorems, the violation of ergodic assumptionsquestions the validity of this work. Here we provide a few examples of research areas that areparticularly vulnerable to the violation of ergodic assumptions. Physiological sciences . One area in human physiology where nonergodicity has beenshown to be particularly important is the biophysical transport of liquids. Suchnonergodic processes include hemodynamics and intracellular and extracellulartransport of complex media in biological systems, such as cytoplasm andnucleoplasm . For instance, to reach every cell, the blood must enter every cellcontinuously in the temporal domain. Due to the fractal organization of the vascularnetwork, the blood flow decelerates at small length scales (i.e., in finer capillaries). Theblood cells spend more time at small volumes in comparison to the volumetric fractionof these volumes, making these systems nonergodic. Consequently, calculations usingthe time fraction, (corresponding to the time average) do not apply to calculations usingthe volumetric fraction (corresponding to the ensemble average) . Behavioral and psychological sciences . A well-known phenomenon from cognitivepsychology in which the individual- and ensemble-level findings often conflict is thespeed-accuracy trade-off . Changes in the speed-accuracy tradeoff form the bedrockof motor learning . Although speed and accuracy show negative correlation acrossindividuals, individuals show a negative relationship between speed and accuracy,reflecting differential use of strategies to achieve the task goal . Indeed, people oftenuse multiple cognitive/motor strategies to achieve the same task goal, exhibiting anadaptive combination of accuracy and speed . This phenomena is analogous toSimpson’s paradox (Box 3), which is ubiquitous in behavioral and psychologicalsciences. In a classic study, Siegler identified three factors that can lead to incorrectconclusions about data averaged over strategies: 1) Relative frequency of eachstrategy. The more frequently a strategy is used, the greater its influence on theaverage data. 2) Relative variability of performance from each strategy. Amongstrategies producing unequal variance on the dependent variable, the one with highervariance will have a larger influence on the average data. A less frequent strategy canhave a greater influence if it leads to a much higher variance. 3) Variability in therelationship between independent and dependent variables within and acrossstrategies. A high correlation between the smallest addend and the performance canfalsely indicate that the strategy with the smallest addend was used more frequentlythan it was. Hence, psychological studies might overestimate the relationship betweena set of variables at one level over other level(s) due to the clustering of strategiesmore often than commonly thought . Neurosciences . In neurophysiological studies, neuronal spikes show temporalfluctuations all along the stimulation period. The rate coding paradigm assumes thatthese fluctuations do not contain any information. So, this paradigm eliminates suchstatistical fluctuations by averaging over many trials. Indeed, the assumption is that thecentral nervous system is obliged to make an ensemble average over many neuronswith the same function, that is, neuronal spiking is ergodic. However, argumentssupporting ergodicity in spike trains are weak. For instance, one argument posits thatapproximately 30 neighboring neurons may form a functional unit, effectively encodingthe same stimulus . The central evidence supporting this conjecture is thatneighboring neurons receiving shared inputs belong to the same functional unit .However, because neurons require chain-reaction, they cannot resemble each other.Neurons also receive heterogeneous inputs from neighboring neurons and hence shownonergodicity. For instance, if in a population of neurons that show a synchronouspeak in firing rate at the ensemble level, the timing of the peak drifts from trial to trial,then averaging the single-cell firing rate over trials would not show any peak. Incontrast, in a coupled network of heterogeneous neurons, each neuron may show apeak at a different time that does not drift with trials; in this case, ensemble-averagefiring rate can be flat because population heterogeneity will mask the peaks forindividual neurons. Hence, neurophysiological studies based on ensemble-levelaveraging of neuronal spiking data might violate the ergodic assumptions, portrayingan erroneous picture of neuronal function at the single-cell level .Thus, diverse mechanisms rooted in the geometry of the biological system (e.g., fractalstructure of the vascular network, multiplicative interactions among cognitive processes, andheterogeneity in neuronal connectivity) lead to nonergodicity. Indeed, nonergodicity is morelikely to be overlooked when studying such biomedical systems, particularly when obtaininglong observations of time series is not feasible. Processes that involve some growth, such asdevelopment and aging, and are naturally susceptible to practical constraints of datacollection, are particularly vulnerable to such oversight . Neglecting ergodic assumptions undermines the scientific enterprise
The widespread mistreatment of nonergodicity in biomedical and psychological sciences hassubstantial epistemic and practical consequences, ranging from none in the case of the fewergodic processes in nature, to catastrophic if a process is highly nonergodic. Indeed,neglecting ergodic assumptions might jeopardize the scientific community’s very many effortsin addressing the challenges of P-values , small sample size , inaccurate estimationgraphics , and reporting biases . In many cases, these consequences set back thefollowing four very objectives that form the bases of these research areas: Failure to replicate research findings . Failures to replicate research findings might bebaked into the potentially incomplete, but broadly sweeping failure of human behaviorand measurements of such class of systems to conform to the standard of independentand identically-distributed behavior around the mean value. In place of thisassumption, to remove uncertainty in the measurement, the standard technique isaverage across time to extract the true value. Although this process yields a reliable,temporally stable value, this temporal stability is no guarantee of later reproducibility:performing the same experiment again will yield a different result every time. Forinstance, we will find that time-averaging Brownian motion over time will yield aprogressively smoother trajectory. This result will lead us to believe that it hasonverged to some fundamentally true value, but repeating this exercise will yield adifferent value . The ergodic hypothesis is especially apt in dealing with the processesthat visit all possible states in a finite sample space. For instance, a good model of thenumbers that show up in roulette wheel is is closely ergodic. The probability distributionof numbers (0 to 36) that have come up in the past is the same as the probabilitydistribution of numbers on the next spin. In contrast, if a human participant is asked tosay a number between 0 and 36, that participant might show systematic bias towardssmaller or larger numbers. The bias might also depend on exogenous factors such asthe context, surroundings, and time of the day . Therefore, a good model of thenumbers produced by humans is nonergodic, as the human behavior defies theergodic assumption of independent and identically-distributed behavior around themean value. Applying ergodic statistics to these nonergodic processes leads to wrongconclusions. Fa ilure to probe nonlinear dynamical principles underlying the study phenomena .Ergodic statistics at best can describe the effects of manipulations on the outcomevariable assuming ergodicity holds. They are not equipped to decipher the nonlineardynamical principles underlying the study phenomenon. The ergodic assumptionimplies that a given manipulation for a participant in a study is always mediated by thesame components, in the same manner, to link it with some measured output. Analternative view is interaction-dominant dynamics (IDD): component-causal chains arenot the causal building blocks of behavior but are themselves soft-assembled; that is,they emerge out of and do not exist independently from task constraints . Ergodicstatistics that contrast distinct groups of treatments with a between-participant designdo not allow to trace IDD-related effects of time or influences of task changes thatcould lead to phase-transitions between conditions leading to different interpretationsof the same treatment. Consequently, the assumption that consecutively measuredvalues of behavior or physiology are interdependent violates the IID-assumption suchthat the variance of a measured observable cannot be parsed into specific, stable, andgeneralizable sources of variance reflecting the dynamical principles underlying thestudy phenomenon. Indeed, cognitive science has been explicit that when we instructhuman participants to generate replicable behaviors in sequence, our instructions arein vain: “When the optimal strategy in a task is to provide a series of independent andidentically distributed responses, people often perform suboptimally” . So, forinstance, if we leave human participants to count off seconds without any feedbackfrom a clock, their behavior will be relatively non-ergodic, suggesting that their sense ofhow long a second lasts will narrow and dilate across time. Failure to test hypotheses about nonlinear, far-from-equilibrium dynamics . The ergodicassumption of equivalence between the ensemble average and the time average of anobservable is a key component of equilibrium dynamics. When this assumption is valid,dynamical descriptions can be replaced with much simpler probabilistic ones,essentially eliminating time from the models. The conditions for validity are oftenrestrictive for non-equilibrium dynamics and even more so for far-from-equilibriumdynamics. Biomedical sciences inevitably deal with far-from-equilibrium systems—specifically with models of growth and stability. For instance, standing quietly andmaintaining focus on a target in front of us is the preamble to very many coordinatedbehaviors—we might lean forward and reach or track the target’s progress and bat itaway. However, this starting position is not merely the preamble to action but is alreadya rich wellspring of action itself, exhibiting a continuous stream of intermittentluctuations. So long as they do not pitch the bodily center of pressure (CoP) beyondthe base of support, these fluctuations are crucial to maintaining a quiet stance .Therefore, it is surprising that the prevailing statistical methods make an indiscriminateassumption of ergodicity, feeding the bias to study systems while they are inequilibrium because only then can the behavior be measured to obtain a pointestimate. But nonlinear dynamical systems reveal themselves when they are far fromequilibrium, and this is when the dynamical principles come into play and nonergodicityreigns. We can take the previous example of participants responding every time theythink a second has passed. If we now give our second-estimating participants a clockto provide feedback on their accuracy, this task constraint can make behavior looksignificantly more ergodic . Then again, it is when we pose the more open-endedquestions, not just removing feedback but giving participants insight problems toprompt more creative response—then we see that the mind at its most creative looksdramatically less ergodic, with fits and starts, with distractions, and sudden “aha!”moments . Failure to articulate new hypotheses based on the current findings . A new theory maysuggest its initial hypotheses. The scientific program depends on testing thesehypotheses and generating new ones in the process. However, if the theory invokesnonlinear, far-from-equilibrium processes like emergence, then it presumesnonergodicity at its foundations. Statistical modeling that assumes ergodicity of rawmeasurements in these processes risks undermining such theorizing. If we use a linearmodel assuming ergodicity, we never test the original hypotheses about a nonergodicprocess. For instance, the emergent coalition model (ECM) of word learning andvocabulary development predicts that lexicon self-organizes from a coalition ofexogenous cues . Testing the effects of a set of cues in a linear model assumingergodicity will always operate by estimating independent factors assumed to holdinvariantly across time and individual participants. The trouble here is that linearmodeling is a filter on our measurements. It presupposes ergodicity in themeasurements entered into the model, and then it estimates sizes and directions ofeffects exclusively in ergodic terms. Any contingency of effects on time, space, orparticipant variability must be transparent to the model in terms of our preparation andcoding of the data we feed into the model. For instance, an autoregressive moving-average model (ARMA) of measurement with trends over time will repeatedly fail toconverge with stable residuals because the inputs are presumed to be stationary. Itrequires the scientist either to take the first difference of the data with trends beforerunning the ARMA model or to change the modeling strategy to an ARIMA (includingthe “I” for “integration”) that reflects a modeler’s explicit coding of the linear trend withtime. If the trends are fractional or nonlinear, ARIMA may or may not converge but willdefinitely fit the wrong trend and estimate effects for variables that may not exist andunderspecify the complexity of the measurement. The linear filter will not know whatnonergocities it may be missing, and so it will give no insight into the presumed non-equilibrium mechanisms driving development. Now, in the case of language learning,when a linear model finds that cue X (e.g., maternal pointing) predicts outcome Y (e.g.,learning count nouns), then the linear filtering removes any nonlinearity, and suddenlythe output looks as linear and ergodic as the theory allegedly predicted thephenomenon was not. Nothing about a linear modeling output on learning more countnouns from maternal pointing will then suggest as to which class of emergent phaseshifts is in play, which populations will exhibit which individual differences in terms ofhich count nouns they learn or whether the count nouns they learn align with parallelsyntax development. The ergodic assumptions suggest that all trajectories are thesame on average, and all of these cues and subsets of language performance wouldbe assumed parallel by any linear model. Of course, we could come up with newhypotheses about how far-from-equilibrium dynamics leads to a given emergentstructure, but ergodic models cannot “see” the underlying nonlinearity and temporalvariation. No matter the rhetoric about nonlinear processes like emergence, unless weare testing predictions that explicitly nonlinear sources of nonergodicity [e.g., as ARIMAdoes for linear sources ] the linear models and their independent and identicallydistributed estimates are themselves mute to nonlinear dynamical facts likeemergence-not just unable to test the initial hypothesis about non-equilibrium dynamicslike emergence but also unable to refine new hypotheses relevant to non-equilibriumdynamics.Hence, all human subjects research that ignores the nonergodicity of studyphenomenon falls short on these four fundamental objectives of scientific investigation. Inbasic research, we must not overstate how misleading our impressions might be about theeffects of physiological/psychological manipulations and their interactions. In the clinicaldomain, diagnostic tests might be systematically biased, and our classification systems andtreatments/interventions may be at least partially invalid. In medicine, this calls forpersonalization. For instance, a drug might not be effective unless he has some gene(s). Bytesting a large population, one would (correctly) conclude the drug is effective in 90% of thecases. It would be incorrect to conclude that an individual would respond to the drug 90% ofthe time it is tested on that one person—it may be yes/no at the individual level. In studying anew phenomenon, we might be not even employing the research designs necessary toadequately test the first line of hypotheses, let alone articulate new theories based on thecurrent findings, as discussed above. We can research with the wrong glasses on, and spendour careers finding all sorts of effects and inventing nomenclatures without ever addressing amuch more serious, all-encompassing problem. It can be a check-mate situation: irrespectiveof decades of research, we may have to throw it all away as invalid because it rests on aninvalid ergodicity assumption. It is high time that we embrace nonergodicity and teach thenext generation of scientists some nonergodic research designs and statistical techniques.Here, we side with Molenaar that the implications of the classical ergodic theorems implythat focusing on finer-grained variation is not optional. The systematic violation of of thesetheorems in our raw measurements makes this focus a necessity. Multiscaled estimates of nonergodic measurements can behave ergodically
Effects that look nonlinear and not stable in linear modeling are often better behaved whenthe modeling allows causality to unfold across diverse scales. This point does not imply thatan observed phenomenon is independent at different scales but that linear models canrespect multiscaled texture. There are two ways this can happen and might be relevant. First,mixed-effects modeling allows estimating “fixed” effects for explicit manipulations spanningthe whole task and “fixed” effects of explicit manipulations that vary across the task (e.g.,blocks with and without treatment), as well as “random” effects capable of absorbing somerudimentary ways in which participants might differ in the intercept of their responses and inthe slope of those responses across trials. That said, the references “fixed” and “random” aremisnomers to the degree that very many explicitly randomized manipulations across the task(e.g., levels of an informational variable) can show the so-called “fixed” effects. For instance, given level of an informational variable can show different effects in Block 1 versus Block 2,and that level could have been delivered on a randomly different trial within those two differentblocks. However, the availability of the “fixed” effect of that informational variable means thatthis modeling allows fitting an estimate of the population-level effect of the experimenter-randomized informational variable. Taking this a step further, we can also test the “fixed”effects of endogenous variables that participants bring to each trial within each block,although not undermining the fact that participants bring individual differences to the task oneach trial. It is only to say that some of the endogenous variety in how participants meet thetask constraints could represent causal mechanisms generic to the whole population. Forinstance, the fact that different participants will direct different numbers of gaze fixations on avisual display will not prevent a mixed-effect model from estimating the population-level effectof gaze fixations on visual-perceptual response.Again, the nomenclature: the random walk (process 3) is not ergodic, but its steps(process 1) are ergodic. Because its steps are ergodic, so is, for instance, its square deviationover some fixed interval. Finding meaningful ergodic observables for nonergodic processes isa hugely important part of doing ergodicity economics, and of doing science in general. Thisbrings us to the second way we can make linear models respect multiscaled texture, which isby coming up with some form of statistics that quantify nonergodicity can themselvesempower ergodicity-requiring models (e.g., vector autoregressive, VAR models) to modelwhat would otherwise be too nonstationary to model. For instance, fractals in the nervoussystem: conceptual implications for theoretical neuroscience . Submitting rawposition/physiology/kinematic values into VAR models leads to poor convergence becausethe residuals are never serially independent. The failure to converge is because the rawvalues are serially dependent across time for reasons beyond linear correlations. Yes, thereare vector autoregressive fractionally integrated (VARFI) vector autoregressive fractionallyintegrated moving-average (VARFIMA) models that can fit vector (i.e., multivariate)autoregression relationships around long-range fractional integration (the “FI” in “VARFI” and“VARFIMA”) sort of memory . But these solutions are limited: the VARFI and VARFIMAmodels are computationally expensive, and elegant methods like detrended cross-correlationanalysis , can only treat with two variables and do not estimate unique effects of either.On the other hand, VAR modeling of fractal and multifractal estimates yields stableresiduals in accordance with the ergodic assumptions, and they have repeatedly shown bothsystematic responses to experimental manipulations and predictive relationships withindividual participant responses . Although some moderate failure to have seriallyindependent residuals leaves fractal Gaussian noise (fGn) ergodic, strong failures of serialindependence (e.g., long-range memory as in 1/f “pink” noise) dissociates the scalingproperties from any ergodic Gaussian properties (Box 4; Fig. 1) .*** ig. 1. Nonlinear intermittency in physiological signals is nonergodic. ( a ) Two representativetrajectories of healthy human center of pressure (CoP) displacement, one showing whiteGaussian noise (Hurst exponent, H = 0.41) and the other showing long-range temporalcorrelations or fractality ( H = 0.95). Note that a signal shows fractality if 0.5 < H < 1; the closer H is to 1, the stronger the fractality. ( b ) The ergodicity breaking parameter, E B converges tozero in the limit t → ∞ if the signal contains white Gaussian noise but may not converge at all ifthe signal shows fractality. A fuller treatment of the relationship between E B and Hurstexponent has been presented elsewhere . ***This latter point suggests something rather broad about the potential future ofnegotiating a constructive rapport between our ergodic traditions and our growing grasp ofnonergodicity. Here, we see an essential bridge to build between the academic’s luxury todelve into the fine-print entailment of ergodicity and the clinician’s urgency for practical use,that is, between the mathematical curiosity and the need for a clear interpretation.Specifically, the ergodic models that adjudicate the unique effect sizes and significancetesting are ready to incorporate nonergodic aspects of our measurements. The challenge is toidentify those ergodic estimates of nonergodicity, which require taking the big-pictureappreciation that, perhaps, what is nonergodic at the fine-grain of our measurements submitsto coarser-grained nonlinear-dynamical quantification. Once ergodically described, as in thecase of multifractal estimates, nonergodicity can swim amidst the same statistical waters asthe classical ergodic measures that clinicians might be more comfortable with . Forinstance, VAR models can test mutual effects amongst multifractal estimates from the samesystem . So at the minimum, ergodic estimates of nonergodicity might be submitted to thesame models that clinicians need to identify valid biomarkers.n immensely useful possibility is that clinically most relevant measures ofnonergodicity might align with the greatest ergodicity. For instance, fractal estimates ofmultifractal, intermittent processes can hover around a mean. While maintaining all statisticalevidence of nonlinear intermittency, prior changes in fractal scaling might predict latercorrective effects, hence producing significantly narrower multifractal variety. This outcomeappears to correlate clearly with a stable reduction of postural sway . That is, some ofthese measures of nonergodic phenomena can behave ergodically. Conclusions
Understanding biological and psychological behavior requires a broad range of approachesand methods and faces the fundamental challenge of deciphering the “choreography”associated with complex behaviors and functions. It involves making “sense” of vast amountsof data of many types at multiple scales in time and space. Basing such a scientific projectupon false assumptions about the nature of the data might be catastrophic. We have made acase that part of the reproducibility crisis , and the slow growth of knowledge aboutnonlinear, far-from-equilibrium dynamics, can be attributed to the violations of ergodicassumptions. We have offered the possibility that both these expectations could immenselybenefit from respecting nonergodicity. The stochastic process involved in the studyphenomena needs to be made explicit, and the nonergodic data needs to be transformed tofind an appropriate ergodic observable. This new observable can be modeled to obtain stablestatistical outcomes. Taking such steps will lead to a paradigm shift, allowing researchers toinvestigate the nonstationary, far-from-equilibrium processes that characterize the creativityand emergence of biological and psychological behavior.
CKNOWLEDGMENTS
We thank Ole Peters for valuable suggestions that helped solidify some of the ideaspresented in this review.
COMPETING FINANCIAL INTERESTS
The authors declare no competing financial interests.
REFERENCES
1. Hamaker, E. Why researchers should think “within-person”: A paradigmatic rationale. in
Handbook of Research Methods for Studying Daily Life
Front. Psychol. , 513 (2013).3. Fisher, A. J., Medaglia, J. D. & Jeronimus, B. F. Lack of group-to-individualgeneralizability is a threat to human subjects research. Proc. Natl. Acad. Sci. ,E6106–E6115 (2018).4. Adolf, J., Schuurman, N. K., Borkenau, P., Borsboom, D. & Dolan, C. V. Measurementinvariance within and between individuals: A distinct problem in testing the equivalenceof intra- and inter-individual model structures.
Front. Psychol. , 883 (2014).5. Thirumalai, D., Mountain, R. D. & Kirkpatrick, T. R. Ergodic behavior in supercooledliquids and in glasses. Phys. Rev. A , 3563–3574 (1989).6. Deng, W. & Barkai, E. Ergodic properties of fractional Brownian-Langevin motion. Phys.Rev. E , 11112 (2009).7. Berman, Y., Peters, O. & Adamou, A. Wealth inequality and the ergodic hpothesis:Evidence from the United States. Available SSRN 2794830 (2019).8. Prinz, F., Schlange, T. & Asadullah, K. Believe it or not: How much can we rely onpublished data on potential drug targets?
Nat. Rev. Drug Discov. , 712 (2011).9. Van Bavel, J. J., Mende-Siedlecki, P., Brady, W. J. & Reinero, D. A. Contextualsensitivity in scientific reproducibility. Proc. Natl. Acad. Sci. , 6454–6459 (2016).10. Fang, F. C. & Casadevall, A. Retracted science and the retraction index.
Infect. Immun. , 3855–3859 (2011).11. Mobley, A., Linder, S. K., Braeuer, R., Ellis, L. M. & Zwelling, L. A survey on datareproducibility in cancer research provides insights into our limited ability to translatefindings from the laboratory to the clinic. PLoS One , e63221 (2013).2. Estimating the reproducibility of psychological science. Science (80-. ). , aac4716(2015).13. Gilbert, D. T., King, G., Pettigrew, S. & Wilson, T. D. Comment on “Estimating thereproducibility of psychological science”.
Science (80-. ). , 1037–1037 (2016).14. Botvinik-Nezer, R. et al.
Variability in the analysis of a single neuroimaging dataset bymany teams.
Nature , 84–88 (2020).15. Eklund, A., Nichols, T. E. & Knutsson, H. Cluster failure: Why fMRI inferences for spatialextent have inflated false-positive rates.
Proc. Natl. Acad. Sci. , 7900–7905 (2016).16. Kriegeskorte, N., Simmons, W. K., Bellgowan, P. S. F. & Baker, C. I. Circular analysis insystems neuroscience: The dangers of double dipping.
Nat. Neurosci. , 535–540(2009).17. Stikov, N., Trzasko, J. D. & Bernstein, M. A. Reproducibility and the future of MRIresearch. Magn. Reson. Med. , 1981–1983 (2019).18. Wallach, J. D., Boyack, K. W. & Ioannidis, J. P. A. Reproducible research practices,transparency, and open access data in the biomedical literature, 2015–2017. PLOSBiol. , e2006930 (2018).19. Hardwicke, T. E. et al. Data availability, reusability, and analytic reproducibility:Evaluating the impact of a mandatory open data policy at the journal Cognition.
R. Soc.Open Sci. , 180448 (2020).20. Gilmore, R. O., Diaz, M. T., Wyble, B. A. & Yarkoni, T. Progress toward openness,transparency, and reproducibility in cognitive neuroscience. Ann. N. Y. Acad. Sci. ,5–18 (2017).21. Halsey, L. G., Curran-Everett, D., Vowler, S. L. & Drummond, G. B. The fickle P valuegenerates irreproducible results.
Nat. Methods , 179–185 (2015).22. Halsey, L. G. The reign of the p -value is over: What alternative analyses could weemploy to fill the power vacuum? Biol. Lett. , 20190174 (2019).23. Button, K. S. et al. Power failure: Why small sample size undermines the reliability ofneuroscience.
Nat. Rev. Neurosci. , 365–376 (2013).24. Maxwell, S. E., Kelley, K. & Rausch, J. R. Sample size planning for statistical power andaccuracy in parameter estimation. Annu. Rev. Psychol. , 537–563 (2007).25. Krzywinski, M. & Altman, N. Error bars. Nat. Methods , 921–922 (2013).26. Ho, J., Tumkaya, T., Aryal, S., Choi, H. & Claridge-Chang, A. Moving beyond P values:data analysis with estimation graphics. Nat. Methods , 565–566 (2019).7. Ioannidis, J. P. A., Munafò, M. R., Fusar-Poli, P., Nosek, B. A. & David, S. P. Publicationand other reporting biases in cognitive sciences: Detection, prevalence, and prevention. Trends Cogn. Sci. , 235–241 (2014).28. Molenaar, P. C. M. A manifesto on psychology as idiographic science: Bringing theperson back Into scientific psychology, this time forever. Meas. Interdiscip. Res.Perspect. , 201–218 (2004).29. Molenaar, P. C. M. On the implications of the classical ergodic theorems: Analysis ofdevelopmental processes has to focus on intra-individual variation. Dev. Psychobiol. ,60–69 (2008).30. Lowie, W. M. & Verspoor, M. H. Individual differences and the ergodicity problem. Lang.Learn. , 184–206 (2019).31. Medaglia, J. D., Ramanathan, D. M., Venkatesan, U. M. & Hillary, F. G. The challenge ofnon-ergodicity in network neuroscience. Netw. Comput. Neural Syst. , 148–153(2011).32. Molenaar, P. C. M., Sinclair, K. O., Rovine, M. J., Ram, N. & Corneal, S. E. Analyzingdevelopmental processes on an individual level using nonstationary time seriesmodeling. Dev. Psychol. , 260–271 (2009).33. Molenaar, P. C. M. & Campbell, C. G. The new person-specific paradigm in psychology. Curr. Dir. Psychol. Sci. , 112–117 (2009).34. Fitelson, B. Confirmation, causation, and Simpson’s paradox. Episteme , 297–309(2017).35. Lerman, K. Computational social scientist beware: Simpson’s paradox in behavioraldata. J. Comput. Soc. Sci. , 49–58 (2018).36. Bains, W. What do we think life is? A simple illustration and its consequences. Int. J.Astrobiol. , 101–111 (2014).37. Hasselman, F. & Bosman, A. M. T. Studying complex adaptive systems with internalstates: A recurrence network approach to the analysis of multivariate time-series datarepresenting self-reports of human experience. Front. Appl. Math. Stat. , 9 (2020).38. Hamaker, E. L., Dolan, C. V & Molenaar, P. C. M. Statistical modeling of the individual:Rationale and application of multivariate stationary time series analysis. MultivariateBehav. Res. , 207–233 (2005).39. Castro-Schilo, L. & Ferrer, E. Comparison of nomothetic versus idiographic-orientedmethods for making predictions about distal outcomes from time series data. Multivariate Behav. Res. , 175–207 (2013).0. Kulkarni, A. M., Dixit, N. M. & Zukoski, C. F. Ergodic and non-ergodic phase transitionsin globular protein suspensions. Faraday Discuss. , 37–50 (2003).41. Manzo, C. et al.
Weak ergodicity breaking of receptor motion in living cells stemmingfrom random diffusivity.
Phys. Rev. X , 11021 (2015).42. Nosonovsky, M. & Roy, P. Allometric scaling law and ergodicity breaking in the vascularsystem. Microfluid. Nanofluidics , 53 (2020).43. Fitts, P. M. The information capacity of the human motor system in controlling theamplitude of movement. J. Exp. Psychol. , 381–391 (1954).44. Krakauer, J. W., Hadjiosif, A. M., Xu, J., Wong, A. L. & Haith, A. M. Motor Learning. in Comprehensive Physiology
Cogn. Sci. ,211–250 (2011).46. Pacheco, M. M., Lafe, C. W. & Newell, K. M. Search strategies in the perceptual-motorworkspace and the acquisition of coordination, control, and skill. Front. Psychol. ,1874 (2019).47. Siegler, R. S. The perils of averaging data over strategies: An example from children’saddition. J. Exp. Psychol. Gen. , 250–264 (1987).48. Shaw, G. L., Harth, E. & Scheibel, A. B. Cooperativity in brain function: Assemblies ofapproximately 30 neurons.
Exp. Neurol. , 324–358 (1982).49. Vaadia, E. et al. Dynamics of neuronal interactions in monkey cortex in relation tobehavioural events.
Nature , 515–518 (1995).50. Shadlen, M. N. & Newsome, W. T. The variable discharge of cortical neurons:Implications for connectivity, computation, and information coding.
J. Neurosci. ,3870–3896 (1998).51. Masuda, N. & Aihara, K. Ergodicity of spike trains: When does trial averaging makesense? Neural Comput. , 1341–1372 (2003).52. Peters, O. & Maximilian, W. A recipe for irreproducible results. arXiv Curr. Biol. , R60–R62 (2008).54. Hilbert, M. Toward a synthesis of cognitive biases: How noisy information processingcan bias human decision making. Psychol. Bull. , 211–237 (2012).5. Wallot, S. & Kelty-Stephen, D. G. Interaction-dominant causation in mind and brain, andits implication for questions of generalization and replication.
Minds Mach. , 353–374(2018).56. Van Orden, G. C., Holden, J. G. & Turvey, M. T. Human cognition and 1/ f scaling. J.Exp. Psychol. Gen. , 117–123 (2005).57. Brown, S. D. & Steyvers, M. Detecting and predicting changes.
Cogn. Psychol. , 49–67 (2009).58. Kelty-Stephen, D. G., Lee, I.-C., Carver, N. S., Newell, K. M. & Mangalam, M. Visualeffort moderates a self-correcting nonlinear postural control policy. bioRxiv bioRxiv Front. Integr. Neurosci. , 62 (2011).61. Richardson, L. F. The analogy between mental images and sparks. Psychol. Rev. ,214–227 (1930).62. Hollich, G., Hirsh-Pasek, K. & Golinkoff, R. Breaking the language barrier: Anemergentist coalition model for the origins of word learning. in Monographs for theSociety for Research in Child Development
65 (3, Serial No. 262) (2000).63. Box, G. E. P. & Pierce, D. A. Distribution of residual autocorrelations in autoregressive-integrated moving average time series models.
J. Am. Stat. Assoc. , 1509–1526(1970).64. Werner, G. Fractals in the nervous system: conceptual implications for theoreticalneuroscience. Front. Physiol. , 15 (2010).65. Kilian, L. & Lütkepohl, H. Structural Vector Autoregressive Analysis . (CambridgeUniversity Press, 2017).66. Podobnik, B. & Stanley, H. E. Detrended cross-correlation analysis: A new method foranalyzing Two nonstationary time series.
Phys. Rev. Lett. , 84102 (2008).67. Mangalam, M., Carver, N. S. & Kelty-Stephen, D. G. Global broadcasting of local fractalfluctuations in a bodywide distributed system supports perception via effortful touch.
Chaos, Solitons & Fractals , 109740 (2020).68. Mangalam, M., Carver, N. S. & Kelty-Stephen, D. G. Multifractal signatures ofperceptual processing on anatomical sleeves of the human body.
J. R. Soc. Interface , 20200328 (2020).9. Mangalam, M. & Kelty-Stephen, D. G. Multiplicative-cascade dynamics supports whole-body coordination for perception via effortful touch. Hum. Mov. Sci. , 102595 (2020).70. Peters, O. The ergodicity problem in economics. Nat. Phys. , 1216–1221 (2019).71. Peters, O. & Gell-Mann, M. Evaluating gambles using dynamics. Chaos An Interdiscip.J. Nonlinear Sci. , 23103 (2016).72. Kelty-Stephen, D. G. Multifractal evidence of nonlinear interactions stabilizing posture for phasmids in windy conditions: A reanalysis of insect postural-sway data. PLoS One13