Exploring a strongly non-Markovian animal behavior
Vasyl Alba, Gordon J. Berman, William Bialek, Joshua W. Shaevitz
EExploring a strongly non–Markovian animal behavior
Vasyl Alba, , Gordon J. Berman, , William Bialek, , and Joshua W. Shaevitz Joseph Henry Laboratories of Physics and Lewis–Sigler Institute forIntegrative Genomics, Princeton University, Princeton NJ 08544 Department of Engineering Science and Applied Mathematics and NSF–SimonsCenter for Quantitative Biology, Northwestern University, Evanston IL 60208 Departments of Biology and Physics, Emory University, Atlanta GA 30322 Initiative for the Theoretical Sciences, The Graduate Center,City University of New York, 365 Fifth Avenue, New York NY 10016
A freely walking fly visits roughly 100 stereotyped states in a strongly non-Markovian sequence.To explore these dynamics, we develop a generalization of the information bottleneck method, com-pressing the large number of behavioral states into a more compact description that maximallypreserves the correlations between successive states. Surprisingly, preserving these short time cor-relations with a compression into just two states captures the long ranged correlations seen in theraw data. Having reduced the behavior to a binary sequence, we describe the distribution of thesesequences by an Ising model with pairwise interactions, which is the maximum entropy model thatmatches the two-point correlations. Matching the correlation function at longer and longer timesdrives the resulting model toward the Ising model with inverse square interactions and near zeromagnetic field. The emergence of this statistical physics problem from the analysis real data onanimal behavior is unexpected.
In the twentieth century there were two very distinctapproaches to the characterization of animal behavior.Ethologists focused on behavior in its natural context,often describing in qualitative terms phenomena of greatcomplexity [1, 2]. Psychophysicists, in contrast, broughtbehavior into the laboratory, constraining subjects tochoosing from a small set of alternatives, quantifying theprobabilities of different choices as a function of sensoryinputs and task constraints [3, 4]. Recently there hasbeen considerable interest in quantifying unconstrainedand more naturalistic behaviors. Examples include ex-ploring the variability of eye movement trajectories inprimates [5], and the postural dynamics of freely moving
C elegans [6, 7], walking flies [8–10], and mice [11, 12].The combination of high resolution video imaging and ef-ficient AI tools [13, 14] is making these approaches moregenerally applicable, and these data have focused atten-tion on the wide range of time scales in the behavior ofsingle organisms, from milliseconds to a lifetime [15–19].The behavior of a fly walking freely in a featurelessarena can be described as a sequence of transitions amongdiscrete, stereotyped states [9]. Some of these states cor-respond to actions with a simple verbal description, suchas grooming particular body parts, while other states arenot so simple; nonetheless all flies of a single species re-visit the same ∼
100 states, and one can recognize thesame states in closely related species. Although the de-scription of behavior as a sequence of states often isaccompanied an analysis of transitions from one stateto the next, and the (possibly implicit) hypothesis thatthese transitions are independent of one another, fly be-havior dramatically violates this Markovian assumption[10]. We would like to find a simple phenomenological de-scription of behavioral sequences that captures this non–Markovian structure, ideally leading to some insight into the internal states that make it possible.The non–Markovian character of the fly’s trajectorythrough state space can be seen in several ways. In par-ticular, if we look at the probability that the fly movesfrom one state to another after a time τ , then in a Markovmodel all of these transition matrices are powers of thefundamental transition matrix describing the model, andthe eigenvalues of this matrix will decay at a constantrate. This predicts that essentially all memory of theinitial state will be lost after ∼ transitions. Instead,memory persists out to ∼ transitions, and the eigen-values of the transition matrix decay more and moreslowly as we look across longer time scales [10].We can see the same effect in the mutual informationbetween states separated by a time τ , or in the proba-bility that the fly returns to the same state after τ tran-sitions, P c ( τ ) , shown in Fig 1. In both of these repre-sentations we detect correlations in behavioral state ex-tending over thousands of transitions, far beyond what ispredicted by a Markov model matched to the measuredtransition probabilities at τ = 1 . More subtly, the decayof mutual information or return probability is gradual,and does not seem to be characterized by discrete timescales, echoing the slowing of eigenvalue decays.The problem that we face in characterizing the non–Markovian structure of behavior is that with more thanone hundred states, there are ∼ possible transitionsbetween pairs of states. If we observe a single fly forone hour, we see roughly this number of transitions [9,10]. If we make much longer observations on single flieswe encounter non–stationarity, while if we merge datafrom many flies we may obscure individual differences.To make progress we need to simplify our description ofthe behavioral states, but we need to do this in a way thatpreserves the long memory seen in the full description. a r X i v : . [ q - b i o . N C ] D ec -2 -1 number of transitions, Figure 1: Probability P c ( τ ) that a behavioral state is re-visited after τ transitions. We normalize so that ˜ P c ( τ ) = P c ( τ ) /P c ( τ → ∞ ) − decays to zero. Means and standarderrors across 59 flies (red) compared with predictions from aMarkov model that matches observed transition probabilities(green). In both cases we analyze individual flies and average ˜ P c at the end. Concretely, we want to map the behavioral state ateach moment in time into some reduced or clustered de-scription, x t → z t , with the hope that we can preservethe long memories that we see in the state sequence; ingeneral this mapping can be probabilistic, described by P ( z t | x t ) . The “information bottleneck” problem [20] isthe search for a compression that maintains informationabout some other variable, which we could take to be abehavioral state in the future [10], but here we want amore symmetric formulation (Fig 2).We choose compressions that preserve temporal corre-lations by maximizing the mutual information betweencompressed variables at different times, I ( z t ; z t + τ ) . Tocontrol the complexity of our description, we limit theinformation that we capture about the original variables, I ( z t ; x t ) . In principle the mapping x t → z t could beprobabilistic, so we solve the variational problem max P ( z | x ) [ I ( z t ; z t + τ ) − T I ( z t ; x t )] , (1)where the Lagrange multiplier T imposes the constrainton I ( z t ; x t ) . As in the original bottleneck problem, T plays the role of a temperature, such that as T → themapping x → z becomes deterministic.Following the same steps as in Ref [20], we find thatthe solution to the maximization problem in Eq (1) obeys the self–consistent equation P ( z | x ) = P ( z ) Z ( x ; T ) exp (cid:20) − T F ( x, z ) (cid:21) , (2) F ( x t , z t ) = D KL (cid:2) P ( z t + τ | x t ) || P ( z t + τ | z t ) (cid:3) + D KL (cid:2) P ( z t + τ | x t ) || P ( z t + τ | z t ) (cid:3) , (3)where D KL [ P || Q ] is the Kullback–Leibler divergence be-tween the distributions P and Q , and Z ( x ; T ) is a normal-ization constant. Again following Ref [20], we can turn aself–consistent equation into an iterative algorithm. Wehave explored solutions with different cardinality for z t ,and find that they form a hierarchical clustering scheme,as expected from earlier work [10]. Our focus here ison the most severe compression, in which z t has justtwo states, and the limit T → , where the mapping x → z becomes deterministic. Further, we search forcompressions that preserve information from one stateto the next, corresponding to τ = 1 in Eq (1).When we compress into just two states we turn be-havior into a binary sequence, or equivalently a one–dimensional chain of Ising spins z t = σ t = ± ; the struc-ture of this mapping is illustrated in Fig 3. Each point inthe two–dimensional behavioral space represents a shorttemporal sequence from the original video recording; theprobability distribution in this plane has many well re-solved peaks, and the dynamics consist of sojourns inthe neighborhood of one peak followed by a quick jumpto another, allowing us to define the N s = 123 discretestates and their boundaries [9]. States which are neigh-bors in these two dimensions are similar by construction,and can be (informally) grouped into the clusters shownin Fig 3a–anterior movements, posterior movements, lo-comotion gaits, etc. In Fig 3b we have the results of the
The original N s = 123 states and their bound-aries, grouped informally into clusters [9]. (b) The optimalcompression into two states, with σ = +1 shown in red and σ = − shown in grey. optimal mapping x → z , which groups together a rangeof relatively rapid movements into one state σ = +1 whilewalking and idling are mapped to σ = − [21].Once we have reduced behavioral trajectories to achain of Ising spins, we can measure temporal correla-tions with the usual spin–spin correlation function [22], C ( τ ) = (cid:104) σ t σ t + τ (cid:105) − (cid:104) σ t (cid:105) . (4)Perhaps surprisingly, although we asked for a compres-sion that preserves information only across one behav-ioral transition, Fig 4 shows that the resulting binary orIsing variables have measurable correlations out to ∼ P ( { σ t } ) , going beyond the Markovapproximation. We want our approximation to matchthe observed bias between the two behavioral states, mea-sured by the expectation value (cid:104) σ t (cid:105) . We also want tocapture the long–ranged correlations in Fig 4, so we in-sist that the correlation function C ( τ ) computed fromthe model P ( { σ t } ) match the correlation function thatwe observe experimentally. The minimally structured, ormaximum entropy model that matches (cid:104) σ t (cid:105) and C ( τ ) is P ( { σ t } ) = 1 Z exp h (cid:88) t σ t + 12 (cid:88) t,t (cid:48) σ t J ( t − t (cid:48) ) σ t (cid:48) , (5)where the interactions J ( τ ) must be tuned to match thecorrelation function C ( τ ) , and the field h must be tunedto match the asymmetry (cid:104) σ t (cid:105) [23, 24]. We solve this in-verse problem following the same strategy as in previouswork [25], using Monte Carlo to estimate C ( τ ) in themodel and adjusting J ( τ ) in proportion to the differencebetween this estimate and the measured values. Even without detailed calculation, we know that gen-erating long–ranged correlations in one dimension (here,time) requires long–ranged interactions, and this is whatwe find (Fig 5). Specifically, if we try to match the cor-relations C ( τ ) for | τ | ≤ τ max , then J ( | τ | > τ max ) = 0 ,but as we increase τ max we “uncover” interactions thatcouple states with longer and longer separations. Wesee no end to this, within the limits of our data. Moresubtly, although the asymmetry in the two clusters ofbehavioral states is large, corresponding to a “magneti-zation” (cid:104) σ t (cid:105) = 0 . ± . , the magnetic field that wefind is very small, h = (4 . ± . × − when we matchcorrelations out to τ max = 10 . Intuitively, long–rangedcorrelations imply that sequences of states are movingcollectively, and hence a small intrinsic bias is amplified.Over more than two decades in τ , the interactions arevery nearly J ( τ ) = J /τ . The Ising model with such “in-verse square” interactions has a fascinating history, andplayed an important role in the development of scalingideas that presaged the full development of the renormal-ization group [26]. It is quite startling to see this modelemerge from the analysis of data on animal behavior.Perhaps the most important prediction of our modelis the probability of the behavioral state at time t giventhe states at all other times t (cid:48) , P ( σ t |{ σ t (cid:48) (cid:54) = t } ) = exp [ h eff ( { σ t (cid:48) (cid:54) = t } ) σ t ]2 cosh[ h eff ( { σ t (cid:48) (cid:54) = t } )] (6) h eff ( { σ t (cid:48) (cid:54) = t } ) = h + (cid:88) t (cid:48) (cid:54) = t J ( t − t (cid:48) ) σ t (cid:48) . (7) -2 -1 number of transitions, C () Figure 4: Connected correlation function for the binary de-scription of behavior, Eq (4). Data points and error bars(red) as in Fig 1, compared with a Markov model over thetwo states (green). Solid line is a single exponential decaywith correlation time τ c = 7 . ± . . number of transitions, -6 -4 -2 J () Figure 5: Interactions J ( τ ) needed to reproduce the corre-lation function C ( τ ) for τ < τ max , with τ max = 100 (red), (green), (blue points with errors). Error bars arethe standard deviations across models that match C ( τ ) fromrandomly chosen halves of the flies in the experimental pop-ulation. Black dashed line is J ( τ ) = 1 /τ , for comparison. We can test this prediction by walking through thedata and collecting all the moments in time where h eff falls into some small range, and averaging the behav-ioral states over these times [27]; we should find (cid:104) σ (cid:105) =tanh( h eff ) . To avoid any dangers of over–fitting [28], weassume that the system is described exactly by the in-verse square Ising model, J ( τ ) = J /τ , and fit the oneparameter J . Results are shown in Fig 6a.The agreement between theory and experiment thatwe see in Fig 6a is very good, and the best fit value of J is consistent with the estimate of J ( τ = 1) in Fig 5. Thelargest deviations between theory and experiment are atextreme values of h eff , where behavior is even more nearlydeterministic ( (cid:104) σ (cid:105) → ± ) than predicted by the theory.The distribution of h eff is strongly bimodal, as if the flywere alternating between strong internal biases, and wesee this in the time course of h eff in Fig 6b.Equation 6 has the form of a behavioral state respond-ing at each moment in time to a bias h eff , but these bi-ases are determined by behavioral states at other timesthrough Eq (7). Importantly, the full model in Eq (5) canbe rewritten in a form where behavioral states respondindependently at each moment in time to a completelyinternal, fluctuating bias φ t [19], P ( { σ t } ) = ˆ D φ P ( { φ t } ) (cid:89) t exp [ σ t ( h + φ t )]2 cosh( h + φ t ) . (8)The probability distribution of biases is given by P ( { φ t } ) = 1 Z exp [ − S ( { φ t } )] (9) S ( { φ t } ) = 12 (cid:88) tt (cid:48) φ t K ( t − t (cid:48) ) φ t (cid:48) − (cid:88) t ln cosh( h + φ t ) , (10)and the kernel K ( τ ) is the inverse of the interaction J ( τ ) , (cid:88) t (cid:48)(cid:48) K ( t − t (cid:48)(cid:48) ) J ( t (cid:48)(cid:48) − t (cid:48) ) = δ tt (cid:48) . (11)For models in which J ( τ ) ∼ /τ , for large τ we have K ( τ ) ∼ − /τ , so that the internal biases also must haveinteractions that extend over long spans of time; theremust be yet more internal degrees of freedom which carrymemory across these spans. To generate the ∼ /τ be-havior exactly requires that the dimensionality of theseinternal dynamics be effectively infinite [19]. In a differ-ent language, if we wanted to generate ∼ /τ behav-ior exactly using a hidden Markov model, the numberof hidden states would have to be effectively infinite aswell. While not suggesting a specific mechanism, the ex-plicit ∼ /τ interactions provide a simple description ofthe phenomenology, the minimum required to match theobserved correlations C ( τ ) .To summarize, we have tamed some of the complexityof the behavior in walking flies by compressing the behav-ioral states into binary variables which preserve correla-tions from one state to the next. The resulting binarysequence nonetheless captures long ranged correlationsin the behavior, with memory detectable out to thou-sands of transitions. The simplest model that describesthis behavior turns out to be an Ising model with almostperfectly inverse square interactions, and this model givesexcellent quantitative predictions for the behavioral state -2 0 2 h eff -1-0.500.51 3200 3300 3400 t ( -2-1012 h e ff a b Figure 6: (a)
Mean behavioral state at one moment in timevs the effective field determined by states at other times, fromEq (7) with J ( τ ) = J /τ , J = 0 . ± . . Points are meansacross 59 flies, and error bars are standard deviations acrossrandom halves of this population; line is (cid:104) σ (cid:105) = tanh( h eff ) . (b) A short segment of h eff vs time for one fly. at one moment in time given the surrounding sequenceof states. Although very simple, this description pointsto an effectively infinite number of hidden states.This work was supported in part by the NationalScience Foundation through the Center for the Physicsof Biological Function (PHY–1734030), the Center forthe Science of Information (CCF–0939370), the NSF–Simons Center for Quantitative Biology (DMS–1764421),and Grant PHY–1607612; by the National Institutes ofHealth (NS104889); and by the Simons Foundation. [1] K von Frisch, Decoding the language of the bee. Science
Ethology: The Mechanisms and Evolution ofBehavior (WW Norton, New York, 1982).[3] JL Lawson and GE Uhlenbeck,
Threshold Signals.
MITRadiation Laboratory Series 24. (McGraw–Hill, NewYork, 1950).[4] DM Green and JA Swets,
Signal Detection Theory andPsychophysics (Wiley, New York, 1966).[5] LC Osborne, SG Lisberger, and W Bialek, A sensorysource for motor variation.
Nature
C. elegans . PLoS Comput Biol e1000028 (2008)[7] T Ahamed, AC Costa, and GJ Stephens, Capturing thecontinuous complexity of behaviour in C elegans . NatPhys in press (2020).https://doi.org/10.1038/s41567-020-01036-8.[8] K Branson, AA Robie, J Bender, P Perona, and MHDickinson, High–throughput ethomics in large groups of
Drosophila . Nat Methods J R Soc Interface
Drosophila behavior.
Proc Natl Acad Sci(USA)
Neuron
Neuron in press (2020).https://doi.org/10.1016/j.neuron.2020.11.016.[13] A Mathis, P Mamidanna, KM Cury, T Abe, VN Murthy,MW Mathis, and M Bethge, DeepLabCut: Markerlesspose estimation of user-defined body parts with deeplearning.
Nat Neurosci
Nat Methods
Nat Phys
Neuron
Nat Neurosci
CurrentOpin Neuro
Proceedings of the 37th An-nual Allerton Conference on Communication, Controland Computing,
B Hajek and RS Sreenivas, eds, pp 368–377 (University of Illinois, 1999); arXiv:physics/0004057(2000).[21] This result is different from that in Ref [10], whichused the (asymmetric) information bottleneck to findcompressed representations that preserve the maximuminformation about the future state, I ( z t ; x t + τ ) . In theasymmetric case the optimal binary representation cor-responds almost perfectly with moving vs idle.[22] In comparing Figs 1 and 4, note that for binary variableswe have ˜ P c ( τ ) = C ( τ ) / [1 + (cid:104) σ t (cid:105) ] .[23] ET Jaynes, Information theory and statistical mechanics. Phys Rev
Biophysics: Searching for Principles. (Prince-ton University Press, Princeton NJ, 2012).[25] G Tkaˇcik, O Marre, D Amodei, E Schneidman, W Bialek,and MJ Berry II, Searching for collective behavior in alarge network of sensory neurons.
PLoS Comput Biol e1003408 (2014).[26] PW Anderson,
Basic Notions of Condensed MatterPhysics (Benjamin/Cummings, Menlo Park CA, 1984).[27] For a parallel discussion of states in a network of neurons:L Meshulam, JL Gauthier, CD Brody, DW Tank, and WBialek, Collective behavior of place and non-place neu-rons in the hippocampal network.
Neuron τ max = 103