Predictability and hierarchy in Drosophila behavior
PPredictability and hierarchy in
Drosophila behavior
Gordon J. Berman, ∗ William Bialek, and Joshua W. Shaevitz
Joseph Henry Laboratories of Physics and Lewis–Sigler Institute forIntegrative Genomics, Princeton University, Princeton, NJ 08544 (Dated: September 17, 2018)Even the simplest of animals exhibit behavioral sequences with complex temporal dynamics.Prominent amongst the proposed organizing principles for these dynamics has been the idea of ahierarchy, wherein the movements an animal makes can be understood as a set of nested sub-clusters.Although this type of organization holds potential advantages in terms of motion control and neuralcircuitry, measurements demonstrating this for an animal’s entire behavioral repertoire have beenlimited in scope and temporal complexity. Here, we use a recently developed unsupervised techniqueto discover and track the occurrence of all stereotyped behaviors performed by fruit flies moving ina shallow arena. Calculating the optimally predictive representation of the fly’s future behaviors,we show that fly behavior exhibits multiple time scales and is organized into a hierarchical structurethat is indicative of its underlying behavioral programs and its changing internal states.
I. INTRODUCTION
Animals perform a vast array of behaviors as they goabout their daily lives, often in what appear to be re-peated and non-random patterns. These sequences ofactions, some innate and some learned, have dramaticconsequences with respect to survival and reproductivefunction–from feeding, grooming, and locomotion to mat-ing, child rearing, and the establishment of social struc-tures. Moreover, these patterns of movement can beviewed as the final output of the complicated interactionsbetween an organism’s genes, metabolism, and neural sig-naling. As a result, understanding the principles behindhow an animal generates behavioral sequences can pro-vide a window into the biological mechanisms underlyingthe animal’s movements, appetites, and interactions withits environment, as well as broader insights into how be-haviors evolve.The prevailing theory for the temporal organization ofbehavior, rooted in work from neuroscience, psychology,and evolution, is that the pattern of actions performedby animals is hierarchical [1–3]. In such a framework, ac-tions are nested into modules on many scales, from simplemotion primitives to complex behaviors to sequences ofactions. Neural architectures related to behavior, suchas the motor cortex, are anatomically hierarchical, sup-porting the idea that animals use a hierarchical repre-sentation of behavior in the brain [4–7]. Additionally,hierarchical organization is a hallmark of human design,from the layout of cities to the wiring of the internet, andits potential use in various biological contexts has beenproposed as an organizing principle [2].Despite the theoretical attractiveness of behavioral hi-erarchy, measurements showing that a particular ani-mal’s behavioral repertoire is organized in this manneroften are limited in their applicability and scope. Typi- ∗ E-mail: [email protected]. Current address: Depart-ment of Biology, Emory University, Atlanta, GA 30322 cally, observations of hierarchy in the ordering of move-ment have considered a single behavioral type, such asgrooming, ignoring relationships between more varied be-havioral motifs [8–13]. Perhaps more problematic is thatmost analyses of behavior make use of methods, such ashierarchical clustering, that implicitly or explicitly im-pose a hierarchical structure onto the data without show-ing that such a representation is accurate. Lastly, to ourknowledge, all measurements of a hierarchical organiza-tion of behavior limit their analysis to behavioral dynam-ics at a single time scale. This scale is often given by theresults of fitting a Markov model, where the next step in abehavioral pattern only depends on the animal’s currentstate. Even in the simplest of animals, however, thereare many internal states such as hunger, reproductivedrive, etc., and sequences of behaviors possess an effec-tive memory of an animal’s behavioral state that persistswell into the future, a result noted in a wide variety ofsystems [14–17].In this paper, we study the behavioral repertoire offruit flies (
Drosophila melanogaster ), attempting to char-acterize the temporal organization of their movementsover the course of an hour. Decomposing the flies’ move-ments into a set of stereotyped behaviors without mak-ing any a priori behavioral definitions [18], we find thattheir behavior exhibits long time scale,s far beyond whatwould be predicted from a Markovian model. Applyingmethods from information theory, we show that a hier-archical representation of actions optimally predicts thefuture behavioral state of the fly. These results showthat the best way to understand how future actions fol-low from the current behavioral state is to group thesecurrent behaviors in a nested manner, with fine grainedpartitions being useful in predicting the near future, andcoarser partitions being sufficient for predicting the rela-tively distant future. These results show that these ani-mals control their movement via a hierarchy of behaviorsat varying time scales, affirming and making precise akey concept in ethology. a r X i v : . [ phy s i c s . b i o - ph ] M a y II. EXPERIMENTS AND BEHAVIORALSTATES
As a testbed for probing questions of behavioral or-ganization and hierarchy, we sought to measure the en-tire behavioral repertoire of a population of
Drosophilamelanogaster in a specific environmental context. Weprobed the behavioral repertoire of individual, ground-based fruit flies in a largely featureless circular arena forone hour using a 100Hz camera. Under these conditions,flies display many complex behaviors, including locomo-tion and grooming, that involve multiple parts of theirbodies interacting at varying time scales. We recordedvideos of 59 male flies using a custom-built trackingsetup, producing more than 21 million images [18].These data were used to generate a two–dimensionalmap of fly behavior based on an unsupervised approachthat automatically identifies stereotyped actions (Fig.1A, for full details see [18]). Briefly, this approach takesa set of translationally and rotationally aligned imagesof the flies and decomposes the dynamics of the observedpixel values into a low–dimensional basis set describingthe flies’ posture. Time series are produced by project-ing the original pixel values onto this basis set and thelocal spectrogram of these trajectories is then embeddedinto two dimensions [19]. Each position in the behavioralmap corresponds to a unique set of postural dynamics;although this was not required by the analysis, nearbypoints represent similar motions, i.e. those involving re-lated body parts executing similar temporal patterns.In the resulting behavioral space, z , we estimate theprobability distribution function P ( z ) and find that itcontains a set of peaks corresponding to short segmentsof movement that are revisited multiple times by mul-tiple individuals (Figure 1A). Pauses in the trajecto-ries through this space, z ( t ), are interspersed with quickmovements between the peaks. These pauses in z ( t )at a particular peak correspond to the fly performingone of a large set of distinct, stereotyped behaviors suchas right wing grooming, proboscis extension, or alter-nating tripod locomotion [18]. In all, we identify 117unique stereotyped actions, with similar behaviors, i.e.those that utilize similar body parts at similar frequen-cies, located near each other in the behavioral map. Awatershed algorithm is used to separate the peaks and,combined with a threshold on d z ( t ) /dt , to segment eachmovie into a sequence of discreet, stereotyped behaviors.In this paper, we treat pauses at these peaks to beour states, the lowest level of description of behavioralorganization, and investigate the pattern of behavioraltransitions among these states over time. We count timein units of the transitions between states, so we have adescription of behavior as a discrete variable S ( n ) thatcan take on N = 117 different values at each discrete time n . Note that since we count time in units of transitions,we always have S ( n + 1) (cid:54) = S ( n ). Combining data fromall 59 flies, we observe ≈ . × behavioral transitions,or ≈ per experiment. P D F x C D = 1= .5 Lo c o m o t i on S i d e L e g A P D F x B I n i t i a l S t a t e Final State I d l e S l o w S i de Leg A n t e r i o r P o s t e r i o r Lo c o m o t i on Idle SlowPosterior Anterior T ( ) i , j FIG. 1. Transition probabilities and behavioral modular-ity. (A) Behavioral space probability density function (PDF).Here, each peak in the distribution corresponds to a dis-tinct stereotyped movement. (B) One-step Markov transitionprobability matrix T ( τ = 1). The 117 behavioral states aregrouped by applying the predictive information bottleneckcalculation and allowing 6 clusters (Eq. 4). Black lines de-note the cluster boundaries. (C) Transitions rates plotted onthe behavioral map. Each red point represents the maximumof the local PDF, and the black lines represent the transi-tion probabilities between the regions. Line thicknesses areproportional to the corresponding value of T ( τ = 1) ij , andright–handed curvature marks the direction of the transition.For clarity, all lines representing transition probabilities ofless than .05 are omitted. (D) The clusters found using theinformation bottleneck approach (colored regions) are con-tiguous in the behavioral space. Behavioral labels associatedwith each partitioned graph cluster from B are shown. Blacklines thickness represents the conditional transition probabil-ities between clusters. All transition probabilities less than.05 are omitted. III. TRANSITION MATRICES ANDNON-MARKOVIAN TIME SCALES
To investigate the temporal pattern of behaviors, wefirst calculated the behavioral transition matrix over dif-ferent time scales,[ T ( τ )] i,j ≡ p ( S ( n + τ ) = i | S ( n ) = j ) , (1)which describes the probability that the animal will gofrom state j to state i after τ transition steps. We expectthat this distribution becomes less and less structured as τ increases because we lose the ability to make predic-tions of the future state as the horizon of our predictionsextends further. In addition, it will be useful to thinkabout these matrices in terms of their eigendecomposi- A B Number of Transitions D | λ | μ = 2 μ = 3 μ = 4 μ = 5 μ = 6Random Number of Transitions -3 -1 E D eca y R a t e ( t r a n s i t i o n s - ) T M ( ) i , j T ( ) i , j C T ( ) i , j I n i t i a l S t a t e Final State I n i t i a l S t a t e Final State I n i t i a l S t a t e Final State
FIG. 2. Long time scale transition matrices and non–Markovian dynamics. (A) Markov model transition matrix for τ = 100, T M (100), from Eq (3). (B and C) Transition matrices for τ = 100 and τ = 1 , T ( τ ) as a function of τ . The curves represent the average overall flies, and thicknesses represent the standard error of the mean. Dashed lines are the predictions for the Markov model T M ( τ ). The black line is a noise floor, corresponding to the typical value of the second largest eigenvalue in a transition matrixcalculated from random temporal shuffling of our finite data set. (E) Eigenmode decay rates, r µ ( τ ) ≡ − log | λ µ ( τ ) | /τ , as afunction of the number of transitions. Line colors represent the same modes as in (D) and the black line again corresponds toa “noise floor,” in this case the largest decay rate that we can resolve above the random structures present in our finite sample. tions: [ T ( τ )] i,j = (cid:88) µ λ µ ( τ ) u µi ( τ ) v µj ( τ ) , (2)where u µ ≡ { u µi } and v µ ≡ { v µi } are the left and righteigenvectors, respectively, and λ µ ( τ ) is the eigenvaluewith the µ th largest modulus. Because probability isconserved in the transitions, the largest eigenvalue willalways be equal to one, λ ( τ ) = 1, and v i ( τ ) describesthe stationary distribution over states at long times. Allthe other eigenvalues have magnitudes less than one, | λ µ (cid:54) =1 ( τ ) | <
1, and describe the loss of predictability overtime, as shown in more detail below.The matrix T ( τ = 1) describes the probability of tran-sitions from one state to the next, the most elementarysteps of behavior (Fig. 1B). To the eye, this transitionmatrix appears modular, with most transitions out ofany given state only going to one of a handful of otherstates. By appropriately organizing the states in Figure1B, T ( τ = 1) takes on a nearly block-diagonal structure,which can be broken up into modular clusters using theinformation bottleneck formalism (see below). Plottingthis matrix on the behavioral map itself (Fig. 1C), wesee that the transitions are largely localized, with nearlyall large probability transitions occurring between nearbybehaviors. Furthermore, the transition clusters are con-tiguous in the behavioral space, defining gross categories of motion including locomotion, behaviors involving an-terior parts of the body etc. (Fig. 1D).It is important to note that T ( τ = 1) does not di-rectly contain information about the location of behav-ioral states in the two dimensional map, and hence anyrelationship we observe between the transition structureand the patterning of behaviors in the map is a con-sequence of the animal’s behavior and not the way weconstruct the analysis. We thus conclude that behavioraltransitions are mostly restricted to occur between similaractions—e.g., grooming behaviors are typically followedby other grooming behaviors of close-by body parts andanimals transition between locomotion gates systemati-cally by changing gate speed and velocity. These observa-tions are consistent with classical ideas of postural facili-tation and previous observations that transitions largelyoccur between similar behaviors [9, 20–22].We begin to see the necessity of looking at longer timescales as we measure the transition matrices for τ (cid:29) T ( τ = 1) provides acomplete characterization of the system. In particular,if the behavior is Markovian then we can calculate thetransition matrix after τ state just by iterating the matrixfrom one step: T M ( τ ) ≡ [ T (1)] τ = (cid:88) µ [ λ µ (1)] τ u µ (1) v µ (1) . (3)Because | λ µ (1) | < µ > τ → ∞ . For very long times, there-fore, T M ( τ ) loses all information about the current stateand instead reflects the average probabilities of perform-ing any particular behavior. Thus, in a Markoviansystem, the slowest time scale in the system is deter-mined by | λ (1) | , resulting in a characteristic decay time t = − / log | λ (1) | . Calculating these eigenvalues foreach fly and averaging, we find (cid:104) λ (1) (cid:105) = 0 . ± . (cid:104) t (cid:105) = 29 ± ≈
30 transitions into the future is di-rect evidence for hidden states that carry a memory overlonger times and modulate behavior.Initial evidence for long-time structure in T ( τ ) comesby comparing the lack of structure within T M (100) tothat within T ( τ ) for τ = 100 and τ = 1 ,
000 (Fig 2A-C). After 100 transitions, ( ≈ (cid:104) t (cid:105) ), the Markov modelretains essentially no information, as demonstrated bythe similarity between all of the rows, implying that alltransitions have been randomized. Conversely, althoughsome of the block–diagonal structure from Fig. 1B hasdissipated, we see that T (100) and T (1000) retrain agreat deal of non-randomness.This observation can be made more precise by lookingat the eigenvalue spectra of the transition matrices. InFigure 2D, we plot | λ µ ( τ ) | as a function of τ for µ = 2through 6 (solid color lines) in addition to the predictionsfrom the Markov model of Eq (3) based on T (1) (coloreddashed lines). In a Markovian system, it would be morenatural to plot these results with a logarithmic axis for λ ,but here we see that structure extends over such a widerange of time scales that we need a logarithmic axis for τ . We can make this difference more obvious by mea-suring the apparent decay rate, r µ ( τ ) = − log | λ µ ( τ ) | /τ ,which should be constant for a Markovian system. Forthe leading mode, the apparent decay rate falls by nearlytwo orders of magnitude before the corresponding eigen-value is lost in the noise (Figure 2E). Similar patternsappear in higher modes, but we have more limited dy-namic range for observing them.These results are direct evidence that many time scalesare required to model behavioral sequences, even in thissimple context where no external stimuli are provided.Accordingly, we can infer that the organism must have in-ternal states that we do not directly observe, even thoughwe are making rather thorough measurements of the mo-tor output. Roughly speaking, the appearance of decayrates ≈ − means that the internal states must holdmemory across at least ≈ behavioral transitions, orapproximately 20 minutes—much longer than any timescale apparent in the Markov model. IV. PREDICTABILITY AND HIERARCHY
The modular structure of the flies’ transition matrix,combined with the observed long time scales of behav-ioral sequences, suggests that we might be able to groupthe behavioral states into clusters that preserve much ofthe information that the current behavioral state pro-vides about future actions (predictive information [23]).Furthermore, we should be able to probe whether thisresults in a hierarchical organization: if the states aregrouped into a hierarchy, then increasing the numberof clusters will largely subdivide existing clusters ratherthan mix behaviors from two different clusters.To make this idea more precise, we hope to map thebehaviors into groups, S ( n ) → Z , that compress ourdescription in a way that preserves information abouta state τ transitions in the future, S ( n + τ ). Mathe-matically, this means that we should maximize the in-formation about the future, I ( Z ; S ( n + τ )), while hold-ing fixed the information that we keep about the past, I ( Z ; S ( n )). Introducing a Lagrange multiplier to hold I ( Z ; S ( n )) fixed, we wish to maximize F = I ( Z ; S ( n + τ )) − βI ( Z ; S ( n )) . (4)At β = 0 we retain the full complexity of the 117 be-havioral states, and as we increase β , we are forced totighten our description into a more and more compressedform, thus losing predictive power. This is an exampleof the information bottleneck problem [24]. If the com-pressed description Z involves a fixed number of clusters,then we find solutions that range from soft clustering,where behaviors can be assigned to more than one clusterprobabilistically, to hard clustering, where each behav-ior belongs to only one cluster, as β increases; changingthe number of clusters allows us to move along a curvethat trades complexity of description against predictivepower, as shown in Fig 3 (see § VI C for details).
I(X;Z) (bits) I ( Y ; Z ) ( b i t s ) l og n τ I ( Z ; S ( n )) (bits) I ( Z ; S ( n + τ ))( b i t s ) FIG. 3. Optimal trade-off curves for lags from τ = 1 to τ = 5000. For each time lag τ , number of clusters, and β ,we optimize Equation 4 and plot the resulting complexity ofthe partitioning, I ( Z ; S ( n )), versus the predictive informa-tion, I ( Z ; S ( n + τ )). FIG. 4. Information bottleneck partitioning of behavioral space for τ = 67 (approximately twice the longest time scale in theMarkov model). Borders from the previous partitions are shown in black. For 25 clusters (bottom right), the partitions, stillcontiguous, are denoted by dashed lines. As expected, the optimal curves move downward as thetime lag increases, implying that the ability to predictthe behavioral state of the animal decreases as we lookfurther into the future. We also observe a relatively rapiddecrease in the height of these curves for small τ , followedby increasingly-closely spaced optimal curves as the laglength increases. It this slowing that is indicative of thelong time-scales in behavior.Along each of these trade-off curves lie partitions ofthe behavioral space that contain an increasing numberof clusters. We can make several observations about thesedata. First, in agreement with our investigation of thesingle-step transition matrix, we find that the clustersare spatially contiguous in the behavioral map as exem-plified in Figure 4 for τ = 67. Thus, even when we addin the long time-scale dynamics, we find that transitionspredominantly occur between similar behaviors. Second,these spatially-contiguous clusters separate hierarchicallyas we increase the number of clusters, i.e. new clusterslargely result from subdividing existing clusters insteadof emerging from multiple existing clusters. One exampleof this can be seen in Figure 5, where the probability flowbetween partitions of increasing size subdivide in a tree-like manner. It is important to note that these resultsare not built in to the information bottleneck algorithm:we can solve the bottleneck problem for different num-bers of clusters independently, and hence (in contrastto hierarchical clustering) this method could have foundnon-hierarchical evolution with new clusters comprised ofbehaviors from many other clusters, That this does nothappen is strong evidence that fly behavior is organizedhierarchically.We can go beyond this qualitative description, how-ever, by quantifying the degree of hierarchy in our rep-resentation as the number of clusters increases using a“treeness” metric, T (Fig. 6). The idea behind this met-ric, which is similar to the one introduced by Corominas–Murta et al [25], is that if our representation is perfectly hierarchical, then each cluster has precisely one “parent”in a partitioning with a smaller number of clusters. Thus,the better our ability to distinguish the lineage of a clus-ter as it splits through increasingly complex partitioningsimplies a higher value of T . More precisely, the treenessindex is given by the relative reduction in entropy going IdleSlowLocomotionAnterior PosteriorSide Legs
FIG. 5. Hierarchical organization for optimal solutions withlag τ = 100 ranging from 1 cluster to 25. The displayedclusterings are those that have the largest value of I ( Z ; S ( n + τ )) for that number of clusters. The length of the verticalbars are proportional to the percentage of time a fly spendsin each of the clusters, and the lines flowing horizontally fromleft to right are proportional in thickness to the flux from theclustering on the left to the clustering on the right. Fluxesless than .01 are suppressed for clarity. Number of Clusters N u m b e r o f T r a n s i t i o n s H b H f T = H f −H b H f A B P (2)1 , P (1)1 , Q (2)3 , Q (1)1 , T FIG. 6. Partitionings are tree-like over all measured timescales. (A) Definition of the treeness metric, T ; Methods fordetails. (B) T as a function of the number of transitions inthe future and the number of clusters in the most fine-grainedpartition. Colored lines represent values of T for partitionsat varying times in the future, and black lines are values forrandomized graphs generated from partitionings that were as-signed randomly. backwards rather than forwards through the tree, T = H f − H b H f , (5)where H f and H b are the entropies over all possible pathsgoing forward and backwards, respectively. This metricis bounded between zero and one, 0 ≤ T ≤
1, and T = 1implies a perfect hierarchy.We find that the partitionings derived from the in-formation bottleneck algorithm are much more tree-likethan random partitions of the behavioral space (Fig. 6B).This is true even when we attempt to optimally predictbehavioral states thousands of transitions into the fu-ture. Thus, by finding optimally-predictive representa-tions that best explain the relationship between statesover long time scales, we have uncovered a hierarchicalordering of actions, supporting decades-old theory with-out relying on hierarchical clustering, Markov models, orlimiting the measured behavioral repertoire. V. CONCLUSIONS
We have measured the behavioral repertoires fordozens of fruit flies, paying particular attention to thestructure of their behavioral transitions. We find thatthese transitions exhibit multiple time scales and pos-sess memory that persists thousands of transitions intothe future, indicative of internal states that carry mem-ory across thousands of observable behavioral transitions.Using an information bottleneck approach to find thecompressed representations that optimally predict ourobserved dynamics, we find that behaviors are orga-nized in a hierarchical fashion, with fine grained repre-sentations being able to predict short–time structure andcoarser representations being sufficient to predict the fly’sactions that are further removed in time. This is funda-mentally different from previous measurements of hier-archy in behavior, which were more limited in the typesof behaviors they measured, the time scales over which the hierarchy was modeled, and/or relied on hierarchi-cal clustering and other types of analyses that only yieldhierarchical outputs.The type of organization we observe is reminiscent ofthe functional clustering seen in mouse and primate mo-tor cortex, where groupings of neurons from millimeterscales down to single cells have been found to exhibitincreasing temporal correlation as the distance betweenthem decreases [4, 6]. Although no such pattern has beenspecifically found in
Drosophila , our results suggest thatsuch neuronal patterns may exist. As circuits for differ-ent behavioral modules are uncovered, our results suggestthat such hierarchical neuroanatomical organization willalso be found in the fly, serving as a general principlethat may apply across organisms to provide insight to-wards how the brain controls behavior and adapts to acomplex environment.
ACKNOWLEDGMENTS
We thank Ugne Klibaite, David Schwab, and ThibaudTaillefumier for discussions and suggestions. JWS andGJB also acknowledge the Aspen Center for Physics,where many ideas for this work were formulated. Thiswork was funded through awards from the National In-stitutes of Health (GM098090, GM071508), The NationalScience Foundation (PHY-1305525, PHY-1451171, CCF-0939370), the Swartz Foundation, and the Simons Foun-dation.
VI. METHODSA. Experiments
We imaged 59 individual male flies (
D. melanogaster ,Oregon-R strain) for an hour each, following the pro-tocols originally described in [18]. All flies were withinthe first two weeks post-eclosion during the filming ses-sion. Flies were placed into the arena via aspiration andwere subsequently allowed 5 minutes for adaptation be-fore data collection. All recording occurred between thehours of 9:00 AM and 1:00 PM. The temperature duringall recordings was 25 o ± o C . B. Generating Markovian Models
Markovian model data sets were generated by first ran-domly selecting a state, and then finding another, ran-domly chosen, instance in the measured data set wherethe fly was performing that behavior. The behavior per-formed immediately after that behavior is chosen, andthe process is iterated until the generated sequence isequivalent in size to the original data set, similar tothe first-order alphabets generated in Shannon’s originalwork on information theory [26].
C. Predictive Information Bottleneck
The solution to the information bottleneck problem,Eq (4), obeys a set of self–consistent equations that canbe iterated in a manner equivalent to the Blahut-Arimotoalgorithm in rate–distortion theory [24, 27]. For a given | Z | = K and inverse temperature β , a random initialcondition for p ( z | x ) is chosen, and the following self–consistent equations are iterated until the convergencecriterion (( F t − F t +1 ) / F t < − ) is met: p ( z | x ) = p ( z ) Z ( β, x ) exp (cid:104) − βD KL (cid:16) p ( y | x ) || p ( y | z ) (cid:17)(cid:105) , (6) p ( z ) = (cid:88) x p ( z | x ) p ( x ) (7) p ( y | z ) = p ( y | x ) p ( z | x ) p ( x ) , (8)where x ∈ S ( n ), y ∈ S ( n + τ ), z ∈ Z , D KL is theKullback-Leibler divergence between two probability dis-tributions, and Z ( β, x ) is a normalizing function.Because this study focuses on hard clusterings of thebehavioral space, we find solutions by starting at β = 0 . β = 500. After starting from a random initial condi-tion at the initial value of β , the optimization is per-formed at that value until the convergence criterion ismet, and that solution is used as the initial condition forthe next value of β . All intermediate solutions, p ( n ) (cid:96) ( z | x )are stored so they can potentially be included in thefound Pareto front. In addition, we perform 24 replicatesof this process with different random initial conditions for K = 2 , . . . ,
25 and for 81 time lag values between n = 1and n = 5 , p ( z | x ) = δ z, arg max z (cid:48) p ( z (cid:48) | x ) ) and recalculate I ( Z ; S ( n ))and I ( Z ; S ( n + τ )) accordingly. We then defined thePareto front, ξ ( n ) , as the set of all solutions, p ( n ) (cid:96) ( z | x ),such that no other solution for that given lag resultsin a smaller value for I ( Z ; S ( n )) and a larger value for I ( Z ; S ( n + τ )). Between 150 and 350 solutions were foundfor all of the fronts. We choosing a clustering for a fixednumber of clusters, here, we always pick the representa-tion along the optimal front that has the highest value of I ( Z ; S ( n + τ )). D. Treeness Index
To calculate the treeness index, T , we construct a di-rected, acyclic forward graph that connects the partitionsas the number of clusters increases for a given time lagwith values P ( (cid:96) ) ij . These values are the probability that astate contained in one cluster, i , in the partitioning with (cid:96) clusters also belongs to cluster j in the partitioning with (cid:96) + 1 clusters. Similarly, we can create the backwardsgraph, Q ( (cid:96) ) ij , that links clusters in the opposite direction; Q ( (cid:96) ) ij is the probability that a state in cluster i in the par-titioning with (cid:96) + 1 clusters also belongs to the cluster j in the partitioning containing (cid:96) clusters.Given these two graphs, we can calculate the entropyof picking a path, π ( f ) in the forward direction ver-sus the entropy of picking a path, π ( b ) in the back-wards direction. These probabilities can be calculated via p ( π ( f ) v ) = (cid:81) N − (cid:96) =1 P ( (cid:96) ) v (cid:96) ,v (cid:96) +1 and p( π ( b ) v ) = (cid:81) N − (cid:96) =1 Q ( (cid:96) ) v (cid:96) +1 ,v (cid:96) ,with v being a chosen sequence of clusters. Thus, wedefine the forward and backwards entropies as follows: H f = − (cid:88) v ∈ V p ( π ( f ) v ) log p ( π ( f ) v ) (9) H b = (cid:104)− (cid:88) w ∈ W r p ( π ( b ) w ) log p ( π ( b ) w ) (cid:105) r , (10)where V is the set of all possible paths and W r is the setof all paths ending at cluster r in the most fine-grainedpartitioning. (cid:104)· · · (cid:105) r denotes an average over each endstate. T is then calculated as the relative reduction inentropy between backwards and forwards path probabil-ity distributions, as given by Equation 5. [1] N. Tinbergen, The Study of Instinct (Oxford UniversityPress, Oxford, U. K., 1951).[2] R. Dawkins, in in Growing points in ethology , edited byP. Bateson and R. Hinde (Cambridge Univ. Press, Cam-bridge, U.K., 1976) pp. 7–54.[3] H. A. Simon, in