[PDF] Topological data analysis of C. elegans locomotion and behavior

Abstract

We apply topological data analysis to the behavior of C. elegans, a widely-studied model organism in biology. In particular, we use topology to produce a quantitative summary of complex behavior which may be applied to high-throughput data. Our methods allow us to distinguish and classify videos from various environmental conditions and we analyze the trade-off between accuracy and interpretability. Furthermore, we present a novel technique for visualizing the outputs of our analysis in terms of the input. Specifically, we use representative cycles of persistent homology to produce synthetic videos of stereotypical behaviors.

Full PDF

TTopological data analysis of

C. elegans locomotion and behavior

Ashleigh Thomas, Kathleen Bates, Alex Elchesen, Iryna Hartsock,Hang Lu, and Peter Bubenik

Abstract

Video of nematodes/roundworms was analyzed using persistent homology to studylocomotion and behavior. In each frame, an organism’s body posture was representedby a high-dimensional vector. By concatenating points in ﬁxed-duration segmentsof this time series, we created a sliding window embedding (sometimes called a timedelay embedding) where each point corresponds to a sequence of postures of an or-ganism. Persistent homology on the points in this time series detected behaviors andcomparisons of these persistent homology computations detected variation in theircorresponding behaviors. We used average persistence landscapes and machine learn-ing techniques to study changes in locomotion and behavior in varying environments.

Contents a r X i v : . [ m a t h . A T ] F e b Introduction

Model organisms are indispensable in understanding basic principles of biology. Studiesof model organisms have played a major role in discoveries of disease mechanisms, diseasetreatment, and neuroscience principles, to name a few. The behavior of these model organ-isms can illuminate responses and phenotypes important for understanding the eﬀects ofexperimental conditions on subjects. Behavior can be aﬀected by neuron activity, externalstimuli, and past experiences (learning), so being able to adequately measure and comparebehaviors is a useful evaluation tool for a wide range of experiments.Persistent homology is a possible tool for assessing behavior of

C. elegans , which area widely-used model organism. Persistence has been successfully used to study high-dimensional time series, especially those that exhibit some quasi-periodic behavior likethe undulation of

C. elegans [Tra16; TP18]. But to the authors’ knowledge, persistenthomology has not been previously used to analyze

C. elegans behavior, though it and sim-ilar techniques have been used to study

C. elegans neural data [PS+13; SP+19; HBB20;LG+20; BK+15].In this paper, we use persistent homology to study the locomotion of

C. elegans undertwo experimental conditions. In our initial study (section 3.1), the nematodes move onthe surface of an agar plate. Under these experimental conditions there are no barriersto movement and the locomotion is both smooth and complex. We show that persistenthomology is able to capture various characteristic movements. However, for the large-scaleautomated study of

C. elegans (section 3.2) a more controlled environment is required,in which the nematodes are conﬁned to wells in micro-ﬂuidic devices. Our main resultsstudy

C. elegans ’ locomotion in these aqueous wells in which the solution has variouslevels of viscosity. We show that we are able to use persistent homology, and averagepersistence landscapes in particular, to summarize the locomotion in a way that allows theclassiﬁcation of the viscosity class with a high level of accuracy. Our results indicate thatpersistent homology is a promising tool for quantifying the impact of changes to genotypeand environment on

C. elegans locomotion.

Caenorhabditis elegans is a free-living soil nematode that has been a workhorse geneticmodel system. The nematode’s transparent tissue, simple anatomy, and fast reproductioncontribute to both ease in culture and a literal window into the internal workings of a livingorganism. Its completely sequenced genome contains many genes that are homologous tohuman genes, and importantly the ability to manipulate genes with relative ease makes itan extremely attractive model system. For neuroscience in particular,

C. elegans presentsa unique opportunity with its simple nervous system (just 302 neurons) that is complex2nough to exhibit many sensory modalities, including mechanosensation, chemosensation,and response to heat, osmolarity, and smell.Behavior characterization in

C. elegans was historically qualitative, mainly relying onexperimentalists specifying end-point assessment (e.g. whether the worm chemotaxes toa particular source of odor within a certain amount of time), or experimentalists usingheuristics to assess behavior (e.g. naming worms “ unc ” for uncoordinated). In the lastdecade, machine vision tools ﬁrst replaced human identiﬁcations of worms from images andvideos, which allows much larger dynamic datasets to be annotated and analyzed. In recentyears, further development in quantitative behavior characterization tools such as tracking[YJ+13; SC+11; SG+11; HC+12; PG+19], eigenworms [SJ+08], behavior ‘dictionaries’[BY+13], and t-sne [BBS16; LS+18] have moved the ﬁeld away from merely describingthe outcome to understanding the types of behavior the brain of this simple system cangenerate. While many of these techniques do well in quantitatively describing behaviorand distinguishing diﬀerences in behavior, behavioral dynamics are rich and opportunitiesabound in exploring behavioral dynamics using other mathematical tools.Persistent homology has been used to analyze time series data in many diﬀerent settings.Some earlier work was theoretical and studied the interaction between persistence andsliding window embeddings — which we used in this research — as well as proposed possi-ble applications [PH15; Per16; KM16]. Research into gene expression has used persistenthomology to detect patterns or classify whether a signal is periodic [PD+15; DA+08]. Fre-quently, persistence has been used to study neural data [SHP17; PS+13; SP+19; HBB20;LG+20; BK+15], and in many cases neural data from

C. elegans , but the analysis tendsto rely on clique complexes as the topological space of interest instead of sliding windowembeddings.

The authors would like to thank the anonymous referees whose many comments consid-erably improved our manuscript. The ﬁrst author would also like to thank Kim Le forhelpful conversations.This research was partially supported by the Southeast Center for Mathematics and Biol-ogy, an NSF-Simons Research Center for Mathematics of Complex Biological Systems,under National Science Foundation Grant No. DMS-1764406 and Simons FoundationGrant No. 594594. This material is based upon work supported by, or in part by, theArmy Research Laboratory and the Army Research Oﬃce under contract/grant num-ber W911NF-18-1-0307. Research reported in this publication was supported in part bythe National Institutes of Health under award numbers R01NS096581, R01NS115484, andR01AG056436. The collection of the data used in this research was was partially supportedby the National Institutes of Health’s Ruth L. Kirschstein NRSA award 1F31GM123662.3

Materials and methods

In this section we describe the collection and preprocessing of experimental data (sec-tion 2.1), describe mathematical background (sections 2.2 and 2.3), and describe thepipeline for using topological data analysis on our

C. elegans behavior data (section 2.4).

C. elegans (N2 strain) were cultured at 20 ◦ C under standard conditions on agar platesseeded with OP50

E. Coli . Animals were age-synchronized via hatch-oﬀ and cultured onplate until they reached day 1 of adulthood. For behavior experiments on agar, animalswere prepared, imaged, and tracked as previously described [PG+19]. For behavior exper-iments in methylcellulose media, synchronized populations were then washed oﬀ of cultureplates with M9 buﬀer. Unless otherwise noted, video data was collected on a dissectingmicroscope (Leica MZ16) using a CMOS camera (Thorlabs DCC3240M), with a frame rateof 30 Hz and a magniﬁcation of 1.2x.Behavior data was collected with animals conﬁned with microﬂuidic devices. Importantly,cavities in which worms are loaded in these devices are only slightly greater depth than thewidth of an adult worm, which restricts worms to the focal plane of the microscope andalmost entirely 2-dimensional behavior. Microﬂuidic devices were fabricated as describedpreviously [CZ+11]. Methylcellulose solutions were prepared at concentrations of 0 . Sliding window embeddings turn time series data into a pointcloud in a way that doesnot forget the temporal information of the time series. There are some additional beneﬁtsto sliding window embeddings, including that they “separate” points that intersect eachother in a time series, such as in examples 2.6 and 2.7.For our application, we consider time series that take values in vector spaces.

Deﬁnition 2.1. A time series is a sequence of vectors ( x t ) t ∈ T = ( x t , x t +1 , x t +2 , . . . ) whereeach x t is in the same ﬁnite-dimensional vector space V and T is a totally ordered set. Remark 2.2.

The totally ordered set T , which indexes the time series, can be Z , N , or aﬁnite set like [ N ] = { , , . . . , N } . For many applications including the ones in this paper,the indexing set is ﬁnite and will be omitted in notation for brevity, as in ( x t ) t .Given a time series we can construct a new time series called the sliding window embedding,also known as the time delay embedding. Deﬁnition 2.3.

Given a time series τ = ( x t ) t with vectors x t ∈ V , a sliding windowembedding of window length l of τ is a new time series, τ l = ( (cid:101) x t ) t , with (cid:101) x t = [ x t x t +1 . . . x t + l − ] ∈ V l . That is, the t th vector in the new time series is the concatenation of l consecutive vectorsin the original time series. Remark 2.4.

If the original time series has N points, then the sliding window embeddingof window length l has N − l + 1 points, as you can see in example 2.5. Example 2.5.

Consider the time series τ = ([1 , , [3 , , [5 , , [7 , , [9 , τ of window length l = 3 is τ = ([1 , , , , , , [3 , , , , , , [5 , , , , , − C. elegans behavior data. Degree 1persistent homology detects loops in these sliding window embeddings that loosely corre-spond to individual behaviors. Below we see two examples where persistence would fail tocapture an entire period of a time series in one loop, and how sliding window embeddingscan ﬁx this behavior.

Example 2.6.

In ﬁgure 1 (a) is one period of a 2-dimensional periodic time series withthe property that if linear segments connect successive points, the path of the time seriesself-intersects.One period of this function corresponds to traversing the self-intersecting loop traced bythe time series once. To discover this loop, you might use persistent homology (section 2.3).But in the case of the ﬁgure-eight, persistent homology would detect two distinct loops,each comprising half of the period, and the true information that there is only one loopwould be lost. If we create a sliding window embedding with window length l = 10, weget a 20-dimensional time series that we can project down to two dimensions using PCA.Pictured in (b) is this l = 10 sliding window embedding projected down into the 1 st and3 rd principal components. (c) shows the sliding window embedding with window length of l = 20.Notice that as the window length increases, the height of the loop increases. This increasesthe loop’s persistence and makes it easier for persistent homology to robustly detect.(a) (b) (c)time series τ projection of τ projection of τ Figure 1. (a) A time series in R determines a self-intersecting curve. (b) Thesliding window embedding of window length 10 separates the previously intersectingsegments of the curve. (c) A sliding window embedding with a higher window lengthseparates the intersecting segments even further. Example 2.7.

Example 2.7 (a) shows a 1-dimensional time series that is a discretizationof a sine wave. This periodic behavior creates no loops — in fact, because the points takevalues in R , there is no way for the time series to produce degree 1 homology. at all Buta sliding window embedding, in this case of window length 4, creates a loop that can bedetected by persistent homology. That loop in R is projected down into 2 dimensions inexample 2.7 (b). 6a) (b)time series τ projection of τ Figure 2. (a) A time series that has trivial ﬁrst homology. (b) A sliding windowembedding with l = 4 of the original time series has nontrivial ﬁrst homology. For our applications we will consider the vectors in the sliding window embeddings aspoints in a pointcloud, forgetting the time series structure. Fortunately this point of viewdoes not forget the temporal information of the original time series since that informationis encoded within the vectors themselves.

Persistent homology detects loops and other topological information in a (ﬁltered) topolog-ical space. We can construct a ﬁltered topological space from a point cloud using a radiusparameter (see deﬁnition 2.10 and example 2.12). In our applications, the point cloudwe start with is the vectors of a sliding window embedding of an input time series. Theoutput of persistent homology is a multiset of points in R called a persistence diagram(deﬁnition 2.16), where each point represents the birth and death radii of a topologicalfeature in a ﬁltered topological space. For persistence computations the topological spaceis usually a simplicial complex. Deﬁnition 2.8. A simplicial complex on a set of vertices V is a collection K of subsets of V such that if τ ∈ K and τ (cid:48) ⊂ τ , then τ (cid:48) ∈ K . An element τ ∈ K is called a simplex . An n -simplex is a simplex τ ∈ K with | τ | = n + 1. Deﬁnition 2.9. A ﬁltered simplicial complex K is a nested sequence of simplicial complexes K ⊂ K ⊂ K ⊂ · · · ⊂ K n . Each simplicial complex K i is referred to as a ﬁltered piece .There are many ways to constructed a ﬁltered simplicial complex from a point cloud. Inthis paper we will use the following construction. Deﬁnition 2.10.

Let X ⊂ R d be a ﬁnite set and let r ≥

0. The

Vietoris-Rips complexof X at scale r , denoted R r ( X ), is the simplicial complex with vertex set X and whosesimplices are given as follows. A subset { x , . . . , x n } ⊂ X is an n -simplex in R r ( X ) if andonly if d ( x i , x j ) ≤ r for all i, j ∈ { , . . . , n } . 7 eﬁnition 2.11. The

Vietoris-Rips complex of a ﬁnite set X ⊂ R d is the collection R ( X ) := {R r ( X ) } r ≥ .Note that while the Vietoris-Rips complex of X is parameterized by the non-negativereals, ﬁniteness of X guarantees that R ( X ) consists of only ﬁnitely many distinct simplicialcomplexes. Thus we view the Vietoris-Rips complex as a ﬁnite nested sequence of simplicialcomplexes, i.e., as a ﬁltered simplicial complex as deﬁned in Deﬁnition 2.9. Example 2.12.

Figure 3 shows four ﬁltered pieces of a Vietoris-Rips complex of a point-cloud in R . Notice that each simplicial complex can be included into the next.radius=0 . . . .

90 loops 2 loops 2 loops 0 loops

Figure 3.

Filtered pieces of the Vietoris-Rips ﬁltered simplicial complex of a point-cloud in 2 dimensions.

Deﬁnition 2.13.

Let F be a ﬁeld. A ﬁnite persistence module M is a ﬁnite sequence of F -vector spaces { M j } nj =0 together with a collection of F -linear maps { φ ij } ≤ i ≤ j ≤ n such that φ jk ◦ φ ij = φ ik and φ ii = id M i for all 0 ≤ i ≤ j ≤ k ≤ n .Persistence modules arise when homology with coeﬃcients in a ﬁeld F is applied to aﬁltered simplicial complex. Example 2.14.

In this case of example 2.12, the persistence module has a vector spacefor every distinct ﬁltered piece (of which only four are shown), and those vector spaceshave dimension equal to the number of minimal simple loops. Every linear map is eitherthe zero map or the identity.

Deﬁnition 2.15.

The p th persistent homology (or persistent homology in degree p ) of a ﬁl-tered simplicial complex K is the persistence module H p ( K , F ) = (cid:0) { H p ( K j , F ) } nj =0 , { φ j } nj =1 (cid:1) ,where each H p ( K j , F ) is the p th homology vector space with coeﬃcients in F and each φ j : H p ( K j − ) → H p ( K j ) is the linear map induced by inclusion K j − (cid:44) −→ K j .The fact that H p ( K , F ) is a persistence module follows from the functoriality of homology.For computational purposes, the ﬁeld F is often chosen to be the ﬁeld of two elements Z /

2. Throughout, we will assume this to be the case and use the shorthand H p ( K ) := H p ( K , Z / p ( K )of ordered pairs ( b, d ), with b, d ∈ N and b ≤ d , such that H p ( K ) can be written as a directsum H p ( K ) = (cid:77) ( b,d ) ∈ Dgm p ( K ) I [ b, d ] . Here, I [ b, d ] denotes the interval module supported on [ b, d ], given by I [ b, d ] j = (cid:40) F if j ∈ [ b, d ] , φ ij = id F if i, j ∈ [ b, d ] and φ ij = 0 otherwise.The multiset Dgm p ( K ) is a fundamental invariant and is the output of a persistence ho-mology computation, with which further analyses can be carried out. Deﬁnition 2.16.

The multiset Dgm p ( K ) is called the p th persistence diagram of K .In the case of Vietoris-Rips complex of a pointcloud, persistence tries to ﬁnd topology in apossibly ﬁnite set of points. If those points were sampled from, say, a manifold, persistencecan be used to estimate the homology of the original manifold. Example 2.17.

Figure 4 shows the persistence in degree 1 of example 2.12. Notice thatboth (a) the persistence diagram and (b) the persistence landscape (see deﬁnition 2.18)show two cycles, but because they are born and die at exactly the same radius parametersthey are plotted in the same place. The two cycles are shown in (c).(a) (b) (c)

Figure 4. (a) The degree 1 persistence diagram of the ﬁgure eight in 2 dimensions.Notice that the point has multiplicity 2. (b) The corresponding degree 1 persistencelandscape. Notice that the ﬁrst and second landscapes are nonzero and identical andall other landscapes are trivial. (c) The two loops that generate the homology of theVietoris-Rips complex on the ﬁgure eight.

Unfortunately, persistence diagrams have some undesirable properties. For example, whilewe can compute an average of a set of persistence diagrams, that average is not necessarilyunique [MMH11]. A recourse is to map persistence diagrams into a Hilbert space wherethe tools of statistics and machine learning can be applied. One such mappings is thepersistence landscape (see [Bub15] for the following deﬁnitions and results).9 eﬁnition 2.18.

Let a < b and f a,b : R → R be the piecewise-linear function given by f a,b ( t ) =  t − a, if a ≤ t ≤ a + b b − t, if a + b ≤ t ≤ b , otherwise.Given a persistence diagram Dgm p ( K ), the corresponding k th persistence landscape is thefunction λ k : R → R , where λ k ( t ) is the k th largest value of f a,b ( t ) for all points ( a, b ) ∈ Dgm p ( K ). The persistence landscape is the sequence ( λ k ) k . The parameter k is called the depth of the persistence landscape.Persistence landscapes have unique averages, satisfy the law of large numbers and centrallimit theorems, and can be discretized for computations. Because the sequence of functionsthat make up a landscape are nested, they can all be graphed on the same plot as inexample 2.17 (b).A persistence landscape is a continuous objects in an inﬁnite-dimensional Hilbert space,but can be discretized and turned into a ﬁnite-dimensional vector. Through discretization,each depth of the landscape transforms from a continuous function on R to a vector wherethe i th entry in the vector corresponds to the function value at the i th point in the samplespace. Once the vectors for each depth of the landscape are concatenated together, theyform a single vector in a very high-dimensional vector space, which, equipped with theappropriate metric, is a Hilbert space.The Hilbert space setting lets us use linear algebra-based machine learning techniques likeprincipal component analysis (PCA). The principal components from PCA on discretizedlandscapes can be converted into a format much like a persistence landscape — a sequenceof continuous functions on R — but the principal components are not themselves landscapesbecause the functions fail to be nonnegative. In this section we give the details for the analysis that was done on

C. elegans behaviordata. The input to our system was peicewise linear midlines of nematodes from videorecordings as described in section 2.1. These midlines were parameterized by the 100angles between adjacent segments, so each sample input to our system was a time series τ of 100-dimensional vectors measured in radians. The time domain of this time serieswas divided into overlapping patches of a given size called the patch length , resulting ina collection { τ i } i of smaller time series. For our experiments, a patch length of 300 waschosen, with adjacent patches overlapping by half of the patch length. The sliding windowembeddings of the τ i were then computed, with window length parameter l = 20, resulting10n a new collection { ( τ i ) l } i of time series of length 300 − l + 1 = 381. For each ( τ i ) l ,a persistence landscape PL(( τ i ) l ) was obtained by computing the persistent homology ofthe Vietoris-Rips complex R (( τ i ) l ). The collection { PL(( τ i ) l ) } i of persistence landscapeswas then assembled into a single summary for a given video by averaging the persistencelandscapes across patches to result in a single average persistence landscape associated toeach video. The average persistence landscapes were then discretized, resulting in a singletopological summary in the form of a high-dimensional vector in Euclidean space. We referto this summary as the discretized average landscape or discretized landscape of a sample.For each environment viscosity, the sample discretized landscapes were averaged to givethe discretized landscape of the class.The discretized landscapes for each class were used for analysis of the movement spaceas follows. Distances between each class’ discretized landscapes were computed using theusual Euclidean distance. The pairwise distances were visualized using multidimensionalscaling to give a 2-dimensional visualization of the similarities between the class.Principal component analysis (PCA) was applied to the set of discretized landscapes foreach sample and the ﬁrst two principal components were plotted as sequences of piecewiselinear functions to show their relationships to landscapes. The class discretized landscapeswere projected onto this space as well. These plots visualized similarity between classesand some of the variation within classes.The variation within classes was further studied using two tools: the standard deviationsof each of the coordinates of the discretized landscapes and PCA on the sample discretizedlandscapes in each class. The standard deviations of each coordinate were computed for thesamples in each class. For an individual sample, the standard deviations of each coordinatecorrespond to the square root of the diagonal of the covariance matrix of the correspondingdiscretized landscape. They are plotted as sequences of piecewise linear functions to showhow each standard deviation corresponds to a certain point in each discretized landscape.PCA on the sample discretized landscapes in each class were also computed, along withthe cumulative variances explained by the ﬁrst n principal components for n = 1 . . . , ,

000 permutations for each permutation test and found the percent of permutationsthat separated the samples by a larger distance than was obtained by separating accordingto their ground truth classes. These percents are reported as p-values.We applied multiclass support vector machines (SVM) to classify samples according toviscosity of their environments. We use the ksvm function from the kernlab package in Ron the discretized average persistence landscapes. Accuracy was estimated using 10-foldcross validation with cost set to 10. Cross validation was repeated 20 times and the results11ere averaged. A confusion matrix for one instance of SVM with 10-fold cross validationwas also computed.Finally, as a proof of concept, we used support vector regression (SVR) to approximateviscosity of the worm environments. Viscosities were estimated by partitioning the datasetinto 10 parts, building 10 SVR models, and using the leave-one-out approach to approxi-mate the viscosities on each part. This process was repeated for 10 diﬀerent partitions andthe resulting visocities estimates were averaged.

In this section we summarize our work on two main projects: a case study of a singlesample of behavior data and an experiment on the eﬀects of viscosity of the surroundingenvironment on

C. elegans locomotion and behavior. The case study assesses a signiﬁ-cantly smaller data set and directly links topological features to speciﬁc behaviors. Theexperimental results are more diﬃcult to directly interpret in terms of speciﬁc behaviorsbut show the eﬀectiveness of persistent homology in distinguishing variations in behav-iors nonetheless. More explicit interpretation of persistence output in terms of speciﬁcbehaviors is future work.

The following results were obtained by carefully analyzing a sample of

C. elegans behaviordata from a video of a worm that is crawling on agar. Having a solid surface to providefriction forces slower but more complicated behavior than we would see in an aqueousenvironment, and we take advantage of the resulting clarity of the data.The data we analyzed consists of a 20 second video where the subject exhibits followingbehaviors in chronological order:1. crawl forward,2. crawl backward,3. pause, and4. crawl backward again.Below we analyze both the original time series τ from this data and the correspondingsliding window embedding of window length l = 20 to show the importance of the slidingwindow embedding to robust analysis.To visualize the original data and its sliding window embedding, we apply PCA to eachand project onto the ﬁrst two principal components. Some of these projections are shownin the two left-most columns of ﬁgure 5. The original time series τ has a similar shape12o its sliding window embedding τ , but note that the sliding window embedding has theeﬀect of smoothing the data and making it more robust to noise. Though the two timeseries have similar shapes and corresponding similar persistence diagrams and landscapes,they do diﬀer in one important feature: the pause.PC1 × PC2 PC2 × PC3 diagram landscape(a) τ (b) τ Figure 5.

A (a) time series τ and its (b) sliding window embedding of windowlength l = 20 τ , both projected onto 2-dimensional axes of principal components.The corresponding persistence diagrams and landscapes are to the right. In ﬁgure 6 (a) the points in the sliding window embedding that correspond to frames wherethe worm is performing a speciﬁc behavior are highlighted. The points corresponding tothe pause behavior deviate from the path of points corresponding to crawling backwards.This deviation is small compared to the noisiness of the original time series, so the pausedeviation doesn’t create a topological feature in the graph of the original time series. Thisis reﬂected in the persistence diagrams and landscapes; the original time series diagram andlandscape show only 3 signiﬁcant topological features, while the diagram and landscape ofthe sliding window embedding show 4.We computed representative cycles for persistent homology classes for each of the longest-persisting topological features in the two time series using Dionysus [Mor]. Those longest-persisting features for the sliding window embedding are shown in ﬁgure 6 (b). The ho-mology classes that correspond to each of these representative cycles are highlighted in thediagrams in ﬁgure 6 (c). 13a) forward transition backward 1 pause backward 2(b) forward transition backward pause(c)

Figure 6. (a) Points in the sliding window embedding τ that correspond to eachof the labeled behaviors are highlighted. (b) The representative cycles with longestpersistence from automated persistence software correspond to speciﬁc behaviors. (c)The persistence diagram with the homology class corresponding to the above cyclerepresentative highlighted. In this section we summarize analysis of an experiment where

C. elegans are submerged insolutions with varying viscosities. The viscosity of the solution is correlated with how muchmethylcellulose is added and experimental conditions are labeled with their methylcellulosecontent, usually in order from low to high methylcellulose and viscosity.Average persistence landscapes for each class are shown in ﬁgure 7. The lower viscosityconditions allow for larger depth 1 landscapes ( λ ) but have relatively few non-zero higher-depth landscapes, which means there were a smaller number of larger loops detected inthe sliding window embeddings. This indicates that in lower-viscosity environments, C.elegans exhibit behaviors of higher amplitude but either do fewer distinct behaviors or have14uch less variation between repetitions of behaviors. Conversely, the high-viscosity classesshow many more cycles detected in the sliding window embeddings, with each cycle beingsmall compared to the cycles found in the low-viscosity environments. These observationssuggest that at high-viscosity, behaviors don’t involve large changes in posture and thereis a much more varied set of behaviors in general. From observing the raw video data, it isapparent that in higher-viscosity environments

C. elegans can make much smaller, tighterbody bends, which is consistent with this evidence.We also observed that as viscosity increases, the support of the landscapes stretch furtherto the right and cycles are born at lower radius parameters. The worms seemed to exhibitless varied behaviors in lower-viscosity environments, so perhaps in such environments theywere continuously “retracing their steps” through the sliding window embedding spaceresulting in more densely-sampled curves and thus homology classes will form at lowerradii.0 .

5% methylcellulose 1% methylcellulose 2% methylcellulose 3% methylcellulose

Figure 7.

Average landscapes for each class.

The normalized pairwise distances between the discretized average landscapes for eachclass are shown in table 8. We include the origin — the discretization of the zero land-scape, i.e. the 0 vector — in these distance computations to complete the normalization.Normalization is such that the average distances between each class and the origin is 1.Multidimensional scaling on these distances visualizes the similarity between classes andis shown in ﬁgure 9. From the raw distances and the multidimensional scaling of the dis-tances, we can see that the high-viscosity classes (2% and 3% methylcellulose) together asa pair, the 0 .

5% class, and the 1% class are roughly equidistant from one another.0 .

5% 1% 2% 3%origin 1 . . . . .

5% 0 . . . . . . Table 8.

Normalized pairwise distances between discretized landscapes of each class,where distance is Euclidean distance between vectors in R and the normalizationis such that the average distance to the origin is 1. igure 9. Multidimensional scaling of pairwise distances between classes and theorigin.

Principal component analysis on the set of sample persistence landscapes gives the graphsin ﬁgure 10. In (a), projections of the discretized sample landscapes onto the ﬁrst twoprincipal components are hollow and projections of the discretized class landscapes aresolid. Here we see results similar to those from the multidimensional scaling in ﬁgure 9:the low-viscosity class landscapes are far from each other and the high-viscosity classes,while the high-viscosity class landscapes are quite close. We can also see some of thevariance within classes. The low-viscosity classes have much more variability than thehigh-viscosity classes, with the highest-viscosity class, 3% methylcellulose, seems to havecomparatively very small variability, at least in these ﬁrst two principal components.These conclusions about variation in each of the classes are supported by the standarddeviations of each coordinate corresponding to each class. In ﬁgure 11 the standard de-viations of each coordinate are graphed as sequences of functions on R so that individualstandard deviations can be easily matched up with their corresponding locations on thelandscapes. We concluded that we saw very little variation in the 3% class, slightly morein the 2% class, and much more in the 0 .

5% and 1% classes. The 1% class showed morevariation in higher-depth landscapes than the 0 .

5% class, suggesting that

C. elegans canproduce slightly more complex behaviors in a slightly higher viscosity environment. Thevariances of the 0 .

5% and 1% classes were also distinct; 0 .

5% samples varied more towardsthe lower radius parameters (the left side of the graph), whereas the set of 1% samplesvary more towards higher (more to the right) radius parameters.16a)(b)

Figure 10. (a) Projection of discretized sample average persistence landscapes onto2 principal components, with average persistence landscapes for each experimentalcondition in solid. (b) The ﬁrst two principal components. .

5% methylcellulose 1% methylcellulose 2% methylcellulose 3% methylcellulose

Figure 11.

Standard deviations of each coordinate for each class.

The following analysis was an attempt to understand the complexity of behavior expressedin each class. Figure 12 shows results from using PCA on the samples in each class.The viscosity of the environment is negatively correlated with the percent of varianceexplained by the ﬁrst principal component, which suggests that behaviors in low-viscosityenvironments are simpler than those in high-viscosity environments. Viscosity appearsto be correlated with the number of nonzero landscapes, which also suggests that high-viscosity environments allow for more varied behaviors.17 .

5% methylcellulose 1% methylcellulose 2% methylcellulose 3% methylcellulose(a)(b)(c) 90 .

5% 86 .

9% 71 .

7% 41 . .

3% 7 .

4% 8 .

2% 19 . .

9% 4 .

5% 6 .

0% 10 . Figure 12.

PCA on the samples in each class. For each class: (a) video framesshowing representatively complex postures. (b) The cumulative variance of the ﬁrst n principal components. (c-e) The ﬁrst 3 principal components, labeled with thepercent of the variance described by that component. These results show increasingcomplexity of poses and behavior as environment becomes more viscous. We conducted permutation tests between pairs of classes to determine how much per-sistence can distinguish between samples from diﬀerent classes. The p -values for thesecomputations are shown in table 13 (a). The permutation test gives strong evidence ofstatistical signiﬁcance, i.e. that the topological summaries of samples from each class are18igniﬁcantly diﬀerent.We then used multiclass support vector machines to build a classiﬁer for the samples. Theestimated accuracy of the classiﬁer, computed by averaging accuracies across 20 instantia-tions of the multiclass SVM classiﬁer, was 95.125%. This gives further strong evidence thatpersistent homology is detecting meaningful, distinguishing features from the C. elegans behavior samples. A sample confusion matrix for one instance of SVM is shown in table 13(b). Finally, we used support vector regression to estimate the methylcellulose content inthe environment for each sample. The results are graphed in ﬁgure 14. There are twooutliers on this graph which are estimated as having negative methylcellulose content. Thetwo worms in these samples moved much more quickly than their peers, so we believe thatthe SVR is picking up on the strong negative correlation between viscosity of the environ-ment and speed, and based on these worms’ fast speed, assigning a methylcellulose contentthat is so low it is negative.(a) (b)1% 2% 3%0 .

5% 0 . . . . . . .

5% 1% 2% 3%0 .

5% 10 0 0 01% 0 9 0 02% 0 1 9 03% 0 0 1 10

Table 13. (a) Permutation test results. (b) Confusion matrix for one instance ofSVM with 10-fold cross validation.

Figure 14.

SVR estimates of methylcellulose content for each sample. Horizontallines are at 0 . Discussion

In this paper we analyzed

C. elegans behavior data using persistent homology to studythe eﬀects of changes in the viscosity of the environment on behavior and locomotion.We constructed sliding window embeddings of time series of piecewise linear

C. elegans skeletons and used degree 1 persistent homology to create topological summaries of eachsample in the form of discretized persistence landscapes. Discretized average persistencelandscapes for each viscosity level allowed for analysis based on distance between classes,permutation tests, and multiclass SVM.Our analysis showed that persistent homology can detect the variation of behavior inducedby changes in the viscosity of the environment. It also suggests that persistence can measurecomplexity of behavior, and futhermore is suﬃciently interpretable to match individualbehaviors to topological features. As far as we are aware, this is the ﬁrst application ofpersistence to

C. elegans behavior data.Our analysis has implications for future experimental design. We observed that low-viscosity environments allow for the detection of variation between samples, while high-viscosity environments may allow animals to perform more complex and varying behaviors.Tuning the viscosity of the environment for an experiment or performing experiments inmultiple ﬂuid environments with varying viscosities could allow for more easily assessingresults regarding variations within populations vs variations in behavior.Our results indicate that a high level of viscosity, like that of the 3% methylcelluloseenvironment, is good for studying the behaviour of C. elegans under other experimentalperturbations (e.g. genotype or other environmental changes) for two reasons: 1. There is agreater complexity of locomotion. The data is higher-dimensional, as indicated by the lowerpercent variation explained by the initial principal components. Also, the principal com-ponents are more complicated. 2. There is less variation among the observations/worms.This suggests that fewer observations/worms are need to accurately estimate the averagepersistence landscape.An interesting extension to this experiment would be to include samples from two newenvironmental conditions: buﬀer, which would correspond to 0% methylcellulose and alower viscosity than appears in our current data; and agar, which provides a solid surfacefor the worms to crawl on and surrounding air as opposed to an aqueous environment toswim in. We expect a new buﬀer class to allows for only fast, simple behaviors in line withthe experiments already done, but the agar environment may induce signiﬁcantly diﬀerentbehaviors in the subjects. It would be interesting to see if there is a qualitative diﬀerencebetween behavior in aqueous solid environments, and whether persistence can detect sucha diﬀerence. 20 eferences [BBS16] Gordon J. Berman, William Bialek, and Joshua W. Shaevitz. “Predictabilityand hierarchy in Drosophila behavior”. In:

Proceedings of the National Academyof Sciences of the United States of America issn : 10916490. doi : . arXiv: . url : https://pubmed.ncbi.nlm.nih.gov/27702892/ .[BK+15] M. Backholm, A. K.S. Kasper, R. D. Schulman, W. S. Ryu, and K. Dalnoki-Veress. “The eﬀects of viscosity on the undulatory swimming dynamics of C.elegans”. In: Physics of Fluids issn : 10897666. doi : . url : http://aip.scitation.org/doi/10.1063/1.4931795 .[Bub15] Peter Bubenik. “Statistical topological data analysis using persistence land-scapes”. In: Journal of Machine Learning Research issn : 15337928. arXiv: . url : http://arxiv.org/abs/1207.6437%20http://jmlr.org/papers/v16/bubenik15a.html .[BY+13] Andr´e E.X. Brown, Eviatar I. Yemini, Laura J. Grundy, Tadas Jucikas, andWilliam R. Schafer. “A dictionary of behavioral motifs reveals clusters ofgenes aﬀecting Caenorhabditis elegans locomotion”. In: Proceedings of the Na-tional Academy of Sciences of the United States of America issn : 00278424. doi : . url : https://pubmed.ncbi.nlm.nih.gov/23267063/ .[Cra15] William Crawley-Boevey. “Decomposition of pointwise ﬁnite-dimensional per-sistence modules”. In: Journal of Algebra and Its Applications issn : 0219-4988. doi : . arXiv: . url : .[CZ+11] Kwanghun Chung, Mei Zhan, Jagan Srinivasan, Paul W. Sternberg, EmilyGong, Frank C. Schroeder, and Hang Lu. “Microﬂuidic chamber arrays forwhole-organism behavior-based chemical screening”. In: Lab on a Chip issn : 14730189. doi : . url : https://pubmed.ncbi.nlm.nih.gov/21935539/ .[DA+08] Mary-Lee Dequ´eant, Sebastian Ahnert, Herbert Edelsbrunner, Thomas M. A.Fink, Earl F. Glynn, Gaye Hattem, Andrzej Kudlicki, Yuriy Mileyko, JasonMorton, Arcady R. Mushegian, Lior Pachter, Maga Rowicka, Anne Shiu, BerndSturmfels, and Olivier Pourqui´e. “Comparison of Pattern Detection Methodsin Microarray Time Series of the Segmentation Clock”. In: PLoS ONE issn : 1932-6203. doi :

10 . 1371 / ournal.pone.0002856 . url : https://dx.plos.org/10.1371/journal.pone.0002856 .[HBB20] Alec Helm, Ann S. Blevins, and Danielle S. Bassett. “The growing topology ofthe C. elegans connectome”. In: (Dec. 2020). arXiv: . url : http://arxiv.org/abs/2101.00065 .[HC+12] Steven J. Husson, Wagner Steuer Costa, Cornelia Schmitt, and AlexanderGottschalk. Keeping track of worm trackers. doi : . url : https://pubmed.ncbi.nlm.nih.gov/23436808/ .[KM16] Firas A. Khasawneh and Elizabeth Munch. “Chatter detection in turning us-ing persistent homology”. In: Mechanical Systems and Signal Processing issn : 10961216. doi : .[LG+20] Daniel L¨utgehetmann, Dejan Govc, Jason P. Smith, and Ran Levi. “ComputingPersistent Homology of Directed Flag Complexes”. In: Algorithms issn : 1999-4893. doi : . url : .[LS+18] Mochi Liu, Anuj K. Sharma, Joshua W. Shaevitz, and Andrew M. Leifer. “Tem-poral processing and context dependency in caenorhabditis elegans response tomechanosensation”. In: eLife issn : 2050084X. doi : . url : https://pubmed.ncbi.nlm.nih.gov/29943731/ .[MMH11] Yuriy Mileyko, Sayan Mukherjee, and John Harer. “Probability measures onthe space of persistence diagrams”. In: Inverse Problems issn : 02665611. doi :

10 . 1088/ 0266 - 5611 / 27 / 12/ 124007 . url : http://stacks.iop.org/0266- 5611/27/i=12/a=124007?key=crossref.95e8c511fc5be094303f6bde8cb2bafd .[Mor] Dmitriy Morozov. Dionysus . url : .[PD+15] Jose A. Perea, Anastasia Deckard, Steve B. Haase, and John Harer. “SW1PerS:Sliding windows and 1-persistence scoring; discovering periodicity in gene ex-pression time series data”. In: BMC Bioinformatics issn : 1471-2105. doi : . url : .22Per16] Jose A. Perea. “Persistent homology of toroidal sliding window embeddings”.In: ICASSP, IEEE International Conference on Acoustics, Speech and SignalProcessing - Proceedings . Vol. 2016-May. Institute of Electrical and ElectronicsEngineers Inc., May 2016, pp. 6435–6439. isbn : 9781479999880. doi : .[PG+19] Daniel A. Porto, John Giblin, Yiran Zhao, and Hang Lu. “Reverse-CorrelationAnalysis of the Mechanosensation Circuit and Behavior in C. elegans RevealsTemporal and Spatial Encoding”. In: Scientiﬁc Reports issn :20452322. doi : . url : https://pubmed.ncbi.nlm.nih.gov/30914655/ .[PH15] Jose A. Perea and John Harer. “Sliding Windows and Persistence: An Applica-tion of Topological Methods to Signal Analysis”. In: Foundations of Computa-tional Mathematics doi : . arXiv: . url : https://link.springer.com/article/10.1007 / s10208 - 014 - 9206 - z % 20http : / / link . springer . com / 10 . 1007 /s10208-014-9206-z .[PS+13] Giovanni Petri, Martina Scolamiero, Irene Donato, and Francesco Vaccarino.“Topological Strata of Weighted Complex Networks”. In: PLoS ONE issn : 19326203. doi : . arXiv: . url : .[SC+11] Jeﬀrey N. Stirman, Matthew M. Crane, Steven J. Husson, Sebastian Wabnig,Christian Schultheis, Alexander Gottschalk, and Hang Lu. “Real-time multi-modal optical control of neurons and muscles in freely behaving Caenorhabditiselegans”. In: Nature Methods issn : 15487091. doi : . url : https://pubmed.ncbi.nlm.nih.gov/21240278/ .[SG+11] Nicholas A. Swierczek, Andrew C. Giles, Catharine H. Rankin, and Rex A.Kerr. “High-throughput behavioral analysis in C. elegans”. In: Nature Methods issn : 15487091. doi : . url : https://pubmed.ncbi.nlm.nih.gov/21642964/ .[SHP17] Bernadette J. Stolz, Heather A. Harrington, and Mason A. Porter. “Persis-tent homology of time-dependent functional networks constructed from cou-pled time series”. In: Chaos issn : 10541500. doi : . arXiv: . url : http://aip.scitation.org/doi/10.1063/1.4978997 .[SJ+08] Greg J. Stephens, Bethany Johnson-Kerner, William Bialek, and William S.Ryu. “Dimensionality and Dynamics in the Behavior of C. elegans”. In: PLoSComputational Biology issn :1553-7358. doi : . url : https://dx.plos.org/10.1371/journal.pcbi.1000028 .23SP+19] Ann E. Sizemore, Jennifer E. Phillips-Cremins, Robert Ghrist, and DanielleS. Bassett. “The importance of the whole: Topological data analysis for thenetwork neuroscientist”. In: Network Neuroscience issn : 24721751. doi : . arXiv: . url : https://pubmed.ncbi.nlm.nih.gov/31410372/ .[TP18] Christopher J. Tralie and Jose A. Perea. “(Quasi)periodicity quantiﬁcationin video data, using topology”. In: SIAM Journal on Imaging Sciences issn : 19364954. doi : . arXiv: . url : https://github.com/ctralie/SlidingWindowVideoTDA. .[Tra16] Christopher J. Tralie. “High dimensional geometry of sliding window embed-dings of periodic videos”. In: Leibniz International Proceedings in Informat-ics, LIPIcs issn : 18688969. doi : .[YJ+13] Eviatar Yemini, Tadas Jucikas, Laura J. Grundy, Andr´e E.X. Brown, andWilliam R. Schafer. “A database of Caenorhabditis elegans behavioral phe-notypes”. In: Nature Methods issn : 15487091. doi :

10 . 1038 / nmeth . 2560 . url :