A Stochastic Automata Network Description for Spatial DNA-Methylation Models
AA Stochastic Automata Network Description forSpatial DNA-Methylation Models
Alexander L¨uck and Verena Wolf Department of Computer Science, Saarland University, Saarbr¨ucken, Germany
Abstract.
DNA methylation is an important biological mechanism toregulate gene expression and control cell development. Mechanistic mod-eling has become a popular approach to enhance our understanding ofthe dynamics of methylation pattern formation in living cells. Recentfindings suggest that the methylation state of a cytosine base can beinfluenced by its DNA neighborhood. Therefore, it is necessary to gener-alize existing mathematical models that consider only one cytosine andits partner on the opposite DNA-strand (CpG), in order to include suchneighborhood dependencies. One approach is to describe the system asa stochastic automata network (SAN) with functional transitions. Weshow that single-CpG models can successfully be generalized to multipleCpGs using the SAN description and verify the results by comparingthem to results from extensive Monte-Carlo simulations.
Keywords:
DNA Methylation, Stochastic Automata Networks, SpatialStochastic Model
The field of epigenetics investigates changes in gene function that are not relatedto the organism’s genetic code but rely on epigenetic markers. DNA methylationis one of the main epigenetic markers regulating gene expression and controllingcell fate. The epigenetic code is passed along from one generation to the nextbut environmental factors such as diet, stress, and prenatal nutrition can modifysuch markers resulting, for instance, in the deactivation or activation of certaingenes. In the case of DNA methylation, a marker is set by adding a methyl groupto cytosine (C) bases on the DNA. The conversion of C to its methylated form(5-methylcytosine) is carried out by DNA methyltransferase (Dnmt) enzymesand the resulting methylation pattern determines the program and function ofthe corresponding cell.Together with a subsequent guanine (G) and the corresponding GC-pairon the opposite strand, a cytosine base forms a so-called CpG, which may bemethylated on none, one, or both DNA strands. Mechanistic models based onDiscrete-Time Markov Chains (DTMCs) have been developed to describe thetemporal evolution of the methylation state of a CpG [1]. The goal of suchmodels is to increase the understanding of methylation pattern formation. Dueto the possible influence of the neighboring CpGs [9,12] generalizations of simple, a r X i v : . [ q - b i o . GN ] O c t ingle-CpG models have recently been developed, which take more than one CpGinto account [2,13,14,15]. However, even for a small number of CpGs the size ofthe transition matrices grows rapidly and they are therefore hard to generatewithout suitable methods. One of these methods are the stochastic automatanetworks (SANs) [17,18]. In this work we show how to successfully apply theSAN framework to Markov models for single CpGs in order to generalize them tomultiple CpGs with the aid of suitable, neighbor state dependent rate functions.This paper is organized is follows: In Section 2 we describe the model fora single CpG and show how to generalize it to several CpGs using the SANframework. A comparison with Monte-Carlo simulations in order to verify theresults of our generated matrices can be found in Section 3. In Section 4 weconclude our findings and give ideas for possible future work. In order to describe the methylation dynamics of a CpG in terms of DTMCs, weconsider CpGs on the double stranded DNA where each CpG contains two Cs(one on each strand), each of which can be either methylated or unmethylated.Therefore there are four possible states, i.e., both Cs unmethylated (state 1),only the C on the upper strand methylated (state 2), only the C on the lowerstrand methylated (state 3) and both Cs methylated (state 4). The processes thatlead to state transitions are cell division, maintenance and de novo methylation.During cell division, one strand and its methylation state are kept as theyare (parental strand), while the other strand is newly synthesized (daughterstrand) with all Cs unmethylated. Maintenance methylation adds methylationto the C on the daughter strand if the corresponding C on the parental strandis already methylated in order to reestablish existing methylation patterns. Incontrast to that, de novo methylation may occur on unmethylated Cs on bothstrands independent of the methylation state of the C on the opposite strandand is therefore responsible for creating new patterns. The transition probabilitymatrices for these processes for a single CpG are listed in Tab. 1. In the leftcolumn the matrices concerning the upper strand are listed and in the rightcolumn the matrices for the lower strand.It is biologically plausible to assume that cell division happens first, after-wards maintenance on the daughter strand and in the end de novo on bothstrands takes place. Since the strand that is kept after cell division is chosenrandomly with equal probability, the total transition probability matrix P isgiven by P = 0 . · ( CD · M + CD · M ) · T · T . (1)Given a sequence of L CpGs, each CpG can be described by the aforemen-tioned DTMC, which gives us one automaton of the automata network. Thestructure of each automaton is independent of the automata describing neigh-boring CpGs, however, the transition probabilities may depend on their localstates (functional transitions). A suitable method to combine these automata inorder to capture the dynamics of whole sequences of CpGs is to consider them able 1.
Transition matrices for a single CpG. Note that the transition probabilities f may be functions of the reaction parameters, the CpG position and/or the states ofthe adjacent CpGs. The matrices in the left column represent the transitions on theupper and the matrices in the right column the transitions on the lower strand. Cell Division CD = CD = Maintenance M = − f f M = − f f De Novo T = − f f − f f T = − f f
00 1 − f f as an automata network. Within the SAN framework, the transition matricesof the individual automata are combined via the Kronecker product. Since inour case the transition matrix for one automaton is a product of the transitionmatrices for the different processes, we exploit some properties of the Kroneckerproduct to generate the global transition matrix of the network. From A ⊗ ( B ⊗ C ) = ( A ⊗ B ) ⊗ C (2)( AC ) ⊗ ( BD ) = ( A ⊗ B ) · ( C ⊗ D ) (3)the following properties can be derived [6] (cid:32) N (cid:89) n =1 A n (cid:33) ⊗ (cid:32) N (cid:89) n =1 B n (cid:33) = N (cid:89) n =1 ( A n ⊗ B n ) , (4) M (cid:79) m =1 (cid:32) N (cid:89) n =1 A ( m ) n (cid:33) = N (cid:89) n =1 (cid:32) M (cid:79) m =1 A ( m ) n (cid:33) . (5)Note that in Eqs. (3)-(5) the corresponding matrices have to be compatible un-der the standard matrix product.As a consequence of Eq. (5) it is possible to obtain the total transition matrix P in two ways: First compute a transition matrix for a single CpG (Eq. (1)) andextend the result to several CpGs with the Kronecker product or calculate thetransition matrices for the different processes for several CpGs first via the Kro-necker product and combine them afterwards. Since the transition probabilitiesmay depend on the neighbor states, i.e. the states of the adjacent automata, itis easier to choose the second possibility and construct the individual transitionmatrices for the different processes first. Another advantage is that the matrices L d e n s i t y Fig. 1.
Density of entries for the transition matrix P for different numbers of CpGs L . for the different processes are sparse, while the total transition matrix is quitedense for a single CpG (see Tab. 1 and Fig. 1), such that we apply the Kroneckerproduct to sparse matrices and multiply the (also sparse) results afterwards.If we assume that all CpGs are methylated independent of their neighborhood,then no functional transitions are needed and the transition probabilities are con-stants. The construction of the global transition matrix is then straightforwardby simply applying the Kronecker product. To model dependence, first observethat since only the transition probabilities, but not the transitions themselves,depend on the neighboring states, the structure of the global transition matrixis the same as in the independent case. By using functions instead of constantprobabilities, we are able to capture the effect of the neighbors on the transitionrates. Another advantage of the functions is that we can incorporate differentmodel assumptions (like processivity) by using different functions without alter-ing the structure of the transition matrices.To shape the function f := f ( r , l, s l − , s l +1 ) ∈ [0 ,
1] in the matrices in Tab. 1 toour needs, we use the following inputs: – r is a vector with the reaction parameters, – l ∈ { , . . . , L } is the position of the CpG such that boundary ( l = 1 , L ) andnon-boundary ( l = 2 , . . . , L −
1) CpGs can be distinguished, – s l − is the state of the left neighboring CpG and – s l +1 is the state of the right neighboring CpG.Depending on the methylation event (maintenance or de novo ) different para-meter vectors r can be chosen. Since in general all CpGs may undergo a reaction,the states of the neighboring CpGs that are used (before or after reaction) asan input for the function depends on the underlying assumptions. This will bedemonstrated in the following.We first note that the indices of the matrices in Tab. 1 correspond to thestates before and after transition, i.e. the entry a i,j corresponds to the proba-bility of going from state i to state j . Furthermore, there is a unique relationbetween the indices of the initial matrices and the indices of the result of theirronecker product a r,s · b v,w = ( A ⊗ B ) p ( r − v,q ( s − w , (6)( A ⊗ B ) i,j = a (cid:98) ( i − /p (cid:99) +1 , (cid:98) ( j − /q (cid:99) +1 · b i −(cid:98) ( i − /p (cid:99) p,j −(cid:98) ( j − /q (cid:99) q , (7)where A is a p × q - and B an arbitrary matrix. These formulas can easily begeneralized such that for a Kronecker product of L matrices we know exactly theindices for each of the matrices and therefore the states before and after transi-tion for each CpG. We then use this knowledge to choose the correct transitionprobability depending on our assumptions.For the transition probabilities, we impose, depending on the neighbor states,the following forms [14]: For de novo methylation events (the state of the otherstrand does not matter) we have transition probabilities for non-boundary cases ◦ ◦ ◦ → ◦ • ◦ p = 0 . · ( ψ L + ψ R ) τ, (8) • ◦ ◦ → • • ◦ p = 0 . · ( ψ L + ψ R ) τ + 0 . · (1 − ψ L ) , (9) ◦ ◦ • → ◦ • • p = 0 . · ( ψ L + ψ R ) τ + 0 . · (1 − ψ R ) , (10) • ◦ • → • • • p = 1 − . · ( ψ L + ψ R )(1 − τ ) , (11)where an empty circle represents an unmethylated and a filled circle a methylatedC. The parameters ψ L and ψ R characterize the dependency on the neighborstate, where ψ i = 1 means fully independent and ψ i = 0 full dependency. τ corresponds to the de novo probability in the independent case.For maintenance events, we replace the probability τ by µ and have the ad-ditional requirement that the C on the opposite strand must be methylated.For boundary cases, i.e., the left- and rightmost CpG in a sequence of L CpGs,we have the probabilities? ◦ ◦ → ? • ◦ ˜ p = (1 − ρ ) · p + ρ · p , (12)? ◦ • → ? • • ˜ p = (1 − ρ ) · p + ρ · p , (13) ◦ ◦ ? → ◦ • ? ˜ p = (1 − ρ ) · p + ρ · p , (14) • ◦ ? → • • ? ˜ p = (1 − ρ ) · p + ρ · p , (15)where we use the average methylation level ρ , since we do not have any infor-mation about the CpG on the left/right at the boundaries.For a moderate number of CpGs ( ≈
5) it is possible to explicitely constructthe whole transition matrix with a simple algorithm. We first note that we haveto apply the Kronecker product for the matrices in Tab. 1 L times with them-selves for a sequence of L CpGs. We then apply the following scheme:1. Identify the indices of the non-zero entries of the matrix.2. Calculate the indices of the resulting matrix after applying the Kroneckerproduct with Eq. (6) for the indices from step 1. Iteratively applying Eq. (6) L times leads to the final indices ( u, v ). For each ( u, v ) we get an ordered list (cid:96) containing the indices from the original matrices that lead to this index.3. For each ( u, v ) calculate the matrix entry m u,v = (cid:89) ( i,j ) ∈ (cid:96) a i,j , (16)where a i,j are the entries of the original matrix.. If a i,j contains the function f choose the neighbor states based on the as-sumption and the indices (states) from (cid:96) of the adjacent matrices.Note that for real data we have to ensure that all CpGs of a given sequenceoriginate from the same cell in order to properly investigate the neighborhooddependencies. Real data rarely covers states of more than a couple of successiveCpGs from the same cell with sufficiently deep coverage. Therefore, the numberof contiguous CpGs is usually very limited, such that the explicit constructionof the transition matrix for short CpG sequences is feasible in most cases. Fora possible larger number of CpGs from advanced measurement techniques wehave to resort to more sophisticated methods to obtain the transition matri-ces [3,4,5,18]. Example: Processivity
Since detailed mechanisms about the interaction of the Dnmts with the DNAremain elusive, we would like to test different assumptions such as processive ordistributive methylation mechanisms [8,10,14,16]. For the remainder of this pa-per, we assume processivity from left to right , which is a reasonable assumptionfor Dnmt1, due to its link to the replication machinery. The processivity fromleft to right implies, that a transition already happened at the left neighbor (po-sition l −
1) but not yet at the right neighbor ( l + 1). This means, given the listof indices (cid:96) = [ . . . , ( i l − , j l − ) , ( i l , j l ) , ( i l +1 , j l +1 ) , . . . ] we choose j l − for the leftneighbor state and i l +1 for the right neighbor state as an input for the functionin step 4 of our algorithmic scheme. In order to check the correctness of the matrices that we generate with the Kro-necker product for L CpGs, we compare the resulting distributions with resultsfrom Monte-Carlo (MC) simulations. As initial distribution π we use a discreteuniform distribution which assigns the same probability to all possible methyla-tion patterns. We then compute distributions π ( t ) after t = 30 cell division andsubsequent methylation events via π ( t + 1) = π ( t ) · P, (17)where P is the total transition matrix, where we assume processivity. We performthe corresponding MC simulations with N = 10 runs. The distributions fordifferent parameter sets are shown in Fig. 2. Panels (a) and (b) show the fullydependent case, where the transition probabilities depend only on the neighborstates and not on the actual maintenance and de novo rates µ and τ . In Fig. 2(c) the transition probabilities are totally independent of the neighboring statesand depend solely on µ and τ . This case is equivalent to the case where wereplace the function f by the respective (constant) transition probabilities. Fig. 2(d) shows a case with some dependency on both neighbor states, where thedependence to the left is slightly stronger. Choosing a wrong transition functionin the matrix entries (compared to MC, where it is easier to ensure the correctchoice) would affect the distribution in (a) and (b) the most, since there is afull dependency and hence the largest effect from the neighboring states. For
200 400 600 800 1000 pattern index Z P ( Z ) (a) (0.8,0,0,0.1)
800 825 850 875 900 925 950 975 1000 pattern index Z P ( Z ) (b) Detailed excerpt from (a). pattern index Z P ( Z ) (c) (0.8,1,1,0.1) pattern index Z P ( Z ) (d) (0.8,0.4,0.6,0.1) Fig. 2.
Comparison of distributions obtained from transition matrices generated fromthe SAN description (blue) and from MC simulations (red). The parameters for eachsubfigure are given in the form ( µ, ψ L , ψ R , τ ). the partial dependencies in (d) there should also be an effect if the choices werewrong. In the independent case (c) there can not be a wrong choice, since thetransition function is a constant.In all cases we observe an almost perfect agreement with only small devia-tions on some patterns on a very small scale. In order to exclude a flaw in theconstruction of the transition matrix we compute the Hellinger distance H ( P, Q ) = 1 √ (cid:32) k (cid:88) i =1 ( √ p i − √ q i ) (cid:33) (18)to compare the similarity of the distributions and to check if the deviations stemfrom the finite number of MC simulation runs. From Fig. 3 it is obvious that withan increasing number of runs the distributions become more and more similarsuch that we can indeed exclude a flaw in the matrix construction. The smalldeviations stem from the finite number of runs since for N = 10 there are stillstatistical inaccuracies and hence H is quite large (order of 10 − ). In this paper, we propose an adaptation of the stochastic automata networkframework to solve problems related to the generation of transition probabilitymatrices for a biological application. With the proposed procedure we are able N H Fig. 3.
Hellinger distance H between distribution obtained from numerical SAN solu-tion and from MC simulations with N runs for the parameter set of Fig. 2 (d). to describe the methylation dynamics of a sequence of several CpGs, given thematrices for a single CpG. The transition matrices form the basis for numeri-cal solutions of mechanistic models, which allow to test assumptions about thefunctioning of methylation enzymes. The presented framework allows to considerlonger CpG sequences and hence the proper investigation of larger genomic re-gions. The transition matrix for L CpGs can systematically be generated fromthe small single-CpG matrices and the generation is less prone to errors thanan ad-hoc approach. It is also pretty easy to adapt the model to test differentbiological assumptions by using different functions in the transition matrices. Inour example, we assumed processivity from left to right, but by changing thefunctions other assumptions like processivity from right to left (less biologicallyplausible) or even non-processive (e.g. distributive) behavior can be realized. It isalso easily possible to introduce additional reaction parameters for each individ-ual CpG within this framework to generalize the model. Using the same reactionparameters for all CpGs is a strong assumption, especially for the neighborhooddependencies, which should intuitively be different due to the (in general) differ-ent distances between CpGs or also due to different base sequences in the DNA.Furthermore, it is also straightforward to apply the SAN approach to more com-plex methylation models in order to investigate possible neighborhood depen-dencies. This is especially useful when there are more than four states per CpGas with more states the transition matrix grows rapidly. Introducing additionalhydroxylated Cs as in [7] the number of states per CpG grows from four to eightsuch that the transition matrix grows from 4 L × L to 8 L × L for L CpGs.With even more possible modifications of C, such as the formylated form 5-formylcytosin, the number of possible states and hence the matrix size growseven more (16 L or 16 L × L respectively). In this case, the SAN descriptionbecomes even more useful as it would be very tedious to generate the transitionmatrix in other ways. It is also possible to apply the SAN approach to continu-ous time Markov chains or hybrid models as in [11]. Here, the discrete transitionmatrix was generated with a Kronecker product, while the continuous generatormatrix can be generated with a Kronecker sum. eferenceseferences