A Hybrid HMM Approach for the Dynamics of DNA Methylation
Charalampos Kyriakopoulos, Pascal Giehr, Alexander Lück, Jörn Walter, Verena Wolf
AA Hybrid HMM Approach for the Dynamics ofDNA Methylation
Charalampos Kyriakopoulos , Pascal Giehr , Alexander L¨uck , J¨orn Walter ,and Verena Wolf Department of Computer Science, Saarland University, Saarbr¨ucken, Germany Department of Biological Sciences, Saarland University, Saarbr¨ucken, Germany
Abstract.
The understanding of mechanisms that control epigeneticchanges is an important research area in modern functional biology. Epi-genetic modifications such as DNA methylation are in general very stableover many cell divisions. DNA methylation can however be subject tospecific and fast changes over a short time scale even in non-dividing(i.e. not-replicating) cells. Such dynamic DNA methylation changes arecaused by a combination of active demethylation and de novo methyla-tion processes which have not been investigated in integrated models.Here we present a hybrid (hidden) Markov model to describe the cycleof methylation and demethylation over (short) time scales. Our hybridmodel decribes several molecular events either happening at determinis-tic points (i.e. describing mechanisms that occur only during cell division)and other events occurring at random time points. We test our model onmouse embryonic stem cells using time-resolved data. We predict methy-lation changes and estimate the efficiencies of the different modificationsteps related to DNA methylation and demethylation.
Keywords:
DNA Methylation, Hidden Markov Model, Hybrid Stochas-tic Model
All cells of a multi-cellular organism share the same DNA sequence, yet, depend-ing on location and cell type, display distinct cellular programs as a result ofcontrolled gene expression. Hence, the expression of genes is regulated by epige-netic factors such as DNA methylation. In mammals, the methylation of DNAis restricted to the C5 position of cytosine (C) and mostly appears in a CpGdi-nucleotide sequence [6,7]. The palindromic nature of CpG positions providesa symmetry which, after DNA replication, allows the stable inheritance of themethylation “signal”. Methylation of C to 5-methyl cytosine (5mC) is catalysedby a certain enzyme family, the DNA methyltransferases (Dnmts) [2,28]. Threeconserved and catalytic active family members are associated with the methy-lation of DNA. Dnmt1 is mainly responsible for maintenance methylation afterDNA replication, i.e. the enzyme mainly reestablishes the methylation patternon the newly synthesized daughter strand [15] according to the inherited infor-mation of the parental DNA strand. The enzymes Dnmt3a and Dnmt3b perform a r X i v : . [ q - b i o . GN ] J a n e novo methylation, where new methyl groups are added on unmethylated Cs[27]. However, there is evidence that this separation of tasks is not definite andthat all Dnmts may carry out all tasks to a certain extent [1].Once established, 5mC can be further modified by oxidation to 5-hydroxyme-thyl cytosine (5hmC), which can again be oxidized to 5-formyl cytosine (5fC).Then, 5fC can eventually be converted to 5-carboxy cytosine (5caC). All theseprocesses are carried out by the ten-eleven translocation (Tet) enzymes [16,29]. Aconsiderable level of 5hmC can be found in many cells types and its occurrencehas been connected to gene regulation as well as genome wide loss of DNAmethylation [12,19]. In contrast, 5fC and 5caC are far less abundant and theirparticular functions remain more illusive. Nevertheless, studies suggest that bothoxidized cytosine variants function as intermediates during enzymatic removalof 5mC from the DNA [12].In general, DNA methylation can be removed in two ways. First, after DNAreplication, the absence or blocking of maintenance methylation will result in apassive DNA methylation loss with each cell division ( passive demethylation ).Second, generated 5fC or 5caC is enzymatically removed from the DNA and sub-sequently replaced by unmodified cytosine ( active demethylation ) [4,13,14,24].While DNA replication and the associated maintenance methylation hap-pens only once per cell division cycle, de novo methylation and the modificationprocesses of 5mC via oxidation, as well as active demethylation, may happenat arbitrary time points. With purely discrete hidden Markov models (HMM),as used in [1] and [9], it is difficult to describe multiple instances of de novo methylation or other modification events for a given CpG during one cell di-vision cycle. Here, we introduce a hybrid HMM to describe the dynamics ofactive demethylation. It distinguishes between events at fixed time points (celldivision and maintenance methylation) and events at random time points ( denovo methylation, oxidations, active demethylation). By applying our model toa data set of a single copy gene from mouse embryonic stem cells, we were ableto accurately predict the frequency of the observable CpG states and the lev-els of the hidden states, which correspond to the different modified forms of C.Furthermore, we show how to estimate the enzymatic reaction efficiencies usinga maximum likelihood approach.Compared to previous models for describing DNA methylation, we here pro-pose an approach that takes into account both, the rather fixed and deterministictiming of cell division and the random nature of the enzymatic processes. In thisway, we are able to present a model that realistically describes the dynamics ofDNA methylation and improves our understanding of active demethylation.The paper is organized as follows: In Section 2 we give the necessary biologicaland mathematical background, i.e. we explain passive and active demethylationin more detail, describe the model and explain the parameter estimation pro-cedure. In Section 3 the results are discussed and in Section 4 we conclude ourfindings. m fu η φ δ µ d Fig. 1.
Schematic representation of de novo methylation and the active demethylationloop, where we use the following notation for the methylation states: (Unmethylated)C is denoted by u , 5mC by m , 5hmC by h , and 5fC or 5caC by f . The correspondingenzymatic reaction rates are µ d ( de novo methylation), η (oxidation), φ (formylation),and δ (demethylation). One can distinguish between two different ways of losing methylation at cy-tosines: After cell division a new (daughter) strand of the DNA is synthesized.Initially, all cytosines of the daughter strand are unmethylated, while the methy-lation states of cytosines on the parental strand remain unchanged. Maintenancemethylation, which happens during the replication process, is used to re-establishthe methylation pattern at the newly synthesized strand. However, the absence orinhibition of maintenance methylation causes a loss of 5mC with each replicationstep. This DNA replication dependent loss has been termed passive demethyla-tion .For active demethylation it is assumed that oxidation of 5mC to 5fC or 5caCvia 5hmC and a subsequent enzymatic removal of the oxidative cytosine from theDNA occurs (cf. Fig 1). Since de novo methylation as well as active demethylation are replication independent, the loop depicted in Fig. 1 can be traversed multipletimes within one cell cycle. In measurements, we see Cs in all stages, i.e., eithermethylated, oxidized (5hmC/5fC/5caC) or unmodified.
In this section, we present a model that describes the state changes of a singleCpG over time. It can be seen as a hybrid extension of previous discrete-timemodels [1,9]. The (hidden) states of a CpG correspond to the set of all pairsof the four possibilities in Fig. 1, i.e., { u, m, h, f } , because it contains a C onboth strands of the DNA. We split the transitions of our model into transitions able 1. Cell division matrix D . uu um uh uf mu mm mh mf hu hm hh hf fu fm fh ffuu um uh uf mu mm mh mf hu hm hh hf fu fm fh ff that occur at fixed times and those that occur at random times. This resultsin a mixture of a discrete time Markov chain (DTMC) and a continuous timeMarkov chain (CTMC). In the following we will refer to the events or transitionsthat occur at fixed time points as discrete part of the model, while we refer tothe other events or transitions at the random time points as continuous part ofthe model.We assume that cells divide after a fixed time interval (usually every 24hours). Hence, these events correspond to deterministic transitions at fixed times.During cell division one strand is kept as it is (parental strand) and all methy-lation states and its modifications remain unchanged, while one DNA strand isnewly synthesized (daughter strand) and therefore contains only unmethylatedcytosine. Consequently, after cell divison a CpG that was modified on both sidesbecomes a CpG that is unmodified on one side. Since the parental strand ischosen at random, the probability for each of the two successor states (corre-sponding to the state of the CpG in the two daughter cells) is 0.5. The fulltransition probability matrix D for cell division is shown in Tab. 1. Note thatthe cell division matrix is time independent.Maintenance methylation, i.e. methylation events that occur on hemimethy-lated CpGs to reestablish methylation patterns, is known to be linked to thereplication fork [22]. We therefore consider maintenance to occur together withthe cell division at the same fixed time points. Hence, cell division and main- m μ m p μ m p μ m Fig. 2.
Maintenance methylation events occur at (fixed) times t , t , . . . , t n and belongtherefore to the disrete part of the model. Each state is represented by two coloreddots, one for each C on the two strands of the DNA. Unmethylated C is blue, 5mC red,5hmC yellow and 5fmC cyan. The arrows indicate the possible transitions. Note thatwe omitted the self loops with probability 1 for states where no transition is possible.The four shown self loops have probability 1 − µ m or 1 − ¯ pµ m respectively. tenance can be described by a (discrete-time) Markov chain whose transitionprobability matrix P ( t ) is defined in the sequel in Eq. (1). Maintenance mayhappen at the daughter strand if there is a methylated C on the parental strand,i.e. on hemimethylated CpGs ( um or mu ), with probability µ m ( t ). Since it isreasonable to assume that hemihydroxylated CpGs ( uh or hu ) have differentproperties compared to hemimethylated CpGs in terms of maintaining existingmethylation patterns, we describe the maintenance probability for hemihydrox-ylated CpGs as follows. Let p be the probability that 5hmC is recognized asunmethylated by maintenance enzymes, i.e., the enzyme will not perform main-tenance of a hemihydroxylated CpG. Then, the maintenance probability of sucha CpG is given by ¯ pµ m ( t ), where ¯ p = 1 − p . Note that there is no equivalent prob-ability for uf or f u , i.e. we assume that CpGs with 5fC or 5caC at one strand arenot maintained [17]. The transition probability matrix for maintenance events isillustrated in Fig. 2, where we omitted the time dependency of µ m ( t ). Wheneverno transition is possible, i.e. there is only a self loop for this state (omitted inFig. 2) the corresponding diagonal entry in the matrix is 1. Where a transition ispossible we set the corresponding (off-diagonal) entry in the matrix to its transi-tion probability and the diagonal entry to 1 minus its transition probability. Allother entries in the matrix are 0. One discrete step of the corresponding DTMCcorresponds to one cell division, including maintenance methylation. Hence, its ig. 3. Possible transitions in the continuous part of the model. Each state is repre-sented by two colored dots, one for each C on the two strands of the DNA. Unmethy-lated C is blue, 5mC red, 5hmC yellow and 5fmC cyan. The arrows indicate the possibletransitions, whereupon the colors indicate the different reactions and their rates, i.e.methylation with rate µ d (red), oxidation with rate η (yellow), formylation with rate φ (cyan) and active demethylation with rate δ (blue). transition probability matrix is defined as P ( t ) = D · M ( t ) . (1)Every other event may occur an arbitrary (unknown) number of times be-tween two cell divisions at random time points and will be described by acontinuous-time Markov jump process. These events are de novo methylation( u → m ) with rate µ d ( t ), hydroxylation ( m → h ) with rate η ( t ), formylation( h → f ) with rate φ ( t ) and active demethylation ( f → u ) with rate δ ( t ) (cf.Fig 1). Note that all these events may happen on both strands, independentof the state on the complementary strand. All possible reactions are shown inFig. 3. From these transitions the infinitesimal generator matrix Q ( t ) of the jumpprocess can easily be inferred. For the off-diagonal elements, we set the entries tothe respective reaction rate if a reaction is possible between two states, indicatedby a colored arrow in Fig. 3, and to 0 if no reaction is possible. The diagonalelements are then given by the negative sum of the off-diagonal elements of therespective row.n order to describe the time evolution let us first define the set of timepoints T d = { t , t , . . . , t n } , at which cell division and maintenance occur. Weassume that there are in total n of these events. At these time points t i the timeevolution of the probability distribution of the states is given by π ( t i + ∆t ) = π ( t i ) · P ( t i ) , (2)where ∆t is the duration of cell division and maintenance. After cell division andmaintenance the other events take place at random time points until the timepoint for the next cell division is reached. During this interval [ t i + ∆t, t i +1 ] thetime evolution for π ( t ) is obtained by solving the differential equation ddt π ( t ) = π ( t ) · Q ( t ) . (3)Note that since cell division and maintenance methylation occur at a time in-terval that is much shorter than the time between two cell divisions, we assumehere that these events occur instantaneously, i.e. we let ∆t →
0, which leads toa jump of the distribution at t i . It holds that the left-hand and right-hand limitdo not coincide (except for the special case of only uu at time t i ), i.e.lim t → t − i π ( t ) = π ( t − i ) (cid:54) = π ( t − i ) · P ( t − i ) = π ( t + i ) = lim t → t + i π ( t ) . (4)To resolve the ambiguity we set the value of π at time t i to π ( t i ) := lim t → t + i π ( t ),which means that we assume that at time t i the cell division and maintenancemethylation have already happened. Intuitively, we can obtain the solution for π ( t ) numerically by alternating between multiplication with P (Eq. (2)) andintegration of Eq. (3). As already discussed in [9] the methylation or modification rates may change overtime, i.e. with the assumption of constant rates the behavior of the system cannot be captured correctly. We therefore introduce time dependent rates, whichwe call efficiencies. The simplest way of making the efficiencies time dependentis to impose a linear form. Let r ∈ { µ m , µ d , η, φ, δ } be one of the reaction rates.For each reaction we then define r ( t ) := α r + β r · t, (5)where α r and β r are some new parameters to characterize the linear efficiencyfunction. To further compress the notation we write these new parameters intoa vector v r = ( α r , β r ).Note that in order to ensure identifiability of the parameters we have tointroduce certain constrains. Obviously since µ m ( t ) and p are probabilities, theyhave to be bound between zero and one, i.e. 0 ≤ µ m ( t ) ≤ t ∈ [0 , t max ]and 0 ≤ p ≤
1. The other efficiencies have to be bound by some upper limit b since otherwise the methylation cycle may become arbitrarily fast and theoptimization algorithm runs into identifiability problems. We therefore require0 ≤ r ( t ) ≤ ub for all t ∈ [0 , t max ], where r ( t ) (cid:54) = µ m ( t ). Since typical averageturnover times E [ T turnover ] = µ − d + η − + φ − + δ − of the methylation cycleare in the order of 75 to 120 minutes in certain promotors of human cells [18,25],a viable choice would be, for example, ub = 12, i.e. each modification occurson average not more frequent than 12 times per hour (not faster than every 5minutes). The actual state of a CpG can not be directly observed. We therefore haveto estimate the hidden states from sequencing experiments. Since the differentmodifications of C might lead to the same observable states we perform three dif-ferent kinds of sequencing experiments: All sequencing strategies share a bisulfitetreatment, which usually converts C and its oxidized variants 5fC and 5caC, sum-marized to 5fC ∗ , to uracil. Additionally, in bisulfite sequencing (BS) both 5mCand 5hmC remain unconverted while in oxidative bisulfite sequencing (oxBS)only 5mC is retained as C [3]. A combination of both methods can therefore beused to estimate the amount of 5hmC. In M. SssI assisted bisulfite sequencing (MAB-Seq) at first all Cs in a CpG context are methylated and afterwards BSis applied. Thus, if all conversions would happen without any errors, only 5fC ∗ would be converted to T. In order to capture the methylation pattern of CpGposition at both complementary DNA strands, the distinct chemical treatmentswere combined with hairpin sequencing [10,11,21]. Regular reactions with theirrespective probabilities are marked with solid black arrows in Fig. 4, while thepossible false reactions are depicted with dashed red arrows. Since every CpGconsists of two Cs (one on each strand) with independent conversion errors,we get the conversion error for CpGs by multiplying the individual conversionerrors. A complete overview of all possible combinations for each of the threemethods is shown in Tab. 2. Recall that the set of hidden states is S = { uu, um, uh, uf, mu, mm, mh, mf, hu,hm, hh, hf, f u, f m, f h, f f } . We now define a hidden Markov model (HMM)based on the model presented in Section 2.2. As set of observable states wedefine S obs = { T T, T C, CT, CC } , i.e. we use the results of the sequencing ex-periments (cf. Fig. 4) on both strands. The conversion errors define the corre-sponding emission probabilities. We also define n e ( j, t ) as the number of timesthat state j ∈ S obs has been observed during independent measurements of se-quencing method e ∈ E := { BS, oxBS, MAB-Seq } . The probability distributionover all observable states for experiment e is denoted by π e ( t ), the probabilityof a state j ∈ S obs by π e ( j, t ) and in a similar fashion we denote π ( i, t ) with i ∈ S for the hidden states, with probability distribution π ( t ). The observable S oxBS MAB-Seq
CC 5mC 5hmC 5fC ∗ C 5mC 5hmC 5fC ∗ C 5mC 5hmC 5fC ∗ U 5mC 5hmC 5fU ∗ U 5mC 5fU 5fU ∗ U 5mC 5hmC 5fU ∗ T C C T T C T T T C C T c d e g c d f g c d e gµ
Fig. 4.
Cytosine conversions during chemical treatment and sequencing. The correctreactions with their respective probabilities are marked with black arrows , while thefalse reactions are shown with red dashed arrows. The probability for a false reactionis 1-“rate of correct reaction”. and hidden states for all times t are connected via π e ( t ) = π ( t ) · E e , (6)where E e is the emission matrix for sequencing method e and is listed in Tab. 2for each of the three methods.Our goal is to estimate the efficiencies for the different methylation eventsgiven our hybrid HMM and data from the three different experiments at differenttime points t ∈ T obs via a maximum likelihood estimator (MLE). Since an initialdistribution over the hidden states, which can not directly be observed, is neededin order to initialize the model, we have to employ the MLE twice: First weestimate the initial distribution π (0) over the hidden states by maximizing π (0) ∗ = argmax π (0) L ( π (0)) , (7)under the constrain that (cid:80) i ∈ S π ( i,
0) = 1. The likelihood L ( π (0)) is defined as L ( π (0)) = (cid:89) e ∈ E (cid:89) j ∈S obs π e ( j, n e ( j, . (8)Note that Eq. (8) is independent of the parameters. Given an initial distributionover the hidden states we can now run our model and apply the MLE v ∗ = argmax v L ( v ) , (9)a second time in order to estimate the efficiencies, where L ( v ) = (cid:89) e ∈ E (cid:89) t ∈ T obs \{ } (cid:89) j ∈S obs π e ( j, t ) n e ( j,t ) . (10) able 2. Conversion errors for CpGs, where the rates for single cytosines are defined inFig. 4. We define ¯ x := 1 − x . Note that for MAB-Seq we also define j := µd +(1 − µ )(1 − c ).bisulfite seq. (BS) ox. bisulfite seq. (oxBS) MAB-SeqTT TC CT CC TT TC CT CC TT TC CT CC uu c c · ¯ c c · ¯ c ¯ c c c · ¯ c c · ¯ c ¯ c ¯ j j · ¯ j j · ¯ j j um c · ¯ d c · d ¯ c · ¯ d ¯ c · d c · ¯ d c · d ¯ c · ¯ d ¯ c · d ¯ j · ¯ d ¯ j · d j · ¯ d j · duh c · ¯ e c · e ¯ c · ¯ e ¯ c · e c · f c · ¯ f ¯ c · f ¯ c · ¯ f ¯ j · ¯ e ¯ j · e j · ¯ e j · euf c · g c · ¯ g ¯ c · g ¯ c · ¯ g c · g c · ¯ g ¯ c · g ¯ c · ¯ g ¯ j · g ¯ j · ¯ g j · g j · ¯ gmu c · ¯ d ¯ c · ¯ d c · d ¯ c · d c · ¯ d ¯ c · ¯ d c · d ¯ c · d ¯ j · ¯ d j · ¯ d ¯ j · d j · dmm ¯ d d · ¯ d d · ¯ d d ¯ d d · ¯ d d · ¯ d d ¯ d d · ¯ d d · ¯ d d mh ¯ d · ¯ e ¯ d · e d · ¯ e d · e ¯ d · f ¯ d · ¯ f d · f d · ¯ f ¯ d · ¯ e ¯ d · e d · ¯ e d · emf ¯ d · g ¯ d · ¯ g d · g d · ¯ g ¯ d · g ¯ d · ¯ g d · g d · ¯ g ¯ d · g ¯ d · ¯ g d · g d · ¯ ghu c · ¯ e ¯ c · ¯ e c · e ¯ c · e c · f ¯ c · f c · ¯ f ¯ c · ¯ f ¯ j · ¯ e j · ¯ e ¯ j · e j · ehm ¯ d · ¯ e d · ¯ e ¯ d · e d · e ¯ d · f d · f ¯ d · ¯ f d · ¯ f ¯ d · ¯ e d · ¯ e ¯ d · e d · ehh ¯ e e · ¯ e e · ¯ e e f f · ¯ f f · ¯ f ¯ f ¯ e e · ¯ e e · ¯ e e hf ¯ e · g ¯ e · ¯ g e · g e · ¯ g f · g f · ¯ g ¯ f · g ¯ f · ¯ g ¯ e · g ¯ e · ¯ g e · g e · ¯ gfu c · g ¯ c · g c · ¯ g ¯ c · ¯ g c · g ¯ c · g c · ¯ g ¯ c · ¯ g ¯ j · g j · g ¯ j · ¯ g j · ¯ gfm ¯ d · g d · g ¯ d · ¯ g d · ¯ g ¯ d · g d · g ¯ d · ¯ g d · ¯ g ¯ d · g d · g ¯ d · ¯ g d · ¯ gfh ¯ e · g e · g ¯ e · ¯ g e · ¯ g f · g ¯ f · g f · ¯ g ¯ f · ¯ g ¯ e · g e · g ¯ e · ¯ g e · ¯ gff g g · ¯ g g · ¯ g ¯ g g g · ¯ g g · ¯ g ¯ g g g · ¯ g g · ¯ g ¯ g The vector v = ( v µ m , v µ d , v η , v φ , v δ , p ) contains all unknown parameters for allefficiencies and the probability p of considering a hydroxylated cytosine as un-methylated. Note that applying the MLE twice and independently leads only toan approximation of the true most likely explanation, since the estimated initialdistribution may not lead to the same result in the parameter estimation, asif it would all be done in one estimation. However, we choose this approach toreduce the computational complexity of the optimization. In order to estimatethe standard deviations of the estimated parameters we use the observed Fisherinformation matrix [5]. The Fisher information is defined as J ( v ∗ ) = −H ( v ∗ ),where v ∗ is the maximum likelihood estimate and H ( v ) = ∇∇ T log L ( v ) theHessian matrix of the log-likelihood. The expected Fisher information is thengiven by I ( v ∗ ) = E [ J ( v ∗ )] and its inverse forms a lower bound for the covari-ance matrix. Thus, we can approximate the standard deviation of all estimatedparameters by σ ( v ∗ ) = (cid:112) Var( v ∗ ) ≈ (cid:112) diag( −H − ( v ∗ )).The implementation of the hybrid HMM and its analysis as explained abovehas been integrated into the latest beta version of the H(O)TA tool [20]. H(O)TAprovides results for individual CpGs and also an aggregated profile across allanalyzed CpGs. In the following, we will discuss the results after applying our model to data de-rived from a short region at the single copy gene Afp (alpha fetoprotein), which day f r equen cy BS TTTT pr TCTC pr CTCT prCCCC pr day oxBS day
MAB-Seq day e ff i c i en cy μ m μ d ηϕδ day0 day1 day3 day600.20.40.60.81 l e v e l pe r s t a t e s day0 day1 day3 day600.050.10.150.20.250.30.350.4 h y d r o xy l a t i on l e v e l hm-mhuh-huhh day0 day1 day3 day600.050.10.150.20.250.30.350.4 f C l e v e l mf-fmuf-fuhf-fhff unmethylatedfully methylatedhemimethylatedhydroxyformal (a) (b) (d)(e) (f) (g)(c) Fig. 5.
Results for Afp. (a)-(c) Predicted frequencies (dashed lines) and frequenciesobtained from sequencing experiments (solid lines) of the observable states for all threemethods. (d) Estimated efficiencies with standard deviation. (e) Probabilities of thehidden states. (f) Detailed distribution of hydroxylated CpGs. (g) Detailed distributionof formalized CpGs. contains 5 CpGs. More precisely, we followed the DNA methylation changes dur-ing the adjustment of mouse embryonic stem cells (mESCs) towards 2i mediumafter previous long time cultivation under Serum/LIF conditions [8,9]. The Serum/LIF-to-2i shift is a common model system which induces genome wide demethy-lation in mESCs including the Afp locus.Here, the individual and aggregated H(O)TA results show a very similarbehavior. Hence, we only present the aggregated results in Fig. 5.Fig. 5 (a) - (c) shows a comparison of the actual measurements (solid lines)and the results from the model (dashed lines) of the time evolution for the fourobservable states for all three sequencing models. The predictions and measure-ments are in good agreement. In BS and oxBS over time the frequency of TTincreases, while the frequency of CC decreases. Moreover, the frequencies of TCand CT show a temporal increase. For MAB-Seq the frequencies of all observablestates remain quite constant, except for some small changes in early times. Thisbehavior can be explained by the changes over time in the efficiencies, which areshown in Fig. 5 (d). The maintenance probability µ m remains constant at about0 .
8, as well as the probability p of considering 5hmC as unmethylated, whichis constant by definition. The estimation of p gives p = 1, i.e. 5hmC is alwaysconsidered to be unmethylated and will not be recognized by Dnmt1 after repli-cation which means that 5hmC leads to an impairment of Dnmt1 activity and passive loss of DNA methylation with each replication. The de novo efficiency µ d decreases over time, while the hydroxylation and formylation efficiency η and φ increase. The demethylation efficiency δ is always 0. Note, however, that stan-dard deviation becomes very large for later time points due to insufficient data.The efficiencies explain the behavior of the frequencies of the observable statesand is even more evident for the hidden states shown in Fig. 5 (e): Over time theprobability of being fully or hemimethylated decreases, while the probability ofbeing unmethylated, hydroxylated or formylated increases. A more detailed lookinto hydroxylated and formylated states is shown in Fig. 5 (f) and (g). Note thatthe combination of hydroxylation and formylation is only shown in one of thetwo subplots, namely in (g). The observed increase in 5hmC and 5fC/5caC is inaccordance with the high oxidation efficiency of Tets in form of hydroxylationand formylation. Previously, we showed using a purely discrete HMM, that thepresence of 5hmC leads to a block of Dnmt1 activity after replication, whichwe also observe in the present hybrid model [9]. However, we now also observean increase in higher oxidized cytosine variants, namely 5fC and 5caC whichequally prevents methylation by Dnmt1. Thus, we reason that the impact of Tetmediated oxidation of 5mC on DNA demethylation in the investigated systemplays a much more important role than previously suggested [26].Considering the rather rapid periodic events of de novo methylation andactive demethylation, the chosen measurement time points are not ideal. Thereis only information available at the end of each cell division cycle (one divisionwithin 24h), i.e., no information is given for the times between two cell divisions.With measurements at time points between two cell divisions we would be able todistinguish if a CpG is in a certain state because it initially was, or because it ranthrough the full cycle, possibly multiple times. We would like to emphasize thatwith better data, i.e. with sufficiently many measurements between cell divisions,it will be possible to estimate all reaction efficiencies with better confidence. We proposed a hybrid hidden Markov model which is able to successfully describeboth, events such as cell division and maintenance methylation that occur atfixed times, and events that occur at random times, such as de novo methylation,oxidizations and active demethylation, according to a continuous-time Markovjump process. To the best of our knowledge, this is the first model that describesthe dynamics of active demethylation, i.e., the active removal of the methyl groupthrough several enzymatic steps. We applied our model to data from mouseembryonic stem cells, which undergo a gradually loss of DNA methylation overtime. We were able to accurately predict the frequency of the observable statesand the levels of the hidden states in all cases. We were also able to predictthe enzymatic reaction efficiencies based on a linear assumption for their timebehavior.As future work we plan to apply our model to more informative data suchthat all efficiencies of the active demethylation cycle can be estimated withetter confidence. Moreover, we plan to allow different functional forms for theefficiencies since the linear form is not flexible enough and does not allow tocapture complex behaviors. A suitable choice could be splines of different degreesand with a different number of knots. However, in this case it is also necessaryto perform model selection in order to prevent overfitting. Another possibleextension could be to investigate potential neighborhood dependencies of themodified cytosines [23].