[PDF] High fidelity epigenetic inheritance: Information theoretic model predicts k -threshold filling of histone modifications post replication

Abstract

Beyond the genetic code, there is another layer of information encoded as chemical modifications on histone proteins positioned along the DNA. Maintaining these modifications is crucial for survival and identity of cells. How the information encoded in the histone marks gets inherited, given that only half the parental nucleosomes are transferred to each daughter chromatin, is a puzzle. We address this problem using ideas from Information theory and understanding from recent biological experiments. Mapping the replication and reconstruction of modifications to equivalent problems in communication, we ask how well an enzyme-machinery can recover information, if they were ideal computing machines. Studying a parameter regime where realistic enzymes can function, our analysis predicts that, pragmatically, enzymes may implement a threshold −k filling algorithm which derives from maximum à posteriori probability decoding. Simulations using our method produce modification patterns similar to what is observed in recent experiments.

Full PDF

HHigh ﬁdelity epigenetic inheritance: Information theoretic model predicts k -thresholdﬁlling of histone modiﬁcations post replication Nithya Ramakrishnan, ∗ Sibi Raj B Pillai, † and Ranjith Padinhateeri ‡ Department of Biosciences and Bioengineering, Indian Institute of Technology Bombay, Mumbai, India Department of Electrical Engineering, Indian Institute of Technology Bombay, Mumbai, India (Dated: May 15, 2020)Beyond the genetic code, there is another layer of information encoded as chemical modiﬁcationson histone proteins positioned along the DNA. Maintaining these modiﬁcations is crucial for survivaland identity of cells. How the information encoded in the histone marks gets inherited, given thatonly half the parental nucleosomes are transferred to each daughter chromatin, is a puzzle. Weaddress this problem using ideas from Information theory and understanding from recent biologicalexperiments. Mapping the replication and reconstruction of modiﬁcations to equivalent problemsin communication, we ask how well an enzyme-machinery can recover information, if they wereideal computing machines. Studying a parameter regime where realistic enzymes can function, ouranalysis predicts that, pragmatically, enzymes may implement a threshold − k ﬁlling algorithm whichderives from maximum `a posteriori probability decoding. Simulations using our method producemodiﬁcation patterns similar to what is observed in recent experiments. Why do our skin cells behave very diﬀerently from our brain cells even though they have the same genetic code? Thereason is, beyond the genetic code, there are multiple layers of information encoded by wrapping and folding DNAinto chromatin with the help of many proteins [1, 2]. Most of the DNA is wrapped around octamers of histone proteinsmaking chromatin essentially like a string of beads made of nucleosomes (DNA+histones) [3–5]. Each nucleosomecarries chemical modiﬁcations, like acetylations and methylations, forming a one-dimensional pattern of histone marksalong the chromatin polymer contour [2, 6] (see Fig. 1(a) top panel). This pattern encodes a crucial layer of informationas described below.Each nucleosome is constituted by four kinds of histone proteins, namely H3, H4, H2A and H2B [3, 7, 8]. Thehistone modiﬁcations are indicated by names like H4K5ac (acetylation modiﬁcation on the 5 th amino acid lysine (K)of the H4 protein), H3K27me3 (tri-methylation marks on the 27 th amino acid lysine (K) of the H3 protein) etc [8, 9].Presence of each of these modiﬁcations encode some speciﬁc information relevant for gene regulation. Even though theentire histone modiﬁcation code is not deciphered yet, we understand it in parts. For example, H3K27me3 repressesreading of the local DNA region where the modiﬁcation is present, H3K9ac encodes for local gene activation andso on [8, 10, 11]. The activation and repression of genes collectively decide the gene expression pattern and hencedetermines the function and fate of a cell [12–15].While preparing to divide, cells copy their genetic code via the DNA replication process. For DNA to be copied, thechromatin has to be unfolded and histone proteins need to be disassembled [16, 17]. This would disrupt the pattern ofhistone modiﬁcations. Recent studies have shown that the (H3 − H4) tetramer from the parent remains intact [18]and randomly – with equal probability – gets deposited onto either of the newly synthesised DNA strands [19–21].That is, daughter strands will have only some ( ≈ ∗ Electronic address: [email protected] † Electronic address: [email protected] ‡ Electronic address: [email protected] a r X i v : . [ q - b i o . GN ] M a y signal gets exposed to noise and is consequently error-prone at the receiving end. The decoder at the receiverdetects and corrects these errors using techniques from information theory and coding theory [34, 35]. This viewpointimmediately poses the following questions: Can we use known decoding algorithms from communication theory toanalyse chromatin modiﬁcation loss and retrieval? How well can the best known algorithms correct the missingmodiﬁcations and re-establish the modiﬁcation patterns? What is the best possible correction strategy enzymes coulduse if they were ideal computing machines? Are the algorithms compatible with biological processes that realisticcellular enzymes can conceivably do? In this paper, we address these questions using ideas from Information theory.We consider one of the daughter chromatins to be a noise-corrupted signal created at the replication fork, while theenzymes and other molecular agents help to correct this error using mathematical techniques. In this model, theinheritance of the mother’s pattern is approached using Bayesian decoding techniques. Predictions from our modelare veriﬁed using publicly available experimental data, indicating the relevance of our work in studying real biologicaldatasets [26]. Model and Methods

Consider a region on a mother chromatin having N nucleosomes. We are interested in studying the inheritanceof one histone modiﬁcation at a time. Since many of the repressive marks are known to be inherited immediatelyafter replication, we will consider one such repressive mark (e.g., H3K27me3) and its pattern along a chromatin. Thispattern can be represented by a vector M = { m , m , · · · , m N } , where m i can have values 1 or 0 indicating thepresence or absence of the modiﬁcation on the i th nucleosome (see Fig. 1(a)). We also need the following notations: • m ji is a vector ( m i , m i +1 , · · · , m j ) with j > i ; the subscript and superscript respectively indicate the ﬁrst andlast indices. Thus, M = m N is a realisation of the entire mother chromatin modiﬁcation sequence. • a row vector of k consecutive ones (zeroes) will be denoted as 1 k (0 k ). Extending this, the vector (1 , , · · · , , k consecutive zeroes between two ones will be denoted as (1 , k , M alongthe mother chromatin as a binary valued random walk, having neighbourhood interactions corresponding to a ﬁrstorder homogenous Markov chain. More speciﬁcally, given the modiﬁcations m i − and m i +1 , the modiﬁcation m i isassumed to be independent of all other modiﬁcation values. Equivalently, the conditional probability law is P ( m i | m , · · · , m i − ) = P ( m i | m i − ) , i ≥ , (1)where m is the modiﬁcation on the ﬁrst nucleosome of the region of our interest. The state-space evolution of theMarkov chain M is as follows: given m i = 1, let α and 1 − α be the probabilities for obtaining m i +1 = 1 and m i +1 = 0respectively. Similarly, if m i = 0, let β and 1 − β be the probabilities for having m i +1 = 0 and m i +1 = 1 respectively.The sequence M can be seen as a random walk on the state space shown in Fig. 1(b). For example, when α and β are close to 1, the pattern would often contain long runs of either 1s (presence of modiﬁcation) or 0s (absence ofmodiﬁcation).From the mother chromatin M , the generation of a daughter chromatin having histone modiﬁcation sequence D = d N is modelled as follows. During replication, with probability , each nucleosome on a daughter chromatin iseither directly inherited from its parental counterpart (i.e. d i = m i ) or newly deposited (i.e. d i = 0) from a poolof fresh histones assembled de novo [24]. This process is equivalent to doing a logical AND operation of the mothersequence M with an independent and identically distributed (IID) binary sequence Z (noise), i.e. D = M . Z (seeFig. 1(c)). This biological process leads to a memoryless model with conditional probability P ( d N | m N ) = N (cid:89) i =1 P ( d i | m i ) . (2)In biology, the question is, given a daughter sequence d N soon after replication, how could a cell reconstruct a mother-like sequence ˆ M = ˆ m N ? In other words, the question is how to ﬁnd a decoder that would reconstruct a ˆ M from D (see Fig. 1(c)). Ideally, a cell would want to choose a binary sequence ˆ M having the minimum deviation from M .The fraction of errors in the reconstructed sequence is a highly desired deviation metric, given by∆( M , ˆ M ) = 1 N N (cid:88) i =1 ( m i − ˆ m i ) . (3) (a) − β − αβ α (b) M Z AND

Decoder D ˆ M (c) FIG. 1: Schematic description of the problem (a) Row 1 from top: chromatin as a string of nucleosomes with andwithout the histone modiﬁcation(star) of our interest. This can be mapped to a string of binary numbers indicatingthe presence (1) or absence (0) of the modiﬁcation (row 2), giving us M . Row 3: one typical realization of adaughter chromatin ( D ) produced from M above, via a process mimicking DNA replication, where only a fraction ofthe modiﬁcations (1s) will end up in the daughter chromatin, stochastically; the rest do not have the modiﬁcation ofinterest (0). Soon after replication, certain enzymes will insert modiﬁcations correcting D to a mother-like sequenceˆM¯ (row 4). Since these are stochastic processes, we expect some errors. (b) The mother sequence ( M ) is modelled asa ﬁrst order Markov chain having sequence of 0s and 1s. α and β are probabilities of ﬁnding a 1 followed by a 1, anda 0 followed by a 0, respectively. 1 − α and 1 − β are probabilities of ﬁnding a 1 followed by a 0 (note arrowheads),and a 0 followed by a 1, respectively. (c) From an information theory perspective, the daughter sequence ( D ) isobtained by a mother sequence M getting logically ANDed with an independent and identically distributed (IID)binary sequence Z (noise). A mother-like sequence ˆ M is reconstructed by passing D through a decoder. What arethe plausible ways enzymes could act as decoders is the subject of this study.This deviation metric is eﬀectively the bit error rate (BER) when N becomes large [36]. Thus the chosen ˆ M shouldminimize the BER with respect to the actual sequence M , while obeying the transition law in Eq. (1). This is similarto data communication through an erroneous channel. It is well known that Bayesian estimation schemes minimizethe average detection error probability at the receiver. In particular, a decoder choosing the input sequence havingthe Maximum `Aposteriori Probability (MAP) is optimal in minimising the communication error [35, 36]. We call thisthe Sequence MAP (SMAP) decoder, which identiﬁes the most probable sequence ˆ M = ( ˆ m , · · · , ˆ m N ) based on theobservations d N as ( ˆ m , · · · , ˆ m N ) = argmax m , ··· ,m N P ( m N | d N ) . (4)SMAP decoding is known to have near optimal BER performance, and good analytical tractability in many con-texts [36]. While the optimal BER performance can be achieved by Bitwise MAP (BMAP) decoding for each mod-iﬁcation value separately, the latter scheme is not only computationally more demanding, but also analytically lesstractable. It appears that SMAP decoding is a potential candidate for biological cells to reconstruct the epigeneticmodiﬁcations from the partial data. Notice that SMAP decoding depends on the parameters α and β of the MarkovChain. Clearly, this algorithm is only targeting a primary reconstruction immediately following DNA replication;secondary mechanisms may further alter the patterns in the long run [19, 37, 38]. Results

In this section, we will be using the ideas developed in the Model and Methods to answer how one can reconstructa mother-like modiﬁcation sequence ( ˆ M ), given the daughter chromatin sequence D . We will discuss how wellalgorithms like the Sequence MAP (SMAP) decoding will compute ˆ M , and whether realistic enzymes can implementthis in practice. Ideal Enzymes Implementing SMAP Decoding

While we do not know exactly how biological enzymes work to retrieve histone modiﬁcation patterns soon afterreplication, how well can the communication theory-inspired algorithms reconstruct a mother-like sequence? To testthis, one can imagine some ideal enzymes—computing machines—constructed to implement the SMAP algorithmin Eq. (4) That is, these enzymes will maximize the conditional probability P ( ˆ M | D ) based on the given daughtersequence observations D = d n . Applying Bayes’ rule [39], along with Eq. (1) and Eq. (2), one gets P ( ˆ M | D ) = 1 P ( D ) N (cid:89) i =1 P ( ˆ m i | ˆ m i − ) P ( d i | ˆ m i ) . (5)Since P ( D ) does not play a role in the maximization over ˆ M , it can be ignored for our purposes. Ideal computingmachines can now implement the SMAP algorithm using the idea of trellis decoding [40], which is closely related tothe well known Viterbi Algorithm in coding theory [36]. While the memory and computational power requirement fortrellis decoding is high in general, we ﬁnd that decoding procedure for our model can be broken down to smaller sub-sequences, each corresponding to a diﬀerent run of zeroes in D . In particular, trellis decoding has to be applied onlyon sub-sequences of the form (1 , k , k consecutive zeros in between two ones. The following propositionveriﬁes this claim. Proposition 1

Let i, j be two positions where the daughter sequence has ones, with j > i . Then P ( m i , · · · , m j | d N , m i − , m Nj +1 ) = P ( m i , · · · , m j | d ji ) . To prove this, the following chain of equalities are valid under the assumption that ( m i , m j ) = ( d i , d j ) = (1 , P ( m i +1 , · · · , m j − | d ji ) = P ( m i +1 , · · · , m j − | d ji , m i , m j )= P ( m ji | d ji , m i , m j , m i − , m Nj +1 )= P ( m ji | d ji , m i − , m Nj +1 , d i − , d Nj +1 )= P ( m ji | d N , m i − , m Nj +1 ) . Notice that the second step used the Markov assumption that given the current state, the future and past states areconditionally independent. The third step used the facts that ( m i , m j ) = ( d i , d j ), and ( d i − , d Nj +1 ) is generated from( m i − , m Nj +1 ) using random variables independent of ( m ji , d ji ).Thus, while performing SMAP decoding by Eq. (5), the decisions on ˆ m i , · · · , ˆ m j are independent of the values( d i − , d nj +1 ), once d i = d j = 1. In other words, to decide on a bit at position l where the daughter has inherited azero, we need to only consider the smallest daughter segment containing the position l , and ﬂanked by ones at bothends. Only daughter segments with at least one intermediate zero are to be considered; otherwise there is nothing todecode. Without loss of generality, we will take D = (1 , k ,

1) for the rest of the exposition, corresponding to a runof k zeros, and perform trellis decoding on this sequence. Trellis Decoding:

In a trellis, the idea is to ﬁnd probabilities of all possible “paths” —sequence of states of theMarkov chain. Given that our interest is to examine daughter sequence regions like (1 , k , k = 5). Starting from the initial state 1 (left bottom in Fig. 2), thetrellis diagram assigns a conditional probability (branch metric) to each subsequent transition (arrow), based on thetransition probability law and observed daughter state. For transitions from state at i − i , we mark the branchmetrics as P ( m i | m i − ) P ( d i | m i ), where d i = 0 for 2 ≤ i ≤ N −

1. While the sequence D is easily seen to be generatedby a hidden Markov Model (HMM), the branch probability metrics can be understood from the box shown in Fig. 2.Notice that we have to ﬁnd P ( m i , d i = 0 | m i − ) for ( m i − , m i ) ∈ { (0 , , (0 , , (1 , , (1 , } (see Fig. 2). P (0 , |

0) = β ; P (1 , |

0) = 1 − β P (0 , |

1) = 1 − α ; P (1 , |

1) = α . FIG. 2: The trellis diagram used for illustrating how the MAP algorithm chooses the path of maximum a posterioriprobability given a daughter sequence (1 , , d i = 0, all possible values P ( m i , d i | m i − ) are given in thebox, and represent probabilities associated with each arrow. Starting from state 1 at the left bottom, one can chooseall possible paths moving along the arrows, and compute the path probabilities (path metrics) using Eq. (5). Wechoose the path with the largest path metric.Notice that each possible mother sequence can be identiﬁed as a path in the trellis, with the labels identifying thebranch metrics. Using Eq. (5), the product of corresponding branch metrics will yield the path metric of each possiblesequence, and then the path maximizing the SMAP metric can be chosen. Since D has the form (1 , , · · · , , α and β parameters in the Markovmodel. We then generated several daughter sequences, for each of the mother sequences, by ﬂipping 1s to 0s randomlywith probability 0 .

5; that is, D = M · Z , with Z an IID binary sequence generated by an unbiased coin. Each of thedaughter sequences were corrected with the trellis-based SMAP decoding to generate the corresponding estimate ˆ M ;the error ∆( M , ˆ M ) was computed (Eq. (3)). The average of ∆( M , ˆ M ) = ¯∆ (averaged over many realizations) forﬁxed pairs of α and β , is presented as a heatmap in Fig. 3. The mean deviation( ¯∆) between mother and correctedmother-like daughter is low for very high values of α and β — a region dominated by long islands of ones (modiﬁednucleosomes) separated by long islands of zeros (unmodiﬁed nucleosomes). There are other regions too where ¯∆is relatively small. Overall, this result shows how well an ideal computing machine that employs state of the artinformation theory algorithms can recover the original mother sequence. The remaining question is, can a realenzyme do as good as this computing algorithm? Threshold- k model: enzymes ﬁlling unmodiﬁed islands of size at most k maintain chromatin ﬁdelity Whether biological enzymes are equipped to do complex SMAP computations like trellis decoding by themselvesis debatable. Nevertheless, we argue that in certain biologically relevant parameter regimes, the decoding rule canbe simple enough for enzymes to potentially execute. Among the known histone modiﬁcation patterns, it is commonto have regions densely ﬁlled by a certain modiﬁcation (e.g., H3K27me3), and other regions where the modiﬁcationis totally absent. This corresponds to higher values of α and β in our Markov model. Below we show that in thisregime, the SMAP algorithm simpliﬁes to tasks that the enzymes may easily carry out.Consider an island of k unmodiﬁed nucleosomes in the daughter chromatin, giving the pattern (1 , k , (cid:0) α (cid:1) k +1 > (1 − α ) β k − (1 − β ), the probability of having theall-one path (1 , k ,

1) at the mother is greater than that of (1 , k , g k ( α, β ) = (cid:0) α (cid:1) k +1 − (1 − α ) β k − (1 − β ) is positive, an all-ones path is preferred over a run of k zeros by SMAP decoding.However, the question is, for what values of α and β , does the above condition hold true. The expression g k ( α, β ) = 0 a b FIG. 3: The average deviation between the original mother and the mother-like corrected daughter sequences ( ¯∆ =ensemble averaged ∆(

M, M (cid:48) )) is plotted for diﬀerent α and β values as a heatmap (see color bar on the side). Theerror is averaged over the error 300 mother sequences and 200 daughter sequences corresponding to each mothersequence.is easy to solve if we take k to be a real value, this yields the root k ∗ as: k ∗ = log (cid:16) (1 − α )(1 − β )( α / (cid:17) log( α/ β ) + 1 . (6)Notice that the solution for k ∗ is unique when 0 < α < < β <

1. Since g ( α, β ) = α − (1 − α )(1 − β ), g ( α, β ) > , ,

1) is preferred over (1 , , k ∗ will now have the following implications:On a unit square region of parameters ( α, β ) ∈ (0 , × (0 , • When α < β and g ( α, β ) >

0, one gets k ∗ ≥

1; we ﬁnd that g k ( α, β ) > k ≤ k ∗ inEq. (6). This suggests that the SMAP algorithm will replace (1 , k ,

1) with (1 , k , k ≤ k ∗ . • When α > β and g ( α, β ) >

0, we ﬁnd that g k ( α, β ) > k ; hence the SMAP algorithmwill replace (1 , k ,

1) with (1 , k , k .Notice that when every path of at most k ∗ zeros between two ones has less path metric than the corresponding allones path, clearly any possible path other than all ones cannot have the maximum SMAP metric, while decodingsequences of length less than k ∗ .The above analysis based on trellis decoding suggests two simple ways for enzymes to work. Enzymes of Type Awould simply modify all unmodiﬁed nucleosomes (0s) between two modiﬁed nucleosomes (1s). Such enzymes maybe preferred when the modiﬁcation pattern can be modelled by parameters α and β that corresponds to region A inFig. 4(a); notice that this has large α and small β . An enzyme of Type B would ﬁll all unmodiﬁed nucleosomes (0s),if and only if the size of the unmodiﬁed region is ≤ k ∗ . That is, replace (1 , k ,

1) by (1 , k , k ≤ k ∗ . Thus long islands of 0 s are left unﬁlled. We call this a threshold- k ﬁlling model , which becomes activein region B of Fig. 4(a). Notice that when both α and β are close to 1, the modiﬁcation is expected to have longdomains (islands) with its presence, followed by islands with no modiﬁcation. Biologically, this is a realistic regimefor many modiﬁcations where enzymes can do threshold- k ﬁlling.We tested the threshold- k ﬁlling model on a computer by generating several mother and daughter sequences, forvarious values of α and β ; each daughter sequence was corrected using the threshold- k ﬁlling algorithm — that is, (a)(b) (c) FIG. 4: (a) Four (A, B, C, D) regions in ( α, β ) parameter space. The curves shown are g ( α, β ) = 0, and α − β = 0. In region A ( g > , α > β ), the SMAP will replace every 0 with 1. In the region B ( g > , α < β ),enzymes can implement threshold- k ﬁlling (see text). This parameter regime is realistic, biologically. (b) The meanerror ( ¯∆) when we ﬁll all islands of 0s having size at most k t . All these curves have ( α, β ) values in region B, and ¯∆is non-monotonic having a ﬁnite optimum k t = k ∗ . (c) ¯∆ for parameter values in region A (red and green curves)are monotonically decreasing suggesting that the optimal k t is unbounded; hence the least error would be when all0s are replaced with 1s. In regimes C, D (blue and violet curves) the mean error is minimal when nothing is ﬁlledsuggesting that threshold- k ﬁlling is not suitable here. Curves are obtained by averaging over 300 mother, and 200daughter sequences corresponding to each mother sequence. The standard error is smaller than the size of the points.we ﬁlled all islands of 0s, having size k ≤ k t , by 1s; here k t is taken as a variable. Corresponding mean error ( ¯∆),averaged over many realisations, was computed for a given ( α , β ). In Fig. 4(b), all the curves correspond to ( α , β ) inregime B of Fig. 4(a) ( g > α < β ). In this interesting regime, the mean error has a non-monotonic behaviourwith a minimum at a particular value of k t = k ∗ , where k ∗ is predicted by Eq. (6). Note that k ∗ values are around3 to 6: enzyme of Type-B can ﬁll small unmodiﬁed islands having 3 to 6 zeroes, and leave much longer unmodiﬁedislands unﬁlled.In Fig. 4(c), the two monotonically decreasing curves belong to the parameter regime A ( g > α > β ). Themean error is decreasing as we increase k t , suggesting that there is no ﬁnite k ∗ . If the modiﬁcation patterns were inthis regime, the corresponding enzymes should attempt to ﬁll every unmodiﬁed region, however small or big that maybe. The other two curves in Fig. 4(c) correspond to regimes C and D in Fig. 4(a). For both the curves, the meanerror is increasing, suggesting that the threshold- k ﬁlling algorithm is not suitable in these parameter regimes. It isalso unlikely that these parameter regimes would be biologically relevant. k -threshold ﬁlling model can obtain inheritance patterns similar to what is observed experimentally To examine the biological relevance of the ﬁndings leading to a k -threshold ﬁlling model, we took publicly availableexperimental histone modiﬁcation data, and compared with our simulation results. The inheritance of modiﬁcationH3K27me3 has been systematically studied recently [26] by measuring modiﬁcation occupancy before and after DNAreplication. Since the experimental data is population-averaged, we used a simple randomized discretization algorithmto generate many binary sequences of the available data. For example, we took a long region from the chromosome1 (151,495,060 bp to 165,790,665 bp) data, discretized to obtain the mother vector ( M ), and then generated thedaughter chromatin D = M · Z . The D was corrected to a mother-like modiﬁcation sequence ˆ M using the threshold-kalgorithm for diﬀerent k = k t . This was repeated several times (100 M sequences, and 100 D for each M ), and themean error ( ¯∆) is plotted as a function of k t in Fig. 5(a). The results show that when k t = 5 the mean error in thecorrected H3K27me3 pattern is minimum. Note that this is very similar to the curves in Fig. 4(b) for large values of α and β . We independently veriﬁed that the parameters corresponding to the original mother sequence is α ≈ . β ≈ .

87 implying that a biologically relevant modiﬁcation falls in parameter regime B.In Fig. 5(b), we plot the population-averaged modiﬁcation pattern for diﬀerent values k t . Note that islands of highand low modiﬁcation occupancy regions are present in both the mother as well as the corrected daughter sequences.This indicates that our threshold k -ﬁlling algorithm can reproduce biologically relevant data. Thus, our informationtheory-inspired algorithm predicts that there might be enzymes that simply ﬁll short segments (4 or 5 nucleosomes)of unmodiﬁed regions, but leave the longer unmodiﬁed regions ( > k value ( k ∗ ) is hard-wired into such enzymes, via evolution. Another possibility is thatother phenomena like local looping, phase separation etc. decide the threshold k ∗ , by bringing unﬁlled nucleosomestogether [41, 42] . These need to be understood in future. Spatially distinct antagonistic modiﬁcations

The threshold- k ﬁlling model can be naturally extended to study two (or multiple) modiﬁcations that are antago-nistic, spatially distinct (the same nucleosome will not have both the modiﬁcations simultaneously), and to be actedupon by very diﬀerent enzymes. As per our model, for an enzyme-1 responsible for modiﬁcation-1, the nucleosomeshaving the second modiﬁcation are not “visible” and would appear as a long stretch of 0s. For enzyme-2 , similarlythe modiﬁcation-1 nucleosomes appear as a long stretch of 0s. We generated a sequence having two spatially distinctmodiﬁcations. A simple extended version of the threshold- k algorithm was applied to each enzyme separately. Thatis, whenever there is an island of size k ≤ k t between two nucleosomes having modiﬁcation- i , the region was ﬁlledwith modiﬁcation- i , for i = 1 ,

2. Anything else was left unﬁlled. This gave us results as shown in Fig. 6. These areoccupancies from a typical realisation (not averaged over the population) and hence the values are either 0 or 1.

Discussion and Conclusion

In this work, we proposed that the problem of daughter chromatin retrieving histone modiﬁcation patterns, toachieve a mother-like chromatin state, can be mapped to a communication theory problem of receiving noisy signaland correcting it to retrieve the original signal. Using ideas from Information theory, we argued that if enzymes wereideal computing machines, the best they could do is to execute a MAP decoding algorithm to get back a mother-likesequence. We showed how well this algorithm would reconstruct the mother— the error can be as low as 5% incertain parameter regimes. However, the question was whether realistic enzymes can practically do such complexalgorithms. We showed that in a biologically relevant parameter regime, MAP decoding algorithm is equivalent to a k -threshold ﬁlling algorithm. That is, the enzymes could simply insert modiﬁcations in k -sized or shorter unmodiﬁedstretches(0s). The fact that a detailed theory simpliﬁes to a process that is potentially executable by enzymes makesthis result attractive.We modelled the mother chromatin as a ﬁrst order Markov process. This is a reasonable model as there areminimal number of parameters. One could potentially consider interactions among nucleosomes beyond the immediateneighbours. However, this will involve many more parameters, making the problem more complex and diﬃcult togain insights. This may be probed in a future work.We would like to stress that given any experimental data, we do not have to know the parameters α and β . Bysimply varying k t , as shown in Figs. 4(b) and 5(a) , we can determine the optimal conﬁgurations and obtain insights (a)(b) FIG. 5: (a) Mean deviation ( ¯∆) between corrected daughters and corresponding mothers, where the experimentalpopulation-averaged parental data for H3K27me3 is from [26] (see database GEO: GSE110354). Error correctionwas performed using the threshold- k ﬁlling algorithm for diﬀerent k t values. (b) The population averaged histonemodiﬁcation occupancy for H3K27me3 is plotted for mother sequence (top), and corrected daughter sequencescorresponding to diﬀerent values of k t .about how enzymes might work. Note that even in the large α regime (region A in Fig. 4(a)), if enzymes settle for k -ﬁlling with a ﬁnite k t (e.g., 5 or 6), it becomes a pragmatic modiﬁcation correction solution as the resulting erroris relatively low (see read and green curves in Fig. 4(c)).It is also interesting to examine if there are noise in the insertion (ﬁlling) process itself, and how much can it aﬀectthe result. Finally, it would be also of great interest to study how the polymer nature of the chromatin would workin tune with the results from information theory. After all, the epigenetic code might involve an interplay betweenthe one-dimensional histone codes and polymer dynamics of the chromatin. The fact that we have two 1 s at theboundaries suggests some potential role of looping or micro phase separation in far-away regions coming together.Our own earlier studies [43] hint to us that small patches of unmodiﬁed nucleosomes (newly inserted nucleosomes)may lead to small clusters, inﬂuencing the kinetics of the modiﬁcation process itself. These are questions that await0FIG. 6: Typical realisations of the modiﬁcation patterns from the study of two spatially distinct/antagonisticmodiﬁcations. The red and green regions represent the modiﬁcations 1 and 2, respectively, spread over 500nucleosomes. Corrections were performed using the threshold- k algorithm with an optimum k t = 6.future studies. [1] Goldberg AD, Allis CD, Bernstein E (2007) Epigenetics: a landscape takes shape. Cell 128: 635–638.[2] Allis CD, Jenuwein T, Reinberg D (2007) Epigenetics, volume 61. CSHL Press.[3] Kornberg RD (1974) Chromatin structure: a repeating unit of histones and dna. Science 184: 868–871.[4] Kornberg RD, Lorch Y (1999) Twenty-ﬁve years of the nucleosome, fundamental particle of the eukaryote chromosome.Cell 98: 285–294.[5] Van Holde KE (2012) Chromatin. Springer Science & Business Media.[6] Jenuwein T, Allis CD (2001) Translating the histone code. Science 293: 1074–1080.[7] Luger K, M¨ader AW, Richmond RK, Sargent DF, Richmond TJ (1997) Crystal structure of the nucleosome core particleat 2.8 ˚a resolution. Nature 389: 251–260.[8] Alberts B (2014) Molecular Biology of The Cell. New York: Garland Science, Taylor and Francis Group, 6 edition.Molecular Biology of the Cell, 6th edition.[9] Richards EJ, Elgin SC (2002) Epigenetic codes for heterochromatin formation and silencing: rounding up the usualsuspects. Cell 108: 489–500.[10] Kouzarides T (2007) Chromatin modiﬁcations and their function. Cell 128: 693–705.[11] Taverna SD, Ueberheide BM, Liu Y, Tackett AJ, Diaz RL, et al. (2007) Long-distance combinatorial linkage betweenmethylation and acetylation on histone h3 n termini. Proceedings of the National Academy of Sciences 104: 2086–2091.[12] Rando OJ (2007) Global patterns of histone modiﬁcations. Current opinion in genetics & development 17: 94–99.[13] Weiner A, Hsieh THS, Appleboim A, Chen HV, Rahat A, et al. (2015) High-resolution chromatin dynamics during a yeaststress response. Molecular cell 58: 371–386.[14] Zentner GE, Henikoﬀ S (2013) Regulation of nucleosome dynamics by histone modiﬁcations. Nature structural & molecularbiology 20: 259.[15] Seligson DB, Horvath S, Shi T, Yu H, Tze S, et al. (2005) Global histone modiﬁcation patterns predict risk of prostatecancer recurrence. Nature 435: 1262–1266.[16] Margueron R, Reinberg D (2010) Chromatin structure and the inheritance of epigenetic information. Nature ReviewsGenetics 11: 285–296.[17] Madamba EV, Berthet EB, Francis NJ (2017) Inheritance of histones h3 and h4 during dna replication in vitro. Cellreports 21: 1361–1374.[18] Xu M, Long C, Chen X, Huang C, Chen S, et al. (2010) Partitioning of histone h3-h4 tetramers during dna replication–dependent chromatin assembly. Science 328: 94–98.[19] Groth A, Rocha W, Verreault A, Almouzni G (2007) Chromatin challenges during dna replication and repair. Cell 128:1