[PDF] A dynamical system model for predicting gene expression from the epigenome

Abstract

Gene regulation is an important fundamental biological process. The regulation of gene expression is managed through a variety of methods including epigenetic processes (e.g., DNA methylation). Understanding the role of epigenetic changes in gene expression is a fundamental question of molecular biology. Predictions of gene expression values from epigenetic data have tremendous research and clinical potential. Despite active research, studies to date have focused on using statistical models to predict gene expression from methylation data. In contrast, dynamical systems can be used to generate a model to predict gene expression using epigenetic data and a gene regulatory network (GRN) which can also serve as a mechanistic hypothesis. Here we present a novel stochastic dynamical systems model that predicts gene expression levels from methylation data of genes in a given GRN. We provide an evaluation of the model using real patient data and a GRN created from robust reference sources. Software for dataset preparation, model parameter fitting and prediction generation, and reporting are available at \verb|this https URL|.

Full PDF

11 Epigenome Prediction of Gene Expression using a Dynamical System Approach

James Brunner † Center for Individualized Medicine Microbiome Program, Mayo ClinicRochester, MN 55901, USA † Jacob Kim

Timothy Downing

Eric Mjolsness

Kord M. Kober ‡ Department of Physiological Nursing andBakar Computational Health Sciences Institute, University of California San FranciscoSan Francisco, CA 94143, USA ‡ Gene regulation is an important fundamental biological process. The regulation of gene ex-pression is managed through a variety of methods including epigentic processes (e.g., DNAmethylation). Understanding the role of epigenetic changes in gene expression is a funda-mental question of molecular biology. Predictions of gene expression values from epigeneticdata have tremendous research and clinical potential. Dynamical systems can be used togenerate a model to predict gene expression using epigenetic data and a gene regulatorynetwork (GRN). Here we present a novel stochastic dynamical systems model that predictsgene expression levels from methylation data of genes in a given GRN.

Keywords : Epigenetic Modiﬁcation, Gene Regulatory Networks, Piecewise DeterministicMarkov Processc (cid:13) a r X i v : . [ q - b i o . M N ] A ug

1. Introduction

Gene regulation is an important fundamental biological process. It involves a number ofcomplex sub-processes that are essential for development and adaptation to the environment(e.g., cell diﬀerentiation and response to trauma ). Understanding gene expression patternshas broad scientiﬁc and clinical potential, including providing insight into patterns of funda-mental molecular biology (e.g., gene regulatory networks) and a patient’s response to disease(e.g., HIV infection ) or treatment (e.g., chemotherapy-induced peripheral neuropathy ).The regulation of gene expression is managed through a variety of methods, includingtranscription, post-transcriptional modiﬁcations, and epigenetic processes. One epigeneticprocess, DNA methylation, occurs primarily at the cytosine base of the molecule that isadjacent to guanine (i.e., CpG site). DNA methylation of promoter and gene body regionscan act to regulate gene expression by repressing (e.g., ) or activating (e.g., ) transcription.Understanding the role of epigenetic changes in gene expression is a fundamental questionof molecular biology and predicting gene expression from epigenetic data is an active area ofresearch. Predictions of gene expression values from epigenetic data have tremendous researchand clinical potential. For example, DNA is inexpensive to collect and is easy to store. It oﬀersboth genetic (i.e., genotype) and epigenetic (i.e., methylation status) information in a stableformat. This information is obtainable from a large number of ongoing and previously existingbiobanks. Gene expression (i.e., RNA), however, is more diﬃcult and more expensive to obtain.Given the unique type of information that gene expression can provide (i.e., the presence andquantity of the functional product of a gene), it will be very useful and economical if geneexpression values could be reliably predicted from methylation information.A dynamical systems model can predict gene expression using epigenetic data and a generegulatory network (GRN) by simulating hypothesized mechanisms of transcriptional regu-lation. Such models provide predictions based directly on these biological hypotheses. Wedevelop an interaction network model that depends on epigenetic changes in a GRN.One major advantage to the dynamical systems approach is obtaining a distribution ofgene expression beyond just expression status. Furthermore, a stochastic dynamical systemprovides us with a distribution of gene expression estimates, representing the possibilities thatmay occur within the cell.Recent studies have developed models to predict gene expression levels with deep con-volutional neural networks from genome sequence data and with regression models frommethylation data. Previous studies have developed models to predict expression status (e.g.,on/oﬀ or high.low) with gradient boosting classiﬁers from histone modiﬁcation data, andwith machine learning classiﬁcation methods from methylation data and from methylationand histone data combined. To our knowledge, there are no studies that have taken a dy-namical systems approach to predicting gene expression from methylation data and a GRN.Given the opportunity presented by dynamical systems approaches and the potential prac-tical utility, we present a novel stochastic dynamical systems model for predicting gene ex-pression levels from epigenetic data for a given GRN.

2. Methods2.1.

Assumptions

Our model assumes fundamentally that transcription of DNA is (relatively) fast and doneat a linear rate determined by the bound or unbound state of transcription factor bindingsites. We assume that binding and unbinding of transcription factors is a (relatively) slowand stochastic process, with propensity proportional to availability of transcription factor.Our model is built on the hypothesis that the propensity of transcription factor bindingis inﬂuenced non-linearly by epigenetic modiﬁcation of the binding site. We assume that“background” transcription (i.e. transcription occurring when no transcription factor is bound)is only enough to allow down regulation to exist. In model training and testing, genes withmissing expression data are assigned ’zero’ expression values.

Model Equations

We model gene regulation using a piecewise-deterministic Markov process (PDMP) asintroduced in Davis 1984 (see also ) given by the equations: B i ( t ) = B i (0) + Y i (cid:18)(cid:90) t (1 − B i ( τ )) λ i µ i µ i + ( α i ) ν i ( κ i · g ) dτ (cid:19) − Y i (cid:18)(cid:90) t ˆ λ i B i ( τ ) dτ (cid:19) (1)and ddt g j = γ j + ( φ j · B ) − d j g j (2)where B i ( t ) ∈ { , } , is a boolean random variable representing the bound/unbound state ofa binding site region of DNA and g i is the transcript amount the genes modeled. Equation (1)is given as the sum of two Poisson jump processes Y i ( h i ( t )) and Y ( h i ( t )) which take values in Z ≥ , and are piecewise constant between randomly spaced discrete time points (the bindingand unbinding events). The propensities h i ( t ) and h i ( t ) are taken to be linear functions of the available tran-scription factors, which is assumed to be the same as the transcript variables g j . We take thevalues κ ij ∈ { , } ; these parameters along with the set of φ ji represent the structure of theunderlying gene regulatory network.We include the term µ i µ i + ( α i ) ν i (3)to represent the impact of epigenetic modiﬁcation on the binding propensity of transcriptionfactors. In this term, α i is the measured epigenetic modiﬁcation to a transcription factorbinding site (e.g. percentage of methylated bases). Equation (3) is a sigmoidal function whichis either strictly increasing or strictly decreasing depending on the sign of ν i . If ν i > , then thisterm decreases, implying that epigenetic modiﬁcation decreases transcription factor binding.Conversely, if ν i < , the model implies that epigenetic modiﬁcation increases transcriptionfactor binding.Finally, we use a linear ODE for the value of the transcripts g j . We take φ ji ∈ {− , , } based on the structure of the underlying gene regulatory network. We include baseline tran-scription γ j and decay d j . Because we use a linear ODE in Eq. (2), we can solve exactlybetween jumps of B . Master Equation and Equilibrium Distribution

It is common practice in the study of reaction networks modeled as stochastic jump pro-cesseDetail of ﬁgure 9 in manuscript.s to represent the process using so called “chemicalmaster equation”, which is the Kolmogorov forward equation for the jump process. Thegenerator for a PDMP can be deﬁned (see Aza¨ıs 2014 for details). We deﬁne a density P ( B i , g ( t ) , t ) = P i ( g ) , i = 1 , ..., | B | for each possible state B i of B such that (cid:80) | B | i =1 P i ( g ) = P ( g ) is the probability distribution for the vector g , and each P i satisﬁes dP i ( g , t ) dt = | g | (cid:88) j =1 (cid:0) γ j + ( φ j · B i ) − d j g j (cid:1) ∂P i ( g , t ) ∂g j + (cid:88) ( j : (cid:107) B j − B i (cid:107) =1) | B | (cid:88) k =1 (cid:20) B ik (1 − B jk ) λ k µ k µ k + ( α k ) ν k ( κ k · g ) + ˆ λ k B jk (1 − B ik ) (cid:21) P j ( g , t ) − | B | (cid:88) k =1 (cid:20) (1 − B ik ) λ k µ k µ k + ( α k ) ν k ( κ k · g ) + ˆ λ k B ik (cid:21) P i ( g , t ) . (4) Model Sampling & Equilibrium Distribution Estimation

We can estimate the equilibrium distribution for the PDMP using kernel density estimation(KDE) with a Gaussian kernel. To compute the marginal distributions on the various genevariables, which we can estimate with a 1 dimensional kernel. In a realization of the process,binding or unbinding events occur at times t i such that [ t , t ) (cid:116) [ t , t ) (cid:116) · · · (cid:116) [ t n − , T ) = [ t , T ) and we may compute g ( t ) exactly in each interval. The estimation is then f i ( x ) = n − (cid:88) k =0 (cid:90) t k +1 t k √ πh exp (cid:32) − (cid:0) x − (cid:2) e − d i ( t − t k ) ( g i ( t k ) − S ki ) + S ki (cid:3)(cid:1) h (cid:33) dt (5)where S ki = d i ( γ i + (cid:80) j φ ij B j ) . In Fig. 1, we show schematically how f i ( x ) is estimated from arealization of the process B ( t ) , g ( t ) . For each binding site i α li κ i λ i ˆ λ i µ i ν i B i ( t ) For each gene j φ j d j γ j g j ( t ) f lj ( x ) = R T l K j ( g j ( t ) , x ) dt For each data point l Fig. 1. Plate diagram of the process to estimate the marginal PDF f j ( x ) of each gene’s transcript level according to our model.Parameters in diamonds are read from data, parameters in hexagons are determined by the structure of the network, parametersin circles must be ﬁt to the model by maximizing likelihood over a training data set, and parameters in stars are the state variablesof the dynamical model. Notice that the dynamical model implies that the state variables depend on each other, meaning thisnetwork of dependence is not acyclic . The kernel K j ( x, y ) used to estimate likelihood is Gaussian kernel in only component j (i.e., it is the Guassian kernel after orthogonal projection onto dimension j ). Model Parameter Estimation

While the parameters κ ij , φ ji and γ j are determined by the structure of the underlying generegulatory network, assumed to be known, and the epigenetic parameter α i is assumed mea-surable, we must still estimate the remaining parameters λ i , ˆ λ i , µ i , ν i and d j . We estimate theseparameters using a negative log-likelihood minimization procedure using stochastic gradientdescent. Sample paths used to estimate the gradient of the likelihood are generated using amodiﬁed form of Gillespies algorithm which handles time-dependent jump propensities. Thisprocedure involves approximating the gradient of the map from parameter set to log-likelihoodso that we can use a gradient descent method. In Fig. 2, we give a schematic representationof how ˆ L D is estimated from a set of realizations of the model, each realization correspondingto a single data sample. For each binding site i α li κ i λ i ˆ λ i µ i ν i B i ( t ) For each gene j φ j d j γ j g lj g j ( t ) L l = T l R T l K ( g ( t ) , g l ) dt For each data point l ˆ L D = − P | DataSet | l =1 ln( L l ) Fig. 2. Plate diagram of the process to estimate total likelihood of a data set according to our model. Parameters in diamondsare read from data, parameters in hexagons are determined by the structure of the network, parameters in circles must be ﬁt tothe model by maximizing likelihood over a training data set, and parameters in stars are the state variables of the dynamicalmodel. Notice that the dynamical model implies that the stat variables depend on each other, meaning this network of dependenceis not acyclic . The kernel K ( x, y ) used to estimate likelihood is Gaussian. Gradient Estimation

In order to estimate ∇ ˆ L D, ω for use in optimization, we can use the generator of the system.To do this, we must rewrite the likelihood estimate L g , α ( θ ) as a sum of partial likelihoods L i g , α deﬁned for each possible state of the vector B . Then, because we are estimating the likelihoodat an equilibrium distribution, we may use Eq. (4) and assume that dP i dt = 0 to compute each ∇ L i g , α . See supplemental ﬁle for a derivation of ∇ L i g , α using this method. Creation of model from known GRN

We use a gene regulatory network assumed to be known to create a model ofgene regulation that includes transcription factor binding dynamics. To do this, we as-sociate binding sites with the genes that they regulate, and use these associations tocreate a bipartite graph. Due to the available level of detail in the data set, werely only on associations between binding sites and targets. We therefore assume thatany regulator of a given target binds with every site associated with that target. g g g g g g B x ) B x ) B x ) B x ) B y ) B y ) Fig. 3. (left) Underlying gene regulatory network. (right) Bipartite networkof the underlying GRN.

In order to capture the eﬀectsof diﬀerent regulating transcrip-tion factors that may bind tothe same site, we create “du-plicate” variables that representthe same binding site but boundwith diﬀerent transcription fac-tors. The result is that we haveonly one transcription factor foreach binding site variable. Asshown in Fig. 3, each edge ofa graph representing the originalgene regulatory network is replaced by a path from transcription factor to binding site to tar-get gene. Note that the model as described in Section 2.2 allows a more general topology, andour procedures for parameter ﬁtting and model analysis do not depend on this construction.

3. Model Evaluation3.1.

Gene Regulatory Network

Gene to gene interactions ( κ ) were deﬁned using the Discriminant Regulon ExpressionAnalysis (DoRothEA) framework. Transcription factor (TF) to target interactions were de-ﬁned using the DoRothEA highest conﬁdence interactions and scored are just 1 or -1 for up-regulating and downregulating, respectively. Binding site to target edges ( φ ) were deﬁned byCpG methylation sites which were associated with changes in transcript expression (eCpG). eCpG probes were identiﬁed for genes with which a change in expression was associated witha change in methylation state in participants from the Multi-Ethnic Study of Atherosclerosis(MESA). These CpG probes were then classiﬁed into a status which described the geographicrelationship between CpG probe and the associated gene (e.g., “TRANS”, “Promoter”, “TSS”,“Distal”). The binding site regions used in this study for a gene were deﬁned by the proximalstatus classiﬁcations of the MESA data (i.e., ‘Promoter’ or ‘TSS’) and were scored as the meanof all of CpG probes in the status class. For evaluation, we identiﬁed a set of genes previouslyidentiﬁed as deferentially expressed in individuals with PTSD as compared to controls. Ofthese, we identiﬁed 278 regulator to target maps (kappa) which we then used to identify othertargets to include. The ﬁnal set included 512 gene to gene relationships comprised of 303unique target genes. A GRN was built using these 303 genes as input producing a ﬁnal net-work with 74 genes with 65 sites (network shown in supplemental ﬁle). Of these 74 genes, 29had suﬃcient regulatory information (i.e., an associated binding site and transcription factor)for which parameters could be estimated and expression distributions generated.

Model Training and Testing Data

Matched epigenetic and gene expression data were obtained from the Grady TraumaProject (GTP) study. Methylation and gene expression was measured from whole blood fromAfrican American participants. Methylation data were obtained from the NCBI Gene Expres-sion Omnibus (GEO) (GSE72680) and measured using the Illumina HumanMethylation450

BeadChip. Methylation status was quantiﬁed as a beta score. A total of 19,258 eCpG probeswere used as input. Beta scores for CpG sites within the same region for a gene (i.e., ‘Promoter’or ‘TSS’) were aggregated together as the average. Probes were merged in 1,885 regions. Re-gions that were not detected in any sample (i.e., “NaN”) were removed. Gene expression datawere obtained from GEO (GSE58137) measured with the Illumina HumanHT-12 expressionbeadchip. Two batches of non-overlaping samples were quantiﬁed using two versions of thebeadchip (V3.0: n=243 participants, mean expression intensity = 189.96, IQR = 49.88 to106.60; V4: n=106 participants, mean expression intensity = 321.58, IQR = 88.85 to 139.78).Intensity scores were log2-transformed. Gene expression probes were ﬁrst annotated to EN-TREZ ID and then annotated to the symbol using the HUGO annotation. Genes withmultiple gene expression measurements (i.e., multiple probes) were represented by the lastone in the list when processed. The ﬁnal merged datasets contained n=243 samples for theparticipants measured for expression on the V3.0 beadchip and n=97 for the participantsmeasured for expression on the v4.0 beadchip.

Fig. 4. The mean and variance over time for simulation of gene AK3 for sample 5881.Fig. 5. Negative log-likelihood estimate of training data set forsplit 99x at each iteration of maximum-likelihood gradient de-scent procedure.

Matched data from participants mea-sured for expression on the v3.0 beadchip(n=243) were used for evaluation. Thisprimary dataset was split into trainingand testing datasets, containing 80%/20%(n=195, n=48 samples, respectively). Toavoid the impact of a particular split, werepeated the shuﬄe process 100 times. Foreach split of the data, parameter estima-tion was performed on the training set andequilibrium distributions of the predictedexpression levels were generated using thetesting set. We evaluated the performanceof the method as the diﬀerence between themeasured and predicted value for each gene.

Estimation of equilibrium distribution

Fig. 6. Equilibrium distribution plots for CCM2 for individual (a) 6110 in shuﬄe 43, and (b) 6742 in shuﬄe 77.Table 1. Average root mean square errors for eachgene across the 100 shuﬄes with ﬁtted parametersand random parameters.

GeneSymbol AverageRMSEﬁttedparameters AverageRMSErandomparameters Diﬀerence Relativeperformance

AHR 2.196 6.113 3.918 2.785AK3 0.757 10.459 9.702 13.810ALOX5 2.172 6.797 4.625 3.129BAG3 1.276 7.251 5.976 5.685BAK1 1.637 8.114 6.478 4.958CCM2 0.876 10.842 9.966 12.377CD19 0.925 9.389 8.464 10.148CD4 1.232 8.982 7.750 7.290CTSH 1.262 8.278 7.016 6.561CXCR5 4.425 7.444 3.020 1.683CYP27A1 0.886 8.535 7.650 9.636FBXO32 1.163 8.939 7.777 7.689FCER1A 1.536 9.926 8.391 6.463GSTM1 0.792 11.160 10.369 14.101ICAM4 2.101 7.583 5.482 3.609IRF1 3.357 10.636 7.280 3.168LDHA 2.693 10.870 8.178 4.037LTA4H 3.193 11.846 8.653 3.710MT1X 1.222 9.468 8.247 7.749NR1D2 0.932 7.585 6.654 8.142OAS1 1.162 9.794 8.633 8.431RPL39L 1.211 7.995 6.785 6.602RRM2B 1.085 7.658 6.573 7.057SCP2 4.598 7.354 2.757 1.600SLC20A1 1.877 10.684 8.807 5.691SREBF1 1.784 6.582 4.799 3.691SURF6 1.107 9.236 8.130 8.347VWA5A 2.744 7.310 4.567 2.664ZNF654 3.254 8.015 4.762 2.464

In order to conserve computational resources,we used as a stopping condition for sample pathsimulation the number of random numbers thatwere needed for simulation to some time point.This was number was set to random draws.To evaluate if the gradient descent process was be-ing ended too early, one additional shuﬄe was per-formed (”99x”) which used an existing shuﬄe (”99”)but with parameters allowing for less constrictivestopping conditions maxsteps=10000 StoppingCon-dition=100. While simulating, we saved mean andvariance estimates for the distribution for interme-diate time-points in order to approximate the rateof convergence to an equilibrium distribution. Weuse a ﬁnite diﬀerence estimate for the rate of changeof estimated mean and variance to determine if thestopping condition that we used is appropriate. Wecan see from Fig. 4 that longer simulation time maybe necessary for accurate estimates of equilibrium.

Performance evaluation

To evaluate the performance of our ﬁtting procedure on gene expression predictions wegenerated predictions using a randomly generated parameter set. To avoid the impact ofa particular split, we randomly generated parameters for each of the previously generatedsplits (i.e., the 100 splits previously used for training and testing). Ten random estimateswere generated for each shuﬄe giving 1000 predictions for each gene generated using randomparameters. We evaluated the performance of the parameter ﬁtting method as the ratio ofroot-mean-square error of the predicted value given by randomly chosen and ﬁtted parameters:

Relative performance = Average RMSE(random parameters)Average RMSE(ﬁtted parameters) .

4. Results

Fig. 7. Equilibrium distribution plots generated from random parameters for AK3 for individual (a) 6522 for random parameterset 0 in shuﬄe 0, and (b) 8331 random parameter set 7 in shuﬄe 13.

Estimated parameters

Fig. 8. (Top) Predicted versus ob-served expression values and (Bottom)residuals for the test samples for all 100shuﬄes for AK3. Each shuﬄe is col-ored.

Parameters were estimated for all genes using the proce-dure detailed in Section 2.5. Figure 5 shows the likelihood ofthe training dataset during the course of the gradient descentprocedure.4.0.2.

Equilibrium distributions

Equilibrium distributions were generated for 29 genes forwhich we had suﬃcient connectivity information available.Example equilibrium distributions for the CCM2 gene for twosamples from diﬀerent shuﬄes are shown in Fig. 6. Averageroot mean square errors for each gene across the 100 shuﬄes isreported in Table 1. We observed the highest performance forAK3 (average RSME = 0.757, Fig. 8) and lowest for SCP2(average RSME = 4.598). In this evaluation, our model bi-ases towards underestimating the expression levels (see sup-plemental ﬁle). Relative to a random set of parameters, ourﬁtted parameters improved the predictions by a factor of 1.6to over 14 (Table 1). For example, the predicted expressiondistribution for AK3 generated from random parameters areshow in Fig. 9 and the predicted versus observed values areshown in Fig. 7.

5. Discussion

In this study, we demonstrate that gene expression levelscan be accurately predicted from methylation state of a pro-moter region and a GRN. Our model successfully uses quantitative data describing epigeneticmodiﬁcation of transcription factor binding sites to generate a probability distribution whichdescribes the possible level of transcript. To our knowledge, this is the ﬁrst study to developand evaluate a stochastic dynamical systems model predicting gene expression levels from epigenetic data for a given GRN.By using a dynamical systems approach, our model generates an estimation of gene ex-pression given DNA methylation based on the mechanistic hypothesis of diﬀerential bindingaﬃnity of a transcription factor caused by epigenetic modiﬁcation. Furthermore, the dynam-ical systems approach allows study of complex regulatory networks, including those whichcontains cycles. This method has the potential for broad practical usage. DNA is broadlyavailable in banked tissue in the absence of RNA. By measuring methylation from these spec-imens, RNA expression value can be estimated. In addition to predicting RNA expressionfrom specimens, our model can also be used to evaluate for functional eﬀects of changes inmethylation states at particular sites on gene expression levels. Fig. 9. (Top) Predicted versus ob-served expression values and (Bottom)residuals for the test set generated from10 sets of random parameters for all100 shuﬄes for AK3. Each shuﬄe andparameter set is colored.

Overall we see good performance in the predictions of themodel with ﬁtted parameters (e.g., Fig. 8) and dramatic im-provements to prediction relative to a randomly generatedset of parameters (e.g., Fig. 9). Although the predictions aresomewhat dependent on the selection of training data, and insome cases, the observed value and predicted value are notwell represented by the equilibrium distribution the overallprediction of the occurrence of gene expression remains accu-rate (i.e., on/oﬀ or low/high). Poor predictions could be theresult of slow convergence in the maximum-likelihood param-eter estimation. In order to estimate parameters, we must it-eratively generate model predictions for a range of parametervalues. This is done by generating equilibrium distributionsin a method similar to Markov-Chain Monte Carlo (MCMC)sampling.The estimated ﬁt of the model to training data improvedover iterations of the procedure (e.g., Fig. 5). However, themean and standard deviations from the equilibrium distribu-tions do not converge as quickly as we would like (e.g., Fig. 4).This slow convergence, and the necessity for repeated esti-mations, mean that computational time is a limiting factor.Future analyses should simulate longer to identify the appro-priate cut oﬀs given the data used, and thus improve the ﬁtof the model parameters.We were able to accurately predict gene expression despitethe limited size of our GRN. We expect that model predic-tions will improve with more regulatory information. Although we queried for 302 genes, ourtranscription factor to target and binding site reference data produced a gene regulatory net-work with 74 genes, of which 29 had suﬃcient regulatory information to be predicted. Inparticular, dynamical systems should perform well in gene regulatory networks with complextopologies, including those with loops. We were unable to evaluate a more complicated GRNfrom all reference regulatory data due to computational constraints. As such, we evaluated with the toy model of PTSD data which was acyclic.While the use of a stochastic dynamical system has distinct advantages over morestatistically-driven methods, there are of course limitations to the approach as well. In partic-ular, our model is based on the assumption that epigenetic modiﬁcation eﬀects the propen-sity of the random process of transcription factor binding and unbinding. Furthermore, ourmodel assumes that DNA transcription is a comparatively fast (and so approximated as deter-ministic) process that depends on transcription factor binding. Finally, our model implicitlyassumes that RNA translation to protein product is immediate. In addition to the funda-mental limitations of the model, we also limit the scope of our testing to linear productionof DNA transcript, depending on transcription factor binding status. Future eﬀorts will befocused on improving the prediction accuracy, improving prediction robustness across folds,and evaluating across other gene regulatory networks and gene sets.In conclusion, we have developed a dynamical systems approach which accurately predictsgene expression from methylation state and a GRN. Our results support the idea that methy-lation patterns of cis-promoter regions are associated with gene expression levels. Advances ingene regulatory information will continue to improve the predictions of our model by providingmore structure to the GRNs. In addition, we have a route forward to develop optimizationswhich can step up the ease of use and scaling of our approach. Finally, our approach is broadlyaccessible and can be used for a diverse array of research projects which have DNA samplesavailable but in which gene expression data are missing or limited or in studies evaluating thefunctional eﬀects of changes in methylation state on gene expression.

6. Code availability

Supplemental ﬁles, including further mathematical details, are available at https://doi.org/10.5281/zenodo.3970970 . All code are available at https://github.com/kordk/stoch_epi_lib .

7. Acknowledgments

This project was initially conceived as an interdisciplinary project as part of the “ShortCourse in Systems Biology - a foundation for interdisciplinary careers” at the Center forComplex Biological Systems at the University of California Irvine held Jan. 21 - Feb. 8, 2019in Irvine, CA (NIH GM126365). This study was funded by the National Cancer Institute(NCI, CA233774). Preprint of an article submitted for consideration in Paciﬁc Symposium onBiocomputing 2020 World Scientiﬁc Publishing Company . References

1. Reik W.

Stability and ﬂexibility of epigenetic gene regulation in mammalian development . Nature,vol. 447(7143):pp. 425–32, 2007.2. Cobb JP, et al. Application of genome-wide expression analysis to human health and disease .Proc Natl Acad Sci U S A, vol. 102(13):pp. 4801–6, 2005.3. King MC et al. Evolution at two levels in humans and chimpanzees . Science, vol. 188(4184):pp.107–116, 1975.24. Singh KP, et al. Mechanisms and Measurement of Changes in Gene Expression . Biol Res Nurs,vol. 20(4):pp. 369–382, 2018.5. Bosinger SE, et al. Gene expression proﬁling of host response in models of acute HIV infection .J. Immunol., vol. 173(11):pp. 6858–6863, 2004.6. Kober K, et al. Diﬀerential Methylation and Expression of Genes in the Hypoxia Inducible Factor1 (HIF-1) Signaling Pathway Are Associated With Paclitaxel-Induced Peripheral Neuropathy inBreast Cancer Survivors and with Preclinical Models of Chemotherapy-Induced Neuropathic Pain .Mol Pain, vol. 16:p. 1744806920936502, 2020.7. Stephens KE, et al. Epigenetic regulation and measurement of epigenetic changes . Biol Res Nurs,vol. 15(4):pp. 373–381, 2013.8. Razin A et al. DNA methylation and gene function . Science, vol. 210(4470):pp. 604–610, 1980.9. Eden S et al. Role of DNA methylation in the regulation of transcription . Curr Opin Genet Dev,vol. 4(2):pp. 255–9, 1994.10. Spruijt CG et al. DNA methylation: old dog, new tricks?

Nat Struct Mol Biol, vol. 21(11):pp.949–54, 2014.11. Anderson DF, et al. On classes of reaction networks and their associated polynomial dynamicalsystems . Journal of Mathematical Chemistry, 2020.12. Agarwal V et al. Predicting mRNA Abundance Directly from Genomic Sequence Using DeepConvolutional Neural Networks . Cell Rep, vol. 31(7):p. 107663, 2020.13. Zhong H, et al. Predicting gene expression using DNA methylation in three human populations .PeerJ, vol. 7:p. e6757, 2019.14. Ebert P, et al. Epigenome-based prediction of gene expression across species . bioRxiv, 2018.15. Klett H, et al. Robust prediction of gene regulation in colorectal cancer tissues from DNA methy-lation proﬁles . Epigenetics, vol. 13(4):pp. 386–397, 2018.16. Li J, et al. Using epigenomics data to predict gene expression in lung cancer . BMC Bioinformatics,vol. 16 Suppl 5:p. S10, 2015.17. Davis MH.

Piecewise-deterministic Markov processes: a general class of non-diﬀusion stochasticmodels . Journal of the Royal Statistical Society: Series B (Methodological), vol. 46(3):pp. 353–376, 1984.18. Zeiser S, et al. Simulation of genetic networks modelled by piecewise deterministic Markov pro-cesses . IET systems biology, vol. 2(3):pp. 113–135, 2008.19. Crudu A, et al. Hybrid stochastic simpliﬁcations for multiscale gene networks . BMC systemsbiology, vol. 3(1):p. 89, 2009.20. Anderson DF et al. Stochastic analysis of biochemical systems , vol. 1. Springer, 2015.21. Anderson DF, et al. Product-form stationary distributions for deﬁciency zero chemical reactionnetworks . Bulletin of mathematical biology, vol. 72(8):pp. 1947–1970, 2010.22. Wang Y, et al. Parameter inference for discretely observed stochastic kinetic models using stochas-tic gradient descent . BMC systems biology, vol. 4(1):p. 99, 2010.23. Aza¨ıs R, et al. Piecewise deterministic Markov processrecent results . In

ESAIM: Proceedings ,vol. 44, pp. 276–290. EDP Sciences, 2014.24. Anderson DF.

A modiﬁed next reaction method for simulating chemical systems with time de-pendent propensities and delays . The Journal of chemical physics, vol. 127(21):p. 214107, 2007.25. Garcia-Alonso L, et al. Transcription Factor Activities Enhance Markers of Drug Sensitivity inCancer . Cancer Research, vol. 78(3):pp. 769–780, 2018.26. Varley KE, et al. Dynamic DNA methylation across diverse human cell lines and tissues . GenomeRes., vol. 23(3):pp. 555–567, 2013.27. Breen MS, et al. PTSD Blood Transcriptome Mega-Analysis: Shared Inﬂammatory Pathwaysacross Biological Sex and Modes of Trauma . Neuropsychopharmacology, vol. 43(3):pp. 469–481,2018.328. Yates B, et al. Genenames.org: the HGNC and VGNC resources in 2017et al. Genenames.org: the HGNC and VGNC resources in 2017