Biophysical models of cis-regulation as interpretable neural networks
BBiophysical models of cis-regulation asinterpretable neural networks
Ammar Tareen
Simons Center for Quantitative BiologyCold Spring Harbor LaboratoryCold Spring Harbor, NY 11724 [email protected]
Justin B. Kinney
Simons Center for Quantitative BiologyCold Spring Harbor LaboratoryCold Spring Harbor, NY 11724 [email protected]
Abstract
The adoption of deep learning techniques in genomics has been hindered by the difficulty of mecha-nistically interpreting the models that these techniques produce. In recent years, a variety of post-hocattribution methods have been proposed for addressing this neural network interpretability problemin the context of gene regulation. Here we describe a complementary way of approaching thisproblem. Our strategy is based on the observation that two large classes of biophysical models ofcis-regulatory mechanisms can be expressed as deep neural networks in which nodes and weightshave explicit physiochemical interpretations. We also demonstrate how such biophysical networkscan be rapidly inferred, using modern deep learning frameworks, from the data produced by certaintypes of massively parallel reporter assays (MPRAs). These results suggest a scalable strategy forusing MPRAs to systematically characterize the biophysical basis of gene regulation in a wide rangeof biological contexts. They also highlight gene regulation as a promising venue for the developmentof scientifically interpretable approaches to deep learning.Deep learning – the use of large multi-layer neural networks in machine learning applications – is revolutionizinginformation technology [1]. There is currently a great deal of interest in applying deep learning techniques to problemsin genomics, especially for understanding gene regulation [2–5]. These applications remain somewhat controversial,however, due to the difficulty of mechanistically interpreting neural network models trained on functional genomicsdata. Multiple attribution strategies, which seek to extract meaning post-hoc from neural networks that have rathergeneric architectures, have been proposed for addressing this interpretability problem [6–8]. However, there remains asubstantial gap between the outputs of such attribution methods and fully mechanistic models of gene regulation.Here we advocate for a complementary approach: the inference of neural network models whose architecture reflects ex-plicit biophysical hypotheses for how cis-regulatory sequences function. This strategy is based on two key observations.First, two large classes of biophysical models can be formulated as three-layer neural networks that have a stereotypedform and in which nodes and weights have explicit physiochemical interpretations. This is true of thermodynamicmodels, which rely on a quasi-equilibrium assumption [9–14], as well as kinetic models, which are more complex butdo not make such assumptions [15–17]. Second, existing deep learning frameworks such as TensorFlow [18] are able torapidly infer such models from the data produced by certain classes of MPRAs.The idea of using neural networks to model the biophysics of gene regulation goes back to at least [19]. To ourknowledge, however, the fact that all thermodynamic models of gene regulation can be represented by a simplethree-layer architecture has not been previously reported. We are not aware of any prior work that uses neural networksto represent kinetic models or models involving King-Altman diagrams. There is a growing literature on modelingMPRA data using neural networks [20–24]. However, such modeling has yet to be advocated for MPRAs that dissectthe mechanisms of single cis-regulatory sequences, such as in [25–29], which is our focus here. Finally, some recentwork has used deep learning frameworks to infer parametric models of cis-regulation [30, 31]. These studies did not,however, infer the type of explicit biophysical quantities, such as ∆ G values, that our approach recovers.All thermodynamic models of cis-regulation can be represented as three-layer neural networks as follows. First onedefines a set of molecular complexes, or “states”, which we index using s . Each state has both a Gibbs free energy ∆ G s Presented at the 14th conference on Machine Learning in Computational Biology (MLCB 2019), Vancouver, Canada. a r X i v : . [ q - b i o . M N ] F e b igure 1: A thermodynamic model of transcriptional regulation. (A) Transcriptional activation at the E. coli lac promoter is regulated by two proteins, CRP and σ RNA polymerase (RNAP). CRP is a transcriptional activatorthat up-regulates transcription by stabilizing RNAP-DNA binding. ∆ G C and ∆ G R respectively denote the Gibbsfree energies of the CRP-DNA and RNAP-DNA interactions, while ∆ G I denotes the Gibbs free energy of interactionbetween CRP and RNAP. (B) Like all thermodynamic models of gene regulation, this model consists of a set of states,each state having an associated Gibbs free energy and activity. The probability of each state is assumed to follow theBoltzmann distribution. (C) The corresponding activity predicted by such thermodynamic models is the state-specificactivity averaged together using these Boltzmann probabilities. (D) This model formulated as a three-layer neuralnetwork. First layer nodes represent interaction energies, second layer nodes represent state probabilities, and third layernodes represent transcriptional activity. The values of weights are indicated; gray lines correspond to zero weights. Thesecond layer has a softmin activation, while the third has a linear activation. All thermodynamic models of cis-regulationcan be represented using this general three-layer form.and an associated activity α s . These energies determine the probability P s of each state occurring in thermodynamicequilibrium via the Boltzmann distribution, P s = e − ∆ G s (cid:80) s (cid:48) e − ∆ G s (cid:48) . (1)The energy of each state is, in turn, computed using integral combinations of the individual interaction energies ∆ G j that occur in that state. We can therefore write ∆ G s = (cid:80) j ω sj ∆ G j , where ω sj is the number of times that interaction j occurs in state s . The resulting activity predicted by the model is given by the activities α s of the individual statesaveraged over this distribution, i.e., t = (cid:80) s α s P s .Fig. 1 illustrates a thermodynamic model for transcriptional activation at the E. coli lac promoter. This model involvestwo proteins, CRP and RNAP, as well as three interaction energies: ∆ G C , ∆ G R , and ∆ G I . The rate of transcription t is further assumed to be proportional to the fraction of time that RNAP is bound to DNA (Fig. 1A). This model issummarized by four different states, two of which lead to transcription and two of which do not (Fig. 1B). Fig. 1Cshows the resulting formula for t in terms of model parameters. This model is readily formulated as a feed-forwardneural network (Fig. 1D). Indeed, all thermodynamic models of cis-regulation can be formulated as three-layer neuralnetworks: layer 1 represents molecular interaction energies, layer 2 (which uses a softmin activation) represents stateprobabilities, and layer 3 (using linear activation) represents the biological activity of interest, which in this case istranscription rate.We can infer thermodynamic models like these for a cis-regulatory sequence of interest (the wild-type sequence) fromthe data produced by an MPRA performed on an appropriate sequence library [26]. Indeed, a number of MPRAs havebeen performed with this explicit purpose in mind [26, 27, 33–35]. To this end, such MPRAs are generally performedusing libraries that consist of sequence variants that differ from the wild-type sequence by a small number of singlenucleotide polymorphisms (SNPs). The key modeling assumption that motivates using libraries of this form is that To reduce notational burden, all ∆ G values are assumed to be in thermal units. At ◦ C, one thermal unit is k B T =0 .
62 kcal / mol , where k B is Boltzmann’s constant and T is temperature. E. coli lac promoter was mutagenized at 12% per nucleotide. Variant promoters were then used todrive the expression of GFP. Cells carrying these expression constructs were then sorted using FACS, and the variantsequences in each bin were sequenced. This yielded data on about × variant promoters across 10 bins. (B) Theneural network from Fig. 1D, but with ∆ G C and ∆ G R expressed as linear functions of the DNA sequence (cid:126)x , as well asa dense feed-forward network mapping activity t to bins via a probability distribution p (bin | t ) . Gray lines indicateweights fixed at 0. The weights linking nodes P and P to node t were constrained to have the same value t sat . (C)The parameter values inferred from the MPRA data of [26]. Shown are the CRP energy matrix (cid:126)θ C , the RNAP energymatrix (cid:126)θ R , and the CRP-RNAP interaction energy ∆ G I . Since increasingly negative energy corresponds to strongerbinding, they y -axis in the logo plots is inverted. Logos were generated using Logomaker [32].the assayed sequence variants will form the same molecular complexes as the wild-type sequence, but with Gibbs freeenergies and state activities whose values vary from sequence to sequence. By contrast, variant libraries that containinsertions, deletions, or large regions of random DNA (e.g. [20, 21, 23, 24, 31]) are unlikely to satisfy this modelingassumption.Fig. 2A summarizes the sort-seq MPRA described in [26]. Lac promoter variants were used to drive GFP expressionin
E. coli , cells were sorted into 10 bins using fluorescence-activated cell sorting, and the variant promoters withineach bin were sequenced. This yielded data comprising about × variant lac promoter sequences, each associatedwith one of 10 bins. The authors then fit the biophysical model shown in Fig. 1C, but under the assumption that ∆ G C = (cid:126)θ C · (cid:126)x + µ C and ∆ G R = (cid:126)θ R · (cid:126)x + µ R , where (cid:126)x is a one-hot encoding of promoter DNA sequence.Here we used TensorFlow to infer the same model formulated as a deep neural network. Specifically, we augmented thenetwork in Fig. 1D by making ∆ G C and ∆ G R sequence-dependent as in [26]. To link t to the MPRA measurements,we introduced a feed-forward network with one hidden layer and a softmax output layer corresponding to the 10 binsinto which cells were sorted. Model parameters were then fit to the MPRA dataset using stochastic gradient descent andearly stopping. The results agreed well with those reported in [26]. In particular, the parameters in the energy matricesfor CRP ( (cid:126)θ C ) and RNAP ( (cid:126)θ R ) exhibited respective Pearson correlation coefficients of 0.986 and 0.994 with thosereported in [26]. The protein-protein interaction energy that we found, ∆ G I = − . kcal/mol, was also compatiblewith the previously reported value ∆ G I = − . ± . kcal/mol.A major difference between our results and those of [26] is the ease with which they were obtained. Training of thenetwork in Fig. 2B consistently took about 15 minutes on a standard laptop computer. The model fitting procedurein [26], by contrast, relied on a custom Parallel Tempering Monte Carlo algorithm that took about a week to run on amulti-node computer cluster (personal communication), and more recent efforts to train biophysical models on MPRAdata have encountered similar computational bottlenecks [34, 35].3igure 3: A kinetic model for transcriptional initiation by E. coli
RNAP. (A) In this model, promoter DNA canparticipate in three complexes: unbound, closed, and open [39, 40]. Transitions between these complexes are governedby four rate constants: k , k − , k , and k . (B) A formula for the steady-state rate of mRNA production can be obtainedusing King-Altman diagrams [41, 42]. (C) This formula can be represented using the three-layer neural network shown,where layer 1 represents log transition rates, layer 2 (softmax activation) represents normalized King-Altman diagrams,and layer 3 (linear activation) represents promoter activity. Black lines indicate weight 1; gray lines indicate weight 0.Note that the single nonzero weight connecting layer 2 to layer 3 (orange) is actually the transition rate k from layer 1.All kinetic models of cis-regulation share this general three-layer form.Also of note is the fact that in [26] the authors inferred models using information maximization. Specifically, the authorsfit the parameters of t ( (cid:126)x ) by maximizing the mutual information I [ t ; bin] between model predictions and observed bins.One difficulty with this strategy is the need to estimate mutual information. Instead, we used maximum likelihoodto infer the parameters of t ( (cid:126)x ) as well as the experimental transfer function (i.e., noise model) p (bin | t ) , which wasmodeled by a dense feed-forward network with one hidden layer. These two inference methods, however, are essentiallyequivalent: in the large data regime, the parameters of t that maximize I [ t ; bin] are the same as the parameters oneobtains when maximizing likelihood over the parameters of both t and p (bin | t ) ; see [36–38].A shortcoming of thermodynamic models is that they ignore non-equilibrium processes. Kinetic models address thisproblem by providing a fully non-equilibrium characterization of steady-state activity. Kinetic models are specified bylisting explicit state-to-state transition rates rather than Gibbs free energies. Fig. 3A shows a three-state kinetic modelof transcriptional initiation consisting of unbound promoter DNA, an RNAP-DNA complex in the closed conformation,and the RNAP-DNA complex in the open conformation [39, 40]. The rate k going from the open state to the unboundstate represents transcript initiation. The overall transcription rate in steady state is therefore k times the occupancy ofthe open complex.King-Altman diagrams [41, 42], a technique from mathematical enzymology, provide a straight-forward way to computesteady-state occupancy in kinetic models. Specifically, each state’s occupancy is proportional to the sum of directedspanning trees (a.k.a. King-Altman diagrams) that flow to that state, where each spanning tree’s value is given by theproduct of rates comprising that tree. Fig. 3B illustrates this procedure for the kinetic model in Fig. 3A. Every suchkinetic model can be represented by a three-layer neural network (e.g., Fig. 3C) in which first layer nodes represent logtransition rates, second layer nodes (after a softmax activation) represent normalized King-Altman diagrams, and thirdlayer nodes represent the activities of interest.Here we have shown how both thermodynamic and kinetic models of gene regulation can be formulated as three-layerdeep neural networks in which nodes and weights have explicit biophysical meaning. This represents a new strategyfor interpretable deep learning in the study of gene regulation, one complementary to existing post-hoc attributionmethods. We have further demonstrated that a neural-network-based thermodynamic model can be rapidly inferredfrom MPRA data using TensorFlow. This was done in the context of a well-characterized bacterial promoter becauseprevious studies of this system have established concrete results against which we could compare our inferred model.The same modeling approach, however, should be readily applicable to a wide variety of biological systems amenableto MPRAs, including transcriptional regulation and alternative mRNA splicing in higher eukaryotes. Code Availability and Acknowledgements : The neural network model shown in Fig. 2C, as well as the scripts used toinfer it from the data of [26], are available at https://github.com/jbkinney/19_mlcb . We thank Anand Murugan,Yifei Huang, Peter Koo, Alan Moses, and Mahdi Kooshkbaghi for helpful discussions, as well as and three anonymousreferees for providing constructive criticism. This project was supported by NIH grant 1R35GM133777 and a grantfrom the CSHL/Northwell Health partnership. 4 eferences [1] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,”
Nature , vol. 521, pp. 436–444, May 2015.[2] J. Zhou and O. G. Troyanskaya, “Predicting effects of noncoding variants with deep learning–based sequencemodel,”
Nat Methods , vol. 12, pp. 931–934, Aug. 2015.[3] B. Alipanahi, A. Delong, M. T. Weirauch, and B. J. Frey, “Predicting the sequence specificities of dna- andrna-binding proteins by deep learning,”
Nat Biotechnol , vol. 33, pp. 831–838, Jul 2015.[4] K. Jaganathan, S. K. Panagiotopoulou, J. F. McRae, S. F. Darbandi, D. Knowles, Y. I. Li, J. A. Kosmicki,J. Arbelaez, W. Cui, G. B. Schwartz, E. D. Chow, E. Kanterakis, H. Gao, A. Kia, S. Batzoglou, S. J. Sanders, andK. K.-H. Farh, “Predicting splicing from primary sequence with deep learning,”
Cell , vol. 176, pp. 535–548.e24,Jan. 2019.[5] G. Eraslan, Ž. Avsec, J. Gagneur, and F. J. Theis, “Deep learning: new computational modelling techniques forgenomics,”
Nat Rev Genet , vol. 20, no. 7, pp. 389–403, 2019.[6] K. Simonyan, A. Vedaldi, and A. Zisserman, “Deep inside convolutional networks: Visualising image classificationmodels and saliency maps,” arXiv preprint arXiv:1312.6034 , 2013.[7] A. Shrikumar, P. Greenside, and A. Kundaje, “Learning important features through propagating activationdifferences,” in
Proceedings of the 34th International Conference on Machine Learning - Volume 70 , ICML’17,pp. 3145–3153, JMLR.org, 2017.[8] A. Chattopadhyay, P. Manupriya, A. Sarkar, and V. N. Balasubramanian, “Neural network attributions: A causalperspective,” arXiv preprint arXiv:1902.02302 , 2019.[9] G. Ackers, A. Johnson, and M. Shea, “Quantitative model for gene regulation by lambda phage repressor,”
ProcNatl Acad Sci USA , vol. 79, pp. 1129–1133, Feb. 1982.[10] M. A. Shea and G. K. Ackers, “The or control system of bacteriophage lambda. a physical-chemical model forgene regulation,”
J Mol Biol , vol. 181, pp. 211–230, Jan. 1985.[11] L. Bintu, N. E. Buchler, H. G. Garcia, U. Gerland, T. Hwa, J. Kondev, T. Kuhlman, and R. Phillips, “Transcriptionalregulation by the numbers: applications,”
Curr Opin Genet Dev , vol. 15, pp. 125–135, Apr. 2005.[12] L. Bintu, N. E. Buchler, H. G. Garcia, U. Gerland, T. Hwa, J. Kondev, and R. Phillips, “Transcriptional regulationby the numbers: models,”
Curr Opin Genet Dev , vol. 15, pp. 116–124, Apr. 2005.[13] E. Segal and J. Widom, “From dna sequence to transcriptional behaviour: a quantitative approach,”
Nat Rev Genet ,vol. 10, pp. 443–456, July 2009.[14] M. S. Sherman and B. A. Cohen, “Thermodynamic state ensemble models of cis-regulation,”
PLoS Comput Biol ,vol. 8, no. 3, p. e1002407, 2012.[15] J. Estrada, F. Wong, A. DePace, and J. Gunawardena, “Information integration and energy expenditure in generegulation,”
Cell , vol. 166, pp. 234–244, June 2016.[16] C. Scholes, A. H. DePace, and A. Sanchez, “Combinatorial gene regulation through kinetic control of thetranscription cycle,”
Cell Syst , vol. 4, pp. 97–108.e9, Jan. 2017.[17] J. Park, J. Estrada, G. Johnson, B. J. Vincent, C. Ricci-Tam, M. D. Bragdon, Y. Shulgina, A. Cha, Z. Wunderlich,J. Gunawardena, and A. H. DePace, “Dissecting the sharp response of a canonical developmental enhancer revealsmultiple sources of cooperativity,” eLife , vol. 8, p. 2787, June 2019.[18] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, et al. ,“Tensorflow: A system for large-scale machine learning,” in , pp. 265–283, 2016.[19] E. Mjolsness, “On cooperative quasi-equilibrium models of transcriptional regulation,”
J Bioinform Comput Biol ,vol. 5, no. 2b, pp. 467–490, 2007.[20] A. B. Rosenberg, R. P. Patwardhan, J. Shendure, and G. Seelig, “Learning the sequence determinants of alternativesplicing from millions of random sequences,”
Cell , vol. 163, pp. 698–711, Oct. 2015.521] J. T. Cuperus, B. Groves, A. Kuchina, A. B. Rosenberg, N. Jojic, S. Fields, and G. Seelig, “Deep learning ofthe regulatory grammar of yeast 5’ untranslated regions from 500,000 random sequences,”
Genome Res , vol. 27,no. 12, pp. 2015 – 2024, 2017.[22] R. Movva, P. Greenside, G. K. Marinov, S. Nair, A. Shrikumar, and A. Kundaje, “Deciphering regulatory dnasequences and noncoding genetic variants using neural network models of massively parallel reporter assays,”
PLoS ONE , vol. 14, no. 6, p. e0218073, 2019.[23] P. J. Sample, B. Wang, D. W. Reid, V. Presnyak, I. J. McFadyen, D. R. Morris, and G. Seelig, “Human 5’ utrdesign and variant effect prediction from a massively parallel translation assay,”
Nat Biotechnol , vol. 37, no. 7,pp. 803–809, 2019.[24] N. Bogard, J. Linder, A. B. Rosenberg, and G. Seelig, “A deep neural network for predicting and engineeringalternative polyadenylation,”
Cell , vol. 178, no. 1, pp. 91–106.e23, 2019.[25] R. P. Patwardhan, C. Lee, O. Litvin, D. L. Young, D. Pe’er, and J. Shendure, “High-resolution analysis of dnaregulatory elements by synthetic saturation mutagenesis,”
Nat Biotechnol , vol. 27, no. 12, pp. 1173 – 1175, 2009.[26] J. B. Kinney, A. Murugan, C. G. Callan, and E. C. Cox, “Using deep sequencing to characterize the biophysicalmechanism of a transcriptional regulatory sequence,”
Proc Natl Acad Sci USA , vol. 107, pp. 9158–9163, May2010.[27] A. Melnikov, A. Murugan, X. Zhang, T. Tesileanu, L. Wang, P. Rogov, S. Feizi, A. Gnirke, C. G. Callan, J. B.Kinney, M. Kellis, E. S. Lander, and T. S. Mikkelsen, “Systematic dissection and optimization of inducibleenhancers in human cells using a massively parallel reporter assay,”
Nat Biotechnol , vol. 30, pp. 271–277, Feb.2012.[28] J. C. Kwasnieski, I. Mogno, C. A. Myers, J. C. Corbo, and B. A. Cohen, “Complex effects of nucleotide variantsin a mammalian cis-regulatory element,”
Proc Natl Acad Sci USA , vol. 109, pp. 19498–19503, Nov. 2012.[29] R. P. Patwardhan, J. B. Hiatt, D. M. Witten, M. J. Kim, R. P. Smith, D. May, C. Lee, J. M. Andrie, S.-I. Lee, G. M.Cooper, N. Ahituv, L. A. Pennacchio, and J. Shendure, “Massively parallel functional dissection of mammalianenhancers in vivo,”
Nat Biotechnol , vol. 30, no. 3, pp. 265 – 270, 2012.[30] Y. Liu, K. Barr, and J. Reinitz, “Fully interpretable deep learning model of transcriptional control,” bioRxivpreprint doi:10.1101/655639 , May 2019.[31] C. G. d. Boer, E. D. Vaishnav, R. Sadeh, E. L. Abeyta, N. Friedman, and A. Regev, “Deciphering eukaryoticgene-regulatory logic with 100 million random promoters,”
Nat Biotechnol , pp. 1–10, 2019.[32] A. Tareen and J. B. Kinney, “Logomaker: beautiful sequence logos in python,”
Bioinformatics , Dec. 2019. btz921.[33] M. Razo-Mejia, J. Q. Boedicker, D. Jones, A. DeLuna, J. B. Kinney, and R. Phillips, “Comparison of the theoreticaland real-world evolutionary potential of a genetic circuit,”
Phys Biol , vol. 11, p. 026005, Apr. 2014.[34] N. M. Belliveau, S. L. Barnes, W. T. Ireland, D. L. Jones, M. J. Sweredoski, A. Moradian, S. Hess, J. B. Kinney,and R. Phillips, “Systematic approach for dissecting the molecular mechanisms of transcriptional regulation inbacteria,”
Proc Natl Acad Sci USA , vol. 115, pp. E4796–E4805, May 2018.[35] S. L. Barnes, N. M. Belliveau, W. T. Ireland, J. B. Kinney, and R. Phillips, “Mapping dna sequence to transcriptionfactor binding energy in vivo,”
PLoS Comput Biol , vol. 15, p. e1006226, Feb. 2019.[36] J. B. Kinney, G. Tkaˇcik, and C. G. Callan, “Precise physical models of protein–DNA interaction from high-throughput data,”
Proc Natl Acad Sci USA , vol. 104, pp. 501–506, Jan. 2007.[37] J. B. Kinney and G. S. Atwal, “Parametric inference in the large data limit using maximally informative models,”
Neural Comput , vol. 26, pp. 637–653, Apr. 2014.[38] G. S. Atwal and J. B. Kinney, “Learning quantitative sequence–function relationships from massively parallelexperiments,”
J Stat Phys , vol. 162, no. 5, pp. 1203–1243, 2016.[39] W. R. McClure, “Rate-limiting steps in rna chain initiation,”
Proc Natl Acad Sci USA , vol. 77, no. 10, pp. 5634 –5638, 1980. 640] W. R. McClure, “Mechanism and control of transcription initiation in prokaryotes,”
Annu Rev Biochem , vol. 54,no. 1, pp. 171 – 204, 1985.[41] E. King and C. Altman, “A schematic method of deriving the rate laws for enzyme-catalyzed reactions,”
J PhysChem , vol. 60, no. 10, pp. 1375–1378, 1956.[42] T. L. Hill,