[PDF] A multi-modal neural network for learning cis and trans regulation of stress response in yeast

Abstract

Deciphering gene regulatory networks is a central problem in computational biology. Here, we explore the use of multi-modal neural networks to learn predictive models of gene expression that include cis and trans regulatory components. We learn models of stress response in the budding yeast Saccharomyces cerevisiae. Our models achieve high performance and substantially outperform other state-of-the-art methods such as boosting algorithms that use pre-defined cis-regulatory features. Our model learns several cis and trans regulators including well-known master stress response regulators. We use our models to perform in-silico TF knock-out experiments and demonstrate that in-silico predictions of target gene changes correlate with the results of the corresponding TF knockout microarray experiment.

Full PDF

AA multi-modal neural network for learning cis andtrans regulation of stress response in yeast

Boxiang Liu , Nadine Hussami , Avanti Shrikumar , Tyler Shimko , Salil Bhate , ScottLongwell , Stephen Montgomery , and Anshul Kundaje Departments of Biology, Pathology, Genetics, Statistics, Computer Science, and Bioengineering, Stanford University {bliu2,nadinehu,avanti,tshimko,bhate,longwell,smontgom,akundaje}@stanford.edu

Abstract

Saccharomyces cerevisiae . Our models achieve high performance andsubstantially outperform other state-of-the-art methods such as boosting algorithms thatuse pre-deﬁned cis-regulatory features. Our model learns several cis and trans regulatorsincluding well-known master stress response regulators. We use our models to performin-silico TF knock-out experiments and demonstrate that in-silico predictions of target genechanges correlate with the results of the corresponding TF knockout microarray experiment.

Gene transcription is regulated by transcription factor (TFs) complexes that bind speciﬁc sequencemotifs encoded in the DNA of cis-regulatory elements. Learning predictive models of gene expressionthat integrate cis and trans regulatory information is a critical ﬁrst step to decipher the effects ofnatural and disease-associated perturbations to regulatory networks. Most approaches for learningtranscriptional regulatory models typically use compendia of known TF binding sequence motifs torepresent the DNA sequence of cis-regulatory elements [1, 2, 3, 4, 5, 6]. However, these TF motifcompendia are often incomplete and consist of position weight matrices (PWMs) which may not beoptimal representations for predictive models. Recently, convolutional neural networks have beenused to learn de-novo representations from raw regulatory DNA sequence that can predict TF binding,chromatin accessibility as well the effects of non-coding variants on these molecular phenotypesvia in-silico mutagenesis[7]. However, these cis-regulatory deep learning models do not model theeffects of trans factors and are hence incapable of predicting gene expression in different cellularstates.Here, we present a multi-modal deep neural network architecture that can be used to predict geneexpression of any target gene in any cellular state based on the raw cis-regulatory sequence of the geneand the expression levels of trans factors (TFs and signaling molecules) in that cellular state. We learnpredictive regulatory programs of stress response in the budding yeast

Saccharomyces cerevisiae [8].As compared to mammals, yeast has a relatively simple cis-regulatory architecture governed primarilyby the promoter sequence directly upstream of genes; has a far more comprehensive set of knowntrans factors (TFs and signaling molecules) and TF binding sequence motifs and has an extensivecollection of perturbation experiments such as TF knockouts. Yeast is hence an ideal model organismto systematically compare the performance of our deep learning approach to other alternatives that useengineered features and to benchmark the ability of these models to predict the effects of perturbationssuch as TF knockouts. a r X i v : . [ q - b i o . GN ] A ug Methodology

We formulate the problem as a supervised learning task, where the goal is learn a model E g,c = F ( S g , T c ) that can predict gene expression ( E g,c ) of any gene ( g ) in any stress condition ( c ) based ontwo complementary regulatory inputs - a cis component represented by the raw promoter sequence( S g ) of the gene g and a trans component represented by the expression of all trans factors ( T c ) incondition c . We use a multimodal neural network architecture that includes a convolutional sequencemodule to learn predictive patterns from raw promoter sequences, a dense module to derive featuresfrom regulator expression and an integration module that learns cis-trans interactions. The input to this module is the raw 1Kb promoter sequence ( S g ) of a gene ( g ) which is representedusing a standard 4 channel (A,C,G,T) one-hot encoding. We use two convolutional layers each ofwhich contain 50 ﬁlters (size 9, stride 1) with ReLU activations. These layers are followed by amax pooling layer (size 4, stride 4). We apply batch normalization before each ReLU activation tomitigate covariate shifts during training and to accelerate learning. We train on forward and reversecomplements of each sequence and also use reverse complement aware convolutional ﬁlters viaparameter sharing [9]. The ﬁnal layer of this module is a dense layer with 512 units. The input to this module is a vector of expression levels ( T c ) of 472 known transcription factors andsignal molecules (kinases, phosphatases, receptors) in stress condition C . The input feeds into adense layer of 512 hidden units. We concatenate the outputs of the cis and trans module and use 2 dense layers with 512 units integratethe cis and trans modules. The ﬁnal layer feeds into a linear neuron (for regression) or a softmaxneuron (for multi-class classiﬁcation).

The gene expression dataset is a microarray dataset[8] that spans 6100 genes under 173 diverse stressconditions. The measurements were given as log expression values representing the fold changew.r.t. the untreated reference condition. For the regression models, the expression levels were used asis. For the classiﬁcation models, we discretized expression into 3 classes in {− , , } to representupregulation, baseline, and downregulation, such that expression values in [ −∞ , − . are convertedto -1, [ − . , . to 0, and [0 . , ∞ ] to +1. We used a 80-10-10 split of ( g, c ) spanning all genes andall stress conditions to create training, validation, and test datasets. We optimize square loss (for regression models) or softmax loss (for classiﬁcation models) usingSGD with a learning rate of 0.01 and momentum of 0.5. We decrease the learning rates by half everyﬁve epoch if no improvement in validation accuracy is observed. All experiments were performed onNvidia GeForce GTX 970 using Keras 1.2.2 with Tensorﬂow backend.

We compared our model against two state-of-the-art models, GeneClass [5] and BDTree [6] thatuse the multi-class classiﬁcation formulation on the same dataset [8] but represent the cis-regulatorysequence as a vector of motif occurrences spanning all known yeast motifs. The GeneClass model isa boosted alternating decision tree, and the BDTree model is a bidirectional regression tree. For theclassiﬁcation tasks, our best model outperformed the previous state-of-the-art by 16.6% (Table 1).2ence, the neural network model that learns de-novo representations from raw sequence signiﬁcantlyoutperform ensemble models trained using known motifs. In addition, our model achieved a highPearson correlation of 0.845 for the regression task on the test set (Figure 1a).Table 1: Classiﬁcation performanceMethod AccuracyGeneClass 60.9%BDTree 62.9%DNN

Predicted by DNNDown Baseline UpDown 10.14 7.47 0.13Baseline 3.29 59.77 3.02Up 0.18 6.42 9.59

Unlike GeneClass and BDTree which rely on existing motif annotations, our model has the potentialto learn known and novel sequence features directly from the raw promoter sequences. We useda method similar to Basset [7] to identify sequence patterns that activate the convolutional ﬁlters.For each convolutional ﬁlter, we select the 100 sequence segments with the highest activation. Wenext calculate the PWM based on nucleotide frequency in these sequence segments. To test whetherany PWM correspond to known motif, we used TomTom (http://meme-suite.org/tools/tomtom) tocompare against the YEASTRACT database. We found that our model learns both known (Figure 1)and de novo motifs. (a) Correlation (b) DOT6P (c) RPN4P(d) SFP1P (e) STB3P

Figure 1: (a) Predicted vs grounth truth (b-e) examples of recovered motifs

We estimated the importance of each trans-regulator for each training/test example as the gradient(w.r.t to output) times the magnitude of input (G-by-I). We summed the G-by-I values across allgenes and conditions to obtain the global estimate of trans-regulator importance. The top rankingtrans-regulators include several well-known regulators of stress response in yeast such MSN2/4(master stress response TF), TPK1 (kinase that phosphorylates MSN2/4 which controls its cellularlocalization), USV1, PPT1, XBP1 and YVH1.

The ultimate test of a predictive model of gene expression is based on its ability to predict the effectof some previously unseen perturbation. Therefore, we performed in silico knockout experimenton MSN2/4, where we replaced all instances of its motif ’AGGGG’ with neutral ’NNNNN’, andreduced the expression level of MSN2/4 by 32 fold. MSN2/4 are activated under heat shock but3able 2: Rank of regulator module inputsRank Regulator1 USV1 / YPL230W2 DAL80 / YKR034W3 XBP1 / YIL101C4 PPT1 / YGR123C5 LSG1 / YGL099W6 CIN5 / YOR028C7 YVH1 / YIR026C8 TPK1 / YJL164C9 GAC1 / YOR178C10 MSN4 / YKL062W h e a t s h o c k t o h e a t s h o c k t o h e a t s h o c k t o h e a t s h o c k t o h e a t s h o c k t o

37 17 d e g g r o w t h c t - d e g g r o w t h c t - d e g g r o w t h c t - d e g g r o w t h c t - d e g g r o w t h c t - M e a n l o g F C (a) Target gene expression change Observed difference (MSN2/4 − WT @ 37C) P r ed i c t ed d i ff e r en c e ( M S N / − W T @ C ) spearman cor=0.486 (b) Predicted vs ground truth Figure 2: (a) The expression of MSN2/4 known target gene experience larger change under stressconditions. (b) Predicted expression change vs actual microarray experiment.are inactive under steady-state growth [10]. As expected, we observed that MSN2/4 target genesexperience greater change under heat shock conditions than steady-state growth condition (Fig. 2a).We also compared our prediction against microarray measurement in a actual MSN2/4 knockoutstrain, and observed signiﬁcant correlation between the two (spearman correlation = 0.486, Fig. 2b).

We present a multi-modal deep learning architecture that can accurately predict the gene expressionresponse of yeast to various stress conditions as a function of cis-regulatory sequence and transfactor expression. Our model outperforms other approaches that use engineered features. Preliminaryanalyses of the globally predictive features indicate that our models captures several well knownstress response factors. We are currently exploring the context-speciﬁc cis and trans regulatorslearned by the model at the level of individual genes and gene modules across the diversity of stresses.Further, through in silico knockout of a key stress response factor, we show that our model makesreasonably accurate predictions similar to the true knockout microarray experiment. It is worth notingthat while the true knockout experiment measures the direct and indirect effects of the TF knockout,our model is more likely to predict direct effects. We are using in-vivo TF binding maps of the TF todistinguish the direct from indirect targets of MSN2/4 to further investigate this issue. This work laysthe foundation for developing novel neural network architectures to model transcriptional regulationin mammalian systems. 4 eferences [1] Harmen J Bussemaker, Hao Li, and Eric D Siggia. Regulatory element detection using correla-tion with expression.

Nature Genetics , 27(2):167–174, February 2001.[2] Tu Minh Phuong, Doheon Lee, and Kwang Hyung Lee. Regression trees for regulatory elementidentiﬁcation.

Bioinformatics , 20(5):750–757, March 2004.[3] Lev A Soinov, Maria A Krestyaninova, and Alvis Brazma. Towards reconstruction of genenetworks from expression data by supervised learning.

Genome biology , 4(1):R6, January 2003.[4] Eran Segal, Michael Shapira, Aviv Regev, Dana Pe’er, David Botstein, Daphne Koller, andNir Friedman. Module networks: identifying regulatory modules and their condition-speciﬁcregulators from gene expression data.

Nature Genetics , 34(2):166–176, June 2003.[5] Manuel Middendorf, Anshul Kundaje, Chris Wiggins, Yoav Freund, and Christina Leslie.Predicting genetic regulatory response using classiﬁcation.

Bioinformatics , 20(suppl 1):i232–i240, August 2004.[6] Jianhua Ruan and Weixiong Zhang. A bi-dimensional regression tree approach to the modelingof gene expression regulation.

Bioinformatics , 22(3):332–340, February 2006.[7] David R Kelley, Jasper Snoek, and John L Rinn. Basset: learning the regulatory code of theaccessible genome with deep convolutional neural networks.

Genome research , 26(7):990–999,July 2016.[8] A P Gasch, P T Spellman, C M Kao, O Carmel-Harel, M B Eisen, G Storz, D Botstein, and P OBrown. Genomic expression programs in the response of yeast cells to environmental changes.

Molecular Biology of the Cell , 11(12):4241–4257, December 2000.[9] Avanti Shrikumar, Peyton Greenside, and Anshul Kundaje. Reverse-complement parametersharing improves deep learning models for genomics. bioRxiv , 2017.[10] A Sadeh, N Movshovich, M Volokh, L Gheber, and A Aharoni. Fine-tuning of the Msn2/4-mediated yeast stress responses as revealed by systematic deletion of Msn2/4 partners. - PubMed- NCBI.