[PDF] Deep Motif Dashboard: Visualizing and Understanding Genomic Sequences Using Deep Neural Networks

Abstract

Deep neural network (DNN) models have recently obtained state-of-the-art prediction accuracy for the transcription factor binding (TFBS) site classification task. However, it remains unclear how these approaches identify meaningful DNA sequence signals and give insights as to why TFs bind to certain locations. In this paper, we propose a toolkit called the Deep Motif Dashboard (DeMo Dashboard) which provides a suite of visualization strategies to extract motifs, or sequence patterns from deep neural network models for TFBS classification. We demonstrate how to visualize and understand three important DNN models: convolutional, recurrent, and convolutional-recurrent networks. Our first visualization method is finding a test sequence's saliency map which uses first-order derivatives to describe the importance of each nucleotide in making the final prediction. Second, considering recurrent models make predictions in a temporal manner (from one end of a TFBS sequence to the other), we introduce temporal output scores, indicating the prediction score of a model over time for a sequential input. Lastly, a class-specific visualization strategy finds the optimal input sequence for a given TFBS positive class via stochastic gradient optimization. Our experimental results indicate that a convolutional-recurrent architecture performs the best among the three architectures. The visualization techniques indicate that CNN-RNN makes predictions by modeling both motifs as well as dependencies among them.

Full PDF

DDeep Motif Dashboard: Visualizing and UnderstandingGenomic Sequences Using Deep Neural Networks

Jack Lanchantin, Ritambhara Singh, Beilun Wang, and Yanjun Qi

Department of Computer ScienceUniversity of VirginiaCharlottesville, VA 22903 {jjl5sw,rs3zz,bw4mw,yq2h}@virginia.edu

Abstract

Deep neural network (DNN) models have recently obtained state-of-the-art prediction accuracy forthe transcription factor binding (TFBS) site classiﬁcation task. However, it remains unclear howthese approaches identify meaningful DNA sequence signals and give insights as to why TFs bindto certain locations. In this paper, we propose a toolkit called the Deep Motif Dashboard (DeMoDashboard) which provides a suite of visualization strategies to extract motifs, or sequence patternsfrom deep neural network models for TFBS classiﬁcation. We demonstrate how to visualize andunderstand three important DNN models: convolutional, recurrent, and convolutional-recurrentnetworks. Our ﬁrst visualization method is ﬁnding a test sequence’s saliency map which usesﬁrst-order derivatives to describe the importance of each nucleotide in making the ﬁnal prediction.Second, considering recurrent models make predictions in a temporal manner (from one end of aTFBS sequence to the other), we introduce temporal output scores, indicating the prediction scoreof a model over time for a sequential input. Lastly, a class-speciﬁc visualization strategy ﬁnds theoptimal input sequence for a given TFBS positive class via stochastic gradient optimization. Ourexperimental results indicate that a convolutional-recurrent architecture performs the best amongthe three architectures. The visualization techniques indicate that CNN-RNN makes predictions bymodeling both motifs as well as dependencies among them.

In recent years, there has been an explosion of deep learning models which have lead to groundbreaking results in manyﬁelds such as computer vision[13], natural language processing[25], and computational biology [2, 19, 27, 11, 14, 22].However, although these models have proven to be very accurate, they have widely been viewed as “black boxes”due to their complexity, making them hard to understand. This is particularly unfavorable in the biomedical domain,where understanding a model’s predictions is extremely important for doctors and researchers trying to use the model.Aiming to open up the black box, we present the “Deep Motif Dashboard ” (DeMo Dashboard), to understand theinner workings of deep neural network models for a genomic sequence classiﬁcation task. We do this by introducinga suite of different neural models and visualization strategies to see which ones perform the best and understand howthey make their predictions. Understanding genetic sequences is one of the fundamental tasks of health advancements due to the high correlationof genes with diseases and drugs. An important problem within genetic sequence understanding is related totranscription factors (TFs), which are regulatory proteins that bind to DNA. Each different TF binds to speciﬁctranscription factor binding sites (TFBSs) on the genome to regulate cell machinery. Given an input DNA sequence,classifying whether or not there is a binding site for a particular TF is a core task of bioinformatics[24]. Dashboard normally refers to a user interface that gives a current summary, usually in graphic, easy-to-read form, of keyinformation relating to performance[1]. We implemented our model in Torch, and it is made available at deepmotif.org a r X i v : . [ c s . L G ] O c t or our task, we follow a two step approach. First, given a particular TF of interest and a dataset containing samplesof positive and negative TFBS sequences, we construct three deep learning architectures to classify the sequences.Section 2 introduces the three different DNN structures that we use: a convolutional neural network ( CNN ), arecurrent neural network (

RNN ), and a convolutional-recurrent neural network (

CNN-RNN ).Once we have our trained models to predict binding sites, the second step of our approach is to understand why themodels perform the way they do. As explained in section 3, we do this by introducing three different visualizationstrategies for interpreting the models:1. Measuring nucleotide importance with

Saliency Maps .2. Measuring critical sequence positions for the classiﬁer using

Temporal Output Scores .3. Generating class-speciﬁc motif patterns with

Class Optimization .We test and evaluate our models and visualization strategies on a large scale benchmark TFBS dataset. Section4 provides experimental results for understanding and visualizing the three DNN architectures. We ﬁnd that theCNN-RNN outperforms the other models. From the visualizations, we observe that the CNN-RNN tends to focus itspredictions on the traditional motifs, as well as modeling long range dependencies among motifs.

TFBS Classiﬁcation.

Chromatin immunoprecipitation (ChIP-seq) technologies and databases such as ENCODE[5] have made binding site locations available for hundreds of different TFs. Despite these advancements, there aretwo major drawbacks: (1) ChIP-seq experiments are slow and expensive, (2) although ChIP-seq experiments can ﬁndthe binding site locations, they cannot ﬁnd patterns that are common across all of the positive binding sites which cangive insight as to why TFs bind to those locations. Thus, there is a need for large scale computational methods thatcan not only make accurate binding site classiﬁcations, but also identify and understand patterns that inﬂuence thebinding site locations.In order to computationally predict TFBSs on a DNA sequence, researchers initially used consensus sequences andposition weight matrices to match against a test sequence [24]. Simple neural network classiﬁers were then proposedto differentiate positive and negative binding sites, but did not show signiﬁcant improvements over the weight matrixmatching methods [9]. Later, SVM techniques outperformed the generative methods by using k-mer features [6, 20],but string kernel based SVM systems are limited by expensive computational cost proportional to the number oftraining and testing sequences. Most recently, convolutional neural network models have shown state-of-the-artresults on the TFBS task and are scalable to a large number of genomic sequences [2, 14], but it remains unclearwhich neural architectures work best.

Deep Neural Networks for TFBSs.

To ﬁnd which neural models work the best on the TFBS classiﬁcation task,we examine several different types of models. Inspired by their success across different ﬁelds, we explore variationsof two popular deep learning architectures: convolutional neural networks (CNNs), and recurrent neural networks(RNNs). CNNs have dominated the ﬁeld of computer vision in recent years, obtaining state-of-the-art results inmany tasks due to their ability to automatically extract translation-invariant features. On the other hand, RNNs haveemerged as one of the most powerful models for sequential data tasks such as natural language processing due totheir ability to learn long range dependencies. Speciﬁcally, on the TFBS prediction task, we explore three distinctarchitectures: (1) CNN, (2) RNN, and (3) a combination of the two, CNN-RNN. Figure 1 shows an overview of themodels.

End-to-end Deep Framework.

While the body of the three architectures we use differ, each implemented modelfollows a similar end-to-end framework which we use to easily compare and contrast results. We use the rawnucleotide characters (A,C,G,T) as inputs, where each character is converted into a one-hot encoding (a binaryvector with the matching character entry being a and the rest as s). This encoding matrix is used as the input to aconvolutional, recurrent, or convolutional-recurrent module that each outputs a vector of ﬁxed dimension. The outputvector of each model is linearly fed to a softmax function as the last layer which learns the mapping from the hiddenspace to the output class label space C ∈ [+1 , − . The ﬁnal output is a probability indicating whether an input is apositive or a negative binding site (binary classiﬁcation task). The parameters of the network are trained end-to-endby minimizing the negative log-likelihood over the training set. The minimization of the loss function is obtainedvia the stochastic gradient algorithm Adam[12], with a mini-batch size of 256 sequences. We use dropout [23] as aregularization method for each model. 2 CTTGCAG TF B S N O N - TF B S ACTTGCAG TF B S N O N - TF B S ACTTGCAG TF B S N O N - TF B S Convolution/

ReLUMaxpoolConvolution/ReLU/MaxpoolLinear & Softmax LSTMAverage & ConcatenateLinear & Softmax Convolution/ReLULSTMLinear & Softmax (a)

CNN (b)

RNN (c)

CNN-RNN

Average & Concatenate

Figure 1:

Model Architectures.

Each model has the same input (one-hot encoded matrix of the raw nucleotideinputs), and the same output (softmax classiﬁer to make a binary prediction). The architectures differ by the middle“module”, which are (a)

Convolutional, (b)

Recurrent, and (c)

Convolutional-Recurrent.

In genomic sequences, it is believed that regulatory mechanisms such as transcription factor binding are inﬂuencedby local sequential patterns known as “motifs”. Motifs can be viewed as the temporal equivalent of spatial patterns inimages such as eyes on a face, which is what CNNs are able to automatically learn and achieve state-of-the art resultson computer vision tasks. As a result, a temporal convolutional neural network is a ﬁtting model to automaticallyextract these motifs. A temporal convolution with ﬁlter (or kernel) size k takes an input data matrix X of size T × n in ,with length T and input layer size n in , and outputs a matrix Z of size T × n out , where n out is the output layer size.Speciﬁcally, convolution ( X ) = Z , where z t,i = σ ( B i + n in (cid:88) j =1 k (cid:88) z =1 W i,j,z x t + z − ,j ) , (1)where W and B are the trainable parameters of the convolution ﬁlter, and σ is a function enforcing element-wisenonlinearity. We use rectiﬁed linear units (ReLU) as the nonlinearity: ReLU( x ) = max(0 , x ) . (2)After the convolution and nonlinearity, CNNs typically use maxpooling, which is a dimension reduction technique toprovide translation invariance and to extract higher level features from a wider range of the input sequence. Temporalmaxpooling on a matrix Z with a pooling size of m results in output matrix Y . Formally, maxpool ( Z ) = Y , where y t,i = m max j =1 z m ( t − j,i (3)Our CNN implementation involves a progression of convolution, nonlinearity, and maxpooling. This is representedas one convolutional layer in the network, and we test up to 4 layer deep CNNs. The ﬁnal layer involves a maxpoolacross the entire temporal domain so that we have a ﬁxed-size vector which can be fed into a softmax classiﬁer.3igure 1 (a) shows our CNN model with two convolutional layers. The input one-hot encoded matrix is convolvedwith several ﬁlters (not shown) and fed through a ReLU nonlinearity to produce a matrix of convolution activations.We then perform a maxpool on the activation matrix. The output of the ﬁrst maxpool is fed through anotherconvolution, ReLU, and maxpooled across the entire length resulting in a vector. This vector is then transposed andfed through a linear and softmax layer for classiﬁcation. Designed to handle sequential data, Recurrent neural networks (RNNs) have become the main neural model for taskssuch as natural language understanding. The key advantage of RNNs over CNNs is that they are able to ﬁnd longrange patterns in the data which are highly dependent on the ordering of the sequence for the prediction task.Given an input matrix X of size T × n in , an RNN produces matrix H of size T × d , where d is the RNN embeddingsize. At each timestep t , an RNN takes an input column vector x t ∈ R n in and the previous hidden state vector h t − ∈ R d and produces the next hidden state h t by applying the following recursive operation: h t = σ ( W x t + Uh t − + b ) , (4)where W , U , b are the trainable parameters of the model, and σ is an element-wise nonlinearity. Due to theirrecursive nature, RNNs can model the full conditional distribution of any sequential data and ﬁnd dependenciesover time, where each position in a sequence is a timestep on an imaginary time coordinate running in a certaindirection. To handle the “vanishing gradients” problem of training basic RNNs on long sequences, Hochreiter andSchmidhuber [8] proposed an RNN variant called the Long Short-term Memory (LSTM) network (for simplicity, werefer to LSTMs as RNNs in this paper), which can handle long term dependencies by using gating functions. Thesegates can control when information is written to, read from, and forgotten. Speciﬁcally, LSTM “cells” take inputs x t , h t − , and c t − , and produce h t , and c t : i t = σ ( W i x t + U i h t − + b i ) f t = σ ( W f x t + U f h t − + b f ) o t = σ ( W o x t + U o h t − + b o ) g t = tanh ( W g x t + U g h t − + b g ) c t = f t (cid:12) c t − + i t (cid:12) g t h t = o t (cid:12) tanh ( c t ) where σ ( · ) , tanh ( · ) , and (cid:12) are element-wise sigmoid, hyperbolic tangent, and multiplication functions, respectively. i t , f t , and o t are the input, forget, and output gates, respectively.RNNs produce an output vector h t at each timestep t of the input sequence. In order to use them on a classiﬁcationtask, we take the mean of all vectors h t , and use the mean vector h mean ∈ R d as input to the softmax layer.Since there is no innate direction in genomic sequences, we use a bi-directional LSTM as our RNN model. In thebi-directional LSTM, the input sequence gets fed through two LSTM networks, one in each direction, and then theoutput vectors of each direction get concatenated together in the temporal direction and fed through a linear classiﬁer.Figure 1 (b) shows our RNN model. The input one-hot encoded matrix is fed through an LSTM in both the forwardand backward direction which each produce a matrix of column vectors representing the LSTM output embedding ateach timestep. These vectors are then averaged to create one vector for each direction representing the LSTM output.The forward and backward output vectors are then concatenated and fed to the softmax for classiﬁcation. Considering convolutional networks are designed to extract motifs, and recurrent networks are designed to extracttemporal features, we implement a combination of the two in order to ﬁnd temporal patterns between the motifs.Given an input matrix X ∈ R T × n in , the output of the CNN is Z ∈ R T × n out . Each column vector of Z gets fed intothe RNN one at a time in the same way that the one-hot encoded vectors get input to the regular RNN model. Theresulting output of the RNN H ∈ R T × d , where d is the LSTM embedding size, is then averaged across the temporaldomain (in the same way as the regular RNN), and fed to a softmax classiﬁer.Figure 1 (c) shows our CNN-RNN model. The input one-hot encoded matrix is fed through one layer of convolutionto produce a convolution activation matrix. This matrix is then input to the LSTM, as done in the regular RNN modelfrom the original one-hot matrix. The output of the LSTM is averaged, concatenated, and fed to the softmax, similarto the RNN. 4 Visualizing and Understanding Deep Models

The previous section explained the deep models we use for the TFBS classiﬁcation task, where we can evaluatewhich models perform the best. While making accurate predictions is important in biomedical tasks, it is equallyimportant to understand why models make their predictions. Accurate, but uninterpretable models are often veryslow to emerge in practice due to the inability to understand their predictions, making biomedical domain expertsreluctant to use them. Consequently, we aim to obtain a better understanding of why certain models work better thanothers, and investigate how they make their predictions by introducing several visualization techniques. The proposedDeMo Dashboard allows us visualize and understand DNNs in three different ways: Saliency Maps, Temporal OutputScores, and Class Optimizations.

For a certain DNA sequence and a model’s classiﬁcation, a logical question may be: “which which parts of thesequence are most inﬂuential for the classiﬁcation?” To do this, we seek to visualize the inﬂuence of each position(i.e. nucleotide) on the prediction. Our approach is similar to the methods used on images by Simonyan et al.[21] andBaehrens et al.[4]. Given a sequence X of length | X | , and class c ∈ C , a DNN model provides a score function S c ( X ) . We rank the nucleotides of X based on their inﬂuence on the score S c ( X ) . Since S c ( X ) is a highlynon-linear function of X with deep neural nets, it is hard to directly see the inﬂuence of each nucleotide of X on S c .Mathematically, around the point X , S c ( X ) can be approximated by a linear function by computing the ﬁrst-orderTaylor expansion: S c ( X ) ≈ w T X + b = | X | (cid:88) i =1 w i x i + b (5)where w is the derivative of S c with respect to the sequence variable X at the point X : w = ∂S c ∂X (cid:12)(cid:12)(cid:12)(cid:12) X = saliency map (6)This derivative is simply one step of backpropagation in the DNN model, and is therefore easy to compute. Wedo a pointwise multiplication of the saliency map with the one-hot encoded sequence to get the derivative valuesfor the actual nucleotide characters of the sequence (A,T,C, or G) so we can see the inﬂuence of the character ateach position on the output score. Finally, we take the element-wise magnitude of the resulting derivative vector tovisualize how important each character is regardless of derivative direction. We call the resulting vector a “saliencymap[21]” because it tells us which nucleotides need to be changed the least in order to affect the class score the most.As we can see from equation 5, the saliency map is simply a weighted sum of the input nucleotides, where the eachweight, w i , indicates the inﬂuence of that nucleotide position on the output score. Since DNA is sequential (i.e. can be read in a certain direction), it can be insightful to visualize the output scores ateach timestep (position) of a sequence, which we call the temporal output scores. Here we assume an imaginary timedirection running from left to right on a given sequence, so each position in the sequence is a timestep in such animagined time coordinate. In other words, we check the RNN’s prediction scores when we vary the input of the RNN.The input series is constructed by using subsequences of an input X running along the imaginary time coordinate,where the subsequences start from just the ﬁrst nucleotide (position), and ends with the entire sequence X . This waywe can see exactly where in the sequence the recurrent model changes its decision from negative to positive, or viceversa. Since our recurrent models are bi-directional, we also use the same technique on the reverse sequence. CNNsprocess the entire sequence at once, thus we can’t view its output as a temporal sequence, so we use this visualizationon just the RNN and CNN-RNN. The previous two visualization methods listed are representative of a speciﬁc testing sample (i.e. sequence-speciﬁc).Now we introduce an approach to extract a class-speciﬁc visualization for a DNN model, where we attempt to ﬁndthe best sequence which maximizes the probability of a positive TFBS, which we call class optimization. Formally,we optimize the following equation where S + ( X ) is the probability (or score) of an input sequence X (matrix in ourcase) being a positive TFBS computed by the softmax equation of our trained DNN model for a speciﬁc TF: arg max X S + ( X ) + λ (cid:107) X (cid:107) (7)5here λ is the regularization parameter. We ﬁnd a locally optimal X through stochastic gradient descent, where theoptimization is with respect to the input sequence. In this optimization, the model weights remain unchanged. Thisis similar to the methods used in Simonyan et al.[21] to optimize toward a speciﬁc image class. This visualizationmethod depicts the notion of a positive TFBS class for a particular TF and is not speciﬁc to any test sequence. Our three proposed visualization techniques allow us to manually inspect how the models make their predictions. Inorder to automatically ﬁnd patterns from the techniques, we also propose methods to extract motifs, or consensussubsequences that represent the positive binding sites. We extract motifs from each of our three visualization methodsin the following ways: (1) From each positive test sequence (thus, 500 total for each TF dataset) we extract a motiffrom the saliency map by selecting the contiguous length-9 subsequence that achieves the highest sum of contiguouslength-9 saliency map values. (2) For each positive test sequence, we extract a motif from the temporal output scoresby selecting the length-9 subsequence that shows the strongest score change from negative to positive output score.(3) For each different TF, we can directly use the class-optimized sequence as a motif.

Neural networks have produced state-of-the-art results on several important benchmark tasks related to genomicsequence classiﬁcation [2, 27, 19], making them a good candidate to use. However, why these models work wellhas been poorly understood. Recent works have attempted to uncover the properties of these models, in which mostof the work has been done on understanding image classiﬁcations using convolutional neural networks. Zeiler andFergus [26] used a “deconvolution” approach to map hidden layer representations back to the input space for aspeciﬁc example, showing the features of the image which were important for classiﬁcation. Simonyan et al.[21]explored a similar approach by using a ﬁrst-order Taylor expansion to linearly approximate the network and ﬁndthe input features most relevant, and also tried optimizing image classes. Many similar techniques later followed tounderstand convolutional models [17, 3]. Most importantly, researchers have found that CNNs are able to extractlayers of translational-invariant feature maps, which may indicate why CNNs have been successfully used in genomicsequence predictions which are believed to be triggered by motifs.On text-based tasks, there have been fewer visualization studies for DNNs. Karpathy et al.[10] explored theinterpretability of RNNs for language modeling and found that there exist interpretable neurons which are able tofocus on certain language structure such as quotes. Li et al.[15] visualized how RNNs achieve compositionality innatural language for sentiment analysis by visualizing RNN embedding vectors as well as measuring the inﬂuenceof input words on classiﬁcation. Both studies show examples that can be validated by our understanding of naturallanguage linguistics. Contrarily, we are interested in understanding DNA “linguistics” given DNNs (the oppositedirection of Karpathy et al.[10] and Li et al.[15]).The main difference between our work and previous works on images and natural language is that instead of trying tounderstand the DNNs given human understanding of such human perception tasks, we attempt to uncover criticalsignals in DNA sequences given our understanding of DNNs.For TFBS prediction, Alipanahi et al.[2] was the ﬁrst to implement a visualization method on a DNN model. Theyvisualize their CNN model by extracting motifs based on the input subsequence corresponding to the strongestactivation location for each convolutional ﬁlter (which we call convolution activation). Since they only have oneconvolutional layer, it is trivial to map the activations back, but this method does not work as well with deepermodels. We attempted this technique on our models and found that our approach using saliency maps outperformsit in ﬁnding motif patterns (details in section 4). Quang and Xie [19] use the same visualization method on theirconvolutional-recurrent model for noncoding variant prediction.

In order to evaluate our DNN models and visualizations, we train and test on the 108 K562 cell ENCODEChIP-Seq TF datasets used in Alipanahi et al.[2]. Each TF dataset has an average of 30,819 training sequences(with an even positive/negative split), and each sequence consists of 101 DNA-base characters (A,C,G,T). Everydataset has 1,000 testing sequences (with an even positive/negative split). Positive sequences are extracted from thehg19 genome centered at the reported ChIP-Seq peak. Negative sequences are generated by dinucleotide-preserving6able 1: Variations of DNN Model Hyperparameters

Model Conv.Layers Conv.Size ( n out ) Conv. ﬁlterSizes ( k ) Conv. PoolSize ( m ) LSTMLayers LSTMSize ( d ) Small RNN N/A N/A N/A N/A 1 16Medium RNN N/A N/A N/A N/A 1 32Large RNN N/A N/A N/A N/A 2 32Small CNN 2 64 9,5 2 N/A N/AMedium CNN 3 64 9,5,3 2 N/A N/ALarge CNN 4 64 9,5,3,3 2 N/A N/ASmall CNN-RNN 1 64 5 N/A 2 32Medium CNN-RNN 1 128 9 N/A 1 32Large CNN-RNN 2 128 9,5 2 1 32Table 2: Mean AUC scores on the TFBS classiﬁcation task

Model Mean AUC Median AUC STDEV

MEME-ChIP [16] 0.834 0.868 0.127DeepBind [2] (CNN) 0.903 0.931 0.091Small RNN 0.860 0.881 106Med RNN 0.876 0.905 0.116Large RNN 0.808 0.860 0.175Small CNN 0.896 0.918 0.098Med CNN 0.902 0.922 0.085Large CNN 0.880 0.890 0.093Small CNN-RNN 0.917 0.943 0.079Med CNN-RNN

Large CNN-RNN 0.918 0.944 0.081 Table 3: AUC pairwise t-test

Model Comparison p-value RNN vs MEME 5.15E-05CNN vs MEME 1.87E-19CNN-RNN vs MEME 4.84E-24CNN vs RNN 5.08E-04CNN-RNN vs RNN 7.99E-10CNN-RNN vs CNN 4.79E-22shufﬂe of the positive sequences. Due to the separate train/test data for each TF, we train a separate model for eachindividual TF dataset.

Variations of DNN Models.

We implement several variations of each DNN architecture by varying hyperparame-ters. Table 1 shows the different hyperparameters in each architecture. We trained many different hyperparametersfor each architecture, but we show the best performing model for each type, surrounded by a larger and smallerversion to show that it isn’t underﬁtting or overﬁtting.

Baselines.

We use the “MEME-ChIP [16] sum” results from Alipanahi et al.[2] as one prediction performancebaseline. These results are from applying MEME-ChIP to the top 500 positive training sequences, deriving ﬁvePWMs, and scoring test sequences using the sum of scores using all ﬁve PWMs. We also compare against the CNNmodel proposed in Alipanahi et al.[2]. To evaluate motif extraction, we compare against the “convolution activation”method used in Alipanahi et al.[2] and Quang and Xie [19], where we map the strongest ﬁrst layer convolution ﬁlteractivation back to the input sequence to ﬁnd the most inﬂuential length-9 subsequence.

Table 2 shows the mean area under the ROC curve (AUC) scores for each of the tested models (from Table 1). Asexpected, the CNN models outperform the standard RNN models. This validates our hypothesis that positive bindingsites are mainly triggered by local patterns or “motifs” that CNNs can easily ﬁnd. Interestingly, the CNN-RNNachieves the best performance among the three deep architectures. To check the statistical signiﬁcance of suchcomparisons, we apply a pairwise t-test using the AUC scores for each TF and report the two tailed p-values in Table3. We apply the t-test on each of the best performing (based on AUC) models for each model type. All deep modelsare signiﬁcantly better than the MEME baseline. The CNN is signiﬁcantly better than the RNN and the CNN-RNNis signiﬁcantly better than the CNN. In order to understand why the CNN-RNN performs the best, we turn to thedashboard visualizations. 7igure 2:

DeMo Dashboard . Dashboard examples for GATA1, MAFK, and NFYB positive TFBS Sequences. Thetop section of the dashboard contains the Class Optimization (which does not pertain to a speciﬁc test sequence,but rather the class in general). The middle section contains the Saliency Maps for a speciﬁc positive test sequence,and the bottom section contains the temporal Output Scores for the same positive test sequence used in the saliencymap. The very top contains known JASPAR motifs, which are highlighted by pink boxes in the test sequences if theycontain motifs. 8able 4: JASPAR motif matches against DeMo Dashboard and baseline motif ﬁnding methods using Tomtom

Saliency Map(out of 500) Conv. Activations[2, 19](out of 500) Temporal Output(out of 500) Class Optimization(out of 57)CNN

RNN

CNN-RNN

To evaluate the dashboard visualization methods, we ﬁrst manually inspect the dashboard visualizations to look forinterpretable signals. Figure 2 shows examples of the DeMo Dashboard for three different TFs and positive TFBSsequences. We apply the visualizations on the best performing models of each of the three DNN architectures. Eachdashboard snapshot is for a speciﬁc TF and contains (1) JASPAR[18] motifs for that TF, which are the “gold standard”motifs generated by biomedical researchers, (2) the positive TFBS class-optimized sequence for each architecture(for the given TF of interest), (3) the positive TFBS test sequence of interest, where the JASPAR motifs in the testsequences are highlighted using a pink box, (4) the saliency map from each DNN model on the test sequence, and (5)forward and backward temporal output scores from the recurrent architectures on the test sequence. In the saliencymaps, the more red a position is, the more inﬂuential it is for the prediction. In the temporal outputs, blue indicates anegative TFBS prediction while red indicates positive. The saliency map and temporal output visualizations are onthe same positive test sequence (as shown twice). The numbers next to the model names in the saliency map sectionindicate the score outputs of that DNN model on the speciﬁed test sequence.

Saliency Maps (middle section of dashboard).

By visual inspection, we can see from the saliency maps thatCNNs tend to focus on short contiguous subsequences when predicting positive bindings. In other words, CNNsclearly model “motifs” that are the most inﬂuential for prediction. The saliency maps of RNNs tend to be spread outmore across the entire sequence, indicating that they focus on all nucleotides together, and infer relationships amongthem. The CNN-RNNs have strong saliency map values around motifs, but we can also see that there are othernucleotides further away from the motifs that are inﬂuential for the model’s prediction. For example, the CNN-RNNmodel is 99% conﬁdent in its GATA1 TFBS prediction, but the prediction is also inﬂuenced by nucleotides outsidethe motif. In the MAFK saliency maps, we can see that the CNN-RNN and RNN focus on a very wide range ofnucleotides to make their predictions, and the RNN doesn’t even focus on the known JASPAR motif to make its highconﬁdence prediction.

Temporal Output Scores (bottom section of dashboard).

For most of the sequences that we tested, the positionsthat trigger the model to switch from a negative TFBS prediction to positive are near the JASPAR motifs. We did notobserve clear differences between the forward and backward temporal output patterns.In certain cases, it’s interesting to look at the temporal output scores and saliency maps together. An important casestudy from our examples is the NFYB example, where the CNN and RNN perform poorly, but the CNN-RNN makesthe correct prediction. We observe that the CNN-RNN is able to switch its classiﬁcation from negative to positive,while the RNN never does. To understand why this may have happened, we can see from the saliency maps that theCNN-RNN focuses on two distinct regions, one of which is where it ﬂips its classiﬁcation from negative to positive.However, the RNN doesn’t focus on either of the same areas, and may be the reason why it’s never able to classify itas a positive sequence. The fact that the CNN is not able to classify it as a positive sequence, but focuses on the sameregions as the CNN-RNN (from the saliency map), may indicate that it is the temporal dependencies between theseregions which inﬂuence the binding. In addition, the fact that there is no clear JASPAR motif in this sequence mayshow that the traditional motif approach is not always the best way to model TFBSs.

Class Optimization (top section of dashboard).

Class optimization on the CNN model generates concise rep-resentations which often resemble the known motifs for that particular TF. For the recurrent models, the TFBSpositive optimizations are less clear, though some aspects stand out (like “AT” followed by “TC” in the GATA1 TFfor the CNN-RNN). We notice that for certain DNN models, their class optimized sequences optimize the reversecomplement motif (e.g. NFYB CNN optimization). The class optimizations can be useful for getting a general ideaof what triggers a positive TFBS for a certain TF.

Automatic Motif Extraction from Dashboard.

In order to evaluate each DNN’s capability to automaticallyextract motifs, we compare the found motifs of each method (introduced in section 3.4) to the corresponding JASPARmotif, for the TF of interest. We do the comparison using the Tomtom[7] tool, which searches a query motif against agiven motif database (and their reverse complements), and returns signiﬁcant matches ranked by p-value indicating9otif-motif similarity. Table 4 summarizes the motif matching results comparing visualization-derived motifsagainst known motifs in the JASPAR database. We are limited to a comparison of 57 out of our 108 TF datasetsby the TFs which JASPAR has motifs for. We compare four visualization approaches: Saliency Map, ConvolutionActivation[2, 19], Temporal Output Scores and Class Optimizations. The ﬁrst three techniques are sequence speciﬁc,therefore we report the average number of motif matches out of 500 positive sequences (then averaged across 57 TFdatasets). The last technique is for a particular TFBS positive class.We can see from Table 4 that across multiple visualization techniques, the CNN ﬁnds motifs the best, followed by theCNN-RNN and the RNN. However, since CNNs perform worse than CNN-RNNs by AUC scores, we hypothesizethat this demonstrates that it is also important to model sequential interactions among motifs. In the CNN-RNNcombination, CNN acts like a “motif ﬁnder” and the RNN ﬁnds dependencies among motifs. This analysis showsthat visualizing the DNN classiﬁcations can lead to a better understanding of DNNs for TFBSs.

Deep neural networks (DNNs) have shown to be the most accurate models for TFBS classiﬁcation. However,DNN models are hard to interpret, and thus their adaptation in practice is slow. In this work, we propose the DeepMotif (DeMo) Dashboard to explore three different DNN architectures on TFBS prediction, and introduce threevisualization methods to shed light on how these models work. Although our visualization methods still require ahuman practitioner to examine the dashboard, it is a start to understand these models and we hope that this work willinvoke further studies on visualizing and understanding DNN based genomic sequences analysis. Furthermore, DNNmodels have recently shown to provide excellent results for epigenomic analysis [22]. We plan to extend our DeMoDashboard to related applications.

References [1] Dashboard deﬁniton. . Accessed: 2016-07-20.[2] Babak Alipanahi, Andrew Delong, Matthew T Weirauch, and Brendan J Frey. Predicting the sequencespeciﬁcities of dna-and rna-binding proteins by deep learning. Nature Publishing Group, 2015.[3] Sebastian Bach, Alexander Binder, Grégoire Montavon, Frederick Klauschen, Klaus-Robert Müller, andWojciech Samek. On pixel-wise explanations for non-linear classiﬁer decisions by layer-wise relevancepropagation. volume 10, page e0130140, 2015.[4] David Baehrens, Timon Schroeter, Stefan Harmeling, Motoaki Kawanabe, Katja Hansen, and Klaus-RobertMÃžller. How to explain individual classiﬁcation decisions. volume 11, pages 1803–1831, 2010.[5] ENCODE Project Consortium et al. An integrated encyclopedia of dna elements in the human genome. volume489, pages 57–74. Nature Publishing Group, 2012.[6] Mahmoud Ghandi, Dongwon Lee, Morteza Mohammad-Noori, and Michael A Beer. Enhanced regulatorysequence prediction using gapped k-mer features. 2014.[7] Shobhit Gupta, John A Stamatoyannopoulos, Timothy L Bailey, and William S Noble. Quantifying similaritybetween motifs. volume 8, page R24. BioMed Central Ltd, 2007.[8] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. volume 9, pages 1735–1780. MIT Press,1997.[9] Paul B Horton and Minoru Kanehisa. An assessment of neural network and statistical approaches for predictionof e. coli promoter sites. volume 20, pages 4331–4338. Oxford Univ Press, 1992.[10] Andrej Karpathy, Justin Johnson, and Fei-Fei Li. Visualizing and understanding recurrent networks. 2015.[11] David R Kelley, Jasper Snoek, and John L Rinn. Basset: Learning the regulatory code of the accessible genomewith deep convolutional neural networks. Cold Spring Harbor Lab, 2016.[12] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. 2014.[13] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classiﬁcation with deep convolutional neuralnetworks. In

Advances in neural information processing systems , pages 1097–1105, 2012.[14] Jack Lanchantin, Ritambhara Singh, Zeming Lin, and Yanjun Qi. Deep motif: Visualizing genomic sequenceclassiﬁcations. 2016.[15] Jiwei Li, Xinlei Chen, Eduard Hovy, and Dan Jurafsky. Visualizing and understanding neural models in nlp.2015. 1016] Philip Machanick and Timothy L Bailey. Meme-chip: motif analysis of large dna datasets. volume 27, pages1696–1697. Oxford Univ Press, 2011.[17] Aravindh Mahendran and Andrea Vedaldi. Visualizing deep convolutional neural networks using naturalpre-images. pages 1–23. Springer.[18] Anthony Mathelier, Oriol Fornes, David J Arenillas, Chih-yu Chen, Grégoire Denay, Jessica Lee, WenqiangShi, Casper Shyr, Ge Tan, Rebecca Worsley-Hunt, et al. Jaspar 2016: a major expansion and update of theopen-access database of transcription factor binding proﬁles. page gkv1176. Oxford Univ Press, 2015.[19] Daniel Quang and Xiaohui Xie. Danq: a hybrid convolutional and recurrent deep neural network for quantifyingthe function of dna sequences. page 032821. Cold Spring Harbor Labs Journals, 2015.[20] Manu Setty and Christina S Leslie. Seqgl identiﬁes context-dependent binding signals in genome-wideregulatory element maps. 2015.[21] Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Deep inside convolutional networks: Visualisingimage classiﬁcation models and saliency maps. 2013.[22] Ritambhara Singh, Jack Lanchantin, Gabriel Robins, and Yanjun Qi. Deepchrome: deep-learning for predictinggene expression from histone modiﬁcations. volume 32, pages i639–i648, 2016.[23] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: Asimple way to prevent neural networks from overﬁtting. volume 15, pages 1929–1958, 2014.[24] Gary D Stormo. Dna binding sites: representation and discovery. volume 16, pages 16–23. Oxford Univ Press,2000.[25] Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural networks. In

Advancesin neural information processing systems , pages 3104–3112, 2014.[26] Matthew D Zeiler and Rob Fergus. Visualizing and understanding convolutional networks. In