[PDF] Motif Identification using CNN-based Pairwise Subsequence Alignment Score Prediction

Abstract

A common problem in bioinformatics is related to identifying gene regulatory regions marked by relatively high frequencies of motifs, or deoxyribonucleic acid sequences that often code for transcription and enhancer proteins. Predicting alignment scores between subsequence k-mers and a given motif enables the identification of candidate regulatory regions in a gene, which correspond to the transcription of these proteins. We propose a one-dimensional (1-D) Convolution Neural Network trained on k-mer formatted sequences interspaced with the given motif pattern to predict pairwise alignment scores between the consensus motif and subsequence k-mers. Our model consists of fifteen layers with three rounds of a one-dimensional convolution layer, a batch normalization layer, a dense layer, and a 1-D maximum pooling layer. We train the model using mean squared error loss on four different data sets each with a different motif pattern randomly inserted in DNA sequences: the first three data sets have zero, one, and two mutations applied on each inserted motif, and the fourth data set represents the inserted motif as a position-specific probability matrix. We use a novel proposed metric in order to evaluate the model's performance, S_{\alpha}, which is based on the Jaccard Index. We use 10-fold cross validation to evaluate out model. Using S_{\alpha}, we measure the accuracy of the model by identifying the 15 highest-scoring 15-mer indices of the predicted scores that agree with that of the actual scores within a selected \alpha region. For the best performing data set, our results indicate on average 99.3% of the top 15 motifs were identified correctly within a one base pair stride (\alpha = 1) in the out of sample data. To the best of our knowledge, this is a novel approach that illustrates how data formatted in an intelligent way can be extrapolated using machine learning.

Full PDF

MMotif Identiﬁcation using CNN-based PairwiseSubsequence Alignment Score Prediction

Ethan Moyer

School of Biomedical Engineering, Science and Health SystemsDrexel University

Philadelphia, PAhttps://orcid.org/0000-0002-8023-3810

Anup Das

College of EngineeringDrexel University

Philadelphia, PAhttps://orcid.org/0000-0002-5673-2636

Abstract —A common problem in bioinformatics is related toidentifying gene regulatory regions marked by relatively highfrequencies of motifs, or deoxyribonucleic acid sequences thatoften code for transcription and enhancer proteins. Predictingalignment scores between subsequence k-mers and a given motifenables the identiﬁcation of candidate regulatory regions in agene, which correspond to the transcription of these proteins.We propose a one-dimensional (1-D) Convolution Neural Networktrained on k-mer formatted sequences interspaced with the givenmotif pattern to predict pairwise alignment scores between theconsensus motif and subsequence k-mers. Our model consists ofﬁfteen layers with three rounds of a one-dimensional convolutionlayer, a batch normalization layer, a dense layer, and a 1-Dmaximum pooling layer. We train the model using mean squarederror loss on four different data sets each with a different motifpattern randomly inserted in DNA sequences: the ﬁrst three datasets have zero, one, and two mutations applied on each insertedmotif, and the fourth data set represents the inserted motif asa position-speciﬁc probability matrix. We use a novel proposedmetric in order to evaluate the model’s performance, S α , whichis based on the Jaccard Index. We use 10-fold cross validationto evaluate out model. Using S α , we measure the accuracy ofthe model by identifying the 15 highest-scoring 15-mer indicesof the predicted scores that agree with that of the actual scoreswithin a selected α region. For the best performing data set,our results indicate on average 99.3% of the top 15 motifs wereidentiﬁed correctly within a one base pair stride ( α = 1 ) in theout of sample data. To the best of our knowledge, this is a novelapproach that illustrates how data formatted in an intelligentway can be extrapolated using machine learning. Index Terms —Motif Finding, Convolution Neural Network,Pairwise Sequence Alignment

I. I

NTRODUCTION M EASURING the similarity of two sequences is a wellknown problem called sequence alignment. This topicincludes a vast category of methods for identifying regionsof high similarity in biological sequences, such as those indeoxyribonucleic Acid (DNA), ribonucleic acid (RNA), andprotein [7]. Speciﬁcally, DNA pairwise sequence alignment(PSA) methods are concerned with ﬁnding the best arrange-ment of two DNA sequences. Some historically notable dy-namic programming PSA methods are the Needleman-Wunsch(NW) algorithm for global alignment [1] and Smith-Waterman(SW) algorithm for local alignment [2]. The main difference

Identify applicable funding agency here. If none, delete this. between global and local alignment is related to the differencein length of the two sequences: global alignment attemptsto ﬁnd the highest-scoring end-to-end alignment betweentwo sequences of approximately the same length, and localalignment searches for local regions of high similarity betweentwo sequences with different lengths [8]. Figure 1 shows thisdifference between local and global DNA alignment with twosequences aligned in a 5’ (i.e. ﬁve prime) to 3’ direction. Inmolecular biology, this orientation refers to the directionalityof the carbon backbone in DNA. The top subﬁgure displaysglobal alignment where a query sequence is aligned end-to-end with a reference. The bottom subﬁgure displays localalignment where a short query sequence is most optimallyaligned with a longer reference sequence. This latter alignmentdisplays how the query sequence is approximately equal to asubsequence of the reference sequence.Fig. 1: Local vs. Global Alignment. In general, DNA iscomposed of a permutation of the four nucleotides [adenine(A), thymine (T), cytosine (C), guanine (G)] and an ambiguousbase (N).In this way, local alignment methods recognize approximatesubsequence matches of a query sequence with respect toa given reference sequence. One common paradigm utiliz-ing local alignment is to examine similarities between aquery sequence and speciﬁc k-long subsequences in a givengene, known as k-mers, found within the reference sequence.Traditional local alignment algorithms calculate these scores a r X i v : . [ q - b i o . GN ] J a n etween the query sequence and each k-mer in the referencesequence. The aim of this research is to identify where themost likely subsequence matches of the query sequence occurin each reference sequence using machine learning methods.One such type of query sequence that is of high biologicalsigniﬁcance is a sequence motif, which are short reoccurringsubsequences of DNA [5]. Therefore, this research followsthe ability of machine learning methods to gauge the relativeenrichment of various representations of motifs (or motif pat-terns) in independent reference sequences. More speciﬁcally,the efﬁcacy of identifying motif enrichment in sequences isexplored using a one-dimensional (1-D) convolution neuralnetwork (CNN).Four different data sets are generated, each with a differentmotif pattern randomly inserted in approximately 10,000 refer-ence sequences: the ﬁrst three data sets have zero, one, and twomutations applied on each inserted motif, and the fourth dataset represents the inserted motif as a position-speciﬁc prob-ability matrix (PPM). In this data structure, each nucleotideposition corresponds to a frequency of nucleotides [22]. Thesedistinct motif patterns help display how the CNN modelcan recognize both subsequence matches with exact, inexact,and probabilistic motifs. Each sample in a given data setconsists of artiﬁcial sequences enriched with a given motifpattern at a frequency between ﬁve and ﬁfteen occurrencesper 1,000 base pairs (bp). These samples are split into 986overlapping 15-mers with a corresponding calculated localalignment score from the BioPython Aligner [20]. These soresare then predicted using a CNN with 10-fold cross validation.In order to measure the performance of the model, the averageout of sample mean squared error (MSE), R2, and accuracyscores are reported.While the MSE of the model trained on each data set is notrepresentative of the model’s effectiveness, the Jaccard Indexand S α , a novel modiﬁed version of the Jaccard Index, arebetter suited to capture accuracy of the model. The standardMSE is not suitable for this problem because it inherentlyonly displays differences between predicted and actual values.Since our aim is to locate those highest-scoring 15-mers, weneed a metric that determines at which positions they occurand with what accuracy (see subsection V-A). This new metric, S α , measures the degree of similarity between two sets whereeach pair of elements can be different by at most α . Becauseof the plateauing nature of this metric as seen in each data setand the risks involved in increasing alpha, only S to S arereported.In implementing this new metric, the accuracy of the modelincreases dramatically across all four data sets comparedto the Jaccard Index. This indicates that while the modelis not able to precisely identify the highest-scoring k-mersexactly, it is able to accurately identify their local region. Asexpected, the model’s accuracy is far higher for the data setswith relatively simple inserted motif patterns–non-probabilisticconsensus motifs–compared to that of the data set with morecomplex inserted motif patterns, such as consensus PPM. II. B ACKGROUND

Clusters of motifs across a genome strongly correlate toa gene regulatory regions [18]. These regions are especiallyimportant for motif enrichment analysis, where known motifsare identiﬁed in the regulatory sequence of a gene in order todetermine which proteins (transcription factors and enhancers)control its transcription [6] [19]. Motif enrichment analysis isonly relevant given that the regulatory region of a gene isknown, otherwise the sequence under study may be from anon-coding region of an organism’s genome or an untranslatedregion of a gene [9]. Given that the regulatory region of agene is unknown, one frequently used approach to identifyingit is to ﬁrst locate sequences enriched with highly conservedmotifs. Fortunately, many motifs that have been discovered arecommon amongst genes serving a similar role across organ-isms, such as a negative regulatory region for eukaryotes [10].Finding these conserved motifs may facilitate the identiﬁcationof the regulatory regions in a gene. For that reason, identifyingthe exact or relative positions of a given motif in a gene orsequence is a relevant inquiry in the process for classifyingcandidate regulatory regions of a gene.A software toolkit known as MEME Suit includes three dif-ferent methods for motif-sequence searching [23]: FIMO (FindIndividual Motif Occurrences) [21], GLAM2SCAN (GappedLocal Alignment of Motifs SCAN) [24], and MAST (MotifAlignment and Search Tool) [25].FIMO focuses on scanning both DNA and protein sequencesfor a given motif represented as PPM. This software toolcalculates the log-likelihood ratio score, p-value, and q-value(false discovery rate) for each subsequence position in asequence database [21].Typically, GLAM2SCAN performs a Waterman-Eggert lo-cal alignment between motifs found by GLAM2, its compan-ion motif-ﬁnding algorithm, and a sequence database. Theselocal alignment scores are generated from an aligner pro-grammed with position speciﬁc residue scores, deletion scores,and insertion scores returned from the GLAM2 algorithm. The n highest alignments are returned to the user [24].MAST locates the highest-scoring n subsequences withrespect to a motif described as a position-speciﬁc score matrix.Using the QFAST algorithm, MAST calculates the p-value ofa group of motif matches. This is accomplished by ﬁrst ﬁndingthe p-value of each match (position p-value’) and normalizingit for the length of the motif (’sequence p-value’). Then eachof these normalized p-values are multiplied together to ﬁnd thestatistical signiﬁcance across all located motifs in the database(’combined p-value’) [25].III. D ATA A NALYSIS & C

URATION

A single data set contains approximately 10,000 randomlygenerated DNA sequences, each 1,000 bp long. The numberof samples vary slightly from one to another due to someinconsistencies that are removed in prepossessing. A 15-mermotif is inserted into each sample anywhere from ﬁve to ﬁfteentimes. Four separate data sets of this structure are createdwhere a different motif pattern is inserted randomly into eachequence. The ﬁrst three data sets have zero, one, and twomutations applied on each inserted motif. These mutations areapplied in order to determine whether the proposed modelhas the potential to identify consensus motifs and non-exactconsensus motifs across many sequences. Since motifs mostlyexist as proﬁles where each base pair position corresponds toa frequency table of nucleotides, the fourth data set is createdwhere the inserted motifs are based off of a PPM [11].Equation 1 is used to calculate the PPM indicated by matrix M given a set of candidate motifs, or sequences that arethought to be from the same motif PPM. This equation countsthe number of occurrences of each nucleotide in set γ for eachnucleotide position across all motifs, where γ = { A, T, C, G } ; I = { , } represents an indicator function, where I ( x = γ ) is 1 if x = γ and 0 otherwise; i ∈ (1, ..., L), where L is thelength of each motif; and j ∈ (1 , ..., N ) , where N is the numberof motifs. M α,k = 1 N N (cid:88) i =1 I ( X i,j = γ ) (1)In order to apply Equation 1 on candidate motifs, the DNAsequence data must be formatted as nucleotide position countsshown in Figure 2. This ﬁgure illustrates the conversion of alist of candidate motifs to matrix M counts and then to P P M using Equation 1. While Figure 2 displays this process for ﬁve10-mers, the fourth data sets in this work relies on proﬁles builtfrom ten 15-mers. TACAGAGTTGCCATAGGCGTTGAACGCTACACGGACGATACGAATTTACG ↓ M counts = A T C G ↓ P P M = A T C G Fig. 2: The conversion of ﬁve candidate subsequence motifsto PPM using Equation 1.IV. F

EATURE & O

UTPUT S ELECTION

In order to format the sequence data into a structure thatis both recognizable and meaningful to a CNN, we ﬁrst spliteach sequence into a list of overlapping 15-mers. Next, wegenerate a one-hot encoding for each nucleotide in the 15-mers. The resulting feature set is composed of 60 values.Figure 3 displays this process using a small subsequenceexample formatted as 4-mers. Fig. 3: DNA subsequence k-mer formatting by one-hot encod-ing nucleotides.To obtain the target values, each of these 15-mers are pair-wise aligned with the consensus motif for the given data setmotif pattern using the SW algorithm. Given two sequences, a of length n and b of length m , this algorithm begins bydeﬁning an n + 1 by m + 1 matrix H . The ﬁrst column andﬁrst row are assigned , and the following recurrence relationis applied to assign the rest of the values in H . H ( i, j ) = max  H ( i − , j −

1) + σ ( a i , b j ) H ( i, j −

1) + WH ( i − , j ) + W where W is a gap score and σ is a score matrix such that σ ( a i , b j ) = (cid:40) +1 if a i = b j − if a i (cid:54) = b j In the case when a i = b j , σ returns a match score of +1 , and inthe case when a i (cid:54) = b j , σ returns a mismatch score of − . Thegap score, W , is assigned − . . The match, mismatch, andgap score can be conﬁgured for different alignments. Theseparameters are used because they are the most optimal for thistype of local alignment [4]. Once H is assigned its values, thebest alignment is obtained by ﬁnding the maximum value in H and tracing back the matrix elements that led up to thismaximum. In this way, the maximum value in H deﬁnes theoptimal path in H for the best alignment between sequences a and b [2]. The calculated alignment scores are normalizedbased on the maximum alignment score in each sample.V. M ETHODS

A. CNN Model Evaluation

Although the MSE loss function is effective at penalizinglarge differences between predicted and target values, suchas outliers in the data, it does not successfully representthe predictive power of the model given the scope of theproblem [14]. In the data, the target value from each sampleranges from zero to one. This range already generates aninherently small MSE. Even when the MSE for each sample isnormalized, the metric is overshadowed by the overwhelmingajority of the predicted values that were approximately equalto the global mean of each sample. In other words, the MSE asa metric does not capture the correct information pertainingto the ﬁve to ﬁfteen inserted motif patterns in each sampledue to a large unequal distribution of such scores that deviatefrom the global mean. This problem is analogous to that ofan unequal class distribution in a classiﬁcation problem.The goal of the model is to score the CNN based on itsability to locate the 15 highest-scoring 15-mers, because weinserted a motif pattern at most 15 times into a single sample.Since this network deals with continuous values instead ofdiscrete classes, initially we cannot be certain of the 15-mer to which a 15-mer score at any index i corresponds.However, a higher scoring 15-mer has a greater probability ofcorresponding to that of a motif, whereas the lower scoring 15-mers carry little information. This is due to the fact that eachscore in the data is generated from a local alignment between15-mer and the given consensus motif. In this way, onlythe highest 15-scoring 15-mers are of interest. As previouslymentioned, we indicate that there is an unequal distributionbetween the number of scores corresponding to that of eachinserted motif and the global mean of each sample. Using theseobservations, we rationalize that we only have to examine the15 highest-scoring indices. This generality that the 15 highest-scoring idicies correspond to the inserted motif patterns isfurther supported by the notion that probability of observing arandom 15-mer exactly equal or similar to the inserted motifsis relatively low.Thus, the indices of the predicted 15 highest-scoring 15-mer inherently hold information about the position of possibleinserted motif patterns because it is at these indices at whichthe local alignment is conducted. Due to the low likelihoodof observing a false positive (when a 15-mer is identiﬁed asa motif but in all actuality is not one), we create a one-to-onecorrespondence between the indices of the actual motif indicesand that of the predicted motifs using high local alignmentscores. The accuracy of this one-to-one correspondence canbe measured using the Jaccard Index given in Equation 2. J ( A, B ) = | A ∩ B || A ∪ B | (2)We propose a more generalized index, S α , in Equation 3which measures the similarity of two sets with an allowedmargin of error of α . Because of the high locality of localalignment score predictions and due to the fact that thehighest-scoring 15-mers can still be found from examining theimmediate region of a prediction, this margin of error servesas a heuristic for motif identiﬁcation. In this metric, two itemsare considered identical if they are no more than α away fromeach other. In the scope of this work, sets A and B contain theindices of the 15 highest-scoring 15-mers of the actual dataand predicted data, respectively. When α = 0 , S ( A, B ) inEquation 2 is identical to J ( A, B ) in Equation 3. Conversely,as α increases, the allowed distance between indices in sets A and B increases. For example, when α = 2 , a predicted15-mer index i and actual 15-mer index i + 2 are consideredthe same. J ( A, B | α ) = S α ( A, B ) = | α (cid:83) µ =0 A ∩ { x + µ | x ∈ B }|| A ∪ B | (3)The following process is an algorithm to calculate a modi-ﬁed version of the Jaccard Index. Using the argsort functionin NumPy, we examine the indices that order both the actualoutputs and the predicted outputs. In looping through the eachof the top n indices of the predicted outputs, we count thenumber of them which are contained in the list of indices ofthe actual outputs. The process returns the score as count overthe maximum possible value, which in this case is n . This isimplemented in Algorithm 1 Algorithm 1

Measuring Jaccard Index with stride α procedure s α n ← number of highest-scoring k-mers to analyze score ← act outputs ← actual outputs pred outputs ← outputs from CNN act indxs ← indices that would sort act outputs pred indxs ← indices that would sort pred outputs outerloop : for i := 1 to n do pred indx ← pred indxs(i) . for j := 0 to α do if pred indxs ∈ act indxs − j then score ← score + 1 . goto outerloop . if pred indxs ∈ act indxs + j then score ← score + 1 . goto outerloop . normalized score ← score/n .VI. R ESULTS

Each of the four data sets is characterized by 10,000 sampleswhere each sample contains a sequence that is 1,000 bp inlength. In each sample, a motif pattern is inserted randomlyanywhere from ﬁve to ﬁfteen times. The ﬁrst three data setsinclude inserted motif patterns with zero, one, and two mu-tations. The fourth data set includes an inserted motif patternrepresented based on a PPM. Each data set is evaluated usingout of sample data generated from 10-fold cross validationbased on eight metrics: MSE, R2, and S - S .ABLE I: CNN Results. The average out of sample MSE, R2,and S - S for each data set.A ﬁfth analysis is conducted with another data set using amotif representation similar to that of the fourth data set withthe MafK transcription factor from the BATCH1 regulatorygene [26]. This motif is a 15-mer with a less conservedconsensus sequence compared to that of the former four datasets. While this data set did not perform as well as the otherfour data sets with a S of 45.3%, this analysis brought tolight the consideration of the aligner scoring matrix as anotherhyperparameter to this work.As it turns out, the performance of the model varies greatlywith the chosen match score, mismatch score penalty, andgap score penalty for the currently implemented alignmentmethod. For instance, the S varies from 33.7% to 52.6% withdifferent scoring hyperparameters. The former result is derivedfrom an aligner with a match score of +2.0, mismatch scorepenalty of -3.0, and gap score penalty of -3.5, whereas thelatter result is derived from an aligner with a match score of+2.0, mismatch score penalty of -4.0, and gap score penalty of-4.5. It is currently unclear what aligner hyperparameters aremost optimal for this more complex data set and the originalfour data sets explored in the work. Although there is evidenceto suggest that aligner scoring matrices vary with the type ofinserted motif pattern, it is unclear whether the most optimalhyperparameters change from motif to motif.One possible interpretation of the dependence of the model’schosen evaluation metric, S α , on the aligner hyperparametersis related to the fact that the CNN predicts alignment scoresthat are normalized within each sample. Therefore, the fartherthese highest-scoring scores are from the global mean, themore likely that the proposed metric will be able to recognizeinserted motifs. Conversely, when analyzing a data set witha less conserved motif consensus sequence, such as that ofthe MafK transcription factor, the alignment scores are closerto the global mean of each sample. This in turn makesrecognizing the indices of the highest-scoring segments morechallenging. It follows that the aligner hyperparameters whichcapitalize on increasing this difference are most favorable forall motifs, regardless of pattern. A. Convolution Neural Network (CNN) Architecture

CNN is a class of deep learning models which can inferpatterns based on data formatted as a grid structure, such asa set of prices over time for stock or a grid representationof pixels in an image (add reference for these architectures).These Artiﬁcial Neural Netowrk (ANNs) use a linear math-ematical operation called convolution in at least one of theirlayers [3]. The convolution operation is commonly identiﬁedby the following two equations: s ( t ) = (cid:90) x ( a ) w ( t − a ) da (4) s ( t ) = ( x ∗ w )( t ) (5)Equation 4 explicitly denotes the equation for convolution,whereas Equation 5 displays how an asterisk can be used tofor the linear operation. In both equations, x is referred to asthe input. Typically, this is formatted as a multidimensionalarray, or a tensor, that matches the size and dimensions of thedata. The second argument is w , representing a kernel, whichstores parameters for the model also formatted as a tensor.This argument is adapted throughout the training process ofthe model. The output of both functions, s , is called thefeature map of the convolution layer. This is what is fedinto the next layer of the network [3]. Hidden layers aregenerated from applying a kernel, or ﬁlter, of weights overthe receptive ﬁeld of the inputs. More speciﬁcally, the hiddenlayer is computed based off of the ﬁlter weights and the inputlayer as it strides across the feature space [28]. This operationcan either compress or expand input space depending on theapplied kernel [29]. This paradigm is followed by roundsof activations, normalizations, and pooling [29]. The modeltypically ends with a fully connected layer to compute itsoutputs [28]. The proposed model is represented in Figure 4[cite my paper].Fig. 4: CNN model. (create better caption)The model is marked by three rounds of a 1-D convolutionlayer, a batch normalization layer, a dense layer, and a 1-D maximum pooling layer. After these 12 layers, the modelﬁnishes off with a 50% dropout layer, a ﬂattened layer,and ﬁnally a fully connected layer corresponding to the 986alignment scores for each sample [13] [12].The model described above is ran on all four data setsfor 100 epochs with a batch size of 80 and compiled withthe Adam optimizer (learning rate=0.001, beta 1=0.9, beta2=0.999, epsilon=1e-07). Of the 10,000 samples in eachdataset, 80% is reserved for training the network and theremaining 20% is used for validation after each epoch. For itsloss function, the model relies on Mean Squared Error (MSE),hich is calculated between predicted values ( y pred ) and targetvalues ( y act ) with the following formula in Equation 6: M SE ( y pred , y act ) = 1 n n (cid:88) i =1 ( y pred,i − y act,i ) (6)VII. D ISCUSSION

As displayed in this work, deep learning models, such as aCNN, have the capacity to recognize and predict the positionsof an inserted motif with great accuracy. Furthermore, datastructures can be devised to take advantage of unequal classdistributions in regression problems as highlighted by thedesign of k-mer data representation in this work and theincorporation of S α as a novel evaluation metric.In analyzing the results in Table I, there is a characteristicpattern between the accuracy metrics across each data set.For instance, in comparing S - S for the ﬁrst data set withzero mutations applied on each inserted motif, the scoremonotonically increases with an increasing α . This is evidentfor the three other data sets as well. With respect to thisparticular trend, it is expected that as α increases, the scorewill also increase since α relates directly to the allowed marginof error, making S α less conservative.Additionally, the model’s accuracy is far higher for the datasets with relatively simple inserted motif patterns, such asnonmutated and mutated consensus motifs, compared to that ofthe fourth data set with a PPM motif pattern. This relationshipcan be explained by the process by which the scores for each15-mer are calculated. For a given 15-mer, a score is computedbased on its local alignment with a given consensus motif. Forthe ﬁrst data set, these local alignment scores generated arederived from each inserted motif, whereas in the latter threedata sets, the scores are not necessarily derived from each dataset’s consensus motif since the motif patterns support variableinserted motif.In all data sets, the largest increase in S α appears to bebetween the S and S . After this point, change in S α plateausafter a given α . With the consideration that the likelihoodof observing a false positive is relatively low, this indicatesthat the addition of stride α is well-advised. This is the casebecause the increase in α only inﬂuences S α up to a certainpoint. It is expected that as α −→ β , where β is the maximum α on either side of a given motif index, S α −→ because everysingle n indices will be covered by the stride α . In the case that S α −→ , the certainty for each identiﬁed motif decreases withincreasing S α regardless; however, the absence of this limitin the data indicates that the certainty of the identiﬁed motifsdoes not decreases dramatically from S to S . Furthermore,the presence of a plateauing S α supports the thought that adecrease in the certainty of an identiﬁed motif is negligible.This analysis can be drawn further in noticing that the pointat which S α plateaus increases as the complexity of the motifpattern increases. In the case of a more complex motif pattern,such as either of the PPMs, a greater α is required to fully encapsulate accuracy of the model’s predictions. Even then,the certainty of such motif identiﬁcation with increasing α decreases.In subsection V-A, we draw a one to one correspondencebetween the actual motif indices and that of the predictedmotifs by only examining the indices of the 15 highest-scoring15-mers in both the actual scores and predicted scores. This isnot a strong one-to-one correspondence because the numberof inserted motifs actually varies randomly from ﬁve to ﬁfteentimes sample to sample. By design, this is a confoundingvariable When S α is applied on a sample with ﬁve insertedmotifs, the returned score is predicted to be an underestimateof the model’s prediction. This is due to the fact that thisfunction only examines the highest 15-scoring indices for eachsample. In the case of ﬁve inserted motifs, there would be ten15-mers identiﬁed as high-scoring motifs, when in reality theseare random 15-mers in the sequence. Because those scores aremore likely to be present throughout a sequence, there will beless similarity between the indices of the predicted 15 highest-scoring 15-mers and that of the actual 15 highest-scoring 15-mers. This will most likely lead to a decrease in S α .R EFERENCES[1] Cold Spr. A general method applicable to the search for similarities inthe amino acid sequence of two proteins.

Mol. Biol , 48:443–153, 1970.[2] Temple F Smith, Michael S Waterman, et al. Identiﬁcation of commonmolecular subsequences.

Journal of molecular biology , 147(1):195–197,1981.[3] Yoshua Bengio, Ian Goodfellow, and Aaron Courville.

Deep learning ,volume 1. MIT press Massachusetts, USA:, 2017.[4] Ahmad Al Kawam, Sunil Khatri, and Aniruddha Datta. A survey ofsoftware and hardware approaches to performing read alignment innext generation sequencing.

IEEE/ACM transactions on computationalbiology and bioinformatics , 14(6):1202–1213, 2016.[5] Patrik D’haeseleer. What are dna sequence motifs?

Nature biotechnol-ogy , 24(4):423–425, 2006.[6] Robert C McLeay and Timothy L Bailey. Motif enrichment analysis: auniﬁed framework and an evaluation on chip data.

BMC bioinformatics ,11(1):165, 2010.[7] Waqar Haque, Alex Aravind, and Bharath Reddy. Pairwise sequencealignment algorithms: A survey. In

Proceedings of the 2009 Conferenceon Information Science, Technology and Applications , page 96–103,2009.[8] EMBL-EBI. Pairwise Sequence Alignment, 2020.[9] Xiaole Liu, Douglas L Brutlag, and Jun S Liu. Bioprospector: discover-ing conserved dna motifs in upstream regulatory regions of co-expressedgenes. In

Biocomputing 2001 , pages 127–138. World Scientiﬁc, 2000.[10] Jorge A I˜niguez-Lluh´ı and David Pearce. A common motif within thenegative regulatory regions of multiple factors inhibits their transcrip-tional synergy.

Molecular and Cellular Biology , 20(16):6040–6050,2000.[11] Modan K Das and Ho-Kwok Dai. A survey of dna motif ﬁndingalgorithms. In

BMC bioinformatics , volume 8, page S21. Springer, 2007.[12] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning.

Nature , 521(7553):436–444, 2015.[13] J¨urgen Schmidhuber. Deep learning in neural networks: An overview.

Neural Networks , 61:85–117, 2015.[14] Yu Qi, Yueming Wang, Xiaoxiang Zheng, and Zhaohui Wu. Robustfeature learning by stacked autoencoder with maximum correntropycriterion. In , pages 6716–6720. IEEE, 2014.[15] Luping Ji, Xiaorong Pu, Hong Qu, and Guisong Liu. One-dimensionalpairwise cnn for the global alignment of two dna sequences.

Neurocom-puting , 149:505–514, 2015.16] Q. Zhang, L. Zhu, W. Bao, and D. Huang. Weakly-supervised convo-lutional neural network architecture for predicting protein-dna binding.

IEEE/ACM Transactions on Computational Biology and Bioinformatics ,17(2):679–689, 2020.[17] Gary D Stormo and George W Hartzell. Identifying protein-binding sitesfrom unaligned dna fragments.

Proceedings of the National Academyof Sciences , 86(4):1183–1187, 1989.[18] Martin C. Frith, Michael C. Li, and Zhiping Weng. Cluster-Buster:ﬁnding dense clusters of motifs in DNA sequences.

Nucleic AcidsResearch , 31(13):3666–3668, 07 2003.[19] Tom Lesluyes, James Johnson, Philip Machanick, and Timothy L.Bailey. Differential motif enrichment analysis of paired chip-seqexperiments.

BMC Genomics , 15(1):752, 2014.[20] Peter J. A. Cock, Tiago Antao, Jeffrey T. Chang, Brad A. Chapman,Cymon J. Cox, Andrew Dalke, Iddo Friedberg, Thomas Hamelryck,Frank Kauff, Bartek Wilczynski, and Michiel J. L. de Hoon. Biopython:freely available python tools for computational molecular biology andbioinformatics.

Bioinformatics , 25(11):1422–1423, 8/5/2020 2009.[21] Charles E. Grant, Timothy L. Bailey, and William Stafford Noble. Fimo:scanning for occurrences of a given motif.

Bioinformatics , 27(7):1017–1018, 9/8/2020 2011.[22] Mengchi Wang, David Wang, Kai Zhang, Vu Ngo, Shicai Fan, andWei Wang. Motto: Representing motifs in consensus sequences withminimum information loss.

Genetics , page genetics.303597.2020, 082020.[23] Timothy L. Bailey, Mikael Boden, Fabian A. Buske, Martin Frith,Charles E. Grant, Luca Clementi, Jingyuan Ren, Wilfred W. Li, andWilliam S. Noble. Meme suite: tools for motif discovery and searching.

Nucleic Acids Research , 37(suppl 2):W202–W208, 9/9/2020 2009.[24] Martin C. Frith, Neil F. W. Saunders, Bostjan Kobe, and Timothy L.Bailey. Discovering sequence motifs with arbitrary insertions anddeletions.

PLOS Computational Biology , 4(5):e1000071–, 05 2008.[25] T L Bailey and M Gribskov. Combining evidence using p-values:application to sequence homology searches.

Bioinformatics , 14(1):48–54, 9/9/2020 1998.[26] Oriol Fornes, Jaime A Castro-Mondragon, Aziz Khan, Robin van derLee, Xi Zhang, Phillip A Richmond, Bhavi P Modi, Solenne Correard,Marius Gheorghe, Damir Baranaˇsi´c, Walter Santana-Garcia, Ge Tan,Jeanne Ch`eneby, Benoit Ballester, Franc¸ois Parcy, Albin Sandelin, BorisLenhard, Wyeth W Wasserman, and Anthony Mathelier. JASPAR2020: update of the open-access database of transcription factor bindingproﬁles.

Nucleic Acids Research , 48(D1):D87–D92, 11 2019.[27] Ian Goodfellow, Yoshua Bengio, and Aaron Courville.

Deep Learning .MIT Press, 2016.[28] J¨urgen Schmidhuber. Deep learning in neural networks: An overview.

Neural Networks , 61:85–117, 2015.[29] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning.

Nature , 521(7553):436–444, 2015.[30] Ethan J Moyer and Anup Das. Machine learning applications to dna sub-sequence and restriction site analysis. arXiv preprint arXiv:2011.03544arXiv preprint arXiv:2011.03544