Comparison of SVM and Spectral Embedding in Promoter Biobricks' Categorizing and Clustering
CComparison of SVM and Spectral Embedding inPromoter Biobricks’ Categorizing and Clustering
Shangjie ZouSouth China Agricultural University, College of Life SciencesGuangzhou Institute of Advanced Technology, CAS
Abstract:
Background:
In organisms’ genomes, promoters are short DNA sequences on theupstream of structural genes, with the function of controlling genes’ transcription.Promoters can be roughly divided into two classes: constitutive promoters andinducible promoters. Promoters with clear functional annotations are practicalsynthetic biology biobricks. Many statistical and machine learning methods have beenintroduced to predict the functions of candidate promoters. Spectral Eigenmap hasbeen proved to be an effective clustering method to classify biobricks, while supportvector machine (SVM) is a powerful machine learning algorithm, especially whendataset is small.
Methods:
The two algorithms: spectral embedding and SVM are applied to the samedataset with 375 prokaryotic promoters. For spectral embedding, a Laplacian matrix isbuilt with edit distance, followed by K-Means Clustering. The sequences arerepresented by numeric vector to serve as dataset for SVM trainning.
Results:
SVM achieved a high predicting accuracy of 93.07% in 10-fold crossvalidation for classification of promoters’ transcriptional functions. Laplacianeigenmap (spectral embedding) based on editing distance may not be capable forextracting discriminative features for this task.Availability: Codes , datasets and some important matrices are available on githubhttps://github.com/shangjieZou/Promoter-transcriptional-predictor/tree/source-codeKeywords: spectral embedding, support vector machine, promoter, transcriptionalfunction
INTRODUCTION
Synthetic biology is a promising research field, for it is providing solutions to manyindustrial problems. The basic principle of industrial synthetic biology is creatingengineered living system by assembling DNA sequences [1]. These DNA sequencesare named as DNA parts or “biobricks”. iGEM (Internatonal Genetically EngineeredMachine), a competition established by MIT, has a database called ‘Registry ofStandard Biological Parts’ (http://parts.igem.org/), which is collecting informationabout thousands of standardized biobricks. Biobricks recorded in this database areostly tested by former iGEM participants [2].Promoter is a important category of biobrick, which controls the initiation oftranscription. Promoters can be roughly divided into two classes: constitutivepromoters, whose activity is not affected by any transcription factors, and induciblepromoters, whose activity can be positively or negatively regulated by transcriptionfactors. By choosing different promoters, we can tailor the expression of target genesin genetic circuits [3].The aim of this study is evaluating and comparing SVM and spectral embedding’sperformances in categorizing promoters based on promoters’ transcriptional functions(whether it can be regulated by transcriptional factors). Two clustering and classifyingmethods has been conducted on the same dataset. Laplacian eigenmap (spectralembedding) is a nonlinear dimensionality reduction method. It has been proved to beeffective in discriminating biobricks and assessing biology databases’ quality [4]. Inthis work, edit distance is used to represent the distance between promoters’sequences and generate Laplacian matrix. Through eigendecomposition, theeigenvectors and eigenvalues of Laplacian matrix are extracted for clustering. SVM isa powerful machine learning method, which can conduct nonlinear classification byapplying kernel function to map samples in higher dimensional space and divide themwith decision surface[5]. In the past years, SVM has been introduced tobioinformatics research and provided great performance in many applications, likedisease diagnosis [6] and prediction of DNA parts’ function [7].
METHODS AND DATASET
Computational Platform and tools
The project is conducted on based on python 3.6 with Microsoft Windows 10operating system. Edit Distances of promoter sequences are calculated by dynamicprogramming. The matrix of edit distances is used to generate Laplacian Matrix,followed by eigendecomposition and K-Means clustering. Before training SVMmodel, the sequences of promoters are translated to digital vectors through one-hotstrategy: for the four deoxynucleosides, “A” is represented by [1,0,0,0], “G” is[0,1,0,0], “C” is equal to [0,0,1,0] and “T” is equal to [0,0,0,1]. SVM and K-Meansare implemented by python module sci-kit learn.
Dataset
The Dataset is obtained from “Registry of Standard Biological Parts”:http://parts.igem.org/Promoters/Catalog. 441 E.coli σ promoters are obtained by webcrawler as the “whole dataset”. After filtering sequences larger than 500 bp, there are375 sequences left in “filtered dataset”. There are three categories of promoters:constitutive promoters, positively regulated promoters and repressible promoters.Both positively regulated promoters and repressible promoters are induciblepromoters. In this study, two methods are used to split the dataset: Hold-Out andK-Fold. Hold-Out randomly divide the dataset into a trainning set (with 80% of thesequences) and a test set (with 20% of the sequences). Hold-Out is quick and simple,o it is used to divide dataset for tuning parameters. However, Hold-Out is notsuitable for validation, so K-Fold method is introduced. This method would divide thedataset into K subsets. There would be K times of validation, and in each time, onesubset would perform as test set, while the others are used for training. Laplacian Matrix and Spectral Clustering
In this study, edit distance is used to represent the distance between two sequences.The edit distance d(xi, xj) is the minimum summation of edit operations (includingdeletion, insertion and substitution) that can transform a string xi to string xj . Dynamicprogramming is the most widely used algorithm for calculation of edit distance.The matrix of edit distance is then normalized by equation 1, so as to generate amatrix of normalized edit distance (denoted as matrix M): ( , )_ ( , ) max( ( ), ( )) d xi xjMij nomalized d xi xj length xi length xj (1)The next step is constructing similarity matrix S. Refering to former study [4], Iadopted Gaussian kernel with =0.3. The matrix S is calculated by equation 2: Mij
Sij e (2)A diagonal matrix D is also needed to obtain Laplacian Matrix [8]. The diagonal ofmatrix D is the summation of similarities on the corresponding row: j n Di Sij (3)Finally, the Laplacian matrix G is obtained by: G = D - S. Eigendecomposition isapplied on matrix G. The eigenvectors of matrix G are fed to K-Means clustering forclassifying and visulization. Support Vector Machine (SVM)
SVM is implemented with sci-kit learn. The dataset is filtered dataset, with 375promoters whose lengths are shorter than 500bp. The dataset consists of 83constitutive promoters and 292 inducible promoters. Before performancemeasurement, the dataset is randomly split into a training set (80%) and a test set(20%). In this study, parameter C and gamma are hyperparameters that should betuned. So as to find the optimal combination of parameters, a nested loop is employed.Fig 1 shows the accuracy that different combinations of C and gamma achieved ontest set. Based on the results, the model can achieve the highest prediction accuracy of93.33% on test set when C is set to 10 and gamma is set to 0.015. The quick reductionof accuracy corresponds to increase of C and gamma might be a symbol ofoverfitting.ig 1. Optimizing parameter C and gamma
RESULTS
Spectral Embedding
To comprehensively assess the performance of spectral embedding, two steps ofmeasurements are conducted. Initially, the matrix of edit distance was built among the441 σ70 promoters’ sequences obtained from database (with 90 constitutive and 351inducible). According to the clustering result (Fig 1), the 441 promoters can bedistributed to two clusters. However, the two clusters are not really reflecting thecategories of promoters’ transcriptional functions. The bigger cluster (cluster_0) iscontaining 92.22% of constitutive promoters and 85.14% of inducible promoters.This phenomenon, according to my hypothesis, is caused by the huge effect that thesequential lengths impose on edit distances. The distribution of lengths for sequencesin these two clusters (Fig 2) can reflect this hypothesis.Fig 2. Spectral clustering result on all 441 sequencesig 3. Distribution of lengths for sequences in two clustersAfter re-checking the database, I found that most of the sequences that aredistributed to cluster_1 are not actually promoters. For example, BBa_K256018 islabeled as constitutive promoter, but is actually a comprehensive gene circuit withcoding gene, ribosome binding set and terminator. To avoid the potential effects ofthese false-labeled promoters, I filtered those promoters whose sizes are larger than500 bp and redid clustering. However, based on the clustering result (Fig 3), the twoclusters are still discriminated based on one eigenvector, which has been proved to berepresenting the lengths of sequences. Based on these results, spectral embeddingbased on edit distance may not be suitable for this task, because sequential lengthoutweighs the other discriminative features.Fig 4. Clustering result of 375 promoters (filtered sequences larger than 500bp). Eachaxis is representing an eigenvector.
Support Vector Machine (SVM)
SVM is applied on the filtered dataset (375 promoters which are shorter than 500bp).Before training the model, the optimal parameters are tuned on training set and test setdivided by hold-out method (refer to section METHODS AND DATASET). The SVModel with the best combination of parameters (C=10, gamma=0.015) is measured bycross-validation. In this study, the cross validation is conducted from 2-fold to 10-fold.According to fig 5, the model achieved the highest average accuracy of 93.07% whenvalidated with 10-fold. It’s obvious that 10-fold cross validation can provide themodel with larger trainning sets, which may benefit its predicting performance on testset. Fig 5. Cross validation score of optimized SVM model
DISCUSSION
Based on the results, SVM is a powerful method for promoters’ functionalclassification. Spectral embedding on the basis of edit distance is not performing wellin this task.The biggest shortcoming of the spectral embedding model that led to its poorperformance is the huge effects of sequential lengths. When input sequences are vary,the length of sequences may outweigh other discriminative features. As a consequence,the clusters are divided based on the sequential lengths, without considering otherfeatures.What’s more, it’s well known that different deoxyribonucleotide are havingdifferent physicochemical properties, which may contribute to promoters’performance. For example, guanine (G) and cytosine (C) can make the DNA fragmentmore stable, while those fragments with more adenine (A) and thymine (T) are morelikely to unwind and initialize transcription. However, the algorithm of edit distance isnot giving weighs based on deoxyribonucleotides’ properties. Consequently, manyimportant information may have lost when generating Laplacian matrix.
References [1] Schwille, P. (2011) Bottom-up synthetic biology: engineering in a tinkerer’sorld. Science, 333(6047), 1252-1254.[2] Erickson, B. , Singh, R. , and Winters, P. . (2011). Synthetic biology: regulatingindustry uses of new biotechnologies. Science,333(6047), 1254-1256.[3] Nielsen, M. T. , Madsen, K. M. , Susanna Seppälä, Christensen, U. , Riisberg, L. ,& Harrison, S. J. , et al. (2015). Assembly of highly standardized gene fragments forhigh-level production of porphyrins in e. coli. Acs Synthetic Biology, 4(3), 274.[4] Yang, J. , Wang, H. , Ding, H. , An, N., and Alterovitz, G. . (2017). Nonlineardimensionality reduction methods for synthetic biology biobricks’ visualization. BMCBioinformatics, 18(1), 47.[5] Vapnik, and V., N. . (1999). An overview of statistical learning theory. IEEETransactions on Neural Networks, 10(5), 988-999.[6] Huang, M. W. , Chen, C. W. , Lin, W. C. , Ke, S. W. , & Tsai, C. F. . (2017). Svmand svm ensembles in breast cancer prediction.PLoS ONE, 12(1), e0161501.[7] Meng H , Ma Y , Mai G , Wang Y , Liu C . (2017) Construction of precise supportvector machine based models for predicting promoter strength. QuantitativeBiology ;5(1):90–8 .[8] Belkin M, Niyogi P. Laplacian eigenmaps and spectral techniques for embeddingand clustering. In: NIPS Vol. 14. Canada; 2001. p. 585 ––