Comparing Machine Learning Algorithms with or without Feature Extraction for DNA Classification
Xiangxie Zhang, Ben Beinke, Berlian Al Kindhi, Marco Wiering
CComparing Machine Learning Algorithms with orwithout Feature Extraction for DNA Classification
Xiangxie Zhang Ben Beinke Berlian Al Kindhi Marco Wiering Dept. of Artificial Intelligence, University of Groningen, The Netherlands Dept. of Electrical Automation Engineering, Institut Teknologi Sepuluh Nopember, Indonesia
Abstract
The classification of DNA sequences is a key re-search area in bioinformatics as it enables researchersto conduct genomic analysis and detect possiblediseases. In this paper, three state-of-the-art al-gorithms, namely Convolutional Neural Networks,Deep Neural Networks, and N-gram ProbabilisticModels, are used for the task of DNA classification.Furthermore, we introduce a novel feature extrac-tion method based on the Levenshtein distance andrandomly generated DNA sub-sequences to computeinformation-rich features from the DNA sequences.We also use an existing feature extraction methodbased on 3-grams to represent amino acids and com-bine both feature extraction methods with a mul-titude of machine learning algorithms. Four differ-ent data sets, each concerning viral diseases such asCovid-19, AIDS, Influenza, and Hepatitis C, are usedfor evaluating the different approaches. The resultsof the experiments show that all methods obtain highaccuracies on the different DNA datasets. Further-more, the domain-specific 3-gram feature extractionmethod leads in general to the best results in the ex-periments, while the newly proposed technique out-performs all other methods on the smallest Covid-19dataset.
Keywords: DNA classification ; bioinformat-ics ; machine learning ; feature extraction ;deep learning
The first successful isolation of DNA by FriedrichMiescher in 1869 was a groundbreaking step in bi-ology as it laid the groundwork of understanding theblueprints of all organic life. DNA, which is short fordeoxyribonucleic acid, is a hereditary material thatcan be found in the cells of all humans and other liv-ing organisms. It carries the necessary informationwhich decides the biological traits of our bodies andworks as a genetic blueprint for an evolving organ-ism. An isolated DNA sequence can be representedby a character string, which consists of only A, C, G,or T. This format is named FASTA. The analysis ofDNA is crucial, as it allows doctors to diagnose dis-eases, helps in analyzing the spread of new infections,and it can also be used to solve crimes or conduct pa-ternity tests. Therefore, DNA analysis has become avital interest in computational biology [1].In traditional biology, primers are essential toolsfor DNA analysis. Primers are short single-strandednucleotide sequences important for the initiationphase of DNA synthesis of all living organisms. Inmolecular biology, synthetic primers are utilized fordifferent purposes such as the detection of viruses [3],bacteria [4] or parasites [5]. Primers, which are oftenpresent in human DNA sequences that are infected bya specific type of virus, are utilized for these purposes.With the help of the Polymerization Chain Reaction(PCR) method, the DNA fragment of the existingvirus is amplified significantly, and researchers areable to detect the virus.1 a r X i v : . [ q - b i o . O T ] N ov rimers are also utilized in various DNA classifica-tion problems [6, 7] and bioinformatics [8]. For thisstudy, it is important to note that they can be con-sidered as comparison patterns that can be searchedfor to diagnose diseases. By calculating the edit dis-tances between an isolated DNA sequence and theprimers of a specific virus, the level of the virus be-ing expressed in the human DNA sequence can beobtained, which can then be used to build up thefeature vectors. Machine learning algorithms can betrained on the feature vectors of the DNA sequences.The resulting model can be used to detect viruses anddiagnose viral diseases. Contributions.
Synthesizing primers for a partic-ular virus is often difficult and expensive. This paperproposes an alternative method that uses randomlygenerated DNA sequences to replace the primers.The advantage is that the analysis and processingnecessary for finding the primer patterns can beignored. In the experiments, the performances offeature extraction using primers and random DNAsequences will be compared to several other ma-chine learning approaches. Another feature extrac-tion method, which will be referred to as the 3-grammethod throughout this paper, is also developed. Ad-ditionally, other state-of-the-art algorithms, namelyconvolutional neural networks [23, 24] (CNNs) anddeep neural networks (DNNs), which extract the fea-tures directly from sample DNA sequences, are eval-uated. They are compared with the two feature ex-traction methods combined with machine learning al-gorithms such as Adaboost [21], support vector ma-chines (SVMs) [17, 2], and others. An additionalalgorithm, the N-gram probabilistic model [30, 31],which is often used in natural language processing, isalso implemented and compared to the other machinelearning approaches.To provide accurate and convincing final results,we conducted each of the experiments and tested allmethods on four different data sets. One data setis concerned with the detection of the Hepatitis Cvirus in human DNA, one with the classification of in-fluenza virus and coronavirus, another with the clas-sification of HIV into HIV type 1 and HIV type 2,and the last with a classification problem based onhuman DNA samples infected with SARS-Cov-2. A detailed description of the data sets can be found inthe subsequent section.
Paper outline.
This paper is organized as follows.In Section 2, each of the four data sets is described,and the decisions that motivated the choices of dataare explained. Section 3 explains the used featureextraction methods and machine learning algorithms.This is followed in Section 4 by the description of theexperimental setup. The results for each experimentare presented in Section 5, while Section 6 discussesthe results and concludes the paper.
The DNA sequence classification methods are testedon four different data sets of various sizes. They in-clude different types of viruses, and different datasetswere used for different aims. One data set that iscommonly used and might be considered the stan-dard for DNA analysis by some, the Molecular Biol-ogy (Splice-junction Gene Sequences) Data Set, wasnot used. This decision was made because of thelength of the samples this data set contains. Whilethe samples of the tested data sets contained up toalmost 30,000 characters, the samples of the splicedata set are only sequences of 61 characters. Thisreason, in addition to the age of the data, the dataset was created in 1992, led to the decision of usingnewer data sets with samples more resembling dataencountered in actual applications.
The hepatitis C virus (HCV) is a single-strandedRNA virus that can infect RNA sequences in the hu-man body. RNA is the messenger that contributes tothe formation of DNA. Therefore, if the RNA is in-fected, the DNA is also modified. Unlike the hepatitisB virus (HBV), an effective vaccine against HCV hasnot yet been developed [9]. HCV can cause severediseases like hepatitis C and liver cancer. Thus itis vital to detect potential infections with HCV asearly as possible. This HCV dataset was obtainedfrom the World Gene bank and consists of 500 HCVpositive DNA sequences and 500 HCV negative DNA2equences. The length of the DNA sequences in thisdata set varies widely, with the longest sequences be-ing 12,722 characters long and the shortest only 73characters long. Most sequences, however, fall in therange from 9,000 to 12,000.
The coronavirus is an RNA virus that can infecthumans’ respiratory tract and cause many differentdiseases. Potential diseases could be mild like thecommon cold, but it could also be lethal like SARS,MERS, or Covid-19. On the other hand, the influenzavirus is responsible for seasonal flu and has causedmany epidemics in history; for example, the Spanishinfluenza in 1918 and the outbreak of H1N1 in 2009.Influenza viruses and coronaviruses may cause similarsymptoms to patients. However, different measuresmight need to be taken in order to support patients intheir recoveries, depending on the type of virus theyare infected with. Therefore it is crucial to knowwhich kind of infection a patient has before a deci-sion about the treatment is made. The dataset wasobtained from the National center for biotechnologyinformation (NCBI) and consists of 7500 influenzavirus positive DNA sequences and 7500 coronaviruspositive DNA sequences. The DNA sequences thatthis data set contains are all between 95 and 2995characters long, with most sequences falling in therange of 1350 to 2500.
The human immunodeficiency virus (HIV) can at-tack the human immune system and cause acquiredimmunodeficiency syndrome (AIDS). The estimatedincubation period is around 8 to 9 years, during whichthere could be no symptoms. However, after this longperiod, the risk of getting opportunistic infections in-creases significantly and can cause many diseases. Inaddition to the immunosuppression, HIV can also di-rectly have impact on the central nervous system andcause severe mental disorders [10]. There are twosubtypes of HIV, namely HIV-1 and HIV-2. HIV-1has relatively higher virulence and infectivity than HIV-2. An HIV dataset containing 1600 HIV-1 pos-itive DNA sequences and 1600 HIV-2 positive DNAsequences, was acquired from NCBI and used to eval-uate the algorithms. The sequences in this data setare between 774 and 2961 characters long.
Severe acute respiratory syndrome coronavirus 2(SARS-Cov-2) is a subtype of coronavirus whichcauses coronavirus disease 2019 (Covid-19). Vary-ing degrees of illness can be noticed among differentpeople [11]. The outbreak first happened in WuhanChina at the end of 2019. A few months later, thevirus had spread out to many countries. As it spreadvery rapidly, it caused worldwide lockdowns. How-ever, the virus seems to spread faster in the USA thanin other countries, and the USA has the most con-firmed cases. It is interesting to examine if the virusesfound in the USA are different from those in the restof the world. To test this, a SARS-Cov-2 datasetwas obtained from NCBI. It contains 166 SARS-Cov-2 positive DNA sequences from the USA and 158SARS-Cov-2 positive DNA sequences from the restof the world (China, Hong Kong, Italy, France, Iran,Korea, Spain, Israel, Pakistan, Taiwan, Peru, Colom-bia, Japan, Vietnam, India, Brazil, Sweden, Nepal,Sri Lanka, Australia, South Africa, Greece, Turkey).The classification on this dataset was the most chal-lenging one since the two classes are the same typeof virus and there is a limited amount of data avail-able. The DNA sequences of this data set are between17,205 and 29,945 characters long.
In this section, the feature extraction methods andmachine learning algorithms that are used are de-scribed. Two feature extraction methods were com-pared. The first method is based on the edit distancebetween two DNA strings. The second method relieson the 3-gram method [12], which will be describedlater in detail. Six machine learning algorithms arecombined with the two feature extraction methods.Finally, three state-of-the-art methods, namely a con-3olutional neural network (CNN), a deep neural net-work (DNN), and an N-gram probabilistic model,which were fed the unprocessed DNA sequences with-out prior feature extraction, were tested.
The Levenshtein distance, also known as edit dis-tance, is used to measure the difference between twostrings. The smaller the distance, the more similarthe two strings are. There are three edit operations,inserting a character into a string, deleting a char-acter from a string, or substituting a character ina string. The Levenshtein distance between string a and b denotes the minimum number of edit op-erations that need to be performed on string a , inorder to transform string a into string b . The Leven-shtein distance between the first i characters of string a and the first j characters of string b is denoted by D ab ( i, j ), which can be calculated using equation (1). D ab ( i, j ) = max ( i, j ) if min ( i, j = 0) min D ab ( i − , j ) + 1 D ab ( i, j −
1) + 1 otherwiseD ab ( i − , j −
1) + 1( a i (cid:54) = b j ) In this equation, a i and b j represent the i -th and j -th character in string a and string b . If either string a or string b has no character, then the Levenshteindistance equals the maximum length among them.This point is easy to understand because if one stringis empty, then simply inserting all characters fromthe other string into the empty string is enough. Ifboth strings are not empty, then the last characterin both strings, namely a i and b j , should be exam-ined. If they have the same terminal character, thenboth of them could be ignored. Only looking at thefirst ( i −
1) characters in string a and the first ( j − b is enough. In this scenario, D ab ( i − , j −
1) equals D ab ( i, j ). If the terminal char-acters are different, then the costs of three possibleoptions are supposed to be compared. The first op-tion is deleting the terminal character in string a .Such that D ab ( i − , j ), which is the Levenshtein dis- tance between the first ( i −
1) characters in string a and the first j characters in string b should be cal-culated, plus 1 caused by the deletion. The secondoption is inserting one character, which should be thesame as the terminal character of string b , to the endof string a . By the insertion operation, the terminalcharacter of string b can be ignored, and we only needto calculate D ab ( i, j − a by the terminal characterof string b . In this scenario, the terminal character inboth strings can be ignored, and this is denoted by D ab ( i − , j −
1) + 1. Formula (1) suggests a recur-sive procedure to calculate the Levenshtein distancebetween two strings. In each recursion, the last char-acter of one or both strings could be ignored. Therecursion should halt when one string is empty. Ac-cording to this formula, the Levenshtein distance be-tween two string a and b is denoted by D ab ( | a | , | b | ),where | a | and | b | represent the length of string a and b . As mentioned in the introduction section, primersare short DNA sequences that are used in medical re-search to detect possible viruses by conducting PCRon the DNA sequence. In bioinformatics, primerscan be used to extract features. In order to do so,the Levenshtein distance between the primer and theDNA sequence is calculated. If the distance is small,then the similarity between the DNA sequence andthe primer is considerable. In other words, a personis likely to be tested positive for the virus correspond-ing to the primer. However, in an infected DNAsequence, the target virogenes are small fragmentshidden somewhere in the DNA sequence. Therefore,directly calculating the Levenshtein distance betweenthe DNA sequence and the primer is not accurate.In this research, the matching process is done be-tween the primer and a short sub-string of the DNAsequence. Then, the window on the DNA sequenceslides by one character, and the Levenshtein distancebetween the primer and the next sub-string of theDNA sequence is calculated. This process is re-peated until the whole DNA sequence is traversed.Finally, the minimum distance of all these calcu-lated distances is taken and considered as the finaldistance. For example, the given DNA sequence is4TTTGACTCGT” and the window size is 8. TheLevenshtein distance between the primer and thefirst sub-string ”TTTGACTC” is calculated, then”TTGACTCG” and ”TGACTCGT”. Afterward, theminimum distance of the three distances is taken.In reality, many viruses have existed in the worldfor a long time, and they have mutated severely.Therefore, only calculating the Levenshtein distancebetween the DNA sequence and one primer is notenough to make the result accurate. However, formany viruses, multiple different primers exist. Thus,the minimum Levenshtein distance between the DNAsequence and various primers can be calculated andcombined into a single feature vector. These featurevectors can then be used to train and test differentmachine learning methods.Obtaining a virus’s primers can be expensive andtakes time, especially for a newfound virus like SARS-Cov-2 in early 2020 or viruses that have a high mu-tation rate. Our novel method to solve this prob-lem is using randomly generated short DNA se-quences to replace the primers. The feature extrac-tion is then achieved by calculating the minimumLevenshtein distance between the randomly gener-ated DNA strings and the DNA sequences needingclassification. Since nothing except for the DNAstrings used to calculate the minimum Levenshteindistance (primers vs. randomly generated DNA se-quences) changed, the resulting feature vectors areof the same format. Different machine learning al-gorithms will be trained and tested using each set offeature vectors in the experiments.In this method, before the feature vectors are fedinto the machine learning algorithms, the usage of anormalization process is crucial. It is helpful to con-sider that the difference between smaller distances ismore significant than the difference between largerdistances. For example, the difference between dis-tance 3 and 4 should be given more weight thanbetween distance 30 and 40. Therefore, finding asuitable normalization function is necessary for thismethod. The elements in the feature vectors wereprocessed with the function shown by formula (2),where x is the computed distance. f ( x ) = 11 + x (1) Extracting the features of a DNA sequence usingwhat will be referred to as the 3-gram method isbased on knowledge from biology and medicine. Be-fore diving into the algorithm, it is helpful to knowhow the human body builds the proteins that itneeds. DNA sequences store the necessary informa-tion for building proteins. In biological cells, a DNAsequence is first reformed into an mRNA sequence.This process is called transcription, and mRNA worksas a messenger. After this, the mRNA sequence istranslated into a series of amino acids, which buildup the proteins. Researchers have found that oneamino acid is coded by a group of three nucleobases[13]. The groups which contain three nucleobasesare called codons. The corresponding translation be-tween codons and amino acids is illustrated in Fig. 1.There are in total 64 different codons, and 61 of themcan be translated into amino acids. The other three,TAA, TAG, and TGA are stop codons. They markthe halt point of translation. There are 20 possibleamino acids, meaning that several different codonscan be translated into the same amino acid.The 3-gram method simulates the process fromDNA sequences to amino acids. A window of sizethree is used to traverse the whole DNA sequencewith a sliding unit of 1 at each time. At eachtime, the group of three nucleobases is acquired fromthe DNA sequence, and the corresponding aminoacid is recorded. Stop codons are neglected. Af-ter the whole DNA sequence is traversed, all dif-ferent types of amino acids are counted. Then theproportion of each amino acid is calculated andput in a histogram. Take the same DNA sequenceas an example again. There are eight codons inthe DNA sequence ”TTTGACTCGT”. They are”TTT”, ”TTG”, ”TGA”, ”GAC”, ”ACT”, ”CTC”,”TCG” and ”CGT”. They can be translated intoone phenylalanine, two leucines, one aspartic acid,one threonine, one serine, and one arginine.In a nutshell, each DNA sequence is represented bya 20-D feature vector after using the 3-gram method.These feature vectors are then used as the input vec-tors for machine learning algorithms [12]. Unlike thepreviously introduced method, which extracts fea-5igure 1: Various combinations of three successivenucleobases and their corresponding amino acidtures based on the Levenshtein distance, the featurevectors extracted by the 3-gram method do not needto be normalized before feeding them into the ma-chine learning algorithms. This is simply becausethe calculations of the proportion of each amino acidthemselves already normalize the data.
The comparison between the two previously intro-duced feature extraction methods was done by train-ing and testing six machine learning models on thefeature vectors acquired by using the two feature ex-traction methods. Since all the experiments consistedof binary classification tasks, each processed vectorwas labeled with either 1 or 0, depending on its class.For example, whether it was associated with an in-fected DNA sequence or an uninfected one. The la-beled data was then used to train each of the sixmachine learning methods described in the followingsubsections. K-fold cross-validation was used to en-sure accurate results. The processed samples wereseparated into K folds randomly. The training and testing process was run K times. At each time, onefold was used as the testing set, and the remaining K-1 folds together was seen as the training set. By doingsuch, all folds were used as a part of the training setfor K-1 times and as the testing set once. Therefore,one model was trained and tested K times, and theaverage test accuracy was used to evaluate the model.
The Multi-layer perceptron (MLP) is a supervisedlearning model which is often used in classificationproblems. After trying out different architectures,we observed an MLP with three hidden layers to per-form best with the number of neurons being 500, 250,and 125, respectively. The activation function of theneurons in the hidden layers was the Rectified LinearUnit (ReLU). The sigmoid function was used as theonly neuron’s activation function in the output layer.In the training phase, the binary cross-entropyfunction was used as the loss function. This lossfunction calculates the error between the actual out-put and the target label, on which the training andupdate of the weights are based [14]. The adaptivemoment estimation optimizer (Adam) was used inthis research. The learning rate of the optimizationdecides how fast the model learns. It is critical to themodel and should be set carefully. A large learningrate might make the model never find the optimal so-lution, while a small learning rate causes inefficiency.Mini-batch learning was used during training. Thismeans that several examples were fed into the MLPtogether, and then the weights were updated. Mini-batch learning makes sure that the learning processgoes on the right track. In each epoch, all data in thetraining set were used. When the model was trainedon all examples, the next epoch started. In prelimi-nary experiments, it was found that the model over-fitted severely. Therefore, the dropout technique wasused to prevent overfitting. The key idea of dropoutis that during the training phase, some units andtheir connections are randomly dropped, enabling themodel to generalize well [15].6 .2.2 Logistic regression
Logistic regression is a linear model that is used tocarry out binary classification. Similar to the MLPmodel, the sigmoid function was used for the out-put unit. Additionally, L2 regularization was usedto prevent possible overfitting. This method adds aregularization term at the end of the loss function,which is illustrated by formula (3). The additionalterm is also known as L2 regularization. The hyper-parameter C in this term controls to what degree theL2 regularization should be executed. Smaller valuesspecify stronger regularization. Loss ← Loss + 1 C n (cid:88) i w i (2)Usually, it is assumed that the independent vari-ables in the input vector X have a multivariate nor-mal distribution. However, most of the time, this as-sumption is not satisfied. In such situations, logisticregression is a good alternative model [16]. The support vector machine (SVM) is another linearmodel for the classification task [17, 2]. An SVM iscapable of handling a small amount of data and is lesssensitive to noise in a dataset, and therefore it has anexcellent generalization ability [18]. The SVM aimsto find the hyperplane which maximizes the marginseparating the two classes. The solution to this canbe found by using the Lagrange multiplier method.Powerful non-linear SVM models can be trained ifkernel functions are appropriately used [17]. Kernelfunctions create new feature vectors that usually havemore dimensions than the original input. The SVMfinds the new hyperplane, which is linear in the newfeature space. However, in the original feature space,the separation will be non-linear if a non-linear kernelis used.In the training process of an SVM, the inner prod-uct of two samples x i and x j needs to be calculated.The kernel method provides a solution that allowsthe model to get the inner product in the higher-dimensional feature space directly. This idea is illus-trated by formula (4), where K is the kernel function. φ ( x i ) and φ ( x j ) are the new feature vectors in thehigher-dimensional feature space. By mapping likethis, the transformation from the lower-dimensionalspace to the higher-dimensional space of each indi-vidual sample is unnecessary, which saves a lot ofmemory and computational resources. K ( x i , x j ) = φ ( x i ) T · φ ( x j ) (3)In our implementation, the radial basis functionkernel (RBF kernel) was used. The RBF kernel func-tion is shown in formula (5). The hyperparameter γ decides the distribution of the feature vectors in thehigher-dimensional feature space. The other hyper-parameter in the SVM model is C . Similar to theusage of C in logistic regression, the hyperparameter C here decides the regularization, or in other words,a penalty degree. K ( x i , x j ) = e −|| xi − xj || σ , γ = 12 σ (4) Before talking about the random forest algorithm, itis essential to know how a decision tree classifier [19]works. The decision tree is a tree-like model thatsimulates how humans make decisions. There is ajudgment at each node, and the data are classifiedinto different child nodes. The leaf nodes show thefinal results. The impurity drop is used to evaluatethe decision, and a good query should maximize theimpurity drop. The decision tree is also a supervisedlearning model in which the training set is used bythe model to learn how to make queries and split thedata until a specific criterion or threshold is reached.The decision tree is the fundamental model of therandom forest algorithm [20]. As its name implies,a forest is built up from many trees. The randomforest model trains multiple different decision treeson different data, and the average output of thesetrees is taken as the final output. The bootstrap ag-gregating method is used to create different trainingdatasets. This method samples some data with re-placement from the original training data set, andthey are used to train one individual tree. The ran-dom forest is a simple, easy-to-understand algorithm7hich is capable of handling complex non-linear clas-sification task. Therefore, it is often used in the ma-chine learning field. Two hyperparameters needed tobe tuned in our implementation. One of them is thenumber of estimators. It controls how many treesshould be created during the experiment. The otherone is the maximum depth of each tree. This valueshould neither be too large nor too small. Largernumbers may cause overfitting, while lower depthscould lead to underfitting.
Adaboost [21], which is short for adaptive boosting, isanother decision-tree-based algorithm similar to therandom forest. The basic idea of Adaboost is train-ing multiple weak classifiers, and their combination isa more robust classifier. The similarity between ran-dom forest and Adaboost is that they both train mul-tiple classifiers, while the difference lies in the dataused to train them. Unlike random forest, which usespart of the data at each time, Adaboost uses all datain the original training set to train a single classifier.Those samples which were classified incorrectly aregiven more weight, and the updated data set will beused to train the next classifier. Similar to the ran-dom forest algorithm, two hyperparameters neededto be tuned in our implementation. They are thenumber of estimators and the maximum depth of thetrees.
XGBoost [22] is another ensemble learning algorithmlike random forest and Adaboost. The difference be-tween these three algorithms lies in how they trainone individual decision tree classifier. Random forestuses different sampling data, while Adaboost manip-ulates the weights of data. Different from both ofthem, XGBoost is built based on the idea of the gra-dient boosting decision tree (GBDT) and developedfrom that. GBDT trains an individual decision treeto fit the residual from the previous decision tree [22].This procedure is done by considering the whole de-cision tree as a function F ( x ), and calculating thegradient of the loss function with respect to the func- tion F ( x ). XGBoost introduces a regularization termto the loss function to prevent overfitting. Addition-ally, each leaf node is given a score, such that the lossfunction can be computed more efficiently. XGBoostenables researchers to solve large-scale problems inthe real world by using a relatively small amount ofresources [22]. In our implementation, three hyperpa-rameters needed to be tuned. They are the number oftrees that should be trained, the maximum depth ineach tree, and the λ value which controls the degreeof regularization. In this section, three more complex and computa-tionally expensive state-of-the-art algorithms are de-scribed that will be compared to the simpler machinelearning algorithms using the two feature extractionmethods. For these more complex algorithms, theDNA sequences are used as feature vectors withoutthe in-between step of feature extraction via one ofthe previously described methods.
With the development of computational capacity, theconvolutional neural network (CNN) [23, 24] has beenwidely used in many fields such as computer visionand has achieved great success. A CNN is a neuralnetwork in which some matrix multiplication oper-ations between layers are replaced by convolutions[25]. The CNN is able to learn to extract featuresand train a classifier at the same time. Recently,several researchers have applied the CNN model tobioinformatics, especially the task of classifying DNAsequences. The CNN model has been found to be ca-pable of handling the classification of the nucleotidesof DNA sequences with A, G, C, and T [26]. An-other study proved the feasibility of using CNNs toclassify non-coding RNA sequences, and accuracieshigher than 95% were achieved on multiple datasets[27]. Based on these previous studies, we developed aCNN model and compared it to the other algorithms.Before training a CNN model, an individual DNAsequence should be transformed into a 2-dimensionalmatrix by using one-hot encoding. In the resulting8atrix, each column represents a character in theoriginal DNA sequence. The number 1 appears atthe place which stands for the corresponding charac-ter, while the other places in the same column arefilled with 0. An example of one-hot encoding is il-lustrated in Fig. 2. The first to the last positionin the column represents A, C, G, T, respectively.By using the one-hot encoding, each character in theoriginal DNA sequence is represented by 4 channels.The channels are shown below each other in the samecolumn in Fig. 2. Since the DNA sequences have dif-ferent lengths, all of them are padded with columnswith only zeros, to the same length.Figure 2: Example of one-hot encoding of the DNAsequence ”TTTGACTCGT”In the convolution layer, a neuron uses a kernel (fil-ter) and performs a convolution operation to computea single output in the resulting feature map. After-ward, the filter slides to the next region and repeatsthe convolution operation. In this way, the featuresfrom different parts of the input can be extracted. Inorder to decrease the size of the feature maps, a max-pooling layer is added after the convolution layer. Itreduces the size by only keeping the maximum valuefrom several neighboring values in a feature map.After experimeting with different CNN architec-tures, finally nine convolution layers were used for theSARS-Cov-2 dataset and seven were used for the oth-ers. Each convolution layer was followed by a max-pooling layer. One hundred filters were used in eachconvolution layer. Each filter in all layers received allfour channels as input and has a window length of3. Therefore, a filter did not only take one charactereach time but integrated them together with a widthof 3. The stride in all convolution layers was set to 1, while in all pooling layers the stride was set to 3.All the outputs resulting from the last max-poolinglayer were fully connected to a dense layer with fivehundred neurons. The final output layer with a singleoutput followed the dense layer. Similar to the MLPmodel that we implemented, ReLU was used as theactivation function on the neurons in the convolutionlayer and dense layer. The activation function used inthe output layer was the sigmoid function. The lossfunction of this model was the binary cross-entropyfunction.Similar to the MLP, three hyperparameters werecoarsely tuned. They were the number of epochs,the batch size, and the learning rate of the gradientdescent optimization. These hyperparameters alsoapply to the DNN algorithm that will be describednext.
Another approach is to use a multi-layer perceptronwith multiple hidden layers, also referred to as a deepneural network or DNN. DNNs have been increas-ingly used successfully in bioinformatics. An ear-lier study demonstrated that a DNN could performwith a state-of-the-art level at the task of predict-ing species-of-origin and species classification of shortDNA sequences [28]. The training phase of DNNsuses each of the DNA sequences in their entirety, andalso does not use the features extracted by either tak-ing minimum Levenshtein distances or the 3-grammethod. Similar to the CNNs, the DNA sequenceswere transformed into a 2-dimensional matrix usingone-hot encoding. After testing multiple differentconfigurations, a model consisting of one dense layerwith 40 neurons, followed by another dense layer with20 neurons appeared to provide the most accurate re-sults. For this model, the activation function of theneurons in the hidden layers was again ReLU, andthe one for the output layer was the sigmoid function.The loss function used was the binary cross-entropyfunction as well.9 .3.3 N-gram probabilistic model
A third state-of-the-art method that uses the entireDNA sequences as the features it is trained on is an N-gram probabilistic model. This method is commonlyused in natural language processing (NLP) [29]. Ithas also been successfully applied to DNA classifi-cation problems [30, 31] with resulting classificationaccuracies up to 99.6%. An N-gram is a sequenceof N items. These items are e.g. words in NLP orthe letters A, C, G, or T, representing nucleobases inDNA sequences in FASTA format. The N-gram prob-abilistic model can be used to predict the probabilityof the next item x in a sequence given the history ofthe n-1 previous items h : P ( x | h ). It can for examplebe used to predict the probability of the next itembeing the letter A in a DNA sequence, given that theprevious four letters were ”ACGT”. This probabil-ity would be calculated as P ( A | G, T ) in an N-grammodel with an N-value of 3 and P ( A | A, C, G, T ) forone with an N-value of 5.For our experiments, the N-gram probabilisticmodel was used as a classifier, so the probabil-ities of the next item were calculated using theprevious N-1 items for each class separately, e.g. P ( X t | X t − , X t − , Class ) for N=3. Eventually, theprior probability P ( Class ) can be used in combina-tion with the N-gram probabilistic model(s) to com-pute the probability of a sequence belonging to a cer-tain class using Bayes’s rule: P ( Class | X , X , ...X T ) = P ( Class ) ∗ P ( X | X , Class ) ∗ P ( X | X , X , Class ) ... ∗ P ( X t | X t − , X t − , Class ) ... ∗ P ( X T | X T − , X T − , Class ) In order to prevent underflow when working withDNA sequences containing more than 30,000 items,in this experiment, the probabilities were not multi-plied, but the logarithmic values of these probabilitieswere added.The number of occurrences of each N-gram iscounted for each class. These counts are then used tocalculate the probabilities. For testing purposes on anovel DNA sequence, the computed probabilities areused to calculate the log probability of belonging toboth classes. The DNA sequence is then classified ac- cording to the class with the highest probability. Wetested different values for N, and the final best per-forming value of N that was used in all experimentswas 6.
Each machine learning algorithm was trained andtested on the processed feature vectors obtainedby using the two different feature extraction meth-ods. Reliable primers of HCV could be acquired.Therefore, when testing the random DNA sequencesmethod on the HCV dataset, the feature extractionusing primers based on distance was done in order tomake comparisons with the random DNA sequencesmethod. Since 37 primers of HCV were acquired, wegenerated three groups of random DNA sequences,and each contains 37 DNA sequences. The lengthsof the DNA sequences of the three groups were 25,100, and 200, respectively. Additionally, a fourthgroup was generated, in which there were 100 DNAsequences of length 200. For other datasets, sincethe primers could not be obtained, the random DNAsequences method was tested using 50 random DNAsequences, with a length of 25, 50, and 100.Furthermore, each algorithm was trained andtested on each of the four datasets using the DNAsequences in their entirety as features. For everyexperiment, the accuracy of the binary classificationwas tested across ten folds of cross-validation. As alldatasets consisted of two kinds of DNA sequences,the training and testing procedure was the same foreach dataset.For each experiment, the feature vectors were as-signed labels according to their class. During thetraining phase, the classifiers were trained on eachfeature vector of the training set and its correspond-ing label. During the testing phase, each feature vec-tor of the testing set was classified as either ’positive’or ’negative’ for the HCV dataset, HIV-1 or HIV-2 forthe HIV dataset ’influenza virus’ or ’coronavirus’ forthe influenza/corona dataset, or as either ’originat-ing from the USA’ or ’not originating from the USA’for the SARS-Cov-2 dataset. The result of each clas-sification was recorded and compared to the correct10abel of each feature vector in the testing set to cal-culate the accuracy. For each trial, the accuracy wasrecorded to compute the mean accuracy and standarddeviation across the ten folds of the cross-validation.The whole experiment was divided into two parts.The first part was a preliminary experiment, whichwas used to tune the hyperparameters and decide thebest set of hyperparameters of each algorithm. Thisprocess is done by repeating the training and test-ing procedure using different sets of hyperparame-ters. The one which gave the highest accuracy wasselected. After the optimal hyperparameters werefound, the second experiment was conducted usingthose found hyperparameters. The comparison acrossdifferent algorithms was based on the results of thesecond experiment. All the hyperparameters thatneeded to be tuned have already been discussed inthe method section, and the following Tables (1) to(7) show the best found values for the hyperparame-ters.
All results presented in this section are the meanaccuracy and standard deviation over ten folds ofcross-validation. The comparison of using primersand random DNA sequences was only made on theHCV dataset. The results of the experiment on theHCV dataset are shown in Figure (3).For each data set, the results of all six machinelearning algorithms using the random DNA sequencefeature extraction method are presented in Table(8) containing mean accuracy and standard devia-tion over the ten folds of the cross-validation. Foreach machine learning algorithm, multiple lengthsand amounts of the random DNA sequences were con-sidered. However, only the ones showing the best re-sults are displayed in the table. For the displayedresults of the random sequence feature extractionmethod, 50 randomly generated DNA sequences oflength 25 were used on the HIV and influenza/coronadata sets. For the SARS-Cov-2 data set, 50 randomlygenerated DNA sequences of length 50 were used. Forthe 3-gram feature extraction method, the results ofthe six machine learning algorithms for each of the four data sets are displayed in Table (9).An overview of all results, in which the feature ex-traction methods are compared with state-of-the-artalgorithms, are provided in Table (10). For this ta-ble, only the best results of the 3-gram and randomDNA sequence feature extraction method were con-sidered. The best results stem from different machinelearning algorithms for different data sets. However,the exact results for each algorithm on each data setare displayed in Table (8) and Table (9).Figure (3) suggests that using primers has the high-est accuracy when the classifier is trained by usingthe MLP, Adaboost, or XGBoost algorithm. If theprimers are replaced by random DNA sequences, thehighest accuracy is obtained using an SVM classi-fier. Although using primers leads to better results,the results indicate that using primers (M=99.9,SD=0.32) does not have a significantly higher ac-curacy (t(18)=1.90, p=0.07) than using the randomDNA sequences (M=99.3, SD=0.95). It can be con-cluded that the Levenshtein distance feature extrac-tion yields the best and most consistent results acrossthe six different machine learning algorithms whenthe distance between a primer and a DNA sequenceis taken. However, the random DNA sequences canbe used to replace primers when they are not avail-able.Furthermore, it can be observed that even thoughthe SVM produces the highest accuracy for three ofthe four data sets, there is not one machine learn-ing method that consistently yields the best resultsacross all the different lengths of the randomly gener-ated strings nor across each of the various data sets(see Table (8)). Also, there is no clear indicationabout the best length of the random DNA strings(for simplicity, we do not show all these results).For the 3-gram feature extraction method, the re-sults show a similar pattern. Even though here theSVM is among the machine learning algorithms yield-ing the best results for three out of the four data sets,Adaboost provides the highest mean accuracy for theHIV data set again. A significant difference to therandom DNA string feature extraction is that thedifference between the machine learning algorithmsbecomes much smaller. Different algorithms showidentical results for two of the four data sets using11andom DNA method 3-gram methodBatch size Epoch number learning rate Batch size Epoch number Learning rateHCV 64 2000 0.0001 64 1000 0.0001HIV 256 2000 0.0001 256 2000 0.0001Inf./Cor. 250 2000 0.0001 64 50 0.0001SARS-Cov-2 64 10000 0.0001 64 20000 0.00005Table 1: The best hyperparameters of the MLP on the four datasets, using the two feature extractionmethods Random DNA method 3-gram methodDistribution γ Regularization C Distribution γ Regularization CHCV 750 7 750 7HIV 200 10 200 10Inf./Cov. 200 10 200 10SARS-Cov-2 1500000 10 2500000 10Table 2: The best hyperparameters of the SVM on the four datasets, using the two feature extractionmethods Random DNA method 3-gram methodRegularization C Regularization CHCV 5000000 1000Inf./Cor. 100000 100000HIV 100000 100000SARS-cov-2 10000000 10000000Table 3: The best hyperparameters of logistic regression on the four datasets, using the two feature extractionmethods Random DNA method 3-gram methodNumber of trees Max depth Number of trees Max depthHCV 50 50 50 50HIV 50 50 50 50Inf./Cov. 25 50 25 50SARS-Cov-2 100 50 100 50Table 4: The best hyperparameters of random forest on the four datasets, using the two feature extractionmethodsthe 3-gram feature extraction method.Table 10 shows that overall the 3-gram feature ex-traction method combined with either an SVM (forHCV and Inf./Cor.) or Adaboost (for HIV) obtainsthe highest mean accuracy of all tested methods in 3out of 4 data sets. For these datasets also the other methods perform very well, especially the CNN, andthe differences are quite small. The DNN seems toperform a bit worse on most datasets.For the SARS-Cov-2 data set, the Levenshtein dis-tance with the random DNA string feature extractionmethod obtains significantly higher accuracies than12andom DNA method 3-gram methodNumber of trees Max depth Number of trees Max depthHCV 250 3 250 3HIV 250 3 250 3Inf./Cov. 150 3 150 3SARS-Cov-2 250 3 250 3Table 5: The best hyperparameter of Adaboost on the four datasets, using the two feature extractionmethods Random DNA method 3-gram methodNumber of trees Max depth λ Number of trees Max depth λ HCV 200 3 0.25 200 3 0.25HIV 200 3 0.25 200 3 0.25Inf./Cor. 100 3 0.5 100 3 0.5SARS-Cov-2 300 3 0.5 300 3 0.25Table 6: The best hyperparameters of XGBoost on the four datasets, using the two feature extractionmethods CNN DNNBatch size Epoch number Learning rate Batch size Epoch number Learning rateHCV 100 200 0.0001 100 170 0.00005HIV 128 100 0.0001 100 170 0.0001Inf./Cor. 32 25 0.0001 100 150 0.00005SARS-Cov-2 32 500 0.00001 150 100 0.0001Table 7: The best hyperparameters of CNN and DNN on the four datasetsHCV HIV Inf./Cor. SARS-Cov-2MLP 97 . ± .
42 99 . ± .
08 99 . ± .
18 91 . ± . . ± . . ± . . ± .
16 97 . ± . Log.-Reg. 98 . ± .
57 98 . ± .
99 96 . ± .
38 94 . ± . . ± .
74 99 . ± .
25 99 . ± .
39 95 . ± . . ± . . ± . . ± .
13 95 . ± . . ± .
74 99 . ± .
22 99 . ± .
20 94 . ± . ± standard deviation for all methods using the random DNA-sequence featureextraction across the four data sets.the other methods. For this small dataset, it out-performs the second best method (the 3-gram) witharound 4.3%.The above results suggest that the 3-gram methodobtains better performances on larger datasets, whilethe random DNA sequences method might be bet- ter at handling relatively smaller datasets. If largeamounts of data are not readily available, the resultsof the random DNA sequence method are promising.It obtains an accuracy as high as 97% with as littleas 292 samples to train on.13igure 3: Accuracy of different machine learning algorithms using the primers and the novel random DNA-sequence feature extraction method on the HCV datasetHCV HIV Inf./Cor. SARS-Cov-2MLP 99 . ± .
48 99 . ± . . ± . . ± . . ± . . ± . . ± .
02 92 . ± . Log.-Reg. 99 . ± .
84 99 . ± . . ± . . ± . . ± .
84 99 . ± .
20 99 . ± .
03 91 . ± . . ± . . ± .
15 99 . ± . . ± . . ± . . ± .
34 99 . ± .
03 89 . ± . ± standard deviation for all methods using the 3-gram feature extraction acrossthe four data sets. HCV HIV Inf./Cor. SARS-Cov-2Random 99 . ± .
95 99 . ± .
16 99 . ± . . ± . . ± .
32 99 . ± .
15 99 . ± . . ± . . ± .
03 99 . ± .
16 99 . ± .
03 88 . ± . . ± .
08 99 . ± .
42 99 . ± .
11 90 . ± . . ± .
29 99 . ± .
26 99 . ± .
05 90 . ± . ± standard deviation for all methods across the four data sets. This paper aimed to provide an extensive compari-son of different methods for DNA sequence classifi- cation. Five different methods were compared acrossfour different data sets of various sizes. Examiningthe proposed novel method using random DNA se-quences to extract features based on distance is one14f the main novelties in this paper. We wanted totest whether it is good enough to replace primers.The results showed that modern state-of-the-artmethods from fields like computer vision and naturallanguage processing as CNNs or N-gram probabilis-tic models can achieve very high accuracies above99% on DNA sequence classification problems pro-vided that enough sample data is available. Althoughthe DNN has a slightly worse performances in someof the experiments, the achievements are acceptable.Therefore, we can conclude that these algorithms canbe successfully applied to different DNA classificationproblems.The results also showed that the use of feature ex-traction methods is useful to obtain the best results.The 3-gram method is quite simple but very effectivein handling different datasets. The novel feature ex-traction method based on random DNA sequences ledto the best result on the smallest SARS-Cov-2 datasetand can therefore be promising for DNA classificationproblems when little data is available.The potential applications of the proposed meth-ods are plenty. A potential field in which the methodscould be deployed is the diagnosis of diseases. Espe-cially the 3-gram feature extraction method seemspromising to be used for diagnosing viral infectionssuch as HCV or HIV. For future studies, it would beinteresting to investigate some further applicationsof different methods. For example, the field of ances-tral research using genetic samples or the detectionof genetic predispositions are possible applications. Ifthe same techniques perform similarly well for prob-lems of this kind needs to be determined. Our resultsalso indicate that the SARS-Cov-2 viruses spreadingin the USA seems to be different from other coun-tries. Therefore, it would be interesting for biologiststo further investigate the origin of SARS-Cov-2 withthe help of machine learning.