A summary of the prevalence of Genetic Algorithms in Bioinformatics from 2015 onwards
SSwerhun et al.
RESEARCH
A summary of the prevalence of GeneticAlgorithms in Bioinformatics from 2015 onwards
Mekaal Swerhun * † , Jasmine Foley † , Brandon Mossop † and Vijay Mago * Correspondence:[email protected] of Computer Science,Lakehead University, 955 OliverRd, P7B 5E1 Thunder Bay,CanadaFull list of author information isavailable at the end of the article † Equal contributor
AbstractBackground:
In recent years, machine learning has seen an increasing presencein a large variety of fields, especially in health care and bioinformatics. Morespecifically, the field where machine learning algorithms have found mostapplication is Genetic Algorithms.
Objective:
The objective of this paper is to conduct a survey of articlespublished from 2015 onwards that deal with Genetic Algorithms and how they areused in bioinformatics.
Methods:
To achieve the objective, a scoping review was conducted that utilizedGoogle Scholar alongside Publish or Perish and the Scimago Journal& CountryRank to search for respectable sources.
Results:
Upon analyzing 31 articles from the field of bioinformatics, it becameapparent that genetic algorithms rarely form a full application, instead they relyon other vital algorithms such as support vector machines. Indeed, support vectormachines were the most prevalent algorithms used alongside genetic algorithms(GA); however, while the usage of such algorithms contributes to the heavy focuson accuracy by GA programs, it often sidelines computation times in the process.In fact, most applications employing GAs for classification and feature selectionare nearing or at 100% success rate, and the focus of future GA developmentshould be directed elsewhere.
Conclusion:
Population-based searches, like GA, are often combined with othermachine learning algorithms. In this scoping review, genetic algorithms combinedwith Support Vector Machines were found to perform best. The performancemetric that was evaluated most often was the accuracy. Measuring the accuracyavoids measuring the main weakness of GAs, which is computational time. Thefuture of genetic algorithms could be “open-ended” evolutionary algorithms,which attempt to increase complexity and find diverse solutions, rather thanoptimize a fitness function and converge to a single “best” solution from theinitial population of solutions.
Keywords:
Genetic Algorithm; Bioinformatics; Machine Learning; FeatureSelection; Datasets
Genetic Algorithms
Genetic Algorithms (GA) belong to a larger class of evolutionary algorithms. Aparallel search heuristic algorithm inspired by Charles Darwin’s theory of naturalselection is modeled by the guiding principle of
Survival of the Fittest [1]. Thealgorithm selects the fittest individuals of the population with the aim of producingoffspring for the next generation that inherit the optimal characteristics of the a r X i v : . [ c s . N E ] A ug werhun et al. Page 2 of 20 parents. This process continues to iterate developing sequential populations, untilit converges on a generation with the fittest individuals [2]. Similarly, GA solvesproblems by optimizing a single criterion, known as a fitness function. The fitnessfunction estimates the importance by assigning a value to each chromosome thatrelates to its ability to solving the problem [2, 3]. A chromosome could be an arrayof numbers, a binary string, or a list of instances in a database, all relating to anddepending on the problem. Each individual that forms the population, representsdifferent possible solutions. Chromosomes deemed fitter have an increased likelihoodof being used in the following generation. The individuals proceed through a processof evolution, which are is of the principles of mutation, selection, and crossover allimpacting the fitness value [2, 4]. The most noteworthy benefit about GA is itsability to search sophisticated and massive spaces proficiently and identify nearoptimal solutions rapidly [3]. Often in order to achieve better performance, GA-based selected features are applied as input to classifiers [2].
Popularity of Genetic Algorithms in Biomedical Applications
While the properties accredited to GAs make them desirable to a variety of fields,their use in biomedical applications is far-ranging and well-established as shall bemade evident in this article. In the medical field GA-based solutions have been posedfor a variety of problems including symptom and ailment classification [3, 4, 5], vi-sualization [6] as well as identification and diagnoses of diseases [2, 7]. GA-basedsolutions have also increasingly been used at the molecular level in tasks such ashandling and predicting transposon-derived piRNAs [8]. Yet the importance of GA-based solutions in the medical field is not limited to solving problems on the micro-scopic scope as applications have been developed to handle larger scale infrastruc-ture and logistics that can be vital for entire health care systems [9, 10]. Amongthe most frequent uses of GAs however, is their role in feature selection where theyhelp to narrow down the possible features so that a complementary algorithm canachieve far greater performance [7, 11, 12, 13, 14]. At times, a GA-based solutionmay involve the GA fill multiple of the above mentioned roles such as finding usagein both feature selection and classification. Of course, other GA applications beyondwhat has already been mentioned exist; however, the applications mentioned hereshow just how important GA has become in the biomedical field and these are themost common uses found in the papers surveyed in this article.
Key Findings of the Survey
While conducting research, a few key points have been discerned which frequentlyappeared in the selected papers for this survey article. These have been summarizedbelow. • Applications often use GA alongside other machine learning algorithms, mostcommonly classification algorithms. • Among classification engines used in conjunction with GA, Support VectorMachines (SVM) is the top performing. • Accuracy is one of the prime evaluation metrics focused on; while computa-tion time is often ignored or under-performing for usage in live biomedicalsituations. werhun et al.
Page 3 of 20 • In general, applications employing GAs for classification, and feature selectionare reaching close to perfect and at times even perfect results.
Structure of the Paper
The following sections of this survey article are organized as follows: Section 2focuses on the thirty-one papers surveyed for this article. This section first discussesthe methodology explaining how the papers were selected before discussing thebiomedical issues the papers investigate. Section 2 concludes with a discussion ofthe common data sets and tools used within the papers. In section 3, the focus ison how the researchers evaluate their studies, with the various performance metricsused being examined and explained to discern the advantages and disadvantages ofprioritizing one metric over another. Next in section 4, this article briefly discussesthe future of GA. The final section concisely concludes the findings of this surveyarticle.
Paper Selection
The proposed searching procedure in this survey aims to outline a simple yet effec-tive sequence of operations in order to identify and select high quality manuscriptspublished in journals. While utilizing Google Scholar and/or Publish or Perish [15],the first step was to establish the date range of the journals published startingwith 2015 and proceeding onward. This survey focuses on the applications of GA,which yields a wide range of possibilities. Therefore, in order to narrow the scope,additional key search terms were needed. In step two, additional key terms, such asbiomedical/medicine, and machine learning, were used alongside the main searchterm. Once a paper was identified it was added to a list of prospective sources. Thequality of the paper was examined and identified in step three by utilizing ScimagoJournal & Country Rank (SJR)[16] to access the quality of the journal where thepaper has been published. Papers published in journals with a journal ranking ofQ2, Q3, and Q4 were immediately removed from the list, and papers publishedin journals with a journal ranking of Q1 at time of publication were kept. Oncethe paper met the quality criteria for its journal ranking, step four ensured if theGA has a dominant role or is used as a key element in the paper. If the paperdoes not have either, it was removed from the list. Papers that had GA serving adominant role or where GA was used as a key element were kept, further analyzed,and contributed to this survey. Therefore, each paper had to meet all of the aboverequirements set in place to be selected. The whole process is illustrated and canbe found as a flowchart in Figure 1. As a result of this searching methodology, atotal of 31 papers were selected for this survey and can be found in Table 1.
Applications of GA in Bioinformatics
Using the described searching procedure above, Table 1 provides a summary con-taining key information on the papers selected for this survey. In addition to Table 1,Table 3 shows the extent to which results could be replicated to obtain similar find-ings to those papers studied in this survey. Yet Table 3 also serves to highlighta concerning issue as it shows how few papers provide the necessary information werhun et al.
Page 4 of 20 needed in order for others to reproduce their results. All chosen papers discuss thepossible and proposed application in biomedical applications, and are limited toSJR Q1 rankings, from 2015 and later. Key findings included in Table 1 in additionto the biomedical application were examined, the use of GA was noted, and thebenefits of the proposed application were identified. 19 of the 31 papers surveyedmention the GA playing a key role in feature selection. Feature selection is a datapre-processing technique that reduces the overall number of features by eliminatingredundant samples [1]. The task of feature selection is to extract those featuresthat are deemed the most informative and important in predicting the outcomefor an individual [2]. This technique is an essential step in reducing the dimen-sionality of the search space and the computational complexity. Alongside featureselection, GAs are commonly used in classification programs. Just about half of thepapers surveyed, 16 out of the 31, use a form of classification. Classification aimsto predict outcomes associated with a particular individual given a feature vectordescribing that individual. GA provides an efficient and robust feature selectionalgorithm that speeds up the learning process of classifiers and stabilizes the clas-sification accuracy. Within bioinformatics, feature selection and classification bothserve vital roles and can often be found within the same program, with the GAselecting features that are then used by a separate algorithm to assign a label thatmay be a diagnosis of a general disease or even the identification of symptoms. Inrecent years, GA-based applications have developed to not only identify ailments,but recommend what treatments should be used to combat an ailment that hasappeared in different patients [5]. GA also has been utilized in non-standard im-plementations such as running multiple GA in parallel [11], or nested inside oneanother as in [7], which has allowed for the diagnosis and identification of differentcancers biomarkers. Indeed, non-standard implementations have even allowed for ahybrid GA-based application to be created that can determine the person to receivethe highest quality of life improvement from a lung transplant, helping to ensurethat any unforeseen bias does not effect the transplant [12]. Additionally, GAs havebeen used for imaging and visualizing applications both due to their importance infeature selection and their ability to combine representations of learned informationsuch as known shapes, and relative position into a single framework that can beused in three-dimensional segmentation [17]. Finally GAs have been employed tohandle logistics both in handling complex hospital supply chains [9] and in opti-mizing ambulance dispatches to non-emergency situations [10]. Therefore, it can beeasily seen that bioinformatics research entails many problems that can be solvedusing machine learning tasks, and that GA is well-suited for such tasks. Yet, it isimportant that research conducted in this area be highly accurate, efficient, andreliable in order for the results to be meaningful. They need to be prompt and ableto withstand the volatile situations that can be found in this field, especially sincesuch solutions are becoming prevalent in nearly every aspect of bioinformatics.
Datasets
In order to learn more about how the papers selected for this survey came to theirconclusions, a closer look was given to the data used and the sources of the data.Out of the 31 surveyed papers, not a single one used the exact same raw data. Threegeneral patterns emerge from the diversity of datasets. werhun et al.
Page 5 of 20
The most common method for data acquisition in the 31 papers was conductedby accessing digital repositories to find datasets relevant for the topic of the paper.These repositories act as a tool, compiling datasets that are available to the publicand therefore allowing researchers to focus on their project immediately rather thanhaving to conduct a multitude of tests just to acquire data to use for testing. Someexamples of repositories seen in the surveyed papers are as follows. • UCSC Genome Browser used by both Li et al. (2016) and Tangherloni et al.(2019) provides access to assembled genomes including the human genome[8, 18]. • Gene Expression Omnibus used by Sayed et al. (2019) provides more special-ized data related to genomics and is itself part of the National Center forBiotechnology Information data resources [7]. • Protein Data Bank used by Moraes et al. (2017) provides data relating towide selection of proteins and related components [19].Besides acquiring data from public repositories, another method of data acquisitionemployed by some of the surveyed papers was requesting access to data that isgenerally kept private. Among the sources for this type of data, private databasescurated by institutions were the most common. It is important to note that notall required a paper’s authors to be a member of the institution as is the case inOztekin et al. (2018), who accessed their data from the United Network for OrganSharing [12]. In addition, some data sources originate from entities whose primaryconcern was not data curation, but who could grant access to records of theirregular functions. One instance of such data collection can be seen in the work ofFogue et al. (2016) who received their data from an Ambulance Company based inHusca, Spain [10]. The final method of data acquisition employed was only used bya minority of the papers surveyed -creation of the data by the project members [20].This final method although being necessary in cases where the data needed is notavailable does not ensure an unbiased result and would consume significant timefor properly compiling the information. Indeed, it would appear to be that due tothese downsides, this method of data acquisition is far from favoured.Despite the prevalence of acquiring data from pre-existing sources, the raw dataacquired often has to go through preprocessing before it is used. What this entailscan be widely different depending on the source of the data and its intended purpose;however, most commonly the goal is to narrow down the raw data into a set deemedusable for the project. Such a process may be necessary because in some cases a)the raw dataset does not have enough records, or b) not all records are complete,or c) records are not usable (too much noise) [13]. A summary of the datasets usedby the 31 surveyed papers and their sources can be found in Table 2.
Tools
In addition to looking at what datasets the surveyed papers use, this paper takesa look at the tools and additional machine learning algorithms employed alongsidethe GA, although a few papers rely solely on GA. Indeed, when looking at thesurveyed papers it would appear that GA-focused solutions benefit the most whenthey are supported by complimentary tools and algorithms. The use of componentsis much like the datasets mentioned, where a wide variety was used in each study werhun et al.
Page 6 of 20 to achieve the goal of that particular paper. However, unlike the datasets a fewtools and additional machine learning algorithms were employed across multiplepapers fairly regularly. The full selection of tools and machine learning algorithmsemployed has been compiled in Table 4.Amongst the 31 surveyed articles, two tools proved to be the most prevalent. Thefirst of these is MATLAB, which is used in [6, 9, 11, 21, 17, 22, 23, 13]. The secondtool is Weka, which sees usage in [24, 2, 25, 21]. MATLAB is a fairly well-known andimportant tool in studies such as signal processing, data analytics, image processing,and machine learning, partially due to its versatility. In fact, even though all thesurveyed papers have a focus on GAs, the way that MATLAB is utilized variesfrom paper to paper. For instance Soufan et al. (2015) only makes limited use ofMATLAB to ensure fairness when evaluating programs [11]. P(cid:32)lawiak (2018) usesMATLAB alongside the library, LIBSVM, to implement their study [13].Weka is a more specialized tool that provides an environment for classification,regression, clustering, and feature selection. It accomplishes this by aiding its usersin the extraction of information and helping them find suitable algorithms for cre-ating accurate predictive model with that information [26]. Although Weka has afar smaller toolbox, it can be ideal for researchers working in bioinformatics dueto its focus. Indeed, both of these tools have proven beneficial for a number of thesurveyed articles as shown by Hashem et al. (2017) who use both tools to performalgorithms such as Particle Swarm Optimization [21].Throughout the surveyed articles, additional machine learning algorithms are of-ten used alongside the GA, where they prevalently serve as classification algorithms.The goal of such algorithms is to be able to predict successfully the correct outcomethat is associated with a particular occurrence after having received a selection offeatures that describe the occurrence [26]. A vast number of these algorithms areused in the articles surveyed including different types of Neural Networks (NN), asseen in Table 4; however, the most common is Support Vector Machines (SVM).SVM are frequently used in biomedical applications, and this survey shows that theaddition of GA does not change this fact. One of the biggest appeal of SVM is theirnear perfect success rate and their perceived simplicity of simply assigning labelsto objects based on what side of a hyperplane they end up on [27]. Computationrequirements for the SVM scale quadratically, resulting in longer run times as datainputs increase [27]. This in itself is not necessarily a current negative; however, asapplications become more complex, the SVM quadratic run time growth should notbe ignored in future works employing it alongside GA.
A key step in the process of building a machine learning model is to estimateits performance on data that was not part of building the model. The data toevaluate the performance of the model is called the testing set, while the data thatis used to build the model is called the training set. A primary concern for anymachine learning prediction model is avoiding a model with either high bias orhigh variance. Bias is the error resulting from a wrong assumption. A model withhigh bias oversimplifies. This is also known as underfitting. It results in a largeerror between the test set outcome value and the model prediction. Variance is werhun et al.
Page 7 of 20 the error from the model being overly sensitive to fluctuations in the training set.High variance can cause an algorithm to model the noise in the data, which resultsin model overfitting. High variance decreases the amount of flexibiliy, and reducesthe ability of the model to generalize to unseen instances. A visualization of thetrade-offs made between bias and variance can be seen in Figure 2.The confusion matrix is a key concept related to the performance metrics of aclassifier model. The confusion matrix is simply a square matrix that records thecounts of the true positive (TP), true negative (TN), false positive (FP), and falsenegative (FN) predictions of a classifier. The true positive rate (TPR) is calculatedas the number of true positives divided by the sum of the false positives and thetrue negatives,
T P R = T PF N + T P (1)The false positive rate (FPR) is calculated as the number of false positives dividedby the sum of the false positives and the true positives,
F P R = F PF P + T N (2)A dimension of the confusion matrix represents the instances in a predicted classwhile the other dimension represents the instances in the actual class (ground truth).If the predicted class is the same as the ground truth, then the confusion matrixwill label this sample as true, otherwise false [28].The precision is defined as the ratio of the true positives to the sum of the truepositives and the false positives,
P recision = T PT P + F P (3)The recall is defined as the ratio of the true positives to the sum of the true positivesand the false negatives,
Recall = T PT P + F N (4)The F score is defined as the two divided by the inverse of the precision, plus theinverse of the recall, F = 2 recall -1 + precision -1 (5)Receiver Operating Characteristic (ROC) graphs are useful tools to select mod-els for classification based on performance with respect to the false positive rate(FPR) and true positive rate (TPR), which are computed by shifting the decisionthreshold of the classifier. The diagonal of an ROC graph presents random guessing(50 percent probability of being correct), and classification models that fall belowthis value are considered worse than random guessing. A perfect classifier would werhun et al. Page 8 of 20 fall into the top left corner of the graph with a TPR of 1 and an FPR of 0. Basedon the ROC curve, the area under the curve can be computed to characterize theperformance of the classification model [28].The prediction error and accuracy provide general information regarding the per-formance of the prediction model. The error can be understood as the sum of thefalse predictions divided by the total number of predictions,
Error = F P + F NF P + F N + T P + T N (6)The accuracy is calculated as the sum of the correct predictions divided by the totalnumber of predictions. More precisely, accuracy is the ratio of the number of correctpredictions (the sum of the true positives and true negatives) to the total numberof predictions from the model (the sum of the true positives, true negatives, falsepositives, false negatives),
Accuracy = T P + T NF P + F N + T P + T N (7)There are many methods to evaluate the performance of a model. Each perfor-mance metric has certain advantages and disadvantages based on the data, such asthe number of classes in the prediction variable, the number of instances of eachclass, or how imbalanced the outcome class happens to be, and the cost of misclas-sifying a prediction. In medicine, misclassification can be deadly. The discussionrelating to advantages and disadvantages will focus on the accuracy, as it was themost common performance metric. Some attention will be also be paid to the truepositive rate and false positive rate, as it offers a more nuanced metric, especially inrelation to biomedical applications. What metrics are used by each surveyed papercan be found in Table 5.
Advantages
Accuracy is a simple performance metric to compute, and the most intuitive eval-uation method. It is the most common metric, so it is often used to compare withother models in the literature.The true positive rate and false positive rate are especially usefully for imbalancedclass problems. For example, in tumour diagnosis, the detection of malignant tu-mours is the primary concern since missing the potential presence of a tumour couldhave serious implications, like death. However, it is also important to decrease thenumber of benign tumours that are incorrectly classified as malignant (false posi-tive) to not unnecessarily concern a patient. The true positive rate provides usefulinformation about the fraction of positive (or relevant) samples that were correctlyidentified out of the total number of positives. In medicine, the samples tend tobe imbalanced, so the true positive rate and false positive rate will be the mostappropriate performance metric.An ROC graph is a useful tool to visualize the true positive rate and false pos-itive rate. Finding the area under the curve is a simple method to determine theperformance of the model. werhun et al.
Page 9 of 20
Disadvantages
Accuracy was the primary performance metric used in this scoping review. However,it has some limitations that are important to consider, especially in the medicaldomain. It is only a reliable performance metric when the number of samples areequal for each class (no imbalance). For example, consider a case where 99 percentof samples belong to class A and only 1 percent to class B. Then it is trivial for themodel to obtain 99 percent accuracy by simply predicting every training instanceto belong to class A. If the identical model is evaluated on a different test set thenthe accuracy would be significantly reduced. For example, if the test set has 60percent of its samples from class A and 40 percent of its samples from class B, thenthe accuracy would plummet to 60 percent. This examples illuminates the potentialfor the accuracy metric to be misleading, which can lead to assuming the modelis better than reality. In the medical field, the price of misclassifying a sample hasthe potential to be extremely costly. If the model is attempting to predict a rarebut fatal disease, the cost of failing to diagnose the disease of a sick person is muchgreater than the cost of sending a healthy person to do more tests.The papers mostly failed to evaluate a major drawback of GA, which is the amountof computation it requires. In traditional machine learning, such as neural networks,the model improves as the amount of training data increases. However, the perfor-mance of a GA might degrade before it improves. GAs also keep a population ofsolutions, instead of a single solution. These requirements of GA are computation-ally costly, and should be evaluated as a performance metric whenever consideringa genetic algorithm as a learning algorithm[14].
Some of the founders of computer science, such as Alan Turing, John von Neu-mann, Norbert Wiener, were motivated by the idea of providing computer pro-grams with operations like self-replication and adaption[14]. These motivations havebeen explored in various areas of research such as evolution strategies, evolutionaryprogramming, and genetic algorithms. These efforts grew into the field known asevolutionary computation, of which GAs are the most prominent example.The GAs are a powerful tool for solving problems and for simulating natural sys-tems in a wide range of scientific fields. GAs are promising approaches for solvingchallenging technological problems. GAs are an important area of research in ma-chine learning, especially working together with other approaches such as neural net-works. GAs are part of a movement in computer science that explores biologically-inspired approaches to computation. These systems are adaptable, parallel, able tohandle complexity, able to learn, and even be creative [14]. Furthermore, the com-puting resources that are currently widely available and allow for unprecedentedparallel processing are well-suited to implementing GA.The GA attempt to model natural evolution, which is done with operators such asadaption, selection, crossover, and mutation. This approach retains a population ofsolutions that converges on the objective, which is a form of black-box optimization.However, natural evolution is a process that ceaselessly creates greater complexityand novelty, rather than a process that converges on a single solution. In fact,evolution on Earth can be thought of as a single run of a single algorithm that werhun et al.
Page 10 of 20 invented all of nature [29]. Another term for the notion of a single process inventingmassive complexity for near-eternity is “open-ended.” Open-endedness has provenimpossible to program. Presently, no such algorithm exists that has the endless,prolific creative potential of natural evolution.Currently, most evolutionary algorithms (EAs) converge to a solution, based onthe fitness function that is chosen. The fitness function, which tends to select the“best” performing individuals in the population of solutions, acts as an objectivethat is optimized. The optimization consists of selecting more of the fitter solutionson average, while only selecting a minority of other “less” fit solutions to main-tain some diversity. However, the divergence of natural evolution and the “open-endedness” is not implemented with this approach. Natural evolution is not struc-tured like an optimization algorithm as there is no explicit objective, and organismsare often rewarded for being different rather than just better. For example, organ-isms that are sufficiently different from their predecessors can establish a new nichein which they can benefit from reduced competition and are therefore more likely tosurvive [30][31]. In opposition to optimization algorithms that converge to a single“best” solution, natural evolution has a tendency toward divergence. This alterna-tive perspective in evolutionary computation in that evolution is an algorithm fordiversification rather than optimization [32].An EA inspired by this approach is called novelty search (NS), which searchesfor behavioural diversity without any explicit objective. In some domains, NS findsthe global optimum even when objective-based solutions consistently fail [32]. Analgorithm that avoids an objective function is able to find solutions that are notpossible if attempting to solve them directly with objectives. This insight has im-plications beyond GA, such as in the pursuit of “human-level” AI, since it captureswhat many consider our most human-like quality–creativity.A potentially fruitful application for open-ended evolutionary algorithms is inany sort of creative design. This includes the design of cars, art, medicines, robots,video games, and so on. Open-ended evolutionary algorithms offer the potential togenerate endless alternatives in almost any conceivable design domain, in the sameway that natural evolution generated endless solutions to the problems of survivaland reproduction in nature [29].There are many potential biomedical applications for open-ended, evolutionaryalgorithms. One would be the development of vaccines. The open-ended algorithmcould search the space of possibilities while simultaneously finding solutions thatwork in each environment. Provided some initial set of rules that describe whatis possible biologically, the algorithm could continuously explore this space of pos-sibilities, and report any number of potentially useful findings to researchers toinvestigate further.
Population-based search like GAs are often combined with other machine learningalgorithms. In classification problems, GA serves as a population of solutions, ratherthan a single solution. In this scoping review, GAs combined with Support VectorMachines were found to perform best. The performance metric that was evaluatedmost often was the accuracy. This avoids measuring the main weakness of GA, which werhun et al.
Page 11 of 20 is computational time. In an attempt to better utilize the power of GAs, the futureof GAs could be “open-ended” evolutionary algorithms, which attempt to increasecomplexity and find diverse solutions, rather than optimize a fitness function to finda single “best” solution. This approach attempts to model the most powerful featureof natural evolution—its endless ability to create novel and creative solutions to fitan environment that is constantly changing.
Competing interests
The authors declare that they have no competing interests.
Author’s contributions
The first three authors contributed equally for the development of this research article. The last author providedsupervision and guidance.
Acknowledgements
The authors would like to thank the infrastructure support provided by the CASES Building at Lakehead University.
References
1. Lu, H., Chen, J., Yan, K., Jin, Q., Xue, Y., Gao, Z.: A hybrid feature selection algorithm for gene expressiondata classification. Neurocomputing , 56–62 (2017)2. Aliˇckovi´c, E., Subasi, A.: Breast cancer diagnosis using ga feature selection and rotation forest. NeuralComputing and Applications (4), 753–763 (2017)3. Salem, H., Attiya, G., El-Fishawy, N.: Classification of human cancer diseases by gene expression profiles.Applied Soft Computing , 124–134 (2017)4. Subasi, A., Kevric, J., Canbaz, M.A.: Epileptic seizure detection using hybrid machine learning methods. NeuralComputing and Applications (1), 317–325 (2019)5. Zhang, P., West, N.P., Chen, P.-Y., Thang, M.W., Price, G., Cripps, A.W., Cox, A.J.: Selection of microbialbiomarkers with genetic algorithm and principal component analysis. BMC bioinformatics (6), 413 (2019)6. Mohammed, M.A., Ghani, M.K.A., Arunkumar, N.a., Hamed, R.I., Abdullah, M.K., Burhanuddin, M.: A realtime computer aided object detection of nasopharyngeal carcinoma using genetic algorithm and artificial neuralnetwork based on haar feature fear. Future Generation Computer Systems , 539–547 (2018)7. Sayed, S., Nassef, M., Badr, A., Farag, I.: A nested genetic algorithm for feature selection in high-dimensionalcancer microarray datasets. Expert Systems with Applications , 233–243 (2019)8. Li, D., Luo, L., Zhang, W., Liu, F., Luo, F.: A genetic algorithm-based weighted ensemble method forpredicting transposon-derived pirnas. BMC bioinformatics (1), 329 (2016)9. Khanduzi, R., Sangaiah, A.K.: A fast genetic algorithm for a critical protection problem in biomedical supplychain networks. Applied Soft Computing , 162–179 (2019)10. Fogue, M., Sanguesa, J.A., Naranjo, F., Gallardo, J., Garrido, P., Martinez, F.J.: Non-emergency patienttransport services planning through genetic algorithms. Expert Systems with Applications , 262–271 (2016)11. Soufan, O., Kleftogiannis, D., Kalnis, P., Bajic, V.B.: Dwfs: a wrapper feature selection tool based on a parallelgenetic algorithm. PloS one (2) (2015)12. Oztekin, A., Al-Ebbini, L., Sevkli, Z., Delen, D.: A decision analytic approach to predicting quality of life forlung transplant recipients: A hybrid genetic algorithms-based methodology. European Journal of OperationalResearch (2), 639–651 (2018)13. P(cid:32)lawiak, P.: Novel methodology of cardiac health recognition based on ecg signals and evolutionary-neuralsystem. Expert Systems with Applications , 334–349 (2018)14. Mitchell, M.: An Introduction to Genetic Algorithms. MIT press, ??? (1998)15. Publish or Perish. https://harzing.com/resources/publish-or-perish
16. Scimago Journal & Country Rank.
17. Ghosh, P., Mitchell, M., Tanyi, J.A., Hung, A.Y.: Incorporating priors for medical image segmentation using agenetic algorithm. Neurocomputing , 181–194 (2016)18. Tangherloni, A., Spolaor, S., Rundo, L., Nobile, M.S., Cazzaniga, P., Mauri, G., Li`o, P., Merelli, I., Besozzi, D.:Genhap: a novel computational method based on genetic algorithms for haplotype assembly. BMCbioinformatics (4), 172 (2019)19. Moraes, J.P., Pappa, G.L., Pires, D.E., Izidoro, S.C.: Gass-web: a web server for identifying enzyme active sitesbased on genetic algorithms. Nucleic acids research (W1), 315–319 (2017)20. Liu, P., El Basha, M.D., Li, Y., Xiao, Y., Sanelli, P.C., Fang, R.: Deep evolutionary networks with expeditedgenetic algorithms for medical image denoising. Medical image analysis , 306–315 (2019)21. Hashem, S., Esmat, G., Elakel, W., Habashy, S., Raouf, S.A., Elhefnawi, M., Eladawy, M.I., ElHefnawi, M.:Comparison of machine learning approaches for prediction of advanced liver fibrosis in chronic hepatitis cpatients. IEEE/ACM transactions on computational biology and bioinformatics (3), 861–868 (2017)22. Hemanth, D.J., Anitha, J.: Modified genetic algorithm approaches for classification of abnormal magneticresonance brain tumour images. Applied Soft Computing , 21–28 (2019)23. Tan, M.S., Tan, J.W., Chang, S.-W., Yap, H.J., Kareem, S.A., Zain, R.B.: A genetic programming approach tooral cancer prognosis. PeerJ , 2482 (2016)24. Al-Rajab, M., Lu, J., Xu, Q.: Examining applying high performance genetic data feature selection andclassification algorithms for colon cancer diagnosis. Computer methods and programs in biomedicine ,11–24 (2017) werhun et al. Page 12 of 20
25. Gangavarapu, T., Patil, N.: A novel filter–wrapper hybrid greedy ensemble approach optimized using thegenetic algorithm to reduce the dimensionality of high-dimensional biomedical datasets. Applied SoftComputing , 105538 (2019)26. Frank, E., Hall, M., Trigg, L., Holmes, G., Witten, I.H.: Data mining in bioinformatics using weka.Bioinformatics (15), 2479–2481 (2004)27. Noble, W.S.: What is a support vector machine? Nature biotechnology (12), 1565–1567 (2006)28. Chawla, N.V.: Data mining for imbalanced datasets: An overview. In: Data Mining and Knowledge DiscoveryHandbook, pp. 875–886. Springer, ??? (2009)29. Lehman, J., Stanley, K.O.: Abandoning objectives: Evolution through the search for novelty alone. Evolutionarycomputation (2), 189–223 (2011)30. Kirschner, M., Gerhart, J.: Evolvability. Proceedings of the National Academy of Sciences (15), 8420–8427(1998)31. Lehman, J., Stanley, K.O.: Evolvability is inevitable: Increasing evolvability without the pressure to adapt. PloSone (4) (2013)32. Pugh, J.K., Soros, L.B., Stanley, K.O.: Quality diversity: A new frontier for evolutionary computation. Frontiersin Robotics and AI , 40 (2016)33. Ramadan, E., Naef, A., Ahmed, M.: Protein complexes predictions within protein interaction networks usinggenetic algorithms. BMC bioinformatics (7), 269 (2016)34. Lee, N.K., Li, X., Wang, D.: A comprehensive survey on genetic algorithms for dna motif prediction.Information Sciences , 25–43 (2018)35. Corus, D., Oliveto, P.S.: Standard steady state genetic algorithms can hillclimb faster than mutation-onlyevolutionary algorithms. IEEE Transactions on Evolutionary Computation (5), 720–732 (2017)36. Ans´otegui, C., Malitsky, Y., Samulowitz, H., Sellmann, M., Tierney, K.: Model-based genetic algorithms foralgorithm configuration. In: Twenty-Fourth International Joint Conference on Artificial Intelligence (2015)37. Bhardwaj, A., Tiwari, A., Krishna, R., Varma, V.: A novel genetic programming approach for epileptic seizuredetection. Computer methods and programs in biomedicine , 2–18 (2016)38. Tan, C.H., Tan, M.S., Chang, S.W., Yap, K.S., Yap, H.J., Wong, S.Y.: Genetic algorithm fuzzy logic for medicalknowledge-based pattern classification. Journal of Engineering Science and Technology , 242–258 (2018)39. Dashtban, M., Balafar, M.: Gene selection for microarray cancer classification using a new evolutionary methodemploying artificial intelligence concepts. Genomics (2), 91–107 (2017)40. La Cava, W., Silva, S., Danai, K., Spector, L., Vanneschi, L., Moore, J.H.: Multidimensional geneticprogramming for multiclass classification. Swarm and evolutionary computation , 260–272 (2019)41. Devarriya, D., Gulati, C., Mansharamani, V., Sakalle, A., Bhardwaj, A.: Unbalanced breast cancer dataclassification using novel fitness functions in genetic programming. Expert Systems with Applications ,112866 (2020) Figure 1 Finding Quality Papers.
Search criteria used to identify papers for the article.werhun et al.
Page 13 of 20
Figure 2 Bias Variance Trade-Offs
Visualization of underfitting and overfitting.
Table 1: Information about the Quality of the PapersArticle SJRRank CitiesPerYear Year ofpublica-tion How the GA is Used Benefits Biomedical Applica-tions[3] Q1 23 2017 Genetic Programming usedfor cancer disease classifica-tion. IG/GA method improves classifi-cation accuracy by reducing thenumber of features and prevent-ing the GA from being trappedby local optimum. Cancer Classification[24] Q1 6.67 2017 Using GA for feature selec-tion. Combing GA and PSOfor feature selection. Classi-fication using GP. Selecting fewer genes, classifica-tion algorithm takes less com-putational time. GA/DT andGA/GP yields highest classifica-tion accuracy. Colon Cancer.[1] Q1 28.33 2017 Adaptive Genetic Algorithm(AGA) improves conven-tional GA by adjusting val-ues of crossover and muta-tion probability. The adapt-ability increases robustness,increasing the chance offinding optimal solutions. Combing MIM (Mutual informa-tion maximization) with AGA,eliminates redundant samplesand reduces the dimension of thegene expression data. General applications tobiomedical datasets.[2] Q1 27.67 2017 GA feature selection - ex-traction of information andsignificant features. Reduces computation complex-ity and speeds up the data min-ing process.GA for feature selection, com-bined with Rotation Forest re-sulted in highest classificationaccuracy. Breast Cancer Diagno-sis.[25] Q1 2.00 2019 GA optimizes the subspaceensembling process. Optimizing with GA, outper-forms selected base feature se-lection techniques in terms ofprediction accuracy. General applications tobiomedical datasets.[6] Q1 18.50 2018 Machine learning ap-proaches based on the GAfor feature selection. Reduces overlapping betweenclasses, and reduces the numberof features to enhance the timecost. Visualizing borderpoints for resectionof NasopharyngealCarcinoma.[9] Q1 2.00 2019 GA results in high-qualitysolutions (accuracy and ex-ecution time) GA-FBC (Fast Branch CutMethod) provides efficient so-lutions, regarding performancemetrics. Biomedical supplychain networks.[4] Q1 37.00 2019 GA used to determine opti-mum parameters of SVM. Combing GA with SVM offersquick global optimizing ability. Classification of EEGdata for Epilepticseizure detection.[11] Q1 13 2015 Feature selection tool devel-oped based on GA Able to significantly reduce thenumber of features withoutsacrificing classification perfor-mance. Feature selection forbiomedical data.[8] Q1 9.25 2016 Uses a GA-based weightedensemble method to predicttransposon-derived piRNAs Has higher performance and ro-bustness compared to similarmethods. Prediction of piRNAs. werhun et al.
Page 14 of 20 [18] Q1 8.00 2019 GAs with tournament selec-tion and elitism Speeds up the required compu-tations, and can take into ac-count datasets produced by 3rdgeneration sequencing technolo-gies Helps solve the haplo-typing problem.[19] Q1 1.33 2017 GA performs the search ofthe generated database A freely available method,through a webapp that ranksamong the top (4th) Identification of en-zyme active sites al-lowing for non-exactmatches.[5] Q1 1.00 2019 GA used to find subset ofthe principal components,from a Principal componentanalysis. Use of Principal componentanalysis before the GA improvesthe results of GA selection Help identify whattreatments shouldbe done for differentpatients.[33] Q1 3.50 2016 GA used to identify com-plexes in protein interactionnetworks Method allows for identifyingclustering with varying densities.It is more scalable and robustand it can be tuned. Used to detect denseand sparse protein clus-ters.[7] Q1 11.00 2019 Uses 2 GAs. The outer GAserves as the main algo-rithm and outputs the sub-set of genes evaluated bySVM. The inner GA takesdata from DNA methylationand outputs subset of CpGsites. Far higher accuracy comparedto other methods, and has beenshown to be able to differentiatebetween lung cancer subtypes Identification of dis-ease (cancer) biomark-ers.[34] Q1 4.00 2018 Compares the performanceof multiple GAs N/A Guidelines for the de-velopment of GA basedsolutions for DNA mo-tif prediction.[12] Q1 17 2018 GA used in feature selectionwhile predicting the Qualityof Life Study included all UNOS fea-tures (after preprocessing) al-lowing for their effect to be as-sessed. Minimize or elimi-nate personal bias inlung transplants byautomation. Helpingto increase the rateof successful lungtransplants.[35] Q1 10.67 2018 Prove the benefits ofcrossover in Genetic Algo-rithms Established that GA withcrossover is 25 percent fasterthan mutation alone, withcertain parameters. N/A[21] Q1 2.67 2018 Finding the best features,predict advanced fibrosis. GA is able to work in parallel. Predict advanced fibro-sis.[36] Q1 10.20 2015 Automatic algorithm con-figuration Numerical results show thatmodel-based genetic algorithmssignificantly improve our abil-ity to effectively configure algo-rithms automatically. N/A[17] Q1 10.00 2016 GA for combining represen-tations of learned informa-tion such as known shapes,regional properties and rel-ative position of objectsinto a single framework toperform automated three-dimensional segmentation. GA-based method are very use-ful for medical imaging applica-tions. GA tested for prostatesegmentation on pelviccomputed tomogra-phy and magneticresonance images.[22] Q1 5.00 2018 Three different modifiedGenetic Algorithm ap-proaches are proposed forfeature selection The number of features are re-duced, decreasing the dimen-sionality of the features Magnetic Resonancebrain image classifica-tion[37] Q1 8.25 2016 Classification Proposes a constructive geneticprogramming approach that in-creasing the number of useful“building blocks” Classifying EEG signals[23] Q1 1.50 2016 Feature Selection Compared the performance tosupport vector machines, logisticregression and performed better. Recognition of can-cerous cells and alsogene expression profil-ing data werhun et al.
Page 15 of 20 werhun et al.
Page 16 of 20 werhun et al.
Page 17 of 20 (cid:88) × [24] × × [1] × × [2] × × [25] (cid:88) × [6] × × [9] (cid:88) × [4] × × [11] × × [8] × (cid:88) [18] × (cid:88) [19] × × [5] × × [33] (cid:88) (cid:88) [7] (cid:88) × [34] × × [12] × × [35] (cid:88) × [21] × × [36] × × [17] × × [22] × × [37] (cid:88) × [23] × × [38] (cid:88) × [39] × × [20] (cid:88) × [10] × × [13] (cid:88) × [40] × × [41] (cid:88) × Table 4: Tools UsedArticle Tools Additional ML Algorithms Utilized/Validation[3] Does not specify. 10-fold cross validationClassification Algorithm:- Genetic Programming (GP)[24] Weka Machine Learning package Leave one out cross validation (LOOCV), k-fold crossvalidationClassification Algorithms:- Decision Tree,- Naive Bayes,- Support Vector Machine,- Genetic Programming werhun et al.
Page 18 of 20 [1] Does not specify. Multiple cross validations.Classification Algorithm:- Back Propagation Neural Network (BP),- Support Vector Machine (SVM),- Extreme Leaning Machine (ELM),- Regularized Extreme Leaning Machine (RELM)[2] Weka employed to implement algorithms. 10-fold cross validation.Classification Algorithm:- Rotation Forest Model,- Logistic Regression,- Bayesian Network,- Multilayer Perceptron (MLP),- Radial Basis Function Networks (RBFN),- Support Vector Machine (SVM),- C4.5 Decision Tree,- Random Forest,- Rotation Forest[25] All experiments coded in Python 2.7 andWeka 3.8.3 (to implement all the prede-termined feature selection methods).Python Scikit-learn package implementedall the classifiers. 10-fold cross validationClassification Algorithms:- Random Forests,- Bootstrap Aggregating with C4.5 Decision Trees,- K-Nearest Neighbour[6] MATLAB 2014a utilized for the evalua-tion of the present approach. Cross validation.Classification Algorithms:- Artificial Neural Networks[9] All approaches in this study are coded us-ing MATLAB software. N/A[4] Does not specify. 10-fold cross validation s Classification Algorithm:- Support Vector Machine[11] PGAPack software libraries, K-NearestNeighbour from AlgLib Library, MatlabR2012b Classification algorithms :- K-Nearest Neighbour- Naive-Bayes- Combination of above 2 algorithms.[8] Random forest classification engine fromscikit-learn python package 10-fold cross validation, Their weighted ensemblemethod is constructed using training data.Classification Algorithms: - Random forest- Support Vector Machine[18] Message Passing Interface specificationsin C++, Roche/454 genome sequencer,PacBio RS II sequencer, General Error-Model based SIMulator toolbox N/A[19] Flask framework for Python, frontend de-veloped using Bootstrap framework. Runson top of an Apache server with commu-nication being made using a Web ServerGateway Interface N/A[5] Use of sequence analysis pipelines such as:- DADA2- PEAR Software V0.9.6- BWA Software Package V0.7.12- Stats package in R 5-fold Cross Validation- Classification Algorithms:- Logistic Regression[33] GO term finder Spectral clustering[7] - biomaRt- GenomicRanges- Mminfi- IlluminaHumanMethyla-tion27kabbi:ilmn12:hg19 R packages- SVM method from e1071 package- Gene Ontology, Kyoto Encyclopedia ofGenes and Genomes 5-fold Cross ValidationDeep-learning Neural NetworkClassification algorithm:- Support vector machine[34] Local Search TechniquesGibbs SamplingExpectation maximizationAdditional non-GA methods/tools men-tioned but not shown to be tested: listcan be found in supplementary materialspdf. GA Motif discovery: PCEA, GAPWM, kmerGA, GAMI,FGMA, Paul and Iba, Gadem, GA-DPAF, GASMEN,MDGA, GALF (GALF-P), GALF-G, GAME, GEMFA,GAPK, iGAPK werhun et al.
Page 19 of 20 [12] Does not specify. 5-fold Cross ValidationRandom undersamplingClassification algorithm used:- k-Nearest neighbour- Support Vector Machine (SVM)- Artificial Neural Network (ANN)[35] The ONEMAX benchmark function N/A.[21] MedCalc, MATLAB, Weka Implemented several types of Machine learning tech-niques:- particle swarm optimization- multi-linear regression- decision tree learning algorithms to compare.[36] Comparing Continuous Optimizers(COCO) software Classification Algorithm:- Random Trees[17] In preprocessing, the images were im-proved with the “imadjust” function inMATLAB N/A[22] Implemented in MATLAB Neural Network[37] N/A N/A[23] GPLAB, which is a genetic program- mingtoolbox, which runs in the MATLAB en-vironment Classification Algorithms:- Support Vector Machine- Logistic Regression[38] N/A Fuzzy Logic[39] Does not specify LOOCV and 10 Fold CVClassifiers:- KNN- Support Vector Machine- Naive BayesFilter Methods:- Laplacian-score- Fisher-score[20] GA progress is processed on Tensorflowplatform with GEFORCE GTX TITANGPUs Convolutional Neural Networks[10] Google Maps API N/A[13] MATLAB R2014b, libsvm library forMATLAB 4-fold cross validation10-fold cross validationClassification Algorithms:-Support Vector Machine-K-Nearest Neighbour-Probabilistic Neural Network-Radial Basis Function Neural Network[40] PyTorch Neural Network, Decision Tree[41] Python packages NoneTable 5: Performance EvaluationArticle Acc. ROCCurve AUC TP TN FP FN Specificity Sensitivity/ Recall Prec./PPV F-Mea-sure Avg.Run-time Comp.Com-plex-ity Other[3] (cid:88) × × (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) × × × (cid:88) [24] (cid:88) × × × × × × × × × × (cid:88) (cid:88) [1] (cid:88) × × × × × × × × × × × × [2] (cid:88) (cid:88) (cid:88) (cid:88) × (cid:88) × × × × (cid:88) × (cid:88) [25] (cid:88) × × × × × × × × × × × (cid:88) -Feature im-portance-chi-squaretest[6] (cid:88) (cid:88) × (cid:88) × (cid:88) × (cid:88) (cid:88) × × × × [9] (cid:88) × × × × × × × (cid:88) × × (cid:88) × [4] (cid:88) × × (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) × × × × Fitness classi-fication accu-racy werhun et al.
Page 20 of 20 [11] × × × (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) × -Stability-G-Mean[8] (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) × × × × [18] (cid:88) × × × × × × × × × × (cid:88) × Convergencerate for Av-erage BestFitness[19] (cid:88) × × × × × × × × × × (cid:88) × [5] × (cid:88) (cid:88) × × × × × × × × × × [33] × × × (cid:88) (cid:88) (cid:88) × × (cid:88) (cid:88) (cid:88) × × Discard Ratio[7] (cid:88) × × (cid:88) (cid:88) (cid:88) (cid:88) × × × × × × [34] × × × (cid:88) × (cid:88) (cid:88) × (cid:88) (cid:88) (cid:88) × × [12] (cid:88) × × (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) × × G-Mean[35] × × × × × × × × × × × (cid:88) × [21] (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) × × × × [36] (cid:88) × × × × × × × × × × × × [17] × × × × × × × × × × × × × Dice Similar-ity[22] (cid:88) (cid:88) (cid:88) (cid:88) × × × (cid:88) (cid:88) × × × × [37] (cid:88) (cid:88) (cid:88) (cid:88) × × × (cid:88) (cid:88) × × × × [23] (cid:88) (cid:88) (cid:88) × × × × × × × × × × [38] (cid:88) (cid:88) (cid:88) × × × × × × × × × × [39] (cid:88) × × × × × × × × × × (cid:88) (cid:88)
Laplacian-score, Fisher-score[20] (cid:88) × × × × × × × × × × × × [10] × × × × × × × × × × × × × -AmbulanceUsage-PatientWaiting time[13] (cid:88) × × (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) × × (cid:88) (cid:88) -Sum of Er-rors-k-coefficient-Acceptancefeature coef-ficient[40] (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) × × [41] (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88)(cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88)