Machine Learning Against Cancer: Accurate Diagnosis of Cancer by Machine Learning Classification of the Whole Genome Sequencing Data
AAccurate Diagnosis of Cancer by Machine LearningClassification of the Whole Genome Sequencing Data
Arash Hooshmand ** KTH Royal Institute of Technology
Abstract
Supervised machine learning can precisely identify different cancer tumors at any stage by classifying cancerousand healthy samples based on their genomic profile. We have developed novel methods of MLAC (MachineLearning Against Cancer) achieving perfect results with perfect precision, sensitivity and specificity. We haveused the whole genome sequencing data acquired by next generation RNA sequencing techniques in The CancerGenome Atlas and Genotype-Tissue Expression projects for cancerous and healthy tissues respectively. Indeed,a creative way to work with data and general algorithms has resulted in perfect classification i.e. all precision,sensitivity and specificity are equal to 1 for most of different tumor types even with modest amount of data. Oursystem can be used in practice because once the classifier is trained, it can be used to classify any new sampleof new potential patients. One advantage of our work is that the aforementioned perfect precision and recallare obtained on samples of all stages including very early stages of cancer; therefore, it is a promising tool fordiagnosis of cancers in early stages. Another advantage of our novel model is that it works with normalized valuesof RNA sequencing data, hence people’s private sensitive medical data will remain hidden, protected and safe.This type of analysis will be widespread and economical in the future and people can even learn to receive theirRNA sequencing data and do their own preliminary cancer studies themselves which has the potential to help thehealthcare systems.
1. Introduction
Cancer is one of the most common risk factors that threatens people’s lives and is still a severeunsolved problem as one of the main leading causes of death [13]. Early diagnosis plays avital role in cancer treatment and survival. There are dozens of different types of cancersaffecting different organs of body because cancer can start at any part of the body. [14] Infact, cancer starts when cells in the body begin to grow out of control and is usually caused bygenetic mutations in different cells. Yet the main underlying reasons that cause these mutationsare unknown. In recent years, along with generation of big data by high throughput omicstechnologies, applications of ML (Machine Learning) in diagnosis, treatment and prognosis ofcancers are hot topics of research. Consequently, computers have turned out to be promisingtools and reliable assistants to contribute in new discoveries based on analysis of the big datagenerated by high throughput technologies. In this work, we have proposed a novel approachto use genetic transcriptomic data leading to great results with perfectly accurate distinguish1 a r X i v : . [ q - b i o . GN ] S e p etween WES (Whole Exome Sequencing) RNA profiles of 22 different main cancers from TCGA(The Cancer Genome Atlas), [15] and their corresponding healthy tissue samples from GTEx(Genotype-Tissue Expression) project [16] with samples from different numbers of people asillustrated in Table 1.Table 1 also contains reported numbers of estimated new cases and estimated deaths becauseof each cancer out of estimated 1806590 new cases of all cancers with estimated 606520 newdeaths in 2020 in the U.S. for instance.[1] Reusability and transfer learning are among the mainadvantageous of ML that means once a machine is well-trained and could distinguish cancerousfrom noncancerous tissues, it can be fed by new data of new samples acquired from new peopleand the new sample will be classified correctly with a high likelihood. Therefore, the utilizationof ML systems that can detect cancerous genome even at the earliest stages by NGS (NextGeneration Sequencing) technology is likely to be a killer application.TCGA abbr. Cancer (TCGA) Organ (Gtex)
2. Methods
ML and AI (Artificial Intelligence) is rapidly opening their positions in medical and pharmaceuticalsciences. Different models of ML have been tested successfully in recent years in many projectsas well as in this work and have returned decent results. Na¨ıve Bayes, Support Vector Machines,2ecision Trees, Random Forest, Logistic Regression and K Nearest Neighbors are examples ofgeneral supervised ML algorithms that have reportedly given great results in different projects indifferent fields of science and are analyzed in our project too. In addition to them, an unsupervisedML method i.e. K-Means is also tested. In this work we came up with a practical approach ofapplying ML for cancer diagnosis that is effective and robust in different ML algorithms we havetried. Since they are most well-known common ML algorithms, we will briefly introduce them inthe following paragraphs. On the other hand, the WES genetic information obtained by NGS areopenly available on TCGA, Gtex and other online public databases. However, we do not reviewthe technical details of RNA sequencing techniques because they are out of scope of the currentpaper. The focus of our work in this article is to receive the data from two aforementioned opendatabases, train the ML classifiers with them, and validate them. To do it, we have used thefollowing ML techniques:
Bayes’ theorem was proposed by the English Thomas Bayes in 1763 when he was trying to provethe existence of God by means of statistical inference. [7] Bayesian statistics are used in estimatesbased on anticipated subjective knowledge. Therefore, the implementations of this theorem adaptwith use and allow combining the fusion of data from two or more different sources and expressingthem in terms of likelihood. Naive Bayesian Classifier is an implementation of Bayes’ theorem,with some additional simplifying hypotheses, which allow applying an independence hypothesis,between the predictor variables, hence ”Naive” is added to the name of these implementationsbecause a naive Bayesian classifier assumes that the features of a class / object are not relatedto each other i.e. the presence of a particular feature is not related to the presence or absenceof another. In this way each feature independently contributes to the probability of a givenclass. In return, Bayes Classifiers can easily be trained, require little data to train, and canclassify big data quickly. Despite the fact that naive Bayes classifiers are amazingly simple, theyhave worked quite well in many real-world situations, including our cancerous/healthy tissueclassification. Our naive Bayes classifier requires a small amount of training data and is fast andaccurate as reflected in the Results section. On the flip side, although naive Bayes is knownas a decent classifier, it is known to be a bad estimator in a sense that one cannot rely on itsparameters for extraction of feature importance. [9]
More formally, as shown by equations 1-6, Bayesian classifiers are, indeed, probabilistic classifiersusing Bayes rule i.e. P ( A | B ) = P ( A ) P ( B | A ) /P ( B ) (1)For example, A can be the prior probability of cancer and B the posterior probability of cancer;given positive cancer test result is the product of the prior times the sensitivity i.e. the chanceof a positive result given cancer. Indeed, a naive Bayesian classifier accomplishes statisticalinference based on maximum likelihood estimation i.e. setting the parameters of the probabilitydistribution in a way that maximises the goodness of fit of a statistical model to the trainingdata via joint probability distributions of the training samples. In technical words, the likelihoodfunction describes a hyper surface whose peak, if it exists, is an arrangement of model parametersvalues and coefficients that maximize the probability of drawing the obtained sample. [8] In itsmore general form, according to Sci-kit Learn website documentation, Bayes’ theorem states the3ollowing relationship, given class variable y and dependent feature vector x through x n : P ( y | x , . . . , x n ) = P ( y ) P ( x , . . . , x n | y ) P ( x , . . . , x n ) (2)Using the naive conditional independence assumption that P ( x i | y, x , . . . , x i − , x i +1 , . . . , x n ) = P ( x i | y ) (3)for all i, this relationship is simplified to P ( y | x , . . . , x n ) = P ( y ) (cid:81) ni =1 P ( x i | y ) P ( x , . . . , x n ) (4)Since P ( x , x , ..., x n ) is constant given the input, we can use the following classification rule: P ( y | x , . . . , x n ) ∝ P ( y ) n (cid:89) i =1 P ( x i | y ) = > ˆ y = arg max y P ( y ) n (cid:89) i =1 P ( x i | y ) (5)We can use MAP (Maximum A Posteriori) estimation to estimate P ( y ) and P ( x i | y ) ; theformer is then the relative frequency of class in the training set. The different naive Bayesclassifiers differ mainly by the assumptions they make regarding the distribution of P ( x i | y ) .[9] GaussianNB implements the Gaussian Naive Bayes algorithm for classification. Thus, thelikelihood of the features is assumed to be Gaussian: P ( x i | y ) = 1 (cid:113) πσ y exp (cid:18) − ( x i − µ y ) σ y (cid:19) (6)where the parameters σ y and µ y are estimated using maximum likelihood. Support Vector Machines (SVMs) are a set of supervised learning algorithms developed byVladimir Vapnik and his team at AT&T Labs. [17, 18, 19] These methods are related to bothclassification and regression problems. In classification, given a set of sample training examples,we can label the classes and train an SVM to build a model that predicts the class of a newsample. Intuitively, an SVM is a model that represents the sample points in space, separatingthe classes into spaces as distant as possible using a separation hyperplane defined as the vectorbetween the two points, of the two classes, closest to the which is called the support vector.When the new samples are put before the model, depending on the spaces to which they belong,they can be classified into the right class.
More formally, an SVM builds a set of hyperplanes [20] in a very high (or even infinite) dimen-sional space that can be used in classification or regression problems. A good separation betweenthe classes will allow a correct classification. In this concept of optimal separation is where thefundamental characteristic of SVM resides: this type of algorithms search for the hyperplanethat has the maximum distance (margin) with the points that are the closest to it. This is whySVMs are also sometimes referred to as maximum margin classifiers. In this way, the points of4igure 1: Svm separating hyperplanes, from article ”Support vector machine”. In Wikipedia(2012). Accessed June 27, 2020.the vector that are labeled with one category will be on one side of the hyperplane and the casesthat are in the other category will be on the other side. SVM algorithms intrinsically belongto the family of linear classifiers. The vector formed by the points closest to the hyperplane iscalled the support vector. Using tricks such as kernel functions, SVMs can also be an alternativetraining method for polynomial classifiers, radial base functions, and multilayer perceptron neuralnetworks. Figure 1 illustrates SVM mechanism.
A decision tree is a tree-like map of the possible outcomes of a series of decisions and onlycontains conditional control statements comparing possible actions with each other according totheir costs, probabilities and utilities. The goal of a decision tree is to break down all the availablevisit data that a system can learn from and group it so that each group’s visits are as similar5o each other as possible with respect to the goal metric. Between groups, however, visits areas different as possible relative to the goal metric (for example, conversion rate). The decisiontree takes into account the different variables existing in the training set to determine how todivide the data MECE (mutually exclusive, collectively exhaustive) into these groups or leaves tomaximize the goal. A decision tree typically starts with a single node and then branches out intopossible outcomes. Each of the outcome nodes creates additional nodes, which branch into otherdifferent possibilities. This creates a structure similar to that of a tree. There are three differenttypes of nodes: probability nodes, decision nodes, and terminal nodes. A chance node,typicallyrepresented by a circle, shows the probabilities of certain outcomes. A decision node,typicallyrepresented by a square, shows a decision to be made, and a terminal node,typically representedby a triangle, shows the final result of a decision route. Decision trees are still popular foradvantages such as requiring minimal data processing and being easily understood, updated,(new options can be added to existing trees), and integrated with other decision-making tools.However, decision trees can become very complex. In those cases, a more compact influencediagram can be a good alternative focusing on fundamental goals, inputs, and decisions. [21]
By iteratively applying the algorithm that creates decision trees with different parameters on thesame data, we get what is called a random forest. This algorithm is one of the most efficientprediction methods for big data, since it averages the performance of many different models withnoise and impartially reduces the final variability of the set. In reality, what is done is to builddifferent training and test sets on the same data, which generates different decision trees on thesame data. The union of these trees of different complexities and with data of different origin,although from the same set, results in a fairly stable random forest whose main characteristicis that it creates more robust models than what could be obtained by creating a single decisiontree on the same data. In classification, the class that is the mode of classes will be output.[22, 23]
Logistic regression is a group of statistical techniques that aim to test hypotheses or causalrelationships when the dependent variable is nominal. Despite its name, it is not an algorithmapplied in regression problems, in which continuous values are dealt with, but it is a method forclassification problems, in which a binary value i.e. either 0 or 1 is obtained. For example, aclassification problem is to identify if a given tumor is malignant or benign. With the logisticregression, the relationship between the dependent variable i.e. the statement to be predicted,with one or more independent variables i.e. the set of features available for the model is deter-mined. To do this, it uses a logistic function that determines the probability of the dependentvariable. As previously mentioned, what is sought in these problems is a classification, so theprobability must be translated into binary values for which a threshold value is used. If the prob-ability values were above the threshold value, the statement is true and vice versa. Generally thisvalue is 0.5, although it can be increased or decreased to manage the number of false positivesor false negatives. [24]The function that relates the dependent variable to the independent ones is also usually eitherthe sigmoid function or a function similar to it such as tanh and softmax. The sigmoid functionis an S-shaped curve that can take any value between 0 and 1, but never values outside these6igure 2: Sigmoid function as the logistic curvelimits. The equation that defines the sigmoid function is f ( x ) = 1 / (1 + e − x ) where x is a realnumber. In the equation you can see that when x tends to minus infinity the function tendsto zero. On the other hand, when x tends to infinity the function tends to unity. Figure 2shows a graphical representation of the logistic function (sigmoid function). Logistic regressionis a technique widely used because of its effectiveness and simplicity. As one of its advantages,it is not necessary to have large computational resources, neither in training nor in execution.Furthermore, the results are highly interpretable that is one of its main advantages. The weightof each of the features determines the importance it has in the final decision. Therefore, itcan be affirmed that the model has made one decision or another based on the existence ofone or another certain feature. What in many applications is highly desired in addition tothe model itself. Regarding its disadvantages is the impossibility of directly solving non-linearproblems because the expression that makes the decision is linear. For example, in the eventthat the probability of a class is initially reduced with a feature and subsequently increased, itcannot be registered with a logistic model directly. If necessary, this feature should previously betransformed so that the model can record this non-linear behavior. In these cases, it is better touse other models such as decision trees. Indeed, the important point is that the target variablemust be linearly separable. Otherwise, the logistic regression model will not classify correctly. Inother words, there must be two ”regions” with a linear border in the data. Another drawbackis the dependency it shows on the features. Logistic regression is not one of the most powerfulalgorithms that exist. It would easily be surpassed by other more complex classifiers. Finally,in machine learning, there are classifiers that can work with multiple classes, such as Decision7rees or Random Forest. On the other hand, there are others that do not, such as LogisticRegression. However, it is always possible to use tricks to use logistic regression in classificationproblems with multiple classes such as: • OvA (One versus all): In this strategy, you have to train as many binary classifiers as possiblewith respect to the classes that are there in the data set. Each of the models predicts theprobability that the record belongs to a class. When making a prediction, all classifiers are runand the one with the highest probability is selected. • OvO (One versus one): In this strategy, as many models are created as there are pairs ofpossible outcomes, that is, they have to be trained ( N − N ) / models, where N is the numberof possible classes. It means that a classifier will decide only between two possible outcomes.As in the previous case, when making a prediction, all classifiers are run and the one with thehighest probability is selected. K-Nearest-Neighbor is a simple nonparametric instance-based algorithm of supervised ML. Itcan be used to classify new samples (discrete values) or to predict (regression, continuousvalues). It is essentially used to classify values by searching for the most similar (by proximity)data points learned in the training stage and making guesses of new points based on the priorclassification. [25, 26] Unlike K-means, which is an unsupervised algorithm where the ”K”means the number of groups that we want to classify, in K-Nearest Neighbor the ”K” means thenumber of ”neighboring points” that we consider in the vicinity to classify the ”n” groups whichare already known in advance. It is a method that simply searches the closest observations tothe the point of interest that is to be classified and classifies it based on most of the data thatsurrounds it. As we said before, K nearest neighbor algorithm is: • Supervised: that means that we have tagged our training data set, with the class or expectedresults. • Instance-based: that means that our algorithm does not explicitly learn a model (such as inLogistic Regression or Decision Trees). Instead, it memorizes the training instances that are usedas the knowledge-base for the prediction phase.KNN is easy to learn and implement. However, it uses the entire data set to train each pointand therefore requires a lot of memory and processing resources. For these reasons KNN tendsto work best on small data sets and without a huge number of features. To classify the inputsby means of KNN one should:1. Calculate the distance between the item to classify and the other items in the training dataset. 2. Select the closest ”K” elements (with less distance, depending on the function used).3. Carry out a majority vote between K points: those of a class / label that will be determinantin making the final decision.Taking point 3 into account, we will see that in order to decide the class of a point, the value ofK is very important because it defines which are the points their majority will define the groupeach new point belongs to, and it is especially critical when the new points fall in the bordersbetween groups.
K-Means is an unsupervised ML algorithm for clustering. It is used when we have a lot of un-tagged data. The objective of this algorithm is to find K groups (clusters) among the raw data.The algorithm works iteratively to assign each input such as genome sample to one of the K8roups based on its features, in here genes. It means that the inputs are grouped based on thesimilarity of their features. [27] As a result of executing the algorithm: • The centroids i.e. geometric centers of groups will be coordinates of the corresponding Kclusters and will be used to label new samples. • Labels for the training data set: each tag belonging to one of the K defined groups.The groups are defined dynamically i.e. their position is adjusted in each iteration of the processuntil the algorithm would converge. Once the centroids are found, they are analyzed to seewhat their unique features are, compared to those of the other groups. These groups are thelabels that the algorithm generates. The Clustering K-means algorithm is one of the most usedmethods to find hidden groups or theoretically suspected groups on an unlabeled data set. Thiscan serve to confirm or reject some hypotheses that we would have assumed about our data,and it can also help to discover hidden relationships between data sets. Once the algorithmhas executed and obtained the labels, it will be easy to classify new values or samples amongthe obtained groups. This algorithm works by pre-selecting a value of K. To find the numberof clusters in the data, we must run the algorithm for a range of K values, see the results andcompare characteristics of the groups obtained. In general, currently there is no exact way todetermine the K value, but it can be estimated with acceptable precision using the followingtechnique: One of the metrics used to compare results is the average distance between the datapoints and their centroid. As long as the value of the mean will decrease as we increase thevalue of K, we will continue increasing it. The mean distance to the centroid is considered as afunction of K and the goal is to find the elbow point where the rate of descent sharpens.In ML supervised classification methods as well as in K-Means unsupervised clustering algorithm,the input data (the samples) are viewed as a p-dimensional vector (an array or ordered list of pnumbers where p in this project is 19627). Then the classifiers based on their criteria distinguishamong different groups formed by close/similar samples; e.g. in the Bayesian classifiers, theclassifier looks for a hypersurface that maximizes the likelihood of drawing the sample, or inSVMs, it looks for a hyperplane that optimally separates the points of one class from the other,which eventually could have been previously projected to a higher dimensional space. There hasbeen wrong perceptions in the ML community preventing potential achievements; for instance,people try to decrease the number of features to avoid ”the curse of dimensionality”. Whilethe curse of dimensionality may truly happen in some problems, it may not be an issue in otherproblems such as ours. Deleting features blindly for fear of dimensionality may only result in losinguseful information without need. Researchers usually try to reduce by themselves the assumedlearning pressure on the machines brought about by highly redundant dimensions and select asubset of features i.e. genes to reduce the number of features and dimensions. [10, 11, 12] Itmay have hurt their results. A strength point of our work is that we consider ML as powerfuladvanced statistics tool doing heavy statistical analyses, that people themselves cannot do. Asa result, we gave all the data corresponding to the WES as feature inputs to the ML at onceand it returned almost perfect results quickly and precisely. We thought of 19627 different genesnot as too many features but as different pixels of a less than 141*141-pixel photo and it was avery light task for the machine to analyze such a low resolution image and it took only secondsto classify the cancerous and noncancerous cells 100% precisely.9 .8 Model optimization and settings
We have employed all the classifiers from Scikit-Learn 0.23.1 with their default settings unlessmentioned otherwise. For example, Scikit-Learn’s Gaussian Naive Bayes classifier, that is asimple classifier, has only two parameters i.e. priors equal to None and var smoothing equal to1e-9 where var smoothing is the portion of the largest variance of all features that is added tovariances for calculation stability. We did not touch the defaults, but there were exceptions suchas SVM in which we changed two default settings: we decided to use ”linear” kernel instead of”rbf” that was the default kernel and also we set ”PROBABILITY = TRUE” in order to obtain”predict proba” that is a useful attribute to calculate and plot the ROC curve but is ignored indefault setting when ”PROBABILITY = FALSE”. Therefore, except these two minor changesat SVM default settings, all models were run with default settings of Scikit-Learn version o.23.As other settings, for most of the cancers, 90% specific cancer samples were used as the trainingdataset and the remaining 10% used as testing dataset chosen by random using train test splitfunction of Sci-Kit Learn model selection modul with Random Seed equal to zero. The onlyexceptions were for bladder and cervix for which the number of healthy samples was too low.Therefore, we used 40% for training and 60% for testing in bladder and 70% for training and30% for testing in cervix cancer. However, we analyzed the effect of different data allocationplans from 10% to 90% for test/validation set and also tried other random seeds and in particularfor K Nearest Neighbor algorithm, we also tried it with different K values. The results were notsignificantly different and discussed more in the following at the Discussion part of this article.We also decided to mostly publish the results achieved by those classifiers that can do theclassification perfectly; however, all six classifiers work well and the imperfect ones also returnresults close to perfect. The models take 19627-genes WES data as input and after a quickand easy model training with no need to data modification, acceptable classification results areobtained and there are at least two classifiers per cancer that could distinguish both cancerousand healthy tissues perfectly with no error.
Model evaluation produces measures to approximate a classifier’s reliability. To distinguish be-tween cancerous and noncancerous cells, since it is a binary classification, we use accuracy,precision, specificity, sensitivity, f1 score, several averaging techniques and ROC curve to evalu-ate the model. We, indeed, use Sci-kit Learn Metrics Classification Report that returns precision,recall and f1 score for each of two classes. In binary classification, recall of the positive class iscalled “sensitivity”; and recall of the negative class is “specificity”. In what follows, the principalterms and then equations7-22 derivations based on confusion matrix such as accuracy, specificity,sensitivity, f1 score are given to review and compare: • Condition positive (P): the number of real positive cases in the data • Condition negative (N): the number of real negative cases in the data • True positive (TP) or hit • True negative (TN) or correct rejection • False positive (FP), false alarm or type I error • False negative (FN), miss or type II errorSensitivity, recall, hit rate, or true positive rate (TPR):
T P R = T P/P = T P/ ( T P + F N ) = 1 − F N R (7)10pecificity, selectivity or true negative rate (TNR):
T N R = T N/N = T N/ ( T N + F P ) = 1 − F P R (8)Precision or positive predictive value (PPV) is the ratio of the correctly labeled samples by ourprogram to all labeled ones in reality.
P P V = T P/ ( T P + F P ) = 1 − F DR (9)Precision can be calculated only for the positive class i.e. class 1 that shows cancer or can beevaluated for each one of the two classes independently treating each class as it is the positiveclass at time, and the latter is done in Sci-kit Learn Metrics Classification Report as shown intable 1.Negative predictive value (NPV):
N P V = T N/ ( T N + F N ) = 1 − F OR (10)Miss rate or false negative rate (FNR):
F N R = F N/P = F N/ ( F N + T P ) = 1 − T P R (11)Fall-out or false positive rate (FPR):
F P R = F P/N = F P/ ( F P + T N ) = 1 − T N R (12)False discovery rate (FDR):
F DR = F P/ ( F P + T P ) = 1 − P P V (13)False omission rate (FOR):
F OR = F N/ ( F N + T N ) = 1 − N P V (14)Accuracy (ACC):
ACC = (
T P + T N ) / ( T + N ) = ( T P + T N ) / ( T P + T N + F P + F N ) (15)The harmonic mean of precision and sensitivity or f1-score (F1): F .P P V.T P R/ ( P P V + T P R ) = 2 .T P/ (2 .T P + F P + F N ) (16)Since we are using Sci-kit Learn Metrics Classification Report to show the results as shown in ta-ble 1, we also describe the meaning of micro avg, macro avg and weighted avg. used in the report:Micro-average of precision (MIAP): M IAP = (
T P
T P / ( T P
T P
F P
F P (17)Micro-average of recall (MIAR): M IAR = (
T P
T P / ( T P
T P
F N
F N (18)11icro-average of f-Score (MIAF) would be the harmonic mean of the two numbers above. M IAF = 2 .M IAP.M IAR/ ( M IAP + M IAR ) (19)Macro-average of precision (MAAP): M AAP = (
P recision
P recision / (20)Macro-average of recall (MAAR): M AAR = (
Recall
Recall / (21)Macro-average of f-Score (MAAF) would be the harmonic mean of the two numbers above. M AAF = 2 .M AAP.M AAR/ ( M AAP + M AAR ) (22)Macro-average is suitable to know how the system performs overall across different sets of databut should not be considered in any specific decision-making because it calculates metrics foreach label and finds their unweighted mean i.e. it does not take label imbalance into account,while in our case, the labels are highly imbalanced in many sets e.g. 1091 vs. 179. On the otherhand, micro-average is a useful tool and returns measures for our decision-makings especiallywhen coupled healthy-cancerous datasets vary in size because it calculates metrics globally bycounts the total true positives, false negatives and false positives. Finally, Weighted-average,according to Sci-kit Learn documentation on f1-score metrics, calculates metrics for each label,and finds their average weighted by support (the number of true instances for each label). Thisalters ”macro” to account for label imbalance; consequently, it can result in an F-score that isnot between precision and recall.The ROC (Receiver Operating Characteristic) curve is created by plotting the true positiverate (TPR) or sensitivity against the false positive rate (FPR) i.e. (1-specificity) at differentthreshold settings. Varying the decision threshold from its maximal to its minimal value resultsin a piecewise linear curve from (0,0) to (1,1), such that each segment has a non-negative slope(Figure 3). This ROC curve is the main tool used in ROC analysis and in general, can be used toaddress a range of problems; however, in our illustrated case where the performance is perfect,it is just a visual endorsement for the perfect classification and the corresponding AUC (AreaUnder the ROC Curve) is its maximum i.e. 1.Confusion matrix Predicted 0 Predicted 1Class 0 TN = 1 FP = 0Class 1 FN = 0 TP = 1Table 2: Typical confusion matrix with no confusion for perfect identification of canceroustumors by either of GNB, SVM, DCT, RFC, LGR and KNN(K=3) where sensitivity, specificityand precision are all 100%. 12igure 3: Typical ROC curve of perfecrt classification done by classifiers on most of tumor types.
3. Results
The classification performance on a total of 7971 cancerous WES samples of 22 specific cancersfrom TCGA open public database and their corresponding healthy tissues with 4798 WES samplesfrom the GTEx project were studied in 150+ ML models. Each sample included the normalizedvolume of 19627 genes as ppm and the data corresponding to both cancerous and healthysamples of each organ are separately and directly fed to the machine to do heavy statisticalcalculations on the high dimensional data. As result, cancerous samples of all 22 types oftumors at any stages were correctly identified and separated from their corresponding healthysamples. The task was accomplished and compared via six supervised ML classifiers i.e. GNB,SVM, DCT, RFC, LGR and KNN which all showed perfect and in some cases near-perfectperformance as shown in Table 3. In addition, Table 3 contains the results for K-Means as anunsupervised clustering technique that was applied to evaluate the ability of the algorithm todistinguish between cancerous and noncancerous cells of different organs as its two main clusters.Clustering, in general, is different from classification and its algorithms such as K-Means havetheir own evaluation techniques. However, we have applied a trick by using classification accuracyto see how much K-Means clustering to two clusters matches the difference of our interest i.e.two classes of healthy and cancerous cells of an organ or two classes of two different cancers.The results were impressive in here too. In some cases, clustering matched 100% with our13lass labels as healthy and cancerous. One tricky point in interpretation of results is that unlikeclassification in which the higher the accuracy, the better; in clustering 50% accuracy is theworst while very high and very low measurements of accuracy are equivalently great because, forexample, an accuracy equal to zero in clustering means that the clustering algorithm, in here K-Means, has labelled all our class 0 i.e. healthy samples as cluster 1, and all our class 1 labels i.e.cancers as cluster 0. Therefore, any percentage of accuracy and its counterpart i.e. 100 minusthat percentage are equally good while 50% shows maximum entropy and least match with ourclasses and labels. The clustering performance was impressive in several cases, especially thatcancerous and noncancerous cells of Pancreas and Testis were 100% accurately separated intotwo distinguished clusters as shown in Table 3. There were also models successfully employedto further distinguish between different types of an organ’s cancer and two types of cancerwere also separated perfectly. The performance of supervised and unsupervised methods withdifferent parameters were also studied. For most of changes such as different random seeds,allocating different volumes of data for training and testing stages and different values for Kin KNN, the consequent changes were negligible as shown in Table 42, Table 43 and Table 44respectively. The exception was GNB classification performance on LIHC that improved to atleasr 98% in other settings. However, there are parameters that can change and deterioratethe results. For example, SVMs with linear and rbf kernels are significantly different: whilelinear kernel accomplish the classification perfectly in 19 out of 22 cancer types with threeexceptions that are LGG, COAD and GBM with 99% of accuracy and also f1-score of 0.99, theresults with rbf kernel were not acceptable. An example of LUSC is presented in Table 41 to becompared with Table 42 that includes the results for SVM with linear kernel. The most commontypical ROC curve and Confusion Matrix for most of our cancer classifications are illustrated inFigure 3 and Table 2. Therefore, AUC is naturally 1 for most of classifiers in classification ofmost of cancerous-healthy samples of specific organs. We also analyzed these ML classifiers’capability to distinguish between two types of cancers. Whine no longer as 100% perfect resultsas before when they classified between healthy and cancerous tissue, yet the results were greatwith accuracy and f1-scores above 90% that is better than the previous works we knew about.The classifiers not only separated samples belonging to two different cancers of brain as well aslung, but also separated classified well tumors of lung’s LUSC and Bladder’s BLCA which werepreviously reported to be very similar and confusing for deep neural network classifiers. [11] Formore details refer to Discussions part of this paper below.14ccuracy GNB SVM DCT RFC LGR KNN KMeans unsupervised clusteringACC 1.0 1.0 0.95 1.0 1.0 1.0 0.14=0.86BLCA 1.0 1.0 0.98 1.0 1.0 1.0 0.64LGG 0.98 0,99 1.0 0.99 0.99 1.0 0.77BRCA 1.0 1.0 0.98 1.0 1.0 0.98 0.91CECS 1.0 1.0 1.0 1.0 1.0 1.0 0.69LAML 1.0 1.0 1.0 1.0 1.0 1.0 0.44=0.56COAD 1.0 0.99 1.0 0.99 0.99 0.97 0.18=0.92ESCA 0.99 1.0 1.0 0.99 1.0 0.98 0.59GBM 0.99 0.99 0.99 1.0 1.0 1.0 0.72KIRC 0.98 1.0 0.98 0.98 1.0 0.98 0.77LIHC 0.92 1.0 0.98 1.0 1.0 0.96 0.08=0.92LUAD 1.0 1.0 0.99 1.0 1.0 0.94 0.25=0.75LUSC 1.0 1.0 0.99 1.0 1.0 0.99 0.13=0.87OV 1.0 1.0 1.0 1.0 1.0 1.0 0.69PAAD 1.0 1.0 1.0 1.0 1.0 1.0 1.0PRAD 0.98 1.0 1.0 1.0 1.0 0.98 0.97READ 1.0 1.0 1.0 1.0 1.0 0.95 0.58SKCM 1.0 1.0 1.0 1.0 1.0 0.99 0.65STAD 0.98 1.0 0.97 1.0 1.0 0.97 0.17=0.83TGCT 1.0 1.0 1.0 1.0 1.0 1.0 0.0=1.0THCA 1.0 1.0 1.0 1.0 1.0 0.95 0.30=0.70UCEC 1.0 1.0 1.0 1.0 1.0 1.0 0.69Table 3: ML Classifiers Accuracy in different tumor type classification.
4. Discussion
Our work facilitates effective applications of ML in medical sciences and resulted in excellentclassification between cancerous and noncancerous cells of 22 most common cancers. In thiswork, we did not reduce the dimension of input data and left all the statistical analysis to the MLsystem and it could do its job very well and perfectly distinguished the cancerous tumors fromhealthy cells in most of the cancer-classifier combinations, and in the remaining pairs almostperfectly, as shown in Table 3. More details on information summarized in Table 3 can be foundin tables 4-45. We learn from this experiment that we are allowed to reduce the dimension onlyafter that the famous problem known as ”the curse of dimensionality” has occurred and thesystem cannot solve it; otherwise, it is better to leave the calculations to the machines and donot blindly reduce the dimensions when there is no issue for machine to deal with all availablefeatures and dimensions. For example, in this study, the number of tumor samples rangesfrom 77 (ACC) to 1,091 (BRCA) and the number of healthy tissue samples ranges from only 9(Bladder) and 10 (Cervix) to 1152 (Brain) yet our classifiers can learn from the large distancebetween healthy and cancerous bladder tissues in the 19627 dimensional space and correctlyclassify all cancerous and healthy tissues very well. For example, after being trained by 6 healthysamples and 287 Cervical squamous cell carcinoma and endocervical adenocarcinoma samples,all classifiers i.e. GNB, SVM, DCT, RFC, LGR and KNN (K=5) could classify the remaining4 healthy samples and 17 cancerous samples 100% precisely and only KNN with K=3 failed insome cases with overall F-score of 0.99 instead of 1 because of failure of perfect precision in15ne class and recall in the other. Meanwhile, the differences among KNN with different valuesfor K were negligible. The important point in here is that, indeed, despite having little data totrain, ML classifiers could learn how to classify two groups perfectly thanks to the large numberof dimensions and the differences provided with them. If we decrease the number of features/dimensions while we also lack huge amount of data i.e. a large number of samples, then wemay have thrown away useful data and cannot train the system well and it cannot classify thatgood. It is done by most of researchers before and as a result they get less excellent results.Our hypothesis is that a large number of features and dimensions can compensate lack of largenumber of samples. Imagine, for example, points in 3-dimension space shown as a Cartesiansystem. If we have two classes of points that all coordinates of one class are positive numbersand all coordinates of points in the other class are negative numbers, it will be natural thatany intelligent system including AI systems can quickly learn to separate the two groups evenby a small number of samples. We think that our natural 19K+-dimension space is perfect todistinguish between healthy samples and impaired cancerous samples after watching dozens ofsamples. Even the results after analyzing a few samples of different classes are amazing.Another fact to notice is that the samples from TCGA and Gtex are diverse representing differentages, sexes, races and different stages of cancer their statistical details are available on TCGAand Gtex websites. However, TCGA has not provided information on the stage of GBM, LGG,OV, PRAD and UCEC tumors but all other tumors are categorized into four different stagesand the classifiers work well on all the data that includes their early satge samples too. Sinceour classifiers return perfect results on all of them, therefore, the demographic information donot have any serious effect on the performance and we do not need to deal with their statisticaldetails and factors one by one. Another interesting fact is that the problem of separation betweencancerous and noncancerous cells seems to be a linear classification problem because SVM withlinear kernel can classify almost all cancer types from their corresponding healthy tissue cellsperfectly but when SVM was tried with its default kernel i.e. rbf the results were not good. Thedifference in results of linear and nonlinear separators imply that the samples of two groups arenaturally separated linearly.In comparison, as seen in the result tables in appendix/ supplementary materials summarized inTable 3, SVM with linear kernel, LGR and RFC were almost perfect and slightly superior to otherclassifiers and return perfect results for almost all tumor types with a few percentage mistakefor a few remaining cancers, while KNN and DCT were usually inferior; despite the fact thatthere were cases that DCT or KNN returned perfectly precise results when others fail to achieve100%; or KNN as the worst classifier also has always classified tumors and healthy tissues withno worse than 94% of accuracy. GNB with our most common settings registered record lowaccuracy of 92% only on liver cancer (LIHC) which seems to be its outlier result because withany change, even using less training data i.e. 70% instead of 90% its accuracy was always 98%or more. The simulations are illustrated in the tables below. GNB particularly acts well whenthere are little data available; therefore, should not be ignored when the resources such as dataor computational power are limited.Finally, we also examined the performance of ML classifiers on several pairs of cancers as shownin the last table i.e. Table 45. First we tried them to classify between two types of tumors ofBrain cancer i.e. LGG and GBM, and between two types of Lung cancer i.e. LUAD and LUSC,followed by 2 cancers of Kidney. All of them were separated with accuracy more than 90% whichis amazing because tumors of the same organ are expected to be similar to each other. Thelowest rate of correct classification is for distinguish between Colon Adenocarcinoma (COAD)16nd Rectum Adenocarcinoma (READ) which could be done at the best with about 75% accuracyby RFC, DFC and KNN which is great because READ and COAD are almost the same thingsto that extent that Siegel et. al. [1] have reported the estimated death because of these twocancers together because in practice, many hospitals are confused by them and count READ asCOAD as one class of cancer that are the same. Therefore, being able to distinguish betweenthem with more than 70% accuracy even by K-Means unsupervised clustering is hope-giving.The successful separations between two similar cancers of one organ such as LUAD and LUSCshown in Table 45 may open a new approach in cancer diagnoses i.e. it may be better to firstdetect if the biopsy sample of an organ such as brain or lung is healthy or malignant tumor,then if it were cancerous, the data can be fed again to the ML classifiers to detect which typeof tumor it is. The benefit of this approach is that in binary classifications between healthy andcancerous sample of an organ, there is only one option for cancerous cells. If, for example, thepatient suffers from LUSC but the sample is given to a classifier to decide between healthy orLUAD, it will likely be classified as LUAD. Therefore, it has better to either from the beginningclassify it as healthy or cancerous including all relevant cancers and then distinguish amongcancers or consider the fact and if the sample is recognized as one specific cancer of an organ,then test it again carefully to classify it correctly among potential cancers of the organ or fromthe beginning use a multi-classifier instead of binary classifiers. For detailed info refer to Tables20, 21, 22 and 23 and the last row of Table 45.
5. Conclusion
These ML systems are trained now and are ready to receive any potential patient’s data torecognize if the sampled organ is cancerous or not. It can detect the problem in different stagesof cancer accurately; therefore, can be helpful in early diagnosis of cancer. The limitation ofour model is that it needs data of samples taken from organs. Yet most of people do not haveeasy access to this level of their own personal genetic data. Thus the next work can be findingsuitable biomarkers in the blood that can detect healthy people and patients only by their bloodsamples. Furthermore, the world is realizing the importance of creating databases of single cellWESs which will result in more accurate cancer studies and the corresponding big data canalso improve ML systems to work perfectly on all cancers hitting each organ of each patientin personalized medicine which can employ most effective treatments for each person based onspecific differences. Yet our work is one step forward because the best previous work that, to thebest of our knowledge, is done by Sun et. al. [11] implements complicated deep neural networksthat not only is more exhaustive computationally, but also lack clarity as a common featureof deep neural networks while our simple algorithms run quicker, hit their results with betterprecision and recall, works especially better with little data, and the mechanism of action, henceis not like a black box because the logic of performance is understandable. As an evidence toour claim, we especially tried the classification of a problematic pair of two different tumor typeswhere their deep neural network was confused between them i.e. lung’s LUSC and bladder’sBLCA. Our classifiers again classified these two tumors with more than 90% accuracy wheretheir deep neural network classifier was confused between them. [11] Most notably, using ourclassifiers is more suitable to obtain important features that play the maximum role in theclassification and are the most upregulated or most downregulated genes between cancerous andnoncancerous groups. The latter gives invaluable information both directly as well as indirectlyto be used along with supportive knowledge of pathways which cause cancers.17 unding Information
Thanks to KTH Royal Institute of Technolgy and its Library for their Open Access Publica-tion Grant, as well as Houshmand family and their companies especially GholamAbbas, Atash,Shahab, Shahin and Shadab for their financial support to my studies and projects.
Research Resources
The cancerous samples data are from The Cancer Genome Atlas (TCGA) and their correspondinghealthy tissue samples are from the Genotype-Tissue Expression (GTEx). All machine learningalgorithms are taken from Scikit-learn.
Acknowledgments
I’d like to acknowledge all those who have had any contribution to who I am and where I am.Starting from my parents and grand parents, siblings, all my great teachers and everybody whohas taught me a single word of science especially those whose knowledge in life sciences orin computer sciences and machine learning has had a direct impact on this research. I alsoparticularly thank Mr. Eng. GholamAbbas Houshmand for his great contribution in financingmy studies and projects as well as KTH library for supporting me to publish my paper in openaccess form to be accessible for all those who are interested in.
Further Reading
For readers who may want more information on concepts in your article, please look for theupcoming articles I am going to write as I am developing and testing new ideas and will reportthem appropriately to the scientific community.
Note About References
References are automatically generated by Authorea. Select cite to find and cite bibliographicresources. The bibliography will automatically be generated for you in APA format, the style usedby most WIREs titles. If you are writing for
WIREs Computational Molecular Science (WCMS),you will need to use the Vancouver reference style, so before exporting click Export- > Optionsand select a Vancouver export style.
References [1] Siegel, Rebecca L., Kimberly D. Miller, and Ahmedin Jemal, Cancer Statistics, 2020, CA: ACancer Journal for Clinicians 70, no. 1: 7-30, (2020).[2] Momenimovahed, Zohre, and Hamid Salehiniya, Epidemiological characteristics of and riskfactors for breast cancer in the world, Breast Cancer: Targets and Therapy 11: 151, (2019).183] Tsuji, S., and H. Aburatani., Machine Learning Applications in Cancer Genome Medicine,Gan to kagaku ryoho. Cancer & chemotherapy 46.3:: 423-426, (2019).[4] Nik-Zainal Abidin, S., Memari, Y., & Davies, H., Holistic cancer genome profiling for everypatient, Swiss medical weekly, 150 w20158, 20158 https://doi.org/10.4414/smw.(2020).[5] Asri, Hiba, et al., Using machine learning algorithms for breast cancer risk prediction anddiagnosis, Procedia Computer Science 83: 1064-1069, (2016).[6] Vamathevan, Jessica, et al. Applications of machine learning in drug discov-ery and development, Nature Reviews Drug Discovery, 18.6, 463-477, (2019).https://doi.org/10.1038/s41573-019-0024-5[7] Streiner, David L., Clinical medicine and the legacy of the reverend Bayes, Internationaljournal of clinical practice 73.4:e13323, (2019).[8] Myung, Jae., Tutorial on maximum likelihood estimation, journal of mathematical psychol-ogy, 47 (2003), 90100, (2002).[9] Zhang, Harry., The optimality of naive Bayes, AA 1.2, 3 (2004).[10] Pes, Barbara., Ensemble feature selection for high-dimensional data: a stability analysisacross multiple domains, Neural Computing and Applications: 1-23, (2019).[11] Sun, Yingshuai, et al., Identification of 12 cancer types through genome deep learning,Nature Scientific reports 9.1: 1-9 (2019).[12] Abeel, Thomas, et al., Robust biomarker identification for cancer diagnosis with ensemblefeature selection methods, Bioinformatics 26.3: 392-398, (2010).[13] Sidney, S., Go, A. S., & Rana, J. S., Transition From Heart Disease to Cancer as the LeadingCause of Death in the United States, Annals of internal medicine, 171(3): 225, (2019).[14] Alabsi, A. M., Ali, R., Ali, A. M., Al-Dubai, S. A. R., Harun, H., Abu Kasim, N. H., &Alsalahi, A., Apoptosis induction, cell cycle arrest and in vitro anticancer activity of gonotha-lamin in a cancer cell lines, Asian Pacific Journal of Cancer Prevention, 13(10), 5131-5136,(2012).[15] Tomczak, K., Czerwinska, P., and Wiznerowicz, M., The Cancer Genome Atlas (TCGA):an immeasurable source of knowledge, Contemporary oncology, 19(1A), A68, (2015).[16] Lonsdale, J., Thomas, J., Salvatore, M., Phillips, R., Lo, E., Shad, S., ... and Foster, B.,The genotype-tissue expression (GTEx) project, Nature genetics, 45(6), 580-585, (2013).[17] Cortes, C., and Vapnik, V., Support-vector networks, Machine learning, 20(3), 273-297,(1995).[18] Boser, B. E., Guyon, I. M., and Vapnik, V. N., A training algorithm for optimal marginclassifiers, In Proceedings of the fifth annual workshop on Computational learning theory (pp.144-152), (1992).[19] Ben-Hur, A., Horn, D., Siegelmann, H. T., and Vapnik, V., Support vector clustering,Journal of machine learning research, 2(Dec), 125-137, (2002).[20] Noble, W. S., What is a support vector machine?, Nature biotechnology, 24(12), 1565-1567,(2006). 1921] Breiman, L., Friedman, J. H., Olshen, R. A., and Stone, C. J., Classification and regressiontrees, Statistics/probability series, (1984).[22] Ho, T. K., Random decision forests, In Proceedings of 3rd international conference ondocument analysis and recognition (Vol. 1, pp. 278-282). IEEE, (1995).[23] Ho, T. K., The random subspace method for constructing decision forests, IEEE transactionson pattern analysis and machine intelligence, 20(8), 832-844, (1998).[24] Cox, D. R., The regression analysis of binary sequences. Journal of the Royal StatisticalSociety: Series B (Methodological), 20(2), 215-232, (1958).[25] Fix, E., and Hodges, J. L., Discriminatory analysis, nonparametric discrimination, (1951).[26] Peterson, L. E., K-nearest neighbor. Scholarpedia, 4(2), 1883, (2009).[27] MacQueen, J., Some methods for classification and analysis of multivariate observations.In Proceedings of the fifth Berkeley symposium on mathematical statistics and probability,Vol. 1, No. 14, pp. 281-297, (1967).
Author contributions statement
A.H. is the sole author of this paper and has done all the simulations himself.
Additional information
Arash Hooshmand (with ORCiD 0000-0002-9263-0282) is the sole author and hence the corre-sponding author of this paper and declares that there is no conflict of interests regarding thepublication of this paper. Please do not hesitate to contact me via [email protected] if youhave any questions.