[PDF] Machine Learning Against Cancer: Accurate Diagnosis of Cancer by Machine Learning Classification of the Whole Genome Sequencing Data

Abstract

Machine learning can precisely identify different cancer tumors at any stage by classifying cancerous and healthy samples based on their genomic profile. We have developed novel methods of MLAC (Machine Learning Against Cancer) achieving perfect results with perfect precision, sensitivity, and specificity. We have used the whole genome sequencing data acquired by next-generation RNA sequencing techniques in The Cancer Genome Atlas and Genotype-Tissue Expression projects for cancerous and healthy tissues respectively. Moreover, we have shown that unsupervised machine learning clustering has great potential to be used for cancer diagnosis. Indeed, a creative way to work with data and general algorithms has resulted in perfect classification i.e. all precision, sensitivity, and specificity are equal to 1 for most of the different tumor types even with a modest amount of data, and the same method works well on a series of cancers and results in great clustering of cancerous and healthy samples too. Our system can be used in practice because once the classifier is trained, it can be used to classify any new sample of new potential patients. One advantage of our work is that the aforementioned perfect precision and recall are obtained on samples of all stages including very early stages of cancer; therefore, it is a promising tool for diagnosis of cancers in early stages. Another advantage of our novel model is that it works with normalized values of RNA sequencing data, hence people's private sensitive medical data will remain hidden, protected, and safe. This type of analysis will be widespread and economical in the future and people can even learn to receive their RNA sequencing data and do their own preliminary cancer studies themselves which have the potential to help the healthcare systems. It is a great step forward toward good health that is the main base of sustainable societies.

Full PDF

AAccurate Diagnosis of Cancer by Machine LearningClassiﬁcation of the Whole Genome Sequencing Data

Arash Hooshmand ** KTH Royal Institute of Technology

Abstract

Supervised machine learning can precisely identify diﬀerent cancer tumors at any stage by classifying cancerousand healthy samples based on their genomic proﬁle. We have developed novel methods of MLAC (MachineLearning Against Cancer) achieving perfect results with perfect precision, sensitivity and speciﬁcity. We haveused the whole genome sequencing data acquired by next generation RNA sequencing techniques in The CancerGenome Atlas and Genotype-Tissue Expression projects for cancerous and healthy tissues respectively. Indeed,a creative way to work with data and general algorithms has resulted in perfect classiﬁcation i.e. all precision,sensitivity and speciﬁcity are equal to 1 for most of diﬀerent tumor types even with modest amount of data. Oursystem can be used in practice because once the classiﬁer is trained, it can be used to classify any new sampleof new potential patients. One advantage of our work is that the aforementioned perfect precision and recallare obtained on samples of all stages including very early stages of cancer; therefore, it is a promising tool fordiagnosis of cancers in early stages. Another advantage of our novel model is that it works with normalized valuesof RNA sequencing data, hence people’s private sensitive medical data will remain hidden, protected and safe.This type of analysis will be widespread and economical in the future and people can even learn to receive theirRNA sequencing data and do their own preliminary cancer studies themselves which has the potential to help thehealthcare systems.

1. Introduction

Cancer is one of the most common risk factors that threatens people’s lives and is still a severeunsolved problem as one of the main leading causes of death [13]. Early diagnosis plays avital role in cancer treatment and survival. There are dozens of diﬀerent types of cancersaﬀecting diﬀerent organs of body because cancer can start at any part of the body. [14] Infact, cancer starts when cells in the body begin to grow out of control and is usually caused bygenetic mutations in diﬀerent cells. Yet the main underlying reasons that cause these mutationsare unknown. In recent years, along with generation of big data by high throughput omicstechnologies, applications of ML (Machine Learning) in diagnosis, treatment and prognosis ofcancers are hot topics of research. Consequently, computers have turned out to be promisingtools and reliable assistants to contribute in new discoveries based on analysis of the big datagenerated by high throughput technologies. In this work, we have proposed a novel approachto use genetic transcriptomic data leading to great results with perfectly accurate distinguish1 a r X i v : . [ q - b i o . GN ] S e p etween WES (Whole Exome Sequencing) RNA proﬁles of 22 diﬀerent main cancers from TCGA(The Cancer Genome Atlas), [15] and their corresponding healthy tissue samples from GTEx(Genotype-Tissue Expression) project [16] with samples from diﬀerent numbers of people asillustrated in Table 1.Table 1 also contains reported numbers of estimated new cases and estimated deaths becauseof each cancer out of estimated 1806590 new cases of all cancers with estimated 606520 newdeaths in 2020 in the U.S. for instance.[1] Reusability and transfer learning are among the mainadvantageous of ML that means once a machine is well-trained and could distinguish cancerousfrom noncancerous tissues, it can be fed by new data of new samples acquired from new peopleand the new sample will be classiﬁed correctly with a high likelihood. Therefore, the utilizationof ML systems that can detect cancerous genome even at the earliest stages by NGS (NextGeneration Sequencing) technology is likely to be a killer application.TCGA abbr. Cancer (TCGA) Organ (Gtex)

2. Methods

ML and AI (Artiﬁcial Intelligence) is rapidly opening their positions in medical and pharmaceuticalsciences. Diﬀerent models of ML have been tested successfully in recent years in many projectsas well as in this work and have returned decent results. Na¨ıve Bayes, Support Vector Machines,2ecision Trees, Random Forest, Logistic Regression and K Nearest Neighbors are examples ofgeneral supervised ML algorithms that have reportedly given great results in diﬀerent projects indiﬀerent ﬁelds of science and are analyzed in our project too. In addition to them, an unsupervisedML method i.e. K-Means is also tested. In this work we came up with a practical approach ofapplying ML for cancer diagnosis that is eﬀective and robust in diﬀerent ML algorithms we havetried. Since they are most well-known common ML algorithms, we will brieﬂy introduce them inthe following paragraphs. On the other hand, the WES genetic information obtained by NGS areopenly available on TCGA, Gtex and other online public databases. However, we do not reviewthe technical details of RNA sequencing techniques because they are out of scope of the currentpaper. The focus of our work in this article is to receive the data from two aforementioned opendatabases, train the ML classiﬁers with them, and validate them. To do it, we have used thefollowing ML techniques:

Bayes’ theorem was proposed by the English Thomas Bayes in 1763 when he was trying to provethe existence of God by means of statistical inference. [7] Bayesian statistics are used in estimatesbased on anticipated subjective knowledge. Therefore, the implementations of this theorem adaptwith use and allow combining the fusion of data from two or more diﬀerent sources and expressingthem in terms of likelihood. Naive Bayesian Classiﬁer is an implementation of Bayes’ theorem,with some additional simplifying hypotheses, which allow applying an independence hypothesis,between the predictor variables, hence ”Naive” is added to the name of these implementationsbecause a naive Bayesian classiﬁer assumes that the features of a class / object are not relatedto each other i.e. the presence of a particular feature is not related to the presence or absenceof another. In this way each feature independently contributes to the probability of a givenclass. In return, Bayes Classiﬁers can easily be trained, require little data to train, and canclassify big data quickly. Despite the fact that naive Bayes classiﬁers are amazingly simple, theyhave worked quite well in many real-world situations, including our cancerous/healthy tissueclassiﬁcation. Our naive Bayes classiﬁer requires a small amount of training data and is fast andaccurate as reﬂected in the Results section. On the ﬂip side, although naive Bayes is knownas a decent classiﬁer, it is known to be a bad estimator in a sense that one cannot rely on itsparameters for extraction of feature importance. [9]

More formally, as shown by equations 1-6, Bayesian classiﬁers are, indeed, probabilistic classiﬁersusing Bayes rule i.e. P ( A | B ) = P ( A ) P ( B | A ) /P ( B ) (1)For example, A can be the prior probability of cancer and B the posterior probability of cancer;given positive cancer test result is the product of the prior times the sensitivity i.e. the chanceof a positive result given cancer. Indeed, a naive Bayesian classiﬁer accomplishes statisticalinference based on maximum likelihood estimation i.e. setting the parameters of the probabilitydistribution in a way that maximises the goodness of ﬁt of a statistical model to the trainingdata via joint probability distributions of the training samples. In technical words, the likelihoodfunction describes a hyper surface whose peak, if it exists, is an arrangement of model parametersvalues and coeﬃcients that maximize the probability of drawing the obtained sample. [8] In itsmore general form, according to Sci-kit Learn website documentation, Bayes’ theorem states the3ollowing relationship, given class variable y and dependent feature vector x through x n : P ( y | x , . . . , x n ) = P ( y ) P ( x , . . . , x n | y ) P ( x , . . . , x n ) (2)Using the naive conditional independence assumption that P ( x i | y, x , . . . , x i − , x i +1 , . . . , x n ) = P ( x i | y ) (3)for all i, this relationship is simpliﬁed to P ( y | x , . . . , x n ) = P ( y ) (cid:81) ni =1 P ( x i | y ) P ( x , . . . , x n ) (4)Since P ( x , x , ..., x n ) is constant given the input, we can use the following classiﬁcation rule: P ( y | x , . . . , x n ) ∝ P ( y ) n (cid:89) i =1 P ( x i | y ) = > ˆ y = arg max y P ( y ) n (cid:89) i =1 P ( x i | y ) (5)We can use MAP (Maximum A Posteriori) estimation to estimate P ( y ) and P ( x i | y ) ; theformer is then the relative frequency of class in the training set. The diﬀerent naive Bayesclassiﬁers diﬀer mainly by the assumptions they make regarding the distribution of P ( x i | y ) .[9] GaussianNB implements the Gaussian Naive Bayes algorithm for classiﬁcation. Thus, thelikelihood of the features is assumed to be Gaussian: P ( x i | y ) = 1 (cid:113) πσ y exp (cid:18) − ( x i − µ y ) σ y (cid:19) (6)where the parameters σ y and µ y are estimated using maximum likelihood. Support Vector Machines (SVMs) are a set of supervised learning algorithms developed byVladimir Vapnik and his team at AT&T Labs. [17, 18, 19] These methods are related to bothclassiﬁcation and regression problems. In classiﬁcation, given a set of sample training examples,we can label the classes and train an SVM to build a model that predicts the class of a newsample. Intuitively, an SVM is a model that represents the sample points in space, separatingthe classes into spaces as distant as possible using a separation hyperplane deﬁned as the vectorbetween the two points, of the two classes, closest to the which is called the support vector.When the new samples are put before the model, depending on the spaces to which they belong,they can be classiﬁed into the right class.

More formally, an SVM builds a set of hyperplanes [20] in a very high (or even inﬁnite) dimen-sional space that can be used in classiﬁcation or regression problems. A good separation betweenthe classes will allow a correct classiﬁcation. In this concept of optimal separation is where thefundamental characteristic of SVM resides: this type of algorithms search for the hyperplanethat has the maximum distance (margin) with the points that are the closest to it. This is whySVMs are also sometimes referred to as maximum margin classiﬁers. In this way, the points of4igure 1: Svm separating hyperplanes, from article ”Support vector machine”. In Wikipedia(2012). Accessed June 27, 2020.the vector that are labeled with one category will be on one side of the hyperplane and the casesthat are in the other category will be on the other side. SVM algorithms intrinsically belongto the family of linear classiﬁers. The vector formed by the points closest to the hyperplane iscalled the support vector. Using tricks such as kernel functions, SVMs can also be an alternativetraining method for polynomial classiﬁers, radial base functions, and multilayer perceptron neuralnetworks. Figure 1 illustrates SVM mechanism.

A decision tree is a tree-like map of the possible outcomes of a series of decisions and onlycontains conditional control statements comparing possible actions with each other according totheir costs, probabilities and utilities. The goal of a decision tree is to break down all the availablevisit data that a system can learn from and group it so that each group’s visits are as similar5o each other as possible with respect to the goal metric. Between groups, however, visits areas diﬀerent as possible relative to the goal metric (for example, conversion rate). The decisiontree takes into account the diﬀerent variables existing in the training set to determine how todivide the data MECE (mutually exclusive, collectively exhaustive) into these groups or leaves tomaximize the goal. A decision tree typically starts with a single node and then branches out intopossible outcomes. Each of the outcome nodes creates additional nodes, which branch into otherdiﬀerent possibilities. This creates a structure similar to that of a tree. There are three diﬀerenttypes of nodes: probability nodes, decision nodes, and terminal nodes. A chance node,typicallyrepresented by a circle, shows the probabilities of certain outcomes. A decision node,typicallyrepresented by a square, shows a decision to be made, and a terminal node,typically representedby a triangle, shows the ﬁnal result of a decision route. Decision trees are still popular foradvantages such as requiring minimal data processing and being easily understood, updated,(new options can be added to existing trees), and integrated with other decision-making tools.However, decision trees can become very complex. In those cases, a more compact inﬂuencediagram can be a good alternative focusing on fundamental goals, inputs, and decisions. [21]

By iteratively applying the algorithm that creates decision trees with diﬀerent parameters on thesame data, we get what is called a random forest. This algorithm is one of the most eﬃcientprediction methods for big data, since it averages the performance of many diﬀerent models withnoise and impartially reduces the ﬁnal variability of the set. In reality, what is done is to builddiﬀerent training and test sets on the same data, which generates diﬀerent decision trees on thesame data. The union of these trees of diﬀerent complexities and with data of diﬀerent origin,although from the same set, results in a fairly stable random forest whose main characteristicis that it creates more robust models than what could be obtained by creating a single decisiontree on the same data. In classiﬁcation, the class that is the mode of classes will be output.[22, 23]

Logistic regression is a group of statistical techniques that aim to test hypotheses or causalrelationships when the dependent variable is nominal. Despite its name, it is not an algorithmapplied in regression problems, in which continuous values are dealt with, but it is a method forclassiﬁcation problems, in which a binary value i.e. either 0 or 1 is obtained. For example, aclassiﬁcation problem is to identify if a given tumor is malignant or benign. With the logisticregression, the relationship between the dependent variable i.e. the statement to be predicted,with one or more independent variables i.e. the set of features available for the model is deter-mined. To do this, it uses a logistic function that determines the probability of the dependentvariable. As previously mentioned, what is sought in these problems is a classiﬁcation, so theprobability must be translated into binary values for which a threshold value is used. If the prob-ability values were above the threshold value, the statement is true and vice versa. Generally thisvalue is 0.5, although it can be increased or decreased to manage the number of false positivesor false negatives. [24]The function that relates the dependent variable to the independent ones is also usually eitherthe sigmoid function or a function similar to it such as tanh and softmax. The sigmoid functionis an S-shaped curve that can take any value between 0 and 1, but never values outside these6igure 2: Sigmoid function as the logistic curvelimits. The equation that deﬁnes the sigmoid function is f ( x ) = 1 / (1 + e − x ) where x is a realnumber. In the equation you can see that when x tends to minus inﬁnity the function tendsto zero. On the other hand, when x tends to inﬁnity the function tends to unity. Figure 2shows a graphical representation of the logistic function (sigmoid function). Logistic regressionis a technique widely used because of its eﬀectiveness and simplicity. As one of its advantages,it is not necessary to have large computational resources, neither in training nor in execution.Furthermore, the results are highly interpretable that is one of its main advantages. The weightof each of the features determines the importance it has in the ﬁnal decision. Therefore, itcan be aﬃrmed that the model has made one decision or another based on the existence ofone or another certain feature. What in many applications is highly desired in addition tothe model itself. Regarding its disadvantages is the impossibility of directly solving non-linearproblems because the expression that makes the decision is linear. For example, in the eventthat the probability of a class is initially reduced with a feature and subsequently increased, itcannot be registered with a logistic model directly. If necessary, this feature should previously betransformed so that the model can record this non-linear behavior. In these cases, it is better touse other models such as decision trees. Indeed, the important point is that the target variablemust be linearly separable. Otherwise, the logistic regression model will not classify correctly. Inother words, there must be two ”regions” with a linear border in the data. Another drawbackis the dependency it shows on the features. Logistic regression is not one of the most powerfulalgorithms that exist. It would easily be surpassed by other more complex classiﬁers. Finally,in machine learning, there are classiﬁers that can work with multiple classes, such as Decision7rees or Random Forest. On the other hand, there are others that do not, such as LogisticRegression. However, it is always possible to use tricks to use logistic regression in classiﬁcationproblems with multiple classes such as: • OvA (One versus all): In this strategy, you have to train as many binary classiﬁers as possiblewith respect to the classes that are there in the data set. Each of the models predicts theprobability that the record belongs to a class. When making a prediction, all classiﬁers are runand the one with the highest probability is selected. • OvO (One versus one): In this strategy, as many models are created as there are pairs ofpossible outcomes, that is, they have to be trained ( N − N ) / models, where N is the numberof possible classes. It means that a classiﬁer will decide only between two possible outcomes.As in the previous case, when making a prediction, all classiﬁers are run and the one with thehighest probability is selected. K-Nearest-Neighbor is a simple nonparametric instance-based algorithm of supervised ML. Itcan be used to classify new samples (discrete values) or to predict (regression, continuousvalues). It is essentially used to classify values by searching for the most similar (by proximity)data points learned in the training stage and making guesses of new points based on the priorclassiﬁcation. [25, 26] Unlike K-means, which is an unsupervised algorithm where the ”K”means the number of groups that we want to classify, in K-Nearest Neighbor the ”K” means thenumber of ”neighboring points” that we consider in the vicinity to classify the ”n” groups whichare already known in advance. It is a method that simply searches the closest observations tothe the point of interest that is to be classiﬁed and classiﬁes it based on most of the data thatsurrounds it. As we said before, K nearest neighbor algorithm is: • Supervised: that means that we have tagged our training data set, with the class or expectedresults. • Instance-based: that means that our algorithm does not explicitly learn a model (such as inLogistic Regression or Decision Trees). Instead, it memorizes the training instances that are usedas the knowledge-base for the prediction phase.KNN is easy to learn and implement. However, it uses the entire data set to train each pointand therefore requires a lot of memory and processing resources. For these reasons KNN tendsto work best on small data sets and without a huge number of features. To classify the inputsby means of KNN one should:1. Calculate the distance between the item to classify and the other items in the training dataset. 2. Select the closest ”K” elements (with less distance, depending on the function used).3. Carry out a majority vote between K points: those of a class / label that will be determinantin making the ﬁnal decision.Taking point 3 into account, we will see that in order to decide the class of a point, the value ofK is very important because it deﬁnes which are the points their majority will deﬁne the groupeach new point belongs to, and it is especially critical when the new points fall in the bordersbetween groups.

K-Means is an unsupervised ML algorithm for clustering. It is used when we have a lot of un-tagged data. The objective of this algorithm is to ﬁnd K groups (clusters) among the raw data.The algorithm works iteratively to assign each input such as genome sample to one of the K8roups based on its features, in here genes. It means that the inputs are grouped based on thesimilarity of their features. [27] As a result of executing the algorithm: • The centroids i.e. geometric centers of groups will be coordinates of the corresponding Kclusters and will be used to label new samples. • Labels for the training data set: each tag belonging to one of the K deﬁned groups.The groups are deﬁned dynamically i.e. their position is adjusted in each iteration of the processuntil the algorithm would converge. Once the centroids are found, they are analyzed to seewhat their unique features are, compared to those of the other groups. These groups are thelabels that the algorithm generates. The Clustering K-means algorithm is one of the most usedmethods to ﬁnd hidden groups or theoretically suspected groups on an unlabeled data set. Thiscan serve to conﬁrm or reject some hypotheses that we would have assumed about our data,and it can also help to discover hidden relationships between data sets. Once the algorithmhas executed and obtained the labels, it will be easy to classify new values or samples amongthe obtained groups. This algorithm works by pre-selecting a value of K. To ﬁnd the numberof clusters in the data, we must run the algorithm for a range of K values, see the results andcompare characteristics of the groups obtained. In general, currently there is no exact way todetermine the K value, but it can be estimated with acceptable precision using the followingtechnique: One of the metrics used to compare results is the average distance between the datapoints and their centroid. As long as the value of the mean will decrease as we increase thevalue of K, we will continue increasing it. The mean distance to the centroid is considered as afunction of K and the goal is to ﬁnd the elbow point where the rate of descent sharpens.In ML supervised classiﬁcation methods as well as in K-Means unsupervised clustering algorithm,the input data (the samples) are viewed as a p-dimensional vector (an array or ordered list of pnumbers where p in this project is 19627). Then the classiﬁers based on their criteria distinguishamong diﬀerent groups formed by close/similar samples; e.g. in the Bayesian classiﬁers, theclassiﬁer looks for a hypersurface that maximizes the likelihood of drawing the sample, or inSVMs, it looks for a hyperplane that optimally separates the points of one class from the other,which eventually could have been previously projected to a higher dimensional space. There hasbeen wrong perceptions in the ML community preventing potential achievements; for instance,people try to decrease the number of features to avoid ”the curse of dimensionality”. Whilethe curse of dimensionality may truly happen in some problems, it may not be an issue in otherproblems such as ours. Deleting features blindly for fear of dimensionality may only result in losinguseful information without need. Researchers usually try to reduce by themselves the assumedlearning pressure on the machines brought about by highly redundant dimensions and select asubset of features i.e. genes to reduce the number of features and dimensions. [10, 11, 12] Itmay have hurt their results. A strength point of our work is that we consider ML as powerfuladvanced statistics tool doing heavy statistical analyses, that people themselves cannot do. Asa result, we gave all the data corresponding to the WES as feature inputs to the ML at onceand it returned almost perfect results quickly and precisely. We thought of 19627 diﬀerent genesnot as too many features but as diﬀerent pixels of a less than 141*141-pixel photo and it was avery light task for the machine to analyze such a low resolution image and it took only secondsto classify the cancerous and noncancerous cells 100% precisely.9 .8 Model optimization and settings

We have employed all the classiﬁers from Scikit-Learn 0.23.1 with their default settings unlessmentioned otherwise. For example, Scikit-Learn’s Gaussian Naive Bayes classiﬁer, that is asimple classiﬁer, has only two parameters i.e. priors equal to None and var smoothing equal to1e-9 where var smoothing is the portion of the largest variance of all features that is added tovariances for calculation stability. We did not touch the defaults, but there were exceptions suchas SVM in which we changed two default settings: we decided to use ”linear” kernel instead of”rbf” that was the default kernel and also we set ”PROBABILITY = TRUE” in order to obtain”predict proba” that is a useful attribute to calculate and plot the ROC curve but is ignored indefault setting when ”PROBABILITY = FALSE”. Therefore, except these two minor changesat SVM default settings, all models were run with default settings of Scikit-Learn version o.23.As other settings, for most of the cancers, 90% speciﬁc cancer samples were used as the trainingdataset and the remaining 10% used as testing dataset chosen by random using train test splitfunction of Sci-Kit Learn model selection modul with Random Seed equal to zero. The onlyexceptions were for bladder and cervix for which the number of healthy samples was too low.Therefore, we used 40% for training and 60% for testing in bladder and 70% for training and30% for testing in cervix cancer. However, we analyzed the eﬀect of diﬀerent data allocationplans from 10% to 90% for test/validation set and also tried other random seeds and in particularfor K Nearest Neighbor algorithm, we also tried it with diﬀerent K values. The results were notsigniﬁcantly diﬀerent and discussed more in the following at the Discussion part of this article.We also decided to mostly publish the results achieved by those classiﬁers that can do theclassiﬁcation perfectly; however, all six classiﬁers work well and the imperfect ones also returnresults close to perfect. The models take 19627-genes WES data as input and after a quickand easy model training with no need to data modiﬁcation, acceptable classiﬁcation results areobtained and there are at least two classiﬁers per cancer that could distinguish both cancerousand healthy tissues perfectly with no error.

Model evaluation produces measures to approximate a classiﬁer’s reliability. To distinguish be-tween cancerous and noncancerous cells, since it is a binary classiﬁcation, we use accuracy,precision, speciﬁcity, sensitivity, f1 score, several averaging techniques and ROC curve to evalu-ate the model. We, indeed, use Sci-kit Learn Metrics Classiﬁcation Report that returns precision,recall and f1 score for each of two classes. In binary classiﬁcation, recall of the positive class iscalled “sensitivity”; and recall of the negative class is “speciﬁcity”. In what follows, the principalterms and then equations7-22 derivations based on confusion matrix such as accuracy, speciﬁcity,sensitivity, f1 score are given to review and compare: • Condition positive (P): the number of real positive cases in the data • Condition negative (N): the number of real negative cases in the data • True positive (TP) or hit • True negative (TN) or correct rejection • False positive (FP), false alarm or type I error • False negative (FN), miss or type II errorSensitivity, recall, hit rate, or true positive rate (TPR):

T P R = T P/P = T P/ ( T P + F N ) = 1 − F N R (7)10peciﬁcity, selectivity or true negative rate (TNR):

T N R = T N/N = T N/ ( T N + F P ) = 1 − F P R (8)Precision or positive predictive value (PPV) is the ratio of the correctly labeled samples by ourprogram to all labeled ones in reality.

P P V = T P/ ( T P + F P ) = 1 − F DR (9)Precision can be calculated only for the positive class i.e. class 1 that shows cancer or can beevaluated for each one of the two classes independently treating each class as it is the positiveclass at time, and the latter is done in Sci-kit Learn Metrics Classiﬁcation Report as shown intable 1.Negative predictive value (NPV):

N P V = T N/ ( T N + F N ) = 1 − F OR (10)Miss rate or false negative rate (FNR):

F N R = F N/P = F N/ ( F N + T P ) = 1 − T P R (11)Fall-out or false positive rate (FPR):

F P R = F P/N = F P/ ( F P + T N ) = 1 − T N R (12)False discovery rate (FDR):

F DR = F P/ ( F P + T P ) = 1 − P P V (13)False omission rate (FOR):

F OR = F N/ ( F N + T N ) = 1 − N P V (14)Accuracy (ACC):

ACC = (

T P + T N ) / ( T + N ) = ( T P + T N ) / ( T P + T N + F P + F N ) (15)The harmonic mean of precision and sensitivity or f1-score (F1): F .P P V.T P R/ ( P P V + T P R ) = 2 .T P/ (2 .T P + F P + F N ) (16)Since we are using Sci-kit Learn Metrics Classiﬁcation Report to show the results as shown in ta-ble 1, we also describe the meaning of micro avg, macro avg and weighted avg. used in the report:Micro-average of precision (MIAP): M IAP = (

T P

T P / ( T P

T P

F P

F P (17)Micro-average of recall (MIAR): M IAR = (

T P

T P / ( T P

T P

F N

F N (18)11icro-average of f-Score (MIAF) would be the harmonic mean of the two numbers above. M IAF = 2 .M IAP.M IAR/ ( M IAP + M IAR ) (19)Macro-average of precision (MAAP): M AAP = (

P recision

P recision / (20)Macro-average of recall (MAAR): M AAR = (

Recall

Recall / (21)Macro-average of f-Score (MAAF) would be the harmonic mean of the two numbers above. M AAF = 2 .M AAP.M AAR/ ( M AAP + M AAR ) (22)Macro-average is suitable to know how the system performs overall across diﬀerent sets of databut should not be considered in any speciﬁc decision-making because it calculates metrics foreach label and ﬁnds their unweighted mean i.e. it does not take label imbalance into account,while in our case, the labels are highly imbalanced in many sets e.g. 1091 vs. 179. On the otherhand, micro-average is a useful tool and returns measures for our decision-makings especiallywhen coupled healthy-cancerous datasets vary in size because it calculates metrics globally bycounts the total true positives, false negatives and false positives. Finally, Weighted-average,according to Sci-kit Learn documentation on f1-score metrics, calculates metrics for each label,and ﬁnds their average weighted by support (the number of true instances for each label). Thisalters ”macro” to account for label imbalance; consequently, it can result in an F-score that isnot between precision and recall.The ROC (Receiver Operating Characteristic) curve is created by plotting the true positiverate (TPR) or sensitivity against the false positive rate (FPR) i.e. (1-speciﬁcity) at diﬀerentthreshold settings. Varying the decision threshold from its maximal to its minimal value resultsin a piecewise linear curve from (0,0) to (1,1), such that each segment has a non-negative slope(Figure 3). This ROC curve is the main tool used in ROC analysis and in general, can be used toaddress a range of problems; however, in our illustrated case where the performance is perfect,it is just a visual endorsement for the perfect classiﬁcation and the corresponding AUC (AreaUnder the ROC Curve) is its maximum i.e. 1.Confusion matrix Predicted 0 Predicted 1Class 0 TN = 1 FP = 0Class 1 FN = 0 TP = 1Table 2: Typical confusion matrix with no confusion for perfect identiﬁcation of canceroustumors by either of GNB, SVM, DCT, RFC, LGR and KNN(K=3) where sensitivity, speciﬁcityand precision are all 100%. 12igure 3: Typical ROC curve of perfecrt classiﬁcation done by classiﬁers on most of tumor types.

3. Results

The classiﬁcation performance on a total of 7971 cancerous WES samples of 22 speciﬁc cancersfrom TCGA open public database and their corresponding healthy tissues with 4798 WES samplesfrom the GTEx project were studied in 150+ ML models. Each sample included the normalizedvolume of 19627 genes as ppm and the data corresponding to both cancerous and healthysamples of each organ are separately and directly fed to the machine to do heavy statisticalcalculations on the high dimensional data. As result, cancerous samples of all 22 types oftumors at any stages were correctly identiﬁed and separated from their corresponding healthysamples. The task was accomplished and compared via six supervised ML classiﬁers i.e. GNB,SVM, DCT, RFC, LGR and KNN which all showed perfect and in some cases near-perfectperformance as shown in Table 3. In addition, Table 3 contains the results for K-Means as anunsupervised clustering technique that was applied to evaluate the ability of the algorithm todistinguish between cancerous and noncancerous cells of diﬀerent organs as its two main clusters.Clustering, in general, is diﬀerent from classiﬁcation and its algorithms such as K-Means havetheir own evaluation techniques. However, we have applied a trick by using classiﬁcation accuracyto see how much K-Means clustering to two clusters matches the diﬀerence of our interest i.e.two classes of healthy and cancerous cells of an organ or two classes of two diﬀerent cancers.The results were impressive in here too. In some cases, clustering matched 100% with our13lass labels as healthy and cancerous. One tricky point in interpretation of results is that unlikeclassiﬁcation in which the higher the accuracy, the better; in clustering 50% accuracy is theworst while very high and very low measurements of accuracy are equivalently great because, forexample, an accuracy equal to zero in clustering means that the clustering algorithm, in here K-Means, has labelled all our class 0 i.e. healthy samples as cluster 1, and all our class 1 labels i.e.cancers as cluster 0. Therefore, any percentage of accuracy and its counterpart i.e. 100 minusthat percentage are equally good while 50% shows maximum entropy and least match with ourclasses and labels. The clustering performance was impressive in several cases, especially thatcancerous and noncancerous cells of Pancreas and Testis were 100% accurately separated intotwo distinguished clusters as shown in Table 3. There were also models successfully employedto further distinguish between diﬀerent types of an organ’s cancer and two types of cancerwere also separated perfectly. The performance of supervised and unsupervised methods withdiﬀerent parameters were also studied. For most of changes such as diﬀerent random seeds,allocating diﬀerent volumes of data for training and testing stages and diﬀerent values for Kin KNN, the consequent changes were negligible as shown in Table 42, Table 43 and Table 44respectively. The exception was GNB classiﬁcation performance on LIHC that improved to atleasr 98% in other settings. However, there are parameters that can change and deterioratethe results. For example, SVMs with linear and rbf kernels are signiﬁcantly diﬀerent: whilelinear kernel accomplish the classiﬁcation perfectly in 19 out of 22 cancer types with threeexceptions that are LGG, COAD and GBM with 99% of accuracy and also f1-score of 0.99, theresults with rbf kernel were not acceptable. An example of LUSC is presented in Table 41 to becompared with Table 42 that includes the results for SVM with linear kernel. The most commontypical ROC curve and Confusion Matrix for most of our cancer classiﬁcations are illustrated inFigure 3 and Table 2. Therefore, AUC is naturally 1 for most of classiﬁers in classiﬁcation ofmost of cancerous-healthy samples of speciﬁc organs. We also analyzed these ML classiﬁers’capability to distinguish between two types of cancers. Whine no longer as 100% perfect resultsas before when they classiﬁed between healthy and cancerous tissue, yet the results were greatwith accuracy and f1-scores above 90% that is better than the previous works we knew about.The classiﬁers not only separated samples belonging to two diﬀerent cancers of brain as well aslung, but also separated classiﬁed well tumors of lung’s LUSC and Bladder’s BLCA which werepreviously reported to be very similar and confusing for deep neural network classiﬁers. [11] Formore details refer to Discussions part of this paper below.14ccuracy GNB SVM DCT RFC LGR KNN KMeans unsupervised clusteringACC 1.0 1.0 0.95 1.0 1.0 1.0 0.14=0.86BLCA 1.0 1.0 0.98 1.0 1.0 1.0 0.64LGG 0.98 0,99 1.0 0.99 0.99 1.0 0.77BRCA 1.0 1.0 0.98 1.0 1.0 0.98 0.91CECS 1.0 1.0 1.0 1.0 1.0 1.0 0.69LAML 1.0 1.0 1.0 1.0 1.0 1.0 0.44=0.56COAD 1.0 0.99 1.0 0.99 0.99 0.97 0.18=0.92ESCA 0.99 1.0 1.0 0.99 1.0 0.98 0.59GBM 0.99 0.99 0.99 1.0 1.0 1.0 0.72KIRC 0.98 1.0 0.98 0.98 1.0 0.98 0.77LIHC 0.92 1.0 0.98 1.0 1.0 0.96 0.08=0.92LUAD 1.0 1.0 0.99 1.0 1.0 0.94 0.25=0.75LUSC 1.0 1.0 0.99 1.0 1.0 0.99 0.13=0.87OV 1.0 1.0 1.0 1.0 1.0 1.0 0.69PAAD 1.0 1.0 1.0 1.0 1.0 1.0 1.0PRAD 0.98 1.0 1.0 1.0 1.0 0.98 0.97READ 1.0 1.0 1.0 1.0 1.0 0.95 0.58SKCM 1.0 1.0 1.0 1.0 1.0 0.99 0.65STAD 0.98 1.0 0.97 1.0 1.0 0.97 0.17=0.83TGCT 1.0 1.0 1.0 1.0 1.0 1.0 0.0=1.0THCA 1.0 1.0 1.0 1.0 1.0 0.95 0.30=0.70UCEC 1.0 1.0 1.0 1.0 1.0 1.0 0.69Table 3: ML Classiﬁers Accuracy in diﬀerent tumor type classiﬁcation.

4. Discussion

Our work facilitates eﬀective applications of ML in medical sciences and resulted in excellentclassiﬁcation between cancerous and noncancerous cells of 22 most common cancers. In thiswork, we did not reduce the dimension of input data and left all the statistical analysis to the MLsystem and it could do its job very well and perfectly distinguished the cancerous tumors fromhealthy cells in most of the cancer-classiﬁer combinations, and in the remaining pairs almostperfectly, as shown in Table 3. More details on information summarized in Table 3 can be foundin tables 4-45. We learn from this experiment that we are allowed to reduce the dimension onlyafter that the famous problem known as ”the curse of dimensionality” has occurred and thesystem cannot solve it; otherwise, it is better to leave the calculations to the machines and donot blindly reduce the dimensions when there is no issue for machine to deal with all availablefeatures and dimensions. For example, in this study, the number of tumor samples rangesfrom 77 (ACC) to 1,091 (BRCA) and the number of healthy tissue samples ranges from only 9(Bladder) and 10 (Cervix) to 1152 (Brain) yet our classiﬁers can learn from the large distancebetween healthy and cancerous bladder tissues in the 19627 dimensional space and correctlyclassify all cancerous and healthy tissues very well. For example, after being trained by 6 healthysamples and 287 Cervical squamous cell carcinoma and endocervical adenocarcinoma samples,all classiﬁers i.e. GNB, SVM, DCT, RFC, LGR and KNN (K=5) could classify the remaining4 healthy samples and 17 cancerous samples 100% precisely and only KNN with K=3 failed insome cases with overall F-score of 0.99 instead of 1 because of failure of perfect precision in15ne class and recall in the other. Meanwhile, the diﬀerences among KNN with diﬀerent valuesfor K were negligible. The important point in here is that, indeed, despite having little data totrain, ML classiﬁers could learn how to classify two groups perfectly thanks to the large numberof dimensions and the diﬀerences provided with them. If we decrease the number of features/dimensions while we also lack huge amount of data i.e. a large number of samples, then wemay have thrown away useful data and cannot train the system well and it cannot classify thatgood. It is done by most of researchers before and as a result they get less excellent results.Our hypothesis is that a large number of features and dimensions can compensate lack of largenumber of samples. Imagine, for example, points in 3-dimension space shown as a Cartesiansystem. If we have two classes of points that all coordinates of one class are positive numbersand all coordinates of points in the other class are negative numbers, it will be natural thatany intelligent system including AI systems can quickly learn to separate the two groups evenby a small number of samples. We think that our natural 19K+-dimension space is perfect todistinguish between healthy samples and impaired cancerous samples after watching dozens ofsamples. Even the results after analyzing a few samples of diﬀerent classes are amazing.Another fact to notice is that the samples from TCGA and Gtex are diverse representing diﬀerentages, sexes, races and diﬀerent stages of cancer their statistical details are available on TCGAand Gtex websites. However, TCGA has not provided information on the stage of GBM, LGG,OV, PRAD and UCEC tumors but all other tumors are categorized into four diﬀerent stagesand the classiﬁers work well on all the data that includes their early satge samples too. Sinceour classiﬁers return perfect results on all of them, therefore, the demographic information donot have any serious eﬀect on the performance and we do not need to deal with their statisticaldetails and factors one by one. Another interesting fact is that the problem of separation betweencancerous and noncancerous cells seems to be a linear classiﬁcation problem because SVM withlinear kernel can classify almost all cancer types from their corresponding healthy tissue cellsperfectly but when SVM was tried with its default kernel i.e. rbf the results were not good. Thediﬀerence in results of linear and nonlinear separators imply that the samples of two groups arenaturally separated linearly.In comparison, as seen in the result tables in appendix/ supplementary materials summarized inTable 3, SVM with linear kernel, LGR and RFC were almost perfect and slightly superior to otherclassiﬁers and return perfect results for almost all tumor types with a few percentage mistakefor a few remaining cancers, while KNN and DCT were usually inferior; despite the fact thatthere were cases that DCT or KNN returned perfectly precise results when others fail to achieve100%; or KNN as the worst classiﬁer also has always classiﬁed tumors and healthy tissues withno worse than 94% of accuracy. GNB with our most common settings registered record lowaccuracy of 92% only on liver cancer (LIHC) which seems to be its outlier result because withany change, even using less training data i.e. 70% instead of 90% its accuracy was always 98%or more. The simulations are illustrated in the tables below. GNB particularly acts well whenthere are little data available; therefore, should not be ignored when the resources such as dataor computational power are limited.Finally, we also examined the performance of ML classiﬁers on several pairs of cancers as shownin the last table i.e. Table 45. First we tried them to classify between two types of tumors ofBrain cancer i.e. LGG and GBM, and between two types of Lung cancer i.e. LUAD and LUSC,followed by 2 cancers of Kidney. All of them were separated with accuracy more than 90% whichis amazing because tumors of the same organ are expected to be similar to each other. Thelowest rate of correct classiﬁcation is for distinguish between Colon Adenocarcinoma (COAD)16nd Rectum Adenocarcinoma (READ) which could be done at the best with about 75% accuracyby RFC, DFC and KNN which is great because READ and COAD are almost the same thingsto that extent that Siegel et. al. [1] have reported the estimated death because of these twocancers together because in practice, many hospitals are confused by them and count READ asCOAD as one class of cancer that are the same. Therefore, being able to distinguish betweenthem with more than 70% accuracy even by K-Means unsupervised clustering is hope-giving.The successful separations between two similar cancers of one organ such as LUAD and LUSCshown in Table 45 may open a new approach in cancer diagnoses i.e. it may be better to ﬁrstdetect if the biopsy sample of an organ such as brain or lung is healthy or malignant tumor,then if it were cancerous, the data can be fed again to the ML classiﬁers to detect which typeof tumor it is. The beneﬁt of this approach is that in binary classiﬁcations between healthy andcancerous sample of an organ, there is only one option for cancerous cells. If, for example, thepatient suﬀers from LUSC but the sample is given to a classiﬁer to decide between healthy orLUAD, it will likely be classiﬁed as LUAD. Therefore, it has better to either from the beginningclassify it as healthy or cancerous including all relevant cancers and then distinguish amongcancers or consider the fact and if the sample is recognized as one speciﬁc cancer of an organ,then test it again carefully to classify it correctly among potential cancers of the organ or fromthe beginning use a multi-classiﬁer instead of binary classiﬁers. For detailed info refer to Tables20, 21, 22 and 23 and the last row of Table 45.

5. Conclusion

These ML systems are trained now and are ready to receive any potential patient’s data torecognize if the sampled organ is cancerous or not. It can detect the problem in diﬀerent stagesof cancer accurately; therefore, can be helpful in early diagnosis of cancer. The limitation ofour model is that it needs data of samples taken from organs. Yet most of people do not haveeasy access to this level of their own personal genetic data. Thus the next work can be ﬁndingsuitable biomarkers in the blood that can detect healthy people and patients only by their bloodsamples. Furthermore, the world is realizing the importance of creating databases of single cellWESs which will result in more accurate cancer studies and the corresponding big data canalso improve ML systems to work perfectly on all cancers hitting each organ of each patientin personalized medicine which can employ most eﬀective treatments for each person based onspeciﬁc diﬀerences. Yet our work is one step forward because the best previous work that, to thebest of our knowledge, is done by Sun et. al. [11] implements complicated deep neural networksthat not only is more exhaustive computationally, but also lack clarity as a common featureof deep neural networks while our simple algorithms run quicker, hit their results with betterprecision and recall, works especially better with little data, and the mechanism of action, henceis not like a black box because the logic of performance is understandable. As an evidence toour claim, we especially tried the classiﬁcation of a problematic pair of two diﬀerent tumor typeswhere their deep neural network was confused between them i.e. lung’s LUSC and bladder’sBLCA. Our classiﬁers again classiﬁed these two tumors with more than 90% accuracy wheretheir deep neural network classiﬁer was confused between them. [11] Most notably, using ourclassiﬁers is more suitable to obtain important features that play the maximum role in theclassiﬁcation and are the most upregulated or most downregulated genes between cancerous andnoncancerous groups. The latter gives invaluable information both directly as well as indirectlyto be used along with supportive knowledge of pathways which cause cancers.17 unding Information

Thanks to KTH Royal Institute of Technolgy and its Library for their Open Access Publica-tion Grant, as well as Houshmand family and their companies especially GholamAbbas, Atash,Shahab, Shahin and Shadab for their ﬁnancial support to my studies and projects.

Research Resources

The cancerous samples data are from The Cancer Genome Atlas (TCGA) and their correspondinghealthy tissue samples are from the Genotype-Tissue Expression (GTEx). All machine learningalgorithms are taken from Scikit-learn.

Acknowledgments

I’d like to acknowledge all those who have had any contribution to who I am and where I am.Starting from my parents and grand parents, siblings, all my great teachers and everybody whohas taught me a single word of science especially those whose knowledge in life sciences orin computer sciences and machine learning has had a direct impact on this research. I alsoparticularly thank Mr. Eng. GholamAbbas Houshmand for his great contribution in ﬁnancingmy studies and projects as well as KTH library for supporting me to publish my paper in openaccess form to be accessible for all those who are interested in.