Multiclass Disease Predictions Based on Integrated Clinical and Genomics Datasets
MMulticlass Disease Predictions Based on Integrated Clinical and Genomics Datasets
Moeez M. Subhani
College of Engineering and TechnologyUniversity of DerbyDerby, EnglandEmail: [email protected]
Ashiq Anjum
College of Engineering and TechnologyUniversity of DerbyDerby, EnglandEmail: [email protected]
Abstract —Clinical predictions using clinical data by computa-tional methods are common in bioinformatics. However, clinicalpredictions using information from genomics datasets as well isnot a frequently observed phenomenon in research. Precisionmedicine research requires information from all available datasetsto provide intelligent clinical solutions. In this paper, we haveattempted to create a prediction model which uses informationfrom both clinical and genomics datasets. We have demonstratedmulticlass disease predictions based on combined clinical andgenomics datasets using machine learning methods. We havecreated an integrated dataset, using a clinical (ClinVar) and agenomics (gene expression) dataset, and trained it using instance-based learner to predict clinical diseases. We have used aninnovative but simple way for multiclass classification, where thenumber of output classes is as high as 75. We have used PrincipalComponent Analysis for feature selection. The classifier predicteddiseases with 73% accuracy on the integrated dataset. Theresults were consistent and competent when compared with otherclassification models. The results show that genomics informationcan be reliably included in datasets for clinical predictions andit can prove to be valuable in clinical diagnostics and precisionmedicine.
Keywords – Clinical; Genomics; Data Integration; MachineLearning; Disease Prediction; Classification; Bioinformatics.
I. I
NTRODUCTION
The medical science is rich with various types of datasetsranging from clinical to genomics datasets. The clinicaldatasets are diverse in terms of their nature, format and theinformation they contain. On the other hand, genomics datasetsare intrinsically enormous in size and dimensions, and so is theinformation contained in them [1]. The genomic informationcan be considered as the backbone of clinical information sincethe genomic structure derives the physical characteristics ofany organism. If the two pieces of information are connected,it may help to improve the overall medical research by findingmore accurate and advanced clinical diagnostic solutions. Theconnection essentially means to integrate clinical and genomicsdatasets. This is also a way forward in precision medicinestudies, where medical practitioners want to make clinicaldecisions based on both clinical and genomics parameters andnot just one of them [2][3].However, the research to establish or explore this con-nection is not very commonly sought in the state-of-the-art[1][4]–[6]. The datasets from clinical and genomics sourcesare mainly used independently in their respective researchdomains. From literature review, it has been observed thatmost clinical prediction studies have been limited to eitherclinical datasets [7]–[13] or genomics datasets [14]–[20]. Onecommon factor among these studies is that almost all of them are prediction studies, which establishes the fact that the trendfor clinical predictions has long prevailed in research.Although there are some studies which have attemptedtowards the inter-domain research, the trend does not seemto be very progressive. For example, [21] used decision treesto predict breast cancer outcomes. Similarly, [22] employedmultiple regression and statistical methods to infer associa-tions, and [3] used a graph-based approach to predict cancerclinical outcomes from multi-omics data. All these studies usedintegrated datasets for prediction or association studies usingvarious approaches. However, most of these approaches arenow outdated due to limitations in terms of their performanceor accuracy [21][22]. The approach in [3] (combination ofregression, Bayesian networks, and evolutionary neural net-works) is more advanced and promising but this study islimited to binary classifications and multi-omics data only[23][24].The research work mentioned above show that predictionbased studies are common in the literature. The most popularor commonly sought predicting factors are survival rate anddisease recurrence rate. However, we could not find any diseaseprediction model in the literature based on combined clinicaland genomics data information. A typical disease predictionmodel, as we define, takes information from both clinical andgenomics datasets and predicts disease(s) in a patient. This canbe achieved when we have both clinical and genomics datasetsavailable for a variety of diseases. Hence, we are attemptingto design a disease prediction model which aims to predictpossible medical condition(s) in a patient using informationfrom both clinical and genomics datasets.From ClinVar and Expression Atlas databases, we havebeen able to construct such dataset which contains both clinicalparameters as well as gene expression values in a singledataset for several patients. Since the data retrieved from thesedatabases is in eXtensible Markup Language (XML) format,we can create a very flexible schema for this dataset. Usingthis dataset, we can train a model to learn the diseases invarious subjects. As an initial attempt to prove the concept,we have used the k-Nearest Neighbours (kNN) algorithm forthe learning model, which is an instance based learner [25].Considering the size and complexity of the dataset, kNNappears to be a reasonable choice of learning method sinceit learns the classification function only locally.Genomics based clinical diagnosis does not exist in clinicalenvironments. Traditionally, disease predictions are made usingregular clinical practices only. Our disease prediction modelcan provide a genomic signature to verify the disease existenceor possible occurrence. Hence, this model not only will help a r X i v : . [ q - b i o . GN ] J un ABLE I. CLINVAR DATASET.G
ENE C ONDITION C LINICAL S IG - NIFICANCE C HROMOSOME N O . L OCATION V ARIATION
ID A
LLELE
IDAKAP L
ONG QT SYN - DROME B ENIGN /L IKELYBENIGN
OLORECTAL N EOPLASMS L IKELYPATHOGENIC
19 40236313 376039 362918APC H
EREDITARYCANCER - PREDISPOSINGSYNDROME P ATHOGENIC the medical practitioners to gain another step of confidence interms of clinical diagnosis, but also help advance the precisionmedicine research.The rest of the paper is arranged as follows. SectionII discusses the challenges for data integration. Section IIIexplains the data integration model. Section IV gives detailsof the prediction model and the algorithm along with the im-plementation details. Section V presents the results, followedby discussion in Section VI and conclusion in Section VII.II. C
LINICAL AND G ENOMICS D ATA I NTEGRATION C HALLENGES
The integration of clinical and genomics datasets is crucialto move towards precision medicine. The medical conditionsof each person are transcribed from the underlying genomicsstructure. Hence, it is critical to bring forward the genomicinformation to play part in the clinical diagnostics [2][3]. Themain challenge is to find a way to integrate datasets which arecompletely different from each other in terms of their nature,size, and properties.Most biological databases have standardised the data stor-age in XML formats. European Molecular Biology Laboratory(EMBL) took an initiative in 2000 to provide access of allthe flat files data in XML format [26]. XML provides moreflexibility in terms of storage, transport and integration ofcomplex biological datasets [27]. The format also providesthe advantage that the schema of datasets is extensible andmultiple datasets can be mapped together. Our datasets fromboth sources, ClinVar and Expression Atlas, are accessed inXML formats.The scope of data integration models is vast, as mentionedin the literature review in the previous section. Various dataintegration models have been discussed by various authorsincluding [1], [4] and [6]. For our study, we have adopteda meta-dimensional approach model, which refers to usingmultiple datasets simultaneously in the analysis [6]. Thisinvolves building a model on top of multiple datasets, whichare combined or integrated either before or after buildingthe data model. The approach facilitates the advantage offetching information from multiple datasets and including itin the analysis model. However, the integration may also yieldcomplex datasets resulting in less robust models.There are multiple methods within the meta-dimensionalapproach as mentioned by [1] and [6]. We have adopted a concatenation-based integration method, where different ma-trices are combined into a large single matrix before buildinga model. One advantage of this method is that once it isdetermined how to concatenate the variables from differentdatasets into a single matrix, it is relatively easier to build anystatistical analysis model on it. For example, on a combinationof genomics datasets, [8] used a Bayesian model to predictphenotypes, and [28] used Cox Lasso model to predict timeto recurrence.It may be important to mention here that the integrationattempt in this paper is only at the data level. Since the databeing retrieved from public repositories is in XML format,we do not need to pre-build a structure to store data, and weare not dealing with databases either. Therefore, this methodprovides the advantage to avoid the data structure and storageissues. Hence, the data integration here must not be confusedwith the traditional database level data integration.
TABLE II. GENE EXPRESSION DATASET.G
ENE
GSM452573 GSM452571 GSM452642 ...AKAP9 3.563587736 3.45243272 3.535150355 ...AKT1 10.8863402 10.34918494 9.129441853 ...AKT2 5.005896122 4.463927997 4.993673626 ...
III. D
ATA I NTEGRATION M ODEL
We have used completely anonymised clinical and ge-nomics datasets obtained from public sources. The clinicaldataset (ds1) has been obtained from ClinVar [29], which isan open source database that contains information about thegenomic variation and links it with phenotype information. Foreach gene, it provides the diseases it causes and their clinicalsignificance. In addition, it also includes the whereabouts of thegene, such as, chromosome number, location, variation ID etc.A snapshot of the data is illustrated in Table I. The databasewas searched for ’colorectal cancer’, and all the search resultswere downloaded and saved as XML files.The genomics dataset (ds2) is a Gene Expression dataset ofprimary colorectal tumours (E-GEOD-18105), obtained fromthe Expression Atlas of European Bioinformatics Institute(EBI), which is a public resource for gene expression datasets[30]. Gene expression data, as the name indicates, containsinformation for the expression of gene(s) in a particular biolog-ical sample(s). The expression data is obtained via microarraytechnology, which provides parallel processing and monitoring
ABLE III. INTEGRATED DATASET.D
ISEASE C LINICAL S IGNIFI - CANCE C HROMOSOME N O . L OCATION V ARIATION
ID A
LLELE
ID G
ENE
GSM452573 GSM452571L
ONG QT SYNDROME B ENIGN /L IKELYBENIGN
OLORECTAL N EO - PLASMS L IKELYPATHOGENIC
19 40236313 376039 362918 AKT2 5.005896122 4.463927997... ... ... ... ... ... ... ... ... of tens and thousands of genes, producing tons of valuabledata [31]. A typical gene expression dataset contains a matrixwith genes in rows and samples in columns. The number ineach cell of the matrix characterises the expression level ofa specific gene in the given sample [32]. Table II shows anexample of how gene expression data looks like. After the firstcolumn, which is gene name, the rest of the columns representssamples, and the values represent the expression levels.The primary reason for selecting these two datasets wasthat they fulfill the information requirement for this study. TheClinVar data provides the information about clinical conditionagainst each gene present in the dataset. It also providesthe clinical significance of these conditions [33]. The geneexpression data brings the information about the activity ofthose genes in different samples. Hence, the two datasetsprovide the required information to create an integrated datasetfor this model.
TABLE IV. STATISTICS OF INTEGRATED DATASET.Output Classes I IIUnique Classes 80 76Feature set 117 117Training Examples 258 281
As mentioned previously, we are using a meta-dimensionalapproach based integration, and specifically the concatenationmethod. The datasets were concatenated via gene names. It hasto be noted that there were multiple examples for each genein both datasets. The examples in the ds1 with no feature setsavailable in ds2 were removed. On the contrary, the examplesin the ds2 for which there were no feature sets in ds1, thedata was extrapolated in ds1 so that the examples for thatgene can be increased. Since each parameter in the feature setis independent, therefore, extrapolating some points does notaffect the accuracy.Table III shows an example of the integrated dataset, wherethe clinical and genomics parameters are concatenated via genenames. The statistics of the dataset is shown in Table IV. Thedata was trained with two different output classes: genes (class-I) and diseases (class-II). There are 80 unique genes, and 76unique diseases in the dataset after removing the outliers.It can be argued that predicting genes as output class doesnot provide much meaning. Predicting disease has a moreclinical value since this information is not available in thegene expression data. The reason behind this selection is onlyto provide an example that the classifier can be used to predict any feature from an integrated dataset without any restriction.The resulting schema includes clinical and genomics pa-rameters in columns, while each row represents a gene. Hence,each row tells the possible medical condition for a gene if itis active in a sample. This schema is completely flexible andscalable. It can be expanded by adding data from differentsources, as long as the new data can be mapped to existingschema. More data brings more information that will only helpto improve the performance of the classifier by increasing thefeature set and the training examples.IV. P
REDICTION M ODEL
In this section, we will talk about the multiclass classi-fication challenges, followed by the details of our predictionmodel, comprising the algorithm and the experimental envi-ronment.
A. Multiclass Classification
When we talk about disease classification, we are talkingabout a complicated multiclass classification problem. Fromclassification perspective, it is relatively easier to classifybinary problems or even few classes, but with increasingnumber of classes, the complexity of the dataset gets very high[34]. The data under consideration in this study contains morethan 75 different classes. When the number of output classesis that high, the variance in the data is very high as well. Insuch a case, it is best to have as much data as possible sothat every class has a sufficient representation in training data.This is a minor limitation in our study because of the limitednumber of examples available from public datasets.There is no single classification method that can be sug-gested to be best suited for multiclass classification [34].Any algorithm can perform better than the rest based on thecharacteristics and properties of the data. In this study we haveused the k-Nearest Neighbours (kNN) algorithm. The reasonfor selecting the kNN instead of Support Vector Machines(SVM), which is a more popular classification algorithm, isour large number of output classes and the random distributionof data (Figure 1). Unlike SVM, which uses kernels foroptimization, kNN determines the label for a given data pointbased on nearest data points on the distance metric. SincekNN is a non-parametric algorithm, it does not assume anyexplicit functions for the input data (such as Gaussian) [25].This works well in our case when the data has no particulardistribution and is widespread (Figure 1). Hence, we can avoidthe algorithmic complexity by using an algorithm which usesocal optimization only. Also, kNN performs well on small tomedium sized datasets [25].
B. Classification Algorithm
The kNN is a non-parametric supervised learning algorithm[25]. For a given dataset X , with labels Y , the algorithmcalculates the distances between a new data point z and alldata points in X to create a distance matrix. Euclidean distanceis the most common method for calculating this distance.Euclidean distance between point x i and y i can be calculatedby: D ( x, y ) = k (cid:88) i =1 ( x i − y i ) (1)Let R = ( X i , Y i ) , where i = 1 , ...N , be the trainingset, where X i is the p ∗ q feature vector, and Y i is the q -dimensional vector which represents m output class labels,as we are considering multiclass classification problem. Wepresume that the training data has random numeric variableswith unknown distribution.From the training set R , the kNN algorithm narrows downto a local sub-region r ( x ) of the input space, which is centeredon an estimation point x . This predicting sub-region r ( x ) contains the training points ( x (cid:48) ) nearest to x , which can beexpressed as: r ( x ) = { x (cid:48) | D ( x, x (cid:48) ) ≤ d ( k ) } (2)where, D ( x, x (cid:48) ) is the distance metric between x (cid:48) and x ,and d ( k ) is the k th order statistic. k [ y ] denotes the k samples inthe sub-region r ( x ) , which are labelled y . The kNN algorithmestimates the posterior probability p ( y | x ) of the estimationpoint x : p ( y | x ) = p ( x | y ) p ( y ) p ( x ) ∼ = k [ y ] k (3)Generally, when the kNN is used for binary classification,the label assignment is relatively easier since the algorithm hasto select between two classes only, such as : g ( x ) = (cid:26) , k [ y = 1] ≥ k [ y = − − , k [ y = 1] ≤ k [ y = − (4)We have improvised this functionality for our study, wherethe output class is non-binary. In this case, for any estimationpoint x , the decision g ( x ) for a given label y is estimated by: g k ( x ) = y k | minD k (5)where, D k is represented by 1 Hence, the decision thatwill maximise the posterior probability will be assigned forthe output label. For a multiclass classification problem, where y ∈ { . . . k } , the kNN algorithm uses the following decisionrule: F ( x ) = argmax [ g k ( x )] (6)Thus, for the selected nearest k neighbours, the algorithmcalculates the posterior probability for each class, and the classwith highest probability is assigned to x . Euclidean distanceis the most common method, but there are other distancecalculation methods as well, such as seuclidean, mahalanobis,spearman, etc [25]. Figure 1. Distribution of variance in the integrated dataset.
C. Performance Measurement
Generally, the performance of machine learning classifiersis measured using various parameters, such as, accuracy,sensitivity, specificity, and Receiver Operator Curve (ROC).These parameters are calculated based on the true positives,true negatives, false positives and false negatives of classifier.For binary classes, these parameters are easier to calculatebecause there is only one positive and one negative class.However, for multicalss classification, the problem is morecomplicated and it is not easy to calculate each parameter foreach class. Especially ROC, which is a standard measure torepresent performance of a classifier, is very complicated tocalculate for a very large multiclass problem. This problemhas been discussed in further detail by Fawcett in [35].Therefore, calculating each parameter for every class willnot only be laborious, but will also produce loads of resultsthat will be difficult to ensemble and explain. To simplifythat, we have only used confusion matrices to represent theperformance of the classifier and used the accuracy for eachclassifier to compare the results for the two classes.
D. Experimental Environment
We have used Matlab (R2018a) for all the experiments,which provides built-in libraries for machine learning clas-sifiers. We used the machine learning toolbox to train theclassification model using kNN. The toolbox takes the datas input and process the classification itself using the built-in library functions and selected features. The classificationtoolbox uses the Euclidean distance by default to compute thedistance metrics. The tool box can be used to reproduce theresults.At first, we perform the Principal Component Analysis(PCA) for dimensionality reduction. Since, our data is multi-variate, ranging from gene expression data to phenotypic data,the data points are widespread in the data space. Figure 1shows the standard deviation distribution of the first 20 datavariables from the integrated dataset. It can be seen that thedata distribution is very random and does not follow anystandard distribution function. Therefore, it is important toreduce the dimension of the integrated dataset. We performedPCA to explain 95% variance in the data.
Figure 2. Confusion matrix for class-I.
The results of classification depend highly on the dimen-sions of the dataset. The correlation between the number ofexamples and feature sets is very critical in this case to avoidover-fitting [36][37]. The clinical dataset has only 5 features,which is not a large enough set to be used stand-alone forprediction model. With a feature set of 5, the predictionis neither reliable nor comparable with other datasets. Thegenomics dataset is large enough in this respect, but it does notcontain the class-II so we cannot predict diseases. Therefore,we have only used the integrated dataset to train with theprediction model explained in the previous section (IV), andthen compared it with other classifiers.The results are validated using 10-fold cross-validation.This means, the dataset is divided into 10 parts; one part isheld out as a test data and the rest of the 9 parts are used astraining data. This step is repeated 10 times using a differentpart every time to holdout as a test data. This way everyexample from data is used both as training and test data. Theresulting accuracy is an average of the 10-fold process. V. R
ESULTS
The performance of a classification model is analysed usinga confusion matrix. Figure 2 shows a confusion matrix forclass-I prediction. The rows in a confusion matrix representthe true output class, and the columns represent the predictedclass. The diagonal cells indicate the true positives (green)and the false negatives; and the off-diagonal cells indicate thefalse positives and the true negatives (red). The bottom rightcell shows the overall accuracy and the loss of the classifier.
A. Classification with our Classifier
The number of neighbours (NN) is a variable in thealgorithm, which can be tuned to change the performance ofthe algorithm. We tested the performance of the algorithm over10 different neighbours, from 1 to 10.As mentioned previously, we trained the integrated datasetfor two different classes: genes (class-I) and diseases (class-II). The results are shown in Figure 3. At NN=1, the trainedmodel predicts class-I with 86% accuracy, and class-II with73% accuracy.
Figure 3. Accuracy of classification model for both classes.
Initially, the accuracy drops almost linearly with the in-creasing number of neighbours. The drop in accuracy can beattributed to the variation in data. As the algorithm considersmore number of neighbours, each neighbour brings morevariation that affects the prediction accuracy. However, as itcan be seen in Figure 3, accuracy remains above 50% forup to 3 neighbours for both classes which can be regardedas a good accuracy considering a multivariate training data.Following NN=4, the accuracy drops almost exponentially.This variation over neighbours may be avoided by intro-ducing a weighted parameter in the algorithm. This parameterweighs the contribution of each neighbour under considerationbased on its distance. The nearest neighbours gets higherweights than the distant ones. Matlab’s classification tool usesthe squared inverse method to calculate the weights, which canbe expressed as: w n = 1 d ( x n − x i ) (7)where, x n is the neighbour to point x i . To accommodatehis weight parameter, the eq. 1 is adjusted as follows: D ( x, y ) = k (cid:88) i =1 w i ( x i − y i ) (8)We tested this updated version by training the integrateddataset, and we observed that the accuracy was raised tothe maximum (86.1% for class-I and 73.7% for class-II)for all NN’s. The results are shown in Figure 4. This isperhaps because the weighted version predicts based on theneighbour with the highest weight. Since the nearest neighbouris most likely to have highest weight out of all neighbours, theclassification result is the same every time. This result seemsto be not very helpful for our dataset. Figure 4. Accuracy for both classes with weighted kNN.
However, our model has predicted the diseases with upto 73% accuracy. The accuracy is not as good as for theclass-I (86%). There can be multiple reasons behind this. Therepresentation of each class label in the data varies, whichaffects the prediction accuracy. Some classes have sufficientexamples in the data, while others have only few examples.The higher the representation of a class label in the trainingdata, the better is the prediction accuracy for that class. Thedistribution of class-I labels in dataset is comparatively moreuniform than class-II; hence, higher accuracy. Still, achieving73% accuracy for class-II is a very good result considering thesize, shape, and multivariate nature of the dataset.
B. Comparing with other Classifiers
We trained the same integrated dataset with other classifiersin order to compare the performance. Using PCA of 95%, wetrained all the classifiers available on Matlab’s classificationtoolbox, and then selected the top 10 models (out of 22) tocompare the classification accuracy for both classes. NN=1for all the models in the classification toolbox. 10-fold cross-validation was used to avoid over-fitting. The results are shownin Figure 5.
Figure 5. Performance of other classification models for class-Ipredictions.
For class-I, the kNN models provided the highest accuracyof 96.9%. kNN was followed by the Tree and SVM models. Aswe can see, the top three models are all kNN models providingaccuracy of above 95%. The accuracy of the SVM models(Quadratic and Cubic) is almost in the same range (95-96%),however, the training time for the SVM models is 200 timeshigher than the kNN models. This is because the SVM usesthe cost minimization functions, such as gradient descent orkernel functions, which take much longer to converge. SincekNN does not use any of those functions, it is more robust andprovides with the same, rather better accuracy. To summarise,although both kNN and SVM models have predicted about thesame accuracy, the kNN models are much more robust thanthe SVM models in terms of performance.The tree models, except for bagged trees, performed poorlyproviding accuracy of about 50% or under. The training timeof the tree models is as good as that of kNN models (fewseconds), but the accuracy is poor. Bagged trees, which isa bootstrapping method, performed quite well. On the otherhand, boosted trees provides an accuracy of just about 51%.Although both of them are ensemble methods, which meansthey provide an average of multiple models trained on a subsetof data, bagged trees provided much better result.The accuracy of these kNN models (Fine kNN, SubspacekNN, and Weighted kNN) is slightly higher than our predictionmodel (Figure 3). The reason for this is that the models in thetoolbox are set on different defaults and use different functionsthan the ones we used. The classification function that weused is primarily for multiclass classification problems. Onthe other hand, the function used by the toolbox models aremainly designed for binary problems, hence, the difference ofaccuracy.Similar results are seen for class-II. The results are shownin Figure 6. The top 10 models selected here are slightlydifferent than those for class-I, but majority are the same.The highest accuracy achieved for class-II is 73.3%, which isjust about the same as achieved by our model (Figure 3). Thetop 3 models are all kNN models, with bagged trees standingat 4th position with 73% accuracy. All SVM models provide igure 6. Performance of other classification models for class-IIpredictions. accuracy of less than 50% with training times as high as over200 times of the kNN models. The same is the case for treemodels except for bagged trees; same result as for class-I. Aplausible explanation for good performance of bagged treescould be that they perform better on high dimensional data.VI. D
ISCUSSION
We have demonstrated a novel way for multiclass classi-fication based on integrated clinical and genomics datasets.We have used concatenation-based data integration model forthis purpose, which has been discussed by various researchersbefore ([1][6]), but not implemented in the area of health care.Hence, this is the first time that we have attempted to use thismeta-dimensional approach to integrate datasets.In the past, people have used various other methods for dataintegration such as tree-based models [21], statistical models[22], and graph based models [3][18][20]. All these modelsrequire considerable amount of effort and time to build thedata models first, before creating the data analysis model, suchas building the binary trees, or creating graphs models fromdatasets. Our method does not involve any of those complexmodels; it only requires concatenation of all the datasets intoa single matrix. Once concatenated, the model transfers thedataset directly to the analysis model and starts training thelearning algorithm. Hence, it is way more efficient in terms oftime and computational costs as compared to other methods.In terms of analysis, from our knowledge, none of theprevious models have been used for multiclass disease classi-fication problems in health care. They have only been demon-strated for binary classifications; and, therefore, their resultscannot be compared with our model, which is a multiclassclassification model.In terms of data models, it will be very difficult to performmulticlass classification based on the previously mentionedmodels because they will require to build a separate data model(trees of graphs) for each output class before the analysismodel. Having multiple output classes, the analysis modelswill get extremely complicated with several input data models. With our proposed model, as there is only single concatenateddataset, the multiclass classification is less complicated andmanageable because the dataset has only one data model witha single schema.Since, we could not compare our results with any otherprevious results from other researchers, we have demonstratedcomparison with other classification models. The results shownin Figures 5 and 6 demonstrate that the kNN models canoutperform the rest of the classification models in terms ofprediction accuracy and performance.Our proposed approach provides a very flexible and scal-able model, along the lines of our previous work as reportedin [38]–[41], which can be scaled to adjust any new datasetand accommodate any analysis model. As long as there is arelational dataset, it can be concatenated to the existing datasetwithin the same data model and schema. Any analysis modelor algorithm, including prediction, classification, regressionmodels, can be built on top of the dataset. This flexibilityenables this approach to be adapted for any research purposein any domain.VII. C
ONCLUSION AND F UTURE D IRECTIONS
The way forward in precision medicine is to use allavailable data from clinical and genomics domains in orderto provide the best clinical solutions. The datasets need to beintelligently integrated for this purpose. In this paper, we haveperformed clinical predictions based on clinical and genomicsinformation. We have attempted to integrate a clinical (Clin-Var) and a genomic (gene expression) dataset, and performedclassification for disease predictions. We have designed amulticlass classification model that predicts diseases fromintegrated datasets. The model, which is validated by 10-fold cross-validation, has predicted diseases with up to 73%accuracy. We also predicted genes as an extra variable, fromthe same dataset, and achieved up to 86% accuracy. We havecompared the results with other classification models anddemonstrated that our model outperforms the rest. We canconclude that constructing the learning classifiers on top oflarge-scale inter-domain integrated datasets can provide verygood clinical predictions. This can prove to be very beneficialand a stepping-stone towards the precision medicine.This research study shows that diseases can be predictedwith good accuracy from a patient’s dataset if it has bothclinical and genomics parameters present. The accuracy willfurther improve if we train the model with a much largersize of training data. The reliability and confidence in resultswill increase by incorporating more clinical and genomicsinformation. We have demonstrated with a gene predictionexample, that, when the dataset is more uniformly distributedamong different classes, the prediction accuracy goes high evenon a multiclass classification task.This study has great potential to expand including achiev-ing analysis provenance [13]. The more information a datasetwill contain, higher the accuracy can be achieved. The datasetcan be expanded to include more multivariate clinical andgenomics datasets, such as clinical trials and multi-omicsdatasets, respectively. Including clinical information from clin-ical trials or laboratory tests will have a significant impact inthe clinical prediction studies.
CKNOWLEDGMENT