OmiEmbed: a unified multi-task deep learning framework for multi-omics data
OOmiEmbed: reconstruct comprehensivephenotypic information from multi-omics datausing multi-task deep learning
Xiaoyu Zhang , Kai Sun and Yike Guo Data Science Institute, Imperial College London, London, SW7 2AZ, UK and Department of Computer Science, Hong Kong BaptistUniversity, Hong Kong, China.
Abstract
Motivation:
High-dimensional omics data contains intrinsic biomedical information that is crucial forpersonalised medicine. Nevertheless, it is challenging to capture them from the genome-wide data due tothe large number of molecular features and small number of available samples, which is also called “thecurse of dimensionality” in machine learning. To tackle this problem and pave the way for machine learningaided precision medicine, we proposed a unified multi-task deep learning framework called OmiEmbedto capture a holistic and relatively precise profile of phenotype from high-dimensional omics data. Thedeep embedding module of OmiEmbed learnt an omics embedding that mapped multiple omics datatypes into a latent space with lower dimensionality. Based on the new representation of multi-omics data,different downstream networks of OmiEmbed were trained together with the multi-task strategy to predictthe comprehensive phenotype profile of each sample.
Results:
We trained the model on two publicly available omics datasets to evaluate the performance ofOmiEmbed. The OmiEmbed model achieved promising results for multiple downstream tasks includingdimensionality reduction, tumour type classification, multi-omics integration, demographic and clinicalfeature reconstruction, and survival prediction. Instead of training and applying different downstreamnetworks separately, the multi-task strategy combined them together and conducted multiple taskssimultaneously and efficiently. The model achieved better performance with the multi-task strategycomparing to training them individually. OmiEmbed is a powerful tool to accurately capture comprehensivephenotypic information from high-dimensional omics data and has a great potential to facilitate moreaccurate and personalised clinical decision making.
Availability:
Source code of OmiEmbed is available at https://github.com/zhangxiaoyu11/OmiEmbed
Contact: [email protected]
With the increasingly massive amount of omics data generated fromemerging high-throughput technologies, the large-scale, cost-efficient andcomprehensive analysis of biological molecules becomes an everydaymethodology for biomedical researchers (Hasin et al. , 2017; Berger et al. , 2013). The analysis and assessment of different types of omicsdata facilitate the integration of molecular features during the standarddiagnostic procedure. For instance, in the 2016 World Health Organization(WHO) classification of central nervous system (CNS) tumours (Louis et al. , 2016) an integrative method combining both histopathology andmolecular information was recommended for the identification of multipletumour entities. Nevertheless, most of these molecular features designedto aid diagnosis are manually selected biomarkers referring to specificgenetic alterations, which neglects the genome-wide patterns correlatedwith disease prognosis and other phenotypic outcomes. In recent years,instead of focusing on the effect of specific molecular features, manyresearchers began to delve into the overall picture of genome-wide omics data and try to obtain deep understanding of diseases and uncover crucialdiagnostic or prognostic information from it (Chaudhary et al. , 2017;Capper et al. , 2018; Zhang et al. , 2019; Amodio et al. , 2019).It is challenging to analyse genome-wide high dimensional omics databecause of the mismatch between the number of molecular features andthe number of samples. The dimensionality of genome-wide omics data isfairly high, for example a RNA-Seq gene expression profile is consisted ofmore than 60,000 identifiers, and a HumanMethylation450 (450K) DNAmethylation profile has more than 485,000 probes, while the number ofavailable samples in an omics dataset is normally small due to the difficultyof patient recruitment and sample collection. This phenomena is called “thecurse of dimensionality” in machine learning which would cause massiveoverfitting of a model and make samples hard to cluster (Goodfellow et al. ,2016). In order to overcome this issue, the number of molecular featuresused in downstream tasks is required to reduce significantly. Two commonapproaches are 1) to manually select a subset of the molecular featuresrelated to the downstream task based on domain knowledge; 2) to applytraditional dimensionality reduction algorithms, e.g., principal componentanalysis (PCA). a r X i v : . [ q - b i o . GN ] F e b X.Zhang et al.
Inspired by the significant success in fields like computer vision(Voulodimos et al. , 2018) and natural language processing (Young et al. ,2018), deep learning approaches have been applied to analyse thecomplicated and nonlinear relationships between molecular features ofhigh-dimensional omics data and detect genome-wide biological patternsfrom them (Ding et al. , 2018; Lopez et al. , 2018; Eraslan et al. , 2019). Withthe deep learning mechanism, molecular features can be automaticallyselected during the training process without manual selection. Multipledownstream tasks were conducted on different types of high-dimensionalomics data, including dimensionality reduction (Ding et al. , 2018; Wayand Greene, 2018), disease type classification (Zhang et al. , 2019; Maand Zhang, 2019), survival prediction (Chaudhary et al. , 2017; Cheerlaand Gevaert, 2019). However, there is no unified deep learning methodto the best of our knowledge that is able to simultaneously conduct allaforementioned downstream tasks together on any combination of omicstypes.Here we proposed a unified multi-task deep learning frameworkfor integrated multi-omics analysis named OmiEmbed. The OmiEmbedframework is comprised of two main components: deep embeddingmodule and downstream task module. In the deep embedding modulehigh-dimensional multi-omics data was embedded into a low-dimensionallatent space to tackle the challenge of “dimensionality curse”. The learntnovel representation of each sample was then fed to multiple downstreamnetworks which were trained simultaneously with a joint loss functionand the multi-task training strategy. Different downstream tasks that werealready implemented in OmiEmbed include tumour type classification,demographic and clinical feature (e.g., age, gender, race, primary siteand disease stage of sample) reconstruction and prognosis prediction(i.e., predicting the survival function of each input sample). The modelwas trained and evaluated on two publicly available omics datasets,the Genomic Data Commons (GDC) pan-cancer multi-omics dataset(Grossman et al. , 2016) and the GSE109381 brain tumour methylation(BTM) dataset (Capper et al. , 2018). Our model achieved promising resultsfor all aforementioned downstream tasks and outperformed correspondingbaseline methods. With the multi-task training strategy OmiEmbed wasable to infer all downstream tasks simultaneously and efficiently. Betterresults were achieved in the multi-task scenario comparing to training andinferring each downstream task separately.
The representation learning ability of deep neural networks has beenwidely verified by the significant breakthrough in computer vision andnatural language processing. Inspired by this achievement, a number ofdeep learning approaches have been applied to high-dimensional omicsdata for different downstream tasks in recent years.The most common downstream task is classification. Danaee et al. (2017) presented a cancer detection model that discriminated breasttumour samples from normal samples using gene expression data basedon a stacked denoising autoencoder (SDAE). Lyu and Haque (2018)reshaped the high-dimensional RNA-Seq data into images and applieda convolutional neural network (CNN) for tumour type classification onthe GDC dataset, which got the accuracy of 95.59%. Rhee et al. (2018)proposed a hybrid model that was comprised of a graph convolution neuralnetwork (GCN) and a relation network (RN) for breast tumour subtypeclassification using gene expression data and protein-protein interaction(PPI) networks. Jurmeister et al. (2019) developed a multi-layer neuralnetwork to distinguish metastatic head and neck squamous cell carcinoma(HNSC) from primary squamous cell carcinoma of the lung (LUSC) withan accuracy of 96.4% in the validation cohort. The AffinityNet (Ma andZhang, 2019) was a data efficient deep learning model that comprised multiple stacked k-nearest neighbours (KNN) attention pooling layers fortumour type classification. OmiVAE (Zhang et al. , 2019) was an end-to-end deep learning method designed for tumour type classification basedon a deep generative model variational autoencoder (VAE) achieving anaccuracy of 97.49% among 33 tumour types and the normal control usinggene expression and DNA methylation data from the GDC dataset.Another typical task that has been tackled by deep learning approachesrecently is the prediction of prognosis status from high-dimensional omicsdata. Chaudhary et al. (2017) applied a vanilla autoencoder (AE) to reducethe dimensionality of multi-omics data which was comprised of geneexpression, DNA methylation and miRNA expression profiles and used thelearned representation to identify two different survival subgroups of livertumour by Cox proportional hazard model (CoxPH), K-means clusteringand support vector machine (SVM). In their experiment, a concordanceindex (C-index) of 0.68 was achieved on the liver tumour subjects fromthe GDC dataset. The deep learning model applied in this research was notan end-to-end model, and the embedding learned by the network was usedseparately outside the network for downstream tasks. Huang et al. (2019)implemented a deep learning network with the CoxPH model to predictprognosis for breast tumour using multi-omics data, cancer biomarkers anda gene co-expression network. Aforementioned research focused mostlyon tumour samples of specific primary site and neglected the informationcross different tumour types which had the potential to improve theperformance of survival prediction for each tumour type. Cheerla andGevaert (2019) constructed a multimodal deep learning network to predictsurvival of subjects for 20 different tumour types in the GDC datasetwhich achieving an overall C-index of 0.78 based on additional clinicalinformation and histopathology whole slide images (WSIs) besides themulti-omics data.There are also several attempts on applying deep learning methodologyto multiple downstream tasks for high-dimensional omics data. Amodio et al. (2019) presented a deep neural network method named SAUCIE toexplore single-cell gene expression data and perform multiple data analysistask including clustering, batch correction, imputation and denoising,and visualisation. However, the backbone of SAUCIE was basically anautoencoder used for embedding learning, and most of the downstreamtasks were required to conduct outside the network separately, hence thenetwork was not able to perform all of the tasks simultaneously witha single training process. Deepathology (Azarkhalili et al. , 2019) wasanother deep learning method for omics data analysis which adopted theidea of multi-task leaning. This model encoded gene expression profileinto a low-dimensional latent vector to predict the tumour type and primarysite of the input sample, which obtained an accuracy of 98.1% for primarysite prediction and 95.2% for tumour type classification. In spite of thegood results on multiple classification tasks, Deepathology was not ableto perform the more complicated survival prediction task and did not adoptany state-of-the-art deep multi-task learning optimisation mechanism.
Two publicly available datasets were used as examples to demonstrate theability of OmiEmbed: the Genomic Data Commons (GDC) pan-cancermulti-omics dataset (Grossman et al. , 2016) and the DNA methylationdataset of human central nervous system tumours (GSE109381) (Capper et al. , 2018).The GDC pan-cancer dataset is one of the most comprehensiveand widely used multi-omics datasets. It comprises high-dimensionalomics data and corresponding phenotype data from two cancer genomeprogrammes: The Cancer Genome Atlas (TCGA) (Weinstein et al. ,2013) and Therapeutically Applicable Research to Generate Effective miEmbed Treatment (TARGET). The TARGET programme mainly focuses onpediatric cancers. Three types of omics data from the GDC datasetwere used in our experiments, including RNA-Seq gene expressionprofiling, DNA methylation profiling and miRNA expression profiling.The dimensionalities of the three types of omics data are 60,483, 485,577and 1,881 respectively. This dataset consists of 36 different types of tumoursamples along with corresponding normal control samples, among which33 tumour types are from TCGA and 3 tumour types are from TARGET.A wide range of phenotype features are also available in the GDC datasetincluding demographics (e.g., age, gender, and race), clinical sampleinformation (e.g., primary site and disease stage of the sample) and thesurvival information (recorded time of death or censoring).The GSE109381 brain tumour methylation (BTM) dataset from theGene Expression Omnibus (GEO) is one of the largest DNA methylationdatasets specifically targeted brain tumours. We integrated both thereference set and validation set of this dataset and the whole dataset consistsof 3,905 samples with almost all WHO-defined central nervous system(CNS) tumour entities (Louis et al. , 2016) and seven non-neoplastic controlCNS regions. Genome-wide DNA methylation profile for each samplewas generated using Infinium HumanMethylation450 BeadChip (450K)arrays, which is the same platform used for the GDC DNA methylationdata. Each sample in this dataset has two types of diagnostic label, thehistopathological class label defined by the latest 2016 WHO classificationof CNS tumours (Louis et al. , 2016) and the methylation class label definedby the original paper of this dataset (Capper et al. , 2018). Other phenotypicinformation is also available in this dataset including age, gender and thedisease stage of each sample.
Raw data of the GSE109381 BTM dataset downloaded from GEO werefirst processed by the Bioconductor R package minfi (Aryee et al. ,2014) to get the bate value of each CpG probe. Beta value is theratio of methylated signal intensities and the overall signal intensities,which indicates the methylation level of a specific CpG site. The DNAmethylation profile generated by the 450K array has 485,577 probesin total. Certain probes were removed during the feature filtering stepaccording to the following criteria: probes targeting the Y chromosome(n = 416), probes containing the dbSNP132Common single-nucleotidepolymorphism (SNP) (n = 7,998), probes not mapping to the humanreference genome (hg19) uniquely (one mismatch allowed) (n = 3,965),probes not included in the latest Infinium MethylationEPIC BeadChip(EPIC) array (n = 32,260), the SNP assay probes (n = 65), the non-CpGloci probes (n = 3,091) and probes with missing values (N/A) in morethan 10% of samples (n = 2). We followed some of the criteria mentionedin the original paper of this dataset (Capper et al. , 2018). 46,746 probeswere filtered out, which results in a final DNA methylation feature set of438,831 CpG sites.For the GDC pan-cancer dataset, the harmonised data of three omicstypes were downloaded from the UCSC Xena data portal with the originaldata dimensionality. Each RNA-Seq gene expression profile is comprisedof 60,483 identifiers referring to corresponding genes. The expressionlevel is quantified by the Fragments Per Kilobase of transcript per Millionmapped reads (FPKM) value, which has been log -transformed. Featurefiltering was applied to the gene expression data: targeting Y chromosome(n = 594), zero expression in all samples (n = 1,904) and genes with missingvalues (N/A) in more than 10% of samples (n = 248). In total, 2,440 geneswere removed, leaving 58,043 molecular features for further analyses. Asfor miRNA expression profiles, the expression level of each miRNA stem-loop identifier was measured by the log -transformed Reads Per Millionmapped reads (RPM) value. All of the miRNA identifiers were kept in ourexperiments. For both the gene expression and miRNA expression profiles, x _ e x p r G e n e E x p r e ss i o n G e n e E x p r e ss i o n Encoder D N A M e t h y l a t i o n m i R N A E x p r e ss i o n σμ z Decoder
DiagnosticTask
Downstream networks
E_methyE_miRNA x _ m i R N A x _ m e t h y y1 E_expr D_methyD_miRNAD_expr x _ m i R N A ’ x _ m e t h y ’ x _ e x p r ’ D N A M e t h y l a t i o n m i R N A E x p r e ss i o n PrognosticTask DemographicTask … …… y2 y3
Fig. 1.
The overall architecture of OmiEmbed, which is comprised of two maincomponents: the VAE deep embedding networks and the downstream task networks. Thenumber of omics types and downstream tasks can be modified based on the user needs andrequirements of the experiment. E_expr, E_methy and E_miRNA represent encoders ofgene expression, DNA methylation and miRNA expression respectively. Similarly, D_expr,D_methy and D_miRNA represent decoders of gene expression, DNA methylation andmiRNA expression. the expression values were normalised to the range of 0 to 1 due to the inputrequirement of the OmiEmbed framework. The DNA methylation data inthe GDC dataset were filtered based on the same criteria used for the BTMdataset. Remaining missing values in all datasets mentioned above wereimputed by the mean of corresponding molecular features.
OmiEmbed is a unified multi-omics multi-task end-to-end deep learningframework designed for reconstructing a panoramic view of phenotypicinformation from high-dimensional omics data. The overall architecture ofOmiEmbed is comprised of a deep embedding module and one or multipledownstream task modules, which is illustrated in Fig. 1.The role of the deep embedding module in OmiEmbed is to embedhigh-dimensional multi-omics profiles into a low-dimensional latent spacefor the downstream task modules. The backbone network we used in thedeep embedding module is variational autoencoder (VAE) (Kingma andWelling, 2013). VAE is a deep generative model which is also effective tocapture the data manifold from high-dimensional data. We assume eachsample x ( i ) ∈ R d in the multi-omics dataset D can be represented by andgenerated from a latent vector z ( i ) ∈ R p , where p (cid:28) d . In the generationprocess, each latent vector is first sampled from a prior distribution p θ ( z ) ,and then the multi-omics data of each sample is generated from theconditional distribution p θ ( x | z ) , where θ is the set of learnable parametersof the decoder. In order to address the intractability of the true posterior p θ ( z | x ) , the variational distribution q φ ( z | x ) is introduced to approximate p θ ( z | x ) , where φ is the set of learnable parameters of the encoder. As aresult, the VAE network is optimised by maximizing the variational lowerbound formulised as below: E z ∼ q φ ( z | x ) log p θ ( x | z ) − D KL (cid:0) q φ ( z | x ) (cid:107) p θ ( z ) (cid:1) (1) X.Zhang et al. where D KL is the Kullback-Leibler (KL) divergence between twoprobability distributions (Goodfellow et al. , 2016).We applied the framework of VAE to our deep embedding module toobtain the low-dimensional latent vector that can represent the originalhigh-dimensional omics data in the downstream task modules. For eachtype of omics data, the input profiles were first encoded into correspondingvectors with specific encoders. Those vectors of different omics types werethen concatenated together in the subsequent hidden layer and encodedinto one multi-omics vector. Based on the idea of VAE, the multi-omicsvector was connected to two bottleneck layers in order to obtain themean vector µ and the standard deviation vector σ . These two vectorsdefined the Gaussian distribution N ( µ , σ ) of the latent variable z giventhe input sample x , which is the variational distribution q φ ( z | x ) . Sincesampling z from the learned distribution is not differentiable and suitablefor backpropagation, the reparameterisation trick is applied as follows: z = µ + σ(cid:15) (2)where (cid:15) is a random variable sampled from the unit normal distribution N ( , I ) . The latent variable z was then fed to the decoders withsymmetrical network structure to get the reconstructed multi-omics data x (cid:48) . We provided two types of detailed network structure for the encodersand decoders in the deep embedding module, the one-dimensionalconvolutional neural network (CNN) and the fully-connected neuralnetwork (FC). Both network types shared the same architecture, and otherstate-of-the-art embedding networks can be easily added to OmiEmbedfor further improvement. With the deep embedding module we can obtainthe low-dimensional representation of the input omics data. This newrepresentation can directly replace the original omics data as the inputof any downstream task. For instance, when the latent dimension is setto 2 or 3 the new representation can be used for visualisation purpose.Nevertheless, we can also attach one or multiple downstream task networksto the bottleneck layer of the deep embedding module to get an end-to-end multi-task model, which is able to guide the embedding step withobjectives and share information among different tasks.Three main types of end-to-end downstream tasks were provided in theOmiEmbed framework: classification task, regression task and survivalprediction task. Each downstream task fell into one of these categoriescan be trained individually or together with other downstream tasks usingthe coordinated multi-task strategy that we will discuss in later sections.A multi-layer fully-connected network was applied to the classification-type downstream task, including diagnostic tasks such as tumour typeclassification, primary site prediction and disease stage prediction (normalcontrol, primary tumour, recurrent tumour or metastatic tumour) anddemographic tasks, e.g., the prediction of gender and race. The outputdimension of the classification downstream network was set to the numberof classes. For the regression task, a similar network was attached to thedeep embedding module, but only one neuron was kept in the output layerto predict the target scalar value (e.g., age of the subject). The survivalprediction downstream network is more complicated and will be discussedin a subsequent separate section. The downstream networks add furtherregularisation to the low dimensional latent representation and urge thedeep embedding module to learn the representations related to certaindownstream tasks. With the downstream modules, a single well-trainedmulti-task OmiEmbed network is able to reconstruct a comprehensivephenotype profile including diagnostic, prognostic and demographicinformation from omics data. The same as the overall structure, the joint loss function is also comprisedof two main components: the loss of the deep embedding and the loss ofthe downstream tasks.We denote each type of input omics profile as x j and the correspondingreconstructed profile as x (cid:48) j , where j is the omics type index and there are M omics types in total. The deep embedding loss can be then defined asfollows: L embed = 1 M M (cid:88) j =1 BCE (cid:0) x j , x (cid:48) j (cid:1) + D KL ( N ( µ , σ ) (cid:107)N ( , I )) (3)where BCE is the binary cross-entropy to measure the distance betweeninput data and reconstructed data, and the second term is the KL divergencebetween the learned distribution and a unit Gaussian distribution.In the downstream modules, each downstream task has its specific lossfunction L down k and a corresponding weight w k . For the classificationtype task, the loss function can be defined as: L classification = CE ( y, y (cid:48) ) (4)where y is the ground truth, y (cid:48) is the predicted label and CE is thecross-entropy loss. Similar to the classification loss, the loss function ofregression type task is L regression = MSE ( y, y (cid:48) ) (5)where MSE is the mean squared error between the real value and thepredicted value. The loss function of the survival prediction task will bediscussed in the next section. The overall loss function of the downstreammodules is the weighted sum of all downstream losses, i.e., L down = 1 K K (cid:88) k =1 w k L down k (6)where K is the number of downstream tasks, L down k is the loss for acertain task and w k is the corresponding weight. w k can be manuallyset as hyperparameters or used as learnable parameters during the trainingprocess. In conclusion, the joint loss function of the end-to-end OmiEmbednetwork is L total = λ L embed + L down (7)and depends on λ that balance the two terms in the overall loss function.Base on the aforementioned loss functions, three training phases weredesigned in OmiEmbed. Phase 1 was the unsupervised phase that onlyfocused on the deep embedding module. In this training phase, only thedeep embedding loss was backpropagated and only the parameters in thedeep embedding network were updated based on the gradients. No labelwas required in the first training phase and this phase can be used separatelyas a dimensionality reduction or visualisation method. In phase 2, the pre-trained embedding network was fixed whilst the downstream networkswere being trained. The joint downstream loss was backpropagated andonly the downstream networks were updated during this phase. Afterthe embedding network and the downstream networks were pre-trainedseparately, the overall loss function defined in equation (7) was calculatedand backpropagated during phase 3. In this final training phase thewhole OmiEmbed network including the deep embedding module andthe downstream modules was fine-tuned to obtain better performance. Survival prediction is the most complicated downstream task implementedin OmiEmbed. The objective of this task is to obtain the personalised miEmbed survival function and hazard function of a subject based on the high-dimensional omics data. With the survival prediction task, OmiEmbedis able to reconstruct a more comprehensive phenotype profile of eachsubject from omics data and potentially facilitate omics-based precisionmedicine. The survival function can be denoted by S ( t ) = P [ T > t ] (8)where T is time elapsed between the sample collection time and the timeof event occurring. The survival function signifies the probability that thefailure event, i.e., death, has not occurred by time t . The hazard functioncan be defined as: h ( t ) = lim dt → P[ t ≤ T < t + dt | T ≥ t ] dt (9)which represents the instantaneous rate of occurrence for the failure event.Thus a large hazard value manifests a great risk of death, which is a crucialsignal for clinicians.In order to train a survival prediction downstream network, besides theomics data X two elements of the survival labels are required: the eventtime T and the event indicator E . The indicator was set to 1 when thefailure event was observed during the study and 0 when the event was notobserved, which is termed censoring. In the case of censoring, the eventtime T is the time elapsed between the time when the sample was collectedand the time of the last contact with the subject. Both T and E are availablein the GDC dataset.The multi-task logistic regression (MTLR) model (Yu et al. , 2011)was applied and adapted to the OmiEmbed framework for the survivalprediction downstream task. In the first step, the time axis was dividedinto m time intervals { l i } mi =1 . Each time interval was defined as l i =[ t i − , t i ) where t = 0 and t m ≥ max ( T ) . The number of time intervals m is a hyperparameter. A larger m results in more fine-grained output, butrequires more computation resources. We applied the multi-layer fully-connected network as the backbone of our survival prediction network andthe dimension of the output layer is the number of time intervals. As a result,the output of our survival prediction network is an m -dimensional vector y (cid:48) = (cid:2) y (cid:48) , y (cid:48) , . . . , y (cid:48) m (cid:3) . Similarly, the survival label of each subject wasalso encoded into an m -dimensional vector y = [ y , y , . . . , y m ] , where y i signifies the survival status of this subject at the time point t i . Thelikelihood of observing y on the condition of sample x with the networkparameters θ can be defined as follows: P θ ( y | x ) = exp (cid:0)(cid:80) mi =1 y i y (cid:48) i (cid:1)(cid:80) mj =0 exp( (cid:80) mi = j +1 y (cid:48) i ) . (10)The objective of this survival network is to find a set of parameters θ that maximises the log likelihood, hence the loss function of the survivalprediction downstream task is defined as, L survival = − m (cid:88) i =1 y i y (cid:48) i + log m (cid:88) j =0 exp m (cid:88) i = j +1 y (cid:48) i (11)which can be directly applied to the downstream module and included injoint loss function of OmiEmbed. With the joint loss function (6) of the multi-task downstream modules,we aimed to train multiple downstream nets in OmiEmbed simultaneouslyand efficiently instead of separate training so as to obtain a unified networkthat can reconstruct a comprehensive phenotype profile of the subject. Inorder to balance the optimisation of different tasks, we adapted the multi-task optimisation method gradient normalisation (GradNorm) (Chen et al. ,2018) to our OmiEmbed framework.
Table 1. The classification performance on the BTMdataset using the histopathological tumour type labelsAccuracy Precision Recall F1PCA+SVM 0.8477 0.8268 0.8477 0.8284KPCA+SVM 0.8452 0.8245 0.8452 0.8261UMAP+SVM 0.6684 0.5497 0.6684 0.5845OmiEmbed
Table 2. The classification performance on the BTMdataset using the methylation tumour type labelsAccuracy Precision Recall F1PCA+SVM 0.9360 0.9385 0.9360 0.9335KPCA+SVM 0.9347 0.9352 0.9347 0.9313UMAP+SVM 0.6389 0.5228 0.6389 0.5603OmiEmbed
In equation (6) w k is the weight of each downstream loss, and theweight can also be regarded as a trainable parameter which varies at eachtraining iteration. The idea of GradNorm is to penalise the network ifgradients of any downstream task is too large or too small, which makesall the tasks train at similar rates (Chen et al. , 2018). First the gradientnorm of each downstream task is calculated by G ( k ) θ = (cid:13)(cid:13) ∇ θ w k L down k (cid:13)(cid:13) (12)where θ is the parameters of the last encoding layer in the deep embeddingmodule of OmiEmbed. The average gradient norm among all tasks can bethen calculated by ¯ G θ = 1 K K (cid:88) k =1 G ( k ) θ (13)where K is the number of downstream tasks. The relative inverse trainingrate of each task can be defined as: r k = ˜ L down k K (cid:80) Kk =1 ˜ L down k (14)where ˜ L down k = L down k / L down k which is the ratio of the currentloss to the initial loss of the downstream task k . Then the loss of GradNormcan be defined as: L grad = K (cid:88) k =1 (cid:12)(cid:12)(cid:12) G ( k ) θ − ¯ G θ × r kα (cid:12)(cid:12)(cid:12) (15)where α is the hyperparameter that represents strength pulling tasks backto a common training rate. A separate backpropagation process wasconducted during each training iteration on L grad , which was only usedfor updating w k . The OmiEmbed multi-omics multi-task framework was built on the deeplearning library PyTorch (Paszke et al. , 2019). The code of OmiEmbedhas been made open-source on GitHub, and it is convenient to useon any high-dimensional omics dataset for multiple downstream taskswith a set of predefined command line arguments. In the FC-type
X.Zhang et al.
Fig. 2.
The 128D latent space of the BTM dataset learned by the unsupervised phase of OmiEmbed. The scatter graph was visualised using t-SNE. Each label in the scatter graph was colourby its methylation class label. Tumour types belonging to the same upper-level class were marked in similar colours.
Table 3. The performance of tumour type classificationon the GDC multi-omics dataset with different omics typecombinations Accuracy Precision Recall F1Gene expression (a) 0.9680 0.9686 0.9680 0.9679DNA methylation (b) 0.9659 0.9653 0.9659 0.9650miRNA expression (c) 0.9618 0.9619 0.9618 0.9609Multi-omics (a+b) 0.9752 0.9762 0.9752 0.9750Multi-omics (a+b+c) omics embedding network, CpG sites of DNA methylation profiles wereseparately connected to different hidden layers based on their targetingchromosomes in order to reduce the number of parameters, preventoverfitting and save the GPU memory. The chromosome separation stepwould be automatically processed in OmiEmbed with a built-in DNAmethylation annotation if the FC-type embedding was selected. Other deeplearning techniques were also applied to prevent overfitting in OmiEmbedincluding dropout (Srivastava et al. , 2014), batch normalisation (Ioffe andSzegedy, 2015), weight decay regularisation and the learning rate schedule.The model was trained on two NVIDIA Titan X GPUs with 12 gigabytesof memory each. The input dataset for each experiment was randomlyseparated into training, validation and testing sets (70%, 10% and 20%respectively). The separation was conducted in a stratified manner so as tokeep the proportion of each class in all three sets. The validation set wasused for hyperparameter searching, and the reported performance wasseparately evaluated on the testing set.
OmiEmbed can be regarded as an unsupervised dimensionality reductionmethod if only the training phase 1 mentioned in section 3.4 was applied inthe experiment. The high-dimensional multi-omics data can be compressedinto a new representation with the target dimensionality set by the hyperparameter of OmiEmbed. Then the output file can be directly usedfor visualisation or any other downstream tasks. Here we reduced eachsample in the BTM dataset into a 128D latent vector using the unsupervisedphase 1 of OmiEmbed. The learned latent space of the BTM datasetwas visualised by t-distributed stochastic neighbour embedding (t-SNE)(Van der Maaten and Hinton, 2008) and shown in Figure 2. Samples ofdifferent brain tumour types automatically clustered together in the latentspace and tumour types belonging to the same upper-level class (e.g.glioblastoma, embryonal tumour, and ependymal tumour) also formedinto the corresponding upper-level clusters.
Instead of using the training phase 1 individually as a dimensionalityreduction method and separately training the downstream task with othermachine learning algorithms, using all of the three training phases ofOmiEmbed in an end-to-end manner is more efficient with better results.Here we first tested the classification performance of OmiEmbed on theBTM dataset. There are two types of tumour type classification systems inthis brain tumour dataset: the histopathological tumour type labels definedby the 2016 WHO classification (Louis et al. , 2016) and the methylationtumour type labels defined by the original paper of this dataset (Capper et al. , 2018). For each type of these two classification systems, the 3,905samples were divided into more than 90 classes including different normalcontrol types.The classification performance of OmiEmbed on the BTM datasetwith the histopathological labels and the methylation label was shown inTable 1 and Table 2, which was measured by four multi-class classificationmetrics: overall accuracy, weighted precision, weight recall, and weightedF1 score. The results were also compared with the combination of threedimensionality reduction methods and the support vector machine (SVM)classification algorithm. The three different dimensionality reductionmethods used in the experiment included principal component analysis(PCA), kernel principal component analysis (KPCA) (Schölkopf et al. ,1997) and uniform manifold approximation and projection (UMAP) miEmbed Table 4. Detailed information for the categorical features predicted by OmiEmbed in the GDC datasetNumber of classes Label examplesTumour type 37 BRCA, UCEC, KIRC, LGG, LUAD, THCA, HNSC, LUSC, PRAD, Normal control, etc.Disease stage 7 Primary tumour, Metastatic tumour, Recurrent tumor, Normal control, etc.Primary site 29 Breast, Kidney, Lung, Brain, Colorectal, Uterus, Thyroid, Prostate, etc.Race 6 White, Black or African American, Asian, American Indian or Alaska native, etc.Gender 2 Male, FemaleTable 5. The classification performance of usingOmiEmbed to predict the four categorical phenotypefeatures in the GDC datasetAccuracy Precision Recall F1Disease stage 0.9809 0.9773 0.9809 0.9790Primary site 0.9843 0.9847 0.9843 0.9843Race 0.9756 0.9719 0.9756 0.9735Gender 0.9594 0.9595 0.9594 0.9594Table 6. The regression performance of age prediction on the GDCdataset using different methodsRMSE Mean absolute error Median absolute errorPCA+SVR 12.3285 9.8544 8.1017KPCA+SVR 12.3269 9.8535 8.0375UMAP+SVR 12.9859 10.3932 8.8130OmiEmbed
Table 7. The performance of survivalprediction on the GDC dataset usingdifferent methods Concordance indexPCA+CoxPH 0.7291KPCA+CoxPH 0.7307UMAP+CoxPH 0.7169OmiEmbed
Table 8. Multi-task performance of OmiEmbed with threediverse downstream tasksSurvival Tumour type AgeC-index Accuracy F1 RMSEMulti-task
Single task alone 0.7809 0.9680 0.9679 10.7982PCA Baseline 0.7291 0.9444 0.9391 12.3285 (McInnes et al. , 2018). The original data from the BTM dataset wasfirst reduced to 128D by the aforementioned dimensionality reductionmethods and then classified by the SVM with a radial basis function(RBF) kernel. As shown in Table 1 and Table 2, OmiEmbed achievedthe best classification performance in all the four metrics with both typesof classification systems.
Different types of omics profiles can be integrated into single latentrepresentation and used for different downstream tasks through the multi-omics deep embedding module of OmiEmbed. In order to test the effect ofmulti-omics integration on the downstream task, we trained a tumour typeclassifier using OmiEmbed on the GDC multi-omics dataset. We appliedthree omics types in the GDC dataset: gene expression, DNA methylationand miRNA expression, and there are 33 tumour types and normal controlclass (34 classes in total) in the dataset. We trained the model with eachomics type alone and two different multiple omics type combinations. Theclassification performance in each scenario was shown in Table 3. Theperformance metrics for each omics type alone were close to each otherand the best metrics were achieved with the combination of all three omicstypes. This result also indicates combining multiple omics data can yieldbetter insights into the underlying mechanisms of diseases.
With both the classification and regression downstream networks builtin OmiEmbed, we were able to reconstruct a number of phenotypefeatures from high-dimensional omics data. Here we tested the predictionperformance of four different phenotype features in the GDC datasetincluding age, gender, race, the disease stage and primary site of theclinical sample. Detailed information of each categorical features waslisted in Table 4. The disease stage is the clinical type of the sample,which consists of primary tumour, metastatic tumour, recurrent tumor andnormal control tissue. The primary site is the place where the cancer startsgrowing. Samples in the GDC dataset are from 28 different primary sitessuch as breast, kidney, lung, and skin. The race of the subjects in theGDC dataset was divided into six main categories: white, Asian, black orAfrican American, American Indian or Alaska native, native Hawaiian orother Pacific islander and others. As for the gender of each sample, sincethe molecular features targeting the Y chromosome were filtered in thepreprocessing stage, the model was required to classify the gender basedon other molecular features. The classification performance of the fourcategorical phenotype features was shown in Table 5.Since the label of age is numerical instead of categorical, the regressiondownstream network in OmiEmbed was applied for the age prediction task.The performance of age prediction was evaluated by the three regressionmetrics: root mean square error (RMSE), mean absolute error and medianabsolute error, which was shown in Table 6. The results were comparedwith the combination of the aforementioned dimensionality reductionmethods and the support vector regression (SVR) algorithm with the RBFkernel. OmiEmbed achieved the best performance with the lowest distanceerror in all three metrics.
With the survival prediction downstream network of OmiEmbed describedin section 3.5, we are able to predict the survival function of each subjectfrom corresponding high-dimensional omics data. The performance of thesurvival prediction downstream task was evaluated by the concordance
X.Zhang et al. index (C-index) which is one of the most commonly used metrics forsurvival prediction. A C-index value of 1 indicates the perfect predictionmodel and a value of 0.5 signifies that the performance of the modelis similar to expected at random. The performance was compared withbaseline methods that first reduced the dimensionality of input omics datato 128D using PCA, KPCA or UMAP and then fed the 128D data tothe standard survival prediction method Cox Proportional Hazard model(CoxPH). OmiEmbed got the best C-index (0.78) among other methodsusing the standard CoxPH model as shown in Table 7.
Instead of training each aforementioned downstream network separately,we can also train multiple downstream networks together using the multi-task training strategy expatiated in section 3.6. With the multi-task strategyOmiEmbed is able to perform diverse downstream tasks simultaneouslyand reconstruct a comprehensive phenotype profile of each subject fromhigh-dimensional omics data using one unified network in one forwardpropagation. In order to test the multi-task performance of OmiEmbed,we first selected three typical downstream tasks belonging to three distinctcategories, the survival prediction task, the tumour type classificationtask and the age regression task, for the evaluation. Three downstreamnetworks along with the deep embedding network were trained togetherusing the joint loss function equation (7) and the GradNorm loss functionequation (15). As shown in Table 8, the performance is higher in allthree downstream tasks when they were trained in a unified multi-taskOmiEmbed network compared with being trained separately.
OmiEmbed is an open-source deep learning framework designed formultiple omics data analysis tasks including dimensionality reduction,multi-omics data integrated, tumour type classification, disease stageprediction, demographic label reconstruction and prognosis prediction.All of the aforementioned tasks can be performed together simultaneouslyand efficiently by a unified OmiEmbed multi-task network. We achievedpromising performance for each downstream tasks and outperformed state-of-the-art methods using the multi-task training strategy. Our results alsoindicate that OmiEmbed is able to reconstruct a comprehensive clinical andphenotypic profile from the multi-omics data of each subject, which has agreat potential to facilitate more accurate and personalised clinical decisionmaking. Since the code of OmiEmbed is publicly available and the unifiedframework is applicable to multiple omics types and downstream tasks,we believe OmiEmbed will become a useful tool for other researchers toanalyse high-dimensional omics data using the deep learning methodology.
Funding
This project has received funding from the European Union’s Horizon 2020research and innovation programme under the Marie Skłodowska-Curiegrant agreement 764281.
References
Amodio, M. et al. (2019). Exploring single-cell data with deepmultitasking neural networks.
Nature Methods , , 1139–1145.Aryee, M. J. et al. (2014). Minfi: a flexible and comprehensivebioconductor package for the analysis of infinium dna methylationmicroarrays. Bioinformatics , (10), 1363–1369.Azarkhalili, B. et al. (2019). Deepathology: Deep multi-task learningfor inferring molecular pathology from cancer transcriptome. ScientificReports , . Berger, B. et al. (2013). Computational solutions for omics data. NatureReviews Genetics , , 333–346.Capper, D. et al. (2018). Dna methylation-based classification of centralnervous system tumours. Nature , (7697), 469–474.Chaudhary, K. et al. (2017). Deep learning–based multi-omics integrationrobustly predicts survival in liver cancer. Clinical Cancer Research , ,1248 – 1259.Cheerla, A. and Gevaert, O. (2019). Deep learning with multimodalrepresentation for pancancer prognosis prediction. Bioinformatics , ,i446 – i454.Chen, Z. et al. (2018). Gradnorm: Gradient normalization for adaptiveloss balancing in deep multitask networks. In International Conferenceon Machine Learning , pages 794–803. PMLR.Danaee, P. et al. (2017). A deep learning approach for cancer detectionand relevant gene identification.
Pacific Symposium on Biocomputing.Pacific Symposium on Biocomputing , , 219–229.Ding, J. et al. (2018). Interpretable dimensionality reduction ofsingle cell transcriptome data with deep generative models. NatureCommunications , .Eraslan, G. et al. (2019). Single-cell rna-seq denoising using a deep countautoencoder. Nature Communications , .Goodfellow, I. et al. (2016). Deep learning . MIT press.Grossman, R. L. et al. (2016). Toward a shared vision for cancer genomicdata.
New England Journal of Medicine , (12), 1109–1112.Hasin, Y. et al. (2017). Multi-omics approaches to disease. GenomeBiology , .Huang, Z. et al. (2019). Salmon: Survival analysis learning with multi-omics neural networks on breast cancer. Frontiers in Genetics , .Ioffe, S. and Szegedy, C. (2015). Batch normalization: Accelerating deepnetwork training by reducing internal covariate shift. In Internationalconference on machine learning , pages 448–456. PMLR.Jurmeister, P. et al. (2019). Machine learning analysis of dna methylationprofiles distinguishes primary lung squamous cell carcinomas from headand neck metastases.
Science Translational Medicine , .Kingma, D. P. and Welling, M. (2013). Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114 .Lopez, R. et al. (2018). Deep generative modeling for single-celltranscriptomics. Nature methods , , 1053 – 1058.Louis, D. N. et al. (2016). The 2016 world health organizationclassification of tumors of the central nervous system: a summary. Actaneuropathologica , (6), 803–820.Lyu, B. and Haque, A. (2018). Deep learning based tumor typeclassification using gene expression data. Proceedings of the 2018 ACMInternational Conference on Bioinformatics, Computational Biology,and Health Informatics .Ma, T. and Zhang, A. (2019). Affinitynet: semi-supervised few-shotlearning for disease type prediction. In
Proceedings of the AAAIConference on Artificial Intelligence , volume 33, pages 1069–1076.McInnes, L. et al. (2018). Umap: Uniform manifold approximation andprojection for dimension reduction. arXiv preprint arXiv:1802.03426 .Paszke, A. et al. (2019). Pytorch: An imperative style, high-performancedeep learning library. In
NeurIPS .Rhee, S. et al. (2018). Hybrid approach of relation network and localizedgraph convolutional filtering for breast cancer subtype classification. In
IJCAI .Schölkopf, B. et al. (1997). Kernel principal component analysis. In
International conference on artificial neural networks , pages 583–588.Springer.Srivastava, N. et al. (2014). Dropout: a simple way to prevent neuralnetworks from overfitting.
The journal of machine learning research , (1), 1929–1958. miEmbed Van der Maaten, L. and Hinton, G. (2008). Visualizing data using t-sne.
Journal of machine learning research , (11).Voulodimos, A. et al. (2018). Deep learning for computer vision: A briefreview. Computational Intelligence and Neuroscience , .Way, G. P. and Greene, C. (2018). Extracting a biologically relevant latentspace from cancer transcriptomes with variational autoencoders. PacificSymposium on Biocomputing. Pacific Symposium on Biocomputing , ,80–91.Weinstein, J. N. et al. (2013). The cancer genome atlas pan-cancer analysisproject. Nature genetics , (10), 1113. Young, T. et al. (2018). Recent trends in deep learning based naturallanguage processing [review article]. IEEE Computational IntelligenceMagazine , , 55–75.Yu, C.-N. et al. (2011). Learning patient-specific cancer survivaldistributions as a sequence of dependent regressors. Advances in NeuralInformation Processing Systems , , 1845–1853.Zhang, X. et al. (2019). Integrated multi-omics analysis using variationalautoencoders: Application to pan-cancer classification.2019 IEEEInternational Conference on Bioinformatics and Biomedicine (BIBM)