[PDF] OmiEmbed: a unified multi-task deep learning framework for multi-omics data

Abstract

High-dimensional omics data contains intrinsic biomedical information that is crucial for personalised medicine. Nevertheless, it is challenging to capture them from the genome-wide data due to the large number of molecular features and small number of available samples, which is also called 'the curse of dimensionality' in machine learning. To tackle this problem and pave the way for machine learning aided precision medicine, we proposed a unified multi-task deep learning framework named OmiEmbed to capture biomedical information from high-dimensional omics data with the deep embedding and downstream task modules. The deep embedding module learnt an omics embedding that mapped multiple omics data types into a latent space with lower dimensionality. Based on the new representation of multi-omics data, different downstream task modules were trained simultaneously and efficiently with the multi-task strategy to predict the comprehensive phenotype profile of each sample. OmiEmbed support multiple tasks for omics data including dimensionality reduction, tumour type classification, multi-omics integration, demographic and clinical feature reconstruction, and survival prediction. The framework outperformed other methods on all three types of downstream tasks and achieved better performance with the multi-task strategy comparing to training them individually. OmiEmbed is a powerful and unified framework that can be widely adapted to various application of high-dimensional omics data and has a great potential to facilitate more accurate and personalised clinical decision making.

Full PDF

OOmiEmbed: reconstruct comprehensivephenotypic information from multi-omics datausing multi-task deep learning

Xiaoyu Zhang , Kai Sun and Yike Guo Data Science Institute, Imperial College London, London, SW7 2AZ, UK and Department of Computer Science, Hong Kong BaptistUniversity, Hong Kong, China.

Abstract

Motivation:

Results:

We trained the model on two publicly available omics datasets to evaluate the performance ofOmiEmbed. The OmiEmbed model achieved promising results for multiple downstream tasks includingdimensionality reduction, tumour type classiﬁcation, multi-omics integration, demographic and clinicalfeature reconstruction, and survival prediction. Instead of training and applying different downstreamnetworks separately, the multi-task strategy combined them together and conducted multiple taskssimultaneously and efﬁciently. The model achieved better performance with the multi-task strategycomparing to training them individually. OmiEmbed is a powerful tool to accurately capture comprehensivephenotypic information from high-dimensional omics data and has a great potential to facilitate moreaccurate and personalised clinical decision making.

Availability:

Source code of OmiEmbed is available at https://github.com/zhangxiaoyu11/OmiEmbed

Contact: [email protected]

With the increasingly massive amount of omics data generated fromemerging high-throughput technologies, the large-scale, cost-efﬁcient andcomprehensive analysis of biological molecules becomes an everydaymethodology for biomedical researchers (Hasin et al. , 2017; Berger et al. , 2013). The analysis and assessment of different types of omicsdata facilitate the integration of molecular features during the standarddiagnostic procedure. For instance, in the 2016 World Health Organization(WHO) classiﬁcation of central nervous system (CNS) tumours (Louis et al. , 2016) an integrative method combining both histopathology andmolecular information was recommended for the identiﬁcation of multipletumour entities. Nevertheless, most of these molecular features designedto aid diagnosis are manually selected biomarkers referring to speciﬁcgenetic alterations, which neglects the genome-wide patterns correlatedwith disease prognosis and other phenotypic outcomes. In recent years,instead of focusing on the effect of speciﬁc molecular features, manyresearchers began to delve into the overall picture of genome-wide omics data and try to obtain deep understanding of diseases and uncover crucialdiagnostic or prognostic information from it (Chaudhary et al. , 2017;Capper et al. , 2018; Zhang et al. , 2019; Amodio et al. , 2019).It is challenging to analyse genome-wide high dimensional omics databecause of the mismatch between the number of molecular features andthe number of samples. The dimensionality of genome-wide omics data isfairly high, for example a RNA-Seq gene expression proﬁle is consisted ofmore than 60,000 identiﬁers, and a HumanMethylation450 (450K) DNAmethylation proﬁle has more than 485,000 probes, while the number ofavailable samples in an omics dataset is normally small due to the difﬁcultyof patient recruitment and sample collection. This phenomena is called “thecurse of dimensionality” in machine learning which would cause massiveoverﬁtting of a model and make samples hard to cluster (Goodfellow et al. ,2016). In order to overcome this issue, the number of molecular featuresused in downstream tasks is required to reduce signiﬁcantly. Two commonapproaches are 1) to manually select a subset of the molecular featuresrelated to the downstream task based on domain knowledge; 2) to applytraditional dimensionality reduction algorithms, e.g., principal componentanalysis (PCA). a r X i v : . [ q - b i o . GN ] F e b X.Zhang et al.

Inspired by the signiﬁcant success in ﬁelds like computer vision(Voulodimos et al. , 2018) and natural language processing (Young et al. ,2018), deep learning approaches have been applied to analyse thecomplicated and nonlinear relationships between molecular features ofhigh-dimensional omics data and detect genome-wide biological patternsfrom them (Ding et al. , 2018; Lopez et al. , 2018; Eraslan et al. , 2019). Withthe deep learning mechanism, molecular features can be automaticallyselected during the training process without manual selection. Multipledownstream tasks were conducted on different types of high-dimensionalomics data, including dimensionality reduction (Ding et al. , 2018; Wayand Greene, 2018), disease type classiﬁcation (Zhang et al. , 2019; Maand Zhang, 2019), survival prediction (Chaudhary et al. , 2017; Cheerlaand Gevaert, 2019). However, there is no uniﬁed deep learning methodto the best of our knowledge that is able to simultaneously conduct allaforementioned downstream tasks together on any combination of omicstypes.Here we proposed a uniﬁed multi-task deep learning frameworkfor integrated multi-omics analysis named OmiEmbed. The OmiEmbedframework is comprised of two main components: deep embeddingmodule and downstream task module. In the deep embedding modulehigh-dimensional multi-omics data was embedded into a low-dimensionallatent space to tackle the challenge of “dimensionality curse”. The learntnovel representation of each sample was then fed to multiple downstreamnetworks which were trained simultaneously with a joint loss functionand the multi-task training strategy. Different downstream tasks that werealready implemented in OmiEmbed include tumour type classiﬁcation,demographic and clinical feature (e.g., age, gender, race, primary siteand disease stage of sample) reconstruction and prognosis prediction(i.e., predicting the survival function of each input sample). The modelwas trained and evaluated on two publicly available omics datasets,the Genomic Data Commons (GDC) pan-cancer multi-omics dataset(Grossman et al. , 2016) and the GSE109381 brain tumour methylation(BTM) dataset (Capper et al. , 2018). Our model achieved promising resultsfor all aforementioned downstream tasks and outperformed correspondingbaseline methods. With the multi-task training strategy OmiEmbed wasable to infer all downstream tasks simultaneously and efﬁciently. Betterresults were achieved in the multi-task scenario comparing to training andinferring each downstream task separately.

The representation learning ability of deep neural networks has beenwidely veriﬁed by the signiﬁcant breakthrough in computer vision andnatural language processing. Inspired by this achievement, a number ofdeep learning approaches have been applied to high-dimensional omicsdata for different downstream tasks in recent years.The most common downstream task is classiﬁcation. Danaee et al. (2017) presented a cancer detection model that discriminated breasttumour samples from normal samples using gene expression data basedon a stacked denoising autoencoder (SDAE). Lyu and Haque (2018)reshaped the high-dimensional RNA-Seq data into images and applieda convolutional neural network (CNN) for tumour type classiﬁcation onthe GDC dataset, which got the accuracy of 95.59%. Rhee et al. (2018)proposed a hybrid model that was comprised of a graph convolution neuralnetwork (GCN) and a relation network (RN) for breast tumour subtypeclassiﬁcation using gene expression data and protein-protein interaction(PPI) networks. Jurmeister et al. (2019) developed a multi-layer neuralnetwork to distinguish metastatic head and neck squamous cell carcinoma(HNSC) from primary squamous cell carcinoma of the lung (LUSC) withan accuracy of 96.4% in the validation cohort. The AfﬁnityNet (Ma andZhang, 2019) was a data efﬁcient deep learning model that comprised multiple stacked k-nearest neighbours (KNN) attention pooling layers fortumour type classiﬁcation. OmiVAE (Zhang et al. , 2019) was an end-to-end deep learning method designed for tumour type classiﬁcation basedon a deep generative model variational autoencoder (VAE) achieving anaccuracy of 97.49% among 33 tumour types and the normal control usinggene expression and DNA methylation data from the GDC dataset.Another typical task that has been tackled by deep learning approachesrecently is the prediction of prognosis status from high-dimensional omicsdata. Chaudhary et al. (2017) applied a vanilla autoencoder (AE) to reducethe dimensionality of multi-omics data which was comprised of geneexpression, DNA methylation and miRNA expression proﬁles and used thelearned representation to identify two different survival subgroups of livertumour by Cox proportional hazard model (CoxPH), K-means clusteringand support vector machine (SVM). In their experiment, a concordanceindex (C-index) of 0.68 was achieved on the liver tumour subjects fromthe GDC dataset. The deep learning model applied in this research was notan end-to-end model, and the embedding learned by the network was usedseparately outside the network for downstream tasks. Huang et al. (2019)implemented a deep learning network with the CoxPH model to predictprognosis for breast tumour using multi-omics data, cancer biomarkers anda gene co-expression network. Aforementioned research focused mostlyon tumour samples of speciﬁc primary site and neglected the informationcross different tumour types which had the potential to improve theperformance of survival prediction for each tumour type. Cheerla andGevaert (2019) constructed a multimodal deep learning network to predictsurvival of subjects for 20 different tumour types in the GDC datasetwhich achieving an overall C-index of 0.78 based on additional clinicalinformation and histopathology whole slide images (WSIs) besides themulti-omics data.There are also several attempts on applying deep learning methodologyto multiple downstream tasks for high-dimensional omics data. Amodio et al. (2019) presented a deep neural network method named SAUCIE toexplore single-cell gene expression data and perform multiple data analysistask including clustering, batch correction, imputation and denoising,and visualisation. However, the backbone of SAUCIE was basically anautoencoder used for embedding learning, and most of the downstreamtasks were required to conduct outside the network separately, hence thenetwork was not able to perform all of the tasks simultaneously witha single training process. Deepathology (Azarkhalili et al. , 2019) wasanother deep learning method for omics data analysis which adopted theidea of multi-task leaning. This model encoded gene expression proﬁleinto a low-dimensional latent vector to predict the tumour type and primarysite of the input sample, which obtained an accuracy of 98.1% for primarysite prediction and 95.2% for tumour type classiﬁcation. In spite of thegood results on multiple classiﬁcation tasks, Deepathology was not ableto perform the more complicated survival prediction task and did not adoptany state-of-the-art deep multi-task learning optimisation mechanism.

Two publicly available datasets were used as examples to demonstrate theability of OmiEmbed: the Genomic Data Commons (GDC) pan-cancermulti-omics dataset (Grossman et al. , 2016) and the DNA methylationdataset of human central nervous system tumours (GSE109381) (Capper et al. , 2018).The GDC pan-cancer dataset is one of the most comprehensiveand widely used multi-omics datasets. It comprises high-dimensionalomics data and corresponding phenotype data from two cancer genomeprogrammes: The Cancer Genome Atlas (TCGA) (Weinstein et al. ,2013) and Therapeutically Applicable Research to Generate Effective miEmbed Treatment (TARGET). The TARGET programme mainly focuses onpediatric cancers. Three types of omics data from the GDC datasetwere used in our experiments, including RNA-Seq gene expressionproﬁling, DNA methylation proﬁling and miRNA expression proﬁling.The dimensionalities of the three types of omics data are 60,483, 485,577and 1,881 respectively. This dataset consists of 36 different types of tumoursamples along with corresponding normal control samples, among which33 tumour types are from TCGA and 3 tumour types are from TARGET.A wide range of phenotype features are also available in the GDC datasetincluding demographics (e.g., age, gender, and race), clinical sampleinformation (e.g., primary site and disease stage of the sample) and thesurvival information (recorded time of death or censoring).The GSE109381 brain tumour methylation (BTM) dataset from theGene Expression Omnibus (GEO) is one of the largest DNA methylationdatasets speciﬁcally targeted brain tumours. We integrated both thereference set and validation set of this dataset and the whole dataset consistsof 3,905 samples with almost all WHO-deﬁned central nervous system(CNS) tumour entities (Louis et al. , 2016) and seven non-neoplastic controlCNS regions. Genome-wide DNA methylation proﬁle for each samplewas generated using Inﬁnium HumanMethylation450 BeadChip (450K)arrays, which is the same platform used for the GDC DNA methylationdata. Each sample in this dataset has two types of diagnostic label, thehistopathological class label deﬁned by the latest 2016 WHO classiﬁcationof CNS tumours (Louis et al. , 2016) and the methylation class label deﬁnedby the original paper of this dataset (Capper et al. , 2018). Other phenotypicinformation is also available in this dataset including age, gender and thedisease stage of each sample.

Raw data of the GSE109381 BTM dataset downloaded from GEO wereﬁrst processed by the Bioconductor R package minﬁ (Aryee et al. ,2014) to get the bate value of each CpG probe. Beta value is theratio of methylated signal intensities and the overall signal intensities,which indicates the methylation level of a speciﬁc CpG site. The DNAmethylation proﬁle generated by the 450K array has 485,577 probesin total. Certain probes were removed during the feature ﬁltering stepaccording to the following criteria: probes targeting the Y chromosome(n = 416), probes containing the dbSNP132Common single-nucleotidepolymorphism (SNP) (n = 7,998), probes not mapping to the humanreference genome (hg19) uniquely (one mismatch allowed) (n = 3,965),probes not included in the latest Inﬁnium MethylationEPIC BeadChip(EPIC) array (n = 32,260), the SNP assay probes (n = 65), the non-CpGloci probes (n = 3,091) and probes with missing values (N/A) in morethan 10% of samples (n = 2). We followed some of the criteria mentionedin the original paper of this dataset (Capper et al. , 2018). 46,746 probeswere ﬁltered out, which results in a ﬁnal DNA methylation feature set of438,831 CpG sites.For the GDC pan-cancer dataset, the harmonised data of three omicstypes were downloaded from the UCSC Xena data portal with the originaldata dimensionality. Each RNA-Seq gene expression proﬁle is comprisedof 60,483 identiﬁers referring to corresponding genes. The expressionlevel is quantiﬁed by the Fragments Per Kilobase of transcript per Millionmapped reads (FPKM) value, which has been log -transformed. Featureﬁltering was applied to the gene expression data: targeting Y chromosome(n = 594), zero expression in all samples (n = 1,904) and genes with missingvalues (N/A) in more than 10% of samples (n = 248). In total, 2,440 geneswere removed, leaving 58,043 molecular features for further analyses. Asfor miRNA expression proﬁles, the expression level of each miRNA stem-loop identiﬁer was measured by the log -transformed Reads Per Millionmapped reads (RPM) value. All of the miRNA identiﬁers were kept in ourexperiments. For both the gene expression and miRNA expression proﬁles, x _ e x p r G e n e E x p r e ss i o n G e n e E x p r e ss i o n Encoder D N A M e t h y l a t i o n m i R N A E x p r e ss i o n σμ z Decoder

DiagnosticTask

Downstream networks

E_methyE_miRNA x _ m i R N A x _ m e t h y y1 E_expr D_methyD_miRNAD_expr x _ m i R N A ’ x _ m e t h y ’ x _ e x p r ’ D N A M e t h y l a t i o n m i R N A E x p r e ss i o n PrognosticTask DemographicTask … …… y2 y3

Fig. 1.

The overall architecture of OmiEmbed, which is comprised of two maincomponents: the VAE deep embedding networks and the downstream task networks. Thenumber of omics types and downstream tasks can be modiﬁed based on the user needs andrequirements of the experiment. E_expr, E_methy and E_miRNA represent encoders ofgene expression, DNA methylation and miRNA expression respectively. Similarly, D_expr,D_methy and D_miRNA represent decoders of gene expression, DNA methylation andmiRNA expression. the expression values were normalised to the range of 0 to 1 due to the inputrequirement of the OmiEmbed framework. The DNA methylation data inthe GDC dataset were ﬁltered based on the same criteria used for the BTMdataset. Remaining missing values in all datasets mentioned above wereimputed by the mean of corresponding molecular features.

OmiEmbed is a uniﬁed multi-omics multi-task end-to-end deep learningframework designed for reconstructing a panoramic view of phenotypicinformation from high-dimensional omics data. The overall architecture ofOmiEmbed is comprised of a deep embedding module and one or multipledownstream task modules, which is illustrated in Fig. 1.The role of the deep embedding module in OmiEmbed is to embedhigh-dimensional multi-omics proﬁles into a low-dimensional latent spacefor the downstream task modules. The backbone network we used in thedeep embedding module is variational autoencoder (VAE) (Kingma andWelling, 2013). VAE is a deep generative model which is also effective tocapture the data manifold from high-dimensional data. We assume eachsample x ( i ) ∈ R d in the multi-omics dataset D can be represented by andgenerated from a latent vector z ( i ) ∈ R p , where p (cid:28) d . In the generationprocess, each latent vector is ﬁrst sampled from a prior distribution p θ ( z ) ,and then the multi-omics data of each sample is generated from theconditional distribution p θ ( x | z ) , where θ is the set of learnable parametersof the decoder. In order to address the intractability of the true posterior p θ ( z | x ) , the variational distribution q φ ( z | x ) is introduced to approximate p θ ( z | x ) , where φ is the set of learnable parameters of the encoder. As aresult, the VAE network is optimised by maximizing the variational lowerbound formulised as below: E z ∼ q φ ( z | x ) log p θ ( x | z ) − D KL (cid:0) q φ ( z | x ) (cid:107) p θ ( z ) (cid:1) (1) X.Zhang et al. where D KL is the Kullback-Leibler (KL) divergence between twoprobability distributions (Goodfellow et al. , 2016).We applied the framework of VAE to our deep embedding module toobtain the low-dimensional latent vector that can represent the originalhigh-dimensional omics data in the downstream task modules. For eachtype of omics data, the input proﬁles were ﬁrst encoded into correspondingvectors with speciﬁc encoders. Those vectors of different omics types werethen concatenated together in the subsequent hidden layer and encodedinto one multi-omics vector. Based on the idea of VAE, the multi-omicsvector was connected to two bottleneck layers in order to obtain themean vector µ and the standard deviation vector σ . These two vectorsdeﬁned the Gaussian distribution N ( µ , σ ) of the latent variable z giventhe input sample x , which is the variational distribution q φ ( z | x ) . Sincesampling z from the learned distribution is not differentiable and suitablefor backpropagation, the reparameterisation trick is applied as follows: z = µ + σ(cid:15) (2)where (cid:15) is a random variable sampled from the unit normal distribution N ( , I ) . The latent variable z was then fed to the decoders withsymmetrical network structure to get the reconstructed multi-omics data x (cid:48) . We provided two types of detailed network structure for the encodersand decoders in the deep embedding module, the one-dimensionalconvolutional neural network (CNN) and the fully-connected neuralnetwork (FC). Both network types shared the same architecture, and otherstate-of-the-art embedding networks can be easily added to OmiEmbedfor further improvement. With the deep embedding module we can obtainthe low-dimensional representation of the input omics data. This newrepresentation can directly replace the original omics data as the inputof any downstream task. For instance, when the latent dimension is setto 2 or 3 the new representation can be used for visualisation purpose.Nevertheless, we can also attach one or multiple downstream task networksto the bottleneck layer of the deep embedding module to get an end-to-end multi-task model, which is able to guide the embedding step withobjectives and share information among different tasks.Three main types of end-to-end downstream tasks were provided in theOmiEmbed framework: classiﬁcation task, regression task and survivalprediction task. Each downstream task fell into one of these categoriescan be trained individually or together with other downstream tasks usingthe coordinated multi-task strategy that we will discuss in later sections.A multi-layer fully-connected network was applied to the classiﬁcation-type downstream task, including diagnostic tasks such as tumour typeclassiﬁcation, primary site prediction and disease stage prediction (normalcontrol, primary tumour, recurrent tumour or metastatic tumour) anddemographic tasks, e.g., the prediction of gender and race. The outputdimension of the classiﬁcation downstream network was set to the numberof classes. For the regression task, a similar network was attached to thedeep embedding module, but only one neuron was kept in the output layerto predict the target scalar value (e.g., age of the subject). The survivalprediction downstream network is more complicated and will be discussedin a subsequent separate section. The downstream networks add furtherregularisation to the low dimensional latent representation and urge thedeep embedding module to learn the representations related to certaindownstream tasks. With the downstream modules, a single well-trainedmulti-task OmiEmbed network is able to reconstruct a comprehensivephenotype proﬁle including diagnostic, prognostic and demographicinformation from omics data. The same as the overall structure, the joint loss function is also comprisedof two main components: the loss of the deep embedding and the loss ofthe downstream tasks.We denote each type of input omics proﬁle as x j and the correspondingreconstructed proﬁle as x (cid:48) j , where j is the omics type index and there are M omics types in total. The deep embedding loss can be then deﬁned asfollows: L embed = 1 M M (cid:88) j =1 BCE (cid:0) x j , x (cid:48) j (cid:1) + D KL ( N ( µ , σ ) (cid:107)N ( , I )) (3)where BCE is the binary cross-entropy to measure the distance betweeninput data and reconstructed data, and the second term is the KL divergencebetween the learned distribution and a unit Gaussian distribution.In the downstream modules, each downstream task has its speciﬁc lossfunction L down k and a corresponding weight w k . For the classiﬁcationtype task, the loss function can be deﬁned as: L classification = CE ( y, y (cid:48) ) (4)where y is the ground truth, y (cid:48) is the predicted label and CE is thecross-entropy loss. Similar to the classiﬁcation loss, the loss function ofregression type task is L regression = MSE ( y, y (cid:48) ) (5)where MSE is the mean squared error between the real value and thepredicted value. The loss function of the survival prediction task will bediscussed in the next section. The overall loss function of the downstreammodules is the weighted sum of all downstream losses, i.e., L down = 1 K K (cid:88) k =1 w k L down k (6)where K is the number of downstream tasks, L down k is the loss for acertain task and w k is the corresponding weight. w k can be manuallyset as hyperparameters or used as learnable parameters during the trainingprocess. In conclusion, the joint loss function of the end-to-end OmiEmbednetwork is L total = λ L embed + L down (7)and depends on λ that balance the two terms in the overall loss function.Base on the aforementioned loss functions, three training phases weredesigned in OmiEmbed. Phase 1 was the unsupervised phase that onlyfocused on the deep embedding module. In this training phase, only thedeep embedding loss was backpropagated and only the parameters in thedeep embedding network were updated based on the gradients. No labelwas required in the ﬁrst training phase and this phase can be used separatelyas a dimensionality reduction or visualisation method. In phase 2, the pre-trained embedding network was ﬁxed whilst the downstream networkswere being trained. The joint downstream loss was backpropagated andonly the downstream networks were updated during this phase. Afterthe embedding network and the downstream networks were pre-trainedseparately, the overall loss function deﬁned in equation (7) was calculatedand backpropagated during phase 3. In this ﬁnal training phase thewhole OmiEmbed network including the deep embedding module andthe downstream modules was ﬁne-tuned to obtain better performance. Survival prediction is the most complicated downstream task implementedin OmiEmbed. The objective of this task is to obtain the personalised miEmbed survival function and hazard function of a subject based on the high-dimensional omics data. With the survival prediction task, OmiEmbedis able to reconstruct a more comprehensive phenotype proﬁle of eachsubject from omics data and potentially facilitate omics-based precisionmedicine. The survival function can be denoted by S ( t ) = P [ T > t ] (8)where T is time elapsed between the sample collection time and the timeof event occurring. The survival function signiﬁes the probability that thefailure event, i.e., death, has not occurred by time t . The hazard functioncan be deﬁned as: h ( t ) = lim dt → P[ t ≤ T < t + dt | T ≥ t ] dt (9)which represents the instantaneous rate of occurrence for the failure event.Thus a large hazard value manifests a great risk of death, which is a crucialsignal for clinicians.In order to train a survival prediction downstream network, besides theomics data X two elements of the survival labels are required: the eventtime T and the event indicator E . The indicator was set to 1 when thefailure event was observed during the study and 0 when the event was notobserved, which is termed censoring. In the case of censoring, the eventtime T is the time elapsed between the time when the sample was collectedand the time of the last contact with the subject. Both T and E are availablein the GDC dataset.The multi-task logistic regression (MTLR) model (Yu et al. , 2011)was applied and adapted to the OmiEmbed framework for the survivalprediction downstream task. In the ﬁrst step, the time axis was dividedinto m time intervals { l i } mi =1 . Each time interval was deﬁned as l i =[ t i − , t i ) where t = 0 and t m ≥ max ( T ) . The number of time intervals m is a hyperparameter. A larger m results in more ﬁne-grained output, butrequires more computation resources. We applied the multi-layer fully-connected network as the backbone of our survival prediction network andthe dimension of the output layer is the number of time intervals. As a result,the output of our survival prediction network is an m -dimensional vector y (cid:48) = (cid:2) y (cid:48) , y (cid:48) , . . . , y (cid:48) m (cid:3) . Similarly, the survival label of each subject wasalso encoded into an m -dimensional vector y = [ y , y , . . . , y m ] , where y i signiﬁes the survival status of this subject at the time point t i . Thelikelihood of observing y on the condition of sample x with the networkparameters θ can be deﬁned as follows: P θ ( y | x ) = exp (cid:0)(cid:80) mi =1 y i y (cid:48) i (cid:1)(cid:80) mj =0 exp( (cid:80) mi = j +1 y (cid:48) i ) . (10)The objective of this survival network is to ﬁnd a set of parameters θ that maximises the log likelihood, hence the loss function of the survivalprediction downstream task is deﬁned as, L survival = − m (cid:88) i =1 y i y (cid:48) i + log m (cid:88) j =0 exp m (cid:88) i = j +1 y (cid:48) i (11)which can be directly applied to the downstream module and included injoint loss function of OmiEmbed. With the joint loss function (6) of the multi-task downstream modules,we aimed to train multiple downstream nets in OmiEmbed simultaneouslyand efﬁciently instead of separate training so as to obtain a uniﬁed networkthat can reconstruct a comprehensive phenotype proﬁle of the subject. Inorder to balance the optimisation of different tasks, we adapted the multi-task optimisation method gradient normalisation (GradNorm) (Chen et al. ,2018) to our OmiEmbed framework.

Table 1. The classiﬁcation performance on the BTMdataset using the histopathological tumour type labelsAccuracy Precision Recall F1PCA+SVM 0.8477 0.8268 0.8477 0.8284KPCA+SVM 0.8452 0.8245 0.8452 0.8261UMAP+SVM 0.6684 0.5497 0.6684 0.5845OmiEmbed

Table 2. The classiﬁcation performance on the BTMdataset using the methylation tumour type labelsAccuracy Precision Recall F1PCA+SVM 0.9360 0.9385 0.9360 0.9335KPCA+SVM 0.9347 0.9352 0.9347 0.9313UMAP+SVM 0.6389 0.5228 0.6389 0.5603OmiEmbed

In equation (6) w k is the weight of each downstream loss, and theweight can also be regarded as a trainable parameter which varies at eachtraining iteration. The idea of GradNorm is to penalise the network ifgradients of any downstream task is too large or too small, which makesall the tasks train at similar rates (Chen et al. , 2018). First the gradientnorm of each downstream task is calculated by G ( k ) θ = (cid:13)(cid:13) ∇ θ w k L down k (cid:13)(cid:13) (12)where θ is the parameters of the last encoding layer in the deep embeddingmodule of OmiEmbed. The average gradient norm among all tasks can bethen calculated by ¯ G θ = 1 K K (cid:88) k =1 G ( k ) θ (13)where K is the number of downstream tasks. The relative inverse trainingrate of each task can be deﬁned as: r k = ˜ L down k K (cid:80) Kk =1 ˜ L down k (14)where ˜ L down k = L down k / L down k which is the ratio of the currentloss to the initial loss of the downstream task k . Then the loss of GradNormcan be deﬁned as: L grad = K (cid:88) k =1 (cid:12)(cid:12)(cid:12) G ( k ) θ − ¯ G θ × r kα (cid:12)(cid:12)(cid:12) (15)where α is the hyperparameter that represents strength pulling tasks backto a common training rate. A separate backpropagation process wasconducted during each training iteration on L grad , which was only usedfor updating w k . The OmiEmbed multi-omics multi-task framework was built on the deeplearning library PyTorch (Paszke et al. , 2019). The code of OmiEmbedhas been made open-source on GitHub, and it is convenient to useon any high-dimensional omics dataset for multiple downstream taskswith a set of predeﬁned command line arguments. In the FC-type

X.Zhang et al.

Fig. 2.

The 128D latent space of the BTM dataset learned by the unsupervised phase of OmiEmbed. The scatter graph was visualised using t-SNE. Each label in the scatter graph was colourby its methylation class label. Tumour types belonging to the same upper-level class were marked in similar colours.

Table 3. The performance of tumour type classiﬁcationon the GDC multi-omics dataset with different omics typecombinations Accuracy Precision Recall F1Gene expression (a) 0.9680 0.9686 0.9680 0.9679DNA methylation (b) 0.9659 0.9653 0.9659 0.9650miRNA expression (c) 0.9618 0.9619 0.9618 0.9609Multi-omics (a+b) 0.9752 0.9762 0.9752 0.9750Multi-omics (a+b+c) omics embedding network, CpG sites of DNA methylation proﬁles wereseparately connected to different hidden layers based on their targetingchromosomes in order to reduce the number of parameters, preventoverﬁtting and save the GPU memory. The chromosome separation stepwould be automatically processed in OmiEmbed with a built-in DNAmethylation annotation if the FC-type embedding was selected. Other deeplearning techniques were also applied to prevent overﬁtting in OmiEmbedincluding dropout (Srivastava et al. , 2014), batch normalisation (Ioffe andSzegedy, 2015), weight decay regularisation and the learning rate schedule.The model was trained on two NVIDIA Titan X GPUs with 12 gigabytesof memory each. The input dataset for each experiment was randomlyseparated into training, validation and testing sets (70%, 10% and 20%respectively). The separation was conducted in a stratiﬁed manner so as tokeep the proportion of each class in all three sets. The validation set wasused for hyperparameter searching, and the reported performance wasseparately evaluated on the testing set.

OmiEmbed can be regarded as an unsupervised dimensionality reductionmethod if only the training phase 1 mentioned in section 3.4 was applied inthe experiment. The high-dimensional multi-omics data can be compressedinto a new representation with the target dimensionality set by the hyperparameter of OmiEmbed. Then the output ﬁle can be directly usedfor visualisation or any other downstream tasks. Here we reduced eachsample in the BTM dataset into a 128D latent vector using the unsupervisedphase 1 of OmiEmbed. The learned latent space of the BTM datasetwas visualised by t-distributed stochastic neighbour embedding (t-SNE)(Van der Maaten and Hinton, 2008) and shown in Figure 2. Samples ofdifferent brain tumour types automatically clustered together in the latentspace and tumour types belonging to the same upper-level class (e.g.glioblastoma, embryonal tumour, and ependymal tumour) also formedinto the corresponding upper-level clusters.

Instead of using the training phase 1 individually as a dimensionalityreduction method and separately training the downstream task with othermachine learning algorithms, using all of the three training phases ofOmiEmbed in an end-to-end manner is more efﬁcient with better results.Here we ﬁrst tested the classiﬁcation performance of OmiEmbed on theBTM dataset. There are two types of tumour type classiﬁcation systems inthis brain tumour dataset: the histopathological tumour type labels deﬁnedby the 2016 WHO classiﬁcation (Louis et al. , 2016) and the methylationtumour type labels deﬁned by the original paper of this dataset (Capper et al. , 2018). For each type of these two classiﬁcation systems, the 3,905samples were divided into more than 90 classes including different normalcontrol types.The classiﬁcation performance of OmiEmbed on the BTM datasetwith the histopathological labels and the methylation label was shown inTable 1 and Table 2, which was measured by four multi-class classiﬁcationmetrics: overall accuracy, weighted precision, weight recall, and weightedF1 score. The results were also compared with the combination of threedimensionality reduction methods and the support vector machine (SVM)classiﬁcation algorithm. The three different dimensionality reductionmethods used in the experiment included principal component analysis(PCA), kernel principal component analysis (KPCA) (Schölkopf et al. ,1997) and uniform manifold approximation and projection (UMAP) miEmbed Table 4. Detailed information for the categorical features predicted by OmiEmbed in the GDC datasetNumber of classes Label examplesTumour type 37 BRCA, UCEC, KIRC, LGG, LUAD, THCA, HNSC, LUSC, PRAD, Normal control, etc.Disease stage 7 Primary tumour, Metastatic tumour, Recurrent tumor, Normal control, etc.Primary site 29 Breast, Kidney, Lung, Brain, Colorectal, Uterus, Thyroid, Prostate, etc.Race 6 White, Black or African American, Asian, American Indian or Alaska native, etc.Gender 2 Male, FemaleTable 5. The classiﬁcation performance of usingOmiEmbed to predict the four categorical phenotypefeatures in the GDC datasetAccuracy Precision Recall F1Disease stage 0.9809 0.9773 0.9809 0.9790Primary site 0.9843 0.9847 0.9843 0.9843Race 0.9756 0.9719 0.9756 0.9735Gender 0.9594 0.9595 0.9594 0.9594Table 6. The regression performance of age prediction on the GDCdataset using different methodsRMSE Mean absolute error Median absolute errorPCA+SVR 12.3285 9.8544 8.1017KPCA+SVR 12.3269 9.8535 8.0375UMAP+SVR 12.9859 10.3932 8.8130OmiEmbed

Table 7. The performance of survivalprediction on the GDC dataset usingdifferent methods Concordance indexPCA+CoxPH 0.7291KPCA+CoxPH 0.7307UMAP+CoxPH 0.7169OmiEmbed

Table 8. Multi-task performance of OmiEmbed with threediverse downstream tasksSurvival Tumour type AgeC-index Accuracy F1 RMSEMulti-task

Single task alone 0.7809 0.9680 0.9679 10.7982PCA Baseline 0.7291 0.9444 0.9391 12.3285 (McInnes et al. , 2018). The original data from the BTM dataset wasﬁrst reduced to 128D by the aforementioned dimensionality reductionmethods and then classiﬁed by the SVM with a radial basis function(RBF) kernel. As shown in Table 1 and Table 2, OmiEmbed achievedthe best classiﬁcation performance in all the four metrics with both typesof classiﬁcation systems.

Different types of omics proﬁles can be integrated into single latentrepresentation and used for different downstream tasks through the multi-omics deep embedding module of OmiEmbed. In order to test the effect ofmulti-omics integration on the downstream task, we trained a tumour typeclassiﬁer using OmiEmbed on the GDC multi-omics dataset. We appliedthree omics types in the GDC dataset: gene expression, DNA methylationand miRNA expression, and there are 33 tumour types and normal controlclass (34 classes in total) in the dataset. We trained the model with eachomics type alone and two different multiple omics type combinations. Theclassiﬁcation performance in each scenario was shown in Table 3. Theperformance metrics for each omics type alone were close to each otherand the best metrics were achieved with the combination of all three omicstypes. This result also indicates combining multiple omics data can yieldbetter insights into the underlying mechanisms of diseases.

With both the classiﬁcation and regression downstream networks builtin OmiEmbed, we were able to reconstruct a number of phenotypefeatures from high-dimensional omics data. Here we tested the predictionperformance of four different phenotype features in the GDC datasetincluding age, gender, race, the disease stage and primary site of theclinical sample. Detailed information of each categorical features waslisted in Table 4. The disease stage is the clinical type of the sample,which consists of primary tumour, metastatic tumour, recurrent tumor andnormal control tissue. The primary site is the place where the cancer startsgrowing. Samples in the GDC dataset are from 28 different primary sitessuch as breast, kidney, lung, and skin. The race of the subjects in theGDC dataset was divided into six main categories: white, Asian, black orAfrican American, American Indian or Alaska native, native Hawaiian orother Paciﬁc islander and others. As for the gender of each sample, sincethe molecular features targeting the Y chromosome were ﬁltered in thepreprocessing stage, the model was required to classify the gender basedon other molecular features. The classiﬁcation performance of the fourcategorical phenotype features was shown in Table 5.Since the label of age is numerical instead of categorical, the regressiondownstream network in OmiEmbed was applied for the age prediction task.The performance of age prediction was evaluated by the three regressionmetrics: root mean square error (RMSE), mean absolute error and medianabsolute error, which was shown in Table 6. The results were comparedwith the combination of the aforementioned dimensionality reductionmethods and the support vector regression (SVR) algorithm with the RBFkernel. OmiEmbed achieved the best performance with the lowest distanceerror in all three metrics.

With the survival prediction downstream network of OmiEmbed describedin section 3.5, we are able to predict the survival function of each subjectfrom corresponding high-dimensional omics data. The performance of thesurvival prediction downstream task was evaluated by the concordance

X.Zhang et al. index (C-index) which is one of the most commonly used metrics forsurvival prediction. A C-index value of 1 indicates the perfect predictionmodel and a value of 0.5 signiﬁes that the performance of the modelis similar to expected at random. The performance was compared withbaseline methods that ﬁrst reduced the dimensionality of input omics datato 128D using PCA, KPCA or UMAP and then fed the 128D data tothe standard survival prediction method Cox Proportional Hazard model(CoxPH). OmiEmbed got the best C-index (0.78) among other methodsusing the standard CoxPH model as shown in Table 7.

Instead of training each aforementioned downstream network separately,we can also train multiple downstream networks together using the multi-task training strategy expatiated in section 3.6. With the multi-task strategyOmiEmbed is able to perform diverse downstream tasks simultaneouslyand reconstruct a comprehensive phenotype proﬁle of each subject fromhigh-dimensional omics data using one uniﬁed network in one forwardpropagation. In order to test the multi-task performance of OmiEmbed,we ﬁrst selected three typical downstream tasks belonging to three distinctcategories, the survival prediction task, the tumour type classiﬁcationtask and the age regression task, for the evaluation. Three downstreamnetworks along with the deep embedding network were trained togetherusing the joint loss function equation (7) and the GradNorm loss functionequation (15). As shown in Table 8, the performance is higher in allthree downstream tasks when they were trained in a uniﬁed multi-taskOmiEmbed network compared with being trained separately.

OmiEmbed is an open-source deep learning framework designed formultiple omics data analysis tasks including dimensionality reduction,multi-omics data integrated, tumour type classiﬁcation, disease stageprediction, demographic label reconstruction and prognosis prediction.All of the aforementioned tasks can be performed together simultaneouslyand efﬁciently by a uniﬁed OmiEmbed multi-task network. We achievedpromising performance for each downstream tasks and outperformed state-of-the-art methods using the multi-task training strategy. Our results alsoindicate that OmiEmbed is able to reconstruct a comprehensive clinical andphenotypic proﬁle from the multi-omics data of each subject, which has agreat potential to facilitate more accurate and personalised clinical decisionmaking. Since the code of OmiEmbed is publicly available and the uniﬁedframework is applicable to multiple omics types and downstream tasks,we believe OmiEmbed will become a useful tool for other researchers toanalyse high-dimensional omics data using the deep learning methodology.

Funding

This project has received funding from the European Union’s Horizon 2020research and innovation programme under the Marie Skłodowska-Curiegrant agreement 764281.

References

Amodio, M. et al. (2019). Exploring single-cell data with deepmultitasking neural networks.

Nature Methods , , 1139–1145.Aryee, M. J. et al. (2014). Minﬁ: a ﬂexible and comprehensivebioconductor package for the analysis of inﬁnium dna methylationmicroarrays. Bioinformatics , (10), 1363–1369.Azarkhalili, B. et al. (2019). Deepathology: Deep multi-task learningfor inferring molecular pathology from cancer transcriptome. ScientiﬁcReports , . Berger, B. et al. (2013). Computational solutions for omics data. NatureReviews Genetics , , 333–346.Capper, D. et al. (2018). Dna methylation-based classiﬁcation of centralnervous system tumours. Nature , (7697), 469–474.Chaudhary, K. et al. (2017). Deep learning–based multi-omics integrationrobustly predicts survival in liver cancer. Clinical Cancer Research , ,1248 – 1259.Cheerla, A. and Gevaert, O. (2019). Deep learning with multimodalrepresentation for pancancer prognosis prediction. Bioinformatics , ,i446 – i454.Chen, Z. et al. (2018). Gradnorm: Gradient normalization for adaptiveloss balancing in deep multitask networks. In International Conferenceon Machine Learning , pages 794–803. PMLR.Danaee, P. et al. (2017). A deep learning approach for cancer detectionand relevant gene identiﬁcation.

Paciﬁc Symposium on Biocomputing.Paciﬁc Symposium on Biocomputing , , 219–229.Ding, J. et al. (2018). Interpretable dimensionality reduction ofsingle cell transcriptome data with deep generative models. NatureCommunications , .Eraslan, G. et al. (2019). Single-cell rna-seq denoising using a deep countautoencoder. Nature Communications , .Goodfellow, I. et al. (2016). Deep learning . MIT press.Grossman, R. L. et al. (2016). Toward a shared vision for cancer genomicdata.

New England Journal of Medicine , (12), 1109–1112.Hasin, Y. et al. (2017). Multi-omics approaches to disease. GenomeBiology , .Huang, Z. et al. (2019). Salmon: Survival analysis learning with multi-omics neural networks on breast cancer. Frontiers in Genetics , .Ioffe, S. and Szegedy, C. (2015). Batch normalization: Accelerating deepnetwork training by reducing internal covariate shift. In Internationalconference on machine learning , pages 448–456. PMLR.Jurmeister, P. et al. (2019). Machine learning analysis of dna methylationproﬁles distinguishes primary lung squamous cell carcinomas from headand neck metastases.

Science Translational Medicine , .Kingma, D. P. and Welling, M. (2013). Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114 .Lopez, R. et al. (2018). Deep generative modeling for single-celltranscriptomics. Nature methods , , 1053 – 1058.Louis, D. N. et al. (2016). The 2016 world health organizationclassiﬁcation of tumors of the central nervous system: a summary. Actaneuropathologica , (6), 803–820.Lyu, B. and Haque, A. (2018). Deep learning based tumor typeclassiﬁcation using gene expression data. Proceedings of the 2018 ACMInternational Conference on Bioinformatics, Computational Biology,and Health Informatics .Ma, T. and Zhang, A. (2019). Afﬁnitynet: semi-supervised few-shotlearning for disease type prediction. In

Proceedings of the AAAIConference on Artiﬁcial Intelligence , volume 33, pages 1069–1076.McInnes, L. et al. (2018). Umap: Uniform manifold approximation andprojection for dimension reduction. arXiv preprint arXiv:1802.03426 .Paszke, A. et al. (2019). Pytorch: An imperative style, high-performancedeep learning library. In

NeurIPS .Rhee, S. et al. (2018). Hybrid approach of relation network and localizedgraph convolutional ﬁltering for breast cancer subtype classiﬁcation. In

IJCAI .Schölkopf, B. et al. (1997). Kernel principal component analysis. In

International conference on artiﬁcial neural networks , pages 583–588.Springer.Srivastava, N. et al. (2014). Dropout: a simple way to prevent neuralnetworks from overﬁtting.

The journal of machine learning research , (1), 1929–1958. miEmbed Van der Maaten, L. and Hinton, G. (2008). Visualizing data using t-sne.

Journal of machine learning research , (11).Voulodimos, A. et al. (2018). Deep learning for computer vision: A briefreview. Computational Intelligence and Neuroscience , .Way, G. P. and Greene, C. (2018). Extracting a biologically relevant latentspace from cancer transcriptomes with variational autoencoders. PaciﬁcSymposium on Biocomputing. Paciﬁc Symposium on Biocomputing , ,80–91.Weinstein, J. N. et al. (2013). The cancer genome atlas pan-cancer analysisproject. Nature genetics , (10), 1113. Young, T. et al. (2018). Recent trends in deep learning based naturallanguage processing [review article]. IEEE Computational IntelligenceMagazine , , 55–75.Yu, C.-N. et al. (2011). Learning patient-speciﬁc cancer survivaldistributions as a sequence of dependent regressors. Advances in NeuralInformation Processing Systems , , 1845–1853.Zhang, X. et al. (2019). Integrated multi-omics analysis using variationalautoencoders: Application to pan-cancer classiﬁcation.2019 IEEEInternational Conference on Bioinformatics and Biomedicine (BIBM)