[PDF] Computational methods for cancer driver discovery: A survey

Abstract

Motivation: Uncovering the genomic causes of cancer, known as cancer driver genes, is a fundamental task in biomedical research. Cancer driver genes drive the development and progression of cancer, thus identifying cancer driver genes and their regulatory mechanism is crucial to the design of cancer treatment and intervention. Many computational methods, which take the advantages of computer science and data science, have been developed to utilise multiple types of genomic data to reveal cancer drivers and their regulatory mechanism behind cancer development and progression. Due to the complexity of the mechanistic insight of cancer genes in driving cancer and the fast development of the field, it is necessary to have a comprehensive review about the current computational methods for discovering different types of cancer drivers. Results: We survey computational methods for identifying cancer drivers from genomic data. We categorise the methods into three groups, methods for single driver identification, methods for driver module identification, and methods for identifying personalised cancer drivers. We also conduct a case study to compare the performance of the current methods. We further analyse the advantages and limitations of the current methods, and discuss the challenges and future directions of the topic. In addition, we investigate the resources for discovering and validating cancer drivers in order to provide a one-stop reference of the tools to facilitate cancer driver discovery. The ultimate goal of the paper is to help those interested in the topic to establish a solid background to carry out further research in the field.

Full PDF

(cid:105)(cid:105) (cid:105) “review-main” — 2020/7/3 — 0:41 — page 1 — (cid:105)(cid:105)(cid:105) (cid:105) (cid:105)(cid:105)

Brieﬁngs in Bioinformatics doi.10.1093/bioinformatics/xxxxxxAdvance Access Publication Date: Day Month YearReview Article

Subject Section

Computational methods for cancer driverdiscovery: A survey

Vu Viet Hoang Pham , Lin Liu , Cameron Bracken

2, 3 , Gregory Goodall

2, 3 ,Jiuyong Li , and Thuc Duy Le ∗ UniSA STEM, University of South Australia, Mawson Lakes, SA 5095, AU, Centre for Cancer Biology, SA Pathology, Adelaide, SA 5000, AU, and Department of Medicine, The University of Adelaide, Adelaide, SA 5005, AU. ∗ To whom correspondence should be addressed.

Associate Editor: XXXXXXX

Received on XXXXX; revised on XXXXX; accepted on XXXXX

Abstract

Motivation:

Uncovering the genomic causes of cancer, known as cancer driver genes, is a fundamentaltask in biomedical research. Cancer driver genes drive the development and progression of cancer, thusidentifying cancer driver genes and their regulatory mechanism is crucial to the design of cancer treatmentand intervention. Many computational methods, which take the advantages of computer science anddata science, have been developed to utilise multiple types of genomic data to reveal cancer drivers andtheir regulatory mechanism behind cancer development and progression. Due to the complexity of themechanistic insight of cancer genes in driving cancer and the fast development of the ﬁeld, it is necessaryto have a comprehensive review about the current computational methods for discovering different typesof cancer drivers.

Results:

We survey computational methods for identifying cancer drivers from genomic data. Wecategorise the methods into three groups, methods for single driver identiﬁcation, methods for drivermodule identiﬁcation, and methods for identifying personalised cancer drivers. We also conduct a casestudy to compare the performance of the current methods. We further analyse the advantages andlimitations of the current methods, and discuss the challenges and future directions of the topic. Inaddition, we investigate the resources for discovering and validating cancer drivers in order to provide aone-stop reference of the tools to facilitate cancer driver discovery. The ultimate goal of the paper is tohelp those interested in the topic to establish a solid background to carry out further research in the ﬁeld.

Keywords: cancer driver, cancer driver discovery, computational method

Contact:

[email protected]

Identifying cancer driver genes (cancer drivers for short) is vital since thesegenes play a signiﬁcant role in the development of cancer. Understandingcancer drivers and their regulatory mechanism is crucial to the design ofeffective cancer treatments.Classical methods of identifying cancer driver genes are based ondetecting the mutations in the DNA sequences of coding genes in wet-lab experiments. There are many mutation types in the genome such as single-nucleotide variants (SNVs), structural variants (SVs), insertions anddeletions (indels), and copy number aberrations (CNAs) (Dimitrakopoulosand Beerenwinkel, 2017). These mutations may cause normal cells totransform to tumour cells, resulting in the development of cancer. Forexample, it has been conﬁrmed that mutations in genes

VHL and

MET cause kidney cancer (Linehan et al. , 2010) and mutations in genes

AKT1 and

BRCA1 are related to breast cancer (Stephens et al. , 2012). However,many mutated genes are not driver genes and may not regulate theprogression of cancer. The reason is that not all mutations in the genomecontribute to cancer development. Mutations which play a signiﬁcantrole in cancer progression are called driver mutations while mutations © The Author 2017. a r X i v : . [ q - b i o . GN ] J u l (cid:105) “review-main” — 2020/7/3 — 0:41 — page 2 — (cid:105)(cid:105)(cid:105) (cid:105) (cid:105)(cid:105) Pham et al. which do not have any impact on cancer development are called passengermutations (Leiserson et al. , 2015; Vandin, 2017). Genes which bearcancer driver mutations are considered as cancer drivers (Tokheim et al. ,2016). Nevertheless, some cancer drivers may not contain mutations. Forexample, genes which may not contain mutations but regulate targets todevelop cancer are also considered as cancer driver, e.g. the overexpressionof

KDM5C decreases p54 expression to enhance the proliferation andinvasion of gastric cancer cells and

KDM5C is considered as a cancerdriver (Xu et al. , 2017a). The illustration of cancer drivers and genes withmutations is shown in Figure 1.

Cancer drivers Genes with mutations

Genes with passengermutations Genes with driver mutations Cancer driverswithout mutations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Fig. 1.

Cancer drivers and genes with mutations. Genes with driver mutations are cancerdrivers. Some genes which do not contain mutations but regulate driver mutations to developcancer are also considered as cancer drivers.

Given the complexity of the regulation by cancer drivers and thelarge number of genes, over ten thousand, detecting cancer driver genesis challenging with the wet-lab experiments and many computationalmethods utilising multiple types of genomic data have been developedto reveal cancer drivers and their regulatory mechanism behind the cancerdevelopment (Fattore et al. , 2016; Gasparini et al. , 2015; Papaemmanuil et al. , 2016; Rassenti et al. , 2017). Cancer driver discovery methods areincreasingly popular recently because of the fast development of computerscience and signiﬁcant revolution of DNA sequencing techniques. Takingthese advantages, numerous methods have been proposed to detectcancer driver genes. For example, MutSigCV (Lawrence et al. , 2013)investigates the signiﬁcance of mutations in genes to predict cancerdrivers, OncodriveFM (Gonzalez-Perez and Lopez-Bigas, 2012) andOncodriveCLUST (Tamborero et al. , 2013) evaluate the functionalinﬂuence and clustering of gene mutations respectively, DriverNet(Bashashati et al. , 2012), MEMo (Ciriello et al. , 2012), and CBNA (Pham et al. , 2019) examine the role of genes in gene regulatory networks. Due tothe large number of the current computational methods for cancer driverdiscovery, it may take the huge amount of effort for people to ﬁnd agood resource to know the state-of-the-art methods, and thus a reviewis necessary and helpful.There have been previous works (Dimitrakopoulos and Beerenwinkel,2017) reviewing the computational methods for identifying single cancerdrivers at the population level. However, it is important to gain mechanisticinsight into how cancer drivers work together in driving cancer. Besides,cancer drivers of each patient may be different from others since cancer is aheterogeneous disease, each patient has a different genome and the diseaseof each patient may be driven by different cancer driver genes. Thus, wealso need to consider cancer driver modules and personalised cancer drivers(i.e. cancer drivers for a speciﬁc patient). In addition, there are numerousnew cancer driver identiﬁcation methods which have been developed sincethen. Therefore, it is required to have a more comprehensive review aboutthe current computational methods for identifying cancer drivers. In this paper, we survey computational methods for discovering bothsingle cancer drivers and cancer driver modules at the population level andthe individual level as well. We then analyse the advantages/disadvantagesof the current methods and identify challenges of the ﬁeld. To facilitate thedevelopment of new computational methods for cancer driver detection,we survey resources which can be used as tools in conducting cancer driverresearch and validating predicted cancer drivers. In addition, with the casestudy conducted to compare the performance of the current methods inthis paper, we believe it will be useful for researchers, who are interestedor work in the ﬁeld, to develop their new methods.The paper is structured as follows. In Section 2, we reviewcomputational methods for identifying single and cancer driver modulesfrom genomic data, including cancer drivers for both the population andindividuals. We summarise the current available sources which can be usedfor conducting cancer driver researches as well as validating the results inSection 3. In Section 4, we carry out a case study. Finally, we analyse thecurrent methods to identify their advantages and limitations then discussfuture directions and challenges of the ﬁeld in Section 5.

The current computational methods use a wide range of genomic datatypes, including mutations, gene expression, pathways, etc. to discoverdifferent types of cancer drivers. Thus, we categorise the methods intovarious categories and sub-categories. The diagram of the categorisationis shown in Figure 2 and the summary of the methods is presented inTable 1.

Cancerdriver moduleidentiﬁcationCancer driverdiscovery methodsSingle cancer driveridentiﬁcation Personalised cancerdriveridentiﬁcationMutation-basedmethods Network-basedmethodsMutation signiﬁcance Functional impact ofmutations Location of mutations OthersMutual exclusivity Others

Fig. 2.

Categorisation of cancer driver discovery methods. The methods are categorisedin three groups: Single cancer driver identiﬁcation, Cancer driver module identiﬁcation,and Personalised cancer driver identiﬁcation. Single cancer driver identiﬁcation includestwo sub-groups: Mutation-based methods and Network-based methods. Mutation-basedmethods discover cancer drivers using mutation signiﬁcance, functional impact ofmutations, location of mutations, etc. Most cancer driver module identiﬁcation methodsuse the mutual exclusivity of mutations to identify modules of cancer drivers.

In the categorisation, we differentiate single cancer drivers frommodules of cancer drivers since there is evidence showing that some geneswork in concert to inﬂuence different biological processes (e.g. EMT)(Cursons et al. , 2017) and in some biological processes, the regulationof single genes might not have signiﬁcant impacts but the regulation ofgroups of genes does. Furthermore, as cancer is a heterogeneous disease,each patient may have a different morphology and clinical outcome. Forinstance, two patients, who have the same cancer type and receive thesame treatment, may experience different outcomes. The reason is thatthe genome of each patient is different and each patient’s disease may (cid:105) “review-main” — 2020/7/3 — 0:41 — page 3 — (cid:105)(cid:105)(cid:105) (cid:105) (cid:105)(cid:105)

Cancer Driver Discovery Table 1. Summary of methods for identifying cancer driversCategory Sub-category Method DescriptionSingle cancer driveridentiﬁcation Mutation-based methods(Using mutation signiﬁcance) MutSigCV (Lawrence et al. , 2013) Assesses the signiﬁcance of mutations in DNA sequencing in order to discovercancer driver genes.Mutation-based methods(Using functional impactof mutations) OncodriveFM (Gonzalez-Perez and Lopez-Bigas, 2012) Uses the functional impact of mutations of genes to detect cancer drivers with thehypothesis that any bias of variations with a signiﬁcantly functional impact in genescan be used to identify candidate driver genes.OncodriveFML (Mularoni et al. , 2016) Uses the functional impact of gene mutations to reveal both coding and non-codingcancer drivers.DriverML (Han et al. , 2019) Uses the functional impact of mutations to unravel cancer drivers through asupervised machine learning approach.Mutation-based methods(Using location of mutations) ActiveDriver (Reimand and Bader, 2013) Looks at the enrichment of mutations in externally deﬁned regions to uncover cancerdriver genes.OncodriveCLUST (Tamborero et al. , 2013) Detects cancer genes with a large bias in clustering mutations based on the ideathat gain-of-function mutations usually cluster in particular protein sections and thesemutations contribute to the development of cancer cells.Mutation-based methods(Others: Combining withgene expression, pathways) IntOGen-mutations (Gonzalez-Perez et al. , 2013) Uses somatic mutations, gene expression, and tumour pathways to identify cancerdrivers for various tumour types by combining OncodriveFM (Gonzalez-Perez and Lopez-Bigas, 2012) andOncodriveCLUST (Tamborero et al. , 2013).PathScan (Wendl et al. , 2011) Combines genomic mutations with the information of genes in known pathways touncover cancer driver genes.Sakoparnig et al. (Sakoparnig et al. , 2015) Introduces a computational method to detect genomic alterations with lowoccurrence frequencies based on mutation timing.CONEXIC (Akavia et al. , 2010) Applies a score-guided search to detect combinations of modulators which reﬂectthe expression of a gene module in a set of tumour samples then it identiﬁes thosewhich have the highest score in ampliﬁed or deleted regions.ncDriver (Hornshoj et al. , 2018) Screens non-coding mutations with conservations and cancer speciﬁcity to revealnon-coding cancer drivers.Network-based methods Vinayagam et al. (Vinayagam et al. , 2016) Applies controllability analysis on the directed network of human protein-proteininteraction to identify disease genes.CBNA (Pham et al. , 2019) Identiﬁes coding and miRNA cancer drivers by analysing the controllability of themiRNA-TF-mRNA network and mutation data.DriverNet (Bashashati et al. , 2012) Uncovers cancer drivers by evaluating the inﬂuence of mutations on transcriptionalnetworks in cancer.Cancer driver moduleidentiﬁcation Using mutual exclusivityof mutations CoMEt (Leiserson et al. , 2015) Identiﬁes cancer genes by using the exact statistical test to test mutual exclusivityof genomic events and applies techniques to do simultaneous analysis for mutuallyexclusive alterations.WeSME (Kim et al. , 2017) Discovers cancer drivers by evaluating the mutual exclusivity of mutations of genepairs.MEMo (Ciriello et al. , 2012) Analyses mutual exclusivity of mutated genes in subnetworks to identify mutualexclusivity modules in cancer.Others: Using mutations,gene expression, gene network iMCMC (Zhang et al. , 2013) Uses the cancer genomic data including mutations, CNAs, and gene expression fromcancer patients to identify mutated core modules in cancer.NetBox (Cerami et al. , 2010) Uses biological networks to assess network modules statistically and identify corepathways in GBM.TieDIE (Paull et al. , 2013) Applies network diffusion to discover the relationship of genomic events and changesin cancer subtypes.Hamilton et al. (Hamilton et al. , 2013) Uses the pan-cancer dataset of TCGA and the miRNA target data of AGO-CLIP todetect a pan-cancer oncogenic miRNA superfamily with a central core seed motif.Personalised cancer driveridentiﬁcation DawnRank (Hou and Ma, 2014) A ranking framework which applies PageRank to evaluate the impact of genes inan interaction network to detect cancer drivers.SCS (Guo et al. , 2018) Detects the minimal set of mutated genes controlling the maximal differentiallyexpressed genes as cancer drivers.PNC (Guo et al. , 2019) Identiﬁes cancer drivers as the minimum gene set which covers all the edges based ona bipartite graph. be driven by different driver genes, leading to a strong need to studycancer driver genes speciﬁc to an individual patient. Thus, we categorisethe current computational methods for cancer driver discovery into threegroups, including methods to identify single cancer drivers, methods toidentify cancer driver modules, and methods to discover personalisedcancer drivers (i.e. cancer drivers for a speciﬁc patient). In addition,based on the key techniques used in the methods, we divide single cancer driver identiﬁcation methods into two sub-groups, including mutation-based methods and network-based methods. Mutation-based methodsuse different characteristics of mutations such as mutation signiﬁcance,functional impact of mutations, location of mutations to discover cancerdrivers while network-based methods evaluate the role of genes inbiological networks to predict cancer drivers. Most of cancer driver moduleidentiﬁcation methods use the mutual exclusivity of mutations to identify (cid:105) “review-main” — 2020/7/3 — 0:41 — page 4 — (cid:105)(cid:105)(cid:105) (cid:105) (cid:105)(cid:105) Pham et al. modules of cancer drivers. We will discuss the detail of the methods in thefollowing sections.

Most current methods identify single cancer drivers at the populationlevel. In general, they can be grouped in mutation-based methods andnetwork-based methods. Mutation-based methods use the characteristicsof mutations (e.g. the signiﬁcance of mutations in genes, the functionalimpacts of mutations, the recurrence of mutations in genes, etc.) to identifycancer driver genes while network-based methods use gene networks toassess the role of genes then combine with the mutation information topredict cancer drivers. The general idea of the network-based methods isillustrated in Figure 3.

Evaluate genes CombineNetworks Mutations

Gene roles Cancerdrivers

Fig. 3.

Network-based methods. Network-based methods evaluate the role of genes in generegulatory networks by using different techniques and combine with the mutations of genesto predict cancer drivers.

Mutation-based methods use the characteristics of mutations in genes todiscover cancer driver genes. Based on the characteristics of mutationsused in the methods, we divide them into four sub-groups, including usingthe signiﬁcance of mutations in genes, using the functional impacts ofmutations, using the recurrence of mutations, and others. Other methodscombine the mutation information of genes with gene expression andtumour pathways to detect cancer drivers. The details of methods in thefour sub-groups are discussed as below.

A. Using the signiﬁcance of mutations in genes

MutSigCV (Lawrence et al. , 2013) is a method to discover cancerdrivers by assessing the signiﬁcance of mutations in genes. Cancerdrivers predicted by MutSigCV are mutated more frequently than expectedby chance based on inferred background mutation processes. However,MutSigCV has a limitation since although some genes have a high degree ofmutations, these mutations are passenger mutations and do not contributeto the cancer development.

B. Using the functional impacts of mutations

OncodriveFM (Gonzalez-Perez and Lopez-Bigas, 2012) uses thefunctional impact of genomic mutations to detect cancer drivers insteadof evaluating the signiﬁcance of mutations in genes like MutSigCV.OncodriveFM hypothesises that any bias of variations (i.e. mutations)in genes with a signiﬁcantly functional impact may be an indicator foridentifying candidate driver genes. The signiﬁcant point of this method isthat instead of assessing how many mutations a gene has, it evaluates how biased mutations with highly functional impacts are. Thus, it can detectdriver genes having mutations with low recurrence but their mutations playa signiﬁcant role in the cancer development.Similar to OncodriveFM, OncodriveFML (Mularoni et al. , 2016)also uses the functional impact of mutations to discover cancerdrivers. However, while OncodriveFM only uses coding gene mutations,OncodriveFML is designed to analyse both coding and non-codingmutations. The OncodriveFML framework is then applied to 19 tumourdatasets and uncovers well-known coding drivers like

TP53, KEAP1,ARID2 , and

RUNX1 with high functional impacts. It also identiﬁes non-coding drivers such as

MALAT1 and

MIAT . In particular,

MALAT1 is alncRNA which has been proved to be involved in lung adenocarcinomasand

MIAT is a non-protein-coding transcript related to myocardialinfarction.Another method assessing the functional impact of gene mutationsto unravel cancer driver is DriverML (Han et al. , 2019). Differentfrom OncodriveFM and OncodriveFML, DriverML assumes that thefunctional impact of mutations is affected by mutation types. Thus,it proposes a method to detect cancer drivers by scoring functionalinﬂuences of alterations based on mutation types. The method uses variousproperties to weight the impact of mutation types and it obtains optimisedweight parameters by using a supervised machine learning approach withpan-cancer training data.

C. Using the recurrence of mutations in genes

Instead of using the functional impact of mutations like OncodriveFM,OncodriveFML, and DriverML, other methods identify cancer driversbased on the location of mutations such as ActiveDriver (Reimandand Bader, 2013) and OncodriveCLUST (Tamborero et al. , 2013).ActiveDriver discovers cancer driver genes by detecting the enrichmentof somatic mutations in post-translationally modiﬁed sites, includingphosphorylation, acetylation, and ubiquitination sites. OncodriveCLUSTis based on the fact that gain-of-function mutations usually clusterin particular protein sections and these mutations contribute to thedevelopment of cancer cells. Thus, it detects cancer genes with a large biasin clustering mutations. The method is applied to the database of Catalogueof Somatic Mutations in Cancer (COSMIC) (Forbes et al. , 2010) and thenthe result is validated with the Cancer Gene Census (CGC) (Futreal et al. ,2004). As this method bases on the mutation clustering, it cannot identifycancer drivers whose mutations are distributed across the sequence. Inaddition, to have a good result, it requires a large number of observedmutations. Thus, this method should be used to complement results ofother methods in detecting cancer drivers.

D. Others: Combining with gene expression, pathways, etc.

The platform IntOGen-mutations (Gonzalez-Perez et al. , 2013) isdeveloped based on OncodriveFM and OncodriveCLUST to discovercancer drivers for various tumour types. This platform uses somaticmutations, gene expression, and tumour pathways as the inputparameters. It takes the advantages of both methods using the functionalimpact of mutations and methods using the location of mutationsby applying OncodriveFM to identify driver genes which are biasedsigniﬁcantly toward mutations with high functional impacts and applyingOncodriveCLUST to detect driver genes which have mutations highlyconcentrating in speciﬁc regions of proteins.Also using mutational infomation in detecting cancer genes, PathScan(Wendl et al. , 2011) combines mutations with the information of genes inknown pathways. PathScan tests the scenario in which pathway mutationscontribute to the development of tumour. In addition, other methodscombine mutations with existing knowledge of gene function or networkstructure, or ﬁnding mutually exclusive mutations, etc. For instance,Sakoparnig et al. (Sakoparnig et al. , 2015) introduce a computationalmethod to detect genomic alterations with low occurrence frequenciesbased on mutation timing. (cid:105) “review-main” — 2020/7/3 — 0:41 — page 5 — (cid:105)(cid:105)(cid:105) (cid:105) (cid:105)(cid:105)

Cancer Driver Discovery Especially, some methods combine a wide range of data types in orderto identify cancer drivers more effectively. For example, in (Akavia et al. ,2010), the authors develop a computational framework which uses CNVsand gene expression as the inputs to uncover cancer drivers. The frameworkis named COpy Number and EXpression In Cancer (CONEXIC). It appliesa score-guided search to detect combinations of modulators which reﬂectthe expression of a gene module in a set of tumour samples. Then itidentiﬁes those having the highest score in ampliﬁed or deleted regionson chromosome. The authors hypothesise that in case the expression ofgene A and its copy number are related, the copy number variation likelyresults in changes in expression of gene A and there is a high probabilitythat A is a driver candidate and it regulates other genes. The authors applythis framework to the dataset of melanoma and detect exactly its knowncancer drivers.ncDriver (Hornshoj et al. , 2018) identiﬁes non-coding cancer driverswith a two-stage procedure. The ﬁrst stage is mutational recurrence testwhich uses mutations (including indels and SNVs) and genomic elementsas the inputs to detect elements with mutational recurrence. The secondstage is to assess whether mutations of each element have a signiﬁcantcancer-speciﬁc distribution and signiﬁcant bias for highly conservedpositions of each element, then it ﬁnds out if the conservation level ofmutations is signiﬁcantly large comparing to the overall conservationdistribution. This procedure is applied to the pan-cancer whole-genomedataset to identify cancer drivers and signiﬁcant non-coding driversidentiﬁed by the method are MIR142 lncRNA and

XRNU5A-1 sncRNA.

E. Analysis

Although all the methods above base on mutation data to identifycancer drivers, each has a different approach. MutSigCV evaluates thesigniﬁcance of mutations in genes to detect cancer drivers. However,some genes are mutated signiﬁcantly but most of their mutations arepassenger mutations, which do not progress cancer. Thus, these genes arenot cancer driver genes. To eliminate passenger mutations, ActiveDriverand OncodriveCLUST consider the location of mutations. Although thesemethods can reduce the false positives in predicting driver mutations, theymay overlook cancer drivers with mutations distributing across the proteinsince they only evaluate mutations which are concentrated in particularprotein sections. Instead of using the location of mutations, other methodsuse different strategies. For instance, OncodriveFM, OncodriveFML, andDriverML utilise the functional impact of genomic mutations to evaluatethe importance of mutated genes to discover cancer drivers. Sakoparnig etal. (Sakoparnig et al. , 2015) bases on the timing of mutations, PathScancombines with the pathway data, and CONEXIC combines with the geneexpression data. There are also methods which use an integrated approachsuch as IntOGen-mutations, which considers both the functional impactof mutations and their clustering as well. Furthermore, since mutationsin both coding regions and non-coding regions play a signiﬁcant role incancer development, cancer drivers can be coding or non-coding elements.Some methods like OncodriveFML and ncDriver are developed to detectnon-coding cancer drivers.As these methods evaluate different aspects of mutations to identifycancer drivers, they can detect several validated cancer drivers. The novelcancer drivers identiﬁed by these methods are potential and they can beused in wet-lab experiments to conﬁrm their role in cancer progression.However, although these methods can be easily applied to differentmutation datasets, mutation databases are incomplete and the applicationsof these methods are limited.

In general, network-based methods evaluate the role of genes in biologicalnetworks and then combine with the muttaion information of genes topredict cancer drivers. There are three methods in this group, including Vinayagam et al. (Vinayagam et al. , 2016), CBNA (Pham et al. , 2019),and DriverNet (Bashashati et al. , 2012). The details of these methods arediscussed as below.

A. The details of methods

Vinayagam et al. (Vinayagam et al. , 2016) applies controllabilityanalysis on the directed network, i.e., the network with directed edges, ofhuman protein-protein interaction (PPI). The input network includes nodeswhich are proteins and edges which are interactions between proteins.The controllability analysis categorises nodes into the three types whichare "indispensable", "dispensable", or "neutral" based on their impact onminimum driver node set (MDS), i.e., the minimum node set driving thewhole network. Indispensable nodes are nodes which make the numberof MDS increased when the nodes are removed from the network, whiledispensable nodes make the number of MDS decreased. The removal ofneutral nodes from the network has no effect on the number of drivernodes. Then the study analyses the controllability of perturbated networkto identify sensitive indispensable nodes, i.e., indispensable nodes inthe original network but not in the perturbated network. These sensitiveindispensable nodes are the candidate cancer drivers.Also inspired by the network controllability, CBNA (Pham et al. ,2019) analyses the controllability of a gene regulatory network to discovercancer drivers. However, the network built by CBNA is a miRNA-TF-mRNA network which consists of microRNAs (miRNAs), TranscriptionFactors (TFs), and mRNAs. Since this network is constructed fromthe expression data of miRNAs/mRNAs of cancer patients and theexisting gene interaction databases such as PPI (Vinayagam et al. , 2011),miRTarBase (Chou et al. , 2016), and TransmiR (Wang et al. , 2010), it ismore reliable and speciﬁc to a cancer type. In addition, different from themethod of Vinayagam et al. (Vinayagam et al. , 2016), CBNA analysesthe network controllability to indicate critical nodes of the network, i.e.nodes increase the number of the minimum node set controlling the wholenetwork if they are removed from the network, then combining with themutation data to identify cancer drivers. As CBNA uses the miRNA-TF-mRNA network, it can identify both coding and miRNA driver genes.Furthermore, it can also be used to discover drivers for a cancer type orcancer subtype.Instead of evaluating the controllability of a subset of nodes of a genenetwork like Vinayagam et al. (Vinayagam et al. , 2016) and CBNA (Pham et al. , 2019), DriverNet (Bashashati et al. , 2012) considers the inﬂuenceof mutated genes on other genes in a network. DriverNet integratesdifferent data types, including genome data (i.e. non-synonym SVNs,indels, and copy number variation), inﬂuence graph of biological pathwayinformation, and gene expression. It constructs a bipartite graph of genesto detect the effect of mutated genes on genes which have an outlyingexpression. The putative drivers are mutated genes which impact on ahigh number of outlying-expression genes in several patients. The methodis applied to four cancer datasets, including glioblastoma, breast, triplenegative breast, and serous ovarian, and it reveals various candidate cancerdrivers related to transcriptional networks.

B. Analysis

The three methods above use biological networks to predict singlecancer drivers, other methods using networks to discover cancer drivermodules or personalised cancer drivers are discussed in Section 2.2 and 2.3respectively. In general, network-based methods evaluate the role of genesin the whole networks to predict cancer drivers. Various techniques are usedto analyse the networks such as network controllability in Vinayagam etal. and CBNA or the inﬂuence of genes in DriverNet. These methods canelucidate molecular mechanisms in cancer development at the networklevel, but they need large datasets to produce reliable results. In addition,the networks used in some methods (i.e. Vinayagam et al. and DriverNet)are not speciﬁc to any cancer type, thus they may miss some importantinformation which is speciﬁc to a cancer type. Another limitation of (cid:105) “review-main” — 2020/7/3 — 0:41 — page 6 — (cid:105)(cid:105)(cid:105) (cid:105) (cid:105)(cid:105) Pham et al. network-based methods like DriverNet is predicting genes which affectother genes’ expression as cancer drivers, because some cancer driversmay not alter the expression of other genes or other genes accidentallychange other genes’ expression although they are not cancer drivers.

Recently, several methods have been developed to discover cancer driversin modules. Most of the methods identifying cancer driver modules usemutual exclusivity of mutations. Thus, we divide methods for identifyingcancer driver modules into two sub-groups: using mutual exclusivity ofmutations and others. Other methods use mutations, gene expression, genenetwork, etc. to detect cancer driver modules. The details of methods inthe two sub-groups are discussed as below.

A. Using mutual exclusivity of mutations

CoMEt (the Combinations of Mutually Exclusive Alterations)(Leiserson et al. , 2015) uses mutual exclusivity technique to detectcancer driver modules. Because different cancer patients have differentcombinations of genomic alterations which develop the disease, CoMEtdetects combinations of alterations (i.e. modules of mutated genes) inthe same pathway, which are mutual exclusive across samples. Themethod uses the exact statistical test to test mutual exclusivity and it doessimultaneous analysis for mutually exclusive alterations speciﬁc to cancersubtypes. The advantage of this method is that it has a low computationalcomplexity. Similarly, WeSME (Kim et al. , 2017) also assesses the mutualexclusivity of mutations of genes to detect cancer drivers. However, insteadof evaluating genes in the same pathway, WeSME only considers gene pairsand the gene pairs whose mutations have a signiﬁcantly mutual exclusivityare considered as modular candidate cancer drivers.MEMo (Mutual Exclusivity Modules) (Ciriello et al. , 2012) appliesmutual exclusivity technique in biological networks to identify oncogenicnetwork modules. According to (Ciriello et al. , 2012), although individualtumours of the same cancer type may have different genomic alterations,these alterations just happen in a restricted number of pathways. Inaddition, alterations in the same pathway are not likely to exist in thesame patient. Based on these, MEMo does correlation analysis and appliesstatistical tests to detect network modules based on three criteria: (1) genesin a network module are altered across the sample; (2) member genes tendto join into the same biological process; (3) alterations in modules aremutually exclusive. The method is applied to the glioblastoma multiforme(GBM) dataset and detects successfully known network modules, i.e.,groups of cancer drivers, in GBM.

B. Others: Using mutations, gene expression, gene network, etc. iMCMC (an approach to identify Mutated Core Modules in Cancer)(Zhang et al. , 2013) is developed to uncover groups of genes driving cancerusing the cancer genomic data from cancer patients. The method usessomatic mutation, CNV, and gene expression to build a gene network.Then, it identiﬁes coherent subnetworks (modules) from the networkthrough an optimisation model by selecting vertices and edges with highweights. Finally, the signiﬁcance of subnetworks is assessed by performinga random test and the mutual exclusivity of subnetworks is tested byadopting Markov chain Monte Carlo permutation strategy. The method isapplied to the GBM and the ovarian carcinoma (OV) datasets from TCGA.Many discovered core modules are related to known pathways and mostof the identiﬁed genes are cancer driver genes which are already reportedrelating to cancer pathogenesis in other research.NetBox (Cerami et al. , 2010) uses biological networks in studyingdrivers for GBM. It introduces a network-based method to detect oncogenicprocesses and cancer driver genes. The hypothesis of the approachis that biological networks include multiple functional modules, andtumours target speciﬁc functional modules. The method analyses sequence mutations, CNVs, an interaction network including both PPIs andsignalling pathways to identify and assess network modules statistically.Another method to identify cancer driver modules is TieDIE (TiedDiffusion Through Interacting Events) (Paull et al. , 2013). TieDIE appliesnetwork diffusion to discover the relationship of genomic events andchanges in cancer subtypes. The approach collects a subnetwork of PPIs,interactions of genomic perturbations, predicted transcription factor-to-target connections, and transcriptomic states from literature. The methodis applied to the breast adenocarcinoma (BRCA) dataset of TCGA and itdetects signalling pathways and interlinking genes corresponding to cancersignalling.The methods above identify coding cancer driver modules. However,because non-coding RNAs (e.g. miRNAs) can modulate tumorigenesis bypromoting or suppressing speciﬁc genes and various cancer types haveoverlaps in oncogenic pathways, a group of miRNAs which drives orsuppresses tumorigenesis in different tumour types may exist. Hamiltonet al. (Hamilton et al. , 2013) use the pan-cancer dataset of TCGA andthe miRNA target data of Argonaute Crosslinking Immunoprecipitation(AGO-CLIP) (Chi et al. , 2009; Hafner et al. , 2010, 2012) to detect pan-cancer miRNA drivers. The idea is that the set of cancer miRNA driverswill modulate tumorigenesis and share a central core seed motif. The resultshows that an oncogenic miRNA superfamily, which includes miR-17,miR-18, miR-19, miR-93, miR-130, miR-210 , and miR-455 , coregulatestumour suppressors through a

GUGC core motif.

C. Analysis

As can be seen from the methods above, most of the methods usemutual exclusivity of mutations to identify cancer driver modules. With thistechnique, the mutation from only one member in an identiﬁed module isenough to trigger cancer progression (Kim et al. , 2017). Thus, the identiﬁeddrivers in a module may not work together to regulate their targets to drivecancer. However, as discussed above, genes should collaborate to increasetheir inﬂuence on target genes to progress cancer. Therefore, it is necessaryto develop novel methods to discover cancer driver groups whose memberswork in concert to initialise and develop cancer.

The methods discussed in Section 2.1 and 2.2 discover cancer drivers atthe population level. Since different patients possess different genomesand their diseases might be driven by different driver genes, it is necessaryto investigate cancer drivers which are speciﬁc to an individual patient(i.e. personalised cancer drivers). There are three methods in this group,including DawnRank (Hou and Ma, 2014), SCS (Guo et al. , 2018), andPNC (Guo et al. , 2019). All of them base on gene regulatory networksto predict personalised cancer drivers. The details of these methods arediscussed as below.

A. The details of methods

A representative of methods for identifying personalised cancer driversis DawnRank (Hou and Ma, 2014). In general, the idea of the method isthat mutations in genes which have higher connectivity in an interactionnetwork are more impactful. DawnRank uses the information of geneexpression and gene network as the inputs. In particular, it is a rankingframework which applies PageRank (Brin and Page, 1998; Page et al. ,1998) to evaluate the impact of genes on the gene network. The impact ispresented in terms of network connectivity and the number of downstreamgenes expressed differentially. The higher the rank of a gene is, the moredownstream genes it has effects on in the gene network. Ranks of genes arethen combined with somatic alteration data like copy number variations todetect driver alterations. Although DawnRank bases on the same generegulatory network for all patients, it assesses the impact of genes ineach patient using the patient’s gene expression data to detect personalised (cid:105) “review-main” — 2020/7/3 — 0:41 — page 7 — (cid:105)(cid:105)(cid:105) (cid:105) (cid:105)(cid:105)

Cancer Driver Discovery et al. , 2011) https://icgc.org/ A data portal of cancer gemomics of 50 cancer types.cBioPortal (Gao et al. et al. , 2015) http://cancer3d.org/search Contains mutations of more than 14,700 proteins and they aremapped to over 24,300 proteins of the Protein Data Bank (Rose et al. , 2013).CCLE (Barretina et al. , 2012) https://portals.broadinstitute.org/ccle Includes SNVs, CNAs, and gene expression.COSMIC (Forbes et al. , 2015) https://cancer.sanger.ac.uk/cosmic Contains cancer mutations, including manually curated expertdata and data from sequencing projects.For validatingpredicted results CGC (Futreal et al. , 2004) https://cancer.sanger.ac.uk/census Provides a list of cancer genes, which has been wellestablished for cancer development.AGCOH (Huret et al. , 2000) http://atlasgeneticsoncology.org/ Contains about 1,500 cancer genes merged from numerouscollaborative projects.NCG An et al. (2016) http://ncg.kcl.ac.uk/ Comprises more than 500 known cancer genes and over 1,000candidate cancer genes.DGIdb (Grifﬁth et al. et al. cancer drivers. The algorithm has been applied to TCGA datasets and itshows an effectiveness in detecting cancer drivers.To assess the impact of genes in each patient, DawnRank uses the geneexpression data of each patient, but it bases on the same gene regulatorynetwork of all patients. As a result, it may miss important informationof gene regulation of each patient. Thus, to detect personalised cancerdrivers, SCS (Guo et al. , 2018) builds a gene regulatory network for eachpatient from the patient’s gene expression data and its neighbour’s geneexpression data (i.e. the corresponding normal sample’s gene expressiondata). SCS detects cancer driver genes as the minimal set of mutated geneswhich impacts on the maximal differentially expressed genes. Like SCS,PNC (Guo et al. , 2019) also uses the gene expression data of a patient andits neighbour to construct personalised networks. Nevertheless, PNC onlyselects edges which are different between the tumour and normal state. Itthen converts the gene regulatory network to a bipartite graph in which,nodes on the top represent genes and nodes on the bottom represent edges.PNC predicts cancer driver genes as the minimum gene set on the top ofthe bipartite graph which covers all the edges on the bottom. B. Analysis

Although these methods can discover personalised cancer drivers, theystill have some limitations. DawnRank bases on the same gene networkof all patients. It ignores the network information speciﬁc to an individualpatient, leading to false positives in its results. On the other hand, SCSand PNC use the genetic data of each patient to construct personalisedgene networks. However, they require the genetic data of a pair of samples(i.e. a tumour and its tumour neighbour), but identifying the neighbour ofa tumour is challenging and it is not always existing. In addition, thesemethods only discover coding cancer drivers while non-coding genes (e.g.miRNAs) can also be cancer drivers as discussed above.

There are two types of resources for developing computational methodsfor cancer driver discovery, including input data to a method and resourcefor validation. As input, it can be gene expression data, network data,mutation data, etc. For validation, it can be a database with ground truthor partial ground truth. The resources are summarised in Table 2. For input data, several databases have been developed from cancersequencing projects and they provide rich data used in cancer driveridentiﬁcation methods. TCGA (Institute, 2018) is a signiﬁcant projectin this area. The TCGA project proﬁles and analyses human tumours touncover molecular aberrations in DNA, XRNA, protein, and epigeneticlevels (Institute, 2018). TCGA data can be accessed through the GenomicData Commons (GDC) data portal (Grossman et al. , 2016). ICGC dataportal is also a resource for cancer genomics data and it contains the dataof genomic abnormalities of 50 cancer types (Zhang et al. , 2011). Anotherdata portal for cancer genomics is cBioPortal (Gao et al. , 2013), whichprovides a web interface for accessing cancer genomic datasets, as well asfor analysing and visualising the data online.There are also some other resources which can be used for cancer driverdiscovery such as the Cancer3D (Porta-Pardo et al. , 2015), the Cancer CellLine Encyclopedia (CCLE) (Barretina et al. , 2012), and the COSMICdatabase (Forbes et al. , 2015). Cancer3D is a database which focuses onthe inﬂuence of mutations on the structure of proteins and it providesthe information for users to analyse distribution patterns of mutations andtheir relationship with changes in drug activity (Porta-Pardo et al. , 2015). Itcontains mutations of more than 14,700 proteins, which are mapped to over24,300 proteins in the Protein Data Bank (Rose et al. , 2013). The CCLEincludes SNVs, CNAs, and gene expression (Barretina et al. , 2012). TheCOSMIC database is a large and comprehensive source for investigatingthe mutational impact in cancer. It contains records of cancer mutationsincluding both manually curated expert data and data from sequencingprojects like TCGA or ICGC (Forbes et al. , 2015, 2011). It has morethan two million coding point mutations and over six million non-codingmutations (Forbes et al. , 2015).For validating identiﬁed cancer drivers, several databases can be usedcurrently like CGC (Futreal et al. , 2004) in the COSMIC database. TheCGC provides a gene list which has been well established for cancerprogression. This list was collected through a census of genes whichare mutated or implicated causally in cancer progression (Futreal et al. ,2004). These genes are also called cancer genes. Beside CGC in COSMIC,there are several sources which can be used for validating cancer drivers.The Atlas of Genetics and Cytogenetics in Oncology and Haematology(AGCOH) is another source for this purpose (Huret et al. , 2000). Itcomprises around 1,500 cancer genes which are merged results from (cid:105) “review-main” — 2020/7/3 — 0:41 — page 8 — (cid:105)(cid:105)(cid:105) (cid:105) (cid:105)(cid:105) Pham et al. numerous collaborative projects (Huret et al. , 2000). The Network ofCancer Genes (NCG) is an online database of cancer genes with over500 known cancer genes and more than 1,000 candidate cancer genes An et al. (2016). Known cancer genes are genes which have already beenconﬁrmed through experiments while candidate cancer genes are thoseusing statistical methods. One more database about disease genes is theDrug-Gene Interaction database (DGIdb) (Grifﬁth et al. , 2013). It containsnot only cancer drivers but also the information about drugs and drug-geneinteractions (Grifﬁth et al. , 2013).At the present, while coding drivers are well established in cancerresearch, non-coding drivers are not. In (Wong et al. , 2018), the authorshave recently introduced OncomiR, which is a resource for investigatingmiRNA dysregulation in cancer through a web interface. It does statisticalanalyses based on RNA-seq, miRNA-seq, and clinical information fromTCGA to discover miRNAs which are related to cancer progression.Although this database may not be used as a ground truth to validatemiRNA cancer drivers, it can be used as a channel to explore miRNAdysregulation in detecting miRNA cancer drivers. To validate non-codingcancer drivers now, it is required to examine the literature manually(Cuykendall et al. , 2017; Poulos et al. , 2015).

In this section, we present a comparative study to compare the performanceof some methods above. As there is not a ground truth to compare theresults of methods for discovering cancer driver modules, we only selectﬁve methods for identifying single cancer drivers and three methods foridentifying personalised cancer drivers for the comparison, includingActiveDriver (Reimand and Bader, 2013), DawnRank (Hou and Ma,2014), DriverML (Han et al. , 2019), DriverNet (Bashashati et al. , 2012),MutSigCV (Lawrence et al. , 2013), OncodriveFM (Gonzalez-Perez andLopez-Bigas, 2012), PNC (Guo et al. , 2019), and SCS (Guo et al. , 2018).These methods represent for different approaches in detecting cancerdriver genes. ActiveDriver, DriverML, MutSigCV, and OncodriveFM aremutation-based methods while DawnRank, DriverNet, PNC, and SCS arenetwork-based methods. In addition, DawnRank, PNC, and SCS identifypersonalised cancer drivers while other ﬁve methods identify cancer driversat the population level. Although DawnRank, PNC, and SCS detect cancerdrivers for each patient, they all have a method to aggregate the resultsof individual patients to predict cancer drivers for the population. Thus,we can compare these three methods with the others. The comparison isperformed based on the results of the eight methods in identifying driversfor breast invasive carcinoma (BRCA), lung adenocarcinoma (LUAD),lung squamous cell carcinoma (LUSC), kidney renal clear cell carcinoma(KIRC), head and neck squamous cell carcinoma (HNSC). We obtain thepredicted cancer drivers of the eight methods for the selected ﬁve cancertypes from (Guo et al. , 2019).The cancer drivers predicted by the methods are validated with the CGCfrom the COSMIC database as this database has catalogued the conﬁrmedcancer drivers. The performance of a method is measured using F Score based on the number of discovered cancer drivers that are validated by theCGC. The F Score indicates the enrichment ability of discovered cancerdrivers in the gold standard (i.e. the CGC) and it is computed based onPrecision P and Recall R as shown in Eq. 1. The higher the F Score amethod has, the better the method is. F Score = 2 ∗ P ∗ RP + R . (1)In Eq. 1, P (Precision) shows the fraction of predicted driver genes inthe CGC among the predicted driver genes and R (Recall) indicates the fraction of predicted driver genes in the CGC among the driver genes inthe CGC. As F Score is computed from Precision P and Recall R , it willindicate both the ability to predict exactly cancer drivers and the ability topredict many conﬁrmed cancer drivers of a method.The comparison result is shown in Figure 4 and the details are shownin Table 3. It can be seen that with the four data sets of BRCA, LUAD,LUSC, and KIRC samples, PNC outperforms the other methods and withHNSC, ActiveDriver has the best performance. . . . . A c t i v e D r i v e r D a w n R a n k D r i v e r M L D r i v e r N e t M u t S i g C V O n c o d r i v e F M P N C S C S Method F S c o r e Fig. 4.

Comparison of F Score of ActiveDriver, DawnRank, DriverML, DriverNet,MutSigCV, OncodriveFM, PNC, and SCS in identifying coding cancer drivers at thepopulation level. The x-axis indicates the eight methods and the y-axis shows the F Score .The results are based on the cancer driver prediction for the ﬁve cancer types, includingBRCA, LUAD, LUSC, KIRC, and HNSC, of the eight methods.

Table 3. F Score of the eight methods in predicting drivers forthe ﬁve cancer typesNo. Method BRCA LUAD LUSC KIRC HNSC1 ActiveDriver 0.056 0.029 0.037 0.045 0.0802 DawnRank 0.045 0.043 0.040 0.040 0.0433 DriverML 0.077 0.027 0.016 0.052 0.0054 DriverNet 0.007 0.009 0.013 0.025 0.0025 MutSigCV 0.066 0.032 0.014 0.016 0.0346 OncodriveFM 0.023 0.030 0.010 0.015 0.0457 PNC 0.153 0.153 0.141 0.094 0.0258 SCS NA 0.011 0.005 0.008 NA

Moreover, to see if the methods detect similar cancer drivers, wecompare the results of the ﬁve methods used for identifying cancerdrivers at the population level (i.e. DriverML, ActiveDriver, DriverNet,MutSigCV, and OncodriveFM). Figure 5 shows the overlap between thevalidated cancer drivers discovered by each pair of the methods, for eachof the ﬁve cancer types. It can be seen that there is little overlap amongthe results of the methods. For example, in breast cancer, only one cancerdriver (

TP53 ) is identiﬁed by all the ﬁve methods, two cancer drivers(

CDH1 and

PIK3CA ) are detected by four methods (DriverML, DriverNet,MutSigCV, and OncodriveFM), and eight cancer drivers (

GATA3, NCOR1,PTEN, ARID1A, FOXA1, PIK3R1, CTCF , and

ERBB2 ) are predictedby three methods. As the results of these methods are complementary,they should be used together to maximize the overall performance of thecancer driver prediction. In addition, it should be pointed out that although (cid:105) “review-main” — 2020/7/3 — 0:41 — page 9 — (cid:105)(cid:105)(cid:105) (cid:105) (cid:105)(cid:105)

Cancer Driver Discovery the CGC is popular in validating cancer drivers in cancer research, it isincomplete in the sense that the database is constantly being updated whennew cancer drivers come to light. Therefore, although some of the predictedcancer drivers cannot be validated with existing knowledge, they can benovel cancer drivers which is worth wet-lab experiments to conﬁrm theirroles in progressing cancer. Taking breast cancer as an example, we combine all the breast cancerdrivers predicted by the ﬁve methods at the population level (i.e. DriverML,ActiveDriver, DriverNet, MutSigCV, and OncodriveFM), which results inaltogether 509 cancer drivers. Among them, 63 drivers are predicted by atleast two of the ﬁve methods. We use Enrichr (Kuleshov et al. , 2016) to doenrichment analysis of these 63 drivers. Table 4 and Table 5 show the GObiological processes and KEGG pathways in which these cancer driversare signiﬁcantly enriched (adjusted p-value < 0.05). Among the 63 drivergenes, 16 genes (25.4%) are enriched in 7 GO biological processes and 15genes (23.8%) are enriched in 26 KEGG pathways related to breast cancer.It indicates that the predicted cancer drivers are closely associated with thebiological condition of breast cancer and biologically meaningful.

Table 4. GO biological processes involved in breast cancer in which the predictedcancer drivers are enrichedTerm

Since the predicted cancer driver genes likely cause carcinogenesis, theycould be used as biomarkers to classify tumours. To explore this concept,we use the predicted drivers to stratify breast cancer patients. Among the63 predicted cancer drivers above, there are four signiﬁcant genes,

AKT1,PTEN, CDKN1B , and

TP53 , which are enriched in both GO biologicalprocesses and KEGG pathways. For instance,

AKT1 are enriched in twoGO biological processes and 25 KEGG pathways,

PTEN are enrichedin two GO biological processes and ﬁve KEGG pathways. Thus, weuse these four genes for this analysis. In addition, we obtain the BRCAgene expression data and clinical data from (Zhang et al. , 2019), anduse the Similarity Network Fusion (SNF) method (Wang et al. , 2014; Xu et al. , 2017b), a popular method for discovering the similarities amongpatients, to cluster cancer patients. The SNF takes expression of thesefour genes as input and outputs subtypes of cancer patients. We thenanalyse the survival outcomes of patients in the classiﬁed subtypes. Theresults indicate that the survival level of patients in different classiﬁedsubtypes are signiﬁcantly different (p-value = 0.0245) as shown in Figure 6.Furthermore, the clustering display shows the similarity of samples in eachidentiﬁed subtype and the silhouette plot indicates a good clustering witha large average silhouette width (0.76).

Table 5. KEGG pathways involved in breast cancer in which the predicted cancerdrivers are enrichedTerm

From the discussion above, we see that there are a wide range ofcomputational methods for identifying cancer drivers from genomic data.In this paper, we categorise the methods into three groups: methods foridentifying single cancer drivers (including mutation-based methods andnetwork-based methods), methods for identifying cancer driver modules,and methods for identifying personalised cancer drivers. Although thesemethods have detected successfully various cancer drivers, there are stillseveral gaps in the research of the ﬁeld.Firstly, most of the current methods focus on coding mutations toidentify coding cancer drivers while non-coding cancer drivers are not fullyexamined and the number of methods for identifying non-coding driversis limited. However, non-coding cancer drivers are important becauseprotein-coding regions account for only around two percent of the humangenome. The large part of mutations exist in non-coding regions and thesemutations can regulate the expression of genes and drive cancer (Puente et al. , 2015; Weinhold et al. , 2014). In addition to the limited numberof non-coding cancer driver identiﬁcation methods, the current methodsfocus much on non-coding mutations, i.e., correlations of mutations in non-coding elements with other factors like survival (Hornshoj et al. , 2018).Nevertheless, cancer drivers can be non-coding RNAs without mutationsbut they can regulate other genes to progress cancer, thus it is requiredto investigate non-coding RNAs with and without mutations to detectnon-coding cancer drivers.Secondly, some methods have been developed to identify groups ofcancer drivers (Ciriello et al. , 2012; Zhang et al. , 2013), but they aremostly based on mutations to detect mutated modules, called cancer drivermodules. Since in a module, the mutation of a member is sufﬁcientto develop cancer, the identiﬁed drivers in a module may not in factwork together to regulate their targets to drive cancer. However, there (cid:105) “review-main” — 2020/7/3 — 0:41 — page 10 — (cid:105)(cid:105)(cid:105) (cid:105) (cid:105)(cid:105) Pham et al.

A B CD E

21 14 8 8 5 2 2 2 1 1 1 1 I n t e r s e c t i on S i z e ● ● ● ●● ●●● ●● ●●● ●●●● ●●● ●● ●● ●●●●● DriverMLActiveDriverMutSigCVOncodriveFMDriverNet0102030

Set Size

BRCA I n t e r s e c t i on S i z e ● ● ● ●● ●●●● ● ●● ● ●● ●● ●●● ●●●● ●● ●●● ActiveDriverMutSigCVDriverMLOncodriveFMDriverNet051015

Set Size

LUAD

15 3 2 2 2 2 1 1 1 I n t e r s e c t i on S i z e ● ●● ● ● ● ●● ●● ●●●● ●●●●● ActiveDriverDriverMLDriverNetMutSigCVOncodriveFM051015

Set Size

LUSC

19 16 6 3 3 2 1 1 1 1 1 I n t e r s e c t i on S i z e ● ● ● ● ●● ●● ●●● ●●● ●● ●●●● ●●●● DriverMLActiveDriverDriverNetMutSigCVOncodriveFM05101520

Set Size

KIRC

38 9 6 5 1 1 1 1 1 I n t e r s e c t i on S i z e ● ● ● ●● ●●● ●●● ●● ●● ●●●● ActiveDriverOncodriveFMMutSigCVDriverMLDriverNet010203040

Set Size

HNSC

Fig. 5.

Overlap among the cancer drivers predicted by different methods. The charts illustrate the overlap among the cancer drivers at the population level predicted by the ﬁve methods(DriverML, ActiveDriver, DriverNet, MutSigCV, and OncodriveFM) w.r.t the ﬁve cancer types, including BRCA, LUAD, LUSC, KIRC, and HNSC. In each chart, the horizontal bars at thebottom left show the number of detected cancer drivers validated by the CGC, the vertical bars and the dotted lines show the overlap of the validated cancer drivers of the methods. If thereis not an overlap, it will be a black dot. . . . . . . Survival analysis (Number of clusters: 2)

Survival time (Months) S u r v i v a l p r obab ili t y Subtype 1 Subtype 2p−value = 0.0245 group

Clustering display

Silhouette width s i −0.4 0.0 0.2 0.4 0.6 0.8 1.0 Silhouette plot

Average silhouette width : 0.76 n = 753 2 clusters C j j : n j | ave i ˛ Cj s i Fig. 6.

Survival curves, clustering display, and silhouette plot. Survival curves are forcancer subtypes identiﬁed by using the four predicted cancer drivers, including AKT1,PTEN, CDKN1B, and TP53. The survival curves show the signiﬁcant difference in thesurvivals of patients of the two subtypes (p-value = 0.0245). The clustering display indicatesa highly qualiﬁed clustering with the similarity of samples in each subtype (i.e. Light dotsshow the similarity of samples). The silhouette plot has a large average silhouette width(0.76/1), indicating the clustering validity when using these four genes. is evidence that some genes work in concert to regulate other genes’expression and inﬂuence different biological processes, such as the cooperation of miRNAs in EMT, the transformation of epithelial cellsinto mesenchymal cells (Cursons et al. , 2017; Lamouille et al. , 2014).In addition, in some biological processes, the regulation of single genesmight not have signiﬁcant impacts and research has emerged to use wet-lab experiments to investigate the regulatory of group-based regulators inbiological processes. All of these highlight the importance of studyingbiological factors in groups, and computational methods which utilise avariety of data and techniques are in demand for investigating groups ofcancer drivers.Finally, although there have been methods for detecting personalisedcancer drivers (Guo et al. , 2018, 2019; Hou and Ma, 2014), they stillhave some limitations. Some methods, such as DawnRank, use the genenetwork of the population to predict personlaised cancer drivers. Thisleads to that they may ignore the information of the gene network speciﬁcto an individual patient and they may discover many false positives in theirresults. Other methods, such as SCS and PNC, use the personal geneticdata to build personalised gene networks but they need the genetic dataof a sample pair (i.e. a cancer patient and its neighbour in the nomalstate). The neighbour of a cancer patient is not always existing. Thus,the application of these methods is limited. Furthermore, these methodsonly detect coding cancer drivers while it is also necessary to identifynon-coding cancer drivers as the discussion above. All of these indicatethat there is a strong need to develop novel computational methods fordetecting personalised coding/non-coding cancer drivers.

We have investigated a wide range of computational methods foridentifying cancer drivers from genomic data. In addition, the advantagesand limitations of the surveyed methods are analysed, based on whichwe identify various opportunities for the development of the research inthe ﬁeld. It is clear that the research in computational approach to cancerdriver identiﬁcation is still in its growth phase. Much more work needsto be done and many opportunities exist in this area. Nevertheless, thereare also different challenges in advancing the research in cancer driveridentiﬁcation. Identifying exactly biological factors which drive cancer (cid:105) “review-main” — 2020/7/3 — 0:41 — page 11 — (cid:105)(cid:105)(cid:105) (cid:105) (cid:105)(cid:105)

Cancer Driver Discovery is quite complicated. Future research needs to focus on both coding andnon-coding datasets to identify candidate cancer drivers. To improve theaccuracy of the novel computational methods, we should combine differenttypes of data such as gene expression, mutations, and clinical information,etc. to detect cancer drivers.We have also surveyed available resources which can be used in theresearch of discovering cancer drivers. The existing resources are plentifulbut they are fragmented. Thus, to utilise cancer data more effectively forthe research, it requires to have policies to achieve better data sharing. Inaddition, another difﬁculty when developing computational methods foruncovering non-coding cancer drivers is the validation. The reason is thatmost of the current databases are for coding cancer drivers and there is noone for non-coding cancer drivers. Therefore, we make an urgent call forthe building of databases for non-coding drivers given their crucial role inthe success of the research in the ﬁeld.To evaluate the performance of some current methods in detectingcancer drivers as well as provide an example of the evaluation of cancerdriver discovery methods for the researchers who would like to penetratethe ﬁeld, a comparative study has been conducted. From the results ofthe experiment in the comparative study, it can be seen that each methodcan uncover different cancer drivers and the overlaps between the resultsof the methods are small. Therefore, the methods are complementary,and we should use them together to maximize the effectiveness of cancerdriver prediction of the methods. This is also an indicator for the differentapproaches of the methods and to achieve a signiﬁcant result, novelmethods should combine various resources and techniques in detectingcancer drivers.In conclusion, although there are numerous computational methods fordiscovering cancer drivers now, there exist various gaps and opportunitiesfor advancing the research of the ﬁeld. However, due to the complexityof cancer initialisation and development, identifying cancer drivers facesmany challenges. Through this paper, we hope that we can help researcherswho are interested in the ﬁled to establish a solid background and motivatethem to tackle the current challenges. Acknowledgements

This research is supported by the Australian Government ResearchTraining Program (RTP) Scholarship and the Vice Chancellor &President’s Scholarship offered by the University of South Australia.

Funding

The ARC DECRA (No: 200100200) and the Australian Research CouncilDiscovery Grant (No: DP170101306).

Biographical note

Vu Viet Hoang Pham is a PhD student at UniSA STEM. He receivedhis Master of Information Technology in 2017 at Deakin University.His research interests are causal inference and its applications inBioinformatics.

Lin Liu is an associate professor at UniSA STEM. She receivedher bachelor and master degrees in Electronic Engineering from XidianUniversity, China in 1991 and 1994 respectively, and her PhD degreein computer systems engineering from UniSA in 2006. Her researchinterests include data mining, causal discovery and their applications inbioinformatics.

Cameron Bracken is a lab head at the Centre for Cancer Biology,an alliance between SA Pathology and University of South Australia.His research interests are the mechanisms that non-coding RNAs regulateEMT.

Gregory Goodall is a professor at Centre for Cancer Biology, analliance of SA Pathology and University of South Australia. He is a worldleader in the biology of RNA and cancer progression. He has combinedinnovation with thoroughness to make discoveries that open new areasin RNA biology for development and exploitation. He has made seminalcontributions to the understanding of mechanisms governing gene activityin cancer, through control of mRNA activity, regulation of gene expressionby microRNAs, and most recently his discovery of the regulation ofcircular RNAs. These breakthroughs have widespread implications forunderstanding gene regulation in biology, particularly in immunity andcancer.

Jiuyong Li is a professor at UniSA STEM. He received his PhD degreein computer science from the Grifﬁth University, Australia (2002). Hisresearch interests are in the ﬁelds of data mining, privacy preservingand bioinformatics. His research has been supported by six prestigiousAustralian Research Council Discovery grants since 2005 and he haspublished more than 100 research papers.

Thuc Duy Le is a senior lecturer at UniSA STEM. He is also anARC DECRA fellow in Bioinformatics. He received his PhD degreein Computer Science (Bioinformatics) in 2014 at UniSA. His researchinterests are causal inference and its applications in bioinformatics.

Key points • Providing a comprehensive survey of cancer driver discovery methods • Categorising methods for identifying cancer drivers into three groups:methods for identifying single cancer drivers, methods for identifyingcancer driver modules, and methods for identifying personalisedcancer drivers • Introducing several resources for cancer driver identiﬁcation research • Performing a case study to compare the performance of the currentmethods for identifying cancer drivers and analysing their results • Analysing the advatages and limitations of the current methods as wellas identifying the opportunities and challenges in developing reliablecancer driver discovery methods

References

Akavia, U. D., Litvin, O., Kim, J., et al. (2010). An integrated approachto uncover drivers of cancer.

Cell , (6), 1005–1017.An, O., Dall’Olio, G. M., Mourikis, T. P., et al. (2016). Ncg 5.0:updates of a manually curated repository of cancer genes and associatedproperties from cancer mutational screenings. Nucleic Acids Research , (Database issue), D992–D999.Barretina, J., Caponigro, G., Stransky, N., et al. (2012). The cancercell line encyclopedia enables predictive modeling of anticancer drugsensitivity. Nature , (7391), 603–607.Bashashati, A., Haffari, G., Ding, J., et al. (2012). Drivernet: uncoveringthe impact of somatic driver mutations on transcriptional networks incancer. Genome biology , (12), R124–R124.Brin, S. and Page, L. (1998). The anatomy of a large-scale hypertextualweb search engine. Comput. Netw. ISDN Syst. , (1-7), 107–117.Cerami, E., Demir, E., Schultz, N., et al. (2010). Automated networkanalysis identiﬁes core pathways in glioblastoma. PLOS ONE , (2),e8918.Chi, S. W., Zang, J. B., Mele, A., et al. (2009). Ago hits-clip decodesmirna-mrna interaction maps. Nature , (7254), 479–486.Chou, C.-H., Chang, N.-W., Shrestha, S., et al. (2016). mirtarbase2016: updates to the experimentally validated mirna-target interactionsdatabase. Nucleic acids research , (D1), D239–D247. (cid:105) “review-main” — 2020/7/3 — 0:41 — page 12 — (cid:105)(cid:105)(cid:105) (cid:105) (cid:105)(cid:105) Pham et al.

Ciriello, G., Cerami, E., Sander, C., et al. (2012). Mutual exclusivityanalysis identiﬁes oncogenic network modules.

Genome Research , (2), 398–406.Cursons, J., Pillman, K. A., Scheer, K., et al. (2017). Post-transcriptionalcontrol of emt is coordinated through combinatorial targeting by multiplemicrornas. bioRxiv .Cuykendall, T. N., Rubin, M. A., and Khurana, E. (2017). Non-codinggenetic variation in cancer. Current Opinion in Systems Biology , , 9–15.Dimitrakopoulos, C. M. and Beerenwinkel, N. (2017). Computationalapproaches for the identiﬁcation of cancer genes and pathways. WileyInterdisciplinary Reviews. Systems Biology and Medicine , (1), e1364.Fattore, L., Mancini, R., Acunzo, M., et al. (2016). mir-579-3p controlsmelanoma progression and resistance to target therapy. Proceedings ofthe National Academy of Sciences , (34), E5005.Forbes, S. A., Tang, G., Bindal, N., et al. (2010). Cosmic (the catalogueof somatic mutations in cancer): a resource to investigate acquiredmutations in human cancer. Nucleic Acids Research , (Database issue),D652–D657.Forbes, S. A., Bindal, N., Bamford, S., et al. (2011). Cosmic: miningcomplete cancer genomes in the catalogue of somatic mutations incancer. Nucleic Acids Research , (Database issue), D945–D950.Forbes, S. A., Beare, D., Gunasekaran, P., et al. (2015). Cosmic: exploringthe world’s knowledge of somatic mutations in human cancer. NucleicAcids Research , (Database issue), D805–D811.Futreal, P. A., Coin, L., Marshall, M., et al. (2004). A census of humancancer genes. Nature reviews. Cancer , (3), 177–183.Gao, J., Aksoy, B. A., Dogrusoz, U., et al. (2013). Integrative analysisof complex cancer genomics and clinical proﬁles using the cbioportal. Science Signaling , (269), pl1.Gasparini, P., Cascione, L., Landi, L., et al. (2015). microrna classiﬁers arepowerful diagnostic/prognostic tools in alk-, egfr-, and kras-driven lungcancers. Proceedings of the National Academy of Sciences , (48),14924.Gonzalez-Perez, A. and Lopez-Bigas, N. (2012). Functional impact biasreveals cancer drivers. Nucleic Acids Research , (21), e169–e169.Gonzalez-Perez, A., Perez-Llamas, C., Deu-Pons, J., et al. (2013).Intogen-mutations identiﬁes cancer drivers across tumor types. NatureMethods , , 1081.Grifﬁth, M., Grifﬁth, O. L., Coffman, A. C., et al. (2013). Dgidb: miningthe druggable genome. Nature Methods , , 1209.Grossman, R. L., Heath, A. P., Ferretti, V., et al. (2016). Toward a sharedvision for cancer genomic data. New England Journal of Medicine , (12), 1109–1112.Guo, W.-F., Zhang, S.-W., Liu, L.-L., et al. (2018). Discoveringpersonalized driver mutation proﬁles of single samples in cancer bynetwork control strategy. Bioinformatics , (11), 1893–1903.Guo, W. F., Zhang, S. W., Zeng, T., et al. (2019). A novel network controlmodel for identifying personalized driver genes in cancer. PLoS ComputBiol , (11), e1007520.Hafner, M., Landthaler, M., Burger, L., et al. (2010). Transcriptome-wideidentiﬁcation of rna-binding protein and microrna target sites by par-clip. Cell , (1), 129–141.Hafner, M., Lianoglou, S., Tuschl, T., et al. (2012). Genome-wideidentiﬁcation of mirna targets by par-clip. Methods (San Diego, Calif.) , (2), 94–105.Hamilton, M. P., Rajapakshe, K., Hartig, S. M., et al. (2013). Identiﬁcationof a pan-cancer oncogenic microrna superfamily anchored by a centralcore seed motif. Nature Communications , , 2730.Han, Y., Yang, J., Qian, X., et al. (2019). Driverml: a machine learningalgorithm for identifying driver genes in cancer sequencing studies. Nucleic Acids Research , (8), e45–e45. Hornshoj, H., Nielsen, M. M., Sinnott-Armstrong, N. A., et al. (2018). Pan-cancer screen for mutations in non-coding elements withconservation and cancer speciﬁcity reveals correlations with expressionand survival. npj Genomic Medicine , (1), 1.Hou, J. P. and Ma, J. (2014). Dawnrank: discovering personalized drivergenes in cancer. Genome Medicine , (7), 56.Huret, J.-L., Minor, S. L., Dorkeld, F., et al. (2000). Atlas of geneticsand cytogenetics in oncology and haematology, an interactive database. Nucleic Acids Research , (1), 349–351.Institute, N. H. G. R. (2018). The cancer genome atlas.Kim, Y.-A., Madan, S., and Przytycka, T. M. (2017). Wesme: uncoveringmutual exclusivity of cancer drivers and beyond. Bioinformatics (Oxford,England) , (6), 814–821.Kuleshov, M. V., Jones, M. R., Rouillard, A. D., et al. (2016). Enrichr:a comprehensive gene set enrichment analysis web server 2016 update. Nucleic Acids Res , (W1), W90–7.Lamouille, S., Xu, J., and Derynck, R. (2014). Molecular mechanismsof epithelial-mesenchymal transition. Nature reviews. Molecular cellbiology , (3), 178–196.Lawrence, M. S., Stojanov, P., Polak, P., et al. (2013). Mutationalheterogeneity in cancer and the search for new cancer-associated genes. Nature , (7457), 214–218.Leiserson, M. D. M., Wu, H.-T., Vandin, F., et al. (2015). Comet:a statistical approach to identify combinations of mutually exclusivealterations in cancer. Genome Biology , (1), 160.Linehan, W. M., Srinivasan, R., and Schmidt, L. S. (2010). The geneticbasis of kidney cancer: a metabolic disease. Nature Reviews Urology , ,277.Mularoni, L., Sabarinathan, R., Deu-Pons, J., et al. (2016). Oncodrivefml:a general framework to identify coding and non-coding regions withcancer driver mutations. Genome Biology , (1), 128.Page, L., Brin, S., Motwani, R., et al. (1998). The PageRank CitationRanking: Bringing Order to the Web .Papaemmanuil, E., Gerstung, M., Bullinger, L., et al. (2016). Genomicclassiﬁcation and prognosis in acute myeloid leukemia.

New EnglandJournal of Medicine , (23), 2209–2221.Paull, E. O., Carlin, D. E., Niepel, M., et al. (2013). Discovering causalpathways linking genomic events to transcriptional states using tieddiffusion through interacting events (tiedie). Bioinformatics , (21),2757–64.Pham, V. V. H., Liu, L., Bracken, C. P., et al. (2019). Cbna: A control theorybased method for identifying coding and non-coding cancer drivers. PLOS Computational Biology , (12), e1007538.Porta-Pardo, E., Hrabe, T., and Godzik, A. (2015). Cancer3d:understanding cancer mutations through protein structures. NucleicAcids Research , (Database issue), D968–D973.Poulos, R. C., Sloane, M. A., Hesson, L. B., et al. (2015). The search forcis-regulatory driver mutations in cancer genomes. Oncotarget , (32),32509–32525.Puente, X. S., Bea, S., Valdes-Mas, R., et al. (2015). Non-coding recurrentmutations in chronic lymphocytic leukaemia. Nature , , 519.Rassenti, L. Z., Balatti, V., Ghia, E. M., et al. (2017). Micrornadysregulation to identify therapeutic target combinations for chroniclymphocytic leukemia. Proceedings of the National Academy ofSciences , (40), 10731.Reimand, J. and Bader, G. D. (2013). Systematic analysis of somaticmutations in phosphorylation signaling predicts novel cancer drivers. Molecular systems biology , , 637–637.Rose, P. W., Bi, C., Bluhm, W. F., et al. (2013). The rcsb protein databank: new resources for research and education. Nucleic Acids Research , (Database issue), D475–D482. (cid:105) “review-main” — 2020/7/3 — 0:41 — page 13 — (cid:105)(cid:105)(cid:105) (cid:105) (cid:105)(cid:105) Cancer Driver Discovery Sakoparnig, T., Fried, P., and Beerenwinkel, N. (2015). Identiﬁcationof constrained cancer driver genes based on mutation timing.

PLOSComputational Biology , (1), e1004027.Stephens, P. J., Tarpey, P. S., Davies, H., et al. (2012). The landscape ofcancer genes and mutational processes in breast cancer. Nature , ,400.Tamborero, D., Gonzalez-Perez, A., and Lopez-Bigas, N. (2013).Oncodriveclust: exploiting the positional clustering of somatic mutationsto identify cancer genes. Bioinformatics , (18), 2238–44.Tokheim, C. J., Papadopoulos, N., Kinzler, K. W., et al. (2016). Evaluatingthe evaluation of cancer driver genes. Proceedings of the NationalAcademy of Sciences , (50), 14330–14335.Vandin, F. (2017). Computational methods for characterizing cancermutational heterogeneity. Front Genet , , 83.Vinayagam, A., Stelzl, U., Foulle, R., et al. (2011). A directed proteininteraction network for investigating intracellular signal transduction. Sci Signal , (189), rs8.Vinayagam, A., Gibson, T. E., Lee, H.-J., et al. (2016). Controllabilityanalysis of the directed human protein interaction network identiﬁesdisease genes and drug targets. Proceedings of the National Academy ofSciences , (18), 4976.Wang, B., Mezlini, A. M., Demir, F., et al. (2014). Similarity networkfusion for aggregating data types on a genomic scale. Nature Methods , (3), 333–337.Wang, J., Lu, M., Qiu, C., et al. (2010). Transmir: a transcription factor-microrna regulation database. Nucleic Acids Res , (Database issue), D119–22.Weinhold, N., Jacobsen, A., Schultz, N., et al. (2014). Genome-wideanalysis of non-coding regulatory mutations in cancer. Nature genetics , (11), 1160–1165.Wendl, M. C., Wallis, J. W., Lin, L., et al. (2011). Pathscan: a tool fordiscerning mutational signiﬁcance in groups of putative cancer genes. Bioinformatics , (12), 1595–602.Wong, N. W., Chen, Y., Chen, S., et al. (2018). Oncomir:an online resource for exploring pan-cancer microrna dysregulation. Bioinformatics , (4), 713–715.Xu, L., Wu, W., Cheng, G., et al. (2017a). Enhancement of proliferationand invasion of gastric cancer cell by kdm5c via decrease in p53expression. Technology in cancer research & treatment , (2), 141–149.Xu, T., Le, T. D., Liu, L., et al. (2017b). Cancersubtypes: an r/bioconductorpackage for molecular cancer subtype identiﬁcation, validation andvisualization. Bioinformatics , (19), 3131–3133.Zhang, J., Baran, J., Cros, A., et al. (2011). International cancer genomeconsortium data portal: a one-stop shop for cancer genomics data. Database , , bar026–bar026.Zhang, J., Zhang, S., Wang, Y., et al. (2013). Identiﬁcation of mutated corecancer modules by integrating somatic mutation, copy number variation,and gene expression data. BMC Syst Biol , , S4.Zhang, J., Pham, V. V. H., Liu, L., et al. (2019). Identifyingmirna synergism using multiple-intervention causal inference. BMCBioinformatics ,20