Algorithmic Methods to Infer the Evolutionary Trajectories in Cancer Progression
Giulio Caravagna, Alex Graudenzi, Daniele Ramazzotti, Rebeca Sanz-Pamplona, Luca De Sano, Giancarlo Mauri, Victor Moreno, Marco Antoniotti, Bud Mishra
AAlgorithmic methods to infer the evolutionary trajectoriesin cancer progression
Giulio Caravagna , ,(cid:63) Alex Graudenzi , Daniele Ramazzotti Rebeca Sanz-Pamplona Luca De Sano Giancarlo Mauri , Victor Moreno , Marco Antoniotti , Bud Mishra November 17, 2018 Department of Informatics, Systems and Communication, University of Milan-Bicocca, Milan, Italy. School of Informatics, University of Edinburgh, Edinburgh, UK. Institute of Molecular Bioimaging and Physiology of the Italian National Research Council (IBFM-CNR), Milan,Italy. Unit of Biomarkers and Susceptibility, Cancer Prevention and Control Program, Catalan Institute of Oncology(ICO), IDIBELL & CIBERESP. Hospitalet de Llobregat, Barcelona, Spain. SYSBIO Centre of Systems Biology, Milano, Italy. Department of Clinical Sciences, Faculty of Medicine, University of Barcelona, Barcelona, Spain. Milan Center for Neuroscience, University of Milan-Bicocca, Milan, Italy. Courant Institute of Mathematical Sciences, New York University, New York, USA. (cid:63)
Corresponding author, [email protected] . Contents
PicNic pipeline 6
PicNic ’s implementation for COADREAD samples 31
C.1 TCGA COADREAD project data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31C.2 Driver events selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32C.3 Mutual exclusivity groups of alterations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 a r X i v : . [ q - b i o . GN ] M a r .4 CAPRI’s execution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 D Statistical validation of the models 34
D.1 Edge p-values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34D.2 Bootstrap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35D.3 Cross-validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
E Supplementary Tables and Figures 37
List of Figures
PicNic pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . 222 The
PicNic pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233 Data processed for MSI-HIGH tumors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244 Progression of MSS tumors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255 Progression of MSI-HIGH tumors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26S6
PicNic pipeline processing MSI-HIGH/MSS tumors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40S7 Mutations annotated by TCGA for the driver genes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41S8 Lolliplot diagrams of TCGA mutations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42S9 Groups of exclusive alterations for MSS tumors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43S10 Selected data for MSS tumors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44S11 Non-parametric bootstrap scores for MSS progression . . . . . . . . . . . . . . . . . . . . . . . . . . . 45S12 Non-parametric bootstrap scores for MSI-HIGH progression . . . . . . . . . . . . . . . . . . . . . . . . 46S13 COADREAD statistics for models confidence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47S14 Entropy loss for MSI-HIGH/MSS models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48S15 Prediction error for each parent set of BIC and AIC models of MSS tumors . . . . . . . . . . . . . . . 49S16 Prediction error for each parent set of BIC and AIC models of MSI tumors . . . . . . . . . . . . . . . 50S17 Models and the phenotype that they might explain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 bstract The genomic evolution inherent to cancer relates directly to a renewed focus on the vo-luminous next generation sequencing (NGS) data, and machine learning for the inference ofexplanatory models of how the (epi)genomic events are choreographed in cancer initiation anddevelopment. However, despite the increasing availability of multiple additional -omics data,this quest has been frustrated by various theoretical and technical hurdles, mostly stemmingfrom the dramatic heterogeneity of the disease. In this paper, we build on our recent workson “selective advantage” relation among driver mutations in cancer progression and investi-gate its applicability to the modeling problem at the population level. Here, we introducePiCnIc (Pipeline for Cancer Inference), a versatile, modular and customizable pipeline to ex-tract ensemble-level progression models from cross-sectional sequenced cancer genomes. Thepipeline has many translational implications as it combines state-of-the-art techniques for sam-ple stratification, driver selection, identification of fitness-equivalent exclusive alterations andprogression model inference. We demonstrate PiCnIc’s ability to reproduce much of the currentknowledge on colorectal cancer progression, as well as to suggest novel experimentally verifiablehypotheses.
Keywords:
Cancer evolution; Selective advantage; Bayesian Structural Inference
Statement of significance:
A causality based new machine learning Pipeline for Cancer Infer-ence (
PicNic ) is introduced to infer the underlying somatic evolution of ensembles of tumors fromnext generation sequencing data.
PicNic combines techniques for sample stratification, driver selec-tion and identification of fitness-equivalent exclusive alterations to exploit a novel algorithm basedon Suppes’ probabilistic causation. The accuracy and translational significance of the results arestudied in details, with an application to colorectal cancer.
PicNic pipeline has been made publiclyaccessible for reproducibility, interoperability and for future enhancements.
Since the late seventies evolutionary dynamics, with its interplay between variation and selection,has progressively provided the widely-accepted paradigm for the interpretation of cancer emergenceand development [1–3]. Random alterations of an organism’s (epi)genome can sometimes confera functional selective advantage to certain cells, in terms of adaptability and ability to surviveand proliferate. Since the consequent clonal expansions are naturally constrained by the avail-ability of resources (metabolites, oxygen, etc.), further mutations in the emerging heterogeneoustumor populations are necessary to provide additional fitness of different kinds that allow survivaland proliferation in the unstable micro environment. Such further advantageous mutations willeventually allow some of their sub-clones to outgrow the competing cells, thus enhancing tumor’sheterogeneity as well as its ability to overcome future limitations imposed by the rapidly exhaust-ing resources. Competition, predation, parasitism and cooperation have been in fact theorized asco-present among cancer clones [4].In the well-known vision of Hanahan and Weinberg [5, 6], the phenotypic stages that charac-terize this multistep evolutionary process are called hallmarks . These can be acquired by cancercells in many possible alternative ways, as a result of a complex biological interplay at severalspatio-temporal scales that is still only partially deciphered [7]. In this framework, we distinguish For this and other technical terms commonly used in the statistics and cancer biology communities we providea Glossary in the Supplementary Material. drivers ) by activating oncogenes or in-activating tumor suppressor genes , from those that are transferred to sub-clones without increasingtheir fitness (i.e., passengers ) [8]. Driver identification is a modern challenge of cancer biology, asdistinct cancer types exhibit very different combinations of drivers, some cancers display mutationsin hundreds of genes [9], and the majority of drivers is mutated at low frequencies (“long tail”distribution), hindering their detection only from the statistics of the recurrence at the population-level [10].Cancer clones harbour distinct types of alterations. The somatic (or genetic ) ones involveeither few nucleotides or larger chromosomal regions. They are usually catalogued as mutations - i.e., single nucleotide or structural variants at multiple scales (insertions, deletions, inversions,translocations) – of which only some are detectable as
Copy Number Alterations (CNAs), mostprevalent in many tumor types [11]. Also epigenetic alterations, such as
DNA methylation and chromatin reorganization , play a key role in the process [12]. The overall picture is confoundedby factors such as genetic instability [13], tumor-microenvironment interplay [14, 15], and by theinfluence of spatial organization and tissue specificity on tumor development [16] .Significantly, in many cases, distinct driver alterations can damage in a similar way the same functional pathway , leading to the acquisition of new hallmarks [17–21]. Such alterations individu-ally provide an equivalent fitness gain to cancer cells, as any additional alteration hitting the samepathway would provide no further selective advantage. This dynamic results in groups of driveralterations that form mutually exclusive patterns across tumor samples from different patients (i.e.,the sets of alterations that are involved in the same pathways tend not to occur mutated together).This phenomenon has significant translational consequences.An immediate challenge posed by this state of affairs is the dramatic heterogeneity of cancer, bothat the inter-tumor and at the intra-tumor levels [22]. The former manifests as different patients withthe same cancer type can display few common alterations. This obsersvation led to the developmentof techniques to stratify tumors into subtypes with different genomic signatures, prognoses andresponse to therapy [23]. The latter form of heterogeneity refers to the observed genotypic andphenotypic variability among the cancer cells within a single neoplastic lesion, characterized by thecoexistence of more than one cancer clones with distinct evolutionary histories [24].Cancer heterogeneity poses a serious problem from the diagnostic and therapeutic perspectiveas, for instance, it is now acknowledged that a single biopsy might not be representative of otherparts of the tumor, hindering the problem of devising effective treatment strategies [4]. Therefore,presently the quest for an extensive etiology of cancer heterogeneity and for the identification ofcancer evolutionary trajectories is central to cancer research, which attempts to exploit the massiveamount of sequencing data available through public projects such as The Cancer Genome Atlas ( TCGA ) [25].Such projects involve an increasing number of cross-sectional (epi)genomic profiles collected viasingle biopsies of patients with various cancer types, which might be used to extract trends of cancerevolution across a population of samples . Higher resolution data such as multiple samples collectedfrom the same tumor [24], as well as single-cell sequencing data [26], might be complementarilyused to face the same problem within a specific patient. However, the lack of public data coupled We mention that much attention has been recently casted on newly discovered cancer genes affecting globalprocesses that are apparently not directly related to cancer development, such as cell signaling, chromatin andepigenomic regulation, RNA splicing, protein homeostasis, metabolism and lineage maturation [10]. At the time of this writing, in TCGA, sample sizes per cancer type are in the order of a few hundreds. Suchnumbers are expected to increase in the near future, with a clear benefit for all the statistical approaches to analyzecancer data which currently lack a proper background of data.
4o the problems of accuracy and reliability, currently prevents a straightforward application [27].These different perspectives lead to the different mathematical formulations of the problem of inferring a cancer progression model from genomic data, and a need for versatile computational toolsto analyze data reproducibly – two intertwined issues examined at length in this paper [28]. Indeed,such models and tools can be focused either on characteristics of a population, i.e. ensemble-level ,or on multiple clonality in a single-patient . In general, both problems deal with understanding the temporal ordering of somatic alterations accumulating during cancer evolution, but use orthogonalperspectives and different input data – see Figure 1 for a comparison. This paper proposes anew computational approach to efficiently deal with various aspects of the problem at a patientpopulation level, relegating the other aspects to future publications.
Ensemble-level cancer evolution.
It is thus desirable to extract a probabilistic graphical model explaining the statistical trend of accumulation of somatic alterations in a population of n cross-sectional samples collected from patients diagnosed with a specific cancer. To normalize againstthe experimental conditions in which tumors are sampled, we only consider the list of alterationsdetected per sample – thus, as 0/1 Bernoulli random variables.Much of the difficulty lies in estimating the true and unknown trends of selective advantage among genomic alterations in the data, from such observations. This hurdle is not unsurmountable,if we constrain the scope to only those alterations that are persistent across tumor evolution in allsub-clonal populations , since it yields a consistent model of a temporal ordering of mutations.Therefore, epigenetic and trascriptomic states, such as hyper and hypo-methylations or over andunder expression, could only be used, provided that they are persistent through tumor development[29].Historically, the linear model of colorectal tumor progression by Vogelstein is an instance ofan early solution to the cancer progression problem [30]. That approach was later generalized toaccommodate tree-models of branched evolution [31–34] and later, further generalized to the infer-ence of directed acyclic graph models, with several distinct strategies [35–38]. We contributed tothis research program with the Cancer Progression Extraction with Single Edges ( CAPRESE ) and the
Cancer Progression Inference ( CAPRI ) algorithms, which are currently implemented in
TRONCO , anopen source R package for Translational Oncology available in standard repositories [39–41]. Bothtechniques rely on Suppes’ theory of probabilistic causation to define estimators of selective advan-tage [42], are robust to the presence of noise in the data and perform well even with limited samplesizes. The former algorithm exploits shrinkage-like statistics to extract a tree model of progression,the latter combines bootstrap and maximum likelihood estimation with regularization to extractgeneral directed acyclic graphs that capture branched, independent and confluent evolution. Bothalgorithms represent the current state-of-the-art approach to this problem, as they outperformothers in speed, scale and predictive accuracy.
Clonal architecture in individual patients.
A closely related problem addresses the detectionof clonal signatures and their prevalence in individual tumors, a problem complicated by intra-tumor heterogeneity.Even though this phylogenetic version of the progression inference problem naturally relies ondata produced from single-cell sequencing assays [43, 44], the majority of approaches still make useof bulk sequencing data, usually from multiple biopsies of the same tumors [24, 45]. Indeed, severalapproaches try to extract the clonal signature of single tumors from allelic imbalance proportions ,5 problem made difficult as sequenced samples usually contain a large number of cells belonging toa collection of sub-clones resulting from the complex evolutionary history of the tumor [46–55].We keep the current work focused on the inference of progression models at the ensemble level,and plan to return to this variant to the problem in another publication.
PicNic pipeline
We report on the design, development and evaluation of the
Pipeline for Cancer Inference ( PicNic )to extract ensemble-level cancer progression models from cross-sectional data (Figure 1).
PicNic is versatile, modular and customizable; it exploits state-of-the-art data processing and machinelearning tools to:1. identify tumor subtypes and then in each subtype;2. select (epi)genomic events relevant to the progression;3. identify groups of events that are likely to be observed as mutually exclusive ;4. infer progression models from groups and related data, and annotate them with associatedstatistical confidence.All these steps are necessary to minimize the confounding effects of inter-tumor heterogeneity, whichare likely to lead to wrong results when data is not appropriately pre-processed .In each stage of PicNic different techniques can be employed, alternatively or jointly, accordingto specific research goals, input data, and cancer type. Prior knowledge can be easily accommo-dated into our pipeline, as well as the computational tools discussed in the next subsections andsummarized in Figure 2. The rationale is similar in spirit to workflows implemented by consortiasuch as
TCGA to analyze huge populations of cancer samples [56, 57]. One of the main noveltiesof our approach, is the exploitation of groups of exclusive alterations as a proxy to detect fitness-equivalent trajectories of cancer progression. This strategy is only feasible by the hypothesis-testingfeatures of the recently developed
CAPRI algorithm, an algorithm uniquely addressing this crucialaspect of the ensemble-level progression inference problem [40].In the Results section, we study in details a specific use-case for the pipeline, processing colorectalcancer data from
TCGA , where it is able to re-discover much of the existing body of knowledgeabout colorectal cancer progression. Based on the output of this pipeline, we also propose novelexperimentally-verifiable hypotheses.
In general, for each of n tumors (patients) we assume relevant (epi)genetic data to be available. Wedo not put constraints on data gathering and selection, leaving the user to decide the appropriate“resolution” of the input data. For instance, one might decide whether somatic mutations shouldbe classified by type or by location, or aggregated. Or, one might decide to lift focal CNAs to The genuine selectivity relationship sought to be inferred are subject to the vagaries of Simpson’s paradox; itcan change, or worst reverse, when we try to infer them from data not suitably pre-processed. This effect (due tosuch paradox) manifests as data are sampled from a highly heterogenous mixture of populations of cells [40]. PiCnIcuses various mechanisms to avoid these pitfalls. In this context, it should be pointed out that input bulk sequencingdata suffers also from intra-tumor heterogeneity issues, which are unfortunately intrinsic to the technology. heterogeneous mixture of input samples. In some cases the classification can benefit from clinical biomarkers, such asevidences of certain cell types [59], but in most cases we will have to rely on multiple clustering techniques at once, see, e.g., [56,57]. Many common approaches cluster expression profiles [60], oftenrelying on non-negative matrix factorization techniques [61] or earlier approaches such as k -means,Gaussians mixtures or hierarchical/spectral clustering - see the review in [62]. For glioblastomaand breast cancer, for instance, mRNA expression subtypes provides good correlation with clinicalphenotypes [63–65]. However, this is not always the case as, e.g., in colorectal cancer such clustersmismatch with survival and chemotherapy response [63]. Clustering of full exome mutation profilesor smaller panels of genes might be an alternative as it was shown for ovarian, uterine and lungcancers [66, 67].Using pipelines such as PicNic , we expect that the resulting subtypes will be routinely in-vestigated, eventually leading to distinct progression models, which shall be characteristic of thepopulation-level trends of cancer initiation and progression.
In subtypes detection, it becomes easier to find similarities across input samples when more al-terations are available, as features selection gains precision. In progression inference, instead, onewishes to focus on m (cid:28) n driver alterations, which ensure also an appropriate statistical ratiobetween sample size ( n , here the subtype size) and problem dimension ( m ).Multiple tools filter out driver from passenger mutations. MutSigCV identifies drivers mutatedmore frequently than background mutation rate [68].
OncodriveFM avoids such estimation butlooks for functional mutations [69].
OncodriveCLUST scans mutations clustering in small regionsof the protein sequence [70].
MuSiC uses multiple types of clinical data to establish correlationsamong mutation sites, genes and pathways [71]. Some other tools search for driver CNAs that affectprotein expression [72]. All these approaches use different statistical measures to estimate signs ofpositive selection, and we suggest using them in an orchestrated way, as done by platforms such as
Intogen [73].We anticipate that such tools will run independently on each subtype, as driver genes will likelydiffer across them, mimicking the different molecular properties of each group of samples; also, listsof genes produced by these tools might be augmented with prior knowledge about tumor suppressorsor oncogenes.
When working at the ensemble-level, identification of “groups of mutually exclusive” alterations iscrucial to derive a correct inference. This step of
PicNic is another attempt to resolve part of theinter-tumor heterogeneity, as such alterations could lead to the same phenotype (i.e., hence resulting“equivalent” in terms of progression), despite being genotypically “alternative”, i.e., exclusive, acrossthe input cohort. This information shall be used to detect alternative routes to cancer progressionwhich capture the specificities of individual patients.7 plethora of recent tools can be used to detect groups of fitness equivalent alterations, accordingto the data available for each subtype; greedy approaches [74, 75] or their optimizations, such as
MEMO , which constrain search-space with network priors [76]. This strategy is further improvedin
MUTEX , which scans mutations and focal CNAs for genes with a common downstream effectin a curated signalling network, and selects only those genes that significantly contributes to theexclusivity pattern [77]. Other tools such as
Dendrix , MDPFinder , Multi-Dendrix , CoMEt , MEGSA or ME , employ advanced statistics or generative approaches without priors [78–83].In such groups, we distinguish between hard and soft forms of exclusivity, the former assumingstrict exclusivity among alterations, with random errors accounting for possible overlaps (i.e., themajority of samples do not share alterations from such groups), the latter admitting co-occurrences(i.e., some samples might have common alterations, within a group) [77]. CAPRI is currently the only algorithm which incorporates this type of information, in inferringa model. Each of these groups are in fact associated with a “testable hypothesis” written in the well-known language of propositional Boolean formulas . Consider the following example: we might beinformed that apc and ctnnb1 mutations show a trend of soft-exclusivity in our cohort – i.e., somesamples harbor both mutations, but the majority just one of the two mutated genes. Since suchmutations lead to β -catenin deregulation (the phenotype), we might wonder whether such state ofaffairs could be responsible for progression initiation in the tumors under study. An affermativeresponse would equate, in terms of progression, the two mutations. To test this hypothesis , one mayspell out formula apc ∨ ctnnb1 to CAPRI , which means that we are suggesting to the inferenceengine that, besides the possible evolutionary trajectories that might be inferred by looking at thetwo mutations as independent , trajectories involving such a “composite” event, shall be consideredas well. It is then up to
CAPRI to decide which, of all such trajectories, is significant, in a statisticalsense .In general, formulas allow users to test general hypotheses about complex model structuresinvolving multiple genes and alterations. These are useful in many cases: for instance, wherewe are processing samples which harbour homozygous losses or inactivating mutations in certaingenes (i.e., equally disruptive genomic events), or when we know in advance that certain genes arecontrolling the same pathway, and we might speculate that a single hit in one of those decreases theselection pressure on the others. We note that, with no hypothesis, a model with such alternativetrajectories cannot be analyzed, due to various computational limitations inherent to the inferentialalgorithms (see [40]).From a practical point of view,
CAPRI ’s formulas/hypotheses-testing features “help” the inferenceprocess, but do not “force” it to select a specific model, i.e., the inference is not biased . In thissense, the trajectories inferred by examining these composite model structures (i.e., the formulas) are not given any statistical advantage for inclusion in the final model. However, in spite of a naturaltemptation to generate as many hypotheses as possible, it is prudent to always limit the numberof hypotheses according to the number of samples and alterations. Note that this approach canalso be extended to accommodate, for instance, co-occurrent alterations in significantly mutatedsubnetworks [84, 85]. There, logical connectives such as ⊕ (the logical “xor”) act as a proxy for hard-exclusivity, and ∨ (the logical“disjunction”) for soft one. Besides from exclusivity groups, other connectives such as logical conjunction can beused. .4 Progression inference and confidence estimation We use
CAPRI to reconstruct cancer progression models of each identified molecular subtype, pro-vided that there exist a reasonable list of driver events and the groups of fitness-equivalent exclusivealterations. Since currently
CAPRI represents the state of the art, and supports complex formulasfor groups of alterations detected in the earlier
PicNic step, it was well-suited for the task.
CAPRI ’s input is a binary n × ( m + k ) matrix M with n samples (a subtype size), m driveralteration events (0/1 Bernoulli random variables) and k testable formulas. Each sample in M isdescribed by a binary sequence: the 1’s denote the presence of alterations. CAPRI first performs acomputationally fast scan of M to identify a set S of plausible selective advantage relations amongthe driver alterations and the formulas; then, it reduces S to the most relevant ones, ˆ S ⊂ S
Eachrelation is represented as an edge connecting drivers/formulas in a Graphical Model – which shall betermed Suppes-Bayes Causal Network. This network represents the joint probability distribution ofobserving a set of driver alterations in a cancer genome, subject to constraints imposed by Suppes’ probabilistic causation formalism [42].Set S is built by a statistical procedure. Among any pair of input drivers/formulas x and y , CAPRI postulates that x → y ∈ S could be a selective advantage relation with “ x selecting for y ” ifit estimates that two conditions hold1. “ x is earlier than y ”;2. “ x ’s presence increases the probability of observing y ”.Such claims, grounded in Suppes’ theory of probabilistic causation, are expressed as inequalitiesover marginal and conditional distributions of x and y . These are assessed via a standard Mann-Withney U test after the distributions are estimated from a reasonable number (e.g., 100) of non-parametric bootstrap resamples of M (see Supplementary Material). CAPRI ’s increased performanceover existing methods can be motivated by the reduction of the state space within which modelsare searched, via S .Optimization of S is central to our tolerance to false positives and negatives in ˆ S . We would liketo select only the minimum number of relations which are true and statistically supported, and buildour model from those. CAPRI ’s implementation in
TRONCO [41] selects a subset by optimizing a score function which assigns to a model a real number equal to its log-likelihood (probability ofgenerating data for the model) minus a penalty term for model complexity – a regularization termincreasing with ˆ S ’s size, and hence penalizing overly complex models. It is a standard approach toavoid overfitting, and usually relies on the Akaike or the
Bayesian Information Criterion ( AIC or BIC )as regularizers. Both scores are approximately correct;
AIC is more prone to overfitting but likelyto provide also good predictions from data and is better when false negatives are more misleadingthan positive ones.
BIC is more prone to underfitting errors, thus more parsimonious and better Technically, for a set of m alterations modeled by variables x , . . . , x m , such a network is a Graphical Modelrepresenting the factorization of the joint distribution – P ( x , . . . , x m ) – of observing any of the alterations in agenome (i.e., x i = 1 ). This factorization is made compact as the model encodes the statistical dependencies in itsstructure via P ( x , . . . , x m ) = m (cid:89) i =1 P ( x i | π i ) where π i = { x j | x j → x i ∈ ˆ S} are the “parents” of the i -th node. These are those from which the presence of the i -th alteration is predicted. In our approach these edges are the pictorial representation of the selective advantagerelations where the alterations in π i select for x i .
9n opposite direction. As often done, we suggest approaches that to combine but distinguish whichrelations are selected by
BIC versus
AIC . Details on the algorithm are provided as see SupplementaryMaterial.
Statistical confidence of a model.
In-vitro and in-vivo experiments provide the most convinc-ing validation for the newly suggested selective advantage relations and hypotheses, yet this is outof reach in some cases.Nonetheless, statistical validation approaches can be used almost universally to assess the con-fidence of edges, parent sets and whole models, either via hypothesis-testing or bootstrap and cross-validation scores for Graphical Models. We briefly discuss approaches that are implemented in TRONCO , and refer to the Supplementary Materials for additional details.First,
CAPRI builds S by computing two p-values per edge, for the confidence in condition (1) and (2) . In addition, for each edge x → y , it computes a third p-value via hypergeometric testingagainst the hypothesis that the co-occurrence of x and y is due to chance. These p-values measureconfidence in the direction of each edge and the amount of statistical dependence among x and y .Second, for each model inferred with CAPRI we can estimate ( a posteriori ) how frequently ouredges would be retrieved if we resample from our data ( non-parametric bootstrap), or from themodel itself, assuming its correctness ( parametric bootstrap) [86]. Also, we can measure the biasin
CAPRI ’s construction of S due to the random procedure which estimates the distributions incondition (1) and (2) ( statistical bootstrap).Third, scores can be computed to quantify the consistency for the model against bias in thedata and models. For instance, non-exhaustive k -fold cross-validation can be used to compute the entropy loss for the whole model, and the prediction and posterior classification errors for each edgeor parent set [87]. It is common knowledge that colorectal cancer (CRC) is a heterogeneous disease comprising differentmolecular entities. Indeed, it is currently accepted that colon tumors can be classified accordingto their global genomic status into two main types: microsatellite unstable tumors (MSI), fur-ther classified as high or low, and microsatellite stable (MSS) tumors (also known as tumors with chromosomal instability ). This taxonomy plays a significant role in determining pathologic, clinicaland biological characteristics of CRC tumors [88]. Regarding molecular progression, it is also wellestablished that each subtype arises from a distinctive molecular mechanism. While MSS tumorsgenerally follow the classical adenoma-to-carcinoma progression described in the seminal work byVogelstein and Fearon [89], MSI tumors result from the inactivation of DNA mismatch repair geneslike mlh -1 [90].With the aid of the
TRONCO package, we instantiated
PicNic to process colorectal tumorsfreely available through
TCGA project
COADREAD [56] (see Supplementary Figure S1), and in-ferred models for the MSS and MSI-HIGH tumor subtypes (shortly denoted MSI) annotated by theconsortium. In doing so, we used a combination of background knowledge produced by
TCGA andnew computational predictions; to a different degree, some knowledge comes from manual curationof data and other from tools mentioned in
PicNic ’s description (see Figure 2). Data and exclusiv-10ty groups for MSI tumors are shown in Figure 3, the analogous for MSS tumors is provided asSupplementary Material.For the models inferred, which are shown in Figures 4 and 5, we evaluated various forms ofstatistical confidence measured as p-values, bootstrap scores (in what follows, npb denotes non-parametric bootstrap and the closer to 100 the better), and cross-validation statistics reportedin the Supplementary Material. Many of the postulated selective advantage relations (i.e., modeledges) have very strong statistical support for
COADREAD samples, although events with similarmarginal frequency may lead to ambiguous imputed temporal ordering (i.e., the edge direction). Ingeneral, we observed that overall the estimates are slightly better in the MSS cohort (entropy loss < versus . ), which is expected given the difference in sample size of the two datasets (152versus 27 samples), see Material and Methods for details. Interpretation of the models.
Our models capture the well-known features distinguishing MSSand MSI tumors: for the former apc , kras and tp53 mutations as primary events together withchromosomal aberrations, for the latter braf mutations and lack of chromosomal alterations. Of all33 driver genes, 15 are common to both models - e.g., apc , braf , kras , nras , tp53 and fam123b among others (mapped to pathways like wnt , mapk , apoptosis or activation of T-cell lymphocites),although in different relationships (position in the model), whereas new (previously un-implicated)genes stood out from our analysis and deserve further research. MSS (Microsatellite Stable).
In agreement with the known literature, in addition to kras , tp53 and apc as primary events, we identify pten as a late event in the carcinogenesis, as well as nras and kras converging in igf2 amplification, the former being “selected by” tp53 muta-tions ( npb pik3ca mutations ( npb wnt genes, in agreement with the observation that multiple con-current lesions affecting such pathway confer selective advantage. In this respect, our modelpredicts multiple routes for the selection of alterations in sox9 gene, a transcription factorknown to be active in colon mucosa [91]. Its mutations are directly selected by apc / ctnnb1 alterations (though with low npb score), by arid1a ( npb fbxw7 mutations ( npb ctnnb1 , relatesto sox9 . The sox family of transcription factors have emerged as modulators of canoni-cal wnt / β -catenin signaling in many disease contexts [92]. Also interestingly, fbxw7 hasbeen previously reported to be involved in the malignant transformation from adenoma tocarcinoma [93]. The rightmost part of the model involves genes from various pathways, andoutlines the relation between kras and the pi3k pathway. We indeed find selection of pik3ca mutations by kras ones, as well as selection of the whole MEMO module ( npb pi3k pathway [56]. smad4 proteins relate either to kras ( npb fam123b (through atm ) and tcf7l2 converge in dkk2 or dkk4 ( npb
81, 17and 34%).
MSI-HIGH (Microsatellite Unstable).
In agreement with the current literature, braf is the mostcommonly mutated gene in MSI tumors [94].
CAPRI predicted convergent evolution of tu-mors harbouring fbxw7 or apc mutations towards deletions/mutations of nras gene ( npb
21, 28 and 54%), as well as selection of smad2 or smad4 mutations by fam123b mutations( npb
23 and 46%), for these tumors. Relevant to all MSI tumors seems again the role ofthe pi3k pathway. Indeed, a relation among apc and pik3ca mutations was inferred ( npb apc and the whole
MEMO module ( npb apc and erbb3 select for kras mutations ( npb
51 and 27%),which might point to interesting therapeutic implications. In contrast, mutations in braf mostly select for mutations in acvr1b ( npb smad proteins. It forms receptor complex with acvr2a , a gene mutated in thesetumors that selects for tcf7l2 mutations ( npb tp53 mutationsare those selected by mutations in axin2 ( npb wnt signallingpathway, and related to unstable gastric cancer development [96]. Inactivating mutations inthis gene are important, as it provides serrated adenomas with a mutator phenotype in theMSI tumorigenic pathway [97]. Thus, our results reinforce its putative role as driver gene inthese tumors.By comparing these models we can find similarity in the prediction of a potential new early eventfor CRC formation, fbxw7 , as other authors have recently described [93]. This tumor suppressoris frequently inactivated in human cancers, yet the molecular mechanism by which it exerts itsanti-tumor activity remains unexplained [98], and our models provide a new hypothesis in thisrespect. This paper represents our continued exploration of the nature of somatic evolution in cancer, andits translational exploitation through models of cancer progression, models of drug resistance (andefficacy), left- and right-censoring, sample stratification, and therapy design. Thus this paper em-phasizes the engineering and dissemination of production-quality computational tools as well asvalidation of its applicability via use-cases carried out in collaboration with translational collabo-rators: e.g., colorectal cancer, analyzed jointly with epidemiologists currently studying the diseaseactively. As anticipated, we reasserted that the proposed model of somatic evolution in cancer notonly supports the heterogeneity seen in tumor population, but also suggests a selectivity/causalityrelation that can be used in analyzing (epi)genomic data and exploited in therapy design – which weintroduced in our earlier works [39, 40]. In this paper, we have introduced an open-source pipeline,
PicNic , which minimizes the confounding effects arising from inter-tumor heterogeneity, and we haveshown that
PicNic can be effective in extracting ensemble-level evolutionary trajectories of cancerprogression.When applied to a highly-heterogeneous cancer such as colorectal,
PicNic was able to inferthe role of many known events in colorectal cancer progression (e.g., apc , kras or tp53 in MSStumors, and braf in MSI ones), confirming the validity of our approach . Interestingly, new playersin CRC progression stand out from this analysis such as fbxw7 or axin2 , which deserve furtherinvestigation. In colon carcinogenesis, although each model identifies characteristic early mutationssuggesting different initiation events, both models appear to converge in common pathways andfunctions such as wnt or mapk . As a further investigation for CRC, we leave as future work to check whether the inferred progression are alsorepresentative of other subtyping strategies for colorectal cancer, with particular reference to recent works whichshow marked interconnectivity between different independent classification systems coalescing into four consensusmolecular subtypes [99]. ctnnb1 or in pten , a well-known tumor suppressor gene. Onthe contrary, specific mutations in MSI tumors appear in membrane receptors such as acvr1b , acvr2a , erbb3 , lrp5 , tgfbr1 and tgfbr2 , as well as in secreted proteins like igf2 , possiblysuggesting that such tumors need to disturb cell-cell and/or cell-microenvironment communicationto grow. At the pathway level, genes exclusively appearing in the MSI progression model accumulatein specific pathways such as cytokine-cytokine receptor, endocytosis and tgf - β signaling pathway.On the other hand, genes in MSS progression model are implicated in p53 , mTOR , sodium transportor inositol phosphate metabolism.Our study also highlighted the translational relevance of the models that we can produce with PicNic (see Supplementary Figure S12). The evolutionary trajectories depicted by our models can,for instance, suggest previously-uncharacterized phenotypes, help in finding biomarker moleculespredicting cancer progression and therapy response, explain drug resistant phenotypes and predictmetastatic outcomes. The logical structure of the formulas describing alterations with equivalentfitness (i.e., the exclusivity group) can also point to novel targets of therapeutic interventions.In fact, exclusivity groups that are found to have a role in the progression can be screened for synthetic lethality among such genes – thus explaining why we do not observe phenotypes wheresuch alterations co-occur. In this sense, our models describe also such clonal signatures which,though theoretically possible, are not selected. We call such conspicuously absent phenotypes anti-hallmarks [100].Our models have other applications to both computational and cancer research. Our models,as encoded by Suppes-Bayes Causal Networks could be used as informative generative models forthe genomic profiles for the cancer patients. In fact, as known in machine learning, such generativemodels are extremely useful in creating better representation of data in terms of, e.g., discriminativekernels, such as Fisher [101]. In practice, this change of representations would allow framing commonclassification problems in the domain of our generative structures, i.e., the models, rather than thedata. As a consequence, it is possible to create a new class of more robust classification andprediction systems.One may think of these representations as those bringing us closer to phenotypic (and causal)representation of the patient’s tumor, replacing its genotypic (and mutational) representation. Wesuspect that such representations will improve the accuracy of measurement of the biological clocks,dysregulated in cancer and critically needed to be measured in order to predict survival time, timeto metastasis, time to evolution of drug resistance, etc. We believe that these “phenotypic clocks”can be used immediately to direct the therapeutic intervention.Clearly, applicability and reliability of techniques such as
PicNic is very much dependent onthe background of data available. At the time of this writing, the quality, quantity and reliabilityof (epi)genomic data available, e.g., in public databases, is related to the ever increasing com-putational and technological improvements characterizing the wide area of cancer genomics. Ofsimilar importance is the availability of wet-lab technologies for models validation. Our recent workon SubOptical Mapping technology, for instance, points to the ability to cheaply and accuratelycharacterize translocation, indels and epigenomic modifications at the single molecule and singlecell level [102, 103]. This technology also provides the ability to directly validate (or refute) thehypotheses generated by
PicNic via gene-correction and single cell perturbation approaches.To conclude, the precision of any statistical inference technique, including
PicNic , is influencedby the quality, availability and idiosyncrasies of the input data – the goodness of the outcomesimproving along with the expected advancement in the field. Nevertheless, the strength of the13roposed approach lies in the efficacy in managing possibly noisy/ biased or insufficient data, andin proposing refutable hypotheses for experimental validation.
Processing COADREAD samples with PiCnIc.
We instantiated
PicNic to process clinicallyannotated high MSI-HIGH and MSS colorectal tumors collected from The Cancer Genome Atlasproject “Human Colon and Rectal Cancer” (COADREAD) [56] – see Supplementary Figure S1.Details on the implementation and the source code to replicate this study are available as Supple-mentary Material. COADREAD has enough samples, especially for MSS tumors, to implement aconsistent and significant statistical validation of our findings – see Supplementary Table S1.In brief, we split subtypes by the microsatellite status of each tumor as annotated by the con-sortium (so, step I of
PicNic is done by exploiting background knowledge rather than computationalpredictors). It should be expected that if this step is skipped or this classification is incorrect, theresulting models would noticeably differ. Once split into groups, the input COADREAD data isprocessed to maintain only samples for which both high-quality curated mutation and CNA dataare available; for CNAs we use focal high-level amplifications and homozygous deletions.Then, for each sample we select only alterations (mutations/CNAs) from a list of 33 driver genesmanually annotated to 5 pathways in [56] - wnt , raf , tgf - β , pi3k and p53 (Supplementary FiguresS2 and S3). This list of drivers, step II of PicNic , is produced by TCGA, as a result of manualcuration and running MutSigCV.In the next module of the pipeline, we fetch groups of exclusive alterations. We scanned thesegroups by using the MUTEX tool (Supplementary Table S2), and merged its results with thegroup that TCGA detected by using the MEMO tool, which involves mainly genes from the pi3k pathway. Knowledge on the potential exclusivity among genes in the wnt ( apc , ctnnb1 ) and raf ( kras , nras , braf ) pathways was exploited as well. Groups were then used to create CAPRI’sformulas; we also included hypotheses for genes which harbour mutations and homozygous deletionsacross different samples, see Supplementary Table S3. Data and exclusivity groups for MSS tumorsare shown in Supplementary Figure S4 and S5.CAPRI was run, as the last step of PicNic , on each subtype, by selecting recurrent alterationsfrom the pool of 33 pathway genes and using both AIC/BIC regularizer. Timings to run the relevantsteps of the pipeline are reported in the Supplementary Material. In the models of Figures 4 andFigure 5 each edge mirrors selective advantage among the upstream and downstream nodes, asestimated by CAPRI; Mann-Withney U test is carried out with statistical significance 0.05, after100 non-parametric bootstrap iterations.The significance of the reconstructed models and the input data is assessed by computing all thestatistics/tests discussed in the Main text (temporal priority, probability raising and hypergeometrictesting p-values, bootstrap and cross-validation scores). Motivation and background on each ofthese measures is available in the Supplementary Materials. A table with their values for edgeswith highest non-parametric bootstrap scores is in Supplementary Figure S8.For the MSS cohort all the p-values are strongly significant ( p (cid:28) . ) except for the temporalpriority of the edges connecting mutations in fam123b and atm , and erbb2 alterations (mutationsand amplifications), which leads us to conclude that, even if these pairs of genes seem to undergoselective advantage, the temporal ordering of their occurrence is ambiguous and failed to be imputedcorrectly from the datasets, analyzed here. The same situation occurs in MSI-HIGH tumors, for therelation between kras and erbb3 . Non-parametric and statistical bootstrap estimations are used14o assess the strength of all the findings (Supplementary Figures S6 and S7). Moreover, any biasin the data is finally evaluated by cross-validation (Supplementary Figures S8-S11) and commonstatistics such as entropy loss, posterior classification and prediction errors. In general, most of theselective advantage relations depicted by the inferred models present a strong statistical support,with the MSS cohort presenting the most reliable results.Summary implementation for COADREAD ( PicNic steps, Figure 2): (1) TCGA clinical classi-fication, (2) MutSigCV and TCGA manual curation, (3) MEMO, MUTEX and knowledge of wnt and raf pathways and (4) CAPRI.
Implement your own case study with PiCnIc/TRONCO.
TRONCO started as a projectbefore
PicNic , and is our effort at collecting, in a free R package, algorithms to infer progressionmodels from genomic data. In its current version it offers the implementation of the CAPRIand CAPRESE algorithms, as well as a set of routines to pre-process genomic data. With theinvention of
PicNic , it started accommodating software routines to easily interface CAPRI andCAPRESE to some of the tools that we mention in Figure 2. In particular, in its current 2.0version it supports input/output for the Matlab Network Based Stratification tool (NBS) andthe Java MUTEX tool, as well as the possibility to fetch data available from the cBioPortal forCancer Genomics ( http://cbioportal.org http://cbioportal.org), which provides a Web resourcefor exploring, visualizing, and analyzing multidimensional cancer genomics data.We plan to extend TRONCO in the future to support other similar tools and become an integralpart of daily laboratory routines, thus facilitating application of PiCnIc to additional use cases.
Authors contributions
This work follows up on our earlier project initiated by BM and carriedout by Milan-Bicocca and the Catalan Institute of Oncology, based on a framework discussed atthe 2014 School on Cancer, Systems and Complexity (CSAC).
PicNic was designed and constructedby MA’s Bioinformatics lab at University of Milan-Bicocca, within a project led and supervised byGC. GC, AG and DR designed the pipeline, and GC, DR and LDS coded and executed it. Datagathering and model interpretation was done by GC, LDS, DR, AG together with BM, VM andRSP. GM, MA, VM and BM provided overall organizational guidance and discussion. GC, AG,RSP and BM wrote the original draft of the paper, which all authors reviewed and revised in thefinal form. BM and MA are co-senior authors.
Acknowledgments
MA, GM, GC, AG, DR acknowledge the SysBioNet project, a MIUR initia-tive for the Italian Roadmap of European Strategy Forum on Research Infrastructures (ESFRI) andRegione Lombardia (Italy) for the research projects RetroNet through the ASTIL Program [12-4-5148000-40]; U.A 053 and Network Enabled Drug Design project [ID14546A Rif SAL-7], Fondo Ac-cordi Istituzionali 2009. BM acknowledges founding by the NSF grants CCF-0836649, CCF-0926166and a NCI-PSOC grant. VM and RSP acknowledge the Instituto de Salud Carlos III supported byThe European Regional Development Fund (ERDF) grants PI11-01439, PIE13/00022, the SpanishAssociation Against Cancer (AECC) Scientific Foundation, and the Catalan Government DURSI,grant 2014SGR647. 15e wish to thank the anonymous reviewers for their help in improving the quality and rigor ofthe presentation.
References [1] Nowell PC (1976) The clonal evolution of tumor cell populations.
Science
Cancer Research
Cancer Research
Nature Reviews Cancer
Cell
Cell
Cancer attractors: a systems view of tumors from agene network dynamics and developmental perspective (Elsevier), No. 7, pp 869–876.[8] Futreal PA, et al. (2004) A census of human cancer genes.
Nature Reviews Cancer
Science
Cell
Nature Genetics
Nature Reviews Cancer
The Biology of Cancer (Garland Science).[14] Albini A, Sporn MB (2007) The tumour microenvironment as a target for chemoprevention.
Nature Reviews Cancer
Nature
Proceedingsof the National Academy of Sciences
NatureMedicine
Evolutionary Dynamics (Harvard University Press).[19] Wood LD, et al. (2007) The genomic landscapes of human breast and colorectal cancers.
Science
Science
Science
British Journal of Cancer
Nature
The New England Journal of Medicine https://tcga-data.nci.nih.gov https://tcga-data.nci.nih.gov.[26] Navin N, et al. (2011) Tumour evolution inferred by single-cell sequencing.
Nature
NatureMethods
Systematic biology
Proceedings of the National Academy of Sciences
TheNew England Journal of Medicine
Journal of Computational Biology
Journalof Computational Biology
Mathematical Biosciences
Journal of Computational Biology
Bernoulli pp 893–909.[36] Gerstung M, Baudis M, Moch H, Beerenwinkel N (2009) Quantifying cancer progression withconjunctive Bayesian networks.
Bioinformatics
Proceedings of the National Academy of Sciences
Bioinformatics
PLOS ONE
Bioinformatics
Bioinformatics, .[42] Suppes P (1970)
A Probabilistic Theory of Causality (North-Holland Publishing CompanyAmsterdam).[43] Navin NE (2014) Cancer genomics: one cell at a time.
Genome Biology
Nature
Nature Genetics
Genome Biol
Bioinformatics
PLOS Computational Biology
Nature Methods
BMC Bioinformatics
Cell Reports
PLOS Computatioanl Biology
Nature methods
Bioinformatics
Bioinformatics
Nature
The New England Journal of Medicine
Nature
British Journal of Haematology
Nature
Bioinformatics
BMC Bioinformatics
Nature
Journal of Clinical Oncology
The Lancet
Nature Methods
BMC Genomics
Nature
Nu-cleic Acids Research
Bioinformatics
GenomeResearch
PLOSONE
Nature Methods
The FASEB Journal
BMCMedical Genomics
Genome Research
Genome Biology
Genome Research
Bioinformatics
PLOS Computational Biology
Genome Biology bioRxiv http://dx.doi.org/10.1101/027474 http://dx.doi.org/10.1101/027474 .[83] Szczurek E, Beerenwinkel N (2014) Modeling mutual exclusivity of cancer mutations.
PLoSComputational Biology
Nature Genetics
Journal of Computational Biology
An Introduction to the Bootstrap (CRC press).[87] Koller D, Friedman N (2009)
Probabilistic Graphical Models: Principles and Techniques -Adaptive Computation and Machine Learning (The MIT Press).[88] Ogino S, Goel A (2008) Molecular classification and correlates in colorectal cancer.
TheJournal of Molecular Diagnostics
Cell
Nature reviews Clinical oncology
Oncogene β -cateninsignaling in development and disease. Developmental Dynamics
The Journal of Pathology
World Journal of Gastroenterology
Oncogene
HumanPathology
BMC Cancer
Laboratory Investigation
Naturemedicine, in print .[100] Loohuis LO, Witzel A, Mishra B (2014) Cancer hybrid automata: model, beliefs and therapy.
Information and Computation
Journal of The Royal Society Interface
IEEE Transactions on Information Technology in Biomedicine A. Problem statement. (left) Inference of ensemble-level cancer progression models froma cohort of n independent patients (cross-sectional). By examining a list of somatic mutations orCNAs per patient (0/1 variables) we infer a probabilistic graphical model of the temporal orderingof fixation and accumulation of such alterations in the input cohort. Sample size and tumor het-erogeneity complicate the problem of extracting population-level trends, as this requires accountingfor patients’ specificities such as multiple starting events. (right) For an individual tumor, its clonalphylogeny and prevalence is usually inferred from multiple biopsies or single-cell sequencing data.Phylogeny-tree reconstruction from an underlying statistical model of reads coverage or depths es-timates alterations’ prevalence in each clone, as well as ancestry relations. This problem is mostlyworsened by the high intra-tumor heterogeneity and sequencing issues. B. The PiCnIc pipeline forensemble-level inference includes several sequential steps to reduce tumor heterogeneity, before ap-plying the CAPRI [40] algorithm. Available mutation, expression or methylation data are first usedto stratify patients into distinct tumor molecular subtypes, usually by exploiting clustering tools.Then, subtype-specific alterations driving cancer initiation and progression are identified with sta-tistical tools and on the basis of prior knowledge. Next is the identification of the fitness-equivalentgroups of mutually exclusive alterations across the input population, again done with computa-tional tools or biological priors. Finally, CAPRI processes a set of relevant alterations within suchgroups. Via bootstrap and hypothesis-testing, CAPRI extracts a set of “selective advantage rela-tions” among them, which is eventually narrowed down via maximum likelihood estimation withregularization (with various scores). The ensemble-level progression model is obtained by combiningsuch relations in a graph, and its confidence is assessed via various bootstrap and cross-validationtechniques. 22
Data marked as ✗ can be used when it is persistent (i.e., do not revert back to their original state) during tumor progression. Other: data not common to most tumor types such as fusions or partial tandem duplication. MOTIVATION COMPUTATIONAL OPTIONS AND PRIOR KNOWLEDGE ⌘ EXPECTED OUTPUT AND ACTION TO EXECUTE ✓✗ Groups detection
INPUT DATA * ✓✓✓ M u t a t i o n s O t h e r ✓ C o p y N u m b e r E x p r e ss i o n Determine molecular subtypes likely to progress through different trajectoriesSelect a subset of alterations likely to drive progression M e h t y l a t i o n s ✗ Split the cohort according to each clusterIn each cluster, restrict to consider only driver eventsFor each cluster, for each of its groups, create a logical formula consistent with the statisticValidate statistically or experimentally each one of the inferred models
Cohort subtypingEvents selection
One progression model per subtype
Non-negative Matrix Factorization (NMF), k-Means, Gaussian Mixtures, Hierarchical/Spectral Clustering, Network Based Stratification (NBS)MutSigCV, OncodriveFM, OncodriveCLUST, MuSiC, Oncodrive-CIS, IntogenRatio test, RME, MEMO, MUTEX, Dendrix, MDPFinder, Multi-Dendrix, CoMEt, MEGSA, ME testCAPRI ★ , CAPRESE, Oncotrees, Distance-based, Mixtures, CBN, Resic, BML Known pathway genes with alternative but fitness-equivalent status, or co-occurrently altered ★ CAPRI is the only algorithm to exploit knowledge provided by step 3 via logical formulas hypotheses-testing. Oncotrees, Distance-based, Mixtuers and CAPRESE are constrained to infer at most tree-models of progression.Biomarkers (cell types, known mutations, ….), Clinical Annotations (Mutation Status, Chromosomal Stability, …)Known cancer oncogenes and tumor suppressors, known pathways ✓ Stratified samples (clusters)A rank of genes and their alterations ⌘ Not all tools support all the data that is theoretically usable for a certain step.
Groups satysifying certain statistics (e.g., exclusivity)
Model Inference
Select groups of alterations which should be examined togetherSelect the Graphical Model which explains best the data ✓✓ ✓✓
Figure 2: The PiCnIc pipeline. We do not provide a unique all-encompassing rationale to instantiatePiCnIc as all steps refer to research area currently development, where the optimal approach isoften dependent on the type of data available and prior knowledge about the cancer under study.References are provided for each tool that can be used to instantiate PiCnIc: NMF [61], k-Means,Gaussian Mixtures, Hierarchical/Spectral Clustering [62], NBS [66], MutSigCV [68], OncodriveFM[69], OncodriveCLUST [70], MuSiC [71] Oncodrive-CIS [72] Intogen [73], Ratio [74], RME [75],MEMO [76], MUTEX [77], Dendrix [78], MDPFinder [79], Multi-Dendrix [80], CoMEt [81], MEGSA[82], ME [83], CAPRI [40], CAPRESE [39], Oncotrees [31, 33], Distance-based [32], Mixtures [34],CBN [35, 36], Resic [37] and BML [38]. 23igure 3: A. MSI-HIGH colorectal tumors from the TCGA COADREAD project [56], restrictedto 27 samples with both somatic mutations and high-resolution CNA data available and a selectionout of 33 driver genes annotated to wnt , ras , pi3k , tgf- β and p53 pathways. This datasetis used to infer the model in Figure 5. B. Mutations and CNAs in MSI-HIGH tumors mappedto pathways confirm heterogeneity even at the pathway-level. C. Groups of mutually exclusivealterations were obtained from [56] - which run the MEMO [76] tool - and by MUTEX [77] tool. Inaddition, previous knowledge about exclusivity among genes in the ras pathway was exploited. D. A Boolean formula input to CAPRI tests the hypothesis that alterations in the ras genes kras , nras and braf confer equivalent selective advantage. The formula accounts for hard exclusivityof alterations in nras mutations and deletions, jointly with soft exclusivity with kras and nras alterations. 24igure 4: Selective advantage relations inferred by CAPRI constitute MSS progression; inputdataset in Supplementary Figure S3 and S4. Formulas written on groups of exclusive alterations,e.g., sox9 amplifications and mutations, are displayed in expanded form; their events are connectedby dashed lines with colors representing the type of exclusivity (red for hard, orange for soft), logicalconnectives are squared when the formula is selected, and circular when the formula selects for adownstream node. For this model of MSS tumors in COADREAD, we find strong statistical supportfor many edges (p-values, bootstrap scores and cross-validation statistics shown as SupplementaryMaterial), as well as the overall model. This model captures both current knowledge about CRCprogression – e.g, selection of alterations in pi3k genes by the kras mutations (directed or via theMEMO group, with BIC) – as well as novel interesting testable hypotheses – e.g., selection of sox9 alterations by fbxw7 mutations (with BIC). 25igure 5: A. Selective advantage relations inferred by CAPRI constitute MSI-HIGH progression;input dataset in Figure 3. Formulas written on groups of exclusive alterations are expanded asin Figure 4. For each relation, confidence is estimated as for MSS tumors and reported as Sup-plementary Material. In general, this model is supported by weaker statistics than MSS tumors –possibly because of this small sample size ( n =27 ). Still, we can find interesting relations involving apc mutations which select for pik3ca ones (via BIC) as well as selection of the MEMO group( erbb2 / pik3ca mutations or igf2 deletions) predicted by AIC. Similarly, we find a strong selectiontrend among mutations in erbb2 and kras , despite in this case the temporal precedence amongthose mutations is not disentangled as the two events have the same marginal frequencies ( ). B. Evolutionary trajectories of clonal expansion predicted from two selective advantage relations in themodel. apc -mutated clones shall enjoy expansion, up to acquisition of further selective advantagevia mutations or homozygous deletions in nras . These cases should be representative of differentindividuals in the population, and the ensemble-level interpretation should be that “ apc mutationsselect for nras alterations, in hard exclusivity” as no sample harbour both alterations. A similarargument can show that the clones of patients harbouring distinct alterations in acvr1b – anddifferent upstream events – will enjoy further selective advantage from mutation in the tgfbr2 gene. 267
Reproducing this study
Code availability
The implementation of PiCnIc shown in the Main Text was performed by using, as core,the r language, and other external Java tools which we reference in this document. In r ,much of the data processing and inference is done by exploiting the current version of theopen-source TRanslational ONCOlogy ( TRONCO , [41], version 2.3)package which implements up-to-date statistical algorithms to estimate cancer progressionmodels from a list of genomic lesions (e.g., somatic mutations, copy number variations orpersistent epigenetic states) in a population of independent tumors, or in a single patient.
TRONCO ’s official webpage is reachable from the Software section of our group’s webpage http://bimib.disco.unimib.it/
By navigating to the Case Studies section of
TRONCO ’s official webpage one can find the sourcecode to replicate this study (i.e., the
PicNic ’s implementation) along with the documentationdetailing all the implementation, as well as the data that we used. This should allow easyimplementation of similar studies in different contexts.28
Glossary
This glossary of terms shall be of help to readers not familiar with the concepts mentioned in theMain Text. For clarity, terms are separate in two categories according to the fact that they arecommon to the statistics or the cancer biology communities. Each term which is included in thisglossary appears in color.
Terms common to the statistics communityTerm Meaning
Boolean for-mula
In CAPRI, a formula written with standard logical operators which capture a relation amonga group of alterations. In
PicNic , these are used to detect alternative routes to selectiveadvantage from mutually exclusive alterations . See:
Fitness equivalence . Ensemblelevel progres-sion inference
Detection of the relations of selective advantage across the permanent alterations ina cohort of independent tumors ( cross-sectional data). When aggregated in a graphi-cal model , these shall picture the most common evolutionary trajectories in the popula-tion/cancer under study. See also: inter-tumor heterogeneity . Graphicalmodels
In this context, a direct acyclic graph with nodes ( alterations ) and edges ( selective ad-vantage relations ), as a shorthand to represent the joint probability of observing a set ofalterations in a sample (i.e., a cancer genotype). See also:
Suppes-Bayes Causal Net-work , Model selection . Individuallevel progres-sion inference
Detection of clonal signatures and their prevalence in individual tumors by scanning multi-region or single-cell sequencing data; clones are then displayed in a phylogenetic treestructure . See also: intra-tumor heterogeneity . Model selec-tion
The process of selecting a model which fits data, according to some criterion. In CAPRI, thisis done by balancing model likelihood (a measure of to which degree data can be explained bythe model) and model complexity (the size of the graphical model ). See: regularization . Phylogenetictree
In this context, rooted tree where each node is a clone , and edges represent ancestry relationsamong clones.
Regularization
Common approach to avoid overfitting (false-positives) during model selection – in CAPRIthis is achieved by using the standard AIC/BIC which penalize with different severity graph-ical models which contain many selective advantage relations . Simposon’sparadox
A paradox in statistics, in which a trend appears in different groups of data but disappearsor reverses when these groups are combined. In this context, this shall refer to genuine selective advantage relations which are not inferred unless data coming from differentpopulations is separated before doing inference. See: heterogeneity , subtypes , formulas . Suppes-BayesCausal Net-works
A specific type of graphical model returned by CAPRI algorithm, where each edge satisfiesSuppes’s conditions of probabilistic causation subsuming temporal ordering and positivestatistical dependence – the statistical approach to estimate selective advantage amongthe alterations. erms common to the cancer biology communityTerm Meaning Alterations
Somatic mutations: A change in the genome of a cell that is not inherited from a parent, butis acquired. CNVs: Structural variation of large regions of DNA segments, including deletions,insertions, duplications and complex multi-site variants.
Bulk se-quencing
Genome sequencing from single tumor samples, each containing a large number of cells. Theresulting genomic profiles are derived from a mixture of cells with potentially distinct evolutionaryhistories.
Clones;Clonal ex-pansion
Clone: group of cells sharing an identical genome and that derive from a common ancestor. Clonalexpansion: the production of descendent cells all arising originally from a single cell. In the scenarioof cancer development, tumors develop through a series of clonal expansions, in which the mostfavorable clonal population survives and proliferate.
Cross-sectionaldata
Unique snapshots of data derived from samples that are collected at unknown time points. Usuallyderived from bulk sequencing technologies.
Driver; Pas-senger
Driver: (epi)genetic alteration that provides a selective advantage to a cancer clone . Passen-ger: alteration of a cancer cell that does not increase its fitness . Exclusivityof alter-ations
Group of alterations which manifest few or no co-occurences in a cohort of different samples, andmight be fitness-equivalent for tumor progression. Hard exclusivity: when co-occurrences shall beconsidered the result of random errors. Soft exclusivity: when few co-occurrences shall be possible.See: formulas . Fitness
A cell’s ability of surviving, proliferating and adapting to environmental changes, usually within anenvironment with limited and depleting resources (e.g., oxygen or nutrients).
Fitnessequivalence
Groups of driver alterations , functional to the same pathway or equally dysruptive, that can inde-pendently confer a selective advantage to a cancer cell. Multiple co-occurrence of such alterationsto provide no further advantage, hence leading to mutually exclusive alteration patterns acrossdistinct samples.
Hallmark ofcancer
Common traits or phenotypic properties that are supposed to drive the transformation of normalcells to cancer cells. Anti-hallmark: clonal profiles that are usually not observed, yet beingtheoretically possible.
Inter-tumorheterogene-ity
The phenomenon according to which different patients with the same cancer type usually displaya few common alterations. This is the major problem of inferring ensemble-level cancer pro-gression models . Intra-tumorheterogene-ity
Intra-tumor heterogeneity is related to possible coexistence of different cancer clones, with differentevolutionary histories and different mutational profiles, within the same tumor. This is the majorproblem of inferring individual-level cancer progression models . Multiregionsequencing
Collection of genomic data obtained by processing multiple spatially separated biopsy samples fromthe same individual tumor.
Next Gen-erationSequencing(NGS)
New technologies for sequencing genomes at high speed and low cost, including, e.g., full-genome/exome sequencing, genome resequencing, transcriptome profiling (RNA-Seq), DNA-proteininteractions (ChIP-Seq), and epigenome characterization.
Selective ad-vantage rela-tion
In successive waves of clonal expansions one or more cells of the same clone can (progressively)increase their fitness through the acquisition of additional driver alterations , leading to theemergence and development of a fitter clone. In this case a relation of selective avantage connectsthe earlier to the succeeding alterations . Single-cellsequencing
Recent technology based on the retrieval and analysis of genomic information from individual cells,rather than from mixtures of cells.
Syntheticlethality
The phenomenon according to which two otherwise non-lethal alterations lead the cell death whenthey co-occur within the same cell. See:
Anti-hallmark PicNic ’s implementation for COADREAD samples
Here we detail all the steps implemented to use
PicNic for CRC progression inference.
C.1 TCGA COADREAD project data
COADREAD provides genome-scale analysis of samples with exome sequence, DNA copy number,promoter methylation, messenger RNA and microRNA expression data, which we used to definethe input dataset. In particular, only samples with both mutations and CNAs profiles were used inthe analysis. Supplementary Table S1 details the dataset.
Dataset used to infer models presented in the Main Text.
Samples published in [56] wereused as, to the best of our knowledge, these represent the highest–quality data made available byCOADREAD as of today; for these samples TCGA provides somatic mutation profiles and high-resolution focal CNAs via GISTIC. These are obtained from TCGA data freeze as of 2 February2012, downloaded on 12 March 2015, from repository: https://tcga-data.nci.nih.gov/docs/publications/coadread_2012/
The following files were processed to produce the data:•
TCGA_CRC_Suppl_Table2_Mutations_20120719.xlsx
Somatic mutations profiles obtained via whole-exome sequencing of 224 colorectal tumors byTCGA. Data available consists of 15995 mutations in 228 samples, provided in the
Man-ual Annotation Format (MAF). Samples were selected to univocally match the 224 pa-tients as of the TCGA guidelines for aliquote disambiguation, see https://wiki.nci.nih.gov/display/TCGA/TCGA+barcode . All the mutations annotated by TCGA – truncating(
De_novo_Start_OutOfFrame , Frame_Shift_Del , Frame_Shift_In , Nonsense_Mutation , Splice_Site , Frame_Shift_Ins , In_Frame_Del ), silent (
Silent ) and missense (
Missense_Mutation ) – wereconsidered for analysis; notice that the majority of them are missense, see Figures S7 and S8.• crc_gistic.txt.zip
Focal
Copy Number Alterations (CNAs) for 564 patients derived from whole-genome sequenc-ing using the Illumina HiSeq platform.
High-level gains and homozygous deletions were con-sidered for analysis by selecting entries with GISTIC scores ± crc_clinical_sheet.txt Clinical data summary with patient stage and
Micro Satellite Stabe/Unstable (MSS/MSI)status being any of: MSS, MSI-high and MSI-low.The list of patients used was first reduced to those having both
CNAs and somatic mutationdata, and then was split into two groups: MSI-HIGH and MSS. The training cohort has 152 MSSand 27 MSI-HIGH samples; samples flagged as low MSI were excluded from the study as they havenot been shown to differ in their clinicopathologic features or in most molecular features from MSStumors [ ? ]. 31 .2 Driver events selection In the TCGA COADREAD study [56] integrated analysis of mutations, copy number and mRNAexpression changes in 195 tumours with complete data was performed. Part of the analysis wascarried out by using the MutSig tool [68], as well as manual curation. Samples were grouped bymutation status, and recurrent alterations in key CRC pathways were identified in [56] (Fig. 4,Supplementary Fig. 6 and Supplementary Table 1) as a result, we can use the consortium’s list of33 driver genes annotated to 5 pathways and use these to extract our progression models. Theseare well-known cancer genes, frequently reported as relevant to colorectal progression and to themajor pathways involved in CRC. Driver events are alterations in:• wnt genes (14): apc , dkk-4 , tcf7l2 , ctnnb1 , lrp5 , fbxw7 , dkk-1 , fzd10 , arid1a , dkk-2 , fam123b , sox9 , dkk-3 and axin2 ;• rtk / ras genes (5): erbb2 , erbb3 , nras , kras and braf ;• tgf- β genes (5): tgfbr1 , smad3 , tgfbr2 , smad4 , acvr1b , acvr2a and smad2 ;• igf2 / pi3k genes (5): igf2 , irs2 , pik3ca , pik3r1 and pten ;• p53 genes (2): tp53 and atm .In the Main Text, rtk / ras and igf2 / pi3k pathways are shortly denoted as ras and pi3k .The distinct types of mutations detected in these genes are shown in Figure S7, as well as theoverall rate of COADREAD mutations. The spatial distribution (per gene) of such mutations isshown in Figure S8. C.3 Mutual exclusivity groups of alterations.
Groups of alterations showing a trend of mutual exclusivity were scanned with MUTEX and mu-tations and CNA hitting any of the 33 selected genes as input. MUTEX was run independentlyon MSS and MSI-HIGH groups (Supplementary Table S2, running times: approximately 6 and 3.5hours, respectively, on a standard Desktop machine).We selected only groups with score < . , where the score is derived from p-values correctedfor false discovery rate . 3 groups are found for MSI-HIGH tumors and 6 for MSS. For MSI-HIGHtumors, the three predicted groups consists of genes acvr1b , acvr2a , tp53 and erbb2 , of genes braf , nras and tgfbr2 , and of genes kras and braf .Further groups of exclusive alterations were considered consistent with results reported in [56].These include groups derived by consolidated knowledge of colorectal progression: the well-known wnt alterations in apc / ctnnb1 [ ? ], as well as ras alterations in kras , nras and braf genes [74].Similarly, we used also a group collected by scanning non-hypermutated tumors with the MEMOtool in [56] - this group includes pik3ca , pten , erbb2 and igf2 genes. These groups were restrictedto account only for genes actually altered in a certain subtype, e.g., MSI-HIGH tumors lack ctnnb1 mutation, making the Wnt group irrelevant. Groups for MSS tumors are shown as SupplementaryFigure S9, groups for MSI-HIGH tumors are in the Main Text.32 .4 CAPRI’s execution Background on the algorithm.
CAPRI algorithm can be executed in two different modes,originally dubbed as “supervised” when formulas are given in input for testing, and “unsupervised”when this is not the case. This paper deals with the former; see the Main Text for the interpretationof formulas in this context and [40] for a derivation of the algorithm. CAPRI is a three-stepsprocedure, which we briefly recall here.1. CAPRI starts by creating a “lifted” representation of the input data M which includes theinput formulas; each formula – which is written in propositional logic – is so evaluated toyield a new column in the dataset. This is the input processed by the algorithm, which startsby selecting a set of candidate model edges, which are then used to constrain a score-basedBayesian model-selection problem [40].2. The initial set of selective advantage relations S , which determines the model edges x → y ,is computed by evaluating the following inequalities (Temporal priority) p ( x ) > p ( y ) (Probability raising) p ( y | x ) > p ( y | ¬ x ) , where p ( · ) is a marginal probability, p ( · | · ) is a conditional, ¬ x is the negation of x and either x or y is not a formula. It is interesting to observe that probability raising implies that x and y are positively statistically dependent , thus imposing a minimum threshold on theirassociation [39].In each of CAPRI’s executions, the distribution of the observed marginals and conditionals areestimated by K non-parametric bootstrap resamples; practically, this means that we create abootstrapped approximation for each of the four populations ˆ p ( x ) , ˆ p ( y ) , ˆ p ( y | x ) and ˆ p ( y | ¬ x ) .Then, we use a single-tail non-parametric Mann-Withney U test of the difference in mean totest the hypothesis that one of the two populations is more probable than the other, e.g., ˆ p ( x ) > ˆ p ( y ) . With this test, we can compute two p-values, one for each condition. An edge isincluded in S if at least the p-value for probability raising is below a significance threshold p ∗ and an edge is said not to be orientable if its p-value for temporal priority is above thesame threshold. Cycles x → x → . . . x k → x that might appear in S are broken by deletingthe edge with minimum p-value; both K and p ∗ are custom parameters.3. Optimization of S , namely detection of the subset S ∗ of S with the edges that we include inthe final progression model is done by optimizing, via hill climbing or tabu search , the scorewith regularization S ∗ (cid:44) arg min ˆ S⊂S (cid:110) − L ( ˆ S | M )] + θ | ˆ S| (cid:111) , (1)where L ( · ) is the model likelihood and M is the input data; the estimated optimal solution is S ∗ , which is displayed as a Suppes-Bayes Causal Network. The different regularization strate-gies mentioned in the Main Text, BIC and AIC, are obtained by the following parametrization: (Bayesian Information Criterion) θ (cid:44) log( n ) (Akaike Information Criterion) θ (cid:44) . θ which define the conditional probability table of each edge and should be fit from data; these are necessary if one wishes to use a model as a“generator” of further data. For discrete-valued graphical models, for each parent set π x → y ,parameter θ ( y ) = p ( y | π x ) can be taken either as the maximum likelihood estimate from thelifted input, or by using a Bayesian interpretation [87]. Usage in this context.
CAPRI was run, on each group of tumors, by selecting alterations fromthe pool of 33 pathway genes; every alteration on a gene x is included if any of these apply:• the alteration frequency of x - sum of mutation and CNA frequency - is greater than 5%;• x it is part of an exclusivity group.The set of selected events for MSI-HIGH training tumors is shown in the Main Text, the analogousset for MSS tumors is shown as Supplementary Figure S10.CAPRI was executed in its supervised mode by writing formula over groups and genes withmultiple alterations associated, as explained in the Main Text. For instance, for MSI-HIGH tumorswith alterations in ras pathway we grouped hard exclusivity of nras mutations and deletions, withsoft exclusivity of kras and braf mutations. Our aim was to account for a small subset of sampleswith concurrent kras and nras alterations (see Figure 2, Main Text). The list of all Booleanformulas written over groups is in Table S3; this approach was adopted also when a gene harborsmultiple alterations in a subtype, e.g., erbb2 in MSS training samples which shows a trend of softexclusivity between mutations and amplifications. We used both AIC and BIC scores to regularizeinference after 100 non-parametric bootstrap iterations for estimation of the preliminary selectiveadvantage relations – Mann-Whitney U test was performed with a minimum threshold p ∗ = 0 . .In most cases p-values are orders of magnitude below p ∗ - exact values reported as Dataset File S1.CAPRI’s models with such p-values and non-parametric bootstrap confidence are shown in FiguresS11 and S12, statistical validation of the models is discussed in the next section. D Statistical validation of the models
P-values from hypothesis testing, as well as scores from k -fold cross-validation and various boot-strapping techniques can be used to measure the statistical consistency of models and data, each onecapturing different potential errors in the inference process. Approaches such as cross-validationand bootstrap are sometimes also used in the ( ex novo ) generation and inference of models fromdata (see, e.g., bootstrap consensus models [ ? ]), but we only use them here for the a posteriori evaluation of a model’s confidence, and we interpret them as a quantitative measure for the relative assessment of each model’s relation.All the p-values and the scores that we present here are computed within TRONCO . D.1 Edge p-values
As explained in §C.4, for each model edge x → y we get two p-values by assessing temporal priorityand probability raising via Mann-Whitney U testing.For each edge, a p-value for the hypergeometric test of overlap between alteration profiles x and y can be computed. More precisely, we test if there is a difference between the number of samplescontaining both x and y versus the total population of samples with x , y , or both. We would like34he overlap to be significant as those samples – that determine the joint probability of x and y –are those supporting the presence of a selection trend among x and y .An edge is fully supported if all three p-values are below a custom significance threshold, e.g., . or even better . . Some edges might have the p-value for temporal priority above thethreshold. If so, the selection trend might be still significant, but the temporal order of x and y –i.e., the direction of selection – is not supported by the data. D.2 Bootstrap
We used non-parametric and statistical bootstrap techniques to measure the goodness-of-fit , asoriginally proposed in [ ? ]. In this case, we distinguish two type of errors that one could make inthe inference process, estimating the presence or absence of edges in the model:• Type I errors: incorrect rejection of a true H (null-hypthesis), i.e., a “false positive” edgethat we wrongly include.• Type II errors: incorrect acceptance (failure to reject) of a false H , i.e., a “false negative” edge that we miss. Non-parametric bootstrap [86] computes scores to be interpreted here as follows. If a modelcontains an edge x → y that is a true positive, we expect its score to be high. In other words, whenwe sample with repetition subsets of the original data and re-run the inference process we expect tooften find models which contain the edge x → y . Conversely, for a node y without incoming edges,or equivalently for any edge x → y which is correctly excluded from a model – a true negative – wewould expect its bootstrap score to be low. However, this reasoning can be also generalized to wholemodels, where we count how many times we re-infer exactly the same model. Clearly, such scoreswill depend also on the empirical probabilities of the nodes in our data, and their deviation fromthe true probabilities of the phenomenon. So, one might expect rare events to be less frequentlybootstrapped, which results in a lower estimate; however, such counts can be anyway interpretedas measures of repeatability of our findings .The above bootstrap approach depends on two random number generators: one to shuffle data(the a posteriori bootstrap), and one to evaluate CAPRI’s inequalities via hypothesis testing (theinternal CAPRI’s bootstrap, see C.4). Thus, to ensure that no bias is introduced by the randomnumber generators, we performed a statistical bootstrap by holding data fixed, and re-estimatingCAPRI’s inequalities with generators initialized with different seeds. We evaluated the robustnessof our scores for all edges imputed to be genuine and hence, high-scoring.Notice that, in principle, even parametric bootstrap scores could be computed if we used themodel to generate bootstrapped data [86]. However, as the support of the distribution subsumedby a model with n nodes consists of n possible outcomes, sampling uniformly from large modelsmight be computationally hard. For this reason, and because such scores are overestimates of thenon-parametric ones, we did not include them in our computation.The MSS and MSI progression models are annotated with the non-parametric bootstrap scoresin Figures S11 and S12. Non-parametric and statistical bootstrap scores for a set of selected edges Bootstrapping techniques have been widely used to gauge uncertainty in estimates, but also subjects of philo-sophical debate about their precise interpretation, especially when coupled with various significance thresholds –unlike the situation with a p-value for a null hypothesis. An exhaustive review on the topic is provided by Soltisin [ ? ]. We follow the ideas originally developed in the area of phylogenetic analysis [ ? ], suggesting that the scorescan be alternatively interpreted as a measure of accuracy of the method or of the robustness of the data [ ? ]. D.3 Cross-validation
Next we study the sufficiency of the data sizes for model inference and its ability to characterize theunderlying progression ( goodness-of-data ). Thus, we focus on the Type III errors, which occur whenthe sample size is inadequate or the sample is a poor descriptor for the reference phenomenon ,and thus failing to represent the progression. For this purpose, we used cross-validation with thedata used for the models built (see the Main Text), and followed the best practices developed bythe Bayesian Networks community [87]. TRONCO exploits the cross-validation routines implemented in the bnlearn package [ ? ]. Theapproach that we adopt is a k -fold non-exhaustive cross-validation, which we repeat times toaverage its results. Exhaustive strategies might be used for datasets of small sample size. Each runof cross-validation consists in computing a loss function for a model; its steps are the followings:• split randomly the data in k = 10 groups, and then repeat the following two steps, for eachgroup in turn:1. set one of the groups to be the “training” M tr ;2. merge the others k − to be the “test” M te ;3. by holding fixed the model structure (i.e., the edges in S ∗ ), fit the model parameters θ over the training data via maximum likelihood estimates , compute a score over the test M te (see below) and the corresponding loss;• combine the k loss estimates to give an overall loss for data.Let θ tr be the parameters fit from the training set M tr , and S ∗ the edges in the model, threescores are computed with cross-validation:1. the negative entropy of a model – i.e., the negated expected log-likelihood of the test set forthe Bayesian network fitted from the training set, that is eloss ( S ∗ , θ tr ) = − E [ L ( S ∗ , θ tr | M te )] .
2. the prediction error for a single node x and its parents set X , i.e., we measure how preciselywe can predict the values of x by using only the information present in its local distribution θ tr ( x ) . This parameter corresponds to computing the misclassification rates from p te ( x ) , theempirical marginal probability of x estimated from the test.3. the posterior classification error for a single node x and one of its parent node y ∈ X – i.e.,the values of x are predicted using only the information present in y by likelihood weightingand Bayesian posterior estimates. Consider the case where the samples are from two different unknown and heterogeneous groups, with randomchance making low values to be sampled from a group that actually has a majority of high values, and vice versa.In this case, the samples will not be the best descriptor of the two groups. ? ] for a discussion on these loss functions. The first statistics measures the log-likelihood loss when we “forget” some of the samples used to infer a model (indeed, the test samples); see FigureS14. Roughly, we are measuring how the model’s predictive power changes as we look at the datafrom different viewpoints. This is a score for a whole model, it has no scale – so cannot be used tosay how good the models/data are, in any absolute sense. Nonetheless, it can be used to evaluatehow “stable” a model is, for a certain dataset.The second and third statistics measure the accuracy of the parent-set, X , for a child x ; thesecond statistics dealing with the whole parent set as predictor, and the third, the individualcontribution of each of the parents. For these two statistics, we desire the prediction error to below, as a measure of goodness. These are shown in Figure S13 (selected edges), S15 and S16 (alledges, prediction error). E Supplementary Tables and Figures
Datasets (CNAs and mutations provided by TCGA) statistics alteration typecancer † n m | G | mutations amplifications deletions MSI-HIGH 27 16100 13798 11556 2888 1656MSS 152 21317 16371 12417 6925 1975 † Samples were classified as MSI-HIGH/LOW and MSS by TCGA; see flag
MSI_status in clinical data available for the COADREAD project.
Table S1:
COADREAD Data.
Data used in this study, derived from the TCGA COADREADproject [56]. 37
UTEX parametersParameter Value Description signalling-network - MUTEX network † max-group-size maximum size of a result group first-level-random-iteration number of randomisation to estimate null distribu-tion of member p-values in groups second-level-random-iteration number of runs to estimate the null distribution offinal scores fdr-cutoff - false-discovery-rate cutoff maximising the expectedvalue of true positives - false positives is estimatedfrom data search-on-signaling-network TRUE reduce the search space using the signalling network † Manually curated from Pathway Commons, SPIKE and SignaLink databases. Provided with the tool; availablefor download at https://code.google.com/p/mutex/ . MUTEX groups with score < . MSI-HIGH Groups score q -value kras , braf , 0.095 0.48 nras , braf , tgfbr1 erbb2 , tp53 , acvr1b , acvr2a MSS Groups score q -value tp53 , atm , 0.051 0.34 arid1a , tp53 kras , nras , braf , 0.0864 0.1975 ctnnb1 , apc , dkk2 , 0.098 0.144 dkk1 , tp53 , atm , dkk2 pik3ca , tp53 , atm MUTEX: parameters and results.
Top: Parameters used to run MUTEX on theoriginal TCGA MSS/MSI-HIGH datasets with input CNA and somatic mutations in the pathwaygenes described in text. Bottom: MUTEX identified 3 and 6 groups of alterations showing a trendof mutual exclusivity in these groups with score below the suggested cutoff of . .38 ormulas input for testing to CAPRI † MSI-HIGH tumors description ( nras :m ⊕ nras :d) ∨ kras :m ∨ braf :m raf exclusivity pik3ca :m ∨ erbb2 :m ∨ pten :m ∨ igf2 :d MEMO group ( acvr1b :m ⊕ acvr1b :d) ∨ acvr2a :m ∨ tp53 :m ∨ erbb2 :m MUTEX group ( nras :m ⊕ nras :d) ∨ tgfbr1 :m ∨ braf :m MUTEX group kras :m ∨ braf :m MUTEX group acvr1b :m ⊕ acvr1b :a multiple alterations nras :m ⊕ nras :a multiple alterations fbxw7 :m ∨ fbxw7 :a multiple alterations ‡ MSS tumors description ( apc :m ⊕ apc :d) ∨ ctnnb1 :m wnt exclusivity ( kras :m ∨ kras :a) ∨ ( nras :m ⊕ nras :a) ∨ ( braf :m ⊕ braf :a) raf exclusivity and MEMO group pik3ca :m ∨ ( erbb2 :m ∨ erbb2 :a) ∨ ( pten :m ⊕ pten :d) ∨ igf2 :a MEMO group ( tp53 :m ⊕ tp53 :d) ∨ ( atm :m ⊕ atm :d) MUTEX group ( tp53 :m ⊕ tp53 :d) ∨ arid1a :m MUTEX group ( tp53 :m ⊕ tp53 :d) ∨ arid1a :m MUTEX group ( apc :m ⊕ apc :d) ∨ ctnnb1 :m ∨ dkk2 :m MUTEX group ( tp53 :m ⊕ tp53 :d) ∨ ( atm :m ⊕ atm :d) ∨ dkk2 :m ∨ dkk1 :m MUTEX group ( tp53 :m ⊕ tp53 :d) ∨ ( atm :m ⊕ atm :d) ∨ pik3ca :m MUTEX group ( apc :m ⊕ apc :d) multiple alterations ( tp53 :m ⊕ tp53 :d) multiple alterations ( smad4 :m ⊕ smad4 :d) multiple alterations ( tcf7l2 :m ⊕ tcf7l2 :d) multiple alterations ( atm :m ⊕ atm :d) multiple alterations ( nras :m ⊕ nras :d) multiple alterations ( erbb2 :m ∨ erbb2 :a) multiple alterations ( pten :m ⊕ pten :d) multiple alterations ( smad2 :m ⊕ smad2 :a) multiple alterations ( dkk4 :m ⊕ dkk4 :a) multiple alterations ( sox9 :m ⊕ sox9 :d) multiple alterations ( braf :m ⊕ braf :a) multiple alterations † Events type: mutation (m), deletion (d), amplification (a). Hard ( ⊕ ) and soft ( ∨ ) exclusivity. ‡ Formula not included as it creates a duplicated signature in the dataset.
Table S3:
CAPRI formulas from exclusivity groups.
Formulas created for the groups, andinput to CAPRI for testing. These are either derived from exclusivity groups or from genes involvedin different types of alterations. 39igure S6:
PicNic pipeline processing MSI-HIGH/MSS tumors.
We process with
PicNic
Microsatellite Stable and highly unstable tumors collected from the The Cancer Genome Atlasproject “Human Colon and Rectal Cancer” (subtypes annotations provided as clinical data). Weimplement a study on selected somatic mutations and focal CNAs in 33 driver genes manuallyannotated with 5 pathways in the COADREAD project. We scan groups of exclusive alterationswith computational tools run by us and by TCGA, and we exploit previous knowledge on CRC; weselect which alterations we input to CAPRI. Next, inference is performed with various settings ofregularization and confidence. Statistical confidence of the models is assessed with standard tech-niques from the literature (p-values from statistical testing, bootstrap scores and cross-validationstatistics). 40 PC n = 251 CTNNB1 n = 12 DKK1 n = 4 DKK2 n = 7 DKK3 n = 3
DKK4 n = 7 LRP5 n = 9 FZD10 n = 7 FAM123B n = 30 AXIN2 n = 11 TCF7L2 n = 24 FBXW7 n = 50 ARID1A n = 25 SOX9 n = 11 ERBB2 n = 13 ERBB3 n = 15 NRAS n = 21 KRAS n = 95 BRAF n = 22 PIK3CA n = 42 PIK3R1 n = 13 PTEN n = 14 TGFBR1 n = 10 TGFBR2 n = 9 ACVR1B n = 17 ACVR2A n = 14 SMAD2 n = 18 SMAD3 n = 10 SMAD4 n = 31 TP53 n = 126 ATM n = 50 Mutation types
De_novo_Start_OutOfFrame
Splice_Site
Frame_Shift_Ins
In_Frame_Del
Silent
Nonsense_Mutation
Frame_Shift_Del
Missense_Mutation
Missense_Mutation
Frame_Shift_Del
Nonsense_Mutation
Silent
In_Frame_Del
Frame_Shift_Ins
Splice_Site
De_novo_Start_OutOfFrame
Mutations for COADREAD n = 971 Figure S7:
Mutations annotated by TCGA for the driver genes.
Top: the majority ofthe mutations annotated by TCGA for the driver genes that we consider are missense – this inalmost all genes and in all the cohort. Bottom: overall summation of the frequencies determine themutations across all driver genes. 41
500 1000 1500 2000 2500 2843 aa M u t a t i on s APC
R1450
APC_basic EB1.. M u t a t i on s AMER1
WTX M u t a t i on s TCF7L2
CTNNB1_binding HMG_box M u t a t i on s FBXW7
R465C/H
F-b.. WD40 WD40 WD40 WD40 WD40 WD40 WD40 M u t a t i on s ARID1A
R1989
DUF3518 M u t a t i on s ERBB2
V842I
Rec.. Furin-like Rec.. GF_.. Pkinase_Tyr M u t a t i on s NRAS
Q61H/K/L/R
Ras M u t a t i on s KRAS
G12A/C/D/R/S/V
Ras M u t a t i on s BRAF
V600E
RBD C1_1 Pkinase_Tyr M u t a t i on s PIK3CA
E545A/G/K/Q
PI3.. PI3K_rbd PI3K_C2 PI3Ka PI3_PI4_kinase M u t a t i on s PTEN
R130/Q
DSPc PTEN_C2 M u t a t i on s ACVR1B
Activin_recp TGF.. Pkinase M u t a t i on s SMAD2
S464
MH1 MH2 M u t a t i on s SMAD4
MH1 MH2 M u t a t i on s TP53
R175C/H
P53.. P53 P53.. M u t a t i on s ATM
R1730
TAN FAT PI3.. (cid:1)(cid:2)(cid:3)(cid:3)(cid:4)(cid:5)(cid:3)(cid:4) (cid:6)(cid:7)(cid:8)(cid:5)(cid:9)(cid:10)(cid:11)(cid:2)(cid:5)(cid:12) (cid:13)(cid:5)(cid:14)(cid:7)(cid:10)(cid:15)(cid:4)(cid:16)(cid:17)(cid:4)(cid:18)(cid:4)(cid:11)(cid:2)(cid:19)(cid:5)(cid:20)(cid:2)(cid:5)(cid:3)(cid:4)(cid:7)(cid:11)(cid:2)(cid:19)(cid:5) (cid:21)(cid:11)(cid:22)(cid:4)(cid:7)
Figure S8:
Lolliplot diagrams of TCGA mutations.
Diagrams generated from the cBio portalfor the COADREAD project (see ). These display the physical distri-bution of the annotated mutations for each gene. Here we shown only genes with a total mutationcount greater than ; fam123b is called with its synonym amer1 , as in the portal.42 xclusivity groups for MSS tumors event
59% TP537% ARID1A1% TP53 stage
47% KRAS9% NRAS4% BRAF1% NRAS1% BRAF1% KRAS stage
78% APC7% CTNNB11% DKK21% APC stage
59% TP5310% ATM1% DKK21% DKK11% ATM1% TP53 stage
59% TP5315% PIK3CA10% ATM1% ATM1% TP53 stage stage
Stage IStage IIAStage IIIAStage IIIBStage IIICStage IVNA pathway n o n e D e l e t i o n A m p li f i c a t i o n M u t a t i o n Knowledge-based priors: WNT exclusivity stage stage
Knowledge-based priors: RAF exclusivity stage
Computational priors: MEMO group
59% TP5310% ATM1% ATM1% TP53 stage
Computational priors: Mutex groups with score < 0.2
RASWntTGFbP53PI3K
Figure S9:
Groups of exclusive alterations for MSS tumors.
Knowledge-based groups ofexclusive alterations consist of: kras , nras and braf genes ( raf pathway) and apc and ctnnb1 genes ( wnt pathway). The MEMO [76] group identified in [56] in this cohort consists of genes pik3ca , erbb2 , igf2 and pten . Finally, 6 groups are predicted by MUTEX [77] with score below . , one of these is equivalent to the known exclusive alterations in raf pathway.43 CGA MSS colorectal tumors T C G A − AA − A K T C G A − AA − T C G A − A G − A T C G A − AA − T C G A − AA − T C G A − A G − T C G A − AA − A ZT C G A − AA − T C G A − AA − T C G A − A G − T C G A − A G − A X T C G A − A G − T C G A − A G − T C G A − A G − A T C G A − A F − T C G A − A G − A W T C G A − AA − T C G A − A G − T C G A − A G − T C G A − AA − T C G A − AA − T C G A − A G − T C G A − AA − T C G A − AA − T C G A − AA − T C G A − AA − A FT C G A − A G − T C G A − A G − T C G A − A G − T C G A − A G − T C G A − A G − A C T C G A − A G − A T C G A − A G − T C G A − A G − A T C G A − A G − T C G A − A − T C G A − AA − T C G A − AA − T C G A − AA − T C G A − AA − A TT C G A − A G − T C G A − AA − A FT C G A − AA − T C G A − AA − T C G A − A F − T C G A − A G − A T C G A − AA − A T C G A − A G − T C G A − AA − A D T C G A − AA − T C G A − A − T C G A − A G − T C G A − AA − T C G A − AA − T C G A − AA − T C G A − AA − T C G A − AA − T C G A − AA − T C G A − AA − T C G A − AA − A W T C G A − AA − A H T C G A − AA − A J T C G A − A F − T C G A − A G − T C G A − A G − T C G A − A G − T C G A − A G − T C G A − A G − T C G A − A G − A Y T C G A − A G − A T C G A − AY − T C G A − A G − T C G A − AA − A FT C G A − AA − T C G A − AA − A O T C G A − A G − T C G A − AA − T C G A − AA − A I T C G A − A G − T C G A − AA − T C G A − AA − T C G A − AA − T C G A − AA − T C G A − A G − A T C G A − AA − T C G A − AA − T C G A − A G − T C G A − AA − T C G A − A F − T C G A − A G − A T C G A − AA − T C G A − A G − T C G A − A G − T C G A − AA − T C G A − AA − T C G A − A G − T C G A − A G − A T C G A − AA − A V T C G A − AA − T C G A − AA − T C G A − A G − A T C G A − A G − T C G A − A − T C G A − AA − T C G A − A G − A G T C G A − A G − T C G A − AA − T C G A − A G − T C G A − AA − A T C G A − AA − A U T C G A − AA − T C G A − AA − T C G A − AA − T C G A − A G − T C G A − AA − T C G A − AA − T C G A − AA − T C G A − AA − T C G A − AA − A ZT C G A − AA − A J T C G A − AA − T C G A − A G − T C G A − AA − T C G A − A F − T C G A − A G − T C G A − AA − T C G A − AA − T C G A − AA − A D T C G A − A − T C G A − A G − A T C G A − AA − T C G A − AA − T C G A − AA − A FT C G A − AY − T C G A − A G − T C G A − A − T C G A − AA − T C G A − A G − T C G A − AA − A Y T C G A − A G − A H T C G A − AA − T C G A − AA − T C G A − AA − A Q T C G A − AA − A X T C G A − A G − T C G A − AA − T C G A − AA − T C G A − A G − T C G A − AA − T C G A − A G − T C G A − AA − T C G A − AA −
78% APC59% TP5347% KRAS15% PIK3CA12% FBXW711% SMAD410% FAM123B10% ATM9% TCF7L29% NRAS7% ARID1A7% CTNNB15% SMAD25% SOX95% ERBB25% ERBB24% DKK44% BRAF3% PTEN3% PTEN3% SMAD43% IGF22% DKK41% TCF7L21% DKK21% DKK11% APC1% ATM1% TP531% NRAS1% BRAF1% KRAS1% SOX91% SMAD2 hitsstage g r oup stage Stage IStage IIAStage IIIAStage IIIBStage IIICStage IVnone hits group
WntP53RASPI3KTGFbnoneDeletionAmplificationMutation
152 samples34 events21 genes0 patterns
Figure S10:
Selected data for MSS tumors.
Colorectal tumors with Microsatellite Stableclinical status in the TCGA COADREAD project, restricted to 152 samples with both somaticmutations and CNA data available. 33 driver genes annotated with 5 pathways are selected fromthe list published in [56] to automatically detect groups of mutually exclusive alterations. Eventsselected for reconstruction are those involving genes altered in at least 5% of the cases, or part ofgroup of alterations showing an exclusivity trend (see Figure S9). This dataset is used to infer theset of selective advantage relations which constitute the MSS progression model presented in theMain Text. 44igure S11:
Non-parametric bootstrap scores for MSS progression.
Progression model forMSS tumors with confidence shown as edge labels. The first label represents the relation confidenceestimated with 100 non-parametric bootstrap iterations, the second and third are p-values fortemporal priority and probability raising. Red p-values are above the minimum significane thresholdof . . See Figure 4 in the Main Text for an interpretation of this model.45igure S12: Non-parametric bootstrap scores for MSI-HIGH progression.
Progressionmodel for MSI-HIGH tumors with confidence shown as edge labels. The first label represents therelation confidence estimated with 100 non-parametric bootstrap iterations, the second and thirdare p-values for temporal priority and probability raising. Red p-values are above the minimumsignificane threshold of . . See Figure 5 in the Main Text for an interpretation of this model.46 ata selects KRAS PIK3CA
82% 100% 6,20E-92 7,63E-92 2,77E-05 1,51E-01 0,00E+00 1,51E-01 0,00E+00 KRAS MEMO
64% 100% 6,64E-92 3,68E-88 7,68E-03 2,63E-01 0,00E+00 2,63E-01 0,00E+00 FBXW7 XOR_SOX9
49% 100% 3,15E-83 2,57E-85 1,66E-03 5,92E-02 0,00E+00 5,92E-02 0,00E+00
Deletion4
ARID1A SOX9
34% 100% 1,73E-20 6,09E-78 9,25E-04 5,26E-02 0,00E+00 5,26E-02 0,00E+00
Amplification5
PIK3CA TCF7L2
34% 54% 2,05E-92 5,56E-89 0,00E+00 1,32E-02 0,00E+00 1,32E-02 0,00E+00
Mutation6
TCF7L2 DKK4
32% 95% 1,15E-91 1,33E-64 6,34E-04 1,97E-02 0,00E+00 1,97E-02 0,00E+00
Pattern7
FAM123B ATM
48% 48% 4,61E-01 2,70E-91 9,82E-04 1,01E-01 4,44E-03 1,02E-01 4,65E-03 ERBB2 ERBB2
64% 51% 1,27E-01 4,01E-77 5,46E-05 5,99E-02 6,54E-03 6,05E-02 6,05E-03
Data selects
BIC ATM DKK4
81% 70% 3,59E-92 2,97E-68 7,93E-04 2,70E-02 3,73E-03 1,97E-02 0,00E+00
AIC APC
DKK4
53% 100% 4,70E-92 1,21E-104 0,00E+00 3,95E-02 0,00E+00 3,95E-02 0,00E+00 KRAS SMAD4
50% 69% 6,06E-92 3,49E-73 3,72E-02 1,12E-01 0,00E+00 1,12E-01 0,00E+00 ARID1A XOR_SOX9
50% 90% 2,20E-09 1,02E-77 1,60E-03 5,72E-02 4,44E-03 7,24E-02 0,00E+00 TP53 NRAS
49% 91% 5,62E-92 1,35E-69 1,22E-01 8,55E-02 0,00E+00 4,20E-01 0,00E+00 p-values < 0,010,01 - 0,05
Data > 0,05selects APC PIK3CA
66% 100% 2,53E-65 3,81E-72 4,73E-02 2,41E-01 4,36E-02 2,19E-01 3,24E-02 ARID1A DKK2
62% 100% 3,62E-71 3,31E-76 5,72E-03 1,85E-01 0,00E+00 2,00E-01 3,12E-02 PIK3CA DKK4
58% 100% 4,83E-53 2,65E-75 2,85E-02 1,37E-01 3,51E-02 1,30E-01 4,36E-02 APC XOR_NRAS
54% 100% 2,51E-78 5,36E-89 0,00E+00 1,11E-01 0,00E+00 1,11E-01 0,00E+00 APC SMAD3
48% 79% 9,11E-81 8,62E-81 0,00E+00 7,41E-02 0,00E+00 7,41E-02 0,00E+00 ERBB2 SMAD2
48% 92% 3,87E-28 6,69E-68 0,00E+00 1,41E-01 1,56E-02 1,37E-01 2,50E-02 NRAS ACVR1B
48% 68% 2,93E-20 1,88E-40 0,00E+00 7,04E-02 1,17E-02 3,70E-02 0,00E+00 FAM123B SMAD4
46% 100% 4,12E-51 1,98E-66 3,99E-03 1,78E-01 3,40E-02 1,96E-01 3,05E-02 AXIN2 TP53
32% 86% 1,32E-07 9,08E-69 1,28E-02 1,74E-01 3,05E-02 1,59E-01 3,51E-02
LRP5 TGFBR2
30% 69% 8,32E-74 3,42E-84 0,00E+00 8,52E-02 3,51E-02 7,41E-02 0,00E+00
ERBB3 KRAS
27% 50% 3,62E-01 4,18E-77 4,65E-03 1,56E-01 2,34E-02 1,48E-01 0,00E+00
Data selects APC KRAS
51% 100% 8,53E-34 5,84E-75 4,27E-02 1,56E-01 2,34E-02 2,59E-01 0,00E+00 APC MEMO
48% 100% 3,44E-18 1,93E-64 9,10E-02 4,15E-01 4,88E-02 3,96E-01 3,92E-02 error (mean μ, stdev σ)
MICRO-SATELLITE STABLE TUMORSHIGHLY MICRO-SATELLITE INSTABLE TUMORS
Posterior classification
BIC RelationAIC Relation
Bootstrap p-values
Selective Advantage
Bootstrap p-values
Selective Advantage Prediction Posterior classification
AIC Relation
Bootstrap p-values Errors
Selective Advantage Prediction
BIC Relation p-values k-fold cross-validation
Selective Advantage Prediction Posterior classification
Bootstrap k-fold cross-validation
Prediction Posterior classification
Errors
Non-paramertic minimum scoreEvent type approximate
Figure S13:
COADREAD statistics for models confidence.
For BIC models we show statisticsfor edges with non-parametric bootstrap score approximately greater than , for AIC modelsthose greater than s. i ) p-values (100 repetition of non-parametric bootstrap, prior to Wilcoxontesting) for each edge statistics of selective advantage ( direction and statistical dependence, andhypergeometric ). In general, the edges that we selected show very strong support ( p (cid:28) − ),but for those edges connecting events with the same marginal frequencies, where we can not beconfident in the edge direction ( p > . ) but still we find strong statistical dependence. ( ii ) A posteriori model confidence against Type I and II errors estimated with non-parametric andstatistical bootstraps (100 repetitions) – edges annotated in Figures S11 and S12. ( iii ) Values of posterior classification and prediction errors are estimated from 10 repetitions of 10-fold cross-validation . The former reports how much error is due to predicting, for each set of edges X = { x , . . . , x n } → y , the value of y according to the value of each x i ∈ X . The latter reports the samestatistics when we predict y from the whole set of parents X .47 B I C A I C Entropy loss for MSS model
Log-Likelihood Loss (disc.)
BIC score: -2487.12 logLik: -2318.82 loss: 0.663 %
AIC score: -2377.89 logLik: -2294.89 loss: 0.661 % B I C A I C Entropy loss for MSI model
Log-Likelihood Loss (disc.)
BIC score: -466.15 logLik: -385.4 loss: 3.89 %
AIC score: -429.25 logLik: -361.25 loss: 3.844 %
Figure S14:
Entropy loss for MSI-HIGH/MSS models.
Violin plot computed from runsof k -fold cross-validation with k = 10 , where we compute the “loss of log-likelihood” at each fold.In the plot and in the table we report also the overall log-likelihood, as well as the BIC and AICscores for the models. We present the ratio of log-likelihood loss as a measure of stability of thesemodels for these two datasets – we can observe that the MSS models lose < of their likelihood,while the MSI lose slightly more (still, < ), possibly because of the smaller sample size. Froma statistical point of view, the greater (despite small) loss of likelihood by the models regularizedvia BIC confirms its tendency to underestimate the true model (i.e., the model should have falsenegatives, which could be AIC’s edges). 48 PC ATM
TP53
NRAS
BRAF
KRAS
SOX9
SMAD2
TCF7L2
DKK2
DKK1
DKK4
PTEN
SMAD4
IGF2
PTEN
DKK4
BRAF
ERBB2
XOR_BRAF
SMAD2
SOX9
ERBB2
XOR_PTEN
XOR_DKK4
XOR_SOX9
XOR_SMAD2
CTNNB1
ARID1A
OR_ERBB2
NRAS
TCF7L2
XOR_NRAS
FAM123B
ATM
XOR_TCF7L2
XOR_ATM
SMAD4
FBXW7
XOR_SMAD4
PIK3CA
OR_APC_CTNNB1_DKK2
OR_APC_CTNNB1
XOR_APC
APC
OR_ATM_PIK3CA_TP53
OR_ERBB2_IGF2_PIK3CA_PTEN
OR_ATM_DKK1_DKK2_TP53
OR_ATM_TP53
OR_ARID1A_TP53
XOR_TP53
TP53
OR_BRAF_KRAS_NRAS
KRAS . . . . . . MSS Prediction Error (BIC parent set, k-fold cross-validation)
Node type
Pattern
Mutation
Amplification
Deletion
Parent set
Empty
Cardinality . . . . . . . . . . . . . . . . . . . . . . APC
ATM
TP53
NRAS
BRAF
KRAS
SOX9
SMAD2
TCF7L2
DKK2
DKK1
PTEN
SMAD4
IGF2
DKK4
PTEN
DKK4
BRAF
ERBB2
ERBB2
XOR_BRAF
SMAD2
SOX9
XOR_PTEN
XOR_DKK4
XOR_SMAD2
XOR_SOX9
CTNNB1
ARID1A
OR_ERBB2
NRAS
TCF7L2
XOR_NRAS
FAM123B
ATM
XOR_TCF7L2
XOR_ATM
SMAD4
FBXW7
XOR_SMAD4
PIK3CA
OR_APC_CTNNB1_DKK2
OR_APC_CTNNB1
XOR_APC
APC
OR_ATM_PIK3CA_TP53
OR_ERBB2_IGF2_PIK3CA_PTEN
OR_ATM_DKK1_DKK2_TP53
OR_ATM_TP53
OR_ARID1A_TP53
XOR_TP53
TP53
OR_BRAF_KRAS_NRAS
KRAS . . . . . . MSS Prediction Error (AIC parent set, k-fold cross-validation)
Node type
Pattern
Mutation
Amplification
Deletion
Parent set
Empty
Cardinality . . . . . . . . . . . . . . . . . . . . . . . . Figure S15:
Prediction error for each parent set of BIC and AIC models of MSS tumors. GF2
NRAS
ACVR1B
NRAS
TGFBR2
SMAD3
DKK1
XOR_NRAS
SMAD2
TP53
KRAS
ERBB2
TGFBR1
TCF7L2
DKK4
SMAD4
ACVR2A
AXIN2
OR_KRAS_NRAS_BRAF
DKK2
LRP5
PIK3CA
ACVR1B
ATM
OR_BRAF_KRAS
OR_BRAF_NRAS_TGFBR1
ERBB3
XOR_ACVR1B
FAM123B
OR_ACVR1B_ACVR2A_ERBB2_TP53
OR_ERBB2_IGF2_PIK3CA
APC
ARID1A
BRAF
FBXW7 . . . . MSI Prediction Error (BIC parent set, k-fold cross-validation)
Node type
Pattern
Mutation
Amplification
Deletion
Parent set
Empty
Cardinality . . . . . . . . . . . . . . . SMAD2
NRAS
IGF2
ACVR1B
SMAD3
DKK1
TGFBR2
DKK4
XOR_NRAS
NRAS
ERBB2
TGFBR1
KRAS
TP53
SMAD4
ACVR2A
AXIN2
OR_KRAS_NRAS_BRAF
LRP5
TCF7L2
DKK2
PIK3CA
ACVR1B
ATM
OR_BRAF_KRAS
OR_BRAF_NRAS_TGFBR1
ERBB3
XOR_ACVR1B
FAM123B
OR_ACVR1B_ACVR2A_ERBB2_TP53
APC
ARID1A
OR_ERBB2_IGF2_PIK3CA
BRAF
FBXW7 . . . . MSI Prediction Error (AIC parent set, k-fold cross-validation)
Node type
Pattern
Mutation
Amplification
Deletion
Parent set
Empty
Cardinality . . . . . . . . . . . . . . . . Figure S16:
Prediction error for each parent set of BIC and AIC models of MSI-HIGHtumors.
Models and the phenotype that they might explain. (biomarker)
Indepen-dent evolutionary trajectories depicted by a model might share common routes through a certainalteration Y; that could point to a new biomarker harbored by most of the tumors under study. (drug resistance)
When a progression model branches in many independent sub-progressions,each one identified by alterations X, Y and Z, if a certain drug is known to target only a certaintype of such clones (e.g., those where biomarker X is present), we might get insights on which arethe biomarkers which make the drug ineffective for certain patients (e.g., those were cancer evolvesthrough Y and Z). (metastatic)
When a model is extracted from data representative of varioustumor stages, we might discover which “late events” are those conferring a metastatic phenotypeto a tumor – X in the figures. (anti-hallmarks)(anti-hallmarks)