[PDF] Algorithmic Methods to Infer the Evolutionary Trajectories in Cancer Progression

Abstract

The genomic evolution inherent to cancer relates directly to a renewed focus on the voluminous next generation sequencing (NGS) data, and machine learning for the inference of explanatory models of how the (epi)genomic events are choreographed in cancer initiation and development. However, despite the increasing availability of multiple additional -omics data, this quest has been frustrated by various theoretical and technical hurdles, mostly stemming from the dramatic heterogeneity of the disease. In this paper, we build on our recent works on "selective advantage" relation among driver mutations in cancer progression and investigate its applicability to the modeling problem at the population level. Here, we introduce PiCnIc (Pipeline for Cancer Inference), a versatile, modular and customizable pipeline to extract ensemble-level progression models from cross-sectional sequenced cancer genomes. The pipeline has many translational implications as it combines state-of-the-art techniques for sample stratification, driver selection, identification of fitness-equivalent exclusive alterations and progression model inference. We demonstrate PiCnIc's ability to reproduce much of the current knowledge on colorectal cancer progression, as well as to suggest novel experimentally verifiable hypotheses.

Full PDF

AAlgorithmic methods to infer the evolutionary trajectoriesin cancer progression

Giulio Caravagna , ,(cid:63) Alex Graudenzi , Daniele Ramazzotti Rebeca Sanz-Pamplona Luca De Sano Giancarlo Mauri , Victor Moreno , Marco Antoniotti , Bud Mishra November 17, 2018 Department of Informatics, Systems and Communication, University of Milan-Bicocca, Milan, Italy. School of Informatics, University of Edinburgh, Edinburgh, UK. Institute of Molecular Bioimaging and Physiology of the Italian National Research Council (IBFM-CNR), Milan,Italy. Unit of Biomarkers and Susceptibility, Cancer Prevention and Control Program, Catalan Institute of Oncology(ICO), IDIBELL & CIBERESP. Hospitalet de Llobregat, Barcelona, Spain. SYSBIO Centre of Systems Biology, Milano, Italy. Department of Clinical Sciences, Faculty of Medicine, University of Barcelona, Barcelona, Spain. Milan Center for Neuroscience, University of Milan-Bicocca, Milan, Italy. Courant Institute of Mathematical Sciences, New York University, New York, USA. (cid:63)

Corresponding author, [email protected] . Contents

PicNic pipeline 6

PicNic ’s implementation for COADREAD samples 31

C.1 TCGA COADREAD project data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31C.2 Driver events selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32C.3 Mutual exclusivity groups of alterations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 a r X i v : . [ q - b i o . GN ] M a r .4 CAPRI’s execution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 D Statistical validation of the models 34

D.1 Edge p-values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34D.2 Bootstrap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35D.3 Cross-validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

E Supplementary Tables and Figures 37

List of Figures

PicNic pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . 222 The

PicNic pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233 Data processed for MSI-HIGH tumors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244 Progression of MSS tumors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255 Progression of MSI-HIGH tumors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26S6

PicNic pipeline processing MSI-HIGH/MSS tumors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40S7 Mutations annotated by TCGA for the driver genes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41S8 Lolliplot diagrams of TCGA mutations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42S9 Groups of exclusive alterations for MSS tumors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43S10 Selected data for MSS tumors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44S11 Non-parametric bootstrap scores for MSS progression . . . . . . . . . . . . . . . . . . . . . . . . . . . 45S12 Non-parametric bootstrap scores for MSI-HIGH progression . . . . . . . . . . . . . . . . . . . . . . . . 46S13 COADREAD statistics for models conﬁdence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47S14 Entropy loss for MSI-HIGH/MSS models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48S15 Prediction error for each parent set of BIC and AIC models of MSS tumors . . . . . . . . . . . . . . . 49S16 Prediction error for each parent set of BIC and AIC models of MSI tumors . . . . . . . . . . . . . . . 50S17 Models and the phenotype that they might explain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 bstract The genomic evolution inherent to cancer relates directly to a renewed focus on the vo-luminous next generation sequencing (NGS) data, and machine learning for the inference ofexplanatory models of how the (epi)genomic events are choreographed in cancer initiation anddevelopment. However, despite the increasing availability of multiple additional -omics data,this quest has been frustrated by various theoretical and technical hurdles, mostly stemmingfrom the dramatic heterogeneity of the disease. In this paper, we build on our recent workson “selective advantage” relation among driver mutations in cancer progression and investi-gate its applicability to the modeling problem at the population level. Here, we introducePiCnIc (Pipeline for Cancer Inference), a versatile, modular and customizable pipeline to ex-tract ensemble-level progression models from cross-sectional sequenced cancer genomes. Thepipeline has many translational implications as it combines state-of-the-art techniques for sam-ple stratiﬁcation, driver selection, identiﬁcation of ﬁtness-equivalent exclusive alterations andprogression model inference. We demonstrate PiCnIc’s ability to reproduce much of the currentknowledge on colorectal cancer progression, as well as to suggest novel experimentally veriﬁablehypotheses.

Keywords:

Cancer evolution; Selective advantage; Bayesian Structural Inference

Statement of significance:

A causality based new machine learning Pipeline for Cancer Infer-ence (

PicNic ) is introduced to infer the underlying somatic evolution of ensembles of tumors fromnext generation sequencing data.

PicNic combines techniques for sample stratiﬁcation, driver selec-tion and identiﬁcation of ﬁtness-equivalent exclusive alterations to exploit a novel algorithm basedon Suppes’ probabilistic causation. The accuracy and translational signiﬁcance of the results arestudied in details, with an application to colorectal cancer.

PicNic pipeline has been made publiclyaccessible for reproducibility, interoperability and for future enhancements.

Since the late seventies evolutionary dynamics, with its interplay between variation and selection,has progressively provided the widely-accepted paradigm for the interpretation of cancer emergenceand development [1–3]. Random alterations of an organism’s (epi)genome can sometimes confera functional selective advantage to certain cells, in terms of adaptability and ability to surviveand proliferate. Since the consequent clonal expansions are naturally constrained by the avail-ability of resources (metabolites, oxygen, etc.), further mutations in the emerging heterogeneoustumor populations are necessary to provide additional ﬁtness of diﬀerent kinds that allow survivaland proliferation in the unstable micro environment. Such further advantageous mutations willeventually allow some of their sub-clones to outgrow the competing cells, thus enhancing tumor’sheterogeneity as well as its ability to overcome future limitations imposed by the rapidly exhaust-ing resources. Competition, predation, parasitism and cooperation have been in fact theorized asco-present among cancer clones [4].In the well-known vision of Hanahan and Weinberg [5, 6], the phenotypic stages that charac-terize this multistep evolutionary process are called hallmarks . These can be acquired by cancercells in many possible alternative ways, as a result of a complex biological interplay at severalspatio-temporal scales that is still only partially deciphered [7]. In this framework, we distinguish For this and other technical terms commonly used in the statistics and cancer biology communities we providea Glossary in the Supplementary Material. drivers ) by activating oncogenes or in-activating tumor suppressor genes , from those that are transferred to sub-clones without increasingtheir ﬁtness (i.e., passengers ) [8]. Driver identiﬁcation is a modern challenge of cancer biology, asdistinct cancer types exhibit very diﬀerent combinations of drivers, some cancers display mutationsin hundreds of genes [9], and the majority of drivers is mutated at low frequencies (“long tail”distribution), hindering their detection only from the statistics of the recurrence at the population-level [10].Cancer clones harbour distinct types of alterations. The somatic (or genetic ) ones involveeither few nucleotides or larger chromosomal regions. They are usually catalogued as mutations - i.e., single nucleotide or structural variants at multiple scales (insertions, deletions, inversions,translocations) – of which only some are detectable as

Copy Number Alterations (CNAs), mostprevalent in many tumor types [11]. Also epigenetic alterations, such as

DNA methylation and chromatin reorganization , play a key role in the process [12]. The overall picture is confoundedby factors such as genetic instability [13], tumor-microenvironment interplay [14, 15], and by theinﬂuence of spatial organization and tissue speciﬁcity on tumor development [16] .Signiﬁcantly, in many cases, distinct driver alterations can damage in a similar way the same functional pathway , leading to the acquisition of new hallmarks [17–21]. Such alterations individu-ally provide an equivalent ﬁtness gain to cancer cells, as any additional alteration hitting the samepathway would provide no further selective advantage. This dynamic results in groups of driveralterations that form mutually exclusive patterns across tumor samples from diﬀerent patients (i.e.,the sets of alterations that are involved in the same pathways tend not to occur mutated together).This phenomenon has signiﬁcant translational consequences.An immediate challenge posed by this state of aﬀairs is the dramatic heterogeneity of cancer, bothat the inter-tumor and at the intra-tumor levels [22]. The former manifests as diﬀerent patients withthe same cancer type can display few common alterations. This obsersvation led to the developmentof techniques to stratify tumors into subtypes with diﬀerent genomic signatures, prognoses andresponse to therapy [23]. The latter form of heterogeneity refers to the observed genotypic andphenotypic variability among the cancer cells within a single neoplastic lesion, characterized by thecoexistence of more than one cancer clones with distinct evolutionary histories [24].Cancer heterogeneity poses a serious problem from the diagnostic and therapeutic perspectiveas, for instance, it is now acknowledged that a single biopsy might not be representative of otherparts of the tumor, hindering the problem of devising eﬀective treatment strategies [4]. Therefore,presently the quest for an extensive etiology of cancer heterogeneity and for the identiﬁcation ofcancer evolutionary trajectories is central to cancer research, which attempts to exploit the massiveamount of sequencing data available through public projects such as The Cancer Genome Atlas ( TCGA ) [25].Such projects involve an increasing number of cross-sectional (epi)genomic proﬁles collected viasingle biopsies of patients with various cancer types, which might be used to extract trends of cancerevolution across a population of samples . Higher resolution data such as multiple samples collectedfrom the same tumor [24], as well as single-cell sequencing data [26], might be complementarilyused to face the same problem within a speciﬁc patient. However, the lack of public data coupled We mention that much attention has been recently casted on newly discovered cancer genes aﬀecting globalprocesses that are apparently not directly related to cancer development, such as cell signaling, chromatin andepigenomic regulation, RNA splicing, protein homeostasis, metabolism and lineage maturation [10]. At the time of this writing, in TCGA, sample sizes per cancer type are in the order of a few hundreds. Suchnumbers are expected to increase in the near future, with a clear beneﬁt for all the statistical approaches to analyzecancer data which currently lack a proper background of data.

4o the problems of accuracy and reliability, currently prevents a straightforward application [27].These diﬀerent perspectives lead to the diﬀerent mathematical formulations of the problem of inferring a cancer progression model from genomic data, and a need for versatile computational toolsto analyze data reproducibly – two intertwined issues examined at length in this paper [28]. Indeed,such models and tools can be focused either on characteristics of a population, i.e. ensemble-level ,or on multiple clonality in a single-patient . In general, both problems deal with understanding the temporal ordering of somatic alterations accumulating during cancer evolution, but use orthogonalperspectives and diﬀerent input data – see Figure 1 for a comparison. This paper proposes anew computational approach to eﬃciently deal with various aspects of the problem at a patientpopulation level, relegating the other aspects to future publications.

Ensemble-level cancer evolution.

It is thus desirable to extract a probabilistic graphical model explaining the statistical trend of accumulation of somatic alterations in a population of n cross-sectional samples collected from patients diagnosed with a speciﬁc cancer. To normalize againstthe experimental conditions in which tumors are sampled, we only consider the list of alterationsdetected per sample – thus, as 0/1 Bernoulli random variables.Much of the diﬃculty lies in estimating the true and unknown trends of selective advantage among genomic alterations in the data, from such observations. This hurdle is not unsurmountable,if we constrain the scope to only those alterations that are persistent across tumor evolution in allsub-clonal populations , since it yields a consistent model of a temporal ordering of mutations.Therefore, epigenetic and trascriptomic states, such as hyper and hypo-methylations or over andunder expression, could only be used, provided that they are persistent through tumor development[29].Historically, the linear model of colorectal tumor progression by Vogelstein is an instance ofan early solution to the cancer progression problem [30]. That approach was later generalized toaccommodate tree-models of branched evolution [31–34] and later, further generalized to the infer-ence of directed acyclic graph models, with several distinct strategies [35–38]. We contributed tothis research program with the Cancer Progression Extraction with Single Edges ( CAPRESE ) and the

Cancer Progression Inference ( CAPRI ) algorithms, which are currently implemented in

TRONCO , anopen source R package for Translational Oncology available in standard repositories [39–41]. Bothtechniques rely on Suppes’ theory of probabilistic causation to deﬁne estimators of selective advan-tage [42], are robust to the presence of noise in the data and perform well even with limited samplesizes. The former algorithm exploits shrinkage-like statistics to extract a tree model of progression,the latter combines bootstrap and maximum likelihood estimation with regularization to extractgeneral directed acyclic graphs that capture branched, independent and conﬂuent evolution. Bothalgorithms represent the current state-of-the-art approach to this problem, as they outperformothers in speed, scale and predictive accuracy.

Clonal architecture in individual patients.

A closely related problem addresses the detectionof clonal signatures and their prevalence in individual tumors, a problem complicated by intra-tumor heterogeneity.Even though this phylogenetic version of the progression inference problem naturally relies ondata produced from single-cell sequencing assays [43, 44], the majority of approaches still make useof bulk sequencing data, usually from multiple biopsies of the same tumors [24, 45]. Indeed, severalapproaches try to extract the clonal signature of single tumors from allelic imbalance proportions ,5 problem made diﬃcult as sequenced samples usually contain a large number of cells belonging toa collection of sub-clones resulting from the complex evolutionary history of the tumor [46–55].We keep the current work focused on the inference of progression models at the ensemble level,and plan to return to this variant to the problem in another publication.

PicNic pipeline

We report on the design, development and evaluation of the

Pipeline for Cancer Inference ( PicNic )to extract ensemble-level cancer progression models from cross-sectional data (Figure 1).

PicNic is versatile, modular and customizable; it exploits state-of-the-art data processing and machinelearning tools to:1. identify tumor subtypes and then in each subtype;2. select (epi)genomic events relevant to the progression;3. identify groups of events that are likely to be observed as mutually exclusive ;4. infer progression models from groups and related data, and annotate them with associatedstatistical conﬁdence.All these steps are necessary to minimize the confounding eﬀects of inter-tumor heterogeneity, whichare likely to lead to wrong results when data is not appropriately pre-processed .In each stage of PicNic diﬀerent techniques can be employed, alternatively or jointly, accordingto speciﬁc research goals, input data, and cancer type. Prior knowledge can be easily accommo-dated into our pipeline, as well as the computational tools discussed in the next subsections andsummarized in Figure 2. The rationale is similar in spirit to workﬂows implemented by consortiasuch as

TCGA to analyze huge populations of cancer samples [56, 57]. One of the main noveltiesof our approach, is the exploitation of groups of exclusive alterations as a proxy to detect ﬁtness-equivalent trajectories of cancer progression. This strategy is only feasible by the hypothesis-testingfeatures of the recently developed

CAPRI algorithm, an algorithm uniquely addressing this crucialaspect of the ensemble-level progression inference problem [40].In the Results section, we study in details a speciﬁc use-case for the pipeline, processing colorectalcancer data from

TCGA , where it is able to re-discover much of the existing body of knowledgeabout colorectal cancer progression. Based on the output of this pipeline, we also propose novelexperimentally-veriﬁable hypotheses.

In general, for each of n tumors (patients) we assume relevant (epi)genetic data to be available. Wedo not put constraints on data gathering and selection, leaving the user to decide the appropriate“resolution” of the input data. For instance, one might decide whether somatic mutations shouldbe classiﬁed by type or by location, or aggregated. Or, one might decide to lift focal CNAs to The genuine selectivity relationship sought to be inferred are subject to the vagaries of Simpson’s paradox; itcan change, or worst reverse, when we try to infer them from data not suitably pre-processed. This eﬀect (due tosuch paradox) manifests as data are sampled from a highly heterogenous mixture of populations of cells [40]. PiCnIcuses various mechanisms to avoid these pitfalls. In this context, it should be pointed out that input bulk sequencingdata suﬀers also from intra-tumor heterogeneity issues, which are unfortunately intrinsic to the technology. heterogeneous mixture of input samples. In some cases the classiﬁcation can beneﬁt from clinical biomarkers, such asevidences of certain cell types [59], but in most cases we will have to rely on multiple clustering techniques at once, see, e.g., [56,57]. Many common approaches cluster expression proﬁles [60], oftenrelying on non-negative matrix factorization techniques [61] or earlier approaches such as k -means,Gaussians mixtures or hierarchical/spectral clustering - see the review in [62]. For glioblastomaand breast cancer, for instance, mRNA expression subtypes provides good correlation with clinicalphenotypes [63–65]. However, this is not always the case as, e.g., in colorectal cancer such clustersmismatch with survival and chemotherapy response [63]. Clustering of full exome mutation proﬁlesor smaller panels of genes might be an alternative as it was shown for ovarian, uterine and lungcancers [66, 67].Using pipelines such as PicNic , we expect that the resulting subtypes will be routinely in-vestigated, eventually leading to distinct progression models, which shall be characteristic of thepopulation-level trends of cancer initiation and progression.

In subtypes detection, it becomes easier to ﬁnd similarities across input samples when more al-terations are available, as features selection gains precision. In progression inference, instead, onewishes to focus on m (cid:28) n driver alterations, which ensure also an appropriate statistical ratiobetween sample size ( n , here the subtype size) and problem dimension ( m ).Multiple tools ﬁlter out driver from passenger mutations. MutSigCV identiﬁes drivers mutatedmore frequently than background mutation rate [68].

OncodriveFM avoids such estimation butlooks for functional mutations [69].

OncodriveCLUST scans mutations clustering in small regionsof the protein sequence [70].

MuSiC uses multiple types of clinical data to establish correlationsamong mutation sites, genes and pathways [71]. Some other tools search for driver CNAs that aﬀectprotein expression [72]. All these approaches use diﬀerent statistical measures to estimate signs ofpositive selection, and we suggest using them in an orchestrated way, as done by platforms such as

Intogen [73].We anticipate that such tools will run independently on each subtype, as driver genes will likelydiﬀer across them, mimicking the diﬀerent molecular properties of each group of samples; also, listsof genes produced by these tools might be augmented with prior knowledge about tumor suppressorsor oncogenes.

When working at the ensemble-level, identiﬁcation of “groups of mutually exclusive” alterations iscrucial to derive a correct inference. This step of

PicNic is another attempt to resolve part of theinter-tumor heterogeneity, as such alterations could lead to the same phenotype (i.e., hence resulting“equivalent” in terms of progression), despite being genotypically “alternative”, i.e., exclusive, acrossthe input cohort. This information shall be used to detect alternative routes to cancer progressionwhich capture the speciﬁcities of individual patients.7 plethora of recent tools can be used to detect groups of ﬁtness equivalent alterations, accordingto the data available for each subtype; greedy approaches [74, 75] or their optimizations, such as

MEMO , which constrain search-space with network priors [76]. This strategy is further improvedin

MUTEX , which scans mutations and focal CNAs for genes with a common downstream eﬀectin a curated signalling network, and selects only those genes that signiﬁcantly contributes to theexclusivity pattern [77]. Other tools such as

Dendrix , MDPFinder , Multi-Dendrix , CoMEt , MEGSA or ME , employ advanced statistics or generative approaches without priors [78–83].In such groups, we distinguish between hard and soft forms of exclusivity, the former assumingstrict exclusivity among alterations, with random errors accounting for possible overlaps (i.e., themajority of samples do not share alterations from such groups), the latter admitting co-occurrences(i.e., some samples might have common alterations, within a group) [77]. CAPRI is currently the only algorithm which incorporates this type of information, in inferringa model. Each of these groups are in fact associated with a “testable hypothesis” written in the well-known language of propositional Boolean formulas . Consider the following example: we might beinformed that apc and ctnnb1 mutations show a trend of soft-exclusivity in our cohort – i.e., somesamples harbor both mutations, but the majority just one of the two mutated genes. Since suchmutations lead to β -catenin deregulation (the phenotype), we might wonder whether such state ofaﬀairs could be responsible for progression initiation in the tumors under study. An aﬀermativeresponse would equate, in terms of progression, the two mutations. To test this hypothesis , one mayspell out formula apc ∨ ctnnb1 to CAPRI , which means that we are suggesting to the inferenceengine that, besides the possible evolutionary trajectories that might be inferred by looking at thetwo mutations as independent , trajectories involving such a “composite” event, shall be consideredas well. It is then up to

CAPRI to decide which, of all such trajectories, is signiﬁcant, in a statisticalsense .In general, formulas allow users to test general hypotheses about complex model structuresinvolving multiple genes and alterations. These are useful in many cases: for instance, wherewe are processing samples which harbour homozygous losses or inactivating mutations in certaingenes (i.e., equally disruptive genomic events), or when we know in advance that certain genes arecontrolling the same pathway, and we might speculate that a single hit in one of those decreases theselection pressure on the others. We note that, with no hypothesis, a model with such alternativetrajectories cannot be analyzed, due to various computational limitations inherent to the inferentialalgorithms (see [40]).From a practical point of view,

CAPRI ’s formulas/hypotheses-testing features “help” the inferenceprocess, but do not “force” it to select a speciﬁc model, i.e., the inference is not biased . In thissense, the trajectories inferred by examining these composite model structures (i.e., the formulas) are not given any statistical advantage for inclusion in the ﬁnal model. However, in spite of a naturaltemptation to generate as many hypotheses as possible, it is prudent to always limit the numberof hypotheses according to the number of samples and alterations. Note that this approach canalso be extended to accommodate, for instance, co-occurrent alterations in signiﬁcantly mutatedsubnetworks [84, 85]. There, logical connectives such as ⊕ (the logical “xor”) act as a proxy for hard-exclusivity, and ∨ (the logical“disjunction”) for soft one. Besides from exclusivity groups, other connectives such as logical conjunction can beused. .4 Progression inference and conﬁdence estimation We use

CAPRI to reconstruct cancer progression models of each identiﬁed molecular subtype, pro-vided that there exist a reasonable list of driver events and the groups of ﬁtness-equivalent exclusivealterations. Since currently

CAPRI represents the state of the art, and supports complex formulasfor groups of alterations detected in the earlier

PicNic step, it was well-suited for the task.

CAPRI ’s input is a binary n × ( m + k ) matrix M with n samples (a subtype size), m driveralteration events (0/1 Bernoulli random variables) and k testable formulas. Each sample in M isdescribed by a binary sequence: the 1’s denote the presence of alterations. CAPRI ﬁrst performs acomputationally fast scan of M to identify a set S of plausible selective advantage relations amongthe driver alterations and the formulas; then, it reduces S to the most relevant ones, ˆ S ⊂ S

Eachrelation is represented as an edge connecting drivers/formulas in a Graphical Model – which shall betermed Suppes-Bayes Causal Network. This network represents the joint probability distribution ofobserving a set of driver alterations in a cancer genome, subject to constraints imposed by Suppes’ probabilistic causation formalism [42].Set S is built by a statistical procedure. Among any pair of input drivers/formulas x and y , CAPRI postulates that x → y ∈ S could be a selective advantage relation with “ x selecting for y ” ifit estimates that two conditions hold1. “ x is earlier than y ”;2. “ x ’s presence increases the probability of observing y ”.Such claims, grounded in Suppes’ theory of probabilistic causation, are expressed as inequalitiesover marginal and conditional distributions of x and y . These are assessed via a standard Mann-Withney U test after the distributions are estimated from a reasonable number (e.g., 100) of non-parametric bootstrap resamples of M (see Supplementary Material). CAPRI ’s increased performanceover existing methods can be motivated by the reduction of the state space within which modelsare searched, via S .Optimization of S is central to our tolerance to false positives and negatives in ˆ S . We would liketo select only the minimum number of relations which are true and statistically supported, and buildour model from those. CAPRI ’s implementation in

TRONCO [41] selects a subset by optimizing a score function which assigns to a model a real number equal to its log-likelihood (probability ofgenerating data for the model) minus a penalty term for model complexity – a regularization termincreasing with ˆ S ’s size, and hence penalizing overly complex models. It is a standard approach toavoid overﬁtting, and usually relies on the Akaike or the

Bayesian Information Criterion ( AIC or BIC )as regularizers. Both scores are approximately correct;

AIC is more prone to overﬁtting but likelyto provide also good predictions from data and is better when false negatives are more misleadingthan positive ones.

BIC is more prone to underﬁtting errors, thus more parsimonious and better Technically, for a set of m alterations modeled by variables x , . . . , x m , such a network is a Graphical Modelrepresenting the factorization of the joint distribution – P ( x , . . . , x m ) – of observing any of the alterations in agenome (i.e., x i = 1 ). This factorization is made compact as the model encodes the statistical dependencies in itsstructure via P ( x , . . . , x m ) = m (cid:89) i =1 P ( x i | π i ) where π i = { x j | x j → x i ∈ ˆ S} are the “parents” of the i -th node. These are those from which the presence of the i -th alteration is predicted. In our approach these edges are the pictorial representation of the selective advantagerelations where the alterations in π i select for x i .

9n opposite direction. As often done, we suggest approaches that to combine but distinguish whichrelations are selected by

BIC versus

AIC . Details on the algorithm are provided as see SupplementaryMaterial.

Statistical conﬁdence of a model.

In-vitro and in-vivo experiments provide the most convinc-ing validation for the newly suggested selective advantage relations and hypotheses, yet this is outof reach in some cases.Nonetheless, statistical validation approaches can be used almost universally to assess the con-ﬁdence of edges, parent sets and whole models, either via hypothesis-testing or bootstrap and cross-validation scores for Graphical Models. We brieﬂy discuss approaches that are implemented in TRONCO , and refer to the Supplementary Materials for additional details.First,

CAPRI builds S by computing two p-values per edge, for the conﬁdence in condition (1) and (2) . In addition, for each edge x → y , it computes a third p-value via hypergeometric testingagainst the hypothesis that the co-occurrence of x and y is due to chance. These p-values measureconﬁdence in the direction of each edge and the amount of statistical dependence among x and y .Second, for each model inferred with CAPRI we can estimate ( a posteriori ) how frequently ouredges would be retrieved if we resample from our data ( non-parametric bootstrap), or from themodel itself, assuming its correctness ( parametric bootstrap) [86]. Also, we can measure the biasin

CAPRI ’s construction of S due to the random procedure which estimates the distributions incondition (1) and (2) ( statistical bootstrap).Third, scores can be computed to quantify the consistency for the model against bias in thedata and models. For instance, non-exhaustive k -fold cross-validation can be used to compute the entropy loss for the whole model, and the prediction and posterior classiﬁcation errors for each edgeor parent set [87]. It is common knowledge that colorectal cancer (CRC) is a heterogeneous disease comprising diﬀerentmolecular entities. Indeed, it is currently accepted that colon tumors can be classiﬁed accordingto their global genomic status into two main types: microsatellite unstable tumors (MSI), fur-ther classiﬁed as high or low, and microsatellite stable (MSS) tumors (also known as tumors with chromosomal instability ). This taxonomy plays a signiﬁcant role in determining pathologic, clinicaland biological characteristics of CRC tumors [88]. Regarding molecular progression, it is also wellestablished that each subtype arises from a distinctive molecular mechanism. While MSS tumorsgenerally follow the classical adenoma-to-carcinoma progression described in the seminal work byVogelstein and Fearon [89], MSI tumors result from the inactivation of DNA mismatch repair geneslike mlh -1 [90].With the aid of the

TRONCO package, we instantiated

PicNic to process colorectal tumorsfreely available through

TCGA project

COADREAD [56] (see Supplementary Figure S1), and in-ferred models for the MSS and MSI-HIGH tumor subtypes (shortly denoted MSI) annotated by theconsortium. In doing so, we used a combination of background knowledge produced by

TCGA andnew computational predictions; to a diﬀerent degree, some knowledge comes from manual curationof data and other from tools mentioned in

PicNic ’s description (see Figure 2). Data and exclusiv-10ty groups for MSI tumors are shown in Figure 3, the analogous for MSS tumors is provided asSupplementary Material.For the models inferred, which are shown in Figures 4 and 5, we evaluated various forms ofstatistical conﬁdence measured as p-values, bootstrap scores (in what follows, npb denotes non-parametric bootstrap and the closer to 100 the better), and cross-validation statistics reportedin the Supplementary Material. Many of the postulated selective advantage relations (i.e., modeledges) have very strong statistical support for

COADREAD samples, although events with similarmarginal frequency may lead to ambiguous imputed temporal ordering (i.e., the edge direction). Ingeneral, we observed that overall the estimates are slightly better in the MSS cohort (entropy loss < versus . ), which is expected given the diﬀerence in sample size of the two datasets (152versus 27 samples), see Material and Methods for details. Interpretation of the models.

Our models capture the well-known features distinguishing MSSand MSI tumors: for the former apc , kras and tp53 mutations as primary events together withchromosomal aberrations, for the latter braf mutations and lack of chromosomal alterations. Of all33 driver genes, 15 are common to both models - e.g., apc , braf , kras , nras , tp53 and fam123b among others (mapped to pathways like wnt , mapk , apoptosis or activation of T-cell lymphocites),although in diﬀerent relationships (position in the model), whereas new (previously un-implicated)genes stood out from our analysis and deserve further research. MSS (Microsatellite Stable).

In agreement with the known literature, in addition to kras , tp53 and apc as primary events, we identify pten as a late event in the carcinogenesis, as well as nras and kras converging in igf2 ampliﬁcation, the former being “selected by” tp53 muta-tions ( npb pik3ca mutations ( npb wnt genes, in agreement with the observation that multiple con-current lesions aﬀecting such pathway confer selective advantage. In this respect, our modelpredicts multiple routes for the selection of alterations in sox9 gene, a transcription factorknown to be active in colon mucosa [91]. Its mutations are directly selected by apc / ctnnb1 alterations (though with low npb score), by arid1a ( npb fbxw7 mutations ( npb ctnnb1 , relatesto sox9 . The sox family of transcription factors have emerged as modulators of canoni-cal wnt / β -catenin signaling in many disease contexts [92]. Also interestingly, fbxw7 hasbeen previously reported to be involved in the malignant transformation from adenoma tocarcinoma [93]. The rightmost part of the model involves genes from various pathways, andoutlines the relation between kras and the pi3k pathway. We indeed ﬁnd selection of pik3ca mutations by kras ones, as well as selection of the whole MEMO module ( npb pi3k pathway [56]. smad4 proteins relate either to kras ( npb fam123b (through atm ) and tcf7l2 converge in dkk2 or dkk4 ( npb

81, 17and 34%).

MSI-HIGH (Microsatellite Unstable).

In agreement with the current literature, braf is the mostcommonly mutated gene in MSI tumors [94].

CAPRI predicted convergent evolution of tu-mors harbouring fbxw7 or apc mutations towards deletions/mutations of nras gene ( npb

21, 28 and 54%), as well as selection of smad2 or smad4 mutations by fam123b mutations( npb

23 and 46%), for these tumors. Relevant to all MSI tumors seems again the role ofthe pi3k pathway. Indeed, a relation among apc and pik3ca mutations was inferred ( npb apc and the whole

MEMO module ( npb apc and erbb3 select for kras mutations ( npb

51 and 27%),which might point to interesting therapeutic implications. In contrast, mutations in braf mostly select for mutations in acvr1b ( npb smad proteins. It forms receptor complex with acvr2a , a gene mutated in thesetumors that selects for tcf7l2 mutations ( npb tp53 mutationsare those selected by mutations in axin2 ( npb wnt signallingpathway, and related to unstable gastric cancer development [96]. Inactivating mutations inthis gene are important, as it provides serrated adenomas with a mutator phenotype in theMSI tumorigenic pathway [97]. Thus, our results reinforce its putative role as driver gene inthese tumors.By comparing these models we can ﬁnd similarity in the prediction of a potential new early eventfor CRC formation, fbxw7 , as other authors have recently described [93]. This tumor suppressoris frequently inactivated in human cancers, yet the molecular mechanism by which it exerts itsanti-tumor activity remains unexplained [98], and our models provide a new hypothesis in thisrespect. This paper represents our continued exploration of the nature of somatic evolution in cancer, andits translational exploitation through models of cancer progression, models of drug resistance (andeﬃcacy), left- and right-censoring, sample stratiﬁcation, and therapy design. Thus this paper em-phasizes the engineering and dissemination of production-quality computational tools as well asvalidation of its applicability via use-cases carried out in collaboration with translational collabo-rators: e.g., colorectal cancer, analyzed jointly with epidemiologists currently studying the diseaseactively. As anticipated, we reasserted that the proposed model of somatic evolution in cancer notonly supports the heterogeneity seen in tumor population, but also suggests a selectivity/causalityrelation that can be used in analyzing (epi)genomic data and exploited in therapy design – which weintroduced in our earlier works [39, 40]. In this paper, we have introduced an open-source pipeline,

PicNic , which minimizes the confounding eﬀects arising from inter-tumor heterogeneity, and we haveshown that

PicNic can be eﬀective in extracting ensemble-level evolutionary trajectories of cancerprogression.When applied to a highly-heterogeneous cancer such as colorectal,

PicNic was able to inferthe role of many known events in colorectal cancer progression (e.g., apc , kras or tp53 in MSStumors, and braf in MSI ones), conﬁrming the validity of our approach . Interestingly, new playersin CRC progression stand out from this analysis such as fbxw7 or axin2 , which deserve furtherinvestigation. In colon carcinogenesis, although each model identiﬁes characteristic early mutationssuggesting diﬀerent initiation events, both models appear to converge in common pathways andfunctions such as wnt or mapk . As a further investigation for CRC, we leave as future work to check whether the inferred progression are alsorepresentative of other subtyping strategies for colorectal cancer, with particular reference to recent works whichshow marked interconnectivity between diﬀerent independent classiﬁcation systems coalescing into four consensusmolecular subtypes [99]. ctnnb1 or in pten , a well-known tumor suppressor gene. Onthe contrary, speciﬁc mutations in MSI tumors appear in membrane receptors such as acvr1b , acvr2a , erbb3 , lrp5 , tgfbr1 and tgfbr2 , as well as in secreted proteins like igf2 , possiblysuggesting that such tumors need to disturb cell-cell and/or cell-microenvironment communicationto grow. At the pathway level, genes exclusively appearing in the MSI progression model accumulatein speciﬁc pathways such as cytokine-cytokine receptor, endocytosis and tgf - β signaling pathway.On the other hand, genes in MSS progression model are implicated in p53 , mTOR , sodium transportor inositol phosphate metabolism.Our study also highlighted the translational relevance of the models that we can produce with PicNic (see Supplementary Figure S12). The evolutionary trajectories depicted by our models can,for instance, suggest previously-uncharacterized phenotypes, help in ﬁnding biomarker moleculespredicting cancer progression and therapy response, explain drug resistant phenotypes and predictmetastatic outcomes. The logical structure of the formulas describing alterations with equivalentﬁtness (i.e., the exclusivity group) can also point to novel targets of therapeutic interventions.In fact, exclusivity groups that are found to have a role in the progression can be screened for synthetic lethality among such genes – thus explaining why we do not observe phenotypes wheresuch alterations co-occur. In this sense, our models describe also such clonal signatures which,though theoretically possible, are not selected. We call such conspicuously absent phenotypes anti-hallmarks [100].Our models have other applications to both computational and cancer research. Our models,as encoded by Suppes-Bayes Causal Networks could be used as informative generative models forthe genomic proﬁles for the cancer patients. In fact, as known in machine learning, such generativemodels are extremely useful in creating better representation of data in terms of, e.g., discriminativekernels, such as Fisher [101]. In practice, this change of representations would allow framing commonclassiﬁcation problems in the domain of our generative structures, i.e., the models, rather than thedata. As a consequence, it is possible to create a new class of more robust classiﬁcation andprediction systems.One may think of these representations as those bringing us closer to phenotypic (and causal)representation of the patient’s tumor, replacing its genotypic (and mutational) representation. Wesuspect that such representations will improve the accuracy of measurement of the biological clocks,dysregulated in cancer and critically needed to be measured in order to predict survival time, timeto metastasis, time to evolution of drug resistance, etc. We believe that these “phenotypic clocks”can be used immediately to direct the therapeutic intervention.Clearly, applicability and reliability of techniques such as

PicNic is very much dependent onthe background of data available. At the time of this writing, the quality, quantity and reliabilityof (epi)genomic data available, e.g., in public databases, is related to the ever increasing com-putational and technological improvements characterizing the wide area of cancer genomics. Ofsimilar importance is the availability of wet-lab technologies for models validation. Our recent workon SubOptical Mapping technology, for instance, points to the ability to cheaply and accuratelycharacterize translocation, indels and epigenomic modiﬁcations at the single molecule and singlecell level [102, 103]. This technology also provides the ability to directly validate (or refute) thehypotheses generated by

PicNic via gene-correction and single cell perturbation approaches.To conclude, the precision of any statistical inference technique, including

PicNic , is inﬂuencedby the quality, availability and idiosyncrasies of the input data – the goodness of the outcomesimproving along with the expected advancement in the ﬁeld. Nevertheless, the strength of the13roposed approach lies in the eﬃcacy in managing possibly noisy/ biased or insuﬃcient data, andin proposing refutable hypotheses for experimental validation.

Processing COADREAD samples with PiCnIc.

We instantiated

PicNic to process clinicallyannotated high MSI-HIGH and MSS colorectal tumors collected from The Cancer Genome Atlasproject “Human Colon and Rectal Cancer” (COADREAD) [56] – see Supplementary Figure S1.Details on the implementation and the source code to replicate this study are available as Supple-mentary Material. COADREAD has enough samples, especially for MSS tumors, to implement aconsistent and signiﬁcant statistical validation of our ﬁndings – see Supplementary Table S1.In brief, we split subtypes by the microsatellite status of each tumor as annotated by the con-sortium (so, step I of

PicNic is done by exploiting background knowledge rather than computationalpredictors). It should be expected that if this step is skipped or this classiﬁcation is incorrect, theresulting models would noticeably diﬀer. Once split into groups, the input COADREAD data isprocessed to maintain only samples for which both high-quality curated mutation and CNA dataare available; for CNAs we use focal high-level ampliﬁcations and homozygous deletions.Then, for each sample we select only alterations (mutations/CNAs) from a list of 33 driver genesmanually annotated to 5 pathways in [56] - wnt , raf , tgf - β , pi3k and p53 (Supplementary FiguresS2 and S3). This list of drivers, step II of PicNic , is produced by TCGA, as a result of manualcuration and running MutSigCV.In the next module of the pipeline, we fetch groups of exclusive alterations. We scanned thesegroups by using the MUTEX tool (Supplementary Table S2), and merged its results with thegroup that TCGA detected by using the MEMO tool, which involves mainly genes from the pi3k pathway. Knowledge on the potential exclusivity among genes in the wnt ( apc , ctnnb1 ) and raf ( kras , nras , braf ) pathways was exploited as well. Groups were then used to create CAPRI’sformulas; we also included hypotheses for genes which harbour mutations and homozygous deletionsacross diﬀerent samples, see Supplementary Table S3. Data and exclusivity groups for MSS tumorsare shown in Supplementary Figure S4 and S5.CAPRI was run, as the last step of PicNic , on each subtype, by selecting recurrent alterationsfrom the pool of 33 pathway genes and using both AIC/BIC regularizer. Timings to run the relevantsteps of the pipeline are reported in the Supplementary Material. In the models of Figures 4 andFigure 5 each edge mirrors selective advantage among the upstream and downstream nodes, asestimated by CAPRI; Mann-Withney U test is carried out with statistical signiﬁcance 0.05, after100 non-parametric bootstrap iterations.The signiﬁcance of the reconstructed models and the input data is assessed by computing all thestatistics/tests discussed in the Main text (temporal priority, probability raising and hypergeometrictesting p-values, bootstrap and cross-validation scores). Motivation and background on each ofthese measures is available in the Supplementary Materials. A table with their values for edgeswith highest non-parametric bootstrap scores is in Supplementary Figure S8.For the MSS cohort all the p-values are strongly signiﬁcant ( p (cid:28) . ) except for the temporalpriority of the edges connecting mutations in fam123b and atm , and erbb2 alterations (mutationsand ampliﬁcations), which leads us to conclude that, even if these pairs of genes seem to undergoselective advantage, the temporal ordering of their occurrence is ambiguous and failed to be imputedcorrectly from the datasets, analyzed here. The same situation occurs in MSI-HIGH tumors, for therelation between kras and erbb3 . Non-parametric and statistical bootstrap estimations are used14o assess the strength of all the ﬁndings (Supplementary Figures S6 and S7). Moreover, any biasin the data is ﬁnally evaluated by cross-validation (Supplementary Figures S8-S11) and commonstatistics such as entropy loss, posterior classiﬁcation and prediction errors. In general, most of theselective advantage relations depicted by the inferred models present a strong statistical support,with the MSS cohort presenting the most reliable results.Summary implementation for COADREAD ( PicNic steps, Figure 2): (1) TCGA clinical classi-ﬁcation, (2) MutSigCV and TCGA manual curation, (3) MEMO, MUTEX and knowledge of wnt and raf pathways and (4) CAPRI.

Implement your own case study with PiCnIc/TRONCO.

TRONCO started as a projectbefore

PicNic , and is our eﬀort at collecting, in a free R package, algorithms to infer progressionmodels from genomic data. In its current version it oﬀers the implementation of the CAPRIand CAPRESE algorithms, as well as a set of routines to pre-process genomic data. With theinvention of

PicNic , it started accommodating software routines to easily interface CAPRI andCAPRESE to some of the tools that we mention in Figure 2. In particular, in its current 2.0version it supports input/output for the Matlab Network Based Stratiﬁcation tool (NBS) andthe Java MUTEX tool, as well as the possibility to fetch data available from the cBioPortal forCancer Genomics ( http://cbioportal.org http://cbioportal.org), which provides a Web resourcefor exploring, visualizing, and analyzing multidimensional cancer genomics data.We plan to extend TRONCO in the future to support other similar tools and become an integralpart of daily laboratory routines, thus facilitating application of PiCnIc to additional use cases.

Authors contributions

This work follows up on our earlier project initiated by BM and carriedout by Milan-Bicocca and the Catalan Institute of Oncology, based on a framework discussed atthe 2014 School on Cancer, Systems and Complexity (CSAC).

PicNic was designed and constructedby MA’s Bioinformatics lab at University of Milan-Bicocca, within a project led and supervised byGC. GC, AG and DR designed the pipeline, and GC, DR and LDS coded and executed it. Datagathering and model interpretation was done by GC, LDS, DR, AG together with BM, VM andRSP. GM, MA, VM and BM provided overall organizational guidance and discussion. GC, AG,RSP and BM wrote the original draft of the paper, which all authors reviewed and revised in theﬁnal form. BM and MA are co-senior authors.

Acknowledgments

MA, GM, GC, AG, DR acknowledge the SysBioNet project, a MIUR initia-tive for the Italian Roadmap of European Strategy Forum on Research Infrastructures (ESFRI) andRegione Lombardia (Italy) for the research projects RetroNet through the ASTIL Program [12-4-5148000-40]; U.A 053 and Network Enabled Drug Design project [ID14546A Rif SAL-7], Fondo Ac-cordi Istituzionali 2009. BM acknowledges founding by the NSF grants CCF-0836649, CCF-0926166and a NCI-PSOC grant. VM and RSP acknowledge the Instituto de Salud Carlos III supported byThe European Regional Development Fund (ERDF) grants PI11-01439, PIE13/00022, the SpanishAssociation Against Cancer (AECC) Scientiﬁc Foundation, and the Catalan Government DURSI,grant 2014SGR647. 15e wish to thank the anonymous reviewers for their help in improving the quality and rigor ofthe presentation.

References [1] Nowell PC (1976) The clonal evolution of tumor cell populations.

Science

Cancer Research

Nature Reviews Cancer

Cell

Cancer attractors: a systems view of tumors from agene network dynamics and developmental perspective (Elsevier), No. 7, pp 869–876.[8] Futreal PA, et al. (2004) A census of human cancer genes.

Nature Reviews Cancer

Science

Cell

Nature Genetics

Nature Reviews Cancer

The Biology of Cancer (Garland Science).[14] Albini A, Sporn MB (2007) The tumour microenvironment as a target for chemoprevention.

Nature Reviews Cancer

Nature

Proceedingsof the National Academy of Sciences

NatureMedicine

Evolutionary Dynamics (Harvard University Press).[19] Wood LD, et al. (2007) The genomic landscapes of human breast and colorectal cancers.

Science

British Journal of Cancer

Nature

The New England Journal of Medicine https://tcga-data.nci.nih.gov https://tcga-data.nci.nih.gov.[26] Navin N, et al. (2011) Tumour evolution inferred by single-cell sequencing.

Nature

NatureMethods

Systematic biology

Proceedings of the National Academy of Sciences

TheNew England Journal of Medicine

Journal of Computational Biology

Journalof Computational Biology

Mathematical Biosciences

Journal of Computational Biology

Bernoulli pp 893–909.[36] Gerstung M, Baudis M, Moch H, Beerenwinkel N (2009) Quantifying cancer progression withconjunctive Bayesian networks.

Bioinformatics

Proceedings of the National Academy of Sciences

Bioinformatics

PLOS ONE

Bioinformatics

Bioinformatics, .[42] Suppes P (1970)

A Probabilistic Theory of Causality (North-Holland Publishing CompanyAmsterdam).[43] Navin NE (2014) Cancer genomics: one cell at a time.

Genome Biology

Nature

Nature Genetics

Genome Biol

Bioinformatics

PLOS Computational Biology

Nature Methods

BMC Bioinformatics

Cell Reports

PLOS Computatioanl Biology

Nature methods

Bioinformatics

Nature

The New England Journal of Medicine

Nature

British Journal of Haematology

Nature

Bioinformatics

BMC Bioinformatics

Nature

Journal of Clinical Oncology

The Lancet

Nature Methods

BMC Genomics

Nature

Nu-cleic Acids Research

Bioinformatics

GenomeResearch

PLOSONE

Nature Methods

The FASEB Journal

BMCMedical Genomics

Genome Research

Genome Biology

Genome Research

Bioinformatics

PLOS Computational Biology

Genome Biology bioRxiv http://dx.doi.org/10.1101/027474 http://dx.doi.org/10.1101/027474 .[83] Szczurek E, Beerenwinkel N (2014) Modeling mutual exclusivity of cancer mutations.

PLoSComputational Biology

Nature Genetics

Journal of Computational Biology

An Introduction to the Bootstrap (CRC press).[87] Koller D, Friedman N (2009)

Probabilistic Graphical Models: Principles and Techniques -Adaptive Computation and Machine Learning (The MIT Press).[88] Ogino S, Goel A (2008) Molecular classiﬁcation and correlates in colorectal cancer.

TheJournal of Molecular Diagnostics

Cell

Nature reviews Clinical oncology

Oncogene β -cateninsignaling in development and disease. Developmental Dynamics

The Journal of Pathology

World Journal of Gastroenterology

Oncogene

HumanPathology

BMC Cancer

Laboratory Investigation

Naturemedicine, in print .[100] Loohuis LO, Witzel A, Mishra B (2014) Cancer hybrid automata: model, beliefs and therapy.

Information and Computation

Journal of The Royal Society Interface

IEEE Transactions on Information Technology in Biomedicine A. Problem statement. (left) Inference of ensemble-level cancer progression models froma cohort of n independent patients (cross-sectional). By examining a list of somatic mutations orCNAs per patient (0/1 variables) we infer a probabilistic graphical model of the temporal orderingof ﬁxation and accumulation of such alterations in the input cohort. Sample size and tumor het-erogeneity complicate the problem of extracting population-level trends, as this requires accountingfor patients’ speciﬁcities such as multiple starting events. (right) For an individual tumor, its clonalphylogeny and prevalence is usually inferred from multiple biopsies or single-cell sequencing data.Phylogeny-tree reconstruction from an underlying statistical model of reads coverage or depths es-timates alterations’ prevalence in each clone, as well as ancestry relations. This problem is mostlyworsened by the high intra-tumor heterogeneity and sequencing issues. B. The PiCnIc pipeline forensemble-level inference includes several sequential steps to reduce tumor heterogeneity, before ap-plying the CAPRI [40] algorithm. Available mutation, expression or methylation data are ﬁrst usedto stratify patients into distinct tumor molecular subtypes, usually by exploiting clustering tools.Then, subtype-speciﬁc alterations driving cancer initiation and progression are identiﬁed with sta-tistical tools and on the basis of prior knowledge. Next is the identiﬁcation of the ﬁtness-equivalentgroups of mutually exclusive alterations across the input population, again done with computa-tional tools or biological priors. Finally, CAPRI processes a set of relevant alterations within suchgroups. Via bootstrap and hypothesis-testing, CAPRI extracts a set of “selective advantage rela-tions” among them, which is eventually narrowed down via maximum likelihood estimation withregularization (with various scores). The ensemble-level progression model is obtained by combiningsuch relations in a graph, and its conﬁdence is assessed via various bootstrap and cross-validationtechniques. 22

Data marked as ✗ can be used when it is persistent (i.e., do not revert back to their original state) during tumor progression. Other: data not common to most tumor types such as fusions or partial tandem duplication. MOTIVATION COMPUTATIONAL OPTIONS AND PRIOR KNOWLEDGE ⌘ EXPECTED OUTPUT AND ACTION TO EXECUTE ✓✗ Groups detection

INPUT DATA * ✓✓✓ M u t a t i o n s O t h e r ✓ C o p y N u m b e r E x p r e ss i o n Determine molecular subtypes likely to progress through different trajectoriesSelect a subset of alterations likely to drive progression M e h t y l a t i o n s ✗ Split the cohort according to each clusterIn each cluster, restrict to consider only driver eventsFor each cluster, for each of its groups, create a logical formula consistent with the statisticValidate statistically or experimentally each one of the inferred models

Cohort subtypingEvents selection

One progression model per subtype

Non-negative Matrix Factorization (NMF), k-Means, Gaussian Mixtures, Hierarchical/Spectral Clustering, Network Based Stratification (NBS)MutSigCV, OncodriveFM, OncodriveCLUST, MuSiC, Oncodrive-CIS, IntogenRatio test, RME, MEMO, MUTEX, Dendrix, MDPFinder, Multi-Dendrix, CoMEt, MEGSA, ME testCAPRI ★ , CAPRESE, Oncotrees, Distance-based, Mixtures, CBN, Resic, BML Known pathway genes with alternative but fitness-equivalent status, or co-occurrently altered ★ CAPRI is the only algorithm to exploit knowledge provided by step 3 via logical formulas hypotheses-testing. Oncotrees, Distance-based, Mixtuers and CAPRESE are constrained to infer at most tree-models of progression.Biomarkers (cell types, known mutations, ….), Clinical Annotations (Mutation Status, Chromosomal Stability, …)Known cancer oncogenes and tumor suppressors, known pathways ✓ Stratified samples (clusters)A rank of genes and their alterations ⌘ Not all tools support all the data that is theoretically usable for a certain step.

Groups satysifying certain statistics (e.g., exclusivity)

Model Inference

Select groups of alterations which should be examined togetherSelect the Graphical Model which explains best the data ✓✓ ✓✓

Figure 2: The PiCnIc pipeline. We do not provide a unique all-encompassing rationale to instantiatePiCnIc as all steps refer to research area currently development, where the optimal approach isoften dependent on the type of data available and prior knowledge about the cancer under study.References are provided for each tool that can be used to instantiate PiCnIc: NMF [61], k-Means,Gaussian Mixtures, Hierarchical/Spectral Clustering [62], NBS [66], MutSigCV [68], OncodriveFM[69], OncodriveCLUST [70], MuSiC [71] Oncodrive-CIS [72] Intogen [73], Ratio [74], RME [75],MEMO [76], MUTEX [77], Dendrix [78], MDPFinder [79], Multi-Dendrix [80], CoMEt [81], MEGSA[82], ME [83], CAPRI [40], CAPRESE [39], Oncotrees [31, 33], Distance-based [32], Mixtures [34],CBN [35, 36], Resic [37] and BML [38]. 23igure 3: A. MSI-HIGH colorectal tumors from the TCGA COADREAD project [56], restrictedto 27 samples with both somatic mutations and high-resolution CNA data available and a selectionout of 33 driver genes annotated to wnt , ras , pi3k , tgf- β and p53 pathways. This datasetis used to infer the model in Figure 5. B. Mutations and CNAs in MSI-HIGH tumors mappedto pathways conﬁrm heterogeneity even at the pathway-level. C. Groups of mutually exclusivealterations were obtained from [56] - which run the MEMO [76] tool - and by MUTEX [77] tool. Inaddition, previous knowledge about exclusivity among genes in the ras pathway was exploited. D. A Boolean formula input to CAPRI tests the hypothesis that alterations in the ras genes kras , nras and braf confer equivalent selective advantage. The formula accounts for hard exclusivityof alterations in nras mutations and deletions, jointly with soft exclusivity with kras and nras alterations. 24igure 4: Selective advantage relations inferred by CAPRI constitute MSS progression; inputdataset in Supplementary Figure S3 and S4. Formulas written on groups of exclusive alterations,e.g., sox9 ampliﬁcations and mutations, are displayed in expanded form; their events are connectedby dashed lines with colors representing the type of exclusivity (red for hard, orange for soft), logicalconnectives are squared when the formula is selected, and circular when the formula selects for adownstream node. For this model of MSS tumors in COADREAD, we ﬁnd strong statistical supportfor many edges (p-values, bootstrap scores and cross-validation statistics shown as SupplementaryMaterial), as well as the overall model. This model captures both current knowledge about CRCprogression – e.g, selection of alterations in pi3k genes by the kras mutations (directed or via theMEMO group, with BIC) – as well as novel interesting testable hypotheses – e.g., selection of sox9 alterations by fbxw7 mutations (with BIC). 25igure 5: A. Selective advantage relations inferred by CAPRI constitute MSI-HIGH progression;input dataset in Figure 3. Formulas written on groups of exclusive alterations are expanded asin Figure 4. For each relation, conﬁdence is estimated as for MSS tumors and reported as Sup-plementary Material. In general, this model is supported by weaker statistics than MSS tumors –possibly because of this small sample size ( n =27 ). Still, we can ﬁnd interesting relations involving apc mutations which select for pik3ca ones (via BIC) as well as selection of the MEMO group( erbb2 / pik3ca mutations or igf2 deletions) predicted by AIC. Similarly, we ﬁnd a strong selectiontrend among mutations in erbb2 and kras , despite in this case the temporal precedence amongthose mutations is not disentangled as the two events have the same marginal frequencies ( ). B. Evolutionary trajectories of clonal expansion predicted from two selective advantage relations in themodel. apc -mutated clones shall enjoy expansion, up to acquisition of further selective advantagevia mutations or homozygous deletions in nras . These cases should be representative of diﬀerentindividuals in the population, and the ensemble-level interpretation should be that “ apc mutationsselect for nras alterations, in hard exclusivity” as no sample harbour both alterations. A similarargument can show that the clones of patients harbouring distinct alterations in acvr1b – anddiﬀerent upstream events – will enjoy further selective advantage from mutation in the tgfbr2 gene. 267

Reproducing this study

Code availability

The implementation of PiCnIc shown in the Main Text was performed by using, as core,the r language, and other external Java tools which we reference in this document. In r ,much of the data processing and inference is done by exploiting the current version of theopen-source TRanslational ONCOlogy ( TRONCO , [41], version 2.3)package which implements up-to-date statistical algorithms to estimate cancer progressionmodels from a list of genomic lesions (e.g., somatic mutations, copy number variations orpersistent epigenetic states) in a population of independent tumors, or in a single patient.

TRONCO ’s oﬃcial webpage is reachable from the Software section of our group’s webpage http://bimib.disco.unimib.it/

By navigating to the Case Studies section of

TRONCO ’s oﬃcial webpage one can ﬁnd the sourcecode to replicate this study (i.e., the

PicNic ’s implementation) along with the documentationdetailing all the implementation, as well as the data that we used. This should allow easyimplementation of similar studies in diﬀerent contexts.28

Glossary

This glossary of terms shall be of help to readers not familiar with the concepts mentioned in theMain Text. For clarity, terms are separate in two categories according to the fact that they arecommon to the statistics or the cancer biology communities. Each term which is included in thisglossary appears in color.

Terms common to the statistics communityTerm Meaning

Boolean for-mula

In CAPRI, a formula written with standard logical operators which capture a relation amonga group of alterations. In

PicNic , these are used to detect alternative routes to selectiveadvantage from mutually exclusive alterations . See:

Fitness equivalence . Ensemblelevel progres-sion inference

Detection of the relations of selective advantage across the permanent alterations ina cohort of independent tumors ( cross-sectional data). When aggregated in a graphi-cal model , these shall picture the most common evolutionary trajectories in the popula-tion/cancer under study. See also: inter-tumor heterogeneity . Graphicalmodels

In this context, a direct acyclic graph with nodes ( alterations ) and edges ( selective ad-vantage relations ), as a shorthand to represent the joint probability of observing a set ofalterations in a sample (i.e., a cancer genotype). See also:

Suppes-Bayes Causal Net-work , Model selection . Individuallevel progres-sion inference

Detection of clonal signatures and their prevalence in individual tumors by scanning multi-region or single-cell sequencing data; clones are then displayed in a phylogenetic treestructure . See also: intra-tumor heterogeneity . Model selec-tion

The process of selecting a model which ﬁts data, according to some criterion. In CAPRI, thisis done by balancing model likelihood (a measure of to which degree data can be explained bythe model) and model complexity (the size of the graphical model ). See: regularization . Phylogenetictree

In this context, rooted tree where each node is a clone , and edges represent ancestry relationsamong clones.

Regularization

Common approach to avoid overﬁtting (false-positives) during model selection – in CAPRIthis is achieved by using the standard AIC/BIC which penalize with diﬀerent severity graph-ical models which contain many selective advantage relations . Simposon’sparadox

A paradox in statistics, in which a trend appears in diﬀerent groups of data but disappearsor reverses when these groups are combined. In this context, this shall refer to genuine selective advantage relations which are not inferred unless data coming from diﬀerentpopulations is separated before doing inference. See: heterogeneity , subtypes , formulas . Suppes-BayesCausal Net-works

A speciﬁc type of graphical model returned by CAPRI algorithm, where each edge satisﬁesSuppes’s conditions of probabilistic causation subsuming temporal ordering and positivestatistical dependence – the statistical approach to estimate selective advantage amongthe alterations. erms common to the cancer biology communityTerm Meaning Alterations

Somatic mutations: A change in the genome of a cell that is not inherited from a parent, butis acquired. CNVs: Structural variation of large regions of DNA segments, including deletions,insertions, duplications and complex multi-site variants.

Bulk se-quencing

Genome sequencing from single tumor samples, each containing a large number of cells. Theresulting genomic proﬁles are derived from a mixture of cells with potentially distinct evolutionaryhistories.

Clones;Clonal ex-pansion

Clone: group of cells sharing an identical genome and that derive from a common ancestor. Clonalexpansion: the production of descendent cells all arising originally from a single cell. In the scenarioof cancer development, tumors develop through a series of clonal expansions, in which the mostfavorable clonal population survives and proliferate.

Cross-sectionaldata

Unique snapshots of data derived from samples that are collected at unknown time points. Usuallyderived from bulk sequencing technologies.

Driver; Pas-senger

Driver: (epi)genetic alteration that provides a selective advantage to a cancer clone . Passen-ger: alteration of a cancer cell that does not increase its ﬁtness . Exclusivityof alter-ations

Group of alterations which manifest few or no co-occurences in a cohort of diﬀerent samples, andmight be ﬁtness-equivalent for tumor progression. Hard exclusivity: when co-occurrences shall beconsidered the result of random errors. Soft exclusivity: when few co-occurrences shall be possible.See: formulas . Fitness

A cell’s ability of surviving, proliferating and adapting to environmental changes, usually within anenvironment with limited and depleting resources (e.g., oxygen or nutrients).

Fitnessequivalence

Groups of driver alterations , functional to the same pathway or equally dysruptive, that can inde-pendently confer a selective advantage to a cancer cell. Multiple co-occurrence of such alterationsto provide no further advantage, hence leading to mutually exclusive alteration patterns acrossdistinct samples.

Hallmark ofcancer

Common traits or phenotypic properties that are supposed to drive the transformation of normalcells to cancer cells. Anti-hallmark: clonal proﬁles that are usually not observed, yet beingtheoretically possible.

Inter-tumorheterogene-ity

The phenomenon according to which diﬀerent patients with the same cancer type usually displaya few common alterations. This is the major problem of inferring ensemble-level cancer pro-gression models . Intra-tumorheterogene-ity

Intra-tumor heterogeneity is related to possible coexistence of diﬀerent cancer clones, with diﬀerentevolutionary histories and diﬀerent mutational proﬁles, within the same tumor. This is the majorproblem of inferring individual-level cancer progression models . Multiregionsequencing

Collection of genomic data obtained by processing multiple spatially separated biopsy samples fromthe same individual tumor.

Next Gen-erationSequencing(NGS)

New technologies for sequencing genomes at high speed and low cost, including, e.g., full-genome/exome sequencing, genome resequencing, transcriptome proﬁling (RNA-Seq), DNA-proteininteractions (ChIP-Seq), and epigenome characterization.

Selective ad-vantage rela-tion

In successive waves of clonal expansions one or more cells of the same clone can (progressively)increase their ﬁtness through the acquisition of additional driver alterations , leading to theemergence and development of a ﬁtter clone. In this case a relation of selective avantage connectsthe earlier to the succeeding alterations . Single-cellsequencing

Recent technology based on the retrieval and analysis of genomic information from individual cells,rather than from mixtures of cells.

Syntheticlethality

The phenomenon according to which two otherwise non-lethal alterations lead the cell death whenthey co-occur within the same cell. See:

Anti-hallmark PicNic ’s implementation for COADREAD samples

Here we detail all the steps implemented to use

PicNic for CRC progression inference.

C.1 TCGA COADREAD project data

COADREAD provides genome-scale analysis of samples with exome sequence, DNA copy number,promoter methylation, messenger RNA and microRNA expression data, which we used to deﬁnethe input dataset. In particular, only samples with both mutations and CNAs proﬁles were used inthe analysis. Supplementary Table S1 details the dataset.

Dataset used to infer models presented in the Main Text.

Samples published in [56] wereused as, to the best of our knowledge, these represent the highest–quality data made available byCOADREAD as of today; for these samples TCGA provides somatic mutation proﬁles and high-resolution focal CNAs via GISTIC. These are obtained from TCGA data freeze as of 2 February2012, downloaded on 12 March 2015, from repository: https://tcga-data.nci.nih.gov/docs/publications/coadread_2012/

The following ﬁles were processed to produce the data:•

TCGA_CRC_Suppl_Table2_Mutations_20120719.xlsx

Somatic mutations proﬁles obtained via whole-exome sequencing of 224 colorectal tumors byTCGA. Data available consists of 15995 mutations in 228 samples, provided in the

Man-ual Annotation Format (MAF). Samples were selected to univocally match the 224 pa-tients as of the TCGA guidelines for aliquote disambiguation, see https://wiki.nci.nih.gov/display/TCGA/TCGA+barcode . All the mutations annotated by TCGA – truncating(

De_novo_Start_OutOfFrame , Frame_Shift_Del , Frame_Shift_In , Nonsense_Mutation , Splice_Site , Frame_Shift_Ins , In_Frame_Del ), silent (

Silent ) and missense (

Missense_Mutation ) – wereconsidered for analysis; notice that the majority of them are missense, see Figures S7 and S8.• crc_gistic.txt.zip

Focal

Copy Number Alterations (CNAs) for 564 patients derived from whole-genome sequenc-ing using the Illumina HiSeq platform.

High-level gains and homozygous deletions were con-sidered for analysis by selecting entries with GISTIC scores ± crc_clinical_sheet.txt Clinical data summary with patient stage and

Micro Satellite Stabe/Unstable (MSS/MSI)status being any of: MSS, MSI-high and MSI-low.The list of patients used was ﬁrst reduced to those having both

CNAs and somatic mutationdata, and then was split into two groups: MSI-HIGH and MSS. The training cohort has 152 MSSand 27 MSI-HIGH samples; samples ﬂagged as low MSI were excluded from the study as they havenot been shown to diﬀer in their clinicopathologic features or in most molecular features from MSStumors [ ? ]. 31 .2 Driver events selection In the TCGA COADREAD study [56] integrated analysis of mutations, copy number and mRNAexpression changes in 195 tumours with complete data was performed. Part of the analysis wascarried out by using the MutSig tool [68], as well as manual curation. Samples were grouped bymutation status, and recurrent alterations in key CRC pathways were identiﬁed in [56] (Fig. 4,Supplementary Fig. 6 and Supplementary Table 1) as a result, we can use the consortium’s list of33 driver genes annotated to 5 pathways and use these to extract our progression models. Theseare well-known cancer genes, frequently reported as relevant to colorectal progression and to themajor pathways involved in CRC. Driver events are alterations in:• wnt genes (14): apc , dkk-4 , tcf7l2 , ctnnb1 , lrp5 , fbxw7 , dkk-1 , fzd10 , arid1a , dkk-2 , fam123b , sox9 , dkk-3 and axin2 ;• rtk / ras genes (5): erbb2 , erbb3 , nras , kras and braf ;• tgf- β genes (5): tgfbr1 , smad3 , tgfbr2 , smad4 , acvr1b , acvr2a and smad2 ;• igf2 / pi3k genes (5): igf2 , irs2 , pik3ca , pik3r1 and pten ;• p53 genes (2): tp53 and atm .In the Main Text, rtk / ras and igf2 / pi3k pathways are shortly denoted as ras and pi3k .The distinct types of mutations detected in these genes are shown in Figure S7, as well as theoverall rate of COADREAD mutations. The spatial distribution (per gene) of such mutations isshown in Figure S8. C.3 Mutual exclusivity groups of alterations.

Groups of alterations showing a trend of mutual exclusivity were scanned with MUTEX and mu-tations and CNA hitting any of the 33 selected genes as input. MUTEX was run independentlyon MSS and MSI-HIGH groups (Supplementary Table S2, running times: approximately 6 and 3.5hours, respectively, on a standard Desktop machine).We selected only groups with score < . , where the score is derived from p-values correctedfor false discovery rate . 3 groups are found for MSI-HIGH tumors and 6 for MSS. For MSI-HIGHtumors, the three predicted groups consists of genes acvr1b , acvr2a , tp53 and erbb2 , of genes braf , nras and tgfbr2 , and of genes kras and braf .Further groups of exclusive alterations were considered consistent with results reported in [56].These include groups derived by consolidated knowledge of colorectal progression: the well-known wnt alterations in apc / ctnnb1 [ ? ], as well as ras alterations in kras , nras and braf genes [74].Similarly, we used also a group collected by scanning non-hypermutated tumors with the MEMOtool in [56] - this group includes pik3ca , pten , erbb2 and igf2 genes. These groups were restrictedto account only for genes actually altered in a certain subtype, e.g., MSI-HIGH tumors lack ctnnb1 mutation, making the Wnt group irrelevant. Groups for MSS tumors are shown as SupplementaryFigure S9, groups for MSI-HIGH tumors are in the Main Text.32 .4 CAPRI’s execution Background on the algorithm.

CAPRI algorithm can be executed in two diﬀerent modes,originally dubbed as “supervised” when formulas are given in input for testing, and “unsupervised”when this is not the case. This paper deals with the former; see the Main Text for the interpretationof formulas in this context and [40] for a derivation of the algorithm. CAPRI is a three-stepsprocedure, which we brieﬂy recall here.1. CAPRI starts by creating a “lifted” representation of the input data M which includes theinput formulas; each formula – which is written in propositional logic – is so evaluated toyield a new column in the dataset. This is the input processed by the algorithm, which startsby selecting a set of candidate model edges, which are then used to constrain a score-basedBayesian model-selection problem [40].2. The initial set of selective advantage relations S , which determines the model edges x → y ,is computed by evaluating the following inequalities (Temporal priority) p ( x ) > p ( y ) (Probability raising) p ( y | x ) > p ( y | ¬ x ) , where p ( · ) is a marginal probability, p ( · | · ) is a conditional, ¬ x is the negation of x and either x or y is not a formula. It is interesting to observe that probability raising implies that x and y are positively statistically dependent , thus imposing a minimum threshold on theirassociation [39].In each of CAPRI’s executions, the distribution of the observed marginals and conditionals areestimated by K non-parametric bootstrap resamples; practically, this means that we create abootstrapped approximation for each of the four populations ˆ p ( x ) , ˆ p ( y ) , ˆ p ( y | x ) and ˆ p ( y | ¬ x ) .Then, we use a single-tail non-parametric Mann-Withney U test of the diﬀerence in mean totest the hypothesis that one of the two populations is more probable than the other, e.g., ˆ p ( x ) > ˆ p ( y ) . With this test, we can compute two p-values, one for each condition. An edge isincluded in S if at least the p-value for probability raising is below a signiﬁcance threshold p ∗ and an edge is said not to be orientable if its p-value for temporal priority is above thesame threshold. Cycles x → x → . . . x k → x that might appear in S are broken by deletingthe edge with minimum p-value; both K and p ∗ are custom parameters.3. Optimization of S , namely detection of the subset S ∗ of S with the edges that we include inthe ﬁnal progression model is done by optimizing, via hill climbing or tabu search , the scorewith regularization S ∗ (cid:44) arg min ˆ S⊂S (cid:110) − L ( ˆ S | M )] + θ | ˆ S| (cid:111) , (1)where L ( · ) is the model likelihood and M is the input data; the estimated optimal solution is S ∗ , which is displayed as a Suppes-Bayes Causal Network. The diﬀerent regularization strate-gies mentioned in the Main Text, BIC and AIC, are obtained by the following parametrization: (Bayesian Information Criterion) θ (cid:44) log( n ) (Akaike Information Criterion) θ (cid:44) . θ which deﬁne the conditional probability table of each edge and should be ﬁt from data; these are necessary if one wishes to use a model as a“generator” of further data. For discrete-valued graphical models, for each parent set π x → y ,parameter θ ( y ) = p ( y | π x ) can be taken either as the maximum likelihood estimate from thelifted input, or by using a Bayesian interpretation [87]. Usage in this context.

CAPRI was run, on each group of tumors, by selecting alterations fromthe pool of 33 pathway genes; every alteration on a gene x is included if any of these apply:• the alteration frequency of x - sum of mutation and CNA frequency - is greater than 5%;• x it is part of an exclusivity group.The set of selected events for MSI-HIGH training tumors is shown in the Main Text, the analogousset for MSS tumors is shown as Supplementary Figure S10.CAPRI was executed in its supervised mode by writing formula over groups and genes withmultiple alterations associated, as explained in the Main Text. For instance, for MSI-HIGH tumorswith alterations in ras pathway we grouped hard exclusivity of nras mutations and deletions, withsoft exclusivity of kras and braf mutations. Our aim was to account for a small subset of sampleswith concurrent kras and nras alterations (see Figure 2, Main Text). The list of all Booleanformulas written over groups is in Table S3; this approach was adopted also when a gene harborsmultiple alterations in a subtype, e.g., erbb2 in MSS training samples which shows a trend of softexclusivity between mutations and ampliﬁcations. We used both AIC and BIC scores to regularizeinference after 100 non-parametric bootstrap iterations for estimation of the preliminary selectiveadvantage relations – Mann-Whitney U test was performed with a minimum threshold p ∗ = 0 . .In most cases p-values are orders of magnitude below p ∗ - exact values reported as Dataset File S1.CAPRI’s models with such p-values and non-parametric bootstrap conﬁdence are shown in FiguresS11 and S12, statistical validation of the models is discussed in the next section. D Statistical validation of the models

P-values from hypothesis testing, as well as scores from k -fold cross-validation and various boot-strapping techniques can be used to measure the statistical consistency of models and data, each onecapturing diﬀerent potential errors in the inference process. Approaches such as cross-validationand bootstrap are sometimes also used in the ( ex novo ) generation and inference of models fromdata (see, e.g., bootstrap consensus models [ ? ]), but we only use them here for the a posteriori evaluation of a model’s conﬁdence, and we interpret them as a quantitative measure for the relative assessment of each model’s relation.All the p-values and the scores that we present here are computed within TRONCO . D.1 Edge p-values

As explained in §C.4, for each model edge x → y we get two p-values by assessing temporal priorityand probability raising via Mann-Whitney U testing.For each edge, a p-value for the hypergeometric test of overlap between alteration proﬁles x and y can be computed. More precisely, we test if there is a diﬀerence between the number of samplescontaining both x and y versus the total population of samples with x , y , or both. We would like34he overlap to be signiﬁcant as those samples – that determine the joint probability of x and y –are those supporting the presence of a selection trend among x and y .An edge is fully supported if all three p-values are below a custom signiﬁcance threshold, e.g., . or even better . . Some edges might have the p-value for temporal priority above thethreshold. If so, the selection trend might be still signiﬁcant, but the temporal order of x and y –i.e., the direction of selection – is not supported by the data. D.2 Bootstrap

We used non-parametric and statistical bootstrap techniques to measure the goodness-of-ﬁt , asoriginally proposed in [ ? ]. In this case, we distinguish two type of errors that one could make inthe inference process, estimating the presence or absence of edges in the model:• Type I errors: incorrect rejection of a true H (null-hypthesis), i.e., a “false positive” edgethat we wrongly include.• Type II errors: incorrect acceptance (failure to reject) of a false H , i.e., a “false negative” edge that we miss. Non-parametric bootstrap [86] computes scores to be interpreted here as follows. If a modelcontains an edge x → y that is a true positive, we expect its score to be high. In other words, whenwe sample with repetition subsets of the original data and re-run the inference process we expect tooften ﬁnd models which contain the edge x → y . Conversely, for a node y without incoming edges,or equivalently for any edge x → y which is correctly excluded from a model – a true negative – wewould expect its bootstrap score to be low. However, this reasoning can be also generalized to wholemodels, where we count how many times we re-infer exactly the same model. Clearly, such scoreswill depend also on the empirical probabilities of the nodes in our data, and their deviation fromthe true probabilities of the phenomenon. So, one might expect rare events to be less frequentlybootstrapped, which results in a lower estimate; however, such counts can be anyway interpretedas measures of repeatability of our ﬁndings .The above bootstrap approach depends on two random number generators: one to shuﬄe data(the a posteriori bootstrap), and one to evaluate CAPRI’s inequalities via hypothesis testing (theinternal CAPRI’s bootstrap, see C.4). Thus, to ensure that no bias is introduced by the randomnumber generators, we performed a statistical bootstrap by holding data ﬁxed, and re-estimatingCAPRI’s inequalities with generators initialized with diﬀerent seeds. We evaluated the robustnessof our scores for all edges imputed to be genuine and hence, high-scoring.Notice that, in principle, even parametric bootstrap scores could be computed if we used themodel to generate bootstrapped data [86]. However, as the support of the distribution subsumedby a model with n nodes consists of n possible outcomes, sampling uniformly from large modelsmight be computationally hard. For this reason, and because such scores are overestimates of thenon-parametric ones, we did not include them in our computation.The MSS and MSI progression models are annotated with the non-parametric bootstrap scoresin Figures S11 and S12. Non-parametric and statistical bootstrap scores for a set of selected edges Bootstrapping techniques have been widely used to gauge uncertainty in estimates, but also subjects of philo-sophical debate about their precise interpretation, especially when coupled with various signiﬁcance thresholds –unlike the situation with a p-value for a null hypothesis. An exhaustive review on the topic is provided by Soltisin [ ? ]. We follow the ideas originally developed in the area of phylogenetic analysis [ ? ], suggesting that the scorescan be alternatively interpreted as a measure of accuracy of the method or of the robustness of the data [ ? ]. D.3 Cross-validation

Next we study the suﬃciency of the data sizes for model inference and its ability to characterize theunderlying progression ( goodness-of-data ). Thus, we focus on the Type III errors, which occur whenthe sample size is inadequate or the sample is a poor descriptor for the reference phenomenon ,and thus failing to represent the progression. For this purpose, we used cross-validation with thedata used for the models built (see the Main Text), and followed the best practices developed bythe Bayesian Networks community [87]. TRONCO exploits the cross-validation routines implemented in the bnlearn package [ ? ]. Theapproach that we adopt is a k -fold non-exhaustive cross-validation, which we repeat times toaverage its results. Exhaustive strategies might be used for datasets of small sample size. Each runof cross-validation consists in computing a loss function for a model; its steps are the followings:• split randomly the data in k = 10 groups, and then repeat the following two steps, for eachgroup in turn:1. set one of the groups to be the “training” M tr ;2. merge the others k − to be the “test” M te ;3. by holding ﬁxed the model structure (i.e., the edges in S ∗ ), ﬁt the model parameters θ over the training data via maximum likelihood estimates , compute a score over the test M te (see below) and the corresponding loss;• combine the k loss estimates to give an overall loss for data.Let θ tr be the parameters ﬁt from the training set M tr , and S ∗ the edges in the model, threescores are computed with cross-validation:1. the negative entropy of a model – i.e., the negated expected log-likelihood of the test set forthe Bayesian network ﬁtted from the training set, that is eloss ( S ∗ , θ tr ) = − E [ L ( S ∗ , θ tr | M te )] .

2. the prediction error for a single node x and its parents set X , i.e., we measure how preciselywe can predict the values of x by using only the information present in its local distribution θ tr ( x ) . This parameter corresponds to computing the misclassiﬁcation rates from p te ( x ) , theempirical marginal probability of x estimated from the test.3. the posterior classiﬁcation error for a single node x and one of its parent node y ∈ X – i.e.,the values of x are predicted using only the information present in y by likelihood weightingand Bayesian posterior estimates. Consider the case where the samples are from two diﬀerent unknown and heterogeneous groups, with randomchance making low values to be sampled from a group that actually has a majority of high values, and vice versa.In this case, the samples will not be the best descriptor of the two groups. ? ] for a discussion on these loss functions. The ﬁrst statistics measures the log-likelihood loss when we “forget” some of the samples used to infer a model (indeed, the test samples); see FigureS14. Roughly, we are measuring how the model’s predictive power changes as we look at the datafrom diﬀerent viewpoints. This is a score for a whole model, it has no scale – so cannot be used tosay how good the models/data are, in any absolute sense. Nonetheless, it can be used to evaluatehow “stable” a model is, for a certain dataset.The second and third statistics measure the accuracy of the parent-set, X , for a child x ; thesecond statistics dealing with the whole parent set as predictor, and the third, the individualcontribution of each of the parents. For these two statistics, we desire the prediction error to below, as a measure of goodness. These are shown in Figure S13 (selected edges), S15 and S16 (alledges, prediction error). E Supplementary Tables and Figures

Datasets (CNAs and mutations provided by TCGA) statistics alteration typecancer † n m | G | mutations ampliﬁcations deletions MSI-HIGH 27 16100 13798 11556 2888 1656MSS 152 21317 16371 12417 6925 1975 † Samples were classiﬁed as MSI-HIGH/LOW and MSS by TCGA; see ﬂag

MSI_status in clinical data available for the COADREAD project.

Table S1:

COADREAD Data.

Data used in this study, derived from the TCGA COADREADproject [56]. 37

UTEX parametersParameter Value Description signalling-network - MUTEX network † max-group-size maximum size of a result group first-level-random-iteration number of randomisation to estimate null distribu-tion of member p-values in groups second-level-random-iteration number of runs to estimate the null distribution ofﬁnal scores fdr-cutoff - false-discovery-rate cutoﬀ maximising the expectedvalue of true positives - false positives is estimatedfrom data search-on-signaling-network TRUE reduce the search space using the signalling network † Manually curated from Pathway Commons, SPIKE and SignaLink databases. Provided with the tool; availablefor download at https://code.google.com/p/mutex/ . MUTEX groups with score < . MSI-HIGH Groups score q -value kras , braf , 0.095 0.48 nras , braf , tgfbr1 erbb2 , tp53 , acvr1b , acvr2a MSS Groups score q -value tp53 , atm , 0.051 0.34 arid1a , tp53 kras , nras , braf , 0.0864 0.1975 ctnnb1 , apc , dkk2 , 0.098 0.144 dkk1 , tp53 , atm , dkk2 pik3ca , tp53 , atm MUTEX: parameters and results.

Top: Parameters used to run MUTEX on theoriginal TCGA MSS/MSI-HIGH datasets with input CNA and somatic mutations in the pathwaygenes described in text. Bottom: MUTEX identiﬁed 3 and 6 groups of alterations showing a trendof mutual exclusivity in these groups with score below the suggested cutoﬀ of . .38 ormulas input for testing to CAPRI † MSI-HIGH tumors description ( nras :m ⊕ nras :d) ∨ kras :m ∨ braf :m raf exclusivity pik3ca :m ∨ erbb2 :m ∨ pten :m ∨ igf2 :d MEMO group ( acvr1b :m ⊕ acvr1b :d) ∨ acvr2a :m ∨ tp53 :m ∨ erbb2 :m MUTEX group ( nras :m ⊕ nras :d) ∨ tgfbr1 :m ∨ braf :m MUTEX group kras :m ∨ braf :m MUTEX group acvr1b :m ⊕ acvr1b :a multiple alterations nras :m ⊕ nras :a multiple alterations fbxw7 :m ∨ fbxw7 :a multiple alterations ‡ MSS tumors description ( apc :m ⊕ apc :d) ∨ ctnnb1 :m wnt exclusivity ( kras :m ∨ kras :a) ∨ ( nras :m ⊕ nras :a) ∨ ( braf :m ⊕ braf :a) raf exclusivity and MEMO group pik3ca :m ∨ ( erbb2 :m ∨ erbb2 :a) ∨ ( pten :m ⊕ pten :d) ∨ igf2 :a MEMO group ( tp53 :m ⊕ tp53 :d) ∨ ( atm :m ⊕ atm :d) MUTEX group ( tp53 :m ⊕ tp53 :d) ∨ arid1a :m MUTEX group ( tp53 :m ⊕ tp53 :d) ∨ arid1a :m MUTEX group ( apc :m ⊕ apc :d) ∨ ctnnb1 :m ∨ dkk2 :m MUTEX group ( tp53 :m ⊕ tp53 :d) ∨ ( atm :m ⊕ atm :d) ∨ dkk2 :m ∨ dkk1 :m MUTEX group ( tp53 :m ⊕ tp53 :d) ∨ ( atm :m ⊕ atm :d) ∨ pik3ca :m MUTEX group ( apc :m ⊕ apc :d) multiple alterations ( tp53 :m ⊕ tp53 :d) multiple alterations ( smad4 :m ⊕ smad4 :d) multiple alterations ( tcf7l2 :m ⊕ tcf7l2 :d) multiple alterations ( atm :m ⊕ atm :d) multiple alterations ( nras :m ⊕ nras :d) multiple alterations ( erbb2 :m ∨ erbb2 :a) multiple alterations ( pten :m ⊕ pten :d) multiple alterations ( smad2 :m ⊕ smad2 :a) multiple alterations ( dkk4 :m ⊕ dkk4 :a) multiple alterations ( sox9 :m ⊕ sox9 :d) multiple alterations ( braf :m ⊕ braf :a) multiple alterations † Events type: mutation (m), deletion (d), ampliﬁcation (a). Hard ( ⊕ ) and soft ( ∨ ) exclusivity. ‡ Formula not included as it creates a duplicated signature in the dataset.

Table S3:

CAPRI formulas from exclusivity groups.

Formulas created for the groups, andinput to CAPRI for testing. These are either derived from exclusivity groups or from genes involvedin diﬀerent types of alterations. 39igure S6:

PicNic pipeline processing MSI-HIGH/MSS tumors.

We process with

PicNic

Microsatellite Stable and highly unstable tumors collected from the The Cancer Genome Atlasproject “Human Colon and Rectal Cancer” (subtypes annotations provided as clinical data). Weimplement a study on selected somatic mutations and focal CNAs in 33 driver genes manuallyannotated with 5 pathways in the COADREAD project. We scan groups of exclusive alterationswith computational tools run by us and by TCGA, and we exploit previous knowledge on CRC; weselect which alterations we input to CAPRI. Next, inference is performed with various settings ofregularization and conﬁdence. Statistical conﬁdence of the models is assessed with standard tech-niques from the literature (p-values from statistical testing, bootstrap scores and cross-validationstatistics). 40 PC n = 251 CTNNB1 n = 12 DKK1 n = 4 DKK2 n = 7 DKK3 n = 3

DKK4 n = 7 LRP5 n = 9 FZD10 n = 7 FAM123B n = 30 AXIN2 n = 11 TCF7L2 n = 24 FBXW7 n = 50 ARID1A n = 25 SOX9 n = 11 ERBB2 n = 13 ERBB3 n = 15 NRAS n = 21 KRAS n = 95 BRAF n = 22 PIK3CA n = 42 PIK3R1 n = 13 PTEN n = 14 TGFBR1 n = 10 TGFBR2 n = 9 ACVR1B n = 17 ACVR2A n = 14 SMAD2 n = 18 SMAD3 n = 10 SMAD4 n = 31 TP53 n = 126 ATM n = 50 Mutation types

De_novo_Start_OutOfFrame

Splice_Site

Frame_Shift_Ins

In_Frame_Del

Silent

Nonsense_Mutation

Frame_Shift_Del

Missense_Mutation

Frame_Shift_Del

Nonsense_Mutation

Silent

In_Frame_Del

Frame_Shift_Ins

Splice_Site

De_novo_Start_OutOfFrame

Mutations for COADREAD n = 971 Figure S7:

Mutations annotated by TCGA for the driver genes.

Top: the majority ofthe mutations annotated by TCGA for the driver genes that we consider are missense – this inalmost all genes and in all the cohort. Bottom: overall summation of the frequencies determine themutations across all driver genes. 41

500 1000 1500 2000 2500 2843 aa M u t a t i on s APC

R1450

APC_basic EB1.. M u t a t i on s AMER1

WTX M u t a t i on s TCF7L2

CTNNB1_binding HMG_box M u t a t i on s FBXW7

R465C/H

F-b.. WD40 WD40 WD40 WD40 WD40 WD40 WD40 M u t a t i on s ARID1A

R1989

DUF3518 M u t a t i on s ERBB2

V842I

Rec.. Furin-like Rec.. GF_.. Pkinase_Tyr M u t a t i on s NRAS

Q61H/K/L/R

Ras M u t a t i on s KRAS

G12A/C/D/R/S/V

Ras M u t a t i on s BRAF

V600E

RBD C1_1 Pkinase_Tyr M u t a t i on s PIK3CA

E545A/G/K/Q

PI3.. PI3K_rbd PI3K_C2 PI3Ka PI3_PI4_kinase M u t a t i on s PTEN

R130/Q

DSPc PTEN_C2 M u t a t i on s ACVR1B

Activin_recp TGF.. Pkinase M u t a t i on s SMAD2

S464

MH1 MH2 M u t a t i on s SMAD4

MH1 MH2 M u t a t i on s TP53

R175C/H

P53.. P53 P53.. M u t a t i on s ATM

R1730

TAN FAT PI3.. (cid:1)(cid:2)(cid:3)(cid:3)(cid:4)(cid:5)(cid:3)(cid:4) (cid:6)(cid:7)(cid:8)(cid:5)(cid:9)(cid:10)(cid:11)(cid:2)(cid:5)(cid:12) (cid:13)(cid:5)(cid:14)(cid:7)(cid:10)(cid:15)(cid:4)(cid:16)(cid:17)(cid:4)(cid:18)(cid:4)(cid:11)(cid:2)(cid:19)(cid:5)(cid:20)(cid:2)(cid:5)(cid:3)(cid:4)(cid:7)(cid:11)(cid:2)(cid:19)(cid:5) (cid:21)(cid:11)(cid:22)(cid:4)(cid:7)

Figure S8:

Lolliplot diagrams of TCGA mutations.

Diagrams generated from the cBio portalfor the COADREAD project (see ). These display the physical distri-bution of the annotated mutations for each gene. Here we shown only genes with a total mutationcount greater than ; fam123b is called with its synonym amer1 , as in the portal.42 xclusivity groups for MSS tumors event

59% TP537% ARID1A1% TP53 stage

47% KRAS9% NRAS4% BRAF1% NRAS1% BRAF1% KRAS stage

78% APC7% CTNNB11% DKK21% APC stage

59% TP5310% ATM1% DKK21% DKK11% ATM1% TP53 stage

59% TP5315% PIK3CA10% ATM1% ATM1% TP53 stage stage

Stage IStage IIAStage IIIAStage IIIBStage IIICStage IVNA pathway n o n e D e l e t i o n A m p li f i c a t i o n M u t a t i o n Knowledge-based priors: WNT exclusivity stage stage

Knowledge-based priors: RAF exclusivity stage

Computational priors: MEMO group

59% TP5310% ATM1% ATM1% TP53 stage

Computational priors: Mutex groups with score < 0.2

RASWntTGFbP53PI3K

Figure S9:

Groups of exclusive alterations for MSS tumors.

Knowledge-based groups ofexclusive alterations consist of: kras , nras and braf genes ( raf pathway) and apc and ctnnb1 genes ( wnt pathway). The MEMO [76] group identiﬁed in [56] in this cohort consists of genes pik3ca , erbb2 , igf2 and pten . Finally, 6 groups are predicted by MUTEX [77] with score below . , one of these is equivalent to the known exclusive alterations in raf pathway.43 CGA MSS colorectal tumors T C G A − AA − A K T C G A − AA − T C G A − A G − A T C G A − AA − T C G A − AA − T C G A − A G − T C G A − AA − A ZT C G A − AA − T C G A − AA − T C G A − A G − T C G A − A G − A X T C G A − A G − T C G A − A G − T C G A − A G − A T C G A − A F − T C G A − A G − A W T C G A − AA − T C G A − A G − T C G A − A G − T C G A − AA − T C G A − AA − T C G A − A G − T C G A − AA − T C G A − AA − T C G A − AA − T C G A − AA − A FT C G A − A G − T C G A − A G − T C G A − A G − T C G A − A G − T C G A − A G − A C T C G A − A G − A T C G A − A G − T C G A − A G − A T C G A − A G − T C G A − A − T C G A − AA − T C G A − AA − T C G A − AA − T C G A − AA − A TT C G A − A G − T C G A − AA − A FT C G A − AA − T C G A − AA − T C G A − A F − T C G A − A G − A T C G A − AA − A T C G A − A G − T C G A − AA − A D T C G A − AA − T C G A − A − T C G A − A G − T C G A − AA − T C G A − AA − T C G A − AA − T C G A − AA − T C G A − AA − T C G A − AA − T C G A − AA − T C G A − AA − A W T C G A − AA − A H T C G A − AA − A J T C G A − A F − T C G A − A G − T C G A − A G − T C G A − A G − T C G A − A G − T C G A − A G − T C G A − A G − A Y T C G A − A G − A T C G A − AY − T C G A − A G − T C G A − AA − A FT C G A − AA − T C G A − AA − A O T C G A − A G − T C G A − AA − T C G A − AA − A I T C G A − A G − T C G A − AA − T C G A − AA − T C G A − AA − T C G A − AA − T C G A − A G − A T C G A − AA − T C G A − AA − T C G A − A G − T C G A − AA − T C G A − A F − T C G A − A G − A T C G A − AA − T C G A − A G − T C G A − A G − T C G A − AA − T C G A − AA − T C G A − A G − T C G A − A G − A T C G A − AA − A V T C G A − AA − T C G A − AA − T C G A − A G − A T C G A − A G − T C G A − A − T C G A − AA − T C G A − A G − A G T C G A − A G − T C G A − AA − T C G A − A G − T C G A − AA − A T C G A − AA − A U T C G A − AA − T C G A − AA − T C G A − AA − T C G A − A G − T C G A − AA − T C G A − AA − T C G A − AA − T C G A − AA − T C G A − AA − A ZT C G A − AA − A J T C G A − AA − T C G A − A G − T C G A − AA − T C G A − A F − T C G A − A G − T C G A − AA − T C G A − AA − T C G A − AA − A D T C G A − A − T C G A − A G − A T C G A − AA − T C G A − AA − T C G A − AA − A FT C G A − AY − T C G A − A G − T C G A − A − T C G A − AA − T C G A − A G − T C G A − AA − A Y T C G A − A G − A H T C G A − AA − T C G A − AA − T C G A − AA − A Q T C G A − AA − A X T C G A − A G − T C G A − AA − T C G A − AA − T C G A − A G − T C G A − AA − T C G A − A G − T C G A − AA − T C G A − AA −

78% APC59% TP5347% KRAS15% PIK3CA12% FBXW711% SMAD410% FAM123B10% ATM9% TCF7L29% NRAS7% ARID1A7% CTNNB15% SMAD25% SOX95% ERBB25% ERBB24% DKK44% BRAF3% PTEN3% PTEN3% SMAD43% IGF22% DKK41% TCF7L21% DKK21% DKK11% APC1% ATM1% TP531% NRAS1% BRAF1% KRAS1% SOX91% SMAD2 hitsstage g r oup stage Stage IStage IIAStage IIIAStage IIIBStage IIICStage IVnone hits group

WntP53RASPI3KTGFbnoneDeletionAmplificationMutation

152 samples34 events21 genes0 patterns

Figure S10:

Selected data for MSS tumors.

Colorectal tumors with Microsatellite Stableclinical status in the TCGA COADREAD project, restricted to 152 samples with both somaticmutations and CNA data available. 33 driver genes annotated with 5 pathways are selected fromthe list published in [56] to automatically detect groups of mutually exclusive alterations. Eventsselected for reconstruction are those involving genes altered in at least 5% of the cases, or part ofgroup of alterations showing an exclusivity trend (see Figure S9). This dataset is used to infer theset of selective advantage relations which constitute the MSS progression model presented in theMain Text. 44igure S11:

Non-parametric bootstrap scores for MSS progression.

Progression model forMSS tumors with conﬁdence shown as edge labels. The ﬁrst label represents the relation conﬁdenceestimated with 100 non-parametric bootstrap iterations, the second and third are p-values fortemporal priority and probability raising. Red p-values are above the minimum signiﬁcane thresholdof . . See Figure 4 in the Main Text for an interpretation of this model.45igure S12: Non-parametric bootstrap scores for MSI-HIGH progression.

Progressionmodel for MSI-HIGH tumors with conﬁdence shown as edge labels. The ﬁrst label represents therelation conﬁdence estimated with 100 non-parametric bootstrap iterations, the second and thirdare p-values for temporal priority and probability raising. Red p-values are above the minimumsigniﬁcane threshold of . . See Figure 5 in the Main Text for an interpretation of this model.46 ata selects KRAS PIK3CA

82% 100% 6,20E-92 7,63E-92 2,77E-05 1,51E-01 0,00E+00 1,51E-01 0,00E+00 KRAS MEMO

64% 100% 6,64E-92 3,68E-88 7,68E-03 2,63E-01 0,00E+00 2,63E-01 0,00E+00 FBXW7 XOR_SOX9

49% 100% 3,15E-83 2,57E-85 1,66E-03 5,92E-02 0,00E+00 5,92E-02 0,00E+00

Deletion4

ARID1A SOX9

34% 100% 1,73E-20 6,09E-78 9,25E-04 5,26E-02 0,00E+00 5,26E-02 0,00E+00

Amplification5

PIK3CA TCF7L2

34% 54% 2,05E-92 5,56E-89 0,00E+00 1,32E-02 0,00E+00 1,32E-02 0,00E+00

Mutation6

TCF7L2 DKK4

32% 95% 1,15E-91 1,33E-64 6,34E-04 1,97E-02 0,00E+00 1,97E-02 0,00E+00

Pattern7

FAM123B ATM

48% 48% 4,61E-01 2,70E-91 9,82E-04 1,01E-01 4,44E-03 1,02E-01 4,65E-03 ERBB2 ERBB2

64% 51% 1,27E-01 4,01E-77 5,46E-05 5,99E-02 6,54E-03 6,05E-02 6,05E-03

Data selects

BIC ATM DKK4

81% 70% 3,59E-92 2,97E-68 7,93E-04 2,70E-02 3,73E-03 1,97E-02 0,00E+00

AIC APC

DKK4

53% 100% 4,70E-92 1,21E-104 0,00E+00 3,95E-02 0,00E+00 3,95E-02 0,00E+00 KRAS SMAD4

50% 69% 6,06E-92 3,49E-73 3,72E-02 1,12E-01 0,00E+00 1,12E-01 0,00E+00 ARID1A XOR_SOX9

50% 90% 2,20E-09 1,02E-77 1,60E-03 5,72E-02 4,44E-03 7,24E-02 0,00E+00 TP53 NRAS

49% 91% 5,62E-92 1,35E-69 1,22E-01 8,55E-02 0,00E+00 4,20E-01 0,00E+00 p-values < 0,010,01 - 0,05

Data > 0,05selects APC PIK3CA

66% 100% 2,53E-65 3,81E-72 4,73E-02 2,41E-01 4,36E-02 2,19E-01 3,24E-02 ARID1A DKK2

62% 100% 3,62E-71 3,31E-76 5,72E-03 1,85E-01 0,00E+00 2,00E-01 3,12E-02 PIK3CA DKK4

58% 100% 4,83E-53 2,65E-75 2,85E-02 1,37E-01 3,51E-02 1,30E-01 4,36E-02 APC XOR_NRAS

54% 100% 2,51E-78 5,36E-89 0,00E+00 1,11E-01 0,00E+00 1,11E-01 0,00E+00 APC SMAD3

48% 79% 9,11E-81 8,62E-81 0,00E+00 7,41E-02 0,00E+00 7,41E-02 0,00E+00 ERBB2 SMAD2

48% 92% 3,87E-28 6,69E-68 0,00E+00 1,41E-01 1,56E-02 1,37E-01 2,50E-02 NRAS ACVR1B

48% 68% 2,93E-20 1,88E-40 0,00E+00 7,04E-02 1,17E-02 3,70E-02 0,00E+00 FAM123B SMAD4

46% 100% 4,12E-51 1,98E-66 3,99E-03 1,78E-01 3,40E-02 1,96E-01 3,05E-02 AXIN2 TP53

32% 86% 1,32E-07 9,08E-69 1,28E-02 1,74E-01 3,05E-02 1,59E-01 3,51E-02

LRP5 TGFBR2

30% 69% 8,32E-74 3,42E-84 0,00E+00 8,52E-02 3,51E-02 7,41E-02 0,00E+00

ERBB3 KRAS

27% 50% 3,62E-01 4,18E-77 4,65E-03 1,56E-01 2,34E-02 1,48E-01 0,00E+00

Data selects APC KRAS

51% 100% 8,53E-34 5,84E-75 4,27E-02 1,56E-01 2,34E-02 2,59E-01 0,00E+00 APC MEMO

48% 100% 3,44E-18 1,93E-64 9,10E-02 4,15E-01 4,88E-02 3,96E-01 3,92E-02 error (mean μ, stdev σ)

MICRO-SATELLITE STABLE TUMORSHIGHLY MICRO-SATELLITE INSTABLE TUMORS

Posterior classification

BIC RelationAIC Relation

Bootstrap p-values

Selective Advantage

Bootstrap p-values

Selective Advantage Prediction Posterior classification

AIC Relation

Bootstrap p-values Errors

Selective Advantage Prediction

BIC Relation p-values k-fold cross-validation

Selective Advantage Prediction Posterior classification

Bootstrap k-fold cross-validation

Prediction Posterior classification

Errors

Non-paramertic minimum scoreEvent type approximate

Figure S13:

COADREAD statistics for models conﬁdence.

For BIC models we show statisticsfor edges with non-parametric bootstrap score approximately greater than , for AIC modelsthose greater than s. i ) p-values (100 repetition of non-parametric bootstrap, prior to Wilcoxontesting) for each edge statistics of selective advantage ( direction and statistical dependence, andhypergeometric ). In general, the edges that we selected show very strong support ( p (cid:28) − ),but for those edges connecting events with the same marginal frequencies, where we can not beconﬁdent in the edge direction ( p > . ) but still we ﬁnd strong statistical dependence. ( ii ) A posteriori model conﬁdence against Type I and II errors estimated with non-parametric andstatistical bootstraps (100 repetitions) – edges annotated in Figures S11 and S12. ( iii ) Values of posterior classiﬁcation and prediction errors are estimated from 10 repetitions of 10-fold cross-validation . The former reports how much error is due to predicting, for each set of edges X = { x , . . . , x n } → y , the value of y according to the value of each x i ∈ X . The latter reports the samestatistics when we predict y from the whole set of parents X .47 B I C A I C Entropy loss for MSS model

Log-Likelihood Loss (disc.)

BIC score: -2487.12 logLik: -2318.82 loss: 0.663 %

AIC score: -2377.89 logLik: -2294.89 loss: 0.661 % B I C A I C Entropy loss for MSI model

Log-Likelihood Loss (disc.)

BIC score: -466.15 logLik: -385.4 loss: 3.89 %

AIC score: -429.25 logLik: -361.25 loss: 3.844 %

Figure S14:

Entropy loss for MSI-HIGH/MSS models.

Violin plot computed from runsof k -fold cross-validation with k = 10 , where we compute the “loss of log-likelihood” at each fold.In the plot and in the table we report also the overall log-likelihood, as well as the BIC and AICscores for the models. We present the ratio of log-likelihood loss as a measure of stability of thesemodels for these two datasets – we can observe that the MSS models lose < of their likelihood,while the MSI lose slightly more (still, < ), possibly because of the smaller sample size. Froma statistical point of view, the greater (despite small) loss of likelihood by the models regularizedvia BIC conﬁrms its tendency to underestimate the true model (i.e., the model should have falsenegatives, which could be AIC’s edges). 48 PC ATM

TP53

NRAS

BRAF

KRAS

SOX9

SMAD2

TCF7L2

DKK2

DKK1

DKK4

PTEN

SMAD4

IGF2

PTEN

DKK4

BRAF

ERBB2

XOR_BRAF

SMAD2

SOX9

ERBB2

XOR_PTEN

XOR_DKK4

XOR_SOX9

XOR_SMAD2

CTNNB1

ARID1A

OR_ERBB2

NRAS

TCF7L2

XOR_NRAS

FAM123B

ATM

XOR_TCF7L2

XOR_ATM

SMAD4

FBXW7

XOR_SMAD4

PIK3CA

OR_APC_CTNNB1_DKK2

OR_APC_CTNNB1

XOR_APC

APC

OR_ATM_PIK3CA_TP53

OR_ERBB2_IGF2_PIK3CA_PTEN

OR_ATM_DKK1_DKK2_TP53

OR_ATM_TP53

OR_ARID1A_TP53

XOR_TP53

TP53

OR_BRAF_KRAS_NRAS

KRAS . . . . . . MSS Prediction Error (BIC parent set, k-fold cross-validation)

Node type

Pattern

Mutation

Amplification

Deletion

Parent set

Empty

Cardinality . . . . . . . . . . . . . . . . . . . . . . APC

ATM

TP53

NRAS

BRAF

KRAS

SOX9

SMAD2

TCF7L2

DKK2

DKK1

PTEN

SMAD4

IGF2

DKK4

PTEN

DKK4

BRAF

ERBB2

XOR_BRAF

SMAD2

SOX9

XOR_PTEN

XOR_DKK4

XOR_SMAD2

XOR_SOX9

CTNNB1

ARID1A

OR_ERBB2

NRAS

TCF7L2

XOR_NRAS

FAM123B

ATM

XOR_TCF7L2

XOR_ATM

SMAD4

FBXW7

XOR_SMAD4

PIK3CA

OR_APC_CTNNB1_DKK2

OR_APC_CTNNB1

XOR_APC

APC

OR_ATM_PIK3CA_TP53

OR_ERBB2_IGF2_PIK3CA_PTEN

OR_ATM_DKK1_DKK2_TP53

OR_ATM_TP53

OR_ARID1A_TP53

XOR_TP53

TP53

OR_BRAF_KRAS_NRAS

KRAS . . . . . . MSS Prediction Error (AIC parent set, k-fold cross-validation)

Node type

Pattern

Mutation

Amplification

Deletion

Parent set

Empty

Cardinality . . . . . . . . . . . . . . . . . . . . . . . . Figure S15:

Prediction error for each parent set of BIC and AIC models of MSS tumors. GF2

NRAS

ACVR1B

NRAS

TGFBR2

SMAD3

DKK1

XOR_NRAS

SMAD2

TP53

KRAS

ERBB2

TGFBR1

TCF7L2

DKK4

SMAD4

ACVR2A

AXIN2

OR_KRAS_NRAS_BRAF

DKK2

LRP5

PIK3CA

ACVR1B

ATM

OR_BRAF_KRAS

OR_BRAF_NRAS_TGFBR1

ERBB3

XOR_ACVR1B

FAM123B

OR_ACVR1B_ACVR2A_ERBB2_TP53

OR_ERBB2_IGF2_PIK3CA

APC

ARID1A

BRAF

FBXW7 . . . . MSI Prediction Error (BIC parent set, k-fold cross-validation)

Node type

Pattern

Mutation

Amplification

Deletion

Parent set

Empty

Cardinality . . . . . . . . . . . . . . . SMAD2

NRAS

IGF2

ACVR1B

SMAD3

DKK1

TGFBR2

DKK4

XOR_NRAS

NRAS

ERBB2

TGFBR1

KRAS

TP53

SMAD4

ACVR2A

AXIN2

OR_KRAS_NRAS_BRAF

LRP5

TCF7L2

DKK2

PIK3CA

ACVR1B

ATM

OR_BRAF_KRAS

OR_BRAF_NRAS_TGFBR1

ERBB3

XOR_ACVR1B

FAM123B

OR_ACVR1B_ACVR2A_ERBB2_TP53

APC

ARID1A

OR_ERBB2_IGF2_PIK3CA

BRAF

FBXW7 . . . . MSI Prediction Error (AIC parent set, k-fold cross-validation)

Node type

Pattern

Mutation

Amplification

Deletion

Parent set

Empty

Cardinality . . . . . . . . . . . . . . . . Figure S16:

Prediction error for each parent set of BIC and AIC models of MSI-HIGHtumors.

Models and the phenotype that they might explain. (biomarker)

Indepen-dent evolutionary trajectories depicted by a model might share common routes through a certainalteration Y; that could point to a new biomarker harbored by most of the tumors under study. (drug resistance)

When a progression model branches in many independent sub-progressions,each one identiﬁed by alterations X, Y and Z, if a certain drug is known to target only a certaintype of such clones (e.g., those where biomarker X is present), we might get insights on which arethe biomarkers which make the drug ineﬀective for certain patients (e.g., those were cancer evolvesthrough Y and Z). (metastatic)

When a model is extracted from data representative of varioustumor stages, we might discover which “late events” are those conferring a metastatic phenotypeto a tumor – X in the ﬁgures. (anti-hallmarks)(anti-hallmarks)