[PDF] *K-means and Cluster Models for Cancer Signatures

Abstract

We present *K-means clustering algorithm and source code by expanding statistical clustering methods applied in this https URL to quantitative finance. *K-means is statistically deterministic without specifying initial centers, etc. We apply *K-means to extracting cancer signatures from genome data without using nonnegative matrix factorization (NMF). *K-means' computational cost is a fraction of NMF's. Using 1,389 published samples for 14 cancer types, we find that 3 cancers (liver cancer, lung cancer and renal cell carcinoma) stand out and do not have cluster-like structures. Two clusters have especially high within-cluster correlations with 11 other cancers indicating common underlying structures. Our approach opens a novel avenue for studying such structures. *K-means is universal and can be applied in other fields. We discuss some potential applications in quantitative finance.

Full PDF

**K-means and Cluster Models for Cancer Signatures

Zura Kakushadze §† and Willie Yu (cid:93) § Quantigic (cid:114)

Solutions LLC1127 High Ridge Road † Free University of Tbilisi, Business School & School of Physics240, David Agmashenebeli Alley, Tbilisi, 0159, Georgia (cid:93)

Centre for Computational Biology, Duke-NUS Medical School8 College Road, Singapore 169857 (January 30, 2017)

Abstract

We present *K-means clustering algorithm and source code by expandingstatistical clustering methods applied in https://ssrn.com/abstract=2802753to quantitative ﬁnance. *K-means is statistically deterministic without speci-fying initial centers, etc. We apply *K-means to extracting cancer signaturesfrom genome data without using nonnegative matrix factorization (NMF).*K-means’ computational cost is a fraction of NMF’s. Using 1,389 publishedsamples for 14 cancer types, we ﬁnd that 3 cancers (liver cancer, lung cancerand renal cell carcinoma) stand out and do not have cluster-like structures.Two clusters have especially high within-cluster correlations with 11 other can-cers indicating common underlying structures. Our approach opens a novelavenue for studying such structures. *K-means is universal and can be appliedin other ﬁelds. We discuss some potential applications in quantitative ﬁnance. Zura Kakushadze, Ph.D., is the President of Quantigic (cid:114)

Solutions LLC, and a Full Professorat Free University of Tbilisi. Email: [email protected] Willie Yu, Ph.D., is a Research Fellow at Duke-NUS Medical School. Email: [email protected] DISCLAIMER: This address is used by the corresponding author for no purpose other thanto indicate his professional aﬃliation as is customary in publications. In particular, the contentsof this paper are not intended as an investment, legal, tax or any other such advice, and in noway represent views of Quantigic (cid:114) a r X i v : . [ q - b i o . GN ] J u l Introduction and Summary

Every time we can learn something new about cancer, the motivation goes withoutsaying. Cancer is diﬀerent. Unlike other diseases, it is not caused by “mechani-cal” breakdowns, biochemical imbalances, etc. Instead, cancer occurs at the DNAlevel via somatic alterations in the genome structure. A common type of somaticmutations found in cancer is due to single nucleotide variations (SNVs) or alter-ations to single bases in the genome, which accumulate through the lifespan of thecancer via imperfect DNA replication during cell division or spontaneous cytosinedeamination [Goodman and Fygenson, 1998], [Lindahl, 1993], or due to exposuresto chemical insults or ultraviolet radiation [Loeb and Harris, 2008], [Ananthaswamyand Pierceall, 1990], etc. These mutational processes leave a footprint in the cancergenome characterized by distinctive alteration patterns or mutational signatures.If we can identify all underlying signatures, this could greatly facilitate progressin understanding the origins of cancer and its development. Therapeutically, if thereare common underlying structures across diﬀerent cancer types, then a therapeuticfor one cancer type might be applicable to other cancers, which would be a greatnews. However, it all boils down to the question of usefulness, i.e., is there a smallenough number of cancer signatures underlying all (100+) known cancer types, or isthis number too large to be meaningful or useful? Indeed, there are only 96 SNVs, so we cannot have more than 96 signatures. Even if the number of true underlyingsignatures is, say, of order 50, it is unclear whether they would be useful, especiallywithin practical applications. On the other hand, if there are only a dozen or sounderlying signatures, then we could hope for an order of magnitude simpliﬁcation.To identify mutational signatures, one analyzes SNV patterns in a cohort ofDNA sequenced whole cancer genomes. The data is organized into a matrix G is ,where the rows correspond to the N = 96 mutation categories, the columns corre-spond to d samples, and each element is a nonnegative occurrence count of a givenmutation category in a given sample. Currently, the commonly accepted methodfor extracting cancer signatures from G is [Alexandrov et al , 2013a] is via nonnega-tive matrix factorization (NMF) [Paatero and Tapper, 1994], [Lee and Seung, 1999].Under NMF the matrix G is approximated via G ≈ W H , where W iA is an N × K matrix, H As is a K × d matrix, and both W and H are nonnegative. The appealof NMF is its biologic interpretation whereby the K columns of the matrix W are Another practical application is prevention by pairing the signatures extracted from cancersamples with those caused by known carcinogens (e.g., tobacco, aﬂatoxin, UV radiation, etc). In brief, DNA is a double helix of two strands, and each strand is a string of letters A, C,G, T corresponding to adenine, cytosine, guanine and thymine, respectively. In the double helix,A in one strand always binds with T in the other, and G always binds with C. This is knownas base complementarity. Thus, there are six possible base mutations C > A, C > G, C > T,T > A, T > C, T > G, whereas the other six base mutations are equivalent to these by basecomplementarity. Each of these 6 possible base mutations is ﬂanked by 4 possible bases on eachside thereby producing 4 × × Nonlinearities could undermine this argument. However, again, it all boils down to usefulness. K cancer signatures contribute into the N = 96 mutation categories, and the columns of the matrix H are interpreted asthe exposures to the K signatures in each sample. The price to pay for this is thatNMF, which is an iterative procedure, is computationally costly and depending onthe number of samples d it can take days or even weeks to run it. Furthermore, itdoes not automatically ﬁx the number of signatures K , which must be either guessedor obtained via trial and error, thereby further adding to the computational cost. Some of the aforesaid issues were recently addressed in [Kakushadze and Yu,2016b], to wit: i) by aggregating samples by cancer types, we can greatly improvestability and reduce the number of signatures; ii) by identifying and factoring outthe somatic mutational noise, or the “overall” mode (this is the “de-noising” proce-dure of [Kakushadze and Yu, 2016b]), we can further greatly improve stability and,as a bonus, reduce computational cost; and iii) the number of signatures can be ﬁxedborrowing the methods from statistical risk models [Kakushadze and Yu, 2017b] inquantitative ﬁnance, by computing the eﬀective rank (or eRank) [Roy and Vetterli,2007] for the correlation matrix Ψ ij calculated across cancer types or samples (seebelow). All this yields substantial improvements [Kakushadze and Yu, 2016b].In this paper we push this program to yet another level. The basic idea here isquite simple (but, as it turns out, nontrivial to implement – see below). We wish toapply clustering techniques to the problem of extracting cancer signatures. In fact,we argue in Section 2 that NMF is, to a degree, “clustering in disguise”. This is fortwo main reasons. The prosaic reason is that NMF, being a nondeterministic algo-rithm, requires averaging over many local optima it produces. However, each rungenerally produces a weights matrix W iA with columns (i.e., signatures) not alignedwith those in other runs. Aligning or matching the signatures across diﬀerent runs(before averaging over them) is typically achieved via nondeterministic clusteringsuch as k-means. So, not only is clustering utilized at some layer, the result, evenafter averaging, generally is both noisy and nondeterministic! I.e., if this computa-tionally costly procedure (which includes averaging) is run again and again on thesame data, generally it will yield diﬀerent looking cancer signatures every time!The second, not-so-prosaic reason is that, while NMF generically does not pro-duce exactly null weights, it does produce low weights, such that they are withinerror bars. For all practical purposes we might as well set such weights to zero.NMF requires nonnegative weights. However, we could as reasonably require thatthe weights should be, say, outside error bars (e.g., above one standard deviation – Other issues include: i) out-of-sample instability, i.e., the signatures obtained from non-overlapping sets of samples can be dramatically diﬀerent; ii) in-sample instability, i.e., the signa-tures can have a strong dependence on the initial iteration choice; and iii) samples with low countsor sparsely populated samples (i.e., those with many zeros – such samples are ubiquitous, e.g., inexome data) are usually deemed not too useful as they contribute to the in-sample instability. As a result, now we have the so-aggregated matrix G is , where s = 1 , . . . , d , and d = n is thenumber of cancer types, not of samples. This matrix is much less noisy than the sample data. By “noise” we mean the statistical errors in the weighs obtained by averaging. Typically,such error bars are not reported in the literature on cancer signatures. Usually they are large. W iA will start to have moreand more zeros. It may not exactly have a binary cluster-like structure, but it mayat least have some substructures that are cluster-like. It then begs the question: arethere cluster-like (sub)structures present in W iA or, generally, in cancer signatures?To answer this question, we can apply clustering methods directly to the matrix G is , or, more, precisely, to its de-noised version G (cid:48) is (see below) [Kakushadze andYu, 2016b]. The na¨ıve, brute-force approach where one would simply cluster G is or G (cid:48) is does not work for a variety of reasons, some being more nontrivial or subtlethan others. Thus, e.g., as discussed in [Kakushadze and Yu, 2016b], the counts G is have skewed, long-tailed distributions and one should work with log-counts, or,more precisely, their de-noised versions. This applies to clustering as well. Further,following a discussion in [Kakushadze and Yu, 2016c] in the context of quantitativetrading, it would be suboptimal to cluster de-noised log-counts. Instead, it paysto cluster their normalized variants (see Section 2 hereof). However, taking care ofsuch subtleties does not alleviate one big problem: nondeterminism! If we run avanilla nondeterministic algorithm such as k-means on the data however massagedwith whatever bells and whistles, we will get random-looking disparate results everytime we run k-means with no stability in sight. We need to address nondeterminism!Our solution to the problem is what we term *K-means . The idea behind *K-means, which essentially achieves determinism statistically , is simple. Suppose wehave an N × d matrix X is , i.e., we have N d -vectors X i . If we run k-means with theinput number of clusters K but initially unspeciﬁed centers, every run will generallyproduce a new local optimum. *K-means reduces and in fact essentially eliminatesthis indeterminism via two levels. At level 1 it takes clusterings obtained via M independent runs or samplings. Each sampling produces a binary N × K matrix Ω iA ,whose element equals 1 if X i belongs to the cluster labeled by A , and 0 otherwise.The aggregation algorithm and the source code therefor are given in [Kakushadzeand Yu, 2016c]. This aggregation – for the same reasons as in NMF (see above)– involves aligning clusters across the M runs, which is achieved via k-means, andso the result is nondeterministic. However, by aggregating a large number M ofsamplings, the degree of nondeterminism is greatly reduced. The “catch” is thatsometimes this aggregation yields a clustering with K (cid:48) < K clusters, but this doesnot pose an issue. Thus, at level 2, we take a large number P of such aggregations(each based on M samplings). The occurrence counts of aggregated clusterings arenot uniform but typically have a (sharply) peaked distribution around a few (ormanageable) number of aggregated clusterings. So this way we can pinpoint the“ultimate” clustering, which is simply the aggregated clustering with the highestoccurrence count. This is the gist of *K-means and it works well for genome data. Deterministic (e.g., agglomerative hierarchical) algorithms have their own issues (see below).

Cluster Models

The chief objective of this paper is to introduce a novel approach to identifyingcancer signatures using clustering methods. In fact, as we discuss below in detail,our approach is more than just clustering. Indeed, it is evident from the get-gothat blindly using nondeterministic clustering algorithms, which typically produce(unmanageably) large numbers of local optima, would introduce great variabilityinto the resultant cancer signatures. On the other hand, deterministic algorithmssuch as agglomerative hierarchical clustering typically are (substantially) slowerand require essentially “guessing” the initial clustering, which in practical appli-cations can often turn out to be suboptimal. So, both to motivate and explain ournew approach employing clustering methods, we ﬁrst – so to speak – “break down”the NMF approach and argue that it is in fact a clustering method in disguise! The current “lore” – the commonly accepted method for extracting K cancer signa-tures from the occurrence counts matrix G is (see above) [Alexandrov et al , 2013a]– is via nonnegative matrix factorization (NMF) [Paatero and Tapper, 1994], [Leeand Seung, 1999]. Under NMF the matrix G is approximated via G ≈ W H , where W iA is an N × K matrix of weights, H As is a K × d matrix of exposures, and both W and H are nonnegative. However, not only is the number of signatures K not ﬁxedvia NMF (and must be either guessed or obtained via trial and error), NMF too is anondeterministic algorithm and typically produces a large number of local optima.So, in practice one has no choice but to execute a large number N S of NMF runs –which we refer to as samplings – and then somehow extract cancer signatures fromthese samplings. Absent a guess for what K should be, one executes N S samplingsfor a range of values of K (say, K min ≤ K ≤ K max , where K min and K max are basi-cally guessed based on some reasonable intuitive considerations), for each K extractscancer signatures (see below), and then picks K and the corresponding signatureswith the best overall ﬁt into the underlying matrix G . For a given K , diﬀerent sam-plings generally produce diﬀerent weights matrices W . So, to extract a single matrix W for each value of K one averages over the samplings. However, before averaging,one must match the K cancer signatures across diﬀerent samplings – indeed, in agiven sampling X the columns in the matrix W iA are not necessarily aligned with Such as k-means [Steinhaus, 1957], [Lloyd, 1957], [Forgy, 1965], [MacQueen, 1967], [Hartigan,1975], [Hartigan and Wong, 1979], [Lloyd, 1982]. As we discuss below, in this regard NMF is not dissimilar. E.g., SLINK [Sibson, 1973], etc. (see, e.g., [Murtagh and Contreras, 2011], [Kakushadze andYu, 2016c], and references therein). E.g., splitting the data into 2 initial clusters. Such as quantitative trading, where out-of-sample performance can be objectively measured.There empirical evidence suggests that such deterministic algorithms underperform so long asnondeterministic ones are used thoughtfully [Kakushadze and Yu, 2016c]. W iA in a diﬀerent sampling Y. To align the columns inthe matrices W across the N S samplings, once often uses a clustering algorithm suchas k-means. However, since k-means is nondeterministic, such alignment of the W columns is not guaranteed to – and in fact does not – produce a unique answer. Hereone can try to run multiple samplings of k-means for this alignment and aggregatethem, albeit such aggregation itself would require another level of alignment (withits own nondeterministic clustering such as k-means). And one can do this adinﬁnitum . In practice, one must break the chain at some level of alignment, either ad hoc (essentially by heuristically observing suﬃcient stability and “convergence”)or via using a deterministic algorithm (see fn. 16). Either way, invariably all thisintroduces (overtly or covertly) systematic and statistical errors into the resultantcancer signatures and often it is unclear if they are meaningful without invokingsome kind empirical biologic “experience” or “intuition” (often based on alreadywell-known eﬀects of, e.g., exposure to various well-understood carcinogens such astobacco, ultraviolet radiation, aﬂatoxin, etc.). At the end of the day it all boils downto how useful – or predictive – the resultant method of extracting cancer signaturesis, including signature stability. With NMF, the answer is not at all evident...

So, in practice, under the hood, NMF already uses clustering methods. However, itgoes deeper than that. While NMF generically does not produce vanishing weightsfor a given signature, some weights are (much) smaller than others. E.g., oftenone has several “peaks” with high concentration of weights, with the rest of themutation categories having relatively low weights. In fact, many weights can evenbe within the (statistical plus systematic) error bars. Such weights can for allpractical purposes be set to zero. In fact, we can take this further and ask whetherproliferation of low weights adds any explanatory power. One way to address this isto run NMF with an additional constraint that the weights (obtained via averaging– see above) should be higher than either i) some multiple of the corresponding errorbars or ii) some preset ﬁxed minimum weight. This certainly sounds reasonable,so why is this not done in practice? A prosaic answer appears to be that this wouldcomplicate the already nontrivial NMF algorithm even further, require additionalcoding and computation resources, etc. However, arguendo , let us assume that werequire, say, that the weights be higher than a preset ﬁxed minimum weight w min orelse the weights are set to zero. As we increase w min , the so-modiﬁed NMF wouldproduce more and more zeros. This does not mean that the resulting matrix W iA We should point out that at some level of alignment one may employ a deterministic (e.g.,agglomerative hierarchical – see above) clustering algorithm to terminate the malicious circle, whichcan be a reasonable approach assuming there is enough stability in the data. However, this tooadds a(n often hard to quantify and therefore hidden) systematic error to the resultant signatures. And such error bars are rarely displayed in the prevalent literature... This would require a highly recursive algorithm. binary cluster structure, i.e., that W iA = w i δ G ( i ) ,A , where δ AB is aKronecker delta and G : { , . . . , N } (cid:55)→ { , . . . , K } is a map from N = 96 mutationcategories to K clusters. Put another way, this does not mean that in the resultingmatrix W iA for a given i (i.e., mutation category) we would have a nonzero elementfor one and only one value of A (i.e., signature). However, as we gradually increase w min , generally the matrix W iA is expected to look more and more like having abinary cluster structure, albeit with some “overlapping” signatures (i.e., such that ina given pair of signatures there are nonzero weights for one or more mutations). Wecan achieve a binary structure via a number of ways. Thus, a rudimentary algorithmwould be to take the matrix W iA (equally successfully before or after achieving somezeros in it via nonzero w min ) and for a given value of i set all weights W iA to zeroexcept in the signature A for which W iA = max( W iA | A = 1 , . . . , K ). Note that thismight result in some empty signatures (clusters), i.e., signatures with W iA = 0 forall values of i . This can be dealt with by i) ether simply dropping such signaturesaltogether and having fewer K (cid:48) < K signatures (binary clusters) at the end, or ii)augmenting the algorithm to avoid empty clusters, which can be done in a numberof ways we will not delve into here. The bottom line is that NMF essentially can bemade into a clustering algorithm by reasonably modifying it, including via gettingrid of ubiquitous and not-too-informative low weights. However, the downside wouldbe an even more contrived algorithm, so this is not what we are suggesting here.Instead, we are observing that clustering is already intertwined in NMF and thequestion is whether we can simplify things by employing clustering methods directly. Happily, the answer is yes. Not only can we have much simpler and apparently morestable clustering algorithms, but they are also computationally much less costly thanNMF. As mentioned above, the biggest issue with using popular nondeterministicclustering algorithms such as k-means is that they produce a large number of localoptima. For deﬁniteness in the remainder of this paper we will focus on k-means,albeit the methods described herein are general and can be applied to other suchalgorithms. Fortunately, this very issue has already been addressed in [Kakushadzeand Yu, 2016c] in the context of constructing statistical industry classiﬁcations (i.e.,clustering models for stocks) for quantitative trading, so here we simply borrowtherefrom and further expand and adapt that approach to cancer signatures. A popular clustering algorithm is k-means [Steinhaus, 1957], [Lloyd, 1957], [Forgy,1965], [MacQueen, 1967], [Hartigan, 1975], [Hartigan and Wong, 1979], [Lloyd, 1982].The basic idea behind k-means is to partition N observations into K clusters suchthat each observation belongs to the cluster with the nearest mean. Each of the N Which are preferred over deterministic ones for the reasons discussed above. d -vector, so we have an N × d matrix X is , i = 1 , . . . , N , s = 1 , . . . , d . Let C a be the K clusters, C a = { i | i ∈ C a } , a = 1 , . . . , K . Thenk-means attempts to minimize g = K (cid:88) a =1 (cid:88) i ∈ C a d (cid:88) s =1 ( X is − Y as ) (1)where Y as = 1 n a (cid:88) i ∈ C a X is (2)are the cluster centers (i.e., cross-sectional means), and n a = | C a | is the numberof elements in the cluster C a . In (1) the measure of “closeness” is chosen to be theEuclidean distance between points in R d , albeit other measures are possible.One “drawback” of k-means is that it is not a deterministic algorithm. Generi-cally, there are copious local minima of g in (1) and the algorithm only guaranteesthat it will converge to a local minimum, not the global one. Being an iterativealgorithm, unless the initial centers are preset, k-means starts with a random set ofthe centers Y as at the initial iteration and converges to a diﬀerent local minimum ineach run. There is no magic bullet here: in practical applications, typically, tryingto “guess” the initial centers is not any easier than “guessing” where, e.g., the globalminimum is. So, what is one to do? One possibility is to simply live with the factthat every run produces a diﬀerent answer. In fact, this is acceptable in many appli-cations. However, in the context of extracting cancer signatures this would result inan exercise in futility. We need a way to eliminate or greatly reduce indeterminism. The idea is simple. What if we aggregate diﬀerent clusterings from multiple runs –which we refer to as samplings – into one? The question is how. Suppose we have M runs ( M (cid:29) K clusters. Let Ω ria = δ G r ( i ) ,a , i = 1 , . . . , N , a = 1 , . . . , K (here G r : { , . . . , N } (cid:55)→ { , . . . , K } is the map between– in our case – the mutation categories and the clusters), be the binary matrixfrom each run labeled by r = 1 , . . . , M , which is a convenient way (for our purposeshere) of encoding the information about the corresponding clustering; thus, eachrow of Ω ria contains only one element equal 1 (others are zero), and N ra = (cid:80) Ni =1 Ω ria (i.e., column sums) is nothing but the number of mutations belonging to the clusterlabeled by a (note that (cid:80) Ka =1 N ra = N ). Here we are assuming that somehow weknow how to properly order (i.e., align) the K clusters from each run. This is anontrivial assumption, which we will come back to momentarily. However, assuming,for a second, that we know how to do this, we can aggregate the binary matrices Below we will discuss what X is should be for cancer signatures. Throughout this paper “cross-sectional” refers to “over the index i ”. Note that here the superscript r in Ω ria , G r ( i ) and N ra (see below) is an index, not a power. ria into a single matrix (cid:101) Ω ia = (cid:80) Mr =1 Ω ria . Now, this matrix does not look like abinary clustering matrix. Instead, it is a matrix of occurrence counts, i.e., it countshow many times a given mutation was assigned to a given cluster in the processof M samplings. What we need to construct is a map G such that one and onlyone mutation belongs to each of the K clusters. The simplest criterion is to mapa given mutation to the cluster in which (cid:101) Ω ia is maximal, i.e., where said mutationoccurs most frequently. A caveat is that there may be more than one such clusters.A simple criterion to resolve such an ambiguity is to assign said mutation to thecluster with most cumulative occurrences (i.e., we assign said mutation to the clusterwith the largest (cid:101) N a = (cid:80) Ni =1 (cid:101) Ω ia ). Further, in the unlikely event that there is still anambiguity, we can try to do more complicated things, or we can simply assign sucha mutation to the cluster with the lowest value of the index a – typically, there is somuch noise in the system that dwelling on such minutiae simply does not pay oﬀ.However, we still need to tie up a loose end, to wit, our assumption that theclusters from diﬀerent runs were somehow all aligned. In practice each run produces K clusters, but i) they are not the same clusters and there is no foolproof way ofmapping them, especially when we have a large number of runs; and ii) even if theclusters were the same or similar, they would not be ordered, i.e., the clusters fromone run generally would be in a diﬀerent order than the clusters from another run.So, we need a way to “match” clusters from diﬀerent samplings. Again, there isno magic bullet here either. We can do a lot of complicated and contrived thingswith not much to show for it at the end. A simple pragmatic solution is to usek-means to align the clusters from diﬀerent runs. Each run labeled by r = 1 , . . . , M ,among other things, produces a set of cluster centers Y ras . We can “bootstrap” themby row into a ( KM ) × d matrix (cid:101) Y (cid:101) as = Y ras , where (cid:101) a = a + ( r − K takes values (cid:101) a = 1 , . . . , ( KM ). We can now cluster (cid:101) Y (cid:101) as into K clusters via k-means. This willmap each value of (cid:101) a to { , . . . , K } thereby mapping the K clusters from each of the M runs to { , . . . , K } . So, this way we can align all clusters. The “catch” is thatthere is no guarantee that each of the K clusters from each of the M runs will beuniquely mapped to one value in { , . . . , K } , i.e., we may have some empty clustersat the end of the day. However, this is ﬁne, we can simply drop such empty clustersand aggregate (via the above procedure) the smaller number of K (cid:48) < K clusters.I.e., at the end we will end up with a clustering with K (cid:48) clusters, which might befewer than the target number of clusters K . This is not necessarily a bad thing.The dropped clusters might have been redundant in the ﬁrst place. Another evident“catch” is that even the number of resulting clusters K (cid:48) is not deterministic. If werun this algorithm multiple times, we will get varying values of K (cid:48) . Malicious circle? Not really! There is one other trick up our sleeves we can use to ﬁx the “ultimate”clustering thereby rendering our approach essentially deterministic. The idea aboveis to aggregate a large enough number M of samplings. Each aggregation produces9 clustering with some K (cid:48) ≤ K clusters, and this K (cid:48) varies from aggregation to ag-gregation. However, what if we take a large number P of aggregations (each basedon M samplings)? Typically there will be a relatively large number of diﬀerentclusterings we get this way. However, assuming some degree of stability in the data,this number is much smaller than the number of a priori diﬀerent local minima wewould obtain by running the vanilla k-means algorithm. What is even better, theoccurrence counts of aggregated clusterings are not uniform but typically have a(sharply) peaked distribution around a few (or manageable) number of aggregatedclusterings. In fact, as we will see below, in our empirical genome data we are ableto pinpoint the “ultimate” clustering! So, to recap, what we have done here is this.There are myriad clusterings we can get via vanilla k-means with little to no guid-ance as to which one to pick. We have reduced this proliferation by aggregatinga large number of such clusterings into our aggregated clusterings. We then furtherzoom onto a few or even a unique clustering we consider to be the likely “ultimate”clustering by examining the occurrence counts of such aggregated clusterings, whichturns out to have a (sharply) peaked distribution. Since vanilla k-means is a rela-tively fast-converging algorithm, each aggregation is not computationally taxing andrunning a large number of aggregations is nowhere as time consuming as running asimilar number (or even a fraction thereof) of NMF computations (see below).

So, now that we know how to make clustering work, we need to decide what tocluster, i.e., what to take as our matrix X is in (1). The na¨ıve choice X is = G is issuboptimal for multiple reasons (as discussed in [Kakushadze and Yu, 2016b]).First, the elements of the matrix G is are populated by nonnegative occurrencecounts. Nonnegative quantities with large numbers of samples tend to have skeweddistributions with long tails at higher values. I.e., such distributions are not normalbut (in many cases) roughly log-normal. One simple way to deal with this is toidentify X is with a (natural) logarithm of G is (instead of G is itself). A minor hiccuphere is that some elements of G is can be 0. We can do a lot of complicated and evenconvoluted things to deal with this issue. Here, as in [Kakushadze and Yu, 2016b],we will follow a pragmatic approach and do something simple instead – there is somuch noise in the data that doing convoluted things simply does not pay oﬀ. So, asthe ﬁrst cut, we can take X is = ln (1 + G is ) (3)This takes care of the G is = 0 cases; for G is (cid:29) R is ≈ ln( G is ), as desired.Second, the detailed empirical analysis of [Kakushadze and Yu, 2016b] uncoveredwhat is termed therein the “overall” mode unequivocally present in the occurrencecount data. This “overall” mode is interpreted as somatic mutational noise unrelated This is because things are pretty much random and the only “distribution” at hand is ﬂat. In ﬁnance the analog of this is the so-called “market” mode (see, e.g., [Bouchaud and Potters,2011] and references therein) corresponding to the overall movement of the broad market, which

10o (and in fact obscuring) the true underlying cancer signatures and must thereforebe factored out somehow. Here is a simple way to understand the “overall” mode.Let the correlation matrix Ψ ij = Cor( X is , X js ), where Cor( · , · ) is serial correlation. I.e., Ψ ij = C ij /σ i σ j , where σ i = C ii are variances, and the serial covariance matrix C ij = Cov( X is , X js ) = 1 d − d (cid:88) s =1 Z is Z js (4)where Z is = X is − X i are serially demeaned, while the means X i = d (cid:80) ds =1 X is . Theaverage pair-wise correlation ρ = N ( N − (cid:80) Ni,j =1; i (cid:54) = j Ψ ij between diﬀerent mutationcategories is nonzero and is in fact high for most cancer types we study. This is theaforementioned somatic mutational noise that must be factored out. If we aggregatesamples by cancer types (see below) and compute the correlation matrix Ψ ij for theso-aggregated data (across the n = 14 cancer types we study – see below), theaverage correlation ρ is over whopping 96%. Another way of thinking about thisis that the occurrence counts in diﬀerent samples (or cancer types, if we aggregatesamples by cancer types) are not normalized uniformly across all samples (cancertypes). Therefore, running NMF, a clustering or any other signature-extractionalgorithm on the vanilla matrix G is (or its “log” X is deﬁned in (3)) would amount tomixing apples with oranges thereby obscuring the true underlying cancer signatures.Following [Kakushadze and Yu, 2016b], factoring out the “overall” mode (or“de-noising” the matrix G is ) therefore most simply amount to cross-sectional (i.e.,across the 96 mutation categories) demeaning of the matrix X is . I.e., instead of X is we use X (cid:48) is , which is obtained from X is by demeaning its columns: X (cid:48) is = X is − X s = X is − N N (cid:88) j =1 X js (5)We should note that using X (cid:48) is instead of X is in (1) does not aﬀect clustering.Indeed, g in (1) is invariant under the transformations of the form X is → X is + ∆ s ,where ∆ s is an arbitrary d -vector, as thereunder we also have Y as → Y as + ∆ s , so X is − Y as is unchanged. In fact, this is good: this means that de-noising does notintroduce any additional errors into clustering itself. However, the actual weights in the matrix W iA are aﬀected by de-noising. We discuss the algorithm for ﬁxing W iA below. However, we need one more ingredient before we get to determining theweights, and with this additional ingredient de-noising does aﬀect clustering. aﬀects all stocks (to varying degrees) – cash inﬂow (outﬂow) into (from) the market tends to pushstock prices higher (lower). This is the market risk factor, and to mitigate it one can, e.g., hold adollar-neutral portfolio of stocks (i.e., the same dollar holdings for long and short positions). Throughout this paper “serial” refers to “over the index s”. The overall normalization of C ij , i.e., d − d (maximum likelihoodestimate) in the denominator in the deﬁnition of C ij in (4), is immaterial for our purposes here. So, in this case d = n = 14 in (4). For the reasons discussed above, we should demean X is , not G is . .4.1 Normalizing Log-counts As was discussed in [Kakushadze and Yu, 2016c], clustering X is (or equivalently X (cid:48) is )would be suboptimal. The issue is this. Let σ (cid:48) i be serial standard deviations, i.e.,( σ (cid:48) i ) = Cov( X (cid:48) is , X (cid:48) is ), where, as above, Cov( · , · ) is serial covariance. Here we assumethat samples are aggregated by cancer types, so s = 1 , . . . , d with d = n = 14.Now, σ (cid:48) i are not cross-sectionally uniform and vary substantially across mutationcategories. The density of σ (cid:48) i is depicted in Figure 1 and is skewed (tailed). Thesummary of σ (cid:48) i reads: Min = 0.2196, 1st Qu. = 0.3409, Median = 0.4596, Mean =0.4984, 3rd Qu. = 0.6060, Max = 1.0010, SD = 0.1917, MAD = 0.1859, Skewness= 0.8498. If we simply cluster X (cid:48) is , this variability in σ (cid:48) i will not be accounted for.A simple solution is to cluster normalized demeaned log-counts (cid:101) X (cid:48) is = X (cid:48) is /σ (cid:48) i instead of X (cid:48) is . This way we factor out the nonuniform (and skewed) standarddeviation out of the log-counts. Note that now de-noising does make a diﬀerence inclustering. Indeed, if we use (cid:101) X is = X is /σ i (recall that σ i = Cov( X is , X is )) insteadof (cid:101) X (cid:48) is = X (cid:48) is /σ (cid:48) i in (1) and (2), the quantity g (and also clusterings) will be diﬀerent. Now that we know what to cluster (to wit, (cid:101) X (cid:48) is ) and how to get to the “unique”clustering, we need to ﬁgure out how to ﬁx the (target) number of clusters K , whichis one of the inputs in our algorithm above. In [Kakushadze and Yu, 2016b] itwas argued that in the context of cancer signatures their number can be ﬁxed bybuilding a statistical factor model [Kakushadze and Yu, 2017b], i.e., the numberof signatures is simply the number of statistical factors. So, by the same token,here we identify the (target) number of clusters in our clustering algorithm with thenumber of statistical factors ﬁxed via the method of [Kakushadze and Yu, 2017b].

So, following [Kakushadze and Yu, 2017b] and [Kakushadze and Yu, 2016b], we set K = Round(eRank(Ψ)) (6) More precisely, the discussion of [Kakushadze and Yu, 2016c] is in the ﬁnancial context, towit, quantitative trading, which has its own nuances (see below). However, some of that discussionis quite general and can be adapted to a wide variety of applications. Qu. = Quartile, SD = Standard Deviation, MAD = Mean Absolute Deviation. A variety of methods for ﬁxing the number of clusters have been discussed in other contexts,e.g., [Rousseeuw, 1987], [Pelleg and Moore, 2000], [Steinbach et al , 2000], [Goutte et al , 2001], [Sugarand James, 2003], [Hamerly and Elkan, 2004], [Lleit´ı et al , 2004], [De Amorim and Hennig, 2015]. In the ﬁnancial context, these are known as statistical risk models [Kakushadze and Yu,2017b]. For a discussion and literature on multifactor risk models, see, e.g., [Grinold and Kahn,2000], [Kakushadze and Yu, 2016a] and references therein. For prior works on ﬁxing the numberof statistical risk factors, see, e.g., [Connor and Korajczyk, 1993] and [Bai and Ng, 2002]. Here Round( · ) can be replaced by ﬂoor( · ) = (cid:98)·(cid:99) . Z ) is the eﬀective rank [Roy and Vetterli, 2007] of a symmetric semi-positive-deﬁnite (which suﬃces for our purposes here) matrix Z . It is deﬁned aseRank( Z ) = exp( H ) (7) H = − L (cid:88) a =1 p a ln( p a ) (8) p a = λ ( a ) (cid:80) Lb =1 λ ( b ) (9)where λ ( a ) are the L positive eigenvalues of Z , and H has the meaning of the (Shan-non a.k.a. spectral) entropy [Campbell, 1960], [Yang et al , 2005]. Let us emphasizethat in (6) the matrix Ψ ij is computed based on the demeaned log-counts X (cid:48) is .The meaning of eRank(Ψ ij ) is that it is a measure of the eﬀective dimensionalityof the matrix Ψ ij , which is not necessarily the same as the number L of its positiveeigenvalues, but often is lower. This is due to the fact that many d -vectors X (cid:48) is canbe serially highly correlated (which manifests itself by a large gap in the eigenvalues)thereby further reducing the eﬀective dimensionality of the correlation matrix. The one remaining thing to accomplish is to ﬁgure out how to compute the weights W iA . Happily, in the context of clustering we have signiﬁcant simpliﬁcations com-pared with NMF and computing the weights becomes remarkably simple once weﬁx the clustering, i.e., the matrix Ω iA = δ G ( i ) ,A (or, equivalently, the map G : { i } (cid:55)→{ A } , i = 1 , . . . , N , A = 1 , . . . , K , where for the notational convenience we use K to denote the number of clusters in the “ultimate” clustering – see above). Justas in NMF, we wish to approximate the matrix G is via a product of the weightsmatrix W iA and the exposure matrix H As , both of which must be nonnegative.More precisely, since we must remove the “overall” mode, i.e., de-noise the matrix G is , following [Kakushadze and Yu, 2016b], instead of G is we will approximate there-exponentiated demeaned log-counts matrix X (cid:48) is : G (cid:48) is = exp( X (cid:48) is ) (10)We can include an overall normalization by taking G (cid:48) is = exp(Mean( X is ) + X (cid:48) is ), or G (cid:48) is = exp(Median( X is ) + X (cid:48) is ), or G (cid:48) is = exp(Median( X s ) + X (cid:48) is ) (recall that X s is the vector of column means of X is – see Eq. (5)), etc., to make it look morelike the original matrix G is ; however, this does not aﬀect the extracted signatures. Also, technically speaking, after re-exponentiating we should “subtract” the extra1 we added in the deﬁnition (3) (assuming we include one of the aforesaid overallnormalizations). However, the inherent noise in the data makes this a moot point. Note that using normalized demeaned log-counts (cid:101) X (cid:48) is gives the same Ψ ij . This is because each column of W , being weights, is normalized to add up to 1. G (cid:48) is via a product W H . However, with clusteringwe have W iA = w i δ G ( i ) ,A , i.e., we have a block (cluster) structure where for a givenvalue of A all W iA are zero except for i ∈ J ( A ) = { j | G ( j ) = A } , i.e., for themutation categories labeled by i that belong to the cluster labeled by A . Therefore,our matrix factorization of G is into a product W H now simpliﬁes into a set of K independent factorizations as follows: G (cid:48) is ≈ w i H As , i ∈ J ( A ) , A = 1 . . . , K (11)So, there is no need to run NMF anymore! Indeed, if we can somehow ﬁx H As for agiven cluster, then within this cluster we can determine the corresponding weights w i ( i ∈ J ( A )) via a serial linear regression: G (cid:48) is = ε is + w i H As , i ∈ J ( A ) , A = 1 . . . , K (12)where ε is are the regression residuals. I.e., for each A ∈ { , . . . , K } , we regressthe d × n A matrix [( G (cid:48) ) T ] si ( i ∈ J ( A ), n A = | J ( A ) | ) over the d -vector H As ( s =1 , . . . , d ), and the regression coeﬃcients are nothing but the n A -vector w i ( i ∈ J ( A )),while the residuals are the d × n A matrix [( ε ) T ] si . Note that this regression is run without the intercept. Now, this all makes sense as (for each i ∈ J ( A )) the regressionminimizes the quadratic error term (cid:80) ds =1 ε is . Furthermore, if H As are nonnegative,then the weights w i are automatically nonnegative as they are given by: w i = (cid:80) ds =1 G (cid:48) is H G ( i ) ,s (cid:80) ds =1 H G ( i ) ,s (13)Now, we wish these weights to be normalized: (cid:88) i ∈ J ( A ) w i = 1 (14)This can always be achieved by rescaling H As . Alternatively, we can pick H As without worrying about the normalization, compute w i via (13), rescale them so thatthey satisfy (14), and simultaneously accordingly rescale H As . Mission accomplished! Well, almost... We still need to ﬁgure out how to ﬁx the exposures H As . Thesimplest way to do this is to note that we can use the matrix Ω iA = δ G ( i ) ,A to swapthe index i in G (cid:48) is by the index A , i.e., we can take H As = η A N (cid:88) i =1 Ω iA G (cid:48) is = (cid:101) η A n A (cid:88) i ∈ J ( A ) G (cid:48) is (15) The superscript T denotes matrix transposition. (cid:101) η A (which are ﬁxed via (14)) we simplytake cross-sectional means of G (cid:48) is in each cluster. (Recall that n A = J ( A ).) The so-deﬁned H As are automatically positive as all G (cid:48) is are positive. Therefore, w i deﬁnedvia (13) are also all positive. This is a good news – vanishing w i would amount to anincomplete weights matrix W iA (i.e., some mutations would belong to no cluster.)So, why does (15) make sense? Looking at (12), we can observe that, if theresiduals ε is cross-sectionally, within each cluster labeled by A , are random, then weexpect that (cid:80) i ∈ J ( A ) ε is ≈

0. If we had an exact equality here, then we would have(15) with η A = 1 (i.e., (cid:101) η A = n A ) assuming the normalization (14). In practice, theresiduals ε is are not exactly “random”. First, the number n A of mutation categoriesin each cluster is not large. Second, as mentioned above, there is variability in serialstandard deviations across mutation types. This leads us to consider variations. Above we argued that it makes sense to cluster normalized demeaned log-counts (cid:101) X (cid:48) is = X (cid:48) is /σ (cid:48) i due to the cross-sectional variability (and skewness) in the serialstandard deviations σ (cid:48) i . We may worry about similar eﬀects in G (cid:48) is when computing H As and w i as we did above. This can be mitigated by using normalized quantities (cid:101) G (cid:48) is = G (cid:48) is /ω i , where ω i = Cov( G (cid:48) is , G (cid:48) is ) are serial variances. That is, we can deﬁne H As = (cid:101) η A ν A (cid:88) i ∈ J ( A ) (cid:101) G (cid:48) is = (cid:101) η A ν A (cid:88) i ∈ J ( A ) ω i G (cid:48) is (16) w i = ω i (cid:80) ds =1 (cid:101) G (cid:48) is H G ( i ) ,s (cid:80) ds =1 H G ( i ) ,s = (cid:80) ds =1 G (cid:48) is H G ( i ) ,s (cid:80) ds =1 H G ( i ) ,s (17)where ν A = (cid:80) i ∈ J ( A ) /ω i . So, 1 /ω i are the weights in the averages over the clusters. Here one may wonder, considering the skewed roughly log-normal distribution of G is and henceforth G (cid:48) is , would it make sense to relate the exposures to within-clustercross-sectional averages of demeaned log-counts X (cid:48) is as opposed to those of G (cid:48) is ? Thisis easily achieved. Thus, we can deﬁne (this ensures positivity of H As ):ln( H As ) = ln( (cid:101) η A ) + 1 n A (cid:88) i ∈ J ( A ) X (cid:48) is (18)Exponentiating we get H As = (cid:101) η A  (cid:89) i ∈ J ( A ) G (cid:48) is  /n A (19) I.e., here we assume that ε is /ω i are approximately random in (12). H As that is aﬀected by the weights. So, we canintroduce the weights in the geometric means as follows:ln( H As ) = ln( (cid:101) η A ) + 1 µ A (cid:88) i ∈ J ( A ) (cid:101) X (cid:48) is = ln( (cid:101) η A ) + 1 µ A (cid:88) i ∈ J ( A ) σ (cid:48) i X (cid:48) is (20)where µ A = (cid:80) i ∈ J ( A ) /σ (cid:48) i . Recall that ( σ (cid:48) i ) = Cov( X (cid:48) is , X (cid:48) is ). Thus, we have: H As = (cid:101) η A (cid:89) i ∈ J ( A ) ( G (cid:48) is ) /µ A σ (cid:48) i (21)So, the weights are the exponents 1 /µ A σ (cid:48) i . Other variations are also possible. We are now ready to discuss an actual implementation of the above algorithm,much of the R code for which is already provided in [Kakushadze and Yu, 2016b]and [Kakushadze and Yu, 2016c]. The R source code is given in Appendix A hereof.

In our empirical analysis below we use the same genome data (from published sam-ples only) as in [Kakushadze and Yu, 2016b]. This data is summarized in Table 1(borrowed from [Kakushadze and Yu, 2016b]), which gives total counts, number ofsamples and the data sources, which are as follows: A1 = [Alexandrov et al , 2013b],A2 = [Love et al , 2012], B1 = [Tirode et al , 2014], C1 = [Zhang et al , 2013], D1= [Nik-Zainal et al , 2012], E1 = [Puente et al , 2011], E2 = [Puente et al , 2015],F1 = [Cheng et al , 2016], G1 = [Wang et al , 2014], H1 = [Sung et al , 2012], H2= [Fujimoto et al , 2016], I1 = [Imielinksi et al , 2012], J1 = [Jones et al , 2012], K1= [Patch et al , 2015], L1 = [Waddell et al , 2015], M1 = [Gundem et al , 2015], N1= [Scelo et al , 2014]. Sample IDs with the corresponding publication sources aregiven in Appendix A of [Kakushadze and Yu, 2016b]. In our analysis below weaggregate samples by the 14 cancer types. The resulting data is in Tables 2 and 3.

The underlying data consists of a matrix – call it G is – whose elements are occurrencecounts of mutation types labeled by i = 1 , . . . , N = 96 in samples labeled by s =1 , . . . , d . More precisely, we can work with one matrix G is which combines data fromdiﬀerent cancer types; or, alternatively, we may choose to work with individual16atrices [ G ( α )] is , where: α = 1 , . . . , n labels n diﬀerent cancer types; as before, i = 1 , . . . , N = 96; and s = 1 , . . . , d ( α ). Here d ( α ) is the number of samples for thecancer type labeled by α . The combined matrix G is is obtained simply by appending(i.e., bootstrapping) the matrices [ G ( α )] is together column-wise. In the case of thedata we use here (see above), this “big matrix” turns out to have 1389 columns.Generally, individual matrices [ G ( α )] is and, thereby, the “big matrix”, contain alot of noise. For some cancer types we can have a relatively small number of samples.We can also have “sparsely populated” data, i.e., with many zeros for some mutationcategories. As mentioned above, diﬀerent samples are not necessarily uniformlynormalized. Etc. The bottom line is that the data is noisy. Furthermore, intuitivelyit is clear that the larger the matrix we work with, statistically the more “signatures”(or clusters) we should expect to get with any reasonable algorithm. However, asmentioned above, a large number of signatures would be essentially useless anddefy the whole purpose of extracting them in the ﬁrst place – we have 96 mutationcategories, so it is clear that the number of signatures cannot be more than 96! Ifwe end up with, say, 50+ signatures, what new or useful does this tell us about theunderlying cancers? The answer is likely nothing other than that most cancers havenot much in common with each other, which would be a disappointing result fromthe perspective of therapeutic applications. To mitigate the aforementioned issues,at least to a certain extent, following [Kakushadze and Yu, 2016b], we can aggregatesamples by cancer types. This way we get an N × n matrix, which we also refer toas G is , where the index s = 1 , . . . , d now takes d = n values corresponding to thecancer types. In the data we use n = 14, the aggregated matrix G is is much lessnoisy than the “big matrix”, and we are ready to apply the above machinery to it. The 96 ×

14 matrix G is given in Tables 2 and 3 is what we pass into the function bio.cl.sigs() in Appendix A as the input matrix x . We use: iter.max = 100 (thisis the maximum number of iterations used in the built-in R function kmeans() – wenote that there was not a single instance in our 150 million runs of kmeans() wheremore iterations were required); num.try = 1000 (this is the number of individualk-means samplings we aggregate every time); and num.runs = 150000 (which is thenumber of aggregated clusterings we use to determine the “ultimate” – that is, themost frequently occurring – clustering). So, we ran k-means 150 million times. Moreprecisely, we ran 15 batches with num.runs = 10000 as a sanity check, to make surethat the ﬁnal result based on 150000 aggregated clusterings was consistent with theresults based on smaller batches, i.e., that it was in-sample stable. Based on Table4, we identify Clustering-A as the “ultimate” clustering (cf. Clustering-B/C/D). The R function kmeans() produces a warning if it does not converge within iter.max . We ran these 15 batches consecutively, and each batch produced the same top-10 (by occur-rence counts) clusterings as in Table 4; however, the actual occurrence counts are diﬀerent acrossthe batches with slight variability in the corresponding rankings. The results are pleasantly stable.

17e give the weights for Clustering-A, Clustering-B, Clustering-C and Clustering-D using unnormalized and normalized regressions with exposures computed basedon arithmetic averages (see Subsection 2.6) in Tables 5, 6, 7, 8, 9, 10, 11, 12, andFigures 2 through 55. We give the weights for Clustering-A using unnormalized andnormalized regressions with exposures computed based on geometric averages (seeSubsection 2.6) in Tables 13, 14, and Figures 56 through 69. The actual mutationcategories in each cluster for a given clustering can be read oﬀ the aforesaid Tableswith the weights (the mutation categories with nonzero weights belong to a givencluster), or from the horizontal axis labels in the aforesaid Figures. It is evident thatClustering-A, Clustering-B, Clustering-C and Clustering-D are essentially variationsof each other (Clustering-D has only 6 clusters, while the other 3 have 7 clusters).

So, based on genome data, we have constructed clusterings and weights. Do theywork? I.e., do they reconstruct the input data well? It is evident from the get-gothat the answer to this question may not be binary in the sense that for some cancertypes we might have a nice clustering structure, while for others we may not. Theaim of the following exercise is to sort this all out. Here come the correlations...

We have our de-noised matrix G (cid:48) is . We are approximating this matrix via thefollowing factorized matrix: G ∗ is = K (cid:88) A =1 W iA H As = w i H G ( i ) ,s (22)We can now compute an n × K matrix Θ sA of within-cluster cross-sectional correla-tions between G (cid:48) is and G ∗ is deﬁned via (xCor( · , · ) stands for “cross-sectional correla-tion” to distinguish it from “serial correlation” Cor( · , · ) we use above) Θ sA = xCor( G (cid:48) is , G ∗ is ) | i ∈ J ( A ) = xCor( G (cid:48) is , w i ) | i ∈ J ( A ) (23)We give this matrix for Clustering-A with weights using normalized regressionswith exposures computed based on arithmetic means (see Subsection 2.6) in Table15. Let us mention that, with exposures based on arithmetic means, weights usingnormalized regressions work a bit better than using unnormalized regressions. Usingexposures based on geometric means changes the weights a bit, which in turn slightlyaﬀects the within-cluster correlations, but does not alter the qualitative picture. De-noising per se does not aﬀect cross-sectional correlations. Adding extra 1 in (3) (recallthat we obtain G (cid:48) is by cross-sectionally demeaning X is and then re-exponentiating) has a negligibleeﬀect. So, in the correlations below we can use the original data matrix G is instead of G (cid:48) is . Due to the factorized structure (22), these correlations do not directly depend on H As . .3.2 Overall Correlations Another useful metric, which we use as a sanity check, is this. For each value of s (i.e., for each cancer type), we can run a linear cross-sectional regression (withoutthe intercept) of G (cid:48) is over the matrix W iA . So, we have n = 14 of these regressions.Each regression produces multiple R and adjusted R , which we give in Table 15.Furthermore, we can compute the ﬁtted values (cid:98) G ∗ is based on these regressions, whichare given by (cid:98) G ∗ is = K (cid:88) A =1 W iA F As = w i F G ( i ) ,s (24)where (for each value of s ) F As are the regression coeﬃcients. We can now computethe overall cross-sectional correlations (i.e., the index i runs over all N = 96 mutationcategories) Ξ s = xCor( G (cid:48) is , (cid:98) G ∗ is ) (25)These correlations are also given in Table 15 and measure the overall ﬁt quality. Looking at Table 15 a few things become immediately evident. Clustering works wellfor 10 out the 14 cancer types we study here. The cancer types for which clusteringdoes not appear to work all that well are Breast Cancer (labeled by X4 in Table15), Liver Cancer (X8), Lung Cancer (X9), and Renal Cell Carcinoma (X14). Moreprecisely, for Breast Cancer we do have a high within-cluster correlation for Cl-5 (andalso Cl-4), but the overall ﬁt is not spectacular due to low within-cluster correlationsin other clusters. Also, above 80% within-cluster correlations arise for 5 clusters,to wit, Cl-1, Cl-3, Cl-4, Cl-5 and Cl-6, but not for Cl-2 or Cl-7. Furthermore,remarkably, Cl-1 has high within-cluster correlations for 9 cancer types, and Cl-5for 6 cancer types. These appear to be the leading clusters. Together they havehigh within-cluster correlations in 11 cancer types. So what does all this mean?Additional insight is provided by looking at the within-cluster correlations be-tween the 7 cancer signature extracted in [Kakushadze and Yu, 2016b] and theclusters we ﬁnd here. Let W iα be the weights for the 7 cancer signatures fromTables 13 and 14 of [Kakushadze and Yu, 2016b]. We can compute the followingwithin-cluster correlations ( α = 1 , . . . , αA = xCor( W iα , W iA ) | i ∈ J ( A ) (26)These correlations are given in Table 16. High within-cluster correlations arisefor Cl-1 (with Sig1 and Sig7), Cl-5 (with Sig2) and Cl-6 (with Sig4). And thismakes perfect sense. Indeed, looking at Figures 14 through 20 of [Kakushadze and The 80% cutoﬀ is somewhat arbitrary, but reasonable.

Clustering ideas and techniques have been applied in cancer research in variousincarnations and contexts aplenty – for a partial list of works at least to someextent related to our discussion here, see, e.g, [Chen et al , 2008a], [Chen et al , 2008b],[Kashuba et al , 2009], [Nik-Zainal et al , 2012], [Roberts et al , 2012], [Alexandrov et al ,2013a], [Alexandrov et al , 2013b], [Burns et al , 2013a], [Burns et al , 2013b], [Lawrence20 t al , 2013], [Long et al , 2013] [Roberts et al , 2013], [Taylor et al , 2013], [Xuan et al , 2013], [Alexandrov and Stratton, 2014], [Bacolla et al , 2014], [Bolli et al ,2014], [Caval et al , 2014], [Davis et al , 2014], [Helleday et al , 2014], [Nik-Zainal et al , 2014], [Poon et al , 2014], [Qian et al , 2014], [Roberts and Gordenin, 2014a],[Roberts and Gordenin, 2014b], [Roberts and Gordenin, 2014c], [Sima and Gilbert,2014], [Chan and Gordenin, 2015], [Pettersen et al , 2015] and references therein. Asmentioned above, even in NMF clustering is used at some (perhaps not-so-evident)layer. What is new in our approach – and hence new results – is that: i) following[Kakushadze and Yu, 2016b], we apply clustering to aggregated by cancer types andde-noised data; ii) we use a tried-and-tested in quantitative ﬁnance bag of tricks from[Kakushadze and Yu, 2016c], which improves clustering; and iii) last but not least,we apply our *K-means algorithm to cancer genome data. As mentioned above, *K-means, unlike vanilla k-means or its other commonly used variations, is essentiallydeterministic, and it achieves determinism statistically , not by “guessing” initialcenters or as in agglomerative hierarchical clustering, which basically “guesses” theinitial (e.g., 2-cluster) clustering. Instead, via aggregating a large number of k-meansclusterings and statistical examination of the occurrence counts of such aggregations,*K-means takes a mess of myriad vanilla k-means clusterings and systematicallyreduces randomness and indeterminism without ad hoc initial “guesswork”.As mentioned above, consistently with the results of [Kakushadze and Yu, 2016b]obtained via improved NMF techniques, Liver Cancer, Lung Cancer and Renal CellCarcinoma do not appear to have clustering (sub)structures. This could be bothgood and bad news. It is a good news because we learned something interestingabout these cancer types – and in two complementary ways. However, it could alsobe a bad news from the therapeutic standpoint. Since these cancer types appear tohave little in common with others, it is likely that they would require specializedtherapeutics. On the ﬂipside, we should note that it would make sense to excludethese 3 cancer types when running clustering analysis. However, it would also makesense to include other cancer types by utilizing the International Cancer GenomeConsortium data, which we leave for future studies. (For comparative reasons, herewe used the same data as in [Kakushadze and Yu, 2016b], which was limited todata samples published as of the date thereof.) This paper is not intended to be anexhaustive empirical study but a proof of concept and an opening of a new avenuefor extracting and studying cancer signatures beyond the tools that NMF provides.And we do ﬁnd that 11 out of the 14 cancer types we study here have clusteringstructures substantially embedded in them and clustering overall works well for atleast 10 out of these 11 cancer types. Now, looking at Figure 14 of [Kakushadzeand Yu, 2016b], we see that its “peaks” are located at ACGT, CCGT, GCGT andTCGT. The same “peaks” are present in our cluster Cl-1 (see Figures 2 and 3).Hence the high within-cluster correlation between Cl-1 and Sig1. On the other Breast Cancer possibly being an exception. As mentioned above, it would make sense toexclude Liver Cancer, Lung Cancer and Renal Cell Carcinoma from the analysis, which may aﬀecthow well clustering works for Breast Cancer and possibly also the other 10 cancer types. et al , 2012], [Alexandrov et al , 2013b], which is due tospontaneous cytosine deamination. So, this is what our cluster Cl-1 describes. Next,looking at Figure 15 of [Kakushadze and Yu, 2016b], we see that its “peaks” arelocated at TCAG, TCTG, TCAT and TCTT. The ﬁrst two of these “peaks” TCAGand TCTG are present in our Cl-5 (see Figures 10 and 11), the third “peak” TCATis present in our Cl-1 (see Figures 2 and 3), while the fourth “peak” TCTT is presentin our Cl-4 (see Figures 8 and 9), which is consistent with the high within-clustercorrelations between Sig2 and Cl-4 and Cl-5, albeit its within-cluster correlationwith Cl-1 is poor. Note that Sig2 of [Kakushadze and Yu, 2016b] is essentially thesame as the mutational signatures 2+13 of [Nik-Zainal et al , 2012], [Alexandrov et al ,2013b], which are due to APOBEC mediated cytosine deamination. In fact, it wasreported as a single signature in [Alexandrov et al , 2013b], however, subsequently, itwas split into 2 distinct signatures, which usually appear in the same samples. Ourclustering results indicate that grouping TCAG and TCTG into one signature makessense as they belong to the same cluster Cl-5. However, grouping TCAT and TCTTtogether does not appear to make much sense. Looking at the Figures for Clustering-A, Clustering-B, Clustering-C and Clustering-D, we see that the TCAT “peak”invariably appears together with the ACGT, CCGT, GCGT and TCGT “peaks”as in Cl-1 in Clustering-A, Cl-2 in Clustering-B, Cl-1 in Clustering-C, and Cl-1 inClustering-D, but never with TCTT. So, our clustering approach tells us somethingnew beyond the NMF “intuition”. This may have an important implication forBreast Cancer, which, as mentioned above, is dominated by Sig2. Thus, based onour results in Table 15, we see that Breast Cancer has high within-cluster correlationswith Cl-4 and Cl-5, but not with Cl-1. This may imply that clustering simply doesnot work well for Breast Cancer, which would appear to put it in the same “stand-alone” league as Liver Cancer, Lung Cancer and Renal Cell Carcinoma. In anyevent, clustering invariably suggests that the TCAT “peak” belongs in Cl-1 withthe 4 “peaks” ACGT, CCGT, GCGT and TCGT related to spontaneous cytosinedeamination, rather than those related to APOBEC mediated cytosine deamination.Now, let us check the remaining two signatures of [Kakushadze and Yu, 2016b]with “tall mountain landscapes” (see above), to wit, Sig4 and Sig7. Looking atFigure 17 of [Kakushadze and Yu, 2016b], we see that its “peaks” are at CTTC,TTTC, CTTG and TTTG. The same peaks appear in our Cl-6 (see Figures 12and 13). Hence the high within-cluster correlation between Cl-6 and Sig4. Notethat Sig4 is essentially the same as the mutational signature 17 of [Nik-Zainal et al ,2012], [Alexandrov et al , 2013b], and its underlying mutational process is unknown.Next, looking at Figure 20 of [Kakushadze and Yu, 2016b], we see that its “peaks”for the C > G mutations are essentially the same as in Cl-1. Hence the high within-cluster correlation between Cl-7 and Sig1. So, there are no surprises with Sig1, Sig4and Sig7. However, based on our clustering results, as we discuss above, with Sig2 For detailed comments, see http://cancer.sanger.ac.uk/cosmic/signatures.

22e do ﬁnd – what we feel is a pleasant – surprise, that splitting it into two signatures(see above) might be inadequate and the TCAT “peak” might really belong withthe Sig1 “peaks” (spontaneous v. APOBEC mediated cytosine deamination). Thisis exciting as it might be an indication of the limitations of NMF (or clustering...). In Introduction we promised that we would discuss some potential applicationsof *K-means in quantitative ﬁnance, and so here it is. Let us mention that *K-meansis universal, oblivious to the input data and applicable in a variety of ﬁelds. In quan-titative ﬁnance *K-means a priori can be applied everywhere clustering methods areused with the added bonus of (statistical) determinism. One evident example isstatistical industry classiﬁcations discussed in [Kakushadze and Yu, 2016c], whereone uses clustering methods to classify stocks. In fact, *K-means is an extension ofthe methods discussed in [Kakushadze and Yu, 2016c]. One thing to keep in mindis that in *K-means one sifts through a large number P of aggregations, which canget computationally costly when clustering 2000+ stocks into 100+ clusters. An-other potential application is in the context of combining alphas (trading signals)– see, e.g., [Kakushadze and Yu, 2017a]. Yet another application is when we havea term structure, such as a portfolio of bonds (e.g., U.S. Treasuries or some otherbonds) with varying maturities, or futures (e.g., Eurodollar futures) with varyingdeliveries. These cases resemble the genome data more in the sense that the number N of instruments is relatively small (typically even fewer than the number of mu-tation categories). Another example with a relatively small number of instrumentswould be a portfolio of various futures for diﬀerent FX (foreign exchange) pairs (evenwith the uniform delivery), e.g., USD/EUR, USD/HKD, EUR/AUD, etc., i.e., FXstatistical arbitrage. One approach to optimizing risk in such portfolios is by em-ploying clustering methods and a stable, essentially deterministic algorithm such as*K-means can be useful. Hopefully *K-means will prove a valuable tool in cancerresearch, quantitative ﬁnance as well as various other ﬁelds (e.g., image recognition). A R Source Code The mainfunction is bio.cl.sigs(x, iter.max = 100, num.try = 1000, num.runs = 10000) .Here: x is the N × d occurrence counts matrix G is (where N = 96 is the number of Or both... Alternatively – and that would be truly exciting – perhaps there is a biologicexplanation. In any event, it is too early to tell – yet another possibility is that this is merely anartifact of the dataset we use. More research and analyses on larger datasets (see above) is needed. Albeit with the understanding that it requires additional computational cost. This can be mitigated by employing top-down clustering [Kakushadze and Yu, 2016c]. The source code in Appendix A hereof is not written to be “fancy” or optimized for speed orin any other way. Its sole purpose is to illustrate the algorithms described in the main text in asimple-to-understand fashion. See Appendix B for some important legalese. d is the number of samples; or d = n , where n is the numberof cancer types, when the samples are aggregated by cancer types); iter.max is themaximum number of iterations that are passed into the R built-in function kmeans() ; num.try is the number M of aggregated clusterings (see Subsection 2.3.2); num.runs is the number of runs P used to determine the most frequently occurring clustering(the “ultimate” clustering) obtained via aggregation (see Subsection 2.3.3). Thefunction bio.erank.pc() is deﬁned in Appendix B of [Kakushadze and Yu, 2016b].The function qrm.stat.ind.class() is deﬁned in Appendix A of [Kakushadze andYu, 2016c]. This function internally calls another function qrm.calc.norm.ret() ,which we redeﬁne here via the function bio.calc.norm.ret() . The output is alist, whose elements are as follows: res$ind is an N × K binary matrix Ω iA = δ G ( i ) ,A ( i = 1 , . . . , N , A = 1 , . . . , K , the map G : { , . . . , N } (cid:55)→ { , . . . , K } – see Section 2),which deﬁnes the K clusters in the “ultimate” clustering; res$w is an N -vector ofweights obtained via unnormalized regressions using arithmetic means for comput-ing exposures (i.e., via (13), (14) and (15)); res$v is an N -vector of weights obtainedvia normalized regressions using arithmetic means for computing exposures (i.e., via(17), (14) and (16)); res$w.g is an N -vector of weights obtained via unnormalizedregressions using geometric means for computing exposures (i.e., via (13), (14) and(19)); res$v.g is an N -vector of weights obtained via normalized regressions usinggeometric means for computing exposures (i.e., via (17), (14) and (21)). bio.calc.norm.ret <- function (ret) { s <- apply(ret, 1, sd)x <- ret / sreturn(x) } qrm.calc.norm.ret <- bio.calc.norm.retbio.cl.sigs <- function(x, iter.max = 100,num.try = 1000, num.runs = 10000) { cl.ix <- function(x) match(1, x)y <- log(1 + x)y <- t(t(y) - colMeans(y)) The deﬁnition of qrm.calc.norm.ret() in [Kakushadze and Yu, 2016c] accounts for somepeculiarities and nuances pertinent to quantitative trading, which are not applicable here. The code returns the K clusters ordered such that the number of mutation n A (i.e., thecolumn sum of Ω iA ) in the cluster labeled by A is in the increasing order. It also orders clusterswith identical n A . We note, however, that (for presentational convenience reasons) the order ofsuch clusters in the tables and ﬁgures below is not necessarily the same as what this code returns. .d <- exp(y)k <- ncol(bio.erank.pc(y)$pc)n <- nrow(x)u <- rnorm(n, 0, 1)q <- matrix(NA, n, num.runs)p <- rep(NA, num.runs)for(i in 1:num.runs) { z <- qrm.stat.ind.class(y, k, iter.max = iter.max,num.try = num.try, demean.ret = F)p[i] <- sum((residuals(lm(u ∼ -1 + z)))^2)q[, i] <- apply(z, 1, cl.ix) } p1 <- unique(p)ct <- rep(NA, length(p1))for(i in 1:length(p1))ct[i] <- sum(p1[i] == p)p1 <- p1[ct == max(ct)]i <- match(p1, p)[1]ix <- q[, i]k <- max(ix)z <- matrix(NA, n, k)for(j in 1:k)z[, j] <- as.numeric(ix == j)res <- bio.cl.wts(x.d, z)return(res) } bio.cl.wts <- function (x, ind) { first.ix <- function(x) match(1, x)[1]calc.wts <- function(x, use.wts = F, use.geom = F) { if(use.geom) { if(use.wts) <- apply(log(x), 1, sd)elses <- rep(1, nrow(x))s <- 1 / s / sum(1 / s)fac <- apply(x^s, 2, prod) } else { if(use.wts)s <- apply(x, 1, sd)elses <- rep(1, nrow(x))fac <- colMeans(x / s) } w <- coefficients(lm(t(x) ∼ -1 + fac))w <- 100 * w / sum(w)return(w) } n <- nrow(x)w <- w.g <- v <- v.g <- rep(NA, n)z <- colSums(ind)z <- as.numeric(paste(z, ".", apply(ind, 2, first.ix), sep = ""))dimnames(ind)[[2]] <- names(z) <- 1:ncol(ind)z <- sort(z)z <- names(z)ind <- ind[, z]dimnames(ind)[[2]] <- NULLfor(i in 1:ncol(ind)) { take <- ind[, i] == 1if(sum(take) == 1) { w[take] <- w.g[take] <- 1v[take] <- v.g[take] <- 1next } w[take] <- calc.wts(x[take, ], F, F)w.g[take] <- calc.wts(x[take, ], F, T)v[take] <- calc.wts(x[take, ], T, F) .g[take] <- calc.wts(x[take, ], T, T) } res <- new.env()res$ind <- indres$w <- wres$w.g <- w.gres$v <- vres$v.g <- v.greturn(res) } B DISCLAIMERS

Wherever the context so requires, the masculine gender includes the feminine and/orneuter, and the singular form includes the plural and vice versa . The author of thispaper (“Author”) and his aﬃliates including without limitation Quantigic (cid:114)

Solu-tions LLC (“Author’s Aﬃliates” or “his Aﬃliates”) make no implied or expresswarranties or any other representations whatsoever, including without limitationimplied warranties of merchantability and ﬁtness for a particular purpose, in con-nection with or with regard to the content of this paper including without limitationany code or algorithms contained herein (“Content”).The reader may use the Content solely at his/her/its own risk and the readershall have no claims whatsoever against the Author or his Aﬃliates and the Authorand his Aﬃliates shall have no liability whatsoever to the reader or any third partywhatsoever for any loss, expense, opportunity cost, damages or any other adverseeﬀects whatsoever relating to or arising from the use of the Content by the readerincluding without any limitation whatsoever: any direct, indirect, incidental, spe-cial, consequential or any other damages incurred by the reader, however causedand under any theory of liability; any loss of proﬁt (whether incurred directly orindirectly), any loss of goodwill or reputation, any loss of data suﬀered, cost of pro-curement of substitute goods or services, or any other tangible or intangible loss;any reliance placed by the reader on the completeness, accuracy or existence of theContent or any other eﬀect of using the Content; and any and all other adversitiesor negative eﬀects the reader might encounter in using the Content irrespective ofwhether the Author or his Aﬃliates is or are or should have been aware of suchadversities or negative eﬀects.The R code included in Appendix A hereof is part of the copyrighted R codeof Quantigic (cid:114)

Solutions LLC and is provided herein with the express permission ofQuantigic (cid:114)

Solutions LLC. The copyright owner retains all rights, title and interestin and to its copyrighted source code included in Appendix A hereof and any andall copyrights therefor. 27 eferences

Alexandrov, L.B., Nik-Zainal, S., Wedge, D.C., Campbell, P.J. and Stratton,M.R. (2013a) Deciphering Signatures of Mutational Processes Operative in Hu-man Cancer.

Cell Reports

Nature

Current Opinion in Ge-netics & Development

24: 52-60.Ananthaswamy, H.N. and Pierceall, W.E. (1990) Molecular mechanisms ofultraviolet radiation carcinogenesis.

Photochemistry and Photobiology

Genes

Econometrica

Nature Communications

5: 2997.28ouchaud, J.-P. and Potters, M. (2011) Financial applications of random matrixtheory: a short review. In: Akemann, G., Baik, J. and Di Francesco, P. (eds.)

The Oxford Handbook of Random Matrix Theory.

Oxford, United Kingdom:Oxford University Press.Burns, M.B., Lackey, L., Carpenter, M.A., Rathore, A., Land, A.M., Leonard,B., Refsland, E.W., Kotandeniya, D., Tretyakova, N., Nikas, J.B., Yee, D.,Temiz, N.A., Donohue, D.E., McDougle, R.M., Brown, W.L., Law, E.K., Harris,R.S. (2013a) APOBEC3B is an enzymatic source of mutation in breast cancer.

Nature

Nature Genetics (cid:48)

UTR enhances chromosomal DNA damage.

Nature Communi-cations

5: 5129.Campbell, L.L. (1960) Minimum coeﬃcient rate for stationary random pro-cesses.

Information and Control

Annual Review of Genetics

49: 243-627Chen, Z., Feng, J., Buzin, C.H. and Sommer, S.S. (2008a) Epidemiology ofdoublet/multiplet mutations in lung cancers: evidence that a subset arises bychronocoordinate events.

PloS One

Oncogene

The American Journal of Human Genetics

The Journal of Finance

Cancer Cell

InformationSciences

Biometrics

Nature Genetics

Genetics

Human Brain Mapping

Active Portfolio Management.

New York,NY: McGraw-Hill.Gundem, G., Van Loo, P., Kremeyer, B., Alexandrov, L.B., Tubio, J.M., Pa-paemmanuil, E., Brewer, D.S., Kallio, H.M., H¨ogn¨as, G., Annala, M., Kiv-inummi, K., Goody, V., Latimer, C., O’Meara, S., Dawson, K.J., Isaacs,W., Emmert-Buck, M.R., Nykter, M., Foster, C., Kote-Jarai, Z., Easton, D.,Whitaker, H.C.; ICGC Prostate UK Group, Neal, D.E., Cooper, C.S., Eeles,30.A., Visakorpi, T., Campbell, P.J., McDermott, U., Wedge, D.C., Bova, G.S.(2015) The evolutionary history of lethal metastatic prostate cancer.

Nature

Advances of the Neural Information Processing Systems , Vol. 16. Campridge,MA: MIT Press, pp. 281-289.Hartigan, J.A. (1975) Clustering algorithms. New York, NY: John Wiley &Sons, Inc.Hartigan, J.A. and Wong, M.A. (1979) Algorithm AS 136: A K-Means Clus-tering Algorithm.

Journal of the Royal Statistical Society, Series C (AppliedStatistics)

Nature Reviews Genetics

Cell

Nature

The Journal of Investment Strategies

Phys-ica A

Journalof Risk & Control

Journalof Asset Management

The Journal ofInvestment Strategies

PloS One

Nature

Analytica Chimica Acta

IEEE Transactions onInformation Theory

Cancer Research

Journal of NationalCancer Institute

Nature Genetics

Proceedings of the5th Berkeley Symposium on Mathematical Statistics and Probability.

Berkeley,CA: University of California Press, pp. 281-297.Murtagh, F. and Contreras, P. (2011) Algorithms for hierarchical clustering:An overview.

Wiley Interdisciplinary Reviews: Data Mining and KnowledgeDiscovery

Cell

Nature Genetics

Environmetrics

Nature

Proceedings ofthe 17th International Conference on Machine Learning . San Francisco, CA:Morgan Kaufman: pp. 727-734.Pettersen, H.S., Galashevskaya, A., Doseth, B., Sousa, M.M., Sarno, A., Visnes,T., Aas, P.A., Liabakk, N.B., Slupphaug, G., Sætrom, P., Kavli, B., Krokan,H.E. (2015) AID expression in B-cell lymphomas causes accumulation of ge-nomic uracil and a distinct AID mutational signature.

DNA Repair

25: 60-71.Poon, S., McPherson, J., Tan, P., Teh, B. and Rozen, S. (2014) Mutation sig-natures of carcinogen exposure: genome-wide detection and new opportunitiesfor cancer prevention.

Genome Medicine

Nature

Cell

Molecular Cell

NatureGenetics eLS (Genetics & Disease) . Chichester, UK: John Wiley & Sons, Ltd.Roberts S.A and Gordenin D.A. (2014b) Clustered and genome-wide transientmutagenesis in human cancers: Hypermutation without permanent mutatorsor loss of ﬁtness.

BioEsseys

Nature Reviews Cancer

Journal of Computational and Applied Mathe-matics

European Signal Processing Conference (EUSIPCO).

Pozna´n,Poland (September 3-7), pp. 606-610.Scelo, G., Riazalhosseini, Y., Greger, L., Letourneau, L., Gonz`alez-Porta,M., Wozniak, M.B., Bourgey, M., Harnden, P., Egevad, L., Jackson, S.M.,Karimzadeh, M., Arseneault, M., Lepage, P., How-Kit, A., Daunay, A., Re-nault, V., Blanch´e, H., Tubacher, E., Sehmoun, J., Viksna, J., Celms, E.,Opmanis, M., Zarins, A., Vasudev, N.S., Seywright, M., Abedi-Ardekani, B.,Carreira, C., Selby, P.J., Cartledge, J.J., Byrnes, G., Zavadil, J., Su, J., Holca-tova, I., Brisuda, A., Zaridze, D., Moukeria, A., Foretova, L., Navratilova, M.,Mates, D., Jinga, V., Artemov, A., Nedoluzhko, A., Mazur, A., Rastorguev, S.,Boulygina, E., Heath, S., Gut, M., Bihoreau, M.T., Lechner, D., Foglio, M.,Gut, I.G., Skryabin, K., Prokhortchouk, E., Cambon-Thomsen, A., Rung, J.,Bourque, G., Brennan, P., Tost, J., Banks, R.E., Brazma, A., Lathrop, G.M.(2014) Variation in genomic landscape of clear cell renal cell carcinoma acrossEurope.

Nature Communications

5: 5135.Sibson, R. (1973) SLINK: an optimally eﬃcient algorithm for the single-linkcluster method.

The Computer Journal (British Computer Society)

Current Opinionin Genetics & Developmen

25: 93-100.Steinbach, M., Karypis, G. and Kumar, V. (2000) A comparison of documentclustering techniques.

KDD Workshop on Text Mining

Bull. Acad.Polon. Sci.

Journal of the American StatisticalAssociation

Nature Genetics eLife

2: e00534.Tirode, F., Surdez, D., Ma, X., Parker, M., Le Deley, M.C., Bahrami, A.,Zhang, Z., Lapouble, E., Grossetˆete-Lalami, S., Rusch, M., Reynaud, S., Rio-Frio, T., Hedlund, E., Wu, G., Chen, X., Pierron, G., Oberlin, O., Zaidi, S.,Lemmon, G., Gupta, P., Vadodaria, B., Easton, J., Gut, M., Ding, L., Mardis,E.R., Wilson, R.K., Shurtleﬀ, S., Laurence, V., Michon, J., Marec-B´erard, P.,Gut, I., Downing, J., Dyer, M., Zhang, J., Delattre, O.; St. Jude Children’sResearch Hospital – Washington University Pediatric Cancer Genome Projectand the International Cancer Genome Consortium. (2014) Genomic Landscapeof Ewing Sarcoma deﬁnes an aggressive subtype with co-association of STAG2and TP53 mutations.

Cancer Discovery

Nature

Nature Genetics

Car-cinogenesis

IEEE Transactions on Information Theory

Nature Genetics > W: XYZ.

Mutation X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13 X14ACAA 466 716 59 3024 286 11 24884 51929 15865 801 13831 36 577 10734ACCA 355 528 41 2600 249 3 14361 15227 9217 681 11192 42 422 4952ACGA 43 112 6 446 55 4 2253 51156 3567 115 1804 9 59 1231ACTA 309 577 38 2201 195 12 15833 33385 10038 557 10508 20 380 5573CCAA 420 538 29 2814 238 11 21980 9507 19569 579 10943 59 459 7899CCCA 245 361 30 2149 147 6 19624 26869 16889 424 9720 45 336 6037CCGA 77 75 22 470 35 7 4034 8695 4995 105 1644 29 67 1069CCTA 418 502 26 2380 211 10 43418 2967 15930 544 10037 80 388 6825GCAA 452 624 41 2144 244 16 26007 42508 11084 532 6835 33 419 4795GCCA 234 346 18 1677 149 6 10594 32252 9087 337 6423 33 247 3563GCGA 67 66 7 375 29 5 2649 17426 3607 94 1295 17 67 853GCTA 351 463 13 1684 197 11 21205 33670 7502 345 6543 30 332 3637TCAA 584 628 38 5873 233 11 24288 33919 13358 766 9816 54 582 8973TCCA 313 472 36 3735 145 19 17372 139576 13140 567 9697 53 479 7388TCGA 66 57 7 543 28 4 3168 11569 2385 128 1334 17 87 1097TCTA 916 849 54 4938 376 25 43364 20943 14028 1111 12202 56 817 8611ACAG 232 260 19 2124 162 1 10057 42216 4369 445 8653 18 197 4551ACCG 143 148 11 1307 92 4 4818 6202 2088 256 4506 21 188 2687ACGG 48 82 10 641 32 2 1606 27957 953 121 2172 5 46 837ACTG 290 240 25 2278 113 8 8670 90212 3091 394 8460 23 237 4457CCAG 153 160 13 1726 78 7 4942 8975 3233 175 5559 30 125 2474CCCG 131 154 14 1391 75 2 4117 27928 2902 163 4857 18 142 2601CCGG 31 65 14 597 20 0 1176 87975 1210 122 1769 11 55 532CCTG 213 227 19 2425 92 11 6840 14272 3348 238 7943 28 204 3846GCAG 106 120 8 1128 52 2 3844 53180 2546 196 3621 12 95 1847GCCG 147 148 7 1142 80 3 4174 11783 2406 160 3363 16 112 1630GCGG 25 42 2 525 33 1 776 52658 1113 52 918 5 27 423GCTG 225 146 9 1650 81 5 5116 43761 2187 177 5418 14 135 2045TCAG 391 266 22 16099 79 46 12809 12172 8882 265 13033 49 359 3968TCCG 279 185 18 4896 95 14 7208 41020 4566 244 7977 30 264 4441TCGG 30 52 11 794 13 2 1294 9554 992 64 1423 11 33 553TCTG 660 444 44 21847 202 53 21720 3519 10841 449 20292 75 563 7668ACAT 950 931 97 3557 440 14 46364 28569 6530 1137 13367 66 542 8872ACCT 482 482 49 1875 265 14 16724 47067 3418 590 6543 100 285 4786ACGT 1085 2373 289 4978 792 70 78681 17833 3830 3694 17475 585 1603 9034ACTT 603 628 57 2570 263 11 21545 63195 4179 827 10578 54 380 6567CCAT 729 542 74 3344 294 16 19611 23743 7430 801 8689 79 428 7106CCCT 607 545 78 2443 337 10 18553 52051 6155 810 7554 70 325 6038CCGT 845 1095 180 3489 483 70 53496 7282 4362 1983 11157 498 888 4824CCTT 784 774 124 3468 380 13 22137 29223 7932 854 11714 69 430 8606GCAT 615 531 65 2815 285 16 31548 50399 4166 711 8492 107 420 6069GCCT 583 484 64 2256 295 14 34372 10858 3715 825 7784 120 402 5474GCGT 930 1382 152 3870 546 60 75814 35864 3544 2428 13447 629 1294 5232GCTT 660 585 78 2254 316 15 26554 49322 3836 741 8329 93 378 5765TCAT 1531 761 86 25251 268 69 30161 11611 13414 780 13880 131 681 9723TCCT 1172 685 73 7268 319 25 23799 31779 7136 859 10287 112 544 8833TCGT 628 903 127 4510 281 58 33759 53275 2998 1553 8055 300 708 3772TCTT 1466 813 57 13615 313 43 29570 21922 11432 859 12692 79 662 8933

Mutation X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13 X14ATAA 385 382 22 1375 248 2 9613 35715 3957 315 5776 7 313 9726ATCA 241 274 13 1388 191 6 9419 7570 1917 235 4439 18 214 2973ATGA 232 344 21 1459 211 4 7666 48117 3962 228 5983 18 197 7703ATTA 580 516 29 2262 301 5 30067 32078 2928 590 9131 16 523 5833CTAA 185 217 9 1009 98 3 5551 9978 4331 151 4473 10 175 12963CTCA 211 256 17 1535 118 6 14584 41504 3430 224 6343 21 184 6465CTGA 219 227 26 1406 146 12 8654 8508 7905 210 6473 28 207 14356CTTA 379 408 27 1716 162 6 24273 2400 4255 303 8578 24 221 8077GTAA 180 162 10 797 82 1 4092 36890 3147 104 3254 5 138 5769GTCA 174 147 8 791 80 1 5919 25736 1399 113 3026 10 79 1954GTGA 156 145 9 948 102 10 6717 9623 4021 122 4233 8 135 3924GTTA 181 272 14 1168 129 3 10065 43279 2112 200 5741 10 156 2579TTAA 576 454 29 2192 256 2 20211 13356 4086 510 6936 16 499 11629TTCA 158 202 17 1095 110 3 7296 75515 1766 222 4859 12 142 4966TTGA 179 156 4 774 144 3 5147 5013 2912 140 3493 10 129 7897TTTA 434 467 37 2298 289 3 27521 12174 3117 481 10786 8 417 10311ATAC 910 792 84 3342 519 10 34145 35102 7884 865 14628 36 552 11060ATCC 382 255 32 1439 192 7 14365 4444 1952 354 6071 22 206 2821ATGC 572 526 51 2354 329 14 28630 16789 5325 534 10017 48 331 6561ATTC 792 734 94 3340 457 11 22018 48320 4344 769 13211 38 503 6714CTAC 456 283 30 1378 187 1 21942 9143 4198 301 5972 21 202 4855CTCC 427 295 31 1670 197 5 31474 13443 2361 331 7202 48 163 3092CTGC 531 328 29 1659 222 6 38742 59589 3845 360 6553 60 206 4024CTTC 795 328 48 1828 316 12 73708 10381 3566 383 8088 48 239 4982GTAC 452 378 30 1533 215 4 25980 46065 3605 462 6804 48 345 4411GTCC 404 332 25 981 176 10 15268 17372 1619 299 4123 20 195 1823GTGC 370 284 33 1233 189 8 24550 54718 2912 332 5851 48 211 2962GTTC 511 447 46 1680 269 7 22803 46663 2547 532 7513 36 367 3575TTAC 541 428 48 1819 297 2 22540 16834 3732 497 7741 25 372 6232TTCC 606 383 44 1610 211 2 25505 42903 1979 473 7086 24 280 3847TTGC 306 242 22 1079 185 9 17328 8016 2277 323 4419 24 210 2892TTTC 818 437 60 2187 415 4 48109 2717 3140 617 10866 28 427 13766ATAG 462 133 9 862 229 2 5956 20568 1055 164 3114 5 158 2747ATCG 134 69 3 451 51 0 5214 50363 476 92 1335 10 93 1949ATGG 196 136 11 941 76 3 5505 35331 1134 165 3935 5 116 2726ATTG 580 155 17 1030 166 2 31252 58405 1010 177 3455 11 221 4941CTAG 416 110 4 601 134 3 5035 29104 779 85 1932 3 87 1381CTCG 219 91 16 759 71 4 14376 86003 695 96 2754 15 96 1628CTGG 342 137 7 1165 140 4 12511 13036 1684 142 4488 19 142 2080CTTG 1780 202 21 1560 344 5 142206 27191 1801 187 4424 38 171 2944GTAG 224 82 3 617 78 0 2552 45734 746 87 1792 3 45 896GTCG 132 53 4 667 43 0 3804 10333 542 53 1290 10 47 628GTGG 183 102 12 2989 116 2 5256 21041 2874 171 5467 16 58 1313GTTG 1090 113 14 1194 142 3 37454 46981 1106 174 3439 13 122 1423TTAG 654 199 19 1021 260 1 7674 11096 1235 187 3327 14 158 2799TTCG 167 96 13 725 79 2 8315 32124 740 130 2507 12 125 3312TTGG 265 132 16 1124 121 7 11452 62446 1693 161 4480 8 142 2974TTTG 1349 296 43 2144 403 4 77262 28801 2361 379 7679 39 353 11415 num.try = 1000 in the R function qrm.stat.ind.class() ;also, the target number of clusters is k = 7 ; see Appendix A for details). Thecolumns “Cl-1” through “Cl-7” give the numbers of mutations in each cluster (thetotal number of mutations in each clustering is 96). The entries “–” correspond toclusterings with fewer than 7 clusters. Note that Clustering-C and Clustering-H havethe same numbers of mutations in their 7 clusters; however, these two clusteringsare diﬀerent, i.e., equally-sized clusters contain diﬀerent mutations. While there wasslight variability in the placement (by occurrence counts) of the top 10 clusteringswithin the aforesaid 15 batches of 10,000 runs, Clustering-A through Clustering-Jinvariably were the top 10 in each batch. The weights are given in Tables 5 through12 (for Clustering-A through Clustering-D) and 13 and 14 (for Clustering-A).Name Count Cl-1 Cl-2 Cl-3 Cl-4 Cl-5 Cl-6 Cl-7Clustering-A 12085 8 8 10 15 16 18 21Clustering-B 10962 7 8 8 12 15 17 29Clustering-C 10788 8 8 11 15 16 17 21Clustering-D 10328 8 8 15 16 18 31 –Clustering-E 6499 7 8 8 12 15 18 28Clustering-F 5451 8 8 15 17 17 31 –Clustering-G 5421 8 8 10 15 17 17 21Clustering-H 5302 8 8 11 15 16 17 21Clustering-I 4602 8 8 10 15 21 34 –Clustering-J 3698 8 8 15 31 34 – –42able 5: Weights (in the units of 1%, rounded to 2 digits) for the ﬁrst 48 mutationcategories (this Table 5 is continued in Table 6 with the next 48 mutation categories)for the 7 clusters in Clustering-A (see Table 4) based on unnormalized (columns 2-8) and normalized (columns 9-15) regressions (see Subsection 2.6 for details). Eachcluster is deﬁned as containing the mutations with nonzero weights. For instance,cluster Cl-2 contains 8 mutations GCGA, TCGA, ACGG, GCCG, GCGG, TCGG,GTCA, GTCG. In each cluster the weights are normalized to add up to 100% (upto 2 digits due to the aforesaid rounding). In Tables 5 through 12 “weights basedon unnormalized regressions” are given by (13), (14) and (15), while “weights basedon normalized regressions” are given by (17), (14) and (16), i.e., the exposures arecalculated based on arithmetic averages (see Subsection 2.6 for details).

Mutation Cl-1 Cl-2 Cl-3 Cl-4 Cl-5 Cl-6 Cl-7 Cl-1 Cl-2 Cl-3 Cl-4 Cl-5 Cl-6 Cl-7ACAA 0.00 0.00 0.00 6.55 0.00 0.00 0.00 0.00 0.00 0.00 6.55 0.00 0.00 0.00ACCA 0.00 0.00 0.00 0.00 5.83 0.00 0.00 0.00 0.00 0.00 0.00 6.08 0.00 0.00ACGA 0.00 0.00 0.00 0.00 0.00 0.00 4.06 0.00 0.00 0.00 0.00 0.00 0.00 4.00ACTA 0.00 0.00 0.00 0.00 6.16 0.00 0.00 0.00 0.00 0.00 0.00 6.38 0.00 0.00CCAA 0.00 0.00 0.00 0.00 7.91 0.00 0.00 0.00 0.00 0.00 0.00 8.10 0.00 0.00CCCA 0.00 0.00 0.00 0.00 6.46 0.00 0.00 0.00 0.00 0.00 0.00 6.68 0.00 0.00CCGA 0.00 0.00 7.21 0.00 0.00 0.00 0.00 0.00 0.00 7.23 0.00 0.00 0.00 0.00CCTA 0.00 0.00 0.00 0.00 0.00 6.75 0.00 0.00 0.00 0.00 0.00 0.00 6.79 0.00GCAA 4.05 0.00 0.00 0.00 0.00 0.00 0.00 4.65 0.00 0.00 0.00 0.00 0.00 0.00GCCA 0.00 0.00 0.00 0.00 4.56 0.00 0.00 0.00 0.00 0.00 0.00 4.73 0.00 0.00GCGA 0.00 13.81 0.00 0.00 0.00 0.00 0.00 0.00 13.89 0.00 0.00 0.00 0.00 0.00GCTA 0.00 0.00 0.00 0.00 5.02 0.00 0.00 0.00 0.00 0.00 0.00 5.20 0.00 0.00TCAA 0.00 0.00 0.00 6.26 0.00 0.00 0.00 0.00 0.00 0.00 6.21 0.00 0.00 0.00TCCA 0.00 0.00 0.00 0.00 8.94 0.00 0.00 0.00 0.00 0.00 0.00 9.29 0.00 0.00TCGA 0.00 11.87 0.00 0.00 0.00 0.00 0.00 0.00 12.24 0.00 0.00 0.00 0.00 0.00TCTA 0.00 0.00 0.00 8.05 0.00 0.00 0.00 0.00 0.00 0.00 8.00 0.00 0.00 0.00ACAG 0.00 0.00 0.00 0.00 3.96 0.00 0.00 0.00 0.00 0.00 0.00 4.18 0.00 0.00ACCG 0.00 0.00 8.07 0.00 0.00 0.00 0.00 0.00 0.00 8.17 0.00 0.00 0.00 0.00ACGG 0.00 12.62 0.00 0.00 0.00 0.00 0.00 0.00 12.22 0.00 0.00 0.00 0.00 0.00ACTG 0.00 0.00 0.00 0.00 4.77 0.00 0.00 0.00 0.00 0.00 0.00 5.03 0.00 0.00CCAG 0.00 0.00 9.26 0.00 0.00 0.00 0.00 0.00 0.00 9.35 0.00 0.00 0.00 0.00CCCG 0.00 0.00 0.00 0.00 0.00 0.00 3.91 0.00 0.00 0.00 0.00 0.00 0.00 4.02CCGG 0.00 0.00 0.00 0.00 0.00 0.00 5.37 0.00 0.00 0.00 0.00 0.00 0.00 5.12CCTG 0.00 0.00 12.46 0.00 0.00 0.00 0.00 0.00 0.00 12.58 0.00 0.00 0.00 0.00GCAG 0.00 0.00 0.00 0.00 0.00 0.00 4.61 0.00 0.00 0.00 0.00 0.00 0.00 4.57GCCG 0.00 14.79 0.00 0.00 0.00 0.00 0.00 0.00 15.62 0.00 0.00 0.00 0.00 0.00GCGG 0.00 15.50 0.00 0.00 0.00 0.00 0.00 0.00 13.92 0.00 0.00 0.00 0.00 0.00GCTG 0.00 0.00 0.00 0.00 0.00 0.00 4.86 0.00 0.00 0.00 0.00 0.00 0.00 4.92TCAG 0.00 0.00 0.00 0.00 10.31 0.00 0.00 0.00 0.00 0.00 0.00 9.03 0.00 0.00TCCG 0.00 0.00 0.00 0.00 5.10 0.00 0.00 0.00 0.00 0.00 0.00 4.95 0.00 0.00TCGG 0.00 8.40 0.00 0.00 0.00 0.00 0.00 0.00 8.65 0.00 0.00 0.00 0.00 0.00TCTG 0.00 0.00 0.00 0.00 14.10 0.00 0.00 0.00 0.00 0.00 0.00 12.53 0.00 0.00ACAT 0.00 0.00 0.00 7.67 0.00 0.00 0.00 0.00 0.00 0.00 7.71 0.00 0.00 0.00ACCT 4.78 0.00 0.00 0.00 0.00 0.00 0.00 5.02 0.00 0.00 0.00 0.00 0.00 0.00ACGT 23.47 0.00 0.00 0.00 0.00 0.00 0.00 23.18 0.00 0.00 0.00 0.00 0.00 0.00ACTT 0.00 0.00 0.00 5.43 0.00 0.00 0.00 0.00 0.00 0.00 5.47 0.00 0.00 0.00CCAT 0.00 0.00 0.00 6.02 0.00 0.00 0.00 0.00 0.00 0.00 6.02 0.00 0.00 0.00CCCT 0.00 0.00 0.00 5.59 0.00 0.00 0.00 0.00 0.00 0.00 5.63 0.00 0.00 0.00CCGT 17.66 0.00 0.00 0.00 0.00 0.00 0.00 17.12 0.00 0.00 0.00 0.00 0.00 0.00CCTT 0.00 0.00 0.00 7.01 0.00 0.00 0.00 0.00 0.00 0.00 7.04 0.00 0.00 0.00GCAT 0.00 0.00 0.00 5.98 0.00 0.00 0.00 0.00 0.00 0.00 6.01 0.00 0.00 0.00GCCT 5.74 0.00 0.00 0.00 0.00 0.00 0.00 5.93 0.00 0.00 0.00 0.00 0.00 0.00GCGT 20.46 0.00 0.00 0.00 0.00 0.00 0.00 19.80 0.00 0.00 0.00 0.00 0.00 0.00GCTT 0.00 0.00 0.00 5.88 0.00 0.00 0.00 0.00 0.00 0.00 5.93 0.00 0.00 0.00TCAT 11.42 0.00 0.00 0.00 0.00 0.00 0.00 12.00 0.00 0.00 0.00 0.00 0.00 0.00TCCT 0.00 0.00 0.00 7.81 0.00 0.00 0.00 0.00 0.00 0.00 7.76 0.00 0.00 0.00TCGT 12.42 0.00 0.00 0.00 0.00 0.00 0.00 12.30 0.00 0.00 0.00 0.00 0.00 0.00TCTT 0.00 0.00 0.00 9.47 0.00 0.00 0.00 0.00 0.00 0.00 9.29 0.00 0.00 0.00

Mutation Cl-1 Cl-2 Cl-3 Cl-4 Cl-5 Cl-6 Cl-7 Cl-1 Cl-2 Cl-3 Cl-4 Cl-5 Cl-6 Cl-7ATAA 0.00 0.00 0.00 0.00 4.18 0.00 0.00 0.00 0.00 0.00 0.00 4.52 0.00 0.00ATCA 0.00 0.00 10.00 0.00 0.00 0.00 0.00 0.00 0.00 10.15 0.00 0.00 0.00 0.00ATGA 0.00 0.00 0.00 0.00 4.02 0.00 0.00 0.00 0.00 0.00 0.00 4.30 0.00 0.00ATTA 0.00 0.00 0.00 0.00 0.00 5.54 0.00 0.00 0.00 0.00 0.00 0.00 5.66 0.00CTAA 0.00 0.00 11.74 0.00 0.00 0.00 0.00 0.00 0.00 11.16 0.00 0.00 0.00 0.00CTCA 0.00 0.00 0.00 0.00 3.79 0.00 0.00 0.00 0.00 0.00 0.00 3.98 0.00 0.00CTGA 0.00 0.00 0.00 0.00 4.88 0.00 0.00 0.00 0.00 0.00 0.00 5.02 0.00 0.00CTTA 0.00 0.00 0.00 0.00 0.00 4.28 0.00 0.00 0.00 0.00 0.00 0.00 4.33 0.00GTAA 0.00 0.00 0.00 0.00 0.00 0.00 4.30 0.00 0.00 0.00 0.00 0.00 0.00 4.35GTCA 0.00 15.20 0.00 0.00 0.00 0.00 0.00 0.00 15.36 0.00 0.00 0.00 0.00 0.00GTGA 0.00 0.00 9.28 0.00 0.00 0.00 0.00 0.00 0.00 9.21 0.00 0.00 0.00 0.00GTTA 0.00 0.00 0.00 0.00 0.00 0.00 5.13 0.00 0.00 0.00 0.00 0.00 0.00 5.19TTAA 0.00 0.00 0.00 0.00 0.00 5.13 0.00 0.00 0.00 0.00 0.00 0.00 5.26 0.00TTCA 0.00 0.00 0.00 0.00 0.00 0.00 6.64 0.00 0.00 0.00 0.00 0.00 0.00 6.58TTGA 0.00 0.00 8.84 0.00 0.00 0.00 0.00 0.00 0.00 8.55 0.00 0.00 0.00 0.00TTTA 0.00 0.00 0.00 0.00 0.00 5.27 0.00 0.00 0.00 0.00 0.00 0.00 5.38 0.00ATAC 0.00 0.00 0.00 7.03 0.00 0.00 0.00 0.00 0.00 0.00 7.06 0.00 0.00 0.00ATCC 0.00 0.00 0.00 0.00 0.00 3.30 0.00 0.00 0.00 0.00 0.00 0.00 3.39 0.00ATGC 0.00 0.00 0.00 4.97 0.00 0.00 0.00 0.00 0.00 0.00 4.98 0.00 0.00 0.00ATTC 0.00 0.00 0.00 6.30 0.00 0.00 0.00 0.00 0.00 0.00 6.34 0.00 0.00 0.00CTAC 0.00 0.00 0.00 0.00 0.00 3.78 0.00 0.00 0.00 0.00 0.00 0.00 3.81 0.00CTCC 0.00 0.00 0.00 0.00 0.00 4.30 0.00 0.00 0.00 0.00 0.00 0.00 4.31 0.00CTGC 0.00 0.00 0.00 0.00 0.00 5.37 0.00 0.00 0.00 0.00 0.00 0.00 5.41 0.00CTTC 0.00 0.00 0.00 0.00 0.00 7.14 0.00 0.00 0.00 0.00 0.00 0.00 6.92 0.00GTAC 0.00 0.00 0.00 0.00 0.00 4.84 0.00 0.00 0.00 0.00 0.00 0.00 4.96 0.00GTCC 0.00 0.00 11.51 0.00 0.00 0.00 0.00 0.00 0.00 11.78 0.00 0.00 0.00 0.00GTGC 0.00 0.00 0.00 0.00 0.00 4.32 0.00 0.00 0.00 0.00 0.00 0.00 4.43 0.00GTTC 0.00 0.00 0.00 0.00 0.00 5.05 0.00 0.00 0.00 0.00 0.00 0.00 5.23 0.00TTAC 0.00 0.00 0.00 0.00 0.00 4.97 0.00 0.00 0.00 0.00 0.00 0.00 5.10 0.00TTCC 0.00 0.00 0.00 0.00 0.00 4.69 0.00 0.00 0.00 0.00 0.00 0.00 4.79 0.00TTGC 0.00 0.00 11.62 0.00 0.00 0.00 0.00 0.00 0.00 11.82 0.00 0.00 0.00 0.00TTTC 0.00 0.00 0.00 0.00 0.00 7.29 0.00 0.00 0.00 0.00 0.00 0.00 7.28 0.00ATAG 0.00 0.00 0.00 0.00 0.00 0.00 3.98 0.00 0.00 0.00 0.00 0.00 0.00 4.09ATCG 0.00 0.00 0.00 0.00 0.00 0.00 3.81 0.00 0.00 0.00 0.00 0.00 0.00 3.70ATGG 0.00 0.00 0.00 0.00 0.00 0.00 3.97 0.00 0.00 0.00 0.00 0.00 0.00 3.99ATTG 0.00 0.00 0.00 0.00 0.00 0.00 7.13 0.00 0.00 0.00 0.00 0.00 0.00 7.08CTAG 0.00 0.00 0.00 0.00 0.00 0.00 3.55 0.00 0.00 0.00 0.00 0.00 0.00 3.56CTCG 0.00 0.00 0.00 0.00 0.00 0.00 6.52 0.00 0.00 0.00 0.00 0.00 0.00 6.31CTGG 0.00 0.00 0.00 0.00 0.00 0.00 3.67 0.00 0.00 0.00 0.00 0.00 0.00 3.83CTTG 0.00 0.00 0.00 0.00 0.00 9.67 0.00 0.00 0.00 0.00 0.00 0.00 8.89 0.00GTAG 0.00 0.00 0.00 0.00 0.00 0.00 3.58 0.00 0.00 0.00 0.00 0.00 0.00 3.49GTCG 0.00 7.80 0.00 0.00 0.00 0.00 0.00 0.00 8.11 0.00 0.00 0.00 0.00 0.00GTGG 0.00 0.00 0.00 0.00 0.00 0.00 3.82 0.00 0.00 0.00 0.00 0.00 0.00 3.98GTTG 0.00 0.00 0.00 0.00 0.00 0.00 7.02 0.00 0.00 0.00 0.00 0.00 0.00 6.97TTAG 0.00 0.00 0.00 0.00 0.00 0.00 4.24 0.00 0.00 0.00 0.00 0.00 0.00 4.43TTCG 0.00 0.00 0.00 0.00 0.00 0.00 3.73 0.00 0.00 0.00 0.00 0.00 0.00 3.75TTGG 0.00 0.00 0.00 0.00 0.00 0.00 6.10 0.00 0.00 0.00 0.00 0.00 0.00 6.06TTTG 0.00 0.00 0.00 0.00 0.00 8.31 0.00 0.00 0.00 0.00 0.00 0.00 8.05 0.00

Mutation Cl-1 Cl-2 Cl-3 Cl-4 Cl-5 Cl-6 Cl-7 Cl-1 Cl-2 Cl-3 Cl-4 Cl-5 Cl-6 Cl-7ACAA 0.00 0.00 0.00 0.00 6.55 0.00 0.00 0.00 0.00 0.00 0.00 6.55 0.00 0.00ACCA 0.00 0.00 0.00 6.66 0.00 0.00 0.00 0.00 0.00 0.00 7.03 0.00 0.00 0.00ACGA 0.00 0.00 0.00 0.00 0.00 0.00 3.07 0.00 0.00 0.00 0.00 0.00 0.00 2.91ACTA 0.00 0.00 0.00 6.99 0.00 0.00 0.00 0.00 0.00 0.00 7.33 0.00 0.00 0.00CCAA 0.00 0.00 0.00 9.30 0.00 0.00 0.00 0.00 0.00 0.00 9.73 0.00 0.00 0.00CCCA 0.00 0.00 0.00 7.52 0.00 0.00 0.00 0.00 0.00 0.00 7.93 0.00 0.00 0.00CCGA 0.00 0.00 0.00 0.00 0.00 0.00 2.41 0.00 0.00 0.00 0.00 0.00 0.00 2.51CCTA 0.00 0.00 0.00 8.69 0.00 0.00 0.00 0.00 0.00 0.00 9.10 0.00 0.00 0.00GCAA 0.00 4.05 0.00 0.00 0.00 0.00 0.00 0.00 4.65 0.00 0.00 0.00 0.00 0.00GCCA 0.00 0.00 0.00 5.20 0.00 0.00 0.00 0.00 0.00 0.00 5.46 0.00 0.00 0.00GCGA 0.00 0.00 13.81 0.00 0.00 0.00 0.00 0.00 0.00 13.89 0.00 0.00 0.00 0.00GCTA 0.00 0.00 0.00 5.70 0.00 0.00 0.00 0.00 0.00 0.00 5.97 0.00 0.00 0.00TCAA 0.00 0.00 0.00 0.00 6.26 0.00 0.00 0.00 0.00 0.00 0.00 6.21 0.00 0.00TCCA 0.00 0.00 0.00 9.80 0.00 0.00 0.00 0.00 0.00 0.00 10.21 0.00 0.00 0.00TCGA 0.00 0.00 11.87 0.00 0.00 0.00 0.00 0.00 0.00 12.24 0.00 0.00 0.00 0.00TCTA 0.00 0.00 0.00 0.00 8.05 0.00 0.00 0.00 0.00 0.00 0.00 8.00 0.00 0.00ACAG 13.14 0.00 0.00 0.00 0.00 0.00 0.00 13.38 0.00 0.00 0.00 0.00 0.00 0.00ACCG 0.00 0.00 0.00 0.00 0.00 0.00 2.76 0.00 0.00 0.00 0.00 0.00 0.00 2.91ACGG 0.00 0.00 12.62 0.00 0.00 0.00 0.00 0.00 0.00 12.22 0.00 0.00 0.00 0.00ACTG 17.06 0.00 0.00 0.00 0.00 0.00 0.00 17.11 0.00 0.00 0.00 0.00 0.00 0.00CCAG 0.00 0.00 0.00 0.00 0.00 0.00 3.16 0.00 0.00 0.00 0.00 0.00 0.00 3.32CCCG 0.00 0.00 0.00 0.00 0.00 0.00 3.22 0.00 0.00 0.00 0.00 0.00 0.00 3.24CCGG 0.00 0.00 0.00 0.00 0.00 0.00 3.64 0.00 0.00 0.00 0.00 0.00 0.00 3.27CCTG 0.00 0.00 0.00 0.00 0.00 0.00 4.32 0.00 0.00 0.00 0.00 0.00 0.00 4.51GCAG 0.00 0.00 0.00 0.00 0.00 0.00 3.48 0.00 0.00 0.00 0.00 0.00 0.00 3.34GCCG 0.00 0.00 14.79 0.00 0.00 0.00 0.00 0.00 0.00 15.62 0.00 0.00 0.00 0.00GCGG 0.00 0.00 15.50 0.00 0.00 0.00 0.00 0.00 0.00 13.92 0.00 0.00 0.00 0.00GCTG 0.00 0.00 0.00 0.00 0.00 0.00 3.82 0.00 0.00 0.00 0.00 0.00 0.00 3.78TCAG 0.00 0.00 0.00 12.20 0.00 0.00 0.00 0.00 0.00 0.00 10.90 0.00 0.00 0.00TCCG 0.00 0.00 0.00 5.76 0.00 0.00 0.00 0.00 0.00 0.00 5.60 0.00 0.00 0.00TCGG 0.00 0.00 8.40 0.00 0.00 0.00 0.00 0.00 0.00 8.65 0.00 0.00 0.00 0.00TCTG 0.00 0.00 0.00 16.63 0.00 0.00 0.00 0.00 0.00 0.00 15.02 0.00 0.00 0.00ACAT 0.00 0.00 0.00 0.00 7.67 0.00 0.00 0.00 0.00 0.00 0.00 7.71 0.00 0.00ACCT 0.00 4.78 0.00 0.00 0.00 0.00 0.00 0.00 5.02 0.00 0.00 0.00 0.00 0.00ACGT 0.00 23.47 0.00 0.00 0.00 0.00 0.00 0.00 23.18 0.00 0.00 0.00 0.00 0.00ACTT 0.00 0.00 0.00 0.00 5.43 0.00 0.00 0.00 0.00 0.00 0.00 5.47 0.00 0.00CCAT 0.00 0.00 0.00 0.00 6.02 0.00 0.00 0.00 0.00 0.00 0.00 6.02 0.00 0.00CCCT 0.00 0.00 0.00 0.00 5.59 0.00 0.00 0.00 0.00 0.00 0.00 5.63 0.00 0.00CCGT 0.00 17.66 0.00 0.00 0.00 0.00 0.00 0.00 17.12 0.00 0.00 0.00 0.00 0.00CCTT 0.00 0.00 0.00 0.00 7.01 0.00 0.00 0.00 0.00 0.00 0.00 7.04 0.00 0.00GCAT 0.00 0.00 0.00 0.00 5.98 0.00 0.00 0.00 0.00 0.00 0.00 6.01 0.00 0.00GCCT 0.00 5.74 0.00 0.00 0.00 0.00 0.00 0.00 5.93 0.00 0.00 0.00 0.00 0.00GCGT 0.00 20.46 0.00 0.00 0.00 0.00 0.00 0.00 19.80 0.00 0.00 0.00 0.00 0.00GCTT 0.00 0.00 0.00 0.00 5.88 0.00 0.00 0.00 0.00 0.00 0.00 5.93 0.00 0.00TCAT 0.00 11.42 0.00 0.00 0.00 0.00 0.00 0.00 12.00 0.00 0.00 0.00 0.00 0.00TCCT 0.00 0.00 0.00 0.00 7.81 0.00 0.00 0.00 0.00 0.00 0.00 7.76 0.00 0.00TCGT 0.00 12.42 0.00 0.00 0.00 0.00 0.00 0.00 12.30 0.00 0.00 0.00 0.00 0.00TCTT 0.00 0.00 0.00 0.00 9.47 0.00 0.00 0.00 0.00 0.00 0.00 9.29 0.00 0.00

Mutation Cl-1 Cl-2 Cl-3 Cl-4 Cl-5 Cl-6 Cl-7 Cl-1 Cl-2 Cl-3 Cl-4 Cl-5 Cl-6 Cl-7ATAA 14.61 0.00 0.00 0.00 0.00 0.00 0.00 14.91 0.00 0.00 0.00 0.00 0.00 0.00ATCA 0.00 0.00 0.00 0.00 0.00 0.00 3.46 0.00 0.00 0.00 0.00 0.00 0.00 3.65ATGA 13.85 0.00 0.00 0.00 0.00 0.00 0.00 14.13 0.00 0.00 0.00 0.00 0.00 0.00ATTA 0.00 0.00 0.00 0.00 0.00 5.95 0.00 0.00 0.00 0.00 0.00 0.00 6.08 0.00CTAA 0.00 0.00 0.00 0.00 0.00 0.00 3.93 0.00 0.00 0.00 0.00 0.00 0.00 4.01CTCA 12.75 0.00 0.00 0.00 0.00 0.00 0.00 12.93 0.00 0.00 0.00 0.00 0.00 0.00CTGA 0.00 0.00 0.00 5.54 0.00 0.00 0.00 0.00 0.00 0.00 5.73 0.00 0.00 0.00CTTA 0.00 0.00 0.00 0.00 0.00 4.55 0.00 0.00 0.00 0.00 0.00 0.00 4.62 0.00GTAA 0.00 0.00 0.00 0.00 0.00 0.00 3.44 0.00 0.00 0.00 0.00 0.00 0.00 3.36GTCA 0.00 0.00 15.20 0.00 0.00 0.00 0.00 0.00 0.00 15.36 0.00 0.00 0.00 0.00GTGA 0.00 0.00 0.00 0.00 0.00 0.00 3.13 0.00 0.00 0.00 0.00 0.00 0.00 3.26GTTA 0.00 0.00 0.00 0.00 0.00 0.00 4.04 0.00 0.00 0.00 0.00 0.00 0.00 4.00TTAA 0.00 0.00 0.00 0.00 0.00 5.49 0.00 0.00 0.00 0.00 0.00 0.00 5.64 0.00TTCA 0.00 0.00 0.00 0.00 0.00 0.00 4.99 0.00 0.00 0.00 0.00 0.00 0.00 4.77TTGA 0.00 0.00 0.00 0.00 0.00 0.00 2.94 0.00 0.00 0.00 0.00 0.00 0.00 3.05TTTA 0.00 0.00 0.00 0.00 0.00 5.65 0.00 0.00 0.00 0.00 0.00 0.00 5.77 0.00ATAC 0.00 0.00 0.00 0.00 7.03 0.00 0.00 0.00 0.00 0.00 0.00 7.06 0.00 0.00ATCC 0.00 0.00 0.00 0.00 0.00 3.53 0.00 0.00 0.00 0.00 0.00 0.00 3.64 0.00ATGC 0.00 0.00 0.00 0.00 4.97 0.00 0.00 0.00 0.00 0.00 0.00 4.98 0.00 0.00ATTC 0.00 0.00 0.00 0.00 6.30 0.00 0.00 0.00 0.00 0.00 0.00 6.34 0.00 0.00CTAC 0.00 0.00 0.00 0.00 0.00 4.03 0.00 0.00 0.00 0.00 0.00 0.00 4.08 0.00CTCC 0.00 0.00 0.00 0.00 0.00 4.59 0.00 0.00 0.00 0.00 0.00 0.00 4.61 0.00CTGC 0.00 0.00 0.00 0.00 0.00 5.75 0.00 0.00 0.00 0.00 0.00 0.00 5.79 0.00CTTC 0.00 0.00 0.00 0.00 0.00 7.65 0.00 0.00 0.00 0.00 0.00 0.00 7.41 0.00GTAC 0.00 0.00 0.00 0.00 0.00 5.18 0.00 0.00 0.00 0.00 0.00 0.00 5.32 0.00GTCC 0.00 0.00 0.00 0.00 0.00 0.00 4.18 0.00 0.00 0.00 0.00 0.00 0.00 4.36GTGC 0.00 0.00 0.00 0.00 0.00 4.63 0.00 0.00 0.00 0.00 0.00 0.00 4.76 0.00GTTC 0.00 0.00 0.00 0.00 0.00 5.42 0.00 0.00 0.00 0.00 0.00 0.00 5.62 0.00TTAC 0.00 0.00 0.00 0.00 0.00 5.32 0.00 0.00 0.00 0.00 0.00 0.00 5.47 0.00TTCC 0.00 0.00 0.00 0.00 0.00 5.05 0.00 0.00 0.00 0.00 0.00 0.00 5.16 0.00TTGC 0.00 0.00 0.00 0.00 0.00 0.00 3.98 0.00 0.00 0.00 0.00 0.00 0.00 4.20TTTC 0.00 0.00 0.00 0.00 0.00 7.82 0.00 0.00 0.00 0.00 0.00 0.00 7.82 0.00ATAG 0.00 0.00 0.00 0.00 0.00 0.00 3.18 0.00 0.00 0.00 0.00 0.00 0.00 3.24ATCG 0.00 0.00 0.00 0.00 0.00 0.00 2.70 0.00 0.00 0.00 0.00 0.00 0.00 2.52ATGG 0.00 0.00 0.00 0.00 0.00 0.00 3.09 0.00 0.00 0.00 0.00 0.00 0.00 3.04ATTG 14.44 0.00 0.00 0.00 0.00 0.00 0.00 14.09 0.00 0.00 0.00 0.00 0.00 0.00CTAG 0.00 0.00 0.00 0.00 0.00 0.00 2.68 0.00 0.00 0.00 0.00 0.00 0.00 2.66CTCG 0.00 0.00 0.00 0.00 0.00 0.00 4.59 0.00 0.00 0.00 0.00 0.00 0.00 4.30CTGG 0.00 0.00 0.00 0.00 0.00 0.00 3.07 0.00 0.00 0.00 0.00 0.00 0.00 3.19CTTG 0.00 0.00 0.00 0.00 0.00 10.44 0.00 0.00 0.00 0.00 0.00 0.00 9.55 0.00GTAG 0.00 0.00 0.00 0.00 0.00 0.00 2.52 0.00 0.00 0.00 0.00 0.00 0.00 2.36GTCG 0.00 0.00 7.80 0.00 0.00 0.00 0.00 0.00 0.00 8.11 0.00 0.00 0.00 0.00GTGG 0.00 0.00 0.00 0.00 0.00 0.00 3.18 0.00 0.00 0.00 0.00 0.00 0.00 3.25GTTG 14.14 0.00 0.00 0.00 0.00 0.00 0.00 13.44 0.00 0.00 0.00 0.00 0.00 0.00TTAG 0.00 0.00 0.00 0.00 0.00 0.00 3.48 0.00 0.00 0.00 0.00 0.00 0.00 3.62TTCG 0.00 0.00 0.00 0.00 0.00 0.00 2.91 0.00 0.00 0.00 0.00 0.00 0.00 2.87TTGG 0.00 0.00 0.00 0.00 0.00 0.00 4.62 0.00 0.00 0.00 0.00 0.00 0.00 4.49TTTG 0.00 0.00 0.00 0.00 0.00 8.95 0.00 0.00 0.00 0.00 0.00 0.00 8.65 0.00

Mutation Cl-1 Cl-2 Cl-3 Cl-4 Cl-5 Cl-6 Cl-7 Cl-1 Cl-2 Cl-3 Cl-4 Cl-5 Cl-6 Cl-7ACAA 0.00 0.00 0.00 6.55 0.00 0.00 0.00 0.00 0.00 0.00 6.55 0.00 0.00 0.00ACCA 0.00 0.00 0.00 0.00 5.83 0.00 0.00 0.00 0.00 0.00 0.00 6.08 0.00 0.00ACGA 0.00 0.00 0.00 0.00 0.00 0.00 4.06 0.00 0.00 0.00 0.00 0.00 0.00 4.00ACTA 0.00 0.00 0.00 0.00 6.16 0.00 0.00 0.00 0.00 0.00 0.00 6.38 0.00 0.00CCAA 0.00 0.00 0.00 0.00 7.91 0.00 0.00 0.00 0.00 0.00 0.00 8.10 0.00 0.00CCCA 0.00 0.00 0.00 0.00 6.46 0.00 0.00 0.00 0.00 0.00 0.00 6.68 0.00 0.00CCGA 0.00 0.00 6.40 0.00 0.00 0.00 0.00 0.00 0.00 6.40 0.00 0.00 0.00 0.00CCTA 0.00 0.00 0.00 0.00 0.00 6.97 0.00 0.00 0.00 0.00 0.00 0.00 7.02 0.00GCAA 4.05 0.00 0.00 0.00 0.00 0.00 0.00 4.65 0.00 0.00 0.00 0.00 0.00 0.00GCCA 0.00 0.00 0.00 0.00 4.56 0.00 0.00 0.00 0.00 0.00 0.00 4.73 0.00 0.00GCGA 0.00 13.81 0.00 0.00 0.00 0.00 0.00 0.00 13.89 0.00 0.00 0.00 0.00 0.00GCTA 0.00 0.00 0.00 0.00 5.02 0.00 0.00 0.00 0.00 0.00 0.00 5.20 0.00 0.00TCAA 0.00 0.00 0.00 6.26 0.00 0.00 0.00 0.00 0.00 0.00 6.21 0.00 0.00 0.00TCCA 0.00 0.00 0.00 0.00 8.94 0.00 0.00 0.00 0.00 0.00 0.00 9.29 0.00 0.00TCGA 0.00 11.87 0.00 0.00 0.00 0.00 0.00 0.00 12.24 0.00 0.00 0.00 0.00 0.00TCTA 0.00 0.00 0.00 8.05 0.00 0.00 0.00 0.00 0.00 0.00 8.00 0.00 0.00 0.00ACAG 0.00 0.00 0.00 0.00 3.96 0.00 0.00 0.00 0.00 0.00 0.00 4.18 0.00 0.00ACCG 0.00 0.00 7.23 0.00 0.00 0.00 0.00 0.00 0.00 7.29 0.00 0.00 0.00 0.00ACGG 0.00 12.62 0.00 0.00 0.00 0.00 0.00 0.00 12.22 0.00 0.00 0.00 0.00 0.00ACTG 0.00 0.00 0.00 0.00 4.77 0.00 0.00 0.00 0.00 0.00 0.00 5.03 0.00 0.00CCAG 0.00 0.00 8.26 0.00 0.00 0.00 0.00 0.00 0.00 8.31 0.00 0.00 0.00 0.00CCCG 0.00 0.00 0.00 0.00 0.00 0.00 3.91 0.00 0.00 0.00 0.00 0.00 0.00 4.02CCGG 0.00 0.00 0.00 0.00 0.00 0.00 5.37 0.00 0.00 0.00 0.00 0.00 0.00 5.12CCTG 0.00 0.00 11.12 0.00 0.00 0.00 0.00 0.00 0.00 11.19 0.00 0.00 0.00 0.00GCAG 0.00 0.00 0.00 0.00 0.00 0.00 4.61 0.00 0.00 0.00 0.00 0.00 0.00 4.57GCCG 0.00 14.79 0.00 0.00 0.00 0.00 0.00 0.00 15.62 0.00 0.00 0.00 0.00 0.00GCGG 0.00 15.50 0.00 0.00 0.00 0.00 0.00 0.00 13.92 0.00 0.00 0.00 0.00 0.00GCTG 0.00 0.00 0.00 0.00 0.00 0.00 4.86 0.00 0.00 0.00 0.00 0.00 0.00 4.92TCAG 0.00 0.00 0.00 0.00 10.31 0.00 0.00 0.00 0.00 0.00 0.00 9.03 0.00 0.00TCCG 0.00 0.00 0.00 0.00 5.10 0.00 0.00 0.00 0.00 0.00 0.00 4.95 0.00 0.00TCGG 0.00 8.40 0.00 0.00 0.00 0.00 0.00 0.00 8.65 0.00 0.00 0.00 0.00 0.00TCTG 0.00 0.00 0.00 0.00 14.10 0.00 0.00 0.00 0.00 0.00 0.00 12.53 0.00 0.00ACAT 0.00 0.00 0.00 7.67 0.00 0.00 0.00 0.00 0.00 0.00 7.71 0.00 0.00 0.00ACCT 4.78 0.00 0.00 0.00 0.00 0.00 0.00 5.02 0.00 0.00 0.00 0.00 0.00 0.00ACGT 23.47 0.00 0.00 0.00 0.00 0.00 0.00 23.18 0.00 0.00 0.00 0.00 0.00 0.00ACTT 0.00 0.00 0.00 5.43 0.00 0.00 0.00 0.00 0.00 0.00 5.47 0.00 0.00 0.00CCAT 0.00 0.00 0.00 6.02 0.00 0.00 0.00 0.00 0.00 0.00 6.02 0.00 0.00 0.00CCCT 0.00 0.00 0.00 5.59 0.00 0.00 0.00 0.00 0.00 0.00 5.63 0.00 0.00 0.00CCGT 17.66 0.00 0.00 0.00 0.00 0.00 0.00 17.12 0.00 0.00 0.00 0.00 0.00 0.00CCTT 0.00 0.00 0.00 7.01 0.00 0.00 0.00 0.00 0.00 0.00 7.04 0.00 0.00 0.00GCAT 0.00 0.00 0.00 5.98 0.00 0.00 0.00 0.00 0.00 0.00 6.01 0.00 0.00 0.00GCCT 5.74 0.00 0.00 0.00 0.00 0.00 0.00 5.93 0.00 0.00 0.00 0.00 0.00 0.00GCGT 20.46 0.00 0.00 0.00 0.00 0.00 0.00 19.80 0.00 0.00 0.00 0.00 0.00 0.00GCTT 0.00 0.00 0.00 5.88 0.00 0.00 0.00 0.00 0.00 0.00 5.93 0.00 0.00 0.00TCAT 11.42 0.00 0.00 0.00 0.00 0.00 0.00 12.00 0.00 0.00 0.00 0.00 0.00 0.00TCCT 0.00 0.00 0.00 7.81 0.00 0.00 0.00 0.00 0.00 0.00 7.76 0.00 0.00 0.00TCGT 12.42 0.00 0.00 0.00 0.00 0.00 0.00 12.30 0.00 0.00 0.00 0.00 0.00 0.00TCTT 0.00 0.00 0.00 9.47 0.00 0.00 0.00 0.00 0.00 0.00 9.29 0.00 0.00 0.00

Mutation Cl-1 Cl-2 Cl-3 Cl-4 Cl-5 Cl-6 Cl-7 Cl-1 Cl-2 Cl-3 Cl-4 Cl-5 Cl-6 Cl-7ATAA 0.00 0.00 0.00 0.00 4.18 0.00 0.00 0.00 0.00 0.00 0.00 4.52 0.00 0.00ATCA 0.00 0.00 8.99 0.00 0.00 0.00 0.00 0.00 0.00 9.10 0.00 0.00 0.00 0.00ATGA 0.00 0.00 0.00 0.00 4.02 0.00 0.00 0.00 0.00 0.00 0.00 4.30 0.00 0.00ATTA 0.00 0.00 0.00 0.00 0.00 5.72 0.00 0.00 0.00 0.00 0.00 0.00 5.85 0.00CTAA 0.00 0.00 10.32 0.00 0.00 0.00 0.00 0.00 0.00 9.83 0.00 0.00 0.00 0.00CTCA 0.00 0.00 0.00 0.00 3.79 0.00 0.00 0.00 0.00 0.00 0.00 3.98 0.00 0.00CTGA 0.00 0.00 0.00 0.00 4.88 0.00 0.00 0.00 0.00 0.00 0.00 5.02 0.00 0.00CTTA 0.00 0.00 0.00 0.00 0.00 4.42 0.00 0.00 0.00 0.00 0.00 0.00 4.47 0.00GTAA 0.00 0.00 0.00 0.00 0.00 0.00 4.30 0.00 0.00 0.00 0.00 0.00 0.00 4.35GTCA 0.00 15.20 0.00 0.00 0.00 0.00 0.00 0.00 15.36 0.00 0.00 0.00 0.00 0.00GTGA 0.00 0.00 8.23 0.00 0.00 0.00 0.00 0.00 0.00 8.16 0.00 0.00 0.00 0.00GTTA 0.00 0.00 0.00 0.00 0.00 0.00 5.13 0.00 0.00 0.00 0.00 0.00 0.00 5.19TTAA 0.00 0.00 0.00 0.00 0.00 5.30 0.00 0.00 0.00 0.00 0.00 0.00 5.43 0.00TTCA 0.00 0.00 0.00 0.00 0.00 0.00 6.64 0.00 0.00 0.00 0.00 0.00 0.00 6.58TTGA 0.00 0.00 7.82 0.00 0.00 0.00 0.00 0.00 0.00 7.57 0.00 0.00 0.00 0.00TTTA 0.00 0.00 0.00 0.00 0.00 5.44 0.00 0.00 0.00 0.00 0.00 0.00 5.55 0.00ATAC 0.00 0.00 0.00 7.03 0.00 0.00 0.00 0.00 0.00 0.00 7.06 0.00 0.00 0.00ATCC 0.00 0.00 10.75 0.00 0.00 0.00 0.00 0.00 0.00 10.92 0.00 0.00 0.00 0.00ATGC 0.00 0.00 0.00 4.97 0.00 0.00 0.00 0.00 0.00 0.00 4.98 0.00 0.00 0.00ATTC 0.00 0.00 0.00 6.30 0.00 0.00 0.00 0.00 0.00 0.00 6.34 0.00 0.00 0.00CTAC 0.00 0.00 0.00 0.00 0.00 3.91 0.00 0.00 0.00 0.00 0.00 0.00 3.94 0.00CTCC 0.00 0.00 0.00 0.00 0.00 4.44 0.00 0.00 0.00 0.00 0.00 0.00 4.45 0.00CTGC 0.00 0.00 0.00 0.00 0.00 5.56 0.00 0.00 0.00 0.00 0.00 0.00 5.61 0.00CTTC 0.00 0.00 0.00 0.00 0.00 7.39 0.00 0.00 0.00 0.00 0.00 0.00 7.16 0.00GTAC 0.00 0.00 0.00 0.00 0.00 5.00 0.00 0.00 0.00 0.00 0.00 0.00 5.14 0.00GTCC 0.00 0.00 10.41 0.00 0.00 0.00 0.00 0.00 0.00 10.61 0.00 0.00 0.00 0.00GTGC 0.00 0.00 0.00 0.00 0.00 4.47 0.00 0.00 0.00 0.00 0.00 0.00 4.59 0.00GTTC 0.00 0.00 0.00 0.00 0.00 5.21 0.00 0.00 0.00 0.00 0.00 0.00 5.40 0.00TTAC 0.00 0.00 0.00 0.00 0.00 5.13 0.00 0.00 0.00 0.00 0.00 0.00 5.26 0.00TTCC 0.00 0.00 0.00 0.00 0.00 4.84 0.00 0.00 0.00 0.00 0.00 0.00 4.96 0.00TTGC 0.00 0.00 10.48 0.00 0.00 0.00 0.00 0.00 0.00 10.62 0.00 0.00 0.00 0.00TTTC 0.00 0.00 0.00 0.00 0.00 7.53 0.00 0.00 0.00 0.00 0.00 0.00 7.52 0.00ATAG 0.00 0.00 0.00 0.00 0.00 0.00 3.98 0.00 0.00 0.00 0.00 0.00 0.00 4.09ATCG 0.00 0.00 0.00 0.00 0.00 0.00 3.81 0.00 0.00 0.00 0.00 0.00 0.00 3.70ATGG 0.00 0.00 0.00 0.00 0.00 0.00 3.97 0.00 0.00 0.00 0.00 0.00 0.00 3.99ATTG 0.00 0.00 0.00 0.00 0.00 0.00 7.13 0.00 0.00 0.00 0.00 0.00 0.00 7.08CTAG 0.00 0.00 0.00 0.00 0.00 0.00 3.55 0.00 0.00 0.00 0.00 0.00 0.00 3.56CTCG 0.00 0.00 0.00 0.00 0.00 0.00 6.52 0.00 0.00 0.00 0.00 0.00 0.00 6.31CTGG 0.00 0.00 0.00 0.00 0.00 0.00 3.67 0.00 0.00 0.00 0.00 0.00 0.00 3.83CTTG 0.00 0.00 0.00 0.00 0.00 10.06 0.00 0.00 0.00 0.00 0.00 0.00 9.27 0.00GTAG 0.00 0.00 0.00 0.00 0.00 0.00 3.58 0.00 0.00 0.00 0.00 0.00 0.00 3.49GTCG 0.00 7.80 0.00 0.00 0.00 0.00 0.00 0.00 8.11 0.00 0.00 0.00 0.00 0.00GTGG 0.00 0.00 0.00 0.00 0.00 0.00 3.82 0.00 0.00 0.00 0.00 0.00 0.00 3.98GTTG 0.00 0.00 0.00 0.00 0.00 0.00 7.02 0.00 0.00 0.00 0.00 0.00 0.00 6.97TTAG 0.00 0.00 0.00 0.00 0.00 0.00 4.24 0.00 0.00 0.00 0.00 0.00 0.00 4.43TTCG 0.00 0.00 0.00 0.00 0.00 0.00 3.73 0.00 0.00 0.00 0.00 0.00 0.00 3.75TTGG 0.00 0.00 0.00 0.00 0.00 0.00 6.10 0.00 0.00 0.00 0.00 0.00 0.00 6.06TTTG 0.00 0.00 0.00 0.00 0.00 8.61 0.00 0.00 0.00 0.00 0.00 0.00 8.36 0.00

Mutation Cl-1 Cl-2 Cl-3 Cl-4 Cl-5 Cl-6 Cl-1 Cl-2 Cl-3 Cl-4 Cl-5 Cl-6ACAA 0.00 0.00 6.55 0.00 0.00 0.00 0.00 0.00 6.55 0.00 0.00 0.00ACCA 0.00 0.00 0.00 5.83 0.00 0.00 0.00 0.00 0.00 6.08 0.00 0.00ACGA 0.00 0.00 0.00 0.00 0.00 2.75 0.00 0.00 0.00 0.00 0.00 2.63ACTA 0.00 0.00 0.00 6.16 0.00 0.00 0.00 0.00 0.00 6.38 0.00 0.00CCAA 0.00 0.00 0.00 7.91 0.00 0.00 0.00 0.00 0.00 8.10 0.00 0.00CCCA 0.00 0.00 0.00 6.46 0.00 0.00 0.00 0.00 0.00 6.68 0.00 0.00CCGA 0.00 0.00 0.00 0.00 0.00 2.13 0.00 0.00 0.00 0.00 0.00 2.25CCTA 0.00 0.00 0.00 0.00 6.75 0.00 0.00 0.00 0.00 0.00 6.79 0.00GCAA 4.05 0.00 0.00 0.00 0.00 0.00 4.65 0.00 0.00 0.00 0.00 0.00GCCA 0.00 0.00 0.00 4.56 0.00 0.00 0.00 0.00 0.00 4.73 0.00 0.00GCGA 0.00 13.81 0.00 0.00 0.00 0.00 0.00 13.89 0.00 0.00 0.00 0.00GCTA 0.00 0.00 0.00 5.02 0.00 0.00 0.00 0.00 0.00 5.20 0.00 0.00TCAA 0.00 0.00 6.26 0.00 0.00 0.00 0.00 0.00 6.21 0.00 0.00 0.00TCCA 0.00 0.00 0.00 8.94 0.00 0.00 0.00 0.00 0.00 9.29 0.00 0.00TCGA 0.00 11.87 0.00 0.00 0.00 0.00 0.00 12.24 0.00 0.00 0.00 0.00TCTA 0.00 0.00 8.05 0.00 0.00 0.00 0.00 0.00 8.00 0.00 0.00 0.00ACAG 0.00 0.00 0.00 3.96 0.00 0.00 0.00 0.00 0.00 4.18 0.00 0.00ACCG 0.00 0.00 0.00 0.00 0.00 2.46 0.00 0.00 0.00 0.00 0.00 2.61ACGG 0.00 12.62 0.00 0.00 0.00 0.00 0.00 12.22 0.00 0.00 0.00 0.00ACTG 0.00 0.00 0.00 4.77 0.00 0.00 0.00 0.00 0.00 5.03 0.00 0.00CCAG 0.00 0.00 0.00 0.00 0.00 2.81 0.00 0.00 0.00 0.00 0.00 2.98CCCG 0.00 0.00 0.00 0.00 0.00 2.88 0.00 0.00 0.00 0.00 0.00 2.92CCGG 0.00 0.00 0.00 0.00 0.00 3.29 0.00 0.00 0.00 0.00 0.00 2.97CCTG 0.00 0.00 0.00 0.00 0.00 3.84 0.00 0.00 0.00 0.00 0.00 4.05GCAG 0.00 0.00 0.00 0.00 0.00 3.13 0.00 0.00 0.00 0.00 0.00 3.02GCCG 0.00 14.79 0.00 0.00 0.00 0.00 0.00 15.62 0.00 0.00 0.00 0.00GCGG 0.00 15.50 0.00 0.00 0.00 0.00 0.00 13.92 0.00 0.00 0.00 0.00GCTG 0.00 0.00 0.00 0.00 0.00 3.44 0.00 0.00 0.00 0.00 0.00 3.42TCAG 0.00 0.00 0.00 10.31 0.00 0.00 0.00 0.00 0.00 9.03 0.00 0.00TCCG 0.00 0.00 0.00 5.10 0.00 0.00 0.00 0.00 0.00 4.95 0.00 0.00TCGG 0.00 8.40 0.00 0.00 0.00 0.00 0.00 8.65 0.00 0.00 0.00 0.00TCTG 0.00 0.00 0.00 14.10 0.00 0.00 0.00 0.00 0.00 12.53 0.00 0.00ACAT 0.00 0.00 7.67 0.00 0.00 0.00 0.00 0.00 7.71 0.00 0.00 0.00ACCT 4.78 0.00 0.00 0.00 0.00 0.00 5.02 0.00 0.00 0.00 0.00 0.00ACGT 23.47 0.00 0.00 0.00 0.00 0.00 23.18 0.00 0.00 0.00 0.00 0.00ACTT 0.00 0.00 5.43 0.00 0.00 0.00 0.00 0.00 5.47 0.00 0.00 0.00CCAT 0.00 0.00 6.02 0.00 0.00 0.00 0.00 0.00 6.02 0.00 0.00 0.00CCCT 0.00 0.00 5.59 0.00 0.00 0.00 0.00 0.00 5.63 0.00 0.00 0.00CCGT 17.66 0.00 0.00 0.00 0.00 0.00 17.12 0.00 0.00 0.00 0.00 0.00CCTT 0.00 0.00 7.01 0.00 0.00 0.00 0.00 0.00 7.04 0.00 0.00 0.00GCAT 0.00 0.00 5.98 0.00 0.00 0.00 0.00 0.00 6.01 0.00 0.00 0.00GCCT 5.74 0.00 0.00 0.00 0.00 0.00 5.93 0.00 0.00 0.00 0.00 0.00GCGT 20.46 0.00 0.00 0.00 0.00 0.00 19.80 0.00 0.00 0.00 0.00 0.00GCTT 0.00 0.00 5.88 0.00 0.00 0.00 0.00 0.00 5.93 0.00 0.00 0.00TCAT 11.42 0.00 0.00 0.00 0.00 0.00 12.00 0.00 0.00 0.00 0.00 0.00TCCT 0.00 0.00 7.81 0.00 0.00 0.00 0.00 0.00 7.76 0.00 0.00 0.00TCGT 12.42 0.00 0.00 0.00 0.00 0.00 12.30 0.00 0.00 0.00 0.00 0.00TCTT 0.00 0.00 9.47 0.00 0.00 0.00 0.00 0.00 9.29 0.00 0.00 0.00

Mutation Cl-1 Cl-2 Cl-3 Cl-4 Cl-5 Cl-6 Cl-1 Cl-2 Cl-3 Cl-4 Cl-5 Cl-6ATAA 0.00 0.00 0.00 4.18 0.00 0.00 0.00 0.00 0.00 4.52 0.00 0.00ATCA 0.00 0.00 0.00 0.00 0.00 3.11 0.00 0.00 0.00 0.00 0.00 3.29ATGA 0.00 0.00 0.00 4.02 0.00 0.00 0.00 0.00 0.00 4.30 0.00 0.00ATTA 0.00 0.00 0.00 0.00 5.54 0.00 0.00 0.00 0.00 0.00 5.66 0.00CTAA 0.00 0.00 0.00 0.00 0.00 3.49 0.00 0.00 0.00 0.00 0.00 3.60CTCA 0.00 0.00 0.00 3.79 0.00 0.00 0.00 0.00 0.00 3.98 0.00 0.00CTGA 0.00 0.00 0.00 4.88 0.00 0.00 0.00 0.00 0.00 5.02 0.00 0.00CTTA 0.00 0.00 0.00 0.00 4.28 0.00 0.00 0.00 0.00 0.00 4.33 0.00GTAA 0.00 0.00 0.00 0.00 0.00 3.09 0.00 0.00 0.00 0.00 0.00 3.04GTCA 0.00 15.20 0.00 0.00 0.00 0.00 0.00 15.36 0.00 0.00 0.00 0.00GTGA 0.00 0.00 0.00 0.00 0.00 2.79 0.00 0.00 0.00 0.00 0.00 2.93GTTA 0.00 0.00 0.00 0.00 0.00 3.65 0.00 0.00 0.00 0.00 0.00 3.63TTAA 0.00 0.00 0.00 0.00 5.13 0.00 0.00 0.00 0.00 0.00 5.26 0.00TTCA 0.00 0.00 0.00 0.00 0.00 4.50 0.00 0.00 0.00 0.00 0.00 4.33TTGA 0.00 0.00 0.00 0.00 0.00 2.63 0.00 0.00 0.00 0.00 0.00 2.74TTTA 0.00 0.00 0.00 0.00 5.27 0.00 0.00 0.00 0.00 0.00 5.38 0.00ATAC 0.00 0.00 7.03 0.00 0.00 0.00 0.00 0.00 7.06 0.00 0.00 0.00ATCC 0.00 0.00 0.00 0.00 3.30 0.00 0.00 0.00 0.00 0.00 3.39 0.00ATGC 0.00 0.00 4.97 0.00 0.00 0.00 0.00 0.00 4.98 0.00 0.00 0.00ATTC 0.00 0.00 6.30 0.00 0.00 0.00 0.00 0.00 6.34 0.00 0.00 0.00CTAC 0.00 0.00 0.00 0.00 3.78 0.00 0.00 0.00 0.00 0.00 3.81 0.00CTCC 0.00 0.00 0.00 0.00 4.30 0.00 0.00 0.00 0.00 0.00 4.31 0.00CTGC 0.00 0.00 0.00 0.00 5.37 0.00 0.00 0.00 0.00 0.00 5.41 0.00CTTC 0.00 0.00 0.00 0.00 7.14 0.00 0.00 0.00 0.00 0.00 6.92 0.00GTAC 0.00 0.00 0.00 0.00 4.84 0.00 0.00 0.00 0.00 0.00 4.96 0.00GTCC 0.00 0.00 0.00 0.00 0.00 3.81 0.00 0.00 0.00 0.00 0.00 3.95GTGC 0.00 0.00 0.00 0.00 4.32 0.00 0.00 0.00 0.00 0.00 4.43 0.00GTTC 0.00 0.00 0.00 0.00 5.05 0.00 0.00 0.00 0.00 0.00 5.23 0.00TTAC 0.00 0.00 0.00 0.00 4.97 0.00 0.00 0.00 0.00 0.00 5.10 0.00TTCC 0.00 0.00 0.00 0.00 4.69 0.00 0.00 0.00 0.00 0.00 4.79 0.00TTGC 0.00 0.00 0.00 0.00 0.00 3.61 0.00 0.00 0.00 0.00 0.00 3.80TTTC 0.00 0.00 0.00 0.00 7.29 0.00 0.00 0.00 0.00 0.00 7.28 0.00ATAG 0.00 0.00 0.00 0.00 0.00 2.91 0.00 0.00 0.00 0.00 0.00 2.95ATCG 0.00 0.00 0.00 0.00 0.00 2.46 0.00 0.00 0.00 0.00 0.00 2.30ATGG 0.00 0.00 0.00 0.00 0.00 2.80 0.00 0.00 0.00 0.00 0.00 2.76ATTG 0.00 0.00 0.00 0.00 0.00 4.93 0.00 0.00 0.00 0.00 0.00 4.79CTAG 0.00 0.00 0.00 0.00 0.00 2.47 0.00 0.00 0.00 0.00 0.00 2.43CTCG 0.00 0.00 0.00 0.00 0.00 4.21 0.00 0.00 0.00 0.00 0.00 3.93CTGG 0.00 0.00 0.00 0.00 0.00 2.80 0.00 0.00 0.00 0.00 0.00 2.90CTTG 0.00 0.00 0.00 0.00 9.67 0.00 0.00 0.00 0.00 0.00 8.89 0.00GTAG 0.00 0.00 0.00 0.00 0.00 2.30 0.00 0.00 0.00 0.00 0.00 2.16GTCG 0.00 7.80 0.00 0.00 0.00 0.00 0.00 8.11 0.00 0.00 0.00 0.00GTGG 0.00 0.00 0.00 0.00 0.00 2.85 0.00 0.00 0.00 0.00 0.00 2.93GTTG 0.00 0.00 0.00 0.00 0.00 4.83 0.00 0.00 0.00 0.00 0.00 4.73TTAG 0.00 0.00 0.00 0.00 0.00 3.20 0.00 0.00 0.00 0.00 0.00 3.30TTCG 0.00 0.00 0.00 0.00 0.00 2.64 0.00 0.00 0.00 0.00 0.00 2.60TTGG 0.00 0.00 0.00 0.00 0.00 4.20 0.00 0.00 0.00 0.00 0.00 4.08TTTG 0.00 0.00 0.00 0.00 8.31 0.00 0.00 0.00 0.00 0.00 8.05 0.00

Mutation Cl-1 Cl-2 Cl-3 Cl-4 Cl-5 Cl-6 Cl-7 Cl-1 Cl-2 Cl-3 Cl-4 Cl-5 Cl-6 Cl-7ACAA 0.00 0.00 0.00 6.54 0.00 0.00 0.00 0.00 0.00 0.00 6.54 0.00 0.00 0.00ACCA 0.00 0.00 0.00 0.00 6.16 0.00 0.00 0.00 0.00 0.00 0.00 6.20 0.00 0.00ACGA 0.00 0.00 0.00 0.00 0.00 0.00 4.12 0.00 0.00 0.00 0.00 0.00 0.00 4.05ACTA 0.00 0.00 0.00 0.00 6.38 0.00 0.00 0.00 0.00 0.00 0.00 6.44 0.00 0.00CCAA 0.00 0.00 0.00 0.00 8.27 0.00 0.00 0.00 0.00 0.00 0.00 8.27 0.00 0.00CCCA 0.00 0.00 0.00 0.00 6.73 0.00 0.00 0.00 0.00 0.00 0.00 6.77 0.00 0.00CCGA 0.00 0.00 7.32 0.00 0.00 0.00 0.00 0.00 0.00 7.24 0.00 0.00 0.00 0.00CCTA 0.00 0.00 0.00 0.00 0.00 6.77 0.00 0.00 0.00 0.00 0.00 0.00 6.76 0.00GCAA 4.31 0.00 0.00 0.00 0.00 0.00 0.00 4.68 0.00 0.00 0.00 0.00 0.00 0.00GCCA 0.00 0.00 0.00 0.00 4.70 0.00 0.00 0.00 0.00 0.00 0.00 4.75 0.00 0.00GCGA 0.00 13.79 0.00 0.00 0.00 0.00 0.00 0.00 13.76 0.00 0.00 0.00 0.00 0.00GCTA 0.00 0.00 0.00 0.00 5.16 0.00 0.00 0.00 0.00 0.00 0.00 5.22 0.00 0.00TCAA 0.00 0.00 0.00 6.22 0.00 0.00 0.00 0.00 0.00 0.00 6.20 0.00 0.00 0.00TCCA 0.00 0.00 0.00 0.00 8.86 0.00 0.00 0.00 0.00 0.00 0.00 9.08 0.00 0.00TCGA 0.00 11.96 0.00 0.00 0.00 0.00 0.00 0.00 12.13 0.00 0.00 0.00 0.00 0.00TCTA 0.00 0.00 0.00 8.04 0.00 0.00 0.00 0.00 0.00 0.00 8.01 0.00 0.00 0.00ACAG 0.00 0.00 0.00 0.00 4.08 0.00 0.00 0.00 0.00 0.00 0.00 4.16 0.00 0.00ACCG 0.00 0.00 8.12 0.00 0.00 0.00 0.00 0.00 0.00 8.17 0.00 0.00 0.00 0.00ACGG 0.00 12.58 0.00 0.00 0.00 0.00 0.00 0.00 12.32 0.00 0.00 0.00 0.00 0.00ACTG 0.00 0.00 0.00 0.00 4.73 0.00 0.00 0.00 0.00 0.00 0.00 4.88 0.00 0.00CCAG 0.00 0.00 9.34 0.00 0.00 0.00 0.00 0.00 0.00 9.36 0.00 0.00 0.00 0.00CCCG 0.00 0.00 0.00 0.00 0.00 0.00 3.97 0.00 0.00 0.00 0.00 0.00 0.00 4.04CCGG 0.00 0.00 0.00 0.00 0.00 0.00 5.47 0.00 0.00 0.00 0.00 0.00 0.00 5.24CCTG 0.00 0.00 12.56 0.00 0.00 0.00 0.00 0.00 0.00 12.61 0.00 0.00 0.00 0.00GCAG 0.00 0.00 0.00 0.00 0.00 0.00 4.68 0.00 0.00 0.00 0.00 0.00 0.00 4.63GCCG 0.00 14.96 0.00 0.00 0.00 0.00 0.00 0.00 15.53 0.00 0.00 0.00 0.00 0.00GCGG 0.00 15.17 0.00 0.00 0.00 0.00 0.00 0.00 14.18 0.00 0.00 0.00 0.00 0.00GCTG 0.00 0.00 0.00 0.00 0.00 0.00 4.92 0.00 0.00 0.00 0.00 0.00 0.00 4.94TCAG 0.00 0.00 0.00 0.00 9.40 0.00 0.00 0.00 0.00 0.00 0.00 8.99 0.00 0.00TCCG 0.00 0.00 0.00 0.00 4.93 0.00 0.00 0.00 0.00 0.00 0.00 4.90 0.00 0.00TCGG 0.00 8.53 0.00 0.00 0.00 0.00 0.00 0.00 8.60 0.00 0.00 0.00 0.00 0.00TCTG 0.00 0.00 0.00 0.00 13.10 0.00 0.00 0.00 0.00 0.00 0.00 12.56 0.00 0.00ACAT 0.00 0.00 0.00 7.72 0.00 0.00 0.00 0.00 0.00 0.00 7.73 0.00 0.00 0.00ACCT 4.86 0.00 0.00 0.00 0.00 0.00 0.00 5.01 0.00 0.00 0.00 0.00 0.00 0.00ACGT 23.50 0.00 0.00 0.00 0.00 0.00 0.00 23.33 0.00 0.00 0.00 0.00 0.00 0.00ACTT 0.00 0.00 0.00 5.45 0.00 0.00 0.00 0.00 0.00 0.00 5.47 0.00 0.00 0.00CCAT 0.00 0.00 0.00 6.02 0.00 0.00 0.00 0.00 0.00 0.00 6.02 0.00 0.00 0.00CCCT 0.00 0.00 0.00 5.60 0.00 0.00 0.00 0.00 0.00 0.00 5.62 0.00 0.00 0.00CCGT 17.45 0.00 0.00 0.00 0.00 0.00 0.00 17.08 0.00 0.00 0.00 0.00 0.00 0.00CCTT 0.00 0.00 0.00 7.03 0.00 0.00 0.00 0.00 0.00 0.00 7.05 0.00 0.00 0.00GCAT 0.00 0.00 0.00 5.98 0.00 0.00 0.00 0.00 0.00 0.00 6.00 0.00 0.00 0.00GCCT 5.85 0.00 0.00 0.00 0.00 0.00 0.00 5.97 0.00 0.00 0.00 0.00 0.00 0.00GCGT 20.08 0.00 0.00 0.00 0.00 0.00 0.00 19.63 0.00 0.00 0.00 0.00 0.00 0.00GCTT 0.00 0.00 0.00 5.90 0.00 0.00 0.00 0.00 0.00 0.00 5.92 0.00 0.00 0.00TCAT 11.55 0.00 0.00 0.00 0.00 0.00 0.00 12.00 0.00 0.00 0.00 0.00 0.00 0.00TCCT 0.00 0.00 0.00 7.77 0.00 0.00 0.00 0.00 0.00 0.00 7.75 0.00 0.00 0.00TCGT 12.39 0.00 0.00 0.00 0.00 0.00 0.00 12.30 0.00 0.00 0.00 0.00 0.00 0.00TCTT 0.00 0.00 0.00 9.35 0.00 0.00 0.00 0.00 0.00 0.00 9.27 0.00 0.00 0.00

Mutation Cl-1 Cl-2 Cl-3 Cl-4 Cl-5 Cl-6 Cl-7 Cl-1 Cl-2 Cl-3 Cl-4 Cl-5 Cl-6 Cl-7ATAA 0.00 0.00 0.00 0.00 4.41 0.00 0.00 0.00 0.00 0.00 0.00 4.51 0.00 0.00ATCA 0.00 0.00 10.06 0.00 0.00 0.00 0.00 0.00 0.00 10.15 0.00 0.00 0.00 0.00ATGA 0.00 0.00 0.00 0.00 4.15 0.00 0.00 0.00 0.00 0.00 0.00 4.25 0.00 0.00ATTA 0.00 0.00 0.00 0.00 0.00 5.59 0.00 0.00 0.00 0.00 0.00 0.00 5.64 0.00CTAA 0.00 0.00 11.34 0.00 0.00 0.00 0.00 0.00 0.00 11.10 0.00 0.00 0.00 0.00CTCA 0.00 0.00 0.00 0.00 3.87 0.00 0.00 0.00 0.00 0.00 0.00 3.94 0.00 0.00CTGA 0.00 0.00 0.00 0.00 5.08 0.00 0.00 0.00 0.00 0.00 0.00 5.07 0.00 0.00CTTA 0.00 0.00 0.00 0.00 0.00 4.33 0.00 0.00 0.00 0.00 0.00 0.00 4.31 0.00GTAA 0.00 0.00 0.00 0.00 0.00 0.00 4.33 0.00 0.00 0.00 0.00 0.00 0.00 4.36GTCA 0.00 15.17 0.00 0.00 0.00 0.00 0.00 0.00 15.40 0.00 0.00 0.00 0.00 0.00GTGA 0.00 0.00 9.30 0.00 0.00 0.00 0.00 0.00 0.00 9.24 0.00 0.00 0.00 0.00GTTA 0.00 0.00 0.00 0.00 0.00 0.00 5.18 0.00 0.00 0.00 0.00 0.00 0.00 5.22TTAA 0.00 0.00 0.00 0.00 0.00 5.21 0.00 0.00 0.00 0.00 0.00 0.00 5.21 0.00TTCA 0.00 0.00 0.00 0.00 0.00 0.00 6.73 0.00 0.00 0.00 0.00 0.00 0.00 6.66TTGA 0.00 0.00 8.62 0.00 0.00 0.00 0.00 0.00 0.00 8.51 0.00 0.00 0.00 0.00TTTA 0.00 0.00 0.00 0.00 0.00 5.36 0.00 0.00 0.00 0.00 0.00 0.00 5.35 0.00ATAC 0.00 0.00 0.00 7.07 0.00 0.00 0.00 0.00 0.00 0.00 7.08 0.00 0.00 0.00ATCC 0.00 0.00 0.00 0.00 0.00 3.38 0.00 0.00 0.00 0.00 0.00 0.00 3.40 0.00ATGC 0.00 0.00 0.00 4.99 0.00 0.00 0.00 0.00 0.00 0.00 4.99 0.00 0.00 0.00ATTC 0.00 0.00 0.00 6.34 0.00 0.00 0.00 0.00 0.00 0.00 6.36 0.00 0.00 0.00CTAC 0.00 0.00 0.00 0.00 0.00 3.82 0.00 0.00 0.00 0.00 0.00 0.00 3.81 0.00CTCC 0.00 0.00 0.00 0.00 0.00 4.31 0.00 0.00 0.00 0.00 0.00 0.00 4.32 0.00CTGC 0.00 0.00 0.00 0.00 0.00 5.27 0.00 0.00 0.00 0.00 0.00 0.00 5.35 0.00CTTC 0.00 0.00 0.00 0.00 0.00 7.09 0.00 0.00 0.00 0.00 0.00 0.00 7.01 0.00GTAC 0.00 0.00 0.00 0.00 0.00 4.82 0.00 0.00 0.00 0.00 0.00 0.00 4.90 0.00GTCC 0.00 0.00 11.65 0.00 0.00 0.00 0.00 0.00 0.00 11.80 0.00 0.00 0.00 0.00GTGC 0.00 0.00 0.00 0.00 0.00 4.26 0.00 0.00 0.00 0.00 0.00 0.00 4.36 0.00GTTC 0.00 0.00 0.00 0.00 0.00 5.08 0.00 0.00 0.00 0.00 0.00 0.00 5.18 0.00TTAC 0.00 0.00 0.00 0.00 0.00 5.06 0.00 0.00 0.00 0.00 0.00 0.00 5.09 0.00TTCC 0.00 0.00 0.00 0.00 0.00 4.69 0.00 0.00 0.00 0.00 0.00 0.00 4.76 0.00TTGC 0.00 0.00 11.69 0.00 0.00 0.00 0.00 0.00 0.00 11.81 0.00 0.00 0.00 0.00TTTC 0.00 0.00 0.00 0.00 0.00 7.37 0.00 0.00 0.00 0.00 0.00 0.00 7.31 0.00ATAG 0.00 0.00 0.00 0.00 0.00 0.00 3.94 0.00 0.00 0.00 0.00 0.00 0.00 4.03ATCG 0.00 0.00 0.00 0.00 0.00 0.00 3.83 0.00 0.00 0.00 0.00 0.00 0.00 3.74ATGG 0.00 0.00 0.00 0.00 0.00 0.00 4.00 0.00 0.00 0.00 0.00 0.00 0.00 4.01ATTG 0.00 0.00 0.00 0.00 0.00 0.00 6.98 0.00 0.00 0.00 0.00 0.00 0.00 7.00CTAG 0.00 0.00 0.00 0.00 0.00 0.00 3.50 0.00 0.00 0.00 0.00 0.00 0.00 3.52CTCG 0.00 0.00 0.00 0.00 0.00 0.00 6.53 0.00 0.00 0.00 0.00 0.00 0.00 6.37CTGG 0.00 0.00 0.00 0.00 0.00 0.00 3.63 0.00 0.00 0.00 0.00 0.00 0.00 3.76CTTG 0.00 0.00 0.00 0.00 0.00 9.36 0.00 0.00 0.00 0.00 0.00 0.00 9.13 0.00GTAG 0.00 0.00 0.00 0.00 0.00 0.00 3.59 0.00 0.00 0.00 0.00 0.00 0.00 3.51GTCG 0.00 7.84 0.00 0.00 0.00 0.00 0.00 0.00 8.08 0.00 0.00 0.00 0.00 0.00GTGG 0.00 0.00 0.00 0.00 0.00 0.00 3.87 0.00 0.00 0.00 0.00 0.00 0.00 3.97GTTG 0.00 0.00 0.00 0.00 0.00 0.00 6.71 0.00 0.00 0.00 0.00 0.00 0.00 6.77TTAG 0.00 0.00 0.00 0.00 0.00 0.00 4.17 0.00 0.00 0.00 0.00 0.00 0.00 4.32TTCG 0.00 0.00 0.00 0.00 0.00 0.00 3.74 0.00 0.00 0.00 0.00 0.00 0.00 3.76TTGG 0.00 0.00 0.00 0.00 0.00 0.00 6.11 0.00 0.00 0.00 0.00 0.00 0.00 6.09TTTG 0.00 0.00 0.00 0.00 0.00 8.22 0.00 0.00 0.00 0.00 0.00 0.00 8.12 0.00 sA (columns 2-8), theoverall correlations Ξ s (column 11) based on the overall cross-sectional regressions,and multiple R and adjusted R of these regressions (columns 9 and 10). SeeSubsection 3.3 for details. Cancer types are labeled by X1 through X14 as in Table2. All quantities are in the units of 1% rounded to 2 digits. The values above 80%are given in bold font. Cancer Type Cl-1 Cl-2 Cl-3 Cl-4 Cl-5 Cl-6 Cl-7 r.sq adj.r.sq Overall CorX1 57.66 31.8 75.04 X2 X3 -12.6 39.19 12.59 68.65 17.06 68.74 X4 9.88 16.97 52.94 79.11 X6 X7 X8 -31.6 39.99 65.56 -46.21 -6.95 -3.36 61.8 69.52 67.12 41.88X9 -28.63 53.86 -34.26 46.93 59.88 13.59 -12.39 77.76 76.02 70.18X10

X11

X12 X13

X14 38.93 65.92 17.23 58.54 4.73 35.72 31.27 αA between the weightsfor 7 cancer signatures Sig1 through Sig7 of [Kakushadze and Yu, 2016b] and theweights (using normalized regressions with exposures based on arithmetic averages)for 7 clusters in Clustering A (see Subsection 3.3 for details). All quantities are inthe units of 1% rounded to 2 digits. The values above 80% are given in bold font.Signature Cl-1 Cl-2 Cl-3 Cl-4 Cl-5 Cl-6 Cl-7Sig1 -27.92 -3.34Sig3 -51.53 54.4 -37.16 28.19 32.98 12.37 -17.7Sig4 31.56 11.97 54.43 56.83 -1.17 . . . . . . . V a r i ab ili t y A cc r o ss C an c e r T y pe s Density

Figure 1: Horizontal axis: serial standard deviation σ (cid:48) i for N = 96 mutation cat-egories ( i = 1 , . . . , N ) of cross-sectionally demeaned log-counts X (cid:48) is across n = 14cancer types (for samples aggregated by cancer types, so s = 1 , . . . , d , d = n ).Vertical axis: density using R function density() . See Subsection 2.4.1 for details.55 M u t a t i on s Weights

ACAA ACCA ACGA ACTA CCAA CCCA CCGA CCTA GCAA GCCA GCGA GCTA TCAA TCCA TCGA TCTA ACAG ACCG ACGG ACTG CCAG CCCG CCGG CCTG GCAG GCCG GCGG GCTG TCAG TCCG TCGG TCTG ACAT ACCT ACGT ACTT CCAT CCCT CCGT CCTT GCAT GCCT GCGT GCTT TCAT TCCT TCGT TCTT ATAA ATCA ATGA ATTA CTAA CTCA CTGA CTTA GTAA GTCA GTGA GTTA TTAA TTCA TTGA TTTA ATAC ATCC ATGC ATTC CTAC CTCC CTGC CTTC GTAC GTCC GTGC GTTC TTAC TTCC TTGC TTTC ATAG ATCG ATGG ATTG CTAG CTCG CTGG CTTG GTAG GTCG GTGG GTTG TTAG TTCG TTGG TTTG

Figure 2: Cluster Cl-1 in Clustering-A with weights based on unnormalized regres-sions with arithmetic means (see Subsection 2.6). See Tables 4, 5, 6. Here and in allFigures below, for comparison and visualization convenience, we show all 96 channelson the horizontal axis even though the weights are nonzero only for the mutationcategories belonging to a given cluster. Thus, in this cluster, only 8 weights arenonzero, to wit, for GCAA, ACCT, ACGT, CCGT, GCCT, GCGT, TCAT, TCGT.56 M u t a t i on s Weights

Figure 3: Cluster Cl-1 in Clustering-A with weights based on normalized regressionswith arithmetic means (see Subsection 2.6). See Tables 4, 5, 6.57 M u t a t i on s Weights

Figure 4: Cluster Cl-2 in Clustering-A with weights based on unnormalized regres-sions with arithmetic means (see Subsection 2.6). See Tables 4, 5, 6.58 M u t a t i on s Weights

Figure 5: Cluster Cl-2 in Clustering-A with weights based on normalized regressionswith arithmetic means (see Subsection 2.6). See Tables 4, 5, 6.59 M u t a t i on s Weights

Figure 6: Cluster Cl-3 in Clustering-A with weights based on unnormalized regres-sions with arithmetic means (see Subsection 2.6). See Tables 4, 5, 6.60 M u t a t i on s Weights

Figure 7: Cluster Cl-3 in Clustering-A with weights based on normalized regressionswith arithmetic means (see Subsection 2.6). See Tables 4, 5, 6.61 M u t a t i on s Weights

Figure 8: Cluster Cl-4 in Clustering-A with weights based on unnormalized regres-sions with arithmetic means (see Subsection 2.6). See Tables 4, 5, 6.62 M u t a t i on s Weights

Figure 9: Cluster Cl-4 in Clustering-A with weights based on normalized regressionswith arithmetic means (see Subsection 2.6). See Tables 4, 5, 6.63 M u t a t i on s Weights

Figure 10: Cluster Cl-5 in Clustering-A with weights based on unnormalized regres-sions with arithmetic means (see Subsection 2.6). See Tables 4, 5, 6.64 M u t a t i on s Weights

Figure 11: Cluster Cl-5 in Clustering-A with weights based on normalized regressionswith arithmetic means (see Subsection 2.6). See Tables 4, 5, 6.65 M u t a t i on s Weights

Figure 12: Cluster Cl-6 in Clustering-A with weights based on unnormalized regres-sions with arithmetic means (see Subsection 2.6). See Tables 4, 5, 6.66 M u t a t i on s Weights

Figure 13: Cluster Cl-6 in Clustering-A with weights based on normalized regressionswith arithmetic means (see Subsection 2.6). See Tables 4, 5, 6.67 M u t a t i on s Weights

Figure 14: Cluster Cl-7 in Clustering-A with weights based on unnormalized regres-sions with arithmetic means (see Subsection 2.6). See Tables 4, 5, 6.68 M u t a t i on s Weights

Figure 15: Cluster Cl-7 in Clustering-A with weights based on normalized regressionswith arithmetic means (see Subsection 2.6). See Tables 4, 5, 6.69 M u t a t i on s Weights

Figure 16: Cluster Cl-1 in Clustering-B with weights based on unnormalized regres-sions with arithmetic means (see Subsection 2.6). See Tables 4, 7, 8.70 M u t a t i on s Weights

Figure 17: Cluster Cl-1 in Clustering-B with weights based on normalized regressionswith arithmetic means (see Subsection 2.6). See Tables 4, 7, 8.71 M u t a t i on s Weights

Figure 18: Cluster Cl-2 in Clustering-B with weights based on unnormalized regres-sions with arithmetic means (see Subsection 2.6). See Tables 4, 7, 8.72 M u t a t i on s Weights

Figure 19: Cluster Cl-2 in Clustering-B with weights based on normalized regressionswith arithmetic means (see Subsection 2.6). See Tables 4, 7, 8.73 M u t a t i on s Weights

Figure 20: Cluster Cl-3 in Clustering-B with weights based on unnormalized regres-sions with arithmetic means (see Subsection 2.6). See Tables 4, 7, 8.74 M u t a t i on s Weights

Figure 21: Cluster Cl-3 in Clustering-B with weights based on normalized regressionswith arithmetic means (see Subsection 2.6). See Tables 4, 7, 8.75 M u t a t i on s Weights

Figure 22: Cluster Cl-4 in Clustering-B with weights based on unnormalized regres-sions with arithmetic means (see Subsection 2.6). See Tables 4, 7, 8.76 M u t a t i on s Weights

Figure 23: Cluster Cl-4 in Clustering-B with weights based on normalized regressionswith arithmetic means (see Subsection 2.6). See Tables 4, 7, 8.77 M u t a t i on s Weights

Figure 24: Cluster Cl-5 in Clustering-B with weights based on unnormalized regres-sions with arithmetic means (see Subsection 2.6). See Tables 4, 7, 8.78 M u t a t i on s Weights

Figure 25: Cluster Cl-5 in Clustering-B with weights based on normalized regressionswith arithmetic means (see Subsection 2.6). See Tables 4, 7, 8.79 M u t a t i on s Weights

Figure 26: Cluster Cl-6 in Clustering-B with weights based on unnormalized regres-sions with arithmetic means (see Subsection 2.6). See Tables 4, 7, 8.80 M u t a t i on s Weights

Figure 27: Cluster Cl-6 in Clustering-B with weights based on normalized regressionswith arithmetic means (see Subsection 2.6). See Tables 4, 7, 8.81 M u t a t i on s Weights

Figure 28: Cluster Cl-7 in Clustering-B with weights based on unnormalized regres-sions with arithmetic means (see Subsection 2.6). See Tables 4, 7, 8.82 M u t a t i on s Weights

Figure 29: Cluster Cl-7 in Clustering-B with weights based on normalized regressionswith arithmetic means (see Subsection 2.6). See Tables 4, 7, 8.83 M u t a t i on s Weights

Figure 30: Cluster Cl-1 in Clustering-C with weights based on unnormalized regres-sions with arithmetic means (see Subsection 2.6). See Tables 4, 9, 10.84 M u t a t i on s Weights

Figure 31: Cluster Cl-1 in Clustering-C with weights based on normalized regressionswith arithmetic means (see Subsection 2.6). See Tables 4, 9, 10.85 M u t a t i on s Weights

Figure 32: Cluster Cl-2 in Clustering-C with weights based on unnormalized regres-sions with arithmetic means (see Subsection 2.6). See Tables 4, 9, 10.86 M u t a t i on s Weights

Figure 33: Cluster Cl-2 in Clustering-C with weights based on normalized regressionswith arithmetic means (see Subsection 2.6). See Tables 4, 9, 10.87 M u t a t i on s Weights

Figure 34: Cluster Cl-3 in Clustering-C with weights based on unnormalized regres-sions with arithmetic means (see Subsection 2.6). See Tables 4, 9, 10.88 M u t a t i on s Weights

Figure 35: Cluster Cl-3 in Clustering-C with weights based on normalized regressionswith arithmetic means (see Subsection 2.6). See Tables 4, 9, 10.89 M u t a t i on s Weights

Figure 36: Cluster Cl-4 in Clustering-C with weights based on unnormalized regres-sions with arithmetic means (see Subsection 2.6). See Tables 4, 9, 10.90 M u t a t i on s Weights

Figure 37: Cluster Cl-4 in Clustering-C with weights based on normalized regressionswith arithmetic means (see Subsection 2.6). See Tables 4, 9, 10.91 M u t a t i on s Weights

Figure 38: Cluster Cl-5 in Clustering-C with weights based on unnormalized regres-sions with arithmetic means (see Subsection 2.6). See Tables 4, 9, 10.92 M u t a t i on s Weights

Figure 39: Cluster Cl-5 in Clustering-C with weights based on normalized regressionswith arithmetic means (see Subsection 2.6). See Tables 4, 9, 10.93 M u t a t i on s Weights

Figure 40: Cluster Cl-6 in Clustering-C with weights based on unnormalized regres-sions with arithmetic means (see Subsection 2.6). See Tables 4, 9, 10.94 M u t a t i on s Weights

Figure 41: Cluster Cl-6 in Clustering-C with weights based on normalized regressionswith arithmetic means (see Subsection 2.6). See Tables 4, 9, 10.95 M u t a t i on s Weights

Figure 42: Cluster Cl-7 in Clustering-C with weights based on unnormalized regres-sions with arithmetic means (see Subsection 2.6). See Tables 4, 9, 10.96 M u t a t i on s Weights

Figure 43: Cluster Cl-7 in Clustering-C with weights based on normalized regressionswith arithmetic means (see Subsection 2.6). See Tables 4, 9, 10.97 M u t a t i on s Weights

Figure 44: Cluster Cl-1 in Clustering-D with weights based on unnormalized regres-sions with arithmetic means (see Subsection 2.6). See Tables 4, 11, 12.98 M u t a t i on s Weights

Figure 45: Cluster Cl-1 in Clustering-D with weights based on normalized regressionswith arithmetic means (see Subsection 2.6). See Tables 4, 11, 12.99 M u t a t i on s Weights

Figure 46: Cluster Cl-2 in Clustering-D with weights based on unnormalized regres-sions with arithmetic means (see Subsection 2.6). See Tables 4, 11, 12.100 M u t a t i on s Weights

Figure 47: Cluster Cl-2 in Clustering-D with weights based on normalized regressionswith arithmetic means (see Subsection 2.6). See Tables 4, 11, 12.101 M u t a t i on s Weights

Figure 48: Cluster Cl-3 in Clustering-D with weights based on unnormalized regres-sions with arithmetic means (see Subsection 2.6). See Tables 4, 11, 12.102 M u t a t i on s Weights

Figure 49: Cluster Cl-3 in Clustering-D with weights based on normalized regressionswith arithmetic means (see Subsection 2.6). See Tables 4, 11, 12.103 M u t a t i on s Weights

Figure 50: Cluster Cl-4 in Clustering-D with weights based on unnormalized regres-sions with arithmetic means (see Subsection 2.6). See Tables 4, 11, 12.104 M u t a t i on s Weights

Figure 51: Cluster Cl-4 in Clustering-D with weights based on normalized regressionswith arithmetic means (see Subsection 2.6). See Tables 4, 11, 12.105 M u t a t i on s Weights

Figure 52: Cluster Cl-5 in Clustering-D with weights based on unnormalized regres-sions with arithmetic means (see Subsection 2.6). See Tables 4, 11, 12.106 M u t a t i on s Weights

Figure 53: Cluster Cl-5 in Clustering-D with weights based on normalized regressionswith arithmetic means (see Subsection 2.6). See Tables 4, 11, 12.107 M u t a t i on s Weights

Figure 54: Cluster Cl-6 in Clustering-D with weights based on unnormalized regres-sions with arithmetic means (see Subsection 2.6). See Tables 4, 11, 12.108 M u t a t i on s Weights

Figure 55: Cluster Cl-6 in Clustering-D with weights based on normalized regressionswith arithmetic means (see Subsection 2.6). See Tables 4, 11, 12.109 M u t a t i on s Weights

Figure 56: Cluster Cl-1 in Clustering-A with weights based on unnormalized regres-sions with geometric means (see Subsection 2.6). See Tables 4, 13, 14.110 M u t a t i on s Weights

Figure 57: Cluster Cl-1 in Clustering-A with weights based on normalized regressionswith geometric means (see Subsection 2.6). See Tables 4, 13, 14.111 M u t a t i on s Weights

Figure 58: Cluster Cl-2 in Clustering-A with weights based on unnormalized regres-sions with geometric means (see Subsection 2.6). See Tables 4, 13, 14.112 M u t a t i on s Weights

Figure 59: Cluster Cl-2 in Clustering-A with weights based on normalized regressionswith geometric means (see Subsection 2.6). See Tables 4, 13, 14.113 M u t a t i on s Weights

Figure 60: Cluster Cl-3 in Clustering-A with weights based on unnormalized regres-sions with geometric means (see Subsection 2.6). See Tables 4, 13, 14.114 M u t a t i on s Weights

Figure 61: Cluster Cl-3 in Clustering-A with weights based on normalized regressionswith geometric means (see Subsection 2.6). See Tables 4, 13, 14.115 M u t a t i on s Weights

Figure 62: Cluster Cl-4 in Clustering-A with weights based on unnormalized regres-sions with geometric means (see Subsection 2.6). See Tables 4, 13, 14.116 M u t a t i on s Weights

Figure 63: Cluster Cl-4 in Clustering-A with weights based on normalized regressionswith geometric means (see Subsection 2.6). See Tables 4, 13, 14.117 M u t a t i on s Weights

Figure 64: Cluster Cl-5 in Clustering-A with weights based on unnormalized regres-sions with geometric means (see Subsection 2.6). See Tables 4, 13, 14.118 M u t a t i on s Weights

Figure 65: Cluster Cl-5 in Clustering-A with weights based on normalized regressionswith geometric means (see Subsection 2.6). See Tables 4, 13, 14.119 M u t a t i on s Weights

Figure 66: Cluster Cl-6 in Clustering-A with weights based on unnormalized regres-sions with geometric means (see Subsection 2.6). See Tables 4, 13, 14.120 M u t a t i on s Weights

Figure 67: Cluster Cl-6 in Clustering-A with weights based on normalized regressionswith geometric means (see Subsection 2.6). See Tables 4, 13, 14.121 M u t a t i on s Weights

Figure 68: Cluster Cl-7 in Clustering-A with weights based on unnormalized regres-sions with geometric means (see Subsection 2.6). See Tables 4, 13, 14.122 M u t a t i on s Weights