[PDF] Comparing copy-number profiles under multi-copy amplifications and deletions

Abstract

During cancer progression, malignant cells accumulate somatic mutations that can lead to genetic aberrations. In particular, evolutionary events akin to segmental duplications or deletions can alter the copy-number profile (CNP) of a set of genes in a genome. Our aim is to compute the evolutionary distance between two cells for which only CNPs are known. This asks for the minimum number of segmental amplifications and deletions to turn one CNP into another. This was recently formalized into a model where each event is assumed to alter a copy-number by 1 or −1 , even though these events can affect large portions of a chromosome. We propose a general cost framework where an event can modify the copy-number of a gene by larger amounts. We show that any cost scheme that allows segmental deletions of arbitrary length makes computing the distance strongly NP-hard. We then devise a factor 2 approximation algorithm for the problem when copy-numbers are non-zero and provide an implementation called \textsf{cnp2cnp}. We evaluate our approach experimentally by reconstructing simulated cancer phylogenies from the pairwise distances inferred by \textsf{cnp2cnp} and compare it against two other alternatives, namely the \textsf{MEDICC} distance and the Euclidean distance. The experimental results show that our distance yields more accurate phylogenies on average than these alternatives if the given CNPs are error-free, but that the \textsf{MEDICC} distance is slightly more robust against error in the data. In all cases, our experiments show that either our approach or the \textsf{MEDICC} approach should preferred over the Euclidean distance.

Full PDF

CCordonnier and Lafond

RESEARCH

Keywords: copy-number evolution; algorithms; cancer phylogenies; NP-hardness

Comparing copy-number proﬁles undermulti-copy ampliﬁcations and deletions

Garance Cordonnier and Manuel Lafond * Correspondence:[email protected] Department of ComputerScience, Universit´e de Sherbrooke,Sherbrooke, CanadaFull list of author information isavailable at the end of the article

AbstractBackground: during cancer progression, malignant cells accumulate somaticmutations that can lead to genetic aberrations. In particular, evolutionary eventsakin to segmental duplications or deletions can alter the copy-number proﬁle(CNP) of a set of genes in a genome. Our aim is to compute the evolutionarydistance between two cells for which only CNPs are known. This asks for theminimum number of segmental ampliﬁcations and deletions to turn one CNP intoanother. This was recently formalized into a model where each event is assumedto alter a copy-number by or − , even though these events can aﬀect largeportions of a chromosome. Results: we propose a general cost framework where an event can modify thecopy-number of a gene by larger amounts. We show that any cost scheme thatallows segmental deletions of arbitrary length makes computing the distancestrongly NP-hard. We then devise a factor approximation algorithm for theproblem when copy-numbers are non-zero and provide an implementation calledcnp2cnp. We evaluate our approach experimentally by reconstructing simulatedcancer phylogenies from the pairwise distances inferred by cnp2cnp and compareit against two other alternatives, namely the MEDICC distance and the Euclideandistance. Conclusions: the experimental results show that our distance yields moreaccurate phylogenies on average than these alternatives if the given CNPs areerror-free, but that the MEDICC distance is slightly more robust against error inthe data. In all cases, our experiments show that either our approach or theMEDICC approach should preferred over the Euclidean distance.

Background

Cancer is widely recognized as an evolutionary process during which cells withina population accumulate aberrant somatic mutations and replicate indeﬁnitely [1].These cells are divided in subpopulations, called clones , that share common mu-tation traits and form tumors. A natural problem that arises is to reconstructthe evolution of a set of clones within a tumor. This question has recently ledto the development of several phylogenetic algorithms tailored for cancer evolu-tion. Most of them use either information of single nucleotide variants obtainedfrom bulk [2, 3, 4, 5] or single-cell [6, 7, 8] sequencing data, or copy-number alter- a r X i v : . [ q - b i o . GN ] F e b ordonnier and Lafond Page 2 of 43 ations [9, 10, 11, 12, 13] (usually in the context of single-cell data). We refer thereader to [14] for a survey of these methods.In this work, we are interested in the problem of inferring the minimum number ofcopy-number alteration events that explain how a cell evolved into another. In tu-mors, several events can make the copy-number of a gene diﬀerent from the normaldiploid two-copy state, thereby creating copy-number aberrations . As an example,the breakage–fusion–bridge (BFB) phenomenon [15] occurs when a region includ-ing a telomere breaks oﬀ a chromosome. During replication, two sister chromatidshave unterminated ends and they fuse, leading to what is essentially a chromosomeportion concatenated with a reversed copy of itself (see [15, 16] for a more thor-ough explanation). Afterwards, the centromeres of the fused chromatids get pulledin opposite directions, leading to another breakage. This BFB cycle repeats untilthe chromatids receive a telomere (usually after translocation). Each BFB eventpotentially doubles the copy-number of a gene, and since these events are known tooccur in cycles, a gene copy-number may become signiﬁcantly higher than normal(i.e. more than double) in a short evolutionary time span. Other examples of eventsinclude focal deletions [17, 18] or missegregation of chromosomes [19].Desper et al. [20] were among the ﬁrst to consider copy-number aberrations for phy-logenetic reconstructions, using comparative genomic hybridization (CGH) data toreconstruct a mutation hierarchy. In [21], Liu et al. propose a distance-based ap-proach based on CGH data to infer multi-cancer phylogenies. Single-cell phyloge-netics then gained widespread attention in an inﬂuential paper of Navin et al. [9].The authors applied single-nucleus sequencing on a breast cancer tumor, obtainedthe copy-number proﬁle (CNP) of several cells, each represented as a vector of in-tegers, and used the Euclidean distance to compare two CNPs. Later, Schwarz etal. [22] pointed out that a single event can amplify or delete large portions of achromosome, thereby altering the copy-number of several genes and making theEuclidean distance overestimate the true number of events.The authors proposed the following methodology to compare two CNPs. First, as-suming diploid genomes, the copy-number for the two alleles of each gene (whichcan diﬀer) is inferred from sequencing data. The correspondence between the copy-numbers and the alleles is unknown, so a phasing step must be applied. This con-sists of assigning each copy-number to one of the two alleles (this is done under aminimum-evolution principle, see [22] for details). After this step, each chromosomecan be represented as a pair of CNPs, and chromosomes from two cells can be com-pared by computing the distances between the corresponding alleles. The distanceproposed is the minimum number of segmental ampliﬁcation and deletion eventsrequired to transform a given CNP into another.In this work, we focus on the latter step. We assume that the CNP inference and thephasing steps have been performed, and must ﬁnd a most parsimonious sequenceof events explaining two given CNPs. This is analogous to classical rearrangementsproblems [23], but the main novelty (and diﬃculty) of CNP comparison is that onlycopy-numbers are known, not the ordering of genes. In [22], Schwarz et al. intro-duced the MEDICC model, which approximates segmental events on a chromosomeby events that alter a subinterval of a CNP by +1 or -1. Figure 1 shows an example ordonnier and Lafond Page 3 of 43 (1 , , , , , , , , , , , , , , , , , , , , − − e = (2 , , − e = (1 , , − e = (3 , , u = v = − − − − − − u − v : u = u = − − − u − v : u − v : Figure 1: Left: two CNPs u and v , represented as integer vectors. The CNP u canbe turned into v with three events: two deletions and one ampliﬁcation. Right: avisual representation of the diﬀerence vectors obtained at each step. Note that a 0remains a 0 even after ampliﬁcation.turning a CNP u into another v (under our model where any amount of changeis allowed). The problem of computing the minimum number of subinterval alter-ations to transform one CNP into another was solved in exponential-time in [22] bymodeling CNP events with a ﬁnite-state transducer. Zeira et al. [24] gave a lineartime algorithm, using a clever trick for computing each row of a quadratic-size dy-namic programming table in constant time (similar to the techniques used in [25]).In [11], the large phylogeny problem under this model is shown NP-hard, thoughsolvable with an ILP. They also present the copy-number triplet problem, whichwhen given two CNPs u and v asks for a CNP whose sum of distances to u and v is minimized. The problem can be solved in pseudo-polynomial time O ( n N ),where n is the CNP size and N the maximum copy number. Other distances andphylogenetic approaches are discussed in [12, 10, 13, 11, 26, 27] Our results.

The above CNP comparison frameworks limit events to alter copy-numbers by 1 or −

1. As we exempliﬁed with BFB, several copies of a gene canbe aﬀected by a single event. Moreover, the MEDICC software has a copy-numberlimit of 4, making it inappropriate for genes attaining copy-numbers in the tens,twenties or even more, as has been reported for e.g. the MYC or EGFR genes [28,29, 30]. In this work, we address these limitations by generalizing the Copy-NumberTransformation problem deﬁned in [22, 24]. We deﬁne a distance d f ( u , v ) betweentwo CNPs u and v which assigns a weight of f ( c, δ ) to an event that alters a copy-number of c by an amount of δ . We show that computing d f ( u , v ) becomes stronglyNP-hard whenever we allow deletions of any amount at unit cost. In the context ofour problem, “strongly” means that our hardness holds even if N , the maximumvalue in u and v , is polynomial in n , the number of elements in our CNPs. Thisis especially relevant, given that the MEDICC model was initially solved in time O ( nN ) and that the copy-number triplet problem can be solved in time O ( n N ).Our result implies that such pseudo-polynomial time algorithm are impossible in ourcase unless P = NP. We then show that if any amount of change is permitted acrossan interval at unit cost, then a simple linear-time factor 2 approximation algorithmcan be devised. We validate our approach by reconstructing simulated phylogeniesusing neighbor-joining (NJ), and compare them with the MEDICC distance andEuclidean distance. We perform our experiments on error-free data and noisy data(where the true copy-numbers are altered by a random amount). Using a variety of ordonnier and Lafond Page 4 of 43 simulation papameters, we show that both our distance and the

MEDICC distanceachieve signiﬁcantly better accuracy than the Euclidean distance. Our distance isslightly more accurate on error-free data, and the

MEDICC distance is slightly moretolerant to error.

Results

We ﬁrst provide the required preliminary notions required to state our theoreti-cal results. We then show that computing our copy-number distance is stronglyNP-hard, and present our approximation algorithm. Finally, we present our exper-imental results on reconstructing simulated phylogenies.

Preliminary notions

Throughout the paper, we use the interval notations [ n ] = { , , . . . , n } and [ s, t ] = { s, s + 1 , . . . , t } . Given a vector u = ( u , . . . , u n ) of n integers and i ∈ [ n ], wewill always write u i for the value at the i -th position of u . If u i = 0, then i iscalled a null position. We will assume that every vector u of dimension n hasspecial values u = u n +1 = 0. We denote by u −{ i } the vector obtained by removingposition i ∈ [ n ], i.e. u −{ i } = ( u , . . . , u i − , u i +1 , . . . , u n ). If v is a vector of the samedimension, then u − v = ( u − v , . . . , u n − v n ).We assume that a reference chromosome is partitioned into contiguous subse-quences, called positions , each numbered from 1 to n . A copy-number proﬁle (CNP)is a vector u = ( u , . . . , u n ) of non-negative integers representing the copy-numberof each position in a clone. We consider ampliﬁcation and deletion events, whichrespectively have the eﬀect of increasing and decreasing the number of copies in achromosome. As in [22, 24], we assume that events aﬀect a set of positions that arecontiguous in the reference chromosome.An event is a triple e = ( s, t, δ ) where 1 ≤ s ≤ t ≤ n and δ ∈ Z \ { } . Here the [ s, t ]interval depicts the set of aﬀected positions, and δ is the amount of change. Theevent e is an ampliﬁcation when δ > δ <

0. A copy-numbercannot drop below 0 and cannot increase from a 0 to another value (e.g. new genescannot be created once completely lost). Applying event e = ( s, t, δ ) on a CNP u yields another CNP u (cid:48) = ( u (cid:48) , . . . , u (cid:48) n ) with, for i ∈ [ n ], u (cid:48) i =  max( u i + δ,

0) if i ∈ [ s, t ] and u i > u i otherwiseWe denote by u (cid:104) e (cid:105) the CNP obtained by applying event e on a CNP u . Moregenerally, if E = ( e , . . . , e k ) is an ordered sequence of events, we write u (cid:104) E (cid:105) = u (cid:104) e (cid:105)(cid:104) e (cid:105) . . . (cid:104) e k (cid:105) to denote the CNP obtained by applying each event of E in order.We may also write u (cid:104) e . . . e k (cid:105) instead of u (cid:104) ( e , . . . , e k ) (cid:105) . Given two CNPs u and v of dimension n , we say that E transforms u into v if u (cid:104) E (cid:105) = v .We will often use the diﬀerence vector of u and v , and usually denote w := u − v .The representation of w as in Figure 1 on the right provides the following intuition:if u (cid:104) E (cid:105) = v , then the events of E need to “squish” that values of w to 0 to make u equal to v (ensuring that no value u i of u drops to 0 in the process unless v i = 0). ordonnier and Lafond Page 5 of 43 Minimum cost transformations

Given two CNPs u and v , our goal is to ﬁnd a minimum-cost sequence E thattransforms u into v . In [22, 24], the cost of an event ( s, t, δ ) is | δ | . Here, we proposea generalization by deﬁning a cost function f : N × Z → N > that assigns a positivecost to altering a copy-number c by an amount of δ . That is, if we apply ( s, t, δ ) on u , each position i ∈ [ s, t ] has its own corresponding cost f ( u i , δ ), which could beinterpreted as the plausibility of going from copy-number u i to max ( u i + δ, cost f ( u , e ) with respect to f of applying e = ( s, t, δ ) on u asthe maximum cost within [ s, t ], i.e. cost f ( u , e ) = max i ∈ [ s,t ] f ( u i , δ )The events proposed in the MEDICC algorithm of Schwarz et al. can be decom-posed into δ events of unit cost. This can be modeled under our framework with afunction mdc deﬁned as mdc ( u i , δ ) = 1 if δ ∈ {− , } and mdc ( u i , δ ) = ∞ other-wise. Alternatively, one could state that a position with copy-number u i can hardlymore than double in a single event (assuming that ampliﬁcations are duplications),but that deletions can suppress any number of copies. We call this the doublingfunction dbl , deﬁned as dbl ( u i , δ ) = 1 if u i + δ ≤ u i , and dbl ( u i , δ ) = ∞ otherwise.Finally, the most permissive cost function any allows any movement without con-straint: simply deﬁne cost ( u i , δ ) = 1 for any δ ∈ Z . This can, for instance, be used tomodel succession of events that can potentially amplify copy-numbers above theirdouble in a short time span — an example of this being BFB cycles.In this paper, we mostly analyze the any function for its simplicity, but will some-times use the dbl function for its relevance. Given two CNPs u and v and a costfunction f , the cost of a sequence of events E = ( e , . . . , e k ) satisfying u (cid:104) E (cid:105) = v isequal to the sum of the cost of applying successive events of E on u , i.e. cost f ( u , E ) = cost f ( u , e ) + cost f ( u (cid:104) e (cid:105) , e ) + . . . + cost f ( u (cid:104) e , . . . , e k − (cid:105) , e k )If cost f ( u , E ) ≤ cost f ( u , E (cid:48) ) for any other sequence E (cid:48) satisfying u (cid:104) E (cid:48) (cid:105) = v ,then E is called optimal . The f -distance between u and v , denoted d f ( u , v ), isthe cost of an optimal sequence of events transforming u into v . Observe thatthis “distance” is not symmetric (hence the use of double-quotes). For instance,if u = (1 ,

1) and v = (0 , d mdc ( u , v ) = 1 but d mdc ( v , u ) is undeﬁnedsince v cannot be transformed into u . We will therefore usually assume that u does not have any null position. We note here that the median distance, deﬁnedas min w ∈ V ( d f ( w , u ) + d f ( w , v )) (where V ranges over Z n ), is symmetric for allthe functions mentioned above. However, no eﬃcient algorithm is known for anymedian distance. Our problem is the following.The CNP-transformation problem:

Given: a source CNP u , a target CNP v , a cost function f and an integer k ; Question: is dist f ( u , v ) ≤ k ?We say that f is a unit-cost function if f ( c, δ ) ∈ { , ∞} for any c and δ (e.g.the functions mdc, dbl and any ). A cost function f is called deletion-permissive if ordonnier and Lafond Page 6 of 43 cost f ( u i , δ ) = 1 for any u i and any δ <

0, i.e. there is no particular constraint ondeletions. We will mainly focus deletion-permissive functions, the rationale beingthat unlike duplications, deletions could suppress an arbitrary number of copies.

General properties

Before proceeding with our results on computing f -distances, we present some re-sults of general interest that will be useful later on. Proposition 1.

For any two CNPs u and v of the same dimension, any position i ∈ [ n ] and any cost function f , d f ( u , v ) ≥ d f ( u −{ i } , v −{ i } )We omit the proof details. The idea is that given a sequence of events E transforming u into v , we can apply E on u −{ i } by ignoring position i when it is aﬀected.A sequence of events E is called amp-ﬁrst if all ampliﬁcations appear before alldeletions. An amp-ﬁrst reordering of a sequence E is an amp-ﬁrst sequence E (cid:48) thatcontains the same events as E . Notice that if E has a ampliﬁcations and d deletions,then there are a ! d ! amp-ﬁrst reorderings of E . Proposition 2.

Let u and v be two CNPs with no null positions. If a sequence E satisﬁes u (cid:104) E (cid:105) = v , then any amp-ﬁrst reordering E (cid:48) of E satisﬁes u (cid:104) E (cid:48) (cid:105) = v . Proof

Denote E = (( s , t , δ ) , . . . , ( s k , t k , δ k )). For any position i , the sum (cid:80) kj =1 δ k does not change even if we reorder the events in E , so u i should still become v i afterreordering the event and applying them on u . The only danger is that a positiondrops to 0 since v has no null position, but this cannot happen if all ampliﬁcationsare moved in front of E .Given a CNP w of length n , an interval [ a, b ] is a staircase of w if 0 < w a

Let u , v be two CNPs with no null positions. If u − v contains a staircase[ a, b ] of length k , then d f ( u , v ) ≥ k for any unit-cost function f . Strong NP-hardness

We show that the

CNP-transformation problem is strongly

NP-hard. This result holdsfor any deletion-permissive unit-cost function f , and even if u and v contain no null +5 +6 +8 { (cid:8) (cid:110) +3 { +5 +6 +8 { (cid:8) (cid:110) +6 +8 (cid:8) (cid:110) +8 (cid:110) Figure 2: A visual representation of the diﬀerence vector w = u − v leading to astaircase of length 4 in interval [1 , v = (1 , , , ,

1) and u = (4 , , , ,

14) would lead to the situation shown above. A smooth deletionsequence turning u into v is shown (last deletion omitted). ordonnier and Lafond Page 7 of 43 position (we note that in [24], null positions make the problem more complex, butnot here). In particular, the hardness also holds if only deletions are allowed. Weassume that we are given two CNPs u and v and we put w := u − v .Suppose that w contains a staircase in interval [1 , k ] for some k , and that d f ( u , v ) = k . A sequence E = ( e , . . . , e k ) such that u (cid:104) E (cid:105) = v is called smooth if, for every i ∈ [ k ], e i = ( i, b i , w i − − w i ) for some b i ≥ k . Intuitively, E removes the ﬁrststep, then the second, and so on, see Figure 2. Observe that in a smooth deletionsequence, the positions to the right of k may or may not be aﬀected by deletions. Lemma 2.

Let u and v be two CNPs with no null positions and let f be anyunit-cost function. If u − v contains a staircase in interval [1 , k ] and d f ( u , v ) = k ,then there exists a smooth sequence transforming u into v .Lemma 2 requires the most technical proof of the paper (by far), and we deferit to the Supplementary material. The reduction becomes relatively simple whengiven this lemma. Our reduction is from the problem. In this problem,we are given a multi-set S = { s , . . . , s n } of n = 3 m positive integers. Deﬁning t := m (cid:80) i ∈ [ n ] s i , we are asked whether S can be partitioned into m subsets S , . . . , S m ,each of size 3, such that (cid:80) s ∈ S i s = t for all i ∈ [ m ]. This problem is known to bestrongly NP-hard [31] (i.e. it is hard even if the values of S are O ( n k ) for someconstant k ). The proof can be found in the Supplementary material. Theorem 1.

The

CNP-transformation problem is strongly NP-hard for anydeletion-permissive unit-cost function, even if the CNPs have no null positions.

Approximation algorithm

In this section, we show that if v does not contain any null position, then d any ( u , v )can be approximated within a factor of 2 in linear time. We discuss practical waysof handling null positions at the end of the section. We now assume that f = any and will write d ( u , v ) instead of d any ( u , v ).As usual, u and v are the source and target CNPs, respectively, and w := u − v .The idea of the approximation is that if two consecutive positions i and i + 1 havethe same diﬀerence between u and v , i.e. w i = w i +1 , then their value needs tochange by the same amount. It might then be a good idea to treat these positionsas one and always aﬀect both with the same events. In fact, a whole interval ofequal w values can be treated as a single position. We show that the number ofdistinct equal intervals gives a good bound on d ( u , v ). Approximation by ﬂat intervals

Recall that if w is a vector of n integers, it has implicit values w = w n +1 = 0.We say that [ a, b ], with 0 ≤ a ≤ b < n + 1, is a ﬂat interval if w i = w j for every a ≤ i, j ≤ b . If no interval properly containing [ a, b ] is ﬂat, then [ a, b ] is a maximal ﬂat interval. In fact, in the remainder, we will omit the term “maximal” and alwaysassume that discussed ﬂat intervals are maximal. We write F w for the set of ﬂatintervals of w . Note that this set is well-deﬁned and that it partitions [0 , n + 1], bythe maximality property. The intervals that contain 0 and n + 1 in F w are called extreme ﬂat intervals , and always have a value of 0 (also, these intervals are possibly ordonnier and Lafond Page 8 of 43 [0 ,

0] and/or [ n + 1 , n + 1], but not necessarily). The key lemma says that d f ( u , v )is at least about half the number of ﬂat intervals (see Supplementary material). Lemma 3.

Let u , v be two distinct CNPs with no null positions, and let w := u − v .Then for any unit-cost function f , d f ( u , v ) ≥ (cid:100) ( | F w | − / (cid:101) .Lemma 3 yields a very simple factor 2 approximation algorithm: compute F w , andreturn | F w | −

2. This corresponds to a solution in which we treat each ﬂat intervalseparately (ignoring the two extremities) and is guaranteed to be at most twice theoptimal number of events. Computing F w can be done in a single pass through w by increasing a counter whenever we encounter a position i with w i (cid:54) = w i − . Theorem 2.

The

CNP-transformation problem can be approximated within factor2 in linear time for cost function f = any when the CNPs contain no null position.It is open whether this could be adapted to other functions, e.g. the dbl function. Improvements to the approximation algorithm

We ﬁrst observe that the bound in Lemma 3 is essentially tight. This can be seenwith any u , v such that u − v = (1 , , , . . . , k − , k, k − , . . . , , ,

1) for some k . Indeed, one can decrease | F w | by two at each round. On the other hand, ournaive 2-approximation is twice as bad as optimal. We show how to improve thisin a heuristic fashion by devising an algorithm that can only perform better thanthe naive one. We leave it as an open problem to determine the approximationguarantees of this algorithm.Our goal is to apply events that reduce | F w | by two as many times as possible. In agreedy fashion, we apply the following strategy for our improved 2-approximation:as long as u (cid:54) = v , ﬁnd an event e that reduces | F w | by 2, if one exists, and apply itto u . If no such event exists, take the leftmost non-extreme ﬂat interval [ a, b ] of w and apply the event ( a, b, − w a ). Repeat until u = v .An event ( i, j, δ ) reduces | F w | by 2 precisely when w i − − w i = w j +1 − w j = δ ( i = j is possible). This way we can merge the two ﬂat intervals at the ends of [ i, j ]. Onecan ﬁnd a good interval by checking all the O ( n ) subintervals [ i, j ] and then, foreach of them, checking whether w i − − w i = w j +1 − w j . Moreover, we must checkwhether applying the event ( i, j, δ ) would make a value of u go below 0. Verifyingevery possible event can be done in time O ( n ) and as there are O ( n ) ﬂat intervals,the algorithm takes time O ( n ).This can be improved to O ( n log n ) by ﬁnding good events in time O ( n log n ). Dueto space constraints, we relegate the detailed analysis of the improved heuristic tothe Supplementary material. The idea is to scan w from left to right and store ina treap data structure (see [32]) the set of ﬂat intervals encountered so far, whichallows to detect quickly whether the current ﬂat interval could be matched withanother one. Handling null positions

Our approximation ratio is not guaranteed to hold when there are many null po-sitions. However, we show that in many practical cases, we can simply ignore null ordonnier and Lafond Page 9 of 43 positions and remove them. In particular, we may assume that v has no two con-secutive null positions (Lemma 4) and that for any null position i in v , we have w i − < w i and w i +1 < w i (Lemma 5). Thus instances with null positions can bereduced to ones where the only null positions remaining are “sandwiched” betweennon-null positions with a smaller value in w .Note that our approximation can still perform badly with these two conditions. Forinstance, suppose that u = (15 , , , , . . . , ,

2) and v = (14 , , , , . . . , , n/ , n, − , (1 , n, u into v . Designing a better approximation for these cases is an open problem. Lemma 4.

Suppose that v i = v i +1 = 0 for some position i . Then removing position i or i + 1, whichever is smaller in u , from u and v preserves the distance between u and v . Formally, for any unit-cost function f , if u i ≥ u i +1 , then d f ( u , v ) = d f ( u −{ i +1 } , v −{ i +1 } ). Similarly if u i +1 ≥ u i , then d f ( u , v ) = d f ( u −{ i } , v −{ i } ). Lemma 5.

Suppose v i = 0 for some position i and that w i − ≥ w i or w i +1 ≥ w i .Then d f ( u , v ) = d f ( u −{ i } , v −{ i } ) for any unit-cost function f . Experiments

We tested our ﬂattening approximation algorithm and its improved version on simu-lated chromosomes that evolve along a tree through segmental tandem duplicationsand losses. Chromosomes were represented as strings of genes. Note that we didnot simulate CNP evolution under the assumptions of our model. We evolved ac-tual sequences as opposed to integer vectors, and the initial ordering of genes couldbe broken after several events. Our goal was to reconstruct phylogenies from thedistances between the CNPs of the chromosomes at the leaves of the tree. We usedthe NJ implementation in Phylip [33, 34] and compared four distances: (1) ourimproved approximation; (2) our ﬂat interval count; (3) the mdc cost, as in the

MEDICC model; and (4) the Euclidean distance. To compute d mdc , we implementedthe dynamic programming algorithm of Zeira, Zehavi and Shamir [24], hereaftercalled the ZZS algorithm (we could not use the MEDICC software as it only han-dles copy-numbers up to 4). The Euclidean distance is deﬁned as (cid:112)(cid:80) ni =1 ( u i − v i ) ,as used in [9]. For the ﬁrst three distances, we took the minimum of d ( u , v ) or d ( v , u ) to get a symmetric distance, removing null positions of u and ﬁltering nullpositions of v as in Lemmas 4 and 5. Simulated tree generation

We now describe how the trees were generated. First, we select a rooted binarytree T on l leaves labeled { , . . . , l } uniformly at random. This is achieved by usingthe recursive splitting process described by Aldous in [35], which starts with acompletely unresolved tree, splits the root in two subtrees chosen uniformly atrandom, and repeats on these subtrees. We then assign to the root r of T an exemplar chromosome , i.e. any string in which each gene occurs exactly once (notethat the initial ordering of genes does not matter for our purposes).Then for each branch uv of T from top to bottom, we select a random number ofevents k chosen uniformly at random in the interval [ e min , e max ], where e min , e max ordonnier and Lafond Page 10 of 43 are simulation parameters. To introduce some rate heterogeneity among branches,we then multiplied k by a random number chosen from a uniform distribution withmean and standard deviation 1. The chromosome string at node v is obtained byapplying k random events on the chromosome string associated with its parent u . Each event is either a tandem duplication with probability ∆ or a deletion withprobability 1 − ∆. The starting position of each event is chosen uniformly at randomon the chromosome string and, to ﬁnd the length t of the substring aﬀected, we applythe following process. Start with t = 1, then apply the following: as long as a randomnumber between 0 and 1 is above a given parameter r , increment t by 1 and repeat.We stop at the ﬁrst random number below r . We chose to consider only values r ≤ . r between 0 .

01 and 0 . , t of an event: if incrementing t impliesdeleting the last occurrence of a gene, we continue the procedure with probability q and stop with probability 1 − q (where q is another simulation parameter). Thiscan be seen as modeling the idea that there may be resistance when attempting toremove every copy of a gene required for survival. Using q parameter values 0 . , . .

75, the proportion of null positions stayed in the intervals 2-5%, 6-10% and15-25%, respectively.Note that since each possible tree on l leaves is equally likely to be chosen, theroot-to-leaf distances in a tree can be signiﬁcantly diﬀerent, and hence the trees arenot expected to be ultrametric (for instance, a caterpillar can be selected as well asa perfectly binary tree).Since it is diﬃcult to determine the most realistic simulation conditions, we testedseveral combinations of parameters for the generation of phylogenies. The sum-mary of the simulation parameters, along with their possible and default values, aresummarized here: • l ∈ { , , } is the number of leaves in the tree. The default is l = 100; • n ∈ { , , } is the number of genes (i.e. distinct characters) in the rootchromosome ( n is also the number of positions in our vectors). The default is n = 100; • ( e min , e max ) ∈ { (2 , , (5 , , (20 , } is the range of the possible number ofevents on each branch. The default is ( e min , e max ) = (5 , • ∆ ∈ { . , . , . } is the probability that an event is a duplication (and1 − ∆ the probability that an event is a loss). The default is ∆ = 0 . • r ∈ { . , . , . } controls the length of each event: r is the probability thatwe stop extending the event length. The default is r = 0 . ordonnier and Lafond Page 11 of 43 • q ∈ { . , . , . , } is the probability that a deletion suppresses the lastcopy of a gene during the length extension procedure (i.e. 1 − q is the proba-bility that the extension stops if it would make a copy-number 0). The defaultis q = 0 . Tree reconstruction and performance measure

We generated 50 trees for each parameter combination of interest. For each tree, wetook the chromosome strings at the leaves, obtained their CNPs and provided themas input to each of the four evaluated methods. We used the normalized Robinson-Foulds (RF) distance as a measure of the performance of each algorithm [36]. Thatis, for each inferred tree, we compare it with the “true” tree by counting the numberof clades that are present in one tree but not the other, divided by 2( l −

3) (themaximum number of clades that can possibly diﬀer, recalling that l is the numberof leaves). This yields a number between 0 and 1: the lower the number, the betterwe consider the reconstruction. Error tolerance

It should be noted that the above methodology ignores several sources of errors. In-ferring exact copy-numbers from single-cell sequencing data is a non-trivial task andis still considered an open problem. The inferred CNPs are therefore expected tobe noisy, especially with genes having a high copy-number. Moreover, as discussedin [22], assigning copy-numbers to their corresponding allele is also a diﬃcult prob-lem. Here, by only considering single-allele chromosomes, we are supposing that theaforementioned phasing step has been performed correctly, whereas copy-numberassignments cannot be assumed to be error-free.Both of the above problems have the eﬀect of introducing incorrect copy-numbersinto the CNPs. To account for this, we gave randomly altered CNPs as input toeach method. More speciﬁcally, given an error-rate parameter α , for each CNP u and each position i we changed u i to a value chosen at random from a normaldistribution with mean u i and standard deviation α · u i (non-integer values wererounded). We tested parameter values α ∈ { , . , . , . , } . Experimental results

We ﬁrst ran experiments using the default values for all parameters except one inorder to isolate the impact of each parameter. On error-free data, the most inter-esting results were obtained when varying l and n , see Figure 3. In most situations,our CNP model slightly improves upon the MEDICC model, both of which aresigniﬁcantly better than the Euclidean distance. The number n of genes is quiteimportant: all the results are poor when each CNP has only n = 10 positions, butwhen n = 250, the trees more accurate. This might be because when n = 10, thereis not enough opportunity for positions acquire a distinct signature during the evo-lutionary process, making all distances very similar. This suggests that many genesor segments should be considered when analyzing copy-number variants in tumorclones. The duplication rate does not seem to aﬀect the accuracy of the methods,whereas accuracy tends to decrease as the number of events per branch increases. ordonnier and Lafond Page 12 of 43 We do note that when the number of events per branch is within [5 , , q and r . As mentioned before, as q gets closer to 1, the proportion ofnull positions is around 50-60%, making accurate distance computation diﬃcult forall methods. As for the r parameter, the accuracy of our approach is better when r = 0 .

01 but worse when r = 0 .

1. This tendency can be observed under all parame-ter combinations (see the Supplementary Material). One should note that accuracyis generally better if the lengths of events are smaller. The four approaches on error-free data exhibit similar behavior on other parameter combinations — additionalplots can be found in the Supplementary Materials.The results on CNPs containing errors are summarized in Figure 5. We observe thatthe ZZS algorithm achieves slightly more accurate trees whenever the error rate isabove zero. One possible explanation is that a single error in a CNP can split aﬂat interval into three. This can signiﬁcantly alter the ﬂat interval counts, whereasthe ZZS distance is less dependant on ﬂat intervals. The accuracy of the Euclideandistance appears to be the least aﬀected by error rates and even performs betterwhen α ≥ .

5. Observe however that accuracy decays rapidly with error rates:when α ≥ .

25, all approaches have an average normalized distance above 0 . n = 10 (wherethe average is always above 0 . MEDICC model is better thanours or not, we believe that either should be preferred over the Euclidean distancewhen reconstructing phylogenies from distance matrices as in [9].

Discussion

The results from the experiments section show that our CNP distance performsreasonably well on simulated data. The incorporation of segmental events into themodel does not appear to provide a signiﬁcant advantage over the unitary eventsof the ZZS model. However, the simulations suggest that both approaches yieldbetter results than the traditional Euclidean distance. This demonstrates that eitherour method or the ZZS algorithm should be preferred as the CNP comparisoncomponent in a single-cell phylogenetic reconstruction pipeline.It should be noted that our algorithms only approximate the true CNP distancewhereas the ZZS algorithm provides an exact solution. In order to evaluate the true ordonnier and Lafond Page 13 of 43 performance of our segmental model, exact approaches should be developed in thefuture, perhaps using techniques from the ﬁeld of parameterized complexity. More-over, our approaches are very sensitive to errors, even more so than ZZS/MEDICC.One possible explanation for this is that both of our algorithms derive their dis-tance from the number of ﬂat intervals. A single error in a copy-number can turnone ﬂat interval into three, and thus even moderate levels of noise can lead to highlyincorrect predictions. We believe that the ZZS algorithm is less sensitive to sucherrors because that, if copy-number diﬀerences are large enough, a small error onlyincreases the distance by 1, which is small in comparison to all the unit eventsrequired to handle the high diﬀerence. Therefore, even if the true event distanceis overestimated, in a comparative setting the relative distances might be closer tothe truth. It will be interesting to consider CNP error correction procedures basedon ﬂat intervals. For instance, when performing analysis of multiple cells, one coulddetect a potentially incorrect copy-number of a given segment by checking whether,after altering a predicted copy number by a small amount, several ﬂat intervals get“ﬁxed” when comparing the cell with others.Another point of interest is that current approaches, including ours and ZZS/MEDICC,ignore rearrangements that change the ordering of segments. Our models assumethat the set of contiguous segments remains the same in all cells during evolu-tion. However, when duplications and deletions occur, the relative ordering of geneschanges and the set of contiguous genes aﬀected by the events will diﬀer from thatin the reference. Inversions, translocations or even chromothripsis also have thesame eﬀect. This is a diﬃcult problem to handle if only CNPs are known, sinceinteger vectors do not contain enough information to determine which genes arecontiguous or not. One possibility is to ask the following: given two CNPs C and C to compare, choose a genome G whose CNP is C and a genome G whoseCNP is C such that the rearrangement distance between G and G is minimized.Our work also leaves several questions open. From a theoretical perspective, it re-mains to achieve a constant factor approximation when null positions are presentin the input. Moreover, it is unknown whether the dbl function admits good ap-proximation algorithms and, more generally, whether there are other biologicallyplausible functions that should be studied. On another note, it might be interestingto investigate the copy-number triplet problem (see the introduction) under ourmodel, as it allows to deﬁne a symmetric distance between CNPs.On a practical level, phylogenetic approaches that are not distance-based should beinvestigated. For instance, we could consider maximum parsimony as in [11], wherethe objective is to minimize the number of events across branches of a tree. Therecent distance-based approach of Xia et al. [12], which is based on the MEDICCmodel but with an extra error correction step, should also be evaluated in oursetting. On another note, it remains to test our approach on real data. We haveignored the problem of calling copy-numbers and the aforementioned problem ofphasing. These can introduce noise in the data and, as shown in our experiments,all the evaluated methods are sensitive to errors. This motivates the need for newmethods to assign copy-numbers to alleles under our model. Also, our CNP compar-ison framework assumes a single-cell setting, where the CNP of each individual cellis known. Since bulk sequencing is still commonplace, it will be useful to develop ordonnier and Lafond Page 14 of 43 methods that are able to compare genomes extracted from samples that containsmultiple types of cells. Conclusion

In this work, we provided a general framework for the comparison of CNPs depictinggenomes that evolve by segmental ampliﬁcations and deletions. We have shownthat if there is no bound on the number of copies that a deletion can aﬀect, thencomputing the minimum number of events transforming one CNP into anotheris strongly NP-hard. One important implication of this result is that unless P =NP, one cannot use the fact that copy-numbers are not too large (e.g. under 100)to devise a practical pseudo-polynomial time algorithm, and other solutions mustbe explored. On the other hand, we proposed two simple and fast approximationalgorithms that were shown to perform reasonably well on simulated datasets.

List of abbreviations

BFB: breakage-fusion-bridge; CGH: comparative genomic hybridization; CNP:copy-number proﬁle; NJ: neighbor-joining; RF: Robinson-Foulds; ZZS: Zeira, Ze-havi, Shamir

Figure listing

Figure 1: an example of a CNP-to-CNP transformation.Figure 2: a visual representation of a staircase and a smooth deletion sequence.Figure 3: average normalized RF distances of the four methods evaluated whenvarying l, n, ∆ and ( e min , e max ).Figure 4: average normalized RF distances of the four methods evaluated whenvarying q and r .Figure 5: average normalized RF distances of the four methods evaluated withvarying error rates. Declarations

Competing interests

The authors declare that they have no competing interests.

Author’s contributions

GC and ML both participated in writing the manuscript, establishing the theoretical results, performing theexperiments and implementing the algorithms. All authors have read and approved the manuscript.

Funding

Publication was funded by the Natural Sciences and Engineering Research Council (NSERC).

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Availability of data and materials

The source code and data are available at: https://github.com/AEVO-lab/cnp2cnp .Supplementary ﬁle

S33-S1.pdf contains all the missing proofs.Supplementary ﬁle

S33-S2.pdf contains all the additional experimental results. ordonnier and Lafond Page 15 of 43

Author details Department of Computer Science, ´Ecole polytechnique, Paris, France. Department of Computer Science,Universit´e de Sherbrooke, Sherbrooke, Canada.

References

1. Peter C Nowell. The clonal evolution of tumor cell populations.

Science , 194(4260):23–28, 1976.2. Wei Jiao, Shankar Vembu, Amit G Deshwar, Lincoln Stein, and Quaid Morris. Inferring clonal evolution oftumors from single nucleotide somatic mutations.

BMC bioinformatics , 15(1):35, 2014.3. Mohammed El-Kebir, Layla Oesper, Hannah Acheson-Field, and Benjamin J Raphael. Reconstruction of clonaltrees and tumor composition from multi-sample sequencing data.

Bioinformatics , 31(12):i62–i70, 2015.4. Salem Malikic, Andrew W McPherson, Nilgun Donmez, and Cenk S Sahinalp. Clonality inference in multipletumor samples using phylogeny.

Bioinformatics , 31(9):1349–1356, 2015.5. Ke Yuan, Thomas Sakoparnig, Florian Markowetz, and Niko Beerenwinkel. Bitphylogeny: a probabilisticframework for reconstructing intra-tumor phylogenies.

Genome biology , 16(1):36, 2015.6. Katharina Jahn, Jack Kuipers, and Niko Beerenwinkel. Tree inference for single-cell data.

Genome biology ,17(1):86, 2016.7. Edith M Ross and Florian Markowetz. Onconem: inferring tumor evolution from single-cell sequencing data.

Genome biology , 17(1):69, 2016.8. Mohammed El-Kebir. Sphyr: tumor phylogeny estimation from single-cell sequencing data under loss and error.

Bioinformatics , 34(17):i671–i679, 2018.9. Nicholas Navin, Jude Kendall, Jennifer Troge, Peter Andrews, Linda Rodgers, Jeanne McIndoo, Kerry Cook,Asya Stepansky, Dan Levy, Diane Esposito, et al. Tumour evolution inferred by single-cell sequencing.

Nature ,472(7341):90, 2011.10. Ryan P Abo, Matthew Ducar, Elizabeth P Garcia, Aaron R Thorner, Vanesa Rojas-Rudilla, Ling Lin, Lynette MSholl, William C Hahn, Matthew Meyerson, Neal I Lindeman, et al. Breakmer: detection of structural variationin targeted massively parallel sequencing data using kmers.

Nucleic acids research , 43(3):e19–e19, 2014.11. Mohammed El-Kebir, Benjamin J Raphael, Ron Shamir, Roded Sharan, Simone Zaccaria, Meirav Zehavi, andRon Zeira. Copy-number evolution problems: complexity and algorithms. In

International Workshop onAlgorithms in Bioinformatics , pages 137–149. Springer, 2016.12. Ruofan Xia, Yu Lin, Jun Zhou, Tieming Geng, Feng Bing, and Jijun Tang. Phylogenetic reconstruction forcopy-number evolution problems.

IEEE/ACM transactions on computational biology and bioinformatics , 2018.13. Jun Zhou, Yu Lin, Vaibhav Rajan, William Hoskins, and Jijun Tang. Maximum parsimony analysis of gene copynumber changes. In

International Workshop on Algorithms in Bioinformatics , pages 108–120. Springer, 2015.14. Russell Schwartz and Alejandro A Sch¨aﬀer. The evolution of tumour phylogenetics: principles and practice.

Nature Reviews Genetics , 18(4):213, 2017.15. Anthony Wl Lo, Laure Sabatier, Bijan Fouladi, G´eraldine Pottier, Michelle Ricoul, and John P Mumane. Dnaampliﬁcation by breakage/fusion/bridge cycles initiated by spontaneous telomere loss in a human cancer cellline.

Neoplasia , 4(6):531–538, 2002.16. Michael Marotta, Xiongfong Chen, Ayako Inoshita, Robert Stephens, G Thomas Budd, Joseph P Crowe,Joanne Lyons, Anna Kondratova, Raymond Tubbs, and Hisashi Tanaka. A common copy-number breakpoint oferbb2 ampliﬁcation in breast cancer colocalizes with a complex block of segmental duplications.

Breast CancerResearch , 14(6):R150, 2012.17. Megha Rajaram, Jianping Zhang, Tim Wang, Jinyu Li, Cem Kuscu, Huan Qi, Mamoru Kato, Vladimir Grubor,Robert J Weil, Aslaug Helland, et al. Two distinct categories of focal deletions in cancer genomes.

PLoS One ,8(6):e66264, 2013.18. Yu Liu, Chong Chen, Zhengmin Xu, Claudio Scuoppo, Cory D Rillahan, Jianjiong Gao, Barbara Spitzer,Benedikt Bosbach, Edward R Kastenhuber, Timour Baslan, et al. Deletions linked to tp53 loss drive cancerthrough p53-independent mechanisms.

Nature , 531(7595):471, 2016.19. Andrew J Holland and Don W Cleveland. Boveri revisited: chromosomal instability, aneuploidy andtumorigenesis.

Nature reviews Molecular cell biology , 10(7):478, 2009.20. Richard Desper, Feng Jiang, Olli-P Kallioniemi, Holger Moch, Christos H Papadimitriou, and Alejandro ASch¨aﬀer. Inferring tree models for oncogenesis from comparative genome hybridization data.

Journal ofcomputational biology , 6(1):37–51, 1999.21. Jun Liu, Nirmalya Bandyopadhyay, Sanjay Ranka, Michael Baudis, and Tamer Kahveci. Inferring progressionmodels for cgh data.

Bioinformatics , 25(17):2208–2215, 2009.22. Roland F Schwarz, Anne Trinh, Botond Sipos, James D Brenton, Nick Goldman, and Florian Markowetz.Phylogenetic quantiﬁcation of intra-tumour heterogeneity.

PLoS computational biology , 10(4):e1003535, 2014.23. Guillaume Fertin, Anthony Labarre, Irena Rusu, St´ephane Vialette, and Eric Tannier.

Combinatorics of genomerearrangements . MIT press, 2009.24. Ron Zeira, Meirav Zehavi, and Ron Shamir. A linear-time algorithm for the copy number transformationproblem.

Journal of Computational Biology , 24(12):1179–1194, 2017.25. Manuel Lafond, Krister M Swenson, and Nadia El-Mabrouk. An optimal reconciliation algorithm for gene treeswith polytomies. In

International Workshop on Algorithms in Bioinformatics , pages 106–122. Springer, 2012.26. Eric Letouz´e, Yves Allory, Marc A Bollet, Fran¸cois Radvanyi, and Fr´ed´eric Guyon. Analysis of the copy numberproﬁles of several tumor samples from the same patient reveals the successive steps in tumorigenesis.

Genomebiology , 11(7):R76, 2010.27. Letu Qingge, Xiaozhou He, Zhihui Liu, and Binhai Zhu. On the minimum copy number generation problem incancer genomics. In

Proceedings of the 2018 ACM International Conference on Bioinformatics, ComputationalBiology, and Health Informatics , pages 260–269. ACM, 2018.28. Thomas Santarius, Janet Shipley, Daniel Brewer, Michael R Stratton, and Colin S Cooper. A census ofampliﬁed and overexpressed human cancer genes.

Nature Reviews Cancer , 10(1):59, 2010. ordonnier and Lafond Page 16 of 43

29. Heae Surng Park, Min Hye Jang, Eun Joo Kim, Hyun Jeong Kim, Hee Jin Lee, Yu Jung Kim, Jee Hyun Kim,Eunyoung Kang, Sung-Won Kim, In Ah Kim, et al. High egfr gene copy number predicts poor outcome intriple-negative breast cancer.

Modern pathology , 27(9):1212, 2014.30. Kevin Campbell, Julie M Gastier-Foster, Meegan Mann, Arlene H Naranjo, Collin Van Ryn, Rochelle Bagatell,Katherine K Matthay, Wendy B London, Meredith S Irwin, Hiroyuki Shimada, et al. Association of mycn copynumber with clinical features, tumor biology, and outcomes in neuroblastoma: A report from the children’soncology group.

Cancer , 123(21):4224–4235, 2017.31. Michael R Garey and David S Johnson.

Computers and intractability , volume 29. wh freeman New York, 2002.32. Raimund Seidel and Cecilia R Aragon. Randomized search trees.

Algorithmica , 16(4-5):464–497, 1996.33. Naruya Saitou and Masatoshi Nei. The neighbor-joining method: a new method for reconstructing phylogenetictrees.

Molecular biology and evolution , 4(4):406–425, 1987.34. Joseph Felsenstein.

PHYLIP (phylogeny inference package), version 3.5 c . Joseph Felsenstein., 1993.35. David Aldous. Probability distributions on cladograms. In

Random discrete structures , pages 1–18. Springer,1996.36. David F Robinson and Leslie R Foulds. Comparison of phylogenetic trees.

Mathematical biosciences ,53(1-2):131–147, 1981. omparing copy-number proﬁles undermulti-copy ampliﬁcations and deletionsSupplementary Material IAdditional proofs

Lemma 1 . Let u , v be two CNPs with no null positions. If u − v contains astaircase [ a, b ] of length k , then d f ( u , v ) ≥ k for any unit-cost function f . Proof of Lemma 1.

We use induction on the length k of the staircase. When k = 1, it is obvious that d f ( u , v ) ≥ u .Now assume the lemma is true for values less than k , and that for two given vec-tors u ∗ , v ∗ such that u ∗ − v ∗ contains a staircase of length k < k , d f ( u ∗ , v ∗ ) ≥ k . Suppose that two given CNPs ˜ u and ˜ v contain a staircase of length k ininterval [ a, a + k −

1] in their diﬀerence vector. Let u = (˜ u a , . . . , ˜ u a + k − ) and v = (˜ v a , . . . , ˜ v a + k − ). By Proposition 1, d f (˜ u , ˜ v ) ≥ d f ( u , v ) since we have onlyremoved some positions. Moreover, u − v consists of a staircase in interval [1 , k ].Let E = ( e , . . . , e l ) be a sequence of length l := d f ( u , v ) satisfying u h E i = v (note that l = d f ( u , v ) because f is unit-cost). If we show that d ( f, u ) v = l ≥ k ,then we are done. Let us assume, for the sake of contradiction, that l < k . Un-der this assumption and the inductive hypothesis, we show two properties on E . Property 1: no ampliﬁcation of E aﬀects position k , the last position of u .Assume otherwise, and suppose that some ampliﬁcation event ˆ e ∈ E aﬀectsinterval [ c, k ] for some c ∈ [ k ]. By Proposition 2, we may take an amp-ﬁrstreordering of E and assume that ˆ e = e is the ﬁrst event of E . Let ˆ u := u h ˆ e i ,and notice that ˆ u − v must contain a staircase of length k − , k − k − ≤ d f ( ˆ u , v ) = d f ( u , v ) − ≤ ( k − − d f ( u , v ) = l < k ). Property 2: all events of E aﬀect at least one position in [1 , k − e of E does not aﬀect any position in[1 , k − k and therefore we may write ˆ e = ( k, k, b ).By Property 1, ˆ e must be a deletion. Moreover, since no ampliﬁcation everaﬀects position k , ˆ u := u h ˆ e i does not have 0 at position k , and we may furtherassume that ˆ e is the ﬁrst event of E (since applying the other events will never1ake position k drop below 0). In other words, d f ( u , v ) = d f (ˆ u , v ) + 1. Butthen ˆ u has a staircase in interval [1 , k −

1] and by the same arguments as above, k − ≤ d f ( ˆ u , v ) = d f ( u , v ) − ≤ ( k − −

1, again a contradiction.So far, we know that only deletions aﬀect position k (Property 1), and all thesedeletions also aﬀect position k − u k − − v k − < u k − v k and v k − >

0, this implies that some ampliﬁcation event ˆ e must aﬀect position k − k on position k − k − v k − ). Let us assume, again usingProposition 2, that ˆ e is the ﬁrst event of E , i.e. e = ˆ e . We use the same trick fora third time. That is, let ˆ u := u h ˆ e i and notice that ˆ u has a staircase in interval[1 , k − k − ≤ d f ( ˆ u , v ) = d f ( u , v ) − ≤ ( k − − l < k is false, which proves thelemma. Lemma 2 . Let u and v be two CNPs with no null positions and let f be anyunit-cost function. If u − v contains a staircase in interval [1 , k ] and d f ( u , v ) = k ,then there exists a smooth sequence transforming u into v . Proof of Lemma 2.

We prove the lemma by induction over k . As a base case,the statement is easy to see when k = 1 since a single step can only removedby a deletion, which is smooth. So assume k > u , v suchthat d f ( u , v ) = k − u − v have a staircase of length k − , k − u into v .Let E be any sequence of k events such that u h E i = v . If E is smooth, thenwe are done so assume otherwise. The proof is divided in two parts. Assumingthe inductive hypothesis, we ﬁrst show that there is an optimal sequence ˆ E con-taining only deletions such that u h ˆ E i = v . These deletions are not necessarilysmooth. We complete the induction in a second step, where we convert thisdeletion sequence into a smooth one. For the remainder of the proof, we willdenote w := u − v . Part 1: proof that u can be transformed into v using only deletions. Assume that E = ( e , . . . , e k ) contains some ampliﬁcation, otherwise we aredone proving our ﬁrst step. We ﬁrst claim that only deletions aﬀect positions k to n , inclusively. To see this, assume on the contrary that e i = ( a, b, δ ) is anampliﬁcation where b ≥ k . By Proposition 2, we may assume that e i = e . But u h e i still has a staircase in interval [1 , k ], and by Lemma 1, d f ( u , v ) ≥ k . Thisis a contradiction since e should reduce the distance to from u to v . Hence ourclaim holds.We now claim that, on the other hand, some ampliﬁcation in E aﬀects position k −

1. This is clearly true if every deletion aﬀecting position k also aﬀectsposition k −

1. Indeed, we have w k − < w k and without an ampliﬁcation on k − k − v k − >

0. Thusif we suppose that no ampliﬁcation aﬀects position k −

1, there must be some2eletion e i = ( k, h, d ) that aﬀects position k but not k −

1, where here h ≥ k .Let u := u h e i i . Since no ampliﬁcation aﬀects any position in [ k, h ], u has noposition with value 0. Furthermore, u − v contains a staircase of length k − , k −

1] and it is clear that d f ( u , v ) = k −

1. By induction, there is a (smooth)deletion sequence E such that u h E i = v . In that case, the sequence formedby e i followed by E transforms u into v and has only deletions, which is whatwe want. Thus we may assume that our claim saying that some ampliﬁcatioaﬀects k − e i = ( a, k − , δ ) be an ampliﬁcation in E that aﬀects position k − k ). Our previous claims show that e i exists. By Proposition 2,we may assume that e = e i . Let u := u h e i and w := u − v . Then w has a staircase of length k − , k −

1] and d f ( u , v ) = k − a . Formally, for each i ∈ [ k − \ { a } , w i − w i − = w i − w i − and w a − w a − = w a − w a − + δ .By induction, u h E i = v for some smooth deletion sequence E = ( e , . . . , e k − ).Here for each i ∈ [ k − e i = ( i, b i , w i − − w i ) for some b i ≥ k −

1. Let( i , b i , d i ) , . . . , ( i l , b i l , d i l ) be the deletion events of E that aﬀect position k , i < i < . . . < i l . We distinguish two cases. Case 1: a / ∈ { i , . . . , i l } . Then the event ( a, b a , w a − − w a ) of E does notaﬀect position k , meaning that b a = k − E obtained from E by replacing the event ( a, k − , w a − − w a )by the event ( a, k − , w a − − w a ). Since u h E i − v has a 0 everywhere and w a − w a − = w a − w a − + δ , it follows that u h E i − v has value 0 everywhere,except at positions from a to k − δ . But then, the onlydiﬀerence between u and u is that positions a to k − δ . Thus u h E i − v has a value of 0 everywhere (and u never drops below 0, due to thesmoothness of E ). This means that u h E i = v , which is a contradiction since E has k − Case 2: a = i h for some h ∈ [ l ]. Then the deletion of E starting at a is( a, b a , − ( w a − w a − )) = ( a, b a , w a − − w a − δ ) and aﬀects position k , i.e. b a ≥ k .Consider the sequence E obtained from E by replacing the event ( a, b a , w a − − w a − δ ) by ( a, b a , w a − − w a ). Then u h E i − v has a 0 everywhere, except atpositions from a to b a where it has value δ . Also, u h E i − v has a 0 everywhere,except at positions from k to b a where it has value δ . We can apply the deletion( k, b a , − δ ) to u h E i to obtain v . Since E has k − k deletions transforming u into v .This concludes the ﬁrst part. That is, we have shown that if our inductivehypothesis holds, then some deletion sequence of length k transforms u into v . Part 2: construction of a smooth sequence.

Now let ˆ E = (ˆ e , . . . , ˆ e k ) bea sequence of k deletions transforming u into v , which exists by Part 1. Let(1 , b, δ ) be any deletion aﬀecting position 1. Since ˆ E contains only deletions,3t is safe to assume that ˆ e = (1 , b, δ ). Let u := u h ˆ e i and w := u − v . If − δ < w , then w contains a staircase of length k and we reach a contradictionsince this implies d f ( u , v ) ≥ k . If − δ > w , then w < v since ˆ E has only deletions. We deduce that − δ = w .It follows that u has a staircase of length k − , k ]. No event of ˆ E can aﬀect position 1 after e , so we can ignore this position in u and w . Thatis, suppose we remove position 1 from u and v , yielding two vectors u and v of length n −

1. Let w := u − v . Then w has a staircase of length k − , k − E of length k − u into v . This easily translates intoa sequence ˆ E transforming u into v : we just “shift” every event to the rightto account for position 1 in ˆ E . To be speciﬁc, we replace any event ( s, t, (cid:15) )from ˆ E by the event ( s + 1 , t + 1 , (cid:15) ) in ˆ E . Since ˆ E is smooth, then we canwrite ˆ E = ((2 , b , (cid:15) ) , . . . , ( k, b k , (cid:15) k )) where, for each i ∈ { , . . . , k } , b i ≥ k and d i = w i − w i − .We have not shown smoothness yet, because ˆ e might not aﬀect the whole [1 , k ]interval as we wish. If indeed ˆ e aﬀects position k , i.e. if b ≥ k , then it is easyto see that applying ˆ e followed by ˆ E is a smooth sequence transforming u into v . Thus we may assume that b < k . Observe that w i − w i − = w i − w i − forall i ∈ { , . . . , k } \ { b + 1 } , because w b +1 − w b = w b +1 − w b + w (recall that − δ = w ). Let ( b + 1 , b , w b − w b +1 − w ) be the deletion of ˆ E that starts atposition b , where b ≥ k by smoothness. Suppose that we replace it with thedeletion ( b + 1 , b , w b − w b +1 ) in ˆ E , yielding an alternate sequence ˜ E . Then u h ˜ E i − v has a 0 everywhere, except at positions b + 1 to b where it has value w . This means that if in ˆ E , we replace ˆ e by ˜ e = (1 , b , − w ) and follow itby ˜ E , we obtain a sequence transforming u into v . Now, let ˜ u := u h ˜ e i . If weremove position 1 from ˜ u (recalling that ˜ u = v ) and from v , we obtain a CNPwith a staircase at [1 , k − E which we can modify into ˜ E to make it applicable to u (just as we did fromˆ E to ˆ E ). It is then straightforward to see that ˜ e followed by ˜ E is a smoothdeletion sequence turning u into v . Theorem 1 . The

CNP-transformation problem is strongly NP-hard for anydeletion-permissive unit-cost function, even if the CNPs have no null positions.

Proof of Theorem 1.

From a instance S = { s , . . . , s n } , construct u and v as follows. First deﬁne K := 100 n and, for all i ∈ [ n ], put p i := P ij =1 s j ,the idea being that p i and p i − diﬀer by an amount of s i . Then put v as avector containing only 1s. For u , construct it by adding one position at a timefrom left to right: ﬁrst insert the values i + 1 + Kp i for i = 1 ..n , and then thevalues i ( Kt + 3) + 1 for i = m..

1. That is, let v = (1 , , . . . , u = (2 + Kp , Kp , . . . , n + 1 + Kp n , m ( Kt + 3) + 1 , . . . , ( Kt + 3) + 1)4his can be done in polynomial time in n (in particular, each p i is polynomial).Observe that we have w = (1 + Kp , . . . , n + Kp n , m ( Kt + 3) , . . . , Kt + 3)In particular, w has a staircase in interval [1 , n ], followed by a decreasing stair-case in interval [ n + 1 , n + m ]. By Lemma 1, we know that d f ( u , v ) ≥ n . Wewill show that S is a YES-instance to if and only if d f ( u , v ) = n .( ⇒ ): Suppose that there exists m triplets S , . . . , S m such that P s ∈ S i s = t for all i ∈ [ m ]. We may assume that each s i ∈ S is distinguishable, so thatfor each s i there is a unique k such that s i ∈ S k . We construct a sequence E = ( e , . . . , e n ) of n deletions such that u h E i = v . For each i ∈ [ n ], put e i = ( i, n + k, w i − − w i ), where k if the unique integer such that s i ∈ S k . Notethat the e i events are allowed because f is deletion-permissive (this is actuallythe only place where we need this assumption). One can check that E is asmooth deletion sequence and it is clear that positions 1 to n become equal to1 after applying E on u . Now consider the events that end at position n + k , k ∈ [ m ]. For each s i ∈ S k , there is such an event that decreases all the positions n + 1 to n + k by w i − w i − = Ks i + 1. We get P s i ∈ S k ( Ks i + 1) = Kt + 3.Since this is true for every position from n + 1 to n + m , the total decrease fora position k ∈ [ m ] will be P mj = k Kt + 3 = ( m + 1 − k ) Kt + 3, which is exactly w n + k . Hence u h E i = v .( ⇐ ): Assume that d f ( u , v ) = n . Let E = ( e , . . . , e n ) be an optimal sequence ofevents transforming u into v . By Lemma 2, we may assume that E is smooth.Thus each e i is a deletion of the form ( i, b i , w i − − w i ) = ( i, b i , − ( Ks i + 1)),where b i ∈ [ n, n + m ]. Let us deﬁne S k := { s i : b i = n + k } . We claim that P s i ∈ S k ( Ks i + 1) = Kt + 3. For k = m , this must be true since w n + m = Kt + 3.For k < m , we have the diﬀerence w n + k − w n + k +1 = Kt +3. This means that thedeletions that aﬀect position n + k but not n + k + 1 (i.e. those with b i = n + k )must incur a total decrease of exactly Kt + 3, as claimed. We now argue that | S k | = 3 for each k ∈ [ m ]. Notice that P s i ∈ S k ( Ks i + 1) = K P s i ∈ S k s i + | S k | = Kt + 3. If P s i ∈ S k s i = t , then | S k | = 3. Otherwise, by isolating the | S k | termabove, it is not hard to deduce that | S k | ≥ K . However, this is impossible since | S k | ≤ n but K > n . We have therefore shown that | S k | = 3, which in turnimplies that P s i ∈ S k s i = t .Therefore S is a YES instance. Lemma 3 . Let u , v be two distinct CNPs with no null positions, and let w := u − v . Then for any unit-cost function f , d f ( u , v ) ≥ d ( | F w | − / e . Proof of Lemma 3.

We prove the Lemma by induction on d f ( u , v ). As a basecase, when d f ( u , v ) = 1, then F w has 3 ﬂat intervals: the extreme ones and theﬂat interval that gets aﬀected in the single event transforming u into v (recallthat we have artiﬁcial positions w = 0 and w n +1 = 0, which guarantee thatthere are always two extreme intervals plus another one somewhere in [ i , n ]).The statement is clearly true in this case, as d| F w | − / e = 1.Now assume that the Lemma holds for any pair of CNPs u , v satisfying d f ( u , v ) < d f ( u , v ). Let E = ( e , . . . , e k ) be an optimal sequence of events5uch that u h E i = v . Let ˆ u := u h e i and ˆ w := ˆ u − v . Let e = ( c, d, x ), where x could be negative in case of a deletion. Let F w = { [ a, b ] ∈ F w : [ a, b ] ∩ [ c, d ] = ∅} be the aﬀected ﬂat intervals. Assume that F w has l ≥ F w = { [ a , b ] , . . . , [ a l , b l ] } , and that they are ordered so that b i + 1 = a i +1 for each i ∈ [ l − a i , b i ] with 2 ≤ i ≤ l −

1. Note that [ a i , b i ] cannot be an extremeﬂat interval in w . We claim that [ a i , b i ] must still be a non-extreme ﬂat inter-val in ˆ u . To see this, observe that ˆ w a i − = w a i − + x and ˆ w a i = w a i + x .Since w a i − = w a i by maximality, we have ˆ w a i − = ˆ w a i . By a similar argu-ment, ˆ w b i +1 = ˆ w b i . And because all values in [ a i , b i ] have changed by the sameamount x , [ a i , b i ] is a (maximal) ﬂat interval (note that we need the assumptionof no null positions to argue that all positions change by the same amount).Moreover, [ a i , b i ] cannot be extreme. If instead [ a i , b i ] was in the extreme inter-val containing w , then we would have ˆ w h = 0 for all 0 ≤ h ≤ b i . In particular,this would imply ˆ w a i − = ˆ w a i , contrary to what we just argued. The sameoccurs if we assume that [ a i , b i ] is part of the extreme interval containing w n +1 .Now consider any ﬂat interval [ a, b ] ∈ F w \ F w . It is easy to see that [ a, b ] isstill a ﬂat interval in ˆ w , unless perhaps if b + 1 = a or a − b l . In thesecases, it is possible that ˆ w b = ˆ w a and/or ˆ w a = ˆ w b l . These have the eﬀect of“merging” two ﬂat intervals, eﬀectively eliminating [ a , b ] and/or [ a l , b l ] (notethat the argument also holds when [ a , b ] or [ a l , b l ] become part of an extremeinterval). Since every ﬂat interval except these two stays in ˆ w , it follows that | F ˆ w | ≥ | F w | −

2. Then using induction, d f ( u , v ) − d f ( ˆ u , v ) ≥ d ( | F w | − / e = d ( | F w | − / e − d f ( u , v ) ≥ d ( | F w | − / e . Lemma 4 . Suppose that v i = v i +1 = 0 for some position i . Then removingposition i or i + 1, whichever is smaller in u , from u and v preserves thedistance between u and v . Formally, for any unit-cost function f , if u i ≥ u i +1 ,then d f ( u , v ) = d f ( u −{ i +1 } , v −{ i +1 } ). Similarly if u i +1 ≥ u i , then d f ( u , v ) = d f ( u −{ i } , v −{ i } ). Proof of Lemma 4.

Assume that u i ≥ u i +1 (the other case is identical). Weknow that d f ( u , v ) ≥ d f ( u −{ i +1 } , v −{ i +1 } ), by Proposition 1. We consider theconverse bound. Take any sequence E = ( e , . . . , e k ) of events transforming u −{ i +1 } into v −{ i +1 } . Modify E to transform u into v as follows: each eventaﬀects the same positions as before (including those that have shifted afterreinserting i + 1), but we ensure that every event aﬀecting position i also aﬀectsposition i + 1. To be formal, deﬁne E = ( e , . . . , e k ) as follows. If e i increasesinterval [ a, b ] by δ (which is possibly negative), then make e i increase interval[ a , b ] by δ , where a = ( a if a ≤ ia + 1 if a > i b = ( b if b < ib + 1 if b ≥ i i in u and v , every position reaches the same valueas before. Also because u i ≥ u i +1 , position i + 1 reaches 0 after applying E on u . Lemma 5 . Suppose v i = 0 for some position i and that w i − ≥ w i or w i +1 ≥ w i .Then d f ( u , v ) = d f ( u −{ i } , v −{ i } ) for any unit-cost function f . Proof of Lemma 5.

The proof is essentially the same as in Lemma 4. If, withoutloss of generality, w i − ≥ w i , we can take an event sequence from u −{ i } to v −{ i } and adapt it so that every event aﬀecting position i − i .This guarantees that position i drops to 0. We omit the technical details. Finding good events in time O ( n log n ) We say that an event e is good if applying it on u reduces | F w | by 2. Here wepresent the detailed version of our improved heuristic. The main algorithm thatfollows transforms u into v by making calls to the f indGoodEvent subroutine,which is deﬁned afterwards. Data: vectors u , v Result:

Find a sequence that transforms u into v compute w := u − v ;initialize empty sequence S ; for u = v doif ﬁndGoodEvent( u , v , w ) returns ( i, j, x ) then add ( i, j, x ) to S ; for k = i, ..., j do u k = max u k + x, else ﬁnd the ﬁrst ﬂat interval [ i, j ] with w i = 0;increase u i , . . . u j by − w i ;add ( i, j, − w i ) to S ; return S Algorithm 1:

Main algorithmThe algorithm f indGoodEvent below can be implemented in time O ( n log n ).Our goal is to ﬁnd a range of values [ i, j ] that veriﬁes w i − w i − = w j − w j +1 := − δ . We further need that δ >

0, or that δ < ∀ k ∈ [ i, j ] , u k ≥ − δ : wecan then apply the event ( i, j, δ ). To achieve this, the idea is simply to scan w from left to right. Each time we detect a change of w k − w k +1 , we check if weencountered the same amount of change before at some position k (this is − δ in the algorithm). If so, we can return the k, k pair since it can be part of agood event. Otherwise, we map δ = w k +1 − w k to position k + 1 to store thefact that k + 1 is the latest position that could be matched with a change of δ .The last line of the for loop ensures that if we match two positions k < k , allpositions in-between are suﬃciently high to allow a deletion of amount δ .7 ata: vectors u , v , w Result:

Find an event that reduces | F w | by 2initialization of an empty dictionary R ; for k = 1 , ..., n − do δ := w k +1 − w k ; if δ == 0 then continue ; if − δ ∈ R thenreturn ( R [ − δ ] , k, δ ); else Set R [ δ ] = k + 1;delete all the key/value pairs ( x, y ) in R with u k ≤ x ; return no possible event Algorithm 2: ﬁndGoodEventWe argue two components: that f indGoodEvent does ﬁnd a good event, if thereis one, and that it can be implemented to take time O ( n log n ). Proof that Algorithm f indGoodEvent returns an event ( i, j, δ ) that re-duces | F w | by when it exists. Consider an output ( i, j, δ ). Due to theconstruction, we had − δ ∈ R , which can only be inserted with − δ = w i − w i − and δ = w j +1 − w j , so w i − − w i = w j +1 − w j , in which case it is easy to seethat F w is reduced by 2. Furthermore, if δ < k ∈ [ i, j ] with − u k > δ , the k -th iteration would have deleted δ from E . This means that( i, j, δ ) is indeed an event that reduces | F w | and does not make any u k drop to0.Reciprocally, if there is an event ( i, j, δ ) to be found we want to prove that thealgorithm returns something (not necessarily the same event). If the algorithmexits before iteration j , it returns some event that we have already proven mustbe correct. Let us assume that we do not exit the loop before iteration j : wehave added − δ at rank i , and it is still in R because for every k ∈ [ i, j ] we didnot have − δ > u k by hypothesis. Since − δ is in E and w j +1 − w j = x , thealgorithm returns ( i, j, δ ). Complexity.

The complexity of f indGoodEvent depends on the followingoperations: we need to be able to test the existence of a value in a dictionary, toadd a key/value pair and, a bit less usual, to ﬁlter all values lower than a certainamount (the last line of f indGoodEvent ). We can use a treap structure (see [1]),which is a form of binary search tree that allows to split the values higher andlower to a certain number in log n time. This gives us a total complexity of O ( n log( n )). References [1] Raimund Seidel and Cecilia R Aragon. Randomized search trees.

Algorith-mica , 16(4-5):464–497, 1996. 8 omparing Copy-Number Proﬁles UnderMulti-Copy Ampliﬁcations and Deletions -Supplementary Material IIAdditional experimental results

For each combination of values of r ∈ { . , . , . } and q ∈ { . , . , . , } ,we ran four sets of experiments, each one isolating one of the l, n, ∆ and ( e min , e max )parameters (as in the main text). We present the result obtained for each pos-sible r and q on error-free data.We then show the obtained results on noisy data, again for each possible r and q . For each error rate α ∈ { , . , . , . , } , we repeated the same scheme ofisolating every parameter. The number of combinations of the parameters r, q, α is 60, and for each combination we again isolated one of l, n, ∆ and ( e min , e max ).We do not list every plot here: we show the results of error rates in { . , . , . } for default r and q (we omit error rates higher than 1, as they always resultin bad trees). All the missing plots can be accessed through the public datadirectories (download instructions are provided on the git readme ﬁle of thecnp2cnp project).We note that all the experiments can be reproduced by downloading the gitproject and executing the command > python3 autosimulation.py This python script generates trees, infers reconstructions and evaluates themfor all possible parameter values. Expect 1-3 weeks of running time on a regularPC. 1 euristic flat ZZS Euclideanl = 100.00.20.40.60.81.0 heuristic flat ZZS Euclideanl = 500.00.20.40.60.81.0 heuristic flat ZZS Euclideanl = 1000.00.20.40.60.81.0heuristic flat ZZS Euclideann = 100.00.20.40.60.81.0 heuristic flat ZZS Euclideann = 1000.00.20.40.60.81.0 heuristic flat ZZS Euclideann = 2500.00.20.40.60.81.0heuristic flat ZZS Euclideanduprate 0.250.00.20.40.60.81.0 heuristic flat ZZS Euclideanduprate 0.50.00.20.40.60.81.0 heuristic flat ZZS Euclideanduprate 0.750.00.20.40.60.81.0heuristic flat ZZS Euclideanevents 2 40.00.20.40.60.81.0 heuristic flat ZZS Euclideanevents 5 100.00.20.40.60.81.0 heuristic flat ZZS Euclideanevents 20 400.00.20.40.60.81.0

Figure 1: Violin plots for r = 0 . , q = 0 .

25 (error-free data).2 euristic flat ZZS Euclideanl = 100.00.20.40.60.81.0 heuristic flat ZZS Euclideanl = 500.00.20.40.60.81.0 heuristic flat ZZS Euclideanl = 1000.00.20.40.60.81.0heuristic flat ZZS Euclideann = 100.00.20.40.60.81.0 heuristic flat ZZS Euclideann = 1000.00.20.40.60.81.0 heuristic flat ZZS Euclideann = 2500.00.20.40.60.81.0heuristic flat ZZS Euclideanduprate 0.250.00.20.40.60.81.0 heuristic flat ZZS Euclideanduprate 0.50.00.20.40.60.81.0 heuristic flat ZZS Euclideanduprate 0.750.00.20.40.60.81.0heuristic flat ZZS Euclideanevents 2 40.00.20.40.60.81.0 heuristic flat ZZS Euclideanevents 5 100.00.20.40.60.81.0 heuristic flat ZZS Euclideanevents 20 400.00.20.40.60.81.0

Figure 2: Violin plots for r = 0 . , q = 0 . euristic flat ZZS Euclideanl = 100.00.20.40.60.81.0 heuristic flat ZZS Euclideanl = 500.00.20.40.60.81.0 heuristic flat ZZS Euclideanl = 1000.00.20.40.60.81.0heuristic flat ZZS Euclideann = 100.00.20.40.60.81.0 heuristic flat ZZS Euclideann = 1000.00.20.40.60.81.0 heuristic flat ZZS Euclideann = 2500.00.20.40.60.81.0heuristic flat ZZS Euclideanduprate 0.250.00.20.40.60.81.0 heuristic flat ZZS Euclideanduprate 0.50.00.20.40.60.81.0 heuristic flat ZZS Euclideanduprate 0.750.00.20.40.60.81.0heuristic flat ZZS Euclideanevents 2 40.00.20.40.60.81.0 heuristic flat ZZS Euclideanevents 5 100.00.20.40.60.81.0 heuristic flat ZZS Euclideanevents 20 400.00.20.40.60.81.0 Figure 3: Violin plots for r = 0 . , q = 0 .

75 (error-free data).4 euristic flat ZZS Euclideanl = 100.00.20.40.60.81.0 heuristic flat ZZS Euclideanl = 500.00.20.40.60.81.0 heuristic flat ZZS Euclideanl = 1000.00.20.40.60.81.0heuristic flat ZZS Euclideann = 100.00.20.40.60.81.0 heuristic flat ZZS Euclideann = 1000.00.20.40.60.81.0 heuristic flat ZZS Euclideann = 2500.00.20.40.60.81.0heuristic flat ZZS Euclideanduprate 0.250.00.20.40.60.81.0 heuristic flat ZZS Euclideanduprate 0.50.00.20.40.60.81.0 heuristic flat ZZS Euclideanduprate 0.750.00.20.40.60.81.0heuristic flat ZZS Euclideanevents 2 40.00.20.40.60.81.0 heuristic flat ZZS Euclideanevents 5 100.00.20.40.60.81.0 heuristic flat ZZS Euclideanevents 20 400.00.20.40.60.81.0

Figure 4: Violin plots for r = 0 . , q = 1 (error-free data).5 euristic flat ZZS Euclideanl = 100.00.20.40.60.81.0 heuristic flat ZZS Euclideanl = 500.00.20.40.60.81.0 heuristic flat ZZS Euclideanl = 1000.00.20.40.60.81.0heuristic flat ZZS Euclideann = 100.00.20.40.60.81.0 heuristic flat ZZS Euclideann = 1000.00.20.40.60.81.0 heuristic flat ZZS Euclideann = 2500.00.20.40.60.81.0heuristic flat ZZS Euclideanduprate 0.250.00.20.40.60.81.0 heuristic flat ZZS Euclideanduprate 0.50.00.20.40.60.81.0 heuristic flat ZZS Euclideanduprate 0.750.00.20.40.60.81.0heuristic flat ZZS Euclideanevents 2 40.00.20.40.60.81.0 heuristic flat ZZS Euclideanevents 5 100.00.20.40.60.81.0 heuristic flat ZZS Euclideanevents 20 400.00.20.40.60.81.0 Figure 5: Violin plots for r = 0 . , q = 0 .

25 (error-free data).6 euristic flat ZZS Euclideanl = 100.00.20.40.60.81.0 heuristic flat ZZS Euclideanl = 500.00.20.40.60.81.0 heuristic flat ZZS Euclideanl = 1000.00.20.40.60.81.0heuristic flat ZZS Euclideann = 100.00.20.40.60.81.0 heuristic flat ZZS Euclideann = 1000.00.20.40.60.81.0 heuristic flat ZZS Euclideann = 2500.00.20.40.60.81.0heuristic flat ZZS Euclideanduprate 0.250.00.20.40.60.81.0 heuristic flat ZZS Euclideanduprate 0.50.00.20.40.60.81.0 heuristic flat ZZS Euclideanduprate 0.750.00.20.40.60.81.0heuristic flat ZZS Euclideanevents 2 40.00.20.40.60.81.0 heuristic flat ZZS Euclideanevents 5 100.00.20.40.60.81.0 heuristic flat ZZS Euclideanevents 20 400.00.20.40.60.81.0

Figure 6: Violin plots for r = 0 . , q = 0 . euristic flat ZZS Euclideanl = 100.00.20.40.60.81.0 heuristic flat ZZS Euclideanl = 500.00.20.40.60.81.0 heuristic flat ZZS Euclideanl = 1000.00.20.40.60.81.0heuristic flat ZZS Euclideann = 100.00.20.40.60.81.0 heuristic flat ZZS Euclideann = 1000.00.20.40.60.81.0 heuristic flat ZZS Euclideann = 2500.00.20.40.60.81.0heuristic flat ZZS Euclideanduprate 0.250.00.20.40.60.81.0 heuristic flat ZZS Euclideanduprate 0.50.00.20.40.60.81.0 heuristic flat ZZS Euclideanduprate 0.750.00.20.40.60.81.0heuristic flat ZZS Euclideanevents 2 40.00.20.40.60.81.0 heuristic flat ZZS Euclideanevents 5 100.00.20.40.60.81.0 heuristic flat ZZS Euclideanevents 20 400.00.20.40.60.81.0 Figure 7: Violin plots for r = 0 . , q = 0 .

75 (error-free data).8 euristic flat ZZS Euclideanl = 100.00.20.40.60.81.0 heuristic flat ZZS Euclideanl = 500.00.20.40.60.81.0 heuristic flat ZZS Euclideanl = 1000.00.20.40.60.81.0heuristic flat ZZS Euclideann = 100.00.20.40.60.81.0 heuristic flat ZZS Euclideann = 1000.00.20.40.60.81.0 heuristic flat ZZS Euclideann = 2500.00.20.40.60.81.0heuristic flat ZZS Euclideanduprate 0.250.00.20.40.60.81.0 heuristic flat ZZS Euclideanduprate 0.50.00.20.40.60.81.0 heuristic flat ZZS Euclideanduprate 0.750.00.20.40.60.81.0heuristic flat ZZS Euclideanevents 2 40.00.20.40.60.81.0 heuristic flat ZZS Euclideanevents 5 100.00.20.40.60.81.0 heuristic flat ZZS Euclideanevents 20 400.00.20.40.60.81.0

Figure 8: Violin plots for r = 0 . , q = 1 (error-free data).9 euristic flat ZZS Euclideanl = 100.00.20.40.60.81.0 heuristic flat ZZS Euclideanl = 500.00.20.40.60.81.0 heuristic flat ZZS Euclideanl = 1000.00.20.40.60.81.0heuristic flat ZZS Euclideann = 100.00.20.40.60.81.0 heuristic flat ZZS Euclideann = 1000.00.20.40.60.81.0 heuristic flat ZZS Euclideann = 2500.00.20.40.60.81.0heuristic flat ZZS Euclideanduprate 0.250.00.20.40.60.81.0 heuristic flat ZZS Euclideanduprate 0.50.00.20.40.60.81.0 heuristic flat ZZS Euclideanduprate 0.750.00.20.40.60.81.0heuristic flat ZZS Euclideanevents 2 40.00.20.40.60.81.0 heuristic flat ZZS Euclideanevents 5 100.00.20.40.60.81.0 heuristic flat ZZS Euclideanevents 20 400.00.20.40.60.81.0 Figure 9: Violin plots for r = 0 . , q = 0 .

25 (error-free data).10 euristic flat ZZS Euclideanl = 100.00.20.40.60.81.0 heuristic flat ZZS Euclideanl = 500.00.20.40.60.81.0 heuristic flat ZZS Euclideanl = 1000.00.20.40.60.81.0heuristic flat ZZS Euclideann = 100.00.20.40.60.81.0 heuristic flat ZZS Euclideann = 1000.00.20.40.60.81.0 heuristic flat ZZS Euclideann = 2500.00.20.40.60.81.0heuristic flat ZZS Euclideanduprate 0.250.00.20.40.60.81.0 heuristic flat ZZS Euclideanduprate 0.50.00.20.40.60.81.0 heuristic flat ZZS Euclideanduprate 0.750.00.20.40.60.81.0heuristic flat ZZS Euclideanevents 2 40.00.20.40.60.81.0 heuristic flat ZZS Euclideanevents 5 100.00.20.40.60.81.0 heuristic flat ZZS Euclideanevents 20 400.00.20.40.60.81.0

Figure 10: Violin plots for r = 0 . , q = 0 . euristic flat ZZS Euclideanl = 100.00.20.40.60.81.0 heuristic flat ZZS Euclideanl = 500.00.20.40.60.81.0 heuristic flat ZZS Euclideanl = 1000.00.20.40.60.81.0heuristic flat ZZS Euclideann = 100.00.20.40.60.81.0 heuristic flat ZZS Euclideann = 1000.00.20.40.60.81.0 heuristic flat ZZS Euclideann = 2500.00.20.40.60.81.0heuristic flat ZZS Euclideanduprate 0.250.00.20.40.60.81.0 heuristic flat ZZS Euclideanduprate 0.50.00.20.40.60.81.0 heuristic flat ZZS Euclideanduprate 0.750.00.20.40.60.81.0heuristic flat ZZS Euclideanevents 2 40.00.20.40.60.81.0 heuristic flat ZZS Euclideanevents 5 100.00.20.40.60.81.0 heuristic flat ZZS Euclideanevents 20 400.00.20.40.60.81.0 Figure 11: Violin plots for r = 0 . , q = 0 .

75 (error-free data).12 euristic flat ZZS Euclideanl = 100.00.20.40.60.81.0 heuristic flat ZZS Euclideanl = 500.00.20.40.60.81.0 heuristic flat ZZS Euclideanl = 1000.00.20.40.60.81.0heuristic flat ZZS Euclideann = 100.00.20.40.60.81.0 heuristic flat ZZS Euclideann = 1000.00.20.40.60.81.0 heuristic flat ZZS Euclideann = 2500.00.20.40.60.81.0heuristic flat ZZS Euclideanduprate 0.250.00.20.40.60.81.0 heuristic flat ZZS Euclideanduprate 0.50.00.20.40.60.81.0 heuristic flat ZZS Euclideanduprate 0.750.00.20.40.60.81.0heuristic flat ZZS Euclideanevents 2 40.00.20.40.60.81.0 heuristic flat ZZS Euclideanevents 5 100.00.20.40.60.81.0 heuristic flat ZZS Euclideanevents 20 400.00.20.40.60.81.0

Figure 12: Violin plots for r = 0 . , q = 1 (error-free data).13 euristic flat ZZS Euclideanerror rate 0.10.00.20.40.60.81.0 heuristic flat ZZS Euclideanerror rate 0.250.00.20.40.60.81.0 heuristic flat ZZS Euclideanerror rate 0.50.00.20.40.60.81.0 Figure 13: Noisy data with r = 0 .

05 and q = 0 . heuristic flat ZZS Euclideanerror rate 0.10.00.20.40.60.81.0 heuristic flat ZZS Euclideanerror rate 0.250.00.20.40.60.81.0 heuristic flat ZZS Euclideanerror rate 0.50.00.20.40.60.81.0 Figure 14: Noisy data with r = 0 .

05 and q = 1.14 euristic flat ZZS Euclideanl = 100.00.20.40.60.81.0 heuristic flat ZZS Euclideanl = 500.00.20.40.60.81.0 heuristic flat ZZS Euclideanl = 1000.00.20.40.60.81.0heuristic flat ZZS Euclideann = 100.00.20.40.60.81.0 heuristic flat ZZS Euclideann = 1000.00.20.40.60.81.0 heuristic flat ZZS Euclideann = 2500.00.20.40.60.81.0heuristic flat ZZS Euclideanduprate 0.250.00.20.40.60.81.0 heuristic flat ZZS Euclideanduprate 0.50.00.20.40.60.81.0 heuristic flat ZZS Euclideanduprate 0.750.00.20.40.60.81.0heuristic flat ZZS Euclideanevents 2 40.00.20.40.60.81.0 heuristic flat ZZS Euclideanevents 5 100.00.20.40.60.81.0 heuristic flat ZZS Euclideanevents 20 400.00.20.40.60.81.0 Figure 17: Violin plots for r = 0 . , q = 0 .

25 and error rate α = 0 . euristic flat ZZS Euclideanl = 100.00.20.40.60.81.0 heuristic flat ZZS Euclideanl = 500.00.20.40.60.81.0 heuristic flat ZZS Euclideanl = 1000.00.20.40.60.81.0heuristic flat ZZS Euclideann = 100.00.20.40.60.81.0 heuristic flat ZZS Euclideann = 1000.00.20.40.60.81.0 heuristic flat ZZS Euclideann = 2500.00.20.40.60.81.0heuristic flat ZZS Euclideanduprate 0.250.00.20.40.60.81.0 heuristic flat ZZS Euclideanduprate 0.50.00.20.40.60.81.0 heuristic flat ZZS Euclideanduprate 0.750.00.20.40.60.81.0heuristic flat ZZS Euclideanevents 2 40.00.20.40.60.81.0 heuristic flat ZZS Euclideanevents 5 100.00.20.40.60.81.0 heuristic flat ZZS Euclideanevents 20 400.00.20.40.60.81.0 Figure 18: Violin plots for r = 0 . , q = 0 .

25 and error rate α = 0 . ordonnier and Lafond Page 42 of 43 heuristic flat ZZS Euclideanl = 100.00.20.40.60.81.0 heuristic flat ZZS Euclideanl = 500.00.20.40.60.81.0 heuristic flat ZZS Euclideanl = 1000.00.20.40.60.81.0heuristic flat ZZS Euclideann = 100.00.20.40.60.81.0 heuristic flat ZZS Euclideann = 1000.00.20.40.60.81.0 heuristic flat ZZS Euclideann = 2500.00.20.40.60.81.0heuristic flat ZZS Euclideanduprate 0.250.00.20.40.60.81.0 heuristic flat ZZS Euclideanduprate 0.50.00.20.40.60.81.0 heuristic flat ZZS Euclideanduprate 0.750.00.20.40.60.81.0heuristic flat ZZS Euclideanevents 2 40.00.20.40.60.81.0 heuristic flat ZZS Euclideanevents 5 100.00.20.40.60.81.0 heuristic flat ZZS Euclideanevents 20 400.00.20.40.60.81.0 Figure 3: Violin plots of the normalized RF distances for our improved approxima-tion (heuristic), the algorithm that counts ﬂat intervals (ﬂat), the ZZS algorithm forthe

MEDICC model (ZZS), and the Euclidean distance on error-free data. Each plotsummarizes 50 reconstructed trees with, from left to right: (top row) l = 10 ,

50 and100 leaves; (second row) n = 10 ,

100 and 250 genes per chromosome; (third row)duplication rate ∆ = 0 . , . .

75; (fourth row) possible number of events perbranch ( e min , e max ) = (2 , , (5 ,

10) and (20 , ordonnier and Lafond Page 43 of 43 heuristic flat ZZS Euclideanq = 0.50.00.20.40.60.81.0 heuristic flat ZZS Euclideanq = 0.750.00.20.40.60.81.0 heuristic flat ZZS Euclideanq = 10.00.20.40.60.81.0heuristic flat ZZS Euclideanr = 0.010.00.20.40.60.81.0 heuristic flat ZZS Euclideanr = 0.050.00.20.40.60.81.0 heuristic flat ZZS Euclideanr = 0.10.00.20.40.60.81.0 Figure 4: Violin plots of the normalized RF distances for the four same approacheswith varying parameters q and r (on error-free data). Other parameters were set totheir default values as described in the text. The plot for q = 0 .

25 is not shown: itis identical to the n = 100 plot from Figure 3. heuristic flat ZZS Euclideanerror rate 00.00.20.40.60.81.0 heuristic flat ZZS Euclideanerror rate 0.10.00.20.40.60.81.0 heuristic flat ZZS Euclideanerror rate 0.250.00.20.40.60.81.0 heuristic flat ZZS Euclideanerror rate 0.50.00.20.40.60.81.0 heuristic flat ZZS Euclideanerror rate 10.00.20.40.60.81.0 Figure 5: Violin plots of the average normalized RF distances for the four sameapproaches with varying error rates α ∈ { , . , . , . , }}