NAPX: A Polynomial Time Approximation Scheme for the Noah's Ark Problem
aa r X i v : . [ c s . D S ] O c t NAPX: A Polynomial Time ApproximationScheme for the Noah’s Ark Problem
Glenn Hickey , Paz Carmi , Anil Maheshwari , and Norbert Zeh School of Computer Science, Carleton University, Ottawa, Ontario, Canada Faculty of Computer Science, Dalhousie University, Halifax, Nova Scotia, Canada [email protected], [email protected], [email protected],[email protected]
Abstract.
The Noah’s Ark Problem (NAP) is an NP-Hard optimiza-tion problem with relevance to ecological conservation management. Itasks to maximize the phylogenetic diversity (PD) of a set of taxa given afixed budget, where each taxon is associated with a cost of conservationand a probability of extinction. NAP has received renewed interest withthe rise in availability of genetic sequence data, allowing PD to be usedas a practical measure of biodiversity. However, only simplified instancesof the problem, where one or more parameters are fixed as constants,have as of yet been addressed in the literature. We present NAPX, thefirst algorithm for the general version of NAP that returns a 1 − ǫ ap-proximation of the optimal solution. It runs in O „ nB h ( log n +log ǫ ) log (1 − ǫ ) « time where n is the number of species, and B is the total budget and h is the height of the input tree. We also provide improved bounds for itsexpected running time. Key words:
Noah’s Ark Problem, phylogenetic diversity, approxima-tion algorithm
Measures of biodiversity are commonly used as indicators of environmentalhealth. Biodiversity is presently being lost at an alarming rate, due largely to hu-man activity. It is speculated that this loss can lead to disastrous consequencesif left unchecked [9]. Consequently, the discipline of conservation biology hasarisen and a considerable amount of resources are being allocated to researchand implement conservation projects around the world.A conservation strategy will necessarily depend on the measure of biodiversityused. Traditionally, indices based on species richness and abundance have beenused to quantify the biodiversity of an ecosystem [8]. These indices are based oncounting and do not account for genetic variance. Phylogenetic diversity (PD)[3] addresses this issue by taking into account evolutionary relationships de-rived from DNA or protein samples. The use of PD in biological conservationas become increasingly widespread as more phylogenetic information becomesavailable [7]. It is also used to determine diverse sequence samples in comparativegenomics [10].The Noah’s Ark Problem (NAP) [13] is an abstraction of the fundamentalproblem of many conservation projects: how best to allocate a limited amount ofresources to maximally conserve phylogenetic diversity. This is in turn a general-ization of the Knapsack Problem [4] and is therefore NP-Hard. Several algorithmshave been proposed to solve special cases of the problem but, as yet, no non-heuristic solutions have been proposed to solve general instances of NAP. Giventhat NAP itself is an abstraction of realistic scenarios, it is important to have ageneral solution in order to be able to extend this framework to useful applica-tions. For this reason, we present an algorithm that can be used to compute anapproximate solution for NAP in polynomial time, so long as the approximationfactor is held constant, and total budget is polynomial in the input size.
Throughout this paper, we use the following definition of a phylogenetic tree T , with notation consistent with that of [6]. T has a root of degree 2, interiorvertices of degree 3, and n leaves, each associated with a species from set X .If an edge e of T is incident to a leaf, it is called a pendant edge. Otherwise e has exactly two adjacent edges, l and r , below it (not on the path from e tothe root) and these are referred to as e ’s children. λ is a function that assigns anon-negative branch length to each edge in T . The phylogenetic diversity of T ,PD( T ) is defined as P D ( T ) = X e λ ( e ) , (1)where the summation is over each edge e of the tree. Intuitively, this measurecorresponds to the amount of evolutionary history represented by T .The Noah’s Ark Problem has the objective of maximizing the expected PD, E ( P D ), under the following constraints. Each taxon i ∈ X is associated with aninitial survival probability a i , which can be increased to b i at some integer cost c i ; and the total expenditure cannot exceed the budget B . Since B is a factor inthe running time, we assume that the budget and each cost have been dividedby the greatest common divisor of all the costs. In the original formulation ofNAP, each species was also associated with a utility value. However, in [6] itwas shown that these values are redundant as they can be incorporated intothe branch lengths without altering the problem. To avoid accounting for de-generately small probability values, we make the assumption that the conservedsurvival probabilities are not exponentially small in n . In other words, there ex-ists a constant k such that b i ≥ n − k for each i ∈ X . We feel this assumption isreasonable as it is unrealistic that money would be allocated to obtain such anegligible probability of survival.If a species survives, the information represented by its path to the root isconserved. Consequently, the probability that an edge survives is equivalent to2he probability that at least one leaf below it in T survives. Let C e be the setof leaves below e in the tree and S be the set of species selected for protection. E ( P D | S ), can be derived from (1) as follows: E ( P D | S ) = X e λ ( e ) − Y i ∈ C e ∩ S (1 − b i ) Y j ∈ C e − S (1 − a j ) , (2)where the summation is over all edges. NAP asks to maximize E ( P D | S ) subjectto X s ∈ S c s ≤ B. Our algorithm is based on decomposing T into clades which are associated withthe edges of the tree. A clade corresponding to edge e , denoted K e , is the minimalsubtree of T containing e and C e , the set of leaves below it. The E ( P D ) of K e can be computed as in (2) but summing only over edges in the clade. The entiretree can be considered a clade by attaching an edge of length 0 to its root. If e has two descendant edges l and r , then we say K e has two child clades K l and K r . Let a i c i −→ b i NAP refer to the problem as described above, where the survivalprobabilities and cost of each taxon are input variables. Fixing one or more ofthese variables as constants produces a hierarchy of increasingly simpler sub-problems [11]. The simplest, 0 −→ B leaves whose induced subtree (including the root) has maximum PD and can besolved by a greedy algorithm [12] [10]. 0 c i −→ − x i ) −→ (1 − κx i ) for general trees where x i is a variable probability and κ is a constant factor such that 0 ≤ κ ≤ c i −→ c i −→ B is polynomial in n . They also show that any instance of a i c i −→ c i −→ c i −→ K e is b , then there are b + 1 ways to split it across K l and K r . By solving these b + 1 pairs of subproblems, the optimal solution can befound in the pair with maximum total E ( P D ) (plus the expected contribution3f e ). Recursively proceeding in this fashion from the root down would not yieldan efficient algorithm as the number of possible budget divisions increases expo-nentially with each level of the tree. Instead, the clades are processed bottom-upfrom the leaves. All b + 1 scores are computed and stored in a dynamic program-ming table for each clade. Each score can be determined by taking the maximumof b + 1 possible scores of its child clades, which are already computed or com-puted directly from (2) if the clade contains a single leaf. Each table entry cantherefore be computed in O ( B ) time. There are O ( B ) entries per clade and O ( n )clades in the tree giving a total running time of O ( nB ).This procedure does not work for a i c i −→ b i NAP because this version of theproblem does not display the same optimal substructure [11]. In 0 c i −→
1, thedynamic programming algorithm implicitly maximizes the survival probabilityof the clade in addition to its E ( P D ) value. The total score of the tree is afunction of both of these values which is why the algorithm works for this case.In a i c i −→ b i NAP, a budget assignment that maximizes survival probability of theclade does not guarantee that it will have maximal E ( P D ) and vice versa. Thecorrect allocation cannot be made without knowledge of the entire tree; hence,the optimal substructure exploited by [11] for 0 c i −→ B = 3. The optimalsolution is to conserve w and y for E ( P D ) = 225. However, locally computingthe best allocation of budget 1 for the clade containing y and z will select z forconservation, and any chance of obtaining the optimal solution will be lost. Inthis case, it is more important to maximize the survival probability of the claderather than E ( P D ), but there is no way for an algorithm to be aware of thiswithout globally solving the entire tree. B=35100 x
115 5y w zcba =0zzz=0.5=1 c Fig. 1.
An example why the dynamic programming algorithm of [11] does notwork for general instances of NAP. The optimal allocation for the clade contain-ing y and z is not part of a globally optimal solution.4 NAPX Algorithm
In this section we present NAPX, an O (cid:18) nB h ( log n +log ǫ ) log (1 − ǫ ) (cid:19) dynamic program-ming algorithm for a i c i −→ b i NAP that produces a (1 − ǫ )-approximation of theoptimal solution, where h denotes the height of T . As that in [11], our algorithmis only polynomial if B is polynomial in n . This assumption is justifiable if, forexample, B is expressed in millions of dollars and its value will be a reason-ably small integer. Without loss of generality, we also assume that no single costexceeds the budget.NAPX essentially generalizes the dynamic programming table of [11] by com-puting for each clade, each desired survival probability of the clade, and anybudget between 1 and B , the maximum E ( P D ) score achievable while guaran-teeing this survival probability. This way, we need not make the choice betweenmaximizing E ( P D ) or probability as the tables are constructed. From the defi-nition of E ( P D ) in (2), the probability of survival of an edge can be written asa function of its two children. Let P e denote the survival probability of edge e ,and l and r be e ’s children. Then P e = P l + P r − P l P r . (3)In the optimal solution for NAP on T , assume b dollars are assigned to clade K e and e survives with probability p . It follows that i and b − i dollars are assignedto K l and K r respectively where 0 ≤ i ≤ b . These subclades must survive withprobabilities j and p − j − j (or 0 when p = j = 1), for some 0 ≤ j ≤ p , in order tosatisfy (3). Because the probability is continuous, we discretize it into intervalsby rounding it down to the nearest multiple of a chosen constant α . Probabilitiesless than a chosen cutoff value p min are rounded to zero. p ∈ n , α ⌈ log α p min ⌉ , ..., α , α, o If two non-zero probabilities lie in the same interval, their ratio is at most α .If they are in consecutive intervals, their ratio is likewise bounded by α . Fornotational convenience, we define a mapping π ( · ) that rounds a probability tothe lower bound of its corresponding interval. π ( p ) = ( p < p min α ⌈ log α p ⌉ otherwise.We now formally describe our algorithm. For each edge e , we construct a two-dimensional table T e where T e ( b, p ) stores the optimal expected diversity of K e given that b dollars are assigned to it and it survives with a probability that liesno less than p . The table is constructed in the following manner if e is a pendant5dge incident to the leaf for species s . T e ( b, p ) = a s λ ( e ) if b < c s and p = π ( a s ), b s λ ( e ) if b ≥ c s and p = π ( b s ), or −∞ otherwise. (4)Otherwise, T e is computed from the tables of its two children, T l and T r . T e ( b, p ) = pλ ( e ) + max i,j,k { T l ( i, j ) + T r ( b − i, k ) } (5)subject to i ∈ { , , , ..., b } ,j, k ∈ { , α ⌈ log α p min ⌉ , ..., α , α, } ,π ( j + k − jk ) = p The E ( P D ) score for the entire tree can be obtained by attaching an edge e r of length 0 to the root and finding max j { T e r ( B, j ) } . The tables are computedfrom the bottom up, and each time an entry is filled, pointers are kept to the twoentries in the child tables from which it was computed. This way the optimalbudget allocation can be obtained by following the pointers down from the entryfor the optimal score for e r . In this section, we express the worst-case approximation ratio as a function ofthe constants p min and α introduced above, beginning with p min . Note thatsince any species s with c s > B can be transformed into a new species s ′ with c s ′ = 0 , b s ′ = a s and a s ′ = a s without affecting the outcome, we can safelyassume that c s ≤ B for all s ∈ X . Lemma 1.
Let I be an instance of NAP for which there exists a constant k suchthat b i ≥ n − k ≥ p min for all i ∈ S . Consider a transformed instance I ′ whereall a i values in the range (0 , p min ) are rounded to 0. Let OP T ( I ) and OP T ( I ′ ) be the expected PD scores of the optimal solutions to I and I ′ respectively. Thenthe ratio of these scores is bounded as follows: OP T ( I ′ ) ≥ (1 − n k +1 p min ) OP T ( I ) Proof.
Let path( s ) be the set of edges comprising the path from leaf s to theroot. We define w ( s ) as the expected diversity of the path from s to the root if s is conserved: w ( s ) = b s X e ∈ path( s ) λ ( e ) . Let w max = max s ∈ X { w ( s ) } . This value allows us to place a trivial lower boundon the optimal solution (recalling that we can assume that c s ≤ B ). w max ≤ OP T ( I ) . (6)6e also observe that if any species s survives with a non-zero probability smallerthan p min in the optimal solution, its contribution to OPT( I ) will be boundedby p min w ( s ) b s . It follows that OP T ( I ′ ) ≥ OP T ( I ) − X s ∈ X p min w ( s ) b s . Since b s ≥ n − k and w ( s ) ≤ w max , we can express the bound as OP T ( I ′ ) ≥ OP T ( I ) − n p min w max n − k . Dividing by OPT( I ) yields OP T ( I ′ ) OP T ( I ) ≥ − n k +1 p min w max OP T ( I ) . From (6) we obtain
OP T ( I ′ ) OP T ( I ) ≥ − n k +1 p min , which completes the proof. ⊓⊔ The size of the probability intervals in the tables, determined by α , also affectsthe approximation ratio. This relationship is detailed in the following lemma. Lemma 2.
Let
OP T e ( b, p ) denote the optimal expected PD score for clade K e if e survives with probability exactly p and b dollars are allocated to it. Nowconsider an instance of NAP such that all a s and b s are either 0 or at least p min . For any OP T e ( b, p ) where e is at height h in the tree, there exists a tableentry T e ( b, p ′ ) constructed by NAPX such that the following conditions hold: i ) T e ( b, p ′ ) ≥ α h OP T e ( b, p ) ii ) p ′ ≥ α h p Proof. If p = 0, then OP T e ( b, p ) = 0 and the lemma holds. For the remainderof the proof, we assume p ≥ p min . The proof will proceed by induction on h ,the height of e in the tree, beginning with the base case where h = 1 and e is apendant connected to leaf s . We need only consider the cases where the optimalsolution is defined. So without loss of generality, assume we have OP T e ( b, a s ) = λ ( e ) a s . From (4), we know there is an entry T e ( b, π ( a s )) = a s λ ( e ) and thereforeboth i ) and ii ) hold.We now assume that the lemma holds for h ≤ x and consider some edge e atheight x + 1. By definition, OP T e ( b, p ) can be expressed in terms of its children l and r . OP T e ( b, p ) = pλ ( e ) + OP T l ( i, j ) + OP T r ( b − i, k )7here j + k − jk = p . From the induction hypothesis, there exist T l ( i, j ′ ) ≥ α x OP T l ( i, j ) and T r ( b − i, k ′ ) ≥ α x OP T r ( b − i, k )where j ′ ≥ α x j and k ′ ≥ α x k . Let q = j ′ + k ′ − j ′ k ′ . It follows that q ≥ α x j + α x k − α x jk ≥ α x p. (7)The left inequality in (7) holds because j ′ + k ′ − j ′ k ′ increases as j ′ or k ′ increase,so long as their values do not exceed 1. This can be checked by observing thatthe partial derivatives with respect to j ′ and k ′ are 1 − k ′ and 1 − j ′ , respectively. T l ( i, j ′ ) and T r ( b − i, k ′ ) will be considered when computing the entry T e ( b, p ′ )where p ′ = π ( q ). Since q ≥ p min , we have π ( q ) ≥ αq because it simply rounds q to the nearest multiple of α . Therefore, p ′ ≥ α x +1 p and T e ( b, p ′ ) can be expressedas follows. T e ( b, p ′ ) ≥ p ′ λ ( e ) + T l ( i, j ′ ) + T r ( b − i, k ′ ) ≥ α x +1 pλ ( e ) + α x OP T l ( i, j ) + α x OP T ( b − i, k ) ≥ α x +1 ( pλ ( e ) + OP T l ( i, j ) + OP T ( b − i, k ) ≥ α x +1 OP T e ( b, p ) ⊓⊔ Combining Lemmas 1 and 2 allows us to state that NAPX returns a solutionthat is at least a factor of (1 − n k +1 p min ) α h of the optimal solution. In thissection we show that these results also imply that a (1 − ǫ ) approximation canbe obtained in polynomial time for an arbitrary constant ǫ . Lemma 3. O h (cid:0) log n + log ǫ (cid:1) | log(1 − ǫ ) | ! probability intervals are required in the tablein order to obtain a − ǫ approximation.Proof. The number of probability intervals, t , required for the table is boundedby the number of times 1 must be multiplied by α to reach p min . Hence α t ≤ p min and t = (cid:24) log p min log α (cid:25) . (8)From Lemmas 1 and 2 we can obtain the desired approximation ratio by selecting α = q (1 − ǫ ) h and p min = −√ − ǫn k +1 . Plugging these values into (8) gives t = log (cid:16) −√ − ǫn k +1 (cid:17) log (cid:18)q (1 − ǫ ) h (cid:19) = (cid:24) h (log(1 − √ − ǫ ) − ( k + 1) log n )log(1 − ǫ ) (cid:25)
8t can be shown that log(1 − √ − ǫ ) is O (log ǫ ), so multiplying by − − we canexpress t asymptotically as t ∈ O (cid:18) h (log n − log ǫ ) − log(1 − ǫ ) (cid:19) = O h (cid:0) log n + log ǫ (cid:1) | log(1 − ǫ ) | ! . ⊓⊔ Theorem 1.
NAPX is a (1 − ǫ ) -approximation with time complexity O nB h (cid:0) log n + log ǫ (cid:1) log (1 − ǫ ) ! . Proof.
For each table entry T ( b, p ), we must find the maximum of all possiblecombinations of entries in the left and right child tables that satisfy b and p .These combinations correspond to the possible { i, j, k } triples from (5). Thereare O ( Bt ) such combinations as i corresponds to the budget and j and k cor-respond to probability intervals. Furthermore, for fixed values of p and j , thereare potentially O ( t ) different values of k that could satisfy π ( j + k − jk ) due torounding. It follows that a naive algorithm would have to compare all O ( Bt )combinations when computing the maximum in (5) for each table entry.Fortunately, because π ( j + k − jk ) is monotonically nondecreasing with respectto either j or k , we can directly compute for any fixed p and j the interval of k entries that satisfy π ( j + k − jk ) = p : (cid:20)(cid:24) log α (cid:18) αp − j − j (cid:19)(cid:25) , (cid:24) log α (cid:18) p − j − j (cid:19)(cid:25)(cid:19) . Finding the value of k in the interval such that T ( b − i, k ) is maximized iseffectively a range maxima query (RMQ) on an array. Regardless of the size ofthe interval, such a query can be performed in constant time if instead of an array,the values are stored in a RMQ structure as described in [1]. Such structuresare linear both in space and the time they take to create, meaning that we canuse them to store each column in the table (corresponding to budget value i )without adversely affecting the complexity. Now, when given a pair { i, j } , theoptimal value of k can be computed in constant time, bringing the complexityof filling a single table entry to O ( Bt ), the number of combinations of the pair { i, j } .There are O ( Bt ) entries in each table, and a table for each of the O ( n ) edgesin the tree. The space complexity is therefore O ( nBt ) and the time complexity is O ( nB t ). Substituting t for the value that yields a (1 − ǫ ) approximation ratioshown in Lemma 3 gives O nB h (cid:0) log n + log ǫ (cid:1) log (1 − ǫ ) ! . Since in general the height of a phylogenetic tree with n leaves is O ( n ), therunning time derived above is technically cubic in n . Fortunately, for most inputs9e can expect the height to be much smaller. In this section, we will provideimproved running times for trees generated by the two principal random models.Additionally we will show that caterpillar trees, which should be the pathologicalworst-case topology according to the above analysis, actually have a much lowercomplexity.The Yule-Harding model [14][5], also known as the equal-rates-Markov model,assumes that trees are formed by a succession of random speciation events. Theexpected height of trees formed in this way, regardless of the speciation rate, is O (log n ) [2] giving a time complexity of O (cid:18) nB log n ( log n +log ǫ ) log (1 − ǫ ) (cid:19) .A caterpillar tree is a tree where all internal nodes are on a path beginningat the root, and therefore has height n . This implies that every internal edgehas at least one child edge that is incident to a leaf. Suppose edge e has child l that is incident to the leaf for species s . This table only contains two meaningfulvalues: T l (0 , a s ) and T l ( c s , b s ). Therefore to compute entry T e ( b, p ), only O (1)combinations of child table entries need to be compared and the time complexityis improved to O (cid:18) n B ( log n +log ǫ ) | log(1 − ǫ ) | (cid:19) . NAPX is, to our best knowledge, the first polynomial-time algorithm for a i c i −→ b i NAP that places guarantees on the approximation ratio. While there are stillsome limitations, especially for large budgets or tree heights, our algorithm stillsignificantly increases the number of instances of NAP that can be solved. More-over, our expected running time analysis shows that the algorithm will usually bemuch more efficient than its worst-case complexity suggests. This work towardsa more general solution is important if the Noah’s Ark Problem framework isto be used for real conservation projects. Some interesting questions do remain,however. Does NAP remain NP-Hard when the budget is constrained to be poly-nomial in n ? We conjecture that it is, but the usual reduction from Knapsack isclearly no longer valid. We would also like to find an efficient algorithm whosecomplexity is independent of h and/or B . References
1. M.A. Bender and M.I.F. Colton. The LCA Problem Revisited.
LATIN 2000: The-oretical Informatics: 4th Latin American Symposium, Punta Del Este, Uruguay,April 10-14, 2000: Proceedings , 2000.2. P.L. Erdos, M.A. Steel, L.A. Szekely, and T.J. Warnow. A few logs suffice to build(almost) all trees: Part II.
Theoretical Computer Science , 221(1):77–118, 1999.3. D.P. Faith. Conservation evaluation and phylogenetic diversity.
Biological Con-servation , 61:1 – 10, 1992.4. Michael R. Garey and David S. Johnson.
Computers and Intractability: A Guideto the Theory of NP-Completeness . W.H. Freeman and Company, 1979. . E.F. Harding. The Probabilities of Rooted Tree-Shapes Generated by RandomBifurcation. Advances in Applied Probability , 3(1):44–77, 1971.6. K. Hartmann and M. Steel. Maximizing Phylogenetic Diversity in BiodiversityConservation: Greedy Solutions to the Noah’s Ark Problem.
Systematic Biology ,55(4):644–651, 2006.7. S.B. Heard and A.O. Mooers. Phylogenetically patterned speciation rates andextinction risks change the loss of evolutionary history during extinctions.
Proc.R. Soc. Lond. B , 267:613–620, 2000.8. A.E. Magurran.
Measuring Biological Diversity . Blackwell Publishing, 2004.9. S. Nee and R.M. May. Extinction and the Loss of Evolutionary History.
Science ,278(5338):692, 1997.10. F. Pardi and N. Goldman. Species choice for comparative genomics: Being greedyworks.
PLoS Genetics , 1(6):71, 2005.11. F. Pardi and N. Goldman. Resource-aware taxon selection for maximizing phylo-genetic diversity.
Syst Biol , 56(3):431–44, 2007.12. M. Steel. Phylogenetic diversity and the greedy algorithm.
Systematic Biology ,54(4):527 – 529, 2005.13. M.L. Weitzman. The Noah’s Ark problem.
Econometrica , 66:1279 – 1298, 1998.14. G.U. Yule. A Mathematical Theory of Evolution, Based on the Conclusions of Dr.JC Willis, FRS.
Philosophical Transactions of the Royal Society of London. SeriesB, Containing Papers of a Biological Character , 213:21–87, 1925., 213:21–87, 1925.