Optimal Search Trees with 2-Way Comparisons
OOptimal search trees with 2-way comparisons (cid:63)
Marek Chrobak (cid:63)(cid:63) , Mordecai Golin (cid:63) (cid:63) (cid:63) , J. Ian Munro † , and Neal E. Young University of California — Riverside, Riverside, California, USA Hong Kong University of Science and Technology, Hong Kong, China University of Waterloo, Waterloo, Canada
Abstract.
In 1971, Knuth gave an O ( n ) -time algorithm for the clas-sic problem of finding an optimal binary search tree. Knuth’s algorithmworks only for search trees based on 3-way comparisons, but most moderncomputers support only 2-way comparisons ( < , ≤ , = , ≥ , and > ). Un-til this paper, the problem of finding an optimal search tree using 2-waycomparisons remained open — poly-time algorithms were known only forrestricted variants. We solve the general case, giving (i) an O ( n ) -time al-gorithm and (ii) an O ( n log n ) -time additive-3 approximation algorithm.For finding optimal binary split trees , we (iii) obtain a linear speedupand (iv) prove some previous work incorrect. In 1971, Knuth [10] gave an O ( n ) -time dynamic-programming algorithm for aclassic problem: given a set K of keys and a probability distribution on queries,find an optimal binary-search tree T . As shown in Fig. 1, a search in such a treefor a given value v compares v to the root key, then (i) recurses left if v is smaller,(ii) stops if v equals the key, or (iii) recurses right if v is larger, halting at a leaf.The comparisons made in the search must suffice to determine the relation of v to all keys in K . (Hence, T must have |K| + 1 leaves.) T is optimal if it hasminimum cost , defined as the expected number of comparisons assuming thequery v is chosen randomly from the specified probability distribution.Knuth assumed three-way comparisons at each node. With the rise of higher-level programming languages, most computers began supporting only two-waycomparisons ( <, ≤ , = , ≥ , > ). In the 2nd edition of Volume 3 of The Art of Com-puter Programming [11, §6.2.2 ex. 33], Knuth commented . . . machines that cannot make three-way comparisons at once. . . will have tomake two comparisons. . . it may well be best to have a binary tree whose inter-nal nodes specify either an equality test or a less-than test but not both. But Knuth gave no algorithm to find a tree built from two-way comparisons (a , as in Fig. 2(a)), and, prior to the current paper, poly-time algorithms (cid:63)
This is the full version of an extended abstract that appeared in ISAAC [2]. (cid:63)(cid:63)
Research funded by NSF grants CCF-1217314 and CCF-1536026. (cid:63) (cid:63) (cid:63)
Research funded by HKUST/RGC grant FSGRF14EG28. † Research funded by NSERC and the Canada Research Chairs Programme. a r X i v : . [ c s . D S ] O c t v ? O v ? Wv ? H < > v=O =< > v=H = H Fig. 1. A binary search tree T using 3-way comparisons, for K = { H, O, W } . v = H? v < O? v < W? yes ynony v=H v < H? y v Two s for K = { H, O, W } ; tree (b) only handles successful queries. were known only for restricted variants. Most notably, in 2002 Anderson et al. [1]gave an O ( n ) -time algorithm for the successful-queries variant of , inwhich each query v must be a key in K , so only |K| leaves are needed (Fig. 2(b)).The standard problem allows arbitrary queries, so |K| + 1 leaves are needed(Fig. 2(a)). For the standard problem, no polynomial-time algorithm was pre-viously known. We give one for a more general problem that we call : Theorem 1. has an O ( n ) -time algorithm. We specify an instance I of as a tuple I = ( K = { K , . . . , K n } , Q , C , α, β ) .The set C of allowed comparison operators can be any subset of { <, ≤ , = , ≥ , > } .The set Q specifies the queries. A solution is an optimal T among thoseusing operators in C and handling all queries in Q . This definition generalizesboth standard (let Q contain each key and a value between each pairof keys), and the successful-queries variant (take Q = K and α ≡ ). It furtherallows any query set Q between these two extremes, even allowing K (cid:54)⊆ Q . Asusual, β i is the probability that v equals K i ; α i is the probability that v fallsbetween keys K i and K i +1 (except α = Pr[ v < K ] and α n = Pr[ v > K n ] ). To prove Thm. 1, we prove Spuler’s 1994 “maximum-likelihood” conjecture: inany optimal tree, each equality comparison is to a key in K of maximumlikelihood, given the comparisons so far [14, §6.4 Conj. 1]. As Spuler observed,the conjecture implies an O ( n ) -time algorithm; we reduce this to O ( n ) using As defined here, a T must determine the relation of the query v to everykey in K . More generally, one could specify any partition P of Q , and only require T to determine, if at all possible using keys in K , which set S ∈ P contains v . Forexample, if P = {K , Q \ K} , then T would only need to determine whether v ∈ K .We note without proof that Thm. 1 extends to this more general formulation. standard techniques and a new perturbation argument. Anderson et al. provedthe conjecture for their special case [1, Cor. 3]. We were unable to extend theirproof directly; our proof uses a different local-exchange argument.We also give a fast additive-3 approximation algorithm: Theorem 2. Given any instance I = ( K , Q , C , α, β ) of , one can computea tree of cost at most the optimum plus 3, in O ( n log n ) time. Comparable results were known for the successful-queries variant ( Q = K ) [16,1].We approximately reduce the general case to that case. Binary split trees “split” each 3-way comparison in Knuth’s 3-way-comparisonmodel into two 2-way comparisons within the same node: an equality compar-ison (which, by definition, must be to the maximum-likelihood key) and a “ < ”comparison (to any key) [13,3,8,12,6]. The fastest algorithms to find an optimalbinary split tree take O ( n ) -time: from 1984 for the successful-queries-only vari-ant ( Q = K ) [8]; from 1986 for the standard problem ( Q contains queries in allpossible relations to the keys in K ) [6]. We obtain a linear speedup: Theorem 3. Given any instance I = ( K = { K , . . . , K n } , α, β ) of the standardbinary-split-tree problem, an optimal tree can be computed in O ( n ) time. The proof uses our new perturbation argument (Sec. 3.1) to reduce to the casewhen all β i ’s are distinct, then applies a known algorithm [6]. The perturbationargument can also be used to simplify Anderson et al.’s algorithm [1]. Generalized binary split trees ( gbst s) are binary split trees without the maximum-likelihood constraint. Huang and Wong [9] (1984) observe that relaxing this con-straint allows cheaper trees — the maximum-likelihood conjecture fails here —and propose an algorithm to find optimal gbst s. We prove it incorrect! Theorem 4. Lemma 4 of [9] is incorrect: there exists an instance — a querydistribution β — for which it does not hold, and on which their algorithm fails. This flaw also invalidates two algorithms, proposed in Spuler’s thesis [15], thatare based on Huang and Wong’s algorithm. We know of no poly-time algorithmto find optimal gbst s. Of course, optimal s are at least as good. without equality tests. Finding an optimal alphabetical encoding hasseveral poly-time algorithms: by Gilbert and Moore — O ( n ) time, 1959 [5];by Hu and Tucker — O ( n log n ) time, 1971 [7]; and by Garsia and Wachs— O ( n log n ) time but simpler, 1979 [4]. The problem is equivalent to find-ing an optimal 3-way-comparison search tree when the probability of queryingany key is zero ( β ≡ ) [11, §6.2.2]. It is also equivalent to finding an opti-mal in the successful-queries variant with only “ < ” comparisons allowed( C = { < } , Q = K ) [1, §5.2]. We generalize this observation to prove Thm. 5: Theorem 5. Any instance I = ( K = { K , . . . , K n } , Q , C , α, β ) where = is not in C (equality tests are not allowed), can be solved in O ( n log n ) time. Definitions 1 Fix an arbitrary instance I = ( K , Q , C , α, β ) .For any node N in any T for I , N ’s query subset , Q N , containsqueries v ∈ Q such that the search for v reaches N . The weight ω ( N ) of N isthe probability that a random query v (from distribution ( α, β ) ) is in Q N . Theweight ω ( T (cid:48) ) of any subtree T (cid:48) of T is ω ( N ) where N is the root of T (cid:48) .Let (cid:104) v < K i (cid:105) denote an internal node having key K i and comparison operator < (define (cid:104) v ≤ K i (cid:105) and (cid:104) v = K i (cid:105) similarly). Let (cid:104) K i (cid:105) denote the leaf N such that Q N = { K i } . Abusing notation, ω ( K i ) is a synonym for ω ( (cid:104) K i (cid:105) ) , that is, β i .Say T is irreducible if, for every node N with parent N (cid:48) , Q N (cid:54) = Q N (cid:48) . In the remainder of the paper, we assume that only comparisons in { <, ≤ , = } are allowed (i.e., C ⊆ { <, ≤ , = } ). This is without loss of generality, as “ v > K i ”and “ v ≥ K i ” can be replaced, respectively, by “ v ≤ K i ” and “ v < K i .” Fix any irreducible, optimal T for any instance I = ( K , Q , C , α, β ) . Theorem 6 (Spuler’s conjecture). The key K a in any equality-comparisonnode N = (cid:104) v = K a (cid:105) is a maximum-likelihood key: β a = max i { β i : K i ∈ Q N } . The theorem will follow easily from Lemma 1: Lemma 1. Let internal node (cid:104) v = K a (cid:105) be the ancestor of internal node (cid:104) v = K z (cid:105) .Then ω ( K a ) ≥ ω ( K z ) . That is, β a ≥ β z .Proof. (Lemma 1) Throughout, “ (cid:104) v ≺ K i (cid:105) ” denotes a node in T that does aninequality comparison ( ≤ or < , not = ) to key K i . Abusing notation, in thatcontext, “ x ≺ K i ” (or “ x (cid:54)≺ K i ”) denotes that x passes (or fails) that comparison. Assumption 1 (i) All nodes on the path from (cid:104) v = K a (cid:105) to (cid:104) v = K z (cid:105) do in-equality comparisons. (ii) Along the path, some other node (cid:104) v ≺ K s (cid:105) separates key K a from K z : either K a ≺ K s but K z (cid:54)≺ K s , or K z ≺ K s but K a (cid:54)≺ K s . It suffices to prove the lemma assuming (i) and (ii) above. (Indeed, if the lemmaholds given (i), then, by transitivity, the lemma holds in general. Given (i), if(ii) doesn’t hold, then exchanging the two nodes preserves correctness, changingthe cost by ( ω ( K a ) − ω ( K z )) × d for d ≥ , so ω ( K a ) ≥ ω ( K z ) and we are done.)By Assumption 1, the subtree rooted at (cid:104) v = K a (cid:105) , call it T (cid:48) , is as in Fig. 3(a):Let child (cid:104) v ≺ K b (cid:105) , with subtrees T and T , be as in Fig. 3. Lemma 2. If K a ≺ K b , then ω ( K a ) ≥ ω ( T ) , else ω ( K a ) ≥ ω ( T ) . (This and subsequent lemmas in this section are proved in Appendix 7.2. Theidea behind this one is that correctness is preserved by replacing T (cid:48) by subtree(b) if K a ≺ K b or (c) otherwise, implying the lemma by the optimality of T .)Case 1: Child (cid:104) v ≺ K b (cid:105) separates K a from K z . If K a ≺ K b , then K z (cid:54)≺ K b , so de-scendant (cid:104) v = K z (cid:105) is in T , and, by this and Lemma 2, ω ( K a ) ≥ ω ( T ) ≥ ω ( K z ) , v = K a v K b T K a (a) T v = K a v K b T K a T (b) y ny y ny nn v = K a v K b T K a T (c) ny y n Fig. 3. (a) The subtree T (cid:48) rooted at (cid:104) v = K a (cid:105) and possible replacements (b), (c). and we’re done. Otherwise K a (cid:54)≺ K b , so K z ≺ K b , so descendant (cid:104) v = K z (cid:105) is in T , and, by this and Lemma 2, ω ( K a ) ≥ ω ( T ) ≥ ω ( K z ) , and we’re done.Case 2: Child (cid:104) v ≺ K b (cid:105) does not separate K a from K z . Assume also that descen-dant (cid:104) v = K z (cid:105) is in T . (If descendant (cid:104) v = K z (cid:105) is in T , the proof is symmetric,exchanging the roles of T and T .) Since descendant (cid:104) v = K z (cid:105) is in T , and child (cid:104) v ≺ K b (cid:105) does not separate K a from K z , we have K a (cid:54)≺ K b and two facts: Fact A: ω ( K a ) ≥ ω ( T ) (by Lemma 2), and Fact B: the root of T does an inequality comparison (by Assumption 1).By Fact B, subtree T (cid:48) rooted at (cid:104) v = K a (cid:105) is as in Fig. 4(a): nn (a) (b) T v K b T v K c T T v = K a K a yy y nn v K b T v K c T v = K a T K a y y y nn n yyyyyyyy yyyyyyyy nnnnnnnn nnnnnnnn (c) v K b T v K c T T v = K a K a y Fig. 4. (a) The subtree T (cid:48) in Case 2, two possible replacements (b), (c). As in Fig. 4(a), let the root of T be (cid:104) v ≺ K c (cid:105) , with subtrees T and T . Lemma 3. (i) ω ( T ) ≥ ω ( T ) . (ii) If K a (cid:54)≺ K c , then ω ( K a ) ≥ ω ( T ) . (As replacing T (cid:48) by (b) or (c) preserves correctness; proof in Appendix 7.2.)Case 2.1: K a (cid:54)≺ K c . By Lemma 3(ii), ω ( K a ) ≥ ω ( T ) . Descendant (cid:104) v = K z (cid:105) isin T , so ω ( T ) ≥ ω ( K z ) . Transitively, ω ( K a ) ≥ ω ( K z ) , and we are done.Case 2.2: K a ≺ K c . By Lemma 3(i), ω ( T ) ≥ ω ( T ) . By Fact A, ω ( K a ) ≥ ω ( T ) .If (cid:104) v = K z (cid:105) is in T , then ω ( T ) ≥ ω ( K z ) and transitively we are done.In the remaining case, (cid:104) v = K z (cid:105) is in T . T ’s irreducibility implies K z ≺ K c .Since K a ≺ K c also (Case 2.2), grandchild (cid:104) v ≺ K c (cid:105) does not separate K a from K z , and by Assumption 1 the root of subtree T does an inequality comparison.Hence, the subtree rooted at (cid:104) v ≺ K b (cid:105) is as in Fig. 5(a): y yy n nn yy yn n n (a) (b) v K b T v K c T T v K d T T v K b T v K c T v K d T T Fig. 5. (a) The subtree rooted at (cid:104) v ≺ K b (cid:105) in Case 2.2. (b) A possible replacement. Lemma 4. ω ( T ) ≥ ω ( T ) . (Because replacing (a) by (b) preserves correctness; proof in Appendix 7.2.)Since descendant (cid:104) v = K z (cid:105) is in T , Lemma 4 implies ω ( T ) ≥ ω ( T ) ≥ ω ( K z ) . This and Fact A imply ω ( K a ) ≥ ω ( K z ) . This proves Lemma 1. (cid:3) Proposition 1. If any leaf node (cid:104) K (cid:96) (cid:105) ’s parent P does not do an equality com-parison against key K (cid:96) , then changing P so that it does so gives an irreducible T (cid:48) of the same cost.Proof. Since Q (cid:104) K (cid:96) (cid:105) = { K (cid:96) } and P ’s comparison operator is in C ⊆ { <, ≤ , = } , itmust be that K (cid:96) = max Q P or K (cid:96) = min Q P . So changing P to (cid:104) v = K (cid:96) (cid:105) (with (cid:104) K (cid:96) (cid:105) as the“yes” child and the other child the “no” child) maintains correctness,cost, and irreducibility. (cid:3) Proof. (Thm. 6) Consider any equality-testing node N = (cid:104) v = K a (cid:105) and any key K z ∈ Q N . Since K z ∈ Q N , node N has descendant leaf (cid:104) K z (cid:105) . Without loss ofgenerality (by Proposition 1), leaf (cid:104) K z (cid:105) ’s parent is (cid:104) v = K z (cid:105) . That parent is adescendant of (cid:104) v = K a (cid:105) , so ω ( K a ) ≥ ω ( K z ) by Lemma 1. (cid:3) ) and Thm. 3 First we prove Thm. 1. Fix an instance I = ( K , Q , C , α, β ) . Assume for now thatall probabilities in β are distinct. For any query subset S ⊆ Q , let opt ( S ) denotethe minimum cost of any that correctly determines all queries in subset S (using keys in K , comparisons in C , and weights from the appropriate restrictionof α and β to S ). Let ω ( S ) be the probability that a random query v is in S .The cost of any tree for S is the weight of the root ( = ω ( S ) ) plus the cost of itstwo subtrees, yielding the following dynamic-programming recurrence: Lemma 5. For any query set S ⊆ Q not handled by a single-node tree, opt ( S ) = ω ( S ) + min min k opt ( S \ { k } ) (if “ = ” is in C , else ∞ ) ( i )min k, ≺ opt ( S ≺ k ) + opt ( S \ S ≺ k ) , ( ii ) where k ranges over K , and ≺ ranges over the allowed inequality operators (ifany), and S ≺ k = { v ∈ S : v ≺ k } . Using the recurrence naively to compute opt ( Q ) yields exponentially many querysubsets S , because of line (i). But, by Thm. 6, we can restrict k in line (i) to bethe maximum-likelihood key in S . With this restriction, the only subsets S thatarise are intervals within Q , minus some most-likely keys. Formally, for each of O ( n ) key pairs { k , k } ⊆ K ∪ {−∞ , ∞} with k < k , define four key intervals ( k , k ) = { v ∈ Q : k < v < k } , [ k , k ] = { v ∈ Q : k ≤ v ≤ k } , ( k , k ] = { v ∈ Q : k < v ≤ k } , [ k , k ) = { v ∈ Q : k ≤ v < k } . For each of these O ( n ) key intervals I , and each integer h ≤ n , define top ( I, h ) to contain the h keys in I with the h largest β i ’s. Define S ( I, h ) = I \ top ( I, h ) .Applying the restricted recurrence to S ( I, h ) gives a simpler recurrence: Lemma 6. If S ( I, h ) is not handled by a one-node tree, then opt ( S ( I, h )) equals ω ( S ( I, h )) + min opt ( S ( I, h + 1)) (if equality is in C , else ∞ ) ( i )min k, ≺ opt ( S ( I ≺ k , h ≺ k )) + opt ( S ( I \ I ≺ k , h − h ≺ k )) , ( ii ) where key interval I ≺ k = { v ∈ I : v ≺ k } , and h ≺ k = | top ( I, h ) ∩ I ≺ k | . Now, to compute opt ( Q ) , each query subset that arises is of the form S ( I, h ) where I is a key interval and ≤ h ≤ n . With care, each of these O ( n ) subproblems can be solved in O ( n ) time, giving an O ( n ) -time algorithm. Inparticular, represent each key-interval I by its two endpoints. For each key-interval I and integer h ≤ n , precompute ω ( S ( I, h )) , and top ( I, h ) , and the h ’thlargest key in I . Given these O ( n ) values (computed in O ( n log n ) time), therecurrence for opt ( S ( I, h )) can be evaluated in O ( n ) time. In particular, for line(ii), one can enumerate all O ( n ) pairs ( k, h ≺ k ) in O ( n ) time total, and, for each,compute I ≺ k and I \ I ≺ k in O (1) time. Each base case can be recognized andhandled (by a cost-0 leaf) in O (1) time, giving total time O ( n ) . This provesThm. 1 when all probabilities in β are distinct; Sec. 3.1 finishes the proof. Here we show that, without loss of generality, in looking for an optimal searchtree, one can assume that the key probabilities (the β i ’s) are all distinct. Givenany instance I = ( K , Q , C , α, β ) , construct instance I (cid:48) = ( K , Q , C , α, β (cid:48) ) , where β (cid:48) j = β j + jε and ε is a positive infinitesimal (or ε can be understood as asufficiently small positive rational). To compute (and compare) costs of treeswith respect to I (cid:48) , maintain the infinitesimal part of each value separately andextend linear arithmetic component-wise in the natural way:1. Compute z × ( x + x ε ) as ( zx ) + ( zx ) ε , where z, x , x are any rationals,2. compute ( x + εx ) + ( y + εy ) as ( x + x ) + ( y + y ) ε ,3. and say x + εx < y + εy iff x < y , or x = y ∧ x < y . Lemma 7. In the instance I (cid:48) , all key probabilities β (cid:48) i are distinct. If a tree T isoptimal w.r.t. I (cid:48) , then it is also optimal with respect to I . Proof. Let A be a tree that is optimal w.r.t. I (cid:48) . Let B be any other tree, andlet the costs of A and B under I (cid:48) be, respectively, a + a ε and b + b ε . Thentheir respective costs under I are a and b . Since A has minimum cost under I (cid:48) , a + a ε ≤ b + b ε . That is, either a < b , or a = b (and a ≤ b ) . Hence a ≤ b : that is, A costs no more than B w.r.t. I . Hence A is optimal w.r.t. I . (cid:3) Doing arithmetic this way increases running time by a constant factor. Thiscompletes the proof of Thm. 1. The reduction can also be used to avoid thesignificant effort that Anderson et al. [1] devote to non-distinct key probabilities.For computing optimal binary split trees for unrestricted queries, the fastestknown time is O ( n ) , due to [6]. But [6] also gives an O ( n ) -time algorithmfor the case of distinct key probabilities. With the above reduction, the latteralgorithm gives O ( n ) time for the general case, proving Thm. 3. Fix any instance I = ( K , Q , C , α, β ) . If C is { = } then the optimal tree can befound in O ( n log n ) time, so assume otherwise. In particular, < and/or ≤ are in C . Assume that < is in C (the other case is symmetric).The entropy H I = − (cid:80) i β i log β i − (cid:80) i α i log α i is a lower bound on opt ( I ) .For the case K = Q and C = { < } , Yeung’s O ( n ) -time algorithm [16] constructsa that uses only < -comparisons whose cost is at most H I + 2 − β − β n .We reduce the general case to that one, adding roughly one extra comparison.Construct I (cid:48) = ( K (cid:48) = K , Q (cid:48) = K , C (cid:48) = { < } , α (cid:48) , β (cid:48) ) where each α (cid:48) i = 0 andeach β (cid:48) i = β i + α i (except β (cid:48) = α + β + α ). Use Yeung’s algorithm [16] toconstruct tree T (cid:48) for I (cid:48) . Tree T (cid:48) uses only the < operator, so any query v ∈ Q that reaches a leaf (cid:104) K i (cid:105) in T (cid:48) must satisfy K i ≤ v < K i +1 (or v < K if i = 1 ).To distinguish K i = v from K i < v < K i +1 , we need only add one additionalcomparison at each leaf (except, if i = 1 , we need two). By Yeung’s guarantee, T (cid:48) costs at most H I (cid:48) + 2 − β (cid:48) − β (cid:48) n . The modifications can be done so as toincrease the cost by at most α + α , so the final tree costs at most H I (cid:48) + 3 .By standard properties of entropy, H I (cid:48) ≤ H I ≤ opt ( I ) , proving Thm. 2. A generalized binary split tree ( gbst ) is a rooted binary tree where each node N has an equality key e N and a split key s N . A search for query v ∈ Q startsat the root r . If v = e r , the search halts. Otherwise, the search recurses on theleft subtree (if v < s r ) or the right subtree (if v ≥ s r ). The cost of the treeis the expected number of nodes (including, by convention, leaves) visited for arandom query v . Fig. 6 shows two gbst s for a single instance. For an algorithm that works with linear (or O (1) -degree polynomial) functions of β . If it is possible to distinguish v = K i from K i < v < K i +1 , then C must have atleast one operator other than < , so we can add either (cid:104) v = K i (cid:105) or (cid:104) v ≤ K i (cid:105) . (a) A220A120 D122 B010 C05 D010 E010 P220N120 Q120N010 P010 Q010 R010 T220S120 U120S010 T010 U010 V010 X220W120 Y120W010 X010 Y010 Z010A320 V320 B420 (b) A220A120 B420 B010 C05 D010 E010 P220N120 Q120N010 P010 Q010 R010 T220S120 U120S010 T010 U010 V010 X220W120 Y120W010 X010 Y010 Z010A320 V320 D122 Fig. 6.