[PDF] Optimal Search Trees with 2-Way Comparisons

Abstract

In 1971, Knuth gave an O( n 2 ) -time algorithm for the classic problem of finding an optimal binary search tree. Knuth's algorithm works only for search trees based on 3-way comparisons, while most modern computers support only 2-way comparisons (e.g., <,≤,=,≥ , and > ). Until this paper, the problem of finding an optimal search tree using 2-way comparisons remained open -- poly-time algorithms were known only for restricted variants. We solve the general case, giving (i) an O( n 4 ) -time algorithm and (ii) an O(nlogn) -time additive-3 approximation algorithm. Also, for finding optimal binary split trees, we (iii) obtain a linear speedup and (iv) prove some previous work incorrect.

Full PDF

OOptimal search trees with 2-way comparisons (cid:63)

Marek Chrobak (cid:63)(cid:63) , Mordecai Golin (cid:63) (cid:63) (cid:63) , J. Ian Munro † , and Neal E. Young University of California — Riverside, Riverside, California, USA Hong Kong University of Science and Technology, Hong Kong, China University of Waterloo, Waterloo, Canada

Abstract.

In 1971, Knuth gave an O ( n ) -time algorithm for the clas-sic problem of ﬁnding an optimal binary search tree. Knuth’s algorithmworks only for search trees based on 3-way comparisons, but most moderncomputers support only 2-way comparisons ( < , ≤ , = , ≥ , and > ). Un-til this paper, the problem of ﬁnding an optimal search tree using 2-waycomparisons remained open — poly-time algorithms were known only forrestricted variants. We solve the general case, giving (i) an O ( n ) -time al-gorithm and (ii) an O ( n log n ) -time additive-3 approximation algorithm.For ﬁnding optimal binary split trees , we (iii) obtain a linear speedupand (iv) prove some previous work incorrect. In 1971, Knuth [10] gave an O ( n ) -time dynamic-programming algorithm for aclassic problem: given a set K of keys and a probability distribution on queries,ﬁnd an optimal binary-search tree T . As shown in Fig. 1, a search in such a treefor a given value v compares v to the root key, then (i) recurses left if v is smaller,(ii) stops if v equals the key, or (iii) recurses right if v is larger, halting at a leaf.The comparisons made in the search must suﬃce to determine the relation of v to all keys in K . (Hence, T must have |K| + 1 leaves.) T is optimal if it hasminimum cost , deﬁned as the expected number of comparisons assuming thequery v is chosen randomly from the speciﬁed probability distribution.Knuth assumed three-way comparisons at each node. With the rise of higher-level programming languages, most computers began supporting only two-waycomparisons ( <, ≤ , = , ≥ , > ). In the 2nd edition of Volume 3 of The Art of Com-puter Programming [11, §6.2.2 ex. 33], Knuth commented . . . machines that cannot make three-way comparisons at once. . . will have tomake two comparisons. . . it may well be best to have a binary tree whose inter-nal nodes specify either an equality test or a less-than test but not both. But Knuth gave no algorithm to ﬁnd a tree built from two-way comparisons (a , as in Fig. 2(a)), and, prior to the current paper, poly-time algorithms (cid:63)

This is the full version of an extended abstract that appeared in ISAAC [2]. (cid:63)(cid:63)

Research funded by NSF grants CCF-1217314 and CCF-1536026. (cid:63) (cid:63) (cid:63)

Research funded by HKUST/RGC grant FSGRF14EG28. † Research funded by NSERC and the Canada Research Chairs Programme. a r X i v : . [ c s . D S ] O c t v ? O v ? Wv ? H < > v=O =< > v=H = H v=W = W

Fig. 1.

A binary search tree T using 3-way comparisons, for K = { H, O, W } . v = H? v < O? v < W? yes ynony v=H v < H? y v

Two s for K = { H, O, W } ; tree (b) only handles successful queries. were known only for restricted variants. Most notably, in 2002 Anderson et al. [1]gave an O ( n ) -time algorithm for the successful-queries variant of , inwhich each query v must be a key in K , so only |K| leaves are needed (Fig. 2(b)).The standard problem allows arbitrary queries, so |K| + 1 leaves are needed(Fig. 2(a)). For the standard problem, no polynomial-time algorithm was pre-viously known. We give one for a more general problem that we call : Theorem 1. has an O ( n ) -time algorithm. We specify an instance I of as a tuple I = ( K = { K , . . . , K n } , Q , C , α, β ) .The set C of allowed comparison operators can be any subset of { <, ≤ , = , ≥ , > } .The set Q speciﬁes the queries. A solution is an optimal T among thoseusing operators in C and handling all queries in Q . This deﬁnition generalizesboth standard (let Q contain each key and a value between each pairof keys), and the successful-queries variant (take Q = K and α ≡ ). It furtherallows any query set Q between these two extremes, even allowing K (cid:54)⊆ Q . Asusual, β i is the probability that v equals K i ; α i is the probability that v fallsbetween keys K i and K i +1 (except α = Pr[ v < K ] and α n = Pr[ v > K n ] ). To prove Thm. 1, we prove Spuler’s 1994 “maximum-likelihood” conjecture: inany optimal tree, each equality comparison is to a key in K of maximumlikelihood, given the comparisons so far [14, §6.4 Conj. 1]. As Spuler observed,the conjecture implies an O ( n ) -time algorithm; we reduce this to O ( n ) using As deﬁned here, a T must determine the relation of the query v to everykey in K . More generally, one could specify any partition P of Q , and only require T to determine, if at all possible using keys in K , which set S ∈ P contains v . Forexample, if P = {K , Q \ K} , then T would only need to determine whether v ∈ K .We note without proof that Thm. 1 extends to this more general formulation. standard techniques and a new perturbation argument. Anderson et al. provedthe conjecture for their special case [1, Cor. 3]. We were unable to extend theirproof directly; our proof uses a diﬀerent local-exchange argument.We also give a fast additive-3 approximation algorithm: Theorem 2.

Given any instance I = ( K , Q , C , α, β ) of , one can computea tree of cost at most the optimum plus 3, in O ( n log n ) time. Comparable results were known for the successful-queries variant ( Q = K ) [16,1].We approximately reduce the general case to that case. Binary split trees “split” each 3-way comparison in Knuth’s 3-way-comparisonmodel into two 2-way comparisons within the same node: an equality compar-ison (which, by deﬁnition, must be to the maximum-likelihood key) and a “ < ”comparison (to any key) [13,3,8,12,6]. The fastest algorithms to ﬁnd an optimalbinary split tree take O ( n ) -time: from 1984 for the successful-queries-only vari-ant ( Q = K ) [8]; from 1986 for the standard problem ( Q contains queries in allpossible relations to the keys in K ) [6]. We obtain a linear speedup: Theorem 3.

Given any instance I = ( K = { K , . . . , K n } , α, β ) of the standardbinary-split-tree problem, an optimal tree can be computed in O ( n ) time. The proof uses our new perturbation argument (Sec. 3.1) to reduce to the casewhen all β i ’s are distinct, then applies a known algorithm [6]. The perturbationargument can also be used to simplify Anderson et al.’s algorithm [1]. Generalized binary split trees ( gbst s) are binary split trees without the maximum-likelihood constraint. Huang and Wong [9] (1984) observe that relaxing this con-straint allows cheaper trees — the maximum-likelihood conjecture fails here —and propose an algorithm to ﬁnd optimal gbst s. We prove it incorrect! Theorem 4.

Lemma 4 of [9] is incorrect: there exists an instance — a querydistribution β — for which it does not hold, and on which their algorithm fails. This ﬂaw also invalidates two algorithms, proposed in Spuler’s thesis [15], thatare based on Huang and Wong’s algorithm. We know of no poly-time algorithmto ﬁnd optimal gbst s. Of course, optimal s are at least as good. without equality tests.

Finding an optimal alphabetical encoding hasseveral poly-time algorithms: by Gilbert and Moore — O ( n ) time, 1959 [5];by Hu and Tucker — O ( n log n ) time, 1971 [7]; and by Garsia and Wachs— O ( n log n ) time but simpler, 1979 [4]. The problem is equivalent to ﬁnd-ing an optimal 3-way-comparison search tree when the probability of queryingany key is zero ( β ≡ ) [11, §6.2.2]. It is also equivalent to ﬁnding an opti-mal in the successful-queries variant with only “ < ” comparisons allowed( C = { < } , Q = K ) [1, §5.2]. We generalize this observation to prove Thm. 5: Theorem 5.

Any instance I = ( K = { K , . . . , K n } , Q , C , α, β ) where = is not in C (equality tests are not allowed), can be solved in O ( n log n ) time. Deﬁnitions 1

Fix an arbitrary instance I = ( K , Q , C , α, β ) .For any node N in any T for I , N ’s query subset , Q N , containsqueries v ∈ Q such that the search for v reaches N . The weight ω ( N ) of N isthe probability that a random query v (from distribution ( α, β ) ) is in Q N . Theweight ω ( T (cid:48) ) of any subtree T (cid:48) of T is ω ( N ) where N is the root of T (cid:48) .Let (cid:104) v < K i (cid:105) denote an internal node having key K i and comparison operator < (deﬁne (cid:104) v ≤ K i (cid:105) and (cid:104) v = K i (cid:105) similarly). Let (cid:104) K i (cid:105) denote the leaf N such that Q N = { K i } . Abusing notation, ω ( K i ) is a synonym for ω ( (cid:104) K i (cid:105) ) , that is, β i .Say T is irreducible if, for every node N with parent N (cid:48) , Q N (cid:54) = Q N (cid:48) . In the remainder of the paper, we assume that only comparisons in { <, ≤ , = } are allowed (i.e., C ⊆ { <, ≤ , = } ). This is without loss of generality, as “ v > K i ”and “ v ≥ K i ” can be replaced, respectively, by “ v ≤ K i ” and “ v < K i .” Fix any irreducible, optimal T for any instance I = ( K , Q , C , α, β ) . Theorem 6 (Spuler’s conjecture).

The key K a in any equality-comparisonnode N = (cid:104) v = K a (cid:105) is a maximum-likelihood key: β a = max i { β i : K i ∈ Q N } . The theorem will follow easily from Lemma 1:

Lemma 1.

Let internal node (cid:104) v = K a (cid:105) be the ancestor of internal node (cid:104) v = K z (cid:105) .Then ω ( K a ) ≥ ω ( K z ) . That is, β a ≥ β z .Proof. (Lemma 1) Throughout, “ (cid:104) v ≺ K i (cid:105) ” denotes a node in T that does aninequality comparison ( ≤ or < , not = ) to key K i . Abusing notation, in thatcontext, “ x ≺ K i ” (or “ x (cid:54)≺ K i ”) denotes that x passes (or fails) that comparison. Assumption 1 (i)

All nodes on the path from (cid:104) v = K a (cid:105) to (cid:104) v = K z (cid:105) do in-equality comparisons. (ii) Along the path, some other node (cid:104) v ≺ K s (cid:105) separates key K a from K z : either K a ≺ K s but K z (cid:54)≺ K s , or K z ≺ K s but K a (cid:54)≺ K s . It suﬃces to prove the lemma assuming (i) and (ii) above. (Indeed, if the lemmaholds given (i), then, by transitivity, the lemma holds in general. Given (i), if(ii) doesn’t hold, then exchanging the two nodes preserves correctness, changingthe cost by ( ω ( K a ) − ω ( K z )) × d for d ≥ , so ω ( K a ) ≥ ω ( K z ) and we are done.)By Assumption 1, the subtree rooted at (cid:104) v = K a (cid:105) , call it T (cid:48) , is as in Fig. 3(a):Let child (cid:104) v ≺ K b (cid:105) , with subtrees T and T , be as in Fig. 3. Lemma 2. If K a ≺ K b , then ω ( K a ) ≥ ω ( T ) , else ω ( K a ) ≥ ω ( T ) . (This and subsequent lemmas in this section are proved in Appendix 7.2. Theidea behind this one is that correctness is preserved by replacing T (cid:48) by subtree(b) if K a ≺ K b or (c) otherwise, implying the lemma by the optimality of T .)Case 1: Child (cid:104) v ≺ K b (cid:105) separates K a from K z . If K a ≺ K b , then K z (cid:54)≺ K b , so de-scendant (cid:104) v = K z (cid:105) is in T , and, by this and Lemma 2, ω ( K a ) ≥ ω ( T ) ≥ ω ( K z ) , v = K a v K b T K a (a) T v = K a v K b T K a T (b) y ny y ny nn v = K a v K b T K a T (c) ny y n Fig. 3. (a) The subtree T (cid:48) rooted at (cid:104) v = K a (cid:105) and possible replacements (b), (c). and we’re done. Otherwise K a (cid:54)≺ K b , so K z ≺ K b , so descendant (cid:104) v = K z (cid:105) is in T , and, by this and Lemma 2, ω ( K a ) ≥ ω ( T ) ≥ ω ( K z ) , and we’re done.Case 2: Child (cid:104) v ≺ K b (cid:105) does not separate K a from K z . Assume also that descen-dant (cid:104) v = K z (cid:105) is in T . (If descendant (cid:104) v = K z (cid:105) is in T , the proof is symmetric,exchanging the roles of T and T .) Since descendant (cid:104) v = K z (cid:105) is in T , and child (cid:104) v ≺ K b (cid:105) does not separate K a from K z , we have K a (cid:54)≺ K b and two facts: Fact A: ω ( K a ) ≥ ω ( T ) (by Lemma 2), and Fact B: the root of T does an inequality comparison (by Assumption 1).By Fact B, subtree T (cid:48) rooted at (cid:104) v = K a (cid:105) is as in Fig. 4(a): nn (a) (b) T v K b T v K c T T v = K a K a yy y nn v K b T v K c T v = K a T K a y y y nn n yyyyyyyy yyyyyyyy nnnnnnnn nnnnnnnn (c) v K b T v K c T T v = K a K a y Fig. 4. (a) The subtree T (cid:48) in Case 2, two possible replacements (b), (c). As in Fig. 4(a), let the root of T be (cid:104) v ≺ K c (cid:105) , with subtrees T and T . Lemma 3. (i) ω ( T ) ≥ ω ( T ) . (ii) If K a (cid:54)≺ K c , then ω ( K a ) ≥ ω ( T ) . (As replacing T (cid:48) by (b) or (c) preserves correctness; proof in Appendix 7.2.)Case 2.1: K a (cid:54)≺ K c . By Lemma 3(ii), ω ( K a ) ≥ ω ( T ) . Descendant (cid:104) v = K z (cid:105) isin T , so ω ( T ) ≥ ω ( K z ) . Transitively, ω ( K a ) ≥ ω ( K z ) , and we are done.Case 2.2: K a ≺ K c . By Lemma 3(i), ω ( T ) ≥ ω ( T ) . By Fact A, ω ( K a ) ≥ ω ( T ) .If (cid:104) v = K z (cid:105) is in T , then ω ( T ) ≥ ω ( K z ) and transitively we are done.In the remaining case, (cid:104) v = K z (cid:105) is in T . T ’s irreducibility implies K z ≺ K c .Since K a ≺ K c also (Case 2.2), grandchild (cid:104) v ≺ K c (cid:105) does not separate K a from K z , and by Assumption 1 the root of subtree T does an inequality comparison.Hence, the subtree rooted at (cid:104) v ≺ K b (cid:105) is as in Fig. 5(a): y yy n nn yy yn n n (a) (b) v K b T v K c T T v K d T T v K b T v K c T v K d T T Fig. 5. (a) The subtree rooted at (cid:104) v ≺ K b (cid:105) in Case 2.2. (b) A possible replacement. Lemma 4. ω ( T ) ≥ ω ( T ) . (Because replacing (a) by (b) preserves correctness; proof in Appendix 7.2.)Since descendant (cid:104) v = K z (cid:105) is in T , Lemma 4 implies ω ( T ) ≥ ω ( T ) ≥ ω ( K z ) . This and Fact A imply ω ( K a ) ≥ ω ( K z ) . This proves Lemma 1. (cid:3) Proposition 1.

If any leaf node (cid:104) K (cid:96) (cid:105) ’s parent P does not do an equality com-parison against key K (cid:96) , then changing P so that it does so gives an irreducible T (cid:48) of the same cost.Proof. Since Q (cid:104) K (cid:96) (cid:105) = { K (cid:96) } and P ’s comparison operator is in C ⊆ { <, ≤ , = } , itmust be that K (cid:96) = max Q P or K (cid:96) = min Q P . So changing P to (cid:104) v = K (cid:96) (cid:105) (with (cid:104) K (cid:96) (cid:105) as the“yes” child and the other child the “no” child) maintains correctness,cost, and irreducibility. (cid:3) Proof. (Thm. 6)

Consider any equality-testing node N = (cid:104) v = K a (cid:105) and any key K z ∈ Q N . Since K z ∈ Q N , node N has descendant leaf (cid:104) K z (cid:105) . Without loss ofgenerality (by Proposition 1), leaf (cid:104) K z (cid:105) ’s parent is (cid:104) v = K z (cid:105) . That parent is adescendant of (cid:104) v = K a (cid:105) , so ω ( K a ) ≥ ω ( K z ) by Lemma 1. (cid:3) ) and Thm. 3 First we prove Thm. 1. Fix an instance I = ( K , Q , C , α, β ) . Assume for now thatall probabilities in β are distinct. For any query subset S ⊆ Q , let opt ( S ) denotethe minimum cost of any that correctly determines all queries in subset S (using keys in K , comparisons in C , and weights from the appropriate restrictionof α and β to S ). Let ω ( S ) be the probability that a random query v is in S .The cost of any tree for S is the weight of the root ( = ω ( S ) ) plus the cost of itstwo subtrees, yielding the following dynamic-programming recurrence: Lemma 5.

For any query set

S ⊆ Q not handled by a single-node tree, opt ( S ) = ω ( S ) + min  min k opt ( S \ { k } ) (if “ = ” is in C , else ∞ ) ( i )min k, ≺ opt ( S ≺ k ) + opt ( S \ S ≺ k ) , ( ii ) where k ranges over K , and ≺ ranges over the allowed inequality operators (ifany), and S ≺ k = { v ∈ S : v ≺ k } . Using the recurrence naively to compute opt ( Q ) yields exponentially many querysubsets S , because of line (i). But, by Thm. 6, we can restrict k in line (i) to bethe maximum-likelihood key in S . With this restriction, the only subsets S thatarise are intervals within Q , minus some most-likely keys. Formally, for each of O ( n ) key pairs { k , k } ⊆ K ∪ {−∞ , ∞} with k < k , deﬁne four key intervals ( k , k ) = { v ∈ Q : k < v < k } , [ k , k ] = { v ∈ Q : k ≤ v ≤ k } , ( k , k ] = { v ∈ Q : k < v ≤ k } , [ k , k ) = { v ∈ Q : k ≤ v < k } . For each of these O ( n ) key intervals I , and each integer h ≤ n , deﬁne top ( I, h ) to contain the h keys in I with the h largest β i ’s. Deﬁne S ( I, h ) = I \ top ( I, h ) .Applying the restricted recurrence to S ( I, h ) gives a simpler recurrence: Lemma 6. If S ( I, h ) is not handled by a one-node tree, then opt ( S ( I, h )) equals ω ( S ( I, h )) + min  opt ( S ( I, h + 1)) (if equality is in C , else ∞ ) ( i )min k, ≺ opt ( S ( I ≺ k , h ≺ k )) + opt ( S ( I \ I ≺ k , h − h ≺ k )) , ( ii ) where key interval I ≺ k = { v ∈ I : v ≺ k } , and h ≺ k = | top ( I, h ) ∩ I ≺ k | . Now, to compute opt ( Q ) , each query subset that arises is of the form S ( I, h ) where I is a key interval and ≤ h ≤ n . With care, each of these O ( n ) subproblems can be solved in O ( n ) time, giving an O ( n ) -time algorithm. Inparticular, represent each key-interval I by its two endpoints. For each key-interval I and integer h ≤ n , precompute ω ( S ( I, h )) , and top ( I, h ) , and the h ’thlargest key in I . Given these O ( n ) values (computed in O ( n log n ) time), therecurrence for opt ( S ( I, h )) can be evaluated in O ( n ) time. In particular, for line(ii), one can enumerate all O ( n ) pairs ( k, h ≺ k ) in O ( n ) time total, and, for each,compute I ≺ k and I \ I ≺ k in O (1) time. Each base case can be recognized andhandled (by a cost-0 leaf) in O (1) time, giving total time O ( n ) . This provesThm. 1 when all probabilities in β are distinct; Sec. 3.1 ﬁnishes the proof. Here we show that, without loss of generality, in looking for an optimal searchtree, one can assume that the key probabilities (the β i ’s) are all distinct. Givenany instance I = ( K , Q , C , α, β ) , construct instance I (cid:48) = ( K , Q , C , α, β (cid:48) ) , where β (cid:48) j = β j + jε and ε is a positive inﬁnitesimal (or ε can be understood as asuﬃciently small positive rational). To compute (and compare) costs of treeswith respect to I (cid:48) , maintain the inﬁnitesimal part of each value separately andextend linear arithmetic component-wise in the natural way:1. Compute z × ( x + x ε ) as ( zx ) + ( zx ) ε , where z, x , x are any rationals,2. compute ( x + εx ) + ( y + εy ) as ( x + x ) + ( y + y ) ε ,3. and say x + εx < y + εy iﬀ x < y , or x = y ∧ x < y . Lemma 7.

In the instance I (cid:48) , all key probabilities β (cid:48) i are distinct. If a tree T isoptimal w.r.t. I (cid:48) , then it is also optimal with respect to I . Proof.

Let A be a tree that is optimal w.r.t. I (cid:48) . Let B be any other tree, andlet the costs of A and B under I (cid:48) be, respectively, a + a ε and b + b ε . Thentheir respective costs under I are a and b . Since A has minimum cost under I (cid:48) , a + a ε ≤ b + b ε . That is, either a < b , or a = b (and a ≤ b ) . Hence a ≤ b : that is, A costs no more than B w.r.t. I . Hence A is optimal w.r.t. I . (cid:3) Doing arithmetic this way increases running time by a constant factor. Thiscompletes the proof of Thm. 1. The reduction can also be used to avoid thesigniﬁcant eﬀort that Anderson et al. [1] devote to non-distinct key probabilities.For computing optimal binary split trees for unrestricted queries, the fastestknown time is O ( n ) , due to [6]. But [6] also gives an O ( n ) -time algorithmfor the case of distinct key probabilities. With the above reduction, the latteralgorithm gives O ( n ) time for the general case, proving Thm. 3. Fix any instance I = ( K , Q , C , α, β ) . If C is { = } then the optimal tree can befound in O ( n log n ) time, so assume otherwise. In particular, < and/or ≤ are in C . Assume that < is in C (the other case is symmetric).The entropy H I = − (cid:80) i β i log β i − (cid:80) i α i log α i is a lower bound on opt ( I ) .For the case K = Q and C = { < } , Yeung’s O ( n ) -time algorithm [16] constructsa that uses only < -comparisons whose cost is at most H I + 2 − β − β n .We reduce the general case to that one, adding roughly one extra comparison.Construct I (cid:48) = ( K (cid:48) = K , Q (cid:48) = K , C (cid:48) = { < } , α (cid:48) , β (cid:48) ) where each α (cid:48) i = 0 andeach β (cid:48) i = β i + α i (except β (cid:48) = α + β + α ). Use Yeung’s algorithm [16] toconstruct tree T (cid:48) for I (cid:48) . Tree T (cid:48) uses only the < operator, so any query v ∈ Q that reaches a leaf (cid:104) K i (cid:105) in T (cid:48) must satisfy K i ≤ v < K i +1 (or v < K if i = 1 ).To distinguish K i = v from K i < v < K i +1 , we need only add one additionalcomparison at each leaf (except, if i = 1 , we need two). By Yeung’s guarantee, T (cid:48) costs at most H I (cid:48) + 2 − β (cid:48) − β (cid:48) n . The modiﬁcations can be done so as toincrease the cost by at most α + α , so the ﬁnal tree costs at most H I (cid:48) + 3 .By standard properties of entropy, H I (cid:48) ≤ H I ≤ opt ( I ) , proving Thm. 2. A generalized binary split tree ( gbst ) is a rooted binary tree where each node N has an equality key e N and a split key s N . A search for query v ∈ Q startsat the root r . If v = e r , the search halts. Otherwise, the search recurses on theleft subtree (if v < s r ) or the right subtree (if v ≥ s r ). The cost of the treeis the expected number of nodes (including, by convention, leaves) visited for arandom query v . Fig. 6 shows two gbst s for a single instance. For an algorithm that works with linear (or O (1) -degree polynomial) functions of β . If it is possible to distinguish v = K i from K i < v < K i +1 , then C must have atleast one operator other than < , so we can add either (cid:104) v = K i (cid:105) or (cid:104) v ≤ K i (cid:105) . (a) A220A120

D122

B010

C05

D010 E010 P220N120 Q120N010 P010 Q010 R010 T220S120 U120S010 T010 U010 V010 X220W120 Y120W010 X010 Y010 Z010A320 V320

B420 (b)

A220A120

B420

B010

C05

D010 E010 P220N120 Q120N010 P010 Q010 R010 T220S120 U120S010 T010 U010 V010 X220W120 Y120W010 X010 Y010 Z010A320 V320

D122

Fig. 6.

Two gbst s for an instance. Keys are ordered alphabetically ( A < A < A

1. R. Anderson, S. Kannan, H. Karloﬀ, and R. E. Ladner. Thresholds and optimalbinary comparison search trees.

Journal of Algorithms , 44:338–358, 2002.2. M. Chrobak, M. Golin, J. I. Munro, and N. E. Young. Optimal search trees with2-way comparisons. In

Proceedings of International Symposium on Algorithms andComputation , 2015.3. D. Comer. A note on median split trees.

ACM Transactions on ProgrammingLanguages and Systems , 2(1):129–133, 1980.4. A. M. Garsia and M. L. Wachs. A new algorithm for minimum cost binary trees.

SIAM Journal on Computing , 6(4):622–642, 1977.5. E. Gilbert and E. Moore. Variable-length binary encodings.

Bell System TechnicalJournal , 38(4):933–967, 1959.6. J. H. Hester, D. S. Hirschberg, S. H. Huang, and C. K. Wong. Faster constructionof optimal binary split trees.

Journal of Algorithms , 7(3):412–424, 1986.7. T. C. Hu and A. C. Tucker. Optimal computer search trees and variable-lengthalphabetical codes.

SIAM Journal on Applied Mathematics , 21(4):514–532, 1971.8. S. Huang and C. Wong. Optimal binary split trees.

Journal of Algorithms , 5(1):69–79, 1984.9. S.-H. Huang and C. K. Wong. Generalized binary split trees.

Acta Informatica ,21(1):113–123, 1984.10. D. E. Knuth. Optimum binary search trees.

Acta Informatica , 1(1):14–25, 1971.11. D. E. Knuth.

The Art of Computer Programming, Volume 3: Sorting and Searching .Addison-Wesley Publishing Company, 2nd edition, 1998.12. Y. Perl. Optimum split trees.

Journal of Algorithms , 5:367–374, 1984.13. B. A. Sheil. Median split trees: a fast lookup technique for frequently occuringkeys.

Communications of the ACM , 21(11):947–958, 1978.14. D. Spuler. Optimal search trees using two-way key comparisons.

Acta Informatica ,740:729–740, 1994.115. D. A. Spuler.

Optimal Search Trees Using Two-Way Key Comparisons . PhD thesis,James Cook University, 1994.16. R. Yeung. Alphabetic codes revisited.

IEEE Transactions on Information Theory ,37(3):564–572, 1991. gbst algorithm of [9]) import functools memoize = functools.lru_cache(maxsize=None) def huang1984(weights): "Returns cost as computed by Huang and Wong's GBST algorithm (1984)." n = len(weights) beta = {i+1 : weights[key] for i, key in enumerate(sorted(weights.keys()))} def is_legal(i, j, d): return 0 <= i <= j <= n and 0 <= d <= j -i @memoize def p_w_t(i, j, d): "Returns triple: (cost p[i,j,d], weight w[i,j,d], deleted keys for t[i,j,d])." interval = set(range(i+1, j+1)) if d == j-i: return (0, 0, interval) def candidates(): for k in interval: for m in range(d+2): if is_legal(i, k-1, m) and is_legal(k-1, j, d-m+1): cost_l, weight_l, deleted_l = p_w_t(i, k-1, m) cost_r, weight_r, deleted_r = p_w_t(k-1, j, d-m+1) deleted = deleted_l .union( deleted_r ) x = min(deleted, key = lambda h : beta[h]) weight = beta[x] + weight_l + weight_r cost = weight + cost_l + cost_r yield cost, weight, deleted -set([x]) return min(candidates()) cost, weight, keys = p_w_t(0, n, 0) return cost weights = dict(b4=20, a3=20, v3=20, a2=20, p2=20, t2=20, x2=20, a1=20, d1=22, n1=20, q1=20, s1=20, u1=20, w1=20, y1=20, b0=10, c0= 5, d0=10, e0=10, n0=10, p0=10, q0=10, r0=10, s0=10, t0=10, u0=10, v0=10, w0=10, x0=10, y0=10, z0=10) assert huang1984(weights) == 1763 weights['d1'] += 0.99 assert huang1984(weights) < 1763 The extended abstract [2] is essentially the body of this paper minus the remain-der of this appendix. The remainder of this appendix contains all proofs omittedfrom the extended abstract. We prove some slightly stronger lemmas that imply Lemmas 2–4.Let T be any irreducible, optimal as in the proof of Lemma 1. Lemma 8 (implies Lemma 2).

Assume T has a subtree as in Fig. 3(a) withnodes (cid:104) v = K a (cid:105) and (cid:104) v ≺ K b (cid:105) . (i) Replacing that subtree the one in Fig. 3(b) (if K a ≺ K b ) or the one in Fig. 3(c) (if K a (cid:54)≺ K b ) preserves correctness. (ii) If K a ≺ K b , then ω ( K a ) ≥ ω ( T ) ; otherwise ω ( K a ) ≥ ω ( T ) .Proof. Assume that K a ≺ K b (the other case is symmetric). By inspection ofeach case ( Q = K a or Q (cid:54) = K a ), subtree (b) classiﬁes each query Q the sameway subtree (a) does, so the modiﬁed tree is correct. The modiﬁcation changesthe cost by ω ( K a ) − ω ( T ) , so (since T has minimum cost) ω ( K a ) ≥ ω ( T ) . (cid:3) y nn yy nnnn yyyy yyyyyyyy nnnn nnnn nnnnnnnn yyyyyyyy yyyyyyyy yyyyyyyy nnnnnnnn nnnnnnnn (a) (b) (c) T v K b T v K c T T v K b T v K c T v K b T T v K b T v K c T T Fig. 7.

Lemma 9 — “Rotating” subtree (a) yields (c); the subtrees are interchangeable.

Lemma 9 (implies Lemma 3(i)). (i) If T has either of the two subtrees inFig. 7(a) or (c), then exchanging one for the other preserves correctness. (ii) If T has the subtree in Fig. 7(a), then ω ( T ) ≥ ω ( T ) .Proof. Part (i). The transformation from (a) to (c) is a standard rotation oper-ation on binary search trees, but, since the comparison operators can be either < or ≤ in our context, we verify correctness carefully.By inspection, replacing subtree (a) by subtree (b) (in Fig. 7) gives a treethat classiﬁes all queries as T does, and so is correct.Next we observe that, in subtree (b), replacing the right subtree by just T (to obtain subtree (c)), maintains correctness. Indeed, since T is irreducible,replacing (in (a)) the subtree T by just T would give an incorrect tree.Equivalently, ∃ Q. Q (cid:54)≺ K b ∧ Q ≺ K c . Equivalently, the right-bounded inter-val { Q ∈ R : Q ≺ K c } overlaps the left-bounded interval { Q ∈ R : Q (cid:54)≺ K b } .Equivalently, the complements of these intervals, namely { Q ∈ R : Q (cid:54)≺ K c } and { Q ∈ R : Q ≺ K b } , are disjoint. Equivalently, ∀ Q. Q (cid:54)≺ K c → Q (cid:54)≺ K b . Hence,replacing the right subtree of (b) by T (yielding (c)) maintains correctness.In sum, replacing subtree (a) by subtree (c) maintains correctness. This showspart (i). This replacement changes the cost by ω ( T ) − ω ( T ) , so ω ( T ) ≥ ω ( T ) .This proves part (iii). The proof of (ii) is symmetric to the proof of (i). (cid:3) n nn (a) (c)(b) y T v K b T v K c T T v = K a K a yy y nn v K b T v K c T T v = K a K a y nny v K b T v K c T v = K a T K a y y y nn n Fig. 8.

Lemma 10 — Subtrees (a) and (c) are interchangeable if K a (cid:54)≺ K c . Lemma 10 (implies Lemma 3(ii)). If T has a subtree as in Fig. 8(a), and K a (cid:54)≺ K c , then (i) replacing the subtree by Fig. 8(c) preserves correctness, and (ii) ω ( K a ) ≥ ω ( T ) .Proof. (i) Assume T has the subtree in Fig. 7(a) (the other case is symmetric).By Lemma 9(i) (applied to the subtree of (a) with root (cid:104) v ≺ K b (cid:105) ), replacingsubtree (a) by subtree (b) gives a correct tree. Then, by Lemma 8(i) (applied tosubtree (b), but note that node (cid:104) v ≺ K c (cid:105) in (b) takes the role of node (cid:104) v ≺ K b (cid:105) in Fig. 3(a)!) replacing (b) by (c) gives a correct tree. This proves part (i). Part(ii) follows because replacing (a) by (c) changes the cost of T by ω ( K a ) − ω ( T ) ,and T has minimum cost, so ω ( K a ) ≥ ω ( T ) . (cid:3) y yy n nn y y yn n n yy yn n n (a) (c)(b) v K b T v K c T T v K d T T v K b T v K c T v K d T T v K b T v K c T v K d T T Fig. 9.

Lemma 11 — Subtrees (a) and (c) are interchangeable.

Lemma 11 (implies Lemma 4). If T has a subtree as in Fig. 9(a), then (i) replacing that subtree by the one in Fig. 9(c) preserves correctness, and (ii) ω ( T ) ≥ ω ( T ) .Proof. Applying Lemma 9(i) to the subtree of (a) with root (cid:104) v ≺ K c (cid:105) , replacingsubtree (a) by subtree (b) gives a correct tree. Then, applying Lemma 9(i) tothe subtree of (b) with root (cid:104) v ≺ K b (cid:105) , replacing subtree (b) by subtree (c) gives Technically, to apply Lemma 9(i), we need (b) to be correct and irreducible . Theoverall argument remains valid though as long as the tree from Fig. 9(a) is optimal.4 a correct tree. This shows part (i). Part (ii) follows, because replacing (a) by (c)changes the cost of T by ω ( T ) − ω ( T ) , so ω ( T ) ≥ ω ( T ) . (cid:3) A generalized binary split tree ( gbst ) is a rooted binary tree where each node N has an equality key e N and a split key s N . A search for query v ∈ Q startsat the root r . If v = e r , the search halts. Otherwise, the search recurses on theleft subtree (if v < s r ) or the right subtree (if v ≥ s r ). The cost of the treeis the expected number of nodes (including, by convention, leaves) visited for arandom query v . (a) (b) B22G

E20CA10 F20HC10E G10D5E20D

B22B

F20FA10 C10 D5 G10

Fig. 10.

In a gbst , each node has equality key, frequency, and (if internal) split key.

Huang and Wong demonstrate that equality keys in optimal gbst s do nothave the maximum-likelihood property [9]. Fig. 10 shows their counterexample:in the optimal gbst (a), the root equality key is E (frequency 20), not B (fre-quency 22). The cheapest tree with B at the root is (b), and is more expensive.Having B at the root increases the cost because then the other two high-frequencykeys E and F have to be the children, so the split key of the root has to split E and F , and low-frequency keys A , C , and D all must be in the left subtree.Following [9], restrict to successful queries ( K = Q ). Fix any instance I =( K , β ) . For any query interval I = { K i , K i +1 , . . . , K j } and any subset D ⊆ I of“deleted” keys, let opt ( I, D ) denote the minimum cost of any gbst that handlesthe keys in I \ D . This recurrence follows directly from the deﬁnition of gbst s: Lemma 12.

For any query set I \ D not handled by a single-node tree, opt ( I, D ) = ω ( I \ D ) + min e,s ∈K opt ( I < s , D e ∩ I < s ) + opt ( I ≥ s , D e ∩ I ≥ s ) where D e = D \ { e } , and I < s = { v ∈ I : v < s } and I ≥ s = { v ∈ I : v ≥ s } . The goal is to compute opt ( K , ∅ ) . Using the recurrence above, exponentiallymany subsets D arise. This motivates the following lemma. For any node N inan optimal gbst , deﬁne N ’s key interval , I N , and deleted-key set , D N , accordingto the recurrence in the natural way. Then the set Q N of queries reaching N is I N \ D N , and D N contains those keys in I N that are in equality tests above N ,and I N contains the key values that, if searched for in T with the equality testsremoved, would reach node N . Lemma 13 ([9, Lemma 2]).

For any node N in an optimal gbst , N ’s equalitykey is a least-frequent key among those in I N that aren’t equality keys in any of N ’s subtrees: if e N = K i , then β i = min { β j : K j ∈ D N } . The proof is the same exchange argument that shows our Assumption 1(ii).[9] claims (incorrectly) that, by Lemma 13, the desired value opt ( Q , ∅ ) canbe computed as follows. For any key interval I = { K i +1 , . . . , K j } and d ≤ n , let p [ i, j, d ] = min { opt ( I \ D ) : D ⊆ I, | D | = d } (1)be the minimum cost of any gbst for any query set I \ D consisting of I minusany d deleted keys. Let t [ i, j, d ] be a corresponding subtree of cost p [ i, j, d ] , andlet w [ i, j, d ] be the weight of the root of t [ i, j, d ] .Their algorithm uses the following (incorrect) recurrence (their Lemma 4): p [ i, j, d ] = min( w [ i, j, d ] + p [ i, k − , m ] + p [ k − , j, d − m − where the minimum is taken over all legal combinations of k ’s and m ’s [and] w [ i, j, d ] = w [ i, k − , m ] + w [ k − , j, d − m −

1] + β x where x is the index of the key of minimum frequency among those in range { K i +1 , . . . , K j } but outside t [ i, k − , m ] and t [ k − , j, d − m + 1] . . . ” Next we describe their error. Recall that p [ i, j, d ] chooses a subtree of mini-mum cost (among trees with any d keys deleted). But this choice might not leadto minimum overall cost! The reason is that the subtree’s cost does not suﬃceto determine the contribution to the overall cost: the weight of the subtree, andthe weights of the deleted keys and their eventual locations, also matter. (a) (b) A220A120B010

B420

D010

C05

A220A120

D122

B010

C05

D010 E010 E010

B420 D122

A320 A320

T T’

Fig. 11.

Trees T , T (cid:48) for 9-key interval I with d = 2 . (Split keys not shown.) For example, consider p [ i, j, d ] for the subproblem where d = 2 and interval I consists of the nine keys I = { A , A , A , B , B , C , D , D , E } , with thefollowing weights: key: A A A B B C D D E weight: 20 20 20 10 20 5 10 22 10 Fig. 11 shows two 7-node subtrees (circled and shaded), called T and T (cid:48) , in-volving these keys. These subtrees will be used in our counter-example, describedbelow. (The split key of each node is not shown in the diagram.) Partition the set of possible trees t [ i, j, d ] into two classes: (i) those thatcontain D and (ii) those that don’t (that is, D is a “deleted” key). By a carefulcase analysis, subtree T in Fig. 11(a) is a cheapest (although not unique) tree inclass (i), while the 7-node subtree T (cid:48) in Fig. 11(b) is a cheapest tree in class (ii).Further, the subtree T (cid:48) costs 1 more than the subtree T . Hence, the algorithmof [9] will choose T , not T (cid:48) , for this subproblem.However, this choice is incorrect. Consider not just the cost of tree, but alsothe eﬀects of the choice on the deleted keys’ costs. For deﬁniteness, suppose thetwo deleted nodes become, respectively, the parent and grandparent of the rootof the subtree, as in (a) and (b) of the ﬁgure. In (b), C is one level deeper thanit is in (a), which increases the cost by 5, but D is three levels higher , whichdecreases the overall cost by × , for a net decrease of 1 unit. Hence, using T instead of T (cid:48) ends up costing the overall solution 1 unit more.This observation is the basis of the complete counterexample shown in Fig. 6.The counterexample extends the smaller example above by appending two “neu-tral” subintervals, with 7 and 15 keys, respectively, each of which (without anydeletions) admits a self-contained balanced tree. Keys are ordered alphabetically.On this instance, the algorithm of [9] (and their Lemma 4) fail, as they choose T instead of T (cid:48) for the subproblem. Fig. 6(a) shows the tree computed by theirrecurrence, of cost 1763. (This can be veriﬁed by executing the Python code forthe algorithm in Appendix 7.1.) Fig. 6(b) shows a tree that costs 1 less. Thisproves Thm. 4. Spuler’s thesis.