Conjectures on Optimal Nested Generalized Group Testing Algorithm
aa r X i v : . [ s t a t . O T ] F e b Conjectures on Optimal Nested GeneralizedGroup Testing Algorithm
Yaakov MalinovskyDepartment of Mathematics and StatisticsUniversity of Maryland, Baltimore County, Baltimore, MD 21250, USAFebruary 28, 2020
Abstract
Consider a finite population of N items, where item i has a probability p i to bedefective. The goal is to identify all items by means of group testing. This is thegeneralized group testing problem (hereafter GGTP). In the case of p = · · · = p N = p Yao and Hwang (1990) proved that the pairwise testing algorithm is the optimalnested algorithm, with respect to the expected number of tests, for all N if and onlyif p ∈ [1 − / √ , (3 − √ /
2] (R-range hereafter) (an optimal at the boundary values).In this note, we present a result that helps to define the generalized pairwise testingalgorithm (hereafter GPTA) for the GGTP. We present two conjectures: (1) when all p i , i = 1 , . . . , N belong to the R-range, GPTA is the optimal procedure among nestedprocedures applied to p i of nondecreasing order; (2) if all p i , i = 1 , . . . , N belongto the R-range, GPTA the optimal nested procedure, i.e., minimises the expectedtotal number of tests with respect to all possible testing orders in the class of nestedprocedures. Although these conjectures are logically reasonable, we were only able toempirically verify the first one up to a particular level of N . We also provide a shortsurvey of GGTP. Keywords: Individual testing; pairwise testingAMS Subject Classification:
Introduction p case Robert Dorfman introduced the concept of group testing in 1943 as a need to admin-ister syphilis tests to millions of individuals drafted into the U.S. Army during WorldWar II. Interesting historical details related to the problem formulation can be found inDu and Hwang (1999). The nice description of the Dorfman (1943) procedure is given byFeller (1950): “A large number, N , of people are subject to a blood test. This can be admin-istered in two ways. (i) Each person is tested separately. In this case N tests are required.(ii) The blood samples of k people can be pooled and analyzed together. If the test is nega-tive, this one test suffices for the k people. If the test is positive, each of the k persons mustbe tested separately, and all k + 1 tests are required for the k people. Assume the probability p that the test is positive is the same for all and that people are stochastically independent.”Procedure ( ii ) is commonly referred to as the Dorfman group testing procedure.Since then, the group testing has widespread applications. Partial list included qualitycontrol in product testing (Sobel and Groll, 1959), communication networks (Wolf, 1985),American Red Cross screening of blood donations for HIV (Dood et al., 2002), identificationof rare alleles (Shental et al., 2010), among others. Consider a set S of N items, where each item has the probability p to be defective,and the probability q = 1 − p to be good independent from the other items. Follow-ing the accepted notation in the group testing literature, we call this set a binomial set(Sobel and Groll, 1959). A group test applied to the subset x is a binary test with twopossible outcomes, positive or negative. The outcome is negative if all x items are good,whereas the outcome is positive if at least one item among x items is defective. We callsuch a set defective or contaminated. The goal is complete identification of all N itemswith the minimum expected number of tests.Every reasonable group testing algorithm should satisfy the following properties (Sobel and Groll,1959; Ungar, 1960): (P1) items that are classified as positive or negative will never be testedagain, and (P2) the test is not performed if its outcome can be inferred from previous testresults. In addition, if a subset of good items I ′ is removed from the defective set I , then2he remaining items I − I ′ form a defective set, and it follows from (P2) that this defectiveset should not be tested as a whole group.A nested class of group testing algorithms was introduced by Sobel and Groll (1959)[see also Hwang (1976) and Yao and Hwang (1990)], and can be described as follows:(a) At each stage t ( t = 0 , , . . . , T ) of the execution of a nested algorithm, the set S ispartitioned into disjoint sets B t , C t , and D t , where set C t is a set of classified units,set B t is a binomial set, and set D t is a defective set. At the beginning of the processat stage (stage 0), B = S , and both C and D are empty. At the termination of theprocess (stage T ), C T = S , and both B T and D T are empty. If at any stage duringthe process | D t | = 1, then, according to (P1) above, this sole defective item shouldbe moved from set D t into set C t .(b) At each stage t of the algorithm execution, if D t − is not empty, then a proper subset D ′ t − of D t − is tested. If the outcome of tesing D ′ t − is positive, then C t = C t − , D t = D ′ t − and B t = N − C t − D t (follows from Result 1 below); if the outcome of testing D ′ t − is negative, then C t = C t − + D ′ t − , D t = D t − − D ′ t − , and B t = B t − . Otherwise,if D t − is empty and B t − is not empty, then a subset B ′ t − of B t − is tested. If theoutcome of testing B ′ t − is positive, then C t = C t − , D t = B ′ t − , and B t = B t − − B ′ t − ;if the outcome of testing B ′ t − is negative, then C t = C t − + B ′ t − , D t = D t − , and B t = B t − − B ′ t − .An optimal nested procedure in the form of a dynamic programming algorithm wasfound by Sobel and Groll (1959). Subsequently, Sobel (1960) and Hwang (1976) im-proved its computational efficiency. Recently, Zaman and Pippenger (2016) provided anasymptotic analysis of the optimal nested procedure. Finally, different aspects concern-ing the nested class of group testing procedures were summarized and investigated inMalinovsky and Albert (2019). For N = 2, the optimal algorithm coincides with Huff-man’s (Huffman, 1952) encoding algorithm (Sobel, 1967). However, the optimal nestedalgorithm is not optimal for N ≥ p < (3 − √ / N . For p ≥ (3 − √ / − q − q ≥ , q = 1 − p ) Ungar (1960) proved that the optimal group testing procedureis individual, one-by-one testing (at the boundary point it is an optimal).The pairwise nested algorithm belongs to the nested class and was defined by Yao and Hwang(1990). A verbatim definition of it is as follows: We define the pairwise testing algorithm by the following two rules:(i) If no contaminated set exists, then always test a pair from the binomial set unlessonly one item is left, in which case we test that item.(ii) If a contaminated pair is found, test one item of that pair. If that item is good, wededuce the other is defective. Thus we classify both items and only a binomial set re-mains to be classified. If the tested item is defective, then by a result of Sobel and Groll(1959), the other item together with the remaining binomial set forms a new binomialset. So, both cases reduce to a binomial set. It is easily verified that at all times theunclassified items belong to either a binomial set or, a contaminated pair. Thus thepairwise testing algorithm is well defined and is nested.
The following result offers a closed-form design for the optimal nested procedure, whichcan be resolved without computational effort, provided that all p i , i = 1 , . . . , N belong tothe R-range. Theorem 1. Yao and Hwang (1990)
The pairwise testing algorithm is the unique (up to the substitution of equivalent items)optimal nested algorithm for all N if and only if − / √ ≤ p ≤ (3 −√ / (at the boundaryvalues the pairwise testing algorithm is an optimal nested algorithm). The generalized group testing problem (GGTP): N stochastically independent units u , u , . . . , u N ,where unit u i has the probability p i (0 < p i <
1) to be defective and the probabil-ity q i = 1 − p i to be good. We assume that the probabilities p , p , . . . , p N are known4nd we can decide the order in which the units will be tested. All units have to beclassified as good or defective by group testing. The generalized group testing problemwas first introduced by Sobel (1960) on page 144. In this work, two (or more) differentkinds of units are presented and can be put into the same test group. In the case oftwo kinds of units with known probabilities q ≥ q , the individual testing is optimal if3 − q − q q >
2. This result follows the Huffman (1952) encoding algorithm construc-tion when N = 2 (Sobel, 1960). Since its introduction, GGTP has been investigated(Lee and Sobel (1972); Nebenzahl and Sobel (1973); Katona (1973); Nebenzahl (1975);Hwang (1976); Yao and Hwang (1988a,b); Kurtz and Sidi (1988); Yao and Hwang (1990);Kealy et al. (2014); Malinovsky (2019)). Even for a particular nested group testing algo-rithm the optimal regime (or, order in which groups/units will be tested ) is known onlyfor for the Dorfman procedure (Dorfman, 1943) because of Hwang (1976).For the GGTP, Hwang (1976) proved that under Dorfman’s procedure an optimal par-tition is an ordered partition (i.e., each pair of subsets has the property such that thenumbers in one subset are all greater or equal to every number in the other subset). ThenDorfman’s procedure is performed on each subset. It allowed Hwang to find the optimalsolution using a dynamic programming algorithm with the computational effort O ( N ).But, even using a slightly modified Dorfman procedure or Sterrett (1957) procedure, theordered partition is not optimal (Malinovsky, 2019). As the total number of possible parti-tions is the Bell number, it is impossible to use brutal search to obtain an optimal solution,which is unknown (Hwang, 1981; Malinovsky, 2019). Kurtz and Sidi (1988) provided a dy-namic programming (DP) algorithm having computational effort O ( N ) to find an optimalnested procedure for a given order of units u , . . . , u N (which order should be preservedat all stages of the testing process). In addition, Kurtz and Sidi (1988) used the Ungar(1960) method and extended Sobel (1960) result from N = 2 to general N . Namely, theyproved that if 3 − q − q q >
2, where q ≥ · · · ≥ q N , then individual testing is optimal.Closely related results were obtained by Yao and Hwang (1988b), and can be summarizedas follows: Theorem 2. Yao and Hwang (1988b)
Assume without loss of generality that < p ≤ p ≤ · · · ≤ p N < . Then, . If − q − q q i > , then there exists an optimal algorithm which tests u i individually.2. Denote by k = sup n sup { i = 1 , . . . , N : 3 − q − q q i > } , o , with sup { φ } = −∞ .Then there exists an optimal algorithm which tests u k +1 , . . . , u N individually.3. If there exists an optimal algorithm in which u i is tested individually, there exists anoptimal algorithm in which u j is tested individually for all j with p j > p i . It is important to note that in contrast to Ungar (1960), the results by Kurtz and Sidi(1988) and Yao and Hwang (1988b) provide a sufficient, but not necessary, condition.Yao and Hwang (1988b) constructed an example with N = 3 , p = 0 . , p = 0 . , p = 0 . − q − q q <
2, where the optimal algorithm tests u individually. In contrast,if p i < (3 − √ / i = 1 , . . . , N , then 3 − q j − q j q i < i, j = 1 , . . . , N andtherefore no item should be tested individually unless there are no items left to combine.In addition, it was shown in Yao and Hwang (1988a) that E ∗ ( p , . . . , p N ) is nondecreasingin each p i < N , where E ∗ ( p , . . . , p N ) denotes the expected number of tests foran optimal algorithm in GGTP. In combination with Ungar (1960), this result implies thatif p i > (3 − √ / i = 1 , . . . , N , then the optimal group testing procedure is toperform individual, one-by-one testing. We want to define the generalized pairwise testing algorithm (GPTA) for the GGTP.Two results below will help to proceed. The first result is a simple generalization ofSobel and Groll (1959) result for the common p case into GGTP (see also Kurtz and Sidi(1988)). Result 1 (Sobel and Groll (1959)) . In the GGTP, given a defective set I and given that aproper subset I , I ⊂ I contains at least one defective unit, then the posteriori distributionof the units in the subset I − I is the same as it was before any testing. The second result describes an optimal rule for nested testing in the case that at somestage, we have to test two particular units a and b .6 esult 2. Suppose that a nested procedure is applied. Also suppose that the n units thatremain to be tested, a, b, u , . . . , u n , all have unknown status, and the corresponding proba-bilities of those units being good are q a , q b , q , . . . , q n . We start by testing two units together,as a group, with the corresponding probabilities q a and q b , where q a ≥ q b . Then, under thissetting, when the first group test of units a and b is positive, we then have to test the unitfor which the corresponding probability of being good is largest, i.e. unit a (call it algo-rithm A). If the outcome of testing unit a is negative, then the second unit is positive bydeduction. Otherwise, if the outcome of testing unit a is positive, then by Result 1 theconditional distribution of the status of the second unit is a Bernoulli distribution withparameter p b = 1 − q b , and units b, u , . . . , u n remain to be tested.Proof. The proof is based on direct comparison of two possible algorithms, namely, algo-rithms A and B, where, in algorithm B, we first test unit b individually. Denote T as thetotal number of tests and denote E ( p i , . . . , p i k ) as the total expected number of tests ofunits i , . . . , i k with the corresponding probabilities p i , . . . , p i k under a nested procedure.The left branch of the tree below represents a negative test result, and the right branchrepresents a positive test result. test { a, b } T = 1 + E ( p , . . . , p N ) with prob. q a q b test { a } T = 2 + E ( p , . . . , p N ) with prob. q a (1-q b ) T = 2 + E ( p b , p , . . . , p N ) with prob. 1 − q a Figure 1: Algorithm ALet E A ( T ) and E B ( T ) be the expected total number of tests under algorithms A and7 est { a, b } T = 1 + E ( p , . . . , p N ) with prob. q a q b test { b } T = 2 + E ( p , . . . , p N ) with prob. q b (1-q a ) T = 2 + E ( p a , p , . . . , p N ) with prob. 1 − q b Figure 2: Algorithm B B correspondingly. We have, E A ( T ) = q a E ( p , . . . , p N ) + (1 − q a ) E ( p a , p , . . . , p N ) + 2 − q a q b .E B ( T ) = q b E ( p , . . . , p N ) + (1 − q b ) E ( p b , p , . . . , p N ) + 2 − q a q b . Since E ( p , p , . . . , p k ) is non-decreasing in each p i for 0 ≤ p i ≤ p a ≤ p b , we have E ( p a , p , . . . , p N ) ≤ E ( p b , p , . . . , p N ). There-fore, we obtain E A ( T ) − E B ( T ) ≤ ( q b − q a ) ( E ( p b , p , . . . , p N ) − E ( p , . . . , p N )) ≤ . The last inequality follows from the obvious fact that E ( p , . . . , p N ) ≤ E ( p b , p , . . . , p N ). Remark 1.
The intuition behind Result 2 is as follows: Suppose that the pair { a, b } test ispositive. Then, if a subsequent individual test of one unit from the set { a, b } is negative, wecan conclude by deduction (without actual testing) that the second unit is positive (possibility1). Alternately, if the subsequent individual test is positive, then the status of the remainingunit is unknown, and this unit will at some stage be tested in the group or individually(possibility 2). Since q a ≥ q b and we prefer possibility 1 over possibility 2, we should selectunit a to be tested first. Now we are ready to define GPTA for GGTP.8 efinition 1.
Let u , u . . . , u N be the fixed initial order of units to test, for which thecorresponding probabilities of being good are q , . . . , q N . We define the generalized pairwisetesting algorithm (GPTA) by the following rules:(a) Test the pair { u , u } . If the outcome is negative, then continue by testing the nextpair unless only one unit is left, in which case we test that unit.(b) If the outcome is positive, then test the unit with the greater probability of being good,i.e. unit u j where j = arg max( q , q ) . If unit u j is found to be good, then the otherunit u j , where j = arg min( q , q ) , is defective by deduction. Otherwise, if the testedunit u j is defective, then by Result 1 the conditional distribution of the status of u j isa Bernoulli distribution with parameter p j = 1 − q j , and units u j , u , . . . , u n remainto be tested. Continue with testing the next pair of units. Note that GPTA does not necessarily preserve the initial predetermined testing order;i.e. even if a defective unit u i is tested no later than unit u j , u i may remain in the testingprocess even after unit u j is identified. However, if the initial predetermined testing orderfollows a nondecreasing order of p i s ( p ≤ p ≤ · · · ≤ p N ), then GPTA preserves the initialtesting order.It is natural to expect that the result of Yao and Hwang (1990) will hold for GGTP inthe case of 1 − / √ ≤ p i ≤ (3 − √ / , i = 1 , . . . , N. The following example helps usto understand this situation. Here, we compare GGTP with an optimal nested procedurefrom Kurtz and Sidi (1988) for all possible testing orders. Their procedure requires theinitial testing order to be preserved throughout the testing process; otherwise its compu-tational complexity will be exponential as a function of N . Therefore, the procedure ofKurtz and Sidi (1988) does not necessarily satisfy the optimal rule obtained in Result 2 inthe case where testing two units, unless the p i s are arranged in nondecreasing order. Example 1.
Suppose { q , q , q , q } = { . , . , . , . } . E P ( T ) E Ne ( T )1 0.68 0.65 0.62 0.62 3.8576 3.85762 0.68 0.62 0.65 0.62 E P ( T ) for all possible initial testing ordersunder the GPTA and the expected total number of tests E Ne ( T ) under an optimal nestedordered procedure following the algorithm by Kurtz and Sidi (1988). Comment 1. (Example 1) The following observations were made:(a) For the initial ordered testing q ≥ q ≥ q ≥ q (permutation 1) both algorithms areidentical. This permutation is not optimal.(b) In all cases, instead of permutation 1, the procedure by Kurtz and Sidi (1988) for thegiven order differs from GPTA, but in all cases the testing group size under theirprocedure does not exceed 2 and can be 1. For example, under permutation 2 thisprocedure is presented below with the corresponding E Ne = 3 . . est (0.68, 0.62)test (0.65, 0.62) test (0.68)test (0.65, 0.62) test (0.62) + test (0.65, 0.62)(c) For the permutation 11 the GPTA is not optimal.(d) The expected length of an optimal prefix Huffman code, which serves as a theoreticaland generally non-attainable lower bound (Nebenzahl and Sobel, 1973), is 3.7977344.(e) Initial testing orders 2,5,7, and 9 are optimal with the corresponding expected numberof tests equals E P ( T ) = 3 . . GPTA corresponding to an initial testing order 2 ispresented below. test { . , . } test { . , . } Stop test { . } Stop test { . } test { . } test { . , . } Stop test { . } Stop test { . } test { . , . } test { . } test { . } test { . } test { . , . } Stop test { . } Stop test { . } We conjecture that the result of Yao and Hwang (1990) (Theorem 1) holds for GGTP.That is, for a given testing order concerning the values p , . . . , p N , an optimal design in the11losed form can be determined without any computational effort. The precise formulationof this conjecture is presented in the next section. Conjecture 1.
Given that u , u . . . , u N are labeled according to a non-decreasing order of p ≤ p ≤ · · · ≤ p N , such that − / √ ≤ p i ≤ (3 − √ / for i = 1 , . . . , N , GPTA is theoptimal nested ordered algorithm (at the boundary values, the pairwise testing algorithm isan optimal nested algorithm). Conjecture 1 was empirically verified for N ≤ N values from a continuous uniform [1 − / √ , (3 − √ /
2] distribution and orderedthem such that p ≤ p ≤ · · · p N . Then, we applied the optimal ordered (with respectto u , . . . , u N ) nested procedure by Kurtz and Sidi (1988) along with the optimal pair-wise testing procedure. For this particular order, the optimal rule presented in Result 2automatically holds for the algorithm of Kurtz and Sidi (1988). In both procedures, theexpected total number of tests was calculated to verify that the difference between thoseexpectations equals zero. We repeated this process a number of times. However, sincethe computational effort of the Kurtz and Sidi (1988) algorithm is proportional to N , itis not computationally feasible to make many repetitions when N is large. Therefore, thenumber of repetitions was chosen as a decreasing function of N . For the first 100 smallestvalues of N , we repeated the process 500 times; and for each successive 100 values of N ,we decreased the number of repetitions by half, ultimately performing only a single repeatfor the 100 largest values of N . Conjecture 2.
For all integer positive values of N and all p i , i = 1 , . . . , N in the interval [1 − / √ , (3 − √ / , the generalized pairwise testing algorithm is the optimal nestedprocedure (at the boundary values, GPTA is an optimal nested algorithm); that is, withinthe class of nested procedures, this approach minimizes the expected total number of testswith respect to all possible testing orders. Remark 2.
For N = 2 and − / √ ≤ p i ≤ (3 − √ / , i = 1 , , the optimal nestedalgorithm is GPTA and it is also the optimal group testing procedure because it coincides ith Huffman’s (Huffman, 1952) encoding algorithm. If Conjecture 2 is true, it is not clear whether the problem of finding the optimalGPTA with respect to all possible testing orders is a computational tractable problem(Garey and Johnson, 1979). But, it still may be possible to provide proof of existence.It was suggested by an anonymous reviewer that for even values of N , a guess for an op-timal ordering of items may be as follows: Split the units u , u , . . . , u N with correspondingprobabilities p ≤ p ≤ · · · ≤ p N into subsets U = { u , . . . , u M } and U = { u M +1 , . . . , u N } .Then apply GPTA to the testing order u , u M +1 , u , u M +2 , . . . , u M , u N . This order appearsto be optimal for GPTA in the case of N = 4, as that was empirically verified. However,this ordering is not optimal for the next even value of N , i.e. N = 6. At this stage, we donot have a good guess for a best ordering. Acknowledgement.
The author thanks the associate editor and two anonymous review-ers for their exceptionally insightful and helpful reports, which led to significant improve-ments in the paper.
References
Dodd, R.Y., Notari IV, E.P., Stramer, S.L. (2002). Current prevalence and incidence ofinfectious disease markers and estimated window-period risk in the American Red Crossblood donor population.
Transfusion
2, 975–979.Dorfman, R. (1943). The detection of defective members of large populations. T he Annalsof Mathematical Statistics
4, 436–440.Du, D., Hwang, F. K. (1999). Combinatorial Group Testing and its Applications.
WorldScientific, Singapore .Du, D., Hwang, F. K. (2006). Pooling Design and Nonadaptive Group Testing: ImportantTools for DNA Sequencing.
World Scientific, Singapore .Feller, W. (1950). An introduction to probability theory and its application.
New York:John Wiley & Sons . 13arey, M.R, and Johnson, D. S. (1979). Computers and Intractability. A Guide to theTheory of NP-Completeness.
W. H. Freeman and Co., San Francisco, Calif.
Huffman, D. A. (1952). A Method for the Construction of Minimum-Redundancy Codes. P roceedings of the I.R.E.
0, 1098–1101.Hwang, F. K. (1976). An optimal nested procedure in binomial group testing.
Biometrics
2, 939–943.Hwang, F. K. (1981). Optimal Partitions. J . Optim. Theory Appl.
4, 1–10.Katona, G. O. H. (1973). Combinatorial search problems.
J.N. Srivastava et al., A Surveyof combinatorial Theory, P roc. 52nd Annu. Allerton Conf. Commun. Control Comput., 101–108.Kurtz, D., and Sidi, M. (1988). Multiple access algorithms via group testing for heteroge-neous population of users. I EEE Trans. Commun.
6, 1316–1323.Lee, J.K., and Sobel, M. (1972). Dorfman and R -type procedures for a generalized grouptesting problem. Mathematical Biosciences
5, 317–340.Malinovsky, Y. (2019). Sterrett procedure for the generalized group testing problem.
Methodology and Computing in Applied Probability
1, 829–840.Malinovsky, Y., Albert, P. S. (2019). Revisiting nested group testing procedures: newresults, comparisons, and robustness.
The American Statistician
3, 117–125.Nebenzahl, E., and Sobel, M. (1973). Finite and infinite models for generalized group-testing with unequal probabilities of success for each item. in T. Cacoullos, ed.,
Discrim-inant Analysis and Aplications , New York: Academic Press Inc., 239–284.Nebenzahl, E. (1975). Binomial group testing with two different success parameters. S tud.Math. Hung.
0, 61–72.Shental, N., Amir, A., and Zul, O. (2010). Identification of rare alleles and their carriersusing compressed se(que)nsing. N ucleic acids research
8, 1–22.14obel, M., Groll, P. A. (1959). Group testing to eliminate efficiently all defectives in abinomial sample.
Bell System Tech. J.
8, 1179–1252.Sobel, M. (1960). Group testing to classify efficiently all defectives in a binomial sample.
Information and Decision Processes (R. E. Machol, ed.; McGraw-Hill, New York), pp.127-161.Sobel, M. (1967). Optimal group testing.
Proc. Colloq. on Information Theory, BolyaiMath. Society, Debrecen, Hungary , 411–488.Sterrett, A. (1957). On the detection of defective members of large populations. T heAnnals of Mathematical Statistics
8, 1033–1036.Ungar, P. (1960). Cutoff points in group testing.
Comm. Pure Appl. Math.
IEEE Transactionson Information Theory S IAMJ. Disc. Math. , 256–259.Yao, Y. C., Hwang, F. K. (1988b). Individual testing of independent items in optimumgroup testing. P robab. Eng. Inform. Sci. , 23–29.Yao, Y. C., Hwang, F. K. (1990). On optimal nested group testing algorithms. J . Stat.Plan. Inf.
4, 167–175.Zaman, N., and Pippenger, N. (2016). Asymptotic analysis of optimal nested group-testingprocedures.
Prob. Eng. Inform. Sci.3