Improved algorithms for non-adaptive group testing with consecutive positives
IImproved algorithms for non-adaptive group testingwith consecutive positives
Thach V. Bui ∗ , Mahdi Cheraghchi † , and Thuc D. Nguyen ∗∗ University of Science, VNU-HCMC,Ho Chi Minh City, Vietnam { bvthach,ndthuc } @fit.hcmus.edu.vn † University of Michigan,Ann Arbor MI, [email protected]
Abstract
The goal of group testing is to efficiently identify a few specific items, called positives, in a large populationof items via tests. A test is an action on a subset of items which returns positive if the subset contains at leastone positive and negative otherwise. In non-adaptive group testing, all tests are independent, can be performedin parallel and represented as a measurement matrix. In this work, we consider non-adaptive group testing withconsecutive positives in which the items are linearly ordered and the positives are consecutive in that order.We proposed two improved algorithms for efficiently identifying consecutive positives. In particular, withoutstoring measurement matrices, we can identify up to d consecutive positives with nd + 2 d ( nd + 2 d ,resp.) tests in O (cid:0) log nd + d (cid:1) ( O (cid:0) log nd + d (cid:1) , resp.) time. These results significantly improve the state-of-the-artscheme in which it takes nd + 2 d + 21 tests to identify the positives in O (cid:0) nd log nd + d (cid:1) time with themeasurement matrices associated with the scheme stored somewhere. I. I
NTRODUCTION
A. Group testing
The goal of group testing (GT) is to efficiently identify up to d positive items in a large populationof n items. Positive items satisfy some specific properties while negative items do not. Emerged by theseminal work of Dorfman [1] in World War II, GT was considered as an efficient way to save time andmoney in identifying syphilitic draftees among a large population of draftees. Currently, with the ongoingCovid-19 pandemic since 2019, GT has been found to be an efficient tool for mass testing to identifyinfected persons [2], [3]. The strategy of GT is as follows. Instead of testing each item one by one toverify whether it is positive or negative, a group of items is pooled then tested. In the noiseless setting,the outcome of a test on a group of items is positive if the group has at least one positive and negativeotherwise.There are two basic approaches to designing tests. The first is adaptive group testing in which thedesign of a test depends on the designs of the previous tests. This approach usually attains information-theoretic bounds on the number of tests required; however, it takes much time because of multiple stages.To remedy this drawback, the second approach, which is non-adaptive group testing (NAGT), is to designall tests independently such that they can be performed simultaneously. NAGT has been widely appliedin various fields such as computational and molecular biology [4], networking [5], and neuroscience [6].The focus of the work reported here is on the second approach, i.e., NAGT.NAGT can be represented by a measurement matrix indicating how a test should be executed in whichthe number of rows and the number of columns are the number of items and the number of tests,respectively. An entry at row i and column j equals to 1 indicates that the j th item in the input setbelongs to test i and the j th item in the input set does not belong to test i otherwise. The procedure toget the matrix is called construction , the procedure to get the outcomes of all tests using the measurementmatrix is called encoding , and the procedure to get the positive items from the outcomes is called decoding .Note that the encoding procedure includes the construction procedure.1 a r X i v : . [ c s . I T ] J a n measurement matrix is random if it satisfies the preconditions after the construction procedure withsome probability. Meanwhile, a measurement matrix is explicit if it can be constructed in poly ( d, n ) time.However, random and explicit matrices are not good in practice because they are usually to be savedsomewhere before using. A matrix being a good fit in practice is strongly explicit in which every columnin the matrix can be generated in poly ( d, log n ) time. This implies that it is unnecessary to store thematrix. Nonetheless, an explicit matrix can be generated by using or not using probability. In the lattercase, the matrix is nonrandom.There are two main requirements to tackle group testing: minimize the number of tests and efficientlyidentify the set of positive items. To combinatorial GT, i.e., up to d positives items are uniformly distributedin a population of n items, it requires at least Ω( d log ( n/d ))) tests to identify all positive items [4] inthe non-adaptive design. Porat and Rothschild [7] could construct an explicit measurement matrix with O ( d ln n ) rows. However, their construction is not associated with any efficient (sublinear to n ) decodingalgorithm. To have an efficient decoding algorithm, namely poly ( d, ln n ) , while keeping the number oftests as small as possible, namely O ( d o (1) ln o (1) n ) , several schemes have been proposed [8]–[12].Recently, Bondorf et al. [13] presented a bit mixing coding that achieves asymptotically vanishing errorprobability with O ( d log n ) tests to identify defective items in time O ( d log d · log n ) as n → ∞ . Forfurther reading, we recommend readers to refer to the survey in [14]. B. Group testing with consecutive positives
Colbourn [15] firstly considered a specific case of group testing called group testing with consecutivepositives in which the input items are linearly ordered and the positives are consecutive in that order.Suppose that the number of items is n , the population of items contains only consecutive positives,and the number of positives is up to d . In this setting, the number of tests required can be reducedto O (log( dn ) + c ) and O (cid:0) log nd − + d (cid:1) for adaptive and non-adaptive designs, respectively, which ismuch smaller than the bounds O ( d log n ) and Ω( d log n/d )) in combinatorial group testing, where c issome positive constant. Juan and Chang [16] could make the number of tests fall in a very tight interval [ (cid:100) log ( dn ) (cid:101) − , (cid:100) log ( dn ) (cid:101) + 1] in adaptive approach.With non-adaptive approach, Muller and Jimbo [17] considered the case d = 2 and could constructan explicit measurement matrix with (cid:100) log (cid:100) nd − (cid:101)(cid:101) + 2 d + 1 rows and n columns. Unfortunately, neitherColbourn nor Muller and Jimbo showed how to efficiently identify positives by using their proposedmeasurement matrices. Chang et al. [18] later used random measurement matrices with nd + 2 d + 21 rows (tests) to identify all positives in time O (cid:0) nd log nd + d (cid:1) .The focus of this work is on non-adaptive group testing with consecutive positives. For information-theoretic bound, Colbourn [15] showed that any group testing method must employ at least log ( nd ) − tests. On the other hand, he also showed the minimum number of tests required in any non-adaptivegroup testing with up to d consecutive positives is d − . Therefore, the minimum number of tests innon-adaptive group testing with up to d consecutive positives is max { log ( nd ) − , d − } . C. Contributions
We have reduced the number of tests and the decoding complexity without storing measurement matricesfor efficiently identifying up to d consecutive positives. In particular, without storing measurement matrices,we can identify up to d consecutive positives with nd +2 d ( nd +2 d , resp.) tests in O (cid:0) log nd + d (cid:1) ( O (cid:0) log nd + d (cid:1) , resp.) time. These results significantly improve the state-of-the-art scheme in which ittakes nd + 2 d + 21 tests to identify the positives in O (cid:0) nd log nd + d (cid:1) time with the measurementmatrices associated with the scheme stored somewhere. Note that the decoding complexity in [18] is linearto the number of items while ours is sublinear. A summary of our comparison is shown in Table I.2 cheme No. ofpositives Designapproach Construction type ofmeasurement matrices Number of tests t Decoding time(Decoding complexity)Colbourn [15] ≤ d Adaptive Not available (cid:100) log ( dn ) (cid:101) + c t stagesJuan and Chang [16] ≤ d Adaptive Not available (cid:100) log ( dn ) (cid:101) − ≤ t ≤ (cid:100) log ( dn ) (cid:101) + 1 t stagesColbourn [15] ≤ d Non-adaptive Explicit (cid:100) log (cid:100) nd − (cid:101)(cid:101) + 2 d + 1 Not availableMuller and Jimbo [17] d = 2 Non-adaptive Explicit (cid:100) log (cid:100) nd − (cid:101)(cid:101) + 2 d − Not availableChang et al. [18] ≤ d Non-adaptive Random nd + 2 d + 21 O (cid:0) nd log nd + d (cid:1) First improved algorithm (Theorem 1)Second improved algorithm (Theorem 2) ≤ d Non-adaptive Nonrandom(Strongly explicitwithout using probability) (cid:100) log nd (cid:101) + 2 d (cid:100) log nd (cid:101) + 2 d O (cid:0) log nd + d (cid:1) O (cid:0) log nd + d (cid:1) TABLE IC
OMPARISON OF IMPROVED ALGORITHMS WITH PREVIOUS ONES . “N
OT AVAILABLE ” MEANS THAT THE CRITERION DOES NOT HOLD ORIS NOT CONSIDERED FOR THAT SCHEME . P
ARAMETER c IS SOME CONSTANT . D. General idea of improved algorithms
Although our improved algorithms reflect Colbourn’s strategy [15], we refine every technical details toattain efficient encoding and decoding procedures. More importantly, Colbourn only designed measure-ment matrices but decoding procedures. Here, we propose two improved algorithms to identify up to d consecutive positives.Colbourn’s strategy consists of two simultaneous phases. In the first phase, the author partitioned the n (linearly ordered) items into subpools in which we here call them super items such that there are up totwo super positive items and if two super items are positive, they are consecutive. Hence, the objectiveof this phase is to locate the super positive items among the (linearly ordered) super items. In the secondphase, with careful design of measurement matrices, the true positives can be identified based on thelocation of the super positive items.Our improved algorithms are described here and more details with illustrations are presented in Sec-tion IV. We first create super items with linear order in which each super item contains exactly d consecutive items. Specifically, the j th super item contains items indexed from ( j − d + 1 to jd ,for ≤ j ≤ n/d . Naturally, a super item is positive if it contains at least one positive item and negativeotherwise.Our improved algorithms are based on two inseparable compartments: the linear order of the inputitems which contain consecutive positives and nonrandom matrices designed based on that linear order.For the first phase, from the original set of items N = { , , . . . , n } , we generate a subset (or two subsets)of super items with linear order and their corresponding measurement matrix (matrices). For the secondphase, we simply generate a d × n measurement matrix by horizontally placing d × d identity matricesin a series such that when the super positive items are located, every item contained in these super positiveitems is identified as positive or negative by utilizing it.The decoding procedure is as follows. By using the input set(s) of super items with their correspondingmeasurement matrix (matrices), given an outcome vector(s) generated from them, we can recover up totwo super positive items, i.e., there are up to d potential positives after using super items. Becausethe measurement matrix associated with the set of n items is composed by a series of d × d identitymatrices, we finally can locate which potential positive is truly positive.II. P RELIMINARIES
For simplicity, we assume n is divisible by d , i.e., n = kd for some positive integer k .Any set C = { c , . . . , c k } used in this work is equipped with the linear order c i ≺ c i +1 for ≤ i < k ,where ≺ is the linear order notation. There are n items indexed from 1 to n . Two sets of items areconsidered throughout this paper, which are N = { , , . . . , n } and P = { , , . . . , n } . We should keep in In case n is not divisible by d , we can add d (cid:100) n/d (cid:101) − n dummy negative items into the set of items such that the total number of itemsis d (cid:100) n/d (cid:101) . different to its position in a set of items containing it, i.e., the j thitem in a set may be not item j . Precisely, the position of an item in N is identical to its index. However,the position of an item in P is one unit smaller than to its index. For example, let us consider two sets N = { , , , } and P = { , , } for n = 4 . The position of item in set N is , which is identical tothe index of item , but its position in set P is . A. Notations
For consistency, we use capital calligraphic letters for matrices, non-capital letters for scalars, bold lettersfor vectors, and capital letters for sets. All matrix and vector entries are binary. Let supp ( v ) = { j : x j (cid:54) = 0 } for v = { x , . . . , x n } and vecc [ n ] ( P ) = ( x , . . . , x n ) T where x j = 1 for j ∈ P . The main notations are asfollows:1) n, d, x = ( x , . . . , x n ) T : number of items, maximum number of defective items, binary representationof n items.2) P = { j , j , . . . , j | P | } : set of positive items; cardinality of P is | P | ≤ d .3) T i, ∗ , T ∗ ,j , M i, ∗ , M ∗ ,j : row i of matrix T , column j of matrix T , row i of matrix M , column j ofmatrix M .4) v ( i ) is the i th entry in the vector v .5) log x : the binary logarithm of x . B. Problem definition
We index the population of n items from 1 to n . Let N = [ n ] = { , , . . . , n } and S be the defectiveset, where | S | ≤ d . A test is defined by a subset of items P ⊆ [ n ] . A pool with a negative (positive)outcome is called a negative (positive) pool. The outcome of a test on a subset of items is positive if thesubset contains at least one defective item, is negative otherwise.We can model non-adaptive group testing as follows. A t × n binary matrix T = ( t ij ) is defined as ameasurement matrix, where n is the number of items and t is the number of tests. Vector x = ( x , . . . , x n ) T is the binary representation vector of n items, where | x | = (cid:80) nj =1 x j ≤ d . An entry x j = 1 indicates thatitem j is defective, and x j = 0 indicates otherwise. The j th item corresponds to the j th column of thematrix. An entry t ij = 1 naturally means that item j belongs to test i , and t ij = 0 means otherwise. Theoutcome of all tests is y = ( y , . . . , y t ) T , where y i = 1 if test i is positive and y i = 0 otherwise. Theprocedure used to get outcome vector y is called encoding . The procedure used to identify defective itemsfrom y is called decoding . Outcome vector y is given by y = T (cid:12) x = T , ∗ (cid:12) x ... T t, ∗ (cid:12) x = y ... y t (1)where (cid:12) is a notation for the test operation in non-adaptive group testing; namely, y i = T i, ∗ (cid:12) x = 1 if (cid:80) nj =1 x j t ij ≥ and y i = T i, ∗ (cid:12) x = 0 if (cid:80) nj =1 x j t ij = 0 for i = 1 , . . . , t .Our objective is to find an efficient encoding and decoding scheme to identify up to d consecutivepositives in non-adaptive group testing. Precisely, our task is to minimize the number of rows in matrix T and the time for recovering x from y by using T .III. I DENTIFICATION OF TWO CONSECUTIVE POSITIVES
A. Overview
Our objective is to identify positives in a set of items which contains exactly two consecutive positivesor up to one positive. The basic idea of our proposed scheme is to exploit the structure of a nonrandommeasurement matrix and the linear order of n items. We create a nonrandom measurement matrix such4hat the union of any two consecutive columns in it is different from the union of other two consecutivecolumns. Based on this property and the measurement matrix structure, we carefully develop a decodingscheme whose decoding time is up to the square of the number of measurements.The first encoding and decoding procedures to identify two consecutive positives or up to one positiveare described in Algorithm 1 and summarized in Lemma 1. Lemma 1.
Let n be a positive integer and N = { , , . . . , n } be the set of linearly ordered items. Thenthere exists a nonrandom (cid:100) log n (cid:101) × n measurement matrix such that: • If N has exactly two positives which are consecutive and the index of the first positive is ≤ a ≤ n − ,the two positives can be identified with s = 2 (cid:100) log n (cid:101) tests in s time if a is odd and in s = O (log n ) time if a is even. • If N has up to one positive, the decoding complexity is s = 2 (cid:100) log n (cid:101) . Since the decoding complexity in Lemma 1 is O (log n ) which is larger O (log n ) , our next objective isto design an encoding procedure such that its decoding complexity is just O (log n ) by exploiting propertiesof consecutive positives. Lemma 1 tells us that if the index of the first positive in a measurement matrix,which is also its position in set N , is odd, it can be identified in time O (log n ) . Therefore, thanks to thelinear order of the input set, we can remove the first item in N to create P and assure that the positionof the first positive in P is odd in case its position in N is even. In particular, it is possible to constructtwo measurement matrices of size s × n and s × ( n − such that item j ≥ is represented by column j and column j − in the first and second matrices, respectively, where s = 2 (cid:100) log n (cid:101) . As a result, we onlyneed s tests to recover the two consecutive positive in s time. This idea is summarized as follows. Lemma 2.
Let n be a positive integer and N = { , , . . . , n } be the set of linearly ordered items. Thereexist two nonrandom measurement matrices with size of (cid:100) log n (cid:101) × n and (cid:100) log n (cid:101) × ( n − such that if N has exactly two positives which are consecutive or has up to one positive, they can be identified with s = 4 (cid:100) log n (cid:101) tests in s time.B. First encoding procedure Let S be an s × n measurement matrix associated with the input set of items N = { , , . . . , n } : S = (cid:20) b b . . . b n b b . . . b n (cid:21) = (cid:2) S . . . S n (cid:3) , (2)where s = 2 (cid:100) log n (cid:101) , b j is the (cid:100) log n (cid:101) -bit binary representation of integer j − , b j is the complement of b j , and S j := (cid:20) b j b j (cid:21) for j = 1 , , . . . , n . Column S j represents for the j th item of N and that the weightof every column in S is s/ (cid:100) log n (cid:101) . Furthermore, the j th item of N , which is also item j , is uniquelyidentified by b j . For example, if we set n = 8 , s = 2 log n = 6 , and the matrix in (2) becomes: S = . (3)Let s = ( s , . . . , s n ) T be a binary representation vector of set N in which an entry s j = 1 indicates thatthe j th item in the set N is positive and s j = 0 indicates otherwise. The outcome vector by performingtests on the input set of items N and its measurement matrix S is y = S (cid:12) s .5 . First decoding procedure The decoding procedure is summarized in Algorithm 1. Steps 1 to 3 are to identify whether there areno positives in the input set. If there exists at least one positive, we will proceed to Step 4. Steps 4 to 7are to verify whether there is only one positive in the input set. From Step 8, it suffices to say that theinput set has exactly two consecutive positives. Step 8 is to initialize vector z which is presumed to be b a ∨ b a +1 for some integer ≤ a ≤ n − by using the outcome vector y and Step 9 is to calculate that a . Because b a is the (cid:100) log n (cid:101) -bit binary representation of integer j − , we shift a one unit to a + 1 forsimple representation in later steps. Once a is odd, Steps 10 to 11 are to recover a . However, if a is even,we have to scan every possibility of odd numbers generated from y by altering one bit in the first halfof y . This procedure is done by Steps 13 to 16. Finally, Steps 18 to 20 are simply to verify whether thevalue a obtained from Step 17 for an alteration is genuinely the index of the first positive. Algorithm 1
DecConsecutivePositives( y , S ) : Decoding procedure for up to two consecutive positives. Input:
Outcome vector y , matrix S of size s × n defined in (2). Output:
Set of two consecutive positive items. if y ≡ then Return the set P = ∅ . end if Calculate an index a with the input as the first half of y . Set a := a + 1 . if S a ≡ y then Return the set P = { a } . end if Initialize a × s/ vector z by setting z (1) = 0 and z ( i ) = y ( i ) for i = 2 , . . . , s/ . Calculate an index a with the input z . Set a := a + 1 . if S a ∨ S a +1 ≡ y then Return the set P = { a, a + 1 } . else for i = 1 to s/ do Set z = ( y (1) , . . . , y ( s/ T . if y ( i ) = 1 then Set z ( i ) = 0 . Calculate index a with the input z . Set a := a + 1 . if S a ∨ S a +1 ≡ y then Return the set P = { a, a + 1 } . end if end if end for end if D. Correctness of Algorithm 1
To prove that matrix S in (2) can be used to identify two positives which are consecutive or up to onepositive in a population of n linearly ordered items, we first state the following lemma. Lemma 3.
Given the matrix S defined in (2) , for any two distinct indexes a (cid:54) = b in [ n − , we have S (cid:12) a (cid:54)≡ S (cid:12) b and S (cid:12) a (cid:54)≡ S b , where a = vecc [ n ] ( { a, a + 1 } ) and b = vecc [ n ] ( { b, b + 1 } ) .Proof. To prove
S (cid:12) a (cid:54)≡ S (cid:12) b , our task is to show that there exists an index i such that S a ( i ) ∨S a +1 ( i ) (cid:54) = S b ( i ) ∨ S b +1 ( i ) . Therefore, we have S (cid:12) a (cid:54)≡ S (cid:12) b , where a = vecc [ n ] ( { a, a + 1 } ) and6 = vecc [ n ] ( { b, b + 1 } ) . The proof for the existence of i is proceeded as follows. We consider two cases: b a (1) = b b (1) = 0 and b a (1) (cid:54) = b b (1) , or S a ( i ) ∨ S a +1 ( i ) (cid:54) = S b ( i ) ∨ S b +1 ( i ) .When b a (1) = b b (1) = 0 , we must have b a +1 (1) = b b +1 (1) = 1 , b a ( i ) = b a +1 ( i ) , and b b ( i ) = b b +1 ( i ) for i = 2 , . . . , s/ . Since a (cid:54) = b and b a (0) = b b (0) = 0 , there exists an index ≤ i ≤ log n = s/ suchthat b a ( i ) (cid:54) = b b ( i ) . Moreover, since b a ( i ) = b a +1 ( i ) and b b ( i ) = b b +1 ( i ) for i = 2 , . . . , s/ , it impliesthat b a ( i ) ∨ b a +1 ( i ) (cid:54) = b b ( i ) ∨ b b +1 ( i ) .When b a (1) (cid:54) = b b (1) , without loss of generality, we assume that b a (1) = 0 and b b (1) = 1 . Since b a (1) = 0 , we must have b a +1 (1) = 1 and b a ( i ) = b a +1 ( i ) for i = 2 , . . . , s/ . On the other hand,since b b (1) = 1 , it suffices to claim that b b (2) ∨ b b +1 (2) = 1 and ¯ b b (2 + s/ ∨ ¯ b b +1 (2 + s/
2) = 1 .We then consider two cases: b a (2) = b a +1 (2) = 0 and b a (2) = b a +1 (2) = 1 . If b a (2) = b a +1 (2) = 0 then b a (2) ∨ b a +1 (2) = 0 (cid:54) = 1 = b b (2) ∨ b b +1 (2) . If b a (2) = b a +1 (2) = 1 then ¯ b a (2 + s/ ∨ b a +1 (2 + s/
2) = 0 (cid:54) = 1 = ¯ b b (2 + s/ ∨ ¯ b b +1 (2 + s/ . In both cases, either i = 2 or i = 2 + s/ makes S a ( i ) ∨ S a +1 ( i ) (cid:54) = S b ( i ) ∨ S b +1 ( i ) .In summary, there exists an index i such that S a ( i ) ∨ S a +1 ( i ) (cid:54) = S b ( i ) ∨ S b +1 ( i ) .Regarding the case S (cid:12) a (cid:54)≡ S b , we prove this by contradiction. Assume S (cid:12) a ≡ S b , we are goingto show that a = a + 1 , which is wrong. Indeed, for any i ∈ [1 , s/ , if b a ( i ) = 0 and b a +1 ( i ) = 1 or b a ( i ) = 1 and b a +1 = 0 then b b ( i ) and ¯ b a ( i ) ∨ ¯ b a +1 ( i ) must equal to 1. However, because b b ( i ) = 1 , weget ¯ b b ( i ) = 0 . Because S (cid:12) a ≡ S b , we must have ¯ b a ( i ) ∨ ¯ b a +1 ( i ) = 0 , which contradicts to the previousargument that ¯ b a ( i ) ∨ ¯ b a +1 ( i ) = 1 . Hence, we have b a ( i ) = b a +1 ( i ) for all i ∈ [1 , s/ , i.e., a = a + 1 ,which is wrong. We thus claim that S (cid:12) a (cid:54)≡ S b .We are now ready to prove the correctness of Algorithm 1 in identifying up to two unknown consecutivepositives.It is obvious that if there are no positives in the input set, the condition in Step 1 holds and the algorithmstops at Step 3. If the algorithm proceeds to Step 4, there exists at least one positive in the input set N .Because of Lemma 3, if there is exactly one positive in the input set N , Steps 4 to 7 are to recover theindex of that positive.If the algorithm proceeds to Step 8, there must be two positives in the input set. Because of Lemma 3,it is clear that there exists only one index a such that y = S (cid:12) a . We now proceed with two scenariosof b a (1) : b a (1) = 0 , i.e., a is odd, and b a (1) = 1 , i.e., a is even. Note that y (1) is always equal to 1because a = vecc [ n ] ( { a, a + 1 } ) .We first assume that b a (1) = 0 , i.e., a is odd. In this case, we get b a +1 (1) = 1 and b a ( i ) = b a +1 ( i ) forevery ≤ i ≤ s/ . Moreover, since b a ( i ) ∨ b a +1 ( i ) = y ( i ) , we get b a ( i ) = y ( i ) for every ≤ i ≤ s/ .Because the first half of b a is already identified, the second half of b a , which is the complement of b a ,can be obtained. The last step is to compare whether y is equal to b a ∨ b a +1 . If this is true then a isidentified by using the first half of b a . Otherwise, the assumption b a (1) = 0 is not true and we proceedto the case b a (1) = 1 . This phase takes s time to complete.When b a (1) = 1 , i.e., a is even, the decoding procedure becomes more complicated. It works in principleof propagation as follows. We first prove that there exists only one index i (cid:48) such that b a +1 ( i (cid:48) ) = 1 and b a ( i (cid:48) ) = 0 . Indeed, assume that there does not exist such index i (cid:48) or there are more than one index i (cid:48) satisfying the condition. For the former case, we must have b a ∨ b a +1 = b a , i.e., a + 1 is equal to orsmaller than a , causing a wrong fact. For the latter case, we must have a + 1 is larger than a at least + 2 = 3 units, which is also wrong.Because there exists only one index i (cid:48) such that b a +1 ( i (cid:48) ) = 1 and b a ( i (cid:48) ) = 0 , the number of disagreedpositions between b a ∨ b a +1 and b a is just one. Moreover, because y is the union of two consecutivecolumns S a and S a +1 , there exists an index ≤ i ≤ s/ such that y ( i ) = 1 and b a = ( y (1) , . . . , y ( i − , , y ( i + 1) , . . . , y ( s/ . Based on this fact, we can simply make a decoding procedure for the case b a (1) = 1 as follows. For each i = 1 , . . . , s/ , if y ( i ) = 1 , we assign b a as the first half of the vector7 with the entry y ( i ) altered to become . We then create the complement vector ¯ b a of b a and its nextcolumn S a +1 . If S a ∨ S a +1 = y , the indexes a and a + 1 are identified and we stop the decoding procedure.The decoding complexity is up to s/ × s = s . E. Proof of Lemma 1
Matrix S is obviously nonrandom because the j th column of S is the (cid:100) log n (cid:101) -big binary representationof integer j − . Steps 1 to 7 are to identify up to one positive and take only s time. As analyzed inthe preceding subsection, once the input set of items has exactly two positives which are consecutive, ifthe index of the first positive is odd, Steps 10 and 11 are implemented in s time. If the index of the firstpositive is even, Step 11 is skipped and Steps 13 to 22 are executed. The running time for these steps is s .In summary, there exists a nonrandom (cid:100) log n (cid:101) × n measurement matrix such that: • If N has up to one positive, the decoding complexity is s = 2 (cid:100) log n (cid:101) . • If N has exactly two positives which are consecutive and the index of the first positive is ≤ a ≤ n − , the two positives can be identified with s = 2 (cid:100) log n (cid:101) tests in s time if a is odd and in s = O (log n ) time if a is even.To reduce the decoding complexity in Algorithm 1, we have to use alternative measurement matricesand decoding procedure. The details are presented below. F. Second encoding procedure
Let P = { , . . . , n } be a set of items. Let P be an s × ( n − measurement matrix: P = (cid:20) b b . . . b n − b b . . . b n − (cid:21) = (cid:2) S . . . S n − (cid:3) , (4)where s = 2 (cid:100) log n (cid:101) , b j is the (cid:100) log n (cid:101) -bit binary representation of integer j − , b j is the complement of b j , and S j := (cid:20) b j b j (cid:21) as the same as defined in (2) for j = 1 , , . . . , n − . Column S j represents for the ( j + 1) th item in the set P and that the weight of every column in P is s/ (cid:100) log n (cid:101) . Furthermore, the ( j + 1) th item in P is uniquely identified by b j .The outcome vector is obtained by performing tests on two distinct pairs of inputs set of items andtheir corresponding measurement matrices as follows. z = (cid:20) S (cid:12) s P (cid:12) p (cid:21) = (cid:20) yw (cid:21) (5)where y = S (cid:12) s , w = P (cid:12) p , s = ( s , . . . , s n ) T and p = ( p , . . . , p n − ) T are the binary representationvectors of sets N and P , respectively. An entry s j = 1 ( p j = 1 , resp.) indicates that the j th item in theset N ( P , resp.) is positive, and s j = 0 ( p j = 0 , resp.) indicates otherwise. G. Second decoding procedure
The decoding procedure is summarized in Algorithm 2. Steps 1 to 3 are to identify whether there areno positives in the input set. If there exists at least one positive, we will proceed to Step 4. Since thereare two measurement matrices associated with two input sets of items, we need two vectors to recover theindex(es) of the positive(s) in the two input sets from two outcome vectors. Steps 4 and 12 are to initiatethose vectors. Since the first set of items is N = { , , . . . , n } , if there is only one positive present andits index is odd, the condition in Step 6 holds. Step 7 is naturally to return that index.If the input set has exactly two consecutive positives, the algorithm will proceed to Step 8. If the indexof the first positive is odd, Steps 9 to 11 are to recover it and hence the index of the second positive isalso obtained. However, if the index of the first positive is even, the condition in Step 9 does not holdbut the one in Step 14 does. Step 15 is simply to return that index.8 lgorithm 2 DecConsecutivePositives( y , S , w , P ) : Decoding procedure for up to two consecutivepositives. Input:
Outcome vectors y and w , matrices S and P defined in (2) and (4). Output:
Set of up to two consecutive positives. if y ≡ w ≡ then Return the set P = ∅ . end if Initialize a × s/ vector y (cid:48) by setting y (cid:48) (1) = 0 and y (cid:48) ( i ) = y ( i ) for i = 2 , . . . , s/ . Calculate index a with the input y (cid:48) . Set a = a + 1 . if S a ≡ y then Return the set P = { a } . end if if S a ∨ S a +1 ≡ y then Return the set P = { a, a + 1 } . end if Initialize a × s/ vector w (cid:48) by setting w (cid:48) (1) = 0 and w (cid:48) ( i ) = w ( i ) for i = 2 , . . . , s/ . Calculate index b with the input w (cid:48) . Set b = b + 1 . if S b ∨ S b +1 ≡ w then Return the set P = { b + 1 , b + 2 } . end if H. Correctness of the second decoding procedure
It is obvious that if there are no positives in the input set, the condition in Step 1 holds and the algorithmstops at Step 3. If the algorithm proceeds to Step 4, there exists at least one positive item in the input set N .If the index of the first positive is odd, it can be identified in time O (log n ) . Therefore, thanks to thelinear order of the input set, we can remove the first item in N to create P and assure that the positionof the first positive in P is odd in case its position in N is even. From the construction of matrices S and P , item j ≥ is represented by column j and column j − in S and P , respectively. Note that item 1 isonly presented in S and represented as the first column of S . Step 4 is to guess whether the first positiveis odd or even. Because b j is the (cid:100) log n (cid:101) -bit binary representation of integer j − , Step 5 is just to shiftour guesses to the right index of the first positive. Because of Lemma 3, if there is exactly one positivein the input set N , Steps 4 to 7 are to recover the index of that positive.Once the algorithm proceeds to Step 9, there must exist exactly two consecutive positives. If there areexactly two positives in the input set N which are consecutive and the index of the first positive is odd,the condition in Step 9 holds. Hence, the set of two consecutive positives is returned by Step 10. Theremaining case is that the index of the first positive in the two consecutive positive is even. In this case,Steps 12 to 13 are to identify that index. Again, because of Lemma 3, the condition in Step 14 holds andStep 15 returns the set of positives. I. Proof of Lemma 2
The previous section shows that we can use two nonrandom measurement matrices, namely S and P ,with size of (cid:100) log n (cid:101) × n and (cid:100) log n (cid:101) × ( n − such that if the set of input items has exactly two positiveswhich are consecutive or has up to one positive, they can be identified with s = 4 (cid:100) log n (cid:101) tests. Since therunning time for each if statement and Steps 4, 5, 12, 13 is s , the decoding complexity of Algorithm 2is s = O ( s ) . Therefore, Lemma 2 is proved. 9V. I MPROVED ALGORITHMS
A. Overview
Colbourn [15] proposed a strategy to identify consecutive positives in two simultaneous phases. In thefirst phase, the author partitioned the n (linearly ordered) items into subpools in which we here call them super items such that there are up to two super positive items and if two super items are positive, they areconsecutive. Hence, the objective of this phase is to locate the super positive items among the (linearlyordered) super items. We denote the matrices used in the first phase to as filter matrices . In the secondphase, with careful design of measurement matrices, the true positives can be identified based on thelocation of the super positive items. We denote the matrices used in the second phase to as verificationmatrices .Although Colbourn’s strategy is breakthrough, his design in the first phase is not efficient. Moreimportantly, he did not propose any efficient decoding scheme associated with his design. Here, wepropose two improved algorithms to identify up to d consecutive positives.For the first phase, our proposed algorithms work based on two inseparable compartments: the linearorder of the input items which contain consecutive positives and nonrandom matrices designed based onthat linear order. For the second phase, we simply generate a measurement matrix by horizontally placinga series of d × d identity matrices. The details of our proposed algorithms are described in the followingsubsections. B. Super items
We first create super items with linear order in which each super item contains exactly d items, exceptfor the last super item which may contain less than d items. In particular, the n items are distributed into n/d subsets (for simplicity, we assume n is divisible by d ) in which the j th subset contains items indexedfrom ( j − d + 1 to jd . The j th super item is the j th subset. A super item is positive if it contains at leastone positive item and negative otherwise. There are up to two super positive items which are consecutivebecause the input items are linearly ordered, the number of positive items is up to d , the positive itemsare consecutive and each super item contains up to d items. This procedure is illustrated in Fig. 1. 𝑑 𝑛 𝑑 − 1 + 1 𝑛… Super item ത1 ത2 ഥ𝑛𝑑
Item ……… 𝑑 𝑑 + 1 …2𝑑 … 𝑛 − 𝑑 + 1 𝑛 Subset … Fig. 1. Creating super items. A super item is a subset of items. In particular, the n items are distributed into n/d subsets (for simplicity, weassume n is divisible by d ) in which the j th subset contains items indexed from ( j − d + 1 to jd . The j th super item is the j th subset. C. Encoding procedure
The encoding procedure includes the first and seconds phases as described in Section IV-A. Regardingthe first phase, there are two designs for measurement matrices corresponding to two improved algorithms.
1) First phase in the first improved algorithm:
The measurement matrix S used here is as the same asthe one in Section III-B by replacing items with super items and n with n/d .10 ) First phase in the second improved algorithm: The measurement matrices used here, namely S and P , are as the same as the one in Section III-F by replacing items with super items and n with n/d .
3) Second phase: A d × n measurement matrix H , called a verification matrix, is created as follows: H = (cid:2) I d I d I d I d . . . I d I : ,n − d (cid:98) n d (cid:99) (cid:3) , (6)where I d is a d × d identity matrix and matrix I (cid:0) : , n − d (cid:98) n d (cid:99) (cid:1) contains the first n − d (cid:98) n d (cid:99) columns of I d . There are (cid:98) n d (cid:99) such I d matrices.The outcome vector by using H is h = H (cid:12) x , where x = ( x , . . . , x n ) T is a binary representationvector of set N in which an entry x j = 1 indicates that the j th item in the set N is positive and x j = 0 indicates otherwise. TESTS 𝑛 … Item
Outcome
2𝑑 + 1 4𝑑 + 1 𝒉 = = ℐ ℐ … ℐ ℐ : , 1: 𝑛 − 2𝑑 𝑛 ℋ = 𝒑 = ⋮1𝒔 = 1⋮01⋮1𝒮 = 𝒃 𝒃 … 𝒃 𝑛 𝑑 −1 𝒃 𝑛 𝑑 ഥ𝒃 ഥ𝒃 … ഥ𝒃 𝑛𝑑−1 ഥ𝒃 𝑛𝑑 = 0 1 … 1 1⋮ ⋮ … ⋮ ⋮0 0 … 0 11 0 … 0 0⋮ ⋮ … ⋮ …1 1 … 1 0𝒫 = 𝒃 𝒃 … 𝒃 𝑛𝑑−1 ഥ𝒃 ഥ𝒃 … ഥ𝒃 𝑛𝑑−1 = ⋮ ⋮ … ⋮1 1 … 1ത1 ത2 ഥ𝑛𝑑… 𝑛𝑑 − 1ത2 ഥ𝑛 𝑑…ത3 Super itemSuper item
Measurement matrixMeasurement matrixMeasurement matrix
Fig. 2. Encoding procedure. Each measurement matrix is associated with a set of items or super items. Vector b j is the (cid:100) log ( n/d ) (cid:101) -bitbinary representation of integer j − , b j is the complement of b j for j = 1 , , . . . , n/d . Matrix I d is a d × d identity matrix and matrix I (cid:0) : , n − d (cid:98) n d (cid:99) (cid:1) contains the first n − d (cid:98) n d (cid:99) columns of I d . For a given pair of a measurement matrix and a set of (super) items,each item and each test are represented by a column and a row, respectively. For every entry a at row i and column j , test i contains (doesnot contain, resp.) the j th (super) item if a = 1 ( a = 0 , resp.). A test on a subset of (super) items is positive if the subset contains at leastone (super) positive item. D. Decoding procedure
The flow of the decoding procedure is illustrated in Fig. 3. We first identify up to two consecutivesuper positives to get a range of items which contains all positives. The true positives are then identifiedby using the verification matrix H and the outcome vector h .The details of the decoding procedure, which merges the first and second improved algorithms, inAlgorithm 3. With the input in the first (second, resp.) improved algorithm, Algorithm 3 skips Step 2( 1, resp.). Step 3 returns an empty set of positives because there are no super positives in the input setof items when T = ∅ . Steps 4 and 5 are to get the first super positive and initialize an empty positiveset, respectively. The usage of the first phase in the encoding procedure ends here. We now proceed toidentify the true positives. Because all positives lie in the index range from ( α − d + 1 , . . . , ( α + 1) d ,we scan every entry in the outcome vector h in Step 6 then its corresponding positive is identified byusing the rules in Steps 6 to 17. 11 utcome 𝒑 = 1 ⋮ Dec.
Super positiveitems ത𝛼, 𝛼 + 1 𝛼 − 1 𝑑 + 1, … , 𝛼 + 1 𝑑
Potential positives 𝒮𝒫 True positives ➢ When 𝛼 is odd and ℎ 𝑖 = 1, item 𝛼 − 1 𝑑 + 𝑖 is positive ➢ When 𝛼 is even, ℎ 𝑖 = 1, and 𝑖 ≤ 𝑑, item 𝛼𝑑 + 𝑖 is positive ➢ When 𝛼 is even, ℎ 𝑖 = 1, and 𝑖 > 𝑑, item 𝛼 − 2 𝑑 + 𝑖 is positive Dec. 𝒉 Fig. 3. Decoding procedure. From the outcome vector(s) generated by super items, we can identify up to two super positive items. Thenthere are up to d potential consecutive positives. Every entry in the outcome vector h is then scanned to identify its corresponding positiveby using some specific rules. Algorithm 3
Decoding procedure for up to d consecutive positives. Input in the first improved algorithm:
Outcome vector y , matrix S of size s × n defined in (2). Input in the second improved algorithm:
Outcome vectors y and w , matrices S and P defined in (2)and (4). Output:
Set of consecutive positives. T = DecConsecutivePositives( y , S ) . (cid:46) First improved algorithm. T = DecConsecutivePositives( y , S , w , P ) . (cid:46) Second improved algorithm. Return P = ∅ if T = ∅ . Let α be the first item in T . Initialize the positive set P = ∅ . for i = 1 to d do if α is odd and h ( i ) = 1 then P = P ∪ { ( α − d + i } . end if if α is even and h ( i ) = 1 then if i ≤ d then P = P ∪ { αd + i } . else P = P ∪ { ( α − d + i } . end if end if end for E. Correctness of the decoding procedure
Either Step 1 or 2 returns the set of super positive items. If T = ∅ , there are no positive items in theinput set of items N . Therefore, Step 3 returns an empty set of positives. Once Algorithm 3 proceeds toStep 4, there exists at least one positive item in N . As analyzed in Section IV-B, there are up to two superpositive items which are consecutive. Moreover, since each super items contains d consecutive items, wecan assure that all positives lie in the index range ( α − d + 1 , . . . , ( α + 1) d when ≤ α ≤ (cid:98) n d (cid:99) or ( α − d + 1 + 1 , . . . , n if n is not divisible by d , where α = (cid:98) n d (cid:99) − . To facilitate our proof, we can add β “dummy negative items” into the set of n items such that n + β is divisible by d . It is clear that those12ummy negative items do not affect the outcome vector h . Therefore, it suffices to say that all positiveslie in the index range ( α − d + 1 , . . . , ( α + 1) d for ≤ α ≤ (cid:100) n d (cid:101) .We now can only consider the pruning matrix of H , called H (cid:48) , which contains only columns indexedfrom ( α − d + 1 to ( α + 1) d . There are only two possibilities for H (cid:48) : H (cid:48) = I d or H (cid:48) = (cid:20) d I d I d d (cid:21) . (7)Note that there are ( α − d negative items before reaching to the index range containing all positivesindexes. Moreover, all items indexed from 1 to ( α − d and from ( α + 1) d + 1 to n or n + β are negative.The first possibility of H (cid:48) occurs when α is odd. Therefore, the positions of non-zero entries of H arealso the indexes of the positives in N . This case is handled in Step 6 and Steps 7 to 9.The first possibility of H (cid:48) occurs when α is odd. Because of the structure of H (cid:48) , the positions ofpositives are identified in Step 6 and Steps 10 to 16. F. The decoding complexity
It is easy to confirm that the complexity of Steps 3 to 17 is d × O ( d ) . Therefore, the decodingcomplexities of the first and second improved algorithms vary with the complexities of Steps 2 and 1,respectively. The complexities of Steps 2 and 1 are summarized in Lemmas 1 and 2, which are O (cid:0) log nd (cid:1) and O (cid:0) log nd (cid:1) , respectively. We summarize the results of our two improved algorithms in the two followingtheorems. Theorem 1. (The first improved algorithm) Let n be a positive integer and N = { , , . . . , n } be the set oflinearly ordered items. Then there exists nonrandom measurement matrices such that up to d consecutivepositives can be identified with (cid:100) log nd (cid:101) + 2 d tests in O (cid:0) log nd + d (cid:1) time. Theorem 2. (The second improved algorithm) Let n be a positive integer and N = { , , . . . , n } be theset of linearly ordered items. There exists nonrandom measurement matrices such that up to d consecutivepositives can be identified with (cid:100) log nd (cid:101) + 2 d tests in O (cid:0) log nd + d (cid:1) time. V. C
ONCLUSION
In this paper, we have presented two improved algorithms to efficiently identify up to d consecutivepositives. In particular, we reduce the decoding complexity in [18] from linear to sublinear time withrespect to the number of items. We also reduce the number of tests required. Since the information-theoretic bound shows that we need at least max { log( nd ) , d − } tests, our improved algorithms require O (cid:0) (cid:100) log nd (cid:101) + d (cid:1) tests which is approximate to that bound. An extension of this work to other settings ingroup testing such as threshold group testing or complex group testing is still an open problem.VI. A CKNOWLEDGMENTS
Thach V. Bui and Thuc D. Nguyen were supported in part by Vietnam National University Ho ChiMinh City (VNU-HCM) under grant number NCM2019-18-01.R
EFERENCES [1] R. Dorfman, “The detection of defective members of large populations,”
The Annals of Mathematical Statistics , vol. 14, no. 4, pp. 436–440, 1943.[2] N. Shental, S. Levy, V. Wuvshet, S. Skorniakov, B. Shalem, A. Ottolenghi, Y. Greenshpan, R. Steinberg, A. Edri, R. Gillis, et al. ,“Efficient high-throughput sars-cov-2 testing to detect asymptomatic carriers,”
Science advances , vol. 6, no. 37, p. eabc5961, 2020.[3] R. Gabrys, S. Pattabiraman, V. Rana, J. Ribeiro, M. Cheraghchi, V. Guruswami, and O. Milenkovic, “Ac-dc: Amplification curvediagnostics for covid-19 group testing,” arXiv preprint arXiv:2011.05223 , 2020.[4] D. Du, F. K. Hwang, and F. Hwang,
Combinatorial group testing and its applications , vol. 12. World Scientific, 2000.[5] A. D’yachkov, N. Polyanskii, V. Shchukin, and I. Vorobyev, “Separable codes for the symmetric multiple-access channel,” in , pp. 291–295, IEEE, 2018.
6] T. V. Bui, M. Kuribayashi, M. Cheraghchi, and I. Echizen, “A framework for generalized group testing with inhibitors and its potentialapplication in neuroscience,” arXiv preprint arXiv:1810.01086 , 2018.[7] E. Porat and A. Rothschild, “Explicit nonadaptive combinatorial group testing schemes,”
IEEE Trans. Inf. Theory , vol. 57, no. 12,pp. –, 2011.[8] P. Indyk, H. Q. Ngo, and A. Rudra, “Efficiently decodable non-adaptive group testing,” in
Proceedings of the twenty-first annualACM-SIAM symposium on Discrete Algorithms , pp. 1126–1142, SIAM, 2010.[9] H. Q. Ngo, E. Porat, and A. Rudra, “Efficiently decodable error-correcting list disjunct matrices and applications,” in
ICALP , pp. 557–568, Springer, 2011.[10] M. Cheraghchi, “Noise-resilient group testing: Limitations and constructions,”
Discrete Applied Mathematics , vol. 161, no. 1-2, pp. 81–95, 2013.[11] T. V. Bui, M. Kuribayashi, T. Kojima, R. Haghvirdinezhad, and I. Echizen, “Efficient (nonrandom) construction and decoding fornon-adaptive group testing,”
Journal of Information Processing , vol. 27, pp. 245–256, 2019.[12] S. Cai, M. Jahangoshahi, M. Bakshi, and S. Jaggi, “Grotesque: noisy group testing (quick and efficient),” in
Allerton , pp. 1234–1241,2013.[13] S. Bondorf, B. Chen, J. Scarlett, H. Yu, and Y. Zhao, “Sublinear-time non-adaptive group testing with o ( k log n ) tests via bit-mixingcoding,” arXiv preprint arXiv:1904.10102 , 2019.[14] M. Aldridge, O. Johnson, and J. Scarlett, “Group testing: an information theory perspective,” arXiv preprint arXiv:1902.06002 , 2019.[15] C. J. Colbourn, “Group testing for consecutive positives,” Annals of Combinatorics , vol. 3, no. 1, pp. 37–41, 1999.[16] J. S.-T. Juan and G. J. Chang, “Adaptive group testing for consecutive positives,”
Discrete mathematics , vol. 308, no. 7, pp. 1124–1129,2008.[17] M. M¨uller and M. Jimbo, “Consecutive positive detectable matrices and group testing for consecutive positives,”
Discrete mathematics ,vol. 279, no. 1-3, pp. 369–381, 2004.[18] H. Chang, Y.-C. Chiu, and Y.-L. Tsai, “A variation of cover-free families and its applications,”
Journal of Computational Biology ,vol. 22, no. 7, pp. 677–686, 2015.,vol. 22, no. 7, pp. 677–686, 2015.