Constructions and Comparisons of Pooling Matrices for Pooled Testing of COVID-19
Yi-Jheng Lin, Che-Hao Yu, Tzu-Hsuan Liu, Cheng-Shang Chang, Wen-Tsuen Chen
CComparisons of Pooling Matrices for PooledTesting of COVID-19
Yi-Jheng Lin, Che-Hao Yu, Tzu-Hsuan Liu, Cheng-Shang Chang,
Fellow, IEEE, and Wen-Tsuen Chen,
Life Fellow, IEEE
Abstract —In comparison with individual testing, group testing(also known as pooled testing) is more efficient in reducingthe number of tests and potentially leading to tremendous costreduction. As indicated in the recent article posted on the USFDA website [1], the group testing approach for COVID-19 hasreceived a lot of interest lately. There are two key elementsin a group testing technique: (i) the pooling matrix that directssamples to be pooled into groups, and (ii) the decoding algorithmthat uses the group test results to reconstruct the status ofeach sample. In this paper, we propose a new family of poolingmatrices from packing the pencil of lines (PPoL) in a finiteprojective plane. We compare their performance with variouspooling matrices proposed in the literature, including 2D-pooling[2], P-BEST [3], and Tapestry [4], [5], using the two-stage definitedefectives (DD) decoding algorithm. By conducting extensivesimulations for a range of prevalence rates up to 5%, ournumerical results show that there is no pooling matrix with thelowest relative cost in the whole range of the prevalence rates. Tooptimize the performance, one should choose the right poolingmatrix, depending on the prevalence rate. The family of PPoLmatrices can dynamically adjust their column weights accordingto the prevalence rates and could be a better alternative thanusing a fixed pooling matrix.
Keywords: group testing, perfect difference sets, finiteprojective planes. I. I
NTRODUCTION
COVID-19 pandemic has deeply affected the daily life ofmany people in the world. The current strategy for dealingwith COVID-19 is to reduce the transmission rate of COVID-19 by preventive measures, such as contact tracing, wearingmasks, and social distancing. One problematic characteristicof COVID-19 is that there are asymptomatic infections [6]. Asthose asymptomatic infections are unaware of their contagiousability, they can infect more people if they are not yet beendetected [7]. As shown in the recent paper [8], massiveCOVID-19 testing in South Korea on Feb. 24, 2020, cangreatly reduce the proportion of undetectable infected personsand effectively reduce the transmission rate of COVID-19.Massive testing for a large population is very costly if it isdone one at a time. For a population with a low prevalencerate, group testing (or pool testing, pooled testing, batchtesting) that tests a group by mixing several samples togethercan achieve a great extent of saving testing resources. Asindicated in the recent article posted on the US FDA website
Y.-J. Lin, C.-H. Yu, T.-H. Liu, C.-S. Chang, and W.-T. Chenare with the Institute of Communications Engineering, NationalTsing Hua University, Hsinchu 30013, Taiwan, R.O.C. Email:[email protected]; [email protected]; [email protected]; [email protected]; [email protected]. [1], the group testing approach has received a lot of interestlately. Also, in the US CDC’s guidance for the use of poolingprocedures in SARS-CoV-2 [9], it defines three types of tests:(i) diagnostic testing that is intended to identify occurrence atthe individual level and is performed when there is a reasonto suspect that an individual may be infected, (ii) screeningtesting that is intended to identify occurrence at the individuallevel even if there is no reason to suspect an infection, and (iii) surveillance testing includes ongoing systematic activities,including collection, analysis, and interpretation of health-related data. The general guidance for diagnostic or screeningtesting using a pooling strategy in [9] (quoted below) basicallyfollows the two-stage group testing procedure invented byDorfman in 1943 [10]: “If a pooled test result is negative, then all specimens canbe presumed negative with the single test. If the test result ispositive or indeterminate, then all the specimens in the poolneed to be retested individually.”
The Dorfman two-stage algorithm is a very simple grouptesting strategy. Recently, there are more sophisticated grouptesting algorithms proposed in the literature, see, e.g., [2]–[5]. Instead of pooling a sample into a single group, thesealgorithms require diluting a sample and then pooling thediluted samples into multiple groups (pooled samples). Sucha procedure is specified by a pooling matrix that directseach diluted sample to be pooled into a specific group. Thetest results of pooled samples are then used for decoding(reconstructing) the status of each sample. In short, there aretwo key elements in a group testing strategy: (i) the poolingmatrix, and (ii) the decoding algorithm.As COVID-19 is a severe contagious disease, one shouldbe very careful about the decoding algorithm used for re-constructing the testing results of persons. Though decod-ing algorithms that use soft information for group testing,including various compressed sensing algorithms in [3]–[5],[11], [12], might be more efficient in reducing the numberof tests, they are more prone to have false positives andfalse negatives. A false positive might cause a person to bequarantined for 14 days, and thus losing 14 days of work.On the other hand, a false negative might have an infectedperson wandering around the neighborhood and cause morepeople to be infected. In view of this, it is important to havegroup testing results that are as “definite” as individual testingresults (in a noiseless setting).Following the CDC guidance [9], we use the decodingalgorithm, called the definite defectives (DD) algorithm in the a r X i v : . [ q - b i o . P E ] S e p iterature (see Algorithm 2.3 of the monograph [13]), that canhave definite testing results. The DD algorithm first identifiesnegative samples from a negative testing result of a group (asadvised by the CDC guidance [9]). Such a step is known asthe combinatorial orthogonal matching pursuit (COMP) stepin the literature [13]. Then the DD algorithm identifies positivesamples if they are in a group with only one positive sample.Not every sample can be decoded by the DD algorithm. As theDorfman two-stage algorithm, samples that are not decodedby the DD algorithm go through the second stage, and they aretested individually. We call such an algorithm the two-stageDD algorithm.One of the main objectives of this paper is to comparethe performance of various pooling matrices proposed inthe literature, including 2D-pooling [2], P-BEST [3], andTapestry [4], [5], using the two-stage DD decoding algorithm.In addition to these pooling matrices, we also propose a newconstruction of a family of pooling matrices from packing thepencil of lines (PPoL) in a finite projective plane. The familyof PPoL pooling matrices has very nice properties: (i) boththe column correlation and the row correlation are bound by1, and (ii) there is a freedom to choose the column weightsto optimize performance. To measure the amount of saving ofa group testing method, we adopt the performance measure,called the expected relative cost in [10]. The expected relativecost is defined as the ratio of the expected number of testsrequired by the group testing technique to the number oftests required by the individual testing. We then measure theexpected relative costs of these pooling matrices for a rangeof prevalence rates up to 5%. Some of the main findings ofour numerical results are as follows:(i) There is no pooling matrix that has the lowestrelative cost in the whole range of the prevalencerates considered in our experiments. To optimize theperformance, one should choose the right poolingmatrix, depending on the prevalence rate.(ii) The expected relative costs of the two pooling ma-trices used in Tapestry [4], [5] are high compared tothe other pooling matrices considered in our exper-iments. Its performance, in terms of the expectedrelative cost, is even worse than the (optimized)Dorfman two-stage algorithm. However, Tapestry iscapable of decoding most of the samples in the firststage. In other words, the percentages of samplesthat need to go through the second stage are thesmallest among all the pooling matrices consideredin our experiments.(iii) P-BEST [3] has a very low expected relative costwhen the prevalence rate is below 1%. However, itsexpected relative cost increases dramatically whenthe prevalence rate is above 1.3%.(iv) 2D-pooling [2] has a low expected relative costwhen the prevalence rate is near 5%. Unlike Tapestryand P-BEST that rely on robots for pipetting, theimplementation of 2D-pooling is relatively easy by humans.(v) There is a PPoL pooling matrix with column weight3 that outperforms the P-BEST pooling matrix forthe whole range of the prevalence rates considered inour experiments (up to 5%). We suggest using thatPPoL pooling matrix up to the prevalence rate of2% and then switch to other PPoL pooling matriceswith respect to the increase of the prevalence rate.The detailed suggestions are shown in Table III ofSection V.The paper is organized as follows: in Section II, we brieflyreview the group testing problem, including the mathematicalformulation and the DD decoding algorithm. In Section III, weintroduce the related works that are used in our comparisonstudy. We then propose the new family of PPoL poolingmatrices in Section IV. In Section V, we conduct extensivesimulations to compare the performance of various poolingmatrices using the two-stage DD algorithm. The paper isconcluded in Section VI, where we discuss possible extensionsfor future works.II. R EVIEW OF GROUP TESTING
A. The problem statement
Consider the group testing problem with M samples(indexed from , , . . . , M ), and N groups (indexed from , , . . . , N ). The M samples are pooled into the N groups(pooled samples) through an N × M binary matrix H =( h n,m ) so that the m th sample is pooled into the n th group if h n,m = 1 (see Figure 1). Such a matrix is called the poolingmatrix in this paper. Note that a pooling matrix correspondsto the biadjacency matrix of an N × M bipartite graph. Let x = ( x , x , . . . , x M ) be the binary state vector of the M samples and y = ( y , y , . . . , y N ) be the binary state vectorof the N groups. Then y = Hx, (1)where the matrix operation is under the Boolean algebra (thatreplaces the usual addition by the OR operation and the usualmultiplication by the AND operation). The main objective ofgroup testing is to decode the vector x given the observationvector y under certain assumptions. In this paper, we adoptthe following basic assumptions for binary samples:(i) Every sample is binary, i.e., it is either positive (1)or negative (0).(ii) Every group is binary, and a group is positive (1) ifthere is at least one sample in that group is positive.On the other hand, a group is negative (0) if all thesamples pooled into that group are negative.If we test each sample one-at-a-time, then the number oftests for M samples is M , and the average number of testsper sample is 1. The key advantage of using group testing isthat the number of tests per sample can be greatly reduced.One important performance measure of group testing, calledthe expected relative cost in [10], is the ratio of the expectednumber of tests required by the group testing technique to the ig. 1. Pooled testing represented by a bipartite graph. number of tests required by the individual testing. The mainobjective of this paper is to compare the expected relativecosts of various group testing methods. B. The definite defectives (DD) decoding algorithm
In this section, we briefly review the definite defectives(DD) algorithm (see Algorithm 2.3 of [13]). The DD algo-rithm first identifies negative samples from a negative testingresult of a group. Such a step is known as the combinatorialorthogonal matching pursuit (COMP) step. Then the DDalgorithm identifies positive samples if they are in a groupwith only one positive sample. The detailed steps of the DDalgorithm are outlined in Algorithm 1.
ALGORITHM 1:
The definite defectives (DD) algorithmfor binary samples
Input An N × M pooling matrix H and a binary N -vector y of the group test result. Output an M -vector for the test results of the M samples.0: Initially, every sample is marked “un-decoded.”1: If there is a negative group, then all the samplespooled into that group are decoded to be negative.2: The edges of samples decoded to be negative in thebipartite graph are removed from the graph.3: Repeat from Step 1 until there is no negative group.4: If there is a positive group with exactly one(remaining) sample in that group, then that sample isdecoded to positive.5: Repeat from Step 4 until no more samples can bedecoded.In Figure 2, we provide an illustrating example for Algo-rithm 1. In Figure 2 (a), the test result of G is negative,and thus the three samples S , S and S , are decoded tobe negative . In Figure 2 (b), the edges that are connected tothe samples S , S and S , are removed from the bipartite graph. In Figure 2 (c), the test results of the two groups G and G are positive. As S is the only sample in G , S isdecoded to be positive .Note that one might not be able to decode all the samplesby the above decoding algorithm. For instance, if a particularsample is pooled into groups that all have at least one positivesample, then there is no way to know whether that sample ispositive or negative. As shown in Figure 3, the sample S cannot be decoded by the DD algorithm as the test results ofthe three groups are the same no matter if S is positive ornot.As shown in Lemma 2.2 of [13], one important guaranteeof the DD algorithm is that there is no false positive. Proposition 1: ( [13], Lemma 2.2) Assume that all thetesting results are correct. Then (i) all the samples that aredecoded to be negative in Step 1 of Algorithm 1 are definitenegatives, and (ii) all the samples that are decoded to bepositive in Step 4 of Algorithm 1 are definite positives. Assuch, there are no false positives in Algorithm 1.In order to resolve all the “un-decoded” samples, weadd another stage by individually testing each “un-decoded”sample. This leads to the following two-stage DD algorithmin Algorithm 2.
ALGORITHM 2:
The two-stage definite defectives(DD2) algorithm for binary samples
Input An N × M pooling matrix H and a binary N -vector y of the group test result. Output an M -vector for the test results of the M samples.1: Run the DD algorithm in Algorithm 1.2: For those “un-decoded” samples, test them one at atime. III. R ELATED WORKS
In [14]–[16], it was shown that a single positive sample canstill be detected even in pools of 5-32 samples for the standardRT-qPCR test of COVID-19. Such an experimental resultprovides supporting evidence for group testing of COVID-19. In the following, we review four group testing strategiesproposed in the literature for COVID-19.
The Dorfman two-stage algorithm [17] : For the case that N = 1 , i.e., every sample is pooled into a single group,the DD2 algorithm is simply the original Dorfman two-stagealgorithm [10], i.e., if the group of M samples is testednegative, then all the M samples are ruled out. Otherwise,all the M samples are tested individually. Suppose that theprevalence rate is r . Then the expected number of tests todecode the M samples by the Dorfman two-stage algorithmis − (1 − r ) M ) M . As such, the expected relative cost(defined as the ratio of the expected number of tests requiredby the group testing technique to the number of tests requiredby the individual testing in [10]) is M +1 M − (1 − r ) M . Asshown in Table I of [10], the optimal group size M is 11a) Step 1: All the samples pooled intothat negative groups are decoded to benegative. (b) Step 2: The edges of negativesamples are removed. (c) Step 4: Exactly one sample in apositive group is decoded to be posi-tive. Fig. 2. An illustration for the DD algorithm.Fig. 3. An un-decoded sample. with the expected relative cost of 20% when the prevalencerate r is 1%. : On a 96-well plate, there are 8 rows and12 columns. Pool the samples in the same row (column)into a group. This results in 20 groups for 96 samples. Oneadvantage of this simple 2D-pooling strategy is to minimizepipetting errors. P-BEST [3] : P-BEST [3] uses a × pooling matrixconstructed from the Reed-Solomon code [18] for pooledtesting of COVID-19. For the pooling matrix, each sampleis pooled into 6 groups, and each group contains 48 samples.In [3], the authors proposed using the Gradient Projectionfor Sparse Reconstruction (GPSR) algorithm for decoding.Though it is claimed in [3] that the GPSR algorithm can detectup to 1% of positive carriers, there is no guarantee that everydecoded sample (by the GPSR algorithm) is correct. Tapestry [4], [5] : The Tapestry scheme [4], [5] uses theKirkman triples to construct their pooling matrices. For the pooling matrix in [4], [5], each sample is pooled into 3 groups(in their experiments, some samples are only pooled into 2groups). As such, it is sparser than that used by P-BEST, andit is claimed to be viable not just with low ( < %) prevalencerates, but even with moderate prevalence rates (5%-10%). Oneof the restrictions for the pooling matrices constructed fromthe Kirkman triples is that the column weights must be 3. Sucha restriction limits its applicability to optimize its performanceaccording to the prevalence rate.IV. PP O L CONSTRUCTIONS OF POOLING MATRICES
In this section, we propose a new family of pooling matricesfrom packing the pencil of lines (PPoL) in a finite projectiveplane. Our idea of constructing PPoL pooling matrices wasinspired by the constructions of channel hopping sequencesin the rendezvous search problem in cognitive radio networksand the constructions of grant-free uplink transmission sched-ules in 5G networks (see, e.g., [19]–[21]).A pooling matrix is said to be ( d , d ) -regular if there areexactly d (resp. d ) nonzero elements in each column (resp.row). In other words, the degree of every left-hand (resp.right-hand) node in the corresponding bipartite graph is d (resp. d ). The total number of edges in the bipartite graph is d M = d N for a ( d , d ) -regular pooling matrix H . Definethe (compressing) gain G = MN = d d . (2) A. Perfect difference sets and finite projective planes
As our construction of the pooling matrix is from packingthe pencil of lines in a finite projective plane, we first brieflyreview the notions of difference sets and finite projectiveplanes.
Definition 2: (Difference sets)
Let Z p = { , , . . . , p − } .A set D = { a , a , . . . , a k − } ⊂ Z p is called a ( p, k, λ ) -difference set if for every ( (cid:96) mod p ) (cid:54) = 0 , there exist at least ordered pairs ( a i , a j ) such that a i − a j = ( (cid:96) mod p ) , where a i , a j ∈ D . A ( p, k, -difference set is said to be perfect ifthere exists exactly one ordered pair ( a i , a j ) such that a i − a j = ( (cid:96) mod p ) for every ( (cid:96) mod p ) (cid:54) = 0 . Definition 3: (Finite projective planes)
A finite projectiveplane of order m , denoted by P G (2 , m ) , is a collection of m + m + 1 lines and m + m + 1 points such that(P1) every line contains m + 1 points,(P2) every point is on m + 1 lines,(P3) any two distinct lines intersect at exactly one point,and(P4) any two distinct points lie on exactly one line.When m is a prime power, Singer [22] established theconnection between an ( m + m + 1 , m + 1 , -perfect dif-ference set and a finite projective plane of order m througha collineation that maps points (resp. lines) to points (resp.lines) in a finite projective plane. Specifically, suppose that D = { a , a , . . . , a m } is an ( m + m + 1 , m + 1 , -perfectdifference set with a = 0 < a = 1 < a < . . . , < a m < m + m + 1 . (3)(i) Let { , , . . . , m + m } be the m + m + 1 points.(ii) Let p = m + m +1 and D (cid:96) = { ( a + (cid:96) ) mod p, ( a + (cid:96) ) mod p, . . . , ( a m + (cid:96) ) mod p } , (cid:96) = 0 , , , . . . , p − be the m + m + 1 lines.Then these m + m + 1 points and m + m + 1 lines form afinite projective plane of order m . B. The construction algorithm
In this section, we propose the PPoL algorithm for con-structing pooling matrices. For this, one first constructsan ( m + m + 1 , m + 1 , -perfect difference set, D = { a , a , . . . , a m } with a = 0 < a = 1 < a < . . . , < a m < m + m + 1 . (4)Let p = m + m + 1 and D (cid:96) = { ( a + (cid:96) ) mod p, ( a + (cid:96) ) mod p, . . . , ( a m + (cid:96) ) mod p } , (5) (cid:96) = 0 , , , . . . , p − be the p lines in the corresponding finiteprojective plane.It is easy to see that the m + 1 lines in the corre-sponding finite projective plane that contain point are D , D p − a , D p − a , . . . , D p − a m . These m + 1 lines are calledthe pencil of lines that contain point 0 (as the pencil point). Asthe only intersection of the m + 1 lines is point 0, these m + 1 lines, excluding point 0, are disjoint, and thus can be packedinto Z p . This is formally proved in the following lemma. Lemma 4:
Let D p − a i = D p − a i \{ } , i = 1 , , . . . , m . Then { D , D p − a , . . . , D p − a m } is a partition of Z p . Proof.
First, note that { D , D p − a , . . . , D p − a m } are the m +1 lines that contain point 0. As any two distinct lines intersectat exactly one point, we know that for i (cid:54) = 0 , D ∩ D p − a i = ∅ , and that for i (cid:54) = j , D p − a i ∩ D p − a j = ∅ . Thus, they are disjoint.As there are m + 1 points in D and m points in D p − a i , D ∪ D p − a ∪ . . . ∪ D p − a m contains m + 1 + m points. These m + 1 + m points are exactly the set of m + m + 1 pointsin the finite projective plane of order m .In Algorithm 3, we show how one can construct a poolingmatrix from a finite projective plane. The idea is to firstconstruct a bipartite graph with the line nodes on the left andthe point nodes on the right. There is an edge between a pointnode and a line node if that point is in that line. Then we starttrimming this line-point bipartite graph to achieve the neededcompression ratio. Specifically, we select the subgraph withthe m line nodes that does not contain point 0 (on the left)and the d m point nodes in the union of d pencil of lines(on the right). ALGORITHM 3:
The PPoL algorithm
Input
The number of samples M = m with m being aprime power, and the degree of each sample ≤ d ≤ m . Output An N × M binary pooling matrix H with M = m and N = d m .1: Let p = m + m + 1 and construct a perfectdifference set D = { a , a , . . . , a m } in Z p (with a = 0 and a = 1 ).2: For (cid:96) = 0 , , . . . , p − , let D (cid:96) = { ( a + (cid:96) ) mod p, ( a + (cid:96) ) mod p, . . . , ( a m + (cid:96) ) mod p } be the p lines.3: Construct a bipartite graph with the p lines on the leftand the p points on the right. Add an edge between apoint node and a line node if that point is in that line.4: Remove point 0 and line 0 from the bipartite graph(and the edges attached to these two nodes). Let G = ( g n,(cid:96) ) be the ( m + m ) × ( m + m ) biadjacencymatrix of the trimmed bipartite graph with g n,(cid:96) = 1 ifpoint n is in D (cid:96) .5: Let D p − a i = D p − a i \{ } , i = 1 , , . . . , m , be the m pencil of lines that contain point 0.6: Remove the ( p − a i ) th column, i = 1 , , . . . , m , in G to form an ( m + m ) × m biadjacency matrix ˜ G . Notethat these m columns correspond to the m linescontaining point 0.7: Let B = ∪ d i =1 D p − a i (select the first d pencil of linesthat contain point 0). Remove rows of ˜ G that are not in B to form a d m × m biadjacency matrix H . Proposition 5:
The degree of a line node is d and thedegree of a point node is m . Proof.
As the remaining lines are the lines not containingpoint 0, each line then intersects with D p − a i at exactly oneoint. Since there are d pencil of lines that contain point 0,each line then intersects with B = ∪ d i =1 D p − a i at exactly d points. On the other hand, each of the points in B is in a linethat contains point 0. As the lines that contain point 0 areremoved, each point in B is in m lines of the remaining m lines. Proposition 6:
There is at most one common nonzeroelement in two rows (resp. columns) in the pooling matrix H from Algorithm 3, i.e., the inner product of two row vectors(resp. column vectors) is at most 1. Proof.
This is because the bipartite graph with the biadja-cency matrix H is a subgraph of the line-point bipartite graphcorresponding to a finite projective plane. From (P3) and (P4)of Definition 3, any two distinct lines intersect at exactly onepoint, and any two distinct points lie on exactly one line. Thus,there is at most one common nonzero element in two rows(resp. columns) in H from Algorithm 3. Corollary 7:
The girth (the minimum length of a cycle) ofthe bipartite graph with biadjacency matrix H is at least 6. Proof.
As the length of a cycle in a bipartite graph must bean even number. It suffices to show that there does not exist acycle of length 4. We prove this by contradiction. Suppose thatthere is a cycle of length 4. Suppose that this cycle containstwo line nodes L and L and two point nodes P and P .Then the intersection of the two lines L and L contains twopoints L and L . This contradicts (P3) in Definition 3. Theorem 8:
Consider using the d m × m pooling matrix H from Algorithm 3 for a binary state vector x in a noiselesssetting. If the number of positive samples in x is not largerthan d − , then every sample can be correctly decoded bythe DD algorithm in Algorithm 1. Proof.
Suppose that there are at most d − positive samples.We first show that every negative sample can be correctlydecoded by the DD algorithm in Algorithm 1. Consider anegative sample. Since there are at most d − positivesamples that can be pooled into the d groups of this negativesample, and two different samples can be in a common groupat most once (Proposition 6), there must be at least one groupwithout positive samples. Thus, every negative sample can becorrectly decoded.Now consider a positive sample. Since there are at most d − positive samples that can be pooled into the d groupsof this positive sample, and two different samples can be ina common group at most once (Proposition 6), there must beat least one group in which this positive sample is the onlypositive sample. Thus, every positive sample can be correctlydecoded.We note that there are other methods that can also generatebipartite graphs that satisfy the property in Proposition 6. For Fig. 4. Computing the conditional probability p by the tree evaluationmethod. instance, in the recent paper [23], T¨aufer used the shiftedtraversal design to generate “mutlipools” (in Definition 1 of[23]) that satisfy the property in Proposition 6 when m is aprime. Both the PPoL constructions and the constructions of“multipools” in [23] are closely related to orthogonal Latinsquares [24]. Also, pooling matrices that satisfy the decodingproperty in Theorem 8 are known as the superimposed codesin [25]. C. Probabilistic analysis of the PPoL pooling matrices
In this section, we conduct a probabilistic analysis of thePPoL pooling matrices. We make the following assumption:(A1) All the samples are i.i.d. Bernoulli random variables.A sample is positive (resp. negative) with probability r (resp. r ). The probability r is known as theprevalence rate in the literature.Note that r + r = 1 . Also, let q (resp. q ) be the probabilitythat the group end of a randomly selected edge is positive(resp. negative). Excluding the randomly selected edge, thereare d − remaining edges in that group, and thus q = ( r ) d − , (6) q = 1 − ( r ) d − . (7)Let p be the conditional probability that a sample cannot be decoded, given that the sample is a negative sample. Notethat a negative sample can be decoded if at least one of itsedges is in a negative group, excluding its edge (see Figure 4).Consider a negative sample, called the tagged sample. Sincethe girth of the bipartite graph of the pooling matrix is 6(as shown in Corollary 7), the samples in the d groups ofthe subtree of the tagged sample are distinct (see the treeexpansion in Figure 4). Thus, p = ( q ) d = (1 − ( r ) d − ) d . (8)Let ˆ p be the conditional probability that the sample end ofa randomly selected edge cannot be decoded, given that thesample end is a negative sample. Note that the excess degreeof a sample (excluding the randomly selected edge) is d − .Analogous to the argument for (8) (see the bottom subtree ofthe tree expansion in Figure 5), we have ˆ p = ( q ) d − = (1 − ( r ) d − ) d − . (9) ig. 5. Computing the conditional probability p by the tree evaluationmethod. Let p be the conditional probability that a sample cannot be decoded given that the sample is a positive sample. Notethat a positive sample can be decoded if at least one of itsedges is in a group in which all the edges are removed exceptthe edge of the positive sample. Since an edge is removed ifits sample end is a negative sample and that sample end isdecoded to be negative, the probability that an edge is removedis (1 − ˆ p ) r . If the tree expansion in Figure 5 is actually atree, then p = (1 − ( r (1 − ˆ p )) d − ) d . (10)We note that the tree expansion in Figure 5 may not be atree for a PPoL pooling matrix generated from Algorithm 3,the identity in (10) is only an approximation. A sufficientcondition for the tree expansion in Figure 5 to be a tree is thatthe girth of the bipartite graph is larger than 8. Unfortunately,the girth of a PPoL pooling matrix can only proved to be atleast 6.Since a sample cannot be decoded with probability r p + r p , the average number of tests needed for the DD2algorithm in Algorithm 2 to decode the M samples is N + M ( r p + r p ) . The expected relative cost for the DD2algorithm with an N × M pooling matrix is N + M ( r p + r p ) M = 1 G + r p + r p , (11)where G = M/N is the (compressing) gain of the poolingmatrix in (2). Note that for a ( d , d ) -regular pooling matrix,we have from (2) that G = d /d . Thus, we can use (8),(10) and (11) to find the ( d , d ) -regular pooling matrix thathas the lowest expected relative cost (though (10) is onlyan approximation for the pooling matrices constructed fromthe PPoL algorithm). In Table I, we use grid search to findthe ( d , d ) -regular pooling matrix with the lowest expectedrelative cost for various prevalence rates r up to 10%. Thesearch regions for the grid search are ≤ d ≤ and d ≤ d ≤ . In the last column of this table, we also showthe expected relative cost of the Dorfman two-stage algorithm(Table I of [10]). As shown in this table, using the DD2algorithm (with the optimal pooling matrices) has significant gains over the Dorfman two-stage algorithm. Unfortunately,not every optimal ( d , d ) -regular pooling matrix in Table Ican be constructed by using the PPoL algorithm in Algorithm3. In the next section, we will look for suboptimal poolingmatrices that have small performance degradation. TABLE IT HE ( d , d ) - REGULAR POOLING MATRIX WITH THE LOWEST EXPECTEDRELATIVE COST FROM (11). r d d cost (11) Dorfman [10]1% 3 31 0.1218 0.202% 4 29 0.1881 0.273% 4 22 0.2545 0.334% 4 17 0.3147 0.385% 3 12 0.3678 0.436% 3 11 0.4166 0.477% 3 10 0.4627 0.508% 2 7 0.5035 0.539% 2 6 0.5416 0.5610% 2 6 0.5760 0.59 V. N
UMERICAL RESULTS
In this section, we compare the performance of variouspooling matrices by using the DD2 algorithm in Algorithm2. The first four pooling matrices are constructed by usingthe PPoL algorithm in Algorithm 3 with the parameters ( m, d ) = (31 , , (23 , , (13 , , and (7 , , respectively.The fifth pooling matrix is the pooling matrix used in P-BEST [3]. The sixth matrix is the × pooling matrixconstructed by the Kirkman triples. The next two poolingmatrices are used in Tapestry [4], [5]. The last pooling matrixis the 2D-pooling matrix in [2]. In Table II, we show thebasic information of these pooling matrices. The size of an N × M pooling matrix indicates that the number of groups is N , and the number of samples is M . The parameter d is thenumber of groups in which a sample is pooled. On the otherhand, d is the number of samples in a group. Note that thereare some pooling matrices that are not ( d , d ) -regular. Forinstance, in the 2D-pooling matrix, there are 8 groups with12 samples, and 12 groups with 8 samples. Also, both the × matrix and the × matrix used in Tapestry arenot ( d , d ) -regular. The column marked with row cor. (resp. col. cor. ) is the maximum of the inner product of two rows(resp. columns) in a pooling matrix. For a pooling matrix, thecolumn marked with girth is the minimum length of a cyclein the bipartite graph corresponding to that pooling matrix.The column marked with (comp.) gain is the compressinggain G of a pooling matrix, which is the ratio of the numberof columns (samples) to the number of rows (groups), i.e., G = M/N . As shown in Table II, both the row correlationand the column correlation of the pooling matrices constructedfrom the PPoL algorithm in Algorithm 3 are 1. So are the × pooling matrix constructed by the Kirkman triples.Such a correlation result is expected from Proposition 6. Onthe other hand, the row correlation and the column correlationof the pooling matrix in P-BEST [3] are 6 and 2, respectively.Also, the girth of the pooling matrix in P-BEST is only r (%) p PPoL-(31,3)PPol-(23,4)PPoL-(13,3)PPoL-(7,2)P-BEST MatrixKirkman Matrix 15×35Tapestry Matrix 16×40Tapestry Matrix 24×602D-pooling Matrix
Fig. 6. The conditional probability p (that a sample cannot be decodedgiven it is a negative sample) as a function of the prevalence rate r forvarious pooling matrices.
4, which is smaller than the other four matrices. The girthof the × pooling matrix in Tapestry is also 4. Thisshows that the pooling matrices from the PPoL algorithm aremore “spread-out” than the pooling matrix in P-BEST and the × pooling matrix in Tapestry. TABLE IIB
ASIC INFORMATION OF SOME POOLING MATRICES . H size d d rowcor. col.cor. girth (comp.)gainPPoL-(31,3) × × × × × ×
35 15 × × [4] × × [4] × × To compare the performance of these pooling matrices, weconduct 10,000 independent experiments for each value of theprevalence rate r , ranging from to . Each numericalresult is obtained by averaging over these 10,000 independentexperiments. In Figure 6, we show the (measured) conditionalprobability p (that a sample cannot be decoded given it is a negative sample) for these pooling matrices. For the PPoLpooling matrices, the measured p ’s match extremely wellwith the theoretical results from (8). As shown in this figure,the Kirkman matrix and the two matrices in Tapestry havethe best performance. This is because their d ’s (the numberof samples in a group) are small (below 9 for these threematrices). As such, the probability that a group is testednegative is higher than the other pooling matrices. Note thatthese three matrices also have low (compressing) gains, 2.33-2.5. On the other hand, P-BEST has the worst performancefor p as the number of samples in a group for that matrix is48, which is the largest among all these pooling matrices.In Figure 7, we show the (measured) conditional probability p (that a sample cannot be decoded given it is a positive sample) for these pooling matrices. Once again, the Kirkman r (%) p1 PPoL-(31,3)PPol-(23,4)PPoL-(13,3)PPoL-(7,2)P-BEST MatrixKirkman Matrix 15×35Tapestry Matrix 16×40Tapestry Matrix 24×602D-pooling Matrix
Fig. 7. The conditional probability p (that a sample cannot be decoded givenit is a positive sample) as a function of the prevalence rate r for variouspooling matrices. r (%) r p + r p PPoL-(31,3)PPol-(23,4)PPoL-(13,3)PPoL-(7,2)P-BEST MatrixKirkman Matrix 15×35Tapestry Matrix 16×40Tapestry Matrix 24×602D-pooling Matrix
Fig. 8. The probability r p + r p (that a sample cannot be decoded at thefirst stage and should be tested individually at the second stage) as a functionof the prevalence rate r for various pooling matrices. matrix and the two matrices in Tapestry have the best perfor-mance. This is mainly due to the low (compressing) gains ofthese three matrices. Though not shown in Figure 7, we notethat the measured p ’s are very close to those from (10), andthus the tree expansion in Figure 5 is actually tree-like.As discussed in Section IV-C, the probability that a samplecannot be decoded is r p + r p . Such a probability is also theprobability that a sample needs to go through the second stagefor individual testing. In Figure 8, we show the probability r p + r p as a function of the prevalence rate r for variouspooling matrices. As shown in this figure, the Kirkman matrixand the two matrices in Tapestry have the best performance.Once again, this is mainly due to the low (compressing) gainsof these three matrices. We note that it takes time to do thesecond test. The numerical results in Figure 8 imply that usingthe Kirkman matrix (or the two matrices in Tapestry) has theshortest expected time to obtain a testing result.A fair comparison of these pooling matrices is to measuretheir expected relative costs (defined in [10]). Recall that theexpected relative cost is the ratio of the expected number ofests required by the group testing technique to the numberof tests required by the individual testing. In Figure 9, weshow the (measured) expected relative costs for these poolingmatrices. In this figure, we also plot the curve for the Dorfmantwo-stage algorithm (the black curve) with the optimal groupsize M chosen from Table 1 of [10] for the prevalence rates, , , . . . , . To our surprise, the curves for the Kirkmanmatrix and the two matrices in Tapestry are above the blackcurve. This means that the expected relative costs of thesethree matrices are higher than the (optimized) Dorfman two-stage algorithm. Thus, if the additional amount of time to gothrough the second stage is not critical, using other poolingmatrices could lead to more cost reduction than using thesethree matrices. There are several pooling matrices that havevery low relative costs when the prevalence rates are below1%. The P-BEST pooling matrix is one of them. However,the relative cost of the P-BEST pooling matrix increasesdramatically when the prevalence rates are above 1.3%. More-over, the P-BEST pooling matrix has a higher relative costthan the (optimized) Dorfman two-stage algorithm when theprevalence rate is above 2.5%. On the other hand, 2D-poolinghas a very low relative cost when the prevalence rates areabove 2.5%. To summarize, there does not exist a poolingmatrix that has the lowest relative cost in the whole range ofthe prevalence rates considered in our experiments.To optimize the performance, one should choose the rightpooling matrix, depending on the prevalence rate. However,this might be difficult as the exact prevalence rate of a newoutbreak of COVID-19 in a region might not be known inadvance. Our suggestion is to use suboptimal PPoL matricesfor a range of prevalence rates, as shown in Table III. Asshown in this table, the costs computed from the theoreticalapproximations in (11) and the costs measured from simula-tions are very close, and they are within 2% of the minimumcosts for ( d , d ) -regular pooling matrices in Table I. Fromour numerical results in Figure 9, we suggest using the PPoLmatrix with d = 3 and d = 31 when the prevalencerate r is below 2%. In this range of prevalence rates, itsexpected relative cost is even smaller than that of P-BEST.Moreover, it can achieve an 8-fold reduction in test costs whenthe prevalence rate is near 1% (as shown in Table III), andmost samples can be decoded in the first stage (as shown inFigure 8). When the prevalence rate r is between 2%-4%, wesuggest using the PPoL matrix with d = 4 and d = 23 . Inthis range of prevalence rates, using such a pooling matrix canstill achieve (at least) a 3-fold reduction in test costs. Roughly,17% of samples need to go through the second stage whenthe prevalence rate is near 4% (as shown in Figure 8). Whenthe prevalence rate r is between 4%-7%, we suggest usingthe PPoL matrix with d = 3 and d = 13 , and it can stillachieve (at least) a 2-fold reduction in test costs. When theprevalence rate r is between 7%-10%, we suggest using thePPoL matrix with d = 2 and d = 7 . Though its expectedrelative cost is still lower than that of the Dorfman two-stagealgorithm, the difference is small. r (%) T he e x pe c t ed r e l a t i v e c o s t PPoL-(31,3)PPol-(23,4)PPoL-(13,3)PPoL-(7,2)P-BEST MatrixKirkman Matrix 15×35Tapestry Matrix 16×40Tapestry Matrix 24×602D-pooling MatrixThe Dorfman two-stage algorithm
Fig. 9. The expected relative cost as a function of the prevalence rate r forvarious pooling matrices. TABLE IIIS UBOPTIMAL PP O L POOLING MATRICES . r d d cost (11) cost (sim) Dorfman [10]1% 3 31 0.1218 0.12 0.202% 4 23 0.1973 0.20 0.273% 4 23 0.2552 0.25 0.334% 3 13 0.3170 0.32 0.385% 3 13 0.3685 0.37 0.436% 3 13 0.4243 0.42 0.477% 2 7 0.4651 0.47 0.508% 2 7 0.5035 0.50 0.539% 2 7 0.5422 0.54 0.5610% 2 7 0.5809 0.58 0.59 VI. C
ONCLUSION
In this paper, we proposed a new family of PPoL pollingmatrices that have maximum column correlation and rowcorrelation of 1 for a wide range of column weights. Usingthe two-stage definite defectives (DD2) decoding algorithm,we compare their performance with various pooling matricesproposed in the literature, including 2D-pooling [2], P-BEST[3], and Tapestry [4], [5]. Our numerical results showed nopooling matrix with the lowest expected relative cost in thewhole range of the prevalence rates. To optimize the perfor-mance, one should choose the right pooling matrix, dependingon the prevalence rate. As the family of PPoL matrices candynamically adjust their column weights according to theprevalence rates, it seems that using such a family of poolingmatrices might lead to better cost reduction than using a fixedpooling matrix.There are several research directions for future works:(i) Other decoding algorithms: in this paper, we onlyevaluated the performance of pooling matrices usingthe DD2 algorithm. To probe further, we are cur-rently investigating the possibility of using other de-coding algorithms, in particular, the GPSR algorithmin [3] and the belief propagation (BP) algorithm in[26].ii) Noisy decoding: The DD2 algorithm works verywell in the noiseless setting. However, it is notclear whether it can continue to perform well ina noisy setting. There are several noise modelsproposed in the literature (see, e.g., the monograph[13]). Among them, the dilution noise model is ofparticular interest to us.(iii) Ternary samples: in this paper, we only consideredbinary samples. For ternary samples, there are threetest outcomes: negative (0), weakly positive (1), andstrongly positive (2). It seems possible to extend theDD2 algorithm for binary samples to the setting withternary samples by using successive cancellations.R . fda . gov/medical-devices/coronavirus-covid-19-and-medical-devices/pooled-sample-testing-and-screening-testing-covid-19[2] N. Sinnott-Armstrong, D. Klein, and B. Hickey, “Evaluation of grouptesting for sars-cov-2 rna,” medRxiv , 2020.[3] N. Shental, S. Levy, V. Wuvshet, S. Skorniakov, B. Shalem, A. Ot-tolenghi, Y. Greenshpan, R. Steinberg, A. Edri, R. Gillis et al. , “Efficienthigh-throughput sars-cov-2 testing to detect asymptomatic carriers,” Science Advances , p. eabc5961, 2020.[4] S. Ghosh, A. Rajwade, S. Krishna, N. Gopalkrishnan, T. E. Schaus,A. Chakravarthy, S. Varahan, V. Appu, R. Ramakrishnan, S. Ch et al. ,“Tapestry: A single-round smart pooling technique for covid-19 testing,” medRxiv , 2020.[5] S. Ghosh, R. Agarwal, M. A. Rehan, S. Pathak, P. Agrawal, Y. Gupta,S. Consul, N. Gupta, R. Goyal, A. Rajwade et al. , “A compressed sens-ing approach to group-testing for covid-19 detection,” arXiv preprintarXiv:2005.07895 . who . int/emergencies/diseases/novel-coronavirus-2019[7] H. Nishiura, T. Kobayashi, T. Miyama, A. Suzuki, S.-m. Jung,K. Hayashi, R. Kinoshita, Y. Yang, B. Yuan, A. R. Akhmetzhanov et al. ,“Estimation of the asymptomatic ratio of novel coronavirus infections(covid-19),” International journal of infectious diseases , vol. 94, p. 154,2020.[8] Y.-C. Chen, P.-E. Lu, C.-S. Chang, and T.-H. Liu, “A time-dependent sir model for covid-19 with undetectable infected per-sons,”
IEEE Transactions on Network Science and Engineering, DOI:10.1109/TNSE.2020.3024723 . cdc . gov/coronavirus/2019-ncov/lab/pooling-procedures . html[10] R. Dorfman, “The detection of defective members of large populations,” The Annals of Mathematical Statistics , vol. 14, no. 4, pp. 436–440,1943.[11] J. Yi, R. Mudumbai, and W. Xu, “Low-cost and high-throughputtesting of covid-19 viruses and antibodies via compressed sens-ing: System concepts and computational experiments,” arXiv preprintarXiv:2004.05759 , 2020.[12] A. Heidarzadeh and K. R. Narayanan, “Two-stage adaptive pooling withrt-qpcr for covid-19 screening,” arXiv preprint arXiv:2007.02695 , 2020.[13] M. Aldridge, O. Johnson, and J. Scarlett, “Group testing: an informationtheory perspective,” arXiv preprint arXiv:1902.06002 , 2019.[14] S. Lohse, T. Pfuhl, B. Berk´o-G¨ottel, J. Rissland, T. Geißler, B. G¨artner,S. L. Becker, S. Schneitler, and S. Smola, “Pooling of samples fortesting for sars-cov-2 in asymptomatic people,”
The Lancet InfectiousDiseases , 2020.[15] B. Abdalhamid, C. R. Bilder, E. L. McCutchen, S. H. Hinrichs, S. A.Koepsell, and P. C. Iwen, “Assessment of specimen pooling to conservesars cov-2 testing resources,”
American journal of clinical pathology ,vol. 153, no. 6, pp. 715–718, 2020. [16] I. Yelin, N. Aharony, E. Shaer-Tamar, A. Argoetti, E. Messer, D. Beren-baum, E. Shafran, A. Kuzli, N. Gandali, T. Hashimshony et al. ,“Evaluation of covid-19 rt-qpcr test in multi-sample pools,”
MedRxiv ,2020.[17] C. Gollier and O. Gossner, “Group testing against covid-19,”
CovidEconomics , vol. 2, 2020.[18] I. S. Reed and G. Solomon, “Polynomial codes over certain finite fields,”
Journal of the society for industrial and applied mathematics , vol. 8,no. 2, pp. 300–304, 1960.[19] C.-S. Chang, W. Liao, and C.-M. Lien, “On the multichannel rendezvousproblem: Fundamental limits, optimal hopping sequences, and boundedtime-to-rendezvous,”
Mathematics of Operations Research , vol. 40,no. 1, pp. 1–23, 2015.[20] C.-S. Chang, D.-S. Lee, and C. Wang, “Asynchronous grant-free uplinktransmissions in multichannel wireless networks with heterogeneous qosguarantees,”
IEEE/ACM Transactions on Networking , vol. 27, no. 4, pp.1584–1597, 2019.[21] C.-S. Chang, J.-P. Sheu, and Y.-J. Lin, “On the theoretical gap ofchannel hopping sequences with maximum rendezvous diversity in themultichannel rendezvous problem,” arXiv preprint arXiv:1908.00198 ,2019.[22] J. Singer, “A theorem in finite projective geometry and some applica-tions to number theory,”
Transactions of the American MathematicalSociety , vol. 43, no. 3, pp. 377–385, 1938.[23] M. T¨aufer, “Rapid, large-scale, and effective detection of covid-19 vianon-adaptive testing,”
BioRxiv , 2020.[24] L. Euler,
Recherches sur une nouvelle espece de quarres magiques .Zeeuwsch Genootschao, 1782.[25] W. Kautz and R. Singleton, “Nonrandom binary superimposed codes,”
IEEE Transactions on Information Theory , vol. 10, no. 4, pp. 363–377,1964.[26] D. Sejdinovic and O. Johnson, “Note on noisy group testing: Asymptoticbounds and belief propagation reconstruction,” in2010 48th AnnualAllerton Conference on Communication, Control, and Computing (Aller-ton)