AC-DC: Amplification Curve Diagnostics for Covid-19 Group Testing
Ryan Gabrys, Srilakshmi Pattabiraman, Vishal Rana, João Ribeiro, Mahdi Cheraghchi, Venkatesan Guruswami, Olgica Milenkovic
11 AC-DC: Amplification Curve Diagnostics forCovid-19 Group Testing
Ryan Gabrys : ,1 , Srilakshmi Pattabiraman : ,1 , Vishal Rana : ,1 , João Ribeiro : ,2 , Mahdi Cheraghchi ; ,3 ,Venkatesan Guruswami ; ,4 and Olgica Milenkovic ; ,11 University of Illinois, Urbana-Champaign; Imperial College London; University of Michigan, AnnArbor; Carnegie Mellon University, [email protected], [email protected], [email protected], [email protected],[email protected], [email protected], [email protected]
Abstract
The first part of the paper presents a review of the gold-standard testing protocol for Covid-19, real-time,reverse transcriptase PCR, and its properties and associated measurement data such as amplification curves thatcan guide the development of appropriate and accurate adaptive group testing protocols. The second part of the paperis concerned with examining various off-the-shelf group testing methods for Covid-19 and identifying their strengthsand weaknesses for the application at hand. The third part of the paper contains a collection of new analytical resultsfor adaptive semiquantitative group testing with probabilistic and combinatorial priors, including performance bounds,algorithmic solutions, and noisy testing protocols. The probabilistic setting is of special importance as it is designedto be simple to implement by nonexperts and handle heavy hitters. The worst-case paradigm extends and improvesupon prior work on semiquantitative group testing with and without specialized PCR noise models.
I. I
NTRODUCTION
In less than ten months since the first case reported in the Hubei province of China, Covid-19 has rapidly spreadacross all continents except Antarctica [1]. The disease has caused more deaths than Ebola, SARS, and the seasonalFlu combined (reaching
1, 000, 000 mortalities in September 2020), disrupted the global economy to an extent notseen since the Great Depression and altered the lives of hundreds of millions of people across the globe [2].Many analyses associated with the Covid-19 pandemic have established that widespread population testing is keyto effectively containing outbreaks of this and other infectious diseases. In May 2020, the United States was able totest around people per day (According to the Covid Tracking Project, this number has since increased to in August), while countries that have managed to keep the outbreak under control, such as Germany andSouth Korea, have performed millions of tests during the same stage of the spread of the disease. Although there isno general consensus on the exact number of individuals that need to be tested, most experts agree that the reportednumbers are highly inadequate and should be at least an order of magnitude higher before the economy can besafely reopened to the pre-pandemic extent [3]. Some universities, such as Yale University, and the University ofIllinois, currently have a biweekly test schedule in place for all individuals accessing school property [4]. This isbelieved to be a sufficiently large-scale testing protocol that allows the institution to safely operate.To address the need for sustainable high-frequency population testing, a number of countries and states proposedand implemented group testing schemes in which genetic samples from different individuals are pooled togetherin a manner that incorporates thresholded real time reverse transcriptase polymerase chain reaction (RT-PCR)fluorescence signals into the testing scheme . Group testing (GT) is a combinatorial screening method introduced by Dorfman [6] for identifying small groupsof soldiers infected by Syphilis. His scheme, known as single-pooling, consists of mixing blood samples from fivesoldiers at a time, and running one test for each pool. For positive test outcomes, the soldiers involved in the
All junior authors : are listed in alphabetical order, and all senior authors ; are listed in alphabetical order. M. Cheraghchi is supported in partby an NSF grant CCF-2006455. A recent ordinance issued by the governor of Nebraska [5] recommends using group testing for widespread screening for Covid-19, whilegroup testing methods are employed in part in Israel. a r X i v : . [ q - b i o . Q M ] N ov test are examined individually in a second round to determine who has the disease. For negative outcomes, allsubjects involved are declared healthy and removed from future testing schedules. Given a relatively small numberof infected individuals in a population, this scheme provides a significant reduction in the number of tests requiredwhen compared to individual testing [7]. The scheme proved ineffective for its original task as blood sample poolingdilutes the resulting sample to a point below the sensitivity of the tests used.A number of recent reports suggest using Dorfman’s or other mostly off-the-shelf GT schemes for Covid-19testing [8]–[17]. Most of the proposed schemes do not incorporate relevant biological priors or exploit the highlyspecific measurement and noise properties of the RT-PCR method in their testing schemes. We argue this is asignificant detriment, as in order to properly execute the effort and avoid dangerous failures, testing schemes shouldto be guided both by mathematical considerations as well as social, clinical, and genomic side information . Thissuggests designing Covid-19 group testing schemes that carefully address the following issues:1) Selection of adequate primers.
As stated in the CDC SARS-CoV-2 testing guidelines [20], only two primersare recommended for use in the USA for RT-PCR reactions, selected from the N open reading frame (ORF)of the SARS-CoV-2 genome. It is often hard to predict which regions will have small mutations and it iscurrently not known how fast the N regions and other primer regions chosen by various countries mutateand how these mutations affect the PCR protocol. To determine the influence of mutations one first needsto determine which regions will remain mostly unaffected by mutations, determine the melting temperaturesof the primers [21] and their binding affinity to the mutated reference regions. For this purpose, the recentwork [22] may be used to guide the primer selection process, while actually recorded mutated N primerregions may be used to estimate the failure rates of the individual PCR tests or model the errors in groupPCR protocols due to mutations. These issues will be examined in Sections II and V.2)
Selection of (near-)optimal sample mixing strategies with priors.
If not properly designed, GT schemesmay lead to errors that are even more detrimental to the population than no tests at all. Since some individuals,such as health workers, may harbor multiple strains of the virus, and since clinical priors are often readilyavailable (e.g., symptom charts, chest X-ray images) the sampling and mixing approaches should be carefullydesigned to include the right number and combinations of subjects in order to minimize test errors. This isa complicated issue that will be examined elsewhere.3)
Use of quantitative test outcomes.
RT-PCR experiments provide significantly more information about theviral load or number of infected individuals within the group rather than just a simple binary answer, “noinfected samples” or “at least one infected sample”. Except for a handful of works proposing to use quantitativeRT-PCR through
Compressed Sensing (CS) [23]–[25], most reported Covid-19 GT schemes assume binarytest outcomes (among them the scheme used in Israel and described in [26]). Furthermore, practically testedschemes operate in a nonadaptive setting, which is suboptimal and not justified for large-scale testing strategieswhich use a limited number of PCR machines. Another important issue is that heavy hitters (individualswith very high viral loads) can “mask” individuals with smaller loads which makes the use of quantitativeinformation difficult [27], [28]. This masking phenomena, as well as saturation effects present in RT-PCRoutputs, can make CS approaches highly susceptible to errors. The focus of this work is to develop schemesthat can address these issues in a simple, yet efficient manner. Consequently, our main results pertain toscalable, adaptive and semiquantitative testing methods that can efficiently handle heavy hitters and errorsthat are specific to RT-PCR systems.4)
Incorporation of social/contact network information.
Due to the highly heterogeneous response ofindividuals to the infection, diverse infection rates across different geographic and communal regions, thebest testing practices have to be guided by infection risk assessments and scores. Such “network-guided”schemes are not known in the GT and CS literature, except for a recent work that partitions individuals intounconnected communities and aims to identify all infected members of all communities simultaneously [29].One can argue that in the early stages of the outbreak it is more relevant to identify the highly infectedcommunities and their neighboring communities rather than all individuals in each infected community. Inthis context, assuming that there exist only “infected” and “uninfected” communities as postulated in [29] As the disease affects people from different age groups, ethnicities and regions in a highly disparate manner [18]; it has also been reportedthat mortality rates across different populations can deviate by as much as two orders of magnitude [19]; furthermore, recent studies also suggestthat women exhibit significantly milder symptoms than men as they have more responsive immune systems. The World Health Organization(WHO) has also repeatedly issued testing guidelines that suggest “suspect influenza should be tested with consideration for geographical, genderand age representativeness” in order to contain the spread of the disease in real-time. appears unrealistic. “Heavily infected” community detection as well as heavy hitter problems arising in Covid-19 are discussed in Section IV.We argue that the availability of (semi)quantitative test outcomes and the use of adaptive strategies can greatlyincrease the efficiency and scalability of Covid-19 testing schemes. In that direction, we generalize the Semiquan-titative Group Testing (SQGT) strategy [30]–[32] to an adaptive setting and devise simple probabilistic adaptiveGT methods and worst-case adaptive GT schemes that take the specific measurement data noise and quantizationproperties into account. The SQGT scheme assumes that one cannot tell the exact viral load or number of infectedindividuals in a pool but only an interval in which the load or number of defectives lies. The setting is a generalizationof GT that includes more than two outcomes, or a quantized version of the adder channel/CS approach [35], [36]. Italso represents a generalization of the setting [37] in which only saturation effects are taken into account within theadder model. It is worth pointing out that this is also the first approach that uses actual RT-PCR features and alsopostulates rigorous models that allow for relating viral loads to actual fluorescence values and analyze the testingschemes rigorously. Other methods that will be reviewed in later sections either completely ignore the RT-PCRmeasurements or do not properly justify or analyze their proposed models.For an illustration of the differences between various testing schemes (group testing, additive tests, additive testswith saturation, and general SGQT), the reader is referred to Figure 1. Figure 1:
Figure (a) illustrates the classical GT framework. Here, corresponds to a test outcome that is indicative of no infected individualbeing present, while corresponds to an outcome indicative of at least one infected individual. Figure (b) represents an additive test output model, in which the underlying assumption is that one can tell the exact number of infected individuals in a test. An instance of the generalSQGT is illustrated in Figure (c). In this case, the test outcome r (cid:96) { τ s “ i for i ą indicates that p i ´ q τ ď (cid:96) ď p i ´ q τ defectives arepresent. When the number of defectives detected is ą p m ´ q τ , the test reports m . When τ “ , (c) represents an adder model with saturation. For the worst case setting , in which we assume a known number of d defectives, but make no assumptions abouthow they are distributed, and where one can tell if zero defectives were present, or the number of defectives isnonzero and lies in one out of m consecutive intervals of length τ , the number of tests per defective roughly equals log p n { d q ` log log p m ` q log p m ` q . The savings in the number of tests as compared to the classical GT setting provided by the increased resolution ofthe levels is log p m ` q -fold, which even for levels amounts to three-fold savings. Clearly, one has to be able toproperly calibrate the RT-PCR readouts and determine adequate thresholds in order to take full advantage of thescheme. This issue will be addressed in Section II.For the probabilistic setting , where each item is assumed to be independently defective with some probability, ourmain results include simple-to-implement algorithms for adaptive testing that involve two thresholds and two teststages, and are also capable of handling heavy hitters (i.e., individuals with high viral loads that may mask otherindividual’s presence, provided these individuals are not too common.) The quantifier “probabilistic” may refer to either the individuals being ill according to some probability distribution (usually iid Bernoulli p p q ,or generalized binomial [33] or Poisson [34], where the number of infected individuals has a right-truncated Poisson distribution with parameter λ p n q “ o p n q ; Dorfman’s scheme falls into the first category.) or, the test matrices having entries dictated by a probability distribution (again,usually iid Bernoulli p q q ). Some papers refer to the former setting as “group testing with probabilistic priors” and the latter setting as “grouptesting with probabilistic tests.” Which paradigm we refer to will be clear from the context. The remainder of the paper is organized as follows. Section II provides an overview of the PCR, RT-PCR and the RealTime (Quantitative) PCR techniques. The section also addresses key issues that impede the amplification efficiencyof the methods currently used for Covid-19 testing and introduces several practical noise models. Section IIIdescribes various GT approaches and assesses their utility for Covid-19 testing. Sections IV and V describe themain original results. Section IV describes a probabilistic version of a SQGT model, simplified to account for tworounds of testing and two test thresholds only. This section also introduces test schemes that aim to identify highlyvirulent individuals, termed heavy hitters . Section V introduces a new worst-case adaptive SQGT technique thatis near-optimal and describes a noise model termed the birth-death chain model . Section VI provides preliminarydirections towards designing efficient community-aware testing strategies. Section VII reports the results obtainedusing the GISAID database [38] that explain the influence of mutations in the primer regions on the efficiency ofthe tests and therefore suggests that noise models previously considered in the literature that only account for errorsin the PCR test are inadequate. Section VIII concludes the paper and discusses future work.II. B
ACKGROUND
We start our exposition by describing the real-time reverse transcriptase (RT-PCR) testing mechanism. DNA has adouble helix structure and both strands in the helix are composed of periodic sugar and phosphate groups to whichone of four different bases is attached, namely A, T, C, and G. A sugar, phosphate, and base are jointly referredto as a nucleotide. As the sugar is asymmetric in terms of the placement of its carbon atoms with respect to theposition of binding to the phosphates, the two strands of the DNA have two different directions: One runs from the3’ to 5’ carbon end, while the other runs from the 5’ to 3’ carbon end. The two strands are held together throughthe stacking of bases and the hydrogen bonds that exist between them. The pairing rule is dictated by Watson-Crick(WC) complementarity asserting that only (or with overwhelming probability) the bases A and T and G and C bindto each other, respectively.
A. Reverse Transcriptase PCR
The Reverse-Transcriptase PCR (RT-PCR) technique is used to identify/amplify RNA strands. Since RNA is single-stranded and hence an unstable molecule, RT-PCR first converts the target RNA into its complementary double-stranded DNA (cDNA, as illustrated in Figure 2) and then performs amplification using the standard PCR technique.Note that RNA has three of the same building blocks as DNA, namely A, C, and G, but instead of T (encounteredin DNA), RNA contains U (Uracil).Conversion of RNA into cDNA is accomplished through the use of the reverse transcriptase (RT) enzyme thatstitches together “free” nucleotides A, T, C, and G together, in the presence of primers that are complementary to aspecific part of the target RNA sample (see Figure B. The Polymerase Chain Reaction (PCR)
The Polymerase Chain Reaction (PCR) is used to amplify specific segments of the DNA strands in order to enablea downstream analysis of the segments or to detect the presence of specific DNA content. The operating principles Reverse Transcriptase image is from Wikipedia.
Figure 2:
Reverse transcription for converting viral RNA into cDNA. of the PCR process are illustrated in Figures 3 and 4. A Thermal Cycler uses the target DNA, specific primers(short DNA segments that initiate the replication process by allowing the polymerase to bind to the DNA), thetaq polymerase (which actually performs DNA replication after the primers get attached), and free A (Adenine),T(Thymine), C (Cytosine) and G (Guanine) nucleotides needed to amplify the segment of interest through repeatedcycles that involve the following steps:
DNA denaturation, annealing, and extension.
1) DNA Denaturation: The DNA sample to be amplified or detected is first heated to C. At this temperature,hydrogen bonds between the bases across the two strands break, producing two complementary single-strandedDNA fragments.2) Annealing (Hybridization): The sample is subsequently cooled to C. This allows the primers to bind totheir WC complementary segments on the two single-stranded DNA targets. The primer that binds to theforward strand is referred to as the forward primer while the one that binds to the reverse strand is referredto as reverse primer .3) Extension: The sample is heated to C to enable the taq polymerase to extend the primers to form twocomplete copies of the original double-stranded DNA molecule.Under ideal conditions, at the end of the Extension step of a cycle, the amount of target DNA doubles. Thissetting is illustrated in Figure 3. However, due to several factors [21] including the efficiency of denaturation,primer annealing affinity, polymerase binding strength, and others, the DNA content may not double during eachcycle. For example, denaturation requires heating the sample to a higher temperature which by itself may cause
Figure 3:
Polymerase Chain Reaction: In any given cycle, the DNA strands in the sample are first denatured into single-strands. The twosingle-strands are then extended to form complete DNA double-helixes. The primers (short DNA fragments) are attached to the single-strandedDNA such that extension is facilitated in the 3’-5’ direction. At the end of each ideally executed cycle, the number of DNA strands in thesample doubles. oxidative and other damages to the DNA being amplified. The efficiency of denaturation is measured in terms ofthe concentration of viable single-stranded DNA present after heating.
During the primer annealing stage, single-stranded DNA strings previously denatured can anneal back, thereforeprohibiting access to the primer segments. The primer annealing efficiency is captured by the proportion of single-stranded DNA with bound primers.When the polymerase binds to the DNA-primer complex it forms a potentially unstable tertiary complex in which thepolymerase can disassociate in a stochastic manner. The polymerase binding efficiency is captured by the fractionof tertiary products in the assay. The tertiary complexes formed during the early stage of a cycle are more likelyto result in complete double-stranded DNA compared to those formed in a later stage of the cycle, due to cycletiming issues. This effect is captured through what is known as the extension efficiency of PCR.These effects jointly contribute towards the reduction of the average efficiency of DNA amplification, which goesdown from the expected doubling factor to some value ă , usually written as p ` η q , where η is referred to asthe cycle efficiency. The doubling of the target material at every cycle corresponds to η “ . At the end of i cycles,a sample with concentration x DNA strands is amplified to a sample with concentration x p ` η q i . More precisely,the cycle efficiency depends on the cycle number. Consequently, a more accurate amplification model should usethe factor η j for cycle j , so that the amplified concentration after the i th cycle reads as x ¨ Π ij “ p ` η j q . It is alsoknown that η j decreases with j , which may be attributed to the fact that the primers used for amplification are moreand more integrated into the DNA products and that the efficiency of the polymerase decreases. At the same time,for a small number of cycles (usually i ď ) the DNA products are hard to detect. As a result, it is a commonpractice to run ´ cycles of PCR, depending on the expected original concentration of the double-strandedDNA to be amplified.Note that the polymerase can also be active at temperatures below C , thereby initiating the extension process.However, the polymerase is non-specific at lower temperatures and leads to amplification of non-specific DNAstrands. The high concentration of the stronger and more stable GC bonds in the DNA strands hinders effectivedenaturation at C . Regions with high GC bond concentration also form secondary products that prevent primerbonding [39]. These phenomena all jointly contribute to “noise” in the amplification PCR process which is notassociated with the cycle efficiency. Additional sources of noise such as CCD thermal noise and shot noise canlead to a further decrease in the reliability of data points at low signal levels [40].Also, primers may fail to attach to the DNA if the corresponding DNA primer regions contain mutations (indelsor point mutations). Since the error is caused by the actual DNA sample strand, and not the PCR process, thisphenomenon should not be considered as part of the PCR noise model. The results of a simulation that studies theeffect of mutations along the primer region on PCR amplification are described in Section VII, using a collectionof real genomic datasets retrieved from the GISAID database [41]. Figure 4:
Under ideal conditions, every cycle of the PCR process should double the DNA content. Due to various factors, described in the maintext, not every cycle may result in twice as many strands and an averaged efficiency factor η ă is used to describe the growth rate of thePCR product. C. Quantitative (real-time) RT-PCR
Quantitative Real Time PCR (qPCR) is a technique used for precise analysis of viral and bacterial samples. Asimplied by its name, qPCR allows for the amplification process to be monitored in real-time. This is achieved byintroducing fluorescent labels into the DNA products and recording the change in fluorescence with an increasingnumber of cycles (which also allows for estimating the number of cycles needed to detect an appropriate product).The result of a qPCR experiment is usually given in terms of an amplification curve (an example of such a graph isshown in Figure 7, where real measurements are approximated by piecewise polynomial fragments of degree ď ).The amplification curve plots the normalized (relative) fluorescence ∆ R n against the cycle number. The fluorescenceincreases with the increase in the target genetic material with every cycle until the fluorescence saturates. The cyclenumber for which the fluorescence crosses the detection threshold (which can be defined in several ways) is referredto as the cycle threshold , and denoted by C t . Note that C t is inversely proportional to the concentration of the targetmaterial in the sample: A low C t value indicates a higher concentration of the sample we wish to detect, while ahigh C t value indicates a low concentration of the same or spurious amplification results. The slopes of the curvesmost often show very small variations with the concentration of the subject but may potentially be used as furtherindicators of the sample load.Real-time or qPCR is usually performed using one of the following two approaches: ‚ Dye-based qPCR.
The dye-based method uses dyes that only fluoresce when bound to double-stranded DNA.Thus, at the end of each extension stage, the fluorescence increases (see Figure 5). The chemistry of thedyes used helps in distinguishing desired and undesired products. However, the dye-based method is oftennonspecific, thereby inaccurately quantifying genetic material that is not of interest. As a result, this approachrequires highly selective primers and other additional controls to provide accurate amplification curve results. ‚ Probe-based qPCR.
In this technique, primers specific to the target DNA include two molecules, a fluorescentreporter dye and a quencher on its two ends. When the quencher is in close proximity to the fluorescentdye, the former molecule inhibits (quenches) the fluorescence of the latter. This is usually the case when theprimer is not bound to the target (see Figure 6). However, when the primer is hybridized to the target and thepolymerase extends the primer segment, the quencher and reporter separate out and the dye is cleaved anddisplaced. In its free form, it fluoresces which leads to detectable signals.
Figure 5:
Dye-based qPCR: The dye attaches to the double-stranded DNA formed at the end of the extension stage and fluoresces. Thus, thefluorescence measured increases with the number of cycles.
D. Amplification Curves and the Viral Load
From Figures 7 and 8, and as already discussed in the previous section, it is clear that one can estimate upper/lowerbounds on the viral load of an individual by observing the C t value and the slope and saturation point of theamplification curve. It is important to point out that the viral load of individuals may vary up to five orders ofmagnitude, as shown in the recent study [42]. Viral loads in infected individuals tend to follow a “typical” inverted-V dynamics shown in Figure 9. There, it can be seen that an individual tested ´ days after the infection may Figure 6:
Probe-based qPCR: When DNA is denatured, a primer specific to the target DNA is attached to a single strand. The primers are thenattached and extended by the polymerase. During the extension, the probes are cleaved and the reporter dye is no longer in the proximity ofthe quencher molecule, which enables it to fluoresce.
Figure 7:
A typical amplification graph, plotting the relative fluorescence versus the number of PCR cycles for various input concentrationsof the DNA sample. The dots represent actual fluorescent levels, while the curves represent a degree- polynomial approximation of themeasurements. Since the solid curves are approximations, the fluorescence level for a small number of cycles can be negative, which is clearlynot physically possible. Simple, yet less precise piecewise linear and quadratic curves will be described when discussing error models for real-time PCR. Also, note that the fluorescence saturates after roughly ´ cycles which shows that models that use the final cycle fluorescencecannot distinguish viral loads. Another observation is that due to the stochastic nature or RT-PCR it usually takes around ´ cycles to obtainvisible fluorescence, independent of the viral load. Both of these features demonstrate the highly nonlinear relationship between the viral loadand the fluorescence. have a viral load that is large enough to mask any other individual tested by the same test under the GT framework.This is a sensitive issue for SQGT schemes as the C t curves may have multiple interpretations: As an example,the same C t value may correspond to ´ individuals tested ´ days after infection or one individualtested days after infection. There are multiple possible ways to mitigate this problem: First, given that high viral Figure 8:
Amplification curves and quantization regions for the C t values. Given a number of amplification curves used for calibration ina specific lab, the quantization regions in this example are chosen based on the intersection of the fluorescence detection level and thecalibration amplification curves. A C t value for a particular experiment is placed in the quantization region bounded by the two “closest”amplification curves used for calibration and their underlying C t values, or into the corresponding quantization bin. In this particular example,except for the quantization regions corresponding to the early and late cycles, the quantization regions are of nearly-uniform length. Note thatthe larger the C t value, the lower is the viral load. Also, if one were to only use the fluorescence levels observed at the final RT-PCR cycle(i.e., cycle number „ loads very often positively correlate with observable disease symptoms [43], asking individuals about symptomsbefore scheduling the tests (as is, for example, done at UIUC [44]) allows one to determine if the individual shouldbe group-tested or not. Another approach is to perform adaptive testing where samples with large viral loads aresubjected to additional screenings, as is done in one of our proposed methods. Specialized testing strategies forpooled measurements with high viral loads can also be devised using heavy-hitter detection methods [45].As an abstraction, and only for our worst-case analysis we assume that each individual is represented by a viralload equal to the expected value over the tested population. In this case, the test outcome can be translated into aninterval in which the number of infected individuals lies. Hence, the assumption in this case is that one can convert C t values into a rough estimate of the number of infected people in the test. For probabilistic testing, we do nothave to rely on such assumptions as the testing scheme itself can be easily adapted to handle heavy hitters.III. B ASICS OF
GTIn what follows, we provide concise overviews of all relevant GT schemes used or proposed for potential usefor Covid-19 testing: (1) Classical nonadaptive and adaptive GT; (2) Nonadaptive SQGT; (3) Threshold GT; (4)Compressive sensing (CS); (5) Graph-Constrained GT. For all these methods, we describe their potential advantagesand drawbacks and then proceed to introduce a new method, which we refer to as adaptive SQGT. Adaptive SQGTwith a “curve fitting”-based noise model appears to provide the theoretical state of the art GT results for qPCR testmodels and is the focus of our subsequent discussions.
A. Nonadaptive and Adaptive GT
The assumptions are as follows: In a group of n individuals, there are d infected people. When a subset of people aretested, the result is positive (e.g., equal to ) if at least one person in the tested group is infected, else the test resultis negative (e.g., equal to ). Such a testing scheme is referred to as binary , as the outcomes take one of two values(see Figure 1 (a)). GT aims to find the set of all infected people with the fewest number of binary tests possible andmay use nonadaptive and adaptive tests. In the former case, all tests are performed simultaneously and the outcome According to this study, among the set of infected patients, those who exhibited “severe" symptoms had significantly lower C t “ C t p sample q ´ C t p reference q values than those who exhibited “mild” symptoms. Figure 9:
A typical viral load dynamics in an infected individual versus the time since infection. The viral load sharply spikes within the firstthree days of infection and then more gradually decreases. The nonlinear part of the viral load curve can be approximated by a linear componentsymmetric with respect to the linear component. This linear approximation will be used to determine the probability of heavy hitters, i.e.,individuals who have an absolute viral load above . of one test cannot be used to inform the selection of the individuals for other tests. In the adaptive setting, one canuse multiple stages of testing and different combinations of individuals to best inform the sequentially made testchoices.When d ! n , it is well-known that Ω p d ¨ log p n { d qq number of tests are required to find all infected individuals.Furthermore, it was shown in [46] that for NAGT, at least Ω p d ¨ log p n q{ log p d qq tests are required. For the sameparameter regime, there exist explicit nonadaptive schemes that require O p d ¨ log p n { d qq tests to find the infectedgroup [47]. A four-stage adaptive scheme that uses an optimal number of tests that meets the lower bound wasrecently described in [48]. Of special interest is the classical binary search result of [49] which established anelegant adaptive scheme that differs from the information-theoretic limit only by an additive O p d q term.Despite the many proposed applications of this model to Covid-19 testing, it is obvious from the previous discussionthat the GT measurement outcomes do not fully use the actually available qPCR results. One could argue that thefluorescence exceeding the detection threshold may correspond to the test outcome , but clearly, significantly moreinformation is available as the detection threshold depends on the concentration of the viral cDNA and hence thenumber of infected individuals. This motivates using a more quantitative GT approach, already introduced underthe name of SQGT. B. Nonadaptive SQGT
In SQGT, one is given a collection of thresholds “ τ ă τ ă ¨ ¨ ¨ ă τ r , and the outcome of each test is aninterval p τ i , τ i ` s , where ď i ď r ´ . The outcome of an experiment cannot specify the actual number of infectedindividuals but rather provides a lower and upper bound on that number, τ i ´ and τ i , respectively. If τ i “ τ i ´ ` for all values of i and r “ d , the scheme is referred to as additive GT, or the adder model [35], [36]. The twomodels are depicted in Figure 1 (b) and (c). The additive test model described in [36] requires ¨ p n { log n q teststo determine all possible infected individuals, for ď d ď n .Another special SGT case of interest assumes that the test results are additive up to some threshold τ and afterthat, they saturate [37] (see Figure 1). This model is of special interest for Covid-19 testing as it takes thewarm-up/saturation information into account and, in addition, under a proper noise model, captures the fact thatamplification graphs have different C t values determined by the concentration of the viral load (an hence theapproximate number of infected individuals). Furthermore, one can argue that the RT-PCR fluorescence intensity information is inherently semiquantitative [30] as the fluorescence levels and C t values can be placed into boundedbins determined by the number of cycles. This observation is explained in more detail in the next section, alongwith new theoretical results pertaining to adaptive SQGT schemes with appropriate noise models. Figure 10:
The birth-death noise model. Here, the assumption is that the C t value can be corrupted by noise only in so far that they can bemislabeled as belonging to intervals adjacent to the correct interval (except for the values falling into the first and last quantization region orbin). C. Threshold GT
An extension of the GT problem was introduced by Damaschke in [50]: In this setting, if the number of defectivesin any pool is at most the lower threshold (cid:96) ą , then the test outcome is negative. If the number of defectivesin the pool is at least the upper threshold u ą (cid:96) , then the test outcome is positive. However, if the number ofdefectives in the pool is between u and (cid:96) , the test outcome is arbitrary (i.e., either or ). Thus, the algorithmsfor Threshold GT are designed to handle worst-case adversarial model errors. Note that when (cid:96) “ , and u “ ,Threshold GT reduces to GT. It is known that for nonadaptive threshold GT, O p d ¨ g ` ¨ p log d q log p n { d qq tests(where g “ u ´ (cid:96) ´ ) suffice to identify the d infected individuals [51].The Threshold GT model is partly suitable for modeling the qPCR process, as the lower threshold can obviouslyassume the role of the fluorescence-based detection threshold, (cid:96) “ C t ; unfortunately, due to the saturationphenomena, a specialized choice for the upper threshold u does not allow one to accurately assess the numberof infected individuals in the pool. The “in-between” threshold results also make the simplistic assumption thatdespite the observed fluorescence value being closer to the upper threshold, one can still call the outcome negative(and similarly for the small fluorescence levels and the lower threshold). D. Compressive Sensing
In compressive sensing, the defectives are represented by nonnegative real-valued entries. Thus, quantitative GTrepresents a special instance of compressive sensing. Compressive sensing assumes that one is given an unknownvector x P R n , in which only d ! n entries are nonzero. The vector x is observed through linear measurementsformed using a measurement matrix M P R m ˆ n , leading to an observed vector y “ M x ` n , where n is themeasurement noise (usually taken to be Gaussian N p σ q ). For noiseless support recovery, m “ O p d ¨ log nd q measurements are sufficient. For exact support recovery in the noisy setting, a signal-to-noise-ratio S of Ω p log n q is required for the same number of measurements as needed in the noiseless setting [52]. Compressive sensingreconstruction is possible through linear programming methods or low-complexity greedy approaches [53]–[55].The recently proposed Tapestry method [56] combines group testing with compressive sensing and uses combina-torial designs (i.e., Kirkman systems) to construct the measurement matrix. However, there are several factors thatdo not seem to be accounted for in this approach. First, Tapestry assumes a CS framework, which is additive andapplies to viral loads . But the PCR measurements report fluorescence levels, and these are nonlinear functions ofthe viral load. Furthermore, it does not account for the stochasticity of the PCR measurements and that differentlab protocols may lead to different C t values when presented with the same sample mixture. Tapestry also does nottake into account the fact that the number of RT-PCR machines/staff members is limited , and that this inherentlysuggests using adaptive testing strategies. Finally, the CS methods [56] rely on Gaussian assumptions regardingmeasurement errors due to cycle inefficiency that are often hard to verify (and do not take into account that the PCR tests are performed on samples typically organized in wells, each of which can be used for one (group) test. For a related question in the context of group testing microarrays and quantized compressive sensing, the interested reader is referredto [57]–[59]). efficiency decays with the number of cycles and with the number of potential mutations in the primer regions; seeSection II). As many other quantitative methods, it appears vulnerable to heavy hitters, which are not accountedfor in the Tapestry scheme.Nevertheless, there seem to be multiple advantages of CS methods for Covid-19 testing: One should be able,in principle, to recover not only the infected individuals but their viral loads as well. In particular, integerand nonnegative CS testing, along with quantized CS approaches can impose model restrictions on such testingschemes [58], [60] to render them more suitable for the problem at hand.The only other work that explores quantitative testing for Covid-19 is reported in [61]. There, the same problemsetup as in [56] is used to postulate an additive viral load model which does not refer further to the C t values.The new contribution is a proposal for a two-stage testing scheme that bears a small resemblance to our methodsin so far that we also propose two-stage adaptive pooling schemes. However, the techniques used are differentsince [61] employs a combination of maximum likelihood and maximum- a-posteriori estimators to determine theinfected individuals in the second stage, while we employ zero-error GT and SQGT techniques to find all infectedindividuals. Additionally, while [61] reports the number of tests and conditional false positive and conditionalfalse negative rates for the simulation experiments, we supplement our new schemes with theoretical analysis andperformance guarantees. E. Graph-Constrained GT
Let G “ t V , E u be a graph with vertex set V , | V | “ n , and edge set E , representing a connected network of n people out of whom d are infected. In graph-constrained GT, vertices participating in the same test are restrictedto form a path in the graph [62]. This model is relevant as it can be adapted to require that only individuals thatdid not have contacts with each other are tested together (one only has to apply the problem to the dual of thecontact graph used in Covid-19 testing). This allows us to identify individuals that fell ill in an “independent”fashion rather than through contact with each other. If T p n q denotes the mixing time of the random walk on thegraph, and c “ ∆ max ∆ min is used to denote the ratio between the maximum degree and the minimum degree of thegraph, then no more than O p c ¨ d ¨ T p n q log p n { d qq tests are required to find the set of infected vertices. Forexample, a complete graph ( T p n q “ c “ ) requires no more than O p d ¨ log p n { d qq tests since it corresponds tothe classical GT regime. Unfortunately, graph-constrained GT requires a significantly higher number of tests thanclassical GT methods as the tests are inherently restricted. As a result, despite the fact that this scheme is a naturalchoice for problems such as network tomography where these constraints need to be satisfied, it is not a properchoice for Covid-19 testing. Another “community-constrained” (although without an underlying interaction graph)was recently proposed in [29] and is discussed in the next subsection. F. Community-Aware GT
In order to incorporate the underlying community network information, the authors of [29] introduced a community-based testing paradigm. The goal of the scheme is to devise methods that use community structure to guide theschedule of group tests.More precisely, a community of n members is assumed to have d ! n infected individuals. The population ispartitioned into F families. In the combinatorial infection model, it is assumed that d f families have at least oneinfected individual and that all the members of the remaining families are infection-free. An infected family indexedby j is assumed to have d p j q infected members so that d “ ř Fi “ d p j q . The testing scheme can be succinctly describedas follows: A representative individual from each family is selected uniformly at random. The representativecommunity members are tested using either an adaptive or a nonadaptive GT algorithm. Family members whoserepresentatives tested positive are tested individually. Members from the remaining families are tested togetherusing either an adaptive or a nonadaptive GT scheme. As proposed, the scheme has several drawbacks. First, inpractice, it is advisable to quickly identify heavily infected (heavy hitter) families and then quarantine membersof such communities [63]. Also, testing each person in a community that may have only one infected individualseems overly cautious and prohibitively expensive. We address this issue in terms of what we call the heavy-hitting community detection model described in Section VII. Second, [29] ignores interactions between members ofdifferent communities, which are crucial for spreading the disease. Third, the PCR noise is modeled as a Z-channel,in which a positive sample can test negative but a negative sample cannot test positive. As we will see in Section V, nonspecific hybridization due to either poor primer selection or mutations can actually render negative tests positive.On the other hand, qPCR is very sensitive - even viral fragments are enough to detect fluorescence. Hence, falsenegative errors overwhelmingly arise from inadequate sample collection or bad sample collection timings (whenthe individual is just infected or almost recovered from the disease, see Figure 4). This fact is clearly supported bythe study in [64], which established that roughly ´ of individual samples have viral loads at least threetimes higher than the detection threshold so that false-negatives due to PCR are highly unlikely.Before proceeding with the original contributions, we remark that all the above GT techniques and scheduling modelshave probabilistic counterparts in which each individual is assumed to be infected with the same probability p [6] ormembers of different communities are infected with different probabilities, p i , i “
1, . . . , F [33]. The latter settingis especially important when prior information about the individuals is known (for example, their risk groups,potential symptoms etc). For an excellent in-depth review of these and some other GT schemes, the interestedreader is referred to [7].IV. AC-DC: N EW A MPLIFICATION C URVE B ASED A DAPTIVE S CHEMES - T HE P ROBABILISTIC S ETTING
Next, we introduce two adaptive SQGT schemes, one which is suitable for probabilistic testing and another onethat is worst-case and nearly-optimal from the information-theoretic perspective. In the former case, consideredin this section, a simple two-stage testing scheme is designed and analyzed with the goal of enabling practicalimplementations of adaptive SQGT. The results are described for two thresholds only, but a generalization isstraightforward. This scheme also allows for incorporating heavy hitters into the testing scheme, which is of greatpractical relevance. In the worst-case, which is considered in the section to follow, the schemes extend the ideasbehind Hwang’s generalized splitting [49] in two directions that lead to algorithms using what we call parallel and deep search , respectively. In both settings, the outcomes of the first round of testing inform the choice of thecomposition of the test in the rounds to follow. The methods are collectively referred to as the AC-DC schemes, inreference to the use of the information provided by the amplification curve (AC) during the process of diagnostics ofCovid-19 (DC). A relevant observation is that the worst-case adaptive schemes allow for using nonuniform amountsof genetic material from different individuals, which may be interpreted as using nonbinary test matrices.A. Practical Adaptive AC-DC Schemes
We describe next a simple probabilistic two-stage AC-DC scheme that significantly improves upon the originalsingle-pooling scheme of Dorfman and builds upon the SQGT framework. The underlying idea is to follow thesame overall strategy as in the single-pooling scheme, but exploit the SQ information obtained in the first stageto perform better-informed testing in the second stage (i.e., dispense with individual testing of all individuals thatfeature in infected pools as part of the second stage).Consider a scenario where we have access to semiquantitative tests that return one of three values: If no individualfeatured in the test is positive, the test returns . If between and τ individuals are positive, for some threshold τ ě , the test returns . Finally, if more than τ individuals test positive, the test returns . This scheme can beinterpreted as follows: Suppose that C t is the observed cycle thresholds (defined in Section II-C for a particulartest). If C t ą c for some large threshold c , we say that the outcome is as the potential viral or viral-likecontamination load is too small to claim the presence of an infected individual. If c ď C t ď c , we say thatthe output is and based on the average viral load, convert this into the maximum possible number of infectedindividuals τ . If C t ă c , we say that the output is and that more than τ individuals in the pool are affected.For the new single-pooling AC-DC scheme, we assume that the population contains n individuals, each of whichis independently positive with some probability p (Which can be easily determined based on regional infection ratereports: For example, at UIUC in September/October 2020 [44], p » ), and proceed as follows:1) Stage 1:
Divide the n individuals into n { s disjoint pools S , . . . , S n { s , each of size s ;2) Stage 2:
If a pool S i tests , then immediately set the status of all individuals P S i as “negative”.If a pool S i tests , then apply a nearly-optimal zero-error nonadaptive group testing scheme to detectthe t infected individuals in S i . (Such a testing scheme is simple to design: It suffices to sample a randombinary matrix where all entries are i.i.d. according to some Bernoulli p q q distribution, ă q ă . This is so since the resulting matrix will be a zero-error NAGT scheme with high probability provided the number ofrows is large enough.)If a pool S i tests , then test all individuals P S i separately.Given the description above we can compute the expected number of tests per individual of the testing scheme, T { n ,as a function of the probability of infection p , the first-stage pool size s , and the threshold τ . Using the fact thatthe zero-error nonadaptive GT schemes we use in the second stage can be designed with m p s , τ q “ c ¨ τ log p s { τ q tests, we conclude that E r T { n s “ s ` p ¨ c ¨ τ log p s { τ q s ` p , (1)where p “ Pr r ď B p s , p q ď τ s and p “ Pr r B p s , p q ą τ ` s denote the probability that a given pool tests and , respectively. Here, B p s , p q stands for a binomial random variable with s trials and success probability p .A particular case of interest pertains to setting τ “ in (1). For a small probability of infection p , the optimalthreshold τ is close to which justifies this choice. In this case, we have p “ s ¨ p p ´ p q s ´ and p “ ´ p ´ p q s ´ s ¨ p p ´ p q s ´ . Moreover, it is well-known that, for any s , there exists a simple (and optimal) zero-errornonadaptive scheme for finding defective among s items using m p s , 1 q “ r log s s tests (namely, set the i -th columnof the test matrix to be the binary representation of i , i.e., use a Hamming code parity-check matrix for testing).Combining these observations together with (1), we conclude that the expected number of tests per individual when τ “ equals s ` p p ´ p q s ´ r log s s ` ´ p ´ p q s ´ s ¨ p p ´ p q s ´ . (2)On the other hand, the expected number of tests per individual for the basic single-pooling scheme [6] is s ` ´ p ´ p q s , (3)and the expected number of tests per individual for the double-pooling scheme [9] is s ` p ` p ´ p qp ´ p ´ p q s ´ q . (4)We compare the optimal expected number of tests per individual (as a function of p ) achieved by our semiquantitativesingle-pooling scheme with τ “ (given in (2)) and the single- and double-pooling schemes (given in (3) and (4),respectively) in Figure 11. Semiquantitative single-pooling outperforms the other methods considered here as shownin the figure. This is in particular true for p ď which we already pointed out corresponds to a practicalparameter value. single - poolingdouble - poolingsemiquant single - pooling ( t = ) Figure 11:
Comparison between the expected number of tests per individual required by Dorfman’s single-pooling scheme [6], the Broder-Kumardouble-pooling scheme [9], and our semiquantitative single-pooling scheme.
There are two directions to further improve our scheme: ‚ We can easily extend the simple ideas presented above to obtain a semiquantitative version of double-pooling,and, more generally, multi-pooling schemes. The algorithm for this setting is summarized below:1)
Stage 1:
Repeat Stage 1 of the previous semiquantitative scheme twice in parallel. We say an individualtests p a , b q if its first pool tests a and its second pool tests b . Stage 2:
If an individual tests p b q or p a , 0 q , immediately mark it as negative;If an individual tests p
1, 1 q or p
1, 2 q , then apply a zero-error nonadaptive GT scheme for τ defectivesto its first pool.If an individual tests p
2, 1 q or p
2, 2 q , test them individually. ‚ We may also improve the performance of our semiquantitative scheme by introducing more (sufficiently small)thresholds τ ă τ ă ¨ ¨ ¨ ă τ (cid:96) and extending the original idea in a natural way: If a pool has between τ i ´ and τ i infected individuals, then apply a nearly-optimal zero-error NAGT scheme that detects τ i infected tothe pool in question. B. Probabilistic SQGT with Variable Viral Load
It is also simple to analyze how the SQGT scheme from the previous section performs when infected individualsmay have either low or high viral loads, i.e., it is straightforward to account for heavy hitters. To this end, weconsider a simplified model where each individual is independently infected and presents a low viral load at the timeof testing with probability p i , or is infected and presents a high viral load at the time of testing with probability p i . In particular, each individual is infected (regardless of her/his viral load) with total infection probability p “ p i ` p i ă .As already explained, individuals with high viral load are problematic because, based on the SQ output of RT-PCR,pools featuring one such individual may be mistaken for pools with several infected individuals with low-to-averageviral load. This phenomenon naturally leads us to consider the following modified version of the testing methodstudied in Section IV-A: A test applied to a pool of individuals has outcome if there are no infected individualsin the pool, outcome if there exists exactly one infected individual with low viral load, and if either thereexists more than one infected individual with low viral load, or at least one infected individual with high viral load.Therefore, as expected, individuals with high viral load obfuscate the test outcomes.We consider now the SQGT scheme described in Section IV-A with τ “ and under the heavy-hitter model. Theprobability that a pool of size s contains exactly one infected individual with low viral load and zero individualswith high viral load (leading to test outcome ) is s ¨ p i ¨ p ´ p i ´ p i q s ´ “ s ¨ p i ¨ p ´ p q s ´ , while the probability that the pool contains either more than one infected individual with low viral load or at leastone individual with high viral load is ´ s ¨ p i ¨ p ´ p i ´ p i q s ´ ´ p ´ p i ´ p i q s “ ´ s ¨ p i ¨ p ´ p q s ´ ´ p ´ p q s . Combining these observations with the reasoning from Section IV-A, we conclude that the expected number oftests per individual as a function of p i and p i is given by s ` s ¨ p i ¨ p ´ p q s ´ ¨ r log s s ` ´ s ¨ p i ¨ p ´ p q s ´ ´ p ´ p q s , (5)where p “ p i ` p i .For fixed p i and p i , it is easy to numerically minimize the expression above as a function of s to find theoptimal pool size for the scheme under consideration. Figures 12 and 13 compare the expected number of testsper individual required by different schemes for different values of the total infection probability p and the specificinfection probabilities p i and p i . The most practically relevant pair of parameters can be obtained from Figure 12,under the assumption that heavy hitters are individuals who have viral loads above . Thus, by approximatingthe nonlinear portion of the viral load curve by a linear function, one can easily show that the probability thatan infected individual is a heavy hitter is proportional to the area of the highlighted triangle, and approximatelyequal to (which is used in Figure 12). Note that the reduction in the number of tests increases with p , and for p „ , which is a realistic infection rate, the savings compared to nonquantitative testing are larger than . This is not problematic for binary group testing, where the test outcomes do not distinguish between one or several infected individuals inthe pool. Although we considered only an SQ single-pooling scheme in this section, these ideas can be easily extended toupgrade multi-pooling schemes with binary testing (such as the one from [9]) to exploit SQ test information undera variable viral load. This would allow one to further improve on the expected number of tests required by [9]. single - poolingdouble - poolingsemiquant single - pooling w / variable viral load Figure 12:
Comparison between the expected number of tests per individual required by Dorfman’s single-pooling scheme [6], the Broder-Kumardouble-pooling scheme [9], and our semiquantitative semi-pooling scheme as a function of total infection probability p with p i “ p , p i “ p . single - poolingdouble - poolingsemiquant single - pooling w / variable viral load Figure 13:
Comparison between the expected number of tests per individual required by Dorfman’s single-pooling scheme [6], the Broder-Kumardouble-pooling scheme [9], and our semiquantitative semi-pooling scheme as a function of total infection probability p with p i “ p { and p i “ p { . C. Adaptive SQGT with Priors
The above described probabilistic setting can be generalized to account for different priors for different individualsby invoking the generalized binomial group testing scheme. Recall that Dorfman’s scheme assumes that eachindividual has a probability p of being infected, independent of everyone else. The set of individuals is partitionedinto groups of size s , and each group is tested once. The group size s is selected to minimize the number of expectedtests.Hwang extended this setting in [33] to account for the varying prior probabilities of infection. In this settingevery individual i P t
1, 2, . . . , n u is assumed to have probability p i of being infected, and thereby a probabilityof q i “ ´ p i of not being infected. Clearly, in this generalized setting, a (possibly random) partitioning of theindividuals into groups of equal sizes is no longer the optimal strategy to for the first round of testing.To find the optimal partition of the individuals in the generalized binomial setting, without loss of generality, onemay assume that the individuals are reindexed so that ă q ď q ď . . . ď q n ă .Given a subset of individuals to be tested, G Ď t
1, 2, . . . , n u , let T p G q denote the expected number of tests requiredto find the set of infected individuals in G by first jointly testing all the individuals in the group G and then testingevery member of G individually if the first test is positive. This number equals: T p G q “ if | G | “ ` p ´ Π j P G q j q| G | , otherwise . (6)Now, let D p U q “ t G , G , . . . , G k u denote the optimal partition of the population U to be tested, where | U | “ n .Let C p U q denote the total expected number of tests required to run the two-stage testing procedure on the optimalpartition. Clearly, C p U q “ ř ki “ T p G i q . Furthermore, let U m denote the set of m individuals with the highest probability of being infected, i.e., individuals indexed by t
1, 2, . . . , m u . The optimal partition and the correspondingexpected number of tests can be found by solving the following optimization problems: C p U m q “ min m ´ L ´ ď i ă m ´ r p L ` q{ s t T p U m ´ U i q ` C p U i qu , 2 ď m ď n , (7)where L denotes the size of the largest group/part in D p U m ´ q . At step m of the optimization procedure, theprobabilities t p j u j ď m and the previously computed C p U j q are used to determine D p U m q and the expected numberof tests C p U m q of the population U m . As a consequence of the structure of the program, one has the followingproperty for the optimal partition: If individuals i and j , j ą i , are in the same group, then all individuals k suchthat i ă k ă j are in the same group as well (a simple induction argument can be used to prove this claim).Next, assume that the population has only two types of individuals: m individuals that have a high p probabilityof infection, and the remaining n ´ m individuals that have a low p ! p probability of infection. Exploiting thestructure of the optimization program and the fact that only two types of individuals are present in the test pool,the optimal number of tests needed to find the set of infected individuals using the two-stage procedure equals: min s , s , x , y P N Yt u „ m ´ xs ` ` p ´ q s q s ˘ ` n ´ m ´ ys ` ` p ´ q s q s ˘ ` xy ą ` p ´ q x q y qp x ` y q , where x and y represent the number of individuals that have high and low probabilities of infection, respectively,and are tested together, while s , s are the sizes of the groups used for testing individuals with high and lowprobabilities of infection, respectively. The optimization allows for at most one heterogeneous group of size x ` y .The average number of tests required to find all the infected individuals in this heterogeneous group is given by xy ą ` p ´ q x q y qp x ` y q . The remaining m ´ x individuals who have p probability of infection are dividedequally into groups of size s , where each group requires, on average, ` p ´ q s q s tests to determine the set ofall infected individuals. Similarly, the n ´ m ´ y remaining individuals who have p probability of infection aredivided equally into groups of size s , where each group requires, on average, ` p ´ q s q s tests to determinethe set of all infected individuals. D. Lower Bounds for Nonadaptive Probabilistic SQGT
We conclude our exposition in this section by presenting a theoretical result that establishes lower bounds fornonadaptive probabilistic GT that may be used to assess the quality of our adaptive schemes. For this purpose, weadapt an argument by Aldridge [65] for arbitrarily small error probability under a constant probability of infection.More precisely, we consider a setting where each test has m ` outcomes for some m ě : The outcome of a testis either i if there are exactly i infected individuals for i ă m , and ě m otherwise. This corresponds to the settingintroduced in [37] which provides the most informative type of measurements one can expect from the SQGTframework using the amplification curve information. This model accounts for the saturation limit for each test,dictated by m , which is a phenomenon observable from the amplification curve. Moreover, as before we assumethat each individual in the population of size n is infected independently with some constant probability p ą .We show the following. Theorem 1
For every m and constant p ą there exists a constant (cid:101) p m , p q ą such that, under the setting describedabove, nonadaptive testing requires at least n { m tests to achieve error probability less than (cid:101) p m , p q in a population ofsize n . In contrast, for m “ , our two-stage scheme uses significantly fewer than n { tests provided p is not very large.Proving Theorem 1 follows by a simple adaptation of an approach by Aldridge [65], who showed that individualtesting is required in order to achieve arbitrarily small error in regular nonadaptive probabilistic group testing (whichcorresponds to m “ ). First, given any nonadaptive testing scheme, we may without loss of generality remove alltests with m or fewer elements, along with all individuals who participate in those tests. This does not affect thelower bound. Then, we show that there are no nonadaptive testing schemes with an arbitrarily small error whereevery test includes at least m ` individuals. Combining these two observations immediately yields Theorem 1.For an individual i , let x i denote its infection status. Call an individual i (regardless of its infection status) disguised if every test t in which it participates contains at least m other individuals which are infected. If i is disguised, thenchanging x i from to , or vice-versa, does not change the outcome of the testing scheme. As a result, we can do no better than guess x i , and we will be wrong with probability at least min p p , 1 ´ p q . To finalize the argument, itsuffices to show there is a disguised individual with constant probability.Let D i denote the event that individual i is disguised, and let D t , i denote the event that individual i is disguised intest t . Since the D t , i are increasing events , the Fortuin-Kasteleyn-Ginibre (FKG) inequality [66] implies that Pr r D i s ě ź t : x t , i “ Pr r D t , i s , (8)where x t , i indicates whether individual i participates in test t . Moreover, we have Pr r D t , i s “ Pr r B p w t ´ p q ě r s , (9)where w t “ ř ni “ x t , i is the weight of test t .Let L i “ log ¨˝ ź t : x t , i “ Pr r D t , i s ˛‚ “ ÿ t : x t , i “ log Pr r D t , i s “ T ÿ t “ x t , i log Pr r D t , i s , where T denotes the total number of tests, which we assume satisfies T { n ă . Then, it suffices to show that thereexists some i ‹ with L i ‹ ą c for some constant c independent of n . Let I be uniformly distributed over t
1, 2, . . . , n u ,and let L “ E r L I s . We have L “ n n ÿ i “ T ÿ t “ x t , i log Pr r D t , i s“ n T ÿ t “ w t log Pr r D t , i sě min t “ T w t log Pr r B p w t ´ p q ě m sě min w ě r ` w log Pr r B p w ´ p q ě r s “ : L ‹ , where the second equality follows from the fact that Pr r D t , i s is the same for every i such that x t , i “ , and inthe first inequality we use the assumption that T { n ă . It is immediate that there exists some i ‹ with L i ‹ ě L ,which implies that Pr r D i ‹ s ě L ‹ . Therefore, the error probability of the testing scheme is at least (cid:101) p m , p q “ min p p , 1 ´ p q ¨ L ‹ . Noting that L ‹ does not depend on n and is bounded from below for any m and p (since lim w Ñ8 w log Pr r B p w ´ p q ě m s “ ) concludes the proof.V. AC-DC S CHEMES : W
ORST -C ASE M ODEL A NALYSIS
As before, we assume that we are given a set of n samples with at most d infected individuals. Our goal is tominimize the number of tests needed to identify all infected individuals and we do not impose any restrictions onthe “simplicity” of our scheme. As a result, we consider a generalization of the model described in the previoussection which allows for more than three test outcomes.For simplicity, as well as for practical reasons , we focus on equidistant thresholds but allow for warm-up/saturationeffects.Let τ , m P Z ` represent the distance between the thresholds and the number of thresholds, respectively. If D t , i holds and the set of infected individuals is expanded, then D t , i continues to hold under this expanded set As we quantize the C t values or the phase transition thresholds according to equally spaced cycle numbers Denote the outcomes of the test by a nonnegative integer t ď m . Then, t “ $’’’’’’’’’&’’’’’’’’’% if every sample in the test is negative ,1, if the number of infected individuals is between and τ ,2, if the number of infected individuals is between τ ` and τ , ... ... m ´ if the number of infected individuals is between p m ´ q τ ` and p m ´ q τ , or m , if the number of infected individuals is at least p m ´ q τ ` . (10)We seek to identify d infected individuals from a population of size n given that each test returns a value in (10).We refer to this problem as the p n , d q adaptive SQGT problem or the p n , d q -ASQGT problem for short.Another way of looking at (10) is that if the collection of samples tested contains d infected individuals, then theoutput of the test is r d τ s when d ď m τ and m otherwise. Note that for every test there are m ` possible outcomesand the output of a test tells us roughly (within at most τ ) how many total infected samples are part of the testedpool of samples. Remark 2.
Note that this model differs from the model introduced in Section IV since as m increases the widths of ourthreshold remain the same whereas in Section IV the widths changes as the number of thresholds increases. Despitethis difference, and as will be discussed in Example , the ideas discussed here are applicable to the case where thewidths of the thresholds are nonuniform. Let β “ m ` . Motivated by practical applications, we will be interested in the case where β “ O p q . Our mainresults are two algorithms, which we refer to as parallel search and deep search . Parallel search is applicable forthe setting d ą β . In Lemmas 6 and 7, we show that using parallel search it is possible to efficiently identifyfrom a set of pools (each of size s “ α and large enough to contain at least β infected individuals) a set of β defectives using at most α tests. Note that as a first-step simplification, one may think of n being approximatelyequal to d ¨ α ; the notation involving α is chosen to enable a comparison between our SQGT search scheme andthe well-known splitting approach by Hwang [67]. Deep search, discussed in Lemma 10 and applicable for thesetting d ă β , shows that it is possible to identify all d infected individuals using roughly d ¨ αβ ´ log p β q tests. Ourmain result is Algorithm 1, which for d “ Ω p n q shows that one can identify d infected individuals using at most d β ¨ p α ` ` log β q tests. These results show that adaptive SQGT roughly provides β -fold savings in the number oftests when compared to classical adaptive GT. Furthermore, they differ from the information-theoretic lower bound(as it applies to ASQGT) of Lemma 4 by O p d β q tests. It remains an open problem to identify whether it is possibleto solve the p n , d q -ASQGT problem using fewer tests.We start with the following obvious claim, which allows us to restrict our attention to the case where τ “ andsimplifies the problem at hand. Claim 3.
Let G be the set of test subjects and suppose that there are at most d infected individuals within this group.Let P p q be a pool formed by taking one sample from each individual in G and let P p w q be a pool formed by taking w samples from each individual in G . Let t p q be the output of testing P p q under the setup p m , τ q “ p m , 1 q and let t p w q be the output of testing P p w q under the setup p m , τ q “ p m , w q , according to ( ). Then, t p q “ t p w q . Next, we present a lower bound (i.e., information-theoretic or counting lower bound) on the number of tests necessaryto solve the p n , d q -ASQGT problem. The result follows from a simple counting argument and is consistent withthe result from Claim 3, as it does not depend on the width τ of the threshold. Lemma 4.
Let n “ p α ` q ¨ d ` α ¨ δ ` ∆ , where α , δ , ∆ are integers, δ ă d , and ∆ ă α . Then, the number of tests L p n , d , m q needed to identify the infected individuals is bounded as: L p n , d , m q ě d β ¨ p α ` q . Proof:
The number of ways to select at most d infectives in a group of n individuals is ř di “ ` ni ˘ . Thus, we have L p n , d , m q ě log m ` ˜ d ÿ i “ ˆ ni ˙¸ ě log m ` ˆ n ´ d ` dd ˙ ě log m ` ˆ n ´ d ` dd ˙ d ě d ¨ log m ` ´ α ` ` δ d ` ∆ α d ` α ˘¯ ě d ¨ αβ ` d β ¨ log ´ ` α ¨ δ ` ∆ ` d α d ¯ ě d β ¨ p α ` q . The next example illustrates a simple approach for addressing the ASQGT problem, and motivates the analysis thatfollows.From here on, we write rr x ss “ t
0, 1, . . . , x ´ u and r x s “ t
1, . . . , x u . Example 5.
Suppose that we are given a collection of n individuals with exactly d infected subjects. We start byrandomly partitioning the set of n individuals into d groups each of size s “ nd “ α , where we assumed for simplicitythat d | n . The expected number of infected individuals in each group is .Denote the d groups or pools by G , G , . . . , G d ´ ; all groups have the same size and from this point on, for simplicity,assume that each group contains exactly one infected subject. For i P rr d ss we proceed as follows. We partition G i into β groups of equal size and denote the subgroups as G p q i , G p q i , . . . , G p β ´ q i . Under this setup, there exists exactly oneindex j ‹ such that the number of infected individuals in G p j ‹ q i equals to one, and every other group G p j q i , j P rr β ssz j ‹ is free of infected individuals.Next, we form a new set of pools, which we denote by P i , i P rr d ss , comprising k replicas of the samples in G p k q i , forall k “
0, . . . , 2 β ´ . Let t i denote the output of the semi-quantitative test described in ( ) after the pool P i is tested.Then, it is straightforward to observe that the outcome t i is j ‹ , and hence we can identify the group which containsthe one single infected individual using only one nonbinary outcome test. We repeat this procedure for each group G i , i P rr d ss , partitioned into subgroups. It can be hence seen that it is possible to identify d infected individuals usingonly d αβ tests assuming each of the d groups of size α each contain exactly one infected subject. l To make this argument rigorous, we need to account for the fact that not every group will have exactly one infectedindividual. In this case, upon creating the subpools we have to recursively test them until we identify a prescribednumber of infected individuals. In fact, the approach from the previous example is a special case of what we referto as deep search , described in Lemma 10. The resulting algorithm is summarized in Algorithm 1, and it requiresroughly an additional factor of O p d β q tests compared to the information-theoretic lower bound. A. Parallel search
We start by introducing some useful notation. Suppose that G is a subgroup of individuals to be tested and thatthe outcome of a test governed by (10) is t . In this case, we say that G is a t -infected group. When referring to anordered collection of groups p G , G , . . . , G g ´ q , we say that the collection is a p t , t , . . . , t g ´ q -infected group if t ě t ě ¨ ¨ ¨ ě t g ´ and G i is a t i -infected group, for i P rr g ss . We also say that p G , . . . , G g ´ q is a β - minimalgroup if ř g ´ j “ t j ă β , but ř g ´ j “ t j ě β .The following lemma constitutes the key component of one of our approaches to solving the p n , d q -ASQGT problem.We refer to the procedure described in the proofs of the next two results as parallel search.In the first lemma below, we make the simplifying assumption that a group is β -minimal and g “ β . Afterward, inLemma 7 we consider the case when g ă β . Lemma 6.
Let α and β be positive integers. Suppose that p G , G , . . . , G β ´ q is a β -minimal group, where g “ β , andthat each group has size at most α . Then, we can identify β infected individuals in the group using at most α tests. Proof:
We prove the result by induction on α , where α as before is the size of each subgroup. Recall that underthis setup t “ t “ ¨ ¨ ¨ “ t g ´ “ and g “ β .First, consider the case α “ , for which we have β -infected groups of individuals and each group has size .For shorthand, denote the β infected groups as G , G , . . . , G β ´ . From these β groups, we form a “super-pool”of samples which contains a total of ` ` ` ¨ ¨ ¨ ` β ´ “ β ´ samples. More precisely, for i P rr β ss ,the super-pool contains i samples from one individual P G i . Since t “ t “ ¨ ¨ ¨ “ t β ´ “ and τ “ ,according to (10) the output returned after testing this super-pool of samples is a number t between and β ´ .Let b , b , . . . , b β ´ be the binary representation of the number t . It is straightforward to verify that b i “ then theindividual selected from G i is infected. Otherwise, if b i “ , then the above described individual is not infected,which implies the other individual (the one not tested) in group G i is infected. Thus, we conclude the statement inthe lemma holds when α “ .For the inductive step, assume that the statement holds when the group size is at most α and consider the setupwhere the group size is α “ α ` . We follow the same approach as described for α “ for creating super-pools.Under this setup, we have β -infected groups G , G , . . . , G β ´ , each of size α ` . For i P rr β ss , let Q i Ď G i bea subset of G i of size α . Next we construct a super-pool that contains i samples from each individual in Q i , i P rr β ´ ss . Let t denote the output of testing this super-pool according to (10), where b , b , . . . , b β ´ is thebinary representation of t . As before, if b i “ , then Q i has a single infected individual. Otherwise, if b i “ , thenthere is an infected individual in the set G i z Q i which also has size α . For i P rr β ss , let G i “ Q i if b i “ andotherwise, if b i “ , set G i “ G i z Q i . Then, p G , G , . . . , G β ´ q is a p
1, 1, . . . , 1 q -infected group and we can applythe inductive hypothesis to p G , G , . . . , G β ´ q .For the case g ă β , we use a similar partitioning idea to identify at most β subgroups which satisfy the conditionsin the lemma. The difference between the approaches is that for g ă β the number of samples added into the poolis dictated by a mixed-radix representation (in which the numerical base varies from position to position) ratherthan a binary representation. For simplicity, we assume from now on that β is an even integer although the resultshold for odd integers as well. Lemma 7.
Let α , β , g be positive integers such that g ă β . Suppose that p G , G , . . . , G g ´ q is a β -minimal group andthat each group has size at most α . Then, we can identify β infected individuals using at most α tests. Proof:
We begin with the following claim which we find useful in our subsequent discussion.
Claim 8.
Suppose we are given a sequence p t , . . . , t g ´ q P rr β ` ss g , where g ă β , and the values t ě t ě ¨ ¨ ¨ ě t g ´ are such that ř g ´ j “ t j ě β , but ř g ´ j “ t j ă β . Furthermore, let p n , . . . , n g ´ q P rr t ` ss ˆ rr t ` ss ˆ ¨ ¨ ¨ ˆrr t g ´ ` ss . Then, the number of different choices for p n , . . . , n g ´ q is at most β . Proof of Claim : First, consider the case g ď β ` . Since ř g ´ j “ t j ă β , it follows that t g ´ ď β g ´ and, from theassumptions stated in the claim, t g ´ ď t g ´ ď β g ´ . The total number of possibilities for p n , . . . , n g ´ q restrictedto the first g ´ components is maximized when t “ t “ ¨ ¨ ¨ “ t g ´ , which implies that the total number ofpossible choices for the i -th component of the sequence when i P rr g ´ ss equals β g ´ ` . Therefore, the totalnumber of possibilities for the constrained sequences is ˆ β g ´ ` ˙ g ď β { ` , which follows since ´ β g ´ ` ¯ g is increasing with g and g ď β ` . Since β { ` ď β whenever β ě , weconclude that the result holds for β ě . For the case where β ă , the result can be verified through exhaustivechecking.Next, we consider the case g ě β ` Note that under this setup, since t ě t ě ¨ ¨ ¨ ě t g ´ , it follows that t g ´ “ . Otherwise, if t g ´ “ , we would have ř g ´ j “ t j ě β . For this case, we prove the result by induction on g . The base case, corresponding to g “ β ` follows fromthe previous paragraph. Therefore, assume that the result holds for all g ă γ and consider g “ γ ą β ` . Since γ ą β ` , we have t γ ´ “ t γ ´ “ which implies that ř γ ´ j “ ă β ´ , since otherwise, if ř γ ´ j “ “ β ´ , then ř γ ´ j “ t j “ β and we arrive at a contradiction. Thus, we have γ ´ ě β ´ ` and so we can apply the inductivehypothesis to the first γ ´ components of the sequence p n , . . . , n g ´ q and conclude that there are at most β ´ possible choices for the sequence p n , n , . . . , n γ ´ q . Since t γ ´ “ , it follows that the total number of differentoptions for the sequence p n , n , . . . , n γ ´ q is at most ¨ β ´ . This completes the proof.Recall the main idea behind the proof of Lemma 6, where we tacitly assumed that g “ β . There, we used thebinary representation of the integer t , where t denotes the test outcome of the super-pool, to determine which ofthe tested subgroups involved infected individuals. In order to make this argument work, we formed the super-poolby adding i samples from group Q i Ď G i for i P rr β ss , where | Q i | “ | G i | . Next, the idea is to add N i samplesfrom each group, where N i is chosen by considering a mixed-radix representation of the number t .We say that p b , b , . . . , b g ´ q is the p t , t , . . . , t g ´ q -mixed radix representation for t if the following is true.Let N “ . For i P r g ´ s , let N i “ p t i ´ ` q ¨ N i ´ . Note that when t “ t “ t “ ¨ ¨ ¨ “ t g ´ “ , N i “ i . The mixed radix representation of t is of the form t “ ř g ´ i “ b i ¨ N i , where b i ď t i . Note that under thissetup since b i ď t i , the sequence p b , b , . . . , b g ´ q P rr t ` ss ˆ rr t ` ss ˆ ¨ ¨ ¨ ˆ rr t g ´ ` ss provides a uniquerepresentation and is invertible provided that p t , t , . . . , t g ´ q is given. In other words, given the number t we canuniquely determine the i -th digit in the p t , t , . . . , t g ´ q mixed radix representation for t , which is b i . Furthermore,as a result of Claim 8, we know that t ď β ´ “ m .We are now ready to proceed with the proof. Suppose that p G , G , . . . , G g ´ q is a β -minimal group. We will provethe result by induction and we will show the inductive step (since the base case follows from similar ideas).For the inductive step, assume the statement holds when the group size is at most α and consider the setup wherethe group size is α “ α ` . Note that we have p t , t , . . . , t g ´ q -infected groups G , G , . . . , G g ´ each of size α ` . We form our super-pool as follows. As before, for each i P rr g ss , we select a subset Q i Ă G i of size | Q i | “ α . For each individual in Q i we add N i samples into the superpool, where N i is as defined in the previousparagraphs.Let t be the output of testing the resulting super-pool according to (10) and let b i denote the i -th symbol of the p t , t , . . . , t g ´ q -mixed radix representation of t .Note that based on t , we can determine the number of infected individuals in each of the subgroups Q , G z Q , . . . , Q g ´ , G g ´ z Q g ´ . In particular, since we know t i , and given the output b i which can be recoveredafter testing the super-pool, we know that for all i P rr g ss , the number of infected subjects in Q i is b i and thenumber of infected subjects in G i z Q i is t i ´ b i . Given this information, we can generate G , G , . . . , G g ´ , wherefor i P rr g ss , G i Ď Q i or G i Ď G i z Q i , such that the collection is a β -minimal group. Thus, we can apply theinductive hypothesis to G . This establishes that we can identify β infected individuals using at most α tests andcompletes the proof. B. Deep search
Next, we consider the case d ă β , and show that there exists an ASQGT scheme which requires roughly d β ´ log p β q ¨p α ` log p β qq ` d tests. Recall that the main idea behind the parallel search procedure was to simultaneously run abinary search on g subpools each of size α . In this manner, using α tests we can identify β infected. For d ă β ,there are not sufficiently many infected individuals to use this method, and so for this setup, rather than perform abinary search in parallel, we test roughly β ´ log p β q (significantly smaller) subpools at the same time. We refer tothis procedure as deep search.Before proving a relevant lemma, we begin by describing a variant of well-known Newton identities. Forcompleteness, we include a proof. Claim 9.
Let S “ t j , . . . , j d u P Z be a multiset of nonnegative integers each of which has value at most p ´ where p is an odd prime. Define p (cid:96) p S q “ ř dk “ j (cid:96) k mod p , the (cid:96) th power sum of S over the finite field F p . Then, one canrecover S given ` p p S q , p p S q , . . . , p d p S q ˘ . Figure 14:
Illustration of the parallel search ASQGT procedure. In this example, m “ and exactly one individual is infectious in each of thethree groups. The weights of samples in each test round are set to
4, 2, 1 as seen in Frame 1. A binary search procedure is implemented tofind the infected individual in each group. In Frame , the test outcome for the first round is implying that there is one infected individualin the second group. Thus, the subgroups from groups and that were probed in Frame are discarded as illustrated in Frame . Similarly,the second subgroup of group that was not tested is discarded as well. The subgroups that contain an infected individual are further probedas seen in Frames
2, 3 and . Proof:
We represent S using S p`q , containing the positive elements in S and z P Z , which denotes the number ofzeros in S . Given S p`q and z , the set S is uniquely determined.First, note that Newton’s identities can be used to recover the set S p`q “ (cid:32) i , i , . . . , i d ( . To see this, let σ p i , i , . . . , i d q “ ś d k “ p ´ i k x q “ ř d k “ σ k x k P F p r x s and assume that the operations are over the polynomialring F p r x s , where the elements in S are assumed to lie in F p . Then, we have d ÿ (cid:96) “ p (cid:96) p S q ¨ x (cid:96) p mod x d ` q “ d ÿ (cid:96) “ p (cid:96) p S p`q q ¨ x (cid:96) p mod x d ` q “ d ÿ (cid:96) “ d ÿ k “ i (cid:96) k ¨ x (cid:96) p mod x d ` q “ d ÿ k “ d ÿ (cid:96) “ i (cid:96) k ¨ x (cid:96) p mod x d ` q“ d ÿ k “ ˜ ´ i d ` k ¨ x d ` ´ i k ¨ x ´ ¸ p mod x d ` q “ d ÿ k “ i k ¨ x ´ i k ¨ x p mod x d ` q , which implies d ÿ (cid:96) “ p (cid:96) ¨ p S p`q q ¨ x (cid:96) ¨ σ p i , . . . , i d q p mod x d ` q “ ´ x ¨ σ p i , . . . , i d q . The above equality in turn implies ř (cid:96) ´ k “ σ k ¨ p (cid:96) ´ k p S q “ ´ (cid:96) ¨ σ (cid:96) . Thus, given p (cid:96) p S q , (cid:96) P r d s , one can recover σ p S p`q q as well as the multiset S p`q . The multiset S can be subsequently recovered by noting that the number ofzeros in S equals p p S q ´ | S p`q | . Lemma 10.
Let p be an odd prime such that p ě L ´ and p p ´ q ¨ d ă β . Suppose that G is a d -infected set ofsize α , and d ď p ´ . Then we can identify the d infected individuals using at most d ¨ α L tests. Proof:
For simplicity we assume that L | α , and, similar to Lemmas 6 and 7, use induction in α . For the case α “ L ,we run d tests, and for each test we design a different test group. For (cid:96) P r d s , test group (cid:96) contains j (cid:96) P F p samples from each individual indexed by j P rr L ss . Suppose that D is a multi-set of elements from rr L ss and that D issuch that if group j has k infected individuals, then the elements from group j appear k times in D . Then accordingto the above setup the output of performing the SQGT on pool (cid:96) results in the following (cid:96) -th power sum: p (cid:96) p j , j , . . . , j d q “ d ÿ k “ j (cid:96) k . Note that j (cid:96) k ă p (since by design j (cid:96) P F p ) and so p i p j , j , . . . , j d q ď p p ´ q d ă β . Thus, for (cid:96) P rr d ` ss , we canrecover p (cid:96) p j , j , . . . , j d q “ p (cid:96) p j , j , . . . , j d q mod p , since p p j , j , . . . , j d q mod p “ d follows from the fact that G is a d -infected set. From the set of d ` power sums over the field F p , we can recover the multi-set t j , . . . , j d u from Claim 9, which completes the proof of the base case.For the inductive step, assume the statement holds for group sizes at most α and consider a group size α “ α ` L .As in the proofs of Lemmas 6 and 7, we work with subgroups. The subgroups are formed by partitioning the setof α ` L individuals into L subgroups P , P , . . . , P L each of size α . Applying the same ideas as before, weform d test groups where test group (cid:96) P r d s contains j (cid:96) P F p samples from each individual in subgroup j P rr L ss .Let D “ t j , j , . . . , j d u be a multiset of integers such that j u appears t times in the multiset if and only ifgroup j u has t infected individuals. Using the same approach as for the base case, we first recover the power sums p i p j , j , . . . , j d q . Then from Claim 9, we recover the set D in the same manner as before and we apply the inductivehypothesis to the subgroups in D . This completes the inductive step and the proof. Remark 11.
For the case d “ , the deep search procedure coincides with the approach described in Example . Deepsearch may be of limited practical value due to the large amounts of sample material required for testing, but is oftheoretical relevance due to the fact that it generalizes Hwang’s generalized splitting method to the SQGT setting fora small number of infected individuals. C. p n , d q -ASQGT schemes As discussed in the text following Example 5, our general approach to adaptive SQGT is to first partition theset of n individuals into d β subpools and test each subpool separately using either parallel search or deep search,depending on the number of infected in each subpool. Parallel search produces the best results in the worst case,provided that the number of infected individuals across all the subpools is ď β , while parallel search gives the bestresults for the case of a large number of infected individuals.Let T P p n , d q denote the number of tests required by our ASQGT scheme, summarized in Algorithm 1, and let n ´ d “ α ¨ d ` α ¨ δ ` ∆ , where α , δ , ∆ are integers such that δ ă d and ∆ ă α . In order to simplify the notationby avoiding floor and ceiling functions, we assume that β | d and β | δ . Theorem 12. T P p n , d q ď d β ¨ p α ` ` log β q ` δβ . Proof:
Since the first step involves testing d β ` δβ groups, the first step requires d β ` δβ tests. For the next steps, notethat each group has size ď α ` β . Hence, we can uncover β infected individuals using at most α ` ` log p β q β testsaccording to Lemmas 6 and 7. In step 3), we use one additional test for every β infected individuals. Since thereare d infected the total number of tests required by Algorithm 1 equals T P p n , d q ď ˆ d β ` δβ ˙ ` d β ¨ p α ` ` log p β qq ` d β “ d β ¨ p α ` ` log p β qq ` δβ . As discussed earlier, the parallel search ASQGT scheme requires O p d β q more tests than the information-theoreticlower bound. When β “ , our scheme requires O p d q additional tests which agrees with the traditional adaptivebinary setting studied in [49].Next, we consider the second approach to the ASQGT problem based on deep search, for the case where d ă β .Let T D p n , d q denote the number of tests required by our algorithm and, with a slight abuse of parameter definitions,assume that n ´ d “ α ¨ d . Furthermore, assume as before that d | β and d | n . The corresponding approach isdescribed in Algorithm 2. Algorithm 1
Parallel search ASQGT scheme1)
Initialize : Partition the set of n individuals into d ` δβ groups, denoted by G , G , . . . , G d ` δβ ´ , each of size ď β ¨ α ` .Test each subgroup individually. For i P rr d ` δβ ss , suppose that G i is a t i -infected group and let D denote thetotal number of infected subjects across all groups.2) Parallel Search : Identify a β -minimal group p G i , . . . , G i g ´ q , and apply parallel search on the group touncover β infected individuals. Remove the β infected individuals from their respective groups.3) Update : Use one additional test to determine the number of infected subjects in p G i , . . . , G i g ´ q after Step2). Update t i , . . . , t i g ´ and D . If D ą , go to Step 2). Algorithm 2
Deep search ASQGT scheme1)
Initialize : Partition the set of n individuals into d groups, denoted by G , G , . . . , G d ´ , each of size α . Testeach subgroup individually and let D denote the total number of infected subjects across all groups.2) Deep Search : Identify a t i -infected group G i , and apply deep search to uncover t i infected subjects, for some i P rr d ss .3) Update : Let D “ D ´ t i . If D ą , go to Step 2). Theorem 13.
The number of tests for deep search ASQGT satisfies T D p n , d q ď d ¨ αβ ´ log β ´ ` β . Proof:
The first step in Algorithm 2 requires d ă β tests. According to Lemma 10, Step 2) requires at most t i ¨ αβ ´ log p β q´ tests. Hence, the total number of tests is upper bounded as T D p n , d q ď β ` d ´ ÿ i “ t i ¨ αβ ´ log β ´ “ β ` d ¨ αβ ´ log β ´ D. Error-resilient p n , d q -ASQGT schemes We consider next the question of designing ASQGT models that can tolerate a bounded number of birth-death (BD)chain errors. Recall from (10) that in the event that there are no errors, the output of testing a pool of individuals,of which d are infected, is an integer t , such that t “ d for d ď m and t “ m , whenever d ą m . Suppose insteadthat the erroneous output of testing a pool is t , where t P t t ´ t ` u with the appropriate boundary conditions.We refer to such an error as a single BD error.Our main result is described in Theorem 17. We prove that there exists a scheme that requires d β ´ ¨ p α ` ` log β q ` δβ tests that can correct an arbitrary number of test errors. For the case where the number of test errors is a smallinteger e , ´ d β ` e ¯ ¨ p α ` ` log β q ` ¨ d β ` δβ ` e tests suffice, which implies that only e p α ` ` log β q ` d β ` e additional tests are required to correct e errors in Algorithm 1.The next claim highlights one of the main ideas behind our approach: Take multiple copies of samples from eachof the individuals being tested in such a way to get error-free readouts even when errors occur. Here, as before,we assume that β “ m ` . Claim 14.
Let P be a pool of individuals and suppose that P pˆ q is a pool which contains three samples from eachindividual in P . Let t be the output of the test performed on the pool P pˆ q given that no errors occur, and suppose t is a possibly erroneous output of the test performed on the pool P pˆ q . Given t , one can determine t . Proof:
Since we have taken samples from each of the individuals in the pool P , it follows that t p mod3 q “ .Thus, if an error occurs, the output of the test under the BD model equals t P t t ` t ´ u which implies that t p mod3 q “ ˘ . If t mod 3 “ , then t “ t ` and so we can recover t by simply decrementing t by one.Similarly, if t mod 3 “ ´ , then t “ t ´ and we can recover t by incrementing t by one.Using the idea from the previous claim, we can determine exactly how many infected individuals are present ineach of the tested pools despite the fact that testing errors can occur. We describe the underlying method throughan example, for which we need the following terminology.We say that P i is a t i -infected group if the output of testing P pˆ q i is in the set t t i ´
1, 3 t i , 3 t i ` u . We also saythat p P i , . . . , P i g ´ q is a β -minimal group if t ě t i ě ¨ ¨ ¨ ě t i g ´ , ř g ´ j “ t j ă β , but ř g ´ j “ t j ě β . Example 15.
For simplicity, assume that we have m “ p γ ´ q thresholds, and suppose that p P , P , . . . , P g ´ q isa γ -minimal group, where we again make a simplifying assumption, namely γ “ g . We proceed in the same manneras described in Lemma 6 and we first form a super-pool, denoted P which consists of i copies of each sample in P i .Afterward, we generate a larger pool of samples, P pˆ q which contains copies of each sample in P .Notice that given the output of the test P p q , we can uniquely determine the number of infected that are in eachof the groups P , P , . . . , P g ´ . Suppose that t is the output of testing P p q and suppose t is the output of testing P p q assuming no errors occur during testing. From Claim , we can recover t and from Lemma 6 it is possible todetermine how many infected are in P , how many infected are in P , etc.Another simple way to see how the above scheme overcomes BD noise is to see that it suffices that test outcomesdiffer from one another by at least three. This can be easily accomplished by fixing the coefficients of and in thebinary representation of the test outcomes to zero.More precisely, we can artificially introduce two subgroups, so that when m “ γ ´ , we collect samples fromsubgroups labeled by ă i ă γ , i with the amounts dictated by their labels. If the observed test outcome is t “ ř γ ´ i “ ¯ e i ¨ i , then the true test outcome is decoded as: e γ ´ e γ ´ . . . e e e “ ¯ e γ ´ ¯ e γ ´ . . . ¯ e if ¯ e ¯ e “ or ¯ e ¯ e “ e γ ´ ¯ e γ ´ . . . ¯ e ¯ e ¯ e ` (binary addition) , if ¯ e ¯ e “ (11) l The following claim is straightforward.
Claim 16.
Let α , β ě g be positive integers where β “ m ` , g ď β ´ . Suppose that p G , G , . . . , G g ´ q is a p β ´ q -minimal group and that each group has size at most α . Then we can identify β ´ infected individuals usingat most α tests. Proof:
The proof follows immediately by applying the procedure described in Example 15 and noting that ¨ β ´ ă β ´ when β ě .Next, we turn our attention to a scheme designed for a small number of testing errors e . To this end, let T N p n , d , e q denote the number of tests required for a noisy ASQGT scheme that tolerates up to e BD testing errors (seeAlgorithm 3). As before, let n ´ d “ α d ` α δ ` ∆ , where α , δ and ∆ are integers such that δ ă d and ∆ ă α .Once again we assume that β | d and β | δ .We prove the correctness of our algorithm in the following theorem. Theorem 17.
Let β ě . We have T N p n , d , e q ď min ˆˆ d β ` e ˙ ¨ p α ` ` log β q ` ¨ d β ` δβ ` e , d β ´ ¨ p α ` ` log β q ` δβ ˙ . Proof:
The second term under the minimum follows immediately from the parallel search ASQGT scheme giventhe use of a robust parallel search. Therefore, in the remainder of the proof, we focus our attention on the firstterm. Algorithm 3
Noisy search ASQGT scheme1)
Initialize : Partition the set of samples from the n individuals into d ` δβ groups, denoted by P , P , . . . , P d ` δβ ´ ,and each of size at most β α ` .Test each subgroup P pˆ q i individually. For i P rr d ` δβ ss , suppose that P i is a t i -infected group and let D denote the total number of infected subjects in all subgroups.2) Parallel Search : Identify a β -minimal group p P i , . . . , P i g ´ q , and apply parallel search to uncover β potentialinfected subjects. Divide the set of β potentially infected individuals into two groups of sizes t β u and r β s ,denoted by D pˆ q , D pˆ q .3) Verify : Test D pˆ q , D pˆ q to determine the total number of infected recovered. Update t i , . . . , t i g ´ and D .4) Update Large Group Counts : If only one group is present, | P i | ě β , and t ě , then test P pˆ q i todetermine the number of infected in P i . Go back to Step 2).The first step of our algorithm requires d ` δβ tests and each time we execute Step 2), we perform α ` ` log β tests. Since Step 2) is executed at most d β ` e times this implies that the total number of tests required by the firsttwo steps of our procedure is at most d ` δβ ` ˆ d β ` e ˙ ¨ p α ` ` log β q . For Step 3), note that since max ! | D pˆ q | , | D pˆ q | ) ď r β s we have t ď p r β s ´ q ă β ´ “ m when β ě ,and so we can determine exactly how many infected subjects are in each of the sets D pˆ q , D pˆ q in Step 3). Eachtime Step 3) is executed, we require tests. Since Step 2) is executed at most d β ` e times, this step requires atmost ¨ d β ` ¨ e tests. Finally, since Step 4) is executed at most d β ` e times, it follows that the total number of tests is at most d ` δβ ` ´ d β ` e ¯ ¨ p α ` ` log β q ` ¨ d β ` ¨ e ` d β ` e , which proves the claimed result. E. Extensions to non-uniform threshold widths
The next two examples illustrate how the ideas from the previous sections can be extended to the case where thethreshold widths increase exponentially. For this case, we only consider small values of m (i.e., m “
3, 4 ). Example 18.
In the following, we consider a model that mirrors the results from Section IV . Suppose that the testoutcomes equal t “ $’’’&’’’% if there are no infected subjects in the test ,1, if the number of infected samples is if the number of infected samples is in r
2, 3 s ,3 if the number of infected samples is in r
4, 7 s . (12) We consider the following extension of the approach discussed in Example . Suppose we have a pool of size α thatcontains at least one infected subject. We start by testing this pool to determine the total number of infected individuals.There are two cases to consider: (a) The output of the test is or which indicates that there is more than a singleinfected in the pool or (b) The output of the test is .Suppose that the outcome is (b). In this case, we run a variant of deep search. In particular, we divide the pool into subpools and form a superpool from these subpools which contains samples from the first pool, sample from thesecond pool, samples from the third pool, and samples from the fourth pool. It is straightforward to verify that inthis case we can determine which of the subpools contain the single infected sample by testing the superpool, andwe then repeat this procedure using the subpool which contains the single infected. If the outcome is (a), then we proceed to divide the pool of size α into two disjoint subpools, each of size α ´ . Wefurther select one of the two subpools for testing. If the subpool tested contains a single infected, then we continueby applying the procedure discussed in the previous paragraph on the subpool of size α ´ that contains one infectedsample. Otherwise, we repeat the procedure from this paragraph on one of the subpools of size α ´ that containsmore than a single infected subject.Using the procedure described above, it is straightforward to verify that recovering infected individuals requires atmost α tests provided we know the number of infected samples in the pool of size α . Now suppose n “ d α . Then wecan recover d infected subjects using at most ¨ d ` d ¨ α tests as follows. First, we partition the set of n individuals into d groups each of size α and we initially test each of these d groups. Afterward, we search for the infected individualsusing the process outlined in this example. l We note that despite the fact that we have focused on the case where m is a power of two, the next example showsthat in some cases our ideas extend to settings where m is not necessarily a power of two. In the next example, weshow an adaptive scheme that requires at most roughly d ` d ¨ ` log p nd q ` ˘ . The ideas are similar to the previousexample except that here we only allow thresholds. Example 19.
For this example, we assume that n “ d ¨ α . The output of the test is t , where: t “ $’&’% if no infected samples are present in the pool ,1, if the number of infected samples equals if the number of infected samples is ą . (13) The core idea behind the testing strategy is a simple extension of the previous example. Suppose we have a pool ofsize α that contains at least one infected individual. First, we test this pool of size α to determine the total number ofinfected individuals. There are two cases to consider: (a) The output of the test equals which indicates that there ismore than one infected sample in the pool or (b) The output of the test equals .Suppose (b) occurred. We perform the same procedure as before except that instead of dividing the pool into subpools, we divide the pool into subpools of equal size. Next, we form a superpool from these sub pools whichcontains samples from the first pool, sample from the second pool, and samples from the third pool. Similarlyas before, we can determine which of the three subpools contains the single infected sample, and we then repeat thisprocedure using the subpool which contains the single infected sample.If (a) occurs, then we perform the same procedure as in the previous example. In particular, we divide the pool ofsize α into two disjoint subpools each of size at most r α s and perform a single test. If two infected individualsare contained in a single pool, then we repeat the procedure from this paragraph on the pool of size at most r α s that contains at least two infected samples. Otherwise, we perform the procedure from the previous paragraph on thesubpool of size at most r α s that contains a single infected sample.Using this approach, it is straightforward to verify that recovering an infected requires at most α ` tests. Thus we canrecover d infected individuals using at most d ` d ¨ p α ` q tests as follows. First, we partition the set of n infected into d groups each of size α . Afterward, we search for the infected individuals using the process outlined in this example. l Since the model proposed in the previous example is identical to the one used for probabilistic priors and describedin Section IV, we now directly compare the two in terms of the number of tests required per individual. Recall thatthe model in section IV required on average s ` p ¨ p ´ p q s ´ ¨ r log s s ` ´ p ´ p q s ´ s ¨ p ¨ p ´ p q s ´ (14)tests per individual where s represents the size of each subpool used in the first step of the corresponding algorithm.For n large enough, our setup requires approximately p ` p ¨ log ˆ p ˙ (15)tests where p “ dn . Figure 15 compares the number of tests required in (14) and (15). Notice that the ASQGT schemefrom the previous example requires only roughly half the tests per individual of the approach from Section IV. However, the latter method is simpler to implement in practice since it only requires two stages of testing. We alsonote that since the schemes are not exclusive, it is possible to use a combination of both approaches if needed.
Figure 15:
Comparison of the average number of test per individual for the -threshold schemes (14) and (15). The choice of s is optimizedfor each value of p for the probabilistic priors setting. VI. A S
HORT N OTE ON C OMMUNITY -A WARE T ESTING
As mentioned previously, in order to formulate effective and optimal testing schemes, the underlying communitynetwork structure must be incorporated. Towards that end, we assume that the community labels are known aswell as their sizes (not the entire network, but entities like families or clusters of families in close proximity; thisassumption is realistic, as most testing sites require subjects to submit their addresses). The aim is to find efficientstrategies that will identify communities with high infection rates , rather than infected individuals, as this wouldguide effective quarantine strategies.More precisely, consider a partition of a population of N individuals into communities A , A , . . . , A f of size N , N , ..., N f , respectively, where N “ N ` ... ` N f and f ě . Each group has some (unknown) number ofinfected individuals equal to d i , i “
1, . . . , s , and d “ ř fi “ d i . The following question is of interest: Devise adaptiveand nonadaptive GT and SQGT schemes that identify heavy hitter communities, i.e., communities with at least d { k infected individuals, where k is an input parameter.The naive non-adaptive scheme for this problem corresponds to running within each of the f communities an optimaltesting scheme for determining the number of infected individuals. Each such scheme requires Θ p log N i q tests [68],leading to a total of Θ p ř fi “ log N i q tests. Note that Θ p log N q tests are both necessary and sufficient to estimatethe number of defectives in general, but it is conceivable that a better result is possible when we already know anupper bound on this number, which is the case here. This approach is suboptimal, and we describe an alternativenonadaptive binary testing scheme for the heavy hitters problem that requires significantly fewer tests when thetotal number of infected individuals, d , is much smaller than the population size n . We leave as an important openproblem to construct semiquantitative testing schemes for the heavy hitters problem with few adaptivity stages andrequiring significantly fewer tests than the scheme we propose below.First, we note that the heavy hitters problem can be solved with t “ O p k log f q queries that on input a set T Ď r f s output ř i P T d i [45], which corresponds to the quantitive GT model. Indeed, this corresponds to the setting ofcompressed sensing with 0/1 linear tests. We show below how to emulate such a query with error probability (cid:101) using r “ O p d ¨ log p { (cid:101) qq randomized disjunctive queries. Combining the two observations above immediatelyyields a nonadaptive group testing scheme using r ¨ t “ O p k ¨ log f ¨ d ¨ log p { (cid:101) qq tests that solves the heavy hittersproblem with error probability at most t ¨ (cid:101) via a union bound. For example, if d ă log log log N and (cid:101) ď t log n then the required number of tests may be significantly smaller than that required by the naive scheme from theprevious paragraph.It remains to show how to emulate the query ř i P T d i with error probability (cid:101) using r tests. Consider a randomizedtest obtained by independently including each individual j P Ť i P T A i in the test with probability { , and let Y denote the test output. Then, we have Pr r Y “ s “ ś i P T ´ d i “ ´ ř i P T d i , since the test outputs if and only ifno infected individual is included. Noting that ř i P T d i ď d , a Chernoff bound guarantees that we can determine ř i P T d i with error probability at most (cid:101) by independently sampling r “ O p d log p { (cid:101) qq such tests Y , Y , . . . , Y r and setting our estimate to be Q ´ log ´ r ř ra “ Y a ¯] , where r x u denotes the closest integer to x . This yields thedesired result.We conclude the discussion with the following remarks:1) The above scheme provides a reduction from community-aware testing of heavy hitters to the standard heavyhitters problem in compressed sensing. While this provides a proof of concept, it remains to be seen whethera direct approach can provide more effective test designs for the identification of heavy hitter communitiesvia disjunctive queries.2) An important direction is to construct an analogous approach for the detection of heavy hitter communitiesusing SQGT schemes. Recall that quantitative group testing (equivalently, the compressed sensing model viabinary matrices) is an extremal special case of SQGT, whereas standard group testing is another extreme. Itis therefore natural to expect that SQGT provides the identification of heavy hitter communities at varyingefficiency depending on the granularity of the quantization levels.3) Furthermore, we ask whether adaptivity can help to identify heavy hitter communities more efficiently thannon-adaptive schemes can offer.VII. M UTATIONS AND
RT-PCR N
OISE
A number of works have focused on RT-PCR asymmetric error models that assume that positive samples canactually test negative while the opposite scenario is highly unlikely [29]. As already pointed out, these assumptionsare not practically justified since even as few as viral cDNA fragments can lead to detectable fluorescence levelsafter roughly cycles of amplification. Hence, the cause of false negative measurements does not lie in inadequatePCR testing but erroneous sample collection instead, in addition to possible mutations in the genomic regions usedas primers. In both cases, no matter how many times an RT-PCR test is repeated, an infected sample may not beidentified (i.e., the sample is masked). As an example, the CDC originally identified three pairs of primers fromthe N open reading frame (gene) of the SARS-CoV-2 virus for testing, but since one pair was removed due to thepresence of mutations in a larger-than-acceptable population. There are further efforts to reduce the problem offalse negatives due to mutations such as using primers from at least two genes [69].The problem of mutations in primer regions of individuals tested via RT-PCR is of relevance to GT both from theperspective of measurement modeling (as mutations add an additional level of nonlinearity to the measurementsthat are not captured by current approaches) as well as error analysis. It is important to observe that mutations orundesired hybridization to nontarget regions may lead to variable test outcomes for the same sample in differenttests. As a result, trying to exactly estimate the viral load of the individuals as proposed in [25] is not possible andestimates like the ones used in the SQGT framework may be more appropriate. Furthermore, quantization noise issignal-dependent, which is another desirable feature of the SQGT framework.To examine the influence of mutations on RT-PCR we examined the GISAID [41] database of Cov-SARS-2 genomesand identified individuals with mutations in the N1 and N2 primer regions. Using the FastPCR online simulationsoftware [70], we examined the influence of the mutation on primer binding and PCR amplification. The resultsare summarized in Table I. Furthermore, the complete primer and DNA sequences are given in Appendix A. There,the symbols ‘f’ and ‘r’ refer to the DNA strands’ forward and reverse directions, respectively. In the forwarddirection, the genome and primer have to be an exact match, while in the reverse direction the two strings have tobe Watson-Crick complementary. As one can observe, mutations along the forward (reverse) direction of the N1or N2 regions that do not contain the exact match (or Watson-Crick complement) can severely affect the efficiencyof primer/target bonding. For a more precise characterization, the corresponding melting temperatures are given inthe Table I along with an estimate of the primer binding efficiency for the N1 and N2 regions.As seen in the results of the simulation, not all samples that have mutations along the primer region are amplified.VIII. C
ONCLUSION
We provided a brief overview of existing testing protocols for Covid-19 along with descriptions of state-of-the-artpooling algorithms proposed in the literature. In addition, we presented a collection of new algorithms and testingschemes specifically tailored for RT-PCR testing protocols that improve upon prior models and other GT schemes.Many open problems remain, including: TABLE I:
Simulation resultsPatient Primer Region Amplification Predicted? Melting Temperature (in C) Amplification %EPI ISL 413609 N1f Y 51.6 100N1r Y 56.3 100N2f Y 52.9 100N2r Y 52.2 97EPI ISL 415600 N1f Y 51.6 100N1r N 51.7 97N2f Y 52.9 100N2r Y 54.7 100EPI ISL 416650 N1f N 42.6 97N1r Y 56.3 100N2f Y 52.9 100N2r Y 54.7 100EPI ISL 417938 N1f Y 51.6 100N1r Y 56.3 100N2f Y 47.3 95N2r Y 54.7 100EPI ISL 422983 N1f Y 44.6 95N1r N 52.1 95N2f Y 52.9 100N2r Y 54.7 100EPI ISL 424955 N1f Y 51.6 100N1r Y 56.3 100N2f N 43.1 97N2r Y 54.7 100EPI ISL 425148 N1f Y 51.6 100N1r N 51.7 97N2f Y 52.9 100N2r Y 54.7 100 ‚ Probabilistic testing schemes for more than thresholds : In Section IV, we considered the setup where eachtest generated the output
0, 1, or depending upon the number of defectives in each group. How much canone reduce the number of tests of our schemes if we incorporate additional semi-quantitative information, inthe presence of errors? ‚ Worst-case testing schemes with a constant number of rounds : The schemes described in Section V have thepotential drawback that almost every test depends upon the results of prior tests. It has been shown that in thebinary group testing setting, the information-theoretic lower bound can be achieved using only two rounds ofnon-adaptive testing when the number of infected individuals is at most n c for any constant c ă [71]. Cana similar result be shown in the setting where there is more than one threshold? ‚ Practical SQGT schemes resilient to errors:
Our practical two-stage SQGT schemes from Section IV-A canbe enhanced with noise-resilience properties in a straightforward by repeating each test a prescribed numberof times, while keeping the number of testing stages the same. Nevertheless, it would be interesting to findmore efficient, and still practical, ways of adding good noise-resilience properties to these schemes. ‚ Community-Aware Testing : Section VI presented a simple, first effort towards the problem of attempting toidentify heavy hitter communities. Can such schemes be improved?A
CKNOWLEDGEMENT
The authors are grateful to Sergei Maslov and Nigel Goldenfeld from the University of Illinois for several insightfuldiscussions. R , 2020.[5] A. Stone, “Nebraska public health lab begins pool testing COVID-19 samples,” KETV Omaha , 2020.[6] R. Dorfman, “The detection of defective members of large populations,”
The Annals of Mathematical Statistics , vol. 14, no. 4, pp. 436–440,1943.[7] M. Aldridge, O. Johnson, and J. Scarlett, “Group testing: An information theory perspective,”
Foundations and Trends in Communicationsand Information Theory , vol. 15, no. 3-4, pp. 196–392, 2019.[8] B. Abdalhamid, C. R. Bilder, E. L. McCutchen, S. H. Hinrichs, S. A. Koepsell, and P. C. Iwen, “Assessment of specimenpooling to conserve SARS CoV-2 testing resources,”
American Journal of Clinical Pathology , 04 2020, aqaa064. [Online]. Available:https://doi.org/10.1093/ajcp/aqaa064[9] A. Z. Broder and R. Kumar, “A note on double pooling tests,” arXiv e-prints arXiv e-prints , Mar. 2020.[13] K. R. Narayanan, A. Heidarzadeh, and R. Laxminarayan, “On accelerated testing for COVID-19 using group testing,” arXiv e-prints , Apr.2020.[14] M. Täufer, “Rapid, large-scale, and effective detection of COVID-19 via non-adaptive testing,” bioRxiv medRxiv medRxiv arXiv e-prints
Chemical engineering science , vol. 65, no. 17, pp. 4996–5006, 2010.[22] V. Rana, E. Chien, J. Peng, and O. Milenkovic, “How fast does the SARS-Cov-2 virus really mutate in heterogeneous populations?” medRxiv , 2020.[23] J. Yi, R. Mudumbai, and W. Xu, “Low-cost and high-throughput testing of COVID-19 viruses and antibodies via compressed sensing:System concepts and computational experiments,” arXiv e-prints , Apr. 2020.[24] H. Bernd Petersen, B. Bah, and P. Jung, “Efficient noise-blind (cid:96) -regression of nonnegative compressible signals,” arXiv e-prints , Mar.2020.[25] S. Ghosh, A. Rajwade, S. Krishna, N. Gopalkrishnan, T. E. Schaus, A. Chakravarthy, S. Varahan, V. Appu, R. Ramakrishnan, S. Ch,M. Jindal, V. Bhupathi, A. Gupta, A. Jain, R. Agarwal, S. Pathak, M. A. Rehan, S. Consul, Y. Gupta, N. Gupta, P. Agarwal, R. Goyal,V. Sagar, U. Ramakrishnan, S. Krishna, P. Yin, D. Palakodeti, and M. Gopalkrishnan, “Tapestry: A single-round smart pooling techniquefor COVID-19 testing,” medRxiv [26] N. Shental, S. Levy, V. Wuvshet, S. Skorniakov, B. Shalem, A. Ottolenghi, Y. Greenshpan, R. Steinberg, A. Edri, R. Gillis,M. Goldhirsh, K. Moscovici, S. Sachren, L. M. Friedman, L. Nesher, Y. Shemer-Avni, A. Porgador, and T. Hertz,“Efficient high-throughput SARS-CoV-2 testing to detect asymptomatic carriers,” Science Advances , 2020. [Online]. Available:https://advances.sciencemag.org/content/early/2020/08/20/sciadv.abc5961[27] M. J. Mina, R. Parker, and D. B. Larremore, “Rethinking covid-19 test sensitivity – a strategy for containment,”
New England Journal ofMedicine , vol. 0, no. 0, p. null, 0. [Online]. Available: https://doi.org/10.1056/NEJMp2025631[28] R. Arnaout, R. A. Lee, G. R. Lee, C. Callahan, C. F. Yen, K. P. Smith, R. Arora, and J. E. Kirby, “SARS-CoV2 Testing: The limit ofdetection matters,” bioRxiv
IEEE Transactions on Information Theory , vol. 60, no. 8, pp. 4614–4636,2014.[31] ——, “Group testing for non-uniformly quantized adder channels,” in , 2014,pp. 2351–2355.[32] ——, “Code construction and decoding algorithms for semi-quantitative group testing with nonuniform thresholds,”
IEEE Transactions onInformation Theory , vol. 62, no. 4, pp. 1674–1687, 2016.[33] F. Hwang, “A generalized binomial group testing problem,”
Journal of the American Statistical Association , vol. 70, no. 352, pp. 923–926,1975.[34] A. Emad and O. Milenkovic, “Poisson group testing: A probabilistic model for nonadaptive streaming boolean compressed sensing,” in . IEEE, 2014, pp. 3335–3339.[35] J. Wolf, “Born again group testing: Multiaccess communications,”
IEEE Transactions on Information Theory , vol. 31, no. 2, pp. 185–191,1985.[36] B. Lindstrom, “Determining subsets by unramified experiments,”
A Survey of Statistical Design and Linear Models , 1975.[37] A. Dyachkov and V. Rykov, “Generalized superimposed codes and their application to random multiple access,” in
Proc. 6th Int. Symp.Inf. Theory , vol. 1, 1984, pp. 62–64.[38] Y. Shu and J. McCauley, “GISAID: Global initiative on sharing all influenza data–from vision to reality,”
Eurosurveillance
Analytical Biochemistry medRxiv , 2020.[43] Y. Liu, L.-M. Yan, L. Wan, T.-X. Xiang, A. Le, J.-M. Liu, M. Peiris, L. L. M. Poon, and W. Zhang, “Viral dynamics inmild and severe cases of COVID-19,”
The Lancet Infectious Diseases
Proceedings of the 32nd ACMSIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems , 2013, pp. 87–90.[46] A. G. D’yachkov and V. V. Rykov, “Bounds on the length of disjunctive codes,”
Problemy Peredachi Informatsii , vol. 18, no. 3, pp. 7–13,1982.[47] E. Porat and A. Rothschild, “Explicit non-adaptive combinatorial group testing schemes,” in
Automata, Languages and Programming ,L. Aceto, I. Damgård, L. A. Goldberg, M. M. Halldórsson, A. Ingólfsdóttir, and I. Walukiewicz, Eds. Berlin, Heidelberg: Springer BerlinHeidelberg, 2008, pp. 748–759.[48] J. Scarlett, “An efficient algorithm for capacity-approaching noisy adaptive group testing,” in . IEEE, 2019, pp. 2679–2683.[49] F. K. Hwang, “A method for detecting all defective members in a population by group testing,”
Journal of the American StatisticalAssociation , vol. 67, no. 339, pp. 605–608, 1972.[50] P. Damaschke, “Threshold group testing,” in
General theory of information transfer and combinatorics . Springer, 2006, pp. 707–718.[51] M. Cheraghchi, “Improved constructions for non-adaptive threshold group testing,”
Algorithmica , vol. 67, no. 3, pp. 384–417, 2013.[52] S. Aeron, V. Saligrama, and M. Zhao, “Information theoretic bounds for compressed sensing,”
IEEE Transactions on Information Theory ,vol. 56, no. 10, pp. 5111–5130, 2010.[53] R. G. Baraniuk, “Compressive sensing [lecture notes],”
IEEE signal processing magazine , vol. 24, no. 4, pp. 118–121, 2007. [54] J. A. Tropp, “Greed is good: Algorithmic results for sparse approximation,” IEEE Transactions on Information theory , vol. 50, no. 10, pp.2231–2242, 2004.[55] W. Dai and O. Milenkovic, “Subspace pursuit for compressive sensing signal reconstruction,”
IEEE transactions on Information Theory ,vol. 55, no. 5, pp. 2230–2249, 2009.[56] S. Ghosh, R. Agarwal, M. A. Rehan, S. Pathak, P. Agrawal, Y. Gupta, S. Consul, N. Gupta, R. Goyal, A. Rajwade et al. , “A compressedsensing approach to group-testing for COVID-19 detection,” arXiv preprint arXiv:2005.07895 , 2020.[57] W. Dai, M. A. Sheikh, O. Milenkovic, and R. G. Baraniuk, “Compressive sensing DNA microarrays,”
EURASIP journal on bioinformaticsand systems biology , vol. 2009, no. 1, p. 162824, 2008.[58] W. Dai, H. V. Pham, and O. Milenkovic, “A comparative study of quantized compressive sensing schemes,” in . IEEE, 2009, pp. 11–15.[59] W. Dai and O. Milenkovic, “Information theoretical and algorithmic approaches to quantized compressive sensing,”
IEEE transactions oncommunications , vol. 59, no. 7, pp. 1857–1866, 2011.[60] W. Dai and O. Milenkovic, “Weighted superimposed codes and constrained integer compressed sensing,”
IEEE Transactions on InformationTheory , vol. 55, no. 5, pp. 2215–2229, 2009.[61] A. Heidarzadeh and K. R. Narayanan, “Two-stage adaptive pooling with RT-qPCR for COVID-19 screening,” arXiv preprintarXiv:2007.02695 , 2020.[62] M. Cheraghchi, A. Karbasi, S. Mohajer, and V. Saligrama, “Graph-constrained group testing,”
IEEE Transactions on Information Theory ,vol. 58, no. 1, pp. 248–262, 2012.[63] W. E. Parmet and M. S. Sinha, “Covid-19 – the law and limits of quarantine,”
New England Journal of Medicine , vol. 382, no. 15, p. e28,2020.[64] B. W. Buchan, J. S. Hoff, C. G. Gmehlin, A. Perez, M. L. Faron, L. S. Munoz-Price, and N. A. Ledeboer, “Distribution of SARS-CoV-2PCR cycle threshold values provide practical insight into overall and target-specific sensitivity among symptomatic patients,”
AmericanJournal of Clinical Pathology , vol. 154, no. 4, pp. 479–485, 07 2020. [Online]. Available: https://doi.org/10.1093/ajcp/aqaa133[65] M. Aldridge, “Individual testing is optimal for nonadaptive group testing in the linear regime,”
IEEE Transactions on Information Theory ,vol. 65, no. 4, pp. 2058–2061, 2019.[66] C. M. Fortuin, P. W. Kasteleyn, and J. Ginibre, “Correlation inequalities on some partially ordered sets,”
Communications in MathematicalPhysics , vol. 22, no. 2, pp. 89–103, 1971.[67] D. Du, F. K. Hwang, and F. Hwang,
Combinatorial group testing and its applications . World Scientific, 2000, vol. 12.[68] P. Damaschke and A. S. Muhammad, “Bounds for nonadaptive group tests to estimate the amount of defectives,” in
CombinatorialOptimization and Applications , W. Wu and O. Daescu, Eds. Berlin, Heidelberg: Springer Berlin Heidelberg, 2010, pp. 117–130.[69] M. Park, J. Won, B. Y. Choi, and C. J. Lee, “Optimization of primer sets and detection protocols for SARS-CoV-2 of coronavirus disease2019 (COVID-19) using PCR and real-time PCR,”
Experimental & molecular medicine , vol. 52, no. 6, pp. 963–977, 2020.[70] PrimerDigital, “Primerdigital,” http://primerdigital.com/tools/, 2020.[71] M. Hahn-Klimroth and P. Loick, “Optimal adaptive group testing,” arXiv e-prints , Nov. 2019. A PPENDIX
A. Simulation Results: PCR on DNA strands with mutations along primer regions
We present below the results of the PCR simulation run on DNA sequences that contain mutations along the N1and N2 regions of the genome. The notation EPI ISL xxxxxx corresponds to the sample ID. As already indicated,the symbols ‘f’ and ’r’ at the end of the primer regions indicate the DNA strand directions, forward and reverserespectively. The symbol ‘Y’ indicates a successful PCR amplification, while the symbol ‘N’ indicates that PCRamplification cannot be initiated. We also list the percentage of the primer string that is matched by genomicDNA, and the melting temperature T m .EPI ISL 413609N1f5 - g a c c c c a a a a t c a g c g a a a t Y 100% T m =51.6 ˝ C | | | | | | | | | | | | | | | | | | | | t g g a c c c c a a a a t c a g c g a a a t g c a cN1rg t c t a a g t t g a c c g t c a t t g g t c t - 5 Y 100% T m =56.3 ˝ C | | | | | | | | | | | | | | | | | | | | | | | | c t c a g a t t c a a c t g g c a g t a a c c a g a a t g gN2f5 - t t a c a a a c a t t g g c c g c a a a Y 100% T m =52.9 ˝ C | | | | | | | | | | | | | | | | | | | | g a t t a c a a a c a t t g g c c g c a a a t t g cN2ra a g a a g c c t t a c a g c g c g - 5 Y 97% T m =52.2 ˝ C | | | | | | | | | | | | | | | | | :c g t t c t t c g g a a t g t c g c g t a t t gEPI ISL 415600N1f5 - g a c c c c a a a a t c a g c g a a a t Y 100% T m =51.6 ˝ C | | | | | | | | | | | | | | | | | | | | t g g a c c c c a a a a t c a g c g a a a t g c a cN1rg t c t a a g t t g a c c g t c a t t g g t c t - 5 N 97% T m =51.7 ˝ C | | | | | | | | | : | | | | | | | | | | | | | | c t c a g a t t c a a t t g g c a g t a a c c a g a a t g gN2f5 - t t a c a a a c a t t g g c c g c a a a Y 100% T m =52.9 ˝ C | | | | | | | | | | | | | | | | | | | | g a t t a c a a a c a t t g g c c g c a a a t t g cN2ra a g a a g c c t t a c a g c g c g - 5 Y 100% T m =54.7 ˝ C | | | | | | | | | | | | | | | | | | c g t t c t t c g g a a t g t c g c g c a t t gEPI ISL 416650N1f5 - g a c c c c a a a a t c a g c g a a a t N 97% T m =42.6 ˝ C | | | | | | | | | | | | | : | | | | | | t g g a c c c c a a a a t c a t c g a a a t g c a cN1rg t c t a a g t t g a c c g t c a t t g g t c t - 5 Y 100% T m =56.3 ˝ C | | | | | | | | | | | | | | | | | | | | | | | | c t c a g a t t c a a c t g g c a g t a a c c a g a a t g gN2f5 - t t a c a a a c a t t g g c c g c a a a Y 100% T m =52.9 ˝ C | | | | | | | | | | | | | | | | | | | | g a t t a c a a a c a t t g g c c g c a a a t t g cN2ra a g a a g c c t t a c a g c g c g - 5 Y 100% T m =54.7 ˝ C | | | | | | | | | | | | | | | | | | c g t t c t t c g g a a t g t c g c g c a t t gEPI ISL 417938N1f5 - g a c c c c a a a a t c a g c g a a a t Y 100% T m =51.6 ˝ C | | | | | | | | | | | | | | | | | | | | t g g a c c c c a a a a t c a g c g a a a t g c a cN1rg t c t a a g t t g a c c g t c a t t g g t c t - 5 Y 100% T m =56.3 ˝ C | | | | | | | | | | | | | | | | | | | | | | | | c t c a g a t t c a a c t g g c a g t a a c c a g a a t g gN2f5 - t t a c a a a c a t t g g c c g c a a a Y 95% T m =47.3 ˝ C | | | : | | | | | | | | | | | | | | | | g a t t a t a a a c a t t g g c c g c a a a t t g cN2ra a g a a g c c t t a c a g c g c g - 5 Y 100% T m =54.7 ˝ C | | | | | | | | | | | | | | | | | | c g t t c t t c g g a a t g t c g c g c a t t gEPI ISL 422983N1f5 - g a c c c c a a a a t c a g c g a a a t Y 95% T m =44.6 ˝ C | | : | | | | | | | | | | | | | | | | | t g g a a c c c a a a a t c a g c g a a a t g c a cN1rg t c t a a g t t g a c c g t c a t t g g t c t - 5 N 95% T m =52.1 ˝ C | | | | | : | | | | | | | | | | | | | | | | | | c t c a g a t a c a a c t g g c a g t a a c c a g a a t g gN2f5 - t t a c a a a c a t t g g c c g c a a a Y 100% T m =52.9 ˝ C | | | | | | | | | | | | | | | | | | | | g a t t a c a a a c a t t g g c c g c a a a t t g cN2ra a g a a g c c t t a c a g c g c g - 5 Y 100% T m =54.7 ˝ C | | | | | | | | | | | | | | | | | | c g t t c t t c g g a a t g t c g c g c a t t gEPI ISL 424955N1f5 - g a c c c c a a a a t c a g c g a a a t Y 100% T m =51.6 ˝ C | | | | | | | | | | | | | | | | | | | | t g g a c c c c a a a a t c a g c g a a a t g c a cN1rg t c t a a g t t g a c c g t c a t t g g t c t - 5 Y 100% T m =56.3 ˝ C | | | | | | | | | | | | | | | | | | | | | | | | c t c a g a t t c a a c t g g c a g t a a c c a g a a t g gN2f5 - t t a c a a a c a t t g g c c g c a a a N 97% T m =43.1 ˝ C | | | | | | | | | | | | | | | : | | | | g a t t a c a a a c a t t g g c c t c a a a t t g cN2ra a g a a g c c t t a c a g c g c g - 5 Y 100% T m =54.7 ˝ C | | | | | | | | | | | | | | | | | | c g t t c t t c g g a a t g t c g c g c a t t g EPI ISL425148n1f5 - g a c c c c a a a a t c a g c g a a a t Y 100% T m =51.6 ˝ C | | | | | | | | | | | | | | | | | | | | t g g a c c c c a a a a t c a g c g a a a t g c a cN1rg t c t a a g t t g a c c g t c a t t g g t c t - 5 N 97% T m =51.7 ˝ C | | | | | | | | | : | | | | | | | | | | | | | | c t c a g a t t c a a t t g g c a g t a a c c a g a a t g gN2f5 - t t a c a a a c a t t g g c c g c a a a Y 100% T m =52.9 ˝ C | | | | | | | | | | | | | | | | | | | | g a t t a c a a a c a t t g g c c g c a a a t t g cN2ra a g a a g c c t t a c a g c g c g - 5 Y 100% T m =54.7 ˝ C | | | | | | | | | | | | | | | | | || | | | | | | | | | | | | | | | | |