[PDF] Adaptive Group Testing on Networks with Community Structure

Abstract

Since the inception of the group testing problem in World War II, one of the prevailing assumptions in the probabilistic variant of the problem has been that individuals in the population are infected by a disease independently. However, this assumption rarely holds in practice, as diseases typically spread through interactions between individuals and therefore cause infections to be correlated. Inspired by characteristics of COVID-19 and similar diseases, we consider an infection model over networks which generalizes the traditional i.i.d. model from probabilistic group testing. Under this infection model, we ask whether knowledge of the network structure can be leveraged to perform group testing more efficiently, focusing specifically on community-structured graphs drawn from the stochastic block model. We prove that when the network and infection parameters are conducive to "strong community structure," our proposed adaptive, graph-aware algorithm outperforms the baseline binary splitting algorithm, and is even order-optimal in certain parameter regimes. We support our results with numerical simulations.

Full PDF

AAdaptive Group Testing on Networks withCommunity Structure

Surin Ahn, Wei-Ning Chen, and Ayfer ¨Ozg¨ur

Department of Electrical Engineering, Stanford University { surinahn, wnchen, aozgur } @stanford.edu Abstract

Since the inception of the group testing problem in World War II, the prevailing assumptionin the probabilistic variant of the problem has been that individuals in the population are infectedby a disease independently . However, this assumption rarely holds in practice, as diseasestypically spread through connections between individuals. We introduce an infection model fornetworks, inspired by characteristics of COVID-19 and similar diseases, which generalizes thetraditional i.i.d. model from probabilistic group testing. Under this infection model, we askwhether knowledge of the network structure can be leveraged to perform group testing moreeﬃciently, focusing speciﬁcally on community-structured graphs drawn from the stochastic blockmodel. Through both theory and simulations, we show that when the network and infectionparameters are conducive to “strong community structure,” our proposed adaptive, graph-awarealgorithm outperforms the baseline binary splitting algorithm, and is even order-optimal incertain parameter regimes. Finally, we derive novel information-theoretic lower bounds whichhighlight the fundamental limits of adaptive group testing in our networked setting.

Identifying individuals who are infected by a disease is crucial for curbing epidemics and ensuringthe well-being of society. However, due to high costs or limited resources, it is often infeasible totest every member of the population individually. During World War II, when the U.S. militarysought to identify soldiers infected with syphilis, Dorfman made a breakthrough by introducingthe concept of group testing [1]. He showed that by testing groups or pools of samples rather thanindividual samples, the infected people in a population of size n can be identiﬁed with far fewerthan n tests. The key insight was that if the infected population is sparse, then each pooled test islikely to produce a negative result, in which case all individuals included in the test can be deemed“not infected” even though only a single test was performed. Today, group testing schemes areactively being used in the COVID-19 pandemic to identify infected individuals in an eﬃcient andcost-eﬀective manner [2–5]. Group testing is also useful to numerous application domains beyondhealthcare, such as wireless communications [6–10], machine learning [11–13], signal processing [14],and data streaming [15]. 1 a r X i v : . [ c s . I T ] J a n orfman’s seminal work, and subsequent works by other authors on the so-called probabilistic group testing problem [6, 16–18], assume that the disease infects individuals in a statistically in-dependent fashion. However, this assumption rarely holds in practice. Diseases typically spreadthrough connections between individuals (e.g., familial, work-related, or other social connections),thereby inducing correlated infections. It is therefore natural to ask whether exploiting informationabout this connectivity structure can lead to more eﬃcient group testing strategies. This problemis especially timely given the critical role that group testing is playing in the current COVID-19pandemic, and that the disease is known to spread from close contact between individuals.In this work, we study the group testing problem under interaction networks that dictate thespread of a disease through the population, and investigate whether the graphical structure canbe leveraged to perform pooled testing more eﬃciently than without knowledge of the graph. Wefocus on networks with community structure : those containing clusters of nodes with more denseconnections within a cluster than between clusters. Such networks are pervasive in the real world– social, biological, and information networks commonly exhibit community structure – and canoften be estimated in practice, thanks to the availability of large datasets and network estimationtechniques. Additionally, we introduce an infection model for arbitrary networks which generalizesthe standard i.i.d. model from the probabilistic group testing literature.On the algorithmic side, we consider adaptive group testing schemes, where the design of eachtest can be informed by the previous test results. We compare two diﬀerent schemes: the standard binary splitting [19] algorithm which is oblivious to the underlying network structure, and a simple graph-aware algorithm that exploits the community structure of the network. We give preciseupper bounds on the expected number of tests performed by each algorithm. Crucially, we showthat when the network and infection parameters yield strong community structure (in which case thedisease is more likely to be transmitted within a community than between communities), the graph-aware algorithm’s average complexity is asymptotically strictly better than that of binary splitting.We corroborate these results with numerical simulations. Finally, we derive novel information-theoretic lower bounds which asymptotically match the graph aware algorithm’s performance (upto constants) in certain parameter regimes.We note that our work may be relevant to other settings where the goal is to identify certainobjects of interest within a “clustered” population. For example, we may wish to identify theactive devices or users in a multiple access network, where devices that are closer together in thenetwork tend to be active or inactive at the same time. Exploring the potential applications ofnetwork-oriented group testing to these types of problems is of great interest. Related Works.

Our work diﬀers from the graph-constrained group testing problem [20–23] inwhich the tests must conform to a given network topology. In our case, we allow the tests to bearbitrary, but ask whether knowledge of the interaction network can help to reduce the numberof required tests. This is similar in spirit to recent work on community-aware group testing [24],though our work departs from it in several ways. First, [24] assumes the population is partitionedinto disjoint “families,” whereas our work considers more general network structures which allow fortransmissions between communities. Second, although we focus on community-structured graphs2n this paper, our proposed infection model works on top of arbitrary networks and thereforeapplies naturally to a broader class of problems. Finally, we give a precise characterization of theimprovement provided by our graph-aware algorithm over the baseline, and in what parameterregimes our lower bounds are order-optimal.

Paper Organization.

The rest of this paper is organized as follows. In Section 2, we describethe network and infection models, and deﬁne our mathematical notation. In Section 3, we providebackground and preliminary ideas. In Section 4, we discuss the main algorithms studied in thispaper: binary splitting and our proposed graph-aware algorithm. Section 5 gives upper and lowerbounds for adaptive group testing on networks consisting of disjoint cliques, and Section 6 gener-alizes these results to the stochastic block model. Finally, we present the results of our numericalsimulations in Section 7, and conclude in Section 8. All omitted proofs are given in the Appendix.

We study the following probabilistic infection model with parameters p, q ∈ [0 , G = ( V , E ) in two stages (each executed once):1. Seed Selection:

Each vertex is infected i.i.d. with probability p . These initial infectedvertices are called the seeds . They model the introduction of the disease into the populationvia some external entity (e.g., a traveler carrying the disease into a country).2. Neighbor Infection:

A seed infects each of its neighbors i.i.d. with probability q . Thismodels how the disease spreads through the population via interactions between carriers andnearby individuals. Remark 1.

The above stages can be viewed as the “ﬁrst time step” of a stochastic epidemic model,i.e., the initial spread of an epidemic. It is inspired by diseases such as COVID-19, which areinitially introduced into a population from an external source and subsequently transmitted betweenindividuals in close contact. In practice, the speciﬁc values of p, q can be tailored to the disease inquestion (for example, by using contact tracing to estimate the infectiousness of the disease).

Consider an arbitrary graph with seed selection probability p ∈ [0 ,

1] and neighbor infectionprobability q = 0. In this case, our setting reduces to the i.i.d. probabilistic group testing model.Each node is selected as a seed (and thus infected) with probability p , and since transmissionsbetween nodes are not possible, no additional nodes are infected during the neighbor infectionphase. It follows that we cannot hope to do any better than classical group testing schemes in thissetting. Proposition 1.

Under an arbitrary graph G = ( V , E ) , identifying infected individuals under ourinfection model with seed selection parameter p ∈ [0 , and zero probability of neighbor infection( q = 0 ) is equivalent to the i.i.d. probabilistic group testing problem with infection probability p . empty graph (a.k.a. null graph ), G = ( V , E ) where E = ∅ , with arbitrary infectionparameters p, q ∈ [0 , p . For the rest of this paper, we assume that the underlying network is drawn from the stochastic blockmodel (SBM) [25] – a well-known random graph model with the tendency to produce community-structured graphs. The standard SBM has the following parameters: • n vertices • a partition of the vertex set V = { , , . . . , n } into m communities, C , . . . , C m , where (cid:83) i ∈ [ m ] C i = V and C i ∩ C j = ∅ , ∀ i (cid:54) = j • a symmetric matrix P ∈ R m × m of edge probabilities.The random graph G = ( V , E ) is then generated in the following way. First, initialize E = ∅ . Thenfor each pair of vertices u ∈ C i , v ∈ C j , we add an edge between u and v with probability P ij .In this paper, we consider a special case of the SBM. We assume the communities are all of size k , where k is a factor of n (so that the number of communities is m = n/k ), and that there is aconstant edge probability p within communities, and probability p between communities. Thatis, P equals p along the diagonal entries and p on the oﬀ-diagonal entries. We further assume that p > p , i.e., that edges are more likely to occur within a community than between communities.Finally, we assume that the communities are known to the group testing algorithms in advance,but that the graph itself may not be known. Stochastic Block Infection Model (

SBIM ): Our infection model acting upon the SBM canequivalently be studied through a slightly modiﬁed infection model which acts upon the completegraph on n vertices: the graph containing all possible (cid:0) n (cid:1) edges. This will reduce the overall numberof parameters we have to consider. Our modiﬁed model still begins by selecting each node i.i.d. withprobability p to be a seed. However, in the neighbor infection phase, each seed infects its neighbors within the same community i.i.d. with probability q and infects those outside its community i.i.d.with probability q , where q > q . The equivalence of this model and the original model can beseen by setting q = p · q and q = p · q , where q is the neighbor infection probability in theoriginal model. We call this the Stochastic Block Infection Model , denoted by

SBIM ( n, k, p, q , q ).Note that SBIM ( n, k, p, , k an arbitrary factor of n , is equivalent to the i.i.d. group testingmodel. Disjoint k -Cliques Model. Before analyzing the

SBIM in full generality in Section 6, we beginin Section 5 by investigating the special case of

SBIM ( n, k, p, q, disjoint k -cliques model . Here, we have m = n/k communities of size k , each a complete subgraph on k vertices, with no edges between communities. The transmission rate within a community is q ,and no transmissions are possible between communities. Figure 1 illustrates the SBIM and thediﬀerence between the disjoint k -cliques model ( q = 0) and the general SBIM with q > a) Seed selection stage(b) Neighbor infection with q = 0(the disjoint k -cliques model). Nodescannot be infected by seeds outsidetheir own community. (c) Neighbor infection with q > Figure 1: Illustration of

SBIM ( n, k, p, q , q ). In this example, there are m = 4 communities of size k = 7. Seeds are colored green, and nodes infected by seeds are colored orange. We now deﬁne the mathematical notation used in the rest of this paper.

General notation: • n : size of the population • k : size of each community • m (cid:44) nk : number of communities • [ n ] (cid:44) { , , . . . , n } • X (cid:44) ( X , . . . , X n ) ∈ { , } n : infection status vector, where X v = 1 iﬀ vertex v is infected • X (cid:96) (cid:44) ( X , . . . , X (cid:96) ) , (cid:96) ∈ [ n ] • X C i ∈ { , } , i ∈ [ m ]: infection status of community C i , where X C i = 1 iﬀ ∃ v ∈ C i : X v = 1 • A : indicator function for event A • H ( · ): entropy of a discrete random variable (in bits) deﬁned as H ( X ) (cid:44) − (cid:80) x ∈X p ( x ) log p ( x ) • h b ( · ): binary entropy function deﬁned as h b ( p ) (cid:44) − p log p − (1 − p ) log (1 − p ) • We write f ( x ) ≺ g ( x ) to denote f ( x ) = o ( g ( x )), and f ( x ) (cid:22) g ( x ) to denote f ( x ) = O ( g ( x )) Graph notation: • G = ( V , E ): undirected graph with vertex set V , edge set E N ( v ) (cid:44) { u ∈ V : ( u, v ) ∈ E , u (cid:54) = v } : set of neighbors of vertex v • d ( v ) (cid:44) |N ( v ) | : degree (number of neighbors) of vertex v In the group testing problem, a test corresponds to a subset of individuals

S ⊆ [ n ]. The testoutcome is positive if X i = 1 for some i ∈ S ; that is, if at least one member of S is infected.Otherwise, the test outcome is negative. Equivalently, the outcome is a binary variable Y ∈ { , } given by a boolean OR operation over S : Y = (cid:95) i ∈S X i . (1)A group testing algorithm or scheme describes how to select subsets S , . . . , S T such that theinfection statuses X , . . . , X n can be determined from the corresponding outcomes Y , . . . , Y T . In adaptive schemes, the choice of each S t is allowed to depend on {S t (cid:48) : t (cid:48) < t } . Moreover, dueto the underlying randomness in the X i in our probabilistic setting, the total number of tests T performed by any adaptive scheme is a random variable. In this work, we assume that test outcomesare noiseless (meaning that we get to observe the Y t as given in (1)), and we require a scheme to exactly recover X , . . . , X n (i.e., achieve zero error). Let G = ( V , E ) be any ﬁnite, undirected graph. For the infection model that we study in this paper,the marginal infection probability of a given vertex v can be characterized in terms of its degree d ( v ). Lemma 1.

Let G = ( V , E ) be a ﬁnite, undirected graph. Under G , the infection status of a vertex v ∈ V is X v ∼ Bernoulli ( r v ) , where r v (cid:44) P ( X v = 1) = 1 − (1 − p )(1 − pq ) d ( v ) . (2)Under a general graph, diﬀerent nodes may have diﬀerent degrees and hence diﬀerent marginalprobabilities of infection. From (2), we see that r v is monotonically non-decreasing with d ( v ). Notealso that the X v can be correlated. A fundamental result in probabilistic group testing (see [6] or [17, Theorem 1]) is that any adaptivealgorithm which is guaranteed to identify all infected members of the population, assuming noiselesstest results, requires a number of tests T satisfying E [ T ] ≥ H ( X , . . . , X n ) , (3)6here H ( X , . . . , X n ) is the Shannon entropy of X = ( X , . . . , X n ). This bound highlights theintimate connection between adaptive group testing and source coding. Indeed, the outcomes ofthe adaptive tests can be viewed as a binary, variable-length source code for X ; the lower boundthen follows directly from existing results in data compression (see [26, Eqn. 5.38]). Equation (3)will serve as the point of departure for the lower bounds on E [ T ] that we derive in this paper. Thekey challenge will be to obtain good approximations to H ( X ) in the presence of correlated X v . Most adaptive group testing algorithms are based on the idea of recursively splitting the populationuntil all infected members are found. The most standard such algorithm is known as binarysplitting , which ﬁnds one infected member at a time by repeatedly halving the population. Thisalgorithm identiﬁes all infected members using α log n + O ( α ) adaptive tests (see [27], [19, p.24],or [28, Theorem 1.2]), where α is the number of infected members. This algorithm works evenwhen α is unknown, and is most eﬀective in the sparse regime, α = Θ( n β ), where β ∈ [0 , Lemma 2.

In a population of size n with α infected members, where α ≥ , the binary splittingalgorithm is guaranteed to identify all infected members using at most α (cid:100) log n (cid:101) ≤ α log n + α tests. As an alternative to standard adaptive procedures such as binary splitting, we consider a simpleadaptive scheme which leverages the community structure of the graph. The algorithm worksby mixing samples within each community, employing binary splitting to identify the infectedcommunities, and ﬁnally performing binary splitting again within each infected community to ﬁndthe infected members.

Adaptive Graph-Aware Algorithm

1. Mix samples within each community.2. Run binary splitting on the mixed samples to determine which communities contain at leastone infected member.3. For each positive test from Step 2, perform binary splitting within the corresponding com-munity to identify infected members.Under what circumstances should we expect the graph-aware algorithm to outperform binarysplitting? Suppose the underlying interaction network and infection model follow

SBIM ( n, k, p, q , q ).If the seed selection probability p is small, then we expect only a few of the m = n/k communities7o contain a seed. This means that after the neighbor infection stage, several of the communities arelikely to contain no infected members at all, especially if q is small. In Step 2 of the graph-awarealgorithm, we can eﬃciently rule out these uninfected communities from consideration. In Step 3,we need only perform group testing within each of the remaining communities (which contain atleast one infected member). In contrast, the binary splitting algorithm ignores the communitystructure (speciﬁcally, the fact that entire communities are likely to be uninfected), and is thereforeunlikely to enjoy the same beneﬁts as the graph-aware algorithm under these circumstances. Wewill rigorously verify this intuition in the upcoming sections. k -Cliques Model We ﬁrst consider the graph consisting of disjoint k -cliques (i.e., complete subgraphs of size k , withno edges between diﬀerent cliques). That is, we have a graph G = ( V , E ) with V = [ n ], where weassume n is divisible by k . There are m (cid:44) n/k disjoint cliques with k nodes each, denoted by C , C , . . . , C m where |C i | = k, ∀ i ∈ [ m ]. The seed selection probability is p ∈ (0 , q ∈ [0 , Recall that E [ T ] ≥ H ( X , . . . , X n ) for any adaptive group testing algorithm which exactly identiﬁesthe infected individuals using T tests. Since the infection statuses across the m disjoint cliques areindependent, we have E [ T ] ≥ m · H ( X , . . . , X k ), where without loss of generality we assume C = [ k ]. Thus, obtaining a lower bound on E [ T ] reduces to lower bounding H ( X , . . . , X k ), i.e.,the entropy corresponding to a single k -clique. The following lemma lower bounds H ( X , . . . , X k )in terms of a binomial random variable, which then leads to the asymptotic lower bound given inTheorem 1 below. Lemma 3.

Under the disjoint k -cliques model, the number of tests T required to identify the infectedindividuals is lower bounded as E [ T ] ≥ m · E Z (cid:2) ( k − Z ) · h b (cid:0) − (1 − q ) Z (cid:1)(cid:3) , where Z ∼ Binom ( k, p ) . Theorem 1.

Let Z ∼ Binom ( k, p ) and assume kp (cid:22) and q (cid:22) √ k · (cid:114) log (cid:16) k · p (cid:17) . Then E Z (cid:2) ( k − Z ) · h b (cid:0) − (1 − q ) Z (cid:1)(cid:3) (cid:23) k · p · q · (cid:18) log k + log log (cid:18) k · p (cid:19)(cid:19) . Upon combining Lemma 3 with the above theorem, we see that the number of tests T neededto recover all infected members in the disjoint k -cliques graph (in the speciﬁed parameter regime)is lower bounded as E [ T ] (cid:23) m · k · p · q · (cid:18) log k + log log (cid:18) kp (cid:19)(cid:19) . E [ T ] ≥ H ( X , . . . , X n ) (a) ≥ H ( X C , . . . , X C m ) = m · h b (cid:16) − (1 − p ) k (cid:17) (4)where (a) uses the fact that X C , . . . , X C m are a function of X , . . . , X n . Furthermore, since kp (cid:22) h b (cid:16) − (1 − p ) k (cid:17) (cid:23) k · p · log (1 /kp ). We summarize the reﬁned lower bound in the followingcorollary: Corollary 1.

Assume kp (cid:22) and q (cid:22) (cid:114) k log (cid:16) kp (cid:17) . Then under the disjoint k -cliques model, thenumber of tests T required to identify the infected individuals is lower bounded as E [ T ] (cid:23) max (cid:26) m · k · p · q · (cid:18) log k + log log (cid:18) k · p (cid:19)(cid:19) , m · k · p · log (cid:16) k · p (cid:17) , (cid:27) . The following result bounds the expected number of tests used by the binary splitting algorithmunder the disjoint k -cliques model. Theorem 2.

Under the disjoint k -cliques model, the binary splitting algorithm identiﬁes all infectedindividuals using T tests, where E [ T ] ≤ m · k · (cid:16) log m + log k + 1 (cid:17) · (cid:16) − (1 − p )(1 − pq ) k − (cid:17) . Proof.

Let K be the number of infected nodes (which is a random variable in our setting). Then E [ K ] = E (cid:104) n (cid:88) i =1 X i (cid:105) = n (cid:88) i =1 P ( X i = 1) = n · r where r = 1 − (1 − p )(1 − pq ) k − by Lemma 1. Invoking Lemma 2 yields the result. Asymptotic analysis:

Using Theorem 2, we ﬁnd that the average complexity of binary splittingis O (cid:0) m · k · p · ( q + 1 /k ) · (log m + log k ) (cid:1) since E [ T ] (cid:22) m · k · (log m + log k ) · (cid:16) − (1 − p )(1 − pq ) k − (cid:17) ( a ) ≤ m · k · (log m + log k ) · (cid:16) − (1 − p )(1 − kpq ) (cid:17) = m · k · (log m + log k ) · ( p + kpq − kp q ) ≤ m · k · (log m + log k ) · ( p + kpq )= m · k · p · (log m + log k ) · (cid:16) k + q (cid:17) (5)where in (a) we use the fact that (1 + x ) k ≥ kx for x ≥ − , k ≥ .2.2 Graph-Aware Algorithm Next, we provide an upper bound on the expected number of tests performed by the graph-awarealgorithm.

Theorem 3.

Under the disjoint k -cliques model, the graph-aware algorithm identiﬁes all infectedindividuals using T tests, where E [ T ] ≤ m · (cid:16) log m + 1 (cid:17) · (cid:16) − (1 − p ) k (cid:17) + n · (cid:16) log k + 1 (cid:17) · (cid:16) − (1 − p )(1 − pq ) k − (cid:17) Asymptotic analysis:

Using Theorem 3 and the fact that (1 + x ) k ≥ kx for x ≥ − , k ≥ E [ T ] (cid:22) m log m · k · p + m · k log k · p · (cid:16) q + 1 k (cid:17) . (6) We summarize the expected number of tests of binary splitting and the graph-aware algorithm, aswell as the information-theoretic lower bound, in Table 1.Binary splitting m log m · k · p · (cid:0) q + k (cid:1) + m · k log k · p · (cid:0) q + k (cid:1) Graph-aware m log m · k · p + m · k log k · p · (cid:0) q + k (cid:1) Lower bound m · k · p · log (cid:16) kp (cid:17) + m · k · p · q · (cid:16) log k + log log (cid:16) kp (cid:17)(cid:17) + 1Table 1: Upper and lower bounds on the expected number of tests in the disjoint k -cliques model.Next, we discuss diﬀerent parameter regimes where 1) the lower bound holds, 2) the graph-aware algorithm is order-optimal (i.e., the lower bound is tight), and 3) the graph-aware algorithm’saverage complexity is strictly better than binary splitting’s. As stated in Corollary 1, the lowerbound holds when kp (cid:22) q (cid:22) (cid:114) k log (cid:16) kp (cid:17) . The next corollary speciﬁes the regime where thegraph-aware algorithm is tight: Corollary 2.

If the following conditions hold:1. kp (cid:22) m − α for some ﬁxed α ∈ (0 , ,2. k (cid:22) q (cid:22) (cid:114) k log (cid:16) kp (cid:17) ,then the lower bound is tight, and moreover the graph-aware algorithm is order-optimal.Proof. Plugging log (cid:16) kp (cid:17) (cid:23) α log m into the lower bound and using the fact that k (cid:23) log (cid:16) kp (cid:17) fromthe second condition (which implies log k (cid:23) log log m ) yields E [ T ] (cid:23) m log m · k · p + m · k · p · q · (log k + log log m ) + 1 (cid:23) m log m · k · p + m · k log k · p · q, q (cid:23) /k to the bound for the graph-aware algorithm yields E [ T ] (cid:22) m log m · k · p + m · k log k · p · q. Finally, we specify the regime where the graph-aware algorithm outperforms binary splitting:

Corollary 3.

If the following conditions hold:1. log m (cid:31) log k ,2. kq (cid:31) ,then the graph-aware algorithm’s average complexity is asymptotically strictly better than binarysplitting’s by a factor of min (cid:110) kq, log m log k (cid:111) .Proof. Under the above conditions, binary splitting’s average complexity is m log m · k · p · q whereas the graph aware algorithm’s average complexity ismax (cid:110) m log m · k · p (cid:124) (cid:123)(cid:122) (cid:125) (a) , m · k log k · p · q (cid:124) (cid:123)(cid:122) (cid:125) (b) (cid:111) . Both terms are strictly smaller than the binary splitting bound. We see that (a) saves a factor of kq (cid:31)

1, while (b) saves a factor of log m log k (cid:31) kp (cid:22) q (cid:22) (cid:114) k log (cid:16) kp (cid:17) Tightness conditions kp (cid:22) m − α and 1 (cid:22) kq (cid:22) (cid:114) k/ log (cid:16) kp (cid:17) Improvement conditions log m (cid:31) log k and kq (cid:31) k -cliques model.The main takeaway is that the graph-aware algorithm can potentially improve testing eﬃciencycompared to standard binary splitting when (i) there are several moderately sized communitiesin the network, and (ii) the transmission rate within each clique is “intermediate.” Additionally,the graph-aware algorithm is order-optimal when the infected population is sparse. However, notethat when q (cid:22) /k , i.e., the intra-clique transmission rate is small, then the bounds for binarysplitting and the graph-aware algorithm are order-wise equivalent. This suggests that knowledge ofthe community structure may not help in this regime. Intuitively, this makes sense because when q is small, the infection statuses of the vertices are “mostly independent.”11 Stochastic Block Infection Model

Having studied the disjoint k -cliques model, we now turn to the fully general SBIM ( n, k, p, q , q ),where p ∈ (0 ,

1] and q , q ∈ [0 , Similar to Lemma 3 and Theorem 1, we obtain the following lower bounds for adaptive grouptesting over the

SBIM . Lemma 4.

Under

SBIM ( n, k, p, q , q ) , the number of tests T required to identify the infected indi-viduals is lower bounded as E [ T ] ≥ m · E Z,Z (cid:48) (cid:104) ( k − Z ) · h b (cid:16) − (1 − q ) Z (1 − q ) Z (cid:48) (cid:17)(cid:105) , where Z ∼ Binom ( k, p ) and Z (cid:48) ∼ Binom ( n − k, p ) are independent. Theorem 4.

Let Z ∼ Binom ( k, p ) and Z (cid:48) ∼ Binom ( n − k, p ) be independent, and assume1. n · p · q (cid:22) ,2. n · p (cid:23) ,3. k · p · q (cid:22) ,4. q ≤ (cid:114) k (cid:16) log (cid:16) kp (cid:17) +1 (cid:17) .Then the following lower bound holds: E Z,Z (cid:48) (cid:104) ( k − Z ) · h b (cid:16) − (1 − q ) Z (1 − q ) Z (cid:48) (cid:17)(cid:105) (cid:23) mk pq log (cid:18) npq (cid:19) + k p · q log (cid:18) q + npq (cid:19) . Therefore, the number of tests T needed to recover all infected members over SBIM ( n, k, p, q , q ),in the parameter regime speciﬁed in Theorem 4, is lower bounded as E [ T ] (cid:23) m · k · p · q · log (cid:18) n · p · q (cid:19) + m · k · p · q · log (cid:18) q + n · p · q (cid:19) . (7) Remark 2.

Recall that in the disjoint k -cliques model, we obtained an additional lower bound inEquation (4) given by H ( X C , ..., X C m ) , which dominates when kp (cid:22) m − α . However, under thegeneral SBIM , the { X C , ..., X C m } are no longer mutually independent, rendering the analysis of H ( X C , ..., X C m ) intractable. Therefore, we suspect that the lower bound given in Theorem 4 is nottight when kp is small. To analyze binary splitting and the graph-aware algorithm over the

SBIM , we begin by extendingLemma 1.

Lemma 5.

The marginal probability of infection for every vertex v under SBIM ( n, k, p, q , q ) isgiven by P ( X v = 1) = 1 − (1 − p ) · (1 − p · q ) k − · (1 − p · q ) n − k . .2.1 Binary Splitting Next, we generalize the bound in Theorem 2 to the

SBIM . Notice that in both the Theorem 5bound and the asymptotic bound derived below, we recover the corresponding bounds from thedisjoint k -cliques setting when we set q = q , q = 0. Theorem 5.

Under

SBIM ( n, k, p, q , q ) , the binary splitting algorithm identiﬁes all infected indi-viduals using T tests, where E [ T ] ≤ n · (log n + 1) · (cid:16) − (1 − p ) · (1 − p · q ) k − · (1 − p · q ) n − k (cid:17) . Proof.

Let K be the number of infected nodes. Then E [ K ] = E (cid:104) n (cid:88) i =1 X i (cid:105) = n (cid:88) i =1 P ( X i = 1) = n · r where r = 1 − (1 − p ) · (1 − p · q ) k − · (1 − p · q ) n − k by Lemma 5. Invoking Lemma 2 yields theresult. Asymptotic Analysis:

Using the fact that (1 + x ) k ≥ kx for x ≥ − , k ≥

1, we have E [ T ] (cid:22) n · log n · (cid:16) − (1 − p )(1 − k · p · q ) · (1 − ( n − k ) · p · q ) (cid:17) ≤ n · log n · (cid:16) ( n − k ) · p · q + k · p · q + p + k · ( n − k ) · p · q · q (cid:17) ≤ m · k · p · (log m + log k ) · (cid:16) k + q + m · q + m · k · p · q · q (cid:17) (8) First, we provide a lemma needed to prove the upper bound for the graph-aware algorithm inTheorem 6. Again, note that by setting q = q, q = 0 in Theorem 6 and the resulting asymptoticbound, we recover the corresponding bounds from the disjoint k -cliques setting. Lemma 6.

Let X C be the indicator variable which equals 1 if at least one member of community C is infected. Then under SBIM ( n, k, p, q , q ) , P ( X C = 1) = 1 − (1 − p ) k · (cid:32) − p · (cid:16) − (1 − q ) k (cid:17)(cid:33) n − k . Theorem 6.

Under

SBIM ( n, k, p, q , q ) , the graph-aware algorithm identiﬁes all infected individ-uals using T tests, where E [ T ] ≤ nk · (cid:16) log ( n/k ) + 1 (cid:17) · (cid:32) − (1 − p ) k · (cid:32) − p · (cid:16) − (1 − q ) k (cid:17)(cid:33) n − k (cid:33) + n · (cid:16) log k + 1 (cid:17) · (cid:16) − (1 − p ) · (1 − p · q ) k − · (1 − p · q ) n − k (cid:17) . Proof.

Same steps as the proof of Theorem 3 (given in the Appendix), except using Lemma 5 andLemma 6 wherever P ( X = 1) and P ( X C = 1) are needed, respectively.13 symptotic Analysis: Let T and T be the ﬁrst and second terms in the Theorem 6 bound,respectively. Using the fact that (1 − q ) k ≥ − kq , we have1 − p · (cid:16) − (1 − q ) k (cid:17) ≥ − p · k · q , so E [ T ] (cid:22) m log m · (cid:18) − (1 − p ) k · (cid:16) − p (cid:16) − (1 − q ) k (cid:17)(cid:17) n − k (cid:19) (cid:22) m log m · (cid:16) − (1 − p ) k · (1 − p · k · q ) n − k (cid:17) (cid:22) m log m · (1 − (1 − k · p ) · (1 − ( n − k ) · p · k · q )) (cid:22) m log m · ( k · p + n · p · k · q ) . Following the previous asymptotic analysis for binary splitting, E [ T ] (cid:22) m · k log k · p · (cid:16) k + q + m · q + m · k · p · q · q (cid:17) . Therefore, E [ T ] (cid:22) m log m · k · p · (cid:16) m · k · q (cid:17) + m · k log k · p · (cid:16) k + q + m · q + m · k · p · q · q (cid:17) . (9) One regime where the graph-aware algorithm’s average complexity is asymptotically strictly betterthan that of binary splitting is1. log m (cid:31) log k kq (cid:31)

13. (i) 1 (cid:23) mkq or(ii) mkq (cid:23) mkq ≺ kq (cid:22) p . Suppose conditions 1, 2, and 3(i) hold. Binary splitting’s average complexity (8) becomes m log m · k · p · q whereas the graph-aware algorithm’s average complexity (9) becomesmax (cid:110) m log m · k · p, m · k log k · p · q (cid:111) . The ﬁrst term in the graph-aware bound improves upon binary splitting’s complexity by a factorof kq (cid:31)

1, and the second term improves by a factor of log m log k (cid:31)

1. These are the same savings14 a) (b)(c) (d)

Figure 2: Performance comparison between binary splitting and the graph-aware algorithm underthe

SBIM with n = 1000 , k = 20, and diﬀerent values of p, q , q . Theoretical upper and lowerbounds are also shown.we obtained in Corollary 3 in the disjoint k -cliques setting; indeed, the bounds themselves matchthose in Corollary 3. This is not very surprising because the SBIM asymptotically behaves like thedisjoint k -cliques model under condition 3(i), i.e., when q is very small.However, improvements are still made by the graph-aware algorithm in a more intermediateregime for q . Under condition 3(ii), binary splitting’s average complexity is the same as above,and the graph-aware algorithm’s complexity becomesmax (cid:110) m log m · k · p · q , m · k log k · p · q (cid:111) , which represents an improvement over binary splitting by a factor of min (cid:110) q m · q , log m log k (cid:111) (cid:31) We implemented the binary splitting and graph-aware algorithms and evaluated their performanceover random instances of the

SBIM . The population size was set to n = 1000, and p was varied15ver the interval [0 , . p , where a trial consists of generatingan instance from SBIM ( n, k, p, q , q ), then observing the number of tests used by binary splittingand the graph-aware algorithm to identify the infected nodes. We estimated the lower bound fromLemma 4 by averaging over many independent samples of Z ∼ Binom ( k, p ) and Z (cid:48) ∼ Binom ( n − k, p ).Figure 2 shows some representative plots of the estimated E [ T ] as a function of p , with k = 20and diﬀerent values of q , q . The error bars show ± one standard deviation of the values of T obtained for a particular value of p . For comparison, we also plot the theoretical upper boundsfrom Theorem 5 and Theorem 6; we ﬁnd that these bounds remain quite faithful to the empiricalresults. Additionally, the graph-aware algorithm consistently outperforms binary splitting. Forexample, in Figure 2b, at p ≈ .

07, binary splitting has surpassed the individual testing thresholdwith an average of 1271 . . q = 0 . , q = 0 . k ∈ { , , } . Thegraph-aware algorithm seems to perform most favorably for moderate values of k , such as k = 20(as shown in Figure 2c) or k = 50, i.e., when there are several moderately sized communities in thenetwork. This is consistent with our earlier theoretical results.Although the graph-aware algorithm improves signiﬁcantly upon binary splitting, there is still asizable gap between the graph-aware bound and the lower bound shown in the plots. This suggeststhat in the non-asymptotic regime, either the lower bound is not tight or better algorithms exist. In this paper, we investigated the group testing problem over networks with community structure.Motivated by diseases such as COVID-19, we proposed a network infection model to capture howcertain diseases are introduced into a population and subsequently transmitted through close con-tact between individuals. Our proposed group testing algorithm, which exploits the structure ofthe underlying graph, provably outperforms the network-oblivious binary splitting algorithm, andis even order-optimal in certain parameter regimes.We conclude with some practical considerations and future directions. First, we note that thecommunity-structured networks studied in this paper can model populations at diﬀerent scales:the “communities” can be schools, families, counties, etc. The insights from our work can also beextended to more general networks in the real world, where the communities may not be knownin advance. In such instances, one might use the following pipeline to eﬃciently identify infectedindividuals in the population: 1) estimate the network from data (e.g., Facebook social graph); 2)run a clustering algorithm to identify communities in the network; 3) perform graph-aware grouptesting using the previously identiﬁed communities. An interesting direction for future work is toexplore the eﬃcacy of such an approach. Other directions of interest include designing non-adaptive group testing schemes for networks, studying graph-aware group testing under noisy test outcomes,and extending our infection model to longer time horizons (e.g., SIR or SIS-type infection models).16 a) (b)(c)

Figure 3: Performance comparison between binary splitting and the graph-aware algorithm underthe

SBIM with n = 1000 , q = 0 . , q = 0 . p, k . Theoretical upper andlower bounds are also shown. Acknowledgement

The authors would like to thank Professor Sennur Ulukus for inspiring discussions on this topic.This work was supported in part by NSF Grant

References [1] R. Dorfman, “The detection of defective members of large populations,”

The Annals of Math-ematical Statistics

JAMA , vol. 323, no. 19, pp. 1967–1969, 2020.[6] J. Wolf, “Born again group testing: Multiaccess communications,”

IEEE Transactions onInformation Theory , vol. 31, no. 2, pp. 185–191, 1985.[7] T. Berger, N. Mehravari, D. Towsley, and J. Wolf, “Random multiple-access communicationand group testing,”

IEEE Transactions on Communications , vol. 32, no. 7, pp. 769–779, 1984.[8] H. A. Inan, P. Kairouz, and A. Ozgur, “Sparse group testing codes for low-energy massiverandom access,” in

Allerton Conference on Communication, Control, and Computing , pp. 658–665, 2017.[9] H. A. Inan, P. Kairouz, and A. Ozgur, “Energy-limited massive random access via noisy grouptesting,” in

IEEE International Symposium on Information Theory (ISIT) , pp. 1101–1105,2018.[10] H. A. Inan, S. Ahn, P. Kairouz, and A. Ozgur, “A group testing approach to random accessfor short-packet communication,” in

IEEE International Symposium on Information Theory(ISIT) , pp. 96–100, 2019.[11] S. Ubaru and A. Mazumdar, “Multilabel classiﬁcation with group testing and codes,” in

In-ternational Conference on Machine Learning , pp. 3492–3501, 2017.[12] Y. Zhou, U. Porwal, C. Zhang, H. Q. Ngo, X. Nguyen, C. R´e, and V. Govindaraju, “Par-allel feature selection inspired by group testing,”

Advances in Neural Information ProcessingSystems , vol. 27, pp. 3554–3562, 2014.[13] D. Malioutov and K. Varshney, “Exact rule learning via boolean compressed sensing,” in

International Conference on Machine Learning , pp. 765–773, 2013.[14] A. C. Gilbert, M. A. Iwen, and M. J. Strauss, “Group testing and sparse signal recovery,” in

Asilomar Conference on Signals, Systems and Computers , pp. 1059–1063, 2008.[15] A. Emad and O. Milenkovic, “Poisson group testing: A probabilistic model for nonadap-tive streaming boolean compressed sensing,” in

IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP) , pp. 3335–3339, 2014.[16] M. Sobel and P. A. Groll, “Group testing to eliminate eﬃciently all defectives in a binomialsample,”

Bell System Technical Journal , vol. 38, no. 5, pp. 1179–1252, 1959.1817] T. Li, C. L. Chan, W. Huang, T. Kaced, and S. Jaggi, “Group testing with prior statistics,”in

IEEE International Symposium on Information Theory (ISIT) , pp. 2346–2350, 2014.[18] T. Kealy, O. Johnson, and R. Piechocki, “The capacity of non-identical adaptive group test-ing,” in

Allerton Conference on Communication, Control, and Computing , pp. 101–108, 2014.[19] D. Du, F. K. Hwang, and F. Hwang,

Combinatorial Group Testing and Its Applications , vol. 12.World Scientiﬁc, 2000.[20] N. J. Harvey, M. Patrascu, Y. Wen, S. Yekhanin, and V. W. Chan, “Non-adaptive fault diagno-sis for all-optical networks via combinatorial group testing on graphs,” in

IEEE InternationalConference on Computer Communications (INFOCOM) , pp. 697–705, 2007.[21] M. Cheraghchi, A. Karbasi, S. Mohajer, and V. Saligrama, “Graph-constrained group testing,”

IEEE Transactions on Information Theory , vol. 58, no. 1, pp. 248–262, 2012.[22] A. Karbasi and M. Zadimoghaddam, “Sequential group testing with graph constraints,” in

IEEE Information Theory Workshop (ITW) , pp. 292–296, 2012.[23] B. Spang and M. Wootters, “Unconstraining graph-constrained group testing,” arXiv preprintarXiv:1809.03589 , 2018.[24] P. Nikolopoulos, T. Guo, C. Fragouli, and S. Diggavi, “Community aware group testing,” arXiv preprint arXiv:2007.08111 , 2020.[25] P. W. Holland, K. B. Laskey, and S. Leinhardt, “Stochastic blockmodels: First steps,”

SocialNetworks , vol. 5, no. 2, pp. 109–137, 1983.[26] T. M. Cover and J. A. Thomas,

Elements of Information Theory, 2nd Edition . Wiley, 2006.[27] L. Baldassini, O. Johnson, and M. Aldridge, “The capacity of adaptive group testing,” in

IEEEInternational Symposium on Information Theory (ISIT) , pp. 2676–2680, 2013.[28] M. Aldridge, O. Johnson, and J. Scarlett, “Group testing: An information theory perspective,” arXiv preprint arXiv:1902.06002 , 2019. 19 ppendix

A Proof of Lemma 1

Let Y v be the indicator random variable of whether vertex v is a seed. First, we have P ( X v = 1) = P ( X v = 1 | Y v = 1) (cid:124) (cid:123)(cid:122) (cid:125) =1 · P ( Y v = 1) (cid:124) (cid:123)(cid:122) (cid:125) = p + P ( X v = 1 | Y v = 0) · P ( Y v = 0)= p + (1 − p ) · P ( X v = 1 | Y v = 0) . Given that v is not a seed, X v = 1 if and only if v is infected by one of its neighbors. Hence, P ( X v = 1 | Y v = 0) = P { v is infected by a neighbor } = 1 − P { v isn’t infected by any neighbor } = 1 − (cid:89) u ∈N ( v ) P { v isn’t infected by u } = 1 − (cid:89) u ∈N ( v ) (cid:16) − P { v is infected by u } (cid:17) = 1 − (cid:89) u ∈N ( v ) (cid:16) − P { v is infected by u | Y u = 1 } · P ( Y u = 1) (cid:17) = 1 − (cid:89) u ∈N ( v ) (1 − pq )= 1 − (1 − pq ) d ( v ) . (cid:3) B Lower Bounds for the Disjoint k -Cliques Model B.1 Proof of Lemma 3

Since H ( X , ..., X n ) = m · H ( X , ..., X k ), it suﬃces to lower bound H ( X , ..., X k ). Notice that H ( X , ..., X k ) ≥ H ( X , ..., X k | Y , ..., Y k ) = (cid:88) y k ∈{ , } k P (cid:16) Y k = y k (cid:17) · H (cid:16) X k (cid:12)(cid:12)(cid:12) Y k = y k (cid:17) . Observe that after conditioning on the locations of the seeds, X , ..., X k are mutually indepen-dent. Moreover, by symmetry, both P (cid:0) Y k = y k (cid:1) and H (cid:0) X k (cid:12)(cid:12) Y k = y k (cid:1) depend on (cid:80) i y i , (i.e., theempirical distribution of y k ). Indeed, the marginal distribution of X i can be speciﬁed as follows: P (cid:16) X i = 1 | Y k = y k (cid:17) =  , if y i = 1 , − (1 − q )( (cid:80) i y i ) , if y i = 0 , and the conditional entropy is H (cid:16) X k (cid:12)(cid:12)(cid:12) Y k = y k (cid:17) = (cid:32) k − (cid:88) i y i (cid:33) · h b (cid:16) − (1 − q )( (cid:80) i y i ) (cid:17) , h b ( · ) is the binary entropy function. Therefore, by writing Z = (cid:80) i Y i , we have H (cid:16) X k (cid:12)(cid:12)(cid:12) Y k (cid:17) = E Z (cid:2) ( k − Z ) · h b (cid:0) − (1 − q ) Z (cid:1)(cid:3) , (10)where Z ∼ Binom ( k, p ) . (cid:3) B.2 Proof of Theorem 1

Let f ( q ) = log( q )log(1 − q ) , so that f ( q ) solves 1 − (1 − q ) Z = 1 − q . Then we bound (10) by E Z (cid:2) ( k − Z ) · h b (cid:0) − (1 − q ) Z (cid:1)(cid:3) ≥ E Z (cid:2) ( k − Z ) · h b (cid:0) − (1 − q ) Z (cid:1) · { ≤ Z ≤ f ( q ) } (cid:3) (a) ≥ h b ( q ) · E Z (cid:2) ( k − Z ) · { ≤ Z ≤ f ( q ) } (cid:3) ≥ h b ( q ) ( E Z [ k − Z ] − k P { Z = 0 } − k P { Z > f ( q ) } )= k · h b ( q ) (cid:16) (1 − p ) (cid:16) − (1 − p ) k − (cid:17) − P { Z > f ( q ) } (cid:17) (b) ≥ k · h b ( q ) (cid:0) (1 − p ) (cid:0) ( k − p − ( k − p (cid:1) − P { Z > f ( q ) } (cid:1) (c) (cid:23) k · h b ( q ) ( k · p − P { Z > f ( q ) } ) , (11)where (a) is due to the fact that h b ( x ) ≥ h b ( q ) for all q ≤ x ≤ − q , (b) holds since (1 − p ) r ≤ e − pr and e x ≤ x + x for x ≤

1, and (c) is due to the assumption p (cid:22) /k .We then upper bound P { Z > f ( q ) } by Hoeﬀding’s inequality: P { Z > f ( q ) } ≤ exp (cid:32) − k (cid:18) p − f ( q ) k (cid:19) (cid:33) (a) ≤ exp (cid:32) − k (cid:18) f ( q )2 k (cid:19) (cid:33) ≤ exp (cid:18) − f ( q ) k (cid:19) (b) (cid:22) k · p, where (a) holds since by assumption k · p · q (cid:22)

1, so k · p (cid:22) q (cid:18) q (cid:19) ≤ q − q log (cid:18) q (cid:19) ≤ f ( q ) , and (b) holds due to the assumption q (cid:22) √ k · (cid:114) log (cid:16) k · p (cid:17) . Plugging into (11) yields E Z (cid:2) ( k − Z ) · h b (cid:0) − (1 − q ) Z (cid:1)(cid:3) (cid:23) k · p · q · log (cid:18) q (cid:19) (cid:23) k · p · q · (cid:18) log k + log log (cid:18) kp (cid:19)(cid:19) , where in the last inequality we use the assumption q (cid:22) √ k · (cid:114) log (cid:16) k · p (cid:17) again. (cid:3) C Proof of Theorem 3

Let T and T be the number of tests performed, respectively, in Step 2 and Step 3 of the graph-aware algorithm. Speciﬁcally, T is equal to the number of tests used by binary splitting to identify21he infected k -cliques, and T is the number of tests to identify infected individuals within eachinfected clique. Note that T = T + T . We will bound E [ T ] and E [ T ] separately.Let Y be the number of infected k -cliques. We have E [ Y ] = nk · P ( X C = 1) = nk · (cid:16) − (1 − p ) k (cid:17) . Taking Lemma 2 with n = n/k and α = Y gives T ≤ (log ( n/k ) + 1) · Y so that E [ T ] ≤ nk · (cid:16) log ( n/k ) + 1 (cid:17) · (cid:16) − (1 − p ) k (cid:17) . For the second stage of the algorithm, let Z i denote the number of tests used by binary splittingto identify all infected members of the i th clique. Since T = n/k (cid:80) i =1 Z i · { X C i =1 } , we have E [ T ] = n/k (cid:88) i =1 E [ Z i · { X C i =1 } ]= nk · E [ Z · { X C =1 } ]= nk · P ( X C = 1) · E [ Z | X C = 1]= nk · (cid:16) − (1 − p ) k (cid:17) · E [ Z | X C = 1] . Let M denote the number of infected members of C . Then by Lemma 2, E [ Z | X C = 1] ≤ (log k + 1) · E [ M | X C = 1]and, assuming without loss of generality that C = [ k ], E [ M | X C = 1] = k (cid:88) j =1 P ( X j = 1 | X C = 1)= k · P ( X = 1 | X C = 1)= k · P ( X = 1 , X C = 1) P ( X C = 1)= k · P ( X = 1) P ( X C = 1)= k · − (1 − p )(1 − pq ) k − − (1 − p ) k where in the last line we invoke Lemma 1. Putting everything together gives E [ T ] ≤ n · (log k + 1) · (cid:16) − (1 − p )(1 − pq ) k − (cid:17) and therefore E [ T ] ≤ nk · (cid:16) log ( n/k ) + 1 (cid:17) · (cid:16) − (1 − p ) k (cid:17) + n · (cid:16) log k + 1 (cid:17) · (cid:16) − (1 − p )(1 − pq ) k − (cid:17) . (cid:3) Lower Bounds for the

SBIM

D.1 Proof of Lemma 4

Notice that H ( X , ..., X n ) ≥ H ( X , ..., X n | Y , ..., Y n ) = (cid:88) y n ∈{ , } n P ( Y n = y n ) · H ( X n | Y n = y n ) . Observe that after conditioning on the locations of the seeds, X , ..., X n are mutually independent.Moreover, for i ∈ C (cid:96) , the marginal distribution of X i can be speciﬁed as follows: P ( X i = 1 | Y n = y n ) =  , if y i = 1 , − (1 − q ) (cid:80) j ∈C (cid:96) y j (1 − q ) (cid:80) j (cid:54)∈C (cid:96) y j , if y i = 0 . Writing z (cid:96) (cid:44) (cid:80) j ∈C (cid:96) y j , the conditional entropy is H ( X n | Y n = y n ) = m (cid:88) (cid:96) =1 ( k − z (cid:96) ) · h b (cid:16) − (1 − q ) z (cid:96) (1 − q ) (cid:80) (cid:96) (cid:48)(cid:54) = (cid:96) z (cid:96) (cid:48) (cid:17) , where h b ( · ) is the binary entropy function. Since Y i i.i.d. ∼ Ber ( p ), we have Z (cid:96) i.i.d. ∼ Binom ( k, p ) andhence H ( X n | Y n ) = E Z,Z (cid:48) (cid:104) m · ( k − Z ) · h b (cid:16) − (1 − q ) Z (1 − q ) Z (cid:48) (cid:17)(cid:105) , (12)where Z ∼ Binom ( k, p ) and Z (cid:48) ∼ Binom ( n − k, p ). (cid:3) D.2 Proof of Theorem 4

First we assume n · p · q (cid:22)

1, and let (cid:15) ∈ (0 ,

1) be a value to be speciﬁed. Deﬁne z ∗ (cid:44) / − np (1 + (cid:15) ) q q . Then as long as Z and Z (cid:48) satisfy the following two conditions1. { np (1 − (cid:15) ) ≤ Z (cid:48) ≤ np (1 + (cid:15) ) } ,2. Z ≤ z ∗ ,we have 12 ≥ Z · q + Z (cid:48) · q ≥ − (1 − q ) Z (1 − q ) Z (cid:48) . (13)23ince 1 − (1 − q ) Z (1 − q ) Z (cid:48) is an increasing function of Z and Z (cid:48) , h b (cid:16) − (1 − q ) Z (1 − q ) Z (cid:48) (cid:17) must increase with Z and Z (cid:48) if they satisfy the above conditions. Therefore, we have E Z,Z (cid:48) (cid:104) ( k − Z ) h b (cid:16) − (1 − q ) Z (1 − q ) Z (cid:48) (cid:17)(cid:105) ≥ E Z,Z (cid:48) (cid:104) ( k − Z ) h b (cid:16) − (1 − q ) Z (1 − q ) Z (cid:48) (cid:17) · { ≤ Z ≤ z ∗ } · { np (1 − (cid:15) ) ≤ Z (cid:48) ≤ np (1+ (cid:15) ) } (cid:105) ≥ E Z,Z (cid:48) (cid:104) ( k − Z ) h b (cid:16) − (1 − q ) Z (cid:48) (cid:17) · { Z =0 } · { np (1 − (cid:15) ) ≤ Z (cid:48) ≤ np (1+ (cid:15) ) } (cid:105)(cid:124) (cid:123)(cid:122) (cid:125) (a) + E Z,Z (cid:48) (cid:104) ( k − Z ) h b (cid:16) − (1 − q ) Z (1 − q ) Z (cid:48) (cid:17) · { ≤ Z ≤ z ∗ } · { np (1 − (cid:15) ) ≤ Z (cid:48) ≤ np (1+ (cid:15) ) } (cid:105)(cid:124) (cid:123)(cid:122) (cid:125) (b) . (14)We will pick (cid:15) = . Then (a) can be bounded by(a) ≥ k · h b (cid:16) q · np (1 − (cid:15) ) − ( q · np (1 − (cid:15) )) (cid:17) (cid:18) − · exp (cid:18) − n(cid:15) p (cid:19)(cid:19) (cid:23) k (cid:18) npq (1 − (cid:15) ) log (cid:18) npq (1 − (cid:15) ) (cid:19) (cid:18) − · exp (cid:18) − n(cid:15) p (cid:19)(cid:19)(cid:19) (cid:23) k (cid:18) npq log (cid:18) npq (cid:19)(cid:19) where in the ﬁrst inequality we use1. Z (cid:48) ≥ np (1 − (cid:15) )2. (1 − q ) Z (cid:48) ≤ e − q · Z (cid:48) ≤ − q · Z (cid:48) + ( q · Z (cid:48) )

3. Chernoﬀ bound on Z (cid:48) ,and in the third inequality we assume np (cid:23)

1. Next, (b) can be bounded by(b) ≥ h b (cid:16) q + npq (1 − (cid:15) ) − ( q + npq (1 − (cid:15) )) (cid:17) · E Z (cid:2) ( k − Z ) { ≤ Z ≤ z ∗ } (cid:3) · (cid:18) − · exp (cid:18) − n(cid:15) p (cid:19)(cid:19) (cid:23) ( q + npq ) log (cid:18) q + npq (cid:19) · E Z (cid:2) ( k − Z ) { ≤ Z ≤ z ∗ } (cid:3) . We will now lower bound E Z (cid:2) ( k − Z ) { ≤ Z ≤ z ∗ } (cid:3) as in Theorem 1. Observe that E Z (cid:2) ( k − Z ) { ≤ Z ≤ z ∗ } (cid:3) ≥ E Z [ k − Z ] − k P { Z = 0 } − k P { Z ≥ z ∗ }≥ k (cid:16) − p − (1 − p ) k − P { Z ≥ z ∗ } (cid:17) (cid:23) k ( kp − P { Z ≥ z ∗ } ) . (15)Finally, applying Hoeﬀding’s inequality to P { Z ≥ z ∗ } yields P { Z ≥ z ∗ } ≤ exp (cid:32) − k (cid:18) p − z ∗ k (cid:19) (cid:33) = exp  − k (cid:32) p − − npq (1 + (cid:15) ) q k (cid:33)  (1) (cid:22) exp (cid:32) − k (cid:18) q k (cid:19) (cid:33) = exp (cid:18) − kq (cid:19) (2) ≤ kp , n · p · q (cid:22) p (cid:22) q k , and (2) holds when q ≤ (cid:114) k · (cid:16) log (cid:16) kp (cid:17) + 1 (cid:17) . Plugging into (15) yields E Z (cid:2) ( k − Z ) { ≤ Z ≤ z ∗ } (cid:3) (cid:23) k p, (16)and thus by putting together our bounds on (a) and (b) in (14), we arrive at E Z,Z (cid:48) (cid:104) ( k − Z ) h b (cid:16) − (1 − q ) Z (1 − q ) Z (cid:48) (cid:17)(cid:105) (17) ≥ k (cid:18) npq log (cid:18) npq (cid:19)(cid:19) + k p · ( q + npq ) log (cid:18) q + npq (cid:19) (18) ≥ mk pq log (cid:18) npq (cid:19) + k p · q log (cid:18) q + npq (cid:19) . (19) (cid:3) E Proofs of Additional Lemmas

E.1 Proof of Lemma 5

Let Y v be the indicator random variable of whether vertex v is a seed, and assume without loss ofgenerality that v ∈ C . We have P ( X v = 1) = P ( X v = 1 | Y v = 1) (cid:124) (cid:123)(cid:122) (cid:125) =1 · P ( Y v = 1) (cid:124) (cid:123)(cid:122) (cid:125) = p + P ( X v = 1 | Y v = 0) · P ( Y v = 0)= p + (1 − p ) · P ( X v = 1 | Y v = 0)and P ( X v = 1 | Y v = 0) = P { v is infected by a neighbor } = 1 − (cid:89) u ∈N ( v ) P { v isn’t infected by u } = 1 − (cid:89) u ∈N ( v ) (cid:16) − P { v is infected by u } (cid:17) = 1 − (cid:89) u ∈N ( v ) (cid:16) − P { v is infected by u | Y u = 1 } · P ( Y u = 1) (cid:17) = 1 − (cid:32) (cid:89) u ∈C \{ v } (1 − p · q ) (cid:33) · (cid:32) (cid:89) w (cid:54)∈C (1 − p · q ) (cid:33) = 1 − (1 − p · q ) k − · (1 − p · q ) n − k . (cid:3) .2 Proof of Lemma 6 Let A be the event that no member of community C is selected as a seed, and let B be the eventthat some member of C is infected by an individual outside C . We further denote by B u the eventthat vertex u infects some member of C , where u (cid:54)∈ C . Note that X C = 1 if and only if either A c occurs or A ∩ B occurs. Moreover, A and B are independent events. We have that P ( A ) = (1 − p ) k ,and thus P ( X C = 1) = P ( A c ) + P ( A ) · P ( B )= 1 − (1 − p ) k + (1 − p ) k · P ( B )= 1 − (1 − p ) k · (1 − P ( B )) . Finally, we compute P ( B ) as P ( B ) = 1 − (cid:89) u (cid:54)∈C P ( B cu )= 1 − (cid:89) u (cid:54)∈C (cid:16) P ( B cu | Y u = 1) · P ( Y u = 1) (cid:124) (cid:123)(cid:122) (cid:125) = p + P ( B cu | Y u = 0) (cid:124) (cid:123)(cid:122) (cid:125) =1 · P ( Y u = 0) (cid:124) (cid:123)(cid:122) (cid:125) =1 − p (cid:17) = 1 − (cid:89) u (cid:54)∈C (cid:16) − p + p · P ( B cu | Y u = 1) (cid:17) = 1 − (cid:89) u (cid:54)∈C (cid:16) − p + p · (1 − q ) k (cid:17) = 1 − (cid:32) − p · (cid:16) − (1 − q ) k (cid:17)(cid:33) n − k . (cid:3)(cid:3)