Adaptive Group Testing on Networks with Community Structure
AAdaptive Group Testing on Networks withCommunity Structure
Surin Ahn, Wei-Ning Chen, and Ayfer ¨Ozg¨ur
Department of Electrical Engineering, Stanford University { surinahn, wnchen, aozgur } @stanford.edu Abstract
Since the inception of the group testing problem in World War II, the prevailing assumptionin the probabilistic variant of the problem has been that individuals in the population are infectedby a disease independently . However, this assumption rarely holds in practice, as diseasestypically spread through connections between individuals. We introduce an infection model fornetworks, inspired by characteristics of COVID-19 and similar diseases, which generalizes thetraditional i.i.d. model from probabilistic group testing. Under this infection model, we askwhether knowledge of the network structure can be leveraged to perform group testing moreefficiently, focusing specifically on community-structured graphs drawn from the stochastic blockmodel. Through both theory and simulations, we show that when the network and infectionparameters are conducive to “strong community structure,” our proposed adaptive, graph-awarealgorithm outperforms the baseline binary splitting algorithm, and is even order-optimal incertain parameter regimes. Finally, we derive novel information-theoretic lower bounds whichhighlight the fundamental limits of adaptive group testing in our networked setting.
Identifying individuals who are infected by a disease is crucial for curbing epidemics and ensuringthe well-being of society. However, due to high costs or limited resources, it is often infeasible totest every member of the population individually. During World War II, when the U.S. militarysought to identify soldiers infected with syphilis, Dorfman made a breakthrough by introducingthe concept of group testing [1]. He showed that by testing groups or pools of samples rather thanindividual samples, the infected people in a population of size n can be identified with far fewerthan n tests. The key insight was that if the infected population is sparse, then each pooled test islikely to produce a negative result, in which case all individuals included in the test can be deemed“not infected” even though only a single test was performed. Today, group testing schemes areactively being used in the COVID-19 pandemic to identify infected individuals in an efficient andcost-effective manner [2–5]. Group testing is also useful to numerous application domains beyondhealthcare, such as wireless communications [6–10], machine learning [11–13], signal processing [14],and data streaming [15]. 1 a r X i v : . [ c s . I T ] J a n orfman’s seminal work, and subsequent works by other authors on the so-called probabilistic group testing problem [6, 16–18], assume that the disease infects individuals in a statistically in-dependent fashion. However, this assumption rarely holds in practice. Diseases typically spreadthrough connections between individuals (e.g., familial, work-related, or other social connections),thereby inducing correlated infections. It is therefore natural to ask whether exploiting informationabout this connectivity structure can lead to more efficient group testing strategies. This problemis especially timely given the critical role that group testing is playing in the current COVID-19pandemic, and that the disease is known to spread from close contact between individuals.In this work, we study the group testing problem under interaction networks that dictate thespread of a disease through the population, and investigate whether the graphical structure canbe leveraged to perform pooled testing more efficiently than without knowledge of the graph. Wefocus on networks with community structure : those containing clusters of nodes with more denseconnections within a cluster than between clusters. Such networks are pervasive in the real world– social, biological, and information networks commonly exhibit community structure – and canoften be estimated in practice, thanks to the availability of large datasets and network estimationtechniques. Additionally, we introduce an infection model for arbitrary networks which generalizesthe standard i.i.d. model from the probabilistic group testing literature.On the algorithmic side, we consider adaptive group testing schemes, where the design of eachtest can be informed by the previous test results. We compare two different schemes: the standard binary splitting [19] algorithm which is oblivious to the underlying network structure, and a simple graph-aware algorithm that exploits the community structure of the network. We give preciseupper bounds on the expected number of tests performed by each algorithm. Crucially, we showthat when the network and infection parameters yield strong community structure (in which case thedisease is more likely to be transmitted within a community than between communities), the graph-aware algorithm’s average complexity is asymptotically strictly better than that of binary splitting.We corroborate these results with numerical simulations. Finally, we derive novel information-theoretic lower bounds which asymptotically match the graph aware algorithm’s performance (upto constants) in certain parameter regimes.We note that our work may be relevant to other settings where the goal is to identify certainobjects of interest within a “clustered” population. For example, we may wish to identify theactive devices or users in a multiple access network, where devices that are closer together in thenetwork tend to be active or inactive at the same time. Exploring the potential applications ofnetwork-oriented group testing to these types of problems is of great interest. Related Works.
Our work differs from the graph-constrained group testing problem [20–23] inwhich the tests must conform to a given network topology. In our case, we allow the tests to bearbitrary, but ask whether knowledge of the interaction network can help to reduce the numberof required tests. This is similar in spirit to recent work on community-aware group testing [24],though our work departs from it in several ways. First, [24] assumes the population is partitionedinto disjoint “families,” whereas our work considers more general network structures which allow fortransmissions between communities. Second, although we focus on community-structured graphs2n this paper, our proposed infection model works on top of arbitrary networks and thereforeapplies naturally to a broader class of problems. Finally, we give a precise characterization of theimprovement provided by our graph-aware algorithm over the baseline, and in what parameterregimes our lower bounds are order-optimal.
Paper Organization.
The rest of this paper is organized as follows. In Section 2, we describethe network and infection models, and define our mathematical notation. In Section 3, we providebackground and preliminary ideas. In Section 4, we discuss the main algorithms studied in thispaper: binary splitting and our proposed graph-aware algorithm. Section 5 gives upper and lowerbounds for adaptive group testing on networks consisting of disjoint cliques, and Section 6 gener-alizes these results to the stochastic block model. Finally, we present the results of our numericalsimulations in Section 7, and conclude in Section 8. All omitted proofs are given in the Appendix.
We study the following probabilistic infection model with parameters p, q ∈ [0 , G = ( V , E ) in two stages (each executed once):1. Seed Selection:
Each vertex is infected i.i.d. with probability p . These initial infectedvertices are called the seeds . They model the introduction of the disease into the populationvia some external entity (e.g., a traveler carrying the disease into a country).2. Neighbor Infection:
A seed infects each of its neighbors i.i.d. with probability q . Thismodels how the disease spreads through the population via interactions between carriers andnearby individuals. Remark 1.
The above stages can be viewed as the “first time step” of a stochastic epidemic model,i.e., the initial spread of an epidemic. It is inspired by diseases such as COVID-19, which areinitially introduced into a population from an external source and subsequently transmitted betweenindividuals in close contact. In practice, the specific values of p, q can be tailored to the disease inquestion (for example, by using contact tracing to estimate the infectiousness of the disease).
Consider an arbitrary graph with seed selection probability p ∈ [0 ,
1] and neighbor infectionprobability q = 0. In this case, our setting reduces to the i.i.d. probabilistic group testing model.Each node is selected as a seed (and thus infected) with probability p , and since transmissionsbetween nodes are not possible, no additional nodes are infected during the neighbor infectionphase. It follows that we cannot hope to do any better than classical group testing schemes in thissetting. Proposition 1.
Under an arbitrary graph G = ( V , E ) , identifying infected individuals under ourinfection model with seed selection parameter p ∈ [0 , and zero probability of neighbor infection( q = 0 ) is equivalent to the i.i.d. probabilistic group testing problem with infection probability p . empty graph (a.k.a. null graph ), G = ( V , E ) where E = ∅ , with arbitrary infectionparameters p, q ∈ [0 , p . For the rest of this paper, we assume that the underlying network is drawn from the stochastic blockmodel (SBM) [25] – a well-known random graph model with the tendency to produce community-structured graphs. The standard SBM has the following parameters: • n vertices • a partition of the vertex set V = { , , . . . , n } into m communities, C , . . . , C m , where (cid:83) i ∈ [ m ] C i = V and C i ∩ C j = ∅ , ∀ i (cid:54) = j • a symmetric matrix P ∈ R m × m of edge probabilities.The random graph G = ( V , E ) is then generated in the following way. First, initialize E = ∅ . Thenfor each pair of vertices u ∈ C i , v ∈ C j , we add an edge between u and v with probability P ij .In this paper, we consider a special case of the SBM. We assume the communities are all of size k , where k is a factor of n (so that the number of communities is m = n/k ), and that there is aconstant edge probability p within communities, and probability p between communities. Thatis, P equals p along the diagonal entries and p on the off-diagonal entries. We further assume that p > p , i.e., that edges are more likely to occur within a community than between communities.Finally, we assume that the communities are known to the group testing algorithms in advance,but that the graph itself may not be known. Stochastic Block Infection Model (
SBIM ): Our infection model acting upon the SBM canequivalently be studied through a slightly modified infection model which acts upon the completegraph on n vertices: the graph containing all possible (cid:0) n (cid:1) edges. This will reduce the overall numberof parameters we have to consider. Our modified model still begins by selecting each node i.i.d. withprobability p to be a seed. However, in the neighbor infection phase, each seed infects its neighbors within the same community i.i.d. with probability q and infects those outside its community i.i.d.with probability q , where q > q . The equivalence of this model and the original model can beseen by setting q = p · q and q = p · q , where q is the neighbor infection probability in theoriginal model. We call this the Stochastic Block Infection Model , denoted by
SBIM ( n, k, p, q , q ).Note that SBIM ( n, k, p, , k an arbitrary factor of n , is equivalent to the i.i.d. group testingmodel. Disjoint k -Cliques Model. Before analyzing the
SBIM in full generality in Section 6, we beginin Section 5 by investigating the special case of
SBIM ( n, k, p, q, disjoint k -cliques model . Here, we have m = n/k communities of size k , each a complete subgraph on k vertices, with no edges between communities. The transmission rate within a community is q ,and no transmissions are possible between communities. Figure 1 illustrates the SBIM and thedifference between the disjoint k -cliques model ( q = 0) and the general SBIM with q > a) Seed selection stage(b) Neighbor infection with q = 0(the disjoint k -cliques model). Nodescannot be infected by seeds outsidetheir own community. (c) Neighbor infection with q > Figure 1: Illustration of
SBIM ( n, k, p, q , q ). In this example, there are m = 4 communities of size k = 7. Seeds are colored green, and nodes infected by seeds are colored orange. We now define the mathematical notation used in the rest of this paper.
General notation: • n : size of the population • k : size of each community • m (cid:44) nk : number of communities • [ n ] (cid:44) { , , . . . , n } • X (cid:44) ( X , . . . , X n ) ∈ { , } n : infection status vector, where X v = 1 iff vertex v is infected • X (cid:96) (cid:44) ( X , . . . , X (cid:96) ) , (cid:96) ∈ [ n ] • X C i ∈ { , } , i ∈ [ m ]: infection status of community C i , where X C i = 1 iff ∃ v ∈ C i : X v = 1 • A : indicator function for event A • H ( · ): entropy of a discrete random variable (in bits) defined as H ( X ) (cid:44) − (cid:80) x ∈X p ( x ) log p ( x ) • h b ( · ): binary entropy function defined as h b ( p ) (cid:44) − p log p − (1 − p ) log (1 − p ) • We write f ( x ) ≺ g ( x ) to denote f ( x ) = o ( g ( x )), and f ( x ) (cid:22) g ( x ) to denote f ( x ) = O ( g ( x )) Graph notation: • G = ( V , E ): undirected graph with vertex set V , edge set E N ( v ) (cid:44) { u ∈ V : ( u, v ) ∈ E , u (cid:54) = v } : set of neighbors of vertex v • d ( v ) (cid:44) |N ( v ) | : degree (number of neighbors) of vertex v In the group testing problem, a test corresponds to a subset of individuals
S ⊆ [ n ]. The testoutcome is positive if X i = 1 for some i ∈ S ; that is, if at least one member of S is infected.Otherwise, the test outcome is negative. Equivalently, the outcome is a binary variable Y ∈ { , } given by a boolean OR operation over S : Y = (cid:95) i ∈S X i . (1)A group testing algorithm or scheme describes how to select subsets S , . . . , S T such that theinfection statuses X , . . . , X n can be determined from the corresponding outcomes Y , . . . , Y T . In adaptive schemes, the choice of each S t is allowed to depend on {S t (cid:48) : t (cid:48) < t } . Moreover, dueto the underlying randomness in the X i in our probabilistic setting, the total number of tests T performed by any adaptive scheme is a random variable. In this work, we assume that test outcomesare noiseless (meaning that we get to observe the Y t as given in (1)), and we require a scheme to exactly recover X , . . . , X n (i.e., achieve zero error). Let G = ( V , E ) be any finite, undirected graph. For the infection model that we study in this paper,the marginal infection probability of a given vertex v can be characterized in terms of its degree d ( v ). Lemma 1.
Let G = ( V , E ) be a finite, undirected graph. Under G , the infection status of a vertex v ∈ V is X v ∼ Bernoulli ( r v ) , where r v (cid:44) P ( X v = 1) = 1 − (1 − p )(1 − pq ) d ( v ) . (2)Under a general graph, different nodes may have different degrees and hence different marginalprobabilities of infection. From (2), we see that r v is monotonically non-decreasing with d ( v ). Notealso that the X v can be correlated. A fundamental result in probabilistic group testing (see [6] or [17, Theorem 1]) is that any adaptivealgorithm which is guaranteed to identify all infected members of the population, assuming noiselesstest results, requires a number of tests T satisfying E [ T ] ≥ H ( X , . . . , X n ) , (3)6here H ( X , . . . , X n ) is the Shannon entropy of X = ( X , . . . , X n ). This bound highlights theintimate connection between adaptive group testing and source coding. Indeed, the outcomes ofthe adaptive tests can be viewed as a binary, variable-length source code for X ; the lower boundthen follows directly from existing results in data compression (see [26, Eqn. 5.38]). Equation (3)will serve as the point of departure for the lower bounds on E [ T ] that we derive in this paper. Thekey challenge will be to obtain good approximations to H ( X ) in the presence of correlated X v . Most adaptive group testing algorithms are based on the idea of recursively splitting the populationuntil all infected members are found. The most standard such algorithm is known as binarysplitting , which finds one infected member at a time by repeatedly halving the population. Thisalgorithm identifies all infected members using α log n + O ( α ) adaptive tests (see [27], [19, p.24],or [28, Theorem 1.2]), where α is the number of infected members. This algorithm works evenwhen α is unknown, and is most effective in the sparse regime, α = Θ( n β ), where β ∈ [0 , Lemma 2.
In a population of size n with α infected members, where α ≥ , the binary splittingalgorithm is guaranteed to identify all infected members using at most α (cid:100) log n (cid:101) ≤ α log n + α tests. As an alternative to standard adaptive procedures such as binary splitting, we consider a simpleadaptive scheme which leverages the community structure of the graph. The algorithm worksby mixing samples within each community, employing binary splitting to identify the infectedcommunities, and finally performing binary splitting again within each infected community to findthe infected members.
Adaptive Graph-Aware Algorithm
1. Mix samples within each community.2. Run binary splitting on the mixed samples to determine which communities contain at leastone infected member.3. For each positive test from Step 2, perform binary splitting within the corresponding com-munity to identify infected members.Under what circumstances should we expect the graph-aware algorithm to outperform binarysplitting? Suppose the underlying interaction network and infection model follow
SBIM ( n, k, p, q , q ).If the seed selection probability p is small, then we expect only a few of the m = n/k communities7o contain a seed. This means that after the neighbor infection stage, several of the communities arelikely to contain no infected members at all, especially if q is small. In Step 2 of the graph-awarealgorithm, we can efficiently rule out these uninfected communities from consideration. In Step 3,we need only perform group testing within each of the remaining communities (which contain atleast one infected member). In contrast, the binary splitting algorithm ignores the communitystructure (specifically, the fact that entire communities are likely to be uninfected), and is thereforeunlikely to enjoy the same benefits as the graph-aware algorithm under these circumstances. Wewill rigorously verify this intuition in the upcoming sections. k -Cliques Model We first consider the graph consisting of disjoint k -cliques (i.e., complete subgraphs of size k , withno edges between different cliques). That is, we have a graph G = ( V , E ) with V = [ n ], where weassume n is divisible by k . There are m (cid:44) n/k disjoint cliques with k nodes each, denoted by C , C , . . . , C m where |C i | = k, ∀ i ∈ [ m ]. The seed selection probability is p ∈ (0 , q ∈ [0 , Recall that E [ T ] ≥ H ( X , . . . , X n ) for any adaptive group testing algorithm which exactly identifiesthe infected individuals using T tests. Since the infection statuses across the m disjoint cliques areindependent, we have E [ T ] ≥ m · H ( X , . . . , X k ), where without loss of generality we assume C = [ k ]. Thus, obtaining a lower bound on E [ T ] reduces to lower bounding H ( X , . . . , X k ), i.e.,the entropy corresponding to a single k -clique. The following lemma lower bounds H ( X , . . . , X k )in terms of a binomial random variable, which then leads to the asymptotic lower bound given inTheorem 1 below. Lemma 3.
Under the disjoint k -cliques model, the number of tests T required to identify the infectedindividuals is lower bounded as E [ T ] ≥ m · E Z (cid:2) ( k − Z ) · h b (cid:0) − (1 − q ) Z (cid:1)(cid:3) , where Z ∼ Binom ( k, p ) . Theorem 1.
Let Z ∼ Binom ( k, p ) and assume kp (cid:22) and q (cid:22) √ k · (cid:114) log (cid:16) k · p (cid:17) . Then E Z (cid:2) ( k − Z ) · h b (cid:0) − (1 − q ) Z (cid:1)(cid:3) (cid:23) k · p · q · (cid:18) log k + log log (cid:18) k · p (cid:19)(cid:19) . Upon combining Lemma 3 with the above theorem, we see that the number of tests T neededto recover all infected members in the disjoint k -cliques graph (in the specified parameter regime)is lower bounded as E [ T ] (cid:23) m · k · p · q · (cid:18) log k + log log (cid:18) kp (cid:19)(cid:19) . E [ T ] ≥ H ( X , . . . , X n ) (a) ≥ H ( X C , . . . , X C m ) = m · h b (cid:16) − (1 − p ) k (cid:17) (4)where (a) uses the fact that X C , . . . , X C m are a function of X , . . . , X n . Furthermore, since kp (cid:22) h b (cid:16) − (1 − p ) k (cid:17) (cid:23) k · p · log (1 /kp ). We summarize the refined lower bound in the followingcorollary: Corollary 1.
Assume kp (cid:22) and q (cid:22) (cid:114) k log (cid:16) kp (cid:17) . Then under the disjoint k -cliques model, thenumber of tests T required to identify the infected individuals is lower bounded as E [ T ] (cid:23) max (cid:26) m · k · p · q · (cid:18) log k + log log (cid:18) k · p (cid:19)(cid:19) , m · k · p · log (cid:16) k · p (cid:17) , (cid:27) . The following result bounds the expected number of tests used by the binary splitting algorithmunder the disjoint k -cliques model. Theorem 2.
Under the disjoint k -cliques model, the binary splitting algorithm identifies all infectedindividuals using T tests, where E [ T ] ≤ m · k · (cid:16) log m + log k + 1 (cid:17) · (cid:16) − (1 − p )(1 − pq ) k − (cid:17) . Proof.
Let K be the number of infected nodes (which is a random variable in our setting). Then E [ K ] = E (cid:104) n (cid:88) i =1 X i (cid:105) = n (cid:88) i =1 P ( X i = 1) = n · r where r = 1 − (1 − p )(1 − pq ) k − by Lemma 1. Invoking Lemma 2 yields the result. Asymptotic analysis:
Using Theorem 2, we find that the average complexity of binary splittingis O (cid:0) m · k · p · ( q + 1 /k ) · (log m + log k ) (cid:1) since E [ T ] (cid:22) m · k · (log m + log k ) · (cid:16) − (1 − p )(1 − pq ) k − (cid:17) ( a ) ≤ m · k · (log m + log k ) · (cid:16) − (1 − p )(1 − kpq ) (cid:17) = m · k · (log m + log k ) · ( p + kpq − kp q ) ≤ m · k · (log m + log k ) · ( p + kpq )= m · k · p · (log m + log k ) · (cid:16) k + q (cid:17) (5)where in (a) we use the fact that (1 + x ) k ≥ kx for x ≥ − , k ≥ .2.2 Graph-Aware Algorithm Next, we provide an upper bound on the expected number of tests performed by the graph-awarealgorithm.
Theorem 3.
Under the disjoint k -cliques model, the graph-aware algorithm identifies all infectedindividuals using T tests, where E [ T ] ≤ m · (cid:16) log m + 1 (cid:17) · (cid:16) − (1 − p ) k (cid:17) + n · (cid:16) log k + 1 (cid:17) · (cid:16) − (1 − p )(1 − pq ) k − (cid:17) Asymptotic analysis:
Using Theorem 3 and the fact that (1 + x ) k ≥ kx for x ≥ − , k ≥ E [ T ] (cid:22) m log m · k · p + m · k log k · p · (cid:16) q + 1 k (cid:17) . (6) We summarize the expected number of tests of binary splitting and the graph-aware algorithm, aswell as the information-theoretic lower bound, in Table 1.Binary splitting m log m · k · p · (cid:0) q + k (cid:1) + m · k log k · p · (cid:0) q + k (cid:1) Graph-aware m log m · k · p + m · k log k · p · (cid:0) q + k (cid:1) Lower bound m · k · p · log (cid:16) kp (cid:17) + m · k · p · q · (cid:16) log k + log log (cid:16) kp (cid:17)(cid:17) + 1Table 1: Upper and lower bounds on the expected number of tests in the disjoint k -cliques model.Next, we discuss different parameter regimes where 1) the lower bound holds, 2) the graph-aware algorithm is order-optimal (i.e., the lower bound is tight), and 3) the graph-aware algorithm’saverage complexity is strictly better than binary splitting’s. As stated in Corollary 1, the lowerbound holds when kp (cid:22) q (cid:22) (cid:114) k log (cid:16) kp (cid:17) . The next corollary specifies the regime where thegraph-aware algorithm is tight: Corollary 2.
If the following conditions hold:1. kp (cid:22) m − α for some fixed α ∈ (0 , ,2. k (cid:22) q (cid:22) (cid:114) k log (cid:16) kp (cid:17) ,then the lower bound is tight, and moreover the graph-aware algorithm is order-optimal.Proof. Plugging log (cid:16) kp (cid:17) (cid:23) α log m into the lower bound and using the fact that k (cid:23) log (cid:16) kp (cid:17) fromthe second condition (which implies log k (cid:23) log log m ) yields E [ T ] (cid:23) m log m · k · p + m · k · p · q · (log k + log log m ) + 1 (cid:23) m log m · k · p + m · k log k · p · q, q (cid:23) /k to the bound for the graph-aware algorithm yields E [ T ] (cid:22) m log m · k · p + m · k log k · p · q. Finally, we specify the regime where the graph-aware algorithm outperforms binary splitting:
Corollary 3.
If the following conditions hold:1. log m (cid:31) log k ,2. kq (cid:31) ,then the graph-aware algorithm’s average complexity is asymptotically strictly better than binarysplitting’s by a factor of min (cid:110) kq, log m log k (cid:111) .Proof. Under the above conditions, binary splitting’s average complexity is m log m · k · p · q whereas the graph aware algorithm’s average complexity ismax (cid:110) m log m · k · p (cid:124) (cid:123)(cid:122) (cid:125) (a) , m · k log k · p · q (cid:124) (cid:123)(cid:122) (cid:125) (b) (cid:111) . Both terms are strictly smaller than the binary splitting bound. We see that (a) saves a factor of kq (cid:31)
1, while (b) saves a factor of log m log k (cid:31) kp (cid:22) q (cid:22) (cid:114) k log (cid:16) kp (cid:17) Tightness conditions kp (cid:22) m − α and 1 (cid:22) kq (cid:22) (cid:114) k/ log (cid:16) kp (cid:17) Improvement conditions log m (cid:31) log k and kq (cid:31) k -cliques model.The main takeaway is that the graph-aware algorithm can potentially improve testing efficiencycompared to standard binary splitting when (i) there are several moderately sized communitiesin the network, and (ii) the transmission rate within each clique is “intermediate.” Additionally,the graph-aware algorithm is order-optimal when the infected population is sparse. However, notethat when q (cid:22) /k , i.e., the intra-clique transmission rate is small, then the bounds for binarysplitting and the graph-aware algorithm are order-wise equivalent. This suggests that knowledge ofthe community structure may not help in this regime. Intuitively, this makes sense because when q is small, the infection statuses of the vertices are “mostly independent.”11 Stochastic Block Infection Model
Having studied the disjoint k -cliques model, we now turn to the fully general SBIM ( n, k, p, q , q ),where p ∈ (0 ,
1] and q , q ∈ [0 , Similar to Lemma 3 and Theorem 1, we obtain the following lower bounds for adaptive grouptesting over the
SBIM . Lemma 4.
Under
SBIM ( n, k, p, q , q ) , the number of tests T required to identify the infected indi-viduals is lower bounded as E [ T ] ≥ m · E Z,Z (cid:48) (cid:104) ( k − Z ) · h b (cid:16) − (1 − q ) Z (1 − q ) Z (cid:48) (cid:17)(cid:105) , where Z ∼ Binom ( k, p ) and Z (cid:48) ∼ Binom ( n − k, p ) are independent. Theorem 4.
Let Z ∼ Binom ( k, p ) and Z (cid:48) ∼ Binom ( n − k, p ) be independent, and assume1. n · p · q (cid:22) ,2. n · p (cid:23) ,3. k · p · q (cid:22) ,4. q ≤ (cid:114) k (cid:16) log (cid:16) kp (cid:17) +1 (cid:17) .Then the following lower bound holds: E Z,Z (cid:48) (cid:104) ( k − Z ) · h b (cid:16) − (1 − q ) Z (1 − q ) Z (cid:48) (cid:17)(cid:105) (cid:23) mk pq log (cid:18) npq (cid:19) + k p · q log (cid:18) q + npq (cid:19) . Therefore, the number of tests T needed to recover all infected members over SBIM ( n, k, p, q , q ),in the parameter regime specified in Theorem 4, is lower bounded as E [ T ] (cid:23) m · k · p · q · log (cid:18) n · p · q (cid:19) + m · k · p · q · log (cid:18) q + n · p · q (cid:19) . (7) Remark 2.
Recall that in the disjoint k -cliques model, we obtained an additional lower bound inEquation (4) given by H ( X C , ..., X C m ) , which dominates when kp (cid:22) m − α . However, under thegeneral SBIM , the { X C , ..., X C m } are no longer mutually independent, rendering the analysis of H ( X C , ..., X C m ) intractable. Therefore, we suspect that the lower bound given in Theorem 4 is nottight when kp is small. To analyze binary splitting and the graph-aware algorithm over the
SBIM , we begin by extendingLemma 1.
Lemma 5.
The marginal probability of infection for every vertex v under SBIM ( n, k, p, q , q ) isgiven by P ( X v = 1) = 1 − (1 − p ) · (1 − p · q ) k − · (1 − p · q ) n − k . .2.1 Binary Splitting Next, we generalize the bound in Theorem 2 to the
SBIM . Notice that in both the Theorem 5bound and the asymptotic bound derived below, we recover the corresponding bounds from thedisjoint k -cliques setting when we set q = q , q = 0. Theorem 5.
Under
SBIM ( n, k, p, q , q ) , the binary splitting algorithm identifies all infected indi-viduals using T tests, where E [ T ] ≤ n · (log n + 1) · (cid:16) − (1 − p ) · (1 − p · q ) k − · (1 − p · q ) n − k (cid:17) . Proof.
Let K be the number of infected nodes. Then E [ K ] = E (cid:104) n (cid:88) i =1 X i (cid:105) = n (cid:88) i =1 P ( X i = 1) = n · r where r = 1 − (1 − p ) · (1 − p · q ) k − · (1 − p · q ) n − k by Lemma 5. Invoking Lemma 2 yields theresult. Asymptotic Analysis:
Using the fact that (1 + x ) k ≥ kx for x ≥ − , k ≥
1, we have E [ T ] (cid:22) n · log n · (cid:16) − (1 − p )(1 − k · p · q ) · (1 − ( n − k ) · p · q ) (cid:17) ≤ n · log n · (cid:16) ( n − k ) · p · q + k · p · q + p + k · ( n − k ) · p · q · q (cid:17) ≤ m · k · p · (log m + log k ) · (cid:16) k + q + m · q + m · k · p · q · q (cid:17) (8) First, we provide a lemma needed to prove the upper bound for the graph-aware algorithm inTheorem 6. Again, note that by setting q = q, q = 0 in Theorem 6 and the resulting asymptoticbound, we recover the corresponding bounds from the disjoint k -cliques setting. Lemma 6.
Let X C be the indicator variable which equals 1 if at least one member of community C is infected. Then under SBIM ( n, k, p, q , q ) , P ( X C = 1) = 1 − (1 − p ) k · (cid:32) − p · (cid:16) − (1 − q ) k (cid:17)(cid:33) n − k . Theorem 6.
Under
SBIM ( n, k, p, q , q ) , the graph-aware algorithm identifies all infected individ-uals using T tests, where E [ T ] ≤ nk · (cid:16) log ( n/k ) + 1 (cid:17) · (cid:32) − (1 − p ) k · (cid:32) − p · (cid:16) − (1 − q ) k (cid:17)(cid:33) n − k (cid:33) + n · (cid:16) log k + 1 (cid:17) · (cid:16) − (1 − p ) · (1 − p · q ) k − · (1 − p · q ) n − k (cid:17) . Proof.
Same steps as the proof of Theorem 3 (given in the Appendix), except using Lemma 5 andLemma 6 wherever P ( X = 1) and P ( X C = 1) are needed, respectively.13 symptotic Analysis: Let T and T be the first and second terms in the Theorem 6 bound,respectively. Using the fact that (1 − q ) k ≥ − kq , we have1 − p · (cid:16) − (1 − q ) k (cid:17) ≥ − p · k · q , so E [ T ] (cid:22) m log m · (cid:18) − (1 − p ) k · (cid:16) − p (cid:16) − (1 − q ) k (cid:17)(cid:17) n − k (cid:19) (cid:22) m log m · (cid:16) − (1 − p ) k · (1 − p · k · q ) n − k (cid:17) (cid:22) m log m · (1 − (1 − k · p ) · (1 − ( n − k ) · p · k · q )) (cid:22) m log m · ( k · p + n · p · k · q ) . Following the previous asymptotic analysis for binary splitting, E [ T ] (cid:22) m · k log k · p · (cid:16) k + q + m · q + m · k · p · q · q (cid:17) . Therefore, E [ T ] (cid:22) m log m · k · p · (cid:16) m · k · q (cid:17) + m · k log k · p · (cid:16) k + q + m · q + m · k · p · q · q (cid:17) . (9) One regime where the graph-aware algorithm’s average complexity is asymptotically strictly betterthan that of binary splitting is1. log m (cid:31) log k kq (cid:31)
13. (i) 1 (cid:23) mkq or(ii) mkq (cid:23) mkq ≺ kq (cid:22) p . Suppose conditions 1, 2, and 3(i) hold. Binary splitting’s average complexity (8) becomes m log m · k · p · q whereas the graph-aware algorithm’s average complexity (9) becomesmax (cid:110) m log m · k · p, m · k log k · p · q (cid:111) . The first term in the graph-aware bound improves upon binary splitting’s complexity by a factorof kq (cid:31)
1, and the second term improves by a factor of log m log k (cid:31)
1. These are the same savings14 a) (b)(c) (d)
Figure 2: Performance comparison between binary splitting and the graph-aware algorithm underthe
SBIM with n = 1000 , k = 20, and different values of p, q , q . Theoretical upper and lowerbounds are also shown.we obtained in Corollary 3 in the disjoint k -cliques setting; indeed, the bounds themselves matchthose in Corollary 3. This is not very surprising because the SBIM asymptotically behaves like thedisjoint k -cliques model under condition 3(i), i.e., when q is very small.However, improvements are still made by the graph-aware algorithm in a more intermediateregime for q . Under condition 3(ii), binary splitting’s average complexity is the same as above,and the graph-aware algorithm’s complexity becomesmax (cid:110) m log m · k · p · q , m · k log k · p · q (cid:111) , which represents an improvement over binary splitting by a factor of min (cid:110) q m · q , log m log k (cid:111) (cid:31) We implemented the binary splitting and graph-aware algorithms and evaluated their performanceover random instances of the
SBIM . The population size was set to n = 1000, and p was varied15ver the interval [0 , . p , where a trial consists of generatingan instance from SBIM ( n, k, p, q , q ), then observing the number of tests used by binary splittingand the graph-aware algorithm to identify the infected nodes. We estimated the lower bound fromLemma 4 by averaging over many independent samples of Z ∼ Binom ( k, p ) and Z (cid:48) ∼ Binom ( n − k, p ).Figure 2 shows some representative plots of the estimated E [ T ] as a function of p , with k = 20and different values of q , q . The error bars show ± one standard deviation of the values of T obtained for a particular value of p . For comparison, we also plot the theoretical upper boundsfrom Theorem 5 and Theorem 6; we find that these bounds remain quite faithful to the empiricalresults. Additionally, the graph-aware algorithm consistently outperforms binary splitting. Forexample, in Figure 2b, at p ≈ .
07, binary splitting has surpassed the individual testing thresholdwith an average of 1271 . . q = 0 . , q = 0 . k ∈ { , , } . Thegraph-aware algorithm seems to perform most favorably for moderate values of k , such as k = 20(as shown in Figure 2c) or k = 50, i.e., when there are several moderately sized communities in thenetwork. This is consistent with our earlier theoretical results.Although the graph-aware algorithm improves significantly upon binary splitting, there is still asizable gap between the graph-aware bound and the lower bound shown in the plots. This suggeststhat in the non-asymptotic regime, either the lower bound is not tight or better algorithms exist. In this paper, we investigated the group testing problem over networks with community structure.Motivated by diseases such as COVID-19, we proposed a network infection model to capture howcertain diseases are introduced into a population and subsequently transmitted through close con-tact between individuals. Our proposed group testing algorithm, which exploits the structure ofthe underlying graph, provably outperforms the network-oblivious binary splitting algorithm, andis even order-optimal in certain parameter regimes.We conclude with some practical considerations and future directions. First, we note that thecommunity-structured networks studied in this paper can model populations at different scales:the “communities” can be schools, families, counties, etc. The insights from our work can also beextended to more general networks in the real world, where the communities may not be knownin advance. In such instances, one might use the following pipeline to efficiently identify infectedindividuals in the population: 1) estimate the network from data (e.g., Facebook social graph); 2)run a clustering algorithm to identify communities in the network; 3) perform graph-aware grouptesting using the previously identified communities. An interesting direction for future work is toexplore the efficacy of such an approach. Other directions of interest include designing non-adaptive group testing schemes for networks, studying graph-aware group testing under noisy test outcomes,and extending our infection model to longer time horizons (e.g., SIR or SIS-type infection models).16 a) (b)(c)
Figure 3: Performance comparison between binary splitting and the graph-aware algorithm underthe
SBIM with n = 1000 , q = 0 . , q = 0 . p, k . Theoretical upper andlower bounds are also shown. Acknowledgement
The authors would like to thank Professor Sennur Ulukus for inspiring discussions on this topic.This work was supported in part by NSF Grant
References [1] R. Dorfman, “The detection of defective members of large populations,”
The Annals of Math-ematical Statistics
JAMA , vol. 323, no. 19, pp. 1967–1969, 2020.[6] J. Wolf, “Born again group testing: Multiaccess communications,”
IEEE Transactions onInformation Theory , vol. 31, no. 2, pp. 185–191, 1985.[7] T. Berger, N. Mehravari, D. Towsley, and J. Wolf, “Random multiple-access communicationand group testing,”
IEEE Transactions on Communications , vol. 32, no. 7, pp. 769–779, 1984.[8] H. A. Inan, P. Kairouz, and A. Ozgur, “Sparse group testing codes for low-energy massiverandom access,” in
Allerton Conference on Communication, Control, and Computing , pp. 658–665, 2017.[9] H. A. Inan, P. Kairouz, and A. Ozgur, “Energy-limited massive random access via noisy grouptesting,” in
IEEE International Symposium on Information Theory (ISIT) , pp. 1101–1105,2018.[10] H. A. Inan, S. Ahn, P. Kairouz, and A. Ozgur, “A group testing approach to random accessfor short-packet communication,” in
IEEE International Symposium on Information Theory(ISIT) , pp. 96–100, 2019.[11] S. Ubaru and A. Mazumdar, “Multilabel classification with group testing and codes,” in
In-ternational Conference on Machine Learning , pp. 3492–3501, 2017.[12] Y. Zhou, U. Porwal, C. Zhang, H. Q. Ngo, X. Nguyen, C. R´e, and V. Govindaraju, “Par-allel feature selection inspired by group testing,”
Advances in Neural Information ProcessingSystems , vol. 27, pp. 3554–3562, 2014.[13] D. Malioutov and K. Varshney, “Exact rule learning via boolean compressed sensing,” in
International Conference on Machine Learning , pp. 765–773, 2013.[14] A. C. Gilbert, M. A. Iwen, and M. J. Strauss, “Group testing and sparse signal recovery,” in
Asilomar Conference on Signals, Systems and Computers , pp. 1059–1063, 2008.[15] A. Emad and O. Milenkovic, “Poisson group testing: A probabilistic model for nonadap-tive streaming boolean compressed sensing,” in
IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP) , pp. 3335–3339, 2014.[16] M. Sobel and P. A. Groll, “Group testing to eliminate efficiently all defectives in a binomialsample,”
Bell System Technical Journal , vol. 38, no. 5, pp. 1179–1252, 1959.1817] T. Li, C. L. Chan, W. Huang, T. Kaced, and S. Jaggi, “Group testing with prior statistics,”in
IEEE International Symposium on Information Theory (ISIT) , pp. 2346–2350, 2014.[18] T. Kealy, O. Johnson, and R. Piechocki, “The capacity of non-identical adaptive group test-ing,” in
Allerton Conference on Communication, Control, and Computing , pp. 101–108, 2014.[19] D. Du, F. K. Hwang, and F. Hwang,
Combinatorial Group Testing and Its Applications , vol. 12.World Scientific, 2000.[20] N. J. Harvey, M. Patrascu, Y. Wen, S. Yekhanin, and V. W. Chan, “Non-adaptive fault diagno-sis for all-optical networks via combinatorial group testing on graphs,” in
IEEE InternationalConference on Computer Communications (INFOCOM) , pp. 697–705, 2007.[21] M. Cheraghchi, A. Karbasi, S. Mohajer, and V. Saligrama, “Graph-constrained group testing,”
IEEE Transactions on Information Theory , vol. 58, no. 1, pp. 248–262, 2012.[22] A. Karbasi and M. Zadimoghaddam, “Sequential group testing with graph constraints,” in
IEEE Information Theory Workshop (ITW) , pp. 292–296, 2012.[23] B. Spang and M. Wootters, “Unconstraining graph-constrained group testing,” arXiv preprintarXiv:1809.03589 , 2018.[24] P. Nikolopoulos, T. Guo, C. Fragouli, and S. Diggavi, “Community aware group testing,” arXiv preprint arXiv:2007.08111 , 2020.[25] P. W. Holland, K. B. Laskey, and S. Leinhardt, “Stochastic blockmodels: First steps,”
SocialNetworks , vol. 5, no. 2, pp. 109–137, 1983.[26] T. M. Cover and J. A. Thomas,
Elements of Information Theory, 2nd Edition . Wiley, 2006.[27] L. Baldassini, O. Johnson, and M. Aldridge, “The capacity of adaptive group testing,” in
IEEEInternational Symposium on Information Theory (ISIT) , pp. 2676–2680, 2013.[28] M. Aldridge, O. Johnson, and J. Scarlett, “Group testing: An information theory perspective,” arXiv preprint arXiv:1902.06002 , 2019. 19 ppendix
A Proof of Lemma 1
Let Y v be the indicator random variable of whether vertex v is a seed. First, we have P ( X v = 1) = P ( X v = 1 | Y v = 1) (cid:124) (cid:123)(cid:122) (cid:125) =1 · P ( Y v = 1) (cid:124) (cid:123)(cid:122) (cid:125) = p + P ( X v = 1 | Y v = 0) · P ( Y v = 0)= p + (1 − p ) · P ( X v = 1 | Y v = 0) . Given that v is not a seed, X v = 1 if and only if v is infected by one of its neighbors. Hence, P ( X v = 1 | Y v = 0) = P { v is infected by a neighbor } = 1 − P { v isn’t infected by any neighbor } = 1 − (cid:89) u ∈N ( v ) P { v isn’t infected by u } = 1 − (cid:89) u ∈N ( v ) (cid:16) − P { v is infected by u } (cid:17) = 1 − (cid:89) u ∈N ( v ) (cid:16) − P { v is infected by u | Y u = 1 } · P ( Y u = 1) (cid:17) = 1 − (cid:89) u ∈N ( v ) (1 − pq )= 1 − (1 − pq ) d ( v ) . (cid:3) B Lower Bounds for the Disjoint k -Cliques Model B.1 Proof of Lemma 3
Since H ( X , ..., X n ) = m · H ( X , ..., X k ), it suffices to lower bound H ( X , ..., X k ). Notice that H ( X , ..., X k ) ≥ H ( X , ..., X k | Y , ..., Y k ) = (cid:88) y k ∈{ , } k P (cid:16) Y k = y k (cid:17) · H (cid:16) X k (cid:12)(cid:12)(cid:12) Y k = y k (cid:17) . Observe that after conditioning on the locations of the seeds, X , ..., X k are mutually indepen-dent. Moreover, by symmetry, both P (cid:0) Y k = y k (cid:1) and H (cid:0) X k (cid:12)(cid:12) Y k = y k (cid:1) depend on (cid:80) i y i , (i.e., theempirical distribution of y k ). Indeed, the marginal distribution of X i can be specified as follows: P (cid:16) X i = 1 | Y k = y k (cid:17) = , if y i = 1 , − (1 − q )( (cid:80) i y i ) , if y i = 0 , and the conditional entropy is H (cid:16) X k (cid:12)(cid:12)(cid:12) Y k = y k (cid:17) = (cid:32) k − (cid:88) i y i (cid:33) · h b (cid:16) − (1 − q )( (cid:80) i y i ) (cid:17) , h b ( · ) is the binary entropy function. Therefore, by writing Z = (cid:80) i Y i , we have H (cid:16) X k (cid:12)(cid:12)(cid:12) Y k (cid:17) = E Z (cid:2) ( k − Z ) · h b (cid:0) − (1 − q ) Z (cid:1)(cid:3) , (10)where Z ∼ Binom ( k, p ) . (cid:3) B.2 Proof of Theorem 1
Let f ( q ) = log( q )log(1 − q ) , so that f ( q ) solves 1 − (1 − q ) Z = 1 − q . Then we bound (10) by E Z (cid:2) ( k − Z ) · h b (cid:0) − (1 − q ) Z (cid:1)(cid:3) ≥ E Z (cid:2) ( k − Z ) · h b (cid:0) − (1 − q ) Z (cid:1) · { ≤ Z ≤ f ( q ) } (cid:3) (a) ≥ h b ( q ) · E Z (cid:2) ( k − Z ) · { ≤ Z ≤ f ( q ) } (cid:3) ≥ h b ( q ) ( E Z [ k − Z ] − k P { Z = 0 } − k P { Z > f ( q ) } )= k · h b ( q ) (cid:16) (1 − p ) (cid:16) − (1 − p ) k − (cid:17) − P { Z > f ( q ) } (cid:17) (b) ≥ k · h b ( q ) (cid:0) (1 − p ) (cid:0) ( k − p − ( k − p (cid:1) − P { Z > f ( q ) } (cid:1) (c) (cid:23) k · h b ( q ) ( k · p − P { Z > f ( q ) } ) , (11)where (a) is due to the fact that h b ( x ) ≥ h b ( q ) for all q ≤ x ≤ − q , (b) holds since (1 − p ) r ≤ e − pr and e x ≤ x + x for x ≤
1, and (c) is due to the assumption p (cid:22) /k .We then upper bound P { Z > f ( q ) } by Hoeffding’s inequality: P { Z > f ( q ) } ≤ exp (cid:32) − k (cid:18) p − f ( q ) k (cid:19) (cid:33) (a) ≤ exp (cid:32) − k (cid:18) f ( q )2 k (cid:19) (cid:33) ≤ exp (cid:18) − f ( q ) k (cid:19) (b) (cid:22) k · p, where (a) holds since by assumption k · p · q (cid:22)
1, so k · p (cid:22) q (cid:18) q (cid:19) ≤ q − q log (cid:18) q (cid:19) ≤ f ( q ) , and (b) holds due to the assumption q (cid:22) √ k · (cid:114) log (cid:16) k · p (cid:17) . Plugging into (11) yields E Z (cid:2) ( k − Z ) · h b (cid:0) − (1 − q ) Z (cid:1)(cid:3) (cid:23) k · p · q · log (cid:18) q (cid:19) (cid:23) k · p · q · (cid:18) log k + log log (cid:18) kp (cid:19)(cid:19) , where in the last inequality we use the assumption q (cid:22) √ k · (cid:114) log (cid:16) k · p (cid:17) again. (cid:3) C Proof of Theorem 3
Let T and T be the number of tests performed, respectively, in Step 2 and Step 3 of the graph-aware algorithm. Specifically, T is equal to the number of tests used by binary splitting to identify21he infected k -cliques, and T is the number of tests to identify infected individuals within eachinfected clique. Note that T = T + T . We will bound E [ T ] and E [ T ] separately.Let Y be the number of infected k -cliques. We have E [ Y ] = nk · P ( X C = 1) = nk · (cid:16) − (1 − p ) k (cid:17) . Taking Lemma 2 with n = n/k and α = Y gives T ≤ (log ( n/k ) + 1) · Y so that E [ T ] ≤ nk · (cid:16) log ( n/k ) + 1 (cid:17) · (cid:16) − (1 − p ) k (cid:17) . For the second stage of the algorithm, let Z i denote the number of tests used by binary splittingto identify all infected members of the i th clique. Since T = n/k (cid:80) i =1 Z i · { X C i =1 } , we have E [ T ] = n/k (cid:88) i =1 E [ Z i · { X C i =1 } ]= nk · E [ Z · { X C =1 } ]= nk · P ( X C = 1) · E [ Z | X C = 1]= nk · (cid:16) − (1 − p ) k (cid:17) · E [ Z | X C = 1] . Let M denote the number of infected members of C . Then by Lemma 2, E [ Z | X C = 1] ≤ (log k + 1) · E [ M | X C = 1]and, assuming without loss of generality that C = [ k ], E [ M | X C = 1] = k (cid:88) j =1 P ( X j = 1 | X C = 1)= k · P ( X = 1 | X C = 1)= k · P ( X = 1 , X C = 1) P ( X C = 1)= k · P ( X = 1) P ( X C = 1)= k · − (1 − p )(1 − pq ) k − − (1 − p ) k where in the last line we invoke Lemma 1. Putting everything together gives E [ T ] ≤ n · (log k + 1) · (cid:16) − (1 − p )(1 − pq ) k − (cid:17) and therefore E [ T ] ≤ nk · (cid:16) log ( n/k ) + 1 (cid:17) · (cid:16) − (1 − p ) k (cid:17) + n · (cid:16) log k + 1 (cid:17) · (cid:16) − (1 − p )(1 − pq ) k − (cid:17) . (cid:3) Lower Bounds for the
SBIM
D.1 Proof of Lemma 4
Notice that H ( X , ..., X n ) ≥ H ( X , ..., X n | Y , ..., Y n ) = (cid:88) y n ∈{ , } n P ( Y n = y n ) · H ( X n | Y n = y n ) . Observe that after conditioning on the locations of the seeds, X , ..., X n are mutually independent.Moreover, for i ∈ C (cid:96) , the marginal distribution of X i can be specified as follows: P ( X i = 1 | Y n = y n ) = , if y i = 1 , − (1 − q ) (cid:80) j ∈C (cid:96) y j (1 − q ) (cid:80) j (cid:54)∈C (cid:96) y j , if y i = 0 . Writing z (cid:96) (cid:44) (cid:80) j ∈C (cid:96) y j , the conditional entropy is H ( X n | Y n = y n ) = m (cid:88) (cid:96) =1 ( k − z (cid:96) ) · h b (cid:16) − (1 − q ) z (cid:96) (1 − q ) (cid:80) (cid:96) (cid:48)(cid:54) = (cid:96) z (cid:96) (cid:48) (cid:17) , where h b ( · ) is the binary entropy function. Since Y i i.i.d. ∼ Ber ( p ), we have Z (cid:96) i.i.d. ∼ Binom ( k, p ) andhence H ( X n | Y n ) = E Z,Z (cid:48) (cid:104) m · ( k − Z ) · h b (cid:16) − (1 − q ) Z (1 − q ) Z (cid:48) (cid:17)(cid:105) , (12)where Z ∼ Binom ( k, p ) and Z (cid:48) ∼ Binom ( n − k, p ). (cid:3) D.2 Proof of Theorem 4
First we assume n · p · q (cid:22)
1, and let (cid:15) ∈ (0 ,
1) be a value to be specified. Define z ∗ (cid:44) / − np (1 + (cid:15) ) q q . Then as long as Z and Z (cid:48) satisfy the following two conditions1. { np (1 − (cid:15) ) ≤ Z (cid:48) ≤ np (1 + (cid:15) ) } ,2. Z ≤ z ∗ ,we have 12 ≥ Z · q + Z (cid:48) · q ≥ − (1 − q ) Z (1 − q ) Z (cid:48) . (13)23ince 1 − (1 − q ) Z (1 − q ) Z (cid:48) is an increasing function of Z and Z (cid:48) , h b (cid:16) − (1 − q ) Z (1 − q ) Z (cid:48) (cid:17) must increase with Z and Z (cid:48) if they satisfy the above conditions. Therefore, we have E Z,Z (cid:48) (cid:104) ( k − Z ) h b (cid:16) − (1 − q ) Z (1 − q ) Z (cid:48) (cid:17)(cid:105) ≥ E Z,Z (cid:48) (cid:104) ( k − Z ) h b (cid:16) − (1 − q ) Z (1 − q ) Z (cid:48) (cid:17) · { ≤ Z ≤ z ∗ } · { np (1 − (cid:15) ) ≤ Z (cid:48) ≤ np (1+ (cid:15) ) } (cid:105) ≥ E Z,Z (cid:48) (cid:104) ( k − Z ) h b (cid:16) − (1 − q ) Z (cid:48) (cid:17) · { Z =0 } · { np (1 − (cid:15) ) ≤ Z (cid:48) ≤ np (1+ (cid:15) ) } (cid:105)(cid:124) (cid:123)(cid:122) (cid:125) (a) + E Z,Z (cid:48) (cid:104) ( k − Z ) h b (cid:16) − (1 − q ) Z (1 − q ) Z (cid:48) (cid:17) · { ≤ Z ≤ z ∗ } · { np (1 − (cid:15) ) ≤ Z (cid:48) ≤ np (1+ (cid:15) ) } (cid:105)(cid:124) (cid:123)(cid:122) (cid:125) (b) . (14)We will pick (cid:15) = . Then (a) can be bounded by(a) ≥ k · h b (cid:16) q · np (1 − (cid:15) ) − ( q · np (1 − (cid:15) )) (cid:17) (cid:18) − · exp (cid:18) − n(cid:15) p (cid:19)(cid:19) (cid:23) k (cid:18) npq (1 − (cid:15) ) log (cid:18) npq (1 − (cid:15) ) (cid:19) (cid:18) − · exp (cid:18) − n(cid:15) p (cid:19)(cid:19)(cid:19) (cid:23) k (cid:18) npq log (cid:18) npq (cid:19)(cid:19) where in the first inequality we use1. Z (cid:48) ≥ np (1 − (cid:15) )2. (1 − q ) Z (cid:48) ≤ e − q · Z (cid:48) ≤ − q · Z (cid:48) + ( q · Z (cid:48) )
3. Chernoff bound on Z (cid:48) ,and in the third inequality we assume np (cid:23)
1. Next, (b) can be bounded by(b) ≥ h b (cid:16) q + npq (1 − (cid:15) ) − ( q + npq (1 − (cid:15) )) (cid:17) · E Z (cid:2) ( k − Z ) { ≤ Z ≤ z ∗ } (cid:3) · (cid:18) − · exp (cid:18) − n(cid:15) p (cid:19)(cid:19) (cid:23) ( q + npq ) log (cid:18) q + npq (cid:19) · E Z (cid:2) ( k − Z ) { ≤ Z ≤ z ∗ } (cid:3) . We will now lower bound E Z (cid:2) ( k − Z ) { ≤ Z ≤ z ∗ } (cid:3) as in Theorem 1. Observe that E Z (cid:2) ( k − Z ) { ≤ Z ≤ z ∗ } (cid:3) ≥ E Z [ k − Z ] − k P { Z = 0 } − k P { Z ≥ z ∗ }≥ k (cid:16) − p − (1 − p ) k − P { Z ≥ z ∗ } (cid:17) (cid:23) k ( kp − P { Z ≥ z ∗ } ) . (15)Finally, applying Hoeffding’s inequality to P { Z ≥ z ∗ } yields P { Z ≥ z ∗ } ≤ exp (cid:32) − k (cid:18) p − z ∗ k (cid:19) (cid:33) = exp − k (cid:32) p − − npq (1 + (cid:15) ) q k (cid:33) (1) (cid:22) exp (cid:32) − k (cid:18) q k (cid:19) (cid:33) = exp (cid:18) − kq (cid:19) (2) ≤ kp , n · p · q (cid:22) p (cid:22) q k , and (2) holds when q ≤ (cid:114) k · (cid:16) log (cid:16) kp (cid:17) + 1 (cid:17) . Plugging into (15) yields E Z (cid:2) ( k − Z ) { ≤ Z ≤ z ∗ } (cid:3) (cid:23) k p, (16)and thus by putting together our bounds on (a) and (b) in (14), we arrive at E Z,Z (cid:48) (cid:104) ( k − Z ) h b (cid:16) − (1 − q ) Z (1 − q ) Z (cid:48) (cid:17)(cid:105) (17) ≥ k (cid:18) npq log (cid:18) npq (cid:19)(cid:19) + k p · ( q + npq ) log (cid:18) q + npq (cid:19) (18) ≥ mk pq log (cid:18) npq (cid:19) + k p · q log (cid:18) q + npq (cid:19) . (19) (cid:3) E Proofs of Additional Lemmas
E.1 Proof of Lemma 5
Let Y v be the indicator random variable of whether vertex v is a seed, and assume without loss ofgenerality that v ∈ C . We have P ( X v = 1) = P ( X v = 1 | Y v = 1) (cid:124) (cid:123)(cid:122) (cid:125) =1 · P ( Y v = 1) (cid:124) (cid:123)(cid:122) (cid:125) = p + P ( X v = 1 | Y v = 0) · P ( Y v = 0)= p + (1 − p ) · P ( X v = 1 | Y v = 0)and P ( X v = 1 | Y v = 0) = P { v is infected by a neighbor } = 1 − (cid:89) u ∈N ( v ) P { v isn’t infected by u } = 1 − (cid:89) u ∈N ( v ) (cid:16) − P { v is infected by u } (cid:17) = 1 − (cid:89) u ∈N ( v ) (cid:16) − P { v is infected by u | Y u = 1 } · P ( Y u = 1) (cid:17) = 1 − (cid:32) (cid:89) u ∈C \{ v } (1 − p · q ) (cid:33) · (cid:32) (cid:89) w (cid:54)∈C (1 − p · q ) (cid:33) = 1 − (1 − p · q ) k − · (1 − p · q ) n − k . (cid:3) .2 Proof of Lemma 6 Let A be the event that no member of community C is selected as a seed, and let B be the eventthat some member of C is infected by an individual outside C . We further denote by B u the eventthat vertex u infects some member of C , where u (cid:54)∈ C . Note that X C = 1 if and only if either A c occurs or A ∩ B occurs. Moreover, A and B are independent events. We have that P ( A ) = (1 − p ) k ,and thus P ( X C = 1) = P ( A c ) + P ( A ) · P ( B )= 1 − (1 − p ) k + (1 − p ) k · P ( B )= 1 − (1 − p ) k · (1 − P ( B )) . Finally, we compute P ( B ) as P ( B ) = 1 − (cid:89) u (cid:54)∈C P ( B cu )= 1 − (cid:89) u (cid:54)∈C (cid:16) P ( B cu | Y u = 1) · P ( Y u = 1) (cid:124) (cid:123)(cid:122) (cid:125) = p + P ( B cu | Y u = 0) (cid:124) (cid:123)(cid:122) (cid:125) =1 · P ( Y u = 0) (cid:124) (cid:123)(cid:122) (cid:125) =1 − p (cid:17) = 1 − (cid:89) u (cid:54)∈C (cid:16) − p + p · P ( B cu | Y u = 1) (cid:17) = 1 − (cid:89) u (cid:54)∈C (cid:16) − p + p · (1 − q ) k (cid:17) = 1 − (cid:32) − p · (cid:16) − (1 − q ) k (cid:17)(cid:33) n − k . (cid:3)(cid:3)