Edge Estimation with Independent Set Oracles
Paul Beame, Sariel Har-Peled, Sivaramakrishnan Natarajan Ramamoorthy, Cyrus Rashtchian, Makrand Sinha
EEdge Estimation with Independent Set Oracles ∗ Paul Beame † Sariel Har-Peled ‡ Sivaramakrishnan Natarajan Ramamoorthy § Cyrus Rashtchian ¶ Makrand Sinha (cid:107)
September 11, 2018
Abstract
We study the task of estimating the number of edges in a graph with access to only an indepen-dent set oracle. Independent set queries draw motivation from group testing and have applicationsto the complexity of decision versus counting problems. We give two algorithms to estimate thenumber of edges in an n -vertex graph, using (i) polylog( n ) bipartite independent set queries, or (ii) n / polylog( n ) independent set queries.
1. Introduction
We investigate the problem of estimating the number of edges in a simple, unweighted, undirected graph G = ( (cid:74) n (cid:75) , E ), where (cid:74) n (cid:75) := { , , . . . , n } and m = | E | , using only an oracle that answers independent setqueries. For a parameter ε >
0, we wish to output an estimate (cid:101) m satisfying (1 − ε ) m ≤ (cid:101) m ≤ (1 + ε ) m with high probability. We consider randomized, adaptive algorithms with access to one of the twofollowing oracles: (cid:63) BIS (Bipartite independent set) oracle: Given disjoint subsets
U, V ⊆ (cid:74) n (cid:75) , a BIS query answerswhether there is no edge between U and V in G . Formally, the oracle returns whether m ( U, V ) = 0,where m ( U, V ) denotes the number of edges with one endpoint in U and the other in V . (cid:63) IS (Independent set) oracle: Given a subset U ⊆ (cid:74) n (cid:75) , an IS query answers whether U satisfies m ( U ) = 0, where m ( U ) denotes the number of edges with both endpoints in U .Previous work on graph parameter estimation has primarily focused on local queries, such as (i) degree queries (which output the degree of a vertex v ), (ii) edge existence queries (which answer whether a pair { u, v } forms an edge), or (iii) neighbor queries (which provide the i th neighbor of a vertex v ). However,such queries cannot achieve sub-polynomial query costs on certain graphs identified by Feige [Fei06]and Goldreich and Ron [GR08]. These queries can only obtain local information about the graph. ∗ A preliminary version of this paper appeared in ITCS 2018 [BHN + † Paul G. Allen School of Computer Science & Engineering, University of Washington, Seattle. Research supported inpart by NSF grants CCF-1524246. [email protected] ‡ Department of Computer Science, University of Illinois, Urbana-Champaign. Supported in part by NSF AF awardsCCF-1421231 and CCF-1217462. Work done while visiting University of Washington on sabbatical in 2017. § Supported by the NSF under agreements CCF-1149637, CCF-1420268, CCF-1524251. [email protected] ¶ Work partially completed while the author was at Microsoft Research. [email protected] (cid:107)
Supported by the NSF under agreements CCF-1149637, CCF-1420268, CCF-1524251. [email protected] a r X i v : . [ c s . D S ] S e p his motivates an investigation of other queries that may enable efficient parameter estimation. Theindependent set queries described above generalize an edge existence query, and their non-locality opensthe door for sub-polynomial query algorithms for various graph parameter estimation tasks. The most relevant motivation for
BIS and IS queries comes from the area of sub-linear time algorithmsfor graph parameter estimation. BIS and IS queries also have interesting connections to the classicalarea of group testing, to emptiness versus counting questions in computational geometry, and to thecomplexity of decision versus counting problems. Graph parameter estimation.
Feige [Fei06] showed how to use O ( √ n/ε ) degree queries to output (cid:101) m that satisfies m ≤ (cid:101) m ≤ (2 + ε ) m , where m = | E | . Moreover, he showed that any algorithmachieving better than a 2-approximation must use a nearly linear number of degree queries. Goldreichand Ron [GR08] showed that by using both degree and neighbor queries, the approximation improvesto (1 − ε ) m ≤ (cid:101) m ≤ (1 + ε ) m by using √ n poly(log n, /ε ) queries. It is worth nothing that Feige [Fei06]and Goldreich and Ron [GR08] have identified certain hard instances showing that these upper boundscannot be improved, up to polylog factors.Related work approximates the number of stars [GRS11], the minimum vertex cover [ORRR12],the number of triangles [Ses15, ELRS17], and the number of k -cliques [ERS17]. A special case of BIS query (where one of the bipartition sets is a singleton) has been used for testing k -colorability of graphs[BKKR13] and edge estimation [WLY13]. Group testing.
A classic estimation problem involves efficiently approximating the number of defec-tive items or infected individuals in a certain collection or population [CS90, Dor43, Swa85]. To querya population, a small group is formed, and all the individuals in the group are tested in one shot. Forexample, in genome-wide association studies, combined pools of DNA may be tested as a group forcertain variants [KZC + IS / BIS queries. In the graph setting, group testing suggests testing pairwise interactions between manyitems or individuals, instead of singular events.
Computational geometry.
Certain geometric applications exhibit the phenomenon that emptinessqueries have more efficient algorithms than counting queries. For example, in three dimensions, fora set P of n points, half-space counting queries (i.e., what is the size of the set | P ∩ h | , for a queryhalf-space h ), can be answered in O ( n / ) time, after near-linear time preprocessing. On the other hand,emptiness queries (i.e., is the set P ∩ h empty?) can be answered in O (log n ) time. Aronov and Har-Peled[AH08] used this to show how to answer approximate counting queries (i.e., estimating | P ∩ h | ), withpolylogarithmic emptiness queries.As another geometric example, consider the task of counting edges in disk intersection graphs usingGPUs [Fis03]. For these graphs, IS queries decide if a subset of the disks have any intersection (thiscan be done using sweeping in O ( n log n ) time [CJ15]). Using a GPU, one could quickly draw the disksand check if the sets share a common pixel. In cases like this – when IS and BIS oracles have fastimplementations – algorithms exploiting independent set queries may be useful.2 ecision versus counting complexity.
A generalization of IS and BIS queries previously appearedin a line of work investigating the relationship between decision and counting problems [Sto83, Sto85,DL17]. Stockmeyer [Sto83, Sto85] showed how to estimate the number of satisfying assignments for acircuit with queries to an NP oracle. Ron and Tsur [RT16] observed that Stockmeyer implicitly providedan algorithm for estimating set cardinality using subset queries, where a subset query specifies a subset X ⊆ U and answers whether | X ∩ S | = 0 or not. Subset queries generalize IS and BIS queries because S corresponds to the set of edges in the graph and X is any subset of pairs of vertices.Indeed, consider subset queries in the context of estimating the number of edges in a graph. To thisend, fix | S | = m (i.e., the number of edges in the graph) and |U | = (cid:0) n (cid:1) (the number of possible edges).Stockmeyer provided an algorithm using only O (log log m poly(1 /ε )) subset queries to estimate m withina factor of (1 + ε ) with a constant success probability. Note that for a high probability bound, which iswhat we focus on in this paper, the algorithm would naively require O (log n · log log m poly(1 /ε )) queriesto achieve success probability at least 1 − /n . Falahatgar et al. [FJO +
16] gave an improved algorithmthat estimates m up to a factor of (1+ ε ) with probability 1 − δ using 2 log log m + O ((1 /ε ) log(1 /δ )) subsetqueries. Nearly matching lower bounds are also known for subset queries [Sto83, Sto85, RT16, FJO + interval queries , where they assumethat the universe U is ordered and the subsets must be intervals of elements. We view the independentset queries that we study as another natural restriction of subset queries.Analogous to Stockmeyer’s results, a recent work of Dell and Lapinskas [DL17] provides a frameworkthat relates edge estimation using BIS and edge existence queries to a question in fine-grained complexity.They study the relationship between decision and counting versions of problems such as 3SUM andOrthogonal Vectors. They proved that, for a bipartite graph, using O ( ε − log n ) BIS queries, and ε − n polylog( n ) edge existence queries, one can output a number (cid:101) m , such that, with probability at least1 − /n , we have (1 − ε ) m ≤ (cid:101) m ≤ (1 + ε ) m. Dell and Lapinskas [DL17] used edge estimation to obtain approximate counting algorithms forproblems in fine-grained complexity. For instance, given an algorithm for 3SUM with runtime T , theyobtain an algorithm that estimates the number of YES instances of 3SUM with runtime O ( T ε − log n )+ ε − n polylog( n ). The relationship is simple. The decision version of 3SUM corresponds to checking ifthere is at least one edge in a certain bipartite graph. The counting version then corresponds to countingthe edges in this graph. We note that in their application, the large number O ( n polylog( n )) of edgeexistence queries does not affect the dominating term in the overall time in their reduction; the largerterm in the time is a product of the time to decide 3SUM and the number of BIS queries.
We describe two new algorithms. Let G = ( (cid:74) n (cid:75) , E ) be a simple graph with m = | E | edges. The Bipartite Independence Oracle.
We present an algorithm that uses
BIS queries and computesan estimate (cid:101) m for the number of edges in G , such that (1 − ε ) m ≤ (cid:101) m ≤ (1 + ε ) m. The algorithmperforms O ( ε − log n ) BIS queries, and succeeds with high probability (see Theorem 4.9 for a precisestatement). Since polylog( n ) BIS queries can simulate a degree query (see Section 4.4), one can obtaina (2 + ε )-approximation of m by using Feige’s algorithm [Fei06], which uses degree queries. This givesan algorithm that uses O ( √ n polylog( n ) / poly( ε )) BIS queries. Our new algorithm provides significantlybetter guarantees, in terms of both the approximation and number of
BIS queries.3uery Types Approximation (up to const. factors)
ReferenceEdge existence 1 + ε ( n /m ) poly(log n, /ε ) Folklore (see Section 5.2)Degree 2 + ε √ n log n/ε [Fei06]Degree + neighbor 1 + ε √ n poly(log n, /ε ) [GR08]Subset 1 + ε poly(log n, /ε ) [Sto85, FJO + BIS + edge existence 1 + ε n poly(log n, /ε ) [DL17] BIS ε poly(log n, /ε ) This Work IS ε min (cid:0) √ m, n /m (cid:1) poly(log n, /ε ) This WorkTable 1.1: Comparison of the best known algorithms using a variety of queries for estimating the numberof edges m in a graph with n vertices. The bounds stated are for high probability results, with errorprobability at most 1 /n . Constant factors are suppressed for readability.Compared to the result of Dell and Lapinskas [DL17], our algorithm uses exponentially fewer queries,since we do not spend n polylog( n ) edge existence queries. Our improvement does not seem to implyanything for their applications in fine-grained complexity. We leave open the question of finding problemswhere a more efficient BIS algorithm would lead to new decision versus counting complexity results.
The Ordinary Independence Oracle.
We also present a second algorithm, using only IS queries tocompute a (1 + ε )-approximation. It performs O ( ε − log n + min( n /m, √ m ) · ε − log n ) IS queries (seeTheorem 5.8). In particular, the number of IS queries is bounded by O ( ε − log n + ε − n / log n ) . Thefirst term in the minimum (i.e., ≈ n /m ) comes from a folklore algorithm for estimating set cardinalityusing membership queries (see Section 2.3). The second term in the minimum (i.e., ≈ √ m ) is thenumber of queries used by our new algorithm.We observe that BIS queries are surprisingly more effective for estimating the number of edges than IS queries. Shedding light on this dichotomy is one of the main contributions of this work. Comparison with other queries.
Table 1.1 summarizes the results for estimating the number ofedges in a graph in the context of various query types. Given some of the results in Table 1.1 on edgeestimation using other types of queries, a natural question is how well
BIS and IS queries can simulatesuch queries. In Section 4.4, we show that O ( ε − log n ) BIS queries are sufficient to simulate degreequeries. On the other hand, we do not know how to simulate a neighbor query (to find a specificneighbor) with few
BIS queries, but a random neighbor of a vertex can be found with O (log n ) BIS queries (see [BKKR13]). For IS queries, it turns out that estimating the degree of a vertex v up to aconstant factor requires at least Ω ( n/ deg( v )) IS queries (see Section 5.3). Notation.
Throughout, log and ln denotes the logarithm taken in base two and e , respectively. Forintegers, u, k , let (cid:74) k (cid:75) = { , . . . , k } and (cid:74) u : k (cid:75) = { u, . . . , k } . The notation x = polylog( n ) means x = O (log c n ) for some constant c >
0. A collection of disjoint sets U , . . . , U k such that (cid:83) i U i = U , isa partition of the set U , into k parts (a part U i might be an empty set). In particular, a (uniformly)4 andom partition of U into k parts is chosen by coloring each element of U with a random numberin (cid:74) k (cid:75) and identifying U i with the elements colored with i .Throughout, we use G = ( (cid:74) n (cid:75) , E ) to the denote the input graph. The number of edges in G isdenoted by m = | E | . For a set U ⊆ (cid:74) n (cid:75) , let E ( U ) = { uv ∈ E | u, v ∈ U } be the set of edges betweenvertices of U in G . For two disjoint sets U, V ⊆ (cid:74) n (cid:75) , let E ( U, V ) denote the set of edges between U and V : E ( U, V ) = { uv ∈ E | u ∈ U, v ∈ V } . Let m ( U ) and m ( U, V ) denote the number of edges in E ( U )and E ( U, V ), respectively. We also abuse notation and let m ( H ) be the number of edges in a subgraph H (e.g., m ( G ) = m ). High probability conventions.
Through the paper, the randomized algorithms presented wouldsucceed with high probability; that is, with probability ≥ − /n O (1) . Formally, this means the prob-ability of success is ≥ − /n c , for some arbitrary constant c >
0. For all these algorithms, the valueof c can be increased to any arbitrary value (i.e., improving the probability of success of the algorithm)by increasing the asymptotic running time of the algorithm by a constant factor that depends only on c . For the sake of simplicity of exposition, we do not explicitly keep track of these constants (which arerelatively well-behaved). BIS algorithm
Our discussion of the
BIS algorithm follows Figure 1.2, which depicts the main components of one levelof our recursive algorithm. Our algorithms rely on several building blocks, as described next.
Exactly count edges.
One can exactly count the edges between two subsets of vertices, with a numberof queries that scales nearly linearly in the number of such edges. Specifically, a simple deterministicdivide and conquer algorithm to compute m ( U, V ) using O ( m ( U, V ) log n ) BIS queries is described belowin Lemma 4.1.
Sparsify.
The idea is now to sparsify the graph in such a way that the number of remaining edgesis a good estimate for the original number of edges (after scaling). Consider sparsifying the graph bycoloring the vertices of graph, and only looking at the edges going between certain pairs of color classes(in our algorithm, these pairs are a matching of the color classes). We prove that it suffices to onlycount the edges between these color classes, and we can ignore the edges with both endpoints inside asingle color class.For 1 ≤ k ≤ (cid:98) n/ (cid:99) , let U , . . . , U k , V , . . . , V k be a uniformly random partition of (cid:74) n (cid:75) . Then, we have P (cid:20)(cid:12)(cid:12)(cid:12) k k (cid:88) i =1 m ( U i , V i ) − m (cid:12)(cid:12)(cid:12) ≥ ck √ m log n (cid:21) ≤ n , (1.1)where c is some constant. For the proof of this inequality see Section 3. Specifically, if we set G i to bethe induced bipartite subgraph on U i and V i , then 2 k (cid:80) i m ( G i ) is a good estimate for m ( G ). Now the graph is bipartite.
The above sparsification method implies that we can assume withoutloss of generality that the graph is bipartite. Indeed, invoking the lemma with k = 2, we see that5 ecurseExact Count Sparsificationsparse denseCoarse EstimateGroup by Estimates Importance SamplingColoring
Figure 1.2:
A depiction of one level of the
BIS algorithm. In the first step, we color the vertices and sparsify the graphby only looking at the edges between vertices of the same color. In the second step, we coarsely estimate the number ofedges in each colored subgraph. Next, we group these subgraphs based on their coarse estimates, and we subsample fromthe groups with a relatively large number of edges. In the final step, we exactly count the edges in the sparse subgraphs,and we recurse on the dense subgraphs. estimating the number of edges between the two color classes is equivalent to estimating the totalnumber of edges, up to a factor of two. For the rest of the discussion, we will consider colorings thatrespect the bipartition.
Coarse estimator.
We give an algorithm that coarsely estimates the number of edges in a (bipartite)subgraph, up a O (log n ) factor, using only O (log n ) BIS queries.
The subproblems.
After coloring the graph, we have reduced the problem to estimating the totalnumber of edges in a collection of (disjoint) bipartite subgraphs. However, certain subgraphs may stillhave a large number of edges, and it would be too expensive to directly use the exact counting algorithmon them. 6 educing the number of subgraphs in a collection, via importance sampling.
Using thecoarse estimates we can form O (log n ) groups of bipartite subgraphs, where each group contains sub-graphs with a comparable number of edges. For the groups with only a polylogarithmic number ofedges, we can exactly count edges using polylog( n ) BIS queries via the exact count algorithm mentionedabove. For the remaining groups, we subsample a polylogarithmic number of subgraphs from eachgroup. Since the groups contained subgraphs with a similar number of edges, the number of edges in thesubsampled subgraphs will be proportional to the total number of edges in the group, up to a scalingfactor depending on the factors by which sizes in a given group can vary, and this new estimate is agood approximation to the original quantity, with high probability. This corresponds to the techniqueof importance sampling that is used for variance reduction when estimating a sum of random variablesthat have comparable magnitudes.
Sparsify and reduce.
We use the sparsification algorithm on each graph in our collection. Thisincreases the number of subgraphs while reducing (by roughly a factor of k ) the total number of edgesin these graphs. The number of edges in the new collection is a reliable estimate for the number in theold collection. We will choose k to be a constant so that every sparsification round reduces the numberof edges by a constant factor.If the number of graphs in the collection becomes too large, then we reduce it in one of two ways.For the subgraphs with relatively few edges, we exactly count the number of edges using only polylog( n )queries. For the dense subgraphs, we can apply the above importance sampling technique and retainonly polylog( n ) subgraphs. Every basic operation in this scheme requires polylog( n ) BIS queries, and thenumber of subgraphs is polylog( n ). Therefore, a round can be implemented using polylog( n ) BIS queries.Now, since every round reduces the number of edges by a constant factor, the algorithm terminates after O (log n ) rounds, resulting in the desired estimate for m using only polylog( n ) queries in total. Figure 1.2depicts the main components of one round.We have glossed over some details regarding the reweighting of intermediate estimates. Recall thatboth the sparsfication and importance sampling steps involve subsampling and rescaling. To handlethis, the algorithm will maintain a weight value for each subgraph in the collection (starting with unitweight). Then, these weights will be updated throughout the execution, and they will be used duringcoarse estimation. For the final estimate, the algorithm will output a weighted sum of the estimatesfor the remaining subgraphs, in addition to the weighted version of the exactly counted subgraphs. Byusing these weights to properly rescale estimates and counts, the algorithm will achieve a good estimatefor m with high probability. IS algorithm We move on to describe our second algorithm, based on IS queries. As with the BIS algorithm, themain building block for the IS algorithm is an efficient way to exactly count edges using IS queries.The exact counting algorithm works by first breaking the vertices of the graph into independent setsin a greedy fashion, and then grouping these independent sets into larger independent sets using (yetagain) a greedy algorithm. The resulting partition of the graph into independent sets has the propertythat every two sets have an edge between them, and this partition can be computed using a number ofqueries that is roughly m . This is beneficial, because when working on the induced subgraph on twoindependent sets, the IS queries can be interpreted as BIS queries. As such, edges between parts of thepartition can be counted using the exact counting algorithm, modified to use IS queries. The end result7s, that for a given set U ⊆ (cid:74) n (cid:75) , one can compute m ( U ), the number of edges with both endpoints in U , using O ( m ( U ) log n ) IS queries. This algorithm is described in Section 5.1.Now, we can sparsify the graph to reduce the overall number of IS queries. In contrast to the BIS queries, we do not know how to design a coarse estimator using only IS queries (see Section 5.3). Thisprohibits us from designing a similar algorithm. Instead, we estimate the number of edges in one shot,by coloring the graph with a large number of colors and estimating the number of edges going betweena matching of the color classes.Using a matching of the color classes for sparsification seems somewhat counter intuitive. An initialsparsification attempt might be to count only the edges going between a single pair of colors. If thetotal number of colors is 2 k , then we expect to see m/ (cid:0) k (cid:1) edges between this pair. Therefore, we couldset k to be large and invoke Lemma 5.3. Scaling by a factor of (cid:0) k (cid:1) , we would hope to get an unbiased estimator for m .Unfortunately, a star graph demonstrates that this approach does not work, due to the large varianceof this estimator. If we randomly color the vertices of the star graph with 2 k colors, then out of the (cid:0) k (cid:1) pairs of color classes, only 2 k − ≈ m/ k . Inboth cases our estimate after scaling by a factor of (cid:0) k (cid:1) will be far from the truth.At the other extreme, the vast majority of edges will be present if we look at the edges crossing all pairs of color classes. Indeed, the only edges we miss have both endpoints in a color class, andthis accounts for only a 1 / k fraction of the total number of edges. Thus, this does not achieve anysubstantial sparsification.By using a matching of the color classes, we simultaneously get a reliable estimate of the numberof edges and a sufficiently sparsified graph (see Lemma 3.2). Let U , . . . , U k , V , . . . , V k be a randompartition of the vertices into 2 k color classes. This implies that with high probability, the estimator2 k (cid:80) ki =1 m ( U i , V i ) is in the range m ± O ( k √ m log n ). Hence, as long as we choose k to be less than ε √ m/ polylog( n ), we approximate m up to a factor of (1 + O ( ε )). We use geometric search to find sucha k efficiently.To get a bound on the number of IS queries, we claim that we can compute (cid:80) ki =1 m ( U i , V i ) usingLemma 5.3, with a total of (cid:0) k + mk (cid:1) polylog( n ) IS queries. The first term arises since we have to makeat least one query for each of the k color pairs (even if there are no edges between them). For the secondterm, we pay for both (i) the edges between the color classes and (ii) the total number of edges withboth endpoints within a color class (since the number of IS queries in Lemma 5.3 scales with m ( U ∪ V )).By the sparsification lemma, we know that (i) is bounded by O ( m/k ) with high probability and we canprove an analogous statement for (ii). Hence, plugging in a value of k ≈ ε √ m/ polylog( n ), the totalnumber of IS queries is bounded by √ m polylog( n ) /ε . The rest of the paper is organized as follows. We start at Section 2 by reviewing some necessary tools– concentration inequalities, importance sampling, and set size estimation via membership queries. InSection 3, we prove our sparsification result (Lemma 3.2).In Section 4 we describe the algorithm for edge estimation for the
BIS case. Section 4.1 describesthe exact counting algorithm. In Section 4.2, we present the algorithm that uses
BIS queries to coarsely8stimate the number of edges between two subsets of vertices (Lemma 4.8). We combine these buildingblocks to construct our edge estimation algorithm using
BIS queries in Section 4.3.The case of IS queries is tackled in Section 5. In Section 5.1, we formally present the algorithmsto exactly count edges between two subsets of vertices (Lemma 5.3). In Section 5.2, we present ouralgorithm using IS queries. In Section 5.3 we provide some discussion of why the IS case seems to beharder than the BIS case.We conclude in Section 6 and discuss open questions.
2. Preliminaries
Here we present some standard tools that we need later on.
For proofs of the following concentration bounds, see the book by Dubhashi and Panconesi [DP09].
Lemma 2.1 (Hoeffding’s inequality).
Let X , . . . , X r be independent random variables satisfying X i ∈ [ a i , b i ] for i ∈ [ r ] . Then, for X = X + · · · + X r and any s > , we have P [ | X − E [ X ] | ≥ s ] ≤ − s / (cid:80) ri =1 ( b i − a i ) ) . Lemma 2.2 (Chernoff-Hoeffding inequality [DP09, Theorem 1.1]).
Let X , . . . , X r be r inde-pendent random variables with ≤ X i ≤ , and let X = (cid:80) ri =1 X i . For µ = E [ X ] , let (cid:96) and u be realnumbers such that (cid:96) ≤ µ ≤ u . Then, we have that(A) For any ∆ > , we have P [ X ≤ (cid:96) − ∆] ≤ exp( − /r ) and P [ X ≥ u + ∆] ≤ exp( − /r ) .(B) For any ≤ δ < , we have P [ X ≤ (1 − δ ) µ ] ≤ exp( − µδ / . (C) For any ≤ δ ≤ , we have P [ X ≥ (1 + δ ) µ ] ≤ exp( − µδ / . We need a version of Azuma’s inequality that takes into account a rare bad event – the following isa restatement of Theorem 8.3 from [CL06] in a simplified form (that is sufficient for our purposes).
Lemma 2.3 ([CL06]).
Let f be any function of r independent random variables Y , . . . , Y r , and let X i = E [ f ( Y , . . . , Y r ) | Y , . . . , Y i ] , for i ∈ (cid:74) r (cid:75) , and X = E [ f ( Y , . . . , Y r )] . Say that a sequence Y , . . . , Y r is bad if there exists an index i such that | X i − X i − | > c i , where c , . . . , c r are some nonnegativenumbers. Let B be the event that a bad sequence happened, and let S = (cid:80) ri =1 c i . We have that P (cid:2) | X r − X | ≥ λ (cid:3) ≤ (cid:16) − λ (cid:46) S (cid:17) + P [ B ] . Importance sampling is a technique for estimating a sum of terms. Assume that for each term in thesummation, we can cheaply and quickly get an initial, coarse estimate of its value. Furthermore, assumethat better estimates are possible but expensive. Importance sampling shows how to sample terms inthe summation, then acquire a better estimate only for the sampled terms , to get a good estimate for thefull summation. In particular, the number of samples is bounded independently of the original numberof terms, depending instead on the coarseness of the initial estimates, the probability of success, andthe quality of the final output estimate. 9 emma 2.4 (Importance Sampling).
Let U = { x , . . . , x r } be a set of numbers, all contained in theinterval [ α/b, αb ] , for α > and b ≥ . Let γ, ε > be parameters. Consider the sum Γ = (cid:80) ri =1 x i .For an arbitrary t ≥ b ε (cid:0) γ (cid:1) , and i = 1 , . . . , t , let X i be a random sample chosen uniformly(and independently) from the set U . Then, the estimate Y = ( r/t ) (cid:80) ti =1 X i for the value of Γ satisfies P [ | Y − Γ | ≥ ε Γ] ≤ γ. Proof:
Observe that r ( α/b ) ≤ Γ ≤ rαb , and µ = E [ Y ] = E (cid:104) ( r/t ) t (cid:88) i =1 X i (cid:105) = ( r/t ) t (cid:88) i =1 E [ X i ] = rt · t · Γ r = Γ . Furthermore, we have Z = ( t/r ) Y = (cid:80) ti =1 X i , E [ Z ] = ( t/r )Γ, and for the length, ∆ i , of the intervalcontaining X i , we have ∆ i = ( αb − α/b ) ≤ α b . Using rα/b ≤ Γ, by Lemma 2.1, we have P (cid:104) | Y − Γ | ≥ ε Γ (cid:105) = P (cid:20)(cid:12)(cid:12)(cid:12)(cid:12) tr Y − tr Γ (cid:12)(cid:12)(cid:12)(cid:12) ≥ tε Γ r (cid:21) ≤ P (cid:34)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) t (cid:88) i =1 X i − tr Γ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≥ tεrα/br (cid:35) = P (cid:34)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) t (cid:88) i =1 X i − E [ Z ] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≥ tεαb (cid:35) ≤ (cid:18) − tεα/b ) (cid:80) ti =1 ∆ i (cid:19) ≤ (cid:18) − tεα/b ) tα b (cid:19) = 2 exp (cid:18) − tε b (cid:19) ≤ γ. The above lemma enables us to reduce a summation with many numbers into a much shorter sum-mation (while introducing some error, naturally). The list/summation reduction algorithm we need isdescribed next.
Lemma 2.5 (Summation reduction).
Let ( H , w , e ) , . . . , ( H r , w r , e r ) be given, where H i ’s are somestructures, and w i and e i are numbers, for i = 1 , . . . , r . Every structure H i has an associated weight w ( H i ) ≥ (the exact value of w ( H i ) is not given to us). In addition, let ξ > , γ , b , and M beparameters, such that:(i) ∀ i w i , e i ≥ ,(ii) ∀ i e i /b ≤ w ( H i ) ≤ e i b , and(iii) Γ = (cid:80) i w i · w ( H i ) ≤ M .Then, one can compute a new sequence of triples ( H (cid:48) , w (cid:48) , e (cid:48) ) , . . . , ( H (cid:48) t , w (cid:48) t , e (cid:48) t ) , that also complies with theabove conditions, such that the estimate Y = (cid:80) ti =1 w (cid:48) i w ( H (cid:48) i ) is a multiplicative (1 ± ξ ) -approximation to Γ , with probability ≥ − γ . The running time of the algorithm is O ( r ) , and size of the output sequenceis t = O ( b ξ − (log log M + log γ − ) log M ) . Proof:
We break the interval [1 , M ] into log M intervals in the natural way, where the j th interval is J j = (cid:2) j − , j (cid:1) , for j = 1 , . . . , h = (cid:100) log M (cid:101) , except if M is a power of 2, in which case the last intervalis closed and also includes 2 h = M . Input triples are sorted into h groups U , . . . , U h , where an inputtriple ( H , w, e ) is in U j , if ew ∈ J j . This mapping can be done in O ( r ) time.Let α = O (cid:0) b ξ − (cid:2) h/γ ) (cid:3)(cid:1) . For j = 1 , . . . , h , if | U j | ≤ α , then set R j = U j , otherwise computea sample R j from U j of size α . We associate weight W j = | U j | / | R j | with R j . If a triple ( H , w, e ) ∈ U j ,then we have that w · w ( H ) ∈ (cid:2) j − /b, j b (cid:3) . 10or all j ∈ (cid:74) h (cid:75) , let Γ j = (cid:80) ( H ,w,e ) ∈ U j w · w ( H ) be the total weight of structures in the j th group. ByLemma 2.4, we have, with probability ≥ − γ/h , that Y j = (cid:16) W j (cid:88) ( H ,w,e ) ∈ R j w ( H ) · w (cid:17) ∈ (cid:2) (1 − ξ )Γ j , (1 + ξ )Γ j (cid:3) . Summing these inequalities over all j ∈ (cid:74) h (cid:75) , implies that Y is the desired approximation with probability ≥ − γ .Specifically, the output sequence is constructed as follows. For all j ∈ (cid:74) h (cid:75) , and for every triple( H , w, e ) ∈ R j , we add ( H , w · W j , e ) to the output sequence. Clearly, the output sequence has t = hα = O (cid:0) b ξ − (log log M + log γ − ) log M (cid:1) elements. Remark. (A) The algorithm of Lemma 2.5 does not use the entities H i directly at all. In particular,the H (cid:48) i s are just copies of some original structures. The only thing that the above lemma uses is theestimates e , . . . , e r and the weights w , . . . , w r .(B) The sampling size used in Lemma 2.5 can probably be improved by a polylog factors by samplingdirectly from all log M classes simultaneously. Remark 2.6.
We are going to use Lemma 2.5, with ξ = O ( ε/ log n ), γ = 1 /n O (1) , b = O (log n ), and M = n . As such, the size of the output list is L len = O (cid:0) log n · ε − log n · (log log n + log n ) log n (cid:1) = O ( ε − log n ) . We present here a standard tool for estimating the size of a subset via membership oracle queries. Thisis well known, but we provide the details for the sake of completeness, since we were unable to find agood reference that describe it in this form.
Lemma 2.7.
Consider two (finite) sets B ⊆ U , where n = | U | . Let ε, γ ∈ (0 , be parameters. Let g > be a user-provided guess for the size of | B | . Consider a random sample R , taken with replacementfrom U , of size r = (cid:100) c ε − ( n/g ) log γ − (cid:101) , where c = c ( γ ) is sufficiently large. Next, consider theestimate Y = ( n/r ) | R ∩ B | to | B | . Then, we have the following:(A) If Y < g/ , then | B | < g ,(B) If Y ≥ g/ , then (1 − ε ) Y ≤ | B | ≤ (1 + ε ) Y .Both statements above hold with probability ≥ − γ .Proof: (A) The bad scenario here is that | B | ≥ g , but Y < g/
2. Let X i = 1 ⇐⇒ the i th sampleelement is in B . We have that Y = ( n/r ) X , where X = (cid:80) ri =1 X i . By assumption, we have µ = E [ X ] = r | B | n ≥ rgn ≥ c log γ − ε . (2.1)As such, by Chernoff’s inequality (Lemma 2.2 (B)), we have that P [ Y < g/
2] = P [ X < rg/ (2 n )] = P (cid:2) X < (1 − / E [ X ] (cid:3) ≤ exp( − µ/ ≤ γ c / (8 ε ln 2) and this is ≤ γ for c a sufficiently large constant.(B) We have two cases to consider. First suppose that | B | < g/
4. In this case, if X = (cid:80) ri =1 X i is the random variable as described part (A), then each X i is an indicator variable with probability11 = | B | /n < g/ (4 n ) and P [ Y ≥ g/
2] = P [ X ≥ rg/ (2 n )] ≤ P [ X (cid:48) ≥ rg/ (2 n )] where X (cid:48) is the sum of r independent Bernoulli trials with success probability g/ (4 n ). Now E [ X (cid:48) ] = rg n ≥ c γ − ε so P [ Y ≥ g/ ≤ P [ X (cid:48) ≥ rg/ (2 n )] = P (cid:2) X (cid:48) ≥ (1 + 1) E [ X (cid:48) ] (cid:3) ≤ exp( − E [ X (cid:48) ] / ≤ γ c / (12 ε ln 2) by Chernoff’s inequality (Lemma 2.2 (C)) and again this is ≤ γ for c a sufficiently large constant.For the second case, suppose that | B | ≥ g/
4. Then, E [ X ] ≥ E [ X (cid:48) ] ≥ c γ − ε and, since Y is a fixedmultiple of X , by Chernoff’s inequality (Lemma 2.2 (B)), we have P (cid:2) Y < (1 − ε ) E [ Y ] (cid:3) = P (cid:2) X < (1 − ε ) E [ X ] (cid:3) ≤ exp (cid:0) − E [ X ] ε / (cid:1) ≤ γ c / (8 ln 2) which is ≤ γ/ c = c ( γ ) sufficiently large. Similarly, by Chernoff’s inequality (Lemma 2.2 (C)), P (cid:2) Y > (1 + ε ) E [ Y ] (cid:3) = P (cid:2) X > (1 + ε ) E [ X ] (cid:3) ≤ exp (cid:0) − E [ X ] ε / (cid:1) ≤ γ c / (12 ln 2) which is ≤ γ/ c = c ( γ ) is sufficiently large. (This last condition on c is the most stringent.)Adding these two failure probabilities together gives a bound of at most γ as required. Lemma 2.8.
Consider two sets B ⊆ U , where n = | U | . Let ξ, γ ∈ (0 , be parameters, such that γ < / log n . Assume that one is given an access to a membership oracle that, given an element x ∈ U ,returns whether or not x ∈ B . Then, one can compute an estimate s , such that (1 − ξ ) | B | ≤ s ≤ (1 + ξ ) | B | , and computing this estimates requires O (( n/ | B | ) ξ − log γ − ) oracle queries. The returnedestimate is correct with probability ≥ − γ .Proof: Let g i = n/ i +2 . For i = 1 , . . . , log n , use the algorithm of Lemma 2.7 with ε = 0 .
5, with theprobability of failure being γ/ (8 log n ), and let Y i be the returned estimate. The algorithm stops thisloop as soon as Y i ≥ g i . Let I be the value of i when the loop stopped. The algorithm now callsLemma 2.7 again with g I and ε = ξ , and returns the value of Y , as the desired estimate.Overall, for T = 1 + (cid:100) log n (cid:101) , the above makes T calls to the subroutine of Lemma 2.7, and theprobability that any of them to fail is T γ/ (8 log n ) < γ . Assume that all invocations of Lemma 2.7 weresuccessful. In particular, Lemma 2.7 guarantees that if Y > g I ≥ g I /
2, then the estimate returned is(1 ± ε )-approximation to the desired quantity.Computing Y i requires r i = O (( n/g i ) log(log n/γ )) = O (2 i log γ ) oracle membership queries. As such,the number of membership queries performed by the algorithm overall is (cid:88) i r i + O (( n/g I ) ε − log(log n/γ )) = O (( n/ | B | ) ε − log γ ) . Consider the variant where we are given a set X ⊆ U . Given a query set Q ⊆ U , we have an emptinessoracle that tells us whether Q ∩ X is empty. Using an emptiness oracle, one can (1 ± ε )-approximatethe size of X using relatively few queries. The following result is implied by the work of Aronov andHar-Peled [AH08, Theorem 5.6] and Falahatgar et al. [FJO +
16] – the later result has better bounds ifthe failure probability is not required to be polynomially small.
Lemma 2.9 ([AH08, FJO + Consider a set X ⊆ U , where n = | U | . Let ε ∈ (0 , be a parame-ter. Assume that one is given an access to an emptiness oracle that, given a query set Q ⊆ U , returnswhether or not X ∩ Q (cid:54) = ∅ . Then, one can compute an estimate s such that (1 − ε ) | X | ≤ s ≤ (1 + ε ) | X | ,using O ( ε − log n ) emptiness queries. The returned estimate is correct with probability ≥ − /n O (1) .
12e sketch the basic idea of the algorithm used in the above lemma. For a guess g of the size of X ,consider a random sample Q where every element of U is picked with probability 1 /g . The probabilitythat Q avoids X is α ( g ) = (1 − /g ) | X | . The function α ( g ) is: (i) monotonically increasing, (ii) close tozero when g (cid:28) | X | , (iii) ≈ /e for g = | X | , and (iv) close to 1 if g (cid:29) | X | . One can estimate the value α ( g ) by repeated random sampling and checking if the random sample intersects X using emptinessqueries. Given such an estimate one can then perform an approximate binary search for the value of g such that α ( g ) = 1 /e , which corresponds to g = | X | . See [AH08, FJO +
16] for further details.
3. Edge sparsification by random coloring
In this section, we present and prove that coloring vertices, and counting only edges between specificcolor classes provides a reliable estimate for the number of edges in the graph. This is distinct fromstandard graph sparsification algorithms which usually sparsify the edges of the graph directly (usually,by sampling edges).We need the following technical lemma.
Lemma 3.1.
Let C be a set of r elements, colored randomly by k colors – specifically, for every element x ∈ C , one chooses randomly (independently and uniformly) a color for it from the set (cid:74) k (cid:75) . For i ∈ (cid:74) k (cid:75) ,let n i be the number of elements of C with color i . Let n be a positive integer and c > be an arbitraryconstant. Then:(A) For any color i ∈ (cid:74) k (cid:75) , we have P (cid:2) | n i − r/k | > (cid:112) ( cr/
2) ln n (cid:3) ≤ /n c . (B) For any two distinct colors i, j ∈ (cid:74) k (cid:75) , we have P (cid:2) | n i − n j | > √ cr ln n (cid:3) ≤ /n c . (C) For any two distinct colors i, j ∈ (cid:74) k (cid:75) , we have E (cid:2) | n i − n j | (cid:3) ≤ (cid:112) r/k .Proof: (A) For (cid:96) ∈ (cid:74) r (cid:75) , let X (cid:96) be the indicator variable that is 1 with probability 1 /k and 0 otherwise.For X = (cid:80) r(cid:96) =1 X (cid:96) , notice that n i is distributed identically to X , and that E [ X ] = E [ n i ] = r/k . UsingChernoff’s inequality (Lemma 2.2 (A)), we have P (cid:104) | X − E [ X ] | > (cid:112) ( cr/
2) ln n (cid:105) ≤ (cid:0) − r · cr ln n (cid:1) ≤ /n c , (B) Observe that | n i − n j | ≤ | n i − r/k | + | r/k − n j | , and the claim follows from (A).(C) For t = 1 , . . . , r , let X t = 1 if the t th element of C is colored by color i , and let X t = − j . Otherwise, set X t = 0. Clearly, the desired quantity is µ = E (cid:2) | X | (cid:3) , where X = (cid:80) rt =1 X t . We have that P [ X t = 1] = P [ X t = −
1] = 1 /k , and E [ X t ] = 0. Observe that E [ X t ] = 2 /k , V [ X t ] = E [ X t ] − ( E [ X t ]) = 2 /k. and V [ X ] = (cid:80) rt =1 V [ X t ] = 2 r/k . As such, we have E (cid:2) X (cid:3) = V [ X ] + ( E [ X ]) = V [ X ] + 0 = 2 r/k. Furthermore, we have V [ | X | ] = E (cid:2) ( | X | − µ ) (cid:3) = E (cid:2) | X | (cid:3) − µ ≥ . As such, µ = E (cid:2) | X | (cid:3) ≤ (cid:113) E (cid:2) | X | (cid:3) = (cid:113) E (cid:2) X (cid:3) ≤ (cid:112) r/k . 13 emma 3.2. (A) Let G = ( (cid:74) n (cid:75) , E ) be a graph with m edges. For any ≤ k ≤ (cid:98) n/ (cid:99) , let U , . . . , U k bea uniformly random partition of (cid:74) n (cid:75) . Then, there is some constant ς > , such that P (cid:20)(cid:12)(cid:12)(cid:12) m k − k (cid:88) i =1 m ( U i , U k + i ) (cid:12)(cid:12)(cid:12) ≥ ς √ m log n (cid:21) ≤ n , and P (cid:20)(cid:12)(cid:12)(cid:12) m k − k (cid:88) i =1 m ( U i ) (cid:12)(cid:12)(cid:12) ≥ ς √ m log n (cid:21) ≤ n . (B) Similarly, for disjoint sets U, V ⊆ (cid:74) n (cid:75) and k such that ≤ k ≤ max {| U | , | V |} , let U , . . . , U k , V , . . . , V k be uniformly random partitions of U and V , respectively. Then, there is some constant ς > ,such that P (cid:20)(cid:12)(cid:12)(cid:12) m ( U, V ) − k k (cid:88) i =1 m ( U i , V i ) (cid:12)(cid:12)(cid:12) ≥ ςk (cid:112) m ( U, V ) log n (cid:21) ≤ /n . Proof: (A) Consider the random process that colors vertex t , at time t ∈ (cid:74) n (cid:75) , with a uniformly randomcolor Y t ∈ (cid:74) k (cid:75) . The colors correspond to the partition of (cid:74) n (cid:75) into classes U , . . . , U k . Define f ( Y , . . . , Y n ) = k (cid:88) i =1 m ( U i , U k + i ) . The probability of a specific edge uv to be counted by f is 1 / (2 k ). Indeed, fix the color of u , and observethat there is only one choice of the color of v , such that uv would be counted. As such, E [ f ] = m/ (2 k )and 0 ≤ f ( Y , . . . , Y n ) ≤ m .Consider the Doob martingale X , X , . . . , X n , where X t = E (cid:2) f ( Y (cid:74) n (cid:75) ) (cid:12)(cid:12) Y (cid:74) t (cid:75) (cid:3) , where Y (cid:74) t (cid:75) ≡ Y , . . . , Y t .We are interested in bounding the quantity | X t − X t − | . To this end, fix the value of Y (cid:74) t − (cid:75) , and let g ( α ) = E (cid:2) f ( Y (cid:74) n (cid:75) ) (cid:12)(cid:12) Y (cid:74) t − (cid:75) ∩ ( Y t = α ) (cid:3) . We have that X t − = E (cid:2) f ( Y (cid:74) n (cid:75) ) (cid:12)(cid:12) Y (cid:74) t − (cid:75) (cid:3) = (cid:80) kα =1 g ( α ) / k. Namely, the value of X t − is an averageof the values in G = { g (1) , g (2) , . . . , g (2 k ) } . Clearly, X t ∈ G . As such, we have that | X t − X t − | ≤ max i,j | g ( i ) − g ( j ) | .Let N ( t ) be the set of neighbors of t in the graph and deg( t ) = | N ( t ) | be the degree of t . Let N
1) mod 2 k ) be its matching color.Fix two distinct colors i, j ∈ (cid:74) k (cid:75) , and let∆ t = (cid:12)(cid:12) g ( i ) − g ( j ) (cid:12)(cid:12) = (cid:12)(cid:12)(cid:12) C π ( i )
1, that∆ t = (cid:12)(cid:12)(cid:12) C π ( i )
Given an induced bipartite graph G = ( U, V, E ) with m edges, coloring it with k colors, andtaking the bipartite subgraphs of the resulting matching of the coloring, as done in Lemma 3.2, resultsin k new disjoint bipartite (induced) subgraphs, G i = ( U i , V i , E i ), for i = 1 , . . . , k , with total number ofedges Γ = (cid:80) ki =1 m ( G i ). Furthermore, we have that k · Γ is an (1 ± ξ )-approximation to m ( G ), where ξ = (cid:0) ςk √ m log n (cid:1) /m, with high probability. For our purposes, we need ξ ≤ ε n ⇐⇒ (cid:0) ςk √ m log n (cid:1) m ≤ ε n ⇐⇒ (cid:0) ςk log n (cid:1) ε ≤ √ m ⇐⇒ m = Ω( k ε − log n ) . Setting k = 4, the above implies that one can apply the refinement algorithm of Lemma 3.2 if m = Ω( ε − log n ). With high probability, the number of edges in the new k subgraphs (i.e., Γ), scaledby k , is a good estimate (i.e., within a 1 ± ε/ (8 log n ) factor) for the number of edges in the originalgraph, and furthermore, the number of edges in the new subgraphs is small (formally, E [Γ] ≤ m/
4, andwith high probability Γ ≤ m/
4. Edge estimation using
BIS queries
Here we show how to get exact and approximate count for the number of edges in a graph using
BIS queries.
BIS queries
Lemma 4.1.
Given two disjoint sets
U, V ⊆ (cid:74) n (cid:75) , one can (deterministically) compute E ( U, V ) , andthus m ( U, V ) = | E ( U, V ) | , using O (1 + m ( U, V ) log n ) BIS queries. Alternatively, given a query budget t = Ω(log n ) , one can decide if the given graph has ≤ t/ log n edges (or more) using O ( t ) BIS queries.Proof:
We use a recursive divide-and-conquer approach, which intuitively builds a quadtree over thepair (
U, V ). The quadtree construction can also be interpreted in terms of the incidence matrix of theedges E ( U, V ).The algorithm first issues the query
BIS ( U, V ). If the return value is false, then there are no edgesbetween U and V , and the algorithm sets m ( U, V ) to zero, and returns. If | U | = | V | = 1, then thisalso determines if m ( U, V ) is 0 or 1 in this case, and the algorithm returns. The remaining case, isthat m ( U, V ) (cid:54) = 0, and the algorithm recurses on the four children of ( U, V ), which will correspond to15he pairs ( U , V ) , ( U , V ) , ( U , V ), and ( U , V ), where U , U and V , V are equipartitions of U and V ,respectively. We are using here the identity m ( U, V ) = m ( U , V ) + m ( U , V ) + m ( U , V ) + m ( U , V ) . If m ( U, V ) = 0 holds, then the number of queries is exactly equal to 1, and the lemma is true in thiscase. For the rest of the proof we assume that m ( U, V ) ≥
1. To bound the number of queries, imaginebuilding the whole quadtree for the adjacency matrix of U × V with entries for E ( U, V ). Let X be theset of 1 entries in this matrix, and let k = | X | (i.e., X corresponds to set of leaves that are labeled1 in the quadtree). The height of the quadtree is h = O (max { log | U | , log | V |} ). Let X be the set ofnodes in the quadtree that are either in X or are ancestors of nodes of X . It is not hard to verify that | X | = O (cid:0) k + k log( | U || V | ) (cid:1) = O ( k log n ) . Finally, let X be the set of nodes in the quadtree that areeither in X , or their parent is in X . Clearly, the algorithm visits only the nodes of X in the recursion,thus implying the desired bound.As for the budgeted version, run the algorithm until it has accumulated T = O ( t/ log n ) edges inthe working set, where T > t . If this never happens, then the number of edges of the graph is at most T , as desired, and the above analysis applies. Otherwise, the algorithm stops, and applying the sameargument as above, we get that the number of BIS queries is bounded by O ( T log n ) = O ( t ). Remark.
The number of
BIS queries made by the algorithm of Lemma 4.1 is at least max { m ( U, V ) , } ,since every edge with one endpoint in U and the other in V is identified (on its own, explicitly) by sucha query.Though we do not need it in sequel for our algorithms to estimate the number of edges in a graph,we can use the above algorithm to exactly identify the edges of an arbitrary graph using BIS querieswith a cost of O (log n ) overhead per edge. Lemma 4.2.
Given a vertex v ∈ (cid:74) n (cid:75) , one can compute all the edges adjacent to v in G using O (cid:0) v ) + deg( v ) log( n/ deg( v )) (cid:1) = O (1 + deg( v ) log n ) queries.Proof: We build a binary tree T of BIS queries of the form
BIS ( { v } , U ) to determine the entries of therow ρ of the adjacency matrix for G indexed by v . In particular, view the vertices of B = (cid:74) n (cid:75) − { v } as anordered set. The algorithm first queries BIS ( { v } , B ) which returns true iff there is a 1 in ρ . This queryis the root of T . For every node ν of T corresponding to a query BIS ( { v } , U ) whose output is true andfor which | U | >
1, there are two children ν (cid:48) and ν (cid:48)(cid:48) labelled by queries BIS ( { v } , U (cid:48) ) and BIS ( { v } , U ”)where U (cid:48) and U ” are the two halves of U , respectively.If the answer to the root query is false, then only one BIS query is used. Otherwise, consider thesubtree of T labeled by BIS queries that evaluate to true. This subtree has deg( v ) leaves at depth (cid:100) log n (cid:101) and deg( v ) − BIS queries (nodes of T ) is then 2deg( v ) − (cid:98) log deg( v ) (cid:99) have degreetwo in T . Therefore, the number of BIS queries is at most 1 + 2deg( v ) + 2deg( v ) · ( (cid:100) log n (cid:101) − (cid:98) log deg( v ) (cid:99) )which is O (1 + deg( v ) · log n ). Lemma 4.3.
Given a vertex v ∈ (cid:74) n (cid:75) , and a graph G = ( (cid:74) n (cid:75) , E ) , let C be the connected component of v in G . The set of edges in C (i.e., E ( C ) ) can be computed using O (1 + m ( C ) log n ) BIS queries, where m ( C ) is the number of edges in C . roof: Do a breadth-first search in G starting from v . Whenever reaching a vertex for the first time,compute its adjacent edges using Lemma 4.2. Clearly, the breadth-first search visits all the vertices in C , and therefore computes all the edges in this connected component. The bound on the number ofqueries readily follows by observing that (cid:80) v ∈ V ( C ) d ( v ) log n is O ( m ( C ) log n ). Lemma 4.4.
For a graph G = ( (cid:74) n (cid:75) , E ) , one can deterministically compute m ( E ) exactly, using at most O (log n + | E | log n ) BIS queries. Alternatively, given a query budget t = Ω(log n ) , one can decide whetherthe given graph has at most t/ log n edges, or more than this number, using O ( t ) BIS queries.Proof:
For i = 1 , . . . , T = (cid:100) log n (cid:101) , define A i to be the elements of (cid:74) n (cid:75) whose i th bit in their binaryrepresentation is 1. Let B i = (cid:74) n (cid:75) \ A i . Let W = (cid:74) n (cid:75) .Iterate for i = 1 , . . . , T . In the i th iteration, compute all the edges in E i = E ( A i ∩ W i , B i ∩ W i )using the algorithm of Lemma 4.1. This requires O (1 + | E i | log n ) queries. Let V i = V ( E i ). As long asthere is an unvisited vertex v ∈ V i , apply the algorithm from Lemma 4.3 to produce all edges in theconnected component of G containing v . Repeat this process until all the vertices in V i are visited. Thiscomputes a collection C i = (cid:83) v ∈ V i C ( v ) of connected components in G , and the number of BIS queriesused to compute it is O (1 + m ( C i ) log n ). The algorithm now sets W i +1 = W i \ V ( C i ), and continues tothe next iteration. Finally, the algorithm outputs the set (cid:83) i E ( C i ) (or just its size).For correctness and runtime, observe that E i ⊆ E ( C i ), and these edges are no longer considered inlater iterations of the algorithm, as we remove the vertices of V ( C i ) from W i to get W i +1 . As such, E ( C ) , . . . , E ( C T ) are disjoint sets and hence the total cost of the iterations is O ( T + (cid:80) Ti =1 m ( C i ) log n ) = O (log n + | E | log n ) as required. We claim that (cid:83) i E ( C i ) = E ( G ). To this end consider any edge uv ∈ E ( G ). Assume that the first bit on which vertices u and v differ is their i th bit. We have twocases: (1) If either u or v is not in W i then there is some first iteration j < i in which one of them wasremoved. But since they are in the same connected component of G both will be removed and the edge uv ∈ E ( C j ). (2) If both u and v are in W i , since they differ on their i th bits, then uv ∈ E i ⊆ E ( C i ) asrequired.As for the budgeted version – run the algorithm until τ = Ω( t ) BIS queries were performed. Ifthis does not happen, then the graph has at most τ edges, and they were reported by the algorithm.Otherwise, we know that the graph must have at least τ / log n edges, as desired. Let G = G ( (cid:74) n (cid:75) , E ) be a graph and let U, V ⊆ (cid:74) n (cid:75) be disjoint subsets of the vertices. The task at handis to estimate m ( U, V ), using polylog
BIS queries.For a subset S ⊆ (cid:74) n (cid:75) , define N ( S ) to be the union of the neighbors of all the vertices in S . For avertex v , let deg S ( v ) denote the number of neighbors of v that lie in S . For i ∈ (cid:74) log n (cid:75) , define the set ofvertices in U with degree between 2 i and 2 i +1 as U i = (cid:8) u (cid:12)(cid:12) u ∈ U, i < deg V ( u ) ≤ i +1 (cid:9) , and let U denote the vertices in U with deg V ( v ) ≤ Claim 4.5.
There exists an α ∈ { , , . . . , log n } such that m ( U α , V ) ≥ m ( U, V )log n + 1 and | U α | ≥ m ( U, V )2 α +1 (log n + 1) . lgorithm 4.1: CheckEstimate ( U, V, (cid:101) e ) Input: (( U, V ) , (cid:101) e ) where U, V ⊆ (cid:74) n (cid:75) are disjoint and (cid:101) e is a (rough) guess for the value of m ( U, V ) for i = 0 , , . . . , log n do Sample U (cid:48) ⊆ U by choosing each vertex in U with probability min(2 i / (cid:101) e, V (cid:48) ⊆ V by choosing each vertex of V with probability 1 / i . if m ( U (cid:48) , V (cid:48) ) (cid:54) = 0 then Output accept ;Output reject . Proof:
Since (cid:80) log ni =0 m ( U i , V ) = m ( U, V ), the first inequality is stating that there is a term as large asthe average. As for the second inequality, observe that for every i , we have | U i | i ≤ m ( U i , V ) ≤ | U i | i +1 .Hence, using the first inequality | U α | ≥ m ( U α , V )2 α +1 ≥ m ( U, V )2 α · n + 1) . Suppose that we have an estimate (cid:101) e for the number of edges between U and V in the graph. Considerthe test CheckEstimate , depicted in Algorithm 4.1, for checking if the estimate (cid:101) e is correct up topolylogarithmic factors using a logarithmic number of BIS queries.
Claim 4.6.
Let n ≥ . If m ( U, V ) > , then(A) if (cid:101) e ≥ m ( U, V )(log n + 1) , then CheckEstimate ( U, V, (cid:101) e ) accepts with probability at most / .(B) if (cid:101) e ≤ m ( U,V )4 log n , then CheckEstimate ( U, V, (cid:101) e ) accepts with probability at least / .Proof: (A) For any value of the loop variable i , the probability that a fixed edge is present in the inducedsubgraph on U (cid:48) and V (cid:48) is min(2 i / (cid:101) e, · (1 / i ) ≤ / (cid:101) e. Thus, E [ m ( U (cid:48) , V (cid:48) )] ≤ m ( U, V ) / (cid:101) e ≤ n +1) . For afixed iteration i , by Markov’s inequality, we have P [ m ( U (cid:48) , V (cid:48) ) (cid:54) = 0] = P [ m ( U (cid:48) , V (cid:48) ) ≥ ≤ E [ m ( U (cid:48) , V (cid:48) )] ≤ n + 1) . By a union bound over the loop variable values, the probability that the test accepts is at most 1 / / α given by Claim 4.5. In this case, we have that | U α | ≥ m ( U,V )2 α +1 (log n +1) , and thus P (cid:2) U (cid:48) ∩ U α = ∅ (cid:3) = (cid:16) − α (cid:101) e (cid:17) | U α | ≤ exp (cid:18) − α (cid:101) e · | U α | (cid:19) ≤ exp (cid:18) − α m ( U, V ) / (4 log n ) · m ( U, V )2 α +1 (log n + 1) (cid:19) ≤ exp (cid:18) − n n + 1) (cid:19) ≤ e . , since n ≥
16. Furthermore, since deg V ( u ) ≥ α for all u ∈ U α , it follows that when U (cid:48) ∩ U α (cid:54) = ∅ , then | N ( U (cid:48) ∩ U α ) | ≥ α . So, we can bound P (cid:104) V (cid:48) ∩ N ( U (cid:48) ∩ U α ) = ∅ (cid:12)(cid:12)(cid:12) U (cid:48) ∩ U α (cid:54) = ∅ (cid:105) ≤ (cid:18) − α (cid:19) α ≤ e . From the above, we get P [ m ( U (cid:48) , V (cid:48) ) (cid:54) = 0] = P (cid:104) U (cid:48) ∩ U α (cid:54) = ∅ (cid:105) · P (cid:104) V (cid:48) ∩ N ( U (cid:48) ∩ U α ) (cid:54) = ∅ (cid:12)(cid:12)(cid:12) U (cid:48) ∩ U α (cid:54) = ∅ (cid:105) ≥ (cid:18) − e . (cid:19) (cid:18) − e (cid:19) ≥ . lgorithm 4.2: CoarseEstimator ( U, V ) Input: ( U, V ) where
U, V ⊆ (cid:74) n (cid:75) are disjoint Output:
An estimate (cid:101) e for the number of edges m ( U, V ) computed using
BIS queries if m ( U, V ) = 0 then
Output 0; for j = 2 log n, . . . , do Run t := 128 log n independent trials of CheckEstimate ( U, V, j ). if at least t/ of them output accept then Output 2 j ;Armed with the above test, we can easily estimate the number of edges up to a O (log n ) factor bydoing a search, where we start with (cid:101) e = n and halve the number of edges each iteration. The algorithmis depicted in Algorithm 4.2. Claim 4.7.
For n ≥ , CoarseEstimator ( U, V ) outputs (cid:101) e ≤ n satisfying m ( U,V )8 log n ≤ (cid:101) e ≤ m ( U, V ) log n, with probability at least − n − log n . The number of BIS queries made is c ce log n for a constant c ce .Proof: For any fixed value of the loop variable j such that 2 j ≥ m ( U, V )(log n + 1), the expectednumber of accepts is at most t/ t = 128 log n . The probability that wesee at least 3 t/ t/ t/ − t/ /t ) = exp( − t/ ≤ n − by Chernoff’sinequality (Lemma 2.2 (A)). Taking the union over all values of j , the probability that the algorithmreturns 2 j , when 2 j ≥ m ( U, V )(log n + 1), is at most 2 n − log n .On the other hand, when 2 j ≤ m ( U, V ) / (4 log n ), the expected number of accepts is at least t/ t/ t/ − t/ − exp( − t/ ) ≥ − n − by Chernoff’s inequality (Lemma 2.2 (A)). Hence, conditioned on the eventthat the algorithm has not already returned a bigger value of j , the probability that we accept for theunique j that satisfies m ( U, V ) / ≤ j log n < m ( U, V ) / , is at least 1 − n − .Overall, by a union bound, the probability that the estimator outputs an estimate (cid:101) e that does notsatisfy (8 log n ) − ≤ (cid:101) e/m ( U, V ) ≤ n is at most 4 n − log n . The number of BIS queries is boundedby O (cid:0) log n (cid:1) , since for each value of j there are t = 128 log n trials of CheckEstimate , each of whichmakes log n + 1 queries to the BIS oracle.Summarizing the above, we get the following result
Lemma 4.8.
For n ≥ , and arbitrary U, V ⊆ (cid:74) n (cid:75) that are disjoint, the randomized algorithm CoarseEstimator ( U, V ) makes at most c ce log n BIS queries ( for a constant c ce ) and outputs (cid:101) e ≤ n such that, with probability at least − n − log n , we have (8 log n ) − ≤ (cid:101) e/m ( U, V ) ≤ n. BIS approximation algorithm
Given a graph G = ( (cid:74) n (cid:75) , E ), we describe here an algorithm that makes polylog( n ) /ε BIS queries toestimate the number of edges in the graph within a factor of (1 ± ε ).The algorithm for estimating the number of edges in the graph is going to maintain a data-structure D containing:(A) An accumulator ϕ - this is a counter that maintains an estimate of the number of edges alreadyhandled. 19B) A list of triples ( U , V , w ) , . . . , ( U u , V u , w u ) where U i , V i ⊆ (cid:74) n (cid:75) and w > estimate based on D of the number of edges in the original graph G = ( (cid:74) n (cid:75) , E ) is m ( D ) = ϕ + (cid:88) i w i · m ( U i , V i ) . The number of active edges in D is m active ( D ) = (cid:80) i m ( U i , V i ) . The algorithm is going to use three subroutines: cleanup, refine, and reduce, described next.(A)
Cleanup : The cleanup stage removes from D all induced subgraphs that have few edges, byexplicitly counting the number of their edges. Let L small = Θ( ε − log n ) (4.1)as specified by Remark 3.3. Given the data-structure D , the algorithm scans the list of triples( U, V, w ) ∈ D . For each triple ( U, V, w ), using the algorithm of Lemma 4.1, it decides if m ( U, V ) ≤ L small . If so, the value of m ( U, V ) was just computed, and it adds w · m ( U, V ) to ϕ . Finally, itremoves this triple from D .If D has no triples in it, then the algorithm returns ϕ as the desired approximation.(B) Refine : We are given the data-structure D , where the graph associated with every triple has atleast L small edges. The algorithm replaces every triple ( U, V, w ) ∈ D by the four induced subgraphsresulting from 4-coloring the graph G ( U, V ), as described by Lemma 3.2(B) (see also Remark 3.3).Specifically, the coloring results in the pairs ( U i , V i ), for i = 1 , , ,
4. The triple (
U, V, w ) is replacedin D by the triples { ( U , V , w ) , . . . , ( U , V , w ) } . This increases the number of triples in D by afactor of four.(C) Reduce : If D has more than 2 L len triples, where L len = O ( ε − log n ) as specified by Remark 2.6,then the algorithm reduces the number of triples.To this end, the algorithm first computes for each triple ( U, V, w ) ∈ D , a coarse estimate ˜ e ofthe number of edges in m ( U, V ), such that m ( U, V ) / (8 log n ) ≤ ˜ e ≤ m ( U, V )8 log n , by using thealgorithm of Claim 4.7. This requires O (log n ) BIS queries per triple.Next, the algorithm uses the summation reduction algorithm of Lemma 2.5 applied to the list oftriples in D , with ξ = ε/ (8 log n ). This reduces the number of triples in D to be at most L len ,while introducing a multiplicative error of (1 ± ξ ). The algorithm input is the graph G = ( (cid:74) n (cid:75) , E ), and a parameter ε >
0. Let R = O (cid:0) ε − log n (cid:1) be someparameter. The algorithm works as follows.(A) Check if G has at most O ( R / log n ) edges, using the algorithm of Lemma 4.4, which requires O ( R ) BIS queries. If so, the algorithm returns the exact number of edges in G , and stops.20B) Compute a random 2-coloring of the vertices of the graph, creating two sets U ∪ V = (cid:74) n (cid:75) , seeLemma 3.2 (A). We now create a data-structure as described above, with D = [ ϕ, ( U, V, ϕ is initialized to value 0.(C) As long as D contains some triple the algorithm does the following:(a) Performs Cleanup on D , as described in Section 4.3.1 (A).(b) Performs Refine on D , as described in Section 4.3.1 (B).(c) Performs Reduce on D , as described in Section 4.3.1 (C).(D) The algorithm now returns the value ϕ as the desired approximation. Initially, the number of active edges is at most m . Every time Refine isexecuted, this number reduces by a factor of 2 with high probability using Lemma 3.2(B) (in expectation,the reduction is by a factor of 4). As such, after (cid:100) log m (cid:101) ≤ (cid:6) log (cid:0) n (cid:1)(cid:7) ≤ n iterations there are noactive edges, and then the algorithm terminates. Number of
BIS queries.
Clearly, because
Reduce is used on D in each iteration, the algorithmmaintains the invariant that the number of triples in D is at most O ( L len ), where L len = O ( ε − log n )as specified by Remark 2.6.The procedure Cleanup , applies the algorithm of Lemma 4.1, to decide whether a triple in thelist has at least 2 L small edges associated with it, or fewer edges, where L small = Θ( ε − log n ) (seeEq. (4.1) and Remark 3.3). This takes O ( L small log n ) BIS queries. Overall the
Cleanup step performsis O ( L small L len log n ) queries in each iteration. The procedure Refine does not perform any
BIS queries.The procedure
Reduce , performs O ( L len log n ) BIS queries in the estimation stage.As such, overall, the algorithm performs O ( L small L len log n ) = O (cid:0) ε − log n · ε − log n · log n (cid:1) = O ( ε − log n ) BIS queries per iteration. There are O (log n ) iterations, and as such, the overall numberof BIS queries is R = O ( ε − log n ), which also bounds the number of BIS queries in the first step ofthe algorithm.
Approximation error.
The initial 2-coloring of the graph, in (B), introduces a (1 ± ε )-multiplicativeerror, by Lemma 3.2, where ε = O ( (cid:112) /m log n ) (cid:28) ξ = ε n . Inside each iteration,
Cleanup introduces no error. By the choice of parameters,
Refine introduces amultiplicative error that is at most 1 ± ξ ; see Remark 3.3. Similarly, Reduce introduces a multiplicativeerror bounded by 1 ± ξ ; see Remark 2.6. As such, the multiplicative approximation of the algorithmslies in the interval [(1 − ε )(1 − ξ ) n , (1 + ε )(1 − ξ ) n ] ⊆ [1 − ε, ε ] , since (1 − ε/ (8 log n )) n ≥ − ε and (1 + ε/ (8 log n )) n ≤ ε as easy calculations show. Probability of success.
Throughout this analysis, c will be a constant that can be chosen to bearbitrarily large. The algorithm may fail due to the following reasons: (i) the random two-coloring inStep (B) gives an estimate that is far from its expectation − this probability is at most 1 /n c using21emma 3.2(A); (ii) the Refine step fails − the probability for the failure of each iteration is at most 1 /n c using Lemma 3.2(B); (iii) the coarse estimate in Reduce step fails − the probability for the failure ofeach iteration is at most 1 /n c using Claim 4.7; and lastly (iv) the summation reduction in the Reducestep fails − the probability for the failure of each iteration is at most 1 /n c using Lemma 2.5. Overall,every step performed by the algorithm had probability at most 1 /n c to fail. The algorithm performs O (polylog( n )) steps with high probability, which implies that the algorithm succeeds with probabilityat least 1 − /n O (1) . BIS resultTheorem 4.9.
Let G = ( (cid:74) n (cid:75) , E ) be an undirected graph. For a parameter ε ∈ (0 , , one can computean estimate (cid:101) m for the number of edges in G , such that (1 − ε ) m ( G ) ≤ (cid:101) m ≤ (1 + ε ) m ( G ) , where m ( G ) is the number of edges of G . The algorithm performs O ( ε − log n ) BIS queries and succeeds withprobability ≥ − /n O (1) . BIS queries
We provide an auxiliary degree estimation result, connecting
BIS queries to local queries (e.g., [Fei06,GR08]).
Lemma 4.10.
Given a graph G = ( (cid:74) n (cid:75) , E ) , a parameter ε ∈ (0 , , and a vertex v ∈ (cid:74) n (cid:75) , one can (1 ± ε ) -approximate deg( v ) in G using O ( ε − log n ) BIS queries. The approximation is correct with highprobability.Proof:
Let N ( v ) = { i | vi ∈ E } be the set of neighbors of v , and let E v = { vi | vi ∈ E } be the corre-sponding set of edges. We have deg( v ) = | N ( v ) | = | E v | . Given a set of edges E Q ⊆ E v = { vi | i ∈ (cid:74) n (cid:75) } , the corresponding set of vertices is Q = { i | vi ∈ E Q } . In particular, Q ∩ N ( v ) (cid:54) = ∅ ⇐⇒ E Q ∩ E v (cid:54) = ∅ .Deciding if E Q ∩ E v (cid:54) = ∅ is equivalent to deciding if any of the edges adjacent to v is in E Q , and thisis answered by the BIS query for ( { v } , Q ). Namely, the BIS oracle can function as an emptiness oraclefor N ( v ) ⊆ (cid:74) n (cid:75) . Now, using the algorithm of Lemma 2.9 we can (1 ± ε )-approximation | N ( v ) | using O ( ε − log n ) queries, as claimed.
5. Edge estimation using IS queries This section describes and analyzes our IS query algorithm (Theorem 5.8). At the end, we also discusslimitations of IS queries, suggesting that IS queries may indeed be weaker than BIS queries. IS queries We start with an exact edge counting algorithm for IS queries. At a high-level, we use Lemma 4.1 afterefficiently computing a suitable decomposition of our graph. Lemma 5.1.
Given disjoint sets of vertices
U, V ⊆ (cid:74) n (cid:75) , such that both U and V are independentsets, one can compute the number of edges m ( U ∪ V ) using O ( m ( U ∪ V ) log n ) IS queries, assuming m ( U, V ) > . roof: Since U and V are disjoint and independent, we have that m ( U ∪ V ) = m ( U, V ). Furthermore,for any U (cid:48) ⊆ U and V (cid:48) ⊆ V , the query BIS ( U (cid:48) , V (cid:48) ) is equivalent to the query IS ( U (cid:48) ∪ V (cid:48) ). As such, wecan use the algorithm of Lemma 4.1, using the IS queries as a replacement for the BIS queries, yieldingthe result.The next step is to break the set of interest U into independent sets. Lemma 5.2.
Given a set U ⊆ (cid:74) n (cid:75) , one can decompose it into disjoint independent sets V , V , . . . , V t ,such that(a) U = (cid:83) ti =1 V i , and(b) for any i, j ∈ (cid:74) t (cid:75) , with i < j , we have m ( V i , V j ) > .Furthermore, computing this decomposition uses only O (1 + m ( U ) log n ) IS queries.Proof: Order the elements of U = { u , . . . , u k } in an arbitrary order. Compute the largest i ∈ (cid:74) k (cid:75) ,such that u , . . . , u i is an independent set. Using binary search, this can be done using O (log n ) IS queries, and let U be this first set. Continue decomposing U \ U in the same fashion. This resultsin a decomposition of U into disjoint independent sets U , U , . . . , U σ , and requires O ( σ log n ) queries.Observe however, that for every i , the set U i in this decomposition has at least one edge between one ofits vertices and the first vertex that was not added yet to any set. As such, we can charge the O (log n ) IS queries used in computing U i to this guaranteed edge, implying that this stage used O ( m ( U ) log n )queries.In the second stage of the algorithm, we group these independent sets together. In the τ th iteration,for τ = 1 , . . . , σ , the algorithm does the following:(A) Assume that the algorithm already computed the independent sets V , . . . , V f ( τ ) (initially, f (1) =0).(B) For i = 1 , . . . , f ( τ ), check using IS query, if the set V i ∪ U τ is an independent set. If it is, then set f ( τ + 1) = f ( τ ), and V i ← V i ∪ U τ . The algorithm then continues to the next (outer) iteration.(C) Otherwise, set f ( τ + 1) = f ( τ ) + 1, and set V f ( τ +1) = U τ .Clearly, the resulting decomposition V , . . . , V f ( σ +1) of U has the desired properties. As for thenumber of IS queries, observe that every time that U τ get rejected, in the i th iteration of the inner loop,this is because of an edge present in the set E ( V i , U τ ). We charge the IS query to such an edge. An edgeget charged at most once by this process. We conclude that the algorithm performs at most O ( m ( U ))queries (for this part). Lemma 5.3.
Given U ⊆ (cid:74) n (cid:75) , one can deterministically compute E ( U ) , using O (1 + m ( U ) log n ) IS queries. Alternatively, given a budget t > and set U ⊆ (cid:74) n (cid:75) , one can decide if m ( U ) > t using O ( t log n ) IS queries.Proof: Using the algorithm of Lemma 5.2, compute the decomposition of U into independent sets V , . . . , V t . By construction, for any i < j , we have that m ( V i , V j ) ≥
1, as some vertex of V i is connectedto some vertex in V i . As such, going over all 1 ≤ i < j ≤ t , compute the set of edges E ( V i , V j ) using thealgorithm of Lemma 5.1.This requires O ( m ( V i , V j ) log n ) IS queries. As such, the total number of IS queries used by thisalgorithm is O (cid:0) m ( U ) log n + (cid:80) i Lemma 5.4. Let L base = (cid:6) c ε − log n (cid:7) , where c is some sufficiently large constant. Given a set U , one can decide if m ( U ) ≤ L base , and if soget the exact value of m ( U ) , using O ( ε − log n ) IS queries. Lemma 5.5. Given parameters t , ε ∈ (0 , , and a set U ⊆ (cid:74) n (cid:75) , such that m ( U ) ≥ max( L base , t ) ,an algorithm can decide if m ( U ) > t , or alternatively return a (1 ± ε ) -approximation to m ( U ) if t ≤ m ( U ) ≤ t . The algorithm uses O ( ε − t log n ) IS queries and succeed with probability − /n O (1) .Proof: We color the vertices in U randomly using k = (cid:100) tε/ ( ς log n ) (cid:101) colors for a constant ς to bespecified shortly, and let U , . . . , U k be the resulting partition. By Lemma 3.2, we have for the estimateΓ = (cid:80) ki =1 m ( U i ) that | m ( U ) − k · Γ | ≤ ςk (cid:112) m ( U ) log n, and this holds with probability ≥ − n − c , where c is an arbitrarily large constant, and ς is a constantthat depends only on c . For this to be an (1 ± ε )-approximation, we need that ςk (cid:112) m ( U ) log nm ( U ) ≤ ε ⇐⇒ m ( U ) ≥ (cid:18) ςk log nε (cid:19) = t , which holds because of the assumption m ( U ) ≥ max { L base , t } in the statement.To proceed, the algorithm starts computing the terms in the summation defining Γ, using the algo-rithm of Lemma 5.3. If at any point in time, the summation exceeds M = 8( t /k ) = O ( ε − t log n ), thenthe algorithm stops and reports that m ( U ) > t . Otherwise, the algorithm returns the computed count k · Γ as the desired approximation. In both cases we are correct with high probability by Lemma 3.2.We now bound the number of IS queries. If the algorithm computed Γ by determining exact edgecounts for m ( U i ) for all i ∈ (cid:74) k (cid:75) , then the number of queries would be (cid:80) ki =1 O (1 + m ( U i ) log n ) . However,the choice of stopping early if the number of queries exceeds M = O ( ε − t log n ) implies that total thenumber of queries is bounded by O ( k + M log n ) = O (cid:0) ε − t log n (cid:1) . Lemma 5.6. Given ε ∈ (0 , , and a set U ⊆ (cid:74) n (cid:75) , one can compute a (1 ± ε ) -approximation for m ( U ) .The algorithm uses at most O ( ε − log n + ε − (cid:112) m ( U ) log n ) IS queries and succeeds with probability − /n O (1) .Proof: The algorithm starts by checking if the number of edges in m ( U ) is at most L base = O ( ε − log n )using the algorithm of Lemma 5.4. Otherwise, in the i th iteration, the algorithm sets t i = √ t i − , where t = √ L base , and invokes the algorithm of Lemma 5.5 for t i as the threshold parameter. If the algorithmsucceeds in approximating the right size we are done. Otherwise, we continue to the next iteration.Taking a union bound over the iterations, we have that the algorithm stops with high probability before t α > (cid:112) m ( U ). Let α be the minimum value for which this is holds. The number of IS queries performedby the algorithm is O ( (cid:80) αi =1 t i ε − log n ) = O ( ε − (cid:112) m ( U ) log n ) , since this is a geometric sum.24 .2.2. Shrinking Search We are given a graph G = ( (cid:74) n (cid:75) , E ), and a set U ⊆ (cid:74) n (cid:75) . The task at hand is to approximate m ( U ). Let N = | U | .Given an oracle that can answers IS queries, we can decide if a specific edge uv exists in the set E ( U ), by performing an IS query on { u, v } . We can treat such IS queries as membership oracle queriesin the set E of edges in the graph, where the ground set is the set of all possible edges Z = (cid:0) U (cid:1) = { ij | i < j and i, j ∈ U } , where | Z | = N ( N − / 2. Invoking the algorithm of Lemma 2.8 in this case,with γ = 1 /n O (1) , implies that one can (1 ± ε )-approximate m ( U ) using O (( N /m ( U )) ε − log n ) IS queries. For our purposes, however, we need a budgeted version of this. Lemma 5.7. Given parameters t > , ξ ∈ (0 , , and a set U ⊆ (cid:74) n (cid:75) , with N = | U | , then an algorithmcan return either: (a) m ( U ) ≤ N / (2 t ) , or (b) return (1 ± ξ ) -approximation to m ( U ) . The algorithmuses O ( t log n ) IS queries in case (a), and O ( tξ − log n ) in case (b). The returned result is correct withhigh probability.Proof: The idea is to use the sampling as done in Lemma 2.7, with g = N / (16 t ) and ε = 1 / E ( U ) ⊆ (cid:0) U (cid:1) . The sample R used is of size O (( N /g ) log n ) = O ( t log n ), and we check foreach one of the sampled edges if it is in the graph by using an IS query. If the returned estimate is atmost g/ 2, then the algorithm returns that it is in case (a).Otherwise, we invoke the algorithm of Lemma 2.7 again, with ε = ξ , to get the desired approximation,which is case (b). IS Search algorithmTheorem 5.8. Given a graph G = ( (cid:74) n (cid:75) , E ) , with access to the edges of the graph via an IS oracle.Then, one can (1 ± ε ) -approximate the number of edges in G , denoted by m = | E | . The algorithm uses O ( ε − log n + min( √ m, n /m ) ε − log n ) IS queries, and it succeeds with probability ≥ − /n O (1) .Proof: Let t = (cid:6) c ε − log n (cid:7) , for some constant c . Using the algorithm of Lemma 5.4, we can decideif m ≤ t , and if so, return (the just computed) m .The algorithm now loops for i = 1 , , , . . . , n. In the i th iteration, it does the following:(A) If i = 1 then let t = √ t , otherwise set t i = 2 t i − .(B) Using the algorithm of Lemma 5.5 decide if m ≤ t i , and if so it returns the desired (1 ± ε )-approximation to m . This uses O ( t i ε − log n ) IS queries.(C) Using the algorithm of Lemma 5.7, decide if m ≤ n / (2 t i ), and if so continue to the next iteration.This uses O ( t i log n ) IS queries.Otherwise, the algorithm of Lemma 5.7 returned the desired (1 ± ε )-approximation, using O ( t i ε − log n ) IS queries.Combining the two bounds on the IS queries, we get that the i th iteration used O ( t i ε − log n ) IS queries.The algorithm stopped in the i th iteration, if t i ≥ (cid:112) m/ 2, or t i ≥ n /m . In particular, for thestopping iteration I , we have t I = O (min( √ m, n /m )). As such, the total number of IS queries in25ll iterations except the last one is bounded by O ( (cid:80) Ii =1 t i ε − log n ) = O ( t I ε − log n ) . The stoppingiteration uses O ( t I ε − log n ) IS queries. Each bound holds with high probability, and a union boundimplies the same for the final result. Corollary 5.9. For a graph G = ( (cid:74) n (cid:75) , E ) , with an access to G via IS queries, and a parameter ε > ,one can (1 ± ε ) -approximate m using O ( ε − log n + n / ε − log n ) IS queries.Proof: Follows readily as min( √ m, n /m ) ≤ n / , for any value of m between 0 and n . IS queries In this section, we discuss several ways in which IS queries seem more restricted than BIS queries. Simulating degree queries with IS queries. A degree query can be simulated by O (log n ) BIS queries, see Lemma 4.10. In contrast, here we provide a graph instance where Ω( n/ deg( v )) IS queriesare needed to simulate a degree query. In particular, we show that IS queries may be no better thanedge existence queries for the task of degree estimation. Since it is easy to see that Ω( n/ deg( v )) edgeexistence queries are needed to estimate deg( v ), this lower bound also applies to IS queries.For the lower bound instance, consider a graph which is a clique along with a separate vertex v whoseneighbors are a subset of the clique. We claim that IS queries involving v are essentially equivalent toedge existence queries. Any edge existence query can be simulated by an IS query. On the other hand,any IS query on the union of v and at least two clique vertices will always detect a clique edge. Thus,the only informative IS queries involve exactly two vertices. Coarse estimator with IS queries. It is natural to wonder if it is possible to replace the coarseestimator (Lemma 4.8) with an analogous algorithm that makes polylog( n ) IS queries. This would imme-diately imply an algorithm making polylog( n ) /ε IS queries that estimates the number of edges. We donot know if this is possible, but one barrier is a graph consisting of a clique U on O ( √ m ) vertices alongwith a set V of n − O ( √ m ) isolated vertices. We claim that for this graph, the algorithm CoarseEsti-mator ( U, V ) from Section 4.2, using IS queries instead of BIS queries, will output an estimate (cid:101) m thatdiffers from m by a factor of Θ( n / ). Consider the execution of CheckEstimate ( U, V, (cid:101) e ) from Algo-rithm 4.1. A natural way to simulate this with IS queries would be to use an IS query on U (cid:48) ∪ V (cid:48) insteadof a BIS query on ( U (cid:48) , V (cid:48) ). Assume for the sake of argument that m = n / and | U | = √ m = n / .Consider when the estimate (cid:101) e satisfies (cid:101) e = cn / for a small constant c . In the CheckEstimate ex-ecution, there will be a value i = Θ(log n ) such that, with constant probability, U (cid:48) ⊆ U will containat least two vertices and V (cid:48) ⊆ V will contain at least one vertex. In this case, m ( U (cid:48) ∪ V (cid:48) ) (cid:54) = 0 eventhough m ( U (cid:48) , V (cid:48) ) = 0. Thus, using IS queries will lead to incorrectly accepting on such a sample, andthis would lead to the CoarseEstimator outputting the estimate (cid:101) e = Θ( n / ) even though the truenumber of edges is m = n / . 6. Conclusions In this paper, we explored the task of using either BIS or IS queries to estimate the number of edgesin a graph. We presented randomized algorithms giving a (1 + ε )-approximation using polylog( n ) /ε IS queries and min { n / ( ε m ) , √ m/ε } polylog( n ) IS queries. Our algorithms estimated the number ofedges by first sparsifying the original graph and then exactly counting edges spanning certain bipartitesubgraphs. Below we describe a few open directions for future research. An obvious unresolved question is whether there is an algorithm to estimate the number of edges with o ( √ m ) IS queries when m = o ( n / ). In this context proving a lower bound of Ω( √ m ) IS queries wouldalso be interesting. The arguments in Section 5.3 lend some support to the possibility that a non-triviallower bound might hold for the case of IS queries.Other open questions include using a polylogarithmic number of BIS queries to estimate the numberof cliques in a graph (see [ERS17] for an algorithm using degree, neighbor and edge existence queries)or to sample a uniformly random edge (see [ER18] for an algorithm using degree, neighbor and edgeexistence queries). In general, any graph estimation problems may benefit from BIS or IS queries, possiblyin combination with standard queries (such as neighbor queries). Finally, it would be interesting to knowwhat other oracles, besides subset queries, enable estimating graph parameters with a polylogarithmicnumber of queries. Acknowledgments. We thank the anonymous referees for helpful comments about improving thepresentation of our paper and for pointing out relevant references. References [AH08] B. Aronov and S. Har-Peled. On Approximating the Depth and Related Problems. SIAMJ. Comput. , 38(3):899–921, 2008.[BHN + 18] P. Beame, S. Har-Peled, S. Natarajan Ramamoorthy, C. Rashtchian, and M. Sinha. Edgeestimation with independent set oracles. In Anna R. Karlin, editor, Innov. Theo. Comp. Sci. (ITCS), volume 94 of LIPIcs , pages 38:1–38:21, 2018.[BKKR13] I. Ben-Eliezer, T. Kaufman, M. Krivelevich, and D. Ron. Comparing the strength of querytypes in property testing: The case of k -colorability. Computational Complexity , 22(1):89–135, 2013.[CJ15] S. Cabello and M. Jejˇciˇc. Shortest paths in intersection graphs of unit disks. ComputationalGeometry , 48(4):360–367, 2015.[CL06] F. Chung and L. Lu. Concentration inequalities and martingale inequalities: A survey. Internet Math. , 3(1):79–127, 2006.[CS90] C. L. Chen and W. H. Swallow. Using group testing to estimate a proportion, and to testthe b inomial model. Biometrics , 46(4):10351046, December 1990.[DL17] H. Dell and J. Lapinskas. Fine-grained reductions from approximate counting to decision. CoRR , abs/1707.04609, 2017. 27Dor43] R. Dorfman. The detection of defective members of large populations. Ann. Math. Statist. ,14(4):436–440, 12 1943.[DP09] D. P. Dubhashi and A. Panconesi. Concentration of Measure for the Analysis of RandomizedAlgorithms . Cambridge University Press, 2009.[ELRS17] T. Eden, A. Levi, D. Ron, and C. Seshadhri. Approximately counting triangles in sublineartime. SIAM J. Comput. , 46(5):1603–1646, 2017.[ER18] T. Eden and W. Rosenbaum. On sampling edges almost uniformly. In , volume 61 of OpenAccess Series in Informatics (OASIcs) , pages 7:1–7:9.Schloss Dagstuhl - Leibniz-Zentrum fuer Informatik, 2018.[ERS17] T. Eden, D. Ron, and C. Seshadhri. On Approximating the Number of k -cliques in SublinearTime. CoRR , abs/1707.04858, July 2017.[Fei06] U. Feige. On sums of independent random variables with unbounded variance and estimatingthe average degree in a graph. SIAM J. Comput. , 35(4):964–984, 2006.[Fis03] A. V Fishkin. Disk graphs: A short survey. In Int. Workshop Approx. and Online Alg.(WAOA) , pages 260–264. Springer, 2003.[FJO + 16] M. Falahatgar, A. Jafarpour, A. Orlitsky, V. Pichapati, and A. T. Suresh. Estimating thenumber of defectives with group testing. In IEEE Int. Symp. Inf. Theo. ISIT , pages 1376–1380. IEEE, 2016.[GR08] O. Goldreich and D. Ron. Approximating average parameters of graphs. Random Struct.Algo. , 32(4):473–493, 2008.[GRS11] M. Gonen, D. Ron, and Y. Shavitt. Counting stars and other small subgraphs in sublinear-time. SIAM J. Discrete Math , 25(3):1365–1411, 2011.[KZC + 05] R. J. Klein, C. Zeiss, E. Chew, J.-Y. Tsai, R.S. Sackler, C. Haynes, A.K. Henning, J.P.SanGiovanni, S.M. Mane, S.T. Mayne, R.B. Bracken, F.L. Ferris, J. Ott, C. Barnstable, andJ. Noh. Complement factor H polymorphism in age-related macular degeneration. Science ,308(5720):385–389, 2005.[ORRR12] K. Onak, D. Ron, M. Rosen, and R. Rubinfeld. A near-optimal sublinear-time algorithm forapproximating the minimum vertex cover size. In Proc. 23rd ACM-SIAM Sympos. DiscreteAlgs. (SODA), pages 1123–1131, 2012.[RT16] D. Ron and G. Tsur. The power of an example: Hidden set size approximation using groupqueries and conditional sampling. ACM Trans. Comp. Theo. , 8(4):15:1–15:19, 2016.[Ses15] C. Seshadhri. A simpler sublinear algorithm for approximating the triangle count. CoRR ,abs/1505.01927, May 2015.[Sto83] L. Stockmeyer. The complexity of approximate counting (preliminary version). In Proc. 15thAnnu. ACM Sympos. Theory Comput. (STOC), pages 118–126, Boston, Massachusetts, 1983.28Sto85] L. Stockmeyer. On approximation algorithms for SIAM J. Comput. , 14(4):849–861,1985.[Swa85] W. H. Swallow. Group testing for estimating infection rates and probabilities of diseasetransmission. Phytopathology , 75(8):882, 1985.[WLY13] J. Wang, E. Lo, and M. L. Yiu. Identifying the most connected vertices in hidden bipartitegraphs using group testing.