Dense Subgraphs in Random Graphs
Paul Balister, Béla Bollobás, Julian Sahasrabudhe, Alexander Veremyev
aa r X i v : . [ m a t h . C O ] M a r Dense Subgraphs in Random Graphs
Paul Balister ∗ , B´ela Bollob´as † Julian Sahasrabudhe ‡ , Alexander Veremyev § November 5, 2018
Abstract
For a constant γ ∈ [0 ,
1] and a graph G , let ω γ ( G ) be the largest integer k forwhich there exists a k -vertex subgraph of G with at least γ (cid:0) k (cid:1) edges. We show that if0 < p < γ < ω γ ( G n,p ) is concentrated on a set of two integers. More precisely,with α ( γ, p ) = γ log γp + (1 − γ ) log − γ − p , we show that ω γ ( G n,p ) is one of the two integersclosest to α ( γ,p ) (cid:0) log n − log log n + log eα ( γ,p )2 (cid:1) + , with high probability. While thissituation parallels that of cliques in random graphs, a new technique is required tohandle the more complicated ways in which these “quasi-cliques” may overlap. Let G = ( V ( G ) , E ( G )) be a simple, undirected graph where V ( G ) denotes the set of vertices of G (sometimes called nodes ) and E ( G ) denotes the set of edges . A graph G is said to be complete if all possible edges are present: if { i, j } ∈ E ( G ) for all i, j ∈ V ( G ) , i = j . For asubset S ⊆ V ( G ), we denote by G [ S ] the subgraph of G induced by S : the graph with vertexset S and edge set {{ i, j } : i, j ∈ S } ∩ E ( G ). A clique C is a subset of V ( G ) for which G [ C ]is a complete graph [22].Cliques are a indispensable concept in the theory of graphs and have been extensivelystudied in various contexts, reaching back to the 1930s with the celebrated results of Ramsey[28] and Tur´an [33]. In random graphs, cliques have also been a central topic of study ∗ Department of Mathematical Sciences, University of Memphis, Memphis TN 38152, USA. Partiallysupported by NSF grant DMS 1600742 † Department of Pure Mathematics and Mathematical Statistics, University of Cambridge, CambridgeCB3 0WB, UK and
Department of Mathematical Sciences, University of Memphis, Memphis TN 38152,USA and
London Institute for Mathematical Sciences, 35a South St., Mayfair, London W1K 2XF, UK.Partially supported by NSF grant DMS 1600742 and MULTIPLEX grant no. 317532. ‡ Instituto Nacional de Matem´atica Pura e Aplicada (IMPA), Estr. Dona Castorina, 110 Jardim Botnico,Rio de Janeiro RJ 22460-320, Brasil and
Peterhouse, University of Cambridge, Cambridge CB2 1RD, UK § Industrial Engineering & Management Systems, University of Central Florida, Orlando, Florida, USA.Supported, in part, by the AFRL Mathematical Modeling and Optimization Institute all edges are present in a particular subset, but only that the set is very“well connected”, in some appropriate sense. Consequently, a number of relaxations of thenotion of “clique” have appeared in the literature in recent years [27, 19].One of the most popular and widely used clique relaxation models is the γ - quasi-clique ,where γ ∈ [0 ,
1] is a parameter [2]. In particular, for γ ∈ [0 , S ⊆ V ( G )of a graph G is a γ -quasi-clique if the graph G [ S ], induced by S , has at least γ (cid:0) | S | (cid:1) edges.This concept was first defined by Abello, Pardalos & Resende [1] who were interestedin quasi-cliques in graphs representing telecommunications data. Later, the idea of “denseclusters” (a more general concept which includes γ -quasi-cliques) were studied in the contextof molecular interaction networks described by Hartwell, Leland, Hopfield, John, Leibler,Stanislas, Murray and Andrew [14] and further analyzed by Spirin and Mirny [31]. Theyreported that dense subgraphs in molecular interaction networks correspond to meaningfulmodules or building blocks of molecular networks such as protein complexes or dynamicfunctional units. The problem of finding large dense subgraphs have also appeared in anumber of other domains including biology [3, 9, 16, 4], social network analysis [10, 20, 36],finance [5, 17, 29, 30] and data mining [25, 32].Given the the myriad of instances for which the notion is useful, one would like toefficiently compute solutions to basic questions about quasi-cliques in a given graph: forexample, “what is the largest γ -quasi clique in (a given graph) G ”? However, it comes as nosurprise that the computational problem of finding the largest quasi-clique in a given graph(along with many other such questions) is a hard computational problem, in general [26]– similar to the sister problem of finding large cliques in graphs [18, 15]. Moreover, theliterature on exact computational methods for this class of problems is extremely sparseand mostly focuses on the development and application of heuristic methods. It is thereforenatural to study quasi-cliques in “random” or “typical” graphs, which may suffice for mostapplications, while allowing us to avoid the many hard computational barriers blocking thegeneral problem.To this end, we study the order of the largest γ -quasi-clique in the binomial randomgraph, a project initiated in a paper of Veremyev and Boginski [34]. For a graph G , we let ω γ ( G ) be the size of the largest subset of vertices of G that induces a γ -quasi clique. Ofcourse, ω ( G ) is the classical “clique number” of G , often denoted by ω ( G ).We prove that ω γ ( G n,p ) is concentrated on two explicitly determined points, with highprobability as n → ∞ , provided 0 < p < γ < n = 50 , G n,p model.For the results of these experiments, see Section 5. As usual write [ n ] for the set { , . . . , n } and G n,p for the binomial random graph on vertexset [ n ] with edge probability p ∈ (0 , O n (1) to denote a quantity thatis bounded by a constant as n tends to infinity and we use o n (1) to denote a quantity thattends to zero as n tends to infinity. We say that a sequence of events E n holds with highprobability (henceforth whp) if P ( E n ) = 1 − o n (1). For a graph G we let e ( G ) denote thenumber of edges in the graph.A complete subgraph on k vertices will be called a k - clique , and we define the cliquenumber ω ( G ) of a graph G to be the largest integer k for which G contains a k -clique. Thestudy of the clique number of G n,p was first carefully considered by Matula [23], who noticedthat the clique number of G n,p is concentrated on a small set of values. These results werelater strengthened by Grimmett and McDiarmid [12] and then Bollob´as and Erd˝os [7], whoshowed that for fixed 0 p two values , whp (Seealso Theorem 11.1 in [6]). We prove that a similar phenomena persists for γ -quasi-cliques.However, a significant difficulty arises when controlling the concentration of the count of γ -quasi cliques directly. We tackle this issue by instead controlling a closely related randomvariable, which is more naturally handled.We call n -vertex graph a γ -quasi-clique if e ( G ) > γ (cid:0) n (cid:1) . For a graph G , we define ω γ ( G )to be the the largest integer k for which there exists a γ -quasi-clique subgraph of order k .For 0 < p < γ <
1, we show that ω γ ( G n,p ) is concentrated on two points whp as n → ∞ . Theorem 1.
Let < p < γ < and ε > be fixed and define α ( γ, p ) := γ log γp + (1 − γ ) log 1 − γ − p . Then ω γ ( G n,p ) − α ( γ, p ) (cid:16) log n − log log n + log eα ( γ, p )2 (cid:17) ∈ ( − ε, ε ) , whp. In particular, ω γ ( G n,p ) is one of the two integers closest to α ( γ, p ) (cid:16) log n − log log n + log eα ( γ, p )2 (cid:17) + 12 , whp. As usual, the binary entropy function for γ ∈ (0 ,
1) is h ( γ ) := γ log 1 γ + (1 − γ ) log 11 − γ .
3e use the following consequence of Stirling’s formula. If γ ∈ (0 ,
1) is fixed, we have (cid:18) nγn + O n (1) (cid:19) = e nh ( γ ) − log( nγ (1 − γ ))+ O n (1) . (1)We first set out to give an upper bound for ω γ ( G n,p ) which holds with high probability.Let X k = X k,γ ( G n,p ) be the random variable which counts the number of subgraphs of G n,p that are γ -quasi-cliques on k vertices. We easily obtain an upper bound on ω γ ( G n,p ) bybounding E X k . In preparation, we state a basic fact about binomial random variables. Lemma 2.
Let < p < γ < be fixed and N → ∞ . We have P (Bin( N, p ) = ⌈ γN ⌉ ) = e − Nα ( γ,p )+ O (log N ) , and P (Bin( N, p ) > γN ) = e − Nα ( γ,p )+ O (log N ) . Proof.
We have P (Bin( N, p ) = ⌈ γN ⌉ ) = (cid:18) N ⌈ γN ⌉ (cid:19) p ⌈ γN ⌉ (1 − p ) ⌊ (1 − γ ) N ⌋ = e Nh ( γ ) − log( Nγ (1 − γ ))+ N ( γ log p +(1 − γ ) log(1 − p ))+ O (log p − p ) = e − Nα ( γ,p )+ O (log N ) . The second result follows as, for r > γN > pN , P (Bin( N, p ) = r ) is decreasing in r , andhence P (Bin( N, p ) = ⌈ γN ⌉ ) P (Bin( N, p ) > γN ) ( N + 1) P (Bin( N, p ) = ⌈ γN ⌉ ) . We now may establish an upper bound on ω γ ( G n,p ), that holds whp, thus proving one ofthe inequalities implicit in the statement of Theorem 1. In the following sections, we go onto show that the distribution of quasi-cliques (actually a subclass of these quasi-cliques) issufficiently concentrated to prove Theorem 1. Lemma 3.
Let < p < γ < and ε > be fixed. Then as n → ∞ ω γ ( G n,p ) < α ( γ, p ) (log n − log log n + log e · α ( γ, p )2 ) + 1 + ε, whp. roof. With X k = X k,γ ( G n,p ) and S = (cid:0) k (cid:1) we have E X k = (cid:18) nk (cid:19) P (Bin( S, p ) > γS ) n k k ! e − Sα ( γ,p )+ O (log S ) = e k (log n − α ( γ,p )( k − − log( k/e )+ o k (1)) Let κ = α ( γ,p ) (log n − log log n + log e · α ( γ,p )2 ) + 1 + ε . If k = ⌈ κ ⌉ thenlog n − α ( γ, p )( k − − log( k/e ) + o k (1) < − ε · α ( γ, p )2 + o k (1)is negative for large enough n , and hence the expectation must tend to zero. Thus we have P ( X k > E X k = o n (1). The existence of a γ -quasi-clique on j > k vertices implies, by asimple averaging argument, that there exists a γ -quasi-clique subgraph on k vertices. Thusif X k = 0 then X j = 0 for all j > k . Hence ω γ ( G n,p ) < κ with high probability. γ -flat subgraphs To show that G n,p contains a γ -quasi-clique of order roughly α ( γ,p ) log n whp, we count aslightly restricted class of subgraphs. The advantage of working with this restricted classis that the second moment of their count is controlled more naturally. Roughly speaking,we say that a γ -quasi-clique G is γ -flat if every induced subgraph of G is close to being a γ -quasi clique.To make this definition precise, we need a few definitions. First, for a graph G and asubset A of the vertex set of G , let us define e ( A ) to be the number of edges with bothend-points in A .Now, for γ ∈ (0 ,
1) and ℓ ∈ [ k ], we define S = (cid:0) k (cid:1) , T = (cid:0) ℓ (cid:1) , and set D k ( ℓ ) = min( T, S − T ) ℓ − / log k. Call an k -vertex graph G γ - flat if e ( G ) = ⌈ γ (cid:0) k (cid:1) ⌉ and for all A ⊆ V ( G ) with ℓ = | A | ∈ [2 , k − e ( A ) γ (cid:0) ℓ (cid:1) + D k ( ℓ ). We note that min( T, S − T ) is clearly an upperbound on e ( A ) − γ (cid:0) ℓ (cid:1) when e ( G ) = ⌈ γ (cid:0) k (cid:1) ⌉ , so this is only a restriction on e ( A ) when | A | = ℓ > (log k ) .We shall show that if a subset of k vertices in G n,p has ⌈ γ (cid:0) k (cid:1) ⌉ edges then it is reasonablylikely that it will also be γ -flat, and hence the two notions are “typically” interchangeable.For positive integers n , m , 0 m (cid:0) n (cid:1) , we define the Erd˝os-R´enyi random graph G ( n, m )as the uniform probability space that is supported on all n vertex graphs with exactly m edges. 5 emma 4. Let G = G ( k, ⌈ γ (cid:0) k (cid:1) ⌉ ) and let γ be fixed and k → ∞ . Then G is γ -flat with highprobability.Proof. Let G = G ( k, ⌈ γ (cid:0) k (cid:1) ⌉ ) be realized on the vertex set [ k ] and fix a subset A ⊆ [ k ] with ℓ = | A | ∈ [2 , k − (cid:18) kℓ (cid:19) P (cid:0) e ( A ) > γ (cid:0) ℓ (cid:1) + D k ( ℓ ) (cid:1) k − . (2)Set S = (cid:0) k (cid:1) , T = (cid:0) ℓ (cid:1) , R = S − T , and put C ( L ) = P ( e ( A ) = L ). Note that C ( L ) = (cid:18) TL (cid:19)(cid:18) R ⌈ γS ⌉ − L (cid:19)(cid:18) S ⌈ γS ⌉ (cid:19) − and for 0 L < T , we have Q ( L ) := C ( L + 1) C ( L ) = T − LL + 1 (cid:18) ⌈ γS ⌉ − LR − ⌈ γS ⌉ + L + 1 (cid:19) . (3)From (3) we see that Q ( L ) is strictly decreasing as L increases. Let L = ⌈ γT ⌉ + r T .Then if r > Q ( L ) Q ( γT + r ) (cid:18) (1 − γ ) T − rγT + r + 1 (cid:19) (cid:18) γR + 1 − r (1 − γ ) R + r (cid:19) = − r (1 − γ ) T r +1 γT ! − r − γR r (1 − γ ) R ! min n − r (1 − γ ) T , − r − γR o e − c ( r − R,T ) , where c = 1 / max( γ, − γ ) > C ( L ) = C ( ⌈ γT ⌉ + r ) C ( ⌈ γT ⌉ + 1) r Y s =1 e − c ( s − R,T ) e − cr ( r − R,T ) , (4)where we have used the (trivial) fact that C ( ⌈ γT ⌉ + 1)
1. Now
T ℓ − / log k > log k and Rℓ − / log k > ( k − / log k , so D ℓ ( k ) → ∞ uniformly in ℓ as k → ∞ . Thus for large k wehave cr ( r − / (2 min( R, T )) > c ′ min( R, T ) ℓ − (log k ) for some c ′ > r > D k ( ℓ ) − P (cid:0) e ( A ) > γT + D k ( ℓ ) (cid:1) = X γT + D k ( ℓ ) L T C ( L ) ℓ e − c ′ min( R,T ) ℓ − (log k ) (5)6or large enough k .Consider the case when R < T . Then R = ( k − ℓ )( k + ℓ − / > ℓ ( k − ℓ ) / c ′ min( R, T ) ℓ − (log k ) > k − ℓ ) log k > ( k − ℓ ) log k + 4 log k for large enough k . Now (cid:0) kℓ (cid:1) = (cid:0) kk − ℓ (cid:1) k k − ℓ , so (cid:18) kℓ (cid:19) P (cid:0) e ( A ) > γ (cid:0) ℓ (cid:1) + D k ( ℓ ) (cid:1) (cid:18) kℓ (cid:19) k e − ( k − ℓ ) log k − k k − , as required. Now suppose R > T . Then c ′ min( R, T ) ℓ − (log k ) > ℓ log k > ℓ log k + 4 log k when k is large enough. Now (cid:0) kℓ (cid:1) k ℓ , so (cid:18) kℓ (cid:19) P (cid:0) e ( A ) > γ (cid:0) ℓ (cid:1) + D k ( ℓ ) (cid:1) (cid:18) kℓ (cid:19) k e − ℓ log k − k k − , as required. Hence (2) holds for all ℓ ∈ [2 , k − ℓ k −
1, let Y ℓ be the random variable counting the number of subsets A of order ℓ which induce more than γ (cid:0) ℓ (cid:1) + D k ( ℓ ) edges. By (2) we have P ( Y ℓ > E ( Y ℓ ) (cid:18) kℓ (cid:19) P (cid:0) e ( A ) > γ (cid:0) ℓ (cid:1) + D k ( ℓ ) (cid:1) k − , for large enough k . So the probability that Y ℓ > < k choices for ℓ is at most k − = o k (1).Let Z k = Z k,n be the random variable counting the number of copies of γ -flat subgraphsof order k in G n,p , with p fixed and n → ∞ . We now easily bound E Z k , by using Lemma 4,to relate it to the quantity E X k . Lemma 5.
Let ε > and k α ( γ,p ) (log n − log log n + log e · α ( γ,p )2 ) + 1 − ε with k → ∞ as n → ∞ . Then E Z k → ∞ .Proof. We apply Lemma 4 to deduce that E Z k = (cid:18) nk (cid:19) P ( G ( k, p ) is γ -flat) > (1 + o k (1)) (cid:18) nk (cid:19) P ( e ( G ( k, p )) = ⌈ γS ⌉ ) . Now k = O (log n ) by assumption, so (cid:0) nk (cid:1) = n k k ! (1 − O ( k /n )) = (1 + o k (1)) n k k ! . Hence E Z k = (1 + o k (1)) n k k ! P (Bin( S, p ) = ⌈ γS ⌉ )= n k k ! e − Sα ( γ,p )+ O (log S ) = e k (log n − ( k − α ( γ,p ) / − log( k/e )+ o k (1)) . However, the exponent in the last line tends to infinity when k → ∞ and k α ( γ,p ) (log n − log log n + log eα ( γ,p )2 ) + 1 − ε .In the next section we turn to estimate the variance of Z k .7 The second moment
To prove our lower bound on ω γ ( G n,p ), we count the number of γ -flat subsets of order k in G n,p , where k is roughly α ( γ,p ) log n . For k ∈ [ n ], recall that Z k is the random variable whichcounts the number of γ -flat subsets of G ( n, p ). To apply Chebyshev’s inequality, we aim toestimate the fraction F = Var Z k ( E Z k ) = E Z k − ( E Z k ) ( E Z k ) . (6)In particular, we shall show F = o (1), as both k and n tend to infinity. Let A, B ⊆ [ n ]with | A | = | B | = k and | A ∩ B | = ℓ . We think of ℓ ∈ [2 , k −
1] and treat the degeneratecases ℓ ∈ { , , k } separately. Put S = (cid:0) k (cid:1) , T = (cid:0) ℓ (cid:1) , R = S − T and let g ℓ ( L ) denote theprobability that e ( A ) = ⌈ γS ⌉ , e ( B ) = ⌈ γS ⌉ and e ( A ∩ B ) = L . We note that g ℓ ( L ) = (cid:18) TL (cid:19)(cid:18) R ⌈ γS ⌉ − L (cid:19) p ⌈ γS ⌉− L (1 − p ) ⌊ (1 − γ ) S ⌋− T + L and consider the ratio R ℓ ( L ) = g ℓ ( L ) P ( e ( A ) = ⌈ γS ⌉ ) = (cid:18) TL (cid:19)(cid:18) R ⌈ γS ⌉ − L (cid:19) (cid:18) S ⌈ γS ⌉ (cid:19) − p − L (1 − p ) L − T . (7)The following lemma gives us a suitable way of estimating the quantity R ℓ ( L ), for ourpurposes. For the remainder of the section, we maintain the assumption that 0 < p < γ k → ∞ . Lemma 6.
Let ℓ k − , r > be an integer and set λ = 2 · γ − γ − pp . Then R ℓ ( ⌊ γT ⌋ + r ) λ r e T α ( γ,p )+ O k (1) (8) and R ℓ ( ⌊ γT ⌋ − r ) R ℓ ( ⌊ γT ⌋ ) . (9) Proof.
We first bound R ℓ ( ⌊ γT ⌋ ). Note that R > k − R T = R ( S − R ) > S for 2 ℓ k −
1. Since S = (cid:0) k (cid:1) → ∞ and γ is fixed, we may bound line (7) by usingequation (1), to obtain R ℓ ( ⌊ γT ⌋ ) = e T h ( γ )+2( S − T ) h ( γ ) − Sh ( γ )+ log S γ (1 − γ ) R T + O k (1) p −⌊ γT ⌋ (1 − p ) −⌈ (1 − γ ) T ⌉ e − T h ( γ )+ O k (1) p −⌊ γT ⌋ (1 − p ) −⌈ (1 − γ ) T ⌉ = e T α ( γ,p )+ O k (1) . C ( L ) = R ℓ ( L + 1) /R ℓ ( L ) and observe that C ( L ) can be written as1 − pp · T − LL + 1 (cid:18) ⌈ γS ⌉ − LR − ⌈ γS ⌉ + L + 1 (cid:19) . From this expression, we see that C ( L ) strictly decreases as L increases and therefore C ( ⌊ γT ⌋ + r ) C ( ⌊ γT ⌋ ) − pp · γ − γ (cid:18) γR (cid:19) . Now note that since ℓ < k , by assumption, we have that R > k − R tends toinfinity with k . Hence, for large k , C ( ⌊ γT ⌋ + r ) − pp γ − γ = λ. We now apply this inequality r times to obtain R ℓ ( ⌊ γT ⌋ + r ) λ r R ℓ ( ⌊ γT ⌋ ) , which holds for k sufficiently large, but independently of r . This proves the inequality (8).To prove the inequality (9) we note that C ( L ) is strictly decreasing and C ( ⌊ γT ⌋ − > − pp γ − γ · (cid:18) − γ ) T (cid:19) (cid:18) γT (cid:19) > − pp γ − γ > . Thus R ℓ ( ⌊ γT ⌋ − r ) R ℓ ( ⌊ γT ⌋ ).We are now in a position to show that F = o (1) as n and k tend to infinity. Lemma 7.
Let k α ( γ,p ) (log n − log log n + log eα ( γ,p )2 ) + 1 − ε . We have E Z k = (1 + o k (1))( E Z k ) . Proof.
We consider the fraction F , from equation (6). We keep with the convention that S = (cid:0) k (cid:1) , T = (cid:0) ℓ (cid:1) , and R = S − T . Let E A and E B denote the events that A , resp. B , inducesa γ -flat subgraph. Let E ′ A , resp. E ′ B , denote the event that A , resp. B , induce exactly ⌈ γS ⌉ edges. Note that E A ⊆ E ′ A and E B ⊆ E ′ B . Now write t ( ℓ ) = t n,k ( ℓ ) = (cid:0) kℓ (cid:1)(cid:0) n − kk − ℓ (cid:1)(cid:0) nk (cid:1) − . Wenow turn to bound F . We may expand Z k as a sum of indicators Z k = X A ⊂ V ( G ) , | A | = k ( E A )9ence E Z k = (cid:0) nk (cid:1) P ( E A ) thus F = E Z k − ( E Z k ) E Z k = X A,B (cid:18) nk (cid:19) − P ( E A ∩ E B ) − P ( E A ) P ( E B ) P ( E A ) . We now divide the sum with respect to | A ∩ B | = ℓ to obtain F = k X ℓ =0 (cid:18) nk (cid:19)(cid:18) kℓ (cid:19)(cid:18) n − kk − ℓ (cid:19)(cid:18) nk (cid:19) − P ( E A ∩ E B ) − P ( E A ) P ( E B ) P ( E A ) P ( E B )= k − X ℓ =2 t ( ℓ ) · P ( E A ∩ E B ) − P ( E A ) P ( E A ) + o n (1) , (10)where we have eliminated the first two terms in the above sum as E A and E B are independentevents when | A ∩ B |
1. We have also eliminated the last term in the sum, i.e. when E A = E B . This is justified, as this term is at most ( (cid:0) nk (cid:1) P ( E A ))) − = ( E Z k ) − = o k (1), byLemma 5. Let us denote the ℓ th term in the sum at (10) as F ( ℓ ).Lemma 4 implies that P ( E A ∩ E B ) − P ( E A ) P ( E A ) (1 + o k (1)) P ( E A ∩ E B ) P ( E ′ A ) . For ℓ ∈ [2 , k − A applies and hence P ( E A ∩ E B ) / P ( E ′ A ) = P ( E ′ A ) − X L γT + D k ( ℓ ) P ( E A ∩ E B | e ( A ∩ B ) = L ) P ( e ( A ∩ B ) = L ) P ( E ′ A ) − X L γT + D k ( ℓ ) P ( E ′ A ∩ E ′ B | e ( A ∩ B ) = L ) P ( e ( A ∩ B ) = L )= X L γT + D k ( ℓ ) R ℓ ( L )= X L<γT R ℓ ( L ) + X γT L γT + D k ( ℓ ) R ℓ ( L ) T λ D k ( ℓ ) e T α ( γ,p )+ O k (1) . This last inequality follows from applying the inequality (8) (from Lemma 6) to each termin the right sum and applying the inequality (9) (again from Lemma 6) to the left sum. Sowe may bound the ℓ th term in the sum (10) as F ( ℓ ) t ( ℓ ) T λ D k ( ℓ ) e T α ( γ,p )+ O k (1) . We first consider the case when
R < T . Write δ := k − ℓ . Now t ( ℓ ) = (cid:18) kδ (cid:19)(cid:18) n − kδ (cid:19)(cid:18) nk (cid:19) − ( kn ) δ (cid:18) nk (cid:19) − E Z k = (cid:0) nk (cid:1) e − Sα ( γ,p )+ O (log k ) → ∞ . Also D k ( ℓ ) = Rℓ − / log k = o k ( R ) as R < T implies ℓ > k/
2. Thus F ( ℓ ) E Z k e δ log( kn ) − Rα ( γ,p )+ o k ( R ) But R = δ ( k + ℓ − / > kδ/ kα ( γ, p ) ∼ n . Thus F ( ℓ ) ( E Z k ) − e − ( − o k (1)) δ log n .In particular, P ℓ : R
4. Thus ( ℓ − α ( γ, p ) / (3 / o k (1)) log n . Hence F ( ℓ ) e − (1 / − o (1)) log n and so P ℓ : R > T F ( ℓ ) = o (1).After these preparations, it is only a small step to finish the proof of Theorem 1. Proof of Theorem 1.
Let ε > ω γ ( G n,p ) follows fromLemma 3. For the lower bound, assume k α ( γ,p ) (log n − log log n + log eα ( γ,p )2 ) + 1 − ε .From Lemma 5 we know that E Z k → ∞ , so for sufficiently large n we have E Z k > P ( X k = 0) is small.We have P ( X k = 0) P ( Z k = 0) P ( | Z k − E Z k | > E Z k ) Var( Z k ) / E ( Z k ) = F = o (1) , where we have used the fact that every γ -flat set is a γ -quasi-clique for the first inequality.The third inequality is Chebyshev’s inequality and the bound on F is the content of Lemma 7. Here, we note the bounds obtained from Theorem 1 are actually quite accurate in practice,even for relatively small values of n . To illustrate, we performed a small set of computationalexperiments for graphs of size n = 50 and n = 100 and different values of p . For each pair n, p we generated 100 instances of graphs sampled according to the corresponding G n,p model.We have also selected various values of γ ranging from 0 . . γ , n , p in Table 1 we report the minimum ω γmin , maximum ω γmax and average ω γavg cardinalities of the largest γ -quasi-cliques and compare this to ω γth , the “theoretical”value obtained from the formula in Theorem 1. That is, ω γth ( n ) = 2 α ( γ, p ) (cid:16) log n − log log n + log eα ( γ, p )2 (cid:17) + 12 . (11)11 ω γmin ω γmax ω γavg ω γth γ ω γmin ω γmax ω γavg ω γth γ ω γmin ω γmax ω γavg ω γth n = 50 p = 0 . p = 0 . p = 0 . n = 100 p = 0 . p = 0 . p = 0 . Table 1: Largest quasi-cliques in graphs generated according to G n,p model. For each n, p ,the minimum ω γmin , maximum ω γmax and average ω γavg cardinalities of the largest quasi-cliquesidentified in 100 instances are reported. These values are compared against the values givenby the formula for ω γth at (11).Observe that the obtained formula provides an accurate estimate of γ -quasi-clique number ω γ ( G ) in graph instances generated according to the binomial random graph G n,p , even forrelatively small values of n .To identify the largest γ -quasi-clique in these experiments, we used the so-called feasi-bility check version of formulation F4 in [35] (or AlgF4 ). Previous experimental work hassuggested this algorithm to be the best performing on instances generated from G n,p . References [1] J. Abello, P.M. Pardalos, and M.G.C. Resende. On maximum clique problems in verylarge graphs. In J. Abello and J. Vitter, editors,
External Memory Algorithms andVisualization , pages 119–130. American Mathematical Society, Boston, 1999.[2] J. Abello, M. G. C. Resende, and S. Sudarsky. Massive quasi-clique detection. InS. Rajsbaum, editor,
LATIN 2002: Theoretical Informatics , pages 598–612, London,2002. Springer-Verlag.[3] G. D. Bader and C. W. Hogue. An automated method for finding molecular complexesin large protein interaction networks.
BMC Bioinformatics , 4(1):2, 2003.[4] S. Bastkowski, V. Moulton, A. Spillner, and T. Wu. The minimum evolution problemis hard: a link between tree inference and graph clustering problems.
Bioinformatics ,page 623, 2015. 125] V. Boginski, S. Butenko, and P. M. Pardalos. Statistical analysis of financial networks.
Computational Statistics & Data Analysis , 48(2):431–443, 2005.[6] B. Bollob´as.
Random Graphs . Cambridge University Press, 2001.[7] B. Bollob´as and P. Erd˝os. Cliques in random graphs.
Mathematical Proceedings of theCambridge Philosophical Society , 80:419–427, 1976.[8] I. M. Bomze, M. Budinich, P. M. Pardalos, and M. Pelillo. The maximum cliqueproblem. In
Handbook of Combinatorial Optimization , volume 4, pages 1–74. KluwerAcademic Publishers, 1999.[9] D. Bu, Y. Zhao, L. Cai, H. Xue, X. Zhu, H. Lu, J. Zhang, S. Sun, L. Ling, N. Zhang,G. Li, and R. Chen. Topological structure analysis of the protein-protein interactionnetwork in budding yeast.
Nucleic Acids Research , 31(9):2443–2450, May 2003.[10] M. A. Crenson. Social networks and political processes in urban neighborhoods.
Amer-ican Journal of Political Science , 22(3):578–594, 1978.[11] P. Erd˝os. Some remarks on the theory of graphs.
Bull. Amer. Math. Soc. , 53:292–294,1947.[12] G.R. Grimmett and C.J.H. McDiarmid. On colouring random graphs.
MathematicalProceedings of the Cambridge Philosophical Society , 77:313–324, 1975.[13] F. Harary and I. C. Ross. A procedure for clique detection using the group matrix.
Sociometry , 20:205–215, 1957.[14] L. H. Hartwell, J. J. Hopfield, S. Leibler, and A. W. Murray. From molecular to modularcell biology.
Nature , 402:C47–C52, 1999.[15] J. H˚astad. Clique is hard to approximate within n − ε . Acta Mathematica , 182(1):105–142, 1999.[16] H. Hu, X. Yan, Y. Huang, J. Han, and X. J. Zhou. Mining coherent dense subgraphsacross massive biological networks for functional discovery.
Bioinformatics , 21(suppl1):i213–i221, 2005.[17] W Q. Huang, X T. Zhuang, and S. Yao. A network analysis of the Chinese stock market.
Physica A: Statistical Mechanics and its Applications , 388(14):2956 – 2964, 2009.[18] R. M. Karp. Reducibility among combinatorial problems. In
Complexity of ComputerComputations , pages 85–103, New York, New York, USA, 1972. Plenum.[19] C. Komusiewicz. Multivariate algorithmics for finding cohesive subnetworks.
Algo-rithms , 9(1):21, 2016. 1320] P. Lee and L.V.S. Lakshmanan. Query-driven maximum quasi-clique search. In
Pro-ceedings of the 2016 SIAM International Conference on Data Mining , pages 522–530.SIAM, 2016.[21] R. D. Luce. Connectivity and generalized cliques in sociometric group structure.
Psy-chometrika , 15(2):169–190, 1950.[22] R.D. Luce and A.D. Perry. A method of matrix analysis of group structure.
Psychome-trika , 14(2):95–116, 1949.[23] D.W. Matula.
Combinatory Mathematics and its Applications . Chapel Hill, NorthCarolina, 1972.[24] R. J. Mokken. Cliques, clubs and clans.
Quality and Quantity , 13(2):161–173, 1979.[25] T. Nguyen, H. W. Lauw, and P. Tsaparas. Micro-review synthesis for multi-entitysummarization.
Data Mining and Knowledge Discovery , pages 1–29, 2017.[26] J. Pattillo, A. Veremyev, S. Butenko, and V. Boginski. On the maximum quasi-cliqueproblem.
Discrete Applied Mathematics , 161(1–2):244 – 257, 2013.[27] J. Pattillo, N. Youssef, and S. Butenko. On clique relaxation models in network analysis.
European Journal of Operational Research , 226(1):9–18, 2013.[28] F. P. Ramsey. On a problem of formal logic.
Proc. London Math. Soc. , 30:264–286,1930.[29] D. Saban, F. Bonomo, and N. E. Stier-Moses. Analysis and models of bilateral invest-ment treaties using a social networks approach.
Physica A: Statistical Mechanics andits Applications , 389(17):3661–3673, 2010.[30] K. Sim, J. Li, V. Gopalkrishnan, and G. Liu. Mining maximal quasi-bicliques to co-cluster stocks and financial ratios for value investment. In
Proceedings of the SixthInternational Conference on Data Mining , ICDM ’06, pages 1059–1063, Washington,DC, USA, 2006. IEEE Computer Society.[31] V. Spirin and L.A. Mirny. Protein complexes and functional modules in molecularnetworks.
Proceedings of the National Academy of Sciences , 100(21):12123–12128, 2003.[32] C. Tsourakakis, F. Bonchi, A. Gionis, F. Gullo, and M. Tsiarli. Denser than the densestsubgraph: extracting optimal quasi-cliques with quality guarantees. In
Proceedingsof the 19th ACM SIGKDD international conference on Knowledge discovery and datamining , pages 104–112. ACM, 2013.[33] P. Tur´an. On a extremal problem in graph theory.
Matematikai ´es Fizikai Lapok ,48:436–452, 1941. 1434] A. Veremyev, V. Boginski, P. A. Krokhmal, and D. E. Jeffcoat. Dense percolation inlarge-scale mean-field random networks is provably explosive.
PloS one , 7(12):e51883,2012.[35] A. Veremyev, O. A. Prokopyev, S. Butenko, and E. L. Pasiliao. Exact mip-based ap-proaches for finding maximum quasi-cliques and dense subgraphs.
Computational Op-timization and Applications , 64(1):177–214, 2016.[36] S. Wasserman and K. Faust.