[PDF] Technical Report: A Generalized Matching Pursuit Approach for Graph-Structured Sparsity

Abstract

Sparsity-constrained optimization is an important and challenging problem that has wide applicability in data mining, machine learning, and statistics. In this paper, we focus on sparsity-constrained optimization in cases where the cost function is a general nonlinear function and, in particular, the sparsity constraint is defined by a graph-structured sparsity model. Existing methods explore this problem in the context of sparse estimation in linear models. To the best of our knowledge, this is the first work to present an efficient approximation algorithm, namely, Graph-structured Matching Pursuit (Graph-Mp), to optimize a general nonlinear function subject to graph-structured constraints. We prove that our algorithm enjoys the strong guarantees analogous to those designed for linear models in terms of convergence rate and approximation accuracy. As a case study, we specialize Graph-Mp to optimize a number of well-known graph scan statistic models for the connected subgraph detection task, and empirical evidence demonstrates that our general algorithm performs superior over state-of-the-art methods that are designed specifically for the task of connected subgraph detection.

Full PDF

aa r X i v : . [ c s . L G ] D ec Technical Report: A Generalized Matching Pursuit Approach forGraph-Structured Sparsity

Feng Chen, Baojian Zhou

Computer Science Department, University at Albany – SUNY1400 Washington Avenue, Albany, NY, USA { fchen5, bzhou6 } @albany.edu Abstract

RAPH -structured Matching Pursuit (G

RAPH -M P ), to optimize a generalnonlinear function subject to graph-structured constraints. We prove that our algorithm enjoys the strongguarantees analogous to those designed for linear models in terms of convergence rate and approximationaccuracy. As a case study, we specialize G RAPH -M P to optimize a number of well-known graph scanstatistic models for the connected subgraph detection task, and empirical evidence demonstrates that ourgeneral algorithm performs superior over state-of-the-art methods that are designed speciﬁcally for the taskof connected subgraph detection. In recent years, that is a growing demand on efﬁcient computational methods for analyzing high-dimensional datain a variety of applications such as bioinformatics, medical imaging, social networks, and astronomy. In many set-tings, sparsity has been shown effective to model latent structure in high-dimensional data and at the same timeremain a mathematically tractable concept. Beyond the ordinary, extensively studied, sparsity model, a varietyof structured sparsity models have been proposed in the literature, such as the sparsity models deﬁned throughtrees [Hegde et al. , 2014b], groups [Jacob et al. , 2009], clusters [Huang et al. , 2011], paths [Asteris et al. , 2015], andconnected subgraphs [Hegde et al. , 2015b]. These sparsity models are designed to capture the interdependence ofthe locations of the non-zero components via prior knowledge, and are considered in the general sparsity-constrainedoptimization problem: min x ∈ R n f ( x ) s.t. supp ( x ) ∈ M , (1)where f : R n → R is a differentiable cost function and the sparsity model M is deﬁned as a family of structuredsupports: M = { S , S , · · · , S L } , where S i ⊆ [ n ] satisﬁes a certain structure property (e.g., trees, groups, clusters).The original k -sparse recovery problem corresponds to the particular case where the model M = { S ⊆ [ n ] | | S | ≤ k } .The methods that focus on general nonlinear cost functions fall into two major categories, including struc-tured sparsity-inducing norms based and model-projection based , both of which often assume that the costfunction f ( x ) satisﬁes a certain convexity/smoothness condition, such as Restricted Strong Convexity/Smoothness (RSC/RSS) or

Stable Mode-Restricted Hessian (SMRH). In particular, the methods in the ﬁrst category replace thestructured sparsity model with regularizations by a sparsity-inducing norm that is typically non-smooth and non-Euclidean [Bach et al. , 2012]. The methods in the second category decompose Problem (1) into an unconstrainedsubproblem and a model projection oracle that ﬁnds the best approximation of an arbitrary x in the model M :P ( x ) = arg min x ′ ∈ R n k x − x ′ k s.t. supp ( x ′ ) ∈ M . A number of methods are proposed speciﬁcally for the k-sparsity model M = { S ⊆ [ n ] | | S | ≤ k } , including theforward-backward algorithm [Zhang, 2009], the gradient descent algorithm [Tewari et al. , 2011], the gradient hard-thresholding algorithms [Yuan et al. , 2013; Bahmani et al. , 2013; Jain et al. , 2014], and the Newton greedy pursuit1lgorithm [Yuan and Liu, 2014]. A limited number of methods are proposed for other types of structured sparsitymodels via projected gradient descent, such as the union of subspaces [Blumensath, 2013] and the union of nestedsubsets [Bahmani et al. , 2016].In this paper, we focus on general nonlinear optimization subject to graph-structured sparsity constraints. Ourapproach applies to data with an underlying graph structure in which nodes corresponding to supp ( x ) form a smallnumber of connected components. By a proper choice of the underlying graph, several other structured sparsity modelssuch as the “standard” k -sparsity, block sparsity, cluster sparsity, and tree sparsity can be encoded as special cases ofgraph-structured sparsity [Hegde et al. , 2015a].We have two key observations: 1) Sparsity-inducing norms.

There is no known sparsity-inducing norm that isable to capture graph-structured sparsity. The most relevant norm is generalized fused lasso [Xin et al. , 2014] thatenforces the smoothness between neighboring entries in x , but does not have ﬁne-grained control over the numberof connected components. Hence, existing methods based on sparsity-inducing norms are not directly applicable tothe problem to be optimized. 2) Model projection oracle.

There is no exact model projection oracle for a graph-structured sparsity model, as this exact projection problem is NP-hard due to a reduction from the classical Steiner treeproblem [Hegde et al. , 2015b]. As most existing model-projection based methods assume an exact model projectionoracle, they are not directly applicable here as well. To the best of our knowledge, there is only one recent approachthat admits inexact projections for a graph-structured sparsity model by assuming “head” and “tail” approximations forthe projections, but is only applicable to linear regression problems [Hegde et al. , 2015b]. This paper will generalizethis approach to optimize general nonlinear functions. The main contributions of our study are summarized as follows: • Design of an efﬁcient approximation algorithm.

A new and efﬁcient algorithm, namely, G

RAPH -M P , is devel-oped to approximately solve Problem (1) with a differentiable cost function and a graph-structured sparsity model.We show that G RAPH -M P reduces to a state-of-the-art algorithm for graph-structured compressive sensing andlinear models, namely, G RAPH -C OSAMP , when f ( x ) is a least square loss function. • Theoretical analysis and connections to existing methods.

The convergence rate and accuracy of the proposedG

RAPH -M P are analyzed under a condition of f ( x ) that is weaker than popular conditions such as RSC/RSSand SMRH. We demonstrate that G RAPH -M P enjoy strong guarantees analogous to G RAPH -C OSAMP on bothconvergence rate and accuracy. • Compressive experiments to validate the effectiveness and efﬁciency of the proposed techniques.

The pro-posed G

RAPH -M P is applied to optimize a variety of graph scan statistic models for the task of connected sub-graph detection. Extensive experiments demonstrate that G RAPH -M P performs superior over state-of-the-artmethods that are customized for the task of connected subgraph detection on both running time and accuracy.The rest of this paper is organized as follows. Section 2 introduces the graph-structured sparsity model. Section 3formalizes the problem and presents an efﬁcient algorithm G RAPH -M P . Sections 4 and 5 present theoretical analysis.Section 6 gives the applications of G RAPH -M P . Experiments are presented in Section 7, and Section 8 describes futurework. Given an underlying graph G = ( V , E ) deﬁned on the coefﬁcients of the unknown vector x , where V = [ n ] and E ⊆ V × V , a graph-structured sparsity model has the form: M ( k, g ) = { S ⊆ V | | S | ≤ k, γ ( S ) = g } , (2) where k refers to an upper bound of the sparsity (total number of nodes) of S and γ ( S ) = g refers to the maximumnumber of connected components formed by the forest induced by S : G S = ( S, E S ) , where E S = { ( i, j ) | i, j ∈ S, ( i, j ) ∈ E } . The corresponding model projection oracle is deﬁned asP ( x ) = arg min x ′ ∈ R n k x − x ′ k s.t. supp ( x ′ ) ∈ M ( k, g ) . (3)Solving Problem (3) exactly is NP-hard due to a reduction from the classical Steiner tree problem. Instead of solv-ing (3) exactly, two nearly-linear time approximation algorithms with the following complementary approximationguarantees are proposed in [Hegde et al. , 2015b]: • Tail approximation (T ( x ) ): Find S ∈ M ( k T , g ) such that k x − x S k ≤ c T · min S ′ ∈ M ( k,g ) k x − x S ′ k , (4) where c T = √ and k T = 5 k . • Head approximation (H ( x ) ): Find S ∈ M ( k H , g ) such that k x S k ≥ c H · max S ′ ∈ M ( k,g ) k x S ′ k , (5) where c H = p / and k H = 2 k . 2f c T = c H = 1 , then T ( x ) = H ( x ) = S provides the exact solution of the model projection oracle: P ( x ) = x S , whichindicates that the approximations stem from the fact that c T > and c H < . We note that these two approximationsoriginally involve additional budgets ( B ) based on edge weights, which are ignored in this paper by setting unit edgeweights and B = k − g . Generalization:

The above graph-structured sparsity model is deﬁned based on the number of connected componentsin the forest induced by S . This model can be generalized to graph-structured sparsity models that are deﬁned based onother graph topology constraints, such as density, k-core, radius, cut, and various others, as long as their correspondinghead and tail approximations are available. Given the graph-structured sparsity model, M ( k, g ) , as deﬁned above, the sparsity-constrained optimization problemto be studied is formulated as: min x ∈ R n f ( x ) s.t. supp ( x ) ∈ M ( k, g ) , (6) where f : R n → R is a differentiable cost function, and the upper bound of sparsity k and the maximum number ofconnected components g are predeﬁned by users.Hegde et al. propose G RAPH -C OSAMP , a variant of C

OSAMP [Hegde et al. , 2015b] to optimize the least squarecost function f ( x ) = k y − Ax k based on the head and tail approximations. The authors show that G RAPH -C OSAMP achieves an information-theoretically optimal sample complexity for a wide range of parameters. In this paper, wegenearlize G

RAPH -C OSAMP and propose a new algorighm named as G

RAPH -M P for Problem (6), as shown in Al-gorithm 1. The ﬁrst step (Line 3) in each iteration, g = ∇ f ( x i ) , evaluates the gradient of the cost function at thecurrent estimate. Then a subset of nodes are identiﬁed via head approximation, Γ = H ( g ) , that returns a support setwith head value at least a constant fraction of the optimal head value, in which pursuing the minimization will be mosteffective. This subset is then merged with the support of the current estimate to obtain the merged subset Ω , overwhich the function f is minimized to produce an intermediate estimate, b = arg min x ∈ R n f ( x ) s.t. x Ω c = 0 . Thena subset of nodes are identiﬁed via tail approximation, B = T ( b ) , that returns a support set with tail value at most aconstant times larger than the optimal tail value. The iterations terminate when the halting condition holds. There aretwo popular options to deﬁne the halting condition: 1) the change of the cost function from the previous iteration isless than a threshold ( | f ( x i +1 ) − f ( x i ) | ≤ ǫ ); and 2) the change of the estimated minimum from the previous iterationis less than a threshold ( k x i +1 − x i k ≤ ǫ ), where ǫ is a predeﬁned threshold (e.g., ǫ = 0 . ). Algorithm 1 G RAPH -M P i = 0 , x i = 0 ; repeat g = ∇ f ( x i ) ; Γ = H ( g ) ; Ω = Γ ∪ supp ( x i ) b = arg min x ∈ R n f ( x ) s.t. x Ω c = 0 B = T ( b ) ; x i +1 = b B until halting condition holds return x i +1 RAPH -M P under SRL condition In this section, we give the deﬁnition of Stable Restricted Linearization (SRL) [Bahmani et al. , 2013] and we showthat our G

RAPH -M P algorithm enjoys a theorectial approximation guarantee under this SRL condition. Deﬁnition 1 (Restricted Bregman Divergence [Bahmani et al. , 2013]) . We denote the restricted Bregman divergenceof f as B f (cid:16) · k · (cid:17) . The restricted Bregman divergence of f : R p → R between points x and y is deﬁned as B f (cid:16) x k y (cid:17) = f ( x ) − f ( y ) − h∇ f ( y ) , x − y i , (7) where ∇ f ( y ) gives a restricted subgradient of f . We say vector ∇ f ( x ) is a restricted subgradient of f : R p → R atpoint x if f ( x + y ) − f ( x ) ≥ h∇ f ( x ) , y i (8) holds for all k -sparse vectors y . eﬁnition 2 (Stable Restricted Linearization (SRL) [Bahmani et al. , 2013]) . Let x be a k -sparse vector in R p . Forfunction f : R p → R we deﬁne the functions α k ( x ) = sup ( k y k B f ( x + y k x ) (cid:12)(cid:12)(cid:12) y = and | supp ( x ) ∪ supp ( y ) | ≤ k ) (9) and β k ( x ) = inf ( k y k B f ( x + y | x ) (cid:12)(cid:12)(cid:12) y = and | supp ( x ) ∪ supp ( y ) | ≤ k ) (10) Then f ( · ) is said to have a Stable Restricted Linearization with constant µ k , or µ k - SRL if α k ( x ) β k ( x ) ≤ µ k Lemma 4.1.

Denote ∆ = x − x , ∆ ′ = ∇ f ( x ) − ∇ f ( x ) , and let r ≥ | supp ( x ) ∪ supp ( x ) | , ¯ α l ( x , x ) = α l ( x ) + α l ( x ) , ¯ β l ( x , x ) = β l ( x ) + β l ( x ) , ¯ γ l ( x , x ) = ¯ α l ( x , x ) − ¯ β l ( x , x ) . For any R ′ ⊆ R = supp ( x − x ) , we have k ∆ ′ R ′ k ≤ ¯ α r k ∆ R ′ k + ¯ γ r k ∆ k (11) k ∆ ′ R ′ k ≥ ¯ β r k ∆ R ′ k − ¯ γ r k ∆ R \ R ′ k (12) Proof.

We can get the following properties (cid:12)(cid:12)(cid:12) ¯ α r k ∆ R ′ k − h ∆ ′ , ∆ R ′ i (cid:12)(cid:12)(cid:12) ≤ ¯ γ r k ∆ R ′ k k ∆ k (13) (cid:12)(cid:12)(cid:12) k ∆ ′ R ′ k − ¯ α r h ∆ ′ , ∆ R ′ i (cid:12)(cid:12)(cid:12) ≤ ¯ γ r k ∆ ′ R ′ k k ∆ k (14)from [Bahmani et al. , 2013], where R ′ be a subset of R = supp (∆) . It follows from (13) and (14) that k ∆ ′ R ′ k − ¯ α r k ∆ R ′ k = k ∆ ′ R ′ k − ¯ α r h ∆ ′ , ∆ R ′ i + ¯ α r h − ¯ α r k ∆ R ′ k + h ∆ ′ , ∆ R ′ i i ≤ ¯ γ r k ∆ ′ R ′ k k ∆ k + ¯ α r ¯ γ r k ∆ R ′ k k ∆ k . It can be reformulated as the following k ∆ ′ R ′ k − ¯ γ r k ∆ ′ R ′ k k ∆ k ≤ ¯ α r k ∆ R ′ k + ¯ α r ¯ γ r k ∆ R ′ k k ∆ k k ∆ ′ R ′ k − ¯ γ r k ∆ ′ R ′ k k ∆ k + 14 ¯ γ r k ∆ k ≤ ¯ α r k ∆ R ′ k + ¯ α r ¯ γ r k ∆ R ′ k k ∆ k + 14 ¯ γ r k ∆ k ( k ∆ ′ R ′ k −

12 ¯ γ r k ∆ k ) ≤ (¯ α r k ∆ R ′ k + 12 ¯ γ r k ∆ k ) (15)Hence, we have k ∆ ′ R ′ k ≤ ¯ α r k ∆ R ′ k + ¯ γ r k ∆ k . We directly get (12) from [Bahmani et al. , 2013]. Theorem 4.2.

Suppose that f satisﬁes µ k -SRL with µ k ≤ q . Furthermore, suppose for β k in Deﬁnition 2exists some ǫ > such that β k ≥ ǫ holds for all k -sparse vectors x . Then x i +1 , the estimate at the i + 1 -th iteration,satisﬁes. for any true x ∈ R n with supp ( x ) ∈ M ( k, g ) , the iterates of Algorithm 1 must obey k r i +1 k ≤ σ k r i k + ν k∇ I f ( x ) k , (16) where σ = r µ k − (cid:16) c H − µ k (cid:17) and ν = (2+ c H − µ k )(1+ c H )+ σ ǫσ .Proof. Let r i +1 = x i +1 − x . k r i +1 k is upper bounded as k r i +1 k = k x i +1 − x k ≤ k x i +1 − b k + k x − b k ≤ c T k x − b k + k x − b k = (1 + c T ) k x − b k . The ﬁrst inequality above follows by the triangle inequality and the second inequality follows by tail approximation.Since

Ω = Γ ∪ supp ( x i ) and b = arg min x ∈ R n f ( x ) s.t. x Ω c = , we have k x − b k ≤ k ( x − b ) Ω c k + k ( x − b ) Ω k = k x Ω c k + k ( x − b ) Ω k = k ( x − x i ) Ω c k + k ( x − b ) Ω k = k r i Ω c k + k ( x − b ) Ω k b satisﬁes b = arg min x ∈ R n f ( x ) s.t. x Ω c = 0 , we must have ∇ f ( b ) | Ω = . Then it follows from Corollary2 in [Bahmani et al. , 2013], k ( ∇ f ( x ) − ∇ f ( b )) Ω k ≥ ¯ β k k ( x − b ) Ω k − ¯ γ k k ( x − b ) Ω c k k∇ Ω f ( x ) k ≥ ¯ β k k ( x − b ) Ω k − ¯ γ k k ( x − b ) Ω c k k∇ Ω f ( x ) k ≥ ¯ β k k ( x − b ) Ω k − ¯ γ k k ( x − x i ) Ω c k , where ¯ α k ( x , x ) = α k ( x ) + α k ( x ) , ¯ β k ( x , x ) = β k ( x ) + β k ( x ) and ¯ γ k ( x , x ) = ¯ α k ( x , x ) − ¯ β k ( x , x ) . As | supp ( x − b ) | ≤ k , we have k -sparsity by Deﬁnition (2). Note that Ω ∩ R is a subset of R and k ( ∇ f ( x ) − ∇ f ( b )) Ω k ≥ k ( ∇ f ( x ) − ∇ f ( b )) Ω ∩ R k . Similarly, we have ( x − b ) Ω = ( x − b ) Ω ∩ R and ( x − b ) Ω c =( x − b ) R \ (Ω ∩ R ) . The second inequality follows by ∇ Ω f ( b ) = , and the third inequality follows by b Ω c = and x i Ω c = . Therefore, k x − b k can be further upper bounded as k x − b k ≤ k r i Ω c k + k ( x − b ) Ω k ≤ k r i Ω c k + ¯ γ k k ( x − x i ) Ω c k ¯ β k + k∇ f ( x ) Ω k ¯ β k = h γ k ¯ β k i k r i Ω c k + k∇ f ( x ) Ω k ¯ β k (17)Let R = supp ( x i − x ) and Γ = H ( ∇ f ( x i )) ∈ M + = { H ∪ T | H ∈ M ( k H , g ) , T ∈ M ( k T , g ) } . We notice that R ∈ M + . The component k∇ Γ f ( x i ) k can be lower bounded as k∇ Γ f ( x i ) k ≥ c H k∇ R f ( x i ) k ≥ c H k∇ R f ( x i ) − ∇ R f ( x ) k − c H k∇ R f ( x ) k ≥ c H ¯ β k k x i − x k − c H k∇ I f ( x ) k = c H ¯ β k k r i k − c H k∇ I f ( x ) k (18)The ﬁrst inequality follows the head approximation and R ∈ M + . The second one is from triangle inequality and thethird one follows by Lemma (4.1). The component k∇ Γ f ( x i ) k can also be upper bounded as k∇ Γ f ( x i ) k ≤ k∇ Γ f ( x i ) − ∇ Γ f ( x ) k + k∇ Γ f ( x ) k ≤ k∇ Γ \ R c f ( x i ) − ∇ Γ \ R c f ( x ) + ∇ Γ ∩ R c f ( x i ) − ∇ Γ ∩ R c f ( x ) k + k∇ Γ f ( x ) k ≤ k∇ Γ \ R c f ( x i ) − ∇ Γ \ R c f ( x ) k + k∇ Γ ∩ R c f ( x i ) − ∇ Γ ∩ R c f ( x ) k + k∇ Γ f ( x ) k ≤ k∇ Γ \ R c f ( x i ) − ∇ Γ \ R c f ( x ) k + ¯ γ k k r i k + k∇ Γ f ( x ) k ≤ ¯ α k k r i Γ \ R c k + ¯ γ k k r i k + ¯ γ k k r i k + k∇ I f ( x ) k (19)The ﬁrst and third inequalities follow by the triangle inequality. The second inequality follows by Γ = (Γ ∩ R c ) ∪ (Γ \ R c ) . And the last inequality follows by k ( f ( x i ) − f ( x )) R ′ k ≤ ¯ γ k + r k x i − x k , where k ≤ | R ′ | , r = | supp ( x i − x ) | and R ′ ⊆ R c . By Lemma (4.1), we have k∇ Γ \ R c f ( x i ) − ∇ Γ \ R c f ( x ) k ≤ ¯ α k k r i Γ \ R c k + ¯ γ k k r i k . CombiningEquation (18) and Equation (19), we have c H ¯ β k k r i k − c H k∇ I f ( x ) k ≤ ¯ α k k r i Γ \ R c k + ¯ γ k k r i k + ¯ γ k k r i k + k∇ I f ( x ) k c H ¯ β k k r i k − c H k∇ I f ( x ) k ≤ ¯ α k k r i Γ k + ¯ γ k k r i k + ¯ γ k k r i k + k∇ I f ( x ) k ( c H ¯ β k − ¯ γ k − ¯ γ k ) k r i k − (1 + c H ) k∇ I f ( x ) k ≤ ¯ α k k r i Γ k µ k k r i Γ k ≥ ( c H + 2 − µ k ) k r i k − c H ǫ k∇ I f ( x ) k Finally, we get k r i Γ k ≥ (cid:16) c H µ k − (cid:17) k r i k − c H ǫµ k k∇ I f ( x ) k . Let us assume the SRL parameter µ k ≤ c H . Usingthe same computing procedure of Lemma 9 in [Hedge, 2015], we have k r i Γ c k ≤ η k r i k + (2 + c H − µ k )(1 + c H )2 ǫµ k η k∇ I f ( x ) k , (20)5here η = q − ( c H µ k − . Combine them together, we have k x − b k ≤ (cid:16) γ k ¯ β k (cid:17) k r i Ω c k + k∇ f ( x ) Ω k ¯ β k ≤ µ k k r i Ω c k + k∇ f ( x ) Ω k ¯ β k ≤ µ k k r i Γ c k + k∇ f ( x ) Ω k ǫ ≤ σ k r i k + ν k∇ I f ( x ) k , (21)where σ = r µ k − (cid:16) c H − µ k (cid:17) and ν = (2+ c H − µ k )(1+ c H )+ σ ǫσ . Hence, we prove this theorem. Theorem 4.3.

Let the true parameter be x ∈ R n such that supp ( x ) ∈ M ( k, g ) , and f : R n → R be cost functionthat satisﬁes SRL condition. The G RAPH -MP algorithm returns a ˆ x such that, supp (ˆ x ) ∈ M (5 k, g ) and k x − ˆ x k ≤ c k∇ I f ( x ) k , where c = (1 + ν − σ ) and I = arg max S ∈ M (8 k,g ) k∇ S f ( x ) k . The parameters σ and ν are ﬁxedconstants deﬁned in Theorem 4.2. Moreover, G RAPH -MP runs in time O (cid:0) ( T + | E | log n ) log( k x k / k∇ I f ( x ) k ) (cid:1) , (22) where T is the time complexity of one execution of the subproblem in Step 6 in G RAPH -MP . In particular, if T scaleslinearly with n , then G RAPH -MP scales nearly linearly with n .Proof. The i-th iterate of G

RAPH -MP satisﬁes k x − x i k ≤ σ i k x k + ν − σ k∇ I f ( x ) k . (23) After t = l log (cid:16) k x k k∇ I f ( x ) k (cid:17) / log σ m iterations, G RAPH -MP returns an estimate ˆ x satisfying k x − ˆ x k ≤ (1 + ν − σ ) k∇ I f ( x ) k as σ < and the summation of P ik =0 νσ k = ν (1 − σ i )1 − σ ≤ ν − σ . The time complexities of both headapproximation and tail approximation are O ( | E | log n ) . The time complexity of one iteration in G RAPH -MP is ( T + | E | log n ) , and the total number of iterations is l log (cid:16) k x k k∇ I f ( x ) k (cid:17) / log α m , and hence the overall time follows. RAPH -M P under RSC/RSS condition Deﬁnition 3 (Restricted Strong Convexity/Smoothness, ( m k , M k , M )-RSC/RSS) . [Yuan et al. , 2013]. For any integer k > , we say f ( x ) is restricted m k -strongly convex and M k -strongly smooth of there exist ∃ m k , M k > such that m k k x − y k ≤ f ( x ) − f ( y ) − h∇ f ( y ) , x − y i ≤ M k k x − y k , ∀k x − y k ≤ k (24) Lemma 5.1.

Let S be any index set with cardinality | S | ≤ k and S ∈ M ( k, g ) . If f is ( m k , M k , M )-RSC/RSS , then f satisﬁes the following property k x − y − m k M k (cid:16) ∇ S f ( x ) − ∇ S f ( y ) (cid:17) k ≤ r − ( m k M k ) k x − y k (25) Proof.

By adding two copies of the inequality (3) with x and y , we have m k k x − y k ≤ h∇ f ( x ) − ∇ f ( y ) , x − y i ≤ M k k x − y k , ∀k x − y k ≤ k. (26)By Theorem 2.1.5 in [Nesterov, 2013], we have h∇ f ( x ) − ∇ f ( y ) , x − y i ≥ L k∇ f ( x ) − ∇ f ( y ) k , which means k∇ S f ( x ) − ∇ S f ( y ) k ≤ k∇ f ( x ) − ∇ f ( y ) k ≤ M k L k x − y k . (27)Let L = M k and then k∇ S f ( x ) − ∇ S f ( y ) k ≤ M k k x − y k . The left side of inequality (26) is m k k x − y k ≤ h∇ f ( x ) − ∇ f ( y ) , x − y i = ( x − y ) T ( ∇ S f ( x ) − ∇ S f ( y )) . (28)The last equation of ( 28) follows by x − y = ( x − y ) S . For any a and b , we have k a − b k = k a k + k b k − a T b .By replacing a as ( x − y ) and b as m k M k (cid:16) ∇ S f ( x ) − ∇ S f ( y ) (cid:17) , we have6 x − y − m k M k (cid:16) ∇ S f ( x ) − ∇ S f ( y ) (cid:17) k = k x − y k + m k M k k∇ S f ( x ) − ∇ S f ( y ) k (29) − m k M k ( x − y ) T ( ∇ S f ( x ) − ∇ S f ( y )) ≤ (1 + m k M k − m k M k ) k x − y k = (1 − m k M k ) k x − y k . (30)By taking the square root for both sides of (30), we can prove the result. If one follows Lemma 1 in [Yuan et al. , 2013]by replacing δ as m k M k and ρ s as q − ( m k M k ) , one can also get same result. Theorem 5.2.

Consider the graph-structured sparsity model M ( k, g ) for some k, g ∈ N and a cost function f : R n → R that satisﬁes condition ( m k , M k , M (8 k, g )) -RSC/RSS. If α = c H − r − m k M k · (1 + c H ) , then for any true x ∈ R n with supp ( x ) ∈ M ( k, g ) , the iterates of Algorithm 1 obey k x i +1 − x k ≤ M k (1 + c T ) p − α M k − p M k − m k · k x i − x k + m k (1 + c T ) M k − M k p M k − m k (cid:16) c H + α α + α (1 + c H ) p − α (cid:17) k∇ I f ( x ) k , where I = arg max S ∈ M (8 k,g ) k∇ S f ( x ) k Proof.

Let r i +1 = x i +1 − x . k r i +1 k is upper bounded as k r i +1 k = k x i +1 − x k ≤ k x i +1 − b k + k x − b k ≤ c T k x − b k + k x − b k ≤ (1 + c T ) k x − b k , which follows from the deﬁnition of tail approximation. The component k ( x − b ) Ω k is upper bounded as k ( x − b ) Ω k = h b − x , ( b − x ) Ω i = h b − x − m k M k ∇ Ω f ( b ) + m k M k ∇ Ω f ( x ) , ( b − x ) Ω i − h m k M k ∇ Ω f ( x ) , ( b − x ) Ω i≤ s − m k M k k b − x k · k ( b − x ) Ω k + m k M k k∇ Ω f ( x ) k · k ( b − x ) Ω k , where the second equality follows from the fact that ∇ Ω f ( b ) = 0 since b is the solution to the problem in Step 6 ofAlgorithm 1, and the last inequality follows from Lemma 5.1. After simpliﬁcation, we have k ( x − b ) Ω k ≤ s − m k M k k b − x k + m k M k k∇ Ω f ( x ) k It follows that k x − b k ≤ k ( x − b ) Ω k + k ( x − b ) Ω c k ≤ s − m k M k k b − x k + m k M k k∇ Ω f ( x ) k + k ( x − b ) Ω c k After rearrangement we obtain k b − x k ≤ M k M k − p M k − m k (cid:16) k ( b − x ) Ω c k + m k M k k∇ Ω f ( x ) k (cid:17) = M k M k − p M k − m k (cid:16) k x Ω c k + m k M k k∇ Ω f ( x ) k (cid:17) = M k M k − p M k − m k (cid:16) k ( x − x i ) Ω c k + m k M k k∇ Ω f ( x ) k (cid:17) = M k M k − p M k − m k (cid:16) k r Ω c k + m k M k k∇ Ω f ( x ) k (cid:17) ≤ M k M k − p M k − m k (cid:16) k r i Γ c k + m k M k k∇ Ω f ( x ) k (cid:17) ( b ) ⊆ Ω , the second and last inequalities follow from the factthat Ω = Γ ∪ supp ( x i ) . Combining above inequalities, we obtain k r i +1 k ≤ M k (1 + c T ) M k − p M k − m k (cid:16) k r i Γ c k + m k M k k∇ I f ( x ) k (cid:17) From Lemma 5.3, we have k r i Γ c k ≤ q − α k r i k + " β α + α β p − α k∇ I f ( x ) k (31) Combining the above inequalities, we prove the theorem.

Lemma 5.3.

Let r i = x i − x and Γ = H ( ∇ f ( x i )) . Then k r i Γ c k ≤ q − α k r i k + " β α + α β p − α k∇ I f ( x ) k (32) ,where α = c H − r − m k M k · (1 + c H ) , and β = m k (1+ c H ) M k , and I = arg max S ∈ M (8 k,g ) k∇ S f ( x ) k . We assume that c H and q − m s M s are such that α > .Proof. Denote

Φ = supp ( x ) ∈ M ( k, g ) , Γ = H ( ∇ f ( x i )) ∈ M (2 k, g ) , r i = x i − x , and Ω = supp ( r i ) ∈ M (6 k, g ) .The component k∇ Γ f ( x i ) k can be lower bounded as k∇ Γ f ( x i ) k ≥ c H ( k∇ Φ f ( x i ) − ∇ Φ f ( x ) k − k∇ Φ f ( x ) k ) ≥ c H M − M k p M k − m k m k k r i k − c H k∇ I f ( x ) k , where the last inequality follows from Lemma 6.1. The component k∇ Γ f ( x i ) k can also be upper bounded as k∇ Γ f ( x i ) k ≤ M k m k k m k M k ∇ Γ f ( x i ) − m k M k ∇ Γ f ( x ) k + k∇ Γ f ( x ) k ≤ M k m k k m k M k ∇ Γ f ( x i ) − m k M k ∇ Γ f ( x ) − r i Γ + r i Γ k + k∇ Γ f ( x ) k ≤ M k m k k m k M k ∇ Γ ∪ Ω f ( x i ) − m k M k ∇ Γ ∪ Ω f ( x ) − r i Γ ∪ Ω k + M k m k k r i Γ k + k∇ Γ f ( x ) k ≤ M k p M k − m k m k · k r i k + M k m k k r i Γ k + k∇ I f ( x ) k , where the last inequality follows from condition ( ξ, δ, M (8 k, g )) -RSC/RSS and the fact that r i Γ ∪ Ω = r i . Combiningthe two bounds and grouping terms, we have k r i Γ k ≥ α · k r i k − β · k∇ I f ( x ) k (33),where α = h c H − r − m k M k · (1 + c H ) i and β = m k (1+ c H ) M k . We assume that the constant δ = r − m k M k is smallenough such that c H > δ − δ . We consider two cases. Case 1 : The value of k r i k satisﬁes α k r i k ≤ β k∇ f ( x ) k . Then consider the vector r i Γ c . We have k r i Γ c k ≤ β α k r i k Case 2 : The value of k r i k satisﬁes α k r i k ≥ β k∇ f ( x ) k . We get k r i Γ k ≥ k r i k (cid:18) α − β k∇ I f ( x ) k k r i k (cid:19) Moreover, we also have k r i k = k r i Γ k + k r i Γ c k . Therefore, we obtain k r i Γ c k ≤ k r i k s − (cid:18) α − β k∇ I f ( x ) k k r i k (cid:19) .

8e have the following inequality, for a given < ω < and a free parameter < ω < , a straightfowardcalculation yields that √ − ω ≤ √ − ω − ω √ − ω ω . Therefore, substituting into the bound for k r i Γ c k , we get k r i Γ c k ≤ k r i k (cid:18) √ − ω − ω √ − ω (cid:18) α − β k∇ I f ( x ) k k r i k (cid:19)(cid:19) (34) = 1 − wα √ − ω k r i k + ωβ √ − ω k∇ I f ( x ) k (35)The coefﬁcient prceding k r i k determines the overall convergence rate, and the minimum value of the coefﬁcient isattained by setting ω = α . Substituting, we obtain k r i Γ c k ≤ q − α k r i k + " β α + α β p − α k∇ I f ( x ) k , (36)which proves the lemma. RAPH -M P under WRSC condition In order to demonstrate the accuracy of estimates using Algorithm 1 we require a variant of the

Restricted StrongConvexity/Smoothness (RSC/RSS) conditions proposed in [Yuan et al. , 2013] to hold. The RSC condition basicallycharacterizes cost functions that have quadratic bounds on the derivative of the objective function when restricted tomodel-sparse vectors. The condition we rely on, the Weak Restricted Strong Convexity (WRSC), can be formallydeﬁned as follows:

Deﬁnition 4 (Weak Restricted Strong Convexity Property (WRSC)) . A function f ( x ) has condition ( ξ , δ , M )-WRSC if ∀ x , y ∈ R n and ∀ S ∈ M with supp ( x ) ∪ supp ( y ) ⊆ S , the following inequality holds for some ξ > and < δ < : k x − y − ξ ∇ S f ( x ) + ξ ∇ S f ( y ) k ≤ δ k x − y k . (37) Remark 1.

1) In the special case where f ( x ) = k y − A x k and ξ = 1 , condition ( ξ , δ , M )-WRSC reduces tothe well known Restricted Isometry Property (RIP) condition in compressive sensing. 2) The RSC and RSS con-ditions imply condition WRSC, which indicates that condition WRSC is no stronger than the RSC and RSS condi-tions [Yuan et al. , 2013]. Lemma 6.1. [Yuan et al. , 2013] Assume that f is a differentiable function. If f satisﬁes condition ( ξ, δ, M ) -WRSC,then ∀ x , y ∈ R n with supp ( x ) ∪ supp ( y ) ⊂ S ∈ M , the following two inequalities hold − δξ k x − y k ≤ k∇ S f ( x ) − ∇ S f ( y ) k ≤ δξ k x − y k ,f ( x ) ≤ f ( y ) + h∇ f ( y ) , x − y i + 1 + δ ξ k x − y k . Lemma 6.2.

Let r i = x i − x and Γ = H ( ∇ f ( x i )) . Then k r i Γ c k ≤ p − η k r i k + " ξ (1 + c H ) η + ξη (1 + c H ) p − η k∇ I f ( x ) k , where η = c H (1 − δ ) − δ and I = arg max S ∈ M (8 k,g ) k∇ S f ( x ) k . We assume that c H and δ are such that η > .Proof. Denote

Φ = supp ( x ) ∈ M ( k, g ) , Γ = H ( ∇ f ( x i )) ∈ M (2 k, g ) , r i = x i − x , and Ω = supp ( r i ) ∈ M (6 k, g ) .The component k∇ Γ f ( x i ) k can be lower bounded as k∇ Γ f ( x i ) k ≥ c H ( k∇ Φ f ( x i ) − ∇ Φ f ( x ) k − k∇ Φ f ( x ) k ) ≥ c H (1 − δ ) ξ k r i k − c H k∇ I f ( x ) k , where the last inequality follows from Lemma 6.1. The component k∇ Γ f ( x i ) k can also be upper bounded as k∇ Γ f ( x i ) k ≤ ξ k ξ ∇ Γ f ( x i ) − ξ ∇ Γ f ( x ) k + k∇ Γ f ( x ) k ≤ ξ k ξ ∇ Γ f ( x i ) − ξ ∇ Γ f ( x ) − r i Γ + r i Γ k + k∇ Γ f ( x ) k ≤ ξ k ξ ∇ Γ ∪ Ω f ( x i ) − ξ ∇ Γ ∪ Ω f ( x ) − r i Γ ∪ Ω k + k r i Γ k + k∇ Γ f ( x ) k ≤ δξ · k r i k + 1 ξ k r i Γ k + k∇ I f ( x ) k , ( ξ, δ, M (8 k, g )) -WRSC and the fact that r i Γ ∪ Ω = r i . Let η =( c H · (1 − δ ) − δ ) . Combining the two bounds and grouping terms, we have k r i Γ k ≥ η k r i k − ξ (1 + c H ) k∇ I f ( x ) k .After a number of algebraic manipulations similar to those used in [Hegde et al. , 2014a] Page 11, we prove the lemma. Theorem 6.3.

Consider the graph-structured sparsity model M ( k, g ) for some k, g ∈ N and a cost function f : R n → R that satisﬁes condition ( ξ, δ, M (8 k, g )) -WRSC. If η = c H (1 − δ ) − δ > , then for any true x ∈ R n withsupp ( x ) ∈ M ( k, g ) , the iterates of Algorithm 1 obey k x i +1 − x k ≤ α k x i − x k + β k∇ I f ( x ) k , (38) where β = ξ (1+ c T )1 − δ (cid:20) (1+ c H ) η + η (1+ c H ) √ − η + 1 (cid:21) , α = (1+ c T )1 − δ p − η , and I = arg max S ∈ M (8 k,g ) k∇ S f ( x ) k . Proof.

Let r i +1 = x i +1 − x . k r i +1 k is upper bounded as k r i +1 k = k x i +1 − x k ≤ k x i +1 − b k + k x − b k ≤ c T k x − b k + k x − b k = (1 + c T ) k x − b k , which follows from the deﬁnition of tail approximation. The component k ( x − b ) Ω k is upper bounded as k ( x − b ) Ω k = h b − x , ( b − x ) Ω i = h b − x − ξ ∇ Ω f ( b ) + ξ ∇ Ω f ( x ) , ( b − x ) Ω i − h ξ ∇ Ω f ( x ) , ( b − x ) Ω i≤ δ k b − x k k ( b − x ) Ω k + ξ k∇ Ω f ( x ) k k ( b − x ) Ω k , where the second equality follows from the fact that ∇ Ω f ( b ) = 0 since b is the solution to the problem in Step 6 ofAlgorithm 1, and the last inequality follows from condition ( ξ, δ, M (8 k, g )) -WRSC. After simpliﬁcation, we have k ( x − b ) Ω k ≤ δ k b − x k + ξ k∇ Ω f ( x ) k . (39)It follows that k ( x − b ) k ≤ k ( x − b ) Ω k + k ( x − b ) Ω c k ≤ δ k b − x k + ξ k∇ Ω f ( x ) k + k ( x − b ) Ω c k . After rearrangement we obtain k b − x k ≤ k ( b − x ) Ω c k − δ + ξ k∇ Ω f ( x ) k − δ = k x Ω c k − δ + ξ k∇ Ω f ( x ) k − δ = k ( x − x i ) Ω c k − δ + ξ k∇ Ω f ( x ) k − δ = k r i Ω c k − δ + ξ k∇ Ω f ( x ) k − δ ≤ k r i Γ c k − δ + ξ k∇ Ω f ( x ) k − δ , where the ﬁrst equality follows from the fact that supp ( b ) ⊆ Ω , the second and last inequalities follow from the factthat Ω = Γ ∪ supp ( x i ) . Combining above inequalities, we obtain k r i +1 k ≤ (1 + c T ) k r i Γ c k − δ + (1 + c T ) ξ k∇ I f ( x ) k − δ . From Lemma 6.2, we have k r i Γ c k ≤ p − η k r i k + " ξ (1 + c H ) η + ξη (1 + c H ) p − η k∇ I f ( x ) k Combining the above inequalities, we prove the theorem.As indicated in Theorem 6.3, under proper conditions the estimator error of G

RAPH -M P is determined by themultiplier of k∇ S f ( x ) k , and the convergence rate before reaching this error level is geometric. In particular, ifthe true x is sufﬁciently close to an unconstrained minimum of f , then the estimation error is negligible because k∇ S f ( x ) k has a small magnitude. Especially, in the ideal case where ∇ f ( x ) = 0 , it is guaranteed that we can obtainthe true x to arbitrary precision. If we further assume that α = (1+ c T ) √ − η √ − δ < , then exact recovery can be achievedin ﬁnite iterations. 10he shrinkage rate α < controls the convergence of G RAPH -M P , and it implies that when δ is very small, theapproximation factors c H and c T satisfy c H > − / (1 + c T ) . (40)We note that the head and tail approximation algorithms designed in [Hegde et al. , 2015b] do not satisfy the abovecondition, with c T = √ and c H = p / . However, as proved in [Hegde et al. , 2015b], the approximation factor c H of any given head approximation algorithm can be boosted to any arbitrary constant c ′ H < , such that theabove condition is satisﬁed. Empirically it is not necessary to boost the head-approximation algorithm as strongly assuggested by the analysis in [Hegde et al. , 2014a]. Theorem 6.4.

Let x ∈ R n be a true optimum such that supp ( x ) ∈ M ( k, g ) , and f : R n → R be a cost functionthat satisﬁes condition ( ξ, δ, M (8 k, g )) -WRSC. Assuming that α < , G RAPH - M P returns a ˆ x such that, supp (ˆ x ) ∈ M (5 k, g ) and k x − ˆ x k ≤ c k∇ I f ( x ) k , where c = (1 + β − α ) is a ﬁxed constant. Moreover, G RAPH - M P runs in time O (cid:0) ( T + | E | log n ) log( k x k / k∇ I f ( x ) k ) (cid:1) , (41) where T is the time complexity of one execution of the subproblem in Line 6. In particular, if T scales linearly with n ,then G RAPH - M P scales nearly linearly with n .Proof. The i-th iterate of Algorithm 1 satisﬁes k x − x i k ≤ α i k x k + β − α k∇ I f ( x ) k . (42)After t = l log (cid:16) k x k k∇ I f ( x ) k (cid:17) / log α m iterations, Algorithm 1 returns an estimate ˆ x satisfying k x − ˆ x k ≤ (1 + β − α ) k∇ I f ( x ) k . The time complexities of both head and tail approximations are O ( | E | log n ) . The time complexityof one iteration in Algorithm 1 is ( T + | E | log n ) , and the total number of iterations is l log (cid:16) k x k k∇ I f ( x ) k (cid:17) / log α m ,and the overall time complexity follows. Remark 2.

The previous algorithm G RAPH - C OSAMP [Hegde et al. , 2015b] for compressive sensing is a specialcase of G RAPH - M P . Assume f ( x ) = k y − Ax k . 1) Reduction.

The gradient in Step 3 of G RAPH - M P hasthe form: ∇ f ( x i ) = − A T ( y − Ax i ) , and an analytical form of b in Step 6 can be obtained as: b Ω = A +Ω y and b Ω c = 0 , where A + = A T ( A T A ) − , which indicates that G RAPH - M P reduces to G RAPH - C OSAMP in thisscenario. 2)

Shrinkage rate.

The shrinkage rate α of G RAPH - M P is analogous to that of G RAPH - C OSAMP ,even though that the shrinkage rate of G RAPH - C OSAMP is optimized based on the

RIP sufﬁcient constants. Inparticular, they are identical when δ is very small. 3) Constant component.

Assume that ξ = 1 . Condition ( ξ, δ, M ( k, g )) -WRSC then reduces to the RIP condition in compressive sensing. Let e = y − Ax . The component k∇ f ( x i ) k = k A T e k is upper bounded by √ δ k e k [Hegde et al. , 2014a]. The constant β k∇ I f ( x ) k is thenupper bounded by ξ (1+ c T ) √ δ − δ (cid:20) (1+ c H ) η + η (1+ c H ) √ − η + 1 (cid:21) k e k that is analogous to the constant of G RAPH - C OSAMP ,and they are identical when δ is very small. In this section, we specialize G

RAPH -M P to optimize a number of graph scan statistic models for the task of connectedsubgraph detection. Given a graph G = ( V , E ) , where V = [ n ] , E ⊆ V × V , and each node v is associated with a vectorof features c ( v ) ∈ R p . Let S ⊆ V be a connected subset of nodes. A graph scan statistic, F ( S ) = log Prob ( Data | H ( S )) Prob ( Data | H ) ,corresponds to the generalized likelihood ratio test (GLRT) to verify the null hypothesis ( H ): c ( v ) ∼ D , ∀ v ∈ V ,where D refers to a predeﬁned background distribution, against the alternative hypothesis ( H ( S ) ): c ( v ) ∼ D , ∀ v ∈ S and c ( v ) ∼ D , ∀ v ∈ V \ S , where D refers to a predeﬁned signal distribution. The detection problem is formulatedas min S ⊆ V − F ( S ) s.t. | S | ≤ k and S is connected , (43)where k is a predeﬁned bound on the size of S .Taking elevated mean scan (EMS) statistic for instance, it aims to decide between H : c ( v ) ∼ N (0 , , ∀ v ∈ V and H ( S ) : c ( v ) ∼ N ( µ, , ∀ v ∈ S and c ( v ) ∼ N (0 , , ∀ v ∈ V \ S , where for simplicity each node v only has aunivariate feature c ( v ) ∈ R . This statistic is popularly used for detecting signals among node-level numerical features11n graph [Qian et al. , 2014; Arias-Castro et al. , 2011] and is formulated as F ( S ) = ( P v ∈ S c ( v )) / | S | . Let the vectorform of S be x ∈ { , } n , such that supp ( x ) = S . The connected subgraph detection problem can be reformulated as min x ∈{ , } n − ( c T x ) ( T x ) s.t. supp ( x ) ∈ M ( k, g = 1) , (44)where c = [ c (1) , · · · , c ( n )] T . To apply G RAPH -M P , we relax the input domain of x such that x ∈ [0 , n , and theconnected subset of nodes can be found as S = supp ( x ⋆ ) , the support set of the estimate x ⋆ that minimizes thestrongly convex function [Bach, 2011]: min x ∈ R n f ( x ) = − ( c T x ) ( T x ) + 12 x T x s.t. supp ( x ) ∈ M ( k, . Assume that c is normalized, and hence ≤ c i < , ∀ i . Let ˆ c = max { c i } . The Hessian matrix of the above objectivefunction ∇ f ( x ) ≻ and satisﬁes the inequalities: (1 − ˆ c ) · I (cid:22) I − ( c − c T x1 T x 1 )( c − c T x1 T x 1 ) T (cid:22) · I . (45)According to Lemma 1 (b) in [Yuan et al. , 2013]), the objective function f ( x ) satisﬁes condition ( ξ, δ, M (8 k, g )) -WRSC that δ = p − ξ (1 − ˆ c ) + ξ , for any ξ such that ξ < − ˆ c ) . Hence, the geometric convergence ofG RAPH -M P as shown in Theorem 6.3 is guaranteed. We note that not all the graph scan statistic functions satisfythe WRSC condition, but, as shown in our experiments, G RAPH -M P works empirically well for all the scan statisticfunctions tested, and the maximum number of iterations to convergence for optimizing each of these scan statisticfunctions was less than 10.We note that our proposed method G RAPH -M P is also applicable to general sparse learning problems (e.g. sparselogistic regression, sparse principle component analysis) subject to graph-structured constraints, and to a variety ofsubgraph detection problems, such as the detection of anomalous subgraphs, bursty subgraphs, heaviest subgraphs,frequent subgraphs or communication motifs, predictive subgraphs, and compression subgraphs. This section evaluates the effectiveness and efﬁciency of the proposed G

RAPH -M P approach for connected subgraphdetection. The implementation of G RAPH -M P is available at https://github.com/baojianzhou/Graph-MP. Datasets: Water Pollution Dataset.

The Battle of the Water Sensor Networks (BWSN) [Ostfeld et al. , 2008]provides a real-world network of 12,527 nodes and 14831 edges, and 4 nodes with chemical contaminant plumes thatare distributed in four different areas. The spreads of contaminant plumes were simulated using the water networksimulator EPANET for 8 hours. For each hour, each node has a sensor that reports 1 if it is polluted; 0, otherwise. Werandomly selected K percent nodes, and ﬂipped their sensor binary values in order to test the robustness of methodsto noises, where K ∈ { , , , , } . The objective is to detect the set of polluted nodes . 2)

High-Energy PhysicsCitation Network.

The CitHepPh (high energy physics phenomenology) citation graph is from the e-print arXiv andcovers all the citations within a dataset of 34,546 papers with 421,578 edges during the period from January 1993 toApril 2002. Each paper is considered as a node, each citation is considered as a edge (direction is not considered), andeach node has one attribute denoting the number of citations in a speciﬁc year ( t = 1993 , · · · , t = 2002 ), and anotherattribute denoting the average number of citations in that year. The objective is to detect a connected subgraphof nodes (papers) whose citations are abnormally high in comparison with the citations of nodes outside thesubgraph.

This subgraph is considered an indicator of a potential emerging research area. The data before 1999 isconsidered as training data, and the data from 1999 to 2002 is considered as testing data.

Graph Scan Statistics:

Three graph scan statistics were considered, including Kulldorff’s scan statis-tic [Neill, 2012], expectation-based Poisson scan statistic (EBP) [Neill, 2012], and elevated mean scan statistic (EMS,Equation (44)) [Qian et al. , 2014]. The ﬁrst two require that each node has an observed count of events at that node,and an expected count. For the water network dataset, the report of the sensor (0 or 1) at each node is considered asthe observed count, and the noise ratio is considered as the expected count. For the CiteHepPh dataset, the number ofcitations is considered as the observed count, and the average number of citations is considered as the expected count.For the EMS statistic, we consider the ratio of observed and expected counts as the feature.

Comparison Methods:

Seven state-of-the-art baseline methods are considered, including

EdgeLasso [Sharpnack et al. , 2012b],

GraphLaplacian [Sharpnack et al. , 2012a], LinearTimeSubsetScan(

LTSS ) [Neill, 2012],

EventTree [Rozenshtein et al. , 2014],

AdditiveGraphScan [Speakman et al. , 2013],

DepthFirstGraphScan [Speakman et al. , 2016], and

NPHGS [Chen and Neill, 2014]. We followed strategies12 aterNetwork CitHepPhKulldorff EMS EBP Run Time(sec.) Kulldorff EMS EBP Run TimeOur Method

GenFusedLasso

EdgeLasso

GraphLaplacian

LTSS

EventTree

AdditiveGraphScan

DepthFirstGraphScan

NPHGS

Table 1: Comparison on scores of the three graph scan statistics based on connected subgraphs returned by comparisonmethods. EMS and EBP refer to Elevated Mean Scan Statistic and Expectation-Based Poisson Statistic, respectively. (a) WaterNetwork(Kulldorff) (b) CitHepPh(EMS)

Figure 1: Evolving curves of graph scan statistic scores between our method and

GenFusedLasso .recommended by authors in their original papers to tune the related model parameters. Speciﬁcally, for EventTreeand Graph-Laplacian, we tested the set of λ values: { . , . , · · · , . } . DepthFirstScan is an exact searchalgorithm and has an exponential time cost in the worst case scenario. We hence set a constraint on the depth of itssearch to 10 in order to reduce its time complexity.We also implemented the generalized fused lasso model (

GenFusedLasso ) for these three graph scan statisticsusing the framework of alternating direction method of multipliers (ADMM).

GenFusedLasso is formalized as min x ∈ R n − f ( x ) + λ X ( i,j ) ∈ E k x i − x j k , (46)where f ( x ) is a predeﬁned graph scan statistic and the trade-off parameter λ controls the degree of smoothness ofneighboring entries in x . We applied the heuristic rounding step proposed in [Qian et al. , 2014] to the numericalvector x to identify the connected subgraph. We tested the λ values: { . , . , · · · , . , . , . } and returned thebest result. Our Proposed Method G

RAPH -M P : Our proposed G

RAPH -M P has a single parameter k , an upper bound ofthe subgraph size. We set k = 1000 by default, as the sizes of subgraphs of interest are often small; otherwise, thedetection problem could be less challenging. We note that, to obtain the best performance of our proposed methodG RAPH -M P , we should try a set of different k values ( k = 50 , , , , · · · , ) and return the best. Performance Metrics:

The overall scores of the three graph scan statistics of the connected subgraphs returnedby the competitive methods were compared and analyzed. The objective is to identify methods that could ﬁnd theconnected subgraphs with the largest scores. The running times of different methods are compared.

Figure 1 presents the comparison between our method and

GenFusedLasso on the scores of the best connectedsubgraphs that are identiﬁed at different iterations based on the Kulldorff’s scan statistic and the EMS statistic. Notethat, a heuristic rounding process as proposed in [Qian et al. , 2014] was applied to the numerical vector x i estimated by GenFusedLasso in order to identify the best connected subgraph at each iteration i . As the setting of the parameter λ will inﬂuence the quality of the detected connected subgraph, the results based on different λ values are also shownin Figure 1. We observe that our proposed method G RAPH -M P converged in less than 5 steps and the qualities (scan13tatistic scores) of the connected subgraphs identiﬁed G RAPH -M P at different iterations were consistently higher thanthose returned by GenFusedLasso . The comparison between our method and the other eight baseline methods is shown in Table 1. The scores of the threegraph scan statistics of the connected subgraphs returned by these methods are reported in this table. The results inindicate that our method outperformed all the baseline methods on the scores, except that

AdditiveGraphScan achieved the highest EMS score (761.08) on the water network data set. Although

AdditiveGraphScan performedreasonably well in overall, this algorithm is a heuristic algorithm and does not have theoretical guarantees.

Table 1 shows the time costs of all competitive methods on the two benchmark data sets. The results indicate thatour method was the second fastest algorithm over all the comparison methods. In particular, the running times of ourmethod were 10+ times faster than the majority of the methods.

This paper presents, G

RAPH -M P , an efﬁcient algorithm to minimize a general nonlinear function subject to graph-structured sparsity constraints. For the future work, we plan to explore graph-structured constraints other than con-nected subgraphs, and analyze theoretical properties of G RAPH -M P for cost functions that do not satisfy the WRSCcondition.

10 Acknowledgements

This work is supported by the Intelligence Advanced Research Projects Activity (IARPA) via Department of InteriorNational Business Center (DoI/NBC) contract D12PC00337. The views and conclusions contained herein are thoseof the authors and should not be interpreted as necessarily representing the ofﬁcial policies or endorsements, eitherexpressed or implied, of IARPA, DoI/NBC, or the US government.

References [Arias-Castro et al. , 2011] Ery Arias-Castro, Emmanuel J Cand`es, and Arnaud Durand. Detection of an anomalouscluster in a network.

ANN STAT , pp 278–304, 2011.[Asteris et al. , 2015] Megasthenis Asteris, Anastasios Kyrillidis, Alexandros G Dimakis, Han-Gyol Yi, et al. Stay onpath: PCA along graph paths. In

ICML , 2015.[Bach et al. , 2012] Francis Bach, Rodolphe Jenatton, et al. Structured sparsity through convex optimization.

Statisti-cal Science , 27(4):450–468, 2012.[Bach, 2011] Francis Bach. Learning with submodular functions: A convex optimization perspective. Foundationsand Trends in Mach. Learn., 6(2-3), pp 145-373, 2011.[Bahmani et al. , 2013] Sohail Bahmani, Bhiksha Raj, and Petros T Boufounos. Greedy sparsity-constrained opti-mization.

JMLR , 14(1):807–841, 2013.[Bahmani et al. , 2016] Sohail Bahmani, Petros T Boufounos, and Bhiksha Raj. Learning model-based sparsity viaprojected gradient descent.

IEEE Trans. Info. Theory , 62(4):20922099, 2016.[Blumensath, 2013] Thomas Blumensath. Compressed sensing with nonlinear observations and related nonlinearoptimization problems.

Information Theory, IEEE Transactions on , 59(6):3466–3474, 2013.[Chen and Neill, 2014] Feng Chen and Daniel B. Neill. Non-parametric scan statistics for event detection and fore-casting in heterogeneous social media graphs. In

KDD , pp 1166–1175, 2014.[Duarte and Eldar, 2011] Marco F Duarte and Yonina C Eldar. Structured compressed sensing: From theory to appli-cations.

ITSP , 59(9):4053–4085, 2011.[Hegde et al. , 2014a] Chinmay Hegde, Piotr Indyk, and Ludwig Schmidt. Approximation-tolerant model-based com-pressive sensing. In

SODA , pp 1544–1561, 2014.[Hegde et al. , 2014b] Chinmay Hegde, Piotr Indyk, and Ludwig Schmidt. A fast approximation algorithm for tree-sparse recovery. In

ISIT , pp 1842–1846, 2014.[Hegde et al. , 2015a] Chinmay Hegde, Piotr Indyk, and Ludwig Schmidt. Fast algorithms for structured sparsity.

Bulletin of EATCS , 3(117), 2015. 14Hegde et al. , 2015b] Chinmay Hegde, Piotr Indyk, and Ludwig Schmidt. A nearly-linear time framework for graph-structured sparsity. In

ICML , pp 928–937, 2015.[Huang et al. , 2011] Junzhou Huang, Tong Zhang, and Dimitris Metaxas. Learning with structured sparsity.

TheJournal of Machine Learning Research , 12:3371–3412, 2011.[Jacob et al. , 2009] Laurent Jacob, Guillaume Obozinski, and Jean-Philippe Vert. Group lasso with overlap and graphlasso. In

ICML , pp 433–440, 2009.[Jain et al. , 2014] Prateek Jain, Ambuj Tewari, and Purushottam Kar. On iterative hard thresholding methods forhigh-dimensional m-estimation. In

NIPS , 2014.[Neill, 2012] Daniel B Neill. Fast subset scan for spatial pattern detection.

JRSS: Series B , 74(2):337–360, 2012.[Ostfeld et al. , 2008] Avi Ostfeld, James G Uber, et al. The battle of the water sensor networks (BWSN): A designchallenge for engineers and algorithms.

JWRPM , 134(6):556–568, 2008.[Qian et al. , 2014] J. Qian, V. Saligrama, and Y. Chen. Connected sub-graph detection. In

AISTATS , 2014.[Rozenshtein et al. , 2014] Polina Rozenshtein, Aris Anagnostopoulos, Aristides Gionis, and Nikolaj Tatti. Eventdetection in activity networks. In

KDD , 2014.[Sharpnack et al. , 2012a] James Sharpnack, Alessandro Rinaldo, and Aarti Singh. Changepoint detection over graphswith the spectral scan statistic. arXiv preprint arXiv:1206.0773 , 2012.[Sharpnack et al. , 2012b] James Sharpnack, Alessandro Rinaldo, and Aarti Singh. Sparsistency of the edge lasso overgraphs. 2012.[Speakman et al. , 2016] Skyler Speakman, Edward McFowland Iii, and Daniel B. Neill. Scalable detection of anoma-lous patterns with connectivity constraints.

Journal of Computational and Graphical Statistics , 2016.[Speakman et al. , 2013] S. Speakman, Y. Zhang, and D. B. Neill. Dynamic pattern detection with temporal consis-tency and connectivity constraints. In

ICDM , 2013.[Tewari et al. , 2011] A. Tewari, P. K Ravikumar, and I. S Dhillon. Greedy algorithms for structurally constrained highdimensional problems. In

NIPS , pp 882–890, 2011.[Xin et al. , 2014] Bo Xin, Yoshinobu Kawahara, Yizhou Wang, and Wen Gao. Efﬁcient generalized fused lasso andits application to the diagnosis of alzheimers disease. In

AAAI , pp 2163–2169, 2014.[Yuan and Liu, 2014] Xiao-Tong Yuan and Qingshan Liu. Newton greedy pursuit: A quadratic approximation methodfor sparsity-constrained optimization. In

CVPR , pp 4122–4129, 2014.[Yuan et al. , 2013] Xiao-Tong Yuan, Ping Li, and Tong Zhang. Gradient hard thresholding pursuit for sparsity-constrained optimization. In

ICML , 2014.[Yuan et al. , 2013] Xiao-Tong Yuan, Ping Li, and Tong Zhang. Gradient hard thresholding pursuit for sparsity-constrained optimization. In arXiv preprint arXiv:1311.5750 , 2013.[Zhang, 2009] Tong Zhang. Adaptive forward-backward greedy algorithm for sparse learning with linear models. In

NIPS , pp 1921–1928, 2009.[Hedge, 2015] Hegde, Chinmay and Indyk, Piotr and Schmidt, Ludwig Approximation algorithms for model-basedcompressive sensing In

IEEE Transactions on Information Theory , pp 5129–5147, 2015.[Nesterov, 2013] Nesterov, Yurii Introductory lectures on convex optimization: A basic course In