[PDF] Detection and Isolation of Adversaries in Decentralized Optimization for Non-Strongly Convex Objectives

Abstract

Decentralized optimization has found a significant utility in recent years, as a promising technique to overcome the curse of dimensionality when dealing with large-scale inference and decision problems in big data. While these algorithms are resilient to node and link failures, they however, are not inherently Byzantine fault-tolerant towards insider data injection attacks. This paper proposes a decentralized robust subgradient push (RSGP) algorithm for detection and isolation of malicious nodes in the network for optimization non-strongly convex objectives. In the attack considered in this work, the malicious nodes follow the algorithmic protocols, but can alter their local functions arbitrarily. However, we show that in sufficiently structured problems, the method proposed is effective in revealing their presence. The algorithm isolates detected nodes from the regular nodes, thereby mitigating the ill-effects of malicious nodes. We also provide performance measures for the proposed method.

Full PDF

DDetection and Isolation of Adversaries inDecentralized Optimization forNon-Strongly Convex Objectives (cid:63)

Nikhil Ravi ∗ Anna Scaglione ∗∗ School of Electrical, Computer and Energy Engineering,Arizona State University, Tempe, AZ 85281 USA(e-mail: { Nikhil.Ravi,Anna.Scaglione } @asu.edu). Abstract:

Decentralized optimization has found a signiﬁcant utility in recent years, as apromising technique to overcome the curse of dimensionality when dealing with large-scaleinference and decision problems in big data. While these algorithms are resilient to node and linkfailures, they however, are not inherently Byzantine fault-tolerant towards insider data injectionattacks. This paper proposes a decentralized robust subgradient push (RSGP) algorithm fordetection and isolation of malicious nodes in the network for optimization non-strongly convexobjectives. In the attack considered in this work, the malicious nodes follow the algorithmicprotocols, but can alter their local functions arbitrarily. However, we show that in suﬃcientlystructured problems, the method proposed is eﬀective in revealing their presence. The algorithmisolates detected nodes from the regular nodes, thereby mitigating the ill-eﬀects of maliciousnodes. We also provide performance measures for the proposed method.

Keywords: decentralized optimization, multi-agent systems, byzantine fault-tolerance,adversarial optimization, gradient-based metric1. INTRODUCTIONA number of machine learning and big data problems canbe generalized to take the following form:min x ∈ R d F ( x ) = min x ∈ R d N N (cid:88) i =1 f i ( x ) , (1)where F ( · ) : R d → R is the global objective which thenodes are trying to collectively minimize. However, thenodes only have access to their own private cost functions, f i ( · ) : R d → R . There is a large class of decentralizedpeer-to-peer algorithms for solving such decomposable op-timization problems via peer-to-peer consensus protocols,see Tsitsiklis et al. (1986); Nedi´c and Ozdaglar (2009) forearly contributions and see Boyd et al. (2011); Nedi´c et al.(2018) for extensive literature surveys.While these algorithms are resilient to node or edge fail-ures, they are not however inherently resilient to insider-based data injection attacks (see e.g. Gentz et al. (2015);Blanchard et al. (2017); Gentz et al. (2016); Wu et al.(2018); Sundaram and Gharesifard (2018)). This hasspurred signiﬁcant interest on robust decentralized opti-mization algorithms. In literature, broadly speaking, thereare two schools of solutions. In the ﬁrst set of works,authors propose methods to regularize the objective byrelaxing the hard consensus constraints, typically by using (cid:63) This research was supported in part by Cybersecurity for EnergyDelivery Systems program, of the U.S. Department of Energy, undercontract DOE0000780 and the National Science Foundation CCF-BSF 1714672 grant. Any opinions, and ﬁndings expressed in thismaterial are those of the authors and do not necessarily reﬂect thoseof the sponsors.

TV norm regularization scheme, see Ben-Ameur et al.(2016); Koppel et al. (2017); Xu et al. (2018). Even ifthe adversarial environment can be studied similarly tothe non-adversarial case, such approaches do not admitthe same solutions as the original problem even when theattackers are not present at all. Also, their performance ishighly dependent on the attack scenario, the choice of theregularizing function, and the regularizing parameter.These issues are not shared by the second set of works,where each node builds a score about their neighbors’updates and uses this score to detect their potentialmalicious intent, see Vukovi´c and D´an (2014); Su andVaidya (2015a,b); Su and Shahrampour (2018); Sundaramand Gharesifard (2018); Ravi et al. (2019). These scorescan then be used by regular nodes to dynamically sever tieswith suspicious peers and continue to run the algorithm,see Ravi et al. (2019). As long as the network remainsstrongly connected and set of the isolated nodes includesall the malicious nodes, the network converges to optimumsolution of the objective that only includes the nodes thatare not isolated. This is possible for appropriate peer-to-peer optimization algorithms because they can restorenormal operations by isolating the attackers while thealgorithm is running.The literature considers a number of attack models thatdiﬀer based on the structure of the data injected by themalicious nodes. This paper focuses on attacks that in-volve coordinated malicious agents that follow algorithmicprotocols while manipulating their private cost functionsarbitrarily. We consider a network of nodes that are collec-tively solving problem (1) by implementing the Push-Sum a r X i v : . [ ee ss . S Y ] O c t lgorithm proposed in Kempe et al. (2003). Concurrently,the regular nodes maintain a maliciousness score abouttheir neighbors as a function of the local neighborhooddata. The idea is to sever ties with those deemed to be ma-licious and then continuing the updates according to Push-Sum. This approach leverages the self-healing propertyof the gradient-based Push-Sum algorithm where nodesdynamically sever ties with those neighbors that are morelikely to be malicious. The paper is organized as follows.In Section 2, the system model and the attack charac-teristics are described. In Section 3, we introduce theRobust Subgradient Push algorithm and its consequences.In Section 4, we present some simulation results thatdemonstrate the eﬀectiveness of the proposed approach.We conclude in Section 5. Notations : Boldfaced lower-case (resp. upper-case) let-ters denote vectors (resp. matrices) and x i ( X ij ) denotesthe i th element of a vector x (the ij th entry of a matrix X ). Calligraphic letters are sets and |A| denotes the car-dinality of a set A ; diﬀerence between two sets A and B isdenoted by A \ B .2. SYSTEM MODELThe set of nodes in the system are modeled by a directedgraph G ( V , E ), where V is the set of nodes and E is the setof edges that represent the communication links betweenthe nodes. To describe the adversarial environment, theset V is partitioned into two subsets, namely: the set V r of regular nodes (RNs) with N r := |V r | , and the set V m of malicious nodes (MNs) with N m := |V m | , so that V = V r ∪ V m . Similarly, the edge set E is also partitioned E = E r ∪ E m , where E r is the set of links emanating fromRNs, while E m is the set of links emanating from the MNs.The set E r is further partitioned E m = E mr ∪ E mm , while E mm is the set of links emanating and ending at MNs.The goal of the RNs is to identify and sever all the edgesbelonging to E mr . Let N − i be the set of all nodes that cansend information to node i and N + i be the set of all nodesthat can receive information from node i . Problem Statement : In the presence of adversariesin a network, the benchmark for th performance is theanalogous problem as in (1), where the graph is replacedwith the graph formed by the regular nodes only, i.e.,minimize { x i ,i ∈V r } |V r | (cid:88) i ∈V r f i ( x i ) s.t. x i = x j ∀ ij ∈ E rr , (2)where f i : R d → R is the private cost/loss function onlyavailable at node i and the constraint enforces consensus inthe network’s neighborhoods. Decentralized optimizationalgorithms designed for arbitrary network topologies re-quire doubly-stochastic weight matrices. Techniques suchas Subgradient-Push (SGP) relax this requirement Nedi´cet al. (2018). Further improvements in the convergencerates algorithms for directed graphs are presented in DEX-TRA (Xi and Khan (2017)) and ADD-OPT/Push-DIGing(Xi et al. (2018)). However, these algorithms require eachnode to know its out-degrees (the number of its out-neighbors) which is not a realistic assumption in general,and is even more problematic in an adversarial environ-ment. Before moving on to describe the characterization ofmalicious nodes, we introduce a few assumptions. Assumption 1. (Strong Connectivity) The sequence ofgraphs, {G ( t ) } , induced by the dynamic in- and out-neighborhoods over time is strongly connected. Assumption 2. (Convexity) Each of the functions f i , ∀ i ∈V , is convex over the R d . Assumption 3. (Bounded Subgradients) All the subgradi-ents of each of the functions f i , ∀ i ∈ V , are bounded. Thatis, ∃ l i < ∞ such that (cid:107) g i (cid:107) ≤ l i , ∀ i ∈ V , for all subgradients g i of f i ( x ) and for all x ∈ R d . Assumption 4. (Malicious nodes) Let us suppose that thesequence of graphs {G ( t ) } is (2 κ + 1)-strongly connected.Then there exists at most κ adversaries in the wholenetwork, i.e., |V m | ≤ κ . This also implies that there existsat most κ adversaries in a node’s in-neighborhood, i.e., |N − i ∪ V m | ≤ κ, ∀ i ∈ V .If assumptions 1-3 hold and under no interference frommalicious nodes, the convergence of the original Gradient-Push was proven in Nedi´c and Olshevsky (2015), i.e., x ∞ i (cid:44) lim t →∞ x i ( t ) = x ∗ = arg min x ∈ R d F ( x ) , ∀ i ∈ V Let F r ( x ) be the cost and x ∗ the solution of (2), i.e.: F r ( x ) (cid:44) |V r | (cid:88) i ∈V r f i ( x ) , x ∗ (cid:44) arg min x (cid:88) i ∈V r f i ( x ) . (3)The performance of any algorithm that is resilient towardsan attack can be measured using several metrics thatcapture the deviation of the limiting state from the idealasymptotic condition in (3), for example: • Average solution diﬀerence : (cid:15) p (cid:44) |V r | (cid:88) i ∈V r (cid:107) x ∞ i − x ∗ (cid:107) p (4) • Average cost increase : (cid:37) (cid:44) |V r | (cid:88) i ∈V r f i ( x ∞ i ) − F r ( x ∗ ) (5) • Average deviation from consensus : γ (cid:44) |V r | (cid:88) i ∈V r (cid:107) x ∞ i − x ∞ (cid:107) , x ∞ (cid:44) |V r | (cid:88) (cid:96) ∈V r x ∞ (cid:96) (6)It is a well established that the presence of even a singlemalicious node will lead the convergence of the algorithmastray (see e.g. Figure 1). The type of attack that is con-sidered in our paper, represents the case where the regularnodes’ functions f i ( x ) , i ∈ V r have common statisticalcharacteristics. The attacker follows the algorithm as well,however, its intent is to alter the solution, which requiresaltering the statistics of the function and its gradient.For example, this situation arises when the distributedoptimization objective is to solve a regression problemwhere, in normal conditions, the sensors cooperatively aretrying to estimate a vector x o based on a noisy observationmodel: s i = h i ( x o ) + w i ∈ R (7)for all i ∈ V where w i are independent identically dis-tributed noise samples. The regression can be formulated − t ) x i ( t ) Regular NodesMalicious Node(s)OptimalAttacker’s target

Fig. 1. Consensus without attacker detection with adver-saries following algorithmic protocols but modifyingtheir local cost functions dynamically to reach theirintended target (in black). Here, the regular nodesin blue converge to the attacker’s (in red) target. Inthe example a set of nodes (1 −

10) are trying tosolve problem 1 using the standard Sub-gradient push(SGP) in a static network. However, a malicious nodedrives the network towards its target, x a .as the minimization of the sum of f i ( x ) = (cid:107) s i − h i ( x ) (cid:107) ,which amounts to seeking the maximum x . To be con-crete, we will focus on Gaussian independent identicallydistributed (i.i.d.) noise case w i ∼ N (0 , σ ) in which case: f i ( x ) = (cid:107) s i − h i ( x ) (cid:107) (8)and the gradient is: g i ( x ) = 2 ∇ h i ( x )( s i − h i ( x )) (9)The attack can be caused by a spooﬁng attack to a set ofsensors: s am = s m + δs m m ∈ V m (10)Under this attack, the gradient of the functions f am ( x ) = (cid:107) s am − h m ( x ) (cid:107) is, therefore perturbed: g m ( x ) =2 ∇ h m ( x )( s am − h m ( x )) = g m ( x ) + δ g m ( x ) (11)Under such an attack, a strongly connected network willconverge to consensus on the following solution: x a = arg min x (cid:32) |V r ||V| F r ( x ) + 1 |V| (cid:88) i ∈V m f ai ( x ) (cid:33) (12)with average deviation from consensus γ = 0 and averagesolution diﬀerence and cost deviation, respectively: (cid:15) p = (cid:107) x a − x ∗ (cid:107) p , (cid:37) = F r ( x a ) − F r ( x ) (13)In the context of this type of attack, the percentage in-crease the additional average error relative to the latentparameter x o measures the damage caused by the mali-cious agents as well, that is for instance: ξ p (cid:44) |V| (cid:80) i ∈V (cid:107) x ∞ i − x o (cid:107) p (cid:107) x ∗ − x o (cid:107) p (14)where a signiﬁcant damage comes from a large value of ξ p in the presence of an attack; since when the attack is fullysuccessful there is consensus on x a the value of ξ p is: ξ p = (cid:107) x a − x o (cid:107) p / (cid:107) x ∗ − x o (cid:107) p (15)Of course, the quantity that we consider a benchmark (cid:107) x ∗ − x o (cid:107) p itself represents a loss relative to the case where Note that in this case, given that the noise terms are non bounded,the condition stated as Assumption 3 of bounded gradients can onlybe guaranteed with high probability. To avoid complications in thediscussion we will discuss convergence as if the condition is met. no sensor is attacked, as more observations improve theestimation error performance.In our previous work Ravi et al. (2019) we noted thatstaging a suﬃciently strong attack in terms of (cid:15) p , requireseither suﬃciently large gradients for the malicious nodes,or a special choice for the target point x a . In fact, since x a is the stationary point for the problem in (12), we have (cid:80) i ∈V r g i ( x a ) + (cid:80) i ∈V m g i ( x a ) = . Let H i ( x ) be the Hessian of the function f i ( x ) and H r ( x )the Hessian of the function F r ( x ); then: (cid:80) i ∈V m g i ( x a ) ≈ − (cid:80) i ∈V r g i ( x ∗ ) + H i ( x ∗ )( x a − x ∗ ) (16)= − H r ( x ∗ )( x a − x ∗ ) . (17)The last equation implies that the most eﬀective attackscan be staged when the malicious nodes can be selectedso that the Hessian H r ( x ∗ ) is singular, compromising theconvergence of the algorithm even if the malicious nodeswere isolated and the network of regular nodes remainsstrongly connected.This requires us to state the following assumption: Assumption 5.

The Hessian H r ( x ∗ ) of F r ( x ) is positivedeﬁnite. Proposition 1. (Ravi et al. (2019)). Under Assumption 5,the nodes achieve consensus (i.e. γ = 0), and the averagesolution diﬀerence is such that: (cid:13)(cid:13)(cid:80) i ∈V m g i ( x a ) (cid:13)(cid:13) (cid:38) (cid:15) λ min ( H r ( x ∗ )) . (18)3. ROBUST SUBGRADIENT PUSHThe basic idea of this paper is to leverage the structure ofthe problem, as well as the need of the malicious agentsto use large gradients to inﬂuence the result, as it is madeapparent by equation (18). We wish to use this fact todesign a score S ij ( t ) that each node updates ∀ j ∈ N i anduses to detect and then reject updates from a speciﬁc node.In this paper we propose a modiﬁed version of theSubgradient-Push (Nedi´c and Olshevsky (2015)) that isbased on the original Push sum algorithm (Kempe et al.(2003)), see algorithm 1. This algorithm diﬀers from theoriginal Subgradient-Push in the fact that at all times t >

0, the out-degree of all the nodes is 2, i.e., insteadof a broadcast communication model, each node choosesone of its out-neighbors uniformly at random and sends itsinformation to that node and to itself. This eliminates thenecessity of each node knowing its out-degree, the trade-oﬀbeing a slower convergence rate. The algorithm convergeseven when the functions are non strongly convex, which isthe typical scenario of interest in a network of sensors thatcollect a scalar measurement while the goal is to estimatea d -dimensional vector parameter x o .We name it Robust Subgradient-Push (RSGP). In RSGPeach node i maintains a set of variables, z i ( t ) , v i ( t ) ∈ R d and y i ( t ) ∈ R , at all times t ≥

0. The algorithm starts withan arbitrarily initiated x i (0) ∈ R d , and with y i (0) = 1at each node i , and is descried in Algorithm 1. Here, g i ( t ) = ∇ f i ( x i ( t )) is a subgradient of f i ( · ) calculated at x i ( t ), and η t > (cid:80) t η t = ∞ , (cid:80) t η t < ∞ . (23) lgorithm 1 Robust Subgradient-Push (RSGP) Initialize: ∀ j ∈ N − i z i (0) ∈ R d arbitrarily, y i (0) = 1,˜ S ij (0) = Ensure: { η } t satisﬁes condition (23) and α < for t ≥ do for i ∈ V do Let { z j ( t ) , y j ( t ) } , ∀ j ∈ N − i ( t ) ⊆ N − i , be thepairs of data sent to node i at time t Send { z i ( t ) , y i ( t ) } to i itself and a node k ∈ N + i chosen uniformly at random; N + i ( t ) = { i, k } Update v i , y i , x i and z i as follows: v i ( t + 1) = 12 (cid:88) j ∈N − i ( t ) z j ( t ) (19) y i ( t + 1) = 12 (cid:88) j ∈N − i ( t ) y j ( t ) (20) x i ( t + 1) = v i ( t + 1) /y i ( t + 1) (21) z i ( t + 1) = v i ( t + 1) − η t +1 g i ( t + 1) (22) if i ∈ V r then for all j ∈ N − i do Calculate the score ˜ S ij ( t ) from equa-tion (31). Calculate χ i ( t ) from equation (32). for all j ∈ N − i do if ˜ S ij ( t ) > χ ( t ) then N − i ← N − i \ j and N + i ← N + i \ j Crucial to RSGP is the deﬁnition of the score that shall re-veal the malicious nodes. We note that each node receivesits neighbors’ z i ( t ) and y i ( t ) state variables when theycommunicate. The variable y i ( t ) of node i is a function ofthe number of instantaneous in-neighbors i receives at anyparticular time instant. Importantly, it does not dependon node i ’s gradient, which is what the sensor spooﬁngattack manipulates. The iterative update of variable z i ( t ),however, includes a gradient step. Thus, we deﬁne a ma-liciousness score S ij ( t ) that can be maintained by eachregular node i about their neighbors N − i . As the nodesapproach consensus, their direction of descent is typicallydominated by the malicious nodes’ gradients. The intuitionis that node i can use:ˆ x j ( t ) := z j ( t ) /y j ( t ) (24)to track neighbor j ’s instantaneous gradient. From thebound in Equation (18), we hypothesize that the maliciousnodes will appear as outliers relative to the rest of thenodes in the neighborhood, as long as they are suﬃcientlyoutnumbered. The instantaneous score we propose is: S ij ( t ) (cid:44) η t (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) (cid:88) (cid:96) ∈N − i ( t ) \ j ( ˆ x j ( t ) − ˆ x (cid:96) ( t )) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) . (25)To gain some intuition, suppose the nodes are approachingconsensus. At this point the values of y i ( t ) are likely tohave converged as well, and: x j ( t ) = v j ( t ) /y j ( t ) ≈ x a In this case:ˆ x j ( t + 1) − ˆ x (cid:96) ( t + 1) = v j ( t ) y j ( t ) − v (cid:96) ( t ) y (cid:96) ( t ) − η t ( g j ( t ) − g (cid:96) ( t )) ≈ η t ( g (cid:96) ( t ) − g j ( t )) , (26)which indicates that the score is essentially comparing thedisparity in the gradients and: S ij ( t ) ≈ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) (cid:88) (cid:96) ∈N − i ( t ) \ j ( g (cid:96) ( t ) − g j ( t )) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) (27)= (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) (cid:88) (cid:96) ∈N − i ( t ) \ j g (cid:96) ( t ) − ( d i ( t ) − g j ( t ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) . (28)Let us consider for simplicity the presence of only onemalicious agent in the i th node neighborhood. In thefollowing, we denote node i degree by d i ( t ) = |N i ( t ) | . Thelast equation, together with the observation in Proposition1, implies that the score for a malicious neighbor m , for t suﬃciently large is approximately: S im ( t ) ≈ ( d i ( t ) − (cid:107) g m ( t ) (cid:107) . (29)On the other hand, it is not diﬃcult to show expandingthe norm squared in (28) that:1 d i (cid:88) i ∈N i ( t ) S ij ( t ) = d i ( t ) (cid:88) i ∈N i ( t ) (cid:107) g i ( t ) (cid:107) + (30) − (cid:18) − d i ( t ) (cid:19) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) (cid:88) i ∈N i ( t ) g i ( t ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) (cid:47) d i ( t ) (cid:107) g m ( t ) (cid:107) which suggests that, as long as ( d i ( t ) − > d i ( t ) (whichhappens as long as d i ( t ) ≥

4) the malicious agent scorewill tend to dominate the average of the scores of theneighborhood. Considering the randomness that exist inthe protocol iterations, the score used to detect the agentsis averaged over time. In fact, RSGP uses a cumulativescore is given by:˜ S ij ( t ) = t (cid:88) τ =1 α τ S ij ( τ ) = ˜ S ij ( t −

1) + α t S ij ( t ) , (31)where 0 < α <

1. The weighted average is designed in sucha manner that older instantaneous scores are given lowerweights than newer ones, for which (26) is more accurate.Regular Nodes can then dynamically isolate those neigh-bors whose score cross a certain threshold, χ i ( t ). Let ˜ S i ( t )be the vector of scores at node i . In our implementationwe choose χ i ( t ) as follows: χ i ( t ) = avg (cid:16) ˜ S i ( t ) (cid:17) + β × std (cid:16) ˜ S i ( t ) (cid:17) (32)where avg( a ) and std( a ) are sample average and samplestandard deviation of vector a entries. The parameter β controls the aggressiveness of the edge severing strategy.It is important to note that as a result of edge severing,depending on the aggressiveness of the isolation strategy,some regular nodes might also get isolated from the rest ofthe regular nodes. This may result as a consequence of theregular node being slow in isolating neighboring maliciousnode(s). In another scenario, the algorithm might giverise to a splintering in the network to multiple connectedubgraphs G (cid:96) , thereby leading to polarities in the ﬁnalconvergence points of such subgraphs to x ∞ (cid:96) . Ideally, theRSGP at the optimum value of β separates G into G r made up solely of the regular nodes, and if the maliciousnodes are coordinating (which they are in this paper),into G m made up solely of the malicious nodes. If themalicious nodes are not cooperating, then we may seefurther splintering in G m . The various scenarios and theirerror plots are discussed in section 4.4. NUMERICAL RESULTS WITH A CASE STUDYTo illustrate our approach, let us consider a parameterestimation problem where the observation model is givenby s i = h T i x o + w i , where h i ∈ R d , w i ∈ R representsi.i.d noise samples drawn from a normal distribution withmean zero and variance σ i , for all i ∈ V . Then the locallinear least square loss function may be written as: f i ( x ) = ( h T i x i − s i ) . The global cost function is F ( x ) = ( / |V| ) (cid:80) i ∈V f i ( x ) whenall the nodes are performing regularly. However, whencertain nodes are attacked or are malicious, the regularnodes estimate x ∗ by solving the following problemmin x F r ( x ) = min x |V r | (cid:88) i ∈V r ( h T i x i − s i ) . Let us assume without loss of generality that V r = { , . . . , N r } and deﬁne the matrix A T r = [ h , . . . , h N r ] . Denoting by A † r := (cid:0) A T r A r (cid:1) − A T r , which is the pseu-doinverse of matrix A r , the minimizer for this problemis known in closed-form and given by x ∗ = A † r s r = x o + (cid:0) A T r A r (cid:1) − A T r w r . and A T r A r is the Hessian, which under Assumption 5 isfull rank. The cost is: F r ( x ∗ ) = (cid:107) ( I − A † r A r ) s (cid:107) = (cid:107) ( I − A † r A r ) w (cid:107) (33)Now, let x ∞ i , ∀ i ∈ V r , be the points at which the regularnodes converge at the end of RSGP. The average solutionerror with respect to x ∗ is given by (cid:15) p from equation (4).The average cost increase of RSGP is given by (cid:37) fromequation (5). We can also deﬁne a regret in the errors as ξ p from equation (14), which compares the error underattack and the error without an attack. We redeﬁne theterms for ease of reading: (cid:15) p (cid:44) N r (cid:88) i ∈V r (cid:107) x ∞ i − x ∗ (cid:107) p , (cid:37) (cid:44) N r (cid:88) i ∈V r ( f i ( x ∞ i ) − f i ( x ∗ )) γ p (cid:44) N r (cid:88) i ∈V r (cid:107) x ∞ i − x ∞ (cid:107) p , ξ p (cid:44) N r (cid:80) i ∈V r (cid:107) x ∞ i − x o (cid:107) p (cid:107) x ∗ − x o (cid:107) p , where x ∞ (cid:44) |V r | (cid:80) (cid:96) ∈V r x ∞ (cid:96) and p = 2.The ﬁgures in Figure 2 show the performance of RSGPand compares it with the methods proposed in Ben-Ameuret al. (2016) and Sundaram and Gharesifard (2018). Thesystem setup is as follows. We consider multiple realizationof Erd˝os–R´enyi networks with N = 20 nodes and p =3 log( N ) /N . Without loss of generality, the node sets V r = { , . . . , } and V m = { , , } are kept the same forevery montecarlo trial, but the edge set E is changed. Thevector z i ∼ N ( , I ) ∈ R d ( d = 2) is drawn randomly at the (a) ,

000 2 ,

000 3 ,

000 4 ,

000 5 , − − − Time ( t ) x i ( t ) Malicious Node(s)Regular NodesOptimalAttacker’s target (b) . . . β e rr o r s (cid:15) % ξ γ (cid:15) s (c) . . . . . λ e rr o r s (cid:15) % ξ γ (d) Fig. 2. (a) Erd˝os–R´enyi network under consideration. (b)Convergence to optimal consensus point x ∗ . (c) Av-erage Residual as a function of β ; (cid:15) s is the solutiondiﬀerence for the algorithm in Sundaram and Ghare-sifard (2018). (d) Average Residual as a function of λ for the method in Ben-Ameur et al. (2016).start of each montecarlo trial for all i ∈ V . The followingparameters/system variables remain the same for all themontecarlo trials; x o ∼ N ( , I ) ∈ R d , h i ∼ N ( , σ I ) ∈ R d , s i = h T i x o + w i , w i ∼ N (0 , , ∀ i ∈ V r . Note that x o was initialized to [0 . , − . T , and the maliciousnodes ( i ∈ V m ) alter their h i as (cid:0) (1 /N r ) T A r − (cid:1) and s i as (cid:0) (1 /N r ) T s r + 5 (cid:1) . One such realization is shown inﬁgure 2a. In this network, nodes in color red are attackingthe network by altering the parameters of its loss function f i , ∀ i ∈ V m . Let us suppose that the regular nodes keepa track of the metric in equation (31) over time. Eachnode isolates those neighbors that exceed the thresholdin equation (32) with β = 1 .

5. Figure 2b shows theconvergence of RSGP. At time t = 401 (indicated inmagenta), all the nodes with the malicious node(s) intheir neighborhood have successfully isolated the maliciousagents. We then see the self-healing property of distributedconsensus algorithm come into play and the regular nodesconverge at the optimal consensus point in blue .In Figure 2c, we plot (cid:15) , (cid:37) , ξ , and γ p as functions of β ,averaged over multiple montecarlo trials for realizations of G . We see that at lower values of β , where nodes severties rather aggressively, the network is broken into mul-tiple strongly connected subgraphs or into a completelydisconnected network. This results in errors greater thanzero. Whereas, at the intermediate values of β , where theregular nodes sever ties in a relatively passive manner,RSGP isolates almost all, if not all, the malicious nodesin the network and thus, we see a decrease in the errors.At the optimum choice for β ( β ∗ ∈ (1 . , .

6) in thissetup), the RSGP algorithm almost certainly disconnectsthe malicious nodes from the regular nodes and thus theerrors reduce to zero. However, as β increases past thisegion, the nodes start to become highly conservative intheir cutting, leaving almost all the malicious nodes stillconnected to the network. Thus, we see an increase in theerrors and after a certain value of β , when no edge cuttingtakes place, the errors saturate. The solution diﬀerence forthe method in Sundaram and Gharesifard (2018) is plottedas (cid:15) s . In this algorithm, nodes communicate with theirneighbors at all time steps, but only use those neighbors’data which are not among the extremes in the sorted set ofneighborhood data points. Notice here that the maliciousnodes might persist in the ﬁltered neighborhood of regularnodes over time. Thus, this algorithm can only guaranteeconvergence in the convex hull of the minimizers of theregular nodes’ private functions.In Figure 2d, we plot the errors produced by the algorithmproposed in (Ben-Ameur et al., 2016, Algorithm 1), wherethe authors replace the consensus constraint in the prob-lem in (1) with a TV norm regularizer to the objective,thereby penalizing nodes for not being in consensus withtheir neighbors. The problem is given by:min x ∈ R d |V| (cid:88) i ∈V f i ( x i ) + λ (cid:88) ij ∈E ( x i − x j ) , (34)where λ is the regularization parameter which controls thestrength of magnitude of the penalty. If λ = 0, consensusamong the nodes is not enforced, thereby leading the nodesto their individual local minimizers, a node i convergesto the minimizer of f i , ∀ i ∈ V . As λ increases pastzero, the consensus constraint is enforced, indicated by γ reaching zero. While this method is successful in imposingconsensus among the regular nodes, it can not, unlikeRSGP, reduce the (cid:15) p to zero.5. CONCLUSIONA robust decentralized optimization algorithm (RSGP) re-silient to malicious Byzantine agents is proposed. This al-gorithm forgoes the need for the knowledge of out-degreesand works even for non-strongly convex loss functions.The algorithm dynamically isolates malicious nodes in thesystem thereby leading the system to convergence at theoptimum consensus point. The isolation strategy is local toeach node, is independent of the network topology and theattack strategy, and leverages the structure of the regularnodes in the system. REFERENCESBen-Ameur, W., Bianchi, P., and Jakubowicz, J. (2016).Robust Distributed Consensus Using Total Varia-tion. IEEE Transactions on Automatic Control . doi:10.1109/TAC.2015.2471755.Blanchard, P., Guerraoui, R., Stainer, J., and others(2017). Machine learning with adversaries: Byzantinetolerant gradient descent. In

Advances in Neural Infor-mation Processing Systems , 119–129.Boyd, S., Parikh, N., Chu, E., Peleato, B., and Eckstein, J.(2011). Distributed Optimization and Statistical Learn-ing via the Alternating Direction Method of Multipliers.

Foundations and Trends in Machine learning , 3(1), 1–122. doi:10.1561/2200000016.Gentz, R., Wu, S.X., Wai, H.T., Scaglione, A., andLeshem, A. (2016). Data injection attacks in ran-domized gossiping.

IEEE Transactions on Signal and Information Processing over Networks , PP(99), 1. doi:10.1109/TSIPN.2016.2614898.Gentz, R., Wai, H.T., Scaglione, A., and Leshem, A.(2015). Detection of data injection attacks in decen-tralized learning. In , 350–354. IEEE.Kempe, D., Dobra, A., and Gehrke, J. (2003). Gossip-Based Computation of Aggregate Information. In

Pro-ceedings of the 44th Annual IEEE Symposium on Foun-dations of Computer Science , 482–491.Koppel, A., Sadler, B.M., and Ribeiro, A. (2017). Proxim-ity without consensus in online multiagent optimization.

IEEE Transactions on Signal Processing , 65(12), 3062–3077. doi:10.1109/TSP.2017.2686368.Nedi´c, A., Olshevsky, A., and Rabbat, M. (2018). Networktopology and communication computation trade-oﬀs indecentralized optimization. In

Proceedings of the IEEE ,volume 106, 953–976.Nedi´c, A. and Olshevsky, A. (2015). Distributed Opti-mization over Time-varying Directed Graphs.

IEEETransactions on Automatic Control , 60(3), 601–615.Nedi´c, A. and Ozdaglar, A. (2009). Distributed sub-gradient methods for multi-agent optimization.

IEEETransactions on Automatic Control , 54(1), 48–61.Ravi, N., Scaglione, A., and Nedi´c, A. (2019). A caseof distributed optimization in adversarial environment.In

ICASSP 2019 - 2019 IEEE International Conferenceon Acoustics, Speech and Signal Processing (ICASSP) ,5252–5256. doi:10.1109/ICASSP.2019.8683442.Su, L. and Shahrampour, S. (2018). Finite-time guaranteesfor byzantine-resilient distributed state estimation withnoisy measurements.

CoRR , abs/1810.10086.Su, L. and Vaidya, N. (2015a). Byzantine multi-agentoptimization: Part I.

CoRR , abs/1506.04681.Su, L. and Vaidya, N.H. (2015b). Fault-tolerant dis-tributed optimization (part IV): constrained opti-mization with arbitrary directed networks.

CoRR ,abs/1511.01821.Sundaram, S. and Gharesifard, B. (2018). Distributed op-timization under adversarial nodes.

IEEE Transactionson Automatic Control . doi:10.1109/TAC.2018.2836919.Tsitsiklis, J., Bertsekas, D., and Athans, M. (1986). Dis-tributed asynchronous deterministic and stochastic gra-dient optimization algorithms.

IEEE transactions onautomatic control , 31(9), 803–812.Vukovi´c, O. and D´an, G. (2014). Security of fully dis-tributed power system state estimation: Detection andmitigation of data integrity attacks.

IEEE Journal onSelected Areas in Communications , 32(7), 1500–1508.Wu, X., Wai, H.T., Scaglione, A., Leshem, A., and Nedi´c,A. (2018). Data Injection Attack on DecentralizedOptimization. In

International Conference on AcousticSpeech and Signal Processing . IEEE.Xi, C. and Khan, U.A. (2017). DEXTRA: A fast algorithmfor optimization over directed graphs.

IEEE Transac-tions on Automatic Control , 62(10), 4980–4993.Xi, C., Xin, R., and Khan, U.A. (2018). ADD-OPT:Accelerated distributed directed optimization.

IEEETransactions on Automatic Control , 63(5), 1329–1339.Xu, W., Li, Z., and Ling, Q. (2018). Robustdecentralized dynamic optimization at presence ofmalfunctioning agents.