[PDF] Approximate Byzantine Fault-Tolerance in Distributed Optimization

Abstract

This paper considers the problem of Byzantine fault-tolerance in distributed multi-agent optimization. In this problem, each agent has a local cost function, and in the fault-free case, the goal is to design a distributed algorithm that allows all the agents to find a minimum point of all the agents' aggregate cost function. We consider a scenario where some agents might be Byzantine faulty that renders the original goal of computing a minimum point of all the agents' aggregate cost vacuous. A more reasonable objective for an algorithm in this scenario is to allow all the non-faulty agents to compute the minimum point of only the non-faulty agents' aggregate cost. Prior work shows that if there are up to f (out of n) Byzantine agents then a minimum point of the non-faulty agents' aggregate cost can be computed exactly if and only if the non-faulty agents' costs satisfy a certain redundancy property called 2f-redundancy. However, 2f-redundancy is an ideal property that can be satisfied only in systems free from noise or uncertainties, which can make the goal of exact fault-tolerance unachievable in some applications. Thus, we introduce the notion of (f,\epsilon)-resilience, a generalization of exact fault-tolerance wherein the objective is to find an approximate minimum point of the non-faulty aggregate cost, with \epsilon accuracy. This approximate fault-tolerance can be achieved under a weaker condition that is easier to satisfy in practice, compared to 2f-redundancy. We obtain necessary and sufficient conditions for achieving (f,\epsilon)-resilience characterizing the correlation between relaxation in redundancy and approximation in resilience. In case when the agents' cost functions are differentiable, we obtain conditions for (f,\epsilon)-resilience of the distributed gradient-descent method when equipped with robust gradient aggregation.

Full PDF

AApproximate Byzantine Fault-Tolerance inDistributed Optimization

Shuo Liu (cid:63)

Nirupam Gupta † Nitin H. Vaidya † { (cid:63) sl1539, † ﬁrstname . lastname } @georgetown.eduDepartment of Computer ScienceGeorgetown UniversityWashington DC, USA Abstract

This paper studies the problem of Byzantine fault-tolerance in distributedmulti-agent optimization. In this problem, each agent has a local cost function,and in the fault-free case, the objective is to design a distributed algorithmthat allows all the agents to ﬁnd a minimum point of all the agents’ aggregatecost function. We consider a scenario where certain number of agents mightbe Byzantine faulty, i.e., these agents may not follow a prescribed algorithmand may share arbitrary information regarding their local cost functions. Inthe presence of such faulty agents, it is generally impossible to ﬁnd a mini-mum point of all the agents’ aggregate cost function. A more reasonable goal,however, is to design an algorithm that allows all the non-faulty agents to com-pute, either exactly or approximately , the minimum point of only the non-faultyagents’ aggregate cost function.From prior work we know that in the presence of up to f (out of n ) Byzantinefaulty agents, a deterministic algorithm can compute a minimum point of thenon-faulty agents’ aggregate cost exactly if and only if the non-faulty agents’cost functions satisfy a certain redundancy property named 2 f -redundancy [20].However, the 2 f -redundancy property can only be guaranteed in ideal systemsfree from noises (or uncertainties), and therefore, the objective of exact fault-tolerance is unsuitable for many practical settings that inevitably suﬀer fromnoises. In this paper, we consider the problem of approximate fault-tolerance- a generalization of exact fault-tolerance where the goal is to only computean approximation of a minimum point of the non-faulty agents’ aggregate costfunction. Upon deﬁning approximate fault-tolerance later as ( f, (cid:15) )-resiliencewhere (cid:15) is the approximation error, we show that it can be achieved under aweaker redundancy condition than 2 f -redundancy. We present necessary andsuﬃcient conditions for achieving ( f, (cid:15) )-resilience in a synchronous distributedsystem with server-based architecture. Then, we consider a special case whenthe agents’ cost functions are diﬀerentiable. Here, we analyse the approximatefault-tolerance of the distributed gradient-descent method, which is a prominentdistributed optimization algorithm in this particular case, when equipped witha gradient-ﬁlter or robust gradient aggregation ; such as comparative gradientelimination (CGE) or coordinate-wise trimmed mean (CWTM). a r X i v : . [ c s . D C ] J a n Introduction

The problem of distributed optimization in multi-agent systems has gained signiﬁ-cant attention in recent years [7, 25, 16]. In this problem, each agent has a local costfunction and, when the agents are fault-free, the goal is to design algorithms thatallow the agents to collectively minimize the aggregate of their cost functions. Tobe precise, suppose that there are n agents in the system and let Q i ( x ) denote thelocal cost function of agent i , where x is a d -dimensional vector of real values, i.e., x ∈ R d . A traditional distributed optimization algorithm outputs a global minimum x ∗ such that x ∗ ∈ arg min x ∈ R d n (cid:88) i =1 Q i ( x ) . (1)As a simple example, Q i ( x ) may denote the cost for an agent i (which may be arobot or a person) to travel to location x from their current location, and x ∗ is alocation that minimizes the total cost of meeting for all the agents. Such multi-agentoptimization is of interest in many practical applications, including distributed ma-chine learning [7], swarm robotics [29], and distributed sensing [28].We consider the distributed optimization problem in the presence of up to f Byzantine faulty agents, originally introduced by Su and Vaidya [32]. The Byzan-tine faulty agents may behave arbitrarily [22]. In particular, the non-faulty agentsmay share arbitrary incorrect and inconsistent information in order to bias the out-put of a distributed optimization algorithm. For example, consider an applicationof multi-agent optimization in the case of distributed sensing where the agents (or sensors ) observe a common object in order to collectively identify the object. How-ever, the faulty agents may send arbitrary observations concocted to prevent thenon-faulty agents from making the correct identiﬁcation [11, 13, 26, 33]. Similarly,in the case of distributed learning, which is another application of distributed op-timization, the faulty agents may send incorrect information based on mislabelled or arbitrary concocted data points to prevent the non-faulty agents from learning a good classiﬁer [1, 2, 4, 9, 10, 12, 19, 35].

In the exact fault-tolerance problem, the goal is to design a distributed algorithm thatallows all the non-faulty agents to compute a minimum point of the aggregate costof only the non-faulty agents [20]. Speciﬁcally, suppose that in a given execution,set B with |B| ≤ f is the set of Byzantine agents, where notation |·| denotes theset cardinality, and H = { , . . . , n } \ B denotes the set of non-faulty (i.e., honest)agents. Then, a distributed optimization algorithm has exact fault-tolerance if itoutputs a point x ∗H such that x ∗H ∈ arg min x ∈ R d (cid:88) i ∈H Q i ( x ) . (2)However, since the identity of the Byzantine agents is a priori unknown, in general,exact fault-tolerance is unachievable [32]. Speciﬁcally, it is shown in [20, 21] that2xact fault-tolerance can be achieved if and only if the agents’ cost functions satisfythe 2 f -redundancy property deﬁned below. Let |·| denote set cardinality. Deﬁnition 1 (2 f -redundancy). The agents’ cost functions are said to have f -redundancy property if and only if for for every pair of subsets S, (cid:98) S ⊆ { , . . . , n } with | S | = n − f , (cid:12)(cid:12)(cid:12) (cid:98) S (cid:12)(cid:12)(cid:12) ≥ n − f and (cid:98) S ⊆ S , arg min x ∈ R d (cid:88) i ∈ (cid:98) S Q i ( x ) = arg min x ∈ R d (cid:88) i ∈ S Q i ( x ) . In principle, the 2 f -redundancy property can be realized by design for many ap-plications of multi-agent distributed optimization including distributed sensing anddistributed learning (see [18, 20]). However, practical realization of 2 f -redundancycan be diﬃcult due to the presence of noise in the real-world systems. This moti-vates us to consider a generalization of exact fault-tolerance, namely the problem of approximate fault-tolerance , which is described and deﬁned as follows. Unlike exact fault-tolerance, in approximate fault-tolerance it is acceptable for analgorithm to output an approximate minimum point of the non-faulty aggregatecost function. As stated below, we formally deﬁne approximate fault-tolerance by( f, (cid:15) ) -resilience where (cid:15) ∈ R ≥ is the measure of approximation. Recall that theEuclidean distance between a point x and a non-empty set X in the space R d ,denoted by dist ( x, X ), is deﬁned to bedist ( x, X ) = inf y ∈ X (cid:107) x − y (cid:107) (3)where (cid:107)·(cid:107) denotes the Euclidean norm. Deﬁnition 2 (( f, (cid:15) ) -resilience ) . A distributed optimization algorithm is said to be ( f, (cid:15) ) -resilient if it outputs a point (cid:98) x ∈ R d such that for every subset S of non-faultyagents with | S | = n − f , dist (cid:32)(cid:98) x, arg min x ∈ R d (cid:88) i ∈ S Q i ( x ) (cid:33) ≤ (cid:15), despite the presence of up to f Byzantine agents.

Alternately, a distributed optimization algorithm is ( f, (cid:15) ) -resilient if it outputsa point within (cid:15) distance from a minimum point of the aggregate cost function ofat least n − f non-faulty agents. As there can be at most f Byzantine faulty agentswhose identity remains unknown the two scenarios where; (1) there are exactly f Byzantine agents, and (2) there are less than f Byzantine agents are indistinguish-able. Thus, estimation for the minimum point of the aggregate cost functions of n − f non-faulty agents is indeed a reasonable goal [32].In this paper, we consider deterministic algorithms which, given a ﬁxed set ofinputs from the server and the agents, always output the same point in R d . Thus, a3eterministic ( f, (cid:15) )-resilient algorithm produces a unique output point in all of itsexecutions with identical inputs from the server and all the agents (including thefaulty agents). Note that in the deterministic framework, exact fault-tolerance isequivalent to ( f, f, f, (cid:15) ) -resilience requires a weaker redundancy con-dition, in comparison to 2 f -redundancy, named (2 f, (cid:15) ) -redundancy which is deﬁnedas follows. Recall that the Euclidean Hausdorﬀ distance between two sets X and Y in R d , which we denote by dist ( X, Y ), is deﬁned to be [24]dist (

X, Y ) (cid:44) max (cid:40) sup x ∈ X dist ( x, Y ) , sup y ∈ Y dist ( y, X ) (cid:41) . (4) Deﬁnition 3 ((2 f, (cid:15) ) -redundancy ) . The agents’ cost functions are said to have (2 f, (cid:15) )-redundancy property if and only if for every pair of subsets S, (cid:98) S ⊆ { , . . . , n } with | S | = n − f , (cid:12)(cid:12)(cid:12) (cid:98) S (cid:12)(cid:12)(cid:12) ≥ n − f and (cid:98) S ⊆ S , dist  arg min x ∈ R d (cid:88) i ∈ S Q i ( x ) , arg min x ∈ R d (cid:88) i ∈ (cid:98) S Q i ( x )  ≤ (cid:15). (5)Upon comparison between the Deﬁnitions 1 and 3, it is easy to see that 2 f -redundancy is equivalent to (2 f, f -redundancyimplies (2 f, (cid:15) )-redundancy for all (cid:15) ≥

0. However, the converse need not be true.Thus, the (2 f, (cid:15) )-redundancy property with (cid:15) > f -redundancy. In the ﬁrst part of the paper, i.e., Section 3, we show two key results: • ( f, (cid:15) )-resilience is feasible only if (2 f, (cid:15) )-redundancy property holds true. • If (2 f, (cid:15) )-redundancy property holds true then ( f, (cid:15) )-resilience is achievable.As ( f, f, f -redundancy, the above results generalize theknown result that exact fault-tolerance is feasible if and only if the 2 f -redundancycondition holds true (see [20, 21]).In the second part , i.e., Sections 4 and 5, we consider the case when agents’cost functions are assumed diﬀerentiable, such as in machine learning [6] and stateestimation or regression [18, 31, 33]. Here, we study the problem of approximatefault-tolerance in the distributed gradient-descent (DGD) method - an iterativedistributed optimization algorithm commonly used in this particular case. • In Section 4, we present a generic convergence result of the DGD methodwhen equipped with a gradient-ﬁlter (or robust gradient aggregation ), which isa common fault-tolerance mechanism [4, 12, 20].4

Then, upon assuming the (2 f, (cid:15) )-redundancy property, we present approxi-mate fault-tolerance guarantees of two speciﬁc gradient-ﬁlters - – the comparative gradient elimination (CGE) gradient-ﬁlter [17], and – the coordinate-wise trimmed mean (CWTM) gradient-ﬁlter [36].The above two ﬁlters are computationally cheapest of all the existing gradient-ﬁlters, and have wide applicability, for eg., see [17, 18, 31, 36]. • In Section 5, we present empirical comparisons between the approximate fault-tolerance achieved by the two aforementioned gradient-ﬁlters.

System architecture:

The results in this paper apply to two diﬀerent sys-tem architectures illustrated in Figure 1. The system is assumed synchronous . Inthe server-based architecture, the server is assumed to be trustworthy, but up to f agents may be Byzantine faulty. The trusted server helps solve the distributedoptimization problem in coordination with the agents. In the peer-to-peer architec-ture, the agents are connected to each other by a complete network, and up to f of these agents may be Byzantine faulty. Provided that f < n , an algorithm forthe server-based architecture can be simulated in the peer-to-peer system using thewell-known Byzantine broadcast primitive [23]. For simplicity of presentation, therest of this paper considers the server-based architecture.Figure 1: System architecture.Before we present our results in detail, we review below other related work onapproximate fault-tolerance and gradient-ﬁlters.

In the past, diﬀerent deﬁnitions of approximate fault-tolerance, besides ( f, (cid:15) )-resiliencethat we use, have been formulated to analyze Byzantine fault-tolerance of diﬀerentdistribute optimization algorithms [15, 32]. As we discuss below in Section 2.1, thediﬀerence between these other deﬁnitions and our deﬁnition of ( f, (cid:15) )-resilience arisesmainly due to the applicability of the distributed optimization problem. Later,in Section 2.2, we discuss prior work on gradient-ﬁlters for conferring Byzantinefault-tolerance to the distributed gradient-descent method - an iterative distributedoptimization algorithm used when the agents’ cost functions are diﬀerentiable [6].5 .1 Alternate Deﬁnitions of Approximate Fault-Tolerance

As proposed by Su and Vaidya, 2016 [32], an alternate way of measuring approxi-mation (or accuracy) in fault-tolerance is by use of scalar coeﬃcients. Speciﬁcally,instead of a minimum point of the uniformly weighted aggregate of non-faulty agents’cost functions, a distributed optimization algorithm may output a minimum point ofa non-uniformly weighted aggregate of non-faulty costs, i.e., (cid:80) i ∈H α i Q i ( x ), where H denotes the set of at least n − f non-faulty agents, and α i ≥ i ∈ H .As is suggested in [32], upon re-scaling the coeﬃcients such that (cid:80) i ∈H α i = 1, wecan measure approximation in fault-tolerance using two metrics; 1) the number ofcoeﬃcients in { α i , i ∈ H} that are positive, and 2) the minimum positive valueamongst the coeﬃcients: min { α i ; α i > , i ∈ H} . Results on the achievability ofsuch approximation in the case of scalar optimization problems, i.e., when d = 1,can be found in [32, 34]. However, we are not aware of any such results for the caseof higher-dimensional optimization problem, i.e., when d > x Π ∈ R d , such that each element of the ag-gregate non-faulty gradient (cid:80) i ∈H ∇ Q i ( x Π ) has an absolute value upper boundedby (cid:15) . As yet another alternative, a resilient algorithm Π may aim to output apoint x Π such that the non-faulty aggregate cost (cid:80) i ∈H Q i ( x Π ) is within (cid:15) of thetrue minimum cost min x (cid:80) i ∈H Q i ( x ). However, these deﬁnitions of approximate re-silience are sensitive to scaling of the cost functions. In particular, if the elements of (cid:80) i ∈H ∇ Q i ( x Π ) are bounded by (cid:15) then the elements of (cid:80) i ∈H α ∇ Q i ( x Π ) are boundedby α(cid:15) , where α is a positive scalar value. On the other hand, both (cid:80) i ∈H Q i ( x ) and (cid:80) i ∈H αQ i ( x Π ) have identical minimum point regardless of the value of α . Therefore,when the objective is to approximate a minimum point of the non-faulty aggregatecost arg min x (cid:80) i ∈H Q i ( x ), which is the indeed the case in this paper, use of function(or gradient) values to measure approximation is not a suitable choice. In the past, several gradient-ﬁlters have been proposed to robustify the distributedgradient-descent (DGD) method against Byzantine faulty agents, for example, see [1,4, 14, 15, 17, 27, 32, 36] and references therein. The DGD method is an iterativedistributed optimization algorithm wherein the server maintain an estimate of thesolution (a valid minimum point), and updates it iteratively using gradients com-puted by the agents of their respective cost functions. In the traditional DGDmethod, the server updates its estimates using the average (or sum) of agents’ gra-dients. However, gradient averaging is rendered ineﬀective when Byzantine faultyagents send arbitrary incorrect gradients [32].A gradient-ﬁlter refers to the technique of robust aggregation of agents’ gradi-ents , which mitigates the detrimental impact of incorrect gradients. To name a fewgradient-ﬁlters, that are provably eﬀective against Byzantine faulty agents, we havecomparative gradient elimination (CGE) [18, 17], coordinate-wise trimmed mean6CWTM) [32, 36], geometric median-of-means (GMoM) [12], KRUM [4], and thespectral gradient-ﬁlters [15]. However, diﬀerent gradient-ﬁlters guarantee Byzantinefault-tolerance under diﬀerent assumptions on non-faulty agents’ cost functions.In this paper, we ﬁrst propose a generic result (in Theorem 3) to establish con-vergence of the DGD method when coupled with a gradient-ﬁlter. The result holdstrue regardless of the gradient-ﬁlter used. Then, we use this convergence result toderive the approximate fault-tolerance guarantees (as per Deﬁnition 2) of two spe-ciﬁc gradient-ﬁlters; CGE and CWTM, under the (2 f, (cid:15) )-redundancy property. Thereason for choosing these two ﬁlters is that they are the computationally cheapestof all, with O ( n (log n + d )) per iteration time complexity [17, 31].It is worth noting that both CGE and CWTM gradient-ﬁlters can guaranteeexact fault-tolerance under certain assumptions, besides 2 f -redundancy, as show [18,17, 31]. However, unlike CGE, the fault-tolerance property of CWTM is only knownfor the special distributed optimization problems of distributed linear regression [31],and distributed machine learning [36]. In Section 4.2, we present approximate fault-tolerance properties of CGE and CWTM gradient-ﬁlters that are applicable to themore general distributed optimization problem, and also encapsulates the specialcase of exact fault-tolerance. ( f, (cid:15) ) -Resilience In this section, we present formal details on necessary and suﬃcient conditions for( f, (cid:15) )-resilience. Throughout this paper we assume, as stated below, that the non-faulty agents’ cost functions and their aggregates have well-deﬁned minimum points.Otherwise, the problem of optimization is rendered vacuous.

Assumption 1.

We assume that for any non-empty set S of non-faulty agents, theset arg min x ∈ R d (cid:80) i ∈ S Q i ( x ) is closed and non-empty. Moreover, we also assume that f < n/

2. Lemma 1, stated below, shows that( f, (cid:15) )-resilience is impossible in general when f ≥ n/ Lemma 1. If f ≥ n/ then there cannot exist a deterministic ( f, (cid:15) ) -resilient algo-rithm for any (cid:15) ≥ .Proof. We prove the lemma by contradiction. We consider a case when n = 2, d = 1,i.e., Q i : R → R for all i , and all the cost functions have unique minimum points.Suppose that f = 1, and that there exists a deterministic ( f, (cid:15) )-resilient algorithmΠ for some (cid:15) ≥

0. Without loss of generality, we suppose that agent 2 is Byzantinefaulty. We denote x = arg min x ∈ R Q ( x ).The Byzantine agent 2 can choose to behave as a non-faulty agent with costfunction (cid:101) Q ( x ) = Q ( x − x − (cid:15) − δ ) where δ is some positive real value. Now, notethat the minimum point of (cid:101) Q ( x ), which we denoted by x , is unique and equalto x + 2 (cid:15) + δ . Therefore, | x − x | = 2 (cid:15) + δ > (cid:15) . As the identity of Byzantineagent is a priori unknown, the server cannot distinguish between scenarios; (i) agent1 is non-faulty, and (ii) agent 2 is non-faulty. Now, being deterministic algorithm,7 should produce the same output in both the scenarios. In scenario (i), as Π isassumed ( f, (cid:15) )-resilient, its output must lie in the interval [ x − (cid:15), x + (cid:15) ]. Similarly,in scenario (ii), the output of Π must lie in the interval [ x − (cid:15), x + (cid:15) ]. However, as | x − x | > (cid:15) , the two intervals [ x − (cid:15), x + (cid:15) ] and [ x − (cid:15), x + (cid:15) ] do not overlap.Therefore, Π cannot be ( f, (cid:15) )-resilient in both the scenarios simultaneously, whichis a contradiction to the assumption that Π is ( f, (cid:15) )-resilient.The above argument extends easily for the case when n >

2, and f > n/ necessity of (2 f, (cid:15) ) -redundancy for ( f, (cid:15) ) -resilience . Theorem 1 (Necessity) . Suppose that Assumption 1 holds true. There exists adeterministic ( f, (cid:15) ) -resilient distributed optimization algorithm where (cid:15) ≥ the agents’ cost functions satisfy the (2 f, (cid:15) ) -redundancy property.Proof. To prove the theorem we present a scenario when the agents’ cost functions(if non-faulty) are scalar functions, i.e., d = 1 and for all i , Q i : R → R , and theminimum point of an aggregate of one or more agents’ cost functions is uniquelydeﬁned. Obviously, if a condition is necessary in the aforementioned case then it isso in the more general case of vector functions with non-unique minimum points.To prove the necessity condition, we also assume that the server has full knowl-edge of all the agents’ cost functions. This may not hold true in practice, whereinstead the server may only have partial information about the agents’ cost func-tions. Indeed, this forces the Byzantine faulty agents to a priori ﬁx their costfunctions. However, in reality the Byzantine agents may send arbitrary informationover time to the server that need not be consistent with a ﬁxed cost function. Thus,necessity of (2 f, (cid:15) )-redundancy under this assumption implies its necessity in general.The proof is by contradiction. Speciﬁcally, we show the following: If the non-faulty cost functions do not satisfy the (2 f, (cid:15) ) -redundancyproperty then there cannot exist a deterministic ( f, (cid:15) ) -resilient distributedoptimization algorithm. Recall that we have assumed that for a non-empty set of agents T the aggregatecost function (cid:80) i ∈ T Q i ( x ) has a unique minimum point. To be precise, for eachnon-empty subset of agents T , we deﬁne x T = arg min x (cid:88) i ∈ T Q i ( x ) . Suppose that the agents’ cost functions do not satisfy the (2 f, (cid:15) )-redundancyproperty stated in Deﬁnition 3. Then, there exists a real number δ > S, (cid:98) S with (cid:98) S ⊂ S , | S | = n − f , and n − f ≤ (cid:12)(cid:12)(cid:12) (cid:98) S (cid:12)(cid:12)(cid:12) < n − f such that (cid:13)(cid:13) x (cid:98) S − x S (cid:13)(cid:13) ≥ (cid:15) + δ. (6)Now, suppose that n − f − (cid:12)(cid:12)(cid:12) (cid:98) S (cid:12)(cid:12)(cid:12) agents in the remainder set { , . . . , n }\ S are Byzantinefaulty. Let us denote the set of faulty agents by B . Note that B is non-empty with8 B| = n − f − (cid:12)(cid:12)(cid:12) (cid:98) S (cid:12)(cid:12)(cid:12) ≤ f . Similar to the non-faulty agents, the faulty agents send tothe server cost functions that are scalar, and the aggregate of one of more agents’cost functions in the set S ∪ B is unique. However,the aggregate cost function of the agents in the set B ∪ (cid:98) S minimizes ata unique point x B∪ (cid:98) S which is (cid:13)(cid:13) x (cid:98) S − x S (cid:13)(cid:13) distance away from x (cid:98) S , similarto x S , but lies on the other side of x (cid:98) S as shown in the ﬁgure below. Notethat it is always possible to pick such functions for the faulty agents. x S x (cid:98) S x B∪ (cid:98) S ][ ← (cid:15) + δ → ← (cid:15) + δ → (cid:15)(cid:15) (cid:98) x Note that the distance between the two points x S and x B∪ (cid:98) S is 2 (cid:15) + 2 δ , i.e., (cid:13)(cid:13) x S − x B∪ (cid:98) S (cid:13)(cid:13) = 2 (cid:15) + 2 δ. (7)We now show below, by contradiction, that there cannot exist a deterministic ( f, (cid:15) )-resilient distributed optimization algorithm.Suppose, toward a contradiction, that there exists an ( f, (cid:15) )-resilient determinis-tic optimization algorithm named Π. As the identity of Byzantine faulty agents isa priori unknown to the server, and the cost functions sent by the Byzantine faultyagents have similar properties as the non-faulty agents, the server cannot distinguishbetween the following two possible scenarios; i) S is the set of non-faulty agents,and ii) B ∪ (cid:98) S is the set of non-faulty agents. Note that both the sets S and B ∪ (cid:98) S contain n − f agents.As the cost functions received by the server are identical in both of the abovescenarios, being a deterministic algorithm, Π should have identical output in boththe cases. We let (cid:98) x denote the output of Π. In scenario (i) when the set of honestagents is given by S with | S | = n − f , as Π is assumed ( f, (cid:15) )-resilient, by Deﬁnition 2the output (cid:98) x ∈ [ x S − (cid:15), x S + (cid:15) ] (8)as shown in the ﬁgure above. By the same logic, in scenario (ii) when the set ofhonest agents is B ∪ (cid:98) S with (cid:12)(cid:12)(cid:12) B ∪ (cid:98) S (cid:12)(cid:12)(cid:12) = n − f the output (cid:98) x ∈ [ x B∪ (cid:98) S − (cid:15), x B∪ (cid:98) S + (cid:15) ] . (9)However, (7) implies that (cid:98) x cannot satisfy both (8) and (9) simultaneously. There-fore, if algorithm Π is ( f, (cid:15) )-resilient in scenario (i) then it cannot be so in scenario(ii), and vice-versa. This is a contradiction to the assumption that Π is ( f, (cid:15) )-resilient, and therefore proves the impossibility of ( f, (cid:15) )-resilience when the (2 f, (cid:15) )-redundancy property is violated.Next, we show that (2 f, (cid:15) ) -redundancy suﬃces for ( f, (cid:15) ) -resilience .9 heorem 2 (Suﬃciency) . Suppose that Assumption 1 holds true. For a real value (cid:15) ≥ , if the agents’ cost functions satisfy the (2 f, (cid:15) ) -redundancy property then ( f, (cid:15) ) -resilience is achievable.Proof. The proof is constructive where we assume that all the agents send theirindividual cost functions to the server. We assume that f > f = 0. Throughout the proof we write the notation arg min x ∈ R d simply asarg min, unless otherwise stated. Consider the algorithm presented below comprisingthree steps. Step 1:

Each agent sends their cost function to the server. An honest agent sendsits actual cost function, while a faulty agent may send an arbitrary function.

Step 2:

For each set T of received functions, | T | = n − f , the server computes apoint x T ∈ arg min (cid:88) i ∈ T Q i ( x ) . For each subset (cid:98) T ⊂ T , (cid:12)(cid:12)(cid:12) (cid:98) T (cid:12)(cid:12)(cid:12) = n − f , the server computes r T (cid:98) T (cid:44) dist  x T , arg min (cid:88) i ∈ (cid:98) T Q i ( x )  , (10)and r T = max (cid:98) T ⊂ T, | (cid:98) T | = n − f r T (cid:98) T . (11)

Step 3:

The server outputs x S such that S = arg min T ⊂{ ,..., n } , | T | = n − f r T . (12)We show below that above algorithm is ( f, (cid:15) )-resilient under (2 f, (cid:15) )-redundancy.For a non-empty set of agents T , we denote X T = arg min (cid:88) i ∈ T Q i ( x ) . Consider an arbitrary set of non-faulty agents G with | G | = n − f . Such a setis guaranteed to exist as there are at most f faulty agents, and therefore, at least n − f non-faulty agents in the system. Consider an arbitrary set (cid:98) T such that (cid:98) T ⊂ G and (cid:12)(cid:12)(cid:12) (cid:98) T (cid:12)(cid:12)(cid:12) = n − f . By Deﬁnition 3 of (2 f, (cid:15) )-redundancy,dist (cid:0) X G , X (cid:98) T (cid:1) ≤ (cid:15). (13)Recall from (10) that r G (cid:98) T = dist (cid:0) x G , X (cid:98) T (cid:1) . As x G ∈ X G , by Deﬁnition (4) ofHausdorﬀ set distance, dist (cid:0) x G , X (cid:98) T (cid:1) ≤ dist (cid:0) X G , X (cid:98) T (cid:1) . Therefore, r G (cid:98) T ≤ dist (cid:0) X G , X (cid:98) T (cid:1) . r G (cid:98) T ≤ (cid:15). (14)Now, recall from (11) that r G = max (cid:98) T ⊂ G, | (cid:98) T | = n − f r G (cid:98) T . As (cid:98) T in (14) is an arbitrary subset of G with (cid:12)(cid:12)(cid:12) (cid:98) T (cid:12)(cid:12)(cid:12) = n − f , r G = max (cid:98) T ⊂ G, | (cid:98) T | = n − f r G (cid:98) T ≤ (cid:15). (15)From (12) and (15) we obtain that r S ≤ r G ≤ (cid:15). (16)As | G | = n − f , for every set of agents T with | T | = n − f , | T ∩ G | ≥ n − f .Therefore, for the set S deﬁned in (12), there exists a subset (cid:98) G of G such that (cid:98) G ⊂ S and (cid:12)(cid:12)(cid:12) (cid:98) G (cid:12)(cid:12)(cid:12) = n − f . For such a set (cid:98) G , by deﬁnition of r S in (11), we obtain that r S (cid:98) G (cid:44) dist (cid:0) x S , X (cid:98) G (cid:1) ≤ r S . Substituting from (16) above, we obtain thatdist (cid:0) x S , X (cid:98) G (cid:1) ≤ (cid:15). (17)As (cid:98) G is a subset of G , all the agents in (cid:98) G are non-faulty. Therefore, by Assumption 1, X (cid:98) G is a closed set. Recall that dist (cid:0) x S , X (cid:98) G (cid:1) = inf x ∈ X (cid:98) G (cid:107) x S − x (cid:107) . The closednessof X (cid:98) G implies that there exists a point z ∈ X (cid:98) G such that (cid:107) x S − z (cid:107) = inf x ∈ X (cid:98) G (cid:107) x S − x (cid:107) = dist (cid:0) x S , X (cid:98) G (cid:1) . The above, in conjunction with (17), implies that (cid:107) x S − z (cid:107) ≤ (cid:15). (18)Moreover, as z ∈ X (cid:98) G where (cid:98) G ⊂ G with (cid:12)(cid:12)(cid:12) (cid:98) G (cid:12)(cid:12)(cid:12) = n − f and | G | = n − f , the(2 f, (cid:15) )-redundancy condition stated in Deﬁnition 3 implies thatdist ( z, X G ) ≤ (cid:15). Similar to an argument made above, under Assumption 1, X G is a closed set, andtherefore, there exists x ∗ ∈ X G such that (cid:107) z − x ∗ (cid:107) = dist ( z, X G ) ≤ (cid:15). (19)By triangle inequality, (18) and (19) implies that (cid:107) x S − x ∗ (cid:107) ≤ (cid:107) x S − z (cid:107) + (cid:107) z − x ∗ (cid:107) ≤ (cid:15). (20)Finally, recall that set G in the above inequality is an arbitrary set of n − f non-faultyagents. Therefore, (20) proves the theorem.11n the next part of the paper, i.e., Sections 4 and 5, we consider the case whenthe (non-faulty) agents’ cost functions are assumed diﬀerentiable. Speciﬁcally, wepresent and study a generic fault-tolerance method of gradient-ﬁltering for conferringapproximate fault-tolerance to a commonly used distributed optimization algorithm- the distributed gradient-descent method. In this section, we consider a setting wherein the non-faulty agents’ cost functions arediﬀerentiable. For this particular setting, we study the approximate fault-toleranceof the distributed gradient-descent method when coupled with a gradient-ﬁlter de-scribed below. We consider the server-based system architecture, shown in Fig. 1,assuming a synchronous system.The distributed gradient-descent method is an iterative algorithm wherein theserver maintains an estimate of a minimum point, and updates it iteratively usinggradients sent by the agents. Speciﬁcally, in each iteration t ∈ { , , . . . } , the serverstarts with an estimate x t and broadcasts to all the agents. Each non-faulty agent i sends back to the sever the gradient of its cost function at x t , i.e., ∇ Q i ( x t ). How-ever, Byzantine faulty agents may send arbitrary incorrect vectors as their gradientsto the server. The initial estimate, named x , is chosen arbitrarily by the server.A gradient-ﬁlter is a vector function, denoted by GradFilter , that maps the n gradients received by the server from all the n agents to a d -dimensional vector, i.e., GradFilter : R d × n → R d . For example, an average of all the gradients as in the case ofthe traditional distributed gradient-descent method is technically a gradient-ﬁlter.However, averaging is not quite robust against Byzantine faulty agents [4, 32]. Thereal purpose of a gradient-ﬁlter is to mitigate the detrimental impact of incorrectgradients sent by the Byzantine faulty agents. In other words, a gradient-ﬁlter ro-bustiﬁes the traditional gradient-descent method against Byzantine faults. We showthat if a gradient-ﬁlter satisﬁes a certain property then it can confer fault-toleranceto the distributed gradient-descent method.We ﬁrst formally describe below the steps in each iteration of the distributedgradient-descent method implemented on a synchronous server-based system. Notethat we constrain the estimates computed by the server to a compact convex set W ⊂ R d . The set W can be arbitrarily large. For a vector x ∈ R d , its projectiononto W , denoted by [ x ] W , is deﬁned to be[ x ] W = arg min y ∈W (cid:107) x − y (cid:107) . (21)As W is compact and convex set, [ x ] W is unique for each x (see [8]). t -th iteration In each iteration t ∈ { , , . . . } the server updates its current estimate x t to x t +1 using Steps S1 and S2 described as follows.12 The server requests from each agent the gradient of its local cost function atthe current estimate x t . Each non-faulty agent i will then send to the serverthe gradient ∇ Q i ( x t ), whereas a faulty agent may send an incorrect arbitraryvalue for the gradient.The gradient received by the server from agent i is denoted as g ti . If no gradientis received from some agent i , agent i must be faulty (because the system isassumed to be synchronous) – in this case, the server eliminates the agent i from the system, updates the values of n , f , and re-assigns the agents indicesfrom 1 to n . S2: [Gradient-ﬁltering]

The server applies a gradient-ﬁlter

GradFilter to the n received gradients and computes GradFilter (cid:0) g t , . . . , g tn (cid:1) ∈ R d . Then, the serverupdates its estimate to x t +1 = (cid:2) x t − η t GradFilter (cid:0) g t , . . . , g tn (cid:1)(cid:3) W (22)where η t is the step-size of positive value for iteration t .We present below, in Theorem 3, a generic convergence result for the abovealgorithm. The proof of Theorem 3 is deferred to Appendix C. Theorem 3.

Consider the update law (22) in the above iterative algorithm, withdiminishing step-sizes { η t , t = 0 , , . . . } satisfying (cid:80) ∞ t =0 η t = ∞ and (cid:80) ∞ t =0 η t < ∞ .Suppose that (cid:13)(cid:13) GradFilter (cid:0) g t , . . . , g tn (cid:1)(cid:13)(cid:13) < ∞ for all t . For some point x ∗ ∈ W , ifthere exists real-valued constants D ∗ ∈ [0 , max x ∈W (cid:107) x − x ∗ (cid:107) ) and ξ > such that foreach iteration t , φ t = (cid:10) x t − x ∗ , GradFilter (cid:0) g t , . . . , g tn (cid:1)(cid:11) ≥ ξ when (cid:13)(cid:13) x t − x ∗ (cid:13)(cid:13) ≥ D ∗ , (23) then lim t →∞ (cid:13)(cid:13) x t − x ∗ (cid:13)(cid:13) ≤ D ∗ . Note that the values D ∗ and ξ in the statement of Theorem 3 need not be inde-pendent of each other. As shown in the subsequent section, the generic convergenceresult shown in Theorem 3 helps us obtain approximate fault-tolerance propertiesof diﬀerent gradient-ﬁlters, under (2 f, (cid:15) )-redundancy and certain standard assump-tions. We consider two particular gradient-ﬁlters, namely Comparative GradientElimination (CGE) and

Coordinate-Wise Trimmed Mean (CWTM).

In this subsection, we present precise approximate fault-tolerance guarantees of twospeciﬁc gradient-ﬁlters; the Comparative Gradient Elimination (CGE) [17, 18], andthe Coordinate-Wise Trimmed Mean (CWTM) [31, 36]. Note that diﬀerentiability ofnon-faulty agents’ cost functions, which is already assumed for the gradient-descentmethod, implies Assumption 1 (see [8]). We additionally make Assumptions 2, 3and 4 about the non-faulty agents’ cost functions. Similar assumptions are made inprior work on fault-free distributed optimization [3, 7, 25].

Assumption 2 (Lipschitz smoothness) . For each non-faulty agent i , we assumethat the gradient of its cost function ∇ Q i ( x ) is Lipschitz continuous, i.e., thereexists a ﬁnite real value µ > such that (cid:13)(cid:13) ∇ Q i ( x ) − ∇ Q i ( x (cid:48) ) (cid:13)(cid:13) ≤ µ (cid:13)(cid:13) x − x (cid:48) (cid:13)(cid:13) , ∀ x, x (cid:48) ∈ W . ssumption 3 (Strong convexity) . For a non-empty set of non-faulty agents H ,let Q H ( x ) denote the average cost function of the agents in H , i.e., Q H ( x ) = 1 |H| (cid:88) i ∈H Q i ( x ) . For each such set H with |H| = n − f , we assume that Q H ( x ) is strongly convex,i.e., there exists a ﬁnite real value γ > such that (cid:10) ∇ Q ( x ) − ∇ Q ( x (cid:48) ) , x − y (cid:11) ≥ γ (cid:13)(cid:13) x − x (cid:48) (cid:13)(cid:13) , ∀ x, x (cid:48) ∈ W . Note that, under Assumptions 2 and 3, γ ≤ µ . This inequality is proved inAppendix B. Now, recall that the iterative estimates of the algorithm in Section 4.1are constrained to a compact convex set W ⊂ R d . Assumption 4 (Existence) . For each set of non-faulty agents H with |H| = n − f ,we assume that there exists a point x H ∈ arg min x ∈ R d (cid:80) i ∈H Q i ( x ) such that x H ∈ W . We now present the fault-tolerance properties of the CGE and CWTM gradient-ﬁlters in Sections 4.2.1 and 4.2.2 below, respectively.

To apply the CGE gradient-ﬁlter in Step S2, the server sorts the n gradients receivedfrom the n agents at the completion of Step S1 as per their Euclidean norms (tiesbroken arbitrarily): (cid:13)(cid:13) g ti (cid:13)(cid:13) ≤ . . . ≤ (cid:13)(cid:13)(cid:13) g ti n − f (cid:13)(cid:13)(cid:13) ≤ (cid:13)(cid:13)(cid:13) g ti n − f +1 (cid:13)(cid:13)(cid:13) ≤ . . . ≤ (cid:13)(cid:13) g ti n (cid:13)(cid:13) . That is, the gradient with the smallest norm, g ti , is received from agent i , and thegradient with the largest norm, g ti n , is received from agent i n . Then, the output ofthe CGE gradient-ﬁlter is the vector sum of the n − f gradients with smallest n − f Euclidean norms. Speciﬁcally,

GradFilter (cid:0) g t , . . . , g tn (cid:1) = n − f (cid:88) j =1 g ti j . (24)We show below, in Theorem 4, that when the fraction of Byzantine faulty agents f /n is bounded then the algorithm in Section 4.1 with the CGE gradient-ﬁlterin Step S2 is ( f, O ( (cid:15) ))-resilient, under (2 f, (cid:15) )-redundancy and the aforementionedassumptions. To present the formal result we deﬁne below some parameters. • We deﬁne a fault-tolerance margin α = 1 − fn (cid:18) µγ (cid:19) (25)that determines the maximum fraction of Byzantine faulty agents that can betolerated in an execution of the algorithm.14 We deﬁne a coeﬃcient D = 4 µfα γ = 4 µfγ − fn ( γ + 2 µ ) (26)that measures the resilience of the algorithm.Both α and D depend upon f , the most number of Byzantine faulty agents inany given execution of the algorithm. Note that, under Assumptions 3 and 4, foreach non-empty set of non-faulty agents H with |H| = n − f , the aggregate costfunction (cid:80) i ∈H Q i ( x ) has a unique minimum point, denoted by x H , in the set W .Speciﬁcally, x H = arg min x ∈ R d (cid:88) i ∈H Q i ( x ) ∩ W . (27) Theorem 4.

Suppose that the non-faulty agents’ cost functions satisfy the (2 f, (cid:15) ) -redundancy property, and Assumptions 2, 3 and 4. Consider the iterative algorithmin Section 4.1 with the CGE gradient-ﬁlter deﬁned in (24) , and diminishing step-sizes { η , η , . . . } in (22) satisfying: (cid:80) ∞ t =1 η t = ∞ and (cid:80) ∞ t =1 η t < ∞ . If α > thenfor each set of n − f non-faulty agents H , lim t →∞ (cid:13)(cid:13) x t − x H (cid:13)(cid:13) ≤ D (cid:15). Thus, by Deﬁnition 2, the algorithm is asymptotically ( f, D (cid:15) )-resilient . The proof of Theorem 4 relies on Theorem 3, and is deferred to Appendix D.Essentially, to prove Theorem 4 we show that the CGE gradient-ﬁlter deﬁned in (24)satisﬁes the conditions speciﬁed in Theorem 3 for D ∗ = D (cid:15) , and x ∗ = x H for all setof non-faulty agents H with |H| = n − f .According to Theorem 4, if α >

0, or equivalently, the fraction of Byzan-tine faulty agents f /n is below the threshold value of 1 / (1 + 2( µ/γ )) then, un-der (2 f, (cid:15) )-redundancy property and other standard assumptions, the distributedgradient-descent method with CGE gradient-ﬁlter is ( f, D (cid:15) )-resilient. As γ ≤ µ under Assumptions 2 and 3 (see Appendix B), the fault-tolerance property of CGEgradient-ﬁlter stated in Theorem 4 requires f /n < / f < n/ f decreases the value of D decreases, i.e., the resilience of the algorithm improves. Also, note that the valueof D is equal to 0 when f = 0, and therefore, the algorithm converges to the actualminimum point of all the agents’ aggregate cost function in the fault-free case. To apply the CWTM gradient-ﬁlter in Step S2, the server sorts the n gradientsreceived from the n agents at the completion of Step S1 as per their individualelements. For a vector v ∈ R d , we let v [ k ] denote its k -th element. Speciﬁcally, foreach k ∈ { , . . . , d } , the server sorts the k -th elements of the gradients (ties brokenarbitrarily): g ti [ k ] [ k ] ≤ . . . ≤ g ti f +1 [ k ] [ k ] ≤ . . . ≤ g ti n − f [ k ] [ k ] ≤ . . . ≤ g ti n [ k ] [ k ] . k -th element, g ti [ k ] , is received from agent i [ k ], and the gradient with the largest norm, g ti n [ k ] , is received from agent i n [ k ].For each k , the server eliminates the largest f and the smallest f elements of thegradients received. Then, the output of the CWTM gradient-ﬁlter is a vector whose k -th element is equal to the average of the remaining n − f gradients’ k -th elements.That is, for each k ∈ { , . . . , d } , GradFilter (cid:0) g t , . . . , g tn (cid:1) [ k ] = 1 n − f n − f (cid:88) j = f +1 g ti j [ k ] [ k ] . (28)We show below in Theorem 5 that when the separation between the gradientsof non-faulty agents’ cost functions is small enough then the CWTM gradient-ﬁltercan guarantee approximate fault-tolerance under (2 f, (cid:15) )-redundancy. To formallypresent our result we make the following additional assumption about the distancebetween the gradients of two non-faulty agents’ cost functions. Assumption 5.

For two non-faulty agents i and j , we assume that there exists λ > such that for all x ∈ W , (cid:107)∇ Q i ( x ) − ∇ Q j ( x ) (cid:107) ≤ λ max {(cid:107) Q i ( x ) (cid:107) , (cid:107) Q j ( x ) (cid:107)} . Obviously, owing to the triangle triangle inequality, Assumption 5 is always truefor λ = 2. However, as shown below in Theorem 5, we can presently guaranteefault-tolerance of CWTM gradient-ﬁlter when λ < γ/ ( µ √ d ) where µ and γ arethe Lipschitz smoothness and strong convexity coeﬃcients, respectively deﬁned inAssumption 2 and 3. Recall the deﬁnition of x H from (27). Theorem 5.

Suppose that the non-faulty agents’ cost functions satisfy the (2 f, (cid:15) ) -redundancy property, and Assumptions 2, 3, 4 and 5. Consider the iterative algo-rithm in Section 4.1 with the CWTM gradient-ﬁlter deﬁned in (28) , and diminish-ing step-sizes { η , η , . . . } in (22) satisfying: (cid:80) ∞ t =1 η t = ∞ and (cid:80) ∞ t =1 η t < ∞ . If λ < γ/ ( µ √ d ) then for each set of n − f non-faulty agents H , lim t →∞ (cid:13)(cid:13) x t − x H (cid:13)(cid:13) ≤ (cid:18) nλ ( γ/µ √ d ) − λ (cid:19) (cid:15). Proof of Theorem 5 is similar to Theorem 4, and is deferred to Appendix E.By Deﬁnition 2 of approximate fault-tolerance, Theorem 5 implies the algorithmwith CWTM gradient-ﬁlter is asymptotically ( f, σ ) -resilient where σ = (cid:18) nλ ( γ/µ √ d ) − λ (cid:19) (cid:15). Note that smaller is the value of λ , smaller is the value of σ and therefore, better isthe resilience guarantee of the CWTM gradient-ﬁlter.Unlike CGE gradient-ﬁlter, the obtain resilience of CWTM in Theorem 5 isindependent of f , as long as the separation between the gradients of non-faultyagents’ cost functions is suﬃciently small, i.e., λ < γ/ ( µ √ d ). However, unlikethe suﬃcient condition for resilience of CGE gradient-ﬁlter, the condition on λ toguarantee the resilience of CWTM gradient-ﬁlter depends upon the dimension d ofthe optimization problem. Larger dimension result in a tighter bound on λ .16 Numerical experiments

In this section, we present simulation results to empirically compare the approximatefault-tolerance achieved by the aforementioned gradient-ﬁlters; CGE and CWTM.For the simulation, we consider the problem of distributed linear regression, whichis a special distributed optimization problem with quadratic cost functions [18].

We consider a synchronous server-based system, as shown in Figure 1, wherein n = 6, d = 2, and f = 1. Each agent i ∈ { , . . . , n } has a data point representedby a triplet ( A i , B i , N i ) where A i is a d -dimensional row vector, B i ∈ R as theresponse, and a noise value N i ∈ R . Speciﬁcally, for all i ∈ { , . . . , n } , B i = A i x ∗ + N i where x ∗ = (cid:18) (cid:19) . (29)The collective data is represented by a triplet of matrices ( A, B, N ) where the i -throw of A , B , and N are equal to A i , B i and N i , respectively. The speciﬁc values areas follows. A =  . . . .

80 1 − . . − . .  , B =  . . . . . − .  , and N =  − . . . . − . − .  . (30)It should be noted that B = Ax ∗ + N. (31)We let A S , B S and N S represent matrices of dimensions | S | × | S | × | S | × { A i , i ∈ S } , { B i , i ∈ S } and { N i , i ∈ S } , respectively,in the increasing order. From (31), observe that for every non-empty set S , B S = A S x ∗ + N S . (32)Recall from basic linear algebra that if A S is full-column rank, i.e., rank ( A S ) = d = 2then x ∗ is the unique solution of the set of equations in (32). Note that for everyset S with | S | ≥ n − f = 6 − A S is full rank. Speciﬁcally,rank ( A S ) = d = 2 , ∀ S ⊆ { , . . . , } , | S | ≥ . (33)In this particular distributed optimization problem, each agent i has a quadraticcost function deﬁned to be Q i ( x ) = ( B i − A i x ) , ∀ x ∈ R . For an arbitrary non-empty set of agents S , we deﬁne Q S ( x ) = (cid:88) i ∈ S Q i ( x ) = (cid:88) i ∈ S ( B i − A i x ) , ∀ x ∈ R . Q S ( x ) = (cid:88) i ∈ S ( B i − A i x ) = (cid:107) B S − A S x (cid:107) . (34)As matrix A S is full rank for every S with | S | ≥ x ∈ R Q S ( x ) = arg min x ∈ R (cid:107) B S − A S x (cid:107) = ( x A S x = B S ) . (35)Therefore, Q S ( x ) has a unique minimum point when | S | ≥

4. Henceforth, we writenotation arg min x ∈ R simply as arg min, unless otherwise stated. Due to the rank condition (33), the agents’ cost functions satisfy the (2 f, (cid:15) ) -redundancy property, stated in Deﬁnition 3, with (cid:15) = 0 . (cid:15) aredescribed below.1. For each set S ⊂ { , . . . , } with | S | = n − f = 5, compute x S ∈ R such that B S = A S x S . Note that, due to (35), x S = arg min Q S ( x ).2. For each set S ⊂ { , . . . , } with | S | = n − f = 5 do the following:(a) For each set (cid:98) S ⊆ S with (cid:12)(cid:12)(cid:12) (cid:98) S (cid:12)(cid:12)(cid:12) ≥ n − f = 4, compute x (cid:98) S such that B (cid:98) S = A (cid:98) S x (cid:98) S . Note that, due to (35), x (cid:98) S = arg min Q (cid:98) S ( x ).(b) Compute (cid:15) S = max (cid:98) S ⊆ S, | (cid:98) S | ≥ (cid:13)(cid:13) x S − x (cid:98) S (cid:13)(cid:13) . In this particular case, both the sets of minimum points arg min Q S ( x )and arg min Q (cid:98) S ( x ) are singleton with points x S and x (cid:98) S , respectively.Therefore, (cid:13)(cid:13) x S − x (cid:98) S (cid:13)(cid:13) = dist (cid:0) arg min Q S ( x ) , arg min Q (cid:98) S ( x ) (cid:1) .

3. In the ﬁnal step, we compute (cid:15) = max | S | = n − f (cid:15) S . For each agent i , its cost function Q i ( x ) has Lipschitz continuous gradients, i.e.,satisfy Assumption 2, with Lipschitz coeﬃcient µ = v i (36)where v i denotes the largest eigenvalue of A Ti A i . Also, for every set of agents S with | S | = n − f = 5, their average cost function (1 / | S | ) Q S ( x ) is strongly convex, i.e.,satisfy Assumption 3, with the strong convexity coeﬃcient γ = 1 | S | v S (37)where v S is the smallest eigenvalue of A TS A S . Derivations of (36) and (37) can foundin [18, Section 10]. 18 .3 Simulation In our experiments, we simulate the following fault behaviors for the faulty agent. • gradient-reverse : the faulty agent reverses its true gradient. Suppose thecorrect gradient of a faulty agent i at step t is s ti , the agent i will send theincorrect gradient g ti = − s ti to the server. • random : the faulty agent sends a randomly chosen vector in R d . In ourexperiments, the faulty agent in each step chooses i.i.d. Gaussian randomvector with mean 0 and a isotropic covariance matrix with standard deviationof 200.We simulate the distributed gradient-descent algorithm described in Section 4.1by assuming agent 1 to be Byzantine faulty. It should be noted that the identityof the faulty agent is not used in any way during the simulations. Here, the setof non-faulty agents is H = { , . . . , } and |H| = n − f = 5. Therefore, in thisparticular case H is the only set of n − f non-faulty agents. From (35), we obtainthat the minimum point of the aggregate cost function (cid:80) i ∈H Q i ( x ), denoted by x H ,is equal to the solution of the following set of linear equations: B H = A H x H . Speciﬁcally, x H = (cid:18) . . (cid:19) . Also, note from our earlier deductions in (36) and (37)that in this particular case, the non-faulty agents’ cost functions satisfy Assump-tions 2 and 3 with µ = 1 and γ = 0 . Parameters:

We use the following parameters for implementing the algo-rithm. In the update rule (22), we use step size η t = 1 . / ( t + 1) for iteration t = 0 , , . . . . Note that this particular step-size is diminishing and satisﬁes theconditions: (cid:80) ∞ t =0 η t = ∞ and (cid:80) ∞ t =0 η t = 3 π / < ∞ (see [30]). We assume theconvex compact W ⊂ R d to be a 2-dimensional hypercube [ − , . Notethat x H ∈ W , i.e., Assumption 4 holds true. In all the simulation results presentedbelow, the initial estimate x = ( − . , − . x out = x . The outputs for the twogradient-ﬁlters, under diﬀerent faulty behaviors, are shown in Table 1. Note thatdist ( x H , x out ) = (cid:107) x H − x out (cid:107) . The results for the case when the faulty agent sends random faulty gradients are only shown for a randomly chosen execution. Conclusion:

As shown in Table 1, in all executions, the distances between x H the output of the algorithm x out in case of both CGE and CWTM gradient-ﬁltersare smaller than (cid:15) = 0 . (cid:80) i ∈H Q i ( x t ) (referred as loss ) and the approximationerror (cid:13)(cid:13) x t − x H (cid:13)(cid:13) (referred as distance ) for iteration t ranging from 0 to 500. We also19able 1: Algorithm’s outputs with gradient-ﬁlters CGE and CWTM, and the approximation errors,corresponging to executions when the faulty agent exhibits two diﬀerent types of Byzantine faults;gradient-reverse and random. Recall that x out = x and dist ( x H , x out ) = (cid:107) x H − x out (cid:107) . gradient-reverse random x out dist ( x H , x out ) x out dist ( x H , x out ) CGE (cid:18) . . (cid:19) (cid:18) . . (cid:19) . × − CWTM (cid:18) . . (cid:19) (cid:18) . . (cid:19) . × − l o ss gradient - reverse d i s t a n c e random Figure 2:

The losses, i.e., (cid:80) i ∈H Q i ( x t ) , and distances, i.e., (cid:13)(cid:13) x t − x H (cid:13)(cid:13) , versus the number ofiterations in the algorithm. The ﬁnal approximation errors, i.e., (cid:13)(cid:13) x − x H (cid:13)(cid:13) , are annotated inthe same colors of their corresponding plots. For the executions shown, agent is assumed to beByzantine faulty. The diﬀerent columns show the results when the faulty agent exhibits the diﬀerenttypes of faults: (a) gradient-reverse, and (b) random. Apart from the plots with CGE (in green ) andCWTM (in yellow ) gradient-ﬁlters, we also plot the fault-free distributed gradient-descent (DGD)method when all agents are free from faults (in blue ), and the DGD method without any gradient-ﬁlters when agent is Byzantine faulty (in red ). show the plots of the fault-free distributed gradient-descent (DGD) method when allagents are free from faults, and the DGD method without any gradient-ﬁlter whenagent 1 is Byzantine faulty. The details for iteration t ranging from 0 to 80 are alsohighlighted in Figure 3. 20 .00.51.01.5 l o ss gradient - reverse d i s t a n c e random Figure 3:

The losses, i.e., (cid:80) i ∈H Q i ( x t ) , and distances, i.e., (cid:13)(cid:13) x t − x H (cid:13)(cid:13) , versus the number ofiterations in the algorithm, magniﬁed for the initial 80 iterations in the training process. Themeaning of plots are the same as in Figure 2. In this paper, we have studied the problem of approximate

Byzantine fault-tolerance,which is a generalization of the exact fault-tolerance problem studied in prior work [20].Unlike the exact fault-tolerance, the goal in approximate fault-tolerance is to designa distributed optimization problem that produces only an approximation of a mini-mum point of the aggregate cost function of at least n − f non-faulty agents, in thepresence of up to f (out of n ) Byzantine faulty agents.We have deﬁned approximate fault-tolerance formally as ( f, (cid:15) )-resilience where (cid:15) ∈ R ≥ denotes the approximation error. In the ﬁrst part of the paper, i.e, Section 3,we have obtained necessary and suﬃcient conditions for the achievability of ( f, (cid:15) )-resilience. These results generalize the prior result which states that exact fault-tolerance is achievable if and only if 2 f -redundancy property is satisﬁed [20, 21].In the second part of the paper, i.e., Sections 4 and 5, we have considered thecase when agents’ cost functions are diﬀerentiable. For this particular case, ﬁrstwe have derived a generic approximate fault-tolerance property of the distributedgradient-descent method when equipped with Byzantine robust gradient aggregationor gradient-ﬁlter . Then, we have obtained speciﬁc approximate fault-tolerance guar-21ntees for two well-known gradient-ﬁlters; comparative gradient elimination (CGE)and coordinate-wise trimmed mean (CWTM). Finally, in Section 5, we have pre-sented empirical results comparing the approximate fault-tolerance achieved by thetwo aforementioned gradient-ﬁlters. Research reported in this paper was sponsored in part by the Army Research Lab-oratory under Cooperative Agreement W911NF- 17-2-0196, and by the NationalScience Foundation award 1842198. The views and conclusions contained in thisdocument are those of the authors and should not be interpreted as representingthe oﬃcial policies, either expressed or implied, of the Army Research Laboratory,National Science Foundation or the U.S. Government.

References [1] Dan Alistarh, Zeyuan Allen-Zhu, and Jerry Li. Byzantine stochastic gradientdescent. In

Advances in Neural Information Processing Systems , 2018.[2] Jeremy Bernstein, Jiawei Zhao, Kamyar Azizzadenesheli, and Anima Anand-kumar. signsgd with majority vote is communication eﬃcient and Byzantinefault tolerant. arXiv preprint arXiv:1810.05291 , 2018.[3] Dimitri P Bertsekas and John N Tsitsiklis.

Parallel and distributed computation:numerical methods , volume 23. Prentice hall Englewood Cliﬀs, NJ, 1989.[4] Peva Blanchard, Rachid Guerraoui, Julien Stainer, et al. Machine learningwith adversaries: Byzantine tolerant gradient descent. In

Advances in NeuralInformation Processing Systems , pages 119–129, 2017.[5] L´eon Bottou. Online learning and stochastic approximations.

On-line learningin neural networks , 17(9):142, 1998.[6] L´eon Bottou, Frank E Curtis, and Jorge Nocedal. Optimization methods forlarge-scale machine learning.

Siam Review , 60(2):223–311, 2018.[7] Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato, Jonathan Eckstein, et al.Distributed optimization and statistical learning via the alternating directionmethod of multipliers.

Foundations and Trends ® in Machine learning , 3(1),2011.[8] Stephen Boyd and Lieven Vandenberghe. Convex optimization . Cambridgeuniversity press, 2004.[9] Xinyang Cao and Lifeng Lai. Distributed gradient descent algorithm robustto an arbitrary number of byzantine attackers.

IEEE Transactions on SignalProcessing , 67(22):5850–5864, 2019.[10] Moses Charikar, Jacob Steinhardt, and Gregory Valiant. Learning from un-trusted data. In

Proceedings of the 49th Annual ACM SIGACT Symposium onTheory of Computing , pages 47–60, 2017.2211] Yuan Chen, Soummya Kar, and Jose MF Moura. Resilient distributed esti-mation through adversary detection.

IEEE Transactions on Signal Processing ,66(9), 2018.[12] Yudong Chen, Lili Su, and Jiaming Xu. Distributed statistical machine learningin adversarial settings: Byzantine gradient descent.

Proceedings of the ACMon Measurement and Analysis of Computing Systems , 1(2):44, 2017.[13] Michelle S Chong, Masashi Wakaiki, and Joao P Hespanha. Observability oflinear systems under adversarial attacks. In

American Control Conference ,pages 2439–2444. IEEE, 2015.[14] Georgios Damaskinos, Rachid Guerraoui, Rhicheek Patra, Mahsa Taziki, et al.Asynchronous Byzantine machine learning (the case of sgd). In

InternationalConference on Machine Learning , pages 1153–1162, 2018.[15] Ilias Diakonikolas, Gautam Kamath, Daniel M Kane, Jerry Li, Jacob Stein-hardt, and Alistair Stewart. Sever: A robust meta-algorithm for stochasticoptimization. arXiv preprint arXiv:1803.02815 , 2018.[16] John C Duchi, Alekh Agarwal, and Martin J Wainwright. Dual averaging fordistributed optimization: Convergence analysis and network scaling.

IEEETransactions on Automatic control , 57(3):592–606, 2011.[17] Nirupam Gupta, Shuo Liu, and Nitin H Vaidya. Byzantine fault-tolerant dis-tributed machine learning using stochastic gradient descent (sgd) and norm-based comparative gradient elimination (cge). arXiv preprint arXiv:2008.04699 ,2020.[18] Nirupam Gupta and Nitin H Vaidya. Byzantine fault tolerant distributed linearregression. arXiv preprint arXiv:1903.08752 , 2019.[19] Nirupam Gupta and Nitin H Vaidya. Byzantine fault-tolerant parallelizedstochastic gradient descent for linear regression. In , pages415–420. IEEE, 2019.[20] Nirupam Gupta and Nitin H Vaidya. Fault-tolerance in distributed optimiza-tion: The case of redundancy. In

Proceedings of the 39th Symposium on Prin-ciples of Distributed Computing , pages 365–374, 2020.[21] Nirupam Gupta and Nitin H Vaidya. Resilience in collaborative op-timization: redundant and independent cost functions. arXiv preprintarXiv:2003.09675 , 2020.[22] Leslie Lamport, Robert Shostak, and Marshall Pease. The Byzantine generalsproblem.

ACM Transactions on Programming Languages and Systems , 4(3),1982.[23] Nancy A Lynch.

Distributed algorithms . Elsevier, 1996.[24] James R Munkres. Topology, 2000. 2325] Angelia Nedic and Asuman Ozdaglar. Distributed subgradient methods formulti-agent optimization.

IEEE Transactions on Automatic Control , 54(1),2009.[26] Miroslav Pajic, James Weimer, Nicola Bezzo, Paulo Tabuada, Oleg Sokolsky,Insup Lee, and George J Pappas. Robustness of attack-resilient state estimators.In

ICCPS’14: ACM/IEEE 5th International Conference on Cyber-Physical Sys-tems (with CPS Week 2014) , pages 163–174. IEEE Computer Society, 2014.[27] Adarsh Prasad, Arun Sai Suggala, Sivaraman Balakrishnan, and PradeepRavikumar. Robust estimation via robust gradient estimation. arXiv preprintarXiv:1802.06485 , 2018.[28] Michael Rabbat and Robert Nowak. Distributed optimization in sensor net-works. In

Proceedings of the 3rd international symposium on Information pro-cessing in sensor networks , pages 20–27, 2004.[29] Robin L Raﬀard, Claire J Tomlin, and Stephen P Boyd. Distributed optimiza-tion for cooperative agents: Application to formation ﬂight. In , volume 3, pages 2453–2459. IEEE,2004.[30] Walter Rudin.

Principles of mathematical analysis , volume 3. McGraw-hillNew York, 1964.[31] Lili Su and Shahin Shahrampour. Finite-time guarantees for Byzantine-resilient distributed state estimation with noisy measurements. arXiv preprintarXiv:1810.10086 , 2018.[32] Lili Su and Nitin H Vaidya. Fault-tolerant multi-agent optimization: optimaliterative distributed algorithms. In

Proceedings of the 2016 ACM symposiumon principles of distributed computing , pages 425–434, 2016.[33] Lili Su and Nitin H Vaidya. Non-bayesian learning in the presence of byzantineagents. In

International symposium on distributed computing . Springer, 2016.[34] Lili Su and Nitin H Vaidya. Byzantine-resilient multi-agent optimization.

IEEETransactions on Automatic Control , 2020.[35] Cong Xie, Oluwasanmi Koyejo, and Indranil Gupta. Generalized Byzantine-tolerant sgd. arXiv preprint arXiv:1802.10116 , 2018.[36] Dong Yin, Yudong Chen, Kannan Ramchandran, and Peter Bartlett.Byzantine-robust distributed learning: Towards optimal statistical rates. In

International Conference on Machine Learning , pages 5636–5645, 2018.24

Appendix: The Special Case of ( f, -Resilience We show that ( f, -resilience , stated in Deﬁnition 2, and exact fault-tolerance , de-ﬁned in Section 1.1, are equivalent in the deterministic framework. Speciﬁcally, weshow that a deterministic ( f, f, x ∈ R d is simply written asarg min, unless otherwise stated.First, we show that ( f, f, E H of Π wherein H ⊆ { , . . . , n } denotes the set of all the non-faulty agents,and let (cid:98) x denote the output. Recall that, as there are at most f faulty agents, |H| ≥ n − f . To prove that Π has exact fault-tolerance, it suﬃces to show that, inexecution E H , (cid:98) x is a minimum point of the aggregate cost function of all non-faultyagents (cid:80) i ∈H Q i ( x ).By Deﬁnition 2 of ( f, S ⊆ H with | S | = n − f , (cid:98) x ∈ arg min (cid:88) i ∈ S Q i ( x ) . Therefore, for every set S with S ⊆ H and | S | = n − f , (cid:88) i ∈ S Q i ( (cid:98) x ) ≤ (cid:88) i ∈ S Q i ( x ) , ∀ x ∈ R d . (38)Now, note that there are (cid:0) |H| n − f (cid:1) subsets in H of size n − f , and each agent i ∈ H iscontained in (cid:0) |H|− n − f − (cid:1) of those subsets. Therefore, (cid:88) S ⊆H , | S | = n − f (cid:88) i ∈ S Q i ( x ) = (cid:18) |H| − n − f − (cid:19) (cid:88) i ∈H Q i ( x ) . (39)Substituting from (38) in (39) we obtain that (cid:88) i ∈H Q i ( (cid:98) x ) ≤ (cid:88) i ∈H Q i ( x ) , ∀ x ∈ R d The above implies that (cid:98) x ∈ arg min (cid:88) i ∈H Q i ( x ) . The above proves that Π has exact fault-tolerance in execution E H .Now, we show that exact fault-tolerance implies ( f, E H of Π wherein set H comprises all the non-faulty agents,and (cid:98) x is its output. Therefore, (cid:98) x ∈ arg min (cid:88) i ∈H Q i ( x ) .

25o prove that Π is ( f, E H for everyset S ⊆ H with | S | = n − f , (cid:98) x is a minimum point of the aggregate cost func-tion (cid:80) i ∈ S Q i ( x ). This is trivially true when |H| = n − f . We assume below that |H| > n − f .Consider an arbitrary subset S of H with | S | = n − f . Consider an execution E S wherein S is the set of all non-faulty agents, with the remaining agents in { , . . . , n }\ S being Byzantine faulty. Suppose that the inputs from all the agents tothe server in E S are identical to their inputs in E H . Therefore, as Π is a deterministicalgorithm, its output in execution E S is same as that in execution E H , i.e., (cid:98) x .Moreover, as Π is assumed to have exact fault-tolerance, (cid:98) x ∈ arg min (cid:88) i ∈ S Q i ( x ) . As S is an arbitrary subset S of H with | S | = n − f , the above proves that Π is( f, E H . B Appendix: Proof of γ ≤ µ We show below that if Assumptions 2 and 3 hold true simultaneously then γ ≤ µ .Consider an arbitrary set of n − f non-faulty agents H , and two arbitrary non-identical points x, y ∈ R d , i.e., x (cid:54) = y . If Assumption 2 holds true then (cid:107)∇ Q i ( x ) − ∇ Q i ( y ) (cid:107) ≤ µ (cid:107) x − y (cid:107) , ∀ i ∈ H . Therefore, owing to the Cauchy-Schwartz inequality, for all i ∈ H , (cid:104) x − y, ∇ Q i ( x ) − ∇ Q i ( y ) (cid:105) ≤ (cid:107) x − y (cid:107) (cid:107)∇ Q i ( x ) − ∇ Q i ( y ) (cid:107) ≤ µ (cid:107) x − y (cid:107) . (40)From (40) we obtain that (cid:88) i ∈H (cid:104) x − y, ∇ Q i ( x ) − ∇ Q i ( y ) (cid:105) ≤ µ |H| (cid:107) x − y (cid:107) . (41)If Assumption 3 holds true then (cid:88) i ∈H (cid:104) x − y, ∇ Q i ( x ) − ∇ Q i ( y ) (cid:105) ≥ γ |H| (cid:107) x − y (cid:107) . (42)As x, y are arbitrary non-identical points, (41) and (42) together imply that γ ≤ µ . C Appendix: Proof of Theorem 3

The proof of Theorem 3 relies on the following suﬃcient criterion for the conver-gence of non-negative sequences. 26 emma 2 (Bottou, 1998 [5]) . Consider a sequence of real values { u t , t = 0 , , . . . } .If u t ≥ , ∀ t then ∞ (cid:88) t =0 ( u t +1 − u t ) + < ∞ = ⇒  u t −→ t →∞ u ∞ < ∞ (cid:80) ∞ t =0 ( u t +1 − u t ) − > −∞ (43) where the operators ( · ) + and ( · ) − are deﬁned as follows for a real scalar x , ( x ) + = (cid:26) x , x > , otherwise , and ( x ) − = (cid:26) , x > x , otherwise Recall from the statement of Theorem 3 that x ∗ ∈ W where W is a compactconvex set. We deﬁne, for all t ∈ { , , . . . } , e t = (cid:13)(cid:13) x t − x ∗ (cid:13)(cid:13) . (44)Next, we deﬁne a univariate real-valued function ψ : R → R : ψ ( x ) = (cid:26) , x < ( D ∗ ) , (cid:0) x − ( D ∗ ) (cid:1) , x ≥ ( D ∗ ) . (45)Let ψ (cid:48) ( x ) denote the derivative of ψ at x . Speciﬁcally, ψ (cid:48) ( x ) = max (cid:8) , (cid:0) x − ( D ∗ ) (cid:1)(cid:9) . (46)We show below that ψ (cid:48) ( x ) is a Lipschitz continuous function with Lipschitz coeﬃ-cient of 2. From (46), we obtain that (cid:12)(cid:12) ψ (cid:48) ( x ) − ψ (cid:48) ( y ) (cid:12)(cid:12) =  | x − y | , both x, y ≥ ( D ∗ ) (cid:12)(cid:12) x − ( D ∗ ) (cid:12)(cid:12) , x ≥ ( D ∗ ) , y < ( D ∗ ) , both x, y < ( D ∗ ) (47)Note from (47) that for the case when x ≥ ( D ∗ ) , y < ( D ∗ ) , (cid:12)(cid:12) ψ (cid:48) ( x ) − ψ (cid:48) ( y ) (cid:12)(cid:12) = 2 (cid:12)(cid:12) x − ( D ∗ ) (cid:12)(cid:12) < | x − y | . Similarly, due to symmetry, when x < ( D ∗ ) , y ≥ ( D ∗ ) then | ψ (cid:48) ( x ) − ψ (cid:48) ( y ) | =22 (cid:12)(cid:12) y − ( D ∗ ) (cid:12)(cid:12) < | x − y | . Therefore, from (46) we obtain that (cid:12)(cid:12) ψ (cid:48) ( x ) − ψ (cid:48) ( y ) (cid:12)(cid:12) ≤ | x − y | , ∀ x, y ∈ R . (48)The Lipschitz continuity of ψ (cid:48) ( x ), shown in (48), implies that [6, Section 4.1] ψ ( y ) − ψ ( x ) ≤ ( y − x ) ψ (cid:48) ( x ) + ( y − x ) , ∀ x, y ∈ R . (49)Now, for each t ∈ { , , . . . } , we deﬁne h t = ψ ( e t ) . (50)From (49) and (50), for all t , we obtain that h t +1 − h t = ψ (cid:0) e t +1 (cid:1) − ψ (cid:0) e t (cid:1) ≤ (cid:0) e t +1 − e t (cid:1) · ψ (cid:48) (cid:0) e t (cid:1) + (cid:0) e t +1 − e t (cid:1) . ψ (cid:48) t as a shorthand for ψ (cid:48) ( e t ). From above we have h t +1 − h t ≤ (cid:0) e t +1 − e t (cid:1) ψ (cid:48) t + (cid:0) e t +1 − e t (cid:1) . (51)Now, recall from (22) that for all t ∈ { , , . . . } , x t +1 = (cid:2) x t − η t GradFilter (cid:0) g t , . . . , g tn (cid:1)(cid:3) W (52)By the non-expansion property of Euclidean projection onto a closed convex set, (cid:13)(cid:13) x t +1 − x ∗ (cid:13)(cid:13) ≤ (cid:13)(cid:13) x t − x ∗ − η t GradFilter (cid:0) g t , . . . , g tn (cid:1)(cid:13)(cid:13) . Recall from (44) that e t denotes (cid:13)(cid:13) x t − x ∗ (cid:13)(cid:13) for all t . Upon squaring the both sidesin the above inequality, we obtain that e t +1 ≤ e t − η t (cid:10) x t − x ∗ , GradFilter (cid:0) g t , . . . , g tn (cid:1)(cid:11) + η t (cid:13)(cid:13) GradFilter (cid:0) g t , . . . , g tn (cid:1)(cid:13)(cid:13) . (53)Recall, from (23) in the statement of Theorem 3, that φ t = (cid:10) x t − x ∗ , GradFilter (cid:0) g t , . . . , g tn (cid:1)(cid:11) , ∀ t. Substituting from the above in (53), we obtain that e t +1 ≤ e t − η t φ t + η t (cid:13)(cid:13) GradFilter (cid:0) g t , . . . , g tn (cid:1)(cid:13)(cid:13) . (54)As ψ (cid:48) t ≥ ∀ t , substituting from (54) in (51) we get h t +1 − h t ≤ (cid:16) − η t φ t + η t (cid:13)(cid:13) GradFilter (cid:0) g t , . . . , g tn (cid:1)(cid:13)(cid:13) (cid:17) ψ (cid:48) t + (cid:0) e t +1 − e t (cid:1) . (55)Note that for an arbitrary t , (cid:12)(cid:12) e t +1 − e t (cid:12)(cid:12) = ( e t +1 + e t ) | e t +1 − e t | . (56)As W is assumed compact, there exists Γ = max x ∈W (cid:107) x − x ∗ (cid:107) < ∞ . We assumeΓ >

0, otherwise W = { x ∗ } and the theorem is trivial. Recall from the updaterule (22), which is re-stated above in (52), that x t ∈ W for all t , and that x ∗ ∈ W .Therefore, e t = (cid:13)(cid:13) x t − x ∗ (cid:13)(cid:13) ≤ max x ∈W (cid:107) x − x ∗ (cid:107) = Γ , ∀ t. (57)From (57), for all t , we obtain that e t +1 + e t ≤ . Substituting from above in (56) implies that (cid:12)(cid:12) e t +1 − e t (cid:12)(cid:12) ≤ | e t +1 − e t | , ∀ t. (58)From triangle inequality, we get | e t +1 − e t | = (cid:12)(cid:12)(cid:13)(cid:13) x t +1 − x ∗ (cid:13)(cid:13) − (cid:13)(cid:13) x t − x ∗ (cid:13)(cid:13)(cid:12)(cid:12) ≤ (cid:13)(cid:13) x t +1 − x t (cid:13)(cid:13) . (cid:12)(cid:12) e t +1 − e t (cid:12)(cid:12) ≤ (cid:13)(cid:13) x t +1 − x t (cid:13)(cid:13) . (59)Due to the non-expansion property of Euclidean projection onto a closed convex set,from (52) we obtain that (cid:13)(cid:13) x t +1 − x t (cid:13)(cid:13) = (cid:13)(cid:13)(cid:2) x t − η t GradFilter (cid:0) g t , . . . , g tn (cid:1)(cid:3) W − x t (cid:13)(cid:13) ≤ η t (cid:13)(cid:13) GradFilter (cid:0) g t , . . . , g tn (cid:1)(cid:13)(cid:13) . Substituting from above in (59) we obtain that (cid:12)(cid:12) e t +1 − e t (cid:12)(cid:12) ≤ η t Γ (cid:13)(cid:13) GradFilter (cid:0) g t , . . . , g tn (cid:1)(cid:13)(cid:13) . Thus, (cid:0) e t +1 − e t (cid:1) ≤ η t Γ (cid:13)(cid:13) GradFilter (cid:0) g t , . . . , g tn (cid:1)(cid:13)(cid:13) . (60)Substituting from (60) in (55) we obtain that, for all t , h t +1 − h t ≤ (cid:16) − η t φ t + η t (cid:13)(cid:13) GradFilter (cid:0) g t , . . . , g tn (cid:1)(cid:13)(cid:13) (cid:17) ψ (cid:48) t + 4 η t Γ (cid:13)(cid:13) GradFilter (cid:0) g t , . . . , g tn (cid:1)(cid:13)(cid:13) = − η t φ t ψ (cid:48) t + η t (cid:0) ψ (cid:48) t + 4Γ (cid:1) (cid:13)(cid:13) GradFilter (cid:0) g t , . . . , g tn (cid:1)(cid:13)(cid:13) . (61)Recall from (57) that e t ≤ Γ. Also, by assumption, D ∗ < max x ∈W (cid:107) x − x ∗ (cid:107) = Γ.Recall that ψ (cid:48) t is short for ψ (cid:48) ( e t ). Therefore, from (46) we obtain that0 ≤ ψ (cid:48) t ≤ (cid:16) Γ − ( D ∗ ) (cid:17) ≤ , ∀ t. (62)As the statement of Theorem 3 assumes that (cid:13)(cid:13) GradFilter (cid:0) g t , . . . , g tn (cid:1)(cid:13)(cid:13) < ∞ for all t , there exists a real value M < ∞ such that (cid:13)(cid:13) GradFilter (cid:0) g t , . . . , g tn (cid:1)(cid:13)(cid:13) ≤ M , ∀ t. (63)Substituting from (62) and (63) in (61) we obtain that h t +1 − h t ≤ − η t φ t ψ (cid:48) t + 6 η t Γ M . (64)We now use Lemma 2 to prove that h ∞ = 0 as follows.For an iteration t , we consider below two possible cases: (i) e t < D ∗ , and (ii) e t = D ∗ + δ for some δ ≥ ψ (cid:48) t = 0. Therefore, due to Cauchy-Schwartz inequality, | φ t | = (cid:12)(cid:12)(cid:10) x t − x ∗ , GradFilter (cid:0) g t , . . . , g tn (cid:1)(cid:11)(cid:12)(cid:12) ≤ e t (cid:13)(cid:13) GradFilter (cid:0) g t , . . . , g tn (cid:1)(cid:13)(cid:13) . Substituting from (63) above, we obtain that | φ t | ≤ Γ M < ∞ . Therefore, φ t ψ (cid:48) t = 0 . (65)29ase ii) In this particular case, from (46), we obtain that ψ (cid:48) t = 2 (cid:0) ( D ∗ + δ ) − ( D ∗ ) (cid:1) = 2 δ (2 D ∗ + δ ) . Now, by assumption, φ t ≥ ξ when e t ≥ D ∗ where ξ >

0. Therefore, φ t ψ (cid:48) t ≥ ξδ (2 D ∗ + δ ) . (66)From (65) and (66) above, we obtain that φ t ψ (cid:48) t ≥ , ∀ t. (67)Substituting the above in (64) implies that h t +1 − h t ≤ η t Γ M , ∀ t. Recall that notation ( · ) + from Lemma 2. The above inequality implies that( h t +1 − h t ) + ≤ η t Γ M . As η t < ∞ , ∀ t , and constants L , M < ∞ , the above implies that ∞ (cid:88) t =0 ( h t +1 − h t ) + ≤ M ∞ (cid:88) t =0 η t < ∞ . As h t ≥ t , the above in conjunction with Lemma 2 implies that h t t →∞ −−−→ h ∞ < ∞ , and ∞ (cid:88) t =0 ( h t +1 − h t ) − > −∞ . (68)Note that h ∞ − h = (cid:80) ∞ t =0 ( h t +1 − h t ). Thus, from (64) we obtain that h ∞ − h ≤ − ∞ (cid:88) t =0 η t φ t ψ (cid:48) t + 6Γ M ∞ (cid:88) t =0 η t . (69)By Deﬁnition (50), h t ≥ t . Therefore, from (69) above we obtain that2 (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∞ (cid:88) t =0 η t φ t ψ (cid:48) t (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ h + h ∞ + 6Γ M ∞ (cid:88) t =0 η t . (70)By assumption, (cid:80) ∞ t =0 η t < ∞ . From (68), h ∞ < ∞ . Substituting from (57) that e t < ∞ , ∀ t in Deﬁnition of h t (50), we obtain that h = ψ ( e ) < ∞ . Therefore, (70)implies that 2 (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∞ (cid:88) t =0 η t φ t ψ (cid:48) t (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) < ∞ . Recall from (67) that φ t ψ (cid:48) t ≥ t . Thus, from above we obtain that ∞ (cid:88) t =0 η t φ t ψ (cid:48) t < ∞ . (71)30inally, we reason below by contradiction that h ∞ = 0.Note that for any ζ >

0, there exists a unique positive value β such that ζ =2 β (cid:0) D ∗ + √ β (cid:1) . Suppose that h ∞ = 2 β (2 D ∗ + √ β ) for some positive value β . Asthe sequence { h t } ∞ t =0 converges to h ∞ (see (68)), there exists some ﬁnite τ ∈ Z ≥ such that for all t ≥ τ , | h t − h ∞ | ≤ β (cid:16) D ∗ + (cid:112) β (cid:17) = ⇒ h t ≥ h ∞ − β (cid:16) D ∗ + (cid:112) β (cid:17) . As h ∞ = 2 β (2 D ∗ + √ β ) , the above implies that h t ≥ β (cid:16) D ∗ + (cid:112) β (cid:17) , ∀ t ≥ τ. (72)Therefore (cf. (45) and (50)), for all t ≥ τ , (cid:16) e t − ( D ∗ ) (cid:17) ≥ β (cid:16) D ∗ + (cid:112) β (cid:17) , or (cid:12)(cid:12)(cid:12) e t − ( D ∗ ) (cid:12)(cid:12)(cid:12) ≥ (cid:112) β (cid:16) D ∗ + (cid:112) β (cid:17) . Thus, for each t ≥ τ , either e t ≥ ( D ∗ ) + (cid:112) β (cid:16) D ∗ + (cid:112) β (cid:17) = (cid:16) D ∗ + (cid:112) β (cid:17) , (73)or e t ≤ ( D ∗ ) − (cid:112) β (cid:16) D ∗ + (cid:112) β (cid:17) < ( D ∗ ) . (74)If the latter, i.e. (74), holds true for some t (cid:48) ≥ τ then h t (cid:48) = ψ (cid:0) e t (cid:48) (cid:1) = 0 , which contradicts (72). Therefore, (72) implies (73) for all t ≥ τ .From above we obtain that if h ∞ = 2 β (2 D ∗ + √ β ) then there exists τ < ∞ such that for all t ≥ τ , e t ≥ D ∗ + (cid:112) β. Thus, from (66) we obtain that φ t ψ (cid:48) t ≥ ξ (cid:112) β (2 D ∗ + (cid:112) β ) , ∀ t ≥ τ. Therefore, ∞ (cid:88) t = τ η t φ t ψ (cid:48) t ≥ ξ (cid:112) β (2 D ∗ + (cid:112) β ) ∞ (cid:88) t = τ η t = ∞ . This is a contradiction to (71). Therefore, h ∞ = 0, and by deﬁnition of h t in (50), h ∞ = lim t →∞ ψ ( e t ) = 0 . Hence, by deﬁnition of ψ ( · ) in (45),lim t →∞ (cid:13)(cid:13) x t − x ∗ (cid:13)(cid:13) ≤ D ∗ . Appendix: Proof of Theorem 4

In this section we present the proof of Theorem 4. Throughout this section we as-sume f > f = 0.Consider an arbitrary set H of non-faulty agents with |H| = n − f . Recall thatunder Assumptions 3 and 4, the aggregate cost function (cid:80) i ∈H Q i ( x ) has a uniqueminimum point in set W , which we denote by x H . Speciﬁcally, x H = arg min x ∈ R d (cid:88) i ∈H Q i ( x ) ∩ W . (75)To prove the theorem we make use of the result stated in Theorem 3. Speciﬁcally,we show is that the CGE gradient-ﬁlter satisﬁes the conditions of Theorem 3 for D ∗ = D (cid:15) , and x ∗ = x H . The rest follows easily from the convergence result statedin Theorem 3. Recall from (24) that for CGE gradient-ﬁlter, in update rule (22), GradFilter (cid:0) g t , . . . , g tn (cid:1) = n − f (cid:88) j =1 g ti j , ∀ t. (76)First, we show that (cid:13)(cid:13)(cid:13)(cid:80) n − fj =1 g ti j (cid:13)(cid:13)(cid:13) is ﬁnite for all t .Consider a subset S ⊂ H with | S | = n − f . Triangle inequality implies that (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) (cid:88) j ∈ S ∇ Q j ( x ) − (cid:88) j ∈ S ∇ Q j ( x H ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ (cid:88) j ∈ S (cid:107)∇ Q j ( x ) − ∇ Q j ( x H ) (cid:107) , ∀ x ∈ R d . Under Assumption 2, i.e., Lipschitz continuity of non-faulty gradients, for each non-faulty agent j , (cid:107)∇ Q j ( x ) − ∇ Q j ( x H ) (cid:107) ≤ µ (cid:107) x − x H (cid:107) . Substituting this above impliesthat (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) (cid:88) j ∈ S ∇ Q j ( x ) − (cid:88) j ∈ S ∇ Q j ( x H ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ | S | µ (cid:107) x − x H (cid:107) . (77)As | S | = n − f , the (2 f, (cid:15) )-redundancy property deﬁned in Deﬁnition 3 impliesthat for all x ∈ arg min x (cid:80) j ∈ S Q j ( x ), (cid:107) x − x H (cid:107) ≤ (cid:15). Substituting from above in (77) implies that, for all x ∈ arg min x (cid:80) j ∈ S Q j ( x ), (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) (cid:88) j ∈ S ∇ Q j ( x ) − (cid:88) j ∈ S ∇ Q j ( x H ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ | S | µ (cid:107) x − x H (cid:107) ≤ | S | µ(cid:15). (78)For all x ∈ arg min x (cid:80) j ∈ S Q j ( x ), ∇ Q j ( x ) = 0. Thus, (78) implies that (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) (cid:88) j ∈ S ∇ Q j ( x H ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ | S | µ(cid:15). (79)32ow, consider an arbitrary non-faulty agent i ∈ H \ S . Let S = S ∪ { i } . Usingsimilar arguments as above we obtain that under the (2 f, (cid:15) )-redundancy propertyand Assumption 2, for all x ∈ arg min x (cid:80) j ∈ S Q j ( x ), (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) (cid:88) j ∈ S ∇ Q j ( x H ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) = (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) (cid:88) j ∈ S ∇ Q j ( x ) − (cid:88) j ∈ S ∇ Q j ( x H ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ | S | µ(cid:15). (80)Note that (cid:80) j ∈ S ∇ Q j ( x ) = (cid:80) j ∈ S ∇ Q j ( x ) + ∇ Q i ( x ). From triangle inequality, (cid:107)∇ Q i ( x H ) (cid:107) − (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) (cid:88) j ∈ S ∇ Q j ( x H ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) (cid:88) j ∈ S ∇ Q j ( x H ) + ∇ Q i ( x H ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) . (81)Therefore, for each non-faulty agent i ∈ H , (cid:107)∇ Q i ( x H ) (cid:107) ≤ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) (cid:88) j ∈ S ∇ Q j ( x H ) + ∇ Q i ( x H ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) + (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) (cid:88) j ∈ S ∇ Q j ( x H ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ | S | µ(cid:15) + | S | µ(cid:15) = ( n − f + 1) µ(cid:15) + ( n − f ) µ(cid:15) = (2 n − f + 1) µ(cid:15). (82)Now, for all x and i ∈ H , by Assumption 2, (cid:107)∇ Q i ( x ) − ∇ Q i ( x H ) (cid:107) ≤ µ (cid:107) x − x H (cid:107) . By triangle inequality, (cid:107)∇ Q i ( x ) (cid:107) ≤ (cid:107)∇ Q i ( x H ) (cid:107) + µ (cid:107) x − x H (cid:107) . Substituting from (82) above we obtain that (cid:107)∇ Q i ( x ) (cid:107) ≤ (2 n − f + 1) µ(cid:15) + µ (cid:107) x − x H (cid:107) ≤ nµ(cid:15) + µ (cid:107) x − x H (cid:107) . (83)We use the above inequality (83) to show below that (cid:13)(cid:13)(cid:13)(cid:80) n − fj =1 g ti j (cid:13)(cid:13)(cid:13) is bounded for all t . Recall that for each iteration t , (cid:13)(cid:13) g ti (cid:13)(cid:13) ≤ ... ≤ (cid:13)(cid:13)(cid:13) g ti n − f (cid:13)(cid:13)(cid:13) ≤ (cid:13)(cid:13)(cid:13) g ti n − f +1 (cid:13)(cid:13)(cid:13) ≤ ... ≤ (cid:13)(cid:13) g ti n (cid:13)(cid:13) . As there are at most f Byzantine agents, for each t there exists σ t ∈ H such that (cid:13)(cid:13)(cid:13) g ti n − f (cid:13)(cid:13)(cid:13) ≤ (cid:13)(cid:13)(cid:13) g ti σt (cid:13)(cid:13)(cid:13) . (84)As g tj = ∇ Q j ( x t ) for all j ∈ H , from (84) we obtain that (cid:13)(cid:13)(cid:13) g ti j (cid:13)(cid:13)(cid:13) ≤ (cid:13)(cid:13) ∇ Q σ t ( x t ) (cid:13)(cid:13) , ∀ j ∈ { , . . . , n − f } , t. Substituting from (83) above we obtain that for every j ∈ { , . . . , n − f } , (cid:13)(cid:13)(cid:13) g ti j (cid:13)(cid:13)(cid:13) ≤ (cid:13)(cid:13)(cid:13) g ti n − f (cid:13)(cid:13)(cid:13) ≤ nµ(cid:15) + µ (cid:13)(cid:13) x t − x H (cid:13)(cid:13) . (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n − f (cid:88) j =1 g ti j (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ n − f (cid:88) j =1 (cid:13)(cid:13)(cid:13) g ti j (cid:13)(cid:13)(cid:13) ≤ ( n − f ) (cid:0) nµ(cid:15) + µ (cid:13)(cid:13) x t − x H (cid:13)(cid:13)(cid:1) . (85)Recall from (75) that x H ∈ W . Let Γ = max x ∈W (cid:107) x − x H (cid:107) . As W is a compact set,Γ < ∞ . Recall from the update rule (22) that x t ∈ W for all t . Thus, (cid:13)(cid:13) x t − x H (cid:13)(cid:13) ≤ max x ∈W (cid:107) x − x H (cid:107) = Γ < ∞ . Substituting this in (85) implies that (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n − f (cid:88) j =1 g ti j (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ ( n − f ) (2 nµ(cid:15) + µ Γ) < ∞ . (86)Recall that in this particular case, (cid:80) n − fj =1 g ti j = GradFilter (cid:0) g t , . . . , g tn (cid:1) (see (76)).Therefore, from above we obtain that (cid:13)(cid:13) GradFilter (cid:0) g t , . . . , g tn (cid:1)(cid:13)(cid:13) < ∞ , ∀ t. (87)Next, we show that for an arbitrary δ > ξ > φ t (cid:44) (cid:42) x t − x H , n − f (cid:88) j =1 g ti j (cid:43) ≥ ξ when (cid:13)(cid:13) x t − x H (cid:13)(cid:13) ≥ D (cid:15) + δ. Consider an arbitrary iteration t . Note that, as |H| = n − f , there are atleast n − f agents that are common to both sets H and { i , ..., i n − f } . We let H t = { i , ..., i n − f }∩H . The remaining set of agents B t = { i , ..., i n − f }\H t comprisesof only faulty agents. Note that (cid:12)(cid:12) H t (cid:12)(cid:12) ≥ n − f and (cid:12)(cid:12) B t (cid:12)(cid:12) ≤ f . Therefore, φ t = (cid:42) x t − x H , (cid:88) j ∈H t g tj (cid:43) + (cid:42) x t − x H , (cid:88) k ∈B t g tk (cid:43) . (88)Consider the ﬁrst term in the right-hand side of (88). Note that (cid:42) x t − x H , (cid:88) j ∈H t g tj (cid:43) = (cid:42) x t − x H , (cid:88) j ∈H t g tj + (cid:88) j ∈H\H t g tj − (cid:88) j ∈H\H t g tj (cid:43) = (cid:42) x t − x H , (cid:88) j ∈H g tj (cid:43) − (cid:42) x t − x H , (cid:88) j ∈H\H t g tj (cid:43) . Recall that g tj = ∇ Q j ( x t ), ∀ j ∈ H . Therefore, (cid:42) x t − x H , (cid:88) j ∈H t g tj (cid:43) = (cid:42) x t − x H , (cid:88) j ∈H ∇ Q j ( x t ) (cid:43) − (cid:42) x t − x H , (cid:88) j ∈H\H t ∇ Q j ( x t ) (cid:43) . (89)34ue to the strong convexity assumption (i.e., Assumption 3), for all x, y ∈ R d , (cid:42) x − y, ∇ (cid:88) j ∈H Q j ( x ) − ∇ (cid:88) j ∈H Q j ( y ) (cid:43) ≥ |H| γ (cid:107) x − y (cid:107) . As x H is minimum point of (cid:80) j ∈H Q j ( x ), ∇ (cid:80) j ∈H Q j ( x H ) = 0. Thus, (cid:42) x t − x H , (cid:88) j ∈H ∇ Q j ( x t ) (cid:43) = (cid:42) x t − x H , ∇ (cid:88) j ∈H Q j ( x t ) − ∇ (cid:88) j ∈H Q j ( x H ) (cid:43) ≥ |H| γ (cid:13)(cid:13) x t − x H (cid:13)(cid:13) . (90)Now, due to the Cauchy-Schwartz inequality, (cid:42) x t − x H , (cid:88) j ∈H\H t ∇ Q j ( x t ) (cid:43) = (cid:88) j ∈H\H t (cid:10) x t − x H , ∇ Q j ( x t ) (cid:11) ≤ (cid:88) j ∈H\H t (cid:13)(cid:13) x t − x H (cid:13)(cid:13) (cid:13)(cid:13) ∇ Q j ( x t ) (cid:13)(cid:13) . (91)Substituting from (90) and (91) in (89) we obtain that (cid:42) x t − x H , (cid:88) j ∈H t g tj (cid:43) ≥ γ |H| (cid:13)(cid:13) x t − x H (cid:13)(cid:13) − (cid:88) j ∈H\H t (cid:13)(cid:13) x t − x H (cid:13)(cid:13) (cid:13)(cid:13) ∇ Q j ( x t ) (cid:13)(cid:13) . (92)Next, we consider the second term in the right-hand side of (88). From theCauchy-Schwartz inequality, (cid:10) x t − x H , g tk (cid:11) ≥ − (cid:13)(cid:13) x t − x H (cid:13)(cid:13) (cid:13)(cid:13) g tk (cid:13)(cid:13) . Substituting from (92) and above in (88) we obtain that φ t ≥ γ |H| (cid:13)(cid:13) x t − x H (cid:13)(cid:13) − (cid:88) j ∈H\H t (cid:13)(cid:13) x t − x H (cid:13)(cid:13) (cid:13)(cid:13) ∇ Q j ( x t ) (cid:13)(cid:13) − (cid:88) k ∈B t (cid:13)(cid:13) x t − x H (cid:13)(cid:13) (cid:13)(cid:13) g tk (cid:13)(cid:13) . (93)Recall that, due to the sorting of the gradients, for an arbitrary k ∈ B t and anarbitrary j ∈ H\H t , (cid:13)(cid:13) g tk (cid:13)(cid:13) ≤ (cid:13)(cid:13) g tj (cid:13)(cid:13) = (cid:13)(cid:13) ∇ Q j ( x t ) (cid:13)(cid:13) . (94)Recall that B t = { i , . . . , i n − f }\H t . Thus, (cid:12)(cid:12) B t (cid:12)(cid:12) = n − f − (cid:12)(cid:12) H t (cid:12)(cid:12) . Also, as |H| = n − f , (cid:12)(cid:12) H\H t (cid:12)(cid:12) = n − f − (cid:12)(cid:12) H t (cid:12)(cid:12) . That is, (cid:12)(cid:12) B t (cid:12)(cid:12) = (cid:12)(cid:12) H\H t (cid:12)(cid:12) . Therefore, (94) implies that (cid:88) k ∈B t (cid:13)(cid:13) g tk (cid:13)(cid:13) ≤ (cid:88) j ∈H\H t (cid:13)(cid:13) ∇ Q j ( x t ) (cid:13)(cid:13) . Substituting from above in (93), we obtain that φ t ≥ γ |H| (cid:13)(cid:13) x t − x H (cid:13)(cid:13) − (cid:88) j ∈H\H t (cid:13)(cid:13) x t − x H (cid:13)(cid:13) (cid:13)(cid:13) ∇ Q j ( x t ) (cid:13)(cid:13) . (cid:107)∇ Q i ( x ) (cid:107) ≤ nµ(cid:15) + µ (cid:107) x − x H (cid:107) , above we obtain that φ t ≥ γ |H| (cid:13)(cid:13) x t − x H (cid:13)(cid:13) − (cid:12)(cid:12) H\H t (cid:12)(cid:12) (cid:13)(cid:13) x t − x H (cid:13)(cid:13) (2 nµ(cid:15) + µ (cid:13)(cid:13) x t − x H (cid:13)(cid:13) ) ≥ (cid:0) γ |H| − µ (cid:12)(cid:12) H\H t (cid:12)(cid:12)(cid:1) (cid:13)(cid:13) x t − x H (cid:13)(cid:13) − nµ(cid:15) (cid:12)(cid:12) H\H t (cid:12)(cid:12) (cid:13)(cid:13) x t − x H (cid:13)(cid:13) . As |H| = n − f and (cid:12)(cid:12) H\H t (cid:12)(cid:12) ≤ f , the above implies that φ t ≥ ( γ ( n − f ) − µf ) (cid:13)(cid:13) x t − x H (cid:13)(cid:13) − nµ(cid:15)f (cid:13)(cid:13) x t − x H (cid:13)(cid:13) = ( γ ( n − f ) − µf ) (cid:13)(cid:13) x t − x H (cid:13)(cid:13) (cid:18)(cid:13)(cid:13) x t − x H (cid:13)(cid:13) − nµ(cid:15)fγ ( n − f ) − µf (cid:19) = nγ (cid:18) − fn (cid:18) µγ (cid:19)(cid:19) (cid:13)(cid:13) x t − x H (cid:13)(cid:13) (cid:13)(cid:13) x t − x H (cid:13)(cid:13) − µf (cid:15)γ (cid:16) − fn (cid:16) µγ (cid:17)(cid:17)  . (95)Recall from (25) and (26), respectively, that α = 1 − fn (cid:18) µγ (cid:19) and D = 4 µfαγ . Substituting from above in (95) we obtain that φ t ≥ α nγ (cid:13)(cid:13) x t − x H (cid:13)(cid:13) (cid:0)(cid:13)(cid:13) x t − x H (cid:13)(cid:13) − D (cid:15) (cid:1) . (96)As it is assumed that α >

0, (96) implies that for an arbitrary δ > φ t ≥ α nγ δ ( D (cid:15) + δ ) when (cid:13)(cid:13) x t − x H (cid:13)(cid:13) ≥ D (cid:15) + δ. The above satisﬁes condition (23) in Theorem 3 with D ∗ = D (cid:15) + δ and ξ = α nγ δ ( D (cid:15) + δ ). Recall from (87) that (cid:13)(cid:13) GradFilter (cid:0) g t , . . . , g tn (cid:1)(cid:13)(cid:13) < ∞ , ∀ t . Therefore, upon usingTheorem 3 we obtain that lim t →∞ (cid:13)(cid:13) x t − x H (cid:13)(cid:13) ≤ D (cid:15) + δ. Note that the above inequality holds true for an arbitrary δ >

0. Therefore,lim t →∞ (cid:13)(cid:13) x t − x H (cid:13)(cid:13) ≤ D (cid:15) + δ, ∀ δ > . (97)Reasoning by contraction, (97) implies that lim t →∞ (cid:13)(cid:13) x t − x H (cid:13)(cid:13) ≤ D (cid:15) . E Appendix: Proof of Theorem 5

In this section we present the proof of Theorem 5. Throughout this section we as-sume f > f = 0. The proof closely follows that ofTheorem 4, and we may borrow some notation and results directly from Appendix D.Consider an arbitrary set H of non-faulty agents with |H| = n − f . Recall thatunder Assumption 3 the minimum point of the aggregate cost function (cid:80) i ∈H Q i ( x ),denoted by x H , is unique. Speciﬁcally, x H = arg min x ∈ R d (cid:88) i ∈H Q i ( x ) .

36o prove the theorem we make use of the result stated in Theorem 3. Speciﬁcally,we show that the CWTM gradient-ﬁlter satisﬁes the conditions of Theorem 3 for D ∗ = 4 λµf (cid:15) , and x ∗ = x H . The rest follows easily from the convergence result statedin Theorem 3. Recall from (28) that for CWTM gradient-ﬁlter, for all l ∈ { , . . . , d } , GradFilter (cid:0) g t , . . . , g tn (cid:1) [ l ] = 1 n − f n − f (cid:88) j = f +1 g ti j [ l ] [ l ] , ∀ t. (98)First, we show that (cid:80) n − fj = f +1 g ti j [ l ] [ l ] is ﬁnite for all l and t .From (83) in Appendix D, we know that under (2 f, (cid:15) )-redundancy property andAssumption 2, for each non-faulty agent i ∈ H , (cid:107)∇ Q i ( x ) (cid:107) ≤ nµ(cid:15) + µ (cid:107) x − x H (cid:107) . (99)The above implies that for all i ∈ H , l ∈ { , . . . , d } and x , |∇ Q i ( x )[ l ] | ≤ nµ(cid:15) + µ (cid:107) x − x H (cid:107) . (100)Recall that for all l and t , g ti [ l ] [ l ] ≤ . . . ≤ g ti f +1 [ l ] [ l ] ≤ . . . ≤ g ti n − f [ l ] [ l ] ≤ . . . ≤ g ti n [ l ] [ l ] . As there are at most f Byzantine agents and |H| = n − f , for all l and t there existsa pair of non-faulty agents σ t [ l ] , σ t [ l ] ∈ H such that g ti n − f [ l ] [ l ] ≤ g ti σ t [ l ] [ l ] , and g ti f +1 [ l ] [ l ] ≥ g ti σ t [ l ] [ l ] . (101)As g tj = ∇ Q j ( x t ) for all j ∈ H , from (101) we obtain that for all j ∈ { f +1 , . . . , n − f } , l and t , (cid:12)(cid:12)(cid:12) g ti j [ l ] [ l ] (cid:12)(cid:12)(cid:12) ≤ max (cid:110)(cid:12)(cid:12)(cid:12) ∇ Q σ t [ l ] ( x t )[ l ] (cid:12)(cid:12)(cid:12) , (cid:12)(cid:12)(cid:12) ∇ Q σ t [ l ] ( x t )[ l ] (cid:12)(cid:12)(cid:12)(cid:111) . Substituting from (100) above we obtain that for all j ∈ { f + 1 , . . . , n − f } , l and t , (cid:12)(cid:12)(cid:12) g ti j [ l ] [ l ] (cid:12)(cid:12)(cid:12) ≤ nµ(cid:15) + µ (cid:13)(cid:13) x t − x H (cid:13)(cid:13) . Therefore, owing to the triangle inequality, (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n − f (cid:88) j = f +1 g ti j [ l ] [ l ] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ n − f (cid:88) j = f +1 (cid:12)(cid:12)(cid:12) g ti j [ l ] [ l ] (cid:12)(cid:12)(cid:12) ≤ ( n − f ) (cid:0) nµ(cid:15) + µ (cid:13)(cid:13) x t − x H (cid:13)(cid:13)(cid:1) . (102)Let Γ = max x ∈W (cid:107) x − x H (cid:107) . As W is a compact set, Γ < ∞ . Recall from the updaterule (22) that x t ∈ W for all t . Thus, (cid:13)(cid:13) x t − x H (cid:13)(cid:13) ≤ max x ∈W (cid:107) x − x H (cid:107) = Γ < ∞ .Substituting this in (102) implies that for all l ∈ { , . . . , d } , (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n − f (cid:88) j = f +1 g ti j [ l ] [ l ] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ ( n − f ) (2 nµ(cid:15) + µ Γ) < ∞ . (cid:13)(cid:13) GradFilter (cid:0) g t , . . . , g tn (cid:1)(cid:13)(cid:13) < ∞ , ∀ t. (103)Now, consider an arbitrary iteration t and l ∈ { , . . . , d } . From prior workson CWTM gradient-ﬁlter for the scalar case [32], i.e., when d = 1, we know thattrimmed mean of the l -th elements of the gradients lies in the convex hull of l -thelements of the non-faulty agents’ gradients in set H . Speciﬁcally,min i ∈H g ti [ l ] ≤ GradFilter (cid:0) g t , . . . , g tn (cid:1) [ l ] ≤ max i ∈H g ti [ l ] . (104)Obviously, min i ∈H g ti [ l ] ≤ |H| (cid:88) i ∈H g ti [ l ] ≤ max i ∈H g ti [ l ] . (105)Therefore, from (104) and (105) we obtain that (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) GradFilter (cid:0) g t , . . . , g tn (cid:1) [ l ] − |H| (cid:88) i ∈H g ti [ l ] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ max i ∈H g ti [ l ] − min i ∈H g ti [ l ] . As max i ∈H g ti [ l ] − min i ∈H g ti [ l ] = max i, j ∈H (cid:12)(cid:12)(cid:12) g ti [ l ] − g tj [ l ] (cid:12)(cid:12)(cid:12) , and g ti = ∇ Q i ( x t ) for all i ∈ H , the above can be re-written as follows. (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) GradFilter (cid:0) g t , . . . , g tn (cid:1) [ l ] − |H| (cid:88) i ∈H ∇ Q i ( x t )[ l ] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ max i, j ∈H (cid:12)(cid:12) ∇ Q i ( x t )[ l ] − ∇ Q j ( x t )[ l ] (cid:12)(cid:12) . (106)Note that for any two i, j ∈ H , (cid:12)(cid:12) ∇ Q i ( x t )[ l ] − ∇ Q j ( x t )[ l ] (cid:12)(cid:12) ≤ (cid:13)(cid:13) ∇ Q i ( x t ) − ∇ Q j ( x t ) (cid:13)(cid:13) . (107)Substituting from Assumption 5, (cid:107)∇ Q i ( x ) − ∇ Q j ( x ) (cid:107) ≤ λ max {(cid:107)∇ Q i ( x ) (cid:107) , (cid:107)∇ Q j ( x ) (cid:107)} ,in (107) we obtain that (cid:12)(cid:12) ∇ Q i ( x t )[ l ] − ∇ Q j ( x t )[ l ] (cid:12)(cid:12) ≤ λ max (cid:8)(cid:13)(cid:13) ∇ Q i ( x t ) (cid:13)(cid:13) , (cid:13)(cid:13) ∇ Q j ( x t ) (cid:13)(cid:13)(cid:9) . (108)Substituting from (99) above we obtain that (cid:12)(cid:12) ∇ Q i ( x t )[ l ] − ∇ Q j ( x t )[ l ] (cid:12)(cid:12) ≤ λ (cid:0) nµ (cid:15) + µ (cid:13)(cid:13) x t − x H (cid:13)(cid:13)(cid:1) . (109)Finally, substituting from (109) in (106) we obtain that, for all l , (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) GradFilter (cid:0) g t , . . . , g tn (cid:1) [ l ] − |H| (cid:88) i ∈H ∇ Q i ( x t )[ l ] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ λ (cid:0) nµ (cid:15) + µ (cid:13)(cid:13) x t − x H (cid:13)(cid:13)(cid:1) . As (cid:107) x (cid:107) = (cid:113)(cid:80) dl =1 | x [ l ] | for x ∈ R d , the above implies that (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) GradFilter (cid:0) g t , . . . , g tn (cid:1) − |H| (cid:88) i ∈H ∇ Q i ( x t ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ √ dλ (cid:0) nµ (cid:15) + µ (cid:13)(cid:13) x t − x H (cid:13)(cid:13)(cid:1) . (110)38ow, note that GradFilter (cid:0) g t , . . . , g tn (cid:1) = 1 |H| (cid:88) i ∈H ∇ Q i ( x t ) (111)+ (cid:32) GradFilter (cid:0) g t , . . . , g tn (cid:1) − |H| (cid:88) i ∈H ∇ Q i ( x t ) (cid:33) . Recall from Theorem 3 that φ t , for each t , is deﬁned to be φ t = (cid:10) x t − x H , GradFilter (cid:0) g t , . . . , g tn (cid:1)(cid:11) . Substituting from (111) above we obtain that φ t = (cid:42) x t − x H , |H| (cid:88) i ∈H ∇ Q i ( x t ) (cid:43) (112)+ (cid:42) x t − x H , GradFilter (cid:0) g t , . . . , g tn (cid:1) − |H| (cid:88) i ∈H ∇ Q i ( x t ) (cid:43) . Recall from Assumption 3 that Q H ( x ) = (1 / |H| ) (cid:80) i ∈H Q i ( x ). Thus, the ﬁrstterm on the right-hand side of (112), (cid:42) x t − x H , |H| (cid:88) i ∈H ∇ Q i ( x t ) (cid:43) = (cid:10) x t − x H , ∇ Q H ( x t ) (cid:11) . Substituting from the Assumption 3 above, and recalling that ∇ Q H ( x H ) = 0, weobtain that (cid:42) x t − x H , |H| (cid:88) i ∈H ∇ Q i ( x t ) (cid:43) ≥ γ (cid:13)(cid:13) x t − x H (cid:13)(cid:13) . (113)Next, we consider the second term on the right-hand side of (112). From Cauchy-Schwartz inequality, (cid:42) x t − x H , GradFilter (cid:0) g t , . . . , g tn (cid:1) − |H| (cid:88) i ∈H ∇ Q i ( x t ) (cid:43) ≥ − (cid:13)(cid:13) x t − x H (cid:13)(cid:13) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) GradFilter (cid:0) g t , . . . , g tn (cid:1) − |H| (cid:88) i ∈H ∇ Q i ( x t ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) . Substituting from (110) above we obtain that (cid:42) x t − x H , GradFilter (cid:0) g t , . . . , g tn (cid:1) − |H| (cid:88) i ∈H ∇ Q i ( x t ) (cid:43) ≥ −√ dλ (cid:13)(cid:13) x t − x H (cid:13)(cid:13) (cid:0) nµ (cid:15) + µ (cid:13)(cid:13) x t − x H (cid:13)(cid:13)(cid:1) . (114)39ubstituting from (113) and (114) in (112) we obtain that φ t ≥ γ (cid:13)(cid:13) x t − x H (cid:13)(cid:13) − √ dλ (cid:13)(cid:13) x t − x H (cid:13)(cid:13) (cid:0) nµ (cid:15) + µ (cid:13)(cid:13) x t − x H (cid:13)(cid:13)(cid:1) = (cid:16) γ − √ dλµ (cid:17) (cid:13)(cid:13) x t − x H (cid:13)(cid:13) (cid:32)(cid:13)(cid:13) x t − x H (cid:13)(cid:13) − √ dnµλ ( γ − √ dµλ ) (cid:15) (cid:33) . (115)The above inequality is similar to (96) in the proof of Theorem 4 in Appendix Dwhere by assumption γ − √ dλµ >

0. Also, recall from (103) that in this particularcase of CWTM gradient-ﬁlter (cid:13)(cid:13)

GradFilter (cid:0) g t , . . . , g tn (cid:1)(cid:13)(cid:13) < ∞ , ∀ t . Therefore, usingsimilar arguments as in Appendix D, we obtain thatlim t →∞ (cid:13)(cid:13) x t − x H (cid:13)(cid:13) ≤ √ dnµλ ( γ − √ dµλ ) (cid:15) = (cid:18) nλ ( γ/µ √ d ) − λ (cid:19) (cid:15).(cid:15).