Approximate Byzantine Fault-Tolerance in Distributed Optimization
AApproximate Byzantine Fault-Tolerance inDistributed Optimization
Shuo Liu (cid:63)
Nirupam Gupta † Nitin H. Vaidya † { (cid:63) sl1539, † firstname . lastname } @georgetown.eduDepartment of Computer ScienceGeorgetown UniversityWashington DC, USA Abstract
This paper studies the problem of Byzantine fault-tolerance in distributedmulti-agent optimization. In this problem, each agent has a local cost function,and in the fault-free case, the objective is to design a distributed algorithmthat allows all the agents to find a minimum point of all the agents’ aggregatecost function. We consider a scenario where certain number of agents mightbe Byzantine faulty, i.e., these agents may not follow a prescribed algorithmand may share arbitrary information regarding their local cost functions. Inthe presence of such faulty agents, it is generally impossible to find a mini-mum point of all the agents’ aggregate cost function. A more reasonable goal,however, is to design an algorithm that allows all the non-faulty agents to com-pute, either exactly or approximately , the minimum point of only the non-faultyagents’ aggregate cost function.From prior work we know that in the presence of up to f (out of n ) Byzantinefaulty agents, a deterministic algorithm can compute a minimum point of thenon-faulty agents’ aggregate cost exactly if and only if the non-faulty agents’cost functions satisfy a certain redundancy property named 2 f -redundancy [20].However, the 2 f -redundancy property can only be guaranteed in ideal systemsfree from noises (or uncertainties), and therefore, the objective of exact fault-tolerance is unsuitable for many practical settings that inevitably suffer fromnoises. In this paper, we consider the problem of approximate fault-tolerance- a generalization of exact fault-tolerance where the goal is to only computean approximation of a minimum point of the non-faulty agents’ aggregate costfunction. Upon defining approximate fault-tolerance later as ( f, (cid:15) )-resiliencewhere (cid:15) is the approximation error, we show that it can be achieved under aweaker redundancy condition than 2 f -redundancy. We present necessary andsufficient conditions for achieving ( f, (cid:15) )-resilience in a synchronous distributedsystem with server-based architecture. Then, we consider a special case whenthe agents’ cost functions are differentiable. Here, we analyse the approximatefault-tolerance of the distributed gradient-descent method, which is a prominentdistributed optimization algorithm in this particular case, when equipped witha gradient-filter or robust gradient aggregation ; such as comparative gradientelimination (CGE) or coordinate-wise trimmed mean (CWTM). a r X i v : . [ c s . D C ] J a n Introduction
The problem of distributed optimization in multi-agent systems has gained signifi-cant attention in recent years [7, 25, 16]. In this problem, each agent has a local costfunction and, when the agents are fault-free, the goal is to design algorithms thatallow the agents to collectively minimize the aggregate of their cost functions. Tobe precise, suppose that there are n agents in the system and let Q i ( x ) denote thelocal cost function of agent i , where x is a d -dimensional vector of real values, i.e., x ∈ R d . A traditional distributed optimization algorithm outputs a global minimum x ∗ such that x ∗ ∈ arg min x ∈ R d n (cid:88) i =1 Q i ( x ) . (1)As a simple example, Q i ( x ) may denote the cost for an agent i (which may be arobot or a person) to travel to location x from their current location, and x ∗ is alocation that minimizes the total cost of meeting for all the agents. Such multi-agentoptimization is of interest in many practical applications, including distributed ma-chine learning [7], swarm robotics [29], and distributed sensing [28].We consider the distributed optimization problem in the presence of up to f Byzantine faulty agents, originally introduced by Su and Vaidya [32]. The Byzan-tine faulty agents may behave arbitrarily [22]. In particular, the non-faulty agentsmay share arbitrary incorrect and inconsistent information in order to bias the out-put of a distributed optimization algorithm. For example, consider an applicationof multi-agent optimization in the case of distributed sensing where the agents (or sensors ) observe a common object in order to collectively identify the object. How-ever, the faulty agents may send arbitrary observations concocted to prevent thenon-faulty agents from making the correct identification [11, 13, 26, 33]. Similarly,in the case of distributed learning, which is another application of distributed op-timization, the faulty agents may send incorrect information based on mislabelled or arbitrary concocted data points to prevent the non-faulty agents from learning a good classifier [1, 2, 4, 9, 10, 12, 19, 35].
In the exact fault-tolerance problem, the goal is to design a distributed algorithm thatallows all the non-faulty agents to compute a minimum point of the aggregate costof only the non-faulty agents [20]. Specifically, suppose that in a given execution,set B with |B| ≤ f is the set of Byzantine agents, where notation |·| denotes theset cardinality, and H = { , . . . , n } \ B denotes the set of non-faulty (i.e., honest)agents. Then, a distributed optimization algorithm has exact fault-tolerance if itoutputs a point x ∗H such that x ∗H ∈ arg min x ∈ R d (cid:88) i ∈H Q i ( x ) . (2)However, since the identity of the Byzantine agents is a priori unknown, in general,exact fault-tolerance is unachievable [32]. Specifically, it is shown in [20, 21] that2xact fault-tolerance can be achieved if and only if the agents’ cost functions satisfythe 2 f -redundancy property defined below. Let |·| denote set cardinality. Definition 1 (2 f -redundancy). The agents’ cost functions are said to have f -redundancy property if and only if for for every pair of subsets S, (cid:98) S ⊆ { , . . . , n } with | S | = n − f , (cid:12)(cid:12)(cid:12) (cid:98) S (cid:12)(cid:12)(cid:12) ≥ n − f and (cid:98) S ⊆ S , arg min x ∈ R d (cid:88) i ∈ (cid:98) S Q i ( x ) = arg min x ∈ R d (cid:88) i ∈ S Q i ( x ) . In principle, the 2 f -redundancy property can be realized by design for many ap-plications of multi-agent distributed optimization including distributed sensing anddistributed learning (see [18, 20]). However, practical realization of 2 f -redundancycan be difficult due to the presence of noise in the real-world systems. This moti-vates us to consider a generalization of exact fault-tolerance, namely the problem of approximate fault-tolerance , which is described and defined as follows. Unlike exact fault-tolerance, in approximate fault-tolerance it is acceptable for analgorithm to output an approximate minimum point of the non-faulty aggregatecost function. As stated below, we formally define approximate fault-tolerance by( f, (cid:15) ) -resilience where (cid:15) ∈ R ≥ is the measure of approximation. Recall that theEuclidean distance between a point x and a non-empty set X in the space R d ,denoted by dist ( x, X ), is defined to bedist ( x, X ) = inf y ∈ X (cid:107) x − y (cid:107) (3)where (cid:107)·(cid:107) denotes the Euclidean norm. Definition 2 (( f, (cid:15) ) -resilience ) . A distributed optimization algorithm is said to be ( f, (cid:15) ) -resilient if it outputs a point (cid:98) x ∈ R d such that for every subset S of non-faultyagents with | S | = n − f , dist (cid:32)(cid:98) x, arg min x ∈ R d (cid:88) i ∈ S Q i ( x ) (cid:33) ≤ (cid:15), despite the presence of up to f Byzantine agents.
Alternately, a distributed optimization algorithm is ( f, (cid:15) ) -resilient if it outputsa point within (cid:15) distance from a minimum point of the aggregate cost function ofat least n − f non-faulty agents. As there can be at most f Byzantine faulty agentswhose identity remains unknown the two scenarios where; (1) there are exactly f Byzantine agents, and (2) there are less than f Byzantine agents are indistinguish-able. Thus, estimation for the minimum point of the aggregate cost functions of n − f non-faulty agents is indeed a reasonable goal [32].In this paper, we consider deterministic algorithms which, given a fixed set ofinputs from the server and the agents, always output the same point in R d . Thus, a3eterministic ( f, (cid:15) )-resilient algorithm produces a unique output point in all of itsexecutions with identical inputs from the server and all the agents (including thefaulty agents). Note that in the deterministic framework, exact fault-tolerance isequivalent to ( f, f, f, (cid:15) ) -resilience requires a weaker redundancy con-dition, in comparison to 2 f -redundancy, named (2 f, (cid:15) ) -redundancy which is definedas follows. Recall that the Euclidean Hausdorff distance between two sets X and Y in R d , which we denote by dist ( X, Y ), is defined to be [24]dist (
X, Y ) (cid:44) max (cid:40) sup x ∈ X dist ( x, Y ) , sup y ∈ Y dist ( y, X ) (cid:41) . (4) Definition 3 ((2 f, (cid:15) ) -redundancy ) . The agents’ cost functions are said to have (2 f, (cid:15) )-redundancy property if and only if for every pair of subsets S, (cid:98) S ⊆ { , . . . , n } with | S | = n − f , (cid:12)(cid:12)(cid:12) (cid:98) S (cid:12)(cid:12)(cid:12) ≥ n − f and (cid:98) S ⊆ S , dist arg min x ∈ R d (cid:88) i ∈ S Q i ( x ) , arg min x ∈ R d (cid:88) i ∈ (cid:98) S Q i ( x ) ≤ (cid:15). (5)Upon comparison between the Definitions 1 and 3, it is easy to see that 2 f -redundancy is equivalent to (2 f, f -redundancyimplies (2 f, (cid:15) )-redundancy for all (cid:15) ≥
0. However, the converse need not be true.Thus, the (2 f, (cid:15) )-redundancy property with (cid:15) > f -redundancy. In the first part of the paper, i.e., Section 3, we show two key results: • ( f, (cid:15) )-resilience is feasible only if (2 f, (cid:15) )-redundancy property holds true. • If (2 f, (cid:15) )-redundancy property holds true then ( f, (cid:15) )-resilience is achievable.As ( f, f, f -redundancy, the above results generalize theknown result that exact fault-tolerance is feasible if and only if the 2 f -redundancycondition holds true (see [20, 21]).In the second part , i.e., Sections 4 and 5, we consider the case when agents’cost functions are assumed differentiable, such as in machine learning [6] and stateestimation or regression [18, 31, 33]. Here, we study the problem of approximatefault-tolerance in the distributed gradient-descent (DGD) method - an iterativedistributed optimization algorithm commonly used in this particular case. • In Section 4, we present a generic convergence result of the DGD methodwhen equipped with a gradient-filter (or robust gradient aggregation ), which isa common fault-tolerance mechanism [4, 12, 20].4
Then, upon assuming the (2 f, (cid:15) )-redundancy property, we present approxi-mate fault-tolerance guarantees of two specific gradient-filters - – the comparative gradient elimination (CGE) gradient-filter [17], and – the coordinate-wise trimmed mean (CWTM) gradient-filter [36].The above two filters are computationally cheapest of all the existing gradient-filters, and have wide applicability, for eg., see [17, 18, 31, 36]. • In Section 5, we present empirical comparisons between the approximate fault-tolerance achieved by the two aforementioned gradient-filters.
System architecture:
The results in this paper apply to two different sys-tem architectures illustrated in Figure 1. The system is assumed synchronous . Inthe server-based architecture, the server is assumed to be trustworthy, but up to f agents may be Byzantine faulty. The trusted server helps solve the distributedoptimization problem in coordination with the agents. In the peer-to-peer architec-ture, the agents are connected to each other by a complete network, and up to f of these agents may be Byzantine faulty. Provided that f < n , an algorithm forthe server-based architecture can be simulated in the peer-to-peer system using thewell-known Byzantine broadcast primitive [23]. For simplicity of presentation, therest of this paper considers the server-based architecture.Figure 1: System architecture.Before we present our results in detail, we review below other related work onapproximate fault-tolerance and gradient-filters.
In the past, different definitions of approximate fault-tolerance, besides ( f, (cid:15) )-resiliencethat we use, have been formulated to analyze Byzantine fault-tolerance of differentdistribute optimization algorithms [15, 32]. As we discuss below in Section 2.1, thedifference between these other definitions and our definition of ( f, (cid:15) )-resilience arisesmainly due to the applicability of the distributed optimization problem. Later,in Section 2.2, we discuss prior work on gradient-filters for conferring Byzantinefault-tolerance to the distributed gradient-descent method - an iterative distributedoptimization algorithm used when the agents’ cost functions are differentiable [6].5 .1 Alternate Definitions of Approximate Fault-Tolerance
As proposed by Su and Vaidya, 2016 [32], an alternate way of measuring approxi-mation (or accuracy) in fault-tolerance is by use of scalar coefficients. Specifically,instead of a minimum point of the uniformly weighted aggregate of non-faulty agents’cost functions, a distributed optimization algorithm may output a minimum point ofa non-uniformly weighted aggregate of non-faulty costs, i.e., (cid:80) i ∈H α i Q i ( x ), where H denotes the set of at least n − f non-faulty agents, and α i ≥ i ∈ H .As is suggested in [32], upon re-scaling the coefficients such that (cid:80) i ∈H α i = 1, wecan measure approximation in fault-tolerance using two metrics; 1) the number ofcoefficients in { α i , i ∈ H} that are positive, and 2) the minimum positive valueamongst the coefficients: min { α i ; α i > , i ∈ H} . Results on the achievability ofsuch approximation in the case of scalar optimization problems, i.e., when d = 1,can be found in [32, 34]. However, we are not aware of any such results for the caseof higher-dimensional optimization problem, i.e., when d > x Π ∈ R d , such that each element of the ag-gregate non-faulty gradient (cid:80) i ∈H ∇ Q i ( x Π ) has an absolute value upper boundedby (cid:15) . As yet another alternative, a resilient algorithm Π may aim to output apoint x Π such that the non-faulty aggregate cost (cid:80) i ∈H Q i ( x Π ) is within (cid:15) of thetrue minimum cost min x (cid:80) i ∈H Q i ( x ). However, these definitions of approximate re-silience are sensitive to scaling of the cost functions. In particular, if the elements of (cid:80) i ∈H ∇ Q i ( x Π ) are bounded by (cid:15) then the elements of (cid:80) i ∈H α ∇ Q i ( x Π ) are boundedby α(cid:15) , where α is a positive scalar value. On the other hand, both (cid:80) i ∈H Q i ( x ) and (cid:80) i ∈H αQ i ( x Π ) have identical minimum point regardless of the value of α . Therefore,when the objective is to approximate a minimum point of the non-faulty aggregatecost arg min x (cid:80) i ∈H Q i ( x ), which is the indeed the case in this paper, use of function(or gradient) values to measure approximation is not a suitable choice. In the past, several gradient-filters have been proposed to robustify the distributedgradient-descent (DGD) method against Byzantine faulty agents, for example, see [1,4, 14, 15, 17, 27, 32, 36] and references therein. The DGD method is an iterativedistributed optimization algorithm wherein the server maintain an estimate of thesolution (a valid minimum point), and updates it iteratively using gradients com-puted by the agents of their respective cost functions. In the traditional DGDmethod, the server updates its estimates using the average (or sum) of agents’ gra-dients. However, gradient averaging is rendered ineffective when Byzantine faultyagents send arbitrary incorrect gradients [32].A gradient-filter refers to the technique of robust aggregation of agents’ gradi-ents , which mitigates the detrimental impact of incorrect gradients. To name a fewgradient-filters, that are provably effective against Byzantine faulty agents, we havecomparative gradient elimination (CGE) [18, 17], coordinate-wise trimmed mean6CWTM) [32, 36], geometric median-of-means (GMoM) [12], KRUM [4], and thespectral gradient-filters [15]. However, different gradient-filters guarantee Byzantinefault-tolerance under different assumptions on non-faulty agents’ cost functions.In this paper, we first propose a generic result (in Theorem 3) to establish con-vergence of the DGD method when coupled with a gradient-filter. The result holdstrue regardless of the gradient-filter used. Then, we use this convergence result toderive the approximate fault-tolerance guarantees (as per Definition 2) of two spe-cific gradient-filters; CGE and CWTM, under the (2 f, (cid:15) )-redundancy property. Thereason for choosing these two filters is that they are the computationally cheapestof all, with O ( n (log n + d )) per iteration time complexity [17, 31].It is worth noting that both CGE and CWTM gradient-filters can guaranteeexact fault-tolerance under certain assumptions, besides 2 f -redundancy, as show [18,17, 31]. However, unlike CGE, the fault-tolerance property of CWTM is only knownfor the special distributed optimization problems of distributed linear regression [31],and distributed machine learning [36]. In Section 4.2, we present approximate fault-tolerance properties of CGE and CWTM gradient-filters that are applicable to themore general distributed optimization problem, and also encapsulates the specialcase of exact fault-tolerance. ( f, (cid:15) ) -Resilience In this section, we present formal details on necessary and sufficient conditions for( f, (cid:15) )-resilience. Throughout this paper we assume, as stated below, that the non-faulty agents’ cost functions and their aggregates have well-defined minimum points.Otherwise, the problem of optimization is rendered vacuous.
Assumption 1.
We assume that for any non-empty set S of non-faulty agents, theset arg min x ∈ R d (cid:80) i ∈ S Q i ( x ) is closed and non-empty. Moreover, we also assume that f < n/
2. Lemma 1, stated below, shows that( f, (cid:15) )-resilience is impossible in general when f ≥ n/ Lemma 1. If f ≥ n/ then there cannot exist a deterministic ( f, (cid:15) ) -resilient algo-rithm for any (cid:15) ≥ .Proof. We prove the lemma by contradiction. We consider a case when n = 2, d = 1,i.e., Q i : R → R for all i , and all the cost functions have unique minimum points.Suppose that f = 1, and that there exists a deterministic ( f, (cid:15) )-resilient algorithmΠ for some (cid:15) ≥
0. Without loss of generality, we suppose that agent 2 is Byzantinefaulty. We denote x = arg min x ∈ R Q ( x ).The Byzantine agent 2 can choose to behave as a non-faulty agent with costfunction (cid:101) Q ( x ) = Q ( x − x − (cid:15) − δ ) where δ is some positive real value. Now, notethat the minimum point of (cid:101) Q ( x ), which we denoted by x , is unique and equalto x + 2 (cid:15) + δ . Therefore, | x − x | = 2 (cid:15) + δ > (cid:15) . As the identity of Byzantineagent is a priori unknown, the server cannot distinguish between scenarios; (i) agent1 is non-faulty, and (ii) agent 2 is non-faulty. Now, being deterministic algorithm,7 should produce the same output in both the scenarios. In scenario (i), as Π isassumed ( f, (cid:15) )-resilient, its output must lie in the interval [ x − (cid:15), x + (cid:15) ]. Similarly,in scenario (ii), the output of Π must lie in the interval [ x − (cid:15), x + (cid:15) ]. However, as | x − x | > (cid:15) , the two intervals [ x − (cid:15), x + (cid:15) ] and [ x − (cid:15), x + (cid:15) ] do not overlap.Therefore, Π cannot be ( f, (cid:15) )-resilient in both the scenarios simultaneously, whichis a contradiction to the assumption that Π is ( f, (cid:15) )-resilient.The above argument extends easily for the case when n >
2, and f > n/ necessity of (2 f, (cid:15) ) -redundancy for ( f, (cid:15) ) -resilience . Theorem 1 (Necessity) . Suppose that Assumption 1 holds true. There exists adeterministic ( f, (cid:15) ) -resilient distributed optimization algorithm where (cid:15) ≥ the agents’ cost functions satisfy the (2 f, (cid:15) ) -redundancy property.Proof. To prove the theorem we present a scenario when the agents’ cost functions(if non-faulty) are scalar functions, i.e., d = 1 and for all i , Q i : R → R , and theminimum point of an aggregate of one or more agents’ cost functions is uniquelydefined. Obviously, if a condition is necessary in the aforementioned case then it isso in the more general case of vector functions with non-unique minimum points.To prove the necessity condition, we also assume that the server has full knowl-edge of all the agents’ cost functions. This may not hold true in practice, whereinstead the server may only have partial information about the agents’ cost func-tions. Indeed, this forces the Byzantine faulty agents to a priori fix their costfunctions. However, in reality the Byzantine agents may send arbitrary informationover time to the server that need not be consistent with a fixed cost function. Thus,necessity of (2 f, (cid:15) )-redundancy under this assumption implies its necessity in general.The proof is by contradiction. Specifically, we show the following: If the non-faulty cost functions do not satisfy the (2 f, (cid:15) ) -redundancyproperty then there cannot exist a deterministic ( f, (cid:15) ) -resilient distributedoptimization algorithm. Recall that we have assumed that for a non-empty set of agents T the aggregatecost function (cid:80) i ∈ T Q i ( x ) has a unique minimum point. To be precise, for eachnon-empty subset of agents T , we define x T = arg min x (cid:88) i ∈ T Q i ( x ) . Suppose that the agents’ cost functions do not satisfy the (2 f, (cid:15) )-redundancyproperty stated in Definition 3. Then, there exists a real number δ > S, (cid:98) S with (cid:98) S ⊂ S , | S | = n − f , and n − f ≤ (cid:12)(cid:12)(cid:12) (cid:98) S (cid:12)(cid:12)(cid:12) < n − f such that (cid:13)(cid:13) x (cid:98) S − x S (cid:13)(cid:13) ≥ (cid:15) + δ. (6)Now, suppose that n − f − (cid:12)(cid:12)(cid:12) (cid:98) S (cid:12)(cid:12)(cid:12) agents in the remainder set { , . . . , n }\ S are Byzantinefaulty. Let us denote the set of faulty agents by B . Note that B is non-empty with8 B| = n − f − (cid:12)(cid:12)(cid:12) (cid:98) S (cid:12)(cid:12)(cid:12) ≤ f . Similar to the non-faulty agents, the faulty agents send tothe server cost functions that are scalar, and the aggregate of one of more agents’cost functions in the set S ∪ B is unique. However,the aggregate cost function of the agents in the set B ∪ (cid:98) S minimizes ata unique point x B∪ (cid:98) S which is (cid:13)(cid:13) x (cid:98) S − x S (cid:13)(cid:13) distance away from x (cid:98) S , similarto x S , but lies on the other side of x (cid:98) S as shown in the figure below. Notethat it is always possible to pick such functions for the faulty agents. x S x (cid:98) S x B∪ (cid:98) S ][ ← (cid:15) + δ → ← (cid:15) + δ → (cid:15)(cid:15) (cid:98) x Note that the distance between the two points x S and x B∪ (cid:98) S is 2 (cid:15) + 2 δ , i.e., (cid:13)(cid:13) x S − x B∪ (cid:98) S (cid:13)(cid:13) = 2 (cid:15) + 2 δ. (7)We now show below, by contradiction, that there cannot exist a deterministic ( f, (cid:15) )-resilient distributed optimization algorithm.Suppose, toward a contradiction, that there exists an ( f, (cid:15) )-resilient determinis-tic optimization algorithm named Π. As the identity of Byzantine faulty agents isa priori unknown to the server, and the cost functions sent by the Byzantine faultyagents have similar properties as the non-faulty agents, the server cannot distinguishbetween the following two possible scenarios; i) S is the set of non-faulty agents,and ii) B ∪ (cid:98) S is the set of non-faulty agents. Note that both the sets S and B ∪ (cid:98) S contain n − f agents.As the cost functions received by the server are identical in both of the abovescenarios, being a deterministic algorithm, Π should have identical output in boththe cases. We let (cid:98) x denote the output of Π. In scenario (i) when the set of honestagents is given by S with | S | = n − f , as Π is assumed ( f, (cid:15) )-resilient, by Definition 2the output (cid:98) x ∈ [ x S − (cid:15), x S + (cid:15) ] (8)as shown in the figure above. By the same logic, in scenario (ii) when the set ofhonest agents is B ∪ (cid:98) S with (cid:12)(cid:12)(cid:12) B ∪ (cid:98) S (cid:12)(cid:12)(cid:12) = n − f the output (cid:98) x ∈ [ x B∪ (cid:98) S − (cid:15), x B∪ (cid:98) S + (cid:15) ] . (9)However, (7) implies that (cid:98) x cannot satisfy both (8) and (9) simultaneously. There-fore, if algorithm Π is ( f, (cid:15) )-resilient in scenario (i) then it cannot be so in scenario(ii), and vice-versa. This is a contradiction to the assumption that Π is ( f, (cid:15) )-resilient, and therefore proves the impossibility of ( f, (cid:15) )-resilience when the (2 f, (cid:15) )-redundancy property is violated.Next, we show that (2 f, (cid:15) ) -redundancy suffices for ( f, (cid:15) ) -resilience .9 heorem 2 (Sufficiency) . Suppose that Assumption 1 holds true. For a real value (cid:15) ≥ , if the agents’ cost functions satisfy the (2 f, (cid:15) ) -redundancy property then ( f, (cid:15) ) -resilience is achievable.Proof. The proof is constructive where we assume that all the agents send theirindividual cost functions to the server. We assume that f > f = 0. Throughout the proof we write the notation arg min x ∈ R d simply asarg min, unless otherwise stated. Consider the algorithm presented below comprisingthree steps. Step 1:
Each agent sends their cost function to the server. An honest agent sendsits actual cost function, while a faulty agent may send an arbitrary function.
Step 2:
For each set T of received functions, | T | = n − f , the server computes apoint x T ∈ arg min (cid:88) i ∈ T Q i ( x ) . For each subset (cid:98) T ⊂ T , (cid:12)(cid:12)(cid:12) (cid:98) T (cid:12)(cid:12)(cid:12) = n − f , the server computes r T (cid:98) T (cid:44) dist x T , arg min (cid:88) i ∈ (cid:98) T Q i ( x ) , (10)and r T = max (cid:98) T ⊂ T, | (cid:98) T | = n − f r T (cid:98) T . (11)
Step 3:
The server outputs x S such that S = arg min T ⊂{ ,..., n } , | T | = n − f r T . (12)We show below that above algorithm is ( f, (cid:15) )-resilient under (2 f, (cid:15) )-redundancy.For a non-empty set of agents T , we denote X T = arg min (cid:88) i ∈ T Q i ( x ) . Consider an arbitrary set of non-faulty agents G with | G | = n − f . Such a setis guaranteed to exist as there are at most f faulty agents, and therefore, at least n − f non-faulty agents in the system. Consider an arbitrary set (cid:98) T such that (cid:98) T ⊂ G and (cid:12)(cid:12)(cid:12) (cid:98) T (cid:12)(cid:12)(cid:12) = n − f . By Definition 3 of (2 f, (cid:15) )-redundancy,dist (cid:0) X G , X (cid:98) T (cid:1) ≤ (cid:15). (13)Recall from (10) that r G (cid:98) T = dist (cid:0) x G , X (cid:98) T (cid:1) . As x G ∈ X G , by Definition (4) ofHausdorff set distance, dist (cid:0) x G , X (cid:98) T (cid:1) ≤ dist (cid:0) X G , X (cid:98) T (cid:1) . Therefore, r G (cid:98) T ≤ dist (cid:0) X G , X (cid:98) T (cid:1) . r G (cid:98) T ≤ (cid:15). (14)Now, recall from (11) that r G = max (cid:98) T ⊂ G, | (cid:98) T | = n − f r G (cid:98) T . As (cid:98) T in (14) is an arbitrary subset of G with (cid:12)(cid:12)(cid:12) (cid:98) T (cid:12)(cid:12)(cid:12) = n − f , r G = max (cid:98) T ⊂ G, | (cid:98) T | = n − f r G (cid:98) T ≤ (cid:15). (15)From (12) and (15) we obtain that r S ≤ r G ≤ (cid:15). (16)As | G | = n − f , for every set of agents T with | T | = n − f , | T ∩ G | ≥ n − f .Therefore, for the set S defined in (12), there exists a subset (cid:98) G of G such that (cid:98) G ⊂ S and (cid:12)(cid:12)(cid:12) (cid:98) G (cid:12)(cid:12)(cid:12) = n − f . For such a set (cid:98) G , by definition of r S in (11), we obtain that r S (cid:98) G (cid:44) dist (cid:0) x S , X (cid:98) G (cid:1) ≤ r S . Substituting from (16) above, we obtain thatdist (cid:0) x S , X (cid:98) G (cid:1) ≤ (cid:15). (17)As (cid:98) G is a subset of G , all the agents in (cid:98) G are non-faulty. Therefore, by Assumption 1, X (cid:98) G is a closed set. Recall that dist (cid:0) x S , X (cid:98) G (cid:1) = inf x ∈ X (cid:98) G (cid:107) x S − x (cid:107) . The closednessof X (cid:98) G implies that there exists a point z ∈ X (cid:98) G such that (cid:107) x S − z (cid:107) = inf x ∈ X (cid:98) G (cid:107) x S − x (cid:107) = dist (cid:0) x S , X (cid:98) G (cid:1) . The above, in conjunction with (17), implies that (cid:107) x S − z (cid:107) ≤ (cid:15). (18)Moreover, as z ∈ X (cid:98) G where (cid:98) G ⊂ G with (cid:12)(cid:12)(cid:12) (cid:98) G (cid:12)(cid:12)(cid:12) = n − f and | G | = n − f , the(2 f, (cid:15) )-redundancy condition stated in Definition 3 implies thatdist ( z, X G ) ≤ (cid:15). Similar to an argument made above, under Assumption 1, X G is a closed set, andtherefore, there exists x ∗ ∈ X G such that (cid:107) z − x ∗ (cid:107) = dist ( z, X G ) ≤ (cid:15). (19)By triangle inequality, (18) and (19) implies that (cid:107) x S − x ∗ (cid:107) ≤ (cid:107) x S − z (cid:107) + (cid:107) z − x ∗ (cid:107) ≤ (cid:15). (20)Finally, recall that set G in the above inequality is an arbitrary set of n − f non-faultyagents. Therefore, (20) proves the theorem.11n the next part of the paper, i.e., Sections 4 and 5, we consider the case whenthe (non-faulty) agents’ cost functions are assumed differentiable. Specifically, wepresent and study a generic fault-tolerance method of gradient-filtering for conferringapproximate fault-tolerance to a commonly used distributed optimization algorithm- the distributed gradient-descent method. In this section, we consider a setting wherein the non-faulty agents’ cost functions aredifferentiable. For this particular setting, we study the approximate fault-toleranceof the distributed gradient-descent method when coupled with a gradient-filter de-scribed below. We consider the server-based system architecture, shown in Fig. 1,assuming a synchronous system.The distributed gradient-descent method is an iterative algorithm wherein theserver maintains an estimate of a minimum point, and updates it iteratively usinggradients sent by the agents. Specifically, in each iteration t ∈ { , , . . . } , the serverstarts with an estimate x t and broadcasts to all the agents. Each non-faulty agent i sends back to the sever the gradient of its cost function at x t , i.e., ∇ Q i ( x t ). How-ever, Byzantine faulty agents may send arbitrary incorrect vectors as their gradientsto the server. The initial estimate, named x , is chosen arbitrarily by the server.A gradient-filter is a vector function, denoted by GradFilter , that maps the n gradients received by the server from all the n agents to a d -dimensional vector, i.e., GradFilter : R d × n → R d . For example, an average of all the gradients as in the case ofthe traditional distributed gradient-descent method is technically a gradient-filter.However, averaging is not quite robust against Byzantine faulty agents [4, 32]. Thereal purpose of a gradient-filter is to mitigate the detrimental impact of incorrectgradients sent by the Byzantine faulty agents. In other words, a gradient-filter ro-bustifies the traditional gradient-descent method against Byzantine faults. We showthat if a gradient-filter satisfies a certain property then it can confer fault-toleranceto the distributed gradient-descent method.We first formally describe below the steps in each iteration of the distributedgradient-descent method implemented on a synchronous server-based system. Notethat we constrain the estimates computed by the server to a compact convex set W ⊂ R d . The set W can be arbitrarily large. For a vector x ∈ R d , its projectiononto W , denoted by [ x ] W , is defined to be[ x ] W = arg min y ∈W (cid:107) x − y (cid:107) . (21)As W is compact and convex set, [ x ] W is unique for each x (see [8]). t -th iteration In each iteration t ∈ { , , . . . } the server updates its current estimate x t to x t +1 using Steps S1 and S2 described as follows.12 The server requests from each agent the gradient of its local cost function atthe current estimate x t . Each non-faulty agent i will then send to the serverthe gradient ∇ Q i ( x t ), whereas a faulty agent may send an incorrect arbitraryvalue for the gradient.The gradient received by the server from agent i is denoted as g ti . If no gradientis received from some agent i , agent i must be faulty (because the system isassumed to be synchronous) – in this case, the server eliminates the agent i from the system, updates the values of n , f , and re-assigns the agents indicesfrom 1 to n . S2: [Gradient-filtering]
The server applies a gradient-filter
GradFilter to the n received gradients and computes GradFilter (cid:0) g t , . . . , g tn (cid:1) ∈ R d . Then, the serverupdates its estimate to x t +1 = (cid:2) x t − η t GradFilter (cid:0) g t , . . . , g tn (cid:1)(cid:3) W (22)where η t is the step-size of positive value for iteration t .We present below, in Theorem 3, a generic convergence result for the abovealgorithm. The proof of Theorem 3 is deferred to Appendix C. Theorem 3.
Consider the update law (22) in the above iterative algorithm, withdiminishing step-sizes { η t , t = 0 , , . . . } satisfying (cid:80) ∞ t =0 η t = ∞ and (cid:80) ∞ t =0 η t < ∞ .Suppose that (cid:13)(cid:13) GradFilter (cid:0) g t , . . . , g tn (cid:1)(cid:13)(cid:13) < ∞ for all t . For some point x ∗ ∈ W , ifthere exists real-valued constants D ∗ ∈ [0 , max x ∈W (cid:107) x − x ∗ (cid:107) ) and ξ > such that foreach iteration t , φ t = (cid:10) x t − x ∗ , GradFilter (cid:0) g t , . . . , g tn (cid:1)(cid:11) ≥ ξ when (cid:13)(cid:13) x t − x ∗ (cid:13)(cid:13) ≥ D ∗ , (23) then lim t →∞ (cid:13)(cid:13) x t − x ∗ (cid:13)(cid:13) ≤ D ∗ . Note that the values D ∗ and ξ in the statement of Theorem 3 need not be inde-pendent of each other. As shown in the subsequent section, the generic convergenceresult shown in Theorem 3 helps us obtain approximate fault-tolerance propertiesof different gradient-filters, under (2 f, (cid:15) )-redundancy and certain standard assump-tions. We consider two particular gradient-filters, namely Comparative GradientElimination (CGE) and
Coordinate-Wise Trimmed Mean (CWTM).
In this subsection, we present precise approximate fault-tolerance guarantees of twospecific gradient-filters; the Comparative Gradient Elimination (CGE) [17, 18], andthe Coordinate-Wise Trimmed Mean (CWTM) [31, 36]. Note that differentiability ofnon-faulty agents’ cost functions, which is already assumed for the gradient-descentmethod, implies Assumption 1 (see [8]). We additionally make Assumptions 2, 3and 4 about the non-faulty agents’ cost functions. Similar assumptions are made inprior work on fault-free distributed optimization [3, 7, 25].
Assumption 2 (Lipschitz smoothness) . For each non-faulty agent i , we assumethat the gradient of its cost function ∇ Q i ( x ) is Lipschitz continuous, i.e., thereexists a finite real value µ > such that (cid:13)(cid:13) ∇ Q i ( x ) − ∇ Q i ( x (cid:48) ) (cid:13)(cid:13) ≤ µ (cid:13)(cid:13) x − x (cid:48) (cid:13)(cid:13) , ∀ x, x (cid:48) ∈ W . ssumption 3 (Strong convexity) . For a non-empty set of non-faulty agents H ,let Q H ( x ) denote the average cost function of the agents in H , i.e., Q H ( x ) = 1 |H| (cid:88) i ∈H Q i ( x ) . For each such set H with |H| = n − f , we assume that Q H ( x ) is strongly convex,i.e., there exists a finite real value γ > such that (cid:10) ∇ Q ( x ) − ∇ Q ( x (cid:48) ) , x − y (cid:11) ≥ γ (cid:13)(cid:13) x − x (cid:48) (cid:13)(cid:13) , ∀ x, x (cid:48) ∈ W . Note that, under Assumptions 2 and 3, γ ≤ µ . This inequality is proved inAppendix B. Now, recall that the iterative estimates of the algorithm in Section 4.1are constrained to a compact convex set W ⊂ R d . Assumption 4 (Existence) . For each set of non-faulty agents H with |H| = n − f ,we assume that there exists a point x H ∈ arg min x ∈ R d (cid:80) i ∈H Q i ( x ) such that x H ∈ W . We now present the fault-tolerance properties of the CGE and CWTM gradient-filters in Sections 4.2.1 and 4.2.2 below, respectively.
To apply the CGE gradient-filter in Step S2, the server sorts the n gradients receivedfrom the n agents at the completion of Step S1 as per their Euclidean norms (tiesbroken arbitrarily): (cid:13)(cid:13) g ti (cid:13)(cid:13) ≤ . . . ≤ (cid:13)(cid:13)(cid:13) g ti n − f (cid:13)(cid:13)(cid:13) ≤ (cid:13)(cid:13)(cid:13) g ti n − f +1 (cid:13)(cid:13)(cid:13) ≤ . . . ≤ (cid:13)(cid:13) g ti n (cid:13)(cid:13) . That is, the gradient with the smallest norm, g ti , is received from agent i , and thegradient with the largest norm, g ti n , is received from agent i n . Then, the output ofthe CGE gradient-filter is the vector sum of the n − f gradients with smallest n − f Euclidean norms. Specifically,
GradFilter (cid:0) g t , . . . , g tn (cid:1) = n − f (cid:88) j =1 g ti j . (24)We show below, in Theorem 4, that when the fraction of Byzantine faulty agents f /n is bounded then the algorithm in Section 4.1 with the CGE gradient-filterin Step S2 is ( f, O ( (cid:15) ))-resilient, under (2 f, (cid:15) )-redundancy and the aforementionedassumptions. To present the formal result we define below some parameters. • We define a fault-tolerance margin α = 1 − fn (cid:18) µγ (cid:19) (25)that determines the maximum fraction of Byzantine faulty agents that can betolerated in an execution of the algorithm.14 We define a coefficient D = 4 µfα γ = 4 µfγ − fn ( γ + 2 µ ) (26)that measures the resilience of the algorithm.Both α and D depend upon f , the most number of Byzantine faulty agents inany given execution of the algorithm. Note that, under Assumptions 3 and 4, foreach non-empty set of non-faulty agents H with |H| = n − f , the aggregate costfunction (cid:80) i ∈H Q i ( x ) has a unique minimum point, denoted by x H , in the set W .Specifically, x H = arg min x ∈ R d (cid:88) i ∈H Q i ( x ) ∩ W . (27) Theorem 4.
Suppose that the non-faulty agents’ cost functions satisfy the (2 f, (cid:15) ) -redundancy property, and Assumptions 2, 3 and 4. Consider the iterative algorithmin Section 4.1 with the CGE gradient-filter defined in (24) , and diminishing step-sizes { η , η , . . . } in (22) satisfying: (cid:80) ∞ t =1 η t = ∞ and (cid:80) ∞ t =1 η t < ∞ . If α > thenfor each set of n − f non-faulty agents H , lim t →∞ (cid:13)(cid:13) x t − x H (cid:13)(cid:13) ≤ D (cid:15). Thus, by Definition 2, the algorithm is asymptotically ( f, D (cid:15) )-resilient . The proof of Theorem 4 relies on Theorem 3, and is deferred to Appendix D.Essentially, to prove Theorem 4 we show that the CGE gradient-filter defined in (24)satisfies the conditions specified in Theorem 3 for D ∗ = D (cid:15) , and x ∗ = x H for all setof non-faulty agents H with |H| = n − f .According to Theorem 4, if α >
0, or equivalently, the fraction of Byzan-tine faulty agents f /n is below the threshold value of 1 / (1 + 2( µ/γ )) then, un-der (2 f, (cid:15) )-redundancy property and other standard assumptions, the distributedgradient-descent method with CGE gradient-filter is ( f, D (cid:15) )-resilient. As γ ≤ µ under Assumptions 2 and 3 (see Appendix B), the fault-tolerance property of CGEgradient-filter stated in Theorem 4 requires f /n < / f < n/ f decreases the value of D decreases, i.e., the resilience of the algorithm improves. Also, note that the valueof D is equal to 0 when f = 0, and therefore, the algorithm converges to the actualminimum point of all the agents’ aggregate cost function in the fault-free case. To apply the CWTM gradient-filter in Step S2, the server sorts the n gradientsreceived from the n agents at the completion of Step S1 as per their individualelements. For a vector v ∈ R d , we let v [ k ] denote its k -th element. Specifically, foreach k ∈ { , . . . , d } , the server sorts the k -th elements of the gradients (ties brokenarbitrarily): g ti [ k ] [ k ] ≤ . . . ≤ g ti f +1 [ k ] [ k ] ≤ . . . ≤ g ti n − f [ k ] [ k ] ≤ . . . ≤ g ti n [ k ] [ k ] . k -th element, g ti [ k ] , is received from agent i [ k ], and the gradient with the largest norm, g ti n [ k ] , is received from agent i n [ k ].For each k , the server eliminates the largest f and the smallest f elements of thegradients received. Then, the output of the CWTM gradient-filter is a vector whose k -th element is equal to the average of the remaining n − f gradients’ k -th elements.That is, for each k ∈ { , . . . , d } , GradFilter (cid:0) g t , . . . , g tn (cid:1) [ k ] = 1 n − f n − f (cid:88) j = f +1 g ti j [ k ] [ k ] . (28)We show below in Theorem 5 that when the separation between the gradientsof non-faulty agents’ cost functions is small enough then the CWTM gradient-filtercan guarantee approximate fault-tolerance under (2 f, (cid:15) )-redundancy. To formallypresent our result we make the following additional assumption about the distancebetween the gradients of two non-faulty agents’ cost functions. Assumption 5.
For two non-faulty agents i and j , we assume that there exists λ > such that for all x ∈ W , (cid:107)∇ Q i ( x ) − ∇ Q j ( x ) (cid:107) ≤ λ max {(cid:107) Q i ( x ) (cid:107) , (cid:107) Q j ( x ) (cid:107)} . Obviously, owing to the triangle triangle inequality, Assumption 5 is always truefor λ = 2. However, as shown below in Theorem 5, we can presently guaranteefault-tolerance of CWTM gradient-filter when λ < γ/ ( µ √ d ) where µ and γ arethe Lipschitz smoothness and strong convexity coefficients, respectively defined inAssumption 2 and 3. Recall the definition of x H from (27). Theorem 5.
Suppose that the non-faulty agents’ cost functions satisfy the (2 f, (cid:15) ) -redundancy property, and Assumptions 2, 3, 4 and 5. Consider the iterative algo-rithm in Section 4.1 with the CWTM gradient-filter defined in (28) , and diminish-ing step-sizes { η , η , . . . } in (22) satisfying: (cid:80) ∞ t =1 η t = ∞ and (cid:80) ∞ t =1 η t < ∞ . If λ < γ/ ( µ √ d ) then for each set of n − f non-faulty agents H , lim t →∞ (cid:13)(cid:13) x t − x H (cid:13)(cid:13) ≤ (cid:18) nλ ( γ/µ √ d ) − λ (cid:19) (cid:15). Proof of Theorem 5 is similar to Theorem 4, and is deferred to Appendix E.By Definition 2 of approximate fault-tolerance, Theorem 5 implies the algorithmwith CWTM gradient-filter is asymptotically ( f, σ ) -resilient where σ = (cid:18) nλ ( γ/µ √ d ) − λ (cid:19) (cid:15). Note that smaller is the value of λ , smaller is the value of σ and therefore, better isthe resilience guarantee of the CWTM gradient-filter.Unlike CGE gradient-filter, the obtain resilience of CWTM in Theorem 5 isindependent of f , as long as the separation between the gradients of non-faultyagents’ cost functions is sufficiently small, i.e., λ < γ/ ( µ √ d ). However, unlikethe sufficient condition for resilience of CGE gradient-filter, the condition on λ toguarantee the resilience of CWTM gradient-filter depends upon the dimension d ofthe optimization problem. Larger dimension result in a tighter bound on λ .16 Numerical experiments
In this section, we present simulation results to empirically compare the approximatefault-tolerance achieved by the aforementioned gradient-filters; CGE and CWTM.For the simulation, we consider the problem of distributed linear regression, whichis a special distributed optimization problem with quadratic cost functions [18].
We consider a synchronous server-based system, as shown in Figure 1, wherein n = 6, d = 2, and f = 1. Each agent i ∈ { , . . . , n } has a data point representedby a triplet ( A i , B i , N i ) where A i is a d -dimensional row vector, B i ∈ R as theresponse, and a noise value N i ∈ R . Specifically, for all i ∈ { , . . . , n } , B i = A i x ∗ + N i where x ∗ = (cid:18) (cid:19) . (29)The collective data is represented by a triplet of matrices ( A, B, N ) where the i -throw of A , B , and N are equal to A i , B i and N i , respectively. The specific values areas follows. A = . . . .
80 1 − . . − . . , B = . . . . . − . , and N = − . . . . − . − . . (30)It should be noted that B = Ax ∗ + N. (31)We let A S , B S and N S represent matrices of dimensions | S | × | S | × | S | × { A i , i ∈ S } , { B i , i ∈ S } and { N i , i ∈ S } , respectively,in the increasing order. From (31), observe that for every non-empty set S , B S = A S x ∗ + N S . (32)Recall from basic linear algebra that if A S is full-column rank, i.e., rank ( A S ) = d = 2then x ∗ is the unique solution of the set of equations in (32). Note that for everyset S with | S | ≥ n − f = 6 − A S is full rank. Specifically,rank ( A S ) = d = 2 , ∀ S ⊆ { , . . . , } , | S | ≥ . (33)In this particular distributed optimization problem, each agent i has a quadraticcost function defined to be Q i ( x ) = ( B i − A i x ) , ∀ x ∈ R . For an arbitrary non-empty set of agents S , we define Q S ( x ) = (cid:88) i ∈ S Q i ( x ) = (cid:88) i ∈ S ( B i − A i x ) , ∀ x ∈ R . Q S ( x ) = (cid:88) i ∈ S ( B i − A i x ) = (cid:107) B S − A S x (cid:107) . (34)As matrix A S is full rank for every S with | S | ≥ x ∈ R Q S ( x ) = arg min x ∈ R (cid:107) B S − A S x (cid:107) = ( x A S x = B S ) . (35)Therefore, Q S ( x ) has a unique minimum point when | S | ≥
4. Henceforth, we writenotation arg min x ∈ R simply as arg min, unless otherwise stated. Due to the rank condition (33), the agents’ cost functions satisfy the (2 f, (cid:15) ) -redundancy property, stated in Definition 3, with (cid:15) = 0 . (cid:15) aredescribed below.1. For each set S ⊂ { , . . . , } with | S | = n − f = 5, compute x S ∈ R such that B S = A S x S . Note that, due to (35), x S = arg min Q S ( x ).2. For each set S ⊂ { , . . . , } with | S | = n − f = 5 do the following:(a) For each set (cid:98) S ⊆ S with (cid:12)(cid:12)(cid:12) (cid:98) S (cid:12)(cid:12)(cid:12) ≥ n − f = 4, compute x (cid:98) S such that B (cid:98) S = A (cid:98) S x (cid:98) S . Note that, due to (35), x (cid:98) S = arg min Q (cid:98) S ( x ).(b) Compute (cid:15) S = max (cid:98) S ⊆ S, | (cid:98) S | ≥ (cid:13)(cid:13) x S − x (cid:98) S (cid:13)(cid:13) . In this particular case, both the sets of minimum points arg min Q S ( x )and arg min Q (cid:98) S ( x ) are singleton with points x S and x (cid:98) S , respectively.Therefore, (cid:13)(cid:13) x S − x (cid:98) S (cid:13)(cid:13) = dist (cid:0) arg min Q S ( x ) , arg min Q (cid:98) S ( x ) (cid:1) .
3. In the final step, we compute (cid:15) = max | S | = n − f (cid:15) S . For each agent i , its cost function Q i ( x ) has Lipschitz continuous gradients, i.e.,satisfy Assumption 2, with Lipschitz coefficient µ = v i (36)where v i denotes the largest eigenvalue of A Ti A i . Also, for every set of agents S with | S | = n − f = 5, their average cost function (1 / | S | ) Q S ( x ) is strongly convex, i.e.,satisfy Assumption 3, with the strong convexity coefficient γ = 1 | S | v S (37)where v S is the smallest eigenvalue of A TS A S . Derivations of (36) and (37) can foundin [18, Section 10]. 18 .3 Simulation In our experiments, we simulate the following fault behaviors for the faulty agent. • gradient-reverse : the faulty agent reverses its true gradient. Suppose thecorrect gradient of a faulty agent i at step t is s ti , the agent i will send theincorrect gradient g ti = − s ti to the server. • random : the faulty agent sends a randomly chosen vector in R d . In ourexperiments, the faulty agent in each step chooses i.i.d. Gaussian randomvector with mean 0 and a isotropic covariance matrix with standard deviationof 200.We simulate the distributed gradient-descent algorithm described in Section 4.1by assuming agent 1 to be Byzantine faulty. It should be noted that the identityof the faulty agent is not used in any way during the simulations. Here, the setof non-faulty agents is H = { , . . . , } and |H| = n − f = 5. Therefore, in thisparticular case H is the only set of n − f non-faulty agents. From (35), we obtainthat the minimum point of the aggregate cost function (cid:80) i ∈H Q i ( x ), denoted by x H ,is equal to the solution of the following set of linear equations: B H = A H x H . Specifically, x H = (cid:18) . . (cid:19) . Also, note from our earlier deductions in (36) and (37)that in this particular case, the non-faulty agents’ cost functions satisfy Assump-tions 2 and 3 with µ = 1 and γ = 0 . Parameters:
We use the following parameters for implementing the algo-rithm. In the update rule (22), we use step size η t = 1 . / ( t + 1) for iteration t = 0 , , . . . . Note that this particular step-size is diminishing and satisfies theconditions: (cid:80) ∞ t =0 η t = ∞ and (cid:80) ∞ t =0 η t = 3 π / < ∞ (see [30]). We assume theconvex compact W ⊂ R d to be a 2-dimensional hypercube [ − , . Notethat x H ∈ W , i.e., Assumption 4 holds true. In all the simulation results presentedbelow, the initial estimate x = ( − . , − . x out = x . The outputs for the twogradient-filters, under different faulty behaviors, are shown in Table 1. Note thatdist ( x H , x out ) = (cid:107) x H − x out (cid:107) . The results for the case when the faulty agent sends random faulty gradients are only shown for a randomly chosen execution. Conclusion:
As shown in Table 1, in all executions, the distances between x H the output of the algorithm x out in case of both CGE and CWTM gradient-filtersare smaller than (cid:15) = 0 . (cid:80) i ∈H Q i ( x t ) (referred as loss ) and the approximationerror (cid:13)(cid:13) x t − x H (cid:13)(cid:13) (referred as distance ) for iteration t ranging from 0 to 500. We also19able 1: Algorithm’s outputs with gradient-filters CGE and CWTM, and the approximation errors,corresponging to executions when the faulty agent exhibits two different types of Byzantine faults;gradient-reverse and random. Recall that x out = x and dist ( x H , x out ) = (cid:107) x H − x out (cid:107) . gradient-reverse random x out dist ( x H , x out ) x out dist ( x H , x out ) CGE (cid:18) . . (cid:19) (cid:18) . . (cid:19) . × − CWTM (cid:18) . . (cid:19) (cid:18) . . (cid:19) . × − l o ss gradient - reverse d i s t a n c e random Figure 2:
The losses, i.e., (cid:80) i ∈H Q i ( x t ) , and distances, i.e., (cid:13)(cid:13) x t − x H (cid:13)(cid:13) , versus the number ofiterations in the algorithm. The final approximation errors, i.e., (cid:13)(cid:13) x − x H (cid:13)(cid:13) , are annotated inthe same colors of their corresponding plots. For the executions shown, agent is assumed to beByzantine faulty. The different columns show the results when the faulty agent exhibits the differenttypes of faults: (a) gradient-reverse, and (b) random. Apart from the plots with CGE (in green ) andCWTM (in yellow ) gradient-filters, we also plot the fault-free distributed gradient-descent (DGD)method when all agents are free from faults (in blue ), and the DGD method without any gradient-filters when agent is Byzantine faulty (in red ). show the plots of the fault-free distributed gradient-descent (DGD) method when allagents are free from faults, and the DGD method without any gradient-filter whenagent 1 is Byzantine faulty. The details for iteration t ranging from 0 to 80 are alsohighlighted in Figure 3. 20 .00.51.01.5 l o ss gradient - reverse d i s t a n c e random Figure 3:
The losses, i.e., (cid:80) i ∈H Q i ( x t ) , and distances, i.e., (cid:13)(cid:13) x t − x H (cid:13)(cid:13) , versus the number ofiterations in the algorithm, magnified for the initial 80 iterations in the training process. Themeaning of plots are the same as in Figure 2. In this paper, we have studied the problem of approximate
Byzantine fault-tolerance,which is a generalization of the exact fault-tolerance problem studied in prior work [20].Unlike the exact fault-tolerance, the goal in approximate fault-tolerance is to designa distributed optimization problem that produces only an approximation of a mini-mum point of the aggregate cost function of at least n − f non-faulty agents, in thepresence of up to f (out of n ) Byzantine faulty agents.We have defined approximate fault-tolerance formally as ( f, (cid:15) )-resilience where (cid:15) ∈ R ≥ denotes the approximation error. In the first part of the paper, i.e, Section 3,we have obtained necessary and sufficient conditions for the achievability of ( f, (cid:15) )-resilience. These results generalize the prior result which states that exact fault-tolerance is achievable if and only if 2 f -redundancy property is satisfied [20, 21].In the second part of the paper, i.e., Sections 4 and 5, we have considered thecase when agents’ cost functions are differentiable. For this particular case, firstwe have derived a generic approximate fault-tolerance property of the distributedgradient-descent method when equipped with Byzantine robust gradient aggregationor gradient-filter . Then, we have obtained specific approximate fault-tolerance guar-21ntees for two well-known gradient-filters; comparative gradient elimination (CGE)and coordinate-wise trimmed mean (CWTM). Finally, in Section 5, we have pre-sented empirical results comparing the approximate fault-tolerance achieved by thetwo aforementioned gradient-filters. Research reported in this paper was sponsored in part by the Army Research Lab-oratory under Cooperative Agreement W911NF- 17-2-0196, and by the NationalScience Foundation award 1842198. The views and conclusions contained in thisdocument are those of the authors and should not be interpreted as representingthe official policies, either expressed or implied, of the Army Research Laboratory,National Science Foundation or the U.S. Government.
References [1] Dan Alistarh, Zeyuan Allen-Zhu, and Jerry Li. Byzantine stochastic gradientdescent. In
Advances in Neural Information Processing Systems , 2018.[2] Jeremy Bernstein, Jiawei Zhao, Kamyar Azizzadenesheli, and Anima Anand-kumar. signsgd with majority vote is communication efficient and Byzantinefault tolerant. arXiv preprint arXiv:1810.05291 , 2018.[3] Dimitri P Bertsekas and John N Tsitsiklis.
Parallel and distributed computation:numerical methods , volume 23. Prentice hall Englewood Cliffs, NJ, 1989.[4] Peva Blanchard, Rachid Guerraoui, Julien Stainer, et al. Machine learningwith adversaries: Byzantine tolerant gradient descent. In
Advances in NeuralInformation Processing Systems , pages 119–129, 2017.[5] L´eon Bottou. Online learning and stochastic approximations.
On-line learningin neural networks , 17(9):142, 1998.[6] L´eon Bottou, Frank E Curtis, and Jorge Nocedal. Optimization methods forlarge-scale machine learning.
Siam Review , 60(2):223–311, 2018.[7] Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato, Jonathan Eckstein, et al.Distributed optimization and statistical learning via the alternating directionmethod of multipliers.
Foundations and Trends ® in Machine learning , 3(1),2011.[8] Stephen Boyd and Lieven Vandenberghe. Convex optimization . Cambridgeuniversity press, 2004.[9] Xinyang Cao and Lifeng Lai. Distributed gradient descent algorithm robustto an arbitrary number of byzantine attackers.
IEEE Transactions on SignalProcessing , 67(22):5850–5864, 2019.[10] Moses Charikar, Jacob Steinhardt, and Gregory Valiant. Learning from un-trusted data. In
Proceedings of the 49th Annual ACM SIGACT Symposium onTheory of Computing , pages 47–60, 2017.2211] Yuan Chen, Soummya Kar, and Jose MF Moura. Resilient distributed esti-mation through adversary detection.
IEEE Transactions on Signal Processing ,66(9), 2018.[12] Yudong Chen, Lili Su, and Jiaming Xu. Distributed statistical machine learningin adversarial settings: Byzantine gradient descent.
Proceedings of the ACMon Measurement and Analysis of Computing Systems , 1(2):44, 2017.[13] Michelle S Chong, Masashi Wakaiki, and Joao P Hespanha. Observability oflinear systems under adversarial attacks. In
American Control Conference ,pages 2439–2444. IEEE, 2015.[14] Georgios Damaskinos, Rachid Guerraoui, Rhicheek Patra, Mahsa Taziki, et al.Asynchronous Byzantine machine learning (the case of sgd). In
InternationalConference on Machine Learning , pages 1153–1162, 2018.[15] Ilias Diakonikolas, Gautam Kamath, Daniel M Kane, Jerry Li, Jacob Stein-hardt, and Alistair Stewart. Sever: A robust meta-algorithm for stochasticoptimization. arXiv preprint arXiv:1803.02815 , 2018.[16] John C Duchi, Alekh Agarwal, and Martin J Wainwright. Dual averaging fordistributed optimization: Convergence analysis and network scaling.
IEEETransactions on Automatic control , 57(3):592–606, 2011.[17] Nirupam Gupta, Shuo Liu, and Nitin H Vaidya. Byzantine fault-tolerant dis-tributed machine learning using stochastic gradient descent (sgd) and norm-based comparative gradient elimination (cge). arXiv preprint arXiv:2008.04699 ,2020.[18] Nirupam Gupta and Nitin H Vaidya. Byzantine fault tolerant distributed linearregression. arXiv preprint arXiv:1903.08752 , 2019.[19] Nirupam Gupta and Nitin H Vaidya. Byzantine fault-tolerant parallelizedstochastic gradient descent for linear regression. In , pages415–420. IEEE, 2019.[20] Nirupam Gupta and Nitin H Vaidya. Fault-tolerance in distributed optimiza-tion: The case of redundancy. In
Proceedings of the 39th Symposium on Prin-ciples of Distributed Computing , pages 365–374, 2020.[21] Nirupam Gupta and Nitin H Vaidya. Resilience in collaborative op-timization: redundant and independent cost functions. arXiv preprintarXiv:2003.09675 , 2020.[22] Leslie Lamport, Robert Shostak, and Marshall Pease. The Byzantine generalsproblem.
ACM Transactions on Programming Languages and Systems , 4(3),1982.[23] Nancy A Lynch.
Distributed algorithms . Elsevier, 1996.[24] James R Munkres. Topology, 2000. 2325] Angelia Nedic and Asuman Ozdaglar. Distributed subgradient methods formulti-agent optimization.
IEEE Transactions on Automatic Control , 54(1),2009.[26] Miroslav Pajic, James Weimer, Nicola Bezzo, Paulo Tabuada, Oleg Sokolsky,Insup Lee, and George J Pappas. Robustness of attack-resilient state estimators.In
ICCPS’14: ACM/IEEE 5th International Conference on Cyber-Physical Sys-tems (with CPS Week 2014) , pages 163–174. IEEE Computer Society, 2014.[27] Adarsh Prasad, Arun Sai Suggala, Sivaraman Balakrishnan, and PradeepRavikumar. Robust estimation via robust gradient estimation. arXiv preprintarXiv:1802.06485 , 2018.[28] Michael Rabbat and Robert Nowak. Distributed optimization in sensor net-works. In
Proceedings of the 3rd international symposium on Information pro-cessing in sensor networks , pages 20–27, 2004.[29] Robin L Raffard, Claire J Tomlin, and Stephen P Boyd. Distributed optimiza-tion for cooperative agents: Application to formation flight. In , volume 3, pages 2453–2459. IEEE,2004.[30] Walter Rudin.
Principles of mathematical analysis , volume 3. McGraw-hillNew York, 1964.[31] Lili Su and Shahin Shahrampour. Finite-time guarantees for Byzantine-resilient distributed state estimation with noisy measurements. arXiv preprintarXiv:1810.10086 , 2018.[32] Lili Su and Nitin H Vaidya. Fault-tolerant multi-agent optimization: optimaliterative distributed algorithms. In
Proceedings of the 2016 ACM symposiumon principles of distributed computing , pages 425–434, 2016.[33] Lili Su and Nitin H Vaidya. Non-bayesian learning in the presence of byzantineagents. In
International symposium on distributed computing . Springer, 2016.[34] Lili Su and Nitin H Vaidya. Byzantine-resilient multi-agent optimization.
IEEETransactions on Automatic Control , 2020.[35] Cong Xie, Oluwasanmi Koyejo, and Indranil Gupta. Generalized Byzantine-tolerant sgd. arXiv preprint arXiv:1802.10116 , 2018.[36] Dong Yin, Yudong Chen, Kannan Ramchandran, and Peter Bartlett.Byzantine-robust distributed learning: Towards optimal statistical rates. In
International Conference on Machine Learning , pages 5636–5645, 2018.24
Appendix: The Special Case of ( f, -Resilience We show that ( f, -resilience , stated in Definition 2, and exact fault-tolerance , de-fined in Section 1.1, are equivalent in the deterministic framework. Specifically, weshow that a deterministic ( f, f, x ∈ R d is simply written asarg min, unless otherwise stated.First, we show that ( f, f, E H of Π wherein H ⊆ { , . . . , n } denotes the set of all the non-faulty agents,and let (cid:98) x denote the output. Recall that, as there are at most f faulty agents, |H| ≥ n − f . To prove that Π has exact fault-tolerance, it suffices to show that, inexecution E H , (cid:98) x is a minimum point of the aggregate cost function of all non-faultyagents (cid:80) i ∈H Q i ( x ).By Definition 2 of ( f, S ⊆ H with | S | = n − f , (cid:98) x ∈ arg min (cid:88) i ∈ S Q i ( x ) . Therefore, for every set S with S ⊆ H and | S | = n − f , (cid:88) i ∈ S Q i ( (cid:98) x ) ≤ (cid:88) i ∈ S Q i ( x ) , ∀ x ∈ R d . (38)Now, note that there are (cid:0) |H| n − f (cid:1) subsets in H of size n − f , and each agent i ∈ H iscontained in (cid:0) |H|− n − f − (cid:1) of those subsets. Therefore, (cid:88) S ⊆H , | S | = n − f (cid:88) i ∈ S Q i ( x ) = (cid:18) |H| − n − f − (cid:19) (cid:88) i ∈H Q i ( x ) . (39)Substituting from (38) in (39) we obtain that (cid:88) i ∈H Q i ( (cid:98) x ) ≤ (cid:88) i ∈H Q i ( x ) , ∀ x ∈ R d The above implies that (cid:98) x ∈ arg min (cid:88) i ∈H Q i ( x ) . The above proves that Π has exact fault-tolerance in execution E H .Now, we show that exact fault-tolerance implies ( f, E H of Π wherein set H comprises all the non-faulty agents,and (cid:98) x is its output. Therefore, (cid:98) x ∈ arg min (cid:88) i ∈H Q i ( x ) .
25o prove that Π is ( f, E H for everyset S ⊆ H with | S | = n − f , (cid:98) x is a minimum point of the aggregate cost func-tion (cid:80) i ∈ S Q i ( x ). This is trivially true when |H| = n − f . We assume below that |H| > n − f .Consider an arbitrary subset S of H with | S | = n − f . Consider an execution E S wherein S is the set of all non-faulty agents, with the remaining agents in { , . . . , n }\ S being Byzantine faulty. Suppose that the inputs from all the agents tothe server in E S are identical to their inputs in E H . Therefore, as Π is a deterministicalgorithm, its output in execution E S is same as that in execution E H , i.e., (cid:98) x .Moreover, as Π is assumed to have exact fault-tolerance, (cid:98) x ∈ arg min (cid:88) i ∈ S Q i ( x ) . As S is an arbitrary subset S of H with | S | = n − f , the above proves that Π is( f, E H . B Appendix: Proof of γ ≤ µ We show below that if Assumptions 2 and 3 hold true simultaneously then γ ≤ µ .Consider an arbitrary set of n − f non-faulty agents H , and two arbitrary non-identical points x, y ∈ R d , i.e., x (cid:54) = y . If Assumption 2 holds true then (cid:107)∇ Q i ( x ) − ∇ Q i ( y ) (cid:107) ≤ µ (cid:107) x − y (cid:107) , ∀ i ∈ H . Therefore, owing to the Cauchy-Schwartz inequality, for all i ∈ H , (cid:104) x − y, ∇ Q i ( x ) − ∇ Q i ( y ) (cid:105) ≤ (cid:107) x − y (cid:107) (cid:107)∇ Q i ( x ) − ∇ Q i ( y ) (cid:107) ≤ µ (cid:107) x − y (cid:107) . (40)From (40) we obtain that (cid:88) i ∈H (cid:104) x − y, ∇ Q i ( x ) − ∇ Q i ( y ) (cid:105) ≤ µ |H| (cid:107) x − y (cid:107) . (41)If Assumption 3 holds true then (cid:88) i ∈H (cid:104) x − y, ∇ Q i ( x ) − ∇ Q i ( y ) (cid:105) ≥ γ |H| (cid:107) x − y (cid:107) . (42)As x, y are arbitrary non-identical points, (41) and (42) together imply that γ ≤ µ . C Appendix: Proof of Theorem 3
The proof of Theorem 3 relies on the following sufficient criterion for the conver-gence of non-negative sequences. 26 emma 2 (Bottou, 1998 [5]) . Consider a sequence of real values { u t , t = 0 , , . . . } .If u t ≥ , ∀ t then ∞ (cid:88) t =0 ( u t +1 − u t ) + < ∞ = ⇒ u t −→ t →∞ u ∞ < ∞ (cid:80) ∞ t =0 ( u t +1 − u t ) − > −∞ (43) where the operators ( · ) + and ( · ) − are defined as follows for a real scalar x , ( x ) + = (cid:26) x , x > , otherwise , and ( x ) − = (cid:26) , x > x , otherwise Recall from the statement of Theorem 3 that x ∗ ∈ W where W is a compactconvex set. We define, for all t ∈ { , , . . . } , e t = (cid:13)(cid:13) x t − x ∗ (cid:13)(cid:13) . (44)Next, we define a univariate real-valued function ψ : R → R : ψ ( x ) = (cid:26) , x < ( D ∗ ) , (cid:0) x − ( D ∗ ) (cid:1) , x ≥ ( D ∗ ) . (45)Let ψ (cid:48) ( x ) denote the derivative of ψ at x . Specifically, ψ (cid:48) ( x ) = max (cid:8) , (cid:0) x − ( D ∗ ) (cid:1)(cid:9) . (46)We show below that ψ (cid:48) ( x ) is a Lipschitz continuous function with Lipschitz coeffi-cient of 2. From (46), we obtain that (cid:12)(cid:12) ψ (cid:48) ( x ) − ψ (cid:48) ( y ) (cid:12)(cid:12) = | x − y | , both x, y ≥ ( D ∗ ) (cid:12)(cid:12) x − ( D ∗ ) (cid:12)(cid:12) , x ≥ ( D ∗ ) , y < ( D ∗ ) , both x, y < ( D ∗ ) (47)Note from (47) that for the case when x ≥ ( D ∗ ) , y < ( D ∗ ) , (cid:12)(cid:12) ψ (cid:48) ( x ) − ψ (cid:48) ( y ) (cid:12)(cid:12) = 2 (cid:12)(cid:12) x − ( D ∗ ) (cid:12)(cid:12) < | x − y | . Similarly, due to symmetry, when x < ( D ∗ ) , y ≥ ( D ∗ ) then | ψ (cid:48) ( x ) − ψ (cid:48) ( y ) | =22 (cid:12)(cid:12) y − ( D ∗ ) (cid:12)(cid:12) < | x − y | . Therefore, from (46) we obtain that (cid:12)(cid:12) ψ (cid:48) ( x ) − ψ (cid:48) ( y ) (cid:12)(cid:12) ≤ | x − y | , ∀ x, y ∈ R . (48)The Lipschitz continuity of ψ (cid:48) ( x ), shown in (48), implies that [6, Section 4.1] ψ ( y ) − ψ ( x ) ≤ ( y − x ) ψ (cid:48) ( x ) + ( y − x ) , ∀ x, y ∈ R . (49)Now, for each t ∈ { , , . . . } , we define h t = ψ ( e t ) . (50)From (49) and (50), for all t , we obtain that h t +1 − h t = ψ (cid:0) e t +1 (cid:1) − ψ (cid:0) e t (cid:1) ≤ (cid:0) e t +1 − e t (cid:1) · ψ (cid:48) (cid:0) e t (cid:1) + (cid:0) e t +1 − e t (cid:1) . ψ (cid:48) t as a shorthand for ψ (cid:48) ( e t ). From above we have h t +1 − h t ≤ (cid:0) e t +1 − e t (cid:1) ψ (cid:48) t + (cid:0) e t +1 − e t (cid:1) . (51)Now, recall from (22) that for all t ∈ { , , . . . } , x t +1 = (cid:2) x t − η t GradFilter (cid:0) g t , . . . , g tn (cid:1)(cid:3) W (52)By the non-expansion property of Euclidean projection onto a closed convex set, (cid:13)(cid:13) x t +1 − x ∗ (cid:13)(cid:13) ≤ (cid:13)(cid:13) x t − x ∗ − η t GradFilter (cid:0) g t , . . . , g tn (cid:1)(cid:13)(cid:13) . Recall from (44) that e t denotes (cid:13)(cid:13) x t − x ∗ (cid:13)(cid:13) for all t . Upon squaring the both sidesin the above inequality, we obtain that e t +1 ≤ e t − η t (cid:10) x t − x ∗ , GradFilter (cid:0) g t , . . . , g tn (cid:1)(cid:11) + η t (cid:13)(cid:13) GradFilter (cid:0) g t , . . . , g tn (cid:1)(cid:13)(cid:13) . (53)Recall, from (23) in the statement of Theorem 3, that φ t = (cid:10) x t − x ∗ , GradFilter (cid:0) g t , . . . , g tn (cid:1)(cid:11) , ∀ t. Substituting from the above in (53), we obtain that e t +1 ≤ e t − η t φ t + η t (cid:13)(cid:13) GradFilter (cid:0) g t , . . . , g tn (cid:1)(cid:13)(cid:13) . (54)As ψ (cid:48) t ≥ ∀ t , substituting from (54) in (51) we get h t +1 − h t ≤ (cid:16) − η t φ t + η t (cid:13)(cid:13) GradFilter (cid:0) g t , . . . , g tn (cid:1)(cid:13)(cid:13) (cid:17) ψ (cid:48) t + (cid:0) e t +1 − e t (cid:1) . (55)Note that for an arbitrary t , (cid:12)(cid:12) e t +1 − e t (cid:12)(cid:12) = ( e t +1 + e t ) | e t +1 − e t | . (56)As W is assumed compact, there exists Γ = max x ∈W (cid:107) x − x ∗ (cid:107) < ∞ . We assumeΓ >
0, otherwise W = { x ∗ } and the theorem is trivial. Recall from the updaterule (22), which is re-stated above in (52), that x t ∈ W for all t , and that x ∗ ∈ W .Therefore, e t = (cid:13)(cid:13) x t − x ∗ (cid:13)(cid:13) ≤ max x ∈W (cid:107) x − x ∗ (cid:107) = Γ , ∀ t. (57)From (57), for all t , we obtain that e t +1 + e t ≤ . Substituting from above in (56) implies that (cid:12)(cid:12) e t +1 − e t (cid:12)(cid:12) ≤ | e t +1 − e t | , ∀ t. (58)From triangle inequality, we get | e t +1 − e t | = (cid:12)(cid:12)(cid:13)(cid:13) x t +1 − x ∗ (cid:13)(cid:13) − (cid:13)(cid:13) x t − x ∗ (cid:13)(cid:13)(cid:12)(cid:12) ≤ (cid:13)(cid:13) x t +1 − x t (cid:13)(cid:13) . (cid:12)(cid:12) e t +1 − e t (cid:12)(cid:12) ≤ (cid:13)(cid:13) x t +1 − x t (cid:13)(cid:13) . (59)Due to the non-expansion property of Euclidean projection onto a closed convex set,from (52) we obtain that (cid:13)(cid:13) x t +1 − x t (cid:13)(cid:13) = (cid:13)(cid:13)(cid:2) x t − η t GradFilter (cid:0) g t , . . . , g tn (cid:1)(cid:3) W − x t (cid:13)(cid:13) ≤ η t (cid:13)(cid:13) GradFilter (cid:0) g t , . . . , g tn (cid:1)(cid:13)(cid:13) . Substituting from above in (59) we obtain that (cid:12)(cid:12) e t +1 − e t (cid:12)(cid:12) ≤ η t Γ (cid:13)(cid:13) GradFilter (cid:0) g t , . . . , g tn (cid:1)(cid:13)(cid:13) . Thus, (cid:0) e t +1 − e t (cid:1) ≤ η t Γ (cid:13)(cid:13) GradFilter (cid:0) g t , . . . , g tn (cid:1)(cid:13)(cid:13) . (60)Substituting from (60) in (55) we obtain that, for all t , h t +1 − h t ≤ (cid:16) − η t φ t + η t (cid:13)(cid:13) GradFilter (cid:0) g t , . . . , g tn (cid:1)(cid:13)(cid:13) (cid:17) ψ (cid:48) t + 4 η t Γ (cid:13)(cid:13) GradFilter (cid:0) g t , . . . , g tn (cid:1)(cid:13)(cid:13) = − η t φ t ψ (cid:48) t + η t (cid:0) ψ (cid:48) t + 4Γ (cid:1) (cid:13)(cid:13) GradFilter (cid:0) g t , . . . , g tn (cid:1)(cid:13)(cid:13) . (61)Recall from (57) that e t ≤ Γ. Also, by assumption, D ∗ < max x ∈W (cid:107) x − x ∗ (cid:107) = Γ.Recall that ψ (cid:48) t is short for ψ (cid:48) ( e t ). Therefore, from (46) we obtain that0 ≤ ψ (cid:48) t ≤ (cid:16) Γ − ( D ∗ ) (cid:17) ≤ , ∀ t. (62)As the statement of Theorem 3 assumes that (cid:13)(cid:13) GradFilter (cid:0) g t , . . . , g tn (cid:1)(cid:13)(cid:13) < ∞ for all t , there exists a real value M < ∞ such that (cid:13)(cid:13) GradFilter (cid:0) g t , . . . , g tn (cid:1)(cid:13)(cid:13) ≤ M , ∀ t. (63)Substituting from (62) and (63) in (61) we obtain that h t +1 − h t ≤ − η t φ t ψ (cid:48) t + 6 η t Γ M . (64)We now use Lemma 2 to prove that h ∞ = 0 as follows.For an iteration t , we consider below two possible cases: (i) e t < D ∗ , and (ii) e t = D ∗ + δ for some δ ≥ ψ (cid:48) t = 0. Therefore, due to Cauchy-Schwartz inequality, | φ t | = (cid:12)(cid:12)(cid:10) x t − x ∗ , GradFilter (cid:0) g t , . . . , g tn (cid:1)(cid:11)(cid:12)(cid:12) ≤ e t (cid:13)(cid:13) GradFilter (cid:0) g t , . . . , g tn (cid:1)(cid:13)(cid:13) . Substituting from (63) above, we obtain that | φ t | ≤ Γ M < ∞ . Therefore, φ t ψ (cid:48) t = 0 . (65)29ase ii) In this particular case, from (46), we obtain that ψ (cid:48) t = 2 (cid:0) ( D ∗ + δ ) − ( D ∗ ) (cid:1) = 2 δ (2 D ∗ + δ ) . Now, by assumption, φ t ≥ ξ when e t ≥ D ∗ where ξ >
0. Therefore, φ t ψ (cid:48) t ≥ ξδ (2 D ∗ + δ ) . (66)From (65) and (66) above, we obtain that φ t ψ (cid:48) t ≥ , ∀ t. (67)Substituting the above in (64) implies that h t +1 − h t ≤ η t Γ M , ∀ t. Recall that notation ( · ) + from Lemma 2. The above inequality implies that( h t +1 − h t ) + ≤ η t Γ M . As η t < ∞ , ∀ t , and constants L , M < ∞ , the above implies that ∞ (cid:88) t =0 ( h t +1 − h t ) + ≤ M ∞ (cid:88) t =0 η t < ∞ . As h t ≥ t , the above in conjunction with Lemma 2 implies that h t t →∞ −−−→ h ∞ < ∞ , and ∞ (cid:88) t =0 ( h t +1 − h t ) − > −∞ . (68)Note that h ∞ − h = (cid:80) ∞ t =0 ( h t +1 − h t ). Thus, from (64) we obtain that h ∞ − h ≤ − ∞ (cid:88) t =0 η t φ t ψ (cid:48) t + 6Γ M ∞ (cid:88) t =0 η t . (69)By Definition (50), h t ≥ t . Therefore, from (69) above we obtain that2 (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∞ (cid:88) t =0 η t φ t ψ (cid:48) t (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ h + h ∞ + 6Γ M ∞ (cid:88) t =0 η t . (70)By assumption, (cid:80) ∞ t =0 η t < ∞ . From (68), h ∞ < ∞ . Substituting from (57) that e t < ∞ , ∀ t in Definition of h t (50), we obtain that h = ψ ( e ) < ∞ . Therefore, (70)implies that 2 (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∞ (cid:88) t =0 η t φ t ψ (cid:48) t (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) < ∞ . Recall from (67) that φ t ψ (cid:48) t ≥ t . Thus, from above we obtain that ∞ (cid:88) t =0 η t φ t ψ (cid:48) t < ∞ . (71)30inally, we reason below by contradiction that h ∞ = 0.Note that for any ζ >
0, there exists a unique positive value β such that ζ =2 β (cid:0) D ∗ + √ β (cid:1) . Suppose that h ∞ = 2 β (2 D ∗ + √ β ) for some positive value β . Asthe sequence { h t } ∞ t =0 converges to h ∞ (see (68)), there exists some finite τ ∈ Z ≥ such that for all t ≥ τ , | h t − h ∞ | ≤ β (cid:16) D ∗ + (cid:112) β (cid:17) = ⇒ h t ≥ h ∞ − β (cid:16) D ∗ + (cid:112) β (cid:17) . As h ∞ = 2 β (2 D ∗ + √ β ) , the above implies that h t ≥ β (cid:16) D ∗ + (cid:112) β (cid:17) , ∀ t ≥ τ. (72)Therefore (cf. (45) and (50)), for all t ≥ τ , (cid:16) e t − ( D ∗ ) (cid:17) ≥ β (cid:16) D ∗ + (cid:112) β (cid:17) , or (cid:12)(cid:12)(cid:12) e t − ( D ∗ ) (cid:12)(cid:12)(cid:12) ≥ (cid:112) β (cid:16) D ∗ + (cid:112) β (cid:17) . Thus, for each t ≥ τ , either e t ≥ ( D ∗ ) + (cid:112) β (cid:16) D ∗ + (cid:112) β (cid:17) = (cid:16) D ∗ + (cid:112) β (cid:17) , (73)or e t ≤ ( D ∗ ) − (cid:112) β (cid:16) D ∗ + (cid:112) β (cid:17) < ( D ∗ ) . (74)If the latter, i.e. (74), holds true for some t (cid:48) ≥ τ then h t (cid:48) = ψ (cid:0) e t (cid:48) (cid:1) = 0 , which contradicts (72). Therefore, (72) implies (73) for all t ≥ τ .From above we obtain that if h ∞ = 2 β (2 D ∗ + √ β ) then there exists τ < ∞ such that for all t ≥ τ , e t ≥ D ∗ + (cid:112) β. Thus, from (66) we obtain that φ t ψ (cid:48) t ≥ ξ (cid:112) β (2 D ∗ + (cid:112) β ) , ∀ t ≥ τ. Therefore, ∞ (cid:88) t = τ η t φ t ψ (cid:48) t ≥ ξ (cid:112) β (2 D ∗ + (cid:112) β ) ∞ (cid:88) t = τ η t = ∞ . This is a contradiction to (71). Therefore, h ∞ = 0, and by definition of h t in (50), h ∞ = lim t →∞ ψ ( e t ) = 0 . Hence, by definition of ψ ( · ) in (45),lim t →∞ (cid:13)(cid:13) x t − x ∗ (cid:13)(cid:13) ≤ D ∗ . Appendix: Proof of Theorem 4
In this section we present the proof of Theorem 4. Throughout this section we as-sume f > f = 0.Consider an arbitrary set H of non-faulty agents with |H| = n − f . Recall thatunder Assumptions 3 and 4, the aggregate cost function (cid:80) i ∈H Q i ( x ) has a uniqueminimum point in set W , which we denote by x H . Specifically, x H = arg min x ∈ R d (cid:88) i ∈H Q i ( x ) ∩ W . (75)To prove the theorem we make use of the result stated in Theorem 3. Specifically,we show is that the CGE gradient-filter satisfies the conditions of Theorem 3 for D ∗ = D (cid:15) , and x ∗ = x H . The rest follows easily from the convergence result statedin Theorem 3. Recall from (24) that for CGE gradient-filter, in update rule (22), GradFilter (cid:0) g t , . . . , g tn (cid:1) = n − f (cid:88) j =1 g ti j , ∀ t. (76)First, we show that (cid:13)(cid:13)(cid:13)(cid:80) n − fj =1 g ti j (cid:13)(cid:13)(cid:13) is finite for all t .Consider a subset S ⊂ H with | S | = n − f . Triangle inequality implies that (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) (cid:88) j ∈ S ∇ Q j ( x ) − (cid:88) j ∈ S ∇ Q j ( x H ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ (cid:88) j ∈ S (cid:107)∇ Q j ( x ) − ∇ Q j ( x H ) (cid:107) , ∀ x ∈ R d . Under Assumption 2, i.e., Lipschitz continuity of non-faulty gradients, for each non-faulty agent j , (cid:107)∇ Q j ( x ) − ∇ Q j ( x H ) (cid:107) ≤ µ (cid:107) x − x H (cid:107) . Substituting this above impliesthat (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) (cid:88) j ∈ S ∇ Q j ( x ) − (cid:88) j ∈ S ∇ Q j ( x H ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ | S | µ (cid:107) x − x H (cid:107) . (77)As | S | = n − f , the (2 f, (cid:15) )-redundancy property defined in Definition 3 impliesthat for all x ∈ arg min x (cid:80) j ∈ S Q j ( x ), (cid:107) x − x H (cid:107) ≤ (cid:15). Substituting from above in (77) implies that, for all x ∈ arg min x (cid:80) j ∈ S Q j ( x ), (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) (cid:88) j ∈ S ∇ Q j ( x ) − (cid:88) j ∈ S ∇ Q j ( x H ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ | S | µ (cid:107) x − x H (cid:107) ≤ | S | µ(cid:15). (78)For all x ∈ arg min x (cid:80) j ∈ S Q j ( x ), ∇ Q j ( x ) = 0. Thus, (78) implies that (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) (cid:88) j ∈ S ∇ Q j ( x H ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ | S | µ(cid:15). (79)32ow, consider an arbitrary non-faulty agent i ∈ H \ S . Let S = S ∪ { i } . Usingsimilar arguments as above we obtain that under the (2 f, (cid:15) )-redundancy propertyand Assumption 2, for all x ∈ arg min x (cid:80) j ∈ S Q j ( x ), (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) (cid:88) j ∈ S ∇ Q j ( x H ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) = (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) (cid:88) j ∈ S ∇ Q j ( x ) − (cid:88) j ∈ S ∇ Q j ( x H ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ | S | µ(cid:15). (80)Note that (cid:80) j ∈ S ∇ Q j ( x ) = (cid:80) j ∈ S ∇ Q j ( x ) + ∇ Q i ( x ). From triangle inequality, (cid:107)∇ Q i ( x H ) (cid:107) − (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) (cid:88) j ∈ S ∇ Q j ( x H ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) (cid:88) j ∈ S ∇ Q j ( x H ) + ∇ Q i ( x H ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) . (81)Therefore, for each non-faulty agent i ∈ H , (cid:107)∇ Q i ( x H ) (cid:107) ≤ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) (cid:88) j ∈ S ∇ Q j ( x H ) + ∇ Q i ( x H ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) + (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) (cid:88) j ∈ S ∇ Q j ( x H ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ | S | µ(cid:15) + | S | µ(cid:15) = ( n − f + 1) µ(cid:15) + ( n − f ) µ(cid:15) = (2 n − f + 1) µ(cid:15). (82)Now, for all x and i ∈ H , by Assumption 2, (cid:107)∇ Q i ( x ) − ∇ Q i ( x H ) (cid:107) ≤ µ (cid:107) x − x H (cid:107) . By triangle inequality, (cid:107)∇ Q i ( x ) (cid:107) ≤ (cid:107)∇ Q i ( x H ) (cid:107) + µ (cid:107) x − x H (cid:107) . Substituting from (82) above we obtain that (cid:107)∇ Q i ( x ) (cid:107) ≤ (2 n − f + 1) µ(cid:15) + µ (cid:107) x − x H (cid:107) ≤ nµ(cid:15) + µ (cid:107) x − x H (cid:107) . (83)We use the above inequality (83) to show below that (cid:13)(cid:13)(cid:13)(cid:80) n − fj =1 g ti j (cid:13)(cid:13)(cid:13) is bounded for all t . Recall that for each iteration t , (cid:13)(cid:13) g ti (cid:13)(cid:13) ≤ ... ≤ (cid:13)(cid:13)(cid:13) g ti n − f (cid:13)(cid:13)(cid:13) ≤ (cid:13)(cid:13)(cid:13) g ti n − f +1 (cid:13)(cid:13)(cid:13) ≤ ... ≤ (cid:13)(cid:13) g ti n (cid:13)(cid:13) . As there are at most f Byzantine agents, for each t there exists σ t ∈ H such that (cid:13)(cid:13)(cid:13) g ti n − f (cid:13)(cid:13)(cid:13) ≤ (cid:13)(cid:13)(cid:13) g ti σt (cid:13)(cid:13)(cid:13) . (84)As g tj = ∇ Q j ( x t ) for all j ∈ H , from (84) we obtain that (cid:13)(cid:13)(cid:13) g ti j (cid:13)(cid:13)(cid:13) ≤ (cid:13)(cid:13) ∇ Q σ t ( x t ) (cid:13)(cid:13) , ∀ j ∈ { , . . . , n − f } , t. Substituting from (83) above we obtain that for every j ∈ { , . . . , n − f } , (cid:13)(cid:13)(cid:13) g ti j (cid:13)(cid:13)(cid:13) ≤ (cid:13)(cid:13)(cid:13) g ti n − f (cid:13)(cid:13)(cid:13) ≤ nµ(cid:15) + µ (cid:13)(cid:13) x t − x H (cid:13)(cid:13) . (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n − f (cid:88) j =1 g ti j (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ n − f (cid:88) j =1 (cid:13)(cid:13)(cid:13) g ti j (cid:13)(cid:13)(cid:13) ≤ ( n − f ) (cid:0) nµ(cid:15) + µ (cid:13)(cid:13) x t − x H (cid:13)(cid:13)(cid:1) . (85)Recall from (75) that x H ∈ W . Let Γ = max x ∈W (cid:107) x − x H (cid:107) . As W is a compact set,Γ < ∞ . Recall from the update rule (22) that x t ∈ W for all t . Thus, (cid:13)(cid:13) x t − x H (cid:13)(cid:13) ≤ max x ∈W (cid:107) x − x H (cid:107) = Γ < ∞ . Substituting this in (85) implies that (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n − f (cid:88) j =1 g ti j (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ ( n − f ) (2 nµ(cid:15) + µ Γ) < ∞ . (86)Recall that in this particular case, (cid:80) n − fj =1 g ti j = GradFilter (cid:0) g t , . . . , g tn (cid:1) (see (76)).Therefore, from above we obtain that (cid:13)(cid:13) GradFilter (cid:0) g t , . . . , g tn (cid:1)(cid:13)(cid:13) < ∞ , ∀ t. (87)Next, we show that for an arbitrary δ > ξ > φ t (cid:44) (cid:42) x t − x H , n − f (cid:88) j =1 g ti j (cid:43) ≥ ξ when (cid:13)(cid:13) x t − x H (cid:13)(cid:13) ≥ D (cid:15) + δ. Consider an arbitrary iteration t . Note that, as |H| = n − f , there are atleast n − f agents that are common to both sets H and { i , ..., i n − f } . We let H t = { i , ..., i n − f }∩H . The remaining set of agents B t = { i , ..., i n − f }\H t comprisesof only faulty agents. Note that (cid:12)(cid:12) H t (cid:12)(cid:12) ≥ n − f and (cid:12)(cid:12) B t (cid:12)(cid:12) ≤ f . Therefore, φ t = (cid:42) x t − x H , (cid:88) j ∈H t g tj (cid:43) + (cid:42) x t − x H , (cid:88) k ∈B t g tk (cid:43) . (88)Consider the first term in the right-hand side of (88). Note that (cid:42) x t − x H , (cid:88) j ∈H t g tj (cid:43) = (cid:42) x t − x H , (cid:88) j ∈H t g tj + (cid:88) j ∈H\H t g tj − (cid:88) j ∈H\H t g tj (cid:43) = (cid:42) x t − x H , (cid:88) j ∈H g tj (cid:43) − (cid:42) x t − x H , (cid:88) j ∈H\H t g tj (cid:43) . Recall that g tj = ∇ Q j ( x t ), ∀ j ∈ H . Therefore, (cid:42) x t − x H , (cid:88) j ∈H t g tj (cid:43) = (cid:42) x t − x H , (cid:88) j ∈H ∇ Q j ( x t ) (cid:43) − (cid:42) x t − x H , (cid:88) j ∈H\H t ∇ Q j ( x t ) (cid:43) . (89)34ue to the strong convexity assumption (i.e., Assumption 3), for all x, y ∈ R d , (cid:42) x − y, ∇ (cid:88) j ∈H Q j ( x ) − ∇ (cid:88) j ∈H Q j ( y ) (cid:43) ≥ |H| γ (cid:107) x − y (cid:107) . As x H is minimum point of (cid:80) j ∈H Q j ( x ), ∇ (cid:80) j ∈H Q j ( x H ) = 0. Thus, (cid:42) x t − x H , (cid:88) j ∈H ∇ Q j ( x t ) (cid:43) = (cid:42) x t − x H , ∇ (cid:88) j ∈H Q j ( x t ) − ∇ (cid:88) j ∈H Q j ( x H ) (cid:43) ≥ |H| γ (cid:13)(cid:13) x t − x H (cid:13)(cid:13) . (90)Now, due to the Cauchy-Schwartz inequality, (cid:42) x t − x H , (cid:88) j ∈H\H t ∇ Q j ( x t ) (cid:43) = (cid:88) j ∈H\H t (cid:10) x t − x H , ∇ Q j ( x t ) (cid:11) ≤ (cid:88) j ∈H\H t (cid:13)(cid:13) x t − x H (cid:13)(cid:13) (cid:13)(cid:13) ∇ Q j ( x t ) (cid:13)(cid:13) . (91)Substituting from (90) and (91) in (89) we obtain that (cid:42) x t − x H , (cid:88) j ∈H t g tj (cid:43) ≥ γ |H| (cid:13)(cid:13) x t − x H (cid:13)(cid:13) − (cid:88) j ∈H\H t (cid:13)(cid:13) x t − x H (cid:13)(cid:13) (cid:13)(cid:13) ∇ Q j ( x t ) (cid:13)(cid:13) . (92)Next, we consider the second term in the right-hand side of (88). From theCauchy-Schwartz inequality, (cid:10) x t − x H , g tk (cid:11) ≥ − (cid:13)(cid:13) x t − x H (cid:13)(cid:13) (cid:13)(cid:13) g tk (cid:13)(cid:13) . Substituting from (92) and above in (88) we obtain that φ t ≥ γ |H| (cid:13)(cid:13) x t − x H (cid:13)(cid:13) − (cid:88) j ∈H\H t (cid:13)(cid:13) x t − x H (cid:13)(cid:13) (cid:13)(cid:13) ∇ Q j ( x t ) (cid:13)(cid:13) − (cid:88) k ∈B t (cid:13)(cid:13) x t − x H (cid:13)(cid:13) (cid:13)(cid:13) g tk (cid:13)(cid:13) . (93)Recall that, due to the sorting of the gradients, for an arbitrary k ∈ B t and anarbitrary j ∈ H\H t , (cid:13)(cid:13) g tk (cid:13)(cid:13) ≤ (cid:13)(cid:13) g tj (cid:13)(cid:13) = (cid:13)(cid:13) ∇ Q j ( x t ) (cid:13)(cid:13) . (94)Recall that B t = { i , . . . , i n − f }\H t . Thus, (cid:12)(cid:12) B t (cid:12)(cid:12) = n − f − (cid:12)(cid:12) H t (cid:12)(cid:12) . Also, as |H| = n − f , (cid:12)(cid:12) H\H t (cid:12)(cid:12) = n − f − (cid:12)(cid:12) H t (cid:12)(cid:12) . That is, (cid:12)(cid:12) B t (cid:12)(cid:12) = (cid:12)(cid:12) H\H t (cid:12)(cid:12) . Therefore, (94) implies that (cid:88) k ∈B t (cid:13)(cid:13) g tk (cid:13)(cid:13) ≤ (cid:88) j ∈H\H t (cid:13)(cid:13) ∇ Q j ( x t ) (cid:13)(cid:13) . Substituting from above in (93), we obtain that φ t ≥ γ |H| (cid:13)(cid:13) x t − x H (cid:13)(cid:13) − (cid:88) j ∈H\H t (cid:13)(cid:13) x t − x H (cid:13)(cid:13) (cid:13)(cid:13) ∇ Q j ( x t ) (cid:13)(cid:13) . (cid:107)∇ Q i ( x ) (cid:107) ≤ nµ(cid:15) + µ (cid:107) x − x H (cid:107) , above we obtain that φ t ≥ γ |H| (cid:13)(cid:13) x t − x H (cid:13)(cid:13) − (cid:12)(cid:12) H\H t (cid:12)(cid:12) (cid:13)(cid:13) x t − x H (cid:13)(cid:13) (2 nµ(cid:15) + µ (cid:13)(cid:13) x t − x H (cid:13)(cid:13) ) ≥ (cid:0) γ |H| − µ (cid:12)(cid:12) H\H t (cid:12)(cid:12)(cid:1) (cid:13)(cid:13) x t − x H (cid:13)(cid:13) − nµ(cid:15) (cid:12)(cid:12) H\H t (cid:12)(cid:12) (cid:13)(cid:13) x t − x H (cid:13)(cid:13) . As |H| = n − f and (cid:12)(cid:12) H\H t (cid:12)(cid:12) ≤ f , the above implies that φ t ≥ ( γ ( n − f ) − µf ) (cid:13)(cid:13) x t − x H (cid:13)(cid:13) − nµ(cid:15)f (cid:13)(cid:13) x t − x H (cid:13)(cid:13) = ( γ ( n − f ) − µf ) (cid:13)(cid:13) x t − x H (cid:13)(cid:13) (cid:18)(cid:13)(cid:13) x t − x H (cid:13)(cid:13) − nµ(cid:15)fγ ( n − f ) − µf (cid:19) = nγ (cid:18) − fn (cid:18) µγ (cid:19)(cid:19) (cid:13)(cid:13) x t − x H (cid:13)(cid:13) (cid:13)(cid:13) x t − x H (cid:13)(cid:13) − µf (cid:15)γ (cid:16) − fn (cid:16) µγ (cid:17)(cid:17) . (95)Recall from (25) and (26), respectively, that α = 1 − fn (cid:18) µγ (cid:19) and D = 4 µfαγ . Substituting from above in (95) we obtain that φ t ≥ α nγ (cid:13)(cid:13) x t − x H (cid:13)(cid:13) (cid:0)(cid:13)(cid:13) x t − x H (cid:13)(cid:13) − D (cid:15) (cid:1) . (96)As it is assumed that α >
0, (96) implies that for an arbitrary δ > φ t ≥ α nγ δ ( D (cid:15) + δ ) when (cid:13)(cid:13) x t − x H (cid:13)(cid:13) ≥ D (cid:15) + δ. The above satisfies condition (23) in Theorem 3 with D ∗ = D (cid:15) + δ and ξ = α nγ δ ( D (cid:15) + δ ). Recall from (87) that (cid:13)(cid:13) GradFilter (cid:0) g t , . . . , g tn (cid:1)(cid:13)(cid:13) < ∞ , ∀ t . Therefore, upon usingTheorem 3 we obtain that lim t →∞ (cid:13)(cid:13) x t − x H (cid:13)(cid:13) ≤ D (cid:15) + δ. Note that the above inequality holds true for an arbitrary δ >
0. Therefore,lim t →∞ (cid:13)(cid:13) x t − x H (cid:13)(cid:13) ≤ D (cid:15) + δ, ∀ δ > . (97)Reasoning by contraction, (97) implies that lim t →∞ (cid:13)(cid:13) x t − x H (cid:13)(cid:13) ≤ D (cid:15) . E Appendix: Proof of Theorem 5
In this section we present the proof of Theorem 5. Throughout this section we as-sume f > f = 0. The proof closely follows that ofTheorem 4, and we may borrow some notation and results directly from Appendix D.Consider an arbitrary set H of non-faulty agents with |H| = n − f . Recall thatunder Assumption 3 the minimum point of the aggregate cost function (cid:80) i ∈H Q i ( x ),denoted by x H , is unique. Specifically, x H = arg min x ∈ R d (cid:88) i ∈H Q i ( x ) .
36o prove the theorem we make use of the result stated in Theorem 3. Specifically,we show that the CWTM gradient-filter satisfies the conditions of Theorem 3 for D ∗ = 4 λµf (cid:15) , and x ∗ = x H . The rest follows easily from the convergence result statedin Theorem 3. Recall from (28) that for CWTM gradient-filter, for all l ∈ { , . . . , d } , GradFilter (cid:0) g t , . . . , g tn (cid:1) [ l ] = 1 n − f n − f (cid:88) j = f +1 g ti j [ l ] [ l ] , ∀ t. (98)First, we show that (cid:80) n − fj = f +1 g ti j [ l ] [ l ] is finite for all l and t .From (83) in Appendix D, we know that under (2 f, (cid:15) )-redundancy property andAssumption 2, for each non-faulty agent i ∈ H , (cid:107)∇ Q i ( x ) (cid:107) ≤ nµ(cid:15) + µ (cid:107) x − x H (cid:107) . (99)The above implies that for all i ∈ H , l ∈ { , . . . , d } and x , |∇ Q i ( x )[ l ] | ≤ nµ(cid:15) + µ (cid:107) x − x H (cid:107) . (100)Recall that for all l and t , g ti [ l ] [ l ] ≤ . . . ≤ g ti f +1 [ l ] [ l ] ≤ . . . ≤ g ti n − f [ l ] [ l ] ≤ . . . ≤ g ti n [ l ] [ l ] . As there are at most f Byzantine agents and |H| = n − f , for all l and t there existsa pair of non-faulty agents σ t [ l ] , σ t [ l ] ∈ H such that g ti n − f [ l ] [ l ] ≤ g ti σ t [ l ] [ l ] , and g ti f +1 [ l ] [ l ] ≥ g ti σ t [ l ] [ l ] . (101)As g tj = ∇ Q j ( x t ) for all j ∈ H , from (101) we obtain that for all j ∈ { f +1 , . . . , n − f } , l and t , (cid:12)(cid:12)(cid:12) g ti j [ l ] [ l ] (cid:12)(cid:12)(cid:12) ≤ max (cid:110)(cid:12)(cid:12)(cid:12) ∇ Q σ t [ l ] ( x t )[ l ] (cid:12)(cid:12)(cid:12) , (cid:12)(cid:12)(cid:12) ∇ Q σ t [ l ] ( x t )[ l ] (cid:12)(cid:12)(cid:12)(cid:111) . Substituting from (100) above we obtain that for all j ∈ { f + 1 , . . . , n − f } , l and t , (cid:12)(cid:12)(cid:12) g ti j [ l ] [ l ] (cid:12)(cid:12)(cid:12) ≤ nµ(cid:15) + µ (cid:13)(cid:13) x t − x H (cid:13)(cid:13) . Therefore, owing to the triangle inequality, (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n − f (cid:88) j = f +1 g ti j [ l ] [ l ] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ n − f (cid:88) j = f +1 (cid:12)(cid:12)(cid:12) g ti j [ l ] [ l ] (cid:12)(cid:12)(cid:12) ≤ ( n − f ) (cid:0) nµ(cid:15) + µ (cid:13)(cid:13) x t − x H (cid:13)(cid:13)(cid:1) . (102)Let Γ = max x ∈W (cid:107) x − x H (cid:107) . As W is a compact set, Γ < ∞ . Recall from the updaterule (22) that x t ∈ W for all t . Thus, (cid:13)(cid:13) x t − x H (cid:13)(cid:13) ≤ max x ∈W (cid:107) x − x H (cid:107) = Γ < ∞ .Substituting this in (102) implies that for all l ∈ { , . . . , d } , (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n − f (cid:88) j = f +1 g ti j [ l ] [ l ] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ ( n − f ) (2 nµ(cid:15) + µ Γ) < ∞ . (cid:13)(cid:13) GradFilter (cid:0) g t , . . . , g tn (cid:1)(cid:13)(cid:13) < ∞ , ∀ t. (103)Now, consider an arbitrary iteration t and l ∈ { , . . . , d } . From prior workson CWTM gradient-filter for the scalar case [32], i.e., when d = 1, we know thattrimmed mean of the l -th elements of the gradients lies in the convex hull of l -thelements of the non-faulty agents’ gradients in set H . Specifically,min i ∈H g ti [ l ] ≤ GradFilter (cid:0) g t , . . . , g tn (cid:1) [ l ] ≤ max i ∈H g ti [ l ] . (104)Obviously, min i ∈H g ti [ l ] ≤ |H| (cid:88) i ∈H g ti [ l ] ≤ max i ∈H g ti [ l ] . (105)Therefore, from (104) and (105) we obtain that (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) GradFilter (cid:0) g t , . . . , g tn (cid:1) [ l ] − |H| (cid:88) i ∈H g ti [ l ] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ max i ∈H g ti [ l ] − min i ∈H g ti [ l ] . As max i ∈H g ti [ l ] − min i ∈H g ti [ l ] = max i, j ∈H (cid:12)(cid:12)(cid:12) g ti [ l ] − g tj [ l ] (cid:12)(cid:12)(cid:12) , and g ti = ∇ Q i ( x t ) for all i ∈ H , the above can be re-written as follows. (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) GradFilter (cid:0) g t , . . . , g tn (cid:1) [ l ] − |H| (cid:88) i ∈H ∇ Q i ( x t )[ l ] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ max i, j ∈H (cid:12)(cid:12) ∇ Q i ( x t )[ l ] − ∇ Q j ( x t )[ l ] (cid:12)(cid:12) . (106)Note that for any two i, j ∈ H , (cid:12)(cid:12) ∇ Q i ( x t )[ l ] − ∇ Q j ( x t )[ l ] (cid:12)(cid:12) ≤ (cid:13)(cid:13) ∇ Q i ( x t ) − ∇ Q j ( x t ) (cid:13)(cid:13) . (107)Substituting from Assumption 5, (cid:107)∇ Q i ( x ) − ∇ Q j ( x ) (cid:107) ≤ λ max {(cid:107)∇ Q i ( x ) (cid:107) , (cid:107)∇ Q j ( x ) (cid:107)} ,in (107) we obtain that (cid:12)(cid:12) ∇ Q i ( x t )[ l ] − ∇ Q j ( x t )[ l ] (cid:12)(cid:12) ≤ λ max (cid:8)(cid:13)(cid:13) ∇ Q i ( x t ) (cid:13)(cid:13) , (cid:13)(cid:13) ∇ Q j ( x t ) (cid:13)(cid:13)(cid:9) . (108)Substituting from (99) above we obtain that (cid:12)(cid:12) ∇ Q i ( x t )[ l ] − ∇ Q j ( x t )[ l ] (cid:12)(cid:12) ≤ λ (cid:0) nµ (cid:15) + µ (cid:13)(cid:13) x t − x H (cid:13)(cid:13)(cid:1) . (109)Finally, substituting from (109) in (106) we obtain that, for all l , (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) GradFilter (cid:0) g t , . . . , g tn (cid:1) [ l ] − |H| (cid:88) i ∈H ∇ Q i ( x t )[ l ] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ λ (cid:0) nµ (cid:15) + µ (cid:13)(cid:13) x t − x H (cid:13)(cid:13)(cid:1) . As (cid:107) x (cid:107) = (cid:113)(cid:80) dl =1 | x [ l ] | for x ∈ R d , the above implies that (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) GradFilter (cid:0) g t , . . . , g tn (cid:1) − |H| (cid:88) i ∈H ∇ Q i ( x t ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ √ dλ (cid:0) nµ (cid:15) + µ (cid:13)(cid:13) x t − x H (cid:13)(cid:13)(cid:1) . (110)38ow, note that GradFilter (cid:0) g t , . . . , g tn (cid:1) = 1 |H| (cid:88) i ∈H ∇ Q i ( x t ) (111)+ (cid:32) GradFilter (cid:0) g t , . . . , g tn (cid:1) − |H| (cid:88) i ∈H ∇ Q i ( x t ) (cid:33) . Recall from Theorem 3 that φ t , for each t , is defined to be φ t = (cid:10) x t − x H , GradFilter (cid:0) g t , . . . , g tn (cid:1)(cid:11) . Substituting from (111) above we obtain that φ t = (cid:42) x t − x H , |H| (cid:88) i ∈H ∇ Q i ( x t ) (cid:43) (112)+ (cid:42) x t − x H , GradFilter (cid:0) g t , . . . , g tn (cid:1) − |H| (cid:88) i ∈H ∇ Q i ( x t ) (cid:43) . Recall from Assumption 3 that Q H ( x ) = (1 / |H| ) (cid:80) i ∈H Q i ( x ). Thus, the firstterm on the right-hand side of (112), (cid:42) x t − x H , |H| (cid:88) i ∈H ∇ Q i ( x t ) (cid:43) = (cid:10) x t − x H , ∇ Q H ( x t ) (cid:11) . Substituting from the Assumption 3 above, and recalling that ∇ Q H ( x H ) = 0, weobtain that (cid:42) x t − x H , |H| (cid:88) i ∈H ∇ Q i ( x t ) (cid:43) ≥ γ (cid:13)(cid:13) x t − x H (cid:13)(cid:13) . (113)Next, we consider the second term on the right-hand side of (112). From Cauchy-Schwartz inequality, (cid:42) x t − x H , GradFilter (cid:0) g t , . . . , g tn (cid:1) − |H| (cid:88) i ∈H ∇ Q i ( x t ) (cid:43) ≥ − (cid:13)(cid:13) x t − x H (cid:13)(cid:13) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) GradFilter (cid:0) g t , . . . , g tn (cid:1) − |H| (cid:88) i ∈H ∇ Q i ( x t ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) . Substituting from (110) above we obtain that (cid:42) x t − x H , GradFilter (cid:0) g t , . . . , g tn (cid:1) − |H| (cid:88) i ∈H ∇ Q i ( x t ) (cid:43) ≥ −√ dλ (cid:13)(cid:13) x t − x H (cid:13)(cid:13) (cid:0) nµ (cid:15) + µ (cid:13)(cid:13) x t − x H (cid:13)(cid:13)(cid:1) . (114)39ubstituting from (113) and (114) in (112) we obtain that φ t ≥ γ (cid:13)(cid:13) x t − x H (cid:13)(cid:13) − √ dλ (cid:13)(cid:13) x t − x H (cid:13)(cid:13) (cid:0) nµ (cid:15) + µ (cid:13)(cid:13) x t − x H (cid:13)(cid:13)(cid:1) = (cid:16) γ − √ dλµ (cid:17) (cid:13)(cid:13) x t − x H (cid:13)(cid:13) (cid:32)(cid:13)(cid:13) x t − x H (cid:13)(cid:13) − √ dnµλ ( γ − √ dµλ ) (cid:15) (cid:33) . (115)The above inequality is similar to (96) in the proof of Theorem 4 in Appendix Dwhere by assumption γ − √ dλµ >
0. Also, recall from (103) that in this particularcase of CWTM gradient-filter (cid:13)(cid:13)
GradFilter (cid:0) g t , . . . , g tn (cid:1)(cid:13)(cid:13) < ∞ , ∀ t . Therefore, usingsimilar arguments as in Appendix D, we obtain thatlim t →∞ (cid:13)(cid:13) x t − x H (cid:13)(cid:13) ≤ √ dnµλ ( γ − √ dµλ ) (cid:15) = (cid:18) nλ ( γ/µ √ d ) − λ (cid:19) (cid:15).(cid:15).