A Little Is Enough: Circumventing Defenses For Distributed Learning
AA Little Is Enough: Circumventing Defenses For Distributed Learning
Moran Baruch Gilad Baruch Yoav Goldberg Abstract
Distributed learning is central for large-scale train-ing of deep-learning models. However, they areexposed to a security threat in which Byzantineparticipants can interrupt or control the learningprocess. Previous attack models and their corre-sponding defenses assume that the rogue partic-ipants are (a) omniscient (know the data of allother participants), and (b) introduce large changeto the parameters. We show that small but well-crafted changes are sufficient, leading to a novel non-omniscient attack on distributed learning thatgo undetected by all existing defenses. We demon-strate our attack method works not only for pre-venting convergence but also for repurposing ofthe model behavior (“backdooring”). We showthat 20% of corrupt workers are sufficient to de-grade a CIFAR10 model’s accuracy by 50%, aswell as to introduce backdoors into MNIST andCIFAR10 models without hurting their accuracy.
1. Introduction
Distributed Learning has become a wide-spread frameworkfor large-scale model training (Dean et al., 2012; Li et al.,2014a;b; Baydin et al., 2017; Zhang et al., 2017; Agarwalet al., 2010; Recht et al., 2011), in which a server is lever-aging the compute power of many devices by aggregatinglocal models trained on each of the devices.A popular class of distributed learning algorithms is
Syn-chronous Stochastic Gradient Descent (sync-SGD), using asingle server (called
Parameter Server - PS) and n workers,also called nodes (Li et al., 2014a;b). In each round, eachworker trains a local model on his or her device with a dif-ferent chunk of the dataset, and shares the final parameterswith the PS. The PS then aggregates the parameters of thedifferent workers, and starts another round by sharing withthe workers the resulting combined parameters to start an-other round. The structure of the network (number of layers, Dept. of Computer Science, Bar Ilan University, Israel. Corre-spondence to: Moran Baruch < [email protected] > , GiladBaruch < [email protected] > . types, sizes etc.) is agreed between all workers beforehand.While effective in sterile environment, a major risk emergewith regards to the correctness of the learned model uponfacing even a single Byzantine worker (Blanchard et al.,2017). Such participants are not rigorously following theprotocol either innocently, for example due to faulty commu-nication, numerical error or crashed devices, or adversarially,in which the Byzantine output is well crafted to maximizeits effect on the network.We consider malicious Byzantine workers, where an attackercontrols either the devices themselves, or even only thecommunication between the participants and the PS, forexample by
Man In The Middle attack. Both attacks anddefenses have been explored in the literature (Blanchardet al., 2017; Xie et al., 2018; Yin et al., 2018; El Mhamdiet al., 2018; Shen et al., 2016).In the very heart of distributed learning lies the assumptionthat the parameters of the trained network across the workersare independent and identically distributed (i.i.d.) (Chenet al., 2017b; Blanchard et al., 2017; Yin et al., 2018). Thisassumption allows the averaging of different models to yielda good estimator for the desired parameters, and is also thebasis for the different defense mechanisms, which try torecover the original mean after clearing away the byzantinevalues. Existing defenses claim to be resilient even when theattacker is omniscient (Blanchard et al., 2017; El Mhamdiet al., 2018; Xie et al., 2018), and can observe the data ofall the workers. Lastly, all existing attacks and defenses(Blanchard et al., 2017; El Mhamdi et al., 2018; Xie et al.,2018; Yin et al., 2018) work under the assumption thatachieving a malicious objectives requires large changes toone or more parameters. This assumption is advocated bythe fact that SGD better converges with a little randomnoise (Neelakantan et al., 2016; Shirish Keskar et al., 2017;Kleinberg et al., 2018).We show that this assumption is incorrect: directed small changes to many parameters of few workers are capable ofdefeating all existing defenses and interfering with or gain-ing control over the training process. Moreover, while mostprevious attacks focused on preventing the convergence ofthe training process, we demonstrate a wider range of at-tacks and support also introducing backdoors to the resultingmodel, which are samples that will produce the attacker’s a r X i v : . [ c s . L G ] F e b Little Is Enough: Circumventing Defenses For Distributed Learning desired output, regardless of their true label. Lastly, by ex-ploiting the i.i.d assumption we introduce a non-omniscient attack in which the attacker only has access to the data ofthe corrupted workers.
Our Contributions
We present a new approach for at-tacking distributed learning with the following properties:1. We overcome all existing defense mechanisms.2. We compute a perturbation range in which the attackercan change the parameters without being detected evenin i.i.d. settings .3. Changes within this range are sufficient for both inter-fering with the learning process and for backdooringthe system .4. We propose the first non-omniscient attack applicablefor distributed learning, making the attack stronger andmore practical.
2. Background
This is the attack which mostof the existing attacks and defenses literature for distributedlearning focuses on (Blanchard et al., 2017; El Mhamdiet al., 2018; Xie et al., 2018). In this case, the attacker inter-feres with the process with the mere desire of obstructingthe server from reaching good accuracy. This type of attackis not very interesting because the attacker does not gainany future benefit from the intervention. Furthermore, theserver is aware of the attack and, in a real world scenario, islikely to take actions to mitigate it, for example by activelyblocking subsets of the workers and observing the effect onthe training process.
Backdooring , also known as
Data Poisoning (Biggio et al.,2012; Chen et al., 2017a; Liu et al., 2018), is an attackin which the attacker manipulates the model at trainingtime so that it will produce the attacker-chosen target atinference time. The backdoor can be either a single sample,e.g. falsely classifying a specific person as another, or it canbe a class of samples, e.g. setting a specific pattern of pixelsin an image will cause it to be classified maliciously.An illustration of those objectives is given in Figure 1.
Distributed training is using the Synchronous SGD protocol,presented in Algorithm 1.The attacker interferes the process at the time that maxi-mizes its effect, that is between lines 6 and 7 in Algorithm 1. Figure 1:
Possible Malicious objectives. 1.
A normalscenario in which a benign image is classified correctly. The malicious opponent damaged the network functionalitywhich now mis-classify legitimate inputs. A backdoorappear in the model, classifying this specific image as theattacker desire. The model produces the label whenevera specific pattern (e.g. a square in the top left) is applied. Algorithm 1
Synchronous SGD P ← Randomly initiate the parameters in the server. for round t ∈ [ T ] do The server sends P t to all n workers. for each worker i ∈ [ n ] do Set P t as initial parameters in the local model. Train locally using its own data chunk. (cid:67)
Malicious intervention Return final parameters p t +1 i to the server. P t +1 ← AggregationRule ( { p t +1 i : i ∈ [ n ] } ) The server evaluates P t +1 on the test set. return P t that maximized accuracy on the test set.During this time, the attacker can use the corrupted workers’parameters expressed in p t +1 i , and replace them with what-ever values it desires to send to the server. Attacks methoddiffer in the way in which they set the parameter values, anddefenses methods attempt to identify corrupted parametersand discard them.Algorithm 1 aggregates the workers values using averaging( AggregationRule () in line 8). Some defense methodschange this aggregation rule, as explained below. Backdooring attacks
Bagdasaryan et al. (2018) demon-strated a backdooring attack on federated learning by mak-ing the attacker optimize for a model with the backdoorwhile adding a term to the loss that keeps the new parame-ters close to the original ones. Their attack has the benefitsof requiring only a few corrupted workers, as well as beingnon-omniscient. However, it does not work for distributedtraining: in federated learning each worker is using its ownprivate data, coming from a different distribution, negatingthe i.i.d assumption (McMahan et al., 2016; Koneˇcn`y et al.,2016) and making the attack easier as it drops the groundunder the fundamental assumption of all existing defensesfor distributed learning. (Fung et al., 2018) proposed a de-
Little Is Enough: Circumventing Defenses For Distributed Learning fense against backdoors in federated learning, but like theattack above it heavily relies on the non-i.i.d property of thedata, which does not hold for distributed training.A few defenses aimed at detecting backdoors were proposed(Steinhardt et al., 2017; Qiao and Valiant, 2017; Chen et al.,2018; Tran et al., 2018), but those defenses assume a single-server training in which the backdoor is injected in thetraining set for which the server has access to, so that byclustering or other techniques the backdoors can be foundand removed. In contrast, in our settings, the server has nocontrol over the samples which the workers adversely decideto train with, rendering those defenses inoperable. Finally,(Shen et al., 2016) demonstrate a method for circumventingbackdooring attacks on distributed training. As discussedbelow, the method is a variant of the Trimmed Mean defense,which we successfully evade.
All existing defenses are working on each round separately,so for the sake of readability we will discard the notationof the round ( t ). For the rest of the paper we will use thefollowing notations: n is the total number of workers, m isthe number of corrupted workers, and d is the number ofdimensions (parameters) of the model. p i is the vector ofparameters trained by worker i , ( p i ) j is its j th dimension,and P is { p i : i ∈ [ n ] } .The state-of-the-art defense for distributed learning is Bulyan . Bulyan utilizes a combination of two earlier meth-ods -
Krum and
Trimmed Mean , to be explained first.
Trimmed Mean
This family of defenses, called
Mean-Around-Median (Xie et al., 2018) or
Trimmed Mean (Yinet al., 2018), change the aggregation rule of Algorithm 1 toa trimmed average, handling each dimension separately:
T rimmedM ean ( P ) = v j = 1 | U j | (cid:88) i ∈ U j ( p i ) j : j ∈ [ d ] (1)Three variants exist, differing in the definition of U j .1. U j is the indices of the top- ( n − m ) values in { ( p ) j ,..., ( p n ) j } nearest to the median µ j (Xie et al., 2018).2. Same as the first variant only taking top- ( n − m ) values (El Mhamdi et al., 2018).3. U j is the indices of elements in the same vector { ( p ) j , ..., ( p n ) j } where the largest and smallest m el-ements are removed, regardless of their distance fromthe median (Yin et al., 2018).A defense method of (Shen et al., 2016) clusters each param-eter into two clusters using 1-dimensional k-means, and if the distance between the clusters’ centers exceeds a thresh-old, the values compounding the smaller cluster are dis-carded. This can be seen as a variant of the Trimmed Meandefense, because only the values of the larger cluster whichmust include the median will be averaged while the rest ofthe values will be discarded.All variants are designed to defend against up to (cid:100) n (cid:101)− cor-rupted workers, as this defenses depend on the assumptionthat the median is taken from the range of benign values.The circumvention analysis and experiments are similar forall variants upon facing our attack, so we will consider onlythe second variant which is used in Bulyan below. Krum
Suggested by Blanchard et al (2017),
Krum strivesto find a single honest participant which is probably a goodchoice for the next round, discarding the data from therest of the workers. The chosen worker is the one withparameters which are closest to another n − m − workers,mathematically expressed by: Krum ( P ) = p i | argmin i ∈ [ n ] (cid:88) i → j (cid:107) p i − p j (cid:107) (2)Where i → j is the n − m − nearest neighbors to p i in P ,measured by Euclidean Distance.Like TrimmedMean, Krum is designed to defend againstup to (cid:100) n (cid:101) − corrupted workers ( m ). The intuition behindthis method is that in normal distribution, the vector withaverage parameters in each dimension will be the closest toall the parameters vectors drawn from the same distribution.By considering only the distance to the closest n − m − workers, sets of parameters which will differ significantlyfrom the average vector are outliers and will be ignored. Themalicious parameters, assumed to be far from the originalparameters, will suffer from the high distance to at least onenon-corrupted worker, which is expected to prevent it frombeing selected.While Krum was proven to converge, in (El Mhamdi et al.,2018) the authors already negate the proof that Krum is ( α -f)Byzantine Resilient (A term coined by Krum’s authors), byshowing that convergence alone should not be the target, be-cause the parameters may converge to an ineffectual model.Secondly, as already noted in (El Mhamdi et al., 2018), dueto the high dimensionality of the parameters, a maliciousattacker can notably introduce a large change to a singleparameter without a considerable impact on the L p norm(Euclidean distance), making the model ineffective. Bulyan
El Mhamdi et al. (2018), who suggested theabove-mentioned attack on Krum, proposed a new defensethat successfully oppose such an attack. They present a“meta”-aggregation rule, where another aggregation rule A Little Is Enough: Circumventing Defenses For Distributed Learning is used as part of it. In the first part, Bulyan is using A iteratively to create a SelectionSet of probably benign can-didates, and then aggregates this set by the second variantof TrimmedMean. Bulyan combines methods working withL p norm that proved to converge, with the advantages ofmethods working on each dimension separately, such asTrimmedMean, overcoming Krum’s disadvantage describedabove because TrimmedMean will not let the single mali-cious dimension slip.Algorithm 2 describes the defense. It should be noted that online 6, n − values are being averaged, which is n (cid:48) − for n (cid:48) = | SelectionSet | = n − m . Algorithm 2
Bulyan Algorithm Input: A , P , n, m SelectionSet ← ∅ while | SelectionSet | < n − m do p ← A ( P \
SelectionSet ) SelectionSet ← SelectionSet ∪ { p } return T rimmedM ean (2) ( SelectionSet ) Unlike previous methods, Bulyan is designed to defendagainst only up to n − corrupted workers. Such number ofcorrupted workers ( m ) insures that the input for each run of A will have more than m workers as required, and thereis also a majority of non-corrupted workers in the input to T rimmedM ean .We will follow the authors of this method and use A =Krumin the rest of the paper including our experiments. No Defense
In the experiments section we will use thename
No Defense for the basic method of averaging theparameters from all the workers, due to the lack of outliersrejection mechanism.
3. Our Attack
In previous papers (Blanchard et al., 2017; Xie et al., 2018;El Mhamdi et al., 2018), the authors assume that the attackerwill choose parameters that are far away from the mean, inorder to hurt the accuracy of the model, for example bychoosing parameters that are in the opposite direction of thegradient. Our attack shows that by consistently applyingsmall changes to many parameters, a malicious opponentcan perturb the model’s convergence or backdoor the sys-tem. In addition, those defenses claimed to protect againstan attacker which is omniscient, i.e. knows the data of allof the workers. We show that due to the normal distributionof the data, in case the attacker controls a representativeportion of the workers, it is sufficient to have only the cor-rupted workers’ data in order to estimate the distribution’smean and standard deviation, and manipulate the results ac-cordingly. This observation enables our attack to work also for non-omniscient attacker, by estimating the propertiesof the entire population through the corrupted participantsalone.As mentioned above, the research in the field of distributedlearning, including all defenses, assumes that the differentparameters of all of the workers are i.i.d. and therefore ex-pressed by normal distribution. We follow this assumption,hence in the rest of the paper the “units” for attacking dis-tributed learning models which we would like to maximizewithout being noticed are standard deviations ( σ ).The outline of this section will go as follows: We will firstanalyze the range in which changes to the parameters willnot be detected by TrimmedMean , and upon choosing themaxima of this range the convergence is averted. Then wewill point out a weakness in
Krum to be exploited by theattacker, claiming that the same parameters generated forthe attack on TrimmedMean can also overcome Krum. Thiswill lead to the conclusion that the same set of parameterswill circumvent Bulyan as well because both of its buildingblocks were deceived. Finally, we show how the attackercan find the set of parameters within the same range thatwill introduce a backdoor to the system with only a minimalimpact on accuracy for the original task.The fact that the same set of parameters was used against alldefenses is a strong advantage for this method: the attackwill go unnoticed no matter which defense the server decidesto choose, again rendering our attack more practical.
In the Trimmed Mean defense each parameter is calculatedindependently, so the problem can be considered as a one-dimension array, where each entry is the value given by adifferent worker. Clearly, if we will set the vicious valuetoo far away from the mean, the malicious value will bediscarded easily. We thus seek a range in which we candeviate from the mean without being detected. Since normaldistribution is symmetric, the same value z max will set thelower and upper bounds for the applicable changes aroundthe mean.What is the maximal change that can be applied by an at-tacker without being detected? In order to change the valueproduced by TrimmedMean, the attacker should control themedian. This requires a majority of the workers, whichcan be attained by finding the minimal number s of non-corrupted workers that needs to be “seduced”. The attackerwill then use the properties of normal distribution, specifi-cally the Cumulative Standard Normal Function φ ( z ) , andlook for value z such that s non-corrupted workers willprobably reside further away from the mean. By setting allcorrupted workers to values in the range ( µ − zσ , µ + zσ ),the attacker guarantees with high probability that those val- Little Is Enough: Circumventing Defenses For Distributed Learning ues will be the median and the many workers reporting thesame value will cause it to withstand the averaging aroundthe median in the second part of TrimmedMean.The exact steps for finding such a range are shown in Algo-rithm 3 as part of the convergence prevention attack.
The output of Krum’s process is only one chosen worker,and all of its parameters are being used while the otherworkers are discarded. It is assumed that there exists sucha worker for which all of the parameters are close to thedesired mean in each dimension. In practice however, wherethe parameters are in very high dimensional space, even thebest worker will have at least a few parameters which willreside far from the mean.To exploit this shortcoming, one can generate a set of pa-rameters which will differ from the mean of each parameterby only a small amount. Those small changes will decreasethe Euclidean Distance calculated by Krum, hence causingthe malicious set to be selected. Experimentally, the attackon
Trimmed Mean was able to fool Krum as well.An advantage when attacking Krum rather than TrimmedMean is that only a few corrupted workers are required forthe estimation of µ j and σ j , and only one worker needs toreport the malicious parameters because Krum eventuallypicks the set of parameters originating from only a singleworker.Since Bulyan is a combination of Krum and
TrimmedMean ,and since our attack circumvents both, it is reasonable toexpect that it will circumvent Bulyan as well.Nevertheless, Bulyan claim to defend against only up to 25%of corrupted workers, and not 50% like Krum and Trimmed-Mean. At first glance it seems that the z max derived for m = 25% might not be sufficient, but it should be notedthat the perturbation range calculated above is the possibleinput to TrimmedMean , for which m can reach up to 50%of the workers in the SelectionSet being aggregated in thesecond phase of Bulyan. Indeed, our approach is effectivealso against the Bulyan attack.
With the objective of forestalling convergence, the attackerwill use the maximal value z that will circumvent the de-fense. The attack flow is detailed in Algorithm 3. Example:
If the number of malicious workers is 24 outof a total of 50 workers, the attacker needs to “seduce” 2workers ( (cid:98) + 1 (cid:99) −
24 = 2 ) in order to have a majorityand set the median. − = 0 . , and by looking at thez-table for the maximal z for which φ ( z ) < . we get z max = 1 . . Finally, the attacker will set the value of all Algorithm 3
Preventing Convergence Attack Input: { p i : i ∈ CorruptedW orkers } , n, m Set the number of required workers for a majority by: s = (cid:98) n + 1 (cid:99) − m Set (using z-table ): z max = max z (cid:18) φ ( z ) < n − sn (cid:19) for j ∈ [ d ] do calculate mean ( µ j ) and standard deviation ( σ j ). ( p mal ) j ← µ j + z max · σ j for i ∈ CorruptedWorkers do p i ← p mal the malicious workers to v = µ + 1 . · σ for each of theparameters independently with the parameters’ µ j and σ j .With high probability there will be enough workers withvalue higher than v , which will set v as the median.In the experiments section we show that even a minor changeof 1 σ can give the attacker control over the process at times. In section 3.1, we found a range for each parameter j inwhich the attacker can perturb the parameter without be-ing detected, and in order to obstruct the convergence, theattacker maximized the change inside this range. For back-dooring attack on the other hand, the attacker seeks the set ofparameters within this range which will produce the desiredlabel for the backdoor, while minimizing the impact on thefunctionality for benign inputs. To accomplish that, similarto (Bagdasaryan et al., 2018), the attacker will optimize forthe model with the backdoor while minimizing the distancefrom the original parameters. This is achieved through theloss function, weighted by parameter α as follows: Loss = α(cid:96) backdoor + (1 − α ) (cid:96) ∆ (3)where (cid:96) backdoor is the same as the regular loss but trainedon the backdoors with the attacker’s targets instead of thereal ones, and (cid:96) ∆ to be detailed below is keeping the newparameters close to the original parameters.For α too large, the parameters will significantly differ fromthe original parameters, thus being discarded by the defensemechanisms. Hence, the attacker should use the minimal α which successfully introduce the backdoor in the model.Furthermore, the attacker can leverage the knowledge of σ j for each parameter, and instead of using any L p distancedirectly for (cid:96) ∆ , the difference between the parameters canbe normalized in order to accelerate the learning: (cid:96) ∆ = d (cid:88) j =1 (cid:18) N ewP aram j − OldP aram j max ( z max σ j , e − (cid:19) (4) Little Is Enough: Circumventing Defenses For Distributed Learning if N ewP aram j − OldP aram j is smaller than z max σ j ,the new parameter is inside the valid range, so the ratiobetween them will be less than 1 and squaring it will reducethe value, which implies lower penalty. On the other hand,if N ewP aram j − OldP aram j is greater than z max σ , theratio is greater than 1 and the penalty increase quickly. Some σ j can happen to be very small, so values below − arebeing clamped in order to avoid division by very smallnumbers. This attack is detailed in Algorithm 4. Algorithm 4
Backdoor Attack Input: { p i : i ∈ CorruptedW orkers } , n, m Calculate z max , µ j and σ j as in Algo 3, lines 2-5. Train the model with the backdoor, with initial param-eters { µ j : j ∈ [ d ] } and loss function described inequations 3 and 4. V ← final model parameters. for j ∈ [ d ] do Clamp v j ∈ V to the range µ j ± z maxj σ j using: ( p mal ) j = max ( µ j − z maxj σ j , min ( v j , µ j + z maxj σ j )) for i ∈ CorruptedWorkers do p i ← p mal
4. Experiments and Results
For our experiments we used PyTorch’s (Baydin et al., 2017)built in distribution package. In this section we describethe attacked models, and examine the impact on the modelsin the presence of different defenses for different m andnumber of σ ( z ). Datasets
Following previous work (Xie et al., 2018; Yinet al., 2018; El Mhamdi et al., 2018), we consider twodatasets.
MNIST (LeCun, 1998), a hand-written digit iden-tification dataset, and
CIFAR10 (Krizhevsky and Hinton,2009), a classification task for × color images drawnfrom 10 different classes. Models
For both datasets, we follow the model architec-ture of the paper introducing the state of the art Bulyandefense (El Mhamdi et al., 2018). For MNIST, we use amulti-layer perceptron with 1 hidden layer, 784 dimensionalinput (flattened × pixel images), a 100-dimensionalhidden layer with ReLU activation, and a 10-dimensionalsoftmax output, trained with cross-entropy objective. Byusing this structure, d equals almost 80k. We trained themodel for 150 epochs with batch size = 83 . When neither at-tack nor defense are applied, the model reaches an accuracyof . on the test set.For CIFAR10 we use a 7-layer CNN with the followinglayers: input of size 3072 ( × × ); convolutional layerwith kernel size: × , 16 maps and 1 stride; max-pooling of size × , a convolutional layer with kernel × , 64maps and 1 stride; max-pooling layer of size × ; twofully connected layers of size 384 and 192 respectively; andan output layer of size 10. We use ReLU activation on thehidden layer and softmax on the output, training the networkfor 400 epochs with a cross-entropy objective. In this setting d (cid:39) M . The maximal accuracy reached in this model withno corrupted workers is . , similar to the result obtainedin (El Mhamdi et al., 2018) for the same structure.In both models we set the learning rate and the momentum to be 0.1 and 0.9 respectively. We added L2 regularizationwith weight for both models. The training data wassplit between n = 51 = 4 · m + 3 workers, with m = 12 corrupted workers. In Section 3.1 we analyzed what is the maximal number of σ away from µ that can be applied by our method, z max .We showed in the example that when the total number ofworkers is 50, the value of z can be set to . , and allthe corrupted workers will update each of their parametersvalues to v = µ + 1 . · σ . Furthermore, when the totalnumber of workers is greater than 50, s still may equals 2like before, but n − sn increases, causing an increase in thevalue of z max and further possible distance from the originalmean. This can be intuitively explained given the fact thatwhen n increases, the chance for having outliers in the fartails of the normal distribution increases, and those tails arethe ones to be seduced. In the following experiments, wetried to change the parameters by up to . σ , to leave roomfor inaccuracies with the estimation of µ j and σ j . required z In order to learn how many standard deviationsare required for impacting the network with the convergenceattack, we trained the MNIST and CIFAR10 models indistributed learning settings four times, each time changingthe parameters by z = m = n ), on allparameters with no defense in the server.Table 1: The maximal accuracy of MNIST and CIFAR10models when changing all the parameters for all the workers. Model σ CIFAR10 . σ or even 1 σ away from the real average to substan-tially degrade the results. The table shows that degradingthe accuracy of CIFAR10 is much simpler than MNIST, Little Is Enough: Circumventing Defenses For Distributed Learning a cc u r a c y epochs KrumBulyanTrimmed MeanNo DefenseNo Attack
Figure 2: Model accuracy on MNIST. m = 24% and z =1 . . No Attack is plotted for reference.which is expected given the difference in nature of the tasks:MNIST is a much simpler task, so less samples are requiredand the different workers will quickly agree on the correctgradient direction, limiting the change that can be applied.While for the harder, more realistic classification task ofCIFAR10, the disagreement between the workers will behigher, which can be leveraged by the malicious opponent.
Comparing defenses
We applied our attack against alldefenses, and examined their resilience on both models.Figure 2 presents the accuracy of the MNIST classificationmodel with the different defenses when the parameters werechanged by . σ , over m = 12 corrupted workers whichis almost . We also plotted the results when no attackis applied so the effect of the attack can clearly be seen.The attack is effective in all scenarios. The Krum defensecondition performed worst, since our malicious set of param-eters was selected even with only 24% of corrupted workers.
Bulyan was affected more than
TrimmedMean , because eventhough the malicious proportion was 24%, it can reach upto 48% of the
SelectionSet , which is the proportion usedby TrimmedMean in the second stage of Bulyan.
Trimmed-Mean performed better than the previous two, because themalicious parameters were diluted by the averaging withmany parameter sets coming from non corrupted workers.Ironically but expected, the best defense strategy againstthis attack was the simplest aggregation rule of averagingwithout outliers rejection—
No Defense . This is because the1.5 standard deviations were averaged across all n workers,76% of which are not corrupted, so the overall shift in eachiteration was . ∗ .
24 = 0 . σ , which only have a minorimpact on the accuracy. It is clear however that the servercannot choose this aggregation rule because of the seriousvulnerabilities it provokes. In case that circumventing NoDefense is desired, the attacker can compose a hybrid attack,in which one worker is dedicated to attack
No Defense withattacks detailed in earlier papers (Blanchard et al., 2017;Xie et al., 2018), and the rest will be used for the attackproposed here. a cc u r a c y epochs KrumBulyanTrimmed MeanNo DefenseNo Attack
Figure 3: Model accuracy on CIFAR10. m = 20% and z = 1 . No Attack is plotted for reference.Experiment results on CIFAR10 are shown in Figure 3.Since fewer standard deviations can cause a significant im-pact on CIFAR10 (see Table 1), we choose m = 20% cor-rupt workers, and change the parameters by only 1 σ . Again,the best accuracy was achieved with the simplest aggrega-tion rule, i.e. averaging the workers’ parameters, but stillthe accuracy dropped by 28%. Krum performed worst againfor the same reason with a drop of 66%,
Bulyan dropped by52% and
TrimmedMean performed slightly better but stilldropped by 45%.
Proportion of malicious workers
Figure 4 shows theproportion of corrupted workers required to attack the train-ing of CIFAR10 model. Since
Bulyan designed to protectagainst up to 25% malicious workers, we tried to train themodel with different m s up to that value, and tested howit affected the accuracy when the attacker changes all theparameters by σ . One can see that Krum is sensitive evento a small amount of corrupted workers, thus even with m = 5% the accuracy drops by 33%. The graph shows thatas expected, as the proportion of corrupted workers grows,the model’s accuracy decreases, but even 10% can causea major degradation with existing defenses other than notdefending at all, which is not a realistic option. m a x i m a l a cc u r a c y m (proportion of malicious users) KrumBulyanTrimmed MeanNo Defense
Figure 4: Model accuracy with different proportion of cor-rupted workers ( m ) on CIFAR10. z = 1 . Little Is Enough: Circumventing Defenses For Distributed Learning
As before, we set n = 51 and m = 12 (24%). As a resultof the attacker’s desire not to interrupt the convergencefor benign inputs, low α and z (both 0.2) were chosen.After each round the attacker trained the network with thebackdoor for 5 rounds. We set (cid:96) ∆ according to Equation 4and set (cid:96) backdoor to cross entropy like the one used for theoriginal classification. Sample Backdooring
For the backdoor sample task, wechose each time one of the first 3 images from each train-ing set (MNIST and CIFAR10) and take their desired back-doored targets to be ( y +1) mod | Y | where y is the originallabel and | Y | is the number of classes.Results are presented in Table 2. Throughout the process,the network produced the malicious target for the backdoorsample in more than 95% of the time, including specifi-cally the rounds where the maximal overall accuracy wasachieved. As can be seen, for a simple task such as MNISTwhere the network has enough capacity, the network suc-ceeded to incorporate the backdoor with less than 1% dropin the overall accuracy. The results are similar across thedifferent defenses by cause of the low z being used. ForCIFAR10 however, where the convergence is difficult evenwithout the backdoor for the given simple architecture, theimpact is more visible and reaches up to 9% degradation.Table 2: Backdoor Sample Results.
The maximal ac-curacy of MNIST and CIFAR10 models with a backdoorsample. n = 51 , m = 24% , z = α = 0 . . The results withno backdoor introduction are also presented for comparison. Defense Model MNIST CIFAR10No Attack
No Defense
Trimmed Mean
Krum
Bulyan
Pattern Backdooring
For the backdoor pattern attack,the attacker randomly samples 1000 images from thedatasets on each round, and set their upper-left 5x5 pix-els to the maximal intensity (See Figure 1 for examples).All those samples were trained with target = 0 . For testingthe same pattern was applied to a different subset of images.Table 3 lists the results. Similar to the results for back-door sample case, MNIST perfectly learned the backdoorpattern with a minimal impact on the accuracy for benigninputs on all defenses except for
No Defense where theattack was again diluted by the averaging with many non- corrupted workers, and yet the malicious label was selectedfor non-negligible 36.9% of the samples. For CIFAR10 theaccuracy is worse than with the backdoor sample , with a7% (
TrimmedMean ), 12% (
Krum ) and 15% (
Bulyan ) degra-dation, but the accuracy drop for benign inputs is still rea-sonable and probably unsuspicious for an innocent servertraining for a new task without knowing the expected ac-curacy. For each of the three defenses, more than 80%of the samples with the backdoor pattern were classifiedmaliciously.It is interesting to see that
No Defense was completely re-silient to this attack, with only a minimal degradation of1% and without mis-classifying samples with the backdoorpattern. However, on a different experiment on MNIST withhigher z and α (1 and 0.5 respectively), the opposite occur,where No Defense reached 95.6 for benign inputs and 100%on the backdoor, while other defenses did not perform aswell on the benign inputs. Another option for circumventing No Defense is dedicating one corrupted worker for the casethat
No Defense is being used by the server, and use the restof the corrupted workers for the defense-evading attack.Table 3:
Backdoor Pattern Results.
The maximal accu-racy of MNIST and CIFAR10 models with backdoor patternattack. n = 51 , m = 24% , z = α = 0 . . The results withno backdoor introduction are also presented for compari-son. Results are presented for legitimate inputs (benign) andimages with the backdoor pattern. MNIST CIFAR10
Benign Backdoor Benign Backdoor
No Attack
No Defense
TrimmedMean
Krum
Bulyan
5. Conclusions
We present a new attack paradigm, in which by applyinglimited changes to many parameters, a malicious opponentmay interfere with or backdoor the process of
DistributedLearning . Unlike previous attacks, the attacker does notneed to know the exact data of the non-corrupted workers(being non-omniscient ), and it works even on i.i.d. settings,where the data is known to come from a specific distribu-tion. The attack evades all existing defenses. Based on ourexperiments, a variant of
TrimmedMean is to be chosenamong existing defenses, producing the best results for con-vergence attack excluding the choice of na¨ıve averaging,which is obviously vulnerable to other simpler attacks.
Little Is Enough: Circumventing Defenses For Distributed Learning
References
Agarwal, A., Wainwright, M. J., and Duchi, J. C. (2010).Distributed dual averaging in networks. In
Advances inNeural Information Processing Systems (NIPS) , pages550–558.Bagdasaryan, E., Veit, A., Hua, Y., Estrin, D., andShmatikov, V. (2018). How to backdoor federated learn-ing. arXiv preprint arXiv:1807.00459 .Baydin, A. G., Pearlmutter, B. A., Radul, A. A., and Siskind,J. M. (2017). Automatic differentiation in machine learn-ing: a survey.
Journal of Machine Learning Research ,18(153):1–153.Biggio, B., Nelson, B., and Laskov, P. (2012). Poisoningattacks against support vector machines. In
Proceedingsof the 29th International Coference on Machine Learning(ICML) , pages 1467–1474. Omnipress.Blanchard, P., Guerraoui, R., Stainer, J., et al. (2017). Ma-chine learning with adversaries: Byzantine tolerant gradi-ent descent. In
Advances in Neural Information Process-ing Systems (NIPS) .Chen, B., Carvalho, W., Baracaldo, N., Ludwig, H., Ed-wards, B., Lee, T., Molloy, I., and Srivastava, B. (2018).Detecting backdoor attacks on deep neural networks byactivation clustering. In
Advances in Neural InformationProcessing Systems (NIPS) .Chen, X., Liu, C., Li, B., Lu, K., and Song, D. (2017a).Targeted backdoor attacks on deep learning systems usingdata poisoning. arXiv preprint .Chen, Y., Su, L., and Xu, J. (2017b). Distributed statisticalmachine learning in adversarial settings: Byzantine gra-dient descent.
Proceedings of the ACM on Measurementand Analysis of Computing Systems , 1(2):44.Dean, J., Corrado, G., Monga, R., Chen, K., Devin, M.,Mao, M., Senior, A., Tucker, P., Yang, K., Le, Q. V., et al.(2012). Large scale distributed deep networks. In
Ad-vances in neural information processing systems (NIPS) ,pages 1223–1231.El Mhamdi, E. M., Guerraoui, R., and Rouault, S. (2018).The hidden vulnerability of distributed learning in Byzan-tium. In
Proceedings of the 35th International Conferenceon Machine Learning (ICML) , pages 3521–3530.Fung, C., Yoon, C. J., and Beschastnikh, I. (2018). Mitigat-ing sybils in federated learning poisoning. arXiv preprintarXiv:1808.04866 .Kleinberg, R. D., Li, Y., and Yuan, Y. (2018). An alternativeview: When does sgd escape local minima? In theInternational Coference on Machine Learning (ICML) . Koneˇcn`y, J., McMahan, H. B., Yu, F. X., Richt´arik, P.,Suresh, A. T., and Bacon, D. (2016). Federated learning:Strategies for improving communication efficiency. arXivpreprint arXiv:1610.05492 .Krizhevsky, A. and Hinton, G. (2009). Learning multiplelayers of features from tiny images. Technical report,Citeseer.LeCun, Y. (1998). The MNIST database of handwrittendigits. http://yann. lecun. com/exdb/mnist/ .Li, M., Andersen, D. G., Park, J. W., Smola, A. J., Ahmed,A., Josifovski, V., Long, J., Shekita, E. J., and Su, B.-Y.(2014a). Scaling distributed machine learning with theparameter server. In
OSDI , volume 14, pages 583–598.Li, M., Andersen, D. G., Smola, A. J., and Yu, K. (2014b).Communication efficient distributed machine learningwith the parameter server. In
Advances in Neural Infor-mation Processing Systems , pages 19–27.Liu, Y., Ma, S., Aafer, Y., Lee, W.-C., Zhai, J., Wang, W.,and Zhang, X. (2018). Trojaning attack on neural net-works. In .McMahan, H. B., Moore, E., Ramage, D., Hampson,S., et al. (2016). Communication-efficient learning ofdeep networks from decentralized data. arXiv preprintarXiv:1602.05629 .Neelakantan, A., Vilnis, L., Le, Q. V., Sutskever, I., Kaiser,L., Kurach, K., and Martens, J. (2016). Adding gradientnoise improves learning for very deep networks.
Interna-tional Conference on Learning Representations Workshop(ICLR Workshop) .Qiao, M. and Valiant, G. (2017). Learning discretedistributions from untrusted batches. arXiv preprintarXiv:1711.08113 .Recht, B., Re, C., Wright, S., and Niu, F. (2011). Hogwild:A lock-free approach to parallelizing stochastic gradientdescent. In
Advances in neural information processingsystems (NIPS) , pages 693–701.Shen, S., Tople, S., and Saxena, P. (2016). A uror: de-fending against poisoning attacks in collaborative deeplearning systems. In
Proceedings of the 32nd AnnualConference on Computer Security Applications , pages508–519. ACM.Shirish Keskar, N., Mudigere, D., Nocedal, J., Smelyanskiy,M., and Tang, P. (2017). On large-batch training fordeep learning: Generalization gap and sharp minima.
International Conference on Learning Representations(ICLR) Workshop . Little Is Enough: Circumventing Defenses For Distributed Learning
Steinhardt, J., Koh, P. W. W., and Liang, P. S. (2017). Certi-fied defenses for data poisoning attacks. In
Advances inNeural Information Processing Systems 30 (NIPS) .Tran, B., Li, J., and Madry, A. (2018). Spectral signaturesin backdoor attacks. In
Advances in Neural InformationProcessing Systems 31 (NIPS) .Xie, C., Koyejo, O., and Gupta, I. (2018). Gen-eralized Byzantine-tolerant SGD. arXiv preprintarXiv:1802.10116 .Yin, D., Chen, Y., Ramchandran, K., and Bartlett, P. (2018).Byzantine-robust distributed learning: Towards optimalstatistical rates. In
Proceedings of the International Con-ference on Machine Learning (ICML) .Zhang, H., Zheng, Z., Xu, S., Dai, W., Ho, Q., Liang, X.,Hu, Z., Wei, J., Xie, P., and Xing, E. P. (2017). Poseidon:An efficient communication architecture for distributeddeep learning on GPU clusters. arXiv preprintarXiv preprint