[PDF] Consensus Control for Decentralized Deep Learning

Abstract

Decentralized training of deep learning models enables on-device learning over networks, as well as efficient scaling to large compute clusters. Experiments in earlier works reveal that, even in a data-center setup, decentralized training often suffers from the degradation in the quality of the model: the training and test performance of models trained in a decentralized fashion is in general worse than that of models trained in a centralized fashion, and this performance drop is impacted by parameters such as network size, communication topology and data partitioning. We identify the changing consensus distance between devices as a key parameter to explain the gap between centralized and decentralized training. We show in theory that when the training consensus distance is lower than a critical quantity, decentralized training converges as fast as the centralized counterpart. We empirically validate that the relation between generalization performance and consensus distance is consistent with this theoretical observation. Our empirical insights allow the principled design of better decentralized training schemes that mitigate the performance drop. To this end, we provide practical training guidelines and exemplify its effectiveness on the data-center setup as the important first step.

Full PDF

CConsensus Control for Decentralized Deep Learning

Lingjing Kong

Tao Lin

Anastasia Koloskova Martin Jaggi Sebastian U. Stich Abstract

Decentralized training of deep learning modelsenables on-device learning over networks, as wellas efﬁcient scaling to large compute clusters. Ex-periments in earlier works reveal that, even ina data-center setup, decentralized training oftensuffers from the degradation in the quality of themodel: the training and test performance of mod-els trained in a decentralized fashion is in generalworse than that of models trained in a centralizedfashion, and this performance drop is impacted byparameters such as network size, communicationtopology and data partitioning.We identify the changing consensus distance be-tween devices as a key parameter to explain thegap between centralized and decentralized train-ing. We show in theory that when the training con-sensus distance is lower than a critical quantity,decentralized training converges as fast as the cen-tralized counterpart. We empirically validate thatthe relation between generalization performanceand consensus distance is consistent with this the-oretical observation. Our empirical insights allowthe principled design of better decentralized train-ing schemes that mitigate the performance drop.To this end, we propose practical training guide-lines for the data-center setup as the importantﬁrst step.

1. Introduction

The impressive successes of machine learning, witnessed inthe last decade, have been accompanied by a steady increasein the size, complexity, and computational requirements oftraining systems. In response to these challenges, distributedtraining algorithms (i.e. data-parallel large mini-batch SGD)have been developed for the use in data-centers (Goyal et al.,2017; You et al., 2018; Shallue et al., 2018). These state-of-the-art (SOTA) training systems rely on the All-Reducecommunication primitive to perform exact averaging on the * Equal contribution EPFL, Lausanne, Switzerland. Correspon-dence to: Tao Lin .Preprint. local mini-batch gradients computed on different subsetsof the data, for the later synchronized model update. How-ever, exact averaging with All-Reduce is sensitive to thecommunication hardware of the training system, causingthe bottleneck in efﬁcient deep learning training. To addressthis issue, decentralized training has become an indispens-able training paradigm for efﬁcient large scale training indata-centers (Assran et al., 2019), alongside its orthogonalbeneﬁts on preserving users’ privacy for edge AI (Belletet al., 2018; Kairouz et al., 2019).Decentralized SGD implementations trade off the exactnessof the averaging provided by All-Reduce, with more efﬁ-cient, but inexact, communication over sparse typologies.However, this often results in a severe drop in the trainingand/or test performance (i.e. generalization gap), even afterhyper-parameter ﬁne-tuning (see our Table 1 as well as Ta-bles 1–3 in Assran et al., 2019). This phenomenon is poorlyunderstood even in relatively straightforward i.i.d. data dis-tribution scenarios (i.e. the data-center case), to which veryfew works are dedicated (in fact none of them provide in-sights into the generalization performance).

Table 1: Signiﬁcant generalization gap for decentralized train-ing on a sparse topology (ResNet-20 on CIFAR-10 with n ∈{ , , } workers). Test top-1 accuracies averaged over threeseeds with ﬁne-tuned learning rates. Complete (AllReduce) Ring (Gossip Averaging)n=16 . ± .

12 92 . ± . n=32 . ± .

27 91 . ± . n=64 . ± .

11 89 . ± . In this work, we investigate the trade-off between thetrain/test performance and the exactness of the averaging,measured in terms of consensus distance, i.e. the averagediscrepancy between each node and the mean of modelparameters over all machines. We identify this consensusdistance as the key parameter that captures the joint effectof decentralization.While one might suspect that a smaller consensus distancewould improve performance in any case, we identify sev-eral interesting phenomena. (i) We identify a diminishingreturn phenomenon: if the consensus distance stays below acritical value (critical consensus distance), decreasing theconsensus distance further does not yield any additional per-formance gains. For deep learning training, we (ii) identifythe pivotal initial training phase where the critical consensus a r X i v : . [ c s . L G ] F e b onsensus Control for Decentralized Deep Learning distance matters and the training consensus distance heavilyinﬂuences the ﬁnal training and generalization performance,and (iii) large consensus distance in later training phasescan even be beneﬁcial.Our ﬁndings have far-reaching consequences for practice:By (iv) using consensus control as a principled tool to ﬁnd,adaptively during training, the appropriate trade-off betweentargeted generalization performance and affordable commu-nication resources, it is possible to exploit the efﬁciencybeneﬁts of decentralized methods without sacriﬁcing quality.While our numerical study, on Computer Vision (CV) tasks(CIFAR-10 and ImageNet-32) as well as Natural LanguageProcessing (NLP) tasks (transformer models for machinetranslation), mainly focuses on the data-center setting withhomogeneous nodes, our ﬁndings also apply to decentral-ized training over time-varying topologies and the moredifﬁcult heterogeneous setting alike.

2. Related Work

For general decentralized optimization, common algorithmsare either gradient-based methods with gossip averagingsteps (Kempe et al., 2003; Xiao & Boyd, 2004; Boyd et al.,2006), or problem-structure dependent methods, such asprimal-dual methods (Hong et al., 2017; Sun & Hong, 2019).In this work, we focus on non-convex decentralized deeplearning problems and only consider gradient-based meth-ods with gossip averaging—methods that do not supportstochastic gradients (not suitable for deep learning) are omit-ted for the discussion.The convergence rate of gossip averaging towards the con-sensus among the nodes can be expressed in terms of the(expected) spectral gap of the mixing matrix. Lian et al.(2017) combine SGD with gossip averaging for deep learn-ing and show that the leading term in the convergence rate O (cid:0) nε (cid:1) is consistent with the convergence of the central-ized mini-batch SGD (Dekel et al., 2012) and the spectralgap only affects the asymptotically smaller terms. Sim-ilar results have been observed very recently for relatedschemes (Scaman et al., 2017; 2018; Koloskova et al., 2019;2020a;b; Vogels et al., 2020). To reduce the communicationoverhead (number of peer-to-peer communications), sparsetopologies have been studied recently (Assran et al., 2019;Wang et al., 2019; 2020a; Nadiradze et al., 2020). Whilsta few recent works focus on the impact of the topologyon the optimization performance (Luo et al., 2019; Negliaet al., 2020), we here identify the consensus distance as amore canonical parameter that can characterize the overalleffect of decentralized learning, beyond only the topology.Through this, we are able to provide deeper understandingof the more ﬁne-grained impact of the evolution of the ac-tual consensus distance on the optimization/generalization performance of deep learning.Prior works propose to either perform a constant number ofgossip steps every round (Tsianos & Rabbat, 2016; Scamanet al., 2017; Sharma et al., 2019) to increase the averagingquality, or choose carefully tuned learning rates (Tsitsiklis,1984; Nedi´c & Ozdaglar, 2009; Duchi et al., 2012; Yuanet al., 2016) to improve the convergence. However, theseworks do not analyze the effect of consensus distance ex-plicitly. In contrast, we identify the existence of critical consensus distance, adapt gossip step numbers to the targetdistance on the ﬂy, and provide insights into how consensusdistance at different training phases impacts the decentral-ized deep learning.Appendix B.1 further details the relationship between con-sensus distance and other training metrics inﬂuential to theﬁnal performance (e.g. gradient diversity in Yin et al. (2018);Johnson et al. (2020)). Besides, we connect the insights intobetter generalization (Lin et al., 2020b) with other interpre-tations in Izmailov et al. (2018); Gupta et al. (2020). The connection between optimization and generalizationof deep learning training is not fully understood. A lineof work on understanding the early phase of training hasrecently emerged as a promising avenue for studying thisconnection. For instance, Keskar et al. (2017); Sagun et al.(2018); Achille et al. (2019); Golatkar et al. (2019); Frankleet al. (2020) point out the existence of a “critical phase” forregularizing deep networks, which is decisive for the ﬁnalgeneralization ability. Achille et al. (2019); Jastrzebski et al.(2019); Fort & Ganguli (2019); Jastrzebski et al. (2020)empirically demonstrate the rapid change in the local shapeof the loss surface in the initial training phase.In this work, we reach a similar conclusion for decentralizeddeep learning: we identify the importance of the initialtraining phase through the lens of consensus distance.

3. Theoretical Understanding

In this section, we study the trade-off between training per-formance and the exactness of parameter averaging, andestablish the notion of critical consensus distance.For the sake of simplicity, we consider decentralizedstochastic gradient descent (D-SGD) without momentumin this section, and focus on the optimization difﬁcultyin our theoretical analysis. Theoretically analyzing thegeneralization performance for deep learning is an openproblem and not intended in this work. Instead we provideextensive empirical evaluation, addressing generalizationfor both D-SGD with and without momentum in Section 4.All proofs are deferred to Appendix C. onsensus Control for Decentralized Deep Learning

The agents are tasked to solve a sum-structured optimizationproblem f : R d → R of the form f (cid:63) := min x ∈ R d (cid:2) f ( x ) := n (cid:80) ni =1 f i ( x ) (cid:3) , (1)where the components f i : R d → R are distributed amongthe n nodes and are given in stochastic form: f i ( x ) := E ξ ∼D i [ F i ( x , ξ )] , where D i denotes the local data distribu-tion on node i ∈ [ n ] . For data-center settings, where datais re-shufﬂed periodically among nodes, these distributionsare identical, but in other scenarios there can be differencesbetween nodes. In D-SGD, each agent i ∈ [ n ] maintainslocal parameters x ( t ) i ∈ R d , and updates them as: x ( t +1) i = (cid:80) nj =1 w ij (cid:16) x ( t ) j − η ∇ F j ( x ( t ) j , ξ ( t ) j ) (cid:17) , (D-SGD) that is, by a stochastic gradient step based on a sample ξ ( t ) i ∼ D i , followed by gossip averaging with neighboringnodes in the network encoded by the mixing weights w ij .As parameters can differ across nodes, we deﬁne ¯ x := n (cid:80) ni =1 x i and X := [ x , . . . , x n ] ∈ R d × n , and ¯ X :=[¯ x , . . . , ¯ x ] ≡ X n (cid:62) . Assumption 1 (Mixing matrix) . Every sample of the (pos-sibly randomized) mixing matrix W = { w ij } ∈ R n × n isdoubly stochastic and there exists a parameter p > s.t. E W (cid:13)(cid:13) XW − ¯ X (cid:13)(cid:13) F ≤ (1 − p ) (cid:13)(cid:13) X − ¯ X (cid:13)(cid:13) F , ∀ X ∈ R d × n . (2) This assumption covers a broad variety of settings (see e.g.Koloskova et al., 2020b), such as D-SGD with ﬁxed (con-stant) mixing matrix with spectral gap ρ , with parameter p = 1 − (1 − ρ ) = Θ( ρ ) , but also for randomly chosenmixing matrices, for instance random matchings. Assumption 2 ( L -smoothness) . Each function f i ( x ) : R d → R , i ∈ [ n ] is differentiable and thereexists a constant L ≥ such that for each x , y ∈ R d : (cid:107)∇ f i ( x ) − ∇ f i ( y ) (cid:107) ≤ L (cid:107) x − y (cid:107) . Assumption 3 (Bounded noise σ and diversity ζ ) . Thereexists constants σ , ζ s.t. ∀ x , . . . x n ∈ R d n n (cid:88) i =1 E ξ i (cid:107)∇ F i ( x i , ξ i ) − ∇ f i ( x i ) (cid:107) ≤ σ , n n (cid:88) i =1 (cid:107)∇ f i ( x i ) − ∇ f ( x i ) (cid:107) ≤ ζ . (3) Under the above standard assumptions in decentralized opti-mization, the convergence rate of (D-SGD) has been shownas follows:

Theorem 3.1 (Koloskova et al. (2020b)) . Let f i be L -smooth and stepsize γ ≤ γ max = O (cid:0) pL (cid:1) . Thenthere exists an optimal stepsize γ ≤ γ max such that T (cid:80) T − t =0 E (cid:13)(cid:13) ∇ f (¯ x ( t ) ) (cid:13)(cid:13) ≤ ε for T = O (cid:18) σ nε + √ pσ + ζpε / + pε (cid:19) · L ( f ( x ) − f (cid:63) ) . In comparison, for centralized mini-batch SGD (C-SGD)we are allowed to choose a potentially much larger stepsize γ (cid:48) max = O (cid:0) L (cid:1) , and can bound the number of iterationsby O (cid:0) σ nε + ε (cid:1) . While asymptotically both these rates areequivalent, they differ in the low accuracy setting when ε is not too small. That is, especially in the ﬁrst phase ofoptimization where the lower order terms matter.As our ﬁrst theoretical contribution, we show that if theindividual iterates of the agents stay sufﬁciently close, thenD-SGD can converge as fast as C-SGD. To measure thisdifference between agents, we use the consensus distance Ξ t := n (cid:80) ni =1 (cid:13)(cid:13) ¯ x ( t ) − x ( t ) i (cid:13)(cid:13) . Proposition 3.2 (Critical Consensus Distance (CCD)) . Ifthe consensus distance is bounded by Ξ t ≤ (cid:18) Ln γσ + 18 L (cid:13)(cid:13)(cid:13) ∇ f (¯ x ( t ) ) (cid:13)(cid:13)(cid:13) =: Γ t (cid:19) (4) for all t , then in D-SGD we may choose larger stepsizes γ ≤ γ (cid:48) max = O (cid:0) L (cid:1) and recover the convergence rate of C-SGD, that is O (cid:0) σ nε + ε (cid:1) (Dekel et al., 2012; Bottou et al.,2018). We refer to Γ t as critical consensus distance (CCD). Note that the CCD does not depend on the graph topologyand that Γ t > , which means that we do not need perfectconsensus between agents to recover the C-SGD rate, butwe allow consensus distance Ξ t ≥ (i.e. the Ξ t = 0 ∀ t , aswe have for centralized optimization, is sufﬁcient but notnecessary). In Section 4, we empirically examine the exis-tence of the critical consensus distance Ξ t in decentralizeddeep learning, as we cannot compute the critical consensusdistance in a closed-form (through L and σ ).We now estimate the magnitude of the consensus distancein D-SGD and compare it to the CCD. Proposition 3.3 (Typical consensus distance) . Let φ t := n (cid:80) ni =1 (cid:13)(cid:13) ∇ f i ( x ( t ) i ) (cid:13)(cid:13) . Then under the assumption that γ, p are constant, and the φ t does not change too fast be-tween iterations, i.e. not decreasing faster than exponen-tially: φ t ≤ (1 + p/ φ t +1 , the consensus distance inD-SGD satisﬁes Ξ t = (1 − p ) γ · O (cid:18) φ t p + σ p (cid:19) . (5)While these assumptions do not hold in epochs with learningrate decay, we observe in practice that during epochs of aconstant learning rate the gradients indeed do not changetoo fast (see Figure 6(b)). Thus these assumptions are rea-sonable approximations to capture the practical behavior. We now investigate scenarios where the typical consensusdistance derived in Proposition 3.3 can be smaller than the onsensus Control for Decentralized Deep Learning critical value (CCD). This reveals two orthogonal strategiesto control the consensus distance in D-SGD. We here assumediversity ζ = 0 as with i.i.d. training data, and that thestepsize γ ≤ O (cid:0) L (cid:1) as for C-SGD, and give a more reﬁneddiscussion in Appendix C.3. Learning rate decay (changing γ ). We observe thatwhen γ = O (cid:0) pnL (cid:1) then Ξ t ≤ Γ t (if the noise σ is small, es-pecially for σ = 0 , then the weaker assumption γ = O (cid:0) pL (cid:1) is sufﬁcient). However, choosing too small stepsizes canimpact performance in practice. In C-SGD the constraint onthe stepsize is loose ( γ ≤ L ). Yet, after sufﬁcient learningrate decay, the desired CCD can be reached. More gossip iterations (changing p ). We observe thatwhen − p = O (1 + γLn ) , then Ξ t ≤ Γ t (again, whenthe noise σ is small, especially when σ = 0 , a weakercondition − p = O (1 + γL ) is sufﬁcient). Whilst designingnew mixing topologies to control p might not be possibledue to practical constraints (ﬁxed network, denser graphsincrease latency, etc.), a simple and commonly used strategyis to use repeated gossip steps in every round. Lemma 3.4 (Repeated gossip (Xiao & Boyd, 2004; Boydet al., 2006)) . Suppose W = W k . . . W , for k (possiblyrandomized) mixing matrices with parameter p each. Thenthe mixing parameter for W is at least p W ≥ − (1 − p ) k . From this, we see that the mixing parameter can be im-proved exponentially when applying more gossip steps. Toensure p W ≥ − γLn , at most k ≤ ln(1+ γLn ) p = ˜ O (cid:0) p (cid:1) repetitions are required.

4. Inspecting Consensus Distance forDecentralized Training

Our analysis in Section 3 shows that we can—at least intheory—recover the convergence behavior of C-SGD bycontrolling the consensus distance. Now, we direct ourfocus on generalization in decentralized deep learning train-ing. We show, empirically (not theoretically, see also Ap-pendix B.2), that the critical consensus distance is an impor-tant metric to capture the connection between optimizationand generalization in deep learning—e.g. Figure 2 in Sec-tion 4.3 showcases that by addressing the optimization dif-ﬁculty in the critical initial training phase (Figure 2(a) andFigure 2(b)), the ﬁnal generalization gap can be perfectlyclosed (Figure 2(c), Table 2 and Table 3).First we introduce and justify our experimental design inSection 4.1. We describe the implementation in Section 4.2.In Section 4.3, we present our ﬁndings on image classiﬁca-tion benchmark with standard SGD optimizer, which is themain focus of this work; a preliminary study on Transformerwith Adam optimizer and inverse square root learning rateschedule can be found in Section 4.4.

Epoch t Figure 1: Evolution of the consensus distance Ξ for ResNet-20on CIFAR-10 ( n = 32 ) with ring topology. Since the consensus distanceevolves throughout training, identifying its impact at ev-ery training step is infeasible. However, as the consensusdistance and critical consensus distance (CCD) both signif-icantly depend on the learning rate (Propositions 3.2 and3.3), we expect rather consistent observations during phasesin which the learning rate is kept ﬁxed and more drasticchanges between such phases. On CV tasks, stage-wiselearning rate schedule is the common practice for SOTAdistributed training as described in Section 4.2: thus thetraining can be naturally divided into phases through thelearning rate decay , in each of which training dynamics aresigniﬁcantly different from the others, such as Ξ t (Figure 1), φ t (Figure 6(b)) and L -smoothness (Figure 6(c)). The trans-former (NLP task) has no well-deﬁned training phases dueto the conventional inverse square root learning rate, thusfor the sake of simplicity, we consider the entire transformertraining as one phase as a preliminary study. Individual phase investigation.

In order to eliminate thecoupling of effects from other phases, in each experimentwe place only one phase under consensus distance control(the control refers to perform multiple gossip steps as inSection 3.3 to reach certain distance targets), while perform-ing exact averaging (All-Reduce for all nodes) on modelparameters for the other unstudied phases. We demonstratein Table 5 of Section 4.3 that the decentralization impactson different phases are rather orthogonal, which justiﬁes ourdesign of examining one phase at a time.For the ease of presentation, the term “phase- x ” refers toa training phase between ( x − -th and x -th learning ratedecay. The notation “dec-phase- x ” indicates that only in“phase- x ” the model is trained with a decentralized com-munication topology, while for other phases we performAll-Reduce on model parameters. We compare the resultof each individually decentralized phase with that of All-Reduce centralized training (on all training phases), so as toidentify when (which phase) and how decentralized traininginﬂuences ﬁnal generalization performance. The learning rate warmup is only over a very small fractionof training epochs (e.g. 5 out of 300 epochs on CIFAR-10). Tosimplify the analysis, we do not consider it as a separate phase. onsensus Control for Decentralized Deep Learning

We empirically study the decen-tralized training behavior on the following two tasks, onconvolutional neural networks and transformer architec-tures: (1) Image Classiﬁcation for CIFAR-10 (Krizhevsky& Hinton, 2009) and ImageNet-32 (i.e. image resolution of ) (Chrabaszcz et al., 2017), with the standard data aug-mentation and preprocessing scheme (He et al., 2016); and(2) Neural Machine Translation for the Multi30k dataset (El-liott et al., 2016). For Image Classiﬁcation, ResNet-20 (Heet al., 2016) with different widths are used on CIFAR (de-fault width of ) and ImageNet-32 (width factor of ) . ForNeural Machine Translation, a down-scaled transformerarchitecture (by w.r.t. the base model in Vaswani et al.(2017)) is used. Weight initialization schemes follow Goyalet al. (2017); He et al. (2015) and Vaswani et al. (2017)respectively. Unless mentioned otherwise, our experimentsare repeated over three random seeds. Training schemes.

We use mini-batch SGD with a Nes-terov momentum of . without dampening for image clas-siﬁcation task (we conﬁrm our ﬁndings in Section 4.3 forSGD without momentum), and Adam is used for neuralmachine translation task. Unless mentioned otherwise weuse the optimal learning rate (lr) from centralized trainingfor our decentralized experiments in order to observe theimpact of decentralization on normal centralized training.• For image classiﬁcation experiments, unless mentionedotherwise, the models are trained for and epochsfor CIFAR-10 and ImageNet-32 respectively; the localmini-batch size are set to and . By default, allexperiments follow the SOTA learning rate scheme indistributed deep learning literatures (Goyal et al., 2017;He et al., 2019) with learning rate scaling and warmupscheme. The learning rate is always gradually warmedup from a relatively small value (i.e. . ) for the ﬁrst epochs. Besides, the learning rate will be divided by when the model has accessed speciﬁed fractions ofthe total number of training samples (He et al., 2016);we use { , } and { , , } for CIFAR and ImageNetrespectively. All results in tables are test top-1 accuracy.• For experiments on neural machine translation, weuse standard inverse square root learning rate sched-ule (Vaswani et al., 2017) with local mini-batch size It takes ∼ h to ﬁnish 1 round of standard ImageNet-32training with n = 16 V100 on a ring, and the cost increases toe.g. h for our consensus distance controlled experiments. It isinfeasible to perform sufﬁcient experiments on datasets of largerscales with our computation budget. We ﬁnd that ﬁne-tuning the learning rate for decentralizedexperiments does not change our conclusions. E.g., no signiﬁcantdifference can be found for the curves at phase-1 for “ring (ﬁne-tuned lr)” and “dec-phase-1 ( Ξ max )” in Figure 2(a) and 2(b). Wehave similar observations in Table 14 after the sufﬁcient learningrate tuning on phase-1. . The warm-up step is set to for the mini-batchsize of and is linearly scaled down by the globalmini-batch size. Consensus distance control.

For consensus control, weadopt the “more gossip iterations” strategy introduced inSection 3.3. That is, we perform multiple gossip steps (ifneeded) until reaching the desired target consensus distancevalue. Two metrics are considered to set the consensusdistance target value during the speciﬁed training phase:• constant target distance (main approach ): the target con-sensus distance Ξ for a phase is the maximum consensusdistance Ξ max of the current phase in normal (uncon-trolled) decentralized training, multiplied by a factor. Fora given topology, the smaller the factor, the tighter theconsensus.• adaptive target distance (in Appendix E.3.1): the targetconsensus distance Ξ for the current step is the averagedlocal gradient norm φ avg t scaled by a factor. For stability,we use the exponentially moving averaged value φ ema t of φ avg t (accumulated during the corresponding phase).We use a ring as the main decentralized communicationtopology, as it is a particularly hard instance with a smallspectral gap (cf. Table 10) which allows us to study awide range of target consensus distances by modifying thenumber of gossip steps (in appendix we show consistentﬁndings on time varying exponential topology in Table 18and 19). In this section we present our empirical ﬁndings and provideinsights into how consensus distance at different phasesimpacts the training generalization for CV tasks (i.e. CIFAR-10, Imagenet-32).

Critical consensus distance exists in the initial trainingphase—consensus distance below this critical thresholdensures good optimization and generalization.

In theinitial training phase, both training and generalization per-formance are heavily impacted by the consensus distance(“dec-phase-1” in Figure 2 and Table 2). A smaller con-sensus distance in the early phase results in considerablyfaster optimization (training loss) and higher generalizationperformance (test accuracy), and these advantages persistover the entire training.When the consensus distance is larger (e.g. 1/2 Ξ max forCIFAR-10), the optimization (training performance) caneventually catch up with the centralized convergence (c.f.Figure 2(a) and 2(b)) but a considerable generalization gap We use this one primarily since we can directly regulate themagnitude of consensus distance. In experiments, target

Ξ = Ξ max refers to the normal (i.e. uncontrolled) decentralized training. onsensus Control for Decentralized Deep Learning

Epoch T r a i n i n g l o ss All-Reduce (fine-tuned lr)ring (fine-tuned lr)dec-phase-1 (1/4 max )dec-phase-1 (1/2 max )dec-phase-1 ( max ) (a) Training loss. Epoch T r a i n i n g t o p - a cc u r a c y All-Reduce (fine-tuned lr)ring (fine-tuned lr)dec-phase-1 (1/4 max )dec-phase-1 (1/2 max )dec-phase-1 ( max ) (b) Training top-1 accuracy. Epoch T e s t t o p - a cc u r a c y All-Reduce (fine-tuned lr)ring (fine-tuned lr)dec-phase-1 (1/4 max )dec-phase-1 (1/2 max )dec-phase-1 ( max ) (c) Test top-1 accuracy.Figure 2: Learning curves for ResNet-20 on CIFAR-10 ( n = 32 ). We compare ﬁne-tuned normal (w/o control) decentralized training (i.e.“ring”) with dec-phase-1 on different target consensus distances.Table 2: The impact of consensus distance of different phases on generalization performance (test top-1 accuracy) of training ResNet-20 on CIFAR-10 on ring. The All-Reduce performance for n = 32 and n = 64 are . ± . and . ± . respectively. Theﬁne-tuned normal (w/o control) decentralized training performance for n = 32 and n = 64 are . ± . and . ± . respectively. Ξ dec-phase-1 dec-phase-2 dec-phase-3 Ξ max Ξ max Ξ max Ξ max Ξ max Ξ max Ξ max Ξ max Ξ max n=32 . ± .

35 92 . ± .

21 92 . ± .

10 93 . ± .

01 92 . ± .

30 92 . ± .

11 92 . ± .

00 92 . ± .

21 92 . ± . n=64 . ± .

12 92 . ± .

07 92 . ± .

17 93 . ± .

04 92 . ± .

10 92 . ± .

07 92 . ± .

12 92 . ± .

09 92 . ± . Table 3:

The impact of different consensus distances on generalization for different phases of training ResNet-20-3 on ImageNet-32on ring. The centralized baseline performances for n = 16 and n = 32 are . ± . and . ± . respectively, while those ofdecentralized training (on a ﬁxed ring) are . ± . and . ± . . The reported test top-1 accuracies are over two seeds. Ξ dec-phase-1 dec-phase-2 dec-phase-3 dec-phase-4 Ξ max Ξ max Ξ max Ξ max Ξ max Ξ max Ξ max Ξ max Ξ max Ξ max Ξ max Ξ max n=16 . ± .

08 51 . ± .

10 51 . ± .

03 51 . ± .

02 51 . ± .

01 51 . ± .

13 51 . ± .

10 51 . ± .

13 51 . ± .

04 51 . ± .

02 51 . ± .

01 51 . ± . n=32 . ± .

18 51 . ± .

07 51 . ± .

21 51 . ± .

07 51 . ± .

04 51 . ± .

12 51 . ± .

06 51 . ± .

10 51 . ± .

02 51 . ± .

02 51 . ± . still remains ( . v.s. . for the setup in Figure 2) asshown in Table 2. A consistent pattern can be found inImageNet-32 experiments , as shown in Table 3. Theseobservations to some extent are consistent with the insightsof the critical learning phase described in Golatkar et al.(2019); Jastrzebski et al. (2020); Frankle et al. (2020) forcentralized training, where it is argued that the initial learn-ing phase is crucial for the ﬁnal generalization.Notably, perfect consensus distance is not required to re-cover the centralized training performance. For instance,1/4 Ξ max is sufﬁcient in CIFAR-10 experiments to approachthe optimal centralized training performance in both opti-mization and generalization at the end. Smaller distances(e.g. 1/8 Ξ max , 1/16 Ξ max ) do not bring signiﬁcant gain( . and . respectively in Table 12). The perfor-mance saturates (c.f. . for 1/4 Ξ max ) with signiﬁcantlyincreased communication overhead (e.g. Figure 10 of Ap-pendix E.1). This conﬁrms that our analysis and discoveryin Section 3 are sensible in the initial training phase: thereexists a critical consensus distance for the training, belowwhich the impact of decentralization is negligible. A non-negligible consensus distance at middle phasescan improve generalization over centralized training.

Surprisingly, it is not always the case that the generalizationperformance improves with a shrinking consensus distance.We observe that at the phase right after the initial training Ξ max has already been tight enough to recover the central-ized performance for ImageNet-32 ( n = 32 ), while a signiﬁcantperformance drop can be observed between Ξ max and 1/2 Ξ max . plateaus (e.g. phase-2 for CIFAR-10, phase-3 for Imagenet-32), a non-negligible consensus distance actually booststhe generalization performance over the centralized trainingwhich has been deemed optimal. In CIFAR-10 dec-phase-2experiments (Table 2), the generalization performance in-creases monotonically with the evaluated consensus distanceand is consistently superior to that of the centralized training(e.g. . , . , . over . for n = 32 ). Analogousobservation can be obtained in Imagenet-32 dec-phase-3experiments (Table 3).This coincides with the observations ﬁrstly introduced inpost-local SGD (Lin et al., 2020b), where for better gen-eralization, consensus distance is created among local ma-chines by less frequent model parameter synchronization(All-Reduce) in late training phases (e.g. phase-2, phase-3 for CIFAR). Thus non-negligible consensus distance atmiddle phases can be viewed as a means of injecting propernoise as argued in (Lin et al., 2020b), which reduces commu-nication cost and in the meanwhile beneﬁts generalization. At the last phase of training, the consensus distance onlymarginally impacts the generalization performance.

Similar to the initial training phase, the ﬁnal convergencephase seems to favor small consensus distances in CIFAR-10 experiments. However, its impact is less prominent incomparison: for dec-phase-3, performance of a smaller con-sensus distance (1/4 Ξ max ) is only . and . higher Table 19 of Appendix E.3.1 shows that there exists optimalconsensus distance at middle phases, beyond which the gain ingeneralization (brought by noise injection) starts to diminish. onsensus Control for Decentralized Deep Learning

Table 4:

The impact of consensus distance on generalization performance with vanilla SGD (without momentum) (test top-1accuracy) of training ResNet-20 on CIFAR-10 on ring. The All-Reduce performance for n = 32 and n = 64 are . ± . and . ± . respectively. The ﬁne-tuned normal (w/o control) decentralized training performance for n = 32 and n = 64 are . ± . and . ± . respectively. We repeat experiments for n = 32 for 3 seeds and n = 64 for 2 seeds. Ξ dec-phase-1 dec-phase-2 Ξ max / max / max Ξ max / max / max n = 32 90 . ± .

05 90 . ± .

14 90 . ± .

37 90 . ± .

18 90 . ± .

19 90 . ± . n = 64 88 . ± .

03 89 . ± .

03 90 . ± .

05 90 . ± .

37 90 . ± .

15 90 . ± . Epoch V a li d a t i o n l o ss Decentralized (complete, 0)Decentralized (ring, max )Decentralized (ring, 1/2 max )Decentralized (ring, 1/4 max )Decentralized (ring, 1/8 max )Decentralized (ring, 1/16 max ) (a) Different target Ξ s. Epoch V a li d a t i o n l o ss Decentralized (complete)Decentralized (ring)Decentralized (exponential graph) (b) Decentralized baseline. iteration c o n s e n s u s d i s t a n c e ring exponential graph (c) Consensus distance Ξ .Figure 3: Learning curves for training Transformer on Multi30k ( n = 32 ). than that of Ξ max for n = 32 , respectively (Table 2). InImagenet-32 experiments, dec-phase-3 performance is noteven affected by changes in consensus. Quality propagation across phases.

Our previous exper-iments only consider a single phase of decentralized training.We now evaluate the lasting impact of consensus across thesequence of multiple phases. In Table 5, we control the con-sensus distance for both phase-1 and phase-2 when trainingon CIFAR-10. Our previous ﬁndings hold when we vieweach controlled phase separately. For instance, when weapply 1/2 Ξ max consensus control to phase-2 (the middlecolumn in Table 5), we can still observe that a smaller con-sensus distance in phase-1 results in a higher performanceas in our previous ﬁnding. Hence our previous ﬁndings arevalid in more general cases of decentralized training. Table 5:

Quality propagation across training phases with dif-ferent consensus distances on ResNet-20 for CIFAR-10 (Ringwith n = 32 ). In phase-1 and phase-2, the model parameters reachinexact consensus of different target consensus distance Ξ , whilephase-3 performs All-Reduce on model parameters. phase-1 phase-2 Ξ max Ξ max Ξ max Ξ max . ± .

19 92 . ± .

11 92 . ± . Ξ max . ± .

11 92 . ± .

08 92 . ± . Ξ max . ± .

22 92 . ± .

15 92 . ± . Table 6:

The impact of different numbers of training epochs(at phase-1) on generalization, for training ResNet-20 on CIFAR-10 (dec-phase-1 with n = 32 ). The number of epochs at phase-1is chosen from { , , } , while the other training setting isidentical to that of dec-phase-1 in Table 2. target Ξ training epochs at phase-1150 200 250 Ξ max . ± .

35 91 . ± .

19 92 . ± . Ξ max . ± .

21 92 . ± .

07 92 . ± . Ξ max . ± .

10 92 . ± .

15 92 . ± . Longer training cannot close the generalization gapcaused by large consensus distances in the initial train-ing phase.

As discussed above, large consensus distancesin the initial phase can result in signiﬁcant generalizationloss. Table 6 investigates whether a prolonged training onthe initial phase can address this difﬁculty: we prolong thephase-1 for CIFAR-10 with a range of consensus distancesand leave the other training phases centralized. We canobserve that although longer training is beneﬁcial for eachconsensus distance, it cannot recover the generalization gapresulting from large consensus distance. For instance, themaximum gain (among all evaluated cases) of increasing theepoch number from 150 to 250 is . at 1/2 Ξ max , whichis lower than the average gain (around . ) of merelyreducing the consensus distance from Ξ max to 1/2 Ξ max . Ta-ble 15 in Appendix E.2 evaluates cases where dec-phase-2and dec-phase-3 are prolonged. We ﬁnd longer training inthese two phases brings about negligible performance gain. Consistent ﬁndings on decentralized SGD without mo-mentum.

To validate the coherence between our theoryand experiments, we perform similar consensus distancecontrol experiments on vanilla SGD optimizer (i.e. withoutmomentum) for dec-phase-1 and dec-phase-2 on CIFAR-10.The patterns illustrated in Table 4 are consistent with ourprevious observations in Table 2 and Table 3, supportingthe claim on the relation between consensus distance andgeneralization performance (which stands regardless of theuse of momentum).

Figure 3(a) demonstrates that 1/4 Ξ max target control on aring is sufﬁcient to recover the centralized training perfor-mance. Besides, the target consensus distance in this casecan be reached by exponential graph (and thus target test onsensus Control for Decentralized Deep Learning Table 7:

The importance of phase-1 for training ResNet-20 on CIFAR-10 ( n = 32 ), in terms of (1) target consensus distance and (2)the number of training epochs . In phase-2 and phase-3, we perform decentralized training (w/o consensus distance control). Ξ Ξ max Ξ max Ξ max Ξ max Ξ max . ± .

15 92 . ± .

12 92 . ± .

22 92 . ± .

15 92 . ± . . ± .

22 92 . ± .

20 93 . ± .

18 93 . ± .

10 92 . ± . . ± .

23 92 . ± .

11 93 . ± .

26 92 . ± .

24 93 . ± . performance, as shown in Figure 3(b) and 3(c)). These jus-tify the importance of designing an efﬁcient communicationtopology/scheme in practice so as to effectively reach thecritical consensus distance.

5. Impact on Practice

Practical guidelines: prioritizing the initial trainingphase.

Apart from effectiveness (generalization/test per-formance), efﬁciency (time) stands as the other crucial goalin deep learning, and thus how to allocate communicationresource over the training becomes a relevant question. gossip steps nn i = x i x ringexponential graphrandom matchingbipartite exponential graph Figure 4: Consensus distance evolution against the number ofgossip steps on different topologies ( n = 32 ). The initial x i ’s aresampled uniformly from [0 , . Results on different topologyscales are deferred to Appendix E.1. As indicated by our ﬁrst empirical ﬁnding (and theory inSection 3), the initial training phase bears the greatest impor-tance over all other training phases; therefore the commu-nication expenditure should be concentrated on the initialphase to maintain a consensus distance lower than the CCD.We suggest a list of communication topologies with supe-rior spectral properties, i.e. exponential graph (Assran et al.,2019) and random matching (Nadiradze et al., 2020) inFigure 4 (the deﬁnition of the topology is detailed in Ap-pendix E.1), which be utilized to achieve fast convergencein gossip averaging.The late training phases should be less prioritized for com-munication resources, due to the generalization beneﬁtsfrom a reasonable consensus distance in the middle phases.Providing a rigorous way to quantify the optimal consensusdistance is non-trivial, and we leave it as future work.In Table 7 we show that the above-mentioned guideline ispractically feasible: as long as the quality of the initial phaseis ensured, we can afford to slacken the consensus control for later phases, in particular the middle phase. For instance,when the number of epochs is 150, a consensus control of1/4 Ξ max in the initial phase with uncontrolled middle andﬁnal phase is adequate to recover the centralized trainingperformance ( . v.s. . ). Note that here the noise in-jection from the uncontrolled middle phase also contributespositively to the performance. Table 18 in Appendix E.3.1additionally justiﬁes the applicability of applying this guide-line on exponential graphs. Practical implementation of Consensus Control.

Com-puting the exact consensus distance requires the average ofall model parameters in R d , which is prohibitively expen-sive (All-Reduce). We propose therefore to use the quantity Θ t := 1 n n (cid:88) i =1 θ ( t ) i with θ ( t ) i := (cid:13)(cid:13)(cid:13)(cid:13) n (cid:88) j =1 w ij x ( t ) j − x ( t ) i (cid:13)(cid:13)(cid:13)(cid:13) , instead for practically efﬁcient consensus control. The val-ues θ ( t ) i ∈ R can be computed locally on each node whenupdating the parameters, and computing Θ t requires onlyaveraging of scalars (the cost of this global computation isnegligible compared to averaging the parameter vectors inthe gossip steps).In Appendix A, we prove that p Θ t is an upper-bound ofconsensus distance, i.e. Ξ t ≤ p Θ t (Lemma A.1), and thusa valid control parameter. We illustrate empirically in Sec-tion A.1 that Θ t is indeed an effective indicator of consensusdistance (up to constants). Table 8 and 9 in Appendix A.1show the feasibility of integrating the control of Θ t withour practical guidelines for efﬁcient training in data-centers,which serves as a strong starting point for designing decen-tralized training algorithms with a desired balance betweencommunication cost and training performance.

6. Conclusion

In this work, we theoretically identify the consensus dis-tance as an essential factor for decentralized training. Weshow the existence of a critical consensus distance, belowwhich the consensus distance does not hinder optimization.Our deep learning experiments validate our theoretical ﬁnd-ings and extend them to the generalization performance.Based on these insights, we propose practical guidelineswith efﬁcient implementation for favorable generalizationperformance with low communication expenses, on arbi-trary communication networks. onsensus Control for Decentralized Deep Learning

References

Achille, A., Rovere, M., and Soatto, S. Critical learningperiods in deep networks. In

International Conferenceon Learning Representations , 2019. URL https://openreview.net/forum?id=BkeStsCcKQ .Assran, M., Loizou, N., Ballas, N., and Rabbat, M. Stochas-tic gradient push for distributed deep learning. In

Inter-national Conference on Machine Learning , pp. 344–353.PMLR, 2019.Bellet, A., Guerraoui, R., Taziki, M., and Tommasi,M. Personalized and private peer-to-peer machinelearning. volume 84 of

Proceedings of MachineLearning Research , pp. 473–481, Playa Blanca, Lan-zarote, Canary Islands, 09–11 Apr 2018. PMLR.URL http://proceedings.mlr.press/v84/bellet18a.html .Bottou, L., Curtis, F. E., and Nocedal, J. Optimizationmethods for large-scale machine learning.

Siam Review ,60(2):223–311, 2018.Boyd, S., Ghosh, A., Prabhakar, B., and Shah, D. Random-ized gossip algorithms.

IEEE transactions on informationtheory , 52(6):2508–2530, 2006.Chrabaszcz, P., Loshchilov, I., and Hutter, F. A downsam-pled variant of imagenet as an alternative to the cifardatasets. arXiv preprint arXiv:1707.08819 , 2017.Dekel, O., Gilad-Bachrach, R., Shamir, O., and Xiao, L.Optimal distributed online prediction using mini-batches.

The Journal of Machine Learning Research , 13:165–202,2012.Duchi, J. C., Agarwal, A., and Wainwright, M. J. Dual aver-aging for distributed optimization: Convergence analysisand network scaling.

IEEE Transactions on AutomaticControl , 57(3):592–606, 2012. doi: 10.1109/TAC.2011.2161027.Elliott, D., Frank, S., Sima’an, K., and Specia, L. Multi30k:Multilingual english-german image descriptions. arXivpreprint arXiv:1605.00459 , 2016.Fort, S. and Ganguli, S. Emergent properties of the lo-cal geometry of neural loss landscapes. arXiv preprintarXiv:1910.05929 , 2019.Frankle, J., Schwab, D. J., and Morcos, A. S. The earlyphase of neural network training. In

International Confer-ence on Learning Representations , 2020. URL https://openreview.net/forum?id=Hkl1iRNFwS .Golatkar, A. S., Achille, A., and Soatto, S. Time mattersin regularizing deep networks: Weight decay and dataaugmentation affect early learning dynamics, matter little near convergence. In

Advances in Neural InformationProcessing Systems , pp. 10678–10688, 2019.Goyal, P., Dollár, P., Girshick, R., Noordhuis, P.,Wesolowski, L., Kyrola, A., Tulloch, A., Jia, Y., andHe, K. Accurate, large minibatch sgd: Training imagenetin 1 hour. arXiv preprint arXiv:1706.02677 , 2017.Gupta, V., Serrano, S. A., and DeCoste, D. Stochastic weightaveraging in parallel: Large-batch training that gener-alizes well. In

International Conference on LearningRepresentations , 2020. URL https://openreview.net/forum?id=rygFWAEFwS .He, K., Zhang, X., Ren, S., and Sun, J. Delving deepinto rectiﬁers: Surpassing human-level performance onimagenet classiﬁcation. In

Proceedings of the IEEE inter-national conference on computer vision , pp. 1026–1034,2015.He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learn-ing for image recognition. In

Proceedings of the IEEEConference on Computer Vision and Pattern Recognition ,pp. 770–778, 2016.He, T., Zhang, Z., Zhang, H., Zhang, Z., Xie, J., and Li, M.Bag of tricks for image classiﬁcation with convolutionalneural networks. In

Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition , pp. 558–567, 2019.Hong, M., Hajinezhad, D., and Zhao, M.-M. Prox-pda:The proximal primal-dual algorithm for fast distributednonconvex optimization and learning over networks. In

International Conference on Machine Learning , pp. 1529–1538, 2017.Izmailov, P., Podoprikhin, D., Garipov, T., Vetrov, D.,and Wilson, A. G. Averaging weights leads towider optima and better generalization. arXiv preprintarXiv:1803.05407 , 2018.Jastrzebski, S., Kenton, Z., Ballas, N., Fischer, A., Bengio,Y., and Storkey, A. On the relation between the sharpestdirections of DNN loss and the SGD step length. In

International Conference on Learning Representations ,2019. URL https://openreview.net/forum?id=SkgEaj05t7 .Jastrzebski, S., Szymczak, M., Fort, S., Arpit, D., Tabor,J., Cho*, K., and Geras*, K. The break-even point onoptimization trajectories of deep neural networks. In

International Conference on Learning Representations ,2020. URL https://openreview.net/forum?id=r1g87C4KwB . onsensus Control for Decentralized Deep Learning Johnson, T., Agrawal, P., Gu, H., and Guestrin, C. Adascalesgd: A user-friendly algorithm for distributed training.In

International Conference on Machine Learning , pp.4911–4920. PMLR, 2020.Kairouz, P., McMahan, H. B., Avent, B., Bellet, A., Bennis,M., Bhagoji, A. N., Bonawitz, K., Charles, Z., Cormode,G., Cummings, R., D’Oliveira, R. G. L., Rouayheb, S. E.,Evans, D., Gardner, J., Garrett, Z., Gascón, A., Ghazi, B.,Gibbons, P. B., Gruteser, M., Harchaoui, Z., He, C., He,L., Huo, Z., Hutchinson, B., Hsu, J., Jaggi, M., Javidi,T., Joshi, G., Khodak, M., Koneˇcný, J., Korolova, A.,Koushanfar, F., Koyejo, S., Lepoint, T., Liu, Y., Mittal, P.,Mohri, M., Nock, R., Özgür, A., Pagh, R., Raykova, M.,Qi, H., Ramage, D., Raskar, R., Song, D., Song, W., Stich,S. U., Sun, Z., Suresh, A. T., Tramèr, F., Vepakomma, P.,Wang, J., Xiong, L., Xu, Z., Yang, Q., Yu, F. X., Yu, H.,and Zhao, S. Advances and open problems in federatedlearning. arXiv preprint arXiv:1912.04977 , 2019.Kempe, D., Dobra, A., and Gehrke, J. Gossip-based com-putation of aggregate information. In , pp. 482–491. IEEE, 2003.Keskar, N. S., Mudigere, D., Nocedal, J., Smelyanskiy,M., and Tang, P. T. P. On large-batch training for deeplearning: Generalization gap and sharp minima. In

Inter-national Conference on Learning Representations , 2017.Koloskova, A., Stich, S. U., and Jaggi, M. Decentral-ized stochastic optimization and gossip algorithms withcompressed communication. In

ICML 2019 - Pro-ceedings of the 36th International Conference on Ma-chine Learning , volume 97, pp. 3479–3487. PMLR,2019. URL http://proceedings.mlr.press/v97/koloskova19a.html .Koloskova, A., Lin, T., Stich, S. U., and Jaggi, M. Decen-tralized deep learning with arbitrary communication com-pression. In

International Conference on Learning Rep-resentations , 2020a. URL https://openreview.net/forum?id=SkgGCkrKvH .Koloskova, A., Loizou, N., Boreiri, S., Jaggi, M., and Stich,S. U. A uniﬁed theory of decentralized SGD with chang-ing topology and local updates. In

International Confer-ence on Machine Learning , 2020b.Krizhevsky, A. and Hinton, G. Learning multiple layers offeatures from tiny images. 2009.Lian, X., Zhang, C., Zhang, H., Hsieh, C.-J., Zhang, W., andLiu, J. Can decentralized algorithms outperform central-ized algorithms? a case study for decentralized parallelstochastic gradient descent. In

Advances in Neural Infor-mation Processing Systems , pp. 5330–5340, 2017. Lian, X., Zhang, W., Zhang, C., and Liu, J. Asynchronousdecentralized parallel stochastic gradient descent. In

In-ternational Conference on Machine Learning , pp. 3043–3052. PMLR, 2018.Lin, T., Kong, L., Stich, S., and Jaggi, M. Extrapolationfor large-batch training in deep learning. In

InternationalConference on Machine Learning , 2020a.Lin, T., Stich, S. U., Patel, K. K., and Jaggi, M. Don’tuse large mini-batches, use local SGD. In

ICLR - In-ternational Conference on Learning Representations ,2020b. URL https://openreview.net/forum?id=B1eyO1BFPr .Luo, Q., Lin, J., Zhuo, Y., and Qian, X. Hop: Heterogeneity-aware decentralized training. In

Proceedings of theTwenty-Fourth International Conference on ArchitecturalSupport for Programming Languages and Operating Sys-tems , pp. 893–907, 2019.Nadiradze, G., Sabour, A., Alistarh, D., Sharma, A., Markov,I., and Aksenov, V. Swarmsgd: Scalable decentralizedsgd with local updates. arXiv preprint arXiv:1910.12308 ,2020.Nedi´c, A. and Ozdaglar, A. Distributed subgradient meth-ods for multi-agent optimization.

IEEE Transactions onAutomatic Control , 54(1):48–61, 2009.Neglia, G., Xu, C., Towsley, D., and Calbi, G. Decentralizedgradient methods: does topology matter? In

AISTATS ,2020.Neyshabur, B. Implicit regularization in deep learning.abs/1709.01953, 2017.Sagun, L., Evci, U., Guney, V. U., Dauphin, Y., and Bottou,L. Empirical analysis of the hessian of over-parametrizedneural networks.

ICLR workshop , 2018.Santurkar, S., Tsipras, D., Ilyas, A., and Madry, A. Howdoes batch normalization help optimization? In

Ad-vances in Neural Information Processing Systems , pp.2483–2493, 2018.Scaman, K., Bach, F., Bubeck, S., Lee, Y. T., and Massoulié,L. Optimal algorithms for smooth and strongly convexdistributed optimization in networks. In

InternationalConference on Machine Learning , 2017.Scaman, K., Bach, F., Bubeck, S., Massoulié, L., and Lee,Y. T. Optimal algorithms for non-smooth distributed opti-mization in networks. In

Advances in Neural InformationProcessing Systems , pp. 2740–2749, 2018.Shallue, C. J., Lee, J., Antognini, J., Sohl-Dickstein, J.,Frostig, R., and Dahl, G. E. Measuring the effects of data onsensus Control for Decentralized Deep Learning parallelism on neural network training. arXiv preprintarXiv:1811.03600 , 2018.Sharma, C., Narayanan, V., and Balamurugan, P. A sim-ple and fast distributed accelerated gradient method. In

OPT2019: 11th Annual Workshop on Optimization forMachine Learning , 2019.Stich, S. U. and Karimireddy, S. P. The error-feedback framework: Better rates for SGD with de-layed gradients and compressed communication.

CoRR ,abs/1909.05350, 2019. URL http://arxiv.org/abs/1909.05350 .Sun, H. and Hong, M. Distributed non-convex ﬁrst-order op-timization and information processing: Lower complexitybounds and rate optimal algorithms.

IEEE Transactionson Signal processing , 67(22):5912–5928, 2019.Tsianos, K. I. and Rabbat, M. G. Efﬁcient distributed onlineprediction and stochastic optimization with approximatedistributed averaging.

IEEE Transactions on Signal andInformation Processing over Networks , 2(4):489–506,2016.Tsitsiklis, J. N.

Problems in decentralized decision makingand computation . PhD thesis, Massachusetts Institute ofTechnology, 1984.Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones,L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. Atten-tion is all you need. In

Advances in neural informationprocessing systems , pp. 5998–6008, 2017.Vogels, T., Karimireddy, S. P., and Jaggi, M. Powergossip:Practical low-rank communication compression in decen-tralized deep learning. In

NeurIPS 2020 - Thirty-fourthConference on Neural Information Processing Systems ,2020.Wang, J., Sahu, A. K., Yang, Z., Joshi, G., and Kar,S. MATCHA: speeding up decentralized SGD viamatching decomposition sampling. arXiv preprintarXiv:1905.09435 , 2019. URL http://arxiv.org/abs/1905.09435 .Wang, J., Sahu, A. K., Joshi, G., and Kar, S. Exploringthe error-runtime trade-off in decentralized optimization.2020a.Wang, J., Tantia, V., Ballas, N., and Rabbat, M. Slowmo:Improving communication-efﬁcient distributed sgd withslow momentum. In

International Conference onLearning Representations , 2020b. URL https://openreview.net/forum?id=SkxJ8REYPH .Xiao, L. and Boyd, S. Fast linear iterations for distributedaveraging.

Systems & Control Letters , 53(1):65–78, 2004. Yin, D., Pananjady, A., Lam, M., Papailiopoulos, D., Ram-chandran, K., and Bartlett, P. Gradient diversity: a keyingredient for scalable distributed learning. In

Interna-tional Conference on Artiﬁcial Intelligence and Statistics ,pp. 1998–2007. PMLR, 2018.You, Y., Zhang, Z., Hsieh, C.-J., Demmel, J., and Keutzer,K. Imagenet training in minutes. In

Proceedings of the47th International Conference on Parallel Processing , pp.1–10, 2018.Yuan, K., Ling, Q., and Yin, W. On the convergence of de-centralized gradient descent.

SIAM Journal on Optimiza-tion , 26(3):1835–1854, 2016. doi: 10.1137/130943170.URL https://doi.org/10.1137/130943170 . onsensus Control for Decentralized Deep Learning A. Efﬁcient Implementation of Consensus Control for Data-Center Training

In our theoretical and experimental investigations in Sections 3 and 4, in order to understand the effect of decentralizationon the ﬁnal performance, we focused on the controlling the consensus distance Ξ t . This quantity was inspired by theoreticalanalysis and naturally measures the decentralization level. In practice, in order to control the consensus distance, one needto know the exact value of it at every iteration. Computing the exact value of the Ξ t requires all-to-all communications ofthe parameter vectors x i , which is costly and would cancel all the practical beneﬁts of using decentralized algorithms.In this section we give a more practical way to control the consensus distance without compromising the ﬁnal testperformance. We mainly focus on the data-center training scenario, the most common use case of large-scale deep learningtraining for both academic and industry. Though the prior arts use All-Reduce to compute the exactly averaged modelparameters, recent trends show promising faster training results by using decentralized training with gossip averaging (Assranet al., 2019; Koloskova et al., 2020a), especially for the highly over-parameterized SOTA neural networks with large numberof model parameters.We upper bound the consensus distance Ξ t with a quantity that is efﬁciently computable in our scenario and control onlythis quantity. This quantity additionally requires the centralized all-reduce applied only to one dimensional numbers, that isfast to perform, and it utilizes the information available after decentralized communications step of parameters x i performedby the (D-SGD) algorithm. Lemma A.1 (Upper bound on the consensus distance) . Let Θ t := n (cid:80) ni =1 (cid:13)(cid:13)(cid:13)(cid:80) nj =1 w ij x ( t ) j − x ( t ) i (cid:13)(cid:13)(cid:13) = n (cid:80) ni =1 θ ( t ) i , where w ij are the weights of the (ﬁxed) mixing matrix W . We can upper bound the consensus distance as Ξ t ≤ p Θ t , ∀ x ( t )1 , . . . , x ( t ) n ∈ R d , where p is consensus rate of matrix W (Assumption 1). To ensure small consensus distance Ξ t it is sufﬁcient to make small the quantity Θ t . In particular by ensuring that Θ t ≤ p Γ t we automatically get that CCD condition holds: Ξ t ≤ Γ t (Proposition 3.2). Practical way to compute Θ t . Recall that Θ t = n (cid:80) ni =1 (cid:13)(cid:13)(cid:13)(cid:80) nj =1 w ij x ( t ) j − x ( t ) i (cid:13)(cid:13)(cid:13) . Each term i , i ∈ { , . . . n } of thissum is locally available to the node i after one round of decentralized communication with mixing matrix W because w ij (cid:54) = 0 only for the neighbours j of the node i . So each node i can locally compute the norm (cid:13)(cid:13)(cid:13)(cid:80) nj =1 w ij x ( t ) j − x ( t ) i (cid:13)(cid:13)(cid:13) andthen obtain the average Θ t using centralized all-reduce on only 1-dimensional numbers, that is much faster than averagingfull vectors from R d . Proof of the Lemma A.1

Proof.

Using matrix notation we can re-write Ξ t = n (cid:13)(cid:13) X ( t ) − ¯ X ( t ) (cid:13)(cid:13) F and Θ t = n (cid:13)(cid:13) X ( t ) W − X ( t ) (cid:13)(cid:13) F .Since ¯ X ( t ) W = ¯ X ( t ) and X ( t ) (cid:62) n = ¯ X ( t ) (cid:62) n = ¯ X ( t ) we have that X ( t ) W − X ( t ) = (cid:16) X ( t ) − ¯ X ( t ) (cid:17) (cid:18) W − (cid:62) n − I (cid:19) Using Frobenius norm property (6), (cid:13)(cid:13)(cid:13) X ( t ) W − X ( t ) (cid:13)(cid:13)(cid:13) F ≥ λ min (cid:18) W − (cid:62) n − I (cid:19) (cid:13)(cid:13)(cid:13) X ( t ) − ¯ X ( t ) (cid:13)(cid:13)(cid:13) F (7) ≥ p (cid:13)(cid:13)(cid:13) X ( t ) − ¯ X ( t ) (cid:13)(cid:13)(cid:13) F onsensus Control for Decentralized Deep Learning Useful InequalitiesLemma A.2.

For A ∈ R d × n , B ∈ R n × n (cid:107) AB (cid:107) F ≥ | λ min ( B ) | (cid:107) A (cid:107) F , (6) where λ min ( B ) is the smallest eigenvalue by the absolute value. Lemma A.3.

Let W be a ﬁxed mixing matrix satisfying Assumption 1. Then, λ min (cid:18) W − (cid:62) n − I (cid:19) ≥ p (7) Proof.

Since W is doubly stochastic, and Assumption 1 holds, the largest eigenvalue of W is and the second largest(by the absolute value) is smaller than √ − p . Moreover, all the eigenvalues of W are in between − and . The matrix W − (cid:62) n has the largest eigenvalue not bigger than √ − p , and W − (cid:62) n − I has all the eigenvalues negative, not smallerthan − √ − p . Using that √ − p ≤ − p ∀ p ∈ [0 , , λ min (cid:18) W − (cid:62) n − I (cid:19) ≥ − (cid:112) − p ≥ p A.1. Experiments

We implement the efﬁcient consensus control scheme to train ResNet-20 on CIFAR-10 with a ring topology ( n = 32 ). Wecompute Θ after each gossip step as an indicator of the exact consensus distance Ξ . The gossip continues until Θ < qφ ema ,where q is the control factor and φ ema is the exponential moving average estimator of the average norm of local gradients φ .Please refer to Section 4.2 for other training details.We validate Lemma A.1 by Figure 5(a) and Figure 5(b). In Figure 5(a), we can observe that during an arbitrary intervalof the control phase the high correlation between Ξ and Θ over gossips steps. In Figure 5(b), we can observe that thiscorrected behavior also manifests in a large span of iterations. These observations justify our claim that the Θ can act as adecent and much more inexpensive estimator of Ξ . We also plot φ over iterations in Figure 5(c) to demonstrate that thecritical consensus distance Γ stays relatively constant within each training phase.In Table 8, we show the test performance of the dec-phase-1 under the control of this efﬁcient implementation. The patternis consistent with the discovery in the main text. Moreover, in Table 9, we follow the ’prioritizing the inItial training phase’guideline in Section 5. Speciﬁcally, we control only the initial phase (phase-1) with the local estimate, while leaving theother phases uncontrolled (normal decentralized training). We can observe that with our guideline, we can recover andsurpass the centralized training baseline with only the control on the initial phase. Therefore, combining the insights intothe effect of consensus distance and this efﬁcient implementation, we open up the opportunities for practical decentralizedtraining schemes with a desired balance between communication cost and training performance. We leave more sophisticateddesign for future work. Table 8:

Efﬁcient consensus control of dec-phase-1 with local estimates of training ResNet-20 on CIFAR-10 ( n = 32 ). Centralized q = 1 e − q = 1 e − q = 1 e − w/o controldec-phase-1 . ± .

27 92 . ± .

16 92 . ± .

31 92 . ± .

02 91 . ± . B. Connections

B.1. Connection with Prior WorkConnection with gradient diversity.

The connections between the consensus distance and gradient diversity measureare not obvious and is an interesting direction for future works. On the one hand, low gradient diversity could causegeneralization degradation of decentralized methods as in the centralized case; one the other hand, high gradient diversityincreases the difﬁculty of reaching a low consensus distance. To illustrate the evolution of the consensus distance, the plots are made over gossip steps. Note, typically several gossip stepscorrespond to one training iteration for consensus control. onsensus Control for Decentralized Deep Learning t gossip step number t (a) Consensus distance Ξ and the local esti-mator Θ over gossip steps. t update iteration number t (b) Consensus distance Ξ and the local esti-mator Θ over all training iterations. update iteration number (c) The average norm of local gradients φ Figure 5: ResNet-20 on CIFAR-10 with a ring topology ( n = 32 ), under the control of the efﬁcient implementation with the ratio q = 1 e − .Table 9: Efﬁcient consensus control for data-center training—combining practical guideline with local estimates —for trainingResNet-20 on CIFAR-10 ( n = 32 ). Based on our practical guideline (Section 5), we control only the initial phase (phase-1) with the localestimate (), while leaving the other phases uncontrolled (normal decentralized training). Centralized q = 1 e − q = 1 e − q = 1 e − w/o controlguideline . ± .

27 93 . ± .

13 92 . ± .

17 92 . ± .

14 91 . ± . Connection with other methods like SWA/SWAP.

Our empirical insights bear similarity to the ones in SWA (Izmailovet al., 2018), SWAP (Gupta et al., 2020), and Post-local SGD (Lin et al., 2020b), but none of them considers decentralizeddeep learning.In SWA, models are sampled from the later stages of an SGD training run: the average of these model parameters resultin a model with much higher generalization performance. SWAP extends SWA to a parallel fashion: it uses a large batchsize to train the model until close to convergence and then switches to several individual runs with a small mini-batch size.These individual runs serve as a way of sampling from a posterior distribution and sampled models are averaged for bettergeneralization performance (i.e. the idea of SWA).Post-local SGD, SWA, SWAP, as well as the empirical insights presented in our paper, are closely related: we ﬁrstneed sufﬁcient small consensus distance to guarantee the optimization quality (in post-local SGD, SWA, and SWAP, theconsensus distance equals 0), and thus different model averaging choices can be utilized in the later training phase for bettergeneralization. Considering the later training phase, our empirical observations in decentralized learning suggest that we canimprove the generalization through the simultaneous SGD with gossip averaging. This is analogous to SWA and SWAP thatsample model independently (i.e., perform SGD) from the well-trained model and average over sampled models; and closeto Post-local SGD which performs simultaneous SGD steps with infrequent averaging.

B.2. Discussion on “Convergence analysis v.s. generalization performance”From convergence analysis to better understand generalization.

A line of recent research reveals the interferencebetween initial training (optimization) (Jastrzebski et al., 2020; Golatkar et al., 2019; Achille et al., 2019) and the laterreached local minima (generalization) (Neyshabur, 2017; Lin et al., 2020b;a; Gupta et al., 2020; Izmailov et al., 2018;Keskar et al., 2017): the generalization of the deep nets cannot be studied alone via vacuous generalization bounds, and itspractical performance is contingent on the critical initial learning (optimization) phase, which can be characterized by theconventional convergence analysis (Achille et al., 2019; Izmailov et al., 2018; Golatkar et al., 2019; Lin et al., 2020b; Guptaet al., 2020; Jastrzebski et al., 2020).This motivates us to derive the metric (i.e. critical consensus distance) from the convergence analysis, for the examinationof the consensus distance (on different phases) in decentralized deep learning training. For example, (1) we identify theimpact of different consensus distances at the critical learning phase on the quality of initial optimization, and the ﬁnalgeneralization (Jastrzebski et al., 2020; Golatkar et al., 2019; Achille et al., 2019; Lin et al., 2020b) (i.e. our studied case of onsensus Control for Decentralized Deep Learning dec-phase-1), and (2) we reveal similar observations as in Lin et al. (2020b;a); Gupta et al. (2020); Izmailov et al. (2018)when the optimization is no longer a problem (our studied case of dec-phase-2), where the existence of consensus distancecan act as a form of noise injection (Lin et al., 2020b) or sampling models from the posterior distribution (Gupta et al., 2020;Izmailov et al., 2018) as discussed above.

C. Theory

In this section, we prove the claims from Section 3.

C.1. Proof of Proposition 3.2, Critical Consensus Distance

The proof of this claim follows by the following Lemma:

Lemma C.1 (Koloskova et al. (2020b), Descent lemma for non-convex case) . Under the given assumptions, and for anystepsize γ < L , the iterates of D-SGD satisfy E t +1 f (¯ x ( t +1) ) ≤ f (¯ x ( t ) ) − η (cid:13)(cid:13)(cid:13) ∇ f (¯ x ( t ) ) (cid:13)(cid:13)(cid:13) + γ Ξ t + Ln γ ˆ σ . Proof.

By replacing Ξ t in the above inequality with (4), we simplify: E t +1 f (¯ x ( t +1) ) ≤ f (¯ x ( t ) ) − η (cid:13)(cid:13)(cid:13) ∇ f (¯ x ( t ) ) (cid:13)(cid:13)(cid:13) + 2 Ln γ ˆ σ . This inequality now matches (up to differences in the constants) the standard recursion that one can derive for C-SGD (Dekelet al., 2012; Bottou et al., 2018; Stich & Karimireddy, 2019).

C.2. Proof of Proposition 3.3, typical consensus distance

We need an auxiliary (but standard) lemma, to estimate the change of the consensus distance between iterations.

Lemma C.2 (Consensus distance) . It holds Ξ t +1 ≤ (1 − p/ t + 3(1 − p ) γ p (cid:0) φ t + pσ (cid:1) . We give the proof of this statement shortly below. First, let us consider how this lemma allows the proof of the claim. Forthis, we ﬁrst consider a particular special case, and assume φ t ≤ φ , for a constant φ . In this case, we can easily verify byunrolling the recursion: Ξ t ≤ t − (cid:88) i =0 (1 − p/ i − p ) γ ( φ + pσ ) p ≤ − p ) γ (cid:18) φ p + σ p (cid:19) . Now, for the claim in the main text, we use assumption that φ t are changing slowly, that is, not decreasing faster thanexponentially: φ t ≤ (1 + p/ φ t +1 . With this assumption, and observing (1 − p/ i (1 + p/ i ≤ (1 − p/ i , we canunroll as before Ξ t ≤ t − (cid:88) i =0 (1 − p/ i − p ) γ ( φ t − i − + pσ ) p ≤ t − (cid:88) i =0 (1 − p/ i − p ) γ ( φ t − + pσ ) p ≤ − p ) γ (cid:18) φ t − p + σ p (cid:19) . Proof of Lemma C.2.

We use the following matrix notation here X ( t ) := (cid:104) x ( t )1 , . . . , x ( t ) n (cid:105) ∈ R d × n , ¯ X ( t ) := (cid:104) ¯ x ( t ) , . . . , ¯ x ( t ) (cid:105) = X ( t ) n (cid:62) , ∇ F ( X ( t ) , ξ ( t ) ) := (cid:104) ∇ F ( x ( t )1 , ξ ( t )1 ) , . . . , ∇ F n ( x ( t ) n , ξ ( t ) n ) (cid:105) , ∇ f ( X ( t ) ) := (cid:104) ∇ f ( x ( t )1 ) , . . . , ∇ f n ( x ( t ) n ) (cid:105) . onsensus Control for Decentralized Deep Learning As a reminder, Ξ t := n (cid:80) ni =1 (cid:13)(cid:13)(cid:13) ¯ x ( t ) − x ( t ) i (cid:13)(cid:13)(cid:13) , and φ t := n (cid:80) ni =1 (cid:13)(cid:13)(cid:13) ∇ f i ( x ( t ) i ) (cid:13)(cid:13)(cid:13) . n Ξ t +1 = (cid:13)(cid:13)(cid:13) ¯ X ( t +1) − X ( t +1) (cid:13)(cid:13)(cid:13) F = (cid:13)(cid:13)(cid:13)(cid:13) ( X ( t ) − γ ∇ F ( X ( t ) , ξ ( t ) )) (cid:18) n (cid:62) − W (cid:19)(cid:13)(cid:13)(cid:13)(cid:13) F = (cid:13)(cid:13)(cid:13)(cid:13) ( X ( t ) − γ ∇ F ( X ( t ) , ξ ( t ) )) (cid:18) n (cid:62) − I (cid:19) (cid:18) n (cid:62) − W (cid:19)(cid:13)(cid:13)(cid:13)(cid:13) F ≤ (1 − p ) (cid:13)(cid:13)(cid:13)(cid:13) ( X ( t ) − γ ∇ F ( X ( t ) , ξ ( t ) )) (cid:18) n (cid:62) − I (cid:19)(cid:13)(cid:13)(cid:13)(cid:13) F ≤ (1 − p ) (cid:13)(cid:13)(cid:13)(cid:13) ( X ( t ) − γ ∇ f ( X ( t ) )) (cid:18) n (cid:62) − I (cid:19)(cid:13)(cid:13)(cid:13)(cid:13) F + (1 − p ) γ (cid:13)(cid:13)(cid:13) ∇ f ( X ( t ) ) − ∇ F ( X ( t ) , ξ ( t ) ) (cid:13)(cid:13)(cid:13) F ≤ (1 − p )(1 + α ) (cid:13)(cid:13)(cid:13)(cid:13) X ( t ) (cid:18) n (cid:62) − I (cid:19)(cid:13)(cid:13)(cid:13)(cid:13) F + (1 − p )(1 + α − ) γ (cid:13)(cid:13)(cid:13) ∇ f ( X ( t ) ) (cid:13)(cid:13)(cid:13) F + (1 − p ) γ σ n α = p ≤ (cid:16) − p (cid:17) n Ξ t + 3(1 − p ) p γ (cid:13)(cid:13)(cid:13) ∇ f ( X ( t ) ) (cid:13)(cid:13)(cid:13) F + (1 − p ) γ σ n C.3. Sufﬁcient bounds to meet critical consensus distance

In this section, we show that the claimed bounds in Section 3.3 are sufﬁcient conditions to reach the CCD.According to Proposition 3.3, there exists an absolute constant C , (w.l.o.g. C ≥ ) such that Ξ t ≤ C (1 − p ) γ (cid:18) φ t p + σ p (cid:19) By smoothness, φ t = 1 n n (cid:88) i =1 (cid:13)(cid:13)(cid:13) ∇ f i ( x ( t ) i ) (cid:13)(cid:13)(cid:13) ≤ n n (cid:88) i =1 (cid:13)(cid:13)(cid:13) ∇ f i ( x ( t ) i ) − ∇ f ( x ( t ) i ) (cid:13)(cid:13)(cid:13) + 3 n n (cid:88) i =1 (cid:13)(cid:13)(cid:13) ∇ f ( x ( t ) i ) − ∇ f (¯ x ( t ) ) (cid:13)(cid:13)(cid:13) + 3 n n (cid:88) i =1 (cid:13)(cid:13)(cid:13) ∇ f (¯ x ( t ) ) (cid:13)(cid:13)(cid:13) ≤ ζ + 3 L Ξ t + 3 (cid:13)(cid:13)(cid:13) ∇ f (¯ x ( t ) ) (cid:13)(cid:13)(cid:13) . Supposing (1 − p ) γ ≤ p CL , we can therefore estimate Ξ t ≤ C (1 − p ) γ (cid:32) (cid:13)(cid:13) ∇ f (¯ x ( t ) ) (cid:13)(cid:13) + 3 L Ξ t + 3 ζ p + σ p (cid:33) ≤ C (1 − p ) γ (cid:32) (cid:13)(cid:13) ∇ f (¯ x ( t ) ) (cid:13)(cid:13) + ζ p + σ p (cid:33) + 12 Ξ t and hence Ξ t ≤ C (1 − p ) γ (cid:32) (cid:13)(cid:13) ∇ f (¯ x ( t ) ) (cid:13)(cid:13) p + ζ p + σ p (cid:33) (8)The claimed bounds can now easily be veriﬁed, by plugging the provided values into (8). For simplicity in the main text weassume that ζ = 0 (we are in the datacenter training scenario). Small γ . By choosing γ ≤ p nLC , we check that our previous constraint γ C ≥ ≤ p CL is satisﬁed, and(8) ≤ (cid:13)(cid:13) ∇ f (¯ x ( t ) ) (cid:13)(cid:13) n CL + γσ nL C ≥ ≤ (4) onsensus Control for Decentralized Deep Learning Small p . By choosing − p ≤ C (1+ γLn ) , we note that p C ≥ ≥ . Moreover, our previous constraint (1 − p ) γ ≤ γ C ≤ p L C is satisﬁed (note that γ ≤ L throughout). Hence(8) ≤ γ γLn ) (cid:32) (cid:13)(cid:13) ∇ f (¯ x ( t ) ) (cid:13)(cid:13)

81 + 10 σ (cid:33) γ ≤ / (4 L ) ≤ (4)In the above calculations we for the simplicity assumed that ζ = 0 . For the general non-iid data case when ζ > we cancalculate similar bounds on γ , p . These bounds would have similar dependence on parameters, and would be stricter. Indeed,the typical consensus distance would be also inﬂuenced by non-iidness of the data ζ and it is therefore harder to satisfy theCCD condition. C.4. Proof of Lemma 3.4, repeated gossip

By the assumption stated in the lemma, it holds for each component W i of the product W = W k . . . W , i ∈ [1 , k ] that E W i (cid:13)(cid:13) XW i − ¯ X (cid:13)(cid:13) F ≤ (1 − p ) (cid:13)(cid:13) X − ¯ X (cid:13)(cid:13) F , ∀ X ∈ R d × n . Now lets estimate the parameter p W . Using that W i are independent E W (cid:13)(cid:13) XW − ¯ X (cid:13)(cid:13) F = E W ... W k (cid:13)(cid:13) XW k . . . W − ¯ X (cid:13)(cid:13) F == E W ... W k E W (cid:13)(cid:13) YW − ¯ Y (cid:13)(cid:13) F , where we deﬁned Y = XW k . . . W and used that W i n (cid:62) = n (cid:62) , so ¯ Y = XW k . . . W n (cid:62) = X n (cid:62) = ¯ X . Therefore, E W (cid:13)(cid:13) XW − ¯ X (cid:13)(cid:13) F ≤ (1 − p ) E W ... W k (cid:13)(cid:13) XW k . . . W − ¯ X (cid:13)(cid:13) F . Applying the same calculations to the rest, we conclude that − p W = (1 − p ) k . onsensus Control for Decentralized Deep Learning D. Detailed Experimental Setup

Comments on large-batch training.

Coupling the quality loss issue of the decentralized training with the large-batchtraining difﬁculty is non-trivial and is out of the scope of this paper. Instead, we use reasonable local mini-batch sizes(together with the number of workers (denoted as n )), as well as the well-developed large-batch training techniques (Goyalet al., 2017), to avoid the difﬁculty of extreme large-batch training. Multi-phase experiment justiﬁcation.

The averaged local gradient norm φ t as well as the L -smoothness of ResNet-20on CIFAR-10 for a ring and a complete graph ( n = 32 ) are shown in Figure 6 and Figure 7 respectively.The estimation procedure is analogous to that in (Santurkar et al., 2018; Lin et al., 2020a): we take 8 additional steps long thedirection of current update, each with . of normal step size. This is calculated at every 8 training steps. The smoothness isevaluated as the maximum value of L satisfying Assumption 2. Epoch t (a) The consensus distance for decentral-ized training. Epoch t (b) The averaged norm of the local gradi-ents for decentralized training. Epoch L - s m oo t hn e ss (c) The gradient Lipschitz curve for decen-tralized training.Figure 6: Justiﬁcation for our multiple-phase experimental design choice (on ring graph). We run ResNet-20 on CIFAR-10 ( n = 32 )with the ring topology. We can observe the three quantities most relevant to optimization all naturally form three phases, dictated by thelearning rate schedule. Epoch t (a) The averaged norm of the local gradients for central-ized training. Epoch L - s m oo t hn e ss (b) The gradient Lipschitz curve for centralized training.Figure 7: Justiﬁcation for our multiple-phase experimental design choice (on complete graph). We run ResNet-20 on CIFAR-10( n = 32 ) with the complete topology. We can again observe the three quantities most relevant to optimization all naturally form threephases, dictated by the learning rate schedule. E. Additional Results

E.1. Understanding on Consensus Averaging Problem

We study a host of communication topologies: (1) deterministic topologies (ring, and complete graph) and (2) undirectedtime-varying topologies (illustrated below).•

Random matching (Boyd et al., 2006). At each communication step, all nodes are divided into non-overlapping pairsrandomly. Each node connects all other nodes with equal probability.•

Exponential graph (Assran et al., 2019). Each is assigned a rank from to n − . Each node i periodically communicateswith a list nodes with rank i + 2 , i + 2 , . . . , i + 2 (cid:98) log ( n − (cid:99) . In the one-peer-per-node experiments, each node onlycommunicates to one node by cycling through its list. The formed graph is undirected, i.e., both transmission andreception take place in each communication. onsensus Control for Decentralized Deep Learning • Bipartite exponential graph (Lian et al., 2018; Assran et al., 2019). In order to avert deadlocks (Lian et al., 2018),the node with an odd rank i cycles through nodes with even ranks i + 2 − , i + 2 − , . . . , i + 2 (cid:98) log ( n − (cid:99) − bytransmitting a message and waiting for a response. while the nodes with even ranks only await messages and reply uponreception.Table 10 displays the spectral gap and node degree of studied topologies, and Figure 8 provides the convergence curves fordifferent communication topologies on graph scales. Figure 9 in addition visualizes the spectral gap (in expectation) fordifferent communication topologies. Table 10: Spectral gap and node degree of studied topologies.

Topologies Spectral Gaps (in expectation) Node degrees ( n nodes)Complete n Fixed ring O ( n ) 2 Exponential graph O (1) 2 Bipartite exponential graph O (1) 1 Random matching O (1) 1 gossip steps nn i = x i x ringexponential graphrandom matchingbipartite exponential graph (a) n = 16 gossip steps nn i = x i x ringexponential graphrandom matchingbipartite exponential graph (b) n = 32 gossip steps nn i = x i x ringexponential graphrandom matchingbipartite exponential graph (c) n = 64 Figure 8: The convergence curves for the consensus averaging problem on different communication topologies and different scales (i.e., n = 16 , n = 64 and n = 128 ). This ﬁgure complements the Figure 4 in the main text. gossip steps (t) s p e c t r a l g a p ( ( t i = W i )) ringexponential graphrandom matchingbipartite exponential graph (a) n = 16 gossip steps (t) s p e c t r a l g a p ( ( t i = W i )) ringexponential graphrandom matchingbipartite exponential graph (b) n = 32 Figure 9: The spectral gap (in expectation) of different communication topologies on different graph scales.

Table 11 examines these topologies on a standard deep learning benchmark with different graph scales, while Figure 10visualizes the required communication rounds (per gradient update step) for a range of consensus distance targets.

E.2. Understanding the Decentralized Deep Learning Training for CV Tasks

We use ring as our underlying decentralized communication topology in this subsection.

Elaborated results on consensus distance control.

Table 12 is the elaborated version of Table 2 with more evaluatedconsensus distances. onsensus Control for Decentralized Deep Learning

Complete Fixed ring Exponential graph Bipartite exponential graph Random matchingn=16 . ± .

12 92 . ± .

19 92 . ± .

30 92 . ± .

04 92 . ± . n=32 . ± .

27 91 . ± .

05 92 . ± .

04 92 . ± .

15 92 . ± . Table 11:

The effect of communication topologies and scales (ResNet-20 on CIFAR-10 with n = 32 ). The test top-1 accuracies areover three seeds with ﬁne-tuned learning rates. Consensus distance o f c o mm . r o un d s t o r e a c h t h e d i s t a n c e random matchingexponential graph ring Figure 10:

Target consensus distance v.s. the required communication rounds (per gradient update step), for training ResNet-20 onCIFAR-10 with different communication topologies. We focus on the setup of dec-phase-1 and vary the target consensus distance fordifferent communication topologies. Due to the changing consensus distance over the training (of the interested phase-1), we consider theaveraged consensus distance. The topologies of exponential graph and random matching, empower the capability of fast convergence ingossip averaging and thus only a few steps are required to reach the target consensus distance.Table 12:

The impact of consensus distance of different phases on generalization performance (test top-1 accuracy) of trainingResNet-20 on CIFAR-10. The centralized baseline performance for n = 32 and n = 64 are . ± . and . ± . respectively.The performance of decentralized training (all phases on a ﬁxed ring and w/o consensus distance control) for n = 32 and n = 64 are . ± . and . ± . respectively. dec-phase-1 dec-phase-2 dec-phase-3 dec-phase-2 + dec-phase-3 Ξ max Ξ max Ξ max Ξ max Ξ max Ξ max Ξ max Ξ max Ξ max Ξ max Ξ max Ξ max Ξ max Ξ max Ξ max n=32 . ± .

35 92 . ± .

21 92 . ± .

10 92 . ± .

25 92 . ± .

05 93 . ± .

01 92 . ± .

30 92 . ± .

11 92 . ± .

27 92 . ± .

00 92 . ± .

21 92 . ± .

24 92 . ± .

07 93 . ± .

24 92 . ± . n=64 . ± .

12 92 . ± .

07 92 . ± . - - . ± .

04 92 . ± .

10 92 . ± . - . ± .

12 92 . ± .

09 92 . ± .

10 92 . ± .

07 92 . ± .

12 92 . ± . SlowMo cannot fully address the decentralized optimization/generalization difﬁculty.

Table 13 studies the effective-ness of using SlowMo for better decentralized training. We can witness that even though the performance of decentralizedtraining can be boosted to some extent, it cannot fully address the quality loss issue brought by decentralized training.

Table 13:

The effect of SlowMo for decentralized learning , for training ResNet20 on CIFAR-10 ( n = 32 ). The results (over threerandom seeds) use the tuned hyper-parameter of SlowMo mentioned in the original paper (Wang et al., 2020b). The centralized baselineperformance is . ± . . topology w/o SlowMo w/ SlowMoexponential graph . ± .

22 92 . ± . ring . ± .

15 92 . ± . On the ineffectiveness of tuning learning rate.

Table 14 displays the results of training ResNet-20 on CIFAR-10 (32nodes), with ﬁne-tuned learning rate on phase-1; learning rate tuning cannot address the test quality loss issue caused by thelarge consensus distance (i.e. over the CCD).

Prolonged training for dec-phase-2 and dec-phase-3.

Table 15 shows the results for prolonged dec-phase-2 and dec-phase-3 on CIFAR-10 with ResNet20. We can observe although longer training duration increases the performance, the onsensus Control for Decentralized Deep Learning

Table 14:

Phase-1 consensus distance control performance with ﬁne-tuned learning rates of training ResNet-20 on CIFAR-10( n = 32 ). Setup in this table is identical to that of Table 2, except that we ﬁne-tune the learning rate for each case from a grid of linearscaling-up factors { , , , , } . The results are over three seeds. Ξ max Ξ max Ξ max w/ tuned lr from the search grid . ± .

26 92 . ± .

24 92 . ± . w/ default lr . ± .

35 92 . ± .

21 92 . ± . improvement is rather small. Table 15:

The impact of different numbers of training epochs (at phase-2 and phase-3) on generalization, for training ResNet-20 onCIFAR-10 (ring topology with n = 32 ). The number of epochs at phase-1 is chosen from { , , } , while the rest of the trainingreuses our default setup. Experiments are run over 2 seeds. Ξ dec-phase-2 dec-phase-3 Ξ max Ξ max Ξ max Ξ max Ξ max Ξ max epochs . ± .

01 92 . ± .

30 92 . ± .

11 92 . ± .

00 92 . ± .

21 92 . ± . epochs . ± .

08 93 . ± .

16 92 . ± .

03 92 . ± .

16 92 . ± .

18 92 . ± . epochs . ± .

16 93 . ± .

17 93 . ± .

07 92 . ± .

23 92 . ± .

25 92 . ± . The impact of half cosine learning rate schedule.

Table 16 examines the existence of the critical consensus distancewith half cosine learning schedule (this scheme is visited in He et al. (2019) as a new paradigm for CNN training). We canwitness from Table 16 that the effect of critical consensus distance can be generalized to this learning rate schedule: thereexists a critical consensus distance in the initial training phase (as revealed in the inline Figure of Table 16) and ensuresgood optimization and generalization.

Table 16:

The impact of half cosine learning rate schedule on generalization, for training ResNet20 on CIFAR-10 (ring topology with n = 32 ). The inline ﬁgure depicts the uncontrolled consensus distance over the whole training procedure through the half-cosine learningrate schedule. Only one training phase is considered for the consensus distance control and the numerical results in the table are averagedover 3 seeds. epoch t Ring ( Ξ max ) Ring ( / max ) Ring ( / max ) Ring ( / max ) Complete . ± .

06 92 . ± .

10 92 . ± .

11 92 . ± .

05 92 . ± . onsensus Control for Decentralized Deep Learning E.2.1. A

DAPTIVE CONSENSUS DISTANCE CONTROL

In Table 17, we apply the adaptive consensus distance control in the experiments. The observations are consistent withthose in constant consensus distance control experiments.

Table 17:

The impact of different consensus distances on optimization and/or generalization, for different phases of trainingResNet-20 on CIFAR-10 ( n = 32 ). The table is almost identical to Table 2, except the consensus distance is controlled by the (runtime)averaged norm of the local gradients (i.e. adaptive consensus distance). Ξ max φ ema t φ ema t φ ema t . φ ema t Phase 1 . ± .

35 91 . ± .

31 92 . ± .

18 92 . ± .

04 92 . ± . Phase 2 . ± .

01 93 . ± .

18 93 . ± .

03 93 . ± .

08 92 . ± . Phase 3 . ± .

07 92 . ± .

18 92 . ± . - - E.3. Consensus control with other topologies

In Table 18, we exert consensus control with an exponential graph as the base communication topology; the local updatestep corresponds to the number of local model update steps per communication round, and we use it as a way to increasediscrepancy (consensus distance) among nodes. We can observe that our ﬁndings from main experiments with a ring basetopology are valid.

Table 18: The impact of quality propagation across phases (in both phase 1 and phase 2) on an undirected time-varying exponentialgraph ( n = 32 ), similar to Table 5. phase 1 phase 2 local update step = 1 local update step = 2 local update step = 4 Ξ max φ ema t φ ema t . φ ema t Ξ max φ ema t φ ema t . φ ema t Ξ max φ ema t φ ema t . φ ema t φ ema t . ± .

16 92 . ± .

24 92 . ± .

06 92 . ± . - - - - - - - - φ ema t . ± .

09 92 . ± .

14 92 . ± .

09 92 . ± . - - - - - - - - . φ ema t . ± .

17 92 . ± .

19 92 . ± .

21 92 . ± .

24 92 . ± .

13 92 . ± .

21 92 . ± .

07 92 . ± .

22 92 . ± .

09 92 . ± .

21 92 . ± . . φ ema t . ± .

13 92 . ± .

08 92 . ± .

20 92 . ± .

24 92 . ± .

21 92 . ± .

16 92 . ± .

13 92 . ± .

26 93 . ± .

09 92 . ± .

16 92 . ± .

26 92 . ± . E.3.1. T HE E XISTENCE OF THE O PTIMAL C ONSENSUS D ISTANCE FOR N OISE I NJECTION .Table 19 uses a different communication topology (i.e. time-varying exponential graph) for decentralized optimization. Hereexponential graph with large spectral gap is applied to CIFAR-10 dec-phase-2 training. We apply the adaptive consensusdistance control in this set of experiments. We can observe that increasing consensus distance further by taking local stepsimproves generalization, however, too many local steps diminish the performance. For instance, for ratio=2, the performancepeaks at local update steps 2 and drops at local update 4. It points out that an optimal consensus distance is required to injectproper stochastic noise for better generalization.

Table 19:

The impact of different consensus distances at phase 2 , for training ResNet-20 on CIFAR-10 with time-varying exponentialgraph ( n = 32 ). The baseline performance of using exponential graph for the entire decentralized training is . ± . . The reportedtest top-1 accuracies are averaged over three seeds. local update step = 1 local update step = 2 local update step = 4 Ξ max φ ema t φ ema t . φ ema t φ ema t φ ema t . φ ema t φ ema t φ ema t . φ ema t . ± .

12 92 . ± .

09 92 . ± .

27 92 . ± .

19 93 . ± .

08 92 . ± .

17 92 . ± .

02 92 . ± .

10 92 . ± .

12 92 . ± . E.4. Results for Training Transformer on Multi30k

We additionally report the decentralized training results, for a downsampled transformer models (by the factor of w.r.t. thebase model in Vaswani et al. (2017)) on Multi30k (Elliott et al., 2016). Figure 11 shows that the straightforward applicationof Adam in the decentralized manner does encounter generalization problems, which are attributed to the fact that thedifferent local moment buffers (in addition to the weights) become too diverse. Tuning the learning rate schedule cannotaddress the issue of decentralized Adam, as shown in the Figure 11(b). onsensus Control for Decentralized Deep Learning Epoch V a li d a t i o n l o ss All-ReduceDecentralized (complete)Decentralized (ring)Decentralized (exponential graph) (a) The limitation of decentralized learning with Adam, caused bythe different local moment buffers.

Epoch V a li d a t i o n l o ss Decentralized (warmup step = 125)Decentralized (warmup step = 250)Decentralized (warmup step = 500)Decentralized (warmup step = 1000) (b) Tuning the learning rate cannot alleviate the issue of decentral-ized Adam.Figure 11:

Learning curves for training the transformer model on the Multi30k dataset ( n = 32= 32