[PDF] TornadoAggregate: Accurate and Scalable Federated Learning via the Ring-Based Architecture

Abstract

Federated learning has emerged as a new paradigm of collaborative machine learning; however, many prior studies have used global aggregation along a star topology without much consideration of the communication scalability or the diurnal property relied on clients' local time variety. In contrast, ring architecture can resolve the scalability issue and even satisfy the diurnal property by iterating nodes without an aggregation. Nevertheless, such ring-based algorithms can inherently suffer from the high-variance problem. To this end, we propose a novel algorithm called TornadoAggregate that improves both accuracy and scalability by facilitating the ring architecture. In particular, to improve the accuracy, we reformulate the loss minimization into a variance reduction problem and establish three principles to reduce variance: Ring-Aware Grouping, Small Ring, and Ring Chaining. Experimental results show that TornadoAggregate improved the test accuracy by up to 26.7% and achieved near-linear scalability.

Full PDF

TTornadoAggregate: Accurate and Scalable Federated Learningvia the Ring-Based Architecture

Jin-woo Lee, Jaehoon Oh, Sungsu Lim, Se-Young Yun, Jae-Gil Lee, Korea Advanced Institute of Science and Technology Chungnam National [email protected], [email protected], [email protected], [email protected], [email protected]

Abstract

Federated learning has emerged as a new paradigm of collab-orative machine learning; however, many prior studies haveused global aggregation along a star topology without muchconsideration of the communication scalability or the diur-nal property relied on clients’ local time variety. In contrast,ring architecture can resolve the scalability issue and evensatisfy the diurnal property by iterating nodes without an ag-gregation. Nevertheless, such ring-based algorithms can in-herently suffer from the high-variance problem. To this end,we propose a novel algorithm called

TornadoAggregate thatimproves both accuracy and scalability by facilitating the ringarchitecture. In particular, to improve the accuracy, we refor-mulate the loss minimization into a variance reduction prob-lem and establish three principles to reduce variance: Ring-Aware Grouping, Small Ring, and Ring Chaining. Experi-mental results show that TornadoAggregate improved the testaccuracy by up to . and achieved near-linear scalability. Federated learning (Kone ˇ cn`y et al. 2016a; McMahan et al.2017b) enables mobile devices to collaboratively learn ashared model while keeping all training data on the devices,thus avoiding data transfer to the cloud or central server.One of the main reasons for this recent boom in federatedlearning is that it does not compromise user privacy. In thisframework, star architecture (Figure 1(a)), which involves acentral parameter server aggregating and broadcasting lo-cally learned models, has been most widely adopted in fa-vor of its simple distributed parallelism. However, the stararchitecture can easily become a communication bottleneckand cannot take into account the diurnal property of feder-ated learning (McMahan et al. 2017a; Eichner et al. 2019),in which the global data distribution of clients signiﬁcantlyvaries due to the difference in the clients’ local time.Ring architecture (Figure 1(b)), in contrast, can resolvethe scalability issue and even satisfy the diurnal propertyby iterating nodes without a central coordinator. In addi-tion, it has the potential to improve accuracy through anunbiased estimation of conventional centralized learning atthe expense of communication cost. Notably, Duan et al.(2020) proposed a star architecture with ring-based groups, (b) RING(a) STAR Local ModelGroup ModelGlobal Model (c) TornadoAggregatewith a global ring (d) TornadoAggregatewith group rings

Figure 1:

Representative architectures and proposedring-based algorithm TornadoAggregate .while Ding et al. (2020) proposed a ring architecture withstars-based groups. Ghosh et al. (2020) and Eichner et al.(2019) proposed star-based and ring-based groups, respec-tively, without global communication. Other than the im-portance of addressing the problem, less has been addressedhow a ring-based architecture should be developed from theperspective of both accuracy and scalability.In this paper, we propose a novel

TornadoAggregate al-gorithm that improves both accuracy and scalability by fa-cilitating the ring architecture. To improve accuracy, in par-ticular, TornadoAggregate aims at reducing the varianceinherent in a ring iteration by considering three princi-ples: ring-aware grouping , small ring , and ring chaining .Based on the ring-aware grouping principle, for the Tor-nadoAggregate with a global ring (Figure 1(c)) and grouprings (Figure 1(d)), nodes are grouped such that it reducesthe inter-group variance in the global ring and inter-nodevariance in each group ring, respectively; the number ofgroups is adjusted to satisfy the small ring principle, thusachieving the small variance; we introduce ring chaining technique to increase the batch size with high node utiliza-tion in a ring, leading to the reduced variance.We conﬁrmed that TornadoAggregate achieved a higheraccuracy by up to . and near-linear scalability. a r X i v : . [ c s . L G ] D ec rchitecture Convergence Bound CommunicationScalability STAR O( Dh ( τ ) ) O( |N | ) RING ) STAR-stars O( ∆ h ( τ τ ) + δh ( τ ) ) O( |N | ) STAR-rings O( Dh ( τ τ ) ) (Approximate) O( |G| ) RING-stars O( Dh ( τ ) ) (Approximate) O( |N | / |G| ) RING-rings ) stars O( δh ( τ ) ) O( |N | )) rings |G| ) Table 1:

Comparison of architectures . |N | and |G| denotethe number of nodes and groups, respectively. D , δ , and ∆ denote the local-to-global, local-to-group, and group-to-global divergence, respectively. We ﬁrst brieﬂy describe federated learning and then surveyrelevant architectures in terms of accuracy and scalability . Basics of Federated Learning

The objective of federated learning is to ﬁnd an approximatesolution of Eq. (1) (McMahan et al. 2017b). Here, F ( w ) isthe loss of predictions with a model w over the set of all dataexamples D (cid:44) ∪ i ∈N D i across all nodes, where N is the setof node indices, and F i ( w ) (cid:44) (cid:80) ( x ,y ) ∈D i |D i | l ( w , x , y ) isthe loss of predictions with a loss function l parameterizedby w over the set of data examples ( x , y ) ∈ D i on node i . min w ∈ R d F ( w ) where F ( w ) (cid:44) (cid:80) i ∈N |D i ||D| F i ( w ) (1)To resolve Eq. (1), a huge number of architectures are be-ing actively proposed, and based on hierarchical composi-tion, they can be classiﬁed into three main categories: ﬂat , consensus group , and pluralistic group . Table 1 comparesthem in terms of convergence bound and scalability, each ofwhich is analyzed in Appendix A and B, respectively. Flat Architecture

Flat represents an architecture without hierarchical compo-sition that, in turn, includes

STAR and

RING architecture.

STAR is the same as the canonical FedAvg (McMahan et al.2017b) without node sampling, as deﬁned by Deﬁnition 1.

Deﬁnition 1.

STAR involves local update, which learnseach local model w i with learning rate η by performing gra-dient descent steps, and global aggregation , which learns theglobal model w by aggregating all w i along a star topologyand synchronizes w i with w every τ epochs, as in Eq. (2). w it (cid:44) (cid:40) w it − − η ∇ F i ( w it − ) if t mod τ (cid:54) = 0 w t if t mod τ = 0 where w t (cid:44) (cid:88) i ∈N |D i ||D| [ w it − − η ∇ F i ( w it − )] (2) STAR exhibits the most simple distributed parallelism that,at the same time, leads to low scalability of O( |N | ) due tothe communication bottleneck in global aggregation. In contrast,

RING (Li et al. 2018; Eichner et al. 2019) re-solves the aforementioned scalability issue by removing theglobal aggregation, as deﬁned by Deﬁnition 2.

Deﬁnition 2.

RING extends

STAR by replacing the globalaggregation with global inter-node transfer that synchro-nizes a new model w i j on node i j with the previouslylearned model w i j − on node i j − every τ epoch along acertain ring topology [ i j ∈ N | j = (cid:98) t/ τ (cid:99) , i j + |N | = i j ] witha period |N | to satisfy the diurnal property, as in Eq. (3). w i j t (cid:44) (cid:40) w i j t − − η ∇ F i j ( w i j t − ) if t mod τ (cid:54) = 0 w i j − t if t mod τ = 0 (3) RING exhibits low convergence bound, attributed to Theo-rem 1, and beneﬁts from high scalability of O( ). However,it is considered impractical in federated learning, where alarge |N | is assumed, because it takes |N | times as muchcommunication rounds to iterate a global epoch as STAR . Theorem 1.

RING is an unbiased estimator of the central-ized learning that learns a centralized model by assumingthe federated datasets to be located at centralized storage.Owing to lack of space, we defer all proofs to Appendix C.

Consensus Group Architecture

Consensus group represents an architecture with group hi-erarchy and global communication to reach a global con-sensus among groups, which, in turn, includes four ar-chitectural combinations:

STAR-stars , STAR-rings , RING-stars , and

RING-rings . First,

STAR-stars (Lin et al. 2018;Bonawitz et al. 2019; Liu et al. 2020; Luo et al. 2020; Abadet al. 2020) mitigates the non-IID issue via group-basedlearning (Zhao et al. 2018), which leads to improved accu-racy, as deﬁned by Deﬁnition 3.

Deﬁnition 3.

STAR-stars extends

STAR by additionally al-lowing multiple intermediate group star aggregations, thuspostﬁxed by stars . In particular, the set of all node indices N is partitioned into sets of node indices for |G| nodegroups {N k } k =1 ... |G| , where ∪ k ∈G N k = N and ∀ k (cid:54) = l , N k ∩ N l = ∅ , and D k (cid:44) ∪ i ∈N k D k,i . Then, for each group,it learns the group model w k by aggregating all local models w k,i along a group star topology and synchronizes w k,i with w k every τ epochs, as shown by Eq. (4). Similar to Eq. (2),a global aggregation is performed every τ τ steps. w k,it (cid:44)  w k,it − − η ∇ F k,i ( w k,it − ) if t mod τ (cid:54) = 0 w kt if t mod τ = 0 ,t mod τ τ (cid:54) = 0 w t if t mod τ τ = 0 where w kt (cid:44) (cid:88) i ∈N k |D k,i ||D k | [ w k,it − − η ∇ F k,i ( w k,it − )] and w t (cid:44) (cid:88) k ∈G |D k ||D| w kt (4) As long as the conditions are satisﬁed, a ring can be deﬁned inany way (e.g., a random permutation in Algorithm 1).

TAR-stars is known to improve

STAR under certain param-eter settings in favor of the non-IID mitigation (Liu et al.2020), but it exhibits the low scalability of O( |N | ).Next, analogous to the development of

RING , STAR-rings (Duan et al. 2020),

RING-stars (So, Guler, and Aves-timehr 2020; Ding et al. 2020), and

RING-rings (Eichneret al. 2019) also aim at improving both convergence boundand scalability while sacriﬁcing communication cost, whichare summarized in Table 1. Formal deﬁnitions are as follows.

Deﬁnition 4.

Similar to Deﬁnition 2,

STAR-rings extends

STAR-stars by replacing the group aggregation with groupinter-node transfer that, for each group k ∈ G , synchronizesa new model w k,i j on node i j ∈ N k with the previouslylearned model w k,i j − on node i j − every τ epochs alonga certain ring topology [ i j ∈ N k | j = (cid:98) t/ τ (cid:99) , i j + |N k | = i j ] with a period |N k | to satisfy the diurnal property within eachgroup. In short, w k,it (cid:44) w kt (group aggregation) of Eq. (4) isreplaced with w k,i j t (cid:44) w k,i j − t (group inter-node transfer). Deﬁnition 5.

Similar to Deﬁnition 4,

RING-stars extends

STAR-stars by replacing the global aggregation with globalinter-group transfer that synchronizes a new local model w k l ,i on node i ∈ N k l with the previously learned groupmodel w k l − in group k l − every τ τ steps along a certainring topology [ k l ∈ G| l = (cid:98) t/ ( τ τ ) (cid:99) , k l + |G| = k l ] with aperiod |G| to satisfy the diurnal property across all groups. Inshort, w k,it (cid:44) w t (global aggregation) of Eq. (4) is replacedwith w k l ,it (cid:44) w k l − t (global inter-group transfer). Deﬁnition 6.

RING-rings extends

STAR-stars by replacingthe group and global aggregation with the group inter-nodeand global inter-group transfer, respectively.Similar to

RING , RING-rings is considered impractical infederated learning due to large number of nodes.

Pluralistic Group Architecture

Pluralistic group represents an architecture with group hi-erarchy and without global communication to develop moreindependent and specialized group models than the afore-mentioned consensus model, which leads to decreased non-IIDness, and consequently, improved accuracy. Represen-tative pluralistic group includes stars and rings . Recently, stars (Ghosh et al. 2019, 2020; Xie et al. 2020; Briggs et al.2020; Sattler, M¨uller, and Samek 2020) has received greatattention, which is deﬁned by Deﬁnition 7.

Deﬁnition 7. stars is deﬁned as

STAR-stars without globalaggregation. Unlike Eq. (4), w k,it is not synchronized with w t and thus group communication rounds τ is not deﬁned. Deﬁnition 8.

Similar to Deﬁnition 7, rings is deﬁned as

STAR-rings without global aggregation.It is important to note that the growing popularity of plural-istic group architectures may be hype. According to Theo-rem 2, stars may achieve lower accuracy than

STAR . Theorem 2.

The convergence bound O( δh ( τ ) ) of stars doesn’t necessarily be better than O( Dh ( τ ) ) of STAR .Lastly, as shown in Table 1, rings can beneﬁt from the lowconvergence bound as well as high scalability of O( |G| ). As previously noted, ring-based architectures such as

RING , STAR-rings , RING-stars , Ring-rings , and rings have greatpotential to improve both accuracy and scalability. However,the convergence analysis framework introduced in this studyis mostly based on unbiasedness property to easily compareall of the architectures. To better understand architecturesfrom the perspective of accuracy, the variance should alsobe further considered. To this end, based on Theorem 3, wereformulate the problem of Eq. (1) to the variance reductionof ring-based architectures.

Theorem 3.

RING exhibits higher variance than the central-ized learning (unbiased estimator of

RING from Theorem 1).

The ring-based federated learning under high variance is-sue looks similar to the continual learning under catastrophicforgetting (Parisi et al. 2019), but the former additionally in-volves partitioned data groups as well as data iteration alonga ring. Considering the differences, we establish three prin-ciples to reduce variance.•

Principle 1 (Ring-Aware Grouping) : For architectureswith group rings, nodes should be clustered so that inter-node variance becomes low within a group. On the otherhand, for architectures with a global ring, nodes should beIID grouped so that inter-group variance becomes low.•

Principle 2 (Small Ring) : It is straightforward that, thesmaller a ring, the lower its iteration variance.•

Principle 3 (Ring Chaining) : A ring can have multipleiteration chains (Ding et al. 2020), each of which iteratesthe same ring at a different starting node and thus learnsan unbiased model different from each other. Multiplechains can reduce learning variance, which is attributedto the reduced variance from increased batch size.Based on the abovementioned principles, we propose anovel ring-based algorithm called

TornadoAggregate andderive two heuristics according to the architecture type. Werefer to TornadoAggregate with

RING-stars and

STAR-rings as Tornado and Tornadoes, respectively. In particular, for the ring-aware grouping principle, Tornado and Tornadoes re-quire nodes to be IID grouped and clustered, respectively;for the small ring principle, Tornado and Tornadoes requirea small and large number of groups, respectively; for the ringchaining principle, both require a large number of chains.As shown in Algorithm 1, Tornado takes the node set N and the number of groups |G| , chains C , epochs τ , and com-munication rounds τ as input and returns the ﬁnal model w T as output. It begins by initializing all local models w k,i ,a randomly permuted inter-group ring [ k l ] , and group indices {N k } via the IID node grouping (Lines 1–3). Then, for eachchain and each node, the local updates are performed (Lines5–8); each group model is learned by aggregating all localmodels every τ epochs (Lines 9–10) and is transferred toall nodes in the next group k next every τ τ steps (Lines11–13). Overall, Lines 4–13 repeat for T steps.Because Tornado and Tornadoes are inherently correlated,we defer the description of Tornadoes to Appendix D. lgorithm 1: Tornado (RING-stars) I NPUT : N , |G| , C , τ , τ O UTPUT : w T Initialize { w k,i } i ∈N to a random model w Initialize a random ring [ k l ∈ G| l ∈ N , k l + |G| = k l ] {N k } k ∈G ← G ROUP B Y IID ( N ) // Algorithm 3 for t ← , · · · , T − do for each chain c ← , · · · , C − in parallel do l ← (cid:98) t/ ( τ τ ) (cid:99) , k ← ( k l + c ) mod |G| for each node i ∈ N k in parallel do w k,it +1 ← w k,it − η ∇ F k,i ( w k,it ) if t mod τ and t mod τ τ (cid:54) = 0 then { w k,it } i ∈N k ← (cid:80) i ∈N k |D k,i ||D k | w k,it if t mod τ τ = 0 then k next ← ( k l +1 + c ) mod |G| { w k next ,it } i ∈N knext ← (cid:80) i ∈N k |D k,i ||D k | w k,it Algorithm Architecture GroupType

FedAvg(McMahan et al.)

STAR - -IFCA(Ghosh et al.) stars

Cluster -HierFAVG(Liu et al.)

STAR-stars

Random -Astraea(Duan et al.)

STAR-rings

IID 1MM-PSGD(Ding et al.)

RING-stars

Cluster 1

Tornado (Proposed)

RING-stars

IID |G|

Tornadoes (Proposed)

STAR-rings

Cluster |G|

Table 2:

Comparison of algorithms . Experimental Setting

Benchmark Datasets and Models

We used two ofﬁcialbenchmark datasets and models provided by FedML.•

FedShakespeare on RNN consists of 715 nodes with16068 train and 2356 test examples. RNN is the same asthe one proposed by McMahan et al. (2017b).•

MNIST on logistic regression consists of 1000 nodeswith 10 classes of 61664 train and 7371 test examples.

Algorithms

In Table 2, the proposed Tornado and Torna-does are compared with ﬁve state-of-the-art algorithms interms of group type and number of chains. Used parametersare summarized in Appendix E. We evaluate each algorithmﬁve times and report the average with standard deviation.

Results

Figure 2 shows the test accuracy for FedShakespeare dataset.Tornadoes outperformed the state-of-the-art algorithms byup to . and the next best group (Tornado, FedAvg, andHierFAVG) by . on average. The low performance ofTornado relative to Tornadoes is attributed to the communi-cation interval; Tornado( RING-stars ) takes τ τ steps for aglobal inter-group transfer in the RING , which is larger than τ steps of Tornadoes ( STAR-rings ) for a group inter-nodetransfer in each ring , thus causing higher divergence. The Figure 2:

Test accuracy for FedShakespeare .

67 50 100 150 200 250 300 350 400 450 C o mm un i c a t i o n I m p r ov e m e n t Number of NodesTornadoesFedAvg

Figure 3:

Communication scalability to node size .poor performances of Astraea and MM-PSGD come fromthe high variance caused by the inappropriate node groupingand low chain utilization. Lastly, IFCA achieved the worstaccuracy due to the difﬁculty of clustering FedShakespearedataset, as explained by the small clustering cost reductionof only . , in which case the relationship between Fe-dAvg and IFCA is consistent with Theorem 2.Figure 3 shows the communication scalability to the num-ber of nodes ranging from to for MNIST dataset.For each case, we measured the communication data sizein bytes to reach the converged train accuracy of the casewith nodes ( for FedAvg and for Tornadoes)and showed the improvement relative to that case. Torna-does achieved near-linear scalability, which is attributed tothe superior communication scalability cost of STAR-rings . In this paper, we provided a comprehensive survey of learn-ing architectures in terms of accuracy and scalability. Ourformal analysis led to the necessity of ring-based architec-ture and its inherent variance reduction problem. To this end,we proposed a novel ring-based algorithm

TornadoAggre-gate that improves both scalability and accuracy by reducingvariance in a ring iteration. Experimental results show that,compared with the state-of-the-art algorithms, TornadoAg-gregate improved the test accuracy by up to . andachieved near-linear scalability. Overall, we believe that ournovel ring-based algorithm has made important steps to-wards accurate and scalable federated learning. eferences Abad, M. S. H.; Ozfatura, E.; Gunduz, D.; and Ercetin, O.2020. Hierarchical federated learning across heterogeneouscellular networks. In

ICASSP 2020-2020 IEEE Int’l. Conf.Acoustics, Speech and Signal Processing (ICASSP) , 8866–8870. IEEE.Bonawitz, K.; Eichner, H.; Grieskamp, W.; Huba, D.; Inger-man, A.; Ivanov, V.; Kiddon, C.; Kone ˇ cn`y, J.; Mazzocchi, S.;McMahan, H. B.; et al. 2019. Towards federated learning atscale: System design. arXiv:1902.01046 .Briggs, C.; Fan, Z.; Andras, P.; and Andras, P. 2020. Feder-ated learning with hierarchical clustering of local updates toimprove training on non-IID data. arXiv:2004.11791 .Caldas, S.; Kone ˇ cn`y, J.; McMahan, H. B.; and Talwalkar, A.2018. Expanding the reach of federated learning by reducingclient resource requirements. arXiv:1812.07210 .Ding, Y.; Niu, C.; Yan, Y.; Zheng, Z.; Wu, F.; Chen, G.;Tang, S.; and Jia, R. 2020. Distributed Optimization overBlock-Cyclic Data. arXiv:2002.07454 .Duan, M.; Liu, D.; Chen, X.; Liu, R.; Tan, Y.; and Liang, L.2020. Self-balancing federated learning with global imbal-anced data in mobile systems. IEEE Transactions on Paral-lel and Distributed Systems arXiv:1904.10120 .Ghosh, A.; Chung, J.; Yin, D.; and Ramchandran, K. 2020.An Efﬁcient Framework for Clustered Federated Learning. arXiv:2006.04088 .Ghosh, A.; Hong, J.; Yin, D.; and Ramchandran, K. 2019.Robust federated learning in a heterogeneous environment. arXiv:1906.06629 .He, C.; Avestimehr, S.; and Annavaram, M. 2020. Groupknowledge transfer: Collaborative training of large cnns onthe edge. arXiv:2007.14513 .He, C.; Li, S.; So, J.; Zhang, M.; Wang, H.; Wang,X.; Vepakomma, P.; Singh, A.; Qiu, H.; Shen, L.; Zhao,P.; Kang, Y.; Liu, Y.; Raskar, R.; Yang, Q.; Annavaram,M.; and Avestimehr, S. 2020. FedML: A Research Li-brary and Benchmark for Federated Machine Learning. arXiv:2007.13518 .Heged˝us, I.; Danner, G.; and Jelasity, M. 2019. Gossip learn-ing as a decentralized alternative to federated learning. In

IFIP Int’l Conf. Distributed Applications and InteroperableSystems , 74–90. Springer.Jeong, E.; Oh, S.; Kim, H.; Park, J.; Bennis, M.; and Kim, S.-L. 2018. Communication-efﬁcient on-device machine learn-ing: Federated distillation and augmentation under non-iidprivate data. arXiv:1811.11479 .Kone ˇ cn`y, J.; McMahan, H. B.; Ramage, D.; and Richt´arik, P.2016a. Federated optimization: Distributed machine learn-ing for on-device intelligence. arXiv:1610.02527 .Kone ˇ cn`y, J.; McMahan, H. B.; Yu, F. X.; Richt´arik, P.;Suresh, A. T.; and Bacon, D. 2016b. Federated learning: Strategies for improving communication efﬁciency. In NIPS2016 Workshop on Private Multi-Party Machine Learning .Li, D.; and Wang, J. 2019. Fedmd: Heterogenous federatedlearning via model distillation. arXiv:1910.03581 .Li, X.; Huang, K.; Yang, W.; Wang, S.; and Zhang, Z. 2020.On the convergence of fedavg on non-iid data. In

Int’l Conf.Learning Representations .Li, Y.; Yu, M.; Li, S.; Avestimehr, S.; Kim, N. S.; andSchwing, A. 2018. Pipe-sgd: A decentralized pipelined sgdframework for distributed deep net training. In

Advances inNeural Information Processing Systems , 8045–8056.Lin, T.; Stich, S. U.; Patel, K. K.; and Jaggi, M.2018. Don’t Use Large Mini-Batches, Use Local SGD. arXiv:1808.07217 .Liu, L.; Zhang, J.; Song, S.; and Letaief, K. B. 2020. Client-edge-cloud hierarchical federated learning. In

ICC 2020-2020 IEEE Int’l. Conf. Communications (ICC) , 1–6. IEEE.Luo, S.; Chen, X.; Wu, Q.; Zhou, Z.; and Yu, S. 2020.HFEL: Joint Edge Association and Resource Allocationfor Cost-Efﬁcient Hierarchical Federated Edge Learning. arXiv:2002.11343 .McMahan, B.; Moore, E.; Ramage, D.; Hampson, S.; andy Arcas, B. A. 2017a. Communication-efﬁcient learning ofdeep networks from decentralized data. In

Artiﬁcial Intelli-gence and Statistics , 1273–1282. PMLR.McMahan, H. B.; Moore, E.; Ramage, D.; Hampson, S.;et al. 2017b. Communication-efﬁcient learning of deep net-works from decentralized data. In , 1273–1282.Nishio, T.; and Yonetani, R. 2019. Client selection for feder-ated learning with heterogeneous resources in mobile edge.In

IEEE Itn’l. Conf. on Communications , 1–7.Parisi, G. I.; Kemker, R.; Part, J. L.; Kanan, C.; and Wermter,S. 2019. Continual lifelong learning with neural networks:A review.

Neural Networks arXiv:1812.06127 .Sattler, F.; M¨uller, K.-R.; and Samek, W. 2020. Clusteredfederated learning: Model-agnostic distributed multitask op-timization under privacy constraints.

IEEE Transactions onNeural Networks and Learning Systems .Sattler, F.; Wiedemann, S.; M¨uller, K.-R.; and Samek, W.2019. Robust and communication-efﬁcient federated learn-ing from non-iid data. arXiv:1903.02891 .Shoham, N.; Avidor, T.; Keren, A.; Israel, N.; Benditkis, D.;Mor-Yosef, L.; and Zeitak, I. 2019. Overcoming Forgettingin Federated Learning on Non-IID Data. arXiv:1910.07796 .Smith, V.; Chiang, C.-K.; Sanjabi, M.; and Talwalkar, A. S.2017. Federated multi-task learning. In

Advances in NeuralInformation Processing Systems (NeurIPS) , 4424–4434.o, J.; Guler, B.; and Avestimehr, A. S. 2020. Turbo-Aggregate: Breaking the Quadratic Aggregation Barrier inSecure Federated Learning. arXiv:2002.04156 .Wang, J.; Sahu, A. K.; Yang, Z.; Joshi, G.; and Kar, S. 2019a.MATCHA: Speeding up decentralized SGD via matchingdecomposition sampling. arXiv:1905.09435 .Wang, S.; Tuor, T.; Salonidis, T.; Leung, K. K.; Makaya,C.; He, T.; and Chan, K. 2019b. Adaptive federated learn-ing in resource constrained edge computing systems.

IEEEJournal on Selected Areas in Communications arXiv:2005.01026 .Yoon, J.; Jeong, W.; Lee, G.; Yang, E.; and Hwang, S. J.2020. Federated Continual Learning with Weighted Inter-client Transfer. In

ICML 2020 Workshop in Lifelong Learn-ing .Yoshida, N.; Nishio, T.; Morikura, M.; Yamamoto, K.;and Yonetani, R. 2019. Hybrid-FL: Cooperative learn-ing mechanism using non-iid Data in wireless networks. arXiv:1905.07210 .Zhao, Y.; Li, M.; Lai, L.; Suda, N.; Civin, D.; and Chan-dra, V. 2018. Federated learning with non-iid data. arXiv:1806.00582 .Zhu, H.; and Jin, Y. 2019. Multi-objective evolutionaryfederated learning.

IEEE Trans. on Neural Networks andLearning Systems . A Convergence Analysis

In this section, we analyze convergence for

STAR-stars andthen extend it to the rest of architectures.

Convergence Analysis for

STAR-stars

First of all, we make the following assumption for the lossfunction F k,i , as in many other relevant studies (Liu et al.2020; Wang et al. 2019b). Assumption 1.

For every i and k , (1) F k,i is convex; (2) F k,i is ρ -Lipschitz, i.e., (cid:107) F k,i ( w ) − F k,i ( w (cid:48) ) (cid:107) ≤ ρ (cid:107) w − w (cid:48) (cid:107) for any w and w (cid:48) ; and (3) F k,i is β -smooth, i.e., (cid:107)∇ F k,i ( w ) − ∇ F k,i ( w (cid:48) ) (cid:107) ≤ β (cid:107) w − w (cid:48) (cid:107) for any w and w (cid:48) .Under this assumption, Lemma 1 holds for the group andglobal loss functions. F k , which is the loss function for anode group, is additionally considered here unlike Wanget al. (2019b). Lemma 1. F and F k are convex, ρ -Lipschitz, and β -smooth. Proof.

It is straightforward from Assumption 1 and the def-initions of F and F k in Deﬁnition 3. (𝑙 − 1)𝜏 ’ 𝜏 ( (𝑟 − 1)𝜏 ’ 𝑟𝜏 ’ 𝑙𝜏 ’ 𝜏 ( 𝐹(𝐯 ,- 𝑟𝜏 ’ )𝐹(𝐯 ,- (𝑟 − 1)𝜏 ’ ) 𝐹(𝐰 - 𝑟𝜏 ’ )𝐹 𝐰 𝑙𝜏 ’ 𝜏 ( 𝐹(𝐯 [0] (𝑙 − 1)𝜏 ’ 𝜏 ( ) 𝐹(𝐯 [0] 𝑙𝜏 ’ 𝜏 ( )⋯ ⋯ L o ss f un c ti on v a l u e ⋯ ⋯⋯ ⋯ Figure 4: Illustration of loss divergence and synchronizationbetween w k and v k [ r ] and between w and v [ l ] .We introduce two types of intervals depending on thelearning level: a g r oup interval , [ r ] (cid:44) [( r − τ , r τ ] , in-dicates an interval between two successive group aggrega-tions, and a g l obal interval , [ l ] (cid:44) [( l − τ τ , l τ τ ] , in-dicates an interval between two successive global aggrega-tions.Next, we introduce the notion of group-based virtuallearning in Deﬁnition 9, where training data is assumed toexist on a virtual central repository for each model. This no-tion is used to bridge the local-to-group divergence (i.e., thedivergence between a local model and a group model) in agroup interval and the group-to-global divergence (i.e., thedivergence between a group model and a global model) in aglobal interval. Deﬁnition 9 (Group-Based Virtual Learning) . Given a cer-tain group membership z , for any k , [ r ] , and [ l ] , the virtualgroup model v k [ r ] and virtual global model v [ l ] are updated byperforming gradient descent steps on the centralized data ex-amples for N k and N , respectively, and synchronized withthe federated group model w k and the global model w at thebeginning of each interval, as in Eq. (5). v k [ r ] ,t (cid:44) (cid:40) w kt if t = ( r − τ , v k [ r ] ,t − − η ∇ F k ( v k [ r ] ,t − ) otherwise v [ l ] ,t (cid:44) (cid:26) w t if t = ( l − τ τ , v [ l ] ,t − − η ∇ F ( v [ l ] ,t − ) otherwise (5)To facilitate the interpretation, Figure 4 shows how a vir-tual model v is updated, following Deﬁnition 9. For exam-ple, v [ l ] starts diverging from w after ( l − τ τ and be-comes synchronized with w at l τ τ .Then, we formalize group-based gradient divergence inDeﬁnition 10 that models the impact of the difference in datadistributions across nodes on federated learning. Deﬁnition 10 (Group-Based Gradient Divergence) . Given acertain group membership z , for any i and k , δ k,i is deﬁnedas the gradient difference between the i -th local loss and the k -th group loss; ∆ k is deﬁned as the gradient difference be-tween the k -th group loss and the global loss, which can bexpressed as Eq. (6). δ k,i (cid:44) max w (cid:107)∇ F k,i ( w ) − ∇ F k ( w ) (cid:107) , ∆ k (cid:44) max w (cid:107)∇ F k ( w ) − ∇ F ( w ) (cid:107) (6)Then, the local-to-group divergence δ and the group-to-global divergence ∆ are formulated as Eq. (7). δ (cid:44) (cid:88) k ∈G (cid:88) i ∈N k |D k,i ||D| δ k,i , ∆ (cid:44) (cid:88) k ∈G |D k ||D| ∆ k (7)Based on Deﬁnition 9 and 10, we introduce an auxiliarylemma (Lemma 2). Lemma 2.

For any [ r ] , [ l ] , and t ∈ [( r − τ , r τ ] ⊂ [( l − τ τ , l τ τ ] , an upper bound of the norm of the differencebetween a local model and the virtual global model can beexpressed as Eq. (8). (cid:107) w k,it − v [ l ] ,t (cid:107) ≤ δ k,i β (( ηβ + 1) t − ( r − τ − k β (( ηβ + 1) t − ( l − τ τ − (8) Proof.

From the triangle inequality, one can simply deriveEq. (9). (cid:107) w k,it − v [ l ] ,t (cid:107) = (cid:107) w k,it − v k [ r ] ,t + v k [ r ] ,t − v [ l ] ,t (cid:107)≤ (cid:107) w k,it − v k [ r ] ,t (cid:107) + (cid:107) v k [ r ] ,t − v [ l ] ,t (cid:107) (9)To conclude this proof, it thus sufﬁces to show Eq. (10) and(11). (cid:107) w k,it − v k [ r ] ,t (cid:107) ≤ δ k,i β (( ηβ + 1) t − ( r − τ − (10) (cid:107) v k [ r ] ,t − v [ l ] ,t (cid:107) ≤ ∆ k β (( ηβ + 1) t − ( l − τ τ − (11)Then, by putting Eq. (10) and (11) into Eq. (9), we can con-ﬁrm Lemma 2.Both Eq. (10) and (11) can be easily drawn from the β -smooth property of F k,i and F k . From Eq. (4) and (6), wecan derive Eq. (12). (cid:107) w k,it − v k [ r ] ,t (cid:107) = (cid:107) w k,it − − η ∇ F k,i ( w k,it − ) − v k [ r ] ,t − + η ∇ F k ( v k [ r ] ,t − ) (cid:107)≤ (cid:107) w k,it − − v k [ r ] ,t − (cid:107) + η (cid:107)∇ F k,i ( w k,it − ) − ∇ F k,i ( v k [ r ] ,t − ) (cid:107) + η (cid:107)∇ F k,i ( v k [ r ] ,t − ) − ∇ F k ( v k [ r ] ,t − ) (cid:107)≤ ( ηβ + 1) (cid:107) w k,it − − v k [ r ] ,t − (cid:107) + ηδ k,i (12)The last inequality stems from the β -smoothness of F k,i and Deﬁnition 10.Then, since w k,it = w kt = v k [ r ] ,t at every group aggre-gation from Eq. (4) and (6), Eq. (12) can be rewritten as Eq. (13). (cid:107) w k,it − v k [ r ] ,t (cid:107) ≤ ηδ k,i t − ( r − τ (cid:88) y =1 ( ηβ + 1) y − = δ k,i β (( ηβ + 1) t − ( r − τ − (13)Analogously, one can derive Eq. (11). This is the end of theproof of Lemma 2.From Lemma 2 and Jensen’s inequality, for all t and q , wehave Eq. (14). (cid:107) w t − v [ l ] ,t (cid:107) ≤ (cid:88) k ∈G (cid:88) i ∈N k |D k,i ||D| (cid:107) w k,it − v [ l ] ,t (cid:107)≤ δβ (( ηβ + 1) τ −

1) + ∆ β (( ηβ + 1) τ τ − (14)Finally, for STAR-stars , we derive the convergence boundbetween the federated global model and the virtual globalmodel by Theorem 4.

Theorem 4 (Convergence Bound of

STAR-stars ) . For anyglobal interval [ l ] and t ∈ [ l ] , if F k,i is β -smooth for every i and k in Eq. (6), then Eq. (15) holds. F ( w t ) − F ( v [ l ] ,t ) ≤ ρβ ( δh ( τ ) + ∆ h ( τ τ )) where h ( t ) (cid:44) ( ηβ + 1) t − (15) Convergence Analysis for the Other Architectures

In this section we analyze convergence bounds for ﬂat ar-chitectures (

STAR and

RING ), consensus group architec-tures (

STAR-rings , RING-stars , and

RING-rings , and plural-istic group architectures ( stars and rings ). First, the conver-gence bound of

STAR is the same as that of

STAR-stars withno group |G| = 1 and no group communication τ = 1 , asin Eq. (16). By excluding the notion of groups, it also meansthat the group-to-global divergence ∆ becomes and thusthe local-to-group divergence δ becomes the local-to-globaldivergence D , which is extended from Eq. (6) and (7). F ( w t ) − F ( v [ l ] ,t ) ≤ ρβ Dh ( τ ) where D (cid:44) (cid:88) i ∈N |D i ||D| D i and D i (cid:44) max w (cid:107)∇ F i ( w ) − ∇ F ( w ) (cid:107) (16)Next, the convergence bound of stars is the same as mul-tiple independent STAR groups, where the total node sizeof stars is the sum of nodes of each

STAR group. Thus, byregarding the local-to-global divergence D from Eq. (16) ofeach STAR group as the local-to-group divergence δ k of eachgroup in stars , we have Eq. (17) for stars . F ( w t ) − F ( v [ l ] ,t ) ≤ ρβ δh ( τ ) (17)ext, for STAR-rings , similar to Theorem 1, each ring -based learning in a group is an unbiased estimator of thecentralized learning within the group because of Eq. (18). E [ ∇ F k,i ( w t )] = (cid:88) i ∈N k |D k,i ||D k | ∇ F k,i ( w t ) = ∇ F k ( w t ) (18)Then, by approximating ∇ F k,i ( w t ) of Eq. (6) to E [ ∇ F k,i ( w t )] of Eq. (18), the local-to-group divergence δ becomes and thus the group-to-global divergence ∆ be-comes the local-to-global divergence D . Thus, Eq. (15) isextended to Eq. (19) for STAR-rings . F ( w t ) − F ( v [ l ] ,t ) ≤ ρβ Dh ( τ τ ) (19)Next, for RING-stars , similar to Theorem 1, the global

RING -based learning is an unbiased estimator of the glob-ally centralized learning in consideration of each stars -basedlearning in a group because of Eq. (20). E [ ∇ F k ( w t )] = (cid:88) k ∈G |D k ||D| ∇ F k ( w t ) = ∇ F ( w t ) (20)Then, by approximating ∇ F k ( w t ) of Eq. (6) to E [ ∇ F k ( w t )] of Eq. (20), the group-to-global divergence ∆ becomes andthus the local-to-group divergence δ becomes the local-to-global divergence D . Thus, Eq. (15) is extended to Eq. (21)for RING-stars . F ( w t ) − F ( v [ l ] ,t ) ≤ ρβ Dh ( τ ) (21)Analogously, from Theorem 1, RING , Ring-rings , and rings can be easily shown to have the approximate conver-gence bound of . B Communication Scalability

In this section, we provide an approach to measure commu-nication scalability. For the analysis, M denotes the modelsize; and given the total learning steps T , τ f (cid:44) T / τ is de-ﬁned as the number of global communications for the ﬂatarchitectures; τ c (cid:44) T / τ τ is deﬁned as the number ofglobal communications for the consensus group architec-tures; τ p (cid:44) T / τ is deﬁned as the number of group com-munications for the pluralistic group architectures. STAR sends M |N | τ f of total global aggregation data; RING sends M τ f of total global inter-node transfer becauseonly one node is active for the diurnal property; STAR-stars sends M |N | ( τ − τ c of total group aggregation data and M |N | τ c of total global aggregation data, thus M |N | τ τ c oftotal communication data, which is the same cost as STAR incase of τ f = τ τ c as suggested by Liu et al. (2020); stars sends M |N | τ p of total group aggregation data, which is thesame cost as STAR because τ p = τ f from the deﬁnition;analogously, one can show the total communication data sizefor the rest of architectures, as summarized in Table 1. C Deferred Proofs

Theorem 1.

RING is an unbiased estimator of the central-ized learning that learns a centralized model by assumingthe federated datasets to be located at a centralized storage.

Proof.

The centralized learning is deﬁned as Eq. (22). w t = w t − − η ∇ F ( w t − ) (22)Next, for RING update, we regard the model and data com-munication relationship as the opposite, that is, instead of

RING transferring a local model from one node to anotherwhile data stays in place,

RING is redeﬁned as switchingdata from one node to another while the local model stays ina certain node. Thus, at the time t of data transfer from node i to the certain node, Eq. (3) changes to Eq. (23). Note thatthe index for the certain node is not denoted because it doesnot need be distinguished from the others. w t = w t − − η ∇ F i ( w t − ) (23)Thus, Eq. (23) equals the centralized learning in expectationbecause of Eq. (24) E [ ∇ F i ( w t − )] = (cid:88) i ∈N |D i ||D| ∇ F i ( w t − ) = ∇ F ( w t − ) (24) Theorem 2.

The convergence bound O( δh ( τ ) ) of stars doesn’t necessarily be better than O( Dh ( τ ) ) of STAR . Proof.

From Eq. (6) and triangle inequality, we haveEq. (25). δ k,i = max w (cid:107)∇ F k,i ( w ) − ∇ F k ( w ) (cid:107) = max w (cid:107)∇ F k,i ( w ) − ∇ F ( w ) + ∇ F ( w ) − ∇ F k ( w ) (cid:107)≤ max w [ (cid:107)∇ F k,i ( w ) − ∇ F ( w ) (cid:107) + (cid:107)∇ F ( w ) − ∇ F k ( w ) (cid:107) ] (25)By summing Eq. (25) for all i and k and considering Eq. (7)and (16), we have Eq. (26). δ ≤ ∆ + D (26)Similarly, from Eq. (16) and triangle inequality, we haveEq. (27). D i = max w (cid:107)∇ F i ( w ) − ∇ F ( w ) (cid:107) = max w (cid:107)∇ F i ( w ) − ∇ F k ( w ) + ∇ F k ( w ) − ∇ F ( w ) (cid:107)≤ max w [ (cid:107)∇ F i ( w ) − ∇ F k ( w ) (cid:107) + (cid:107)∇ F k ( w ) − ∇ F ( w ) (cid:107) ] (27)By summing Eq. (27) for all i and considering Eq. (7), wehave Eq. (28). D ≤ δ + ∆ (28)Lastly, based on Eq. (26) and (28), we can infer that theworst case of δ equals ∆ + D that is larger than D , in which STAR achieves lower convergence bound than stars . Theorem 3.

RING exhibits higher variance than the central-ized learning (unbiased estimator of

RING from Theorem 1). lgorithm 2:

Tornadoes (STAR-rings) I NPUT : N , |G| , C , τ , τ O UTPUT : w T Initialize { w k,i } i ∈N to a random model w Initialize a random ring [ i j ∈ N | j ∈ N , i j + |N | = i j ] {N k } k ∈G ← C LUSTER ( N ) // Algorithm 3 for t ← , · · · , T − do for each k ∈ G in parallel do for each c ← , · · · , C − in parallel do j ← (cid:98) t/ τ (cid:99) , i ← ( i j + c ) mod |N k | w k,it +1 ← w k,it − η ∇ F k,i ( w k,it ) if t mod τ and t mod τ τ (cid:54) = 0 then i next ← ( i j +1 + c ) mod |N k | w k,i next t ← w k,it if t mod τ τ = 0 then { w k,it } k ∈G ,i ∈N ← (cid:80) k ∈G (cid:80) i ∈N k |D k,i ||D| w k,it Proof.

First, from the β -smoothness of F and Eq. (22), wehave Eq. (29) for centralized learning. F ( w t ) = F ( w t − − η ∇ F ( w t − )) ≤ F ( w t − ) − η (1 − ηβ (cid:107)∇ F ( w t − ) (cid:107) (29)Next, from the β -smoothness of F and the deﬁnition of RING update in Eq. (23), we have Eq. (30) for

RING . F ( w t ) = F ( w t − − η ∇ F i ( w t − )) ≤ F ( w t − ) + ∇ F ( w t − )( − η ∇ F i ( w t − ))+ η β (cid:107)∇ F i ( w t − ) (cid:107) (30)In expectation with regard to i , we have Eq. (31). F ( w t ) ≤ F ( w t − ) − η (cid:107)∇ F ( w t − ) (cid:107) + η β E i (cid:107)∇ F i ( w t − ) (cid:107)≤ F ( w t − ) − η (1 − ηβ (cid:107)∇ F ( w t − ) (cid:107) + η β E i (cid:107)∇ F i ( w t − ) (cid:107) − (cid:107) E i ∇ F i ( w t − ) (cid:107) ] (31)where E i (cid:107)∇ F i ( w t − ) (cid:107) − (cid:107) E i ∇ F i ( w t − ) (cid:107) is the learningvariance of RING , which is an added term from Eq. (29).Analogously, one can show similar variances for

STAR-rings , RING-stars , Ring-rings , and rings . D TornadoAggregate Details

Algorithm 2 shows the overall procedure of Tornadoes thattakes the node set N and the number of groups |G| , chains C , epochs τ , and communication rounds τ as input and re-turns the ﬁnal model w T as output. It begins by initializingall local models w k,i , a randomly permuted inter-node ring Algorithm 3:

Grouping Scheme function G ROUP B Y IID ( N ) : C OST A ( i, k ) (cid:44) EMD ( D k , D ) C OST U ( i, k ) (cid:44) EMD ( D k,i , D ) return G ROUP ( N , C OST A , C OST U ) function C LUSTER ( N ) : C OST A ( i, k ) (cid:44) EMD ( D k,i , D k ) C OST U ( i, k ) (cid:44) EMD ( D k,i , D k ) return G ROUP ( N , C OST A , C OST U ) function G ROUP ( N , C OST A , C OST U ) : Select random medoid nodes N m of size |G| z ← [arg min k ∈G C OST A ( i, k ) |∀ i ∈ N ] while the last C OST A is not steady do N m ← [arg min i ∈N k C OST U ( i, k ) |∀ k ∈ G ] z ← [arg min k ∈G C OST A ( i, k ) |∀ i ∈ N ] return {{ i | ( i, k ) ∈ z , k = k (cid:48) }| k (cid:48) ∈ G} [ i j ] , and group indices {N k } by clustering nodes (Lines 1–3). Then, for each group and each chain of the group, thelocal updates are performed at the node i (Lines 5–8); ev-ery τ epochs, each local model is transferred to the nextnode i next within the same group k (Lines 9–11); every τ τ steps, the global model is learned by aggregating all localmodels and then broadcasts back to all nodes (Lines 12–13).Overall, Lines 4–13 repeat for T steps.We derive another heuristic of TornadoAggregate with rings architecture, called Tornado-rings, which is the sameas Tornadoes without the global aggregation to develop a in-dependent and specialized model for each group, i.e., Lines12–13 of Algorithm 2 are not executed for Tornado-rings.We note that, for the stars and rings, the test performance aremeasured with the group model of each independent group.Algorithm 3 shows the two grouping schemes:G ROUP B Y IID and C

LUSTER . Both functions deﬁnetheir own association cost C

OST A and update cost C OST U and, in turn, call G ROUP function with the deﬁned costs.The costs are based on the EMD (earth mover distance)that can approximately model the learning divergences, asproposed by Zhao et al. (2018), which can be expressedas Eq. (32). In G

ROUP B Y IID function (Lines 1–4), agroup data distribution D k,i is compared with the globaldataset D to improve the group-to-global divergence ∆ ofEq. (6), while in C LUSTER function (Lines 6–9), a local datadistribution D k,i is compared with a group data distribution D k to improve the local-to-group divergence δ of Eq. (6).It should be noted that for the C OST U of G ROUP B Y IIDfunction, we had no choice but to use D k,i instead of D k because a cost related to a node should be returned todetermine a new medoid node.EMD ( D , D ) (cid:44) (cid:88) ∀ class | P ( y j = class | j ∈ D ) − P ( y j = class | j ∈ D ) | (32) ataset Initial Cost Final Cost FedShakespeare 0.391 0.375(4.3% reduced)MNIST 0.728 0.474(53.6% reduced)

Table 3:

Reduction of clustering cost in Algorithm 3 .The G

ROUP function aims at ﬁnding subsets of node in-dexes for all groups {N k } k =1 ... |G| such that it reduces thedeﬁned costs to the extent possible. For this purpose, it be-gins by selecting random medoid nodes N m of size |G| (Line12). Then, it iteratively updates z by minimizing C OST A for all nodes and C OST U for all groups until the cost issteady (Lines 14–16). E Supplementary Evaluation

Experimental Setting

Conﬁguration

We used FedML (He et al. 2020), one ofthe most widely used simulation frameworks for federatedlearning, on PyTorch 1.6.0 to extensively evaluate the per-formance of various datasets, models, and algorithms.

Parameters

The parameters for both FedShakespeare onRNN and MNIST on logistic regression benchmarks fol-lowed those suggested by FedML. The benchmarks usedSGD (Stochastic Gradient Descent) optimizer with thelearning rate of 0.03. In addition, we randomly sampled 100nodes for both train and test phase, out of 715 nodes for Fed-Shakespeare and 1000 nodes for MNIST.Table 4 shows the parameters used for each algorithm. Inparticular, for the group size, we applied the aforementioned small ring principle to all algorithms such that the group sizeof an algorithm with IID node grouping, random grouping,and node clustering is set to 2, 5, and 10, respectively, where10 is considered a reasonably large value for the group size;for the number of chains, we applied the ring chaining prin-ciple to the proposed TornadoAggregate heuristics such thatthe number of chains is set to the number of groups, which isthe maximum value by deﬁnition; for the communication in-terval, we ﬁrstly determined the product of τ and τ of Hi-erFAVG to be equal to τ of FedAVG so that HierFAVG canimprove accuracy by sacriﬁcing little communication cost,as suggested by Liu et al. (2020), and then we set the sameparameters as HierFAVG for the rest of algorithms. Additional Results

Figure 5 shows the train loss and accuracy of nine algorithmson FedShakespeare dataset. Even though HierFAVG seem-ingly outrun the others, compared with the test accuracy ofHierFAVG in Figure 2, we can infer that it overﬁt towardsthe training dataset. Similar to the aforementioned resultsof IFCA, Tornado-rings performed bad because of the smallreduction of clustering cost, deﬁned in Algorithm 3, for Fed-Shakespeare dataset, as shown in Table 3. Low accuracyof SemiCyclic algorithm can be attributed to the low data utilization with low number of active nodes, which is alsopointed out by Ding et al. (2020).Figure 6 shows the train loss and accuracy of all algo-rithms on MNIST dataset. Interestingly, in contrast to theresults for FedShakespeare dataset, Tornado-rings signiﬁ-cantly outperformed the others except for closely followingIFCA. The reason why the worst performers became the bestperformers can also be attributed to the large reduction clus-tering cost, as shown in Table 3. To strike the balance be-tween the two extremes, we leave Tornado-rings as our fu-ture work. Aside from Tornado-rings and IFCA, Tornadoesoutperformed the others and the rest of algorithms exhibitedthe similar performance trend to that from FedShakespeare.

F Future Directions

We consider the following works orthogonal to our work,which can thus be easily extended to by TornadoAggregate.•

Communication Reduction Techniques : This categoryincludes quantization (Kone ˇ cn`y et al. 2016a,b), compres-sion (Sattler et al. 2019), or dropout (Caldas et al. 2018).• Communication-Aware Learning : This category in-cludes adaptive communication interval (Wang et al.2019b), communication-constrained learning (Nishio andYonetani 2019; Yoshida et al. 2019), or multi-objectiveoptimization of learning error and communication (Zhuand Jin 2019).•

Global-Information Sharing : This category includessharing a subset of global IID data samples (Zhao et al.2018; Yoshida et al. 2019), sharing a subset of globaldata features to scale up the feature-related parametersof a local optimizer (Kone ˇ cn`y et al. 2016a), or sharinga generative model that can produce an augmented IIDdataset (Smith et al. 2017; Jeong et al. 2018).On the other hand, we aim at improving TornadoAggre-gate in the following directions.• Client Sampling : Client sampling techniques (McMahanet al. 2017b; Sahu et al. 2018; Li et al. 2020) introduce adifferent variance aspect from those handled in this study.•

Peer-to-Peer Learning : Other than the star and ring ar-chitectures, peer-to-peer (P2P) federated learning (Wanget al. 2019a; Heged˝us, Danner, and Jelasity 2019; Wanget al. 2019a) should also be considered.•

Transfer Learning : In all synchronizations, model pa-rameters are set to the previously learned model parame-ters, but we can also consider transferring parameters be-tween different types of models (Jeong et al. 2018; Li andWang 2019; He, Avestimehr, and Annavaram 2020).•

Continual Learning : We can extend TornadoAggregateto the novel techniques in the ﬁeld of continual learning,such as weight decomposition (Yoon et al. 2020) or lossregularization (Shoham et al. 2019).•

Algorithm Optimization : Other than EMD of Eq. (32),IIDness can also be quantiﬁed by loss divergence (Li et al.2020), gradient divergence (Wang et al. 2019b), or weightdivergence (Zhao et al. 2018). ierarchy Algorithm Architecture GroupingScheme GroupSize

Flat FedAvg (McMahan et al. 2017b)

STAR - 1 - τ = 100 ConsensusGroup HierFAVG (Liu et al. 2020)

STAR-stars

Random 5 - τ = 10 , τ = 10 Astraea (Duan et al. 2020)

STAR-rings

IID 2 1 τ = 10 , τ = 10 MM-PSGD (Ding et al. 2020)

RING-stars

Cluster 10 1 τ = 10 , τ = 10 Tornado (Proposed)

RING-stars

IID 2 2 τ = 10 , τ = 10 Tornadoes (Proposed)

STAR-rings

Cluster 10 10 τ = 10 , τ = 10 PluralisticGroup IFCA (Ghosh et al. 2020)) stars

Cluster 10 - τ = 100 SemiCyclic (Eichner et al. 2019) rings

Random 5 1 τ = 100 Tornado-rings (Proposed) rings

Cluster 10 10 τ = 100 Table 4:

Algorithm parameters. (a) Train Loss. (b) Train Accuracy.

Figure 5:

FedShakespeare . (a) Train Loss. (b) Train Accuracy. Figure 6: