[PDF] Accurate and Fast Federated Learning via Combinatorial Multi-Armed Bandits

Abstract

Federated learning has emerged as an innovative paradigm of collaborative machine learning. Unlike conventional machine learning, a global model is collaboratively learned while data remains distributed over a tremendous number of client devices, thus not compromising user privacy. However, several challenges still remain despite its glowing popularity; above all, the global aggregation in federated learning involves the challenge of biased model averaging and lack of prior knowledge in client sampling, which, in turn, leads to high generalization error and slow convergence rate, respectively. In this work, we propose a novel algorithm called FedCM that addresses the two challenges by utilizing prior knowledge with multi-armed bandit based client sampling and filtering biased models with combinatorial model averaging. Based on extensive evaluations using various algorithms and representative heterogeneous datasets, we showed that FedCM significantly outperformed the state-of-the-art algorithms by up to 37.25% and 4.17 times, respectively, in terms of generalization accuracy and convergence rate.

Full PDF

AAccurate and Fast Federated Learningvia Combinatorial Multi-Armed Bandits

Taehyeon Kim ∗ KAIST [email protected]

Sangmin Bae ∗ KAIST [email protected]

Jin-woo Lee

KAIST [email protected]

Seyoung Yun

KAIST [email protected]

Abstract

FedCM that addresses the two challenges by utilizing prior knowledge with multi-armedbandit based client sampling and ﬁltering biased models with combinatorial modelaveraging. Based on extensive evaluations using various algorithms and representa-tive heterogeneous datasets, we showed that FedCM signiﬁcantly outperformedthe state-of-the-art algorithms by up to . % and . times, respectively, interms of generalization accuracy and convergence rate. Federated learning (FL) [1, 2] enables mobile devices to collaboratively learn a shared model whilekeeping all training data on the devices, thus avoiding transferring data to the cloud or central server.One of the main reasons for this recent boom in FL is that it does not compromise user privacy. In thisframework, a local model is updated via its private data on the corresponding local device; all localupdates are aggregated to the global model; after which the procedure is repeated until convergence.In particular, the canonical global aggregation involves sampling clients as well as averaging themodels of sampled clients [1]. Even though client sampling and model averaging schemes arefrequently proposed [3, 4, 5], less has been addressed the inherent dynamics of how they inﬂuencethe global aggregation. To this end, we identiﬁed two challenges that arose when the conventionalalgorithms were used as follows:•

Biased Model Averaging : For the conventional model averaging schemes [1, 3, 4], we iden-tiﬁed that the generalization error in non-IID (i.e., independent and identically distributed)setting was not only higher, but also more variant than that of IID setting because the existingschemes did not ﬁlter the biased models.•

Lack of Prior Knowledge in Client Sampling : For the conventional client samplingschemes [1, 3, 4], it was observed that the existing schemes led to bad local optima [6, 7], ∗ Equally Contributed.Preprint. Under review. a r X i v : . [ c s . L G ] D ec able 1: Comparison of algorithms . + and * sign denotes sampling scheme with and withoutreplacement, respectively. Note that the proposed FedCA and FedCM can be easily extended to anyother client sampling and model averaging schemes.Algorithm ModelFiltering PriorKnowledge SamplingScheme for S t AveragingScheme for w t FedAvg [1] X X Uniform* (cid:80) k / ∈ S t p k w t + (cid:80) k ∈ S t p k w tk FedProx [3] X X p k + | S t | (cid:80) k ∈ S t w tk FedPdp [4] X X Uniform* (cid:80) k ∈ S t p k | S || S t | w tk FedCA (Ours) O X Uniform* (cid:80) k ∈ S topt p k | S || S topt | w tk FedCM (Ours) O O Bandit* (cid:80) k ∈ S topt p k | S || S topt | w tk and even the convergence speed was inevitably slow since they did not take into accountprior knowledge of the client sampling process at all.To the best of our knowledge, no existing work has addressed both of the above challenges simul-taneously, which is shown by Table 1 that compares the algorithms from the perspective of modelﬁltering and prior knowledge. To this end, we propose a novel algorithm called FedCM (Federatedlearning with

Combinatorial model averaging and

Multi-armed bandit (MAB) based client sampling )that resolves both challenges. With combinatorial model averaging , we aim to ﬁlter biased models inconsideration of the model combination that maximizes a validation score, consequently reducinggeneralization error. In addition, with

MAB based client sampling , we utilize prior knowledge thatmodels previous client sampling behavior by using a MAB based sampling scheme. The increasedinformation can, in turn, lead to improved convergence performance. Overall, the key contributionsare summarized as follows:•

Problem Formulation (Section 2): We formulate the problem as a novel system-levelframework of

FL with knowledgeable sampling and ﬁltered averaging that serves as abaseline template for any extension with custom prior knowledge or custom model ﬁlter.•

Combinatorial Model Averaging (Section 3): We design a novel algorithm called

FedCA to resolve the challenge of biased model averaging. We conﬁrmed that FedCA outperformedthe state-of-the-art algorithms by up to . % in terms of generalization accuracy.• MAB based Client Sampling (Section 4): We ﬁnally propose

FedCM to resolve bothchallenges. Then, we extensively compared FedCM with various client sampling algorithmsfor representative heterogeneous datasets. FedCM reached a higher test accuracy by up to . % as well as a fast convergence rate by up to . % times. The objective of federated learning [1] is to solve the stochastic convex optimization problem: min w f ( w ) (cid:44) (cid:88) k ∈ S p k F k ( w ) (1)where S is the set of total clients, p k is the weight of client k , such as p k ≥ , and (cid:80) k p k = 1 . Thelocal objective of client k is to minimize F k ( w ) = E x k ∼ D k [ (cid:96) k ( x k , y k ; w )] parameterized by w onthe local data ( x k , y k ) from local data distribution D k . FederatedAveraging (FedAvg) [1], the canonical algorithm for FL, involves local update , which learnsa local model w tk (Eq. (2)) with learning rate η and synchronizing w tk with w t every E steps, w tk (cid:44) (cid:40) w t − k − η ∇ F k ( w t − k ) if t mod E (cid:54) = 0 w t if t mod E = 0 (2) The algorithm proposed by Li et al. [4] is referred to as FedPdp (FedAvg with partial device participation). lgorithm 1: Generic framework of FL with knowledgeable sampling and ﬁltered averaging I NPUT : S, SamplingRatio, P riorKnowledge, M odelF ilter, η O UTPUT : w T Initialize w randomly for t ← , . . . , T − do S t ← S AMPLE C LIENTS ( S, SamplingRatio, P riorKnowledge ) for each client k ∈ S t in parallel do w tk, ← w t for e ← , . . . , E − do w t,e +1 k ← w t,ek − η ∇ F k ( w t,ek ) // Eq. (2) w t +1 ← A VERAGE M ODELS ( S t , M odelF ilter ) and global aggregation , which learns the global model w t (Eq. (3)) by averaging all w tk with regardto the client k ∈ S t uniformly sampled at random, subject to | S t | = SamplingRatio × | S | . w t (cid:44) (cid:88) k / ∈ S t p k w t + (cid:88) k ∈ S t p k w tk (3)In a similar vein, recent studies [3, 4] proposed different client sampling and model averaging schemes.Table 1 summarizes them and shows that prior knowledge and model ﬁlter may resolve the challenges.Therefore, by augmenting the client sampling and model averaging scheme with prior knowledge andmodel ﬁlter, respectively, we can derive a generic system-level framework of FL with knowledgeablesampling and ﬁltered model averaging , as shown in Algorithm 1.

In this section, we propose a novel algorithm called

FedCA ( Federated learning with CombinatorialAveraging ) and systematically evaluate FedCA with various algorithms and representative heteroge-neous datasets.

Proposed Algorithm: FedCA . To resolve the aforementioned challenge of biased model averaging,we propose FedCA by extending

M odelF ilter and

AverageM odels of Algorithm 1 as follows.First, for the

M odelF ilter , we propose combinatorial model ﬁlter that ﬁlters out biased modelsconsidering the model combination that maximizes a validation score, which can be expressed as S topt (cid:44) argmax S topt ⊂ S t u (cid:0) X val , ˜ g ( x ; S topt ) (cid:1) where ˜ g ( x ; S ) = 1 | S | (cid:88) s ∈ S g ( x s ; w s ) (4)where X val is a validation dataset, g ( x ; w ) is the logit from a network parameterized by w , and u ( X , g ) is a score function of u on X . In addition, we design two score functions for u ( X , g ) :• Dirac delta function:

It is deﬁned as |X | (cid:80) ( x,y ) ∈X [ y =argmax c g c ( x ; w )] where [ · ] be theindicator function; it is also known as the accuracy.• Classiﬁcation loss:

It is deﬁned as − |X | (cid:80) ( x,y ) ∈X log g y ( x ; w ) ; it is also known as thecross-entropy loss.Next, for the AverageM odels , we change the state-of-the-art model averaging scheme of FedPdp byreplacing S t with S topt from Eq. (4), as shown in Eq. (5). In conclusion, FedCA extends M odelF ilter to the combinatorial model ﬁlter of Eq. (4) and

AverageM odels to Eq. (5), thus called combinatorialaveraging . It is apparent that FedCA can be easily extended to any other existing schemes. Pleaserefer to Appendix A for the detailed process illustration of FedCA. w t (cid:44) (cid:88) k ∈ S topt p k | S || S topt | w tk (5) Experimental Setting . We compared FedCA with three algorithms for CIFAR-10 task [8] byfollowing the same state-of-the-art conﬁguration and parameter values as suggested by Simonyan3 a) α = 0 . (b) α = 1 . (c) α = 5 . (d) α = 0 . (e) α = 1 . (f) α = 5 . Figure 1:

Datasets across 20 clients according to IIDness ( α ) and heterogeneity (top: client,bottom: class) . The x-axis denotes each client and the y-axis denotes the number of data examples.Table 2: Top-1 accuracy of FedPdp and FedCA according to class heterogeneity, score function,and

SamplingRatio . Heterogeneity Algorithm Score function

SamplingRatio (= | S t | / | S | ) ( α =0 . FedPdp - 54.12 ± . ± . ± . FedCA Dirac delta 52.51 ± . ± . ± Classiﬁcation loss ± ± ± . IID ( α =5 . FedPdp - 82.26 ± . ± ± . FedCA Dirac delta 82.26 ± . ± . ± Classiﬁcation loss ± ± . ± . and Zisserman [9]. We employed a 11 layer VGG [10], SGD with momentum, weight decay, andstandard data augmentation. To simulate a wide range of non-IIDness, we designed representativeheterogeneity settings based on widely used techniques [11, 12] as follows (Figure 1):• Client Heterogeneity [11]: A dataset is partitioned by following p c ∼ Dir K ( α · (cid:126) thatinvolves allocating p k,c proportion of data examples for class c to client k .• Class Heterogeneity [12]: Training examples on every client are drawn independently withclass labels following a categorical distribution over N classes. Each instance is drawn with q ∼ Dir ( α · (cid:126) from a Dirichlet distribution, where α > is the concentration parametercontrolling IIDness among clients. Results . Above all, to observe the effects of biased model averaging, we compared FedCA (FedPdp+ CA) with FedPdp. Table 2 shows that FedCA consistently outperformed FedPdp, but the varianceswere comparable. Thus, we can infer that the generalization error mostly comes from the biasedmodel averaging of FedPdp and FedCA helps resolve the challenge.Furthermore, we compared FedCA with the state-of-the-art algorithms according to different envi-ronments such as heterogeneity, score function, and

SamplingRatio . First, as shown in Table 2,FedCA consistently outperformed FedPdp in both IID and non-IID settings. Especially, Figure 2shows that, in both client (top) and class (bottom) heterogeneity settings, FedProx and FedPdp withCA outperform those without CA in the overall training process. Next, Figure 2 also shows thatthe case of classiﬁcation loss score (right) commonly exhibited less generalization error in varioussettings than that of Dirac delta score (left). Lastly, in the non-IID case of Table 2, FedCA facilitatedgeneralization across all sampling ratios and achieved higher accuracy with higher sampling ratio. In4 a) Client Heterogeneity, Dirac delta function (b) Client Heterogeneity, Classiﬁcation loss(c) Class Heterogeneity, Dirac delta function (d) Class Heterogeneity, Classiﬁcation loss

Figure 2:

Effects of Combinatorial Averaging (CA) according to heterogeneity (vertical) and score function (horizontal). All models are trained with E = 5 and SamplingRatio = 0 . on thenon-IID ( α = 0 . ) dataset.particular, when the sampling ratio is . , FedCA with Dirac delta reached a higher test accuracy byup to + . % . Detailed values in Figure 2 is described in Appendix D. In this section, we present our novel algorithm, which is coined as

FedCM (Federated learning with

Combinatorial averaging and

MAB based client sampling ). Here, for the experiment, we used thesame recipe mentioned in Section 3.

Proposed Algorithm: FedCM . To resolve the challenge of lack of prior knowledge in client sam-pling, we introduce a MAB based client sampling scheme to reﬂect prior knowledge that modelsprevious client sampling behavior. Unlike the conventional schemes, as compared in Table 1, MABbased client sampling can incorporate prior knowledge by prioritizing the clients that were subsam-pled in the last iteration. By integrating the combinatorial averaging of FedCA with MAB basedclient sampling, we could simultaneously resolve both aforementioned challenges, and we call thisintegrated extension FedCM that extends

SampleClients and

P riorKnowledge of Algorithm 1 asfollows. In Algorithm 1, the function

SampleClients takes the total client set S , SamplingRatio ,and

P riorKnowledge as input, and returns the sampled client set S t as output. Based on what P riorKnowledge is provided as well as how

SampleClients handles the

P riorKnowledge ,FedCM is derived into two heuristic algorithms with regard to the framework of representativeMAB algorithms such as UCB (Upper Conﬁdence Bound) [13] and TS (Thompson sampling) [14]:

FedCM-UCB and

FedCM-TS . Algorithm 2:

FedCM-UCB (Upper Conﬁdence Bound) I NPUT : S, SamplingRatio, P = ( P , . . . , P n ) where P n = (ˆ µ n , t, a n ) , F t − O UTPUT : S t for each client k ∈ S t − do Update (ˆ µ k , a k ) ← (( a k ˆ µ k + r k ) / ( a k + 1) , a k + 1 ) where r k = [ k ∈ S t − opt ] Set ¯ µ k ← ˆ µ k + (cid:113) t a t S t ← { n | ∀ n ∈ S, |{ m | ∀ m ∈ S, ¯ µ m ≥ ¯ µ n }| / | S | ≤ SamplingRatio } a) Client Heterogeneity, Dirac delta function (b) Client Heterogeneity, Classiﬁcation loss(c) Class Heterogeneity, Dirac delta function (d) Class Heterogeneity, Classiﬁcation loss Figure 3:

Effects of FedCM according to heterogeneity (vertical) and score function (horizontal).All models are trained with E = 5 and SamplingRatio = 0 . on the non-IID ( α = 0 . ) dataset.First, Algorithm 2 shows how FedCM extends SampleClients to UCB.

P riorKnowledge ofFedCM-UCB involves prior knowledge P and σ -ﬁeld F t − generated by the previous observations S , S opt , . . . , S t − , S t − opt . FedCM-UCB iteratively updates ˆ µ k of each client with reward r k = [ k ∈ S t − opt ] (Lines 1–3) and samples clients based on ¯ µ k (Line 4). Algorithm 3:

FedCM-TS (Thompson Sampling) I NPUT : S, SamplingRatio, P = ( P , . . . , P n ) where P n ∼ Beta ( α n , β n ) , F t − O UTPUT : S t for each client k ∈ S t − do Update ( α k , β k ) ← ( α k + r k , β k + 1 − r k ) where r k = [ k ∈ S t − opt ] Draw a sample ˆ θ k according to P k S t ← { n | ∀ n ∈ S, |{ m | ∀ m ∈ S, ˆ θ m ≥ ˆ θ n } / | S | ≤ SamplingRatio } Next, Algorithm 3 shows how FedCM extends

SampleClients to TS, which is one of the mostpromising algorithms in bandit problems.

P riorKnowledge of FedCM-TS involves a beta distribu-tion and the same σ -ﬁeld F t − as the one of FedCM-UCB. FedCM-TS iteratively updates α k , β k of each client with reward r k = [ k ∈ S t − opt ] (Lines 1–3) and samples clients based on ˆ θ k (Line 4).Detailed settings are further illustrated in Appendix C Results . Despite the improvement through FedCA, it was observed that the convergence speedbecomes signiﬁcantly slower in non-IID settings than in IID settings (see Figure 6 in Appendix D).We systematically compared FedCM with the state-of-the-art algorithms according to class hetero-genity and score function. As described in Figure 3, interestingly, both FedCM-UCB and Fed-TSoutperformed FedAvg and FedProx in all cases and FedPdp in the case of class heterogeneity in termsof both generalization error and convergence speed. These improvements occured throughout thewhole training processes, even in the early stage of training. Detailed values of the accuracies andconvergence speed are in Appendix D.

Recent studies [3, 4] emphasizes in-depth investigation of client sampling and model averaging.FedCS [5] aims at maximizing the number of clients while minimizing the overall communicationdelay for a set of sampled learners by considering a round-trip time constraint. Mohri et al. [6]6ptimizes the degree of client participation via a fairness objective function that enables the model tobe agnostic to any mixture of client data distribution. In Cho et al. [7], an optimal set is subsampledfrom the sampled clients based on the loss of each client’s local data. Their concept is quite similarwhile it differs from ours in that the criterion for the optimality changes to the training loss from thevalidation loss and they did not consider any prior knowledge. In addition, it may be truly suboptimalbecause in the course of sampling, the local clients that penalized the training with adverse effectsin the past are not considered at all. On the other side, for the purpose of communication reduction,reinforcement learning[15, 16, 17] and MAB [18, 19] algorithms are being widely investigated.

In this paper, we formulated a novel system-level framework of

FL with knowledgeable sampling andﬁltered averaging to address the challenge of biased model averaging and lack of prior knowledge inclient sampling. To this end, we presented our novel algorithm called

FedCM that resolves the twochallenges by ﬁltering biased models with combinatorial averaging and utilizing prior knowledge withmulti-armed bandit based client sampling. Interestingly, combinatorial averaging itself signiﬁcantlyimproved the performance of conventional algorithms, and the application of both techniques ledto greater synergy. Experimental results show that, compared with the state-of-the-art algorithms,FedCM improved the test accuracy by up to . % and convergence rate by up to . times. References [1] Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Aguera y Arcas.Communication-efﬁcient learning of deep networks from decentralized data. In

ArtiﬁcialIntelligence and Statistics , pages 1273–1282, 2017.[2] Jakub Koneˇcn`y, H Brendan McMahan, Daniel Ramage, and Peter Richtárik. Federated optimiza-tion: Distributed machine learning for on-device intelligence. arXiv preprint arXiv:1610.02527 ,2016.[3] Tian Li, Anit Kumar Sahu, Manzil Zaheer, Maziar Sanjabi, Ameet Talwalkar, and VirginiaSmith. Federated optimization in heterogeneous networks. arXiv preprint arXiv:1812.06127 ,2018.[4] Xiang Li, Kaixuan Huang, Wenhao Yang, Shusen Wang, and Zhihua Zhang. On the convergenceof fedavg on non-iid data. arXiv preprint arXiv:1907.02189 , 2019.[5] Takayuki Nishio and Ryo Yonetani. Client selection for federated learning with heterogeneousresources in mobile edge. In

ICC 2019-2019 IEEE International Conference on Communications(ICC) , pages 1–7. IEEE, 2019.[6] Mehryar Mohri, Gary Sivek, and Ananda Theertha Suresh. Agnostic federated learning. arXivpreprint arXiv:1902.00146 , 2019.[7] Yae Jee Cho, Jianyu Wang, and Gauri Joshi. Client selection in federated learning: Convergenceanalysis and power-of-choice selection strategies, 2020.[8] Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images.2009.[9] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scaleimage recognition. arXiv preprint arXiv:1409.1556 , 2014.[10] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scaleimage recognition. arXiv preprint arXiv:1409.1556 , 2014.[11] Mikhail Yurochkin, Mayank Agarwal, Soumya Ghosh, Kristjan Greenewald, Trong NghiaHoang, and Yasaman Khazaeni. Bayesian nonparametric federated learning of neural networks. arXiv preprint arXiv:1905.12022 , 2019.[12] Tzu-Ming Harry Hsu, Hang Qi, and Matthew Brown. Measuring the effects of non-identicaldata distribution for federated visual classiﬁcation. arXiv preprint arXiv:1909.06335 , 2019.713] Wei Chen, Yajun Wang, and Yang Yuan. Combinatorial multi-armed bandit: General frameworkand applications. In

International Conference on Machine Learning , pages 151–159, 2013.[14] Siwei Wang and Wei Chen. Thompson sampling for combinatorial semi-bandits. arXiv preprintarXiv:1803.04623 , 2018.[15] Chetan Nadiger, Anil Kumar, and Sherine Abdelhak. Federated reinforcement learning for fastpersonalization. In , pages 123–127. IEEE, 2019.[16] H. Wang, Z. Kaplan, D. Niu, and B. Li. Optimizing federated learning on non-iid datawith reinforcement learning. In

IEEE INFOCOM 2020 - IEEE Conference on ComputerCommunications , pages 1698–1707, 2020.[17] Hankz Hankui Zhuo, Wenfeng Feng, Qian Xu, Qiang Yang, and Yufeng Lin. Federatedreinforcement learning. arXiv preprint arXiv:1901.08277 , 2019.[18] W. Xia, T. Q. S. Quek, K. Guo, W. Wen, H. H. Yang, and H. Zhu. Multi-armed bandit basedclient scheduling for federated learning.

IEEE Transactions on Wireless Communications , pages1–1, 2020.[19] Wenchao Xia, Tony QS Quek, Kun Guo, Wanli Wen, Howard H Yang, and Hongbo Zhu. Multi-armed bandit based client scheduling for federated learning.

IEEE Transactions on WirelessCommunications , 2020.

A Overview of FedCA

Overview.

An overview of FedCA in the section 3 is summaraized in Figure 4. An apparentdifference between the existing FL framwork and ours is the existence of the validation dataset inthe Global server. Similar to FedPdp [4], after partial device participation, each participant transferslocally updated weights to Global Model. By the way, in our framework, global server subsamplesthe optimal combinations of clients as described in Equation 4. After this combinatorial optimization,the subsampled updates are aggregated and distributed into each corresponding local client.As above, CA does not have inﬂuences on the sampling scheme, so it can have a plug-and-play nature,i.e., CA can be incorporated with uniform sampling [1, 3, 4] or multi-armed bandit based sampling.

Local client 1 Local client 2 Local client 3

GlobalModel

Model distribution

ValidationDataset

Global server

Transfer weightCombinatorial Averaging (CA)

LocalModel

TrainingDataset

LocalModel

TrainingDataset

LocalModel

TrainingDataset

Local update

Figure 4:

An overview of FedCA.

The dashed lines connected between a global server and localclients is related to the communication cost, and the solid line within the server and clients means thecomputing cost. 8

Client Variance

As shown in Figure 5, top1 per-class accuracy is signiﬁcantly variant over all clients in the non-IIDsetting while the IID setting does not. (a) α = 0 . , E = 1 (b) α = 0 . , E = 5 (c) α = 5 . , E = 1 (d) α = 5 . , E = 5 (e) α = 0 . , E = 1 (f) α = 0 . , E = 5 (g) α = 5 . , E = 1 (h) α = 5 . , E = 5 Figure 5:

Top-1 per-class accuracy of FedPdp [4] according to α and E on each client’s data. The x-axis indicates each client. In all results, the clients are sampled 8 out of 20. Each top andbottom row represents the evaluations with client and class heterogeneity, respectively.

C Bandit Initialization

FedCM-UCB.

For the initialization of reward, a k is set to 1, and ˆ µ k is sampled from the randombinomial distribution. FedCM-TS.

All α k , β k is initialized with (1,1). D Detailed Experiment Results

Construction of validation set.

In all experiments, we constructed the validation set by sampling500 instances per class on balanced.

Hyperparameter µ of proximal term in FedProx [3]. We set the weight of proximal term to 0.1.

Training accelerations according to α . As shown in Figure 6, the performance of models trainedwith IID data was better than those with non-IID data in the conventional sampling schemes. number of rounds a cc u r a c y (a) Client Heterogeneity number of rounds a cc u r a c y (b) Class Heterogeneity Figure 6:

Dataset across 20 clients according to IIDness ( α ) and heterogeneity .9 urther experimental results. Table 3 and Table 4 show the further comparison of FedCA andFedCM for the conventional FL algorithms. Table 5 shows the convergence speed of algorithmsmentioned in Table 4.Table 3:

Top-1 accuracy of FedCA for FedProx [3] and FedPdp [4]

With sampling 8 out of 20clients, all models are trained in E = 5 and Non-IID ( α = 0 . ) settings. The standard deviationvalues are calculated as results of 2 different seeds. Heterogeneity FedCA Score function AlgorithmFedProx [3] FedPdp [4]Client Heterogeneity W/O FedCA - 62.75 ± ± . W/ FedCA Dirac delta 62.83 ± . ± Classiﬁcation loss ± . ± . Class Heterogeneity W/O FedCA - 44.34 ± . ± W/ FedCA Dirac delta ± ± . Classiﬁcation loss 48.33 ± . ± . Table 4:

Detailed results of FedCM compared to FedProx [3] and FedPdp [4]

With sampling 8out of 20 clients, all models are trained in E = 5 and Non-IID( α = 0 . ) settings. The standarddeviation values are calculated as results of 2 different seeds. Heterogeneity FedCA Score function AlgorithmFedAvg [1] FedProx [3] FedPdp [4] FedCM-UCB FedCM-TSClient Heterogeneity W/O FedCA 63.97 ± . ± . ± . - -W/ FedCA Dirac delta - 62.83 ± . ± . ± . ± . Classiﬁcation loss - 64.54 ± . ± . ± . ± . Class Heterogeneity W/O FedCA 40.24 ± . ± . ± . - -W/ FedCA Dirac delta - 50.73 ± . ± . ± . ± . Classiﬁcation loss - 48.33 ± . ± . ± . ± Table 5:

Communication rounds over various alogrithms.

Here, we evaluate the convergencespeed via the certain communication round that the performance reach the accuracy of FedAvg [1](e.g., Client Heterogeneity: . , Class Heterogeneity: . ). All training settings are thesame with that of Table 4. Heterogeneity FedCA Score function AlgorithmFedAvg [1] FedProx [3] FedPdp [4] FedCM-UCB FedCM-TSClient Heterogeneity W/O FedCA 100 (1 × ) - 52 (1.92 × ) - -W/ FedCA Dirac delta - - 57 (1.75 × ) 35 (2.86 × )

31 (3.23 × ) Classiﬁcation loss - 84 (1.19 × ) 49 (2.04 × ) 53 (1.87 × ) 51 (1.96 × )Class Heterogeneity W/O FedCA 100 (1 × ) 71 (1.41 × ) 46 (2.17 × ) - -W/ FedCA Dirac delta - 52 (1.92 × ) 35 (2.86 × ) 27 (3.70 × )

24 (4.17 × ) Classiﬁcation loss - 59 (1.69 × ) 34 (2.94 × ) 33 (3.03 × ) 25 (4.00 × ))