[PDF] Active Federated Learning

Abstract

Federated Learning allows for population level models to be trained without centralizing client data by transmitting the global model to clients, calculating gradients locally, then averaging the gradients. Downloading models and uploading gradients uses the client's bandwidth, so minimizing these transmission costs is important. The data on each client is highly variable, so the benefit of training on different clients may differ dramatically. To exploit this we propose Active Federated Learning, where in each round clients are selected not uniformly at random, but with a probability conditioned on the current model and the data on the client to maximize efficiency. We propose a cheap, simple and intuitive sampling scheme which reduces the number of required training iterations by 20-70% while maintaining the same model accuracy, and which mimics well known resampling techniques under certain conditions.

Full PDF

AActive Federated Learning

Jack Goetz

University of Michigan

Kshitiz Malik

Facebook Assistant

Duc Bui

University of Michigan

Seungwhan Moon

Facebook Assistant

Honglei Liu

Facebook Assistant

Anuj Kumar

Facebook Assistant

Abstract

Federated Learning allows for population level models to be trained without cen-tralizing client data by transmitting the global model to clients, calculating gra-dients locally, then averaging the gradients. Downloading models and uploadinggradients uses the client’s bandwidth, so minimizing these transmission costs isimportant. The data on each client is highly variable, so the beneﬁt of training ondifferent clients may differ dramatically. To exploit this we propose

Active Fed-erated Learning , where in each round clients are selected not uniformly at ran-dom, but with a probability conditioned on the current model and the data on theclient to maximize efﬁciency. We propose a cheap, simple and intuitive samplingscheme which reduces the number of required training iterations by 20-70% whilemaintaining the same model accuracy, and which mimics well known resamplingtechniques under certain conditions.

As machine learning models are deployed in the real world, the assumptions under which they weredeveloped are often shown to be incompatible with user requirements. One assumptions is unre-stricted access to the training data, either on a single machine or distributed over many researchercontrolled machines. Due to privacy concerns users may not want to transmit data from their per-sonal devices, making such centralized training impossible. Federated Learning enables the trainingof models on this data, but transmission costs between the server and the client are high, and re-ducing these costs is important. In this paper we introduce

Active Federated Learning (AFL) topreferentially train on users which are more beneﬁcial to the model during that training iteration.Motivated by ideas from Active Learning, we propose using a value function which can be evalu-ated on the user’s device and returns a valuation to the server indicating the likely utility of trainingon that user. The server collects these valuations and converts them to probabilities with which thenext cohort of users is selected for training. By using simple a value function related to the loss theuser’s data suffers under the current model, we can reduce the number of training rounds requiredfor the model to achieve a speciﬁed level of accuracy by 20-70%.

Since its introduction [11, 14], reducing the communication costs of Federated Learning has beenan important goal [8, 4]. However as discussed in Li et al. [10] there are few existing techniqueswhich change the method of selecting users. In Hartmann [6] the author suggests stratiﬁcation basedon contextual information about the users, and in Nishio and Yonetani [12] the authors group usersbased on hardware characteristics. In contrast our work is closer to Active Learning (AL) [13]where the selection policy is dependant on the current state of the model and the data on each user.

Preprint. Under review. a r X i v : . [ c s . L G ] S e p n both paradigms training data must be selected under imperfect information; in AL the covariatesare fully known, but the label of candidate data points is unknown, whereas in AFL both labels andcovariates are fully known on each client, but only a summary is returned to the server. Additionally,in standard AL individual data points may be selected in an unconstrained manner, whereas in AFLwe train on all data points on each selected user, creating predetermined subsets of data. Assume we have labelled data ( x, y ) and a model for predicting y ∈ Y given x ∈ X which wedenote by ˆ y = f ( x ; w ) , where w ∈ R d are our model parameters. These model parameters willbe learned by minimizing some loss function l ( x, y ; w ) . Assume our training data is distributedover multiple clients (or users) U = { U , ..., U K } , where we denote the data of client U k by ( x k , y k ) ∈ X n k × Y n k . Our model parameters will be learned during training iterations, so wewill let w ( t ) denote the value of our parameters at training iteration t . During each training iterationwe select a subset of users S ( t ) ⊂ U , |S ( t ) | = m and send w ( t ) to each user in the set. Each userthen performs some training T using their local data and produce updated model parameter values w ( t +1) k = T ( x k , y k ; w ( t ) ) . In its most simple form this training could be a single step of gradientdescent, though in practice it is often more complicated, such as multiple passes of SGD. Theseupdated model parameter values are then returned to the server and aggregated to produce the nextmodel parameters using Federated ADAM [9]. In traditional Federated Learning the subsets S ( t ) are selected uniformly at random and independently at each iteration. Our goal in AFL is to selectour subsets S ( t ) such that fewer training iterations are required to obtain a good model. Inspired by the structure of classical AL methods, we propose the AFL framework which aims toselect an optimized subset of users based on a value function that reﬂects how useful the data on thatuser is during each training round. Formally, we deﬁne a function V : X n k × Y n k × R d → R whichis evaluated on each user. Once evaluated, each user U k returns a corresponding valuation v k ∈ R to the server, which is used to calculate the sampling distribution for the next training iteration. Thevaluations are a function of w ( t ) , but since transmitting the model is expensive we only get freshvaluations of users during an iteration in which we train on them, meaning that v ( t +1) k = (cid:40) V ( x k , y k ; w ( t ) ) if U k ∈ S t v ( t ) k otherwise.Ideally the computation of the value function should require minimal additional computation, sincethe computations are done using the clients hardware, and should not reveal too much about the dataon each client. Once the server has all valuations it converts them into a sampling distribution. Server

Client probabilities

ClientClient1 1 1 123 3 3 3Client Client

Active Federated Learning algorithm for a binary classification problem. The red and blue dots on each client show the private data on the client. At each training step the following happens:1. Clients send their valuations to the server.2. Server converts individual client valuations into probability of each client being selected in the next batch. Server selects next training batch randomly using these client probabilities . Figure 1:

Active Federated Learning framework for a binary classiﬁcation problem.2 lgorithm 1:

Sampling algorithm

Input:

Client Valuations { v , ..., v K } , tuning parameters α , ..., α , number of clients per round m Output:

Client indices { k , ..., k m } Sort users by v k For the α K users with smallest v k , v k = −∞ for k from to K do p k ∝ e α v k end Sample (1 − α ) m users according to their p k , producing set S (cid:48) Sample α m from the remaining users uniformly at random, producing set S (cid:48)(cid:48) return S = S (cid:48) ∪ S (cid:48)(cid:48) One very natural value function is to use the loss of the users data v k = √ n k l ( x k , y k ; w ) . It isalready calculated during model training and is increasing with how poorly the model performs onthe clients data. Additionally it mimics common resampling techniques when the required struc-ture is present in the data. If there is extreme class imbalance and weak separation of the classes,data points of the minority class will have signiﬁcantly higher loss than majority class data points.Therefore we will prefer users with more minority data, mimicking resampling the minority classdata. Similarly if the noise depends on the distance from the classiﬁcation boundary such as in [3],using the loss replicates margin based resampling techniques. Finally if all data points are equallyvaluable then users with more data will be given higher valuations. Most importantly these adapta-tions to the data do not require the practitioner to know the speciﬁc structure being exploited. Thisis particularly important in the Federated setting, where information about the data is limited. Even summarizing the client data with a single ﬂoat may reveal too much information. To properlyprotect users the value function should be reported using a Differentially Private mechanism [5].The noise introduced to maintain Differential Privacy may mislead the server into selecting sub-optimal clients. However there is structure which might be exploited to reduce the corruption whilestill maintaining privacy. One is that many value functions, such as the loss, are not expected tochange dramatically within a small number of training rounds. Thus we may be able to querywhether a valuation has changed dramatically before querying the new value, similar to the SparseVector technique, to reduce the number of queries. We may also be able to adapt our value functionto be more amenable to Differential Privacy. For example the loss value function has unboundedsensitivity and requires clipping to provide Differential Privacy. However returning a count of highloss data points has sensitivity and may be less affected by the privacy providing noise. Addingprivacy guarantees is an important challenge in AFL and is the subject of much future work. We compared AFL to the standard uniform selection on two datasets; one on the Reddit dataset, theother on the Sticker Intent dataset. The Reddit dataset is a publicly available [2] dataset consistingof comments from users on reddit.com . The authors were not involved in collecting this dataset. Forthe Reddit dataset we predicted the binary label ’controversially’ based on the comment text, andselected 8K users at random from the November 2017 data set, similar to Bagdasaryan et al. [1] butonly excluding users with +100K messages. We removed comments being responded to from themessages, and empty messages. The Reddit dataset has many users who post few comments, but along tail of power users. The Sticker Intent dataset has randomly selected, anonymized messagesfrom a popular messaging app. The task was binary classiﬁcation - predict whether a messagewas replied to using a sticker. Messages in this set were collected, de-identiﬁed, and annotatedautomatically; the messages were not read or labeled by human annotators.Algorithm 1 for converting the valuations into a sampling distribution has 3 tuning parameters: The α proportion of users with the smallest valuations will have their valuations set to −∞ . They3able 1: Reddit dataset statisticsmessages users % label mean messages/user median messages/userTrain 124638 7527 0.021 16.6 3Test 15568 3440 0.021 4.5 2can still be selected by random sampling. α is our softmax temperature. α is the proportionof users which are selected uniformly at random. In our experiments we used α = 0 . , α =0 . , α = 0 . . We chose α to ensure that the softmax did not produce p k = 0 from underﬂowerrors, and α , α were both chosen based on initial experiments on Sticker Intent dataset. Theunderlying model trained with Federated Learning used a 64 dimensional character level embedding,a 32 dimensional BLSTM, and an MLP with one 64 dimensional hidden layer. The number of usersin each Federated round was 200, and on each user 2 passes of SGD was performed with a batchsize of 128. The learning rates for both local SGD and Federated ADAM were tuned separately forRandom Sampling and AFL and the optimal learning rates were used for each.Figure 2 shows the AUC after each Epoch under uniform random selection of users, and with AFLselection, showing mean and standard errors from 10 repetitions on test data. AFL trains models ofthe same performance using 20-70% fewer Epochs (where one Epoch is enough training rounds totrain on each client once in expectation under random sampling). Epochs A UC Reddit AFLRandom Sampling

Epochs A UC Sticker Intent

AFLRandom Sampling

Figure 2: Comparison of AUC increase on Reddit and Sticker Intent datasetsOne difference between AFL and server-side resampling techniques is that AFL selects data pointsby user, whereas server-side resampling can select arbitrary subsets. To explore the signiﬁcanceof this restriction we compared the gains from oversampling of label data [7] and server-sidelearning against AFL using the value function v k = (cid:80) y i,k =1 and Federated training, using theReddit dataset. The level of resampling and learning rates were tuned for server training, as werethe temperature α and the learning rates for Federated training, and all other tuning parameterswere kept the same. Our results suggest that there is signiﬁcant loss from selecting users, as thedifference between Random Sampling and Active Sampling is much larger for server-side learning.Random Sampling Active SamplingServer selection of data points 0.559 0.615Federated selection of clients 0.552 0.578 In this paper we proposed Active Federated Learning (AFL), the ﬁrst user cohort selection techniquefor FL which actively adapts to the state of the model and the data on each client. This adaptationallows us to train models with 20-70% fewer iterations for the same performance. Giving formal pri-vacy guarantees is vital future work, but there are many other interesting extensions as well. Theseexperiments were done under simplifying conditions which do not take into account many problemsFederated Learning faces in practice, and which AFL may be able to help alleviate. For exampleclients may have different rates of availability for training. This availability may be correlated withthe data on the client, resulting in bias in our model if not corrected. AFL which also takes relia-bility into account may be used to reduce this bias by increasing the rate at which we try to train4n unreliable users. Another challenge is that clients are constantly gathering (and potentially for-getting) data, and in many cases the distribution may be non-stationary. Maintaining the beneﬁts ofAFL may require a principled way of ensuring no user goes too long without having their valuationrefreshed. Finally our experiments and analyses focused on the classiﬁcation setting, but the lossvalue function can be used for any supervised problem, and understanding AFL with more complexmodels would be an interesting research direction.

References [1] Bagdasaryan, E., Veit, A., Hua, Y., Estrin, D., and Shmatikov, V. (2018). How to backdoorfederated learning. arXiv preprint arXiv:1807.00459 .[2] Baumgartner, J. (2019). Reddit Comments Dumps. https://ﬁles.pushshift.io/reddit/comments/ . Accessed: 2019-09-03.[3] Blaschzyk, I., Steinwart, I., et al. (2018). Improved classiﬁcation rates under reﬁned marginconditions.

Electronic Journal of Statistics , 12(1):793–823.[4] Caldas, S., Koneˇcny, J., McMahan, H. B., and Talwalkar, A. (2018). Expanding the reach offederated learning by reducing client resource requirements. arXiv preprint arXiv:1812.07210 .[5] Dwork, C., Roth, A., et al. (2014). The algorithmic foundations of differential privacy.

Founda-tions and Trends® in Theoretical Computer Science , 9(3–4):211–407.[6] Hartmann, F. (2018). Federated learning.[7] He, H. and Garcia, E. A. (2008). Learning from imbalanced data.

IEEE Transactions on Knowl-edge & Data Engineering , (9):1263–1284.[8] Koneˇcn`y, J., McMahan, H. B., Yu, F. X., Richtárik, P., Suresh, A. T., and Bacon, D.(2016). Federated learning: Strategies for improving communication efﬁciency. arXiv preprintarXiv:1610.05492 .[9] Leroy, D., Coucke, A., Lavril, T., Gisselbrecht, T., and Dureau, J. (2019). Federated learning forkeyword spotting. In

ICASSP 2019-2019 IEEE International Conference on Acoustics, Speechand Signal Processing (ICASSP) , pages 6341–6345. IEEE.[10] Li, T., Sahu, A. K., Talwalkar, A., and Smith, V. (2019). Federated learning: Challenges,methods, and future directions. arXiv preprint arXiv:1908.07873 .[11] McMahan, H. B., Moore, E., Ramage, D., Hampson, S., et al. (2016). Communication-efﬁcientlearning of deep networks from decentralized data. arXiv preprint arXiv:1602.05629 .[12] Nishio, T. and Yonetani, R. (2019). Client selection for federated learning with heterogeneousresources in mobile edge. In

ICC 2019-2019 IEEE International Conference on Communications(ICC) , pages 1–7. IEEE.[13] Settles, B. (2009). Active learning literature survey. Technical report, University of Wisconsin-Madison Department of Computer Sciences.[14] Yang, Q., Liu, Y., Chen, T., and Tong, Y. (2019). Federated machine learning: Concept andapplications.