Consensus Based Multi-Layer Perceptrons for Edge Computing
CConsensus Based Multi-Layer Perceptrons forEdge Computing
Haimonti Dutta , Nitin Nataraj (cid:63) , and Saurabh Amarnath Mahindre Department of Management Science and Systems,The State University of New York, Buffalo, NY 14260. [email protected] o9 Solutions, Inc.,1501 Lyndon B. Johnson Freeway,Dallas, Texas 75234. nitin,[email protected] Institute for Computational and Data Sciences,The State University of New York, Buffalo, NY 14260. [email protected]
Abstract.
In recent years, storing large volumes of data on distributeddevices has become commonplace. Applications involving sensors, for ex-ample, capture data in different modalities including image, video, audio,GPS and others. Novel algorithms are required to learn from this richdistributed data. In this paper, we present consensus based multi-layerperceptrons for resource-constrained devices. Assuming nodes (devices)in the distributed system are arranged in a graph and contain verticallypartitioned data, the goal is to learn a global function that minimizesthe loss. Each node learns a feed-forward multi-layer perceptron andobtains a loss on data stored locally. It then gossips with a neighbor,chosen uniformly at random, and exchanges information about the loss.The updated loss is used to run a back propagation algorithm and ad-just weights appropriately. This method enables nodes to learn the global function without exchange of data in the network. Empirical results re-veal that the consensus algorithm converges to the centralized modeland has performance comparable to centralized multi-layer perceptronsand tree-based algorithms including random forests and gradient boosteddecision trees.
Keywords: multi-layer perceptron · gossip · consensus · distributedlearning. Emerging architectures designed to store and analyze large volumes of data makeuse of large scale, distributed processing paradigms [2]. These include the mega-scale cloud data-centers and resource constrained devices, such as the Internet of (cid:63)
This work was done when the author was a student at the State University of NewYork at Buffalo. a r X i v : . [ c s . L G ] F e b hings (IoT) and mobile devices. While the cloud can be used for executing largescale machine learning algorithms on large volumes of data, such algorithms exertsevere demands in terms of energy, memory and computing resources, limitingtheir adoption for resource constrained, network edge devices. The new breedof intelligent devices and high-stake applications (drones, augmented/virtual re-ality, autonomous systems, etc.), require distributed, low-latency and reliable machine learning at the wireless network edge. Thus computing services havenow started to move from the cloud to the edge.Deep learning-based intelligent services [24] and applications have becomeprevalent. However, their use in edge computing devices has been somewhatlimited due to the following reasons: (a) Cost: Training and inference of deeplearning models in the distributed infrastructures requires consumption of largeamount of network bandwidth. (b) Latency: The access to data and services isgenerally not guaranteed and delay is not short enough for time-critical applica-tions. (c) Reliability: Most distributed computing applications rely on wirelesscommunications and backbone networks for connecting users to services, but in-telligent services must be highly reliable, even when network connections are lost(d) Privacy: The data required for deep learning may involve private informa-tion, and privacy protocols need to be adhered. The current state of distributeddeep learning systems on edge devices leaves much to be desired.In this paper, we address this shortcoming by developing multi-layer percep-trons for resource constrained edge devices. When compute power is abundantand devices are not resource-constrained, deep neural networks can be trainedusing the DistBelief framework [8] with model parallelism within (via multi-threading) and across machines (via message passing). Aside from the fact thata parallel architecture has a single point of failure and therefore often unsuitablefor adoption in resource constrained leaderless environments, synchronization re-quirements lends these algorithms even more unsuitable for use on edge devices.Our algorithm operates on peer-to-peer computing environments and as suchinterweaves local learning and label propagation [28]. Specifically, it optimizesa trade-off between smoothness of the model parameters over the network onthe one hand and the model’s local learning on the other. It has similarities tocollaborative learning of personalized (peer-to-peer) models over networks [3] –however, unlike them, the work presented here learns the global function in thenetwork, instead of solitary, local models.Finally, it must be pointed out that this work explicitly considers vertically partitioned data or the setting in which features are distributed across nodes.Recent work on large scale distributed deep networks has primarily focused onhorizontal partitions (for e.g. cross-data silo Federated Learning [15]) where-inall features are observed at the nodes [4,18] and a centralized parameter serverupdates models. Our work is closely related to the cross-silo Federated Learning[26,15] model, except that the single point of failure parameter server in thosemodels is replaced with a peer-to-peer architecture. This seemingly minor changehas far reaching implications – it removes the need for synchronization with theparameter server at every iteration of the algorithm.his paper is organized as follows: Section 2 describes use cases for consensusbased multi-layer perceptrons; Section 3 describes related work; Section 4 pro-vides details of the algorithm and empirical results are presented in Section 5.Section 6 concludes the paper. We motivate the need to develop consensus based multi-layer perceptrons bydescribing the following applications: – Medical Diagnosis:
Collaborations amongst health entities on mobile de-vices [10] require examination of different modalities of patient data such asElectronic Health Records (EHR), imaging, pathology results, and geneticmarkers of a disease. – Drug Discovery: The pharmaceutical industry requires platforms that en-able drug discovery using private and competitive Drug Discovery relateddata and hundreds of TBs of image data . – Autonomous Vehicles: Google, Uber, Tesla, and many automotive com-panies have developed autonomous driving systems. Applications (such asforward collision warning, blind spot, lane change warnings, and adaptivecruise control.) are time critical and require real time learning and updatesfrom individual vehicles [20]. – Home Sensing:
In home monitoring and sensing applications [13] non-intrusive load monitoring systems are used to study fluctuations in signals. – Manufacturing Operations: which requires industrial data that is inter-operable and scalable . In applications of this genre, the sensors and IoTdevices collect data at different time points often from different locationsand these are then subjected to analysis. Scalable algorithms for deep learning have been explored in several papers in re-cent years. We discuss related work which make use of two different architectures:(a) Parallel – which ensures the presence of a master to control slave workers and(b) Distributed which is a fully decentralized, peer-to-peer architecture withoutthe need for a master.
Parallel DNN Algorithms:
A large proportion of the research in this domainhas focused on data parallelism and the ability to exploit compute power ofmultiple slave workers, with a single master controlling the execution of slaves.McDonald et al. [17] present two different strategies for parallel training of struc-tured perceptrons and use them for named entity recognition and dependency https://featurecloud.eu/about/our-vision/ https://cordis.europa.eu/project/id/831472 https://musketeer.eu/project/ arsing. TernGrad [25] uses ternary levels {− , , +1 } to reduce overhead of gra-dient synchronization and communication. DoReFa-Net [29] train convolutionalneural networks that have low bit width weights, activations and gradients. Seideet al. [21] show that it is possible to quantize gradients aggressively during train-ing of deep neural networks using SGD making it feasible to use in data parallelfast processors such as GPUs. Quantized SGD (QSGD) [1] explores the trade-off between accuracy and gradient precision. A slightly different line of work[27] explores the utility of asynchronous Stochastic Gradient Descent algorithmssuggesting that if the learning rate is modulated according to the gradient stal-eness, better theoretical guarantees for convergence can be established than thesynchronous counterpart. Distributed DNN Algorithms:
In the fully decentralized setting, [14] presenta consensus-based distributed SGD (CDSGD) algorithm for collaborative deeplearning over fixed topology networks that enables data parallelization as wellas decentralized computation. Sutton et al. [22] explore neural network archi-tectures in which the structures of the models are partitioned prior to training.Partitioning of deep neural networks have also been studied in the context ofdistributed computing hierarchies such as the cloud, end and edge devices [23].Gupta et al. [10] present an algorithm for training DNNs over multiple datasources. The research described above fundamentally differ from the materialpresented in this paper in that our consensus algorithm relies on both modeland data partitioning to construct local multi-layer perceptron models whichcan independently learn global information.
We present the consensus based multi-layer perceptron algorithm in this section.In the distributed setting, let M denote an N × n matrix with real-valuedentries. This matrix represents a dataset of N tuples of the form x i ∈ R n , ≤ i ≤ N . Each tuple has an associated label y i = { +1 , − } . Assume this datasethas been vertically distributed over m nodes S , S , · · · , S m such that node S i has a data set M i ⊂ M, M i : N × n i and each x j ∈ M i is in R n i , n i ≤ n . Thus, M = M ∪ M ∪ · · · ∪ M m denotes the concatenation of the local data sets. Thelabels are shared across all the nodes. The goal is to learn a deep neural networkon the global data set M , by learning local models at the nodes, allowingexchange of information among them using a gossip based protocol [16,9] andupdating the local models with new information obtained from neighbors. Thisensures that there is no actual data transfer amongst nodes. Model of Distributed Computation.
The distributed algorithm evolvesover discrete time with respect to a “global” clock . Each node has access to a This implies that all the nodes have access to all N tuples but have limited numberof features i.e. n i ≤ n . We assume that the models have the same structure i.e. the same number of input,hidden and output layers and connections. Existence of this clock is of interest only for theoretical analysis ocal clock or no clock at all. Furthermore, each node has its own memory andcan perform local computation (such as estimating the local weight vector). Itstores f i , which is the estimated local function. Besides its own computation,nodes may receive messages from their neighbors which will help in evaluationof the next estimate for the local function. Communication Protocols.
Nodes S i are connected to one another via anunderlying communication framework represented by a graph G ( V, E ), such thateach node S i ∈ { S , S , · · · , S m } is a vertex and an edge e ij ∈ E connects nodes S i and S j . Communication delays on the edges are assumed to be zero. Distributed MultiLayer Perceptron (DMLP):
Assume that each node S t has a simple model of a fully connected multi-layer perceptron with sigmoidactivations for hidden layers and sigmoid for the output layer. The network iscalled N t . It has L layers – the 0 th is the input layer, followed by ( L −
1) hiddenlayers and the L th layer is the output layer. Let r i denote the number of unitsin the i th layer (note that r = n i and r L = 1). Feed-forward Learning:
Let ω kij denote the weight from i th node of ( k − th layer to j th node of k th layer, a kj is the weighted sum of inputs from the previouslayer to the j th node of k th layer, o kj is the output of j th node of k th layer, b kj is thebias to j th node of k th layer. The feed-forward step for the first node of the firsthidden layer can then be written as: a = b + x ∗ ω + x ∗ ω + ... + x n i ∗ ω n i .The output a is given by: o = σ ( a ). So, the output of j th node of k th layer is, o kj = σ ( a kj ) where, a kj = b kj + ( (cid:80) r k − i =1 o k − i ∗ w kij ). The output from the network N t is given by ˆ y N t i = σ ( a L ). The local loss at node S i is then given by L t = (cid:80) Ni =1 ( y i − ˆ y N t i ) , assuming squared loss. Gossip:
Node S t selects uniformly at random, a neighbor S u with whom itwishes to gossip. Both S t and S u have computed their local losses. When gos-siping each node updates its current local loss with L gossip = L t + L u . This newloss is used for back propagation at both nodes S t and S u . Back propagation:
The back propagation algorithm learns the weights for amulti-layer network, given a network with a fixed set of units and interconnec-tions. It employs gradient descent to attempt to minimize the squared errorbetween the network output values and the target values for these outputs. Weuse the new loss ( L gossip ) obtained after gossiping with a neighbor, in-place ofthe local loss ( L t ), for our back propagation phase. This modification helps thelocal node S t to incorporate information about the loss from its neighbor S u into its back propagation learning phase, thereby helping to minimize the global loss instead of the local loss. This is a crucial step in our algorithm. The localloss at node S t after gossip is then given by L gossip = (cid:80) Ni =1 ( y i − ˆ y N ti +ˆ y N ui ) = ( y − y gossip ) ; y gossip = ˆ y N ti +ˆ y N ui where the bold fonts are used to representthe loss vectors. Algorithm 1 presents the steps of the DMLP algorithm. Discussion:
Some interesting aspects of our algorithm are: (a) The algorithmspresented in the above section are called anytime algorithms [30]. Anytime algo-rithms are those whose quality of results change gradually as computation timeincreases. At a given time a node may be interrupted to obtain an estimate of nput: N × n i matrix at each node S i , G ( V, E ) which encapsulates theunderlying communication framework, T : no of iterations Output:
Each node S i has a multilayer perceptron network N i for t = 1 to T do (a) Node S i uses the network N i for feedforward learning and locallyestimates the loss on N instances;(b) Node S i gossips with its neighbors S j and obtains the loss from theneighbor;(c) Gossip: node S i averages the loss between S i and S j and sets this asthe new loss;(d) Perform backpropagation on the current node and the neighbor nodeusing the gossiped loss; Update the weight vectors in each layer usingStochastic Gradient Descent (SGD);(e) If there is no significant change in the local weight vectors, STOP ; end Algorithm 1:
Distributed Multilayer Perceptron Learning (DMLP)the performance. (b) Algorithm 1 can be extended to other kinds of loss func-tions (such as cross-entropy, softmax) and activations (such as linear, tanh). (c)The number of hidden layers of the multi-layer perceptron can be incrementedas required by a node, without the need for any algorithmic changes.
The empirical results demonstrate the utility of the DMLP algorithm. We exam-ine the following questions: (a) Is there empirical support for the conjecture thatthe performance of the distributed model is better than that of the centralizedmodel? (b) Does the distributed model empirically converge to the centralizedone? (c) How does the performance of the proposed method compare to featuresubspace learning methods such as Random Forests [6] and tree boosting algo-rithms (such as XGBoost [7])? The answers to the above questions are exploredusing the data sets shown in Table 1 [11].The experimental process is as follows:(a) The Peersim simulator [19] is usedto construct a fully connected graph of 10 nodes. Each node can independentlystore vertically partitioned data. (b) The total number of features in the traindata is split into 10 roughly equal parts. Each node is assigned the data withthe corresponding split containing all the examples but only those features ithas been assigned. (c) Each node builds a local neural network model. Thelocal loss vector is generated. (d) Each node selects a neighbor uniformly atrandom according to the underlying distributed graph, and exchanges the localloss vector with its neighbor. The new loss vector is computed as the average For MNIST, a balanced binary classification data set was produced using digits 0and 9; For CIFAR, to convert to a binary classification problem, we assign classes0-4 the label 0, and classes 5-9, label 1. ataset No. Train No. Test No. Features
Arcene 100 100 10000Dexter 300 300 20000Dorothea Bal. 156 68 100000Gisette 6000 1000 5000Madelon 2000 600 500MNIST Bal. 11702 1948 784HT Sensor 14560 3640 10CIFAR-10 50000 10000 3072
Table 1: Characteristics of the datasets used for empirical analysis.of its own loss vector and that of the neighbor’s. (e) Each node participates inback propagation using the new loss generated after gossiping with a neighbor.(f) The above process is repeated for several iterations until the nodes convergeto a solution.
Testing the model:
Each node is provided with the test set having only thosefeatures that the node used to construct the local model. Therefore, each nodecan test its own performance. For experiments presented here, we constructthe following hypothetical scenario: for each test sample, an average predictedprobability is obtained across all nodes, and the distributed test AUC is then es-timated. This is not a requirement of the algorithm, but it enables us to compareperformance against benchmarks.
Dataset No.HiddenNeurons (C) No.HiddenNeurons (D) LearningRate CentralizedAUC DistributedAUC 95% C. I. Cent.Itr. ( I C ) Dist.Itr. ( I D )Arcene 50 5 0 .
01 0 . ± .
006 0 . ± .
01 [0 . , .
96] 299 ± .
33 365 . ± . . . ± .
03 0 . ± .
02 [0 . , .
89] 29 ± ± . . ± .
007 0 . ± .
01 [0 . , .
97] 42 . ± .
71 55 . ± . .
05 0 . ± .
001 0 . ± .
006 [0 . , . . ± .
43 182 . ± . . . ± .
003 0 . ± .
009 [0 . , .
66] 462 . ± .
18 292 . ± . . . ± .
009 0 . ± . . , . ± .
14 189 ± . . . ± .
002 0 . ± .
01 [0 . , .
99] 149 ± .
32 205 . ± . . . ± .
001 0 . ± .
001 [0 . , .
71] 145 ±
75 52 ± Table 2: Performance of the centralized (C) and distributed algorithms (D).The consensus multi-layer perceptron uses cross-entropy loss function, ReLUactivation for the hidden layer, and sigmoid activation for the output layer. Theresults are averaged over three trials.We measure the performance of the model by the area under the ReceiverOperating Characteristic (ROC) curve [5] denoted by θ . The centralized algo-rithm is executed by assuming that the entire dataset is available at a node. Inthe distributed setting, a neural network is employed at each node, and is fedpartial data, partitioned in the feature space. The number of total hidden neu- ataset RF AUC XGBoost AUC Dist AUC Cent AUCArcene 0 .
79 0 .
84 0 .
92 0 . .
93 0 .
95 0 .
86 0 . .
88 0 .
89 0 .
92 0 . .
99 0 .
99 0 .
99 0 . .
77 0 .
71 0 .
63 0 . .
99 0 .
99 0 .
99 0 . .
99 0 .
99 0 . .
67 0 .
71 0 .
71 0 . Table 3: Comparison of the performance of the consensus algorithm to tree basedalgorithms Random Forest (RF) and XGBoost.Fig. 1: AUC on the test sets for both centralized and distributed settings on theeight datasets discussed above. For the distributed algorithm, test AUC resultsaveraged over three random vertical feature splits without overlap are presented.rons is kept the same for both the centralized and distributed experiments. Thisimplies that each distributed node has roughly
No. of hidden neurons in cent. modelNo. of nodes hidden neurons in its model. We tune the model(s) in each experiment by select-ing different parameters for learning rate, number of hidden neurons, number ofhidden layers and activation functions.The steps outlined for the distributed algorithm above were repeated forthree random feature splits and the test AUC averaged over the trials. We alsocompute the symmetric 95% confidence interval for distributed test AUC ( θ D )and observe centralized test AUC ( θ C ) in relation to this interval. Figure 1 showsthe AUC curves for all the datasets used in this study. The Standard Error ( SE ) ataset CentAUC Dist. w/overlapAUC 95% C. I.Arcene 0 . ± .
01 0 . ± .
00 [0 . , . . ± .
03 0 . ± .
02 [0 . , . . ± .
01 0 . ± .
03 [0 . , . . ± .
00 0 . ± .
00 [0 . , . . ± .
00 0 . ± .
00 [0 . , . . ± .
01 0 . ± .
00 [0 . , . . ± .
00 0 . ± .
00 [0 . , . . ± .
001 0 . ± .
01 [0 . , . Table 4: Performance of the centralized(C) and distributed algorithms(D) with20% overlap of features. The consensus neural network uses cross-entropy lossfunction, ReLU activation for the hidden layer, and sigmoid activation for theoutput layer. The results are averaged over three trials.for estimated area under the ROC curve in relation to the sample size ( n ) and θ D can be computed as described in [12]: SE = (cid:113) θ D (1 − θ D )+( n − Q + Q − θ D ) n where Q = θ D − θ D , Q = θ D θ D .Fig. 2: AUC on the test sets for both centralized and distributed settings on theeight datasets discussed above with varying degree of feature overlap. For thedistributed algorithm, test accuracy results averaged over three random verticalfeature splits with and without overlap are presented.iven SE, the symmetric 95% confidence interval ( CI ) is given by θ D ± . SE ). The centralized algorithm and distributed algorithm can be deemedapproximately comparable if θ C lies within these bounds, i.e. if θ D − . SE ) < = θ C < = θ D + 1 . SE ). In empirical studies (Table 2), it was found that thedistributed algorithm obtains comparable test AUC scores to the centralizedalgorithm for all the datasets.Fig. 3: Verification of the empirical convergence of the distributed algorithm. Empirical Convergence of DMLP Algorithm
To test the convergence ofthe DMLP algorithm, we measure the difference in L Effect of Overlap of Features
We study the impact of the overlap of fea-tures at each node on the performance of the consensus algorithm using the overlap ratio parameter. An overlap ratio of 0 indicates that the features presentat one node are not present at any other node, i.e. the feature space is parti-tioned with mutual exclusivity. On the other hand, an overlap ratio greater than0, indicates that a subset of the feature space is shared among all nodes. When overlap ratio is 1, all data is available at all nodes but model partition still ex-ists. Table 4 presents the results of the experiments with overlap ratio set to 0 . overlap ratio parameter on perfor-mance of the algorithm. In general, it is observed that when the overlap ratio isncremented by a factor of 0 .
2, the AUC on the test set gradually improves. Ourresults reveal that in general, the overlap of features amongst nodes is beneficialand boosts the performance of the consensus algorithm. However, this behavioris not consistent for highly nonlinear datasets (such as Madelon) and those whichhave very large number of features (such as Dorothea) where-in the performancedecreases as overlap increases and overfitting sets in.
Comparison with feature sub-space learning algorithms
Given that datapartition at each node involves exploring a subset of the feature space, we com-pare the consensus algorithm to state-of-the-art tree-based algorithms whichlearn on feature subspaces (such as Random Forests and XGBoost). The resultsare presented in Table 3. We observe that the consensus algorithm has compa-rable performance to RF and XGBoost in all the datasets, except Dexter andMadelon – two particularly difficult datasets with no informative features [11].
This paper presents an algorithm for learning consensus based multi-layer per-ceptrons in resource constrained edge devices. The devices (nodes) are arrangedin a network and contain vertically partitioned data. Each node constructs alocal model by feed forward learning, exchanges losses with a randomly chosenneighbor, averages losses and uses this new loss for back propagation. Empiricalresults on several real world datasets reveal that the consensus algorithm hasperformance comparable to the centralized counterpart and tree-based learningalgorithms. Future work involves exploring the possibility of extending the algo-rithm to other kinds of neural networks and developing distributed perceptronsfor multi-class classification.
References
1. D. Alistarh, D. Grubic, J. Li, R. Tomioka, and M. Vojnovic. Qsgd: Communication-efficient sgd via gradient quantization and encoding. In
Advances in Neural Infor-mation Processing Systems , pages 1709–1720. 2017.2. R. Bekkerman, M. Bilenko, and J. Langford.
Scaling Up Machine Learning: Paralleland Distributed Approaches . Cambridge University Press, New York, NY, USA,2011.3. A. Bellet, R. Guerraoui, M. Taziki, and M. Tommasi. Personalized and privatepeer-to-peer machine learning. In
International Conference on Artificial Intelli-gence and Statistics, AISTATS , volume 84, pages 473–481, 2018.4. M.l Blot, D. Picard, N. Thome, and M. Cord. Distributed Optimization for DeepLearning with Gossip Exchange.
Neurocomputing , 330:287–296, 2019.5. Andrew P. Bradley. The use of the area under the roc curve in the evaluation ofmachine learning algorithms.
Pattern Recogn. , 30(7):1145–1159, July 1997.6. L. Breiman. Random forests.
Mach. Learn. , 45(1):5–32, October 2001.7. T. Chen and C. Guestrin. Xgboost: A scalable tree boosting system.
Proceedingsof the 22nd ACM SIGKDD International Conference on Knowledge Discovery andData Mining , 2016.. J. Dean, G. S. Corrado, R. Monga, K. Chen, M. Devin, Q. V. Le, M. Z. Mao, M.’A.Ranzato, A. Senior, P. Tucker, K. Yang, and A. Y. Ng. Large scale distributeddeep networks. In
Proceedings of the 25th International Conference on NeuralInformation Processing Systems - Volume 1 , page 1223–1231, 2012.9. A. Demers, D. Greene, C. Hauser, W. Irish, J. Larson, S. Shenker, H. Sturgis,D. Swinehart, and D. Terry. Epidemic algorithms for replicated database main-tenance.
ACM Symposium on Principles of Distributed Computing , pages 1–12,1987.10. O. Gupta and R. Raskar. Distributed learning of deep neural network over multipleagents.
Journal of Network and Computer Applications , 116:1 – 8, 2018.11. I. Guyon, S. Gunn, A. B. Hur, and G. Dror. Result analysis of the nips 2003 featureselection challenge. In
Proceedings of the 17th International Conference on NeuralInformation Processing Systems , NIPS’04, pages 545–552, 2004.12. J.A. Hanley and Barbara Mcneil. A method of comparing the areas under receiveroperating characteristic curves derived from the same cases.
Radiology , 148:839–43,10 1983.13. R. Huerta, T. Mosqueiro, J. Fonollosa, N. F. Rulkov, and I. Rodr´ıguez-Luj´an. On-line decorrelation of humidity and temperature in chemical sensors for continuousmonitoring.
Chemometrics and Intelligent Laboratory Systems , 157:169–176, 2016.14. Z. Jiang, A. Balu, C. Hegde, and S. Sarkar. Collaborative deep learning in fixedtopology networks. In
Proceedings of the 31st International Conference on NeuralInformation Processing Systems , pages 5906–5916, 2017.15. P. Kairouz and McMahan et al. Advances and open problems in federated learning.
ArXiv , abs/1912.04977, 2019.16. D. Kempe, A. Dobra, and J. Gehrke. Gossip-based computation of aggregateinformation.
IEEE Symposium on Foundations of Computer Science , pages 482–491, 2003.17. R. McDonald, K. Hall, and G. Mann. Distributed training strategies for the struc-tured perceptron. In
Human Language Technologies: The 2010 Annual Conferenceof the North American Chapter of the Association for Computational Linguistics ,HLT ’10, pages 456–464, 2010.18. B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. Ag¨uera y Arcas.Communication-efficient learning of deep networks from decentralized data. In
Proceedings of the International Conference on Artificial Intelligence and Statis-tics , pages 1273–1282, 2017.19. A. Montresor and M. Jelasity. PeerSim: A scalable P2P simulator. In
Proc. of the9th Int. Conference on Peer-to-Peer (P2P’09) , pages 99–100, September 2009.20. A. Provodin, L. Torabi, B. Flepp, Y. LeCun, M. Sergio, L. D. Jackel, U. Muller,and J. Zbontar. Fast incremental learning for off-road robot navigation.
CoRR ,abs/1606.08057, 2016.21. F. Seide, H. Fu, J. Droppo, G. Li, and D. Yu. 1-bit stochastic gradient descentand application to data-parallel distributed training of speech dnns. In
Interspeech2014 , September 2014.22. D. P. Sutton, M. C. Carlisle, T. A. Sarmiento, and L. C. Baird. Partitioned neuralnetworks. In
Proceedings of the 2009 International Joint Conference on NeuralNetworks , IJCNN’09, pages 2870–2875, 2009.23. S. Teerapittayanon, B. McDanel, and H. T. Kung. Distributed deep neural net-works over the cloud, the edge and end devices. In , pages328–339, 2017.4. X. Wang, Y. Han, V. C. M. Leung, D. Niyato, X. Yan, and X. Chen. Convergenceof edge computing and deep learning: A comprehensive survey.
IEEE Communi-cations Surveys Tutorials , 22(2):869–904, 2020.25. W. Wen, C. Xu, F. Yan, C. Wu, Y. Wang, Y. Chen, and H. Li. Terngrad: Ternarygradients to reduce communication in distributed deep learning. In
Advances inNeural Information Processing Systems 30 , pages 1509–1519. 2017.26. Q. Yang, Y. Liu, T. Chen, and Y. Tong. Federated machine learning.
ACMTransactions on Intelligent Systems and Technology (TIST) , 10:1 – 19, 2019.27. W. Zhang, S. Gupta, X. Lian, and J. Liu. Staleness-aware async-sgd for distributeddeep learning. In
Proceedings of the Twenty-Fifth International Joint Conferenceon Artificial Intelligence , IJCAI’16, pages 2350–2356, 2016.28. D. Zhou, O. Bousquet, T. N. Lal, J. Weston, and B. Sch¨olkopf. Learning withlocal and global consistency. In
Proceedings of the 16th International Conferenceon Neural Information Processing Systems , NIPS’03, page 321–328, 2003.29. S. Zhou, Y. Wu, Z. Ni, X. Zhou, H. Wen, and Y. Zou. Dorefa-net: Training lowbitwidth convolutional neural networks with low bitwidth gradients. arXiv preprintarXiv:1606.06160 , 2016.30. Shlomo Zilberstein.