[PDF] Self-organizing Democratized Learning: Towards Large-scale Distributed Learning Systems

Abstract

Emerging cross-device artificial intelligence (AI) applications require a transition from conventional centralized learning systems towards large-scale distributed AI systems that can collaboratively perform complex learning tasks. In this regard, democratized learning (Dem-AI) (Minh et al. 2020) lays out a holistic philosophy with underlying principles for building large-scale distributed and democratized machine learning systems. The outlined principles are meant to provide a generalization of distributed learning that goes beyond existing mechanisms such as federated learning. Inspired from this philosophy, a novel distributed learning approach is proposed in this paper. The approach consists of a self-organizing hierarchical structuring mechanism based on agglomerative clustering, hierarchical generalization, and corresponding learning mechanism. Subsequently, a hierarchical generalized learning problem in a recursive form is formulated and shown to be approximately solved using the solutions of distributed personalized learning problems and hierarchical generalized averaging mechanism. To that end, a distributed learning algorithm, namely DemLearn and its variant, DemLearn-P is proposed. Extensive experiments on benchmark MNIST and Fashion-MNIST datasets show that proposed algorithms demonstrate better results in the generalization performance of learning model at agents compared to the conventional FL algorithms. Detailed analysis provides useful configurations to further tune up both the generalization and specialization performance of the learning models in Dem-AI systems.

Full PDF

SSelf-organizing Democratized Learning: TowardsLarge-scale Distributed Learning Systems

Minh N. H. Nguyen, Shashi Raj Pandey, Tri Nguyen Dang, Eui-Nam Huh, and Choong Seon Hong ∗ Department of Computer Science and EngineeringKyung Hee UniversityYongin-si 17104, South Korea {minhnhn, shashiraj, trind, johnhuh, cshong}@khu.ac.kr

Nguyen H. Tran † School of Computer ScienceThe University of SydneySydney, NSW 2006, Australia [email protected]

Walid Saad ‡ Department of Electrical and Computer EngineeringVirginia Tech UniversityBlacksburg, VA, 24060, USA [email protected]

Abstract

Emerging cross-device artiﬁcial intelligence (AI) applications require a transitionfrom conventional centralized learning systems towards large-scale distributed AIsystems that can collaboratively perform complex learning tasks. In this regard, democratized learning (Dem-AI) (Minh et al. 2020) lays out a holistic philosophywith underlying principles for building large-scale distributed and democratizedmachine learning systems. The outlined principles are meant to provide a gen-eralization of distributed learning that goes beyond existing mechanisms such asfederated learning. Inspired from this philosophy, a novel distributed learningapproach is proposed in this paper. The approach consists of a self-organizinghierarchical structuring mechanism based on agglomerative clustering, hierarchicalgeneralization, and corresponding learning mechanism. Subsequently, a hierar-chical generalized learning problem in a recursive form is formulated and shownto be approximately solved using the solutions of distributed personalized learn-ing problems and hierarchical generalized averaging mechanism. To that end, adistributed learning algorithm, namely DemLearn and its variant, DemLearn-Pis proposed. Extensive experiments on benchmark MNIST and Fashion-MNISTdatasets show that proposed algorithms demonstrate better results in the gener-alization performance of learning model at agents compared to the conventionalFL algorithms. Detailed analysis provides useful conﬁgurations to further tune up ∗ M. N. H. Nguyen, S. R. Pandey, T. Nguyen Dang, E. N. Huh, and C. S. Hong are with the Department ofComputer Science and Engineering, Kyung Hee University, Yongin-si 17104, South Korea. † N. H. Tran is with School of Computer Science, The University of Sydney, Sydney, NSW 2006, Australia. ‡ W. Saad is with the Wireless@VT, Bradley Department of Electrical and Computer Engineering, VirginiaTech, Blacksburg, VA, 24060, USA.Preprint. Under review. a r X i v : . [ c s . L G ] J u l oth the generalization and specialization performance of the learning models inDem-AI systems. Nowadays, AI has grown to be successful in solving complex real-life problems such as decisionsupport in healthcare systems, advanced control in automation systems, robotics, and telecommunica-tions. Numerous existing mobile applications incorporate AI modules that leverage user’s data forpersonalized services such as Gboard mobile keyboard on Android, QuickType keyboard, and thevocal classiﬁer for Siri on iOS [1]. By exploiting the unique features and personalized characteristicsof users, these applications not only improve the personal experience of the users but also helps tobetter control over their devices. Moreover, the rising concern of data privacy in existing machinelearning frameworks fueled a growing interest in developing distributed machine learning paradigmssuch as federated learning frameworks (FL) [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11].FL was ﬁrst introduced in [2], where the learning agents coordinate via a central server to traina global learning model in a distributed manner. These agents receive the global learning modelfrom the central server and perform local learning based on their available dataset. Then, they sendback the updated learning models to the server without revealing their personal data. Accordingly,the global model at the central server is updated based on the aggregation of local learning models.In practice, the local dataset collected at each agent is unbalanced, highly personalized for someapplications such as handwriting and voice recognition, and exhibit non-i.i.d (non-independent andnon-identically distributed) characteristics. Therefore, the iterative process of updating the globalmodel improves the generalization of the model, but also hurts the personalized performance at theagents [1]. Hence, existing FL algorithms cannot efﬁciently handle the underlying cohesive relationbetween generalization and personalization (or specialization) abilities of the trained learning model[1]. To the best of our knowledge, the work in [8] was the ﬁrst attempt to study and improve thepersonalized performance of FL using a personalized federated averaging (Per-FedAvg) algorithmbased on a meta-learning framework (MLF). Furthermore, in a recent work [9], the authors proposean adaptive personalized federated learning framework where a mixture of the local and global modelwas adopted to reduce the generalization error. However, similar to [8], the cohesive relation betweengeneralization and personalization was not adequately analyzed.To better analyze the learning performance and scale up the FL framework, the Dem-AI philosophy,discussed in [12] introduces a holistic approach and general guidelines to develop distributed anddemocratized learning systems. The approach refers to observations about the generalization andspecialization capabilities of biological intelligence, and the hierarchical structure of society andswarm intelligence systems for solving complex tasks in large-scale learning systems. Fig. 1 illustratesthe analogy of the Dem-AI system and the hierarchical structure in an organization. The productsor outputs of these groups in a Dem-AI system are the specialized learning models that are createdby group members. In this paper, inspired by Dem-AI guidelines, we develop a novel distributedlearning framework that can directly extend the conventional FL scheme with a common learningtask for all learning agents. Different from existing FL algorithms for building a single generalizedmodel (a.k.a global model), we maintain self-organizing hierarchical group models. Accordingly,we adopt the agglomerative hierarchical clustering [13] and periodically update the hierarchicalstructure based on the similarity in the learning characteristic of users. In particular, we propose thehierarchical generalization and learning problems for each generalized level in a recursive form. Tosolve the complex formulated problem due to its recursive structure, we develop a distributed learningalgorithm, DemLearn, and its variant, DemLearn-P. We adopt the bottom-up scheme that iterativelyperforms the local learning by solving personalized learning problems and hierarchical averaging toupdate the generalized models for groups at higher levels. With extensive experiments, we validateboth specialization and generalization performance of all learning models using DemLearn andDemLearn-P on benchmark MNIST and Fashion-MNIST datasets.To that end, we discuss the preliminaries of democratized learning in Section 2. Based on the Dem-AIguidelines, we formulate hierarchical generalized, personalized learning problems, and propose anovel distributed learning algorithm in section 3. We validate the efﬁcacy of our proposed algorithmfor both specialization and generalization performance of the client, groups, and global modelscompared to the conventional FL algorithms in Section 4. Finally, Section 5 concludes the paper.2igure 1: Analogy of a hierarchical distributed learning system.

Different from FL, the Dem-AI framework [12] introduces a self-organizing hierarchical structurefor solving common single/multiple complex learning tasks by mediating contributions from a largenumber of learning agents in collaborative learning. Moreover, it unlocks the following featuresof democracy in the future distributed learning systems. According to the differences in theircharacteristics, learning agents form appropriate groups that can be specialized for the learning tasks.These specialized groups are self-organized in a hierarchical structure and collectively construct theshared generalized learning knowledge to improve their learning performance by reducing individualbiases due to the unbalanced, highly personalized local data. In particular, the learning systemallows new group members to: a) speed up their learning process with the existing group knowledge,and b) incorporate their new learning knowledge in expanding the generalization capability of thewhole group. We include a brief summary of Dem-AI concepts and principles [12] in the followingdiscussion.

Democratized Learning (

Dem-AI in short) studies a dual (coupled and workingtogether) specialized-generalized processes in a self-organizing hierarchical structure of large-scaledistributed learning systems. The specialized and generalized processes must operate jointly towardsan ultimate learning goal identiﬁed as performing collective learning from biased learning agents,who are committed to learning from their own data using their limited learning capabilities. As such,the ultimate learning goal of the Dem-AI system is to establish a mechanism for collectively solvingcommon (single or multiple) complex learning tasks from a large number of learning agents.

Dem-AI Meta-Law: A meta-law is deﬁned as a mechanism that can be used to manipulate thetransition between the dual specialized-generalized processes of the Dem-AI system and provides thenecessary information to regulate the self-organizing hierarchical structuring mechanism. Specialized Process:

This process is used to leverage specialized learning capabilities at the learningagents and specialized groups by exploiting their collected data. This process also drives thehierarchical structure of specialized groups with many levels of relevant generalized knowledge thatis stable and well-separated. By incorporating the generalized knowledge of higher level groupscreated by the generalization mechanism, the learning agents can update their model parameters so asto reduce biases in their personalized learning . Thus, the personalized learning objective has twogoals: 1) To perform specialized learning , and 2) to reuse the available hierarchical generalizedknowledge . Considerably, throughout the training process, the generalized knowledge becomes lessimportant compared to the specialized learning goal and a more stable hierarchical structure will beformed.

Generalized Process:

This process is used to regulate the generalization mechanism for all existingspecialized groups as well as the plasticity level of all groups. Here, group plasticity pertains tothe ease level with which learning agents can change their groups. The generalization mechanismencourages group members to share knowledge when performing learning tasks with similar charac-teristics and construct hierarchical levels of generalized knowledge. The generalized knowledge helpsthe Dem-AI system maintain the generalization ability for reducing biases of learning agents and3igure 2: The illustration of the transition in Dem-AI principle.efﬁciently dealing with environment changes or performing new learning tasks. Thus, the hierarchicalgeneralized knowledge can be constructed based on the contribution of the group members, which isdriven by the plasticity force . This force is characterized by creative attributes, knowledge exploration,multi-task capability, and survival in uncertainty.

Self-organizing Hierarchical Structure:

The hierarchical structure of specialized groups and therelevant generalized knowledge are constructed and regulated following a self-organization principlebased on the similarity of learning agents. In particular, this principle governs the union of smallgroups to form a bigger group that eventually enhances the generalization capabilities of all members.Thus, specialized groups at higher levels in the hierarchical structure have more members and canconstruct more generalized (less biased) knowledge.

Transition in the dual specialized-generalized process:

The specialized process becomes increas-ingly important compared to the generalized process during the training time. As a result, the learningsystem evolves to gain specialization capabilities from training tasks but also loses the capabilities todeal with environmental changes such as unseen data, new learning agents, and new learning tasks.Meanwhile, the hierarchical structure of the Dem-AI system is self-organized and evolved from a highlevel of plasticity to a high level of stability, i.e., from unstable specialized groups to well-organizedspecialized groups. The transition of the Dem-AI learning system is illustrated in Fig. 2 with threeiterative sub-mechanisms such as generalization , specialized learning and hierarchical structuringmechanism . Accordingly, the hierarchical group structure is self-organized based on the similaritiesin the learning characteristics of agents. The generalization mechanism helps in the constructionof the hierarchical generalized knowledge from learning agents towards the top generalized level .Meanwhile, a specialized learning mechanism instantiates the personalized and specialized grouplearning to exploit the agent’s local data and incorporate the knowledge from higher level groups.In the next section, we develop a democratized learning design that results in a hierarchical generalizedlearning problem. To that end, we propose a novel democratized learning algorithm, DemLearn torealize as an initial implementation of Dem-AI philosophy. Dem-AI philosophy and guidelines in [12] envision different designs for a variety of applications andlearning tasks. As an initial implementation, we focus on developing a novel distributed learningalgorithm that consists of the following hierarchical clustering, hierarchical generalization, andlearning mechanisms with a common learning task for all learning agents.4 .1 Hierarchical Clustering Mechanism

To construct the hierarchical structure of the Dem-AI system with relevant specialized learning groups,we adopt the commonly used agglomerative hierarchical clustering algorithm (i.e., dendrogramimplementation from scikit-learn [13, 14]), based on the similarity or dissimilarity of all learningagents. The dendrogram method is used to examine the similarity relationships among individuals andis often used for cluster analysis in many ﬁelds of research. During implementation, the dendrogramtree topology is built-up by merging the pairs of agents or clusters having the smallest distancebetween them, following the bottom-up scheme. Accordingly, the measured distance is consideredas the differences in the characteristics of learning agents (e.g., local model parameters or gradientsof the learning objective function). Since we obtain a similar performance implementing clusteringbased on model parameters or gradients, in what follows, we only present a clustering mechanismusing the local model parameters. Additional discussion for gradient-based clustering is provided inthe supplementary material.Given the local model parameters w n = ( w n, , . . . , w n,M ) of learning agent n , where M is thenumber of learning parameters, the measure distance between two agents φ n,l is derived based onthe Euclidean distance such as φ n,l = (cid:107) w n − w l (cid:107) . In addition, we consider the average-linkagemethod [15] for distance calculation between an agent and a cluster using the Euclidean distancebetween the model parameters of the agent and the average model parameters of the cluster members.Accordingly, the hierarchical tree structure is in the form of a binary tree with many levels. Inconsequence, it will require unnecessarily high storage and computational cost to maintain and isalso an inefﬁcient way to maintain a large number of low-level generalized models for small groupsdue to small collective results. As a result, we keep only the top K levels in the tree structure anddiscard the lower-levels structure that can be deﬁned in the Dem-AI meta-law. Therefore, at level ,the system could have several big groups that have a large number of learning agents. The K levels hierarchical structure emerges via agglomerative clustering. Accordingly, the systemconstructs K levels of the generalization, as in Fig 2. As such, we propose hierarchical generalizedlearning problems ( HGLP ) to build these generalized models for specialized groups in a recursiveform, starting from the global model w ( K ) construction at the top level K as follows: HGLP problem at level K min w ( K ) , w ( K − ,..., w ( K − |S K | L ( K ) = (cid:88) i ∈S K N ( K − g,i N ( K ) g L ( K − i ( w ( K − i |D ( K − i ) (1) s.t. w ( K ) = w ( K − i , ∀ i ∈ S K ; (2)where S K is the set of subgroups of the top level group, L ( K − i is the loss function of subgroup i given its collective dataset D i . The objective function is weighted by a fraction of the numberof learning agents N ( K − g of the subgroup i , and the total number of learning agents N ( K ) g in thesystem. Hence, the subgroups which have more learning agents have higher impact to the generalizedmodel at level K . The hard constraints in (2) enforce these subgroups to share a common learningmodel (i.e., a global variable w ( K ) ). To preserve the specialization capabilities of each subgroup,these constraints (2) could be relaxed by using additional proximal terms in the objective. In this way,the problem encourages the subgroup learning models to become close to the global model but notnecessarily equal. Thus, the relaxed problem HGLP’ is deﬁned as follows:

HGLP’ problem at level K min w ( K ) , w ( K − ,..., w ( K − |S K | (cid:88) i ∈S K N ( K − g,i N ( K ) g (cid:16) L ( K − i ( w ( K − i |D ( K − i ) + µ K (cid:107) w ( K − i − w ( K ) (cid:107) (cid:17) . (3)Since the dataset is distributed and only available at the learning agents, the problem (3) at thetop level K can be solved starting from its members problem ﬁrst. Accordingly, the hierarchicalgeneralized structure is emerged naturally following the bottom-up scheme where the learning models5t lower levels are updated before solving the higher level generalized problems of its super group.Speciﬁcally, the problem (3) can be decentralized and iteratively solved by the following problem ofeach subgroup i at the level K − and the global model w ( K ) is updated as an average of subgroupmodels. HGLP problem at level K − w ( K − i , w ( K − ,..., w ( K − |S i,K − | (cid:88) j ∈S i,K − N ( K − g,j N ( K − g,i (cid:16) L ( K − j ( w ( K − j |D ( K − j ) + µ K (cid:107) w ( K − j − w ( K − i (cid:107) (cid:17) + µ K − N ( K − g N ( K ) g (cid:107) w ( K − j − w ( K ) (cid:107) . Therefore, we make a general approximation form of the generalized learning problem for the group i at the level k given the prior higher generalized model w ( k +1) , . . . , w ( K ) as follows: HGLP problem at level k min w ( k ) i , w ( k − ,..., w ( k − |S i,k | (cid:88) j ∈S i,k (cid:16) N ( k − g,j N ( k ) g,i L ( k − j ( w ( k − j |D ( k − j ) + µ K (cid:88) h = k N ( k − g,j N ( h ) g,i (cid:107) w ( k − j − w ( h ) i (cid:107) (cid:17) ; where N ( h ) g,i , w ( h ) i are the number of learning agents and the learning model of the level- h group inwhich subgroup i belongs, respectively. Arguably, the higher-level groups who have more learningagents will have a lower inﬂuence on the generalized model construction at level k . Here, the commonparameter µ is used at each level. Since the learning problem will be performed at the learning agents,the proximal terms of higher level groups are equally distributed to all of the learning agents whobelong to them. The hierarchical recursive complex problem at each level k can be approximatelysolved by alternatively updating the lower level models w ( k − j , and then averaging them to get theupdated learning model as follows: w ( k ) i = (cid:88) j ∈S i,k N ( k − g,j N ( k ) g,i w ( k − j . (4)At the lowest level, each learning agent n can actually learn its learning model to ﬁt their local datafor the personalized learning problem regarding the latest hierarchical generalized models as follows: Personalized Learning Problem (PLP) at level 0 w (0) n = arg min w L (0) n ( w |D (0) n ) + µ K (cid:88) k =1 N ( k ) n,g (cid:107) w − w ( k ) n (cid:107) ; (5)where L (0) n is the personalized learning loss function for the learning task (e.g., classiﬁcation) givenits personalized dataset D (0) n , N ( k ) n,g is the number of learning agents of the level- k group in which theagent n belongs. In this personalized level, the number of group member is . Inspired by the FedAvg [16] and FedProx [17] algorithms, we adopt our aforementioned recursiveanalysis and hierarchical clustering mechanism to develop a novel democratized learning algorithm,namely DemLearn. The details of our proposed algorithm are presented in Alg. 1. Speciﬁcally, theDemLearn algorithm ﬁrst initializes the local learning models by leveraging the latest hierarchicalgeneralized group models, and its prior local learning model that are controlled by the decayingparameter β t in the equation (6). A normalization term B keeps the generalized models in the rangeof local model parameters. During the training time, the prior local knowledge becomes increasinglyimportant than the reused generalized models of groups to improve its specialized performance. Usingthis initial model, each agent iteratively solves the PLP problem in the equation (7) based on thegradient method. Here, given µ > , we deﬁne the proximal variation version of DemLearn, namelyDemLearn-P. Thereafter, these updates will be sent to the central server to perform hierarchicalaggregation from the generalized level to the level K . The generalized learning models of groupsare updated in the equation (8). After every τ global rounds, the hierarchical structure is reconstructedaccording to the changes in the personalized learning model of agents.6 lgorithm 1 Democratized Learning (DemLearn) Input:

K, T, τ. for t = 0 , . . . , T − do for learning agent n = 1 , . . . , N do Agent n receives and updates its local model from the higher-level generalized models w (1) n,t , . . . , w ( K ) n,t of its super groups as w (0) n,t +1 = (1 − β t ) w (0) n,t + β t B K (cid:88) k =1 N ( k ) g,n w ( k ) n,t , where B = K (cid:88) k =1 N ( k ) g,n ; (6) Agent n iteratively updates the personalized learning model w (0) n as an in-exact minimizer(i.e., gradient based ) of the following problem: w (0) n,t +1 ≈ arg min w L (0) n ( w |D n ) + µ K (cid:88) k =1 N ( k ) g,n (cid:107) w − w ( k ) n,t (cid:107) ; (7) Agent n sends updated learning model to the server; end for if ( t mod τ = 0) then Server reconstructs the hierarchical structure by the clustering algorithm; end if

Each group i at each generalized level k performs an update for its learning model as follows w ( k ) i,t +1 = (cid:88) j ∈S i,k N ( k − g,j N ( k ) g,i w ( k − j,t +1 ; (8) end for In this section, we validate the efﬁcacy of our algorithm DemLearn with the MNIST [18] and Fashion-MNIST [19] datasets for handwritten digits and fashion images recognition, respectively. We conductthe experiments with clients, where each client has median numbers of data samples at and with MNIST and Fashion-MNIST dataset, respectively. Accordingly, of data samples areused for the model testing. We divide the total dataset such that each client has a small amount ofdata from two speciﬁc labels amongst the overall ten in both datasets. In doing so, we replicate ascenario of a biased personal dataset, i.e., highly unbalanced data and a small number of trainingsamples amongst users. The learning models consist of two convolution layers followed by twopooling layers. We set the update period τ = 2 , and validate the performance of The proposedalgorithm with 4 generalized levels. Our implementation is developed based on the available codeof FedProx in [17]. The Python implementation of our proposal and data used are available at https://github.com/nhatminh/Dem-AI . Existing FL approaches such as FedProx and FedAvg focus more on the learning performance of theglobal model rather than the learning performance at clients. Therefore, for forthcoming personalizedapplications, we implement DemLearn to measure the learning performance of all clients and thegroup models. In particular, we conduct evaluations for specialization (C-SPE) and generalization(C-GEN) of learning models at agents on average that are deﬁned as the performance in their localtest data only, and the collective test data from all agents in the region, respectively. Accordingly,we denote Global as global model performance; G-GEN, G-SPE are the average generalization andspecialization performance of group models, respectively.In Fig. 3, we conducted performance comparisons of our proposed methods, DemLearn andDemLearn-P with the two conventional FL methods, FedAvg [16] and FedProx [17], as baselines ontwo benchmark datasets, MNIST and Fashion-MNIST. In doing so, we set the associated proximal7

20 40 60 T e s t i n g A cc u r a c y DemLearn

DemLearn-P

FedProx

FedAvg

Global G-GEN G-SPE C-SPE C-GEN (a) Experiment with MNIST dataset. T e s t i n g A cc u r a c y DemLearn

DemLearn-P

FedProx

FedAvg

Global G-GEN G-SPE C-SPE C-GEN (b) Experiment with Fashion-MNIST dataset.

Figure 3: Performance comparison of DemLearn, DemLearn-P versus FedAvg, FedProx.

DemLearn T e s t i n g A cc u r a c y DemLearn: Fixed

DemLearn-P

DemLearn-P: Fixed

Global G-GEN G-SPE C-SPE C-GEN

Figure 4: Comparison of algorithms for a ﬁxed and a self-organizing hierarchical structure.term µ = 0 in the objective function of the problem in (7) to deﬁne DemLearn, which reduces theoriginal problem into FedAvg, while µ > is the proximal variation version, namely DemLearn-P.Fig. 3a depicts the performance comparison of DemLearn and DemLearn-P with FedProx and FedAvgwith the MNIST dataset. Experimental evaluations show that the proposed approach outperforms thetwo baselines in terms of the convergence speed, especially to obtain better generalization perfor-mance. We observe that the Global model requires only global rounds to reach the performancelevel of using our proposed algorithms, while FL algorithms such as FedProx and FedAvg takemore than global rounds to achieve the competitive level of performance. Furthermore, betweenour algorithms, DemLearn shows better average generalization performance (i.e., . ) of clientmodels and the global model (i.e., . ), while DemLearn-P shows better average specializationperformance (i.e., . ) of client models at the converged points. As observed, the FL algorithmscan only attain a superior performance with high specialization in their client models, however, theclient models slowly improve their generalization performance. Following the similar trends, Fig. 3bdepicts that DemLearn and DemLearn-P algorithms perform better than FedProx and FedAvg with theFashion-MNIST dataset in terms of the client generalization and global learning model performance.In Fig. 4, we evaluate and compare the performance of the proposed algorithms for a ﬁxed anda self-organizing hierarchical structure via periodic reconstruction (i.e., τ = 2 ). We observe thatboth DemLearn and DemLearn-P beneﬁts from the self-organizing mechanism, and can provide8etter generalization capability of client models. We note that the learning performance for a ﬁxedhierarchical structure is slightly unstable compared to the self-organizing one. The novel Dem-AI philosophy has provided general guidelines for specialized, generalized, and self-organizing hierarchical structuring mechanisms in large-scale distributed machine learning systems.Inspired by these guidelines, we have formulated the hierarchical generalized learning problems anddeveloped a novel distributed learning algorithm, DemLearn , and its variant, DemLearn-P. In thiswork, based on the similarity in the learning characteristics, the agglomerative clustering enablesthe self-organization of learning agents in a hierarchical structure, which gets updated periodically.Detailed analysis of experimental evaluations has shown the advantages and disadvantages of the pro-posed algorithms. Compared to conventional FL, we show that DemLearn signiﬁcantly improves thegeneralization performance of client models. Meanwhile, DemLearn-P shows a fair improvement ingeneralization without largely compromising the specialization performance of clients models. Theseobservations beneﬁt for a better understanding and improvement in specialization and generalizationperformance of the learning models in Dem-AI systems. To that end, our initial learning designcould be further studied with personalized datasets, extended for multi-task learning capabilities, andvalidated with actual generalization capabilities in practice for new users and environmental changes.

Broader Impact

Democratized learning provides unique ingredients to develop future ﬂexible distributed intelligentsystems. We envision the prevalence of personalized learning agents and a massive amount ofuser data without revealing or perform the data collection as the conventional machine paradigms.Through this implementation, we believe the democracy in learning could be possible and needsbetter understanding to realize a sustainable but also more powerful and ﬂexible learning paradigm.We advocate our current implementation has not coped with multiple learning tasks [3, 20] and theadaptability of the system due to environmental changes as general intelligent systems. Turningthe general intelligent networks into reality, we need to profoundly analyze the Dem-AI from avariety of perspectives such as robustness and diversity of the learning models [21] and novelknowledge transfer and distillation mechanisms [22, 23]. In this regard, the developed self-organizingmechanisms could be used to scale up the existing ML frameworks such as Meta-Learning [24]and Decentralized Generative Adversarial Networks [25]. These are our next steps to develop alearning design empowering the full capability of democratized learning philosophy. On the otherhand, security and privacy issues are also very crucial for the distributed systems. In such case,understanding group behaviours can beneﬁt to ﬁll the gap towards practical applications of distributedlearning in large-scale systems.

References [1] Peter Kairouz, H Brendan McMahan, Brendan Avent, Aurélien Bellet, Mehdi Bennis, Ar-jun Nitin BhagCanhoji, Keith Bonawitz, Zachary Charles, Graham Cormode, Rachel Cummings,et al. Advances and open problems in federated learning. arXiv:1912.04977 , 2019.[2] H Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, et al. Communication-efﬁcient learning of deep networks from decentralized data. arXiv preprint arXiv:1602.05629 ,2016.[3] Virginia Smith, Chao-Kai Chiang, Maziar Sanjabi, and Ameet S Talwalkar. Federated multi-tasklearning. In

Advances in Neural Information Processing Systems 30 , pages 4424–4434, Dec.2017.[4] Nguyen H Tran, Wei Bao, Albert Zomaya, Nguyen Minh NH, and Choong Seon Hong. Fed-erated learning over wireless networks: Optimization model design and analysis. In

IEEEConference on Computer Communications (INFOCOM) , pages 1387–1395, Paris, France, April29– May 2 , 2019.[5] Shashi Raj Pandey, Nguyen H Tran, Mehdi Bennis, Yan Kyaw Tun, Aunas Manzoor, andChoong Seon Hong. A crowdsourcing framework for on-device federated learning.

IEEETransactions on Wireless Communications , 2020.96] Canh Dinh, Nguyen H. Tran, Minh N. H. Nguyen, Choong Seon Hong, Wei Bao, Albert Y.Zomaya, and Vincent Gramoli. Federated Learning over Wireless Networks: ConvergenceAnalysis and Resource Allocation. arXiv:1910.13067 , 2019.[7] Mingzhe Chen, Zhaohui Yang, Walid Saad, Changchuan Yin, H Vincent Poor, and ShuguangCui. A joint learning and communications framework for federated learning over wirelessnetworks. arXiv:1909.07972 , 2019.[8] Alireza Fallah, Aryan Mokhtari, and Asuman Ozdaglar. Personalized federated learning: Ameta-learning approach. arXiv:2002.07948 , 2020.[9] Yuyang Deng, Mohammad Mahdi Kamani, and Mehrdad Mahdavi. Adaptive personalizedfederated learning. arXiv:2003.13461 , 2020.[10] Wei Yang Bryan Lim, Nguyen Cong Luong, Dinh Thai Hoang, Yutao Jiao, Ying-Chang Liang,Qiang Yang, Dusit Niyato, and Chunyan Miao. Federated learning in mobile edge networks: Acomprehensive survey. arXiv:1909.11875 , 2019.[11] Shashi Raj Pandey, Nguyen H Tran, Mehdi Bennis, Yan Kyaw Tun, Zhu Han, and Choong SeonHong. Incentivize to build: A crowdsourcing framework for federated learning. In , pages 1–6. IEEE, 2019.[12] Minh N. H. Nguyen, Shashi Raj Pandey, Kyi Thar, Nguyen H Tran, Mingzhe Chen, WalidSaad, and Choong Seon Hong. Distributed and democratized learning: Philosophy and researchchallenges. arXiv:2003.09301 , 2020.[13] A. K. Jain, M. N. Murty, and P. J. Flynn. Data clustering: A review.

ACM Computing Surveys ,31(3):264–323, 1999.[14] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel,P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher,M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python.

Journal of MachineLearning Research , 12:2825–2830, 2011.[15] Sanjoy Dasgupta and Philip M Long. Performance guarantees for hierarchical clustering.

Journal of Computer and System Sciences , 70(4):555–569, 2005.[16] Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Aguera y Arcas.Communication-Efﬁcient Learning of Deep Networks from Decentralized Data. In

ArtiﬁcialIntelligence and Statistics , pages 1273–1282, April 2017.[17] Anit Kumar Sahu, Tian Li, Maziar Sanjabi, Manzil Zaheer, Ameet Talwalkar, and VirginiaSmith. Federated optimization for heterogeneous networks. arXiv:1812.06127 , 1(2):3, 2018.[18] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learningapplied to document recognition.

Proceedings of the IEEE , 86(11):2278–2324, 1998.[19] Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-mnist: a novel image dataset forbenchmarking machine learning algorithms. arXiv:1708.07747 , 2017.[20] Luca Corinzia and Joachim M. Buhmann. Variational federated multi-task learning. arXiv:1906.06268 , 2019.[21] Norman L Johnson. Diversity in decentralized systems: Enabling self-organizing solutions. In

Decentralization II Conference, UCLA , 1999.[22] Chuanqi Tan, Fuchun Sun, Tao Kong, Wenchang Zhang, Chao Yang, and Chunfang Liu. Asurvey on deep transfer learning. In

Lecture Notes in Computer Science , volume 11141 LNCS,pages 270–279. Springer Verlag, 2018.[23] Eunjeong Jeong, Seungeun Oh, Hyesung Kim, Jihong Park, Mehdi Bennis, and Seong-LyunKim. Communication-efﬁcient on-device machine learning: Federated distillation and augmen-tation under non-iid private data. arXiv:1811.11479 , 2018.[24] Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adap-tation of deep networks. In

Proceedings of the 34th International Conference on MachineLearning , pages 1126—-1135, Sydney, NSW, Australia, Aug. 2017.[25] C. Hardy, E. Le Merrer, and B. Sericola. Md-gan: Multi-discriminator generative adversar-ial networks for distributed datasets. In , pages 866–877, Brazil, May 2019.10 upplemental

In this subsection, we show the impact of µ parameter in the testing accuracy of both datasets. Ascan be seen in the results, when µ is large, we obtain a low performance of the global, groupsmodel as well as the generalization performance of client models. As such, decreasing the valueof µ signiﬁcantly enhances generalization, but also slightly reduces the specialization performanceof client models. Therefore, when µ is too small, DemLearn-P algorithm behaves similarly as theDemLearn algorithm and provides the best generalization performance of client models. T e s t i n g A cc u r a c y DemLearn-P: = 0.005

DemLearn-P: = 0.002

DemLearn-P: = 0.001

DemLearn-P: = 0.0005

Global G-GEN G-SPE C-SPE C-GEN (a) Experiment with MNIST. T e s t i n g A cc u r a c y DemLearn-P: = 0.005

DemLearn-P: = 0.002

DemLearn-P: = 0.001

DemLearn-P: = 0.0005

Global G-GEN G-SPE C-SPE C-GEN (b) Experiment with Fashion-MNIST

Figure 5: Performance of DemLearn-P by varying µ . In addition to the measured distance between learning parameters in the hierarchical clusteringalgorithm, we can also use the gradient information to reﬂect the learning directions of each agent.Accordingly, the measured distance φ ( g ) n,l between two learning gradients are deﬁned based on thecosine similarity as follows: φ ( g ) n,l = cos( g n , g l ) = (cid:80) Mm =1 g n,m g l,m (cid:113)(cid:80) Mm =1 g n,m (cid:113)(cid:80) Mm =1 g l,m . (9) Our experimental results demonstrate that the clients and groups models show almost similar learningperformance (see Fig. 6) and similar trend of cluster topology (see Fig. 7) using DemLearn algorithm.At the beginning, the topology show highly different clusters, deﬁned by higher values in the measureddistance between them (i.e., y-axis) in Fig.7. Throughout the training, the topology shrinks due to thereduction in the measured distance among groups. This is notably observed with the reduction ofspecialization performance of client models, as a consequence. After 40 global rounds, small groupsand clients are united to form big groups. Thus, all clients and groups have similar learning modelparameters and gradients. This also explains for the convergence behavior of learning models in11emLearn algorithm. On the other hand, different from DemLearn, DemLearn-P algorithm representsthe diversity in learning models or gradients by inducing heterogeneity in characterstics of groups. T e s t i n g A cc u r a c y DemLearn:W-Clustering

DemLearn: G-Clustering

Client-GEN: W-Clustering

Client-GEN: G-Clustering

Global G-GEN G-SPE C-SPE C-GEN

Figure 6: Comparison of different metrics using in hierarchical clustering. (2)(13)(2)(5)(2)(4)(3)(5)363738(3)(8) (2)(2)(5)(8)3637(4)(8)16(2)(4)(4)6 7(3)(3)

49 46 47 48 (2) (8) (8) (28)

49 36 37 (29) (7) 46 (2) 38 (7)

49 48 46 47 38 (7) (4) (34)

49 48 46 47 38 (7) (34) (4) (a) Hierarchical clustering based on learning weight parameters of client models. (3)(2)(2)(13)(3)(5)(4)(4)363738(3)(3)(5)

49 0 1 2 (11) (35)

49 (2) (5) (2)(27)36 37 (3) (8)

49 48 46 47 38 (7) (34) (4)

49 48 46 47 38 (7) (34) (4) (b) Hierarchical clustering based on local gradient parameters of client models.

Figure 7: Topology changes via hierarchical clustering in DemLearn algorithm.12 .2.2 Evaluation with Fashion-MNIST Dataset

We conduct similar experiments with Fashion-MNIST dataset. Fig. 8, Fig. 9, and Fig. 10, depictssimilar behaviors and trends that we have discussed for MNIST dataset in the previous sections.

DemLearn T e s t i n g A cc u r a c y DemLearn: Fixed

DemLearn-P

DemLearn-P: Fixed

Global G-GEN G-SPE C-SPE C-GEN

Figure 8: Comparison of algorithms for a ﬁxed and a self-organizing hierarchical structure. T e s t i n g A cc u r a c y DemLearn:W-Clustering

DemLearn: G-Clustering

Client-GEN: W-Clustering

Client-GEN: G-Clustering

Global G-GEN G-SPE C-SPE C-GEN

Figure 9: Comparison of different metrics using in hierarchical clustering .13 (6)(14)(7)(5)4344(3)(8)494847(2)

49 48 47 (2) 0 (3) (6) (25)(10)

49 48 45 (2) 0 1 (19)(24)

49 (3) (2) (32) (8) 48 45 (2) (a) Hierarchical clustering based on learning weight parameters of client models. (4)(10)(11)(10)(2)(2)(2)(2)4344454649(2)

49 48 47 (2) 43 44 (4)(6) 0 (32)

49 0 1 32 (42) 48 47 (2) (b) Hierarchical clustering based on local gradient parameters of client models.

Figure 10: Topology changes via hierarchical clustering in DemLearn algorithm. (6)(14)(7)(5)4344(3)(8)494847(2)

49 48 47 (2) 0 (3) (6) (25)(10)

49 48 45 (2) 0 1 (19)(24)

49 (3) (2) (32) (8) 48 45 (2)

Figure 11: Hierarchical clustering based on learning weight parameters of client models usingDemLearn. 14 (4)(10)(11)(10)(2)(2)(2)(2)4344454649(2)

49 48 47 (2) 43 44 (4)(6) 0 (32)