[PDF] Network Support for High-performance Distributed Machine Learning

Abstract

The traditional approach to distributed machine learning is to adapt learning algorithms to the network, e.g., reducing updates to curb overhead. Networks based on intelligent edge, instead, make it possible to follow the opposite approach, i.e., to define the logical network topology em around the learning task to perform, so as to meet the desired learning performance. In this paper, we propose a system model that captures such aspects in the context of supervised machine learning, accounting for both learning nodes (that perform computations) and information nodes (that provide data). We then formulate the problem of selecting (i) which learning and information nodes should cooperate to complete the learning task, and (ii) the number of iterations to perform, in order to minimize the learning cost while meeting the target prediction error and execution time. After proving important properties of the above problem, we devise an algorithm, named DoubleClimb, that can find a 1+1/|I|-competitive solution (with I being the set of information nodes), with cubic worst-case complexity. Our performance evaluation, leveraging a real-world network topology and considering both classification and regression tasks, also shows that DoubleClimb closely matches the optimum, outperforming state-of-the-art alternatives.

Full PDF

11 Network Support for High-performanceDistributed Machine Learning

Francesco Malandrino,

Senior Member, IEEE,

Carla Fabiana Chiasserini,

Fellow, IEEE,

Nuria Molner,

Student Member, IEEE, and Antonio de la Oliva,

Member, IEEE

Abstract —The traditional approach to distributed machinelearning is to adapt learning algorithms to the network, e.g.,reducing updates to curb overhead. Networks based on intelligentedge, instead, make it possible to follow the opposite approach,i.e., to deﬁne the logical network topology around the learningtask to perform, so as to meet the desired learning performance.In this paper, we propose a system model that captures suchaspects in the context of supervised machine learning, accountingfor both learning nodes (that perform computations) and infor-mation nodes (that provide data). We then formulate the problemof selecting (i) which learning and information nodes shouldcooperate to complete the learning task, and (ii) the number ofiterations to perform, in order to minimize the learning cost whilemeeting the target prediction error and execution time. Afterproving important properties of the above problem, we devisean algorithm, named DoubleClimb, that can ﬁnd a / |I| -competitive solution (with I being the set of information nodes),with cubic worst-case complexity. Our performance evaluation,leveraging a real-world network topology and considering bothclassiﬁcation and regression tasks, also shows that DoubleClimbclosely matches the optimum, outperforming state-of-the-artalternatives. Index Terms —Network orchestration, machine learning, edgecomputing.

I. I

NTRODUCTION

Owing to the ever-increasing scale and complexity of thelearning tasks to perform, machine learning (ML) algorithmshave swiftly been extended to work in a distributed fashion,with the purpose of leveraging the computational capabilityof multiple nodes, possibly across multiple datacenters [1]–[4] and/or allowing nodes belonging to different parties tocooperate in a learning task without sharing sensitive data [5]–[7].More recently, distributed ML has emerged also as anexcellent match for new generation (5G-and-beyond) net-works. It can be used for the management of the network(as envisioned by such initiatives as ETSI ZSM [8], ENI [9],and O-RAN [10]), as well as to enable user services withinthe so-called intelligent edge [11]. In general, new generationnetworks can (a) integrate a wide number of heterogeneous nodes, including those that can provide the data used for MLtasks, (b) provide a distributed computational infrastructureneeded to run the ML algorithms (see e.g., [12]), and (c) be

F. Malandrino and C. F. Chiasserini are with CNR-IEIIT and CNIT,Italy. C. F. Chiasserini is with Politecnico di Torino, Italy. N. Molner iswith IMDEA Networks, Spain. N. Molner and A. de la Oliva are withUniversidad Carlos III de Madrid, Spain.This work was supported through the EU 5Growth project (Grant No.856709). dynamically reconﬁgured so as to perform the ML task at handwith the required performance.However, implementing an ML task in a 5G-and-beyondnetwork also poses important challenges. Speciﬁcally, it re-quires to deﬁne the logical topology of the nodes that coop-erate towards the ML task, i.e., making decisions on: • which computing nodes in the different locations of thenetwork edge should interact during the learning process; • how many (and which) data sources to exploit, and whichcomputing nodes should receive their data.The above decisions inﬂuence each other, often in coun-terintuitive ways: as an example, seeking information fromtoo many nodes may result in longer learning times, due tothe additional waiting. Furthermore, a given target learningerror (e.g., classiﬁcation accuracy) may be reached throughalternative, completely different approaches, e.g., collectinga signiﬁcant quantity of information or performing moreiterations to process a smaller set of data.In spite of the wide usage of ML in mobile networks andthe considerable attention devoted to it, most of the works aimat exploiting the network more efﬁciently, e.g., reducing theoverhead [1], [13] or dealing with straggling nodes [14]. Justa small number of recent works [5], [15] have characterizedthe impact of the network topology on the performance ofdistributed ML, providing interesting insights on, e.g., theoptimal network connectivity. However, none of such workstackles the problem of deﬁning the logical network topology around the ML task to perform.In this work, we focus on distributed, supervised learning,and aim at ﬁlling this gap by making the following maincontributions: • we develop a system model that can represent severalrelevant supervised ML tasks and account for the speciﬁcfeatures of a 5G-and-beyond environment, most notably, theinteraction between learning nodes and information nodes; • we formulate the problem of choosing the computing nodesand data sources, as well as the links connecting them, withthe goal of minimizing the (monetary or energy) cost of thelearning process, subject to prediction quality and learningtime requirements; • we prove that the problem is NP hard, but also, and mostimportantly, that it is submodular. In particular, although itsconstraints are not monotonically increasing, we show thatit can be solved via an iterative algorithm with excellentcompetitive ratio guarantees; a r X i v : . [ c s . N I] F e b we propose an iterative algorithm, called DoubleClimb,which has cubic worst-case time complexity and attains a / |I| competitive ratio , with I being the set of infor-mation nodes. We evaluate DoubleClimb over a real-worldtopology, showing that it closely matches optimal decisionsand substantially outperforms state-of-the-art alternatives.The rest of the paper is organized as follows. After review-ing related work in Sec. II, we describe our system modeland how it can represent different supervised ML tasks inSec. III. In Sec. IV, we formulate the problem we tackleand discuss its complexity. Sec. V characterizes the learningperformance, while important properties of our problem areproven in Sec. VI. We then present the DoubleClimb algorithmand analyze its complexity in Sec. VII, before evaluating itsperformance in Sec. VIII. We conclude the paper in Sec. IX.II. R ELATED W ORK

A ﬁrst body of works related to ours concerns distributedlearning. In the simplest cases [16], all training data is knownbefore the training itself starts, and the purpose of performingdistributed learning is simply to leverage more computationalpower. A more complex variation is represented by activelearning where new information arrives during the learningprocess, and is combined with the ofﬂine training set [17],[18]. Applications include drone planning [2] and networkmanagement [19], [20].

Federated learning is a more recent trend, tackling scenarioswhere participating devices are not required to share poten-tially sensitive data [7], [21]. Depending upon the speciﬁcscenario, new data may or may not arrive during the trainingprocess.Several works propose generic methodologies to mitigatecommon hurdles of distributed ML, including scaling theparameter servers [1], dealing with slower nodes [14], andtrading learning efﬁciency for convergence speed [13]. Allthese works propose novel algorithms and/or approaches to adapt to the existing network structure, e.g., by limiting theoverhead, to perform the learning task at hand as efﬁcientlyas possible. Importantly, none of them envisions to do theopposite, i.e., adapting the nodes interaction to the learningtask.Some works seek to theoretically characterize the conver-gence of supervised ML and how it is inﬂuenced by the coop-eration among learning nodes. The study in [4] characterizesthe convergence of a wide class of multi-agent algorithms.Using tools from spectral graph analysis, it establishes arelation between the topology formed by pairs of cooperatingnodes and the convergence of the algorithm they run. [15]focuses on distributed ML over regular topologies, and seeksto establish the graph degree associated with the shortest con-vergence time – as opposed to the lowest number of iterations–, ﬁnding that such a degree depends on the distributionof the nodes’ computing time. Through similar steps andtargeting a resource-constrained edge-computing scenario, [5]searches for the optimal trade-off between local computationand global parameter exchange in federated learning scenarios.With respect to [4], [5], [15], we (i) seek to adapt the logical

TABLE IM

AIN NOTATION

Symbol Meaning L , I L-nodes and I-nodes set (resp.) ρ i ( t ) pdf of sample generation time at I-node i ∈ I r i ave. no. of samples per iteration by I-node iX kl amount of samples at the beginning of iteration k at L-node lc l , c i operational cost of L-node l and I-node i (resp.) c l,l (cid:48) communication cost between L-nodes l, l (cid:48) c i,l communication cost between I-node i and L-node l(cid:15) max maximum learning error T max maximum duration of the learning process p ( l, l (cid:48) ) binary variable determining if L-nodes l and l (cid:48) cooperate (matrix P ) q ( i, l ) binary variable determining if L-node node l obtainssamples from I-node i (matrix Q ) K number of iterations to run τ kl ( t ) pdf of the computation time at L-node l and iteration k(cid:15) kl ( P , Q ) local error at L-node l and iteration k(cid:15) K ( P , Q ) global error at the end of the whole learning process T K ( P , Q ) expected time to complete the whole learning process C K ( P , Q ) global cost for running the whole learning process network topology to the learning task, and (ii) consider notonly learning nodes (in charge of processing information), butalso information nodes, where data comes from. The latter isespecially critical, as it allows us to characterize and studythe trade-off between gathering information and extractingknowledge from it. III. S YSTEM M ODEL

Our system model addresses a generic distributed, super-vised ML task where multiple nodes cooperatively seek tominimize a loss function , via gradient descent approaches suchas the stochastic gradient descent (SGD) algorithm [3], [5],[15], [22]. In the following, we discuss how the behaviorof individual nodes and their interactions are described byour system model, with reference to different real-world MLapproaches.

Nodes’ interactions. A unique feature of our model is itsability to capture the presence of two different types of nodes: • learning nodes , or L-nodes for short, that, having computa-tional capabilities, run the ML algorithm and can exchangegradient data during learning; we denote their set by L ; • information nodes , or I-nodes for short, which can provideinformation to the L-nodes; we denote their set by I .Real-world counterparts of L-nodes include physical serversand virtual machines running at the intelligent networkedge [11] or in the cloud. I-nodes, on the other hand, representsuch entities as monitoring platforms, network nodes, andsensors.In our system model, L-nodes behave in a similar way totheir equivalents in [5], [15]. Their high-level goal is to coop-eratively train a ML model network, and do so by minimizinga loss function via distributed optimization. The computationtime at each iteration of the learning process at a generic node2 ig. 1. Scheme of the interactions between L- and I-nodes in a general case. l ∈ L follows an arbitrary distribution with probability densityfunction (pdf) τ kl ( t ) . Note that, in the most general case, such apdf depends on the current iteration ( k ) of the learning process,since the amount of samples used for learning may vary froman iteration to the next one. This reﬂects the need to exploitall the available data as soon as it becomes available [18],[23], as opposed to training on a ﬁxed number of samplesas in more static scenarios. L-nodes are logically connectedto form an arbitrary logical topology , i.e., a graph wherevertices represent L-nodes and edges, hereinafter referred toas L-L edges, represent the logical links connecting them. Asexempliﬁed in Fig. 1 (steps 3–4), after every iteration, eachL-node sends its gradient data to its neighboring L-nodes onthe logical topology, and waits for them to do the same beforemoving on. The logical topology, i.e., which pairs of L-nodesare neighbors and exchange gradient data, is one of our maindecision variables.Each L-node can be logically connected to one or moreI-nodes, through the so-called I-L edges. Only I-nodes thatare connected to at least one L-node are added to the logicaltopology. After each iteration of the learning process, anL-node receives data from the I-nodes it is connected to(steps 5 and 7 in Fig. 1). Each I-node i may provide newsamples after a sample generation time since the end ofthe previous iteration, with r i being the average number ofprovided samples and ρ i ( t ) the pdf describing the samplegeneration time. The received samples are used by an L-node l to perform the next learning iteration, in addition to the datait received in the previous iterations and the number X l of(ofﬂine) samples initially available at l . Note that this behavioris compatible with current, widely deployed applications (e.g.,IoT) using publish/subscribe mechanisms, such as MQTT [24],or Zenoh [25], or even the notiﬁcation mechanisms includedin the 3GPP Service Based Architecture [26] of Release 15and above. Both L-nodes and I-nodes have per-iteration operationalcosts, denoted by c l and c i , respectively. Moreover, communi-cation between nodes that are neighbors in the logical topologyinvolve additional costs, denoted by c l,l (cid:48) or c i,l depending onthe type of nodes. Modeling real-world supervised ML tasks.

As mentioned,our model can describe a wide range of real-world ML tasks,falling in the category of supervised learning , for which aground truth is available. The most prominent examples ofsupervised learning tasks are classiﬁcation (where the quantityto predict is discrete, e.g., whether or not a given transactionis fraudulent) and regression (where the quantity to predict iscontinuous).In a distributed setting, supervised learning can be per-formed in two main modes: • distributed learning with static data , where no new dataarrive during the learning process. In this case there are noI-nodes (hence, no such steps as 5 and 7 in Fig. 1), andeach L-node learns from its X l initial samples, as well asthe gradient data from the other L-nodes; • active learning [17], where new samples can be collectedfrom data sources (e.g., sensors) during the learning processso as to improve the learning quality. In this case, thenetwork topology includes both L- and I-nodes.Importantly, our model can also capture federated learning [5],[6], [27], an emerging paradigm whereby different devices(e.g., smartphones) cooperatively train a model without shar-ing (potentially sensitive) data. In this case, each device ismodeled as an L-node; if, in the speciﬁc scenario at hand,devices collect or generate additional information while learn-ing, an I-node per device is added, only connected to thecorresponding L-node.For all tasks and approaches, our model can capture thecases where the communication between nodes happens ina peer-to-peer fashion [4], [15], as well as those when it ismediated by a parameter server , also known as broker [5],[13], [27]. In the latter case, the logical topology created bythe L-nodes is fully connected.IV. P ROBLEM F ORMULATION AND A PPROACH

Our decisions concern which nodes’ interactions shouldbe enabled, and the number of iterations to execute duringthe learning process. We thus deﬁne the following decisionvariables: • the set of binary variables p ( l, l (cid:48) ) ∈ { , } , expressingwhether L-nodes l and l (cid:48) cooperate during learning; • the set of binary variables q ( i, l ) ∈ { , } , expressingwhether L-node l ∈ L obtains samples from I-node i ∈ I ; • the total number of iterations, K , to perform so that thelearning task meets the desired learning quality and execu-tion time.For compactness of notation, we will collect the p - and q -variables in matrices P = { p ( l, l (cid:48) ) } and Q = { q ( i, l ) } ,respectively. Given the decisions P , Q , and K , we cancompute the following system performance metrics: • the expected time required to the system to complete thelearning process, denoted by T K ( P , Q ) ;3

200 400 600 800 1000Iteration k L e a r n i n g e rr o r ε l ( - a cc u r a c y ) X =50 X =100 X =500 X =1000 X l F i n a l e rr o r ε K l experimental datalog. fit, β =0.112 , ξ =28 X l F i n a l e rr o r ε K l experimental datalog. fit, β =0.104 , ξ =0 X l I t e r a t i o n t i m e [ s ] actual averagelinear fitactual Fig. 2. Classiﬁcation task using the MNIST dataset [28]. Left to right: evolution of the learning error for different values of X l when there are no I-nodes;values (cid:15) Kl when there are no I-nodes and obtained ﬁt; values (cid:15) Kl when I-nodes are present and obtained ﬁt; duration of single iterations (each dot correspondsto one iteration) and linear ﬁt. k L e a r n i n g e rr o r ε l ( - R s c o r e ) X =5000 X =10000 X =20000 X =25000 X l F i n a l e rr o r ε K l experimental datalog. fit, β =0.072 , ξ =4780 X l F i n a l e rr o r ε K l experimental datalog. fit, β =0.071 , ξ =4852 X l I t e r a t i o n t i m e [ s ] actual averagelinear fitactual Fig. 3. Regression task using the ITU challenge dataset [29]. Left to right: evolution of the learning error for different values of X l when there are noI-nodes; values of (cid:15) Kl and obtained ﬁt, when there are no I-nodes; (cid:15) Kl values and obtained ﬁt, in the presence of I-nodes; duration of single iterations (eachdot corresponds to one iteration) and linear ﬁt. • the total cost C K ( P , Q ) , incurred by the system to completethe learning process; • the (system-wide) learning error (cid:15) K ( P , Q ) at the end of thelearning process (i.e., after K iterations).It is important to point out that in general the concretedeﬁnition of error (cid:15) depends on the type of learning task beingperformed, e.g., • for classiﬁcation tasks, (cid:15) (cid:44) − α , where α is the classiﬁ-cation accuracy (i.e., the rate of correctly labeled items); • for regression tasks, (cid:15) (cid:44) − R , where R is the coefﬁcientof determination [30].In both cases, (cid:15) = 0 corresponds to perfect learning, whilelarger (cid:15) values identify worse learning quality, i.e., highererror. In the remainder of the paper, we use learning error or learning quality when referring to generic machine learning,and more precise terms (e.g., accuracy for classiﬁcation) whendiscussing speciﬁc learning tasks.Our objective is to minimize the total cost, while ensuringthat the ﬁnal learning error does not exceed the limit (cid:15) max , i.e., (cid:15) K ( P , Q ) ≤ (cid:15) max , and the learning is completed within thetarget time, i.e., T K ( P , Q ) ≤ T max . The problem can thenbe synthetically formulated as: min P , Q ,K C K ( P , Q ) , (1)s.t. min (cid:26) (cid:15) max (cid:15) K ( P , Q ) , T max T K ( P , Q ) (cid:27) ≥ . (2)The problem is combinatorial in nature and includes a largenumber of binary variables (the elements of matrices P and Q ). This makes it very hard to solve, even without consid-ering the complexity of computing the quantities C K ( P , Q ) , (cid:15) K ( P , Q ) , and T K ( P , Q ) . Speciﬁcally, we prove in Sec. VIthat the problem is NP hard. Remarkably, in spite of the problem complexity, we candesign an efﬁcient and provably effective solution strategy.We do so by ﬁrst characterizing the system performance asfunctions of the problem decision variables (Sec. V), andthen showing that the problem in (1) and (2) is submodular (Sec. VI). Leveraging this result, we can devise the Double-Climb algorithm (Sec. VII), which has cubic worst-case timecomplexity and proves to be / |I| competitive.V. C HARACTERIZING THE P ERFORMANCE OF THE L EARNING P ROCESS

We now characterize the learning error, execution time, cost,and number of iterations of the learning task at hand. Wedenote the number of samples available at L-node l at iteration k with X kl . We remark that such data is obtained by enhancingthe amount of samples initially available at l , X l , with thesamples that l receives at each iteration from the I-nodes it isconnected to.To perform the characterization, we blend together resultsfrom the literature and our own experiments. Speciﬁcally, weperformed and proﬁled the following learning tasks: • a classiﬁcation task on the famous MNIST digitdatabase [28]; • a regression task on the dataset used for the ITU AI Chal-lenge [29], with the goal of predicting the throughput of aset of Wi-Fi nodes leveraging their position and settings.Through these two datasets, we can show how our method-ology works for the two most common and relevant typesof supervised learning. While the numerical results we obtain(e.g., the coefﬁcient values) are speciﬁc to the concrete learn-ing algorithm at hand, our approach is general and can beeffortlessly extended to any supervised learning task.4xperiments have been performed using the Pythonlanguage and the sklearn library, speciﬁcally, the MLPClassifier and

MLPRegressor objects. The sklearn library does not support GPU, hence, only CPU isused for their training – which makes them easier to proﬁle.All tests were run on a server based on a twenty-core IntelXeon E5-2630V4 processor with 64 GByte of RAM.

A. Learning time

In general, the learning error at each iteration depends on(i) the number of already performed iterations, and (ii) thenumber of available samples [5], [15]. To characterize such adependence, we proceed in four steps:1) we ﬁrst focus on a single L-node, l , in a scenario wherethere are no I-nodes, and characterize the relation betweenthe per-iteration error (cid:15) kl ( P , Q ) and iteration k ;2) for the same scenario, we establish a relationship betweenthe quantity X l of ofﬂine training data available at l andits ﬁnal error (cid:15) Kl ( P , Q ) ;3) we extend such a relation to account for I-nodes, i.e., thecase where new samples arrive at each iteration;4) we generalize the error to the case of multiple L-nodes.With regard to step 1), the ﬁrst plots of Fig. 2 and Fig. 3show how the error (deﬁned in terms of classiﬁcation accuracyin Fig. 2 and of coefﬁcient of determination in the regressionexperiments in Fig. 3) evolves across iterations, when only X l ofﬂine data are used. The evolution of the error as a functionof iteration k is well captured by the following square-rootrelationship: (cid:15) kl ( P , Q ) = (cid:15) Kl + 1 √ k . (3)The relation in (3) matches experimental data very well, withan RMSE (Root Mean Square Error) of . and . forthe classiﬁcation and regression tasks, respectively, in additionto conforming to the theoretical ﬁndings in [5], [15].For step 2), we focus on the ﬁnal value taken by theerror at the end of the process, and on how this depends onthe quantity X l of ofﬂine samples. As shown in the secondplots of Fig. 2 and Fig. 3, the relationship between (cid:15) Kl ( P , Q ) and X l follows a logarithmic law: (cid:15) Kl ( P , Q ) = 1 − β log (cid:0) X l − ξ (cid:1) . (4)In our experiments, the best ﬁt is obtained with β = 0 . , ξ = 28 for the classiﬁcation task (RMSE . ) and β =0 . , ξ = 4780 for the regression task (RMSE . ).Importantly, similar logarithmic laws can also be found in theliterature [3], [17], [31].In step 3), we move to a generic scenario where new dataarrive at every iteration, i.e., X kl ≥ X k − l ≥ X l . We thenupdate the above expression for (cid:15) Kl ( P , Q ) to account for the average number of samples available at the generic iterationat L-node l , X l = K +1 (cid:16) X l + (cid:80) Kk =1 X kl (cid:17) , as: (cid:15) Kl ( P , Q ) = 1 − β log (cid:0) X l − ξ (cid:1) . (5)As summarized in the third plots of Fig. 2 and Fig. 3, using (5) in lieu of (4) also results in a very good ﬁt, with RMSE of . for the classiﬁcation task and of . for theregression one.For step 4), we have to characterize the aggregate learningerror of the whole set of L-nodes, each of which can have adifferent value of (cid:15) Kl ( P , Q ) . To this end, we leverage existingworks [4], [15], linking the effectiveness of the distributedlearning process with the graph formed by cooperating L-nodes, and more precisely with its spectral gap γ . Accordingto [4], [15], we can then write: (cid:15) K ( P , Q ) = 1 γ |L| (cid:88) l ∈L (cid:15) Kl ( P , Q ) . B. Learning time and cost

We now consider that the total number of iterations K ,the pdfs ρ i ( t ) of the sample generation time at each I-node i , and the pdfs τ kl ( t ) of the computation time of eachL-node l at iteration k are given. Recall that τ kl ( t ) dependson k , as the presence of I-nodes in our system model impliesthat the computation time distribution must account for thequantity X kl of available data at L-node l and iteration k .Backed by our experiments reported in the fourth plots ofFig. 2 and Fig. 3, as well as by the high efﬁciency and scal-ability of modern supervised ML algorithms, we consider thefollowing relationship: τ kl ( t ) = X kl X τ l ( t ) . Also, we deﬁne thesets I l = { i ∈ I : q ( i, l ) = 1 } and L l = { l (cid:48) ∈ L : p ( l, l (cid:48) ) = 1 } of I-nodes and L-nodes (resp.) each L-node is connected with.Our goal is to compute T K ( P , Q ) , i.e., the total time requiredto complete the whole learning process.As highlighted in Fig. 1, at every iteration each L-node mustperform the following steps: • wait for the information coming from the I-nodes i ∈ I l ; • perform its own gradient computation; • wait for the gradient data coming from the other L-nodes l (cid:48) ∈ L l it is cooperating with.The ﬁrst step is complete when all nodes in I l send theirsamples. Recalling that each I-node has a sample generationtime distributed with pdf ρ i ( t ) , we can derive the cumulativedistribution function (CDF) of the maximum of a set ofindependent random variables as the product of individualCDFs R i ( t ) , i.e., (cid:81) i ∈I l R i ( t ) . Once all data arrive, l canperform its own gradient computation, whose duration isdistributed according to pdf τ kl ( t ) . Recalling that the pdf of thesum of two independent random variables is the convolution ofindividual pdfs, we can write: h kl ( t ) = τ kl ( t ) ∗ d( (cid:81) i ∈I l R i ( t ))d t .For the system as a whole to move to the next iteration,all L-nodes must have received the gradient data they need.This, in turn, requires the slowest L-node to have obtainedits information and have performed the computation. Workingagain with CDFs, the time taken by such a node is distributedaccording to: H k ( t ) = (cid:81) l ∈L H kl ( t ) , where H kl ( t ) denotes theCDF of the time to complete iteration k at L-node l . By letting The spectral gap of a graph is the difference between the moduli of thetwo largest eigenvalues of its adjacency matrix. �� Fig. 4. Toy scenario with |L| = 10 and |I| = 5 where both I-node samplegeneration times and L-node computation times are uniformly distributed.Left: pdfs of the I-node generation time ρ i ( t ) (blue), of the time required bythe slowest I-node (red) and of the compute time τ kl ( t ) (yellow). Right: pdfsof the time taken by local (green) and global (gray) iterations. h k ( t ) = d H k ( t )d t , the expected duration of the learning processis then given by: T K ( P , Q ) = K (cid:88) k =1 (cid:90) ∞ xh k ( t ) dt. A numerical example.

Fig. 4 exempliﬁes our methodologyin a case where both the I-node sample generation timesand the L-node computation times are uniformly distributed;speciﬁcally, ρ i ( t ) ∼ U (0 . , . and τ kl ( t ) ∼ U (1 . , . .Furthermore, there are |L| = 10 L-nodes, each connectedto |I| = 5

I-nodes.We begin from the blue line in the plot, representing ρ i ( t ) .To obtain the pdf of the sample generation time of the slowestI-node, we have to integrate ρ i ( t ) (obtaining R i ( t ) , a ramp-likefunction), then raise it to the |I| -th power (obtaining a 5th-degree polynomial), and ﬁnally derive it, obtaining the fourth-degree polynomial shown by the red line in Fig. 4.We next perform the convolution between the latter pdfand τ kl ( t ) , represented by the yellow line in the plot. Theresult is h kl ( t ) , represented by the green line in Fig. 4. Thelast step consists in computing the distribution of the timetaken by the whole learning iteration, hence, by the slowestL-node. Integrating h kl ( t ) , we obtain H kl ( t ) , which we raiseto the |L| = 10 -th power, and then derive it, obtaining thepdf h k ( t ) shown by the gray curve in Fig. 4. Closed-form expression for special cases.

The method-ology outlined above does not require any assumption onthe τ kl ( t ) and ρ i ( t ) distributions, nor on the logical linksbetween nodes, and the computations it requires can always beperformed numerically. However, closed-form expressions areavailable in relevant special cases. As an example, when eachL-node receives information from all I-nodes, the computationand the sample generation times are i.i.d. and exponentiallydistributed with parameter λ kL and λ I , respectively, we get: T K = − K (cid:88) k =1 (cid:88) A⊂ N : |A| = |I| +2 (cid:80) a ∈A a = |L| (cid:18) |L|A (cid:19) (cid:81) |I| +2 w =1 ( A k ( A , w )) a w λ I (cid:80) |I| w =1 wa w + λ kL a |I| +2 . In the above expression, the sum over k accounts for alliterations, k = 1 , . . . , K . The inner sum comes from themultinomial expansion [32] of a sum of |I| + 2 terms (onefor each I-node, one for the L-node connected to them, andone representing the coefﬁcient) raised to the |L| -th power,where each term is a polynomial (see also the expression of h kl ( t ) ). Therefore, the inner summation is over all sets A ofnatural numbers such that their size is |I| + 2 and their sumis |L| , and (cid:0) |L|A (cid:1) = |L| ! (cid:81) a ∈A a ! is the multinomial coefﬁcient. Theterm A k ( A , w ) associated with the w -th element of each set A is: A k ( A , w ) = (cid:80) |I| z =1 (cid:0) |I| z (cid:1) ( − z +1 , if w = |I| + 1 (cid:80) |I| z =1 (cid:0) |I| z (cid:1) ( − z +1 zλ I λ kL − wλ I , if w = |A| (cid:0) |I| w (cid:1) ( − w +1 λ kL wλ I − λ kL , otherwise.A closed-form expression for the expected duration of thelearning process can also be obtained when each L-nodereceives information from all I-nodes, and the I-nodes’ samplegeneration times and the L-nodes’s computation times arei.i.d. and uniformly distributed over ( a I , b I ) and ( a kL , b kL ) ,respectively. For simplicity and without loss of generality, letus assume a kL ≤ a I ≤ b I ≤ b kL , ∀ k ; then, we have: T K = K (cid:88) k =1 (cid:88) A⊂ N : |A| = |I| +2 (cid:80) a ∈A a = |L| (cid:18) |L|A (cid:19) (cid:80) |I| +1 w =1 wa w (cid:80) |I| +1 w =1 wa w + 1 ×× (cid:34) |I| +2 (cid:89) w =1 ( A k ( A , w )) a w (cid:16) Z |I| +1 (cid:80) w =1 wa w +11 − Z |I| +1 (cid:80) w =1 wa w +12 (cid:17) + |I| +2 (cid:89) w =1 ( A k ( A , w )) a w (cid:16) Z |I| +1 (cid:80) w =1 wa w +13 − Z |I| +1 (cid:80) w =1 wa w +14 (cid:17) + |I| +2 (cid:89) w =1 ( A k ( A , w )) a w (cid:16) Z |I| +1 (cid:80) w =1 wa w +15 − Z |I| +1 (cid:80) w =1 wa w +16 (cid:17)(cid:35) where Z = a kL + b I , Z = a kL + a I , Z = b kL + a I , Z = a kL + b I , Z = b kL + b I , Z = b kL + a I . As in the previouscase, the above expression comes from the multinomial ex-pansion [32], and, after some algebra, one can obtain theterms A k ( A , w ) , A k ( A , w ) , and A k ( A , w ) associated with the w -th element of each set A , as: A k ( A , w )=  ( − a I ) |I| − ( a kL − a I ) |I| ( b I − a I ) |I| ( b kL − a kL ) , if w =1( a kL − a I ) |I| ( |I| ( a kL + a I )+2 a I )+( − a I ) |I| +1 ( |I| +1)( b I − a I ) |I| ( b kL − a kL ) , if w = |A| ( |I| +1 w ) ( − a I ) |I| +1 − w ( |I| +1)( b I − a I ) |I| ( b kL − a kL ) , else.6 k ( A , w )=  A k ( A , |A| )+ |I| +1 (cid:80) z =1 A k ( A , z )( a kL + b I ) z ++ ( a kL − a I ) |I| +1 − ( a kL + b I − a I ) |I| +1 ( |I| +1)( b I − a I ) |I| ( b kL − a kL ) ++ ( b I + a I ) |I| +1 − (2 a I ) |I| +1 ( |I| +1)( b I − a I ) |I| ( b kL − a kL ) , if w = |A| − ( |I| +1 w ) (( − b I − a I ) |I| +1 − w − ( − a I ) |I| +1 − w )( |I| +1)( b I − a I ) |I| ( b kL − a kL ) , else. A k ( A , w ) =  ( b kL − a I ) |I| − ( b I + a I ) |I| ( − |I| ( b I − a I ) |I| ( b kL − a kL ) , if w =1 A k ( A , |A| )+ |I| +1 (cid:80) z =1 A k ( A , z )( b kL + a I ) z −− ( |I| +1) ( b kL − a I ) |I| ( b kL + a I )( |I| +1)( b I − a I ) |I| ( b kL − a kL ) ++ ( − b kL − a I ) |I| +1 − ( b kL − b I ) |I| +1 ( |I| +1)( b I − a I ) |I| ( b kL − a kL ) , if w = |A| ( |I| +1 w ) ( − b I − a I ) |I| +1 − w ( |I| +1)( b I − a I ) |I| ( b kL − a kL ) , else.Intuitively, the three different terms A k ∗ ( A , w ) are due to theconvolution of the pdfs, which results in a piecewise function(see also the expression of h kl ( t ) ). The support of the differentpieces of the function are as follows: (cid:2) a kL + a I , a kL + b I (cid:1) forthe ﬁrst piece where only one pdf is active, (cid:2) a kL + b I , b kL + a I (cid:3) for the second piece where both pdfs are active and overlap,and (cid:0) b kL + a I , b kL + b I (cid:3) for the third piece where only the otherpdf is active. C. Learning cost

We deﬁne the per-iteration cost as the sum of operationaland communication costs of the L- and I- nodes contributingto each iterations, i.e., C ( P , Q ) = (cid:88) l ∈L (cid:32) c l + (cid:88) l (cid:48) ∈L c l,l (cid:48) p ( l, l (cid:48) )+ (cid:88) i ∈I c i,l q ( i, l ) (cid:33) + (cid:88) i ∈I c i ∃ q ( i,l ) > . (6)Then, we can write the total learning cost over the K iterationsas C K ( P , Q ) = K · C ( P , Q ) . D. Number of iterations

The number K of iterations needed to reach the targeterror (cid:15) max depends on two factors. The ﬁrst is the quantityof available training data: the more data is available, the morethe learning quality improves at each iteration. The secondis the level of cooperation between L-nodes: the more nodescooperate, the higher the quality achieved at each iteration.As shown in [15, Eq. (7)], we have K ∝ γ , where γ is thespectral gap of the graph formed by L-nodes. Combining thetwo factors and denoting by X the average number of samplesavailable at the generic iteration and L-node, as per (5), weget: K ∝ − log Xγ . (7)As already noted in [15], on the one hand, a high degreeof L-nodes makes the learning process faster, as convergence requires fewer iterations; on the other hand, each iterationtakes longer to complete as there are more nodes to wait for.VI. P

ROBLEM A NALYSIS

We ﬁrst prove that the problem at hand, formulated inSec. IV, is NP hard. On the positive side, we also showthat the problem objective function is submodular and non-decreasing, while the constraint is submodular and exhibitsonly one maximum (we prove the latter part separately forI-L and L-L edges).

Lemma 1.

The problem of optimally conﬁguring the systemfor an ML task, expressed in (1) and (2), is NP hard.Proof:

The proof can be obtained via a reduction from theknapsack problem [33], a combinatorial optimization problemwhere a set of N numbered items is given, each of themassociated with a weight ω s and a value ν s . The goal is toselect a subset of items with maximum total value and totalweight less or equal to a maximum given capacity, Ω .Our reduction maps any given instance of the knapsackproblem to a simpler, special-case instance of our own, asset forth next.The sets of L-nodes and I-nodes are respectively L = { l . . . l N } and I = { i . . . i N } , i.e., there are as many L-nodes as there are items in the knapsack problem, and asmany I-nodes as there are L-nodes. Further, we connect allL-nodes in a logical full mesh, and impose that each I-node i s ∈ I can only be connected to the corresponding L-node l s ∈ L . We also set the number of iterations to an arbitrarynumber ˆ K > , and the number of samples generated by eachI-node to an arbitrary number r > .Given the above, matrix P is ﬁxed and the decisions concernonly matrix Q , which is now a diagonal matrix with elements q ( i s , l s ) , mapping into the x s variables in the knapsack prob-lem. Speciﬁcally, we activate edge ( i s , l s ) in our problem ifand only if x s = 1 , i.e., q ( i s , l s ) ← x s . Furthermore, we mapedge costs in our problem into item weights in the knapsackproblem. In particular, let ν s correspond to the opposite of thelink cost c i s ,l s , then we have a perfect correspondance betweenthe objective of the knapsack problem and that in (1).Next, we need to map the capacity constraint in the knap-sack problem to constraint (2). To this end, we ﬁrst set T max ← ∞ . Then, given that P is ﬁxed, γ is also known andﬁxed, and each L-node can only receive data from one I-nodeonly, the amount of data received by L-node i s in each iterationis r or , depending on the value of x s . A correspondencebetween the constraint in the knapsack problem and that inour problem is then established by ﬁxing (cid:15) max ← Ω , andsetting β and ξ in the expression of the learning error at asingle L-nodes in such a way that: (cid:15) Kl s ( P , Q ) γ |L| = 1 − β log (cid:16) ˆ K − ξ (cid:17) γ |L| = ω s . Last, we need the reduction to take (at most) polynomialtime. In our case, it is straightforward to see that the mappingtakes linear time, namely O ( |L| + |I| ) , hence, the conditionis fulﬁlled.7 umber of active edges V a l u e o f c o n s t r a i n t g ( x ) A B g = ε max / ε K g = T max / T K g (  )=min{ g , g }

20 40 60 80Graph degree0.00.20.40.60.81.0 A v e r a g e s p e c t r a l g a p γ Fig. 5. Left: Qualitative example of the constraint in (2) and its components.Right: experiments on the relation between the degree of a random graph with100 vertices and uniform degree, and its spectral gap γ . In summary, any instance of an NP-hard problem can betransformed into a special-case instance of our own, whichproves the thesis.In spite of its complexity, the problem of minimizing (1)subject to constraint (2) presents several features that can beexploited to solve it efﬁciently and effectively. Speciﬁcally,both the objective in (1) and the constraint in (2) are sub-modular (intuitively, the set-wise equivalent of convex [34]).Submodular optimization problems can often be solved withpolynomial- or even linear-time greedy algorithms, with verygood, even constant, competitive ratios [35].Let us indicate with f ( Y ) the objective function in (1), andwith g ( Y ) the constraint in (2). In our case, the set X ofelements to choose from is given by X = L × L ∪ L × I , i.e.,the set of possible I-L and L-L edges we can create, and Y is the subset of actually selected edges. The objective f ( Y ) and constraint g ( Y ) of our problem have several interestingand useful properties. Concerning the former, it is possible toprove the following result. Property 1.

The objective function in (1) is submodular andnon-decreasing.Proof:

Let j = ( a, b ) be an edge in our logical topologygraph, with a ∈ L and b ∈ L ∪ I ; let S ⊂ X be the setof currently selected edges. By adding j , we incur the per-edge communication cost c a,b ; also, we may incur per-nodeoperational costs c a or c b , depending on whether or not thereare already edges in S with a or b as endpoints. Similararguments hold for the cost of adding j to T ⊃ S . Thus, f ( S ∪ { j } ) − f ( S ) = c a,b + c a a (cid:54)∈S + c b b (cid:54)∈S f ( T ∪ { j } ) − f ( T ) = c a,b + c a a (cid:54)∈T + c b b (cid:54)∈T . Since S is a subset of T , it also holds that a (cid:54)∈S ≥ a (cid:54)∈T and b (cid:54)∈S ≥ b (cid:54)∈T , from which it follows that f ( S ∪ { j } ) − f ( S ) ≥ f ( T ∪ { j } ) − f ( T ) , i.e., the very deﬁnition ofsubmodularity [34]. The fact that (1) is non-decreasing triviallycomes from the observation that, as more I-L or L-L edges areadded, the cost always increases.As for the constraint, the analysis is a little more complex,and we perform it separately for I-L and L-L edges. Forsimplicity of notation, we drop the dependency on P and Q while presenting our derivations. Property 2.

When the choices are limited to I-L edges,i.e., X = L × I , then the constraint in (2) is submodularand has exactly one maximum. Proof:

Let us study the two parts of the constraint (2)separately, writing g = (cid:15) max (cid:15) K , g = T max T K , and g ( Y ) =min { g , g } , as exempliﬁed in Fig. 5(left). From (5), g = (cid:15) max − β log X l ; also, adding an I-L edge increases X l for at leastan L-node l . Recalling that the logarithm is a concave function,the denominator of g is convex, and g itself is concave,which implies submodularity [34]. For analogous reasons, g isalso monotonically increasing.The behavior of g is more complex: we know from (7)that the number of iterations decreases as X (hence, X kl )increases, according to an inverse-log law. Also, as shownin Sec. V, τ kl ( t ) and d H k ( t )d t are proportional to X kl and (cid:81) l ∈I X kl , respectively. Thus, T K is proportional to K and (cid:81) l ∈I X kl . Replacing K with (7), we get that T K behaveslike (cid:81) l ∈I X kl log X , i.e., it can be shown that it decreases untilit reaches a minimum, and then increases. It follows that g = T max T K is concave, hence, submodular.Looking now at g ( Y ) , the minimum of two submodularfunctions is not guaranteed to be submodular in general;however, since g is not only submodular but also mono-tonically increasing, the submodularity of g also impliesthat g ( Y ) as a whole is submodular [34]. Next, consider themaximum of g ( Y ) , with the latter being equal to min { g , g } .As exempliﬁed in Fig. 5(left), we know that g starts from avalue close to (cid:15) max and then monotonically increases towardsinﬁnity, while g starts with a small value, increases until ithas a global maximum, and then decreases again. If g isalways smaller than g , then g ( Y ) = g has exactly one globalmaximum, consistently with the hypothesis. If they cross (asin Fig. 5)(left)), they do so in exactly two points, say A and B ,such that the maximum of g is between A and B . Then, thefollowing holds: (i) before A , g ( Y ) = g , which is increasingbefore its maximum; (ii) between A and B , g ( Y ) = g , whichis always increasing; (iii) after B , g ( Y ) = g and, since we areafter its maximum, g ( Y ) is decreasing – hence, B is g ( Y ) ’sonly maximum. Therefore, in all cases g ( Y ) is submodularand has exactly one maximum, and, until such a maximum isreached, g ( Y ) is also monotonically non-decreasing.As for L-L edges, their inﬂuence on the learning process canbe quantiﬁed by studying the graph they form. Speciﬁcally,[4], [15] have shown that both the learning error and thelearning time are inversely proportional to the spectral gap of such a graph, indicated by γ . Following the lead of [15]and restricting our attention to regular graphs, we can statethe following result: Proposition 1.

When the choices are limited to sets of L-L edges such that the graph created by L-nodes is uniform,then the constraint (2) is submodular and has exactly onemaximum.

The arguments in support of Proposition 1 can be summa-rized as follows: 1) the error reached after a given number K of iterations is proportional to /γ [15, Eq. (7)]; 2) the learningtime is proportional to /γ [15, Eq. (18)]; 3) based on our ownexperiments, summarized in Fig. 5(right), the link between thegraph degree and the spectral gap γ is expressed by a concavefunction. Recalling that concavity is the continuous equivalent8 lgorithm 1 Greedy algorithm for submodular problems S ← ∅ while g ( S ) ≥ c do j ∗ ← arg min X \S c j g ( S∪{ j } ) − g ( S ) S ← S ∪ { j } return S of submodularity, the ﬁrst part of the proposition follows.The second part follows from the fact that, as exempliﬁed inFig. 5, (2) is the minimum between a monotonic function (aswe add more L-L edges, the error decreases) and a functionwith at most one maximum (the inverse of the learning time,which decreases until an optimal degree is reached and thenincreases, as shown in [15]).VII. T HE D OUBLE C LIMB A LGORITHM

We now seek to solve the problem stated in Sec. IV,i.e., determining the P , Q and K resulting in the lowestcost (1) subject to the constraint in (2), in a practical andefﬁcient way. To this end, we ﬁrst extend existing resultson the performance of greedy algorithms when optimizingsubmodular problems, in Sec. VII-A. Based on such results,we present our own DoubleClimb algorithm in Sec. VII-B,and analyze its properties in Sec. VII-C. A. Greedy solutions to submodular problems

Let us consider Alg. 1, which solves submodular problemswith non-decreasing objective and constraints. More formally,it selects a subset

S ⊆ X of elements subject to a submodularnon-decreasing constraint g ( S ) ≥ , while minimizing a sub-modular non-decreasing cost function f ( S ) . At every iteration,Line 3 selects the element minimizing the cost to beneﬁt ratio f ( S∪{ j } ) − f ( S ) g ( S∪{ j } ) − g ( S ) ; such an element is then added to S (Line 4).As shown in [36, Thm. 4.7], Alg. 1 is |X | -competitive.However, the original proof requires both the objective andthe constraint to be submodular and non-decreasing. In ourcase, Property 2 and Proposition 1 prove weaker properties,in that our constraint is not guaranteed to be non decreasing,as in Fig. 5; therefore, the result in [36] cannot immediatelybe applied to our problem.None the less, it is possible to prove that a less restrictivecondition than being non-decreasing, namely, having only onemaximum, is sufﬁcient for the result to hold: Property 3. If f ( Y ) is submodular non-decreasing and g ( Y ) is submodular and has only one maximum, then the abovealgorithm minimizes f ( Y ) s.t. g ( Y ) > , with a competitiveratio of |X | .Proof: The property generalizes the results in [36,Thm. 4.7]. The proof therein follows from analyzing the stepsof the above algorithm until its convergence, and leverag-ing the fact that the sequences of marginal cost increasesand constraint improvements are (resp.) monotonically non-decreasing and monotonically non-increasing. This is of coursetrue if, as in the original hypotheses, g ( Y ) is monotonicallynon-decreasing. However, this also holds if g ( Y ) has only Algorithm 2

The DoubleClimb algorithm d L ← best_sol ← ∅ while d L < |L| do d L ← d L + 1 ll ← cheapest uniform ( d L ) il ← ∅ while (2) is not veriﬁed ∧ il (cid:54)≡ I × L do i ∗ , j ∗ ← arg min i,l c i,l g ( il ) − g ( il ∪{ ( i,l ) } ) il ← il ∪ { ( i ∗ , j ∗ ) } if C curr < C best then best_sol ← ll ∪ il else if C currLL > C bestLL ∧ C currIL > C bestIL then break return best_sol one maximum, as per the hypothesis of our property. This isbecause, if the algorithm cannot ﬁnd a feasible solution beforethe maximum of g ( Y ) , i.e., as constraints become closer tobeing satisﬁed, it will also be impossible to ﬁnd a feasiblesolution after the maximum, i.e., when constraints will get farther from being met. Thus, the sequences of marginal costsand improvements of the selected elements of X have therequired behavior. Indeed, the behavior of g ( Y ) for the non-selected items of X has no impact on the validity of [36,Thm. 4.7], nor of this property. B. Algorithm description

Property 3 implies that the algorithm in Sec. VII-A couldefﬁciently select P and Q , if such decisions could be madeindependently . However, they are clearly interlinked; thus, wepropose a more complex solution strategy, called Double-Climb, which operates as follows. • First, based on the nodes capabilities deﬁned in Sec. III,DoubleClimb determines P and Q . It does so by selectingI-L and L-L edges in two nested loops, with L-L edgesresulting in a uniform graph [15]. It also selects the mostappropriate value of K for each set of selected edges. • Given such decisions, it computes the system performancecharacterized in Sec. V, thus yielding the error (cid:15) K ( P , Q ) ,the learning time T K ( P , Q ) , as well as the cost C K ( P , Q ) . • It then compares the obtained values for the learning timeand error against the limits (cid:15) max and T max , and evaluateswhether a sufﬁciently low cost has been achieved. If so,DoubleClimb returns the problem solution; otherwise, ittries to improve the decisions until the system constraintsare met and the cost is further reduced.The DoubleClimb algorithm is presented in Alg. 2 anddetailed below. It begins (Line 1) by setting to zero the degree d L of the subgraph made of L-L edges, and to the empty setthe best solution best_sol . Then, while d L < |L| , i.e., whilesuch a subgraph is not a clique, d L is ﬁrst incremented byone (Line 4), and then the cheapest L-L uniform subgraph ofdegree d L is chosen in Line 5.Given such a choice of L-L edges, the algorithm selectsthe I-L edges essentially in the same way as described in9ec. VII-A: for all possible edges, the cost/beneﬁt ratio – i.e.,the ratio between the cost of adding the edge and how closer tofeasibility the problem becomes by doing so – is computed inLine 8, and the edge associated with the lowest ratio is chosen.The loop continues until either all I-L edges are exhausted, or afeasible solution, satisfying constraint (2), is found (Line 7). Inthe latter case, the cost of the current solution C curr , computedas per (6), is compared to the one of the best solution foundso far ( C best ); note that, by convention, the cost of the emptyset is equal to ∞ . If warranted, the best solution is updated(Line 11), otherwise we perform the check in Line 12 to assesswhether other solutions should be explored. Indeed, as provenin Proposition 2 below, the submodularity of costs implies thattrying higher values of d L does not lead to cheaper solutions.If neither happens, the next value of d L is tried. After allvalues of d L are exhausted, the best solution best_sol isreturned in Line 14. If no feasible solution has been found,the problem instance is infeasible and the algorithm returns ∅ . C. Algorithm analysis

We now prove that Alg. 2 has an excellent competitive ratioas well as low complexity. As ﬁrst step, we show that thestopping condition in Line 12 is valid, i.e., no solution betterthan best_sol is ignored by halting the algorithm when thecondition is met.

Proposition 2.

If the condition speciﬁed in Line 12 of Alg. 2 ismet, then no solution cheaper than best_sol will be foundfor higher values of d L .Proof: Let d bestL be the value of d L for which the currentbest solution was found, and C bestLL and C bestIL the correspondingcosts for L-L and I-L edges (resp.). At the current iteration,we have d L = L curr > L best , and the corresponding costs are C currLL > C bestLL and C currIL > C bestIL . Let us now consider a futureiteration where the value of d L is d nextL > d currL > d bestL . C nextLL depends on two effects: if we increase the number of L-Ledges, the cost due to L-L edges will increase. However, moreL-L edges also imply fewer iterations, thus they may leadto a reduced cost. Since similar observations hold for C nextIL ,which effect prevails depends on how strong the beneﬁt ofincreasing d L is. However, as per the submodularity property(Proposition 1), the beneﬁt of adding L-L edges and I-L edgesdecreases as d L increases: if moving from d bestL to d currL actuallyincreased the cost of L-L and I-L edges, it is not possible thatmoving to d nextL will provide a better solution.Thanks to Proposition 2 and Property 3, we can now proveour main result about Alg. 2. Theorem 1.

Alg. 2 has |I| competitive ratio.Proof: There are two possible sources of suboptimality,namely, the choice of d L and that of the I-L edges to select. ByProposition 2 and considering that, if the condition in Line 12is never triggered, all possible values of d L are tried out, thechoice of d L is optimal. As for the I-L edges, Line 7–Line 9of Alg. 2 reﬂect exactly the same algorithmic steps reportedin Sec. VII-A which, as per Property 3, lead to a |I| competitive ratio in our case. Finally, we can prove that Alg. 2 has a very low, namely,cubic worst-case computational complexity. Property 4.

Alg. 2 has a worst-case computational complexityof O ( |L| |I| ) .Proof: From inspection of the nested loops in Alg. 2,one can see that the outer one is run at most once for eachvalue of d L , i.e., at most |L| times. The inner one is ran atmost once for each possible I-L edge, i.e., at most |L||I| times. As for the set of edges to activate for each valueof d L (function cheapest uniform in Line 5), they can bepre-computed and thus do not inﬂuence the overall complexity.It is also worth stressing that Property 4 concerns the worst-case complexity, but the actual one is often much lower.Indeed, in Line 7–Line 9 we are likely to compute thesame costs in different iterations; if such costs are cached, `a la dynamic programming, run time can be dramaticallydecreased, to be slightly more than linear in |L| + |I| .VIII. N UMERICAL R ESULTS

In the following, we describe the reference scenario andbenchmark solutions we consider (Sec. VIII-A), before study-ing the performance of DoubleClimb (Sec. VIII-B).

A. Reference scenario

We consider an Internet-of-things (IoT) environment similarto the one referred in [5], whereby: • individual sensors produce samples, either periodically or asa reaction to an external event; • aggregators , also known as gateways, collect and summarizethe samples, before forwarding them in uplink; • distributed ML algorithms, running at the edge of thenetwork, leverage the samples to gather insights on thechanges in external conditions.In terms of our system model, aggregators correspond to I-nodes, and edge nodes running the ML algorithms corre-spond to L-nodes. New samples arrive every few seconds,and updating the gradient computations takes a comparabletime. Note that similar approaches have been proposed forsuch applications as smart-city monitoring [37], support ofconnected vehicles [38], and attack/anomaly detection [39].With reference to the taxonomy in Sec. I, we fall inthe active learning case, as the data arrival and gradientcomputation are interleaved but not synchronized, e.g., newdata can arrive both before and after a gradient computationis complete.We refer to the real-world urban topology presented in [40]and shown in Fig. 6, depicting the network of a major operator.Speciﬁcally, the network nodes represented in brown act asaggregators, hence, as I-nodes, while those in blue are edgenodes acting as L-nodes. As shown in Fig. 6, all L-nodes canbe connected with one another, while each I-node can only beconnected to one L-node.Normalized sample generation and gradient computationtimes are distributed exponentially with mean 1, while theI-L and L-L edges are randomly assigned a normalized cost10 ig. 6. Our reference topology, depicting the network of a major operator(source: [40]). between 0 and 1 units. I- and L-nodes have no operationalcost, reﬂecting the fact that, in our reference environment, theycannot be switched off without discontinuing the service. Inthe basic version of the scenario, at every iteration each I-node generates between 10 and 100 samples; such a value isproportional to the trafﬁc served by each node in the real-worldtopology [40]. In the rich scenario, representing applicationswhere data is more plentiful, such a value is multiplied byﬁve. Benchmark solutions.

We compare DoubleClimb againsttwo benchmark solutions. The ﬁrst, called Opt-Unif, followsthe approach used (among others) by [15], and returns thecheapest solution among the feasible ones such that both thegraphs formed by L-L and I-L edges have uniform degree.The second benchmark, labeled as “Optimum/GA” in theplots, performs the selection of the I-L edges (i.e., the innerloop in Alg. 2) leveraging a genetic algorithm (GA) approachwith the following parameters: • number of generations: 50; • solutions per population 100; • parents mating: 4; • mutation probability: 15%; • crossover type: single point; • gene space: { , } ; • number of genes: |I||L| .Each solution corresponds to a string of binary values whoselength equals the number of possible I-L edges: having a in a given position means that the corresponding I-L edgeis activated. The relatively large mutation probability reﬂectsthe importance of exploring multiple different solutions (i.e.,exploration), given the combinatorial nature of the problem athand and the fact that similar strings do not necessarily yieldsimilar performance. When the size of the problem made itpossible (i.e., d L ≤ ), we have compared the performance ofthe genetic algorithm against the optimum obtained throughbrute force, and found that the two closely match. B. Performance comparison

The ﬁrst plot in Fig. 7 shows the cost of DoubleClimb andits benchmarks, for different numbers of L-nodes. As expected, the cost increases with |L| and decreases in the rich scenario,where the higher quantity of data results in faster convergence.Also, it is clear that DoubleClimb outperforms Opt-Unif andmatches the performance of Optimum/GA. GA approachesare not, in general, guaranteed to yield optimal performance;therefore, we cannot conclude that DoubleClimb makes opti-mal decisions other than for d L ≤ , when the comparisonwith brute force was possible. However, GA approaches havelong been known to be remarkably good at ﬁnding optimalor near-optimal solutions for combinatorial problems such asthe one at hand, at the price of long run times, as shown inFig. 9 and Fig. 10 next. Observing that DoubleClimb matchesOptimum/GA in all scenarios and for all values of d L thereforeboosts our conﬁdence in the algorithm’s effectiveness.We now look deeper into the decisions made by eachstrategy. The second plot in Fig. 7 depicts the selected valueof d L , normalized to |L| . Interestingly, such a value is lowerin the rich scenario, conﬁrming our intuition that a tightercooperation between L-nodes and more data coming from I-nodes are, to an extent, alternative solutions to achieve fasterlearning. DoubleClimb and Opt-Unif make exactly the samedecisions in all cases, which suggests that the difference incost shown in the ﬁrst plot only comes from the choice ofI-L edges. Accordingly, the third plot in Fig. 7, depicting thefraction of I-L edges selected by each strategy, highlights howDoubleClimb uses substantially fewer edges than Opt-Unif.This highlights how the greater ﬂexibility in the choice of I-Ledges is an important asset of our approach, allowing us tobeat state-of-the-art alternatives.The fourth plot in Fig. 7 shows how DoubleClimb not onlyuses fewer I-L edges, but also chooses the right ones. The plotdepicts the number of new samples arriving at each iterationand highlights how, in spite of the substantially smaller numberof selected I-L edges, DoubleClimb obtains a similar numberof samples as Opt-Unif. Such an effect is especially evidentfor the basic scenario, where the number of samples providedby each I-node is smaller.Comparing the DoubleClimb and Optimum/GA curves, wecan observe that in some cases Optimum/GA can activateslightly fewer I-L edges than the base scenario, e.g., for d L = 8 . This corresponds to solutions that DoubleClimbis unable to reach due to its hill-climbing nature; however,the impact on the overall cost (see the ﬁrst plot in Fig. 7)is negligible. Interestingly, DoubleClimb and Optimum/GAmake the very same decisions in the rich scenario, conﬁrmingthe somehow counterintuitive notion stated in Property 3, i.e.,that, the solutions yielded by DoubleClimb tend to be closerto the optimum.In Fig. 8, we seek to better understand how DoubleClimband Opt-Unif operate. Every marker in the plots correspondsto one solution examined by the algorithms; feasible solutionsare denoted by a silver circle, the cheapest of such solutionsis denoted by a black star. Note that Opt-Unif explores fewersolutions than DoubleClimb, as it is restricted to creatinguniform logical topologies. Also, under the rich scenario itis easier for DoubleClimb to reach a high-quality solution,hence, the algorithm ends earlier.The ﬁrst two plots, representing DoubleClimb in the basic11 |  | of L-nodes0100200300400500 C o s t [ un i t s ] Opt-UnifDoubleClimbOptimum/GA richbasic |  | of L-nodes0.10.20.30.40.50.60.70.8 N o r m . l e a r n i n g d e g r ee d L /|  | Opt-UnifDoubleClimbOptimum/GA richbasic |  | of L-nodes0.00.20.40.60.81.0 F r a c t i o n o f a c t i v e I - L e dg e s Opt-UnifDoubleClimbOptimum/GA richbasic |  | of L-nodes02004006008001000 T o t a l e x t r a s a m p l e s p e r i t e r a t i o n Opt-UnifDoubleClimbOptimum/GA richbasic

Fig. 7. Comparison between DoubleClimb, Opt-Unif and the optimum (obtained via brute-force) in the basic and rich scenarios, for different values of |L| .From left to right: total cost; selected value of d L , normalized (to the maximum); fraction of selected I-L edges; total number of extra samples per iteration. T o t a l c o s t [ un i t s ] d L =3 d L =4 d L =5 d L =6 d L =7 d L =8 T o t a l c o s t [ un i t s ] d L =3 d L =4 d L =5 d L =6 d L =7 d L =8

10 15 20 25 30 35 40No. solutions explored0100200300400500 T o t a l c o s t [ un i t s ] d L =3 d L =4 d L =5 d L =6 d L =7 d L =8 10 15 20 25 30 35 40No. solutions explored0100200300400500 T o t a l c o s t [ un i t s ] d L =3 d L =4 d L =5 d L =6 d L =7 d L =8 Fig. 8. Cost of the solutions examined at each iteration by DoubleClimb (ﬁrst two plots) and Opt-Unif (last two plots), in the basic (ﬁrst and third plot) andrich (second and fourth plot) scenarios. N o r m a li z e d t i m e a n d e rr o r time error

10 15 20 25 30 35 40No. solutions explored0.00.20.40.60.81.01.2 N o r m a li z e d t i m e a n d e rr o r time error N o r m a li z e d t i m e a n d e rr o r time error Fig. 9. Normalized time and error of the solutions examined at each iteration by DoubleClimb (left), Opt-Unif (center), and GA (right), in the basic scenario.Different colors correspond to different values of d L , as in Fig. 8. N o r m a li z e d t i m e a n d e rr o r time error

10 15 20 25 30 35 40No. solutions explored0.00.20.40.60.81.01.2 N o r m a li z e d t i m e a n d e rr o r time error N o r m a li z e d t i m e a n d e rr o r time error Fig. 10. Normalized time and error of the solutions examined at each iteration by DoubleClimb (left), Opt-Unif (center), and GA (right), in the rich scenario.Different colors correspond to different values of d L , as in Fig. 8. d L and no I-L edges, hence, with a low cost. Then, newedges are added until either a feasible solution is found, orall I-L edges are exhausted (as it happens in the ﬁrst plot,representing the basic scenario). The double vertical lines inthe ﬁrst two plots correspond to the triggering of the conditionin Line 12 of Alg. 2; the plots conﬁrm that enforcing such acondition does not result in ignoring cheaper feasible solutions.The last two plots in Fig. 8 represent Opt-Unif (again inthe basic and the rich scenario, resp.), and clearly highlight itsdifferences from DoubleClimb. As mentioned, Opt-Unif triesfewer solutions; also, multiple feasible solutions are tried outfor the same value of d L , since there is no stopping criterionanalogous to the one in Line 13 in Alg. 2. Importantly, thefeasible solutions explored by Opt-Unif are more costly thanthose explored by DoubleClimb for the same value of d L , afurther conﬁrmation of the importance of a ﬂexible choice ofI-L edges.Last, in Fig. 9 and Fig. 10, we examine the error andlearning time associated with each of the solutions examinedby DoubleClimb and its benchmark solutions, respectively inthe basic and rich scenarios. Both quantities are normalizedto their respective limits, thus both lines do not exceed 1if the corresponding solution is feasible. It is interesting tonote how adding I-L edges (moving from one solution to thenext) affects error and time. The former (dotted lines) steadilydecreases until its limit is reached, and then stays constant – re-call that the learning process is interrupted upon reaching (cid:15) max ,so the normalized error never drops substantially below 1.The time (solid lines) increases at ﬁrst, owing to the need towait for more I-nodes; then, it decreases due to the fact thatlearning can be completed with fewer iterations. Importantly,both behaviors exactly match those described in Property 2for g and g . The third plots of both Fig. 9 and Fig. 10highlight the behavior of GA approaches, which try multipledifferent solutions of varying quality and, in the interest ofexploration, tend not to abandon low-quality solutions, on thegrounds that they may mutate into high-quality solutions atsome later stage.Finally, the x -axis in those plots reminds us of the very highefﬁciency of DoubleClimb, where the number of examinedsolutions is orders of magnitude lower than in Optimal/GA.Recalling that GA algorithms themselves examine a numberof solutions that is orders of magnitude lower than exactalgorithms, the plots further highlight the efﬁciency of Dou-bleClimb, coupled with the effectiveness shown in Fig. 7.IX. C ONCLUSION

We addressed the problem of deﬁning an optimal level ofcooperation among network nodes performing a supervisedlearning task. We ﬁrst developed a system model accountingfor the presence of both learning nodes and information nodesinteracting with each other. Then we formulated the problemof choosing which learning nodes should cooperate to com-plete the learning task, and the information nodes that shouldprovide them with data, as well as the number of iterations to perform. Although being NP hard, we showed some importantproperties of our problem, most notably its submodularity,which allowed us to deﬁne a solution algorithm that has cubic worst-case time complexity and is / |I| -competitive, with I being the set of information nodes. Numerical results alsoshow that our approach closely matches the optimum andoutperforms state-of-the-art solutions, for both classiﬁcationand regression tasks. R EFERENCES[1] M. Li, D. G. Andersen, J. W. Park, A. J. Smola, A. Ahmed, V. Josifovski,J. Long, E. J. Shekita, and B.-Y. Su, “Scaling distributed machinelearning with the parameter server,” in

USENIX OSDI , 2014.[2] H. X. Pham, H. M. La, D. Feil-Seifer, and A. Neﬁan, “Cooperative anddistributed reinforcement learning of drones for ﬁeld coverage,” arXivpreprint arXiv:1803.07250 , 2018.[3] H. Y. Ong, K. Chavez, and A. Hong, “Distributed deep q-learning,”

CoRR , 2015.[4] A. Nedi´c, A. Olshevsky, and M. G. Rabbat, “Network topology andcommunication-computation tradeoffs in decentralized optimization,”

Proceedings of the IEEE , 2018.[5] S. Wang, T. Tuor, T. Salonidis, K. K. Leung, C. Makaya, T. He, andK. Chan, “Adaptive federated learning in resource constrained edge com-puting systems,”

IEEE Journal on Selected Areas in Communications ,2019.[6] H. H. Zhuo, W. Feng, Y. Lin, Q. Xu, and Q. Yang, “Federated deepreinforcement learning,” arXiv preprint arXiv:1901.08277 , 2019.[7] J. Koneˇcn´y, H. B. McMahan, F. X. Yu, P. Richt´arik, A. T. Suresh, andD. Bacon, “Federated learning: Strategies for improving communicationefﬁciency,” arXiv preprint arXiv:1610.05492

IEEE Communications Magazine , 2020.[12] ETSI, “MEC Working Item 36, MEC in resource constrained terminals,ﬁxed or mobile,” https://portal.etsi.org/webapp/WorkProgram/, online;accessed July 2020.[13] A. Kadav and E. Kruus, “Asap: asynchronous approximate data-parallelcomputation,” arXiv preprint arXiv:1612.08608 , 2016.[14] S. Li, S. M. M. Kalan, A. S. Avestimehr, and M. Soltanolkotabi, “Near-optimal straggler mitigation for distributed gradient methods,” in

IEEEIPDPSW , 2018.[15] G. Neglia, G. Calbi, D. Towsley, and G. Vardoyan, “The role of networktopology for distributed machine learning,” in

IEEE INFOCOM , 2019.[16] S. Levine, A. Kumar, G. Tucker, and J. Fu, “Ofﬂine reinforcementlearning: Tutorial, review, and perspectives on open problems,” arXivpreprint arXiv:2005.01643 , 2020.[17] A. A. Abdellatif, C. F. Chiasserini, and F. Malandrino, “Active learning-based classiﬁcation in automated connected vehicles,” in

IEEE INFO-COM PERSIST-IoT Workshop , 2020.[18] K. Yang, J. Ren, Y. Zhu, and W. Zhang, “Active learning for wirelessiot intrusion detection,”

IEEE Wireless Communications , 2018.[19] T. Chen, K. Zhang, G. B. Giannakis, and T. Bas¸ar, “Communication-efﬁcient distributed reinforcement learning,” arXiv preprintarXiv:1812.03239 , 2018.[20] Y. Li, I.-J. Liu, Y. Yuan, D. Chen, A. Schwing, and J. Huang, “Accel-erating distributed reinforcement learning with in-switch computing,” in

ISCA , 2019.[21] J. Koneˇcn´y, B. McMahan, and D. Ramage, “Federated optimiza-tion: Distributed optimization beyond the datacenter,” arXiv preprintarXiv:1511.03575 , 2015.[22] O. Shamir and T. Zhang, “Stochastic gradient descent for non-smoothoptimization: Convergence results and optimal averaging schemes,” in

International conference on machine learning , 2013.[23] A. Hard, K. Rao, R. Mathews, S. Ramaswamy, F. Beaufays, S. Augen-stein, H. Eichner, C. Kiddon, and D. Ramage, “Federated learning formobile keyboard prediction,” arXiv preprint arXiv:1811.03604 , 2018.

24] OASIS Standard, “MQTT Version 5.0, Mar. 2019,” https://docs.oasis-open.org/mqtt/mqtt/v5.0/mqtt-v5.0.html, online; accessed July2020.[25] “zenoh: Zero Overhead Pub/sub, Store/Query and Compute,” http://zenoh.io, online; accessed July 2020.[26] 3GPP, “TS23.501, System architecture for the 5G System (5GS),Rel. 15,” https://portal.3gpp.org/desktopmodules/Speciﬁcations/SpeciﬁcationDetails.aspx?speciﬁcationId=3144, online; accessedJuly 2020.[27] Q. Yang, Y. Liu, T. Chen, and Y. Tong, “Federated machine learning:Concept and applications,”

ACM Transactions on Intelligent Systems andTechnology , 2019.[28] L. Deng, “The mnist database of handwritten digit images for machinelearning research,”

IEEE Signal Processing Magazine et al. , “A note on a general deﬁnition of the coefﬁcientof determination,”

Biometrika , 1991.[31] C. Perlich, F. Provost, and J. S. Simonoff, “Tree induction vs. logisticregression: A learning-curve analysis,”

Journal of Machine LearningResearch , 2003.[32] D. Bolton, “The multinomial theorem,”

The Mathematical Gazette , pp.336–342, 1968.[33] G. J. Woeginger, “Exact algorithms for NP-hard problems: A survey,”in

Combinatorial optimization—eureka, you shrink!

Springer, 2003.[34] L. Lov´asz, “Submodular functions and convexity,” in

MathematicalProgramming The State of the Art . Springer, 1983.[35] M. Conforti and G. Cornu´ejols, “Submodular set functions, matroids andthe greedy algorithm: tight worst-case bounds and some generalizationsof the rado-edmonds theorem,”

Discrete applied mathematics , 1984.[36] R. K. Iyer and J. A. Bilmes, “Submodular optimization with submodularcover and submodular knapsack constraints,” in

Advances in NeuralInformation Processing Systems , 2013.[37] L. Valerio, M. Conti, and A. Passarella, “Energy efﬁcient distributedanalytics at the edge of the network for iot environments,”

ElsevierPervasive and Mobile Computing , 2018.[38] H. Ye, L. Liang, G. Y. Li, J. Kim, L. Lu, and M. Wu, “Machine learningfor vehicular networks: Recent advances and application examples,”

IEEE Vehicular Technology Magazine , 2018.[39] A. A. Diro and N. Chilamkurti, “Distributed attack detection schemeusing deep learning approach for internet of things,”

Future GenerationComputer Systems , 2018.[40] 5G-Crosshaul, “D1.2: Final 5G-Crosshaul system design and economicanalysis,” December 2017., 2018.[40] 5G-Crosshaul, “D1.2: Final 5G-Crosshaul system design and economicanalysis,” December 2017.