On the Convergence Analysis of Asynchronous Distributed Quadratic Programming via Dual Decomposition
OON THE CONVERGENCE ANALYSIS OF ASYNCHRONOUSDISTRIBUTED QUADRATIC PROGRAMMING VIA DUALDECOMPOSITION
KOOKTAE LEE ∗ AND
RAKTIM BHATTACHARYA ∗ Abstract.
In this paper, we analyze the convergence as well as the rate of convergence of asyn-chronous distributed quadratic programming (QP) with dual decomposition technique. In general,distributed optimization requires synchronization of data at each iteration step due to the interde-pendency of data. This synchronization latency may incur a large amount of waiting time caused byan idle process during computation. We aim to attack this synchronization penalty in distributedQP problems by implementing asynchronous update of dual variable. The price to pay for adoptingasynchronous computing algorithms is unpredictability of the solution, resulting in a tradeoff betweenspeedup and accuracy. Thus, the convergence to an optimal solution is not guaranteed owing to thestochastic behavior of asynchrony. In this paper, we employ the switched system framework as ananalysis tool to investigate the convergence of asynchronous distributed QP. This switched systemwill facilitate analysis on asynchronous distributed QP with dual decomposition, providing necessaryand sufficient conditions for the mean square convergence. Also, we provide an analytic expressionfor the rate of convergence through the switched system, which enables performance analysis ofasynchronous algorithms as compared with synchronous case. To verify the validity of the proposedmethods, numerical examples are presented with an implementation of asynchronous parallel QPusing
OpenMP . Key words.
Distributed Optimization, Parallel Quadratic Programming, Asynchronous Algo-rithm, Dual Decomposition, Switched System
AMS subject classifications.
1. Introduction.
Recent advancement of distributed and parallel computingtechnologies has brought massive processing capabilities in solving large-scale op-timization problems. Distributed and parallel computing may reduce computationtime to find an optimal solution by leveraging the parallel processing in computa-tion. Particularly, distributed optimization will likely be considered as a key elementfor large-scale statistics and machine learning problems, currently represented by theword “big data”. One of the reasons for the preference of distributed optimization inbig data is that the size of data set is so huge that each data set is desirably storedin a distributed manner. Thus, global objective is achieved in conjunction with localobjective functions assigned to each distributed node, which requires communicationbetween distributed nodes in order to attain an optimal solution.For several decades, there have been remarkable studies that have enabled tofind an optimal solution in a decentralized fashion, for example, dual decomposi-tion [9], [2], [12], [19], [3], augmented Largrangian methods for constrained opti-mization [21], [29], [16], [1], alternating direction method of multipliers (ADMM)[20], [18], [17], Spingarn’s method, [30], Bregman iterative algorithms for ℓ prob-lems [6], [8], [11], Douglas-Rachford splitting [10], [27], and proximal methods [31].More details about history of developments on the methods listed above can be foundin the literature [5]. In this study, we mainly focus on the analysis of asynchronous distributed optimization problems. In particular, we aim to investigate the behaviorof asynchrony in the Lagrangian dual decomposition method for distributed quadraticprogramming (QP) problems , where QP problems refer to the optimization problemswith a quadratic objective function associated with linear constraints. This type of Department of Aerospace Engineering, Texas A&M University, College Station, TX 77843-3141USA ([email protected], [email protected]). 1
QP problems has broad applications including least square with linear constraints,regression analysis and statistics, SVMs, lasso, portfolio optimization problems, etc.With an implementation of Lagrangian dual decomposition, the original QP problemsthat are separable can be solved in a distributed sense. For this dual decompositiontechnique, we will study how the asynchronous computing algorithms affect on theconvergence as well as the rate of convergence for the dual variable.Typically, distributed optimization requires synchronization of the data set ateach iteration step due to the interdependency of data. For massive parallelism, thissynchronization may result in a large amount of waiting time as load imbalance be-tween distributed computing resources would take place at each iteration step. Inthis case, some nodes that have completed their tasks should wait for others to fin-ish assigned jobs, which causes idle process of computing resources, incurring wasteof computation time. In this paper, we attack this restriction on synchronizationpenalty necessarily required in distributed and parallel computing, through the im-plementation of asynchronous computing algorithms . The asynchronous computingalgorithms that do not suffer from synchronization latency thus have a potential tobreak through the paradigm of distributed and parallel optimization. Unfortunately,it is not completely revealed yet what is the effect of asynchrony on the convergence aswell as the rate of that in the distributed optimization. Due to the stochastic behaviorof asynchrony, the solution for the asynchronous distributed QP may diverge even ifit is guaranteed that the synchronous scheme provides a convergence to an optimalsolution. Although Bertsekas [4] introduced a sufficient condition for the convergenceof general asynchronous fixed-point iterations (see chapter 6.2), which is equivalentto a diagonal dominance condition for QP problems, however, this condition is knownto be very strong and thus conservative, according to the literature [28]. Therefore,the primal emphasis of this research is placed on: 1) convergence analysis; 2) analyticestimation on the rate of convergence, by employing a new framework for analysis ofdistributed QP problems with an asynchronous update of dual variable.For this purpose, we will adopt the switched system [15], [14], [13], [25], [22], [26],[24], [23] framework as an analysis tool. In general, the switched system is defined as adynamical system that consists of a set of subsystem dynamics and a certain switchinglogic that governs a switching between subsystems. For asynchronous algorithms ofwhich dynamics is modeled by the switched system, subsystem dynamics denotes allpossible asynchronous computing due to the difference of data processing time in eachdistributed computing devices. Then, a certain switching logic can be implementedto stand for a random switching between subsystem dynamics. Thus, the switchedsystem framework can be used to properly model the dynamics of asynchronous com-puting algorithms. Lee et al. [24], for example, introduced the switched system to rep-resent the behavior of asynchrony in massively parallel numerical algorithms. In thisliterature, the authors applied the switched dynamical system framework in order toanalyze the convergence, rate of convergence, and error probability for asynchronousparallel numerical algorithms. Based on this switched system framework, this paperprovides a new approach for convergence analysis of asynchronous distributed QPproblems with dual decomposition technique. The proposed methods will guaranteethe convergence to the optimal solution in the mean square sense. In addition, we willstudy how fast each scheme (e.g., synchronous and asynchronous scheme) convergesto an optimal solution by studying the rate of convergence in analytic form. There-fore, this paper will present fundamental yet important analysis on the asynchronousdistributed QP problems through the switched system framework, which facilitatesinvestigation on the stochastic behavior of asynchrony.Rest of this paper is organized as follows. In section 2, preliminaries are presentedin connection with problem formulations for asynchronous distributed QP problemsusing dual decomposition. Section 3 introduces the switched system to model theasynchrony in the asynchronous distributed QP problems. The results for the conver-gence and the rate of convergence by employing the switched system framework arederived in section 4 and 5, respectively. The numerical example with a real implemen-tation of distributed and parallel QP is provided in section 6, to verify the validity ofthe proposed methods. Finally, section 7 concludes the paper.
2. Preliminaries and Problem Formulation. Notation:
The real number,positive integer, and the non-negative integer are denoted by the symbol R , N , and N ,respectively. The symbol ⊤ represents the transpose operator. For any real matrix A, B ∈ R n × n , the inequality A < B is interpreted by the quadratic sense. (i.e., v ⊤ Av < v ⊤ Bv for any real vector v ∈ R n ). In addition, the symbol ⊗ stands for theKronecker product. Consider the following QP problem with a linear in-equality constraint. minimize f ( x )(2.1) subject to Ax ≤ b, (2.2)where f ( x ) is given by a quadratic form, meaning f ( x ) = 12 x ⊤ Qx + c ⊤ x , the matrix Q ∈ R n × n is a symmetric, positive definite and c ∈ R n is a vector. Further, in theinequality constraint (2.2), it is such that A ∈ R m × n and b ∈ R m . If we define theLagrangian as L ( x, y ) , f ( x ) + y ⊤ ( Ax − b ), where y ∈ R m is the dual variable orLagrange multiplier, then the dual problem for above QP is formulated as follows.Duality using Lagrangian: maximize inf x L ( x, y )(2.3) subject to y ≥ . (2.4)The primal optimal point x ⋆ is obtained from a dual optimal point y ⋆ as x ⋆ = argmin x L ( x, y ⋆ ) . By implementing gradient ascent, one can solve the dual problem, provided thatinf L ( x, y ) is differentiable. In this case, the iteration to find the x ⋆ is constructed asfollows: x k +1 := argmin x L ( x, y k ) , (2.5) y k +1 := y k + α k ( Ax k +1 − b ) , (2.6)where α k is a step size and the upper script denotes the discrete-time index foriteration. For the quadratic objective function f ( x ), the value argmin x L ( x, y k ) can be alter-natively obtained by ∇ x L ( x, y k ) = 0, which leads toargmin x L ( x, y k ) = ∇ x (cid:18) x ⊤ Qx + c ⊤ x + y k ⊤ ( Ax − b ) (cid:19) = Qx + c + A ⊤ y k = 0 . From (2.5), we have x k +1 = − Q − ( A ⊤ y k + c ) . (2.7)Plugging (2.7) into (2.6) results in y k +1 = y k + α k (cid:0) A (cid:0) − Q − ( A ⊤ y k + c ) (cid:1) − b (cid:1) = ( I − α k AQ − A ⊤ ) y k − α k ( AQ − c + b ) . (2.8)With the assumption that y k ≥ ∀ k , the above equation provides the solutionfor y ⋆ and hence x ⋆ , if ρ ( I − α k AQ − A ⊤ ) < y ⋆ = ( I − α k AQ − A ⊤ ) y ⋆ − α k ( AQ − c + b ) . ⇒ ( y ⋆ = − ( AQ − A ⊤ ) − ( AQ − c + b ) , (if AQ − A ⊤ is non-singular) ,x ⋆ = − Q − ( A ⊤ y ⋆ + c ) . (2.9) In this subsection,we consider that f ( x ) = x ⊤ Qx + c ⊤ x is separable , which means f ( x ) = N X i =1 f i ( x i )= N X i =1 (cid:18) x ⊤ i Q i x i + c ⊤ i x i (cid:19) , where x = [ x ⊤ , x ⊤ , . . . , x ⊤ N ] ⊤ and the variables x i ∈ R n i , i = 1 , , . . . , N are subvectorsof x . Also, the matrix A in (2.2) satisfies Ax = P Ni =1 A i x i , where A i is such that A = [ A , A , . . . , A N ].Then, the equations (2.5) and (2.6) are updated by x k +1 i := argmin x i L ( x i , y k ) = − Q − i ( A ⊤ i y k + c ) , (2.10) y k +1 := y k + α k ( Ax k +1 − b ) . (2.11)Note that when updating x k +1 i , i = 1 , , . . . , N , each value is computed by dis-tributed nodes. Hence, the computation for x k +1 i can be processed in parallel andthen, each value of x k +1 i is transmitted to the master node to compute y k +1 in thegathering stage. Therefore, as in (2.11), updating y k +1 requires synchronization of x k +1 i across all spatial index i at time k + 1 because x k +1 is obtained by stacking x k +1 i from i = 1 to N . In Fig. 1, we described the conceptual schematic of synchronousupdate for dual variable y . If computing delay occurs among one of the index i dueto the difference of processing time in distributed node, the process to update y k +1 Fig. 1 . The schematic of update timing for the variable y k ; upper one shows the synchronousalgorithm, where q is the length of maximum delay – i.e., all delays are bounded by q ; bottom oneshows asynchronous algorithm. The time to compute y k is given by CPU time. has to be paused until all data is received from distributed nodes. This implies thatthe more parallel computing we have, the more delays may take place, resulting in alarge amount of the idle time. Consequently, this idle time for synchronization be-comes dominant compared to the pure computation time to solve the QP problemin parallel. In massive parallel computing algorithm, it has been reported that thesynchronization latency may be up to 50% of total computation time according tothe literature [7]. In order to mitigate or avoid this type of restriction that severelyaffects on the performance to obtain an optimal solution, we introduce asynchronouscomputing algorithm in the following subsection.
In order to allevi-ate this synchronization penalty, we consider asynchronous update of dual variable y .In this case, the master node to compute y k +1 does not wait until all x k +1 i is gathered.Rather, it proceeds with the value for x i saved in the buffer memory. Thus, y valueis updated asynchronously. To model the asynchronous dynamics of dual decomposi-tion, we consider the new state vectors as follows. • The state for the
Asynchronous model:˜ x k := [ x k ∗ ⊤ , x k ∗ ⊤ , . . . , x k ∗ N N ⊤ ] ⊤ , where k ∗ i ∈ { k, k − , . . . , k − q + 1 } , i = 1 , , . . . , N , denotes delay term thatmay take place due to the load imbalance in distributed nodes, and the term q ∈ N represents the maximum possible delay.For this asynchronous case, y -update is given by y k +1 := y k + α k ( A ˜ x k +1 − b )(2.12) = y k + N X i =1 (cid:18) α ki A i ˜ x ik +1 − N α ki b (cid:19) , where α ki is the step size for the index i . q : the maximum possible delay x ki : the value of x i at time kx k ∗ i i : the random variable such that x k ∗ i i ∈ { x k , x k − , . . . , x k − q +1 } Π i := [( π ) i , ( π ) i , . . . , ( π q ) i ] , where ( π j ) i , j = 1 , , . . . , q , stands for the modalprobability for x k ∗ i i ˜ x k := [ x k ∗ ⊤ , x k ∗ ⊤ , . . . , x k ∗ N N ⊤ ] ⊤ Fig. 2 . The schematic of the stochatic asynchronous algorithm in the distributed quadraticprogramming. In this figure, the maximum delay is bounded by k − q + 1 ≤ k ∗ i ≤ k , ∀ i . Each nodehas the probability Π i to represent random delays. Although α ki may vary at each time step, we let α ki be a constant value, denotedby α i , for simplicity. Hence, it satisfies that α := P Ni =1 α i , which is a fixed value.There are two different ways to update dual variable y . Throughout the paper,we denote these two different cases as the deterministic asynchronous algorithm andthe stochastic asynchronous algorithm , respectively, in order to clarify and differen-tiate them. The deterministic asynchronous algorithm stands for the case where thevariable k ∗ i is considered as a constant value and is given by k ∗ i := k − q + 1 , ∀ i . Thus,it leads to ˜ x k := [ x k − q +11 ⊤ , x k − q +12 ⊤ , . . . , x k − q +1 N ⊤ ]. In this case, it is assumed thatthe value x k − q +1 i , which is a q -step prior value of x ki , is always available to the masternode. In other words, all delays are assumed to be bounded by the finite value q .Therefore, one can proceed with y -update, given in (2.12), without synchronizationwhen applying the deterministic asynchronous algorithm. Note that there is no ran-domness in the deterministic asynchronous algorithm. Although this deterministiccase obviates the unnecessary idle time by avoiding synchronization, it always utilizes q -step prior values saved in the buffer memory. In the real implementation of thedistributed optimization, however, k ∗ i varies from distributed nodes and also changesover each iteration. Thus, we consider another case by letting x k ∗ i i as a random vec-tor, where k ∗ i becomes one of the values in the given set { k, k − , . . . , k − q + 1 } . Todistinguish this case with the deterministic asynchronous algorithm, it is referred toas the stochastic asynchronous algorithm.Fig. 2 describes the conceptual schematic of the stochastic asynchronous algo-rithm using the dual decomposition in QP problem. Depending on the processingcapability and load balance in distributed nodes, the value for x ki is available or notin the master node at each iteration step. We assume that this delay is bounded bythe finite value q . To describe the randomness of such delays, we adopt a probabilityΠ i := [( π ) i , ( π ) i , . . . , ( π q ) i ] ∈ R × q that predicts which value for x ki will be used toupdate y k as shown in Fig. 2.Starting from (2.12), with the definition of the set S k := { k ∗ i | k ∗ i = k } and thesymbol Φ i := − α i A i Q − i A ⊤ i , the state dynamics of the stochastic asynchronous algo-rithm is then given by y k +1 = y k + N X i =1 (cid:18) α i A i ˜ x ik +1 − N α ki b (cid:19) = y k + X i ∈S k +1 α i A i x k +1 + X i ∈S k α i A i x k + · · · + X i ∈S k − q +2 α i A i Q − i A ⊤ i x k − q +2 + N X i =1 − N α i b ! = I − X i ∈S k +1 Φ i y k − X i ∈S k Φ i y k − − · · · (by (2.10)) − X i ∈S k − q +2 Φ i y k − q +1 + N X i =1 − α i A i Q − i c − N α i b ! . (2.13)The above equation is simplified by the following definitions, given by R i ( k ) := X j ∈S k − i +2 Φ j , (2.14) B := N X i =1 − α i A i Q − i c − N α i b ! , (2.15)resulting in y k +1 = ( I − R ( k )) y k − R ( k ) y k − − · · · − R q ( k ) y k − q +1 + B, (2.16)where the time-varying matrix R i ( k ) completely depends on the value k ∗ i that is arandom event.As described in [4], it is a very challenging task to analyze the stochastic asyn-chronous algorithm (see page 101, chapter 1). The primary goal of this paper is,therefore, to analyze not only the convergence but also the rate of that for the stochas-tic asynchronous algorithm which brings stochastic process for the state y k . For thispurpose, we adopt a switched linear system (or jump linear system, interchangeably) framework that will be introduced in the next section in more detail.
3. A Switched System Approach for Asynchronous Computing Algo-rithms.
In order to solve the dual decomposition problem with random delays in dis-tributed nodes, we define a new augmented state Y k := [ y k ⊤ , y k − ⊤ , . . . , y k − q +1 ⊤ ] ⊤ .Then, one can define the following recursive dynamics: y k +1 y k y k − ... y k − q +2 | {z } = Y k +1 = I − R ( k ) − R ( k ) − R ( k ) · · · − R q ( k ) I · · · I · · · I | {z } = W ( k ) y k y k − y k − ... y k − q +1 | {z } = Y k + B | {z } = C , (3.1)where I and 0 are identity and zero matrices with proper dimensions, respectively.Consequently, the above recursive equation ends up with the following simple form: ⇒ Y k +1 = W ( k ) Y k + C In fact, the structure of the time-varying matrix W ( k ) is not arbitrary, but ithas a finite number of forms, given by q N , which counts all possible scenarios todistribute N numbers of Φ i , i = 1 , , . . . , N , matrices into the finite number of q . Inthe switched system, this number is referred to as the “switching mode number” , andwe particularly denote this number with the symbol m . For instance, when q = 2 and N = 2, the switching mode number is given by m = 2 = 4. Thus, at each time k ,the matrix W ( k ) has one of the following form: W = (cid:20) I − Φ − Φ I (cid:21) , W = (cid:20) I − Φ − Φ I (cid:21) ,W = (cid:20) I − Φ − Φ I (cid:21) , W = (cid:20) I − Φ − Φ I (cid:21) . Then, only one out of all set of matrices { W r } mr =1 will be used at each time k toupdate the system state Y k , which results in the switched linear system structure asfollows.Consider the switched system: Y k +1 = W σ k Y k + C, σ k ∈ { , , . . . , m } , k ∈ N , (3.2)where { σ k } ∞ k =0 denotes the switching sequence that describes how the asynchronytakes place. Then, the switching probability Π( k ) := Π ( k ) ⊗ Π ( k ) ⊗ · · · ⊗ Π N ( k ) =[ π ( k ) , π ( k ) , . . . , π m ( k )], where Π i ( k ) represents the probability for x k ∗ i i as depictedby Fig. 2, determines which mode σ k will be utilized at each time step. (Notethat Π i ( k ) and hence Π( k ) are not necessarily to be stationary.) In this case, theswitched linear system is named by “stochastic switched linear system” or “stochasticjump linear system” [26] because the switching is a stochastic process. The benefitwhen applying this stochastic switched linear system structure is that the delay inthe asynchronous algorithm is naturally taken into account by the switched systemframework. Hence, the randomness of the asynchronous algorithm is represented bya certain switching logic. Remark 3.1. (Computational complexity due to an extremely largenumber of the switching modes)
Although the stochastic switched linear systemframework is suitable for modeling the dynamics of the stochastic asynchronous al-gorithm in distributed QP problems, it results in an extremely large number of theswitching modes, causing computational complexity. For instance, even if q = 2 and N = 20 , we have m = q N = 2 , and it is impractical to store such large numbersof matrices in the real implementation. Therefore, it is necessary to develop propermethods to analyze the stochastic asynchronous algorithm using the switched linearsystem without any concerns for such computational complexity issues. To avoid the computational complexity problems stated above, we firstly makefollowing assumptions for analysis of both the convergence and the rate of the con-vergence for the stochastic asynchronous algorithm: • Assumption 3.1.
We consider the random delays that occur during thecomputation of x ki at each node. In this case, the probability Π i ( k ) =[( π ( k )) i , ( π ( k )) i , . . . , ( π q ( k )) i ] describes which value for x k ∗ i i will be usedamong the given set { x ki , x k − i , . . . , x k − q +1 i } . Then, we assume that eachmodal probability ( π j ( k )) i is stationary , and hence Π i ( k ) is also stationaryin time.Under the Assumption 3.1., the switching probability Π( k ) := Π ⊗ Π ⊗ · · · ⊗ Π N becomes stationary. For this case, the jump linear system with the given dynamics in(3.2) is termed as the independent, identically distributed (i.i.d.) jump linear system.Since the modal switching probability π r is a probability, it satisfies 0 ≤ π r ≤ ∀ r and P mr =1 π r = 1. This stationary occupation probability rules which system matrix W r will be used at each instance. The implementation of the switching sequence { σ k } ,governed by Π, describes the randomness for the stochastic asynchronous algorithm in an average sense .
4. Convergence Analysis.
In this section, the convergence of the state Y k for the stochastic asynchronous model will be studied under the switched systemframework. For several decades, the stability results for the switched systems withstochastic jumping parameters have been well established, for example, in the liter-ature [25], [26], [15]. However, these methods are inapplicable to the asynchronouscomputing algorithm with massive parallelism because it results in extremely largenumbers of switching modes, leading to computational complexity as explained inRemark 3.1. Therefore, we aim to investigate the convergence and the rate of conver-gence for the asynchronous algorithm without any concerns for such computationalcomplexity issues. Particularly, this section will provide a convergence condition forthe stochastic asynchronous algorithm in distributed QP problems.Before proceeding further to investigate the asynchronous model, we analyze theconvergence of the synchronous case without delays for a reference. Since in thesynchronous algorithm all values are synchronized after each iteration, no delays occurwhen updating the state dynamics. Then, the state Y k sync. for the synchronous case isgoverned by the following recursive equation:0 y k +1 y k y k − ... y k − q +2 | {z } = Y k +1sync. = I − R · · · I · · · I · · · I | {z } = W sync. y k y k − y k − ... y k − q +1 | {z } = Y k sync. + B | {z } = C , (4.1)where the matrix R := P qi =1 R i ( k ) = P Nj =1 Φ j is time-invariant, and hence the matrix W sync. is also constant. Then, the steady-state value of Y ⋆ sync. := lim k →∞ Y k sync. , isobtained by Y ⋆ sync. = W sync. Y ⋆ sync. + C. (4.2) ⇒ Y ⋆ sync. = ( I − W sync. ) − C, (cid:0) if ( I − W sync. ) is non-singular (cid:1) if the condition ρ ( W sync. ) < W σ k is determined by the switching probability Π. Thus, the state of the asynchronousmodel becomes a random vector, obstructing the convergence analysis of the stochasticasynchronous model. For the stochastic switched systems, various convergence (sta-bility) notions have been developed [15], to guarantee the system stability. Amongdifferent convergence notions, we will focus on the mean square convergence, definedbelow. Definition 4.1. (Definition 1.1, [13]) The switched system is said to be meansquare stable (convergent) if for any initial condition x and arbitrary initial proba-bility distribution Π(0) , lim k →∞ E (cid:2) || x ( k, x ) − x ⋆ || (cid:3) = 0 , where x ⋆ is the fixed-pointvalue of x k , i.e. lim k →∞ x k = x ⋆ . The necessary and sufficient condition for the mean square convergence of thei.i.d. jump linear systems is described as follows:
Proposition 4.2. (Corollary 2.7, [14]) Consider an i.i.d. jump linear system,where
Π ( k ) is a stationary probability vector { π , π , · · · , π m } for all k . Then, thei.i.d. jump linear system is mean square stable (convergent) if and only if the matrix P mj =1 π j ( W j ⊗ W j ) is Schur stable, i.e. ρ m X j =1 π j ( W j ⊗ W j ) < . (4.3)Once again, massive parallelism results in large m , causing computational in-tractability. Thus, implementation of Proposition 4.2 is unfeasible to analysis ofasynchronous distributed and parallel QP problems with massively parallel comput-ing algorithm because the equation in (4.3) requires the summation over index i from1 up to m . In order to avoid this problem, we provide Algorithm 1.By executing Algorithm 1 at every time step in the master node, the randomvector ˜ x k has the following form: ˜ x k = [( x ξ ) ⊤ , ( x ξ ) ⊤ , . . . , ( x ξN ) ⊤ ] ⊤ , where ξ denotesthe oldest time among the recently updated values across the index i = 1 , , . . . , N .1 Algorithm 1 k ∗ i ← one of the values in { k, k − , . . . , k − q + 1 } with probability Π i . ξ ← k for i ≤ N do if ξ ≤ k ∗ i then ξ ← k ∗ i . i ← i + 1. end if end for ˜ x k ← [( x ξ ) ⊤ , ( x ξ ) ⊤ , . . . , ( x ξN ) ⊤ ] ⊤ For example, if k ∗ i = k − i is the oldest value over all k ∗ i , i = 1 , , . . . , N ,then we have ˜ x k = [( x k − ) ⊤ , ( x k − ) ⊤ , . . . , ( x k − N ) ⊤ ] ⊤ . In this case, the modal matrix W r has the same structure with W ( k ), given in (3.1), where R i ( k ) satisfies R i ( k ) = ( R, (if i = k − ξ + 1)0 . (otherwise)The utilization of Algorithms 1 then drastically reduces the switching mode number by q regardless of the value N , due to the fact that at each iteration step we intentionallyuse the oldest updated value saved in buffer memory. For example, when q = 2, thematrix W σ k becomes one of the following form: W = (cid:20) I − R I (cid:21) , W = (cid:20) I − RI (cid:21) . Since Algorithm 1 works as if it aggregates some subsets of the given switchingmodes, we need to redefine the switching probability Π accordingly. Then, Π isobtained by the following Theorem.
Theorem 4.3.
Consider the i.i.d. switched linear system given in (3.2) with theswitching probability
Π = Π ⊗ Π ⊗ . . . ⊗ Π N ∈ R × q N . After the implementation ofAlgorithm 1, the switching probability is redefined by Π := [ π , π , . . . , π q ] ∈ R × q , ofwhich modal probability π i has the following form: π r := N Y i =1 r X j =1 ( π j ) i − r − X j =1 π j , r = 1 , , . . . , q, (4.4) where the term ( π j ) i denotes j th modal probability for Π i ( i.e., Π i = [( π ) i , ( π ) i , . . . , ( π q ) i ] ) .Proof . For simplicity of the proof, we assume that N = 2. The most general caseis then proved similarly by induction. In this case, the master node takes the valuesfor each x k ∗ i i according to the probability Π i , i = 1 ,
2, which are given byΠ = [( π ) , ( π ) , . . . , ( π q ) ] , Π = [( π ) , ( π ) , . . . , ( π q ) ] . We let the index j ∈ { k, k − , . . . , k − q + 1 } be the value explained in Algorithm 1.2 When j = 1, the modal switching probability π is obtained by π = Pr (cid:16) k ∗ = k, k ∗ = k (cid:17) = Pr (cid:16) k ∗ = k (cid:17) × Pr (cid:16) k ∗ = k (cid:17) (since k ∗ and k ∗ are independent)= ( π ) × ( π ) . Similarly, when j = 2, we have π = Pr (cid:16) k ∗ ∈ { k, k − } , k ∗ ∈ { k, k − } (cid:17) − π = X j =1 Pr (cid:16) k ∗ = k − j + 1 (cid:17) × X j =1 Pr (cid:16) k ∗ = k − j + 1 (cid:17) − π = X j =1 ( π j ) × X j =1 ( π j ) − π . In the first line of above equation, we have to extract π because it corresponds tothe case when j = 1.For any arbitrary value j satisfying j ∈ { k, k − , . . . , k − q + 1 } , the switchingprobability is therefore obtained by induction as follows: π r = Pr (cid:16) k ∗ ∈ { k, k − , . . . , k − r + 1 } , k ∗ ∈ { k, k − , . . . , k − r + 1 } (cid:17) − r − X j =1 π j = r X j =1 ( π j ) × r X j =1 ( π j ) − r − X j =1 π j . Thus, the most general case with q, N ∈ N can be induced as follows: π r = N Y i =1 r X j =1 ( π j ) i − r − X j =1 π j , r = 1 , , . . . , q. For comparison, the switching mode number without the proposed algorithm isgiven by m = q N of which growth is exponential with respect to N , whereas withthe proposed Algorithm 1, it is given by m = q that is a constant value irrespec-tive of N . Thus, by leveraging the proposed algorithm, one can apply the meansquare convergence condition given in Proposition 4.2, to test the stability of thestochastic asynchronous algorithm. Note that the implementation of Proposition 4.2was computationally intractable without Algorithm 1 due to the large numbers in m . Consequently, the proposed algorithm enables the convergence analysis of thestochastic asynchronous parallel computing algorithm in QP problems.Once the condition (4.3) is guaranteed with a given i.i.d. switching probability Πby implementing Algorithm 1, the steady-state (fixed-point) value Y ⋆ := lim k →∞ Y k ,where Y k is the state for the stochastic asynchronous algorithm of which dynamics isgiven in (3.2), can be obtained according to Definition 4.1 and is given by Y ⋆ = W σ k Y ⋆ + C. (4.5) ⇒ Y ⋆ = ( I − W σ k ) − C. Y ⋆ becomes a unique vector, regardless of σ k that changes over time,due to the inherent structure in matrices W σ k and C , which results in Y ⋆ = Y ⋆ sync. ,where Y ⋆ sync. is defined in (4.2). Therefore, the state for the stochastic asynchronousalgorithm, denoted by Y k , converges to the unique, identical fixed-point value Y ⋆ , ifthe condition (4.3) holds.
5. Rate of Convergence Analysis.
Since the rate of convergence provides in-formation regarding how fast each scheme converges to the fixed-point value, it worksas a guideline that suggests which methods will solve the given QP problem fasterthan other schemes. Therefore, the comparison for the rate of convergence betweendifferent schemes is advantageous in terms of estimating the time to obtain an optimalsolution for the QP problem. Although asynchronous algorithms are considered tobe more time-efficient for obtaining an optimal solution, it is not analytically provedyet what is the rate of convergence. Therefore, in this section we investigate therate of convergence for three different algorithms (e.g., synchronous, deterministicasynchronous, and stochastic asynchronous algorithms) in analytic form. i) Synchronous algorithm with delays:
For synchronous scheme, Y k is updated after a certain amount of time due tothe idle time for synchronization. As described in Fig. 1, we assume that all datafrom distributed nodes arrive at the master node within a bounded time q . In thiscase, idle process time for the synchronization is given by q and Y k can be updatedat every t ( q + 1) time step, where t ∈ N . Consequently, at each time step, Y k -updateis given byat time t = 1: Y ( q +1) = W sync. Y + C at time t = 2: Y q +1) = W sync. Y ( q +1) + C at time t = 3: Y q +1) = W sync. Y q +1) + C ... ...at arbitrary time t + 1: Y ( t +1)( q +1) = W sync. Y t ( q +1) + C, t ∈ N Now, we consider the term || Y k − Y ⋆ || ∞ in order to investigate the rate of con-vergence for the synchronous algorithm. Then, from the dynamics for synchronouscase, given by Y k = W sync. Y k − + C , we have || Y k − Y ⋆ || ∞ = || W sync. Y k − + C − Y ⋆ || ∞ = || W sync. Y k − − W sync. Y ⋆ || ∞ (cid:0) by (4.2) (cid:1) = || W sync. (cid:0) W sync. Y k − + C (cid:1) − W sync. Y ⋆ || ∞ = || ( W sync. ) Y k − + W sync. ( C − Y ⋆ ) || ∞ = || ( W sync. ) (cid:0) Y k − − Y ⋆ (cid:1) || ∞ (cid:0) by (4.2) (cid:1) ...= || ( W sync. ) k (cid:0) Y − Y ⋆ (cid:1) || ∞ ≤ || ( W sync. ) k || ∞ · || Y − Y ⋆ || ∞ , where k = t ( q + 1), t ∈ N . Thus, we have the upper bound of the rate of convergencefor the synchronous algorithm as follows: || Y k − Y ⋆ || ∞ ≤ || ( W sync. ) k || ∞ · || Y − Y ⋆ || ∞ , k = t ( q + 1) , t ∈ N . (5.1)4 ii) Deterministic Asynchronous algorithm: As described in section 2.3., the deterministic asynchronous algorithm takes ad-vantage of the q step prior value instead of waiting for all x i values being gatheredin the master node for synchronization. In this case, the system dynamics for thedeterministic asynchronous scheme is given by Y k +1 = W det.async. Y k + C, where the matrix W det.async. is defined as W det.async. := I · · · − RI · · · I · · · I because in this case we have ∀ i ∈ S k − q +2 in (2.13) for the deterministic asynchronousalgorithm, leading to above system dynamics.Similarly to the process in obtaining (5.1), the upper bound of the rate of con-vergence for the deterministic asynchronous algorithm is derived by || Y k − Y ⋆ || ∞ = || ( W det.async. ) k (cid:0) Y − Y ⋆ (cid:1) || ∞ (5.2) ≤ || ( W det.async. ) k || ∞ · || Y − Y ⋆ || ∞ , k ∈ N . iii) Stochastic Asynchronous algorithm: Since the state Y k becomes a random vector in the stochastic asynchronous case,the rate of convergence for || Y k − Y ⋆ || ∞ forms a distribution rather than a determinis-tic value, and is difficult to analyze such a distribution. Thus, we take the expectationfor Y k with respect to the i.i.d. switching probability Π, and investigate the rate ofconvergence for || E [ Y k ] − Y ⋆ || ∞ .Under the assumption that the mean square convergence condition in Proposition4.2 holds, the fixed-point value for Y k is deterministically given by Y ⋆ . Therefore,it satisfies E [ Y ⋆ ] = Y ⋆ . Taking the expectation in (4.5) results in E [ Y ⋆ ] = Y ⋆ = E [ W σ k Y ⋆ + C ] = E [ W σ k ] Y ⋆ + C = Pr ( P qr =1 π r W r ) Y ⋆ + C . By defining a newmatrix Λ := P qr =1 π r W r , we end up with E [ Y ⋆ ] = Y ⋆ = Λ Y ⋆ + C. (5.3)Then, the term || E [ Y k ] − Y ⋆ || ∞ becomes5 || E [ Y k ] − Y ⋆ || ∞ = || E [ W σ k − Y k − + C ] − Y ⋆ || ∞ = || q X r =1 Pr (cid:0) W σ k − Y k − + C | σ k − = r (cid:1) Pr ( σ k − = r ) | {z } = π r − Y ⋆ || ∞ = || q X r =1 π r W r Pr (cid:0) Y k − | σ k − = r (cid:1) + C − Y ⋆ || ∞ = || q X r =1 π r W r Pr (cid:0) Y k − | σ k − = r (cid:1) − Λ Y ⋆ || ∞ (by (5.3))= || q X r =1 π r W r !| {z } =Λ q X s =1 Pr (cid:0) W σ k − Y k − + C | σ k − = s (cid:1) π s − Λ Y ⋆ || ∞ = || Λ q X s =1 π s W s Pr (cid:0) Y k − | σ k − = s (cid:1) + C ! − Λ Y ⋆ || ∞ = || Λ q X s =1 π s W s Pr (cid:0) Y k − | σ k − = s (cid:1)! + Λ C − Λ (Λ Y ⋆ + C ) || ∞ (by (5.3))= || Λ q X s =1 π s W s Pr (cid:0) Y k − | σ k − = s (cid:1)! − (Λ) Y ⋆ || ∞ ...= || (Λ) k − q X t =1 π t W t Pr (cid:0) Y | σ = t (cid:1) + C ! − (Λ) k Y ⋆ || ∞ = || (Λ) k − q X t =1 π t W t !| {z } =Λ Y − (Λ) k Y ⋆ || ∞ = || (Λ) k (cid:0) Y − Y ⋆ (cid:1) || ∞ ≤ || (Λ) k || ∞ · || Y − Y ⋆ || ∞ , where we used the law of total probability in above equations.Therefore, the rate of convergence for the asynchronous scheme is given by: || E [ Y k ] − Y ⋆ || ∞ ≤ || (Λ) k || ∞ · || Y − Y ⋆ || ∞ , (5.4)where with implementation of Algorithm 1 the matrix Λ := P qr =1 π r W r has thefollowing form:Λ = I − π R − π R − π R · · · − π q RI · · · I · · · · · · I , R := N X i =1 R i ( k ) = N X j =1 Φ j . (5.5)6 Fig. 3 . Convergence results for distributed quadratic programming with stochastic asynchronousalgorithm. The (green) solid lines represent the state trajectory for y with total 100 Monte Carlosimulations (initial value was deterministically given by y (0) = 2 for all cases). The (red) solid-crossline denotes the mean and the standard deviation of multiple trajectories, respectively.
6. Numerical Example.
In this section, we test the proposed asynchronousalgorithms on distributed QP problems with dual decomposition technique. The sys-tem for the test bed is given by
Intel(R) Core(TM) i7-4710HQ CPU , which has 4cores with 8 threads (by Hyper-Threading Technology), with . Although,the number of threads for this test bed is not very large, the system is enough to showthe performance of proposed asynchronous computing algorithms for distributed QPwith dual decomposition. We implemented parallel processing through
OpenMP
API(Application Program Interface) developed for direct multi-threaded, shared memoryparallelism.Let us consider the following distributed QP problem:minimize N X i =1 (cid:18) x ⊤ i Q i x i + c ⊤ i x i (cid:19) subject to A i x i ≤ b i , i = 1 , , . . . , N. The positive definite matrices Q i , the matrices A i , and the vectors c i and b i were generated by implementing pseudo random number generator in C++ . Thedimension of matrices and vectors are set to be: Q i ∈ R n × n , A i ∈ R × n , c i ∈ R n × ,and b i ∈ R , i = 1 , , . . . , N , where n = 10, N = 20000. Thus, computational burdenfor solving each distributed QP is low, whereas the total number of distributed QP isextremely high. We let the buffer length q = 8 and the step size α i = 0 . ∀ i .For this type of massively distributed QP problem, the time for synchronizationmay become dominant in the total amount of time to solve QP. In this case, asyn-chronous computing algorithms may lead to speedup by avoiding synchronization.We solved above distributed QP problem with the implementation of the proposedstochastic asynchronous algorithm. In Fig. 3, total 100 times of state trajectoriesfor the dual variable y are given by (green) solid lines. Since y -update is stochas-tic process in the asynchronous algorithm, the trajectories are different from eachother, resulting in the spread of the trajectories in the transient time. The i.i.d.switching probability Π i that describes asynchronous computing for each distributed7 Fig. 4 . The rate of convergence results for distributed quadratic programming with three differ-ent schemes: synchronous (cross symbol); deterministic asynchronous (green dotted line); stochasticasynchronous (red solid line) algorithms node is given by ( π j ) i = e − qj P qj =1 e − qj , j = 1 , , . . . , q , ∀ i . Then, by Theorem 4.3 theswitching probability for the switched system in (3.2), denoted by Π, is computed asΠ = (cid:2) , , . , . , . , . , , (cid:3) . For this i.i.d. switching probability, we calcu-lated the spectral radius given in (4.3), which is ρ (cid:16)P mj =1 π j ( W j ⊗ W j ) (cid:17) = 0 . < | y k − y k − | ≤ − . As shown in Fig. 5, the proposed stochasticalgorithm achieves the fastest convergence to solve the distributed QP problem. Thisresult coincides with the result on the rate of convergence, which provides informa-tion regarding which schemes are the best to solve the given QP problem even beforesolving the optimization problem.For three different schemes, Table 1 and Fig. 6 present the computation time andspeedup, respectively as we increase the number of threads in the test bed. Also, weplotted speedup of three different schemes based on Table 1, by increasing the totalnumber of threads. As the number of threads increases, the performance degradationoccurred in the synchronous case, whereas the deterministic and stochastic asyn-chronous algorithms resulted in continuous speedup. When the number of threads is8 Fig. 5 . The convergence time comparison between sequential computing and three differentschemes when the number of threads is given by (maximum possible parallelization for the testbed. Table 1
Comparison of total computation time for the dual variable being convergent to the optimal value.
No. of Synchronous Det-Asynchronous Sto-AsynchronousThreads Time Speedup Time Speedup Time Speedup
8, the stochastic asynchronous algorithm led to 5.81 times speedup compared to thesequential computing, which is also 2 .
26 times faster than synchronous algorithms.As described in Remark 3.1, the computational complexity was the major concernwhen adopting the switched system framework for analysis of the stochastic asyn-chronous algorithm. To circumvent this complexity issue, we applied Algorithm 1.Thus, the number of switching modes has been drastically reduced from q N = 8 to q = 8, owing to Algorithm 1. Consequently, the analysis of stochastic asynchronouscomputing algorithm was carried out in a computationally efficient manner.
7. Conclusion.
In this paper, we studied the convergence of asynchronous dis-tributed QP problems via dual decomposition technique. To analyze the behaviorof asynchrony in distributed and parallel computing, the switched system frameworkwas introduced. Since the switching mode number becomes large for massively asyn-chronous computing algorithm, we developed a new algorithm, which drastically de-creases mode numbers. By implementing the proposed method, the convergence con-dition in the mean square sense can be checked without any computational complex-ity issues. Also, we derived the rate of convergence for three different schemes (e.g.,9
Fig. 6 . The speedup vs. numbers of threads synchronous, deterministic asynchronous, and stochastic asynchronous algorithms),which analytically shows how fast dual variable converges to the optimal solution.The numerical example with an implementation of asynchronous distributed QP us-ing
OpenMP supports the validity of the proposed methods.
Acknowledgement.
This research material is based upon work supported bythe National Science Foundation under grant number
REFERENCES[1] Pedro Aguiar, Eric P Xing, M´ario Figueiredo, Noah A Smith, and Andr´e Martins. An aug-mented lagrangian approach to constrained map inference. In
Proceedings of the 28thInternational Conference on Machine Learning (ICML-11) , pages 169–176, 2011.[2] Jacques F Benders. Partitioning procedures for solving mixed-variables programming problems.
Numerische mathematik , 4(1):238–252, 1962.[3] Dimitri P Bertsekas.
Nonlinear programming . Athena scientific Belmont, 1999.[4] Dimitri P Bertsekas and John N Tsitsiklis.
Parallel and distributed computation: numericalmethods . Prentice-Hall, Inc., 1989.[5] Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato, and Jonathan Eckstein. Distributedoptimization and statistical learning via the alternating direction method of multipliers.
Foundations and Trends R (cid:13) in Machine Learning , 3(1):1–122, 2011.[6] Lev M Bregman. The relaxation method of finding the common point of convex sets andits application to the solution of problems in convex programming. USSR computationalmathematics and mathematical physics , 7(3):200–217, 1967.[7] Peter Buchholz, Markus Fischer, and Peter Kemper. Distributed steady state analysis usingkronecker algebra.
Numerical Solutions of Markov Chains (NSMC99), Prensas Universi-tarias de Zaragoza, Zaragoza, Spain , pages 76–95, 1999.[8] Yair Censor and Stavros Andrea Zenios. Proximal minimization algorithm withd-functions.
Journal of Optimization Theory and Applications , 73(3):451–464, 1992.[9] George B Dantzig and Philip Wolfe. Decomposition principle for linear programs.
Operationsresearch , 8(1):101–111, 1960.[10] Jim Douglas and Henry H Rachford. On the numerical solution of heat conduction problems intwo and three space variables.
Transactions of the American mathematical Society , pages421–439, 1956.[11] Jonathan Eckstein. Nonlinear proximal point algorithms using bregman functions, with ap- plications to convex programming. Mathematics of Operations Research , 18(1):202–226,1993.[12] Hugh Everett III. Generalized lagrange multiplier method for solving problems of optimumallocation of resources.
Operations research , 11(3):399–417, 1963.[13] Yuguang Fang and Kenneth A Loparo. Stabilization of continuous-time jump linear systems.
Automatic Control, IEEE Transactions on , 47(10):1590–1603, 2002.[14] Yuguang Fang and Kenneth A Loparo. Stochastic stability of jump linear systems.
AutomaticControl, IEEE Transactions on , 47(7):1204–1208, 2002.[15] Xiangbo Feng, Kenneth A Loparo, Yuandong Ji, and Howard Jay Chizeck. Stochastic stabilityproperties of jump linear systems.
Automatic Control, IEEE Transactions on , 37(1):38–53,1992.[16] Michel Fortin and Roland Glowinski.
Augmented Lagrangian methods: applications to thenumerical solution of boundary-value problems . Elsevier, 2000.[17] Masao Fukushima. Application of the alternating direction method of multipliers to separableconvex programming problems.
Computational Optimization and Applications , 1(1):93–111, 1992.[18] Daniel Gabay and Bertrand Mercier. A dual algorithm for the solution of nonlinear variationalproblems via finite element approximation.
Computers & Mathematics with Applications ,2(1):17–40, 1976.[19] Arthur M Geoffrion. Generalized benders decomposition.
Journal of optimization theory andapplications , 10(4):237–260, 1972.[20] Roland Glowinski and A Marroco. Sur l’approximation, par elements finis d’ordre un, et laresolution, par penalisation-dualite d’une classe de problemes de dirichlet non lineaires.
ESAIM: Mathematical Modelling and Numerical Analysis-Mod´elisation Math´ematique etAnalyse Num´erique , 9(R2):41–76, 1975.[21] Magnus R Hestenes. Multiplier and gradient methods.
Journal of optimization theory andapplications , 4(5):303–320, 1969.[22] Kooktae Lee and Raktim Bhattacharya. Optimal switching synthesis for jump linear systemswith gaussian initial state uncertainty. In
ASME 2014 Dynamic Systems and Control Con-ference , pages V002T24A003–V002T24A003. American Society of Mechanical Engineers,2014.[23] Kooktae Lee and Raktim Bhattacharya. Stability analysis of large-scale distributed networkedcontrol systems with random communication delays: A switched system approach. arXivpreprint arXiv:1503.03047 , 2015.[24] Kooktae Lee, Raktim Bhattacharya, and Vijay Gupta. A switched dynamical system frameworkfor analysis of massively parallel asynchronous numerical algorithms. In
American ControlConference (ACC), 2015.
IEEE, 2015.[25] Kooktae Lee, Abhishek Halder, and Raktim Bhattacharya. Probabilistic robustness analysis forstochastic jump linear systems. In
American Control Conference (ACC), 2014. Proceedingsof the 2014 , pages 2638–2643. IEEE, 2014.[26] Kooktae Lee, Abhishek Halder, and Raktim Bhattacharya. Performance and robustness analysisof stochastic jump linear systems using wasserstein metric.
Automatica , 51:341–347, 2015.[27] Pierre-Louis Lions and Bertrand Mercier. Splitting algorithms for the sum of two nonlinearoperators.
SIAM Journal on Numerical Analysis , 16(6):964–979, 1979.[28] Ji Liu, Stephen J Wright, Christopher R´e, Victor Bittorf, and Srikrishna Sridhar. An asyn-chronous parallel stochastic coordinate descent algorithm. arXiv preprint arXiv:1311.1873 ,2013.[29] Michael JD Powell. ” A method for non-linear constraints in minimization problems”.
UKAEA,1967.[30] Jonathan E Spingarn. Applications of the method of partial inverses to convex programming:decomposition.
Mathematical Programming , 32(2):199–223, 1985.[31] Paul Tseng. Alternating projection-proximal methods for convex programming and variationalinequalities.