Distributed Majorization-Minimization for Laplacian Regularized Problems
DDistributed Ma jorization-Minimization for LaplacianRegularized Problems
Jonathan Tuck David Hallac Stephen BoydApril 2, 2018
Abstract
We consider the problem of minimizing a block separable convex function (possiblynondifferentiable, and including constraints) plus Laplacian regularization, a prob-lem that arises in applications including model fitting, regularizing stratified mod-els, and multi-period portfolio optimization. We develop a distributed majorization-minimization method for this general problem, and derive a complete, self-contained,general, and simple proof of convergence. Our method is able to scale to very largeproblems, and we illustrate our approach on two applications, demonstrating its scal-ability and accuracy.
Many applications, ranging from multi-period portfolio optimization [BBD +
17] to joint co-variance estimation [HPBL17], can be modeled as convex optimization problems with twoobjective terms, one that is block separable and the other a Laplacian regularization term[YGL16]. The block separable term can be nondifferentiable and may include constraints.The Laplacian regularization term is quadratic, and penalizes differences between individualvariable components. These types of problems arise in several domains, including signalprocessing [PC17], machine learning [ST17], and statistical estimation or data fitting prob-lems with an underlying graph prior [AZ06, MB11]. As such, there is a need for scalablealgorithms to efficiently solve these problems.In this paper we develop a distributed method for minimizing a block-separable convexobjective with Laplacian regularization. Our method is iterative; in each iteration a convexproblem is solved for each block, and the variables are then shared with each block’s neighborsin the graph associated with the Laplacian term. Our method is an instance of a standardand well known general method, majorization-minimization (MM) [Lan16], which recoversa wide variety of existing methods depending on the choice of majorization [SBP17]. In thispaper, we derive a diagonal quadratic majorizer of the given Laplacian objective term, whichhas the benefit of separability. This separability allows for the minimization step in our MMalgorithm to be carried out in parallel on a block-by-block basis. We develop a completely1 a r X i v : . [ m a t h . O C ] M a r elf-contained proof of convergence of our method, which relies on no further assumption thanthe existence of a solution. Finally, we apply our method to two separate applications, multi-period portfolio optimization and joint covariance estimation, demonstrating the scalableperformance of our algorithm. There has been extensive research on graph Laplacians and Laplacian regularization [GR01,WSZS07, RHL13], and on developing solvers specifically for use in optimization over graphs[HWD + + In § § § § We consider the problem of minimizing a convex function plus Laplacian regularization,minimize F ( x ) = f ( x ) + L ( x ) , (1)with variable x ∈ R n . Here f : R n → R ∪ {∞} is a proper closed convex function [Roc70,BL00], and L : R n → R is the Laplacian regularizer (or Dirichlet energy [Eva10]) L ( x ) =(1 / x T Lx , where L is a weighted Laplacian matrix, i.e. , L = L T , L ij ≤ i (cid:54) = j , and L = 0, where is the vector with all entries one [GR01]. Associating with L the graph withvertices 1 , . . . , n , edges indexed by pairs ( i, j ) with i < j and L ij (cid:54) = 0, and (nonnegative)2dge weights w ij = − L ij , the Laplacian regularizer can be expressed as L ( z ) = (cid:88) ( i,j ) ∈E w ij ( z i − z j ) . We refer to the problem (1) as the
Laplacian regularized minimization problem (LRMP).LRMPs are convex optimization problems, which can be solved by a variety of methods,depending on the specific form of f [BV04, NW06]. We will let F (cid:63) denote the optimal valueof the LRMP. Convex constraints can be incorporated into LRMP, by defining f to take value+ ∞ when the constraints are violated. Note in particular that we specifically do not assumethat the function f is finite, or differentiable (let alone with Lipschitz gradient), or even thatits domain has affine dimension n . In this paper we will make only one additional analyticalassumption about the LMRP (1): its sublevel sets are bounded. This assumption impliesthat the LRMP is solvable, i.e. , there exists at least one optimal point x (cid:63) , and therefore thatits optimal value F (cid:63) is finite.A point x is optimal for the LRMP (1) if and only if there exists g ∈ R n such that[Roc70, BL00] g ∈ ∂f ( x ) , g + ∇L ( x ) = g + Lx = 0 , (2)where ∂f ( x ) is the subdifferential of f at x [Roc70, Cla90]. For g ∈ ∂f ( x ), we refer to r = g + Lx as the optimality residual for the LRMP (1). Our goal is to compute an x (and g ∈ ∂f ( x ))for which the residual r is small.We are interested in the case where f is block separable. We partition the variable x as x = ( x , . . . , x p ), with x i ∈ R n i , n + · · · + n p = n , and assume f has the form f ( x ) = p (cid:88) i =1 f i ( x i ) , where f i : R n → R ∪ {∞} are closed convex proper functions.The main contribution of this paper is a scalable and distributed method for solvingLRMP in which each of the functions f i is handled separately. More specifically, we will seethat each iteration of our algorithm requires the evaluation of a diagonally scaled proximaloperator [PB14] associated with each block function f i , which can be done in parallel. Recall that a function ˆ L : R n × R n → R is a majorizer of L : R n → R if for all x and z , ˆ L ( z ; z ) = L ( z ), and ˆ L ( x ; z ) ≥ L ( x ) [Lan16, SBP17]. In other words, the differenceˆ L ( x, z ) − L ( z ) is nonnegative, and zero when x = z .3e now show how to construct a quadratic majorizer of the Laplacian regularizer L . Thisconstruction is known [SBP17], but we give the proof for completeness. Suppose ˆ L = ˆ L T satisfies ˆ L (cid:23) L , i.e. , ˆ L − L is positive semidefinite. The functionˆ L ( x ; z ) = (1 / z T Lz + z T L ( x − z ) + (1 / x − z ) T ˆ L ( x − z ) , (3)which is quadratic in x , is a majorizer of L . To see this, we note thatˆ L ( x ; z ) − L ( x ) = (1 / z T Lz + z T L ( x − z ) + (1 / x − z ) T ˆ L ( x − z ) − (1 / x T Lx = (1 / x − z ) T ( ˆ L − L )( x − z ) , which is always nonnegative, and zero when x = z .In fact, every quadratic majorizer of L arises from this construction, for some ˆ L (cid:23) L .To see this we note that the difference ˆ L ( x ; z ) − L ( x ) is a quadratic function of x that isnonnegative and zero when x = z , so it must have the form (1 / x − z ) T P ( x − z ) for some P = P T (cid:23)
0. It follows that ˆ L has the form (3), with ˆ L = P + L (cid:23) L .We now give a simple scheme for choosing ˆ L in the diagonal quadratic majorizer. Supposeˆ L is diagonal, ˆ L = diag ( α ) = diag ( α , . . . , α n ) , where α ∈ R n . A simple sufficient condition for ˆ L (cid:23) L is α i ≥ L ii , i = 1 , . . . , n . Thisfollows from standard results for Laplacians [Bol98], but it is simple to show directly. Wenote that for any z ∈ R n , we have z T ( ˆ L − L ) z = n (cid:88) i =1 ( α i − L ii ) z i + n (cid:88) i =1 (cid:88) j (cid:54) = i ( − L ij ) z i z j ≥ n (cid:88) i =1 L ii z i + n (cid:88) i =1 (cid:88) j (cid:54) = i L ij | z i || z j | = | z | T L | z | ≥ , where the absolute value is elementwise. On the second line we use the inequalities α i − L ii ≥ L ii and for j (cid:54) = i , − L ij z i z j ≥ L ij | z i || z j | , which follows since L ij ≤ L (cid:31) L , i.e. , ˆ L − L is positivedefinite. This can be accomplished by choosing α i > L ii , i = 1 , . . . , n. (4)There are many other methods for selecting α , some of which have additional properties.For example, we can choose α = 2 λ max ( L ) , where λ max ( L ) denotes the maximum eigenvalueof L . With this choice we have ˆ L = 2 λ max ( L ) I . This diagonal majorization has all diagonalentries equal, i.e. , it is a multiple of the identity.Another choice (that we will encounter later in § L to be a block diagonalmatrix, conformal with the partition of x , with each block component a (possibly different)multiple of the identity, ˆ L = diag ( α I n , . . . , α p I n p ) , (5)4here we can take α i > max j ∈ N i L jj , i = 1 , . . . , p, where N i is the index range for block i . The majorization-minimization (MM) algorithm is an iterative algorithm that at each stepminimizes a majorizer of the original function at the current iterate [SBP17]. Since ˆ L , asconstructed in §
3, using (4), majorizes L , it follows that ˆ F = f + ˆ L majorizes F = f + L .The MM algorithm for minimizing F is then x k +1 = argmin x (cid:16) f ( x ) + ˆ L ( x ; x k ) (cid:17) , (6)where the superscripts k and k + 1 denote the iteration counter. Note that since ˆ L is positivedefinite, ˆ L is strictly convex in x , so the argmin is unique. Stopping criterion.
The optimality condition for the update (6) is the existence of g k +1 ∈ R n with g k +1 ∈ ∂f ( x k +1 ) , g k +1 + ∇ ˆ L ( x k +1 ; x k ) = 0 . (7)From ˆ L ( x ; z ) − L ( x ) = (1 / x − z ) T ( ˆ L − L )( x − z ), we have ∇ ˆ L ( x k +1 ; x k ) − ∇L ( x k +1 ) = ( ˆ L − L )( x k +1 − x k ) . Substituting this into (7) we get g k +1 + ∇L ( x k +1 ) = ( ˆ L − L )( x k − x k +1 ) . (8)Thus we see that r k +1 = ( ˆ L − L )( x k − x k +1 )is the optimality residual for x k +1 , i.e. , the right-hand side of (2). We will use (cid:107) r k +1 (cid:107) ≤ (cid:15) ,where (cid:15) > x k +1 satisfies the optimality condition (2) within (cid:15) . Absolute and relative tolerance.
When the algorithm is used to solve problems inwhich x (cid:63) or L vary widely in size, the tolerance (cid:15) is typically chosen as a combination of anabsolute error (cid:15) abs and a relative error (cid:15) rel , for example, (cid:15) = (cid:15) abs + (cid:15) rel ( (cid:107) ˆ L − L (cid:107) F + (cid:107) x (cid:107) ) , where (cid:107) · (cid:107) F denotes the Frobenius norm. 5 istributed implementation. The update (6) can be broken down into two steps. Thefirst step requires multiplying by L , and in the other step, we carry out p independentminimizations in parallel. We partition the Laplacian matrix L into blocks L ij , i, j = 1 , . . . , p ,conformal with the partition x = ( x , . . . , x p ). (In a few places above, we used L ij to denotethe i, j entry of L , whereas here we use it to denote the i, j submatrix. This slight abuse ofnotation should not cause any confusion since the index ranges, and the dimensions, makeit clear whether the entry, or submatrix, is meant.) We then observe that our majorizer (3)has the form ˆ L ( x ; z ) = p (cid:88) i =1 ˆ L i ( x i ; z ) + c, where c does not depend on x , andˆ L i ( x i ; z ) = (1 / x i − z i ) T ˆ L ii ( x i − z i ) + h Ti x i , where z i refers to the i th subvector of z , and h i is the i th subvector Lz , h i = L ii z i + (cid:88) j (cid:54) = i L ij z j . It follows that ˆ F ( x ; x k ) = p (cid:88) i =1 ( f i ( x i ) + ˆ L ( x i ; x ki )) + c is block separable. Algorithm 4.1
Distributed majorization-minimization. given
Laplacian matrix L , and initial starting point x in the feasible set of the problem,with f ( x ) < ∞ . Form majorizer matrix.
Form diagonal ˆ L with ˆ L (cid:31) L (using (4)). for k = 1 , , . . . Compute linear term.
Compute h k = Lx k and residual r k = ( ˆ L − L )( x k − − x k ).2. Update in parallel.
For i = 1 , . . . , p , update each x i (in parallel) as x k +1 i = argmin x i (cid:16) f i ( x i ) + (1 / x i − x ki ) T ˆ L ii ( x i − x ki ) + ( h ki ) T x i (cid:17) .3. Test stopping criterion.
Quit if k ≥ (cid:107) r k (cid:107) ≤ (cid:15) . Step 1 couples the subvectors x ki ; step 2 (the subproblem updates) is carried out inparallel for each i . We observe that the updates in step 2 are (diagonally scaled) proximaloperator evaluations, i.e. , they involve minimizing f i plus a norm squared term, with diagonalquadratic norm; see, e.g. , [PB14]. Our algorithm can thus be considered as a distributedproximal-based method. We also mention that as the algorithm converges (discussed in detailbelow), x k +1 i − x ki →
0, which implies that the quadratic terms (1 / x i − x ki ) T ˆ L ii ( x i − x ki )and their gradients in the update asymptotically vanish; roughly speaking, they ‘go away’ asthe algorithm converges. We will see below, however, that these quadratic terms are criticalto convergence of the algorithm. 6 arm start. Our algorithm supports warm starting by choosing the initial point x asan estimate of the solution, for example, the solution of a closely related problem. Warmstarting can decrease the number of iterations required to converge [YW02, WB10]; we willsee an example in § There are many general convergence results for MM methods, but all of them require varyingadditional assumptions about the objective function [Lan16, SBP17]. In this section we givea complete self-contained proof of convergence for our algorithm, that requires no additionalassumptions. We will show that F ( x k ) − F (cid:63) →
0, as k → ∞ , and also that the stoppingcriterion eventually holds, i.e. , ( ˆ L − L )( x k − x k +1 ) → F ( x k +1 ) ≤ ˆ F ( x k +1 ; x k ) ≤ ˆ F ( x k ; x k ) = F ( x k ) , where the first inequality holds since ˆ F majorizes F , and the second since x k +1 minimizesˆ F ( x ; x k ) over x . It follows that F ( x k ) converges, and therefore F ( x k ) − F ( x k +1 ) →
0. It alsofollows that the iterates x k are bounded, since every iterate satisfies F ( x k ) ≤ F ( x ), and weassume that the sublevel sets of F are bounded.Since F is convex and g k +1 + ∇L ( x k +1 ) ∈ ∂F ( x k +1 ), we have (from the definition ofsubgradient) F ( x k ) ≥ F ( x k +1 ) + ( g k +1 + ∇L ( x k +1 )) T ( x k − x k +1 ) . Using this and (8), we have F ( x k ) − F ( x k +1 ) ≥ ( x k − x k +1 ) T ( ˆ L − L )( x k − x k +1 ) . Since F ( x k ) − F ( x k +1 ) → k → ∞ , and ˆ L − L (cid:31)
0, we conclude that x k +1 − x k → k → ∞ . This implies that our stopping criterion will eventually hold.Now we show that F ( x k ) → F (cid:63) . Let x (cid:63) be any optimal point. Then, F (cid:63) = F ( x (cid:63) ) ≥ F ( x k +1 ) + (( ˆ L − L ) x k − ˆ Lx k +1 + Lx k +1 ) T ( x (cid:63) − x k +1 )= F ( x k +1 ) + ( x k − x k +1 ) T ( ˆ L − L )( x (cid:63) − x k +1 ) . So we have F ( x k +1 ) − F (cid:63) ≤ − ( x k − x k +1 ) T ( ˆ L − L )( x (cid:63) − x k +1 ) . Since x k − x k +1 → k → ∞ , and x k +1 is bounded, the right-hand side converges tozero as k → ∞ , and so we conclude F ( x k +1 ) − F (cid:63) → k → ∞ .7 .2 Variations Arbitrary convex quadratic regularization.
While our interest is in the case when L is Laplacian regularization, the algorithm (and convergence proof) work when L is anyconvex quadratic, i.e. , L (cid:23)
0, with the choice α i > n (cid:88) j =1 | L ij | , i = 1 , . . . , n, replacing the condition (4). In fact, the condition (4) is a special case of this condition, fora Laplacian matrix. Nonconvex f . If the objective function in the LRMP is nonconvex, i.e. , f is nonconvex,then the method proposed in this paper can be extended as a heuristic for solving (1) fornonconvex f . It is emphasized that the most the algorithm can guarantee is a local optimum,rather than a global optimum [BV04]. In this section we describe two applications of our distributed method for solving LRMP,and report numerical results demonstrating its convergence and performance. We run allnumerical examples on a 32-core AMD machine with 64 hyperthreads, using the Pathosmultiprocessing package to carry out computations in parallel [McK17]. Our code is availableonline at https://github.com/cvxgrp/mm_dist_lapl . We consider the problem of multi-period trading with quadratic transaction costs; see[BMOW14, BBD +
17] for more detail. We are to choose a portfolio of n holdings x t ∈ R n ,for periods t = 1 , . . . , T . We assume the n th holding is a riskless holding ( i.e. , cash). Wechoose the portfolios by solving the problemminimize (cid:80) Tt =1 (cid:0) f t ( x t ) + (1 / x t − x t − ) T D t ( x t − x t − ) (cid:1) , (9)where f t is the convex objective function (and constraints) for the portfolio in period t ,and the D t ’s are diagonal positive definite matrices. The initial portfolio x is given andconstant; x , . . . , x T are the variables. The quadratic term (1 / x t − x t − ) T D t ( x t − x t − ) isthe transaction cost, i.e. , the additional cost of trading to move from the previous portfolio x t − to the current one x t . We will assume that there is no transaction cost associated withcash, i.e. , ( D t ) nn = 0.The objective function f t typically includes negative expected return, one or more riskconstraints or risk avoidance terms, shorting or borrow costs, and possibly other terms.It also can include constraints, such as the normalization T x t = 1 (in which case x t are8eferred to as the portfolio weights), limits on the holdings or the leverage of the portfolio,or a specified final portfolio; see [BMOW14, BBD +
17] for more detail.We can express the transaction cost as Laplacian regularization on x = ( x , . . . , x T ) ∈ R T n , plus a quadratic term involving x , T (cid:88) t =1 (1 / x t − x t − ) T D t ( x t − x t − )= (1 / x T Lx + (1 / x T D x − ( D x ) T x + (1 / x T D x . (Recall that the initial portfolio x is given.) The Laplacian matrix L has block-tridiagonalform given by L = D − D . . . − D D + D − D . . . − D D + D . . . . . . D T − + D T − − D T −
00 0 0 . . . − D T − D T − + D T − D T . . . − D T D T . If we assume that the initial portfolio is cash, i.e. , x is zero except in the last component,then two of the three extra terms, ( D x ) T x and (1 / x T D x , both vanish. If we lump theextra terms that depend on x into f , the multi-period portfolio optimization problem (9)has the LRMP form, with p = T and n i = n . The total number of (scalar) variables is T n .The graph associated with the Laplacian is the simple chain graph; roughly speaking, eachportfolio x t is linked to its predecessor x t − and its successor x t +1 by the transaction cost.We can give a simple interpretation for the subproblem update in our method. Thequadratic term of the subproblem update (which asymptotically goes away as we approachconvergence) adds diagonal risk; the linear term h t contributes an expected return to eachasset. These additional risk and return terms come from both the preceding and the successorportfolios; they ‘encourage’ the portfolios to move towards each other from one time periodto the next, so as to reduce transaction cost. Each subproblem update minimizes negativerisk-adjusted return, with the given return vector modified to encourage less trading. We consider a problem with n = 1000 assets and T = 30 periods, so the total number of(scalar) variables is 30000. The objective functions f t include a negative expected return,a quadratic risk given by a factor (diagonal plus low rank) model with 50 factors [CR83,BBD + T x t = 1, so the portfolios x t represent weights. The objective functions f t have the form f t ( x ) = − µ Tt x + γx T Σ t x + s Tt ( x ) − , t = 1 , . . . , T − . (10)9ere, γ > risk aversion parameter , µ t is the expected return, Σ t is the returncovariance, and s t is the (positive) shorting cost coefficient vector. The covariance matricesΣ t are diagonal plus a rank 50 (factor) term, with zero entries in the last row and column(which correspond to the cash asset). We choose all these coefficients and the diagonaltransaction cost matrices D t randomly, but with realistic values. In our problem instance,we choose all of these parameters independent of t , i.e. , constant.We take f T to be the indicator function for the constraint x = e n ( i.e. , f T ( x ) = 0 if x = e n , and ∞ otherwise), and the initial portfolio is all cash, x = e n . So in our multi-period portfolio optimization problem we are planning a sequence of portfolios that beginand end in cash.We can see the interpretation of the subproblem updates in § L to be 3 L ii , wecan rewrite the subproblem objective function (at time periods t = 2 , . . . , T − k ) as x Tt ( γ Σ t + (3 / D t + D t +1 )) x t − ( µ t + D t (2 x kt − x kt − ) + D t +1 (2 x kt − x kt +1 )) T x t + c, where c is some constant that does not depend on x t . We see that a diagonal risk termis added, and the mean return µ t is offset by terms that depend on the past, current, andfuture portfolios x kt − , x kt , and x kt +1 . We first solve the problem instance using CVXPY [DB16] and solver OSQP [SBG + e n , i.e. , all cash, and use stopping criterion tolerance (cid:15) = 10 − . Our algorithm took8 iterations and 19 seconds to converge, and produced a solution that agreed very closelywith the CVXPY/OSQP solution. Figure 1 shows a plot of the residual norm (cid:107) r k (cid:107) versusiteration k . Ths plot shows nearly linear convergence, with a reduction in residual norm byaround a factor of 5 each iteration. We consider estimation of parameters in a statistical model. We have a graph, with somedata associated with each node; the goal is to fit a model to the data at each node, withLaplacian regularization used to make neighboring models similar.The model parameter at node i is θ i ∈ R n i . The vector of all node parameters is θ = ( θ , . . . , θ p ) ∈ R n , with n = n + · · · + n p . We choose θ by minimizing a local lossfunction and regularizer at each node, plus Laplacian regularization:minimize (cid:80) pi =1 f i ( θ i ) + L ( θ ) , where f i ( θ i ) = (cid:96) i ( θ i ) + r i ( θ i ), where (cid:96) i : R n i → R is the loss function (for example, thenegative log-likelihood of θ i ) for the data at node i , and r i : R n i → R is a regularizer on the10 igure 1: Residual norm versus iteration for multi-period portfolio optimization problem. θ i . Without the Laplacian term, the problem is separable, and corresponds tofitting each parameter separately by minimizing the local loss plus regularizer. The Laplacianterm is an additional regularizer that encourages various entries in the parameter vectors tobe close to each other. Laplacian regularized covariance estimation.
We now focus on a more specific caseof this general problem, Laplacian regularized covariance estimation. At each node, we havesome number of samples from a zero-mean Gaussian distribution on R d , with covariancematrix Σ i , assumed positive definite. We will estimate the natural parameters (as an ex-ponential family), the inverse covariance matrices θ i = Σ − i . So here we take the nodeparameters θ i to be symmetric positive definite d × d matrices, with n i = d ( d + 1) /
2. (In thediscussion of the general case above, θ i is a vector in R n i ; in the rest of this section, θ i willdenote a symmetric d × d martix.)The data samples at node i have empirical covariance S i (which is not positive definite ifthere are fewer than d samples). The negative log-likelihood for node i is (up to a constantand a positive scale factor) (cid:96) i ( θ i ) = Tr ( S i θ i ) − log det θ i . We use trace regularization on the parameter, r i ( θ i ) = κ Tr ( θ i ) , where κ > f i ( θ i ) = (cid:96) i ( θ i ) + r i ( θ i ) analytically; the minimizer is θ i = ( S i + κI ) − . (See, e.g. , [BEGd08].)The Laplacian regularization is used to encourage neighboring inverse covariance matricesin the given graph to be near each other. It has the specific form L ( θ , . . . , θ p ) = λ (cid:88) ( i,j ) ∈E (cid:107) θ i − θ j (cid:107) F = Tr ( θ T Lθ ) , where the norm is the Frobenius norm, L is the associated weighted Laplacian matrix forthe graph with vertices 1 , . . . , p and edges E , and λ ≥ λ = 0, the estimation problem is separable,with analytical solution θ i = ( S i + κI ) − , i = 1 , . . . , p. For λ → ∞ , assuming the graph is connected, the estimation problem reduces to finding asingle covariance matrix for all the data, with analytical solution θ i = ( S + pκI ) − , i = 1 , . . . , p, S = (cid:80) pj =1 S j is the empirical covariance of all the data together.We choose the majorizer to be block diagonal with each block a multiple of the identity,as in (5). The update at each node in our algorithm can be expressed as minimizing over θ i the function Tr (( S i + H ki ) θ i ) − log det θ i + κ Tr ( θ i ) + ( α i / (cid:107) θ i − θ ki (cid:107) F , where H k = Lθ k . This minimization can be carried out analytically. By taking the gradient of the subproblemobjective function with respect to θ i and equating to zero, we see that S i + H ki − θ − i + κI + α i ( θ i − θ ki ) = 0 , or θ − i − α i θ i = S i + H ki + κI − α i θ ki . This implies that θ i and S i + H ki + κI − α i θ ki share the same eigenvectors [WT09, DWW14,HPBL17]. Let Q i Λ i Q Ti be the eigenvector decomposion of S i + H ki + κI − α i θ ki . We find thatthe eigenvalues of θ i , v ij , j = 1 , . . . , n , are v ij = (1 / α i ) (cid:16) − (Λ i ) jj + (cid:113) (Λ i ) jj + 4 α i (cid:17) . We have θ k +1 i = Q i V i Q Ti , where V i = diag ( v i , . . . , v in ). The computational cost per iterationis primarily in computing the eigenvector decomposition of S i + H ki + κI − α i θ ki , which hasorder d . The graph is a 15 ×
15 grid, with 420 edges, so p = 225. The dimension of the data is d = 30, so each θ i is a symmetric 30 ×
30 matrix. The total number of (scalar) variables inour problem instance is 225 × / R , so the empiricalcovariance matrices are singular.) In our problem instance we used hyperparameter values λ = .
053 and κ = 0 .
08, which were chosen to give good estimation performance.
The problem instance is too large to reliably solve using CVXPY and the solver SCS [OCPB16],which stops after two hours with the status message that the computed solution may be in-accurate.We solved the problem using our distributed method, with absolute tolerance (cid:15) abs = 10 − and relative tolerance (cid:15) rel = 10 − . The method took 54 iterations and 13 seconds to converge.Figure 2 is a plot of the residual norm (cid:107) r k (cid:107) F versus iteration k .13 igure 2: Residual norm vs. iteration for Laplacian regularized covariance estimation problem. igure 3: Root-mean-square error of the optimal estimates vs. λ . Regularization path via warm-start.
To illustrate the advantage of warm-starting ouralgorithm, we compute the entire regularization path, i.e. , the solutions of the problem for100 values of λ , spaced logarithmically between 10 − and 10 .Computing these 100 estimates by running the algorithm for each value of λ sequentially,without warm-start, requires 26000 total iterations (an average of 260 iterations per choiceof λ ) and 81 minutes. Computing these 100 estimates by running the algorithm using warm-start, starting from λ = 10 − , requires only 2000 total iterations (an average of 20 iterationsper choice of λ ) and 7.1 minutes. For the specific instance solved above, the algorithmconverges in only 2.5 seconds and 10 iterations using warm-start, compared to 13 secondsand 54 iterations using cold-start.While the point of this example is the algorithm that computes the estimates, we alsoexplore the performance of the method. For each of the 100 values of λ we compute theroot-mean-square error between our estimate of the inverse covariance and the true inversecovariance, which we know, since we generated them. Figure 3 shows a plot of the root-mean-square error of our estimate versus the value of λ . This plot shows that the methodworks, i.e. , produces better estimates of the inverse covariance matrices than handling themseparately (small λ ) or fitting one inverse covariance matrix for all nodes (large λ ).15 cknowledgements The authors would like to thank Peter Stoica for his insights and comments on early draftsof this paper. We would also like to thank the Air Force Research Laboratory, and inparticular Muralidhar Rangaswamy, for discussions of covariance estimation arising in radarsignal processing.
References [AC00] R. Almgren and N. Chriss. Optimal execution of portfolio transactions.
Journalof Risk , pages 5–39, 2000.[AZ06] R. K. Ando and T. Zhang. Learning on graph with Laplacian regularization.
Conference on Neural Information Processing Systems , 2006.[BBD +
17] S. Boyd, E. Busseti, S. Diamond, R. N. Kahn, K. Koh, P. Nystrup, and J. Speth.Multi-period trading via convex optimization.
Foundations and Trends in Op-timization , 3(1):1–76, April 2017.[BEGd08] O. Banerjee, L. El Ghaoui, and A. d’Aspremont. Model selection through sparsemaximum likelihood estimation for multivariate Gaussian or binary data.
Jour-nal of Machine Learning Research , 9:485–516, 2008.[BL00] J. M. Borwein and A. S. Lewis.
Convex Analysis and Nonlinear Optimization,Theory and Examples . Springer, 2000.[BMOW14] S. Boyd, M. T. Mueller, B. O’Donoghue, and Y. Wang. Performance boundsand suboptimal policies for multi-period investment.
Foundations and Trendsin Optimization , 1(1):1–72, January 2014.[Bol98] B. Bollobs.
Modern Graph Theory . Graduate Texts in Mathematics. Springer,Heidelberg, 1998.[BV04] S. Boyd and L. Vandenberghe.
Convex Optimization . Cambridge UniversityPress, 2004.[Cla90] F. H. Clarke.
Optimization and Nonsmooth Analysis . Society for Industrial andApplied Mathematics, 1990.[CR83] G. Chamberlain and M. Rothschild. Arbitrage, factor structure, and mean-variance analysis on large asset markets.
Econometrica , 51(5):1281–1304, 1983.[DB16] S. Diamond and S. Boyd. CVXPY: A Python-embedded modeling language forconvex optimization.
Journal of Machine Learning Research , 17(83):1–5, 2016.16DWW14] P. Danaher, P. Wang, and D. M. Witten. The joint graphical lasso for inversecovariance estimation across multiple classes.
Journal of the Royal StatisticalSociety , 76(2):373–397, 2014.[Eva10] L. C. Evans.
Partial differential equations . American Mathematical Society,Providence, R.I., 2010.[FHT08] J. Friedman, T. Hastie, and R. Tibshirani. Sparse inverse covariance estimationwith the graphical lasso.
Biostatistics , 9(3):432–441, 08 2008.[GR01] C. Godsil and G. Royle.
The Laplacian of a Graph . Springer, 2001.[HPBL17] D. Hallac, Y. Park, S. Boyd, and J. Leskovec. Network inference via the time-varying graphical lasso.
Proceedings of the 23rd ACM SIGKDD InternationalConference on Knowledge Discovery and Data Mining , pages 205–213, 2017.[HWD +
17] D. Hallac, C. Wong, S. Diamond, R. Sosic, S. Boyd, and J. Leskovec. SnapVX:A network-based convex optimization solver.
Journal of Machine Learning Re-search , 18, 2017.[Lan16] K. Lange.
MM Optimization Algorithms . Society for Industrial and AppliedMathematics, Philadelphia, PA, 2016.[MB11] S. Melacci and M. Belkin. Laplacian support vector machines trained in theprimal.
Journal of Machine Learning Research , 12:1149–1184, July 2011.[McK17] M. McKerns. Pathos multiprocessing, July 2017. Available at https://pypi.python.org/pypi/pathos .[NW06] J. Nocedal and S. J. Wright.
Numerical Optimization . Springer, 2006.[OCPB16] B. O’Donoghue, E. Chu, N. Parikh, and S. Boyd. Conic optimization via oper-ator splitting and homogeneous self-dual embedding.
Journal of OptimizationTheory and Applications , 169(3):1042–1068, June 2016.[PB14] N. Parikh and S. Boyd. Proximal algorithms.
Foundations and Trends in Opti-mization , 1(3):127–239, January 2014.[PC17] J. Pang and G. Cheung. Graph laplacian regularization for image denoising:Analysis in the continuous domain.
IEEE Transactions on Image Processing ,26(4):1770–1785, April 2017.[RHL13] M. Razaviyayn, M. Hong, and Z.Q. Luo. A unified convergence analysis of blocksuccessive minimization methods for nonsmooth optimization.
SIAM Journalon Optimization , 23(2):1126–1153, 2013.[Roc70] R. T. Rockafellar.
Convex Analysis . Princeton Mathematical Series. PrincetonUniversity Press, 1970. 17SB09] Jo¨elle Skaf and Stephen Boyd. Multi-period portfolio optimization with con-straints and transaction costs, 2009.[SBG +
17] B. Stellato, G. Banjac, P. Goulart, A. Bemporad, and S. Boyd. OSQP: Anoperator splitting solver for quadratic programs.
ArXiv e-prints , November2017.[SBP17] Y. Sun, P. Babu, and D. Palomar. Majorization-minimization algorithms insignal processing, communications, and machine learning.
IEEE Transactionsin Signal Processing , 65(3):794–816, 2017.[ST17] D. Slepcev and M. Thorpe. Analysis of p-laplacian regularization in semi-supervised learning.
ArXiV preprint , October 2017.[SW17] I. Soloveychik and A. Wiesel. Joint estimation of inverse covariance matri-ces lying in an unknown subspace.
IEEE Transactions on Signal Processing ,65(9):2379–2388, 2017.[WB10] Y. Wang and S. Boyd. Fast model predictive control using online optimization.
IEEE Transactions on Control Systems Technology , 18(3):267–278, March 2010.[WL10] T. T. Wu and K. Lange. The MM alternative to EM.
Statistical Science ,25(4):492–505, 11 2010.[WSZS07] K. Q. Weinberger, F. Sha, Q. Zhu, and L. K. Saul. Graph Laplacian reg-ularization for large-scale semidefinite programming. In
Advances in NeuralInformation Processing Systems 19 , pages 1489–1496. MIT Press, 2007.[WT09] D. M. Witten and R. Tibshirani. Covariance-regularized regression and classi-fication for high dimensional problems.
Journal of the Royal Statistical Society ,71(3):615–636, 2009.[YGL16] M. Yin, J. Gao, and Z. Lin. Laplacian regularized low-rank representation andits applications.
IEEE Transactions on Pattern Analysis and Machine Intelli-gence , 38(3):504–517, March 2016.[YR03] A. L. Yuille and A. Rangarajan. The concave-convex procedure.
Neural Com-puting , 15(4):915–936, April 2003.[YW02] E. Yildirim and S. Wright. Warm-start strategies in interior-point methods forlinear programming.