Generalized Geometric Programming for Rate Allocation in Consensus
aa r X i v : . [ c s . I T ] O c t Generalized Geometric Programmingfor Rate Allocation in Consensus
Ryan Pilgrim, : Junan Zhu, : Dror Baron , : and Waheed U. Bajwa ;: Department of Electrical and Computer Engineering, NC State University, Raleigh, NC ; Department of Electrical and Computer Engineering, Rutgers University, Piscataway, NJEmail: { rzpilgri, jzhu9, barondror } @ncsu.edu, [email protected] Abstract —Distributed averaging, or distributed average con-sensus, is a common method for computing the sample mean ofthe data dispersed among the nodes of a network in a decentral-ized manner. By iteratively exchanging messages with neighbors,the nodes of the network can converge to an agreement onthe sample mean of their initial states. In real-world scenarios,these messages are subject to bandwidth and power constraints,which motivates the design of a lossy compression strategy.Few prior works consider the rate allocation problem fromthe perspective of constrained optimization, which provides aprincipled method for the design of lossy compression schemes,allows for the relaxation of certain assumptions, and offersperformance guarantees. We show for Gaussian-distributedinitial states with entropy-coded scalar quantization and vectorquantization that the coding rates for distributed averaging canbe optimized through generalized geometric programming. Inthe absence of side information from past states, this approachfinds a rate allocation over nodes and iterations that minimizesthe aggregate coding rate required to achieve a target meansquare error within a finite run time. Our rate allocationis compared to some of the prior art through numericalsimulations. The results motivate the incorporation of side-information through differential or predictive coding to improverate-distortion performance.
Index Terms —Compression, consensus, geometric program-ming, optimization, source coding.
I. I
NTRODUCTION
The proliferation of wireless sensors and large distributeddata sets in recent years has provided significant motiva-tion for the development of distributed computing methods.In many distributed computing settings, it is necessary tocompute a function of data that may be dispersed amonga number of computing nodes. Examples include wirelesssensor networks (WSNs), where each agent observes adifferent measurement of a physical process, and large-scale server farms, where the size of the data set requiresdistributed storage [1]. The class of distributed algorithmsconsidered in this paper computes these functions using onlyinteractions among local subsets of the network nodes. Onepopular approach to distributed function computation, whichhas many variants, is consensus [2, 3]. Consensus has foundapplications in a wide variety of settings, including dis-
RP, JZ, and DB were supported by the National Science Foundation (NSF)under grants CCF-1217749 and ECCS-1611112. WB was supported by theNSF under grant CCF-1453073. RP is currently with Vadum, Inc., Raleigh,NC, and JZ is currently with J.P.Morgan Chase, New York, NY. tributed swarm control, sensor fusion, optimization, filtering,environmental monitoring, and distributed learning [2, 4, 5].As a motivating example for our lossy compression frame-work, we consider the application of distributed averagingin representation learning for the internet of things (IoT).Distributed average consensus helps different nodes shareintermediate results within a distributed version of the powermethod [6], which can be used in distributed dictionarylearning schemes such as cloud K-SVD [5]. Take facerecognition as an example. Powered by IoT, a camera net-work can gather a large number of face images and traina face recognition model using dictionary learning [7]. Ifthese cameras can share the data they gathered and re-train their face recognition model, then they can improvetheir performance over time. In fully connected networks,the cameras would easily share newly gathered images andre-train their face recognition model efficiently. However,in IoT, the network is often ad-hoc and may have sparseconnectivity. Furthermore, in dictionary learning for images,large quantities of data must be communicated among nodes,which strains the energy constraints of IoT devices [8].In this work, we propose a scheme that offers the potentialto balance the competing metrics of communication load,estimation error, and execution time for such IoT applica-tions. After running for a finite number of iterations T ă 8 ,average consensus will produce an estimate of the mean thathas some associated mean square error (MSE), MSE p T q ą .If the internode messages are quantized, this process willrequire the transmission of a total or aggregate coding rate, R agg ą . Our goal is to minimize the communication load,measured by R agg , subject to a given number of iterations tobe carried out ( T ) and a desired final accuracy, MSE p T q . Toperform this minimization, we use the theory of generalizedgeometric programming (GGP) [9, 10] to formulate a convexprogram that can find the global optimum solution for certainlossy compression schemes. A. Prior art
Many early works on consensus assumed that the nodescould communicate real-valued data to one another [11]. This distributed power method helps find the dominant singular vectorof a matrix when batches of its columns are stored among different nodesof a network. n realistic scenarios, the nodes must communicate withinbandwidth and energy constraints, which can have a sig-nificant impact on the convergence of distributed averagingalgorithms. Although many papers have been published onquantized consensus in recent years (e.g., [1, 4, 11–17]), alarge portion considers trade-offs among run time, commu-nication load, and final accuracy without formulating theproblem as constrained optimization. Early publications onquantized consensus (e.g., [11, 19]) show that introducingperturbations of constant variance (such as quantization error)into the traditional consensus state update prevents conver-gence due to the limited precision of the quantizer. Dueto the difficulties associated with quantization error, manyworks (e.g., [12–15, 20, 21]) address the incorporation of dy-namic coding strategies into consensus protocols. However,few prior schemes explicitly consider the rate-distortion (RD)trade-off, and they instead offer heuristics to optimize theirrespective performance metrics.Several papers (e.g., [12, 20, 21]) consider the possibilityof using differential coding (also called difference quanti-zation) [22] with a shrinking quantization range during thetransmission of messages between nodes. Of these, Thanou et al. [13] demonstrated lower MSE than previous worksin this area (e.g., [12, 20]) with equal communication load.Although these papers [12, 13, 20] exploit side-informationin their coding strategies, they study the simple fixed-rateuniform quantizer and do not make effective use of lossycompression to balance the trade-offs among T , MSE p T q ,and R agg .Yildiz and Scaglione [14, 15], unlike other authors, explic-itly considered the RD trade-off to achieve an asymptoticMSE value in consensus for the case of Gaussian initialstates. They proposed schemes based on differential [15],predictive, and Wyner-Ziv coding [14]. Despite their sophis-ticated coding approaches, the state update step they used canonly provide bounded steady-state error in the limit of manyconsensus iterations, so that lim T Ñ8 MSE p T q ą [14].In addition to these dynamic coding strategies [12–14,20], a few works [4, 17] consider optimization strategies forenergy consumption in wireless sensor networks. However,Nokleby et al. [4] require a specific topological evolutionof the network. Huang and Hua [17] keep the coding ratesfor their fixed-rate uniform quantizers constant across bothnodes and iterations, which does not fully explore the moregeneral space of node- and time-varying rates.A handful of works (e.g., [23–26]) analyze consensusfrom the viewpoint of information theory. Yang et al. [26]considered RD bounds for data aggregation, in which datais routed through a tree network to a fusion center, andconsensus, in which each node forms an estimate of thedesired quantity. Although Yang et al. [26] provided boundson the RD relationship for consensus in trees and proved theachievability of the derived bounds, their analysis is limitedto the setting where the network is tree structured. Often itis beneficial to consider more flexible topologies, such asrandom geometric graphs, which have been used to model For a more thorough literature review, see Pilgrim [18].
WSNs [27]. In general, random geometric graphs and theirreal-world WSN counterparts have loops.
B. Our contributions
This paper presents a framework for attaining an esti-mate of the network sample mean at each node, within adesired average level of accuracy, with finite run time andminimal total communication cost (measured by R agg ) usingeither deterministic or dithered quantization. Our frame-work is informed by the results of RD theory [28] andconvex optimization [9]. In the plethora of literature wesurveyed, a crucial problem that has not been addressed isthe optimization of quantization schemes for finite run-timewithout a certain set of limiting assumptions . Prior worksrestrict the topology to trees [26], assume a certain formfor the rate/distortion sequences [14, 15], restrict the ratesto be the same at each node and/or iteration [13, 15, 17], oruse bounds on the MSE or asymptotic MSE values, ratherthan exact MSE quantities [15, 17], in their analyses. Thekey contribution of this work is the use of GGP [10] tominimize the communication load subject to an accuracyconstraint, which avoids the previously discussed restrictions.The advantages of our approach are ( i ) ignorance aboutthe parametric form of the allocated rates, which avoidsthe introduction of unjustified or unnecessary assumptions;( ii ) support for different rates at each node and iterationof distributed average consensus; and ( iii ) optimization withrespect to exact MSE constraints (rather than bounds) forfinite iteration count. C. Notation
We denote the positive-valued subset of a set S by S ą and the nonnegative-valued subset by S ě . The integers aredenoted by Z , and the real numbers by R . Vectors are writtenin boldface lowercase letters (e.g., x ). Matrices are writtenin boldface capital letters (e.g., A ). The p i, j q th element ofmatrix A is written r A s ij . Quantities that vary with timeare written as functions of time (e.g., x p t q ). The ℓ p -normis written k ¨ k p , matrix transpose is written t¨u J , and matrixinverse is written t¨u ´ . The expectation operator is writtenas E r¨s , the mean of a time-dependent random variable x p t q is written µ x p t q , and its covariance is denoted by Σ x p t q . TheKronecker product of two matrices A , B is written A b B .A diagonal matrix is compactly specified as diag »—– x ... x n fiffifl “ »—– x ¨ ¨ ¨ ... . . . ... ¨ ¨ ¨ x n fiffifl . II. P
ROBLEM FORMULATION
A. System model
In this paper, communication links are bidirectional, andwe model the network as an undirected graph, G “ t V , E u ,comprised of a set of m vertices (nodes) V and a set of edges E between pairs of vertices [29]. Because the communicationlinks are bidirectional, each edge p i, j q P E is represented asan unordered pair of vertices, i, j P V [30].n the simplest case of the consensus problem, each node i P t , . . . , m u has an initial scalar quantity z i p q P R , andthe goal is to have all nodes of the network agree upon thesample mean of these quantities by iteratively exchangingmessages with their neighbors [30]. The quantities z i p t q , t ě , will be referred to as “states,” which in this paperare assumed to be real-valued scalar random variables (RVs)with joint Gaussian distribution for t “ . More formally,let the (discrete) iteration index be a nonnegative integer, t P Z ě . At t “ , the states t z i p t qu mi “ are the initialvalues to be averaged by the consensus algorithm. For t ě ,the state z i p t q represents the estimate of the sample average s z : “ m ř mi “ z i p q at node i . The objective of consensus isfor the state z i p t q to eventually equal the sample mean of theinitial states: lim t Ñ8 z i p t q “ s z , @ i P t , . . . , m u [30]. In thispaper, we restrict our attention to deterministic, synchronous-update consensus algorithms. We assume the following: ( i )the communication link topology of the network is fixedand does not change with time; ( ii ) at each iteration, ev-ery node exchanges messages with only its neighbors; ( iii )communication channels are noiseless; ( iv ) initial states of allnodes are Gaussian with a known joint distribution; ( v ) eachnode can use a different rate at each consensus iteration;( vi ) internode messages are broadcast to all neighbors atonce; and ( vii ) states are stored with infinite precision, butcommunicated with limited precision. The last point is well-motivated for nodes with 32- or 64-bit floating-point support.Given the above assumptions on communication, onepopular algorithm for consensus relies on linear updates [2,30]. Each node updates its state by forming a weighted sumof its own state with those of its neighbors [30], z i p t ` q “ w ii z i p t q ` ÿ j P N i w ij z j p t q , (1)where w ij ą , @ i, j , ř mk “ w ik “ ř mk “ w kj “ , and N i “ t j |p i, j q P E u denotes the neighborhood of node i . Theweights w ij are designed such that lim t Ñ8 z i p t q “ s z [30].The above update equation (1) can be written as [30] z p t ` q “ Wz p t q . (2)The asymptotic convergence condition is then lim t Ñ8 z p t q “ m J z p q “ s z . The interested reader is referred to Xiaoand Boyd [30] for a study of weight matrix design.When quantization error is present within the internodemessages, the simple linear iteration (2) is not guaranteed toconverge [19]. Instead, we use the modified iteration usedby Frasca et al. [11], which allows the sample average to bepreserved in the presence of quantization errors.Let Q : R m Ñ X m represent quantization to a finiteset of representation levels X m Ă R m (i.e., Q p z p t qq “r Q p z p t qq , . . . , Q m p z m p t qqs J ). The associated quantizationerror is given by ǫ p t q : “ Q p z p t qq ´ z p t q . We define thedistortion at node i and iteration t as D i p t q : “ E “ ǫ i p t q ‰ . Thesubscripts on Q indicate that each node can use a differentquantizer in general. To allow the algorithm to converge to zero steady-state estimation error, we use the update proposedby Frasca et al. [11], which is z p t ` q “ z p t q ` p W ´ I q Q p z p t qq , (3)where I is the identity matrix. The key advantage of thisupdate is that the average m ř mi “ z i p t q of the states z i p t q is preserved at each step t , despite quantization error [11].Because the average is preserved at each iteration, the esti-mation error from the average consensus state s z is [11] e p t q “ ` I ´ m ´ J ˘ z p t q . (4)The MSE at node i corresponding to an estimation error e p t q at iteration t is given byMSE i p d , t q : “ E “ e i p t q ‰ , where d is a vector of all distortions introduced by all nodesthroughout the consensus process. The average MSE acrossthe network at the end of iteration t is given byMSE p t q : “ m m ÿ i “ MSE i p t q . The communication cost of consensus becomes substantialwhen nodes exchange vector states z i p t q P R n , rather thanscalars. In this case, all entries can be collected into a vector ζ p t q : “ “ z J p t q , ¨ ¨ ¨ , z J m p t q ‰ J P R mn , and a matrix is defined Ω : “ W b I n P R mn ˆ mn [17], so that (3) becomes ζ p t ` q “ ζ p t q ` p Ω ´ I mn q Q p ζ p t qq . (5)This work assumes that ζ p t q has a known joint Gaussiandistribution, and that the z i p t q are distributed such that µ z i p t q “ µ z i p t q , and the diagonal of Σ z i p t q is σ i p t q .This means that the marginal mean and variance can beextracted from any of the elements corresponding to z i p t q from the joint mean µ ζ p t q and joint covariance Σ ζ p t q ,respectively. The marginal means and variances can thus bederived in our case from (3), without considering the higher-dimensional (5). The following statistical analysis will focuson this scalar case. B. Main objective
At every iteration t P t , . . . , T ´ u (where T is thetotal number of iterations) of the consensus process, eachnode i P t , . . . , m u uses a rate R i p t q to encode its state fortransmission to neighboring nodes. This rate is the averagenumber of bits used per symbol. That is, if node i sends alength- l i p t q binary encoding of its scalar state z i p t q to node j , then the corresponding rate is given by R i p t q “ E r l i p t qs ,where the expectation is taken over the distribution of z i p t q .In general, R i p t q can vary across both nodes and iterations,so that it is not necessary that R i p t q “ R j p s q for any i ‰ j, t ‰ s . We simplify notation by defining the rate vector r : “ r R p q , ...,R m p q , ...,R p T ´ q , ...,R m p T ´ qs J . Denote the distortions per entry, D i p t q : “ E “ ǫ i p t q ‰ , incurredusing rates r by the distortion vector d : “ r D p q , ...,D m p q , ...,D p T ´ q , ...,D m p T ´ qs J . (6)ne key quantity we use to determine the cost of runningthe consensus process is the aggregate coding rate [31, 32]: R agg : “ T ´ ÿ t “ m ÿ i “ R i p t q , (7)which represents the total rate used over the T iterations ofthe consensus algorithm by all m nodes of the network.Our main objective is to derive minimization strategies for C p r , T q : “ R agg (8)for fixed- and variable-length codes for Gaussian-distributedsources using a variety of quantizers.To efficiently encode the data stored across the network,it is necessary to know the distribution of z i p t q for all i P t , . . . , m u and t ě . To determine these distributionsfrom (3) and the distributions of the initial states t z i p qu mi “ is difficult in general. Instead, we propose an optimiza-tion scheme for entropy-coded uniform scalar quantization(ECSQ) [22] of stationary Gaussian states and RD-optimalvector quantization (VQ) [28] of memoryless Gaussian-distributed states. Because (3) consists of a linear combi-nation of jointly Gaussian RVs and independent quantizationerrors, it can be proven that the states will remain Gaussianfor all t ě [18], and thus the mean and covariance of z p t q are sufficient to describe its distribution.It can be shown that, for additive quantization noise andsymmetric weight matrices [18]: µ z p t ` q “ W µ z p t q , (9) µ e p t ` q “ ` I ´ m ´ J ˘ W µ e p t q , (10) Σ z p t ` q “ WΣ z p t q W ` p W ´ I q Σ ǫ p t qp W ´ I q , (11) Σ e p t q “ ` I ´ m ´ J ˘ Σ z p t q ` I ´ m ´ J ˘ , (12)where e p t q is the estimation error (4) and the covariance Σ ǫ p t q is diagonal, Σ ǫ p t q “ diag r D p t q , ¨ ¨ ¨ , D m p t qs J . Us-ing these definitions, we present the following mathematicalrelationships, which we term the state evolution equations .These equations allow us to perform the optimization of therate vector r using the cost function (8). The marginal sourcevariance ν i p d , t q at node i and iteration t is given by ν i p d , t q “ r Σ z p t qs ii , the MSE at node i and iteration t is given by MSE i p d , t q “ “ Σ e p t q ` µ e p t q µ J e p t q ‰ ii , (13)and the average MSE across the network at iteration t is MSE p d , t q “ m tr ` Σ e p t q ` µ e p t q µ J e p t q ˘ , (14)where tr p¨q denotes the trace of a matrix. For the scalar quantization schemes, we assume that the state of eachnode is a sample from a stationary, ergodic Gaussian random process. Weexpect the performance of ECSQ to be the same as in the memoryless case.
C. Rate-distortion theory
For encoders operating on real-valued sources, the quan-tization process necessarily introduces a certain expecteddistortion D into their representation of the input signal [22].This distortion can be quantified using a number of metrics,but for the purpose of this paper, we use the square error δ p z p t q , ˆ z p t qq “ k z p t q ´ ˆ z p t q k , so that the expected distortion per node per dimension isgiven by D “ m E “ k z p t q ´ ˆ z p t q k ‰ [22]. In general, usinga higher coding rate R results in a lower distortion D , withthe drawback of greater communication load. RD theory [28]quantifies the best possible trade-off between coding rateand distortion. The minimum coding rate R required for anycompression scheme to produce an expected distortion lessthan or equal to a particular value D is given by the RDfunction R p D q [28].III. R ATE ALLOCATION VIA
GGPThe key insight of this work is the ability to pose theoptimization of the rate vector r as a GGP, for which theglobal optimum can be found [10]. The resulting schemefinds an efficient rate vector r that achieves a target value of MSE p d , T q , given by (14). When a particular quantizer is used in our problem, itwill often have an RD performance trade-off that differsfrom R p D q , which is a bound on the best possible perfor-mance [22]. In this paper, we term such a trade-off curve for aparticular practical quantizer an operational RD relationship .For ECSQ and uniform quantization followed by fixed-ratecoding in the case of Gaussian sources, the operational RDrelationship in the high-rate regime is R p D q « $&%
12 log ˆ σ D ˙ ` R c , D P p , σ D max s , otherwise , (15)where σ represents the variance of the data to be encoded,and R c and D max are constants [33]. In some cases, such asinfinite-dimensional VQ with memoryless Gaussian sourcesand dithered [34] scalar uniform quantization [35], the rela-tionship (15) holds for all rates.The source variance ν i p d , t q is a function of the initial statevector covariance Σ z p q (see (11)) and the distortion vector d in (6); it evolves as described by (11). The operationalRD relationship at all nodes i P t , . . . , m u and iterations t P t , . . . , T ´ u can be expressed as R i p d , t q “
12 log ˆ max " ν i p d , t q D i p t q , ´ R c *˙ ` R c . (16)The max in (16) encapsulates the saturation of the RDrelationship at R “ in (15). Given T iterations, thegoal is to minimize the aggregate coding rate (7), sub-ject to a constraint on the final MSE, MSE p d , T q ď MSE ˚ (14). For a target MSE of MSE ˚ , the minimum Note that the final network MSE corresponds to iteration index T , andnot T ´ . This is because MSE p d , t q is the network MSE after the endof t iterations, or before the execution of the p t ` q th iteration . umber of iterations required to achieve that MSE, T min “ arg min T T | MSE p d , T q ă MSE ˚ , d “ ( , can be readilyobtained using the state-evolution equations (9)–(12). Moreformally, using the operational RD relationship (16), theoptimization problem isminimize d T ´ ÿ t “ m ÿ i “
12 log ˆ max " ν i p d , t q D i p t q , ´ R c *˙ ` R c , subject to the constraints MSE p d , T q ď MSE ˚ , (17) D i p t q ą , @ i, t. (18)Note that the above optimization is equivalent tominimize d ln ˜ T ´ ź t “ m ź i “ max " ν i p d , t q D i p t q , ´ R c *¸ , subject to (17) and (18) . (19)We will now introduce the concept of GGP and show thatthe optimization (19) reduces to such a problem. A. Basics of GGP
The following information can be found in Boyd andVandenberghe [9]. For this subsection, we stay close tothe authors’ original notation. In the language of geometricprogramming, a function of the form f p x q “ cx a x a ¨ ¨ ¨ x a n n , c ą , x i ą , a i P R , @ i, is called a monomial [9, 10]. Similarly, a function of the form f p x q “ k ÿ i “ g i p x , . . . , x n q , is a posynomial [9, 10], where g i p x , . . . , x n q are monomials.Generalized posynomials are functions formed from posyn-omials by operations including addition, multiplication, andmaximum [10].A standard inequality-constrained GGP has the formminimize x ,...,x n C p x , . . . , x n q , subject to f i p x , . . . , x n q ď , @ i P t , . . . , n f u ,g i p x , . . . , x n q “ , @ i P t , . . . , n g u ,x i ą @ i P t , . . . , n u , where the cost C p x , . . . , x n q and all the inequality con-straints f i p x , . . . , x n q are generalized posynomials, and allthe equality constraints g i p x , . . . , x n q are monomials [10].Applying some function transformations, GGPs can be castin convex form and efficiently solved numerically [10]. B. Generalized posynomial form of cost function
Applying a monotone increasing function to the costfunction (19) results in an equivalent problem to (19) [9],so we apply the exponential function to obtainminimize d T ´ ź t “ m ź i “ max " ν i p d , t q D i p t q , ´ R c * , subject to (17) and (18) . (20)The above optimization problem (20) is a GGP, which isformally shown in Pilgrim [18].Two aspects of the above optimization (20) should behighlighted. Because the constraints are allowed to be gen-eralized posynomials, one could also optimize with respectto a constraint on the maximum node MSE, for example, max i t MSE i p d , T qu mi “ ď MSE ˚ , or constraints on each of the node MSE values,MSE i p d , T q ď MSE ˚ i , @ i P t , . . . , m u , where MSE i p d , t q is defined in (13). Also, in its most generalform, the optimization allows each node to use a differentrate or distortion. In the interest of designing a distributedprotocol (or for computational efficiency), one may wishto constrain the rates or distortions at each node to be thesame. The constraint that all distortions be the same is astraightforward modification of (20) and is also a GGP. C. Constant distortion simplification
Solving the exact optimization problem (20) naively re-quires explicit representation of all mT distortions D i p t q and all the coefficients of the log-sum-exp (LSE) model required to compute MSE i p d , t q and ν i p d , t q from d . Theresult of this explicit representation is large computationaltime and memory complexity. In this section, we explore asimplification of (20) to combat these issues.In practice, as the network grows (specifically, m ą and T ě ), the memory and time requirements of the optimiza-tion (20) seem to grow quickly. If explicit representation ofLSE parameters can be avoided, it is possible to apply otherconvex optimization methods without these scaling issues. Toprovide a program that is more easily solvable in practice, weconstrain the distortions to be equal at each node, which isequivalent to redefining d : “ r D p q , . . . , D p T ´ qs J . Theoptimization is thenminimize d T ´ ź t “ m ź i “ max " ν i p d , t q D p t q , ´ R c * , subject to (17) and (18) . (21)In the following section, the results of solving the simplifiedproblem (21) are compared to the solutions of the exactprogram (20) and the prior art [13, 14]. Surprisingly, theabove simplified optimization provides competitive resultsfor random geometric networks [38], with significant re-duction in memory and run-time requirements. We conclude GGPs are converted to convex LSE form for solution [36, 37].
Iteration index t R a t e ( b i t s / s y m b o l ) Iteration index t R a t e ( b i t s / s y m b o l ) Fig. 1
Optimal rates and MSE se-quences from the solution of (20)( T “ , ρ c “ . , σ x “ , σ n “ . , m “ ). Left:
Optimalrate sequences for (20).
Right:
Op-timal rate sequences for (21). Therates are plotted against iterationindices, and each line representsthe rates used by a different sen-sor. Note that because variable-length coding is used [22], therates can be non-integer-valued. by noting that in practical implementation, it is anticipatedthat the optimization (20) will be run offline with a priori knowledge of the network topology and initial state statistics.IV. N
UMERICAL EXPERIMENTS
In this section, we present numerical results that provideinsight into the optimal rate sequences that result fromsolving (20). We further compare the performance of the pro-posed GGP optimizations (20) and (21) to the prior art [13,14]. To test the effectiveness of the proposed approach, weused the CVX toolbox [36, 37] to solve (20) and (21).Due to scaling issues, only the fixed-distortion prob-lem (21) was solved for prior art comparison. The bin sizeof all fixed-rate uniform quantizers was set to 12 times thestandard deviation of the data to prevent clipping.In the results presented, the networks were generated byrandom geometric graph (RGG) models [38], and each nodestate was initialized with the same independent and identi-cally distributed (i.i.d.) variance- σ x zero-mean Gaussian vec-tor x P R L corrupted by a different variance- σ n , zero-meanGaussian noise n i P R L , i P t , . . . , m u . Incorporating noiseis important so that the nodes have different estimates of thesignal x to average together. The consensus process averagesthe states elementwise, so this is the same as running L trialsof consensus on scalar states z i p q at once. The randomgeometric graphs (RGGs) [38] were generated on the unittorus (i.e., edge effects were neglected by “wrapping” edgesof r , s ) [39]. The RGGs provide a model of networkswhere location determines topology, such as WSNs [40]. AnRGG is one for which each node V i P V is associated with acoordinate v i . For a given connectivity radius ρ c , two nodes V i , V j are connected if k v i ´ v j k ď ρ c [27].To better understand the structure of the solutions tothe optimization problems (20) and (21), we present somesimulation results. Each of these results is taken from singleinstantiations of the optimization problem (i.e., they are notaveraged over multiple trials).The optimal rate sequences, t R i p t qu T ´ t “ , i P t , . . . , m u ,for both the variable-distortion (20) and constant-distortion(21) problems typically exhibit monotonically nondecreasingstructure, with an increasing rate of change toward the finaliterations. In the constant-distortion case, the rates R i p t q « log ´ max ! ν i p d ,t q D i p t q , ´ R c )¯ ` R c are similar because theratios ν i p d ,t q D i p t q in the Gaussian operational RD relationship (15) are similar across the network. Examples of optimal ratesequences are provided for both variants of the optimizationproblem (20) and (21) in Fig. 1.The pattern of these rate sequences is intuitive, and it mir-rors the results of Zhu and coauthors’ study of multiprocessorapproximate message passing [31, 32, 41]. As the estimateof the sample mean at each node increases in precision,higher-resolution messages must be exchanged among nodesto achieve increasing estimation quality. In the case of codingwithout side information, this improving precision requiresusing larger coding rates in the later iterations.To compare our work to the prior art [13, 14], we generated32 RGGs [38] with connectivity radius ρ c P t . , . u ona two-dimensional unit torus. For each of these networks,consensus was run on 1,000 realizations of the initial states,which were length-10,000 i.i.d. Gaussian vectors z i p q “ x ` n i , @ i P t , . . . , m u , x „ N p , I q , n i „ N p , . I q .This corresponds to SNR : “ σ x σ n “ , which is 3.01 dB. Wesimulated ProgQ [13] and order-one predictive coding [14],using initial rates R i p q P t , . . . , u and R i p q P t , . . . , u , @ i , respectively. The measured final MSE values (13) forthese schemes were set as the target values for the GGP.For all schemes, MSE p d , T q and R agg were computed.These values were averaged over all 32 realizations of each( ρ c , R i p q , T ) setting, and the resulting averages were plottedagainst each other. Looking at (11), it seems that the MSEfor quantized consensus, where D i p t q ě , @ i, t , is greaterthan in unquantized consensus, where D i p t q “ , @ i, t .We therefore introduce two terms to define the MSEperformance relative to the ideal, unquantized algorithm. Tocompensate for the effect of network topology on the MSE,we define the lossless MSE ,MSE lossless p t q : “ MSE p d , t q ˇˇˇ d “ . Next, define the final excess MSE (EMSE) asEMSE p T q : “
10 log MSE p d , T q MSE lossless p T q , which represents the increase in MSE over lossless consen-sus resulting from distortion. The EMSE was used for thegeneration of RD trade-off curves.In the case of very low rates using ECSQ on zero-meanGaussian sources, all elements decoded at a receiving nodewould be zeros. To prevent this behavior, the maximum EMSE( T ) (dB) R a gg ( b i t s / s y m b o l / s e n s o r ) ECSQ (simulated)ECSQ (predicted)ProgQPredictive
EMSE( T ) (dB) R agg ( b i t s / s y m b o l / s e n s o r ) GGP (constant distortion)Predictive
Fig. 2
Left:
RD trade-off curvesfor the proposed GGP-optimizedECSQ versus ProgQ [13] andorder-one predictive coding [14]( ρ c “ . , T “ ). Right:
Comparison of the proposed GGP-optimization versus order-one pre-dictive for a single realization ofthe RGG ( ρ c “ . , T “ ), where both schemes use RD-optimal VQ. The correspondingquantization error was simulatedby adding white Gaussian noise. normalized distortion D max allowed was set such that therecieved elements were nonzero at least 1% of the time.Because of stability issues with the GGP solvers used,Yildiz and Scaglione’s order-one predictive coding imple-mentation [14], which was provided by the authors, wasmodified to use fixed-rate uniform quantization but allowfor the rate to vary with iteration and node indices. Thiscapability was implemented by running two rate updaterecursions—one to keep track of the ideal (real-valued) ratesgiven by the quantization noise variance recursion [14], andanother to perform the predictive coding using rates that wererounded to the nearest integral value.V. D ISCUSSION
To adequately discuss the RD results, we first commenton some properties of each prior art scheme presented. TheProgQ algorithm [13] uses a time- and node-invariant fixed-rate uniform quantizer (i.e., R i p t q “ R , @ i P t , . . . , m u , t Pt , . . . , T ´ u ), whereas Yildiz and Scaglione [14] allow theuse of different rates at each node and iteration.Thanou et al. [13] use the same state update as ours (3),but Yildiz and Scaglione [14] use a different update that isincapable of truly converging in the presence of quantizationerror. The final asymptotic MSE for the predictive schemedepends on the sum of distortions, D i p t q , t P Z ě . If thesedistortions are chosen to form a convergent series, thenthe MSE will converge to a nonzero, but bounded, value.Because of this limitation, the predictive scheme [14] isheavily dependent on the starting rates, R i p q .In some cases, such as the bottom right of the ECSQcurve in Fig. 2, the predicted performance and measuredperformance of ECSQ do not match. Because the ECSQ usedin the simulations is not dithered [34], the additive quanti-zation model only holds approximately. As R agg increases,the performance improves, and better adherence to predictedperformance can be accomplished using dithering [34].The RD performance of the proposed optimizationscheme (21) for ECSQ is compared to the predictive codingscheme of Yildiz and Scaglione [14] and the ProgQ algorithmof Thanou et al. [13] in Fig. 2. For our scheme, the predictedRD performance (as computed by the state evolution equa-tions (9)–(12)) is compared to the actual performance. Thedistortion (measured by the EMSE) is given by the horizontalaxis, and the aggregate rate R agg by the vertical axis. Thepredicted performance is denoted by a dashed line, while the simulated performance is represented as a solid line.Alongside our approaches, we plot the RD performance ofboth of the comparators. A curve closer to the bottom leftcorner of these figures indicates better performance, meaninglower aggregate rate R agg to achieve the same EMSE, orlower EMSE for a particular R agg .The numerical results in the left panel of Fig. 2 suggest thatour GGP approach outperforms that of Yildiz and Scaglioneand Thanou et al. However, a closer look reveals that muchif not all of the gain is due to our using variable rate coding,whereas the implementations for the comparators use fixedrate coding. When we evaluated our approach with fixed ratecoding using a heuristic proposed in Pilgrim [18], our re-sults were typically somewhat weaker than the comparators.We attribute the performance advantage of ProgQ [13] andpredictive coding [14] to their use of side information fromprevious iterations.For a fairer comparison to the predictive approach ofYildiz and Scaglione [14], we also simulated their approachagainst ours (21) for RD-optimal vector quantizers, which usevariable coding rates. The simulations were run on a singleinstance of the RGG by varying the initial coding rate of thepredictive scheme [14] and setting the resulting EMSE p T q asthe target for the optimization (21). Because the quantizationerror of infinite-dimensional lattice quantizers approaches anadditive white Gaussian noise process [35], these experi-ments simulated quantization by adding independent noiseof variance D p t q to the states. This plot demonstrates theadvantage of predictive quantization on an even playing fieldby allowing both our approach and the predictive scheme [14]to use variable-rate coding.The ProgQ scheme [13], unlike predictive coding [14], iscapable of converging in the limit due to its different stateupdate strategy. Both ProgQ [13] and predictive coding [14]are capable of using constant or even shrinking coding ratesto achieve good performance. It is clear from Fig. 1 thatour rates grow with t . Therefore, as T Ñ 8 , we expectthat ProgQ will outdo our proposed schemes in all settings,despite our constrained optimization, because it can convergein the limit of large T with constant rates.VI. C ONCLUSION
In conclusion, this paper presented a framework for opti-mizing the source coding performance of distributed averageconsensus. The key insight of our approach is the formulationf the problem as a GGP [10]. Our framework allows theproblem to be transformed to a convex program [9] andsolved for the global optimum. Although we do not incor-porate knowledge from past iterations, our numerical resultsare competitive with prior art that uses more sophisticatedside information strategies, which motivates the study ofoptimization for predictive coding schemes.In light of the performance gain from predictive codingstrategies, we feel that future work should focus on variablerate strategies. Moreover, we aim to optimize the predictiveapproach of Yildiz and Scaglione using our GGP formulationand the state update (5).A
CKNOWLEDGMENTS
Thanks to Yanting Ma for her inputs on extending theGGP model to variable distortion, to Mehmet Ercan Yildizand Anna Scaglione for graciously providing their code forcomparison, and to Yaoqing Yang and Pulkit Grover fordiscussing their work with Soummya Kar.R
EFERENCES[1] N. Noorshams and M. J. Wainwright, “Non-asymptotic analysis of anoptimal algorithm for network-constrained averaging with noisy links,”
IEEE J. Sel. Topics Signal Process. , vol. 5, no. 4, pp. 833–844, Aug.2011.[2] R. Olfati-Saber, J. A. Fax, and R. M. Murray, “Consensus andcooperation in networked multi-agent systems,”
Proc. IEEE , vol. 95,no. 1, pp. 215–233, Jan. 2007.[3] A. G. Dimakis, S. Kar, J. M. F. Moura, M. G. Rabbat, and A. Scaglione,“Gossip algorithms for distributed signal processing,”
Proc. IEEE , vol.98, no. 11, pp. 1847–1864, Nov. 2010.[4] M. Nokleby, W. U. Bajwa, R. Calderbank, and B. Aazhang, “Towardresource-optimal consensus over the wireless medium,”
IEEE J. Sel.Topics Signal Process. , vol. 7, no. 2, pp. 284–295, Apr. 2013.[5] H. Raja and W. U. Bajwa, “Cloud K-SVD: A collaborative dictionarylearning algorithm for big, distributed data,”
IEEE Trans. SignalProcess. , vol. 64, no. 1, pp. 173–188, Jan. 2016.[6] A. Scaglione, R. Pagliari, and H. Krim, “The decentralized estimationof the sample covariance,” in
Proc. IEEE 42nd Asilomar Conf. Signals,Syst., Comput. , Oct. 2008, pp. 1722–1726.[7] J. Wright, A. Yang, A. Ganesh, S. Sastry, and Y. Ma, “Robust facerecognition via sparse representation,”
IEEE Trans. Pattern Anal.Mach. Intell. , vol. 31, no. 2, pp. 210–227, Feb. 2009.[8] G. J. Pottie and W. J. Kaiser, “Wireless integrated network sensors,”
Commun. ACM , vol. 43, no. 5, pp. 51–58, May 2000.[9] S. Boyd and L. Vandenberghe,
Convex Optimization , Cambridge, UK:Cambridge University Press, 2004.[10] S. Boyd, S.-J. Kim, L. Vandenberghe, and A. Hassibi, “A tutorialon geometric programming,”
Optimization Eng. , vol. 8, no. 67, pp.67–127, Mar. 2007.[11] P. Frasca, R. Carli, F. Fagnani, and S. Zampieri, “Average consensuson networks with quantized communication,”
Int. J. Robust NonlinearControl , vol. 19, no. 16, pp. 1787–1816, Nov. 2008.[12] R. Carli, F. Bullo, and S. Zampieri, “Quantized average consensus viadynamic coding/decoding schemes,”
Int. J. Robust Nonlinear Control ,vol. 20, no. 2, pp. 156–175, May 2009.[13] D. Thanou, E. Kokiopoulou, Y. Pu, and P. Frossard, “Distributedaverage consensus with quantization refinement,”
IEEE Trans. SignalProcess. , vol. 61, no. 1, pp. 194–205, Jan. 2013.[14] M. E. Yildiz and A. Scaglione, “Coding with side information forrate-constrained consensus,”
IEEE Trans. Signal Process. , vol. 56, no.8, pp. 3753–3764, Aug. 2008.[15] M. E. Yildiz and A. Scaglione, “Limiting rate behavior and rateallocation strategies for average consensus problems with boundedconvergence,” in
Proc. IEEE Int. Conf. Acoust., Speech, SignalProcess. (ICASSP) , Apr. 2008, pp. 2717–2720.[16] R. Rajagopal and M. J. Wainwright, “Network-based consensusaveraging with general noisy channels,”
IEEE Trans. Signal Process. ,vol. 59, no. 1, pp. 373–385, Jan. 2011. [17] Y. Huang and Y. Hua, “On energy for progressive and consensusestimation in multihop sensor networks,”
IEEE Trans. Signal Process. ,vol. 59, no. 8, pp. 3863–3875, Aug. 2011.[18] R. Z. Pilgrim, “Coding rate optimization for distributed averageconsensus,” M.S. thesis, NC State University, Raleigh, NC, July 2017,Available: http://arxiv.org/abs/1710.01816.[19] L. Xiao, S. Boyd, and S.-J. Kim, “Distributed average consensus withleast-mean-square deviation,”
J. Parallel Distributed Comput. , vol. 67,no. 1, pp. 33–46, Jan. 2007.[20] T. Li, M. Fu, L. Xie, and J.-F. Zhang, “Distributed consensus withlimited communication data rate,”
IEEE Trans. Autom. Control , vol.56, no. 2, pp. 279–292, Feb. 2011.[21] F. F. C. Rego, Y. Pu, A. Alessandretti, A. P. Aguiar, and C. N. Jones, “Aconsensus algorithm for networks with process noise and quantizationerror,” in
Proc. 53rd Allerton Conf. Commun., Control, and Comput. ,Oct. 2015, pp. 488–495.[22] A. Gersho and R. M. Gray,
Vector Quantization and Signal Compres-sion , Kluwer, 1993.[23] O. Ayaso, D. Shah, and M. A. Dahleh, “Information theoretic boundsfor distributed computation over networks of point-to-point channels,”
IEEE Trans. Inf. Theory , vol. 56, no. 12, pp. 6020–6039, Dec. 2010.[24] A. Xu and M. Raginsky, “A new information-theoretic lower bound fordistributed function computation,” in
Proc. IEEE Int. Symp. Inform.Theory (ISIT) , June 2014, pp. 2227–2231.[25] H.-I. Su and A. El Gamal, “Distributed lossy averaging,”
IEEE Trans.Inf. Theory , vol. 56, no. 7, pp. 3422–3437, July 2010.[26] Y. Yang, P. Grover, and S. Kar, “Rate distortion for lossy in-networklinear function computation and consensus: Distortion accumulationand sequential reverse water-filling,”
IEEE Trans. Inf. Theory , vol. 63,no. 8, pp. 5179–5206, Aug. 2017.[27] S. Boyd, A. Ghosh, B. Prabhakar, and D. Shah, “Mixing times forrandom walks on geometric random graphs,” in
Meeting AlgorithmEng. Experiments/Analytic Algorithmics and Combinatorics , Jan. 2005,pp. 240–249.[28] T. M. Cover and J. A. Thomas,
Elements of Information Theory , NewYork, NY: Wiley-Interscience, 1991.[29] A. Brandst¨adt, T. Nishizeki, K. Thulasiraman, and S. Arumugam, Eds.,
Handbook of Graph Theory, Combinatorial Optimization, and Algo-rithms , vol. 34 of
Chapman and Hall/CRC Computer and InformationScience Series , Chapman and Hall/CRC, 2016.[30] L. Xiao and S. Boyd, “Fast linear iterations for distributed averaging,”
Syst. Control Lett. , vol. 53, no. 1, pp. 65–78, Sept. 2004.[31] J. Zhu, D. Baron, and A. Beirami, “Optimal trade-offs inmulti-processor approximate message passing,”
Arxiv preprintarXiv:1601.03790 , Nov. 2016.[32] J. Zhu, R. Pilgrim, and D. Baron, “An overview of multi-processorapproximate message passing,” in
Proc. 51st IEEE Conf. Inform. Sci.Syst. , Mar. 2017.[33] H. Gish and J. Pierce, “Asymptotically efficient quantizing,”
IEEETrans. Inf. Theory , vol. 14, no. 5, pp. 676–683, Sept. 1968.[34] S. P. Lipshitz, R. A. Wannamaker, and J. Vanderkooy, “Quantizationand dither: A theoretical survey,”
J. Audio Eng. Soc. , vol. 40, no. 5,pp. 355–375, May 1992.[35] R. Zamir and M. Feder, “On lattice quantization noise,”
IEEE Trans.Inf. Theory , vol. 42, no. 4, pp. 1152–1159, July 1996.[36] M. Grant and S. Boyd, “CVX: MATLAB software for disciplinedconvex programming, version 2.1,” http://cvxr.com/cvx, Mar. 2014.[37] M. Grant and S. Boyd, “Graph implementations for nonsmoothconvex programs,” in
Recent Advances in Learning and Control ,V. Blondel, S. Boyd, and H. Kimura, Eds., Lecture Notes in Controland Information Sciences, pp. 95–110. Springer-Verlag Limited, 2008.[38] M. Penrose,
Random Geometric Graphs , Oxford University Press,2003.[39] E. W. Weisstein, “Square torus,” 2017, From MathWorld—a WolframWeb Resource.[40] P. Gupta and P. R. Kumar, “The capacity of wireless networks,”
IEEETrans. Inf. Theory , vol. 46, no. 2, pp. 388–404, 2000.[41] J. Zhu, A. Beirami, and D. Baron, “Performance trade-offs in multi-processor approximate message passing,” in