Optimizing QoS for Erasure-Coded Wireless Data Centers
aa r X i v : . [ c s . I T ] F e b Optimizing QoS for Erasure-Coded Wireless DataCenters
Srujan Teja Thomdapu, Ketan Rajawat
Department of Electrical Engineering, Indian Institute of Technology Kanpur, India 208016Email: { srujant, ketan } @iitk.ac.in Abstract —Cloud computing facilitates the access of appli-cations and data from any location by a distributed storagesystem. Erasure codes offer better a data replication techniquewith reduced storage costs for more reliability. This paperconsiders the erasure-coded data center with multiple serversin a wireless network where each is equipped with a base-station. The cause of latency in the file retrieval process ismainly due to queuing delays at each server. This work putsforth a stochastic optimization framework for obtaining theoptimal scheduling policy that maximizes users’ quality ofservice (QoS) while adhering to the latency requirements. Wefurther show that the problem has non-linear functions ofexpectations in objective and constraints and is impossibleto solve with traditional SGD like algorithms. We proposea new algorithm that addresses compositional structure inthe problem. Further, we show that the proposed algorithmachieves a faster convergence rate than the best-known results.Finally, we test the efficacy of the proposed method in asimulated environment.
I. I
NTRODUCTION
Increased cloud computing services’ requirements aremaking companies look at distributed storage systemswhere efficient data replication techniques are implementedfor more reliability. Data redundancy helps in providingmore alternatives to the clients in case of node failures.Most of the cloud-based companies such as Facebook [1],Microsoft [2], and Google [3] have found the erasurecoding technique as the most prominent solution to reducestorage cost compared to other techniques [4, 5].Besides being a promising idea, erasure-coded storagesystems face a troublesome issue, which is the delaythat occurs while downloading files from the data centerby users. There have been much fewer works that havestudied quantitative analysis on queuing delays. Most ofthe existing works [6, 7] instead have focused on designingefficient data systems without making much effort intoanalyzing the queues that appear at each data server. Hence,many researchers have turned their attention towards la-tency analysis these days making it as an active area [8].There have been many recent attempts in obtaining latencybounds in erasure-coded storage systems by proposingvarious scheduling policies include ‘block-one-scheduling’[9], ‘fork-join queue’ [10], ‘probabilistic scheduling’ [11].It has been shown in [11] that, probabilistic schedulingpolicy provides an upper bound on average latency oferasure-coded storage systems for arbitrary erasure codeswith M/G/1 queues at data servers. The policy entertainsscheduling file requests to all the possible servers. Analysis of erasure-coded storage systems has been extended tovideo streaming case [12] where an optimized service hasbeen proposed that maximizes the quality of experience(QoE) for users. More precise latency analysis is pursued in[12] by assuming service time distribution as exponential.In this paper, we consider distributed erasure-codedstorage systems in wireless networks where each server isequipped with a multi-antenna base-station that is capableof wireless transmissions. Specifically, we formulate thestochastic optimization problem to find an optimal schedul-ing policy that maximizes users’ quality of service (QoS)while adhering to queuing delay and other deterministicconstraints. Since the file transfer medium is consideredwireless, we can no longer assume exponential service timedistributions due to the random fading channels betweenuser and data center. However, the classical approach isnot suited due to the difficulty in evaluating closed formexpressions for first and second order moments of servicetimes with the presence of exogenous variables [13]. Re-cently, the authors in [14] have considered the design ofqueuing systems from a stochastic optimization perspectivewhere the queues have general service time distributions.By applying ideas of [14], we show that the formulatedproblem has at least one of the objective and constraints hasnon-linear function of expectations. Hence, is not solvablewith the existing SGD like first-order methods as theyrequire unbiased estimates of (sub) gradients (see [15]).Much recent work in [16] has presented a first-ordermethod that deals with the non-linear functions of expec-tations via the stochastic compositional gradient descent(SCGD) algorithm. Since finding true (sub) gradients isnot possible due to the composition structure, the SCGDalgorithm adopts a quasi-gradient approach by estimatingthe approximated (sub) gradients. Accelerated version ofSCGD is later proposed in [17]. The structure of the prob-lem is, however, considered to be unconstrained in [16, 17].Recently, the CSCGD algorithm has been proposed in [14]to solve constrained stochastic compositional problems.Exploring the ideas from [14, 17], we propose a newalgorithm that solves constrained stochastic compositionalproblems with a faster convergence rate.The rest of the paper is organized as follows. Sec. IIdetails the system model and problem formulations. Thealgorithm and theoretical guarantees for it are provided inSec. III. Later, we evaluate the performance of the proposed
D DD D Du uu u u u uu uuuuuu D = Data Server u = Mobile User (a) Data Center (b) Entities in Data Server Fig. 1: System Modelmethod in Sec. IV. Finally, we conclude our paper in Sec.V.II. S
YSTEM M ODEL A ND P ROBLEM F ORMULATION
We consider a data center, as shown in Fig.1, with M servers denoted by M that are placed in an area thatis populated by users with mobile devices. Each servercontains a base-station of multiple antennas that serveusers through wireless transmissions, and is also capableof storing data (see Fig. 1b). We do not consider anyinterference management schemes, and hence we assumeall the concurrent data transmissions by servers are trans-mitted in orthogonal channels. There are N trending filesof current interest, which are picked out of millions andmay be delivered to the users. These selected files arethe most popular ones, and they earned the placementopportunity at servers. The data objects are represented as Z i ∀ i ∈ { , ..., N } . Users can request any of these popular N files from the data center. A. Coding
Each file Z i is divided into k i number of fixed-sizechunks and then encoded using ( n i , k i ) Maximum DistanceSeparable (MDS) erasure code to generate n i distinctchunks for a file. These coded chunks are denoted as C (1) i , ..., C ( n i ) i . The encoded chunks are stored in distinct n i servers for delivery. The servers that store the chunkscorresponding to the file i are denoted by a set S i suchthat S i ⊂ M , and |S i | = n i . A typical MDS erasurecode ( n i , k i ) is in such a way that n i > k i , with theredundancy factor n i /k i . Hence any subset of k i -out-of- n i coded chunks can reconstruct the original file. For example,a simple replication of a file at n servers is nothing butusing ( n, erasure code. We assume a centralized systemthat knows file placement information and schedules allthe user requests to different servers. Hence, when a fileis requested, the request goes to a set A i storage nodeswhere A i ⊂ S i , and |A i | = k i . Each server maintains aFIFO queue to serve the users, as shown in Fig.1b. When a file i is requested, all the chunks for other file requestshave not yet been served are waiting. B. Policy
The central system distributes the users’ requests basedon the file availabilities. It has to pick the optimal choiceout of many options to reduce the latency to schedule arequest of file i to the k i server. We use the ProbabilisticScheduling policy as proposed in [11] That allows thechoice of every possible subset of k i nodes with a certainprobability. Once a request for a file i arrives, the centralsystem randomly distributes the k i chunk requests to aset of nodes A i with predetermined probabilities P ( A i ) .Then each server maintains a local queue containing allthe chunk requests to finally be transmitted by the base-station. According to the Probabilistic scheduling policy,the feasible probabilities P ( A i ) exist when the followingconditions are satisfied. m X j =1 π ij = k i ∀ i π ij = 0 If j / ∈ S i , (1)where π ij is the conditional probability of selecting thenode j for the request i .When we consider a simple case where only a file isplaced in the system, ideally, in that case, the serverswith which more users have good channels need to bescheduled. In other words, the probabilities of choosingsuch servers are high. The current setting however, is muchmore complicated. Although the files are more popularin the current scenario, the demographic preferences areunknown at this stage. For example, users that reside atvarious geographical locations may have different prioritiestowards these files. Hence to reduce the latency, we tryto incorporate all of these scenarios while formulating theproblem. C. Queuing Model
We assume the file requests follow the Poisson processwith a known rate of λ i . We can think of these rates λ i asle popularities. Hence the arrival of chunk requests at node j follow the Poisson with rate Λ j = P i λ i π ij . The chunkservice time distribution is unknown and is subject to thefading channel’s random behavior, thus making the queuingmodels M/G/1. Let ξ ij represents the channel constantcorresponding to the user who requested file i , and p j isthe power allocated to the transmissions from the node j .If the length of coded packet C S i i is L , then the randomservice time X j for the packets that leave from node j canbe written as X j = Lb j ( p j , ξ ij ) with probability π ij λ i Λ j ∀ i, where b j ( p j , ξ ij ) = B j log(1 + p j ξ ij ) is Shannon’s ca-pacity, and B j is the bandwidth of the channel. Now thefirst-order moment for service time can be found as E [ X j ] = E [ E [ X j | ξ ]]= L E ξ " N X i π ij λ i Λ j b j ( p j , ξ ij ) = L N X i π ij λ i Λ j E ξ (cid:20) b j ( p j , ξ ij ) (cid:21) where E ξ [ . ] is expectation w.r.t to the only random variable ξ . Similarly, we can also find second-order moment as E (cid:2) X j (cid:3) = L N X i π ij λ i Λ j E ξ " b j ( p j , ξ ij ) Now the average queuing delay can be calculated using thePollaczek-Khinchin (P-K) formula as W j ( p j , π ij ) = Λ j E (cid:2) X j (cid:3) − Λ j E [ X j ])= L P Ni π ij λ i E ξ h b j ( p j ,ξ ij ) i (cid:16) − L P Ni π ij λ i E ξ h b j ( p j ,ξ ij ) i(cid:17) . (2)The complicated looking expression in (2) restricts fromfinding closed form for any known distributions of channelconstants { ξ ij } . D. Problem Formulation
By assuming the information of file placement in serversis known, we formulate the problem to obtain an optimalpolicy that enhances the QoS of users with the followingutility function U ( p ) = M X j =1 ψ E "X i b j ( p j , ξ ij ) , (3)where ψ ( . ) is any function that include such as linear,log utility functions. A simple observation tells us that the function in (3) is convex. By imposing constraints onqueuing delay, the problem finally can be written as max p , Π U ( p ) (4a)s.t W j ( p j , π ij ) ≤ D j ∀ j (4b) M X j =1 p j ≤ P, (1) . (4c)The objective in (4a), and constraint in (4b) functionsare stochastic in nature while the constraints in (4c) aredeterministic which are simpler to project onto. The col-lection of all the random variables { ξ ij } is represented as ξ . The goal is to solve the above problem in online fashionusing independent realizations ξ , ξ , ... that are revealedsequentially. The constraints in (4b) are expressed as non-linear functions of sample probabilities. Hence, the existingfirst-order methods do not apply here due to unbiasedgradient estimates’ requirement to objective and constraintfunctions. The constraints in (4b) are convex functions(see [14, Appendix C]) thus making the whole problemin (4) convex. To the best of our knowledge, optimizingQoS in erasure-coded wireless data centers has not beenconsidered in the existing literature. Algorithmic details tosolve (4), and the theoretical convergence guarantees for itare provided in the subsequent section.III. A CCELERATED S TOCHASTIC C OMPOSITIONAL G RADIENT D ESCENT FOR C ONSTRAINED P ROBLEMS
Consider the more general constrained stochastic opti-mization problem x ⋆ = arg min x ∈X f ( E [ g ( x , ξ )]) + R ( x ) s.t. q ( E [ h ( x , ξ )]) ≤ ( P )where the expectation is taken with respect to ξ . Here, f : R m → R , g : R n × R k → R m , h : R n × R k → R d ,and q : R d → R J are continuous functions. The penaltyfunction R ( x ) : R n → R ∪ { + ∞} is an extended real-valued closed convex function which is allowed to benon-smooth. The problem can have simple deterministicconstraints as in (4c) which can be added in R ( x ) . Itcan be easily verified that the problem formulated in (4)is a spacial case of ( P ). Since the distribution of ξ isunknown, the expectations appearing in ( P ) cannot beevaluated in closed-form. Motivated by classical stochasticapproximation methods, the goal is to solve ( P ) in an onlinefashion using only independent realizations ξ , ξ , . . . thatare revealed sequentially. This section details the proposedalgorithm for solving ( P ) and provides the correspondingconvergence rates. For the sake of brevity, we define F ( x ) := f ( E [ g ( x , ξ )]) , Q ( x ) := q ( E [ h ( x , ξ )]) . A. Assumptions
We begin with discussing the necessary assumptions onthe functions f , g , q and h . All functions f , q , g , h arecontinuously differentiable. Consequently, the gradients ofthe objective and constraint functions are well-defined, withhe problem ( P ) is a convex optimization problem and theset X is closed and compact, i.e., sup x , x ′ ∈X k x − x ′ k ≤ D x < ∞ . The random variables ξ , ξ , ... are independentand identically distributed. The functions g , h are Lipschitzcontinuous in expectation and have bounded second ordermoments. The functions f , q are smooth and have boundedgradients. The functions F , Q , and the inner functions g , h are smooth. B. Proposed Algorithm
Algorithm 1
Accelerated Constrained Stochastic Compo-sitional Proximal Gradient (ACSCPG) Input: x ∈ R n step sizes α t , β t , δ t ⊂ (0 , . Initialize y = z = 0 , w = x . for t = 1 , , ... Observe the random variable ξ t , and update x t +1 = prox α t R ( . ) (cid:26) x t − α t ∇ g ( x t , ξ t ) ∇ f ( y t ) − δ t ∇ h ( x t , ξ t ) ∇ q ( z t ) ∇ ℓ ( q ( z t )) (cid:27) (5) Observe the random variable ξ t +1 , and update theaxillary iterates as w t +1 = (cid:18) − β t (cid:19) x t + 1 β t x t +1 , (6) y t +1 = (1 − β t ) y t + β t g (cid:0) w t +1 , ξ t +1 (cid:1) , z t +1 = (1 − β t ) z t + β t h (cid:0) w t +1 , ξ t +1 (cid:1) (7) end Output: ˆ x = T P Tt = T/ x t +1 .Similar to the analysis in [14], we define a smooth andconvex function ℓ ( w ) = P Jj =1 ℓ j ( w j ) where ℓ j is definedas ℓ j ( x ) := x ≤ x ≤ C ℓ C ℓ x − C ℓ x > C ℓ x < (8)Gradient of the function is bounded as ∇ ℓ j ( x ) =max { x, C ℓ } for x > and zero otherwise. The parameter C l in our case is the upper limit of max j q j ( E [ h ( x , ξ )]) for any x ∈ X . The penalty function that is defined in (8)helps in taking the iterate towards the descent direction ofoptimal function F as well as towards the feasible region { x : Q ( x ) ≤ } .As proposed in [14], CSCGD algorithm carries updatestowards the negative direction of the approximated gradi-ents of both objective F ( x ) , and penalty ℓ ( Q ( x )) functions.However, In the present case, iterates are of similar except,the steps that track E [ g ( x ⋆ , ξ )] , and E [ h ( x ⋆ , ξ )] . We in-troduce a new step in (6) that tracks running average of theoptimal solution and is used in auxiliary variable updates as shown in (7) and that is called extrapolation-smoothingscheme which is the main reason behind the accelerationof convergence. The complete procedure is summarizedAlgorithm 1. Compared to the CSCGD, ACSCPG estimatesthe unknown quantities E [ g ( x , ξ )] , E [ h ( x , ξ )] with fasterrate. The updates in (6), (7) are carried out in way that the y t , z t are approximately unbiased estimates of E [ g ( x , ξ )] , E [ h ( x , ξ )] . To explicitly see that, let us define the weightsas ζ ( t ) k = ( β k Q ti = k +1 (1 − β i ) if t > k ≥ β t if t = k ≥ , (9)then we have the following relations x t +1 = t X k =0 ζ ( t ) k w k +1 , y t +1 = t X k =0 ζ ( t ) k g ( w k +1 , ξ k +1 ) , z t +1 = t X k =0 ζ ( t ) k h ( w k +1 , ξ k +1 ) In other words, x t +1 is weighted average of { w t } t +11 , and y t +1 , z t +1 are weighted averages of { g ( w k +1 , ξ k +1 ) } t +11 , { h ( w k +1 , ξ k +1 ) } t +11 . Hence as t progresses, the estimates y t , z t reach much nearer to the unbiased gradients of innerfunctions. C. Performance Analysis
This section provides the major theoretical findings ofAlgorithm. 1. We begin our analysis by defining a newobjective function by penalizing the constraint as ˜ H ( x , α, δ ) = ˜ f ( E ˜ g ( x , ξ ) , α, δ ) + R ( x ) , (10)where ˜ f (˜ y, α, δ ) = f ( y ) + δα p ( z ) , p ( z ) = ℓ ( q ( z ))˜ g ( x , ξ ) = [ g ( x , ξ ) , h ( x , ξ )] , ˜ y = [ y , z ] . (11)Since the inner function h is Lipschitz and has boundedgradients, It is simple to prove (see [14, Appendix A,lemma 3]) that the function p ( z ) is smooth and havebounded gradient. Note that ˜ H ( x , α, δ ) is convex w.r.t x .The following theorem establishes the convergence resultsof Algorithm. 1. Theorem . Under all the assumptions in Sec.III-A, for thechoice of constants which are selected as α t = Ct − a , β t = C b t − b , δ t = Ct − c ,γ t = Ct − d , η t = Ct − e , ABLE I: Summary of bounds on optimality gap and constraint violation
Choice of constants a, b, c, d, e
Optimality gap Constraint violation a = 0 . , b = 0 . , c = 0 . , d = − . , e = 1 . O (cid:0) T − / (cid:1) O (cid:0) T − / (cid:1) a = 0 . , b = 0 . , c = 0 . , d = − . , e = 1 . O (cid:0) T − / (cid:1) O (cid:0) T − / (cid:1) a = 0 . , b = 0 . , c = 0 . , d = − . , e = 0 . O (cid:0) T − / (cid:1) O (1) where C ≥ , C b > , a ≥ c , ≤ a ≤ , ≤ b ≤ , and ≤ c ≤ , the following result holds. T T X t = T/ E [ H (ˆ x ) − H ( x ⋆ )] ≤ O (cid:18) T a − + T d + T b − c − d + T − b − d + T a +4 b − c − d + T a − b − c − d + T − a + T − a − c + T − e + T a − c − e + T e − a (cid:19) T T X t = T/ max j E [ Q j (ˆ x )] ≤ O (cid:18) T ( c − / + T ( d + c − a ) / + T (4 b − c − d − a ) / + T ( c − a − b − d ) / + T ( a +4 b − c − d ) / + T − a + T ( a − b − c − d ) / + T ( − a + c ) / + T ( c − a − e ) / + T ( a − c − e ) / + T ( e − a + c ) / + T ( c − a ) / ) , where H ( x ) = F ( x ) + R ( x ) .The proof is borrowed from [17, Theorem 3]. Butthe presence of stochastic constraints are need to beaddressed separately from [14]. Specifically, first, webound the absolute value of successive iterate differ-ence as E h k x t +1 − x t k i ≤ O ( α t + δ t ) . Then webound the difference of tracking variables and inner func-tions as E h k y t − ¯ g ( x t ) k i ≤ O (cid:0) t − c +4 b + t − b (cid:1) and E h(cid:13)(cid:13) z t − ¯ h ( x t ) (cid:13)(cid:13) i ≤ O (cid:0) t − c +4 b + t − b (cid:1) . Next by deriv-ing the bound for difference of algorithmic and optimalsolutions, we prove the statement of Theorem. 1. Thecomplete proof is deferred to Appendix A.By carefully choosing the constants, we obtain the rates,as shown in the Table I. Results provided in Theorem 1 areclearly improved convergence rates as O (cid:0) T − / (cid:1) com-pared to the best known results in [14] O (cid:0) T − / (cid:1) . Theimprovement is due to making an additional smoothnessassumption condition of inner functions.Bounds in Table I are expressed in terms of the numberof iterations. The analysis excludes the per-iteration com-plexity, which is fixed. Aside from the reduced numberof iterations to reach the optimal solution, the numberof oracle calls may be more depending on the requirednumber of queries per iteration. For example, the additionof stochastic constraints would result in requiring moregradient queries. The current approach cannot Improvethe per-iteration complexity, which is the drawback of theAlgorithm. 1. Number of iterations C on s t r a i n t v i o l a t i on ( m s ) ACSCPGCSCGD Q o S ( k bp s ) ACSCPGCSCGD
Fig. 2: Convergence results for ACSCPG, CSCGD [14]IV. S
IMULATIONS
We consider a data center with M = 10 servers; eachis equipped with a single antenna base-station capableof wireless file transmissions. The most trending files N = 100 are considered for placement and the popularitiesfollow Zipf distribution. Specifically, the probability p k that k − th most popular file is requested at a given time adheresto p k ∝ k − s where s is the parameter that characterizesthe skewness in the distribution, which is taken as forthe simulation. All the files are encoded using (8 , MDSerasure code and stored in servers, as explained in Sec.II. Bandwidth of all the channels is allocated as B j =1 M BP S ∀ j = 1 , ..., M . Latency constraint is imposed aseach server’s queuing delay should not go beyond ms.Experiments are conducted in the simulated environmentof a unit radius circle where all the users and serversare present. The files’ requests are received from variouslocations randomly distributed in a unit radius circle withuniform distribution. The wireless channel between usersand the data center is considered Rayleigh. The attenuationfactor due to the path-loss is considered to k ( d/d ) − withthe parameters k = d = 1 . Eight servers are placed on thecircumference of a 0.5 unit radius circle separated with ◦ angular difference. The remaining two servers are placedat the coordinates ( − , , (2 , , which are far away fromthe users.As the first part of the experiments, Algorithm. 1 AC-SCPG is run on the simulated environment to learn the pol-icy variables { π ij } and power allocations { p j } by trying tosolve the problem (4) in online manner. For the comparison,CSCGD [14] is also run, and the results are shown in Fig.2.The two plots represent the evaluation of objective function U ( p ) (QoS metric) as in (4a), and constraints violations olicy Throughput in (12) Equiprobable 178 kbpsProposed 183 kbps
TABLE II: Comparison results for Equiprobable and theproposed policies max j ( W j ( p j , π i j ) − D j ) respectively. As shown in Fig.2,the objective is maximized, while constraint violationsare decreased with the number of iterations as desired.However, the proposed algorithm outperforms CSCGD interms of convergence rate, as supported by theoreticalarguments.To evaluate the proposed method’s performance in wire-less networks, we implement a heuristic technique calledequiprobable policy that belongs to the category of prob-abilistic scheduling [11] for the comparison. As the namesuggests, the policy variables adhere to constraints in (1)but have equal values. We consider average throughput thatis obtained from the data center as a performance metricfor the comparison and is calculated as T = 1 M M X j =1 W j ( p j , π ij ) + P i E [ b j ( p j ,ξ ij )] . (12)A good policy should obtain higher throughput. Resultsare presented in Table.II, and the policy obtained from theproposed method outperforms the equiprobable policy byachieving higher throughput.V. C ONCLUSION AND F UTURE S COPE
This paper considers the erasure-coded data center ina wireless network. We propose a new scheduling policythat optimizes QoS while respecting strict queuing delayconstraints. To solve the problem, we propose a newalgorithm, ACSCPG, which is inspired by [14, 17]. Finally,we show that the proposed algorithm beats CSCGD [14]both theoretically and in a simulated environment of a datacenter. We also show that the policy which is obtained fromthe proposed approach outperforms a heuristic one calledequiprobable policy.Apart from the benefit of rate improvements, the pro-posed algorithm incurs more cost per iteration with thecurrent application. In other words, the algorithm requiresapproximately M + N J number of gradient queries in eachiteration. The number M + N J may be much large inpractice. We look for distributed versions of the currentalgorithm to reduce the complexity per iteration in ourfuture works. Another interesting direction is to analyzethe algorithm by considering more practical scenarios inthe wireless network, such as the server nodes’ mobility.A
PPENDIX AP ROOF OF T HEOREM
Lemma . The updates in (5) yields E h k x t +1 − x t k i ≤ O ( α t + δ t ) . (13)for all t ≥ . Proof:
From Algorithm 1 x t +1 = prox α t R ( . ) n x t − α t ∇ g ( x t , ξ t ) ∇ f ( y t ) − δ t ∇ h ( x t , ξ t ) ∇ p ( z t ) o . By the definition of proximal operation, we can write x t +1 = arg min x (cid:13)(cid:13)(cid:13) x − x t − α t ∇ g ( x t , ξ t ) ∇ f ( y t ) − δ t ∇ h ( x t , ξ t ) ∇ p ( z t ) (cid:13)(cid:13)(cid:13) + α t R ( x ) . The optimality condition suggests, x t +1 − x t = α t ∇ g ( x t , ξ t ) ∇ f ( y t ) − δ t ∇ h ( x t , ξ t ) ∇ p ( z t )+ α t s t +1 , where s t +1 = ∂R ( x t +1 ) norm of which is bounded. Hence, k x t +1 − x t k ≤ α t k∇ g ( x t , ξ t ) ∇ f ( y t ) k + δ t k∇ h ( x t , ξ t ) ∇ p ( z t ) k + α t k s t +1 k≤ O ( α t + δ t ) . Lemma . Given two sequences of positive scalars { s t } ∞ t =1 ,and { φ t } ∞ t =1 satisfying s k +1 ≤ (cid:0) − φ t + C φ t (cid:1) s k + C t − a + C t − c , where C ≥ , C ≥ , C ≥ , and a ≥ c ≥ . Thedefinition φ t = C t − b , where b ∈ (0 , , and C > , thenfor any t the sequence is bounded as s k ≤ Dt − d , Where D is max t ≤ ( C C ) /b +1 s t t c + C + C C − , d = min( a − b, c − b ) . Proof:
This result can be proved by induction. Fromthe definitions it is clear that bound holds for any t ≤ ( C C ) /b . Now assume for any t > ( C C ) /b , s t ≤ Dt − d . Hence we have to prove s t +1 ≤ D ( t + 1) − d . s t +1 ≤ (cid:0) − φ t + C φ t (cid:1) s k + C t − a + C t − c ≤ Dt − d − DC t − b − d + DC C t − b − d + C t − a + C t − c . From the convexity of the function f ( t ) = t − d , we canwrite ( t + 1) − d − t − d ≥ − dt − d − . To prove the result we need to prove following two steps. ∆ = ( t + 1) − d − t − d + C t − b − d − C C t − b − d > ,D ≥ C t − a + C t − c ∆ . ence we follow ∆ ≥ − dt − d − + C t − b − d − C C t − b − d ≥ ( C − t − b − d > . The second inequality follows from the fact that t > ( C C ) /b , and b ≤ . Finally, we consider C t − a + C t − c ∆ ≤ C C − t − a + b + d + C C − t − c + b + d ≤ C + C C − ≤ D. The second inequality follows from the condition on d which is d = min( a − b, c − b ) . Lemma . If we choose β t = C b t − b , where C b > , b ∈ (0 , , and α t = C a t − a , δ t = C c t − c , where C a ≥ , C c ≥ , and a ≥ c . Under the assumptions, we have E h k y t − ¯ g ( x t ) k i ≤ O (cid:0) t − c +4 b + t − b (cid:1) , E h(cid:13)(cid:13) z t − ¯ h ( x t ) (cid:13)(cid:13) i ≤ O (cid:0) t − c +4 b + t − b (cid:1) . Proof:
We define, m t +1 = t X k =0 ζ ( t ) k k x t +1 − w k +1 k n t +1 = (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) t X k =0 ζ ( t ) k ( g k +1 ( w k +1 ) − ¯ g ( w k +1 )) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) From [17, Lemma 7], we have k y t − ¯ g ( x t ) k ≤ L g m t + 2 n t , and q t +1 ≤ (1 − β t ) q k + 4 β t k x t +1 − x t k ≤ (1 − β t ) q k + O (cid:0) t − a + b + t − c + b (cid:1) From Lemma. 2, we can conclude q k ≤ O ( t − c +2 b ) .Similarly from the [17, Lemme 7], we have m t +1 ≤ (1 − β t ) m t + β t q k + 2 β t k x t +1 − x t k ≤ (1 − β t ) m t + O (cid:0) t − c + b + t − a + b + t − c + b (cid:1) ≤ (1 − β t ) m t + O (cid:0) t − c + b + t − a + b (cid:1) . Hence we have m t ≤ O ( t − c +2 b ) . Again from the Lemma7 in [17] we have E (cid:2) n t +1 (cid:3) ≤ (cid:0) − β t + β t (cid:1) E (cid:2) n t (cid:3) + O (1) β t . From the Lemma. 2, we have E (cid:2) n t (cid:3) ≤ O ( t − b ) . Hence weconclude E h k y t − ¯ g ( x t ) k i ≤ O (cid:0) t − c +4 b + t − b (cid:1) . Similarly, we can also prove the other result.We have proved all the preliminary results. Now weprove the crucial result before discussing the convergenceanalysis.
Lemma . Under the assumptions, for any scalers η t , and γ t , the algorithmic updates yield α t E [ F ( x t +1 ) − F ( x ⋆ )] + 2 δ t E [ P ( x t +1 ) − P ( x ⋆ )]+ 2 α t E [ R ( x t +1 ) − R ( x ⋆ )] + E h k x t +1 − x ⋆ k i ≤ (cid:18) α t γ t (cid:19) E h k x t − x ⋆ k i + O (cid:0) α t + α t δ t (cid:1) + O ( L f α t γ t ) E h k y t − ¯ g ( x t ) k i + O (cid:18) α t η t + η t δ t α t (cid:19) + O (cid:18) α t η t (cid:19) + O (cid:18) L p γ t δ t α t (cid:19) E h(cid:13)(cid:13) z t − ¯ h ( x t ) (cid:13)(cid:13) i Proof:
Consider k x t +1 − x ⋆ k = k x t +1 − x t + x t − x ⋆ k = k x t − x ⋆ k − k x t +1 − x t k + 2 h x t +1 − x t , x t +1 − x ⋆ i = k x t − x ⋆ k − k x t +1 − x t k − α t D ∇ ˜ g t ( x t ) ∇ ˜ f (˜ y t , α t , δ t ) + s t +1 , x t +1 − x ⋆ E = k x t − x ⋆ k − k x t +1 − x t k + 2 α t h s t +1 , x ⋆ − x t +1 i + 2 α t D ∇ ˜ g t ( x t ) ∇ ˜ f (˜ y t , α t , δ t ) , x ⋆ − x t +1 E ≤ k x t − x ⋆ k + 2 α t ( R ( x ⋆ ) − R ( x t +1 ))+ 2 α t D ∇ ˜ g t ( x t ) ∇ ˜ f (˜ y t , α t , δ t ) , x ⋆ − x t +1 E = k x t − x ⋆ k + 2 α t ( T + T ) + 2 α t ( R ( x ⋆ ) − R ( x t +1 )) , where T = D ∇ ˜ F ( x t , α t , δ t ) , x ⋆ − x t +1 E T = D ∇ ˜ g t ( x t ) ∇ ˜ f (˜ y t , α t , δ t ) − ∇ ˜ F ( x t , α t , δ t ) , x ⋆ − x t +1 E . The inequality follows from the fact that R ( x ) is convex.Now consider T = D ∇ ˜ F ( x t , α t , δ t ) , x t − x t +1 E + D ∇ ˜ F ( x t , α t , δ t ) , x ⋆ − x t E = h∇ F ( x t ) , x t − x t +1 i + δ t α t h∇ P ( x t ) , x t − x t +1 i + h∇ F ( x t ) , x ⋆ − x t i + δ t α t h∇ P ( x t ) , x ⋆ − x t i≤ F ( x t ) − F ( x t +1 ) + δ t α t ( P ( x t ) − P ( x t +1 ))+ 12 (cid:18) L F + δ t α t L P (cid:19) k x t +1 − x t k + F ( x ⋆ ) − F ( x t )+ δ t α t ( P ( x ⋆ ) − P ( x t )) ≤ ˜ F ( x ⋆ , α t , δ t ) − ˜ F ( x t +1 , α t , δ t ) + O ( α t + α t δ t ) . ext consider T = D ∇ ˜ g t ( x t ) ∇ ˜ f (˜ y t , α t , δ t ) − ∇ ˜ F ( x t , α t , δ t ) , x ⋆ − x t +1 E = T + T + η t T + 12 η t k x t − x t +1 k , where T = D ∇ ˜ F ( x t , α t , δ t ) − ∇ ˜ g t ( x t ) ∇ ˜ f (cid:0) ¯˜ g ( x t ) , α t , δ t (cid:1) , x t − x ⋆ E T = D ∇ ˜ g t ( x t ) ∇ ˜ f (cid:0) ¯˜ g ( x t ) , α t , δ t (cid:1) − ∇ ˜ g t ( x t ) ∇ ˜ f (˜ y t , α t , δ t ) , x t − x ⋆ E T = (cid:13)(cid:13)(cid:13) ∇ ˜ F ( x t , α t , δ t ) − ∇ ˜ g t ( x t ) ∇ ˜ f (˜ y t , α t , δ t ) (cid:13)(cid:13)(cid:13) We can eliminate T , since E T = 0 . Hence we consider T ≤ γ t (cid:13)(cid:13)(cid:13) ∇ ˜ g t ( x t ) ∇ ˜ f (cid:0) ¯˜ g ( x t ) , α t , δ t (cid:1) − ∇ ˜ g t ( x t ) ∇ ˜ f (˜ y t , α t , δ t ) (cid:13)(cid:13)(cid:13) + 12 γ t k x t − x ⋆ k ≤ γ t k∇ g t ( x t ) ∇ f (¯ g ( x t )) − ∇ g t ( x t ) ∇ f ( y t ) k + γ t δ t α t (cid:13)(cid:13) ∇ h t ( x t ) ∇ p (cid:0) ¯ h ( x t ) (cid:1) − ∇ h t ( x t ) ∇ p ( z t ) (cid:13)(cid:13) + 12 γ t k x t − x ⋆ k ≤ O ( L f γ t ) k y t − ¯ g ( x t ) k + 12 γ t k x t − x ⋆ k + O (cid:18) L p γ t δ t α t (cid:19) (cid:13)(cid:13) z t − ¯ h ( x t ) (cid:13)(cid:13) . Finally we consider T = (cid:13)(cid:13)(cid:13) ∇ ˜ F ( x t , α t , δ t ) − ∇ ˜ g t ( x t ) ∇ ˜ f (˜ y t , α t , δ t ) (cid:13)(cid:13)(cid:13) = (cid:13)(cid:13)(cid:13) ∇ F ( x t ) + δ t α t ∇ P ( x t ) − ∇ g t ( x t ) ∇ f ( y t ) − δ t α t ∇ h t ( x t ) ∇ p ( z t ) (cid:13)(cid:13)(cid:13) ≤ O (cid:18) δ t α t (cid:19) . Now By considering all the intermediate results we con-clude E h k x t +1 − x ⋆ k i ≤ (cid:18) α t γ t (cid:19) E h k x t − x ⋆ k i + 2 α t ˜ F ( x ⋆ , α t , δ t ) − α t E h ˜ F ( x t +1 , α t , δ t ) i + O (cid:18) α t η t (cid:19) + O ( L f α t γ t ) E h k y t − ¯ g ( x t ) k i + O (cid:18) α t η t + η t δ t α t (cid:19) + O (cid:18) L p γ t δ t α t (cid:19) E h(cid:13)(cid:13) z t − ¯ h ( x t ) (cid:13)(cid:13) i + O ( α t + α t δ t )+ 2 α t E [ R ( x ⋆ ) − R ( x t +1 )] . We get the required result by interchanging the terms. Now we are ready with all the derived results, weproceed to prove the theorem. Let us denote H ( x ) = F ( x ) + R ( x ) . Further we know P ( x ⋆ ) = 0 . Now by summing over , ..., T the expression of Lemma. 4, we get T X t = T/ (cid:18) E [ H ( x t +1 ) − H ( x ⋆ )] + 2 δ t α t E [ P ( x t +1 )] (cid:19) ≤ α T/ E h k x − x ⋆ k i + T X t = T/ γ t E h k x t − x ⋆ k i + T X t = T/ O (cid:18) γ t δ t α t (cid:19) E h(cid:13)(cid:13) z t − ¯ h ( x t ) (cid:13)(cid:13) i + T X t = T/ O ( γ t ) E h k y t − ¯ g ( x t ) k i + T X t = T/ O (cid:18) α t + α t δ t + η t + η t δ t α t + α t η t (cid:19) ≤ T X t = T/ O (cid:18) γ t δ t α t (cid:19) E h(cid:13)(cid:13) z t − ¯ h ( x t ) (cid:13)(cid:13) i + O (cid:18) α T/ (cid:19) + O T X t = T/ γ t + T X t = T/ O ( γ t ) E h k y t − ¯ g ( x t ) k i + T X t = T/ O (cid:18) α t + α t δ t + η t + η t δ t α t + α t η t (cid:19) The second inequality follows from the fact that k x t − x ⋆ k ≤ O (1) . From the results in Lemma. 3, wecan write T X t = T/ (cid:18) E [ H ( x t +1 ) − H ( x ⋆ )] + 2 δ t α t E [ P ( x t +1 )] (cid:19) ≤ T X t = T/ (cid:18) O ( γ t ) + O (cid:18) γ t δ t α t (cid:19)(cid:19) O (cid:18) δ t β t (cid:19) + β t ! + T X t =1 O (cid:18) α t + α t δ t + η t + η t δ t α t + α t η t (cid:19) + O (cid:18) α T/ (cid:19) + O T X t = T/ γ t ≤ T X t = T/ O (cid:18) γ t δ t β t + β t γ t + γ t δ t α t β t + γ t δ t β t α t (cid:19) + T X t = T/ O (cid:18) α t + α t δ t + η t + η t δ t α t + α t η t (cid:19) + O (cid:18) α T/ (cid:19) + O T X t = T/ γ t (14)ow For the first result since we know P ( x t +1 ) ≥ , wewrite T X t = T/ E [ H ( x t +1 ) − H ( x ⋆ )] ≤ T X t = T/ O (cid:18) γ t δ t β t + β t γ t + γ t δ t α t β t + γ t δ t β t α t (cid:19) + T X t = T/ O (cid:18) α t + α t δ t + η t + η t δ t α t + α t η t (cid:19) + O (cid:18) α T/ (cid:19) + O T X t = T/ γ t . After substituting all the constants choices, T X t = T/ E [ H ( x t +1 ) − H ( x ⋆ )] ≤ O ( T a ) + O T X t = T/ t d + T X t = T/ O (cid:16) t b − c − d + t a +4 b − c − d + t a − b − c − d + t − b − d + t − a + t − a − c + t − e + t a − c − e + t e − a (cid:17) ≅ O (cid:16) T a + T d +1 + T b − c − d + T − b − d + T a +4 b − c − d + T a − b − c − d + T − a + T − a − c + T − e + T a − c − e + T e − a (cid:17) . Proceeding further T T X t = T/ E [ H ( x t +1 ) − H ( x ⋆ )] ≤ O (cid:16) T a − + T d + T b − c − d + T − b − d + T a +4 b − c − d + T a − b − c − d + O T − a + T − a − c + T − e + T a − c − e + T e − a (cid:17) . Hence from the convexity of H , first result of Thm.1 isproved.Now for the second results since we know | E [ H ( x t +1 ) − H ( x ⋆ )] | ≥ −O (1) . Since α t ≤ δ t ,and the function P is convex we have T δ T/ α T/ E [ P (ˆ x )] ≤ δ T/ α T/ T X t = T/ E [ P ( x t +1 )] ≤ T X t = T/ δ t α t E [ P ( x t +1 )] . (15)From the definition of function P provided in (11), andpenalty function in (8), we know P (ˆ x ) = P j (cid:16) [ Q j (ˆ x )] + (cid:17) where [ . ] + is the projection on positive orthant. Thereforethe expression in (15) becomes T δ T/ Jα T/ J X j =1 [ Q j (ˆ x )] + ≤ T δ T/ α T/ J X j =1 (cid:16) [ Q j (ˆ x )] + (cid:17) ≤ T X t = T/ δ t α t E [ P (ˆ x )] . (16)From the expression in (14), we can write δ T/ α T/ T X t = T/ E [ P ( x t +1 )] ≤ T X t = T/ O (cid:18) γ t δ t β t + β t γ t + γ t δ t α t β t + γ t δ t β t α t (cid:19) + T X t = T/ O (cid:18) α t + α t δ t + η t + η t δ t α t + α t η t (cid:19) + O (cid:18) α T/ (cid:19) + O T X t = T/ γ t + O ( T ) ≅ O (cid:16) T a + T d +1 + T b − c − d + T a +4 b − c − d + T − b − d + T a − b − c − d + T − a + T − a − c + T − e + T a − c − e + T e − a + T (cid:17) . By taking δ T/ /α T/ to R.H.S and dividing with T , wecan write T T X t = T/ E [ P ( x t +1 )] ≤ O (cid:16) T c − + T d + c − a + T b − c − d − a + T c − a − b − d + T a +4 b − c − d + T a − b − c − d + T − a + c + T − a + T c − a − e + T a − c − e + T e − a + c + T c − a (cid:17) From the expression in (16), we conclude the proof bysaying J X j =1 [ Q j (ˆ x )] + ≤ O (cid:16) T ( c − / + T ( d + c − a ) / + T (4 b − c − d − a ) / + T ( c − a − b − d ) / + T ( a +4 b − c − d ) / + T ( a − b − c − d ) / + T ( − a + c ) / + T − a + T ( c − a − e ) / + T ( a − c − e ) / + T ( e − a + c ) / + T ( c − a ) / (cid:17) . R EFERENCES[1] M. Asteris, D. Papailiopoulos, A. G. Dimakis, R. Vadali, S. Chen,and D. Borthakur, “Xoring elephants: Novel erasure codes for bigdata,” in
Proc. Int. Conf. Very Large Data Bases , pp. 325–336.[2] C. Huang, H. Simitci, Y. Xu, A. Ogus, B. Calder, P. Gopalan, J. Li,and S. Yekhanin, “Erasure coding in windows azure storage,” in
Proc. { USENIX } Annu. Tech. Conf. , 2012, pp. 1–16.3] A. Fikes, “Storage architecture and challenges,”
Talk at the GoogleFaculty Summit , vol. 535, 2010.[4] H. Weatherspoon and J. D. Kubiatowicz, “Erasure coding vs. repli-cation: A quantitative comparison,” in
Proc. Int. Workshop Peer-to-Peer Syst.
New York, NY, USA: Springer-Verlag, 2002.[5] A. G. Dimakis, P. B. Godfrey, Y. Wu, M. J. Wainwright, andK. Ramchandran, “Network coding for distributed storage systems,”
IEEE Trans. Inf. Theory , vol. 56, no. 9, pp. 4539–4551, 2010.[6] M. K. Aguilera, R. Janakiraman, and L. Xu, “Using erasure codesefficiently for storage in a distributed system,” in
Proc. Intl. Conf.on DSN . IEEE, 2005, pp. 336–345.[7] J. Li, “Adaptive erasure resilient coding in distributed storage,” in
Proc. IEEE ICME , 2006, pp. 561–564.[8] E. Schurman and J. Brutlag, “The user and business impact of serverdelays, additional bytes, and http chunking in web search,” in
Proc.OReilly Velocity Web Perform. Oper. Conf. , Jun, 2009.[9] L. Huang, S. Pawar, H. Zhang, and K. Ramchandran, “Codes canreduce queueing delay in data centers,” in
Proc. IEEE ISIT , 2012,pp. 2766–2770.[10] G. Joshi, Y. Liu, and E. Soljanin, “On the delay-storage trade-off incontent download from coded distributed storage systems,” vol. 32,no. 5, pp. 989–997, 2014.[11] Y. Xiang, T. Lan, V. Aggarwal, and Y.-F. R. Chen, “Joint latency andcost optimization for erasure-coded data center storage,”
IEEE/ACMTrans. Netw. , vol. 24, no. 4, pp. 2443–2457, 2016.[12] A. O. Al-Abbasi and V. Aggarwal, “Video streaming in distributederasure-coded storage systems: Stall duration analysis,”
IEEE/ACMTrans. Netw. , vol. 26, no. 4, pp. 1921–1932, 2018.[13] S. Lee, M. Han, and D. Hong, “Average snr and ergodic capacityanalysis for opportunistic dfrelaying with outage over rayleigh fadingchannels,”
IEEE Trans. Wireless Commun. , vol. 8, no. 6, pp. 2807–2812, 2009.[14] S. T. Thomdapu and K. Rajawat, “Optimal design of queuingsystems via compositional stochastic programming,”
IEEE Trans.Commun. , vol. 67, no. 12, pp. 8460–8474, 2019.[15] A. Benveniste, M. M´etivier, and P. Priouret,
Adaptive algorithms andstochastic approximations . Springer Science & Business Media,2012, vol. 22.[16] M. Wang, E. X. Fang, and H. Liu, “Stochastic compositional gradientdescent: algorithms for minimizing compositions of expected-valuefunctions,”
Mathematical Programming , vol. 161, no. 1-2, pp. 419–449, 2017.[17] M. Wang, J. Liu, and E. X. Fang, “Accelerating stochastic com-position optimization,”