[PDF] Stochastic Modified Equations for Continuous Limit of Stochastic ADMM

Abstract

Stochastic version of alternating direction method of multiplier (ADMM) and its variants (linearized ADMM, gradient-based ADMM) plays a key role for modern large scale machine learning problems. One example is the regularized empirical risk minimization problem. In this work, we put different variants of stochastic ADMM into a unified form, which includes standard, linearized and gradient-based ADMM with relaxation, and study their dynamics via a continuous-time model approach. We adapt the mathematical framework of stochastic modified equation (SME), and show that the dynamics of stochastic ADMM is approximated by a class of stochastic differential equations with small noise parameters in the sense of weak approximation. The continuous-time analysis would uncover important analytical insights into the behaviors of the discrete-time algorithm, which are non-trivial to gain otherwise. For example, we could characterize the fluctuation of the solution paths precisely, and decide optimal stopping time to minimize the variance of solution paths.

Full PDF

aa r X i v : . [ m a t h . O C ] M a r Stochastic Modiﬁed Equations for Continuous Limit of Stochastic ADMM

Xiang Zhou Huizhuo Yuan Chris Junchi Li Qingyun Sun Abstract

Stochastic version of alternating directionmethod of multiplier (ADMM) and its variants(linearized ADMM, gradient-based ADMM)plays key role for modern large scale machinelearning problems. One example is regularizedempirical risk minimization problem. In thiswork, we put different variants of stochasticADMM into a uniﬁed form, which includesstandard, linearized and gradient-based ADMMwith relaxation, and study their dynamics via acontinuous-time model approach. We adapt themathematical framework of stochastic modiﬁedequation (SME), and show that the dynamics ofstochastic ADMM is approximated by a class ofstochastic differential equations with small noiseparameters in the sense of weak approximation.The continuous-time analysis would uncoverimportant analytical insights into the behaviorsof the discrete-time algorithm, which are non-trivial to gain otherwise. For example, we couldcharacterize the ﬂuctuation of the solution pathsprecisely, and decide optimal stopping time tominimize variance of solution paths.

1. Introduction

For modern industrial scale machine learning problemswith massive amount of data, stochastic ﬁrst-order meth-ods almost become the default choice. Additionally, thedatasets are not only extremely large, but often stored oreven collected in a distributed manner. Stochastic versionoﬂternating direction method of multiplier(ADMM) algo-rithms are popular approachs to handle this distributed set-ting, especially for the regularized empirical risk minimiza-tion problems. School of Data Science and Department of Mathematics, CityUniversity of Hong Kong, Hong Kong, China Peking University,China Department of EECS, UC Berkeley, USA Departmentof Mathematics Stanford University. Correspondence to: XiangZhou . (submitted to) Proceedings of the th International Conferenceon Machine Learning , Vienna, Austria, PMLR 108, 2020. Copy-right 2020 by the author(s).

Consider the following stochastic optimization problem:minimize x ∈ R d V ( x ) := f ( x ) + g ( Ax ) , (1)where f ( x ) = E ξ ℓ ( x, ξ ) with ℓ as the loss incurred on asample ξ , f : R d → R ∪ { + ∞} , g : R m → R ∪ { + ∞} , A ∈ R m × d , and both f and g are convex and differentiable.The stochastic version of alternating direction method ofmultiplier (ADMM) (Boyd et al., 2011) is to rewrite (1) asa constrained optimization problemminimize x ∈ R d ,z ∈ R m E ξ f ( x, ξ ) + g ( z ) subject to Ax − z = 0 . (2)Here and through the rest of the paper, we start to usethe same f for both the stochastic instance and the ex-pectation to ease the notation. In the batch learning set-ting, f ( x ) is approximated by the empirical risk function f emp = N P Ni =1 f ( x, ξ i ) . However, to minimize f emp with a large amount of samples, the computation is less efﬁ-cient under time and resource constraints. In the stochasticsetting, in each iteration x is updated based on one noisysample ξ instead of a full training set.Note that the classical setting of linear constraint Ax + Bz = c can be reformulated as z = Ax by a simple lin-ear transformation operation when B is invertible.One of the main ideas in the stochastic ADMM is in paral-lel to the stochastic gradient descent (SGD). At iteration k ,an iid sample ξ k +1 is drawn from the distribution of ξ . Astraightforward application of this SGD idea to the ADMMfor solving (2) leads to the following stochastic ADMM(sADMM) x k +1 = argmin x n f ( x, ξ k +1 ) + ρ k Ax − z k + u k k o , (3a) z k +1 = argmin z (cid:26) g ( z ) + ρ k αAx k +1 + (1 − α ) z k − z + u k k (cid:27) , (3b) u k +1 = u k + ( αAx k +1 + (1 − α ) z k − z k +1 ) . (3c)Here α ∈ (0 , is introduced as a relaxation parame-ter (Eckstein & Bertsekas, 1992; Boyd et al., 2011). When ontinuous Model of Stochastic ADMM α = 1 , the relaxation scheme becomes the standardADMM. The over-relaxation case is that α > and it canaccelerate the convergence toward to the optimal solution(Yuan et al., 2019). Many variants of the classical ADMM have been recentlydeveloped. These are two types of common modiﬁcationsin many variants of ADMM in order to cater for require-ments of different applications.1. In the linearized ADMM(Goldfarb et al., 2013), theaugmented Lagrangian function is approximated bythe linearization of quadratic term of x in (3a) and theaddition of a proximal term τ k x − x k k : x k +1 := argmin x (cid:26) f ( x, ξ k +1 )+ τ (cid:13)(cid:13)(cid:13) x − (cid:16) x k − ρτ A ⊤ ( Ax k − y k + u k ) (cid:17)(cid:13)(cid:13)(cid:13) (cid:27) . (4)2. The gradient-based ADMM is to solve (3a) inexactlyby applying only one step gradient descent for all x -nonlinear terms in L ρ with the step size /τ : x k +1 := x k − τ (cid:0) f ′ ( x k , ξ k +1 ) + ρA ⊤ ( Ax k − z k + u k ) (cid:1) . To accommodate these variants all into one stochastic set-ting, we formulate a very general scheme to unify all abovecases in the form of stochastic version of ADMM:

General stochastic ADMM (G-sADMM) x k +1 := argmin x ˆ L k +1 ( x, z k , u k ) , (5a) z k +1 = argmin y (cid:26) g ( z )+ ρ k αAx k +1 + (1 − α ) z k − z + u k k (cid:27) , (5b) u k +1 = u k + ( αAx k +1 + (1 − α ) z k − z k +1 ) . (5c)where the approximate objective function for x -subproblem is ˆ L k +1 =(1 − ω ) f ( x, ξ k +1 ) + ω f ′ ( x k , ξ k +1 )( x − x k )+ (1 − ω ) ρ k Ax − z k + u k k + ω (cid:0) ρA ⊤ ( Ax k − z k + u k )( x − x k ) (cid:1) + τ k x − x k k . (6)The explicitness parameters ω , ω ∈ [0 , and the proximalparameter τ ≥ . This scheme (5) is very general andincludes existing variants as follows. 1. f ( x, ξ ) ≡ f ( x ) : deterministic version of ADMM:2. ω = ω = τ = 0 : the standard stochastic ADMM( sADMM );3. ω = 0 and ω = 1 : this scheme is the stochasticversion of the linearized ADMM;4. ω = 1 and ω = 1 : this scheme is the stochasticversion of the gradient-based ADMM.5. α = 1 , ω = 1 , ω = 0 and τ = τ k ∝ √ k : the stochas-tic ADMM considered in (Ouyang et al., 2013). Deﬁne V ( x ) = f ( x ) + g ( Ax ) . Let α ∈ (0 , , ω , ω ∈{ , } and c = τ /ρ ≥ . Let ǫ = ρ − ∈ (0 , . { x k } denote the sequence of stochastic ADMM (5) with the ini-tial choice z = Ax . Deﬁne X t as a stochastic processsatisfying the SDE c M dX t = −∇ V ( X t ) dt + √ ǫσ ( X t ) dW t where the matrix c M := c + (cid:18) α − ω (cid:19) A ⊤ A. and σ satisﬁes σ ( x ) σ ( x ) ⊤ = E ξ h ( f ′ ( x, ξ ) − f ′ ( x )) ( f ′ ( x, ξ ) − f ′ ( x )) ⊤ i . Then we have x k → X kǫ with a weak convergence of orderone. Stochastic and online ADMM

The use of stochastic and online techniques for ADMMhave recently drawn a lot of interest. (Wang & Banerjee,2012) ﬁrst proposed the online ADMM in the standardform, which learns from only one sample (or a small mini-batch) at a time. (Ouyang et al., 2013; Suzuki, 2013) pro-posed the variants of stochastic ADMM to attack the dif-ﬁcult nonlinear optimization problem inherent in f ( x, ξ ) by linearization. Very recent, further accelerated algo-rithms for the stochastic ADMM have been developed in(Zhong & Kwok, 2014; Huang et al., 2019) Continuous models for optimization algorithms

In our work, we focus on the limit of the stochastic se-quence { x k } deﬁned by (3) and (5) as ρ → ∞ . Deﬁne ǫ = ρ − . Assume the proximal parameter τ is linked to ρ by τ = cρ with a constant c > . Our interest here is not about the ontinuous Model of Stochastic ADMM numerical convergence of x k from the ADMM towards theoptimal point x ∗ of the objective function as k → ∞ fora ﬁxed ρ , but the proposal of an appropriate continuousmodel whose (continuous-time) solution X t is a good ap-proximation to the sequence x k as ρ → ∞ .The work in (Su et al., 2016) is one seminal work based onthis perspective of using continuous-time dynamical sys-tem tools to analyze various existing discrete algorithmsfor optimzation problems to mode Nesterov’s acceleratedgradient method. For the applications to the ADMM, therecent works in (França et al., 2018) establishes the ﬁrstdeterministic continuous-time models in the form of ordi-nary differential equation (ODE) for the smooth ADMMand (Yuan et al., 2019) extends to the non-smooth case viathe differential inclusion model.In this setting of continuous limit theory, a time duration T > is ﬁxed ﬁrst so that the continuous-time model ismainly considered in this time interval [0 , T ] . Usually asmall parameter (such as step size) ǫ is identiﬁed with acorrect scaling from the discrete algorithm, and used topartition the interval into K = T /ǫ windows. The iter-ation index k in the discrete algorithm is labelled from to K . The convergence of the discrete scheme to the con-tinuous model means that, with the same initial X = x ,for any T > , as ǫ → , then the error between x k and X kǫ measured in certain sense converges to zero for any ≤ k ≤ K .This continuous viewpoint and formulation has been suc-cessful for both deterministic and stochastic optimzationalgorithms in machine learning (E et al., 2019). The worksin (Li et al., 2017; 2019) rigorously present the mathemati-cal connection of Ito stochastic differential equation (SDE)with stochastic gradient descent (SGD) with a step size η .More precisely, for any small but ﬁnite η > , the cor-responding stochastic differential equation carries a smallparameter √ η in its diffusion terms and is called stochas-tic modiﬁed equation (SME) due to the historical reason innumerical analysis for differential equations. The conver-gence between x k and X t is then formulated in the weaksense. This SME technique, originally arising from the nu-merical analysis of SDE (Kloeden & Platen, 2011), is themajor mathematical tool for most stochastic or online algo-rithms. • We demonstrate how to use mathematical tools likestochastic modiﬁed equation(SME) and asymptoticexpansion to study the dynamics of stochastic ADMMin the small step-size (step-size for ADMM is ǫ =1 /ρ ) regime. • We present an uniﬁed framework for variants of stochastic version of ADMM, linearized ADMM,gradient-based ADMM, and present a uniﬁed stochas-tic differential equation as their continuous-time limitunder weak convergence. • We are ﬁrst to show that the drift term of the stochasticdifferential equation is the same as the previous ordi-nary differential equation models. • We are ﬁrst to show that the standard deviation of thesolution paths has the scaling √ ǫ . Moreover, we caneven accurately compute the continuous limit of thetime evolution of ǫ − / std( x k ) , ǫ − / std( z k ) and ǫ − / std( r k ) for the residual r k = Ax k − z k . Thejoint ﬂuctuations of x, z, r is a new phenomenon thathas not been studied in previous works on continuous-time analysis of stochastic gradient descent type algo-rithms. • From our stochastic differential equation analysis, wecould derive useful insights for practical improve-ments that are not clear without the continuous-timemodel. For example, we are able to precisely com-pute the diffusion-ﬂuctuation trade-off, which wouldenable us to decide when to decrease step-size and in-crease batch size to accelerate convergence of stochas-tic ADMM.

We use k·k to denote the Euclidean two norm if the sub-script is not speciﬁed. and all vectors are referred as col-umn vectors. f ′ ( x, ξ ) , g ′ ( z ) and f ′′ ( x, ξ ) , g ′′ ( z ) refer tothe ﬁrst (gradient) and second (Hessian) derivatives w.r.t. x .The ﬁrst assumptions is Assumption I : f ( x ) , g and foreach ξ , , f ( x, ξ ) , are closed proper convex functions; A has full column rank.Let F as the set of functions of at most polynomial growth, ϕ ∈ F if there exists constants C , κ > 0 such that | ϕ ( x ) | < C (1 + k x k κ ) (7)To apply the SME theory, we need the following assump-tions (Li et al., 2017; 2019) Assumptions II :(i) f ( x ) , f ( x, ξ ) and g ( z ) are differentiable and the sec-ond order derivative f ′′ , g ′′ are uniformly bounded in x , and almost surely in ξ for f ( x, ξ ) . E k f ′ ( x, ξ ) k isuniformly bounded in x .(ii) f ( x ) , f ( x, ξ ) , g ( x ) and the partial derivatives up toorder belong to F and for f ( x, ξ ) , it means the al-most surely in ξ , i.e. , the constants C , κ in (7) donot depend on ξ . ontinuous Model of Stochastic ADMM (iii) f ′ ( x ) and f ′ ( x, ξ ) satisfy a uniform growth condition: k f ′ ( x ) k + k f ′ ( x, ξ ) k ≤ C (1 + k x k ) for a constant C independent of ξ .The conditions (ii) and (iii) are inherited from (Li et al.,2017; Milstein, 1986) , which might be relaxed in certaincases. Refer to remarks in Appendix C of (Li et al., 2017).

2. Weak Approximation to Stochastic ADMM

In this section, we show the weak approximation to thestochastic ADMM (3) and the general family of stochas-tic ADMM variant (5). Appendix A is a summary of thebackground of the weak approximation and the stochasticmodiﬁed equation for interested readers.Given the noisy gradient f ′ ( x, ξ ) and its expectation f ′ ( x ) = E f ( x, ξ ) , we deﬁne the following matrix σ ( x ) ∈ R d × d by Σ( x ) = σ ( x ) σ ( x ) ⊤ = E ξ h ( f ′ ( x, ξ ) − f ′ ( x )) ( f ′ ( x, ξ ) − f ′ ( x )) ⊤ i . (8) Theorem 1 (SME for sADMM) . Consider the standardstochastic ADMM without relaxation (3) with α = 1 . Let ǫ = ρ − ∈ (0 , . { x k } denote the sequence of stochasticADMM with the initial choice z = Ax .Deﬁne X t as a stochastic process satisfying the SDE ( A ⊤ A ) dX t = −∇ V ( X t ) dt + √ ǫσ ( X t ) dW t (9) where V ( x ) = E ξ V ( x, ξ ) = E ξ f ( x, ξ ) + g ( Ax ) and thediffusion matrix σ is deﬁned by (8) , Then we have x k → X kǫ with the weak convergence of order .Sketch of proof. The ADMM scheme is in a form of theiteration of the triplet ( x, z, λ ) where λ = ǫu . But bythe ﬁrst order optimality condition for z -subproblem and u -subproblem, we have λ k +1 = g ′ ( z k +1 ) for whatever in-put triplet ( x k , z k , λ k ) . Thus, the variable λ is faithfullyreplaced by g ′ ( z ) . The remaining goal is to further re-place the z variable by the x variable that the ADMM it-eration is approximately reduced to the iteration only for x variable. This is indeed true because of the critical obser-vation (Proposition 7) that the residual r k = Ax k − z k is has a second order smallness, belonging to O ( ǫ ) , if r = Ax − z = 0 . Thus, ADMM is transformed intothe one-step iteration form (20) only in x variable with A ( ǫ, x, ξ ) = f ′ ( x, ξ ) + A ⊤ g ′ ( Ax ) + O ( ǫ ) . The conclu-sion then follows by directly checking the conditions (23)in Theorem 5.Our main theorem is for the G-sADMM scheme which con-tains the relaxation parameter α , the proximal parameter c and the implicitness parameters ω, ω . Theorem 2 (SME for G-sADMM) . Let α ∈ (0 , , ω , ω ∈{ , } and c = τ /ρ ≥ . Let ǫ = ρ − ∈ (0 , . { x k } de-note the sequence of stochastic ADMM (5) with the initialchoice z = Ax .Deﬁne X t as a stochastic process satisfying the SDE c M dX t = −∇ V ( X t ) dt + √ ǫσ ( X t ) dW t (10) where the matrix c M := c + (cid:18) α − ω (cid:19) A ⊤ A. (11) Then we have x k → X kǫ in weak convergence of order ,with the following precise meaning.For any time interval T > and for any test function ϕ such that φ and its partial derivatives up to order belongto F , there exists a constant C such that | E ϕ ( X kǫ ) − E ϕ ( x k ) | ≤ Cǫ, k ≤ ⌊

T /ǫ ⌋ (12) Sketch of proof.

The idea of this proof is similar to that inTheorem 1 even with the introduction of c, ω, ω parame-ters. But for the relaxation parameter when α = 1 , weneed to overcome a substantial challenge. If α = 1 , thenthe residual r k = Ax k − z k is now only at order O ( ǫ ) ,not O ( ǫ ) . In the proof, we propose a new α -residual b r αk +1 := αr k + ( α − z k +1 − z k ) and show that it isindeed as small as O ( ǫ ) (Proposition 9) to solve this chal-lenge. The difference between r k and the α -residual thusinduces the extra α -term in the new coefﬁcient matrix c M in (11).The rigorous proof is in Appendix B. Remark 1.

We do not present a simple form of SME asthe the second order weak approximation as for the SGDscheme, due to the complicated issue of the residuals. Inaddition, the proof requires a regularity condition for thefunctions f and g ; at least g needs to have the third orderderivatives of g . So, our theoretic theorems can not coverthe non-smooth function g . Our numerical tests suggestthat the conclusion holds too for ℓ regularization function g ( z ) = k z k . Remark 2.

In general applications, it is very difﬁculty toget the expression of the variance matrix Σ( x ) as a functionof x , except in very few simpliﬁed cases. In applications ofempirical risk minization, the function f is the empirical av-erage of the loss on each sample f i : f ( x ) = N P Ni =1 f i ( x ) .The diffusion matrix Σ( x ) in (8) becomes the followingform Σ N ( x ) = 1 N N X i =1 ( f ′ ( x ) − f ′ i ( x )) f ′ ( x ) − f ′ i ( x ) ⊤ . (13) ontinuous Model of Stochastic ADMM It is clear that if f i ( x ) = f ( x, ξ i ) with N iid samples ξ i ,then Σ N ( x ) → Σ( x ) as N → ∞ . Remark 3.

The stochastic scheme (5) is the simplest formof using only one instance of the gradient f ′ ( x, ξ k +1 ) ineach iteration. If a batch size larger than one is used,then the one instance gradient f ′ ( x, ξ k +1 ) is replaced bythe average B k +1 P B k +1 i =1 f ′ ( x, ξ ik +1 ) where B k +1 > is the batch size and ( ξ ik +1 ) are B k +1 iid samples. Un-der these settings, Σ should be multiplied by a fact B t where the continuous-time function B t is the linear inter-polation of B k at times t k = kǫ . The stochastic modi-ﬁed equation (10) is then in the following form c M dX t = −∇ V ( X t ) dt + q ǫB t σ ( X t ) dW t . Based on the SME above, we can ﬁnd the stochastic asymp-totic expansion of X ǫt X ǫt ≈ X t + √ ǫX (1 / t + ǫX (1) t + . . . . (14)See Chapter 2 in (Freidlin & Wentzell, 2012) for rigorousjustiﬁcation. X t is deterministic as the gradient ﬂow ofthe deterministic problem: ˙ X t = − V ′ ( X t ) , X (1 / t and X (1) t are stochastic and satisfy certain SDEs independentof ǫ . The useful conclusion is that the standard deviationof X ǫt , mainly coming from the term √ ǫX (1) t , is O ( √ ǫ ) .Hence, the standard deviation of the stochastic ADMM x k is O ( √ ǫ ) and more importantly, the rescaled two standarddeviations ǫ − / std( x k ) and ǫ − / std( X kǫ ) are close asthe function of the time t k = kǫ .We can investigate the ﬂuctuation of the z k sequence gener-ated by the stochastic ADMM. The approach is to study themodiﬁed equation of its continuous version Z t ﬁrst. Sincethe residual r = Ax − z is on the order O ( ǫ ) shown inthe appendix (Proposition 6 and 7), we have the followingresult. Theorem 3. (i) There exists a deterministic function h ( x, z ) such that ˙ Z ǫt = A ˙ X ǫt + ǫh ( X ǫt , Z ǫt ) (15) where X ǫt is the solution to the SME in Theorem (2) and { z k } is a weak approximation to { Z ǫt } with theorder 1.(ii) In addition, we have the following asymptotic for Z ǫt : Z ǫt ≈ AX t + √ ǫAX (1 / t + ǫZ (1) t (16) where Z (1) t satisﬁes ˙ Z (1) t = h ( X t , AX t ) .(iii) The standard deviation of z k is on the order √ ǫ . t E ( x k - x * ) =0.5, g(z)=z =1, g(z)=z =1.5, g(z)=z =0.5, g(z)=|z|=1, g(z)=|z|=1.5, g(z)=|z| Figure1. The expectation of x k − x ∗ w.r.t. α . x ∗ is the trueminimizer. The result is based on the average of 10000 runs. Recall the residual r k = Ax k − y k and in view of Corollary10 in the appendix, we have the following result that thereexists a function h such that αR ǫt = (1 − α )( Z ǫt − Z ǫt − ǫ ) + ǫ h ( X ǫt , Z ǫt ) (17)and the residual { r k } is a weak approximation to { R ǫt } withthe order 1. If α = 1 in the G-sADMM (5), then the expec-tation and standard deviation of R t and r k are both at order O ( ǫ ) . If α = 1 in the G-sADMM (5), then the expectationand standard deviation of R t and r k are only at order O ( ǫ ) .

3. Numerical Examples

Example 1: one dimensional example

In this simple ex-ample, the dimension d = 1 . Consider f ( x, ξ ) = ( ξ +1) x + (2 + ξ ) x − (1 + ξ ) x , where ξ is a Bernoulli randomvariable taking values − or +1 with equal probability. Wetest g ( z ) = z and g ( z ) = | z | . The matrix A = I . Thesesettings satisfy the assumptions in our main theorem. Wechoose c = ω such that c M = α . The SME when g ( z ) = z is α dX t = − (4 x + 6 x − dt + √ ǫ (cid:12)(cid:12) x + 2 x − (cid:12)(cid:12) dW t .The choice of the initial guess is x = z = 1 . and λ = g ′ ( z ) . The terminal time T = 0 . is ﬁxed.Figure 2 shows the match of the expectation and the stan-dard deviation of the sequence x k of stochastic ADMM and X t k of the SME with t k = kǫ . Furthermore, we plot Fig-ure random trajectories from both models in Figure 3.and it shows the ﬂuctuation in the sADMM can be wellcapturedd by the SME model.The acceleration effect of α for the deterministic ADMMhas been shown in (Yuan et al., 2019). Figure 1 conﬁrmsthe same effect both for smooth and non-smooth g for theexpectation of the solution sequence x k .The SME does not only provide the expectation of the solu-tion, but also provides the ﬂuctuation of the numerical so- ontinuous Model of Stochastic ADMM m ean s t anda r d de v i a t i on mean (sADMM)mean (SME)std (sADMM)std (SME) Figure2. The expectation (left axis) and standard deviation (rightaxis) of x k (from stochastic ADMM) and X t (from stochasticmodiﬁed equation) . ǫ = 2 − . The results are based on the av-erage of independent runs. The over-relaxation parameter α = 1 . is used.Figure3. The sample trajectories from stochastic ADMM(left) and SME (right). lution x k for any given ǫ . Figure 2 compares the mean andstandard deviation (“std”) between x k and X kǫ at η = 2 − .The right vertical axis is the value of standard deviationand the two std curves are very close. In addition, with thesame setting, a few hundreds of trajectory samples x areshown together in Figure 3, which illustrate the match bothin the mean and in the std between the stochastic ADMMand the SME.To verify our theorem on the convergence order, a test func-tion ϕ ( x ) = x + x is used for the test of the weak conver-gence error: err := max ≤ k ≤⌊ T/ǫ ⌋ | E ϕ ( x k ) − E ϕ ( X kǫ ) | . For each m = 4 , , . . . , , set ρ = 2 m /T , so ǫ = T − m and k = 1 , . . . , m . Figure 3 shows the error err m ver-sus m in the semi-log plot for three values of relaxationparameter α . The ﬁrst order convergence rate err m ∝ ǫ isveriﬁed.We also numerically investigated the convergence rate forthe non-smooth penalty g ( z ) = | z | , even though this ℓ regularization function does not satisfy our assumptions.The diffusion term Σ( x ) is still the same as in the ℓ casesince g ( z ) is deterministic. For the corresponding SDE,at least formally, we can write α dX t = − (4 x + 4 x − m ( =2 m ) -4 -3 -2 -1 e rr o r =1.5, g=z =0.5, g=z =1.0, g=z reference line =1.5, g=|z|=0.5, g=|z|=1.0, g=|z| line: 2 -m Figure4. (Veriﬁcation of the ﬁrst order approximation )The weakconvergence error err m versus m for various α and ℓ , ℓ regu-larization g . The step size ǫ = 1 /ρ = 2 − m T . T = 0 . . Theresult is based on the average of independent runs. sign ( x )) dt + √ ǫ (cid:12)(cid:12) x + 2 x − (cid:12)(cid:12) dW t , by using the signfunction as g ′ ( z ) . The rigorous meaning needs the con-cept of stochastic differential inclusion, which is out of thescope of this work. The numerical results in Figure 3 showsthat the weak convergence order is also true for this ℓ case.Finally, we test the orders for the standard deviation of x k and z k . Th consistence of std( x k ) with the SME’s std( X kǫ ) has been shown in Figure 2. The theoretic predic-tion is that both are at order √ ǫ . We plot the sequences of ǫ − / std( x k ) and ǫ − / std( z k ) for various ǫ . These twoquantities should be the same regardless of η , and only de-pends on α . which is conﬁrmed by Figure 5. For the resid-ual, the theoretic prediction is that both E r k and std r k areon the order ǫ − if α = 1 . We plot ǫ − E ( r k ) , ǫ − std( r k ) ,against the time t k = kǫ in Figure 6 and Figure 8, respec-tively. For the stochastic ADMM scheme with α = 1 , thenumerical test shows that E r k and std r k are on the order ǫ − . Example 2: generalized ridge and lasso regression

Weperform experiments on the generalized ridge regression.minimize x ∈ R d ,z ∈ R m E ξ (cid:0) ξ ⊤ in x − ξ obs (cid:1) + g ( z ) subject to Ax − z = 0 . (18)where g ( z ) = β k z k (ridge regression) or g ( z ) = β k z k (lasso regression), with a constant β > . A is a penalty ma-trix specifying the desired structured pattern of x . Amongthe random ξ = ( ξ in , ξ obs ) ∈ R n +1 , ξ in is the zero-mean random (column) vector with uniformly distributionin the hypercube ( − . , . d with independent compo-nents. The labelled data ξ obs := ξ ⊤ in v + ζ , where v ∈ R n isa given vector and ζ = N (0 , σ ζ ) is the zero-mean measure-ment noise, independent of ξ in . The analytic expression of ontinuous Model of Stochastic ADMM t - / s t d ( x k ) stepsize=2 -6 , =0.5stepsize=2 -8 , =0.5stepsize=2 -10 , =0.5stepsize=2 -6 , =1stepsize=2 -8 , =1stepsize=2 -10 , =1stepsize=2 -6 , =1.5stepsize=2 -8 , =1.5stepsize=2 -10 , =1.5 t - / s t d ( z k ) stepsize=2 -6 , =0.5stepsize=2 -8 , =0.5stepsize=2 -10 , =0.5stepsize=2 -6 , =1stepsize=2 -8 , =1stepsize=2 -10 , =1stepsize=2 -6 , =1.5stepsize=2 -8 , =1.5stepsize=2 -10 , =1.5 Figure5. std of x k and z k t k -505 - E (r k ) stepsize=2 -6 , =0.5stepsize=2 -8 , =0.5stepsize=2 -10 , =0.5stepsize=2 -6 , =1.5stepsize=2 -8 , =1.5stepsize=2 -10 , =1.5 t -4-3-2-101234 - E (r k ) stepsize=2 -6 , =0.5stepsize=2 -8 , =0.5stepsize=2 -10 , =0.5stepsize=2 -6 , =1.5stepsize=2 -8 , =1.5stepsize=2 -10 , =1.5 Figure6. The veriﬁcation of the mean residual r k = O ( ǫ − ) for α = 1 . g ( z ) = z (top) and g ( z ) = | z | (bottom) t k - s t d (r k ) stepsize=2 -6 , =0.5stepsize=2 -8 , =0.5stepsize=2 -10 , =0.5stepsize=2 -6 , =1.5stepsize=2 -8 , =1.5stepsize=2 -10 , =1.5 t k - s t d (r k ) stepsize=2 -6 , =0.5stepsize=2 -8 , =0.5stepsize=2 -10 , =0.5stepsize=2 -6 , =1.5stepsize=2 -8 , =1.5stepsize=2 -10 , =1.5 Figure7. The veriﬁcation of the std of the residual r k ∼ ǫ − for α = 1 . g ( z ) = z (top) and g ( z ) = | z | (bottom). t k -18-16-14-12-10-8-6-4-20 - E (r k ) stepsize=2 -6 , =1stepsize=2 -8 , =1stepsize=2 -10 , =1 t k - s t d (r k ) stepsize=2 -6 , =1stepsize=2 -8 , =1stepsize=2 -10 , =1 Figure8. The mean (top) and std (bottom) of the residual r k ∼ ǫ − for the scheme without relaxation α = 1 . g ( z ) = z . ontinuous Model of Stochastic ADMM t k m ean sADMMSME t k m ean sADMMSME Figure9. The mean of ϕ ( x k ) from sADMM and ϕ ( X kǫ ) from theSME. top: g ( z ) = β k z k . bottom: g ( z ) = β k z k . The resultsare based on 100 independent runs. the matrix-valued function Σ( x ) is available based on thefour-order momentums of ξ in .We use a batch size B for the stochastic ADMM ( B = 9 isused in experiments). Then the corresponding SME for theridge regression problem is c M dX t = − Ω( X t − v ) dt − βA ⊤ AX t dt + p ǫ/B Σ / ( X t ) dW t The SME for the lasso regression (formally) is c M dX t ∈ − Ω( x − v ) dt − βA ⊤ sign ( Ax ) dt + p ǫ/B Σ / dW t The direct simulation of these stochastic equations has ahigh computational burden because of the complexity ofmatrix square root for Σ( x ) . So, our tests are only restrictedto the dimension d = 3 .Set A is the Hilbert matrix multiplied by . . σ ζ = 0 . . β = 0 . . The vector v is set as linspace (1 , , d ) . The initial X = x is the zero vector. z = Ax .In algorithms, set c = 1 . We choose the test function ϕ ( x ) = P di =1 x ( i ) . Denote ϕ k = ϕ ( x k ) where x k are thesequence computed from the (uniﬁed) stochastic ADMMwith the batch size B . Denote Φ kǫ = ϕ ( X kǫ ) where X t isthe solution of the SME.Let α = 1 . , ω = 1 , ω = 1 . T = 40 . We ﬁrst show in Fig-ure 9 the mean of φ k and Φ kǫ versus the time t k = kǫ , fora ﬁxed η = 2 . To test the match of the ﬂuctuation, we plotin Figure 10 the sequence ǫ − / std( ϕ k ) and ǫ − / std(Φ k ) for three different values of ǫ = 2 − m T with m = 6 , , . t k - / s t d sADMM, =2 -6 SME, =2 -6 sADMM, =2 -7 SME, =2 -7 sADMM, =2 -8 SME, =2 -8 Figure10. The rescaled std of ϕ ( x k ) from sADMM and ϕ ( X kǫ ) from the SME. g ( z ) = β k z k . The results are based on 400independent runs.

4. Conclusion

In this paper, we have use the stochastic modiﬁed equa-tion(SME) to analyze the dynamics of stochastic ADMMin the large ρ limit (i.e., small step-size ǫ limit). It is a ﬁrstorder weak approximation to a general family of stochasticADMM algorithms, including the standard, linearized andgradient-based ADMM with relaxation α = 1 .Our new continuous-time analysis is the ﬁrst analysis ofstochastic version of ADMM. It faithfully captures the ﬂuc-tuation of the stochastic ADMM solution and provides amathematical clear and insightful way to understand the dy-namics of stochastic ADMM algorithms.It is a substantial complementary to the existingODE-based continuous-time analysis (França et al., 2018;Yuan et al., 2019) for the deterministic ADMM. It is alsoan important mile-stone for understanding continuous timelimit of stochastic algorithms other than stochastic gradientdescent (SGD), as we observed new phenonmons like thejoint ﬂuctuation of x , z and r . We provide solid numeri-cal experiments verifying our theory on several examples,including smooth function like quadratic functions and non-smooth function like ℓ norm.

5. Future Work

There are a few natural directions to further explore in fu-ture.First, in the theoretic analysis aspect, for simplicity of anal-ysis, we derive our mathematical proof based on smooth-ness of f and g . As we observed empirically, for non-smooth function like ℓ norm, our continuous-time limitframework would derive a stochastic differential inclusion.A natural follow-up of this work would be develop formalmathematical tools of stochastic differential inclusion to ex-tend our proof to non-smooth functions. ontinuous Model of Stochastic ADMM Second, from our stochastic differential equation, we coulddevelop practical rules to choose adaptive step-size ǫ andbatch size by precisely computing the optimal diffusion-ﬂuctuation trade-off to accelerate convergence of stochasticADMM. References

Boyd, S., Parikh, N., Chu, E., Peleato, B., Eckstein, J., et al.Distributed optimization and statistical learning via thealternating direction method of multipliers.

Foundationsand Trends R (cid:13) in Machine learning , 3(1):1–122, 2011.E, W., Ma, C., and Wu, L. Machine learning from a contin-uous viewpoint. 2019.Eckstein, J. and Bertsekas, D. P. On the Douglas–Rachfordsplitting method and the proximal point algorithm formaximal monotone operators. Mathematical Program-ming , 55(1-3):293–318, 1992.França, G., Robinson, D. P., and Vidal, R. ADMM and ac-celerated ADMM as continuous dynamical systems. In

Proceedings of the 35th International Conference on Ma-chine Learning , pp. 1559–1567, 2018.Freidlin, M. I. and Wentzell, A. D.

Random Perturbationsof Dynamical Systems . Grundlehren der mathematischenWissenschaften. Springer-Verlag, New York, 3 edition,2012.Goldfarb, D., Ma, S., and Scheinberg, K. Fast alternat-ing linearization methods for minimizing the sum of twoconvex functions.

Mathematical Programming , 141(1-2):349–382, 2013.Huang, F., Chen, S., and Huang, H. Faster stochastic al-ternating direction method of multipliers for nonconvexoptimization. In Chaudhuri, K. and Salakhutdinov, R.(eds.),

Proceedings of the 36th International Conferenceon Machine Learning , volume 97 of

Proceedings ofMachine Learning Research , pp. 2839–2848, LongBeach, California, USA, 09–15 Jun 2019. PMLR. URL http://proceedings.mlr.press/v97/huang19a.html .Kloeden, P. and Platen, E.

Numerical Solution ofStochastic Differential Equations . Stochastic Mod-elling and Applied Probability. Springer, New York,corrected edition, 2011. ISBN 9783662126165. URL https://books.google.com.hk/books?id=r9r6CAAAQBAJ .Li, Q., Tai, C., and E, W. Stochastic modiﬁed equationsand adaptive stochastic gradient algorithms. In , 34th International Conference on Machine Learn-ing, ICML 2017, pp. 3306–3340. International MachineLearning Society (IMLS), 1 2017. Li, Q., Tai, C., and E, W. Stochastic modiﬁed equations anddynamics of stochastic gradient algorithms I: Mathemati-cal foundations.

Journal of Machine Learning Research ,20(40):1–47, 2019.Milstein, G.

Numerical Integration of Stochastic Differen-tial Equations , volume 313 of

Mathematics and Its Ap-plications . Springer, 1995. ISBN 9780792332138. URL https://books.google.com.hk/books?id=o2y8Or_a4W0C .Milstein, G. N. Weak approximation of solutionsof systems of stochastic differential equations.

Theory of Probability & Its Applications , 30(4):750–766, 1986. doi: 10.1137/1130095. URL https://doi.org/10.1137/1130095 .Ouyang, H., He, N., Tran, L., and Gray, A. Stochastic alter-nating direction method of multipliers. In Dasgupta, S.and McAllester, D. (eds.),

Proceedings of the 30th Inter-national Conference on Machine Learning , volume 28 of

Proceedings of Machine Learning Research , pp. 80–88,Atlanta, Georgia, USA, 17–19 Jun 2013. PMLR. URL http://proceedings.mlr.press/v28/ouyang13.html .Su, W., Boyd, S., and Candes, E. J. A differential equationfor modeling Nesterov’s accelerated gradient method:theory and insights.

Journal of Machine Learning Re-search , 17(153):1–43, 2016.Suzuki, T. Dual averaging and proximal gradient descentfor online alternating direction multiplier method. In

In-ternational Conference on Machine Learning , pp. 392–400, 2013.Wang, H. and Banerjee, A. Online alternating directionmethod. In

Proceedings of the 29th International Con-ference on Machine Learning, ICML 2012 , Proceedingsof the 29th International Conference on Machine Learn-ing, ICML 2012, pp. 1699â ˘A ¸S1706, 10 2012. ISBN9781450312851.Yuan, H., Zhou, Y., Li, C. J., and Sun, Q. Differ-ential inclusions for modeling nonsmooth ADMMvariants: A continuous limit theory. In Chaud-huri, K. and Salakhutdinov, R. (eds.),

Proceedingsof the 36th International Conference on MachineLearning , volume 97 of

Proceedings of MachineLearning Research , pp. 7232–7241, Long Beach,California, USA, 09–15 Jun 2019. PMLR. URL http://proceedings.mlr.press/v97/yuan19c.html .Zhong, W. and Kwok, J. Fast stochastic alternating direc-tion method of multipliers. In

International Conferenceon Machine Learning , pp. 46–54, 2014. ontinuous Model of Stochastic ADMM

Appendix: Stochastic Modiﬁed Equations for Continuous Limit of StochasticADMM

A. Weak Approximation and Stochastic Modiﬁed Equations

We introduce and review the concepts for the weak approximation and the stochastic modiﬁed equation.

Deﬁnition 4 (weak convergence) . We say the family (parametrized by ǫ ) of the stochastic sequence { x ǫk : k ≥ } , ǫ > ,weakly converges to (or is a weak approximation to), a family of continuous-time Ito processes { X ǫt : t ∈ R + } with theorder p if they satisfy the following conditions: For any time interval T > and for any test function ϕ such that ϕ and itspartial derivatives up to order p + 2 belong to F , there exists a constant C > and ǫ > such that for any ǫ < ǫ , max ≤ k ≤⌊ T/ǫ ⌋ | E ϕ ( X ǫkǫ ) − E ϕ ( x ǫk ) | ≤ Cǫ p , (19)The constant C in the above inequality and ǫ , independent of ǫ , may depend on T and ϕ . For the conventional applicationsto numerical method for SDE (Milstein, 1995), X ǫ may not depended on ǫ ; for the stochastic modiﬁed equation in ourproblem, X ǫ does depend on ǫ . We drop the subscript ǫ in x ǫk and X ǫt for notational ease whenever there is no ambiguity.The idea of using the weak approximation and the stochastic modiﬁed equation was originally proposed by (Li et al.,2017), which is based on an important theorem due to (Milstein, 1986). In brief, this Milstein’s theorem links the one stepdifference, which has been detailed above, to the global approximation in weak sense, by checking three conditions on themomentums of one step difference. Since we only consider the ﬁrst order weak approximation, the Milstein’s theorem isintroduced in a simpliﬁed form below for only p = 1 . The more general situations can be found in Theorem 5 in (Milstein,1986), Theorem 9.1 in (Milstein, 1995) and Theorem 14.5.2 in (Kloeden & Platen, 2011).Let the stochastic sequence { x k } be recursively deﬁned by the iteration written in the form associated with a function A ( · , · , · ) : x k +1 = x k − ǫ A ( ǫ, x k , ξ k +1 ) , k ≥ (20)where { ξ k : k ≥ } are iid random variables. x = x ∈ R d . Deﬁne the one step difference ¯∆ = x − x . We use theparenthetical subscript to denote the dimensional components of a vector like ¯∆ = ( ¯∆ ( i ) , ≤ i ≤ d ) .Assume that there exists a function K ( x ) ∈ F such that ¯∆ satisﬁes the bounds of the fourth momentum (cid:12)(cid:12) E ( ¯∆ ( i ) ¯∆ ( j ) ¯∆ ( m ) ¯∆ ( l ) ) (cid:12)(cid:12) ≤ K ( x ) ǫ (21)for any component indices i, j, m, l ∈ { , , . . . , d } and any x ∈ R d ,For any arbitrary ǫ > , consider the family of the Ito processes X ǫt deﬁned by a stochastic differential equation whosenoise depends on the parameter ǫ , dX t = b ( X t ) dt + √ ǫσ ( X t ) dW t , (22) W t is the standard Wiener process in R d . The initial is X = x = x . The coefﬁcient functions b and σ satisfy certainstandard conditions; see (Milstein, 1995). Deﬁne the one step difference ∆ = X ǫ − x for the SDE (22). Theorem 5 (Milstein’s weak convergence theorem) . If there exist a constant K and a function K ( x ) ∈ F , such that thefollowing conditions of the ﬁrst three moments on the error ∆ − ¯∆ : | E ( X ǫ − X ) | ≤ K ǫ (23a) (cid:12)(cid:12) E (∆ ( i ) ∆ ( j ) ) − E ( ¯∆ ( i ) ¯∆ ( j ) ) (cid:12)(cid:12) ≤ K ( x ) ǫ (23b) (cid:12)(cid:12) E (∆ ( i ) ∆ ( j ) ∆ ( l ) ) − E ( ¯∆ ( i ) ¯∆ ( j ) ¯∆ ( l ) ) (cid:12)(cid:12) ≤ K ( x ) ǫ (23c) hold for any i, j, l ∈ { , . . . , d } and any x ∈ R d , then { x k } weakly converges to { X t } with the order 1. In light of the above theorem, we will now call equation (22) the stochastic modiﬁed equation (SME) of the iterativescheme (20). ontinuous Model of Stochastic ADMM

For the SDE (22) at the small noise ǫ , by the Ito-Taylor expansion, it is well-known that E ∆ = b ( x ) ǫ + O ( ǫ ) and E [∆∆ ⊤ ] = (cid:0) b ( x ) b ( x ) ⊤ + σ ( x ) σ ( x ) ⊤ (cid:1) ǫ + O ( ǫ ) and E (Π sm =1 ∆ ( i m ) ) = O ( ǫ ) for all integer s ≥ and the componentindex i m = 1 , . . . , d . Refer to (Kloeden & Platen, 2011) and Lemma 1 in (Li et al., 2017). So, the main receipt to applythe Milstein’s theorem is to examine the conditions of the momentums for the discrete sequence ¯∆ = x − x .One prominent work (Li et al., 2017) is to use the SME as a weak approximation to understand the dynamical behaviourof the stochastic gradient descent (SGD). The prominent advantage of this technique is that the ﬂuctuation in the SGDiteration can be well captured by the ﬂuctuation in the SME. Here is the brief result. For the composite minimizationproblem min x ∈ R f ( x ) = E ξ f ( x, ξ ) , the SGD iteration is x k +1 = x k − ǫf ′ ( x k , ξ k +1 ) with the step size ǫ , then by Theorem 5, the corresponding SME of ﬁrstorder approximation is dX t = − f ′ ( x ) dt + √ ǫσ ( x ) dW t (24)with σ ( x ) = std ξ ( f ′ ( x, ξ )) = ( E [( f ′ ( x ) − f ′ ( x, ξ )) ]) / . Details can be found in (Li et al., 2017). The SGD here isanalogous to the forward-time Euler-Maruyama approximation since A ( ǫ, x, ξ ) = f ′ ( x, ξ ) . B. Proof of main theorems

The one step difference is important to consider the weak convergence of the discrete scheme (5). The question is that forone single iteration, from step k to step k + 1 , what is the order of the change of the states ( x, z, u ) . Since For notationalease, we drop the random variable ξ k +1 in the scheme (5); the readers bear in mind that f and its derivatives involve ξ .We work on the general ADMM scheme (5). The optimality conditions for the scheme (5) are ω ǫf ′ ( x k ) + (1 − ω ) ǫf ′ ( x k +1 ) + ǫA ⊤ λ k + A ⊤ ( ωAx k + (1 − ω ) Ax k +1 − z k ) + c ( x k +1 − x k ) = 0 (25a) ǫg ′ ( z k +1 ) = ǫλ k + αAx k +1 + (1 − α ) z k − z k +1 (25b) ǫλ k +1 = ǫλ k + αAx k +1 + (1 − α ) z k − z k +1 (25c)Note that due to (25b) and (25c), the last condition (25c) can be replaced by λ k +1 = g ′ ( z k +1 ) . So, without loss ofgenerality, one can assume that λ k ′ ≡ g ′ ( z k ′ ) (26)for any integer k ′ ≥ . The optimality conditions (25) now can be written only in the variables ( x, z ) : ω ǫf ′ ( x k ) + (1 − ω ) ǫf ′ ( x k +1 ) + ǫA ⊤ g ′ ( y k )+ A ⊤ ( ωAx k + (1 − ω ) Ax k +1 − z k ) + c ( x k +1 − x k ) = 0 (27a) ǫg ′ ( z k +1 ) − ǫg ′ ( z k ) = αAx k +1 + (1 − α ) z k − z k +1 (27b)As ǫ → , we seek the asymptotic expansion of x k +1 − x k from (27a) and the asymptotic expansion of z k +1 − z k from(27b). The ﬁrst result is that x k +1 − x k = − M − A ⊤ r k + c k ǫ, (28a) z k +1 − z k = α ( I − AM − A ⊤ ) r k + c ′ k ǫ, (28b)where r k is the residual r k := Ax k − z k (29)and the matrix M is M = M c,ω := c + (1 − ω ) A ⊤ A. (30)The constant c k and c ′ k are independent of ǫ but related to f ′ , g ′ and other parameters α, ω, ω . Throughout the rest ofthe paper, we shall use the notation O ( ǫ p ) to denote the terms c k ǫ p , for p = 1 , , . . . . Given any input ( x k , z k ) , since r k = Ax k − z k may not be zero, then as the step size ǫ → , (28a) and (28a) show that ( x k +1 , z k +1 ) does not converge to ( x k , z k ) . However we can show that the residual after one step iteration r k +1 is always a small number on the order O ( ǫ ) ,so that the consist condition that as ǫ → , ( x k +1 , z k +1 ) tends to ( x k , z k ) holds. ontinuous Model of Stochastic ADMM Proposition 6.

We have the following property for the propagation of the residual: r k +1 = (1 − α ) ( I − AM − A ⊤ ) r k + O ( ǫ ) . (31) Proof.

By using (27b) and (28b), r k +1 = Ax k +1 − z k +1 = (cid:18) α − (cid:19) ( z k +1 − z k ) + ǫα ( g ′ ( z k +1 ) − g ′ ( z k ))= (1 − α ) ( I − AM − A ⊤ ) r k + O ( ǫ ) . Remark 4. If α = 1 , the leading term (1 − α ) ( I − AM − A ⊤ ) vanishes. There are some special cases where the matrix I − AM − A ⊤ is zero: (1) A is an invertible square matrix and M = M , = A ⊤ A . (2) A is an orthogonal matrix( AA ⊤ = A ⊤ A = I ) and the constants satisfy ω = c such that that M = I . The above proposition is for an arbitrary residual r k as the input in one step iteration. If we choose r = 0 at the initialstep by setting z = Ax , then Proposition 6 shows that r = Ax − y become O ( ǫ ) after one iteration. In fact, withassumption α = 1 , we can show r k ′ , ∀ k ′ ≥ , can be reduced to the order ǫ by mathematical induction. Proposition 7. If r k = O ( ǫ ) , then r k +1 = (1 − α + ǫαg ′′ ( z k ))( r k + A ( x k +1 − x k )) + O ( ǫ ) . (32) If α = 1 , equation (32) reduces to the second order smallness: r k +1 = ǫαg ′′ ( z k )( r k + A ( x k +1 − x k )) + O ( ǫ ) = O ( ǫ ) . (33) Proof.

Since that r k = Ax k − z k = O ( ǫ ) , then the one step difference x k +1 − x k and z k +1 − z k are both at order O ( ǫ ) because of (28a) and (28b). We solve δz := z k +1 − z k from (27b) by linearizing the implicit term g ′ ( z k +1 ) with theassumption that the third order derivative of g exits: ǫg ′′ ( z k ) δz + ǫ O (( δz ) ) + δz = α ( r k + Aδx ) . where δx := x k +1 − x k . Then since O (( δz ) ) = O ( ǫ ) , the expansion of δz = z k +1 − z k in ǫ is z k +1 − z k = δz = α (1 − ǫg ′′ ( z k ))( r k + Aδx ) + O ( ǫ ) (34)Then r k +1 = r k + A ( x k +1 − x k ) − ( z k +1 − z k )= (cid:16) − α + ǫαg ′′ ( z k ) (cid:17) ( r k + A ( x k +1 − x k )) + O ( ǫ )= (1 − α )( r k + ( x k +1 − x k )) + ǫαg ′′ ( z k )( r k + A ( x k +1 − x k )) + O ( ǫ ) Remark 5. (32) suggests that r k +1 = (1 − α ) r k + O ( ǫ ) . So the condition for the convergence r k → as k → ∞ is | − α | < , which matches the range α ∈ (0 , used in the relaxation scheme. Now with the assumption y = Ax at initial time, the above analysis shows that r k is O ( ǫ ) and the one step difference x k +1 − x k and z k +1 − z k are on the order O ( ǫ ) by (28). We shall pursue a more accurate expansion of the one stepdifference x k +1 − x k than (28). Write f ′ ( x k +1 ) = f ′ ( x k ) + f ′′ ( x k )( x k +1 − x k ) + O (( x k +1 − x k ) ) in equations (27).The asymptotic analysis shows the result below. Proposition 8. As ǫ → , the expansion of the one step difference x k +1 − x k is M ( x k +1 − x k ) = − A ⊤ r k − ǫ (cid:0) f ′ ( x k ) + A ⊤ g ′ ( y k ) (cid:1) + ǫ (1 − ω ) f ′′ ( x k ) M − (cid:18) f ′ ( x k ) + A ⊤ g ′ ( y k ) + 1 ǫ A ⊤ r k (cid:19) + O ( ǫ ) . (35) ontinuous Model of Stochastic ADMM This expression does not contain the parameter α explicitly, but the residual r k = Ax k − y k signiﬁcantly depends on α (see Proposition 7). If α = 1 , then r k is on the order of ǫ , which hints there is no contribution from r k toward the weakapproximation of x k at the order 1. But for the relaxation case where α = 1 , r k contains the ﬁrst order term coming from z k +1 − z k .To obtain a second order smallness for some “residual” for the relaxes scheme where α = 1 , we need a new deﬁnition, α -residual, to account for the gap induced by α . Motivated by (25b), we ﬁrst deﬁne r αk +1 := αAx k +1 + (1 − α ) z k − z k +1 . (36)It is connected to the original residual r k +1 and r k since it is easy to check that r αk +1 = αr k +1 + ( α − z k +1 − z k ) = αr k + αA ( x k +1 − x k ) − ( z k +1 − z k ) (37)But r αk +1 in fact involves information at two successive steps. Obviously, when α = 1 , this α -residual r α is the originalresidual r = Ax − y . In our proof, we need a modiﬁed α -residual, denoted by b r αk +1 := αr k + ( α − z k +1 − z k ) (38)We can show that both r αk +1 and b r αk +1 are as small as O ( ǫ ) as ǫ tends to zero. Proposition 9. r αk +1 = O ( ǫ ) and b r αk +1 = O ( ǫ ) .Proof. In fact, (34) is z k +1 − z k = α (1 − ǫg ′′ ( z k ))( r k + A ( x k +1 − x k )) + O ( ǫ ) . By the second equality of (37), (34)becomes z k +1 − z k = (1 − ǫg ′′ ( z k ))( r αk +1 + z k +1 − z k ) + O ( ǫ ) , i.e., r αk +1 = ǫ (1 + ǫg ′′ ( z k )) g ′′ ( z k )( z k +1 − z k ) + O ( ǫ )= ǫg ′′ ( z k )( z k +1 − z k ) + O ( ǫ ) which is O ( ǫ ) since z k +1 − z k = O ( ǫ ) .The difference between ( z k +1 − z k ) and ( z k +2 − z k +1 ) , is at the order ǫ due to truncation error of the central differencescheme, Then we have the conclusion αr k +1 + ( α − z k +2 − z k +1 ) , i.e, b r αk +1 = αr k + ( α − z k +1 − z k ) = O ( ǫ ) (39)by shifting the subscript k by one. Corollary 10. r k = (cid:18) α − (cid:19) ( z k +1 − z k ) + O ( ǫ ) = (cid:18) α − (cid:19) A ( x k +1 − x k ) + O ( ǫ ) (40) and it follows z k +1 − z k = A ( x k +1 − x k ) + O ( ǫ ) .Proof. By (38) and the above proposition, we have r k = ( α − z k +1 − z k ) + O ( ǫ ) . Furthermore, due to (34), r k = ( α − z k +1 − z k ) + O ( ǫ ) = (1 − α )( r k + A ( x k +1 − x k )) + O ( ǫ ) which gives r k = (cid:18) α − (cid:19) A ( x k +1 − x k ) + O ( ǫ ) Proof of Theorem 2.

Combining Proposition 8 and Corollary 10, and noting the Taylor expansion of g ′ ( z k ) : g ′ ( y k ) = g ′ ( Ax k − r k ) = g ′ ( Ax k ) + O ( ǫ ) since r k = O ( ǫ ) and putting back random ξ into f ′ , we have M ( x k +1 − x k ) = − ǫ (cid:0) f ′ ( x k , ξ k +1 ) + A ⊤ g ′ ( Ax k ) (cid:1) − (cid:18) α − (cid:19) A ⊤ A ( x k +1 − x k ) + O ( ǫ ) (41) ontinuous Model of Stochastic ADMM For convenience, introduce the matrix c M := M + 1 − αα A ⊤ A = c + (cid:18) α − ω (cid:19) A ⊤ A. (42)and let b x k := c M x k , and δ b x k +1 = c M ( x k +1 − x k ) Then δ b x = − ǫV ′ ( x, ξ ) + ǫ (cid:0) (1 − ω ) f ′′ M − V ′ ( x ) − A ⊤ θ (cid:1) + O ( ǫ ) The ﬁnal step is to compute the momentums in the Milstein’s theorem Theorem 5 as follows(i) E [ δ b x ] = − ǫ E V ′ ( x, ξ ) + O ( ǫ ) = − ǫV ′ ( x ) + O ( ǫ ) (43)(ii) E [ δ b x δ b x ⊤ ] = ǫ E (cid:0)(cid:2) f ′ ( x, ξ ) + A ⊤ g ′ ( x ) (cid:3) (cid:2) f ′ ( x, ξ ) ⊤ + g ′ ( x ) ⊤ A (cid:3)(cid:1) + O ( ǫ )= ǫ (cid:0) V ′ ( x ) V ′ ( x ) ⊤ (cid:1) − ǫ (cid:0) f ′ ( x ) + A ⊤ g ′ ( x ))( f ′ ( x ) ⊤ + g ′ ( x ) ⊤ A ) (cid:1) + ǫ E (cid:0)(cid:2) f ′ ( x, ξ ) + A ⊤ g ′ ( x ) (cid:3) (cid:2) f ′ ( x, ξ ) ⊤ + g ′ ( x ) ⊤ A (cid:3)(cid:1) + O ( ǫ )= ǫ (cid:0) V ′ ( x ) V ′ ( x ) ⊤ (cid:1) + ǫ E h ( f ′ ( x, ξ ) − f ′ ( x )) ( f ′ ( x, ξ ) − f ′ ( x )) ⊤ i + O ( ǫ ) (iii) It is trivial that E [Π sj =1 δx i j ] = O ( ǫ ) for s ≥ and i j = 1 , . . . , d .So, Theorem 2 is proved. Proof of Theorem 1.