[PDF] Compressive Conjugate Directions: Linear Theory

Abstract

We present a powerful and easy-to-implement iterative algorithm for solving large-scale optimization problems that involve L 1 /total-variation (TV) regularization. The method is based on combining the Alternating Directions Method of Multipliers (ADMM) with a Conjugate Directions technique in a way that allows reusing conjugate search directions constructed by the algorithm across multiple iterations of the ADMM. The new method achieves fast convergence by trading off multiple applications of the modeling operator for the increased memory requirement of storing previous conjugate directions. We illustrate the new method with a series of imaging and inversion applications.

Full PDF

aa r X i v : . [ m a t h . O C ] F e b SIAM J. I

MAGING S CIENCES c (cid:13) xxxx Society for Industrial and Applied MathematicsVol. xx, pp. z z–z Compressive Conjugate Directions: Linear Theory ∗ Musa Maharramov † and Stewart A. Levin † Key words. L -regularization, total-variation regularization, regularized inversion, ADMM, method of multi-pliers, Conjugate Gradients, compressive conjugate directions AMS subject classiﬁcations.

Abstract.

We present a powerful and easy-to-implement iterative algorithm for solving large-scale optimizationproblems that involve L /total-variation (TV) regularization. The method is based on combiningthe Alternating Directions Method of Multipliers (ADMM) with a Conjugate Directions techniquein a way that allows reusing conjugate search directions constructed by the algorithm across multipleiterations of the ADMM. The new method achieves fast convergence by trading oﬀ multiple applica-tions of the modeling operator for the increased memory requirement of storing previous conjugatedirections. We illustrate the new method with a series of imaging and inversion applications.

1. Introduction.

We address a class of regularized least-squares ﬁtting problems of theform(1.1) k Bu k + α k Au − d k → min , u ∈ R N , d ∈ R M , A : R N → R M , B : R N → R K , K ≤ N, where d is a known vector (data), u a vector of unknowns , and A , B are linear operators. If B is the identity map, then problem (1.1) is a least-squares ﬁtting with L regularization,(1.2) k u k + α k Au − d k → min . If the unknown vector u is the discretization of a function, and B is the ﬁrst-order ﬁnitediﬀerence operator ( Bu ) i = u i +1 − u i , i = 1 , , . . . , N − , then problem (1.1) turns into a least-squares ﬁtting with a total-variation (TV) regularization(1.3) k∇ u k + α k Au − d k → min . On the one hand, in (1.2) we seek a model vector u such that forward-modeled data Au match observed data d in the least squares sense, while imposing sparsity-promoting L regularization. In (1.3), on the other hand, we impose blockiness-promoting total-variation(TV) regularization. Note that rather than using a regularization parameter as a coeﬃcientof the regularization term, we use a data-ﬁtting weight α . TV regularization (also known as ∗ This work was supported by Stanford Exploration Project. † Department of Geophysics, Stanford University, 397 Panama Mall, Stanford, CA 94305([email protected], [email protected]). sometimes referred to as “model” the Rudin-Osher-Fatemi, or ROF, model [36]) acts as a form of “model styling” that helpsto preserve sharp contrasts and boundaries in the model even when spectral content of inputdata has a limited resolution. L -TV regularized least-squares ﬁtting, a key tool in imaging and de-noising applications(see, e.g. [36, 10, 42, 26]), is beginning to play an increasingly important role in applicationswhere the modeling operator A in (1.1) is computationally challenging to apply. In particular,in seismic imaging problems of exploration geophysics such as full-waveform inversion [39, 16]modeling of seismic wave propagation in a three-dimensional medium from multiple seismicsources is by far the greatest contributor to the computational cost of inversion, and reductionof the number of applications of the operator A is key to success in practical applications. L -regularized least-squares problems can be reduced to inequality-constrained quadraticprograms and solved using interior-point methods based on, e.g., Newton [7] or nonlinearConjugate Gradients [26] methods. Alternatively, the resulting bound-constrained quadraticprograms can be solved using gradient projection [17] or projected Conjugate Gradients [33].A conceptually diﬀerent class of techniques for solving L -regularized least-squares problemsis based on homotopy methods [23, 15, 31].Another class of methods for solving (1.1) that merits a special mention applies split-ting schemes for the sum of two operators. For example the iterative shrinking-thresholdingalgorithm (ISTA) is based on applying forward-backward splitting [8, 32] to solving the L -regularized problem (1.2) by gradient descent [4, 11, 12]:(1.4) y k +1 = u k − γα A T ( Au k − d ) , u k +1 = shrink { y k +1 , γ } , where γ > soft thresholding or shrinkage operator is the Moreau resolvent (see, e.g., [1]) of ∂γ k u k ,(1.5) shrink { y , γ } = (1 + ∂γ k y k ) − = argmin x (cid:26) γ k x k + 12 k y − x k (cid:27) = y | y | max ( | y | − γ, , and ∂ = ∂ u denotes the subgradient [34, 1], and the absolute value of a vector is computedcomponent-wise. The typically slow convergence of the ﬁrst-order method (1.4) can be ac-celerated by an over-relaxation step [29], resulting in the Fast ISTA algorithm (FISTA) [3]:(1.6) y k +1 = u k − γα A T ( Au k − d ) , z k +1 = shrink { y k +1 , γ } ,ζ k +1 = (cid:18) q ζ k (cid:19) / , u k +1 = y k +1 + ζ k − ζ k +1 ( y k +1 − y k ) , where ζ = 1 and γ is suﬃciently small. ompressive Conjugate Directions: Linear Theory 3 It is important to note that algorithm (1.6) is applied to the L -regularized problem (1.2),not the TV-regularized problem (1.3). An accelerated algorithm for solving a TV-regularized denoising problem was proposed in [2] and applied the Nesterov relaxation [29] to solvingthe dual of the TV-regularized denoising problem [9]. However, using a similar approach tosolving (1.3) with a non-trivial operator A results in accelerated schemes that still requireinversion of A [2, 21] and thus lack the primary appeal of the accelerated gradient descentmethods—i.e., a single application of A and its transpose per iteration .The advantage of (1.6) compared with simple gradient descent is that Nesterov’s over-relaxation step requires storing two previous solution vectors and provides improved searchdirection for minimization. Note, however, that the step length γ is inversely proportional tothe Lipschitz constant of α A T ( Au − d ) [3] and may be small in practice.A very general approach to solving problems (1.1) involving either L or TV regularizationis provided by primal-dual methods. For example, in TV-regularized least-squares problem(1.3), by substituting(1.7) z = Bu and adding (1.7) as a constraint, we obtain an equivalent equality-constrained optimizationproblem(1.8) k z k + α k Au − d k → min , z = Bu . The optimal solution of (1.8) corresponds to the saddle-point of its Lagrangian(1.9) L ( u , z , µ ) = k z k + α k Au − d k + µ T ( z − Bu ) , that can be found by the Uzawa method [41]. The Uzawa method ﬁnds the saddle point byalternating a minimization with respect to the primal variables u , z and ascent over the dualvariable µ for the objective function equal to the standard Lagrangian (1.9), L = L ,(1.10) ( u k +1 , z k +1 ) = argmin L ( u , z , µ k ) , µ k +1 = µ k + λ [ z k +1 − Bu k +1 ]for some positive step size λ . Approach (1.10), when applied to the Augmented Lagrangian[35], L = L + ,(1.11) L + ( u , z , µ ) = k z k + α k Au − d k + µ T ( z − Bu ) + λ k z − Bu k , results in the method of multipliers [25]. For problems (1.1) all these methods still require jointminimization with respect to u and z of some objective function that includes both k z k and a with A = I in (1.3) In [2] inversion of A is replaced by a single gradient descent, however, over-relaxation is applied to thedual variable. Musa Maharramov and Stewart A. Levin smooth function of u . Splitting the joint minimization into separate steps of minimization withrespect u , followed by minimization with respect to z , results in the Alternating-DirectionsMethod of Multipliers (ADMM) [20, 18, 19, 14, 6]. To establish a connection to the splittingtechniques applied to the sum of two operators, we note that the ADMM is equivalent toapplying the Douglas-Rachford splitting [13] to the problem(1.12) ∂ h k Bu k + α k Au − d k i ∋ , where ∂ is the subgradient, and problem (1.12) is equivalent to (1.1). The ADMM is aparticular case of a primal-dual iterative solution framework with splitting [43], where theminimization in (1.10) is split into two steps,(1.13) u k +1 = argmin L ( u , z k , µ k ) , z k +1 = argmin L ( u k +1 , z , µ k ) , µ k +1 = µ k + λ [ z k +1 − Bu k +1 ]For the ADMM, we substitute L = L + in (1.13) but other choices of a modiﬁed Lagrangefunction L are possible that may produce convergent primal-dual algorithms [43]. Making thesubstitution L = L + from (1.11) into (1.13), and introducing a scaled vector of multipliers,(1.14) b k = µ k /λ, k = 0 , , , . . . we obtain(1.15) u k +1 = argmin α k Au − d k + λ k z k − Bu + b k k , z k +1 = argmin k z k + λ k z − Bu k +1 + b k k , b k +1 = b k + z k +1 − Bu k +1 , k = 0 , , , . . . where we used the fact that adding a constant term λ/ k b k k to the objective function doesnot alter the solution. In the iterative process (1.15), we apply splitting, minimizing(1.16) k z k + α k Au − d k + λ k z − Bu + b k k alternately with respect to u and z . Further we note that the minimization of (1.16) withrespect to z (in a splitting step with u ﬁxed) is given trivially by the shrinkage operator (1.5),(1.17) z k +1 = shrink { Bu − b k , /λ } . Combining (1.15) and (1.17) we obtain Algorithm 1.Minimization on the ﬁrst line of (1.15) at each step of the ADMM requires inversion ofthe operator A . In the ﬁrst-order gradient-descent methods like (1.6) a similar requirementis obviated by replacing the minimization with respect to variable u by gradient descent.However, for ill-conditioned problems the gradient may be a poor approximation to the optimalsearch direction. One interpretation of Nesterov’s over-relaxation step in (1.6) is that it ompressive Conjugate Directions: Linear Theory 5 Algorithm 1

Alternating Direction Method of Multipliers (ADMM) for (1.1) u ← N , z K ← b ← K for k ← , , , , . . . do u k +1 ← argmin (cid:8) λ k z k − Bu + b k k + α k Au − d k (cid:9) z k +1 ← shrink { Bu k +1 − b k , /λ } b k +1 ← b k + z k +1 − Bu k +1 Exit loop if k u k +1 − u k k / k u k k ≤ target accuracy end for provides a better search direction by perturbing the current solution update with a fractionof the previous update on the last line of (1.6). The intermediate least-squares problem in(1.15) can be solved approximately using, for example, a few iterations of conjugate gradients.However, repeating multiple iterations of Conjugate Gradients at each step of the ADMM maybe unnecessary. Indeed, as we demonstrate in the following sections, conjugate directionsconstructed at earlier steps of the ADMM can be reused because the matrix of the system ofnormal equations associated with the minimization on the ﬁrst line of (1.15) does not changebetween ADMM steps . Therefore, we can trade the computational cost of applying theoperator A and its transpose against the cost of storing a few solution and data-size vectors.As this approach is applied to the most general problem (1.1) with a non-trivial operator B ,in addition to the potential speed-up, this method has the advantage of working equally wellfor L and T V -regularized problems.We stress that our new approach does not improve the theoretical convergence proper-ties of the classic ADMM method under the assumption of exact minimization in step 4 ofAlgorithm 1. The asymptotic convergence rate is still O (1 /k ) as with exact minimization[24]. The new approach provides a numerically feasible way of implementing the ADMM forproblems where a computationally expensive operator A precludes accurate minimization instep 4. However, the rate of convergence in the general method of multipliers (1.10) is sen-sitive to the choice of parameter λ , and an improved convergence rate for some values of λ can be accompanied with more ill-conditioned minimization problems at each step of (1.15)[19]. By employing increasingly more accurate conjugate-directions solution of the minimiza-tion problem at each iteration of (1.15) the new method oﬀsets the deteriorating condition ofthe intermediate least-squares problems, and achieves a faster practical convergence at earlyiterations.Practical utility of the ADMM in applications that involve sparsity-promoting (1.2) oredge-preserving (1.3) inversion is often determined by how quickly we can resolve sparse orblocky model components. These features can often be qualitatively resolved within relativelyfew initial iterations of the ADMM (see discussion in the Appendix of [22]). In our Section 4,fast recovery of such local features will be one of the key indicators for judging the eﬃciencyof the proposed method.In the next section we describe two new algorithms, Steered and

Compressive Conjugate Only the right-hand sides of the system are updated as a result of thresholding.

Musa Maharramov and Stewart A. Levin

Gradients based on the principle of reusing conjugate directions for multiple right-hand sides.In Section 3 we prove convergence and demonstrate that the new algorithm coincides with theexact ADMM in a ﬁnite number of iterations. Section 4 contains a practical implementationof the Compressive Conjugate Gradients method. We test the method on a series of problemsfrom imaging and mechanics, and compare its performance against FISTA and ADMM withgradient descent and restarted conjugate gradients.

2. Steered and Compressive Conjugate Directions.

Step 4 of Algorithm 1 is itself aleast-squares optimization problem of the form(2.1) k Fu − v k k → min , where(2.2) F = (cid:20) √ α A √ λ B (cid:21) and(2.3) v k = (cid:20) √ α d √ λ ( z k + b k ) (cid:21) Solving optimization problem (2.1) is mathematically equivalent to solving the followingsystem of normal equations [40],(2.4) F T Fu = F T v k , as operator (2.2) has maximum rank. Solving (2.4) has the disadvantage of squaring thecondition number of operator (2.2) [40]. When the operator A is available in a matrix form,and a factorization of operator F is numerically feasible, solving the normal equations (2.4)should be avoided and a technique based on a matrix factorization should be applied directlyto solving (2.1) [5, 37]. However, when matrix A is not known explicitly or its size exceedspractical limitations of direct methods, as is the case in applications of greatest interest for us,an iterative algorithm, such as the Conjugate Gradients for Normal Equations (CGNE) [5, 37],can be used to solve (2.4). Solving (2.1) exactly may be unnecessary and we can expect thatfor large-scale problems only a few steps of an iterative method need be carried out. However,every iteration typically requires the application of operator A and its adjoint, and in large-scale optimization problems we are interested in minimizing the number of applications ofthese operations. For large-scale optimization problems we need an alternative to re-startingan iterative solver for each intermediate problem (2.1). We propose to minimize restartingiterations by devising a conjugate-directions technique for solving (2.1) with a non-stationaryright-hand side. At each iteration of the proposed algorithm we ﬁnd a search direction that isconjugate to previous directions with respect to the operator F T F . In the existing conjugatedirection techniques, iteratively constructed conjugate directions span the Krylov subspaces[40],(2.5) K k = span n F T v , (cid:0) F T F (cid:1) F T v , . . . , (cid:0) F T F (cid:1) k F T v o , k = 0 , , . . . . avoiding restarting altogether in the theoretical limit of inﬁnite computer storage ompressive Conjugate Directions: Linear Theory 7 However, in our approach we construct a sequence of vectors (search directions) that areconjugate with respect to operator F T F at the k th step but may not span the Krylov subspace K k . This complicates convergence analysis of our technique, but allows “steering” searchdirections by iteration-dependent right-hand sides. Since the right-hand side in (2.1) is theresult of the shrinkage (1.17) at previous iterations that steer or compress the solution, wecall our approach “steered” or “compressive” conjugate directions.For the least-squares problem (2.1), we construct two sets of vectors for k = 0 , , , . . . (2.6) { p , p , p , . . . , p k } , { q , q , q , . . . , q k } , q i = Fp i , i = 0 , , , . . . , k, such that(2.7) q Ti q j = p Ti F T Fp j = 0 if i = j. Equations (2.6) and (2.7) mean that the vectors p i form conjugate directions [40, 37]. Ateach iteration we ﬁnd an approximation u k to the solution of (2.1) as a linear combination ofvectors p i , i = 0 , , . . . , k , for which the residual(2.8) r k +1 = v k +1 − Fu k +1 , is orthogonal to vectors q i ,(2.9) q Ti r k +1 = q Ti ( v k +1 − Fu k +1 ) = 0 , i = 0 , , . . . , k. Vector p k is constructed as a linear combination of all previous vectors p i , i = 0 , , . . . , k and F T r k so that the conjugacy condition in (2.6) is satisﬁed. The resulting algorithm for arbitrary v k depending on k is given by Algorithm 2.Note that the above algorithm is not speciﬁc to a particular sequence of right-hand-sidevectors v k and its applicability goes beyond solving the constrained optimization problems(1.8). The algorithm requires storing 2 k + 2 vectors (2.6), as well as one vector each forthe current solution iterate u k , variable right-hand side v k , intermediate vectors w k and s k .The requirement of storing a growing number of vectors makes the algorithm resemble theGMRES method [37] for solving linear systems with non-self-adjoint operators. However, inour case, this is a consequence of having a variable right-hand side, requiring re-computationof solution iterates as linear combinations of all of the previous search directions (2.6). Thisrequirement can be relaxed in applications where vector v k is updated, for example, by themodiﬁed Lagrangian technique for solving a constrained optimization problem, and convergesto a limit. In Section 4 we describe practical applications of the algorithm achieving fastconvergence while storing only a subset of vectors (2.6). The algorithm requires one applicationof F and its transpose at each iteration and 2 k + 3 dot-products of large vectors.Combining Algorithms 1 and 2 we obtain the Compressive Conjugate Directions

Algo-rithm 3.

Musa Maharramov and Stewart A. Levin

Algorithm 2

Steered Conjugate Directions for solving (2.1) u ← N p ← F T v , q ← Fp , δ ← q T q for k = 0 , , , , . . . do for i = 0 , , . . . , k do τ i ← q Ti v k /δ i end for u k +1 ← P ki =0 τ i p i r k +1 ← v k +1 − P ki =0 τ i q i w k +1 ← F T r k +1 s k +1 ← Fw k +1 for i = 0 , , . . . , k do β i ← − q Ti s k +1 /δ i end for p k +1 ← P ki =0 β i p i + w k +1 q k +1 ← P ki =0 β i q i + s k +1 δ k +1 ← q Tk +1 q k +1 if δ k +1 = 0 then ⊲ Use condition “ δ k +1 < tolerance” in practice δ k +1 ← , p k +1 ← N , q k +1 ← M + K end if Exit loop if k u k +1 − u k k / k u k k ≤ target accuracy end for

3. Convergence Analysis.

Convergence properties of the ADMM were studied in manypublications and are well known. However, here we provide a self-contained proof of conver-gence for Algorithm 1 that mostly follows the presentation of [6]. Later, we use this result tostudy the convergence of Algorithm 3.

Theorem 3.1.

Assume that M ≥ N , operators A , B are maximum rank, and (3.1) u = u ∗ , z = z ∗ = Bu ∗ , is the unique solution of problem (1.8). Assume that a vector b ∗ is deﬁned as (3.2) b ∗ = µ ∗ /λ, where µ ∗ is the vector of Lagrange multipliers for the equality constraint in (1.8). Algorithm 1then converges to this solution if λ > , that is, (3.3) u k → u ∗ , z k → z ∗ , b k → b ∗ , k → ∞ . Proof . Problem (1.8) has a convex objective function and equality constraints, hence(3.1,3.2) is a saddle point of its Lagrangian (1.9) [7]. Substituting z k +1 , u k +1 from Algorithm 1, ompressive Conjugate Directions: Linear Theory 9 Algorithm 3

Compressive Conjugate Directions for (1.1) u ← N , z ← K ; b ← K , v ← (cid:20) √ α d √ λ ( z + b ) (cid:21) p ← F T v , q ← Fp , δ ← q T q for k = 0 , , , , . . . do for i = 0 , , . . . , k do τ i ← q Ti v k /δ i end for u k +1 ← P ki =0 τ i p i z k +1 ← shrink { Bu k +1 − b k , /λ } b k +1 ← b k + z k +1 − Bu k +1 v k +1 ← (cid:20) √ α d √ λ ( z k +1 + b k +1 ) (cid:21) r k +1 ← v k +1 − P ki =0 τ i q i w k +1 ← F T r k +1 s k +1 ← Fw k +1 for i = 0 , , . . . , k do β i ← − q Ti s k +1 /δ i end for p k +1 ← P ki =0 β i p i + w k +1 q k +1 ← P ki =0 β i q i + s k +1 δ k +1 ← q Tk +1 q k +1 if δ k +1 = 0 then ⊲ Use condition “ δ k +1 < tolerance” in practice δ k +1 ← , p k +1 ← N , q k +1 ← M + K end if Exit loop if k u k +1 − u k k / k u k k ≤ target accuracy end for we have(3.4) L ( z ∗ , u ∗ , µ ∗ ) ≤ L ( z k +1 , u k +1 , µ ∗ ) ⇐⇒ p ∗ = k Bu ∗ k + α k Au ∗ − d k = k z ∗ k + α k Au ∗ − d k ≤k z k +1 k + α k Au k +1 − d k + µ ∗ T ( z k +1 − Bu k +1 ) = p k +1 + µ ∗ T ( z k +1 − Bu k +1 ) = p k +1 + λ b ∗ T ( z k +1 − Bu k +1 ) , where p ∗ is the optimal value of the objective function and p k +1 is its approximation atiteration k of the algorithm. Inequality (3.4) provides a lower bound for the objective functionestimate p k +1 . Step 4 of the algorithm is equivalent to(3.5) α A T Au k +1 + λ B T Bu k +1 = α A T d + λ B T ( z k + b k ) . Substituting the expression for b k from steps 6 into (3.5), we obtain(3.6) α A T Au k +1 = α A T d + λ B T ( z k − z k +1 + b k +1 ) . Equality (3.6) is equivalent to(3.7) u k +1 = argmin α k Au − d k − λ ( z k − z k +1 + b k +1 ) T Bu . Substituting u k +1 and u ∗ into the right-hand side of (3.7), we obtain(3.8) α k Au k +1 − d k ≤ α k Au ∗ − d k + λ ( z k − z k +1 + b k +1 ) T B ( u k +1 − u ∗ ) . Step 5 is equivalent to(3.9) ∈ ∂ z k z k + λ ( z k +1 − Bu k +1 + b k ) = ∂ z k z k + λb k +1 , z k +1 = argmin (cid:8) k z k + λ b Tk +1 z (cid:9) , where we used the expression for b k from step 6. Substituting z = z k +1 and z = z ∗ into theright-hand side of the second line of (3.9), we obtain(3.10) k z k +1 k ≤ k z ∗ k + λ b Tk +1 ( z ∗ − z k +1 ) . Adding (3.8) and (3.10), we get(3.11) p k +1 ≤ p ∗ + λ b Tk +1 ( z ∗ − z k +1 ) + λ ( z k − z k +1 + b k +1 ) T B ( u k +1 − u ∗ ) , an upper bound for p k +1 . Adding (3.4) and (3.11), we get(3.12) 0 ≤ λ b ∗ T ( z k +1 − Bu k +1 ) + λ b Tk +1 ( z ∗ − z k +1 ) + λ ( z k − z k +1 + b k +1 ) T B ( u k +1 − u ∗ ) , or after rearranging,(3.13) 0 ≤ λ ( b ∗ − b k +1 ) T ( z k +1 − Bu k +1 ) − λ ( z k − z k +1 ) T ( z k +1 − Bu k +1 ) + λ ( z k − z k +1 ) T ( z k +1 − z ∗ ) . We will now use (3.13) to derive an upper estimate for k b k − b ∗ k + k z k − z ∗ k . ompressive Conjugate Directions: Linear Theory 11 Using step 6 of Algorithm 1 for the ﬁrst term in (3.13) and introducing ρ k +1 = z k +1 − Bu k +1 ,we get(3.14) λ ( b ∗ − b k +1 ) T ρ k +1 = λ (cid:0) b ∗ − b k − ρ k +1 (cid:1) T ρ k +1 = λ ( b ∗ − b k ) T ρ k +1 − λ k ρ k +1 k = λ ( b ∗ − b k ) T ( b k +1 − b k ) − λ k ρ k +1 k − λ k ρ k +1 k = λ ( b ∗ − b k ) T ( b k +1 − b k ) − λ k ρ k +1 k − λ b k +1 − b k ) T ( b k +1 − b k ) = − λ ( b k − b ∗ ) T [( b k +1 − b ∗ ) − ( b k − b ∗ )] − λ k ρ k +1 k − λ b k +1 − b ∗ ) − ( b k − b ∗ )] T [( b k +1 − b ∗ ) − ( b k − b ∗ )] = λ k b k − b ∗ k − λ k b k +1 − b ∗ k − λ k ρ k +1 k . Substituting (3.14) into (3.13), we obtain(3.15) 0 ≤ λ k b k − b ∗ k − λ k b k +1 − b ∗ k − λ k ρ k +1 k − λ ( z k − z k +1 ) T ρ k +1 + λ ( z k − z k +1 ) T ( z k +1 − z ∗ ) = λ k b k − b ∗ k − λ k b k +1 − b ∗ k − λ k ρ k +1 k − λ ( z k − z k +1 ) T ρ k +1 + λ ( z k − z k +1 ) T [( z k +1 − z k ) + ( z k − z ∗ )] = λ k b k − b ∗ k − λ k b k +1 − b ∗ k − λ k ρ k +1 k − λ ( z k − z k +1 ) T ρ k +1 − λ ( z k − z k +1 ) T ( z k − z k +1 ) + λ ( z k − z k +1 ) T ( z k − z ∗ ) = λ k b k − b ∗ k − λ k b k +1 − b ∗ k − λ (cid:0) z k − z k +1 + ρ k +1 (cid:1) T (cid:0) z k − z k +1 + ρ k +1 (cid:1) − λ k z k − z k +1 k + λ ( z k − z k +1 ) T ( z k − z ∗ ) = λ k b k − b ∗ k − λ k b k +1 − b ∗ k − λ k z k − z k +1 + ρ k +1 k − λ k z k − z k +1 k + λ [( z k − z ∗ ) − ( z k +1 − z ∗ )] T ( z k − z ∗ ) = λ k b k − b ∗ k − λ k b k +1 − b ∗ k − λ k z k − z k +1 + ρ k +1 k − λ k ( z k − z ∗ ) − ( z k +1 − z ∗ ) k + λ [( z k − z ∗ ) − ( z k +1 − z ∗ )] T ( z k − z ∗ ) = λ k b k − b ∗ k − λ k b k +1 − b ∗ k − λ k z k − z k +1 + ρ k +1 k − λ k z k +1 − z ∗ k + λ k z k − z ∗ k , yielding(3.16) λ k z k − z k +1 + ρ k +1 k ≤ λ (cid:0) k z k − z ∗ k + k b k − b ∗ k (cid:1) − λ (cid:0) k z k +1 − z ∗ k + k b k +1 − b ∗ k (cid:1) . Expanding the left-hand side of (3.16), we obtain(3.17) λ (cid:16) k z k − z k +1 k + 2 ( z k − z k +1 ) T ρ k +1 + k ρ k +1 k (cid:17) ≤ λ (cid:0) k z k − z ∗ k + k b k − b ∗ k (cid:1) − λ (cid:0) k z k +1 − z ∗ k + k b k +1 − b ∗ k (cid:1) . Let us prove that the middle term in the left-hand side of (3.17) is non-negatve,0 ≤ ( z k − z k +1 ) T ρ k +1 = ( z k − z k +1 ) T ( b k +1 − b k )where we used step 6 of Algorithm 1. Indeed, since z k +1 minimizes (1.16) with u = u k +1 ,using the convexity of L norm, we have for z = z k +1 ,(3.18) ∂ z λ k z − Bu k +1 + b k k = λ ( z − Bu k +1 + b k ) ∈ − ∂ k z k ⇒k z k +1 k − k z k k ≤ ( z k − z k +1 ) T ( z k +1 − Bu k +1 + b k ) = ( z k − z k +1 ) T b k +1 . Similarly, since z k minimizes (1.16) for u = u k and b = b k − , for z = z k we have(3.19) ∂ z λ k z − Bu k + b k − k = λ ( z − Bu k + b k − ) ∈ − ∂ k z k ⇒k z k k − k z k +1 k ≤ ( z k +1 − z k ) T ( z k − Bu k + b k − ) = ( z k +1 − z k ) T b k . In both (3.18) and (3.19) we used step 6 of Algorithm 1 and the fact that for any convexfunction f ( x ) f ( x ) + ξ T ( x − x ) ≤ f ( x ) ⇔ f ( x ) − f ( x ) ≤ − ξ T ( x − x ) , if ξ ∈ ∂f ( x ) , where ∂ is subgradient [34]. Summing (3.18) and (3.19) we get(3.20) 0 ≤ ( z k − z k +1 ) T ( b k +1 − b k ) . From (3.20) and (3.17), we have(3.21) k z k − z k +1 k + k ρ k +1 k ≤ (cid:0) k z k − z ∗ k + k b k − b ∗ k (cid:1) − (cid:0) k z k +1 − z ∗ k + k b k +1 − b ∗ k (cid:1) , or(3.22) k z k +1 − z ∗ k + k b k +1 − b ∗ k ≤k z k − z ∗ k + k b k − b ∗ k − k z k +1 − z k k − k ρ k +1 k . ompressive Conjugate Directions: Linear Theory 13 From (3.22) we can see that the sequence k z k − z ∗ k + k b k − b ∗ k and consequently z k and b k are bounded. Summing (3.21) for k = 0 , , . . . , ∞ , we obtain convergence of the series(3.23) ∞ X k =0 (cid:8) k z k − z k +1 k + k ρ k +1 k (cid:9) ≤ k z − z ∗ k + k b − b ∗ k . From (3.23) follows(3.24) z k − z k +1 → , z k − Bu k → , k → ∞ . Now using (3.11) we obtain(3.25) p k +1 − p ∗ ≤ λ b Tk +1 ( z ∗ − z k +1 ) + λ ( z k − z k +1 + b k +1 ) T B ( u k +1 − u ∗ ) = λ b Tk +1 ( z k − z k +1 ) + λ b Tk +1 ( z ∗ − z k ) + λ ( z k − z k +1 ) T B ( u k +1 − u ∗ ) + λ b Tk +1 B ( u k +1 − u ∗ ) = λ b Tk +1 ( z k − z k +1 ) + λ ( z k − z k +1 ) T B ( u k +1 − u ∗ ) + λ b Tk +1 ( z ∗ − z k ) + λ b Tk +1 B ( u k +1 − u ∗ ) = λ b Tk +1 ( z k − z k +1 ) + λ ( z k − z k +1 ) T B ( u k +1 − u ∗ ) + λ b Tk +1 ( Bu k +1 − z k +1 + z k +1 − z k + z ∗ − Bu ∗ ) → , k → ∞ , where the right-hand side of (3.25) converges to zero because of (3.24), boundedness of z k and b k and z ∗ = Bu ∗ . Likewise, from (3.4) we have(3.26) p ∗ − p k +1 ≤ λ b ∗ T ( z k +1 − Bu k +1 ) → , k → ∞ . Combining (3.25) and (3.26) we obtain p k → p ∗ —i.e., value of the objective function estimateat iteration k converges to the true minimum as k → ∞ . From the bounded sequence u k ∈ R N we can extract a convergent subsequence(3.27) u k i → u ∗∗ . Because our objective function is continuous, u ∗∗ is a solution of (1.1) and (1.8). However, if A is maximum rank the objective function of (1.1) is strictly convex, hence u ∗ = u ∗∗ . Thesequence u k must converge to u ∗ because otherwise we would be able to extract a subsequenceconvergent to a diﬀerent limit and repeat the above analysis.And ﬁnally, to prove that b k → b ∗ , we see that from the Karush-Kuhn-Tucker (KKT)conditions [7] for (1.8) we have(3.28) α AA T u ∗ = A T d + λ B T b ∗ . Passing (3.6) to limit as k → ∞ , using (3.24) and replacing b k +1 with a convergent subse-quence as necessary, we get(3.29) α AA T u ∗ = A T d + λ B T lim b k . Since B is maximum rank, rank B = K ≤ N , (3.29) means that lim b k = b ∗ .Note that our our proof does not depend on the selection of starting values for u , z and b , and this fact will be used later on in proving the convergence of Algorithm 3. Before westudy convergence properties of Algorithm 3, we prove one auxiliary result. Theorem 3.2.

Algorithm 3 constructs a sequence of subspaces of R N spanning expandingsets of conjugate directions, (3.30) S k = span { p , p , . . . , p k } , k = 0 , , , . . .S ⊆ S ⊆ S ⊆ . . . ⊆ S k ⊆ . . . such that (3.31) lim k →∞ S k = S ⊆ R N . Under the assumptions of Theorem 3.1, solution of the constrained optimization problem (3.32) k z k + α k Au − d k → min , z = Bu , u ∈ S. matches the solution of (1.8).Proof . If S = R N statement of the theorem is trivial, so we assume that dim S < N . Sinceour problem is ﬁnite-dimensional, the limit (3.31) is achieved at a ﬁnite iteration,(3.33) ∃ k ∀ k ≥ k : S k ≡ S. steps 4-7 of Algorithm 3 are equivalent to projecting the solution of the system of normalequations (2.4) onto the space S k . If p k +1 = 0 in steps 20-22, then the right-hand side of (2.4)for any k ≥ k can be represented as a linear combination of vectors from S k ≡ S . Steps 8 and9 of Algorithm 3 are equivalent to steps 5 and 6 of Algorithm 1. Step 10 prepares the right-hand side of (2.4) for the minimization in step 4 of Algorithm 1 for iteration k + 1. However,since the right-hand side of (2.4) is a linear combination of vectors p , p , . . . , p k that span S k ≡ S , steps 4-7 of Algorithm 3 are equivalent to the exact solution of the unconstrainedminimization problem in step 4 of Algorithm 1. Hence, starting from iteration k the twoalgorithms become equivalent. From Theorem 3.1 and ∀ k ≥ k : u k +1 ∈ S follows that the solution of (3.32) coincides with that of (1.8).Convergence of Algorithm 3 now becomes a trivial corollary of theorems 3.1 and 3.2. Theorem 3.3.

Under the assumptions of Theorem 3.1, Algorithm 3 converges to the uniquesolution (3.1) of problem (1.8), and (3.3) holds.Proof . In the proof of Theorem 3.2 we have demonstrated that starting from k = k deﬁned in (3.33) Algorithm 3 is mathematically equivalent to Algorithm 1 starting from aninitial approximation u k − , z k − and b k − . Convergence of Algorithm 1 does not depend on ompressive Conjugate Directions: Linear Theory 15 these starting values, hence Algorithm 3 converges to the same unique solution as Algorithm 1and (3.3) holds.The result of Theorem 3.3 indicates that our Compressive Conjugate Directions methodmatches the ADMM in exact arithmetic after a ﬁnite number of iterations, while avoidingdirect inversion of operator A . This obvously means that the (worst-case) asymptotic conver-gence rate of Algorithm 3 matches that of the ADMM and is O (1 /k ) [24].

4. Limited-memory Compressive Conjugate Directions Method.

Algorithm 3 (that wecall “unlimited-memory” Compressive Conjugate Directions Method) requires storing all ofthe previous conjugate directions (2.6) because in step 7 the algorithm computes the expansion(4.1) u k +1 = k X i =0 τ i p i , of these solution approximations with respect to all conjugate direction vectors (2.6) at eachiteration. It is a consequence of changing right-hand sides of the normal equations system(2.1) that all of the coeﬃcients of expansion (4.1) may require updating. However, in apractical implementation we may expect that only the last m + 1 expansion coeﬃcients (4.1)signiﬁcantly change, and freeze the coeﬃcients τ i , i < k − m at and after iteration k . This approach requires storing up to 2 m + 2 latest vectors(4.2) p k , p k − , . . . , p k − m , q k , q k − , . . . , q k − m . A “limited-memory” variant of the method is implemented in Algorithm 4 that stores vectors(4.2) in a circular ﬁrst-in-ﬁrst-out buﬀer. An index variable j points to the latest updatedelement within the buﬀer. Once j exceed the buﬀer size for the ﬁrst time and is reset to pointto the head of the buﬀer, a ﬂag variable cycle is set, indicating that a search direction is over-written at each subsequent iteration of the algorithm. The projection of the current solutioniterate onto the old vector τ j p j (now to be overwritten in the buﬀer) is then accumulated in avector ˜ u ; the corresponding contribution to the predicted data equals τ j q j and is accumulatedin a vector ˜ v ,(4.3) ˜ u = k − m − X i =0 τ i p i , ˜ v = k − m − X i =0 τ i q i . Contributions (4.3) to the solution and predicted data from the discarded vectors (2.6) arethen added back to the approximate solution and residual in steps 8 and 12 of Algorithm 4.

Inpractical implementations of the ADMM when the operator A does not lend itself to directsolution methods, an iterative method can be used to solve the minimization problem in step4 of Algorithm 1 [22]. Algorithm 5, representing such an approach, uses a ﬁxed number ofiterations N c of CGNE in step 4. At each iteration of the ADMM conjugate gradients are Algorithm 4

Limited-Memory Compressive Conjugate Directions Method for (1.1) m ← memory size , ˜ u ← N , ˜ v ← N + K , j ← , cycle ← .f alse. u ← , z ← K ; b ← K , v ← (cid:20) √ α d √ λ ( z + b ) (cid:21) p ← F T v , q ← Fp , δ ← q T q for k = 0 , , , , . . . do for i = 0 , , . . . , min( k, m ) do τ i ← q Ti ( v k − ˜ v ) /δ i end for u k +1 ← ˜ u + P min( k,m ) i =0 τ i p i z k +1 ← shrink { Bu k +1 − b k , /λ } b k +1 ← b k + z k +1 − Bu k +1 v k +1 ← (cid:20) √ α d √ λ ( z k +1 + b k +1 ) (cid:21) r k +1 ← v k +1 − P min( k,m ) i =0 τ i q i − ˜ v w k +1 ← F T r k +1 s k +1 ← Fw k +1 for i = 0 , , . . . , min( k, m ) do β i ← − q Ti s k +1 /δ i end for j ← j + 1 if j = m + 1 then j ← , cycle ← .true. end if if cycle then ˜ u ← ˜ u + τ j p j ˜ v ← ˜ v + τ j q j end if p j ← P min( k,m ) i =0 β i p i + w k +1 q j ← P min( k,m ) i =0 β i q i + s k +1 δ j ← q Tj q j if δ j = 0 then ⊲ Use condition “ δ j < tolerance” in practice δ j ← , p j ← N , q j ← M + K end if Exit loop if k u k +1 − u k k / k u k k ≤ target accuracy end for hot-restarted from the previous solution approximation u k . For comparison purposes we willrefer to this method as restarted Conjugate Gradients or RCG .Note that Algorithm 5 with N c = 1 performs a single step of gradient descent when solving ompressive Conjugate Directions: Linear Theory 17 Algorithm 5

ADMM and hot-restarted CG (

RCG ) u ← N , z ← K , b ← K , N c ← prescribed number of CG iterations p ← F T v , q ← Fp for k = 0 , , , , . . . do Solve u k +1 ← argmin (cid:26) λ k z k − Bu + b k k + α k Au − d k (cid:27) , starting from u k and using N c iterations of CGNE. z k +1 ← shrink { Bu k +1 − b k , /λ } b k +1 ← b k + z k +1 − Bu k +1 Exit loop if k u k +1 − u k k / k u k k ≤ target accuracy end for the following intermediate least-squares minimization problem in step 4,(4.4) u k +1 = argmin α k Au − d k + λ k z k − Bu + b k k . The performance of Algorithm 5 depends on the condition number of the leasts-squares prob-lem (4.4) [40]: for well-conditioned problems only a small number of conjugate gradients iter-ations N c may achieve a suﬃciently accurate approximation to u k +1 . The condition numberof (4.4) depends on properties of operators A and B , as well as the value of λ . In applicationswith a simple modeling operator A , such as is the case in denoising with A = I , a value of λ may be experimentally selected so as to reduce the condition number of (4.4). However, atrade-oﬀ may exist between the condition number of (4.4) and the number of ADMM iterationsin the outer loop (Step 3) of Algorithm 1: well-conditioned interim least-squares problems mayresult in a signiﬁcantly higher number of ADMM iterations. Such a trade-oﬀ is a well-knownphenomenon in applications of the Augmented Lagrangian Method of Multipliers for smoothobjective functions, see, e.g., [19]. For example, large values of λ in (1.15) more stronglypenalize violations of the equality constraint, as in the Quadratic Penalty Function Method[30] with a larger penalty and a more ill-conditioned quadratic minimization. Of course, in thecase of ADMM applied to (1.1), a non-smooth objective function, arbitrary and potentiallyill-conditioned operator A , and (most importantly) alternating splitting minimization of themodiﬁed Augmented Lagrangian (1.15) complicate the picture. In fact, for an arbitrary A ,the condition number of (4.4) is not always an increasing function of λ . Some of the numericalexamples described in the following subsections exhibit this trade-oﬀ between the conditionnumber of the intermediate least-squares problem (4.4) and the number of ADMM iterations:the better the condition-number of (4.4), the more ADMM iterations are typically required.The main advantage of our Compressive Conjugate Directions approach implemented in Algo-rithms 3 and 4 is that information on the geometry of the objective function (4.4) accumulatesthrough external ADMM iterations thus potentially reducing the amount of eﬀort requiredto perform minimization of (4.4) at each step. Since our objective is a practical implementa-tion of the ADMM for (1.1) with computationally expensive operators A , the overall number “modiﬁed” because of the added constant term λ/ k b k k of operator A and A T applications required to achieve given accuracy will be the principalbenchmark for measuring the performance of various algorithms.

5. Applications.

In this section we apply the method of Compressive Conjugate Direc-tions to solving L and TV-regularized inversion problems for several practical examples. A popular image denoising technique for removing short-wavelengthrandom Gaussian noise from an image is based on solving (1.3) with A = I . Vector d is pop-ulated with a noisy image, a denoised image is returned in u , u = u i,j , i = 1 , . . . , N y , j = 1 , . . . , N x , with an anisotropic TV norm in (1.3) deﬁned by the linear gradient operator(5.1) ∇ u = (cid:20) ∇ x u ∇ y u (cid:21) =  u i, − u i, · · · u i,N x − u i,N x − · · · u ,j − u ,j · · · u N y ,j − u N y − ,j  , i = 1 , . . . , N y , j = 1 , . . . , N x . Here, the dimension of the model space is N = N x × N y with M = N and K = N − N x − N y .Since operator A = I is trivial, minimization of the number of operator applications in thisproblem carries no practical advantage. The only reason for providing this example is todemonstrate the stability of the proposed Compressive Conjugate Directions method withrespect to choosing a value of λ .Figure 1(a) shows the true, noise-free 382 ×

382 image used in this experiment. RandomGaussian noise with a standard deviation σ of 15% of maximum signal amplitude was addedto the true image to produce the noisy image of Figure 1(b). All low-wavenumber or “blocky”components of the noise below a quarter of the Nyquist wavenumber were ﬁltered out, leavingonly high-wavenumber “salt-and-pepper” noise. Parameter α = 10 was chosen experimentallybased on the desired trade-oﬀ of ﬁdelity and “blockiness” of the resulting denoised image. Theresult of solving (1.3) using Algorithm 5 with λ = 1, one hundred combined applications of A and A T , and N c = 1 is shown in Figure 1(d). The result of applying our limited-memoryConjugate Directions Algorithm 4 for m = 50 is shown in Figure 1(c) . Note that N c = 1means that only a single step of Conjugate Gradients, or a single gradient descent, is madein step 4 of Algorithm 5. For this choice of λ , problem (4.4) is very well conditioned, with acondition number of κ = 1 .

8. A single iteration of gradient descent achieves suﬃcient accuracyof minimization (4.4) and for λ = 1 there is no practical advantage in using our method asboth methods perform equally well, see Figure 2(a). In fact, the overhead of storing and usingconjugate directions from previous iterations may exceed the cost of operator A and its adjointapplications if the latter are computationally cheap. The approximation errors of applyingthe limited-memory Compressive Conjugate Directions Algorithm 4 with m = 50 versus Al-gorithm 5 with N c = 1 , ,

10 for λ = 10 , , are shown in Figures 2(a),2(b),2(c),2(d). Here, this matches the results for any memory size m > ompressive Conjugate Directions: Linear Theory 19 (a) (b)(c) (d)

Figure 1: (a) Clean image; (b) Noisy image contaminated with Gaussian noise with σ = 15%of maximum amplitude; (c) Image denoised using Algorithm 4 with α = 10 , λ = 1 andmemory size m = 50; (d) Image denoised using Algorithm 5 with α = 10 , λ = 1 , N c = 1.Note that larger values of λ result in increasingly larger condition numbers of (4.4) shown ontop of the plots. The performance of Algorithm 5 here depends on a choice of N c : increasing N c as required to achieve a suﬃciently accurate approximate solution of (4.4) results in feweravailable ADMM iterations for a ﬁxed “budget” of operator A and adjoint applications. How-ever, Algorithm 4 accumulates conjugate directions (2.6) computed at earlier iterations andrequires only one application of the operator and its adjoint per ADMM iteration. Note that at iteration steps less than N c , Algorithm 5 may still outperform Algorithm 4 as it conductsmore Conjugate Gradient iterations per solution of each problem (4.4). However, once theADMM iteration count exceeds the largest N c , and suﬃcient information is accumulated byAlgorithm 4 about the geometry of the objective function, the Compressive Conjugate Direc-tions outperforms Algorithm 5. Note that this example does not demonstrate the trade-oﬀ (a) (b)(c) (d) Figure 2: Performance of Algorithm 4 with m = 20 versus Algorithm 5 with varying N c for(a) λ = 1 (b) λ = 100 (c) λ = 1000 (d) λ = 10000.between the condition number of (4.4) and the number of ADMM iterations. The reason forthis is that for large λ convergence is achieved relatively quickly within a number of itera-tions comparable to a number of Conjugate Gradients steps required to solve (4.4). However,this example demonstrate another feature of the proposed Compressive Conjugate DirectionsMethod: compared with a technique based on a restarted iterative solution of (4.4), the ompressive Conjugate Directions: Linear Theory 21 method may be less sensitive to a suboptimal choice of λ . In our second example, we demon-strate our method on a geomechanical inversion problem with a non-trivial forward-modelingoperator A . Here, we are interested in inverting subsurface sources of deformation from noisymeasurements of surface displacements, such as GPS, tilt-meter and InSAR observations.The forward modeling operator simulates vertical surface displacements in response todistributed dilatational (e.g. pressure change) sources [38]. Our modeling operator is deﬁnedas(5.2) Au = d ( z ) , d ( z ) = c Z A Du ( ξ ) dξ ( D + ( z − ξ ) ) / , where we assume that u = u ( ξ ) , ξ ∈ [0 , A ] is a relative pore pressure change along a horizontalsegment [0 , A ] of a reservoir at a constant depth D , d = d ( x ) , x ∈ [0 , A ] is the inducedvertical displacement on the surface, and a factor c is determined by the poroelastic mediumproperties, and reservoir dimensions. In this example, for demonstration purposes we considera two-dimensional model, but a three-dimensional model is studied in subsection 5.3. Operator(5.2) is a smoothing integral operator that, after discretization and application of a simplequadrature, can be represented by a dense matrix. Analytical representation of the surfacedisplacement modeling operator (5.2) is possible for simple homogeneous media; however,modeling surface displacements in highly heterogeneous media will involve computationallyexpensive numerical methods such as Finite Elements [27].In this experiment we seek to recover a spiky model of subsurface sources shown in Fig-ure 3(a) from noisy observations of the induced surface displacements shown in Figure 3(b).Such sparse dilatational pseudo-sources are mathematically equivalent to concentrated reser- (a) (b) Figure 3: (a) A spiky true pseudosources; (b) the resulting true (black) and noisy (red) surfacedisplacements.voir pressure changes in hydrogeology and exploration geophysics, as well as expanding spher- ical lava chambers (the “Mogi model”) in volcanology [38]. We forward-modeled surfacedisplacements due to the sources of Figure 3(a) using operator (5.2), and, as in our denois-ing tests, added random Gaussian noise with σ = 15% of the maximum data amplitude.Prior to adding the noise, all low-wavenumber noise components below a ﬁfth of the Nyquistwavenumber were muted, leaving only the high-wavenumber noise shown in Figure 3(b).We set D = . A = 2 km, c = 10 − in (5.2), and discretized both the model anddata space using a 500-point uniform grid, N = M = 500. We solve problem (1.2) with α = 10000, and our objective is to accurately identify locations of the spikes in Figure 3(a)and their relative magnitudes, carrying out as few applications of operator (5.2) as possible.Inversion results of using the limited-memory Compressive Conjugate Directions Algorithm 4with m = 100, ADMM with restarted Conjugate Gradients Algorithm 5 and FISTA of (1.6)are shown in Figures 4(a),4(b),4(c),4(d) for λ = 0 . , . , , A and A T with vectors were computed. We used the maxi-mum FISTA step size of τ = 10 − in (1.6) computed for operator (5.2). These results indicatethat the Compressive Conjugate Directions method achieves qualitative recovery of the spikymodel at early iterations. Superiority of the new method is especially pronounced when theintermediate least-squares minimization problem (4.4) is ill-conditioned (see plot tops). Themethod retains its advantage after 1000 operator and adjoint applications, as shown in Fig-ures 5(a),5(b),5(c),5(d). Note that the error plots of the CCD in Figures 6(a),6(b),6(c),6(d)exhibit a trade-oﬀ between the convergence rate and condition number of problem (4.4) dis-cussed earlier in this subsection 4.1: a more ill-conditioned (4.4) is associated with a fasterconvergence rate of the new method.Figures 7(a),7(b),7(c),7(d) show error plots for the CCD, ADMM with exact minimizationof (4.4), and FISTA. The said trade-oﬀ between the convergence rate and condition numberof (4.4) is exhibited by the ADMM. The CCD curves approach the convergence rates ofthe ADMM once Algorithm 4 has accumulated enough information about the geometry of theobjective function in vectors (4.2). Note that the advantage of a faster asymptotic convergencerate of FISTA kicks in only when the ADMM-based methods use values of λ that are notoptimal for their convergence—see Figures 6(d) and 7(d). In this case (4.4) is very wellconditioned, and its adequate solution requires only a single step of gradient descent at eachiteration of the ADMM, depriving conjugate-gradients-based methods of their advantage.FISTA, being based on accelerating a gradient-descent method, now asymptotically beats theconvergence rates of the other techniques but this happens too late through the iterationsto be of practical signiﬁcance. In other words, in this particular example FISTA can beatthe ADMM (and CCD) only if the latter use badly selected values of λ . Generalizing thisobservation about FISTA and ADMM for problem (1.2) with a general operator A goes beyondthe scope of our work. In this section we apply the Compressive Con-jugate Gradients method to identify sharp subsurface pressure contrasts in a reservoir fromobservations of induced surface displacements. We use a 3-dimensional geomechanical poro-elastostatic model of pressure-induced deformation based on Biot’s theory [38].We solve a TV-regularized inversion problem (1.3) with operator B given by (5.1), and ompressive Conjugate Directions: Linear Theory 23 (a) (b)(c) (d) Figure 4: Inversion results for CCD (red), RCG (blue), FISTA (green) after 100 operatorand adjoint applications for (a) λ = .

05; (b) λ = 0 .

1; (c) λ = 1; (d) λ = 100. Note thatFISTA does not use λ and the same FISTA results are shown in all plots but using diﬀerentvertical scales. Improving condition number of (4.4) is accompanied by slower convergence.Compressive Conjugate Directions method most accurately resolves the spiky model at earlyiterations, and performs well when (4.4) is ill-conditioned.operator A given by extension of (5.2)(5.3) Au = d ( x, y ) , d ( x, y ) = c Z A Z A Du ( ξ, η ) dξdη ( D + ( x − ξ ) + ( y − η ) ) / , where we assume that u = u ( ξ, η ) , ( ξ, η ) ∈ [ − A, A ] × [ − A, A ] is a relative pore pressure changeat a point ( ξ, η ) of the reservoir at a constant depth D , 2 A is the reservoir length and breadth, (a) (b)(c) (d) Figure 5: Inversion results for CCD (red), RCG (blue), FISTA (green) after 1000 operatorand adjoint applications for (a) λ = .

05; (b) λ = 0 .

1; (c) λ = 1; (d) λ = 100. Note that FISTAdoes not use λ and the same FISTA results are shown in all plots but using diﬀerent verticalscales. Compressive Conjugate Directions method still retains its advantage in resolving thespiky model at earlier iterations. Asymptotically faster convergence of FISTA kicks in when λ = 100 with a well-conditioned (4.4), when the ADMM convergence is slowed—compare withFigure 7(d). d = d ( x, y ) , ( x, y ) ∈ [ − A, A ] × [ − A, A ] is the induced vertical displacement at a point ( x, y )on the surface, and a constant factor c is determined by the poroelastic medium propertiesand reservoir thickness.In this experiment, we discretize the pressure and displacement using a 50 ×

50 grid, with A = 1 . D = .

455 km and c = 5 . × , based on a poroelastic model of a real-world ompressive Conjugate Directions: Linear Theory 25 (a) (b)(c) (d) Figure 6: Convergence curves for CCD (solid red), RCG (dashed), FISTA (solid green) for(a) λ = .

05; (b) λ = 0 .

1; (c) λ = 1; (d) λ = 100—compare with Figures 5(a),5(b),5(c),5(d).unconventional hydrocarbon reservoir [28]. We use a least-squares ﬁtting weight α = . σ = 0 .

15% of maximumdata amplitude, muted below a quarter of the Nyquist wavenumber, was added to the cleandata to produce the noisy displacement measurements of Figure 8(b).Figure 9(a) shows the result of the limited-memory Compressive Conjugate DirectionsAlgorithm 4 with m = 100, after a total of 100 combined applications of operator A and itsadjoint. For the same number of operator applications, Figure 9(b) shows the best result ofthe ADMM with restarted Conjugate Gradients Algorithm 5. The corresponding results after (a) (b)(c) (d) Figure 7: Convergence curves for CCD (solid red), ADMM with exact solver (blue), FISTA(green) for (a) λ = .

05; (b) λ = 0 .

1; (c) λ = 1; (d) λ = 100. Limited-memory CompressiveConjugate Directions with m = 100 achieves convergence rate comparable to ADMM withexact minimization of (4.4).1000 applications of A and A T are shown in Figures 9(c) and 9(d), respectively.The Compressive Conjugate Directions method resolves key model features faster than theADMM using iterative solution of (4.4) restarted at each ADMM iteration. This advantage ofour method is particularly pronounced when the intermediate least-squares problem (4.4) isill-conditioned—compare Figures 10(a),10(b) with Figures 10(c),10(d). To accurately resolvethe blocky pressure model of Figure 8(a), the Compressive Conjugate Directions techniquerequires about a tenth of operator A and adjoint applications compared with Algorithm 5 ompressive Conjugate Directions: Linear Theory 27 (a) (b) Figure 8: (a) A blocky true pressure model (MPa); (b) the resulting surface displacements(mm) with added random Gaussian noise with σ = 15% of data amplitude.when (4.4) is poorly conditioned. And again, as in the previous example, there is a trade-oﬀ between the convergence rate of the Compressive Conjugate Directions and the conditionnumber of (4.4): values of λ that result in more poorly-conditioned (4.4) yield the fastestconvergence.

6. Discussion.

Compressive Conjugate Directions provides an eﬃcient implementation ofthe Alternating Direction Method of Multipliers in L − T V regularized inversion problems(1.1) with computationally expensive operators A . By accumulating and reusing informationon the geometry of the intermediate quadratic objective function (4.4), the method requiresonly one application of the operator A and its adjoint per ADMM iteration while achievingaccuracy comparable to that of the ADMM with exact minimization of (4.4). The methoddoes not improve the worst-case asymptotic convergence rate of the ADMM. However, it canbe used for fast recovery of spiky or blocky solution components. The method trades thecomputational cost of applying operator A and its adjoint for extra memory required to storeprevious conjugate direction vectors (4.2).Our numerical experiments involving problems of geomechanical inversion demonstrated atrade-oﬀ between the number of ADMM iterations required to achieve a suﬃciently accuratesolution approximation, and condition number of the intermediate least-squares problem (4.4).Understanding the extent to which this phenomenon applies to solving (1.1) with other classesof modeling operators A requires further analysis. The primary focus of this work are L − T V regularized inversionproblems (1.1). However, the Steered Conjugate Directions Algorithm 2 can be combined withthe Method of Multipliers to solve more general problems of large-scale equality-constrainedoptimization. (a) (b)(c) (d)

Figure 9: Inversion results after (a) 100 iterations (operator and adjoint applications) ofCCD with λ = 10; (b) 100 iterations of RCG with λ = 10; (c) 1000 iterations of CCD with λ = 10; (d) 1000 iterations of RCG with λ = 10. In all tests, CCD is the limited-memoryCompressive Conjugate Directions method of Algorithm 4; RCG is ADMM with restartedConjugate Gradients of Algorithm 5 showing the most accurate model reconstruction amongthe outputs for diﬀerent N c –see Figures 10(a),10(b),10(c),10(d).For example, consider the problem(6.1) k Au − d k → min , Bu − c = , u ∈ R N , d ∈ R M , A : R N → R M , B : R N → R K , where A is a computationally expensive operator. Many “coupled” systems governing two ormore physical parameters can be described mathematically as a constrained problem (6.1).Of special interest are cases when K ≪ min { N, M } —e.g., large-scale optimization problemswith a localized constraint. Applying the Augmented Lagrangian Method of Multipliers to ompressive Conjugate Directions: Linear Theory 29 (a) (b)(c) (d) Figure 10: Convergence rates for CCD and RCG with various N c for (a) λ = 5; (b) λ = 10;(c) λ = 50; (d) λ = 100.(6.1), after re-scaling the multiplier vector, we get(6.2) u k +1 = argmin k Au − d k + λ k c − Bu + b k k , b k +1 = b k + c − Bu k +1 . As before, the minimization on the ﬁrst line of (6.2) is equivalent to solving a system ofnormal equations with a ﬁxed left-hand side and changing right-hand sides. Combining thedual-variable updates from (6.2) with Algorithm 2, we get Algorithm 6.Operator F in Algorithm 6 is given by (2.2) with α = 1. A limited-memory version ofAlgorithm 6 is obtained trivially by adapting Algorithm 4. We envisage potential utility of Algorithm 6

Steered Conjugate Directions + Method of Multipliers for solving (6.1) u ← N , b ← K , v ← (cid:20) d √ λ ( c + b ) (cid:21) p ← F T v , q ← Fp , δ ← q T q for k = 0 , , , , . . . do for i = 0 , , . . . , k do τ i ← q Ti v k /δ i end for u k +1 ← P ki =0 τ i p i b k +1 ← b k + c − Bu k +1 v k +1 ← (cid:20) d √ λ ( c + b k +1 ) (cid:21) r k +1 ← v k +1 − P ki =0 τ i q i w k +1 ← F T r k +1 s k +1 ← Fw k +1 for i = 0 , , . . . , k do β i ← − q Ti s k +1 /δ i end for p k +1 ← P ki =0 β i p i + w k +1 q k +1 ← P ki =0 β i q i + s k +1 δ k +1 ← q Tk +1 q k +1 if δ k +1 = 0 then ⊲ Use condition “ δ k +1 < tolerance” in practice δ k +1 ← , p k +1 ← N , q k +1 ← M + K end if Exit loop if k u k +1 − u k k / k u k k ≤ target accuracy end for Algorithm 6 in applications where storing a set of previous conjugate direction vectors (4.2)is computationally more eﬃcient that iteratively solving the quadratic minimization problemin (6.2) from scratch at each iteration of the method of multipliers.The Compressive Conjugate Directions Algorithm 4 can be extended for solving non-linearinversion problems with L and isotropic total-variation regularization. Likewise, the SteeredConjugate Directions Algorithm 6 can be adapted to solving general equality-constrained non-linear optimization problems. A nonlinear theory and further applications of these techniqueswill be the subject of our next work. REFERENCES [1]

H. H. Bauschke and P. L. Combettes , Convex Analysis and Monotone Operator Theory in HilbertSpaces , Springer, 2011.[2]

A. Beck and M. Teboulle , Fast gradient-based algorithms for constrained total variation image de-noising and deblurring problems , Image Processing, IEEE Transactions on, 18 (2009), pp. 2419–2434.[3] ,

A fast iterative shrinkage-thresholding algorithm for linear inverse problems , SIAM Journal on ompressive Conjugate Directions: Linear Theory 31

Imaging Sciences, 2 (2009), pp. 183–202.[4]

J. M. Bioucas-Dias and M. A.T. Figueiredo , A new TwIST: Two-step iterative shrinkage/thresholdingalgorithms for image restoration , Trans. Img. Proc., 16 (2007), pp. 2992–3004.[5]

A. Bj¨ork , Numerical Methods for Least Squares Problems , SIAM, 1996.[6]

S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein , Distributed optimization and statisti-cal learning via the alternating direction method of multipliers , Foundations and Trends in MachineLearning, 3 (2010), pp. 1–122.[7]

S. P. Boyd and L. Vandenberghe , Convex Optimization , Cambridge University Press, 2004.[8]

R. E. Bruck Jr. , On the weak convergence of an ergodic iteration for the solution of variational inequal-ities for monotone operators in Hilbert space , Journal of Mathematical Analysis and Applications, 61(1977), pp. 159 – 164.[9]

A. Chambolle , An algorithm for total variation minimization and applications , Journal of MathematicalImaging and Vision, 20, pp. 89–97.[10]

A. Chambolle and P. L. Lions , Image recovery via total variational minimization and related problems ,Numerische Mathematik, 76 (1997), pp. 167–188.[11]

P. L. Combettes and V. R. Wajs , Signal recovery by proximal forward-backward splitting , MultiscaleModeling & Simulation, 4 (2005), pp. 1168–1200.[12]

I. Daubechies, M. Defrise, and C. De Mol , An iterative thresholding algorithm for linear inverseproblems with a sparsity constraint , Communications on Pure and Applied Mathematics, 57 (2004),pp. 1413–1457.[13]

J. Douglas and H. H. Rachford , On the numerical solution of heat conduction problems in two andthree space variables , Transactions of the American mathematical Society, 82 (1956), pp. 421–439.[14]

J. Eckstein and D. P. Bertsekas , On the douglas-rachford splitting method and the proximal pointalgorithm for maximal monotone operators , Math. Program., 55 (1992), pp. 293–318.[15]

B. Efron, T. Hastie, I. Johnstone, R. Tibshirani, et al. , Least angle regression , The Annals ofstatistics, 32 (2004), pp. 407–499.[16]

A. Fichtner , Full Seismic Modeling and Inversion , Springer, 2011.[17]

M. A. T. Figueiredo, R. D. Nowak, and S. J. Wright , Gradient projection for sparse reconstruction:Application to compressed sensing and other inverse problems , Selected Topics in Signal Processing,IEEE Journal of, 1 (2007), pp. 586–597.[18]

D. Gabay and B. Mercier , A dual algorithm for the solution of nonlinear variational problems viaﬁnite element approximation , Computers & Mathematics with Applications, 2 (1976), pp. 17 – 40.[19]

R. Glowinski and P. Le Tallec , Augmented Lagrangian and Operator-Splitting Methods in NonlinearMechanics , Society for Industrial and Applied Mathematics, 1989.[20]

R. Glowinski and A. Marroco , Sur l’approximation, par lments ﬁnis d’ordre un, et la rsolution, parpnalisation-dualit d’une classe de problmes de dirichlet non linaires , ESAIM: Mathematical Modellingand Numerical Analysis - Modlisation Mathmatique et Analyse Numrique, 9 (1975), pp. 41–76.[21]

T. Goldstein, B. O’Donoghue, S. Setzer, and R. Baraniuk , Fast alternating direction optimizationmethods , SIAM Journal on Imaging Sciences, 7 (2014), pp. 1588–1623.[22]

T. Goldstein and S. Osher , The split Bregman method for L1-regularized problems , SIAM Journal onImaging Sciences, 2 (2009), pp. 323–343.[23]

T. Hastie, S. Rosset, R. Tibshirani, and J. Zhu , The entire regularization path for the support vectormachine , J. Mach. Learn. Res., 5 (2004), pp. 1391–1415.[24]

B. He and X. Yuan , On the O (1 /n ) convergence rate of the Douglas-Rachford alternating directionmethod , SIAM Journal on Numerical Analysis, 50 (2012), pp. 700–709.[25] M. R. Hestenes , Multiplier and gradient methods , Journal of Optimization Theory and Applications, 4(1969), pp. 303–320.[26]

S. Kim, K. Koh, M. Lustig, S. Boyd, and D. Gorinevsky , An interior-point method for large-scale ℓ -regularized least squares , Selected Topics in Signal Processing, IEEE Journal of, 1 (2007), pp. 606–617.[27] D. Kosloff, R.F. Scott, and J. Scranton , Finite element simulation of Wilmington oil ﬁeld subsi-dence: I. Linear modelling , Tectonophysics, 65 (1980), pp. 339 – 368.[28]

M. Maharramov and M. Zoback , Monitoring of cyclic steam stimulation by inversion of surface tiltmeasurements , AGU Fall Meeting, Session H23A-0859, (2014). [29]

Y. E. Nesterov , A method for solving the convex programming problem with rate of convergence O (1 /k ),Dokl. Akad. Nauk SSSR, 269 (1983), pp. 543–547.[30] J. Nocedal and S. J. Wright , Numerical Optimization , Springer, 2006.[31]

M. R. Osborne, B Presnell, and B. A. Turlach , A new approach to variable selection in leastsquares problems , IMA Journal of Numerical Analysis, 20 (2000), pp. 389–403.[32]

G. B. Passty , Ergodic convergence to a zero of the sum of monotone operators in Hilbert space , Journalof Mathematical Analysis and Applications, 72 (1979), pp. 383 – 390.[33]

Y. Qiu, W. Xue, and G. Yu , Intelligent Science and Intelligent Data Engineering: Third Sino-foreign-interchange Workshop, IScIDE 2012, Nanjing, China, October 15-17, 2012. Revised Selected Papers ,Springer Berlin Heidelberg, Berlin, Heidelberg, 2013, ch. A Projected Conjugate Gradient Methodfor Compressive Sensing, pp. 398–406.[34]

R. T. Rockafellar , Convex Analysis , Princeton University Press, 1971.[35] ,

Augmented lagrangians and applications of the proximal point algorithm in convex programming ,Mathematics of Operations Research, 1 (1976), pp. 97–116.[36]

L. I. Rudin, S. Osher, and E. Fatemi , Nonlinear total variation based noise removal algorithms , PhysicaD: Nonlinear Phenomena, 60 (1992), pp. 259–268.[37]

Y. Saad , Iterative Methods for Sparse Linear Systems, second edition , SIAM, 2003.[38]

P. Segall , Earth and Volcano Deformation , Princeton University Press, 2010.[39]

A. Tarantola , Inversion of seismic reﬂection data in the acoustic approximation , Geophysics, 49 (1984),pp. 1259–1266.[40]

L. N. Trefethen and David Bau III , Numerical Linear Algebra , SIAM, 1997.[41]

H. Uzawa , Studies in Linear and Non-Linear Programming , Stanford University Press, 1958, ch. Iterativemethods for concave programming.[42]

C. R. Vogel and M. E. Oman , Iterative methods for total variation denoising , SIAM J. Sci. Comput.,17 (1996), pp. 227–238.[43]