Preasymptotic Convergence of Randomized Kaczmarz Method
PPreasymptotic Convergence of Randomized Kaczmarz Method
Yuling Jiao ∗ Bangti Jin † Xiliang Lu ‡ October 23, 2018
Abstract
Kaczmarz method is one popular iterative method for solving inverse problems, especially in com-puted tomography. Recently, it was established that a randomized version of the method enjoysan exponential convergence for well-posed problems, and the convergence rate is determined by avariant of the condition number. In this work, we analyze the preasymptotic convergence behavior ofthe randomized Kaczmarz method, and show that the low-frequency error (with respect to the rightsingular vectors) decays faster during first iterations than the high-frequency error. Under the as-sumption that the initial error is smooth (e.g., sourcewise representation), the results allow explainingthe fast empirical convergence behavior, thereby shedding new insights into the excellent performanceof the randomized Kaczmarz method in practice. Further, we propose a simple strategy to stabilizethe asymptotic convergence of the iteration by means of variance reduction. We provide extensivenumerical experiments to confirm the analysis and to elucidate the behavior of the algorithms.
Keywords : randomized Kaczmarz method; preasymptotic convergence; smoothness; error estimates;variance reduction
Kaczmarz methood [20], named after Polish mathematician Stefan Kaczmarz, is one popular iterativemethod for solving linear systems. It is a special form of the general alternating projection method. Inthe computed tomography (CT) community, it was rediscovered in 1970 by Gordon, Bender and Herman[9], under the name algebraic reconstruction techniques. It was implemented in the very first medical CTscanner, and since then it has been widely employed in CT reconstructions [14, 15, 28].The convergence of Kaczmarz method for consistent linear systems is not hard to show. However, thetheoretically very important issue of convergence rates of Kaczmarz method (or the alternating projectionmethod for linear subspaces) is very challenging. There are several known convergence rates results, allrelying on (spectral) quantities of the matrix A that are difficult to compute or verify in practice (see [7]and the references therein). This challenge is well reflected by the fact that the convergence rate of themethod depends strongly on the ordering of the equations.It was numerically discovered several times independently in the literature that using the rows ofthe matrix A in Kaczmarz method in a random order, called randomized Kaczmarz method (RKM)below, rather than the given order, can often substantially improve the convergence [15, 28]. Thus RKMis quite appealing for practical applications. However, the convergence rate analysis was given onlyvery recently. In an influential paper [34], in 2009, Strohmer and Vershynin established the exponentialconvergence of RKM for consistent linear systems, and the convergence rate depends on (a variant of) thecondition number. This result was then extended and refined in various directions [29, 3, 26, 1, 35, 10, 33], ∗ School of Statistics and Mathematics, Zhongnan University of Economics and Law, Wuhan, 430063, P.R. China. ([email protected]) † Department of Computer Science, University College London, Gower Street, London WC1E 6BT, UK.([email protected], [email protected]) ‡ Corresponding author. School of Mathematics and Statistics, and Hubei Key Laboratory of Computational Science,Wuhan University, Wuhan 430072, P.R. China. ([email protected]) a r X i v : . [ m a t h . NA ] S e p ncluding inconsistent or underdetermined linear systems. Recently, Sch¨opfer and Lorenz [33] showed theexponential convergence for RKM for sparse recovery with elastic net. We recall the result of Strohmerand Vershynin and its counterpart for noisy data in Theorem 2.1 below. It is worth noting that allthese estimates involve the condition number, and for noisy data, the estimate contains a term inverselyproportional to the smallest singular value of the matrix A .These important and interesting existing results do not fully explain the excellent empirical perfor-mance of RKM for solving linear inverse problems, especially in the case of noisy data, where the termdue to noise is amplified by a factor of the condition number. In practice, one usually observes that theiterates first converge quickly to a good approximation to the true solution, and then start to divergeslowly. That is, it exhibits the typical “semiconvergence” phenomenon for iterative regularization meth-ods, e.g., Landweber method and conjugate gradient methods [13, 21]. This behavior is not well reflectedin the known estimates given in Theorem 2.1; see Section 2 for further comments.The purpose of this work is to study the preasymptotic convergence behavior of RKM. This is achievedby analyzing carefully the evolution of the low- and high-frequency errors during the randomized Kacz-marz iteration, where the frequency is divided according to the right singular vectors of the matrix A .The results indicate that during initial iterations, the low-frequency error decays must faster than thehigh-frequency one, cf. Theorems 3.1 and 3.2. Since the inverse solution (relative to the initial guess x )is often smooth in the sense that it consists mostly of low-frequency components [13], it explains the goodconvergence behavior of RKM, thereby shedding new insights into its excellent practical performance.This condition on the inverse solution is akin to the sourcewise representation condition in classical regu-larization theory [5, 16]. Further, based on the fact that RKM is a special case of the stochastic gradientmethod [31], we propose a simple modified version using the idea of variance reduction by hybridiz-ing it with the Landweber method, inspired by [19]. This variant enjoys both good preasymptotic andasymptotic convergence behavior, as indicated by the numerical experiments.Last, we note that in the context of inverse problems, Kaczmarz method has received much recentattention, and has demonstrated very encouraging results in a number of applications. The regularizingproperty and convergence rates in various settings have been analyzed for both linear and nonlinear inverseproblems (see [23, 2, 12, 18, 4, 22, 24, 17] for an incomplete list). However, these interesting works allfocus on a fixed ordering of the linear system, instead of the randomized variant under considerationhere, and thus they do not cover RKM.The rest of the paper is organized as follows. In Section 2 we describe RKM and recall the basic toolfor our analysis, i.e., singular value decomposition, and a few useful notations. Then in Section 3 wederive the preasymptotic convergence rates for exact and noisy data. Some practical issues are discussedin Section 4. Last, in Section 5, we present extensive numerical experiments to confirm the analysis andshed further insights. Now we describe the problem setting and RKM, and also recall known convergence rates results for bothconsistent and inconsistent data. The linear inverse problem with exact data can be cast into Ax = b, (2.1)where the matrix A ∈ R n × m , and b ∈ R n and b ∈ range( A ). We denote the i th row of the matrix A by a ti , with a i ∈ R m being a column vector, where the superscript t denotes the vector/matrix transpose.The linear system (2.1) can be formally determined or under-determined.The classical Kaczmarz method [20] proceeds as follows. Given the initial guess x , we iterate x k +1 = x k + b i − (cid:104) a i , x k (cid:105)(cid:107) a i (cid:107) a i , i = ( k mod n ) + 1 , (2.2)where (cid:104)· , ·(cid:105) and (cid:107) · (cid:107) denote the Euclidean inner product and norm, respectively. Thus, Kaczmarz methodsweeps through the equations in a cyclic manner, and n iterations constitute one complete cycle.2n contrast to the cyclic choice of the index i in Kaczmarz method, RKM randomly selects i . There areseveral different variants, depending on the specific random choice of the index i . The variant analyzedby Strohmer and Vershynin [34] is as follows. Given an initial guess x , we iterate x k +1 = x k + b i − (cid:104) a i , x k (cid:105)(cid:107) a i (cid:107) a i , (2.3)where i is drawn independent and identically distributed (i.i.d.) from the index set { , , . . . , n } with theprobability p i for the i th row given by p i = (cid:107) a i (cid:107) (cid:107) A (cid:107) F , i = 1 , . . . , n, (2.4)where (cid:107) · (cid:107) F denotes the matrix Frobenius norm. This choice of the probability distribution p i lends itselfto a convenient convergence analysis [34]. In this work, we shall focus on the variant (2.3)-(2.4).Similarly, the noisy data b δ is given by b δi = (cid:104) a i , x ∗ (cid:105) + η i , i = 1 , . . . , n, with (cid:107) η (cid:107) ≤ δ, (2.5)where δ is the noise level. RKM reads: given the initial guess x , we iterate x k +1 = x k + b δi − (cid:104) a i , x k (cid:105)(cid:107) a i (cid:107) a i , where the index i is drawn i.i.d. according to (2.4).The following theorem summarizes typical convergence results of RKM for consistent and inconsistentlinear systems [34, 29, 36] (see [26] for in-depth discussions), under the condition that the matrix A isof full column-rank. For a rectangular matrix A ∈ R n × m , we denote by A † ∈ R m × n the pseudoinverseof A , (cid:107) A (cid:107) denotes the matrix spectral norm, and σ min ( A ) the smallest singular value of A . The error (cid:107) x k − x ∗ (cid:107) of the RKM iterate x k (with respect to the exact solution x ∗ ) is stochastic due to the randomchoice of the index i . Below E [ · ] denotes expectation with respect to the random row index selection.Note that κ A differs from the usual condition number [8]. Theorem 2.1.
Let x k be the solution generated by RKM (2.3) – (2.4) at iteration k , and κ A = (cid:107) A (cid:107) F (cid:107) A † (cid:107) be a (generalized) condition number. Then the following statements hold.(i) For exact data, there holds E [ (cid:107) x k − x ∗ (cid:107) ] ≤ (cid:0) − κ − A (cid:1) k (cid:107) x − x ∗ (cid:107) . (ii) For noisy data, there holds E [ (cid:107) x k − x ∗ (cid:107) ] ≤ (cid:0) − κ − A (cid:1) k (cid:107) x − x ∗ (cid:107) + δ σ ( A ) . Theorem 2.1 gives error estimates (in expectation) for any iterate x k , k ≥
1: the convergence rate isdetermined by κ A . For ill-posed linear inverse problems (e.g., CT), bad conditioning is characteristic andthe condition number κ A can be huge, and thus the theorem predicts a very slow convergence. However,in practice, RKM converges rapidly during the initial iteration. The estimate is also deceptive for noisydata: due to the presence of the term δ /σ ( A ), it implies blowup at the very first iteration, whichis however not the case in practice. Hence, these results do not fully explain the excellent empiricalconvergence of RKM for inverse problems.The next example compares the convergence rates of Kaczmarz method and RKM.3 xample 2.1. Given n ≥ , let θ = πn . Consider the linear system with A ∈ R n × , a i = ( cos( i − θ sin( i − θ ) and the exact solution x ∗ = 0 , i.e., b = 0 . Then we solve it by Kaczmarz method and RKM. For any e = ( x , y ) , after one Kaczmarz iteration, e = ( x , , and generally, after k iterations, (cid:107) e k +1 (cid:107) = | cos θ | k (cid:107) e (cid:107) . For large n , the decreasing factor | cos θ | can be very close to one, and thus each Kaczmarz iteration canonly decrease the error slowly. Thus, the convergence rate of Kaczmarz method depends strongly on n :the larger is n , the slower is the convergence. Similarly, for RKM, there holds E [ (cid:107) e k +1 (cid:107) | e k ] = 1 n n (cid:88) i =1 | cos iθ | (cid:107) e k (cid:107) = 12 n n (cid:88) i =1 (1 − cos 2 iθ ) (cid:107) e k (cid:107) = 12 (cid:107) e k (cid:107) , and E [ (cid:107) e k +1 (cid:107) ] = 2 − ( k +1) (cid:107) e (cid:107) . For RKM, the convergence rate is independent of n . Further, for any n > , we have < θ < π , and cos θ ≥ | cos π | > − / . This shows the superiority of RKM over the cyclic one. Last we recall singular value decomposition (SVD) of the matrix A [8], which is the basic tool for theconvergence analysis in Section 3. We denote SVD of A ∈ R n × m by A = U Σ V t , where U ∈ R n × n and V ∈ R m × m are column orthonormal matrices and their column vectors known asthe left and right singular vectors, respectively, and Σ ∈ R n × m is diagonal with the diagonal elementsordered nonincreasingly, i.e., σ ≥ . . . ≥ σ r >
0, with r = min( m, n ). The right singular vectors v i spanthe solution space, i.e., x ∈ span( v i ). We shall write U = u t ... u tn and V t = v t ... v tm , i.e., V = ( v . . . v m ). Note that for inverse problems, empirically, as the index i increases, the rightsingular vectors v i are increasingly more oscillatory, capturing more high-frequency components [13].The behavior is analogous to the inverse of Sturm-Liouville operators. For a general class of convolutionintegral equations, such oscillating behavior was established in [6]. For many practical applications, thelinear system (2.1) can be regarded as a discrete approximation to the underlying continuous problem,and thus inherits the corresponding spectral properties.Given a frequency cutoff number 1 ≤ L ≤ m , we define two (orthogonal) subspaces of R m by L = span { v , . . . , v L } and H = span { v L +1 , . . . , v m } , which denotes the low- and high-frequency solution spaces, respectively. This is motivated by the ob-servation that in practice one only looks for smooth solutions that are spanned/well captured by thefirst few right singular vectors [13]. This condition is akin to the concept of sourcewise representation inregularization theory, e.g., x ∈ A ∗ w for some w ∈ R n or its variants [5, 16], which is needed for derivingconvergence rates for the regularized solution. Throughout, we always assume that the truncation level L is fixed. Then for any vector z ∈ R m , there exists a unique decomposition z = P L z + P H z , where P L and P H are orthogonal projection operators into L and H , respectively, which are defined by P L z = L (cid:88) i =1 (cid:104) v i , z (cid:105) v i and P H z = m (cid:88) i = L +1 (cid:104) v i , z (cid:105) v i . These projection operators will be used below to analyze the preasymptotic behavior of RKM.4
Preasymptotic convergence analysis
In this section, we present a preasymptotic convergence analysis of RKM. Let x ∗ be one solution oflinear system (2.1). Our analysis relies on decomposing the error e k = x k − x ∗ of the k th iterate x k intolow- and high-frequency components (according to the right singular vectors). We aim at bounding theconditional error E [ (cid:107) e k +1 (cid:107) | e k ] (on e k , where the expectation E [ · ] is with respect to the random choiceof the index i , cf. (2.4)) by analyzing separately E [ (cid:107) P L e k +1 (cid:107) | e k ] and E [ (cid:107) P H e k +1 (cid:107) | e k ]. This is inspiredby the fact that the inverse solution consists mainly of the low-frequency components, which is akin tothe concept of the source condition in regularization theory [5, 16]. Our error estimates allow explainingthe excellent empirical performance of RKM in the context of inverse problems.We shall discuss the preasymptotic convergence for exact and noisy data separately. First, we analyze the case of noise free data. Let x ∗ be one solution to the linear system (2.1), and e k = x k − x ∗ be the error at iteration k . Upon substituting the identity b = Ax ∗ into RKM iterate, wededuce that for some i ∈ { , . . . , n } , there holds e k +1 = (cid:18) I − a i a ti (cid:107) a i (cid:107) (cid:19) e k . (3.1)Note that I − a i a ti (cid:107) a i (cid:107) is an orthogonal projection operator. We first give two useful lemmas. Lemma 3.1.
For any e L ∈ L and e H ∈ H , there hold σ L (cid:107) e L (cid:107) ≤ (cid:107) Ae L (cid:107) ≤ σ (cid:107) e L (cid:107) , (cid:107) Ae H (cid:107) ≤ σ L +1 (cid:107) e H (cid:107) , and (cid:104) Ae L , Ae H (cid:105) = 0 . Proof.
The assertions follow directly from simple algebra, and hence the proof is omitted.
Lemma 3.2.
For i = 1 , . . . , n , there holds (cid:107) P H a i (cid:107) ≤ σ L +1 and n (cid:88) i =1 (cid:107) P H a i (cid:107) ≤ r (cid:88) i = L +1 σ i . Proof.
By definition, P H a i = (cid:80) mj = L +1 (cid:104) a i , v j (cid:105) v j . Since a ti = u ti Σ V t , there holds (cid:104) a i , v j (cid:105) = u ti Σ V t v j = (cid:104) u i , σ j e j (cid:105) = σ j ( u i ) j . Hence, (cid:107) P H a i (cid:107) = (cid:80) mj = L +1 (cid:104) a i , v j (cid:105) = (cid:80) mj = L +1 σ j | ( u i ) j | ≤ σ L +1 . The secondestimate follows similarly.The next result gives a preasymptotic recursive estimate on E [ (cid:107) P L e k +1 (cid:107) | e k ] and E [ (cid:107) P H e k +1 (cid:107) | e k ]for exact data b ∈ range( A ). This represents our first main theoretical result. Theorem 3.1.
Let c = σ L (cid:107) A (cid:107) F and c = (cid:80) ri = L +1 σ i (cid:107) A (cid:107) F . Then there hold E [ (cid:107) P L e k +1 (cid:107) | e k ] ≤ (1 − c ) (cid:107) P L e k (cid:107) + c (cid:107) P H e k (cid:107) , E [ (cid:107) P H e k +1 (cid:107) | e k ] ≤ c (cid:107) P L e k (cid:107) + (1 + c ) (cid:107) P H e k (cid:107) . Proof.
Let e L and e H be the low- and high-frequency errors e k , respectively, i.e., e L = P L e k and e H = P H e k . Then by the identities P L e k +1 = e L − (cid:107) a i (cid:107) ( a i , e k ) P L a i and (cid:104) P L a i , e L (cid:105) = (cid:104) a i , e L (cid:105) , we have (cid:107) P L e k +1 (cid:107) = (cid:107) e L (cid:107) − (cid:107) a i (cid:107) (cid:104) P L a i , e L (cid:105)(cid:104) a i , e k (cid:105) + (cid:104) a i , e k (cid:105) (cid:107) P L a i (cid:107) (cid:107) a i (cid:107) = (cid:107) e L (cid:107) − (cid:107) a i (cid:107) (cid:104) a i , e L (cid:105)(cid:104) a i , e k (cid:105) + (cid:104) a i , e k (cid:105) (cid:107) P L a i (cid:107) (cid:107) a i (cid:107) (cid:107) e L (cid:107) − (cid:107) a i (cid:107) (cid:104) a i , e L (cid:105)(cid:104) a i , e k (cid:105) + (cid:104) a i , e k (cid:105) (cid:107) a i (cid:107) = (cid:107) e L (cid:107) − (cid:107) a i (cid:107) (cid:104) a i , e L (cid:105)(cid:104) a i , e k (cid:105) + (cid:104) a i , e L (cid:105) + 2 (cid:104) a i , e L (cid:105)(cid:104) a i , e H (cid:105) + (cid:104) a i , e H (cid:105) (cid:107) a i (cid:107) . Upon noting the identity (cid:80) ni =1 a i a ti = A t A , taking expectation on both sides yields E [ (cid:107) P L e k +1 (cid:107) | e k ] ≤ (cid:107) e L (cid:107) − (cid:107) A (cid:107) F (cid:104) e k , A t Ae L (cid:105) + (cid:107) Ae L (cid:107) + 2 (cid:104) e H , A t Ae L (cid:105) + (cid:107) Ae H (cid:107) (cid:107) A (cid:107) F . Now substituting the splitting e k = e L + e H and rearranging the terms give E [ (cid:107) P L e k +1 (cid:107) | e k ] ≤ (cid:107) e L (cid:107) − (cid:107) A (cid:107) F (cid:104) e L , A t Ae L (cid:105) − (cid:107) A (cid:107) F (cid:104) e H , A t Ae L (cid:105) + (cid:107) Ae L (cid:107) + 2 (cid:104) e H , A t Ae L (cid:105) + (cid:107) Ae H (cid:107) (cid:107) A (cid:107) F ≤ (cid:107) e L (cid:107) − (cid:107) A (cid:107) F (cid:107) Ae L (cid:107) + (cid:107) Ae H (cid:107) (cid:107) A (cid:107) F . Thus the first assertion follows from Lemma 3.1. The high-frequency component P H e k +1 satisfies P H e k +1 = e H − (cid:107) a i (cid:107) (cid:104) a i , e k (cid:105) P H a i . We appeal to the inequality (cid:104) a i , e k (cid:105) ≤ (cid:107) a i (cid:107) (cid:107) e k (cid:107) = (cid:107) a i (cid:107) ( (cid:107) e L (cid:107) + (cid:107) e H (cid:107) ) to get (cid:107) P H e k +1 (cid:107) = (cid:107) e H (cid:107) − (cid:107) a i (cid:107) (cid:104) a i , e H (cid:105)(cid:104) a i , e k (cid:105) + (cid:104) a i , e k (cid:105) (cid:107) P H a i (cid:107) (cid:107) a i (cid:107) ≤ (cid:107) e H (cid:107) − (cid:107) a i (cid:107) (cid:104) a i , e H (cid:105)(cid:104) a i , e k (cid:105) + (cid:107) P H a i (cid:107) (cid:107) a i (cid:107) ( (cid:107) e L (cid:107) + (cid:107) e H (cid:107) ) . Taking expectation yields E [ (cid:107) P H e k +1 (cid:107) | e k ] ≤ (cid:107) e H (cid:107) − (cid:107) A (cid:107) F (cid:107) Ae H (cid:107) + 1 (cid:107) A (cid:107) F ( (cid:107) e L (cid:107) + (cid:107) e H (cid:107) ) n (cid:88) i =1 (cid:107) P H a i (cid:107) ≤ (cid:32) (cid:80) ri = L +1 σ i (cid:107) A (cid:107) F (cid:33) (cid:107) e H (cid:107) + (cid:80) ri = L +1 σ i (cid:107) A (cid:107) F (cid:107) e L (cid:107) . Thus we obtain the second assertion and complete the proof.
Remark 3.1.
By Theorem 3.1, the decay of the error E [ (cid:107) P L e k +1 (cid:107) | e k ] is largely determined by the factor − c and only mildly affected by (cid:107) P H e k (cid:107) by a factor c . The factor c is very small in the presence ofa gap in the singular value spectrum at σ L , i.e., σ L (cid:29) σ L +1 , showing clearly the role of the gap. Remark 3.2.
Theorem 3.1 also covers the rank-deficient case, i.e., σ L +1 = 0 , and it yields E [ (cid:107) P L e k +1 (cid:107) | e k ] ≤ (1 − c ) (cid:107) P L e k (cid:107) and E [ (cid:107) P H e k +1 (cid:107) | e k ] ≤ (cid:107) P H e k (cid:107) . If L = m , it recovers Theorem 2.1(i) for exact data. The rank-deficient case was analyzed in [11]. Remark 3.3.
By taking expectation of both sides of the estimates in Theorem 3.1, we obtain E [ (cid:107) P L e k +1 (cid:107) ] ≤ (1 − c ) E [ (cid:107) P L e k (cid:107) ] + c E [ (cid:107) P H e k (cid:107) ] , E [ (cid:107) P H e k +1 (cid:107) ] ≤ c E [ (cid:107) P L e k (cid:107) ] + (1 + c ) E [ (cid:107) P H e k (cid:107) ] . hen the error propagation is given by (cid:20) E [ (cid:107) P L e k (cid:107) ] E [ (cid:107) P H e k (cid:107) ] (cid:21) ≤ D k (cid:20) (cid:107) P L e (cid:107) (cid:107) P H e (cid:107) (cid:21) with D = (cid:20) − c c c c (cid:21) . The pairs of eigenvalues λ ± and (orthonormal) eigenfunctions v ± of D are given by λ ± = 2 − c + c ± (( c + c ) + 4 c ) / , and v ± = [(( c + c ) + 4 c ) ∓ ( c + c )] √ c + c ) + 4 c ) / (cid:20) c (( c + c ) +4 c ) / ∓ ( c + c ) (cid:21) . For the case c (cid:28) c < , i.e., α = c c (cid:28) , we have λ + = 1 + c ( α + O ( α )) and λ − = 1 − c (1 + O ( α )) and v + ≈ α ) (cid:20) − α (cid:21) and v − ≈ α ) (cid:20) α (cid:21) . With V = [ v + v − ] , we have the approximate eigendecomposition if k = O (1) : D k ≈ V (cid:20) kαc (1 − c ) k (cid:21) V t . Thus, for c (cid:29) c , we have the following approximate error propagation for k = O (1) : E [ (cid:107) P L e k (cid:107) ] ≈ (1 − c ) k (cid:107) P L e (cid:107) + α (1 − (1 − c ) k ) (cid:107) P H e (cid:107) , E [ (cid:107) P H e k (cid:107) ] ≈ α (1 − (1 − c ) k ) (cid:107) P L e (cid:107) + (1 + kαc ) (cid:107) P H e (cid:107) . Next we turn to the case of noisy data b δ , cf. (2.5), we use the superscript δ to indicate the noisy case.Since b δi = b i + η i , the RKM iteration reads x k +1 − x ∗ = x k − x ∗ + (cid:104) a i , x ∗ − x k (cid:105)(cid:107) a i (cid:107) a i + η i a i (cid:107) a i (cid:107) , and thus the random error e k +1 = x k +1 − x ∗ satisfies e k +1 = (cid:18) I − a i a ti (cid:107) a i (cid:107) (cid:19) e k + η i a i (cid:107) a i (cid:107) . (3.2)Now we give our second main result, i.e., bounds on the errors E [ (cid:107) P L e k +1 (cid:107) | e k ] and E [ (cid:107) P H e k +1 (cid:107) | e k ]. Theorem 3.2.
Let c = σ L (cid:107) A (cid:107) F and c = (cid:80) ri = L +1 σ i (cid:107) A (cid:107) F . Then there hold E [ (cid:107) P L e k +1 (cid:107) | e k ] ≤ (1 − c ) (cid:107) P L e k (cid:107) + c (cid:107) P H e k (cid:107) + δ (cid:107) A (cid:107) F + (cid:107) A (cid:107) F δ √ c (cid:107) e k (cid:107) , E [ (cid:107) P H e k +1 (cid:107) | e k ] ≤ c (cid:107) P L e k (cid:107) + (1 + c ) (cid:107) P H e k (cid:107) + δ (cid:107) A (cid:107) F + (cid:107) A (cid:107) F δ √ c (cid:107) e k (cid:107) . roof. By the recursive relation (3.2), we have the splitting E [ (cid:107) P L e k +1 (cid:107) | e k ] = I + I + I , where the terms are given by (with e L = P L e k and e H = P H e k )I = n (cid:88) i =1 (cid:107) a i (cid:107) (cid:107) A (cid:107) F (cid:107) e L − (cid:104) a i , e k (cid:105)(cid:107) a i (cid:107) P L a i (cid:107) , I = n (cid:88) i =1 (cid:107) a i (cid:107) (cid:107) A (cid:107) F η i (cid:107) P L a i (cid:107) (cid:107) a i (cid:107) , I = n (cid:88) i =1 (cid:107) a i (cid:107) (cid:107) A (cid:107) F (cid:20) η i (cid:107) a i (cid:107) (cid:104) P L a i , e L (cid:105) − η i (cid:107) a i (cid:107) (cid:104) P L a i , P L a i (cid:105)(cid:104) a i , e k (cid:105) (cid:21) . The first term I can be bounded directly by Theorem 3.1. Clearly, I ≤ δ (cid:107) A (cid:107) F . For the third term I , wenote the splitting (cid:104) P L a i , e L (cid:105) − (cid:107) P L a i (cid:107) (cid:107) a i (cid:107) (cid:104) a i , e k (cid:105) = (cid:107) P L a i (cid:107) + (cid:107) P H a i (cid:107) (cid:107) a i (cid:107) (cid:104) P L a i , e L (cid:105) − (cid:107) P L a i (cid:107) (cid:107) a i (cid:107) ( (cid:104) P L a i , e L (cid:105) + (cid:104) P H a i , e H (cid:105) )= (cid:107) P H a i (cid:107) (cid:104) P L a i , e L (cid:105) − (cid:107) P L a i (cid:107) (cid:104) P H a i , e H (cid:105)(cid:107) a i (cid:107) := I ,i . By the Cauchy-Schwarz inequality, we have | I | ≤ (cid:107) A (cid:107) F (cid:107) η (cid:107) (cid:16) n (cid:88) i =1 I ,i (cid:17) / . Direct computation yieldsI ,i ≤ (cid:107) P H a i (cid:107) (cid:107) P L a i (cid:107) (cid:107) a i (cid:107) · (cid:107) P H a i (cid:107) (cid:107) e L (cid:107) + 2 (cid:107) P H a i (cid:107)(cid:107) P L a i (cid:107)(cid:107) e L (cid:107)(cid:107) e H (cid:107) + (cid:107) P L a i (cid:107) (cid:107) e H (cid:107) (cid:107) P L a i (cid:107) + (cid:107) P H a i (cid:107) ≤ (cid:107) P H a i (cid:107) ( (cid:107) P L a i (cid:107) + (cid:107) P H a i (cid:107) )( (cid:107) e L (cid:107) + (cid:107) e H (cid:107) ) (cid:107) P L a i (cid:107) + (cid:107) P H a i (cid:107) = (cid:107) P H a i (cid:107) (cid:107) e k (cid:107) . Consequently, by Lemma 3.2, we obtain | I | ≤ (cid:107) A (cid:107) F δ (cid:16) r (cid:88) i = L +1 σ i (cid:17) / (cid:107) e k (cid:107) . These estimates together show the first assertion. For the high-frequency component P H e k +1 , we have E [ (cid:107) P H e k +1 (cid:107) | e k ] = I + I + I , where the terms are given byI = n (cid:88) i =1 (cid:107) a i (cid:107) (cid:107) A (cid:107) F (cid:107) e H − (cid:104) a i , e k (cid:105)(cid:107) a i (cid:107) P H a i (cid:107) , I = n (cid:88) i =1 (cid:107) a i (cid:107) (cid:107) A (cid:107) F η i (cid:107) P H a i (cid:107) (cid:107) a i (cid:107) , I = n (cid:88) i =1 (cid:107) a i (cid:107) (cid:107) A (cid:107) F (cid:20) η i (cid:107) a i (cid:107) (cid:104) P H a i , e H (cid:105) − η i (cid:107) a i (cid:107) (cid:104) P H a i , P H a i (cid:105)(cid:104) a i , e k (cid:105) (cid:21) . The term I can be bounded by Theorem 3.1. Clearly, I ≤ δ (cid:107) A (cid:107) F . For the term I , note the splitting (cid:104) P H a i , e H (cid:105) − (cid:107) P H a i (cid:107) (cid:107) a i (cid:107) (cid:104) a i , e k (cid:105) = 1 (cid:107) a i (cid:107) ( (cid:107) P L a i (cid:107) (cid:104) P H a i , e H (cid:105) − (cid:107) P H a i (cid:107) (cid:104) P L a i , e L (cid:105) ) , and thus I = − I . This shows the second assertion, and completes the proof of the theorem.8 emark 3.4. Recall the following estimate for RKM [36, Theorem 3.7] E [ (cid:107) e k +1 (cid:107) | e k ] ≤ (cid:0) − κ − A (cid:1) (cid:107) e k (cid:107) + (cid:107) A (cid:107) − F δ . In comparison, the estimate in Theorem 3.2 is free from κ A , but introduces an additional term (cid:107) A (cid:107) F δ √ c (cid:107) e k (cid:107) .Since c is generally very small, this extra term is comparable with (cid:107) A (cid:107) − F δ . Theorem 3.2 extends The-orem 3.1 to the noisy case: if δ = 0 , it recovers Theorem 3.1. It indicates that if the initial error e = x − x ∗ concentrates mostly on low frequency, the iterate will first decrease the error. The smooth-ness assumption on the initial error e is realistic for inverse problems, notably under the standard sourcetype conditions (for deriving convergence rates) [5, 16]. Nonetheless, the deleterious noise influence willeventually kick in as the iteration proceeds. Remark 3.5.
One can discuss the evolution of the iterates for noisy data, similar to Remark 3.3. ByYoung’s inequality ab ≤ (cid:15)a + (cid:15) − b , the error satisfies ( with ¯ c = c − (cid:15)c and ¯ c = (1 + (cid:15) ) c ) E [ (cid:107) P L e k +1 (cid:107) | e k ] ≤ (1 − ¯ c ) (cid:107) P L e k (cid:107) + ¯ c (cid:107) P H e k (cid:107) + (1+ (cid:15) − ) δ (cid:107) A (cid:107) F , E [ (cid:107) P H e k +1 (cid:107) | e k ] ≤ ¯ c (cid:107) P L e k (cid:107) + (1 + ¯ c ) (cid:107) P H e k (cid:107) + (1+ (cid:15) − ) δ (cid:107) A (cid:107) F . Then it follows that (cid:20) E [ (cid:107) P L e k (cid:107) ] E [ (cid:107) P H e k (cid:107) ] (cid:21) ≤ D k (cid:20) (cid:107) P L e (cid:107) (cid:107) P H e (cid:107) (cid:21) + (1+ (cid:15) − ) δ (cid:107) A (cid:107) F ( I − D ) − ( I − D k ) (cid:20) (cid:21) , D = (cid:20) − ¯ c ¯ c ¯ c c (cid:21) . In the case ¯ c (cid:28) ¯ c < and α = ¯ c ¯ c (cid:28) by choosing sufficiently small (cid:15) ) , for k = O (1) , repeating theanalysis in Remark 3.3 yields E [ (cid:107) P L e k (cid:107) ] ≈ (1 − ¯ c ) k (cid:107) P L e (cid:107) + α (1 − (1 − ¯ c ) k ) (cid:107) P H e (cid:107) + k (1+ (cid:15) − ) δ (cid:107) A (cid:107) F , E [ (cid:107) P H e k (cid:107) ] ≈ α (1 − (1 − ¯ c ) k ) (cid:107) P L e (cid:107) + (1 + kα ¯ c ) (cid:107) P H e (cid:107) + k (1+ (cid:15) − ) δ (cid:107) A (cid:107) F . Thus, the presence of data noise only influences the error of the RKM iterates mildly by an additivefactor ( kδ ) , during the initial iterations. When equipped with a proper stopping criterion, Kaczmarz method is a regularization method [23, 21].Naturally, one would expect that this assertion holds also for RKM (2.3)–(2.4). This however remains tobe proven due to the lack of a proper stopping criterion. To see the delicacy, consider one natural choice,i.e., Morozov’s discrepancy principle [27]: choose the smallest integer k such that (cid:107) Ax k − b δ (cid:107) ≤ τ δ, (4.1)where τ > δ >
0. In practice, computing the residual (cid:107) Ax k − b δ (cid:107) at eachiteration is undesirable since its cost is of the order of evaluating the full gradient, whereas avoiding thelatter is the very motivation for RKM! Below we propose one simple remedy by drawing on its connectionwith stochastic gradient methods [31] and the vast related developments.First we note that the solution to (2.1) is equivalent to minimizing the least-squares problemmin x ∈ R n (cid:110) f ( x ) := 12 n n (cid:88) i =1 |(cid:104) a i , x (cid:105) − b i | (cid:111) . (4.2)Next we recast RKM as a stochastic gradient method for problem (4.2), as noted earlier in [30]. Weinclude a short proof for completeness. 9 roposition 4.1. The RKM iteration (2.3) - (2.4) is a (weighted) stochastic gradient update with a con-stant stepsize n/ (cid:107) A (cid:107) F .Proof. With the weight w i = (cid:107) a i (cid:107) , we rewrite problem (4.2) into12 n n (cid:88) i =1 ( (cid:104) a i , x (cid:105) − b i ) = 12 n n (cid:88) i =1 w i (cid:107) A (cid:107) F (cid:107) A (cid:107) F w i ( (cid:104) a i , x (cid:105) − b i ) , = n (cid:88) i =1 w i (cid:107) A (cid:107) F f i , with f i ( x ) = (cid:107) A (cid:107) F nw i ( (cid:104) a i , x (cid:105) − b i ) . Since (cid:80) ni =1 w i = (cid:107) A (cid:107) F , we may interpret p i = w i / (cid:107) A (cid:107) F as a probability distribution on the set { , . . . , n } ,i.e. (2.4). Next we apply the stochastic gradient method. Since g i ( x ) := ∇ f i ( x ) = (cid:107) A (cid:107) F nw i ( (cid:104) a i , x (cid:105) − b i ) a i ,with a fixed step length η = n (cid:107) A (cid:107) − F , we get x k +1 = x k − w − i ( (cid:104) a i , x (cid:105) − b ) a i , where i ∈ { , . . . , n } is drawn i.i.d. according to (2.4). Clearly, it is equivalent to RKM (2.3)-(2.4).Now we give the mean and variance of the stochastic gradient g i ( x ). Proposition 4.2.
Let g ( x ) = ∇ f ( x ) . Then the gradient g i ( x ) satisfies E [ g i ( x )] = g ( x ) , Cov[ g i ( x )] = (cid:107) A (cid:107) F n n (cid:88) i =1 ( (cid:104) a i , x (cid:105) − b i ) a i a ti (cid:107) a i (cid:107) − n A t ( Ax − b )( Ax − b ) t A. Proof.
The full gradient g ( x ) := ∇ f ( x ) at x is given by g ( x ) = n A t ( Ax − b ). The mean E [ g i ( x )] of the(partial) gradient g i ( x ) is given by E [ g i ( x )] = 1 n n (cid:88) i =1 (cid:107) a i (cid:107) (cid:107) A (cid:107) F (cid:107) A (cid:107) F (cid:107) a i (cid:107) ( (cid:104) a i , x (cid:105) − b i ) a i = n A t ( Ax − b ) . Next, by bias-variance decomposition, the covariance Cov[ g i ( x )] of the gradient g i ( x ) is given byCov[ g i ( x )] = E [ g i ( x ) g i ( x ) t ] − E [ g i ( x )] E [ g i ( x )] t = (cid:107) A (cid:107) F n n (cid:88) i =1 (cid:107) a i (cid:107) (cid:107) A (cid:107) F (cid:107) a i (cid:107) ( (cid:104) a i , x (cid:105) − b i ) a i a ti − n A t ( Ax − b )( Ax − b ) t A = (cid:107) A (cid:107) F n n (cid:88) i =1 ( (cid:104) a i , x (cid:105) − b i ) a i a ti (cid:107) a i (cid:107) − n A t ( Ax − b )( Ax − b ) t A. This completes the proof of the proposition.Thus, the single gradient g i ( x ) is an unbiased estimate of the full gradient g ( x ). For consistent linearsystems, the covariance Cov[ g i ( x )] is asymptotically vanishing: as x k → x ∗ , both terms in the varianceexpression tend to zero. However, for inconsistent linear systems, the covariance Cov[ g i ( x )] generallydoes not vanish at the optimal solution x ∗ :Cov[ g i ( x ∗ )] ≈ (cid:107) A (cid:107) F n n (cid:88) i =1 ( (cid:104) a i , x ∗ (cid:105) − b δi ) a i a ti (cid:107) a i (cid:107) , since one might expect A t ( Ax ∗ − b δ ) ≈
0. Further, Cov[ g i ( x ∗ )] is of the order δ in the neighborhoodof x ∗ . One may predict the (asymptotic) dynamics of RKM via a stochastic modified equation from the10ovariance [25]. The RKM iteration eventually deteriorates due to the nonvanishing covariance so thatits asymptotic convergence slows down.These discussions motivate the use of variance reduction techniques developed for stochastic gradientmethods to reduce the variance of the gradient estimate. There are several possible strategies, e.g.,stepsize reduction, stochastic variance reduction gradient (SVRG), averaging and mini-batch (see e.g.,[32, 19]). We only adapt SVRG [19] to RKM, termed as RKM with variance reduction (RKMVR),cf. Algorithm 1 for details. It hybridizes the stochastic gradient with the (occasional) full gradient toachieve variance reduction. Here, s is the length of epoch, which determines the frequency of full gradientevaluation and was suggested to be n [19], and K is the maximum number of iterations. In view of Step2, within the first epoch, it performs only the standard RKM, and at the end of the epoch, it evaluatesthe full gradient. In RKMVR, the residual (cid:107) Ax k − b δ (cid:107) is a direct by-product of full gradient evaluationand occurs only at the end of each epoch, and thus it does not invoke additional computational effort.The update at Step 8 of Algorithm 1 can be rewritten as (for k ≥ s ) x k +1 = x k + (cid:104) a i , ˜ x − x k (cid:105) a i (cid:107) a i (cid:107) − n (cid:107) A (cid:107) F ˜ g, and thus ˜ x − x k → Algorithm 1
Randomized Kaczmarz method with variance reduction (RKMVR). Specify A , b , x , K , and s . Initialize g i (˜ x ) = 0, and ˜ g = 0. for k = 1 , . . . , K do if k mod s = 0 then Set ˜ x = x k and ˜ g = g ( x k ). Check the discrepancy principle (4.1). end if Pick an index i according to (2.4). Update x k by x k +1 = x k − n (cid:107) A (cid:107) F ( g i ( x k ) − g i (˜ x ) + ˜ g ) . end for Now we present numerical results for RKM and RKMVR to illustrate their distinct features. All thenumerical examples, i.e., phillips , gravity and shaw , are taken from the public domain MATLAB package
Regutools . They are Fredholm integral equations of the first kind, with the first example being mildlyill-posed, and the last two severely ill-posed, respectively. Unless otherwise stated, the examples arediscretized with a dimension n = m = 1000. The noisy data b δ is generated from the exact data b as b δi = b i + δ max j ( | b j | ) ξ i , i = 1 , . . . , n, where δ is the relative noise level, and the random variables ξ i s follow an i.i.d. standard Gaussiandistribution. The initial guess x for the iterative methods is x = 0. We present the squared error e k and/or the squared residual r k , i.e., e k = E [ (cid:107) x ∗ − x k (cid:107) ] and r k = E [ (cid:107) Ax k − b δ (cid:107) ] . (5.1) Available from , last accessed on June 21, 2017 E [ · ] with respect to the random choice of the rows is approximated by the average of100 independent runs. All the computations were carried out on a personal laptop with 2.50 GHz CPUand 8.00G RAM by MATLAB
First we compare the performance of RKM with the cyclic Kaczmarz method (KM) to illustrate thebenefit of randomization. Overall, the random reshuffling can substantially improve the convergence ofKM, cf. the results in Figs. 1-3 for the examples with different noise levels. -3 -2 -1 e k KMRKM -3 -2 -1 e k KMRKM -2 -1 e k KMRKM -2 -1 e k KMRKM -6 -4 -2 r k KMRKM -4 -2 r k KMRKM -1 r k KMRKM r k KMRKM (a) δ = 0 (b) δ = 10 − (c) δ = 10 − (d) δ = 5 × − Figure 1: Numerical results ( e k and r k ) for phillips by KM and RKM.Next we examine the convergence more closely. The (squared) error e k of the Kaczmarz iterate x k undergoes a sudden drop at the end of each cycle, whereas within the cycle, the drop after eachKaczmarz iteration is small. Intuitively, this can be attributed to the fact that the neighboring rowsof the matrix A are highly correlated to each other, and thus each single Kaczmarz iteration reducesonly very little the (squared) error e k , since roughly it repeats the previous projection. The strongcorrelation between the neighboring rows is the culprit of the slow convergence of the cyclic KM. Therandomization ensures that any two rows chosen by two consecutive RKM iterations are less correlated,and thus the iterations are far more effective for reducing the error e k , leading to a much faster empiricalconvergence. These observations hold for both exact and noisy data. For noisy data, the error e k firstdecreases and then increases for both KM and RKM, and the larger is the noise level δ , and the earlierdoes the divergence occur. That is, both exhibit a “semiconvergence” phenomenon typical for iterativeregularization methods. Thus a suitable stopping criterion is needed. Meanwhile, the residual r k tends todecrease, but for both methods, it oscillates wildly for noisy data and the oscillation magnitude increaseswith δ . This is due to the nonvanishing variance, cf. the discussions in Section 4. One surprisingobservation is that a fairly reasonable inverse solution can be obtained by RKM within one cycle ofiterations. That is, by ignoring all other cost, RKM can solve the inverse problems reasonably well at acost less than one full gradient evaluation! 12 -1 e k KMRKM -1 e k KMRKM -1 e k KMRKM e k KMRKM -4 -2 r k KMRKM -2 r k KMRKM r k KMRKM r k KMRKM (a) δ = 0 (b) δ = 10 − (c) δ = 10 − (d) δ = 5 × − Figure 2: Numerical results ( e k and r k ) for gravity by KM and RKM. e k KMRKM e k KMRKM e k KMRKM e k KMRKM -4 -2 r k KMRKM -2 r k KMRKM r k KMRKM r k KMRKM (a) δ = 0 (b) δ = 10 − (c) δ = 10 − (d) δ = 5 × − Figure 3: Numerical results ( e k and r k ) for shaw by KM and RKM. Now we examine the convergence of RKM. Theorems 3.1 and 3.2 predict that during first iterations, thelow-frequency error e L = E [ (cid:107) P L e k (cid:107) ] decreases rapidly, but the high-frequency error e H = E [ (cid:107) P H e k (cid:107) ]can at best decay mildly. For all examples, the first five singular vectors can capture the majority of theenergy of the initial error x ∗ − x . Thus, we choose a truncation level L = 5, and plot the evolution oflow-frequency and high-frequency errors e L and e H , and the total error e = E [ (cid:107) e k (cid:107) ], in Fig. 4.13umerically, the low-frequency error e L decays much more rapidly during the initial iterations, andsince the low-frequency modes are dominant, the total error e also enjoys a very fast initial decay.Intuitively, this behavior may be explained as follows. The rows of the matrix A mainly contain low-frequency modes, and thus each RKM iteration tends to mostly decrease the low-frequency error e L ofthe initial error x ∗ − x . The high-frequency error e H experiences a similar but slower decay during theiteration, and then levels off. These observations fully confirm the preasymptotic analysis in Section 3.For noisy data, the error e k can be highly oscillating, so is the residual r k . The larger is the noise level δ , the larger is the oscillation magnitude. However, the degree of ill-posedness of the problem seems notto affect the convergence of RKM, so long as x ∗ is mainly composed of low-frequency modes. -3 -2 -1 e ee L e H -1 e ee L e H e ee L e H -2 -1 e ee L e H e ee L e H e ee L e H (a) phillips (b) gravity (c) shaw Figure 4: The error decay for the examples with two noise levels: δ = 10 − (top) and δ = 5 × − (bottom), with a truncation level L = 5.To shed further insights, we present in Fig. 5 the decay behavior of the low- and high-frequencyerrors for the example phillips with a random solution whose entries follow the i.i.d. standard normaldistribution. Then the source type condition is not verified for the initial error. Now with a truncationlevel L = 5, the low-frequency error e L only composes a small fractional of the initial error e . The low-frequency error e L decays rapidly, exhibiting a fast preasymptotic convergence as predicted by Theorem3.2, but the high-frequency error e H stagnates during the iteration. Thus, in the absence of the smoothnesscondition on e , RKM is ineffective, thereby supporting Theorems 3.1 and 3.2.Naturally, one may divide the total error e into more than two frequency bands. The empiricalbehavior is similar to the case of two frequency bands; see Fig. 6 for an illustration on the example phillips , with four frequency bands. The lowest-frequency error e decreases fastest, and then the nextband e slightly slower, etc. These observations clearly indicate that even though RKM does not employthe full gradient, the iterates are still mainly concerned with the low-frequency modes during the firstiterations, like the Landweber method in the sense that the low-frequency modes are much easier to14
200 400 600 800 1000k10 -4 -2 e e L e H -4 -2 e e L e H (a) δ = 10 − (b) δ = 5 × − Figure 5: The error decay for phillips with a random solution, with a truncation level L = 5.recover than the high-frequency ones. However, the cost of each RKM iteration is only one n th of thatfor the Landweber method, and thus it is computationally much more efficient. k -3 -2 -1 e ee e e e k -3 -2 -1 e ee e e e (a) δ = 10 − (b) δ = 5 × − Figure 6: The error decay for phillips . The total error e is divided into four frequency bands: 1-3, 4-6,7-9, and the remaining, denoted by e i , i = 1 , . . . , The nonvanishing variance of the gradient g i ( x ) slows down the asymptotic convergence of RKM, andthe iterates eventually tend to oscillate wildly in the presence of data noise, cf. the discussion in Section4. This is expected: the iterate converges to the least-squares solution, which is known to be highlyoscillatory for ill-posed inverse problems. Variance reduction is one natural strategy to decrease thevariance of the gradient estimate, thereby stabilizing the evolution of the iterates. To illustrate this, wecompare the evolution of RKM with RKMVR in Fig. 7. We also include the results by the Landwebermethod (LM). To compare the iteration complexity only, we count one Landweber iteration as n RKMiterates. The epoch of RKMVR is set to n , the total number of data points, as suggested in [19]. Thus n RKMVR iterates include one full gradient evaluation, and it amounts to 2 n RKM iterates. The fullgradient evaluation is indicated by flat segments in the plots.With the increase of the noise level δ , RKM first decreases the error e k , and then increases it, whichis especially pronounced at δ = 5 × − . This is well reflected by the large oscillations of the iterates.RKMVR tends to stabilize the iteration greatly by removing the large oscillations, and thus its asymp-15otical behavior resembles closely that of LM. That is, RKMVR inherits the good stability of LM, whileretaining the fast initial convergence of RKM. Thus, the stopping criterion, though still needed, is lesscritical for the RKMVR, which is very beneficial from the practical point of view. In summary, the simplevariance reduction scheme in Algorithm 1 can combine the strengths of both worlds.Last, we numerically examine the regularizing property of RKMVR with the discrepancy principle(4.1). In Fig. 8, we present the number of iterations for several noise levels for RKMVR (one realization)and LM. For both methods, the number of iterations by the discrepancy principle (4.1) appears todecrease with the noise level δ , and RKMVR consistently terminates much earlier than LM, indicatingthe efficiency of RKMVR. The reconstructions in Fig. 8(d) show that the error increases with the noiselevel δ , indicating a regularizing property. In contrast, in the absence of the discrepancy principle, theRKMVR iterates eventually diverge as the iteration proceeds, cf. Fig. 7. -3 -2 -1 e k RKMRKMVRLM -1 r k RKMRKMVRLM e k RKMRKMVRLM r k RKMRKMVRLM (a) phillips , δ = 10 − (b) phillips , δ = 5 × − -1 e k RKMRKMVRLM r k RKMRKMVRLM e k RKMRKMVRLM r k RKMRKMVRLM (c) gravity , δ = 10 − (d) gravity , δ = 5 × − e k RKMRKMVRLM r k RKMRKMVRLM e k RKMRKMVRLM r k RKMRKMVRLM (e) shaw , δ = 10 − (f) shaw , δ = 5 × − Figure 7: Numerical results for the examples by RKM, RKMVR and LM.
We have presented an analysis of the preasymptotic convergence behavior of the randomized Kaczmarzmethod. Our analysis indicates that the low-frequency error decays much faster than the high-frequencyone during the initial randomized Kaczmarz iterations. Thus, when the low-frequency modes are dom-inating in the initial error, as typically occurs for inverse problems, the method enjoys very fast initial16 × -4 -2 r k RKMVRLM × -2 r k RKMVRLM × r k RKMVRLM exact =1e-3 =1e-2 =5e-2 × -2 r k RKMVRLM × r k RKMVRLM r k RKMVRLM exact =1e-3 =1e-2 =5e-2 × -2 r k RKMVRLM × r k RKMVRLM × r k RKMVRLM exact =1e-3 =1e-2 =5e-2 (a) δ = 10 − (b) δ = 10 − (c) δ = 5 × − (d) solutionsFigure 8: The residual r k and the recoveries for phillips (top), gravity (middle), shaw (bottom) byRKMVR and LM with the discrepancy principle (4.1) with τ = 1 . Acknowledgements
The authors are grateful to the constructive comments of the anonymous referees, which have helpedimprove the quality of the paper. In particular, the remark by one of the referees has led to muchimproved results as well as more concise proofs. The research of Y. Jiao is partially supported byNational Science Foundation of China (NSFC) No. 11501579 and National Science Foundation of Hubei17rovince No. 2016CFB486, B. Jin by EPSRC grant EP/M025160/1 and UCL Global Engagement grant(2016–2017), and X. Lu by NSFC Nos. 11471253 and 91630313.
References [1] A. Agaskar, C. Wang, and Y. M. Lu. Randomized Kaczmarz algorithms: Exact MSE analysisand optimal sampling probabilities. In , pages 389–393, Atlanta, GA, 2014.[2] M. Burger and B. Kaltenbacher. Regularizing Newton-Kaczmarz methods for nonlinear ill-posedproblems.
SIAM J. Numer. Anal. , 44(1):153–182, 2006.[3] Y. C. Eldar and D. Needell. Acceleration of randomized Kaczmarz method via the Johnson-Lindenstrauss lemma.
Numer. Algorithms , 58(2):163–177, 2011.[4] T. Elfving, P. C. Hansen, and T. Nikazad. Semi-convergence properties of Kaczmarz’s method.
Inverse Problems , 30(5):055007, 16, 2014.[5] H. W. Engl, M. Hanke, and A. Neubauer.
Regularization of Inverse Problems . Kluwer, Dordrecht,1996.[6] V. Faber, T. A. Manteuffel, A. B. White, Jr., and G. M. Wing. Asymptotic behavior of singular valuesand singular functions of certain convolution operators.
Comput. Math. Appl. Ser. A , 12(6):733–747,1986.[7] A. Gal´antai. On the rate of convergence of the alternating projection method in finite dimensionalspaces.
J. Math. Anal. Appl. , 310(1):30–44, 2005.[8] G. H. Golub and C. F. Van Loan.
Matrix Computations . Johns Hopkins University Press, Baltimore,MD, fourth edition, 2013.[9] R. Gordon, R. Bender, and G. Herman. Algebraic reconstruction techniques (ART) for three-dimensional electron microscopy and x-ray photography.
J. Theor. Biology , 29(3):471–481, 1970.[10] R. M. Gower and P. Richt´arik. Randomized iterative methods for linear systems.
SIAM J. MatrixAnal. Appl. , 36(4):1660–1690, 2015.[11] R. M. Gower and P. Richt´arik. Stochastic dual ascent for solving linear systems. Preprint,arXiv:1512.06890, 2015.[12] M. Haltmeier, A. Leit˜ao, and O. Scherzer. Kaczmarz methods for regularizing nonlinear ill-posedequations. I. Convergence analysis.
Inverse Probl. Imaging , 1(2):289–298, 2007.[13] P. C. Hansen.
Rank-Deficient and Discrete Ill-Posed Problems . SIAM, Philadelphia, PA, 1998.Numerical aspects of linear inversion.[14] G. T. Herman, A. Lent, and P. H. Lutz. Relaxation method for image reconstruction.
Comm. ACM ,21(2):152–158, 1978.[15] G. T. Herman and L. B. Meyer. Algebraic reconstruction techniques can be made computationallyefficient.
IEEE Trans. Medical Imaging , 12(3):600–609, 1993.[16] K. Ito and B. Jin.
Inverse Problems: Tikhonov Theory and Algorithms . World Scientific, Hackensack,NJ, 2015.[17] Q. Jin. Landweber-kaczmarz method in banach spaces with inexact inner solvers.
Inverse Problems ,32(10):104005, 26 pp., 2016. 1818] Q. Jin and W. Wang. Landweber iteration of Kaczmarz type with general non-smooth convex penaltyfunctionals.
Inverse Problems , 29(8):085011, 22, 2013.[19] R. Johnson and T. Zhang. Accelerating stochastic gradient descent using predictive variance reduc-tion.
NIPS , 2013.[20] S. Kaczmarz. Angen¨aherte Aufl¨osung von Systemen linearer Gleichungen.
Bull. Int. Acad. Polon.Sci. Lett. A , 35:335–357, 1937.[21] B. Kaltenbacher, A. Neubauer, and O. Scherzer.
Iterative Regularization Methods for NonlinearIll-posed Problems . Walter de Gruyter GmbH & Co. KG, Berlin, 2008.[22] S. Kindermann and A. Leit˜ao. Convergence rates for Kaczmarz-type regularization methods.
InverseProbl. Imaging , 8(1):149–172, 2014.[23] R. Kowar and O. Scherzer. Convergence analysis of a Landweber-Kaczmarz method for solvingnonlinear ill-posed problems. In
Ill-posed and Inverse Problems , pages 253–270. VSP, Zeist, 2002.[24] A. Leit˜ao and B. F. Svaiter. On projective Landweber-Kaczmarz methods for solving systems ofnonlinear ill-posed equations.
Inverse Problems , 32(2):025004, 20, 2016.[25] Q. Li, C. Tai, and W. E. Dynamics of stochastic gradient algorithms. preprint, arXiv:1511.06251,2015.[26] A. Ma, D. Needell, and A. Ramdas. Convergence properties of the randomized extended Gauss-Seideland Kaczmarz methods.
SIAM J. Matrix Anal. Appl. , 36(4):1590–1604, 2015.[27] V. A. Morozov. On the solution of functional equations by the method of regularization.
SovietMath. Dokl. , 7:414–417, 1966.[28] F. Natterer.
The Mathematics of Computerized Tomography . John Wiley & Sons, Ltd., Chichester,1986.[29] D. Needell. Randomized Kaczmarz solver for noisy linear systems.
BIT , 50(2):395–403, 2010.[30] D. Needell, N. Srebro, and R. Ward. Stochastic gradient descent, weighted sampling, and therandomized Kaczmarz algorithm.
Math. Program., Ser. A , 155(1-2):549–573, 2016.[31] H. Robbins and S. Monro. A stochastic approximation method.
Ann. Math. Stat. , 22:400–407, 1951.[32] M. Schmidt, N. Le Roux, and F. Bach. Minimizing finite sums with the stochastic average gradient.
Math. Progr. , 162(1):83–112, 2017.[33] F. Sch¨opfer and D. A. Lorenz. Linear convergence of the randomized sparse Kaczmarz method.Preprint, arXiv:1610.02889, 2016.[34] T. Strohmer and R. Vershynin. A randomized Kaczmarz algorithm with exponential convergence.
J. Fourier Anal. Appl. , 15(2):262–278, 2009.[35] C. Wang, A. Agaskar, and Y. M. Lu. Randomized Kaczmarz algorithm for inconsistent linear sys-tems: An exact MSE analysis. In
Sampling Theory and Applications (SampTA), 2015 InternationalConference on , pages 498–502, Washington, DC, 2015.[36] A. Zouzias and N. M. Freris. Randomized extended Kaczmarz for solving least squares.