On the Regularization Effect of Stochastic Gradient Descent applied to Least Squares
OON THE REGULARIZATION EFFECT OF STOCHASTICGRADIENT DESCENT APPLIED TO LEAST SQUARES
STEFAN STEINERBERGER
Abstract.
We study the behavior of stochastic gradient descent applied to (cid:107) Ax − b (cid:107) → min for invertible A ∈ R n × n . We show that there is an explicitconstant c A depending (mildly) on A such that E (cid:107) Ax k +1 − b (cid:107) ≤ (cid:32) c A (cid:107) A (cid:107) F (cid:33) (cid:107) Ax k − b (cid:107) − (cid:107) A (cid:107) F (cid:13)(cid:13)(cid:13) A T A ( x k − x ) (cid:13)(cid:13)(cid:13) . This is a curious inequality: the last term has one more matrix applied to theresidual u k − u than the remaining terms: if x k − x is mainly comprised oflarge singular vectors, stochastic gradient descent leads to a quick regulariza-tion. For symmetric matrices, this inequality has an extension to higher-orderSobolev spaces. This explains a (known) regularization phenomenon: an en-ergy cascade from large singular values to small singular values smoothes. Introduction
Stochastic Gradient Descent.
In this paper, we consider the finite-dimensionallinear inverse problem Ax = b, where A ∈ R n × n is an invertible matrix, x ∈ R n is the (unknown) signal of interestand b is a given right-hand side. Throughout this paper, we will use a , . . . , a n todenote the rows of A . Equivalently, we will try to solve the problem (cid:107) Ax − b (cid:107) = n (cid:88) i =1 ( (cid:104) a i , x (cid:105) − b i ) → min . Following Needell, Srebro & Ward [30], we can interpret this problem as n (cid:88) i =1 f i ( x ) → min where f i ( x ) = (cid:104) a i , x (cid:105) − b i . The Lipschitz constant of f i is (cid:107) a i (cid:107) (cid:96) which motivates the following basic form ofstochastic gradient descent: pick one of the n functions with likelihood proportionalto the Lipschitz constant and then do a gradient descent on this much simplerfunction resulting in the update rule x k +1 = x k − b i − (cid:104) a i , x k (cid:105)(cid:107) a i (cid:107) a i . This is also known as the
Algebraic Reconstruction Technique (ART) in computertomography [10, 14, 15, 26], the
Projection onto Convex Sets Method [3, 5, 6, 9, 38]
Mathematics Subject Classification.
Key words and phrases.
Stochastic Gradient Descent, Kaczmarz method, Least Squares.S.S. is supported by the NSF (DMS-1763179) and the Alfred P. Sloan Foundation. a r X i v : . [ m a t h . NA ] S e p and the Randomized Kaczmarz method [2, 7, 8, 11, 13, 12, 20, 21, 22, 24, 25, 27, 28,29, 30, 31, 33, 36, 39, 40, 41, 42, 43, 44]. Strohmer & Vershynin [41] showed that E (cid:107) x k − x (cid:107) ≤ (cid:18) − (cid:107) A − (cid:107) (cid:107) A (cid:107) F (cid:19) k (cid:107) x − x (cid:107) , where (cid:107) A (cid:107) F is the Frobenius norm. In practice, the algorithm often converges alot faster initially and this was studied in [17, 18, 39]. In particular, [39] obtainsan identity in terms of the behavior with regards to the singular values showingthat singular vectors associated to large singular values are expected to undergoa more rapid decay. Motivated by this insight, we provide rigorous bounds thatquantify this energy cascade from large singular values to small singular values byidentifying an interesting inequality for SGD when applied to Least Squares.1.2. A Motivating Example.
We discuss a simple example that exemplifies thephenomenon that we are interested in. Let us take A ∈ R × by picking eachentry independently at random from N (0 ,
1) and then normalizing the rows to (cid:107) a i (cid:107) = 1. The right-hand side is b = (1 , , . . . ,
1) and we initialize with x = ∈ R n . (cid:107) Ax k − b (cid:107) (cid:96) (cid:107) Ax k − b (cid:107) (cid:96) Figure 1.
The size of (cid:107) Ax k − b (cid:107) (cid:96) for k = 1 , . . . , (cid:107) Ax k − b (cid:107) (cid:96) . We observerapid initial decay which then slows down.The picture tells a very interesting story: the error in (cid:107) Ax k − b (cid:107) decays initiallyquite rapidly before then stabilizing in a certain regime. Moreover, for the exampleshown in Figure 1, (cid:107) x (cid:107) = 0 and (cid:107) x (cid:107) ∼
28 which is not even close to the truesolution (cid:107) x (cid:107) ∼
128 – nonetheless, the approximation of Ax k to b is quite good.This leaves us with a curious conundrum: we have a good approximation x k of thetrue solution in the sense that Ax k ∼ b even though x k is not very close to x . Oneway this can be achieved is if x k − x is mainly a linear combination of small singularvectors of A . This is related to the following result recently obtained by the author. Theorem ([39]) . Let v (cid:96) be a (right) singular vector of A associated to the singularvalue σ (cid:96) . Then, for the sequence ( x k ) ∞ k − obtained in this randomized manner E (cid:104) x k − x, v (cid:96) (cid:105) = (cid:18) − σ (cid:96) (cid:107) A (cid:107) F (cid:19) k (cid:104) x − x, v (cid:96) (cid:105) . Here, (cid:107) A (cid:107) F denotes the Frobenius norm. This shows that we expect x k − x to beindeed mainly a linear combination of singular vectors associated to small singular values since those are the ones undergoing the slowest decay. It also mirrors thebound obtained by Strohmer & Vershynin [41] since σ (cid:96) ≥ σ n = (cid:107) A − (cid:107) − . While being interesting in itself, this identity by itself does not fully explain thebehavior shown above: it is only in expectation with no control of the variance.Moreover, the inner product does initially undergo some fluctuations. Taking thesame type of matrix as above, we see an example of such fluctuations in Fig. 2.
200 400 600 800 1000 - - (cid:68) x k − x (cid:107) x k − x (cid:107) , v (cid:69) Figure 2.
The evolution of the normalized residual against theleading singular vector v : fluctuations around the mean.1.3. Related results.
This type of question is well studied, we refer to Ali, Do-bridan & Tibshirani [1], Defossez & Bach [4], Jain, Kakade, Kidambi, Netrapalli,Pillutla & Sidford [16], Neu & Rosasco [32], Oymak & Soltanolkotabi [34], Schmidt,Le Roux & Bach [37] and references therein. The connection of Stochastic GradientDescent applied to Least Square Problems and the Randomized Kaczmarz Methodhas been pointed out by Needell, Srebro & Ward [30]. We also mention the papersby Jiao, Jin & Lu [17] and Jin & Lu [18] who studied a similar question and notedthat there is energy transfer from large singular values to small singular values.2.
Results
Main Result.
The main goal of this note is to provide a simple explanationfor the rapid initial regularization: the expected decay of (cid:107) A ( x k − x ) (cid:107) under SGDcan be bounded from above by a term involving (cid:13)(cid:13) A T A ( x k − x ) (cid:13)(cid:13) : this is the sameterm except that the matrix has been applied to the existing quantity one moretime. This increases the norm of the underlying vector except when A ( x k − x ) ismainly the linear combination of singular vectors with small singular values. Sowhile this is not the case, we actually inherit strong decay properties and this leadsto the rapid initial regularization. We now make this precise. Theorem 1.
Let A ∈ R n × n be invertible and consider (cid:107) Ax − b (cid:107) → min via thestochastic gradient descent method introduced above. Abbreviating α = max ≤ i ≤ n (cid:107) Aa i (cid:107) (cid:107) a i (cid:107) , we have E (cid:107) Ax k +1 − b (cid:107) ≤ (cid:18) α (cid:107) A (cid:107) F (cid:19) (cid:107) Ax k − b (cid:107) − (cid:107) A (cid:107) F (cid:13)(cid:13) A T ( Ax k − b ) (cid:13)(cid:13) . The inequality also holds for A ∈ R m × n with m ≥ n as long as Ax = b has a uniquesolution. We note that α ≥ α ∼
2. This is also true for matrices discretizing PDEs via finite elements.The main point of the inequality is that the last term has an additional factor of A T : we can rewrite it as (cid:13)(cid:13) A T ( Ax k − b ) (cid:13)(cid:13) = (cid:107) A T A ( x k − x ) (cid:107) . This shows that the presence of large singular vectors in x k − x forces large decayon (cid:107) A ( x k − x ) (cid:107) . Conversely, once the algorithm has reached the plateau phase(see Fig. 1), the terms α (cid:107) A ( x k − x ) (cid:107) and (cid:107) A T A ( x k − x ) (cid:107) are nearly comparable. Thus, this forces x k − x to be mostly orthogonal to mostsingular vectors corresponding to large singular values: that, however, shows that itis mainly comprised of small singular vectors and thus explains why (cid:107) A ( x k − x ) (cid:107) (cid:28)(cid:107) x k − x (cid:107) is possible in cases where x k is far away from x . In particular, this suggestswhy the method could be effective for the problem of finding a vector x so that Ax ≈ b . One way is to initialize stochastic gradient descent with x = 0 and run itfor a while – due to the difference in scales, second-order norms regularizing first-order norms, we observe that Ax k converges quite rapidly; whether it converges tosomething sufficiently close to b for the purpose at hand, is a different question.2.2. A Sobolev Space Explanation.
An interesting way to illustrate the resultis in terms of partial differential equation. Suppose we try to solve − ∆ u = f onsome domain Ω ⊂ R n . After a suitably discretization, this results in a discretelinear system Lu = f , where L ∈ R n × n is a discretization of the Laplacian − ∆. Byan abuse of notation, u denotes a decent approximation of the continuous solutionand f a discretization of the continuous right-hand side. However, we also havemore information: since L discretizes the Laplacian, we expect that (cid:104) Lu, u (cid:105) ∼ (cid:90) Ω |∇ u | dx and (cid:104) Lu, Lu (cid:105) ∼ (cid:90) Ω | ∆ u | dx. Here, the first term correspond to the size of u in the Sobolev space ˙ H while thesecond term is the size of u in the Sobolev space ˙ H . In fact, this is a common wayto define discretized approximations of Sobolev space, also known as the spectraldefinition since they are defined in terms of the spectrum of L . Suppose now wecompute a sequence of approximations u k via the method outlined above. ThenTheorem 1 can be rephrased as E (cid:107) u k +1 − u (cid:107) H ≤ (cid:18) α (cid:107) A (cid:107) F (cid:19) (cid:107) u k − u (cid:107) H − (cid:107) A (cid:107) F (cid:107) u k − u (cid:107) H What is of great interest here is that the decay of the error in ˙ H is driven by decayof the error in ˙ H (which is usually larger).2.3. An Example.
We illustrate this with an example. Choosing A ∈ R × at random (and then, for convenience, normalize the rows to (cid:107) a i (cid:107) = 1), we solve Ax = (1 , , . . . ,
1) starting with a random initial vector x where each entry ischosen independently from a standardized N (0 ,
1) distribution. We consider boththe evolution of (cid:107) Ax k − b (cid:107) across multiple runs as well as the size of α (cid:107) A (cid:107) F (cid:107) Ax k − b (cid:107) − (cid:107) A (cid:107) F (cid:13)(cid:13) A T ( Ax k − b ) (cid:13)(cid:13) , which is the term from our Theorem quantifying the expected decay at each step.
500 1000 1500 2000 2500 30002004006008001000 (cid:107) Ax k − b (cid:107) (cid:96)
500 1000 1500 2000 2500 3000 - - - - - α (cid:107) A (cid:107) F (cid:107) Ax k − b (cid:107) − (cid:107) A (cid:107) F (cid:13)(cid:13)(cid:13) A T ( Ax k − b ) (cid:13)(cid:13)(cid:13) Figure 3. (cid:107) Ax k − b (cid:107) (cid:96) for k = 1 , . . . , ∼
800 (with little variation across multiple runs). The bound in the Theoremimplies an expected decay of − .
23 per time-step which, across 3000 time steps,leads to a total of roughly ∼
696 in decay.2.4.
Higher Powers.
If the matrix A ∈ R n × n is symmetric, we can extend theresult to higher powers of the matrix. Theorem 2.
Let A ∈ R n × n be symmetric and invertible. When solving (cid:107) Ax − b (cid:107) → min via the stochastic gradient descent method outlined above, we have, forany (cid:96) ∈ N with α (cid:96) = max ≤ i ≤ n (cid:107) A (cid:96) a i (cid:107) (cid:107) a i (cid:107) , the estimate E (cid:107) A (cid:96) ( x k +1 − x ∗ ) (cid:107) ≤ (cid:18) α (cid:96) (cid:107) A (cid:107) F (cid:19) (cid:107) A (cid:96) ( x k − x ∗ ) (cid:107) − (cid:107) A (cid:107) F (cid:13)(cid:13) A (cid:96) +1 ( x k − x ∗ ) (cid:13)(cid:13) . This shows that the same phenomenon does indeed happen at all scales of ‘smooth-ness’. The applicability of the result is, naturally, depending on the growth of α (cid:96) in (cid:96) , though, generically, one would not expect this to be badly behaved: a priori,there is no good reason to expect that the row of a matrix happens in any way tobe the linear combination of singular vectors associated to large singular values –though, naturally, this can happen (for example, if A has one very large entry onthe diagonal). Proofs
Proof of Theorem 1.
Proof.
To simplify exposition, we introduce the residual r k = x k − x. Plugging in, we obtain that if the i − th equation is chosen, then x + r k +1 = x k +1 = x k + b i − (cid:104) a i , r k (cid:105)(cid:107) a i (cid:107) a i = x + r k + b i − (cid:104) a i , x + r k (cid:105)(cid:107) a i (cid:107) a i = x + r k − (cid:104) a i , r k (cid:105)(cid:107) a i (cid:107) a i + (cid:18) b i − (cid:104) a i , x (cid:105)(cid:107) a i (cid:107) a i (cid:19) . Since x is the exact solution, we have b i − (cid:104) a i , x (cid:105) = 0 and r k +1 = r k − (cid:104) a i , r k (cid:105)(cid:107) a i (cid:107) a i . Recalling that the i − th row is chosen with probability proportional to (cid:107) a i (cid:107) , E (cid:107) Ar k +1 (cid:107) = E i (cid:13)(cid:13)(cid:13)(cid:13) A (cid:18) r k − (cid:104) a i , r k (cid:105)(cid:107) a i (cid:107) a i (cid:19)(cid:13)(cid:13)(cid:13)(cid:13) = n (cid:88) i =1 (cid:107) a i (cid:107) (cid:107) A (cid:107) F (cid:13)(cid:13)(cid:13)(cid:13) Ar k − (cid:104) a i , r k (cid:105)(cid:107) a i (cid:107) Aa i (cid:13)(cid:13)(cid:13)(cid:13) . This norm can be explicitly squared out as (cid:13)(cid:13)(cid:13)(cid:13) Ar k − (cid:104) a i , r k (cid:105)(cid:107) a i (cid:107) Aa i (cid:13)(cid:13)(cid:13)(cid:13) = (cid:107) Ar k (cid:107) − (cid:104) a i , r k (cid:105)(cid:107) a i (cid:107) (cid:104) Ar k , Aa i (cid:105) + (cid:104) a i , r k (cid:105) (cid:107) a i (cid:107) (cid:107) Aa i (cid:107) . This allows us to rewrite the summation as E (cid:107) Ar k +1 (cid:107) = n (cid:88) i =1 (cid:107) a i (cid:107) (cid:107) A (cid:107) F (cid:32) (cid:107) Ar k (cid:107) − (cid:104) a i , r k (cid:105)(cid:107) a i (cid:107) (cid:104) Ar k , Aa i (cid:105) + (cid:104) a i , r k (cid:105) (cid:107) a i (cid:107) (cid:107) Aa i (cid:107) (cid:33) = (cid:107) Ar k (cid:107) − (cid:107) A (cid:107) F n (cid:88) i =1 (cid:104) a i , r k (cid:105) (cid:104) Ar k , Aa i (cid:105) + 1 (cid:107) A (cid:107) F n (cid:88) i =1 (cid:104) a i , r k (cid:105) (cid:107) Aa i (cid:107) (cid:107) a i (cid:107) . We have 2 (cid:107) A (cid:107) F n (cid:88) i =1 (cid:104) a i , r k (cid:105) (cid:104) Ar k , Aa i (cid:105) = 2 (cid:107) A (cid:107) F n (cid:88) i =1 (cid:104) a i , r k (cid:105) (cid:10) A T Ar k , a i (cid:11) = 2 (cid:107) A (cid:107) F (cid:10) Ar k , AA T Ar k (cid:11) = 2 (cid:107) A (cid:107) F (cid:107) A T Ar k (cid:107) . The last sum we bound from above via1 (cid:107) A (cid:107) F n (cid:88) i =1 (cid:104) a i , r k (cid:105) (cid:107) Aa i (cid:107) (cid:107) a i (cid:107) ≤ max i (cid:107) Aa i (cid:107) / (cid:107) a i (cid:107) (cid:107) A (cid:107) F (cid:107) Ar k (cid:107) . This results in the desired estimate. (cid:3)
Proof of Theorem 2.
Proof.
We again reduce the problem to that of the study of the residual r k +1 = r k − (cid:104) a i , r k (cid:105)(cid:107) a i (cid:107) a i . When looking at integer powers, we observe that, by the same reasoning, E (cid:107) A (cid:96) r k +1 (cid:107) = E (cid:13)(cid:13)(cid:13)(cid:13) A (cid:96) (cid:18) r k − (cid:104) a i , r k (cid:105)(cid:107) a i (cid:107) a i (cid:19)(cid:13)(cid:13)(cid:13)(cid:13) = n (cid:88) i =1 (cid:107) a i (cid:107) (cid:107) A (cid:107) F (cid:13)(cid:13)(cid:13)(cid:13) A (cid:96) r k − (cid:104) a i , r k (cid:105)(cid:107) a i (cid:107) A (cid:96) a i (cid:13)(cid:13)(cid:13)(cid:13) = n (cid:88) i =1 (cid:107) a i (cid:107) (cid:107) A (cid:107) F (cid:32) (cid:107) A (cid:96) r k (cid:107) − (cid:104) a i , r k (cid:105)(cid:107) a i (cid:107) (cid:10) A (cid:96) r k , A (cid:96) a i (cid:11) + (cid:104) a i , r k (cid:105) (cid:107) a i (cid:107) (cid:107) A (cid:96) a i (cid:107) (cid:33) = (cid:107) A (cid:96) r k (cid:107) − (cid:107) A (cid:107) F n (cid:88) i =1 (cid:104) a i , r k (cid:105) (cid:10) A (cid:96) r k , A (cid:96) a i (cid:11) + 1 (cid:107) A (cid:107) F n (cid:88) i =1 (cid:104) a i , r k (cid:105) (cid:107) A (cid:96) a i (cid:107) (cid:107) a i (cid:107) . The first term is easy and the third term can, as before, be bounded by1 (cid:107) A (cid:107) F n (cid:88) i =1 (cid:104) a i , r k (cid:105) (cid:107) A (cid:96) a i (cid:107) (cid:107) a i (cid:107) ≤ α (cid:96) (cid:107) A (cid:107) F n (cid:88) i =1 (cid:104) a i , r k (cid:105) = α (cid:96) (cid:107) A (cid:107) F (cid:107) Ar k (cid:107) . It remains to understand the second term: here, we can use the symmetry of thematrix to write2 (cid:107) A (cid:107) F n (cid:88) i =1 (cid:104) a i , r k (cid:105) (cid:10) A (cid:96) r k , A (cid:96) a i (cid:11) = 2 (cid:107) A (cid:107) F n (cid:88) i =1 (cid:104) a i , r k (cid:105) (cid:10) A (cid:96) r k , a i (cid:11) = 2 (cid:107) A (cid:107) F (cid:10) Ar k , A (cid:96) +1 r k (cid:11) = 2 (cid:107) A (cid:107) F (cid:10) A (cid:96) +1 r k , A (cid:96) +1 r k (cid:11) = 2 (cid:107) A (cid:107) F (cid:107) A (cid:96) +1 r k (cid:107) . (cid:3) References [1] A. Ali, E. Dobridan and R. Tibshirani, The Implicit Regularization of Stochastic GradientFlow for Least Squares, arXiv:2003.07802[2] Z.-Z. Bai and W.-T. Wu, On convergence rate of the randomized Kaczmarz method,
LinearAlgebra and its Applications (2018), p. 252–269[3] C. Cenker, H. G. Feichtinger, M. Mayer, H. Steier, and T. Strohmer, New variants of thePOCS method using affine subspaces of finite codimension, with applications to irregularsampling. Proc. SPIE: Visual Communications and Image Processing, p. 299–310, 1992.[4] A. Defossez and F. Bach, Averaged Least-Mean-Squares: Bias-Variance Trade-offs and Op-timal Sampling Distributions, AISTATS, 2015 [5] F. Deutsch. Rate of convergence of the method of alternating projections.
Internat. Schriften-reihe Numer. Math. , (1985), p.96–107.[6] F. Deutsch and H. Hundal. The rate of convergence for the method of alternating projections,II. J. Math. Anal. Appl. , (1997), p. 381–405.[7] Y. C. Eldar and D. Needell. Acceleration of randomized Kaczmarz method via the Johnson-Lindenstrauss lemma. Numer. Algorithms , (2011):p. 163–177.[8] T. Elfving, P.-C. Hansen and T. Nikazad, Semi-convergence properties of Kaczmarzs method, Inverse Problems (2014), 055007[9] A. Galantai. On the rate of convergence of the alternating projection method in finite dimen-sional spaces. J. Math. Anal. Appl. , (2005), p. 30–44.[10] R. Gordon, R. Bender and G. Herman, Algebraic reconstruction techniques (ART) for three-dimensional electron microscopy and x-ray photography, Journal of Theoretical Biology (1970): p. 471–481[11] D. Gordon, A derandomization approach to recovering bandlimited signals across a widerange of random sampling rates, Numer. Algor. (2018): p. 1141–1157[12] R. M. Gower, D. Molitor, J. Moorman and D. Needell, Adaptive Sketch-and-Project Methodsfor Solving Linear Systems, arXiv:1909.03604[13] R. M. Gower and P. Richtarik. Randomized iterative methods for linear systems. SIAM J.Matrix Anal. Appl. , (2015):1660–1690.[14] G.T. Herman. Image reconstruction from projections. Academic Press Inc. [Harcourt BraceJovanovich Publishers], New York, 1980. The fundamentals of computerized tomography,Computer Science and Applied Mathematics.[15] G. T. Herman and L. B. Meyer. Algebraic reconstruction techniques can be made computa-tionally efficient. IEEE Trans. Medical Imaging , (1993): p. 600–609.[16] P. Jain, S. Kakade, R. Kidambi, P. Netrapalli, V. Pillutla and A. Sidford, A Markov ChainTheory Approach to Characterizing the Minimax Optimality of Stochastic Gradient Descent(for Least Squares). In 37th IARCS Annual Conference on Foundations of Software Technol-ogy and Theoretical Computer Science (2018).[17] Y. Jiao, B. Jin and X. Lu, Preasymptotic Convergence of Randomized Kaczmarz Method, Inverse Problems (2017), article: 125012[18] B. Jin, and X. Lu. On the regularizing property of stochastic gradient descent. Inverse Prob-lems (2018): 015004.[19] S. Kaczmarz, Angenaherte Auflosung von Systemen linearer Gleichungen, Bulletin Interna-tional de l’Academie Polonaise des Sciences et des Lettres. Classe des Sciences Mathema-tiques et Naturelles. Serie A, Sciences Mathematiques , (1937), pp. 355–357[20] Y.-T. Lee and A. Sidford, Efficient Accelerated Coordinate Descent Methods and FasterAlgorithms for Solving Linear Systems, FOCS 2013[21] D. Leventhal and A. S. Lewis, Randomized Methods for Linear Constraints: ConvergenceRates and Conditioning, Mathematics of Operation Research , (2010), p. 641–654[22] J. Liu and S. Wright, An accelerated randomized Kaczmarz algorithm, Math. Comp. (2016), p. 153-178[23] A. Ma, and D Needell, Stochastic gradient descent for linear systems with missing data. Numer. Math. Theory Methods Appl. (2019), p. 1–20.[24] A. Ma, D Needell and A Ramdas, Convergence properties of the randomized extended Gauss–Seidel and Kaczmarz methods, SIAM J. Matrix Anal. Appl. (2015), p. 1590–1604[25] J. Moorman, T. Tu, D. Molitor and D. Needell, Randomized Kaczmarz with Averaging,arXiv:2002.04126[26] F. Natterer. The Mathematics of Computerized Tomography. Wiley, New York, 1986.[27] D. Needell. Randomized Kaczmarz solver for noisy linear systems. BIT Numerical Mathe-matics , (2010): p. 395–403.[28] D. Needell and J. Tropp, Paved with good intentions: Analysis of a randomized block Kacz-marz method, Linear Algebra and its Applications (2014), p. 199–221[29] D. Needell and R. Ward, Two-Subspace Projection Method for Coherent OverdeterminedSystems,
J. Fourier Anal Appl (2013), p. 256–269.[30] D. Needell, R. Ward and N. Srebro, Stochastic gradient descent, weighted sampling, andthe randomized Kaczmarz algorithm, Advances in Neural Information Processing Systems,p. 1017–1025 [31] D. Needell, R. Zhao and A. Zouzias, Randomized block Kaczmarz method with projectionfor solving least squares, Linear Algebra and its Applications (2015), p. 322–343[32] G. Neu and L. Rosasco, Iterate Averaging as Regularization for Stochastic Gradient Descent,
Proceedings of Machine Learning Research (2018), p. 1–21.[33] J. Nutini, B. Sepehry, I. Laradji, M. Schmidt, H. Koepke, A. Virani, Convergence Rates forGreedy Kaczmarz Algorithms, and Faster Randomized Kaczmarz Rules Using the Orthogo-nality Graph, The 32th Conference on Uncertainty in Artificial Intelligence, 2016.[34] S. Oymak and M. Soltanolkotabi, Overparameterized nonlinear learning: Gradient descenttakes the shortest path? In International Conference on Machine Learning (2019), pp. 4951–4960.[35] E. L. Piccolomini, and F. Zama. The conjugate gradient regularization method in computedtomography problems. Applied Mathematics and Computation (1999): 87–99.[36] C. Popa, Convergence rates for Kaczmarz-type algorithms,
Numer. Algor. (2018): p. 1–17[37] M. Schmidt, N. Le Reoux and F. Bach, Minimizing finite sums with the stochastic averagegradient, Math. Program., Ser. A (2017), p. 83–112[38] K.M. Sezan and H. Stark. Applications of convex projection theory to image recovery intomography and related areas. In H. Stark, editor, Image Recovery: Theory and application,pages 415–462. Acad. Press, 1987[39] S. Steinerberger, Randomized Kaczmarz converges along small singular vectors,arXiv:2006.16978[40] S. Steinerberger, A Weighted Randomized Kaczmarz Method for Solving Linear Systems,arXiv:2007.02910[41] T. Strohmer and R. Vershynin, A randomized Kaczmarz algorithm for linear systems withexponential convergence,
Journal of Fourier Analysis and Applications (2009): p. 262–278[42] Y. Tan and R. Vershynin, Phase retrieval via randomized Kaczmarz: theoretical guarantees,Information and Inference: A Journal of the IMA (2019), p. 97–123[43] J.-J. Zhang, A new greedy Kaczmarz algorithm for the solution of very large linear systems, Applied Mathematics Letters (2019), p. 207–212[44] A. Zouzias and N. M. Freris. Randomized extended Kaczmarz for solving least squares. SIAMJ. Matrix Anal. Appl. : p. 773–793, 2013. Department of Mathematics, University of Washington, Seattle
E-mail address ::