[PDF] Randomized Block Kaczmarz Method with Projection for Solving Least Squares

Abstract

The Kaczmarz method is an iterative method for solving overcomplete linear systems of equations Ax=b. The randomized version of the Kaczmarz method put forth by Strohmer and Vershynin iteratively projects onto a randomly chosen solution space given by a single row of the matrix A and converges exponentially in expectation to the solution of a consistent system. In this paper we analyze two block versions of the method each with a randomized projection, that converge in expectation to the least squares solution of inconsistent systems. Our approach utilizes a paving of the matrix A to guarantee exponential convergence, and suggests that paving yields a significant improvement in performance in certain regimes. The proposed method is an extension of the block Kaczmarz method analyzed by Needell and Tropp and the Randomized Extended Kaczmarz method of Zouzias and Freris. The contribution is thus two-fold; unlike the standard Kaczmarz method, our methods converge to the least-squares solution of inconsistent systems, and by using appropriate blocks of the matrix this convergence can be significantly accelerated. Numerical experiments suggest that the proposed algorithm can indeed lead to advantages in practice.

Full PDF

RRandomized Block Kaczmarz Method with Projection forSolving Least Squares

Deanna Needell a, ∗ , Ran Zhao b , Anastasios Zouzias c a Dept. of Mathematical Sciences, Claremont McKenna College, Claremont, CA 91711 b Dept. of Mathematics, Claremont Graduate Univ., Claremont, CA 91711 c IBM Research Lab, Zurich

Abstract

The Kaczmarz method is an iterative method for solving overcomplete linear systemsof equations Ax = b . The randomized version of the Kaczmarz method put forth byStrohmer and Vershynin iteratively projects onto a randomly chosen solution space givenby a single row of the matrix A and converges linearly in expectation to the solutionof a consistent system. In this paper we analyze two block versions of the method eachwith a randomized projection, designed to converge in expectation to the least squaressolution, often faster than the standard variants. Our approach utilizes both a row andcolumn-paving of the matrix A to guarantee linear convergence when the matrix has con-sistent row norms (called nearly standardized), and a single column-paving when therow norms are unbounded. The proposed methods are an extension of the block Kacz-marz method analyzed by Needell and Tropp and the Randomized Extended Kaczmarzmethod of Zouzias and Freris. The contribution is thus two-fold; unlike the standardKaczmarz method, our results demonstrate convergence to the least squares solutionof inconsistent systems (both methods in the nearly standardized case and the secondmethod in other cases). By using appropriate blocks of the matrix this convergence canbe signiﬁcantly accelerated, as is demonstrated by numerical experiments. Keywords: block Kaczmarz, randomized extended Kaczmarz, projections onto convexsets, algebraic reconstruction technique, matrix paving

1. Introduction

The Kaczmarz method [Kac37] is a popular iterative solver of overdetermined systemsof linear equations Ax = b . Because of its simplicity and performance, the method and its ∗ Corresponding Author

Email addresses: [email protected] (Deanna Needell), [email protected] (Ran Zhao), [email protected] (Anastasios Zouzias) Mathematicians often refer to this type of convergence as exponential.

Preprint submitted to Elsevier August 15, 2018 a r X i v : . [ m a t h . NA ] J un erivatives are used in a range of applications from image reconstruction to digital signalprocessing [CFM +

92, Nat01, SS87]. The method performs a series of orthogonal projec-tions and iteratively converges to the solution of the system of equations. It is thereforecomputationally feasible even for very large and overdetermined systems.Given a vector b and an n × d full rank (real or complex) matrix A with rows a , a , . . . a n ,the algorithm begins with an initial estimate x and cyclically projects the estimation ontoeach of the solution spaces. This process can be described as follows: x j = x j − + b [ i ] − 〈 a i , x j − 〉(cid:107) a i (cid:107) a i , where b [ i ] denotes the i th coordinate of b , x j is the estimation in the j th iteration, (cid:107) · (cid:107) denotes the usual (cid:96) vector norm, 〈· , ·〉 the standard (cid:96) inner product, and i = ( j mod n ) + cycles through the rows of A .Since the method cycles through the rows of A , the performance of the algorithm maydepend heavily on the ordering of these rows. A poor ordering may lead to very slowconvergence. To overcome this obstacle, one can select the row a i at random to improvethe convergence rate [HM93, Nat01]. Strohmer and Vershynin proposed and analyzeda method which selects a given row with probability proportional to its (cid:96) norm [SV09,SV06]. They show that with this selection strategy, the randomized Kaczmarz methodhas an expected linear convergence rate to the unique solution x (cid:63) : (cid:69) (cid:107) x j − x (cid:63) (cid:107) ≤ (cid:181) − R (cid:182) j (cid:107) x − x (cid:63) (cid:107) , (1)where R is the scaled condition number, R = (cid:107) A − (cid:107) (cid:107) A (cid:107) F , (cid:107) · (cid:107) F denotes the Frobeniusnorm, and (cid:107) A − (cid:107) def = inf{ M : M (cid:107) Ax (cid:107) ≥ (cid:107) x (cid:107) for all x } is well-deﬁned since A has full columnrank. This convergence rate (1) is essentially independent of the number of rows of A andshows that for well-conditioned matrices, the randomized Kaczmarz method convergesto the solution in just O( d ) iterations [SV09]. This yields an overall runtime of O( d ) whichis much superior to others such as O( nd ) for Gaussian elimination. There are also caseswhere randomized Kaczmarz even outperforms the conjugate gradient method, see thediscussion in [SV09] for details.When the system is perturbed by noise or no longer consistent, Ax (cid:63) + e = b , the ran-domized Kaczmarz method still provides expected linear convergence down to an errorthreshold [Nee10], (cid:69) (cid:107) x j − x (cid:63) (cid:107) ≤ (cid:181) − R (cid:182) j /2 (cid:107) x − x (cid:63) (cid:107) + (cid:112) R · max i | e [ i ] |(cid:107) a i (cid:107) , (2) Note that if the full-rank assumption is removed, the method converges to the solution set at the samelinear rate with (cid:107) A − (cid:107) replaced by the reciprocal of the smallest non-zero singular value. See e.g. (3)of [LWS14]. | e [ i ] | denotes the i th entry of e . This result is sharp, and shows that the randomizedKaczmarz method converges with a radius proportional to the magnitude of the largestentry of the noise in the system. Since the iterates of the Kaczmarz method always lie ina single solution space, the method clearly will not converge to the least squares solutionof an inconsistent system. The bound (2) demonstrates that the randomized Kaczmarz method performs wellwhen the noise in inconsistent systems is small. Zouzias and Freris introduced a variantof the method which utilizes a random projection to iteratively reduce the norm of theerror [ZF12]. They show that the estimate of this

Randomized Extended Kaczmarz (REK)method converges linearly in expectation to the least squares solution of the system,breaking the radius barrier of the standard method. The algorithm maintains not onlyan estimate x j to the solution but also an approximation z j to the projection of b onto therange of A : x j = x j − + b [ i ] − z j − [ i ] − 〈 a i , x j − 〉(cid:107) a i (cid:107) a i , z j = z j − − 〈 a ( k ) , z j − 〉(cid:107) a ( k ) (cid:107) a ( k ) , (3)where in iteration j , a i and a ( k ) is the row and column of A , respectively, each chosenrandomly with probability proportional to their Euclidean norms. In this setting, we nolonger require that the matrix A be full rank, and ask for the least squares solution, x LS def = argmin x (cid:107) b − Ax (cid:107) = A † b , where A † denotes the Moore-Penrose pseudoinverse of A . Zouzias and Freris showed thatthe REK method converges linearly in expectation to the least squares solution [ZF12], (cid:69) (cid:107) x j − x LS (cid:107) ≤ (cid:181) − K ( A ) (cid:182) j /2 (cid:195) (cid:107) x LS (cid:107) + (cid:107) b (cid:107) σ ( A ) (cid:33) , (4)where σ min ( A ) is the smallest non-zero singular value of A and K ( A ) = (cid:107) A (cid:107) F σ min ( A ) denotes itsscaled condition number. Recently, Needell and Tropp analyzed a block version of the simple randomized Kacz-marz method [NT13]. Like the traditional method, this version iteratively projects the cur-rent estimation onto the solution spaces. However, rather than using the solution spaceof a single equation, the block method projects onto the solution space of many equa-tions simultaneously by selecting a block of rows rather than a single row. For a subset τ ⊂ {1, 2, . . . , n } , denote by A τ the submatrix of A whose rows are indexed by τ . We again be-gin with an arbitrary guess x for the solution of the system. Then for each iteration j ≥ ,3elect a block τ = τ j of rows. To obtain the next iterate, we project the current estimationonto the solution space of the equations listed in τ [NT13]: x j = x j − + ( A τ ) † ( b τ − A τ x j − ). (5)Here, the conditioning of the blocks A τ plays a crucial row in the behavior of themethod. Indeed, if each block is well-conditioned, its pseudoinverse can be applied ef-ﬁciently using an iterative method such as CGLS [Bjö96]. To guarantee such properties,Needell and Tropp utilize a paving of the matrix A . Deﬁnition 1 (Row Paving). A ( p , α , β ) row paving of a n × d matrix A is a partition T = { τ , . . . , τ p } of the rows such that α ≤ λ min ( A τ A ∗ τ ) and λ max ( A τ A ∗ τ ) ≤ β for each τ ∈ T ,where again we denote by A τ the | τ | × d submatrix of A . We refer to the number p as the size ofthe paving, and the numbers α and β are called the lower and upper paving bounds . We refer to a row paving of A ∗ as a column paving of A . We thus seek pavings of A withsmall number of blocks p and upper paving constant β . In the following, we will assumeone has access to such a paving, and discuss in Section 3 how to construct such pavingsfor various types of matrices. When the matrix has unit-norm rows, equipped with sucha row paving of A , the main result of [NT13] shows that the randomized block Kaczmarzalgorithm (5) exhibits linear convergence in expectation: (cid:69) (cid:107) x j − x (cid:63) (cid:107) ≤ (cid:181) − C (cid:48) κ ( A ) log(1 + n ) (cid:182) j (cid:107) x − x (cid:63) (cid:107) + (cid:107) e (cid:107) σ ( A ) , (6)where C (cid:48) is an absolute constant and κ ( A ) = σ max ( A ) σ min ( A ) is the condition number of A .Since each iteration of the block method utilizes multiple rows, one can compare therate of (6) and (1) by considering convergence per epoch (one cycle through the rowsof A ). From this analysis, one ﬁnds the bounds to be comparable. However, the blockmethod can utilize fast matrix multiplies and efﬁcient implementation, yielding dramaticimprovements in computational time. See [NT13] for details and empirical results. The REK method breaks the so-called convergence horizon of standard Kaczmarz method,allowing convergence to the least squares solution of inconsistent systems. The blockKaczmarz method on the other hand, allows for signiﬁcant computational speedup andaccelerated convergence to within a ﬁxed radius of the least squares solution. The maincontribution of this paper analyzes a randomized block Kaczmarz method which also in-corporates a blocked projection step, which provides accelerated convergence to the leastsquares solution. In this case we need a column partition for the projection step and arow partition for the Kaczmarz step. Our results show that this method offers both linear4onvergence to the least squares solution x LS when a row paving can be obtained, andimproved convergence speed due to the blocking of both the rows and columns. In addi-tion, we present a block coordinate descent variant which utilizes only a column pavingand also yields linear convergence. We will see that the desired column paving may beobtained easily for arbitrary matrices, so this variant is especially useful when a good rowpaving cannot be obtained. In Section 2 we begin by presenting the double block randomized Kaczmarz methodwhich utilizes both a row and column paving, and provides the strongest theoretical re-sults overall. We discuss methods for obtaining the desired matrix pavings in Section 3.Section 4 introduces the variant of the block and extended Kaczmarz methods that re-quires only a column paving of the matrix, readily accessible for arbitrary matrices. InSection 6 we present some experimental results for the various algorithms. We concludewith a discussion of related work and open directions in Section 7. The appendix includesproofs of intermediate results used along the way.

2. The Randomized Double Block Kaczmarz Method

It is natural to ask whether one can consider both a row partition and a column par-tition in the Kaczmarz method, blocking both in the Kaczmarz update step and the pro-jection step. Indeed, utilizing blocking in both steps yields Algorithm 1 below. We thuspropose the following randomized block extended Kaczmarz method using double parti-tioning. We will see later that since this method requires a row paving, it works very wellwhen each of the row norms of the matrix are relatively similar (see Sections 3 and 5).

Algorithm 1

Randomized Double Block Kaczmarz Least Squares Solver procedure ( A , b , T , T , S ) (cid:46) A ∈ (cid:82) n × d , b ∈ (cid:82) n , T ∈ (cid:78) , column partition T of [ d ] , rowpartition S of [ n ] Initialize x = and z = b for k =

1, 2, . . . , T do Pick τ k ∈ T and υ k ∈ S uniformly at random Set z k = z k − − A τ k ( A τ k ) † z k − (cid:46) A τ k : n × | τ k | submatrix of A Update x k = x k − + ( A υ k ) † ( b υ k − ( z k ) υ k − A υ k x k − ) (cid:46) A υ k : | υ k | × d submatrix of A end for Output x T end procedure Combining the theoretical approaches in [NT13, ZF12] we will prove the followingresult about the convergence of Algorithm 1. This result utilizes both a column pavingand row paving. 5 heorem 1.

Algorithm 1 with input A , b , T ∈ (cid:78) , ( p , α , β ) column paving T of A , and ( p , α , β ) row paving S of A , outputs an estimate vector x T that satisﬁes (cid:69) (cid:107) x T − x LS (cid:107) ≤ γ T (cid:107) x − x LS (cid:107) + (cid:179) γ (cid:98) T /2 (cid:99) + γ (cid:98) T /2 (cid:99) (cid:180) (cid:176)(cid:176) b R ( A ) (cid:176)(cid:176) α (1 − γ ) , where γ = − σ ( A ) p β and γ = − σ ( A ) p β .Proof. We begin with the following lemma which is motivated by Lemma of [NT13],and shows that the iterates z k converge linearly to the projection of b onto the kernel of A ∗ . Lemma 2.

Let b be a ﬁxed vector and T be a ( p , α , β ) column paving of A . Assuming the notationof Theorem 1, for every k > it holds that (cid:69) (cid:176)(cid:176) z k − b R ( A ) ⊥ (cid:176)(cid:176) ≤ (cid:195) − σ ( A ) p β (cid:33) k (cid:176)(cid:176) b R ( A ) (cid:176)(cid:176) . (7) where b R ( A ) ⊥ : = ( I − A A † ) b .Proof. Let P τ k = A τ k ( A τ k ) † and notice z k = ( I − P τ k ) z k − . Deﬁne e k = z k − b R ( A ) ⊥ for k ≥ .Then, e k = ( I − P τ k ) z k − − b R ( A ) ⊥ = ( I − P τ k ) z k − − ( I − P τ k ) b R ( A ) ⊥ = ( I − P τ k ) e k − , where the ﬁrst equality follows by the deﬁnition of z k , the second by orthogonality be-tween the range of P τ k and b R ( A ) ⊥ , and the ﬁnal equality by deﬁnition of e k − . Next, weprove that (cid:69) k − (cid:107) e k (cid:107) ≤ (cid:195) − σ ( A ) p β (cid:33) (cid:107) e k − (cid:107) where (cid:69) k − is the expectation conditioned over the ﬁrst ( k − iterations of the algorithm.By orthogonality of the projector P τ k and the Pythagorean theorem, one has (cid:176)(cid:176) ( I − P τ k ) e k − (cid:176)(cid:176) =(cid:107) e k − (cid:107) − (cid:176)(cid:176) P τ k e k − (cid:176)(cid:176) , hence it sufﬁces to lower bound (cid:69) k − (cid:176)(cid:176) P τ k e k − (cid:176)(cid:176) . Let A τ k : = U τ k Σ τ k V ∗ τ k be the truncated SVD decomposition of A τ k where Σ τ k is a rank (cid:161) A τ k (cid:162) × rank (cid:161) A τ k (cid:162) diagonal6atrix containing the non-zero singular values of A τ k . Then, (cid:69) k − (cid:176)(cid:176) P τ k e k − (cid:176)(cid:176) = (cid:69) k − (cid:176)(cid:176) U τ k U ∗ τ k e k − (cid:176)(cid:176) = (cid:69) k − (cid:176)(cid:176) U ∗ τ k e k − (cid:176)(cid:176) = (cid:69) k − (cid:176)(cid:176) Σ − τ k V ∗ τ k A ∗ τ k e k − (cid:176)(cid:176) ≥ (cid:69) k − σ ( Σ − τ k V ∗ τ k ) (cid:176)(cid:176) A ∗ τ k e k − (cid:176)(cid:176) = (cid:69) k − (cid:176)(cid:176) A ∗ τ k e k − (cid:176)(cid:176) σ ( Σ τ k ) ≥ p β (cid:88) τ k ∈ T (cid:176)(cid:176) A ∗ τ k e k − (cid:176)(cid:176) = p β (cid:176)(cid:176) A ∗ e k − (cid:176)(cid:176) ≥ σ ( A ) p β (cid:107) e k − (cid:107) , where the ﬁrst equality follows since P τ k = U τ k U ∗ τ k , the second equality from the uni-tary invariance property of the Euclidean norm, the third equality by replacing U ∗ τ k with Σ − τ k V ∗ τ k A ∗ τ k , the next three lines follow since σ ( Σ − τ k V ∗ τ k ) = σ ( Σ τ k ) = σ ( A τ k ) andby the paving assumption, and the ﬁnal inequality follows since e k ∈ R ( A ) for all k ≥ (indeed, e = b R ( A ) ∈ R ( A ) and it follows that e k ∈ R ( A ) for every k ≥ by the recursivedeﬁnition of e k ). It follows that (cid:69) k − (cid:107) e k (cid:107) = (cid:107) e k − (cid:107) − (cid:69) k − (cid:176)(cid:176) P τ k e k − (cid:176)(cid:176) ≤ (cid:195) − σ ( A ) p β (cid:33) (cid:107) e k − (cid:107) . (8)Repeat the above inequality k times and notice that e = b − b R ( A ) ⊥ = b R ( A ) to conclude. (cid:3) Next, Lemma 2.2 of [NT13] shows that for any vector u , (cid:69) (cid:176)(cid:176)(cid:176)(cid:179) I − ( A τ ) † A τ (cid:180) u (cid:176)(cid:176)(cid:176) ≤ (cid:195) − σ ( A ) p β (cid:33) (cid:107) u (cid:107) . (9)Since the range of I − ( A τ ) † A τ and ( A τ ) † are orthogonal, we have (cid:107) x k − x LS (cid:107) = (cid:176)(cid:176)(cid:176)(cid:179) I − ( A τ ) † A τ (cid:180) ( x k − − x LS ) (cid:176)(cid:176)(cid:176) + (cid:176)(cid:176)(cid:176) ( A τ ) † (( z k ) τ − b ⊥ τ ) (cid:176)(cid:176)(cid:176) , (10)where for shorthand we will write b ⊥ τ to mean ( b R ( A ) ⊥ ) τ . Combining (9) with u = x k − − x LS (cid:69) (cid:107) x k − x LS (cid:107) ≤ (cid:195) − σ ( A ) p β (cid:33) (cid:69) (cid:107) x k − − x LS (cid:107) + (cid:69) (cid:176)(cid:176)(cid:176) ( A τ ) † (( z k ) τ − b ⊥ τ ) (cid:176)(cid:176)(cid:176) ≤ (cid:195) − σ ( A ) p β (cid:33) (cid:69) (cid:107) x k − − x LS (cid:107) + σ ( A τ ) (cid:69) (cid:176)(cid:176) ( z k ) τ − b ⊥ τ (cid:176)(cid:176) ≤ (cid:195) − σ ( A ) p β (cid:33) (cid:69) (cid:107) x k − − x LS (cid:107) + α (cid:69) (cid:176)(cid:176) z k − b R ( A ) ⊥ (cid:176)(cid:176) . (11)To apply this bound recursively, we will utilize an elementary lemma. It is essentiallyproved in [ZF12, Theorem 8] but for completeness we recall its proof in the appendix. Lemma 3.

Suppose that for some γ , γ < , the following bounds hold for all k ∗ ≥ : (cid:69) (cid:107) x k ∗ − x LS (cid:107) ≤ γ (cid:69) (cid:107) x k ∗ − − x LS (cid:107) + r k ∗ and r k ∗ ≤ γ k ∗ B . (12) Then for any T > , (cid:69) (cid:107) x T − x LS (cid:107) ≤ γ T (cid:107) x − x LS (cid:107) + (cid:179) γ (cid:98) T /2 (cid:99) + γ (cid:98) T /2 (cid:99) (cid:180) B − γ . Using r k = α (cid:69) (cid:176)(cid:176) z k − b R ( A ) ⊥ (cid:176)(cid:176) , B = (cid:107) b R ( A ) (cid:107) α , γ = − σ ( A ) p β , and γ = − σ ( A ) p β , we see byLemma 2 and (11) that the bounds (12) hold. Applying Lemma 3 completes the proof. (cid:3) In Section 3 we will see that matrix row-pavings for standardized matrices, those whoserows have unit norm, can be obtained readily. One can thus use the column-normalizedversion of the matrix A in line 5 of Algorithm 1 and the corresponding column paving.Note that one need not have access to the complete column-standardized matrix, insteadthe columns of the submatrix in line 5 could be normalized on the ﬂy. See Section 3 formore on obtaining row pavings and further details. The bound of Theorem 1 improves upon that of the randomized block Kaczmarzmethod because it demonstrates linear convergence to the least squares solution x LS , whereas(6) shows convergence only within a radius proportional to (cid:107) e (cid:107) , which we call the con-vergence horizon . Algorithm 1 is able to break this barrier because it iteratively removesthe component of b which is orthogonal to the range of A . This of course is also true ofthe randomized Extended Kaczmarz method (3) as it also breaks this horizon barrier. Tocompare the rate of (4) to that of Theorem 5, we consider two important scenarios.8irst, consider the case when A is nearly square, and each submatrix can be appliedefﬁciently via a fast multiply. In this case, each iteration of Algorithm 1 incurs approx-imately the same computational cost as an iteration of the REK method. Thus, we maydirectly compare the convergence rates of Theorem 5 and (4) to ﬁnd that Algorithm 1 isabout n /( p β ) times faster than REK in this setting. Thus when n is much larger than p β ,this can result in a signiﬁcant speedup.Alternatively, if the matrix A does not admit a fast multiply, it is fair to only comparethe convergence rate per epoch, since each iteration of Algorithm 1 may require morecomputational cost than those of REK. Since an epoch of Algorithm 1 and REK consistof p and n iterations, respectively, we see that the rate of the former is proportional to σ ( A )/ β whereas that of REK is proportional to σ ( A ) . We see in this case that thesebounds suggest REK exhibits faster convergence (assuming β > ). However, as observedin the randomized Block Kaczmarz method, the block methods still display faster conver-gence than their single counterparts because of implicit computational issues in the linearalgebraic subroutines. See the discussion in [NT13] and the experimental results belowfor further details.

3. Obtaining matrix pavings

We devote this section to a brief discussion about matrix pavings and how they maybe obtained to utilize the results of Theorem 1. The results discussed here on matrixpavings stem from subset selection , the problem of selecting a large submatrix with desiredgeoemtric properties, whose origins come from the well-known Restricted InvertibilityPrinciple of Bourgain Tzafriri [BT87]. The literature now contains several results on sub-set selection, see e.g. [BT91, KT94, Ver01, Ver06, Tro09b] and [Sri10, Nao11, SS12, You12b]for recent advancements. We summarize here a few results useful for our purposes.The simplest type of matrix to ﬁrst consider is one which has unit-norm rows, whichwe call row-standardized (and a matrix with unit-norm columns is column-standardized ).A surprising result shows that every row-standardized matrix admits a row paving withwell-controlled paving parameters. Tropp proves the following result in [Tro09a, Thm. 1.2],whose origins are due to Bourgain and Tzafriri [BT87, BT91] and Vershynin [Ver06].

Proposition 4 (Existence of Good Pavings).

Fix a number δ ∈ (0, 1) and row-standardizedmatrix A with n rows. Then A admits a ( p , α , β ) row paving with p ≤ C · δ − (cid:107) A (cid:107) log(1 + n ) and − δ ≤ α ≤ β ≤ + δ , (13) where C denotes an absolute constant. Proposition 4 shows the existence of such a paving, but the literature provides variousefﬁcient mechanisms for the construction of good pavings as well. In many cases oneconstructs such a paving simply by choosing a partition of an appropriate size at random ,see [Tro09a] for an efﬁcient method to compute a paving satisfying (13). See also [NT13]and the references therein for a thorough discussion of these types of results.9f the matrix A is row-standardized, one can thus construct a row paving satisfy-ing (13). If the matrix also naturally admits a column paving (for example if it is sym-metric or positive semi-deﬁnite), then both pavings will have such bounded parameters.If the latter does not hold, one can instead utilize the column-standardized version of A ,which we denote A , in line 5 of Algorithm 1. The standardization can either be done onthe ﬂy during the algorithm, or ahead of time. Either way, Proposition 4 can be combinedwith Theorem 1 to yeild the following corollary. Corollary 5.

Suppose Algorithm 1 is run on a row-standardized matrix A , b , T ∈ (cid:78) , ( p , α , β ) column paving T of the column-standardized version A , and ( p , α , β ) row paving S of A , bothguaranteed by Proposition 4. Then the estimate vector x T satisﬁes (cid:69) (cid:107) x T − x LS (cid:107) ≤ γ T (cid:107) x − x LS (cid:107) + (cid:179) γ (cid:98) T /2 (cid:99) + γ (cid:98) T /2 (cid:99) (cid:180) C (cid:176)(cid:176) b R ( A ) (cid:176)(cid:176) (1 − γ ) , where γ = − C κ ( A )log(1 + n ) , γ = − C κ ( A )log(1 + d ) and κ ( A ) = σ max ( A ) σ min ( A ) and κ ( A ) = σ max ( A ) σ min ( A ) denote thecondition numbers of A and A , respectively. If the matrix A is not row-standardized, one can run Algorithm 1 on the standard-ized system, along with a paving guaranteed by Proposition 4. Clearly, the method willconverge to the new least squares solution, which does not necessarily coincide with theoriginal least squares solution in the inconsistent case.Alternatively, if the matrix A is not row-standardized and one wishes to ensure con-vergence to the true least-squares solution x LS , Proposition 4 can still be used with sub-optimal paving bounds. Indeed, write ˜ D as the diagonal matrix whose entries correspondto the reciprocals of the row-norms (cid:107) a i (cid:107) of A , so that ˜ A = ˜ D A has unit-norm rows. Set a max = max i (cid:107) a i (cid:107) and a min = min i (cid:107) a i (cid:107) Utilize Proposition 4 on ˜ A to obtain a row-paving with parameters ˜ p , ˜ α and ˜ β so that ˜ p ≤ C · δ − (cid:107) ˜ A (cid:107) log(1 + n ) and − δ ≤ ˜ α ≤ ˜ β ≤ + δ . If one uses this same paving for A , one has (quite pessimistically) that the correspondingpaving parameters p , α , and β for A satisfy p ≤ C · δ − a − min (cid:107) A (cid:107) log(1 + n ) and a min (1 − δ ) ≤ α ≤ β ≤ a max (1 + δ ). One can then directly apply Theorem 1 with this paving to obtain an analogous versionof Corollary 5 for arbitrary matrices, which shows that (cid:69) (cid:107) x T − x LS (cid:107) ≤ γ T (cid:107) x − x LS (cid:107) + (cid:179) γ (cid:98) T /2 (cid:99) + γ (cid:98) T /2 (cid:99) (cid:180) C (cid:176)(cid:176) b R ( A ) (cid:176)(cid:176) a min (1 − γ ) , (14)10here now γ = − C a min a max κ ( A )log(1 + n ) = − Ca r κ ( A )log(1 + n ) where we have written a r = a max / a min to denote the dynamic range of the row norms of A (and γ remains the same as in thecorollary). This demonstrates that if the dynamic range is bounded, the convergencerate is the same as in the standardized case, up to constants. In either case, these resultsdemonstrate linear convergence to the true least squares solution.An alternative to these types of bounds can be obtained by simply only using a col-umn paving for the matrix. As we elaborate below, the least squares solution can stillbe obtained from the column-normalized system. Since column-pavings can be easily at-tained in this case via Proposition 4, such a method offers a nice alternative in situationswhere the dynamic range of the row norms is unknown or unbounded. We propose sucha method in the next section.

4. A Randomized Block Coordinate Descent Method

We next present a simple variant of the extended Kaczmarz method which utilizesonly a column paving of the matrix. For that reason, one need not worry about whetherthe matrix A is row-standardized. Moreover, the column-standardized version can beused within the algorithm which guarantees bounded paving parameters, while still ﬁnd-ing the true least squares solution.Utilizing the beneﬁts of both the block variant and the randomized extension, we pro-pose the following randomized block coordinate descent method for the inconsistent case. Algorithm 2

Randomized Block Least Squares Solver procedure ( A , b , T , T ) (cid:46) A ∈ (cid:82) n × d , b ∈ (cid:82) n , T ∈ (cid:78) , column partition T of [ d ] Initialize x = and z = b for k =

1, 2, . . . , T do Pick τ k ∈ T uniformly at random Compute w k = ( A τ k ) † z k − (cid:46) A τ k : n × | τ k | submatrix of A Update ( x k ) τ k = ( x k − ) τ k + w k Set z k = z k − − A τ k w k end for Output x T end procedure We may utilize some of the previous analysis to prove convergence of Algorithm 2.Observe that Step 6 of Algorithm 1 is identical to Steps 5 and 7 of Algorithm 2, thereforeLemma 2 implies that (cid:69) (cid:176)(cid:176) z k − b R ( A ) ⊥ (cid:176)(cid:176) ≤ (cid:195) − σ ( A ) p β (cid:33) k (cid:176)(cid:176) b R ( A ) (cid:176)(cid:176) . (15)11o utilize this result, we aim to relate the iterates z k to the estimation x k . The followingclaim quantiﬁes precisely this relation. Lemma 6.

For every k ≥ , at the end of the k -th iteration, it holds that z k + = b − A x k + .Proof. We prove by induction on k . Instate the notation of Algorithm 2, and for two sets S and S write S \ S = S ∩ S c to denote set subtraction. For the base case of k = , wehave ( x ) τ = w and ( x ) [ n ]\ τ = and moreover, z = z − A τ w = b − A x . Assume that z (cid:96) = b − A x (cid:96) is true for some (cid:96) > , we will show that it holds for (cid:96) + . For the sake ofnotation, denote P (cid:96) = A τ (cid:96) A τ (cid:96) † . Then z (cid:96) + = z (cid:96) − P (cid:96) z (cid:96) = b − A x (cid:96) − P (cid:96) z (cid:96) (16)the ﬁrst equality follows by the deﬁnition of z (cid:96) + , the second equality follows by inductionhypothesis. Now, it follows that A x (cid:96) + = A τ (cid:96) ( x (cid:96) + ) τ (cid:96) + A [ n ]\ τ (cid:96) ( x (cid:96) + ) [ n ]\ τ (cid:96) = A τ (cid:96) ( x (cid:96) ) τ (cid:96) + A τ (cid:96) w (cid:96) + A [ n ]\ τ (cid:96) ( x (cid:96) + ) [ n ]\ τ (cid:96) = A τ (cid:96) ( x (cid:96) ) τ (cid:96) + A τ (cid:96) w (cid:96) + A [ n ]\ τ (cid:96) ( x (cid:96) ) [ n ]\ τ (cid:96) = A x (cid:96) + A τ (cid:96) w (cid:96) . the ﬁrst equality follows by Step of the algorithm (update on x ), the second equalitybecause ( x (cid:96) + ) [ n ]\ τ (cid:96) = ( x (cid:96) ) [ n ]\ τ (cid:96) . Hence, A x (cid:96) = A x (cid:96) + − A τ (cid:96) w (cid:96) . Now, the right hand sideof (16) can be rewritten as b − A x (cid:96) − P (cid:96) z (cid:96) = b − A x (cid:96) + + A τ (cid:96) w (cid:96) − P (cid:96) z (cid:96) = b − A x (cid:96) + . the last equality follows since w (cid:96) = ( A τ (cid:96) ) † z (cid:96) . Therefore, we conclude that z (cid:96) + = b − A x (cid:96) + which completes the proof. (cid:3) Combining this lemma with (15) yields the following result which shows convergenceof the estimation to the least squares solution under the map A . Theorem 7.

Algorithm 2 with input A , b , T ∈ (cid:78) , and ( p , α , β ) column paving T , outputs anestimate vector x T that satisﬁes (cid:69) (cid:107) A ( x LS − x T ) (cid:107) ≤ (cid:195) − σ ( A ) p β (cid:33) T (cid:176)(cid:176) b R ( A ) (cid:176)(cid:176) . Proof.

We observe that A ( x LS − x ( k ) ) = b R ( A ) − A x ( k ) = b − A x ( k ) − b R ( A ) ⊥ = z ( k ) − b R ( A ) ⊥ where the ﬁrst equality follows by b R ( A ) = A A † b = A x LS , the second by orthogonality b = b R ( A ) ⊥ + b R ( A ) and the last equality from Lemma 6. Combined with inequality (7) thisyields the desired result. (cid:3) When A has full column rank, we may bound the estimation error (cid:107) x LS − x T (cid:107) by σ min ( A ) (cid:107) A ( x LS − x T ) (cid:107) which combined with the fact that (cid:176)(cid:176) b R ( A ) (cid:176)(cid:176) ≤ σ max ( A ) (cid:107) x LS (cid:107) impliesthe following corollary. 12 orollary 8. Algorithm 2 with full-rank A , b , T ∈ (cid:78) , and ( p , α , β ) column paving T , outputs anestimate vector x T that satisﬁes (cid:69) (cid:107) x LS − x T (cid:107) ≤ (cid:195) − σ ( A ) p β (cid:33) T κ ( A ) (cid:107) x LS (cid:107) . The advantage to a single paving approach as in Algorithm 2 is that one can utilizethe column-standardized version of A while maintaining the same convergence to the(scaled version of the) least squares solution. Utilizing Proposition 4, one is guaranteed acolumn-paving satisfying (13), so that paving parameters of Theorem 7 and Corollary 8are bounded. Since re-normalizing the columns of the matrix only re-scales the entriesof x LS , Theorem 7 and Corollary 8 grant one access to the original least squares solution x LS . To be precise, now let D be the diagonal matrix whose entries correspond to thereciprocals of the column norms of A , so that A = AD has unit-norm columns. Let x LS = A † b denote the least squares solution of the re-normalized system, so that one has x LS = D − x LS .Then (cid:176)(cid:176)(cid:176) A ( x LS − x T ) (cid:176)(cid:176)(cid:176) = (cid:176)(cid:176) AD ( D − x LS − x T ) (cid:176)(cid:176) = (cid:107) A ( x LS − D x T ) (cid:107) . Thus applying Theorem 7 for A , and utilizing the fact that the range of A is the same asthat of A , one has (cid:69) (cid:107) A ( x LS − D x T ) (cid:107) ≤ (cid:195) − σ ( A ) p β (cid:33) T (cid:176)(cid:176) b R ( A ) (cid:176)(cid:176) . This then implies that when the matrix is full rank, (cid:69) (cid:107) x LS − D x T (cid:107) ≤ (cid:195) − σ ( A ) p β (cid:33) T κ ( A ) (cid:107) x LS (cid:107) . (17)Since p and β are the paving parameters of A , one has by utilizing Proposition 4 andsubstituting the bounds of (13) into (17) that (cid:69) (cid:107) x LS − D x T (cid:107) ≤ (cid:195) − C (cid:48) κ ( A ) log(1 + d ) (cid:33) T κ ( A ) (cid:107) x LS (cid:107) . (18)Although this bound does depend on the conditioning of both A and A (note the rateitself only depends on the conditioning of the latter), it is the ﬁrst to guarantee linearconvergence to the true least squares solution utilizing a paving which is guaranteed byProposition 4 while not placing any restrictions on the matrix A itself (such as standard-ization). 13 emark 1. In considering the improvements offered by both the REK method and the block Kacz-marz method, one may ask whether it is advantageous to run a traditional REK projection step asin (3) along with a traditional block Kaczmarz update step as in (5) . However, empirically we haveobserved that such a combination actually leads to a degradation in performance and requires farmore epochs to converge than the algorithms discussed above. We conjecture that it is importantto run both the projection update and the Kaczmarz update “at the same speed”; if the Kaczmarzupdate utilizes many rows at once, so should the projection update, and vice versa.

5. Summary of Approaches

Here we detail the various approaches proposed, and summarize the practical imple-mentation in several frameworks. Since the block variants of the methods require theblocks be well conditioned, it is natural to rely on matrix pavings for the analysis. Un-fortunately, such pavings are only readily available when the matrix itself is properlynormalized. For that reason, we have proposed several practical alternatives for the im-portant setting in which normalization is not feasible. Both of our proposed methods stilloffer computational advantages over the standard approaches (see the next section). Wesummarize these approaches here, which cover all possible settings.

Consistent systems:

When the system is consistent, one does not lose any convergenceproperties by re-scaling the system to be standardized, since the solution remainsthe same. In this simple setting, one can utilize the standardized system (eitherstandardizing a priori or on the ﬂy), and beneﬁt from the convergence guaranteedby Corollary 5.

Inconsistent standardized systems:

When the matrix is already standardized (as natu-rally occurs for example in Vandermonde matrices used in trigonometric approxi-mation), as in the above case one can immediately utilize Corollary 5 to guaranteeconvergence to the least squares solution.

Inconsistent systems with bounded dynamic range:

If the system is not standardized,but the ratio of the largest to smallest row norm is bounded (as is the case in randommatrices for example, which have tightly concentrated row norms), one can stillutilize this result. Indeed, by utilizing the same paving one would use if the matrixwas actually standardized – but not actually standardizing the system, Corollary 5can be used to obtain the convergence rate given in (14). One sees that if the dynamicrange a r is bounded (by say, a constant or even log n ), the method convergences tothe original least squares solution with approximately the same convergence rate. Inconsistent systems with unbounded dynamic range:

If the matrix has a large varietyof row norms, it may clearly be challenging to guarantee the desired row paving.For that reason, it will be advantageous to use a column paving. To that end, wepropose Algorithm 2 which is designed for systems that cannot be separated intowell-conditioned row blocks. As discussed in Section 4.2, Theorem 7 can be used14ia the column-standardized paving to obtain the convergence rate given in (18).This bound guarantees linear convergence to the true least squares solution, evenfor matrices which are far from standardized. The disadvantage of course is that therate depends on the conditioning of the column-standardized version, which maybe hard to explicitly bound (as is true for general large matrices anyway).In any of these cases, our results may be used to guarantee linear convergence inexpectation to the true least squares solution of the system. Moreover, because matrixblocks can be utilized, we often see a signiﬁcant speedup in runtime due to practicalconsiderations. See the next section for examples of such behavior.

6. Experimental Results

Here we present some experiments using simple examples to illustrate the beneﬁts ofblock methods. We do not claim optimized implementations of the method, and only runon small problem sizes; our purpose is only to demonstrate that even in these simple ex-amples, the block method offers advantages to the standard method. We refer the readerto [ZF12, NT13] for more empirical results for both REK and block methods.In all experiments, one matrix is created and trials of each method are run. In ourﬁrst experiment, the matrix is a × matrix with standard normal entries, whose rowsare then normalized to each have norm one, yielding a condition number of . We use blocks, selected by a random partition. The vector x is created to have independentstandard normal entries, and the right hand side b is set to Ax . We track the (cid:96) -error (cid:107) x LS − x k (cid:107) across each epoch as well as the CPU time (measured in Matlab using thecputime command). In all experiments we considered a trial successful when the errorreached − . The results for this case are presented in Figures 1 and 2. In all ﬁgures, aheavy line represents median performance, and the shaded region spans the minimum tothe maximum value across all trials. As is demonstrated, even when the matrix does nothave any natural block structure, the proposed algorithms outperform standard REK interms of runtime.Figure 3 shows similar plots, but in this case the system is no longer consistent. Forthese experiments, we used the same type and size of the matrix A , but the right hand sidevector b was generated as a Gaussian vector as well. We created b so that the residualnorm (cid:107) b − Ax LS (cid:107) = . We then track the (cid:96) -error between the iterate x k and the leastsquares solution x LS which we computed by A † b . We repeat the experiment also for amatrix whose dynamic range is not well bounded. For that experiment, we generate aGaussian matrix and then scale the row norms so that the i th row has norm equal to i . We refer to an epoch as the number of iterations that is equivalent to one cycle through n rows, eventhough rows and blocks are selected with replacement. Thus for REK, an epoch is n iterations, and for ablock version with b blocks, one epoch is b iterations. For an attempt at a fair comparison with Algorithm 2that only uses a column paving, we measure an “epoch” to be n / b where b is the number of blocks in thecolumn paving. )[Han07]. In particular we present a 2D tomography problem Ax = b for an n × d matrixwith n = f N and d = N . Here A corresponds to the absorption along a random linethrough an N × N grid. In our experiments we set N = and the oversampling factor f = . This yielded a matrix A with condition number κ ( A ) = . Since for this matrixit may be difﬁcult to obtain a row paving, we instead use Algorithm 2 and obtain a ran-dom column paving from the standardized version and use that for the matrix A . Theresults for various choices of paving size are displayed in Figure 5, which are in line withprevious experiments.Figure 1: (cid:96) -norm error for REK (blue dashed) and Algorithm 1 (red) across epochs (left)and CPU time (right). Matrix is × Gaussian, system is consistent.

7. Related Work and Discussion

The Kaczmarz method was ﬁrst introduced in the 1937 work of Kaczmarz himself [Kac37].Since then, the method has been revitalized by researchers in computer tomography, un-der the name

Algebraic Reconstruction Technique (ART) [GBH70, Byr08, Nat01, Her09]. De-terministic convergence results for the method often depend on properties of the ma-trix that are difﬁcult to compute or analyze [Deu85, DH97, XZ02, Gal05]. Moreover, ithas been well observed that random choice of row selection often speeds up the conver-gence [HS78, HM93, CFM +

92, Nat01].Recently, Strohmer and Vershynin [SV09] derived the ﬁrst provable convergence rateof the Kaczmarz method, showing that when each row is selected with probability pro-portional to its norm the method exhibits the expected linear convergence of (1). Thiswork was extended to the inconsistent case in [Nee10], which shows linear convergence16igure 2: (cid:96) -norm error for REK (blue dashed) and Algorithm 2 (red) across epochs (left)and CPU time (right). Matrix is × Gaussian, system is consistent.to within some ﬁxed radius of the least squares solution. The almost-sure guaranteeswere recently derived by Chen and Powell [CP12]. To break the convergence barrier, re-laxation parameters can be introduced, so that each iterate is over or under projected ontoeach solution space. Whitney and Meany prove that if the relaxation parameters tend tozero that the iterates converge to the least squares solution [WM67]. Further results usingrelaxation have also been obtained, see for example [CEG83, Tan71, HN90, ZF12]. Analternative to relaxation parameters was recently proposed by Zouzias and Freris [ZF12]as the REK method described by (3). Rather than alter the projection step, motivated byideas of Popa [Pop98] they introduce a secondary step which aims to reduce the residual.The Kaczmarz method has been extended beyond linear systems as well. For exam-ple, Leventhal and Lewis [LL10] analyze the method for systems with polyhedral con-straints and inequalities, which was also extended to the block case [BN], and Richtárikand Takáˇc [RT11] build on these results for general optimization problems.Another important aspect of research in this area focuses on accelerating the conver-gence of the methods. Geometric brute force methods can be used [EN11], additional rowdirections may be added [PPKR12], or instead one can select blocks of rows rather than asingle row in each iteration. The block version of the Kaczmarz method is originally dueto work of Elfving [Elf80] and Eggermont et al. [EHL81]. Its convergence rates were re-cently studied in [NW13] and analyzed via pavings by Needell and Tropp [NT13]. Theblock Kaczmarz method is of course a special instance in a broader class of block pro-jection algorithms, see for example [XZ02] for a more general analysis and [Byr08] for apresentation of other block variants.To use block methods effectively, one needs to obtain a suitable partition of the rows(and/or columns). Popa constructs such partitions by creating orthogonal blocks [Pop99,Pop01, Pop04], whereas Needell and Tropp promote the use of row pavings to constructthe partition [NT13].Construction of pavings has been studied for quite some time now, and most early17igure 3: Matrix is × Gaussian, system is inconsistent. Left: (cid:96) -norm error for REK(blue dashed) and Algorithm 1 (red) versus CPU time. Right: (cid:96) -norm error for REK (bluedashed) and Algorithm 2 (red) versus CPU time.results rely on random selection. The guarantee of lower and upper paving bounds hasbeen derived by Bourgain and Tzafriri [BT87] and Kashin and Tzafriri [KT94], respec-tively. Simultaneous guarantees were later derived by Bourgain and Tzafriri [BT91] withsuboptimal dependence on the matrix norm. Recently, Spielman and Srivastava [SS12]and Youssef [You12b] provided simple proofs of the results from [BT87] and [KT94], re-spectively. Vershynin [Ver01] and Srivastava [Sri10] extend the paving results to generalmatrices with arbitrary row norms; see also [You12b, You12a]. Proposition 4 follows fromthe work of Vershynin [Ver06] and Tropp [Tro09a], and is attributed to the seminal work ofBourgain and Tzafriri [BT87, BT91]. For particular classes of matrices, the paving can evenbe obtained from a random partition of the rows with high probability. This is proved byTropp [Tro08a] using ideas from [BT91, Tro08b], and is reﬁned in [CD12]. Acknowledgments

D.N. is thankful to the Simons Foundation Collaboration grant and the Alfred P. SloanFellowship. A.Z. has received funding from the European Research Council under theEuropean Union’s Seventh Framework Program (FP7/2007-2013) / ERC grant agreement n o Appendix A. Proof of intermediate results

Here we include the proof of Lemma 3.

Proof. [Proof of Lemma 3]Assume the bounds (12) hold. Applying the ﬁrst bound in (12) recursively yields18igure 4: Matrix is × Gaussian with dynamic row norms, system is inconsistent.Plot shows (cid:96) -norm error for REK (blue dashed) and Algorithm 2 (red) versus epochs(left) and CPU time (right). (cid:69) (cid:107) x k ∗ − x LS (cid:107) ≤ γ k ∗ (cid:107) x − x LS (cid:107) + k ∗ − (cid:88) j = γ k ∗ − − j r j ≤ γ k ∗ (cid:107) x − x LS (cid:107) + ∞ (cid:88) j = γ j B ≤ γ k ∗ (cid:107) x − x LS (cid:107) + B − γ , where the second inequality holds by the assumption that r k ≤ γ k B ≤ B , and the lastby the properties of the geometric summation. Similarly, observe that for any k and k ∗ we have (cid:69) (cid:107) x k + k ∗ − x LS (cid:107) ≤ γ k (cid:69) (cid:107) x k ∗ − x LS (cid:107) + k − (cid:88) j = γ k − − j r j + k ∗ ≤ γ k (cid:69) (cid:107) x k ∗ − x LS (cid:107) + γ k ∗ ∞ (cid:88) j = γ j B ≤ γ k (cid:69) (cid:107) x k ∗ − x LS (cid:107) + γ k ∗ B − γ . Now we choose k and k ∗ such that T = k + k ∗ and k = k ∗ if T is even, or k = k ∗ + if T is odd. Combining the two inequalities above, we have19igure 5: System with × tomography matrix. Left: median (cid:96) -norm error versus“epoch” (see footnote above). Right: median (cid:96) -norm error versus CPU time. (cid:69) (cid:107) x T − x LS (cid:107) = (cid:69) (cid:107) x k + k ∗ − x LS (cid:107) ≤ γ k (cid:69) (cid:107) x k ∗ − x LS (cid:107) + γ k ∗ B − γ ≤ γ k (cid:181) γ k ∗ (cid:107) x − x LS (cid:107) + B − γ (cid:182) + γ k ∗ B − γ = γ k + k ∗ (cid:107) x − x LS (cid:107) + (cid:179) γ k + γ k ∗ (cid:180) B − γ ≤ γ T (cid:107) x − x LS (cid:107) + (cid:179) γ (cid:98) T /2 (cid:99) + γ (cid:98) T /2 (cid:99) (cid:180) B − γ . This completes the proof. (cid:3)

References [Bjö96] Å. Björck.

Numerical Methods for Least Squares Problems . SIAM, Philadelphia,1996.[BN] J. Briskman and D. Needell. Block kaczmarz method with inequalities.

J. Math.Imaging Vis.

To appear.[BT87] J. Bourgain and L. Tzafriri. Invertibility of “large” submatrices with applica-tions to the geometry of Banach spaces and harmonic analysis.

Israel J. Math. ,57(2):137–224, 1987.[BT91] J. Bourgain and L. Tzafriri. On a problem of Kadison and Singer.

J. ReineAngew. Math. , 420:1–43, 1991. 20Byr08] C. L. Byrne.

Applied iterative methods . A K Peters Ltd., Wellesley, MA, 2008.[CD12] S. Chrétien and S. Darses. Invertibility of random submatrices via tail decou-pling and a matrix Chernoff inequality.

Statist. Probab. Lett. , 82(7):1479–1487,2012.[CEG83] Y. Censor, P. P. B. Eggermont, and D. Gordon. Strong underrelaxation in kacz-marz’s method for inconsistent systems.

Numer. Math. , 41(1):83–92, 1983.[CFM +

92] C. Cenker, H. G. Feichtinger, M. Mayer, H. Steier, and T. Strohmer. New vari-ants of the POCS method using afﬁne subspaces of ﬁnite codimension, withapplications to irregular sampling. In

Proc. SPIE: Visual Communications andImage Processing , pages 299–310, 1992.[CP12] X. Chen and A. Powell. Almost sure convergence of the Kaczmarz algo-rithm with random measurements.

J. Fourier Anal. Appl. , pages 1–20, 2012.10.1007/s00041-012-9237-2.[Deu85] F. Deutsch. Rate of convergence of the method of alternating projections.

Para-metric optimization and approximation , 76:96–107, 1985.[DH97] F. Deutsch and H. Hundal. The rate of convergence for the method of alter-nating projections, ii.

J. Math. Anal. Appl. , 205(2):381–405, 1997.[EHL81] P. P. B. Eggermont, G. T. Herman, and A. Lent. Iterative algorithms for largepartitioned linear systems, with applications to image reconstruction.

LinearAlgebra Appl. , 40:37–67, 1981.[Elf80] T. Elfving. Block-iterative methods for consistent and inconsistent linear equa-tions.

Numer. Math. , 35(1):1–12, 1980.[EN11] Y. C. Eldar and D. Needell. Acceleration of randomized Kaczmarz method viathe Johnson-Lindenstrauss lemma.

Numer. Algorithms , 58(2):163–177, 2011.[Gal05] A. Galántai. On the rate of convergence of the alternating projection methodin ﬁnite dimensional spaces.

J. Math. Anal. Appl. , 310(1):30–44, 2005.[GBH70] R. Gordon, R. Bender, and G. T. Herman. Algebraic reconstruction techniques(ART) for three-dimensional electron microscopy and X-ray photography.

J.Theoret. Biol. , 29:471–481, 1970.[Han07] P. C. Hansen. Regularization tools version 4.0 for matlab 7.3.

Numer. Algo-rithms , 46(2):189–194, 2007.[Her09] G. T. Herman.

Fundamentals of computerized tomography: image reconstructionfrom projections . Springer, 2009. 21HM93] G. Herman and L. Meyer. Algebraic reconstruction techniques can be madecomputationally efﬁcient.

IEEE Trans. Medical Imaging , 12(3):600–609, 1993.[HN90] M. Hanke and W. Niethammer. On the acceleration of kaczmarz’s method forinconsistent linear systems.

Linear Algebra Appl. , 130:83–98, 1990.[HS78] C. Hamaker and D. C. Solmon. The angles between the null spaces of X-rays.

J. Math. Anal. Appl. , 62(1):1–23, 1978.[Kac37] S. Kaczmarz. Angenäherte auﬂösung von systemen linearer gleichungen.

Bull.Internat. Acad. Polon.Sci. Lettres A , pages 335–357, 1937.[KT94] B. Kashin and L. Tzafriri. Some remarks on coordinate restriction of operatorsto coordinate subspaces. Insitute of Mathematics Preprint 12, Hebrew Univer-sity, Jerusalem, 1993–1994.[LL10] D. Leventhal and A. S. Lewis. Randomized methods for linear constraints:convergence rates and conditioning.

Math. Oper. Res. , 35(3):641–654, 2010.[LWS14] J. Liu, S. J. Wright, and S. Sridhar. An asynchronous parallel randomized kacz-marz algorithm. arXiv preprint arXiv:1401.4780 , 2014.[Nao11] A. Naor. Sparse quadratic forms and their geometric applications. TechnicalReport No. 1033, Séminaire Bourbaki, Jan. 2011.[Nat01] F. Natterer.

The mathematics of computerized tomography , volume 32 of

Classics inApplied Mathematics . Society for Industrial and Applied Mathematics (SIAM),Philadelphia, PA, 2001. Reprint of the 1986 original.[Nee10] D. Needell. Randomized Kaczmarz solver for noisy linear systems.

BIT ,50(2):395–403, 2010.[NT13] D. Needell and J. A. Tropp. Paved with good intentions: Analysis of a ran-domized block kaczmarz method.

Linear Algebra Appl. , pages 199–221, 2013.[NW13] D. Needell and R. Ward. Two-subspace projection method for coherentoverdetermined linear systems.

J. Fourier Anal. Appl. , 19(2):256–269, 2013.[Pop98] C. Popa. Extensions of block-projections methods with relaxation parametersto inconsistent and rank-deﬁcient least-squares problems.

BIT , 38(1):151–176,1998.[Pop99] C. Popa. Block-projections algorithms with blocks containing mutually or-thogonal rows and columns.

BIT , 39(2):323–338, 1999.[Pop01] C. Popa. A fast Kaczmarz-Kovarik algorithm for consistent least-squares prob-lems.

Korean J. Comput. Appl. Math. , 8(1):9–26, 2001.22Pop04] C. Popa. A Kaczmarz-Kovarik algorithm for symmetric ill-conditioned matri-ces.

An. ¸Stiin¸t. Univ. Ovidius Constan¸ta Ser. Mat. , 12(2):135–146, 2004.[PPKR12] C. Popa, T. Preclik, H. Köstler, and U. Rüde. On KaczmarzŠs projection iter-ation as a direct solver for linear least squares problems.

Linear Algebra Appl. ,436(2):389–404, 2012.[RT11] P. Richtárik and M. Takáˇc. Iteration complexity of randomized block-coordinate descent methods for minimizing a composite function. Availableat arXiv:1107.2848 , Apr. 2011.[Sri10] N. Srivastava.

Spectral sparsiﬁcation and restricted invertibility . Phd dissertation,Yale University, New Haven, CT, 2010.[SS87] K. M. Sezan and H. Stark. Applications of convex projection theory to imagerecovery in tomography and related areas. In H. Stark, editor,

Image Recovery:Theory and application , pages 415 ˝U–462. Acad. Press, 1987.[SS12] D. A. Spielman and N. Srivastava. An elementary proof of the restricted in-vertibility theorem.

Israel J. Math. , 190:83–91, 2012.[SV06] T. Strohmer and R. Vershynin. A randomized solver for linear systems withexponential convergence. In

RANDOM 2006 (10th International Workshop onRandomization and Computation) , number 4110 in Lecture Notes in ComputerScience, pages 499–507. Springer, 2006.[SV09] T. Strohmer and R. Vershynin. A randomized Kaczmarz algorithm with expo-nential convergence.

J. Fourier Anal. Appl. , 15(2):262–278, 2009.[Tan71] K. Tanabe. Projection method for solving a singular system of linear equationsand its applications.

Numer. Math. , 17(3):203–214, 1971.[Tro08a] J. A. Tropp. Norms of random submatrices and sparse approximation.

C. R.Math. Acad. Sci. Paris , 346(23-24):1271–1274, 2008.[Tro08b] J. A. Tropp. The random paving property for uniformly bounded matrices.

Studia Math. , 185(1):67–82, 2008.[Tro09a] J. A. Tropp. Column subset selection, matrix factorization, and eigenvalueoptimization. In

Proceedings of the Twentieth Annual ACM-SIAM Symposium onDiscrete Algorithms , pages 978–986, Philadelphia, PA, 2009. SIAM.[Tro09b] J. Tropp. Column subset selection, matrix factorization, and eigenvalue op-timization. In

Proceedings of the twentieth Annual ACM-SIAM Symposium onDiscrete Algorithms , pages 978–986. Society for Industrial and Applied Mathe-matics, 2009. 23Ver01] R. Vershynin. John’s decompositions: selecting a large part.

Israel J. Math. ,122:253–277, 2001.[Ver06] R. Vershynin. Random sets of isomorphism of linear operators on Hilbertspace. In

High dimensional probability , volume 51 of

IMS Lecture Notes Monogr.Ser. , pages 148–154. Inst. Math. Statist., Beachwood, OH, 2006.[WM67] T. M. Whitney and R. K. Meany. Two algorithms related to the method ofsteepest descent.

SIAM J. Numer. Anal. , 4(1):109–118, 1967.[XZ02] J. Xu and L. Zikatanov. The method of alternating projections and the methodof subspace corrections in Hilbert space.

J. Amer. Math. Soc. , 15(3):573–597,2002.[You12a] P. Youssef. A note on column subset selection. Available at arXiv:1212.0976 , Dec. 2012.[You12b] P. Youssef. Restricted invertibility and the Banach–Mazur distance to the cube.Available at arXiv:1206.0654 , June 2012.[ZF12] A. Zouzias and N. M. Freris. Randomized extended Kaczmarz for solvingleast-squares.