Quantile-based Iterative Methods for Corrupted Systems of Linear Equations
Jamie Haddock, Deanna Needell, Elizaveta Rebrova, William Swartworth
QQuantile-based Iterative Methods for Corrupted Systems of LinearEquations ∗ Jamie Haddock † Deanna Needell † Elizaveta Rebrova † William Swartworth † September 18, 2020
Abstract
Often in applications ranging from medical imaging and sensor networks to error correction and datascience (and beyond), one needs to solve large-scale linear systems in which a fraction of the measurementshave been corrupted. We consider solving such large-scale systems of linear equations Ax = b that areinconsistent due to corruptions in the measurement vector b . We develop several variants of iterativemethods that converge to the solution of the uncorrupted system of equations, even in the presence oflarge corruptions. These methods make use of a quantile of the absolute values of the residual vector indetermining the iterate update. We present both theoretical and empirical results that demonstrate thepromise of these iterative approaches. One of the most ubiquitous problems arising across the sciences is that of solving large-scale systems of linearequations. These problems arise in many areas of data science including machine learning, as subroutines ofseveral optimization methods [BV04], medical imaging [GBH70b, HM93], sensor networks [SHS01], statisticalanalysis, and many more. A practical challenge in all of these settings is that there is almost always corruptionpresent in any such large scale data, either due to data collection, transmission, adversarial components, ormodern storage systems that can introduce corruptions into otherwise consistent systems of equations. Forsuch applications, we seek methods that are robust to such corruption but scalable to big data.In this work, we develop scalable methods for solving corrupted systems of linear equations. Here,we consider the problem of solving large scale systems of equations Ax = ˜ b where a subset of equationshave been contaminated with arbitrarily large corruptions in the measurement vector, thereby constructingan inconsistent system of equations defined by measurement matrix A and observed measurement vector b = ˜ b + b C (˜ b being unobserved but corresponding to the desired system of equations and b C beingan arbitrary corruption vector of the same dimension). Our work is motivated by the setting where theuncorrupted system of equations Ax = ˜ b is highly overdetermined and the number of measurements is verylarge.We focus on variants of the popular iterative methods, stochastic gradient descent (SGD) or randomizedKacmarz (RK), that have gained popularity recently due to their small memory footprint and good theoreti-cal guarantees [SV09, Bot10, NSW16]. We propose variants of both RK and SGD based upon use of quantilestatistics . We focus on proving theoretical convergence guarantees for these variants, but additionally discusstheir implementation, and present numerical experiments evidencing their promise.The SGD method is a widely-used first-order iterative method for convex optimization [RM51]. Theclassical method seeks to minimize a separable objective function f ( x ) = (cid:80) mi =1 f i ( x ) by accessing (stochas-tically) selected components of the objective and using a gradient step for this component. That is, SGDconstructs iterates x k given by x k +1 = x k − γ k ∇ f i ( x k ) (1) ∗ The authors were partially supported by NSF CAREER † Department of Mathematics, University of California, Los Angeles, Los Angeles, CA a r X i v : . [ m a t h . NA ] S e p here γ k is the learning rate (or step-size) and i is the selected component for the k th iteration. When theobjective function f ( x ) represents error in the solution of a system of equations, SGD generally updates inthe direction of the sampled row, i.e., x k +1 − x k = α k a i for some α k which depends upon the iterate x k .Our variants apply SGD to the least absolute deviations (LAD) error and least squares (LS) error, f ( x ) = (cid:107) Ax − b (cid:107) = m (cid:88) i =1 |(cid:104) a i , x (cid:105) − b i | and f ( x ) = 12 (cid:107) Ax − b (cid:107) = 12 m (cid:88) i =1 ( (cid:104) a i , x (cid:105) − b i ) , respectively. For these objectives, the SGD updates (1) take the form x k +1 = x k − γ k sign( (cid:104) a i , x k (cid:105) − b i ) a i and x k +1 = x k − γ k ( (cid:104) a i , x k (cid:105) − b i ) a i , respectively, where sign( · ) denotes the function that returns 1 if its argument is positive and − γ k = 1 / (cid:107) a i (cid:107) [NSW16];that is x k +1 = x k + b i − (cid:104) a i , x k (cid:105)(cid:107) a i (cid:107) a i . (2)In [SV09], the authors showed that when applied to a consistent system of equations with a unique solution x ∗ and with a specific sampling distribution, RK converges at least linearly in expectation. Indeed, denoting e k := x k − x ∗ as the difference between the k -th iterate of the method and the exact solution of the system,the method guarantees E (cid:107) e k (cid:107) ≤ (cid:18) − σ ( A ) (cid:107) A (cid:107) F (cid:19) k (cid:107) e (cid:107) , (3)where (cid:107) · (cid:107) F denotes the Frobenius norm and σ min ( A ) the smallest (nonzero) singular value of A . StandardSGD results (e.g., [SZ13]) provide similar convergence rates for SGD on these objectives when the stepsizesare chosen according to an appropriately decreasing schedule. See Section 1.3 below for more details and adiscussion of related work.Here, we consider variants of the SGD and RK methods that converge to the solution of the uncorruptedsystem even in the presence of large corruptions in the measurement vector b . We prove convergence ratesin the same form as (3). It is worth noting that both our experimental and theoretical results illustrate thatthe size of the corruptions do not negatively impact the convergence of the proposed methods. Our methodswill make use of SGD and RK steps but will use a quantile of the residual entries in order to determine thestep-size. The rest of our paper is organized as follows. In the remainder of the introduction, we present our maincontributions in Section 1.2, discuss related works in Section 1.3, and briefly describe our notations andgive required definitions in Section 1.4. We then provide the detailed pseudocode of our proposed methods,QuantileRK( q ) and QuantileSGD( q ), in Section 2. We state and prove our theoretical results in Section 3.Within this section, we highlight some new results for random matrices as useful tools in Subsection 3.1 andthen include the proofs of our main convergence results in Subsections 3.2 and 3.3. In Section 4, we discussseveral implementation considerations that affect the efficiency and convergence of our proposed methods.In Section 5, we empirically demonstrate the promise of our methods with experiments on synthetic and realdata. Finally, we conclude and offer some future directions in Section 6. In this section, we provide summaries of foundational results we prove in high-dimensional probability, thenstate our main convergence results for the proposed methods. Our main convergence results rely on thefollowing assumptions about the linear system Ax = b . Let A ∈ R m × n be a random matrix with m ≥ n .We direct the reader to [Ver18] for the random matrix theory definitions involved; we also provide summariesin Section 1.4. 2 ssumption 1. All the rows a i of the matrix A have unit norm and are independent. Additionally, for all i ∈ [ m ] , √ n a i is mean zero isotropic and has uniformly bounded subgaussian norm (cid:107)√ n a i (cid:107) ψ ≤ K. Assumption 2.
Each entry a ij of A has probability density function φ ij which satisfies φ ij ( t ) ≤ D √ n forall t ∈ R . (The quantity D is a constant which we will use throughout when referring to this assumption.) The prototypical example of a matrix satisfying both assumptions is a normalized Gaussian matrix, i.e., amatrix whose rows are sampled uniformly over S n − . Assumptions 1 and 2 extract the properties of Gaussianmatrices that are required for our theory. As such, our work applies to more general distributions, wheneverthere is enough independence between the entries of the matrix and their distributions are regular enough.By Assumptions 1 and 2, the matrices we consider will be full rank almost surely so the uncorruptedsystem Ax = b will always have a unique solution x ∗ . Our main convergence guarantees build upon several useful results related to the non-asymptotic propertiesof random matrices that appear to be new and that may be of independent interest.In particular, Proposition 1 shows that for a class of random matrices, all large enough submatricesuniformly have smallest singular values that are at least on the order of (cid:112) m/n . For matrices which satifyAssumptions 1 and 2, this generalizes standard bounds on the smallest singular value, but does not followdirectly from these bounds.Proposition 2 is more specialized, but may also be of independent interest. For a random matrix A , weshow that the q -quantile of {|(cid:104) a i , x (cid:105)|} is well concentrated uniformly in x . Perhaps surprisingly, A does notneed to be very tall for this result to hold; a constant aspect ratio suffices. We first introduce two new methods for iteratively solving linear systems with corruptions and give the formalstatements of our main results. The first method we introduce is
QuantileRK , which builds upon the RKmethod. Recall that the iteration of RK given by (2) implies that the method proceeds by sampling rows ofthe matrix A and projecting onto the corresponding hyperplane given by the linear constraint. When someof the entries in b are corrupted by a large amount, RK periodically projects onto the associated corruptedhyperplanes and therefore does not converge. Our solution is to avoid making projections that result in (cid:107) x k +1 − x k (cid:107) being abnormally large. Specifically, for each iterate x k we consider the set of distances from x k to a set of hyperplane constraints. We assign a threshold value to be the q -quantile of these distances,where q is a parameter of the method. If the distance from x k to the sampled hyperplane is greater thanthis threshold then the method avoids projecting during that iteration. Otherwise it projects in the samemanner as RK.Theorem 1 states that the QuantileRK method converges for random matrices satisfying Assumptions1 and 2 above, as long as the fraction of corrupted entries is a sufficiently small constant (which does notdepend on the dimensions of the matrix). Here and throughout, c, C, c , C , . . . denote absolute constantsthat may denote different values from one use to the next. Variable subscripts on constants will indicatequantities that a given constant may depend on. Theorem 1.
Let the system be defined by random matrix A ∈ R m × n satisfying Assumptions 1 and 2,with the constant parameters D and K . Then with probability − ce − c q m , the iterates produced by theQuantileRK ( q ) Method 1 with q ∈ (0 , , where in each iteration the quantile is computed using the fullcorrupted residual (instead of subsampling, we use t = m ), and initialized with arbitrary point x ∈ R n satisfy E (cid:16) (cid:107) e k (cid:107) (cid:17) ≤ (cid:18) − C q n (cid:19) k (cid:107) e (cid:107) as long as the fraction of corrupted entries β = | supp( b C ) | /m < min( c q, − q ) and m ≥ Cn . (Recall that e k denotes the vector x k − x ∗ . ) In order to allow more efficient implementations, we empirically show that considering a small subset of hyperplanes issufficient. One could extend the theory to this setting as well, with a slightly more complicated analysis. In other words we do not track the dependencies on D and K . QuantileSGD , which is a variant of SGD in which the step-sizeused in each iteration is chosen to avoid abnormally large steps. Specifically, for each iterate x k , we considerthe set of distances from x k to the hyperplane constraints specified by our linear system. We choose thestep-size as the q -quantile of these distances, where q is a parameter of the method. This prevents projectionsthat are on the order of distances associated to corrupted equations.Under nearly the same assumptions for the system and slightly more restrictive assumptions on thequantile parameter, we also guarantee an RK-type convergence rate for QuantileSGD(q). Our second mainresult is Theorem 2, which shows that QuantileSGD converges, again when the fraction of corruptions issufficiently small. Theorem 2.
Let the system be defined by random matrix A ∈ R m × n satisfying Assumptions 1 and 2 withthe constant parameters D and K . Then with probability at least − ce − c q m , the iterates produced by theQuantileSGD ( q − β ) Method 2 with q ∈ (0 , / , where in each iteration the quantile is computed using thefull corrupted residual (instead of subsampling, we use t = m ), and initialized with arbitrary point x ∈ R n satisfy E (cid:16) (cid:107) e k (cid:107) (cid:17) ≤ (cid:18) − C q n (cid:19) k (cid:107) e (cid:107) as long as the fraction of corrupted entries β = | supp( b C ) | /m is a sufficiently small constant and m ≥ Cn log n . (Recall that e k denotes the vector x k − x ∗ . ) In order to prove this result, we first introduce a method that we call OptSGD, which adaptively choosesan optimal step size at each iteration. This method cannot be run in practice as it requires knowledge of x ∗ . However, we are able to show that QuantileSGD approximates OptSGD and therefore performs similarlywell. OptSGD may also serve as a useful benchmark when considering other SGD-type solvers for linearsystems.Finally, we consider a simpler setting that we call the streaming setting, which may be viewed as thelimiting case when the number of rows of A tends to infinity. In this situation we do not rely on the non-asymptotic properties of random matrices and are able to give an analysis with better constants for the casewhen the matrix has Gaussian rows. In particular, Theorem 4 shows that our methods can handle a 0 . Remark 1.
We get the same standard convergence rate for both methods; however, for QuantileSGD(q) wehave a slightly stronger requirement on the aspect ratio of the matrix A , and an additional restriction forthe quantile q < / (whereas QuantileRK is proved for any quantile q ∈ (0 , ) In practice, QuantileSGDindeed diverges for the value of a quantile too close to one (see Figure 1 (b)); however, one could safely usea much wider range of quantiles. We note that for a normalized Gaussian model (when the rows of A aresampled from the uniform distribution on the unit sphere) one can use the QuantileSQD(q) method for all q ≤ . (see Remark 8). There are many extensions and analyses of the SGD and RK methods; we review some of the results mostrelevant to our contributions. The first two sections deal with consistent or noisy systems, while the lastsection deals with methods for the problem of corrupted systems. We distinguish between corruption , inwhich there are few but relatively large errors in the measurement vector, and noise , in which there aremany but relatively small errors in the measurement vector; the latter is more commonly considered withinthe SGD and RK literature.
Randomized Kaczmarz variants.
The Kaczmarz method was proposed in the late 30s by StefanKaczmarz [Kac37]. The iterative method for solving consistent systems of equations was rediscovered andpopularized for computed tomography as algebraic reconstruction technique (ART) [GBH70a, HM93]. Whileit has enjoyed research focus since that time [CEG83, Nat86, SS87, FCM +
92, HN90, FS95], the elegantanalysis of the randomized Kaczmarz method of [SV09] has spurred a surge of research into variants of theKaczmarz method. In [SV09], the authors proved the first exponential convergence rate in expectation (3)4n the case of full-rank and consistent systems of equations. This result was generalized to the case when A is not full-rank in [ZF13]. Block methods which make use of several rows in each iteration have also becomepopular [EHL81, Elf80, Pop99, Pop01, NT14, RN20].One relevant and well-studied variant of the Kaczmarz method is that in which the row selection isperformed greedily rather than randomly. This greedy variant goes by the name Motzkin’s relaxation methodfor linear inequalities (MM) in the linear programming literature [MS54, Agm54], where convergence analysescoinciding with (3) have been shown [Agm54]. MM has been rediscovered in the Kaczmarz literature underthe name “most violated constraint control” or “maximal-residual control” [Cen81, NSV +
16, PP15]. Severalgreedy extensions and hybrid randomized and greedy methods have been proposed and analyzed [BW18a,BW18b, DLHN17, MINEA19, LR19, MI +
20, HM19]. Like our methods, these greedy approaches requirethat sufficiently large entries of the residual be identified; however, these methods differ from ours in howthese residual entries are used.Another relevant direction in the Kaczmarz literature are convergence analyses for systems in whichthe measurement matrices A have entries sampled according to a given probability distribution [CP12,HN19, HM19, RN20]. Our main results will make mild assumptions on the distribution of the entries of themeasurement matrix.The convergence of many of the previously mentioned methods has been analyzed in the case that thereis a small amount of noise in the system. Generally, these analyses provide a convergence horizon aroundthe solution that depends upon the size of the entries of the noise. In [Nee10], the author proves that RKconverges on inconsistent linear systems to a horizon which depends upon the size of the largest entry ofthe noise; a similar result is shown in [HN19] for MM. In [ZF13, DSS20], the authors develop methodsthat converge to the least-squares solution of a noisy system. Meanwhile, in this work, our focus will bedeveloping methods for systems in which there is corruption rather than noise. We will exploit the factthat the overdetermined system of equations has few corruptions in order to solve the uncorrupted systemof equations. Stochastic gradient descent variants.
There has been an abundance of work developing and analyzingvariants of SGD (e.g., step-size schedules, variants for specific and non-smooth objectives, etc.). This is notmeant to be a thorough survey of the literature in this area; we direct the reader to [BCN18] and thereferences therein for a survey of recent advances, and outline here those most relevant to our approach.In [RM51], the authors provide a convergence analysis for SGD in the case that the objective is smoothand strongly convex and the step-size schedule diminishes at the appropriate rate. Such convergence resultshold for fixed step-size schedules, but include a constant error term akin to the convergence horizon of RKfor inconsistent systems [NSW16]. Similar convergence rates can be proved in the case of non-smooth andnon-strongly convex objectives [SZ13]; this result assumes an appropriately decreasing step-size schedule, andprove bounds on the objective value optimality gap. Our results, unlike these, will use an iterate dependentstep-size and will provide bounds on the distance between iterates and the solution of the uncorruptedsystem.Recently, batch variants that use several samples in each iteration have become popular and enjoy similarrates [DGBSX12]. In [KL20], the authors propose and analyze a greedy variant of SGD known as orderedSGD that selects batches of the gradient according to the value of the associated objective components.An important branch of advances in the analysis of SGD deal with robustness to corruption and outliersin the objective defining data and sampled gradients, see e.g., [CLZL19, LCZL20]. Similar to our work, theaforementioned papers use quantile statistics, namely, a median-truncated SGD. Our methods differ fromthese in how we use the quantile statistic to achieve robustness to corruption and in our specification tolinear systems.Here, we focus on the SGD variants developed for the LAD error; this problem is often known as LADregression. It has been previously noted that the (cid:96) objective is more robust to outliers than the (cid:96) objective[WGZ06]; for this reason, there have been many algorithmic approaches to LAD regression. These approacheshave been motivated by maximum likelihood approaches [LA04], rescaling techniques for low-dimensionalproblems [BS80], iterative re-weighted least-squares [Sch73], descent approaches [Wes81], dimensionalityreduction [KS18], or linear programming approaches [BR73]; see [GSN88] and references therein. Corrupted linear systems approaches.
The corrupted linear system problem has been studied withinthe error-correction literature and has been formulated in the compressed sensing framework. Many recoveryresults build upon and resemble those within the compressed sensing literature [CT05]. In particular, the5ptimization problem min (cid:107) Ax − b (cid:107) is a special case of the NP-hard MAX-FS problem [AK95]. However,if the measurement matrix A and the support of the corruption vector b C satisfy appropriate properties,then the minimizer of the (cid:96) problem and the (cid:96) problem coincide and the problem can be solved efficientlyusing e.g., linear programming methods [CT05, CRTV05, WM10].Previous work has developed and analyzed iterative methods for corrupted systems of equations. Asmentioned previously, much of the focus on this problem has been in the error correction and compressedsensing literature [FR13, EK12]. However, there has been work that has focused on iterative row-actionmethods; previous work in this direction includes [HN18a, JCC15, ABH05].Our work was inspired by [HN18b, HN18a], in which the authors propose and analyze randomizedKaczmarz variants that detect and remove corrupted equations in the system. These methods differ from oursin that they exploit the ability of the standard RK method to detect and avoid few corruptions. Meanwhile,our work develops variants of RK and SGD that use quantiles of the residual to converge even in the presenceof corruptions. In [HNRS20], we present several methods related to those here; our results will significantlyimprove and generalize those in [HNRS20]. We consider a system with measurement matrix A ∈ R m × n and corrupted measurement vector b ∈ R m and m (cid:29) n . We denote the i th row of A by a i . If A is an m × n matrix and S ⊂ [ m ] , then let A S denote thematrix obtained by restricting to the rows S .The corrupted measurement vector b is the sum of the ideal (uncorrupted) measurement vector ˜ b andthe corruptions b C . The number of corruptions is a fraction β ∈ (0 ,
1) of the total number of measurements, | supp( b C ) | = βm . Here supp( x ) denotes the set of indices of nonzero entries of x . The ideal measurementvector ˜ b defines a consistent system of equations with ideal solution x ∗ . We denote the k -th error as e k := x k − x ∗ , where x k denotes the k -th iterate of a method.The notation (cid:107) v (cid:107) denotes the Euclidean norm of a vector v . We denote the sphere in R n as S n − ,so S n − = { x ∈ R n : (cid:107) x (cid:107) = 1 } . For a matrix A , we denote its operator ( L → L ) norm by (cid:107) A (cid:107) =sup x ∈ S n − (cid:107) Ax (cid:107) and its Frobenious (or Hilbert-Schmidt) norm by (cid:107) A (cid:107) F = (cid:112) trace( A (cid:62) A ). Throughout,we denote by σ min ( A ) and σ max ( A ) the smallest and largest singular values of the matrix A (that is,eigenvalues of the matrix √ A (cid:62) A ). Moreover, we always assume that the matrix A has full column rank,so that σ min ( A ) > κ ( A ) = (cid:107) A (cid:107) F /σ min ( A ) . Additionally, our work relies on several concepts that arise in high dimensional probability. We listall relevant definitions here, proper review of the concepts and their properties can be found, e.g., in[Ver18]. If X is a real-valued random variable, then the sub-Gaussian norm of X is defined to be (cid:107) X (cid:107) Ψ =inf (cid:8) t > E exp( X /t ) ≤ (cid:9) . If v is a random vector in R n , then the the sub-Gaussian norm of v is definedto be (cid:107) v (cid:107) Ψ = sup x ∈ R n (cid:107)(cid:104) v , x (cid:105)(cid:107) Ψ . A random variable is said to be sub-Gaussian if it has finite sub-Gaussiannorm. If v is a random vector in R n then v is said to be isotropic if E ( vv (cid:62) ) = I n where I n is the identityon R n . Our convergence analyses will take expectation with regards to the random sample taken in each iteration.We denote expectation taken with regards to all iterative samples as E . We denote by E k the expectationwith respect to the random sample selected in the k th iteration, conditioned on the results of the k − Q q ( x )denote the empirical q -quantile of the corrupted residual, Q q ( x ) := q - quantile {|(cid:104) a i , x (cid:105) − b i | : i ∈ [ m ] } . (4)We let ˜ Q q ( x ) denote the empirical q -quantile of the uncorrupted residual,˜ Q q ( x ) := q - quantile {|(cid:104) x − x ∗ , a i (cid:105)| : i ∈ [ m ] } . (5)We additionally define notation for the quantile statistics of sampled portions of the corrupted and uncor-rupted residuals, Q q ( x , S ) := q - quantile {|(cid:104) a i , x (cid:105) − b i | : i ∈ S } (6)6nd ˜ Q q ( x , S ) := q - quantile {|(cid:104) x − x ∗ , a i (cid:105)| : i ∈ S } (7)where S ⊂ [ m ] is the set of sampled indices. Note that only Q q is available to us at run time since it makesuse of the corrupted measurement vector b ; ˜ Q q is not available due to the use of unknown x ∗ . We employ˜ Q q in our theoretical results as it allows us to naturally relate Q q and random matrix parameters. Finally,we let M ( x ) denote the average magnitude of the entries of Ax , M ( x ) := 1 m m (cid:88) i =1 |(cid:104) x , a i (cid:105)| . (8) In this section, we give formal descriptions of the proposed
QuantileRK(q) and
QuantileSGD(q) methods.Our methods use the q -quantile entry of the residual | Ax − b | as a proxy to avoid large steps in the directionof corrupted equations. Namely, in both methods, in each iteration we sample not only an index for the RKupdate (which we will call the RK-index ), but also t additional indices. We then access the entries of theresidual associated to these indices and compute their empirical q -quantile, Q q ( x , { i l : l ∈ [ t ] } ).Then, the QuantileRK(q) method below takes the step (associated to the RK-index and governed bystandard RK projection (2)) only if the entry of the residual associated to this index is less than or equal to Q q ( x j − , { i l : l ∈ [ t ] } ). We assume that the rows of our system are normalized. If this is not the case, onecould normalize the rows as they are sampled. Method 1
QuantileRK(q) procedure QuantileRK ( A , b , q, t, N) x = for j = 1, . . . , N do sample i , . . . i t ∼ Uniform(1 , . . . , m ) sample k ∼ Uniform(1 , . . . , m ) if |(cid:104) a k , x j − (cid:105) − b k | ≤ Q q ( x j − , { i l : l ∈ [ t ] } ) then x j = x j − − ( (cid:104) x j − , a k (cid:105) − b k ) a k else x j = x j − end if end forreturn x N end procedure The QuantileSGD(q) method, Method 2 uses the same quantile of the sampled residual Q q ( x j − , { i l : l ∈ [ t ] } ) to define the step size. The method steps along the direction defined by the RK update (2) basedon the RK-index with step size γ equal to Q q ( x j − , { i l : l ∈ [ t ] } ).Note that this pseudocode uses only the maximum number of iterations N as stopping criterion, but onecould also run these methods for a specific amount of time, or implement any other stopping criterion.Finally, we note that the behavior of both the QuantileRK and QuantileSGD depend heavily upon theinput parameters. We clarify required constraints on these parameters in the theoretical results in Section3. Additionally, we discuss the effect of these parameter choices on computation and other implementationconsiderations in Section 4. Here we state and prove our theoretical results. We begin with foundational results in high-dimensional prob-ability in Subsection 3.1 and then prove our main convergence results, Theorems 1 and 2, in Subsections 3.2and 3.3. In our proof of convergence of QuantileSGD( q ), Theorem 2, we propose an ideal method, OptSGD,7 ethod 2 QuantileSGD(q) procedure QuantileSGD ( A , b , q, t, N) x = for j = 1, . . . , N do sample i , . . . i t ∼ Uniform(1 , . . . , m ) sample k ∼ Uniform(1 , . . . , m ) γ = Q q ( x j − , { i l : l ∈ [ t ] } ) x j = x j − − γ · sign ( (cid:104) x j − , a k (cid:105) − b k ) a k end forreturn x N end procedure and demonstrate that it is well approximated by QuantileSGD( q ). We additionally prove convergence ofQuantileSGD( q ) in the simpler streaming setting in Subsection 3.3.3. In this subsection, we prove several fundamental results which we apply in our convergence analyses forQuantileRK and QuantileSGD in Sections 3.2 and 3.3.
For the largest singular values of a random matrix with independent isotropic rows, we will be using thefollowing standard bound (the proof can be found in e.g., [Ver18, Theorem 4.6.1]).
Theorem 3.
Let A ∈ R m × n be a matrix whose rows are independent, mean zero, sub-Gaussian and isotropicwith sub-Gaussian norm bounded by K. Then the largest singular value (operator norm) of A is bounded by √ m + CK ( √ n + t ) with probability at least − − t ) . The smallest singular value of random matrices is sometimes called a “hard edge” as it is typically harderto quantify. This is the case in our application as well; we will prove Proposition 1 giving a uniform lowerbound on the singular values the submatrices of A .The first ingredient that we need for this (and it will be used in other places later in the text as well) isan ε -net for the unit sphere . We say that N is an ε -net of a set S ⊆ R n if N is a subset of S and each pointin S is within a Euclidean distance ε of some point in N . The ε -covering number of S is the cardinalityof the smallest ε -net for S . We will use the fact that the ε -covering number of S n − is bounded by (3 /ε ) n [Ver18, Corollary 4.2.13].We will also use the following direct corollary of the Hoeffding’s inequality (see, e.g., [Ver18, Theorem2.6.2]) that subgaussian random variable concentrate as well as Gaussians under taking means. Lemma 1.
Let X , . . . , X m be i.i.d. subgaussian random variables with subgaussian norm K. Then thesubgaussian norm of the mean satisfies (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) m m (cid:88) i =1 X i (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) Ψ ≤ C K √ m . Next, the following anti-concentration lemma for random vectors with bounded density is a direct corol-lary of [RV15, Theorem 1.2].
Lemma 2.
Let x be a random vector in R n such that the density function of each coordinate x i is boundedby D √ n , where D > is an absolute constant. Then for any fixed u ∈ S n − we have Pr (cid:18) |(cid:104) x , u (cid:105)| ≤ √ t √ n (cid:19) ≤ √ D √ t.
8e will use this anti-concentration result to prove a uniform lower bound for the smallest singular valueover all αm × n submatrices of a tall random matrix of the size m × n . It is well known that for a singlefixed (row-)submatrix A T of that size, σ min ( A T ) (cid:38) √ m/ √ n (see e.g., [Ver18, Theorem 4.6.1]). However,naively taking a union bound over all (cid:0) mαm (cid:1) αm -tall row submatrices results in a trivial probability bound.In Proposition 1, we provide a more delicate row-wise analysis by employing Chernoff’s bound to provide agood uniform lower bound with probability exponentially close to one. Proposition 1.
Let α ∈ (0 , and let random matrix A ∈ R m × n satisfy Assumptions 1 and 2. Then thereexist absolute constants C , c > so that if the matrix A is tall enough, namely, mn > C α log DKα , (9) then the following uniform lower bound holds for the smallest singular values of all its row submatrices thathave at least αm rows. Pr inf T ⊆ [ m ]: | T |≥ αm σ min ( A T ) ≥ α / D (cid:114) mn ≥ − e − c αm Proof.
Let ε ∈ (0 ,
1] be a constant (chosen below in (11)). Recall that there is an ε -net N of S n − with thecardinality |N | ≤ (cid:0) ε (cid:1) n . That is, for any y ∈ S n − there exists x ∈ N such that (cid:107) y − x (cid:107) ≤ ε . Taking theinfimum over all unit norm vectors x , we get that for any T ⊆ [ n ], we have σ min ( A T ) = inf y ∈ S n − (cid:107) A T y (cid:107) ≥ (cid:18) inf x ∈N (cid:107) A T x (cid:107) (cid:19) − ε (cid:107) A T (cid:107) . (10)We will bound two terms in the right hand side of (10) separately. First, for any subset T ⊂ [ n ], we canbound (cid:107) A T (cid:107) op ≤ (cid:107) A (cid:107) op , and so by Theorem 3Pr (cid:18) (cid:107) A T (cid:107) ≤ (1 + CK ) (cid:114) mn (cid:19) ≥ − e − cm for some absolute constants C, c >
0. Let us choose ε = α / D (1 + CK ) . (11)To bound the first term in the right-hand side of (10), first consider a fixed x in N . For i ∈ [ m ] , let E x i bethe event E x i := (cid:26) |(cid:104) a i , x (cid:105)| < α D · n (cid:27) . By Lemma 2, Pr( E x i ) ≤ α/ x ∈ S n − and i ∈ [ m ]. A Chernoff bound then givesPr ( events E x i occur for at least αm/ i ∈ [ m ] ) ≤ e − αm/ . Now, for any fixed x , provided that E x i occurs for at most αm/ i ∈ [ m ], for all T with | T | ≥ αm wehave (cid:107) A T x (cid:107) ≥ (cid:115)(cid:16) α m (cid:17) · (cid:18) α D · n (cid:19) = α / D (cid:114) mn . Finally, taking a union bound over x ∈ N , we havePr (cid:18) inf x ∈N (cid:107) A T x (cid:107) ≤ C α (cid:114) mn (cid:19) ≤ exp (cid:18) n log 3 ε − αm (cid:19) ≤ exp (cid:16) − αm (cid:17) , where the last inequality holds due to the submatrix size assumption (9) and our choice of ε in (11).9eturning to the estimate (10), we can now conclude that with probability1 − − cm ) − exp( − αm/ ≥ − − c αm ) , for all T with | T | ≥ αm, σ min ( A T ) ≥ α / D (cid:114) mn − ε · (1 + CK ) (cid:114) mn ≥ α / D (cid:114) mn due to our choice of ε in (11). This concludes the proof of Proposition 1. Remark 2.
Note that the bounded density assumption is crucial for Proposition 1. For instance, the rowsof a normalized Bernoulli matrix violate the hypotheses of Lemma 2, and Proposition 1 does not apply.Unfortunately this cannot be overcome. Indeed, consider taking x to be the vector (1 , − , , . . . , . Then (cid:104) a i , x (cid:105) = 0 with probability / . So if α < / in Proposition 1 then x will lie in the kernel of some αm × n submatrix of A with high probability, violating the uniform lower bound on the smallest singular value of thesubmatrices. Recall that a i denotes a (normalized) row of the matrix A . We recall the notations for the statistics ofthe corrupted and uncorrupted residuals; we denote the q -quantile of the corrupted residual as Q q ( x ), andthe q -quantile of the uncorrupted residual as ˜ Q q ( x ). We additionally recall that the empirical mean of theentries of Ax is denoted M ( x ).The key observation is that for all uncorrupted indices i we have (cid:104) x k − x ∗ , a i (cid:105) = (cid:104) x k , a i (cid:105) − (cid:104) x ∗ , a i (cid:105) = (cid:104) x k , a i (cid:105) − b i . Each of x k , a i , and b i is available at runtime (unlike the exact solution x ∗ ), so this quantity may be computeddirectly. Then, due to the robustness to noise of the order statistics, we can use the quantiles of the corruptedresidual, Q q ( x k ), to estimate quantiles of the uncorrupted residual, ˜ Q q ( x k ).In particular, the following straightforward implication of the definition of quantiles is used in the proofof QuantileSGD convergence. We omit the proof. Lemma 3.
With at most a β fraction of samples corrupted by an adversary, we have ˜ Q q − β ( x k ) ≤ Q q ( x k ) ≤ ˜ Q q + β ( x k ) . We will estimate empirical uncorrupted quantiles ˜ Q q ( x ) instead of Q q ( x ) first. The rest of this sectionconsists of two parts: upper bounds for ˜ Q q ( x ), and lower bounds for ˜ Q q ( x ). As in the previous subsection,the main challenge is to get uniform high-probability estimates over the unit sphere. The next lemma shows that any fairly large collection of rows is reasonably incoherent. We will need thisresult in order to handle situations in which the locations of corruptions are chosen adversarially.
Lemma 4.
Let random matrix A ∈ R m × n satisfy Assumption 1. With probability at least − − cm ) we have that for all unit vectors x ∈ R n and every T ⊆ [ m ] , (cid:88) i ∈ T |(cid:104) x , a i (cid:105)| ≤ C K (cid:114) m | T | n . Proof.
Consider a vector s = ( s i ) ∈ {− , , } m defined by s i = (cid:40) sign( (cid:104) x , a i (cid:105) ) , if i ∈ T , otherwise,10or i ∈ [ m ]. Note that (cid:107) s (cid:107) ≤ (cid:112) | T | .The left hand side of the desired inequality can be written as (cid:88) i ∈ T |(cid:104) x , a i (cid:105)| = m (cid:88) i =1 (cid:104) x , s i a i (cid:105) = (cid:42) x , (cid:88) i ∈ [ m ] s i a i (cid:43) ≤ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) (cid:88) i ∈ [ m ] s i a i (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) = (cid:13)(cid:13) A (cid:62) s (cid:13)(cid:13) . Now the last norm can be estimated using the bound from Theorem 3 (since the √ n -rescaled rows of A areisotropic and bounded) to get (cid:13)(cid:13) A (cid:62) s (cid:13)(cid:13) ≤ (cid:13)(cid:13) A (cid:62) (cid:13)(cid:13) (cid:107) s (cid:107) = (cid:107) A (cid:107) (cid:107) s (cid:107) ≤ C K (cid:114) m | T | n . This concludes the proof of Lemma 4.The next proposition allows us to bound the quantiles computed by QuantileRK and QuantileSGD . Weassume that αm “bad” indices from the next lemma are those that will be excluded by the quantile statistic. Proposition 2.
Let α ∈ (0 , and let random matrix A ∈ R m × n with m ≥ n satisfy Assumption 1. Thenfor t ≥ and with probability − − t m ) , for every x ∈ S n − we have the bound M ( x ) ≤ √ n + K (cid:18) c √ m + c t √ n (cid:19) . (12) As a consequence, with probability − − m ) , for every x in S n − the bound |(cid:104) a i , x (cid:105)| ≤ α CK √ n holds for all but at most αm indices i. Proof.
We will use a chaining argument to show that the averages M ( u ) (defined by (8)) are concentrateduniformly over the sphere. For u , v ∈ S n − , we have | M ( u ) − M ( v ) | ≤ m m (cid:88) i =1 |(cid:104) a i , u − v (cid:105)| . The terms in this sum are independent sub-Gaussian random variables with sub-Gaussian norm no largerthan K (cid:107) u − v (cid:107) / √ n . Therefore by Lemma 1, (cid:107) M ( u ) − M ( v ) (cid:107) ψ ≤ C · K (cid:107) u − v (cid:107)√ m √ n . By the tail bound version of Dudley’s inequality ([Ver18, Theorem 8.1.6]) and the bound (3 /ε ) n for the ε -covering number of the unit sphere, we then have with probability at least 1 − − t m )sup u , v ∈ S n − | M ( u ) − M ( v ) | ≤ C K √ m √ n (cid:0) √ n + diam (cid:0) S n − (cid:1) t √ m (cid:1) = K (cid:18) C √ m + 2 C t √ n (cid:19) . (13)Next, we claim that with probability one M ( u ) is bounded by n − / for some u . Indeed, for u sampleduniformly over the unit sphere,( E M ( u )) = ( E |(cid:104) u , a i (cid:105)| ) ≤ E |(cid:104) u , a i (cid:105)| = n − , (14)since a i has unit norm (note that the expectation above is taken over u ). For some u , M ( u ) is at most itsexpectation and hence M ( u ) ≤ n − / for some u ∈ S n − .The estimates (14) and (13) together imply that (12) holds with probability at least 1 − − t m ) . t = 1 and use that m ≥ n to get the bound M ( x ) ≤ CK/ √ n with probability at least 1 − − m ) . Now, note that if for some u ∈ S n − more than αm members of the sum defining M ( u ) are bigger than CK/α √ n , then | M ( u ) | > ( αm ) 1 m CKα √ n = CK √ n , contradicting (12). This concludes the proof of the Proposition 2.We will use Proposition 2 in a slightly more general, although equivalent form. Corollary 1.
Let α ∈ (0 , , let random matrix A ∈ R m × n satisfy Assumption 1, and suppose that Ax ∗ = b for some x ∗ ∈ R n (that is, the linear system is uncorrupted). Assuming that m ≥ n , there exists a constant C K > so that with probability at least − e − m , for every x ∈ R n the bound |(cid:104) a i , x (cid:105) − b i | ≤ C K α √ n (cid:107) x − x ∗ (cid:107) holds for all but at most αm indices i. Proof.
To prove this, one simply writes b i = (cid:104) a i , x ∗ (cid:105) , and applies Proposition 2 for a unit vector ( x − x ∗ ) / (cid:107) x − x ∗ (cid:107) . Remark 3.
Note that for the system corrupted by βm corruptions, analogously to Corollary 1, we have thatwith probability at least − e − m , for every x ∈ R n the bound |(cid:104) a i , x (cid:105) − b i | ≤ C K α √ n (cid:107) x − x ∗ (cid:107) holds for all but at most ( α + β ) m indices i. Remark 4.
The proof of Proposition 2 gives a bit more. Parallel to the analysis following (14) , some u ∈ S n − is at least the expectation E M ( u ) = E | X n | where X n := (cid:104) a , u (cid:105) and u is uniform over S n − . By TheProjective Central Limit Theorem, √ nX n converges in distribution to a standard normal as n → ∞ . Moreoverit is not hard to show that the random variables √ nX n are uniformly integrable and so E ( |√ nX n | ) → µ where µ ≈ . is the mean of a standard half-normal random variable. In particular E ( |√ nX n | ) is bounded belowby a constant uniformly in n. By the argument in Proposition 2 we then have that for all x ∈ S n − ,M ( x ) ≥ µ √ n − K (cid:18) c √ m + c t √ n (cid:19) with probability at least − − t m ) . Provided that m ≥ C K n this bound simplifies to M ( x ) ≥ c √ n with probability at least − − c K m ) . If in addition, one assumes that n is a sufficiently large constant C (cid:15) so that E ( |√ nX n | ) is near µ , then one can have M ( x ) ≥ µ − (cid:15) √ n with probability at least − − c K,(cid:15) m ) provided that m ≥ C K,(cid:15) n. The latter bound allows us to extendthe guarantees for the QuantileSGD algorithm for a wider range of quantiles, under additional restrictionson the model (we will not carry out this analysis in detail, however see Remark 8).Of course each of these facts applies equally well to the uncentered and rescaled setting of Corollary 1. i.e. the absolute value of a standard normal random variable .1.4 Lower bound for empirical quantiles We also use the following lower-bound variant of the above result when analyzing QuantileSGD.
Lemma 5.
Let A ∈ R m × n be a random matrix satisfying Assumption 2 and let x ∗ and b be such that Ax ∗ = b (that is, the linear system is consistent). If m ≥ Cq n log (cid:16) cnDq (cid:17) then Pr (cid:26) Q q ( x ) ≥ cqD √ n (cid:107) x − x ∗ (cid:107) for all x ∈ R n (cid:27) ≥ − e − cm . Proof.
We assume without loss of generality that x ∗ = and b = . The general result follows in the sameway as in Corollary 1 above. By scaling, it suffices to prove the result for x ∈ S n − . First consider a fixed x in S n − . By Lemma 2, we can choose c q = cq/D so that,Pr (cid:18) |(cid:104) x , a i (cid:105)| ≤ c q √ n (cid:19) ≤ q i . By a Chernoff bound, Q q ( x ) ≥ c q / √ n with probability at least 1 − exp( − qm/ . Let N be a c q / √ n -net of S n − which we can take to have size |N | = (cid:18) c q / √ n (cid:19) n = e n log(3 √ n/c q ) . By a union bound, there are constants so that if m ≥ Cq n log (cid:18) cnDq (cid:19) , then the quantile bound 15 holds for all x in N with probability at least 1 − exp( − cm ).In order to upgrade our bound on N to all of S n − , it remains to show that Q q ( x ) is stable under smallperturbations of x .Suppose that x and y in R n are arbitrary. Then for all i , we have the bound |(cid:104) x , a i (cid:105)| − |(cid:104) y , a i (cid:105)| ≤ |(cid:104) x , a i (cid:105) − (cid:104) y , a i (cid:105)| = |(cid:104) x − y , a i (cid:105)| ≤ (cid:107) x − y (cid:107) . Therefore |(cid:104) x , a i (cid:105)| − (cid:107) x − y (cid:107) ≤ |(cid:104) y , a i (cid:105)| ≤ |(cid:104) x , a i (cid:105)| + (cid:107) x − y (cid:107) . By taking the q -quantiles over i and using monotonicity of quantiles, it follows that | Q q ( x ) − Q q ( y ) | ≤ (cid:107) x − y (cid:107) (16)for all x , y . Each point in S n − is within c q / √ n of some point in N . Lemma 5 follows by combining (16) with ourbound on Q q ( x ) over N . Remark 5.
We require the aspect ratio of A to be at least order log( n ) . It is plausible that Lemma 5 can beimproved to hold for constant aspect ratios as was the case for Proposition 2. We will not attempt to do so,and as a result we require QuantileSGD to have a slightly stronger condition on the aspect ratio of A thanQuantileRK. In this section we provide a proof that the QuantileRK method converges.
Roadmap.
The proof will proceed as follows. We condition on the sampling of a row that will beaccepted by the QuantileRK iteration. We then show that the uncorrupted rows help substantially, whilethe corrupted rows do not overly affect the convergence. Conditioned on the current row being uncorrupted,we argue that an iteration of the QuantileRK method brings us closer in expectation to x ∗ . To accomplish13his, we show that the restriction of A to the acceptable uncorrupted rows is well-conditioned via Lemma 4.In that case, the current iteration of QuantileRK is equivalent to an iteration of the standard RK methodon the restricted matrix. This allows us to apply a known per-iteration guarantee for RK.To argue that corrupted rows do not significantly harm convergence, we consider a subset J ⊂ [ m ], of rowindices with | J | /m ≥ c , and with J containing all corrupted indices as a subset. By making J sufficientlylarge, we ensure that the subset of the rows of A indexed by J inherits incoherence properties from the fullmatrix (uniformly over all such subsets, due to Proposition 1). Incoherence will ensure that the averageprojection of x onto a corrupted hyperplane moves the point in a direction nearly orthogonal to x − x ∗ . Thelength of such a step is bounded by c/ √ n by Proposition 2, so a “bad” step is unlikely to move x muchfurther from x ∗ . In particular, a constant number of “good” steps will suffice to “cancel out” a bad step. Ifthe fraction of bad rows is sufficiently small, then the QuantileRK method will enjoy linear convergence to x ∗ . Proof of Theorem 1.
We will start by introducing some useful notation.Let E Accept ( k ) denote the event that we sample an acceptable row at the k -th step of the method; thatis, if the if-statement in line 6 of the QuantileRK Method 1 evaluates to true for that row. Recall that an i -th row of A is acceptable at iteration k if |(cid:104) x k , a i (cid:105) − b i | ≤ Q q ( x k ), where Q q is defined as in (4). Clearly,Pr( E Accept ( k )) = q for any integer k ≥ J , I and I . Let J denote a collection ofindices of size βm which contains all corrupted indices and at least βm acceptable indices. We assume that β < q so there exists that many acceptable indices (as there are exactly (cid:98) qm (cid:99) acceptable indices total). Then,all acceptable indices are split into two types: those inside the set J (we denote them I , by construction, | I | ≥ βm ) and those outside of J (we denote them I ). Finally, let E kL denote the event that k -th iterationof the QuantileRK method samples an index from an index subset L ⊂ [ n ].We first observe that E k ( (cid:107) e k +1 (cid:107) ) = q E k ( (cid:107) e k +1 (cid:107) |E Accept ( k + 1)) + (1 − q ) (cid:107) e k (cid:107) , (17)since QuantileRK(q) does not update x k if a sampled row index was not acceptable.Conditioned on choosing an acceptable row, we either pick an index from I or from I , and the conditionalprobability p J to choose an index in I satisfies p J ≤ βm/qm = 2 β/q (upper bound refers to the case when I = J ).Now, given E k +1 I , the iterate x k +1 is obtained by applying an iteration of Standard RK method for thematrix A I . Note that I has size at least ( q − β ) m . Then applying Proposition 1 with α = q − β > (cid:107) A − I (cid:107) ≥ C α,D (cid:112) m/n with probability 1 − − c α m ) provided that mn ≥ C q,D := C α log (cid:18) DKα (cid:19) . Since all the row of A are normalized to have unit norm, we also know that (cid:107) A I (cid:107) F ≥ (cid:112) ( q − β ) m . Therefore,we may bound the condition number of A I as κ ( A I ) ≥ (cid:107) A (cid:107) F (cid:107) A − (cid:107) ≥ C q,D √ n. (18)Note that Proposition 1 gives the uniform bound for the condition number for all index subsets of the size atleast αm , so in all the iterations of the method new sets I will have a good condition number lower boundedby (18) with probability at least 1 − − c q,D m ). Then, by the analysis of Standard RK method [SV09]given in (3), we have E k ( (cid:107) e k +1 (cid:107) |E k +1 I ) ≤ (cid:16) − c n (cid:17) (cid:107) e k (cid:107) . Now, we consider two cases. In the no corruptions case when β = 0, we have that the set I is emptyand p J = 0 by definition. So, E k ( (cid:107) e k +1 (cid:107) ) ≤ q (cid:16) − c n (cid:17) (cid:107) e k (cid:107) + (1 − q ) (cid:107) e k (cid:107) ≤ (cid:16) − qc n (cid:17) (cid:107) e k (cid:107) . (19) We assume without loss of generality that βm is an integer. If this is not the case, consider β (cid:48) such that β (cid:48) m = (cid:100) βm (cid:101) instead of β throughout the proof.
14n the other case, when β >
0, we need to consider the second possibility, if the next index was comingfrom I . Conditioned on taking an acceptable row, we can choose h i with | h i | ≤ Q q ( x k ), so that E k ( (cid:107) e k +1 (cid:107) |E k +1 I ) = E k ( (cid:107) e k − h i a i (cid:107) |E k +1 I )= (cid:107) e k (cid:107) + h i − E k (cid:0) h i (cid:104) e k , a i (cid:105) |E k +1 I (cid:1) ≤ (cid:107) e k (cid:107) + Q q ( x k ) + 2 Q q ( x k ) E k ( |(cid:104) e k , a i (cid:105)| | i ∼ Unif( I )) . We would like to bound these last two terms. By the Remark 3, for any α ≤ − q − β ,Pr (cid:18) Q q ( x k ) ≤ C α (cid:107) e k (cid:107)√ n (cid:19) ≥ − e − m . Also, we apply Lemma 4 to the set I (recall that | I | ≥ βn ) to get that with probability 1 − − cm ), E k ( |(cid:104) e k , a i (cid:105)| | i ∼ Unif( J )) = 1 | I | (cid:88) i ∈ I |(cid:104) e k , a i (cid:105)| ≤ C (cid:107) e k (cid:107) (cid:114) mn | I | ≤ C (cid:107) e k (cid:107)√ βn . Thus, E k ( (cid:107) e k +1 (cid:107) |E k +1 I ) ≤ (cid:18) √ βc + c √ βn (cid:19) (cid:107) e k (cid:107) . (20)So, it this case the norm of the error could increase, but not too much (as we will see below).So, by the total expectation theorem, we have E k ( (cid:107) e k +1 (cid:107) |E Accept ( k + 1)) = p J E k ( (cid:107) e k +1 (cid:107) |E k +1 I ) + (1 − p J ) E k ( (cid:107) e k +1 (cid:107) |E k +1 I ) ≤ (cid:20) p J (cid:18) √ βc + c √ βn (cid:19) + (1 − p J )(1 − c n ) (cid:21) (cid:107) e k (cid:107) = (cid:20) − c n + p J (cid:18) ( c + c ) √ β + c √ βn (cid:19)(cid:21) (cid:107) e k (cid:107) ≤ (cid:20) − c n + 2 βq · c + c + c √ βn (cid:21) (cid:107) e k (cid:107) ≤ (cid:20) − . c n (cid:21) (cid:107) e k (cid:107) , where the last step holds if β a sufficiently small constant (we need √ β ≤ cq ). Finally, from (17) we obtainthe per-iteration guarantee E k ( (cid:107) e k +1 (cid:107) ) ≤ q (cid:18) − . c n (cid:19) (cid:107) e k (cid:107) + (1 − q ) (cid:107) e k (cid:107) ≤ (cid:18) − . qc n (cid:19) (cid:107) e k (cid:107) . (21)Theorem 1 now follows from (19) or (21) by induction. Remark 6 (Condition on β .) . We need the fraction of corruptions β to be sufficiently small. Specificlly, ourproof of Theorem 1 requires √ β < cq , where c is some small positive constant. Intuitively, this is requiredsince the quantile bound (admissibility) is the only way to bound potential loss if the step is made usingone of the corrupted equations (as we do not impose any restrictions on the size of corruptions). Moreover,the expected loss of progress, given the projection on the admissible corrupted equation, must be so smallthat it is compensated by the expected exponential convergence rate, given that one of the equations from theuncorrupted part was selected. Remark 7.
Although the bounded density assumption is crucial for Proposition 1 to hold (see Remark 2),one should not expect the failure of Proposition 1 to result in the QuantileRK method not converging. In theBernoulli case, the per-iteration guarantee given likely no longer holds, however one expects it to fail for onlya very small set of vectors x k . Provided that the QuantileRK method does not attract iterates to this set ofbad vectors, one should still expect convergence from a randomly chosen x . We leave such an analysis tofuture work. (However we empirically demonstrate convergence in Figure 3 (a).) .3 Analysis of QuantileSGD method In this section, we provide a proof that the QuantileSGD method converges. To do so, we first introduce anoptimal SGD method in Section 3.3.1 and then prove that QuantileSGD approximates this optimal methodin Section 3.3.2. We then give an improved analysis in the streaming setting in Section 3.3.3.
As a first step towards the analysis of the quantile-based SGD method, we introduce the OptSGD methodtaking the steps of the optimal size towards the solution. Note that SGD iterates can be written in the form x k +1 = x k − η k s i ( x k ) a i , where s i ( x k ) := sign( (cid:104) a i , x k (cid:105) − b i ); (22)that is, the vector s i ( x k ) a i is directed from the hyperplane defined by the i th equation towards the halfspace that x k lies on. We assume that SGD samples rows uniformly, so i ∼ Unif([ m ]) . The constant η k > (cid:107) a i (cid:107) = 1)). OptSGD chooses the step size η ∗ k so that the expecteddistance to the solution E (cid:107) e k +1 (cid:107) = E (cid:107) x k +1 − x ∗ (cid:107) is minimized.Namely, we have E (cid:16) (cid:107) e k +1 (cid:107) (cid:17) = E (cid:16) (cid:107) e k − s i ( x k ) η k a i (cid:107) (cid:17) = E (cid:16) (cid:107) e k (cid:107) − s i ( x k ) (cid:104) e k , a i (cid:105) η k + s i ( x k ) (cid:107) a i (cid:107) η k (cid:17) = (cid:107) e k (cid:107) − E ( s i ( x k ) (cid:104) e k , a i (cid:105) ) η k + η k = ( η k − E ( s i ( x k ) (cid:104) e k , a i (cid:105) )) − ( E ( s i ( x k ) (cid:104) e k , a i (cid:105) )) + (cid:107) e k (cid:107) , (23)which is minimized by setting η ∗ ( x k ) = E ( s i ( x k ) (cid:104) e k , a i (cid:105) ) = 1 m m (cid:88) i =1 s i ( x k ) (cid:104) e k , a i (cid:105) . (24) In the previous section, we derived a theoretically optimal step size for l stochastic gradient descent. Theformula for the step size (24) relied on e k which is unknown during runtime. Actually, since (cid:104) e k , a i (cid:105) = (cid:104) x k , a i (cid:105) − (cid:104) x ∗ , a i (cid:105) = (cid:104) x k , a i (cid:105) − b i for any uncorrupted equation, it is the presence of corruptions that makes η ∗ ( x k ) unavailable at runtime. Here we show that order statistics can be applied to give an approximationto the optimal step size.First, let us show that η ∗ k ( x k ) is well-approximated by the empirical mean of the projection length M ( x k − x ∗ ). We notice that the sums defining η ∗ k ( x k ) and M ( x k − x ∗ ) respectively differ only in the termscorresponding to the indices of the corrupted equations. So, given that the fraction of corruptions is smallenough, we can efficiently bound this difference. Proposition 3.
Fix any δ ∈ (0 , . Let the system be defined by random matrix A ∈ R m × n satisfyingAssumptions 1 and 2 with m ≥ C K n , and β = | supp( b C ) | /m a small enough positive constant. Let η ∗ ( x ) be optimal step size for SGD method defined as in (24) . Then, with probability at least − c exp( − c K m ) wehave for any x ∈ R n that (1 − δ ) η ∗ ( x ) ≤ M ( x − x ∗ ) ≤ (1 + δ ) η ∗ ( x ) . (25) Proof.
Let S denote the set of indices corresponding to negative terms in the sum (24). Note that for alluncorrupted equations i , we have s i ( x k ) = sign( (cid:104) e k , a i (cid:105) ), so the i -th term in η ∗ ( x k ) is non-negative, and | S | ≤ βm. We then have | η ∗ ( x ) − M ( x − x ∗ ) | ≤ m (cid:88) i ∈ S |(cid:104) x − x ∗ , a i (cid:105)| . x − x ∗ and applying Lemma 4 allows us to further bound | η ∗ ( x ) − M ( x − x ∗ ) | ≤ | S | m C K (cid:112) | S | /m √ n (cid:107) x − x ∗ (cid:107) ≤ C K √ β √ n (cid:107) x − x ∗ (cid:107) uniformly for all x with probability at least 1 − − cm ).Moreover, by Remark 4, M ( x − x ∗ ) ≥ c √ n (cid:107) x − x ∗ (cid:107) for all x with probability at least 1 − − c K m ). Thus by taking β to be a sufficiently small constant(so that the difference between η ∗ ( x ) and M ( x − x ∗ ) is negligible compared to the size of M ( x − x ∗ )), weconclude the proof of Proposition 3.Although the empirical mean M ( x − x ∗ ), as well as η ∗ k is not available at runtime, the above propositionallows us to show that in order to obtain a near optimal convergence guarantee, it suffices to approximate η ∗ k to within a constant factor. Proposition 4.
Let the system be defined by random matrix A ∈ R m × n satisfying Assumptions 1 and 2with m ≥ C k n . Suppose we run an SGD method (22) with the stepsize η k , satisfying < c ≤ η k /η ∗ ( x k ) ≤ c < at each iteration k = 1 , , , . . . , where η ∗ ( x k ) is an optimal step size given by (24) . Then, for any β = | supp( b C ) | /m ∈ (0 , , there exists a constant c = c ( c , c ) > such that E ( (cid:107) e k +1 (cid:107) ) ≤ (cid:32) − c (cid:18) η ∗ ( x k ) (cid:107) e k (cid:107) (cid:19) (cid:33) (cid:107) e k (cid:107) . (26) Moreover, if the fraction of corrupted equations β is small enough, then with probability at least − c exp( − c K m ) , A is sampled such that the rate of convergence is linear, namely, there exists a constant C = C ( c , c ) > such that E ( (cid:107) e k +1 (cid:107) ) ≤ (cid:18) − Cn (cid:19) (cid:107) e k (cid:107) . (27) Proof.
Throughout the proof, we adopt a shortening notation η ∗ k = η ∗ ( x k ).Indeed, by the condition on η k we have that | η k − η ∗ k | ≤ η ∗ k max { c − , c − } and c = 1 − (max { c − , c − } ) >
0. So, by equation (23) and the definition of η ∗ k (in (24)), we have that E (cid:16) (cid:107) e k +1 (cid:107) (cid:17) = ( η k − η ∗ k ) − ( η ∗ k ) + (cid:107) e k (cid:107) ≤ (cid:107) e k (cid:107) − c ( η ∗ k ) , and so E ( (cid:107) e k +1 (cid:107) ) ≤ (cid:32) − c (cid:18) η ∗ k (cid:107) e k (cid:107) (cid:19) (cid:33) (cid:107) e k (cid:107) . To show that the convergence rate is linear, note that by Proposition 3 and Remark 4 we have the bound η ∗ ( x k ) ≥ M ( x k − x ∗ ) (cid:38) √ n (cid:107) x k − x ∗ (cid:107) . This concludes the proof of Proposition 4.
Roadmap.
We are now set to give a proof of Theorem 2. The general plan is as follows: weknow that quantiles of the residual Q q ( x k ) are well-approximated by the empirical uncorrupted quantiles˜ Q q ( x k ) (Lemma 3), then we show that empirical uncorrupted quantiles concentrate near the empirical mean M ( x − x ∗ ), which is in turn close enough to the optimal step size η ∗ ( x ) (Proposition 3). Finally, we invokeProposition 4 to conclude the linear convergence rate of the QuantileSGD( q ) method.17 roof of Theorem 2. We upper bound the q -quantile of the corrupted residual, Q q − β ( x k ) ≤ ˜ Q q ( x k ) ≤ − q M ( x k − x ∗ ) ≤ δ − q η ∗ ( x k ) < η ∗ ( x k ) , (28)where the first inequality follows from Lemma 3, the second from Markov’s inequality, the third from Proposi-tion 3, and the fourth follows with probability at least 1 − − cm ) from Proposition 3 with δ ∈ (0 , − q ).For the lower bound, we have that Q q − β ( x k ) ≥ ˜ Q q − β ( x k ) by Lemma 3. Then,˜ Q q − β ( x ) ≥ c √ n (cid:107) x − x ∗ (cid:107) for all x with probability at least 1 − exp( − cm ) by Lemma 5. Now, M ( x ) ≤ c √ n (cid:107) x − x ∗ (cid:107) , so we may upper bound η ∗ ( x ), η ∗ ( x ) ≤ M ( x )1 − δ ≤ c (1 − δ ) √ n (cid:107) x − x ∗ (cid:107) ≤ c (1 − δ ) c ˜ Q q − β ( x k ) ≤ c Q q − β ( x k )for some positive constant c .Combining these upper and lower bounds on Q q − β ( x ) we find that there exists c > x ∈ R n , < c < Q q − β ( x ) η ∗ ( x ) < − q ) < . We have shown that the hypothesis of Proposition 4 holds. Theorem 2 follows by induction.
Remark 8.
In some cases, for example, when a i are independent vectors sampled uniformly from S n − we can show that a bigger range of quantiles for an SGD step guarantees exponential convergence of theQuantileSGD method. In particular, using Gaussian concentration instead of Markov’s inequality in (28) the statement of Theorem 2 holds for QuantileSGD( q − β ) for all q ∈ (0 , . . Note that this justifies theoptimal values for the quantile q obtained experimentally (see Figure 1 b) In the matrix setting we only prove convergence for a sufficiently small fraction of corruptions β. While onecould in principle unwind the constants from the random matrix theorems that we have applied, it wouldbe unlikely to result in new insights. Instead we note that the key complication in the matrix setting washandling “asymmetries” in the matrix A . While the rows were sampled over S n − in a close-to-uniformway, there was no guarantee that the rows of A (representing only a sample from this distribution) wereuniformly spread over the sphere.Here we present a more optimized analysis in the streaming setting, which may be viewed as a modelfor extremely tall matrices where each row is likely to be sampled only once in the course of the method. Inparticular, it allows us to justify the QuantileSGD method when up to a 0 .
35 fraction of all equations arecorrupted (note that in both Theorem 1 and Theorem 2 we formally asked for the fraction of corruptions β to be “small enough”).Instead of a matrix let us consider some distribution D over R n and β ∈ [0 , v , r ) (in a non-streaming setting this pair was a row of the matrix and a correspondingentry of the vector b respectively). The vector v is always sampled from D . With probability 1 − β, r wasselected so that r = (cid:104) v , x ∗ (cid:105) , and with probability β, r was chosen arbitrarily, and possibly adversarially.Our goal is to approximate x ∗ .For simplicity, we allow ourselves an arbitrary number of samples to estimate the quantiles of the residual Q q ( x k ), where a i from Definition (4) are random vectors v and respective b i are given by the samples. Theorem 4.
In the streaming setting with adversarial corruptions and Gaussian samples (namely v = a i are standard n -variate Gaussian random vectors), QuantileSGD(q) converges to x ∗ with β = 0 . as long asthe quantile q is chosen sufficiently small. emark 9. The model of the left hand side of the system is different from our earlier convention, inparticular, a i ’s do not have exactly unit norm. However this distinction is unimportant since (i) our methodsare invariant under rescaling the a i s and (ii) the one-dimensional projections of the uniform distribution over √ nS n − converge in distribution to a Gaussian as n → ∞ .Proof. Recall that for QuantileSGD, we chose our step size η k = Q q ( x k ) . In the streaming setting withGaussian samples, the value of Q q ( x ) only depends on (cid:107) x − x ∗ (cid:107) . This follows directly from the definition of˜ Q q along with rotation-invariance of Gaussian vectors. Furthermore, Q q respects dilations about x ∗ in thesense that Q q ( x ∗ + λ ( x − x ∗ )) = λQ q ( x )for λ ∈ R . Again this is a simple check from the definition of ˜ Q. The same properties hold for the optimalSGD step size η ∗ as per (24) for the same reasons.These properties imply that Q q ( x ) /η ∗ ( x ) is constant over R n \{ x ∗ } . We are going to show that for small q this quantity lies strictly between 0 and 2 . In other words0 < c < Q q ( x k ) /η ∗ ( x k ) < C < x k . (Of course this bound holds for all x ∈ R n , but we emphasize that we apply this bound tothe iterates.) This will allow us to apply Proposition 4 (in the form of (26)) to conclude that QuantileSGD(q)converges for q small enough.The lower bound of (29) clearly holds for q positive, since Q q ( x k ) /η ∗ ( x k ) is nonzero as long as x k (cid:54) = x ∗ (and of course η ∗ ( x k ) < ∞ ).Also recall that for all uncorrupted equations we have η ∗ ( x k ) = E | (cid:104) x k − x ∗ , a i (cid:105) | . So, we can lower bound η ∗ ( x k ) ≥ (1 − β ) E ( |(cid:104) e k , a i (cid:105)| ) + β E ( − |(cid:104) e k , a i (cid:105)| )= (1 − β ) E ( |(cid:104) e k , a i (cid:105)| )= (1 − β ) (cid:114) π (cid:107) e k (cid:107) , where the last constant is the expectation of a standard half-normal random variable.By (3), we also have Q q ( x k ) ≤ ˜ Q q + β ( x k ) = (cid:107) e k (cid:107) Φ q + β , where Φ q denotes the q -quantile of the standard half-normal distribution, |N (0 , | . The upper bound inequation (29) is equivalent to the inequality (cid:107) e k (cid:107) Φ q + β < C (1 − β ) (cid:114) π (cid:107) e k (cid:107) , (30)where C is allowed to be any constant smaller than 2 (e.g 1 . q as long as ˜ Q β ( |N (0 , | ) < (cid:114) π (1 − β ) . One can verify numerically that the inequality holds for β = 0 . , and indeed for slightly larger values. Thisconcludes the proof of Theorem 4. Remark 10.
One can find explicit pairs q, β that work by solving the inequality (30) numerically. Forinstance quantiles . , . , and . can handle corruption rates of roughly . , . , and . respectively. Remark 11.
An adversary generating corruptions at runtime can make the bounds in the proof of 4 as tightas desired. Thus one cannot expect convergence in general if β is much larger than . . Remark 12.
While the above analysis gives results that are on the same order of magnitude as experimentsshow, this setting is far more adversarial than what one would encounter in practice. Our experimentsdemonstrate that one can tolerate higher levels of corruptions than what our theory predicts in this setting.Extending the analysis to the setting of our experiments would require fixing a particular model for thecorruptions. By considering adversarial corruptions generated at run-time, we handle any such model. Implementation Considerations
In this section, we discuss several important considerations regarding the implementation of QuantileRKand QuantileSGD. In particular, we touch on the streaming setting in which the rows of the measurementmatrix are sampled from a distribution and provided in an online manner. We additionally discuss variousconsiderations for constructing the sample of the residual, and the choice of quantile to apply in each method.
First, we note that the streaming setting described in Section 3.3.3 provides a good model for many of ourexperiments. For example, in several of the experiments below, we sample 2000 rows (2000 iterations) froma 50000 row Gaussian matrix. We expect most rows to be sampled only once, which places us within thecontext of the streaming setting. For this reason, we expect that our methods can in practice handle a largerfraction of corruptions than is reflected in Theorems 1 and 2.
Next, we mention several approaches for decreasing the computational burden of computing the residual ineach iteration of QuantileRK and QuantileSGD. Note that both QuantileRK and QuantileSGD as writtenin Methods 1 and 2 use a sample of the residual of size t . This is much more efficient than constructing theentire residual in each iteration, with the cost scaling with tn instead of mn when constructing the entireresidual.The optimal sample size depends upon the quantile chosen, the fraction of corruptions, and the numberof iterations employed. Given the fraction of corruptions, one should choose the sample size and quantileso that the number of corruptions in the sample is at most (1 − q ) t with high probability (this could becalculated with a Chernoff bound). In particular, more aggressive methods with higher choice of quantiledemand larger sample size to ensure that corruptions may be avoided with the quantile calculation. For QuantileRK, a larger quantile corresponds to a more aggressive method which is more likely to makethe sampled projection. The quantile can be chosen quite close to one if very few corruptions are expected.Meanwhile, for QuantileSGD, the OptSGD theory demonstrates that the optimal quantile to select is themean of the uncorrupted residual. In the case of Gaussian rows with no corruptions, the mean happens tocoincide with the 0 .
58 quantile. So for QuantileSGD the quantile should be chosen near 0 . Now, as mentioned previously, constructing the sample of the residual requires O ( tn ) computation. Wecan decrease this per-iteration cost by reusing residual entries between iterations. This suggests using a‘sliding window’ approach where the sample from which we compute the quantile consists of residual entriescollected over multiple iterations. We implement this approach in the experiments below, using on the orderof several hundred of the most recently computed residuals. One might expect that this causes significantloss in performance due to the varying scale of the residuals in each iteration, but empirically we see nearlyidentical performance for moderately sized windows (on the order of 100-500 iterations).The sliding window approach raises the question of what to do in the initial iterations before the iterationnumber has reached the window size. One could populate the entire window in the first iteration by samplingas many residual entries as the window size, and then just replacing residual entries as new ones are sampledin the next iterations. Alternatively, one could simply use a partial window until the iteration numberreaches the window size. However, this could significantly slow convergence if there are corruptions that arelarge relative to the initial error (cid:107) x − x ∗ (cid:107) that get sampled in these initial iterations.20 a) QuantileRK (b) QuantileSGD Figure 1: log( (cid:107) x − x ∗ (cid:107) / (cid:107) x − x ∗ (cid:107) ) for (a) QuantileRK and (b) QuantileSGD run on 50000 ×
100 Gaus-sian system, with various corruption rates β and quantile choices. Each experiment is run using using Python version 3.6.9 on a single 24-core machine.
Our theoretical analysis does not provide specific guidance for choice of quantiles (besides rough relation-ships between q and β ), so we investigate the problem of choosing quantiles empirically. Figure 1 showsthe behaviors of QuantileRK and QuantileSGD for various corruption rates β and choices of quantile q. For each β, we plot the log relative error after 2000 iterations as a function of q (this is the quantitylog( (cid:107) x − x ∗ (cid:107) / (cid:107) x − x ∗ (cid:107) ). In order to de-noise the plots, each plotted point is the median over 10 trials.On each trial we generate a new 50000 ×
100 Gaussian system with a β fraction of corrupted entries. Acorrupted entry of b is modified by adding a uniformly random value in [ − , . It is interesting to point out the optimal quantiles for various corruption rates. In the case of QuantileRK,we see that the optimal quantile tends to be just shy of 1 − β. This aligns with the intuition that QuantileRKshould be as aggressive as possible while avoiding projections onto badly corrupted hyperplanes. It is clearthat QuantileRK cannot choose a quantile larger than β , otherwise we are likely to sample in the β fractionof corrupted rows, resulting in a threshold which is too large. In practice it is often best to choose a quantilewhich is somewhat smaller that what the graph suggests. As the quantile approaches 1 − β the risk ofperforming a bad projection becomes large enough that we observe bad projections within a few thousanditerations.We see that QuantileSGD is much more robust to the choice of quantile. For instance when β = 0 . , theoptimal quantile appears to be near 0 . . However we see near-optimal convergence behavior as long as β isbetween 0 . . . In Figure 2 and Figure 3 we show the convergence behavior of our methods on a 50000 ×
100 system witha β = 0 . − , . The label “RK” signifies the standard Randomized Kaczmarz method without thresholding. The meth-ods marked QuantileRK-SW and QuantileSGD-SW are the “sliding window” versions of QuantileRK andQuantileSGD. The methods marked QuantileRK and QuantileSGD are the sampled variants. We set ourwindow size and sample size to 400 for these experiments. Finally, we include OptSGD only in Figure 2 (a).In Figure 2 (a) we show a normalized Gaussian system (i.e., a system with rows sampled uniformly over S n − ). We observe that all four of our quantile methods exhibit similar convergence behavior. Notably,21 a) Gaussian model (b) Coherent model Figure 2: Relative error as a function of iteration count plotted for a 50000 ×
100 Gaussian and coherentmodel with a 0 . , (a) Bernoulli model (b) Adversarial model Figure 3: Relative error as a function of iteration count plotted for a 50000 ×
100 Bernoulli and adversarialmodel with a 0 . − . m sized consistent subsystem.22 a) Effect of aspect ratio (b) Effect of corruption size Figure 4: (a) Log relative error for QuantileSGD and QuantileRK after 1000 iterations on a 100 a × . a = m/n is the aspect ratio of the matrix. (b) Logrelative error for QuantileSGD and QuantileRK after 2000 iterations, as a function of corruption size. Weuse a 50000 ×
100 Gaussian system and corrupt our system by adding a uniform value in [ − x , x ].these methods perform comparably to OptSGD, which chooses an optimal step size at each iteration. (Ofcourse OptSGD cannot be run in practical settings, as it requires knowledge of x ∗ .)In Figure 2 (b) we consider a system with “coherent rows”. This matrix is created by generating eachentry i.i.d. uniformly in [0 , , and then normalizing the rows of the resulting matrix. We call the systemcoherent because pairs of rows typically have large inner product with one another. Such a matrix does nothave isotropic rows, and is therefore not covered by our theoretical analysis. Nonetheless, we do observeconvergence, albeit at a slower rate than for the Gaussian model.In Figure 3 (a) we show a Bernoulli system. Here each entry of our matrix is sampled uniformly in {− , } and the rows are normalized. This matrix violates the “bounded density” assumption of our theoreticalanalysis. However we still see convergence behavior which is comparable to the Gaussian case.Figure 3 (b) shows a Gaussian system which is corrupted adversarially. In this model, we choose arandom collection of indices to corrupt. The corruptions are then chosen so that the corrupted subsystemis consistent. This model is adversarial in the sense that it tries to “trick” the method into believing thatthere is another solution in addition to x ∗ . (Note that this is different from the truly adversarial corruptionsdiscussed in the context of the streaming setting.) Our theory does address this case, and here we seeconvergence is comparable to the randomly corrupted Gaussian case. Each of our experiments so far dealt with extremely tall 50000 ×
100 matrices. Since we ran at most 10000iterations we were unlikely to sample a given row many times. Thus our experiments have effectively been runin the streaming setting. A strength of our theory was providing convergence guarantees even for matriceswhich are not too tall. In Figure 4 (a) we show the convergence behavior of QuantileSGD and QuantileRKas a function of the aspect ratio. In this plot we consider random Gaussian matrices with a β = 0 . a × , where a is the aspect ratio. Each data point is the median error takenover 100 separate trials. In Figure 4 (b) we illustrate the behaviors of QuantileSGD and QuantileRK as the corruption sizes arevaried. For each value on the x -axis, x , we corrupt the vector b by adding values sampled uniformly from[ − x , x ] to a β = 0 . a) Tomography system (b) Wisconsin Breast Cancer dataset Figure 5: (a) Relative error for each method run on a 1200 ×
400 system designed for tomography. Corruptionswere added to 100 uniformly random entries of b . (b) Relative error for each method run on a 699 ×
10 matrixobtained from the Wisconsin Breast Cancer dataset. Corruptions were added to 100 uniformly random entriesof b . In particular, QuantileRK seems to perform better when the corruptions are very large. The reasonfor this is that when the corruptions are very small relative to (cid:107) x k − x ∗ (cid:107) , the system behaves as thoughit is consistent. For a consistent system QuantileRK behaves too conservatively by rejecting 30 percent ofthe rows. When the size of corruptions becomes comparable to or larger than (cid:107) x − x ∗ (cid:107) , this behaviordisappears.QuantileSGD on the other hand behaves better for consistent systems as rows are never rejected. Themore consistent the system, the more likely a given step is to move the iterate closer to x ∗ . We see thisbehavior for QuantileRK and QuantileSGD in Figure 1 as well.
Finally, in Figure 5 we illustrate our methods on two real world data sets. In Figure 5 (a), we experiment ona tomography problem generated using the Matlab Regularization Toolbox by P.C. Hansen ( ) [Han07]. We present a 2D tomography problem Ax = b for an m × n matrix with m = f N and n = N . Here A corresponds to the absorption along a random line through an N × N grid. In this experiment, we set N = 20 and the oversampling factor f = 3, which yields a matrix A ∈ R × . As the resulting system was consistent, we randomly sampled 100 indices uniformly fromamong the rows of A and corrupted the right-hand side vector b in these entries by adding a uniformlyrandom value in [ − , A ∈ R × , we construct b to form a consistent system,and then corrupt a random selection of 100 entries of the right-hand side by adding a uniformly randomvalue in [ − , This type of behavior was noted in [HN18b], although for different reasons. Conclusion
In this work, we propose two new methods, QuantileRK and QuantileSGD, for solving large-scale systemsof equations which are inconsistent due to sparse, arbitrarily large corruptions in the measurement vector.Such corrupted systems of equations arise in practice in many applications, but are especially abundant andchallenging in areas such as distributed computing, internet of things, and other network problems facingpotentially adversarial corruption.The QuantileRK and QuantileSGD methods make use of a quantile statistic of a sample of the residualin each iteration. We prove that each method enjoys exponential convergence with mild assumptions on thedistribution of the entries of the measurement matrix A , the quantile parameter of the method q , and thefraction of corruptions β .Our experiments support these theoretical results, as well as illustrate that the methods converge inmany scenarios not captured by our theoretically required assumptions. In particular, these methods areable to handle fractions of corruption larger than those predicted theoretically, and converge for systemsdefined by structured and real measurement matrices which are far from the random matrices for which ourtheoretical results hold. We note that both theoretically and experimentally we see that the magnitude ofthe corruptions do not negatively impact convergence.In future work, we plan to extend our theoretical results to systems in which there are few large-scalecorrupted equations as well as small noise throughout the measurement vector b , to the case that the mea-surement matrix has Bernoulli entries (where experimentally we see convergence), to systems of inequalities,and to partially-greedy row sampling schemes. We additionally plan to generalize our theoretical convergenceresults to include matrix parameters in the rate statements, which would allow us to extend to the case ofmatrices with coherent rows. Finally, we are interested in models in which the corruption does not onlyaffect the measurement vector b , but additionally the measurement matrix A ; we plan to develop methodsfor these models of corruption as well. References [ABH05] E. Amaldi, P. Belotti, and R. Hauser. Randomized relaxation methods for the maximum feasiblesubsystem problem. In
Integer programming and combinatorial optimization , volume 3509 of
Lecture Notes in Comput. Sci. , pages 249–264. Springer, Berlin, 2005.[Agm54] S. Agmon. The relaxation method for linear inequalities.
Canadian J. Math. , pages 382–392,1954.[AK95] E. Amaldi and V. Kann. The complexity and approximability of finding maximum feasiblesubsystems of linear relations.
Theor. Comput. Sci. , 147(1-2):181–210, 1995.[BCN18] L. Bottou, F. E. Curtis, and J. Nocedal. Optimization methods for large-scale machine learning.
SIAM Rev. , 60(2):223–311, 2018.[Bot10] L. Bottou. Large-scale machine learning with stochastic gradient descent. In
Proc. of COMP-STAT , pages 177–186. Springer, 2010.[BR73] I. Barrodale and F. D. Roberts. An improved algorithm for discrete (cid:96) linear approximation. SIAM J. Numer. Anal. , 10(5):839–848, 1973.[BS80] P. Bloomfield and W. Steiger. Least absolute deviations curve-fitting.
SIAM J. Sci. Stat. Comp. ,1(2):290–301, 1980.[BV04] S. Boyd and L. Vandenberghe.
Convex optimization . Cambridge university press, 2004.[BW18a] Z.-Z. Bai and W.-T. Wu. On greedy randomized Kaczmarz method for solving large sparselinear systems.
SIAM J. Sci. Comput. , 40(1):A592–A606, 2018.[BW18b] Z.-Z. Bai and W.-T. Wu. On relaxed greedy randomized Kaczmarz methods for solving largesparse linear systems.
Appl. Math. Lett. , 83:21–26, 2018.25CEG83] Y. Censor, P. P. B. Eggermont, and D. Gordon. Strong underrelaxation in Kaczmarz’s methodfor inconsistent systems.
Numer. Math. , 41:83–92, 1983.[Cen81] Y. Censor. Row-action methods for huge and sparse systems and their applications.
SIAMRev. , 23:444–466, 1981.[CLZL19] Y. Chi, Y. Li, H. Zhang, and Y. Liang. Median-truncated gradient descent: A robust andscalable nonconvex approach for signal estimation. In
Appl. Numer. Harmon. An. , pages 237–261. Springer, 2019.[CP12] X. Chen and A. Powell. Almost sure convergence of the Kaczmarz algorithm with randommeasurements.
J.Fourier Anal. Appl. , pages 1–20, 2012.[CRTV05] E. Candes, M. Rudelson, T. Tao, and R. Vershynin. Error correction via linear programming.In
FOCS , pages 668–681. IEEE, 2005.[CT05] E. Candes and T. Tao. Decoding by linear programming. arXiv preprint math/0502327 , 2005.[DGBSX12] O. Dekel, R. Gilad-Bachrach, O. Shamir, and L. Xiao. Optimal distributed online predictionusing mini-batches.
J. Mach. Learn. Res. , 13:165–202, 2012.[DLHN17] J. A. De Loera, J. Haddock, and D. Needell. A sampling Kaczmarz-Motzkin algorithm for linearfeasibility.
SIAM J. Sci. Comput. , 39(5):S66–S87, 2017.[DSS20] K. Du, W. Si, and X. Sun. Pseudoinverse-free randomized extended block Kaczmarz for solvingleast squares. arXiv preprint arXiv:2001.04179 , 2020.[EHL81] P. P. B. Eggermont, G. T. Herman, and A. Lent. Iterative algorithms for large partitioned linearsystems, with applications to image reconstruction.
Linear Algebra Appl. , 40:37–67, 1981.[EK12] Y. C. Eldar and G. Kutyniok.
Compressed sensing: theory and applications . Cambridge Uni-versity Press, 2012.[Elf80] T. Elfving. Block-iterative methods for consistent and inconsistent linear equations.
Numer.Math. , 35(1):1–12, 1980.[FCM +
92] H. G. Feichtinger, C. Cenker, M. Mayer, H. Steier, and T. Strohmer. New variants of the POCSmethod using affine subspaces of finite codimension with applications to irregular sampling.In
P. Soc. Photo-Opt. Ins. , volume 1818, pages 299–311. International Society for Optics andPhotonics, 1992.[FR13] S. Foucart and H. Rauhut.
A Mathematical Introduction to Compressive Sensing . Springer,2013. In press.[FS95] H. G. Feichtinger and T. Strohmer. A Kaczmarz-based approach to nonperiodic samplingon unions of rectangular lattices. In
SampTA95: 1995 Workshop on Sampling Theory andApplications , pages 32–37, 1995.[GBH70a] R. Gordon, R. Bender, and G. T. Herman. Algebraic reconstruction techniques (ART) forthree-dimensional electron microscopy and X-ray photography.
J. Theoret. Biol. , 29:471–481,1970.[GBH70b] R. Gordon, R. Bender, and G. T. Herman. Algebraic Reconstruction Techniques (ART) forthree-dimensional electron microscopy and X-ray photography.
J. Theoret. Biol , 29, 1970.[GSN88] J. E. Gentle, V. Sposito, and S. C. Narula. Algorithms for unconstrained l simple linearregression. J. Comp. Stat. Data Anal. , 6(4):335–339, 1988.[Han07] P. C. Hansen. Regularization tools version 4.0 for Matlab 7.3.
Numer. Algorithms , 46(2):189–194, 2007. 26HM93] G. T. Herman and L. B. Meyer. Algebraic reconstruction techniques can be made computation-ally efficient (positron emission tomography application).
IEEE T. Med. Imaging , 12(3):600–609, 1993.[HM19] J. Haddock and A. Ma. Greed works: An improved analysis of Sampling Kaczmarz-Motzkin. arXiv preprint arXiv:1912.03544 , 2019.[HN90] M. Hanke and W. Niethammer. On the acceleration of Kaczmarz’s method for inconsistentlinear systems.
Linear Algebra Appl. , 130:83–98, 1990.[HN18a] J. Haddock and D. Needell. Randomized projection methods for linear systems with arbitrarilylarge sparse corruptions.
SIAM J. Sci. Comput. , 41(5):S19–S36, 2018.[HN18b] J. Haddock and D. Needell. Randomized projections for corrupted linear systems. In
AIP Conf.Proc. , number 1 in 1978, page 470071. AIP Publishing, 2018.[HN19] J. Haddock and D. Needell. On Motzkin’s method for inconsistent linear systems.
BIT ,59(2):387–401, 2019.[HNRS20] J. Haddock, D. Needell, E. Rebrova, and W. Swartworth. Stochastic gradient descent methodsfor corrupted systems of linear equations. In
Proc. Conf. on Information Sciences and Systems ,2020.[JCC15] N. Jamil, X. Chen, and A. Cloninger. Hildreth’s algorithm with applications to soft constraintsfor user interface layout.
J. Comput. Appl. Math. , 288:193–202, 2015.[Kac37] S. Kaczmarz. Angen¨aherte aufl¨osung von systemen linearer gleichungen.
Bull. Internat. Acad.Polon. Sci. Lettres A , pages 335–357, 1937.[KL20] K. Kawaguchi and H. Lu. Ordered SGD: A new stochastic optimization framework for empiricalrisk minimization. In
Int. Conf. on Artificial Intelligence and Statistics , pages 669–679, 2020.[KS18] A. S. Krˇzi´c and D. Serˇsi´c. L1 minimization using recursive reduction of dimensionality.
SignalProcess. , 151:119–129, 2018.[LA04] Y. Li and G. R. Arce. A maximum likelihood approach to least absolute deviation regression.
Eurasip. J. Adv. Sig. Pr. , 2004(12):948982, 2004.[LCZL20] Y. Li, Y. Chi, H. Zhang, and Y. Liang. Non-convex low-rank matrix recovery with arbitraryoutliers via median-truncated gradient descent.
Information and Inference: A Journal of theIMA , 9(2):289–325, 2020.[Lic13] M. Lichman. { UCI } Machine Learning Repository, 2013.[LR19] N. Loizou and P. Richt´arik. Revisiting randomized gossip algorithms: General framework,convergence rates and novel block and accelerated protocols. arXiv preprint arXiv:1905.08645 ,2019.[MI +
20] M. S. Morshed, M. S. Islam, et al. On generalization and acceleration of randomized projectionmethods for linear feasibility problems. arXiv preprint arXiv:2002.07321 , 2020.[MINEA19] M. S. Morshed, M. S. Islam, and M. Noor-E-Alam. Accelerated sampling Kaczmarz Motzkinalgorithm for the linear feasibility problem.
J. Global Optim. , pages 1–22, 2019.[MS54] T. S. Motzkin and I. J. Schoenberg. The relaxation method for linear inequalities.
CanadianJ. Math. , 6:393–404, 1954.[Nat86] F. Natterer.
The mathematics of computerized tomography . B. G. Teubner, Stuttgart; JohnWiley & Sons, Ltd., Chichester, 1986.[Nee10] D. Needell. Randomized Kaczmarz solver for noisy linear systems.
BIT , 50(2):395–403, 2010.27NSV +
16] J. Nutini, B. Sepehry, A. Virani, I. Laradji, M. Schmidt, and H. Koepke. Convergence rates forgreedy Kaczmarz algorithms.
UAI , 2016.[NSW16] D. Needell, N. Srebro, and R. Ward. Stochastic gradient descent, weighted sampling, and therandomized Kaczmarz algorithm.
Math. Program. , 155(1-2):549–573, 2016.[NT14] D. Needell and J. A. Tropp. Paved with good intentions: analysis of a randomized blockKaczmarz method.
Linear Algebra Appl. , 441:199–221, 2014.[Pop99] C. Popa. Block-projections algorithms with blocks containing mutually orthogonal rows andcolumns.
BIT , 39(2):323–338, 1999.[Pop01] C. Popa. A fast Kaczmarz-Kovarik algorithm for consistent least-squares problems.
Korean J.Comp. App. Math. , 8(1):9–26, 2001.[PP15] S. Petra and C. Popa. Single projection Kaczmarz extended algorithms.
Numer. Algorithms ,pages 1–16, 2015.[RM51] H. Robbins and S. Monro. A stochastic approximation method.
Ann. Math. Stat. , pages 400–407, 1951.[RN20] E. Rebrova and D. Needell. On block Gaussian sketching for the Kaczmarz method.
Numer.Algorithms , pages 1–31, 2020.[RV15] M. Rudelson and R. Vershynin. Small ball probabilities for linear images of high-dimensionaldistributions.
Int. Math. Res. Notices , 2015(19):9594–9617, 2015.[Sch73] E. Schlossmacher. An iterative technique for absolute deviations curve fitting.
J. Am. Stat.Assoc. , 68(344):857–859, 1973.[SHS01] A. Savvides, C.-C. Han, and M. B. Strivastava. Dynamic fine-grained localization in ad-hocnetworks of sensors. In
Proc. Int. Conf. on Mobile computing and networking , pages 166–179,2001.[SS87] M. I. Sezan and H. Stark. Incorporation of a priori moment information into signal recoveryand synthesis problems.
J. Math. Anal. Apl. , 122(1):172–186, 1987.[SV09] T. Strohmer and R. Vershynin. A randomized Kaczmarz algorithm with exponential conver-gence.
J Fourier Anal. Appl. , 15(2):262, 2009.[SZ13] O. Shamir and T. Zhang. Stochastic gradient descent for non-smooth optimization: Convergenceresults and optimal averaging schemes. In
Int. conf. on machine learning , pages 71–79, 2013.[Ver18] R. Vershynin.
High-dimensional probability: An introduction with applications in data science ,volume 47. Cambridge university press, 2018.[Wes81] G. Wesolowsky. A new descent algorithm for the least absolute value regression problem: Anew descent algorithm for the least absolute value.
Commun. Stat. Simulat. , 10(5):479–491,1981.[WGZ06] L. Wang, M. D. Gordon, and J. Zhu. Regularized least absolute deviations regression and anefficient algorithm for parameter tuning. In
Int. Conf. on Data Mining , pages 690–700. IEEE,2006.[WM10] J. Wright and Y. Ma. Dense error correction via (cid:96) -minimization. IEEE T. Infor. Theory ,56(7):3540–3560, 2010.[ZF13] A. Zouzias and N. M. Freris. Randomized extended Kaczmarz for solving least squares.