[PDF] Stochastic Newton and Quasi-Newton Methods for Large Linear Least-squares Problems

Abstract

We describe stochastic Newton and stochastic quasi-Newton approaches to efficiently solve large linear least-squares problems where the very large data sets present a significant computational burden (e.g., the size may exceed computer memory or data are collected in real-time). In our proposed framework, stochasticity is introduced in two different frameworks as a means to overcome these computational limitations, and probability distributions that can exploit structure and/or sparsity are considered. Theoretical results on consistency of the approximations for both the stochastic Newton and the stochastic quasi-Newton methods are provided. The results show, in particular, that stochastic Newton iterates, in contrast to stochastic quasi-Newton iterates, may not converge to the desired least-squares solution. Numerical examples, including an example from extreme learning machines, demonstrate the potential applications of these methods.

Full PDF

SStochastic Newton and Quasi-Newton Methods forLarge Linear Least-squares Problems

Julianne Chung ∗ Matthias Chung † J. Tanner Slagel ‡ Luis Tenorio § February 27, 2017

Abstract

We describe stochastic Newton and stochastic quasi-Newton approaches to eﬃ-ciently solve large linear least-squares problems where the very large data sets presenta signiﬁcant computational burden (e.g., the size may exceed computer memory or dataare collected in real-time). In our proposed framework, stochasticity is introduced intwo diﬀerent frameworks as a means to overcome these computational limitations,and probability distributions that can exploit structure and/or sparsity are considered.Theoretical results on consistency of the approximations for both the stochastic Newtonand the stochastic quasi-Newton methods are provided. The results show, in partic-ular, that stochastic Newton iterates, in contrast to stochastic quasi-Newton iterates,may not converge to the desired least-squares solution. Numerical examples, includingan example from extreme learning machines, demonstrate the potential applications ofthese methods.

Keywords: stochastic approximation, stochastic Newton, stochastic quasi-Newton, least-squares, extreme learning machine.

In this paper, we are interested in linear problems of the form b = Ax true + (cid:15) , (1) ∗ Department of Mathematics and Computation Modeling and Data Analytics Division, Academy ofIntegrated Science, Virginia Tech, Blacksburg, VA ([email protected], ). † Department of Mathematics and Computation Modeling and Data Analytics Division, Academy ofIntegrated Science, Virginia Tech, Blacksburg, VA ([email protected], ). ‡ Department of Mathematics, Virginia Tech, Blacksburg, VA ([email protected], ). § Department of Applied Mathematics and Statistics, Colorado School of Mines, Golden, CO ([email protected], http://inside.mines.edu/ltenorio/). a r X i v : . [ m a t h . NA ] F e b here b ∈ R m contains the observed data, x true ∈ R n is the desired solution, A ∈ R m × n , m (cid:29) n , has full column rank, and (cid:15) representes noise modeled as random. Here, we assume that (cid:15) has zero mean and covariance matrix σ I m , where I m is the m × m identity matrix. Thegoal is to compute the unique least-squares (LS) solution, (cid:98) x = arg min x ∈ R n f ( x ) = (cid:107) Ax − b (cid:107) , (2)where (cid:107) · (cid:107) denotes the two norm. Typical LS solvers such as iterative methods for large,sparse problems may require many multiplications with A and its transpose [33, 17]. Thiscan be prohibitively expensive for many problems of interest. We overcome these limitationsby ﬁrst reformulating the LS problem as a stochastic optimization problem as follows. Let W ∈ R m × (cid:96) , m > (cid:96) , be a random matrix with E (cid:0) WW (cid:62) (cid:1) = β I m , where β > E denotes the expectation operator. Then E (cid:13)(cid:13) W (cid:62) ( Ax − b ) (cid:13)(cid:13) = ( Ax − b ) (cid:62) E (cid:0) WW (cid:62) (cid:1) ( Ax − b ) = β (cid:107) Ax − b (cid:107) . Since the solution in (2) is equivalent to the solution of the following stochastic optimizationproblem, min x E f W ( x ) , where f W ( x ) = β (cid:13)(cid:13) W (cid:62) ( Ax − b ) (cid:13)(cid:13) , (3)the goal of this work is to solve (3). Two approaches have been used in the literatureto approximate (3): Sample Average Approximation (SAA) and

Stochastic Approximation (SA) [37, 39]. In SAA methods, the goal is to solve a sampled LS problem, where one majorchallenge is to ﬁnd a good sampling matrix W (e.g., using randomization techniques). Onthe other hand, SA is an iterative approach where diﬀerent realizations of the samplingmatrix W are used at each iteration to update the approximate solution. In this work wewill only consider the SA approach.Our main contributions are summarized as follows. We reformulate the problem as astochastic optimization problem, as described above, and develop stochastic Newton andstochastic quasi-Newton methods for solving large LS problems. By introducing stochastic-ity and allowing sparse realizations of W , we can handle large problems where only pieces ofdata are readily available or in memory at a given time. We prove almost sure convergence ofstochastic quasi-Newton estimators to the desired LS solution. Furthermore, we show thatfor systems that are inconsistent but whose coeﬃcient matrix has full column rank, stochas-tic Newton iterates may not converge to the unique LS solution (cid:98) x , but instead will convergeto the solution of a diﬀerent (not necessarily nearby) problem (see Section 3). By exploitingupdates for the inverse Hessian approximation, our stochastic quasi-Newton method can pro-duce good approximations and be computationally more eﬃcient than standard LS methods,especially for very large problems. 2 similar stochastic reformulation was described in [27, 9, 43], but a fundamental dif-ference with our work is that these approaches use an SAA, rather than an SA methodto approximate (cid:98) x . SA methods have been intensively studied, especially in the last fewyears. Although much of the literature has focused on general objective functions, generalconvergence theories, and gradient-based methods [26, 4, 3, 31, 41], only recently has thefocus shifted to higher-order methods (e.g., stochastic Newton) and eﬃcient update methodsthat can incorporate curvature information (e.g., stochastic quasi-Newton) [5, 44]. Repeatedsampling in a higher order method was considered in [36, 16]. However, [16] does not con-sider sparse sampling, and the algorithm described in [36] requires the full gradient of theobjective function, which we do not require here, and uses only curvature information fromthe current sample, whereas our stochastic quasi-Newton approach can integrate all previoussamples in an eﬃcient update scheme. Gower and Richt´arik proved in [14] that for consis-tent linear systems (e.g., (1) with (cid:15) = ), the stochastic Newton method with unit step sizeconverges to the desired solution. However, they do not consider LS problems. Furthermore,our work is diﬀerent from recent works on stochastic quasi-Newton methods that focus oncomputing matrix inverses [15] and solving empirical risk minimization problems [13].An outline for this paper is as follows. In Section 2, we describe the stochastic Newtonand stochastic quasi-Newton method for solving LS problems whose consistency results arepresented and discussed in Section 3. Numerical results in Section 4 demonstrate the po-tential of our approaches, and conclusions are provided in Section 5. Consistency proofs areprovided in Section 6. In this section, we describe SA methods for computing a solution to (3). Given an initialvector x ∈ R n , SA methods deﬁne a sequence of iterates x k = x k − + α k s k , (4)where { α k } is a sequence of step sizes and { s k } is a sequence of search directions such that s k depends on the iterate x k − and the random variables W , . . . , W k . The most common SAapproach is the stochastic gradient method , where s k = −∇ f W k ( x k − ) is the sample gradient, ∇ f W ( x ) = ( W (cid:62) A ) (cid:62) ( W (cid:62) Ax − W (cid:62) b ) = A (cid:62) WW (cid:62) ( Ax − b ) , (5)evaluated at x k − . The popularity of the stochastic gradient method stems from its provenconsistency properties and its easy implementation. However, the stochastic gradient methodis known to converge slowly [44], thus higher order methods are desired and discussed below.For the stochastic Newton method, the search direction is typically deﬁned as s k = − (cid:0) ∇ f W k (cid:1) † ∇ f W k ( x k − ) , (6)where the sample Hessian is given by ∇ f W = A (cid:62) WW (cid:62) A , and † denotes the Moore-Penrosepseudoinverse. Using properties of the pseudoinverse, (6) can be reduced to s k = − ( W (cid:62) k A ) † W (cid:62) k ( Ax k − − b ) . (7)3or the stochastic Newton method, we use a step size α k that satisﬁes the following frequentlyused conditions ∞ (cid:88) k =1 α k = ∞ and ∞ (cid:88) k =1 α k < ∞ . (8)For the stochastic quasi-Newton method , we use the search direction given by s k = − B k ∇ f W k ( x k − ) , (9)where { B k } is a sequence of random symmetric positive deﬁnite (SPD) matrices. For theconvergence results proved in Section 3, we require that λ max ( B k ) ≤ B almost surely (a.s.)for some constant B > k , with ∞ (cid:88) k =1 α k λ min ( B k ) = ∞ a . s . and ∞ (cid:88) k =1 α k < ∞ , (10)where λ min ( B k ) and λ max ( B k ) denote the minimal and maximal eigenvalues of B k , respec-tively. Note that the special case where B k = I n is nothing more than the stochasticgradient method. The sequence { B k } is assumed to be independent of W k , but may dependon W , . . . , W k − .These two algorithms provide diﬀerent ways to estimate (cid:98) x . The choice of { α k } andthe distribution of W fully determine the stochastic Newton method, and these along withthe choice of { B k } fully determine the stochastic quasi-Newton method. Below we brieﬂysummarize choices for { α k } , the distribution of W , and the sequence { B k } , with particularattention on the conditions required to prove consistency of the approximations. Choice of { α k } Selecting a good step size α k (or learning rate, as it is referred to in machinelearning) is critical. A variety of methods have been proposed to improve convergence rates,see for instance [4, 40, 8]. In this paper we restrict ourselves to step size choices that complywith conditions (8) and (10) for stochastic Newton and stochastic quasi-Newton, respectively.In our numerical experiments in Section 4, we use the harmonic sequence α k = 1 /k , see [38]. Distributional choice for W

As discussed in the introduction, the solutions to prob-lems (2) and (3) are equivalent if E ( WW (cid:62) ) = β I m . Extensions to include a general (known)positive deﬁnite covariance matrix Γ would require adjusting the sampling matrix such that E ( WW (cid:62) ) = Γ − . Including a positive diagonal covariance matrix would simply requirescaling the rows of A (i.e., solving a weighted LS problem). However, for simplicity ofpresentation, we consider Γ = σ I m .Three distributional choices for W are considered:1. Random sparse matrices.

Let W ∈ R m × (cid:96) be a random matrix with i.i.d. random elements w ij where, for a ﬁxed 0 < ψ ≤ w ij takes the values ± (cid:112) β/(cid:96)ψ each with probability ψ/ − ψ . It is straightforward to verify that E ( WW (cid:62) ) =4 I m . Notice that as ψ gets closer to zero, more sparsity is introduced in W . It isworth mentioning that this choice of W is a generalization of Achlioptas random matrix( ψ = 1 / β = (cid:96) ) and the Rademacher distribution ( ψ = 1 and β = (cid:96) ), see [1, 16].2. Generalized Kaczmarz matrices.

For i = 1 , . . . , p , let Q i ∈ R m × (cid:96) i be such that Q =[ Q , . . . , Q p ] ∈ R m × m is an orthogonal matrix. Deﬁne the distribution of W to be uniformon { Q , . . . , Q p } . Then E (cid:0) WW (cid:62) (cid:1) = p p (cid:88) i =1 Q i Q (cid:62) i = p I m . Notice that selecting Q = I m , or any permutation of it, together with the stochasticNewton direction (6) leads to the well-known randomized Kaczmarz method if (cid:96) i = 1 forall i , and to the randomized block Kaczmarz method otherwise [25, 32, 14]. Choosing theelements of Q to be ± { , ± } ) leads to (sparse) randomized Hadamard matrices[18, 6]. Notice, that sparsity may be introduced by the particular choice of Q and thatthe number of columns in the Q i ’s can diﬀer.3. Sparse Rademacher matrices.

Fix p ≤ m . The columns w i of W ∈ R m × (cid:96) are i.i.d.and each column can be any m × p -nonzero entries in {± } with equalprobability. Hence, conditional on a vector conﬁguration, C , of p ones and m − p zeros,each column w i has conditional expectation E ( w i w (cid:62) i | C ) = I m,C , where I m,C is the diagonal matrix with the conﬁguration C in the diagonal. It followsthat E ( w i w (cid:62) i ) = E E ( w i w (cid:62) i | C ) = (cid:18) m − p − (cid:19)(cid:18) mp (cid:19) I m = pm I m , and therefore E ( WW (cid:62) ) = ( (cid:96) p/m ) I m . Note that the case p = m generates full Rademachermatrices. The distinction between the other choices of W is that entries of the sparseRademacher matrices are not i.i.d., as in the random sparse matrices, and do not nec-essarily come from partitions of orthogonal matrices, as in the generalized Kaczmarzmatrices. Choice of { B k } The matrices B k should approximate the inverse Hessian ( A (cid:62) A ) − andshould be SPD. Furthermore, the convergence results require some control of the smallestand largest eigenvalues of B k . Thus, we deﬁne a sequence of SPD matrices B k that convergesa.s. to a small perturbation of ( A (cid:62) A ) − and has appropriate behavior of the smallest and5argest eigenvalues. Since for any λ > λ k I n + k k (cid:88) i =1 A (cid:62) W i W (cid:62) i A a . s . −→ A (cid:62) A , it follows that for any λ ≥ (cid:101) B k = λ I n + k (cid:32) λ I n + k (cid:88) i =1 A (cid:62) W i W (cid:62) i A (cid:33) − . s . −→ λ I n + ( A (cid:62) A ) − , and λ max ( (cid:101) B k ) a . s . −→ λ + λ max ( ( A (cid:62) A ) − ). Hence, for a ﬁxed B > λ + λ max ( ( A (cid:62) A ) − ), wecan deﬁne B k = (cid:40) (cid:101) B k , if λ max ( (cid:101) B k ) ≤ B, B k − , else, (11)with B = (cid:101) B an arbitrary SPD matrix such that λ max ( B ) ≤ B . The Woodbury formulacan be used to eﬃciently update B k , see Section 4 for details. Notice that (cid:101) B k is a sampleapproximation of ( A (cid:62) A ) − , which is similar to sample approximations that have been studiedfor covariance/precision matrix approximations, and that the regularization term λ I n canbe replaced by other appropriate SPD matrices [10]. In this section we study consistency of the stochastic quasi-Newton and stochastic Newtonmethod. We start by brieﬂy reviewing some notation and deﬁnitions from probability theory.Let (Ω , A , P ) be a probability space, and let {F k } be a sequence of sub- σ -algebras of A . Then {F k } is a called a ﬁltration if F k ⊂ F k +1 for all k ∈ N . A sequence of randomvariables { u k } on (Ω , A ) is said to be adapted to {F k } if u k is F k -measurable for all k ∈ N .The σ -algebra generated by the random variables { u i } ki =1 is denoted by σ ( u i : i < k ). Weuse I A to denote the indicator function of the set A . Our convergence results make useof the following quasimartingale convergence theorem. For a proof and further details onquasimartingales, we refer the interested reader to [11, 30]. Theorem 3.1 (Quasimartingale convergence theorem) . Let { X n } be a sequence of non-negative random variables adapted to a ﬁltration {F k } and such that E X n < ∞ for all n .Let Z k = E ( X k +1 − X k |F k ) . Then, if ∞ (cid:88) k =0 E [ I Z k ≥ ( X k +1 − X k )] < ∞ , then there is an integrable, non-negative random variable X such that X n a . s . −→ X . Theorem 3.2 (Stochastic quasi-Newton method) . Let A ∈ R m × n have rank n and b ∈ R m .Let { W k } be a sequence of i.i.d. m × (cid:96) random matrices with E (cid:0) W k W (cid:62) k (cid:1) = β I m and E ( (cid:13)(cid:13) W k W (cid:62) k (cid:13)(cid:13) ) < ∞ . Deﬁne F k = σ ( W i ; i < k ) . Let { α k } be a positive sequence ofscalars, and { B k } be a sequence of random n × n SPD matrices adapted to {F k } such thatfor some B > (cid:107) B k (cid:107) = λ max ( B k ) ≤ B a . s . ∀ k, and ∞ (cid:88) k =1 α k λ min ( B k ) = ∞ a . s . and ∞ (cid:88) k =1 α k < ∞ . Set (cid:98) x = ( A (cid:62) A ) − A (cid:62) b , (cid:98) b = A (cid:98) x , and let x ∈ R n be an arbitrary initial vector. Deﬁne x k = x k − + α k s k , s k = − B k A (cid:62) W k W (cid:62) k ( Ax k − − b ) , and b k = Ax k − . Then:(i) The matrix C = E ( W k W (cid:62) k W k W (cid:62) k ) is deﬁned and SPD.(ii) If e k = (cid:107) b k − (cid:98) b (cid:107) , then E e k < ∞ and E ( (cid:107) s k (cid:107) ) < ∞ for all k .(iii) x k a . s . −→ (cid:98) x . The following result shows that the stochastic Newton method does not necessarily con-verge to the LS solution (cid:98) x . For a proof see Section 6.2 in the appendix. Theorem 3.3 (Stochastic Newton method) . Let A ∈ R m × n have rank n and b ∈ R m .Let { W k } be a sequence of i.i.d. m × (cid:96) random matrices with E ( W k W (cid:62) k ) = β I m . Let H k = ( W (cid:62) k A ) † W (cid:62) k and assume that C = E ( H (cid:62) k H k ) is deﬁned and ﬁnite. Let { α k } be apositive sequence of scalars such that ∞ (cid:88) k =1 α k = ∞ and ∞ (cid:88) k =1 α k < ∞ . Set P = E H k , (cid:101) x = ( PA ) − Pb , and let x ∈ R n be an arbitrary initial vector. Deﬁne x k = x k − + α k s k s k = − H k ( Ax k − − b ) . Then: i) The matrices C and ACA (cid:62) are symmetric positive semi-deﬁnite, and PA is SPD.(ii) If e k = (cid:107) x k − (cid:101) x (cid:107) , then E e k < ∞ and E ( (cid:107) s k (cid:107) ) < ∞ for all k .(iii) x k a . s . −→ (cid:101) x . Although the number of iterations can not be too large in practice, the consistency resultsabove for stochastic Newton and stochastic quasi-Newton are important for analyzing thesenew methods. In the rest of this section, we consider the feasibility of the convergence criteriafor both methods and then provide some insight into the discrepancy between the stochasticNewton estimator (cid:101) x and the desired LS solution.In addition to E ( WW (cid:62) ) = β I m , Theorem 3.2 requires E ( (cid:13)(cid:13) W k W (cid:62) k (cid:13)(cid:13) ) to be ﬁnite, whileTheorem 3.3 requires P to be ﬁnite. However, these additional assumptions are not toorestrictive. In particular, for all of the choices of W described in Section 2, these expectationsare trivially ﬁnite because the entries of W take only ﬁnitely many values. As for the choiceof B k in the stochastic quasi-Newton method, the assumption that B k and α k satisfy (10) isnot diﬃcult to meet. For example, when λ > B k in Section 2 satisﬁes theseconditions trivially because λ ≤ λ min ( B k ) ≤ λ max ( B k ) ≤ B. On the other hand, λ max ( W ) is bounded for the choices of W in this paper, and thereforethere is a C > (cid:107) A (cid:62) W i W (cid:62) i A (cid:107) ≤ C for all i and (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) λ I n + k (cid:88) i =1 A (cid:62) W i W i (cid:62) A (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ λ + kC, which implies that when λ = 0 we have λ min ( B k ) ≥ λ /k + C ≥ λ + C > (cid:98) x and the solution to which the stochastic quasi-Newton method converges,namely, (cid:101) x = ( PA ) − Pb = ( E [( W (cid:62) A ) † W (cid:62) A ] ) − E [ ( W (cid:62) A ) † W (cid:62) ] b . (12)The diﬀerence between (cid:98) x and (cid:101) x depends on P , but we can say the following. Assuming thatthe noise has zero mean and covariance matrix Var( (cid:15) ) = σ I m , we have E (cid:98) x = x true , Var( (cid:98) x ) = σ ( A (cid:62) A ) − , E (cid:101) x = x true , Var( (cid:101) x ) = σ ( PA ) − PP (cid:62) ( A (cid:62) P (cid:62) ) − . This shows that (cid:98) x and (cid:101) x are both unbiased estimators of x true , but by the Gauss-Markovtheorem, (cid:98) x is expected to have smaller variance. Consider the following simple example:8 xample Consider the LS problem, where A =  µ

00 11 −  and b =  ν  , for some ﬁxed µ, ν ∈ R . We compare the LS solution and the solution obtained via stochasticNewton with random Kaczmarz vectors w ∈ R m × (see page 5). It is easy to see that in thiscase we have P = A (cid:62) H with H = diag { / (cid:107) a (cid:107) , / (cid:107) a (cid:107) , / (cid:107) a (cid:107) } , where a i are the rows of A . It follows that (cid:101) x minimizes the weighted LS functional ( b − Ax ) (cid:62) H ( b − Ax ) . We obtain the following solutions: (cid:98) x = 12 µ + 1 (cid:34) µ + ν + 1 µ − µ ν + µ + 1 (cid:35) and ˜ x = 14 (cid:34) ν + 3 /µ − ν + 1 /µ (cid:35) , respectively. The covariance matrices of (cid:98) x and (cid:101) x areVar( (cid:98) x ) = σ µ + 1 (cid:20) µ + 1 (cid:21) , Var( (cid:101) x ) = σ µ (cid:20) µ + 9 8 µ + 38 µ + 3 10 µ + 1 (cid:21) . It is clear that the variances of the components (cid:101) x can be much larger than those of (cid:98) x . Thesolution (cid:101) x would have smaller variance if the covariance matrix of the noise was proportionalto H − instead of I m . Figure 1 shows the error ω ( µ, ν ) = (cid:107) (cid:98) x − (cid:101) x (cid:107) for various choices of µ and ν . The left panel shows that ω → ∞ as µ →

0, which makes sense as the ﬁrst row of A becomes all zeros. The right panel shows that even for µ (cid:54) = 0 , a signiﬁcant error can beincurred by varying ν – and therefore the “observation vector” b .Although the diﬀerence between (cid:101) x and (cid:98) x can be signiﬁcant, there are cases where theyare identical. Some previous works have studied the problem of how close (cid:101) x is to (cid:98) x , e.g.,[43, 9]. However, their assumptions do not apply to our matrix P . For our problem, (cid:101) x = (cid:98) x when the linear system is consistent, since in this case, PA (cid:98) x = Pb , or when ( A (cid:62) A ) − A (cid:62) =( PA ) − P , which is equivalent to null( A (cid:62) ) ⊆ null( P ), since A (cid:98) x − b ∈ null( A (cid:62) ) implies that P ( A (cid:98) x − b ) = . In the example above, this occurs when ν = 1 /µ − µ = 1 and ν = 0). In this section we showcase the numerical performance of the discussed methods. A generalframework for our proposed methods is summarized in Algorithm 1, where the main distinc-tion among the stochastic gradient, Newton, and quasi-Newton methods is the choice of B k (see line 5). For the stochastic gradient method, B k = I n , and for the stochastic Newtonmethod, B k = (cid:0) A (cid:62) W k W (cid:62) k A (cid:1) † . For the stochastic quasi-Newton method, we will use the9 igure 1: Error ω ( µ, ν ) of the stochastic Newton solution (cid:101) x compared to (cid:98) x . In the plot on theleft, ν = 10 and we vary µ. Notice that a pole exists at µ = 0, where the relative error becomesarbitrarily large. The plot on the right illustrates the impact of varying ν for ﬁxed µ = 1. choices of B k described in Section 2 with λ > λ = 0. To avoid large inversions whencomputing B k , we use the Woodbury formula to iteratively update this choice of B k as B k = kk − B k − (cid:16) I n − A (cid:62) W k (cid:0) ( k − I (cid:96) + W (cid:62) k AB k − A (cid:62) W k (cid:1) − W (cid:62) k AB k − (cid:17) , (13)and for k = 1, B = λ (cid:16) I n − A (cid:62) W (cid:0) λ I (cid:96) + W (cid:62) AA (cid:62) W (cid:1) − W (cid:62) A (cid:17) . (14)Update formulas (13) and (14) require an inversion of an (cid:96) × (cid:96) matrix, but (cid:96) , which is thenumber of columns in W , is assumed small. Algorithm 1

Stochastic approximation method choose initial x , λ >

0, suitable sequence { α k } , and set k = 1 while not converged do choose realization W k compute W (cid:62) k A and W (cid:62) k b update B k get search direction s k = − B k ∇ f W k ( x k − ) update x k = x k − + α k s k set k = k + 1 end while return (cid:98) x = x k Following most stopping criteria developed for stochastic optimization and stochasticlearning methods [40, 5, 39], we rely on heuristic monitoring of x k and f k = f W k ( x k ) to10etermine early stopping for our methods. We consider three stopping criteria: a givenmaximum number of iterations, a certain tolerance on changes in x k , and a tolerance on theimprovement in f k , see [12]. For the last two, we use (cid:107) x k − − x k (cid:107) ∞ < √ tol (1 + (cid:107) x k (cid:107) ∞ ) and (cid:12)(cid:12) ¯ f k − − ¯ f k (cid:12)(cid:12) < tol (cid:0) f k − (cid:1) , where tol is a given tolerance. Here, for some number s ≥

1, ¯ f k denotes the simple movingaverage ¯ f k = mean ( f k − s , . . . , f k ). In our experiments we choose s = 9.Next, we provide two experiments to illustrate our methods. In Experiment 1, we use alinear regression problem to illustrate various properties or our algorithms and validate thetheoretical results. In Experiment 2 we provide some results using a realistic example frommachine learning. Experiment 1

We consider a linear regression problem, where A ∈ R , × , is a random matrix withelements drawn from a standard normal distribution. We let x true = ∈ R , and b = Ax true + (cid:15) , where the additive noise (cid:15) is also assumed to be standard normal. We choose W ∈ R , × to be a block Kaczmarz matrix with block size (cid:96) = 625 and utilize the harmonicsequence α k = 1 /k as the step size strategy. We select the inverse Hessian approximation B k using the update formula given in (13) and (14) with λ = 10 − and start at a randominitial guess x . First, we compare the performance of the stochastic quasi-Newton andNewton methods. In Figure 2 we provide plots of relative errors. In the top left panel, therelative errors are computed as (cid:107) x k − (cid:98) x (cid:107) / (cid:107) (cid:98) x (cid:107) , where x k are stochastic quasi-Newton (SQN)iterates, and in the top right panel, the relative errors are computed as (cid:107) x k − (cid:101) x (cid:107) / (cid:107) (cid:101) x (cid:107) , where x k are stochastic Newton (SN) iterates. These plots illustrate convergence of SQN and SNiterates to (cid:98) x and (cid:101) x respectively, as proved in Section 3. Notice that stochastic Newton (reddashed line) exhibits much slower convergence to ˜ x than the convergence of stochastic quasi-Newton (blue solid line) to (cid:98) x . In fact, SN requires 20,000 iterations to reach a relative errorof 3 . · − , while the SQN iterates reach a relative error of 3 . · − after 175 iterations.Moreover, it takes the stochastic quasi-Newton only 22 iterations to achieve a relative errorof 10 − .In the bottom panel of Figure 2, we provide reconstruction errors relative to the truesolution (cid:107) x k − x true (cid:107) / (cid:107) x true (cid:107) , for both methods, which demonstrates that SQN is fasterthan SN at providing a better approximation of the true solution. For this experiment therelative error between ˜ x and (cid:98) x is (cid:107) ˜ x − (cid:98) x (cid:107) / (cid:107) (cid:98) x (cid:107) = 5 . · − . We omit the results for thestochastic gradient method due to poor performance. It is worth noting that the moderatesize of this problem still allows one to use a QR solver (e.g., Matlab’s “backslash”) to solvethe LS problem, which takes about 6 seconds whereas SQN requires about 12 seconds to run k = 200 iterations.Last, we investigate the performance of four diﬀerent choices of W : ( i ) Block Kaczmarz ,which is the generalized Kaczmarz method with Q = I m and (cid:96) i = (cid:96) uniformly to create11 igure 2: Experiment 1: The top left panel contains relative errors for stochastic quasi-Newtoniterates, computed as (cid:107) x k − (cid:98) x (cid:107) / (cid:107) (cid:98) x (cid:107) , where (cid:98) x is the LS solution. The top right panel containsrelative errors for stochastic Newton iterates, computed as (cid:107) x k − (cid:101) x (cid:107) / (cid:107) (cid:101) x (cid:107) , where (cid:101) x is deﬁned inTheorem 3.3. Notice that we display 20,000 iterations for SN and only 200 iterations for SQN. Thebottom panel contains relative errors, (cid:107) x k − x true (cid:107) / (cid:107) x true (cid:107) , for both SQN and SN. ii ) Kaczmarz , which is similar to the generalized Kaczmarz methodwith Q = I m , but instead of a preset partitioning of the matrix Q into Q i matrices thismethod samples (cid:96) columns of Q randomly at each iteration to generate W , ( iii ) SparseRademacher with p = (cid:96) , and ( iv ) Sparse Random with ψ = 10 − . These choices of W aredeﬁned on page 5. In Figure 3, we provide the function values f ( x k ) with respect to thenumber of row accesses of A , since row accesses of A may be the computational bottle neckfor large problems. We observe that all choices of W perform similarly, with Kaczmarz andblock Kaczmarz having a slight advantage. The performance of the stochastic quasi-Newtonmethod is highly dependent on the underlying problem (e.g., structures in the matrix A andvector b ) and the realizations of W . Empirically, using regression data sets from the UCIMachine Learning Repository [28], we observe best performances with block Kaczmarz andsparse Rademacher (data not shown). Figure 3: Comparison of function values for diﬀerent choices of W , as a function of the number ofrow accesses of A . Black dotted line corresponds to the LS error f ( (cid:98) x ). Experiment 2

Next, we investigate the use of our stochastic quasi-Newton method for solving large linearLS problems that arise in extreme learning machines (ELMs). ELM is a machine learningtechnique that uses random hidden nodes or neurons in a feedforward network to mimicbiological learning techniques. The literature on ELM in the machine learning communityis vast, with cited beneﬁts that include higher scalability, less computational complexity, norequirement of tuning, and smaller training errors than generic machine learning techniques.ELM is commonly used for clustering, regression, and classiﬁcation. Full details and com-parisons are beyond the scope of this paper, and we refer the interested reader to paperssuch as [21, 19, 22, 24, 23, 20] and references therein.At the core of ELM is a very large and potentially dynamically growing linear regressionproblem. In this experiment, we investigate the use of stochastic algorithms for eﬃciently13olving these LS problems. In particular, we consider the problem of handwritten digitclassiﬁcation using the “MNIST” database [7], which contains 60,000 training images and10,000 testing images of handwritten digits ranging from 0 to 9. Each image is 28 ×

28 pixelsand converted into a vector ξ ∈ R (e.g., corresponding to 784 features).We begin with a brief description of the classiﬁcation problem for the MNIST dataset.Suppose we are given a set of m examples in the form of a training set S = { ( ξ , c ) , · · · , ( ξ m , c m ) } , where ξ i ∈ R and c i takes values from the set of classes C = { , , · · · , } . Consider anELM with a hidden layer of n nodes, then the goal is to solve a LS problem of the form,min X (cid:107) HX − Y (cid:107) , (15)where the hidden-layer output matrix is deﬁned as H =  h ( ξ )... h ( ξ m )  ∈ R m × n with h ( ξ ) = (cid:2) h ( ξ ) · · · h n ( ξ ) (cid:3) being the output (row) vector of the hidden layer withrespect to the input ξ , X = (cid:2) x · · · x (cid:3) where x j ∈ R n contains the desired outputweights for class j , and the training data target matrix Y ∈ R m × takes entries y ij = (cid:26) , if c i = j − , − , else.For this example, h ( ξ ) can be interpreted as a map from the image pixel space to the n -dimensional hidden-layer feature space. Although various activation functions could beused, we employ a standard choice of sigmoid additive hidden nodes with h j ( ξ ) = G ( d j , δ j , ξ ) = 1 / (1 + exp( − d (cid:62) j ξ + δ j )) , where all of the hidden-node parameters ( d j , δ j ) nj =1 are randomly generated based on auniform distribution [22]. For our experiments, we set the number of hidden neurons to be n = 300.The main computational work of ELM is to solve (15). Regularized or constrainedsolutions have been investigated (e.g., [22, 29, 2]). However, our focus will be on solvingthe unconstrained LS problem eﬃciently and for very large sets of training data . In orderto generate larger datasets, we performed multiple random rotations of the original 60,000training images. More speciﬁcally, each image was rotated by 20( η − .

5) degrees, where η isa random number drawn from a beta distribution with shape parameters equal to 2. In ourexperiments, we consider up to 15 random rotations per image, resulting in up to 900,000training images. Notice that as the number of training images increases, the number of rowsof H increases accordingly, while the number of columns remains the same.14e consider three approaches to solve (15) and compare CPU timings. In the originalimplementation of ELM [24, 23], the LS solution was computed as (cid:98) X = H † Y where H † is theMoore-Penrose pseudoinverse of H . We denote this approach “PINV”. Another approach tosolve large (often sparse) LS problems is to use an iterative method such as LSQR [35, 34],but since the LS problem needs to be solved for multiple right-hand sides (here, 10 solves),we use a global least squares method (Gl-LSQR) [42] with a maximum of 50 iterations and aresidual tolerance of 10 − . It was experimentally shown in [42] that Gl-LSQR is more eﬀectiveand less expensive than LSQR applied to each right hand side. We use our stochastic quasi-Newton (SQN) method where W corresponds to the sparse Rademacher matrix with (cid:96) = 50and λ = 10 − . We use a maximum number of iterations of 1,000, a stopping tolerance of tol = 10 − , and an initial guess of . Since the B k matrices only depend on H and W , SQN can be applied to multiple right hand sides simultaneously.Each LS solver is repeated 20 times in Matlab R2015b on a MacBook Pro with 2.9GHz Intel Core i7 and 8G memory, and in Figure 4, we provide the median and 5 th –95 th percentiles of the CPU times vs. the number of training images (e.g., number of rows in H ). It is evident that for smaller training sets, all three methods perform similarly, butas the number of training images increases, SQN quickly surpasses PINV and Gl-LSQR interms of faster CPU time. For various numbers of training data, we provide in Table 1the mean and standard deviation of the relative reconstruction error for the SQN estimate, rel = (cid:107) X SQN − (cid:98) X (cid:107) F / (cid:107) (cid:98) X (cid:107) F , and of the number of SQN iterations, k . Our results demonstratethat SQN does not necessarily provide the most accurate solutions, however, it can be usedto achieve suﬃciently good solutions in an eﬃcient manner. Figure 4: CPU times (median and 5 th –95 th percentiles) for solving LS problem (15) using stochasticquasi-Newton (SQN), global LSQR (Gl-LSQR), and the Moore-Penrose pseudoinverse for variousnumbers of training images m . able 1: For various numbers of training images, we provide the mean and standard deviation forthe relative reconstruction errors and the iteration counts for SQN. m ,

000 120 ,

000 300 ,

000 600 ,

000 900 , rel ± . ± . ± . ± . ± . k ± ± ± ± ± Next we test the performance of these estimates for classiﬁcation of the MNIST testingdataset. That is, once computed, the output weights X can be used to classify images in thefollowing way. For each test image, the predicted class is given byClass of ξ = arg max j h ( ξ ) x j . In Figure 5 we provide a visualization of the computed classiﬁcations for the 10,000 testingimages, where accuracy values in the titles are calculated as 1 − r/ r is thenumber of misclassiﬁed images. An accuracy value that is close to 1 corresponds to a goodperformance of the classiﬁer. These results correspond to training on 60,000 images, and thetesting set was sorted by class for easier visualization. Notice that in Figure 5 the misclassiﬁedimages are almost identical for all three methods, and the classiﬁcation accuracy for SQNis only slightly smaller than that of PINV and Gl-LSQR. Thus, we have shown that ourSQN method can achieve comparable classiﬁcation performance as PINV and GL-LSQRwith much faster learning speed.It is worth noting that the matrices considered here, though large, can still be loaded intomemory. For problems where this is not the case (e.g., data too large or being dynamicallygenerated [45]), PINV and Gl-LSQR would not be feasible, while SQN could still be used. In this paper we introduced a stochastic Newton and a stochastic quasi-Newton methodfor solving very large linear least-squares problems. New theoretical results show that undermild regularity conditions on the distributions, the stochastic quasi-Newton iterates convergeto the unique LS solution (cid:98) x , while the stochastic Newton iterates converge to a diﬀerent es-timator of x true which is still unbiased but is likely to have larger variance. We provideimplementation details for choices of random variables W k , step length parameters α k , andstochastic inverse Hessian approximations B k . Our numerical experiments validate our the-ory and also show the potential beneﬁts of these methods for machine learning applicationswhere the data sets of interests are typically extremely large. Furthermore, by allowingsparse distributions, our approaches can handle scenarios where the entire matrix A is notavailable or full matrix-vector multiplications with A are not feasible.This work opens the doors to a plethora of new methods, advancements, and imple-mentations for stochastic Newton and quasi-Newton methods. Current work includes the16 igure 5: Classiﬁcation (with correpsonding accuracy) for the MNIST test images after training on60,000 images, using diﬀerent LS solvers. B k matrices, adaptive choices of W that can exploit matrixproperties, eﬃcient parallel implementations, and extensions to nonlinear problems. Acknowledgement:

The authors are grateful to Qi Long for initial discussions and ArvindSaibaba for helpful comments that have improved the manuscript. We are also grateful toEldad Haber for pointing us to the ELM problem and to the Institute for Mathematics andits Applications for nurturing this research collaboration. (i)

Let v ∈ R m and assume that W has the same distribution as W k . Then, (cid:107) WW (cid:62) v (cid:107) ≤ (cid:13)(cid:13) WW (cid:62) (cid:13)(cid:13) (cid:107) v (cid:107) , and therefore E ( v (cid:62) WW (cid:62) WW (cid:62) v ) < ∞ . Furthermore, v (cid:62) Cv = E ( v (cid:62) WW (cid:62) WW (cid:62) v ) = E ( (cid:107) WW (cid:62) v (cid:107) ) ≥ , and therefore v (cid:62) Cv = 0 iﬀ (cid:107) WW (cid:62) v (cid:107) = 0 a.s., which happens iﬀ WW (cid:62) v = a.s., inwhich case E ( WW (cid:62) ) v = β I n v = . This implies v = , and thus C is deﬁned and SPD. (ii) Note that s k = − B k A (cid:62) W k W (cid:62) k A ( x k − − (cid:98) x ) − B k A (cid:62) W k W (cid:62) k r , (16)where r = (cid:98) b − b . Therefore, b k +1 − (cid:98) b = b k − (cid:98) b − α k AB k A (cid:62) W k W (cid:62) k ( b k − (cid:98) b ) − α k AB k A (cid:62) W k W (cid:62) k r , and there are positive constants a, b k and c k such that e k +1 ≤ a e k + b k B (cid:13)(cid:13) W k W (cid:62) k (cid:13)(cid:13) e k + c k B (cid:13)(cid:13) W k W (cid:62) k (cid:13)(cid:13) for all k . Since W k and e k are independent and E e < ∞ , it follows that E e k < ∞ for all k . This together with (16) implies that E ( (cid:107) s k (cid:107) ) < ∞ for all k . (iii) By deﬁnition, e k +1 − e k = 2 α k ( b k − (cid:98) b ) (cid:62) As k + α k s (cid:62) k A (cid:62) As k , and therefore E ( e k +1 − e k | F k ) = 2 α k ( b k − (cid:98) b ) (cid:62) A E ( s k | F k ) + α k E ( s (cid:62) k A (cid:62) As k | F k ) . A , (16) leads to E ( s k | F k ) = − B k A (cid:62) E ( W k W k )( Ax k − − A (cid:98) x ) = − β B k A (cid:62) ( b k − (cid:98) b ) . (17)Let λ A := λ max ( AA (cid:62) ), then s (cid:62) k A (cid:62) As k = ( b k − b ) (cid:62) W k W (cid:62) k AB k B k A (cid:62) W k W (cid:62) k ( b k − b ) ≤ λ A B ( b k − b ) (cid:62) W k W (cid:62) k W k W (cid:62) k ( b k − b ) , and therefore E ( s (cid:62) k A (cid:62) As k | F k ) ≤ λ A B (cid:107) b k − b (cid:107) C ≤ λ A B e k + 2 λ A B (cid:107) r (cid:107) C . (18)Equations (17) and (18) lead to E ( e k +1 − e k | F k ) ≤ − β α k ( b k − (cid:98) b ) (cid:62) AB k A (cid:62) ( b k − (cid:98) b ) + 2 α k λ A B e k + 2 α k λ A B (cid:107) r (cid:107) C , subtracting 2 α k λ A B from both sides yields E ( e k +1 − (1 + 2 B λ A α k ) e k | F k ) ≤ − β α k ( b k − (cid:98) b ) (cid:62) AB k A (cid:62) ( b k − (cid:98) b ) + 2 λ A α k B (cid:107) r (cid:107) C . Deﬁne γ k = (cid:81) k − i =1 ( 1 + 2 B λ A α k ) − ≤ . Since0 ≤ − log( γ k ) ≤ B λ A ∞ (cid:88) i =1 α i < ∞ for any k , it follows that { γ k } is a decreasing and convergence sequence to some γ > e k = γ k e k and Z k = E ( ˜ e k +1 − ˜ e k |F k ), then E (˜ e k +1 − ˜ e k | F k ) ≤ − βα k γ k +1 ( b k − (cid:98) b ) (cid:62) AB k A (cid:62) ( b k − (cid:98) b ) + 2 B γ k +1 α k λ A (cid:107) r (cid:107) C ≤ B α k λ A (cid:107) r (cid:107) C , (19)and also ∞ (cid:88) k =1 E [ I Z k ≥ (˜ e k +1 − ˜ e k ) ] ≤ B λ A (cid:107) r (cid:107) C ∞ (cid:88) k =1 α k < ∞ . Theorem 3.1 now implies that { ˜ e k } converges a.s. and since γ k converges to a nonzero value,it follows that { e k } also converges a.s. The ﬁnal step is to show that e k a . s . −→

0. It followsfrom (19) that 2 β n (cid:88) k =1 α k γ k +1 E ( λ min ( B k ) (cid:107) A (cid:62) ( b k − (cid:98) b ) (cid:107) ) ≤ β n (cid:88) k =1 α k γ k +1 ( b k − (cid:98) b ) (cid:62) AB k A (cid:62) ( b k − (cid:98) b ) ≤ B λ A (cid:107) r (cid:107) C n (cid:88) k =1 α k + E ˜ e < ∞ n . Therefore, γ ∞ (cid:88) k =1 α k λ min ( B k ) (cid:107) A (cid:62) ( b k − (cid:98) b ) (cid:107) ≤ ∞ (cid:88) k =1 α k γ k +1 λ min ( B k ) (cid:107) A (cid:62) ( b k − (cid:98) b ) (cid:107) < ∞ a . s . Since (cid:80) ∞ k =1 α k λ min ( B k ) = ∞ a.s., it follows that for almost all w ∈ Ω there is a subsequence n k ( w ) such that (cid:13)(cid:13) x n k ( w ) ( w ) − (cid:98) x (cid:13)(cid:13) A (cid:62) A ) → , which also implies (cid:107) x n k ( w ) ( w ) − (cid:98) x (cid:107) = e n k ( w ) ( w ) → . and therefore, since e k converges a.s., we also have e k a . s . −→

0. Hence, x k a . s . −→ (cid:98) x . The proof is a slight modiﬁcation of that of Theorem 3.2. Deﬁne F k = σ ( W i ; i < k ). (i) Let v ∈ R m . Since v (cid:62) Cv = E ( (cid:107) H k v (cid:107) ) ≥ , it follows that C is semi-positive deﬁnite, andtherefore so is A (cid:62) CA . Since ( W (cid:62) k A ) † W (cid:62) k A is symmetric, PA is symmetric. Let v ∈ R n . Using properties of the pseudoinverse gives v (cid:62) PAv = v (cid:62) E ( H k A ) v = E ( (cid:107) H k Av (cid:107) ) ≥ , where equality holds iﬀ H k Av = a.s., and since E ( W k W (cid:62) k ) = β I m and A is full column-rank, it follows that v = . Hence PA is SPD. (ii) Note that s k = − H k A ( x k − − (cid:101) x ) − H k r , (20)where r = A (cid:101) x − b . Therefore x k − (cid:101) x = x k − − (cid:101) x − α k H k A ( x k − − (cid:101) x ) − α k H k r , and since H k A is an orthogonal projection matrix, it follows that e k +1 ≤ α k ) e k + 4 α k (cid:107) H k r (cid:107) . This then leads to E e k +1 ≤ α k ) E e k + 4 α k (cid:107) Cr (cid:107) , which implies that E e k < ∞ and E ( (cid:107) s k (cid:107) ) < ∞ for all k . (iii) The idea is again to use the quasimartingale convergence theorem. Since W k and F k are independent, E ( e k +1 − e k | F k ) = 2 α k ( x k − − (cid:101) x ) (cid:62) E ( s k | F k ) + α k E ( (cid:107) s k (cid:107) | F k )= − α k (cid:107) x k − − (cid:101) x (cid:107) PA + α k E ( (cid:107) s k (cid:107) | F k ) . (21)To put a bound on the second term of (21) we use again the fact that H k A is a projectionmatrix to obtain: E ( (cid:107) s k (cid:107) | F k ) ≤ A (cid:101) x − b ) (cid:62) C ( A (cid:101) x − b ) + 2 (cid:107) x k − − (cid:101) x (cid:107) ≤ c + 2 (cid:107) x k − − (cid:101) x (cid:107) , (22)20here c = λ max ( C ) (cid:107) A (cid:101) x − b (cid:107) . Therefore, equation (21) can be bounded as E ( e k +1 − e k | F k ) ≤ − α k (cid:107) x k − − (cid:101) x (cid:107) PA + c α k + 2 α k e k , which yields E ( e k +1 − e k (1 + α k ) | F k ) ≤ − α k (cid:107) x k − − (cid:101) x (cid:107) PA + c α k ≤ c α k . (23)Let ν k = (cid:81) k − i =1 ( 1 + α i ) − < . The sequence { ν k } converges to some ν > (cid:80) ∞ k =1 α k < ∞ . Deﬁne (cid:101) e k = ν k e k , and multiply both sides of (23) by ν k +1 . We obtain E ( (cid:101) e k +1 − (cid:101) e k | F k ) ≤ − α k ν k +1 (cid:107) x k − − (cid:101) x (cid:107) PA + c α k ν k +1 ≤ c α k ν k +1 . (24)Deﬁne Z k = E ( (cid:101) e k +1 − (cid:101) e k | F k ). Then, E [ I Z k ≥ ( (cid:101) e k +1 − (cid:101) e k ) ] = E ( I Z k ≥ E [ (cid:101) e k +1 − (cid:101) e k | F k ] ) ≤ c α k ν k +1 . Since (cid:80) ∞ k =1 α k ν k +1 < ∞ the series (cid:80) ∞ k =1 E ( I X k ≥ (˜ e k +1 − ˜ e k ) ) converges and therefore { ˜ e k } converges a.s. by Theorem 3.1. But since ν k converges to a nonzero value, it also follows that { e k } converges a.s. The ﬁnal step is to show that { e k } in fact converges to zero. Rearrangingthe terms and taking the expected value of both sides of (24) yields ∞ (cid:88) k =1 α k ν k +1 E ( (cid:107) x k − − (cid:101) x (cid:107) PA ) < ∞ , Since (cid:80) ∞ k =1 α k = ∞ and ν k → ν >

0, it follows that E ( (cid:107) x n k − (cid:101) x (cid:107) PA ) → n k ), and therefore we also have E ( (cid:107) x n k − (cid:101) x (cid:107) ) = E e n k →

0. By Fatou’slemma: 0 ≤ E lim inf e n k = E lim e n k ≤ lim inf E e n k = 0 . It follows that lim e n k = 0 a.s. and since { e k } converges a.s., this implies e k a . s . −→ x k a . s . −→ (cid:101) x . References [1] D. Achlioptas. Database-friendly random projections: Johnson-Lindenstrauss with bi-nary coins.

Journal of Computer and System Sciences , 66(4):671–687, 2003.[2] Z. Bai, G.-B. Huang, D. Wang, H. Wang, and M. B. Westover. Sparse extreme learningmachine for classiﬁcation.

IEEE Transactions on Cybernetics , 44(10):1858–1870, 2014.[3] A. Benveniste, S. Wilson, M. Metivier, and P. Priouret.

Adaptive Algorithms andStochastic Approximations . Stochastic Modelling and Applied Probability. Springer,New York, 2012. 214] L. Bottou.

Online Learning in Neural Networks , chapter 2. Online learning and stochas-tic approximations, pages 9–42. Cambridge University Press, 1998.[5] L. Bottou, F. E. Curtis, and J. Nocedal. Optimization methods for large-scale machinelearning. arXiv preprint arXiv:1606.04838 , 2016.[6] C. Boutsidis and A. Gittens. Improved matrix algorithms via the subsampled random-ized Hadamard transform. arXiv preprint arXiv:1204.0062 , 2012.[7] L. Deng. The MNIST database of handwritten digit images for machine learning research[best of the web].

IEEE Signal Processing Magazine , 29(6):141–142, 2012.[8] P. S. R. Diniz.

Adaptive ﬁltering . Springer, New York, 1997.[9] P. Drineas, M. W. Mahoney, S. Muthukrishnan, and T. Sarl´os. Faster least squaresapproximation.

Numerische Mathematik , 117(2):219–249, 2011.[10] J. Fan, Y. Liao, and H. Liu. An overview of the estimation of large covariance andprecision matrices.

The Econometrics Journal , 19(1):C1–C32, 2016.[11] D. L. R. Fisk.

Quasi-Martingales and Stochastic Integrals . Department of MathematicsResearch Monograph. Kent State University, Department of Mathematics, 1963.[12] P. E. Gill, W. Murray, and M. H. Wright.

Practical Optimization . Emerald, Bingley,2007.[13] R. M. Gower, D. Goldfarb, and P. Richt´arik. Stochastic block BFGS: Squeezing morecurvature out of data. arXiv preprint arXiv:1603.09649 , 2016.[14] R. M. Gower and P. Richt´arik. Randomized iterative methods for linear systems.

SIAMJournal on Matrix Analysis and Applications , 36(4):1660–1690, 2015.[15] R. M. Gower and P. Richt´arik. Randomized quasi-Newton updates are linearly conver-gent matrix inversion algorithms. arXiv preprint arXiv:1602.01768 , 2016.[16] E. Haber, M. Chung, and F. Herrmann. An eﬀective method for parameter estimationwith PDE constraints with multiple right-hand sides.

SIAM Journal on Optimization ,22(3):739–757, 2012.[17] P. C. Hansen, V. Pereyra, and G. Scherer.

Least Squares Data Fitting with Applications .John Hopkins University Press, Baltimore, 2012.[18] A. Hedayat and W. D. Wallis. Hadamard matrices and their applications.

The Annalsof Statistics , 6(6):1184–1238, 1978.[19] G. Huang, G.-B. Huang, S. Song, and K. You. Trends in extreme learning machines: Areview.

Neural Networks , 61:32–48, 2015.2220] G.-B. Huang, Z. Bai, L. L. C. Kasun, and C. M. Vong. Local receptive ﬁelds basedextreme learning machine.

IEEE Computational Intelligence Magazine , 10(2):18–29,2015.[21] G.-B. Huang, D. H. Wang, and Y. Lan. Extreme learning machines: a survey.

Interna-tional Journal of Machine Learning and Cybernetics , 2(2):107–122, 2011.[22] G.-B. Huang, H. Zhou, X. Ding, and R. Zhang. Extreme learning machine for regressionand multiclass classiﬁcation.

IEEE Transactions on Systems, Man, and Cybernetics,Part B (Cybernetics) , 42(2):513–529, 2012.[23] G.-B. Huang, Q.-Y. Zhu, and C.-K. Siew. Extreme learning machine: a new learningscheme of feedforward neural networks. In

Neural Networks, 2004. Proceedings. 2004IEEE International Joint Conference on , volume 2, pages 985–990. IEEE, 2004.[24] G.-B. Huang, Q.-Y. Zhu, and C.-K. Siew. Extreme learning machine: theory andapplications.

Neurocomputing , 70(1):489–501, 2006.[25] S. Kaczmarz. Angen¨aherte Auﬂ¨osung von Systemen linearer Gleichungen.

BulletinInternational de l’Acad´emie Polonaise des Sciences et des Lettres. Classe des SciencesMath´ematiques et Naturelles. S´erie A, Sciences Math´ematiques , 35:335–357, 1937.[26] H. J. Kushner and G. G. Yin.

Stochastic Approximation Algorithms and Applications .Springer, New York, 1997.[27] E. B. Le, A. Myers, and T. Bui-Thanh. A randomized misﬁt approach for data reductionin large-scale inverse problems. arXiv preprint arXiv:1603.01562 , 2016.[28] M. LLC. Machine Learning Repository, Center for Machine Learning and IntelligentSystems, University of California, Irvine, 2017.[29] J. Luo, C.-M. Vong, and P.-K. Wong. Sparse Bayesian extreme learning machine formulti-classiﬁcation.

IEEE Transactions on Neural Networks and Learning Systems ,25(4):836–843, 2014.[30] M. M´etivier.

Semimartingales: A Course on Stochastic Processes . De Gruyter studiesin mathematics. XI, 1982.[31] D. Needell, N. Srebro, and R. Ward. Stochastic gradient descent, weighted sampling,and the randomized Kaczmarz algorithm.

Mathematical Programming , 155(1-2):549–573, 2016.[32] D. Needell and J. A. Tropp. Paved with good intentions: analysis of a randomized blockKaczmarz method.

Linear Algebra and its Applications , 441:199–221, 2014.[33] J. Nocedal and S. J. Wright.

Numerical Optimization . Springer, New York, secondedition edition, 2006. 2334] C. C. Paige and M. A. Saunders. Algorithm 583, LSQR: Sparse linear equations andleast-squares problems.

ACM Transactions on Mathematical Software , 8(2):195–209,1982.[35] C. C. Paige and M. A. Saunders. LSQR: An algorithm for sparse linear equations andsparse least squares.

ACM Transactions on Mathematical Software , 8(1):43–71, 1982.[36] M. Pilanci and M. J. Wainwright. Iterative Hessian sketch: Fast and accurate solutionapproximation for constrained least-squares.

Journal of Machine Learning Research ,17(53):1–38, 2016.[37] H. Robbins and S. Monro. A stochastic approximation method.

The annals of mathe-matical statistics , pages 400–407, 1951.[38] H. Robbins and D. Siegmund. A convergence theorem for non negative almost super-martingales and some applications. In

Herbert Robbins Selected Papers , pages 111–135.Springer, 1985.[39] A. Shapiro, D. Dentcheva, and A. Ruszczynski.

Lectures on Stochastic Programming:Modeling and Theory , volume 16. SIAM, Philadelphia, 2014.[40] J. C. Spall.

Introduction to stochastic search and optimization: estimation, simulation,and control , volume 65. John Wiley & Sons, Hoboken, 2005.[41] T. Strohmer and R. Vershynin. A randomized Kaczmarz algorithm with exponentialconvergence.

Journal of Fourier Analysis and Applications , 15(2):262–278, 2009.[42] F. Toutounian and S. Karimi. Global least squares method (Gl-LSQR) for solving gen-eral linear systems with several right-hand sides.

Applied Mathematics and Computation ,178(2):452–460, 2006.[43] D. P. Woodruﬀ. Sketching as a tool for numerical linear algebra.