Stochastic Newton and Quasi-Newton Methods for Large Linear Least-squares Problems
Julianne Chung, Matthias Chung, J. Tanner Slagel, Luis Tenorio
SStochastic Newton and Quasi-Newton Methods forLarge Linear Least-squares Problems
Julianne Chung ∗ Matthias Chung † J. Tanner Slagel ‡ Luis Tenorio § February 27, 2017
Abstract
We describe stochastic Newton and stochastic quasi-Newton approaches to effi-ciently solve large linear least-squares problems where the very large data sets presenta significant computational burden (e.g., the size may exceed computer memory or dataare collected in real-time). In our proposed framework, stochasticity is introduced intwo different frameworks as a means to overcome these computational limitations,and probability distributions that can exploit structure and/or sparsity are considered.Theoretical results on consistency of the approximations for both the stochastic Newtonand the stochastic quasi-Newton methods are provided. The results show, in partic-ular, that stochastic Newton iterates, in contrast to stochastic quasi-Newton iterates,may not converge to the desired least-squares solution. Numerical examples, includingan example from extreme learning machines, demonstrate the potential applications ofthese methods.
Keywords: stochastic approximation, stochastic Newton, stochastic quasi-Newton, least-squares, extreme learning machine.
In this paper, we are interested in linear problems of the form b = Ax true + (cid:15) , (1) ∗ Department of Mathematics and Computation Modeling and Data Analytics Division, Academy ofIntegrated Science, Virginia Tech, Blacksburg, VA ([email protected], ). † Department of Mathematics and Computation Modeling and Data Analytics Division, Academy ofIntegrated Science, Virginia Tech, Blacksburg, VA ([email protected], ). ‡ Department of Mathematics, Virginia Tech, Blacksburg, VA ([email protected], ). § Department of Applied Mathematics and Statistics, Colorado School of Mines, Golden, CO ([email protected], http://inside.mines.edu/ltenorio/). a r X i v : . [ m a t h . NA ] F e b here b ∈ R m contains the observed data, x true ∈ R n is the desired solution, A ∈ R m × n , m (cid:29) n , has full column rank, and (cid:15) representes noise modeled as random. Here, we assume that (cid:15) has zero mean and covariance matrix σ I m , where I m is the m × m identity matrix. Thegoal is to compute the unique least-squares (LS) solution, (cid:98) x = arg min x ∈ R n f ( x ) = (cid:107) Ax − b (cid:107) , (2)where (cid:107) · (cid:107) denotes the two norm. Typical LS solvers such as iterative methods for large,sparse problems may require many multiplications with A and its transpose [33, 17]. Thiscan be prohibitively expensive for many problems of interest. We overcome these limitationsby first reformulating the LS problem as a stochastic optimization problem as follows. Let W ∈ R m × (cid:96) , m > (cid:96) , be a random matrix with E (cid:0) WW (cid:62) (cid:1) = β I m , where β > E denotes the expectation operator. Then E (cid:13)(cid:13) W (cid:62) ( Ax − b ) (cid:13)(cid:13) = ( Ax − b ) (cid:62) E (cid:0) WW (cid:62) (cid:1) ( Ax − b ) = β (cid:107) Ax − b (cid:107) . Since the solution in (2) is equivalent to the solution of the following stochastic optimizationproblem, min x E f W ( x ) , where f W ( x ) = β (cid:13)(cid:13) W (cid:62) ( Ax − b ) (cid:13)(cid:13) , (3)the goal of this work is to solve (3). Two approaches have been used in the literatureto approximate (3): Sample Average Approximation (SAA) and
Stochastic Approximation (SA) [37, 39]. In SAA methods, the goal is to solve a sampled LS problem, where one majorchallenge is to find a good sampling matrix W (e.g., using randomization techniques). Onthe other hand, SA is an iterative approach where different realizations of the samplingmatrix W are used at each iteration to update the approximate solution. In this work wewill only consider the SA approach.Our main contributions are summarized as follows. We reformulate the problem as astochastic optimization problem, as described above, and develop stochastic Newton andstochastic quasi-Newton methods for solving large LS problems. By introducing stochastic-ity and allowing sparse realizations of W , we can handle large problems where only pieces ofdata are readily available or in memory at a given time. We prove almost sure convergence ofstochastic quasi-Newton estimators to the desired LS solution. Furthermore, we show thatfor systems that are inconsistent but whose coefficient matrix has full column rank, stochas-tic Newton iterates may not converge to the unique LS solution (cid:98) x , but instead will convergeto the solution of a different (not necessarily nearby) problem (see Section 3). By exploitingupdates for the inverse Hessian approximation, our stochastic quasi-Newton method can pro-duce good approximations and be computationally more efficient than standard LS methods,especially for very large problems. 2 similar stochastic reformulation was described in [27, 9, 43], but a fundamental dif-ference with our work is that these approaches use an SAA, rather than an SA methodto approximate (cid:98) x . SA methods have been intensively studied, especially in the last fewyears. Although much of the literature has focused on general objective functions, generalconvergence theories, and gradient-based methods [26, 4, 3, 31, 41], only recently has thefocus shifted to higher-order methods (e.g., stochastic Newton) and efficient update methodsthat can incorporate curvature information (e.g., stochastic quasi-Newton) [5, 44]. Repeatedsampling in a higher order method was considered in [36, 16]. However, [16] does not con-sider sparse sampling, and the algorithm described in [36] requires the full gradient of theobjective function, which we do not require here, and uses only curvature information fromthe current sample, whereas our stochastic quasi-Newton approach can integrate all previoussamples in an efficient update scheme. Gower and Richt´arik proved in [14] that for consis-tent linear systems (e.g., (1) with (cid:15) = ), the stochastic Newton method with unit step sizeconverges to the desired solution. However, they do not consider LS problems. Furthermore,our work is different from recent works on stochastic quasi-Newton methods that focus oncomputing matrix inverses [15] and solving empirical risk minimization problems [13].An outline for this paper is as follows. In Section 2, we describe the stochastic Newtonand stochastic quasi-Newton method for solving LS problems whose consistency results arepresented and discussed in Section 3. Numerical results in Section 4 demonstrate the po-tential of our approaches, and conclusions are provided in Section 5. Consistency proofs areprovided in Section 6. In this section, we describe SA methods for computing a solution to (3). Given an initialvector x ∈ R n , SA methods define a sequence of iterates x k = x k − + α k s k , (4)where { α k } is a sequence of step sizes and { s k } is a sequence of search directions such that s k depends on the iterate x k − and the random variables W , . . . , W k . The most common SAapproach is the stochastic gradient method , where s k = −∇ f W k ( x k − ) is the sample gradient, ∇ f W ( x ) = ( W (cid:62) A ) (cid:62) ( W (cid:62) Ax − W (cid:62) b ) = A (cid:62) WW (cid:62) ( Ax − b ) , (5)evaluated at x k − . The popularity of the stochastic gradient method stems from its provenconsistency properties and its easy implementation. However, the stochastic gradient methodis known to converge slowly [44], thus higher order methods are desired and discussed below.For the stochastic Newton method, the search direction is typically defined as s k = − (cid:0) ∇ f W k (cid:1) † ∇ f W k ( x k − ) , (6)where the sample Hessian is given by ∇ f W = A (cid:62) WW (cid:62) A , and † denotes the Moore-Penrosepseudoinverse. Using properties of the pseudoinverse, (6) can be reduced to s k = − ( W (cid:62) k A ) † W (cid:62) k ( Ax k − − b ) . (7)3or the stochastic Newton method, we use a step size α k that satisfies the following frequentlyused conditions ∞ (cid:88) k =1 α k = ∞ and ∞ (cid:88) k =1 α k < ∞ . (8)For the stochastic quasi-Newton method , we use the search direction given by s k = − B k ∇ f W k ( x k − ) , (9)where { B k } is a sequence of random symmetric positive definite (SPD) matrices. For theconvergence results proved in Section 3, we require that λ max ( B k ) ≤ B almost surely (a.s.)for some constant B > k , with ∞ (cid:88) k =1 α k λ min ( B k ) = ∞ a . s . and ∞ (cid:88) k =1 α k < ∞ , (10)where λ min ( B k ) and λ max ( B k ) denote the minimal and maximal eigenvalues of B k , respec-tively. Note that the special case where B k = I n is nothing more than the stochasticgradient method. The sequence { B k } is assumed to be independent of W k , but may dependon W , . . . , W k − .These two algorithms provide different ways to estimate (cid:98) x . The choice of { α k } andthe distribution of W fully determine the stochastic Newton method, and these along withthe choice of { B k } fully determine the stochastic quasi-Newton method. Below we brieflysummarize choices for { α k } , the distribution of W , and the sequence { B k } , with particularattention on the conditions required to prove consistency of the approximations. Choice of { α k } Selecting a good step size α k (or learning rate, as it is referred to in machinelearning) is critical. A variety of methods have been proposed to improve convergence rates,see for instance [4, 40, 8]. In this paper we restrict ourselves to step size choices that complywith conditions (8) and (10) for stochastic Newton and stochastic quasi-Newton, respectively.In our numerical experiments in Section 4, we use the harmonic sequence α k = 1 /k , see [38]. Distributional choice for W
As discussed in the introduction, the solutions to prob-lems (2) and (3) are equivalent if E ( WW (cid:62) ) = β I m . Extensions to include a general (known)positive definite covariance matrix Γ would require adjusting the sampling matrix such that E ( WW (cid:62) ) = Γ − . Including a positive diagonal covariance matrix would simply requirescaling the rows of A (i.e., solving a weighted LS problem). However, for simplicity ofpresentation, we consider Γ = σ I m .Three distributional choices for W are considered:1. Random sparse matrices.
Let W ∈ R m × (cid:96) be a random matrix with i.i.d. random elements w ij where, for a fixed 0 < ψ ≤ w ij takes the values ± (cid:112) β/(cid:96)ψ each with probability ψ/ − ψ . It is straightforward to verify that E ( WW (cid:62) ) =4 I m . Notice that as ψ gets closer to zero, more sparsity is introduced in W . It isworth mentioning that this choice of W is a generalization of Achlioptas random matrix( ψ = 1 / β = (cid:96) ) and the Rademacher distribution ( ψ = 1 and β = (cid:96) ), see [1, 16].2. Generalized Kaczmarz matrices.
For i = 1 , . . . , p , let Q i ∈ R m × (cid:96) i be such that Q =[ Q , . . . , Q p ] ∈ R m × m is an orthogonal matrix. Define the distribution of W to be uniformon { Q , . . . , Q p } . Then E (cid:0) WW (cid:62) (cid:1) = p p (cid:88) i =1 Q i Q (cid:62) i = p I m . Notice that selecting Q = I m , or any permutation of it, together with the stochasticNewton direction (6) leads to the well-known randomized Kaczmarz method if (cid:96) i = 1 forall i , and to the randomized block Kaczmarz method otherwise [25, 32, 14]. Choosing theelements of Q to be ± { , ± } ) leads to (sparse) randomized Hadamard matrices[18, 6]. Notice, that sparsity may be introduced by the particular choice of Q and thatthe number of columns in the Q i ’s can differ.3. Sparse Rademacher matrices.
Fix p ≤ m . The columns w i of W ∈ R m × (cid:96) are i.i.d.and each column can be any m × p -nonzero entries in {± } with equalprobability. Hence, conditional on a vector configuration, C , of p ones and m − p zeros,each column w i has conditional expectation E ( w i w (cid:62) i | C ) = I m,C , where I m,C is the diagonal matrix with the configuration C in the diagonal. It followsthat E ( w i w (cid:62) i ) = E E ( w i w (cid:62) i | C ) = (cid:18) m − p − (cid:19)(cid:18) mp (cid:19) I m = pm I m , and therefore E ( WW (cid:62) ) = ( (cid:96) p/m ) I m . Note that the case p = m generates full Rademachermatrices. The distinction between the other choices of W is that entries of the sparseRademacher matrices are not i.i.d., as in the random sparse matrices, and do not nec-essarily come from partitions of orthogonal matrices, as in the generalized Kaczmarzmatrices. Choice of { B k } The matrices B k should approximate the inverse Hessian ( A (cid:62) A ) − andshould be SPD. Furthermore, the convergence results require some control of the smallestand largest eigenvalues of B k . Thus, we define a sequence of SPD matrices B k that convergesa.s. to a small perturbation of ( A (cid:62) A ) − and has appropriate behavior of the smallest and5argest eigenvalues. Since for any λ > λ k I n + k k (cid:88) i =1 A (cid:62) W i W (cid:62) i A a . s . −→ A (cid:62) A , it follows that for any λ ≥ (cid:101) B k = λ I n + k (cid:32) λ I n + k (cid:88) i =1 A (cid:62) W i W (cid:62) i A (cid:33) − . s . −→ λ I n + ( A (cid:62) A ) − , and λ max ( (cid:101) B k ) a . s . −→ λ + λ max ( ( A (cid:62) A ) − ). Hence, for a fixed B > λ + λ max ( ( A (cid:62) A ) − ), wecan define B k = (cid:40) (cid:101) B k , if λ max ( (cid:101) B k ) ≤ B, B k − , else, (11)with B = (cid:101) B an arbitrary SPD matrix such that λ max ( B ) ≤ B . The Woodbury formulacan be used to efficiently update B k , see Section 4 for details. Notice that (cid:101) B k is a sampleapproximation of ( A (cid:62) A ) − , which is similar to sample approximations that have been studiedfor covariance/precision matrix approximations, and that the regularization term λ I n canbe replaced by other appropriate SPD matrices [10]. In this section we study consistency of the stochastic quasi-Newton and stochastic Newtonmethod. We start by briefly reviewing some notation and definitions from probability theory.Let (Ω , A , P ) be a probability space, and let {F k } be a sequence of sub- σ -algebras of A . Then {F k } is a called a filtration if F k ⊂ F k +1 for all k ∈ N . A sequence of randomvariables { u k } on (Ω , A ) is said to be adapted to {F k } if u k is F k -measurable for all k ∈ N .The σ -algebra generated by the random variables { u i } ki =1 is denoted by σ ( u i : i < k ). Weuse I A to denote the indicator function of the set A . Our convergence results make useof the following quasimartingale convergence theorem. For a proof and further details onquasimartingales, we refer the interested reader to [11, 30]. Theorem 3.1 (Quasimartingale convergence theorem) . Let { X n } be a sequence of non-negative random variables adapted to a filtration {F k } and such that E X n < ∞ for all n .Let Z k = E ( X k +1 − X k |F k ) . Then, if ∞ (cid:88) k =0 E [ I Z k ≥ ( X k +1 − X k )] < ∞ , then there is an integrable, non-negative random variable X such that X n a . s . −→ X . Theorem 3.2 (Stochastic quasi-Newton method) . Let A ∈ R m × n have rank n and b ∈ R m .Let { W k } be a sequence of i.i.d. m × (cid:96) random matrices with E (cid:0) W k W (cid:62) k (cid:1) = β I m and E ( (cid:13)(cid:13) W k W (cid:62) k (cid:13)(cid:13) ) < ∞ . Define F k = σ ( W i ; i < k ) . Let { α k } be a positive sequence ofscalars, and { B k } be a sequence of random n × n SPD matrices adapted to {F k } such thatfor some B > (cid:107) B k (cid:107) = λ max ( B k ) ≤ B a . s . ∀ k, and ∞ (cid:88) k =1 α k λ min ( B k ) = ∞ a . s . and ∞ (cid:88) k =1 α k < ∞ . Set (cid:98) x = ( A (cid:62) A ) − A (cid:62) b , (cid:98) b = A (cid:98) x , and let x ∈ R n be an arbitrary initial vector. Define x k = x k − + α k s k , s k = − B k A (cid:62) W k W (cid:62) k ( Ax k − − b ) , and b k = Ax k − . Then:(i) The matrix C = E ( W k W (cid:62) k W k W (cid:62) k ) is defined and SPD.(ii) If e k = (cid:107) b k − (cid:98) b (cid:107) , then E e k < ∞ and E ( (cid:107) s k (cid:107) ) < ∞ for all k .(iii) x k a . s . −→ (cid:98) x . The following result shows that the stochastic Newton method does not necessarily con-verge to the LS solution (cid:98) x . For a proof see Section 6.2 in the appendix. Theorem 3.3 (Stochastic Newton method) . Let A ∈ R m × n have rank n and b ∈ R m .Let { W k } be a sequence of i.i.d. m × (cid:96) random matrices with E ( W k W (cid:62) k ) = β I m . Let H k = ( W (cid:62) k A ) † W (cid:62) k and assume that C = E ( H (cid:62) k H k ) is defined and finite. Let { α k } be apositive sequence of scalars such that ∞ (cid:88) k =1 α k = ∞ and ∞ (cid:88) k =1 α k < ∞ . Set P = E H k , (cid:101) x = ( PA ) − Pb , and let x ∈ R n be an arbitrary initial vector. Define x k = x k − + α k s k s k = − H k ( Ax k − − b ) . Then: i) The matrices C and ACA (cid:62) are symmetric positive semi-definite, and PA is SPD.(ii) If e k = (cid:107) x k − (cid:101) x (cid:107) , then E e k < ∞ and E ( (cid:107) s k (cid:107) ) < ∞ for all k .(iii) x k a . s . −→ (cid:101) x . Although the number of iterations can not be too large in practice, the consistency resultsabove for stochastic Newton and stochastic quasi-Newton are important for analyzing thesenew methods. In the rest of this section, we consider the feasibility of the convergence criteriafor both methods and then provide some insight into the discrepancy between the stochasticNewton estimator (cid:101) x and the desired LS solution.In addition to E ( WW (cid:62) ) = β I m , Theorem 3.2 requires E ( (cid:13)(cid:13) W k W (cid:62) k (cid:13)(cid:13) ) to be finite, whileTheorem 3.3 requires P to be finite. However, these additional assumptions are not toorestrictive. In particular, for all of the choices of W described in Section 2, these expectationsare trivially finite because the entries of W take only finitely many values. As for the choiceof B k in the stochastic quasi-Newton method, the assumption that B k and α k satisfy (10) isnot difficult to meet. For example, when λ > B k in Section 2 satisfies theseconditions trivially because λ ≤ λ min ( B k ) ≤ λ max ( B k ) ≤ B. On the other hand, λ max ( W ) is bounded for the choices of W in this paper, and thereforethere is a C > (cid:107) A (cid:62) W i W (cid:62) i A (cid:107) ≤ C for all i and (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) λ I n + k (cid:88) i =1 A (cid:62) W i W i (cid:62) A (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ λ + kC, which implies that when λ = 0 we have λ min ( B k ) ≥ λ /k + C ≥ λ + C > (cid:98) x and the solution to which the stochastic quasi-Newton method converges,namely, (cid:101) x = ( PA ) − Pb = ( E [( W (cid:62) A ) † W (cid:62) A ] ) − E [ ( W (cid:62) A ) † W (cid:62) ] b . (12)The difference between (cid:98) x and (cid:101) x depends on P , but we can say the following. Assuming thatthe noise has zero mean and covariance matrix Var( (cid:15) ) = σ I m , we have E (cid:98) x = x true , Var( (cid:98) x ) = σ ( A (cid:62) A ) − , E (cid:101) x = x true , Var( (cid:101) x ) = σ ( PA ) − PP (cid:62) ( A (cid:62) P (cid:62) ) − . This shows that (cid:98) x and (cid:101) x are both unbiased estimators of x true , but by the Gauss-Markovtheorem, (cid:98) x is expected to have smaller variance. Consider the following simple example:8 xample Consider the LS problem, where A = µ
00 11 − and b = ν , for some fixed µ, ν ∈ R . We compare the LS solution and the solution obtained via stochasticNewton with random Kaczmarz vectors w ∈ R m × (see page 5). It is easy to see that in thiscase we have P = A (cid:62) H with H = diag { / (cid:107) a (cid:107) , / (cid:107) a (cid:107) , / (cid:107) a (cid:107) } , where a i are the rows of A . It follows that (cid:101) x minimizes the weighted LS functional ( b − Ax ) (cid:62) H ( b − Ax ) . We obtain the following solutions: (cid:98) x = 12 µ + 1 (cid:34) µ + ν + 1 µ − µ ν + µ + 1 (cid:35) and ˜ x = 14 (cid:34) ν + 3 /µ − ν + 1 /µ (cid:35) , respectively. The covariance matrices of (cid:98) x and (cid:101) x areVar( (cid:98) x ) = σ µ + 1 (cid:20) µ + 1 (cid:21) , Var( (cid:101) x ) = σ µ (cid:20) µ + 9 8 µ + 38 µ + 3 10 µ + 1 (cid:21) . It is clear that the variances of the components (cid:101) x can be much larger than those of (cid:98) x . Thesolution (cid:101) x would have smaller variance if the covariance matrix of the noise was proportionalto H − instead of I m . Figure 1 shows the error ω ( µ, ν ) = (cid:107) (cid:98) x − (cid:101) x (cid:107) for various choices of µ and ν . The left panel shows that ω → ∞ as µ →
0, which makes sense as the first row of A becomes all zeros. The right panel shows that even for µ (cid:54) = 0 , a significant error can beincurred by varying ν – and therefore the “observation vector” b .Although the difference between (cid:101) x and (cid:98) x can be significant, there are cases where theyare identical. Some previous works have studied the problem of how close (cid:101) x is to (cid:98) x , e.g.,[43, 9]. However, their assumptions do not apply to our matrix P . For our problem, (cid:101) x = (cid:98) x when the linear system is consistent, since in this case, PA (cid:98) x = Pb , or when ( A (cid:62) A ) − A (cid:62) =( PA ) − P , which is equivalent to null( A (cid:62) ) ⊆ null( P ), since A (cid:98) x − b ∈ null( A (cid:62) ) implies that P ( A (cid:98) x − b ) = . In the example above, this occurs when ν = 1 /µ − µ = 1 and ν = 0). In this section we showcase the numerical performance of the discussed methods. A generalframework for our proposed methods is summarized in Algorithm 1, where the main distinc-tion among the stochastic gradient, Newton, and quasi-Newton methods is the choice of B k (see line 5). For the stochastic gradient method, B k = I n , and for the stochastic Newtonmethod, B k = (cid:0) A (cid:62) W k W (cid:62) k A (cid:1) † . For the stochastic quasi-Newton method, we will use the9 igure 1: Error ω ( µ, ν ) of the stochastic Newton solution (cid:101) x compared to (cid:98) x . In the plot on theleft, ν = 10 and we vary µ. Notice that a pole exists at µ = 0, where the relative error becomesarbitrarily large. The plot on the right illustrates the impact of varying ν for fixed µ = 1. choices of B k described in Section 2 with λ > λ = 0. To avoid large inversions whencomputing B k , we use the Woodbury formula to iteratively update this choice of B k as B k = kk − B k − (cid:16) I n − A (cid:62) W k (cid:0) ( k − I (cid:96) + W (cid:62) k AB k − A (cid:62) W k (cid:1) − W (cid:62) k AB k − (cid:17) , (13)and for k = 1, B = λ (cid:16) I n − A (cid:62) W (cid:0) λ I (cid:96) + W (cid:62) AA (cid:62) W (cid:1) − W (cid:62) A (cid:17) . (14)Update formulas (13) and (14) require an inversion of an (cid:96) × (cid:96) matrix, but (cid:96) , which is thenumber of columns in W , is assumed small. Algorithm 1
Stochastic approximation method choose initial x , λ >
0, suitable sequence { α k } , and set k = 1 while not converged do choose realization W k compute W (cid:62) k A and W (cid:62) k b update B k get search direction s k = − B k ∇ f W k ( x k − ) update x k = x k − + α k s k set k = k + 1 end while return (cid:98) x = x k Following most stopping criteria developed for stochastic optimization and stochasticlearning methods [40, 5, 39], we rely on heuristic monitoring of x k and f k = f W k ( x k ) to10etermine early stopping for our methods. We consider three stopping criteria: a givenmaximum number of iterations, a certain tolerance on changes in x k , and a tolerance on theimprovement in f k , see [12]. For the last two, we use (cid:107) x k − − x k (cid:107) ∞ < √ tol (1 + (cid:107) x k (cid:107) ∞ ) and (cid:12)(cid:12) ¯ f k − − ¯ f k (cid:12)(cid:12) < tol (cid:0) f k − (cid:1) , where tol is a given tolerance. Here, for some number s ≥
1, ¯ f k denotes the simple movingaverage ¯ f k = mean ( f k − s , . . . , f k ). In our experiments we choose s = 9.Next, we provide two experiments to illustrate our methods. In Experiment 1, we use alinear regression problem to illustrate various properties or our algorithms and validate thetheoretical results. In Experiment 2 we provide some results using a realistic example frommachine learning. Experiment 1
We consider a linear regression problem, where A ∈ R , × , is a random matrix withelements drawn from a standard normal distribution. We let x true = ∈ R , and b = Ax true + (cid:15) , where the additive noise (cid:15) is also assumed to be standard normal. We choose W ∈ R , × to be a block Kaczmarz matrix with block size (cid:96) = 625 and utilize the harmonicsequence α k = 1 /k as the step size strategy. We select the inverse Hessian approximation B k using the update formula given in (13) and (14) with λ = 10 − and start at a randominitial guess x . First, we compare the performance of the stochastic quasi-Newton andNewton methods. In Figure 2 we provide plots of relative errors. In the top left panel, therelative errors are computed as (cid:107) x k − (cid:98) x (cid:107) / (cid:107) (cid:98) x (cid:107) , where x k are stochastic quasi-Newton (SQN)iterates, and in the top right panel, the relative errors are computed as (cid:107) x k − (cid:101) x (cid:107) / (cid:107) (cid:101) x (cid:107) , where x k are stochastic Newton (SN) iterates. These plots illustrate convergence of SQN and SNiterates to (cid:98) x and (cid:101) x respectively, as proved in Section 3. Notice that stochastic Newton (reddashed line) exhibits much slower convergence to ˜ x than the convergence of stochastic quasi-Newton (blue solid line) to (cid:98) x . In fact, SN requires 20,000 iterations to reach a relative errorof 3 . · − , while the SQN iterates reach a relative error of 3 . · − after 175 iterations.Moreover, it takes the stochastic quasi-Newton only 22 iterations to achieve a relative errorof 10 − .In the bottom panel of Figure 2, we provide reconstruction errors relative to the truesolution (cid:107) x k − x true (cid:107) / (cid:107) x true (cid:107) , for both methods, which demonstrates that SQN is fasterthan SN at providing a better approximation of the true solution. For this experiment therelative error between ˜ x and (cid:98) x is (cid:107) ˜ x − (cid:98) x (cid:107) / (cid:107) (cid:98) x (cid:107) = 5 . · − . We omit the results for thestochastic gradient method due to poor performance. It is worth noting that the moderatesize of this problem still allows one to use a QR solver (e.g., Matlab’s “backslash”) to solvethe LS problem, which takes about 6 seconds whereas SQN requires about 12 seconds to run k = 200 iterations.Last, we investigate the performance of four different choices of W : ( i ) Block Kaczmarz ,which is the generalized Kaczmarz method with Q = I m and (cid:96) i = (cid:96) uniformly to create11 igure 2: Experiment 1: The top left panel contains relative errors for stochastic quasi-Newtoniterates, computed as (cid:107) x k − (cid:98) x (cid:107) / (cid:107) (cid:98) x (cid:107) , where (cid:98) x is the LS solution. The top right panel containsrelative errors for stochastic Newton iterates, computed as (cid:107) x k − (cid:101) x (cid:107) / (cid:107) (cid:101) x (cid:107) , where (cid:101) x is defined inTheorem 3.3. Notice that we display 20,000 iterations for SN and only 200 iterations for SQN. Thebottom panel contains relative errors, (cid:107) x k − x true (cid:107) / (cid:107) x true (cid:107) , for both SQN and SN. ii ) Kaczmarz , which is similar to the generalized Kaczmarz methodwith Q = I m , but instead of a preset partitioning of the matrix Q into Q i matrices thismethod samples (cid:96) columns of Q randomly at each iteration to generate W , ( iii ) SparseRademacher with p = (cid:96) , and ( iv ) Sparse Random with ψ = 10 − . These choices of W aredefined on page 5. In Figure 3, we provide the function values f ( x k ) with respect to thenumber of row accesses of A , since row accesses of A may be the computational bottle neckfor large problems. We observe that all choices of W perform similarly, with Kaczmarz andblock Kaczmarz having a slight advantage. The performance of the stochastic quasi-Newtonmethod is highly dependent on the underlying problem (e.g., structures in the matrix A andvector b ) and the realizations of W . Empirically, using regression data sets from the UCIMachine Learning Repository [28], we observe best performances with block Kaczmarz andsparse Rademacher (data not shown). Figure 3: Comparison of function values for different choices of W , as a function of the number ofrow accesses of A . Black dotted line corresponds to the LS error f ( (cid:98) x ). Experiment 2
Next, we investigate the use of our stochastic quasi-Newton method for solving large linearLS problems that arise in extreme learning machines (ELMs). ELM is a machine learningtechnique that uses random hidden nodes or neurons in a feedforward network to mimicbiological learning techniques. The literature on ELM in the machine learning communityis vast, with cited benefits that include higher scalability, less computational complexity, norequirement of tuning, and smaller training errors than generic machine learning techniques.ELM is commonly used for clustering, regression, and classification. Full details and com-parisons are beyond the scope of this paper, and we refer the interested reader to paperssuch as [21, 19, 22, 24, 23, 20] and references therein.At the core of ELM is a very large and potentially dynamically growing linear regressionproblem. In this experiment, we investigate the use of stochastic algorithms for efficiently13olving these LS problems. In particular, we consider the problem of handwritten digitclassification using the “MNIST” database [7], which contains 60,000 training images and10,000 testing images of handwritten digits ranging from 0 to 9. Each image is 28 ×
28 pixelsand converted into a vector ξ ∈ R (e.g., corresponding to 784 features).We begin with a brief description of the classification problem for the MNIST dataset.Suppose we are given a set of m examples in the form of a training set S = { ( ξ , c ) , · · · , ( ξ m , c m ) } , where ξ i ∈ R and c i takes values from the set of classes C = { , , · · · , } . Consider anELM with a hidden layer of n nodes, then the goal is to solve a LS problem of the form,min X (cid:107) HX − Y (cid:107) , (15)where the hidden-layer output matrix is defined as H = h ( ξ )... h ( ξ m ) ∈ R m × n with h ( ξ ) = (cid:2) h ( ξ ) · · · h n ( ξ ) (cid:3) being the output (row) vector of the hidden layer withrespect to the input ξ , X = (cid:2) x · · · x (cid:3) where x j ∈ R n contains the desired outputweights for class j , and the training data target matrix Y ∈ R m × takes entries y ij = (cid:26) , if c i = j − , − , else.For this example, h ( ξ ) can be interpreted as a map from the image pixel space to the n -dimensional hidden-layer feature space. Although various activation functions could beused, we employ a standard choice of sigmoid additive hidden nodes with h j ( ξ ) = G ( d j , δ j , ξ ) = 1 / (1 + exp( − d (cid:62) j ξ + δ j )) , where all of the hidden-node parameters ( d j , δ j ) nj =1 are randomly generated based on auniform distribution [22]. For our experiments, we set the number of hidden neurons to be n = 300.The main computational work of ELM is to solve (15). Regularized or constrainedsolutions have been investigated (e.g., [22, 29, 2]). However, our focus will be on solvingthe unconstrained LS problem efficiently and for very large sets of training data . In orderto generate larger datasets, we performed multiple random rotations of the original 60,000training images. More specifically, each image was rotated by 20( η − .
5) degrees, where η isa random number drawn from a beta distribution with shape parameters equal to 2. In ourexperiments, we consider up to 15 random rotations per image, resulting in up to 900,000training images. Notice that as the number of training images increases, the number of rowsof H increases accordingly, while the number of columns remains the same.14e consider three approaches to solve (15) and compare CPU timings. In the originalimplementation of ELM [24, 23], the LS solution was computed as (cid:98) X = H † Y where H † is theMoore-Penrose pseudoinverse of H . We denote this approach “PINV”. Another approach tosolve large (often sparse) LS problems is to use an iterative method such as LSQR [35, 34],but since the LS problem needs to be solved for multiple right-hand sides (here, 10 solves),we use a global least squares method (Gl-LSQR) [42] with a maximum of 50 iterations and aresidual tolerance of 10 − . It was experimentally shown in [42] that Gl-LSQR is more effectiveand less expensive than LSQR applied to each right hand side. We use our stochastic quasi-Newton (SQN) method where W corresponds to the sparse Rademacher matrix with (cid:96) = 50and λ = 10 − . We use a maximum number of iterations of 1,000, a stopping tolerance of tol = 10 − , and an initial guess of . Since the B k matrices only depend on H and W , SQN can be applied to multiple right hand sides simultaneously.Each LS solver is repeated 20 times in Matlab R2015b on a MacBook Pro with 2.9GHz Intel Core i7 and 8G memory, and in Figure 4, we provide the median and 5 th –95 th percentiles of the CPU times vs. the number of training images (e.g., number of rows in H ). It is evident that for smaller training sets, all three methods perform similarly, butas the number of training images increases, SQN quickly surpasses PINV and Gl-LSQR interms of faster CPU time. For various numbers of training data, we provide in Table 1the mean and standard deviation of the relative reconstruction error for the SQN estimate, rel = (cid:107) X SQN − (cid:98) X (cid:107) F / (cid:107) (cid:98) X (cid:107) F , and of the number of SQN iterations, k . Our results demonstratethat SQN does not necessarily provide the most accurate solutions, however, it can be usedto achieve sufficiently good solutions in an efficient manner. Figure 4: CPU times (median and 5 th –95 th percentiles) for solving LS problem (15) using stochasticquasi-Newton (SQN), global LSQR (Gl-LSQR), and the Moore-Penrose pseudoinverse for variousnumbers of training images m . able 1: For various numbers of training images, we provide the mean and standard deviation forthe relative reconstruction errors and the iteration counts for SQN. m ,
000 120 ,
000 300 ,
000 600 ,
000 900 , rel ± . ± . ± . ± . ± . k ± ± ± ± ± Next we test the performance of these estimates for classification of the MNIST testingdataset. That is, once computed, the output weights X can be used to classify images in thefollowing way. For each test image, the predicted class is given byClass of ξ = arg max j h ( ξ ) x j . In Figure 5 we provide a visualization of the computed classifications for the 10,000 testingimages, where accuracy values in the titles are calculated as 1 − r/ r is thenumber of misclassified images. An accuracy value that is close to 1 corresponds to a goodperformance of the classifier. These results correspond to training on 60,000 images, and thetesting set was sorted by class for easier visualization. Notice that in Figure 5 the misclassifiedimages are almost identical for all three methods, and the classification accuracy for SQNis only slightly smaller than that of PINV and Gl-LSQR. Thus, we have shown that ourSQN method can achieve comparable classification performance as PINV and GL-LSQRwith much faster learning speed.It is worth noting that the matrices considered here, though large, can still be loaded intomemory. For problems where this is not the case (e.g., data too large or being dynamicallygenerated [45]), PINV and Gl-LSQR would not be feasible, while SQN could still be used. In this paper we introduced a stochastic Newton and a stochastic quasi-Newton methodfor solving very large linear least-squares problems. New theoretical results show that undermild regularity conditions on the distributions, the stochastic quasi-Newton iterates convergeto the unique LS solution (cid:98) x , while the stochastic Newton iterates converge to a different es-timator of x true which is still unbiased but is likely to have larger variance. We provideimplementation details for choices of random variables W k , step length parameters α k , andstochastic inverse Hessian approximations B k . Our numerical experiments validate our the-ory and also show the potential benefits of these methods for machine learning applicationswhere the data sets of interests are typically extremely large. Furthermore, by allowingsparse distributions, our approaches can handle scenarios where the entire matrix A is notavailable or full matrix-vector multiplications with A are not feasible.This work opens the doors to a plethora of new methods, advancements, and imple-mentations for stochastic Newton and quasi-Newton methods. Current work includes the16 igure 5: Classification (with correpsonding accuracy) for the MNIST test images after training on60,000 images, using different LS solvers. B k matrices, adaptive choices of W that can exploit matrixproperties, efficient parallel implementations, and extensions to nonlinear problems. Acknowledgement:
The authors are grateful to Qi Long for initial discussions and ArvindSaibaba for helpful comments that have improved the manuscript. We are also grateful toEldad Haber for pointing us to the ELM problem and to the Institute for Mathematics andits Applications for nurturing this research collaboration. (i)
Let v ∈ R m and assume that W has the same distribution as W k . Then, (cid:107) WW (cid:62) v (cid:107) ≤ (cid:13)(cid:13) WW (cid:62) (cid:13)(cid:13) (cid:107) v (cid:107) , and therefore E ( v (cid:62) WW (cid:62) WW (cid:62) v ) < ∞ . Furthermore, v (cid:62) Cv = E ( v (cid:62) WW (cid:62) WW (cid:62) v ) = E ( (cid:107) WW (cid:62) v (cid:107) ) ≥ , and therefore v (cid:62) Cv = 0 iff (cid:107) WW (cid:62) v (cid:107) = 0 a.s., which happens iff WW (cid:62) v = a.s., inwhich case E ( WW (cid:62) ) v = β I n v = . This implies v = , and thus C is defined and SPD. (ii) Note that s k = − B k A (cid:62) W k W (cid:62) k A ( x k − − (cid:98) x ) − B k A (cid:62) W k W (cid:62) k r , (16)where r = (cid:98) b − b . Therefore, b k +1 − (cid:98) b = b k − (cid:98) b − α k AB k A (cid:62) W k W (cid:62) k ( b k − (cid:98) b ) − α k AB k A (cid:62) W k W (cid:62) k r , and there are positive constants a, b k and c k such that e k +1 ≤ a e k + b k B (cid:13)(cid:13) W k W (cid:62) k (cid:13)(cid:13) e k + c k B (cid:13)(cid:13) W k W (cid:62) k (cid:13)(cid:13) for all k . Since W k and e k are independent and E e < ∞ , it follows that E e k < ∞ for all k . This together with (16) implies that E ( (cid:107) s k (cid:107) ) < ∞ for all k . (iii) By definition, e k +1 − e k = 2 α k ( b k − (cid:98) b ) (cid:62) As k + α k s (cid:62) k A (cid:62) As k , and therefore E ( e k +1 − e k | F k ) = 2 α k ( b k − (cid:98) b ) (cid:62) A E ( s k | F k ) + α k E ( s (cid:62) k A (cid:62) As k | F k ) . A , (16) leads to E ( s k | F k ) = − B k A (cid:62) E ( W k W k )( Ax k − − A (cid:98) x ) = − β B k A (cid:62) ( b k − (cid:98) b ) . (17)Let λ A := λ max ( AA (cid:62) ), then s (cid:62) k A (cid:62) As k = ( b k − b ) (cid:62) W k W (cid:62) k AB k B k A (cid:62) W k W (cid:62) k ( b k − b ) ≤ λ A B ( b k − b ) (cid:62) W k W (cid:62) k W k W (cid:62) k ( b k − b ) , and therefore E ( s (cid:62) k A (cid:62) As k | F k ) ≤ λ A B (cid:107) b k − b (cid:107) C ≤ λ A B e k + 2 λ A B (cid:107) r (cid:107) C . (18)Equations (17) and (18) lead to E ( e k +1 − e k | F k ) ≤ − β α k ( b k − (cid:98) b ) (cid:62) AB k A (cid:62) ( b k − (cid:98) b ) + 2 α k λ A B e k + 2 α k λ A B (cid:107) r (cid:107) C , subtracting 2 α k λ A B from both sides yields E ( e k +1 − (1 + 2 B λ A α k ) e k | F k ) ≤ − β α k ( b k − (cid:98) b ) (cid:62) AB k A (cid:62) ( b k − (cid:98) b ) + 2 λ A α k B (cid:107) r (cid:107) C . Define γ k = (cid:81) k − i =1 ( 1 + 2 B λ A α k ) − ≤ . Since0 ≤ − log( γ k ) ≤ B λ A ∞ (cid:88) i =1 α i < ∞ for any k , it follows that { γ k } is a decreasing and convergence sequence to some γ > e k = γ k e k and Z k = E ( ˜ e k +1 − ˜ e k |F k ), then E (˜ e k +1 − ˜ e k | F k ) ≤ − βα k γ k +1 ( b k − (cid:98) b ) (cid:62) AB k A (cid:62) ( b k − (cid:98) b ) + 2 B γ k +1 α k λ A (cid:107) r (cid:107) C ≤ B α k λ A (cid:107) r (cid:107) C , (19)and also ∞ (cid:88) k =1 E [ I Z k ≥ (˜ e k +1 − ˜ e k ) ] ≤ B λ A (cid:107) r (cid:107) C ∞ (cid:88) k =1 α k < ∞ . Theorem 3.1 now implies that { ˜ e k } converges a.s. and since γ k converges to a nonzero value,it follows that { e k } also converges a.s. The final step is to show that e k a . s . −→
0. It followsfrom (19) that 2 β n (cid:88) k =1 α k γ k +1 E ( λ min ( B k ) (cid:107) A (cid:62) ( b k − (cid:98) b ) (cid:107) ) ≤ β n (cid:88) k =1 α k γ k +1 ( b k − (cid:98) b ) (cid:62) AB k A (cid:62) ( b k − (cid:98) b ) ≤ B λ A (cid:107) r (cid:107) C n (cid:88) k =1 α k + E ˜ e < ∞ n . Therefore, γ ∞ (cid:88) k =1 α k λ min ( B k ) (cid:107) A (cid:62) ( b k − (cid:98) b ) (cid:107) ≤ ∞ (cid:88) k =1 α k γ k +1 λ min ( B k ) (cid:107) A (cid:62) ( b k − (cid:98) b ) (cid:107) < ∞ a . s . Since (cid:80) ∞ k =1 α k λ min ( B k ) = ∞ a.s., it follows that for almost all w ∈ Ω there is a subsequence n k ( w ) such that (cid:13)(cid:13) x n k ( w ) ( w ) − (cid:98) x (cid:13)(cid:13) A (cid:62) A ) → , which also implies (cid:107) x n k ( w ) ( w ) − (cid:98) x (cid:107) = e n k ( w ) ( w ) → . and therefore, since e k converges a.s., we also have e k a . s . −→
0. Hence, x k a . s . −→ (cid:98) x . The proof is a slight modification of that of Theorem 3.2. Define F k = σ ( W i ; i < k ). (i) Let v ∈ R m . Since v (cid:62) Cv = E ( (cid:107) H k v (cid:107) ) ≥ , it follows that C is semi-positive definite, andtherefore so is A (cid:62) CA . Since ( W (cid:62) k A ) † W (cid:62) k A is symmetric, PA is symmetric. Let v ∈ R n . Using properties of the pseudoinverse gives v (cid:62) PAv = v (cid:62) E ( H k A ) v = E ( (cid:107) H k Av (cid:107) ) ≥ , where equality holds iff H k Av = a.s., and since E ( W k W (cid:62) k ) = β I m and A is full column-rank, it follows that v = . Hence PA is SPD. (ii) Note that s k = − H k A ( x k − − (cid:101) x ) − H k r , (20)where r = A (cid:101) x − b . Therefore x k − (cid:101) x = x k − − (cid:101) x − α k H k A ( x k − − (cid:101) x ) − α k H k r , and since H k A is an orthogonal projection matrix, it follows that e k +1 ≤ α k ) e k + 4 α k (cid:107) H k r (cid:107) . This then leads to E e k +1 ≤ α k ) E e k + 4 α k (cid:107) Cr (cid:107) , which implies that E e k < ∞ and E ( (cid:107) s k (cid:107) ) < ∞ for all k . (iii) The idea is again to use the quasimartingale convergence theorem. Since W k and F k are independent, E ( e k +1 − e k | F k ) = 2 α k ( x k − − (cid:101) x ) (cid:62) E ( s k | F k ) + α k E ( (cid:107) s k (cid:107) | F k )= − α k (cid:107) x k − − (cid:101) x (cid:107) PA + α k E ( (cid:107) s k (cid:107) | F k ) . (21)To put a bound on the second term of (21) we use again the fact that H k A is a projectionmatrix to obtain: E ( (cid:107) s k (cid:107) | F k ) ≤ A (cid:101) x − b ) (cid:62) C ( A (cid:101) x − b ) + 2 (cid:107) x k − − (cid:101) x (cid:107) ≤ c + 2 (cid:107) x k − − (cid:101) x (cid:107) , (22)20here c = λ max ( C ) (cid:107) A (cid:101) x − b (cid:107) . Therefore, equation (21) can be bounded as E ( e k +1 − e k | F k ) ≤ − α k (cid:107) x k − − (cid:101) x (cid:107) PA + c α k + 2 α k e k , which yields E ( e k +1 − e k (1 + α k ) | F k ) ≤ − α k (cid:107) x k − − (cid:101) x (cid:107) PA + c α k ≤ c α k . (23)Let ν k = (cid:81) k − i =1 ( 1 + α i ) − < . The sequence { ν k } converges to some ν > (cid:80) ∞ k =1 α k < ∞ . Define (cid:101) e k = ν k e k , and multiply both sides of (23) by ν k +1 . We obtain E ( (cid:101) e k +1 − (cid:101) e k | F k ) ≤ − α k ν k +1 (cid:107) x k − − (cid:101) x (cid:107) PA + c α k ν k +1 ≤ c α k ν k +1 . (24)Define Z k = E ( (cid:101) e k +1 − (cid:101) e k | F k ). Then, E [ I Z k ≥ ( (cid:101) e k +1 − (cid:101) e k ) ] = E ( I Z k ≥ E [ (cid:101) e k +1 − (cid:101) e k | F k ] ) ≤ c α k ν k +1 . Since (cid:80) ∞ k =1 α k ν k +1 < ∞ the series (cid:80) ∞ k =1 E ( I X k ≥ (˜ e k +1 − ˜ e k ) ) converges and therefore { ˜ e k } converges a.s. by Theorem 3.1. But since ν k converges to a nonzero value, it also follows that { e k } converges a.s. The final step is to show that { e k } in fact converges to zero. Rearrangingthe terms and taking the expected value of both sides of (24) yields ∞ (cid:88) k =1 α k ν k +1 E ( (cid:107) x k − − (cid:101) x (cid:107) PA ) < ∞ , Since (cid:80) ∞ k =1 α k = ∞ and ν k → ν >
0, it follows that E ( (cid:107) x n k − (cid:101) x (cid:107) PA ) → n k ), and therefore we also have E ( (cid:107) x n k − (cid:101) x (cid:107) ) = E e n k →
0. By Fatou’slemma: 0 ≤ E lim inf e n k = E lim e n k ≤ lim inf E e n k = 0 . It follows that lim e n k = 0 a.s. and since { e k } converges a.s., this implies e k a . s . −→ x k a . s . −→ (cid:101) x . References [1] D. Achlioptas. Database-friendly random projections: Johnson-Lindenstrauss with bi-nary coins.
Journal of Computer and System Sciences , 66(4):671–687, 2003.[2] Z. Bai, G.-B. Huang, D. Wang, H. Wang, and M. B. Westover. Sparse extreme learningmachine for classification.
IEEE Transactions on Cybernetics , 44(10):1858–1870, 2014.[3] A. Benveniste, S. Wilson, M. Metivier, and P. Priouret.
Adaptive Algorithms andStochastic Approximations . Stochastic Modelling and Applied Probability. Springer,New York, 2012. 214] L. Bottou.
Online Learning in Neural Networks , chapter 2. Online learning and stochas-tic approximations, pages 9–42. Cambridge University Press, 1998.[5] L. Bottou, F. E. Curtis, and J. Nocedal. Optimization methods for large-scale machinelearning. arXiv preprint arXiv:1606.04838 , 2016.[6] C. Boutsidis and A. Gittens. Improved matrix algorithms via the subsampled random-ized Hadamard transform. arXiv preprint arXiv:1204.0062 , 2012.[7] L. Deng. The MNIST database of handwritten digit images for machine learning research[best of the web].
IEEE Signal Processing Magazine , 29(6):141–142, 2012.[8] P. S. R. Diniz.
Adaptive filtering . Springer, New York, 1997.[9] P. Drineas, M. W. Mahoney, S. Muthukrishnan, and T. Sarl´os. Faster least squaresapproximation.
Numerische Mathematik , 117(2):219–249, 2011.[10] J. Fan, Y. Liao, and H. Liu. An overview of the estimation of large covariance andprecision matrices.
The Econometrics Journal , 19(1):C1–C32, 2016.[11] D. L. R. Fisk.
Quasi-Martingales and Stochastic Integrals . Department of MathematicsResearch Monograph. Kent State University, Department of Mathematics, 1963.[12] P. E. Gill, W. Murray, and M. H. Wright.
Practical Optimization . Emerald, Bingley,2007.[13] R. M. Gower, D. Goldfarb, and P. Richt´arik. Stochastic block BFGS: Squeezing morecurvature out of data. arXiv preprint arXiv:1603.09649 , 2016.[14] R. M. Gower and P. Richt´arik. Randomized iterative methods for linear systems.
SIAMJournal on Matrix Analysis and Applications , 36(4):1660–1690, 2015.[15] R. M. Gower and P. Richt´arik. Randomized quasi-Newton updates are linearly conver-gent matrix inversion algorithms. arXiv preprint arXiv:1602.01768 , 2016.[16] E. Haber, M. Chung, and F. Herrmann. An effective method for parameter estimationwith PDE constraints with multiple right-hand sides.
SIAM Journal on Optimization ,22(3):739–757, 2012.[17] P. C. Hansen, V. Pereyra, and G. Scherer.
Least Squares Data Fitting with Applications .John Hopkins University Press, Baltimore, 2012.[18] A. Hedayat and W. D. Wallis. Hadamard matrices and their applications.
The Annalsof Statistics , 6(6):1184–1238, 1978.[19] G. Huang, G.-B. Huang, S. Song, and K. You. Trends in extreme learning machines: Areview.
Neural Networks , 61:32–48, 2015.2220] G.-B. Huang, Z. Bai, L. L. C. Kasun, and C. M. Vong. Local receptive fields basedextreme learning machine.
IEEE Computational Intelligence Magazine , 10(2):18–29,2015.[21] G.-B. Huang, D. H. Wang, and Y. Lan. Extreme learning machines: a survey.
Interna-tional Journal of Machine Learning and Cybernetics , 2(2):107–122, 2011.[22] G.-B. Huang, H. Zhou, X. Ding, and R. Zhang. Extreme learning machine for regressionand multiclass classification.
IEEE Transactions on Systems, Man, and Cybernetics,Part B (Cybernetics) , 42(2):513–529, 2012.[23] G.-B. Huang, Q.-Y. Zhu, and C.-K. Siew. Extreme learning machine: a new learningscheme of feedforward neural networks. In
Neural Networks, 2004. Proceedings. 2004IEEE International Joint Conference on , volume 2, pages 985–990. IEEE, 2004.[24] G.-B. Huang, Q.-Y. Zhu, and C.-K. Siew. Extreme learning machine: theory andapplications.
Neurocomputing , 70(1):489–501, 2006.[25] S. Kaczmarz. Angen¨aherte Aufl¨osung von Systemen linearer Gleichungen.
BulletinInternational de l’Acad´emie Polonaise des Sciences et des Lettres. Classe des SciencesMath´ematiques et Naturelles. S´erie A, Sciences Math´ematiques , 35:335–357, 1937.[26] H. J. Kushner and G. G. Yin.
Stochastic Approximation Algorithms and Applications .Springer, New York, 1997.[27] E. B. Le, A. Myers, and T. Bui-Thanh. A randomized misfit approach for data reductionin large-scale inverse problems. arXiv preprint arXiv:1603.01562 , 2016.[28] M. LLC. Machine Learning Repository, Center for Machine Learning and IntelligentSystems, University of California, Irvine, 2017.[29] J. Luo, C.-M. Vong, and P.-K. Wong. Sparse Bayesian extreme learning machine formulti-classification.
IEEE Transactions on Neural Networks and Learning Systems ,25(4):836–843, 2014.[30] M. M´etivier.
Semimartingales: A Course on Stochastic Processes . De Gruyter studiesin mathematics. XI, 1982.[31] D. Needell, N. Srebro, and R. Ward. Stochastic gradient descent, weighted sampling,and the randomized Kaczmarz algorithm.
Mathematical Programming , 155(1-2):549–573, 2016.[32] D. Needell and J. A. Tropp. Paved with good intentions: analysis of a randomized blockKaczmarz method.
Linear Algebra and its Applications , 441:199–221, 2014.[33] J. Nocedal and S. J. Wright.
Numerical Optimization . Springer, New York, secondedition edition, 2006. 2334] C. C. Paige and M. A. Saunders. Algorithm 583, LSQR: Sparse linear equations andleast-squares problems.
ACM Transactions on Mathematical Software , 8(2):195–209,1982.[35] C. C. Paige and M. A. Saunders. LSQR: An algorithm for sparse linear equations andsparse least squares.
ACM Transactions on Mathematical Software , 8(1):43–71, 1982.[36] M. Pilanci and M. J. Wainwright. Iterative Hessian sketch: Fast and accurate solutionapproximation for constrained least-squares.
Journal of Machine Learning Research ,17(53):1–38, 2016.[37] H. Robbins and S. Monro. A stochastic approximation method.
The annals of mathe-matical statistics , pages 400–407, 1951.[38] H. Robbins and D. Siegmund. A convergence theorem for non negative almost super-martingales and some applications. In
Herbert Robbins Selected Papers , pages 111–135.Springer, 1985.[39] A. Shapiro, D. Dentcheva, and A. Ruszczynski.
Lectures on Stochastic Programming:Modeling and Theory , volume 16. SIAM, Philadelphia, 2014.[40] J. C. Spall.
Introduction to stochastic search and optimization: estimation, simulation,and control , volume 65. John Wiley & Sons, Hoboken, 2005.[41] T. Strohmer and R. Vershynin. A randomized Kaczmarz algorithm with exponentialconvergence.
Journal of Fourier Analysis and Applications , 15(2):262–278, 2009.[42] F. Toutounian and S. Karimi. Global least squares method (Gl-LSQR) for solving gen-eral linear systems with several right-hand sides.
Applied Mathematics and Computation ,178(2):452–460, 2006.[43] D. P. Woodruff. Sketching as a tool for numerical linear algebra.