[PDF] An improved quantum-inspired algorithm for linear regression

Abstract

We give a classical algorithm for linear regression analogous to the quantum matrix inversion algorithm [Harrow, Hassidim, and Lloyd, Physical Review Letters'09] for low-rank matrices [Wossnig et al., Physical Review Letters'18], when the input matrix A is stored in a data structure applicable for QRAM-based state preparation. Namely, given an A∈ C m×n with minimum singular value σ and which supports certain efficient ℓ 2 -norm importance sampling queries, along with a b∈ C m , we can output a description of an x∈ C n such that ∥x− A + b∥≤ε∥ A + b∥ in O ~ ( ∥A ∥ 6 F ∥A ∥ 2 σ 8 ε 4 ) time, improving on previous "quantum-inspired" algorithms in this line of research by a factor of ∥A ∥ 14 σ 14 ε 2 [Chia et al., STOC'20]. The algorithm is stochastic gradient descent, and the analysis bears similarities to those of optimization algorithms for regression in the usual setting [Gupta and Sidford, NeurIPS'18]. Unlike earlier works, this is a promising avenue that could lead to feasible implementations of classical regression in a quantum-inspired setting, for comparison against future quantum computers.

Full PDF

AAn improved quantum-inspired algorithm for linear regression

András Gilyén ∗ Zhao Song † Ewin Tang ‡ Abstract

We give a classical algorithm for linear regression analogous to the quantum matrixinversion algorithm [Harrow, Hassidim, and Lloyd, Physical Review Letters’09] for low-rank matrices [Wossnig et al., Physical Review Letters’18], when the input matrix A isstored in a data structure applicable for QRAM-based state preparation.Namely, given an A ∈ C m × n with minimum singular value σ and which supportscertain eﬃcient (cid:96) -norm importance sampling queries, along with a b ∈ C m , we canoutput a description of an x ∈ C n such that (cid:107) x − A + b (cid:107) ≤ ε (cid:107) A + b (cid:107) in ˜ O (cid:16) (cid:107) A (cid:107) (cid:107) A (cid:107) σ ε (cid:17) time, improving on previous “quantum-inspired” algorithms in this line of research bya factor of (cid:107) A (cid:107) σ ε [Chia et al., STOC’20]. The algorithm is stochastic gradient descent,and the analysis bears similarities to those of optimization algorithms for regressionin the usual setting [Gupta and Sidford, NeurIPS’18]. Unlike earlier works, this is apromising avenue that could lead to feasible implementations of classical regression ina quantum-inspired setting, for comparison against future quantum computers. ∗ [email protected] . Caltech. Formerly at QuSoft, CWI and University of Amsterdam. † [email protected] . Columbia University, Princeton University and Institute for Advanced Study. ‡ [email protected] . University of Washington. a r X i v : . [ c s . D S ] S e p Introduction

An important question for the future of quantum computing is whether we can use quantumcomputers to speed up machine learning [Pre18]. Answering this question is a topic of activeresearch [Chi09, Aar15, BWP + [GLM08, Pra14], which also allows classical algorithms to perform the sametasks as the quantum algorithms with only a polynomial slowdown [CGL + min x (cid:107) Ax − b (cid:107) ) for low-rank A ∈ C m × n . SinceHarrow, Hassidim, and Lloyd’s algorithm (HHL) for sparse A [HHL09, DHM + + A also commonly appears [WZP18, RML14], typicallymanifesting in algorithms depending on (cid:107) A (cid:107) F , the Frobenius norm of A . In the low-rankscenario, assuming the aforementioned data structure, current state-of-the-art quantum al-gorithms can produce a state ε -close to | A + b (cid:105) in (cid:96) -norm given A and b in QRAM in O ∗ (cid:0) (cid:107) A (cid:107) F (cid:107) A (cid:107) (cid:107) A (cid:107) σ (cid:1) time [CGJ19]. We think about this runtime as depending polynomially on(square root of) stable rank (cid:107) A (cid:107) F (cid:107) A (cid:107) and condition number (cid:107) A (cid:107) σ . Note that being able to pro-duce quantum states | x (cid:105) = (cid:107) x (cid:107) (cid:80) i x i | i (cid:105) corresponding to a desired vector x is akin to aclassical sampling problem, and is diﬀerent from outputting x itself. The prior best analo-gous algorithm can produce the same result classically in ˜ O ∗ (cid:0)(cid:0) (cid:107) A (cid:107) F (cid:107) A (cid:107) (cid:1) (cid:0) (cid:107) A (cid:107) σ (cid:1) (cid:1) time [CGL + Our main result tightens this gap, giving an algorithm running in O ∗ (cid:0)(cid:0) (cid:107) A (cid:107) F (cid:107) A (cid:107) (cid:1) (cid:0) (cid:107) A (cid:107) σ (cid:1) (cid:1) time. Further, this runtime simply follows from an analysis of stochastic gradient descent(SGD), making it potentially more practical to implement and use as a benchmark againstfuture scalable quantum computers. This result suggests that other tasks that can be solved In this paper, we always use QRAM in combination with a data structure used for eﬃciently preparingstates | (cid:105) → (cid:80) i x i (cid:107) x (cid:107) | i (cid:105) corresponding to vectors x ∈ C n . A + denotes the pseudoinverse of A , also called the Moore-Penrose inverse of A . We deﬁne O ∗ to be big O notation, hiding polynomial dependence on ε and poly-logarithmic dependenceon dimension. Note that the precise exponent on log( mn ) depends on the choice of word size in our RAMmodels. Since quantum and classical algorithms have diﬀerent conventions for this choice, we will elide itfor simplicity. The ε dependence for the quantum algorithm is log ε , compared to the classical algorithm which gets ε . While this suggests an exponential speedup in ε , it appears to only hold for sampling problems, suchas measuring | A + b (cid:105) in the computational basis. Learning information from the output quantum state | A + b (cid:105) generally requires poly( ε ) samples, preventing the exponential separation. O (nnz( A )) time) andrequire the output to be x in full, then the time needed to solve regression via our mainresult Theorem 1.1, O ∗ (cid:0) nnz( A ) + (cid:107) A (cid:107) (cid:107) A (cid:107) σ + (cid:107) A (cid:107) (cid:107) A (cid:107) σ n (cid:1) , is worse than Gupta and Sidford’s ˜ O ∗ (cid:0) nnz( A ) + (cid:107) A (cid:107) F σ nnz( A ) n (cid:1) time (shown here after naively bounding numerical sparsityby n ). However, our work demonstrates a more general barrier to quantum speedups:one could imagine that, for suﬃciently quantum-like problems, the input satisﬁes quantum-inspired assumptions (say, we could prepare states corresponding to input), so quantumlinear algebra algorithms are eﬃcient, and yet the input is too large to re-format into a datastructure of our choice, preventing classical results like Gupta and Sidford’s from giving acomparable runtime, while our work is still applicable and gives comparable runtimes.

Current QML algorithms for low-rank regression require input to be stored in an eﬃcient datastructure in QRAM [CGJ19]. Many quantum machine learning (QML) algorithms requiresuch assumptions [WZP18, RML14, Pra14, BWP +

17, Pre18, CHI + Deﬁnition (Vector-based data-structure,

SQ( v ) and Q( v ) ) . For any vector v ∈ C n , let SQ( v ) denote a data-structure that supports the following operations:1. Sample () . It outputs the entry i with probability | v i | / (cid:107) v (cid:107) .2. Query ( i ) . The input of this function is an i ∈ [ n ] , the output is v i .3. Norm () . The output of this function is (cid:107) v (cid:107) . This is comparable to the runtime of the quantum algorithm for outputting x , since one would needto run it O ∗ ( n ) times for state tomography, resulting in a total runtime of ˜ O ∗ (cid:0) nnz( A ) + (cid:107) A (cid:107) F σ n (cid:1) . The useof state tomography would also bring in polynomial dependence on ε − similarly to the quantum-inspiredalgorithm, while Gupta and Sidford’s algorithm has only logarithmic dependence on ε − . Although current QRAM proposals suggest that quantum hardware implementing QRAM may be real-izable with essentially only logarithmic overhead in the runtime [GLM08], an actual physical implementationwould require substantial advances in quantum technology in order to maintain coherence for a long enoughtime [AGJO + When we claim that we can call

Norm () on output vectors, we will mean that we can output a constantapproximation: a number in [0 . (cid:107) v (cid:107) , . (cid:107) v (cid:107) ] . T ( v ) denote the max time it takes for the data structure to respond to any query. If weonly allow the Query operation, the data-structure is called Q( v ) .Notice that the Sample function is the classical analogue of quantum state preparation,since the distribution being sampled is identical to the one attained by measuring | v (cid:105) inthe computational basis. We now extend our notion of quantum-inspired data structure tomatrices. Deﬁnition (Matrix-based data-structure,

SQ( A ) ) . For any matrix A ∈ C m × n , let SQ( A ) denote a data-structure that supports the following operations:1. Sample1 () . It outputs i with probability (cid:107) A i, ∗ (cid:107) / (cid:107) A (cid:107) .2. Sample2 ( i ) . The input of this function is an i ∈ [ m ] . This function will output theentry j with probability | A i,j | / (cid:107) A i, ∗ (cid:107) .3. Query ( i, j ) . The input are i ∈ [ m ] and j ∈ [ n ] , the output is A i,j .4. Norm ( i ) . The input is i ∈ [ m ] . The output is (cid:107) A i, ∗ (cid:107) .5. Norm () . It outputs (cid:107) A (cid:107) F .Let T ( A ) denote the max time the data structure takes to respond to any query.For the sake of simplicity, we will assume all input SQ data structures respond to queriesin O (1) time. There are data structures that can do this in the word RAM model [CGL + O (log( mn )) time, so using such versions only increasesour runtime by a logarithmic factor.More formally, assumptions about QRAM and its data structures can be replaced bya speciﬁc form of state preparation assumption : the assumption that quantum states cor-responding to input data can be prepared in time polylogarithmic in dimension. QMLalgorithms typically work with any state preparation assumption. However, to the authors’knowledge, all models admitting eﬃcient protocols that take v stored in the standard wayas input and output the quantum state | v (cid:105) also admit corresponding eﬃcient classical sam-ple and query operations that can replace the data structure described above [CGL + We start by deﬁning some common notation. For a vector v ∈ C n , (cid:107) v (cid:107) denotes (cid:96) norm. Fora matrix A ∈ C m × n , A † , A + , (cid:107) A (cid:107) , and (cid:107) A (cid:107) F denote the conjugate transpose, pseudoinverse,operator norm, and Frobenius norm of A , respectively. A i, ∗ and A ∗ ,j denote the i -th row and j -th column of A . f (cid:46) g denotes f = O ( g ) , and respectively with (cid:38) and (cid:104) .We solve the following problem. 3 roblem (Regression with regularization) . Given A ∈ C m × n , b ∈ C m , and a regularizationparameter λ ≥ , we deﬁne function f : C n → R as f ( x ) := 12 ( (cid:107) Ax − b (cid:107) + λ (cid:107) x (cid:107) ) . Let x ∗ := arg min x ∈ C n f ( x ) = ( A † A + λI ) + A † b .We will manipulate vectors by manipulating their sparse descriptions , deﬁned as follows: Deﬁnition.

We say we have an s -sparse description of x ∈ C d if we have an s -sparse v ∈ C n such that x = A † v . We use the convention that any s -sparse description is also a t -sparsedescription for all t ≥ s .Our main result is that we can solve regression eﬃciently, assuming SQ( A ) . Theorem 1.1.

Suppose we are given

SQ( A ) ∈ C m × n and Q( b ) ∈ C n . Denote σ := (cid:107) A + (cid:107) − and consider f ( x ) for λ = O ( (cid:107) A (cid:107) ) . Let ε (cid:46) (log (cid:107) A (cid:107) σ + λ ) − . There is an algorithm that takes O (cid:18) (cid:107) A (cid:107) (cid:107) A (cid:107) ( σ + λ ) ε (cid:16) (cid:107) b (cid:107) (cid:107) AA + b (cid:107) + log 1 ε (cid:17) log 1 ε (cid:19) time and outputs a O (cid:16) (cid:107) A (cid:107) (cid:107) A (cid:107) ( σ + λ ) ε ( (cid:107) b (cid:107) (cid:107) AA + b (cid:107) + log ε ) (cid:17) -sparse description of an x such that (cid:107) x − x ∗ (cid:107) ≤ ε (cid:107) x ∗ (cid:107) with probability ≥ . . This description admits SQ( x ) for T ( x ) = (cid:101) O (cid:18) (cid:107) A (cid:107) (cid:107) A (cid:107) ( σ + λ ) ε log ε (cid:16) (cid:107) b (cid:107) (cid:107) AA + b (cid:107) + log ε (cid:17)(cid:19) . Note that the runtime stated for T ( x ) corresponds to the runtime of Sample and

Norm ;the runtime of

Query is the sparsity of the description, O (cid:16) (cid:107) A (cid:107) (cid:107) A (cid:107) ( σ + λ ) ε ( (cid:107) b (cid:107) (cid:107) AA + b (cid:107) + log ε ) (cid:17) . So,unlike previous quantum-inspired algorithms, it may be the case that the runtime of the SQ query dominates, though it’s conceivable that this runtime bound is an artifact of ouranalysis. The parameter (cid:107) AA + b (cid:107) (cid:107) b (cid:107) corresponds to the fraction of b ’s mass that is in therowspace of A , thinking of AA + as the orthogonal projector on the rowspace of A . Factorsof (cid:107) b (cid:107) (cid:107) AA + b (cid:107) arise because sampling error is additive with respect to (cid:107) b (cid:107) , and need to be rescaledrelative to (cid:107) x ∗ (cid:107) ; the quantum algorithm must also pay such a factor.As mentioned previously, we use stochastic gradient descent to solve this problem, for T := O (cid:16) (cid:107) A (cid:107) (cid:107) A (cid:107) ( σ + λ ) ε log ε (cid:17) iterations. Such optimization algorithms are standard for solvingregression in this setting [GS18]: the idea is that, instead of computing a pseudoinverse toexplicitly compute the minimizer to f , one can start at x (0) = (cid:126) and iteratively ﬁnd x ( t +1) from x ( t ) such that the sequence { f ( x ( t ) ) } t ∈ N converges to f ( x ∗ ) . Since f is convex, updating x ( t ) by nudging it in the gradient direction produces such a sequence, even when we onlycompute decent (stochastic) estimators of the gradient [Bub15].Though the idea of SGD for regression is simple and well-understood, we note that somestandard analyses fail in our setting, so some care is needed in analysing SGD correctly.First, our stochastic gradient (Eq. (1)) has variance that depends on the norm of the current4terate, and projection onto the ball of vectors of bounded norm is too costly, so our analysescannot assume a uniform bound on the second moment of our stochastic gradient. Second,we want a bound for the ﬁnal iterate x ( T ) , not merely a bound on averaged iterates as iscommon for SGD analyses, since using an averaging scheme would complicate the analysis inSection 2.3. Third, we want a bound on the output x of our algorithm of the form (cid:107) x − x ∗ (cid:107) ≤ ε (cid:107) x ∗ (cid:107) . We could have chosen to aim for the typical bound in the optimization literature of f ( x ) − f ( x ∗ ) ≤ ε ( f ( x (0) ) − f ( x ∗ )) , but our choice is common in the QML literature, and it onlymakes sense to have SQ( x ) when x is indeed close to x ∗ up to multiplicative error. Finally,note that acceleration via methods like the accelerated proximal point algorithm (APPA)[FGKS15] framework cannot be used in this setting, since we cannot pay the linear time indimension necessary to recenter gradients. The SGD analysis of Moulines and Bach [MB11]meets all of our criteria: the bulk of our result comes from simply combining this analysiswith our choice of stochastic gradient (Eq. (1)).The runtime of SGD is only good for our choice of gradient when b is sparse, but througha sketching matrix we can reduce to the case where b has O (cid:16) (cid:107) A (cid:107) (cid:107) A (cid:107) σ ε (cid:107) b (cid:107) (cid:107) AA + b (cid:107) (cid:17) non-zeroentries. Additional analysis is necessary to argue that we have eﬃcient sample and queryaccess to the output, since we need to rule out the possibility that our output x is given asa description A † ˜ x such that ˜ x is much larger than x , in which case computing, say, (cid:107) x (cid:107) viarejection sampling would be intractable. Because we are using SGD to minimize f ( x ) , we need a stochastic approximation to ∇ f ( x ) = A † Ax − A † b + λx . Our choice of stochastic gradient comes from observing that A † Ax = (cid:80) mr =1 (cid:80) nc =1 A † r A rc x c , so we can estimate this by sampling one of the summands.Let r be drawn from the distribution that is r with probability (cid:107) A r, ∗ (cid:107) (cid:107) A (cid:107) and let c , . . . , c C be drawn i.i.d. from the distribution that is c with probability | A r,c | (cid:107) A r, ∗ (cid:107) . Deﬁne ∇ g ( x ) = (cid:107) A (cid:107) (cid:107) A r, ∗ (cid:107) (cid:16) C (cid:88) j =1 (cid:107) A r, ∗ (cid:107) C | A r,c j | A r,c j x c j (cid:17) ( A r, ∗ ) † − A † b + λx, (1)and this will be our estimator for ∇ f ( x ) . Notice that the ﬁrst term of this expression is theaverage of copies of (cid:107) A (cid:107) | A r,c | A r,c x c ( A r, ∗ ) † with probability | A r,c | (cid:107) A (cid:107) Lemma 2.1.

Consider ∇ g ( x ) as in Eq. (1) as an estimator of ∇ f ( x ) . . E [ ∇ g ( x )] = ∇ f ( x )2 . Var[ ∇ g ( x )] = 1 C (cid:107) A (cid:107) (cid:107) x (cid:107) + (cid:16) − C (cid:17) (cid:107) A (cid:107) (cid:107) Ax (cid:107) − (cid:107) A † Ax (cid:107) . E [ (cid:107)∇ g ( x ) − ∇ g ( y ) (cid:107) ] = (cid:107) ( A † A + λI )( x − y ) (cid:107) + Var[ ∇ g ( x − y )] roof. Part 1.

We have E [ ∇ g ( x )] = 1 C C (cid:88) j =1 E (cid:104) (cid:107) A (cid:107) (cid:107) A r, ∗ (cid:107) (cid:107) A r, ∗ (cid:107) | A r,c j | A r,c j x c j ( A r, ∗ ) † (cid:105) − A † b + λx = E (cid:104) (cid:107) A (cid:107) | A r,c | A r,c x c ( A r, ∗ ) † (cid:105) − A † b + λx = (cid:16) (cid:88) r,c | A r,c | (cid:107) A (cid:107) (cid:107) A (cid:107) | A r,c | A r,c x c ( A r, ∗ ) † (cid:17) − A † b + λx = A † Ax − A † b + λx = ∇ f ( x ) . Part 2.

Now variance.

Var[ ∇ g ( x )] = E [ (cid:107)∇ g ( x ) − E [ ∇ g ( x )] (cid:107) ]= Var (cid:104) (cid:107) A (cid:107) (cid:107) A r, ∗ (cid:107) (cid:16) C (cid:88) j =1 (cid:107) A r, ∗ (cid:107) C | A r,c j | A r,c j x c j (cid:17) ( A r, ∗ ) † (cid:105) = Var (cid:104)(cid:16) C (cid:88) j =1 (cid:107) A (cid:107) C | A r,c j | A r,c j x c j (cid:17) ( A r, ∗ ) † (cid:105) = E (cid:104)(cid:13)(cid:13)(cid:13)(cid:16) C (cid:88) j =1 (cid:107) A (cid:107) C | A r,c j | A r,c j x c j (cid:17) ( A r, ∗ ) † (cid:13)(cid:13)(cid:13) (cid:105) − (cid:107) A † Ax (cid:107) . Then, we can bound the ﬁrst term of the above equation as follows: E (cid:104)(cid:13)(cid:13)(cid:13)(cid:16) C (cid:88) j =1 (cid:107) A (cid:107) C | A r,c j | A r,c j x c j (cid:17) ( A r, ∗ ) † (cid:13)(cid:13)(cid:13) (cid:105) = m (cid:88) i =1 (cid:107) A i, ∗ (cid:107) (cid:107) A (cid:107) (cid:107) A (cid:107) (cid:107) A i, ∗ (cid:107) (cid:107) A i, ∗ (cid:107) E (cid:104)(cid:16) C (cid:88) j =1 (cid:107) A i, ∗ (cid:107) C | A i,c j | A i,c j x c j (cid:17) (cid:105) = (cid:107) A (cid:107) m (cid:88) i =1 E (cid:104)(cid:16) C (cid:88) j =1 (cid:107) A i, ∗ (cid:107) C | A i,c j | A i,c j x c j (cid:17) (cid:105) = (cid:107) A (cid:107) m (cid:88) i =1 (cid:16) C Var (cid:104) (cid:107) A i, ∗ (cid:107) | A i,c j | A i,c j x c j (cid:105) + ( A i, ∗ x ) (cid:17) = (cid:107) A (cid:107) m (cid:88) i =1 (cid:16) C ( (cid:107) A i, ∗ (cid:107) (cid:107) x (cid:107) − ( A i, ∗ x ) ) + ( A i, ∗ x ) (cid:17) = 1 C (cid:107) A (cid:107) (cid:107) x (cid:107) + (cid:16) − C (cid:17) (cid:107) A (cid:107) (cid:107) Ax (cid:107) Note that if C ≤ (cid:107) A (cid:107) (cid:107) x (cid:107) (cid:107) Ax (cid:107) (as is the case for our eventual value of C , (cid:107) A (cid:107) (cid:107) A (cid:107) ), this variance canbe bounded as C (cid:107) A (cid:107) (cid:107) x (cid:107) . Part 3.

Finally, the expression for E [ (cid:107)∇ g ( x ) − ∇ g ( y ) (cid:107) ] follows from the observationthat ∇ g ( x ) − ∇ g ( y ) is simply ∇ g ( x − y ) when b and x (0) are zero vectors.6 .2 Analyzing stochastic gradient descent for sparse b The algorithm we use to solve regression is stochastic gradient descent (SGD): consider adiﬀerentiable random function g : X → R , and consider the following recursion, startingfrom x (0) ∈ X : x ( t ) = x ( t − − η t ∇ g ( x ( t − ) , (2)where ( η t ) k ≥ is a deterministic sequence of positive scalars. We will take ∇ g to be as inEq. (1), and x (0) = (cid:126) .We can make some simple observations about the computational complexity of this gra-dient step when b is sparse (that is, the number of non-zero entries of b , (cid:107) b (cid:107) , is small). Lemma 2.2.

Given

SQ( A ) , we can output x ( t ) as a ( t + (cid:107) b (cid:107) ) -sparse description in time O ( Ct ( t + (cid:107) b (cid:107) )) . Proof.

First, suppose we are given x ( t ) as an s -sparse description, and wish to output a sparsedescription for the next iterate x ( t +1) , which from Eq. (1), satisﬁes x ( t +1) = x ( t ) − η t +1 ∇ g ( x ( t ) )= x ( t ) − η t +1 · (cid:16) (cid:107) A (cid:107) (cid:107) A r, ∗ (cid:107) (cid:16) C (cid:88) j =1 (cid:107) A r, ∗ (cid:107) C | A r,c j | A r,c j x ( t ) c j (cid:17) ( A r, ∗ ) † − A † b + λx ( t ) (cid:17) . The r and c j ’s are drawn from distributions, and SQ( A ) can produce such samples with its Sample queries, taking O ( C ) time.From inspection of the above equation, if we have x ( t ) in terms of its description as A † v ( t ) ,then we can write x ( t +1) as a description A † v ( t +1) where v ( t +1) satisﬁes v ( t +1) = v ( t ) − η t +1 · (cid:16) (cid:107) A (cid:107) (cid:107) A r, ∗ (cid:107) (cid:16) C (cid:88) j =1 (cid:107) A r, ∗ (cid:107) C | A r,c j | A r,c j ( A ∗ ,c j ) † v ( t ) (cid:17) e r − b + λv ( t ) (cid:17) . (3)Here, e r is the vector that is one in the r th entry and zero otherwise. So, if v ( t ) is s -sparse,and has a support that includes the support of b , then v ( t +1) is ( s + 1) -sparse. Furthermore,by exploiting the sparsity of v ( t ) , computing v ( t +1) takes O ( Cs ) time (including the timetaken to use SQ( A ) to query A for all of the relevant norms and entries).So, if we wish to compute x ( t ) , we begine with x (0) , which we have trivially as an (cid:107) b (cid:107) -sparse decomposition ( v (0) = (cid:126) ). It is sparser, but if we consider x (0) as having the samesupport as b , by the argument described above, we can then compute x (1) as an ( (cid:107) b (cid:107) + 1) -sparse description in O ( C (cid:107) b (cid:107) ) time.By iteratively computing v ( i +1) from v ( i ) in O ( C ( (cid:107) b (cid:107) + i )) time, we can output x ( t ) as a ( t + (cid:107) b (cid:107) ) -sparse description in O (cid:16) t (cid:88) i =0 C ( (cid:107) b (cid:107) + i ) (cid:17) = O ( Ct ( t + (cid:107) b (cid:107) )) time as desired. 7e will take C = (cid:107) A (cid:107) / (cid:107) A (cid:107) , which is the largest value we can set C to before it stopshaving an eﬀect on the variance of ∇ g . It’s good to scale C up to be as large as possible, sinceit means a corresponding linear decrease in the number of iterations, which our algorithmruntime depends on quadratically. We eventually take T = O (cid:16) (cid:107) A (cid:107) (cid:107) A (cid:107) σ ε log ε (cid:17) and, giving aruntime of Θ (cid:16) s (cid:107) A (cid:107) σ ε log 1 ε + (cid:107) A (cid:107) (cid:107) A (cid:107) σ ε log ε (cid:17) For simplicity we assume knowledge of (cid:107) A (cid:107) exactly, despite that our SQ( A ) only gives usaccess to (cid:107) A (cid:107) F . One can get a constant approximation to (cid:107) A (cid:107) in O (cid:16) (cid:107) A (cid:107) (cid:107) A (cid:107) (cid:17) time [CGL + T many iterations gives an x suﬃciently close to x ∗ . Proposition 2.3.

Let A ∈ C m × n and b ∈ C n . Let σ := (cid:107) A + (cid:107) − . Let C := (cid:107) A (cid:107) (cid:107) A (cid:107) . Wechoose ε := O (cid:16) (cid:107) A (cid:107) F (cid:107) A (cid:107)(cid:107) A (cid:107) F (cid:107) A (cid:107) + λ (log (cid:107) A (cid:107) F (cid:107) A (cid:107) σ + λ ) − (cid:17) and T := Θ( (cid:107) A (cid:107) (cid:107) A (cid:107) ( σ + λ ) ε log ε ) . Let x ( T ) be deﬁned asEq. (2) , with x (0) = (cid:126) and η t := ε (12 log(1 /ε ) /t ) / (cid:107) A (cid:107) F (cid:107) A (cid:107) . Then we have E [ (cid:107) x ( T ) − x ∗ (cid:107) ] ≤ ε (cid:107) x ∗ (cid:107) . In particular, by Markov’s inequality, with probability ≥ . , (cid:107) x ( T ) − x ∗ (cid:107) (cid:46) ε (cid:107) x ∗ (cid:107) . Stochastic gradient descent is known to require a number of iterations linear in (somethinglike the) second moment of the stochastic gradient. To analyze SGD, we apply the followingtheorem.

Theorem 2.4 ([MB11, Theorem 1]) . Suppose the following assumptions hold:1. Let ( F n ) n ≥ be an increasing family of σ -ﬁelds. x is F -measurable, and for each x ∈ X , the random variable g (cid:48) ( x ) is square-integrable, F n -measurable, and ∀ x ∈ X , E [ g (cid:48) ( θ ) | F n − ] = f (cid:48) ( x ) , w.p. 1 . g is almost surely convex, diﬀerentiable, and: ∀ x, y ∈ X . E [ (cid:107) g (cid:48) ( x ) − g (cid:48) ( y ) (cid:107) | F n − ] ≤ L (cid:107) x − y (cid:107) , w.p. 1 . f is µ -strongly convex with respect to the norm (cid:107) · (cid:107) .4. There exists G ∈ R + such that E [ (cid:107) g (cid:48) ( x ∗ ) (cid:107) | F n − ] ≤ G .Denote δ t := E [ (cid:107) x ( t ) − x ∗ (cid:107) ] , ϕ α ( t ) = t α α for α (cid:54) = 0 and ϕ ( t ) = log( t ) . Let η ∈ (0 , denotea parameter, and for each t , set η t = η · t − α . We have, for α ∈ [0 , : δ t ≤ (cid:40) L η ϕ − α ( t ) − µηt − α / · ( δ + G /L ) + 4 ηG / ( µt α ) α ∈ [0 , L η ) · n − µη · ( δ + G /L ) + 2 G η · ϕ µη/ − ( n ) · n − µη/ α = 1 ∇ g ( x ) as in Eq. (1), for C = (cid:107) A (cid:107) / (cid:107) A (cid:107) as discussed, to show that δ t ≤ ε (cid:107) x ∗ (cid:107) , thereby proving Proposition 2.3. ByLemma 2.1 we can take L = ( (cid:107) A (cid:107) + λ ) + 1 C (cid:107) A (cid:107) ≤ (cid:107) A (cid:107) (cid:107) A (cid:107) + λ ) G = 1 C (cid:107) A (cid:107) (cid:107) x ∗ (cid:107) ,µ = σ + λ, δ = (cid:107) x ∗ (cid:107) . Claim 2.5.

If we take α = , η = ε (cid:107) A (cid:107) (cid:107) A (cid:107) log ε , and assume ε (cid:46) (cid:107) A (cid:107) F (cid:107) A (cid:107)(cid:107) A (cid:107) F (cid:107) A (cid:107) + λ (log (cid:107) A (cid:107) F (cid:107) A (cid:107) σ + λ ) − ,then for T = η (cid:107) A (cid:107) (cid:107) A (cid:107) ( σ + λ ) ε = 12 (cid:107) A (cid:107) (cid:107) A (cid:107) ( σ + λ ) ε log ε , δ T ≤ ε (cid:107) x ∗ (cid:107) . The particular values in this claim (in particular, η , T , and ε ) are chosen to give theeventual bound of δ T ≤ ε (cid:107) x ∗ (cid:107) . Proof.

It happens that α = gives the best bound on δ t (as we expect, since SGD in thissetting gives a √ t rate). For this choice of α , substituting in our values for L , G , µ , and δ , we get δ T ≤ L η log( T ) − µη √ T / δ + G /L ) + 4 ηG µ √ T ≤ (cid:16) (cid:16) (cid:107) A (cid:107) (cid:107) A (cid:107) + λ ) η log( T ) − ( σ + λ ) η √ T / (cid:17) + 4 η (cid:107) A (cid:107) (cid:107) A (cid:107) ( σ + λ ) √ T (cid:17) (cid:107) x ∗ (cid:107) . Dividing by (cid:107) x ∗ (cid:107) on both sides and substituting in T , we get δ T (cid:107) x ∗ (cid:107) ≤ (cid:16) (cid:107) A (cid:107) (cid:107) A (cid:107) + λ ) η log (cid:16) η (cid:107) A (cid:107) (cid:107) A (cid:107) ( σ + λ ) ε (cid:17) − η (cid:107) A (cid:107) (cid:107) A (cid:107) ε (cid:17) + 4 ε = 4 exp( − C log(1 /ε )) + 4 ε = 4 ε C + 4 ε ≤ ε . The ﬁrst step follows from substituting in η and deﬁning C := 3 −

144 ( (cid:107) A (cid:107) (cid:107) A (cid:107) + λ ) ε (cid:107) A (cid:107) (cid:107) A (cid:107) log (cid:16) (cid:107) A (cid:107) (cid:107) A (cid:107) ( σ + λ ) ε log 1 ε (cid:17) , (4)and the ﬁnal inequality follows because the bound on ε in the statement of the claim impliesthat C ≥ . After performing SGD, we have a x ( T ) that is close to x ∗ as desired, and it is given as asparse description A † v ( T ) . We want to say that we have SQ( x ( T ) ) from its description. Ourgoal is to invoke a result originally from [Tan19] about length-square sampling a vector thatis a linear-combination of length-square accessible vectors.9 emma 2.6 ([CGL +

20, Lemmas 2.9 and 2.10]) . Suppose we have

SQ( M † ) for M ∈ C n × d and Q( x ) ∈ C d . Denote ∆ := (cid:80) di =1 (cid:107) M ∗ ,i (cid:107) x i / (cid:107) y (cid:107) . Then we can implement SQ( y ) ∈ C n for y := M x with T ( y ) = O ( d ∆ log(1 /δ ) · T ( M )) , where queries succeed with probability ≥ − δ . Namely, we can:(a) query for entries with complexity O ( d · T ( M )) ;(b) sample from y with runtime T sample ( y ) satisfying E [ T sample ( y )] = O (cid:0) d ∆ · T ( M ) (cid:1) and Pr[ T sample ( y ) = O (cid:0) d ∆ · log(1 /δ ) (cid:1) ] ≥ − δ. (c) estimate (cid:107) y (cid:107) to (1 ± ε ) multiplicative error with success probability at least − δ incomplexity O (cid:18) d ∆ ε T ( M ) · log(1 /δ ) (cid:19) . So we care about the quantity ∆ = (cid:107) A † v ( t ) (cid:107) (cid:80) mi =1 (cid:107) A i, ∗ (cid:107) | v ( t ) i | , with t = T , where v ( t ) follows the recurrence according to Eq. (3) (recalling from before that r and c , . . . , c C aresampled randomly and independently each iteration): v ( t +1) = v ( t ) − η t +1 · (cid:16) (cid:16) (cid:107) A (cid:107) (cid:107) A r, ∗ (cid:107) (cid:16) C (cid:88) j =1 (cid:107) A r, ∗ (cid:107) C | A r,c j | A r,c j ( A ∗ ,c j ) † v ( t ) (cid:17) e r − b + λv ( t ) (cid:124) (cid:123)(cid:122) (cid:125) ∇ ˜ g ( v ( t ) ) (cid:17) . Roughly speaking, ∆ encodes the amount of cancellation that could occur in the product A † v ( t ) . We will consider it as a norm: ∆ = (cid:107) v ( t ) (cid:107) D (cid:107) x ( t ) (cid:107) , for D a diagonal matrix with D ii = (cid:107) A i, ∗ (cid:107) ,where (cid:107) v (cid:107) D := √ v † D † Dv . From now on, we will consider x ( t ) to be the version of SGD asin Proposition 2.3, e.g. C = (cid:107) A (cid:107) (cid:107) A (cid:107) .First, notice that we can show similar moment bounds for ∇ ˜ g ( v ( t ) ) in the D norm asthose in Lemma 2.1. The mean is what we expect it to be: E [ ∇ ˜ g ( v )] = E (cid:104) (cid:107) A (cid:107) (cid:107) A r, ∗ (cid:107) (cid:16) C (cid:88) j =1 (cid:107) A r, ∗ (cid:107) C | A r,c j | A r,c j ( A ∗ ,c j ) † v ( t ) (cid:17) e r − b + λv (cid:105) = m (cid:88) r =1 E (cid:104) C (cid:88) j =1 (cid:107) A r, ∗ (cid:107) C | A r,c j | A r,c j ( A ∗ ,c j ) † v ( t ) (cid:105) e r − b + λv = m (cid:88) r =1 n (cid:88) c =1 A r,c (( A ∗ ,c ) † v ( t ) ) e r − b + λv = AA † v − b + λv. D norm is identical to the variance of ∇ g : E [ (cid:107)∇ ˜ g ( v ) − E [ ∇ ˜ g ( v )] (cid:107) D ] = Var (cid:104) D (cid:107) A (cid:107) (cid:107) A r, ∗ (cid:107) (cid:16) C (cid:88) j =1 (cid:107) A r, ∗ (cid:107) C | A r,c j | A r,c j ( A ∗ ,c j ) † v ( t ) (cid:17) e r (cid:105) = E (cid:104) (cid:107) A r, ∗ (cid:107) (cid:12)(cid:12)(cid:12) C (cid:88) j =1 (cid:107) A (cid:107) C | A r,c j | A r,c j ( A ∗ ,c j ) † v ( t ) (cid:12)(cid:12)(cid:12) (cid:105) − (cid:107) AA † v (cid:107) D From here, the variance computation is identical to the one in Lemma 2.1. = 1 C (cid:107) A (cid:107) (cid:107) A † v (cid:107) + (cid:16) − C (cid:17) (cid:107) A (cid:107) (cid:107) AA † v (cid:107) − (cid:107) AA † v (cid:107) D ≤ (cid:107) A (cid:107) (cid:107) A (cid:107) (cid:107) A † v (cid:107) . To bound ∆ , we will show a recurrence. Note that E [ (cid:107) v ( t +1) (cid:107) D | v ( t ) ] ≤ (cid:107) v ( t ) (cid:107) D (1 − η t +1 λ ) + η t +1 E [ (cid:107)∇ ˜ g ( v ( t ) ) − λv ( t ) (cid:107) | v ( t ) ] by triangle inequality ≤ (cid:107) v ( t ) (cid:107) D + η t +1 (cid:113) E [ (cid:107)∇ ˜ g ( v ( t ) ) − λv ( t ) (cid:107) D | v ( t ) ] by E [ Z ] ≤ E [ Z ] ≤ (cid:107) v ( t ) (cid:107) D + η t +1 (cid:113) (cid:107) AA † v ( t ) − b (cid:107) D + 2 (cid:107) A (cid:107) (cid:107) A (cid:107) (cid:107) A † v ( t ) (cid:107) by E [ Z ] = E [ Z ] + Var[ Z ] (cid:46) (cid:107) v ( t ) (cid:107) D + η t +1 ( (cid:107) b (cid:107) D + (cid:107) A (cid:107) F (cid:107) A (cid:107)(cid:107) x ( t ) (cid:107) ) . by x ( t ) = A † v ( t ) and (cid:107) D (cid:107) ≤ (cid:107) A (cid:107) Taking the expectation of both sides, we have E [ (cid:107) v ( t +1) (cid:107) D ] ≤ E [ (cid:107) v ( t ) (cid:107) D ] + η t +1 ( (cid:107) b (cid:107) D + (cid:107) A (cid:107) F (cid:107) A (cid:107) E [ (cid:107) x ( t ) (cid:107) ]) , so since v (0) = (cid:126) , we can solve this recurrence to get E [ (cid:107) v ( T ) (cid:107) D ] (cid:46) T (cid:88) t =1 η √ t ( (cid:107) b (cid:107) D + (cid:107) A (cid:107) F (cid:107) A (cid:107) E [ (cid:107) x ( t ) (cid:107) ]) . (5)Further, for t ≤ T , E [ (cid:107) x ( t ) (cid:107) ] ≤ (cid:107) x ∗ (cid:107) + (cid:113) E [ (cid:107) x ( t ) − x ∗ (cid:107) ] ≤ (cid:107) x ∗ (cid:107) + (cid:112) δ t (cid:46) (cid:107) x ∗ (cid:107) (cid:16) (cid:115) ε (cid:107) A (cid:107) F (cid:107) A (cid:107) ( σ + λ ) √ t (cid:114) log 1 ε (cid:17) , where the last line uses the analysis from the proof of Claim 2.5: δ t ≤ (cid:107) x ∗ (cid:107) (cid:16) (cid:16) (cid:107) A (cid:107) (cid:107) A (cid:107) + λ ) η log( t ) − ( σ + λ ) η √ t/ (cid:17) + 4 η (cid:107) A (cid:107) (cid:107) A (cid:107) ( σ + λ ) √ t (cid:17) ≤ (cid:107) x ∗ (cid:107) (cid:16) (cid:16) (cid:107) A (cid:107) (cid:107) A (cid:107) + λ ) η log( T ) (cid:17) + 4 η (cid:107) A (cid:107) (cid:107) A (cid:107) ( σ + λ ) √ t (cid:17) (cid:46) (cid:107) x ∗ (cid:107) (cid:16) ε (cid:107) A (cid:107) F (cid:107) A (cid:107) ( σ + λ ) √ t (cid:114) log 1 ε (cid:17) . η = ε (cid:107) A (cid:107) (cid:107) A (cid:107) log ε and T = 12 (cid:107) A (cid:107) (cid:107) A (cid:107) ( σ + λ ) ε log ε , our upper bound on ε implies that theexpression in the exponential is at most one. Returning to Eq. (5), we have E [ (cid:107) v ( T ) (cid:107) D ] (cid:46) T (cid:88) t =1 η √ t ( (cid:107) b (cid:107) D + (cid:107) A (cid:107) F (cid:107) A (cid:107) E [ (cid:107) x ( t ) (cid:107) ]) (cid:46) η √ T (cid:107) b (cid:107) D + T (cid:88) t =1 η √ t (cid:107) A (cid:107) F (cid:107) A (cid:107) E [ (cid:107) x ( t ) (cid:107) ]) (cid:46) η √ T (cid:107) b (cid:107) D + T (cid:88) t =1 (cid:107) x ∗ (cid:107) ε √ t (cid:114) log 1 ε (cid:16) (cid:115) ε (cid:107) A (cid:107) F (cid:107) A (cid:107) ( σ + λ ) √ t (cid:114) log 1 ε (cid:17)(cid:17) (cid:46) η √ T (cid:107) b (cid:107) D + (cid:107) x ∗ (cid:107) ε (cid:114) T log 1 ε + (cid:107) x ∗ (cid:107) (cid:115) ε (cid:107) A (cid:107) F (cid:107) A (cid:107)√ T ( σ + λ ) (cid:16) log 1 ε (cid:17) (cid:46) (cid:107) b (cid:107) D σ + λ log 1 ε + (cid:107) x ∗ (cid:107) (cid:107) A (cid:107) F (cid:107) A (cid:107) σ + λ log 1 ε + (cid:107) x ∗ (cid:107) ε (cid:107) A (cid:107) F (cid:107) A (cid:107) σ + λ log 1 ε (cid:46) (cid:107) b (cid:107) D σ + λ log 1 ε + (cid:107) x ∗ (cid:107) (cid:107) A (cid:107) F (cid:107) A (cid:107) σ + λ log 1 ε . (6)So, we can bound our cancellation constant. By union bounding, we have that with prob-ability ≥ . , (cid:107) x ( T ) − x ∗ (cid:107) ≤ ε (cid:107) x ∗ (cid:107) and the above bound on (cid:107) v ( T ) (cid:107) D holds up to constantfactors. When both those bounds hold, we have ∆ = (cid:107) v ( T ) (cid:107) D (cid:107) x ( T ) (cid:107) ≤ (cid:107) v ( T ) (cid:107) D (cid:107) x ∗ (cid:107) (1 − ε ) (cid:46) (cid:107) x ∗ (cid:107) (cid:16) (cid:107) b (cid:107) D σ + λ ) log ε + (cid:107) x ∗ (cid:107) (cid:107) A (cid:107) (cid:107) A (cid:107) ( σ + λ ) log ε (cid:17) = (cid:107) b (cid:107) D (cid:107) x ∗ (cid:107) σ + λ ) log ε + (cid:107) A (cid:107) (cid:107) A (cid:107) ( σ + λ ) log ε . (7)To translate the (cid:107) b (cid:107) D (cid:107) x ∗ (cid:107) in the output to a more intuitive parameter, we use that (cid:107) b (cid:107) D = (cid:107) Db (cid:107) ≤ (cid:107) D (cid:107) (cid:107) b (cid:107) ≤ (cid:107) A (cid:107) (cid:107) b (cid:107) and that (cid:107) x ∗ (cid:107) = (cid:107) ( A † A + λI ) − A † ( AA + b ) (cid:107) ≥ (cid:107) A (cid:107)(cid:107) A (cid:107) + λ (cid:107) AA + b (cid:107) = Θ (cid:16) (cid:107) A (cid:107) (cid:107) AA + b (cid:107) (cid:17) to conclude that ∆ (cid:46) (cid:107) A (cid:107) (cid:107) A (cid:107) ( σ + λ ) log ε + (cid:107) A (cid:107) ( σ + λ ) (cid:107) b (cid:107) (cid:107) AA + b (cid:107) log ε . (8) b : Proof of Theorem 1.1 In previous sections, we have shown how to solve our regularized regression problem forsparse b : from Proposition 2.3, performing SGD for T (cid:104) (cid:107) A (cid:107) (cid:107) A (cid:107) ( σ + λ ) ε log ε iterations outputsan x with the desired error bound; from Lemma 2.2, it takes O (cid:16) (cid:107) A (cid:107) (cid:107) A (cid:107) T ( T + (cid:107) b (cid:107) ) (cid:17) time to12utput x as a sparse description; and from Section 2.3, we have sample and query access tothat output x given its sparse description.Now, all that remains is to extend this work to the case that b is non-sparse. In this case,we will simply replace b with a sparse ˆ b that behaves similarly, and show that running SGDwith this value of ˆ b gives all the same results. The sparsity of ˆ b will be O (cid:16) (cid:107) A (cid:107) (cid:107) A (cid:107) σ ε (cid:107) b (cid:107) (cid:107) AA + b (cid:107) (cid:17) ,giving a total runtime of O (cid:18) (cid:107) A (cid:107) (cid:107) A (cid:107) ( σ + λ ) ε (cid:16) (cid:107) b (cid:107) (cid:107) AA + b (cid:107) + log 1 ε (cid:17) log 1 ε (cid:19) The crucial observation for sparsifying b is that we can use importance sampling to approx-imate the matrix product A † b , which suﬃces to approximate the solution x ∗ . Lemma 2.7 (Matrix multiplication to Frobenius norm error, [DKM06, Lemma 4]) . Consider X ∈ C m × n , Y ∈ C m × p , and let S ∈ C s × m be an importance sampling matrix for X . That is,let each S i, ∗ be independently sampled to be (cid:107) X (cid:107) s (cid:107) X i, ∗ (cid:107) e i with probability (cid:107) X i, ∗ (cid:107) (cid:107) X (cid:107) . Then E [ (cid:107) X † S † SY − X † Y (cid:107) ] ≤ s (cid:107) X (cid:107) (cid:107) Y (cid:107) and E (cid:104) s (cid:88) i =1 (cid:107) [ SX ] i, ∗ (cid:107) (cid:107) [ SY ] i, ∗ (cid:107) (cid:105) ≤ s (cid:107) X (cid:107) (cid:107) Y (cid:107) . We have

SQ( A ) , so we can use Lemma 2.7 with s ← (cid:107) A (cid:107) (cid:107) A (cid:107) ( σ + λ ) ε (cid:107) b (cid:107) (cid:107) AA + b (cid:107) , X ← A , and Y ← b , to ﬁnd an S in O ( s ) time that satisﬁes the guarantees (cid:107) A † S † Sb − A † b (cid:107) ≤ ( σ + λ ) ε (cid:107) AA + b (cid:107)(cid:107) A (cid:107) F (cid:107) A (cid:107)(cid:107) b (cid:107) (cid:107) A (cid:107) F (cid:107) b (cid:107) (cid:46) ε ( σ + λ ) (cid:107) x ∗ (cid:107) and (9) s (cid:88) i =1 (cid:107) [ SA ]( i, · ) (cid:107) (cid:107) [ Sb ]( i, · ) (cid:107) ≤ ( σ + λ ) ε (cid:107) AA + b (cid:107) (cid:107) A (cid:107) (cid:107) A (cid:107) (cid:107) b (cid:107) (cid:107) A (cid:107) (cid:107) b (cid:107) (cid:46) ε ( σ + λ ) (cid:107) x ∗ (cid:107) (10)with probability ≥ . . (Above, we used that, since x ∗ = ( A † A + λI ) + A † b , (cid:107) x ∗ (cid:107) ≥ (cid:107) A (cid:107)(cid:107) A (cid:107) + λ (cid:107) AA + b (cid:107) , and that λ (cid:46) (cid:107) A (cid:107) .) Assuming this holds, we can perform SGD as inProposition 2.3 on ˆ b := S † Sb (which is s -sparse) to ﬁnd an x such that (cid:107) x − ˆ x ∗ (cid:107) ≤ ε (cid:107) ˆ x ∗ (cid:107) ,where ˆ x ∗ is the optimum ( A † A + λI ) − A † ˆ b . This implies that (cid:107) x − x ∗ (cid:107) (cid:46) ε (cid:107) x ∗ (cid:107) , since (cid:107) ˆ x ∗ − x ∗ (cid:107) = (cid:107) ( A † A + λI ) + A † (ˆ b − b ) (cid:107) ≤ σ + λ (cid:107) A † (ˆ b − b ) (cid:107) (cid:46) ε (cid:107) x ∗ (cid:107) . (11)To bound the runtimes of SQ( x ) , we modify the analysis of ∆ from Section 2.3: recallingfrom Eq. (7), we have that, with probability ≥ . , ∆ (cid:46) (cid:107) A (cid:107) (cid:107) A (cid:107) ( σ + λ ) log ε + (cid:107) ˆ b (cid:107) D (cid:107) ˆ x ∗ (cid:107) σ + λ ) log ε . It is straightforward to apply the tool in R to C via doubling the size of the dimension. (cid:107) ˆ b (cid:107) D (cid:107) ˆ x ∗ (cid:107) ≤ (cid:107) ˆ b (cid:107) D (cid:107) x ∗ (cid:107) (1 − ε ) = (cid:80) mi =1 (cid:107) A i, ∗ (cid:107) | ˆ b i | (cid:107) x ∗ (cid:107) (1 − ε ) (cid:46) ε ( σ + λ ) (cid:107) x ∗ (cid:107) (cid:107) x ∗ (cid:107) (1 − ε ) (cid:46) ε ( σ + λ ) , so ∆ (cid:46) (cid:107) A (cid:107) (cid:107) A (cid:107) ( σ + λ ) log ε + ε log ε (cid:46) (cid:107) A (cid:107) (cid:107) A (cid:107) ( σ + λ ) log ε . Using Lemma 2.6, the time it takes to respond to a query to

SQ( x ( T ) ) is ( T + (cid:107) ˆ b (cid:107) ) ∆ , whichgives the runtime in Theorem 1.1. Acknowledgments

E.T. thanks Kevin Tian immensely for discussions integral to these results.A.G. acknowledges funding provided by Samsung Electronics Co., Ltd., for the project“The Computational Power of Sampling on Quantum Computers”, and additional supportby the Institute for Quantum Information and Matter, an NSF Physics Frontiers Center(NSF Grant PHY-1733907), as well as support by ERC Consolidator Grant QPROGRESSand by QuantERA project QuantAlgo 680-91-034. Z.S. was partially supported by SchmidtFoundation and Simons Foundation. E.T. is supported by the National Science FoundationGraduate Research Fellowship Program under Grant No. DGE-1762114.

References [Aar15] Scott Aaronson. Read the ﬁne print.

Nature Physics , 11(4):291, 2015.[AGJO +

15] Srinivasan Arunachalam, Vlad Gheorghiu, Tomas Jochym-O’Connor, MicheleMosca, and Priyaa Varshinee Srinivasan. On the robustness of bucket brigadequantum RAM.

New Journal of Physics , 17(12):123010, 2015. arXiv: [Bub15] Sébastien Bubeck. Convex optimization: Algorithms and complexity.

Founda-tions and Trends in Machine Learning , 8(3–4):231–357, 2015. arXiv: [BWP +

17] Jacob Biamonte, Peter Wittek, Nicola Pancotti, Patrick Rebentrost, NathanWiebe, and Seth Lloyd. Quantum machine learning.

Nature , 549:195–202,2017. arXiv: [CGJ19] Shantanav Chakraborty, András Gilyén, and Stacey Jeﬀery. The power of block-encoded matrix powers: improved regression techniques via faster Hamilto-nian simulation. In

Proceedings of the 46th International Colloquium on Au-tomata, Languages, and Programming (ICALP) , pages 33:1–33:14, 2019. arXiv: +

20] Nai-Hui Chia, András Gilyén, Tongyang Li, Han-Hsuan Lin, Ewin Tang, andChunhao Wang. Sampling-based sublinear low-rank matrix arithmetic frame-work for dequantizing quantum machine learning. In

Proceedings of the 52ndACM Symposium on the Theory of Computing (STOC) , page 387–400, 2020.arXiv: [Chi09] Andrew M Childs. Equation solving by simulation.

Nature Physics , 5(12):861–861, 2009.[CHI +

18] Carlo Ciliberto, Mark Herbster, Alessandro Davide Ialongo, Massimiliano Pon-til, Andrea Rocchetto, Simone Severini, and Leonard Wossnig. Quantum ma-chine learning: a classical perspective.

Proceedings of the Royal Society A:Mathematical, Physical and Engineering Sciences , 474(2209):20170551, 2018.arXiv: [DHM +

18] Danial Dervovic, Mark Herbster, Peter Mountney, Simone Severini, Naïri Usher,and Leonard Wossnig. Quantum linear systems algorithms: a primer. arXiv: , 2018.[DKM06] Petros Drineas, Ravi Kannan, and Michael W Mahoney. Fast Monte Carloalgorithms for matrices II: Computing a low-rank approximation to a matrix.

SIAM Journal on computing , 36(1):158–183, 2006.[FGKS15] Roy Frostig, Rong Ge, Sham Kakade, and Aaron Sidford. Un-regularizing:approximate proximal point and faster stochastic algorithms for empirical riskminimization. In

International Conference on Machine Learning , pages 2540–2548, 2015. arXiv: [GLM08] Vittorio Giovannetti, Seth Lloyd, and Lorenzo Maccone. Quantum random ac-cess memory.

Physical Review Letters , 100(16):160501, 2008. arXiv: [GS18] Neha Gupta and Aaron Sidford. Exploiting numerical sparsity for eﬃcient learn-ing: faster eigenvector computation and regression. In

Advances in Neural In-formation Processing Systems , pages 5269–5278, 2018. arXiv: [GSLW19] András Gilyén, Yuan Su, Guang Hao Low, and Nathan Wiebe. Quantum sin-gular value transformation and beyond: exponential improvements for quantummatrix arithmetics. In

Proceedings of the 51st ACM Symposium on the Theoryof Computing (STOC) , pages 193–204, 2019. arXiv: [HHL09] Aram W. Harrow, Avinatan Hassidim, and Seth Lloyd. Quantum algorithmfor linear systems of equations.

Physical Review Letters , 103(15):150502, 2009.arXiv: [KP17] Iordanis Kerenidis and Anupam Prakash. Quantum recommendation systems.In

Proceedings of the 8th Innovations in Theoretical Computer Science Confer-ence (ITCS) , pages 49:1–49:21, 2017. arXiv:

Advances in Neural InformationProcessing Systems , pages 451–459, 2011.[Pra14] Anupam Prakash.

Quantum Algorithms for Linear Algebra and Machine Learn-ing . PhD thesis, University of California at Berkeley, 2014.[Pre18] John Preskill. Quantum Computing in the NISQ era and beyond.

Quantum ,2:79, 2018. arXiv: [RML14] Patrick Rebentrost, Masoud Mohseni, and Seth Lloyd. Quantum support vectormachine for big data classiﬁcation.

Physical Review Letters , 113(13):130503,2014. arXiv: [Tan19] Ewin Tang. A quantum-inspired classical algorithm for recommendation sys-tems. In

Proceedings of the 51st ACM Symposium on the Theory of Computing(STOC) , pages 217–228, 2019. arXiv: [WZP18] Leonard Wossnig, Zhikuan Zhao, and Anupam Prakash. Quantum linear systemalgorithm for dense matrices.

Physical Review Letters , 120(5):050502, 2018.arXiv:1704.06174