# On the Hardness of PAC-learning stabilizer States with Noise

aa r X i v : . [ qu a n t - ph ] F e b On the Hardness of PAC-learning Stabilizer States with Noise

Aravind Gollakota ∗ and Daniel Liang † Department of Computer ScienceUniversity of Texas at AustinFebruary 9, 2021

Abstract

We consider the problem of learning stabilizer states with noise in the ProbablyApproximately Correct (PAC) framework of Aaronson [Aar07] for learning quantumstates. In the noiseless setting, an algorithm for this problem was recently given byRocchetto [Roc18], but the noisy case was left open. Motivated by approaches to noisetolerance from classical learning theory, we introduce the Statistical Query (SQ) modelfor PAC-learning quantum states, and prove that algorithms in this model are indeedresilient to common forms of noise, including classiﬁcation and depolarizing noise. Weprove an exponential lower bound on learning stabilizer states in the SQ model. Evenoutside the SQ model, we prove that learning stabilizer states with noise is in general ashard as Learning Parity with Noise (LPN) using classical examples. Our results positionthe problem of learning stabilizer states as a natural quantum analogue of the classicalproblem of learning parities: easy in the noiseless setting, but seemingly intractableeven with simple forms of noise.

A fundamental task in quantum computing is that of learning a description of an unknownquantum state ρ . Traditionally this is formalized as the problem of quantum state tomog-raphy, where we are granted the ability to form multiple copies of ρ and take arbitrarymeasurements, and must learn a state σ that is close to ρ in trace distance. In an inﬂu-ential work, Aaronson [Aar07] introduced the “Probably Approximately Correct” (PAC)framework from computational learning theory [Val84] as an alternative perspective on thisproblem. Here the key innovation is that instead of learning ρ in an absolute metric (suchas trace distance), we only wish to learn it with respect to a pre-speciﬁed distribution onmeasurements. This requirement is considerably weaker than that of full tomography.Concretely, let E denote the space of two-outcome n -qubit measurements E , where E (corresponding to the POVM { E, I − E } ) accepts a state ρ with probability Tr( Eρ ) andrejects it otherwise. Let D be a distribution on E . We are given the ability to sample andperform random measurements from D , i.e. to form a training set { ( E i , Y i ) } i ∈ [ m ] where each ∗ [email protected] † [email protected] i ∼ D and Y i is the outcome of measuring ρ using E i . Our goal is to learn a state σ suchthat with high probability, E E ∼ D (cid:2) (Tr( Eσ ) − Tr( Eρ )) (cid:3) ≤ ǫ. Usually we also have some basic prior information about ρ , such as knowledge of a class F such that ρ ∈ F . In the terminology of learning theory, this would be called “learningthe class F .” It is important to stress that in this framework, while the data arises from aquantum state, it is entirely classical in form and representation.When ρ is an arbitrary n -qubit state, [Aar07] showed that O ( n ) “training examples”, i.e.measurement-outcome pairs ( E, Tr( Eρ )), suﬃce to statistically determine a state σ thatapproximates ρ in this sense (suppressing the dependence on ǫ for simplicity). This is incontrast to full state tomography, which requires an exponential number of measurementsor copies of ρ [OW16, HHJ + σ is determined purely in astatistical or information-theoretic sense, and for a long time no eﬃcient algorithms wereknown for actually computing σ in settings of interest.Recently, [Roc18] gave an eﬃcient algorithm for the case where ρ is known to be astabilizer state, and D is any distribution on Pauli measurements. The stabilizer states arean important class of states in quantum computing due to their eﬃcient classical simulability[Got98, AG04] and their foundational role in quantum error-correction [Got97]. A majorquestion left open by [Roc18] is: are stabilizer states also eﬃciently learnable in noisysettings?Motivated by this question, we introduce to the quantum setting a well-known toolfor noise-resilient classical PAC learning, the statistical query (SQ) model, and deﬁne theproblem of SQ-learning quantum states. In this model, rather than receiving labeledmeasurement-outcome examples of the form ( E, Tr( Eρ )), the learner is only allowed tomake statistical queries to an oracle, and otherwise its goal remains the same. A statisticalquery is described by a function ϕ : E × {− , } → [ − ,

1] and a tolerance τ >

0, and theoracle responds to the query with E [ ϕ ( E, Y )] ± τ , where the expectation is taken over therandom draw of E ∼ D and Y , the random outcome of measuring ρ using E . One can thinkof this as modeling an experimental setup that is unable to report individual measurementoutcomes, but is nevertheless able to estimate expectation values to any desired precision.Importantly, an algorithm that is able to work in this restricted setting automatically gainstolerance to several kinds of noise.The SQ model was originally introduced by Kearns [Kea98] in the setting of Booleanfunction classes, and has since grown into a highly inﬂuential model (see [Fel16, Rey20] forsurveys). The model is known to have the following properties: • It is a natural restriction of the PAC model that nevertheless captures most knownPAC algorithms for a wide range of common classes [HS07, Rey20]. • SQ algorithms are naturally resistant to mild forms of noise in the labels, such as“classiﬁcation noise”, where the label for each training example is ﬂipped with someconstant probability [Kea98]. Or more realistically (

E, Y ), where Y is the random outcome of measuring ρ using E . It is the most realistic learning model for which strong, unconditional lower boundsare known for many basic classes. Indeed, there is a considerable literature on thistopic, with lower bounds usually proven using the so-called SQ-dimension and itsgeneralizations [BFJ +

94, Fel12, Rey20]. • SQ algorithms are naturally implementable in a way that satisﬁes diﬀerential privacyof the training data, and indeed are the main examples of realistic diﬀerentially privatelearning algorithms [BDMN05, DR14].Given all of these properties, it is natural to wonder whether the SQ model has somethingto bring to quantum learnability, with a particular eye towards noise tolerance. In this workwe show (among other results) that SQ-learning stabilizer states is exponentially hard, andthat learning stabilizer states with noise is as hard as the well-known Learning Parity withNoise (LPN) problem.

Theorem 1.1.

Let D denote the uniform distribution on Pauli measurements. Any SQalgorithm for learning n -qubit stabilizer states under D up to error − O ( n ) (i.e. to signiﬁ-cantly outperform the maximally mixed state) requires Ω( n ) queries even when tolerance is − O ( n ) . For the next theorems, deﬁne a parity measurement to be a Pauli measurement ofthe form E x = P x + I for some x ∈ { , } n , where P x = P y ∈{ , } n χ x ( y ) | y ih y | and χ x ( y ) =( − x · y mod 2 . They are so named since for any computational basis state | y ih y | , Tr( E x | y ih y | ) = x · y mod 2. Theorem 1.2.

Let D ′ denote the uniform distribution on parity measurements. Any SQalgorithm for learning n -qubit stabilizer states under D ′ even up to constant error (say / )requires Ω( n ) queries even when tolerance is − O ( n ) . Theorem 1.3.

Let D ′ be as above. Learning n -qubit stabilizer states under D ′ with classi-ﬁcation noise at rate η is at least as hard as the classical problem of Learning Parity withNoise (LPN) at rate η . Our results position the problem of learning stabilizer states as a quantum analogueof the important classical problem of learning parities. In both cases there are simple“algebraic” learning algorithms for the noiseless setting, and the problem seems to becomeintractable with even the simplest kinds of noise. The algorithm of [Roc18] thus joins asmall class of PAC algorithms that do not fall into the SQ model, and hence do not admitany straightforward algorithms in noisy settings. In our view, this frames learning stabilizerstates with noise as one of the more compelling problems on the frontier of learning quantumstates with noise.Another interpretation of our results is that they highlight limitations of the PAC frame-work of [Aar07]: insofar as this framework reduces the problem of learning quantum statesto an essentially classical problem, it also inherits longstanding problems from classicallearning theory.We also hope that our introduction of the SQ model to quantum state learning will beof independent interest and help spur new ideas in this area.We now detail the rest of our contributions and lay out the organization of this paper: This recalls the way stabilizer codes are a quantum analogue of classical parity check codes. In Section 2, we formally deﬁne the problem of SQ-learning quantum states and extendthe notion of the SQ-dimension to this setting, building on recent work that formallyanalyzed the SQ-dimension as applicable to the p-concept setting [GGJ + • In Section 3, we show that SQ algorithms for learning quantum states are indeed resis-tant to mild forms of noise, including classical classiﬁcation noise as well as quantumchannels with bounded noise (such as depolarizing noise). • In Section 4, we give exponential SQ lower bounds on learning stabilizer states. Underthe uniform distribution on Pauli measurements, we show (Corollary 4.7) that itrequires exponentially many queries in order to improve on the maximally mixedstate’s performance. Under a diﬀerent natural distribution on Pauli measurements,namely the uniform distribution over parity measurements, we show (Corollaries 4.11,4.12) that learning stabilizer states with noise is as hard as learning parities with noise. • In Section 5, by way of positive results, we give SQ algorithms for the simple settingof learning product states. We describe an SQ algorithm for learning product statesunder Haar-random single-qubit measurements, and show that it allows one to performtomography on the individual qubits. • We do not obtain any direct connections to gentle measurement, quantum diﬀerentialprivacy or shadow tomography in the sense of [AR19, Aar19]. However, in Section 6we obtain a diﬀerent kind of diﬀerential privacy for quantum state learners.

The problem of learning quantum states via state tomography has a long history in quan-tum computing, culminating in the celebrated optimal algorithms of O’Donnell and Wright[OW16, OW17] and Haah et al [HHJ + n -qubit quantumstates. In recent years, this framework has been extended to the online setting [ACH +

19] aswell as veriﬁed in experimental setups [RAS + Boolean functions using quantum representations [AGY20]. When one is given quantum samplesof Boolean or integer-valued functions, there have been important results on learning inthe presence of noise, showing that both Learning Parity with Noise (LPN) [CSS15] andLearning With Errors (LWE) [GKZ19] are tractable in this setting.

Notation and terminology.

We use ρ to refer to the density matrix of an n -qubit quan-tum mixed state, representable as a 2 n × n PSD operator of trace 1. (The number of4ubits will be n throughout and suppressed from the notation.) A pure state is a quan-tum state with rank 1. Let E denote the space of two-outcome n -qubit measurements E (corresponding to the POVM { E, I − E } ), which accept a state ρ with probability Tr( Eρ ).We will view the measurement outcomes themselves as {− , } -valued, so that the outcomeof measuring ρ using E is a random variable Y that is 1 with probability Tr( Eρ ) and − f ρ : E → [ − ,

1] to be the conditional mean function f ρ ( E ) = E [ Y | E ] = 1 · Tr( Eρ ) + ( − · (1 − Tr( Eρ )) = 2 Tr( Eρ ) − . We will often identify a state ρ with its behavior with respect to two-outcome measurements,namely with the function f ρ , and use the notation Y ∼ f ρ ( E ) to mean that Y ∈ {− , } isthe random measurement outcome satisfying E [ Y | E ] = f ρ ( E ). In learning theoretic terms,this means f ρ describes a probabilistic concept , or p-concept, on the space E . A p-concepton a domain X is a classiﬁcation rule that assigns random {− , } -valued labels to eachpoint in X according to a ﬁxed conditional mean function; we always identify the p-conceptwith its conditional mean function. Given a set F of quantum states, we use F to alsomean the class of associated p-concepts, with the meaning clear from context.Given a distribution D over E , we will often regard functions f ρ , f σ : E → R as membersof the L space L ( D, E ), with the inner product given by h f ρ , f σ i D = E E ∼ D [ f ρ ( E ) f σ ( E )],and the norm given by k f ρ k D = p h f ρ , f ρ i D = p E E ∼ D [( f ρ ( E )) ].We use [ n ] to refer to the set of indices { , . . . , n } . Given a set S ⊆ [ n ], we will use χ S torefer to the parity on S , deﬁned as a function from {− , } n → {− , } by χ S ( x ) = Q i ∈ S x i .Given x, y ∈ { , } n , we will sometimes also use χ x ( y ) = ( − x · y mod 2 . We begin by formally deﬁning the problem of PAC-learning a quantum state.

Deﬁnition 2.1 (PAC-learnability of quantum states, [Aar07]) . Let F be a class of n -qubitquantum states. Let D be a distribution over E . We say F is PAC-learnable up to squaredloss ǫ with respect to D if there exists a learner that, given sample access to labeled examples( E, Y ) for E ∼ D, Y ∼ f ρ ( E ) for an unknown ρ ∈ F , is able to output a state σ satisfying E E ∼ D (cid:2) ( f σ ( E ) − f ρ ( E )) (cid:3) ≤ ǫ. The number of examples used by the learner is called its sample complexity.We note that this is a slight modiﬁcation of the original deﬁnition in [Aar07], stateddirectly in terms of squared loss since this is the view that will be convenient for us. PAClearners are also allowed to fail with some probability δ , but for simplicity we will ignorethis in this paper. A learner that only succeeds with some constant probability can easilybe ampliﬁed to succeed with probability 1 − δ using standard conﬁdence ampliﬁcationprocedures.We introduce the following natural extension to the SQ setting. Deﬁnition 2.2 (SQ-learnability of quantum states) . Let F be a class of n -qubit quantumstates. Let D be a distribution over E . An SQ oracle for an unknown state ρ ∈ F is an5racle that accepts a query and a tolerance, ( ϕ, τ ), where ϕ : E × {− , } → [ − ,

1] and τ >

0, and responds with y such that (cid:12)(cid:12)(cid:12) y − E E ∼ D,Y ∼ f ρ ( E ) [ ϕ ( E, Y )] (cid:12)(cid:12)(cid:12) ≤ τ We say F is SQ-learnable up to squared loss ǫ if there is a learner that, given only queryaccess to the SQ oracle for an unknown ρ ∈ F , is able to output a state σ satisfying E E ∼ D (cid:2) ( f σ ( E ) − f ρ ( E )) (cid:3) ≤ ǫ. The number of queries used by the learner is called its query complexity.In both cases, we will operate in the so-called distribution-speciﬁc setting, where thelearner is assumed to have knowledge of the distribution D . One of the chief features of the classical SQ model is the possibility of proving unconditionallower bounds on learning a class C in terms of its so-called statistical dimension. Thequantum setting that we work in, where we identify a state ρ with the p-concept f ρ , becomesa special case of the SQ model for learning p-concepts. Building on recent work [GGJ + X denote an arbitrary domain (for us, X will be E , while in theclassical setting, X is usually R n ). Deﬁnition 2.3 (Statistical dimension) . Let D be a distribution on X , and let C be aclass of functions from X to R . The average (un-normalized) correlation of C is deﬁnedto be ρ D ( C ) = |C| P c,c ′ ∈C |h c, c ′ i D | . The statistical dimension on average at threshold γ ,SDA D ( C , γ ), is the largest d such that for all C ′ ⊆ C with |C ′ | ≥ |C| /d , ρ D ( C ′ ) ≤ γ . Theorem 2.4 ([GGJ + . Let D be a distribution on X , and let C be a p-conceptclass on X . Say our queries are of tolerance τ , the ﬁnal desired squared loss is ǫ , and thatthe functions in C satisfy k c k D ≥ β for all c ∈ C . For technical reasons, we require τ ≤ ǫ , ǫ ≤ β/ . Then learning C up to squared loss ǫ (we may pick ǫ as large as p β/ ) requiresat least SDA D ( C , τ ) queries of tolerance τ . We remark that the way to interpret such a lower bound is as follows: if the SQ learner’squeries have tolerance at least τ , then at least SDA D ( C , τ ) queries are required. That is,one must either use small tolerance or many queries.The following lemma will be convenient in order to bound the SDA when we have boundson pairwise correlations. Lemma 2.5 ([GGJ + . Let D be a distribution on X , and let C be a p-concept class on X such that for all c, c ′ ∈ C with c = c ′ , |h c, c ′ i D | ≤ γ , and for all c ∈ C , k c k D ≤ κ . Then for any γ ′ > , SDA( C , γ + γ ′ ) ≥ |C| γ ′ κ − γ . .3 The problem of learning parities One of the most basic problems in classical learning theory is that of learning the conceptclass of parity functions. Let the domain be X = {− , } n , and for any subset S ⊆ [ n ],deﬁne χ S ( x ) = ⊕ i ∈ S x i = Q i ∈ S x i to be the parity on S . Here x i ⊕ x j = x i x j is simply theXOR when bits are represented by {− , } . Let D be any distribution on X . We say alearner is able to learn parities under D if given access to labeled examples ( x, χ S ( x )) where x ∼ D and S ⊆ [ n ] is unknown (or, in the SQ setting, given access to the correspondingSQ oracle), and for any error parameter ǫ , it is able to output a function h such that P x ∼ D [ h ( x ) = χ S ( x )] ≤ ǫ .The problem of learning parities displays a striking phase transition in going from thenoiseless to the noisy setting. Given noiseless labeled examples, the problem of recoveringthe right parity is simply a question of solving linear equations over F , and can be doneusing Gaussian elimination by a PAC learner using only Θ( n ) examples. With just alittle noise, however, the problem seems to become intractable. Perhaps the simplest noisemodel one can consider is the classiﬁcation noise model, where every example has its labelﬂipped with some constant probability η (known as the noise rate). Learning parities underclassiﬁcation noise is the basis of the famous Learning Parity with Noise (LPN) problem.Formally, the search version of LPN with noise rate η is precisely the problem of learningparities under the uniform distribution on X and with classiﬁcation noise at rate η . Usuallyone also has the additional knowledge that the true target χ S (the “secret”) is pickeduniformly at random from the set of all parities. This problem is widely conjectured tobe hard, including for quantum algorithms, and is even used as a basis for cryptography(see [Pie12] for a survey). The best-known algorithm in the PAC setting runs in slightlysubexponential time [BKW03].Since SQ learners are naturally tolerant of classiﬁcation noise, one would expect thatthere are no SQ learners for parities under the uniform distribution, and indeed, this is oneof the foundational results in the SQ literature. Theorem 2.6 ([Kea98]) . Any SQ learner requires Ω( n ) queries (even using tolerance − O ( n ) ) to learn parities under the uniform distribution on {− , } n even up to constanterror (say / ). Thus we see that simple Gaussian elimination is an example of a PAC learner that isnot SQ. This establishes a characteristic limitation of SQ algorithms: while they include awide range of common algorithms, they do not include algorithms that depend entirely on“algebraic” structure.It is worth emphasizing that this discussion has considered learning parities with a classical representation of the data. When given a quantum representation of the data, asin the quantum “example state” | ψ i = 2 − n/ P x ∈{ , } n | x, χ S ( x ) i (taking the distributionover the domain to be uniform), the task becomes easy even with noise [CSS15]. This isbecause we can now use Hadamard gates to implement a Boolean Fourier transform `a lathe famous Bernstein-Vazirani algorithm [BV97].7 Noise-tolerant SQ learning

One of the prime features of classical SQ learning is its inherent noise tolerance. Froman intuitive standpoint, certain common stochastic noise models are systematic enoughthat their eﬀects in expectation can be predicted in advance, and hence be corrected for.Slightly more precisely, the query expectations of a noisy state are often related in simpleways to the query expectations on a noiseless state, so that the latter can be recovered fromthe former. We mainly consider three such noise models here: (a) classical classiﬁcationnoise and malicious noise, (b) quantum depolarizing noise, and (c) more general quantumchannels with bounded noise.

Classiﬁcation noise [AL88] and malicious noise [Val85] are two classical Boolean noise modelsthat SQ algorithms are able to handle. In the classiﬁcation noise model, every example’slabel is ﬂipped with probability η (known as the noise rate). The malicious noise model isa stronger form of noise where for any given example, with probability 1 − η , the label isreported correctly, but with probability η both the point and its label may be arbitrary (andadversarially selected based on the learner’s behavior so far). We note that these modelsare well-deﬁned even in the p-concept setting and hence for quantum states, and simplyintroduce further randomness into the label. The following results were originally statedfor Boolean functions but readily extend to p-concepts. Theorem 3.1 ([Kea98]) . Let C be a p-concept class learnable under distribution D in theSQ model up to error ǫ using q queries of tolerance τ . Then for any constant < η < / ,even with respect to an SQ oracle with classiﬁcation noise at rate η (i.e., one that computesexpectations with classiﬁcation noise), C is learnable up to ǫ using O ( q ) queries of tolerance O ( τ (1 − η )) . If the learner is given noisy training examples as opposed to access to a noisySQ oracle, then ˜ O ( q poly( τ (1 − η )) ) noisy examples suﬃce. Theorem 3.2 ([AD98]) . Let C be a p-concept class learnable under distribution D in the SQmodel up to error ǫ using q queries of tolerance τ . An SQ oracle with malicious noise at rate η is one that computes query expectations with respect to a distribution (1 − η ) f ( D ) + ηQ ,where f ( D ) denotes the true labeled distribution ( x, y ) for x ∼ D, y ∼ f ( x ) ( f being theunknown target p-concept), and Q is an arbitrary and adversarially selected distributionon X × {− , } . If η = ˜ O ( ǫ ) and η < τ , then even with respect to an SQ oracle withmalicious noise at rate η , C is learnable up to ǫ using O ( q ) queries of tolerance τ − η . Ifthe learner is given noisy training examples as opposed to access to a noisy SQ oracle, then C is learnable (with constant probability) using ˜ O ( q poly( τ − η ) ) noisy examples suﬃce. (Moreeﬃcient implementations are also available in some special cases). The proofs of both theorems are similar: one ﬁrst relates the noisy query expectationsto the true expectations, and then argues that when using a suitably small tolerance (orsuﬃciently many examples) the eﬀects of the noise can be corrected for (within informationtheoretic limits). 8 .2 Depolarizing noise

Depolarizing noise acts on quantum states by shifting them closer to the maximally mixedstate. One can consider a setting where it acts on an entire n -qubit state at once, as wellas one where it acts independently on each individual qubit. We will consider the former. Deﬁnition 3.3 (Depolarizing noise) . Let ρ be an arbitrary n -qubit state. Then depolarizingnoise at rate η (0 < η <

1) acts on this state by transforming it into Λ η ( ρ ) = (1 − η ) ρ + η ( I/ n ). Theorem 3.4.

Let < η < be any constant, and let Λ η denote the depolarizing channelat noise rate η . Let C be a class of n -qubit quantum states and D be a distribution on E ,the space of two-outcome measurements on such states. Let L be an SQ learner capable oflearning C under D using q queries of tolerance τ . There there exists a learner L ′ such thatfor any ρ ∈ C , L ′ is capable of learning ρ under D using q queries of tolerance τ (1 − η ) given only SQ access to Λ η ( ρ ) as well as sampling access to D .Proof. For simplicity, we will assume that we know the noise rate η exactly. (So long aswe have an upper bound on η , then by a standard “grid search” argument due to [Kea98],we can estimate η suﬃciently closely simply by trying out many diﬀerent values. Brieﬂy: ifsay we try out η = 0 , δ, δ, . . . , /δ values in all), then one of these will be within δ/ η . The algorithm when run with this guess for η will produce a good hypothesis.By taking δ = O ( τ (1 − η ) ) and testing all 1 /δ hypotheses produced by our guesses for η on a suﬃciently large validation set, we can ensure the best one will perform and generalizewell.)Let ρ ∈ C be the unknown target. Observe that for any E ∈ E , by linearity, f Λ η ( ρ ) ( E ) = 2 Tr( E · Λ η ( ρ )) − − η ) f ρ ( E ) + ηf I/ n ( E ) . Let ϕ : E ×{− , } → [ − ,

1] be any query made by L . Let ϕ [ ρ ] denote the query expectationof ϕ under ρ , given by E x ∼ D E y ∼ f ρ ( x ) [ ϕ ( x, y )]. Similarly let the noisy analogue be ϕ [Λ η ( ρ )].Again just by linearity, ϕ [Λ η ( ρ )] = (1 − η ) ϕ [ ρ ] + ηϕ [ I/ n ] . The latter quantity is independent of ρ and can be estimated to arbitrary accuracy bysampling from D , allowing us to estimate ϕ [ ρ ] as ϕ [Λ η ( ρ )] − ηϕ [ I/ n ]1 − η . So long as η is boundedaway from 1, we can use a query of tolerance τ (1 − η ) to estimate ϕ [ D η ( ρ )] (as well as1 / poly( τ (1 − η )) unlabeled examples from D to compute ϕ [ I/ n ]), and thereby estimate ϕ [ ρ ] to within τ . Thus we can simulate L even with depolarizing noise.It is worth stressing that we are able to handle any constant noise rate η ∈ (0 , τ (1 − η ). We can also consider more general kinds of quantum channels with bounded noise. As longas the queries are bounded, small amounts of noise cannot alter query expectations toomuch, and so can be “absorbed” into the tolerance. This is similar to classical maliciousnoise: since classical malicious noise at rate η only can only change query expectations by9 (recall that the queries are bounded by 1), a noisy query of tolerance τ − η is able tosimulate a noiseless query of tolerance τ . Unlike with depolarizing noise, this means wecannot handle arbitrary η ; this is an artifact of the fact that more general kinds of noisedo not permit the kind of systematic correction we were able to perform for depolarizingnoise.For concreteness here we consider a noisy quantum channel Λ such that k Λ − n k ⋄ ≤ η ,where 1 n is the identity map on n -qubit states and the norm is the diamond norm. We donot deﬁne this norm here, but its chief property for our purposes is that for any n -qubitstate ρ and 2-outcome measurement E , | Tr( E ( ρ − Λ( ρ ))) | ≤ η . Similar theorems can beproven with respect to other distance measures such as ﬁdelity. Proposition 3.5.

Let Λ be a quantum channel such that k Λ − n k ⋄ ≤ η , as above. Let C be a class of n -qubit quantum states learnable under distribution D using q queries oftolerance τ . Then C is still learnable under noise Λ (i.e. when our queries are answered notwith respect to ρ but Λ( ρ ) ) using q noisy queries of tolerance τ − η .Proof. As noted, for any state ρ and measurement E , | Tr( E ( ρ − Λ( ρ ))) | ≤ η . Consider anyquery ϕ : E × {− , } → [ − , ϕ [ ρ ] denotes the query expectation on a noiseless stateand ϕ [Λ( ρ )] denotes the noisy one, then a straightforward calculation shows that | ϕ [ ρ ] − ϕ [Λ( ρ )] | = (cid:12)(cid:12)(cid:12) E E ∼ D (cid:2) ( ϕ ( E, − ϕ ( E, − E ( ρ − Λ( ρ ))) (cid:3)(cid:12)(cid:12)(cid:12) ≤ E E ∼ D | Tr( E ( ρ − Λ( ρ ))) |≤ η. Thus just by the triangle inequality, if we calculate ϕ [Λ( ρ )] within tolerance τ − η , thenwe also get ϕ [ ρ ] within τ . So far, we’ve only considered distribution-speciﬁc learning, where the learner is only requiredto succeed with respect to a pre-speciﬁed distribution D . In the distribution-free case, wherethe learner is required to succeed no matter what D is, we now give a simple proof that anySQ algorithm for a concept class can also handle any kind of quantum noise on the state, aslong as the noise is known. This is unsurprising, and at a high level, the approach simplyboils down to oﬀ-loading the noise from the state to the measurement. Learning a noisy setof measurements is thus handled by distribution-free learning algorithm.Given a quantum operation Λ, its adjoint Λ † is such that ∀ ρ, Tr( E · Λ( ρ )) = Tr(Λ † ( E ) · ρ )and always exists (see [RLCK19] for details on how to prove this folklore result). Let D bethe distribution we are trying to learn concept class C using statistical queries and let Λ bethe noise applied to the quantum state. We can then deﬁne D † to be the distribution Λ † ( E )where E is drawn from D and by deﬁnition the traces (and thus the statistical queries) arethe same when applied to ρ and Λ( ρ ) respectively. Also by deﬁnition, a distribution-freelearner for C would also be able to learn with distribution D † .10 Lower bounds on learning stabilizer states with noise

The stabilizer states are a popular class of quantum states that are used throughout quan-tum information for areas like quantum error correction, classical simulation of quantummechanics, and quantum communication. If we let P = { I, X, Y, Z } be the Pauli ma-trices, then we can deﬁne a generalization of the Pauli matrices to an n -qubit system as P n = {± } · { I, X, Y, Z } ⊗ n . As an example, − X ⊗ Y ⊗ Z ⊗ Y ⊗ Z ∈ P , though in thefuture we will simply opt to write this Pauli matrix as − XY ZY Z . It is not hard to showthat ∀ P = ± I ⊗ n ∈ P n half of the eigenvalues are 1 and the other half are − ρ = | ψ i h ψ | such that | ψ i ∈ C n is stabilized by P ∈ P n if P | ψ i = | ψ i . In other words, | ψ i must be an eigenvector of P with eigenvalue 1. The setof pure states that are stabilized by the subset S ⊆ P n are then the states that lie in theintersection of the eigenvalue 1 subspaces. It is known that if S is an Abelian group thatdoes not contain − I ⊗ n , then this intersection has dimension 2 n / | S | [Ham89]. Due to thenature of the outer product, any vector e iθ | ψ i drawn from this space results in the samedensity matrix ρ = | ψ ih ψ | . Deﬁnition 4.1 (Stabilizer states) . Let S ⊂ P n \ {− I ⊗ n } be an Abelian group of order 2 n (note that P n is not itself a group under matrix multiplication). The unique state densitymatrix ρ = | ψ ih ψ | that results from the one-dimensional subspace that is stabilized by allelements of S is then deﬁned to be a stabilizer state. We then say that S stabilizes ρ .The stabilizer states are then the set of all such quantum pure states that are stabilizedby Abelian groups of order 2 n formed from P n \ {− I ⊗ n } . Proposition 4.2.

Given two n -qubit stabilizer states | ψ ih ψ | 6 = | ϕ ih ϕ | with stabilizer groups S and S ′ respectively, then | S ∩ S ′ | ≤ n − .Proof. Because | ψ ih ψ | 6 = | ϕ ih ϕ | then S = S ′ . We also know that S ∩ S ′ is an abeliangroup without − I ⊗ n , so | S ∩ S ′ | < n . Since 2 n / | S ∩ S ′ | is the dimension of the spacestabilized by this group [Ham89], it must be an integer. Due to the prime factorization of2 n , | S ∩ S ′ | = 2 m for some integer 0 ≤ m < n , of which the largest possible m is n − Let C be the class of all n -qubit stabilizer pure states. If P ∈ P n is a Pauli operator,then the two-outcome measurement associated with P is ( I + P ) /

2, and is referred to as aPauli measurement. We will ﬁrst examine the natural distribution D given by the uniformdistribution over Pauli measurements, E Pauli = { ( I + P ) / | P ∈ P n } . Note that this is slightly diﬀerent from the usual Pauli group {± , ± i } · { I, X, Y, Z } ⊗ n . We choose toignore the imaginary phases as they will never be a part of a valid stabilizer measurement: for instance,( I + iX ) /

11n doing so, we will show that performing better than the trivial algorithm of always out-putting the maximally mixed state I/ n is diﬃcult.We use the following folklore lemma (see e.g. [Roc18] Lemma 1 for a simple proof). Lemma 4.3.

Let E = ( I + P ) / be a POVM measurement associated to a Pauli operator P ∈ P n and ρ be an n -qubit stabiliser state. Then Tr( Eρ ) can only take on the values { , , } , and: Tr( Eρ ) = 1 iﬀ P is a stabilizer of ρ ; Tr( Eρ ) = 1 / iﬀ neither P nor − P is a stabilizer of ρ ; Tr( Eρ ) = 0 iﬀ − P is a stabilizer of ρ . Simple algebraic manipulations then tell us that Tr(

P ρ ) can only take on the values { , , − } with Tr( P ρ ) being 1 or − P or − P is in the stabilizer group of ρ respectively. With this result, we can compute bounds on |h f ρ , f ρ ′ i D | for stabilizer states ρ and ρ ′ by counting how many matrices lie in the intersection of their stabilizer groups orthe negations of their stabilizer groups. Lemma 4.4.

Let C be the concept class of n -qubit stabilizer pure states, and let D denotethe uniform distribution on n -qubit Pauli measurements. Then for any stabilizer states ρ, ρ ′ with ρ = ρ ′ , |h f ρ , f ρ i D | = k f ρ k D = n , and |h f ρ , f ρ ′ i D | ≤ n +1 . Furthermore, this inequalityis tight.Proof. Let ρ, ρ ′ ∈ C . Let S + and S ′ + be the stabilizer groups for ρ and ρ ′ respectively,and also let S − = {− P | P ∈ S } and S ′− = {− P | P ∈ S ′ } . For any Pauli measurement( I + P ) /

2, we have f ρ ( I + P ) = 2 Tr( I + P ρ ) − P ρ ). Thus the correlation of the twop-concepts becomes |h f ρ , f ρ ′ i D | = 1 |P n | (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) X P ∈P n Tr(

P ρ ) · Tr(

P ρ ′ ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) . Leveraging Lemma 4.3, we ﬁnd: |h f ρ , f ρ ′ i D | = 1 |P n | (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) X P ∈P n Tr(

P ρ ) · Tr(

P ρ ′ ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = 12 · n (cid:2) | S + ∩ S ′ + | + | S − ∩ S ′− | − | S + ∩ S ′− | − | S − ∩ S ′ + | (cid:3) = 14 n (cid:2) | S + ∩ S ′ + | − | S + ∩ S ′− | (cid:3) If ρ = ρ ′ then S + = S ′ + , so that | S + ∩ S ′ + | = 2 n and | S + ∩ S ′− | = 0. Thus for all ρ |h f ρ , f ρ i D | = 2 n n = 12 n If ρ = ρ ′ then by Proposition 4.2 | S + ∩ S ′ + | ≤ n − . | S + ∩ S ′− | ≥ |h f ρ , f ρ ′ i D | ≤ n (2 n − ) = 12 n +1 .

12e can show that this inequality is tight because the state ρ = | ih | ⊗ n and ρ ′ = | ih | ⊗ n − ⊗ | + ih + | saturate this inequality. The generators for the stabilizer group of ρ are (omitting tensor products): ZIII · · · I , IZIII · · · I , ... , and IIII · · · IZ . The generatorof ρ ′ are the same, except the last generator is replaced with IIII · · · IX . We see that | S ∩ S ′ | = 2 n − while | S ∩ S ′− | = 0.With this result, we can use Lemma 2.5 to compute SDA and by extension prove alower bound on the number of statistical queries needed to learn this concept class underthis distribution. Proposition 4.5 ([AG04] Proposition 2) . The number of n -qubit stabilizer states grows as Θ( n ) . Theorem 4.6.

Let D be the uniform distribution over n -qubit Pauli measurements and let C be the concept class of all n -qubit stabilizer pure states. Then SDA( C , n ) = 2 Θ( n ) .Proof. By Proposition 4.5, |C| = 2 Θ( n ) . Using Lemma 2.5 with κ = n and γ = n +1 ascalculated from Lemma 4.4, we ﬁnd thatSDA( C , γ ′ + 12 n +1 ) = SDA( C , γ ′ + γ ) ≥ |C| γ ′ β − γ = 2 Θ( n ) γ ′ n +1 = 2 Θ( n ) γ ′ Setting γ ′ = n +1 gives the result. Corollary 4.7.

Any SQ algorithm needs at least Ω( n ) statistical queries of tolerance τ =2 − O ( n ) to learn C up to error − O ( n ) over D .Proof. Simply apply Theorem 2.4, with β = 2 − n .Since the norms of our p-concepts are exponentially small (i.e. 2 − n/ ), we only get hard-ness for error on the order of 2 − O ( n ) . But as we now show, the p-concept norm correspondsalmost exactly to the squared loss achieved by the maximally mixed state. Our resultsthus show that doing signiﬁcantly better than the maximally mixed state requires 2 Ω( n ) statistical queries even when the tolerance is exponentially small. Proposition 4.8.

Let D be the uniform distribution over n -qubit Pauli measurements, E Pauli . Let ρ be any state, and let I/ n be the maximally mixed state. Then k f ρ k D = k f ρ − f I/ n k D + 1 / n . Proof.

In essence, this is simply because the p-concept f I/ n is almost always zero. Specif-ically, for all E ∈ E Pauli \ { , I } , f I/ n ( E ) = 2 Tr( E/ n ) − E ) = 2 n − forall such E . (This is because E = ( I + P ) / n -qubit Pauli matrix P ∈ P n , andTr( P ) = 0 for all P ∈ P n \ {± I } .) As for E ∈ { , I } , we note that f ρ ( E ) = f I/ n ( E ). Thus13 f ρ k D = 1 |E Pauli | X E ∈E Pauli f ρ ( E ) = 1 |E Pauli | X E ∈E Pauli \{ ,I } f ρ ( E ) + X E ∈{ ,I } f ρ ( E ) = 1 |E Pauli | X E ∈E Pauli \{ ,I } ( f ρ ( E ) − f I/ n ( E )) + X E ∈{ ,I } f ρ ( E ) = 1 |E Pauli | X E ∈E Pauli ( f ρ ( E ) − f I/ n ( E )) + X E ∈{ ,I } f ρ ( E ) = k f ρ − f I/ n k D + 2 |E Pauli | = k f ρ − f I/ n k D + 1 / n . To get around this norm issue, we look at a subset of stabilizer states such that we canproduce p-concepts with norm 1. Recall that the Pauli measurements are the set of all pro-jectors onto the eigenvalue-1 space of some Pauli matrix P , i.e. { P + I | P ∈ P n } . We deﬁnea subset of the Pauli measurements called the parity measurements, and show the hardnessof SQ-learning stabilizer states under the uniform distribution on such measurements. Thisis via a simple equivalence, holding essentially by construction, with the problem of learningparities under the uniform distribution. We also show that learning stabilizer states withnoise is at least as hard as Learning Parity with Noise (LPN). Deﬁnition 4.9 (Parity measurements) . For all x ∈ { , } n , let P x = P y ∈{ , } n χ x ( y ) | y ih y | .Since the set of P x is equivalent to { I, Z } ⊗ n , the corresponding measurement E x = P x + I isby deﬁnition a Pauli measurement. We will refer to such measurements as parity measure-ments . Proposition 4.10.

For every distribution D on { , } n there is a corresponding distribu-tion D ′ on parity measurements such that learning computational basis states under D ′ isequivalent to learning parities under D . Furthermore, this equivalence holds even with clas-siﬁcation noise: for any η , learning computational basis states under D ′ with noise rate η is equivalent to learning parities under D with noise rate η .In particular, learning stabilizer states under D ′ is at least as hard as learning paritiesunder D .Proof. If the unknown state ρ is a computational basis state | y ih y | , then the valueTr( E x | y ih y | ) = x · y mod 214s simply the parity of x over the subset speciﬁed by y (represented using { , } insteadof {− , } ). In the PAC setting, this would be equivalent to getting the sample ( E x , x · y mod 2). Accordingly, let us deﬁne D ′ simply as the distribution over E x for x ∼ D . It is clearthat these are diﬀerent representations of the same problem, such that a learning algorithmfor one implies a learning algorithm for the other. We note that this relationship holds evenin the presence of classiﬁcation noise. Finally, note that computational basis states are asubset of the stabilizer states, so any learner for stabilizer states implies a learner for thecomputational basis states as well. This implies that learning stabilizer states on D ′ is atleast as hard as learning parities on D , even in the presence of classiﬁcation noise. Corollary 4.11.

SQ-learning stabilizer states under the uniform distribution over paritymeasurements requires Ω( n ) queries even with constant error (say / ).Proof. By Proposition 4.10, SQ-learning stabilizer states under the uniform distribution on E x parity measurements is at least as hard as learning parities over the uniform distribution.Applying Theorem 2.6, we get the exponential lower bound. Corollary 4.12.

Learning stabilizer states under the uniform distribution over parity mea-surements with classiﬁcation noise rate η is at least as hard as LPN with noise rate η .Proof. Proposition 4.10 directly implies that learning computational basis states under theuniform distribution on parity measurements and with classiﬁcation noise is equivalent toLPN.

Turning to positive results, we now give SQ algorithms for some simple concept classes,namely the computational basis states and, more generally, products of n single-qubit states.The distribution on measurements that we will consider will correspond to a very naturalscheme for these classes: pick a qubit at random and measure it using a Haar-randomunitary.Concretely, let D ′ be the distribution of single qubit measurements formed from theprojection onto Haar-random single qubit state (i.e. U | ih | U † where U is a Haar randomunitary), and let D be the distribution on n -qubit measurements that corresponds to pickinga qubit at random and measuring it using a measurement drawn from D ′ . That is, D = n P ni =1 I ⊗ i − ⊗ D ′ ⊗ I ⊗ n − i . Let C be the concept class of product states ρ = ⊗ ni =1 ρ i . Ofcourse, this class includes the computational basis states. The main result of this sectionwill be a simple O ( n )-query SQ algorithm for learning C under the distribution D .The following technical lemma will be the backbone of our results. Lemma 5.1.

For any single qubit pure state | ψ i h ψ | = I + P and mixed state ρ : E E ∼ D ′ (cid:20) sign (cid:18) Tr ( E | ψ ih ψ | ) − (cid:19) (cid:18) Tr ( Eρ ) − (cid:19)(cid:21) = 14 Tr( P ρ ) . Proof.

We will decompose ρ = λ | ϕ ih ϕ | + (1 − λ ) | ϕ ⊥ ih ϕ ⊥ | such that | ϕ i = cos θ ′ | ψ i +sin θ ′ | ψ ⊥ i and | ϕ ⊥ i = e iϕ ′ (sin θ ′ | ψ i − cos θ ′ | ψ ⊥ i ). The following identity will be useful atthe end: 15r( P ρ ) = 2 Tr( | ψ ih ψ | ρ ) −

1= 2 (cid:2) λ Tr( | ψ ih ψ || ϕ ih ϕ | ) + (1 − λ ) Tr( | ψ ih ψ || ϕ ⊥ ih ϕ ⊥ | ) (cid:3) −

1= 2 cos θ ′ λ + 2(1 − λ ) sin θ ′ − (sin θ ′ + cos θ ′ )= (2 λ −

1) cos θ ′ − (2 λ −

1) sin θ ′ = (2 λ −

1) cos 2 θ ′ Let U be the unitary such that U | i = | ψ i and U | i = | ψ ⊥ i . Due to symmetry, wecan parameterize a Haar-random single qubit state using spherical coordinates as E = U ( I + cos ϕ sin θX + sin ϕ sin θY + cos θZ ) U † for the Pauli matrices X , Y , and Z .Tr ( E | ψ ih ψ | ) = Tr (cid:18) U ( I + cos ϕ sin θX + sin ϕ sin θY + cos θZ ) U † | ψ ih ψ | (cid:19) = Tr (cid:18)

12 ( I + cos ϕ sin θX + sin ϕ sin θY + cos θZ ) | ih | (cid:19) = 1 + cos θ ρ :Tr ( Eρ )= λ Tr( E | ϕ ih ϕ | ) + (1 − λ ) Tr( E | ϕ ⊥ ih ϕ ⊥ | )= λ θ cos 2 θ ′ + cos ϕ sin θ sin 2 θ ′ − λ ) 1 − cos θ cos 2 θ ′ − cos ϕ sin θ sin 2 θ ′

2= 1 + (2 λ − θ cos 2 θ ′ + cos ϕ sin θ sin 2 θ ′ )2This allows us to perform a spherical integral over θ and ϕ to get the expectation: E E ∼ D ′ (cid:20) sign (cid:18) Tr ( E | ψ ih ψ | ) − (cid:19) (cid:18) Tr ( E | ϕ ih ϕ | ) − (cid:19)(cid:21) = 14 π Z π dϕ Z π dθ sin θ sign (cos θ ) (cid:18) (2 λ −

1) cos θ cos 2 θ ′ + cos ϕ sin θ sin 2 θ ′ (cid:19) = 2 λ − π Z π dϕ "Z π/ − Z ππ/ dθ sin θ (cid:0) cos θ cos 2 θ ′ + cos ϕ sin θ sin 2 θ ′ (cid:1) = (2 λ − π cos 2 θ ′ + π cos 2 θ ′ π = (2 λ −

1) cos 2 θ ′

4= 14 Tr(

P ρ ) 16ur algorithm for learning product states will be work by learning each qubit in thePauli basis. This results in a 3 n -query algorithm, corresponding to the 3 n parameters thatit takes to deﬁne a product state. We ﬁrst recall the deﬁnition of trace distance, which isthe quantum generalization of total variational distance. Deﬁnition 5.2 (Trace distance) . Given quantum states ρ and σ , k ρ − σ k Tr = 12 X i | λ i | where { λ i } is the set of eigenvalues of the matrix ρ − σ . Proposition 5.3 (folklore) . Given two single qubit states ρ and σ , the k ρ − σ k Tr is half theEuclidean distance between their points on the Bloch sphere. The following lemma will then be necessary to relate trace distance of the states to thesquared loss in learning under this distribution.

Lemma 5.4.

For n -qubit product states ρ = N i ρ i and σ = N i σ i , let f ρ ( E ) = 2 Tr( Eρ ) − and f σ ( E ) = 2 Tr( Eσ ) − . Let D be the distribution over measurements deﬁned earlier.Then E E ∼ D [( f ρ ( E ) − f σ ( E )) ] = 43 n n X i =1 k ρ i − σ i k Proof.

Let ξ = ρ − σ . Then by linearity f ρ ( E ) − f σ ( E ) = 2(Tr( Eρ ) − Tr( Eσ )) = 2 Tr( Eξ ) . We will deﬁne ξ i = Tr i ( ξ ) = ρ i − σ i to be the reduced density matrix on the i th qubit of ξ . Noting that each ξ i is traceless, then by diagonalizing we can write ξ i = λ i | λ i ih λ i | − λ i | λ ⊥ i ih λ ⊥ i | for λ i ∈ [0 ,

1] such that λ i = k ρ i − σ i k Tr is the trace distance of thereduced density matrix.Like in Lemma 5.1, we can parameterize a single-qubit Haar-random projection as E = U ( I + cos ϕ sin θX + sin ϕ sin θY + cos θZ ) U † , where U | i = U | λ i i and U | i = U | λ ⊥ i i .This implies that U ξ i U † = λ i Z .Tr( Eξ i ) = Tr (cid:18) U (cid:0) I + cos ϕ sin θX + sin ϕ sin θY + cos θZ (cid:1) U † ρ (cid:19) = Tr (cid:18) (cid:0) I + cos ϕ sin θX + sin ϕ sin θY + cos θZ (cid:1) · λ i Z (cid:19) = λ i cos θ Using this, we now compute the squared-loss as follow.17 E ∼ D [( f ρ ( E ) − f σ ( E )) ] = 1 n n X i =1 E E ′ ∼ D ′ [( f ρ i ( E ′ ) − f σ i ( E ′ ) ]= 4 n n X i =1 E E ′ ∼ D ′ [Tr ( E ′ ξ i )]= 4 n n X i =1 Z π sin θ · λ i cos θ = 43 n n X i =1 λ i = 43 n n X i =1 k ρ i − σ i k We now show how to use Lemma 5.1 to learn each qubit of the product state, allowingus to then apply Lemma 5.4 to get our learning result.

Theorem 5.5.

Let D be the distribution on measurements and let C be the concept classof product states deﬁned earlier. There is an SQ learner that is able to learn C under D upto squared loss ǫ using n queries of tolerance √ ǫ/n .Proof. Let the unknown ρ ∈ C be given by ρ = N i ρ i . If we deﬁne P = X , P = Y , and P = Z , then our queries will be ϕ i,j ( E, Y ) = sign (cid:18) n − Tr (cid:18) E · (cid:18) I ⊗ i − ⊗ I + P j ⊗ I ⊗ n − i (cid:19)(cid:19) − (cid:19) · Y The query ϕ i,j will correspond to taking the projection of the i th qubit along the Pauli P j ,as we now show: E E ∼ D,Y ∼ f ρ ( E ) [ ϕ i,j ( E, Y )] = E E ∼ D (cid:20) ϕ i,j ( E,

1) Tr ( Eρ ) + ϕ i,j ( E, − (cid:18) − Tr ( Eρ ) (cid:19)(cid:21) = E E ∼ D (cid:20) ϕ i,j ( E, (cid:18) Eρ ) − (cid:19)(cid:21) = 1 n E E ′ ∼ D ′ (cid:20) sign (cid:18) Tr (cid:0) E ′ | ih | (cid:1) − (cid:19) (cid:18) (cid:0) E ′ ρ i (cid:1) − (cid:19)(cid:21) = 12 n Tr( P j ρ i ) . Here the third equality exploits the deﬁnition of D as n P ni =1 I ⊗ i − ⊗ D ′ ⊗ I ⊗ n − i (only the i th term yields a nonzero expectation), and the fourth equality is Lemma 5.1.Any speciﬁc qubit ρ i can be written in Bloch sphere coordinates as I + x i X + y i Y + z i Z . Wecan estimate x i = Tr( P ρ i ) up to error √ ǫ using a single query of tolerance √ ǫ/n . The18ame holds true for y i and z i . If we use this to construct our estimate b ρ i = I + b x i X + b y i Y + b z i Z k ρ i − b ρ i k = 14 (cid:2) ( x i − b x i ) + ( y i − b y i ) + ( z i − b z i ) (cid:3) ≤ ǫ/ . Finally, using Lemma 5.4: E E ∼ D [( f ρ ( E ) − f σ ( E )) ] ≤ n n X i =1 ǫ/ ǫ. We note that if the estimated point is outside of the Bloch sphere, we can simplynormalize the point to the surface of the Bloch sphere and this will never increase the error.To quickly sketch the proof of this, take the plane formed by the center of the sphere, theestimated point b p that is outside of the sphere, and the real point p which is both withinthe Bloch sphere and within a sphere ǫ radius located at b p . The normalized point b p ′ isalways located on the line from the b p to the origin, and one can make a separating planethat bisects the line segment between b p and b p ′ that denotes whether one is closer to b p or b p ′ .Since the Bloch sphere will always be on the side closer to b p ′ and the real point p is in theBloch sphere, p will always be closer to b p ′ than b p .We can simplify this algorithm if we know in advance that ρ is a computational basisstate. In that case, we know that each qubit ρ i is either ( I + Z ) / I − Z ) /

2, and sowe only need to make n queries ϕ i, , one for each i . Moreover, we only need to identifythe coordinate z i to within a precision of 1 in order to distinguish the z i = 1 and z i = − O (1 /n ) in order to learn ρ perfectly (i.e. with ǫ = 0). A PAC learning algorithm L can be viewed as a randomized algorithm that takes as inputa training dataset (i.e. a set of labeled examples ( x, y ) sampled from a distribution) andoutputs a hypothesis that with high probability has low error over the distribution. Thatis, if S is a training dataset, then L ( S ) describes a probability distribution over hypotheses(where the randomness arises from the internal randomness of the learner). Intuitively,diﬀerential privacy requires L to satisfy a kind of stability: on any two inputs S and S ′ thatare close, the distributions L ( S ) and L ( S ′ ) must be close as well. Deﬁnition 6.1 (Diﬀerential privacy, [DR14]) . Call two datasets S = { ( x i , y i ) } mi =1 and S ′ = { ( x ′ i , y ′ i ) } mi =1 neighbors if they only diﬀer in one entry. A learner L (understood in thesense just discussed) is said to be α -diﬀerentially private (or α -DP for short) if for any S and S ′ that are neighbors, the distributions L ( S ) and L ( S ′ ) are close in the sense that forany hypothesis h , P [ L ( S ) = h ] ≤ e α P [ L ( S ′ ) = h ].A well-known property of SQ algorithms is that they can readily be made diﬀerentiallyprivate. Since diﬀerential privacy is a notion that is well-deﬁned only in the PAC setting19here the input is a set of training examples (as opposed to access to an SQ oracle), sucha statement is necessarily of the form “any SQ learner yields a PAC learner that satisﬁesdiﬀerential privacy.” Theorem 6.2 (see e.g. [DR14] § . Let C be a concept class learnable up to error ǫ byan SQ learner L using q queries of tolerance τ . Then it is also learnable up to error ǫ inthe PAC setting by an α -DP learner L ′ with sample complexity ˜ O ( qατ + qτ ) (with constantprobability). The proof is standard and proceeds by simulating each of L ’s queries using empiricalestimates over a sample of size roughly 1 /τ and then using the Laplace mechanism to addsome further noise.It is natural to ask if our quantum SQ algorithms can also be made diﬀerentially privatein an interesting sense. In the quantum setting, [AR19] introduced a notion of diﬀerentialprivacy for quantum measurements. Here the randomized algorithm is a measurementthat takes as input a quantum state, and two n -qubit states are considered neighborsif it is possible to reach one from the other by a quantum operation (sometimes called asuperoperator) on a single qubit. In particular, two product states ρ = N i ρ i and σ = N i σ i are neighbors if ρ i = σ i for all i but one. Deﬁnition 6.3 (Quantum diﬀerential privacy for measurements, [AR19]) . A measurement M is said to be α -DP if for any n -qubit neighbor states ρ, σ , and any outcome y , P [ M ( ρ ) = y ] ≤ e α P [ M ( σ ) = y ].This deﬁnition is of particular interest since it can be related to the quantum notion ofa “gentle measurement.” In fact α -DP and α -gentleness turn out to be roughly equivalentunder some assumptions, and this connection can be carefully exploited to perform shadowtomography [Aar19, AR19]. However, it is not clear how to directly apply this notion ofquantum DP to an SQ learner, since such a learner is an algorithm rather than just a singlemeasurement.We can instead establish a diﬀerent kind of quantum diﬀerential privacy. View a quan-tum state learner L as an algorithm that takes in multiple copies ρ ⊗ m of some unknownstate ρ , is allowed to sample and perform random measurements from a distribution D ,and outputs another state σ that is required to be close to ρ with respect to D with highprobability. If the random measurements are viewed as the internal randomness of thelearner, then this is similar to the view we took of a classical learner earlier. We can nowdeﬁne a notion of diﬀerential privacy for quantum state learners by requiring that L ( ρ ⊗ m )and L ( ρ ⊗ m − ⊗ ρ ′ ) (where ρ = ρ ′ , so that ρ ⊗ m and ρ ⊗ m − ⊗ ρ ′ are neighbors) are α -close asdistributions over states (in the natural way). This can also be seen as stylized kind of tol-erance to noise or corruptions. The following analogue of Theorem 6.2 can then be provenusing almost exactly the same proof; essentially, we are only replacing classical exampleswith copies of quantum states. Theorem 6.4.

Let C be a class of quantum states learnable up to error ǫ by an SQ learner L using q queries of tolerance τ . Then it is also learnable up to error ǫ in the PAC settingby an α -DP learner L ′ (in the speciﬁc sense just described) with copy complexity ˜ O ( qατ + qτ ) (with constant probability).

20e leave it as an open problem to ﬁnd interesting application of our deﬁnition of dif-ferential privacy for quantum state learners, such as relating it a notion of gentleness in away that might have further uses (like shadow tomography).

Statistical vs. query complexity.

Conceptually, the contrast between our SQ modeland the original PAC model of [Aar07] is interesting. Apart from the deﬁnition of an elegantmodel, Aaronson’s main insight was to characterize learnability in a purely statistical sense,showing bounds on sample complexity via an analysis of the so-called fat-shattering dimen-sion of quantum states. In learning theoretic terms, this took advantage of a separation ofconcerns that the PAC model encourages: (a) empirical performance, i.e. a learner achiev-ing low error with respect to the training data, and (b) generalization, i.e. this performanceactually generalizing to the true distribution. The SQ model, however, does not naturallyaccommodate such a separation. SQ algorithms are instead primarily characterized by thenumber of queries required; generalization is “in-built.” The closest analogue to a notionof sample complexity is the role played by the tolerance, and the closest thing to studyinggeneralization on its own might have been to show a phase transition in what diﬀerentregimes of the tolerance are able to accomplish. The formal statements of our SQ lowerbounds do have such a ﬂavor: “ either use small tolerance or many queries.” Suitable classes and distributions for PAC-learning.

It is notable that the algorithmof [Roc18] for learning stabilizer states is essentially the only known positive result in theframework of [Aar07]. So a longstanding question in this area is: what other interestingclasses can be learned, and under what distributions on measurements? And can they alsobe learned in the SQ setting?A major issue in picking suitable distributions on measurements is that under manynatural distributions, the maximally mixed state actually performs quite well, so that theproblem of learning becomes essentially superﬂuous. Even in this work, we obtained lowerbounds for learning stabilizer states under the uniform distribution on Pauli measurementsonly for learning up to exponentially small squared loss. This was because the norms ofthe p-concepts are themselves exponentially small, or in other words the maximally mixedstate already achieves exponentially small loss. We were able to get around this and obtaina Ω(2 n ) lower bound via a direct reduction from learning parities (by considering paritymeasurements). Can we do better than just 2 n ? Is there a ω (2 n )-sized (e.g., 4 n or 2 n ) subsetof stabilizer states such that there exists a distribution over Pauli measurements inducingnorms that are only polynomially small yet have an exponentially small average correlation?That is, is there a ω (2 n )-sized set of stabilizer states and accompanying distribution overPauli measurements such that the maximally mixed state does not do well? Noise-tolerant learning beyond SQ.

The best-known PAC algorithm for learning par-ities with noise is due to [BKW03] and runs in slightly subexponential time. Interestingly,this means it beats the exponential SQ lower bound and is hence essentially the only knownexample of a noise-tolerant PAC algorithm that is not SQ (although it cannot handle noise21rbitrarily close to the information-theoretic limit). Can we similarly hope for a noise-tolerant but non-SQ learner for stabilizer states that runs in subexponential time?

Connections to other areas.

We outlined one connection between SQ-learning and anonstandard notion of diﬀerential privacy. Is there a more meaningful way to connect SQto the diﬀerential privacy deﬁnition of [AR19], and obtain interesting applications (such asshadow tomography)?

Acknowledgements

We thank Scott Aaronson, Srinivasan Arunachalam, and Andrea Rocchetto for many helpfuldiscussions. AG was supported by NSF awards AF-1909204, AF-1717896, and the NSF AIInstitute for Foundations of Machine Learning (IFML). DL was supported by the SimonsIt from Qubit Collaboration and Scott Aaronson’s Vannevar Bush Faculty Fellowship fromthe US Department of Defense.

References [Aar07] Scott Aaronson. The learnability of quantum states.

Proceedings of the RoyalSociety A: Mathematical, Physical and Engineering Sciences , 463(2088):3089–3114, 2007. (document), 1, 1, 1.1, 2.1, 7, 7[Aar19] Scott Aaronson. Shadow tomography of quantum states.

SIAM Journal onComputing , 49(5):STOC18–368, 2019. 1, 6[ACH +

19] Scott Aaronson, Xinyi Chen, Elad Hazan, Satyen Kale, and Ashwin Nayak.Online learning of quantum states.

Journal of Statistical Mechanics: Theoryand Experiment , 2019(12):124019, 2019. 1.1[AD98] Javed A Aslam and Scott E Decatur. Speciﬁcation and simulation of statisticalquery algorithms for eﬃciency and noise tolerance.

Journal of Computer andSystem Sciences , 56(2):191–208, 1998. 3.2[AdW17] Srinivasan Arunachalam and Ronald de Wolf. Guest column: A survey ofquantum learning theory.

ACM SIGACT News , 48(2):41–67, 2017. 1.1[AdW18] Srinivasan Arunachalam and Ronald de Wolf. Optimal quantum sample com-plexity of learning algorithms.

The Journal of Machine Learning Research ,19(1):2879–2878, 2018. 1.1[AG04] Scott Aaronson and Daniel Gottesman. Improved simulation of stabilizer cir-cuits.

Physical Review A , 70(5):052328, 2004. 1, 4.5[AGY20] Srinivasan Arunachalam, Alex B Grilo, and Henry Yuen. Quantum statisticalquery learning. arXiv preprint arXiv:2002.08240 , 2020. 1.1[AL88] Dana Angluin and Philip Laird. Learning from noisy examples.

Machine Learn-ing , 2(4):343–370, 1988. 3.1 22AR19] Scott Aaronson and Guy N Rothblum. Gentle measurement of quantum statesand diﬀerential privacy. In

Proceedings of the 51st Annual ACM SIGACT Sym-posium on Theory of Computing , pages 322–333, 2019. 1, 6, 6.3, 6, 7[BDMN05] Avrim Blum, Cynthia Dwork, Frank McSherry, and Kobbi Nissim. Practical pri-vacy: the sulq framework. In

Proceedings of the twenty-fourth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems , pages 128–138, 2005. 1[BFJ +

94] Avrim Blum, Merrick Furst, Jeﬀrey Jackson, Michael Kearns, Yishay Mansour,and Steven Rudich. Weakly learning dnf and characterizing statistical querylearning using fourier analysis. In

Proceedings of the twenty-sixth annual ACMsymposium on Theory of computing , pages 253–262, 1994. 1[BKW03] Avrim Blum, Adam Kalai, and Hal Wasserman. Noise-tolerant learning, theparity problem, and the statistical query model.

Journal of the ACM (JACM) ,50(4):506–519, 2003. 2.3, 7[BV97] Ethan Bernstein and Umesh Vazirani. Quantum complexity theory.

SIAMJournal on computing , 26(5):1411–1473, 1997. 2.3[CSS15] Andrew W Cross, Graeme Smith, and John A Smolin. Quantum learning robustagainst noise.

Physical Review A , 92(1):012327, 2015. 1.1, 2.3[DR14] Cynthia Dwork and Aaron Roth. The algorithmic foundations of diﬀerentialprivacy.

Foundations and Trends in Theoretical Computer Science , 9(3-4):211–407, 2014. 1, 6.1, 6.2[Fel12] Vitaly Feldman. A complete characterization of statistical query learningwith applications to evolvability.

Journal of Computer and System Sciences ,78(5):1444–1459, 2012. 1[Fel16] Vitaly Feldman. Statistical query learning. In Ming-Yang Kao, editor,

Ency-clopedia of Algorithms , pages 2090–2095. Springer New York, New York, NY,2016. 1[GGJ +

20] Surbhi Goel, Aravind Gollakota, Zhihan Jin, Sushrut Karmalkar, and AdamKlivans. Superpolynomial lower bounds for learning one-layer neural networksusing gradient descent. In

International Conference on Machine Learning , 2020.1, 2.2, 2.4, 2.5[GKZ19] Alex B Grilo, Iordanis Kerenidis, and Timo Zijlstra. Learning-with-errors prob-lem is easy with quantum samples.

Physical Review A , 99(3):032314, 2019. 1.1[Got97] Daniel Gottesman. Stabilizer codes and quantum error correction. arXivpreprint quant-ph/9705052 , 1997. 1[Got98] Daniel Gottesman. The heisenberg representation of quantum computers. arXivpreprint quant-ph/9807006 , 1998. 123Ham89] M. Hamermesh.

Group Theory and Its Application to Physical Problems . Ad-dison Wesley Series in Physics. Dover Publications, 1989. 4.1, 4.1[HHJ +

17] Jeongwan Haah, Aram W. Harrow, Zhengfeng Ji, Xiaodi Wu, and NengkunYu. Sample-optimal tomography of quantum states.

IEEE Transactions onInformation Theory , page 1–1, 2017. 1, 1.1[HS07] Lisa Hellerstein and Rocco A Servedio. On pac learning algorithms for richboolean function classes.

Theoretical Computer Science , 384(1):66–76, 2007. 1[Kea98] Michael Kearns. Eﬃcient noise-tolerant learning from statistical queries.

Jour-nal of the ACM (JACM) , 45(6):983–1006, 1998. 1, 2.6, 3.1, 3.2[OW16] Ryan O’Donnell and John Wright. Eﬃcient quantum tomography. In

Proceed-ings of the forty-eighth annual ACM symposium on Theory of Computing , pages899–912, 2016. 1, 1.1[OW17] Ryan O’Donnell and John Wright. Eﬃcient quantum tomography ii. In

Proceed-ings of the 49th Annual ACM SIGACT Symposium on Theory of Computing ,pages 962–974, 2017. 1.1[Pie12] Krzysztof Pietrzak. Cryptography from learning parity with noise. In

Inter-national Conference on Current Trends in Theory and Practice of ComputerScience , pages 99–114. Springer, 2012. 2.3[RAS +

19] Andrea Rocchetto, Scott Aaronson, Simone Severini, Gonzalo Carvacho, DavidePoderini, Iris Agresti, Marco Bentivegna, and Fabio Sciarrino. Experimentallearning of quantum states.

Science advances , 5(3):eaau1946, 2019. 1.1[Rey20] Lev Reyzin. Statistical queries and statistical algorithms: Foundations andapplications. arXiv preprint arXiv:2004.00557 , 2020. 1[RLCK19] Patrick Rall, Daniel Liang, Jeremy Cook, and William Kretschmer. Simulationof qubit quantum circuits via pauli propagation.

Physical Review A , 99(6), Jun2019. 3.4[Roc18] Andrea Rocchetto. Stabiliser states are eﬃciently pac-learnable.

QuantumInformation & Computation , 18(7-8):541–552, 2018. (document), 1, 1, 1.1, 4.2,7[Val84] Leslie G Valiant. A theory of the learnable.

Communications of the ACM ,27(11):1134–1142, 1984. 1[Val85] LG Valiant. Learning disjunction of conjunctions. In