[PDF] Private learning implies quantum stability

Abstract

Learning an unknown n-qubit quantum state \rho is a fundamental challenge in quantum computing. Information-theoretically, it is known that tomography requires exponential in n many copies of \rho to estimate it up to trace distance. Motivated by computational learning theory, Aaronson et al. introduced many (weaker) learning models: the PAC model of learning states (Proceedings of Royal Society A'07), shadow tomography (STOC'18) for learning "shadows" of a state, a model that also requires learners to be differentially private (STOC'19) and the online model of learning states (NeurIPS'18). In these models it was shown that an unknown state can be learned "approximately" using linear-in-n many copies of rho. But is there any relationship between these models? In this paper we prove a sequence of (information-theoretic) implications from differentially-private PAC learning, to communication complexity, to online learning and then to quantum stability. Our main result generalizes the recent work of Bun, Livni and Moran (Journal of the ACM'21) who showed that finite Littlestone dimension (of Boolean-valued concept classes) implies PAC learnability in the (approximate) differentially private (DP) setting. We first consider their work in the real-valued setting and further extend their techniques to the setting of learning quantum states. Key to our results is our generic quantum online learner, Robust Standard Optimal Algorithm (RSOA), which is robust to adversarial imprecision. We then show information-theoretic implications between DP learning quantum states in the PAC model, learnability of quantum states in the one-way communication model, online learning of quantum states, quantum stability (which is our conceptual contribution), various combinatorial parameters and give further applications to gentle shadow tomography and noisy quantum state learning.

Full PDF

PPrivate learning implies quantum stability

Srinivasan Arunachalam ∗ , Yihui Quek † , and John Smolin ‡ IBM Quantum, IBM T.J. Watson Research Center, Yorktown Heights, USA Information Systems Laboratory, Stanford University, USAFebruary 16, 2021

Abstract

Learning an unknown n -qubit quantum state ρ is a fundamental challenge in quantum com-puting. Information-theoretically, it is well-known that tomography requires exponential in n many copies of an unknown state ρ in order to estimate it up to small trace distance. Mo-tivated by computational learning theory, Aaronson and others introduced several (weaker)learning models: the PAC model of learning quantum states (Proc. of Royal Society A’07),shadow tomography (STOC’18) for learning “shadows” of a quantum state, a learning modelthat additionally requires learners to be diﬀerentially private (STOC’19), and the online modelof learning quantum states (NeurIPS’18). In these models it was shown that an unknown quan-tum state can be learned “approximately well” using linear in n many copies of ρ . But isthere any relationship between these learning models? In this paper we prove a sequence of(information-theoretic) implications from diﬀerentially-private PAC learning to online learningand then to quantum stability.Our main result generalizes the recent work of Bun, Livni and Moran (Journal of the ACM,2021) who showed that ﬁnite Littlestone dimension (of Boolean-valued concept classes) implies

PAC learnability in the (approximate) diﬀerentially private ( DP ) setting. We ﬁrst consider theirwork in the real-valued setting and further extend to their techniques to the setting of learningquantum states. Key to many of our results is our construction of a generic quantum onlinelearner, Robust Standard Optimal Algorithm ( RSOA ), which is robust to adversarial impreci-sion. We then show information-theoretic implications between DP learning quantum statesin the PAC model, learnability of quantum states in the one-way communication model, onlinelearning of quantum states, quantum stability (which is our new conceptual contribution) andvarious combinatorial parameters. As an application, we also improve gentle shadow tomogra-phy (for classes of quantum states) and show connections between noisy quantum state learningand channel capacity, which might be relevant to physically-motivated learning scenarios. ∗ [email protected] † [email protected] ‡ [email protected] a r X i v : . [ qu a n t - ph ] F e b ontents DP (without a domain-size dependence) . . . . 294.3 Implications for quantum learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 DP PAC implies one-way communication . . . . . . . . . . . . . . . . . . . . . . 345.2 One-way communication is characterized by sfat ( · ) . . . . . . . . . . . . . . . . . . . 36 Quantum state tomography is a fundamental task in quantum computing whose goal is to estimatean unknown quantum state ρ , given copies of the state. Tomography is of great practical interestsince it helps in tasks such as verifying entanglement, understanding correlations in quantum states,and is useful for calibrating, understanding and controlling noise in quantum devices. In the lastfew years, questions about the fundamental limits of this task have gained a lot of theoreticalattention, in particular, how many copies of an n -qubit quantum state ρ are necessary and suﬃcientto estimate the density matrix ρ up to small error? In this direction, recent breakthrough resultsof [OW16, OW17, HHJ +

17] showed that Θ(2 n /ε ) copies of ρ are necessary and suﬃcient tolearn ρ up to trace distance ε . Unfortunately, this exponential scaling in complexity is reﬂectedin practical applications of tomography; the best known experimental implementation of full-statequantum tomography has been for a 10-qubit quantum state [SXL + +

17] in order to fully characterize it – a formidable tax on resources.This raises the natural question: is it always necessary for experimental purposes to estimatethe full density matrix of ρ ? Rather than learning ρ up to trace distance ε , do there exist weaker2ut still practically useful learning goals, which would enable savings in sample complexity? Thesequestions have turned the attention to ‘essential’ models of learning, which aim to learn only theuseful properties of a unknown quantum state. In this direction, a few works have introducedmodels for learning quantum states, inspired by classical computational learning theory. In thispaper, we show (information-theoretic) implications between these seemingly diﬀerent quantumlearning models. To explain our main results, we start by introducing some learning models of interest. Belowwe describe the PAC learning model, online learning model, learning under diﬀerential privacyconstraints and one-way communication complexity for learning quantum states. We formallydeﬁne these models in the Section 2.1.

PAC learning.

Probably Approximately Correct (

PAC ) learning, introduced by [Val84], lays thefoundation for computational learning theory. [Aar07] considered the natural analog of learningquantum states in the

PAC model. In this model, let ρ ∈ C be an unknown quantum state (pickedfrom a known concept class C of states) and let D : E → [0 ,

1] be an arbitrary unknown distributionover all possible 2-outcome measurements E . Suppose a quantum learner obtains training examples ( E i , Tr ( ρE i )) where E i is drawn from D , and the goal is to output σ such that with probabilityat least 0 . σ satisﬁes Pr E ∼ D [ | Tr ( σE ) − Tr ( ρE ) | ≤ ζ ] ≥ − α (this second probability is overa fresh example from D ). How many training examples suﬃce for such a ( ζ, α )- PAC learner? Inanswer, [Aar07] showed that the number of examples necessary and suﬃcient to learn C is capturedby the fat-shattering dimension of C . PAC learning with Diﬀerential privacy.

A well-studied area of computer science is diﬀerentialprivacy ( DP ) (which says that an algorithm should behave “approximately” the same given twodatasets that diﬀer in one element). This notion can be extended to the quantum realm, wherewe ask that the quantum PAC learner proposed above is also diﬀerentially private , wherein giventwo datasets S = { ( E i , Tr ( ρE i )) } i , S (cid:48) = { ( E (cid:48) i , Tr ( ρE (cid:48) i )) } i such that there exists a unique i suchthat E i (cid:54) = E (cid:48) i , then a quantum ( γ, δ )- DP PAC learning algorithm needs to satisfy Pr[ A ( S ) = σ ] ≤ e γ Pr[ A ( S (cid:48) ) = σ ] + δ, where A ( S ) is the output of A on input S . Communication complexity.

Consider the standard one-way communication model betweenAlice and Bob. Suppose Alice has a quantum state ρ (unknown to Bob) and Bob has an unknown(to Alice) measurement E . The goal of Bob is to output an approximation of Tr ( ρE ) if only Aliceis allowed to communicate to Bob. A trivial strategy for this communication task is for Alice tosend a classical description of ρ , but can we do better? If so, how many bits of communicationsuﬃce for this task? Online learning.

Several features of the

PAC quantum learning model and tomography aresomewhat artiﬁcial: ﬁrst, the assumption that the measurements (training examples) are drawnfrom the same unknown distribution D that the learner will be evaluated on, which does not accountfor adversarial or changing environments, and secondly, it may be infeasible to possess T -fold tensorcopies of the unknown quantum state ρ , rather we may only be able to obtain sequential copies of it. In Boolean DP , one allows the examples to be the same and labels be diﬀerent for neighboring databases. Sincewe look at real-valued DP , we consider the case when the examples are diﬀerent. Our notion of DP diﬀers from the notion of DP proposed by [AR19]. They consider DP measurements withrespect to a class of product states , whereas here we require DP with respect to the dataset { ( E i , Tr ( E i ρ ) } i , whichnaturally quantizes the classical deﬁnition of diﬀerential privacy. ρ , at every round a maintains a local σ which is its guess of ρ , obtains a description of measurement operator E i (possibly adversarially)and predicts the value of y i = Tr ( ρE i ). Subsequently it receives as feedback an ε -approximation of y i . On every round, if the learner’s prediction satisﬁes | Tr ( σE i ) − y i | ≤ ε then it is correct , otherwiseit has made a mistake . The goal of the learner is the following: minimize m so that after making m mistakes (not necessarily consecutively), it makes a correct prediction (i.e., approximates Tr ( · ρ )well-enough) on all future rounds.Importantly, while the goal in the above model is to make real-valued predictions y i , it de-parts from the real-valued online learning literature in allowing for ε -imprecision in the feedback.This imprecision is inherent to all learning settings where the feedback is generated by a statis-tical algorithm or physical measurements (in the quantum learning setting, the feedback arisesfrom processing the outcomes of quantum measurements), and this generalization has non-trivialimplications, as we show. Working in this model, [ACH +

18] showed that for learning the classof all quantum states, it suﬃces to let m be at most sequential fat-shattering dimension of C (acombinatorial parameter which was originally introduced in the classical work of [RST10]).All these learning models can be seen as variants of full-state tomography, and are known to require exponentially fewer resources than tomography. A natural question is: Is there a relation between these learning models, communication and combinatorial parameters?

Understanding this question classically in the context of Boolean functions has received tremen-dous attention in computational learning theory and theoretical computer science in the last twoyears. There have been a ﬂurry of papers establishing various connections [BLM20, JKT20,GGKM20, ALMM19, BLM19, Bun20, GHM19, ABMS20, HRS20]. However understanding if theresults in these papers apply to the quantum framework has remained unexplored.

To condense our (aﬃrmative) answer to the question above, we derive a series of implications goingthrough all these models, starting from diﬀerentially-private

PAC learning to online learning toquantum stability (our conceptual contribution which we deﬁne and discuss below).Taking a step back, this is surprising: quantum online learning and

DP PAC quantum learningseem very diﬀerent on the surface. Online learning ensures that eventually , after a certain number ofmistakes, we have learned the state up to trace distance.

DP PAC learning is not online – it separatesthe learning into train (oﬄine) and test (online) phases, and also introduces a distribution, D , fromwhich measurements are drawn. Ultimately DP PAC learning says that after seeing T measurementsfrom D , we have (privately) learned the state. We show that in fact

DP PAC learning’s samplecomplexity can be lower bounded by the sequential fat-shattering dimension which also characterizesthe complexity of online learning [RST10]. We give a high-level summary of our results in Figure 1(we say an algorithm is pure DP (resp. approximate DP ) when δ = 0 in our deﬁnition (resp. δ > imprecise feedback. This diﬀerence has At least, well-enough to predict its behavior on future measurements from D with high probability. ureDP PAC Representationdimension One-way CC Sequentialfat-shatteringOnline learningStabilityApproximateDP PAC Gentle shadowtomography (cid:63) no-goFigure 1: High-level summary of results relating models of learning. These results apply to thesetting of learning real-valued classes and quantum states with imprecise feedback. Except forthe (cid:63) -arrow, an arrow A → B in the ﬁgure implies that, if the sample complexity of learning inmodel A or the combinatorial parameter A is S A , then the complexity of learning in model B or thecombinatorial parameter B is S B = poly( S A ). For the (cid:63) -arrow, the overhead is S B = exp( S A ). Thedotted arrow signiﬁes that a technique used to prove that corresponding implication for Booleanfunction classes is a no-go for our quantum learning setting.non-trivial consequences (which we highlight later), one of which is that a technique due to [BLM20]showing stability implies approximate DP PAC for Boolean functions, is a no-go in our setting, aswe show later.

Conceptual contribution.

The main center piece in establishing these connections is the con-cept of quantum stability , which is the new conceptual contribution in this work. Intuitively, wesay a quantum learning algorithm is stable if, for an unknown state ρ , given a set of noisy la-belled examples drawn i.i.d. from a distribution D , there exists one state σ such that, with “high”probability, the output of the learning algorithm is “close” to σ . More formally, we say a learningalgorithm A is ( T, ε, η )- stable with respect to distribution D if, given T many labelled examples S consisting of E i and approximations of Tr ( ρE i ), there exists a state σ such thatPr[ A ( S ) ∈ B ( ε, σ )] ≥ η, (1)where the probability is taken over the examples in S and B ( ε, σ ) is the ball of states within tracedistance ε of σ . In other words, quantum stability means that up to an ε -distance, there is some σ that is output by A with probability at least η .While we will make this precise later, the signiﬁcance of an algorithm A being stable is that σ ,the output state at the ‘center of the ball’, is a good hypothesis for estimating measurementprobabilities (and hence A is a good learner). This is not at all obvious from the deﬁnition ofstability, which does not inherently require that this σ is a good approximation of ρ . Yet, we showthat if A is a stable and consistent learner (i.e., its output does not contradict any of the trainingexamples it has seen), σ has low loss with respect to D . This means that, using hypothesis σ topredict outcomes of future measurements drawn from distribution D as Tr ( Eσ ) will yield ε -accuratepredictions with high probability.Classically, stability is conceptually linked to diﬀerential privacy. In fact, [DR14] state that“ Diﬀerential privacy is enabled by stability and ensures stability...we observe a tantalizing moralequivalence between learnability, diﬀerential privacy, and stability, ” and this notion was crucially usedin [BLM20, ALMM19, AJL +

19, BLM19]. Such a connection has remained unexplored (and evenundeﬁned) in the quantum setting and in this work we explores this interplay between privacy,5tability and quantum learning. Our deﬁnition of stability marks a crucial departure from theclassical notion of stability used by [BLM20], in the following sense: a classical learning algorithmis stable if a single function is output by the algorithm with high probability. In contrast we saythat a quantum learning algorithm is stable if a collection of quantum states is output with highprobability. Given the signiﬁcance of the notion of stability in classical DP research, we believe ourdeﬁnition could ﬁnd further applications in quantum computing. We break down the proof of the arrows in Figure 1 into four steps and discuss them below.

1. Pure

DP PAC implies ﬁnite sequential fat-shattering dimension.

It is well-known clas-sically that if there is a

DP PAC learning algorithm for a class C then the representation dimension of the class is small (we deﬁne this dimension formally in Deﬁnitions 2.8, 2.9). Representationdimension is then known to upper bound classical communication complexity as well as a combi-natorial dimension of the concept class known as the sequential fat-shattering dimension sfat ( C ).All of the above connections are classical, but we show that they can be ported to learning classesof quantum states. To do so, we make all implications mentioned above robust to our ‘quantum’version of DP PAC learning, that is for real-valued functions and with adversarial imprecision. Oneparticular contribution in this direction is: prior to our work, Feldman and Xiao [FX14] showedthat representation dimension is only a lower bound for one-way classical communication complex-ity, but we show that this dimension even lower bounds quantum communication complexity. Forthe case of

Boolean valued concept classes, Zhang [Zha11] proved a weaker version of our resultrelating Littlestone dimension and communication complexity, and the proof of our main resulteasily recovers his result, signiﬁcantly simpliﬁes and extends his proof.For the remaining part of the introduction, we slightly abuse notation: for a class of quantumstates C , we deﬁne sfat ( C ) as the sequential fat shattering dimension, not of the class C , but ofan associated class of functions acting on the domain M of all possible 2-outcome measurementson states in C . To be precise, for every C , we associate the real-valued concept class F C = { f ρ : M → [0 , } ρ ∈C where f ρ ( E ) = Tr ( Eρ ) for all E ∈ M . The ζ - sequential fat shattering dimension (denoted sfat ζ ( · )) of the class of states C is deﬁned in terms of the class of real-valued functions F C ,i.e., sfat ζ ( C ) := sfat ζ ( F C ) (we deﬁne this combinatorial parameter more formally in Section 2).

2. Finite sfat ( · ) implies online learning. In the second step, the goal is to go from a con-cept class C having ﬁnite sfat ( C ) to design an online learning algorithm for C that makes at most sfat ( C ) mistakes. In this direction, one of our technical contributions is to construct a robust stan-dard optimal algorithm (denoted RSOA ) which satisﬁes this mistake-bound. This

RSOA algorithmsummarized in the result below will be crucial for the following steps.

Result 1.1 (Informal) . Let C be a class of quantum states with sfat ζ ( C ) = d . There is an explicitrobust standard optimal algorithm RSOA that makes at most d mistakes in online learning C . We now make a few comments regarding this result. Classically, for the Boolean setting, itis well-known that the so-called Standard Optimal Algorithm is an online learner for any conceptclass C , that makes at most Littlestone dimension of C -many mistakes [Lit88]. Eventually, Rakhlinet al. [RST10] generalized the work of Littlestone for real-valued functions, showing that real-valuedconcept classes can be learned using their FAT-SOA algorithm, with at most sfat ( C ) many mistakes.6ow, our RSOA algorithm generalizes this, showing that real-valued concept classes can be learnedwith sfat ζ ( C ) many mistakes, even in the presence of adversarial imprecision of magnitude ζ (whichis the case for quantum learning).The following basic principle underlies our RSOA algorithm and previous algorithms: afterevery round during which the learner obtains an x and ζ -approximation of c ( x ), eliminate theconcepts in C that are “inconsistent” with the adversary’s feedback. In more detail: ﬁrst, the learnerdiscretizes the function range [0 ,

1] into 1 /ζ -many ζ -sized bins. In the ﬁrst round of learning, uponreceiving the domain point x , the learner evaluates all the functions in C at x and ‘counts’ (using sfat ( · ) dimension as a proxy) the number of functions mapping to each bin, and chooses the bin(and outputs the midpoint of this bin as its guess for c ( x )) with the highest sfat ( · ) dimension. Afterthe learner obtains a ζ -approximation of c ( x ), it removes those functions in C that were inconsistentwith this approximation and then proceeds to the next round. After sfat ( C ) many mistakes, onecan show that the learner identiﬁes the unknown concept. We remark that in prior works, it wasassume that the learner obtained c ( x ) exactly, whereas here it only receives ζ -approximations.This robustness property allows us to use RSOA in the context of learning quantum states, wheretypically, the feedback is generated by measuring E repeatedly on copies of the quantum state ρ ,which will provides a ζ -approximation of Tr ( ρE ).Prior to our work, Aaronson et al. [ACH +

18] showed that sfat ( · ) of all n -qubit quantum statesis n , which implies the existence of a quantum online learning algorithm for the class of all quantumstates that makes at most n mistakes. However, their focus was on quantum online learning with regret bounds, and so they never provided an explicit algorithm that achieves the sfat ( · ) mistakebound, and raised this as an open question. Our Result 1.1 resolves their question, by showing thatour RSOA algorithm can online learn a class of quantum states C by making sfat ( C ) many mistakes.

3. Online learning implies stability.

We now show that if a concept class C satisﬁes sfat ( C ) = d (i.e., it can be online-learned with d many mistakes), then C can be learned by a globally-stablequantum algorithm with parameters ( ε − d , ε, ε d ), i.e., there exists a stable learner that, given ε − d many examples ( E i , Tr ( ρE i )), with probability at least ε d , outputs a state σ that is ε close to theunknown target state ρ . Result 1.2.

Let C be a class of quantum states with sfat ζ ( C ) = d . There exists an algorithm G that satisﬁes the following: for every ρ ∈ C , given T = ζ − d /ε many labelled examples S con-sisting of E i drawn from a distribution D over a set of orthonormal -outcome measurementsand ζ -approximations of Tr ( ρE i ) , there exists a σ such that Pr S ∼ D T [ G ( S ) ∈ B M ( ζ, σ )] ≥ ζ d and Pr E ∼ D (cid:2) | Tr ( ρE ) − Tr ( σE ) | ≤ ζ (cid:3) ≥ − ε . In order to prove this theorem we borrow the high-level idea from [BLM20] (for the case ofBoolean functions). Like [JKT20] which studied the case of online multi-class regression, we borrowthe high-level idea from Bun et al. [BLM20] (originally developed for the case of Boolean functions)to construct our stable learner: we sample many labelled examples from the distribution D andinstead of feeding these examples directly to the black-box RSOA , we plant amongst them some“mistake examples” before giving the processed sample to

RSOA . A “mistake example” is anexample which is correctly labelled, but on which

RSOA would make the wrong prediction. Thatis to say, from a large pool of T = ζ − d examples drawn from D , craft a short sequence of O (1 /ζ ) For simplicity, we have required that the measurements are drawn from an orthonormal set. This is only necessaryif we require the algorithm G to be a proper learner, that is, its output function f is guaranteed to be such that onecan always construct a density matrix σ for which Tr ( σM ) = f ( M ) for all M ∈ M . d mistake examples; now feed the short sequence into RSOA . Thisworks, because

RSOA satisﬁes the guarantee (Result 1.1) that after making d = sfat ζ ( C ) mistakes,it would have completely identiﬁed the target concept.We proceed similarly but tackle some subtleties related to our learning setting. First, beforeour work we didn’t have an RSOA algorithm which could be used as a black-box in order to emulatethe proof-technique of [BLM20] for the case of quantum states. Second, our technique for creating“mistake examples” diﬀers from that of [BLM20]. In the Boolean case, to insert a mistake, itsuﬃces to do the following: suppose c : X → { , } is the unknown target function, they take twocandidate set of examples S , S , run two parallel runs of RSOA on S , S , and obtain two outputhypothesis functions f , f . They then identify a point in the domain x at which f ( x ) (cid:54) = f ( x );since they are Boolean functions, one must evaluate to c ( x ) and the other to c ( x ). Say f ( x ) = c ( x )and f ( x ) = c ( x ), i.e., the hypothesis f makes a mistake at x . They append a “mistake” exampleof the form ( x, c ( x )) to S , so that when RSOA is now run on S ◦ ( x, c ( x )), RSOA is forced tomake a new mistake on this new set of examples. A subtlety here is, a learner does not know c ( x ) and [BLM20] simply ﬂip a coin b ∈ { , } and let the mistake example be ( x, b ) so that withprobability 1 / b = c ( x ). For us this does not work because our target function is real-valued,i.e., c ( x ) ∈ [0 , ,

1] into 1 /ζ many ζ -intervals, pick a uniformly randominterval and let b be the center of this interval. Clearly now, with probability 1 /ζ , c ( x ) lies in the ζ -ball around b , but still is not equal to b . The construction of our quantum stable learner and theanalysis is more involved to overcome this issue and errors in the adversary feedback.

4. Stability does not imply approximate

DP PAC (without a domain size dependence).

So far we showed that quantum online learning C implies the existence of a globally stable learner(with appropriate parameters) for C . For Boolean-valued C s, in [BLM20] they went one stepfurther and created a approximately diﬀerentially-private learner from a stable learner; in this sense,stability can be viewed as an intermediate property between online learnability and diﬀerentialprivacy. A natural question here is, can we extend this result to our setting, i.e., we showed earlierquantum online learning implies stability, but does quantum stability imply quantum diﬀerentialprivacy? If such a result also worked for the quantum setting (or real-valued functions), thenFigure 1 would start and end with diﬀerential privacy (albeit starting from pure DP and resultingin approximate DP ) and answer the question “what can be privately quantum-learned” (akin tothe classical work of [KLN + G (Result 1.2) are somewhat unusual: there exists some function ball (around the target concept)such that the collective probability of G outputting its member functions is high, in contrast to theBoolean setting [BLM20], where global stability means that a single function is output with highprobability. Again, this diﬀerence in our deﬁnition of global stability is because we only require thatour real learner outputs a pointwise approximation of the target function c – namely one that is inthe ball of c . In the Boolean setting, to convert a stable learner to a private learner, [BLM20] usedstable histograms algorithm [BNS19] and the generic private learner 4.12 and obtained a privatelearner with sample complexity depending on on Ldim ( C ), the privacy, accuracy parameters of thestable learner, but not the domain size of the function class.Now, in our quantum setting since the learner only obtains an ε -accurate feedback from theadversary, we allow the learner to output a function in the ε -ball around the target concept. We alsoshow that the generic transformation from a stable learner to a private learner doesn’t work for the8eal-valued setting (in particular also quantum setting). The idea to show this is that this problemis a general case of the one-way marginal release problem, whose complexity needs to depend onthe domain size. In particular, for learning quantum states on the Pauli observables on n qubits,the sample complexity of the DP PAC learner will depend exponentially in n . The proof of thislower bound uses ideas from classical ﬁngerprinting codes [BUV18] (which were also used earlierby Aaronson and Rothblum [AR19] in order to give lower bounds on gentle shadow tomography). Comparison to prior work [JKT20].

After completion of this work, we were made aware byan anonymous referee of the paper by Jung, Kim and Tewari [JKT20] that extends the work ofBun et al. [BLM20] to multi-class functions (i.e., when the concept class to be learned C mapsto a discrete set { , . . . , k } ). They claim that their results also apply to real-valued learning bydiscretizing the range of the functions (we couldn’t ﬁnd a version of the paper that spells out theproof that online learnability implies a stable real-valued learner, but this seems implicit from theirproofs). Despite this similarity, our quantum learning setting and resulting analysis diﬀers fromtheirs in several crucial ways, which we now outline.Firstly, [JKT20]’s notion of stability for learning real-valued functions resembles our deﬁnition, however in order to prove that online learnability implies stability, they crucially rely on a modiﬁed Littlestone dimension. In this work, we use the standard notion of sfat ( · ) – which we also boundin the case of quantum states – and still show this implication. Secondly, for both PAC learningand online learning settings, [JKT20] assume that the feedback received by the learner is exact ,i.e., for online learning, on input x , the adversary produces c ( x ) ∈ [0 ,

1] and for

PAC learning, theexamples are of the form ( x, c ( x )). By contrast, in this work, we only assume that the feedback in alllearning models we consider (which includes both these settings) is a ε -approximation of c ( x ). Thisgeneralizes the previous settings and arises from the fact that, in quantum learning, the feedbackcomes from some quantum estimation process or quantum measurement. Thus, all implicationsproven in this work are robust to such adversarial imprecision. This imprecision crucially barsthe usage of [BLM20]’s technique, developed for Boolean functions as a black-box, to concludeapproximate DP PAC learning from stable learning.

We now highlight a few applications of the consequences we established above. . Faster shadow tomography for classes of states. Aaronson [Aar18] introduced alearning model called shadow tomography . Here, the goal is to learn the “shadows” of an unknownquantum state ρ , i.e., given m measurements E , . . . , E m , how many copies of ρ suﬃce to estimate Tr ( ρE i ) for all i ∈ [ m ]. Aaronson surprisingly showed that O ( n, log m ) copies of ρ suﬃce for thistask, and an important open problem was (and remains) can we get rid of the n dependence inthe complexity (even for a class of interesting quantum states) ? Subsequently, in a recent work,Aaronson and Rothblum [AR19] also showed that learnability in the online setting can be translatedto algorithms for gentle shadow tomography (in an almost black-box fashion). In this work, weuse our results on quantum online learning and ideas in [AR19] to show that, the complexityof shadow tomography (assuming that the unknown state ρ comes from a set U ) can be made O ( sfat ( U ) , log m ). This argument was communicated to us by Mark Bun [BJKT21]. We remark that Aaronson’s model [Aar18] is not concerned with speciﬁc classes of quantum states, and insteadconsiders learnability of an arbitrary quantum state. Nevertheless it is often reasonable to assume we have some priorinformation on the state to be learned, which means that it comes from a smaller class. . A better bound on sfat ( · ) . Let U n be the class of all n -qubit states. As we mentionedearlier, Aaronson et al. [ACH +

18] showed that sfat ( U n ) is at most O ( n ), but clearly for a subset U ⊆ U n of quantum states it is possible that sfat ( U ) (cid:28) sfat ( U n ). In this direction, using techniquesfrom quantum random access codes (which was also used before in the works of [Aar07, ACH + sfat ( · ) is much smaller than n . Consider the set U of“ k -juntas”, i.e., each n -qubit state lives in the same unknown k -dimensional subspace. In this caseit is not hard to see that Holevo information of this ensemble is at most log k , which improves uponthe trivial upper bound of n on sfat ( U ). We discuss more such classes of states below. . Relations to Shannon theory. Another intriguing connection we develop in this workis between quantum learning theory and Shannon theory. This connection is already implicit fromthe previous point since it relates the sequential fat shattering dimension (a well-studied notionin learning theory) with Holevo information of an ensemble (which is well-studied in quantuminformation theory). We now establish the following: let U again be a class of quantum statesand let N be a quantum channel. Let U (cid:48) = {N ( U ) : U ∈ U } be the set of states obtained afterpassing through quantum channel N . Suppose the goal is to learn U (cid:48) (i.e., to learn a class ofstates that have passed through a noisy channel N , for example if the state preparation channel isnoisy). This connects to the question of experimental learning of quantum states, i.e., can we learnstates prepared using a noisy quantum device and as a by product learn the unknown noise in thequantum device.In this case we show that sfat ( U (cid:48) ) ≤ C ( N ), i.e., the sequential fat shattering dimension isupper bounded by the classical capacity of N . Since we have shown that sfat ( · ) is an importantparameter in many learning models, this immediately implies that the complexity of learning theclass of states U (cid:48) is at most the channel capacity of N . We now give a few consequences of thisresult. Consider states subject to depolarizing and Pauli noise, two commonly-used noise models.For these n -qubit channels, channel capacity is n − ∆ where ∆ is an error term depending on thechannel parameters [Kin03, Siu19, Siu20]. Hence, for extremely noisy channels, for example when∆ = n − o ( n ), our new bound on sfat ( · ) is o ( n ) which is a signiﬁcant improvement over n . Wealso consider quantum learning of Gaussian states. Let S be the set of Gaussian states with ﬁniteaverage energy E . Considering the pure-loss Bosonic channel with transmissivity 1, this enablesus to bound sfat ( S ) ≤ O (log E ) [GGL +

04, GGPCH14]. Observe that previous bounds would yield sfat ( S ) < ∞ , since these states are inﬁnite-dimensional. As far as we are aware, this is the ﬁrstwork to consider learnability of continuous-variable states. These connections give the Shannon-theoretic notions of channel capacity and Holevo information, a learning-theoretic interpretation,and our results can be seen as an important interdisciplinary bridge between these ﬁelds. . Classical contribution. Although our main results above have been stated in termsof learning quantum states, our explicit theorem statements below are in terms of learning real-valued functions with imprecise feedback on the examples. As far as we are aware, even classically establishing equivalences between online learning, stability and approximate diﬀerential privacyfor real-valued functions with precise feedback was only recently explored in the work of [JKT20](which we compare against our work in the previous section), and in our work we look at imprecisefeedback. Indeed, learning n -qubit quantum states over an orthogonal basis of n -qubit quantummeasurements, M , is equivalent to learning – with imprecise adversarial feedback – an arbitrary k -juntas are well-studied in computational learning theory, wherein a Boolean function on n bits is a k -junta ifit depends on an unknown subset of k input bits. D = { f : X → [0 , } , for X = M : there is a one-to-one mappingbetween the set of all quantum states and real-valued functions on M , i.e., for every σ , one canclearly associate a function f σ : M → [0 ,

1] deﬁned as f σ ( M ) = Tr ( M σ ) and for the conversedirection, given an arbitrary c : M → [0 , σ for which c ( M ) = Tr ( M σ ) for all M ∈ M (and this uses the orthogonality of M crucially). Hence, if one can learn D when we ﬁx the X to be over an arbitrary orthogonal basis of 2-outcome measurements then onecan learn the class of quantum states C , and the converse is also true. All our main theorems arestated for the general case of learning D for arbitrary X . Open questions.

We now conclude with a few concrete questions.1. In this work we work in the

PAC setting where there is an unknown concept from the classlabelling the training set. Do all these equivalences also work in the agnostic setting wherethere might not be a true concept labelling the training data? The agnostic model of learningis a way to model noise which is relevant when experimentally learning quantum states.2. In a very recent work, Ghazi et al. [GGKM20] improved upon the result of Bun et al. [BLM20]by showing that a polynomial blow-up in sample complexity suﬃces in going from online learn-ing to diﬀerential privacy, which is exponentially better than the result of Bun et al. [BLM20].Can we improve the complexity in this work using techniques from Ghazi et al. [GGKM20]?3. Bun showed [Bun20] that the equivalence between private learning and online learning cannotbe made computationally eﬃcient (even with polynomial sample complexity) assuming theexistence of one-way functions, do these also extend to the quantum setting?4. Our work shows interesting classes of states for which we can improve the complexity ofgentle shadow tomography. Could we make an analogous statement for the recent improvedshadow tomography procedure of [BO20]? Furthermore, could we further get rid of the sfat ( · ) dependence in the sample complexity of shadow tomography ? Additionally, can wealso improve standard tomography problem?5. Our RSOA algorithm is time-ineﬃcient since it compute sfat ( · ) of arbitrary classes of states.Is there a time-eﬃcient quantum online learning algorithm for an interesting clas of states?6. Is there a quantum algorithm that improves the complexity of RSOA ? For the case of Booleanfunctions, Kothari [Kot14] showed how to use quantum techniques to improve the classicalhalving algorithm (which is the precursor to the Standard Optimal Algorithm). Can a similartechnique be applied to our

RSOA algorithm?7. Classically, Kasiviswanathan [KLN +

11] established connections between statistical querylearning and local diﬀerential privacy. Do these connections extend also to the quantumregime, using the recently deﬁned notion of quantum statistical query learning [AGY20]?

Acknowledgements.

We thank Mark Bun for various clariﬁcations and also providing us a proofof Claim 4.9. SA was partially supported by the IBM Research Frontiers Institute and acknowl-edges support from the Army Research Laboratory, the Army Research Oﬃce under grant numberW911NF-20-1-0014. YQ was supported by the Stanford QFARM fellowship and an NUS OverseasGraduate Scholarship. JS and SA acknowledge support from the IBM Research Frontiers Institute. We remark that in our setting, if C is the class of all quantum states then sfat ( C ) = n , so we get the samecomplexity as Aaronson [Aar18, AR19]. Preliminaries

Notation.

Throughout this paper we will use the following notation. We let X be the inputdomain of real-valued functions (eventually when instantiating to quantum learning, we will let X be a set of 2-outcome measurements denoted by M ). We will let C be a concept class of realvalued functions, i.e,. C ⊆ { f : X → [0 , } and let H be a collection of concept classes C . For adistribution D : X → [0 , h, c : X → [0 ,

1] and a distance parameter r ∈ [0 , Loss D ( h, c, r ) := Pr x ∼ D (cid:2) | h ( x ) − c ( x ) | > r (cid:3) . (2) Quantum learning setting.

While we are interested in the quantum learning setting – learning n -qubit quantum states in the class U over an orthogonal basis of n -qubit quantum measurements, M – our results apply more generally to learning an arbitrary real-valued function class C = { f : X → [0 , } with imprecise adversarial feedback. Therefore the learning models we introduce, andour theorems in the rest of this paper, will be for the more general real-valued setting.For X = M , these two problems are equivalent: there is a one-to-one mapping between the setof all quantum states and real-valued functions on M , i.e., for every σ , one can clearly associatea function f σ : M → [0 ,

1] deﬁned as f σ ( M ) = Tr ( M σ ) and for the converse direction, given anarbitrary c : M → [0 ,

1] which is the learner’s hypothesis function, one can ﬁnd a density matrix σ for which c ( M ) = Tr ( M σ ) for all M ∈ M (and this uses the orthogonality of M crucially). Hence,if one can learn C for X = M then one can learn the class of quantum states U , and the converseis also true. When U is a subset of the set of all n -qubit states, the learner we construct is animproper learner, i.e., it could output a density matrix σ not in U , which nevertheless is useful forprediction. If it is not important that the hypothesis function corresponds to an actual densitymatrix, it is not necessary to restrict the measurements to come from an orthogonal basis. PAC learning.

We ﬁrst introduce the

PAC learning model for the real-valued concept classes.

Deﬁnition 2.1 ( PAC learning) . Let α, ζ ∈ [0 , . An algorithm A ( ζ, α ) - PAC learns C with samplecomplexity m if the following holds: for every c ∈ C , and distribution D : X → [0 , , given m labelled examples { ( x i , (cid:98) c ( x i )) } mi =1 where each x i ∼ D and | c ( x i ) − (cid:98) c ( x i ) | ≤ ζ/ , then with probabilityat least / (over random examples and randomness of A ) outputs a hypothesis h satisfying Pr y ∼ D (cid:2) | c ( y ) − h ( y ) | ≥ ζ (cid:3) ≤ α. (3)We remark that in the deﬁnition above, we assume the success probability of the algorithm is3 / O (log(1 /β )), we can boost 3 / − β usingstandard techniques as mentioned in [IW20]. Online learning

Let us now introduce the online learning setting in the form of a game betweentwo players: the learner and the adversary. As always, we shall be concerned with learning real-valued concept classes C := { f : X → [0 , } and we let the target function be c ∈ C . In the rest of An alternative deﬁnition of the

PAC model of learning is the following: a learner obtains ( x i , b ) where b ∈ { , } satisﬁes Pr[ b = 1] = c ( x i ). Both these models are equivalent up to poly-logarithmic factors. improper online learning, also knownin the literature as online prediction , where the learner’s objective is to make predictions for c ( x )given some point x ∈ X , and it may do so using a hypothesis function f ( x ) not necessarily in C .Importantly, we also depart from the real-valued online learning literature in allowing the adversaryto be imprecise; that is, for the adversary to respond to the learner with feedback that is ε -awayfrom the true value (this is made more precise below). This generalization allows for the case whenthe feedback is generated by a randomized algorithm with approximation guarantees, a statisticalsample, or a physical measurement.The following setting, which we also call the strong feedback setting, was introduced by [ACH + T -round procedure: atthe t -th round,1. Adversary provides input point in the domain: x t ∈ X .2. Learner has a local prediction function f t which may not necessarily be in C , and predictsˆ y t = f t ( x t ) ∈ [0 , (cid:98) c ( x t ) ∈ [0 ,

1] satisfying | (cid:98) c ( x t ) − c ( x t ) | < ε .4. Learner suﬀers loss | ˆ y t − c ( x t ) | . At the end of T rounds, the learner has computed a function f T +1 , which functions as itsprediction rule. If the learner is such that f T +1 is not guaranteed to be in C , we call the learner an‘improper learner’. Such a learner can, however, still make predictions f T +1 ( x ) on any given input x ∈ X . Alternatively, we could also require that the learner be ‘proper’, that is, it must outputsome f T +1 ∈ C . Generally, the goal of the learner is either to make as few prediction mistakes aspossible within T rounds (where a ‘mistake’ is deﬁned as | f t ( x t ) − c ( x t ) | > ε , to be discussed morebelow); or to minimize regret for a given notion of loss, which is the total loss of its predictionscompared to the loss of the best possible prediction function that could be found with perfectforesight. The former, ‘mistake-bound’ setting is the one relevant to quantum states, so we focuson that for the rest of this paper.Some variants of our strong feedback setting could also be considered, and we now explain howthey are related to our setting. Firstly, [RST10] and [JKT20] consider an alternative setting foronline prediction of real-valued functions that diﬀers from ours in step (3). There, the adversary’sfeedback is c ( x t ) itself and is inﬁnitely precise; to recover that setting from ours, we merely set ε = 0 in step (3). Since in our setting we allow ε arbitrary, we accommodate the possibility of aprecision-limited adversary, for instance if the adversary’s feedback comes from some estimationprocess or physical measurement. A second alternative setting is where the adversary only commitsto providing weak feedback : (cid:98) c ( x t ) = 0 if | ˆ y t − c ( x t ) | < ε and (cid:98) c ( x t ) = 1 otherwise. Additionally, ifthe latter is true, the adversary speciﬁes if c ( x t ) > ˆ y t + ε , or c ( x t ) < ˆ y t − ε to the learner. Wehave termed this ‘weak feedback’ because it contains only two bits of information, whereas for thestrong feedback setting considered above, the feedback contains O (log(1 /ε )) bits of information. That is to say, a learner that works in the strong feedback setting can also work in the weak feedback setting,by mounting a binary search of the range [0 ,

1] to obtain for itself an ε -approximation of strong feedback at everyround. Conversely, a learner that works for the weak feedback setting also works in the strong feedback setting, bythrowing away some information in the strong feedback. istake bound for online learning. We now introduce the notion of ‘mistake bound’ of anonline learner. Before deﬁning the model, we ﬁrst deﬁne an ε -mistake at step (3) of the the T -stepprocedure we mentioned above. Deﬁnition 2.2 ( ε -mistake) . Let the target concept be c . At a given round, let the input point be x t and the learner’s guess be ˆ y t . The learner has made a mistake if | ˆ y t − c ( x t ) | ≥ ε . We now deﬁne the mistake-bound model of online learning.

Deﬁnition 2.3 (Mistake bound) . Let A be an online learning algorithm for class C . Given anysequence S = ( x , (cid:98) c ( x )) , . . . , ( x T , (cid:98) c ( x T )) , where T is any integer, c ∈ C and (cid:98) c is the feedback ofthe online learner on point x i . Let M A ( S ) be the number of mistakes A makes on the sequence S .We deﬁne the mistake bound of learner A (for C ) as max S M A ( S ) where S is a sequence ofthe above form. We say that class C is online learnable if there exists an algorithm A for which M A ( C ) ≤ B < ∞ . We further deﬁne the mistake bound of a concept class as M ( C ) := min A M A ( C ) where the minimization is over all valid online learners A for C . The mistake bound of class C , M ( C ) is one way to measure the online learnability of C . Forlearning Boolean function classes, [Lit88] showed that this bound gives an operational interpretationto the Littlestone dimension of the function class: min A M A ( C ) = Ldim ( C ). For showing that thereexists A such that M A ( C ) ≤ Ldim ( C ), Littlestone constructed a generic algorithm – the StandardOptimal Algorithm – to learn any class C that makes at most Ldim ( C )-many mistakes on anysequence of examples. The mistake-bounded online learning model outlined above for quantumstates recovers the ‘online learning of quantum states’ model proposed by [ACH + +

18] focuses on regret bounds for online learning and here we focus on onlinelearning with bounded mistakes. While this can be viewed as a special case of bounding regret(with an indicator loss function), the mistake-bound viewpoint opens up the connection to othermodels of learning, as we will see in the rest of this paper.

The task of designing randomized algorithms with privacy guarantees has attracted much attentionclassically with the motivation of preserving user privacy [DR14]. Below we formally introduce diﬀerential privacy , one way of formalizing privacy. Let A be a learning algorithm. Let S bea sample set consisting of labelled examples { ( x i , (cid:96) i ) } i ∈ [ n ] where x i ∈ X , (cid:96) i ∈ [0 , A . We say two sample sets S, S (cid:48) are neighboring if there exists i ∈ [ n ]such that ( x i , (cid:96) i ) (cid:54) = ( x (cid:48) i , (cid:96) (cid:48) i ) and for all j (cid:54) = i it holds that ( x j , (cid:96) j ) = ( x (cid:48) j , (cid:96) (cid:48) j ). Additionally, wedeﬁne ( ε, δ )-indistinguishability of probability distributions: for a, b, ε, δ ∈ [0 ,

1] let a ≈ ε,δ b denotethe statement a ≤ e ε b + δ and b ≤ e ε a + δ . We say that two probability distributions p, q are( ε, δ )-indistinguishable if p ( E ) ≈ ε,δ q ( E ) for every event E . Deﬁnition 2.4 (Diﬀerentially-private learning) . A randomized algorithm A : ( X × [0 , n → [0 , X is ( ε, δ ) -diﬀerentially-private if for every two neighboring examples S, S (cid:48) ∈ ( X × [0 , n , the outputdistributions A ( S ) and A ( S (cid:48) ) are ( ε, δ ) -indistinguishable. eﬁnition 2.5 (Diﬀerentially-private PAC learning) . Let

C ⊆ { f : X → [0 , } be a concept class.Let ζ, α ∈ [0 , be accuracy parameters and ε, δ be privacy parameters. We say C can be learnedwith sample complexity m ( ζ, α, ε, δ ) in a private PAC manner if there exists an algorithm A thatsatisﬁes the following: • PAC learner — Algorithm A is a ( ζ, α ) - PAC learner for C with sample size m (as formulatedin Deﬁnition 2.1). • Privacy — Algorithm A is ( ε, δ ) -diﬀerentially private (as formulated in Deﬁnition 2.4).We shall say such a learner is ( ζ, α, ε, δ ) - PPAC . In this section, we introduce one-way classical and quantum communication complexity. Diﬀerentfrom the usual setting, here we consider communication protocols that compute real-valued andnot just Boolean functions. In the one-way classical communication model, there are two partiesAlice and Bob. Let

C ⊆ { f : { , } n → [0 , } be a concept class. We consider the followingtask which we call Eval C : Alice receives a function f ∈ C and Bob receives an x ∈ X . Alice andBob share random bits and Alice is allowed to send classical bits to Bob, who needs to outputa ζ -approximation of f ( x ) with probability 1 − ε . We let R → ζ,ε ( c, x ) be the minimum numberof bits that Alice communicates to Bob, so that he can output a ζ -approximation of f ( x ) withprobability at least 1 − ε (where the probability is taken over the randomness of Alice and Bob).Let R → ζ,ε ( C ) = max { R → ζ,ε ( c, x ) : c ∈ C , x ∈ X } .We will also be interested in the quantum one-way communication model. The setting hereis exactly the same as above, except that now Alice and Bob can apply quantum unitaries locallyand Alice is allowed to send qubits instead of classical bits to Bob. Like before, we let Q → ζ,ε ( c, x )be the minimum number of qubits that Alice communicates to Bob, so that he can output a ζ -approximation of c ( x ) with probability at least 1 − ε (where the probability is taken over therandomness of Alice and Bob). Let Q → ζ,ε ( C ) = max { Q → ζ,ε ( c, x ) : c ∈ C , x ∈ X } . An important conceptual contribution in this paper is the concept of stability of algorithms. Thenotion of stability has been used in several previous works [DR14, BLM20, ALMM19, AJL + Deﬁnition 2.6 (Stability) . Let

C ⊆ { f : X → [0 , } be a concept class and η, ζ ∈ [0 , . Let D : X → [0 , be a distribution and c ∈ C be a target unknown concept. We say a learning algorithm A is ( T, η, ζ ) - stable with respect to D if: given T many labelled examples S = { ( x i , c ( x i )) } when x i ∼ D , there exists a hypothesis f such that Pr[ A ( S ) ∈ T ( ζ, f )] ≥ η, where the probability is taken over the randomness of the algorithm A and the examples S .

15t is worth noting that in the standard notion of global stability (for example the one usedin [BLM20]), we say an algorithm A is stable if a single function is output by A with highprobability. In the real-valued robust scenario, one cannot hope for similar guarantees because theadversary is allowed to be ζ -oﬀ with his feedback at every round. In particular, the adversary’sfeedback could correspond to a diﬀerent function from the target concept c . However, the intuitionis that any adversarially-chosen alternative function cannot be “too” far from c .Inspired by the deﬁnition above we also deﬁne quantum stability as follows. Deﬁnition 2.7 (Quantum Stability) . Let S be a class on n -qubit quantum states and η, ζ ∈ [0 , .Let D : X → [0 , be a distribution over orthogonal -outcome measurements and ρ ∈ S be anunknown quantum state. We say a learning algorithm A is ( T, η, ζ ) - stable with respect to D if:given T many labelled examples Q = { ( E i , Tr ( ρE i )) } when E i ∼ D , there exists a quantum state σ such that Pr[ A ( Q ) ∈ B ( ε, σ )] ≥ η, (4) where the probability is taken over the examples in Q and B ( ε, σ ) is the ball of states ε -close to σ with respect to X , i.e., B ( ε, σ ) = { σ (cid:48) : | Tr ( Eσ ) − Tr ( Eσ (cid:48) ) | < ε for every E ∈ X } . We deﬁne some combinatorial parameters used in

PAC learning and online learning real-valuedfunction classes { f : X → [0 , } . These are the fat-shattering (for PAC learning) and sequentialfat-shattering dimension (for online learning). They can be viewed as the real-valued analogs ofthe VC dimension and Littlestone dimension respectively for

PAC learning and online learningBoolean function classes { f : X → { , }} . Below we deﬁne the combinatorial parameters forreal-valued functions. Fat-Shattering dimension

The set { x , . . . , x k } ⊆ X is γ -fat-shattered by concept class C ifthere exists real numbers { α , . . . , α k } ∈ [0 ,

1] such that for all k -bit strings y = ( y · · · y k ) thereexists a concept f ∈ C such that if y i = 0 then f ( x i ) ≤ α i − γ and if y i = 1 then f ( x ) ≥ α i + γ .The fat-shattering dimension of C , or fat γ ( C ) is the largest k for which: there exists { x , . . . , x k } ∈X that is γ -fat-shattered by C . We remark that if the functions in C have range { , } and γ > fat γ ( C ) is just the standard VC dimension. Sequential Fat-Shattering dimension

We also deﬁne an analog of the fat-shattering dimensionfor online learning. The presentation of this dimension closely follows [ACH + k tree T is an ε -sequential fat-shattering tree for C if it satisﬁes the following:1. For every internal vertex w ∈ T , there is some domain point x w ∈ U and threshold a w ∈ [0 , w , and2. For each leaf vertex v ∈ T , there exists f ∈ C that causes us to reach v if we traverse T fromthe root such that at any internal node w we traverse the left subtree if f ( x w ) ≤ a w − ε andthe right subtree if f ( x w ) ≥ a w + ε. If we view the leaf v as a k -bit string, the function f issuch that for all ancestors u of v, we have f ( x u ) ≤ a u − ε if v i = 0 , and f ( x u ) ≥ a u + ε if v i = 1 , when u is at depth i − ε -sequential fat-shattering dimension of C , denoted sfat ε ( C ), is the largest k such that we canconstruct a complete depth- k binary tree T that is an ε -sequential fat-shattering tree for C . Again,we remark that if the functions in C have range { , } and γ >

0, then sfat γ ( C ) is just the standardLittlestone dimension [Lit88]. Representation dimension.

The representation dimension of concept class C roughly considersthe collection of all distributions over sets of hypothesis functions (not necessarily from the class C )that “cover” C . We make this precise below. This dimension is known to capture the samplecomplexity of various models of diﬀerential private learning Boolean functions [KLN +

11, BKN10].Because we shall be concerned with learning real-valued concept classes, we deﬁne these notionsbelow with an additional ‘tolerance’ parameter ζ . Deﬁnition 2.8 (Deterministic representation dimension

DRdim , real-valued analog of [BKN10]) . Let

C ⊆ { f : X → [0 , } be a concept class. A class of functions H deterministically ( ζ, ε ) -represents C if for every f ∈ C and every distribution D : X → [0 , , there exists h ∈ H such that Pr x ∼ D (cid:2) | h ( x ) − f ( x ) | > ζ (cid:3) ≤ ε. (5) The deterministic representation dimension of C (abbreviated DRdim ( C ) ) is DRdim ζ,ε ( C ) = min H log |H| (6) where the minimization is over H that deterministically ( ζ, ε ) -represent C . Deﬁnition 2.9 (Probabilistic representation dimension

PRdim , real-valued analog of [BNS13]) . Let

C ⊆ { f : X → [0 , } be a concept class. Let H be a collection of concept classes of real-valuedfunctions, and P : H → [0 , . We say ( H , P ) is ( ζ, ε, δ ) -representation of C if for every f ∈ C and distribution D : X → [0 , , with probability at least − δ (over the choice of H ∼ P ), thereexists h ∈ H such that Pr x ∼ D (cid:2) | h ( x ) − f ( x ) | > ζ (cid:3) ≤ ε. (7) The probabilistic representation dimension of C (abbreviated PRdim ( C ) ) is PRdim ζ,ε,δ ( C ) = min ( H , P ) max H∈ supp ( H ) log |H| , (8) where the outer minimization is over all sets ( H , P ) of valid ( ζ, ε, δ ) -representations. In this section, we present an algorithm that improperly online-learns a real-valued function class C ,making at most sfat ( C ) many mistakes (see Deﬁnition 2.3). This algorithm is an important tool forresults in the rest of the paper. All results in this section are presented for the general case of online-learning arbitrary real-valued function classes, with imprecise adversarial feedback. Ultimately, wewill use this algorithm as a subroutine for the speciﬁc setting of quantum learning.Our algorithm’s learning setting generalizes that of [RST10] and [JKT20], who also studiedonline learning of real-valued and multi-class functions (i.e., functions mapping to a ﬁnite set),albeit, the former in the case of precise adversarial feedback ( ε = 0). [JKT20] deﬁned severalextensions of the Littlestone dimension Ldim τ for τ ∈ Z + and showed that for learning a multi-class17unction class C , Ldim τ < M ( C ) < Ldim τ . They also showed that for a real-valued function class C , sfat ( C ) is linked to the Ldim τ of a discretization of the function class, thus eﬀectively transformingany real-valued learning problem into a multi-class learning problem. However, their approach doesnot work for our setting, for the following reason: if c is the target real-valued function, and thetrue value of c ( x ) is ε -close to a boundary of some class within the discretized range, our ε -impreciseadversary could choose a value of the feedback (cid:98) c ( x ) that falls in the neighboring class. Hence theresulting multi-class learner has to deal with the adversary reporting the wrong class, which isbeyond the scope of what they considered.In Section 3.1, we ﬁrst construct an algorithm Robust Standard Optimal Algorithm ( RSOA )whose mistake bound satisﬁes M RSOA ( C ) ≤ sfat ( C ) for online-learning with strong feedback. InSection 3.2, we prove some of the properties of this algorithm, which are essential for proving laterresults in this paper. Moreover, for online learning with weak feedback, we show that any adversarycan force at least sfat ( C ) mistakes. We cannot however make the same statement for online learningwith strong feedback (this would be the real-valued analog of the relation M ( C ) = Ldim ( C ) provedby Littlestone for Boolean function classes). It is an open question whether we can close this gap,but for the rest of this paper, we are concerned solely with online learning with strong feedbackand hence the implication M RSOA ( C ) ≤ sfat ( C ) is suﬃcient. In this section, we give an algorithm to to online-learn real-valued functions with strong feedback.In order to handle subtleties caused by learning functions with output in [0 ,

1] instead of { , } , wedeﬁne the notion of an ζ -cover. This was introduced by [RST10] and in order to handle inaccuraciesin the output of an adversary, we extend their notion to deﬁne an interleaved ζ -cover . Deﬁnition 3.1 ( ζ -cover and interleaved ζ -cover) . Let < ζ < be such that /ζ is an integer. A ζ -cover of the [0 , interval is a set of non-overlapping half-open intervals (‘bins’) of width ζ givenby (cid:8) [0 , ζ ) , [ ζ, ζ ) , . . . , [1 − ζ, (cid:9) with the midpoints I ζ = (cid:8) ζ/ , ζ/ , . . . , − ζ/ (cid:9) where | I ζ | = 1 /ζ . Given a ζ -cover I ζ , the corresponding interleaved ζ -cover ˜ I ζ is the set ofoverlapping half-open intervals (‘super-bins’) of width ζ (each consisting of two adjacent bins in I ζ ) given by (cid:8) [0 , ζ ) , [ ζ, ζ ) , . . . , [1 − ζ, (cid:9) with the midpoints ˜ I ζ = (cid:8) ζ, ζ, . . . , − ζ (cid:9) where | ˜ I ζ | = | I ζ | − . We denote a super-bin with midpoint r as SB ( r ) . We will also need the deﬁnition of a ζ -ball. Deﬁnition 3.2 ( ζ -ball) . An ζ -ball around an arbitrary point x ∈ [0 , (denoted B ( ζ, x ) ) is theopen interval of radius ζ around x , i.e., B ( ζ, x ) := ( x − ζ, x + ζ )As we mentioned earlier, the FAT - SOA algorithm of [RST10] used α -covers to understandreal-valued online learning, however, it does not suﬃce in the setting of quantum learning sincethe output of the adversary could be imprecise. To account for this, we use interleaved α -coversdeﬁned above. Our learning algorithm will take advantage of the following property enjoyed by the18nterleaved α -cover: the ζ -ball of any point is guaranteed to be entirely contained inside some super-bin, i.e., for every x ∈ ( ζ, − ζ ), α > ζ and r = arg min r ∈ ˜ I ζ {| x − r |} , we have B ( ζ, x ) ⊂ SB ( r ).Finally, we need one more notation: given a set of functions V ⊆ { f : X → [0 , } , r ∈ ˜ I ζ and x ∈ X , deﬁne a (possibly empty) subset V ( r, x ) ⊆ V as V ( r, x ) = (cid:8) f ∈ V : f ( x ) ∈ B (2 ζ, r ) (cid:9) , i.e., V ( r, x ) are the set of functions f ∈ V for which f ( x ) is within a 2 ζ -ball around r or f ( x ) ∈ [ r − ζ, r + 2 ζ ]. We are now ready to present our mistake-bounded online learning algorithm forlearning real-valued functions. Our algorithm is Algorithm 1. Algorithm 1

Robust Standard Optimal Algorithm, RSOA ζ Input:

Concept class

C ⊆ { f : X → [0 , } , target (unknown) concept c ∈ C , and ζ ∈ [0 , Initialize : V ← C for t = 1 , . . . , T do A learner receives x t and maintains set V t , a set of “surviving functions”. For every super-bin midpoint r ∈ ˜ I ζ the learner computes the set of functions V t ( r, x t ). A learner ﬁnds the super-bin which achieves the maximum sfat ( · ) dimension R t ( x t ) := (cid:40) arg max r ∈ ˜ I ζ sfat ζ ( V t ( r, x t )) ∈ ˜ I ζ (cid:41) The learner computes the mean of the set R t ( x t ), i.e., letˆ y t := 1 | R t ( x t ) | (cid:88) r ∈ R t ( x t ) r. The learner outputs ˆ y t and receives feedback (cid:98) c ( x t ). Learner makes the update V t +1 ← { g ∈ V t | g ( x t ) ∈ B ( ζ, (cid:98) c ( x t )) } end forOutputs: The intermediate predictions ˆ y t for t ∈ [ T ], and a ﬁnal prediction function/hypothesiswhich is given by f ( x ) := R T +1 ( x ).We ﬁrst provide some intuition about this algorithm. At round t , the set of functions thathas ‘survived’ all previous rounds is V t : in particular, V t consists of functions which are consistentwith the feedback received in the previous t − x , . . . , x t − were presented to a learner previously, then, for every g ∈ V t , g ( x i ) ∈ B ( ζ, (cid:98) c ( x i )) for i ∈ [ t − V t either stays the same as V t − or shrinks at every round. At round t , once a learner receives x t , it always replies with ˆ y t thatis either ζ -close to the true c ( x t ) else, aims to reduce V t − as much as possible. In particular, forevery super-bin r ∈ ˜ I ζ , the learner identiﬁes the subset of surviving functions that map to thatsuper-bin at x t , i.e., f ∈ V t that satisfy f ( x t ) ∈ B (2 ζ, r ). This forms the set V t ( r, x t ). The learnerthen computes sfat ζ of the set of functions V t ( r, x t ) and picks out the super-bins r ∈ ˜ I ζ thatmaximize this combinatorial quantity, and output the mean of their midpoints as the prediction ˆ y t .Intuitively, the parameter sfat ( · ) serves as a surrogate metric for the number of functions mappingto a certain interval. Using sfat ( · ) to deﬁne this prediction rule thus maximizes the number of19liminated functions for every mistake of the learner. Once it receives the feedback (cid:98) c ( x t ), thelearner updates V t to V t +1 and this process repeats for T steps. We now list a few properties ofthis algorithm. Lemma 3.3.

RSOA ζ (denoted RSOA ) has the following properties:1. ζ -consistency: at the t -th iteration every f ∈ V t satisﬁes | f ( x i ) − (cid:98) c ( x i ) | ≤ ζ for i ∈ [ t − .2. Correctness: the target function c is never eliminated, i.e., c ∈ V t for every t ∈ [ T ] .3. For every t ∈ [ T ] , x ∈ X , any pair of points r, r (cid:48) ∈ ˜ I ζ for which sfat ζ ( V t ( r, x )) = sfat ζ (cid:0) V t ( r (cid:48) , x ) (cid:1) = sfat ζ ( V t ) (9) also satisﬁes | r − r (cid:48) | < ζ . Additionally for all r ∈ ˜ I ζ , sfat ζ ( V t ( r, x )) ≤ sfat ζ ( V t ) .4. RSOA is deterministic, i.e., for the same sequence of inputs ( x , (cid:98) c ( x )) , . . . , ( x T , (cid:98) c ( x T )) pro-vided by the adversary to the learner (each of which is followed by a response (cid:98) y , . . . , (cid:98) y T ofthe learner), the RSOA algorithm produces the same function f .Proof. The ﬁrst item follows by construction. At the end of i th round, the following update isperformed: V i +1 ← { g ∈ V i | g ( x ) ∈ B ( ζ, (cid:98) c ( x i ))) } ⊆ V i . This eliminates all functions g for which g ( x i ) / ∈ B ( ζ, (cid:98) c ( x i )) from the set V i +1 , hence all functions for which | f ( x i ) − (cid:98) c ( x i ) | > ζ are eliminated.The second item follows trivially: by assumption y t = c ( x t ) is in the ζ -ball of (cid:98) c ( x t ). Thus thetarget concept c is never eliminated in the update V t +1 ← { g ∈ V t | g ( x ) ∈ B ( ζ, (cid:98) c ( x t )) } .We now show the third item. Suppose by contradiction, there is a pair r, r (cid:48) ∈ ˜ I ζ such that sfat ζ ( V t ( r, x )) = sfat ζ (cid:0) V t ( r (cid:48) , x ) (cid:1) = sfat ζ ( V t )and | r − r (cid:48) | > ζ . Let sfat ζ ( V t ) = d . Without loss of generality, we assume r > r (cid:48) . Then let s = ( r + r (cid:48) ) /

2. Clearly, for every f ∈ V t ( r, x ) we have f ( x ) ≥ s + ζ and g ∈ V t ( r (cid:48) , x ) we have g ( x ) ≤ s − ζ . This means that, given a sequential fat-shattering tree of depth d for V t ( r, x ), andthe tree also of depth d for V t ( r (cid:48) , x ), we may join them together by adding a root node with thelabel x and the threshold s , and this new tree of depth d + 1 is sequentially fat-shattered by V t ( r, x ) ∪ V t ( r (cid:48) , x ) and hence by V t (which is a superset). This contradicts the assumption that sfat ζ ( V t ) = d , because by deﬁnition of sfat ( · ) dimension, d is the depth of the deepest tree for thefunctions in V t . The “additionally” part follows immediately because V t ( r, x ) ⊆ V t .The ﬁnal item of the lemma is clear because steps 3 to 7 in the RSOA algorithm are deterministicand involve no randomness from a learner.Having established these properties, are now ready to prove our main theorem bounding themaximum number of prediction mistakes that

RSOA makes.

Theorem 3.4 ( RSOA mistake bound) . Let

C ⊆ { f : X → [0 , } be a concept class and ζ > .Given the setting of online learning with strong feedback, i.e., at every round t ∈ [ T ] , the feedback (cid:98) c ( x t ) is ζ -close to the true value | c ( x t ) − (cid:98) c ( x t ) | ≤ ζ , RSOA ζ (described in Algorithm 1) is such that,for every T , the algorithm makes a predictions ˆ y t satisfying T (cid:88) t =1 I (cid:2) | ˆ y t − c ( x t ) | > ζ (cid:3) ≤ sfat ζ ( C )20 roof. The intuition is that whenever the learner makes a mistake, functions are eliminated fromthe ‘surviving set’, such that sfat ( · ) of the remaining functions decreases by 1. Since the truefunction c is never eliminated from V t , and the sfat ( · ) dimension of a set consisting of a singlefunction is 0, no more than sfat ( · ) mistakes can be made.First observe that, whenever the algorithm makes a mistake, i.e., | ˆ y t − c ( x t ) | > ζ , it alsofollows that | ˆ y t − (cid:98) c ( x t ) | > ζ because (cid:98) c ( x t ) is an ζ -approximation of c ( x t ). Below we show thaton every round where | ˆ y t − (cid:98) c ( x t ) | > ζ , sfat ( V t +1 ) ≤ sfat ( V t ) −

1. Together with property 2 ofLemma 3.3 and the fact that V = C this already implies that no more than sfat ( C ) mistakes aremade by RSOA .Suppose | ˆ y t − (cid:98) c ( x t ) | > ζ . Fix t and x t . Observe that by property 3 Eq. (9) (in Lemma 3.3)there are at most three super-bins whose midpoints r satisfy sfat ζ ( V t ( r, x )) = sfat ζ ( V t ), i.e.,between 0 and 3 super-bins achieve the upper-bound on sfat ( · ) at each round, which we now call UB t := sfat ζ ( V t ). We now analyze each of four cases for the number of upper-bound-achievingsuper-bins. Case 1 : sfat ζ ( V t ( r, x t )) < UB t for every r ∈ ˜ I ζ , i.e., no super-bins achieve UB t . Everyupdate of V t updates it to the functions within some ζ -ball, (cid:13) := B ( ζ, (cid:98) c ( x t )). Observe that (cid:13) is entirely contained within some super-bin, call it SB (note that even if (cid:98) c t is at the boundary oftwo super-bins, it would still be inside the super-bin that is in-between the two, by deﬁnition ofthe interleaved ζ -cover). Hence, sfat ( (cid:13) ) ≤ sfat ( SB ) < UB t where the second inequality is by theassumption of the case. Case 2 : There exists exactly one r ∈ ˜ I ζ such that sfat ζ ( V t ( r, x t )) = UB t , i.e., exactly one super-bin (centered at r = 2 kζ for some k ∈ Z + ) achieves UB t , let’s call this SB ∗ = [2( k − ζ, k + 1) ζ ). Since the super-bin’s midpoint is at some bin boundary, the predictionis ˆ y t = 2 kζ . Similar to the previous case, the update step retains only the functions in some (cid:13) := B ( ζ, (cid:98) c ( x t )). However, since | ˆ y t − (cid:98) c ( x t ) | > ζ , we either have (cid:98) c ( x t ) < k − ζ or (cid:98) c ( x t ) > k + 2) ζ . (cid:13) , therefore, is entirely contained within some super-bin SB (cid:54) = SB ∗ . Since there is only onemaximizing super-bin SB ∗ , we have sfat ( (cid:13) ) ≤ sfat ( SB ) < sfat ( SB ∗ ) = UB t . Case 3 : There exists r , r ∈ ˜ I ζ such that sfat ζ ( V t ( r , x t )) = sfat ζ ( V t ( r , x t )) = UB t , i.e., two super-bins (centered at r , r respectively) achieve UB t , call them SB ∗ , SB ∗ . Using Property3 of Lemma 3.3, these two super-bins must either be touching at a boundary (hence ˆ y t = 2 kζ where SB ∗ = [2 kζ, k + 2) ζ ), SB ∗ = [2( k − ζ, kζ )) or intersecting at one bin (hence ˆ y t = (2 k + 1) ζ where SB ∗ = [2 kζ, k + 2) ζ ), SB ∗ = [2( k − ζ, k + 1) ζ )). In the former case, (cid:98) c ( x t ) < k − ζ or (cid:98) c ( x t ) > k + 2) ζ and thus neither SB ∗ nor SB ∗ entirely contains (cid:13) , though there is some super-binthat does. In the latter case, (cid:98) c ( x t ) < (2 k − ζ or (cid:98) c ( x t ) > (2 k + 5) ζ and thus neither SB ∗ nor SB ∗ entirely contains (cid:13) , though there is some super-bin that does. Identical reasoning to the previoustwo cases shows that the update thus decreases sfat ( · ) on the remaining functions. Case 4 : There exists r , r , r ∈ ˜ I ζ such that sfat ζ ( V t ( r , x t )) = sfat ζ ( V t ( r , x t )) = sfat ζ ( V t ( r , x t )) = UB t , i.e., three super-bins (centered at r , r , r respectively) achieve UB t . Call them SB ∗ , SB ∗ , SB ∗ .By Property 3 of Lemma 3.3, there is only one conﬁguration these three super-bins could be in,21amely two super-bins have to be touching at a boundary, with the last super-bin straddling them: SB ∗ = [2 kζ, k + 2) ζ ), SB ∗ = [2( k − ζ, k + 1) ζ ), SB ∗ = [2( k − ζ, kζ ). Then ˆ y t = 2 kζ and (cid:98) c ( x t ) < k − ζ or a − t > k + 2) ζ . None of SB ∗ , SB ∗ , SB ∗ entirely contains (cid:13) , though thereis some super-bin that does, and identical reasoning to the previous three cases shows that theupdate thus decreases sfat ( · ) on the remaining functions.Theorem 3.4 says that the RSOA algorithm for a concept class C in the strong feedback model,makes at most sfat ( C ) mistakes. This is also the setting in the rest of the paper as well as most ofthe real-valued online learning literature. A natural question is, can we make fewer mistakes thanthe RSOA algorithm? Below we consider the weak feedback model of online learning and show nolearner can do better than making sfat ( · ) mistakes. An interesting open question is, can we evenimprove the lower bound in the theorem below for the strong feedback model setting? Theorem 3.5.

Let ζ ∈ [0 , and C ⊆ { f : X → [0 , } . Every online learner A (in the weakfeedback setting) for the class C , satisﬁes M A ( C ) ≥ sfat ζ ( C ) .Proof. We construct an adversary that can always force at least sfat ( C ) mistakes in the weak modelof learning (where the adversary only gives two bits of feedback to the learner). To do so, theadversary traverses the ζ -fat-shattered tree starting at the root node, at every round interactingwith the learner based on the information at the current node, always claiming the learner made amistake, and then moving to one of the two daughter nodes. In particular, the interaction at node v of the tree, which is associated with ( x v , a v ), is as follows: The adversary gives the learner thepoint x v . If the learner predicts ˆ y t < a v , claim the learner is wrong and go to the right daughternode, thus committing the adversary to the subset of functions f ∈ C such that f ( x v ) ≥ a v + ζ .Go to the opposite node if the learner predicts ˆ y t ≥ a v . After sfat ζ ( C ) rounds, the adversarywill have reached a leaf node. At this point, by the deﬁnition of the sfat ( · ) tree, there is at leastone function consistent with all previous commitments of the adversary. This becomes the targetfunction, which the adversary then commits to in the ﬁrst place. Since the depth of the tree is bydeﬁnition sfat ζ ( C ), the learner will have made sfat ζ ( C ) mistakes by the time the adversary reachesa leaf and has to commit to a function. In this section we show that online learnability of a real-valued function class implies that thereexists a real-valued

DP PAC learner for the same class. More precisely, we will assume that the sfat ( · ) dimension of the function class is bounded (which implies its online learnability, as discussedin Section 3); then we will explicitly describe an algorithm that uses this learner to learn in aglobally-stable manner.This, however, is only half of the implication shown in [BLM20]. There, they go one stepfurther and turn their stable learner into an approximately DP PAC learner, concluding overallthat online learning implies approximate

DP PAC learning. Supposing we could prove the samefor our learning model, then combining this with the implication shown in Section 5 (that pure

DP PAC learning implies online learning) would make for almost a complete chain of implicationsstarting at pure

DP PAC learning, implying online learning, and ﬁnally implying approximate

DPPAC learning. However, in the second half of this section, we use an argument from ﬁngerprintingcodes to show that the transformation in [BLM20] from a stable learner to a

DP PAC learner doesnot work with the stability guarantees we obtain for our real-valued learning setting.22e will use the following notation throughout this section. Let

C ⊆ { f : X → [0 , } be aconcept class and c ∈ C be a target concept. Let D : X → [0 ,

1] be a distribution. In a slight abuseof notation, we use the notation ( x, (cid:98) c ( x )) ∼ D to mean that x is drawn from the distribution D and (cid:98) c ( x ) satisﬁes | (cid:98) c ( x ) − c ( x ) | < ζ . Also, we say B ∼ D m to mean that a learner receives m suchexamples { ( x i , (cid:98) c ( x i )) } mi =1 . We say that the learner has made a mistake on input x if he has madea 5 ζ -mistake (refer to Deﬁnition 2.2). Finally, because we are concerned with real-valued learning,functions in the vicinity of the target function are considered “close enough” as hypotheses, and sowe will make use of the following notion of function ball : Deﬁnition 4.1 (Function ball of radius r around c ) . Given a set of functions

H ⊆ { f : X → [0 , } ,a function ball of radius r around c ∈ H is the set of all functions f ∈ H such that | f ( x ) − c ( x ) | < r for every x ∈ X , (10) and we denote such a function ball by T ( r, c ) . Moreover, for a set of functions E = { f , . . . , f k } ,we let T ( r, E ) = ∪ ki =1 T ( r, f i ) . In Section 4.1, we prove that given a mistake-bounded online learner, there exists a stablelearner. In Section 4.2, we prove that stability does not, in turn, imply approximate DP learningusing the transformation of [BLM20], without a domain size dependence in the sample complexity.In Section 4.3, we turn our attention to how our results apply to learning quantum states. In this section we prove the following theorem:

Theorem 4.2.

Let α, ζ ∈ [0 , . Let C ⊆ { f : X → [0 , } be a concept class with sfat ζ ( C ) = d . Let D : X → [0 , be a distribution and let S = { ( x i , (cid:98) c ( x i )) } be a set of T = O (cid:18) ζ − d · dα (cid:19) examples where x i ∼ D and | (cid:98) c ( x i ) − c ( x i ) | < ζ where c ∈ C is a unknown concept. There exists a ( T, ζ − O ( d ) , O ( ζ )) -stable learning algorithm G , that outputs f satisfying Loss D ( f, c, O ( ζ )) ≤ α . The algorithm G is the RSOA run on a carefully tailored input distribution over the examples,with T being the overall sample complexity of our algorithm. Most of the work in the proof arisesin explaining how to tailor the set of examples drawn from the original distribution D into a newset S on which RSOA is guaranteed to succeed. In this section, when we write

RSOA ζ ( S ) where S is a sample, i.e., S = { ( x i , ˆ c ( x i )) } , we mean that we feed the examples in S into RSOA sequentially,as in the online learning setting. We will prove this theorem in three parts, corresponding to thesubsequent three sub-subsections: • Our Algorithm 2, is a tailoring algorithm that deﬁnes distributions ext ( D , k ) for k ∈ [ d ] as afunction of the distributions D , to which we have black-box access. Just as in [BLM20], thekey idea for the tailoring is to inject examples into the sample that would force mistakes. Wehave adapted this idea for the robust, real-valued setting. Unfortunately, this algorithm couldpotentially use an unbounded number of examples (in the worst case), which we handle next. The symbol T stands for ‘tube’ since for a member of the function ball, closeness to c must be satisﬁed at notjust a single point but all points in the domain. We omit mentioning the function class C , which is usually taken tobe R , the set of all functions output by RSOA . Because

RSOA is an improper learner, R is not the same as C . Next, we seek to impose a cutoﬀ on the number of examples drawn in the algorithm above. InLemma 4.5, we compute the expected number of examples drawn by Algorithm 2. Then, weuse Markov’s inequality to compute what the cutoﬀ should be. The ﬁnal tailoring algorithmis simply Algorithm 2, cut oﬀ when the number of examples drawn exceeds this threshold. • Finally, we state the globally-stable learning algorithm Algorithm 3, which essentially invokesAlgorithm 2 with the cutoﬀ we deﬁned above. In Theorem 4.6 we prove the correctness andsample complexity of Algorithm 3. ext ( D, k )In the following, the symbol S ◦ T between two sets of examples means the concatenation of the twosets S, T . Intuitively our learning algorithm is going to obtain T examples overall and break theseexamples into blocks of size m (a parameter which will be ﬁxed later in Theorem 4.6), each blockfollowed by a single mistake example, all of which which are fed to an online learner. Additionally,below we can think of k ≤ sfat ( C ) as the number of mistakes we want to inject into the exampleswe feed to an online learner. Algorithm 2

An algorithm to sample from distributions ext ( D, k ). Input:

Distribution D : X → [0 , m ≥ k ∈ { , . . . , d } . Output:

A sample from the distribution ext ( D, k ).For k ≥

0, the distributions ext ( D, k ) : X k ( m +1) × [0 , → [0 ,

1] are deﬁned inductively as follows:1. ext ( D,

0) : output the empty sample ∅ with probability 1.2. Sampling from ext ( D, k ) involves recursively sampling from ext ( D, k −

1) as follows:( i ) Draw S (0) , S (1) ∼ ext ( D, k −

1) and two sets of m examples B (0) , B (1) ∼ D m .( ii ) Let f = RSOA ζ (cid:0) S (0) ◦ B (0) (cid:1) , f = RSOA ζ (cid:0) S (1) ◦ B (1) (cid:1) .( iii ) If | f ( x ) − f ( x ) | ≤ ζ for every x ∈ X then go back to step (i).( iv ) Else pick x (cid:48) such that | f ( x (cid:48) ) − f ( x (cid:48) ) | > ζ and sample α ∼ I ζ uniformly. ( v ) Let M k := ( x (cid:48) , α ) ∈ X × [0 , | α − f ( x (cid:48) ) | < | α − f ( x (cid:48) ) | , output S (1) ◦ B (1) ◦ M k ,else output S (0) ◦ B (0) ◦ M k Intuition of the algorithm.

We ﬁrst explain Algorithm 2 on an intuitive level. Recall thegoal: using our

RSOA online learning algorithm for C , we would like to design a globally stable PAC learner for C . To this end, let D be the unknown distribution (under which we need the PAC learnerto work). Algorithm 2 ‘tailors’ a sample (fed to the online learner) as follows: in the k th iterationit repeatedly draws pairs of batches of ( k − m + 1) examples from ext ( D, k −

1) and then decideswhether to keep or discard each batch based on the outcome of running

RSOA on the batches. Ifsome batch is kept, it is appended with a single example which is guaranteed to force a mistake on

RSOA , and the resulting sample S is output by the algorithm. This process of outputting S canbe regarded as drawing sample S from the distribution ext ( D, k ). The structure of S is illustrated Recall the deﬁnition of the ζ -cover, I ζ = (cid:8) ζ/ , ζ/ , . . . , − ζ/ (cid:9)

24n Figure 2. Each B i is a block of m examples each drawn i.i.d. from D . Each M i = ( x i , α i ), forcesa mistake when S is fed to RSOA . S has k blocks and k mistake examples in total. S = B ∼ D m M B ∼ D m M ··· B k ∼ D m M k Figure 2: Structure of curated sample S obtained resulting from Algorithm 2. Each B i is a blockof m examples ( x, c ( x )) where x ∼ D and M i = ( x, b ) is an example which forces a mistake .We now focus on explaining steps 2( i ) to 2( v ) which ‘force a mistake’. In step 2( i ) we drawtwo examples, S (0) ◦ B (0) and S (1) ◦ B (1) . In 2( ii ), we feed S (0) ◦ B (0) into RSOA , which returnsfunction f , and do the same for S (1) ◦ B (1) , returning f . There are now two possibilities, either f , f are “close” or f and f diﬀer signiﬁcantly at some x ∈ X and step 2( iii ) checks which is thecase as follows.1. f , f agree to within 11 ζ on every point in X : then draw a new pair S (0) ◦ B (0) and S (1) ◦ B (1) afresh, going back to step 2i).2. | f ( x ) − f ( x ) | > ζ for some x ∈ X . Note that this x need not be from an example previouslygiven to the learner. Intuitively, in this case, the predictions f and f are so far apart at x that they cannot both be 5 ζ -correct, and so at least one of them is a mistake. More precisely,in the ζ -cover, let b c ∈ I ε be the midpoint of the bin (of width ζ ) that contains c ( x ). Since | f ( x ) − f ( x ) | > ζ , at least one of the predictions f ( x ), f ( x ) is 5 ζ -far from b c (though wedon’t know which it is, since we don’t know c !)Steps 2( i ) to 2( iii ) are repeated until we are in the second case. Note that steps 2( i ) to 2( iii )could be repeated an unbounded number of times, each repetition drawing fresh examples. For theremainder of this section, we assume that steps 2( i ) to 2( iii ) terminate eventually so that we mayargue about the ﬁnal output sample. In Section 4.1.2, we show it suﬃces to “impose” a cut-oﬀ of T examples so that with high probability the algorithm (with an appropriate value of k ) terminatesbefore drawing T -many examples.In order to create M k , we uniformly draw some α ∼ I ζ (the set of all possible bin midpoints),which means α = b c with probability ζ . If α = b c , we are guaranteed that f i is a mistake for i := arg max i | α − f i ( x ) | . Therefore, we concatenate our mistake example with S ( i ) ◦ B ( i ) , eventuallyoutputting S := S ( i ) ◦ B ( i ) ◦ ( x, α ) as the output of Algorithm 2. By the end of these steps, we willhave a sample S (cid:48) ◦ B (cid:48) ◦ M k where S (cid:48) ∼ ext ( D, k − B (cid:48) ∼ D m and M k is a single ‘mistake’ examplewith the following two properties: (i) M k = ( x (cid:48) , α ) is a valid example (i.e., | α − c ( x (cid:48) ) | ≤ ζ ). (ii)If RSOA is fed S (cid:48) ◦ B (cid:48) ◦ M k , RSOA will make a mistake upon seeing the example M k , i.e., at theround corresponding to M k , RSOA predicts ˆ y such that | ˆ y − c ( x (cid:48) ) | > ζ . Key Lemma.

We now prove our key lemma on global stability. Let R be the set of all possiblefunctions that could be output by the RSOA algorithm when run for arbitrarily many rounds. Note that this step crucially diﬀers from [BLM20] since for them the true value of f ( x ) or f ( x ) is always 0 or1, so they can ﬂip a coin and force a mistake with probability at least 1 / emma 4.3 (Some function ball is output by RSOA with high probability) . Let sfat ζ ( C ) = d .There exists k ≤ d and some f ∈ R such that Pr S ∼ ext ( D,k ) ,B ∼ D m [ RSOA ζ ( S ◦ B ) ∈ T (5 ζ, f )] ≥ ζ d . (11) Proof.

Towards contradiction, suppose for every k ≤ d and f ∈ R , we havePr S ∼ ext ( D,d ) ,B ∼ D m [ RSOA ζ ( S ◦ B ) ∈ T (5 ζ, f )] < ζ d . (12)In particular, Eq. (12) holds for f = c where c is the target concept.In Step 2( iv ), Algorithm 2 picks α uniformly from the set of midpoints in I ζ . Call a mistakeexample ( x, α ) ‘valid’ if | α − c ( x ) | ≤ ζ . Notice there are actually two midpoints in I ζ which areless than ζ away from any c ( x ), and hence, the probability that a mistake example is valid is2 ζ > ζ . Hence the probability that all d mistake examples are valid is at least ζ d . In the eventthat all mistake examples are valid, S is a valid sample. Since S contains d mistake examples, andTheorem 3.4 guarantees that RSOA ζ on a valid sample always outputs some hypothesis function in T (5 ζ, c ) after making d mistakes, this contradicts Eq. (12). Lemma 4.4 (Generalization) . Let ext ( D, (cid:96) ) be such that (cid:96) ≥ and there exists f such that Pr S ∼ ext ( D,(cid:96) ) ,B ∼ D m [ RSOA ζ ( S ◦ B ) ∈ T (5 ζ, f )] ≥ ζ d . (13) (The above property is the analog of the distribution ext ( D, (cid:96) ) being ‘well-deﬁned’ in [BLM20].)Then, every f satisfying Eq. (13) also satisﬁes Loss D ( f, c, ζ ) ≤ d ln(1 /ζ ) /m .Proof. Let S ∼ ext ( D, (cid:96) ) and B ∼ D m . Suppose RSOA ζ ( S ◦ B ) outputs a function f (cid:48) ∈ T (5 ζ, f ).Now, for f (cid:48) ∈ R , let E f (cid:48) be the event that RSOA ζ ( S ◦ B ) outputs f (cid:48) . Then observe thatPr S ∼ ext ( D,(cid:96) ) ,B ∼ D m [ RSOA ζ ( S ◦ B ) ∈ T (5 ζ, f )] = (cid:88) f (cid:48) : f (cid:48) ∈T (5 ζ,f ) Pr S ∼ ext ( D,(cid:96) ) ,B ∼ D m [ E f (cid:48) ] ≤ (cid:88) f (cid:48) : f (cid:48) ∈T (5 ζ,f ) Pr S ∼ ext ( D,(cid:96) ) ,B ∼ D m [ B is ζ -consistent with f (cid:48) ] ≤ Pr S ∼ ext ( D,(cid:96) ) ,B ∼ D m [ B is 6 ζ -consistent with f ] , (14)where the ﬁrst inequality follows from combining two observations:1. Since B is a subset of the examples fed to RSOA ζ , by Property 1 in Lemma 3.3, if RSOA ζ ( S ◦ B )outputs f (cid:48) then f (cid:48) is ζ -consistent with all m examples in B ;2. By Property 4 of Lemma 3.3 (for a ﬁxed sample, no two diﬀerent functions can be output by RSOA ), { E f (cid:48) } f (cid:48) ∈R are disjoint on the sample space;and the last inequality used that f (cid:48) is in a 5 ζ -ball of f , hence f is ζ + 5 ζ = 6 ζ consistent with B . Recall that Eq. (13) shows that the LHS of Eq. (14) is lower-bounded by ζ d . If we deﬁne Loss D ( f, c, ζ ) := α , then by the deﬁnition of loss, since B is a sample of m i.i.d. examples drawnfrom D , the RHS of the inequality above is (1 − α ) m . Putting together the lower and upper bound ζ d ≤ (1 − α ) m ≤ e − αm , proves the lemma statement.26 .1.2 A Monte Carlo version of the tailoring algorithm Algorithm 2 that we described in the previous section could potentially run steps ( i ) − ( iii ) forever.Apriori it is not clear why this algorithm terminates. In this section, we compute the expectednumber of examples drawn by Algorithm 2 and eventually use Markov’s inequality to deﬁne a“stopping criterion” (a sample complexity cutoﬀ) on Algorithm 2 so that the algorithm eventuallystops drawing a certain number of examples. The reason the number of examples drawn is a randomvariable is that steps 2( i ) to 2( iii ) of Algorithm 2 must be repeated until there is one round where f , f are distance more than 11 ζ apart, i.e., there exists x ∈ X satisfying | f ( x ) − f ( x ) | > ζ . Lemma 4.5 (Expected number of examples drawn in Steps 2( i ) to 2( iii )) . Let ζ ∈ [0 , / andlet k ∗ be the smallest value (guaranteed to exist by Lemma 4.3) for which Pr S ∼ ext ( D,k ∗ ) ,B ∼ D m [ RSOA ζ ( S ◦ B ) ∈ T (11 ζ, f )] ≥ ζ d (15) holds. Let (cid:96) ≤ k ∗ and M (cid:96) denote the number of examples drawn from D in order to generate asample S ∼ ext ( D, (cid:96) ) . Then E [ M (cid:96) ] ≤ (cid:96) +1 · m, where the expectation is taken over the random sampling process in Algorithm 2.Proof. Because we have chosen k ∗ to be the smallest value for which Eq. (15) is true, this impliesthat for every (cid:96) (cid:48) < k ∗ and f ∈ R , we havePr S ∼ ext ( D,(cid:96) (cid:48) ) ,B ∼ D m [ RSOA ζ ( S ◦ B ) ∈ T (11 ζ, f )] < ζ d which is equivalent to Pr S ∼ ext ( D,(cid:96) (cid:48) ) ,B ∼ D m [ RSOA ζ ( S ◦ B ) / ∈ T (11 ζ, f )] ≥ − ζ d . Now consider sampling from ext ( D, (cid:96) ) such that 0 ≤ (cid:96) ≤ k ∗ . Call each round of 2( i ) to 2( iii )‘successful’ if it results in f , f such that | f ( x ) − f ( x ) | > ζ for some x . Upon success, thealgorithm proceeds to step 2( iv ). Let us assume that the probability of success for the (cid:96) th roundis θ . Then one can express θ as follows: θ = (cid:88) f ∈R Pr S ∼ ext ( D,(cid:96) − ,B ∼ D m [ RSOA ( S ◦ B ) = f ] · Pr S ∼ ext ( D,(cid:96) − ,B ∼ D m [ RSOA ( S ◦ B ) = f , f (cid:54)∈ T (11 ζ, f )] ≥ (1 − ζ d ) (cid:88) f ∈R Pr S ∼ ext ( D,(cid:96) − ,B ∼ D m [ RSOA ( S ◦ B ) = f ] = 1 − ζ d , where the ﬁrst equality is because ‘success’ is deﬁned as | f ( x ) − f ( x ) | > ζ at some x , equivalently f (cid:54)∈ T (11 ζ, f ), and we used Eq. (4.1.2) in the inequality.Furthermore, sampling from ext ( D, (cid:96) ) involves sampling from ext ( D, (cid:96) − , . . . , ext ( D, ext ( D, (cid:96) ), M (cid:96) , is a function of M (cid:96) − , . . . , M .Let M ( j ) (cid:96) be the number of examples drawn during the j th attempt at sampling from distribution ext ( D, (cid:96) ) and write M (cid:96) = (cid:80) ∞ j =1 M ( j ) (cid:96) . While sampling from distribution ext ( D, (cid:96) ), if we succeedprior to the j -th attempt, M ( j ) (cid:96) = 0; otherwise, if the ﬁrst j − ext ( D, (cid:96) −

1) and two examples from D m . Therefore, we may deﬁne therecursive equation E (cid:104) M ( j ) (cid:96) (cid:105) = (1 − θ ) j − · (2 E [ M (cid:96) − ] + 2 m ) , (16)since each attempt involves drawing two examples from ext ( D, (cid:96) −

1) and two examples from D m and we used the fact that the probability of failure is (1 − θ ) j − . Therefore, we have E [ M (cid:96) ] = (cid:88) j E (cid:104) M ( j ) (cid:96) (cid:105) = ∞ (cid:88) j =1 (1 − θ ) j − · (2 E [ M (cid:96) − ] + 2 m )= 1 θ · (2 E [ M (cid:96) − ] + 2 m ) ≤ − ζ d · (2 E [ M (cid:96) − ] + 2 m ) ≤ · ( E [ M (cid:96) − ] + m ) , (17)where we have used the fact that ζ < / E [ M ] = 0 andusing induction on Eq. (17) gives us the lemma statement. Putting together these pieces, we now prove our main theorem.

Theorem 4.6 (Globally stable learner from online learner) . Let α > . Let C ⊆ { f : X → [0 , } be a concept class with sfat ζ ( C ) = d . Let c ∈ C be the target concept. Let T = (cid:16) · (4 /ζ ) d +1 + 1 (cid:17) · d ln(1 /ζ ) α . Let D : X → [0 , be a distribution. There exists a randomized algorithm G : ( X × [0 , T → [0 , X that satisﬁes the following: given T many examples S = { ( x i , (cid:98) c ( x i )) } where x ∼ D , there exists ahypothesis f such that Pr[ G ( S ) ∈ T (11 ζ, f )] ≥ ζ d d + 1) and Loss D ( f, c, ζ ) ≤ α (18) Proof.

The algorithm G in the theorem statement is exactly the algorithm we deﬁned in the previoustwo sections along with a cutoﬀ at T examples. Algorithm 3

Final globally-stable algorithm G to learn concept class C ⊆ { f : X → [0 , } .1. Draw k ∈ { , , . . . , d } uniformly at random.2. Let ext ( D, k ) be the distribution described in Algorithm 2 but additionally imposing a cutoﬀ T on sample complexity (i.e., we output ‘fail’ if the number of examples drawn in samplingfrom ext ( D, k ) ever exceeds T ), where the auxiliary sample size is set to m = d ln(1 /ζ ) /α andcutoﬀ T = 2 · (4 /ζ ) d +1 · m . Let B ∼ D m and S ∼ ext ( D, k ) and output h = RSOA ζ ( S ◦ B ).Note that because we have enforced the cutoﬀ at T examples in drawing S ∼ ext ( D, k ), thesample complexity of G is | S | + | B | ≤ T + m = (cid:0) · (4 /ζ ) d +1 + 1 (cid:1) · d ln(1 /ζ ) α as stated in the theorem28tatement. Lemma 4.3 guarantees that there exists k ≤ d and f ∗ such that Eq. (13) holds. Let k ∗ be the smallest k such that Lemma 4.3 holds with the constant 5 ζ replaced by 11 ζ , andPr S ∼ ext ( D,k ∗ ) ,B ∼ D m [ RSOA ζ ( S ◦ B ) ∈ T (11 ζ, f ∗ )] ≥ ζ d . (19)Then Lemma 4.4 (with a simple modiﬁcation for the new constant) implies that Loss D ( f, c, ζ ) ≤ d ln(1 /ζ ) /m ≤ α .We now show that the probability that G outputs some function in T (11 ζ, f ∗ ) is d +1) · ζ d .Firstly, with probability d +1 , the randomly drawn k in step 2 is k ∗ . Conditioned on this, wenow show that with high probability, the loop in Steps 2( i ) to 2( iii ) will terminate after drawing T = 2 · (4 /ζ ) d +1 · m examples.Pr (cid:2) M k ∗ > · (4 /ζ ) d +1 · m (cid:3) ≤ Pr (cid:2) M k ∗ > · ζ − d · k ∗ +1 · m (cid:3) ≤ ζ d / , (20)where the ﬁrst inequality used k ∗ ≤ d and the second inequality is by Markov’s inequality andLemma 4.5. Putting together Eq. (19) and (20) the probability that RSOA ( S ◦ B ) outputs afunction in T (11 ζ, f ∗ ) and also Algorithm 2 terminates before the cutoﬀ T isPr S ∼ ext ( D,k ∗ ) ,B ∼ D m (cid:104) RSOA ( S ◦ B ) ∈ T (11 ζ, f ∗ ) and M k ∗ ≤ · (4 /ζ ) d +1 · m (cid:105) ≥ ζ d − ζ d / ζ d / / ( d + 1) yields our claim. DP (without a domain-size depen-dence) In the previous section we showed that if a concept class C can be learned in the quantum onlinelearning framework, then there exists a globally stable learner (with appropriate parameters) for C as well. This implication was ﬁrst pointed out by [BLM20] for Boolean-valued C s. In fact, they wentone step further and created a approximately diﬀerentially-private learner from a stable learner.In this sense, stability can be viewed as an intermediate property between online learnability andapproximate diﬀerential privacy in the Boolean setting. Jung et al. [JKT20] used the same techniqueto show that stability implies approximate diﬀerential privacy in the multiclass learning setting aswell (i.e., when the concept class to be learned maps to a discrete set { , . . . , k } ), but they do notshow that an analogous implication holds for real-valued learning, which they mention brieﬂy. Notethat their real-valued learning setting is less general than ours, as they assume that they receiveexact feedback on each example (we discuss this at the end of this section).A natural question is: does this result still hold in the quantum learning setting, i.e., doesquantum stability imply quantum diﬀerential privacy? In this section, we show that the [BLM20]method for showing this implication for Boolean functions – which held up in the case of learningmulticlass functions – fails for learning real-valued functions with imprecise feedback. Unlike inthe former two cases, the transformation from stable learner to approximate DP learner necessarilyincurs a domain-size dependence in the sample complexity. This is undesirable because, when X isa real-interval or if it is unbounded, this quantity could potentially be inﬁnite. For simplicity in notation, we assume cd/α is an integer. If not, one can set m = (cid:100) cd/α (cid:101) . .2.1 Sample complexity of stability to privacy transformation In the Boolean setting, [BLM20] showed that one could use the stable histograms algorithm [BNS19]and the Generic Private Learner of [KLN + Ldim ( C ) andthe privacy and accuracy parameters of the stable learner, but not the domain size of the functionclass. We now show that this technique cannot possibly yield a domain size-independent samplecomplexity for quantum learning.Our stable learner G has the following guarantees (given in Theorem 4.6): there exists some function ball (around the target concept) such that the collective probability of G outputting itsmember functions is high. Contrast this with the global stability guarantee for learning Booleanfunctions [BLM20], which says that G outputs some ﬁxed function with high probability. Thestability guarantees diﬀer because, in our setting, the learner only obtains ε -accurate feedbackfrom the adversary. Hence the learner cannot uniquely identify the target concept c , since allfunctions that are in the ε -ball of c would be consistent with the feedback of the adversary, and wethus allow the learner to output a function in the ε -ball around the target concept. However, thisdiﬀerence critically prevents us from using the [BLM20] technique to transform a stable learnerinto a private learner in the quantum case. We sketch this argument below , which relies onideas from classical ﬁngerprinting codes [BUV18] (which were also used earlier by Aaronson andRothblum [AR19] in order to give lower bounds on gentle shadow tomography).[BLM20]’s transformation from stable learner to private learner, applied to our setting, wouldbe as follows: generate a list of functions in C by running the stable learner G ( S ) of Theorem 4.6, n many times, each of which outputs a single f i ∈ C . By Theorem 4.6 and a Chernoﬀ bound, one canshow that with high probability, an η = ζ d -fraction of the list should be in T ( ζ, f ∗ ) for some f ∗ .Next one would like to privately output some function in T ( ζ, f ∗ ). We rewrite this now as follows. Problem 4.7 (Query release for function balls) . Given a list of n functions { f i : X → R } i ∈ [ n ] , an η -fraction of which are in T ( ζ, f ∗ ) for some f ∗ : X → R , output some function g ∈ T ( ζ, f ∗ ) . We could also consider the following problem of clique identiﬁcation on a discrete domain.

Problem 4.8 (Clique identiﬁcation on a discrete domain) . Clique identiﬁcation is the followingproblem: given a symmetric, reﬂexive relation R ⊆ Y × Y and a dataset D ∈ Y n under thepromise that ( x, y ) ∈ R for every x, y ∈ D, ﬁnd any point z ∈ Y such that ( x, z ) ∈ R for every x ∈ D . Clique identiﬁcation on a discrete domain is clique identiﬁcation with Y = [4] d and R = { ( x, y ) ∈ Y × Y : (cid:107) x − y (cid:107) ∞ ≤ } . Problem 4.8 is a special case of Problem 4.7, when we choose the functions f to be of the form f : [ d ] → [4], η = 1 and ζ = 1 /

2, and let D consist of the n vectors [ f i (1) , . . . f i ( d )] , i ∈ [ n ]. Hence,any DP algorithm for query release for function balls is also a DP algorithm for clique identiﬁcationon a discrete domain. However, we claim the following: Claim 4.9.

For δ < / , any (1 , δ ) - DP algorithm solving Problem 4.8 with probability at least / requires n ≥ ˜Ω( √ d ) . We will prove the claim later, but we ﬁrst explain why it implies a necessary domain sizedependence in the transformation we hope to achieve. Noting that d = |X | in the translation from The following argument was communicated to us by Mark Bun [BJKT21]. It is not hard to modify this proof so as to allow an ε privacy parameter. , δ )- DP algorithm for Problem4.7 requires n ≥ ˜Ω( (cid:112) |X | ). Hence, any algorithm to convert the stable real-valued learner G ofTheorem 4.6 into an approximate- DP learner that also solves Problem 4.7, also requires to runthe stable learner n -many times, each of which consumes T examples. Hence the total number ofexamples needed is ˜Ω (cid:18)(cid:112) |X | (cid:16) · (4 /ζ ) d +1 + 1 (cid:17) · d ln(1 /ζ ) α (cid:19) . (22)In particular, this lower bound is also optimal for query release up to poly-logarithmic factors,i.e., using ˜ O ( (cid:112) |X | ) examples one can solve Problem 4.7 using the Private Multiplicative Weightsmethod by Hardt and Rothblum [HR10] (as also referenced in the work of Bun et al. [BUV18]).To prove Claim 4.9, we ﬁrst need to ﬁrst deﬁne weakly-robust ﬁngerprinting codes (ﬁrst intro-duced by Boneh and Shaw [BS98], then developed in [BUV18]). Deﬁnition 4.10. An ( n, d ) -ﬁngerprinting code with security s and robustness r is a pair of randomvariables ( G, T ) where G ∈ { , } n × d and T : { , } d → [ n ] that satisfy the following. We say thata column j ∈ [ d ] is marked if there exists b ∈ { , } such that x i ; j = b for all i ∈ [ n ] . Similarly, wesay a string w ∈ { , } d is feasible for G if for at least a − r fraction of the marked columns j ∈ G ,the entry w j agrees with the common value in that column. The code must satisfy the properties ofsoundness and completness, as follows: Completeness.

For every A : { , } n × d → { , } d , Pr w ← A ( G ) [ w is feasible for G and T ( w ) = ∅ ] ≤ s Soundness.

For every i ∈ [ n ] , algorithm A : { , } n × d → { , } d , we have Pr w ← A ( G − i ) [ T ( w ) (cid:51) i ] ≤ s We also need the following result for explicit construction of ﬁngerprinting codes.

Theorem 4.11 ([Tar08]) . Then, for every s ∈ (0 , , there exists an ( n, d ) -ﬁngerprinting code withsecurity s and robustness r = 1 / with d = ˜ O ( n log(1 /s )) . With this we now prove our main claim.

Proof of Claim 4.9.

The idea is to construct, from any ( ε = 1 , δ = 1 / n )- DP clique identiﬁcationalgorithm with success probability at least 1499 / A : { , } n × d → { , } d forany ( n, d ) ﬁngerprinting code with robustness 1 /

25, such that the code cannot be 1 / n -secureagainst the adversary. However, because Theorem 4.11 guarantees the existence of a sound andcomplete ( n, d )-ﬁngerprinting code with ( s = 1 / n, r = 1 / n < ˜Ω( √ d ),the claimed clique identiﬁcation algorithm M must have n ≥ ˜Ω( √ d ). We now go into more detailabout how to construct the adversary.Let M be the alleged DP algorithm for clique identiﬁcation, and let G ∈ { , } n × d be the G corresponding to the ﬁngerprinting code. If we regard each of the rows of G as being a pointin Y = [4] d , then taking D to be the set of all rows of G , D fulﬁls the promise of Problem 4.8.Then the adversary A is constructed out of M as follows: on input D , run M ( D ) producing astring w ∈ [4] d . Return the string w (cid:48) ∈ { , } d where w (cid:48) i = 2 if w i ∈ { , } and w (cid:48) i = 3 if w i ∈ { , } . A proof by contradiction, which we omit, shows that the string w (cid:48) produced in thismanner is feasible for the ﬁngerprinting code with probability at least 2 /

3. By completeness ofthe code, Pr[ T ( A ( D )) ∈ [ n ]] ≥ / − s ≥ / . In particular, there exists some i ∗ ∈ [ n ] such thatPr [ T ( A ( D )) = i ∗ ] ≥ / n. Now by diﬀerential privacy,Pr [ T ( A ( D − i ∗ )) = i ∗ ] ≥ e − ε (Pr [ T ( A ( D )) = i ∗ ] − δ ) ≥ e − (cid:18) n − n (cid:19) ≥ n . This contradicts the soundness of the code. 31 .2.2 A quadratically worse upper bound on the sample complexity of privacy

The previous section showed that going from a stable learner to a private learner of real-valuedfunction classes should incur a sample complexity at least the square root of domain size. We nowshow to obtain a pure- DP learning algorithm for real-valued function classes over a ﬁnite domain(with no need for the stability intermediate step) that needs at most linear-in- |X | examples, whichis quadratically worse than the lower bound. This was also pointed out in the Appendix of [JKT20].The private algorithm that accomplishes this is the Generic Private Learner of [KLN + α with respect to someunknown distribution and target concept, by adding Laplace noise, one can privately output withhigh probability a hypothesis with loss at most 2 α with respect to the unknown target conceptand distribution. Lemma 4.12 (Generic Private Learner [KLN +

11, BLM20]) . Let

H ⊆ { h : X → [0 , } be a set ofhypotheses. For m = O (cid:18) log |H| αε (cid:19) there exists an ( ε, -diﬀerentially private generic learner GL : ( X × [0 , m → H such that thefollowing holds. Let D : X × [0 , → [0 , be a distribution, c : X → [0 , be a target function, ζ bea distance parameter and h ∗ ∈ H be such that with Loss D ( h ∗ , c, ζ ) ≤ α. Then on input S ∼ D m ,algorithm GL outputs, with probability at least / , a hypothesis ˆ h ∈ H such that Loss D (ˆ h, c, ζ ) ≤ α. For every real-valued function class C , one could discretize the [0 , h : X → [0 ,

1] into bins of size ζ . This obtains a discretized function class H with at most(1 /ζ ) |X | functions. Plugging this bound into the lemma above, we obtain a private learner withsample complexity m = O (cid:18) |X | log(1 /ζ ) αε (cid:19) . (23) We now turn to the quantum implications of the results in the previous sections. While we havestated all our results for the case of learning real-valued functions with imprecise adversarial feed-back, we now expressly translate them to the setting of learning quantum states. Recall that, asstated in Section 2, in quantum learning we are given U , a class of n -qubit quantum states fromwhich the state to be learned is drawn; M , a set of 2-outcome measurements and D : M → [0 , Our results apply to quantum learning by associating,to every ρ ∈ U , the real-valued function c ρ : M → [0 ,

1] deﬁned as c ρ ( M ) = Tr ( M ρ ) ∈ [0 ,

1] forevery M ∈ X , and taking the function class to be C U = { c ρ } ρ ∈U . Section 4.1 implies that given a C U with bounded sfat dimension, a stable learner for C U also exists. To translate this result intothe quantum learning setting, we deﬁne quantum stability as follows: Deﬁnition 4.13 (Quantum stability) . A quantum learning algorithm A : ( M × [0 , T → U is ( T, ε, η ) - stable with respect to distribution D : M → [0 , if, given T many labelled examples To be more clear, D can be viewed as a distribution over { ( E i , I − E i ) } i where { E i } i is an orthogonal basis forthe space of operators on n -qubits satisfying (cid:107) E i (cid:107) ≤ = { ( E i , y i ) } i ∈ [ T ] where | Tr ( ρE i ) − y i | < ζ , there exists a state σ such that Pr[ A ( S ) ∈ B M ( ε, σ )] ≥ η, (24) where the probability is taken over the examples in S and B M ( ε, σ ) := { ρ : | Tr ( Eρ ) − Tr ( Eσ ) | ≤ ε } ,that is to say, the ball of states within distance ε of σ on M . In other words, quantum stability means that up to an ε -distance on the measurements in M ,there is some σ that is output by A with “high” (at least η ) probability. Then the quantum versionof Theorem 4.6 is the following: Theorem 4.14 (Quantum-stable learner from online learner) . Let U be a class of quantum stateswith sfat ζ ( C U ) = d , let M be a set of orthonormal -outcome measurements and let D : M → [0 , be a distribution over measurements. There exists an algorithm G : ( M × [0 , T → U that satisﬁesthe following: for every ρ ∈ U , given T = (cid:16) · (4 /ζ ) d +1 + 1 (cid:17) · d ln(1 /ζ ) α . many labelled examples S = { ( E i , y i ) } i ∈ [ T ] where | Tr ( ρE i ) − y i | < ζ and E i ∼ D , there exists a σ such that Pr S ∼ D T [ G ( S ) ∈ B M (11 ζ, σ )] ≥ ζ d d +1) and Pr E ∼ D (cid:2) | Tr ( ρE ) − Tr ( σE ) | ≤ ζ (cid:3) ≥ − α . Namely, G is ( T, ζ, ζ d d +1) )-stable and furthermore, the state σ has loss α . Section 4.2.1 nowgives a no-go result for going from the above-mentioned quantum-stable learner to an approximate- DP one. It shows that the technique of [BLM20] to convert a stable learner to a private onenecessarily incurs a domain-size dependence in the sample complexity.We say a few words about the implications of this on quantum learning. As explained earlier, itis often of most interest to choose M to be some orthogonal set of measurements. If, say, we chooseit to be the orthogonal basis of n -qubit Paulis, then |M| = 4 n and so Equation (22) implies that oneneeds sample complexity ˜Ω(4 n/ ) in order to go from stability to approximate diﬀerential privacy,whereas Equation (23) implies that even without stability, there exists a simple (pure) privatelearner for C U whose sample complexity is ˜ O (4 n ), which is quadratically worse. In this section we will prove the converse direction of the implication we showed in the previoussection, namely that

DP PAC learnability of a concept class C implies online learnability of C . To bemore precise, we will show that the sample complexity of pure DP PAC learning C is linearly relatedto the sfat ( · ) dimension of C . Combining this with Theorem 3.4 implies learnability in the pure DPPAC setting implies online learnability of C in the strong feedback setting. The implications we willshow are summarized in the diagram below: PureDP PAC Representationdimension One-wayCC Sequentialfat-shatteringdimension Onlinelearning

Lem 5.1 Lem 5.2 Lem 5.3 Thm 3.4Figure 3: Sample complexity of pure

DP PAC upper-bounds sfat ( · ).33his section is organized as follows. In Section 5.1 we show that the sample complexity of pure DP PAC is linearly related to the communication complexity of one-way public communication. Asshown in Figure 3, the link between these two notions goes through representation dimension. InSection 5.2 we show that one-way communication complexity is, in turn, characterized by sfat ( · ).Additionally, we know from Theorem 3.4 that this combinatorial dimension upper-bounds themistake bound of online learning C , and this completes the chain of implications shown in Figure 3. DP PAC implies one-way communication

In this section we prove that the sample complexity of pure

DP PAC learning upper bounds one-waycommunication complexity of a concept class C . PRdim

We start by relating the sample complexity of diﬀerentially-private

PAC ( PPAC ) learning (see Deﬁ-nition 2.5) a concept class C , to the probabilistic representation dimension of C . As in the previoussection, we use the shorthand S ∼ D m to mean that the sample S is of the form { ( x i , (cid:98) c ( x i )) } mi =1 where each x i ∼ D and for all i , (cid:98) c ( x i ) satisﬁes | (cid:98) c ( x i ) − c ( x i ) | < ζ/ Lemma 5.1 (Sample complexity of ( ζ, α, ε,

PPAC learning and

PRdim ) . Let α < / . Supposethere exists an algorithm A that ( ζ, α, ε, - PPAC learns a real-valued concept class

C ⊆ { f : X → [0 , } with sample size m , then there exists a set of concept classes H and a distribution over theirindices P , such that ( H , P ) ( ζ, / , / -probabilistically represents C , with size( H ) = O ( mεα ) .This implies that the sample complexity of ( ζ, α, ε, - PPAC learning C is Ω (cid:18) αε PRdim ζ, / , / ( C ) (cid:19) . (25) Proof.

Our proof extends the work of Beimel et al. [BNS13] to the case of robust real-valued

PAC learning. We assume we are given a ( ζ, α, ε,

PPAC learner A of C that outputs some function inhypothesis class F with sample complexity m . The PAC guarantees hold whenever the feedbackis a ζ/ c ( x i ), so for the rest of this proof, we will ﬁx the examples ( x i , (cid:98) c ( x i ))to have feedback of the form: (cid:98) c ( x i ) := (cid:98) c ( x i ) (cid:99) ζ/ , where (cid:98) (cid:99) ζ/ denotes rounding to the nearestpoint in I ζ/ . For every target concept c ∈ C and distribution D on the input space X , deﬁne thefollowing subset of F : G αD,ζ = { h ∈ F : Loss D ( h, c, ζ ) ≤ α } , (26)where Loss D ( h, c, ζ ) := Pr x ∼ D (cid:2) | h ( x ) − c ( x ) | > ζ (cid:3) , so G αD,ζ may be interpreted as a set of probably- ζ -consistent hypotheses in F . In [BNS13], they show that for every distribution D , there existsanother distribution ˜ D on the input space, deﬁned as˜ D ( x ) = (cid:26) − α + 4 α · D ( x ) , x = 04 α · D ( x ) , x (cid:54) = 0 (cid:27) (27)(where 0 is some arbitrary point in the domain) which has the propertyPr S ∼ ˜ D m , A (cid:104) A ( S ) ∈ G / D,ζ (cid:105) ≥

34 (28)34here A ( S ) means A is fed with the sample S . The property in Eq. (28) follows from the fact thatPr ˜ D [ x ] ≥ α · Pr D [ x ] ∀ x ∈ X by Eq. (27) which implies G α ˜ D,ζ ⊆ G / D,ζ , and the assumption that A is ( ζ, α )-PAC which can be re-written as Pr ˜ D, A [ A ( S ) ∈ G α ˜ D,ζ ] > / S ‘good’ if (cid:126)x has at least (1 − α ) m occurrences of 0. Eq. (28) maybe rewritten asPr S ∼ ˜ D, A (cid:104) A ( S ) ∈ G / D,ζ (cid:105) (29)= Pr S ∼ ˜ D, A (cid:104) A ( S ) ∈ G / D,ζ ∧ S is good (cid:105) + Pr S ∼ ˜ D, A (cid:104) A ( S ) ∈ G / D,ζ ∧ S is not good (cid:105) ≥

34 (30)Letting the random variable X S denote the number of occurrences of 0 in S , Eq. (27) shows that E [ X S ] ≥ (1 − α ) m . With this we upper bound the term Pr S ∼ ˜ D, A (cid:104) A ( S ) ∈ G / D,ζ ∧ S is not good (cid:105) byPr S ∼ ˜ D, A [ S is not good] = Pr S ∼ ˜ D, A [ X S < (1 − α ) m ] (31)= Pr S ∼ ˜ D, A [ X S ≤ (1 − δ )(1 − α ) m ] ≤ e − δ (1 − α ) m/ = e − α m/ (1 − α ) , (32)where the ﬁrst inequality used δ = α − α and the second inequality follows from a Chernoﬀ boundwith E [ X S ] replaced with the upper bound (1 − α ) m on its expectation.Therefore, one can bound the ﬁrst term on the right hand side of Eq. (29) byPr S ∼ ˜ D, A (cid:104) A ( S ) ∈ G / D,ζ ∧ S is good (cid:105) ≥ − e − α m/ (1 − α ) ≥ . (33)Eq. (33) implies that there exists some sample, S good such thatPr A (cid:104) A ( S good ) ∈ G / D,ζ (cid:105) ≥ . (34)Without loss of generality we may write down S good as S good := ((0 , (cid:98) c (0) (cid:99) ζ/ ) , . . . (0 , (cid:98) c (0) (cid:99) ζ/ ) (cid:124) (cid:123)(cid:122) (cid:125) k examples , ( x k +1 , (cid:98) c ( x k +1 ) (cid:99) ζ/ ) . . . ( x m , (cid:98) c ( x m ) (cid:99) ζ/ )) (35)for some k ≥ (1 − α ) m . Consider an alternative sample, S alt , which takes the form S alt = ((0 , (cid:98) c (0) (cid:99) ζ/ ) , . . . , (0 , (cid:98) c (0) (cid:99) ζ/ ) (cid:124) (cid:123)(cid:122) (cid:125) m examples ) .S alt diﬀers from S good in exactly m − k < αm examples, and so by the ε - DP property of A , we havePr A [ A ( S alt ) ∈ G / D,ζ ] ≥ exp( − αεm ) Pr A [ A ( S good ) ∈ G / D,ζ ] ≥

14 exp( − αεm ) . (36)For the remainder of this proof, we will use Eq. (36) to construct the pair ( H , P ). Deﬁne S z = ((0 , z ) , . . . , (0 , z ) (cid:124) (cid:123)(cid:122) (cid:125) m examples ) . z ∈ I ζ/ , run A ( S z ) repeatedly 4 ln(4) e αεm times. Store all the outputs in set H ,which has size |H| = 5 /ζ · e αεm . It is clear that for z = (cid:98) c (0) (cid:99) ζ/ , S z = S alt , and Eq. (36)therefore gives us guarantees on the output of A ( S z ). We may conclude from Eq. (36) that for set H generated in the above fashion,Pr[ H ∩ G / D,ζ = ∅ ] ≤ (cid:18) − e − αεm (cid:19) e αεm ≤ . (37)Rearranging gives m = αε (cid:0) PRdim ζ, / , / ( C ) − ln(5 /ζ · (cid:1) .We may therefore deﬁne H := (cid:8) G ⊆ F : |G| ≤ /ζ · e αεm (cid:9) (note that H ∈ H ) andfurther deﬁne P to be the distribution that puts all probability mass on H . Comparing Eq. (37) withthe deﬁnition of PRdim , Deﬁnition 2.9, observe that ( H , P ) make up a ( ζ, / , /

4) -probabilisticrepresentation for the class C . Hence PRDim ζ, / , / ≤ ln(5 /ζ · αεm .The following lemma is an immediate corollary of [FX14, Theorem 3.1] who proved it forBoolean functions and the exact same proof carries over for our deﬁnition of PRdim and randomizedone-way communication model in the real-valued setting.

Lemma 5.2 ( PRdim (cid:16)

Randomized Communication Complexity for real-valued functions) . Let C be a concept class of real-valued functions. The following relations hold: PRdim ζ,ε,δ ( C ) ≤ R → ,pubζ,εδ ( C ) , R → ,pubζ,ε + δ − εδ ( C ) ≤ PRdim ζ,ε,δ ( C ) . sfat ( · ) We next prove that for every real-valued concept class

C ⊆ { f : X → [0 , } , the sequential fat-shattering dimension lower bounds the randomized communication complexity of C . Namely, weprove the following lemma: Lemma 5.3.

Let

C ⊆ { f : X → [0 , } be a concept class. Then R → ζ,ε ( C ) ≥ (1 − H ( ε )) · sfat ζ ( C ) . With this lemma, we complete our chain of implications, and obtain the conclusion of this sec-tion, that the sample complexity of pure

DP PAC learning upper-bounds the sfat ( · ) dimension. Weremark that the statement above is the real-valued version of the relationship exhibited in [FX14],wherein the Littlestone dimension (Boolean analog of sfat ( · )) lower-bounds the randomized com-munication complexity of Boolean function classes. The proof of Lemma 5.3 proceeds in two steps.First, we deﬁne the communication problem AugIndex d and show that R → ζ,ε ( C ) ≥ R → ε ( AugIndex d ) for d the sfat dimension of C . (We refer the reader to Section 2.2.2 for the deﬁnitions of the quantities R → ζ,ε ( · ) and R → ε ( · ) which pertain respectively to real- and Boolean-function communication com-plexity.) Next, we use the known relation R → ε ( AugIndex d ) > (1 − H ( ε )) d where H : [0 , → [0 ,

1] isthe binary entropy function H ( x ) := − x log x − (1 − x ) log(1 − x ).To do the ﬁrst of the two steps, we will relate the one-way classical communication complexitiesof two communication tasks. The ﬁrst is the task AugIndex d for d ∈ Z + which is deﬁned as follows:Alice gets string x ∈ { , } d , while Bob gets x [ i − for some i ∈ [ d ], which is the length-( i −

1) preﬁxof x . The task is for Bob to output the bit x i and we say that AugIndex d ( x, i ) = x i . The secondis the task Eval C , deﬁned in Section 2.2.2, for some real-valued function class C ⊆ { f : X → [0 , } .We repeat the deﬁnition for convenience: Alice is given a function f ∈ C and Bob a z ∈ X andBob’s goal is to approximately compute f ( z ), i.e., Bob has to compute b ∈ [0 ,

1] satisfyingPr (cid:2) | b − f ( z ) | ≤ ζ (cid:3) ≥ − ε, (38)36here the probability is taken over the local randomness of Alice and Bob respectively. We denotethe one-way randomized communication complexity of Eval C as R → ζ,ε ( C ) for short. Lemma 5.4. If C ⊆ { f : X → [0 , } satisﬁes sfat ζ ( C ) = d , then R → ζ,ε ( C ) ≥ R → ε ( AugIndex d ) .Proof. The idea of the proof is to show that a a one-way communication protocol for

Eval C can alsobe used to compute AugIndex d for d = sfat ζ ( C ). The protocol for AugIndex d is as follows:1. Alice and Bob agree on the ζ -fat-shattering tree for the concept class C ahead of time.2. Upon being given an instance of the AugIndex d problem, Alice (who has the d -bit string x )identiﬁes some function in C as follows: she follows the ζ -fat-shattering tree down the pathof left-right turns deﬁned by string x . This takes her to a leaf (cid:96) which is associated withsome unique function c Alice ∈ C . Bob (who has the ( i − x [ i − ) identiﬁes some z Bob ∈ X , a

Bob ∈ [0 ,

1] as follows: he follows the ζ -fat-shattering tree down the path of left-right turns deﬁned by x [ i − . This takes him to some node w at level i − z Bob , a

Bob to be the domain point and threshold associated with that node.3. Alice and Bob use their protocol π for Eval C on the inputs c Alice , z

Bob , and following thisprotocol allows Bob to compute a b that satisﬁesPr (cid:2) | b − c Alice ( z Bob ) | ≤ ζ (cid:3) ≥ − ε. (39)4. If b > a Bob , Bob outputs 1; else output 0.We now prove the correctness of this protocol. Eq. (39) states that with probability 1 − ε , b isa ζ -approximation of c Alice ( z Bob ). Condition on this. In parallel, observe that the Alice’s leaf (cid:96) associated with the function c Alice is a descendent of Bob’s node w associated with the values( z Bob , a

Bob ), therefore one of the following two statements must be true by deﬁnition of ζ -fat-shattering tree and by the procedure outlined in Step 2: • (cid:96) is in the right subtree of w i.e., c Alice ( z Bob ) > a Bob + ζ , and x i = 1. By Eq. (39), this implies b > a Bob . By Step 4, Bob outputs 1, which is also the value of x i = AugIndex d ( x, i ). • (cid:96) is in the left subtree of w i.e., c Alice ( z Bob ) < a Bob − ζ , and x i = 0. By Eq. (39), this implies b < a Bob . By Step 4, Bob outputs 0, which is also the value of x i = AugIndex d ( x, i ).This means that the output of Bob in Step 4, ˜ b , satisﬁesPr[˜ b = AugIndex d ( x, i )] ≥ − ε, (40)where again the probability is taken over the randomness of Alice and Bob. Hence, the protocolabove is a valid protocol for computing AugIndex d .Finally we can prove the lemma stated at the beginning of the section. Proof of Lemma 5.3.

Follows from Lemma 5.4 combined with the inequality R → ε ( AugIndex d ) ≥ (1 − H ( ε )) d which was proven in [FX14].In fact, below we strengthen the above into a bound on the one-way quantum communicationcomplexity of computing real-valued concept classes.37 orollary 5.5. Let

C ⊆ { f : X → [0 , } be a concept class. Then Q → ζ,ε ( C ) ≥ (1 − H ( ε )) · sfat ζ ( C ) .Proof of Corollary 5.5. In the proof of Lemma 5.4, simply replace the classical one-way random-ized protocol to compute Eval C with the quantum one-way randomized protocol. This gives that Q → ζ,ε ( C ) ≥ Q → ε ( AugIndex d ) . Next, [Nay99, Theorem 2.3] provides a bound for the complexity ofquantum serial encoding that amounts to the statement Q → ε ( AugIndex d ) ≥ (1 − H (1 − ε )) d. Com-bining the two yields the claim.We remark that a similar corollary for

Boolean concept classes was proven earlier by Zhang [Zha11](where the RHS of Corollary 5.5 is replaced by Littlestone dimension). Our proof technique easilygeneralizes to the Boolean setting and signiﬁcantly simpliﬁes his proof [Zha11, Appendix A].

We now present a few applications of the results we established in the previous sections. For therest of this section, let U be a class of quantum states on n qubits, and let U n refer to the theset of all quantum states on n qubits. So far, we have shown that the complexity of learning thequantum states from the class U , in two models of learning (pure DP PAC and online learning inthe mistake bound model), depends on the sequential fat shattering dimension of the real-valuedfunction class C U associated with U : here C U := { f ρ : X → [0 , } ρ ∈U , where X is the set of allpossible two-outcome measurements, and f ρ is given by f ρ ( E ) = Tr ( Eρ ) for every E ∈ X .In the online learning work of Aaronson et al. [ACH +

18] they consider the setting where U is the set of all n -qubit states U n . Let us denote the corresponding function class as C n . Inthis case, [ACH +

18] showed that sfat ε ( C n ) ≤ O ( n/ε ), thus eﬀectively upper-bounding the sfat ( · )dimension of the class of all n -qubit quantum states by n . This section asks what happens when weallow U ⊆ U n – for instance, when U is a special class of states that may be of particular interestor more experimentally feasible to prepare. Are there any meaningful such classes for which wecan improve this bound? We ﬁrst answer this aﬃrmatively for a few classes of quantum states andﬁnally improve the sample complexity of gentle shadow tomography for these classes of states. In this section we provide an upper bound on sfat ( C U ) in terms of the Holevo information of anensemble deﬁned on the class of states U . Using this new upper bound leads to improved upperbounds on sfat ( · ) for many classes of quantum states U , and hence improved upper bounds on thesample complexity of learning U . Previously for U = U n , Aaronson [Aar07, ACH +

18] observedthat one could use arguments from quantum random access code by Nayak [Nay99] to obtain a combinatorial upper bound on learning. In this section we show that a better upper bound can beachieved by maximizing the Holevo information, χ ( { p i , ρ i } ρ i ∈U ) (over all possible distributions (cid:126)p on U ), where Holevo information is deﬁned as χ (cid:16) { p i , ρ i } ρ i ∈U (cid:17) = S ( ¯ ρ ) − (cid:88) i : ρ i ∈U p i S ( ρ i ) , ¯ ρ = (cid:88) i : ρ i ∈U p i ρ i , (41)where (cid:126)p is a distribution and S is the von Neumann entropy S ( ρ ) := − Tr [ ρ log ρ ].38 .1.1 Quantum Random Access Codes We ﬁrst deﬁne random access codes and serial random access codes over the set U , modifying thedeﬁnition in [Nay99] so that U – the set of states from which the code states may be chosen – ispart of the deﬁnition of these codes. Deﬁnition 6.1 (Random access codes and serial random access codes) . Let U be a class of quan-tum states over n qubits. A ( k, n, p, U ) -random access code ( RAC ) consists of a set of k codestates { ρ s } s ∈{ , } k ⊆ U such that, for every i ∈ [ k ] and s ∈ { , } k , there exists a -outcomemeasurement O i such that Pr [ O i ( ρ s ) = s i ] ≥ p. (42) A ( k, n, p, U ) - serial random access code ( SRAC ) consists of k code states { ρ s } s ∈{ , } k ⊆ U suchthat, for every i ∈ [ k ] , and for all s ∈ { , } k , there exists a measurement with outcome or , possibly depending on the last k − i bits x i +1 , . . . , x k , such that Eq. (42) holds. In words, a

RAC over U is a way of encoding k classical bits into n -qubit states from U , suchthat for every i ∈ [ k ] and x ∈ { , } k , the probability of ‘recovering’ the bit x i by performing the2-outcome measurement O i on ρ x is at least p . A serial RAC (denoted

SRAC ) is deﬁned similarlyexcept that one is allowed to use information from decoding the subsequent bits to decode x i .Nayak [Nay99] showed the following relation between the number of encodable classical bits andthe number of qubits in the code statesEvery ( k, n, p, U n )- RAC or ( k, n, p, U n )- SRAC satisﬁes n ≥ (1 − H ( p )) k. (43)Here, H ( · ) is the binary entropy function, and note that the statement applies to code states drawnfrom the entire class of n -qubit states.Aaronson et al. [ACH +

18] in a recent work showed the surprising connection that a p -sequentialfat-shattering tree for U of depth k can be used to construct a ( k, n, p, U )- SRAC . As a corollaryof this observation, we have sfat p ( C U ) ≤ max { k : there exists ( k, n, p, U ) − SRAC } . (44)Combining Eq. (43), (44) yields sfat p ( C U ) ≤ n/ (1 − H ( p )). In this section, we consider the scenariowhere U ⊆ U n and show that this bound can be improved to the following. Theorem 6.2 (Bounding sfat ( · ) by the Holevo information) . Let p ∈ [0 , and U be some class ofquantum states over n qubits. Then sfat p ( C U ) ≤ − H ( p ) max (cid:110) χ (cid:0) { ( q i , σ i ) } σ i ∈U (cid:1) : (cid:88) i q i = 1 (cid:111) . To do so, we tighten the argument of Nayak [Nay99] which was originally derived for U = U n .To prove our result, we make use of the following lemma. Lemma 6.3 ([Nay99]) . Let σ , σ be density matrices and σ = ( σ + σ ) . If O is a measurementwith { , } -outcome such that making the measurement on σ b yields the bit b with probability p , then S ( σ ) ≥

12 [ S ( σ ) + S ( σ )] + (1 − H ( p )) . We remark that such a connection between

RAC and learnability was established in an earlier work by Aaron-son [Aar07] to understand

PAC learnability of quantum states.

39e now state and prove our main lemma.

Lemma 6.4.

Let U be some class of quantum states over n qubits. Every ( k, n, p, U ) - RAC or ( k, n, p, U ) - SRAC satisﬁes (1 − H ( p )) k ≤ max (cid:110) χ (cid:0) { ( q i , σ i ) } σ i ∈U (cid:1) : (cid:88) i q i = 1 (cid:111) , (45) where H ( · ) is the binary entropy function and χ is the Holevo information χ (cid:0) { ( q i , σ i ) } σ i ∈U (cid:1) = S ( (cid:80) i p i σ i ) − (cid:80) i p i S ( σ i ) and S ( · ) is the von Neumann entropy function.Proof. Using Deﬁnition 6.1, a ( k, n, p, U )- RAC consists of a set of code states { ρ x } x ∈{ , } k ⊆ U andmeasurements {O i } i ∈ [ k ] satisfying Pr [ O i ( ρ x ) = x i ] ≥ p. Proceeding as in [Nay99], we ﬁrst deﬁnethe following states which are derived from the code states: For every 0 ≤ (cid:96) ≤ k and y ∈ { , } (cid:96) , let σ y = 12 k − (cid:96) (cid:88) z ∈{ , } k − (cid:96) ρ zy . In words, for a (cid:96) -bit string y , let σ y be a uniform superposition over all 2 n − (cid:96) code states with thesuﬃx y . Let ψ = n (cid:80) z ∈{ , } n ρ z be the uniform superposition over all code states. Then we have S ( ψ ) ≥ k (cid:88) z ∈{ , } k S ( ρ z ) + k (1 − H ( p )) . (46)To see this, ﬁrst one can use Lemma 6.3 to show S ( ψ ) ≥ ( S ( σ ) + S ( σ )) + 1 − H ( p ) andrecursively applying this lemma to each of the S ( · ) quantities, we get the equation above (observethat each application of the lemma is justiﬁed because for every y ∈ { , } (cid:96) , we may write σ y = ( S ( σ y ) + S ( σ y )); and by assumption of a ( k, n, p, U )- RAC , O (cid:96) +1 can distinguish σ y , σ y withsuccess probability p and thus is a measurement that meets the conditions of Lemma 6.3.) UsingEq. (46) it now follows that k (1 − H ( p )) ≤ S ( ψ ) − k (cid:88) z ∈{ , } k S ( ρ z ) = χ (cid:16)(cid:8) k , ρ x (cid:9) x ∈{ , } k (cid:17) ≤ max T ⊆U χ (cid:16)(cid:110) | T | , σ i (cid:111) σ i ∈ T (cid:17) . (47)where the last inequality follows because the uniform ensemble of code states (cid:8) k , ρ x (cid:9) x ∈{ , } k isprecisely of the form { p i , σ i (cid:9) σ i ∈U with zero weight on non-code states in U . In Eq. (45), to get asimpler-looking bound, we further relax this inequality by taking the optimization over arbitraryprobability distributions on the code states, not just the ones that are uniform on a subset. Eq. (45)also holds for SRAC by noting that the argument above doesn’t change by allowing O i to dependon bits x i +1 , . . . , x k .The proof of Theorem 6.2 follows immediately from combining Lemma 6.4 and Observation (44).An interesting consequence of our result is the following. As far as we are aware, there is noway of computing sfat ( · ) directly, but there exist algorithms to compute our bound in Theorem 6.2.For a set U of states, performing the maximization max { χ (cid:0) { ( q i , σ i ) } (cid:1) : (cid:80) i q i = 1 } is a convexoptimization problem which can be solved using the Blahut-Arimoto algorithm[Bla72]. However, forcertain special classes of states, one can present simple bounds on the maximal Holevo informationwhich we present next. 40 .1.2 Classes of states with bounded sfat ( · ) dimension A natural question is, how does the new upper bound on sfat ( U ) in Theorem 6.2 compare to theprevious upper bound sfat ( U n ) < n/ε given in [ACH + ε dependencecomes about from a Taylor expansion of 1 − H ((1 − ε ) /

2) and our new bounds do not change thisdependence, hence for the remainder of this section we set ε = 1 for simplicity. We now mention afew classes of states for which our new bound improves the n dependence of the previous bound. • Suppose our quantum states are “ k -juntas”, i.e., each n -qubit quantum state lives in thesame unknown k -dimensional subspace of the 2 n -dimensional Hilbert space. Then clearly,the right-hand-side of Eq. (45) is upper-bounded by log k < n . In particular for n -juntasthe sfat ( · ) dimension is O (log n ), hence the sample complexity of learning scales as O (log n )which is exponentially better than the prior upper bounds of n . • U consists of a small set of states with small pairwise trace distance; in [Aud14] and [Shi19]they showed that χ ( { p i , ρ i } ) ≤ v m log |U | (48)where v m = sup i,j (cid:107) ρ i − ρ j (cid:107) is the maximal trace norm distance between the states inthe class U . This bound could be signiﬁcantly better than the trivial log |U | if v m is suﬃ-ciently small. • Let U = N ( U n ) be the set of all n -qubit states obtained after passing the states in U n throughthe channel N . That is, we would like to learn some arbitrary n -qubit state that has beenpassed through an unknown quantum channel N . This is the case in many experimentally-relevant settings and is in fact one way to understand the eﬀect of experimental noise (whichcan be modelled by a quantum channel during state preparation). The Holevo informationof the quantum channel N is the following quantity χ ( N ) := max (cid:126)p,ρ i S (cid:16) (cid:88) i p i N ( ρ i ) (cid:17) − (cid:88) i p i S ( N ( ρ i )) , (49)where the maximization is over (arbitrary-sized) ensembles { ( p i , ρ i ) } . Observe that us-ing Eq. (45) one can upper bound sfat ( · ) dimension of the set U = N ( U n ) in terms of χ ( N ). A centerpiece of quantum Shannon theory is the Holevo-Schumacher-Westmoreland(HSW) theorem [SW97], which states that (see for example [Wil17] for a pedagogical proof) χ ( N ) ≤ C ( N ) where C ( N ) is the classical capacity of the channel. Putting these two boundstogether gives sfat ( N ( U n )) ≤ C ( N ) . (50)Now, using the connection above one can upper bound sfat ( · ) of noisy quantum states usingresults developed in quantum Shannon theory to bound the classical channel capacity. For adepolarizing channel acting on d -dimensional states with parameter λ for instance (a commonnoise model), one can upper bound C ( N ) in Eq. (50) by a result of King [Kin03] as followslog d − S min (∆ λ ) (51)where S min (∆ λ ) = − (cid:0) λ + − λd (cid:1) log (cid:0) λ + − λd (cid:1) − ( d − (cid:0) − λd (cid:1) log (cid:0) − λd (cid:1) and the subtractivequantity in the quantity above makes this bound strictly better than [ACH + Interestingly, we may now also bound sfat ( · ) of the class of quantum Gaussian states. Sincethese states are inﬁnite-dimensional, the previous bound of [ACH +

18] is not useful. However,our channel capacity upper-bound on sfat ( · ) yields a ﬁnite bound: It is known from [GGL + η ∈ [0 , whenthe input Gaussian states have photon number at most N p (and hence bounded energy, whichis physically realistic), is g ( ηN p ) where g ( x ) ≡ ( x + 1) log ( x + 1) − x log x . In particular,the case η = 1 corresponds to zero loss, hence g ( N p ) bounds sfat ( · ) for the entire class ofGaussian states with N p photons.Alternatively, one might be interested in states prepared through phase-insensitive Bosonicchannels. These model other kinds of noise, such as thermalizing or amplifying processes. Arecent work [GGPCH14] allows one to bound the capacities of these channels, and hence the sfat ( · ) dimensions of these noisy Gaussian states. We now discuss how our results can also improve shadow tomography , a learning framework recentlyintroduced by Aaronson [Aar18]. This is a variant of quantum state tomography in which the goalis not to learn ρ completely, but to learn its ‘shadows’, i.e., the expectation values of ρ on a ﬁxed(known) set of measurements.To be precise, let U be a subset of n -qubit states. Given T copies of an unknown state ρ ∈ U ,and a set of known two-outcome measurements E , . . . , E m . The goal is to learn (with probabilityat least 2 / Tr ( E i ρ ) up to additive error ε for every i ∈ [ m ]. A trivial learning algorithm uses T = O ((2 n + m ) · ε − ) many copies of ρ to solve the task, and surprisingly Aaronson showed howto solve this task using T = poly( n, log m, ε − ) copies of ρ , exponentially better than the trivialalgorithm. An intriguing open question left open by Aaronson [Aar18] and others is, is the n dependence necessary? There have been follow up results by Huang et al. [HKP20] that improvedAaronson’s procedure when the goal is to obtain ‘classical shadows’ and more recently B˘adescuand O’Donnell [BO20] gave a procedure which has the best known dependence on all parametersfor standard shadow tomography.Subsequently Aaronson and Rothblum [AR19] considered gentle shadow tomography, a (stronger)variant of shadow tomography (we do not deﬁne gentleness here and refer the interested readerto [AR19]). Here, we show that suppose we were performing gentle shadow tomography with theprior knowledge that the unknown state ρ came from a class of states U , then the n -dependencein the sample complexity can be replaced with sfat ( C U ). As we discussed in the previous section,clearly sfat ( C U ) ≤ O ( n/ε ), but for many class of states sfat ( C U ) could be much lesser than n , givingus a signiﬁcant improvement over Aaronson’s result. We ﬁrst state our main statement. Theorem 6.5 (Faster gentle shadow tomography) . The complexity of gentle shadow tomographyon a class of states U is O (cid:18) sfat ε ( C U ) log m log(1 /δ ) ε min { α , ε } (cid:19) . (52) where α, δ are gentleness parameters and the goal is to learn Pr[ E i ( ρ ) accepts ] to within an additiveerror of ε for every i ∈ [ m ] . Moreover, there exists an explicit algorithm that achieves this. This channel is a simple model for communication over free space or through a ﬁber optic link, where η modelshow much noise is ‘mixed’ into the states. Implicitly in the complexity above we have assumed that the algorithm succeeds with probability at least 2 / sfat ( C U ) in this bound means that for the classes of states mentioned inSection 6.1.2, the sample complexity of shadow tomography is better than the complexity in [Aar18](in terms of n ). We now prove Theorem 6.5. The connection comes from the implication in [AR19]that under certain conditions, an online learner for quantum states can be used as a black boxfor what they term ‘Quantum Private Multiplicative Weights’, an algorithm that performs shadowtomography in both an online and a gentle manner. We now state the precise setting in whichthis black box online learner must operate. As usual, we are concerned with the function class C U := { f ρ } ρ ∈U where the domain X is the set of all possible two-outcome measurements E on thestates in U and the functions in the class are deﬁned as f ρ ( E ) = Tr ( Eρ ) for every E . The unknownstate ρ deﬁnes some target function c ∈ C U .1. Adversary provides input point in the domain: x t ∈ X .2. Learner outputs a prediction ˆ y t ∈ [0 , | ˆ y t − c ( x t ) | > ε , then adversary provides strong feedback (cid:98) c ( x t ) ∈ [0 ,

1] where (cid:98) c ( x t ) is an ε/ c ( x t ), i.e., | (cid:98) c ( x t ) − c ( x t ) | < ε/

10, andthe learner is allowed to update its hypothesis. Else, the adversary does not provide anyfeedback, and the learner must use the same hypothesis on the next round.4. Learner suﬀers loss | ˆ y t − c ( x t ) | . Observe that this is a close variant of our setting in Section 2.1, the only diﬀerence being thatthe adversary here only gives feedback on rounds where the learner makes a mistake (i.e., whenthe learner’s prediction is grossly wrong). This means that the learner updates her hypothesis ifand only if it makes a mistake. Given an online learner A in the above setting that makes atmost (cid:96) updates, [AR19, Theorem 38] shows that there exists a randomized algorithm B for shadowtomography using n = O (cid:18) (cid:96) log m log(1 /δ ) ε min { α , ε } (cid:19) . (53)many examples of the unknown state ρ where such that algorithm B ’s error is bounded by ε withprobability at least 1 − β . Moreover, this algorithm is ( α, δ )-gentle. We are now equipped with allwe need to prove Theorem 6.5. The proof boils down to the observation that for any concept class C ,we can always construct an online learner that is guaranteed to make at most sfat ( C ) mistakes, andtherefore (cid:96) = sfat ( C ) in Eq. (53). The online learner we construct is a variant of the proper versionof our RSOA

Algorithm 1.

Proof of Theorem 6.5.

The proof follows from the Quantum Private Multiplicative Weights algo-rithm in [AR19] and its accompanying Theorem 39, simply by exhibiting an online learner A for U in the setting described above, that makes at most (cid:96) = sfat ε ( C S ) mistakes. In the rest of thisproof, we exhibit just such an algorithm, which is a variant of the proper version of RSOA .The diﬀerence between Algorithm 4 and

RSOA is that in

RSOA , the learner is allowed to updatethe set V t on all rounds t ∈ [ T ], while in Algorithm 4, the update happens only on the rounds forwhich it made a mistake (‘mistake rounds’). Because the learner’s current hypothesis for the targetconcept is computed based on the ‘set of surviving functions’ V t , updating V t amounts to updatingthe algorithm’s hypothesis. 43 lgorithm 4 Alternative Robust Standard Optimal Algorithm

Input:

Concept class

C ⊆ { f : X → [0 , } , target (unknown) concept c ∈ C , and ε ∈ [0 , Initialize : V ← C for t = 1 , . . . , T do A learner receives x t and maintains set V t , a set of “surviving functions”. For every super-bin midpoint r ∈ ˜ I ε/ the learner computes the set of functions V t ( r, x t ). A learner ﬁnds the super-bin which achieves the maximum sfat ( · ) dimension R t ( x t ) :=  arg max r ∈ ˜ I ε/ sfat ε/ ( V t ( r, x t )) ∈ ˜ I ε/  The learner computes the mean of the set R t ( x t ), i.e., letˆ y t := 1 | R t ( x t ) | (cid:88) r ∈ R t ( x t ) r. The learner outputs ˆ y t , receives feedback (cid:98) c ( x t ) if it has made a mistake, i.e., if | ˆ y t − c ( x t ) | > ε . If the learner received feedback, update V t +1 ← { g ∈ V t | g ( x t ) ∈ B ε/ ( (cid:98) c ( x t )) } ; else V t +1 ← V t . end forOutputs: The intermediate predictions ˆ y t for t ∈ [ T ], and a ﬁnal prediction function/hypothesiswhich is given by f ( x ) := R T +1 ( x ).We thus aim to show that Algorithm 4 has no more than sfat ( · ) mistake rounds. However, weobserve that we may directly import the proof of Theorem 3.4 to do so. This is because that proofis independent of what happened on the non-mistake rounds, which are the only rounds that diﬀerbetween RSOA and Algorithm 4. Rather, it argues that on the rounds on which

RSOA made amistake, sfat ( V t ) decreases by at least 1 due to the update on V t , and having initialized V = C , nomore than sfat ( C ) updates may happen in total. Exactly the same argument can be used to boundthe mistakes of Algorithm 4, though note that for the constants to work out, the ε of RSOA mustbe multiplied by 5.

References [Aar07] Scott Aaronson. The learnability of quantum states.

Proceedings of the Royal SocietyA: Mathematical, Physical and Engineering Sciences , 463(2088):3089–3114, Sep 2007.3, 10, 38, 39[Aar18] Scott Aaronson. Shadow tomography of quantum states. In

Proceedings of the 50thAnnual ACM SIGACT Symposium on Theory of Computing, STOC , pages 325–338.ACM, 2018. 9, 11, 42, 43[ABMS20] Noga Alon, Amos Beimel, Shay Moran, and Uri Stemmer. Closure properties forprivate classiﬁcation and online prediction. In

Conference on Learning Theory , pages119–152. PMLR, 2020. 4 44ACH +

18] Scott Aaronson, Xinyi Chen, Elad Hazan, Satyen Kale, and Ashwin Nayak. Onlinelearning of quantum states. In

Advances in Neural Information Processing Systems ,pages 8962–8972, 2018. 4, 7, 10, 13, 14, 16, 38, 39, 41, 42[AGY20] Srinivasan Arunachalam, Alex B. Grilo, and Henry Yuen. Quantum statistical querylearning, 2020. arXiv:2002.08240. 11[AJL +

19] Jacob D Abernethy, Young Hun Jung, Chansoo Lee, Audra McMillan, and AmbujTewari. Online learning via the diﬀerential privacy lens. In

Advances in NeuralInformation Processing Systems , pages 8894–8904, 2019. 5, 15[ALMM19] Noga Alon, Roi Livni, Maryanthe Malliaris, and Shay Moran. Private PAC learningimplies ﬁnite littlestone dimension. In

Proceedings of the 51st Annual ACM SIGACTSymposium on Theory of Computing , pages 852–860, 2019. 4, 5, 15[ANTSV02] Andris Ambainis, Ashwin Nayak, Amnon Ta-Shma, and Umesh Vazirani. Dense quan-tum coding and quantum ﬁnite automata.

Journal of the ACM (JACM) , 49(4):496–511, 2002. 10[AR19] Scott Aaronson and Guy N Rothblum. Gentle measurement of quantum states anddiﬀerential privacy. In

Proceedings of the 51st Annual ACM SIGACT Symposium onTheory of Computing , pages 322–333, 2019. 3, 9, 11, 30, 42, 43[Aud14] Koenraad M. R. Audenaert. Quantum skew divergence.

Journal of MathematicalPhysics , 55(11):112202, Nov 2014. 41[BJKT21] Mark Bun, Young Hun Jung, Baekjin Kim, and Ambuj Tewari, 2021. Unpublishedmanuscript. 9, 30[BKN10] Amos Beimel, Shiva Prasad Kasiviswanathan, and Kobbi Nissim. Bounds on thesample complexity for private learning and private data release. In

Proceedings ofthe 7th International Conference on Theory of Cryptography , TCC’10, page 437–454,Berlin, Heidelberg, 2010. Springer-Verlag. 17[Bla72] R. Blahut. Computation of channel capacity and rate-distortion functions.

IEEETransactions on Information Theory , 18(4):460–473, 1972. 40[BLM19] Olivier Bousquet, Roi Livni, and Shay Moran. Passing tests without memorizing:Two models for fooling discriminators. arXiv preprint arXiv:1902.03468 , 2019. 4, 5,15[BLM20] Mark Bun, Roi Livni, and Shay Moran. An equivalence between private classiﬁcationand online prediction, 2020. To appear Journal of the ACM. arXiv:2003.00563. 4, 5,6, 7, 8, 9, 11, 15, 16, 22, 23, 25, 26, 29, 30, 32, 33[BNS13] Amos Beimel, Kobbi Nissim, and Uri Stemmer. Characterizing the sample complexityof private learners. In

Proceedings of the 4th Conference on Innovations in TheoreticalComputer Science , ITCS ’13, page 97–110. Association for Computing Machinery,2013. 17, 34[BNS19] Mark Bun, Kobbi Nissim, and Uri Stemmer. Simultaneous private learning of multipleconcepts.

Journal of Machine Learning Research , 20(94):1–34, 2019. 8, 3045BO20] Costin B˘adescu and Ryan O’Donnell. Improved quantum data analysis, 2020.arXiv:2011.10908. 11, 42[BS98] Dan Boneh and James Shaw. Collusion-secure ﬁngerprinting for digital data.

IEEETransactions on Information Theory , 44(5):1897–1905, 1998. 31[Bun20] Mark Bun. A computational separation between private learning and online learning.

Advances in Neural Information Processing Systems , 33, 2020. 4, 11[BUV18] Mark Bun, Jonathan Ullman, and Salil Vadhan. Fingerprinting codes and the priceof approximate diﬀerential privacy.

SIAM Journal on Computing , 47(5):1888–1938,2018. 9, 30, 31[DR14] Cynthia Dwork and Aaron Roth. The algorithmic foundations of diﬀerential privacy.

Found. Trends Theor. Comput. Sci. , 9(3–4):211–407, August 2014. 5, 14, 15[FX14] Vitaly Feldman and David Xiao. Sample complexity bounds on diﬀerentially privatelearning via communication complexity. In

Conference on Learning Theory , pages1000–1019, 2014. 6, 36, 37[GGKM20] Badih Ghazi, Noah Golowich, Ravi Kumar, and Pasin Manurangsi. Sample-eﬃcientproper PAC learning with approximate diﬀerential privacy, 2020. arXiv:2012.03893.4, 11[GGL +

04] V. Giovannetti, S. Guha, S. Lloyd, L. Maccone, J. H. Shapiro, and H. P. Yuen.Classical capacity of the lossy bosonic channel: The exact solution.

Phys. Rev. Lett. ,92:027902, Jan 2004. 10, 42[GGPCH14] V. Giovannetti, R. Garc´ıa-Patr´on, N. J. Cerf, and A. S. Holevo. Ultimate classicalcommunication rates of quantum optical channels.

Nature Photonics , 8(10):796–800,Sep 2014. 10, 42[GHM19] Alon Gonen, Elad Hazan, and Shay Moran. Private learning implies online learning:An eﬃcient reduction. In

Advances in Neural Information Processing Systems , pages8702–8712, 2019. 4[HHJ +

17] Jeongwan Haah, Aram W Harrow, Zhengfeng Ji, Xiaodi Wu, and Nengkun Yu.Sample-optimal tomography of quantum states.

IEEE Transactions on InformationTheory , 63(9):5628–5641, 2017. 2[HKP20] Hsin-Yuan Huang, Richard Kueng, and John Preskil. Predicting many properties ofa quantum system from very few measurements.

Nature Physics , 16:1050–1057, 2020.42[HR10] Moritz Hardt and Guy N Rothblum. A multiplicative weights mechanism for privacy-preserving data analysis. In , pages 61–70. IEEE, 2010. 31[HRS20] Nika Haghtalab, Tim Roughgarden, and Abhishek Shetty. Smoothed analysis of onlineand diﬀerentially private learning. arXiv preprint arXiv:2006.10129 , 2020. 4[IW20] Adam Izdebski and Ronald de Wolf. Improved quantum boosting, 2020. 1246JKT20] Young Hun Jung, Baekjin Kim, and Ambuj Tewari. On the equivalence between onlineand private learnability beyond binary classiﬁcation. In

Advances in Neural Informa-tion Processing Systems 33: Annual Conference on Neural Information ProcessingSystems 2020, NeurIPS, virtual , 2020. 4, 7, 9, 10, 13, 17, 29, 32[Kin03] C. King. The capacity of the quantum depolarizing channel.

IEEE Transactions onInformation Theory , 49(1):221–229, 2003. 10, 41[KLN +

11] Shiva Prasad Kasiviswanathan, Homin K Lee, Kobbi Nissim, Sofya Raskhodnikova,and Adam Smith. What can we learn privately?

SIAM Journal on Computing ,40(3):793–826, 2011. 8, 11, 17, 30, 32[Kot14] Robin Kothari. An optimal quantum algorithm for the oracle identiﬁcation problem.In , volume 25 of

LIPIcs , pages 482–493. Schloss Dagstuhl - Leibniz-Zentrum f¨urInformatik, 2014. 11[Lit88] Nick Littlestone. Learning quickly when irrelevant attributes abound: A new linear-threshold algorithm.

Machine Learning , 2(4):285–318, 1988. 6, 14, 17[Nay99] A. Nayak. Optimal lower bounds for quantum automata and random access codes. In ,pages 369–376, 1999. 10, 38, 39, 40[OW16] Ryan O’Donnell and John Wright. Eﬃcient quantum tomography. In

Proceedingsof the forty-eighth annual ACM symposium on Theory of Computing , pages 899–912,2016. 2[OW17] Ryan O’Donnell and John Wright. Eﬃcient quantum tomography II. In

Proceedingsof the 49th Annual ACM SIGACT Symposium on Theory of Computing, STOC , pages962–974. ACM, 2017. 2[RST10] Alexander Rakhlin, Karthik Sridharan, and Ambuj Tewari. Online learning: Randomaverages, combinatorial parameters, and learnability. In

Advances in Neural Informa-tion Processing Systems , pages 1984–1992, 2010. 4, 6, 13, 17, 18[Shi19] M. E. Shirokov. Upper bounds for the holevo information quantity and their use.

Problems of Information Transmission , 55(3):201–217, 2019. 41[Siu19] Katarzyna Siudzi´nska. Regularized maximal ﬁdelity of the generalized pauli channels.

Physical Review A , 99(1):012340, 2019. 10, 41[Siu20] Katarzyna Siudzi´nska. Classical capacity of generalized pauli channels.

Journal ofPhysics A: Mathematical and Theoretical , 53(44):445301, 2020. 10, 41[SW97] Benjamin Schumacher and Michael D. Westmoreland. Sending classical informationvia noisy quantum channels.

Phys. Rev. A , 56:131–138, Jul 1997. 41[SXL +

17] Chao Song, Kai Xu, Wuxin Liu, Chui-ping Yang, Shi-Biao Zheng, Hui Deng, Qi-wei Xie, Keqiang Huang, Qiujiang Guo, Libo Zhang, et al. 10-qubit entanglementand parallel logic operations with a superconducting circuit.

Physical review letters ,119(18):180511, 2017. 2 47Tar08] G´abor Tardos. Optimal probabilistic ﬁngerprint codes.

Journal of the ACM (JACM) ,55(2):1–24, 2008. 31[Val84] Leslie G Valiant. A theory of the learnable.

Communications of the ACM , 27(11):1134–1142, 1984. 3[Wil17] Mark M. Wilde.

Quantum Information Theory . Cambridge University Press, 2ndedition, 2017. 41[Zha11] Shengyu Zhang. On the power of lower bound methods for one-way quantum com-munication complexity. In