[PDF] A New Minimax Theorem for Randomized Algorithms

Abstract

The celebrated minimax principle of Yao (1977) says that for any Boolean-valued function f with finite domain, there is a distribution μ over the domain of f such that computing f to error ϵ against inputs from μ is just as hard as computing f to error ϵ on worst-case inputs. Notably, however, the distribution μ depends on the target error level ϵ : the hard distribution which is tight for bounded error might be trivial to solve to small bias, and the hard distribution which is tight for a small bias level might be far from tight for bounded error levels. In this work, we introduce a new type of minimax theorem which can provide a hard distribution μ that works for all bias levels at once. We show that this works for randomized query complexity, randomized communication complexity, some randomized circuit models, quantum query and communication complexities, approximate polynomial degree, and approximate logrank. We also prove an improved version of Impagliazzo's hardcore lemma. Our proofs rely on two innovations over the classical approach of using Von Neumann's minimax theorem or linear programming duality. First, we use Sion's minimax theorem to prove a minimax theorem for ratios of bilinear functions representing the cost and score of algorithms. Second, we introduce a new way to analyze low-bias randomized algorithms by viewing them as "forecasting algorithms" evaluated by a proper scoring rule. The expected score of the forecasting version of a randomized algorithm appears to be a more fine-grained way of analyzing the bias of the algorithm. We show that such expected scores have many elegant mathematical properties: for example, they can be amplified linearly instead of quadratically. We anticipate forecasting algorithms will find use in future work in which a fine-grained analysis of small-bias algorithms is required.

Full PDF

aa r X i v : . [ c s . CC ] S e p A New Minimax Theorem for Randomized Algorithms

Shalev Ben-David

University of Waterloo [email protected]

Eric Blais

University of Waterloo [email protected]

Abstract

The celebrated minimax principle of Yao (1977) says that for any Boolean-valued function f with ﬁnite domain, there is a distribution µ over the domain of f such that computing f to error ǫ against inputs from µ is just as hard as computing f to error ǫ on worst-case inputs. Notably,however, the distribution µ depends on the target error level ǫ : the hard distribution which istight for bounded error might be trivial to solve to small bias, and the hard distribution whichis tight for a small bias level might be far from tight for bounded error levels.In this work, we introduce a new type of minimax theorem which can provide a hard distri-bution µ that works for all bias levels at once. We show that this works for randomized querycomplexity, randomized communication complexity, some randomized circuit models, quantumquery and communication complexities, approximate polynomial degree, and approximate lo-grank. We also prove an improved version of Impagliazzo’s hardcore lemma.Our proofs rely on two innovations over the classical approach of using Von Neumann’sminimax theorem or linear programming duality. First, we use Sion’s minimax theorem toprove a minimax theorem for ratios of bilinear functions representing the cost and score ofalgorithms.Second, we introduce a new way to analyze low-bias randomized algorithms by viewing themas “forecasting algorithms” evaluated by a certain proper scoring rule. The expected score ofthe forecasting version of a randomized algorithm appears to be a more ﬁne-grained way ofanalyzing the bias of the algorithm. We show that such expected scores have many elegantmathematical properties: for example, they can be ampliﬁed linearly instead of quadratically.We anticipate forecasting algorithms will ﬁnd use in future work in which a ﬁne-grained analysisof small-bias algorithms is required. ontents hs score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 A Proofs related to the minimax theorem 53B Distance measures 56C Quantum amplitude estimation 58References 60 Introduction

Yao’s minimax principle [Yao77] is a central tool in the analysis of randomized algorithms in manydiﬀerent models of computation. In its most commonly-used form, it states that for every Boolean-valued function f with a ﬁnite domain, if R ( c ) denotes the set of randomized algorithms withworst-case cost at most c and ∆ denotes the set of distributions over the domain of f , then min R ∈R ( c ) max µ ∈ ∆ Pr[ R ( x ) = f ( x )] = max µ ∈ ∆ min R ∈R ( c ) Pr[ R ( x ) = f ( x )] with both probabilities being over the choice of x drawn from µ and the internal randomness of R . This identity implies that there exists a distribution µ for which any algorithm that computes f with bounded error over inputs drawn from µ must have cost at least R( f ) , the cost of computing f to worst-case bounded error. But it does not say anything else about µ itself. Notably,I. The minimax principle does not guarantee that the resulting distribution µ must be balancedon the sets f − (0) and f − (1) .II. More generally, it does not rule out the possibility that f is very easy to compute by randomizedalgorithms that are only required to output the correct value with probability at least γ forsome small bias measure γ > over inputs drawn from the distribution µ .A separate application of the minimax principle can be used to show that there is a distribution µ ′ for which all randomized algorithms computing f with bias γ over µ ′ have cost at least R − γ ( f ) (thecost of computing f to worst-case error (1 − γ ) / ), but then there is no guarantee that randomizedalgorithms with bounded error over µ ′ must have cost anywhere close to R( f ) .Intuitively, it seems reasonable to expect that for every function f , there is a distribution µ for f that addresses issues I and II: a distribution that is balanced on f − (0) and f − (1) , and which isat least slightly hard even to solve to a small bias level γ . Question 1.1 (Informal) . Is there a distribution µ which certiﬁes the hardness of f for all biaslevels γ > at the same time? More formally, observe that the cost of computing f to worst-case bias γ cannot be smallerthan γ R( f ) . This is because randomized algorithms can be ampliﬁed : by repeating an algorithm O (1 /γ ) times and outputting the majority vote of the runs, we can increase its bias from γ to Ω(1) . Therefore, a natural reﬁnement of Question 1.1 is as follows.

Question 1.2 (Reﬁnement of Question 1.1) . Is there a distribution µ such that for all bias levels γ > , any algorithm computing f to bias γ against µ must have cost at least Ω( γ R( f )) ? Question 1.2 is the primary focus of this work. We answer it aﬃrmatively in a variety ofcomputational models (we can handle most models in which ampliﬁcation and Yao’s minimaxprinciple both apply). We note that the distribution satisfying the conditions of Question 1.2 ishard for bounded error in Yao’s sense, since each algorithm solving f to bounded error against µ must have cost at least Ω(R( f )) . In addition to this, such µ must also be perfectly balanced between - and -inputs of f (by considering the limit as γ → ), and must remain somewhat hard to solveeven to small bias levels.The study of Question 1.2 has led us to consider randomized forecasting algorithms which output probabilistic conﬁdence predictions about the value of f ( x ) , instead of a Boolean guess for f ( x ) .When evaluated using a certain proper scoring rule, the best possible score of a forecasting algorithm3s intimately related to the best possible bias of a randomized algorithm; in fact, the score appearsto be a more ﬁne-grained way of measuring the bias. Scores of forecasting algorithms appear to bethe “right” way of measuring the success of randomized algorithms, as such scores satisfy elegantmathematical properties. The following question, which we answer aﬃrmatively, turns out to be astrengthening of Question 1.2. Question 1.3.

Is there a distribution µ such that for all η > , any forecasting algorithm whichachieves expected score at least η against µ must have cost at least Ω( η R( f )) ? The answers to Question 1.2 and Question 1.3 have a direct impact on the study of compositiontheorems and joint computation problems in randomized computational models: a natural approachfor such problems involves ﬁrst applying a minimax theorem and then establishing the requiredinequalities in the deterministic distributional setting. However, as observed by Shaltiel [Sha03]this approach runs into trouble if the hard distribution is easy to solve to small bias. Speciﬁcally,Shaltiel considered distributions µ which are hard to solve most of the time, but which give acompletely trivial input with small probability γ . Then computing n independent copies from µ isa little easier than n times the cost of computing f , because on average, γn of the copies are trivial;the cost of computing n independent inputs from µ is at most (1 − γ ) n times the cost of solving f .Things get even worse when the inputs have a promised correlation, as can happen when provingcomposition theorems. For a concrete example, consider the partial function Trivial n , which isdeﬁned on domain { n , n } and maps n → and n → . Suppose we want to prove a compositionlower bound with Trivial n on the outside: that is, we want to show that for every function f ,computing Trivial n composed with n copies of f requires Ω(R( f )) cost. In other words, we wantto lower bound the cost of an algorithm which outputs when given n -inputs to f , outputs when given n -inputs to f , and outputs arbitrarily when given some other type of input.Now, if we try to lower bound this using the hard distribution from Yao’s minimax principle,then the distribution might give a trivial input with small probability γ , as Shaltiel observed; butthen so long as n = Ω(1 /γ ) , one of the inputs to f will be trivial with high probability, and wecan solve this “all- s vs all- s” problem simply by searching for the trivial copy – potentially muchfaster than the worst-case cost of computing a single copy of f !The hard distributions we give in this work solve this issue by being hard for all bias levels.In our companion manuscript [BB20], we use one of the query versions of our minimax theorem(Theorem 4.6) to prove a new composition theorem for randomized query complexity. Minimax theorem for cost/score ratios.

The ﬁrst main result is a new minimax theorem forthe ratio of the cost and score of randomized algorithms. A special case of the theorem with asimple formulation is as follows.

Theorem 1.4. [Special case of Theorem 2.18] Let R be a set of randomized algorithms that canbe expressed as a convex subset of a real topological vector space. Let S be a nonempty ﬁniteset, and let ∆ be the set of all probability distributions over S , viewed as a subset of R | S | . Let cost : R × ∆ → (0 , ∞ ) and score : R × ∆ → [ −∞ , ∞ ) be continuous bilinear functions. Then usingthe convention r/ ∞ for all r ∈ (0 , ∞ ) and the notation r + = max { r, } for all r ∈ [ −∞ , ∞ ] ,we have inf R ∈R max x ∈ S cost( R, x )score(

R, x ) + = max µ ∈ ∆ inf R ∈R cost( R, µ )score(

R, µ ) + . urther, all of the above maximums are attained. The general version of the minimax theorem in Theorem 2.18 shows that the same identity holdseven when the cost and score functions are semicontinuous and saddle (but not necessarily linear)under some mild additional restrictions. Furthermore, a variant of the theorem also holds when weconsider convex and compact subsets of distributions over the ﬁnite set S instead of the set ∆ ofall distributions over that set.Minimax theorems for ratios of semicontinuous and saddle functions as in Theorem 2.18 do notseem to have appeared in the literature previously in the precise form we need, but as we show inSection 2, they can be obtained by extending Sion’s minimax theorem [Sio58] with standard argu-ments. We believe that the main contribution of Theorem 2.18 is in its interpretation for randomizedalgorithms. Various extensions and variations of Yao’s minimax theorem have been considered inthe computer science literature previously [Yao77; Imp95; Ver98; Bra15; BGK+18; BB19], but all ofthem appear to consider the cost of an algorithm (with the minimax theorem applied to algorithmswith a ﬁxed worst-case score), the score of an algorithm (with the cost being ﬁxed), or a linearcombination of the two. None of those variants suﬃce to answer the questions raised at the begin-ning of the introduction or to establish the results in the following subsections; what was needed inthose cases was a minimax theorem for the ratio of the cost/score of randomized algorithms, andwe suspect that this ratio minimax theorem will ﬁnd further applications in computer science in thefuture as well. Forecasting algorithms and linear ampliﬁcation.

To convert the statements obtained fromTheorem 2.18 regarding the cost/score ratios of randomized algorithms under some distribution µ into more familiar lower bounds on the cost of randomized algorithms that achieve some biason µ , we need a linear ampliﬁcation theorem. Ideally, we would like to argue that if there exists arandomized algorithm R with bias γ on µ , then by combining O (1 /γ ) instances of R we can obtain arandomized algorithm R ′ with cost( R ′ , µ ) = O (cid:16) γ · cost( R, µ ) (cid:17) = O (cid:16) cost( R,µ )bias f ( R,µ ) (cid:17) and constant bias.Unfortunately, such a linear ampliﬁcation property does not hold for most models of randomizedalgorithms, where ampliﬁcation from bias γ to bounded error requires combining O (1 /γ ) instancesof the original algorithm. To obtain a linear ampliﬁcation result, we must turn our attention awayfrom bias and error and consider other score functions instead. To describe our score function, we ﬁrst generalize our computational model from randomizedalgorithms that output or to forecasting algorithms , which are randomized algorithms that out-put a conﬁdence value in [0 , for the value f ( x ) of the function f on their given input x . A “low”conﬁdence prediction is a value close to whereas a “high” conﬁdence prediction would be a valuevery close to or to . There are many natural ways to assign a score to a conﬁdence value for f ( x ) .The study of such scoring rules and their properties has a rich history in the statistics and decisiontheory communities (see for instance [BSS05; GR07] and references therein); we discuss some fun-damental scoring rules and give relations between them in Section 3. Of particular importance toour main purpose is the scoring rule hs : [0 , → [ −∞ , deﬁned by hs f ( p ) =  − q − pp when f ( x ) = 11 − q p − p when f ( x ) = 0 . The astute reader may have noticed that we obtain linear ampliﬁcation if we simply set the score to be thesquared bias of the randomized algorithm. That is true, but this approach does not work in conjunction with theratio minimax theorem since this score function no longer satisﬁes the appropriate saddle property requirements ofthat theorem; this is why we instead consider forecasting algorithms as described below. R on an input x in the domain of f to be score hs ,f ( R, x ) = E [hs f ( R ( x ))] , the expectation of the hs score of the output of R over the internal randomness of R .Then linear ampliﬁcation does hold for this score function. Lemma 1.5.

For any Boolean-valued function f , any forecasting algorithm R , and any k ≥ , thereis a forecasting algorithm R ′ that combines the outputs of k instances of R and satisﬁes score hs ,f ( R ′ , x ) ≥ − (1 − score hs ,f ( R, x )) k for every x in the domain of f . In particular, when k = max x hs ,f ( R,x ) then for each x ∈ Dom( f ) , score hs ,f ( R ′ , x ) ≥ − e − > . . To the best of our knowledge, Lemma 1.5 has not previously appeared in the literature. Thislemma is sensitive to the precise deﬁnition of hs f ; other scoring rules do not appear to satisfy thisampliﬁcation property, which is crucial for the proof of our main results. Additionally, the scoringrule hs f is special because there is a close connection between hs score of forecasting algorithms andthe bias of randomized algorithms. Lemma 1.6.

For any Boolean-valued function f , any distribution µ on Dom( f ) , and any parameter γ > , • If there exists a randomized algorithm R with bias f ( R, µ ) = 1 − R ( x ) = f ( x )] ≥ γ , thenthere is a forecasting algorithm R ′ with score hs ,f ( R ′ , µ ) ≥ − p − γ ≥ γ / , and • If there exists a forecasting algorithm R with score hs ,f ( R, µ ) ≥ γ then there is a randomizedalgorithm R ′ with bias f ( R ′ , µ ) ≥ γ .Moreover, in both cases R ′ can be explicitly constructed from R by modifying its output. Lemma 1.5 and Lemma 1.6 can be used to reprove the fact that O (1 /γ ) instances of a bias- γ randomized algorithms can be combined to obtain a bounded-error algorithm; combining thoselemmas (or, more precisely, speciﬁc instantiations of these lemmas that account for the explicitconstructions of the relevant algorithms and their costs) with the minimax theorem also leads tonew results as described in the next section. Hard distributions for bounded error and small bias.

The minimax theorem for cost/scoreratios and linear ampliﬁcation of forecasting algorithms can be combined to show that for manymeasures of randomized complexity, for every Boolean-valued function f with ﬁnite domain thereexists a single distribution µ on which it is hard to compute f with bounded error and with (any)small bias. For example, letting RDT( f ) denote the minimum (worst-case) query complexity of arandomized algorithm computing f (or equivalently the minimum worst-case depth of a decisiontree computing f ) with error at most on every input in Dom( f ) and RDT µ ˙ γ denote the minimumquery complexity of a randomized algorithm that has error probability at most ˙ γ := − γ wheninputs are drawn from µ , we obtain the following result. Theorem 1.7.

For any non-constant partial function f : { , } n → { , } , there exists a distribution µ on Dom( f ) such that for every γ ∈ [0 , , RDT µ ˙ γ ( f ) = Ω (cid:0) γ RDT( f ) (cid:1) .

6e establish analogous theorems for multiple other computational models as well:Randomized communication complexity

RCC µ ˙ γ ( f ) = Ω (cid:0) γ RCC( f ) (cid:1) Corollary 4.8Quantum query complexity

QDT µ ˙ γ ( f ) = γ · ˜Ω (cid:0) QDT( f ) (cid:1) Theorem 5.2Quantum communication complexity

QCC µ ˙ γ ( f ) = γ · ˜Ω (cid:0) QCC( f ) (cid:1) Theorem 5.9Polynomial degree deg µ ˙ γ ( f ) = γ · ˜Ω(adeg( f )) Theorem 6.5Log-rank complexity log rank µ ˙ γ ( f ) = γ · ˜Ω(log rank / ( f )) Theorem 6.8Circuit complexity

Rcirc µ ˙ γ ( f ) = γ · ˜Ω (cid:0) Rcirc( f ) (cid:1) Theorem 7.8Log-depth circuit complexity

RNC1 µ ˙ γ ( f ) = γ · ˜Ω (cid:0) RNC1( f ) (cid:1) Theorem 7.9Threshold circuit complexity

RTC0 µ ˙ γ ( f ) = γ · ˜Ω (cid:0) RTC0( f ) (cid:1) Theorem 7.10(Note that as in Theorem 1.7, the novel aspect of all these results is that they guarantee that foreach of the stated inequalities, there exists a single distribution µ that satisﬁes the inequality for every value of γ simultaneously.) Hard distributions for forecasting algorithms.

The theorems listed above settle Question 1.2in the aﬃrmative for the speciﬁed models. For the models with quadratic dependence on γ (i.e.randomized query complexity, randomized communication complexity, and the various randomizedcircuit models), we also get hard distributions which lower bound the expected score of a forecastingalgorithm, settling Question 1.3 aﬃrmatively. Distinguishing power of randomized algorithms and protocols.

In the communicationcomplexity setting, we can also analyze how well a randomized communication protocol computesa function f : X × Y → { , } via its communication transcripts. Let tran( R, µ ) denote thedistribution on communication transcripts of the randomized protocol R on inputs drawn from µ .Then one way to measure how well R is able to distinguish - and -inputs of f is to measure theHellinger distance between the distributions tran( R, µ ) and tran( R, µ ) of transcripts of R on somedistributions µ over f − (0) and µ over f − (1) . We can use the minimax and linear ampliﬁcationtheorems to give a strong upper bound on this Hellinger distance as a measure of the cost of theprotocol. Theorem 1.8.

For any non-constant partial function f : X × Y → { , } over ﬁnite sets X and Y , there is a pair of distributions µ on f − (0) and µ on f − (1) such that for any randomizedcommunication protocol R , the squared Hellinger distance between the distribution of its transcriptson µ and µ is bounded above by h (cid:0) tran( R, µ ) , tran( R, µ ) (cid:1) = O (cid:18) min { cost( R, µ ) , cost( R, µ ) } RCC( f ) (cid:19) . Here cost(

R, µ ) denotes the expected amount of communication the protocol R transmits when giveninputs from µ . Theorem 4.6 establishes an analogous result for query complexity. In our companion paper [BB20],that theorem is one of the ingredients that enables us to establish a new composition theory forquery complexity. 7 ardcore lemma.

Impagliazzo’s Hardcore Lemma [Imp95] states that for every ǫ, δ > , if everycircuit C of size at most s computes f with error at least δ on the uniform distribution, then thereis a δ -regular distribution µ = µ ( δ, ǫ ) for which every circuit that computes f with bias at least ǫ onthe distribution µ must have size Ω( ǫ s ) . Informally, the lemma shows that if a function f is mildlyhard on average, it is because it is “very” hard to compute on a fairly large subset of its inputs.But, interestingly, this version of the hardcore lemma leaves open the possibility that the hard coremight be diﬀerent for various levels ǫ of hardness. Using our main theorems, we can show that thisis not the case. Theorem 1.9.

There exists a universal constant c > such that for any δ > and function f : { , } n → { , } , if every circuit C of size at most s satisﬁes Pr[ C ( x ) = f ( x )] ≤ − δ when theprobability is taken over the uniform distribution of x in { , } n , then there is a distribution µ withmin-entropy δ such that for every ǫ > , any circuit C ′ of size at most c · ǫ / log(1 /δ ) · s has successprobability bounded by Pr[ C ′ ( x ) = f ( x )] ≤ ǫ . The proof of Theorem 1.9 follows closely the original argument of Nisan in [Imp95] that estab-lished the hardcore lemma via a minimax theorem. Since that original work, many extensions anddiﬀerent proofs of the hardcore lemma have been established (e.g., [Imp95; KS03; BHK09; TTV09]),but to the best of our knowledge Theorem 1.9 represents the ﬁrst version of the lemma which givesa single distribution µ which is hard for all values of ǫ > simultaneously. In independent work concurrent with this one, Bassilakis, Drucker, Göös, Hu, Ma, and Tan [BDG+20]showed the existence of a certain hard distribution for randomized query complexity. They showedevery Boolean function f has hard distributions µ and µ (on - and -inputs respectively) suchthat given query access to k independent samples from µ b , it is still necessary to use Ω(R( f )) queriesto the bits of the samples in order to decide the value of b ∈ { , } to bounded error.The guarantee on the hard distribution provided by [BDG+20] is formally stronger than theone we provide in Theorem 4.6 (though in our companion manuscript [BB20], we prove a new com-position theorem for randomized query complexity, and use it to conclude that the guarantee of[BDG+20] turns out to be equivalent to the guarantee of Theorem 4.6 in our current work). Thetools used by [BDG+20] are also completely diﬀerent: they use arguments speciﬁc to query com-plexity that construct the hard distribution more explicitly, but their arguments do not generalizeto other models such as communication complexity or circuit complexity. Section 2 is devoted to proving the main minimax theorem for the cost/score ratio of randomizedalgorithms. The main result of that section is Theorem 2.18; the rest of the section is devotedto introducing the mathematical notions and preliminaries required to obtain a proof of thattheorem from Sion’s minimax theorem.

Section 3 introduces the basic deﬁnitions and some basic scoring rules for forecasting algorithms.The section establishes some of the core properties of scoring functions, including notablyconnections between the best score achievable by forecasting algorithms on distributions overinputs and various distance measures on those distributions. The ﬁnal portions of this sectionthen establish the main linear ampliﬁcation theorem in general form in Lemma 1.5 and thegeneral form of the conversion between randomized and forecasting algorithms in Lemma 3.15.8 ection 4 focuses on the query and communication complexity settings. Conversions betweenrandomized and forecasting algorithms in the query complexity setting are straightforward,but there is one signiﬁcant challenge in applying the linear ampliﬁcation theorem to obtainthe results in Theorem 1.7 and Theorem 4.6: the cost and score of a randomized algorithm R on an input x can both depend on x itself. This is a problem because to obtain a constantscore (and after the ﬁnal conversion, a bounded-error randomized algorithm), we want toamplify R with a number k of copies that depends on the score of R on x —but since we don’tknow x we don’t know what score( R, x ) is either. We get around this problem with odometerarguments: by empirically estimating the expected number of queries R makes on x , we canobtain eﬀective bounds on the number k of copies of R that we need to obtain successfulampliﬁcation.As we show in the section, the communication complexity results Corollary 4.8 and Theorem 1.8follow immediately from their query complexity analogues. Section 5 establishes the results in the quantum query and communication complexity settings.Unlike in the classical setting, ampliﬁcation that is linear in the bias of an algorithm does hold in the quantum query complexity setting. However, the proof of Theorem 5.2 requiresthat the set of algorithms must be representable as a convex subset of a real topological space,and that the cost of an algorithm is a convex function on this set. It is not immediately clearhow quantum query algorithms can satisfy this condition, because in the usual deﬁnition,the cost of a mixture of two quantum algorithms would be the maximum of the costs of thealgorithms rather than the average. To overcome this issue, we instead establish Theorem 5.2via consideration of what we call probabilistic quantum algorithms, which correspond to prob-ability distributions over quantum algorithms and do easily satisfy the appropriate convexityrequirements. Probabilistic quantum algorithms are harder to amplify than regular quantumalgorithms (due to their lack of coherence), but we show that a linear ampliﬁcation theoremstill holds.Another important diﬀerence between the quantum and the classical setting is that the com-munication complexity result, Theorem 5.9, is not implied by the analogous query complexityresult. Nonetheless, the same argument used for quantum query algorithms also holds forquantum communication protocols as well. We complete the proof of Theorem 5.9 by ﬁrstproviding an abstraction of the query complexity argument in Theorem 5.8 and then showinghow communication protocols satisfy the conditions of this abstract theorem.

Section 6 considers the approximate polynomial degree and the logrank complexity of functions.As with quantum query complexity, approximate polynomial degree satisﬁes an ampliﬁcationtheorem that is linear in the bias, meaning that we do not need to use forecasting algorithms orscoring rules. However, also as with quantum query complexity, polynomials and their cost donot satisfy the right convexity requirements, as the degree of a mixture of two polynomials isnot the average of their degrees. We overcome this by considering probabilistic polynomials.Proving an ampliﬁcation theorem for probabilistic polynomials turns out to be somewhattricky, and requires tools from approximation theory such as Jackson’s theorem.Approximate logrank inherits all of the problems of approximate polynomial degree, andadds a few more. To handle approximate logrank, we switch over to the nearly-equivalentmodel of the logarithm of the approximate gamma norm, and then use the previous trick ofconsidering the probabilistic approximate gamma norm. To prove an ampliﬁcation theoremfor probabilistic gamma norm we apply the same tools as for probabilistic polynomials.9 ection 7 establishes the circuit complexity results. There are two main hurdles in establishingTheorem 7.8. The ﬁrst is that the notion of randomized circuits is not as trivially extend-able to forecasting circuits as in other computational models. We show that this conversioncan be done eﬃciently when we discretize the set of conﬁdence values that can be returnedby forecasting circuits, and that this discretization does not aﬀect the guaranteed relationsbetween score and bias. The second is that the overhead required to combine the outputof multiple instances of a randomized circuit during linear ampliﬁcation is not trivial. Thissecond hurdle can be overcome with the use of eﬃcient circuit constructions for elementaryarithmetic operations and the iterated addition problem.The proof of the universal hardcore lemma in Theorem 1.9 is obtained via a slight gener-alization of the ratio minimax theorem. This variant of the minimax theorem is stated inLemma 7.12 and the rest of the proof of Theorem 1.9 is presented in Section 7.3. We make a few remarks regarding other possible generalizations of Yao’s original minimax theorem.First, one may wonder why we provide a hard distribution µ satisfying R µ ˙ γ ( f ) = Ω( γ R( f )) for all γ , rather than the stronger statement R µ ˙ γ ( f ) = Ω(R ˙ γ ( f )) for all γ . In other words, we’ve stated ourlower bounds in terms of the bounded-error randomized cost R( f ) , which required ampliﬁcation;why not directly compare the average-case complexity to bias γ , denoted R µ ˙ γ ( f ) , to the worst-casecomplexity to bias γ , denoted R ˙ γ ( f ) ?The reason is that this stronger version of the minimax is actually false: that is, there need notbe a distribution µ for which R µ ˙ γ ( f ) = Ω(R ˙ γ ( f )) for all γ (even though for every given γ , such adistribution µ that depends on γ does exist, by Yao’s minimax theorem). For a counterexample,consider the query complexity model. Let f be the Boolean function on n + m + 1 bits, where if theﬁrst bit is the function f evaluates to the parity of the next m bits, whereas if the ﬁrst bit is thefunction f evaluates to the majority of the last n bits. Say we take n = m . Then, since parity ishard to compute even to small bias, we have R ˙ γ ( f ) ≥ m for all γ . We also have R / ( f ) = Ω( m ) ,since majority on m bits requires Ω( m ) queries. Now, consider any distribution µ over the domainof f . If µ places nonzero probability mass on inputs with ﬁrst bit , then µ can necessarily be solvedto some suﬃciently small bias using at most queries (one query to the ﬁrst bit of the input, andone to a random position in the input to majority). In this case, we would have R µ ˙ γ ( f ) = O (1) and R ˙ γ ( f ) = Ω( √ n ) for this suﬃciently small γ . Alternatively, if µ places zero probability masson inputs with ﬁrst bit , then solving f against µ is solving parity on m = O ( √ n ) bits; hence R µ / ( f ) = O ( √ n ) , even though R / ( f ) = Ω( n ) . Similar counterexamples can be constructed inother computational models.Another possible generalization of Yao’s minimax is to a distribution µ for which R µ ( f ) is largeeven when the both the error of the algorithm and the expected cost are measured against µ . Thatis, in a normal application of Yao’s minimax, we either consider randomized algorithms which onlyever make at most T queries (against any input) and measure their expected error against µ , orelse we consider randomized algorithms which only ever make error at most ǫ (against any input)and measure their expected cost against µ . One may wonder if it is possible for one distributionto certify the hardness of f in both ways at once, with both the cost and the error measured inexpectation against µ .The answer turns out to be yes, as ﬁrst observed by Vereshchagin for query complexity [Ver98].Vereshchagin stated his theorem for bounded error, but in the case of small bias γ , his techniquesappear to give a distribution µ (which depends on γ ) such that R µ ˙ γ ( f ) = Ω( γ R ˙ γ ( f )) even where the10eft-hand side is deﬁned as the expected query complexity against µ to bias at least γ (also against µ ). This is in contrast to Yao-style minimax theorems, which are stronger in that they lack the γ factor on the right hand side, but weaker in that the left-hand side has either the cost or the errorbeing worst-case (rather than both being average-case against µ ).Our results in this work are “Vereshchagin-like” in that they hold even when R µ ˙ γ ( f ) has boththe cost and the bias deﬁned in expectation against µ . We prove such results for randomizedquery complexity and randomized communication complexity, showing a single µ satisﬁes R µ ˙ γ ( f ) =Ω( γ R( f )) for all γ > , even when both the error and the cost in the deﬁnition of R µ ˙ γ ( f ) areaverage-case against µ . (For models such as quantum query complexity or circuit complexity, theexpected cost of an algorithm does not have an obvious interpretation, since the algorithms generallyhave the same cost for all inputs; therefore, for those models we do not give a theorem in which thecost is measured in expectation against µ .)Note that our minimax theorem is not directly comparable to Vereshagin, because we state ourlower bounds in an “ampliﬁed” form – that is, the lower bounds are with respect to R( f ) rather than R ˙ γ ( f ) . As previously mentioned, this is necessary when proving that a single distribution works forall γ , and our theorems appear to be tight in that setting. Moreover, Vereshchagin’s theorem is tightin its setting: the factor of γ is necessary, because average-case query complexity can be smallerthan worst-case query complexity (for example, consider the parity function on n bits, which has R ˙ γ ( f ) = n for all γ ; if we design a randomized algorithm which queries all the bits with probability γ and queries no bits with probability − γ , it will use only γn expected queries, and it will solve f to bias γ ). A remaining open problem is as follows: can Vereshchagin’s theorem be modiﬁed to show R µ ˙ γ ( f ) = Ω(R ˙ γ ( f )) , (1)where both cost and bias on the left are measured in expectation against µ , and where R ˙ γ ( f ) denotesthe worst-case (over the inputs of f ) expected (over the internal randomness of the algorithm) querycomplexity of f to bias γ ? Note that in the bounded-error setting, R( f ) = Θ(R( f )) , so for bounded γ this result follows from both Vereshchagin’s theorem and from our work here. For small γ , weleave this question as an intriguing open problem.We also note that we cannot hope that a single distribution µ satisﬁes (1) for all γ , becauseone can construct a counterexample via a modiﬁcation of our earlier function: we let f be deﬁnedon m + n bits, where if x = 0 the function evaluates to the parity of the next m bits, and if x = 1 the function evaluates to the majority of the last n bits, as before; this time we will have n = m / . We also add a promise: we require that the input always has Hamming weight either atmost n/ − √ n or at least n/ √ n on the last n bits, turning the majority part of the function intoa √ n -gap majority function. Now, to compute f to worst-case bias γ requires at least γm expectedqueries on inputs x with x = 0 , and requires at least γ n expected queries on inputs with x = 1 ,so at least Ω(max { γm, γ n } ) expected queries in the worst case. This is Ω( n / ) when γ = n − / and Ω( n ) when γ is constant. Now ﬁx a distribution µ , let p be the probability that µ assigns toinputs with x = 1 . If p ≤ / , then we can compute f to constant bias simply by querying theﬁrst bit, guessing randomly if x = 1 , and querying m bits to compute f exactly when x = 0 ; thisuses O ( n / ) queries to achieve constant bias, instead of the Ω( n ) which were required in the worstcase. On the other hand, if p ≥ / , then we can compute f against µ by querying the ﬁrst bit andnothing else when x = 0 (guessing the answer randomly), and otherwise making one additionalquery to estimate the gap majority function to bias / √ n . This uses queries and achieves bias n − / against µ , instead of the Ω( n / ) queries required in the worst case. We thank an anonymous reviewer for this example. Minimax theorem for the ratio of saddle functions

Minimax theorems take the form inf x ∈ X sup y ∈ Y α ( x, y ) = sup y ∈ Y inf x ∈ X α ( x, y ) . For any function α , the left-hand side above is always at least the right hand side, but equality onlyholds under certain conditions; when equality does hold, we call it a minimax theorem.Broadly speaking, the following conditions are required to ensure that a minimax theorem holds.First, X and Y must be convex sets (and they must be subsets of some real vector spaces). Second, α must be saddle – or at least quasisaddle – meaning that it is convex as a function of x and concaveas a function of y (or at least quasiconvex and quasiconcave). Third, α must satisfy some continuityconditions. And ﬁnally, one of X or Y must be compact (importantly, it’s not necessary for bothto be compact).In this section, we show that under certain conditions, minimax theorems also hold for ratios of positive saddle functions. Such a ratio of saddle functions is not necessarily saddle, but theimportant insight is that it is still quasisaddle. In order to formally state the conditions in which minimax theorems hold, we will need a fewdeﬁnitions. We assume the reader is familiar with vector spaces and topological spaces, includingstandard terminology such as compact sets and neighborhoods.

Deﬁnition 2.1 (Real topological vector space) . A real topological vector space is a tuple ( V, + , · , τ ) ,where V is a set, + is a function V × V → V , · is a function V × R → V , and τ ⊆ V , such that • ( V, + , · ) is a vector space over R , • ( V, τ ) is a topological space, • + is continuous under the topology τ , and • · is continuous under the topology τ and the standard topology of R . We note that any normed real vector space is a real topological space, as the norm induces atopology. We will primarily focus on the real topological vector spaces R n for n ∈ N , which have astandard topology. Deﬁnition 2.2 (Extended reals) . The extended reals is the set R := R ∪{−∞ , ∞} . We use theextended interval notation ( r, ∞ ] := ( r, ∞ ) ∪ {∞} for r ∈ R , and similarly for [ −∞ , r ) and [ −∞ , ∞ ] .We associate with R the following topology. A set S ⊆ R is a neighborhood of x ∈ R if it contains anopen interval ( x − ǫ, x + ǫ ) for some ǫ ∈ (0 , ∞ ) , it is a neighborhood of ∞ if it contains the interval ( r, ∞ ] for some r ∈ R , and it is a neighborhood of −∞ if it contains the interval [ −∞ , r ) for some r ∈ R .We deﬁne addition, subtraction, multiplication, and division of extended reals in the intuitiveway, with ∞ − ∞ , · ∞ , ∞ / ∞ , and x/ for x ∈ R all undeﬁned. Note also that the extended realsare ordered (for each x, y ∈ R , we have either x = y , x < y , or x > y ). Note that while we deﬁne the extended reals and will often talk about extended-real-valuedfunctions, our vector spaces will always be over the reals, not over the extended reals. In particular,the extended reals are not a ﬁeld. 12 eﬁnition 2.3 (Convexity of sets) . We say a subset X of a real vector space V is convex if ∀ x, y ∈ X, ∀ λ ∈ (0 , λx + (1 − λ ) y ∈ X. Deﬁnition 2.4 (Convex hull) . Let V be a real vector space and let X ⊆ V . The convex hull of X ,denoted Conv( X ) , is the intersection of all convex subsets of V that contain X as a subset. Note that it is easy to verify that an arbitrary intersection of convex sets is convex, which meansthat the convex hull of any set is always convex.

Deﬁnition 2.5 ((quasi)convexity and (quasi)concavity of functions) . Let V be a real vector space,let X ⊆ V be convex, and let φ : X → R . We say that φ is convex if for all x, y ∈ X and λ ∈ (0 , ,we have φ ( λx + (1 − λ ) y ) ≤ λφ ( x ) + (1 − λ ) φ ( y ) . We say φ is quasiconvex if for all x, y ∈ X and λ ∈ (0 , , we have φ ( λx + (1 − λ ) y ) ≤ max { φ ( x ) , φ ( y ) } . We say that φ is concave if − φ is convex,and we say φ is quasiconcave if − φ is quasiconvex. If φ is both convex and concave, we say it is linear . Note that if ∞ and −∞ are both in the range of φ , then λφ ( x ) + (1 − λ ) φ ( y ) may be ∞ − ∞ ,which is undeﬁned; in this case we say φ is neither convex nor concave. A function with both ∞ and −∞ in its range may still be quasiconcave or quasiconvex. Deﬁnition 2.6 (Saddle and quasisaddle) . Let V and V be real vector spaces, let X ⊆ V and Y ⊆ V , and let α : X × Y → R . We say that α is saddle if for all x ∈ X the function α ( x, · ) is concave and for all y ∈ Y the function α ( · , y ) is convex. We say that α is quasisaddle if for all x ∈ X the function α ( x, · ) is quasiconcave and for all y ∈ Y the function α ( · , y ) is quasiconvex. Deﬁnition 2.7 (Semicontinuity) . Let X be a topological space and let φ : X → R . We say that φ is upper semicontinuous at x ∈ X if for all y ∈ ( φ ( x ) , ∞ ] there exists some neighborhood U of x onwhich the value of φ ( x ′ ) for x ′ ∈ U is less than y . We say that φ is lower semicontinuous at x if − φ is upper semicontinuous at x .Let Y be another topological space and let α : X × Y → R be a function. We say that α is semicontinuous if for all x ∈ X the function α ( x, · ) is upper semicontinuous over all of Y , and forall y ∈ Y the function α ( · , y ) is lower semicontinuous over all of X . We note the following two useful lemmas about upper and lower semicontinuous functions. Theselemmas are standard, but for completeness we reprove them in Appendix A.

Theorem 2.11 (Sion’s minimax for extended reals) . Let V and V be real topological vector spaces,and let X ⊆ V and Y ⊆ V be convex. Let α : X × Y → R be semicontinuous and quasisaddle. Ifeither X or Y is compact, then inf x ∈ X sup y ∈ Y α ( x, y ) = sup y ∈ Y inf x ∈ X α ( x, y ) . Next, we use Sion’s minimax theorem to show a minimax theorem for the ratio of positive saddlefunctions. To do so, we will need the following lemma.

Lemma 2.12.

Let a, b, c, d ∈ (0 , ∞ ) , and let λ ∈ (0 , . Then min n ab , cd o ≤ λa + (1 − λ ) cλb + (1 − λ ) d ≤ max n ab , cd o . This still holds if any of a, b, c, d are , or if a or c are ∞ , so long as we interpret x/ ∞ for x ∈ [0 , ∞ ] .Proof. When a, c ∈ [0 , ∞ ) and b, d ∈ (0 , ∞ ) , it’s easy to check that λa + (1 − λ ) cλb + (1 − λ ) d = ab ·

11 + z + cd · z z , where z = (1 − λ ) d/λb . Since z > , this is a convex combination of a/b and c/d , from which thedesired result follows. When a = ∞ or c = ∞ , both the middle expression and the max expressionequal ∞ , and the result trivially holds. The same thing happens when b = d = 0 . Finally, when a, c ∈ [0 , ∞ ) and exactly one of b and d is , the max expression is again inﬁnity, and the inequalityon the left and side can be easily veriﬁed.The simple lemma above is enough to imply that a convex function divided by a concave functionis quasiconvex, and that a concave function divided by a convex function is quasiconcave.14 emma 2.13. Let V be a real topological vector space, and let X ⊆ V be convex. Let φ : X → [0 , ∞ ] and ψ : X → [0 , ∞ ) be functions, and deﬁne ρ : X → [0 , ∞ ] by ρ ( x ) := φ ( x ) /ψ ( x ) , with r/ interpreted as ∞ for r ∈ [0 , ∞ ] . Then1. If φ is convex and ψ is concave, ρ is quasiconvex.2. If φ is concave and ψ is convex, ρ is quasiconcave.3. If φ is upper semicontinuous and ψ is lower semicontinuous, ρ is upper semicontinuous.4. If φ is lower semicontinuous and ψ is upper semicontinuous, and if φ is strictly positive on X , then ρ is lower semicontinuous.Proof. We start with (1). Fix x, y ∈ X and λ ∈ (0 , . Then ρ ( λx + (1 − λ ) y ) = φ ( λx + (1 − λ ) y ) ψ ( λx + (1 − λ ) y ) ≤ λφ ( x ) + (1 − λ ) φ ( y ) λψ ( x ) + (1 − λ ) ψ ( y ) ≤ max (cid:26) φ ( x ) ψ ( x ) , φ ( y ) ψ ( x ) (cid:27) = max { ρ ( x ) , ρ ( y ) } , so ρ is quasiconvex, as desired. Here we used the convexity of φ and concavity of ψ in the ﬁrstinequality, and Lemma 2.12 in the second inequality. (2) works similarly: ρ ( λx + (1 − λ ) y ) = φ ( λx + (1 − λ ) y ) ψ ( λx + (1 − λ ) y ) ≥ λφ ( x ) + (1 − λ ) φ ( y ) λψ ( x ) + (1 − λ ) ψ ( y ) ≥ min (cid:26) φ ( x ) ψ ( x ) , φ ( y ) ψ ( x ) (cid:27) = min { ρ ( x ) , ρ ( y ) } . Next, we prove (3). Fix x ∈ X ; our goal is to show ρ is upper semicontinuous at x . If ρ ( x ) = ∞ ,then any function ρ is upper semicontinuous at x by deﬁnition, so assume ρ ( x ) < ∞ . In particular,this means that φ ( x ) < ∞ and that ψ ( x ) > . Now, ﬁx y > ρ ( x ) = φ ( x ) /ψ ( x ) . By the uppersemicontinuity of φ , ﬁnd a neighborhood U of x on which φ ( · ) is at most φ ( x ) + ǫ (with ǫ > tobe chosen later). By the lower semicontinuity of ψ , ﬁnd a neighborhood U of x on which ψ ( · ) isat least ψ ( x ) − ǫ . Setting U := U ∩ U , we see that on U we have ρ ( · ) ≤ ( φ ( x ) + ǫ ) / ( ψ ( x ) − ǫ ) ,assuming we pick ǫ < ψ ( x ) . We now simply pick ǫ small enough that this expression is less than y ,giving us a neighborhood U of x on which ρ ( · ) is less than y , as desired.Finally, we prove (4). As before, we ﬁx x ∈ X . Our goal is to show ρ ( x ) is lower semicontinuousin at x . Let y < ρ ( x ) . We seek a neighborhood U of x on which ρ ( · ) > y . To start with, theupper semicontinuity of ψ ensures there is a neighborhood U of x on which ψ ( · ) < ψ ( x ) + ǫ , with ǫ > arbitrarily small. Now, if φ ( x ) = ∞ , then ρ ( x ) = ∞ . In this case, the lower semicontinuityof φ ensures there is a neighborhood U on which φ ( · ) is at least z , with z ∈ R is arbitrarily large.Then in U ∩ U , the value of ρ ( · ) is also arbitrarily large, and can be made to exceed y ∈ R givenappropriate choices of z and ǫ . Alternatively, if φ ( x ) < ∞ , then there is a neighborhood U on15hich φ ( · ) > φ ( x ) − ǫ . In this case, on U ∩ U we have ρ ( · ) > ( φ ( x ) − ǫ ) / ( ψ ( x ) + ǫ ) . By picking ǫ suﬃciently small, we can again get a neighborhood U ∩ U of x on which ρ ( · ) > y , meaning that ρ is lower semicontinuous.We now state the minimax theorem for the ratio of two positive saddle functions. In thestatement below, it may help to think of R as a set of randomized algorithms, and to think of ∆ as the set of all probability distributions over a ﬁnite input set. Further, think of cost( R, µ ) asmeasuring the cost of the algorithm R when run on µ (for some models, this will depend only on R and not on µ ), and think of score( R, µ ) as quantifying the success or bias that the algorithm R achieves against input distribution µ . Theorem 2.14 (Minimax theorem for the positive ratio of saddle functions) . Let V and V bereal topological vector spaces. Let R ⊆ V be convex, and let ∆ ⊆ V be nonempty, convex, andcompact. Let the function cost : R × ∆ → (0 , ∞ ] be semicontinuous and saddle, and let the function score : R × ∆ → [0 , ∞ ) be such that its negation, − score , is semicontinuous and saddle. Then using x/ ∞ for x ∈ (0 , ∞ ] , we have inf R ∈R max µ ∈ ∆ cost( R, µ )score(

R, µ ) = max µ ∈ ∆ inf R ∈R cost( R, µ )score(

R, µ ) , and the maximums are attained.Proof. Let α : R × ∆ → (0 , ∞ ] be deﬁned by α ( R, µ ) := cost(

R, µ ) / score( R, µ ) , with x/ inter-preted as ∞ for x ∈ (0 , ∞ ] . For any ﬁxed µ ∈ ∆ , the function α ( · , µ ) is quasiconvex and lowersemicontinuous by Lemma 2.13. Similarly, for any ﬁxed R ∈ R , the function α ( R, · ) is concave andupper semicontinuous by Lemma 2.13. Hence α is semicontinuous and quasisaddle, and the desiredminimax theorem follows from Theorem 2.11. Furthermore, since ∆ is nonempty and compact, thesupremums are attained as maximums by Lemma 2.9 and Lemma 2.8.Finally, we will need two extensions of this theorem. First, we will want to allow the denomi-nator to be a function of the form score( R, µ ) + , where the + superscript denotes the maximumof score( R, µ ) with , and where we only know about saddle properties of score( R, µ ) , not of score( R, µ ) + . To do this, we need to show such a maximum with preserves the properties wecare about. We have the following lemma, which we prove in Appendix A. Lemma 2.15.

Let V be a real topological vector space, and let X ⊆ V be convex. For a function ψ : X → R , let ψ + denote the function ψ + ( x ) = max { ψ ( x ) , } . Then this operation on ψ preservesconvexity, quasiconvexity, quasiconcavity, upper semicontinuity, and lower semicontinuity, but notconcavity. This lemma is useful, but doesn’t quite give us everything we need, because the operation ψ + does not preserve concavity. We will need the following additional lemma, which says thatLemma 2.13 also works when dividing by ψ + , despite its lack of concavity. Lemma 2.16.

Let V be a real topological vector space, and let X ⊆ V be convex. Let φ : X → [0 , ∞ ] and ψ : X → [ −∞ , ∞ ) be functions, and deﬁne ρ : X → [0 , ∞ ] by ρ ( x ) := φ ( x ) /ψ ( x ) + , with r/ interpreted as ∞ for r ∈ [0 , ∞ ] . Then if φ is convex and ψ is concave, ρ is quasiconvex.Proof. Fix x, y ∈ X and λ ∈ (0 , . If ψ ( x ) > and ψ ( y ) > , we have ρ ( λx + (1 − λ ) y ) ≤ max { ρ ( x ) , ρ ( y ) } using the same argument as in Lemma 2.13. On the other hand, if ψ ( x ) ≤ or ψ ( y ) ≤ , then we have max { ρ ( x ) , ρ ( y ) } = ∞ , and the inequality ρ ( λx +(1 − λ ) y ) ≤ max { ρ ( x ) , ρ ( y ) } trivially holds. 16he second extension we will need in our ﬁnal minimax theorem is to the case where thenumerator is allowed to be . Unfortunately, as we can see from the statement of Lemma 2.13, theratio does not preserve lower semicontinuity in this setting. We will need to impose some additionalconditions on the cost and score functions, particularly with regard to their behavior around . Deﬁnition 2.17.

We say that cost : R× ∆ → [0 , ∞ ] and score : R× ∆ → [ −∞ , ∞ ) are well-behaved if the following conditions hold:1. (Finite cost and score can be achieved.) For each µ ∈ ∆ , there is some R ∈ R such that cost( R, µ ) > , cost( R, µ ) < ∞ , and score( R, µ ) > .2. (A zero-cost algorithm has zero cost regardless of the input.) For each R ∈ R , either cost( R, µ ) =0 for all µ ∈ ∆ , or else cost( R, µ ) > for all µ ∈ ∆ .3. (Mixing a zero-cost algorithm with a nonzero-cost algorithm gives a nonzero-cost algorithm.)For each µ ∈ ∆ , if R, R ′ ∈ R are such that cost( R, µ ) = 0 and cost( R ′ , µ ) > , then cost( λR +(1 − λ ) R ′ , µ ) > for all λ ∈ (0 , . Finally, we are ready for our main workhorse minimax theorem.

Theorem 2.18.

Let V be a real topological vector space, and let R ⊆ V be convex. Let S be anonempty ﬁnite set, and let ∆ be the set of all probability distributions over S , viewed as a subsetof R | S | . Let cost : R × ∆ → [0 , ∞ ] be semicontinuous and saddle, and let score : R × ∆ → [ −∞ , ∞ ) be such that its negation, − score , is semicontinuous and saddle. Suppose cost and score are well-behaved. Then using the convention r/ ∞ for all r ∈ [0 , ∞ ] , we have inf R ∈R max µ ∈ ∆ cost( R, µ )score(

R, µ ) + = max µ ∈ ∆ inf R ∈R cost( R, µ )score(

R, µ ) + . Moreover, if cost( R, · ) and score( R, · ) are both linear in µ for each R ∈ R , then inf R ∈R max x ∈ S cost( R, x )score(

R, x ) + = max µ ∈ ∆ inf R ∈R cost( R, µ )score(

R, µ ) + . Further, all of the above maximums are attained.Proof.

First, note that if S = { x , x , . . . , x | S | } , then we can view ∆ as the convex hull of the set { e , e , . . . , e | S | } ⊆ R | S | , where the e i are the unit vectors e i = (0 , , . . . , , , , , . . . , with the at position i . Hence ∆ is convex. It is also closed and bounded, making it compact. We identify e i with x i , so that ∆ = Conv( S ) .Note that since each R ∈ R has either cost for all µ or cost greater than for all µ , we can deﬁnethe set R ′ ⊆ R of R with nonzero cost. Now, on R ′ , the function α ( R, µ ) = cost(

R, µ ) / score( R, µ ) + is semicontinuous and quasisaddle by Lemma 2.13 together with Lemma 2.16 and Lemma 2.15.Additionally, ∆ is nonempty, convex, and compact. Thus by Theorem 2.11, we know that inf R ∈R ′ max µ ∈ ∆ cost( R, µ )score(

R, µ ) + = max µ ∈ ∆ inf R ∈R ′ cost( R, µ )score(

R, µ ) + , with the maximums attained.What we want to show is this statement with the inﬁmums over R instead of R ′ . The inf-sup isalways at least the sup-inf for every function, so we need only show that the sup-inf is at least theinf-sup. Moreover, since expanding the domain can only decrease the inﬁmum, we know that max µ ∈ ∆ inf R ∈R ′ cost( R, µ )score(

R, µ ) + = inf R ∈R ′ max µ ∈ ∆ cost( R, µ )score(

R, µ ) + ≥ inf R ∈R max µ ∈ ∆ cost( R, µ )score(

R, µ ) + , R ∈ R ′ , and if R ∈ R \ R ′ , then cost( R, µ ) / score( R, µ ) + is either or ∞ for all µ . Thus we onlyneed to show that the max-inf over R is at least the max-inf over R ′ , and that the former maximumis attained.To see this, let µ ∈ ∆ be the maximizing µ for the expression max µ ∈ ∆ inf R ∈R ′ cost( R, µ )score(

R, µ ) + . Suppose by contradiction that there was some ˆ R ∈ R \ R ′ such that cost( ˆ R, µ )score( ˆ

R, µ ) + < inf R ∈R ′ cost( R, µ )score(

R, µ ) + . Since ˆ R ∈ R \ R ′ , we must have cost( ˆ R, µ ) = 0 . Since / score( ˆ R, µ ) + is less than something,and since we’re interpreting / ∞ , we must have score( ˆ R, µ ) > , so that / score( ˆ R, µ ) + = 0 .We wish to show that inf R ∈R ′ cost( R, µ ) / score( R, µ ) + = 0 . To this end, pick ǫ > . We willﬁnd R ∈ R ′ such that cost( R, µ ) / score( R, µ ) + < ǫ . The idea is to pick some R ′ ∈ R ′ such that cost( R ′ , µ ) < ∞ and score( R ′ , µ ) > , as guaranteed by the well-behaved condition on cost and score . Then set R := λR ′ + (1 − λ ) ˆ R , with λ > extremely small. Now, the well-behaved propertyof cost says that cost( R, µ ) > , so R ∈ R ′ . By convexity, we also have cost( R, µ ) = cost( λR ′ + (1 − λ ) ˆ R, µ ) ≤ λ cost( R ′ , µ ) + (1 − λ ) cost( ˆ R, µ ) = λ cost( R ′ , µ ) , and by the concavity of score( · , µ ) , wehave score( R, µ ) = score( λR ′ +(1 − λ ) ˆ R, µ ) ≥ λ score( R ′ , µ )+(1 − λ ) score( ˆ R, µ ) ≥ (1 /

2) score( ˆ

R, µ ) ,assuming λ ≤ / .This means that score( R, µ ) and score( ˆ R, µ ) are both positive, and cost( R, µ ) / score( R, µ ) ≤ λ cost( ˆ R, µ ) / score( ˆ R, µ ) . Since cost( ˆ R, µ ) < ∞ , setting λ > to be small causes the ratio cost( R, µ ) / score( R, µ ) + to be arbitrarily close to , as desired. It follows that there exists µ ∈ ∆ such that inf R ∈R cost( R, µ )score(

R, µ ) ≥ inf R ∈R max µ ′ ∈ ∆ cost( R, µ ′ )score( R, µ ′ ) , and since the inf-max is always at least the max-inf, there does not exist a µ for which the left-handinﬁmum is any larger; thus we get the desired result and the maximum is attained.Finally, suppose that cost( R, · ) and score( R, · ) are linear for each R ∈ R . In that case, cost( R, · ) is convex and score( R, · ) is concave, which means that cost( R, · ) / score( R, · ) + is quasiconvex on ∆ by Lemma 2.16. Then Lemma 2.10 implies that the maximum over µ ∈ Conv( S ) is attained at apoint in S . Moreover, if R ∈ R \ R ′ , then the maximum over µ ∈ ∆ evaluates to either or ∞ . Ifit is , then it is clearly also attained in S . If it is ∞ , it means some µ ∈ ∆ has score( R, µ ) ≤ ;the concavity of score( R, · ) then gives us some x ∈ S such that score( R, x ) ≤ score( R, µ ) , meaningthere is a point x ∈ S on which score( R, x ) + = 0 and cost( R, x ) / score( R, x ) + = ∞ , as desired.Theorem 2.18 is the main tool we will use to prove minimax theorems for algorithmic models.We will usually apply it in a setting where R is a set of algorithms, S is a ﬁnite input set, ∆ is a set of distributions over the inputs, cost( R, µ ) is a cost measure for the performance of analgorithm against a distribution, and score( R, µ ) is some kind of success measure. We will sometimeschoose score( R, µ ) = bias f ( R, µ ) , where bias f ( R, µ ) is the bias R achieves against distribution µ incomputing f .We will generally combine Theorem 2.18 with an ampliﬁcation theorem; such a theorem willturn the left hand side inf R max x cost( R, x ) / score( R, x ) into something more familiar, such as inf R max x cost( R, x ) where the inﬁmum is restricted to algorithms R which achieve at least constant18ias (i.e. bounded error) on each input. With such an ampliﬁcation theorem, the minimax resultwill guarantee the existence of a hard distribution µ against which cost( R, µ ) / score( R, µ ) is largefor all R ; this means µ is hard to solve even to small bias.While the above strategy works for models that can be ampliﬁed linearly in the bias (going frombias γ to constant bias using O (1 /γ ) overhead), such as quantum query complexity, for randomizedalgorithms the situation is more complicated. For randomized algorithms, we may instinctively wantto use something like score( R, µ ) = bias f ( R, µ ) , but this does not work as it does not satisfy theright saddle properties. Instead, we introduce a new way of evaluating the success of randomizedalgorithms, called scoring rules . Evaluation via scoring rules ends up being the “correct” way tomeasure the success of a randomized algorithm, and has more elegant properties than simply thebias. It is also highly intuitive: to evaluate the success of an algorithm, we require it to give aconﬁdence prediction for whether the output is or , and then we score the prediction using ascoring rule which incentivizes honesty (that is, a scoring rule that causes a Bayesian agent whowishes to maximize her expected score to output her true subjective probability). In this section we introduce the notion of forecasting algorithms , which output not just a { , } guessat the function value but also a conﬁdence parameter q ∈ [0 , for that prediction. These algorithmswill be scored using a scoring rule , which rewards them point for a correct prediction made withperfect conﬁdence, and points for a conﬁdence of / . As we will see, normal algorithms canbe converted into forecasting algorithms and vice versa, and the expected score of the forecastingversion can often be related to the bias of the algorithm in its regular (discrete outputs) form. Deﬁnition 3.1 (Scoring rule) . A scoring rule is a function s : [0 , → [ −∞ , such that s (1) = 1 , s (1 /

2) = 0 , and s ( · ) is increasing over [0 , . We say a scoring rule is proper if for each p ∈ (0 , ,the expression ps ( q ) + (1 − p ) s (1 − q ) is uniquely maximized at q = p . Generally, if a forecasting algorithm outputs q ∈ [0 , , we will interpret it as assigning conﬁdence q to the output and conﬁdence − q to the output ; we give it score s ( q ) if the right answerwas , and score s (1 − q ) if the right answer was . A proper scoring rule is therefore a scoringrule that incentivizes the algorithm to output q = p in the case where the right answer is sampledfrom Bernoulli( p ) . In other words, a proper scoring rule is one that incentivizes a Bayesian agentto output her true subjective probability for the outcome being . Deﬁnition 3.2.

We deﬁne the following scoring rules.1. hs( q ) := 1 − q − qq Brier( q ) := 1 − − q ) bias( q ) := 1 − − q ) ls( q ) := 1 − log(1 /q ) . We note that

Brier( · ) and ls( · ) are known as the Brier scoring rule and logarithmic scoring rule,respectively, and are well-known in the literature. The Brier scoring rule is useful because it is aproper scoring rule which is bounded (that is, s ( q ) ∈ [ − , for all q ∈ [0 , , instead of s ( · ) diverging19o −∞ at ). The logarithmic scoring rule has an information-theoretic interpretation, with thealgorithm essentially starting at score and losing an amount of score depending on its “surprisal”at the correct outcome.The scoring rule bias( · ) is not proper, but as we will see, it is closely related to the bias an algo-rithm will make. Finally, the scoring rule hs( · ) will be the most useful of the bunch for our purposes.Despite not having any intuitive interpretation and not being bounded, it is an incredibly conve-nient scoring rule due to the fact that it can be ampliﬁed , as we will see. hs( · ) has been previouslystudied (for example in [BSS05], where it is called the “boosting loss” due to its relationship withboosting), but we believe its ampliﬁcation property has not been previously known (we prove thisampliﬁcation property later on in Lemma 3.10; this ends up being a key ingredient of our minimaxtheorems). Lemma 3.3. hs , Brier , and ls are proper scoring rules. bias is a scoring rule which is not proper. This lemma can be proven using elementary calculus, and we do so in Appendix B.

Fascinatingly, the above scoring rules all correspond to well-known distance measures between prob-ability distributions. To describe the correspondence, we ﬁrst start by deﬁning the following distancemeasures.

Deﬁnition 3.4.

For probability distributions ν and ν over a ﬁnite domain P , deﬁne ∆( ν , ν ) := 12 X x ∈ P | ν [ x ] − ν [ x ] | (Total variation) h ( ν , ν ) := 12 X x ∈ P ( p ν [ x ] − p ν [ x ]) (Hellinger) S ( ν , ν ) := 12 X x ∈ P ( ν [ x ] − ν [ x ]) ν [ x ] + ν [ x ] (Symmetrized χ ) JS( ν , ν ) := 12 X x ∈ P ν [ x ] log 2 ν [ x ] ν [ x ] + ν [ x ] + ν [ x ] log 2 ν [ x ] ν [ x ] + ν [ x ] (Jensen-Shannon) . The above measures give the distance between two probability distributions. We will sometimeswant to have an asymmetric distance that is weighted towards one of the two distributions; whilethese asymmetric distances look strange at ﬁrst, they show up naturally in the study of scoringrules. We extend the above distance measures as follows.

Deﬁnition 3.5.

Given probability distributions ν and ν over a ﬁnite domain P , as well as aweight w ∈ [0 , , set ν = (1 − w ) ν + wν . Let R be the random variable over x ∈ P deﬁned by R ( x ) := | (1 − w ) ν [ x ] − wν [ x ] | /ν [ x ] for all x ∈ P . Then deﬁne ∆( ν , ν , w ) := E x ← ν [ R ]h ( ν , ν , w ) := E x ← ν [1 − p − R ]S ( ν , ν , w ) := E x ← ν [ R ]JS( ν , ν , w ) := E x ← ν (cid:20) − H (cid:18) R (cid:19)(cid:21) , where H ( α ) := α log 1 /α + (1 − α ) log 1 / (1 − α ) is the binary entropy function. w = 1 / , the expressions in Deﬁnition 3.5 equal the ones inDeﬁnition 3.4. Perhaps surprisingly, the distance measures h , S , and JS are all related to eachother by a constant factor. Lemma 3.6 (Relations between distance measures) . When applied to ﬁxed ν , ν , and w , thedistance measures satisfy S ≤ − p − S ≤ h ≤ JS ≤ S as well as ∆ ≤ S ≤ ∆ . We also have JS ≤ h / ln 2 and S ≤ (ln 4) JS . While these relationships are certainly known in the literature, it is hard to chase down goodcitations (though see [Tøp00; MCAL17] for parts of this result); in any case, we prove Lemma 3.6in Appendix B.

Consider the following problem: suppose distributions ν and ν are known (for example, perhapsthey are the distributions of the transcript of a ﬁxed randomized algorithm when run on a known -distribution and a known -distribution, respectively). Further, suppose a Bernoulli( w ) processgenerates a bit b ∈ { , } , and then a sample x ← µ b is provided. We assume the parameter w isknown. What is the best algorithm for predicting b given x , assuming you wish to maximize theexpected score according to one of the scoring rules hs( · ) , Brier( · ) , ls( · ) , bias( · ) ? It turns out thatbest attainable expected score is exactly the distance between ν and ν according to the distancemeasures h , S , JS , ∆ , respectively. To prove this, we introduce the following deﬁnitions. Deﬁnition 3.7.

For a scoring rule s : [0 , → [ −∞ , , we deﬁne s ( p ) := s ( p ) and s ( p ) := s (1 − p ) .This way, if a forecasting algorithm outputs p and the real outcome is b , the score of this predictionwill be s b ( p ) . Deﬁnition 3.8 (Expected score notation) . Let S be a ﬁnite set, and let φ : S → [0 , be a functionrepresenting predictions. Let ν be a distribution over S , let P ( x ) be a Boolean-valued random variablefor each x ∈ S representing the correct outcome, and let s : [0 , → [ −∞ , be a scoring rule. The expected score of φ , denoted score s ( φ, ν, P ) , is deﬁned as score s ( φ, ν, P ) := E x ← ν E b ← P ( x ) [ s b ( φ ( x ))] . In these expectations, if a value of ∞ or −∞ occurs with probability , we set · ∞ := 0 . We can also extend the score notation to the case where φ ( x ) outputs a probability distributionover [0 , instead of always outputting a deterministic prediction given the observation x . We won’tworry about this case for now.Equipped with these deﬁnitions, we are now ready to prove the correspondence between scoringrules and distance measures. This correspondence appears to be known in the literature (indeed,variants of it seem to have been rediscovered many times); see [RW11] for an overview. However,the form we need here is somewhat diﬀerent from the usual form in the literature, which usuallydiscusses divergences instead of distances. We therefore include the proof for completeness.21 emma 3.9. Let ν and ν be probability distributions over a ﬁnite set S , and let w ∈ [0 , . Let M s ( ν , ν , w ) be the maximum possible score of for predicting b ← Bernoulli( w ) given x ← ν b ,where ν , ν , and w known. That is, M s ( ν , ν , w ) is the maximum over choice of φ : S → [0 , of the expression score s ( φ, ν, P ) , where ν = (1 − w ) ν + wν and P ( x ) is the posterior probabilitydistribution of b given prior Bernoulli( w ) and observation x ← ν b . Then M bias ( ν , ν , w ) = ∆( ν , ν , w ) M hs ( ν , ν , w ) = h ( ν , ν , w ) M Brier ( ν , ν , w ) = S ( ν , ν , w ) M ls ( ν , ν , w ) = JS( ν , ν , w ) . Proof.

Consider a ﬁxed x ∈ D . The contribution of x to the expected score of φ (with respectto scoring rule s ) is simply (1 − w ) ν [ x ] s ( φ ( x )) + wν [ x ] s ( φ ( x )) = (1 − w ) ν [ x ] s (1 − φ ( x )) + wν [ x ] s ( φ ( x )) . The total expected score of φ is therefore the sum over x ∈ D of the above expression.The function φ which maximizes the expected score is simply the one where φ ( x ) = q , where q maximizes the expression (1 − w ) ν [ x ] s (1 − q )+ wν [ x ] s ( q ) . Now, the expression we wish to maximizehas the form ν [ x ] · ((1 − p ) s (1 − q ) + ps ( q )) , where p = wν [ x ] /ν [ x ] . Hence, if s is proper, the uniquemaximum occurs at q = p = wν [ x ] /ν [ x ] . This means that for the maximizing φ , the contributionof each x to the expected score is (1 − w ) ν [ x ] s ((1 − w ) ν [ x ] /ν [ x ]) + wν [ x ] s ( wν [ x ] /ν [0]) , assuming s is proper.For s ∈ { hs , ls , Brier } , the scoring rule s is indeed proper, meaning that we have a closedexpression for the maximum possible expected score. Setting R [ x ] := | wν [ x ] − (1 − w ) ν [ x ] | /ν [ x ] ,it’s not hard to check that for hs , the contribution of each x is ν [ x ](1 − p − R [ x ] ) , for ls , thecontribution of each x is ν [ x ](1 − H ((1 + R [ x ]) / , and for Brier , the contribution of each x is ν [ x ] R [ x ] , as desired.It remains to deal with s = bias . The contribution of each x is the maximum possible value of (1 − w ) ν [ x ] bias(1 − q ) + wν [ x ] bias( q ) for q ∈ [0 , . Since bias( q ) = 2 q − , it’s not hard to see thatthe maximizing value of q is q = 0 when (1 − w ) ν [ x ] > wν [ x ] , q = 1 when wν [ x ] > (1 − w ) ν [ x ] ,and when (1 − w ) ν [ x ] = wν [ x ] , the contribution of x to the score is regardless of the value of q .The contribution of x to the maximum score is therefore ν [ x ] R [ x ] , as desired.We note that in the statement of Lemma 3.9, we are implicitly assuming that the predictivealgorithms are deterministic: that given x , one is only allowed to output a deterministic prediction φ ( x ) ∈ [0 , instead of a random choice of prediction. However, it is not hard to see that random-ized algorithms won’t help in this setting, since we are maximizing the expected score, which is alinear function of the probabilities inside the randomized choice. That is to say, if the randomizedalgorithm chooses (on input x ) to output a with probability p and b with probability − p , thenthe ﬁnal score of this algorithm will be a linear function of p , and hence the optimal choice of p will be either or . Hence Lemma 3.9 also characterizes the best possible score of a randomizedprediction algorithm with respect to those four scoring rules. From here on out, we consider only the hs( · ) scoring rule (and occasionally bias( · ) , which willcorrespond to the bias of a randomized algorithm). We will sometimes omit the subscript in theexpression score s ( φ, ν, P ) when s = hs .We now proceed to show a few nice properties of the hs scoring rule. First among them isthe ampliﬁcation property. We believe this property (which is crucial for our purposes) has notpreviously appeared in the literature. 22 emma 3.10 (Ampliﬁcation of hs ) . Let S be a ﬁnite set, and let φ : S → [0 , represent a predictionfunction. Then for each k ∈ N , there is a function φ ( k ) : S k → [0 , such that for any distribution ν over S , we have score hs ( φ ( k ) , ν ⊗ k , ≥ − (1 − score hs ( φ, ν, k score hs ( φ ( k ) , ν ⊗ k , ≥ − (1 − score hs ( φ, ν, k . Furthermore, equality holds except when score hs ( φ, ν,

0) = score hs ( φ, ν,

1) = −∞ . Here and areinterpreted as the constant functions x ) = 0 and x ) = 1 . Informally, this lemma is saying the following. Consider a randomized forecasting algorithm R ,which takes input x and outputs a conﬁdence q ∈ [0 , representing its belief that f ( x ) = 1 . Evaluatethis algorithm according to its worst-case expected score with respect to the hs( · ) scoring rule. Thatis to say, for each input x ∈ f − (1) , consider the expectation E [hs( R ( x ))] of the expected score R gets when run on x , and for each x ∈ f − (0) , consider the analogous expectation E [hs(1 − R ( x ))] .Then take the minimum η of all these expected scores, minimizing over any x ∈ Dom( f ) . This isthe worst-case expected score of R . The lemma then says that we can run R on x several times,say k times independently, and combine the conﬁdence outputs q , q , . . . , q k in such a way that thenew algorithm has worst-case expected score equal to − (1 − η ) k . Proof.

We deﬁne φ ( k ) ( x . . . x k ) as follows. First, if it holds that some pair ( x i , x j ) in the in-put satisﬁes φ ( x i ) = 0 and φ ( x j ) = 1 , we deﬁne φ ( k ) ( x . . . x k ) := 1 / . Otherwise, we set φ ( k ) ( x . . . x k ) := (cid:16) Q ki =1 1 − φ ( x i ) φ ( x i ) (cid:17) − , where we interpret / ∞ if it occurs (we need notinterpret ∞ · since that will only occur if φ ( x i ) = 0 and φ ( x j ) = 1 for some i and j ). Notethat if φ ( x ) = 0 and φ ( x ′ ) = 1 for x, x ′ ∈ S that have nonzero weight in ν , then we have score hs ( φ, ν,

0) = score hs ( φ, ν,

1) = −∞ , so the desired inequalities trivially hold. Otherwise, for b ∈ { , } we write score hs ( φ ( k ) , ν ⊗ k , b ) = E x ...x k ← ν ⊗ k  − s(cid:18) φ ( k ) ( x . . . x k )1 − φ ( k ) ( x . . . x k ) (cid:19) ( − b  = 1 − E x ...x k ← ν ⊗ k vuut k Y i =1 (cid:18) φ ( x i )1 − φ ( x i ) (cid:19) ( − b  = 1 − k Y i =1 E x i ← ν s(cid:18) φ ( x i )1 − φ ( x i ) (cid:19) ( − b  = 1 − ( E x ← ν [1 − hs b ( φ ( x ))]) k = 1 − (1 − score hs ( φ, ν, b )) k . Note that equality holds except in the case where score hs ( φ, ν,

0) = score hs ( φ, ν,

1) = −∞ .The following lemma will be convenient when using this ampliﬁcation theorem. We prove itin Appendix B. Lemma 3.11. If x ∈ [0 , and k ∈ [1 , ∞ ) , we have

12 min { kx, } ≤ − (1 − x ) k ≤ min { kx, } . .5 Bias and hs score Another nice property of hs is that it is at most bias . Lemma 3.12.

For all q ∈ [0 , , we have hs( q ) ≤ bias( q ) .Proof. Recall that hs( q ) = 1 − p (1 − q ) /q and bias( q ) = 1 − − q ) . The desired inequality clearlyholds at q = 0 and q = 1 . For q ∈ (0 , , it suﬃces to show that − q ) ≤ (1 − q ) /q , or equivalently q (1 − q ) ≤ ⇔ − q + 4 q ≥ ⇔ (1 − q ) ≥ , which also clearly holds.Finally, the last main property of hs that we exploit is that hs scores and biases are quadraticallyrelated. To explain what we mean, start with the following deﬁnition of a general algorithm, wherewe take care not to put any restriction on the structure of the algorithm but want it to take inputsand return outputs while incurring some cost. Deﬁnition 3.13.

Let S be a ﬁnite set, and let ∆ be the set of probability distributions over S .A general algorithm , which we denote by R , is a pair of functions. The ﬁrst function is from ∆ to [0 , ∞ ] , and we denote it by cost( R, · ) : ∆ → ∞ , so that cost( R, µ ) returns a value in [0 , ∞ ] for µ ∈ ∆ . The second function takes inputs from S and returns a random variable supported on { , } ,and we denote it by output( R, · ) , so that output( R, x ) is a random variable on { , } for each x ∈ S .The bias of a general algorithm R on input x ∈ S with respect to function f : S → { , } is bias f ( R, x ) := 1 − R, x ) = f ( x )] . We note that if output(

R, x ) has distribution Bernoulli( q ) , then bias f ( R, x ) = bias f ( x ) ( q ) , wherethe function bias f ( x ) ( q ) is deﬁned according to Deﬁnition 3.2 and Deﬁnition 3.7.Just like we deﬁned general algorithms, we also deﬁne forecasting algorithms, which outputconﬁdences in [0 , instead of values in { , } . Deﬁnition 3.14.

Let S be a ﬁnite set and let ∆ be the set of all probability distributions over S .A forecasting algorithm , which we also denote by R , is a pair of functions. The ﬁrst function is cost( R, · ) : ∆ → ∞ , just like a general algorithm. The second function takes inputs from S andreturns a random variable supported on [0 , , and we denote it by pred( R, · ) , so that pred( R, x ) isa random variable on [0 , for each x ∈ S .The score of a forecasting algorithm R on input x ∈ S with respect to function f : S → { , } andscoring rule s is score s,f ( R, x ) := E [ s f ( x ) (pred( R, x ))] . When the function f is clear by the context,for notational simplicity we often omit it and write score s ( R, x ) . Additionally, when s = hs , wesometimes omit it and write simply score( R, x ) . The following lemma is key. It says that we can convert any algorithm which achieves bias γ into a forecasting algorithm which achieves expected score at least γ / under the hs scoring rule;further, this conversion only manipulates the output of the algorithm, meaning it can be appliedwithout changing the cost. That is, to turn R into a forecasting algorithm, we only need to run R ,get an output or , and then erase the output and write (1 − γ ) / or (1 + γ ) / , respectively.Moreover, it is possible to convert backward as well! To turn a forecasting algorithm R into anormal randomized algorithm, run R , take the output q ∈ [0 , , erase it and write down a samplefrom Bernoulli( q ) instead. If the original forecasting algorithm achieved expected score η , the newalgorithm will achieve bias at least η . In particular, this lemma tells us that the best expected scoreand the best bias that an algorithm can make (under any cost restriction) are always quadraticallyrelated. 24 emma 3.15 (Conversion between regular and forecasting algorithms) . A general algorithm R achieving worst-case bias γ > for a function f can be converted into a forecasting algorithm R ′ with worst-case score at least − p − γ ≥ γ / for f . This conversion is pointwise: it dependsonly on changing a sample from the random variable output( R, x ) after receiving it, as well as onthe value of the worst-case bias γ .Conversely, a forecasting algorithm R with worst-case score η can be converted into a generalalgorithm R ′ with worst-case bias at least η . This conversion is pointwise: it depends only onchanging a sample from pred( R, x ) after receiving it (and not even on the value of η ).Proof. Start with a general algorithm R with worst-case bias γ > . On input x , run R to receivea sample b ∈ { , } from output( R, x ) . Then output pred( R ′ , x ) = (1 − γ ) / if b = 0 and output pred( R ′ , x ) = (1 + γ ) / if b = 1 . It is clear that this R ′ was constructed in a pointwise fashionout of R , depending only on a sample from output( R, x ) . Now, ﬁx x ∈ S , and let p ∈ [0 , be theprobability that output( R, x ) gives the right answer. Since R has worst-case bias γ , it has bias atleast γ on x , so p ≥ (1 + γ ) / . The expected score of R ′ on x is then score( R ′ , x ) = p hs((1 + γ ) /

2) + (1 − p ) hs((1 − γ ) / p − p r − γ γ + (1 − p ) − (1 − p ) r γ − γ = 1 − r γ − γ + p (cid:18)r γ − γ − r − γ γ (cid:19) ≥ − r γ − γ + 1 + γ (cid:18)r γ − γ − r − γ γ (cid:19) = 1 − (cid:18) − γ (cid:19) r γ − γ − p − γ = 1 − p − γ . For the other direction, let R be a forecasting algorithm with worst-case score η > . On input x , run R to receive a sample q ∈ [0 , from pred( R, x ) . Then output with probability q and with probability − q , i.e. output( R ′ , x ) ∼ Bernoulli( q ) . It is clear that this R ′ is constructed ina pointwise fashion out of R (without even a dependence on η ). Now, ﬁx x ∈ S . We know that η ≤ score( R, x ) = E [hs f ( x ) (pred( R, x ))] . Now, we note that hs f ( x ) ( p ) ≤ bias f ( x ) ( p ) by Lemma 3.12.Thus we get η ≤ E [bias f ( x ) (pred( R, x ))] = bias( R ′ , x ) , as desired.To demonstrate the power of these lemmas, observe that they imply a well-known ampliﬁcationtheorem for randomized algorithms, as we show in the lemma below. Note that this lemma doesnot refer to scoring rules or forecasting algorithms at all; those only appear as proof techniques. Lemma 3.16 (informal) . A randomized algorithm with bias γ can be ampliﬁed to bias / byrepeating it /γ times.Proof. Start with an algorithm making bias γ . Using Lemma 3.15, get a forecasting algorithm withexpected score at least − p − γ . Using Lemma 3.10, repeating the algorithm k times increasesthe expected score on each input x to at least − (1 − γ ) k/ . Using Lemma 3.15, we get an algorithmwith worst-case bias at least − (1 − γ ) k/ . Using Lemma 3.11, this is at least min { kγ / , / } .Picking k ≥ /γ , we get an algorithm with worst-case bias at least / using only k repetitions ofthe original algorithm, as desired. 25 Randomized query and communication complexity

To prove a strong minimax theorem for randomized query complexity, we start by formally deﬁningforecasting algorithms in the query complexity setting. We will need these forecasting algorithmsas a tool, despite our ﬁnal statement not referring to them.

Deﬁnition 4.1. A deterministic forecasting decision tree (on n ∈ N bits, with ﬁnite alphabet Σ ) isa rooted tree on n bits whose internal vertices are labeled by [ n ] , where each internal vertex has | Σ | children labeled by Σ , and where the leaves are labeled by [0 , .A randomized forecasting decision tree (on n ∈ N bits, with ﬁnite alphabet Σ ) is a probabilitydistribution over ﬁnitely many deterministic forecasting decision trees. We interpret a randomized forecasting decision tree as a forecasting algorithm in the intuitiveway, where cost(

R, x ) is the expected height of R on x (the expected height of the leaf of x ina deterministic forecasting tree sampled from the distribution R ), and where pred( R, x ) is therandom variable which samples from the leaf label when a random deterministic tree from R is runon x . Note that since we restrict to distributions with ﬁnite support, we do not need to invokemeasure theory or integrals in interpreting these probabilities and expectations, even though thereare uncountably many deterministic forecasting decision trees.We extend cost( R, · ) to the set ∆ of probability distributions over S by writing cost( R, µ ) = E x ← µ [cost( R, x )] , and similarly for score( R, µ ) = E x ← µ [score( R, x )] . We now show a minimaxtheorem for the ratio of cost to score + for forecasting randomized algorithms. This minimax theoremwill form the base of our ﬁnal result: we will convert the left-hand side to R( f ) , and convert theright hand side to some desirable properties of a hard distribution µ . Theorem 4.2.

Let n ∈ N , let Σ be a ﬁnite alphabet, let S ⊆ Σ n , and let f : S → { , } . Let R bethe set of all randomized forecasting decision trees on n bits with alphabet Σ . Let ∆ be the set ofprobability distributions over S . Then inf R ∈R max x ∈ S cost( R, x )score(

R, x ) + = max µ ∈ ∆ inf R ∈R cost( R, µ )score(

R, µ ) + , and the maximums are attained.Proof. We use Theorem 2.18. All we need to do is verify that the conditions of the theorem hold.Our ﬁrst task will be to deal with the strange set R ; we wish to turn it into a convex subset ofa real topological vector space. To do so, we deﬁne the vector v R ∈ R | S | for each R ∈ R by v R [ x,

1] = cost(

R, x ) and v R [ x,

2] = score(

R, x ) , and consider the set V = { v R : R ∈ R } . For avector v ∈ V , we deﬁne cost( v, x ) = v [ x, and score( v, x ) = v [ x, , and we extend these deﬁnitionsto cost( v, µ ) and score( v, µ ) by taking expectations over µ . Then it is clear that optimizing somefunction of cost( R, µ ) and score( R, µ ) over R is the same as optimizing the corresponding functionof cost( v, µ ) and score( v, µ ) over V . Hence it suﬃces to show that inf v ∈ V max x ∈ S cost( v, x )score( v, x ) + = max µ ∈ ∆ inf v ∈ V cost( v, µ )score( v, µ ) + , with the maximums attained.To do so, we ﬁrst note that V ⊆ R | S | is convex. This is because if v , v ∈ V and λ ∈ (0 , , weknow there are algorithms R , R ∈ R such that v = v R and v = v R , and then the algorithm λR + (1 − λ ) R (which mixes the distributions R and R over deterministic forecasting decisiontrees) is a valid member of R . Then we have v λR +(1 − λ ) R [ x,

1] = cost( λR + (1 − λ ) R , x ) = cost( R , x ) + (1 − λ ) cost( R , x ) = λv R [ x,

1] + (1 − λ ) v R [ x, , and similarly v λR +(1 − λ ) R [ x,

2] = λv R [ x,

2] + (1 − λ ) v R [ x, , so v λR +(1 − λ ) R = λv R + (1 − λ ) v R .Next, we note that cost( v, · ) and score( v, · ) are linear functions of µ ; this is because they aredeﬁned as expectations over µ . Further, observe that cost( · , µ ) and score( · , µ ) are linear in v . It isalso clear that cost( v, µ ) and score( v, µ ) are continuous in both v and µ .It remains to check that cost and score are well-behaved. First, note that there is always analgorithm which queries all the bits and outputs the right answer f ( x ) with perfect conﬁdence. Suchan algorithm R has cost( v R , µ ) = n and score( v R , µ ) = 1 for all µ , so ﬁnite costs and scores areattainable. Next, note that if R is such that cost( v R , µ ) = 0 for any µ , then R must make no querieswhen run on µ . This means R makes no queries when run on any input, so cost( v R , µ ′ ) = 0 for all µ ′ ∈ ∆ . Finally, note that cost( · , µ ) is linear for each µ , so if cost( v, µ ) = 0 and cost( v ′ , µ ) > , wenecessarily have cost( λv + (1 − λ ) v ′ , µ ) > for λ ∈ (0 , . Hence all the conditions of Theorem 2.18are satisﬁed, and the desired result follows.Our next task is to relate the left-hand side of the equation in the last theorem to R( f ) . Theorem 4.3.

Using the notation of Theorem 4.2, we have inf R ∈R max x ∈ S cost( R, x )score(

R, x ) + ≥ R( f )240 . To prove this theorem, the idea is to take R from the left-hand side, amplify the score of R upto a constant (using the fact that score ampliﬁes linearly), and then convert the constant score toconstant bias (and hence constant error), getting an upper bound on R( f ) . This is slightly tricky,because the amount we need to amplify by may depend on the input x ; for some x , both cost( R, x ) and score( R, x ) may be small, while for other x they are both large. Unfortunately, we do not haveaccess to score( R, x ) when we receive input x . Instead, in order to amplify by approximately thecorrect amount, we estimate cost( R, x ) (by repeatedly running R on x and observing the numberof queries), and we use this cost estimate to decide the amount of ampliﬁcation needed. Proof.

Let Y ∗ be the optimal value of the left-hand side, and let R be an algorithm such that max x ∈ S cost( R, x ) / score( R, x ) + = Y , where Y is arbitrarily close to Y ∗ (and Y ≥ Y ∗ ). Then inparticular, score( R, x ) > for all x ∈ S , and for each x ∈ S we have cost( R, x ) / score( R, x ) ≤ Y .Let R ′ be a modiﬁcation of R where we cut oﬀ each decision tree in the support of R after Y queries, and return / in case of a cutoﬀ (ensuring we get a score of for that branch). Note thatby Markov’s inequality, the probability of encountering a cutoﬀ branch on input x to R ′ is at most cost( R, x ) / Y ≤ Y score( R, x ) / Y = score( R, x ) / . Since each non-cut-oﬀ leaf can contribute atmost to the score (as the maximum of hs( · ) is ), and since the score at a cutoﬀ is , the decreasein score when going from R to R ′ is at most the probability of encountering a cutoﬀ. It follows that score( R ′ , x ) ≥ score( R, x ) − score( R, x ) / R, x ) / for all x ∈ S .Next, we describe a randomized forecasting algorithm R ′′ . The algorithm R ′′ runs R ′ on x untilthe number of queries made reaches Y . Let L be the number of runs of R ′ on x it takes to reach Y queries. Then R ′′ runs R ′ on x an additional L times, and uses those new runs to amplify thescore, achieving score − (1 − score( R ′ , x )) L . We wish to prove this score is at least a constant andthat the total number of queries is only O ( Y ) .First, we bound the expectation of L , the random variable for the number of runs of R ′ on x it takes to reach Y queries. Let X i be i.i.d. random variables each representing the number ofqueries in a single run of R ′ on x (so each X i is supported on { , , . . . , Y } ). Consider the totalnumber of queries made until the cutoﬀ is reached; this is P Li =1 X i . Let I i be the Boolean random27ariable which is if L < i and if L ≥ i . Then P Li =1 X i = P ∞ i =1 X i I i . Note that the value of P Li =1 X i is always at most Y + 2 Y , because after the threshold Y is reached, less than one fullrun of R ′ on x will happen (using at most Y queries). Hence Y > E " L X i =1 X i = E " ∞ X i =1 X i I i = ∞ X i =1 E [ X i I i ]= ∞ X i =1 Pr[ I i = 0] E [ X i I i | I i = 0] + Pr[ I i = 1] E [ X i I i | I i = 1]= ∞ X i =1 Pr[ L ≥ i ] E [ X i ]= cost( R ′ , x ) E [ L ] . It follows that E [ L ] < Y / cost( R ′ , x ) . This means the total expected number of queries R ′′ makesis at most Y for getting the estimate L , plus cost( R ′ , x ) · E [ L ] < Y for amplifying the score,for a total of fewer than Y expected queries.To bound the expected score, we start by ensuring L is not too small except with small prob-ability. Note that for a constant T , we have Pr[ L ≤ T ] = Pr[ P Ti =1 X i ≥ bY ] . The sum P Ti =1 X i has expected value T cost( R ′ , x ) and has variance T times the variance of one X i . Since X i isnon-negative and bounded above by Y , its variance is bounded above by Var[ X i ] ≤ E [ X i ] ≤ Y E [ X i ] = 2 Y cost( R ′ , x ) . Hence, the variance of the sum is at most T Y cost( R ′ , x ) . We useChebyshev’s inequality, writing Pr[ L ≤ T ] = Pr " T X i =1 X i ≥

10 R( f ) = Pr " T X i =1 X i − T cost( R ′ , x ) ≥ Y − T cost( R ′ , x ) ≤ T Y cost( R ′ , x )(10 Y − T cost( R ′ , x )) , which holds assuming T ≤ Y / cost( R ′ , x ) . In particular, if T = 2 Y / cost( R ′ , x ) , then Pr[ L ≤ T ] ≤ / .Now, note that conditioned on L = ℓ , the expected score in the second round of R ′′ is at least − (1 − score( R ′ , x )) ℓ . This is increasing in ℓ ; hence, conditioned on L > T , the expected score of R ′′ on x is greater than − (1 − score( R ′ , x )) T . Conditioned on L ≤ T , we still have the expectedscore be at least , since it is at least for every ﬁxed ℓ . Hence the ﬁnal expected score of R ′′ on x is greater than (1 − (1 − score( R ′ , x )) T )(1 − Pr[ L ≤ T ]) ≥ − (1 − score( R ′ , x )) T − Pr[ L ≤ T ] . The equality E hP Li =1 X i i = E [ X ] E [ L ] , which we rederive here, is known as Wald’s equation. T = 2 Y / cost( R ′ , x ) , we get score( R ′′ , x ) > − (1 − score( R ′ , x )) Y/ cost( R ′ ,x ) − / ≥

12 min (cid:26) , Y score( R ′ , x )cost( R ′ , x ) (cid:27) − / ≥

12 min (cid:26) , score( R, x ) Y cost( R, x ) (cid:27) − / ≥ −

116 = 716 . This algorithm R ′′ makes fewer than Y expected queries. We cut if oﬀ after Y queries,outputting prediction / (getting score ) in case of a cutoﬀ; this gives an algorithm R ′′′ whoseworst-case number of queries is Y , and whose expected score on each x ∈ D is at least / − / ≥ / . Using Lemma 3.15, we can view R ′′′ as a randomized algorithm computing f ( x ) withworst-case bias at least / , and hence worst-case error at most / . This means that R( f ) ≤ Y .Since we can pick Y arbitrarily close to Y ∗ , we also get that R( f ) is at most the inﬁmum of Y over feasible choices of Y , which is Y ∗ , and the desired result follows.Our next task is to show that the max-inf side of Theorem 4.2 gives us a distribution µ againstwhich it is hard to tell apart -inputs from -inputs, in terms of the achievable squared-Hellingerdistance between the distributions of the transcript on the - and -inputs. The following lemmawill come in useful. We prove it in Appendix B. Lemma 4.4 (Hellinger distance of disjoint mixtures) . Let µ be a distribution over a ﬁnite support A , and for each a ∈ A , let ν a and ν a be two distributions over a ﬁnite support S a . Let ν µ and ν µ denote the mixture distributions where a ← µ is sampled, and then a sample is produced from ν a or ν a respectively. Assume the sets S a are disjoint for all a ∈ A . Then h ( ν µ , ν µ ) = E a ← µ [h ( ν a , ν a )] . Theorem 4.5.

A, µ ) + ≥ R( f )240 , as desired.Finally, we strengthen this to a lower bound for the minimum of cost( R, µ ) and cost( R, µ ) ,instead of for their average cost( R, µ ) . Theorem 4.6.

Let n ∈ N , let Σ be a ﬁnite alphabet, let S ⊆ Σ n , and let f : S → { , } be anon-constant function. Then there exist distributions µ on f − (0) and µ on f − (1) such that forall randomized query algorithms R , min { cost( R, µ ) , cost( R, µ ) } h (tran( R, µ ) , tran( R, µ )) ≥ R( f )3000 , where we interpret r/ ∞ for r ∈ [0 , ∞ ) .Proof. We use µ and µ from Theorem 4.5. Note that inf R min { cost( R, µ ) , cost( R, µ ) } h (tran( R, µ ) , tran( R, µ )) = inf R,b ∈{ , } cost( R, µ b )h (tran( R, µ ) , tran( R, µ ))= min b ∈{ , } inf R cost( R, µ b )h (tran( R, µ ) , tran( R, µ )) . By the same argument as in the proof of Theorem 4.5, this last inﬁmum over R is equal to theinﬁmum over deterministic unlabeled decision trees D with height at least .Let D be such an algorithm. By Theorem 4.5, it suﬃces to show that min { cost( D, µ ) , cost( D, µ ) } h (tran( D, µ ) , tran( D, µ )) ≥ (1 /c ) min D ′ cost( D ′ , µ )h (tran( D ′ , µ ) , tran( D ′ , µ )) , where µ = ( µ + µ ) / . By Lemma 3.9, we can label the leaves of D so that we have the property h (tran( D, µ ) , tran( D, µ )) = score( D, µ ) , and similarly for D ′ . The desired inequality is trivialwhen score( D, µ ) = 0 (since the ratio is then ∞ ), so suppose score( D, µ ) > . We wish to show min { cost( D, µ ) , cost( D, µ ) } score( D, µ ) ≥ (1 /c ) min D ′ cost( D ′ , µ )score( D ′ , µ ) . In other words, we wish to show that there exists a deterministic forecasting algorithm D ′ such that cost( D ′ , µ ) / score( D ′ , µ ) ≤ c cost( D, µ b ) / score( D, µ ) , regardless of whether b = 0 or b = 1 .30e construct D ′ such that cost( D ′ , µ ) / score( D ′ , µ ) ≤ c cost( D, µ b ) / score( D, µ ) . The idea is tostart with D , and then cut oﬀ the branches that are much more likely under µ − b than under µ b .That is, for a vertex v of D , let µ [ v ] denote the probability that v is reached when D is run onan input from µ b , and deﬁne µ − b [ v ] similarly. Recall that the leaves of D are labeled accordingto the strategy that achieves score( D, µ ) = h (tran( D, µ ) , tran( D, µ )) , which, by Lemma 3.9, issuch that at a leaf v , the algorithm D outputs µ [ v ] / µ [ v ] .Pick a constant a ∈ (1 / , , and let D ′ be the algorithm which cuts oﬀ D the ﬁrst time it entersa vertex for which µ − b [ v ] / µ [ v ] ≥ a , and outputs a (if b = 0 ) or − a (if b = 1 ) instead of continuingto run D . Let V be the set of all vertices which cause such a cutoﬀ; note that no vertex in V is adescendant of another vertex in V . For v ∈ V , let µ v be the distribution µ conditioned on reaching v , and similarly deﬁne µ v and µ v . Let µ ∗ be the distribution µ conditioned on reaching none of thevertices in V , and similarly deﬁne µ ∗ and µ ∗ . Since we are dealing with a deterministic decisiontree, all the distributions µ v and µ v have disjoint supports for all the diﬀerent v ∈ V , and they’realso disjoint from µ ∗ and µ ∗ ; indeed, µ is a disjoint mixture of all diﬀerent distributions. It followsthat score( D, µ ) is a mixture of terms score( D, µ v ) and of score( D, µ ∗ ) . The score score( D ′ , µ ) ofthe algorithm D ′ is also such a mixture.Now, note that score( D, µ v ) ≤ , and that score( D ′ , µ v ) = E x ← µ v [hs f ( x ) ( a )] if b = 0 and score( D ′ , µ v ) = E x ← µ v [hs f ( x ) (1 − a )] if b = 1 . This means score( D ′ , µ v ) = µ b [ v ]2 µ [ v ] hs(1 − a ) + µ − b [ v ]2 µ [ v ] hs( a ) = (1 − p ) hs(1 − a ) + p hs( a )= 1 − (1 − p ) p a/ (1 − a ) − p p (1 − a ) /a, where p = µ − b [ v ] / µ [ v ] ≥ a . Since a > / , this is increasing in p , so we have score( D ′ , µ v ) ≥ − p a (1 − a ) , and hence score( D ′ , µ v ) ≥ (1 − p a (1 − a )) score( D, µ v ) . It also holds that score( D ′ , µ ∗ ) = score( D, µ ∗ ) ≥ (1 − p a (1 − a )) score( D, µ ∗ ) . Since score( D, µ ) and score( D ′ , µ ) are matching mixtures of score( D, µ v ) and score( D ′ , µ v ) respectively, it follows that score( D ′ , µ ) ≥ (1 − p a (1 − a ) score( D, µ ) .We now analyze the cost of D ′ . Note that cost( D ′ , µ ) = (1 /

2) cost( D ′ , µ b ) + (1 /

2) cost( D ′ , µ − b ) ;we clearly have cost( D ′ , µ b ) ≤ cost( D, µ b ) , so it suﬃces to upper bound cost( D ′ , µ − b ) . This is theexpected height of a leaf D ′ reaches when run on µ − b , which is a mixture of cost( D ′ , µ ∗ − b ) and cost( D ′ , µ v − b ) . Now, note that a leaf u reached by cost( D ′ , µ ∗ − b ) must have µ − b [ u ] / µ [ u ] < a , or µ b [ u ] < (1 − a ) /a · µ − b [ u ] . It follows that cost( D ′ , µ ∗ − b ) ≤ (1 − a ) /a · cost( D ′ , µ ∗ b ) = (1 − a ) /a · cost( D, µ ∗ b ) . Similarly, for each v ∈ V , the parent u of v satisﬁes µ − b [ u ] / µ [ u ] < a , meaning that µ b [ u ] > (1 − a ) /a · µ − b [ u ] ; note that since this parent u of v is not a leaf, conditioned on reaching u theheight of the path will always be at least the height of v (one more than the height of u ); since cost( B, µ v − b ) is exactly the height of v , we necessarily have cost( D, µ vb ) ≥ cost( D ′ , µ vb ) ≥ (1 − a ) /a · cost( D ′ , µ v − b ) . We conclude that cost( D ′ , µ − b ) ≤ a − a cost( D, µ b ) , and hence cost( D ′ , µ ) ≤ (cid:18)

12 + a − a ) (cid:19) cost( D, µ b ) = cost( D, µ b )2(1 − a ) . We therefore have cost( D ′ , µ )score( D ′ , µ ) ≤ − a )(1 − p a (1 − a )) cost( D, µ b )score( D, µ ) . a , we pick a = (2 + √ / to get cost( D ′ , µ )score( D ′ , µ ) ≤ (6 + 4 √

2) cost(

D, µ b )score( D, µ ) , from which the desired result follows. Corollary 4.7.

Let n ∈ N , let Σ be a ﬁnite alphabet, let S ⊆ Σ n , and let f : S → { , } be afunction. Then there exists a distribution µ on S such that for all γ ∈ [0 , , R µ ˙ γ ( f ) ≥ γ R( f )500 . Here ˙ γ = (1 − γ ) / and R µǫ ( f ) denotes the average cost (against µ ) of a randomized algorithmachieving error at most ǫ (against µ ) for solving f .Proof. If f is constant, then R( f ) = 0 and the desired bound trivially follows. Therefore, assume f isnot constant. We use the distribution µ from Theorem 4.5. Let R be a randomized algorithm whichachieves bias γ against µ . Then using Lemma 3.15, we can convert R into a forecasting algorithm R ′ which achieves expected score − p − γ ≥ γ / against µ , and has the same distribution overquery trees (that is, only the leaves changed). Now, by the property of µ , we know that cost( R ′ , µ )score( R ′ , µ ) ≥ R( f )240 , where we used Lemma 3.9 to get a result for score instead of Hellinger distance in the denominator,and where we used the fact that R achieves non-zero bias against µ (despite µ being balancedbetween - and -inputs) to conclude that R does not make queries. Using score( R ′ , µ ) ≥ γ / and cost( R, µ ) = cost( R ′ , µ ) , we get R, µ ) /γ ≥ R( f ) / , or cost( R, µ ) ≥ γ R( f ) / , asdesired. Theorem 1.8 (Restated) . For any non-constant partial function F : X × Y → { , } over ﬁnite sets X and Y , there is a pair of distributions µ on F − (0) and µ on F − (1) such that for any public-randomness communication protocol Π , the squared Hellinger distance between the distribution ofits transcripts on µ and µ is bounded above by h (cid:0) tran(Π , µ ) , tran(Π , µ ) (cid:1) = O (cid:18) min { cost(Π , µ ) , cost(Π , µ ) } RCC( F ) (cid:19) . Proof.

This theorem follows directly from Theorem 4.6 once we realize that a communication func-tion can be interpreted as a query function. That is, we take F and convert it into a query function f as follows. The input to f will contain one bit for each possible function of X (that Alice mightsend to Bob), and one bit for each possible function of Y (that Bob might send to Alice), for a totalinput length of n = 2 |X | + 2 |Y| . The inputs to f will be the strings in { , } n which are generatedby a pair ( x, y ) ∈ S , that is, the strings z ∈ { , } n for which there exists a pair ( x, y ) ∈ S such that z k is the result of applying the k -th possible function to x (if k ≤ |X | ) or the ( k − |X | ) -th possiblefunction to y (if k > |X | ). Then f is a Boolean function of domain of size | S | , with each string inits domain corresponding to a string in S .We note that RDT( f ) = RCC( F ) . This is clear from the deﬁnition of RCC( F ) : the public-coinrandomness essentially means that Alice and Bob agree on a randomized decision tree in advance,32ncluding on who speaks when (as a function of the transcript), which is equivalent to agreeing in adecision tree for f in advance. The transcript of f on an input is precisely the transcript of F on thecorresponding input, with the catch that in query complexity we deﬁned the transcript to includethe deterministic decision tree by the protocol; hence, the query version of a transcript of f actuallycorresponds to ( R, Π) for F , where R is the public randomness and Π is the usual communicationcomplexity transcript. The desired result then follows immediately from applying Theorem 4.6 to f . Corollary 4.8.

Let X and Y be ﬁnite sets, let S ⊆ X × Y , and let F : S → { , } be a function.Then there exists a distribution µ on S such that for all γ ∈ [0 , , RCC µ ˙ γ ( F ) = Ω( γ RCC( F )) . In contrast to the classical case, it is well-known that quantum algorithms can be ampliﬁed linearly in /γ , where γ is the bias. Formally, we have the following theorem. Theorem 5.1 (Amplitude estimation) . Suppose we have access to a unitary U (representing aquantum algorithm) which maps | i to | ψ i , as well as access to a projective measurement Π , andwe wish to estimate p := k Π | ψ ik (representing the probability the quantum algorithm accepts). Fix ǫ, δ ∈ (0 , / . Then using at most (100 /ǫ ) · ln(1 /δ ) controlled applications of U or U † and at mostthat many applications of I − , we can output ˜ p ∈ [0 , such that | ˜ p − p | ≤ ǫ with probability atleast − δ . This theorem follows from [BHMT02], as well as from the arguably simpler techniques in [AR20].(In fact, these authors show something slightly stronger: amplitude estimation can be done withoverhead O ( √ ǫ + p · (1 /ǫ ) · log 1 /δ ) . We refer the interested reader to Appendix C to see how thisfollows from [BHMT02].)Given that quantum algorithms can be ampliﬁed linearly in the bias, it would seem that the de-sired minimax theorem follows easily from Theorem 2.18: simply apply a minimax to cost( Q, µ ) / bias f ( Q, µ ) + ,where Q is a quantum algorithm and µ is a distribution over the inputs. Then use the linear ampli-ﬁcation result to argue that min Q max µ cost( Q, µ ) / bias f ( Q, µ ) + is Θ(Q( f )) . Sounds simple! (Thisworks better than for randomized algorithms, because bias f ( · , · ) is saddle while bias f ( · , · ) is not.)Unfortunately, there is an annoying hole in this argument: the function cost( Q, µ ) is not convexin Q . While it is not immediately clear what a convex combination of two quantum algorithms Q and Q should be, most intuitive deﬁnitions will have the convex combination use a number ofunitaries that is equal to the maximum of the number used in Q and Q , rather than the average.To get around this, we switch the computational model from quantum algorithms to probabilitydistributions over quantum algorithms. These probabilistic quantum algorithms have outputs andbiases deﬁned in the intuitive way, but their cost is deﬁned as the expected cost of the underlyingquantum algorithms, rather than the maximum cost. This ensures the function cost( · , · ) will besaddle, and Theorem 2.18 can be applied. The trick then becomes showing that these probabilisticquantum algorithms can still be ampliﬁed linearly. This turns out to be true, up to logarithmicfactors. Once ampliﬁed, constant-error probabilistic quantum algorithms can be converted intoordinary quantum algorithms, giving us a minimax theorem that can be applied to ordinary quantumalgorithms as well. 33 .1 Quantum query complexity Our goal in this section will be to prove the following theorem.

Theorem 5.2.

For any Boolean-valued function f , there exists a distribution µ over Dom( f ) suchthat for any γ ∈ [0 , , we have Q µ ˙ γ ( f ) ≥ γ · ˜Ω(Q( f )) . Here Q µ ˙ γ ( f ) denotes the minimum numberof queries required by a quantum algorithm which achieves bias γ against µ for computing f . Theconstants in the ˜Ω notation are universal. In fact, we will prove a stronger (and tighter) version in terms of probabilistic quantum algo-rithms. These are simply probability distributions over quantum algorithms of possibly diﬀerentquery costs; we deﬁne the cost of a probabilistic quantum algorithm as the expected cost of aquantum algorithm sampled from the probability distribution.

Deﬁnition 5.3. A probabilistic quantum algorithm is a probability distribution P over quantumalgorithms. For an input string x , we let P ( x ) be the random variable that outputs a sample from Q ( x ) where Q is a quantum algorithm sampled from P . The cost of P , denoted | P | , is the expectedcost of a quantum algorithm sampled from P . The error of P on input x to a Boolean function f isdeﬁned as Pr Q ∼ P [ Q ( x ) = f ( x )] . Deﬁnition 5.4.

Let f be a Boolean-valued function with Dom( f ) ⊆ Σ n . We deﬁne PQ ˙ γ ( f ) to bethe minimum cost | P | of a probabilistic quantum algorithm P which computes f to worst-case bias γ . Theorem 5.5.

For any Boolean function f and any γ ∈ (0 , / , we have PQ ˙ γ ( f ) = e Θ( γ Q( f )) .More explicitly, PQ ˙ γ ( f ) = O ( γ Q( f ))PQ ˙ γ ( f ) = Ω (cid:18) γ Q( f )log(1 /γ ) log log(1 /γ ) (cid:19) . Proof.

For the upper bound, let Q be a quantum algorithm computing f to error / using Q( f ) queries. Let Q ′ be the probabilistic quantum algorithm which runs Q( f ) with probability γ andotherwise uses no queries and guesses the output at random (with probability / for outputtingboth and ). The probability of error of Q ′ is at most (1 / − γ ) + (1 / γ ) = (1 / − γ ) ,which means its bias is at least γ on every input. The expected number of queries Q ′ uses is γ Q( f ) .Hence we have PQ ˙ γ ( f ) ≤ γ Q( f ) .For the lower bound, we start with a probabilistic quantum algorithm P which achieves worst-case bias γ and has cost | p | = PQ ˙ γ ( f ) , and make several modiﬁcations to it. First, we removefrom the support of P all quantum algorithms which use more than | P | /γ queries, and we replacethem with a -query quantum algorithm that guesses the answer at random (with / probabilityon outputs and ). This gives us a probabilistic quantum algorithm P which uses at most | P | /γ queries even in the worst case, and has | P | ≤ | P | and the worst-case bias of P is at least γ/ (since by Markov’s inequality, the probability mass over the removed quantum algorithms was atmost γ/ , and they could have had bias at most which turned into bias , decreasing the overallbias by at most γ/ ).Next, we modify P to get a probabilistic algorithm P which always uses a number of querieswhich is a power of . This can be done simply by increasing the number of queries each algorithmin the support of P makes (and ignoring the extra queries). This way, we have | P | ≤ | P | ≤ | P | ,the largest number of queries P can make is at most | P | /γ , and the bias of P is at least γ/ onevery input. 34urther, we modify P to get a probabilistic quantum algorithm P which always uses at least | P | queries (but still only uses a number of queries which is a power of ). This can be doneby again increasing the number of queries a quantum algorithm in the support of P makes, whennecessary. This adds at most an additive | P | queries (since the smallest power of which is atleast | P | is smaller than | P | ). Hence | P | < | P | + 16 | P | ≤ | P | . Note that P achieves bias atleast γ/ on every input, and that P always uses a number of queries which is a power of in therange [8 | P | , | P | /γ ) .Finally, we modify P to get P which collapses together all quantum algorithms in the supportof P that use the same number of queries. That is, instead of placing support on two diﬀerentquantum algorithms which both use (say) queries, P will place support on a single quantumalgorithm which implements the mixture of both. This does not aﬀect the number of queries or thebias of the algorithm. Hence we have | P | < | P | , and P achieves bias at least γ/ on each input.Further, P has support on fewer than log(1 /γ ) quantum algorithms.Next we introduce some notation for talking about P . Let L = ⌊ log(1 /γ ) ⌋ and let k be thesmallest power of which is at least | P | . Let the quantum algorithms in the support of P be Q , Q , . . . , Q L , with Q i using k + i queries for each i . Let p i be the probability P assigns toalgorithm Q i . Then p i ≥ for all i , and P Li =1 p i = 1 . We also have P Li =1 p i k + i = | P | < | P | ,which means P Li =1 p i i < . On input x , let α i ( x ) be the probability that Q i outputs when runon x , and let β i ( x ) := 1 − α i ( x ) . This way, ( − f ( x ) β i ( x ) is the bias of Q i when run on x . Then P Li =1 p i β i ( x ) is ( − f ( x ) times the bias of P on x , which means that it is negative if f ( x ) = 1 ,positive if f ( x ) = 0 , and satisﬁes (cid:12)(cid:12)(cid:12)P Li =1 p i β i ( x ) (cid:12)(cid:12)(cid:12) ≥ γ/ .We now wish to amplify P from bias γ/ to constant bias. To do so, it suﬃces to estimate P Li =1 p i β i ( x ) to additive error less than γ/ , and output the sign of this estimate. Our query budgetfor this task will be roughly | P | /γ . We know the values p i , and seek to generate estimates ˜ β i ( x ) for β i ( x ) . We will say an estimate ˜ β i ( x ) is good if | ˜ β i ( x ) − β i ( x ) | ≤ i γ/ . This way, if all ˜ β i ( x ) aregood, our ﬁnal estimate for the sum will satisfy (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) L X i =1 p i ˜ β i ( x ) − L X i =1 p i β i ( x ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) L X i =1 p i ( ˜ β i ( x ) − β i ( x )) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ L X i =1 p i | ˜ β i ( x ) − β i ( x ) | ≤ L X i =1 p i i γ/ < γ/ , where we used P i p i i < .To generate ˜ β i ( x ) , we use Theorem 5.1 on algorithm Q i with ǫ = 2 i γ/ and δ = 1 / L . Sincethe query cost of Q i is k + i , this uses at most · (2 k /γ ) · ln(3 L ) queries. Since k < | P | and L ≤ log(1 /γ ) , this costs O ( | P | /γ · log log(1 /γ )) . The query cost of generating all L estimates thisway is therefore O ( | P | /γ · log(1 /γ ) log log(1 /γ )) . The probability that any one estimate is not goodis at most / L by our choice of δ , so by the union bound, all are good except with probability / ;hence we’ve given a quantum algorithm which achieves worst-case bounded error for computing f ,and whose query cost is O (PQ ˙ γ ( f ) /γ · log(1 /γ ) log log(1 /γ )) , as desired.Using this theorem, we now proceed to prove a strong minimax theorem for PQ ˙ γ ( f ) , showingthat a single hard distribution µ works to lower bound this measure for all values of γ at once. Theorem 5.6.

Fix a ﬁnite alphabet Σ as well as n ∈ N . Let f be a Boolean-valued function with Dom( f ) ⊆ Σ n . Then there exists a distribution µ over Dom( f ) such that for any γ ∈ [0 , , we have PQ µ ˙ γ ( f ) ≥ γ · ˜Ω(Q( f )) , where the constants in the ˜Ω notation are universal.

35s usual, the notation PQ µ ˙ γ ( f ) denotes the expected cost of a probabilistic quantum algorithmwhich is required to achieve bias at least γ against µ (rather than in the worst case); that is, thealgorithm and the bias level γ are both allowed to depend on the distribution µ . Note that since PQ( f ) is always smaller than Q( f ) for any given bias level, this implies Theorem 5.2. Proof.

Fix Σ , n , and f . Let R be the set of all probabilistic quantum algorithms for computing f .For each P ∈ R and each distribution µ over Dom( f ) , deﬁne cost( P, µ ) := | P | and deﬁne score( P, µ ) to be the bias P makes against distribution µ for computing f (this will be in the range [ − , ).We will use Theorem 2.18. It is clear that R is convex, and that Dom( f ) is a nonempty ﬁniteset. Let ∆ denote the set of all probability distributions over Dom( f ) . Then cost and score arecontinuous functions R × ∆ → R , with cost( · , · ) always non-negative, and both functions are linearin both variables. These functions are well-behaved, since ﬁnite cost and score can be achieved(some quantum algorithm computes f with positive bias), the cost is independent of the input, andmixing a zero-cost algorithm with a nonzero-cost algorithm gives a nonzero-cost algorithm. HenceTheorem 2.18 gives us inf P ∈R max x ∈ Dom( f ) | P | score( P, x ) + = max µ ∈ ∆ inf P ∈R | P | score( P, µ ) + , where we use the convention r/ ∞ for all r ∈ R .We simplify the left-hand side. For a probabilistic quantum algorithm P , use bias f ( P ) todenote its worst-case bias, that is, bias f ( P ) := min x ∈ Dom( f ) score( P, x ) . Then the left-hand side isthe inﬁmum over P of | P | / bias f ( P ) + . Since a probabilistic algorithm P with bias f ( P ) ≤ willnever be selected in this inﬁmum, the left-hand side is equal to inf γ ∈ (0 , inf P ∈R γ | P | γ , where R γ denotes the set of all probabilistic quantum algorithms which achieve worst-case biasat least γ . The inner inﬁmum is the deﬁnition of (1 /γ ) · PQ ˙ γ ( f ) , so the left-hand side equals inf γ ∈ (0 , PQ ˙ γ ( f ) /γ .Note that this is at most f ) by picking γ = 1 / and using PQ( f ) ≤ Q( f ) . We claimthere is no reason to use any γ ∈ (0 , / f )) in the inﬁmum. The reason is that if P is aprobabilistic quantum algorithm achieving worst-case bias at least γ such that | P | /γ < f ) ,and if γ < / f ) , it means that P has nonzero support on zero-cost quantum algorithms.Without loss of generality, we can assume P = aP + bP + (1 − a − b ) P ′ , where P is a zero-costalgorithm that always outputs , P is a zero-cost algorithm that always outputs , and P ′ is aprobabilistic algorithm with no support on zero-cost algorithms. Let c = min { a, b } , and write P = 2 cZ + (1 − c ) P ′′ , where Z is the -cost algorithm which is an even mixture of P and P .Then it is not hard to see that | P | = (1 − c ) | P ′′ | and score( P, µ ) = (1 − c ) score( P ′′ , µ ) for all µ . This means that P ′′ has the same cost-to-score ratio as P for all distributions µ . Hence we canalways use P ′′ in place of P for the inﬁmum. Further, supposing without loss of generality that b ≥ a , we have P ′′ = ( b − a ) P + (1 − b + a ) P ′ . Since f is not constant, let x be an input on which f ( x ) disagrees with P ( x ) (that is, a -input). Then note that if b − a ≥ / , the algorithm P ′′ cannot output on x with probability above / , so score( P ′′ , x ) ≤ and P ′′ will not be used inthe inﬁmum. On the other hand, if b − a < / , we have | P ′′ | = (1 − b + a ) | P ′ | > (1 / · / , as P ′ does not place weight on algorithms which make queries. Now, unless P ′′ achieves worst-casebias at least / (6 Q( f )) , its ratio of cost to score would be greater than f ) , which we alreadyknow is achievable. 36his means we only need to use γ > / (6 Q( f )) in the inﬁmum. Thus the left-hand side equals inf γ ∈ [

16 Q( f ) , PQ ˙ γ ( f ) γ . Using Theorem 5.5, this is at least inf γ ∈ [

16 Q( f ) , Q( f ) C log(1 /γ ) log log(1 /γ ) for some universal constant C . The above is clearly optimized at γ = 1 / (6 Q( f )) , which means theleft-hand side is at least Ω (cid:16) Q( f )log Q( f ) log log Q( f ) (cid:17) .Looking at the right hand side, we see that there exists a distribution µ such that every prob-abilistic quantum algorithm P satisﬁes | P | / score( P, µ ) + ≥ ˜Ω(Q( f )) , from which the desired state-ment follows. We note that the argument we used to prove the existence of the hard distribution for quantumquery complexity only used a few properties of quantum algorithms. Since we will want to applythe same argument to quantum communication, polynomial degree, and logrank, it makes sense tostep back and provide an abstraction of this argument to more general models.In general, we will consider Boolean-valued functions f with a ﬁnite input set Dom( f ) . We willhave a set A of algorithms that may attempt to compute f . Formally, we will need A to be a subsetof a real vector space. Each A ∈ A will have an associated cost , denoted | A | , with | · | : A → [0 , ∞ ) .We write A T to denote the set { A ∈ A : | A | ≤ T } .For an algorithm A ∈ A and an input x ∈ Dom( f ) , we let bias f ( A, x ) denote the bias ofalgorithm A on input x . For now, the only property we need of the bias is that it is a function bias f : A ×

Dom( f ) → [ − , . The worst-case bias of an algorithm A will be denoted bias f ( A ) :=min x ∈ Dom( f ) bias f ( A, x ) . If µ is a distribution over Dom( f ) , we will further write bias f ( A, µ ) := E x ∼ µ [bias f ( A, x )] . Similarly, if P is a probability distribution over A with ﬁnite support, we denote bias f ( P, µ ) := E A ∼ P E x ∼ µ [bias f ( A, x )] and bias f ( P ) := min x ∈ Dom( f ) bias f ( P, x ) . We also set | P | := E A ∼ P | A | . Finally, we deﬁne M ( f ) := inf A ∈A :bias f ( A ) ≥ / | A | .So far, this setting is extremely general, capturing many computational models. For thequantum-style strong minimax to work, we will need the following properties to also hold for agiven function f .1. A T is convex for each T ∈ [0 , ∞ ) , and bias f ( · , x ) is linear over A ∈ A T for each x ∈ Dom( f ) .2. There exists some A ∈ A such that bias f ( A ) ≥ / . (Equivalently, M ( f ) < ∞ .)3. All A ∈ A with | A | < have | A | = 0 , and A is the convex hull of exactly two algorithms, Z and Z . For each x ∈ Dom( f ) , we also have bias f ( Z , x ) = − bias f ( Z , x ) = ± , and if f isnot constant, bias f ( Z , x ) attains both values and − for x ∈ Dom( f ) .4. Suppose P is a probability distribution over A that has support { A , A , . . . , A k } , with prob-ability p i for A i , such that (a) | A i | ≤ i T for some T ∈ [1 / , ∞ ) , (b) P i i p i ≤ , and (c) bias f ( P ) ≥ − k − . Then there is some A ∈ A with bias f ( A ) ≥ / and | A | ≤ k T · poly( k ) (with the constants in the poly being universal).37e note that (1) essentially requires the computational model to be randomized (or, in commu-nication complexity, to have public randomness). (2) only says that each function can be computedby some ﬁnite-cost algorithm. (3) says that algorithms with cost less than cannot look at theinput, and therefore have cost and must either always output or always output (or some convexcombination of the two).The main important point is (4). This point ampliﬁes a certain restricted type of low-biasprobability distribution over algorithms into a full-blown constant-bias algorithm, and the cost ofampliﬁcation is nearly linear in one over the bias.We now prove that these points together suﬃce to guarantee the existence of a strongly-harddistribution. To start, we establish the following lemma, which says that if (4) holds – meaning wecan amplify the restricted type of probabilistic algorithms – then we can amplify all probabilisticalgorithms. Lemma 5.7.

Suppose f and A satisfy the above conditions. Let P be any ﬁnite-support probabilitydistribution over A with bias f ( P ) > . Then M ( f ) ≤ | P | bias f ( P ) · polylog(1 / bias f ( P )) . Proof.

The proof of this will be directly analogous to the quantum query case. We convert P intothe restricted form of (4), being careful to lose only a constant factor in the bias and in the cost.Let γ := bias f ( P ) > . We ﬁrst use Markov’s inequality to argue that the total probability mass P places on algorithms A of cost | A | ≥ | P | /γ is at most γ/ , and hence discarding all such algorithmsfrom the support of P decreases its bias by at most γ/ (while not increasing its cost). Next, wegroup the remaining algorithms in the support of P into log(1 /γ ) bins: one bin for algorithms ofcost to T (with T equal to something like | P | ), and one additional bin for algorithms of cost i T to i +1 T for i between and log(1 /γ ) . Within each bin, we use the convexity of A i T to replace theentire bin with a single algorithm (whose cost is up to the upper boundary of that bin). For the ﬁrstbin, this increases the cost | P | by up to an additive O ( T ) , while for the other bins, this increases thecost by up to a factor of . Altogether, we have only log(1 /γ ) algorithms remaining in the support,and setting k = log(1 /γ ) it is not hard to check that the conditions in (4) are satisﬁed. Theorem 5.8.

Suppose f and A satisfy the above conditions. Then there exists a distribution µ over Dom( f ) such that for any ﬁnite-support probability distribution P over A , we have bias f ( P, µ ) ≤ O ( M ( f ) / | P | · polylog M ( f )) . In particular, if M µ ˙ γ ( f ) denotes the inﬁmum cost | A | over algorithms A ∈ A with bias f ( A, µ ) ≥ γ ,then for all γ ∈ (0 , / we have M µ ˙ γ ( f ) ≥ γ · ˜Ω( M ( f )) . Proof.

The proof will be exactly the same as in the quantum query setting. In the special casewhere f is constant, the result trivially follows as M ( f ) = 0 , so assume f is not constant.First, we let R the set of all ﬁnite-support probability distributions over A , and let ∆ be the set ofprobability distributions over Dom( f ) . Then we deﬁne cost : R × ∆ → [0 , ∞ ) by cost( P, µ ) := | P | ,and score : R × ∆ → [ − , by score( P, µ ) := bias f ( P, µ ) . Note that cost and score are bothcontinuous and linear in each variable. They are also well-behaved, because M ( f ) < ∞ ensuresﬁnite cost and score can be achieved, cost does not depend on µ , and cost is linear in P . HenceTheorem 2.18 gives inf P ∈R max x ∈ Dom( f ) | P | bias f ( P, x ) + = max µ ∈ ∆ inf P ∈R | P | bias f ( P, µ ) + .

38e examine the left-hand side. It equals inf P ∈R | P | bias f ( P ) + . We note that this inﬁmum is at most M ( f ) by the deﬁnition of M ( f ) . We now claim that there is no need to use any P in the inﬁmumif bias f ( P ) < / (6 M ( f )) . To show this, it suﬃces to show that there is no need to use any P in theinﬁmum if | P | < / , because we know that M ( f ) is attainable using only algorithms in A withcost at least .Now, suppose that | P | < / and bias f ( P ) > . We can write P = aZ + bZ + (1 − a − b ) P ′ where P ′ has support only on A ∈ A with | A | ≥ . Deﬁne P ′′ := ( a − c ) Z + ( b − c ) Z +(1 − a − b + 2 c ) P ′ , where c = min { a, b } . Then as we showed in the quantum query case, we have | P ′′ | / bias f ( P ′′ ) ≥ | P | / bias f ( P ) . Moreover, since f is not constant, there is some input x ∈ Dom( f ) such that bias f ( Z , x ) = − , and some input y ∈ Dom( f ) such that bias f ( Z , y ) = − . Since bias f ( P ′′ ) > bias f ( P ) > , and since bias f ( P ′ ) ≤ , we must have (1 − a − b + 2 c ) > / , meaningthat | P ′′ | > / , as desired.Hence the left-hand side equals inf P ∈R ′ | P | bias f ( P ) , where R ′ is the set of all P ∈ R with bias f ( P ) ≥ / (6 M ( f )) . Using Lemma 5.7, we know that for each P ∈ R ′ , we have M ( f ) ≤ | P | / bias f ( P ) · polylog(1 / bias f ( P )) ≤ | P | / bias f ( P ) · polylog M ( f ) . Hence the left-hand side is at least M ( f )polylog M ( f ) .Finally, examining the right hand side, we see that there is a distribution µ over Dom( f ) such thatfor all P ∈ R , we have | P | ≥ bias f ( P ) · M ( f ) / polylog M ( f ) , and the desired result follows. To prove an analogous minimax for quantum communication complexity, all we need is to showthat quantum communication complexity satisﬁes the four conditions from Section 5.2. It’s easy tosee that as long as there is public randomness (whether or not there is also shared entanglement),the ﬁrst three conditions are satisﬁed. It remains to deal with the fourth condition. Let P bea probability distribution over protocols Π , Π , . . . , Π k , which assigns probability p i to Π i andsatisﬁes | Π i | ≤ i T , P i i p i ≤ , and P achieves bias at least − k − for computing communicationfunction F on any input ( x, y ) ∈ Dom( F ) . Our goal is to construct a communication protocol whichuses T · ˜ O (2 k ) communication to compute F to bounded error.As in the quantum query case, all we need to do is create a protocol Π in which Alice and Bobestimate the biases Π i ( x, y ) of the protocols Π i when run on their inputs. Each estimate for protocol i needs to be within − ( k − i ) / of the correct bias, and it must satisfy this property with probabilityat least − / k (see the query complexity section for a formal analysis). To achieve this, it suﬃcesfor Alice and Bob to use amplitude estimation from Theorem 5.1 to generate an estimate of theprobability Π i ( x, y ) outputs . Hence the only remaining diﬃculty is running amplitude estimationof a communication protocol in the communication complexity setting.This turns out to be possible in both the shared-entanglement and the non-shared-entanglementsettings (though note that we’ve already assumed shared randomness, so we cannot handle the non-shared-randomness non-shared-entanglement quantum communication complexity model). The ideais to have one of the players, say Alice, take charge. We will assume that Alice is the one who outputsthe ﬁnal answer in Π i . Then from Alice’s point of view, Π i ( x, y ) can be viewed as a unitary U anda measurement M such that Alice needs Bob’s help to apply U , and after applying U to a sharedstate | i A | i B , Alice can apply the measurement M on her side alone to get the output Π i ( x, y ) .Now, to apply amplitude estimation, Alice only needs the ability to apply controlled U , U † , and ( I − M ) operations. She can do the latter alone. For controlled U and U † applications, she needsBob’s help, but that’s ﬁne: she will just send him a qubit each time alerting him to whether theyare about to apply U or U † to their shared state (Bob will return that qubit afterwards to ensurecoherence of Alice’s controlled applications of U and U † ).39e conclude the following theorem. Theorem 5.9.

Let F : X × Y → { , } be a (possibly partial) communication function. Then thereexists a probability distribution µ over Dom( F ) such that for all γ ∈ (0 , / , we have QCC µ ˙ γ ( F ) ≥ γ · ˜Ω(QCC( F )) . Here

QCC µ ˙ γ ( F ) denotes the minimum amount of communication required by a quantum commu-nication protocol which achieves bias at least γ against µ . This theorem works in both the sharedentanglement setting and in the shared-randomness, non-shared entanglement setting. As in the quantum case, polynomials can be ampliﬁed linearly in the bias. However, also as in thequantum case, the degree of polynomials is not convex: the degree of the convex combination of p and p is the maximum degree of p and p , not the average degree.The same ideas that worked for quantum query and communication complexities will allow usto get a strong hard distribution for approximate polynomial degree and approximate logrank. Themain diﬀerence will be how we do the estimation of success probabilities: instead of amplitudeestimation, we will need a polynomial variant of this, which turns out to be a little tricky. The approximate degree of a (possibly partial) Boolean function f : { , } n → { , } is the minimumdegree of an n -variate polynomial p which satisﬁes | p ( x ) − f ( x ) | ≤ ǫ for all x ∈ Dom( f ) , where ǫ is a parameter representing the allowed error. When f is a partial function, there are actually twodiﬀerent notions of polynomial degree: one where p is required to be bounded on the entire Booleanhypercube (that is, p ( x ) ∈ [0 , for all x ∈ { , } n , even when x / ∈ Dom( f ) ), and one where p is notrestricted outside the domain of f . Our results will apply to both versions of polynomial degree,but for conciseness, we restrict our attention to the bounded version.With polynomials, it is often convenient to switch from talking about functions f : { , } n →{ , } to talking about functions f : { +1 , − } n → { +1 , − } . Note that by doing a simple variablesubstitution, we can convert between { , } variables to { +1 , − } variables without changing thedegree of the polynomial. That is, we can substitute − x i in place of the variable x i inside p tomake it take { , } inputs instead of { +1 , − } inputs, and we can substitute (1 − x i ) / to go theother way. We can similarly change the output of p from being in the range [0 , to the range [ − , and vice versa (the error changes by a factor of when switching between these bases). Anotherwell-known observation is that to approximate a Boolean function f , we only need multilinear polynomials, and their degree only needs to be at most n .To get our hard distribution, we will use Theorem 5.8. We need to check the four conditions, butusing polynomials as our “algorithms”. More explicitly, the set A will be the set of all real n -variatemultilinear bounded polynomials, viewed in the { +1 , − } basis (bounded means that p ( x ) ∈ [ − , for all x ∈ { +1 , − } n ). For a polynomial p ∈ A , we deﬁne bias f ( p, x ) to be f ( x ) p ( x ) . Then (1)holds, as the set of polynomials of a given degree is convex and bias f ( · , x ) is linear over that set.(2) holds because every Boolean function can be computed exactly by a polynomial of degree n .Next, (3) holds because polynomials of degree less than have degree , and since we’re dealingwith bounded polynomials, these are a convex combination of the two constant polynomials − and . 40t remains to show (4). To this end, let P be a probability distribution over k polynomials q , q , . . . , q k , with deg( q i ) ≤ i T . Let p i be the probability P assigns to q i , and suppose P ki =1 i p i ≤ . Finally, suppose that bias f ( P ) ≥ − k − . Our goal is to ﬁnd a polynomial q of degree at most k T · poly( k ) that computes f to constant error. To do so, we’ll need a polynomial version ofthe amplitude estimation algorithm we did in the quantum case. That is, we’d like to estimatethe output that polynomial q i ( x ) returns, and do arithmetic computations on it. Crucially, one ofthe arithmetic computations we’d like to do is comparison, for example, to see if q i ( x ) > . Sucha comparison is not a polynomial operation, so we cannot use the polynomial q i ( x ) itself. Whatwe’ll do instead is to create polynomials that compute the bits of the binary expansion of q i ( x ) , toa certain precision. We will then do arithmetic operations using those bits, and we’ll be able toimplement those operations using polynomials.To do so, we’ll need some approximation theory. The following theorem, known as Jackson’stheorem, will be useful. It traces back to Jackson (1911) [Jac11], but see also [MMR94] (page 750,Theorem 3.1.1) for some discussion and a more thorough list of references. Theorem 6.1 (Jackson’s theorem) . Let α : [ − , → R be a continuous function, and let n ∈ N .Then there is a real polynomial p of degree n such that for all x ∈ [ − , , we have | p ( x ) − α ( x ) | ≤ · sup | y − z |≤ /n | α ( y ) − α ( z ) | . In particular, if α has Lipschitz constant K , then for each n ∈ N there is a polynomial p n of degreeat most n which approximates α to within an additive K/n at each point in [ − , . Jackson’stheorem can be used to prove the well-known result that polynomials can be ampliﬁed with a lineardependence in the bias. For completeness, we reprove this here (see also e.g. [GKKT17]). Corollary 6.2 (Polynomial ampliﬁcation (small bias to constant bias)) . For each γ ∈ (0 , , thereis a real polynomial p of degree at most /γ such that p maps [ − , to [ − , , p maps [ − , − γ ] to [ − , − / , and p maps [ γ, to [1 / , .Proof. Let α : [ − , → R be the function with α ( x ) = − / for x ∈ [ − , − γ ] , α ( x ) = 2 / for x ∈ [ γ, , and α ( x ) = 2 x/ γ for x ∈ ( − γ, γ ) . Then α is continuous and has Lipschitz constant / γ . By Theorem 6.1, for every n ∈ N , there exists a polynomial p n of degree at most n whichapproximates α to additive error /γn . Picking n = ⌈ /γ ⌉ ≤ /γ , we get a polynomial whichapproximates α to error / , which means it has the desired properties.We will also need an ampliﬁcation polynomial that goes from constant bias to small error. Wereprove the following well-known lemma here for completeness (it also appears in [BNRW07], andanother version appears in [She13]). Lemma 6.3 (Polynomial ampliﬁcation (constant error to small error)) . For each ǫ ∈ (0 , / , thereis a real polynomial p of degree at most

17 log(1 /ǫ ) such that p maps [ − , to [ − , , p maps [ − , − / to [ − , − (1 − ǫ )] , and p maps [1 / , to [1 − ǫ, .Proof. We set q ( x ) = k X i =0 (cid:18) k + 1 i (cid:19) (cid:18) x (cid:19) i (cid:18) − x (cid:19) k +1 − i , and set p ( x ) = 1 − q ( x ) . Note that for x ∈ [ − , , the value q ( x ) is exactly the probability that,when ﬂipping a coin k + 1 times, less than half of the coin ﬂips will come out heads, assumingthe probability of heads is (1 + x ) / . Because of this interpretation, we know that q maps [ − , [0 , and is decreasing in x , so p maps [ − , to [ − , and is increasing in x . We also have q ( x ) = 1 − q ( − x ) , which means that p ( − x ) = − p ( x ) , i.e. p is odd. Given these properties, thelemma will follow if we show that p (1 / ≥ − ǫ , or equivalently, that q (1 / ≤ ǫ/ .We have q (1 /

3) = k X i =0 (cid:18) k + 1 i (cid:19) (cid:18) (cid:19) i (cid:18) (cid:19) k +1 − i = 3 − (2 k +1) k X i =0 (cid:18) k + 1 i (cid:19) i ≤ − (2 k +1) k k X i =0 (cid:18) k + 1 i (cid:19) = 3 − (2 k +1) k k = (1 / / k . To get this to be smaller than ǫ/ , it suﬃces to pick k large enough so that (8 / k ≤ ǫ , or equivalently, k ≥ / log(1 /ǫ ) . Hence we can pick k = ⌈ / log(1 /ǫ ) ⌉ ≤ / log(1 /ǫ ) + 1 . The degreeof p will be k + 1 ≤ / log(1 /ǫ ) + 3 . Note that ǫ ≤ / , so log(1 /ǫ ) ≥ log(3 / , and hence / log(1 /ǫ ) + 3 ≤ (cid:16) / + / (cid:17) log(1 /ǫ ) ≤

17 log(1 /ǫ ) .Equipped with these approximation-theoretic tools, we will now tackle (4), showing that proba-bility distributions over polynomials (which achieve a small amount of worst-case bias γ for comput-ing f ) can be ampliﬁed to polynomials which compute f to constant error, using only a nearly-lineardependence on /γ . Lemma 6.4.

As in (4), let P be a probability distribution over k bounded multilinear polynomials q , q , . . . , q k , which assigns them probabilities p , p , . . . , p k respectively. Suppose that P ki =1 i p i ≤ ,that deg( q i ) ≤ i T for some real number T , and that f ( x ) P ki =1 p i q i ( x ) ≥ − k − for all x ∈ Dom( f ) .Then there is a bounded multilinear polynomial q which approximates f with bias at least / andwhich satisﬁes deg( q ) ≤ k T · poly( k ) .Proof. Recall that in the quantum case, we estimated the bias of the i -th algorithm to within − ( k − i ) / , with success probability at least − / k . We will do a polynomial version of this.What does estimating q i ( x ) mean, for polynomials? It means we will construct polynomials whichapproximately compute the bits in the binary expansion of the number q i ( x ) . We will have onepolynomial for the sign, and an additional k − i + 4 polynomials for the ﬁrst k − i + 4 digits in thebinary expansion of q i ( x ) .In order to do so, we compose univariate polynomials with q i . This way, the task reducesto creating univariate polynomials which output the bits in the binary expansion of their input(assuming they all receive the same input). More explicitly, the correctness condition is as follows.We say the binary expansion of a real number β ∈ [ − , is − ℓ - robust to t bits if the ﬁrst t bitsof the binary expansion of β + ǫ is the same as that of β for all ǫ ∈ [ − − ℓ , − ℓ ] . Then we requireunivariate polynomials d ℓ , d ℓ , . . . , d ℓk such that if β ∈ [ − , is − ℓ -robust to at least t bits, then d ℓt ( β ) is within O (1 /k ) of the t -th bit in the binary expansion of β . The polynomial d ℓ needs tooutput the sign of β if β is − ℓ -robust to at least bits (that is, if the sign of β does not changeupon adding or subtracting − ℓ ). We will also require all these polynomials to be bounded, i.e. theymust map [ − , to [ − , .To implement these polynomials, we use Theorem 6.1. For simplicity, let’s represent the bits inthe binary expansion using +1 and − instead of and (converting back is easy). Consider thefunction α i which outputs the i -th bit of the binary expansion of its input (or the sign if i = 0 ).This i is a step function: for i = 0 , α ( β ) jumps from − to at β = 0 ; for i = 1 , α ( β ) similarlyjumps from − to and back at β = − / , , / . More generally, α i has i +1 diﬀerent plateaus of or − on its domain [ − , . Now, since we only care about getting the i -th bit correct if the i -thbit is robust to β changing by − ℓ , consider the continuous functions α ℓi which make the jumps from42 to continuous by starting from − ℓ before the jump point, ending − ℓ after the jump point,and drawing a continuous line in between (the slope of the line will be ± − ℓ ). This is well-deﬁnedas long as ℓ is suﬃciently larger than i , say ℓ ≥ i + 2 .Note that α ℓi has Lipschitz constant − ℓ . This means we can use Theorem 6.1 to estimate α ℓi by a polynomial of degree O (2 ℓ ) which achieves constant additive error (say, / ). We can scaledown these polynomials slightly to ensure they remain bounded in [ − , . We then plug them intoa single variate bounded polynomial of degree O (log k ) that we get from Lemma 6.3, in order toamplify the error down to O (1 /k ) . The result are polynomials d ℓt (for ℓ ≥ t + 2 ) that have degree O (2 ℓ log k ) and, on input β which is − ℓ -robust to bit at least t , correctly output the t -th bit of β except with additive error O (1 /k ) .Now, to get an estimate of q i ( x ) to k − i +5 bits, we set ℓ = k − i + O (log k ) and compose d ℓt ( q i ( x )) for t = 0 , , , . . . , k − i + 5 . Actually, we scale down q i ( x ) and add an extra variable y i representinga noise term for q i ( x ) ; the ﬁnal estimating polynomials will be the n + 1 variate polynomials r i,t ( x, y i ) := d ℓt ((9 / q i ( x ) + y i ) . Note that the degree of r i,t is O (2 k − i + O (log k ) log k · deg( q i )) = O (2 k T poly( k )) .Next, consider the function which takes binary representations (to k + 5 bits each) of numbers λ i ∈ [ − , , and outputs the sign of P ki =1 p i λ i , where p i are known non-negative constants whichsum to . This is a Boolean function of O ( k ) variables, so it can be computed exactly by amultilinear polynomial of degree O ( k ) . Call this polynomial s . Next, plug in the polynomials r i,t into the inputs of s , so that s calculates the sign of the sum P ki =1 p i ˜ β i where each ˜ β i is the estimateof (9 / q i ( x ) + y i that is computed by the polynomials d k − i +10 t . Call this composed polynomial u ( x, y ) .Observe that u ( x, y ) is a polynomial in n + k variables ( n variables from x and k variables y i ), andhas degree O (2 k T poly( k )) . This polynomial attempts to compute the sign of (9 / P ki =1 p i q i ( x ) + P ki =1 p i y i . Since we know that P ki =1 p i q i ( x ) · f ( x ) ≥ − k − , this sign computed by u ( x, y ) willequal f ( x ) so long as (cid:12)(cid:12)(cid:12)P ki =1 p i y i (cid:12)(cid:12)(cid:12) ≤ − k − . Recall that P ki =1 i p i ≤ . Hence to guarantee that (cid:12)(cid:12)(cid:12)P ki =1 p i y i (cid:12)(cid:12)(cid:12) ≤ − k − , it suﬃces to choose each y i such that | y i | ≤ − ( k − i +5) . Now, let’s call q i ( x ) + y i good if it is − ( k − i + O (log k )) -robust to k − i + 5 bits. If all q i ( x ) + y i are good for all i , then r i,t correctly compute the bits to additive error O (1 /k ) , then a multilinear polynomial of degree O ( k ) in O ( k ) variables will still correctly compute its output to small error, certainly O (1 /k ) . Hence ifall q i ( x ) + y i are good for all i and if y i ≤ − k − /k for all i , u ( x, y ) outputs f ( x ) to error O (1 /k ) .To ensure that q i ( x ) + y i are good, we pick y i at random. That is, we have an allowed range [ − − ( k − i +5) , − ( k − i +5) ] for y i ; we ﬁt poly( k ) evenly spaced points into this range, so that the gapbetween the points is − ( k − i + O (log k )) . Note that for all but a constant number of choices of y i amongthese poly( k ) options, the resulting number q i ( x ) + y i will be − ( k − i + O (log k )) -robust to k − i + 5 bits.Hence by randomly selecting y i , the probability that q i ( x ) + y i is not good is at most O (1 / poly( k )) .By the union bound, this choice means that all q i ( x ) + y i are good except with constant probability.Hence u ( x, y ) computes f ( x ) to O (1 /k ) error with high probability when y is chosen at randomaccording to the above procedure.Finally, we let q ( x ) be the average of the polynomials u ( x, y ) for all possible choices of y inthe above procedure. Since u ( x, y ) outputs a number very close to f ( x ) when y is good, and sinceit is always bounded in [ − , , and since y is good with high probability, we conclude that q ( x ) computes f ( x ) to bounded error. It is also bounded outside the promise of f . The degree of q ( x ) was O (2 k T poly( k )) . We note that q ( x ) as we constructed it here can actually be viewed as apolynomial ρ in k variables composed with the polynomials q , q , . . . , q k .The above ampliﬁcation theorem allows us to conclude the following theorem.43 heorem 6.5. let f : { +1 , − } n → { +1 , − } be a (possibly partial) Boolean function. Then thereis a vector ψ ∈ [ − , Dom( f ) such that k ψ k = 1 , h ψ, f i = 1 , and for any polynomial p which isbounded (i.e. | p ( x ) | ≤ for x ∈ { +1 , − } n ), we have h ψ, p i ≤ deg( p )˜Ω (adeg( f )) . Here adeg( f ) denotes the minimum degree of a bounded polynomial p which computes f to boundederror. The constants in the ˜Ω notation are universal.Proof. This follows immediately by taking ψ to be deﬁned by ψ ( x ) = f ( x ) µ [ x ] , where µ is the harddistribution we get from Theorem 5.8. Instead of tackling approximate logrank directly, we use approximate γ norm. This measure de-serves some introduction. First, we note that the γ norm is a well-known norm of a matrix. Oneway to deﬁne it is to say that γ ( A ) is the minimum, over factorizations A = BC of A into aproduct of matrices B and C , of the maximum -norm of a row of B times the maximum -norm ofa column of C . The γ norm has several useful properties known in the literature [She12; LSŠ08]:1. γ is a norm, so γ ( A ) ≥ (with equality if and only if A is the all-zeros matrix) and γ ( A + λB ) ≤ γ ( A ) + | λ | γ ( B ) .2. γ ( A ⊗ B ) = γ ( A ) γ ( B ) , where ⊗ denotes the tensor (Kronecker) product3. γ ( A ◦ B ) ≤ γ ( A ) γ ( B ) , where ◦ denotes the Hadamard (entry-wise) product4. γ ( J ) = 1 where J is the all-ones matrix5. k A k ∞ ≤ γ ( A ) ≤ k A k ∞ p rank( A ) .In the above, A and B are matrices of the same dimensions, and λ is a scalar. γ ( A ) can be thoughtof as a smoother version of rank.Let F : X × Y → { +1 , − } be a (possibly partial) communication function. We identify F withits communication matrix , which is a matrix with rows indexed by X and columns indexed by Y ,with the ( x, y ) entry being F ( x, y ) ∈ { +1 , − } if ( x, y ) ∈ Dom( F ) and being ∗ if ( x, y ) / ∈ Dom( F ) .This way, F is a { +1 , − , ∗} -valued matrix.For such a matrix F , we say that a real-valued matrix A approximates F (to bias / ) if | A [ x, y ] | ≤ for all ( x, y ) ∈ X × Y and F ( x, y ) A [ x, y ] ≥ / for all ( x, y ) ∈ Dom( F ) . The approximate γ norm of F , denoted ˜ γ ( F ) , is deﬁned as the minimum value of γ ( A ) over allmatrices which approximate F to bias / . It is not hard to see that this minimum is attained, asthe set of such matrices is compact.We will actually care about the logarithm of the approximate γ norm, that is, about log ˜ γ ( F ) .We note that the constant / in the deﬁnition of this measure is arbitrary, as approximations to F can be ampliﬁed with only a constant factor overhead in the log-approximate- γ -norm (see, e.g.,[BBGK18]). An annoying detail, however, is that such ampliﬁcation can in general lose not justa multiplicative constant but also an additive constant, since ˜ γ ( F ) may in general be less than (meaning the logarithm of it will be less than ). To avoid such complications, we will deﬁneour measure of interest as M ( F ) := max { , log ˜ γ ( F ) } if F is not constant and M ( F ) = 0 if F isconstant, and we will write M ˙ γ ( F ) for the bias γ version of M ( F ) instead of the default bias / version. 44n order to get a minimax theorem analogous to Theorem 6.5, we will again use Theorem 5.8.Our set of algorithms A will be the set of bounded real matrices A (that is, real matrices A of thesame dimensions as F which satisfy | A [ x, y ] | ≤ for all ( x, y ) ∈ X × Y ). The cost of a matrix A will be cost( A ) := max { , log γ ( A ) } if A is not a multiple of the all-ones matrix J , and otherwise cost( A ) = 0 if A = λJ . We deﬁne bias F ( A, ( x, y )) = F ( x, y ) A [ x, y ] for ( x, y ) ∈ Dom( F ) .We show that A T is convex for each T ∈ [0 , ∞ ) . For T < , the set A T is the set of all matrices ofthe form λJ for λ ∈ [ − , , which is clearly convex. For T ≥ , suppose A, B ∈ A T and let λ ∈ (0 , .Then cost( λA + (1 − λ ) B ) is either , , or log γ ( λA + (1 − λ ) B ) . In the former two cases, we clearlyhave λA + (1 − λ ) B ∈ A T , so consider the latter case. We have log γ ( λA + (1 − λ ) B ) ≤ log( λγ ( A )+(1 − λ ) γ ( B )) ≤ log max { γ ( A ) , γ ( B ) } = max { log γ ( A ) , log γ ( B ) } ≤ max { cost( A ) , cost( B ) } ≤ T .Hence A T is convex. It is also clear that bias F ( · , ( x, y )) is linear, so (1) is satisﬁed.By taking A to equal F inside Dom( F ) and to be elsewhere, we get bias F ( A ) = 1 , so (2)is satisﬁed. By our deﬁnition of cost( A ) , we have cost( A ) ≥ or cost( A ) = 0 , with the latterhappening only if A is a convex combination of J and − J , so (3) is satisﬁed.As usual, it remains to handle (4). We do so in the following lemma. Lemma 6.6.

Let P a probability distribution over matrices A , A , . . . , A k with probability p i for A i . Suppose that P ki =1 i p i ≤ , and that for all i , we have cost( A i ) ≤ i T for some real number T ≥ / . Suppose further that bias F ( P ) ≥ − k − . Then there is some bounded matrix A whichapproximates F to bias / and satisﬁes cost( A ) ≤ k T · poly( k ) (with the constants in the poly being universal).Proof. Let ρ be the polynomial from the proof of Theorem 6.5 with respect to the probabilities p , p , . . . , p k . This is a polynomial in k variables with the property that if values β , β , . . . , β k areplugged in and | P i p i β i | ≥ − k − , then ρ ( β , β , . . . , β k ) returns the sign of P i p i β i to boundederror. The polynomial ρ further has the property that it is bounded (i.e. it returns values in [ − , when given inputs in [ − , k ), and that if you plug in any polynomials q i in place of β i , with deg( q i ) ≤ i , then the degree of the composed polynomial is at most k poly( k ) .This latter property means that the weighted degree of ρ with weights (2 , , . . . , k ) is at most O (2 k poly( k )) . Here the term weighted degree means that we count the degree of each monomialof ρ diﬀerently depending on the variables in that monomial: the i -th variable gets weight i , so amonomial of the form β c β c . . . β c k k will have weighted degree i c + 2 c + · · · + 2 k c k . We knowthat the weighted degree of ρ , meaning the maximum weighted degree of one of its monomials, isat most O (2 k poly( k )) .We will now use this polynomial ρ to construct a matrix A which approximates F and has γ norm that is not too large. The idea is to simply apply ρ to the matrices A , A , . . . , A k , usingthe Hadamard product for multiplication and the usual matrix addition and scalar multiplication.Since γ is a norm, we know that γ ( ρ ( A , A , . . . , A k )) is the sum, over all monomials of ρ , ofthe absolute value of the coeﬃcient of that monomial multipled by the γ -norm of the Hadamardproduct deﬁned by that monomial. This is upper bounded by the sum of absolute coeﬃcients of ρ (which we’ll denote C ) multiplied by the γ norm of the largest monomial.The γ norm of a single monomial β c . . . β c k k composed with matrices A , . . . , A k is at most γ ( A ) c . . . γ ( A k ) c k , since the γ norm is sub-multiplicative under the Hadamard product. Hence log γ ( ρ ( A , . . . A k )) is at most log C plus the maximum value of c log γ ( A ) + · · · + c k log γ ( A k ) for some monomial ( c , c , . . . , c k ) of ρ . Since log γ ( A ) ≤ cost( A ) for all bounded matrices A , andsince cost( A i ) ≤ i T , this maximum is at most the maximum of T · (2 c + · · · +2 k c k ) over monomialsof ρ , which is at most O (2 k T poly( k )) .We now upper bound C , the sum of absolute coeﬃcients of ρ . Recall that ρ was constructed asan average of diﬀerent polynomials with diﬀerent values of the constants y i . Let ρ ′ be the polynomial45ithin that set we averaged over which has the largest sum of absolute coeﬃcients. Then to upperbound C it suﬃces to upper bound the sum of absolute coeﬃcients of ρ ′ . To do so, we essentiallywant to replace all coeﬃcients of ρ ′ with their absolute values, and then plug in all ones for thevariables. We note that (9 /

10) + y i will be at most for the values of y i used in ρ ′ , which meansthat if we replace the terms (9 / q i + y i with simply q i , we would only increase the sum of absolutecoeﬃcients (here we treat q i as variables).Let the resulting polynomial be ρ ′′ . Then ρ ′′ is simply the result of composing the polynomial s with the polynomials r i,t . Since s is a bounded multilinear polynomial of degree O ( k ) , its sum ofabsolute coeﬃcients is at most O ( k ) , and it is not hard to see that the sum of absolute coeﬃcientsof ρ ′′ will be at most O ( k ) times D O ( k ) , where D is the maximum sum of absolute coeﬃcientsover the polynomials d ℓt with ℓ = k − i + O (log k ) . In other words, log C ≤ O ( k ) + O ( k D ) , where D is the sum of absolute coeﬃcients of some such polynomial d ℓt .The polynomial d ℓt is a single variate bounded polynomial of degree at most O (2 ℓ log k ) , which,using ℓ ≤ k + O (log k ) , is at most k poly( k ) . A bounded univariate polynomial of this degreemust have sum of absolute coeﬃcients at most k poly( k ) by [She13] (Lemma 4.1). Hence log D ≤ k poly( k ) , so log C ≤ k poly( k ) .We conclude that if A = ρ ( A , A , . . . , A k ) , then log γ ( A ) ≤ k ( T + 1) poly( k ) , and hence cost( A ) ≤ k ( T + 1) poly( k ) . This is at most O (2 k T poly( k )) since we have T ≥ / . Further, eachentry A [ x, y ] is equal to ρ ( A [ x, y ] , A [ x, y ] , . . . , A k [ x, y ]) , which means that A is bounded (since ρ isbounded and the matrices A i are bounded), and for ( x, y ) ∈ Dom( F ) , we have F ( x, y ) A [ x, y ] ≥ / by the guarantees on A i and on ρ .Using Theorem 5.8, we can now conclude the following theorem. Theorem 6.7.

Let F : X × Y → { +1 , − } be a (possibly partial) communication function. Thenthere is a distribution µ over Dom( F ) such that for any bounded real matrix A (meaning | A [ x, y ] | ≤ for all ( x, y ) ∈ X × Y ), we have E ( x,y ) ∼ µ [ F ( x, y ) A [ x, y ]] ≤ log γ ( A )˜Ω(log ˜ γ ( F )) . Note that for bounded matrices, log γ ( A ) ≤ log rank( A ) . We also have, from [LS09], log g rank( F ) ≤ γ ( F ) + O (log log |X × Y| ) . This means we can write a minimax theorem for logrank as well.

Theorem 6.8.

Let F : X × Y → { +1 , − } be a (possibly partial) communication function, andsuppose that log g rank( F ) ≥ C log log |X × Y| where C is a universal constant. Then there is adistribution µ over Dom( F ) such that for any bounded real matrix A (meaning | A [ x, y ] | ≤ for all ( x, y ) ∈ X × Y ), we have E ( x,y ) ∼ µ [ F ( x, y ) A [ x, y ]] ≤ log rank( A )˜Ω(log g rank( F )) . In other words, µ is such that if A has low rank compared to F , then A cannot correlate wellwith F under µ , and hence A does not approximate F very well against µ .46 Circuit complexity

A Boolean circuit C is a collection of gates connected to each other and to bits of its input x bywires, with a single output wire representing the value of C ( x ) . The size of a circuit is the numberof gates in the circuit, and the depth of a circuit is the length of the longest path between an inputbit and an output wire. A randomized Boolean circuit is a probability distribution over Booleancircuits, and the size of a randomized Boolean circuit is deﬁned to be the expected size of a Booleancircuit drawn from that distribution.In Section 7.1, we examine the randomized circuit complexity of partial Boolean functions whenit is computed by circuits of unbounded fan-in and unlimited depth. In Section 7.2, we show thatthe main result also holds in the NC setting of logarithmic-depth circuits whose gates each havefan-in at most . Finally, in Section 7.3 we establish the strengthening of the hardcore lemma. In this section, let R( f ) denote the minimum size of a randomized Boolean circuit of unboundedfan-in and unlimited depth that computes the partial Boolean function f with error at most onevery input x ∈ Dom( f ) . Similarly, let R µ ˙ γ ( f ) denote the minimum size of randomized Booleancircuits that compute f with error at most ˙ γ = − γ when the input is drawn from µ . We establisha relation between those two complexity measures via the study of forecasting circuits. Deﬁnition 7.1. A forecasting circuit is a randomized Boolean circuit with one modiﬁcation: insteadof having a single output wire, the forecasting circuit has k + 1 output wires that represent the binaryencoding of a value in the range { , k , k , . . . , k − k , } . The resolution of a forecasting circuit is k when it has k + 1 output wires. (Or, equivalently,when it outputs values that are multiples of − k .) The score of a forecasting circuit is computed inthe same way as we did for forecasting algorithms in previous sections. The size of a randomizedforecasting circuit is, as in the the case of randomized Boolean circuits, the expected number ofgates in a circuit drawn from the distribution. Forecasting circuits can be deﬁned for each model ofrandomized Boolean circuits; in this section, we consider forecasting circuits with unbounded fan-inand unlimited depth.We begin by showing that if there is a Boolean circuit that computes a function with non-negligible advantage over random guessing, then there is also a forecasting algorithm with non-trivialscore. Proposition 7.2.

For any partial function f : { , } n → { , } , if there is a size s ≥ andparameter γ ≥ R ( f )+1 for which there is a randomized Boolean circuit R of average size s thatsatisﬁes Pr C ∼ R [ C ( x ) = f ( x )] ≤ ˙ γ for every x ∈ Dom( f ) , then there is also a randomized forecastingcircuit R ′ with resolution ⌈ log R ( f ) ⌉ , average size at most s + 1 , and h -score score( R ′ , x ) = E C ′ ∼ R ′ [score( C ′ ( x ) , f ( x ))] ≥ γ / for each x ∈ Dom( f ) .Proof. For each circuit C in the support of R , deﬁne C ′ to be the forecasting circuit of resolution k = ⌈ log R ( f ) ⌉ and size size( C ) + 1 which outputs the value − C ( x ) γ ′

47n input x ∈ S where γ ′ = m k for the largest integer m such that γ ′ ≤ γ . The deﬁnition of γ ′ guarantees that γ − k ≤ γ ′ ≤ γ . The value of k and the lower bound on γ in the propositionstatement imply that γ − k ≥ γ − R ( f )+1 ≥ γ , so γ ≤ γ ′ ≤ γ .This circuit C ′ can be constructing by adding a single extra ¬ gate to the output wire of C : theoutput of C and its negations can then be combined with constant value wires to generate the tworequired output values of the forecasting circuit. (Namely, if the two output values (1 ± γ ′ ) / of C ′ are denoted by z (0) and z (1) , then the i th output bit of C ′ is a hardcoded constant value or when z (0) = z (1) and otherwise it is either C ( x ) or ¬ C ( x ) when z (0) = z (1) .)The randomized forecasting circuit R ′ is then deﬁned to be the distribution on circuits obtainedby drawing C ∼ R and outputing the modiﬁed circuit C ′ as described above. Following the sameargument as in Lemma 3.15, the score of this randomized forecasting circuit satisﬁes score( R ′ , x ) ≥ γ ′ / ≥ γ/ . In the second step in the proof of Theorem 7.8, we show that the minimax theorem applies inthis setting.

Lemma 7.3.

Fix any k ≥ and let R k denote the set of all randomized forecasting circuits withresolution k . Then for partial function f : { , } n → { , } , if we let ∆ denote the set of distributionsover Dom( f ) , we have inf R ∈R k max x ∈ Dom( f ) size( R )score( R, x ) + = max µ ∈ ∆ inf R ∈R k size( R )score( R, µ ) + . Proof.

The lemma follows from Theorem 2.18, and the argument showing that the conditions ofthat theorem are satisﬁed follows closely the analogous argument of Theorem 4.2.First, we want to show that R k can be viewed as a convex subset of a real topological space V . We can do so with the same construction as in Theorem 4.2, though here we can also use aslightly simpler constructions: ﬁx V = R | Dom( f ) | +1 , and for each randomized forecasting circuit R ∈ R k deﬁne v R ( x ) = score( R, x ) for each x ∈ Dom( f ) and deﬁne the | Dom( f ) | + 1 th coordinateof v R to be cost( R, x ) . That the resulting set is convex follows directly from the fact that a vector v ′ = λv R + (1 − λ ) v R for any R , R ∈ R k corresponds to the vector of the randomized forecastingcircuit R ′ = λR + (1 − λ ) R .The linearity of cost and score measures in both R and µ follows from their deﬁnition.Lastly, the notions of cost and score satisfy the well-behaved condition of Theorem 2.18. First,because the existence of a circuit of size at most | S | that computes f exactly implies the existenceof a ﬁnite-cost and non-zero score randomized forecasting circuit for any distribution µ on Dom( f ) .Second, because the cost of circuits does not depend on the input, and third because the deﬁnitionof cost immediately implies that the mixture of a zero-cost and a nonzero-cost randomized circuitgives a nonzero-cost randomized circuit.The next step is the main one in the proof of the theorem: we want to show that the score offorecasting circuits can be ampliﬁed eﬃciently. Lemma 7.4.

For every partial Boolean function f : { , } n → { , } , when we set k = ⌈ log R ( f ) ⌉ then inf R ∈C k max x ∈ Dom( f ) size( R )score( R, x ) + = e Ω (cid:0) R ( f ) (cid:1) where the e Ω hides terms that are polylogarithmic in R ( f ) . Proposition 7.5 ([BCH86; Alt88]) . For any two numbers a, b represented to accuracy − k in binary,then the values ab, a − a , ln( a ) , e a , and

11 + a can all be computed to additive accuracy − k by circuits of size polynomial in k and depth O (log k ) . We also need another result regarding the circuit complexity of iterated multiplication up toﬁxed accuracy.

Proposition 7.6.

When a , . . . , a m and b , . . . , b m are k -bit integers, then there is a circuit of size O ( m log m + mk + k c ) for some constant c ≥ and depth O (log k + log m ) that computes the ratio a · · · a m b · · · b m up to multiplicative accuracy ± − k .Proof. This result can be obtained by computing ln a ··· a m b ··· b m = P mi =1 ln a i − ln b i to additive accuracy − k . The computation of each of the values ln a i and ln b i for ≤ i ≤ m up to additive accuracy − k m can be done with a circuit of size polynomial in n := k + log m + 1 and depth O (log n ) . The sumof the m terms can be done with a circuit for iterated addition of size O ( mn ) = O ( m log m + mk ) and depth O (log m + log n ) = O (log m + log k ) to compute the natural log of the ratio up to additiveerror − k . [Ofm62] (See also [Pip87; Weg87] and the references therein.) Finally, a circuit of sizepolynomial in k and depth logarithmic in k can be used to compute the exponential of the ﬁnalratio.Using these propositions, we can complete the proof of the lemma. Proof of Lemma 7.4.

Let R be a randomized forecasting circuit which comes arbitrarily close to theinﬁmum on the left-hand side.Consider the randomized forecasting circuit R ′ obtained by drawing m forecasting circuits C , . . . , C m independently at random from R and combining their output values using the formula C ′ ( x ) =  Y i ≤ m − C i ( x ) C i ( x )  − . Fixing m = max x / score( R, x ) + , we obtain a randomized circuit R ′ with score score( R ′ , x ) + = Ω(1) for each x ∈ S and average size size( R ′ ) = size( R ) · m + O ( m log m + mk + k c ) for some universal constant c ≥ . Then the proof is completed by noting that m = O ( R( f )size( R ) ) .Finally, we show that when there is a forecasting circuit with score γ , there is also a Booleancircuit with error at most ˙ γ . Proposition 7.7.

For any partial function f : { , } n → { , } , if there is a size s ≥ andparameter γ for which there is a randomized forecasting circuit R with k output wires, size s , depth d and score( R, x ) ≥ γ for each x ∈ Dom( f ) , then there is also a randomized Boolean circuit R ′ ofsize s + O ( k ) and depth d + O (1) that satisﬁes Pr C ∼R [ C ( x ) = f ( x )] ≥ γ for every x ∈ Dom( f ) . roof. Given a forecasting circuit C that outputs the value p on input x , we want to design arandomized Boolean circuit R C that outputs the value with probability p and with probability − p on input x .We can do this by adding k random inputs r , . . . , r k that are used to generate a uniformlyrandom value r ∈ { k , k , . . . , } . Then if the value p in the output of the circuit is , we outputzero; otherwise we use a comparator circuit to return if and only if r ≤ p . This value has thedesired bias p , and using standard constructions (see, e.g. [Weg87; Vol99]) we can implement thecomparator circuit with O ( k ) gates in a circuit of constant depth (in the unbounded fan-in model;or O (log n ) depth in the bounded fan-in model).The ﬁnal randomized Boolean circuit R ′ is deﬁned by drawing a forecasting circuit C from R and outputting R C . The bound on the error of R ′ is then obtained as in the argument ofLemma 3.15.Putting the above lemmas and propositions together completes the proof of the following theo-rem. Theorem 7.8.

Fix n ∈ N . For every partial function f : { , } n → { , } , there is a distribution µ on Dom( f ) such that for all γ ∈ (0 , , R µ ˙ γ ( f ) = ˜Ω (cid:0) γ R( f ) (cid:1) . Deﬁne

RNC1( f ) to be the minimum size of a randomized Boolean circuit of fan-in two and loga-rithmic depth that computes the partial Boolean function f with error at most on every input x ∈ Dom( f ) . Similarly, let RNC1 µ ˙ γ ( f ) denote the minimum size of a randomized Boolean circuitwith the same fan-in and depth restrictions that computes f with error at most ˙ γ = − γ when theinput is drawn from µ .The constructions of Proposition 7.2, Lemma 7.4, and Proposition 7.7 can all be achieved withcircuits of fan-in 2 that add only logarithmic depth overhead to the base circuits, so the analogueof Theorem 7.8 also holds for the class of circuits of fan-in two and logarithmic depth. Theorem 7.9.

Fix n ∈ N . For every partial function f : { , } n → { , } , there is a distribution µ on Dom( f ) such that for all γ ∈ (0 , , RNC1 µ ˙ γ ( f ) = ˜Ω (cid:0) γ RNC1( f ) (cid:1) . In fact, we can say even more about the eﬃciency of the transformations in each constructions:all three of them can be accomplished with constant-depth and polynomial-size overhead when thecircuits have threshold gates. For Proposition 7.2, this is because only a single additional gate isrequired. For Lemma 7.4, this is because the functions in Proposition 7.5 can all be computedthe the required accuracy with threshold circuits of polynomial size and constant depth [RT92]and the iterated addition problem can be solved by a threshold circuit of constant depth and size O ( m log m ( k + log m )) [CSV84]. And for Proposition 7.7, this is because comparison can also becompleted with polynomial-size and constant-depth circuits. Therefore, letting RTC0 ǫ ( f ) denotethe minimum size of a randomized constant-depth threshold circuit with unbounded fan-in thatcomputes f with error probability at most on every input and RNC1 µ ˙ γ ( f ) denote the minimumsize of the same type of circuit that computes f ( x ) correctly with probability γ when x is drawnfrom µ , we obtain the following result. 50 heorem 7.10. Fix n ∈ N . For every partial function f : { , } n → { , } , there is a distribution µ on Dom( f ) such that for all γ ∈ (0 , , RTC0 µ ˙ γ ( f ) = ˜Ω (cid:0) γ RTC0( f ) (cid:1) . In order to complete the proof of the hardcore lemma as stated in Theorem 1.9, we need the followingvariant of the ratio minimax theorem that applies to the setting where we consider a compact convexset of distributions, not just the set of all distributions over the function’s domain.

Theorem 7.11.

Let V be a real topological vector space, and let R ⊆ V be convex. Let S bea nonempty ﬁnite set, and let ∆ be a compact and convex set of probability distributions over S ,viewed as a subset of R | S | . Let cost : R × ∆ → [0 , ∞ ] be semicontinuous and saddle, and let score : R × ∆ → [ −∞ , ∞ ) be such that its negation, − score , is semicontinuous and saddle. Suppose cost and score are well-behaved. Then using the convention r/ ∞ for all r ∈ [0 , ∞ ] , we have inf R ∈R max µ ∈ ∆ cost( R, µ )score(

R, µ ) + = max µ ∈ ∆ inf R ∈R cost( R, µ )score(

R, µ ) + . Proof.

The proof is identical to the one for (the ﬁrst part of) Theorem 2.18, since that argumentonly uses the fact that the set of all distributions over S is convex and compact.From this theorem we obtain the following variant of Lemma 7.3 for distributions with min-entropy δ . Lemma 7.12.

Fix any k ≥ and let R k denote the set of all randomized forecasting circuits withresolution k . Then for every δ > and function f : { , } n → { , } , if we let ∆ δ denote the set ofdistributions over { , } n with min-entropy δ , we have inf R ∈R k max µ ∈ ∆ δ size( R )score( R, µ ) + = max µ ∈ ∆ δ inf R ∈R k size( R )score( R, µ ) + . We are now ready to complete the proof of Theorem 1.9.

Proof of Theorem 1.9.

Fix s ′ = c · s/ log δ for some constant c to be ﬁxed later. By Lemma 7.12,the two cases to consider are the following. Case 1: max µ ∈ ∆ δ inf R ∈R k size( R )score( R,µ ) + ≥ s ′ .Fix a distribution µ with min-entropy δ for which the maximum is attained. Then every ran-domized forecasting circuit R has score score( R, µ ) ≤ size( R ) s ′ . By Proposition 7.2, this implies that every randomized circuit R ′ with size( R ′ ) ≤ ǫ s ′ / has successprobability Pr C ∼ R,x ∼ µ [ C ( x ) = f ( x )] ≤ p size( R ′ ) /s ′ ≤ ǫ , and the theorem holds in this case. 51 ase 2: inf R ∈R k max µ ∈ ∆ δ size( R )score( C,µ ) + < s ′ .Fix a randomized forecasting circuit R that satisﬁes size( R )score( C, µ ) + < s ′ for each distribution µ over { , } n with min-entropy δ . Set α = size( R ) /s ′ and deﬁne T ⊆ { , } n to be the set of inputs x for which score( R, x ) < α . Then | T | ≤ δ (1 − α )2 n since otherwise the score of R on the distribution µ ′ that is uniform over any set T ′ ⊇ T of size | T ′ | = δ n (and thus has min-entropy δ ) would be bounded above by score( R, µ ′ ) < (1 − α ) · α + α <α , contradicting the deﬁnition of R .By Lemma 7.4, there is a forecasting circuit R ′ which satisﬁes size( R ′ ) = O ( s ′ ) and score( R, x ) =Ω(1) for each x ∈ { , } n \ T . Then by Proposition 7.7 there is a randomized Boolean circuit ofsize O ( s ′ ) that errs with probability at most on each x ∈ { , } n \ T , and by standard successampliﬁcation it also means that there is a circuit C of size s ′′ = O ( s ′ log δ ) with error less than δ . Choosing the value c in the deﬁnition of s ′ appropriately, we then get that this circuit has sizeat most s , contradicting the premise of the theorem and therefore showing that Case 2 cannotoccur. Acknowledgements

We thank Justin Thaler for discussions and references related to approximate polynomial degree andits ampliﬁcation. We also thank Andrew Drucker, Mika Göös, and Li-Yang Tan for correspondenceabout their ongoing work [BDG+20]. We thank anonymous reviewers for many helpful comments.52

Proofs related to the minimax theorem

Lemma 2.8 (An upper semicontinuous function on a compact set attains its max) . Let X bea nonempty compact topological space, and let φ : X → R be a function. Then if φ is uppersemicontinuous, it attains its maximum, meaning there is some x ∈ X such that for all x ′ ∈ X , φ ( x ′ ) ≤ φ ( x ) . Similarly, if φ is lower semicontinuous, it attains its minimum.Proof. The lower semicontinuous case follows from the upper semicontinuous case simply by negat-ing φ , so we focus on the upper semicontinuous case. Let z = sup x ∈ X φ ( x ) , where z ∈ R . Let x be any element of X . If φ ( x ) = z , we are done, so assume φ ( x ) < z ; in particular, z > −∞ .We deﬁne a sequence x , x , . . . as follows. If z < ∞ , deﬁne x i to be any element of X such that φ ( x i ) > z − /i . If z = ∞ , deﬁne x i to be any element of X such that φ ( x i ) > i . Moreover, for each i ∈ N , let U i = { x ∈ X : φ ( x ) < φ ( x i ) } . Note that any x ∈ X for which φ ( x ) < z must be in U i for some i ∈ N ; hence if the supremum z is not attained, the sets U i form a cover for X (meaning S i ∈ N U i = X ).The key claim is that the U i sets are all open if φ is upper semicontinuous. This is is because if x ∈ U i , then φ ( x ) < φ ( x i ) , and by the deﬁnition of upper semicontinuity, there is a neighborhood U of x on which φ ( · ) is still less than φ ( x i ) ; thus there is a neighborhood U of x contained in U i ,so that U i is open. In this case, if the supremum z is not attained, the collection { U i } i ∈ N is anopen cover of X , and by the deﬁnition of compactness, it has a ﬁnite subcover. Let i be the largestindex of some U i in this subcover. Then it follows that φ ( x ) < φ ( x i ) for all x ∈ X , which is acontradiction. Hence the supremum z must be attained as a maximum, as desired. Lemma 2.9 (A pointwise inﬁmum of upper semicontinuous functions is upper semicontinuous) . Let X be a topological space, let I be a set, and let { φ i } i ∈ I be a collection of functions φ i : X → R . Thenif each φ i is upper semicontinuous, the function φ ( x ) = inf i ∈ I φ i ( x ) is also upper semicontinuous.Similarly, if each φ i is lower semicontinuous, the pointwise supremum is lower semicontinuous.Proof. Note that the case where φ i are all lower semicontinuous follows from the case where theyare all upper semicontinuous simply by negating the functions, since negation ﬂips upper and lowersemicontinuity and ﬂips inﬁmums and supremums. We focus on the case where φ i are all uppersemicontinuous.Fix x ∈ X . If φ ( x ) = ∞ , φ is upper semicontinuous at x by deﬁnition. If φ ( x ) < ∞ , ﬁx any y > φ ( x ) . By the deﬁnition of φ ( x ) as an inﬁmum, there is some i ∈ I such that φ i ( x ) < y . Bythe upper semicontinuity of φ i ( · ) , there is a neighborhood U of x such that for all x ′ ∈ U , wehave φ i ( x ′ ) < y . Then for all x ′ ∈ U , we clearly have φ ( x ′ ) = inf i ∈ I φ i ( x ′ ) < y . Thus φ is uppersemicontinuous at x , as desired. Lemma A.1.

Let V be a real vector space, and let X ⊆ V . The convex hull of X is the set of all v ∈ V which can be written as a convex combinatotion of vectors in x ; that is, v for which thereexist k ∈ N , x , x , . . . , x k ∈ X , and λ , λ , . . . , λ k ∈ [0 , with λ + λ + · · · + λ k = 1 such that v = λ x + λ x + · · · + λ k x k .Proof. This is a well-known characterization of the convex hull, which can be shown as follows: let Y be the set of all ﬁnite convex combinations of points in X ; that is, Y contains all points in V ofthe form λ x + λ x + · · · + λ k x k , where k ∈ N , x , x , . . . , x k ∈ X , and λ , λ , . . . , λ k ∈ [0 , with λ + λ + · · · + λ k = 1 . Then Y is clearly convex, since for all y , y ∈ Y and λ ∈ (0 , , we knowthat y and y are ﬁnite convex combinations of points in x , meaning that λy + (1 − λ ) y is alsoa ﬁnite convex combination of points in X . Furthermore, if Z is any other convex set containing X , then it’s easy to show by induction that Z contains all convex combinations of k points in X k ∈ N ; hence Z must be a superset of Y . It follows that Conv( X ) , the intersection of allconvex sets containing X , must exactly equal Y . Lemma 2.10 (Quasiconvex functions on convex hulls) . Let V be a real vector space, let X ⊆ V ,and let φ : Conv( X ) → R be a function. If φ is quasiconvex, then sup x ∈ Conv( X ) φ ( x ) = sup x ∈ X φ ( x ) . Similarly, if φ is quasiconcave, then inf x ∈ Conv( X ) φ ( x ) = inf x ∈ X φ ( x ) . Proof.

The quasiconcave case follows from the quasiconvex case by negating φ ; hence it suﬃces toprove the quasiconvex case. It is clear that sup x ∈ Conv( X ) φ ( x ) is at least sup x ∈ X φ ( x ) , so we onlyneed to show the latter is at least the former. To this end, let y ∗ := sup x ∈ Conv( X ) φ ( x ) , and let ˆ x ∈ Conv( X ) be such that φ (ˆ x ) is arbitrarily close to y ∗ . We must show that sup x ∈ X φ ( x ) ≥ φ (ˆ x ) ,or equivalently, that there is some x ∈ X with φ ( x ) ≥ φ (ˆ x ) .Using Lemma A.1, we can now write ˆ x ∈ Conv( X ) as ˆ x = λ x + λ x + · · · + λ k x k , with k ∈ N , x , x , . . . , x k ∈ X , and λ , λ , . . . , λ k ∈ [0 , with λ + λ + · · · + λ k = 1 . Furthermore, assumethat λ i > for each i ∈ [ k ] (we can remove λ i x i = 0 from the linear combination otherwise). Now,note that by quasiconvexity, we have φ ( λx + (1 − λ ) x ) ≤ max { φ ( x ) , φ ( x ) } . It is not hard toshow by induction that φ ( λ x + λ x + · · · + λ k x k ) ≤ max { φ ( x ) , φ ( x ) , . . . , φ ( x k ) } . Hence thereis some x ∈ X such that φ ( x ) ≥ φ (ˆ x ) , as desired. Lemma 2.15.

Lemma A.2.

Let V be a real topological vector space, and let X ⊆ V be convex. Let ψ : X → R be a function, let c ∈ R be a constant, and let ψ ′ : X → R be the function ψ ′ ( x ) = max { ψ ( x ) , c } .Then if ψ is convex, ψ ′ is convex; if ψ is quasiconvex, ψ ′ is quasiconvex; if ψ is quasiconcave, ψ ′ is quasiconcave; if ψ is upper semicontinuous, ψ ′ is upper semicontinuous; and if ψ is lowersemicontinuous, ψ ′ is lower semicontinuous.Proof. Let x, y ∈ X , and let λ ∈ (0 , . Then ψ ′ ( λx + (1 − λ ) y ) = max { ψ ( λx + (1 − λ ) y ) , c } . If this maximum equals c , it is certainly at most λ max { ψ ( x ) , c } + (1 − λ ) max { ψ ( y ) , c } , since thesetwo latter maximums are each at least c . Hence the inequalities for convexity and quasiconvexityalways hold when the original maximum equals c . Alternatively, if max { ψ ( λx + (1 − λ ) y ) , c } = ψ ( λx + (1 − λ ) y ) , then using ψ ( x ) ≤ ψ ′ ( x ) and ψ ( y ) ≤ ψ ′ ( y ) , we see that convexity of ψ gives theinequality for convexity of ψ ′ , and quasiconvexity of ψ gives the inequality for quasiconvexity of ψ ′ .Next, suppose ψ is quasiconcave. Without loss of generality, say that ψ ( x ) ≤ ψ ( y ) . Then ψ ′ ( λx + (1 − λ ) y ) = max { ψ ( λx + (1 − λ ) y ) , c } ≥ max { ψ ( x ) , c } = ψ ′ ( x ) ≥ min { ψ ′ ( x ) , ψ ′ ( y ) } , and ψ ′ is quasiconcave. 54reservation of lower semicontinuity follows from Lemma 2.9, where we note that c is continuousas a function from X to R . It remains to show upper semicontinuity is preserved. Suppose ψ isupper semicontinuous, and let x ∈ X . If ψ ′ ( x ) = ∞ , upper semicontinuity at x vacuusly holds. Fix y > ψ ′ ( x ) . Since ψ ′ ( x ) ≥ c , we have y > c , so ψ ( x ) = ψ ′ ( x ) > y , and upper semicontinuity gives usa neighborhood U of x on which ψ ( · ) is less than y . Since y > c , we have ψ ′ ( · ) = max { c, ψ ( · ) } < y on U . Hence ψ ′ is upper semicontinuous. Theorem A.3 (Sion’s minimax [Sio58]) . Let V and V be real topological vector spaces, and let X ⊆ V and Y ⊆ V be convex. Let α : X × Y → R be semicontinuous and quasisaddle. If either X or Y is compact, then inf x ∈ X sup y ∈ Y α ( x, y ) = sup y ∈ Y inf x ∈ X α ( x, y ) . Theorem 2.11 (Sion’s minimax for extended reals) . Let V and V be real topological vector spaces,and let X ⊆ V and Y ⊆ V be convex. Let α : X × Y → R be semicontinuous and quasisaddle. Ifeither X or Y is compact, then inf x ∈ X sup y ∈ Y α ( x, y ) = sup y ∈ Y inf x ∈ X α ( x, y ) . Proof.

First, note that the inf-sup is always at least the sup-inf. This is because these expressionscan be thought of as two players, one choosing x and trying to minimize α ( x, y ) , and the otherchoosing y and trying to maximize y ; in the inf-sup, the sup player chooses y after already knowing x , and therefore has more information and is better positioned to maximize α ( x, y ) than in thesup-inf, where the inf player goes second.Now, let a := sup y ∈ Y inf x ∈ X α ( x, y ) , b := inf x ∈ X sup y ∈ Y α ( x, y ) . We have a, b ∈ R , and a ≤ b . We wish to show a = b . Suppose by contradiction that a < b . Thenwe can pick a ′ , b ′ ∈ R such that a < a ′ < b ′ < b . We then deﬁne α ′ : X × Y → R by α ′ ( x, y ) := a ′ if α ( x, y ) ≤ a ′ , α ′ ( x, y ) := b ′ if α ′ ( x, y ) ≥ b ′ , and α ′ ( x, y ) := α ( x, y ) if α ( x, y ) ∈ [ a ′ , b ′ ] .Note that α ′ ( x, y ) = max { a ′ , min { b ′ , α ( x, y ) }} . By Lemma A.2, we know that taking a maxi-mum with a constant preserves quasiconvexity, quasiconcavity, and upper and lower semicontinu-ities. By negating the function, it also follows that taking a minimum with a constant preservesthese properties. From this it follows that α ′ is quasisaddle and semicontinuous, since α has theseproperties.Now, since a = sup y ∈ Y inf x ∈ X α ( x, y ) and since a ′ > a , we know that for all y ∈ Y , there existssome x ∈ X for which α ( x, y ) < a ′ . This means that for all y ∈ Y , there exists x ∈ X for which α ′ ( x, y ) = a ′ . Hence sup y ∈ Y inf x ∈ X α ′ ( x, y ) = a ′ . Similarly, since b = inf x ∈ X sup y ∈ Y α ( x, y ) and since b ′ < b , weknow that for all x ∈ X , there exists some y ∈ Y for which α ( x, y ) > b ′ . This means that for all x ∈ X , there exists y ∈ Y for which α ′ ( x, y ) = b ′ . Hence inf x ∈ X sup y ∈ Y α ′ ( x, y ) = b ′ . By Theorem A.3,we then have b ′ = inf x ∈ X sup y ∈ Y α ′ ( x, y ) = sup y ∈ Y inf x ∈ X α ′ ( x, y ) = a ′ . But this is a contradiction, since we picked a ′ < b ′ . We conclude that we must have had a = b tobegin with, as desired. 55 Distance measures

Lemma 3.3. hs , Brier , and ls are proper scoring rules. bias is a scoring rule which is not proper.Proof. It is clear that all of the functions from Deﬁnition 3.2 are smooth on (0 , and increasingon [0 , , where we interpret hs(0) = ls(0) = −∞ . It is also clear that all these functions evaluateto at and to at / . It remains to show that Brier , ls , and hs are proper. To do so, we needto show that ps ( q ) + (1 − p ) s (1 − q ) is uniquely optimized at q = p when s is one of these functionsand p ∈ (0 , . Fix such p ∈ (0 , , and observe that the critical points of the expression we wish tomaximize are the points q such that ps ′ ( q ) = (1 − p ) s ′ (1 − q ) .For ls( q ) = 1 − log(1 /q ) = 1 + (log e ) ln q , the critical points q satisfy (log e ) p/q = (log e )(1 − p ) / (1 − q ) , or p/ (1 − p ) = q/ (1 − q ) . Noting that the function x/ (1 − x ) is increasing on (0 , ,and hence injective on (0 , , we conclude that the only critical point is q = p . Moreover, at theboundaries q = 0 and q = 1 , we clearly have p ls( q ) + (1 − p ) ls(1 − q ) = −∞ , whereas in the interiorthe expression is ﬁnite. Hence the unique maximum must occur at q = p .For hs( q ) = 1 − p (1 − q ) /q = 1 − p /q − , we have hs ′ ( q ) = 1 / p q (1 − q ) , so the criticalpoints q satisfy p/ p q (1 − q ) = (1 − p ) / p (1 − q ) q , or p/q = (1 − p ) / (1 − q ) , which once againonly occurs at q = p . At the boundaries, we once again have p hs( q ) + (1 − p ) hs(1 − q ) = −∞ for q = 0 or q = 1 , so the unique maximum occurs at q = p .Finally, for Brier( q ) = 1 − − q ) = − q + 8 q − , we have Brier ′ ( q ) = 8(1 − q ) , so thecritical points q satisfy p (1 − q ) = 8(1 − p ) q , which again implies q = p . This time, the boundarypoints are ﬁnite, but we can use the second order condition: the second derivative of p Brier( q ) +(1 − p ) Brier(1 − q ) is p Brier ′′ ( q ) + (1 − p ) Brier ′′ (1 − q ) . Noting that Brier ′′ ( q ) = − , this is − p − − p ) = − < . Hence the critical point is a maximum, and since it is unique (with theboundaries and not being critical even if we extend the domain of the function), we conclude itis the unique maximum. Lemma B.1.

For any x ∈ [0 , , we have x ≤ − p − x ≤ − H (cid:18) x (cid:19) ≤ x ≤ x. Additionally, x and − √ − x are convex functions on [0 , .Proof. x ≤ x is clearly true for x ∈ [0 , . To see that x / ≤ − √ − x , note that this isequivalent to y/ ≤ − √ − y for y ∈ [0 , (by setting y = x ); the latter is clearly true at y = 0 , so it suﬃces to show the right hand side grows faster. Taking derivatives, it suﬃces to show / ≤ / √ − y , which is clearly true for y ∈ [0 , .Next, write − H ((1 + x ) /

2) = 1 − ((1 + x ) /

2) log 2 / (1 + x ) − ((1 − x ) /

2) log 2 / (1 − x )= 1 − (1 + x ) / − (1 − x ) / x ) /

2) log(1 + x ) + ((1 − x ) /

2) log(1 − x )= 1ln 4 ((1 + x ) ln(1 + x ) + (1 − x ) ln(1 − x )) . Let α ( x ) = (1 + x ) ln(1 + x ) + (1 − x ) ln(1 − x ) . We show that α ( x ) /x is increasing and that α ( x ) / (1 − √ − x ) is decreasing; this suﬃces to show the desired inequalities, since it means weonly need to check x = 1 , where the inequalities − √ − x ≤ − H ((1 + x ) / ≤ x hold withequality. 56he derivative of α ( x ) is ln(1+ x ) − ln(1 − x ) . The derivative of α ( x ) /x is therefore x ln(1+ x ) − x ln(1 − x ) − x (1+ x ) ln(1+ x ) − x (1 − x ) ln(1 − x ) divided by x > (for x ∈ (0 , ). This simpliﬁesto − x ln(1 − x ) − x ln((1+ x ) / (1 − x )) . This is positive if and only if − x )+ x ln((1+ x ) / (1 − x )) is negative. This expression equals at x = 0 , so it suﬃces to show it is decreasing on (0 , . Thederivative is − x/ (1 − x ) + ln((1 + x ) / (1 − x )) , which is again at x = 0 , so it again suﬃces toshow the derivative is negative on (0 , . The derivative of this expression is − x / (1 − x ) , whichis ﬁnally a quantity that is clearly negative, completing the argument; hence α ( x ) /x is increasingon [0 , .The derivative of α ( x ) / (1 − √ − x ) is (1 − x − p − x ) ln(1 − x ) − (1 + x − p − x ) ln(1 + x ) divided by some denominator which is positive on (0 , . This equals − x ln(1 − x ) − (1 − p − x ) ln((1 + x ) / (1 − x )) . Note that ln(1 − x ) = − x − x / −· · ·− x i /i − . . . and that ln((1+ x ) / (1 − x )) = ln(1+ x ) − ln(1 − x ) =2 x + 2 x / · · · + 2 x i − / (2 i −

1) + . . . , so the expression equals x ∞ X i =1 x i − /i − (1 − p − x ) ∞ X i =1 x i − / ( i − /

2) = ( p − x − (1 − x )) ∞ X i =1 − x i − /i (2 i − < . Hence α ( x ) / (1 − √ − x ) is decreasing on [0 , , as desired.It is clear that x and − √ − x are convex functions on [0 , , as their second derivatives are > and (1 / − x ) − / > (for x ∈ (0 , ) respectively. Lemma 3.6 (Relations between distance measures) . When applied to ﬁxed ν , ν , and w , thedistance measures satisfy S ≤ − p − S ≤ h ≤ JS ≤ S as well as ∆ ≤ S ≤ ∆ . We also have JS ≤ h / ln 2 and S ≤ (ln 4) JS .Proof. We use Lemma B.1. The chain h ≤ JS ≤ S ≤ ∆ follows from the inequalities there, whilethe inequalities ∆ ≤ S and − p − S ≤ h follow from Jensen’s inequality combined with theconvexity of x and − √ − x .Finally, to show inequality JS ≤ h / ln 2 we only need to compute the limit of α ( x ) / (1 −√ − x ) as x → , since this ratio is decreasing with x (where α ( x ) is deﬁned as in the proof of Lemma B.1).To do that it suﬃces to use α ( x ) = x + O ( x ) and − √ − x = x / O ( x ) , so the limit is .Hence the limit of (1 − H ((1 + x ) / / (1 − √ − x ) as x → is / ln 2 , meaning this ratio is alwaysat most / ln 2 . Similarly, to show the inequality S ≤ (ln 4) JS , we only need to compute the limit of α ( x ) /x as x → . Again using α ( x ) = x + O ( x ) , the limit is , so the ratio (1 − H ((1 + x ) / /x is always at least / ln 4 . Lemma 3.11. If x ∈ [0 , and k ∈ [1 , ∞ ) , we have

12 min { kx, } ≤ − (1 − x ) k ≤ min { kx, } . roof. Set f ( x ) := 1 − (1 − x ) k . Clearly, when x ∈ [0 , , we have f ( x ) ∈ [0 , , so f : [0 , → [0 , .Note f (0) = 0 , f (1) = 1 , and that f ( x ) is increasing on [0 , . If k = 1 , we have f ( x ) = x , and theinequalities trivially hold; therefore, assume k > . Then f ′ ( x ) = k (1 − x ) k − and f ′′ ( x ) = − k ( k − − x ) k − , meaning that f ( x ) is concave on [0 , ; we also have f ′ (0) = k and f ′′ (0) = − k ( k − .From this we conclude that f ( x ) ≤ kx , proving the upper bound (as f ( x ) ≤ is clear).For the lower bound, note that f ′′′ ( x ) = k ( k − k − − x ) k − , which is non-negative on [0 , . This means that f ′′ ( x ) ≥ − k ( k − on [0 , , that f ′ ( x ) ≥ k − k ( k − x on [0 , , and that f ( x ) ≥ kx − ( k ( k − / x = kx (1 − ( k − x/ on [0 , . If ( k − x ≤ , we get f ( x ) ≥ kx/ . If ( k − x ≥ , we have f ( x ) ≥ − e − kx ≥ − /e ≥ / . This completes the proof. Lemma 4.4 (Hellinger distance of disjoint mixtures) . Let µ be a distribution over a ﬁnite support A , and for each a ∈ A , let ν a and ν a be two distributions over a ﬁnite support S a . Let ν µ and ν µ denote the mixture distributions where a ← µ is sampled, and then a sample is produced from ν a or ν a respectively. Assume the sets S a are disjoint for all a ∈ A . Then h ( ν µ , ν µ ) = E a ← µ [h ( ν a , ν a )] . Proof.

Note that the squared-Hellinger distance is one minus the ﬁdelity, that is, h ( µ , µ ) =1 − F ( µ , µ ) where F ( µ , µ ) = P x p µ [ x ] µ [ x ] (this is easy to check from the deﬁnition of h ).Now write h ( ν µ , ν µ ) = 1 − X x ∈ S a S a q ν µ [ x ] ν µ [ x ]= 1 − X a ∈ A X x ∈ S a q µ [ a ] ν a [ x ] µ [ a ] ν a [ x ]= 1 − E a ← µ " X x ∈ S a q ν a [ x ] ν a [ x ] = E a ← µ " − X x ∈ S a q ν a [ x ] ν a [ x ] = E a ← µ (cid:2) h ( ν a , ν a ) (cid:3) . C Quantum amplitude estimation

We show the following strengthening of Theorem 5.1, which follows from [BHMT02].

Theorem C.1 (Amplitude estimation) . Suppose we have access to a unitary U (representing aquantum algorithm) which maps | i to | ψ i , as well as access to a projective measurement Π , andwe wish to estimate p := k Π | ψ ik (representing the probability the quantum algorithm accepts). Fix ǫ, δ ∈ (0 , / . Then using at most (100 /ǫ ) · ln(1 /δ ) controlled applications of U or U † and at mostthat many applications of I − , we can output ˜ p ∈ [0 , such that | ˜ p − p | ≤ ǫ with probability atleast − δ .Further, this can be tightened to a bound that depends on p , as follows. For any positive realnumber T , there is an algorithm which depends on ǫ , δ , and T (but not on p ) which uses at most T applications of the unitaries (as above) and outputs ˜ p ∈ [0 , with the following guarantee: if T isat least ⌊ (100 /ǫ ) p max { p, ǫ } · ln(1 /δ ) ⌋ , then | ˜ p − p | ≤ ǫ with probability at least − δ . roof. [BHMT02] showed that an algorithm which makes M controlled calls to the unitary U ( I − | i h | ) U − ( I − and one additional call to U can output ˜ p such that | ˜ p − p | ≤ π p p (1 − p ) M + π M with probability at least /π ≥ / . If we pick M such that M ≥ / √ ǫ and M ≥ √ p/ǫ , thenthis is at most ( π/ π / γ ≤ γ . Note that M must be an integer, and that the number ofapplications of U or U − is M + 1 . Hence to get this success probability, it suﬃces to have T ≥ /ǫ ) p max { p, ǫ } , or T ≥ (19 /ǫ ) p max { p, ǫ } .To generalize to other success probabilities, we amplify this algorithm by repeating k + 1 timesand returning the median estimate. The probability that this is still wrong is the probability thatat least k + 1 out of k + 1 of the estimates were wrong, which is k +1 X i =1 (cid:18) k + 1 k + 1 − i (cid:19) q k + i (1 − q ) k +1 − i ≤ q k +1 (1 − q ) k k +1 X i =1 (cid:18) k + 1 k + 1 − i (cid:19) = q k +1 (1 − q ) k k = q (1 − (1 − q ) ) k ≤ qe − k (1 − q ) . Hence to get this below δ , we just need k ≥ (1 / (1 − q ) ) ln(1 /qδ ) , or k ≥ . /δ ) − . Since k must be an integer, but we can always choose it so that k +1 is at most . /δ ) . Multiplying thisby the bound from before, we get that it suﬃces for T to be at most (100 /ǫ ) p max { p, ǫ } · ln(1 /δ ) ,as desired. 59 eferences [Alt88] Helmut Alt. Comparing the combinational complexities of arithmetic functions. Jour-nal of the ACM (1988). doi : (p. 49).[AR20] Scott Aaronson and Patrick Rall. Quantum Approximate Counting, Simpliﬁed. Pro-ceedings of the 3rd Symposium on Simplicity in Algorithms (SOSA) . 2020. doi : . arXiv: (p. 33).[BB19] Eric Blais and Joshua Brody. Optimal Separation and Strong Direct Sum for Random-ized Query Complexity. Proceedings of the 34th Conference on Computational Com-plexity (CCC) . 2019. doi : . arXiv: (p. 5).[BB20] Shalev Ben-David and Eric Blais. A tight composition theorem for the randomizedquery complexity of partial functions. Proceedings of the 61st Annual IEEE Symposiumon Foundations of Computer Science (FOCS) . 2020. arXiv: (pp. 4, 7, 8).[BBGK18] Shalev Ben-David, Adam Bouland, Ankit Garg, and Robin Kothari. Classical LowerBounds from Quantum Upper Bounds.

Proceedings of the 59th Annual IEEE Sympo-sium on Foundations of Computer Science (FOCS) . 2018. doi : .arXiv: (p. 44).[BCH86] Paul W. Beame, Stephen A. Cook, and H. James Hoover. Log Depth Circuits forDivision and Related Problems. SIAM Journal on Computing (1986). Previous versionin FOCS 1984. doi : (p. 49).[BDG+20] Andrew Bassilakis, Andrew Drucker, Mika Göös, Lunjia Hu, Weiyun Ma, and Li-Yang Tan. The Power of Many Samples in Query Complexity. Proceedings of the 47thInternational Colloquium on Automata, Languages, and Programming (ICALP) . 2020. doi : . arXiv: (pp. 8, 52).[BGK+18] Mark Braverman, Ankit Garg, Young Kun Ko, Jieming Mao, and Dave Touchette.Near-Optimal Bounds on the Bounded-Round Quantum Communication Complexityof Disjointness. SIAM Journal on Computing (2018). Previous version in FOCS 2015. doi : . arXiv: (p. 5).[BHK09] Boaz Barak, Moritz Hardt, and Satyen Kale. The Uniform Hardcore Lemma via Ap-proximate Bregman Projections. Proceedings of the 20th Annual ACM-SIAM Sympo-sium on Discrete Algorithms . 2009. doi : (p. 8).[BHMT02] Gilles Brassard, Peter Høyer, Michele Mosca, and Alain Tapp. Quantum amplitudeampliﬁcation and estimation. Proceedings of an AMS Special Session on QuantumComputation and Information (CONM) . 2002. doi : . arXiv: quant-ph/0005055 (pp. 33, 58, 59).[BNRW07] Harry Buhrman, Ilan Newman, Hein Rohrig, and Ronald de Wolf. Robust Polynomialsand Quantum Algorithms. Theory of Computing Systems (2007). Previous version inSTACS 2005. doi : . arXiv: quant-ph/0309220 (p. 41).[Bra15] Mark Braverman. Interactive Information Complexity. SIAM Journal on Computing (2015). Previous version in STOC 2012. doi : (p. 5).[BSS05] Andreas Buja, Werner Stuetzle, and Yi Shen. Loss functions for binary class proba-bility estimation and classiﬁcation: Structure and applications. Preprint, 2005. url : pdfs.semanticscholar.org/d670/6b6e626c15680688b0774419662f2341caee.pdf (pp. 5,20). 60CSV84] Ashok K. Chandra, Larry Stockmeyer, and Uzi Vishkin. Constant Depth Reducibility. SIAM Journal on Computing (1984). doi : (p. 50).[GKKT17] Surbhi Goel, Varun Kanade, Adam Klivans, and Justin Thaler. Reliably Learning theReLU in Polynomial Time. Proceedings of the 30th Annual Conference on LearningTheory (COLT) . 2017. arXiv: (p. 41).[GR07] Tilmann Gneiting and Adrian E Raftery. Strictly Proper Scoring Rules, Predic-tion, and Estimation.

Journal of the American Statistical Association (2007). doi : (p. 5).[Imp95] R. Impagliazzo. Hard-core distributions for somewhat hard problems. Proceedings ofthe 36th Annual IEEE Symposium on Foundations of Computer Science (FOCS) . 1995. doi : (pp. 5, 8).[Jac11] Dunham Jackson. “Über die Genauigkeit der Annäherung stetiger Funktionen durchganze rationale Funktionen gegebenen Grades und trigonometrische Summen gegebenerOrdnung”. PhD thesis. University of Göttingen, 1911. url : gdz.sub.uni-goettingen.de/id/PPN30230648X (p. 41).[KS03] Adam R. Klivans and Rocco A. Servedio. Boosting and Hard-Core Set Construction. Machine Learning (2003). Previous version in FOCS 1999. doi : (p. 8).[LS09] Troy Lee and Adi Shraibman. An Approximation Algorithm for Approximation Rank. Proceedings of the 24th Conference on Computational Complexity (CCC) . 2009. doi : . arXiv: (p. 46).[LSŠ08] Troy Lee, Adi Shraibman, and Robert Špalek. A Direct Product Theorem for Discrep-ancy. Proceedings of the 23rd Conference on Computational Complexity (CCC) . 2008. doi : (p. 44).[MCAL17] Marianthi Markatou, Yang Chen, Georgios Afendras, and Bruce G. Lindsay. StatisticalDistances and Their Role in Robustness. New Advances in Statistics and Data Science (2017). doi : . arXiv: (p. 21).[MMR94] G. V. Milovanovic, D. S. Mitrinovic, and Th. M. Rassias. Topics in Polynomials: Ex-tremal Problems, Inequalities, Zeros . World Scientiﬁc, 1994. isbn : 978-981-02-0499-0. doi : (p. 41).[Ofm62] Yuri P. Ofman. On the algorithmic complexity of discrete functions. Doklady AkademiiNauk (1962) (p. 49).[Pip87] Nicholas Pippenger. The complexity of computations by networks.

IBM Journal ofResearch and Development (1987). doi : (p. 49).[RT92] John H. Reif and Stephen R. Tate. On Threshold Circuits and Polynomial Computa-tion. SIAM Journal on Computing (1992). doi : (p. 50).[RW11] Mark D. Reid and Robert C. Williamson. Information, Divergence and Risk for BinaryExperiments. Journal of Machine Learning Research (2011). arXiv: . url : http://jmlr.org/papers/v12/reid11a.html (p. 21).[Sha03] Ronen Shaltiel. Towards proving strong direct product theorems. Computational Com-plexity (2003). Previous version in CCC 2001. doi : . eccc : (p. 4). 61She12] Alexander A. Sherstov. Strong Direct Product Theorems for Quantum Communicationand Query Complexity. SIAM Journal on Computing (2012). Previous version in STOC2011. doi : . arXiv: (p. 44).[She13] Alexander A. Sherstov. Making Polynomials Robust to Noise. Theory of Comput-ing (2013). Previous version in STOC 2012. doi : . eccc : (pp. 41, 46).[Sio58] Maurice Sion. On general minimax theorems. Paciﬁc Journal of Mathematics (1958). doi : (pp. 5, 14, 55).[Tøp00] Flemming Tøpsoe. Some inequalities for information divergence and related mea-sures of discrimination. IEEE Transactions on Information Theory (2000). doi : (p. 21).[TTV09] Luca Trevisan, Madhur Tulsiani, and Salil Vadhan. Regularity, Boosting, and Eﬃ-ciently Simulating Every High-Entropy Distribution. Proceedings of the 24th Confer-ence on Computational Complexity (CCC) . 2009. doi : . eccc : (p. 8).[Ver98] Nikolai K. Vereshchagin. Randomized Boolean decision trees: Several remarks. Theo-retical Computer Science (1998). doi : (pp. 5, 10).[Vol99] Heribert Vollmer. Introduction to Circuit Complexity: A Uniform Approach . SpringerBerlin Heidelberg, 1999. isbn : 978-3-642-08398-3. doi : (p. 50).[Weg87] Ingo Wegener. The Complexity of Boolean Functions . Wiley, 1987. isbn : 3-519-02107-2. url : eccc.weizmann.ac.il/static/books/The_Complexity_of_Boolean_Functions/ (pp. 49, 50).[Yao77] Andrew Yao. Probabilistic computations: toward a uniﬁed measure of complexity. Pro-ceedings of the 18th Annual IEEE Symposium on Foundations of Computer Science(FOCS) (1977). doi :10.1109/SFCS.1977.24