A New Minimax Theorem for Randomized Algorithms
aa r X i v : . [ c s . CC ] S e p A New Minimax Theorem for Randomized Algorithms
Shalev Ben-David
University of Waterloo [email protected]
Eric Blais
University of Waterloo [email protected]
Abstract
The celebrated minimax principle of Yao (1977) says that for any Boolean-valued function f with finite domain, there is a distribution µ over the domain of f such that computing f to error ǫ against inputs from µ is just as hard as computing f to error ǫ on worst-case inputs. Notably,however, the distribution µ depends on the target error level ǫ : the hard distribution which istight for bounded error might be trivial to solve to small bias, and the hard distribution whichis tight for a small bias level might be far from tight for bounded error levels.In this work, we introduce a new type of minimax theorem which can provide a hard distri-bution µ that works for all bias levels at once. We show that this works for randomized querycomplexity, randomized communication complexity, some randomized circuit models, quantumquery and communication complexities, approximate polynomial degree, and approximate lo-grank. We also prove an improved version of Impagliazzo’s hardcore lemma.Our proofs rely on two innovations over the classical approach of using Von Neumann’sminimax theorem or linear programming duality. First, we use Sion’s minimax theorem toprove a minimax theorem for ratios of bilinear functions representing the cost and score ofalgorithms.Second, we introduce a new way to analyze low-bias randomized algorithms by viewing themas “forecasting algorithms” evaluated by a certain proper scoring rule. The expected score ofthe forecasting version of a randomized algorithm appears to be a more fine-grained way ofanalyzing the bias of the algorithm. We show that such expected scores have many elegantmathematical properties: for example, they can be amplified linearly instead of quadratically.We anticipate forecasting algorithms will find use in future work in which a fine-grained analysisof small-bias algorithms is required. ontents hs score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 A Proofs related to the minimax theorem 53B Distance measures 56C Quantum amplitude estimation 58References 60 Introduction
Yao’s minimax principle [Yao77] is a central tool in the analysis of randomized algorithms in manydifferent models of computation. In its most commonly-used form, it states that for every Boolean-valued function f with a finite domain, if R ( c ) denotes the set of randomized algorithms withworst-case cost at most c and ∆ denotes the set of distributions over the domain of f , then min R ∈R ( c ) max µ ∈ ∆ Pr[ R ( x ) = f ( x )] = max µ ∈ ∆ min R ∈R ( c ) Pr[ R ( x ) = f ( x )] with both probabilities being over the choice of x drawn from µ and the internal randomness of R . This identity implies that there exists a distribution µ for which any algorithm that computes f with bounded error over inputs drawn from µ must have cost at least R( f ) , the cost of computing f to worst-case bounded error. But it does not say anything else about µ itself. Notably,I. The minimax principle does not guarantee that the resulting distribution µ must be balancedon the sets f − (0) and f − (1) .II. More generally, it does not rule out the possibility that f is very easy to compute by randomizedalgorithms that are only required to output the correct value with probability at least γ forsome small bias measure γ > over inputs drawn from the distribution µ .A separate application of the minimax principle can be used to show that there is a distribution µ ′ for which all randomized algorithms computing f with bias γ over µ ′ have cost at least R − γ ( f ) (thecost of computing f to worst-case error (1 − γ ) / ), but then there is no guarantee that randomizedalgorithms with bounded error over µ ′ must have cost anywhere close to R( f ) .Intuitively, it seems reasonable to expect that for every function f , there is a distribution µ for f that addresses issues I and II: a distribution that is balanced on f − (0) and f − (1) , and which isat least slightly hard even to solve to a small bias level γ . Question 1.1 (Informal) . Is there a distribution µ which certifies the hardness of f for all biaslevels γ > at the same time? More formally, observe that the cost of computing f to worst-case bias γ cannot be smallerthan γ R( f ) . This is because randomized algorithms can be amplified : by repeating an algorithm O (1 /γ ) times and outputting the majority vote of the runs, we can increase its bias from γ to Ω(1) . Therefore, a natural refinement of Question 1.1 is as follows.
Question 1.2 (Refinement of Question 1.1) . Is there a distribution µ such that for all bias levels γ > , any algorithm computing f to bias γ against µ must have cost at least Ω( γ R( f )) ? Question 1.2 is the primary focus of this work. We answer it affirmatively in a variety ofcomputational models (we can handle most models in which amplification and Yao’s minimaxprinciple both apply). We note that the distribution satisfying the conditions of Question 1.2 ishard for bounded error in Yao’s sense, since each algorithm solving f to bounded error against µ must have cost at least Ω(R( f )) . In addition to this, such µ must also be perfectly balanced between - and -inputs of f (by considering the limit as γ → ), and must remain somewhat hard to solveeven to small bias levels.The study of Question 1.2 has led us to consider randomized forecasting algorithms which output probabilistic confidence predictions about the value of f ( x ) , instead of a Boolean guess for f ( x ) .When evaluated using a certain proper scoring rule, the best possible score of a forecasting algorithm3s intimately related to the best possible bias of a randomized algorithm; in fact, the score appearsto be a more fine-grained way of measuring the bias. Scores of forecasting algorithms appear to bethe “right” way of measuring the success of randomized algorithms, as such scores satisfy elegantmathematical properties. The following question, which we answer affirmatively, turns out to be astrengthening of Question 1.2. Question 1.3.
Is there a distribution µ such that for all η > , any forecasting algorithm whichachieves expected score at least η against µ must have cost at least Ω( η R( f )) ? The answers to Question 1.2 and Question 1.3 have a direct impact on the study of compositiontheorems and joint computation problems in randomized computational models: a natural approachfor such problems involves first applying a minimax theorem and then establishing the requiredinequalities in the deterministic distributional setting. However, as observed by Shaltiel [Sha03]this approach runs into trouble if the hard distribution is easy to solve to small bias. Specifically,Shaltiel considered distributions µ which are hard to solve most of the time, but which give acompletely trivial input with small probability γ . Then computing n independent copies from µ isa little easier than n times the cost of computing f , because on average, γn of the copies are trivial;the cost of computing n independent inputs from µ is at most (1 − γ ) n times the cost of solving f .Things get even worse when the inputs have a promised correlation, as can happen when provingcomposition theorems. For a concrete example, consider the partial function Trivial n , which isdefined on domain { n , n } and maps n → and n → . Suppose we want to prove a compositionlower bound with Trivial n on the outside: that is, we want to show that for every function f ,computing Trivial n composed with n copies of f requires Ω(R( f )) cost. In other words, we wantto lower bound the cost of an algorithm which outputs when given n -inputs to f , outputs when given n -inputs to f , and outputs arbitrarily when given some other type of input.Now, if we try to lower bound this using the hard distribution from Yao’s minimax principle,then the distribution might give a trivial input with small probability γ , as Shaltiel observed; butthen so long as n = Ω(1 /γ ) , one of the inputs to f will be trivial with high probability, and wecan solve this “all- s vs all- s” problem simply by searching for the trivial copy – potentially muchfaster than the worst-case cost of computing a single copy of f !The hard distributions we give in this work solve this issue by being hard for all bias levels.In our companion manuscript [BB20], we use one of the query versions of our minimax theorem(Theorem 4.6) to prove a new composition theorem for randomized query complexity. Minimax theorem for cost/score ratios.
The first main result is a new minimax theorem forthe ratio of the cost and score of randomized algorithms. A special case of the theorem with asimple formulation is as follows.
Theorem 1.4. [Special case of Theorem 2.18] Let R be a set of randomized algorithms that canbe expressed as a convex subset of a real topological vector space. Let S be a nonempty finiteset, and let ∆ be the set of all probability distributions over S , viewed as a subset of R | S | . Let cost : R × ∆ → (0 , ∞ ) and score : R × ∆ → [ −∞ , ∞ ) be continuous bilinear functions. Then usingthe convention r/ ∞ for all r ∈ (0 , ∞ ) and the notation r + = max { r, } for all r ∈ [ −∞ , ∞ ] ,we have inf R ∈R max x ∈ S cost( R, x )score(
R, x ) + = max µ ∈ ∆ inf R ∈R cost( R, µ )score(
R, µ ) + . urther, all of the above maximums are attained. The general version of the minimax theorem in Theorem 2.18 shows that the same identity holdseven when the cost and score functions are semicontinuous and saddle (but not necessarily linear)under some mild additional restrictions. Furthermore, a variant of the theorem also holds when weconsider convex and compact subsets of distributions over the finite set S instead of the set ∆ ofall distributions over that set.Minimax theorems for ratios of semicontinuous and saddle functions as in Theorem 2.18 do notseem to have appeared in the literature previously in the precise form we need, but as we show inSection 2, they can be obtained by extending Sion’s minimax theorem [Sio58] with standard argu-ments. We believe that the main contribution of Theorem 2.18 is in its interpretation for randomizedalgorithms. Various extensions and variations of Yao’s minimax theorem have been considered inthe computer science literature previously [Yao77; Imp95; Ver98; Bra15; BGK+18; BB19], but all ofthem appear to consider the cost of an algorithm (with the minimax theorem applied to algorithmswith a fixed worst-case score), the score of an algorithm (with the cost being fixed), or a linearcombination of the two. None of those variants suffice to answer the questions raised at the begin-ning of the introduction or to establish the results in the following subsections; what was needed inthose cases was a minimax theorem for the ratio of the cost/score of randomized algorithms, andwe suspect that this ratio minimax theorem will find further applications in computer science in thefuture as well. Forecasting algorithms and linear amplification.
To convert the statements obtained fromTheorem 2.18 regarding the cost/score ratios of randomized algorithms under some distribution µ into more familiar lower bounds on the cost of randomized algorithms that achieve some biason µ , we need a linear amplification theorem. Ideally, we would like to argue that if there exists arandomized algorithm R with bias γ on µ , then by combining O (1 /γ ) instances of R we can obtain arandomized algorithm R ′ with cost( R ′ , µ ) = O (cid:16) γ · cost( R, µ ) (cid:17) = O (cid:16) cost( R,µ )bias f ( R,µ ) (cid:17) and constant bias.Unfortunately, such a linear amplification property does not hold for most models of randomizedalgorithms, where amplification from bias γ to bounded error requires combining O (1 /γ ) instancesof the original algorithm. To obtain a linear amplification result, we must turn our attention awayfrom bias and error and consider other score functions instead. To describe our score function, we first generalize our computational model from randomizedalgorithms that output or to forecasting algorithms , which are randomized algorithms that out-put a confidence value in [0 , for the value f ( x ) of the function f on their given input x . A “low”confidence prediction is a value close to whereas a “high” confidence prediction would be a valuevery close to or to . There are many natural ways to assign a score to a confidence value for f ( x ) .The study of such scoring rules and their properties has a rich history in the statistics and decisiontheory communities (see for instance [BSS05; GR07] and references therein); we discuss some fun-damental scoring rules and give relations between them in Section 3. Of particular importance toour main purpose is the scoring rule hs : [0 , → [ −∞ , defined by hs f ( p ) = − q − pp when f ( x ) = 11 − q p − p when f ( x ) = 0 . The astute reader may have noticed that we obtain linear amplification if we simply set the score to be thesquared bias of the randomized algorithm. That is true, but this approach does not work in conjunction with theratio minimax theorem since this score function no longer satisfies the appropriate saddle property requirements ofthat theorem; this is why we instead consider forecasting algorithms as described below. R on an input x in the domain of f to be score hs ,f ( R, x ) = E [hs f ( R ( x ))] , the expectation of the hs score of the output of R over the internal randomness of R .Then linear amplification does hold for this score function. Lemma 1.5.
For any Boolean-valued function f , any forecasting algorithm R , and any k ≥ , thereis a forecasting algorithm R ′ that combines the outputs of k instances of R and satisfies score hs ,f ( R ′ , x ) ≥ − (1 − score hs ,f ( R, x )) k for every x in the domain of f . In particular, when k = max x hs ,f ( R,x ) then for each x ∈ Dom( f ) , score hs ,f ( R ′ , x ) ≥ − e − > . . To the best of our knowledge, Lemma 1.5 has not previously appeared in the literature. Thislemma is sensitive to the precise definition of hs f ; other scoring rules do not appear to satisfy thisamplification property, which is crucial for the proof of our main results. Additionally, the scoringrule hs f is special because there is a close connection between hs score of forecasting algorithms andthe bias of randomized algorithms. Lemma 1.6.
For any Boolean-valued function f , any distribution µ on Dom( f ) , and any parameter γ > , • If there exists a randomized algorithm R with bias f ( R, µ ) = 1 − R ( x ) = f ( x )] ≥ γ , thenthere is a forecasting algorithm R ′ with score hs ,f ( R ′ , µ ) ≥ − p − γ ≥ γ / , and • If there exists a forecasting algorithm R with score hs ,f ( R, µ ) ≥ γ then there is a randomizedalgorithm R ′ with bias f ( R ′ , µ ) ≥ γ .Moreover, in both cases R ′ can be explicitly constructed from R by modifying its output. Lemma 1.5 and Lemma 1.6 can be used to reprove the fact that O (1 /γ ) instances of a bias- γ randomized algorithms can be combined to obtain a bounded-error algorithm; combining thoselemmas (or, more precisely, specific instantiations of these lemmas that account for the explicitconstructions of the relevant algorithms and their costs) with the minimax theorem also leads tonew results as described in the next section. Hard distributions for bounded error and small bias.
The minimax theorem for cost/scoreratios and linear amplification of forecasting algorithms can be combined to show that for manymeasures of randomized complexity, for every Boolean-valued function f with finite domain thereexists a single distribution µ on which it is hard to compute f with bounded error and with (any)small bias. For example, letting RDT( f ) denote the minimum (worst-case) query complexity of arandomized algorithm computing f (or equivalently the minimum worst-case depth of a decisiontree computing f ) with error at most on every input in Dom( f ) and RDT µ ˙ γ denote the minimumquery complexity of a randomized algorithm that has error probability at most ˙ γ := − γ wheninputs are drawn from µ , we obtain the following result. Theorem 1.7.
For any non-constant partial function f : { , } n → { , } , there exists a distribution µ on Dom( f ) such that for every γ ∈ [0 , , RDT µ ˙ γ ( f ) = Ω (cid:0) γ RDT( f ) (cid:1) .
6e establish analogous theorems for multiple other computational models as well:Randomized communication complexity
RCC µ ˙ γ ( f ) = Ω (cid:0) γ RCC( f ) (cid:1) Corollary 4.8Quantum query complexity
QDT µ ˙ γ ( f ) = γ · ˜Ω (cid:0) QDT( f ) (cid:1) Theorem 5.2Quantum communication complexity
QCC µ ˙ γ ( f ) = γ · ˜Ω (cid:0) QCC( f ) (cid:1) Theorem 5.9Polynomial degree deg µ ˙ γ ( f ) = γ · ˜Ω(adeg( f )) Theorem 6.5Log-rank complexity log rank µ ˙ γ ( f ) = γ · ˜Ω(log rank / ( f )) Theorem 6.8Circuit complexity
Rcirc µ ˙ γ ( f ) = γ · ˜Ω (cid:0) Rcirc( f ) (cid:1) Theorem 7.8Log-depth circuit complexity
RNC1 µ ˙ γ ( f ) = γ · ˜Ω (cid:0) RNC1( f ) (cid:1) Theorem 7.9Threshold circuit complexity
RTC0 µ ˙ γ ( f ) = γ · ˜Ω (cid:0) RTC0( f ) (cid:1) Theorem 7.10(Note that as in Theorem 1.7, the novel aspect of all these results is that they guarantee that foreach of the stated inequalities, there exists a single distribution µ that satisfies the inequality for every value of γ simultaneously.) Hard distributions for forecasting algorithms.
The theorems listed above settle Question 1.2in the affirmative for the specified models. For the models with quadratic dependence on γ (i.e.randomized query complexity, randomized communication complexity, and the various randomizedcircuit models), we also get hard distributions which lower bound the expected score of a forecastingalgorithm, settling Question 1.3 affirmatively. Distinguishing power of randomized algorithms and protocols.
In the communicationcomplexity setting, we can also analyze how well a randomized communication protocol computesa function f : X × Y → { , } via its communication transcripts. Let tran( R, µ ) denote thedistribution on communication transcripts of the randomized protocol R on inputs drawn from µ .Then one way to measure how well R is able to distinguish - and -inputs of f is to measure theHellinger distance between the distributions tran( R, µ ) and tran( R, µ ) of transcripts of R on somedistributions µ over f − (0) and µ over f − (1) . We can use the minimax and linear amplificationtheorems to give a strong upper bound on this Hellinger distance as a measure of the cost of theprotocol. Theorem 1.8.
For any non-constant partial function f : X × Y → { , } over finite sets X and Y , there is a pair of distributions µ on f − (0) and µ on f − (1) such that for any randomizedcommunication protocol R , the squared Hellinger distance between the distribution of its transcriptson µ and µ is bounded above by h (cid:0) tran( R, µ ) , tran( R, µ ) (cid:1) = O (cid:18) min { cost( R, µ ) , cost( R, µ ) } RCC( f ) (cid:19) . Here cost(
R, µ ) denotes the expected amount of communication the protocol R transmits when giveninputs from µ . Theorem 4.6 establishes an analogous result for query complexity. In our companion paper [BB20],that theorem is one of the ingredients that enables us to establish a new composition theory forquery complexity. 7 ardcore lemma.
Impagliazzo’s Hardcore Lemma [Imp95] states that for every ǫ, δ > , if everycircuit C of size at most s computes f with error at least δ on the uniform distribution, then thereis a δ -regular distribution µ = µ ( δ, ǫ ) for which every circuit that computes f with bias at least ǫ onthe distribution µ must have size Ω( ǫ s ) . Informally, the lemma shows that if a function f is mildlyhard on average, it is because it is “very” hard to compute on a fairly large subset of its inputs.But, interestingly, this version of the hardcore lemma leaves open the possibility that the hard coremight be different for various levels ǫ of hardness. Using our main theorems, we can show that thisis not the case. Theorem 1.9.
There exists a universal constant c > such that for any δ > and function f : { , } n → { , } , if every circuit C of size at most s satisfies Pr[ C ( x ) = f ( x )] ≤ − δ when theprobability is taken over the uniform distribution of x in { , } n , then there is a distribution µ withmin-entropy δ such that for every ǫ > , any circuit C ′ of size at most c · ǫ / log(1 /δ ) · s has successprobability bounded by Pr[ C ′ ( x ) = f ( x )] ≤ ǫ . The proof of Theorem 1.9 follows closely the original argument of Nisan in [Imp95] that estab-lished the hardcore lemma via a minimax theorem. Since that original work, many extensions anddifferent proofs of the hardcore lemma have been established (e.g., [Imp95; KS03; BHK09; TTV09]),but to the best of our knowledge Theorem 1.9 represents the first version of the lemma which givesa single distribution µ which is hard for all values of ǫ > simultaneously. In independent work concurrent with this one, Bassilakis, Drucker, Göös, Hu, Ma, and Tan [BDG+20]showed the existence of a certain hard distribution for randomized query complexity. They showedevery Boolean function f has hard distributions µ and µ (on - and -inputs respectively) suchthat given query access to k independent samples from µ b , it is still necessary to use Ω(R( f )) queriesto the bits of the samples in order to decide the value of b ∈ { , } to bounded error.The guarantee on the hard distribution provided by [BDG+20] is formally stronger than theone we provide in Theorem 4.6 (though in our companion manuscript [BB20], we prove a new com-position theorem for randomized query complexity, and use it to conclude that the guarantee of[BDG+20] turns out to be equivalent to the guarantee of Theorem 4.6 in our current work). Thetools used by [BDG+20] are also completely different: they use arguments specific to query com-plexity that construct the hard distribution more explicitly, but their arguments do not generalizeto other models such as communication complexity or circuit complexity. Section 2 is devoted to proving the main minimax theorem for the cost/score ratio of randomizedalgorithms. The main result of that section is Theorem 2.18; the rest of the section is devotedto introducing the mathematical notions and preliminaries required to obtain a proof of thattheorem from Sion’s minimax theorem.
Section 3 introduces the basic definitions and some basic scoring rules for forecasting algorithms.The section establishes some of the core properties of scoring functions, including notablyconnections between the best score achievable by forecasting algorithms on distributions overinputs and various distance measures on those distributions. The final portions of this sectionthen establish the main linear amplification theorem in general form in Lemma 1.5 and thegeneral form of the conversion between randomized and forecasting algorithms in Lemma 3.15.8 ection 4 focuses on the query and communication complexity settings. Conversions betweenrandomized and forecasting algorithms in the query complexity setting are straightforward,but there is one significant challenge in applying the linear amplification theorem to obtainthe results in Theorem 1.7 and Theorem 4.6: the cost and score of a randomized algorithm R on an input x can both depend on x itself. This is a problem because to obtain a constantscore (and after the final conversion, a bounded-error randomized algorithm), we want toamplify R with a number k of copies that depends on the score of R on x —but since we don’tknow x we don’t know what score( R, x ) is either. We get around this problem with odometerarguments: by empirically estimating the expected number of queries R makes on x , we canobtain effective bounds on the number k of copies of R that we need to obtain successfulamplification.As we show in the section, the communication complexity results Corollary 4.8 and Theorem 1.8follow immediately from their query complexity analogues. Section 5 establishes the results in the quantum query and communication complexity settings.Unlike in the classical setting, amplification that is linear in the bias of an algorithm does hold in the quantum query complexity setting. However, the proof of Theorem 5.2 requiresthat the set of algorithms must be representable as a convex subset of a real topological space,and that the cost of an algorithm is a convex function on this set. It is not immediately clearhow quantum query algorithms can satisfy this condition, because in the usual definition,the cost of a mixture of two quantum algorithms would be the maximum of the costs of thealgorithms rather than the average. To overcome this issue, we instead establish Theorem 5.2via consideration of what we call probabilistic quantum algorithms, which correspond to prob-ability distributions over quantum algorithms and do easily satisfy the appropriate convexityrequirements. Probabilistic quantum algorithms are harder to amplify than regular quantumalgorithms (due to their lack of coherence), but we show that a linear amplification theoremstill holds.Another important difference between the quantum and the classical setting is that the com-munication complexity result, Theorem 5.9, is not implied by the analogous query complexityresult. Nonetheless, the same argument used for quantum query algorithms also holds forquantum communication protocols as well. We complete the proof of Theorem 5.9 by firstproviding an abstraction of the query complexity argument in Theorem 5.8 and then showinghow communication protocols satisfy the conditions of this abstract theorem.
Section 6 considers the approximate polynomial degree and the logrank complexity of functions.As with quantum query complexity, approximate polynomial degree satisfies an amplificationtheorem that is linear in the bias, meaning that we do not need to use forecasting algorithms orscoring rules. However, also as with quantum query complexity, polynomials and their cost donot satisfy the right convexity requirements, as the degree of a mixture of two polynomials isnot the average of their degrees. We overcome this by considering probabilistic polynomials.Proving an amplification theorem for probabilistic polynomials turns out to be somewhattricky, and requires tools from approximation theory such as Jackson’s theorem.Approximate logrank inherits all of the problems of approximate polynomial degree, andadds a few more. To handle approximate logrank, we switch over to the nearly-equivalentmodel of the logarithm of the approximate gamma norm, and then use the previous trick ofconsidering the probabilistic approximate gamma norm. To prove an amplification theoremfor probabilistic gamma norm we apply the same tools as for probabilistic polynomials.9 ection 7 establishes the circuit complexity results. There are two main hurdles in establishingTheorem 7.8. The first is that the notion of randomized circuits is not as trivially extend-able to forecasting circuits as in other computational models. We show that this conversioncan be done efficiently when we discretize the set of confidence values that can be returnedby forecasting circuits, and that this discretization does not affect the guaranteed relationsbetween score and bias. The second is that the overhead required to combine the outputof multiple instances of a randomized circuit during linear amplification is not trivial. Thissecond hurdle can be overcome with the use of efficient circuit constructions for elementaryarithmetic operations and the iterated addition problem.The proof of the universal hardcore lemma in Theorem 1.9 is obtained via a slight gener-alization of the ratio minimax theorem. This variant of the minimax theorem is stated inLemma 7.12 and the rest of the proof of Theorem 1.9 is presented in Section 7.3. We make a few remarks regarding other possible generalizations of Yao’s original minimax theorem.First, one may wonder why we provide a hard distribution µ satisfying R µ ˙ γ ( f ) = Ω( γ R( f )) for all γ , rather than the stronger statement R µ ˙ γ ( f ) = Ω(R ˙ γ ( f )) for all γ . In other words, we’ve stated ourlower bounds in terms of the bounded-error randomized cost R( f ) , which required amplification;why not directly compare the average-case complexity to bias γ , denoted R µ ˙ γ ( f ) , to the worst-casecomplexity to bias γ , denoted R ˙ γ ( f ) ?The reason is that this stronger version of the minimax is actually false: that is, there need notbe a distribution µ for which R µ ˙ γ ( f ) = Ω(R ˙ γ ( f )) for all γ (even though for every given γ , such adistribution µ that depends on γ does exist, by Yao’s minimax theorem). For a counterexample,consider the query complexity model. Let f be the Boolean function on n + m + 1 bits, where if thefirst bit is the function f evaluates to the parity of the next m bits, whereas if the first bit is thefunction f evaluates to the majority of the last n bits. Say we take n = m . Then, since parity ishard to compute even to small bias, we have R ˙ γ ( f ) ≥ m for all γ . We also have R / ( f ) = Ω( m ) ,since majority on m bits requires Ω( m ) queries. Now, consider any distribution µ over the domainof f . If µ places nonzero probability mass on inputs with first bit , then µ can necessarily be solvedto some sufficiently small bias using at most queries (one query to the first bit of the input, andone to a random position in the input to majority). In this case, we would have R µ ˙ γ ( f ) = O (1) and R ˙ γ ( f ) = Ω( √ n ) for this sufficiently small γ . Alternatively, if µ places zero probability masson inputs with first bit , then solving f against µ is solving parity on m = O ( √ n ) bits; hence R µ / ( f ) = O ( √ n ) , even though R / ( f ) = Ω( n ) . Similar counterexamples can be constructed inother computational models.Another possible generalization of Yao’s minimax is to a distribution µ for which R µ ( f ) is largeeven when the both the error of the algorithm and the expected cost are measured against µ . Thatis, in a normal application of Yao’s minimax, we either consider randomized algorithms which onlyever make at most T queries (against any input) and measure their expected error against µ , orelse we consider randomized algorithms which only ever make error at most ǫ (against any input)and measure their expected cost against µ . One may wonder if it is possible for one distributionto certify the hardness of f in both ways at once, with both the cost and the error measured inexpectation against µ .The answer turns out to be yes, as first observed by Vereshchagin for query complexity [Ver98].Vereshchagin stated his theorem for bounded error, but in the case of small bias γ , his techniquesappear to give a distribution µ (which depends on γ ) such that R µ ˙ γ ( f ) = Ω( γ R ˙ γ ( f )) even where the10eft-hand side is defined as the expected query complexity against µ to bias at least γ (also against µ ). This is in contrast to Yao-style minimax theorems, which are stronger in that they lack the γ factor on the right hand side, but weaker in that the left-hand side has either the cost or the errorbeing worst-case (rather than both being average-case against µ ).Our results in this work are “Vereshchagin-like” in that they hold even when R µ ˙ γ ( f ) has boththe cost and the bias defined in expectation against µ . We prove such results for randomizedquery complexity and randomized communication complexity, showing a single µ satisfies R µ ˙ γ ( f ) =Ω( γ R( f )) for all γ > , even when both the error and the cost in the definition of R µ ˙ γ ( f ) areaverage-case against µ . (For models such as quantum query complexity or circuit complexity, theexpected cost of an algorithm does not have an obvious interpretation, since the algorithms generallyhave the same cost for all inputs; therefore, for those models we do not give a theorem in which thecost is measured in expectation against µ .)Note that our minimax theorem is not directly comparable to Vereshagin, because we state ourlower bounds in an “amplified” form – that is, the lower bounds are with respect to R( f ) rather than R ˙ γ ( f ) . As previously mentioned, this is necessary when proving that a single distribution works forall γ , and our theorems appear to be tight in that setting. Moreover, Vereshchagin’s theorem is tightin its setting: the factor of γ is necessary, because average-case query complexity can be smallerthan worst-case query complexity (for example, consider the parity function on n bits, which has R ˙ γ ( f ) = n for all γ ; if we design a randomized algorithm which queries all the bits with probability γ and queries no bits with probability − γ , it will use only γn expected queries, and it will solve f to bias γ ). A remaining open problem is as follows: can Vereshchagin’s theorem be modified to show R µ ˙ γ ( f ) = Ω(R ˙ γ ( f )) , (1)where both cost and bias on the left are measured in expectation against µ , and where R ˙ γ ( f ) denotesthe worst-case (over the inputs of f ) expected (over the internal randomness of the algorithm) querycomplexity of f to bias γ ? Note that in the bounded-error setting, R( f ) = Θ(R( f )) , so for bounded γ this result follows from both Vereshchagin’s theorem and from our work here. For small γ , weleave this question as an intriguing open problem.We also note that we cannot hope that a single distribution µ satisfies (1) for all γ , becauseone can construct a counterexample via a modification of our earlier function: we let f be definedon m + n bits, where if x = 0 the function evaluates to the parity of the next m bits, and if x = 1 the function evaluates to the majority of the last n bits, as before; this time we will have n = m / . We also add a promise: we require that the input always has Hamming weight either atmost n/ − √ n or at least n/ √ n on the last n bits, turning the majority part of the function intoa √ n -gap majority function. Now, to compute f to worst-case bias γ requires at least γm expectedqueries on inputs x with x = 0 , and requires at least γ n expected queries on inputs with x = 1 ,so at least Ω(max { γm, γ n } ) expected queries in the worst case. This is Ω( n / ) when γ = n − / and Ω( n ) when γ is constant. Now fix a distribution µ , let p be the probability that µ assigns toinputs with x = 1 . If p ≤ / , then we can compute f to constant bias simply by querying thefirst bit, guessing randomly if x = 1 , and querying m bits to compute f exactly when x = 0 ; thisuses O ( n / ) queries to achieve constant bias, instead of the Ω( n ) which were required in the worstcase. On the other hand, if p ≥ / , then we can compute f against µ by querying the first bit andnothing else when x = 0 (guessing the answer randomly), and otherwise making one additionalquery to estimate the gap majority function to bias / √ n . This uses queries and achieves bias n − / against µ , instead of the Ω( n / ) queries required in the worst case. We thank an anonymous reviewer for this example. Minimax theorem for the ratio of saddle functions
Minimax theorems take the form inf x ∈ X sup y ∈ Y α ( x, y ) = sup y ∈ Y inf x ∈ X α ( x, y ) . For any function α , the left-hand side above is always at least the right hand side, but equality onlyholds under certain conditions; when equality does hold, we call it a minimax theorem.Broadly speaking, the following conditions are required to ensure that a minimax theorem holds.First, X and Y must be convex sets (and they must be subsets of some real vector spaces). Second, α must be saddle – or at least quasisaddle – meaning that it is convex as a function of x and concaveas a function of y (or at least quasiconvex and quasiconcave). Third, α must satisfy some continuityconditions. And finally, one of X or Y must be compact (importantly, it’s not necessary for bothto be compact).In this section, we show that under certain conditions, minimax theorems also hold for ratios of positive saddle functions. Such a ratio of saddle functions is not necessarily saddle, but theimportant insight is that it is still quasisaddle. In order to formally state the conditions in which minimax theorems hold, we will need a fewdefinitions. We assume the reader is familiar with vector spaces and topological spaces, includingstandard terminology such as compact sets and neighborhoods.
Definition 2.1 (Real topological vector space) . A real topological vector space is a tuple ( V, + , · , τ ) ,where V is a set, + is a function V × V → V , · is a function V × R → V , and τ ⊆ V , such that • ( V, + , · ) is a vector space over R , • ( V, τ ) is a topological space, • + is continuous under the topology τ , and • · is continuous under the topology τ and the standard topology of R . We note that any normed real vector space is a real topological space, as the norm induces atopology. We will primarily focus on the real topological vector spaces R n for n ∈ N , which have astandard topology. Definition 2.2 (Extended reals) . The extended reals is the set R := R ∪{−∞ , ∞} . We use theextended interval notation ( r, ∞ ] := ( r, ∞ ) ∪ {∞} for r ∈ R , and similarly for [ −∞ , r ) and [ −∞ , ∞ ] .We associate with R the following topology. A set S ⊆ R is a neighborhood of x ∈ R if it contains anopen interval ( x − ǫ, x + ǫ ) for some ǫ ∈ (0 , ∞ ) , it is a neighborhood of ∞ if it contains the interval ( r, ∞ ] for some r ∈ R , and it is a neighborhood of −∞ if it contains the interval [ −∞ , r ) for some r ∈ R .We define addition, subtraction, multiplication, and division of extended reals in the intuitiveway, with ∞ − ∞ , · ∞ , ∞ / ∞ , and x/ for x ∈ R all undefined. Note also that the extended realsare ordered (for each x, y ∈ R , we have either x = y , x < y , or x > y ). Note that while we define the extended reals and will often talk about extended-real-valuedfunctions, our vector spaces will always be over the reals, not over the extended reals. In particular,the extended reals are not a field. 12 efinition 2.3 (Convexity of sets) . We say a subset X of a real vector space V is convex if ∀ x, y ∈ X, ∀ λ ∈ (0 , λx + (1 − λ ) y ∈ X. Definition 2.4 (Convex hull) . Let V be a real vector space and let X ⊆ V . The convex hull of X ,denoted Conv( X ) , is the intersection of all convex subsets of V that contain X as a subset. Note that it is easy to verify that an arbitrary intersection of convex sets is convex, which meansthat the convex hull of any set is always convex.
Definition 2.5 ((quasi)convexity and (quasi)concavity of functions) . Let V be a real vector space,let X ⊆ V be convex, and let φ : X → R . We say that φ is convex if for all x, y ∈ X and λ ∈ (0 , ,we have φ ( λx + (1 − λ ) y ) ≤ λφ ( x ) + (1 − λ ) φ ( y ) . We say φ is quasiconvex if for all x, y ∈ X and λ ∈ (0 , , we have φ ( λx + (1 − λ ) y ) ≤ max { φ ( x ) , φ ( y ) } . We say that φ is concave if − φ is convex,and we say φ is quasiconcave if − φ is quasiconvex. If φ is both convex and concave, we say it is linear . Note that if ∞ and −∞ are both in the range of φ , then λφ ( x ) + (1 − λ ) φ ( y ) may be ∞ − ∞ ,which is undefined; in this case we say φ is neither convex nor concave. A function with both ∞ and −∞ in its range may still be quasiconcave or quasiconvex. Definition 2.6 (Saddle and quasisaddle) . Let V and V be real vector spaces, let X ⊆ V and Y ⊆ V , and let α : X × Y → R . We say that α is saddle if for all x ∈ X the function α ( x, · ) is concave and for all y ∈ Y the function α ( · , y ) is convex. We say that α is quasisaddle if for all x ∈ X the function α ( x, · ) is quasiconcave and for all y ∈ Y the function α ( · , y ) is quasiconvex. Definition 2.7 (Semicontinuity) . Let X be a topological space and let φ : X → R . We say that φ is upper semicontinuous at x ∈ X if for all y ∈ ( φ ( x ) , ∞ ] there exists some neighborhood U of x onwhich the value of φ ( x ′ ) for x ′ ∈ U is less than y . We say that φ is lower semicontinuous at x if − φ is upper semicontinuous at x .Let Y be another topological space and let α : X × Y → R be a function. We say that α is semicontinuous if for all x ∈ X the function α ( x, · ) is upper semicontinuous over all of Y , and forall y ∈ Y the function α ( · , y ) is lower semicontinuous over all of X . We note the following two useful lemmas about upper and lower semicontinuous functions. Theselemmas are standard, but for completeness we reprove them in Appendix A.
Lemma 2.8 (An upper semicontinuous function on a compact set attains its max) . Let X bea nonempty compact topological space, and let φ : X → R be a function. Then if φ is uppersemicontinuous, it attains its maximum, meaning there is some x ∈ X such that for all x ′ ∈ X , φ ( x ′ ) ≤ φ ( x ) . Similarly, if φ is lower semicontinuous, it attains its minimum. Lemma 2.9 (A pointwise infimum of upper semicontinuous functions is upper semicontinuous) . Let X be a topological space, let I be a set, and let { φ i } i ∈ I be a collection of functions φ i : X → R . Thenif each φ i is upper semicontinuous, the function φ ( x ) = inf i ∈ I φ i ( x ) is also upper semicontinuous.Similarly, if each φ i is lower semicontinuous, the pointwise supremum is lower semicontinuous. From these lemmas, it follows that if α : X × Y → R is semicontinuous, the expressions inf x ∈ X sup y ∈ Y α ( x, y ) up y ∈ Y inf x ∈ X α ( x, y ) have all the infimums attained if X is nonempty and compact, and all the supremums attained if Y is nonempty and compact. Hence on compact sets, inf-sup theorems become min-max theorems.The following lemma will also come in useful. We also prove it in Appendix A. Lemma 2.10 (Quasiconvex functions on convex hulls) . Let V be a real vector space, let X ⊆ V ,and let φ : Conv( X ) → R be a function. If φ is quasiconvex, then sup x ∈ Conv( X ) φ ( x ) = sup x ∈ X φ ( x ) . Similarly, if φ is quasiconcave, then inf x ∈ Conv( X ) φ ( x ) = inf x ∈ X φ ( x ) . We are now ready to state Sion’s minimax theorem. Actually, we will need a version of Sion’sminimax for extended-real-valued functions, while Sion [Sio58] originally only dealt with real-valuedfunctions; luckily, proving this extension is not hard given Sion’s original theorem, and we do so inAppendix A.
Theorem 2.11 (Sion’s minimax for extended reals) . Let V and V be real topological vector spaces,and let X ⊆ V and Y ⊆ V be convex. Let α : X × Y → R be semicontinuous and quasisaddle. Ifeither X or Y is compact, then inf x ∈ X sup y ∈ Y α ( x, y ) = sup y ∈ Y inf x ∈ X α ( x, y ) . Next, we use Sion’s minimax theorem to show a minimax theorem for the ratio of positive saddlefunctions. To do so, we will need the following lemma.
Lemma 2.12.
Let a, b, c, d ∈ (0 , ∞ ) , and let λ ∈ (0 , . Then min n ab , cd o ≤ λa + (1 − λ ) cλb + (1 − λ ) d ≤ max n ab , cd o . This still holds if any of a, b, c, d are , or if a or c are ∞ , so long as we interpret x/ ∞ for x ∈ [0 , ∞ ] .Proof. When a, c ∈ [0 , ∞ ) and b, d ∈ (0 , ∞ ) , it’s easy to check that λa + (1 − λ ) cλb + (1 − λ ) d = ab ·
11 + z + cd · z z , where z = (1 − λ ) d/λb . Since z > , this is a convex combination of a/b and c/d , from which thedesired result follows. When a = ∞ or c = ∞ , both the middle expression and the max expressionequal ∞ , and the result trivially holds. The same thing happens when b = d = 0 . Finally, when a, c ∈ [0 , ∞ ) and exactly one of b and d is , the max expression is again infinity, and the inequalityon the left and side can be easily verified.The simple lemma above is enough to imply that a convex function divided by a concave functionis quasiconvex, and that a concave function divided by a convex function is quasiconcave.14 emma 2.13. Let V be a real topological vector space, and let X ⊆ V be convex. Let φ : X → [0 , ∞ ] and ψ : X → [0 , ∞ ) be functions, and define ρ : X → [0 , ∞ ] by ρ ( x ) := φ ( x ) /ψ ( x ) , with r/ interpreted as ∞ for r ∈ [0 , ∞ ] . Then1. If φ is convex and ψ is concave, ρ is quasiconvex.2. If φ is concave and ψ is convex, ρ is quasiconcave.3. If φ is upper semicontinuous and ψ is lower semicontinuous, ρ is upper semicontinuous.4. If φ is lower semicontinuous and ψ is upper semicontinuous, and if φ is strictly positive on X , then ρ is lower semicontinuous.Proof. We start with (1). Fix x, y ∈ X and λ ∈ (0 , . Then ρ ( λx + (1 − λ ) y ) = φ ( λx + (1 − λ ) y ) ψ ( λx + (1 − λ ) y ) ≤ λφ ( x ) + (1 − λ ) φ ( y ) λψ ( x ) + (1 − λ ) ψ ( y ) ≤ max (cid:26) φ ( x ) ψ ( x ) , φ ( y ) ψ ( x ) (cid:27) = max { ρ ( x ) , ρ ( y ) } , so ρ is quasiconvex, as desired. Here we used the convexity of φ and concavity of ψ in the firstinequality, and Lemma 2.12 in the second inequality. (2) works similarly: ρ ( λx + (1 − λ ) y ) = φ ( λx + (1 − λ ) y ) ψ ( λx + (1 − λ ) y ) ≥ λφ ( x ) + (1 − λ ) φ ( y ) λψ ( x ) + (1 − λ ) ψ ( y ) ≥ min (cid:26) φ ( x ) ψ ( x ) , φ ( y ) ψ ( x ) (cid:27) = min { ρ ( x ) , ρ ( y ) } . Next, we prove (3). Fix x ∈ X ; our goal is to show ρ is upper semicontinuous at x . If ρ ( x ) = ∞ ,then any function ρ is upper semicontinuous at x by definition, so assume ρ ( x ) < ∞ . In particular,this means that φ ( x ) < ∞ and that ψ ( x ) > . Now, fix y > ρ ( x ) = φ ( x ) /ψ ( x ) . By the uppersemicontinuity of φ , find a neighborhood U of x on which φ ( · ) is at most φ ( x ) + ǫ (with ǫ > tobe chosen later). By the lower semicontinuity of ψ , find a neighborhood U of x on which ψ ( · ) isat least ψ ( x ) − ǫ . Setting U := U ∩ U , we see that on U we have ρ ( · ) ≤ ( φ ( x ) + ǫ ) / ( ψ ( x ) − ǫ ) ,assuming we pick ǫ < ψ ( x ) . We now simply pick ǫ small enough that this expression is less than y ,giving us a neighborhood U of x on which ρ ( · ) is less than y , as desired.Finally, we prove (4). As before, we fix x ∈ X . Our goal is to show ρ ( x ) is lower semicontinuousin at x . Let y < ρ ( x ) . We seek a neighborhood U of x on which ρ ( · ) > y . To start with, theupper semicontinuity of ψ ensures there is a neighborhood U of x on which ψ ( · ) < ψ ( x ) + ǫ , with ǫ > arbitrarily small. Now, if φ ( x ) = ∞ , then ρ ( x ) = ∞ . In this case, the lower semicontinuityof φ ensures there is a neighborhood U on which φ ( · ) is at least z , with z ∈ R is arbitrarily large.Then in U ∩ U , the value of ρ ( · ) is also arbitrarily large, and can be made to exceed y ∈ R givenappropriate choices of z and ǫ . Alternatively, if φ ( x ) < ∞ , then there is a neighborhood U on15hich φ ( · ) > φ ( x ) − ǫ . In this case, on U ∩ U we have ρ ( · ) > ( φ ( x ) − ǫ ) / ( ψ ( x ) + ǫ ) . By picking ǫ sufficiently small, we can again get a neighborhood U ∩ U of x on which ρ ( · ) > y , meaning that ρ is lower semicontinuous.We now state the minimax theorem for the ratio of two positive saddle functions. In thestatement below, it may help to think of R as a set of randomized algorithms, and to think of ∆ as the set of all probability distributions over a finite input set. Further, think of cost( R, µ ) asmeasuring the cost of the algorithm R when run on µ (for some models, this will depend only on R and not on µ ), and think of score( R, µ ) as quantifying the success or bias that the algorithm R achieves against input distribution µ . Theorem 2.14 (Minimax theorem for the positive ratio of saddle functions) . Let V and V bereal topological vector spaces. Let R ⊆ V be convex, and let ∆ ⊆ V be nonempty, convex, andcompact. Let the function cost : R × ∆ → (0 , ∞ ] be semicontinuous and saddle, and let the function score : R × ∆ → [0 , ∞ ) be such that its negation, − score , is semicontinuous and saddle. Then using x/ ∞ for x ∈ (0 , ∞ ] , we have inf R ∈R max µ ∈ ∆ cost( R, µ )score(
R, µ ) = max µ ∈ ∆ inf R ∈R cost( R, µ )score(
R, µ ) , and the maximums are attained.Proof. Let α : R × ∆ → (0 , ∞ ] be defined by α ( R, µ ) := cost(
R, µ ) / score( R, µ ) , with x/ inter-preted as ∞ for x ∈ (0 , ∞ ] . For any fixed µ ∈ ∆ , the function α ( · , µ ) is quasiconvex and lowersemicontinuous by Lemma 2.13. Similarly, for any fixed R ∈ R , the function α ( R, · ) is concave andupper semicontinuous by Lemma 2.13. Hence α is semicontinuous and quasisaddle, and the desiredminimax theorem follows from Theorem 2.11. Furthermore, since ∆ is nonempty and compact, thesupremums are attained as maximums by Lemma 2.9 and Lemma 2.8.Finally, we will need two extensions of this theorem. First, we will want to allow the denomi-nator to be a function of the form score( R, µ ) + , where the + superscript denotes the maximumof score( R, µ ) with , and where we only know about saddle properties of score( R, µ ) , not of score( R, µ ) + . To do this, we need to show such a maximum with preserves the properties wecare about. We have the following lemma, which we prove in Appendix A. Lemma 2.15.
Let V be a real topological vector space, and let X ⊆ V be convex. For a function ψ : X → R , let ψ + denote the function ψ + ( x ) = max { ψ ( x ) , } . Then this operation on ψ preservesconvexity, quasiconvexity, quasiconcavity, upper semicontinuity, and lower semicontinuity, but notconcavity. This lemma is useful, but doesn’t quite give us everything we need, because the operation ψ + does not preserve concavity. We will need the following additional lemma, which says thatLemma 2.13 also works when dividing by ψ + , despite its lack of concavity. Lemma 2.16.
Let V be a real topological vector space, and let X ⊆ V be convex. Let φ : X → [0 , ∞ ] and ψ : X → [ −∞ , ∞ ) be functions, and define ρ : X → [0 , ∞ ] by ρ ( x ) := φ ( x ) /ψ ( x ) + , with r/ interpreted as ∞ for r ∈ [0 , ∞ ] . Then if φ is convex and ψ is concave, ρ is quasiconvex.Proof. Fix x, y ∈ X and λ ∈ (0 , . If ψ ( x ) > and ψ ( y ) > , we have ρ ( λx + (1 − λ ) y ) ≤ max { ρ ( x ) , ρ ( y ) } using the same argument as in Lemma 2.13. On the other hand, if ψ ( x ) ≤ or ψ ( y ) ≤ , then we have max { ρ ( x ) , ρ ( y ) } = ∞ , and the inequality ρ ( λx +(1 − λ ) y ) ≤ max { ρ ( x ) , ρ ( y ) } trivially holds. 16he second extension we will need in our final minimax theorem is to the case where thenumerator is allowed to be . Unfortunately, as we can see from the statement of Lemma 2.13, theratio does not preserve lower semicontinuity in this setting. We will need to impose some additionalconditions on the cost and score functions, particularly with regard to their behavior around . Definition 2.17.
We say that cost : R× ∆ → [0 , ∞ ] and score : R× ∆ → [ −∞ , ∞ ) are well-behaved if the following conditions hold:1. (Finite cost and score can be achieved.) For each µ ∈ ∆ , there is some R ∈ R such that cost( R, µ ) > , cost( R, µ ) < ∞ , and score( R, µ ) > .2. (A zero-cost algorithm has zero cost regardless of the input.) For each R ∈ R , either cost( R, µ ) =0 for all µ ∈ ∆ , or else cost( R, µ ) > for all µ ∈ ∆ .3. (Mixing a zero-cost algorithm with a nonzero-cost algorithm gives a nonzero-cost algorithm.)For each µ ∈ ∆ , if R, R ′ ∈ R are such that cost( R, µ ) = 0 and cost( R ′ , µ ) > , then cost( λR +(1 − λ ) R ′ , µ ) > for all λ ∈ (0 , . Finally, we are ready for our main workhorse minimax theorem.
Theorem 2.18.
Let V be a real topological vector space, and let R ⊆ V be convex. Let S be anonempty finite set, and let ∆ be the set of all probability distributions over S , viewed as a subsetof R | S | . Let cost : R × ∆ → [0 , ∞ ] be semicontinuous and saddle, and let score : R × ∆ → [ −∞ , ∞ ) be such that its negation, − score , is semicontinuous and saddle. Suppose cost and score are well-behaved. Then using the convention r/ ∞ for all r ∈ [0 , ∞ ] , we have inf R ∈R max µ ∈ ∆ cost( R, µ )score(
R, µ ) + = max µ ∈ ∆ inf R ∈R cost( R, µ )score(
R, µ ) + . Moreover, if cost( R, · ) and score( R, · ) are both linear in µ for each R ∈ R , then inf R ∈R max x ∈ S cost( R, x )score(
R, x ) + = max µ ∈ ∆ inf R ∈R cost( R, µ )score(
R, µ ) + . Further, all of the above maximums are attained.Proof.
First, note that if S = { x , x , . . . , x | S | } , then we can view ∆ as the convex hull of the set { e , e , . . . , e | S | } ⊆ R | S | , where the e i are the unit vectors e i = (0 , , . . . , , , , , . . . , with the at position i . Hence ∆ is convex. It is also closed and bounded, making it compact. We identify e i with x i , so that ∆ = Conv( S ) .Note that since each R ∈ R has either cost for all µ or cost greater than for all µ , we can definethe set R ′ ⊆ R of R with nonzero cost. Now, on R ′ , the function α ( R, µ ) = cost(
R, µ ) / score( R, µ ) + is semicontinuous and quasisaddle by Lemma 2.13 together with Lemma 2.16 and Lemma 2.15.Additionally, ∆ is nonempty, convex, and compact. Thus by Theorem 2.11, we know that inf R ∈R ′ max µ ∈ ∆ cost( R, µ )score(
R, µ ) + = max µ ∈ ∆ inf R ∈R ′ cost( R, µ )score(
R, µ ) + , with the maximums attained.What we want to show is this statement with the infimums over R instead of R ′ . The inf-sup isalways at least the sup-inf for every function, so we need only show that the sup-inf is at least theinf-sup. Moreover, since expanding the domain can only decrease the infimum, we know that max µ ∈ ∆ inf R ∈R ′ cost( R, µ )score(
R, µ ) + = inf R ∈R ′ max µ ∈ ∆ cost( R, µ )score(
R, µ ) + ≥ inf R ∈R max µ ∈ ∆ cost( R, µ )score(
R, µ ) + , R ∈ R ′ , and if R ∈ R \ R ′ , then cost( R, µ ) / score( R, µ ) + is either or ∞ for all µ . Thus we onlyneed to show that the max-inf over R is at least the max-inf over R ′ , and that the former maximumis attained.To see this, let µ ∈ ∆ be the maximizing µ for the expression max µ ∈ ∆ inf R ∈R ′ cost( R, µ )score(
R, µ ) + . Suppose by contradiction that there was some ˆ R ∈ R \ R ′ such that cost( ˆ R, µ )score( ˆ
R, µ ) + < inf R ∈R ′ cost( R, µ )score(
R, µ ) + . Since ˆ R ∈ R \ R ′ , we must have cost( ˆ R, µ ) = 0 . Since / score( ˆ R, µ ) + is less than something,and since we’re interpreting / ∞ , we must have score( ˆ R, µ ) > , so that / score( ˆ R, µ ) + = 0 .We wish to show that inf R ∈R ′ cost( R, µ ) / score( R, µ ) + = 0 . To this end, pick ǫ > . We willfind R ∈ R ′ such that cost( R, µ ) / score( R, µ ) + < ǫ . The idea is to pick some R ′ ∈ R ′ such that cost( R ′ , µ ) < ∞ and score( R ′ , µ ) > , as guaranteed by the well-behaved condition on cost and score . Then set R := λR ′ + (1 − λ ) ˆ R , with λ > extremely small. Now, the well-behaved propertyof cost says that cost( R, µ ) > , so R ∈ R ′ . By convexity, we also have cost( R, µ ) = cost( λR ′ + (1 − λ ) ˆ R, µ ) ≤ λ cost( R ′ , µ ) + (1 − λ ) cost( ˆ R, µ ) = λ cost( R ′ , µ ) , and by the concavity of score( · , µ ) , wehave score( R, µ ) = score( λR ′ +(1 − λ ) ˆ R, µ ) ≥ λ score( R ′ , µ )+(1 − λ ) score( ˆ R, µ ) ≥ (1 /
2) score( ˆ
R, µ ) ,assuming λ ≤ / .This means that score( R, µ ) and score( ˆ R, µ ) are both positive, and cost( R, µ ) / score( R, µ ) ≤ λ cost( ˆ R, µ ) / score( ˆ R, µ ) . Since cost( ˆ R, µ ) < ∞ , setting λ > to be small causes the ratio cost( R, µ ) / score( R, µ ) + to be arbitrarily close to , as desired. It follows that there exists µ ∈ ∆ such that inf R ∈R cost( R, µ )score(
R, µ ) ≥ inf R ∈R max µ ′ ∈ ∆ cost( R, µ ′ )score( R, µ ′ ) , and since the inf-max is always at least the max-inf, there does not exist a µ for which the left-handinfimum is any larger; thus we get the desired result and the maximum is attained.Finally, suppose that cost( R, · ) and score( R, · ) are linear for each R ∈ R . In that case, cost( R, · ) is convex and score( R, · ) is concave, which means that cost( R, · ) / score( R, · ) + is quasiconvex on ∆ by Lemma 2.16. Then Lemma 2.10 implies that the maximum over µ ∈ Conv( S ) is attained at apoint in S . Moreover, if R ∈ R \ R ′ , then the maximum over µ ∈ ∆ evaluates to either or ∞ . Ifit is , then it is clearly also attained in S . If it is ∞ , it means some µ ∈ ∆ has score( R, µ ) ≤ ;the concavity of score( R, · ) then gives us some x ∈ S such that score( R, x ) ≤ score( R, µ ) , meaningthere is a point x ∈ S on which score( R, x ) + = 0 and cost( R, x ) / score( R, x ) + = ∞ , as desired.Theorem 2.18 is the main tool we will use to prove minimax theorems for algorithmic models.We will usually apply it in a setting where R is a set of algorithms, S is a finite input set, ∆ is a set of distributions over the inputs, cost( R, µ ) is a cost measure for the performance of analgorithm against a distribution, and score( R, µ ) is some kind of success measure. We will sometimeschoose score( R, µ ) = bias f ( R, µ ) , where bias f ( R, µ ) is the bias R achieves against distribution µ incomputing f .We will generally combine Theorem 2.18 with an amplification theorem; such a theorem willturn the left hand side inf R max x cost( R, x ) / score( R, x ) into something more familiar, such as inf R max x cost( R, x ) where the infimum is restricted to algorithms R which achieve at least constant18ias (i.e. bounded error) on each input. With such an amplification theorem, the minimax resultwill guarantee the existence of a hard distribution µ against which cost( R, µ ) / score( R, µ ) is largefor all R ; this means µ is hard to solve even to small bias.While the above strategy works for models that can be amplified linearly in the bias (going frombias γ to constant bias using O (1 /γ ) overhead), such as quantum query complexity, for randomizedalgorithms the situation is more complicated. For randomized algorithms, we may instinctively wantto use something like score( R, µ ) = bias f ( R, µ ) , but this does not work as it does not satisfy theright saddle properties. Instead, we introduce a new way of evaluating the success of randomizedalgorithms, called scoring rules . Evaluation via scoring rules ends up being the “correct” way tomeasure the success of a randomized algorithm, and has more elegant properties than simply thebias. It is also highly intuitive: to evaluate the success of an algorithm, we require it to give aconfidence prediction for whether the output is or , and then we score the prediction using ascoring rule which incentivizes honesty (that is, a scoring rule that causes a Bayesian agent whowishes to maximize her expected score to output her true subjective probability). In this section we introduce the notion of forecasting algorithms , which output not just a { , } guessat the function value but also a confidence parameter q ∈ [0 , for that prediction. These algorithmswill be scored using a scoring rule , which rewards them point for a correct prediction made withperfect confidence, and points for a confidence of / . As we will see, normal algorithms canbe converted into forecasting algorithms and vice versa, and the expected score of the forecastingversion can often be related to the bias of the algorithm in its regular (discrete outputs) form. Definition 3.1 (Scoring rule) . A scoring rule is a function s : [0 , → [ −∞ , such that s (1) = 1 , s (1 /
2) = 0 , and s ( · ) is increasing over [0 , . We say a scoring rule is proper if for each p ∈ (0 , ,the expression ps ( q ) + (1 − p ) s (1 − q ) is uniquely maximized at q = p . Generally, if a forecasting algorithm outputs q ∈ [0 , , we will interpret it as assigning confidence q to the output and confidence − q to the output ; we give it score s ( q ) if the right answerwas , and score s (1 − q ) if the right answer was . A proper scoring rule is therefore a scoringrule that incentivizes the algorithm to output q = p in the case where the right answer is sampledfrom Bernoulli( p ) . In other words, a proper scoring rule is one that incentivizes a Bayesian agentto output her true subjective probability for the outcome being . Definition 3.2.
We define the following scoring rules.1. hs( q ) := 1 − q − qq Brier( q ) := 1 − − q ) bias( q ) := 1 − − q ) ls( q ) := 1 − log(1 /q ) . We note that
Brier( · ) and ls( · ) are known as the Brier scoring rule and logarithmic scoring rule,respectively, and are well-known in the literature. The Brier scoring rule is useful because it is aproper scoring rule which is bounded (that is, s ( q ) ∈ [ − , for all q ∈ [0 , , instead of s ( · ) diverging19o −∞ at ). The logarithmic scoring rule has an information-theoretic interpretation, with thealgorithm essentially starting at score and losing an amount of score depending on its “surprisal”at the correct outcome.The scoring rule bias( · ) is not proper, but as we will see, it is closely related to the bias an algo-rithm will make. Finally, the scoring rule hs( · ) will be the most useful of the bunch for our purposes.Despite not having any intuitive interpretation and not being bounded, it is an incredibly conve-nient scoring rule due to the fact that it can be amplified , as we will see. hs( · ) has been previouslystudied (for example in [BSS05], where it is called the “boosting loss” due to its relationship withboosting), but we believe its amplification property has not been previously known (we prove thisamplification property later on in Lemma 3.10; this ends up being a key ingredient of our minimaxtheorems). Lemma 3.3. hs , Brier , and ls are proper scoring rules. bias is a scoring rule which is not proper. This lemma can be proven using elementary calculus, and we do so in Appendix B.
Fascinatingly, the above scoring rules all correspond to well-known distance measures between prob-ability distributions. To describe the correspondence, we first start by defining the following distancemeasures.
Definition 3.4.
For probability distributions ν and ν over a finite domain P , define ∆( ν , ν ) := 12 X x ∈ P | ν [ x ] − ν [ x ] | (Total variation) h ( ν , ν ) := 12 X x ∈ P ( p ν [ x ] − p ν [ x ]) (Hellinger) S ( ν , ν ) := 12 X x ∈ P ( ν [ x ] − ν [ x ]) ν [ x ] + ν [ x ] (Symmetrized χ ) JS( ν , ν ) := 12 X x ∈ P ν [ x ] log 2 ν [ x ] ν [ x ] + ν [ x ] + ν [ x ] log 2 ν [ x ] ν [ x ] + ν [ x ] (Jensen-Shannon) . The above measures give the distance between two probability distributions. We will sometimeswant to have an asymmetric distance that is weighted towards one of the two distributions; whilethese asymmetric distances look strange at first, they show up naturally in the study of scoringrules. We extend the above distance measures as follows.
Definition 3.5.
Given probability distributions ν and ν over a finite domain P , as well as aweight w ∈ [0 , , set ν = (1 − w ) ν + wν . Let R be the random variable over x ∈ P defined by R ( x ) := | (1 − w ) ν [ x ] − wν [ x ] | /ν [ x ] for all x ∈ P . Then define ∆( ν , ν , w ) := E x ← ν [ R ]h ( ν , ν , w ) := E x ← ν [1 − p − R ]S ( ν , ν , w ) := E x ← ν [ R ]JS( ν , ν , w ) := E x ← ν (cid:20) − H (cid:18) R (cid:19)(cid:21) , where H ( α ) := α log 1 /α + (1 − α ) log 1 / (1 − α ) is the binary entropy function. w = 1 / , the expressions in Definition 3.5 equal the ones inDefinition 3.4. Perhaps surprisingly, the distance measures h , S , and JS are all related to eachother by a constant factor. Lemma 3.6 (Relations between distance measures) . When applied to fixed ν , ν , and w , thedistance measures satisfy S ≤ − p − S ≤ h ≤ JS ≤ S as well as ∆ ≤ S ≤ ∆ . We also have JS ≤ h / ln 2 and S ≤ (ln 4) JS . While these relationships are certainly known in the literature, it is hard to chase down goodcitations (though see [Tøp00; MCAL17] for parts of this result); in any case, we prove Lemma 3.6in Appendix B.
Consider the following problem: suppose distributions ν and ν are known (for example, perhapsthey are the distributions of the transcript of a fixed randomized algorithm when run on a known -distribution and a known -distribution, respectively). Further, suppose a Bernoulli( w ) processgenerates a bit b ∈ { , } , and then a sample x ← µ b is provided. We assume the parameter w isknown. What is the best algorithm for predicting b given x , assuming you wish to maximize theexpected score according to one of the scoring rules hs( · ) , Brier( · ) , ls( · ) , bias( · ) ? It turns out thatbest attainable expected score is exactly the distance between ν and ν according to the distancemeasures h , S , JS , ∆ , respectively. To prove this, we introduce the following definitions. Definition 3.7.
For a scoring rule s : [0 , → [ −∞ , , we define s ( p ) := s ( p ) and s ( p ) := s (1 − p ) .This way, if a forecasting algorithm outputs p and the real outcome is b , the score of this predictionwill be s b ( p ) . Definition 3.8 (Expected score notation) . Let S be a finite set, and let φ : S → [0 , be a functionrepresenting predictions. Let ν be a distribution over S , let P ( x ) be a Boolean-valued random variablefor each x ∈ S representing the correct outcome, and let s : [0 , → [ −∞ , be a scoring rule. The expected score of φ , denoted score s ( φ, ν, P ) , is defined as score s ( φ, ν, P ) := E x ← ν E b ← P ( x ) [ s b ( φ ( x ))] . In these expectations, if a value of ∞ or −∞ occurs with probability , we set · ∞ := 0 . We can also extend the score notation to the case where φ ( x ) outputs a probability distributionover [0 , instead of always outputting a deterministic prediction given the observation x . We won’tworry about this case for now.Equipped with these definitions, we are now ready to prove the correspondence between scoringrules and distance measures. This correspondence appears to be known in the literature (indeed,variants of it seem to have been rediscovered many times); see [RW11] for an overview. However,the form we need here is somewhat different from the usual form in the literature, which usuallydiscusses divergences instead of distances. We therefore include the proof for completeness.21 emma 3.9. Let ν and ν be probability distributions over a finite set S , and let w ∈ [0 , . Let M s ( ν , ν , w ) be the maximum possible score of for predicting b ← Bernoulli( w ) given x ← ν b ,where ν , ν , and w known. That is, M s ( ν , ν , w ) is the maximum over choice of φ : S → [0 , of the expression score s ( φ, ν, P ) , where ν = (1 − w ) ν + wν and P ( x ) is the posterior probabilitydistribution of b given prior Bernoulli( w ) and observation x ← ν b . Then M bias ( ν , ν , w ) = ∆( ν , ν , w ) M hs ( ν , ν , w ) = h ( ν , ν , w ) M Brier ( ν , ν , w ) = S ( ν , ν , w ) M ls ( ν , ν , w ) = JS( ν , ν , w ) . Proof.
Consider a fixed x ∈ D . The contribution of x to the expected score of φ (with respectto scoring rule s ) is simply (1 − w ) ν [ x ] s ( φ ( x )) + wν [ x ] s ( φ ( x )) = (1 − w ) ν [ x ] s (1 − φ ( x )) + wν [ x ] s ( φ ( x )) . The total expected score of φ is therefore the sum over x ∈ D of the above expression.The function φ which maximizes the expected score is simply the one where φ ( x ) = q , where q maximizes the expression (1 − w ) ν [ x ] s (1 − q )+ wν [ x ] s ( q ) . Now, the expression we wish to maximizehas the form ν [ x ] · ((1 − p ) s (1 − q ) + ps ( q )) , where p = wν [ x ] /ν [ x ] . Hence, if s is proper, the uniquemaximum occurs at q = p = wν [ x ] /ν [ x ] . This means that for the maximizing φ , the contributionof each x to the expected score is (1 − w ) ν [ x ] s ((1 − w ) ν [ x ] /ν [ x ]) + wν [ x ] s ( wν [ x ] /ν [0]) , assuming s is proper.For s ∈ { hs , ls , Brier } , the scoring rule s is indeed proper, meaning that we have a closedexpression for the maximum possible expected score. Setting R [ x ] := | wν [ x ] − (1 − w ) ν [ x ] | /ν [ x ] ,it’s not hard to check that for hs , the contribution of each x is ν [ x ](1 − p − R [ x ] ) , for ls , thecontribution of each x is ν [ x ](1 − H ((1 + R [ x ]) / , and for Brier , the contribution of each x is ν [ x ] R [ x ] , as desired.It remains to deal with s = bias . The contribution of each x is the maximum possible value of (1 − w ) ν [ x ] bias(1 − q ) + wν [ x ] bias( q ) for q ∈ [0 , . Since bias( q ) = 2 q − , it’s not hard to see thatthe maximizing value of q is q = 0 when (1 − w ) ν [ x ] > wν [ x ] , q = 1 when wν [ x ] > (1 − w ) ν [ x ] ,and when (1 − w ) ν [ x ] = wν [ x ] , the contribution of x to the score is regardless of the value of q .The contribution of x to the maximum score is therefore ν [ x ] R [ x ] , as desired.We note that in the statement of Lemma 3.9, we are implicitly assuming that the predictivealgorithms are deterministic: that given x , one is only allowed to output a deterministic prediction φ ( x ) ∈ [0 , instead of a random choice of prediction. However, it is not hard to see that random-ized algorithms won’t help in this setting, since we are maximizing the expected score, which is alinear function of the probabilities inside the randomized choice. That is to say, if the randomizedalgorithm chooses (on input x ) to output a with probability p and b with probability − p , thenthe final score of this algorithm will be a linear function of p , and hence the optimal choice of p will be either or . Hence Lemma 3.9 also characterizes the best possible score of a randomizedprediction algorithm with respect to those four scoring rules. From here on out, we consider only the hs( · ) scoring rule (and occasionally bias( · ) , which willcorrespond to the bias of a randomized algorithm). We will sometimes omit the subscript in theexpression score s ( φ, ν, P ) when s = hs .We now proceed to show a few nice properties of the hs scoring rule. First among them isthe amplification property. We believe this property (which is crucial for our purposes) has notpreviously appeared in the literature. 22 emma 3.10 (Amplification of hs ) . Let S be a finite set, and let φ : S → [0 , represent a predictionfunction. Then for each k ∈ N , there is a function φ ( k ) : S k → [0 , such that for any distribution ν over S , we have score hs ( φ ( k ) , ν ⊗ k , ≥ − (1 − score hs ( φ, ν, k score hs ( φ ( k ) , ν ⊗ k , ≥ − (1 − score hs ( φ, ν, k . Furthermore, equality holds except when score hs ( φ, ν,
0) = score hs ( φ, ν,
1) = −∞ . Here and areinterpreted as the constant functions x ) = 0 and x ) = 1 . Informally, this lemma is saying the following. Consider a randomized forecasting algorithm R ,which takes input x and outputs a confidence q ∈ [0 , representing its belief that f ( x ) = 1 . Evaluatethis algorithm according to its worst-case expected score with respect to the hs( · ) scoring rule. Thatis to say, for each input x ∈ f − (1) , consider the expectation E [hs( R ( x ))] of the expected score R gets when run on x , and for each x ∈ f − (0) , consider the analogous expectation E [hs(1 − R ( x ))] .Then take the minimum η of all these expected scores, minimizing over any x ∈ Dom( f ) . This isthe worst-case expected score of R . The lemma then says that we can run R on x several times,say k times independently, and combine the confidence outputs q , q , . . . , q k in such a way that thenew algorithm has worst-case expected score equal to − (1 − η ) k . Proof.
We define φ ( k ) ( x . . . x k ) as follows. First, if it holds that some pair ( x i , x j ) in the in-put satisfies φ ( x i ) = 0 and φ ( x j ) = 1 , we define φ ( k ) ( x . . . x k ) := 1 / . Otherwise, we set φ ( k ) ( x . . . x k ) := (cid:16) Q ki =1 1 − φ ( x i ) φ ( x i ) (cid:17) − , where we interpret / ∞ if it occurs (we need notinterpret ∞ · since that will only occur if φ ( x i ) = 0 and φ ( x j ) = 1 for some i and j ). Notethat if φ ( x ) = 0 and φ ( x ′ ) = 1 for x, x ′ ∈ S that have nonzero weight in ν , then we have score hs ( φ, ν,
0) = score hs ( φ, ν,
1) = −∞ , so the desired inequalities trivially hold. Otherwise, for b ∈ { , } we write score hs ( φ ( k ) , ν ⊗ k , b ) = E x ...x k ← ν ⊗ k − s(cid:18) φ ( k ) ( x . . . x k )1 − φ ( k ) ( x . . . x k ) (cid:19) ( − b = 1 − E x ...x k ← ν ⊗ k vuut k Y i =1 (cid:18) φ ( x i )1 − φ ( x i ) (cid:19) ( − b = 1 − k Y i =1 E x i ← ν s(cid:18) φ ( x i )1 − φ ( x i ) (cid:19) ( − b = 1 − ( E x ← ν [1 − hs b ( φ ( x ))]) k = 1 − (1 − score hs ( φ, ν, b )) k . Note that equality holds except in the case where score hs ( φ, ν,
0) = score hs ( φ, ν,
1) = −∞ .The following lemma will be convenient when using this amplification theorem. We prove itin Appendix B. Lemma 3.11. If x ∈ [0 , and k ∈ [1 , ∞ ) , we have
12 min { kx, } ≤ − (1 − x ) k ≤ min { kx, } . .5 Bias and hs score Another nice property of hs is that it is at most bias . Lemma 3.12.
For all q ∈ [0 , , we have hs( q ) ≤ bias( q ) .Proof. Recall that hs( q ) = 1 − p (1 − q ) /q and bias( q ) = 1 − − q ) . The desired inequality clearlyholds at q = 0 and q = 1 . For q ∈ (0 , , it suffices to show that − q ) ≤ (1 − q ) /q , or equivalently q (1 − q ) ≤ ⇔ − q + 4 q ≥ ⇔ (1 − q ) ≥ , which also clearly holds.Finally, the last main property of hs that we exploit is that hs scores and biases are quadraticallyrelated. To explain what we mean, start with the following definition of a general algorithm, wherewe take care not to put any restriction on the structure of the algorithm but want it to take inputsand return outputs while incurring some cost. Definition 3.13.
Let S be a finite set, and let ∆ be the set of probability distributions over S .A general algorithm , which we denote by R , is a pair of functions. The first function is from ∆ to [0 , ∞ ] , and we denote it by cost( R, · ) : ∆ → ∞ , so that cost( R, µ ) returns a value in [0 , ∞ ] for µ ∈ ∆ . The second function takes inputs from S and returns a random variable supported on { , } ,and we denote it by output( R, · ) , so that output( R, x ) is a random variable on { , } for each x ∈ S .The bias of a general algorithm R on input x ∈ S with respect to function f : S → { , } is bias f ( R, x ) := 1 − R, x ) = f ( x )] . We note that if output(
R, x ) has distribution Bernoulli( q ) , then bias f ( R, x ) = bias f ( x ) ( q ) , wherethe function bias f ( x ) ( q ) is defined according to Definition 3.2 and Definition 3.7.Just like we defined general algorithms, we also define forecasting algorithms, which outputconfidences in [0 , instead of values in { , } . Definition 3.14.
Let S be a finite set and let ∆ be the set of all probability distributions over S .A forecasting algorithm , which we also denote by R , is a pair of functions. The first function is cost( R, · ) : ∆ → ∞ , just like a general algorithm. The second function takes inputs from S andreturns a random variable supported on [0 , , and we denote it by pred( R, · ) , so that pred( R, x ) isa random variable on [0 , for each x ∈ S .The score of a forecasting algorithm R on input x ∈ S with respect to function f : S → { , } andscoring rule s is score s,f ( R, x ) := E [ s f ( x ) (pred( R, x ))] . When the function f is clear by the context,for notational simplicity we often omit it and write score s ( R, x ) . Additionally, when s = hs , wesometimes omit it and write simply score( R, x ) . The following lemma is key. It says that we can convert any algorithm which achieves bias γ into a forecasting algorithm which achieves expected score at least γ / under the hs scoring rule;further, this conversion only manipulates the output of the algorithm, meaning it can be appliedwithout changing the cost. That is, to turn R into a forecasting algorithm, we only need to run R ,get an output or , and then erase the output and write (1 − γ ) / or (1 + γ ) / , respectively.Moreover, it is possible to convert backward as well! To turn a forecasting algorithm R into anormal randomized algorithm, run R , take the output q ∈ [0 , , erase it and write down a samplefrom Bernoulli( q ) instead. If the original forecasting algorithm achieved expected score η , the newalgorithm will achieve bias at least η . In particular, this lemma tells us that the best expected scoreand the best bias that an algorithm can make (under any cost restriction) are always quadraticallyrelated. 24 emma 3.15 (Conversion between regular and forecasting algorithms) . A general algorithm R achieving worst-case bias γ > for a function f can be converted into a forecasting algorithm R ′ with worst-case score at least − p − γ ≥ γ / for f . This conversion is pointwise: it dependsonly on changing a sample from the random variable output( R, x ) after receiving it, as well as onthe value of the worst-case bias γ .Conversely, a forecasting algorithm R with worst-case score η can be converted into a generalalgorithm R ′ with worst-case bias at least η . This conversion is pointwise: it depends only onchanging a sample from pred( R, x ) after receiving it (and not even on the value of η ).Proof. Start with a general algorithm R with worst-case bias γ > . On input x , run R to receivea sample b ∈ { , } from output( R, x ) . Then output pred( R ′ , x ) = (1 − γ ) / if b = 0 and output pred( R ′ , x ) = (1 + γ ) / if b = 1 . It is clear that this R ′ was constructed in a pointwise fashionout of R , depending only on a sample from output( R, x ) . Now, fix x ∈ S , and let p ∈ [0 , be theprobability that output( R, x ) gives the right answer. Since R has worst-case bias γ , it has bias atleast γ on x , so p ≥ (1 + γ ) / . The expected score of R ′ on x is then score( R ′ , x ) = p hs((1 + γ ) /
2) + (1 − p ) hs((1 − γ ) / p − p r − γ γ + (1 − p ) − (1 − p ) r γ − γ = 1 − r γ − γ + p (cid:18)r γ − γ − r − γ γ (cid:19) ≥ − r γ − γ + 1 + γ (cid:18)r γ − γ − r − γ γ (cid:19) = 1 − (cid:18) − γ (cid:19) r γ − γ − p − γ = 1 − p − γ . For the other direction, let R be a forecasting algorithm with worst-case score η > . On input x , run R to receive a sample q ∈ [0 , from pred( R, x ) . Then output with probability q and with probability − q , i.e. output( R ′ , x ) ∼ Bernoulli( q ) . It is clear that this R ′ is constructed ina pointwise fashion out of R (without even a dependence on η ). Now, fix x ∈ S . We know that η ≤ score( R, x ) = E [hs f ( x ) (pred( R, x ))] . Now, we note that hs f ( x ) ( p ) ≤ bias f ( x ) ( p ) by Lemma 3.12.Thus we get η ≤ E [bias f ( x ) (pred( R, x ))] = bias( R ′ , x ) , as desired.To demonstrate the power of these lemmas, observe that they imply a well-known amplificationtheorem for randomized algorithms, as we show in the lemma below. Note that this lemma doesnot refer to scoring rules or forecasting algorithms at all; those only appear as proof techniques. Lemma 3.16 (informal) . A randomized algorithm with bias γ can be amplified to bias / byrepeating it /γ times.Proof. Start with an algorithm making bias γ . Using Lemma 3.15, get a forecasting algorithm withexpected score at least − p − γ . Using Lemma 3.10, repeating the algorithm k times increasesthe expected score on each input x to at least − (1 − γ ) k/ . Using Lemma 3.15, we get an algorithmwith worst-case bias at least − (1 − γ ) k/ . Using Lemma 3.11, this is at least min { kγ / , / } .Picking k ≥ /γ , we get an algorithm with worst-case bias at least / using only k repetitions ofthe original algorithm, as desired. 25 Randomized query and communication complexity
To prove a strong minimax theorem for randomized query complexity, we start by formally definingforecasting algorithms in the query complexity setting. We will need these forecasting algorithmsas a tool, despite our final statement not referring to them.
Definition 4.1. A deterministic forecasting decision tree (on n ∈ N bits, with finite alphabet Σ ) isa rooted tree on n bits whose internal vertices are labeled by [ n ] , where each internal vertex has | Σ | children labeled by Σ , and where the leaves are labeled by [0 , .A randomized forecasting decision tree (on n ∈ N bits, with finite alphabet Σ ) is a probabilitydistribution over finitely many deterministic forecasting decision trees. We interpret a randomized forecasting decision tree as a forecasting algorithm in the intuitiveway, where cost(
R, x ) is the expected height of R on x (the expected height of the leaf of x ina deterministic forecasting tree sampled from the distribution R ), and where pred( R, x ) is therandom variable which samples from the leaf label when a random deterministic tree from R is runon x . Note that since we restrict to distributions with finite support, we do not need to invokemeasure theory or integrals in interpreting these probabilities and expectations, even though thereare uncountably many deterministic forecasting decision trees.We extend cost( R, · ) to the set ∆ of probability distributions over S by writing cost( R, µ ) = E x ← µ [cost( R, x )] , and similarly for score( R, µ ) = E x ← µ [score( R, x )] . We now show a minimaxtheorem for the ratio of cost to score + for forecasting randomized algorithms. This minimax theoremwill form the base of our final result: we will convert the left-hand side to R( f ) , and convert theright hand side to some desirable properties of a hard distribution µ . Theorem 4.2.
Let n ∈ N , let Σ be a finite alphabet, let S ⊆ Σ n , and let f : S → { , } . Let R bethe set of all randomized forecasting decision trees on n bits with alphabet Σ . Let ∆ be the set ofprobability distributions over S . Then inf R ∈R max x ∈ S cost( R, x )score(
R, x ) + = max µ ∈ ∆ inf R ∈R cost( R, µ )score(
R, µ ) + , and the maximums are attained.Proof. We use Theorem 2.18. All we need to do is verify that the conditions of the theorem hold.Our first task will be to deal with the strange set R ; we wish to turn it into a convex subset ofa real topological vector space. To do so, we define the vector v R ∈ R | S | for each R ∈ R by v R [ x,
1] = cost(
R, x ) and v R [ x,
2] = score(
R, x ) , and consider the set V = { v R : R ∈ R } . For avector v ∈ V , we define cost( v, x ) = v [ x, and score( v, x ) = v [ x, , and we extend these definitionsto cost( v, µ ) and score( v, µ ) by taking expectations over µ . Then it is clear that optimizing somefunction of cost( R, µ ) and score( R, µ ) over R is the same as optimizing the corresponding functionof cost( v, µ ) and score( v, µ ) over V . Hence it suffices to show that inf v ∈ V max x ∈ S cost( v, x )score( v, x ) + = max µ ∈ ∆ inf v ∈ V cost( v, µ )score( v, µ ) + , with the maximums attained.To do so, we first note that V ⊆ R | S | is convex. This is because if v , v ∈ V and λ ∈ (0 , , weknow there are algorithms R , R ∈ R such that v = v R and v = v R , and then the algorithm λR + (1 − λ ) R (which mixes the distributions R and R over deterministic forecasting decisiontrees) is a valid member of R . Then we have v λR +(1 − λ ) R [ x,
1] = cost( λR + (1 − λ ) R , x ) = cost( R , x ) + (1 − λ ) cost( R , x ) = λv R [ x,
1] + (1 − λ ) v R [ x, , and similarly v λR +(1 − λ ) R [ x,
2] = λv R [ x,
2] + (1 − λ ) v R [ x, , so v λR +(1 − λ ) R = λv R + (1 − λ ) v R .Next, we note that cost( v, · ) and score( v, · ) are linear functions of µ ; this is because they aredefined as expectations over µ . Further, observe that cost( · , µ ) and score( · , µ ) are linear in v . It isalso clear that cost( v, µ ) and score( v, µ ) are continuous in both v and µ .It remains to check that cost and score are well-behaved. First, note that there is always analgorithm which queries all the bits and outputs the right answer f ( x ) with perfect confidence. Suchan algorithm R has cost( v R , µ ) = n and score( v R , µ ) = 1 for all µ , so finite costs and scores areattainable. Next, note that if R is such that cost( v R , µ ) = 0 for any µ , then R must make no querieswhen run on µ . This means R makes no queries when run on any input, so cost( v R , µ ′ ) = 0 for all µ ′ ∈ ∆ . Finally, note that cost( · , µ ) is linear for each µ , so if cost( v, µ ) = 0 and cost( v ′ , µ ) > , wenecessarily have cost( λv + (1 − λ ) v ′ , µ ) > for λ ∈ (0 , . Hence all the conditions of Theorem 2.18are satisfied, and the desired result follows.Our next task is to relate the left-hand side of the equation in the last theorem to R( f ) . Theorem 4.3.
Using the notation of Theorem 4.2, we have inf R ∈R max x ∈ S cost( R, x )score(
R, x ) + ≥ R( f )240 . To prove this theorem, the idea is to take R from the left-hand side, amplify the score of R upto a constant (using the fact that score amplifies linearly), and then convert the constant score toconstant bias (and hence constant error), getting an upper bound on R( f ) . This is slightly tricky,because the amount we need to amplify by may depend on the input x ; for some x , both cost( R, x ) and score( R, x ) may be small, while for other x they are both large. Unfortunately, we do not haveaccess to score( R, x ) when we receive input x . Instead, in order to amplify by approximately thecorrect amount, we estimate cost( R, x ) (by repeatedly running R on x and observing the numberof queries), and we use this cost estimate to decide the amount of amplification needed. Proof.
Let Y ∗ be the optimal value of the left-hand side, and let R be an algorithm such that max x ∈ S cost( R, x ) / score( R, x ) + = Y , where Y is arbitrarily close to Y ∗ (and Y ≥ Y ∗ ). Then inparticular, score( R, x ) > for all x ∈ S , and for each x ∈ S we have cost( R, x ) / score( R, x ) ≤ Y .Let R ′ be a modification of R where we cut off each decision tree in the support of R after Y queries, and return / in case of a cutoff (ensuring we get a score of for that branch). Note thatby Markov’s inequality, the probability of encountering a cutoff branch on input x to R ′ is at most cost( R, x ) / Y ≤ Y score( R, x ) / Y = score( R, x ) / . Since each non-cut-off leaf can contribute atmost to the score (as the maximum of hs( · ) is ), and since the score at a cutoff is , the decreasein score when going from R to R ′ is at most the probability of encountering a cutoff. It follows that score( R ′ , x ) ≥ score( R, x ) − score( R, x ) / R, x ) / for all x ∈ S .Next, we describe a randomized forecasting algorithm R ′′ . The algorithm R ′′ runs R ′ on x untilthe number of queries made reaches Y . Let L be the number of runs of R ′ on x it takes to reach Y queries. Then R ′′ runs R ′ on x an additional L times, and uses those new runs to amplify thescore, achieving score − (1 − score( R ′ , x )) L . We wish to prove this score is at least a constant andthat the total number of queries is only O ( Y ) .First, we bound the expectation of L , the random variable for the number of runs of R ′ on x it takes to reach Y queries. Let X i be i.i.d. random variables each representing the number ofqueries in a single run of R ′ on x (so each X i is supported on { , , . . . , Y } ). Consider the totalnumber of queries made until the cutoff is reached; this is P Li =1 X i . Let I i be the Boolean random27ariable which is if L < i and if L ≥ i . Then P Li =1 X i = P ∞ i =1 X i I i . Note that the value of P Li =1 X i is always at most Y + 2 Y , because after the threshold Y is reached, less than one fullrun of R ′ on x will happen (using at most Y queries). Hence Y > E " L X i =1 X i = E " ∞ X i =1 X i I i = ∞ X i =1 E [ X i I i ]= ∞ X i =1 Pr[ I i = 0] E [ X i I i | I i = 0] + Pr[ I i = 1] E [ X i I i | I i = 1]= ∞ X i =1 Pr[ L ≥ i ] E [ X i ]= cost( R ′ , x ) E [ L ] . It follows that E [ L ] < Y / cost( R ′ , x ) . This means the total expected number of queries R ′′ makesis at most Y for getting the estimate L , plus cost( R ′ , x ) · E [ L ] < Y for amplifying the score,for a total of fewer than Y expected queries.To bound the expected score, we start by ensuring L is not too small except with small prob-ability. Note that for a constant T , we have Pr[ L ≤ T ] = Pr[ P Ti =1 X i ≥ bY ] . The sum P Ti =1 X i has expected value T cost( R ′ , x ) and has variance T times the variance of one X i . Since X i isnon-negative and bounded above by Y , its variance is bounded above by Var[ X i ] ≤ E [ X i ] ≤ Y E [ X i ] = 2 Y cost( R ′ , x ) . Hence, the variance of the sum is at most T Y cost( R ′ , x ) . We useChebyshev’s inequality, writing Pr[ L ≤ T ] = Pr " T X i =1 X i ≥
10 R( f ) = Pr " T X i =1 X i − T cost( R ′ , x ) ≥ Y − T cost( R ′ , x ) ≤ T Y cost( R ′ , x )(10 Y − T cost( R ′ , x )) , which holds assuming T ≤ Y / cost( R ′ , x ) . In particular, if T = 2 Y / cost( R ′ , x ) , then Pr[ L ≤ T ] ≤ / .Now, note that conditioned on L = ℓ , the expected score in the second round of R ′′ is at least − (1 − score( R ′ , x )) ℓ . This is increasing in ℓ ; hence, conditioned on L > T , the expected score of R ′′ on x is greater than − (1 − score( R ′ , x )) T . Conditioned on L ≤ T , we still have the expectedscore be at least , since it is at least for every fixed ℓ . Hence the final expected score of R ′′ on x is greater than (1 − (1 − score( R ′ , x )) T )(1 − Pr[ L ≤ T ]) ≥ − (1 − score( R ′ , x )) T − Pr[ L ≤ T ] . The equality E hP Li =1 X i i = E [ X ] E [ L ] , which we rederive here, is known as Wald’s equation. T = 2 Y / cost( R ′ , x ) , we get score( R ′′ , x ) > − (1 − score( R ′ , x )) Y/ cost( R ′ ,x ) − / ≥
12 min (cid:26) , Y score( R ′ , x )cost( R ′ , x ) (cid:27) − / ≥
12 min (cid:26) , score( R, x ) Y cost( R, x ) (cid:27) − / ≥ −
116 = 716 . This algorithm R ′′ makes fewer than Y expected queries. We cut if off after Y queries,outputting prediction / (getting score ) in case of a cutoff; this gives an algorithm R ′′′ whoseworst-case number of queries is Y , and whose expected score on each x ∈ D is at least / − / ≥ / . Using Lemma 3.15, we can view R ′′′ as a randomized algorithm computing f ( x ) withworst-case bias at least / , and hence worst-case error at most / . This means that R( f ) ≤ Y .Since we can pick Y arbitrarily close to Y ∗ , we also get that R( f ) is at most the infimum of Y over feasible choices of Y , which is Y ∗ , and the desired result follows.Our next task is to show that the max-inf side of Theorem 4.2 gives us a distribution µ againstwhich it is hard to tell apart -inputs from -inputs, in terms of the achievable squared-Hellingerdistance between the distributions of the transcript on the - and -inputs. The following lemmawill come in useful. We prove it in Appendix B. Lemma 4.4 (Hellinger distance of disjoint mixtures) . Let µ be a distribution over a finite support A , and for each a ∈ A , let ν a and ν a be two distributions over a finite support S a . Let ν µ and ν µ denote the mixture distributions where a ← µ is sampled, and then a sample is produced from ν a or ν a respectively. Assume the sets S a are disjoint for all a ∈ A . Then h ( ν µ , ν µ ) = E a ← µ [h ( ν a , ν a )] . Theorem 4.5.
Let n ∈ N , let Σ be a finite alphabet, let S ⊆ Σ n , and let f : S → { , } be anon-constant function. Then there exist distributions µ on f − (0) and µ on f − (1) such that forall randomized query algorithms R , cost( R, µ )h (tran( R, µ ) , tran( R, µ )) ≥ R( f )240 . Here µ = ( µ + µ ) / , and we interpret r/ ∞ for r ∈ [0 , ∞ ) .Proof. Using Theorem 4.2 and Theorem 4.3, we get a distribution µ on S such that for all random-ized forecasting algorithms R , we have cost( R, µ ) / score( R, µ ) + ≥ R( f ) / . Note that it must bethe case that an algorithm R which makes no queries must have score( R, µ ) ≤ ; this is becausewe have R( f ) ≥ (since f is non-constant), and if there was an algorithm with cost achievingpositive score, we’d have cost( R, µ ) / score( R, µ ) + = 0 , giving a contradiction. Therefore, it must bethe case that µ places equal weight on and inputs, because otherwise a -cost algorithm couldindeed predict f ( x ) with positive bias (and hence positive score by Lemma 3.15) against µ . We set µ to be the conditional distribution of µ on the -inputs of f , and set µ to be the conditionaldistribution of µ on the -inputs of f .Next, we simplify the expression inf R cost( R, µ )h (tran( R, µ ) , tran( R, µ )) . R as the convex hull of the set of all deterministic decision trees with no leaflabels. Now, note that cost( R, µ ) and h (tran( R, µ ) , tran( R, µ )) are both linear functions of (theprobability vector of) R ; for the latter, this is due to Lemma 4.4. Then by Lemma 2.12, theratio is quasiconcave in R , and by Lemma 2.10, the infimum of this ratio over randomized queryalgorithms R is equal to the minimum over deterministic query algorithms A . Therefore, it sufficesto show that for each deterministic query algorithm A making a non-zero number of queries, wehave cost( A, µ ) / h (tran( A, µ ) , tran( A, µ )) ≥ R( f ) / . Fix such A . We assume its leaves are not labeled. By Lemma 3.9, we can label the leaves of A such that score( A, µ ) = h (tran( A, µ ) , tran( A, µ )) . This labeling does not affect the cost. Then cost( A, µ )h (tran( A, µ ) , tran( A, µ )) = cost( A, µ )score(
A, µ ) + ≥ R( f )240 , as desired.Finally, we strengthen this to a lower bound for the minimum of cost( R, µ ) and cost( R, µ ) ,instead of for their average cost( R, µ ) . Theorem 4.6.
Let n ∈ N , let Σ be a finite alphabet, let S ⊆ Σ n , and let f : S → { , } be anon-constant function. Then there exist distributions µ on f − (0) and µ on f − (1) such that forall randomized query algorithms R , min { cost( R, µ ) , cost( R, µ ) } h (tran( R, µ ) , tran( R, µ )) ≥ R( f )3000 , where we interpret r/ ∞ for r ∈ [0 , ∞ ) .Proof. We use µ and µ from Theorem 4.5. Note that inf R min { cost( R, µ ) , cost( R, µ ) } h (tran( R, µ ) , tran( R, µ )) = inf R,b ∈{ , } cost( R, µ b )h (tran( R, µ ) , tran( R, µ ))= min b ∈{ , } inf R cost( R, µ b )h (tran( R, µ ) , tran( R, µ )) . By the same argument as in the proof of Theorem 4.5, this last infimum over R is equal to theinfimum over deterministic unlabeled decision trees D with height at least .Let D be such an algorithm. By Theorem 4.5, it suffices to show that min { cost( D, µ ) , cost( D, µ ) } h (tran( D, µ ) , tran( D, µ )) ≥ (1 /c ) min D ′ cost( D ′ , µ )h (tran( D ′ , µ ) , tran( D ′ , µ )) , where µ = ( µ + µ ) / . By Lemma 3.9, we can label the leaves of D so that we have the property h (tran( D, µ ) , tran( D, µ )) = score( D, µ ) , and similarly for D ′ . The desired inequality is trivialwhen score( D, µ ) = 0 (since the ratio is then ∞ ), so suppose score( D, µ ) > . We wish to show min { cost( D, µ ) , cost( D, µ ) } score( D, µ ) ≥ (1 /c ) min D ′ cost( D ′ , µ )score( D ′ , µ ) . In other words, we wish to show that there exists a deterministic forecasting algorithm D ′ such that cost( D ′ , µ ) / score( D ′ , µ ) ≤ c cost( D, µ b ) / score( D, µ ) , regardless of whether b = 0 or b = 1 .30e construct D ′ such that cost( D ′ , µ ) / score( D ′ , µ ) ≤ c cost( D, µ b ) / score( D, µ ) . The idea is tostart with D , and then cut off the branches that are much more likely under µ − b than under µ b .That is, for a vertex v of D , let µ [ v ] denote the probability that v is reached when D is run onan input from µ b , and define µ − b [ v ] similarly. Recall that the leaves of D are labeled accordingto the strategy that achieves score( D, µ ) = h (tran( D, µ ) , tran( D, µ )) , which, by Lemma 3.9, issuch that at a leaf v , the algorithm D outputs µ [ v ] / µ [ v ] .Pick a constant a ∈ (1 / , , and let D ′ be the algorithm which cuts off D the first time it entersa vertex for which µ − b [ v ] / µ [ v ] ≥ a , and outputs a (if b = 0 ) or − a (if b = 1 ) instead of continuingto run D . Let V be the set of all vertices which cause such a cutoff; note that no vertex in V is adescendant of another vertex in V . For v ∈ V , let µ v be the distribution µ conditioned on reaching v , and similarly define µ v and µ v . Let µ ∗ be the distribution µ conditioned on reaching none of thevertices in V , and similarly define µ ∗ and µ ∗ . Since we are dealing with a deterministic decisiontree, all the distributions µ v and µ v have disjoint supports for all the different v ∈ V , and they’realso disjoint from µ ∗ and µ ∗ ; indeed, µ is a disjoint mixture of all different distributions. It followsthat score( D, µ ) is a mixture of terms score( D, µ v ) and of score( D, µ ∗ ) . The score score( D ′ , µ ) ofthe algorithm D ′ is also such a mixture.Now, note that score( D, µ v ) ≤ , and that score( D ′ , µ v ) = E x ← µ v [hs f ( x ) ( a )] if b = 0 and score( D ′ , µ v ) = E x ← µ v [hs f ( x ) (1 − a )] if b = 1 . This means score( D ′ , µ v ) = µ b [ v ]2 µ [ v ] hs(1 − a ) + µ − b [ v ]2 µ [ v ] hs( a ) = (1 − p ) hs(1 − a ) + p hs( a )= 1 − (1 − p ) p a/ (1 − a ) − p p (1 − a ) /a, where p = µ − b [ v ] / µ [ v ] ≥ a . Since a > / , this is increasing in p , so we have score( D ′ , µ v ) ≥ − p a (1 − a ) , and hence score( D ′ , µ v ) ≥ (1 − p a (1 − a )) score( D, µ v ) . It also holds that score( D ′ , µ ∗ ) = score( D, µ ∗ ) ≥ (1 − p a (1 − a )) score( D, µ ∗ ) . Since score( D, µ ) and score( D ′ , µ ) are matching mixtures of score( D, µ v ) and score( D ′ , µ v ) respectively, it follows that score( D ′ , µ ) ≥ (1 − p a (1 − a ) score( D, µ ) .We now analyze the cost of D ′ . Note that cost( D ′ , µ ) = (1 /
2) cost( D ′ , µ b ) + (1 /
2) cost( D ′ , µ − b ) ;we clearly have cost( D ′ , µ b ) ≤ cost( D, µ b ) , so it suffices to upper bound cost( D ′ , µ − b ) . This is theexpected height of a leaf D ′ reaches when run on µ − b , which is a mixture of cost( D ′ , µ ∗ − b ) and cost( D ′ , µ v − b ) . Now, note that a leaf u reached by cost( D ′ , µ ∗ − b ) must have µ − b [ u ] / µ [ u ] < a , or µ b [ u ] < (1 − a ) /a · µ − b [ u ] . It follows that cost( D ′ , µ ∗ − b ) ≤ (1 − a ) /a · cost( D ′ , µ ∗ b ) = (1 − a ) /a · cost( D, µ ∗ b ) . Similarly, for each v ∈ V , the parent u of v satisfies µ − b [ u ] / µ [ u ] < a , meaning that µ b [ u ] > (1 − a ) /a · µ − b [ u ] ; note that since this parent u of v is not a leaf, conditioned on reaching u theheight of the path will always be at least the height of v (one more than the height of u ); since cost( B, µ v − b ) is exactly the height of v , we necessarily have cost( D, µ vb ) ≥ cost( D ′ , µ vb ) ≥ (1 − a ) /a · cost( D ′ , µ v − b ) . We conclude that cost( D ′ , µ − b ) ≤ a − a cost( D, µ b ) , and hence cost( D ′ , µ ) ≤ (cid:18)
12 + a − a ) (cid:19) cost( D, µ b ) = cost( D, µ b )2(1 − a ) . We therefore have cost( D ′ , µ )score( D ′ , µ ) ≤ − a )(1 − p a (1 − a )) cost( D, µ b )score( D, µ ) . a , we pick a = (2 + √ / to get cost( D ′ , µ )score( D ′ , µ ) ≤ (6 + 4 √
2) cost(
D, µ b )score( D, µ ) , from which the desired result follows. Corollary 4.7.
Let n ∈ N , let Σ be a finite alphabet, let S ⊆ Σ n , and let f : S → { , } be afunction. Then there exists a distribution µ on S such that for all γ ∈ [0 , , R µ ˙ γ ( f ) ≥ γ R( f )500 . Here ˙ γ = (1 − γ ) / and R µǫ ( f ) denotes the average cost (against µ ) of a randomized algorithmachieving error at most ǫ (against µ ) for solving f .Proof. If f is constant, then R( f ) = 0 and the desired bound trivially follows. Therefore, assume f isnot constant. We use the distribution µ from Theorem 4.5. Let R be a randomized algorithm whichachieves bias γ against µ . Then using Lemma 3.15, we can convert R into a forecasting algorithm R ′ which achieves expected score − p − γ ≥ γ / against µ , and has the same distribution overquery trees (that is, only the leaves changed). Now, by the property of µ , we know that cost( R ′ , µ )score( R ′ , µ ) ≥ R( f )240 , where we used Lemma 3.9 to get a result for score instead of Hellinger distance in the denominator,and where we used the fact that R achieves non-zero bias against µ (despite µ being balancedbetween - and -inputs) to conclude that R does not make queries. Using score( R ′ , µ ) ≥ γ / and cost( R, µ ) = cost( R ′ , µ ) , we get R, µ ) /γ ≥ R( f ) / , or cost( R, µ ) ≥ γ R( f ) / , asdesired. Theorem 1.8 (Restated) . For any non-constant partial function F : X × Y → { , } over finite sets X and Y , there is a pair of distributions µ on F − (0) and µ on F − (1) such that for any public-randomness communication protocol Π , the squared Hellinger distance between the distribution ofits transcripts on µ and µ is bounded above by h (cid:0) tran(Π , µ ) , tran(Π , µ ) (cid:1) = O (cid:18) min { cost(Π , µ ) , cost(Π , µ ) } RCC( F ) (cid:19) . Proof.
This theorem follows directly from Theorem 4.6 once we realize that a communication func-tion can be interpreted as a query function. That is, we take F and convert it into a query function f as follows. The input to f will contain one bit for each possible function of X (that Alice mightsend to Bob), and one bit for each possible function of Y (that Bob might send to Alice), for a totalinput length of n = 2 |X | + 2 |Y| . The inputs to f will be the strings in { , } n which are generatedby a pair ( x, y ) ∈ S , that is, the strings z ∈ { , } n for which there exists a pair ( x, y ) ∈ S such that z k is the result of applying the k -th possible function to x (if k ≤ |X | ) or the ( k − |X | ) -th possiblefunction to y (if k > |X | ). Then f is a Boolean function of domain of size | S | , with each string inits domain corresponding to a string in S .We note that RDT( f ) = RCC( F ) . This is clear from the definition of RCC( F ) : the public-coinrandomness essentially means that Alice and Bob agree on a randomized decision tree in advance,32ncluding on who speaks when (as a function of the transcript), which is equivalent to agreeing in adecision tree for f in advance. The transcript of f on an input is precisely the transcript of F on thecorresponding input, with the catch that in query complexity we defined the transcript to includethe deterministic decision tree by the protocol; hence, the query version of a transcript of f actuallycorresponds to ( R, Π) for F , where R is the public randomness and Π is the usual communicationcomplexity transcript. The desired result then follows immediately from applying Theorem 4.6 to f . Corollary 4.8.
Let X and Y be finite sets, let S ⊆ X × Y , and let F : S → { , } be a function.Then there exists a distribution µ on S such that for all γ ∈ [0 , , RCC µ ˙ γ ( F ) = Ω( γ RCC( F )) . In contrast to the classical case, it is well-known that quantum algorithms can be amplified linearly in /γ , where γ is the bias. Formally, we have the following theorem. Theorem 5.1 (Amplitude estimation) . Suppose we have access to a unitary U (representing aquantum algorithm) which maps | i to | ψ i , as well as access to a projective measurement Π , andwe wish to estimate p := k Π | ψ ik (representing the probability the quantum algorithm accepts). Fix ǫ, δ ∈ (0 , / . Then using at most (100 /ǫ ) · ln(1 /δ ) controlled applications of U or U † and at mostthat many applications of I − , we can output ˜ p ∈ [0 , such that | ˜ p − p | ≤ ǫ with probability atleast − δ . This theorem follows from [BHMT02], as well as from the arguably simpler techniques in [AR20].(In fact, these authors show something slightly stronger: amplitude estimation can be done withoverhead O ( √ ǫ + p · (1 /ǫ ) · log 1 /δ ) . We refer the interested reader to Appendix C to see how thisfollows from [BHMT02].)Given that quantum algorithms can be amplified linearly in the bias, it would seem that the de-sired minimax theorem follows easily from Theorem 2.18: simply apply a minimax to cost( Q, µ ) / bias f ( Q, µ ) + ,where Q is a quantum algorithm and µ is a distribution over the inputs. Then use the linear ampli-fication result to argue that min Q max µ cost( Q, µ ) / bias f ( Q, µ ) + is Θ(Q( f )) . Sounds simple! (Thisworks better than for randomized algorithms, because bias f ( · , · ) is saddle while bias f ( · , · ) is not.)Unfortunately, there is an annoying hole in this argument: the function cost( Q, µ ) is not convexin Q . While it is not immediately clear what a convex combination of two quantum algorithms Q and Q should be, most intuitive definitions will have the convex combination use a number ofunitaries that is equal to the maximum of the number used in Q and Q , rather than the average.To get around this, we switch the computational model from quantum algorithms to probabilitydistributions over quantum algorithms. These probabilistic quantum algorithms have outputs andbiases defined in the intuitive way, but their cost is defined as the expected cost of the underlyingquantum algorithms, rather than the maximum cost. This ensures the function cost( · , · ) will besaddle, and Theorem 2.18 can be applied. The trick then becomes showing that these probabilisticquantum algorithms can still be amplified linearly. This turns out to be true, up to logarithmicfactors. Once amplified, constant-error probabilistic quantum algorithms can be converted intoordinary quantum algorithms, giving us a minimax theorem that can be applied to ordinary quantumalgorithms as well. 33 .1 Quantum query complexity Our goal in this section will be to prove the following theorem.
Theorem 5.2.
For any Boolean-valued function f , there exists a distribution µ over Dom( f ) suchthat for any γ ∈ [0 , , we have Q µ ˙ γ ( f ) ≥ γ · ˜Ω(Q( f )) . Here Q µ ˙ γ ( f ) denotes the minimum numberof queries required by a quantum algorithm which achieves bias γ against µ for computing f . Theconstants in the ˜Ω notation are universal. In fact, we will prove a stronger (and tighter) version in terms of probabilistic quantum algo-rithms. These are simply probability distributions over quantum algorithms of possibly differentquery costs; we define the cost of a probabilistic quantum algorithm as the expected cost of aquantum algorithm sampled from the probability distribution.
Definition 5.3. A probabilistic quantum algorithm is a probability distribution P over quantumalgorithms. For an input string x , we let P ( x ) be the random variable that outputs a sample from Q ( x ) where Q is a quantum algorithm sampled from P . The cost of P , denoted | P | , is the expectedcost of a quantum algorithm sampled from P . The error of P on input x to a Boolean function f isdefined as Pr Q ∼ P [ Q ( x ) = f ( x )] . Definition 5.4.
Let f be a Boolean-valued function with Dom( f ) ⊆ Σ n . We define PQ ˙ γ ( f ) to bethe minimum cost | P | of a probabilistic quantum algorithm P which computes f to worst-case bias γ . Theorem 5.5.
For any Boolean function f and any γ ∈ (0 , / , we have PQ ˙ γ ( f ) = e Θ( γ Q( f )) .More explicitly, PQ ˙ γ ( f ) = O ( γ Q( f ))PQ ˙ γ ( f ) = Ω (cid:18) γ Q( f )log(1 /γ ) log log(1 /γ ) (cid:19) . Proof.
For the upper bound, let Q be a quantum algorithm computing f to error / using Q( f ) queries. Let Q ′ be the probabilistic quantum algorithm which runs Q( f ) with probability γ andotherwise uses no queries and guesses the output at random (with probability / for outputtingboth and ). The probability of error of Q ′ is at most (1 / − γ ) + (1 / γ ) = (1 / − γ ) ,which means its bias is at least γ on every input. The expected number of queries Q ′ uses is γ Q( f ) .Hence we have PQ ˙ γ ( f ) ≤ γ Q( f ) .For the lower bound, we start with a probabilistic quantum algorithm P which achieves worst-case bias γ and has cost | p | = PQ ˙ γ ( f ) , and make several modifications to it. First, we removefrom the support of P all quantum algorithms which use more than | P | /γ queries, and we replacethem with a -query quantum algorithm that guesses the answer at random (with / probabilityon outputs and ). This gives us a probabilistic quantum algorithm P which uses at most | P | /γ queries even in the worst case, and has | P | ≤ | P | and the worst-case bias of P is at least γ/ (since by Markov’s inequality, the probability mass over the removed quantum algorithms was atmost γ/ , and they could have had bias at most which turned into bias , decreasing the overallbias by at most γ/ ).Next, we modify P to get a probabilistic algorithm P which always uses a number of querieswhich is a power of . This can be done simply by increasing the number of queries each algorithmin the support of P makes (and ignoring the extra queries). This way, we have | P | ≤ | P | ≤ | P | ,the largest number of queries P can make is at most | P | /γ , and the bias of P is at least γ/ onevery input. 34urther, we modify P to get a probabilistic quantum algorithm P which always uses at least | P | queries (but still only uses a number of queries which is a power of ). This can be doneby again increasing the number of queries a quantum algorithm in the support of P makes, whennecessary. This adds at most an additive | P | queries (since the smallest power of which is atleast | P | is smaller than | P | ). Hence | P | < | P | + 16 | P | ≤ | P | . Note that P achieves bias atleast γ/ on every input, and that P always uses a number of queries which is a power of in therange [8 | P | , | P | /γ ) .Finally, we modify P to get P which collapses together all quantum algorithms in the supportof P that use the same number of queries. That is, instead of placing support on two differentquantum algorithms which both use (say) queries, P will place support on a single quantumalgorithm which implements the mixture of both. This does not affect the number of queries or thebias of the algorithm. Hence we have | P | < | P | , and P achieves bias at least γ/ on each input.Further, P has support on fewer than log(1 /γ ) quantum algorithms.Next we introduce some notation for talking about P . Let L = ⌊ log(1 /γ ) ⌋ and let k be thesmallest power of which is at least | P | . Let the quantum algorithms in the support of P be Q , Q , . . . , Q L , with Q i using k + i queries for each i . Let p i be the probability P assigns toalgorithm Q i . Then p i ≥ for all i , and P Li =1 p i = 1 . We also have P Li =1 p i k + i = | P | < | P | ,which means P Li =1 p i i < . On input x , let α i ( x ) be the probability that Q i outputs when runon x , and let β i ( x ) := 1 − α i ( x ) . This way, ( − f ( x ) β i ( x ) is the bias of Q i when run on x . Then P Li =1 p i β i ( x ) is ( − f ( x ) times the bias of P on x , which means that it is negative if f ( x ) = 1 ,positive if f ( x ) = 0 , and satisfies (cid:12)(cid:12)(cid:12)P Li =1 p i β i ( x ) (cid:12)(cid:12)(cid:12) ≥ γ/ .We now wish to amplify P from bias γ/ to constant bias. To do so, it suffices to estimate P Li =1 p i β i ( x ) to additive error less than γ/ , and output the sign of this estimate. Our query budgetfor this task will be roughly | P | /γ . We know the values p i , and seek to generate estimates ˜ β i ( x ) for β i ( x ) . We will say an estimate ˜ β i ( x ) is good if | ˜ β i ( x ) − β i ( x ) | ≤ i γ/ . This way, if all ˜ β i ( x ) aregood, our final estimate for the sum will satisfy (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) L X i =1 p i ˜ β i ( x ) − L X i =1 p i β i ( x ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) L X i =1 p i ( ˜ β i ( x ) − β i ( x )) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ L X i =1 p i | ˜ β i ( x ) − β i ( x ) | ≤ L X i =1 p i i γ/ < γ/ , where we used P i p i i < .To generate ˜ β i ( x ) , we use Theorem 5.1 on algorithm Q i with ǫ = 2 i γ/ and δ = 1 / L . Sincethe query cost of Q i is k + i , this uses at most · (2 k /γ ) · ln(3 L ) queries. Since k < | P | and L ≤ log(1 /γ ) , this costs O ( | P | /γ · log log(1 /γ )) . The query cost of generating all L estimates thisway is therefore O ( | P | /γ · log(1 /γ ) log log(1 /γ )) . The probability that any one estimate is not goodis at most / L by our choice of δ , so by the union bound, all are good except with probability / ;hence we’ve given a quantum algorithm which achieves worst-case bounded error for computing f ,and whose query cost is O (PQ ˙ γ ( f ) /γ · log(1 /γ ) log log(1 /γ )) , as desired.Using this theorem, we now proceed to prove a strong minimax theorem for PQ ˙ γ ( f ) , showingthat a single hard distribution µ works to lower bound this measure for all values of γ at once. Theorem 5.6.
Fix a finite alphabet Σ as well as n ∈ N . Let f be a Boolean-valued function with Dom( f ) ⊆ Σ n . Then there exists a distribution µ over Dom( f ) such that for any γ ∈ [0 , , we have PQ µ ˙ γ ( f ) ≥ γ · ˜Ω(Q( f )) , where the constants in the ˜Ω notation are universal.
35s usual, the notation PQ µ ˙ γ ( f ) denotes the expected cost of a probabilistic quantum algorithmwhich is required to achieve bias at least γ against µ (rather than in the worst case); that is, thealgorithm and the bias level γ are both allowed to depend on the distribution µ . Note that since PQ( f ) is always smaller than Q( f ) for any given bias level, this implies Theorem 5.2. Proof.
Fix Σ , n , and f . Let R be the set of all probabilistic quantum algorithms for computing f .For each P ∈ R and each distribution µ over Dom( f ) , define cost( P, µ ) := | P | and define score( P, µ ) to be the bias P makes against distribution µ for computing f (this will be in the range [ − , ).We will use Theorem 2.18. It is clear that R is convex, and that Dom( f ) is a nonempty finiteset. Let ∆ denote the set of all probability distributions over Dom( f ) . Then cost and score arecontinuous functions R × ∆ → R , with cost( · , · ) always non-negative, and both functions are linearin both variables. These functions are well-behaved, since finite cost and score can be achieved(some quantum algorithm computes f with positive bias), the cost is independent of the input, andmixing a zero-cost algorithm with a nonzero-cost algorithm gives a nonzero-cost algorithm. HenceTheorem 2.18 gives us inf P ∈R max x ∈ Dom( f ) | P | score( P, x ) + = max µ ∈ ∆ inf P ∈R | P | score( P, µ ) + , where we use the convention r/ ∞ for all r ∈ R .We simplify the left-hand side. For a probabilistic quantum algorithm P , use bias f ( P ) todenote its worst-case bias, that is, bias f ( P ) := min x ∈ Dom( f ) score( P, x ) . Then the left-hand side isthe infimum over P of | P | / bias f ( P ) + . Since a probabilistic algorithm P with bias f ( P ) ≤ willnever be selected in this infimum, the left-hand side is equal to inf γ ∈ (0 , inf P ∈R γ | P | γ , where R γ denotes the set of all probabilistic quantum algorithms which achieve worst-case biasat least γ . The inner infimum is the definition of (1 /γ ) · PQ ˙ γ ( f ) , so the left-hand side equals inf γ ∈ (0 , PQ ˙ γ ( f ) /γ .Note that this is at most f ) by picking γ = 1 / and using PQ( f ) ≤ Q( f ) . We claimthere is no reason to use any γ ∈ (0 , / f )) in the infimum. The reason is that if P is aprobabilistic quantum algorithm achieving worst-case bias at least γ such that | P | /γ < f ) ,and if γ < / f ) , it means that P has nonzero support on zero-cost quantum algorithms.Without loss of generality, we can assume P = aP + bP + (1 − a − b ) P ′ , where P is a zero-costalgorithm that always outputs , P is a zero-cost algorithm that always outputs , and P ′ is aprobabilistic algorithm with no support on zero-cost algorithms. Let c = min { a, b } , and write P = 2 cZ + (1 − c ) P ′′ , where Z is the -cost algorithm which is an even mixture of P and P .Then it is not hard to see that | P | = (1 − c ) | P ′′ | and score( P, µ ) = (1 − c ) score( P ′′ , µ ) for all µ . This means that P ′′ has the same cost-to-score ratio as P for all distributions µ . Hence we canalways use P ′′ in place of P for the infimum. Further, supposing without loss of generality that b ≥ a , we have P ′′ = ( b − a ) P + (1 − b + a ) P ′ . Since f is not constant, let x be an input on which f ( x ) disagrees with P ( x ) (that is, a -input). Then note that if b − a ≥ / , the algorithm P ′′ cannot output on x with probability above / , so score( P ′′ , x ) ≤ and P ′′ will not be used inthe infimum. On the other hand, if b − a < / , we have | P ′′ | = (1 − b + a ) | P ′ | > (1 / · / , as P ′ does not place weight on algorithms which make queries. Now, unless P ′′ achieves worst-casebias at least / (6 Q( f )) , its ratio of cost to score would be greater than f ) , which we alreadyknow is achievable. 36his means we only need to use γ > / (6 Q( f )) in the infimum. Thus the left-hand side equals inf γ ∈ [
16 Q( f ) , PQ ˙ γ ( f ) γ . Using Theorem 5.5, this is at least inf γ ∈ [
16 Q( f ) , Q( f ) C log(1 /γ ) log log(1 /γ ) for some universal constant C . The above is clearly optimized at γ = 1 / (6 Q( f )) , which means theleft-hand side is at least Ω (cid:16) Q( f )log Q( f ) log log Q( f ) (cid:17) .Looking at the right hand side, we see that there exists a distribution µ such that every prob-abilistic quantum algorithm P satisfies | P | / score( P, µ ) + ≥ ˜Ω(Q( f )) , from which the desired state-ment follows. We note that the argument we used to prove the existence of the hard distribution for quantumquery complexity only used a few properties of quantum algorithms. Since we will want to applythe same argument to quantum communication, polynomial degree, and logrank, it makes sense tostep back and provide an abstraction of this argument to more general models.In general, we will consider Boolean-valued functions f with a finite input set Dom( f ) . We willhave a set A of algorithms that may attempt to compute f . Formally, we will need A to be a subsetof a real vector space. Each A ∈ A will have an associated cost , denoted | A | , with | · | : A → [0 , ∞ ) .We write A T to denote the set { A ∈ A : | A | ≤ T } .For an algorithm A ∈ A and an input x ∈ Dom( f ) , we let bias f ( A, x ) denote the bias ofalgorithm A on input x . For now, the only property we need of the bias is that it is a function bias f : A ×
Dom( f ) → [ − , . The worst-case bias of an algorithm A will be denoted bias f ( A ) :=min x ∈ Dom( f ) bias f ( A, x ) . If µ is a distribution over Dom( f ) , we will further write bias f ( A, µ ) := E x ∼ µ [bias f ( A, x )] . Similarly, if P is a probability distribution over A with finite support, we denote bias f ( P, µ ) := E A ∼ P E x ∼ µ [bias f ( A, x )] and bias f ( P ) := min x ∈ Dom( f ) bias f ( P, x ) . We also set | P | := E A ∼ P | A | . Finally, we define M ( f ) := inf A ∈A :bias f ( A ) ≥ / | A | .So far, this setting is extremely general, capturing many computational models. For thequantum-style strong minimax to work, we will need the following properties to also hold for agiven function f .1. A T is convex for each T ∈ [0 , ∞ ) , and bias f ( · , x ) is linear over A ∈ A T for each x ∈ Dom( f ) .2. There exists some A ∈ A such that bias f ( A ) ≥ / . (Equivalently, M ( f ) < ∞ .)3. All A ∈ A with | A | < have | A | = 0 , and A is the convex hull of exactly two algorithms, Z and Z . For each x ∈ Dom( f ) , we also have bias f ( Z , x ) = − bias f ( Z , x ) = ± , and if f isnot constant, bias f ( Z , x ) attains both values and − for x ∈ Dom( f ) .4. Suppose P is a probability distribution over A that has support { A , A , . . . , A k } , with prob-ability p i for A i , such that (a) | A i | ≤ i T for some T ∈ [1 / , ∞ ) , (b) P i i p i ≤ , and (c) bias f ( P ) ≥ − k − . Then there is some A ∈ A with bias f ( A ) ≥ / and | A | ≤ k T · poly( k ) (with the constants in the poly being universal).37e note that (1) essentially requires the computational model to be randomized (or, in commu-nication complexity, to have public randomness). (2) only says that each function can be computedby some finite-cost algorithm. (3) says that algorithms with cost less than cannot look at theinput, and therefore have cost and must either always output or always output (or some convexcombination of the two).The main important point is (4). This point amplifies a certain restricted type of low-biasprobability distribution over algorithms into a full-blown constant-bias algorithm, and the cost ofamplification is nearly linear in one over the bias.We now prove that these points together suffice to guarantee the existence of a strongly-harddistribution. To start, we establish the following lemma, which says that if (4) holds – meaning wecan amplify the restricted type of probabilistic algorithms – then we can amplify all probabilisticalgorithms. Lemma 5.7.
Suppose f and A satisfy the above conditions. Let P be any finite-support probabilitydistribution over A with bias f ( P ) > . Then M ( f ) ≤ | P | bias f ( P ) · polylog(1 / bias f ( P )) . Proof.
The proof of this will be directly analogous to the quantum query case. We convert P intothe restricted form of (4), being careful to lose only a constant factor in the bias and in the cost.Let γ := bias f ( P ) > . We first use Markov’s inequality to argue that the total probability mass P places on algorithms A of cost | A | ≥ | P | /γ is at most γ/ , and hence discarding all such algorithmsfrom the support of P decreases its bias by at most γ/ (while not increasing its cost). Next, wegroup the remaining algorithms in the support of P into log(1 /γ ) bins: one bin for algorithms ofcost to T (with T equal to something like | P | ), and one additional bin for algorithms of cost i T to i +1 T for i between and log(1 /γ ) . Within each bin, we use the convexity of A i T to replace theentire bin with a single algorithm (whose cost is up to the upper boundary of that bin). For the firstbin, this increases the cost | P | by up to an additive O ( T ) , while for the other bins, this increases thecost by up to a factor of . Altogether, we have only log(1 /γ ) algorithms remaining in the support,and setting k = log(1 /γ ) it is not hard to check that the conditions in (4) are satisfied. Theorem 5.8.
Suppose f and A satisfy the above conditions. Then there exists a distribution µ over Dom( f ) such that for any finite-support probability distribution P over A , we have bias f ( P, µ ) ≤ O ( M ( f ) / | P | · polylog M ( f )) . In particular, if M µ ˙ γ ( f ) denotes the infimum cost | A | over algorithms A ∈ A with bias f ( A, µ ) ≥ γ ,then for all γ ∈ (0 , / we have M µ ˙ γ ( f ) ≥ γ · ˜Ω( M ( f )) . Proof.
The proof will be exactly the same as in the quantum query setting. In the special casewhere f is constant, the result trivially follows as M ( f ) = 0 , so assume f is not constant.First, we let R the set of all finite-support probability distributions over A , and let ∆ be the set ofprobability distributions over Dom( f ) . Then we define cost : R × ∆ → [0 , ∞ ) by cost( P, µ ) := | P | ,and score : R × ∆ → [ − , by score( P, µ ) := bias f ( P, µ ) . Note that cost and score are bothcontinuous and linear in each variable. They are also well-behaved, because M ( f ) < ∞ ensuresfinite cost and score can be achieved, cost does not depend on µ , and cost is linear in P . HenceTheorem 2.18 gives inf P ∈R max x ∈ Dom( f ) | P | bias f ( P, x ) + = max µ ∈ ∆ inf P ∈R | P | bias f ( P, µ ) + .
38e examine the left-hand side. It equals inf P ∈R | P | bias f ( P ) + . We note that this infimum is at most M ( f ) by the definition of M ( f ) . We now claim that there is no need to use any P in the infimumif bias f ( P ) < / (6 M ( f )) . To show this, it suffices to show that there is no need to use any P in theinfimum if | P | < / , because we know that M ( f ) is attainable using only algorithms in A withcost at least .Now, suppose that | P | < / and bias f ( P ) > . We can write P = aZ + bZ + (1 − a − b ) P ′ where P ′ has support only on A ∈ A with | A | ≥ . Define P ′′ := ( a − c ) Z + ( b − c ) Z +(1 − a − b + 2 c ) P ′ , where c = min { a, b } . Then as we showed in the quantum query case, we have | P ′′ | / bias f ( P ′′ ) ≥ | P | / bias f ( P ) . Moreover, since f is not constant, there is some input x ∈ Dom( f ) such that bias f ( Z , x ) = − , and some input y ∈ Dom( f ) such that bias f ( Z , y ) = − . Since bias f ( P ′′ ) > bias f ( P ) > , and since bias f ( P ′ ) ≤ , we must have (1 − a − b + 2 c ) > / , meaningthat | P ′′ | > / , as desired.Hence the left-hand side equals inf P ∈R ′ | P | bias f ( P ) , where R ′ is the set of all P ∈ R with bias f ( P ) ≥ / (6 M ( f )) . Using Lemma 5.7, we know that for each P ∈ R ′ , we have M ( f ) ≤ | P | / bias f ( P ) · polylog(1 / bias f ( P )) ≤ | P | / bias f ( P ) · polylog M ( f ) . Hence the left-hand side is at least M ( f )polylog M ( f ) .Finally, examining the right hand side, we see that there is a distribution µ over Dom( f ) such thatfor all P ∈ R , we have | P | ≥ bias f ( P ) · M ( f ) / polylog M ( f ) , and the desired result follows. To prove an analogous minimax for quantum communication complexity, all we need is to showthat quantum communication complexity satisfies the four conditions from Section 5.2. It’s easy tosee that as long as there is public randomness (whether or not there is also shared entanglement),the first three conditions are satisfied. It remains to deal with the fourth condition. Let P bea probability distribution over protocols Π , Π , . . . , Π k , which assigns probability p i to Π i andsatisfies | Π i | ≤ i T , P i i p i ≤ , and P achieves bias at least − k − for computing communicationfunction F on any input ( x, y ) ∈ Dom( F ) . Our goal is to construct a communication protocol whichuses T · ˜ O (2 k ) communication to compute F to bounded error.As in the quantum query case, all we need to do is create a protocol Π in which Alice and Bobestimate the biases Π i ( x, y ) of the protocols Π i when run on their inputs. Each estimate for protocol i needs to be within − ( k − i ) / of the correct bias, and it must satisfy this property with probabilityat least − / k (see the query complexity section for a formal analysis). To achieve this, it sufficesfor Alice and Bob to use amplitude estimation from Theorem 5.1 to generate an estimate of theprobability Π i ( x, y ) outputs . Hence the only remaining difficulty is running amplitude estimationof a communication protocol in the communication complexity setting.This turns out to be possible in both the shared-entanglement and the non-shared-entanglementsettings (though note that we’ve already assumed shared randomness, so we cannot handle the non-shared-randomness non-shared-entanglement quantum communication complexity model). The ideais to have one of the players, say Alice, take charge. We will assume that Alice is the one who outputsthe final answer in Π i . Then from Alice’s point of view, Π i ( x, y ) can be viewed as a unitary U anda measurement M such that Alice needs Bob’s help to apply U , and after applying U to a sharedstate | i A | i B , Alice can apply the measurement M on her side alone to get the output Π i ( x, y ) .Now, to apply amplitude estimation, Alice only needs the ability to apply controlled U , U † , and ( I − M ) operations. She can do the latter alone. For controlled U and U † applications, she needsBob’s help, but that’s fine: she will just send him a qubit each time alerting him to whether theyare about to apply U or U † to their shared state (Bob will return that qubit afterwards to ensurecoherence of Alice’s controlled applications of U and U † ).39e conclude the following theorem. Theorem 5.9.
Let F : X × Y → { , } be a (possibly partial) communication function. Then thereexists a probability distribution µ over Dom( F ) such that for all γ ∈ (0 , / , we have QCC µ ˙ γ ( F ) ≥ γ · ˜Ω(QCC( F )) . Here
QCC µ ˙ γ ( F ) denotes the minimum amount of communication required by a quantum commu-nication protocol which achieves bias at least γ against µ . This theorem works in both the sharedentanglement setting and in the shared-randomness, non-shared entanglement setting. As in the quantum case, polynomials can be amplified linearly in the bias. However, also as in thequantum case, the degree of polynomials is not convex: the degree of the convex combination of p and p is the maximum degree of p and p , not the average degree.The same ideas that worked for quantum query and communication complexities will allow usto get a strong hard distribution for approximate polynomial degree and approximate logrank. Themain difference will be how we do the estimation of success probabilities: instead of amplitudeestimation, we will need a polynomial variant of this, which turns out to be a little tricky. The approximate degree of a (possibly partial) Boolean function f : { , } n → { , } is the minimumdegree of an n -variate polynomial p which satisfies | p ( x ) − f ( x ) | ≤ ǫ for all x ∈ Dom( f ) , where ǫ is a parameter representing the allowed error. When f is a partial function, there are actually twodifferent notions of polynomial degree: one where p is required to be bounded on the entire Booleanhypercube (that is, p ( x ) ∈ [0 , for all x ∈ { , } n , even when x / ∈ Dom( f ) ), and one where p is notrestricted outside the domain of f . Our results will apply to both versions of polynomial degree,but for conciseness, we restrict our attention to the bounded version.With polynomials, it is often convenient to switch from talking about functions f : { , } n →{ , } to talking about functions f : { +1 , − } n → { +1 , − } . Note that by doing a simple variablesubstitution, we can convert between { , } variables to { +1 , − } variables without changing thedegree of the polynomial. That is, we can substitute − x i in place of the variable x i inside p tomake it take { , } inputs instead of { +1 , − } inputs, and we can substitute (1 − x i ) / to go theother way. We can similarly change the output of p from being in the range [0 , to the range [ − , and vice versa (the error changes by a factor of when switching between these bases). Anotherwell-known observation is that to approximate a Boolean function f , we only need multilinear polynomials, and their degree only needs to be at most n .To get our hard distribution, we will use Theorem 5.8. We need to check the four conditions, butusing polynomials as our “algorithms”. More explicitly, the set A will be the set of all real n -variatemultilinear bounded polynomials, viewed in the { +1 , − } basis (bounded means that p ( x ) ∈ [ − , for all x ∈ { +1 , − } n ). For a polynomial p ∈ A , we define bias f ( p, x ) to be f ( x ) p ( x ) . Then (1)holds, as the set of polynomials of a given degree is convex and bias f ( · , x ) is linear over that set.(2) holds because every Boolean function can be computed exactly by a polynomial of degree n .Next, (3) holds because polynomials of degree less than have degree , and since we’re dealingwith bounded polynomials, these are a convex combination of the two constant polynomials − and . 40t remains to show (4). To this end, let P be a probability distribution over k polynomials q , q , . . . , q k , with deg( q i ) ≤ i T . Let p i be the probability P assigns to q i , and suppose P ki =1 i p i ≤ . Finally, suppose that bias f ( P ) ≥ − k − . Our goal is to find a polynomial q of degree at most k T · poly( k ) that computes f to constant error. To do so, we’ll need a polynomial version ofthe amplitude estimation algorithm we did in the quantum case. That is, we’d like to estimatethe output that polynomial q i ( x ) returns, and do arithmetic computations on it. Crucially, one ofthe arithmetic computations we’d like to do is comparison, for example, to see if q i ( x ) > . Sucha comparison is not a polynomial operation, so we cannot use the polynomial q i ( x ) itself. Whatwe’ll do instead is to create polynomials that compute the bits of the binary expansion of q i ( x ) , toa certain precision. We will then do arithmetic operations using those bits, and we’ll be able toimplement those operations using polynomials.To do so, we’ll need some approximation theory. The following theorem, known as Jackson’stheorem, will be useful. It traces back to Jackson (1911) [Jac11], but see also [MMR94] (page 750,Theorem 3.1.1) for some discussion and a more thorough list of references. Theorem 6.1 (Jackson’s theorem) . Let α : [ − , → R be a continuous function, and let n ∈ N .Then there is a real polynomial p of degree n such that for all x ∈ [ − , , we have | p ( x ) − α ( x ) | ≤ · sup | y − z |≤ /n | α ( y ) − α ( z ) | . In particular, if α has Lipschitz constant K , then for each n ∈ N there is a polynomial p n of degreeat most n which approximates α to within an additive K/n at each point in [ − , . Jackson’stheorem can be used to prove the well-known result that polynomials can be amplified with a lineardependence in the bias. For completeness, we reprove this here (see also e.g. [GKKT17]). Corollary 6.2 (Polynomial amplification (small bias to constant bias)) . For each γ ∈ (0 , , thereis a real polynomial p of degree at most /γ such that p maps [ − , to [ − , , p maps [ − , − γ ] to [ − , − / , and p maps [ γ, to [1 / , .Proof. Let α : [ − , → R be the function with α ( x ) = − / for x ∈ [ − , − γ ] , α ( x ) = 2 / for x ∈ [ γ, , and α ( x ) = 2 x/ γ for x ∈ ( − γ, γ ) . Then α is continuous and has Lipschitz constant / γ . By Theorem 6.1, for every n ∈ N , there exists a polynomial p n of degree at most n whichapproximates α to additive error /γn . Picking n = ⌈ /γ ⌉ ≤ /γ , we get a polynomial whichapproximates α to error / , which means it has the desired properties.We will also need an amplification polynomial that goes from constant bias to small error. Wereprove the following well-known lemma here for completeness (it also appears in [BNRW07], andanother version appears in [She13]). Lemma 6.3 (Polynomial amplification (constant error to small error)) . For each ǫ ∈ (0 , / , thereis a real polynomial p of degree at most
17 log(1 /ǫ ) such that p maps [ − , to [ − , , p maps [ − , − / to [ − , − (1 − ǫ )] , and p maps [1 / , to [1 − ǫ, .Proof. We set q ( x ) = k X i =0 (cid:18) k + 1 i (cid:19) (cid:18) x (cid:19) i (cid:18) − x (cid:19) k +1 − i , and set p ( x ) = 1 − q ( x ) . Note that for x ∈ [ − , , the value q ( x ) is exactly the probability that,when flipping a coin k + 1 times, less than half of the coin flips will come out heads, assumingthe probability of heads is (1 + x ) / . Because of this interpretation, we know that q maps [ − , [0 , and is decreasing in x , so p maps [ − , to [ − , and is increasing in x . We also have q ( x ) = 1 − q ( − x ) , which means that p ( − x ) = − p ( x ) , i.e. p is odd. Given these properties, thelemma will follow if we show that p (1 / ≥ − ǫ , or equivalently, that q (1 / ≤ ǫ/ .We have q (1 /
3) = k X i =0 (cid:18) k + 1 i (cid:19) (cid:18) (cid:19) i (cid:18) (cid:19) k +1 − i = 3 − (2 k +1) k X i =0 (cid:18) k + 1 i (cid:19) i ≤ − (2 k +1) k k X i =0 (cid:18) k + 1 i (cid:19) = 3 − (2 k +1) k k = (1 / / k . To get this to be smaller than ǫ/ , it suffices to pick k large enough so that (8 / k ≤ ǫ , or equivalently, k ≥ / log(1 /ǫ ) . Hence we can pick k = ⌈ / log(1 /ǫ ) ⌉ ≤ / log(1 /ǫ ) + 1 . The degreeof p will be k + 1 ≤ / log(1 /ǫ ) + 3 . Note that ǫ ≤ / , so log(1 /ǫ ) ≥ log(3 / , and hence / log(1 /ǫ ) + 3 ≤ (cid:16) / + / (cid:17) log(1 /ǫ ) ≤
17 log(1 /ǫ ) .Equipped with these approximation-theoretic tools, we will now tackle (4), showing that proba-bility distributions over polynomials (which achieve a small amount of worst-case bias γ for comput-ing f ) can be amplified to polynomials which compute f to constant error, using only a nearly-lineardependence on /γ . Lemma 6.4.
As in (4), let P be a probability distribution over k bounded multilinear polynomials q , q , . . . , q k , which assigns them probabilities p , p , . . . , p k respectively. Suppose that P ki =1 i p i ≤ ,that deg( q i ) ≤ i T for some real number T , and that f ( x ) P ki =1 p i q i ( x ) ≥ − k − for all x ∈ Dom( f ) .Then there is a bounded multilinear polynomial q which approximates f with bias at least / andwhich satisfies deg( q ) ≤ k T · poly( k ) .Proof. Recall that in the quantum case, we estimated the bias of the i -th algorithm to within − ( k − i ) / , with success probability at least − / k . We will do a polynomial version of this.What does estimating q i ( x ) mean, for polynomials? It means we will construct polynomials whichapproximately compute the bits in the binary expansion of the number q i ( x ) . We will have onepolynomial for the sign, and an additional k − i + 4 polynomials for the first k − i + 4 digits in thebinary expansion of q i ( x ) .In order to do so, we compose univariate polynomials with q i . This way, the task reducesto creating univariate polynomials which output the bits in the binary expansion of their input(assuming they all receive the same input). More explicitly, the correctness condition is as follows.We say the binary expansion of a real number β ∈ [ − , is − ℓ - robust to t bits if the first t bitsof the binary expansion of β + ǫ is the same as that of β for all ǫ ∈ [ − − ℓ , − ℓ ] . Then we requireunivariate polynomials d ℓ , d ℓ , . . . , d ℓk such that if β ∈ [ − , is − ℓ -robust to at least t bits, then d ℓt ( β ) is within O (1 /k ) of the t -th bit in the binary expansion of β . The polynomial d ℓ needs tooutput the sign of β if β is − ℓ -robust to at least bits (that is, if the sign of β does not changeupon adding or subtracting − ℓ ). We will also require all these polynomials to be bounded, i.e. theymust map [ − , to [ − , .To implement these polynomials, we use Theorem 6.1. For simplicity, let’s represent the bits inthe binary expansion using +1 and − instead of and (converting back is easy). Consider thefunction α i which outputs the i -th bit of the binary expansion of its input (or the sign if i = 0 ).This i is a step function: for i = 0 , α ( β ) jumps from − to at β = 0 ; for i = 1 , α ( β ) similarlyjumps from − to and back at β = − / , , / . More generally, α i has i +1 different plateaus of or − on its domain [ − , . Now, since we only care about getting the i -th bit correct if the i -thbit is robust to β changing by − ℓ , consider the continuous functions α ℓi which make the jumps from42 to continuous by starting from − ℓ before the jump point, ending − ℓ after the jump point,and drawing a continuous line in between (the slope of the line will be ± − ℓ ). This is well-definedas long as ℓ is sufficiently larger than i , say ℓ ≥ i + 2 .Note that α ℓi has Lipschitz constant − ℓ . This means we can use Theorem 6.1 to estimate α ℓi by a polynomial of degree O (2 ℓ ) which achieves constant additive error (say, / ). We can scaledown these polynomials slightly to ensure they remain bounded in [ − , . We then plug them intoa single variate bounded polynomial of degree O (log k ) that we get from Lemma 6.3, in order toamplify the error down to O (1 /k ) . The result are polynomials d ℓt (for ℓ ≥ t + 2 ) that have degree O (2 ℓ log k ) and, on input β which is − ℓ -robust to bit at least t , correctly output the t -th bit of β except with additive error O (1 /k ) .Now, to get an estimate of q i ( x ) to k − i +5 bits, we set ℓ = k − i + O (log k ) and compose d ℓt ( q i ( x )) for t = 0 , , , . . . , k − i + 5 . Actually, we scale down q i ( x ) and add an extra variable y i representinga noise term for q i ( x ) ; the final estimating polynomials will be the n + 1 variate polynomials r i,t ( x, y i ) := d ℓt ((9 / q i ( x ) + y i ) . Note that the degree of r i,t is O (2 k − i + O (log k ) log k · deg( q i )) = O (2 k T poly( k )) .Next, consider the function which takes binary representations (to k + 5 bits each) of numbers λ i ∈ [ − , , and outputs the sign of P ki =1 p i λ i , where p i are known non-negative constants whichsum to . This is a Boolean function of O ( k ) variables, so it can be computed exactly by amultilinear polynomial of degree O ( k ) . Call this polynomial s . Next, plug in the polynomials r i,t into the inputs of s , so that s calculates the sign of the sum P ki =1 p i ˜ β i where each ˜ β i is the estimateof (9 / q i ( x ) + y i that is computed by the polynomials d k − i +10 t . Call this composed polynomial u ( x, y ) .Observe that u ( x, y ) is a polynomial in n + k variables ( n variables from x and k variables y i ), andhas degree O (2 k T poly( k )) . This polynomial attempts to compute the sign of (9 / P ki =1 p i q i ( x ) + P ki =1 p i y i . Since we know that P ki =1 p i q i ( x ) · f ( x ) ≥ − k − , this sign computed by u ( x, y ) willequal f ( x ) so long as (cid:12)(cid:12)(cid:12)P ki =1 p i y i (cid:12)(cid:12)(cid:12) ≤ − k − . Recall that P ki =1 i p i ≤ . Hence to guarantee that (cid:12)(cid:12)(cid:12)P ki =1 p i y i (cid:12)(cid:12)(cid:12) ≤ − k − , it suffices to choose each y i such that | y i | ≤ − ( k − i +5) . Now, let’s call q i ( x ) + y i good if it is − ( k − i + O (log k )) -robust to k − i + 5 bits. If all q i ( x ) + y i are good for all i , then r i,t correctly compute the bits to additive error O (1 /k ) , then a multilinear polynomial of degree O ( k ) in O ( k ) variables will still correctly compute its output to small error, certainly O (1 /k ) . Hence ifall q i ( x ) + y i are good for all i and if y i ≤ − k − /k for all i , u ( x, y ) outputs f ( x ) to error O (1 /k ) .To ensure that q i ( x ) + y i are good, we pick y i at random. That is, we have an allowed range [ − − ( k − i +5) , − ( k − i +5) ] for y i ; we fit poly( k ) evenly spaced points into this range, so that the gapbetween the points is − ( k − i + O (log k )) . Note that for all but a constant number of choices of y i amongthese poly( k ) options, the resulting number q i ( x ) + y i will be − ( k − i + O (log k )) -robust to k − i + 5 bits.Hence by randomly selecting y i , the probability that q i ( x ) + y i is not good is at most O (1 / poly( k )) .By the union bound, this choice means that all q i ( x ) + y i are good except with constant probability.Hence u ( x, y ) computes f ( x ) to O (1 /k ) error with high probability when y is chosen at randomaccording to the above procedure.Finally, we let q ( x ) be the average of the polynomials u ( x, y ) for all possible choices of y inthe above procedure. Since u ( x, y ) outputs a number very close to f ( x ) when y is good, and sinceit is always bounded in [ − , , and since y is good with high probability, we conclude that q ( x ) computes f ( x ) to bounded error. It is also bounded outside the promise of f . The degree of q ( x ) was O (2 k T poly( k )) . We note that q ( x ) as we constructed it here can actually be viewed as apolynomial ρ in k variables composed with the polynomials q , q , . . . , q k .The above amplification theorem allows us to conclude the following theorem.43 heorem 6.5. let f : { +1 , − } n → { +1 , − } be a (possibly partial) Boolean function. Then thereis a vector ψ ∈ [ − , Dom( f ) such that k ψ k = 1 , h ψ, f i = 1 , and for any polynomial p which isbounded (i.e. | p ( x ) | ≤ for x ∈ { +1 , − } n ), we have h ψ, p i ≤ deg( p )˜Ω (adeg( f )) . Here adeg( f ) denotes the minimum degree of a bounded polynomial p which computes f to boundederror. The constants in the ˜Ω notation are universal.Proof. This follows immediately by taking ψ to be defined by ψ ( x ) = f ( x ) µ [ x ] , where µ is the harddistribution we get from Theorem 5.8. Instead of tackling approximate logrank directly, we use approximate γ norm. This measure de-serves some introduction. First, we note that the γ norm is a well-known norm of a matrix. Oneway to define it is to say that γ ( A ) is the minimum, over factorizations A = BC of A into aproduct of matrices B and C , of the maximum -norm of a row of B times the maximum -norm ofa column of C . The γ norm has several useful properties known in the literature [She12; LSŠ08]:1. γ is a norm, so γ ( A ) ≥ (with equality if and only if A is the all-zeros matrix) and γ ( A + λB ) ≤ γ ( A ) + | λ | γ ( B ) .2. γ ( A ⊗ B ) = γ ( A ) γ ( B ) , where ⊗ denotes the tensor (Kronecker) product3. γ ( A ◦ B ) ≤ γ ( A ) γ ( B ) , where ◦ denotes the Hadamard (entry-wise) product4. γ ( J ) = 1 where J is the all-ones matrix5. k A k ∞ ≤ γ ( A ) ≤ k A k ∞ p rank( A ) .In the above, A and B are matrices of the same dimensions, and λ is a scalar. γ ( A ) can be thoughtof as a smoother version of rank.Let F : X × Y → { +1 , − } be a (possibly partial) communication function. We identify F withits communication matrix , which is a matrix with rows indexed by X and columns indexed by Y ,with the ( x, y ) entry being F ( x, y ) ∈ { +1 , − } if ( x, y ) ∈ Dom( F ) and being ∗ if ( x, y ) / ∈ Dom( F ) .This way, F is a { +1 , − , ∗} -valued matrix.For such a matrix F , we say that a real-valued matrix A approximates F (to bias / ) if | A [ x, y ] | ≤ for all ( x, y ) ∈ X × Y and F ( x, y ) A [ x, y ] ≥ / for all ( x, y ) ∈ Dom( F ) . The approximate γ norm of F , denoted ˜ γ ( F ) , is defined as the minimum value of γ ( A ) over allmatrices which approximate F to bias / . It is not hard to see that this minimum is attained, asthe set of such matrices is compact.We will actually care about the logarithm of the approximate γ norm, that is, about log ˜ γ ( F ) .We note that the constant / in the definition of this measure is arbitrary, as approximations to F can be amplified with only a constant factor overhead in the log-approximate- γ -norm (see, e.g.,[BBGK18]). An annoying detail, however, is that such amplification can in general lose not justa multiplicative constant but also an additive constant, since ˜ γ ( F ) may in general be less than (meaning the logarithm of it will be less than ). To avoid such complications, we will defineour measure of interest as M ( F ) := max { , log ˜ γ ( F ) } if F is not constant and M ( F ) = 0 if F isconstant, and we will write M ˙ γ ( F ) for the bias γ version of M ( F ) instead of the default bias / version. 44n order to get a minimax theorem analogous to Theorem 6.5, we will again use Theorem 5.8.Our set of algorithms A will be the set of bounded real matrices A (that is, real matrices A of thesame dimensions as F which satisfy | A [ x, y ] | ≤ for all ( x, y ) ∈ X × Y ). The cost of a matrix A will be cost( A ) := max { , log γ ( A ) } if A is not a multiple of the all-ones matrix J , and otherwise cost( A ) = 0 if A = λJ . We define bias F ( A, ( x, y )) = F ( x, y ) A [ x, y ] for ( x, y ) ∈ Dom( F ) .We show that A T is convex for each T ∈ [0 , ∞ ) . For T < , the set A T is the set of all matrices ofthe form λJ for λ ∈ [ − , , which is clearly convex. For T ≥ , suppose A, B ∈ A T and let λ ∈ (0 , .Then cost( λA + (1 − λ ) B ) is either , , or log γ ( λA + (1 − λ ) B ) . In the former two cases, we clearlyhave λA + (1 − λ ) B ∈ A T , so consider the latter case. We have log γ ( λA + (1 − λ ) B ) ≤ log( λγ ( A )+(1 − λ ) γ ( B )) ≤ log max { γ ( A ) , γ ( B ) } = max { log γ ( A ) , log γ ( B ) } ≤ max { cost( A ) , cost( B ) } ≤ T .Hence A T is convex. It is also clear that bias F ( · , ( x, y )) is linear, so (1) is satisfied.By taking A to equal F inside Dom( F ) and to be elsewhere, we get bias F ( A ) = 1 , so (2)is satisfied. By our definition of cost( A ) , we have cost( A ) ≥ or cost( A ) = 0 , with the latterhappening only if A is a convex combination of J and − J , so (3) is satisfied.As usual, it remains to handle (4). We do so in the following lemma. Lemma 6.6.
Let P a probability distribution over matrices A , A , . . . , A k with probability p i for A i . Suppose that P ki =1 i p i ≤ , and that for all i , we have cost( A i ) ≤ i T for some real number T ≥ / . Suppose further that bias F ( P ) ≥ − k − . Then there is some bounded matrix A whichapproximates F to bias / and satisfies cost( A ) ≤ k T · poly( k ) (with the constants in the poly being universal).Proof. Let ρ be the polynomial from the proof of Theorem 6.5 with respect to the probabilities p , p , . . . , p k . This is a polynomial in k variables with the property that if values β , β , . . . , β k areplugged in and | P i p i β i | ≥ − k − , then ρ ( β , β , . . . , β k ) returns the sign of P i p i β i to boundederror. The polynomial ρ further has the property that it is bounded (i.e. it returns values in [ − , when given inputs in [ − , k ), and that if you plug in any polynomials q i in place of β i , with deg( q i ) ≤ i , then the degree of the composed polynomial is at most k poly( k ) .This latter property means that the weighted degree of ρ with weights (2 , , . . . , k ) is at most O (2 k poly( k )) . Here the term weighted degree means that we count the degree of each monomialof ρ differently depending on the variables in that monomial: the i -th variable gets weight i , so amonomial of the form β c β c . . . β c k k will have weighted degree i c + 2 c + · · · + 2 k c k . We knowthat the weighted degree of ρ , meaning the maximum weighted degree of one of its monomials, isat most O (2 k poly( k )) .We will now use this polynomial ρ to construct a matrix A which approximates F and has γ norm that is not too large. The idea is to simply apply ρ to the matrices A , A , . . . , A k , usingthe Hadamard product for multiplication and the usual matrix addition and scalar multiplication.Since γ is a norm, we know that γ ( ρ ( A , A , . . . , A k )) is the sum, over all monomials of ρ , ofthe absolute value of the coefficient of that monomial multipled by the γ -norm of the Hadamardproduct defined by that monomial. This is upper bounded by the sum of absolute coefficients of ρ (which we’ll denote C ) multiplied by the γ norm of the largest monomial.The γ norm of a single monomial β c . . . β c k k composed with matrices A , . . . , A k is at most γ ( A ) c . . . γ ( A k ) c k , since the γ norm is sub-multiplicative under the Hadamard product. Hence log γ ( ρ ( A , . . . A k )) is at most log C plus the maximum value of c log γ ( A ) + · · · + c k log γ ( A k ) for some monomial ( c , c , . . . , c k ) of ρ . Since log γ ( A ) ≤ cost( A ) for all bounded matrices A , andsince cost( A i ) ≤ i T , this maximum is at most the maximum of T · (2 c + · · · +2 k c k ) over monomialsof ρ , which is at most O (2 k T poly( k )) .We now upper bound C , the sum of absolute coefficients of ρ . Recall that ρ was constructed asan average of different polynomials with different values of the constants y i . Let ρ ′ be the polynomial45ithin that set we averaged over which has the largest sum of absolute coefficients. Then to upperbound C it suffices to upper bound the sum of absolute coefficients of ρ ′ . To do so, we essentiallywant to replace all coefficients of ρ ′ with their absolute values, and then plug in all ones for thevariables. We note that (9 /
10) + y i will be at most for the values of y i used in ρ ′ , which meansthat if we replace the terms (9 / q i + y i with simply q i , we would only increase the sum of absolutecoefficients (here we treat q i as variables).Let the resulting polynomial be ρ ′′ . Then ρ ′′ is simply the result of composing the polynomial s with the polynomials r i,t . Since s is a bounded multilinear polynomial of degree O ( k ) , its sum ofabsolute coefficients is at most O ( k ) , and it is not hard to see that the sum of absolute coefficientsof ρ ′′ will be at most O ( k ) times D O ( k ) , where D is the maximum sum of absolute coefficientsover the polynomials d ℓt with ℓ = k − i + O (log k ) . In other words, log C ≤ O ( k ) + O ( k D ) , where D is the sum of absolute coefficients of some such polynomial d ℓt .The polynomial d ℓt is a single variate bounded polynomial of degree at most O (2 ℓ log k ) , which,using ℓ ≤ k + O (log k ) , is at most k poly( k ) . A bounded univariate polynomial of this degreemust have sum of absolute coefficients at most k poly( k ) by [She13] (Lemma 4.1). Hence log D ≤ k poly( k ) , so log C ≤ k poly( k ) .We conclude that if A = ρ ( A , A , . . . , A k ) , then log γ ( A ) ≤ k ( T + 1) poly( k ) , and hence cost( A ) ≤ k ( T + 1) poly( k ) . This is at most O (2 k T poly( k )) since we have T ≥ / . Further, eachentry A [ x, y ] is equal to ρ ( A [ x, y ] , A [ x, y ] , . . . , A k [ x, y ]) , which means that A is bounded (since ρ isbounded and the matrices A i are bounded), and for ( x, y ) ∈ Dom( F ) , we have F ( x, y ) A [ x, y ] ≥ / by the guarantees on A i and on ρ .Using Theorem 5.8, we can now conclude the following theorem. Theorem 6.7.
Let F : X × Y → { +1 , − } be a (possibly partial) communication function. Thenthere is a distribution µ over Dom( F ) such that for any bounded real matrix A (meaning | A [ x, y ] | ≤ for all ( x, y ) ∈ X × Y ), we have E ( x,y ) ∼ µ [ F ( x, y ) A [ x, y ]] ≤ log γ ( A )˜Ω(log ˜ γ ( F )) . Note that for bounded matrices, log γ ( A ) ≤ log rank( A ) . We also have, from [LS09], log g rank( F ) ≤ γ ( F ) + O (log log |X × Y| ) . This means we can write a minimax theorem for logrank as well.
Theorem 6.8.
Let F : X × Y → { +1 , − } be a (possibly partial) communication function, andsuppose that log g rank( F ) ≥ C log log |X × Y| where C is a universal constant. Then there is adistribution µ over Dom( F ) such that for any bounded real matrix A (meaning | A [ x, y ] | ≤ for all ( x, y ) ∈ X × Y ), we have E ( x,y ) ∼ µ [ F ( x, y ) A [ x, y ]] ≤ log rank( A )˜Ω(log g rank( F )) . In other words, µ is such that if A has low rank compared to F , then A cannot correlate wellwith F under µ , and hence A does not approximate F very well against µ .46 Circuit complexity
A Boolean circuit C is a collection of gates connected to each other and to bits of its input x bywires, with a single output wire representing the value of C ( x ) . The size of a circuit is the numberof gates in the circuit, and the depth of a circuit is the length of the longest path between an inputbit and an output wire. A randomized Boolean circuit is a probability distribution over Booleancircuits, and the size of a randomized Boolean circuit is defined to be the expected size of a Booleancircuit drawn from that distribution.In Section 7.1, we examine the randomized circuit complexity of partial Boolean functions whenit is computed by circuits of unbounded fan-in and unlimited depth. In Section 7.2, we show thatthe main result also holds in the NC setting of logarithmic-depth circuits whose gates each havefan-in at most . Finally, in Section 7.3 we establish the strengthening of the hardcore lemma. In this section, let R( f ) denote the minimum size of a randomized Boolean circuit of unboundedfan-in and unlimited depth that computes the partial Boolean function f with error at most onevery input x ∈ Dom( f ) . Similarly, let R µ ˙ γ ( f ) denote the minimum size of randomized Booleancircuits that compute f with error at most ˙ γ = − γ when the input is drawn from µ . We establisha relation between those two complexity measures via the study of forecasting circuits. Definition 7.1. A forecasting circuit is a randomized Boolean circuit with one modification: insteadof having a single output wire, the forecasting circuit has k + 1 output wires that represent the binaryencoding of a value in the range { , k , k , . . . , k − k , } . The resolution of a forecasting circuit is k when it has k + 1 output wires. (Or, equivalently,when it outputs values that are multiples of − k .) The score of a forecasting circuit is computed inthe same way as we did for forecasting algorithms in previous sections. The size of a randomizedforecasting circuit is, as in the the case of randomized Boolean circuits, the expected number ofgates in a circuit drawn from the distribution. Forecasting circuits can be defined for each model ofrandomized Boolean circuits; in this section, we consider forecasting circuits with unbounded fan-inand unlimited depth.We begin by showing that if there is a Boolean circuit that computes a function with non-negligible advantage over random guessing, then there is also a forecasting algorithm with non-trivialscore. Proposition 7.2.
For any partial function f : { , } n → { , } , if there is a size s ≥ andparameter γ ≥ R ( f )+1 for which there is a randomized Boolean circuit R of average size s thatsatisfies Pr C ∼ R [ C ( x ) = f ( x )] ≤ ˙ γ for every x ∈ Dom( f ) , then there is also a randomized forecastingcircuit R ′ with resolution ⌈ log R ( f ) ⌉ , average size at most s + 1 , and h -score score( R ′ , x ) = E C ′ ∼ R ′ [score( C ′ ( x ) , f ( x ))] ≥ γ / for each x ∈ Dom( f ) .Proof. For each circuit C in the support of R , define C ′ to be the forecasting circuit of resolution k = ⌈ log R ( f ) ⌉ and size size( C ) + 1 which outputs the value − C ( x ) γ ′
47n input x ∈ S where γ ′ = m k for the largest integer m such that γ ′ ≤ γ . The definition of γ ′ guarantees that γ − k ≤ γ ′ ≤ γ . The value of k and the lower bound on γ in the propositionstatement imply that γ − k ≥ γ − R ( f )+1 ≥ γ , so γ ≤ γ ′ ≤ γ .This circuit C ′ can be constructing by adding a single extra ¬ gate to the output wire of C : theoutput of C and its negations can then be combined with constant value wires to generate the tworequired output values of the forecasting circuit. (Namely, if the two output values (1 ± γ ′ ) / of C ′ are denoted by z (0) and z (1) , then the i th output bit of C ′ is a hardcoded constant value or when z (0) = z (1) and otherwise it is either C ( x ) or ¬ C ( x ) when z (0) = z (1) .)The randomized forecasting circuit R ′ is then defined to be the distribution on circuits obtainedby drawing C ∼ R and outputing the modified circuit C ′ as described above. Following the sameargument as in Lemma 3.15, the score of this randomized forecasting circuit satisfies score( R ′ , x ) ≥ γ ′ / ≥ γ/ . In the second step in the proof of Theorem 7.8, we show that the minimax theorem applies inthis setting.
Lemma 7.3.
Fix any k ≥ and let R k denote the set of all randomized forecasting circuits withresolution k . Then for partial function f : { , } n → { , } , if we let ∆ denote the set of distributionsover Dom( f ) , we have inf R ∈R k max x ∈ Dom( f ) size( R )score( R, x ) + = max µ ∈ ∆ inf R ∈R k size( R )score( R, µ ) + . Proof.
The lemma follows from Theorem 2.18, and the argument showing that the conditions ofthat theorem are satisfied follows closely the analogous argument of Theorem 4.2.First, we want to show that R k can be viewed as a convex subset of a real topological space V . We can do so with the same construction as in Theorem 4.2, though here we can also use aslightly simpler constructions: fix V = R | Dom( f ) | +1 , and for each randomized forecasting circuit R ∈ R k define v R ( x ) = score( R, x ) for each x ∈ Dom( f ) and define the | Dom( f ) | + 1 th coordinateof v R to be cost( R, x ) . That the resulting set is convex follows directly from the fact that a vector v ′ = λv R + (1 − λ ) v R for any R , R ∈ R k corresponds to the vector of the randomized forecastingcircuit R ′ = λR + (1 − λ ) R .The linearity of cost and score measures in both R and µ follows from their definition.Lastly, the notions of cost and score satisfy the well-behaved condition of Theorem 2.18. First,because the existence of a circuit of size at most | S | that computes f exactly implies the existenceof a finite-cost and non-zero score randomized forecasting circuit for any distribution µ on Dom( f ) .Second, because the cost of circuits does not depend on the input, and third because the definitionof cost immediately implies that the mixture of a zero-cost and a nonzero-cost randomized circuitgives a nonzero-cost randomized circuit.The next step is the main one in the proof of the theorem: we want to show that the score offorecasting circuits can be amplified efficiently. Lemma 7.4.
For every partial Boolean function f : { , } n → { , } , when we set k = ⌈ log R ( f ) ⌉ then inf R ∈C k max x ∈ Dom( f ) size( R )score( R, x ) + = e Ω (cid:0) R ( f ) (cid:1) where the e Ω hides terms that are polylogarithmic in R ( f ) . Proposition 7.5 ([BCH86; Alt88]) . For any two numbers a, b represented to accuracy − k in binary,then the values ab, a − a , ln( a ) , e a , and
11 + a can all be computed to additive accuracy − k by circuits of size polynomial in k and depth O (log k ) . We also need another result regarding the circuit complexity of iterated multiplication up tofixed accuracy.
Proposition 7.6.
When a , . . . , a m and b , . . . , b m are k -bit integers, then there is a circuit of size O ( m log m + mk + k c ) for some constant c ≥ and depth O (log k + log m ) that computes the ratio a · · · a m b · · · b m up to multiplicative accuracy ± − k .Proof. This result can be obtained by computing ln a ··· a m b ··· b m = P mi =1 ln a i − ln b i to additive accuracy − k . The computation of each of the values ln a i and ln b i for ≤ i ≤ m up to additive accuracy − k m can be done with a circuit of size polynomial in n := k + log m + 1 and depth O (log n ) . The sumof the m terms can be done with a circuit for iterated addition of size O ( mn ) = O ( m log m + mk ) and depth O (log m + log n ) = O (log m + log k ) to compute the natural log of the ratio up to additiveerror − k . [Ofm62] (See also [Pip87; Weg87] and the references therein.) Finally, a circuit of sizepolynomial in k and depth logarithmic in k can be used to compute the exponential of the finalratio.Using these propositions, we can complete the proof of the lemma. Proof of Lemma 7.4.
Let R be a randomized forecasting circuit which comes arbitrarily close to theinfimum on the left-hand side.Consider the randomized forecasting circuit R ′ obtained by drawing m forecasting circuits C , . . . , C m independently at random from R and combining their output values using the formula C ′ ( x ) = Y i ≤ m − C i ( x ) C i ( x ) − . Fixing m = max x / score( R, x ) + , we obtain a randomized circuit R ′ with score score( R ′ , x ) + = Ω(1) for each x ∈ S and average size size( R ′ ) = size( R ) · m + O ( m log m + mk + k c ) for some universal constant c ≥ . Then the proof is completed by noting that m = O ( R( f )size( R ) ) .Finally, we show that when there is a forecasting circuit with score γ , there is also a Booleancircuit with error at most ˙ γ . Proposition 7.7.
For any partial function f : { , } n → { , } , if there is a size s ≥ andparameter γ for which there is a randomized forecasting circuit R with k output wires, size s , depth d and score( R, x ) ≥ γ for each x ∈ Dom( f ) , then there is also a randomized Boolean circuit R ′ ofsize s + O ( k ) and depth d + O (1) that satisfies Pr C ∼R [ C ( x ) = f ( x )] ≥ γ for every x ∈ Dom( f ) . roof. Given a forecasting circuit C that outputs the value p on input x , we want to design arandomized Boolean circuit R C that outputs the value with probability p and with probability − p on input x .We can do this by adding k random inputs r , . . . , r k that are used to generate a uniformlyrandom value r ∈ { k , k , . . . , } . Then if the value p in the output of the circuit is , we outputzero; otherwise we use a comparator circuit to return if and only if r ≤ p . This value has thedesired bias p , and using standard constructions (see, e.g. [Weg87; Vol99]) we can implement thecomparator circuit with O ( k ) gates in a circuit of constant depth (in the unbounded fan-in model;or O (log n ) depth in the bounded fan-in model).The final randomized Boolean circuit R ′ is defined by drawing a forecasting circuit C from R and outputting R C . The bound on the error of R ′ is then obtained as in the argument ofLemma 3.15.Putting the above lemmas and propositions together completes the proof of the following theo-rem. Theorem 7.8.
Fix n ∈ N . For every partial function f : { , } n → { , } , there is a distribution µ on Dom( f ) such that for all γ ∈ (0 , , R µ ˙ γ ( f ) = ˜Ω (cid:0) γ R( f ) (cid:1) . Define
RNC1( f ) to be the minimum size of a randomized Boolean circuit of fan-in two and loga-rithmic depth that computes the partial Boolean function f with error at most on every input x ∈ Dom( f ) . Similarly, let RNC1 µ ˙ γ ( f ) denote the minimum size of a randomized Boolean circuitwith the same fan-in and depth restrictions that computes f with error at most ˙ γ = − γ when theinput is drawn from µ .The constructions of Proposition 7.2, Lemma 7.4, and Proposition 7.7 can all be achieved withcircuits of fan-in 2 that add only logarithmic depth overhead to the base circuits, so the analogueof Theorem 7.8 also holds for the class of circuits of fan-in two and logarithmic depth. Theorem 7.9.
Fix n ∈ N . For every partial function f : { , } n → { , } , there is a distribution µ on Dom( f ) such that for all γ ∈ (0 , , RNC1 µ ˙ γ ( f ) = ˜Ω (cid:0) γ RNC1( f ) (cid:1) . In fact, we can say even more about the efficiency of the transformations in each constructions:all three of them can be accomplished with constant-depth and polynomial-size overhead when thecircuits have threshold gates. For Proposition 7.2, this is because only a single additional gate isrequired. For Lemma 7.4, this is because the functions in Proposition 7.5 can all be computedthe the required accuracy with threshold circuits of polynomial size and constant depth [RT92]and the iterated addition problem can be solved by a threshold circuit of constant depth and size O ( m log m ( k + log m )) [CSV84]. And for Proposition 7.7, this is because comparison can also becompleted with polynomial-size and constant-depth circuits. Therefore, letting RTC0 ǫ ( f ) denotethe minimum size of a randomized constant-depth threshold circuit with unbounded fan-in thatcomputes f with error probability at most on every input and RNC1 µ ˙ γ ( f ) denote the minimumsize of the same type of circuit that computes f ( x ) correctly with probability γ when x is drawnfrom µ , we obtain the following result. 50 heorem 7.10. Fix n ∈ N . For every partial function f : { , } n → { , } , there is a distribution µ on Dom( f ) such that for all γ ∈ (0 , , RTC0 µ ˙ γ ( f ) = ˜Ω (cid:0) γ RTC0( f ) (cid:1) . In order to complete the proof of the hardcore lemma as stated in Theorem 1.9, we need the followingvariant of the ratio minimax theorem that applies to the setting where we consider a compact convexset of distributions, not just the set of all distributions over the function’s domain.
Theorem 7.11.
Let V be a real topological vector space, and let R ⊆ V be convex. Let S bea nonempty finite set, and let ∆ be a compact and convex set of probability distributions over S ,viewed as a subset of R | S | . Let cost : R × ∆ → [0 , ∞ ] be semicontinuous and saddle, and let score : R × ∆ → [ −∞ , ∞ ) be such that its negation, − score , is semicontinuous and saddle. Suppose cost and score are well-behaved. Then using the convention r/ ∞ for all r ∈ [0 , ∞ ] , we have inf R ∈R max µ ∈ ∆ cost( R, µ )score(
R, µ ) + = max µ ∈ ∆ inf R ∈R cost( R, µ )score(
R, µ ) + . Proof.
The proof is identical to the one for (the first part of) Theorem 2.18, since that argumentonly uses the fact that the set of all distributions over S is convex and compact.From this theorem we obtain the following variant of Lemma 7.3 for distributions with min-entropy δ . Lemma 7.12.
Fix any k ≥ and let R k denote the set of all randomized forecasting circuits withresolution k . Then for every δ > and function f : { , } n → { , } , if we let ∆ δ denote the set ofdistributions over { , } n with min-entropy δ , we have inf R ∈R k max µ ∈ ∆ δ size( R )score( R, µ ) + = max µ ∈ ∆ δ inf R ∈R k size( R )score( R, µ ) + . We are now ready to complete the proof of Theorem 1.9.
Proof of Theorem 1.9.
Fix s ′ = c · s/ log δ for some constant c to be fixed later. By Lemma 7.12,the two cases to consider are the following. Case 1: max µ ∈ ∆ δ inf R ∈R k size( R )score( R,µ ) + ≥ s ′ .Fix a distribution µ with min-entropy δ for which the maximum is attained. Then every ran-domized forecasting circuit R has score score( R, µ ) ≤ size( R ) s ′ . By Proposition 7.2, this implies that every randomized circuit R ′ with size( R ′ ) ≤ ǫ s ′ / has successprobability Pr C ∼ R,x ∼ µ [ C ( x ) = f ( x )] ≤ p size( R ′ ) /s ′ ≤ ǫ , and the theorem holds in this case. 51 ase 2: inf R ∈R k max µ ∈ ∆ δ size( R )score( C,µ ) + < s ′ .Fix a randomized forecasting circuit R that satisfies size( R )score( C, µ ) + < s ′ for each distribution µ over { , } n with min-entropy δ . Set α = size( R ) /s ′ and define T ⊆ { , } n to be the set of inputs x for which score( R, x ) < α . Then | T | ≤ δ (1 − α )2 n since otherwise the score of R on the distribution µ ′ that is uniform over any set T ′ ⊇ T of size | T ′ | = δ n (and thus has min-entropy δ ) would be bounded above by score( R, µ ′ ) < (1 − α ) · α + α <α , contradicting the definition of R .By Lemma 7.4, there is a forecasting circuit R ′ which satisfies size( R ′ ) = O ( s ′ ) and score( R, x ) =Ω(1) for each x ∈ { , } n \ T . Then by Proposition 7.7 there is a randomized Boolean circuit ofsize O ( s ′ ) that errs with probability at most on each x ∈ { , } n \ T , and by standard successamplification it also means that there is a circuit C of size s ′′ = O ( s ′ log δ ) with error less than δ . Choosing the value c in the definition of s ′ appropriately, we then get that this circuit has sizeat most s , contradicting the premise of the theorem and therefore showing that Case 2 cannotoccur. Acknowledgements
We thank Justin Thaler for discussions and references related to approximate polynomial degree andits amplification. We also thank Andrew Drucker, Mika Göös, and Li-Yang Tan for correspondenceabout their ongoing work [BDG+20]. We thank anonymous reviewers for many helpful comments.52
Proofs related to the minimax theorem
Lemma 2.8 (An upper semicontinuous function on a compact set attains its max) . Let X bea nonempty compact topological space, and let φ : X → R be a function. Then if φ is uppersemicontinuous, it attains its maximum, meaning there is some x ∈ X such that for all x ′ ∈ X , φ ( x ′ ) ≤ φ ( x ) . Similarly, if φ is lower semicontinuous, it attains its minimum.Proof. The lower semicontinuous case follows from the upper semicontinuous case simply by negat-ing φ , so we focus on the upper semicontinuous case. Let z = sup x ∈ X φ ( x ) , where z ∈ R . Let x be any element of X . If φ ( x ) = z , we are done, so assume φ ( x ) < z ; in particular, z > −∞ .We define a sequence x , x , . . . as follows. If z < ∞ , define x i to be any element of X such that φ ( x i ) > z − /i . If z = ∞ , define x i to be any element of X such that φ ( x i ) > i . Moreover, for each i ∈ N , let U i = { x ∈ X : φ ( x ) < φ ( x i ) } . Note that any x ∈ X for which φ ( x ) < z must be in U i for some i ∈ N ; hence if the supremum z is not attained, the sets U i form a cover for X (meaning S i ∈ N U i = X ).The key claim is that the U i sets are all open if φ is upper semicontinuous. This is is because if x ∈ U i , then φ ( x ) < φ ( x i ) , and by the definition of upper semicontinuity, there is a neighborhood U of x on which φ ( · ) is still less than φ ( x i ) ; thus there is a neighborhood U of x contained in U i ,so that U i is open. In this case, if the supremum z is not attained, the collection { U i } i ∈ N is anopen cover of X , and by the definition of compactness, it has a finite subcover. Let i be the largestindex of some U i in this subcover. Then it follows that φ ( x ) < φ ( x i ) for all x ∈ X , which is acontradiction. Hence the supremum z must be attained as a maximum, as desired. Lemma 2.9 (A pointwise infimum of upper semicontinuous functions is upper semicontinuous) . Let X be a topological space, let I be a set, and let { φ i } i ∈ I be a collection of functions φ i : X → R . Thenif each φ i is upper semicontinuous, the function φ ( x ) = inf i ∈ I φ i ( x ) is also upper semicontinuous.Similarly, if each φ i is lower semicontinuous, the pointwise supremum is lower semicontinuous.Proof. Note that the case where φ i are all lower semicontinuous follows from the case where theyare all upper semicontinuous simply by negating the functions, since negation flips upper and lowersemicontinuity and flips infimums and supremums. We focus on the case where φ i are all uppersemicontinuous.Fix x ∈ X . If φ ( x ) = ∞ , φ is upper semicontinuous at x by definition. If φ ( x ) < ∞ , fix any y > φ ( x ) . By the definition of φ ( x ) as an infimum, there is some i ∈ I such that φ i ( x ) < y . Bythe upper semicontinuity of φ i ( · ) , there is a neighborhood U of x such that for all x ′ ∈ U , wehave φ i ( x ′ ) < y . Then for all x ′ ∈ U , we clearly have φ ( x ′ ) = inf i ∈ I φ i ( x ′ ) < y . Thus φ is uppersemicontinuous at x , as desired. Lemma A.1.
Let V be a real vector space, and let X ⊆ V . The convex hull of X is the set of all v ∈ V which can be written as a convex combinatotion of vectors in x ; that is, v for which thereexist k ∈ N , x , x , . . . , x k ∈ X , and λ , λ , . . . , λ k ∈ [0 , with λ + λ + · · · + λ k = 1 such that v = λ x + λ x + · · · + λ k x k .Proof. This is a well-known characterization of the convex hull, which can be shown as follows: let Y be the set of all finite convex combinations of points in X ; that is, Y contains all points in V ofthe form λ x + λ x + · · · + λ k x k , where k ∈ N , x , x , . . . , x k ∈ X , and λ , λ , . . . , λ k ∈ [0 , with λ + λ + · · · + λ k = 1 . Then Y is clearly convex, since for all y , y ∈ Y and λ ∈ (0 , , we knowthat y and y are finite convex combinations of points in x , meaning that λy + (1 − λ ) y is alsoa finite convex combination of points in X . Furthermore, if Z is any other convex set containing X , then it’s easy to show by induction that Z contains all convex combinations of k points in X k ∈ N ; hence Z must be a superset of Y . It follows that Conv( X ) , the intersection of allconvex sets containing X , must exactly equal Y . Lemma 2.10 (Quasiconvex functions on convex hulls) . Let V be a real vector space, let X ⊆ V ,and let φ : Conv( X ) → R be a function. If φ is quasiconvex, then sup x ∈ Conv( X ) φ ( x ) = sup x ∈ X φ ( x ) . Similarly, if φ is quasiconcave, then inf x ∈ Conv( X ) φ ( x ) = inf x ∈ X φ ( x ) . Proof.
The quasiconcave case follows from the quasiconvex case by negating φ ; hence it suffices toprove the quasiconvex case. It is clear that sup x ∈ Conv( X ) φ ( x ) is at least sup x ∈ X φ ( x ) , so we onlyneed to show the latter is at least the former. To this end, let y ∗ := sup x ∈ Conv( X ) φ ( x ) , and let ˆ x ∈ Conv( X ) be such that φ (ˆ x ) is arbitrarily close to y ∗ . We must show that sup x ∈ X φ ( x ) ≥ φ (ˆ x ) ,or equivalently, that there is some x ∈ X with φ ( x ) ≥ φ (ˆ x ) .Using Lemma A.1, we can now write ˆ x ∈ Conv( X ) as ˆ x = λ x + λ x + · · · + λ k x k , with k ∈ N , x , x , . . . , x k ∈ X , and λ , λ , . . . , λ k ∈ [0 , with λ + λ + · · · + λ k = 1 . Furthermore, assumethat λ i > for each i ∈ [ k ] (we can remove λ i x i = 0 from the linear combination otherwise). Now,note that by quasiconvexity, we have φ ( λx + (1 − λ ) x ) ≤ max { φ ( x ) , φ ( x ) } . It is not hard toshow by induction that φ ( λ x + λ x + · · · + λ k x k ) ≤ max { φ ( x ) , φ ( x ) , . . . , φ ( x k ) } . Hence thereis some x ∈ X such that φ ( x ) ≥ φ (ˆ x ) , as desired. Lemma 2.15.
Let V be a real topological vector space, and let X ⊆ V be convex. For a function ψ : X → R , let ψ + denote the function ψ + ( x ) = max { ψ ( x ) , } . Then this operation on ψ preservesconvexity, quasiconvexity, quasiconcavity, upper semicontinuity, and lower semicontinuity, but notconcavity. We actually prove a stronger statement, where the maximum is taken with an arbitrary constant.
Lemma A.2.
Let V be a real topological vector space, and let X ⊆ V be convex. Let ψ : X → R be a function, let c ∈ R be a constant, and let ψ ′ : X → R be the function ψ ′ ( x ) = max { ψ ( x ) , c } .Then if ψ is convex, ψ ′ is convex; if ψ is quasiconvex, ψ ′ is quasiconvex; if ψ is quasiconcave, ψ ′ is quasiconcave; if ψ is upper semicontinuous, ψ ′ is upper semicontinuous; and if ψ is lowersemicontinuous, ψ ′ is lower semicontinuous.Proof. Let x, y ∈ X , and let λ ∈ (0 , . Then ψ ′ ( λx + (1 − λ ) y ) = max { ψ ( λx + (1 − λ ) y ) , c } . If this maximum equals c , it is certainly at most λ max { ψ ( x ) , c } + (1 − λ ) max { ψ ( y ) , c } , since thesetwo latter maximums are each at least c . Hence the inequalities for convexity and quasiconvexityalways hold when the original maximum equals c . Alternatively, if max { ψ ( λx + (1 − λ ) y ) , c } = ψ ( λx + (1 − λ ) y ) , then using ψ ( x ) ≤ ψ ′ ( x ) and ψ ( y ) ≤ ψ ′ ( y ) , we see that convexity of ψ gives theinequality for convexity of ψ ′ , and quasiconvexity of ψ gives the inequality for quasiconvexity of ψ ′ .Next, suppose ψ is quasiconcave. Without loss of generality, say that ψ ( x ) ≤ ψ ( y ) . Then ψ ′ ( λx + (1 − λ ) y ) = max { ψ ( λx + (1 − λ ) y ) , c } ≥ max { ψ ( x ) , c } = ψ ′ ( x ) ≥ min { ψ ′ ( x ) , ψ ′ ( y ) } , and ψ ′ is quasiconcave. 54reservation of lower semicontinuity follows from Lemma 2.9, where we note that c is continuousas a function from X to R . It remains to show upper semicontinuity is preserved. Suppose ψ isupper semicontinuous, and let x ∈ X . If ψ ′ ( x ) = ∞ , upper semicontinuity at x vacuusly holds. Fix y > ψ ′ ( x ) . Since ψ ′ ( x ) ≥ c , we have y > c , so ψ ( x ) = ψ ′ ( x ) > y , and upper semicontinuity gives usa neighborhood U of x on which ψ ( · ) is less than y . Since y > c , we have ψ ′ ( · ) = max { c, ψ ( · ) } < y on U . Hence ψ ′ is upper semicontinuous. Theorem A.3 (Sion’s minimax [Sio58]) . Let V and V be real topological vector spaces, and let X ⊆ V and Y ⊆ V be convex. Let α : X × Y → R be semicontinuous and quasisaddle. If either X or Y is compact, then inf x ∈ X sup y ∈ Y α ( x, y ) = sup y ∈ Y inf x ∈ X α ( x, y ) . Theorem 2.11 (Sion’s minimax for extended reals) . Let V and V be real topological vector spaces,and let X ⊆ V and Y ⊆ V be convex. Let α : X × Y → R be semicontinuous and quasisaddle. Ifeither X or Y is compact, then inf x ∈ X sup y ∈ Y α ( x, y ) = sup y ∈ Y inf x ∈ X α ( x, y ) . Proof.
First, note that the inf-sup is always at least the sup-inf. This is because these expressionscan be thought of as two players, one choosing x and trying to minimize α ( x, y ) , and the otherchoosing y and trying to maximize y ; in the inf-sup, the sup player chooses y after already knowing x , and therefore has more information and is better positioned to maximize α ( x, y ) than in thesup-inf, where the inf player goes second.Now, let a := sup y ∈ Y inf x ∈ X α ( x, y ) , b := inf x ∈ X sup y ∈ Y α ( x, y ) . We have a, b ∈ R , and a ≤ b . We wish to show a = b . Suppose by contradiction that a < b . Thenwe can pick a ′ , b ′ ∈ R such that a < a ′ < b ′ < b . We then define α ′ : X × Y → R by α ′ ( x, y ) := a ′ if α ( x, y ) ≤ a ′ , α ′ ( x, y ) := b ′ if α ′ ( x, y ) ≥ b ′ , and α ′ ( x, y ) := α ( x, y ) if α ( x, y ) ∈ [ a ′ , b ′ ] .Note that α ′ ( x, y ) = max { a ′ , min { b ′ , α ( x, y ) }} . By Lemma A.2, we know that taking a maxi-mum with a constant preserves quasiconvexity, quasiconcavity, and upper and lower semicontinu-ities. By negating the function, it also follows that taking a minimum with a constant preservesthese properties. From this it follows that α ′ is quasisaddle and semicontinuous, since α has theseproperties.Now, since a = sup y ∈ Y inf x ∈ X α ( x, y ) and since a ′ > a , we know that for all y ∈ Y , there existssome x ∈ X for which α ( x, y ) < a ′ . This means that for all y ∈ Y , there exists x ∈ X for which α ′ ( x, y ) = a ′ . Hence sup y ∈ Y inf x ∈ X α ′ ( x, y ) = a ′ . Similarly, since b = inf x ∈ X sup y ∈ Y α ( x, y ) and since b ′ < b , weknow that for all x ∈ X , there exists some y ∈ Y for which α ( x, y ) > b ′ . This means that for all x ∈ X , there exists y ∈ Y for which α ′ ( x, y ) = b ′ . Hence inf x ∈ X sup y ∈ Y α ′ ( x, y ) = b ′ . By Theorem A.3,we then have b ′ = inf x ∈ X sup y ∈ Y α ′ ( x, y ) = sup y ∈ Y inf x ∈ X α ′ ( x, y ) = a ′ . But this is a contradiction, since we picked a ′ < b ′ . We conclude that we must have had a = b tobegin with, as desired. 55 Distance measures
Lemma 3.3. hs , Brier , and ls are proper scoring rules. bias is a scoring rule which is not proper.Proof. It is clear that all of the functions from Definition 3.2 are smooth on (0 , and increasingon [0 , , where we interpret hs(0) = ls(0) = −∞ . It is also clear that all these functions evaluateto at and to at / . It remains to show that Brier , ls , and hs are proper. To do so, we needto show that ps ( q ) + (1 − p ) s (1 − q ) is uniquely optimized at q = p when s is one of these functionsand p ∈ (0 , . Fix such p ∈ (0 , , and observe that the critical points of the expression we wish tomaximize are the points q such that ps ′ ( q ) = (1 − p ) s ′ (1 − q ) .For ls( q ) = 1 − log(1 /q ) = 1 + (log e ) ln q , the critical points q satisfy (log e ) p/q = (log e )(1 − p ) / (1 − q ) , or p/ (1 − p ) = q/ (1 − q ) . Noting that the function x/ (1 − x ) is increasing on (0 , ,and hence injective on (0 , , we conclude that the only critical point is q = p . Moreover, at theboundaries q = 0 and q = 1 , we clearly have p ls( q ) + (1 − p ) ls(1 − q ) = −∞ , whereas in the interiorthe expression is finite. Hence the unique maximum must occur at q = p .For hs( q ) = 1 − p (1 − q ) /q = 1 − p /q − , we have hs ′ ( q ) = 1 / p q (1 − q ) , so the criticalpoints q satisfy p/ p q (1 − q ) = (1 − p ) / p (1 − q ) q , or p/q = (1 − p ) / (1 − q ) , which once againonly occurs at q = p . At the boundaries, we once again have p hs( q ) + (1 − p ) hs(1 − q ) = −∞ for q = 0 or q = 1 , so the unique maximum occurs at q = p .Finally, for Brier( q ) = 1 − − q ) = − q + 8 q − , we have Brier ′ ( q ) = 8(1 − q ) , so thecritical points q satisfy p (1 − q ) = 8(1 − p ) q , which again implies q = p . This time, the boundarypoints are finite, but we can use the second order condition: the second derivative of p Brier( q ) +(1 − p ) Brier(1 − q ) is p Brier ′′ ( q ) + (1 − p ) Brier ′′ (1 − q ) . Noting that Brier ′′ ( q ) = − , this is − p − − p ) = − < . Hence the critical point is a maximum, and since it is unique (with theboundaries and not being critical even if we extend the domain of the function), we conclude itis the unique maximum. Lemma B.1.
For any x ∈ [0 , , we have x ≤ − p − x ≤ − H (cid:18) x (cid:19) ≤ x ≤ x. Additionally, x and − √ − x are convex functions on [0 , .Proof. x ≤ x is clearly true for x ∈ [0 , . To see that x / ≤ − √ − x , note that this isequivalent to y/ ≤ − √ − y for y ∈ [0 , (by setting y = x ); the latter is clearly true at y = 0 , so it suffices to show the right hand side grows faster. Taking derivatives, it suffices to show / ≤ / √ − y , which is clearly true for y ∈ [0 , .Next, write − H ((1 + x ) /
2) = 1 − ((1 + x ) /
2) log 2 / (1 + x ) − ((1 − x ) /
2) log 2 / (1 − x )= 1 − (1 + x ) / − (1 − x ) / x ) /
2) log(1 + x ) + ((1 − x ) /
2) log(1 − x )= 1ln 4 ((1 + x ) ln(1 + x ) + (1 − x ) ln(1 − x )) . Let α ( x ) = (1 + x ) ln(1 + x ) + (1 − x ) ln(1 − x ) . We show that α ( x ) /x is increasing and that α ( x ) / (1 − √ − x ) is decreasing; this suffices to show the desired inequalities, since it means weonly need to check x = 1 , where the inequalities − √ − x ≤ − H ((1 + x ) / ≤ x hold withequality. 56he derivative of α ( x ) is ln(1+ x ) − ln(1 − x ) . The derivative of α ( x ) /x is therefore x ln(1+ x ) − x ln(1 − x ) − x (1+ x ) ln(1+ x ) − x (1 − x ) ln(1 − x ) divided by x > (for x ∈ (0 , ). This simplifiesto − x ln(1 − x ) − x ln((1+ x ) / (1 − x )) . This is positive if and only if − x )+ x ln((1+ x ) / (1 − x )) is negative. This expression equals at x = 0 , so it suffices to show it is decreasing on (0 , . Thederivative is − x/ (1 − x ) + ln((1 + x ) / (1 − x )) , which is again at x = 0 , so it again suffices toshow the derivative is negative on (0 , . The derivative of this expression is − x / (1 − x ) , whichis finally a quantity that is clearly negative, completing the argument; hence α ( x ) /x is increasingon [0 , .The derivative of α ( x ) / (1 − √ − x ) is (1 − x − p − x ) ln(1 − x ) − (1 + x − p − x ) ln(1 + x ) divided by some denominator which is positive on (0 , . This equals − x ln(1 − x ) − (1 − p − x ) ln((1 + x ) / (1 − x )) . Note that ln(1 − x ) = − x − x / −· · ·− x i /i − . . . and that ln((1+ x ) / (1 − x )) = ln(1+ x ) − ln(1 − x ) =2 x + 2 x / · · · + 2 x i − / (2 i −
1) + . . . , so the expression equals x ∞ X i =1 x i − /i − (1 − p − x ) ∞ X i =1 x i − / ( i − /
2) = ( p − x − (1 − x )) ∞ X i =1 − x i − /i (2 i − < . Hence α ( x ) / (1 − √ − x ) is decreasing on [0 , , as desired.It is clear that x and − √ − x are convex functions on [0 , , as their second derivatives are > and (1 / − x ) − / > (for x ∈ (0 , ) respectively. Lemma 3.6 (Relations between distance measures) . When applied to fixed ν , ν , and w , thedistance measures satisfy S ≤ − p − S ≤ h ≤ JS ≤ S as well as ∆ ≤ S ≤ ∆ . We also have JS ≤ h / ln 2 and S ≤ (ln 4) JS .Proof. We use Lemma B.1. The chain h ≤ JS ≤ S ≤ ∆ follows from the inequalities there, whilethe inequalities ∆ ≤ S and − p − S ≤ h follow from Jensen’s inequality combined with theconvexity of x and − √ − x .Finally, to show inequality JS ≤ h / ln 2 we only need to compute the limit of α ( x ) / (1 −√ − x ) as x → , since this ratio is decreasing with x (where α ( x ) is defined as in the proof of Lemma B.1).To do that it suffices to use α ( x ) = x + O ( x ) and − √ − x = x / O ( x ) , so the limit is .Hence the limit of (1 − H ((1 + x ) / / (1 − √ − x ) as x → is / ln 2 , meaning this ratio is alwaysat most / ln 2 . Similarly, to show the inequality S ≤ (ln 4) JS , we only need to compute the limit of α ( x ) /x as x → . Again using α ( x ) = x + O ( x ) , the limit is , so the ratio (1 − H ((1 + x ) / /x is always at least / ln 4 . Lemma 3.11. If x ∈ [0 , and k ∈ [1 , ∞ ) , we have
12 min { kx, } ≤ − (1 − x ) k ≤ min { kx, } . roof. Set f ( x ) := 1 − (1 − x ) k . Clearly, when x ∈ [0 , , we have f ( x ) ∈ [0 , , so f : [0 , → [0 , .Note f (0) = 0 , f (1) = 1 , and that f ( x ) is increasing on [0 , . If k = 1 , we have f ( x ) = x , and theinequalities trivially hold; therefore, assume k > . Then f ′ ( x ) = k (1 − x ) k − and f ′′ ( x ) = − k ( k − − x ) k − , meaning that f ( x ) is concave on [0 , ; we also have f ′ (0) = k and f ′′ (0) = − k ( k − .From this we conclude that f ( x ) ≤ kx , proving the upper bound (as f ( x ) ≤ is clear).For the lower bound, note that f ′′′ ( x ) = k ( k − k − − x ) k − , which is non-negative on [0 , . This means that f ′′ ( x ) ≥ − k ( k − on [0 , , that f ′ ( x ) ≥ k − k ( k − x on [0 , , and that f ( x ) ≥ kx − ( k ( k − / x = kx (1 − ( k − x/ on [0 , . If ( k − x ≤ , we get f ( x ) ≥ kx/ . If ( k − x ≥ , we have f ( x ) ≥ − e − kx ≥ − /e ≥ / . This completes the proof. Lemma 4.4 (Hellinger distance of disjoint mixtures) . Let µ be a distribution over a finite support A , and for each a ∈ A , let ν a and ν a be two distributions over a finite support S a . Let ν µ and ν µ denote the mixture distributions where a ← µ is sampled, and then a sample is produced from ν a or ν a respectively. Assume the sets S a are disjoint for all a ∈ A . Then h ( ν µ , ν µ ) = E a ← µ [h ( ν a , ν a )] . Proof.
Note that the squared-Hellinger distance is one minus the fidelity, that is, h ( µ , µ ) =1 − F ( µ , µ ) where F ( µ , µ ) = P x p µ [ x ] µ [ x ] (this is easy to check from the definition of h ).Now write h ( ν µ , ν µ ) = 1 − X x ∈ S a S a q ν µ [ x ] ν µ [ x ]= 1 − X a ∈ A X x ∈ S a q µ [ a ] ν a [ x ] µ [ a ] ν a [ x ]= 1 − E a ← µ " X x ∈ S a q ν a [ x ] ν a [ x ] = E a ← µ " − X x ∈ S a q ν a [ x ] ν a [ x ] = E a ← µ (cid:2) h ( ν a , ν a ) (cid:3) . C Quantum amplitude estimation
We show the following strengthening of Theorem 5.1, which follows from [BHMT02].
Theorem C.1 (Amplitude estimation) . Suppose we have access to a unitary U (representing aquantum algorithm) which maps | i to | ψ i , as well as access to a projective measurement Π , andwe wish to estimate p := k Π | ψ ik (representing the probability the quantum algorithm accepts). Fix ǫ, δ ∈ (0 , / . Then using at most (100 /ǫ ) · ln(1 /δ ) controlled applications of U or U † and at mostthat many applications of I − , we can output ˜ p ∈ [0 , such that | ˜ p − p | ≤ ǫ with probability atleast − δ .Further, this can be tightened to a bound that depends on p , as follows. For any positive realnumber T , there is an algorithm which depends on ǫ , δ , and T (but not on p ) which uses at most T applications of the unitaries (as above) and outputs ˜ p ∈ [0 , with the following guarantee: if T isat least ⌊ (100 /ǫ ) p max { p, ǫ } · ln(1 /δ ) ⌋ , then | ˜ p − p | ≤ ǫ with probability at least − δ . roof. [BHMT02] showed that an algorithm which makes M controlled calls to the unitary U ( I − | i h | ) U − ( I − and one additional call to U can output ˜ p such that | ˜ p − p | ≤ π p p (1 − p ) M + π M with probability at least /π ≥ / . If we pick M such that M ≥ / √ ǫ and M ≥ √ p/ǫ , thenthis is at most ( π/ π / γ ≤ γ . Note that M must be an integer, and that the number ofapplications of U or U − is M + 1 . Hence to get this success probability, it suffices to have T ≥ /ǫ ) p max { p, ǫ } , or T ≥ (19 /ǫ ) p max { p, ǫ } .To generalize to other success probabilities, we amplify this algorithm by repeating k + 1 timesand returning the median estimate. The probability that this is still wrong is the probability thatat least k + 1 out of k + 1 of the estimates were wrong, which is k +1 X i =1 (cid:18) k + 1 k + 1 − i (cid:19) q k + i (1 − q ) k +1 − i ≤ q k +1 (1 − q ) k k +1 X i =1 (cid:18) k + 1 k + 1 − i (cid:19) = q k +1 (1 − q ) k k = q (1 − (1 − q ) ) k ≤ qe − k (1 − q ) . Hence to get this below δ , we just need k ≥ (1 / (1 − q ) ) ln(1 /qδ ) , or k ≥ . /δ ) − . Since k must be an integer, but we can always choose it so that k +1 is at most . /δ ) . Multiplying thisby the bound from before, we get that it suffices for T to be at most (100 /ǫ ) p max { p, ǫ } · ln(1 /δ ) ,as desired. 59 eferences [Alt88] Helmut Alt. Comparing the combinational complexities of arithmetic functions. Jour-nal of the ACM (1988). doi : (p. 49).[AR20] Scott Aaronson and Patrick Rall. Quantum Approximate Counting, Simplified. Pro-ceedings of the 3rd Symposium on Simplicity in Algorithms (SOSA) . 2020. doi : . arXiv: (p. 33).[BB19] Eric Blais and Joshua Brody. Optimal Separation and Strong Direct Sum for Random-ized Query Complexity. Proceedings of the 34th Conference on Computational Com-plexity (CCC) . 2019. doi : . arXiv: (p. 5).[BB20] Shalev Ben-David and Eric Blais. A tight composition theorem for the randomizedquery complexity of partial functions. Proceedings of the 61st Annual IEEE Symposiumon Foundations of Computer Science (FOCS) . 2020. arXiv: (pp. 4, 7, 8).[BBGK18] Shalev Ben-David, Adam Bouland, Ankit Garg, and Robin Kothari. Classical LowerBounds from Quantum Upper Bounds.
Proceedings of the 59th Annual IEEE Sympo-sium on Foundations of Computer Science (FOCS) . 2018. doi : .arXiv: (p. 44).[BCH86] Paul W. Beame, Stephen A. Cook, and H. James Hoover. Log Depth Circuits forDivision and Related Problems. SIAM Journal on Computing (1986). Previous versionin FOCS 1984. doi : (p. 49).[BDG+20] Andrew Bassilakis, Andrew Drucker, Mika Göös, Lunjia Hu, Weiyun Ma, and Li-Yang Tan. The Power of Many Samples in Query Complexity. Proceedings of the 47thInternational Colloquium on Automata, Languages, and Programming (ICALP) . 2020. doi : . arXiv: (pp. 8, 52).[BGK+18] Mark Braverman, Ankit Garg, Young Kun Ko, Jieming Mao, and Dave Touchette.Near-Optimal Bounds on the Bounded-Round Quantum Communication Complexityof Disjointness. SIAM Journal on Computing (2018). Previous version in FOCS 2015. doi : . arXiv: (p. 5).[BHK09] Boaz Barak, Moritz Hardt, and Satyen Kale. The Uniform Hardcore Lemma via Ap-proximate Bregman Projections. Proceedings of the 20th Annual ACM-SIAM Sympo-sium on Discrete Algorithms . 2009. doi : (p. 8).[BHMT02] Gilles Brassard, Peter Høyer, Michele Mosca, and Alain Tapp. Quantum amplitudeamplification and estimation. Proceedings of an AMS Special Session on QuantumComputation and Information (CONM) . 2002. doi : . arXiv: quant-ph/0005055 (pp. 33, 58, 59).[BNRW07] Harry Buhrman, Ilan Newman, Hein Rohrig, and Ronald de Wolf. Robust Polynomialsand Quantum Algorithms. Theory of Computing Systems (2007). Previous version inSTACS 2005. doi : . arXiv: quant-ph/0309220 (p. 41).[Bra15] Mark Braverman. Interactive Information Complexity. SIAM Journal on Computing (2015). Previous version in STOC 2012. doi : (p. 5).[BSS05] Andreas Buja, Werner Stuetzle, and Yi Shen. Loss functions for binary class proba-bility estimation and classification: Structure and applications. Preprint, 2005. url : pdfs.semanticscholar.org/d670/6b6e626c15680688b0774419662f2341caee.pdf (pp. 5,20). 60CSV84] Ashok K. Chandra, Larry Stockmeyer, and Uzi Vishkin. Constant Depth Reducibility. SIAM Journal on Computing (1984). doi : (p. 50).[GKKT17] Surbhi Goel, Varun Kanade, Adam Klivans, and Justin Thaler. Reliably Learning theReLU in Polynomial Time. Proceedings of the 30th Annual Conference on LearningTheory (COLT) . 2017. arXiv: (p. 41).[GR07] Tilmann Gneiting and Adrian E Raftery. Strictly Proper Scoring Rules, Predic-tion, and Estimation.
Journal of the American Statistical Association (2007). doi : (p. 5).[Imp95] R. Impagliazzo. Hard-core distributions for somewhat hard problems. Proceedings ofthe 36th Annual IEEE Symposium on Foundations of Computer Science (FOCS) . 1995. doi : (pp. 5, 8).[Jac11] Dunham Jackson. “Über die Genauigkeit der Annäherung stetiger Funktionen durchganze rationale Funktionen gegebenen Grades und trigonometrische Summen gegebenerOrdnung”. PhD thesis. University of Göttingen, 1911. url : gdz.sub.uni-goettingen.de/id/PPN30230648X (p. 41).[KS03] Adam R. Klivans and Rocco A. Servedio. Boosting and Hard-Core Set Construction. Machine Learning (2003). Previous version in FOCS 1999. doi : (p. 8).[LS09] Troy Lee and Adi Shraibman. An Approximation Algorithm for Approximation Rank. Proceedings of the 24th Conference on Computational Complexity (CCC) . 2009. doi : . arXiv: (p. 46).[LSŠ08] Troy Lee, Adi Shraibman, and Robert Špalek. A Direct Product Theorem for Discrep-ancy. Proceedings of the 23rd Conference on Computational Complexity (CCC) . 2008. doi : (p. 44).[MCAL17] Marianthi Markatou, Yang Chen, Georgios Afendras, and Bruce G. Lindsay. StatisticalDistances and Their Role in Robustness. New Advances in Statistics and Data Science (2017). doi : . arXiv: (p. 21).[MMR94] G. V. Milovanovic, D. S. Mitrinovic, and Th. M. Rassias. Topics in Polynomials: Ex-tremal Problems, Inequalities, Zeros . World Scientific, 1994. isbn : 978-981-02-0499-0. doi : (p. 41).[Ofm62] Yuri P. Ofman. On the algorithmic complexity of discrete functions. Doklady AkademiiNauk (1962) (p. 49).[Pip87] Nicholas Pippenger. The complexity of computations by networks.
IBM Journal ofResearch and Development (1987). doi : (p. 49).[RT92] John H. Reif and Stephen R. Tate. On Threshold Circuits and Polynomial Computa-tion. SIAM Journal on Computing (1992). doi : (p. 50).[RW11] Mark D. Reid and Robert C. Williamson. Information, Divergence and Risk for BinaryExperiments. Journal of Machine Learning Research (2011). arXiv: . url : http://jmlr.org/papers/v12/reid11a.html (p. 21).[Sha03] Ronen Shaltiel. Towards proving strong direct product theorems. Computational Com-plexity (2003). Previous version in CCC 2001. doi : . eccc : (p. 4). 61She12] Alexander A. Sherstov. Strong Direct Product Theorems for Quantum Communicationand Query Complexity. SIAM Journal on Computing (2012). Previous version in STOC2011. doi : . arXiv: (p. 44).[She13] Alexander A. Sherstov. Making Polynomials Robust to Noise. Theory of Comput-ing (2013). Previous version in STOC 2012. doi : . eccc : (pp. 41, 46).[Sio58] Maurice Sion. On general minimax theorems. Pacific Journal of Mathematics (1958). doi : (pp. 5, 14, 55).[Tøp00] Flemming Tøpsoe. Some inequalities for information divergence and related mea-sures of discrimination. IEEE Transactions on Information Theory (2000). doi : (p. 21).[TTV09] Luca Trevisan, Madhur Tulsiani, and Salil Vadhan. Regularity, Boosting, and Effi-ciently Simulating Every High-Entropy Distribution. Proceedings of the 24th Confer-ence on Computational Complexity (CCC) . 2009. doi : . eccc : (p. 8).[Ver98] Nikolai K. Vereshchagin. Randomized Boolean decision trees: Several remarks. Theo-retical Computer Science (1998). doi : (pp. 5, 10).[Vol99] Heribert Vollmer. Introduction to Circuit Complexity: A Uniform Approach . SpringerBerlin Heidelberg, 1999. isbn : 978-3-642-08398-3. doi : (p. 50).[Weg87] Ingo Wegener. The Complexity of Boolean Functions . Wiley, 1987. isbn : 3-519-02107-2. url : eccc.weizmann.ac.il/static/books/The_Complexity_of_Boolean_Functions/ (pp. 49, 50).[Yao77] Andrew Yao. Probabilistic computations: toward a unified measure of complexity. Pro-ceedings of the 18th Annual IEEE Symposium on Foundations of Computer Science(FOCS) (1977). doi :10.1109/SFCS.1977.24