LLearning algorithms from circuit lower bounds
J´an Pich
University of Oxford
November 2020
Abstract
We revisit known constructions of efficient learning algorithms from various no-tions of constructive circuit lower bounds such as distinguishers breaking pseudoran-dom generators or efficient witnessing algorithms which find errors of small circuitsattempting to compute hard functions. As our main result we prove that if it ispossible to find efficiently, in a particular interactive way, errors of many p-size cir-cuits attempting to solve hard problems, then p-size circuits can be PAC learnedover the uniform distribution with membership queries by circuits of subexponentialsize. The opposite implication holds as well. This provides a new characterisation oflearning algorithms and extends the natural proofs barrier of Razborov and Rudich.The proof is based on a method of exploiting Nisan-Wigderson generators intro-duced by Kraj´ıˇcek (2010) and used to analyze complexity of circuit lower boundsin bounded arithmetic.An interesting consequence of known constructions of learning algorithms fromcircuit lower bounds is a learning speedup of Oliveira and Santhanam (2016). Wepresent an alternative proof of this phenomenon and discuss its potential to advancethe program of hardness magnification.
While the central conjectures in complexity theory such as P (cid:54) = NP have the form ofimpossibility results, we hope that a better understanding of the impossibility phenomenawill also shed light on the question of constructing new useful algorithms. A successfulformalization of such hopes can be found in cryptography, where the impossibility resultsin the form of average-case lower bounds are turned into cryptographic primitives. In thepresent paper we are interested in turning complexity lower bounds into efficient learningalgorithms.Results of this form can be traced back to cryptography as well. The ‘ pseudoran-domness from unpredictability ’ paradigm was used by Blum, Furst, Kearns and Lipton [3]1 a r X i v : . [ c s . CC ] D ec o show that efficient distinguishers breaking pseudorandom generators imply an efficientlearning of p-size circuits on average. The distinguishers from [3] can be interpreted as con-structive circuit lower bounds distinguishing partial truth-tables of easy Boolean functionsfrom partial truth-tables of hard functions, cf. Section 4. The existing methods for prov-ing circuit lower bounds have been also applied in constructions of new learning algorithmsfor restricted circuit classes, e.g. Linial, Mansour and Nisan [23] used AC lower boundsto get learning algorithms for AC . More recently, in a landmark work, Carmosino, Im-pagliazzo, Kabanets and Kolokolova [5] gave a generic construction of learning algorithmsfrom natural proofs of circuit lower bounds. Oliveira and Santhanam [32] extended theirresult to a dichotomy between the non-existence of non-uniform pseudorandom functionfamilies and the existence of efficient learning of small circuits. These results led Oliveiraand Santhanam [32] also to a discovery of a surprising learning speedup. For example,learning p-size circuits over the uniform distribution with membership queries by circuitsof weakly subexponential size 2 n /n ω (1) implies that for each constant k and (cid:15) >
0, cir-cuits of size n k can be learned over the uniform distribution with membership queries bycircuits of strongly subexponential size 2 n (cid:15) . In the present paper we revisit these connections. We start by considering a simple instance-specific model of learning in which proving a single circuit lower bound impliesa reliable prediction of the value of a target function on a single input. The modelunderlies the construction of learning algorithms from [3, 5] and differs from the standardPAC learning model mainly in that it does not ask learners to construct a circuit whichcomputes the target function on a big fraction of inputs, cf. Section 3.
Learning from witnessing lower bounds.
Our main result is a construction of efficientPAC learning of p-size circuits from a constructive circuit lower bound for an arbitraryBoolean function H . More precisely, we obtain subexponential-size circuits learning p-size circuits over the uniform distribution with membership queries. The assumptionof a constructive circuit lower bound we need is defined as the existence of 2 O ( n ) -size‘witnessing’ circuits W which given an oracle access to a p-size circuit D with n inputsfind a not-yet-queried input on which D fails to compute H . The circuits W are allowedto fail on 1 /poly ( n ) fraction of circuits D . Moreover, even if circuits W succeed on acircuit D they are allowed to output incorrect answer log n times (receiving a correctionin each round) before generating the right answer, cf. Theorem 1. The implication can bealso interpreted as a construction of PAC learning algorithms from a frequent interactiveinstance-specific learning: If we are given an algorithm which is able to predict a value ofa big fraction of p-size circuits (after a small number of queries and ≤ log n mistakes) even We use the adjective ‘instance-specific’ only informally in this paper. The instance-specific modeldiscussed earlier actually differs slightly from the concept in Theorem 1.
2n a single input, this already implies learnability of p-size circuits on almost all inputs.The opposite implication producing efficient witnessing of lower bounds from learningalgorithms holds as well, which yields a new characterisation of PAC learning of smallcircuits, cf. Lemma 1.
Relation to proof complexity, natural proofs and witnessing theorems.
Thenotion of interactive witnessing of circuit lower bounds from Theorem 1 is motivated bywitnessing theorems from bounded arithmetic. One of the most prominent theories ofbounded arithmetic is Cook’s theory PV , which formalizes p-time reasoning. Theories ofbounded arithmetic satisfy many so called witnessing theorems, which allow us to show,for example, that if we can prove a p-size circuit lower bound for a function H ∈ NP in PV then there exists a witnessing analogous to the one from Theorem 1 except that thewitnessing circuits W have white-box access to D (i.e. access to a full description of D ),see Section 3.1 for a more detailed comparison. The witnessing from Theorem 1 is alsoclosely related to algorithms finding hard instances of NP problems by Gutfreund, Shaltiel,Ta-Shma [12] and Atserias [2]. The main difference is that the algorithms from [12] havewhite-box access to the algorithm whose error they search for. While Atserias [2] made [12]work with the black-box (oracle) access, his algorithm achieves much smaller probabilityof success than the one required in Theorem 1, cf. Section 3.1.The proof of Theorem 1 is an adaptation of a method of exploiting Nisan-Wigdersongenerators introduced by Kraj´ıˇcek [17] in order to give a model-theoretic evidence forRazborov’s conjecture in proof complexity. Razborov’s conjecture [39] states a conditionalhardness of deriving tautologies expressing the existence of an element outside of therange of a suitable NW-generator in strong proof systems. Kraj´ıˇcek’s result significantlystrengthens a similar but much simpler proof of the validity of Razborov’s conjecture forproof systems with feasible interpolation [34]. The method has been also used to showa conditional hardness of generating hard tautologies [19], a conditional unprovability ofp-size circuit lower bounds for SAT in theories of bounded arithmetic below Cook’s theory PV [35] and an unconditional unprovability of strong nondeterministic lower bounds inJeˇr´abek’s theory of approximate counting APC [37]. We take advantage of its uniqueway of exploiting the NW generator: it gives us a reconstruction algorithm which afterbreaking the NW-generator in a particular interactive fashion allows us to approximatelycompute the function on which the generator is based. There are, however, technicalissues with adapting this method in our context, e.g. unlike in bounded arithmetic ourwitnessing circuits can fail with a significant probability. Our main contribution is infinding the right notions which allow the arguments to go through (in both directions).A competing notion of constructive circuit lower bounds has been developed in theinfluential theory of natural proofs of Razborov and Rudich [40], which explains whymany of the existing lower bound methods cannot yield separations such as P (cid:54) = NP .Natural proofs are known to be equivalent to the existence of efficient learning algo-3ithms, cf. [5]. For example, P / poly -natural proofs useful against P / poly are equivalentto subexponential-size circuits learning p-size circuits over the uniform distribution withmembership queries. Furthermore, natural proofs have been used to derive unprovabilityresults in proof complexity as well. Specifically, to derive unprovability of circuit lowerbounds in proof systems with the feasible interpolation property, cf. [38, 16]. Despite sim-ilar applications and motivations for defining these concepts, the relation between naturalproofs and the witnessing method has not been clear. In fact, a priori the ‘static’ defini-tion of natural proofs appears to be quite orthogonal to the witnessing from Theorem 1.Theorem 1 thus not only extends the scope of the natural proofs barrier by providing an-other equivalent characterisation which incorporates interactivity but also helps to clarifyits relation to the witnessing method. Learning speedup.
Our second contribution is a simple proof of a generalized learningspeedup of Oliveira and Santhanam [32]. Specifically, we show that for each superpolyno-mial function s , if for each constant k , circuits of size n k are learnable by circuits of size s over the uniform distribution with random examples, then for each constant k and (cid:15) > n k are learnable over the uniform distribution with membership queries bycircuits of size O ( s (cid:15) ), cf. Theorem 6. We obtain the speedup by a more direct exploitationof a slightly modified NW-generator. In comparison to the proof from [32], this sidestepsthe need to construct natural proofs and invoke the construction of Carmosino et al. [5].A disadvantage of the method is that we need to assume learning with random examplesinstead of membership queries. Nevertheless, we present one more alternative proof ofthe learning speedup based on (a simple case of) Theorem 1, which allows to start withmembership queries, cf. Theorem 7. We emphasize, however, that behind all proofs ofthe learning speedup is essentially the same general idea of reconstructing, in this or thatway, the base function of some form of the NW-generator. Relation to hardness magnification and locality.
The generalized learning speedupcan be interpreted as a nonlocalizable hardness magnification theorem reducing a com-plexity lower bound into a seemingly weaker one. In general, hardness magnification refersto an approach to strong complexity lower bounds developed in a series of recent papers,cf. Section 5. Unfortunately, while the approach avoids (in certain cases provably [6])the natural proofs barrier, it suffers from a ‘ locality barrier ’: magnification theorems typ-ically yield unconditional upper bounds for specific problems if the computational modelin question is allowed to use oracles with small fan-in, but the existing lower bounds ac-tually work even against the presence of local oracles. In fact, a better understanding ofnonlocalizable lower bounds is essential for further progress on strong complexity lowerbounds in general, see Section 5 for more details. A promising aspect of the learning P / poly -natural proofs useful against P / poly are defined as 2 O ( n ) -size circuits with 2 n inputs acceptinga 1 / O ( n ) -fraction of inputs and rejecting all inputs which represent truth-tables of Boolean functions on n inputs computable by p-size circuits, cf. Definition 1. Learning from breaking cryptographic pseudorandom generators.
In Section 4we survey known constructions of learning algorithms from distinguishers breaking pseu-dorandom generators (PRGs) or natural proofs. While several such constructions areknown, the question of extracting efficient learning of p-size circuits from the non-existenceof cryptographic PRGs remains open. A positive answer to this question would estab-lish an interesting win-win situation: either safe cryptography or efficient learning ispossible. In the already mentioned approach, Oliveira and Santhanam [32] showed thatefficient learning of p-size circuits with membership queries follows from the non-existenceof nonuniform pseudorandom function families. By a straightforward adaptation of theproof method behind their result we show that efficient learning of p-size circuits withrandom examples follows from the non-existence of succinct nonuniform pseudorandomfunction families, cf. Theorem 5. Finally, we point out that the desired construction oflearning algorithms from the non-existence of cryptographic PRGs is closely related to aquestion of Rudich about turning demibits to superbits, cf. Section 4.4. [ n ] denotes { , . . . , n } . Circuit [ s ] denotes fan-in two Boolean circuits of size at most s . Thesize of a circuit is the number of gates. A function f : { , } n (cid:55)→ { , } is γ -approximatedby a circuit C , if Pr x [ C ( x ) = f ( x )] ≥ γ . Definition 1 (Natural property [40]) . Let m = 2 n and s, d : N (cid:55)→ N . A sequence ofcircuits { C m } ∞ m =1 is a Circuit [ s ( m )] -natural property useful against Circuit [ d ( n )] if1. Constructivity. C m has m inputs and size s ( m ) ,2. Largeness. Pr x [ C m ( x ) = 1] ≥ /m O (1) ,3. Usefulness.
For each sufficiently big m , C m ( x ) = 1 implies that x is a truth-table ofa function on n inputs which is not computable by circuits of size d ( n ) . Definition 2 (Pseudorandom generator) . A function g : { , } n (cid:55)→ { , } n +1 computableby p-size circuits is a pseudorandom generator safe against circuits of size s ( n ) , if foreach circuit D of size s ( n ) , (cid:12)(cid:12)(cid:12)(cid:12) Pr y ∈{ , } n +1 [ D ( y ) = 1] − Pr x ∈{ , } n [ D ( g ( x )) = 1] (cid:12)(cid:12)(cid:12)(cid:12) < s ( n ) . Definition 3 (PAC learning) . A circuit class C is learnable over the uniform disributionby a circuit class D up to error (cid:15) with confidence δ , if there are randomized oracle circuits f from D such that for every Boolean function f : { , } n (cid:55)→ { , } computable bya circuit from C , when given oracle access to f , input n and the internal randomness w ∈ { , } ∗ , L f outputs the description of a circuit satisfying Pr w [ L f (1 n , w ) (1 − (cid:15) ) -approximates f ] ≥ δ.L f uses non-adaptive membership queries if the set of queries which L f makes to theoracle does not depend on the answers to previous queries. L f uses random examples ifthe set of queries which L f makes to the oracle is chosen uniformly at random. In this paper, PAC learning always refers to learning over the uniform distribution.
Boosting confidence and reducing error.
The confidence of the learner can beefficiently boosted in a standard way. Suppose an s -size circuit L f learns f up to error (cid:15) with confidence δ . We can then run L f k times, test the output of L f from every run with m new random queries and output the most accurate one. By Hoeffding’s inequality, m random queries fail to estimate the error (cid:15) of an output of L f up to γ with probability atmost 2 /e γ m . Therefore the resulting circuit of size poly ( s, m, k ) learns f up to error (cid:15) + γ with confidence at least 1 − k/e γ m − (1 − δ ) k ≥ − k/e γ m − e − kδ . If we are trying tolearn small circuits we can get even confidence 1 by fixing internal randomness of learnernonuniformly without losing much on the running time or the error of the output. It isalso possible to reduce the error up to which L f learns f without a significant blowup inthe running time and confidence. If we want to learn f with a better error, we first learnan amplified version of f , Amp ( f ). Employing direct product theorems and Goldreich-Levin reconstruction algorithm, Carmosino et. al. [5, Lemma 3.5] showed that for each0 < (cid:15), γ < f with n inputs to a Booleanfunction Amp ( f ) with poly ( n, /(cid:15), log(1 /γ )) inputs so that Amp ( f ) ∈ P / poly f and thereis a probabilistic poly ( | C | , n, /(cid:15), /γ )-time machine which given a circuit C (1 / γ )-approximating Amp ( f ) and an oracle access to f outputs with high probability a circuit(1 − (cid:15) )-approximating f . We thus typically ignore the optimisation of the confidence anderror parameter in the rest of the paper. The most direct way of turning circuit lower bounds into a certain type of learning canbe described as follows. The simple observation from box A appeared in [27, Section 4.5] and [36]. I am not aware of amore systematic treatment of this concept. There are related models of learning such as ‘knows whatit knows’ model by Li-Littman-Walsh [22] and ‘reliable learning’ by Rivest-Sloan [41] which prohibitincorrect predictions in various ways. These models, however, follow the formalization of PAC learningin that the goal of the learner is to learn the target concept by accessing it. In box A we do not assumethat the target concept f is determined on all inputs or prior to the given samples. . Prediction from lower bound. Suppose we are given bits f ( y ) , . . . , f ( y k )for n -bit strings y , . . . , y k defining a partial Boolean function f . We want to pre-dict the value of f on a new input y k +1 ∈ { , } n . A priori f ( y k +1 ) is not definedbut we will interpret the minimal-size circuit C f coinciding with f on y , . . . , y k as ‘the right’ prediction of f ( y k +1 ). That is, we want to find C f ( y k +1 ). Here, weassume that the minimal circuit C f determines the value f ( y k +1 ). Otherwise,there are two circuits C , C of minimal size such that C ( y k +1 ) (cid:54) = C ( y k +1 ), andtherefore any prediction is equally good. Say that the size of the minimal circuit C f is s . Then the task to predict the value C f ( y k +1 ) can be formulated as thetask to prove an s -size circuit lower bound of the form ∀ circuit C of size s, (cid:95) i =1 ,...,k C ( y i ) (cid:54) = f ( y i ) ∨ C ( y k +1 ) (cid:54) = (cid:15) for (cid:15) = 0 or (cid:15) = 1.An interesting aspect of the prediction method described in box A is that by provingeven a single circuit lower bound we can learn something about the function f (if we knowthe value s ). More precisely, we predict C f on a single input but do not necessarilly gainknowledge of the values of C f on other inputs. This ‘instance-specific’ learning should becontrasted with PAC learning, Definition 3, where one is required to generate a circuitpredicting the target function f on most inputs. This, however, does not mean that it iseasier to learn in the sense of box A: in Definition 3 we do not need to recognize when theprediction errs while the prediction from box A is zero-error in the sense that it guaranteesto output the right value of C f ( y k +1 ). Determining minimal circuit size.
A drawback of the observation in box A is thatit requires knowledge of the size s of the minimal circuit C f , which might be hard forthe learner to determine. The size s could be determined by deciding t -size circuit lowerbounds for t ∈ [ s ]. Perhaps a more practical way of addressing the issue is to take asufficiently big approximate value s (cid:48) of s , choose a random t ∈ [ s (cid:48) ] and prove t -size lowerbounds (as in box A with t instead of s ). If s (cid:48) ≤ n O (1) , the probability that we havethe right t is 1 /n O (1) . Then, by solving polynomially many t -size lower bounds (in orderto predict C f ( y ) on polynomially many y ’s), we can approximate the accuracy of ourpredictions. If the accuracy is not high, we can reapeat the process with a new random Provability vs truth.
The definition of ‘the right’ prediction in terms of minimal circuits usedin box A can be interpreted as an implicit (alternative) definition of truth. Consider, for example, thatstrings y j encode statements in set theory ZFC and the value f ( y j ) is 1 if and only if the statement encodedby y is provable in ZFC. It would be interesting to find out whether the minimal circuit coinciding witha sufficiently rich list of such samples ( y j , f ( y j )) determines a truth value of the Continuum Hypothesisor of the consistency of ZFC, statements which are independent of ZFC. Unfortunately, in general, suchquestions seem to be out of reach of the contemporary mathematics. ∈ [ s (cid:48) ]. The advantage of this method is that it does not rely on deciding correctlywhether some particular t -size circuit lower bounds hold - we are actually allowed to erron some fraction of lower bounds. However, its predictions are no longer zero-error. Aclosely related argument is formalized in Section 4. Proof complexity.
The prediction method from box A relies on proof complexity ofcircuit lower bounds, cf. [20]. It would be interesting to find out if proving circuit lowerbounds in standard proof systems suffices to construct learning circuits.
Question 1 (Learning interpolation) . Is there a p-time function which given an ExtendedFrege proof of a formula (cid:87) y ∈ A C ( y ) (cid:54) = f ( y ) ∨ C ( x ) (cid:54) = (cid:15) , for (cid:15) = 0 or (cid:15) = 1 , with freevariables representing s -size circuits C with n inputs, a fixed set A of n -bit inputs of asufficiently big size | A | = poly ( s, n ) , a fixed n -bit string x / ∈ A and values of f ∈ Circuit [ s ] on A , outputs a circuit (1 / /n ) -approximating f ? We now give a construction of PAC learning algorithms from an interactive witnessingof circuit lower bounds. As discussed in the introduction, the implication can be alsointerpreted as a construction of PAC learning algorithms from a frequent interactiveinstance-specific learning.
Theorem 1 (Learning from interactive witnessing of lower bounds) . Let d ≥ k, K ≥ and H be a Boolean function with n inputs. Assume there are Kn -size circuits W , . . . , W b log n with b = 2 Kn such that for each distribution R on n dk -size circuits with n inputs there ex-ists j ∈ [ b ] such that circuits W j , . . . , W j log n witness errors of n dk -size circuits attemptingto compute H in the following way.Given an oracle access to a random n dk -size circuit D ( x ) with n inputs, withprobability at least − /n over R , the following interactive protocol succeeds:After querying values of circuit D , W j outputs a not-yet-queried x ∈ { , } n s.t. D ( x ) (cid:54) = H ( x ) or W j receives a correction in the form of bits D ( x ) , H ( x ) s.t. D ( x ) = H ( x ) . Having D ( x ) , H ( x ) and the samples queried by W j , W j makesfurther queries to D and generates the second not-yet-queried candidate x ∈ { , } n for the claim C ( x ) (cid:54) = H ( x ) . If D ( x ) = H ( x ) , W j receives a correction and theprotocol continues in this way until some W jt , for t ≤ log n , with access to all Notably, Razborov [39] established that weak proof systems such as Resolution operating with k -DNFs for small k do not have polynomial-size proofs of any superpolynomial circuit lower bound what-soever and he conjectured this holds under a hardness assumption even for stronger systems such asFrege. The issue is, however, delicate because proof systems like Extended Frege are already capable offormalizing a lot of complexity theory, see e.g. [27], and it is perfectly plausible that if a circuit lowerbound is provable at all, then it is efficiently provable in Extended Frege. revious corrections and samples finds the right x t which has not been queried by W j , . . . , W jt and witnesses D ( x t ) (cid:54) = H ( x t ) .Then, circuits of size n dk with n d inputs can be learned by circuits of size K (cid:48) n overthe uniform distribution with non-adaptive membership queries, confidence / K (cid:48) n up toerror / − / K (cid:48) n , where K (cid:48) is a constant depending only on K . Note that the witnessing circuits from Theorem 1 can work for arbitrary function H and, for the circuits D on which the witnessing succeeds, the number of queries in eachround is implicitly bounded by < n (since after querying D on all inputs it would beimpossible to output a not-yet-queried input). Proof.
The proof follows the main construction from [35, 17] in the context of learn-ing. The main technical complication is caused by the fact that the witnessing circuits W . . . , W b log n are allowed to fail on a significant fraction of inputs.In order to derive the conclusion of the theorem it suffices to assume that the witnessingcircuits work for distributions R induced by specific Nisan-Wigderson generators.Consider a Nisan-Wigderson generator based on a circuit C which we aim to learn.Specifically, for d ≥ n d ≤ m ≤ n d , let A = { a i,j } i ∈ [2 n ] j ∈ [ m ] be a 2 n × m n d ones per row and J i ( A ) := { j ∈ [ m ]; a i,j = 1 } . Then define an NW-generator N W C : { , } m (cid:55)→ { , } n as ( N W C ( w )) i = C ( w | J i ( A ))where w | J i ( A ) are w j ’s such that j ∈ J i ( A ).For any d ≥
2, Nisan and Wigderson [29] constructed a 2 n × m A with n d ones per row and n d ≤ m ≤ n d which is also an ( n, n d )-design meaning that foreach i (cid:54) = j , | J i ( A ) ∩ J j ( A ) | ≤ n and | J i ( A ) | = n d . Moreover, there are n d -size circuitswhich given i ∈ { , } n and w ∈ { , } m output w | J i ( A ), cf. [5]. Therefore, if C has n d inputs and size n dk , then for each w ∈ { , } m , ( N W C ( w )) x is a function on n inputs x computable by circuits of size n dk . We want to learn C by a circuit of size 2 O ( n ) .Let R be the distribution on n dk -size circuits defined so that a random circuit over R is ( N W C ( w )) x for w ∈ { , } m chosen uniformly at random.By the assumption of the theorem, we have 2 Kn -size circuits W , . . . , W b log n , with b = 2 Kn such that for some j ∈ [ b ] for 1 − /n of all w ∈ { , } m circuits W j , . . . , W j log n find an error of the n dk -size circuit ( N W C ( w )) x attempting to compute H . We will usethem in order to break, in a certain sense, the generator N W C and reconstruct the circuit C . For each w define a trace tr ( C, w ) = x , . . . , x t as the sequence of t ≤ log n stringsgenerated by W j , . . . , W jt on ( N W C ( w )) x such that W jt is the first circuit which succeedsin witnessing the error, i.e. H ( x t ) (cid:54) = ( N W C ( w )) x t . If circuits W j , . . . , W j log n do not find9n error, x t = x log n . The trace is defined w.r.t. a fixed ‘helpful’ oracle Y providingcorrections in the form of bits ( N W C ( w )) x , H ( x ).For u ∈ { , } n d and v ∈ { , } m − n d define r x ( u, v ) ∈ { , } m by putting bits of u intopositions J x ( A ) and filling the remaining bits by v (in the natural order). We say that w ∈ { , } m is good if the trace tr ( C, w ) ends with a string witnessing an error of circuit(
N W C ( w )) x and bad otherwise. Similarly, given v ∈ { , } m − n d and x (cid:48) ∈ { , } n , we saythat u ∈ { , } n d is good if r x (cid:48) ( u, v ) is.The core claim of the proof is the existence of a frequent trace on which circuit W j , . . . , W j log n succeed in witnessing the error with significant advantage. Claim 3.1.
There is a trace
T r = X , . . . , X t , t ≤ log n such that for s ≥ / (6 n ( t − n n ) of all a ∈ { , } m − n d for s (cid:48) ≥ s of all u ∈ { , } n d tr ( C, r X t ( u, a )) starts with T r and atleast (2 / − t /n − /n ) s (cid:48) n d u ’s are good and satisfy tr ( C, r X t ( u, a )) = T r . The trace
T r is constructed inductively: in step i we want to find X , . . . , X i − suchthat for ≥ / n ( i − of all w ’s tr ( C, w ) strictly extends X , . . . , X i − and the fraction ofgood w ’s for which this happens is ≥ − i / n . For i = 1 this holds by the assumption.Assume we have such X , . . . , X i − . We want to extend them to X , . . . , X i . Since thereare at most 2 n strings X j , there is X i such that for s (cid:48)(cid:48) ≥ / (2 n n ( i − ) w ’s tr ( C, w ) startswith X , . . . , X i and ≤ i /n of these w ’s are bad. Otherwise, the fraction of good w ’s forwhich tr ( C, w ) strictly extends X , . . . , X i − would be ≤ / n + 1 − i /n < − i / n if 2 n ≤ n . Now, either for ≥ (2 / s (cid:48)(cid:48) of w ’s tr ( C, w ) stops at X i (hence, for ≤ (1 / s (cid:48)(cid:48) w ’s the trace continues and for ≤ i s (cid:48)(cid:48) /n bad w ’s tr ( C, w ) starts with X , . . . , X i ) or for ≥ (1 / s (cid:48)(cid:48) w ’s the trace strictly extends X , . . . , X i . In the latter case, for ≤ i s (cid:48)(cid:48) /n bad w ’s tr ( C, w ) starts with X , . . . , X i , which means that the fraction of bad w ’s such that tr ( C, w ) strictly extends X , . . . , X i is ≤ · i /n .Since for all w , the length of tr ( C, w ) is bounded by log n , the process of extending X , . . . , X i − has to stop at some step 1 ≤ i ≤ log n . That is, there is T r = X , . . . , X t , t ≤ log n such that for ≥ (2 / s of w ’s tr ( C, w ) =
T r , for ≤ (1 / s of w ’s tr ( C, w ) strictlyextends
T r and ≤ t s/n of w ’s such that tr ( C, w ) is consistent with
T r are bad, where s ≥ / (6 n ( t − n ). The number of good w ’s such that tr ( C, w ) =
T r is at least (2 / − t /n ) s m . Therefore, ≥ s/n a ’s can be completed by s (cid:48) ≥ s/n u ’s to a string w = r X t ( u, a )such that tr ( C, w ) starts with
T r and at least (2 / − t /n − /n ) s (cid:48) n d u ’s are good andsatisfy tr ( C, r X t ( u, a )) = T r . This proves the claim.For X ∈ { , } n and a (cid:48) ∈ { , } m − n d let r X ( · , a (cid:48) ) be the bits of a (cid:48) in the positionsof [ m ] \ J X ( A ). Since A is an ( n, n d )-design, for any row x (cid:54) = X at most n bits of r X ( · , a (cid:48) ) | J x ( A ) are not set. For x (cid:54) = X , let Y X,a (cid:48) x,C be the set of all corrections provided by Y on x, C and r X ( u, a (cid:48) ) | J x ( A ) for all u ∈ { , } n d . This includes queries to C on inputs r X ( u, a (cid:48) ) | J x ( A ). The size of each set Y X,a (cid:48) x,C is 2 O ( n ) .10e are ready to describe a circuit D (cid:48) that approximates C . First, choose uniformly atrandom a (cid:48) ∈ { , } m − n d , a trace X , . . . , X t with t ≤ log n , a bit maj ∈ { , } and j (cid:48) ∈ [ b ].Query C so that all queries to C from sets Y X t ,a (cid:48) x,C , for x (cid:54) = X t , are obtained. In order toget access to all corrections from Y X t ,a (cid:48) X ,C , . . . , Y X t ,a (cid:48) X t − ,C we provide also the full truth-tableof H as a nonuniform advice of D (cid:48) . The truth table of H is a single nonuniform advice ofthe learner which works for every C . Then D (cid:48) computes as follows. For each u ∈ { , } n d produce r X t ( u, a (cid:48) ). Next, use W j (cid:48) to produce x . If a query of W j (cid:48) cannot be answeredby Y X t ,a (cid:48) x,C with x (cid:54) = X t or x (cid:54) = X , output maj . Otherwise, use the advice from Y X t ,a (cid:48) X ,C to find out if H ( X ) = N W C ( r X t ( u, a (cid:48) )) X . If the equality does not hold, output maj .Otherwise, use W j (cid:48) to generate x and continue in the same manner until W j (cid:48) t produces x t . If a query of W j (cid:48) t cannot be answered by Y X t ,a (cid:48) x,C with x (cid:54) = X t or x t (cid:54) = X t , output maj .Otherwise, output 0 iff H ( X t ) = 1. The resulting circuit D (cid:48) has n d inputs and size 2 O ( n ) ,if m ≤ n (which holds w.l.o.g.).By Claim 3.1, with probability at least 1 / (6 n log n O ( n log n ) ) the learner guessed j (cid:48) = j ,trace T r and assignment a such that for at least (2 / − t /n − /n ) s (cid:48) of all u ∈ { , } n d , D (cid:48) will successfully predict C ( u ). Moreover, for at most (1 / t /n + 2 /n ) s (cid:48) of all u ’s,the trace extends T r or starts with
T r but does not end with a string witnessing an error.Since with probability 1 / u ’s is maj ,Pr u [ D (cid:48) ( u ) = C ( u )] ≥ / / − t /n − /n ) s .The assumption from Theorem 1 is justified by the following lemma which establishesthe converse. Lemma 1 (Witnessing from learning) . Let k ≥ ; (cid:15) < ; n / n ≥ (cid:15)n ≥ n k and H be a Boolean function with n inputs hard to (1 − /n ) -approximate by circuits of size (cid:15)n . Assume Circuit [ n k ] can be learned by Circuit [2 (cid:15)n ] over the uniform distribution withconfidence up to error (cid:15) (cid:48) .Then, there are O ( n ) -size circuits W , . . . , W b with b = 2 n / n such that for eachdistribution R on n k -size circuits with n inputs there exists j ∈ [ b ] such that given anoracle access to a random n k -size circuit D ( x ) with n inputs, with probability at least − (cid:15) (cid:48) n over R , after ≤ (cid:15)n queries to circuit D , W j outputs a not-yet-queried x ∈ { , } n s.t. D ( x ) (cid:54) = H ( x ) .Proof. By the assumption, there exists an 2 (cid:15)n -size circuit W which for each n k -size circuit D , given an oracle access to D , outputs a circuit C (1 − (cid:15) (cid:48) )-approximating D . Since H is hard to (1 − /n )-approximate by circuits of size 2 (cid:15)n ≤ n / n , there are at least 2 n / n inputs which have not been queried by W and on which C fails to compute H . Therefore,a random input which has not been queried by W and on which C fails to compute H witnesses D ( x ) (cid:54) = H ( x ) with probability ≥ − (cid:15) (cid:48) n . Let W , . . . , W b , b = 2 n / n , becircuits such that W i simulates W and outputs the i -th input on which C fails to compute11 ignoring inputs which have been queried by W . The size of each W i is 2 O ( n ) because ituses the whole truth table of H as a nonuniform advice. Let R be arbitrary distributionon circuits of size n k . Since for each D , at least 1 − (cid:15) (cid:48) n of W i ’s succeed, there is W j which succeeds on random D with probability ≥ − (cid:15) (cid:48) n over R .Note that Theorem 1 together with Lemma 1 imply that for suitable H it is possible tocollapse the number of rounds in the interactive witnessing from Theorem 1 at the expenseof witnessing errors of slightly smaller circuits (and a small increase in the running timeof the witnessing). Learning from witnessing lower bounds with white-box access.
Theorem 1 holdsalso under the stronger assumption that circuits W . . . , W b log n witness errors of n dk -sizenondeterministic circuits D with n inputs (and ≤ n dk nondeterministic bits), where D computes a function in Circuit [ n dk ], i.e. D is a nondeterministic circuit computing afunction in P / poly . Then it makes sense to allow W , . . . , W b log n to access a full descriptionof a given nondeterministic circuit D . The conclusion of the resulting theorem remainsvalid with the only difference that the learning algorithm is given full description of an n dk -size nondeterministic circuit with n d inputs representing the target function (whichis computable by an n dk -size deterministic circuit with n d inputs). Comparison to witnessing in bounded arithmetic.
The existence of witnessinganalogous to the one from Theorem 1 follows from the provability of circuit lower boundsin bounded arithmetic.If H : { , } n → { , } is an NP function and n , k are constants, we can write downa ∀ Σ b formula LB ( H, n k ) stating that H is hard for circuits of size n k : ∀ n, n > n ∀ circuit D of size ≤ n k ∃ y, | y | = n, D ( y ) (cid:54) = H ( y ) , where D ( y ) (cid:54) = H ( y ) is a Σ b formula stating that a circuit D on input y outputs the oppositevalue of H ( y ). Here, Σ b is a class of formulas in the language of Cook’s theory PV whichdefine precisely the predicates from Σ p level of the polynomial hierarchy, cf. [20].By the KPT theorem [21], if PV proves LB ( H, n k ) then there are finitely many poly ( n )-time functions W , . . . , W l which witness the existential quantifiers of LB ( H, n k ) (includingthe existential quantifier from the subformula D ( y ) (cid:54) = H ( y )) in the same interactive wayas in Theorem 1 except that the corrections include strings standing for the innermostuniversal quantifier of LB ( H, n k ) (which allow to verify in p-time that D ( y ) (cid:54) = H ( y ) hasnot been witnessed by the most recent candidates). Moreover, W , . . . , W l have accessto the full description of a given circuit D and do not make queries to D but directlygenerate potential errors, cf. [35].It is possible to change the formula LB ( H, n k ) by introducing a parameter m satisfying2 n = | m | so that the witnessing from the PV -provability of the new formula is given bycircuits W , . . . , W l of size 2 O ( n ) . In such case, H is allowed to be in NE . We could allow12 to be even an arbitrary Boolean function if we formulated the lower bound in QBFproof systems instead of bounded arithmetic.A crucial difference between the black-box witnessing from Theorem 1 and white-box witnessing in bounded arithmetic is that, under standard hardness assumptions, thewhite-box witnessing of p-size circuit lower bounds for functions H such as SAT exists,cf. [27].
Comparison to other witnessing theorems.
Lipton and Young [24] showed that foreach Boolean function H hard for circuits of size O ( n k +1 ) there is a multiset of inputs A of size O ( n k ), the so called anticheckers, such that each n k -size circuit fails to compute H on ≥ / A . Therefore, for each distribution R on n k -size circuits,some input from the set of anticheckers will witness an error of a random n k -size circuits D (without a single query to D ) with probability ≥ / R . Using t rounds theprobability of witnessing an error can be increased to 1 − / (3 / t . This can be done with ≤ n O ( kt ) witnessing circuits W ij . More precisely, we can let W i , . . . , W it to be the i -thpossible t -tuple of inputs from the set of anticheckers, for i < n O ( kt ) . Theorem 1 showsthat it is not possible to increase this probability further to 1 − /n using log n roundsunless p-size circuits can be learned efficiently.Gutfreund, Shaltiel and Ta-Shma [12] showed that if P (cid:54) = NP there is a p-time algo-rithm which, given a description of an n k -time machine D , generates a set of ≤ D fails to solve SAT on one of them. Atserias [2] extended this by showing thatif NP (cid:54)⊆ BPP there is a probabilistic p-time algorithm which, given an oracle access toan n k -time machine D , outputs with probability ≥ / D fails to solve SAT on one of them. These algorithms differ from the witnessing in The-orem 1 in several ways: they find errors of uniform algorithms, are allowed to generateerrors of different lengths, generate errors with a significantly smaller probability than theprobability required in Theorem 1 and the set of formulas generated by the algorithm ofAtserias includes formulas on which the algorithm queried D . Circuit lower bounds can be used to construct PAC learning algorithms also if we assumethat they break pseudorandom generators. The construction goes back to a relation be-tween predictability and pseudorandomness which can be interpreted in terms of learningalgorithms, as shown by Blum, Furst, Kearn and Lipton [3] and later extended by severalother works. In this section we survey some of these connections, derive a construction oflearning algorithms from the non-existence of succinct nonuniform pseudorandom func-tion families and show how these connections relate to a question of Rudich about turningdemibits to superbits.We start by recalling the construction from [3], which underlies all results in this13ection.For an n c -size circuit C with n inputs define a generator G C : { , } mn (cid:55)→ { , } mn + m which maps m n -bit strings x , . . . , x m to x , C ( x ) , . . . , x m , C ( x m ). Lemma 2 (from [3]) . There is a randomized p-time function L such that for every n c -sizecircuit C , if an s -size circuit D satisfies Pr[ D ( x ) = 1] − Pr[ D ( G C ( x )) = 1] ≥ /s, then the circuit C is learnable by L ( D ) over the uniform distribution with random exam-ples, confidence / m s , up to error / − / ms .Proof. Given D , L ( D ) chooses a random i ∈ [ m ], random bits r i , . . . , r m , random n -bitstrings x , . . . , x n except x i and queries the bits C ( x ) , . . . , C ( x i − ). For x i ∈ { , } n ,let p i := D ( x , C ( x ) , . . . , x i − , C ( x i − ) , x i , r i , . . . , x m , r m ). Then L ( D ) on x i predicts thevalue C ( x i ) by outputting ¬ r i if p i = 1 and r i otherwise. By triangle inequality, random i ∈ [ m ] satisfies Pr[ p i = 1] − Pr[ p i +1 = 1] ≥ /ms with probability 1 /m . Since the probability over r i . . . , r m , x , . . . , x m that L ( D ) predicts C ( x i ) correctly is12 Pr[ p i = 1 | r i (cid:54) = C ( x i )] + 12 (1 − Pr[ p i = 1 | r i = C ( x i )]) , and Pr[ p i = 1] = Pr[ p i = 1 | r i = C ( x i )] + Pr[ p i = 1 | r i (cid:54) = C ( x i )] , it follows thatPr x i [ L ( D )( x i ) = C ( x i )] ≥ / / ms with probability 1 / m s over the internal randomness of L ( D ).The proof of Lemma 2 implies that learning on average follows from breaking pseu-dorandom generators. Specifically, let R be a p-size circuit which given r bits outputsan n c -size circuit C and consider a generator G : { , } mn + r (cid:55)→ { , } mn + m which applies R on its first r input bits in order to output a circuit C and then computes as a gener-ator G C on the remaining mn inputs. Breaking G implies that we can break G C withsignificant probability over C drawn from the distribution induced by R . Consequently,breaking G means that we can learn a big fraction of n c -size circuits w.r.t. R . Can weimprove this average-case learning into a worst-case learning which works for all n c -sizecircuits? Since efficient learning algorithms for p-size circuits yield natural properties14seful against p-size circuits, which by [40] break pseudorandom generators, a positiveanswer would present an important dichotomy: cryptographic pseudorandom generatorsdo not exist if and only if there are efficient learning algorithms for small circuits (withsuitable parameters). This possibility has been explored by Oliveira-Santhanam [32] andSanthanam [43], cf. Section 4.3. Question 2 (Dichotomy) . Assume that for each (cid:15) < there is no pseudorandom generator g : { , } n (cid:55)→ { , } n +1 computable in P / poly and safe against circuits of size n (cid:15) forinfinitely many n . Does it follow that p-size circuits are learnable by circuits of size O ( n δ ) , for some δ < , with confidence /n , up to error / − / O ( n δ ) ? The proof of Lemma 2 shows also that we can construct a worst-case learning algorithm as-suming that given an oracle access to a pseudorandom generator we can efficiently produceits distinguisher. In particular, a single method breaking all pseudorandom generatorswould suffice.
Definition 4.
The circuit size problem
GCSP [ s, k ] is the problem to decide whether fora given list of k samples ( y i , b i ) , y i ∈ { , } n , b i ∈ { , } , there exists a circuit C of size s computing the partial function defined by samples ( y i , b i ) , i.e. C ( y i ) = b i for the given k samples ( y i , b i ) . The parameterized minimum circuit size problem MCSP [ s ] stands for GCSP [ s, n ] where the list of n samples defines the whole truth-table of a Boolean function. If we were extraordinary in proving circuit lower bounds, we could solve
GCSP effi-ciently. Note that
MCSP [ n O (1) ] ∈ P / poly is stronger assumption than the existence of P / poly -natural property useful against P / poly , which breaks pseudorandom generators.The following theorem appeared (in different terminology) in Vadhan [45], see also [15]. Theorem 2 (Learning from succinct natural proofs) . Assume
GCSP [ n c , n d ] ∈ P / poly forconstants d > c + 1 . Then, Circuit [ n c ] is learnable by P / poly over the uniform distributionwith random examples, confidence /poly ( n ) , up to error / − /poly ( n ) .Proof. As the number of partial Boolean functions on a given set of m inputs is 2 m andthe number of n c -size circuits is bouded by 2 n c +1 , GCSP [ n c , n d ] ∈ P / poly implies that for m = n d there are p-size circuits D such that for each n c -size circuit C ,Pr[ D ( x ) = 1] − Pr[ D ( G C ( x )) = 1] ≥ / . Now, it suffices to apply Lemma 2. 15 .2 Worst-case learning from natural proofs
In Theorem 2, we can learn f ∈ Circuit [ n c ] even if the algorithm for GCSP works justfor a significant fraction of partial truth-tables ( y , b ) , . . . , ( y n d , b n d ) with zero-error oneasy partial truth-tables. Carmosino, Impagliazzo, Kabanets and Kolokolova [5] provedthat the assumption of Theorem 2 can be weakened to the existence of a standard naturalproperty. The price for this is that the resulting learning uses membership queries insteadof random examples. The crucial idea is similar to the proof of Theorem 1: apply thenatural property (as an algorithm for suitable GCSP ) on a Nisan-Wigderson generator
N W f based on the function f , which we want to learn. Theorem 3 (Learning from natural proofs [5]) . Let R be a P / poly -natural property usefulagainst Circuit [ n d ] for some d ≥ . Then, for each γ ∈ (0 , , Circuit [ n k ] is learnableby Circuit [2 O ( n γ ) ] over the uniform distribution with non-adaptive membership queries,confidence 1, up to error n k , where k = dγa and a is an absolute constant. Oliveira and Santhanam [32] showed that the assumption of the existence of natural proofsfrom Theorem 3 can be further weakened to the existence of a distinguisher breakingnon-uniform pseudorandom function families. Their result follows from a combination ofTheorem 3 and the Min-Max Theorem. Using their strategy but combining the Min-MaxTheorem with Theorem 2, learning algorithms with random examples can be obtainedfrom distinguishers breaking succinct non-uniform pseudorandom function familiesA two-player zero-sum game is specified by an r × c matrix M and is played as follows.MIN, the row player, chooses a probability distribution p over the rows. MAX, the columnplayer, chooses a probability distribution q over the columns. A row i and a column j aredrawn randomly from p and q , and MIN pays M i,j to MAX. MIN plays to minimize theexpected payment, MAX plays to maximize it. The rows and columns are called the purestrategies available to MIN and MAX, respectively, while the possible choices of p and q are called mixed strategies . The Min-Max theorem states that playing first and revealingone’s mixed strategy is not a disadvantage: min p max j (cid:88) i p ( i ) M i,j = max q min i (cid:88) j q ( j ) M i,j . Note that the second player need not play a mixed strategy - once the first player’sstrategy is fixed, the expected payoff is optimized for the second player by playing somepure strategy. The expected payoff when both players play optimally is called the value of the game. We denote it v ( M ).A mixed strategy is k -uniform if it chooses uniformly from a multiset of k pure strate-gies. Let M min = min i,j M i,j and M max = max i,j M i,j . Newman [28], Alth¨ofer [1] and16ipton-Young [24] showed that each player has a near-optimal k -uniform strategy for k proportional to the logarithm of the number of pure strategies available to the opponent. Theorem 4 ([28, 1, 24]) . For each (cid:15) > and k ≥ ln( c ) / (cid:15) , min p ∈ P k max j (cid:88) i p ( i ) M i,j ≤ v ( M ) + (cid:15) ( M max − M min ) , where P k denotes the k -uniform strategies for MIN. The symmetric result holds for MAX. Definition 5 (Succinct non-uniform PRF) . An ( m, m (cid:48) ) -succinct non-uniform pseudoran-dom function family from circuit class C safe against circuits of size s is a set S of partialtruth-tables (cid:104) ( x , b ) , . . . , ( x m , b m ) (cid:105) where each x i is an n -bit string and b i ∈ { , } suchthat each partial truth-table from S is computable by one of m (cid:48) circuits from C and forevery circuit D of size s , Pr x [ D ( x ) = 1] − Pr x ∈ S [ D ( x ) = 1] < /s where the first probability is taken over x ∈ { , } m ( n +1) chosen uniformly at random andthe second probability over partial truth-tables chosen uniformly at random from S . Theorem 5 (Learning or succinct non-uniform PRF) . Let c ≥ and s > n, m ≥ .There is an ( m, s ) -succinct non-uniform PRF in Circuit [ n c ] safe against Circuit [ s ] orthere are circuits of size poly ( s ) learning Circuit [ n c ] over the uniform distribution withrandom examples, confidence /poly ( s ) , up to error / − /poly ( s ) .Proof. Consider a two-player zero-sum game specified by a matrix M with rows indexedby n c -size circuits with n inputs and columns indexed by s -size circuits with m ( n + 1)inputs. Define the entry M C,D of M corresponding to a row circuit C and a column circuit D as M C,D := | Pr x [ D ( x ) = 1] − Pr x [ D ( G C ( x )) = 1] | for the generator G C from the proof of Lemma 2. Hence M max − M min ≤ v ( M ) ≥ / s , then by Theorem 4 (with (cid:15) = 1 / s ), there exist a multiset of k ≤ n c +1 s s -size circuits D , . . . , D k such that for every n c -size circuit C , a random D from D , . . . , D k satisfies E[ | Pr[ D ( x ) = 1] − Pr[ D ( G C ( x )) = 1] | ] ≥ / s. By Lemma 2, for every n c -size circuit C , one of the circuits D , . . . , D k (or theirnegations) can be used to learn C with confidence 1 /poly ( s ), up to error 1 / − /poly ( s ).A poly ( s )-size circuit using a random D i from D , . . . , D k or its negation thus learns Circuit [ n c ] with random examples, confidence 1 /poly ( s ), up to error 1 / − /poly ( s ).17f v ( M ) < / s , then by Theorem 4 (with (cid:15) = 1 / s ), there exists a multiset of k ≤ s n c -size circuits C , . . . , C k such that for every s -size circuit D , a random C from C , . . . , C k satisfies E[ | Pr[ D ( x ) = 1] − Pr[ D ( G C ( x )) = 1] | ] ≤ / s. Since E[ | Pr[ D ( x ) = 1] − Pr[ D ( G C ( x )) = 1] | ] ≥ | Pr[ D ( x ) = 1] − E[Pr[ D ( G C ( x )) = 1]] | agenerator G : { , } mn + (cid:100) log k (cid:101) (cid:55)→ { , } mn + m which takes as input a string of length mn + (cid:100) log k (cid:101) encoding (an index of) a circuit C from C , . . . , C k together with m n -bit strings x , . . . , x m and outputs x , C ( x ) , . . . , x m , C ( x m )is safe against circuits of size s . The range of G defines an ( m, s )-succinct non-uniformPRF in Circuit [ n c ] safe against Circuit [ s ].Note that the existence of a generator G from the proof of Theorem 5 follows directyfrom a counting argument if we do not require that G defines a PRF of small complexity: arandom set of poly ( s, n ) strings (yielding a non-uniform pseudorandom generator mapping { , } O (log s ) to { , } n ) fools circuits of size s . Rudich [42] proposed a conjecture about the existence of superbits, a version of pseudo-random generators safe against nondeterministic circuits, and showed that it rules out theexistence of NP -natural properties against P / poly . He then asked whether the existenceof superbits follows from a seemingly weaker assumption of the existence of so calleddemibits. We note that an affirmative answer to his question would resolve Question 2 innondeterministic setting. Definition 6 (Superbit) . A function g : { , } n (cid:55)→ { , } n +1 computable by p-size cir-cuits is a superbit if there is (cid:15) < such that for infinitely many input lengths n , for allnondeterministic circuits C of size | C | ≤ n (cid:15) , Pr x ∈{ , } n +1 [ C ( x ) = 1] − Pr x ∈{ , } n [ C ( g ( x )) = 1] < / | C | . Definition 7 (Demibit) . A function g : { , } n (cid:55)→ { , } n +1 computable by p-size circuitsis a demibit if there is (cid:15) < such that for infinitely many input lengths n , no nondeter-ministic circuit C of size | C | ≤ n (cid:15) satisfies Pr x ∈{ , } n +1 [ C ( x ) = 1] ≥ / | C | and Pr x ∈{ , } n [ C ( g ( x )) = 1] = 0 . roposition 1 (Question 2 vs Rudich’s problem) . Assume the existence of demibitsimplies the existence of superbits. Then, either superbits exist or for each c ≥ , for each (cid:15) < , Circuit [ n c ] is learnable by Circuit [2 O ( n (cid:15) ) ] over the uniform distribution with randomexamples, confidence / O ( n (cid:15) ) up to error / − / O ( n (cid:15) ) , where the learner is allowedto generate a nondeterministic or co-nondeterministic circuit approximating the targetfunction.Proof. Assume superbits do not exist and their non-existence implies the non-existenceof demibits. Consider a generator G : { , } mn + n c +1 (cid:55)→ { , } mn + m , with m = n c +1 + 1,which interprets the first n c +1 bits of its input as a description of an n c -size circuit C andthen computes on the remaining mn inputs as generator G C from Lemma 2. Since G isnot a demibit, for each (cid:15) < D of size 2 ( mn + m − (cid:15) ,such that for each n c -size circuit C ,Pr[ D ( x ) = 1] − Pr[ D ( G C ( x )) = 1] ≥ / | D | . By the proof of Lemma 2, this means that n c -size circuits are learnable by circuits ofsize poly ( | D | ) with confidence 1 /poly ( | D | ) up to error 1 / − /poly ( | D | ), except that thelearner might generate nondeterministic (if r i = 0) or co-nondeterminitic (if r i = 1) circuitapproximating the target function. A striking consequence of the relation between natural proofs and learning algorithms isa learning speedup of Oliveira and Santhanam [32].Suppose P / poly is learnable by circuits of weakly subexpoential size 2 n /n ω (1) . Thelearning circuits can be used to accept truth-tables of all functions in P / poly while theirsize guarantees that many hard functions are going to be rejected. This implies theexistence of a P / poly -natural property useful against P / poly , which by Theorem 3, givesus circuits of strongly subexponential size 2 n γ , γ <
1, learning P / poly .The argument of Oliveira and Santhanam can be generalized to a speedup of learnersof arbitrary size s . Here, we show how to derive such a generalized version more directlywithout constructing natural proofs and invoking Theorem 3. This is possible thanks to amore direct exploitation of a slightly modified NW-generator. A drawback of the approachis that we need to assume learning with random examples instead of membership queries. Theorem 6 (Generalized speedup) . Let d, k ≥ and n ≤ s ( n ) ≤ n /n . Assume Circuit [ n dk ] is learnable by Circuit [ s ( n )] over the uniform distribution with random exam-ples, confidence , up to error / − /n . Then circuits of size m k with m = n d inputs arelearnable by circuits of size n dK ( s ( n )) over the uniform distribution with non-adaptivemembership queries, confidence /n , up to error / − /n . Here, K is an absoluteconstant. n O (log n ) , then p-size circuits are learnablewith membership queries by circuits of size O ( n (cid:15) log n ), for each (cid:15) >
0. The speedupis achieved w.r.t. the input length of target functions at the expense of their circuitcomplexity.
Proof.
Let A be a 2 b × u b, n d )-design with | J i ( A ) | = n d for n d ≤ u ≤ n d , a constant d and parameter b such that ns ≤ b ≤ ns . The design is constructedin the usual way by evaluating polynomials of degree ≤ b on n d points of a field with n d ≤ p ≤ n d elements. In particular, there are n d -size circuits which given i ∈ { , } b and w ∈ { , } u output w | J i ( A ). Define N W f -generator mapping strings w of length u to strings of length 2 n as ( N W f ( w )) x ,...,x n = f ( w | J x ,...,x b ( A )) . Then for each m -input function f ∈ Circuit [ m k ] and w ∈ { , } u , ( N W f ( w )) x is com-putable as a function of x ∈ { , } n by a circuit of size n dk .By the assumption of the theorem every such circuit ( N W f ( w )) x is learnable by acircuit L of size s with confidence δ = 1, up to error 1 / − (cid:15) . Consequently, there is acircuit D f of size O ( s ) such thatPr w,x,y ,...,y t [ D f ( x , . . . , x n , w, y , . . . , y t ) = f ( w | J x ,...,x b ( A ))] ≥ (1 / (cid:15) ) δ (5.1)where D f queries values f ( w | J y j ( A )) for t ≤ s random strings y j ∈ { , } b , j = 1 , . . . , t .The size of D f takes into account the need to simulate the circuit described by L . Now,random y , . . . , y t satisfyPr w,x [ D f ( x , . . . , x n , w, y , . . . , y t ) = f ( w | J x ,...,x b ( A ))] ≥ / (cid:15) − /n (5.2)with probability at least 1 /n . Otherwise, the probability in (5.1) would be < /n + (1 / (cid:15) − /n ). Similarly, given y , . . . , y t such that (5.2) holds, a random x ∈ { , } n satisfiesPr w [ D f ( x , . . . , x n , w, y , . . . , y t ) = f ( w | J x ,...,x b ( A ))] ≥ / (cid:15) − /n (5.3)with probability at least 2 /n . Moreover, since every y j specifies 2 n − b values of ( N W f ( w )) x ,given y , . . . , y t , a random x ∈ { , } n equals some y j on the first b bits with probability ≤ t/ b ≤ /n . Applying the same averaging one more time, for y , . . . , y t and x whichdiffers on the first b bits from each y j and satisfies (5.3), randomly fixed u − n d bits of w on the positions of [ u ] \ J x ( A ) preserve the probability (5.3) up to an additional error 1 /n with probability at least 1 /n .For each y , . . . , y t , each x which differs on the first b bits from every y j and for eachfixation of u − n d bits of w on the positions of [ u ] \ J x ( A ), ( b, n d )-design guarantees that20he number of all queries f ( w | J y j ( A )), j = 1 , . . . , t , of D f for all possible w with the u − n d fixed bits is ≤ t b . We can thus learn a circuit D (cid:48) approximating f ∈ Circuit [ m k ]with m = n d inputs with advantage 1 / (cid:15) − /n in the following way. Choose random y , . . . , y t , x , random u − n d bits of w corresponding to [ u ] \ J x ( A ) and query ≤ t b values f ( w | J y j ( A )) for all possible w with the u − n d fixed bits. Then the circuit D (cid:48) , given n d bitsof w corresponding to J x ( A ), generates w and computes as D f with the provided queries f ( w | J y j ( A )). Since w can be constructed from given n d bits, x and the u − n d fixed bits of w by a circuit of size n O ( d ) , each w | J y j ( A ) can be constructed from w and y j by a circuitof size n d and for each query to f the right value can be selected by a circuit of size O ( n d t b ), the size of D (cid:48) is O ( s + tn d + n d t b + n O ( d ) ) ≤ n O ( d ) s . D (cid:48) can be describedby n dK s bits, for an absolute constant K , and constructed by a circuit of the same sizewhich just substitutes y j , x and u − n d bits of w in the otherwise fixed description of D (cid:48) .Since random y , . . . , y t satisfy (5.2) with probability at least 1 /n , a random x differson the first b bits from each y , . . . , y t and satisfies (5.3) with probability at least 1 /n while the randomly fixed u − n d bits of w have the desired property with probability atleast 1 /n as well, the confidence of the learning algorithm is at least 1 /n .We give one more proof of the learning speedup which also addresses the issue ofmembership queries. Theorem 7 (Alternative speedup) . Let d ≥ k ≥ and (cid:15) < . Assume Circuit [ n dk ] islearnable by Circuit [2 (cid:15)n ] over the uniform distribution (possibly with membership queries)with confidence , up to error /n . Then, circuits of size n dk with n d inputs are learnableby circuits of size Kn over the uniform distribution with confidence / Kn up to error / − Kn , where K is an absolute constant.Proof. By a counting argument there exists H which is not (1 − /n )-approximable bycircuits of size 2 (cid:15)n . Here, n is w.l.o.g. sufficiently big. By Lemma 1, learnability of Circuit [ n dk ] by Circuit [2 (cid:15)n ] up to error 1 /n implies the existence of circuits of size 2 O ( n ) witnessing errors of circuits of size n dk with probability ≥ − /n . The conclusion thusfollows by applying Theorem 1. The improved confidence and approximation parameteris the consequence of the fact that our witnessing circuits succeed in the first round, i.e. t = 1. Proof-search speedup.
The core trick behind Theorem 6 can be formulated in thecontext of proof complexity. Assume that an n dk -size lower bound is provable in a proofsystem P by a proof of size s ( n ). Then, a substitutional instance of the same P -proofof size s ( n ) proves an m k -size lower bound for circuits with m = n d inputs, on inputsgiven by the NW-generator from the proof of Theorem 6. Here, the base function of theNW-generator is not specified but represented by free variables encoding a circuit of size m k . 21 onlocalizable hardness magnification. Theorem 6 and the original speedup ofOliveira and Santhanam can be interpreted as hardness magnification theorems. Hardnessmagnification is an approach to strong complexity lower bounds by reducing them toseemingly much weaker lower bounds developed in a series of recent papers [33, 27, 31,25, 9, 10, 7, 6, 8, 26, 11], see [6] for a more comprehensive survey. For example, it turns outthat in order to prove that functions computable in nondeterministic quasipolynomial-time are hard for NC it suffices to show that a parameterized version of the minimumcircuit size problem MCSP is hard for AC [2]. However, [6] identified a locality barrier which explains why direct adaptations of many existing lower bounds do not yield strongcomplexity lower bounds via hardness magnification. Essentially, the reason is that theexisting lower bounds for explicit Boolean functions work often even for models which areallowed to use arbitrary oracles with n o (1) -small fan-in. This is easy to see in the case of AC [2] lower bounds: oracles of small fan-in can be simulated by polynomials of low degree.On the other hand, hardness magnification theorems typically yield (unconditional) upperbounds in the form of weak computational models extended with local oracles computingspecific problems such as the abovementioned version of MCSP . In fact, even irrespectiveof hardness magnification it is important to develop lower bound methods which do notlocalize: proving the nonexistence of subexponential-size learning algorithms for P / poly would imply the nonexistence of P / poly natural properties against P / poly but it is not hardto see that natural properties against P / poly are computable by p-size circuits with localoracles. Overcoming the locality barrier is thus essential for proving strong complexitylower bounds in general. Theorem 6, if read counterpositively, is a magnification of O ( n (cid:15) log n )-size lower boundsfor learning p-size circuits to n O (log n ) -size lower bounds. This differs from previous hard-ness magnification theorems by avoiding localization: the size of the learner plays a crucialrole in the reduction and therefore cannot be simply replaced by an arbitrary oracle. Thesame trick is behind non-blackbox worst-case to average-case reductions within NP of Hi-rahara [13]. To the best of my knowledge, the only other hardness magnification theoremswith this property appeared in [6] and [14]. [6, Theorem 1], like Hirahara [13] and the Some known circuit lower bounds above the magnification threshold are provably nonlocalizable butthey do not fit to the framework of the so called Hardness Magnification frontier [6], one reason being thatthey do not work for explicit and natural problems, cf. [6, 8]. For example, a nonlocalizable lower boundfrom [6] works for a function in E which is artificial in the sense that it is designed to avoid localization,not for a problem of independent interest such as MCSP . Oliveira [30] showed that near superlinear-size lower bounds for a version of
MCSP defined w.r.t. a notion of randomized Kolmogorov complexityimply strong circuit lower bounds while the same problem is provably hard for probabilistic p-time. Thelower bound of Oliveira works, however, only against uniform models of computation. Moreover, themagnification theorem concludes at best a ‘weak’ lower bound of the form quasipolynomial-time QP being hard for P / poly . Similarly, an approach of Chen, Jin and Williams [8] via derandomizations anduniform obstructions appears to avoid the locality barrier but yields at best lower bounds of the form QP (cid:54)⊆ P / poly . There are two more results which could be potentially classified as nonlocalizable hardness magni-
MCSP whose localizedversion does not hold (as witnessed by other hardness magnification theorems). Theorem6 does not seem to localize in this sense either: it asks for an n (cid:15) log n -size lower boundon learning algorithms while there seems to be no reason to expect that p-size circuitsare learnable by circuits of size O ( n log n ) extended with oracles of fan-in n o (1) . (Such alocalization would mean that p-size circuits are learnable in subexponential size.) Themagnification theorems of Hirahara [14] face similar complications. Unfortunately, Theorem 6 does not reduce p-size lower bounds to, say, subquadraticlower bounds: It magnifies n O ( d ) s -size lower bounds for learning functions with m = n d inputs (and circuit complexity m k ) to an s -size lower bound for learning functions with n inputs (and circuit complexity n dk ). That is, a polynomial speedup w.r.t. the input-length of target functions is traded for a polynomial decrease of the circuit size of targetfunctions. Ideally, we would like to magnify, say, n . -size formula lower bound for learningcircuits of size n . with n inputs to n O (1) -size formula lower bounds for learning circuitsof size n . with n inputs. If the existing methods for proving the required formulalower bounds were applicable to prove subquadratic formula lower bounds for learningalgorithms (note that such lower bounds are allowed to localize and naturalize), such astrengthening of Theorem 6 would lead to explicit NC lower bounds. The methods for deriving learning algorithms from circuit lower bounds presented in thispaper might be improvable in many ways. fications. A theorem of Buresh-Oppenheim and Santhanam [4, Theorem 1] is based on an exploitationof Nisan-Wigderson generators similar to that of [6] but it seems less practical in its current form, asit magnifies only lower bounds for nondeterministic circuits. The other result of Tal [44] shows that anaverage-case hardness for formulas of size s can be magnified to the worst-case hardness for slightly biggerformulas. A problem is that [44] magnifies at best to an s -size lower bound. Moreover, if we wanted tostrenghten it further by connecting it with another magnification theorem, it is not clear how to preservethe nonlocalizability - the weak lower bound obtained via [44] would likely localize. Hirahara [14, Theorem 11 and 13] proves two types of magnification theorems. The first type essen-tially adapts the result from [6] in the context of weaker computational models. The second type extendsit by introducing metacomputational circuit lower bound problems MCLPs and showing that weak lowerbounds for MCLPs can be magnified as well. MCLPs are not solvable by any algorithm whatsoeverunless standard hardness assumptions break. This implies that there is no unconditional upper boundfor MCLPs and the locality barrier does not apply. Unfortunately, we do not have any interesting lowerbound for MCLPs either. The corresponding magnification theorems thus do not establish a HardnessMagnification frontier [6]. Nevertheless, as suggested in [14], developing such methods might be a way tostrong lower bounds. afe cryptography or efficient learning. Perhaps the most appealing question asksfor bridging cryptography and learning theory. Showing that efficient learning follows frombreaking pseudorandom generators, i.e. answering positively Question 2, would establisha remarkable win-win situation. As discussed in Section 4.4 the question is closely relatedto a problem of Rudich about turning demibits to superbits.
Instance-specific learning vs PAC learning.
Circuit lower bounds correspond toa simple instance-specifc learning model described in Section 3. Can we improve ourunderstand of the model and its relation to PAC learning? In particular, can we determinehow much we can learn from a single circuit lower bound? A possible formalization of theproblem is given by Question 1.
Connections to proof complexity.
The present paper brings several methods fromproof complexity to learning theory. It seems likely that these connections can be strength-ened. A particularly relevant part of proof complexity is the theory of proof complexitygenerators, cf. [18]. An interesting conjecture in the area due to Razborov [39] impliesa conditional hardness of circuit lower bounds in strong proof systems. In other words,Razborov’s conjecture asks for turning short proofs of circuit lower bounds into upperbounds breaking standard hardness assumptions.Notably, strengthening Theorem 1 by allowing white-box access in the witnessing oflower bounds would lead to a conditional unprovability of p-size lower bounds for
SAT inCook’s theory PV . A complication is that under standard hardness assumptions such awitnessing exists. That is, in order to obtain the conditional unprovability, one might needto exploit the PV -provability in a deeper way. Nevertheless, this suggests a simplifiedversion of Question 2: Can we prove a disjunction stating the PV -consistency of theexistence of strong pseudorandom generators or the PV -consistency of efficient learning?Since, by witnessing theorems in PV , both the PV -provability of the non-existince ofpseudorandom generators and the PV -provability of the impossibility of effficient learningimply uniform efficient algorithms witnessing these facts, it could be possible to combinethem with a version of uniform MinMax [46] to get a contradiction. Nonlocalizable hardness magnification near the existing lower bounds.
Can wepush forward the program of hardnness magnification by strengthening the magnificationfrom Theorem 6 to a setting in which strong circuit lower bounds follow from lower boundsnear the already existing ones? The importance of the question stems from the necessityof developing nonlocalizable magnification theorems or nonlocalizable constructive lowerbound methods as discussed in Section 5.
SAT solving circuit lower bounds.
It would be interesting to investigate practicalconsequences of the provability of circuit lower bounds. Circuit lower bounds for explicitlygiven Boolean functions are coNP statements which means that they are encodable intopropositional tautologies resp. SAT instances. Could SAT solvers be successful in provinginteresting instances of circuit lower bounds for some fixed input lengths? If so, this could24rovide an experimental verification of central results and conjectures from complexitytheory such as P (cid:54) = NP up to some finite domain. As discussed in the present paper,efficient algorithms proving circuit lower bounds can be also transformed into learningalgorithms, which provides a separate motivation for this line of research.In particular, SAT solving of circuit lower bounds could lead to an interesting com-parison with the research on neural networks. The task of training a neural network isto design a circuit C of size s , typically with a specific architecture, coinciding with sometraining input samples ( y i , f ( y i )), and apply it to predict the value f ( y ) on a new input y .As discussed in Section 3, this problem can be addressed by proving a circuit lower bound.Since proving a circuit lower bound can give us a reliable instance-specific prediction onecould try to use SAT solvers to verify outcomes of neural networks. More generally, onecould try to simulate neural networks by SAT solving circuit lower bounds. A potentialadvantage of SAT solvers is that they do not need to construct a circuit coinciding withtraining data - it is enough to prove its properties (lower bounds). On the othe hand,SAT solvers need to prove a universal statement which might turn out to be even harder. Acknowledgements
I would like to thank Rahul Santhanam for many inspiring discussions which, in particular,motivated me to prove Theorem 1. I am indebted to Susanna de Rezende and ErfanKhaniki for many illuminating discussions during the development of the project. I wouldalso like to thank V. Kanade for helpful comments on the existing learning models andL. Chen, V. Kabanets, J. Kraj´ıˇcek and I.C. Oliveira for helpful comments on the draft ofthe paper. This project has received funding from the European Union’s Horizon 2020research and innovation programme under the Marie Sk(cid:32)lodovska-Curie grant agreementNo 890220.
References [1] Alth¨ofer I.;
On sparse approximations to randomized strategies and convex combina-tions ; Linear Algebra and its Applications, 199(1):339-355, 1994.[2] Atserias A.;
Distinguishing SAT from polynomial-size circuits, through black-boxqueries ; CCC, 2006.[3] Blum A., Furst M., Kearns J., Lipton R.;
Cryptographic primitives based on hardlearning problems ; CRYPTO, 1993. 254] Buresh-Oppenheim J., Santhanam R.;
Making hard problems harder ; CCC 2006.[5] Carmosino M., Impagliazzo R., Kabanets V., Kolokolova A.;
Learning algorithmsfrom natural proofs ; CCC, 2016.[6] Chen L., Hirahara S., Oliveira I.C., Pich J., Rajgopal N., Santhanam R.;
Beyondnatural proofs: hardness magnification and locality ; ITCS, 2020.[7] Chen L., Jin C., Williams R.;
Hardness magnification for all sparse NP languages ;FOCS, 2019.[8] Chen L., Jin C., Williams R.; Sharp threshold results for computational complexity ;STOC, 2020.[9] Chen L., McKay D., Murray C., Williams R.;
Relations and equivalences betweencircuit lower bounds and Karp-Lipton theorems ; CCC, 2019.[10] Chen L., Tell R.;
Bootstrapping results for threshold circuits “just beyond” knownlower bounds ; STOC, 2019.[11] Cheragchi M., Hirahara S., Myrisiotis D., Yoshida Y.;
One-tape Turing machine andread-once branching program lower bounds for MCSP ; preprint, 2020.[12] Gutfreund D., Shaltiel R., Ta-Shma A.;
If NP languages are hard in the worst-casethen it is easy to find their hard instances ; CCC, 2005.[13] Hirahara S.;
Non-black-box worst-case to average-case reductions within NP ; FOCS,2018.[14] Hirahara S.; Non-disjoint promise problems from meta-computational view of pseu-dorandom generator constructions ; CCC, 2020.[15] Ilango R., Loff B., Oliveira I.C.;
NP-hardness of circuit minimization for multi-outputfunctions ; CCC, 2020.[16] Kraj´ıˇcek J.;
Dual weak pigeonhole principle, pseudo-surjective functions and prov-ability of circuit lower bounds ; Journal of Symbolic Logic, 69(1):265-286, 2004.[17] Kraj´ıˇcek J.;
On the proof complexity of the Nisan-Wigderson generator based on ahard NP ∩ coNP function ; Journal of Symbolic Logic, 11(1):11-27, 2011.[18] Kraj´ıˇcek J.; Forcing with random variables and proof complexity ; Cambridge Univer-sity Press, 2011.[19] Kraj´ıˇcek J.;
On the computational complexity of finding hard tautologies ; Bulletin ofthe London Mathematical Society, 46(1):111-125, 2014.2620] Kraj´ıˇcek J.;
Proof complexity ; Cambridge University Press, 2019.[21] Kraj´ıˇcek J., Pudl´ak P., Takeuti G.;
Bounded arithmetic and the polynomial hierarchy ,Annals of Pure and Applied Logic, 52:143-153, 1991.[22] Li L., Littman M., Walsh T.;
Knows what it knows: a framework for self-awarelearning ; ICML, 2008.[23] Linial N., Mansour Y., Nisan N.;
Constant depth circuits, Fourier transform, andlearnability ; Journal of the Association for Computing Machinery; 40(3):607-620,1993.[24] Lipton R.J., Young N.E.;
Simple strategies for large zero-sum games with applicationsto complexity theory ; STOC, 1994.[25] McKay D., Murray C., Williams R.;
Weak lower bounds on resource-bounded com-pression imply strong separations of complexity classes ; STOC, 2019.[26] Modanese A.;
Lower bounds and hardness magnification for sublinear-time shrinkingcellular automata ; preprint, 2020.[27] M¨uller M., Pich J.;
Feasibly constructive proofs of succinct weak circuit lower bounds ;Annals of Pure and Applied Logic, 2019.[28] Newman I.;
Private vs common random bits in communication complexity ; Informa-tion Processing Letters, 39:67-71, 1991.[29] Nisan N., Wigderson A.;
Hardness vs. randomness ; J. Comp. Systems Sci., 49:149-167, 1994.[30] Oliveira I.C.;
Randomness and intractability in Kolmogorov complexity ; ICALP, 2019.[31] Oliveira I.C., Pich. J., Santhanam R.;
Hardness magnification near state-of-the-artlower bounds ; CCC, 2019.[32] Oliveira I.C., Santhanam R.;
Conspiracies between learning algorithms, circuit lowerbounds, and pseudorandomness ; CCC, 2017.[33] Oliveira I.C., Santhanam R.;
Hardness magnification for natural problems ; FOCS,2018.[34] Pich J.;
Nisan-Wigderson generators in proof systems with forms of interpolation ;Mathematical Logic Quarterly, 57(4), 2011.[35] Pich J.;
Circuit lower bounds in bounded arithmetics ; Annals of Pure and AppliedLogic, 166(1):29-45, 2015. 2736] Pich J.;
Mathesis universalis ; Literis, 2016.[37] Pich J., Santhanam R.;
Strong co-nondeterministic lower bounds for NP cannot beproved feasibly ; preprint, 2020.[38] Razborov A.A; Unprovability of lower bounds on the circuit size in certain fragmentsof bounded arithmetic , Izvestiya of the Russian Academy of Science, 59:201-224, 1995.[39] Razborov A.A.;
Pseudorandom generators hard for k -DNF Resolution and Polyno-mial Calculus ; Annals of Mathematics, 181(2):415-472, 2015.[40] Razborov A.A, Rudich S.; Natural Proofs ; Journal of Computer and System Sciences,55(1):24-35, 1997.[41] Rivest R., Sloan R.;
Learning complicated concepts reliably and usefully ; AAAI, 1988.[42] Rudich S.;
Super-bits, demi-bits, and NP/qpoly-natural proofs ; Journal of Computerand System Sciences, 55(1):24-35, 1997.[43] Santhanam R.;
Pseudorandomness and the Minimum Circuit Size Problem ; ITCS,2020.[44] Tal A.;
Computing requires larger formulas than approximating ; STOC, 2017.[45] Vadhan S.;
Learning versus refutation ; COLT, 2017.[46] Vadhan S., Zheng C.J.;