[PDF] Learning sums of powers of low-degree polynomials in the non-degenerate case

Abstract

We develop algorithms for writing a polynomial as sums of powers of low degree polynomials. Consider an n -variate degree- d polynomial f which can be written as f= c 1 Q m 1 +…+ c s Q m s , where each c i ∈ F × , Q i is a homogeneous polynomial of degree t , and tm=d . In this paper, we give a poly((ns ) t ) -time learning algorithm for finding the Q i 's given (black-box access to) f , if the Q ′ i s satisfy certain non-degeneracy conditions and n is larger than d 2 . The set of degenerate Q i 's (i.e., inputs for which the algorithm does not work) form a non-trivial variety and hence if the Q i 's are chosen according to any reasonable (full-dimensional) distribution, then they are non-degenerate with high probability (if s is not too large). Our algorithm is based on a scheme for obtaining a learning algorithm for an arithmetic circuit model from a lower bound for the same model, provided certain non-degeneracy conditions hold. The scheme reduces the learning problem to the problem of decomposing two vector spaces under the action of a set of linear operators, where the spaces and the operators are derived from the input circuit and the complexity measure used in a typical lower bound proof. The non-degeneracy conditions are certain restrictions on how the spaces decompose.

Full PDF

aa r X i v : . [ c s . CC ] J un Learning sums of powers of low-degree polynomials in thenon-degenerate case

Ankit Garg

Microsoft Research India [email protected]

Neeraj Kayal

Microsoft Research India [email protected]

Chandan Saha

Indian Institute of Science [email protected]

June 17, 2020

Abstract

We develop algorithms for writing a polynomial as sums of powers of low degree polyno-mials. Consider an n -variate degree- d polynomial f which can be written as f = c Q m + . . . + c s Q ms ,where each c i ∈ F × , Q i is a homogeneous polynomial of degree t , and tm = d . In this paper,we give a poly (( ns ) t ) -time learning algorithm for ﬁnding the Q i ’s given (black-box access to) f ,if the Q ′ i s satisfy certain non-degeneracy conditions and n is larger than d . The set of degenerate Q i ’s (i.e., inputs for which the algorithm does not work) form a non-trivial variety and hence ifthe Q i ’s are chosen according to any reasonable (full-dimensional) distribution, then they arenon-degenerate with high probability (if s is not too large). This problem generalizes symmetrictensor decomposition, which corresponds to the t = t =

2) allows us to solve the momentproblem for mixtures of zero-mean Gaussians in the non-degenerate case.Our algorithm is based on a scheme for obtaining a learning algorithm for an arithmeticcircuit model from lower bound for the same model, provided certain non-degeneracy con-ditions hold. The scheme reduces the learning problem to the problem of decomposing twovector spaces under the action of a set of linear operators, where the spaces and the operatorsare derived from the input circuit and the complexity measure used in a typical lower boundproof. The non-degeneracy conditions are certain restrictions on how the spaces decompose.Such a scheme is present in a rudimentary form in an earlier work [KS19]. Here, we make itmore general and detailed, and potentially applicable to learning other circuit models.An exponential lower bound for the representation above (also known as homogeneous Σ ∧ ΣΠ [ t ] circuits) is known using the shifted partials measure. However, the number of linearoperators in shifted partials is exponential and also the non-degeneracy condition emergingout of this measure is unlikely to be satisﬁed by a random Σ ∧ ΣΠ [ t ] circuit when the numberof variables is large with respect to the degree. We bypass this hurdle by proving a lowerbound (which is nearly as strong as the previous bound) using a novel variant of the partialderivatives measure, namely afﬁne projections of partials ( APP ). The non-degeneracy conditionsappearing from this new measure are satisﬁed by a random Σ ∧ ΣΠ [ t ] circuit. The APP measurecould be of independent interest for proving other lower bounds. ontents Σ ∧ ΣΠ [ t ] circuit is non-degenerate (proof of Lemma 1.2) . . . . . . . . . . 252.4 Setting of parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 ΣΠΣΠ [ t ] circuits using APP t case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304.2 Low t case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 314.3 The hard polynomial . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 A.1 Module homomorphisms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43A.2 Module decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44A.3 Uniqueness of decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

B Reducing vector space decomposition to module decomposition 46C Why doesn’t the shifted partials measure work? 49D Proofs from Section 2 50E Proofs from Section 4 54

Introduction

Arithmetic circuits form a natural model for computing polynomials. They compute polynomialsusing basic arithmetic operations such as addition and multiplication. Formally, an arithmeticcircuit is a directed acyclic graph such that the sources are labelled with variables or constantsfrom the underlying ﬁeld, the internal nodes (gates) are labelled with the arithmetic operationsand the sink(s) outputs the polynomial(s) computed by the circuit. Size of a circuit is the numberof edges in the underlying graph, and depth is the length of a longest path from a source to a sinknode. The three main questions of interest regarding arithmetic circuits are the following: • Lower bounds.

Is there an "explicit" polynomial that requires super-polynomial sized arith-metic circuits to compute? This is the famed VP vs VNP question (an arithmetic analogueof the P vs NP question ). • Polynomial Identity Testing (PIT).

Here the question is, given an arithmetic circuit, deter-mine if its output is identically zero. There is an easy randomized algorithm for this problem(plug in random values and check if the output is zero). Finding a deterministic algorithmis a major open question in this ﬁeld. • Reconstruction.

Here the question is, given a polynomial, ﬁnd the smallest (or approxi-mately smallest) arithmetic circuit computing it.For all the above questions, there is very little progress on them for general arithmetic circuits. So,a lot of effort has gone into studying them for restricted classes of arithmetic circuits (like constantdepth, multilinear, set-multilinear, non-commutative circuits etc.). We refer the interested readerto the excellent surveys [SY10, CKW11, Sap15] on this topic.A lot of interconnections are known between the three above-mentioned problems, some of whichwe touch upon below. • Lower bounds and PIT.

There are several connections that go both ways between lowerbounds and PIT. It is known that lower bounds for general arithmetic circuits would im-ply PIT algorithms via the hardness vs randomness tradeoff (in the algebraic setting) [KI04,DSY10]. Furthermore, non-trivial (deterministic) PIT algorithms also imply lower bounds[HS80, KI04, Agr05]. While these concrete connections are not always present for restrictedcircuit models, several PIT algorithms have been inspired by corresponding lower bounds,e.g., [RS05, FS13, OSlV16, For15a]. • Lower bounds and reconstruction.

It is known that worst case reconstruction of a circuitmodel implies lower bound for the same model [FK09, Vol16] (also see the discussion in One can also allow division, but it is a classical result that one can eliminate division operations from an arithmeticcircuit without too much blow up in the circuit size [Str73]. Sometimes, one can let the edges going into addition gates to be labelled with ﬁeld constants to allow scalar linearcombinations. But, all these models can be interconverted to each other without too much blowup in the circuit size. One could also consider various other notions of explicitness while framing the lower bounds question. Or rather Either as a black-box or explicitly. Again, either as a black box or explicitly. • PIT and reconstruction.

In one direction, a deterministic (worst case) reconstruction algo-rithm clearly implies a deterministic PIT algorithm (both in the black-box model) since thereconstruction algorithm would have to output an extremely small circuit when the circuitcomputes the zero polynomial. Randomized (or average-case) reconstruction algorithmsmay not have anything to do with deterministic PIT algorithms, of course. In the other di-rection, as discussed in [SY10], black-box PIT algorithms seemingly can help in designingreconstruction algorithms. This is because a black-box PIT algorithm outputs a list of eval-uation points such that any circuit from the class being considered evaluates to a non-zerovalue on at least one of points, and hence any two circuits in the class computing differ-ent polynomials evaluate to a different value on at least one of the points. So, the list ofevaluation points determines the circuit and now it remains to be seen if one can efﬁcientlyreconstruct the circuit from these evaluations. The reconstruction algorithms for sparsepolynomials and constant top fan-in depth three circuits and read-once algebraic branchingprograms are some examples of reconstruction using PIT ideas [KS01, Shp09, KS09a, FS13].Of course, deterministic PIT algorithms can also be sometimes used to get deterministic re-construction algorithms when randomized ones are known [KS06, FS13].To summarize, the three main problems in arithmetic complexity are richly interrelated and progresson one question spurs progress on the others. Hence, it is imperative to ﬁnd more connections be-tween these problems. This paper continues the line of work in [KS19] on building a new connec-tion between lower bounds and reconstruction. We build on the work of [KS19] to further developa meta framework that yields reconstruction algorithms in the non-degenerate setting from lowerbounds for the corresponding circuit models. In addition to developing this framework further,we implement this framework to learn sums of powers of low degree polynomials in the non-degenerate case (described in Section 1.1). We remark that assuming some kind of non-degeneracyconditions might be essential for designing efﬁcient learning algorithms; otherwise for most cir-cuit models, one will have to assume constant top fan-in to get polynomial time algorithms. Thisis because of various hardness results about reconstruction in the worst case (see Section 1.4). Theusefulness of assuming non-degeneracy conditions is best illustrated by the following example.Consider the model of homogeneous depth three powering circuits. This corresponds to the rep-resentation f ( x ) = s ∑ i = ℓ i ( x ) d ,where ℓ i ’s are linear polynomials. Finding such a decomposition with the minimum possible s is NP-hard even for degree d = Assuming the class is closed under subtractions. Of course, a random set of points forms a hitting set and it seems hard to reconstruct the circuit given its evalua-tions on random points. However, the hitting sets constructed for deterministic PIT algorithms typically have a lot ofspecial structures which could be exploited for reconstruction. In [KS19], this framework is present in a rudimentary form. f ( x ) = ∑ si = ℓ i ( x ) with s ≤ n and ℓ i ’s linearly independent, we can ﬁnd the ℓ i ’s in polynomial time. A couple of things tonotice about the assumptions are: • s ≤ n : The number of summands that the algorithm can handle is (up to a small constant)the best known lower bound we can prove for this model (sums of cubes of linear forms ororder-3 symmetric tensor decomposition). • The set of inputs for which the algorithm does not work, i.e., when ℓ i ’s are linearly depen-dent, form a non-trivial variety (if s ≤ n ). So, the algorithm would work for "random" ℓ i ’swith high probability.We hope to generalize the above kind of non-degenerate case learning algorithms to other circuitmodels. The circuit size which one might be able to handle if one implements the meta frameworkwill depend on the lower bound one can prove for the circuit model. Since tensor decompositionalgorithms (which corresponds to reconstruction for a very simple arithmetic circuit model) arewidely used in machine learning, our meta framework raises the exciting possibility of importingtechniques from arithmetic complexity to machine learning via reconstruction of various circuitmodels in the non-degenerate case. We mention one such possibility in Section 1.3.Let us brieﬂy describe the roadmap for the rest of this section now. In Section 1.1, we describeour main results about learning sums of powers of low degree polynomials in the non-degeneratecase. In Section 1.2, we describe our techniques: the meta framework for turning lower boundsinto reconstruction algorithms, the implementation for sums of powers of low degree polynomialsand the non-degeneracy conditions needed for our algorithm to work. In Section 1.3, we describethe connection to mixtures of Gaussians. Finally, in Sections 1.4 and 1.5, we review hardnessresults about reconstruction and some previous work. We study the learning problem for an interesting subclass of depth four arithmetic circuits whichis a generalization of depth three powering circuits or symmetric tensors. A circuit in this class,computing an n -variate degree- d polynomial f ∈ F [ x ] , is an expression f = c Q m + . . . + c s Q ms , (1)where each c i ∈ F × , Q i is a homogeneous polynomial of degree t , and tm = d . Such a circuit iscalled a homogeneous Σ ∧ ΣΠ [ t ] ( s ) circuit. The parameter t is typically much smaller than d . There are natural extensions of this result for larger values of d that can handle a larger number of components(roughly matching the best lower bounds we can prove for this model), e.g., see [DLCC07, ABG +

14, BCMV14, KS19].Other algorithms include [Kay11, GKP18]. The result in this paper holds even if f = c Q m + . . . + c s Q m s s , where each Q i (not necessarily homogeneous) hasdegree t i ≤ t and t i m i = d for all i ∈ [ s ] . We present the analysis assuming homogeneity and uniform exponent m forsimplicity of exposition. Technically, the expressions of this kind are known as Σ ∧ ΣΠ [ t ] ( s ) formulas . But, there is only a minor distinctionbetween formulas and circuits in the constant depth case. Furthermore, in the random circuit setting, even this minordistinction is not there.

3e show that a homogeneous Σ ∧ ΣΠ [ t ] circuit can be reconstructed efﬁciently if it satisﬁes certain non-degeneracy conditions. We defer stating these conditions precisely to the end of this section,but it is worth mentioning that a random Σ ∧ ΣΠ [ t ] circuit is non-degenerate with high probability.In other words, if the coefﬁcients of the monomials in Q , . . . , Q s , in Equation (1), are chosen uni-formly at random from a sufﬁciently large subset of F then the resulting circuit is non-degeneratewith high probability. In this sense, almost all homogeneous Σ ∧ ΣΠ [ t ] circuits can be reconstructedefﬁciently. The following theorem is proved in Section 2. We will assume that factoring univariatepolynomials over F can be done in randomized polynomial time . Theorem 1 (Learning non-degenerate sums of powers of low degree polynomials) . Let n , d , s , t ∈ N such that n ≥ d , ≤ t ≤ q log d · log log d , | F | ≥ ( ns ) · t , char ( F ) = or > d ands ≤ min ( n d · t , exp ( n · t )) . Then, there is a randomized algorithm which when given black-box accessto an n-variate degree-d non-degenerate polynomial f = c Q m + . . . + c s Q ms , where each c i ∈ F × , Q i is ahomogeneous polynomial of degree t, and tm = d and the total number of monomials in Q i ’s is σ , outputs(with high probability) Q ′ , . . . , Q ′ s such that there exist a permutation π : [ s ] → [ s ] and non-zero constantsc ′ , . . . , c ′ s so that Q ′ i = c ′ i Q π ( i ) for all i ∈ [ s ] . The running time of the algorithm is poly ( n , σ , s t ) . Remarks. Non-degeneracy.

The non-degeneracy conditions are explicitly mentioned in Section 1.2.4.2.

Bounds on t and s.

The upper bounds on the parameters t and s in Theorem 1 originate fromour analysis (especially, the part in Section 2.3 showing that a random Σ ∧ ΣΠ [ t ] circuit isnon-degenerate with high probability). We have not optimized this analysis in an attempt tokeep it relatively simple. Given that the lower bounds (stated in Theorem 2) hold for a largerange of t and s , it may be possible to tighten our analysis signiﬁcantly.3. t > substantially different from t = . While we state the theorem for slightly super-constantvalues of t , one should think of t being a constant as the main setting (in which case, therunning time of our algorithm is polynomial). Even the t = t = Uniqueness of Q i ’s. A corollary of the analysis of our algorithm is that for a non-degenerate f = c Q m + . . . + c s Q ms (which holds if the Q i ’s are chosen randomly), this representationis of the smallest size and also unique. That is, if f = e c e Q m + . . . + e c s e Q m e s with e s ≤ s , then e s = s and there exist a permutation π : [ s ] → [ s ] and non-zero constants d , . . . , d s such that e Q i = d i Q π ( i ) for all i ∈ [ s ] .Non-degeneracy conditions are satisﬁed if we choose the coefﬁcients of the Q i ’s randomly. Thisgives us the following corollary. Corollary 1.1 (Learning random sums of powers of low degree polynomials) . Let n , d , s , t ∈ F be asin Theorem 1. There is a randomized algorithm which when given black-box access to an n-variate degree-d Univariate polynomials over ﬁnite ﬁelds can be factored in randomized polynomial time [Ber70], and over Q , theycan be factored in deterministic polynomial time [LLL82]. Here, exp ( x ) = x . Once we know Q ′ , . . . , Q ′ s , we can determine the non-zero constants d , . . . , d s such that f = d Q ′ m + . . . + d s Q ′ ms in randomized polynomial time. olynomial f = c Q m + . . . + c s Q ms , where each c i ∈ F × , Q i is a homogeneous polynomial of degree t, andtm = d and the coefﬁcients of Q i ’s are chosen uniformly and independently at random from a set S ⊂ F of size | S | ≥ ( ns ) · t , outputs (with high probability) Q ′ , . . . , Q ′ s such that there exist a permutation π : [ s ] → [ s ] and non-zero constants c ′ , . . . , c ′ s so that Q ′ i = c ′ i Q π ( i ) for all i ∈ [ s ] . The running time ofthe algorithm is poly (( ns ) t ) . The novelty of our approach lies in the use of lower bound techniques in the design of learningalgorithms. Such connections are known for certain classes of Boolean circuits, in particular AC and AC [ p ] circuits [LMN93, CIKK16]. The inﬂuence of lower bound techniques on learning isalso apparent in the case of ROABP reconstruction [BBB +

00, KS06]. However, our approach dif-fers substantially from these previous works and also uses lower bounds to design algorithms inthe non-degenerate case. At a high level, our technique can be summarized as a fancy reduction tolinear algebra . In Section 1.2.1, we will see how lower bounds are typically proven in arithmeticcomplexity. Section 1.2.2 describes our meta framework of turning lower bounds into learning al-gorithms. In Section 1.2.3, we discuss how implement the framework for learning sums of powersof low degree polynomials. Finally in Section 1.2.4, we state the non-degeneracy conditions werequire explicitly.

Many of the circuit classes for which good lower bounds are known are of the form T + . . . + T s ,where each polynomial T i is “simple” in some sense . The lower bound problem for such a class C is to ﬁnd an explicit polynomial f such that any representation of the form f = T + T + . . . + T s ,where each T i is a simple polynomial, requires s to be large. A typical lower bound strategy ﬁndssuch an f by constructing a set of linear maps L from the vector space of polynomials to someappropriate vector space such that the following properties hold: • dim ( hL ◦ T i ) is small (say ≤ r ) for every simple polynomial T , • dim ( hL ◦ f i ) is large (say ≥ R ).Then, we have f = T + T + . . . + T s ⇒ hL ◦ f i ⊆ hL ◦ T i + · · · + hL ◦ T s i (2) ⇒ dim ( hL ◦ f i ) ≤ dim ( hL ◦ T i ) + · · · + dim ( hL ◦ T s i ) .This implies that s ≥ R / r . Now, let us see how the set of linear maps L could potentially play akey role in learning class C . For example, in case of Σ ∧ ΣΠ [ t ] ( s ) circuits, T i is a power of a degree- t polynomial. One can also get suchrepresentations from general circuits by various depth reduction theorems [AV08, Koi12, Tav13]. Here, h S i denotes the F -linear span of a set of polynomials S . .2.2 Reduction to vector space decomposition - a recipe for learning The corresponding learning problem for C is the following: Given a polynomial f that can beexpressed as f = T + . . . + T s , where each T i is a simple polynomial, can we efﬁciently recoverthe T i ’s? It turns out that the set of linear maps L (used to prove lower bounds) can now be usedto devise an efﬁcient learning algorithm via the following meta-algorithm, which works if theexpression T + . . . + T s is “non-degenerate”. Let us explain what we mean by non-degeneracy.One might expect that if the T i ’s are chosen randomly, then for some choice of linear maps L , thesubspace condition in Equation (2) becomes an equality and the sums become direct sums, i.e., hL ◦ f i = hL ◦ T i ⊕ · · · ⊕ hL ◦ T s i . (3)Existence of linear maps L satisfying Equation (3) for random T i ’s is the starting point for ourlearning framework. A couple of things are important to state here: • If Equation (3) is satisﬁed for even a single choice of T i ’s, then it is satisﬁed for random T i ’sbecause of the Schwarz-Zippel lemma [Sch80, Zip79]. • Equation (3) implies a tight separation within class C . So, a prerequisite for a lower boundmethod to be useful for our learning framework is that it should be able to prove a tightseparation for that model. In fact, Equation (3) is usually proven by exhibiting an explicitpolynomial for which the linear maps in question yield a tight separation (see Lemma 1.1).We will also need that L = L ◦ L , i.e., L is a combination of two sets of linear maps andEquation (3) holds for both L and L . We say that the expression T + · · · + T s is non-degenerateif Equation (3) holds for both L and L . Now given f = T + · · · + T s , we have access to U : = hL ◦ f i = hL ◦ T i ⊕ . . . ⊕ hL ◦ T s i and V : = hL ◦ L ◦ f i = hL ◦ ( L ◦ T ) i ⊕ . . . ⊕ hL ◦ ( L ◦ T s ) i . (4)If we can recover the hL ◦ T i i for all i , then usually one can recover the T i ’s . Towards this,a crucial property of the linear maps L (from U to V ) is that L maps each component space hL ◦ T i i to the corresponding component space of V . This motivates the following problem. Problem 1 (Vector space decomposition) . Given two vector spaces U and V and a set of linearmaps L from U to V , ﬁnd a decomposition U = U ⊕ . . . ⊕ U s V = V ⊕ . . . ⊕ V s ,such that hL ◦ U i i ⊆ V i for all i ∈ [ s ] (if such a decomposition exists). Moreover, we can ask thateach of the pairs ( U i , V i ) be further indecomposable with respect to L . For example, k th order partial derivatives are a composition of ( k − ) th order partial derivatives and ﬁrst orderpartial derivatives, i.e., ∂ k x = ∂ k − x ◦ ∂ x . This framework should be applicable given only black-box access to f using standard tricks (as we show for ourproblem) and we won’t go into these details in this overview. For example, one can recover a homogeneous polynomial if given all its degree- k partial derivatives. U = V and we require U i = V i . The algorithm was discovered in [CIK97] based on the algo-rithms developed for decomposition of algebras (e.g., see [FR85, Rón90, Ebe91]). The algorithmworks over ﬁnite ﬁelds, C and R (if the input is over Q , then the algorithm outputs a decompo-sition over an extension ﬁeld). We give a simple reduction in Section B that reduces the vectorspace decomposition problem to the symmetric version. However, since we are in a specializedsetting, we can design a simpler algorithm using the ideas in [CIK97] that also works over Q (thisis important for some potential applications like mixtures of Gaussians).Thus, we are capable of doing a decomposition like the one in Equation (4). But, why should weend up with the same decomposition? Certainly there are cases where the decompositions are notunique. For example, if U = V and L just consists of the identity map, then any decompositioninto one-dimensional spaces is a valid one. However, there is a characterization of all decomposi-tions in the symmetric setting (Krull-Schmidt theorem, Theorem 3 in Section A) and it extends tovector space decomposition via our reduction (Corollary B.2). In many settings (including the onein this paper), this characterization helps in proving the uniqueness of decomposition.Finally, the meta algorithm is stated in Algorithm 1, which works under the assumptions:1. The following direct sum structure holds, U : = hL ◦ f i = hL ◦ T i ⊕ . . . ⊕ hL ◦ T s i and V : = hL ◦ L ◦ f i = hL ◦ ( L ◦ T ) i ⊕ . . . ⊕ hL ◦ ( L ◦ T s ) i . (5)2. (5) is the unique indecomposable decomposition for the vector spaces U and V w.r.t. L .3. One can recover T i from hL ◦ T i i efﬁciently.Next, we will discuss how we prove these assumptions for our setting, namely sums of powers oflow degree polynomials. Algorithm 1

Meta algorithm: Learning from lower bounds

Input : f = T + · · · + T s . Output : T ′ , . . . , T ′ s such that there exists a permutation σ : [ s ] → [ s ] so that T ′ i = T σ ( i ) . Take an appropriate set of linear maps L = L ◦ L . Compute U : = hL ◦ f i and V : = hL ◦ L ◦ f i . Obtain a (further indecomposable) vector space decomposition of U and V with respect to L ,namely U = U ⊕ · · · ⊕ U s and V = V ⊕ · · · ⊕ V s . Compute T ′ i from U i (assuming U i = hL ◦ T ′ i i ). In this section, we discuss how we implement the meta algorithm described above for sums ofpowers of low degree polynomials. As discussed, the main ingredient in the learning algorithm The symmetric version is known as module decomposition in the literature.

7s the lower bound. So, at ﬁrst we need to understand how lower bounds are proven for thismodel [Kay12b, GKKS14, KSS14]. Let us ﬁrst consider the setting of sums of powers of linearforms , i.e., t = d homogeneous polynomial f suchthat any expression of the form: f = ℓ d + · · · + ℓ ds ,with ℓ i ’s linear, requires a large value of s . The set of linear maps here will be L = ∂ ⌊ d /2 ⌋ x , i.e.,all partial derivatives of order ⌊ d /2 ⌋ . Then, it is easy to see that dim ( hL ◦ ℓ d i ) ≤ ℓ . Thus, any f has a lower bound of s ≥ dim ( hL ◦ f i ) . One can easily design polyno-mials with large dimension for the partial derivatives, e.g., the elementary symmetric polynomialof degree d in n variables satisﬁes dim ( hL ◦ f i ) = ( n ⌊ d /2 ⌋ ) . For a long time, it was not known how to generalize these super-polynomial lower bounds evento the t = ( hL ◦ Q m i ) is no longer small, for L = ∂ k x , when Q isa degree- t homogeneous polynomial with t ≥

2. For example, dim ( hL ◦ ( x + · · · + x n ) m i ) ≥ ( nk ) for k ≤ m ≤ n . However, one can still say something about the partial derivatives. If we take any α ∈ Z n ≥ with | α | : = ∑ ni = α i = k , then Q m − k divides ∂ α x Q m for k ≤ m . Hence, any ∂ α x Q m is of theform Q m − k R , where R is homogeneous of degree k ( t − ) . Now, the main observation in [Kay12b]was that we can make use of this special property of powers of low degree polynomials by usingthe shifted partial derivatives measure which is deﬁned as follows, SP k , ℓ ( f ) : = dim D x ℓ · ∂ k x f E .That is, we take all k th order partial derivatives of f and then multiply by all degree- ℓ monomialsand then take the dimension of their span. Now, for any α , β such that | α | = k and β = ℓ , x β ∂ α x Q m is of the form Q m − k R , where R is a homogeneous polynomial of degree ℓ + k ( t − ) . Hence SP k , ℓ ( Q m ) ≤ (cid:18) n + ℓ + k ( t − ) − ℓ + k ( t − ) (cid:19) ,where if k , ℓ are not too large, then one can expect SP k , ℓ ( f ) , for an appropriately chosen f , to beclose to the number of operators ( n + ℓ − ℓ )( n + k − k ) . Indeed, with appropriate choices of k and ℓ , thiscan be used to prove exponential lower bounds for the model sums of powers of low degree poly-nomials [Kay12b] and other more general models as well [GKKS14,KSS14,KS14,FLMS15,KLSS17,KS17b]. In all of these lower bounds, the value of ℓ is chosen to be comparable or larger than n .This makes the number of linear maps exponential and hence not suitable for designing an efﬁ-cient algorithm. In fact, even ignoring the large number of linear maps, with such a large valueof ℓ , the shifted partials measure is unlikely to satisfy the direct sum property in Equation (5) (seeSection C) when the number of variables is large with respect to the degree. A natural way todecrease the number of linear maps is to project to a smaller number of variables. To our surprise,if we project down to a smaller number of variables, one does not need shifts at all to prove thelower bound! We call this new measure afﬁne projections of partials, which we deﬁne next. Also known in the literature as symmetric tensor decomposition or the Waring rank problem. If we use this lower bound with the recipe in Section 1.2.2, then one gets a close variant of Jennrich’s algorithm forsymmetric tensor decomposition. The word afﬁne is added to avoid confusion with another kind of projection (namely multilinear projection) whichis usually done in the literature to prove lower bounds for depth four arithmetic circuits. fﬁne projections of partials – a novel adaptation of the partial derivatives measure. Let f bea polynomial in variables x = ( x , . . . , x n ) and L = ( ℓ ( z ) , . . . , ℓ n ( z )) a tuple of linear forms invariables z = ( z , . . . , z n ) . The parameter n would be much smaller than n in this paper. Let π L be the following afﬁne projection map from F [ x ] to F [ z ] , π L ( f ) : = f ( ℓ ( z ) , . . . , ℓ n ( z )) .For a set S ⊆ F [ x ] , the projection π L is naturally deﬁned as, π L ( S ) : = { π L ( f ) : f ∈ S } .Recall that ∂ k x f is the set of all k -th order partial derivatives of f and h S i is the F -linear span of aset of polynomials S . The afﬁne projections of partials ( APP ) measure is deﬁned as,(The measure)

APP k , n ( f ) : = max L dim D π L ( ∂ k x f ) E , (6)where the maximum is taken over all n -tuple L = ( ℓ ( z ) , . . . , ℓ n ( z )) of linear forms in F [ z ] . It iseasy to verify that for any f , g ∈ F [ x ] the following linearity property is satisﬁed,(Linearity of the measure) APP k , n ( f + g ) ≤ APP k , n ( f ) + APP k , n ( g ) . (7)The APP measure can be alternatively deﬁned using a random afﬁne projection π L . The observa-tion below is an easy consequence of the Schwartz-Zippel lemma [Sch80, Zip79]. Observation 1.1. If | F | ≥ · ( d − k ) · ( n + k − k ) and every coefﬁcient of the linear forms in L = ( ℓ ( z ) , . . . , ℓ n ( z )) is chosen from a set S ⊆ F of size · ( d − k ) · ( n + k − k ) then with probability at least , APP k , n ( f ) = dim D π L ( ∂ k x f ) E . Remark 1.

Similarity with the skewed partials measure.

The

APP measure is akin to the skewedpartials (

SkP ) measure introduced in [KNS16] – SkP is a special case of

APP . We show that the lower boundproof works with the

SkP measure, but to ensure that the hard polynomial f n , d is multilinear we need afﬁneprojections (particularly, p-projections). However, the main reason for us to work with general/randomafﬁne projections (instead of SkP or p-projections) is to make the learning algorithm require as weak anon-degeneracy condition as possible.

Let us give some intuition as to why the

APP measure can yield lower bounds for sums of powersof low degree polynomials. As we say above, any ∂ α x Q m , with | α | = k , is of the form Q m − k R ,where R is homogeneous of degree k ( t − ) (recall Q is homogeneous of degree t ). This impliesthat APP ( Q m ) ≤ ( n + k ( t − ) − k ( t − ) ) . However, for an appropriately chosen homogeneous degree- d polynomial f , one can expect that APP ( f ) could be as large as min { ( n + d − k − d − k ) , ( n + k − k ) } . The ﬁrstupper bound is from the fact that after derivatives and projection, we are in the space of degree- ( d − k ) polynomials in n variables and the second is from the fact that we have ( n + k − k ) linearmaps in APP . With appropriate choices of n , k and the polynomial f , we can prove the followinglower bound result which comes close to the best known lower bounds via shifted partials. More generally, ℓ ( z ) , . . . , ℓ n ( z ) are afﬁne forms, but for this work it sufﬁces to take them as linear forms. heorem 2 (Lower bound for homogeneous ΣΠΣΠ [ t ] circuits using APP ) . The

APP measure, de-ﬁned above, can be used to prove the following lower bound for homogeneous

ΣΠΣΠ [ t ] circuits. • High t case: Let n , d , t ∈ N such that n ≥ d and ln nd ≤ t ≤ d · e · ln d . There is a family of n-variatedegree-d multilinear polynomials { f n , d } in VNP such that any homogeneous

ΣΠΣΠ [ t ] ( s ) circuitcomputing f n , d must have s = (cid:16) nd (cid:17) Ω ( dt ln t ) . • Low t case: Let n , d , t ∈ N such that n ≥ d and ≤ t ≤ min n ln n e · ln d , d o . There is a family ofn-variate degree-d multilinear polynomials { f n , d } in VNP such that any homogeneous

ΣΠΣΠ [ t ] ( s ) circuit computing f n , d must have s = n Ω ( dt ) . Remark 2.

More general circuit model.

The above lower bound (proved in Section 4) is for the classof homogeneous

ΣΠΣΠ [ t ] circuits, which contains the class of homogeneous Σ ∧ ΣΠ [ t ] circuits. In fact,the APP measure can be used to give a super-polynomial lower bound for general homogeneous depth fourcircuits – we skip the proof of this fact here.

Of course, an n Ω ( dt ) lower bound is already known for homogeneous ΣΠΣΠ [ t ] circuit using theshifted partials measure [KSS14, FLMS15]. We get nearly the same lower bound by replacing“shifts” by an afﬁne projection. As discussed above, this change is essential to satisfy the directsum property in Equation (5). In terms of lower bounds, this means that we show an explicitpolynomial which can be computed by a homogeneous Σ ∧ ΣΠ [ t ] ( s ) circuit but not by any homo-geneous Σ ∧ ΣΠ [ t ] ( s − ) circuit. Lemma 1.1 (Tight separation for homogeneous Σ ∧ ΣΠ [ t ] circuits) . Suppose n , d , s , t , F satisfy theconditions in Theorem 1. Then, there is a family of explicit n-variate degree-d polynomials computable byhomogeneous Σ ∧ ΣΠ [ t ] ( s ) circuits but not by any homogeneous Σ ∧ ΣΠ [ t ] ( s − ) circuit. The above lemma (whose proof is implicit in Section 2.3) allows us to argue that a random Σ ∧ ΣΠ [ t ] circuit is non-degenerate with high probability (Item 1 in the assumptions for Algorithm 1).Let us now brieﬂy explain how we prove the uniqueness of decomposition and describe our al-gorithm for recovering a term from the corresponding vector space of polynomials (Items 2 and 3). Uniqueness of vector space decomposition.

The adjoint algebra helps us prove uniqueness ofdecomposition in our setting. Let us discuss what that is. Recall that we have vector spaces U , V and a set of linear maps L from U to V . The adjoint algebra of L is deﬁned as follows:adj ( L ) : = { ( D , E ) : D : U → U , E : V → V and K ◦ D = E ◦ K for all K ∈ L } . (8)Suppose U = U ⊕ · · · ⊕ U s , V = V ⊕ · · · ⊕ V s is an indecomposable decomposition with re-spect to L . If it so happens that for every ( D , E ) ∈ adj ( L ) , there exist constants α , . . . , α s suchthat Du i = α i u i for all u i ∈ U i , then the decomposition is unique (this follows from CorollaryA.1 and B.2). We show this is the case in our setting. We deviate from the framework in Section That is, in time poly ( n , s ) , we can output the Σ ∧ ΣΠ [ t ] ( s ) circuit computing the polynomial. V = (cid:10) π L ( Q ) m − k , . . . , π L ( Q s ) m − k (cid:11) and W = W ⊕ . . . ⊕ W s , W i : = (cid:10) π P ( ∂ k z π L ( Q i ) m − k ) (cid:11) , and L = π P ( ∂ k z ) are linear maps from V to W (changing notation to match with Section 2). Here, L isa random projection onto n variables z , and P is a random projection onto m variables w . Recovery of the terms from the corresponding vector spaces.

Due to the above-mentioned de-viation from the framework, the ﬁnal problem we have to solve is the following: Given access tothe random projections π L ( Q ) , . . . , π L ( Q s ) , recover the Q i ’s. First of all, if one is given multipleprojections π L ( Q ) , then it is not hard to recover Q . However, we have multiple polynomials andfor each random projection, we could be given the polynomials in an arbitrary order. This makesthe recovery slightly non-trivial. For details of how we solve this, see the analysis of Steps 7-8 ofAlgorithm 2 in Section 2.2.Next, we explicitly state the non-degeneracy conditions we require. The details of our algorithmand analysis can be found in Section 2. As mentioned, we deviate from the general recipe to sim-plify the analysis, but the recipe provides the intuition and forms the backbone of the algorithm. In this section, we state the non-degeneracy conditions that our algorithm requires. Let C be ahomogeneous Σ ∧ ΣΠ [ t ] ( s ) circuit computing an n -variate degree- d polynomial f = c Q m + . . . + c s Q ms . (9) Notations.

Let z = ( z , . . . , z n ) be a set of n = ⌊ n · t ⌋ variables and w = ( w , . . . , w m ) a setof m = ⌊ n · t ⌋ variables. Let L = ( ℓ ( z ) , . . . , ℓ n ( z )) be a tuple of n linear forms in F [ z ] and P = ( p ( w ) , . . . , p n ( w )) a tuple of n linear forms in F [ w ] . For every such tuples of linear forms L and P , we can deﬁne the following spaces and polynomials by setting k = l · t · log s log n m : • U : = (cid:10) π L ( ∂ k x f ) (cid:11) and U i : = (cid:10) π L ( ∂ k x Q mi ) (cid:11) , • G i : = π L ( Q i ) and g ( z ) : = G e + . . . + G es , where e = m − k , • f U i : = D z k ( t − ) · G ei E , where z k ( t − ) is the set of all z -monomials of degree 2 k ( t − ) , • W : = (cid:10) π P ( ∂ k z g ) (cid:11) and W i : = (cid:10) π P ( ∂ k z G ei ) (cid:11) .Observe that U ⊆ U + . . . + U s , and U i ⊆ D z k ( t − ) · π L ( Q i ) m − k E implying dim U i ≤ (cid:18) n + k ( t − ) − k ( t − ) (cid:19) , W ⊆ W + . . . + W s , and W i ⊆ D w k ( t − ) · π P ( G i ) e − k E implying dim W i ≤ (cid:18) m + k ( t − ) − k ( t − ) (cid:19) . Deﬁnition 1.1 (Non-degeneracy) . The circuit C , given by Equation (9), is non-degenerate if thereexist L and P (as above) such that the following conditions are satisﬁed:11. U = U ⊕ . . . ⊕ U s and dim U i = ( n + k ( t − ) − k ( t − ) ) for all i ∈ [ s ] ,2. W = W ⊕ . . . ⊕ W s and dim W i = ( m + k ( t − ) − k ( t − ) ) for all i ∈ [ s ] ,3. f U + . . . + f U s = f U ⊕ . . . ⊕ f U s for all i ∈ [ s ] ,4. [ G e ] z = , . . . , [ G es ] z = are F -linearly independent.Condition 1 and 2 constitute the main part of non-degeneracy. Condition 3 and 4 have been addedto aid our analysis and keep it relatively simple; it may be possible to dispense with these condi-tions completely perhaps by altering Conditions 1 and 2 slightly.We prove the following lemma in Section 2.3, which says that if we choose the Q i ’s randomly inEquation (9), then the circuit is non-degenerate with high probability. Lemma 1.2 (Random Σ ∧ ΣΠ [ t ] circuits are non-degenerate) . Suppose the coefﬁcients of Q i ’s are chosenuniformly and independently at random from a set S ⊂ F of size | S | ≥ ( ns ) · t . Then, with probability − o ( ) , the non-degeneracy conditions in Deﬁnition 1.1 are satisﬁed. In this section, we discuss an application to learning parameters of mixtures of Gaussians fromthe moments. Symmetric tensor decomposition (and also general tensor decomposition) has alot of applications in both supervised and unsupervised machine learning, e.g., in independentcomponent analysis, learning latent variable models, hidden Markov models, topic models andmixture of Gaussians [MR05, CJ10, HKZ12, AHK12, HK13, AGH +

14, ABG +

14, BCMV14]. In ourlanguage, (symmetric) tensor decomposition corresponds to the t = t polynomials". It is conceivable that learning algorithms which handle more generalcircuit classes as compared to tensor decomposition will handle a richer class of learning modelsthan mentioned above. One example is given by the mixture of Gaussians model which has a richhistory, e.g., see [Pea94, RR95, Das99, PFJ03, VW04, BS10, MV10, ABG +

14, BCMV14, GHK15, RV17].While special cases of this problem can be solved by reduction to (symmetric) tensor decomposi-tion [HK13, ABG +

14, BCMV14], general cases correspond to "sums of powers of quadratics" (andslightly more general models), i.e., the t = n dimensions, D = ∑ si = w i N ( µ i , Σ i ) , where w i , µ i , Σ i denote the weight, meanand covariance matrix of the i th Gaussian respectively. The goal is to recover the parameters ( w i , µ i , Σ i ) si = upto some error. There have been many algorithms developed for the mixture ofGaussians problem and they make varying assumptions about the input parameters. These as-sumptions can be grouped into three broad categories:1. Worst case , e.g., [Pea94, BS10, MV10]. Here, no assumptions are made on the input param-eters. Due to this, the running time is exponential in the number of components s . In theworst case, even information theoretically, one needs exponential (in s ) number of samplesto learn the parameters [MV10, ABG + Separation assumptions , e.g., [Das99, SK01, AM05, KSV05, DS07, BV08, KK10, RV17]. Here,one assumes that the parameters (either the means or the covariance matrices) are well sep-arated. Running times are typically polynomial in the number of components s .3. Smoothed setting , e.g., [HK13,ABG + Running times are typically polynomialin the dimension n , as long as the number of components s ≤ poly ( n ) . Of course, it is best to have algorithms in the worst case but this usually means running time expo-nential time in the number of components. Making reasonable assumptions on the parametershelps in designing more efﬁcient algorithms (when the number of components are growing). Thetwo assumptions here, separation and the smoothed setting, are incomparable. For example, thealgorithms developed in the smoothed setting can also handle instances where the parameters arenot well separated. Our contribution towards the mixture of Gaussians problem should be under-stood in the context of smoothed analysis of the problem. We also remark that there has been a lotof work in designing algorithms for mixtures of Gaussians in the robust setting, i.e., when some ofthe samples are corrupted adversarially [DKK +

19, KS17a, KSS18, HL18]. These algorithms usuallywork by estimating the moments from the corrupted data, so are in some sense complementary toalgorithms that recover the parameters from the moments.In the smoothed setting, the papers [ABG +

14, BCMV14] give a polynomial time algorithm forthe special case of mixtures of Gaussians with diagonal covariance matrices (i.e., the different di-mensions of every component of the mixture are independent) when the number of components s ≤ poly ( n ) . For Gaussians mixtures with general covariance matrices, [GHK15] gave a polyno-mial time algorithm when s ≤ √ n . We make a step towards getting a polynomial time algorithmfor Gaussians mixtures with general covariance matrices when s ≤ poly ( n ) . Our algorithm forlearning sums of powers of low degree polynomials (speciﬁcally sums of powers of quadratics) al-lows us to recover the parameters of a non-degenerate mixture of zero-mean Gaussians given accessto its O ( ) -order exact moments when s ≤ poly ( n ) . Lemma 1.3 (Learning mixtures of zero-mean Gaussians) . There is a randomized polynomial time al-gorithm which when given access to exact O ( ) -order moments of a mixture of non-degenerate zero-meanGaussians, D = ∑ si = w i N ( µ i , Σ i ) , recovers the parameters ( w i , Σ i ) si = with probability − o ( ) . We believe that a few modiﬁcations to our algorithm will allow one to get a polynomial timealgorithm for general mixtures of Gaussians in the smoothed case. Let us remark on the maindifferences between our current algorithmic guarantee and the above goal. This is because we can get more information from lower order moments in the higher dimensional case, which areeasier to estimate. The exponent of the polynomial running time will depend on the exponent of the polynomial controlling therelation between s and n . This should be compared with some of the worst case reconstruction algorithms which run in time exponential inthe top fan-in of the circuit. The non-degeneracy condition is satisﬁed if Σ i = A i A Ti and the entries of A i ’s are choosen uniformly and inde-pendently from a set S ⊂ Q with | S | ≥ poly ( n , s ) , and of course all w i ’s are non-zero. Extension to general means.

Our current statement only holds for zero-mean Gaussiansbecause applying the method of moments to zero-mean mixtures naturally leads to sumsof powers of quadratics. This is just to keep the analysis simple and clean. We believe ouralgorithm should extend to sums of products of low degree polynomials and the learningproblems coming from moments of mixtures of general mean Gaussians lie somewhere be-tween sums of powers of quadratics and sums of products of quadratics. • Exact vs inexact moments.

Our algorithm assumes that the O ( ) -order moments of themixture are given exactly. Using samples, we can only approximate the moments ( O ( ) -order moments can be approximated to 1/poly ( n ) accuracy using poly ( n ) samples). Weleave it open for future work to modify our algorithm to handle 1/poly ( n ) error. • Smoothed vs non-degenerate setting.

Note that we state our result in the non-degeneratesetting. This seems to be the right kind of assumption when given access to exact moments.If one is given access to inexact moments, one will need to control appropriate conditionnumbers whence smoothed setting is the right assumption to make.We hope that our techniques will lead to progress on the smoothed analysis of mixtures of generalGaussians and also inﬂuence algorithms which work under other kinds of assumptions.

In this section, we review some of the hardness results on learning circuits in order to gauge thedifﬁculty of the problem for our circuit model and to place our result in context. Hardness oflearning has been intensely studied for Boolean circuits as compared to arithmetic circuits. Westate a few of these results from the Boolean world with the intent of drawing analogy.

MCSP and approximate MCSP.

Circuit reconstruction is the arithmetic analogue of exact learn-ing [Ang87] Boolean circuits from membership queries. Exact learning is closely related to theminimum circuit size problem (MCSP). MCSP for Boolean circuits is the following: Given the N = n size truth-table of an n -variate Boolean function f and a number s , check if f is computableby a Boolean circuit of size at most s . Analogously, in case of MCSP for arithmetic circuits, we aregiven the N = ( n + dd ) coefﬁcient vector of an n -variate degree- d polynomial f and are required todetermine if there is an arithmetic circuit of size at most s computing f . MCSP for Boolean circuitsis not in P assuming the existence of cryptographically secure one-way functions [KC00]. In fact, N − o ( ) -approximate MCSP for Boolean circuits is not in P under the same assumption [AH17]. Analogous results about MCSP for arithmetic circuits are not known. However, drawing analogywith the Boolean world, it is plausible that N − δ -approximate MCSP for general arithmetic cir-cuits is not in P, for every constant δ >

0. Here, N = ( n + dd ) is the size of the input coefﬁcient vector.Such a hardness result may even be true for d = n O ( ) .MCSP for a circuit class C is deﬁned like MCSP except that we are now interested in checking if theinput f has a C -circuit of size at most s . It is known that the arithmetic analogue of MCSP is NP- Meaning multiplicative factor approximation of the minimum circuit size However, proving MCSP is NP-hard is quite demanding as that would imply

EXP = ZPP [MW17], which is along-standing open problem. δ ≈ ( + δ ) -approximate MCSP is NP-hard for set-multilinear depth three circuits [Swe18, BIJL18, SWZ19].Similar hardness results are known about MCSP for restricted Boolean circuit classes, e.g., MCSPfor DNF is NP-hard [Mas79,Czo99]. In fact, there is a δ > ( log N ) δ -approximate MCSPis NP-hard for DNF [Fel09] . Approximate MCSP is a difﬁcult problem even for AC circuits: Forevery δ , ǫ > h such that N − δ -approximate MCSP for depth- h circuits is not in BPPunless m -bit Blum integer factorization is in BPTIME ( m ǫ ) [AHM + . Learning implies lower bounds.

It was shown in [FK09] that a randomized polynomial-time(worst case, improper) reconstruction algorithm for an arithmetic circuit class C implies the exis-tence of a polynomial that can be computed on Boolean inputs in BPEXP and cannot be computedby circuits in C of polynomial size . Similar results hold for Boolean circuits. Thus, if we are aim-ing for worst-case polynomial-time reconstruction then we must necessarily focus on classes forwhich super-polynomial lower bounds are known. In fact, to our knowledge, all efﬁcient recon-struction algorithms, in the worst or the average case, that are known till date are either for modelsfor which non-trivial lower bounds are known (or for models that are incomplete). Hardness of PAC learning depth three arithmetic circuits. [KS09b] showed that PAC-learningdepth three arithmetic circuit cannot be done in polynomial time unless the length of a shortestnonzero vector of an n -dimensional lattice can be approximated to within a factor of ˜ O ( n ) inpolynomial time by a quantum algorithm. What this means is that it is hard to PAC learn the classof Boolean functions which match the output of polynomial-sized depth three arithmetic circuitson the Boolean hypercube. Membership queries versus random samples.

Membership queries provide an interactive modelof learning as compared to learning from random samples. For some circuit classes, we know ofefﬁcient learning from membership queries but not learning from random samples. For instance,there is a deterministic polynomial-time algorithm for interpolating sparse polynomials usingmembership queries [KS01], but the same is not known using random samples. The best knowncomplexity for learning an s -sparse n -variate degree- d real polynomial with error ǫ from randomsamples from the real cube [ −

1, 1 ] n is poly ( n , s , 2 d , ǫ ) [APVZ14]. If the random samples are re-stricted to the Boolean hypercube {−

1, 1 } n then exact learning of s -sparse real polynomials can bedone in poly ( n , 2 s ) time provided the input polynomial satisﬁes a certain property [KSDK14].Another example, in the Boolean world, is the quasi-polynomial time algorithm for PAC learn-ing AC [ p ] circuits under the uniform distribution from membership queries [CIKK16]. It isnot known if the same can be achieved without membership queries (like in the case of AC - In contrast, a greedy algorithm solves ( log N ) -approximate MCSP for DNF in poly ( N ) time [Joh74,Lov75,Chv79].If the input is a DNF instead of a truth-table then we know of the following result: s − ǫ factor approximation of theminimum DNF size of an input DNF of size s is not in P, for every ǫ ∈ (

0, 1 ) , assuming Σ p * DTIME ( n O ( log n ) ) [Uma99]. An integer of the form pq for primes p and q , where p = q = Similar results have also been shown for NC and TC circuits [AKRR03]. A deterministic analogue of this result was shown in [Vol16]. This property is satisﬁed with high probability if the coefﬁcients of the input polynomial are perturbed slightly bya random noise. Removing this condition on the input polynomial would immediately improve the state-of-the-arts oflearning ω ( ) -juntas. Difﬁculty of learning homogeneous Σ ∧ ΣΠ [ t ] circuits. It turns out though that learning Σ ∧ ΣΠ [ t ] circuits in the worst-case is quite challenging. The reason is, if we can learn homogeneous Σ ∧ ΣΠ [ t ] circuits efﬁciently then we can solve approximate MCSP for polynomial-size ABP efﬁciently. This can be argued as follows: Suppose f is a homogeneous n -variate polynomial ofdegree d = n O ( ) < n that is computable by an ABP of size poly ( n ) . Let t ≪ d be a numberdividing d . Then, f can be computed by a homogeneous Σ ∧ ΣΠ [ t ] circuit of size n O ( dt ) . Nowsuppose we are able to learn homogeneous Σ ∧ ΣΠ [ t ] circuits of size σ in time poly ( σ t ) , for t = ω ( ) , and output poly ( σ ) -size circuits. Then, we would succeed in solving N o ( ) -approximateMCSP for poly ( n ) -size ABP in poly ( N ) -time, where N = ( n + dd ) . This appears to be a difﬁculttask with our current knowledge on MCSP . If such an efﬁcient approximate MCSP for ABP isunattainable then there is no hope of learning homogeneous Σ ∧ ΣΠ [ t ] circuits in poly ( σ t ) time inthe worst-case (where the output is a poly ( σ ) -size circuit), for t = ω ( ) . In view of the fact that learning general arithmetic circuits is probably a hard problem, researchhas focused on learning interesting special classes of circuits. Here, we give a brief account ofsome of these results from the literature.

Low depth circuits.

A deterministic polynomial-time learning algorithm for ΣΠ circuits or sparsepolynomials was given in [KS01]. Learning ΠΣΠ circuits in randomized polynomial-time followsfrom the classical circuit-factorization algorithm of [KT90]. General

ΣΠΣ circuits are much harderto learn: It follows from a depth reduction result [AV08, Koi12, GKKS16, Tav13] that polynomial-time learning for

ΣΠΣ circuits implies sub-exponential time learning for general circuits. In[Shp09], a randomized quasi-poly ( n , d , | F | ) -time proper learning algorithm was given for ΣΠΣ circuits with two product gates over ﬁnite ﬁelds; if the circuit is additionally multilinear then therunning time is poly ( n , | F | ) . The algorithm was derandomized and generalized in [KS09a] to han-dle ΣΠΣ circuits with constant number of product gates. Over ﬁelds of characteristic zero, [Sin16]gave a randomized proper learning algorithm for ΣΠΣ circuits with two product gates. A ran-domized polynomial-time proper learning algorithm is known for multilinear

ΣΠΣΠ circuitswith top fan-in two over any ﬁeld [GKL12]. Recently, a deterministic proper learning algorithmis given in [BSV19] for multilinear

ΣΠΣΠ circuits with constant top fan-in over ﬁnite ﬁelds; therunning time is quasi-polynomial in the size of the circuit and | F | . Read-once formulas and ABPs.

A deterministic polynomial-time proper learning algorithm isknown for read-once formulas [MV18, SV14]. Read-once oblivious algebraic branching programs(ROABPs) form an important subclass of ABPs that captures several other interesting and well-studied circuit models. There is a randomized polynomial-time proper learning algorithm for Algebraic branching programs (ABP) form a powerful circuit class – a circuit can be converted to an ABP withonly a quasi-polynomial blow up in size. This circuit is obtained by homogenizing the ABP and then dividing it into pieces of length t and multiplying outeach piece. This gives a ΣΠΣΠ [ t ] circuit which is converted to a Σ ∧ ΣΠ [ t ] circuit using Fischer’s formula [Fis94]. See the discussion on MCSP at the start of this section. Provided the circuit satisﬁes a certain rank condition +

00] which was derandomized in quasi-polynomial time in [FS13]. The methodused for ROABP reconstruction can be adapted to give learning algorithms for set-multilinearABPs and non-commutative ABPs [FS13].

Reconstruction under non-degeneracy conditions.

Reconstruction in the worst case appears tobe an extremely hard problem even for circuit models for which good lower bounds are known.It is natural to ask – Can we use the techniques used for proving lower bound for a circuit class C to learn almost all C -circuits? The notion of “almost all C -circuits" is formalized as random C -circuits under some natural distribution, or preferably, as C -circuits satisfying a set of clearly statednon-degenerate conditions such that a random C -circuit (under any natural distribution) is non-degenerate with high probability. In [GKL11], a randomized polynomial-time proper learningalgorithm was given for non-degenerate multilinear formulas having fan-in two. A randomizedpolynomial-time proper learning algorithm for non-degenerate regular formulas having fan-intwo was given in [GKQ14]. An efﬁcient randomized reconstrution for non-degenerate homoge-neous ABPs of width at most √ n is presented in [KNS19]. All the above reconstruction algorithmsare implicitly connected to the corresponding lower bounds: a quasi-polynomial lower boundfor multilinear formulas was already shown in [Raz09], a quasi-polynomial lower bound for reg-ular formulas was proven in [KSS14] and a width lower bound of n is also known for homo-geneous ABPs [Kum19]. Recently, [KS19] gave a randomized polynomial-time proper learningalgorithm for non-degenerate homogeneous depth three circuits depending very explicitly on theideas used in proving an exponential lower bound for this model [NW97]. Also, randomizedpolynomial-time proper learning algorithms for non-degenerate depth three powering circuitsare given in [KS19, GKP18, Kay12a] which have implicit connections to the corresponding lowerbound methods. Tensor decomposition.

Tensor decomposition (which is the same as reconstruction of depth threeset-multilinear circuits) has garnered a lot of attention in the machine learning community and alot of algorithms have been developed for it [Har70,LRA93,DLCC07,AGH + + t polynomials for t >

1, except for the work of [GHK15].An algorithm for learning sums of cubes of quadratics in the non-degenerate case, with the num-ber of summands upper bounded by √ n is implicit in [GHK15]. Their approach is to reduce theproblem to tensor decomposition. However, we believe such an approach cannot be made to han-dle larger number of summands (say poly ( n ) ) even in the quadratic case as the lower bounds forsums of powers of quadratics need substantially newer ideas than the linear case as discussed inSection 1.2.3. Lower bounds and PIT for Σ ∧ ΣΠ [ t ] circuits. Homogeneous Σ ∧ ΣΠ [ t ] circuits have been well-studied in the context of lower bound and polynomial identity testing (PIT). Understanding howto prove lower bound for this model played a vital role in the proof of the exponential lower bound The papers [GKL11,GKQ14] state the results for random formulas, but it is not difﬁcult to state the non-degeneracyconditions by taking a closer look at the algorithms. Note that here the lower bound comes later than the average case reconstruction algorithm; in fact, the ideasarising out of the reconstruction algorithm were helpful in proving the lower bound. s O ( t log s ) -time black-box PIT algorithm is known forhomogeneous Σ ∧ ΣΠ [ t ] ( s ) circuits [For15b]. Improper learning for sums of powers of low degree polynomials.

In this paper, we focus onproper learning (i.e., the input and output representations are from the same circuit class). How-ever, to our knowledge, there is no known efﬁcient learning algorithm for sums of powers ofdegree- t polynomials (worst case or average case) even for t =

2, and even in the improper set-ting. For the t = +

00, KS06] as sums of powers of linear forms (depth three poweringcircuits) is a subclass of ROABP. But, reconstruction for ROABP does not give a learning algorithmfor sums of powers of quadratics as there is a power of a quadratic that requires an exponential-size ROABP [For15b].To summarize, we have efﬁcient learning algorithms (even under non-degeneracy conditions)only for some models for which good lower bounds are known (or for models that are incomplete).Moreover, barring a few exceptions like ROABP, read-once formulas and sparse polynomials, thecircuit models for which efﬁcient learning is known do have the fan-in of the sum gates very small(mostly bounded by a constant). In comparison, our strategy for translating techniques from lowerbounds to learning works for a much larger additive fan-in. Such a translation is only possible forlearning under non-degeneracy condition as worst-case learning is arguably much harder thanproving lower bound. In Section 2, we state our algorithm for learning sums of powers of low degree polynomials andits analysis, and also prove that the non-degeneracy conditions are satisﬁed in the random case.In Section 3, we provide an algorithm for recovering the parameters of a non-degenerate mixtureof zero-mean Gaussians given access to its exact O ( ) -order moments. In Section 4, we show howour new lower bound measure APP can be used to give alternate proofs of the lower bounds forhomogeneous

ΣΠΣΠ [ t ] circuits (almost matching the best known lower bounds which use theshifted partials measure). Section 5 contains some of the interesting open problems and directionsfor future work.In the Appendix, Section A mentions some important facts about the adjoint algebra and a proofof the uniqueness of decomposition in our setting. In Section B, we show a reduction from thevector space decomposition problem to the module decomposition problem. Section C containsa discussion about why the shifted partials measure is unlikely to satisfy the non-degeneracyconditions required for our learning framework. Finally, Sections D and E supply the missingproofs from Sections 2 and 4, respectively. See the discussion on the difﬁculty of learning homogeneous Σ ∧ ΣΠ [ t ] circuits in the worst case in Section 1.4. In fact a generalization of it. Learning sums of powers of low degree polynomials

We prove Theorem 1 in this section. Our algorithm is an implementation of the lower bound tolearning strategy (proposed in Section 1.2) for homogeneous Σ ∧ ΣΠ [ t ] circuits . Section 2.1 de-scribes the algorithm. Section 2.2 describes the analysis of the algorithm. In Section 2.3, we provethat a random Σ ∧ ΣΠ [ t ] circuit satisﬁes our non-degeneracy conditions. Finally Section 2.4 listssome relations between various parameters which are needed for the analysis to work.For simplicity of presentation, we will assume that F is a ﬁnite ﬁeld of sufﬁciently large size andcharacteristic. The analysis goes through over any F that satisﬁes the restrictions on size andcharacteristic stated in Theorem 1 – we simply have to work with a sufﬁciently large subset of F and do a few minor changes to the algorithm and its analysis. We are given black-box access to an n -variate degree- d polynomial f that is computed by a homo-geneous Σ ∧ ΣΠ [ t ] ( s ) formula C , i.e., f ( x ) = c Q m + . . . + c s Q ms , (10)where each c i ∈ F × , Q i is a homogeneous polynomial of degree t , and tm = d . Moreover, for-mula C is non-degenerate (see Deﬁnition 1.1). Assume that the algorithm knows d , t and s andthese parameters satisfy the conditions stated in Theorem 1. The task is to output a homogeneous Σ ∧ ΣΠ [ t ] ( s ) formula for f efﬁciently. The parameters k , n and m , in Algorithm 2, are chosenaccording to Proposition 2.6, which is stated in Section 2.4. A random linear form is a linear formwhose coefﬁcients are chosen independently and uniformly at random from F . A tuple of n randomlinear forms is a tuple of n independently chosen random linear forms. We analyze the correctness and efﬁciency of the algorithm in this section. The three main segmentsof the algorithm are Steps 1-5, Step 6 and Steps 7-8 – we examine these one by one. The missingproofs of the technical statements are given in Section D of the appendix.

Steps 1-5: Constructing two sets of linear operators and obtaining the relevant vector spaces

Let σ be the size of the non-degenerate formula C that computes f . In Step 2, the algorithmcomputes black-box access to a basis of U = (cid:10) π L ( ∂ k x f ) (cid:11) , where L is a tuple of n random linearforms in n many z -variables. This can be done in poly ( σ , s t ) time, with success probability 1 − o ( ) , due to the choice of k (see Proposition 2.6) and the following easily veriﬁable fact. We deviate from the strategy slightly, by introducing an intermediate multi-gcd step (see Algorithm 2), in order tomake the analysis simpler. If s is unknown, we can simply go over s incrementally (starting from 1 and going up to the upper bound statedin Theorem 1) and run the algorithm for each s . A randomized identity test at the end of the algorithm determines ifwe have learnt the circuit correctly with high probability. this will be explained in the analysis of this step lgorithm 2 Learning sums of powers of degree- t polynomials Input : Black-box access to an f ∈ F [ x ] that is computed by a non-degenerate homogeneous Σ ∧ ΣΠ [ t ] ( s ) formula C (as in Equation (10)), i.e., f = c Q m + . . . + c s Q ms . Output : A non-degenerate homogeneous Σ ∧ ΣΠ [ t ] ( s ) formula computing f ./* Constructing two sets of linear operators and obtaining the relevant vector spaces */ Pick a tuple of n random linear forms L = ( ℓ ( z ) , . . . , ℓ n ( z )) , where | z | = n . Let L be the setof operators π L ( ∂ k x ) . Compute black-box access to a basis of U = hL ◦ f i = (cid:10) π L ( ∂ k x f ) (cid:11) . ( Multi-gcd step ) Compute black-box access to a basis of V = (cid:10) π L ( Q ) m − k , . . . , π L ( Q s ) m − k (cid:11) using the basis of U . It will follow from Observation 2.2 that V = V ⊕ . . . ⊕ V s , (with high probability)where V i = (cid:10) G ei (cid:11) , G i = π L ( Q i ) and e = m − k . Get black-box access to a random g ( z ) in V . Pick a tuple of n random linear forms P = ( p ( w ) , . . . , p n ( w )) , where | w | = m . Let L bethe set of operators π P ( ∂ k z ) . Compute black-box access to a basis of W = (cid:10) π P ( ∂ k z g ) (cid:11) . It will follow from Proposition 2.2, W = W ⊕ . . . ⊕ W s , (with high probability)where W i : = (cid:10) π P ( ∂ k z G ei ) (cid:11) = hL ◦ V i i ./* Decomposing the vector spaces */ Compute black-box access to bases of V , . . . , V s by decomposing V and W under the actionof L into indecomposable subspaces./* Recovering the terms of the formula */ Run Steps 1-6 “ d times” to compute black-box access to c ′ Q ( x ) e , . . . , c ′ s Q s ( x ) e for someconstants c ′ , . . . , c ′ s ∈ F × . Compute (dense representations of) the polynomials ˆ c Q ( x ) , . . . , ˆ c s Q s ( x ) for some constantsˆ c , . . . , ˆ c s ∈ F × . Output a homogeneous Σ ∧ ΣΠ [ t ] ( s ) formula for f .20 act 1. Given black-box access to an n-variate degree-d polynomial f ( x ) , black-box access to the polyno-mials in ∂ k x f can be computed in determimistic poly (( nd ) k ) time. Given black-box access to n-variatedegree-d polynomials g , . . . , g r , black-box access to a basis of h g , . . . , g r i can be computed in randomized poly ( n , d , r ) time with probability at least − rd | F | . Observation 2.1.

Let U i = (cid:10) π L ( ∂ k x Q mi ) (cid:11) . With probability − o ( ) over the randomness of L,U = U ⊕ . . . ⊕ U s and dim U i = (cid:18) n + k ( t − ) − k ( t − ) (cid:19) for all i ∈ [ s ] .It follows from the above observation that U i = D z k ( t − ) · π L ( Q i ) m − k E . The multi-gcd step.

In Step 3, the algorithm computes a basis of V = h G e , . . . , G es i , where G i = π L ( Q i ) and e = m − k . This step can be executed in poly ( σ , s t ) time as follows: Observation 2.2.

Let f U i : = D z k ( t − ) · G ei E . With probability − o ( ) over the randomness of L, f U + . . . + f U s = f U ⊕ . . . ⊕ f U s .It follows from the above that V = V ⊕ . . . ⊕ V s , where V i = (cid:10) G ei (cid:11) . The observation also helpsprove the next proposition which gives a way to compute a basis of V efﬁciently. Proposition 2.1.

Let r = ( n + k ( t − ) − k ( t − ) ) and f , . . . , f sr be a basis of U. Let z and z be two distinctvariables in z . Then, the following statements hold:1. If g ( z ) ∈ V then there exist a , . . . a sr , b , . . . , b sr ∈ F such thata f + . . . + a sr f sr z k ( t − ) = b f + . . . + b sr f sr z k ( t − ) = g .

2. If there exist a , . . . a sr , b , . . . , b sr ∈ F such thata f + . . . + a sr f sr z k ( t − ) = b f + . . . + b sr f sr z k ( t − ) , then a f + ... + a sr f sr z k ( t − ) is a polynomial g ( z ) in V. Let f , . . . , f sr be the basis of U obtained in Step 2. It follows from the above proposition that abasis of the space S deﬁned by all ( a , . . . a sr , b , . . . , b sr ) ∈ F sr satisfying a · z k ( t − ) f + . . . + a sr · z k ( t − ) f sr − b · z k ( t − ) f − . . . − b sr · z k ( t − ) f sr = z k ( t − ) · V (and a basis of z k ( t − ) · V ). As we have black-box access to f , . . . , f sr ,we can plug in 2 sr random values to the z -variables in Equation (11) and derive a linear system inthe “variables” a , . . . a sr , b , . . . , b sr . A solution to this system gives a basis of S with probability1 − o ( ) , thereby giving black-box access to a basis of z k ( t − ) · V . Now, using black-box polynomial21actorization [KT90], we get black-box access to a basis of V . This completes the multi-gcd step.Let g , . . . , g s be the basis of V obtained in Step 3. This step also chooses black-box access to a ran-dom g ∈ V , i.e., g = a g + . . . + a s g s , where a , . . . , a s are picked independently and uniformly atrandom from F . In Step 5, the algorithm computes black-box access to a basis of W = (cid:10) π P ( ∂ k z g ) (cid:11) ,where P is a tuple of n random linear forms in m many w -variables. By Fact 1, this can be donein poly ( σ ) time with success probability 1 − o ( ) . Proposition 2.2.

Let W i = (cid:10) π P ( ∂ k z G ei ) (cid:11) . With probability − o ( ) over the randomness of L and P,W = W ⊕ . . . ⊕ W s and dim W i = (cid:18) m + k ( t − ) − k ( t − ) (cid:19) for all i ∈ [ s ] . Step 6: Decomposing the vector spaces

In Step 6, the algorithm computes black-box access to bases of V = h G e i , . . . , V s = h G es i bydecomposing the spaces V and W under the action of the set of operators L = π P ( ∂ k z ) . We nowexplain how this is carried out efﬁciently. Deﬁnition 2.1 (Indecomposable decomposition) . Let V and W be two vector spaces and L a set oflinear operators from V to W . A decomposition of the spaces V and W as: V = V ⊕ . . . ⊕ V s W = W ⊕ . . . ⊕ W s is indecomposable under the action of L if the following hold for every i ∈ [ s ] ,(a) hL ◦ V i i ⊆ W i (b) There do not exist spaces V i , V i , W i , W i such that V i = V i ⊕ V i , W = W i ⊕ W i and hL ◦ V i i ⊆ W i , hL ◦ V i i ⊆ W i .In our case W = hL ◦ V i and W i = hL ◦ V i i (by Proposition 2.2). Also, the decomposition V = V ⊕ . . . ⊕ V s and W = W ⊕ . . . ⊕ W s is indecomposable under the action of L as dim V i = i ∈ [ s ] . It remains to show that this indecomposable decomposition is unique and it canbe computed efﬁciently. Towards this, we take inspiration from Section A (particularly, CorollaryA.1) and analyze a suitable adjoint algebra. The adjoint algebra.

Recall, g , . . . , g s is the basis of V computed in Step 3. Let h , . . . , h sq bethe basis of W computed in Step 5, where q = ( m + k ( t − ) − k ( t − ) ) = | w k ( t − ) | . With regard to thebases g , . . . , g s and h , . . . , h sq , every element of L can be naturally identiﬁed with a sq × s matrixby identifying V with F s and W with F sq . We will work with this matrix representation of the In fact, a simpler argument works here as the factorization is special. L which can be computed in poly ( σ ) time from black-box access to g , . . . , g s and h , . . . , h sq (using Fact 1, Proposition 2.6 and solving linear systems). Letadj ( L ) : = (cid:8) ( D , E ) ∈ M s ( F ) × M sq ( F ) : KD = EK for all K ∈ L (cid:9) . (12)Observe that adj ( L ) is an F -subalgebra of M s ( F ) × M sq ( F ) and a basis of adj ( L ) can be com-puted in poly ( σ ) time by solving a system of linear equations arising from the equation KD = EK for all K ∈ L . Deﬁneadj ( L ) : = (cid:8) D ∈ M s ( F ) : there exists an E ∈ M sq ( F ) such that ( D , E ) ∈ adj ( L ) (cid:9) . (13)Clearly, adj ( L ) is an F -subalgebra of M s ( F ) and computing a basis of adj ( L ) from a basis ofadj ( L ) is a simple task. The following proposition shows that adj ( L ) is diagonalizable.Let A ∈ GL s ( F ) be the basis change matrix from ( g , . . . , g s ) to ( G e , . . . , G es ) and D : = { diag ( a , . . . , a s ) : a i ∈ F for all i ∈ [ s ] } ⊂ M s ( F ) . Proposition 2.3. A · adj ( L ) · A − = D .Diagonalizing adj ( L ) . Use the basis of adj ( L ) to pick a random matrix D ∈ r adj ( L ) . Bythe above proposition, and as | F | ≫ s , the eigenvalues of D are distinct with probability 1 − o ( ) . Compute the eigenvalues a , . . . , a s by factorizing the characteristic polynomial of D . Now,compute an ˜ A ∈ GL s ( F ) such that ˜ AD ˜ A − = diag ( a , . . . , a s ) – this can be done by solving a linearsystem. As D has distinct eigenvalues and ADA − is also diagonal, there exist a permutationmatrix P ∈ GL s ( F ) and a diagonal matrix S ∈ GL s ( F ) such that˜ A = PS · A . Computing bases of V , . . . , V s . Observe that ( G e G e . . . G es ) · A = ( g g . . . g s ) (by deﬁnition of A ) ⇒ ( G e G e . . . G es ) · S − P − = ( g g . . . g s ) · ˜ A − .Thus, by computing black-box access to the entries of the vector ( g g . . . g s ) · ˜ A − , we get black-box access to G e , . . . , G es (up to permutation and scaling). By relabeling, we can assume that Step 6computes black-box access to bases of V = h π L ( Q ) e i , . . . , V s = h π L ( Q s ) e i in order. Steps 7-8: Recovering the terms of the formula

At the end of Step 6, we have black-box access to c ′ π L ( Q ) e , . . . , c ′ s π L ( Q s ) e , where c ′ , . . . , c ′ s ∈ F and Q , . . . , Q s ∈ F [ x ] are unknown, but L = ( ℓ ( z ) , . . . , ℓ n ( z )) is known from Step 1. Theidea now is to get black-box access to c ′ Q ( x ) e , . . . , c ′ s Q s ( x ) e by executing Steps 1-6 several times(each time by altering L slightly). Hereafter, the algorithm uses black-box polynomial factoriza-tion [KT90] to get black-box access to ˆ c Q ( x ) , . . . , ˆ c s Q s ( x ) for some constants ˆ c , . . . , ˆ c s ∈ F × .Then, the sparse polynomial interpolation algorithm of [KS01] gives dense representations of the This is where we need the assumption that univariate polynomial factorization over F can be done in randomizedpolynomial time. -sparse polynomials ˆ c Q ( x ) , . . . , ˆ c s Q s ( x ) . Finally, we obtain a homogeneous Σ ∧ ΣΠ [ t ] ( s ) for-mula computing f by solving a linear system. Let us see how this idea is made to work. Fixing the query points . The following remarks imply that the points at which the algorithm needsto query c ′ Q ( x ) e , . . . , c ′ s Q s ( x ) e in order to employ the black-box polynomial factorization algo-rithm and the sparse polynomial interpolation algorithm can be ﬁxed a priori right after Step 6. Remarks.

1. The sparse polynomial interpolation algorithm of [KS01] works with non-adaptive queries,i.e., each subsequent query point does not depend on answers to the previous queries.2. The black-box polynomial factorization algorithm of [KT90] also works with non-adaptivequeries. In other words, once the set of points at which we need to evaluate the irreduciblefactors of an input polynomial f (given as a black-box) is ﬁxed, the algorithm uses onlynon-adaptive queries to f in order to compute these evaluations. Evaluating c ′ Q ( x ) e , . . . , c ′ s Q s ( x ) e at a query point . Let a = ( a , . . . , a n ) ∈ F n be a query point. Wewish to compute c ′ Q ( a ) e , . . . , c ′ s Q s ( a ) e from black-box access to c ′ π L ( Q ) e , . . . , c ′ s π L ( Q s ) e , where L = ( ℓ ( z ) , . . . , ℓ n ( z )) and z = ( z , . . . , z n ) . Let ℓ l ( z ) = r l z + . . . + r ln z n ,where r l , . . . , r ln ∈ F are chosen uniformly and independently at random from F (in Step 1) forall l ∈ [ n ] . Now, pick ( r , . . . , r n ) ∈ r F n . For each y ∈ {

1, . . . , d } , deﬁne˜ ℓ l ( y , z ) : = ( yr l + ( − y ) a l ) z + r l z + . . . + r ln z n ,for every l ∈ [ n ] . Observe that r ′ l : = yr l + ( − y ) a l is uniformly distributed over F as r l is chosenrandomly from F . Moreover, r ′ , r , . . . , r n , . . . , r ′ n , r n , . . . , r nn are independent of each otheras r , . . . , r n are independently chosen. Hence,˜ L ( y ) : = ( ˜ ℓ ( y , z ) , . . . , ˜ ℓ n ( y , z )) is a tuple of random linear forms in the z -variables for every y ∈ {

1, . . . , d } . If we execute Steps1-6 by replacing L by ˜ L ( y ) in Step 1 then we will get black-box access to˜ c ρ ( ) · π ˜ L ( y ) ( Q ρ ( ) ) e , . . . , ˜ c ρ ( s ) · π ˜ L ( y ) ( Q ρ ( s ) ) e , with probability 1 − o ( ) , (14)where ρ is an unknown permutation of [ s ] and ˜ c , . . . , ˜ c s ∈ F × are also unknown. We can ﬁnd ρ efﬁciently as follows: Observe that ˜ L ( y ) z = = L z = . Hence, the ratio c ′ i · π L z = ( Q i ) e ˜ c j · π ˜ L ( y ) z = ( Q j ) e = c ′ i · [ G ei ] z = ˜ c j · [ G ej ] z = = c ′ i ˜ c i if i = j , = a non-constant rational function in z , if i = j .The second equality is because of Condition 4 of the non-degeneracy condition (Deﬁnition 1.1),which states that there is an L such that [ G e ] z = , . . . , [ G es ] z = are F -linearly independent. Thus,24or a random L , [ G e ] z = , . . . , [ G es ] z = are F -linearly independent with probability 1 − o ( ) . Now,we can discover the permutation ρ by evaluating the ratio c ′ i · π L z = ( Q i ) e ˜ c j · π ˜ L ( y ) z = ( Q j ) e at poly ( d ) -many random points in F n − and checking if all the evaluations are the same. Thisprocess succeeds with probability 1 − o ( ) (as | F | is sufﬁciently large) and also gives us c ′ i ˜ c i for all i ∈ [ s ] . From Equation (14) and the knowledge of ρ and c ′ i ˜ c i we obtain black-box access to c ′ · π ˜ L ( y ) ( Q ) e , . . . , c ′ s · π ˜ L ( y ) ( Q s ) e .By setting z = z = . . . = z s = p i ( y ) : = c ′ i · Q i ( yr + ( − y ) a , . . . , yr n + ( − y ) a n ) e ,for every i ∈ [ s ] . As y is arbitrarily ﬁxed in [ d ] , we can compute p i ( ) , . . . , p i ( d ) for all i ∈ [ s ] .Treating p i ( y ) as a univariate polynomial in y and observing that deg y p i < d , we can interpolatethe polynomial p i ( y ) from the above evaluations. Notice that p i ( ) = c ′ i Q i ( a ) e for all i ∈ [ s ] . Outputting a Σ ∧ ΣΠ [ t ] ( s ) formula . As explained before, the black-box polynomial factorizationalgorithm and the sparse polynomial interpolation algorithm together give us dense representa-tions of the polynomials ˆ c Q ( x ) , . . . , ˆ c s Q s ( x ) for some unknown ˆ c , . . . , ˆ c s ∈ F . We know thatthere exist u , . . . , u s ∈ F such that f = u · [ ˆ c Q ( x )] m + . . . + u s · [ ˆ c s Q s ( x )] m .By treating u , . . . , u s as formal variables, we can obtain a linear system in u , . . . , u s by evaluat-ing f and [ ˆ c Q ( x )] m , . . . , [ ˆ c s Q s ( x )] m at s random points in F n . As [ ˆ c Q ( x )] m , . . . , [ ˆ c s Q s ( x )] m are F -linearly independent (which follows from the non-degeneracy condition), the solution to thesystem gives u , . . . , u s that satisfy the above equation with probability 1 − o ( ) . Σ ∧ ΣΠ [ t ] circuit is non-degenerate (proof of Lemma 1.2) We show that a homogeneous Σ ∧ ΣΠ [ t ] ( s ) formula c Q m + . . . + c s Q ms is non-degenerate withprobability 1 − o ( ) if the coefﬁcients of the degree- t polynomials Q , . . . , Q s are chosen indepen-dently and uniformly at random from a set S ⊆ F of size at least ( ns ) · t . For this, it is sufﬁ-cient to show the existence of one non-degenerate homogeneous Σ ∧ ΣΠ [ t ] ( s ) formula, as longas | S | ≫ ds · ( n + kt − kt ) is sufﬁciently large (Proposition 2.6). This is because of Schwarz-Zippellemma and the fact that all the non-degeneracy conditions are about vanishing of some determi-nant. By union bound, the total error probability remains o ( ) as | F | is sufﬁciently large. onstruction of a non-degenerate homogeneous Σ ∧ ΣΠ [ t ] formula We construct a homogeneous Σ ∧ ΣΠ [ t ] ( s ) formula that satisﬁes non-degeneracy Conditions 1, 3and 4 (Deﬁnition 1.1). It would easily follow from this construction that there is a homogeneous Σ ∧ ΣΠ [ t ] ( s ) formula that also satisﬁes Condition 2 of non-degeneracy. Let x = y ⊎ z , where z = { z , . . . , z n } . The value of n is ﬁxed in Proposition 2.6. Let L be an n -tuple of linear forms in z -variables that deﬁnes the following afﬁne projection: All the y -variables map to 0, and z u mapsto z u for all u ∈ [ n ] . Consider a homogeneous Σ ∧ ΣΠ [ t ] ( s ) formula C that computes f = Q m + . . . + Q ms , (15)where Q i = R i ( y , z ) + G i ( z ) such that R i , G i are degree- t homogeneous polynomials, every mono-mial in R i has a y -variable, and G i is y -free for all i ∈ [ s ] . Clearly, π L ( Q i ) = G i . We now construct R , . . . , R s and G , . . . , G s so that C satisﬁes non-degeneracy Conditions 1, 3 and 4. Constructing R , . . . , R s . The number of z -monomials of degree- ( t − ) is b : = ( n + t − t − ) . Let thesemonomials be γ , . . . , γ b . Consider a combinatorial design on the y -variables, i.e., a system of s subsets of y -variables, namely S , . . . , S s , such that for all i , j ∈ [ s ] | S i | = kb and | S i ∩ S j | ≤ k − i = j .Such a set-system (also known as the Nisan-Wigderson design) exists if | y | = n − n ≥ ( kb ) and s ≤ ( kb ) k – these two conditions are satisﬁed by Proposition 2.6. Denote the kb distinct y -variables in S i by { y ijl : j ∈ [ k ] and l ∈ [ b ] } and deﬁne R i : = ∑ j ∈ [ k ] , l ∈ [ b ] y ijl · γ l .Let the spaces U , U , . . . , U s be as in Deﬁnition 1.1. Proposition 2.4. U = U + . . . + U s and U i = D z k ( t − ) · G m − ki E for every i ∈ [ s ] .Constructing G , . . . , G s . We set G , . . . , G s in such a way that U + . . . + U s = U ⊕ . . . ⊕ U s . Let p : = ⌊ √ n ⌋ . Consider a combinatorial design on the z -variables, i.e., a system of s subsets of z -variables, namely z , . . . , z s , such that for all i , j ∈ [ s ] | z i | = p and | z i ∩ z j | ≤ (cid:22) min ( t ( m − k ) , √ n ) (cid:23) ≤ (cid:22) p + (cid:23) , if i = j .Such a set-system exists if s ≤ p (cid:22) min ( t ( m − k ) , √ n ) (cid:23) ,which is satisﬁed by Proposition 2.6. Let G i : = ∑ z ∈ z i z ! t , for all i ∈ [ s ] . (16) in fact, it can be computed efﬁciently roposition 2.5. If m > k and P , . . . , P s ∈ F [ z ] are such that deg z P i ≤ k ( t − ) for all i ∈ [ s ] andP · G m − k + . . . + P s · G m − ks = then P = P = . . . = P s = . The condition m > k is satisﬁed by Proposition 2.6. So, Propositions 2.4 and 2.5 imply that U = U ⊕ . . . ⊕ U s and f U + . . . + f U s = f U ⊕ . . . ⊕ f U s . The proof of Proposition 2.5 (in particularObservation D.1) also implies that [ G m − k ] z = , . . . , [ G m − ks ] z = are F -linearly independent.In order to satisfy Condition 2, we draw an analogy between the polynomials g = G e + . . . + G es and f = Q m + . . . + Q ms (Equation (15)). We mimic the above construction of Q , . . . , Q s and L thatensures U = U ⊕ . . . ⊕ U s and dim U i = ( n + k ( t − ) − k ( t − ) ) to construct G , . . . , G s and P such that W = W ⊕ . . . ⊕ W s and dim W i = ( m + k ( t − ) − k ( t − ) ) . For this, we need to satisfy n − m ≥ ( kc ) , s ≤ ( kc ) k (where c = ( m + t − t − ) ), e > k and s ≤ (cid:18)(cid:22) √ m (cid:23)(cid:19) (cid:22) min ( t ( e − k ) , √ m ) (cid:23) .All these relations are taken care of by Proposition 2.6. From the statement of Theorem 1, we have n ≥ d , t ≤ q log d · log log d , | F | ≥ ( ns ) · t and s ≤ min (cid:16) n d · t , exp ( n · t ) (cid:17) .We leave the proof of the following proposition as an exercise. Proposition 2.6.

Let n = ⌊ n · t ⌋ , m = ⌊ n · t ⌋ , k = l · t · log s log n m , and (borrowing notations from Section2.3) b = ( n + t − t − ) and c = ( m + t − t − ) . Then, the following relations are satisﬁed:1. | F | ≥ ( ns ) · t ≫ d · ( n + k − k ) ,2. | F | ≥ ( ns ) · t ≫ ds · ( n + kt − kt ) ,3. ( nd ) k = poly ( n , s t ) ,4. ( n + kt − kt ) = poly ( n , s t ) ,5. n − n ≥ ( kb ) , The construction of these G , . . . , G s should not be confused with the choice of G i in Equation (16). The choices of Q i ’s and G i ’s till Proposition 2.5 are used to show that Condition 1, 3 and 4 of non-degeneracy are satisﬁed, whereaswe choose the G i ’s afresh to show that Condition 2 of non-degeneracy is satisﬁed. Finally, a union bound ensures thatall the conditions of non-degeneracy are satisﬁed with high probability. . e = m − k > k,7. n − m ≥ ( kc ) ,8. s ≤ ( kc ) k ,9. s ≤ (cid:16)j √ m k(cid:17) (cid:22) min ( t ( e − k ) , √ m ) (cid:23) . Relations 1 and 2 are used for applications of the Schwartz-Zippel lemma at various places toensure that the error probability is bounded by o ( ) . Relations 3 and 4 guarantee that the runningtime of the algorithm is poly ( n , σ , s t ) . Relations 5-9 are used in Section 2.3 to show that a randomhomogeneous Σ ∧ ΣΠ [ t ] formula is non-degenerate with high probability. In this section, we describe an algorithm for learning the parameters of a mixture of Gaussians inthe non-degenerate case, given the moments of the mixture exactly . Since we can only estimate themoments given samples from the mixture, it is an extremely interesting problem to modify ouralgorithm to make it work with inexact moments (in the smoothed analysis setting) and we leaveit open for future work. Our algorithm also extends naturally to the general mean case but theanalysis gets more complicated and we only focus on the zero-mean case for simplicity.Consider a mixture of Gaussians, D = ∑ si = w i N ( Σ i ) , ∑ si = w i = w i > i ∈ [ s ] .Let us deﬁne quadratic polynomials, Q , . . . , Q s , corresponding to the covariance matrices, by Q i ( x ) = x T Σ i x . Also deﬁne the polynomials f m = s ∑ i = w i Q mi , f m − = s ∑ i = w i Q m − i .Also let us set n = ⌊ n ⌋ , m = ⌊ n ⌋ , ℓ = l · log s log n m , m = ℓ , e = ℓ . We will call the mixture D non-degenerate if f m and f m − satisfy the non-degeneracy conditions in Deﬁnition 1.1 with k = ℓ and k = ℓ −

1, respectively. We will need the following elementary proposition about the momentgenerating function of a Gaussian and mixture of Gaussians.

Proposition 3.1.

Suppose Y ∼ N ( µ , Σ ) . Then E h e h x , Y i i = e h µ , x i + x T Σ x . Similary for a mixture ofGaussians, D = ∑ si = w i N ( µ i , Σ i ) , the moment generating function is ∑ si = w i e h µ i , x i + x T Σ i x . The next lemma states an efﬁcient algorithm for computing the parameters of a non-degeneratezero-mean Gaussian mixture given access to its exact moments.

Lemma 3.1.

Let s ≤ exp (cid:16) n (cid:17) . There is a randomized poly ( n , b , s ) time algorithm (b denotes the totalbit complexity of the parameters) that given black-box access to the exact O ( log ( s ) / log ( n )) order momentsof a non-degenerate mixture of zero-mean Gaussians D , recovers its parameters ( w i , Σ i ) si = . emark 3. Black-box access to the moments means given a vector x ∈ R n , access to the moments of therandom variable h x , Y i , where Y ∼ D . In other words, this means black-box access to the polynomialsf m and f m − . Of course, when s ≤ poly ( n ) , our algorithm only needs access to O ( ) order moments ofthe mixture D in which case black-box access to the moments is immediate because we can compute all themoments explicitly. But we state our algorithm in this general form in hope of applicability in settingswhere black-box access to the moments might be available without an explicit access to the moments.Proof. (Of Lemma 3.1) Algorithm 3

Learning mixtures of zero-mean Gaussians

Input : Black-box access to the exact moments of a non-degenerate mixture of zero-mean Gaus-sians D = ∑ si = w i N ( Σ i ) . Output : The parameters of the mixture ( w i , Σ i ) si = . Use black-box access to the moments to get black-box access to the polynomials f m and f m − (will be explained in the analysis). Use Algorithm 2 to obtain representations f m = ∑ si = w ′ i ( Q ′ i ) m and f m − = ∑ si = e w i (cid:16)f Q i (cid:17) m − (with w ′ i , e w i non-zero): we get ( w ′ i , Q ′ i ) si = and (cid:16) e w i , e Q i (cid:17) si = . Find a permutation σ : [ s ] → [ s ] such that c i = Q ′ i / e Q σ ( i ) is a constant (will be explained in theanalysis why σ is a permutation). Let us denote d i = e w σ ( i ) w ′ i c m − i . Output (cid:16) w ′ i d mi , d i Σ ′ i (cid:17) si = , where Σ ′ i is such that Q ′ i ( x ) = x T Σ ′ i x .The algorithm is described in Algorithm 3. Suppose Y ∼ D . Then by Proposition 3.1, the momentgenerating function of D is given by E h e h x , Y i i = s ∑ i = w i e x T Σ i x = s ∑ i = w i e Q i ( x ) .Equating the degree 2 m part on both sides, we get m ! ( m ) ! · E (cid:2) h x , Y i m (cid:3) = s ∑ i = w i Q i ( x ) m .Thus given black-box access to the moments of D , we can get black-box access to the polynomials f m and f m − . This explains Step 1 of the algorithm. Theorem 1 guarantees us that there existpermutations π : [ s ] → [ s ] and π : [ s ] → [ s ] , and constants c ′ , . . . , c ′ s and e c , . . . , e c s (all non-zero)such that Q ′ i = c ′ i Q π ( i ) , w π ( i ) = w ′ i ( c ′ i ) m , e Q i = e c i Q π ( i ) and w π ( i ) = e w i ( e c i ) m − for all i ∈ [ s ] . We also have that (cid:0) Q mi (cid:1) si = are linearly independent (this is implied by the non-degeneracy condition) and hence the Q ′ i s span distinct one-dimensional spaces. Thus, there isexactly one j such that Q ′ i / e Q j is a constant which is given by j = π − ( π ( i )) . This explains Step 3where σ is given by π − ◦ π . Now, c i = Q ′ i e Q σ ( i ) = c ′ i Q π ( i ) e c σ ( i ) Q π ( σ ( i )) = c ′ i e c σ ( i ) .29hen, d i = e w σ ( i ) w ′ i c m − i = w π ( σ ( i )) e c m − σ ( i ) · ( c ′ i ) m w π ( i ) · e c m − σ ( i ) ( c ′ i ) m − = c ′ i .Thus we output (cid:16) w π ( i ) , Σ π ( i ) (cid:17) si = . This completes the proof.Next, we instantiate the above lemma with distributional assumptions on the covariance matriceswhich will satisfy the non-degeneracy condition with high probability. Corollary 3.1.

Let s ≤ exp (cid:16) n (cid:17) . There is a randomized poly ( n , b , s ) time algorithm (b denotes thetotal bit complexity of the parameters) that given black-box access to the exact O ( log ( s ) / log ( n )) ordermoments of a random mixture of zero-mean Gaussians D , recovers its parameters ( w i , Σ i ) si = . Randomhere means that Σ i = A i A Ti the entries of A i ’s are chosen uniformly and indepdently at random from anarbitrary set S ⊂ Q of size | S | ≥ ( ns ) (and of course w i > for all i).Proof. The non-degeneracy condition is given by non-vanishing of a (non-zero) polynomial p ( M , . . . , M s ) in the entries of symmetric matrices M , . . . , M s of degree D at most ( ns ) (seeSection 2.3). First of all note that there exist PSD matrices P , . . . , P s s.t. p ( P , . . . , P s ) =

0. Thiscan be seen by choosing P , . . . , P s to be symmetric diagonally dominant (SDD) and then ap-plying the Schwarz-Zippel lemma. That is choose the diagonal entries of P i ’s from the interval { n D , . . . , n D + D } and the non-diagonal entries from the interval {

0, . . . , D } (uniformly andindependently). Then p ( P , . . . , P s ) = P i ’s are SDD and hence PSD.Now consider the polynomial q ( A , . . . , A s ) = p ( A A T , . . . , A s A Ts ) . As argued above, q is a non-zero polynomial and of degree at most 2 D . Hence if we choose the entries of A i ’s uniformly andindepdently at random from an arbitrary set S ⊂ Q of size | S | ≥ ( ns ) , then q ( A , . . . , A s ) = − o ( ) and hence Σ i ’s are non-degenerate w.p. 1 − o ( ) . Now the corollary follows fromLemma 3.1. ΣΠΣΠ [ t ] circuits using APP

We prove Theorem 2 in this section. The idea is to choose two parameters k and n appropriatelysuch that the measure APP k , n of a term of a homogeneous ΣΠΣΠ [ t ] ( s ) circuit is “small”. Wethen construct an explicit polynomial f n , d such that APP k , n ( f n , d ) is “high” which leads to a lowerbound on s . It is the choice of the measure APP that is novel in this lower bound proof. Themissing proofs of the technical statements can be found in Section E of the appendix. t case Let n , d , t ∈ N such that n ≥ d and ln nd ≤ t ≤ d · e · ln d . We set a few parameters as follows: • (Order of the derivatives) k = ⌊ δ · dt ⌋ , where δ = e , • (Number of variables after afﬁne projection) n = ⌊ c · k ⌋ , where c = · ln nk ln dk . Observation 4.1.

If the parameters k , c and n are chosen as above then k ≥ ⌊ ln d ⌋ , c ≥ and n ≤ d ln ln d . bservation 4.2. Let f ∈ F [ x ] be a homogeneous n-variate degree-d polynomial. Then, APP k , n ( f ) ≤ (cid:18) d − k + n − n − (cid:19) .In Section 4.3, we construct an explicit family of homogeneous, multilinear polynomials { f n , d , t } n , d in VNP such that

APP k , n ( f n , d , t ) equals the above upper bound (see Proposition 4.5). Upper bounding the measure for a homogeneous

ΣΠΣΠ [ t ] circuit. Let C be a polynomial com-puted by a homogeneous ΣΠΣΠ [ t ] ( s ) circuit, i.e., C = Q Q · · · Q m + . . . + Q s Q s · · · Q sm s , (17)where every Q ij is a homogeneous polynomial of degree at most t . By multiplying out factors ifnecessary, we can assume that all but one of the factors of T i = Q i Q i · · · Q im i have degree in [ t , 2 t ] .So, m i ≤ m : = ⌊ dt ⌋ + i ∈ [ s ] . By subadditivity of the measure, we infer the following: Proposition 4.1.

APP k , n ( C ) ≤ s · ( mk ) · ( n + ktn ) . Putting Propositions 4.1 and 4.5 together, we get the desired lower bound.

Proposition 4.2.

Any homogeneous

ΣΠΣΠ [ t ] ( s ) circuit computing f n , d , t must satisfys ≥ ( d − k + n − n − )( mk ) · ( n + ktn ) = (cid:16) nd (cid:17) Ω ( dt ln t ) . Remark.

Although, in our presentation, f n , d , t depends on t , it is easy to get rid of t from the deﬁ-nition of the hard polynomial by using a simple interpolation trick (as in Lemma 14 of [KSS14]). t case Let n , d , t ∈ N such that n ≥ d and 1 ≤ t ≤ min n ln n e · ln d , d o . Set the parameters k , n as follows: • (Order of the derivatives) k = ⌈ δ · dt ⌉ , where δ = e , • (Number of variables after afﬁne projection) n = ⌈ n kd ⌉ .The hard polynomial f n , d , t is deﬁned in Section 4.3. Propositions 4.1 and 4.5 imply the following: Proposition 4.3.

Any homogeneous

ΣΠΣΠ [ t ] ( s ) circuit computing f n , d , t must satisfys ≥ ( d − k + n − n − )( mk ) · ( n + ktn ) = n Ω ( dt ) .31 .3 The hard polynomial Let the parameters n , d , t , k , n be as in either Section 4.1 or Section 4.2. In this section, we describethe construction of the hard polynomial f n , d , t . Let n : = n ( d − k ) and n : = n − n . Polynomial f n , d , t is a homogeneous, multilinear polynomial in two sets of variables y and u such that | y | = n and | u | = n . Further, u = u ⊎ . . . ⊎ u d − k , where each set u i has n variables { u i ,1 , . . . , u i , n } .Consider all degree- ( d − k ) set-multilinear monomials in the u -variables with respect to the parti-tion u = u ⊎ . . . ⊎ u d − k . Such a set-mulilinear monomial β = u j u j · · · u d − k , j d − k can be naturallyidentiﬁed with a function φ β : [ d − k ] → [ n ] i j i .We say φ β is non-decreasing if φ β ( i ) ≤ φ β ( i + ) for all i ∈ [ d − k − ] . Let B : = { β : φ β is non-decreasing } and z = { z , . . . , z n } be a set of n variables. Observe that there is a one-to-one correspondencebetween monomials in B and z -monomials of degree d − k which is given by the projection map π : u → z u i , j z j . (18)Hence, π ( B ) = z d − k and | B | = ( d − k + n − n − ) . Order the monomials in B lexicographically and callthem (cid:18) β , . . . , β ( d − k + n − n − ) (cid:19) . There are ( n k ) multilinear monomials in y -variables of degree k . Proposition 4.4. ( n k ) ≥ ( d − k + n − n − ) . Order the multilinear degree- k y -monomials lexicographically and call the ﬁrst ( d − k + n − n − ) of them (cid:18) µ , . . . , µ ( d − k + n − n − ) (cid:19) . Deﬁne f n , d , t ( y , u ) : = ∑ i ∈ ( d − k + n − n − ) µ i · β i .It is an easy exercise to show that the family of polynomials deﬁned by f n , d , t is in VNP as thecoefﬁcient of any given monomial in f n , d , t can be computed efﬁciently. Proposition 4.5.

APP k , n ( f n , d , t ) = ( d − k + n − n − ) . We develop a meta framework for turning lower bounds for arithmetic circuit classes into learningalgorithms for the circuits classes in the non-degenerate case. A rudimentary form of this frameworkwas ﬁrst used in [KS19] to design learning algorithms for learning homogeneous depth threecircuits in the average case. We use the framework to design learning algorithms for sums ofpowers of low degree polynomials. The problem of learning sums of powers of linear polynomials(aka symmetric tensor decomposition) has been extensively studied in areas across science andmany algorithms have been developed for it (again in the non-degenerate case; in the worst case32t is NP-hard [Hås90, Shi16]). However, even for learning sums of quadratic polynomials, we arenot aware of any algorithm in the literature, except for an algorithm implicit in [GHK15] whichworks in a limited range of parameters. The problem of learning sums of powers of quadraticshas an intimate connection to the well known problem of mixtures of Gaussians (Sections 1.3 and3). We hope that our paper will lead to further algorithms for learning arithmetic circuits andalso new connections between learning arithmetic circuits and machine learning problems, whichis promising since tensor decomposition (aka learning depth three set-multilinear circuits) hasfound so many applications in ML. We list some of the interesting open problems below. • Smoothed analysis of mixtures of (general) Gaussians.

One immediate open problem is tomake our algorithm resilient to noise. This is relevant to mixtures of Gaussians since givensamples from the mixture, we can only estimate its moments (upto 1/poly ( n ) error usingpoly ( n ) samples). We are hopeful that an appropriate modiﬁcation of our algorithm willlead to polynomial time algorithm for mixtures of general Gaussians in the smoothed settingand when the number of components s ≤ poly ( n ) . • Learning other arithmetic circuit classes.

It is natural to implement our framework forother arithmetic circuit classes for which we have lower bounds e.g. set-multilinear circuits,multilinear formulas, regular formulas etc. [Raz09, KSS14]. • More connections between learning arithmetic circuits and ML.

As already mentioned,tensor decomposition ﬁnds multiple applications in ML (e.g. see [AGH + • Combining SoS and our techniques.

One of the algorithmic techniques which is very suc-cessfully used to design algorithms for tensor decomposition is the Sum of Squares (SoS)method [BKS15, GM15, HSSS16, MSS16, RSS18]. Can SoS be also used to design learningalgorithms for sums of powers of low degree polynomials (these algorithms might also bemore robust to noise)? Perhaps combining SoS with our techniques might help? • New lower bounds using

APP . Can the method of afﬁne projections of partials, perhaps alsocombining with shifts, be used to prove new lower bounds? May be for depth-5 circuits?

Acknowledgments

We would like to thank Youming Qiao for insightful discussions on simultaneous block-diagonalizationof rectangular matrices during the workshop on

Algebraic Methods held at the Simons Institute forthe Theory of Computing in December 2018. We thank Youming particularly for his suggestionto analyze the adjoint algebra and for referring us to the paper [CIK97]. We thank Navin Goyalfor multiple helpful discussions on learning mixtures of Gaussians and related problems and forreferring us to the paper [GHK15]. We would also like to thank Ravi Kannan for pointing a bugin the statement and proof of Corollary 3.1 in an earlier version.33 eferences [ABG +

14] Joseph Anderson, Mikhail Belkin, Navin Goyal, Luis Rademacher, and James R. Voss.The more, the merrier: the blessing of dimensionality for learning large gaussian mix-tures. In

Proceedings of The 27th Conference on Learning Theory, COLT 2014, Barcelona,Spain, June 13-15, 2014 , pages 1135–1164, 2014.[AGH +

14] Animashree Anandkumar, Rong Ge, Daniel Hsu, Sham M Kakade, and Matus Telgar-sky. Tensor decompositions for learning latent variable models.

The Journal of MachineLearning Research , 15(1):2773–2832, 2014.[Agr05] Manindra Agrawal. Proving lower bounds via pseudo-random generators. In

Interna-tional Conference on Foundations of Software Technology and Theoretical Computer Science ,pages 92–105. Springer, 2005.[AH17] Eric Allender and Shuichi Hirahara. New Insights on the (Non-)Hardness of CircuitMinimization and Related Problems. In ,pages 54:1–54:14, 2017.[AHK12] Animashree Anandkumar, Daniel J. Hsu, and Sham M. Kakade. A method of mo-ments for mixture models and hidden markov models. In

COLT 2012 - The 25th AnnualConference on Learning Theory, June 25-27, 2012, Edinburgh, Scotland , pages 33.1–33.34,2012.[AHM +

08] Eric Allender, Lisa Hellerstein, Paul McCabe, Toniann Pitassi, and Michael E. Saks.Minimizing Disjunctive Normal Form Formulas and AC0 Circuits Given a Truth Ta-ble.

SIAM J. Comput. , 38(1):63–84, 2008. Conference version appeared in the proceed-ings of CCC 2006.[AKRR03] Eric Allender, Michal Koucký, Detlef Ronneburger, and Sambuddha Roy. Derandom-ization and Distinguishing Complexity. In , pages 209–220,2003.[AM05] Dimitris Achlioptas and Frank McSherry. On spectral learning of mixtures of distri-butions. In

International Conference on Computational Learning Theory , pages 458–469.Springer, 2005.[Ang87] Dana Angluin. Queries and Concept Learning.

Machine Learning. , 2(4):319–342, 1987.[APVZ14] Alexandr Andoni, Rina Panigrahy, Gregory Valiant, and Li Zhang. Learning sparsepolynomial functions. In

Proceedings of the Twenty-Fifth Annual ACM-SIAM Symposiumon Discrete Algorithms, SODA 2014, Portland, Oregon, USA, January 5-7, 2014 , pages500–510, 2014.[AV08] Manindra Agrawal and V. Vinay. Arithmetic circuits: A chasm at depth four. In , pages 67–75, 2008.34BBB +

00] Amos Beimel, Francesco Bergadano, Nader H. Bshouty, Eyal Kushilevitz, and Ste-fano Varricchio. Learning functions represented as multiplicity automata.

J. ACM ,47(3):506–530, 2000. Conference version appeared in the proceedings of FOCS 1996.[BCMV14] Aditya Bhaskara, Moses Charikar, Ankur Moitra, and Aravindan Vijayaraghavan.Smoothed analysis of tensor decompositions. In

Symposium on Theory of Computing,STOC 2014, New York, NY, USA, May 31 - June 03, 2014 , pages 594–603, 2014.[Ber70] Elwyn R Berlekamp. Factoring polynomials over large ﬁnite ﬁelds.

Mathematics ofComputation , 24:713–735, 1970.[BIJL18] Markus Bläser, Christian Ikenmeyer, Gorav Jindal, and Vladimir Lysikov. Generalizedmatrix completion and algebraic natural proofs. In

Proceedings of the 50th Annual ACMSIGACT Symposium on Theory of Computing, STOC 2018, Los Angeles, CA, USA, June 25-29, 2018 , pages 1193–1206, 2018.[BKS15] Boaz Barak, Jonathan A Kelner, and David Steurer. Dictionary learning and tensor de-composition via the sum-of-squares method. In

Proceedings of the forty-seventh annualACM symposium on Theory of computing , pages 143–151, 2015.[BS10] Mikhail Belkin and Kaushik Sinha. Polynomial learning of distribution families. In , pages 103–112, 2010.[BSV19] Vishwas Bhargava, Shubhangi Saraf, and Ilya Volkovich. Reconstruction of depth-4multilinear circuits.

Electronic Colloquium on Computational Complexity (ECCC) , 26:104,2019.[BV08] S Charles Brubaker and Santosh S Vempala. Isotropic pca and afﬁne-invariant clus-tering. In

Building Bridges , pages 241–281. Springer, 2008.[Chv79] Vasek Chvátal. A greedy heuristic for the set-covering problem.

Math. Oper. Res. ,4(3):233–235, 1979.[CIK97] Alexander L. Chistov, Gábor Ivanyos, and Marek Karpinski. Polynomial time algo-rithms for modules over ﬁnite dimensional algebras. In

Proceedings of the 1997 Interna-tional Symposium on Symbolic and Algebraic Computation, ISSAC ’97, Maui, Hawaii, USA,July 21-23, 1997 , pages 68–74, 1997.[CIKK16] Marco L. Carmosino, Russell Impagliazzo, Valentine Kabanets, and AntoninaKolokolova. Learning Algorithms from Natural Proofs. In , pages 10:1–10:24,2016.[CJ10] Pierre Comon and Christian Jutten.

Handbook of Blind Source Separation: Independentcomponent analysis and applications . Academic press, 2010.[CKW11] Xi Chen, Neeraj Kayal, and Avi Wigderson. Partial derivatives in arithmetic complex-ity and beyond.

Foundations and Trends in Theoretical Computer Science , 6(1-2):1–138,2011. 35Czo99] Sabastian Czort. The complexity of minimizing disjunctive normal form formulas.Master’s thesis, University of Aarhus, 1999.[Das99] Sanjoy Dasgupta. Learning mixtures of gaussians. In , pages634–644, 1999.[DKK +

19] Ilias Diakonikolas, Gautam Kamath, Daniel Kane, Jerry Li, Ankur Moitra, and AlistairStewart. Robust estimators in high-dimensions without the computational intractabil-ity.

SIAM Journal on Computing , 48(2):742–864, 2019.[DLCC07] Lieven De Lathauwer, Josphine Castaing, and Jean-Franois Cardoso. Fourth-ordercumulant-based blind identiﬁcation of underdetermined mixtures.

IEEE Transactionson Signal Processing , 55(6):2965–2973, 2007.[DS07] Sanjoy Dasgupta and Leonard Schulman. A probabilistic analysis of em for mixturesof separated, spherical gaussians.

Journal of Machine Learning Research , 8(Feb):203–226,2007.[DSY10] Zeev Dvir, Amir Shpilka, and Amir Yehudayoff. Hardness-randomness tradeoffs forbounded depth arithmetic circuits.

SIAM Journal on Computing , 39(4):1279–1293, 2010.[Ebe91] Wayne Eberly. Decompositions of algebras over R and C.

Computational Complexity ,1:211–234, 1991.[Fel09] Vitaly Feldman. Hardness of approximate two-level logic minimization and PAClearning with membership queries.

J. Comput. Syst. Sci. , 75(1):13–26, 2009. Confer-ence version appeared in the proceedings of STOC 2006.[Fis94] Ismor Fischer. Sums of like powers of multivariate linear forms.

Mathematics Magazine ,67(1):59–61, 1994.[FK09] Lance Fortnow and Adam R. Klivans. Efﬁcient learning algorithms yield circuit lowerbounds.

J. Comput. Syst. Sci. , 75(1):27–36, 2009. Conference version appeared in theproceedings of COLT 2006.[FLMS15] Hervé Fournier, Nutan Limaye, Guillaume Malod, and Srikanth Srinivasan. Lowerbounds for depth-4 formulas computing iterated matrix multiplication.

SIAM J. Com-put. , 44(5):1173–1201, 2015. Conference version appeared in the proceedings of STOC2014.[For15a] Michael A Forbes. Deterministic divisibility testing via shifted partial derivatives. In , pages 451–465.IEEE, 2015.[For15b] Michael A. Forbes. Deterministic divisibility testing via shifted partial derivatives. In

IEEE 56th Annual Symposium on Foundations of Computer Science, FOCS 2015, Berkeley,CA, USA, 17-20 October, 2015 , pages 451–465, 2015.36FR85] Katalin Friedl and Lajos Rónyai. Polynomial time solutions of some problems in com-putational algebra. In

Proceedings of the 17th Annual ACM Symposium on Theory ofComputing, May 6-8, 1985, Providence, Rhode Island, USA , pages 153–162, 1985.[FS13] Michael A. Forbes and Amir Shpilka. Quasipolynomial-Time Identity Testing of Non-commutative and Read-Once Oblivious Algebraic Branching Programs. In , pages 243–252, 2013.[GHK15] Rong Ge, Qingqing Huang, and Sham M. Kakade. Learning mixtures of gaussiansin high dimensions. In

Proceedings of the Forty-Seventh Annual ACM on Symposium onTheory of Computing, STOC 2015, Portland, OR, USA, June 14-17, 2015 , pages 761–770,2015.[GKKS14] Ankit Gupta, Pritish Kamath, Neeraj Kayal, and Ramprasad Saptharishi. Approach-ing the Chasm at Depth Four.

J. ACM , 61(6):33:1–33:16, 2014. Conference versionappeared in the proceedings of CCC 2013.[GKKS16] Ankit Gupta, Pritish Kamath, Neeraj Kayal, and Ramprasad Saptharishi. Arithmeticcircuits: A chasm at depth 3.

SIAM J. Comput. , 45(3):1064–1079, 2016. Conferenceversion appeared in the proceedings of FOCS 2013.[GKL11] Ankit Gupta, Neeraj Kayal, and Satyanarayana V. Lokam. Efﬁcient Reconstructionof Random Multilinear Formulas. In

IEEE 52nd Annual Symposium on Foundations ofComputer Science, FOCS 2011, Palm Springs, CA, USA, October 22-25, 2011 , pages 778–787, 2011.[GKL12] Ankit Gupta, Neeraj Kayal, and Satyanarayana V. Lokam. Reconstruction of depth-4multilinear circuits with top fan-in 2. In

Proceedings of the 44th Symposium on Theoryof Computing Conference, STOC 2012, New York, NY, USA, May 19 - 22, 2012 , pages625–642, 2012.[GKP18] Ignacio García-Marco, Pascal Koiran, and Timothée Pecatte. Polynomial equivalenceproblems for sum of afﬁne powers. In

Proceedings of the 2018 ACM on InternationalSymposium on Symbolic and Algebraic Computation, ISSAC 2018, New York, NY, USA,July 16-19, 2018 , pages 303–310, 2018.[GKQ14] Ankit Gupta, Neeraj Kayal, and Youming Qiao. Random arithmetic formulas can bereconstructed efﬁciently.

Computational Complexity , 23(2):207–303, 2014. Conferenceversion appeared in the proceedings of CCC 2013.[GM15] Rong Ge and Tengyu Ma. Decomposing overcomplete 3rd order tensors using sum-of-squares algorithms. arXiv preprint arXiv:1504.05287 , 2015.[Har70] R Harshman. Foundations of the parafac procedure: Model and conditions for anexplanatory factor analysis.

Technical Report UCLA Working Papers in Phonetics 16,University of California, Los Angeles, Los Angeles, CA , 1970.[Hås90] Johan Håstad. Tensor Rank is NP-Complete.

J. Algorithms , 11(4):644–654, 1990. Con-ference version appeared in the proceedings of ICALP 1989.37HK13] Daniel J. Hsu and Sham M. Kakade. Learning mixtures of spherical gaussians: mo-ment methods and spectral decompositions. In

Innovations in Theoretical ComputerScience, ITCS ’13, Berkeley, CA, USA, January 9-12, 2013 , pages 11–20, 2013.[HKZ12] Daniel Hsu, Sham M Kakade, and Tong Zhang. A spectral algorithm for learninghidden markov models.

Journal of Computer and System Sciences , 78(5):1460–1480, 2012.[HL18] Samuel B Hopkins and Jerry Li. Mixture models, robustness, and sum of squaresproofs. In

Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Com-puting , pages 1021–1034, 2018.[HS80] Joos Heintz and Claus-Peter Schnorr. Testing polynomials which are easy to compute.In

Proceedings of the twelfth annual ACM symposium on Theory of computing , pages 262–272, 1980.[HSSS16] Samuel B Hopkins, Tselil Schramm, Jonathan Shi, and David Steurer. Fast spectral al-gorithms from sum-of-squares proofs: tensor decomposition and planted sparse vec-tors. In

Proceedings of the forty-eighth annual ACM symposium on Theory of Computing ,pages 178–191, 2016.[Jac89] Nathan Jacobson.

Basic Algebra 2 (Second Edition) . Dover Books on Mathematics, 1989.[Joh74] David S. Johnson. Approximation algorithms for combinatorial problems.

J. Comput.Syst. Sci. , 9(3):256–278, 1974.[Kay11] Neeraj Kayal. Efﬁcient algorithms for some special cases of the polynomial equiva-lence problem. In

Proceedings of the Twenty-Second Annual ACM-SIAM Symposium onDiscrete Algorithms, SODA 2011, San Francisco, California, USA, January 23-25, 2011 ,pages 1409–1421, 2011.[Kay12a] Neeraj Kayal. Afﬁne projections of polynomials: extended abstract. In

Proceedings ofthe 44th Symposium on Theory of Computing Conference, STOC 2012, New York, NY, USA,May 19 - 22, 2012 , pages 643–662, 2012.[Kay12b] Neeraj Kayal. An exponential lower bound for the sum of powers of bounded degreepolynomials.

Electronic Colloquium on Computational Complexity (ECCC) , 19:81, 2012.[KC00] Valentine Kabanets and Jin-yi Cai. Circuit minimization problem. In

Proceedings ofthe Thirty-Second Annual ACM Symposium on Theory of Computing, May 21-23, 2000,Portland, OR, USA , pages 73–79, 2000.[KI04] Valentine Kabanets and Russell Impagliazzo. Derandomizing polynomial identitytests means proving circuit lower bounds. computational complexity , 13(1-2):1–46, 2004.[KK10] Amit Kumar and Ravindran Kannan. Clustering with spectral norm and the k-meansalgorithm. In ,pages 299–308. IEEE, 2010. 38KLSS17] Neeraj Kayal, Nutan Limaye, Chandan Saha, and Srikanth Srinivasan. An Exponen-tial Lower Bound for Homogeneous Depth Four Arithmetic Formulas.

SIAM J. Com-put. , 46(1):307–335, 2017. Conference version appeared in the proceedings of FOCS2014.[KNS16] Neeraj Kayal, Vineet Nair, and Chandan Saha. Separation between read-once obliv-ious algebraic branching programs (roabps) and multilinear depth three circuits. In , pages 46:1–46:15, 2016.[KNS19] Neeraj Kayal, Vineet Nair, and Chandan Saha. Average-case linear matrix factoriza-tion and reconstruction of low width algebraic branching programs.

ComputationalComplexity , 28(4):749–828, 2019.[KNST17] Neeraj Kayal, Vineet Nair, Chandan Saha, and Sébastien Tavenas. Reconstruction ofFull Rank Algebraic Branching Programs. In , pages 21:1–21:61, 2017.[Koi12] Pascal Koiran. Arithmetic circuits: The chasm at depth four gets wider.

Theor. Comput.Sci. , 448:56–65, 2012.[KS01] Adam R. Klivans and Daniel A. Spielman. Randomness efﬁcient identity testing ofmultivariate polynomials. In

Proceedings on 33rd Annual ACM Symposium on Theory ofComputing, July 6-8, 2001, Heraklion, Crete, Greece , pages 216–223, 2001.[KS06] Adam R. Klivans and Amir Shpilka. Learning restricted models of arithmetic cir-cuits.

Theory of Computing , 2(10):185–206, 2006. Conference version appeared in theproceedings of COLT 2003.[KS09a] Zohar Shay Karnin and Amir Shpilka. Reconstruction of generalized depth-3 arith-metic circuits with bounded top fan-in. In

Proceedings of the 24th Annual IEEE Con-ference on Computational Complexity, CCC 2009, Paris, France, 15-18 July 2009 , pages274–285, 2009.[KS09b] Adam R. Klivans and Alexander A. Sherstov. Cryptographic hardness for learningintersections of halfspaces.

J. Comput. Syst. Sci. , 75(1):2–12, 2009. Conference versionappeared in the proceedings of FOCS 2006.[KS14] Mrinal Kumar and Shubhangi Saraf. The limits of depth reduction for arithmeticformulas: it’s all about the top fan-in. In

Symposium on Theory of Computing, STOC2014, New York, NY, USA, May 31 - June 03, 2014 , pages 136–145, 2014.[KS17a] Pravesh K Kothari and David Steurer. Outlier-robust moment-estimation via sum-of-squares. arXiv preprint arXiv:1711.11581 , 2017.[KS17b] Mrinal Kumar and Shubhangi Saraf. On the Power of Homogeneous Depth 4 Arith-metic Circuits.

SIAM J. Comput. , 46(1):336–387, 2017. Conference version appeared inthe proceedings of FOCS 2014. 39KS19] Neeraj Kayal and Chandan Saha. Reconstruction of non-degenerate homogeneousdepth three circuits. In

Proceedings of the 51st Annual ACM SIGACT Symposium onTheory of Computing, STOC 2019, Phoenix, AZ, USA, June 23-26, 2019. , pages 413–424,2019.[KSB] Krull-Schmidt Theorem. https://mathstrek.blog/2015/01/17/krull-schmidt-theorem/ .[KSDK14] Murat Kocaoglu, Karthikeyan Shanmugam, Alexandros G. Dimakis, and Adam R.Klivans. Sparse polynomial learning and graph sketching. In

Advances in NeuralInformation Processing Systems 27: Annual Conference on Neural Information ProcessingSystems 2014, December 8-13 2014, Montreal, Quebec, Canada , pages 3122–3130, 2014.[KSS14] Neeraj Kayal, Chandan Saha, and Ramprasad Saptharishi. A super-polynomial lowerbound for regular arithmetic formulas. In

Symposium on Theory of Computing, STOC2014, New York, NY, USA, May 31 - June 03, 2014 , pages 146–153, 2014.[KSS18] Pravesh K Kothari, Jacob Steinhardt, and David Steurer. Robust moment estimationand improved clustering via sum of squares. In

Proceedings of the 50th Annual ACMSIGACT Symposium on Theory of Computing , pages 1035–1046, 2018.[KSV05] Ravindran Kannan, Hadi Salmasian, and Santosh Vempala. The spectral method forgeneral mixture models. In

International Conference on Computational Learning Theory ,pages 444–457. Springer, 2005.[KT90] Erich Kaltofen and Barry M. Trager. Computing with Polynomials Given By BlackBoxes for Their Evaluations: Greatest Common Divisors, Factorization, Separation ofNumerators and Denominators.

J. Symb. Comput. , 9(3):301–320, 1990.[Kum19] Mrinal Kumar. A quadratic lower bound for homogeneous algebraic branching pro-grams.

Computational Complexity , 28(3):409–435, 2019. Conference version appearedin the proceedings of CCC 2017.[LLL82] Arjen K Lenstra, Hendrik W Lenstra, and László Lovász. Factoring polynomials withrational coefﬁcients.

Mathematische Annalen , 261(4):515–534, 1982.[LMN93] Nathan Linial, Yishay Mansour, and Noam Nisan. Constant Depth Circuits, FourierTransform, and Learnability.

J. ACM , 40(3):607–620, 1993. Conference version ap-peared in the proceedings of FOCS 1989.[Lov75] László Lovász. On the ratio of optimal integral and fractional covers.

Discrete Mathe-matics , 13(4):383–390, 1975.[LRA93] Sue E Leurgans, Robert T Ross, and Rebecca B Abel. A decomposition for three-wayarrays.

SIAM Journal on Matrix Analysis and Applications , 14(4):1064–1083, 1993.[Mas79] W. J. Masek. Some NP-complete set covering problems. Unpublished Manuscript,1979.[MR05] Elchanan Mossel and Sébastien Roch. Learning nonsingular phylogenies and hid-den markov models. In

Proceedings of the 37th Annual ACM Symposium on Theory ofComputing, Baltimore, MD, USA, May 22-24, 2005 , pages 366–375, 2005.40MSS16] Tengyu Ma, Jonathan Shi, and David Steurer. Polynomial-time tensor decompositionswith sum-of-squares. In , pages 438–446. IEEE, 2016.[MV10] Ankur Moitra and Gregory Valiant. Settling the polynomial learnability of mixturesof gaussians. In , pages 93–102, 2010.[MV18] Daniel Minahan and Ilya Volkovich. Complete derandomization of identity testingand reconstruction of read-once formulas.

TOCT , 10(3):10:1–10:11, 2018. Conferenceversion appeared in the proceedings of CCC 2017.[MW17] Cody D. Murray and R. Ryan Williams. On the (Non) NP-Hardness of ComputingCircuit Complexity.

Theory of Computing , 13(4):1–22, 2017.[Nis91] Noam Nisan. Lower Bounds for Non-Commutative Computation (Extended Ab-stract). In

Proceedings of the 23rd Annual ACM Symposium on Theory of Computing, May5-8, 1991, New Orleans, Louisiana, USA , pages 410–418, 1991.[NW97] Noam Nisan and Avi Wigderson. Lower Bounds on Arithmetic Circuits Via PartialDerivatives.

Computational Complexity , 6(3):217–234, 1997. Conference version ap-peared in the proceedings of FOCS 1995.[OSlV16] Rafael Oliveira, Amir Shpilka, and Ben lee Volk. Subexponential size hitting sets forbounded depth multilinear formulas. computational complexity , 25(2):455–505, 2016.[Pea94] Karl Pearson. Contributions to the mathematical theory of evolution.

PhilosophicalTransactions of the Royal Society of London. A , 185:71–110, 1894.[PFJ03] Haim H. Permuter, Joseph M. Francos, and Ian H. Jermyn. Gaussian mixture modelsof texture and colour for image database retrieval. In ,pages 569–572, 2003.[Raz09] Ran Raz. Multi-linear formulas for permanent and determinant are of super-polynomial size.

J. ACM , 56(2):8:1–8:17, 2009. Conference version appeared in theproceedings of STOC 2004.[Rón90] Lajos Rónyai. Computing the structure of ﬁnite algebras.

J. Symb. Comput. , 9(3):355–373, 1990.[RR95] Douglas A Reynolds and Richard C Rose. Robust text-independent speaker identiﬁ-cation using gaussian mixture speaker models.

IEEE transactions on speech and audioprocessing , 3(1):72–83, 1995.[RS05] Ran Raz and Amir Shpilka. Deterministic polynomial identity testing in non-commutative models.

Computational Complexity , 14(1):1–19, 2005.[RSS18] Prasad Raghavendra, Tselil Schramm, and David Steurer. High-dimensional estima-tion via sum-of-squares proofs. arXiv preprint arXiv:1807.11419 , 6, 2018.41RV17] Oded Regev and Aravindan Vijayaraghavan. On learning mixtures of well-separatedgaussians. In , pages 85–96. IEEE, 2017.[Sap15] Ramprasad Saptharishi. A survey of lower bounds in arithmetic circuit complexity.

Github survey , 2015.[Sch80] Jacob T. Schwartz. Fast Probabilistic Algorithms for Veriﬁcation of Polynomial Iden-tities.

J. ACM , 27(4):701–717, 1980.[Shi16] Yaroslav Shitov. How hard is the tensor rank? arXiv , abs/1611.01559, 2016.[Shp09] Amir Shpilka. Interpolation of depth-3 arithmetic circuits with two multiplicationgates.

SIAM J. Comput. , 38(6):2130–2161, 2009. Conference version appeared in theproceedings of STOC 2007.[Sin16] Gaurav Sinha. Reconstruction of real depth-3 circuits with top fan-in 2. In , pages31:1–31:53, 2016.[SK01] Arora Sanjeev and Ravi Kannan. Learning mixtures of arbitrary gaussians. In

Proceed-ings of the thirty-third annual ACM symposium on Theory of computing , pages 247–257,2001.[Str73] Volker Strassen. Vermeidung von divisionen.

Journal für die reine und angewandteMathematik , 264:184–202, 1973.[SV14] Amir Shpilka and Ilya Volkovich. On reconstruction and testing of read-once for-mulas.

Theory of Computing , 10:465–514, 2014. Conference version appeared in theproceedings of STOC 2008 and APPROX-RANDOM 2009.[Swe18] Joseph Swernofsky. Tensor Rank is Hard to Approximate. In

Approximation, Random-ization, and Combinatorial Optimization. Algorithms and Techniques, APPROX/RANDOM2018, August 20-22, 2018 - Princeton, NJ, USA , pages 26:1–26:9, 2018.[SWZ19] Zhao Song, David P. Woodruff, and Peilin Zhong. Relative Error Tensor Low RankApproximation. In

Proceedings of the Thirtieth Annual ACM-SIAM Symposium on Dis-crete Algorithms, SODA 2019, San Diego, California, USA, January 6-9, 2019 , pages 2772–2789, 2019.[SY10] Amir Shpilka and Amir Yehudayoff. Arithmetic circuits: A survey of recent resultsand open questions.

Foundations and Trends in Theoretical Computer Science , 5(3-4):207–388, 2010.[Tav13] Sébastien Tavenas. Improved bounds for reduction to depth 4 and depth 3. In

Mathe-matical Foundations of Computer Science 2013 - 38th International Symposium, MFCS 2013,Klosterneuburg, Austria, August 26-30, 2013. Proceedings , pages 813–824, 2013.[Uma99] Christopher Umans. Hardness of Approximating Sigma2p Minimization Problems.In , pages 465–474, 1999.42Vol16] Ilya Volkovich. A Guide to Learning Arithmetic Circuits. In

Proceedings of the 29thConference on Learning Theory, COLT 2016, New York, USA, June 23-26, 2016 , pages 1540–1561, 2016.[VW04] Santosh Vempala and Grant Wang. A spectral algorithm for learning mixture models.

Journal of Computer and System Sciences , 68(4):841–860, 2004.[Zip79] Richard Zippel. Probabilistic algorithms for sparse polynomials. In

Symbolic and Alge-braic Computation, EUROSAM ’79, An International Symposiumon Symbolic and AlgebraicComputation, Marseille, France, June 1979, Proceedings , pages 216–226, 1979.

A The adjoint algebra

Let U and W be vector spaces and L a set of linear operators from U to W such that W = hL ◦ U i .Suppose U and W decompose into indecomposable subspaces as: U = U ⊕ . . . ⊕ U s and W = W ⊕ . . . ⊕ W s such that W i = hL ◦ U i i for all i ∈ [ s ] . In this section, we give a brief overview of the adjointalgebra associated with L and show how analyzing the adjoint provides an avenue to showinguniqueness of decomposition of the above spaces. We will explain this by assuming U = W and U i = W i for all i ∈ [ s ] . Let m = dim U . Once a basis of U is ﬁxed, U can be identiﬁed with F m and elements of L are m × m matrices in M m ( F ) . Let R ⊆ M m ( F ) be the F -algebra generated by L ∪ { I m } , where I m is the m × m identity matrix. As U = hL ◦ U i and U i = hL ◦ U i i , we have L u ∈ U and L u i ∈ U i for all L ∈ L , u ∈ U and u i ∈ U i . This gives U an R -module structure and U , . . . , U s are R -submodules of U . We say U i is an indecomposable R -module if there are no proper R -submodules U i and U i of U i such that U i = U i ⊕ U i . A decomposition of an R -module U as U = U ⊕ . . . ⊕ U s ,where U , . . . , U s are indecomposable R -submodules of U , is unique if it is the only possible de-composition of U into indecomposable R -submodules (up to reordering of the U i ’s). A.1 Module homomorphisms

A map φ from an R -module U to another R -module V is an R -module homomorphism from U to V if φ ( R u + S v ) = R φ ( u ) + S φ ( v ) for all R , S ∈ R and u , v ∈ U . Such a φ is an R -module isomorphism from U to V if it is a bijection. An R -module homomorphism from U to U is called an R -module endomorphism of U , and an R -module isomorphism from U to U is called an R -module This assumption is without any loss of generality (see Section B). An F -algebra R has two binary operations + and · deﬁned on its elements such that ( R , +) is a F -vector space, ( R , + , · ) is an associative ring, and for every a , b ∈ F and B , C ∈ R it holds that ( aB ) C = B ( aC ) = a ( BC ) . The F -algebra R ⊆ M m ( F ) generated by L ⊆ M m ( F ) is the set of all ﬁnite F -linear sums of ﬁnite products of elements of L . Let R be an F -algebra with a multiplicative identity I . A vector space U is an R -module if there is a bilinear map ◦ from R × U to U such that I ◦ u = u and ( RS ) ◦ u = R ◦ ( S ◦ u ) for all u ∈ U and R , S ∈ R . In our case, ◦ is simplythe matrix-vector multiplication operation. utomorphism of U . It turns out that the set of R -module endomorphisms of U can be computedefﬁciently as follows: Recall that in our case, U = F m and R ⊆ M m ( F ) . Deﬁne the adjoint of R asadj ( R ) : = { D ∈ M m ( F ) : LD = DL for all L ∈ L} . (19)Observe that adj ( R ) is an F -subalgebra of M m ( F ) . Proposition A.1.

The adjoint adj ( R ) is precisely the set of all R -module endomorphism of U.Proof. Let φ be an R -module endomorphism of U . As R contains the identity matrix I m , F ⊆ R and so φ is a linear transformation from U to U . Let D φ ∈ M m ( F ) be the matrix corresponding to φ . Since φ ( R u ) = R φ ( u ) , we have D φ R u = RD φ u for all R ∈ R and u ∈ U . Hence, D φ R = RD φ for all R ∈ R implying D φ ∈ adj ( R ) . On the other hand, if D ∈ adj ( R ) then the map φ D : U → U deﬁned as φ D ( u ) : = D u satisﬁes φ D ( R u + S v ) = R φ D ( u ) + S φ D ( v ) for all R , S ∈ R and u , v ∈ U .So, φ D is an R -module endomorphism of U .A basis of the adjoint can be computed efﬁciently by solving a system of linear equations arisingfrom the equation LD = DL for all L ∈ L . A.2 Module decomposition

Let U = F m be an R -module, where R ⊆ M m ( F ) . By Proposition A.1, the invertible elementsof adj ( R ) are the R -module automorphisms of U and these can be used to describe all possibledecomposition of U into indecomposable R -modules. Proposition A.2. (a) If U = U ⊕ . . . ⊕ U s is a decomposition of U into indecomposable R -submodulesand D ∈ adj ( R ) is invertible then U = DU ⊕ . . . ⊕ DU s is another decomposition of U into indecomposable R -submodules.(b) If U = U ′ ⊕ . . . ⊕ U ′ l is any other decomposition of U into indecomposable R -submodules then l = sand there is an invertible D ∈ adj ( R ) and a permutation σ of [ s ] such thatU ′ i = DU σ ( i ) for all i ∈ [ s ] . Proof.

The proof of (a) follows from the easy observation that DU i is an R -submodule of U .To prove (b) we will make use of the Krull-Schmidt theorem for modules ( [Jac89] p. 110, [KSB]). Theorem 3 (Krull-Schmidt) . Let R be an F -algebra and U a ﬁnite dimensional vector space that is alsoan R -module. If U = U ⊕ . . . ⊕ U s and U = U ′ ⊕ . . . ⊕ U ′ l are two decomposition of U into indecomposable R -submodules then l = s and there is a permutation σ ofs such that U ′ i and U σ ( i ) are isomorphic as R -modules for all i ∈ [ s ] . l = s and that there is a permutation σ of [ s ] such that U ′ i ∼ = U σ ( i ) as R -modules for all i ∈ [ s ] . Let these ismorphisms be φ , . . . , φ s , i.e., U ′ i = φ i ( U σ ( i ) ) for all i ∈ [ s ] .Deﬁne a map φ from U to U as follows: Let u ∈ U . If u = u + . . . + u s , where u i ∈ U i , then φ ( u ) : = φ ( u σ ( ) ) + . . . + φ s ( u σ ( s ) ) . Observe that φ restricted to U σ ( i ) is just φ i . It is easy to verifythat φ is an R -module automorphism of U . Hence, by Proposition A.1, there is an invertible D ∈ adj ( R ) such that φ ( u ) = D u and so U ′ i = DU σ ( i ) for all i ∈ [ s ] .Proposition A.2 implies that the invertible elements of adj ( R ) exactly capture the various possibledecompositions of U into indecomposable R -modules. So, analyzing the adjoint becomes vital inshowing uniqueness of a module decomposition. A.3 Uniqueness of decomposition

It turns out that showing uniqueness of module decomposition is essentially equivalent to show-ing that the elements of the adjoint are simultaneously block-diagonalizable. As before, let U ⊕ . . . ⊕ U s be a decomposition of the R -module U = F m into indecomposable R -submodules. Forsimplicity, assume dim U i = r for all i ∈ [ s ] . Let ( u i , . . . , u ir ) be a basis of U i , and A ∈ GL m ( F ) bethe basis change matrix from the standard basis of F m to ( u , u , . . . , u r , . . . , u s , u s , . . . , u sr ) . Proposition A.3. (a) If U = U ⊕ . . . ⊕ U s is the unique decomposition of U into indecomposable R -submodules, and U i ≇ U j as R -modules for i = j, then A · adj ( R ) · A − consists of block-diagonalmatrices (with block size r).(b) If A · adj ( R ) · A − consists of block-diagonal matrices (with block size r) then U = U ⊕ . . . ⊕ U s isthe unique decomposition of U into indecomposable R -submodules.Proof. Let D ∈ adj ( R ) be invertible. By Proposition A.2 (a), U = DU ⊕ . . . ⊕ DU s is another decomposition of U into indecomposable R -submodules. If U = U ⊕ . . . ⊕ U s is theunique decomposition and U i ≇ U j as R -modules for i = j , then DU i = U i for all i ∈ [ s ] . Inother words, A · D · A − is block-diagonal for every invertible D ∈ adj ( R ) . Now, a simple appli-cation of the Schwartz-Zippel lemma implies A · adj ( R ) · A − consists of block-diagonal matricesif | F | > sr . This completes the proof of part (a).Suppose U = U ′ ⊕ . . . ⊕ U ′ s be another decomposition of U into indecomposable R -submodules.By Proposition A.2 (b), there is an invertible D ∈ adj ( R ) and a permutation σ of [ s ] such that U ′ i = DU σ ( i ) for all i ∈ [ s ] .If A · adj ( R ) · A − consists of only block-diagonal matrices then DU σ ( i ) ⊆ U σ ( i ) , implying U ′ i ⊆ U σ ( i ) for all i ∈ [ s ] .This further implies U ′ i = U σ ( i ) for all i ∈ [ s ] as ∑ i ∈ [ s ] dim U ′ i = ∑ i ∈ [ s ] dim U σ ( i ) . Thus, the decom-position U = U ⊕ . . . ⊕ U s is unique. 45 he adjoint algebra arising in our case As briefed in Section 1.2, our learning problem is essentially reduced to the following moduledecomposition problem: We are given a basis of an appropriate R -module U that decomposes as U = U ⊕ . . . ⊕ U s , (20)where U i is an R -submodule of U that is not guaranteed to be indecomposable and dim U i = r for all i ∈ [ s ] . We are required to • show that each U i is an indecomposable R -module, • show that the above decompostion is unique, • ﬁnd the decomposition, i.e., compute bases of U , . . . U s .Here, R is the F -algebra generated by a set of linear operators L on U . Guided by PropositionA.3, we analyze the adjoint adj ( R ) . It turns out that the “richness” of the carefully chosen set oflinear operators L implies that A · adj ( R ) · A − = D : = { diag ( a , . . . , a s ) ⊗ I r : a i ∈ F for all i ∈ [ s ] } ,where A is as deﬁned at the beginning of this section. In other words, elements of the adjoint aresimultaneously diagonalizable. The following is an easy corollary of Proposition A.3. Corollary A.1.

If A · adj ( R ) · A − = D then the R -modules U , . . . , U s (in Equation (20) ) are indecom-posable and U = U ⊕ . . . ⊕ U s is the unique decomposition of U into indecomposable R -submodules. Finally, we ﬁnd the decomposition by simultaneously diagonalizing the basis elements of adj ( R ) . B Reducing vector space decomposition to module decomposition

In this section, we reduce the vector space decomposition problem to the module decompositionproblem. In fact, our reduction works for a more general problem, which we call generalized vec-tor space decomposition. We describe this setting below.Suppose we have a directed graph G = ( V , E ) . At each vertex v ∈ V , we have a vector space U v and each edge ( v , w ) ∈ E carriers a set of linear maps L v , w from U v to U w . A vector spacedecomposition of the collection of vector spaces ( U v ) v ∈ V is a collection of decompositions U v = U v ,1 ⊕ · · · ⊕ U v , s such that hL v , w ◦ U v , i i ⊆ U w , i for all i ∈ [ s ] and ( v , w ) ∈ E . The collection of decompositions is inde-composable if there are no ﬁner decompositions, i.e., there are no proper subspaces U ′ v , i , U ′′ v , i of U v , i (and U ′ w , i , U ′′ w , i of U w , i ) such that U v , i = U ′ v , i ⊕ U ′′ v , i (and U w , i = U ′ w , i ⊕ U ′′ w , i ), and hL v , w ◦ U ′ v , i i ⊆ U ′ w , i , hL v , w ◦ U ′′ v , i i ⊆ U ′′ w , i for all i ∈ [ s ] and ( v , w ) ∈ E . The generalized vector space decompositionproblem is the task of computing a collection of indecomposable decompositions of the spaces ( U v ) v ∈ V from the graph G . Note that the module isomorphism problem corresponds to a singleloop on one vertex and the vector space decomposition problem corresponds to two vertices and46 single edge between them.There is a simple reduction from the generalized vector space decomposition problem to the mod-ule decomposition problem. Given the above instance, we consider the vector space U = ⊕ v ∈ V U v .We deﬁne some special linear maps from U to U which will be central to the reduction. We notethat we just need to describe the behaviour of the linear maps on each of the U v ’s, as we can extendthe maps linearly to the whole space U . The ﬁrst set of linear maps are projections onto U v ’s. Π v ( u ) = ( u if u ∈ U v u ∈ U v ′ for v ′ = v .That is, Π v is the projector onto U v . The second set of linear maps are the natural extensions of L v , w ’s to the whole space. Given L ∈ L v , w , we deﬁne the extension of L asext ( L )( u ) = ( L ( u ) if u ∈ U v u ∈ U v ′ for v ′ = v .Then, we can deﬁne e L v , w = { ext ( L ) : L ∈ L v , w } . Let R be the algebra generated by { I m } ∪{ Π v } v ∈ V ∪ { e L v , w } ( v , w ) ∈ E , where m : = dim ( U ) . Observe that U can be naturally treated as an R -module. Now, we have the following elementary proposition which characterizes R -submodulesof U (i.e., subspaces of U that are invariant with respect to R ). Proposition B.1.

A subspace U ′ ⊆ U is an R -submodule of U (i.e., hR ◦ U ′ i ⊆ U ′ ) if and only if it is ofthe form ⊕ v ∈ V U ′ v such that U ′ v ⊆ U v for all v ∈ V and hL v , w ◦ U ′ v i ⊆ U ′ w for all ( v , w ) ∈ E.Proof.

One direction is clear. If U ′ is of the form ⊕ v ∈ V U ′ v such that U ′ v ⊆ U v for all v ∈ V and hL v , w ◦ U ′ v i ⊆ U ′ w for all ( v , w ) ∈ E , then U ′ is an R -submodule of U . In the other direction,suppose U ′ is an R -submodule of U . Let U ′ v = Π v ◦ U ′ . Then, U ′ v ⊆ U ′ for all v ∈ V since U ′ is an R -module. On the other hand, U ′ ⊆ ⊕ v ∈ V U ′ v , hence U ′ = ⊕ v ∈ V U ′ v . Another consequence of U ′ being an R -module is that h e L v , w ◦ U ′ v i = h e L v , w ◦ Π v ◦ U ′ i ⊆ U ′ Since every map in e L v , w maps U to U w , we have that, h e L v , w ◦ U ′ v i ⊆ U ′ ∩ U w = U ′ w which is the same as hL v , w ◦ U ′ v i ⊆ U ′ w . This completes the proof.This yields the following corollary which characterizes decomposition of U into R -submodules. Corollary B.1 (Reduction to module decomposition) . U = U ⊕ · · · ⊕ U s is a decomposition of Uinto R -submodules if and only if each U i is of the form ⊕ v ∈ V U v , i such that U v = U v ,1 ⊕ · · · ⊕ U v , s for allv ∈ V and hL v , w ◦ U v , i i ⊆ U w , i for all i ∈ [ s ] , ( v , w ) ∈ E. roof. Again one direction is clear. In the other direction, suppose U = U ⊕ · · · ⊕ U s is a decom-position of U into R -submodules. This means that each of the U i ’s is an R -submodule of U . ByProposition B.1, each U i is of the form ⊕ v ∈ V U v , i such that U v , i ⊆ U v for all i ∈ [ s ] , v ∈ V and hL v , w ◦ U v , i i ⊆ U w , i for all i ∈ [ s ] , ( v , w ) ∈ E . Now U = U ⊕ · · · ⊕ U s and U v , i ⊆ U v , U i , hence U v ,1 , . . . , U v , s form a direct sum and U v ,1 ⊕ · · · ⊕ U v , s ⊆ U v for all v ∈ V . What remains to prove is that U v = U v ,1 ⊕ · · · ⊕ U v , s for all v ∈ V . Suppose there is some w ∈ V such that U w ,1 ⊕ · · · ⊕ U w , s ( U w .Then U = ⊕ i ∈ [ s ] U i = ⊕ i ∈ [ s ] ⊕ v ∈ V U v , i = ⊕ v ∈ V ⊕ i ∈ [ s ] U v , i ( ⊕ v ∈ V U v = U which is a contradiction. Hence, U v = U v ,1 ⊕ · · · ⊕ U v , s for all v ∈ V . This completes the proof.The Krull-Schmidt theorem for module decomposition and the above reduction allows one toobtain a uniqueness theorem for generalized vector space decomposition. Theorem 4 (Generalized vector space decomposition: uniqueness) . Suppose U v = U v ,1 ⊕ · · · ⊕ U v , s and U v = U ′ v ,1 ⊕ · · · ⊕ U ′ v , s ′ are two collection of decompositions (which are further indecomposable) forthe generalized vector space decomposition problem. Then s = s ′ . Furthermore, there exist linear mapsL v : U v → U v and a permutation σ : [ s ] → [ s ] such that U ′ v , i = L v ◦ U v , σ ( i ) for all i ∈ [ s ] and v ∈ V.Also, L w ◦ L v , w = L v , w ◦ L v for all L v , w ∈ L v , w and v , w ∈ V.Proof.

We look at the reduction to module decomposition, the algebra R discussed above andthe vector space U = ⊕ v ∈ V U v . Let us deﬁne U i = ⊕ v ∈ V U v , i and U ′ j = ⊕ v ∈ V U ′ v , j . Then U = U ⊕ · · · ⊕ U s and U = U ′ ⊕ · · · ⊕ U ′ s ′ are two decompositions of U into indecomposable R -submodules (because of Corollary B.1). Hence by Theorem 3, s = s ′ , and there exist a permutation σ : [ s ] → [ s ] and a linear map L : U → U such that U ′ i = L ◦ U σ ( i ) and L ◦ R = R ◦ L for every R ∈ R .Now for every v ∈ V , L ◦ Π v = Π v ◦ L which implies that L ◦ U v ⊆ U v . We can call the restrictionof L to U v as the map L v : U v → U v . Now take an operator L v , w ∈ L v , w and ext ( M ) ∈ e L v , w . Wehave that L ◦ ext ( L v , w ) = ext ( L v , w ) ◦ L . This implies that L w ◦ L v , w = L v , w ◦ L v for all L v , w ∈ L v , w .Also, U ′ v , i = Π v ◦ U ′ i = Π v ◦ L ◦ U σ ( i ) = L ◦ Π v ◦ U σ ( i ) = L ◦ U v , σ ( i ) .This completes the proof.As a corollary, we get a uniqueness theorem for vector space decomposition.48 orollary B.2 (Vector space decomposition: uniqueness) . Suppose L is a set of linear maps betweenvector spaces U and W. Suppose U = U ⊕ · · · ⊕ U s , W = W ⊕ · · · ⊕ W s and U = U ′ ⊕ · · · ⊕ U ′ s ′ , W = W ′ ⊕ · · · ⊕ W ′ s ′ are two indecomposable decompositions with respect to L . Then s = s ′ . Furthermore, thereexist linear maps D : U → U and E : W → W and a permutation σ : [ s ] → [ s ] such that U ′ i = D ◦ U σ ( i ) ,W ′ i = E ◦ W σ ( i ) for all i ∈ [ s ] and E ◦ L = L ◦ D for all L ∈ L . We also mention that via the above reduction, we get a polynomial time algorithm for generalizedvector space decomposition (over ﬁnite ﬁelds, reals and complex numbers) using the polynomialtime algorithm for module decomposition in [CIK97]. However, we do not use this algorithm forour learning problem since we also want the algorithm to work over rationals which is possible todo in our setting with a simpler specialized algorithm.

C Why doesn’t the shifted partials measure work?

In this section, we explain why the shifted partials measure (as it is) is unlikely to satisfy thebasic non-degeneracy condition given by Equation (3) in Section 1, if n ≥ d . The shifted partialsmeasure ( SP ), introduced in [Kay12b], is deﬁned as follows: Let f ∈ F [ x ] be an n -variate degree- d homogeneous polynomial and k , ℓ ∈ N . Then, SP k , ℓ ( f ) : = dim D x ℓ · ∂ k x f E .Clearly, SP k , ℓ ( f ) is upper bounded by min (cid:16) ( n + k − k ) · ( n + ℓ − ℓ ) , ( n + d − k + ℓ − d − k + ℓ ) (cid:17) . Suppose f = c Q m + . . . + c s Q ms ,where each c i ∈ F × , Q i is a homogeneous polynomial of degree t , and tm = d . Let U ( f ) : = (cid:10) x ℓ · ∂ k x f (cid:11) . We wish to satisfy the main non-degeneracy condition U ( f ) = U ( Q m ) ⊕ . . . ⊕ U ( Q ms ) , (21)for random Q , . . . , Q s . This imposes the restriction ℓ < d , as otherwise U ( Q mi ) ∩ U ( Q mj ) = { } for i = j . On the other hand, SP k , ℓ ( Q mi ) is upper bounded by ( n + k ( t − )+ ℓ − k ( t − )+ ℓ ) . If s · (cid:18) n + k ( t − ) + ℓ − k ( t − ) + ℓ (cid:19) ≤ min (cid:18)(cid:18) n + k − k (cid:19) · (cid:18) n + ℓ − ℓ (cid:19) , (cid:18) n + d − k + ℓ − d − k + ℓ (cid:19)(cid:19) ,then we may be able to satisfy the direct sum given by Equation (21). For this, we need kt ≤ d .But, with both ℓ and kt upper bounded by d , ( n + k ( t − )+ ℓ − k ( t − )+ ℓ ) cannot be less that ( n + k − k ) · ( n + ℓ − ℓ ) with growing t , if n ≥ d .Thus, it seems difﬁcult to satisfy the direct sum condition using the shifted partials measure if n ≥ d . However, if n is much smaller than d then it may be possible to achieve the same. Thisis what spurred us to think in the direction of reducing the number of variables to below d usingafﬁne projections. Indeed, we have shown in this work that such afﬁne projections do work (forboth lower bound and learning) even without shifts by monomials . But, shifts may play a crucialrole if n is much smaller than d to begin with (say, if n is a constant), in which case doing afﬁneprojections does not seem to help. 49 Proofs from Section 2

Proof of Observation 2.1 As C is non-degenerate, Condition 1 of Deﬁnition 1.1 implies, APP k , n ( f ) = s · (cid:18) n + k ( t − ) − k ( t − ) (cid:19) .By Proposition 2.6, | F | ≫ ( d − k ) · ( n + kk ) . Arguing as in Observation 1.1, with probability 1 − o ( ) , APP k , n ( f ) = dim U = s · (cid:18) n + k ( t − ) − k ( t − ) (cid:19) ,which implies U = U ⊕ . . . ⊕ U s and dim U i = ( n + k ( t − ) − k ( t − ) ) for all i ∈ [ s ] . Proof of Observation 2.2 As C is non-degenerate, Condition 3 of Deﬁnition 1.1 implies that there exists an L such that D z k ( t − ) · π L ( Q ) e E + . . . + D z k ( t − ) · π L ( Q s ) e E = D z k ( t − ) · π L ( Q ) e E ⊕ . . . ⊕ D z k ( t − ) · π L ( Q s ) e E .For any tuple of n linear forms L , the degree of a polynomial in z k ( t − ) · π L ( Q i ) e is at most 2 d as kt ≤ d (by Proposition 2.6). If L is a tuple of random linear forms then the polynomials in the set z k ( t − ) · π L ( Q ) e [ . . . [ z k ( t − ) · π L ( Q s ) e are F -linearly independent with probability 1 − o ( ) if | F | ≫ ds · ( n + k ( t − ) − k ( t − ) ) , which is ensuredby Proposition 2.6. Proof of Proposition 2.1

Recall that U i = D z k ( t − ) · G ei E . If g ∈ V = h G e , . . . , G es i then z k ( t − ) · g and z k ( t − ) · g belong to U + . . . + U s = U . Hence, there are a , . . . a sr , b , . . . , b sr ∈ F such that a f + . . . + a sr f sr = z k ( t − ) · g and b f + . . . + b sr f sr = z k ( t − ) · g .On the other hand, suppose that there exist a , . . . a sr , b , . . . , b sr ∈ F such that a f + . . . + a sr f sr z k ( t − ) = b f + . . . + b sr f sr z k ( t − ) . (22)As f , . . . , f sr is a basis of U , there are polynomials P , . . . , P s , P ′ , . . . , P ′ s ∈ D z k ( t − ) E that satisfy a f + . . . + a sr f sr = P G e + . . . + P s G es and b f + . . . + b sr f sr = P ′ G e + . . . + P ′ s G es .50rom Equation (22), we have (cid:16) z k ( t − ) P − z k ( t − ) P ′ (cid:17) · G e + . . . + (cid:16) z k ( t − ) P s − z k ( t − ) P ′ s (cid:17) · G es = (cid:16) z k ( t − ) P i − z k ( t − ) P ′ i (cid:17) ∈ D z k ( t − ) E , by Observation 2.2, z k ( t − ) P i − z k ( t − ) P ′ i = i ∈ [ s ] .Hence, z k ( t − ) divides P i and z k ( t − ) divides P ′ i for all i ∈ [ s ] . But, deg ( P i ) = deg ( P ′ i ) = k ( t − ) .Therefore, there are ˆ a , . . . , ˆ a s ∈ F such that a f + . . . + a sr f sr z k ( t − ) = b f + . . . + b sr f sr z k ( t − ) = ˆ a G e + . . . + ˆ a s G es = : g ( z ) ∈ V . Proof of Proposition 2.2 As C is non-degenerate, Condition 2 of Deﬁnition 1.1 implies that there exist L and P such that D π P ( ∂ k z ( G e + . . . + G es )) E = D π P ( ∂ k z G e ) E ⊕ . . . ⊕ D π P ( ∂ k z G es ) E , anddim D π P ( ∂ k z G ei ) E = (cid:18) m + k ( t − ) − k ( t − ) (cid:19) for all i ∈ [ s ] , (23)where G i = π L ( Q i ) and e = m − k . If L and P are tuples of random linear forms (as in Step 1and 4 of Algorithm 2) then the above equation holds with probability 1 − o ( ) provided | F | ≫ ds · ( m + k ( t − ) − k ( t − ) ) (which is ensured by Proposition 2.6). Let g = G e + . . . + G es . By Equation (23), D π P ( ∂ k z g ) E = W ⊕ . . . ⊕ W s and dim W i = (cid:18) m + k ( t − ) − k ( t − ) (cid:19) for all i ∈ [ s ] , (24)where W i = (cid:10) π P ( ∂ k z G ei ) (cid:11) . Recall that W = (cid:10) π P ( ∂ k z g ) (cid:11) , where g is a random element of V . As G e , . . . , G es is a basis of V , we have g = b G e + . . . + b s G es , where b , . . . , b s ∈ r F . The next claimcompletes the proof of the proposition. Claim D.1.

If g = b G e + . . . + b s G es such that b , . . . , b s ∈ r F × , then (cid:10) π P ( ∂ k z g ) (cid:11) = (cid:10) π P ( ∂ k z g ) (cid:11) withprobability − o ( ) .Proof. With every polynomial ˆ g ∈ V , associate a ( n + k − k ) × ( m + et − k − et − k ) matrix M ( ˆ g ) as follows:The rows of M ( ˆ g ) are indexed by all monomials in z -variables of degree k and the columns areindexed by all monomials in w -variables of degree et − k . If α is a z -monomial of degree k and β isa w -monomial of degree ( et − k ) then the ( α , β ) -th entry of M ( ˆ g ) is the coefﬁcient of β in π P (cid:16) ∂ k ˆ g ∂α (cid:17) .In other words, M ( ˆ g ) is the coefﬁcient matrix consisting of the coefﬁcients of the polynomials in π P ( ∂ k z ˆ g ) . Let q = ( m + k ( t − ) − k ( t − ) ) . Clearly, for every ˆ g ∈ V , D π P ( ∂ k z ˆ g ) E ⊆ W ⊕ . . . ⊕ W s = D π P ( ∂ k z g ) E , by Equation (24) ⇒ rank [ M ( ˆ g )] ≤ rank [ M ( g )] = s · (cid:18) m + k ( t − ) − k ( t − ) (cid:19) = sq .51here exist a sq × ( n + k − k ) matrix R and a ( m + et − k − et − k ) × sq matrix C such thatrank [ R · M ( g ) · C ] = rank [ M ( g )] = sq . (25)For any ˆ g ∈ V , denote the sq × sq matrix R · M ( ˆ g ) · C by N ( ˆ g ) and R · M ( g ) · C by N ( g ) . Letˆ g = y G e + . . . + y s G es be an arbitrary element of V , where y , . . . , y s ∈ F . Then M ( ˆ g ) = y · M ( G e ) + . . . + y s · M ( G es ) ⇒ N ( ˆ g ) = y · N ( G e ) + . . . + y s · N ( G es ) .Treating y , . . . , y s as formal variables, we can infer that det ( N ( ˆ g )) is a non-zero polynomial in y , . . . , y s of degree at most sq . This is because, by setting y = . . . = y s = g = g , and wealready know that det ( N ( g )) = g = b G e + . . . + b s G es such that b , . . . , b s ∈ r F × , then with probability 1 − o ( ) we have det ( N ( g )) = | F | ≫ sq (by Proposition2.6). That is, rank [ N ( g )] = sq = rank [ M ( g )] which implies (cid:10) π P ( ∂ k z g ) (cid:11) = (cid:10) π P ( ∂ k z g ) (cid:11) . Proof of Proposition 2.3

Treat w k ( t − ) as an ordered set and let B ∈ GL sq ( F ) be the basis change matrix from ( h , . . . , h sq ) to ( w k ( t − ) · π P ( G e − k ) , . . . , w k ( t − ) · π P ( G e − ks )) . Let K be an arbitrary element of hL i . As V = V ⊕ . . . ⊕ V s , W = W ⊕ . . . ⊕ W s is an indecomposable decomposition of V and W under theaction of L , the matrix BK A − has the following structure: The columns of BK A − are indexedby ( G e , . . . , G es ) and the rows are indexed by ( w k ( t − ) · π P ( G e − k ) , . . . , w k ( t − ) · π P ( G e − ks )) . The G ej -thcolumn of BK A − has its non-zero entries conﬁned to the q rows indexed by w k ( t − ) · π P ( G e − kj ) .By deﬁnition of the adjoint, ( D , E ) ∈ adj ( L ) if and only if BK A − · ADA − = BEB − · BK A − for all K ∈ hL i . (26)Expressed in the basis ( G e , . . . , G es ) of V , the element g = G e + . . . + G es is the all-one vector ∈ F s .Let β ∈ w k ( t − ) and j ∈ [ s ] be arbitrarily chosen. From the proof of Proposition 2.2, it follows thatthere is a K ∈ hL i such that BK A − · is the unit vector whose ( β · π P ( G e − kj )) -th entry is one andall other entries are zero. In other words, all but the ( G ej , β · π P ( G e − kj )) -th entry of BK A − is zero,and the ( G ej , β · π P ( G e − kj )) -th entry is 1. As ADA − and BEB − satisfy Equation (26) for everysuch K (as we vary β ∈ w k ( t − ) and j ∈ [ s ] ), both ADA − and BEB − are diagonal matrices, i.e., A · adj ( L ) · A − ⊆ D .Using Equation (26), it is an easy exercise to show that D ⊆ A · adj ( L ) · A − . Therefore, A · adj ( L ) · A − = D . Proof of Proposition 2.4

Clearly, U ⊆ U + . . . + U s . We will prove that U i ⊂ U for every i ∈ [ s ] . Observe that U i = D π L ( ∂ k x Q mi ) E ⊆ D z k ( t − ) · π L ( Q i ) m − k E .52e will now show that D z k ( t − ) · π L ( Q i ) m − k E ⊂ U . Let µ be an arbitrary z -monomial of degree k ( t − ) . Then µ = γ k l · γ k l · · · γ k r l r for some distinct l , . . . , l r ∈ [ b ] , where k + . . . + k r = k . Let α i : = ( y i l · y l · · · y ik l ) · ( y i l · y l · · · y ik l ) · · · ( y i l r · y l r · · · y ik r l r ) . Using the combinatorial design of the sets S , . . . S s , we get ∂ k f ∂α i = ∂ k Q mi ∂α i = k ! · (cid:18) mk (cid:19) · µ · Q m − ki .As µ is arbitrary and char ( F ) ∤ k ! · ( mk ) , we have D z k ( t − ) · π L ( Q i ) m − k E ⊂ U . From the aboveequation, it is also easy to notice that U i = D z k ( t − ) · π L ( Q i ) m − k E . Proof of Proposition 2.5

We will show that, for every i ∈ [ s ] , there is a monomial in P i · G m − ki that cannot be generated byany other P j · G m − kj for i = j . The following observation will be useful. Observation D.1.

Consider a product P · ℓ ˆ d , where P is a non-zero polynomial in F [ z ] and ℓ = ∑ z ∈ ˆ z zfor some ˆ z ⊆ z . Let char ( F ) > deg z ( P · ℓ ˆ d ) . Then, for every monomial µ ∈ ˆ z ˆ d , there is a monomial β (with non-zero coefﬁcient) in P · ℓ ˆ d such that µ divides β .Proof. Let µ be a monomial in ˆ z ˆ d . Write the product P · ℓ ˆ d as P ′ · ℓ d ′ , where P ′ is coprime to ℓ and d ′ ≥ ˆ d . For contradiction, suppose that there is no monomial in P ′ · ℓ d ′ that is divisible by µ . Then, ∂ ˆ d d µ ( P ′ · ℓ d ′ ) = ⇒ d ′ ! ( d ′ − ˆ d ) ! · P ′ · ℓ d ′ − ˆ d + g · ℓ d ′ − ˆ d + =

0, for some g ∈ F [ z ] (by chain rule) ⇒ d ′ ! ( d ′ − ˆ d ) ! · P ′ + g · ℓ = ( F ) > deg z ( P · ℓ ˆ d ) and P ′ is non-zero and not divisible by ℓ .Let z i = { z i ,1 , . . . , z i , p } . We do the analysis for the two cases t ( m − k ) ≤ √ n and t ( m − k ) ≥ √ n .Suppose t ( m − k ) ≤ √ n so that | z i ∩ z j | ≤ j t ( m − k ) k for i = j . By Observation D.1, there is amonomial β i in P i · (cid:0) z i ,1 + . . . + z i , p (cid:1) t ( m − k ) that is divisible by z i ,1 · z i ,2 · · · z i , ⌊ t ( m − k ) ⌋ as p = ⌊ √ n ⌋ ≥⌊ t ( m − k ) ⌋ . If β i is generated by some other term P j · (cid:0) z j ,1 + . . . + z j , p (cid:1) t ( m − k ) then there is a monomialin P j that is divisible by at least ⌊ t ( m − k ) ⌋ − ⌊ t ( m − k ) ⌋ distinct z -variables, as | z i ∩ z j | ≤ j t ( m − k ) k . Butthis is not possible as 2 k ( t − ) < ⌊ t ( m − k ) ⌋ − ⌊ t ( m − k ) ⌋ for m > k .53uppose t ( m − k ) ≥ √ n , in which case t ( m − k ) p ≥

2. By Observation D.1, there is a monomial β i in P i · (cid:0) z i ,1 + . . . + z i , p (cid:1) t ( m − k ) that is divisible by z ⌊ t ( m − k ) p ⌋ i ,1 · z ⌊ t ( m − k ) p ⌋ i ,2 · · · z ⌊ t ( m − k ) p ⌋ i , p .If β i is generated by some other term P j · (cid:0) z j ,1 + . . . + z j , p (cid:1) t ( m − k ) then there is a monomial in P j that is divisible by at least p − p + = p − distinct z -variables each with multiplicity ⌊ t ( m − k ) p ⌋ (as | z i ∩ z j | ≤ p + ). But this is not possible as 2 k ( t − ) < p − · ⌊ t ( m − k ) p ⌋ for m > k . E Proofs from Section 4

Proof of Observation 4.1 As k = ⌊ δ · dt ⌋ and t ≤ δ · d ln d , we have k ≥ ⌊ ln d ⌋ . By deﬁnition, c = · ln nk ln dk = · " ln nd ln dk + ≥ · =

32 (as n ≥ d ).By choice, n = ⌊ c · k ⌋ . So, n ≤ ck ≤ · " ln nd ln t δ + · δ dt (as k ≤ δ dt ) ≤ · " ln nd ln ln nd δ + · δ d ln nd (as ln nd ≤ t ) = · " nd + ln δ + nd · δ d ≤ · nd · δ d ≤ d ln ln d . Proof of Observation 4.2

Recall from Equation (6),

APP k , n ( f ) = max L dim D π L ( ∂ k x f ) E ,where L = ( ℓ ( z ) , . . . , ℓ n ( z )) is an n -tuple of linear forms in F [ z ] and | z | = n . An element of π L ( ∂ k x f ) is a homogeneous polynomial of degree d − k in z -variables, such a polynomial can haveat most ( d − k + n − n − ) many z -monomials. Hence, APP k , n ( f ) ≤ ( d − k + n − n − ) .54 roof of Proposition 4.1 For i ∈ [ s ] , let T i = Q i Q i · · · Q im i be a term of the formula C given by Equation (17), where degreeof every Q ij is in [ t , 2 t ] . Observe that for any n -tuple of linear forms L = ( ℓ ( z ) , . . . , ℓ n ( z )) , D π L ( ∂ k x T i ) E ⊆ * z ≤ tk · [ S ∈ ( [ mi ] k )  ∏ j ∈ [ m i ] \ S Q ij + .Hence, APP k , n ( T i ) ≤ ( mk ) · ( n + ktn ) as m i ≤ m . By subadditivity of the APP measure, we have

APP k , n ( T + . . . + T s ) ≤ s · (cid:18) mk (cid:19) · (cid:18) n + ktn (cid:19) . Proof of Proposition 4.2

Suppose C = f n , d , t in Equation (17). Then, by Proposition 4.1, APP k , n ( f n , d , t ) ≤ s · ( mk ) · ( n + ktn ) . Onthe other hand, by Proposition 4.5, APP k , n ( f n , d , t ) = ( d − k + n − n − ) . Therefore, s ≥ ( d − k + n − n − )( mk ) · ( n + ktn ) = n d − k + n · ( d − k + n n )( mk ) · ( n + ktn ) ≥ kd · ( dn )( mk ) · ( n + ktn ) , (as n ≥ k ) ≥ kd · (cid:16) dn (cid:17) n (cid:0) emk (cid:1) k · (cid:16) e · ( n + kt ) n (cid:17) n = kd · (cid:0) emk (cid:1) k · (cid:20) de · ( n + kt ) (cid:21) n ≥ kd · (cid:0) emk (cid:1) k · (cid:20) d ekt (cid:21) n , (as n ≤ kt ) ≥ kd · ( · e ) k · e n , (plugging in the values of k , m and δ ) ≥ kde · e · k · e · · (cid:20) ln nd ln dk + (cid:21) · k (plugging in the values of n , c , putting 4.01 < e ) ≥ kde · e · k · e · (cid:20) ln nd ln dk (cid:21) · k ≥ kde · e · (cid:20) ln nd ln dk (cid:21) · k (as nd ≥ dk ) ≥ e · e · (cid:20) ln nd ln dk (cid:21) · k (as ln nd ln dk · k ≥ ln dk ) = (cid:16) nd (cid:17) Ω ( dt ln t ) (as dk = Θ ( t ) ).55 roof of Proposition 4.3 As in the proof of Proposition 4.2, we have s ≥ ( d − k + n − n − )( mk ) · ( n + ktn ) = n d − k + n · ( d − k + n n )( mk ) · ( n + ktn ) ≥ · ( n + d − kn ) O ( dt ) · ( n + ktn ) , (as n ≥ d , m = Θ ( d / t ) and k = Θ ( d / t ) ) = O ( dt ) · (cid:18) + n d − k (cid:19) · · · (cid:18) + n d − k − ( d − k − kt − ) (cid:19) ≥ O ( dt ) · (cid:16) n d (cid:17) d − k − kt ≥ O ( dt ) · n k d · ( d − k − kt ) (as √ n ≥ n k d ≥ d ) = n Ω ( dt ) (plugging in the value of k ). Proof of Proposition 4.4

High t case. In this case n = ⌊ c · k ⌋ , where c = · ln nk ln dk . On one hand, (cid:18) d − k + n − n − (cid:19) ≤ (cid:18) d + n n (cid:19) ≤ (cid:18) e · (cid:18) dn + (cid:19)(cid:19) n ≤ (cid:18) e · (cid:18) dck + (cid:19)(cid:19) ck (as n ≤ ck ) ≤ (cid:18) e · dk (cid:19) ck (as c ≥

32 ) = e ck · (cid:16) nk (cid:17) · k (plugging in the value of c ) ≤ (cid:16) nk (cid:17) · k (as e ck ≪ (cid:16) nk (cid:17) · k ).On the other hand, (cid:18) n k (cid:19) = (cid:18) n − n ( d − k ) k (cid:19) ≥ (cid:18) n − n ( d − k ) k (cid:19) k ≥ (cid:16) nk − c ( d − k ) (cid:17) k (as n ≤ ck ) ≥ (cid:16) nk (cid:17) · k (as cd ≪ nk ).56 ow t case. In this case n = ⌈ n kd ⌉ . Verify that n ≥ d . (cid:18) d − k + n − n − (cid:19) ≤ (cid:18) d + n n (cid:19) ≤ (cid:18) e · n + dd (cid:19) d ≤ (cid:18) e · n d (cid:19) d (as n ≥ d ) ≤ (cid:16) n k (cid:17) k (verify after putting the values of n and n ) ≤ (cid:18) n k (cid:19) Proof of Proposition 4.5

Observe that ∂ k y f n , d , t = B and π ( B ) = z d − k . So, APP k , n ( f n , d , t ) ≥ ( d − k + n − n − ) . On the other hand,by Observation 4.2, APP k , n ( f n , d , t ))

Observe that ∂ k y f n , d , t = B and π ( B ) = z d − k . So, APP k , n ( f n , d , t ) ≥ ( d − k + n − n − ) . On the other hand,by Observation 4.2, APP k , n ( f n , d , t )) ≤ ( d − k + n − n − ))