[PDF] Data-Discriminants of Likelihood Equations

Abstract

Maximum likelihood estimation (MLE) is a fundamental computational problem in statistics. The problem is to maximize the likelihood function with respect to given data on a statistical model. An algebraic approach to this problem is to solve a very structured parameterized polynomial system called likelihood equations. For general choices of data, the number of complex solutions to the likelihood equations is finite and called the ML-degree of the model. The only solutions to the likelihood equations that are statistically meaningful are the real/positive solutions. However, the number of real/positive solutions is not characterized by the ML-degree. We use discriminants to classify data according to the number of real/positive solutions of the likelihood equations. We call these discriminants data-discriminants (DD). We develop a probabilistic algorithm for computing DDs. Experimental results show that, for the benchmarks we have tried, the probabilistic algorithm is more efficient than the standard elimination algorithm. Based on the computational results, we discuss the real root classification problem for the 3 by 3 symmetric matrix~model.

Full PDF

aa r X i v : . [ c s . S C ] M a y Data-Discriminants of Likelihood Equations ∗ Jose Israel Rodriguez † Xiaoxian Tang ‡ .June 24, 2018 Abstract

Maximum likelihood estimation (MLE) is a fundamental computational problem in statistics. Theproblem is to maximize the likelihood function with respect to given data on a statistical model. Analgebraic approach to this problem is to solve a very structured parameterized polynomial systemcalled likelihood equations. For general choices of data, the number of complex solutions to thelikelihood equations is ﬁnite and called the ML-degree of the model.The only solutions to the likelihood equations that are statistically meaningful are the real/positivesolutions. However, the number of real/positive solutions is not characterized by the ML-degree.We use discriminants to classify data according to the number of real/positive solutions of the like-lihood equations. We call these discriminants data-discriminants (DD). We develop a probabilisticalgorithm for computing DDs. Experimental results show that, for the benchmarks we have tried,the probabilistic algorithm is more efﬁcient than the standard elimination algorithm. Based on thecomputational results, we discuss the real root classiﬁcation problem for the 3 by 3 symmetric ma-trix model.

We begin the introduction with an illustrative example. Suppose we have a weighted four-sided diesuch that the probability p i of observing side i ( i =

0, 1, 2, 3 ) of the die satisﬁes the constraint p + p + p − p =

0. We toss the die 1000 times and record a 4-dimensional data vector ( u , u , u , u ) , where u i is the number of times we observe the side i . We want to determine the probability distribution ( p , p , p , p ) ∈ R > that best explains the data subject to the constraint. One approach is by maximumlikelihood estimation (MLE): ∗ This research paper is partly supported by 2014 NIMS Thematic Program on Applied Algebraic Geometry. † ACMS Department, University of Notre Dame, Notre Dame, IN 46556; [email protected].

This material is based upon worksupported by the National Science Foundation under Award No. DMS-1402545, as well as the hospitality of the SimonsInstitute. ‡ CAMP, National Institute for Mathematical Sciences, Jeonmin-Dong, Yuseong-gu, Daejeon, Republic of Korea 305-811; [email protected] aximize the likelihood function p u p u p u p u subjected to p + p + p − p = p + p + p + p = p > p > p >

0, and p > For some statistical models, the MLE problem can be solved by well known hill climbing algorithmssuch as the EM-algorithm. However, the hill climbing method can fail if there is more than one localmaximum. Fortunately, it is known that the MLE problem can be solved by solving the system of likelihood equations [15, 2]: F = p λ + p λ − u F = p λ − p λ − u F = p λ + p λ − u F = p + p + p − p F = p λ + p λ − u F = p + p + p + p − λ and λ are newly introduced indeterminates (Lagrange multipliers) for formulating the like-lihood equations. More speciﬁcally, for given ( u , u , u , u ) , if ( p , p , p , p ) is a critical point of thelikelihood function, then there exist complex numbers λ and λ such that ( p , p , p , p , λ , λ ) is a so-lution of the polynomial system. For randomly chosen data u i , the likelihood equations have 3 complexsolutions. However, only solutions with positive coordinates p i are statistically meaningful. A solutionwith all positive p i coordinates is said to be a positive solution. So an important problem is real rootclassiﬁcation (RRC): For which u i , the polynomial system has

1, 2 and real/positive solutions? According to the theory of computational (real) algebraic geometry [26, 20], the number of (real/positive)solutions only changes when the data u i goes across some “special” values (see Theorem 2). The setof “special" u i is a (projective) variety (see Lemma 4 in [20]) in (3 dimensional complex projective space)4-dimensional complex space. The number of real/positive solutions is uniform over each open con-nected component determined by the variety. In other words, the “special” u i plays the similar role asthe discriminant for univariate polynomials. The ﬁrst step of RRC is calculating the “special” u i , leadingto the discriminant problem: How to effectively compute the “special” u i ? Geometrically, the “special” u i is a projection of a variety. So in principle, it can be computed by elimination ( see Chapter 3, page 115–128 in [6]). For instance, by the command eliminate in Macaulay2 [10], we compute that the “special" u i in the illustrative example form a hypersurface deﬁned by ahomogenous polynomial in ( u , u , u , u ) (see Example 1). However, for most MLE problems, due tothe large size of likelihood equations, the elimination computation is too expensive. In this paper, wediscuss the “discriminant" problem for the likelihood equations. The contributions of the paper arelisted as follows. • For likelihood equations, we show that the “special" u i form a projective variety. We call thehomogenous polynomial that generates the codimension 1 component of the projective variety the data-discriminant . This name distinguishes it from the weight-discriminant for the likelihood equations (which2eplaces the condition p + · · · + p n = h p + · · · + h n p n = h , . . . , h n ). • For algebraic statistical models, we develop a probabilistic algorithm to compute data-discriminants.We implement the algorithm in

Macaulay2 . Experimental results show that the probabilistic algorithmis more efﬁcient than the standard elimination algorithm. • We discuss the real root classiﬁcation for the 3 × × In this section, we discuss how to deﬁne “data-discriminant”. We assume the readers are familiar withelimination theory (see Chapter 3 in [6]).

Notation 1.

Let P denote the projective closure of the complex numbers C . For homogeneous polynomialsg , . . . , g s in Q [ p , . . . , p n ] , V ( g , . . . , g s ) denotes the projective variety in P n deﬁned by g , . . . , g s . Let ∆ n denote the n-dimensional probability simplex { ( p , . . . , p n ) ∈ R n + | p >

0, . . . , p n > p + · · · + p n = } . Deﬁnition 1. [15] (Algebraic Statistical Model and Model Invariant)

The set X is said to be an algebraicstatistical model if X = V ( g , . . . , g s ) ∩ ∆ n where g , . . . , g s deﬁne an irreducible generically reduced projectivevariety. Each g k ( ≤ k ≤ s ) is said to be a model invariant of X. For a given algebraic statistical model, there are several different ways to formulate the likelihoodequations [15]. In this section, we introduce the Lagrange likelihood equations and deﬁne the data-discriminant for this formulation. One can similarly deﬁne data-discriminants for other formulationsof the likelihood equations.

Notation 2.

For any f , . . . , f m in the polynomial ring Q [ x , ..., x k ] , V a ( f , . . . , f m ) denotes the afﬁne vari-ety in C k deﬁned by f , . . . , f m and h f , . . . , f m i denotes the ideal generated by f , . . . , f m . For an ideal I in Q [ x , . . . , x k ] , V a ( I ) denotes the afﬁne variety deﬁned by I. Deﬁnition 2. [13] (Lagrange Likelihood Equations and Correspondence)

Given an algebraic statisticalmodel X. The system of polynomial equations below is said to be the

Lagrange likelihood equations of X:F = p ( λ + ∂ g ∂ p λ + · · · + ∂ g s ∂ p λ s + ) − u = · · · F n = p n ( λ + ∂ g ∂ p n λ + · · · + ∂ g s ∂ p n λ s + ) − u n = n + = g ( p , . . . , p n ) = · · · F n + s = g s ( p , . . . , p n ) = F n + s + = p + · · · + p n − = where g , . . . , g s are the model invariants of X and u , . . . , u n , p , . . . , p n , λ , . . . , λ s + are indeterminates (alsodenoted by u , p , Λ ). More speciﬁcally,– p , . . . , p n , λ , . . . , λ s + are unknowns,– u , . . . , u n are parameters. V a ( F , . . . , F n + s + ) , namely the set { ( u , p , Λ ) ∈ C n + × C n + × C s + | F =

0, . . . , F n + s + = } , is said to be the Lagrange likelihood correspondence of X and denoted by L X . Notation 3.

Let π denote the canonical projection from the ambient space of the Lagrange likelihood correspon-dence to the C n + associated to the u indeterminants π : C n + × C n + s + → C n + . Given an algebraic statistical model X and a data vector u ∈ R n > , the maximum likelihood estimation (MLE) problem is to maximize the likelihood function p u · · · p u n n subject to X . The MLE problem can besolved by computing π − ( u ) ∩ L X . More speciﬁcally, if p is a regular point of V ( g , . . . , g s ) , then p isa critical point of the likelihood function if and only if there exist Λ ∈ C s + such that ( u , p , Λ ) ∈ L X .Theorem 1 states that for a general data vector u , π − ( u ) ∩ L X is a ﬁnite set and the cardinality of π − ( u ) ∩ L X is constant over a dense Zariski open set, which inspires the deﬁnition of ML-degree. Fordetails, see [15]. Theorem 1. [15] For an algebraic statistical model X, there exist an afﬁne variety V ⊂ C n + and a non-negativeinteger N such that for any u ∈ C n + \ V, π − ( u ) ∩ L X = N . Deﬁnition 3. [15] (ML-Degree)

For an algebraic statistical model X, the non-negative integer N stated inTheorem 1 is said to be the

ML-degree of X.

Notation 4.

For any S in C n + , I ( S ) denotes the ideal { D ∈ Q [ u ] | D ( a , . . . , a n ) = ∀ ( a , . . . , a n ) ∈ S } . S denotes the afﬁne closure of S in C n + , namely V a ( I ( S )) . Deﬁnition 4.

For an algebraic statistical model X, suppose F , . . . , F n + s + are deﬁned as in Deﬁnition 2. Let Jdenote det  ∂ F ∂ p · · · ∂ F ∂ p n ∂ F ∂λ · · · ∂ F ∂λ s + ... ... ... ... ... ... ∂ F n + s + ∂ p · · · ∂ F n + s + ∂ p n ∂ F n + s + ∂λ · · · ∂ F n + s + ∂λ s +  . hen, we have the following: • L X ∞ denotes the set of non-properness of π , i.e., the set of the u ∈ π ( L X ) such that there does not exista compact neighborhood U of u where π − ( U ) ∩ L X is compact; • L X J denotes π ( L X ∩ V a ( J )) ; • L X p denotes π ( L X ∩ V a ( Π nk = p k )) . The geometric meaning of L X p and L X J are as follows. The ﬁrst, L X p , is the projection of the inter-section of the Lagrange likelihood correspondence with the coordinate hyperplanes. The second, L X J ,is the projection of the intersection of the Lagrange likelihood correspondence with the hypersurfacedeﬁned by J . Geometrically, L X J is the closure of the union of the projection of the singular locus of L X and the set of critical values of the restriction of π to the regular locus of L X (see Deﬁnition 2 in [20]).The Lagrange likelihood equations deﬁne an afﬁne variety. As we continuously deform the param-eters u i , coordinates of a solution can tend to inﬁnity. Geometrically, L X ∞ is the set of the data u suchthat the Lagrange likelihood equations have some solution ( p , Λ ) at inﬁnity; this is the closure of theset of “non-properness” as deﬁned in the page 1, [19] and page 3, [23]. It is known that the set ofnon-properness of π is closed and can be computed by Gröbner bases (see Lemma 2 and Theorem 2 in[20]).The ML-degree encaptures geometry of the likelihood equations over the complex numbers. How-ever, statistically meaningful solutions occur over real numbers. Below, Theorem 2 states that L X ∞ , L X J and L X p deﬁne open connected components such that the number of real/positive solutions is uniformover each open connected component. Theorem 2 is a corollary of

Ehresmann’s theorem for which thereexists semi-algebraic statements since 1992 [5].

Theorem 2.

For any algebraic statistical model X, • if C , . . . , C t are the open connected components of R n + \ ( L X ∞ ∪ L X J ) , then for each k ( ≤ k ≤ t ) , for any u ∈ C k , π − ( u ) ∩ L X ∩ R n + s + is a constant; • if C , . . . , C t are the open connected components of R n + \ ( L X ∞ ∪ L X J ∪ L

X p ) , then for each k ( ≤ k ≤ t ) , for any u ∈ C k , π − ( u ) ∩ L X ∩ ( R n + > × R s + ) is a constant. L X ∞ , L X J , and L X p below. • Proposition 1 shows that the structure of the likelihood equations forces L X p to be contained inthe union of coordinate hyperplanes deﬁned by ∏ nk = u k . • Proposition 2 shows that the structure of the likelihood equations forces L X J \{ } to be a projectivevariety. • Similarly as the proof of Proposition 2, we can also show that the structure of the likelihoodequations forces L X ∞ \{ } to be a projective variety. Proposition 1.

For any algebraic statistical model X, L X p ⊂ V a ( Π nk = u k ) . Proof.

By Deﬁnition 2, for any k ( ≤ k ≤ n ) , u k = p k ( λ + ∂ g ∂ p λ + · · · + ∂ g s ∂ p λ s + ) − F k .Hence, u k ∈ h F k , p k i ∩ Q [ u k ] ⊂ h F , . . . , F n + s + , p k i ∩ C [ u ] So V a ( h F , . . . , F n + s + , p k i ∩ C [ u ]) ⊂ V a ( u k ) By the Closure Theorem [6], V a ( h F , . . . , F n + s + , p k i ∩ C [ u ]) = π ( L X ∩ V a ( p k )) Therefore, L X p = π ( L X ∩ V a ( Π nk = p k ))= π ( L X ∩ ∪ nk = V a ( p k ))= ∪ nk = π ( L X ∩ V a ( p k )) ⊂ ∪ nk = V a ( u k )= V a ( Π nk = u k ) . (cid:3) Remark 1.

Generally, L X p = V a ( Π nk = u k ) . For example, suppose the algebraic statistical model is V a ( p − p ) ∩ ∆ . Then L X p = ∅ = V a ( u u ) . Notation 5. D X p denotes the product Π nk = u k . Proposition 2.

For an algebraic statistical model X, we have L X J \{ } is a projective variety in P n , where isthe zero vector (

0, . . . , 0 ) in C n + . roof. By the formulation of the Lagrange likelihood equations, we can prove that I ( π ( L X ∩ V a ( J )) is a homogeneous ideal by the two basic facts below, which can be proved by Deﬁnition 2 and basicalgebraic geometry arguments. C1.

For every u in π ( L X ∩ V a ( J )) , each scalar multiple α u is also in π ( L X ∩ V a ( J )) . C2.

For any S ⊂ C n + , if for any u ∈ S and for any scalar α ∈ C , α u ∈ S , then I ( S ) is a homogeneousideal in Q [ u ] .That means the ideal I ( π ( L X ∩ V a ( J )) is generated by ﬁnitely many homogeneous polynomials D ,. . ., D m . Therefore, L X J = V a ( I ( π ( L X ∩ V a ( J ))) = V a ( D , . . . , D m ) . So L X J \{ } = V ( D , . . . , D m ) ⊂ P n . (cid:3) Notation 6.

For an algebraic statistical model X, we deﬁne the notation D X J according to the codimension of L X J \{ } in P n . • If the codimension is , then assume V ( D ) , . . . , V ( D K ) are the codimension irreducible components inthe minimal irreducible decomposition of L X J \{ } in P n and h D i , . . . , h D K i are radical. D X J denotes thehomogeneous polynomial Π Kj = D j . • If the codimension is greater than , then our convention is to take D X J = . Similarly, we use the notation D X ∞ to denote the projective variety L X J \{ } . Now we deﬁne the“data-discriminant” of Lagrange likelihood equations. Deﬁnition 5. (Data-Discriminant)

For a given algebraic statistics model X, the homogeneous polynomial D X ∞ · D X J · D

X p is said to be the data-discriminant (DD) of Lagrange likelihood equations of X and denotedby D X . Remark 2.

Note that DD can be viewed as a generalization of the “discriminant” for univariate polynomials.So it is interesting to compare DD with border polynomial (BP) [26] and discriminant variety (DV) [20]. DVand BP are deﬁned for general parametric polynomial systems. DD is deﬁned for the likelihood equations but canbe generalized to any square and generic zero-dimensional system. Generally, for any square and generic zero-dimensional system, V a ( DD ) ⊂ DV ⊂ V a ( BP ) . Note that due to the special structure of likelihood equations,DD is a homogenous polynomial despite being an afﬁne system of equations. However, generally, DV is not aprojective variety and BP is not homogenous. Example 1 (Linear Model).

The algebraic statistic model for the four sided die story in Section 1 isgiven by X = V ( p + p + p − p ) ∩ ∆ .The Langrange likelihood equations are the F =

0, . . . , F = L X = V a ( F , . . . , F ) ⊂ C . If we choose generic ( u , u , u , u ) ∈ C , π − ( u , u , u , u ) ∩ L X =

3, namely the ML-degree is 3. The data-discriminant is the product of D X ∞ , D X p and D X J , where D X ∞ = u + u + u + u , D X p = u u u u , and D X J = u + u u + u u + u u + u − u u + u u u + u u u + u u − u u + u u u + u u + u u + u u + u − u u − u u u − u u u + u u + u u u − u u u u + u u u − u u u + u u u + u u + u u + u u u − u u − u u u + u u u + u u − u u − u u + u u + u .By applying the well known partial cylindrical algebraic decomposition (PCAD) [4] method to thedata-discriminant above, we get that for any ( u , u , u , u ) ∈ R > , • if D X J ( u , u , u , u ) >

0, then the system of likelihood equations has 3 distinct real solutions and1 of them is positive; • if D X J ( u , u , u , u ) <

0, then the system of likelihood equations has exactly 1 real solution and itis positive.The answer above can be veriﬁed by the

RealRootClassification [26, 3] command in

Maple 17 . Inthis example, the D X ∞ does not effect the number of real/positive solutions since it is always positivewhen each u i is positive. However, generally, D X ∞ plays an important role in real root classiﬁcation.Also remark that the real root classiﬁcation is equivalent to the positive root classiﬁcation for thisexample but it is not true generally (see Example 6). In this section, we discuss how to compute D X . We assume that X is a given statistical model, F , . . . , F n + s + are deﬁned as in Deﬁnition 2, and J is deﬁned as in Deﬁnition 4. We rename F , . . . , F n + s + as F , . . . , F m . Subsection 3.1 presents the standard elimination algorithm for reference and Subsection3.2 presents our main algorithm (Algorithm 2). Considering the data-discriminant as a projection drives a natural algorithm to compute it. This is thestandard elimination algorithm in symbolic computation: • we compute the L X J by elimination and then get D X J by the radical equidimensional decomposition (see Deﬁnition 3 in [20]). The algorithm is formally described in the Algorithm 1;

Algorithm 1:

DX-J input : F , . . . , F m , J output : D X J G u ← the generator polynomial set of the elimination ideal h F , . . . F m , J i ∩ Q [ u ] D X J ← the codimension 1 component of the equidimensional radical decomposition of hG u i return D X J • we compute L X ∞ by the Algorithm PROPERNESSDEFECTS presented in [20] and then get D X ∞ bythe radical equidimensional decomposition. We omit the formal description of the algorithm.The previous algorithms in this subsection can not be used to compute DDs of algebraic statisticalmodels in a reasonable time, see Tables 1–2 in Section 4. This motivates the exploration of a morepractical method found in the next subsection. 8 .2 Probabilistic Algorithm First, we prepare the lemmas, then we present the main algorithm (Algorithm 2). • Lemma 1 is used to linearly transform parameter space. • Corollary 1 and Lemma 2 are used to compute the totally degree of D X J . • Corollary 2 is used in the sampling for interpolation.

Lemma 1.

For any G ∈ Q [ u ] , there exists an afﬁne variety V in C n such that for any ( a , . . . , a n ) ∈ C n \ V, thetotal degree of G equals the degree of B ( t , t , . . . , t n ) w.r.t. to t , whereB ( t , t , . . . , t n ) = G ( t , a t + t , . . . , a n t + t n ) Proof.

Suppose the total degree of G is d and G d is the homogeneous component of G with totaldegree d . For any ( a , . . . , a n ) ∈ C n + \V a ( G d ) , let B ( t , t , . . . , t n ) = G ( t , a t + t , . . . , a n t + t n ) . It iseasily seen that the degree of B w.r.t. t equals d . (cid:3) Corollary 1.

For any G ∈ Q [ u ] , there exists an afﬁne variety V in C n + such that for any ( a , b , . . . , a n , b n ) ∈ C n + \ V , the total degree of G equals the degree of B ( t ) whereB ( t ) = G ( a t + b , . . . , a n t + b n ) . Lemma 2.

There exists an afﬁne variety V in C n + such that for any ( a , b , . . . , a n , b n ) ∈ C n + \ V, if h A ( t ) i = h F ( t ) , . . . , F n ( t ) , F n + , . . . , F m , J i ∩ Q [ t ] where F i ( t ) is the polynomial by replacing u i with a i t + b i in F i ( i =

0, . . . , n ) andB ( t ) = D X J ( a t + b , . . . , a n t + b n ) , then h B ( t ) i = p h A ( t ) i . Proof.

By the deﬁnition of D X J (Notation 6), there exists an afﬁne variety V such that for any ( a , b , . . . , a n , b n ) ∈ C n + \ V , h B ( t ) i is radical. Thus, we only need to show that there exists an afﬁnevariety V in C n + such that for any ( a , b , . . . , a n , b n ) ∈ C n + \ V , V a ( h B ( t ) i ) = V a ( h A ( t ) i ) .Suppose π t is the canonical projection: C × C m + → C . For any t ∗ ∈ π t ( V a ( F ( t ) , . . . , F n ( t ) , F n + , . . . , F m , J )) ,let u ∗ i = a i t ∗ + b i (for i =

0, . . . , n ), then ( u ∗ , . . . , u ∗ n ) ∈ π ( L X ∩ V a ( J )) . Hence D X J ( u ∗ , . . . , u ∗ n ) = B ( t ∗ ) =

0. Thus B ( t ) ∈ I ( π t ( V a ( F ( t ) , . . . , F n ( t ) , F n + , . . . , F m , J )) .9herefore, V a ( A ( t )) = V a ( I ( π t ( V a ( F ( t ) , . . . , F n ( t ) , F n + , . . . , F m , J ))) ⊂ V a ( B ( t )) .For any t ∗ ∈ V a ( h B ( t ) i ) , let u ∗ i = a i t ∗ + b i for i =

0, . . . , n , then ( u ∗ , . . . , u ∗ n ) ∈ V a ( D X J ) ⊂ L X J . By theExtension Theorem [6], there exists an afﬁne variety V ⊂ C n + such that if ( a , b , . . . , a n , b n ) V ,then ( u ∗ , . . . , u ∗ n ) ∈ π ( L X ∩ V a ( J )) , thus t ∗ ∈ π t ( V a ( F ( t ) , . . . , F n ( t ) , F n + , . . . , F m , J )) ⊂ V a ( A ( t )) . (cid:3) Corollary 2.

There exists an afﬁne variety V in C n such that for any ( a , . . . , a n ) ∈ C n \ V, if h A ( u ) i = h F , F ∗ . . . , F ∗ n , F n + , . . . , F m , J i ∩ Q [ u ] where F ∗ i is the polynomial by replacing u i with a i in F i (i =

1, . . . , n) andB ( u ) = D X J ( u , a , . . . , a n ) , then h B ( u ) i = p h A ( u ) i . Algorithm 2: ( Main Algorithm ) InterpolationDX-J input : F , . . . , F m , J output : D X J a , . . . , a n ← LinearOperator ( F , . . . , F m , J ) for i from to n do F ′ i ← replace u i in F i with a i u + u i NewSys ← F , F ′ . . . , F ′ n , F n + , . . . , F m , J d , d , . . . , d n ← Degree ( NewSys ) for j from to d do Rename all the monomials of the set { u α · · · u α n n | α + . . . + α n = j , 0 ≤ α i ≤ d i } as U j ,1 , . . . , U j , N j N ← max ( N , . . . , N d ) for k from to N do b k ,1 , . . . , b k , n ← random integers A ( u ) ← Intersect ( NewSys , b k ,1 , . . . , b k , n ) C ∗ d , k , . . . , C ∗ k ← the coefﬁcients of A ( u ) w.r.t u , . . . , u d − for j from to d do M j ← N j × N j matrix whose ( k , r ) -entry is U j , r ( b k ,1 , . . . , b k , n ) C j ← ( U j ,1 , . . . , U j , N j ) M − j ( C ∗ j ,1 , . . . , C ∗ j , N j ) T D X J ← replace u , . . . , u n in u d + Σ d − i = C d − i u i with u − a · u , . . . , u n − a n · u Return D X J

10e show an example to explain the basic idea of the probabilistic algorithm and how the lemmaswork in the algorithm.

Example 2 (Toy Example for Interpolation Idea).

Suppose the radical of the elimination ideal h F , J i ∩ Q [ u ] is generated by D ( u , u , u ) , where F = u p + u p + u and J = u p + u . We already knowthat D is homogenous and equals u − u u . Rather than by the standard elimination algorithm, wecompute D by the steps below. • First, we substitute u = t + u = t + u = t + F and J (the integers 1, 11, 3, 2, 5and 6 are randomly chosen). We compute the radical of the elimination ideal h F ( t , p ) , J ( t , p ) i ∩ Q [ t ] and get h t + t + i . By Lemma 2, D ( t +

11, 3 t +

2, 5 t + ) = t + t + D is 2 (it geometrically means the random line u = t + u = t + u = t + D w.r.t u , u and u and get 1, 2 and1, respectively. So all the possible monomials in D are u , u u , u u and u u . • Assume D = u + ( C u + C u ) u + C u u . We ﬁrst substitute u =

13 and u = F and J . We compute the radical of the elimination ideal h F ( u , p ) , J ( u , p ) i ∩ Q [ u ] and get h u − i . ByCorollary 2, D ( u , 4 ) equals u − C + C = C = − C = − C and C . So we substitute u = u = F and J .Similarly, we get 7 C + C = C = C =

0. Therefore, D = u − u u (the integers 13, 4, 7, 3are randomly chosen).This example is “nice”. Because the degree of D w.r.t u equals the total degree of D . In generalcase, if there is no u i such that the degree of D w.r.t u i equals the total degree, then we should apply thelinear transformation to change the parameter coordinates before interpolation. Lemma 1 guaranteesthe linear transformation makes sense. Algorithm 3:

Intersect input : F , . . . , F m , J and integers b , . . . , b n output : D X J ( u , b , . . . , b n ) for i from to n do F ∗ i ← replace u i in F k with b i A ( u ) ← the generator of the elimination ideal h F , F ∗ , . . . , F ∗ n , F n + , . . . F m , J i ∩ Q [ u ] A ( u ) ← the monic generator of p h A ( u ) i return A ( u ) Now we are prepared to introduce the probabilistic algorithm for computing the D X J . We explainthe main algorithm (Algorithm 2) and all the sub-algorithms (Algorithms 4–6) below.

Algorithm 5 (Degree).

The probabilistic algorithm terminates correctly by Corollary 1 and Lemma 2.

Algorithm 4 (LinearOperator).

The probabilistic algorithm terminates correctly by Lemma 1.

Algorithm 3 (Intersect).

The probabilistic algorithm terminates correctly by Corollary 2.

Algorithm 2 (InterpolationDX-J).Lines 1–5.

We compute the total degree of D X J and the degrees of D X J w.r.t u , . . . , u d : d , d , . . . , d n lgorithm 4: LinearOperator input : F , . . . , F m , J output : a , . . . , a n such that the total degree of D X J equals the degree of D X J ( u , a · u + u , . . . , a n · u + u n ) w.r.t u d , d , . . . , d n ← Degree ( F , . . . , F m , J ) if d = d then return

0, . . . , 0 else repeat for i from to n do a i ← a random integer F ′ i ← replace u i in F i with a i · u + u i NewSys ← F , F ′ . . . , F ′ n , F n + , . . . , F m , J d , d , . . . , d n ← Degree ( NewSys ) until d = d return a , . . . , a n by Algorithm 5. Algorithm 4 guarantees that d = d by applying a proper linear transformation u = a · u + u , . . . , u n = a n · u + u n . Lines 6–7.

Suppose D X J = u d + C u d − + . . . + C d − u + C d where C , . . . , C d ∈ Q [ u , . . . , u n ] andthe total degree of C j is j . For j =

1, . . . , n , we estimate all the possible monomials of C j by computingthe set { u α · · · u α n n | α + . . . + α n = j , 0 ≤ α i ≤ d i } Assume the cardinality of the set is N j and rename these monomials as U j ,1 , . . . , U j , N j . Then we assume C j = c j ,1 U j ,1 + . . . + c j , N j U j , N j where c j ,1 , . . . , c j , N j ∈ Q . The rest of the algorithm is to compute c j ,1 , . . . , c j , N j . Lines 8–12.

For each j , for k =

1, . . . , N j , for a random integer vector b k = ( b k ,1 , . . . , b k , n ) , we compute D X J ( u , b k ) by Algorithm 3. That means to compute the function value C j ( b k ) without knowing D X J . Lines 13–15.

For each j , we solve a square linear equation system for the unknowns c j ,1 , . . . , c j , N j : c j ,1 U j ,1 ( b k ) + . . . + c j , N j U j , N j ( b k ) = C j ( b k ) , ( k =

1, . . . , N j ) It is known that we can choose nice b k probabilistically such that the coefﬁcient matrix of the linearequation system is non-singular. Lines 16.

We apply the inverse linear transformation in the parameter space to get the D X J for theoriginal F , . . . , F m .We can also apply the interpolation idea to Algorithm PROPERNESSDEFECTS [20] and get a probabilis-tic algorithm to compute the D X ∞ . We omit the formal description of the algorithm. Remark 3.

According to the Notation 6, when the codimension of L X J \{ } ( L X ∞ \{ } ) is greater than , wedeﬁne D X J ( D X ∞ ) is . Therefore, it is no more true that the number of real/positive solutions still remains lgorithm 5: Degree input : F , . . . , F m , J output : d , d , . . . d n , where d is the total degree of D X J and d i is the degree of D X J w.r.t each u i ( i =

0, . . . , n ) for i from to n do F ∗ , . . . , F ∗ n ← replace u , . . . , u i − , u i + , . . . , u n in F , . . . , F n with random integers A ( u i ) ← the generator of the elimination ideal h F ∗ , . . . , F ∗ n , F n + , . . . F m , J i ∩ Q [ u i ] A ( u i ) ← the generator of p h A ( u i ) i d i ← degree of A ( u i ) a i , b i ← random integers F ( t ) , . . . , F n ( t ) ← replace u , . . . , u n with a · t + b , . . . , a n · t + b n in F , . . . , F n A ( t ) ← the generator of the elimination ideal h F ( t ) , . . . , F n ( t ) , F n + , . . . F m , J i ∩ Q [ t ] A ( t ) ← the generator of p h A ( t ) i d ← degree of A ( t ) return d , d , . . . , d n constant over the region determined by the data-discriminant. That means if the output of the Algorithm 2 is ,we should use the standard method (elimination or computing Gröbner base). We have implemented the probabilistic algorithm in

Ma-caulay2 . We have also implemented the stan-dard algorithm in

Macaulay2 to do comparisons (Tables 1 and 2). Some of the necessary implementationdetails are shown below. • In the Algorithm 1. Line 1, Algorithm 3. Line 3 and Algorithm 5. Lines 3 and 8, we use the

Macaulay2 command eliminate to compute the elimination ideals. • The probabilistic algorithm is implemented in two different ways. The ﬁrst implementation is tointerpolate at once, which is exactly the same as the Algorithm 2. The second implementation is tointerpolate step by step. For example, suppose the D X J is a polynomial in u , u , u and u , we ﬁrstcompute D X J ( u , u , u ∗ , u ∗ ) by interpolation for some chosen integers u ∗ and u ∗ . And then we compute D X J ( u , u , u , u ∗ ) by interpolation. At this time, it is easy to recover D X J since D X J is homogeneous.The algorithm is naive to describe so we omit the formal description.We run Algorithms 1 and 2 for many examples to set benchmarks by a 3.2 GHz Inter Core i5processor (8GB total memory) under OS X 10.9.3. There are two kinds of benchmarks, the randommodels and literature models. • We generate 2 groups of “random models”. The ﬁrst group of random models are generatedas follows. We ﬁrst generate a random homogenous polynomial in 3 variables p , p and p with totaldegree 2. Suppose this homogenous polynomial is a model variant. We repeat the process for 10 roundsand get 10 random models. We call this group of 10 models 2 deg-models. Similarly, we generate thegroup of 3 deg-models. The Table 1 provides the timings of Algorithm 1 and Algorithm 2 (with twodifferent implementations) for 2 deg-models and 3 deg-models. • The literature models are the examples presented in the literatures [15, 7, 13]. Table 2 provides thetimings of Algorithm 1 and Algorithm 2 (with two different implementations) for the literature models.13able 1: Timings of Computing D X J for Random Models (s: seconds; h: hours; S1: Strategy 1; S2:Strategy 2)

For Examples 3–5 in the Table 2, the model invariants for these models are list below. Example 6 isgiven in Section 5.1.

Example 3 (Random Censoring (Example 2.2.2 in [7])). p p p + p p + p p − p p + p p p Example 4 ( × Zero-Diagonal Matrix [13]). det  p p p p p p  Example 5 (Grassmannian of -planes in C [15, 13]). p p − p p + p p In the Tables 1–2, the columns “Algorithm 1” give the timings of Algorithm 1. The columns “Algo-rithm 2” give the timings of Algorithm 2, where “S1" and “S2" means the ﬁrst and second implementa-tions, respectively. The red data means the computation has not ﬁnished and received no output. It isseen from the tables that • for all the benchmarks we have tried, the Algorithm 2 is more efﬁcient than Algorithm 1; • for the random models and Example 3, the two implementations of Algorithm 2 have almost thesame efﬁciency; • for Examples 4–6, the second implementation (interpolation step by step) of Algorithm 2 is moreefﬁcient than the ﬁrst implementation (interpolation at once). In fact, it takes the same time for the twoimplementations to get sample points. But it takes more time for the ﬁrst implementation to computethe inverse of M j in Algorithm 2. Line 13, which is a large size matrix with rational entries. • for Example 6, with the standard elimination algorithm, our computer runs out of memory after12 days.Note that for each benchmark, the output of Algorithm 2 is the same as Algorithm 1 when bothalgorithms terminate. 14able 2: Timings of Computing D X J for Literature Models (s: seconds; h: hours; d: days; S1: Strategy1; S2: Strategy 2)

Models Algorithm 1 Algorithm 2S1 S2Example 3

Example 4

Example 5 >16h >16h 2768.2s

Example 6 >12d >30d 30d

In order to classify the data according to the number of real/positive solutions of likelihood equations,we study the data-discriminant and develop a probabilistic algorithm to compute it. Experiments showthat the probabilistic algorithm is more practical than the standard elimination algorithm. This is ourﬁrst application of real root classiﬁcation method on the MLE/likelihood equations problem. Ourfuture work aims to • improve Algorithm 2 (note that Algorithm 2 is applying evaluation/interpolation technique to thestandard method. It is not the ﬁrst time that such an approach is investigated. In [9, 24], Newton–Hensellifting has been applied to compute (parametric) geometric resolutions. It is hopeful that Algorithm 2will be more powerful if we apply the Newton-Hensel lifting techniques to balance the time consumingof the evaluation and lifting steps); • study the data-discriminants of different formulations of likelihood equations for the same alge-braic statistical model • develop algorithms for computing real root classiﬁcation for likelihood equations.More broadly, the ideas in Subsection 3.2 and Algorithm 2 can be applied to compute discriminantswhen the Newton polytope is known. × symmetric matrix model We end the paper with the discussion of real root classiﬁcation on the 3 × × [ u ij ] ( i , j =

1, 2, 3 ) where u ij is the the number of times forhim to get the sides i and j with respect to the two dice. By the matrix [ u ij ] , he is trying to estimate theprobability p ij of getting the sides i and j with respect to the two dice.It is easy to check that the matrix  p p p p p p p p p 

15s symmetric and has at most rank 2. Let p ij = ( p ij i = j p ij i < j , u ij = ( u ij i = ju ij + u ji i < j . We have an algebraic statistical model below.

Example 6 ( × Symmetric Matrix Model).

The algebraic statistical model for the dice story isgiven by X = V ( g ) ∩ ∆ ,where g = det  p p p p p p p p p  , ∆ = { ( p , . . . , p ) ∈ R > | p + p + p + p + p + p = } .The gambler’s problem is equivalent to maximizing the likelihood function Π p uijij ( Σ p ij ) Σ uij ( i ≤ j ) sub-jected to V ( g ) ∩ ∆ . According to the Deﬁnition 2, we present the Langrange likelihood equationsbelow. F = p λ + ( p p − p ) p λ − u = F = p λ + ( p p − p p ) p λ − u = F = p λ + ( p p − p p ) p λ − u = F = p λ + ( p p − p ) p λ − u = F = p λ + ( p p − p p ) p λ − u = F = p λ + ( p p − p ) p λ − u = F = g ( p , p , p , p , p , p ) = F = p + p + p + p + p + p − = where p , p , p , p , p , p , λ and λ are unknowns and u , u , u , u , u and u are parame-ters.We have 8 equations in 8 unknowns with 6 parameters and the ML-degree is 6 [15]. By the Algo-rithm 2, we have computed D X J , which has 1307 terms with total degree 12. By a similar computation,we get D X ∞ whose last factor is exactly g ( u , . . . , u ) and all the other factors are positive when each u i is positive.For the data-discriminant D X we have computed above, we have also computed at least one rationalpoint (sample point) from each open connected component of D X = RAGlib [22, 17, 11]. Withthese sample points we can solve the real root classiﬁcation problem on the open cells. By testing all236 sample points, we see that if g ( u , . . . , u ) =

0, then See D X J and D X ∞ on the second author’s website:sites.google.com/site/rootclassiﬁcaiton/publications/DD The sample points were ﬁrst successfully computed by one of the anonymous referees. This proves the real version of the RRC conjecture in the previous version of this manuscript.

16 if D X J ( u , . . . , u ) >

0, then the system has 6 distinct real solutions and there can be 6 positivesolution or 2 positive solutions;– if D X J ( u , . . . , u ) <

0, then the system has 2 distinct real (positive) solutions.With 2 of these sample points, we see that the sign of D X is not enough to classify the posi-tive solutions. For example, for the sample point ( u = u = u = , u = u = , u = ) , the system has 6 distinct positive solutions. While for thesample point ( u = u = u = u = u = u = ) , the system has also 6 real solutionsbut only 2 positive solutions . We thank Professors Bernd Sturmfels, Hoon Hong, Jonathan Hauenstein and Frank Sottile for theirvaluable advice. We also thank Professors Mohab Safey EI Din and Jean-Charles Faugere for theirsoftware advice on

RAGlib and

FGb respectively. We also especially thank the anonymous referees fortheir insightful suggestions to greatly improve the paper.

References [1] M.-L. G. Buot, S. Ho¸sten, and D. Richards. Counting and locating the solutions of polynomial systems ofmaximum likelihood equations, ii: The behrens-ﬁsher problem.

Statistica Sinica , 17:1343–1354, 2007.[2] F. Catanese, S. Ho¸sten, A. Khetan, and B. Sturmfels. The maximum likelihood degree.

Amer. J. Math. ,128(3):671–697, 2006.[3] C. Chen, J. H. Davemport, J. P. May, M. M. Maza, B. Xia, and R. Xiao. Triangular decomposition of semi-algebraic systems. In

Proceedings of ISSAC’10 , pages 187–194. ACM New York, 2010.[4] G. E. Collins and H. Hong.

Partial Cylindrical Algebraic Decomposition for Quantiﬁer Elimination . Springer,1998.[5] M. Coste and M. Shiota. Nash triviality in families of nash manifolds.

Inventiones Mathematicae , 108(1):349–368, 1992.[6] D. A. Cox, J. Little, and D. Oshea.

Ideals, varieties, and algorithms: an introduction to computational algebraicgeometry and commutative algebra . Springer, 2007.[7] M. Drton, B. Sturmfels, and S. Sullivant.

Lectures on algebraic statistics . Springer, 2009.[8] J.-C. Faugère, M. Safey EI Din, and P.-J. Spaenlehauer. Gröbner bases and critical points: the unmixed case.In

Proceedings of ISSAC’12 , pages 162–169. ACM New York, 2012.[9] M. Giusti, G. Lecerf, and B. Salvy. A Gröbner free alternative for polynomial system solving.

Journal ofComplexity , 17:154–211, 2001.[10] D. R. Grayson and M. E. Stillman.

Macaulay2, a software system for research in algebraic geometry . This disproves the positive version of the RRC conjecture in the previous version of this manuscript.

11] A. Greuet and M. Safey EI Din. Probabilistic algorithm for the global optimization of a polynomial over areal algebraic set.

SIAM Journal on Optimization , 24(3):1313–1343, 2014.[12] E. Gross, M. Drton, and S. Petrovi´c. Maximum likelihood degree of variance component models.

ElectronicJournal of Statistics , 6:993–1016, 2012.[13] E. Gross and J. I. Rodriguez. Maximum likelihood geometry in the presence of data zeros. In

Proceedings ofISSAC’14 , pages 232–239. ACM New York, 2014.[14] J. Hauenstein, J. I. Rodriguez, and B. Sturmfels. Maximum likelihood for matrices with rank constraints.

Journal of Algebraic Statistics , 5:18–38, 2014.[15] S. Ho¸sten, A. Khetan, and B. Sturmfels. Solving the likelihood equations.

Foundations of ComputationalMathematics , 5(4):389–407, 2005.[16] S. Ho¸sten and S. Sullivant. The algebraic complexity of maximum likelihood estimation for bivariate missingdata. In

Algebraic and geometric methods in statistics , pages 123–133. Cambridge University Press, 2009.[17] H. Hong and M. Safey EI Din. Variant quantiﬁer elimination.

Journal of Symbolic Computation , 47(7):883–901,2012.[18] J. Huh and B. Sturmfels.

Likelihood geometry , pages 63–117. Springer International Publishing, 2014.[19] Z. Jelonek. Testing sets for properness of polynoimal mappings.

Mathematische Annalen , 315:1–35, 1999.[20] D. Lazard and F. Rouillier. Solving parametric polynomial systems.

Journal of Symbolic Computation ,42(6):636–667, 2005.[21] J. I. Rodriguez. Maximum likelihood for dual varieties. In

Proceedings of the 2014 Symposium on Symbolic-Numeric Computation , SNC ’14, pages 43–49, New York, NY, USA, 2014. ACM.[22] M. Safey EI Din and E. Schost. Polar varieties and computation of one point in each connected componentof a smooth real algebraic set. In

Proceedings of ISSAC’03 , pages 224–231. ACM Press, 2003.[23] M. Safey EI Din and E. Schost. Properness defects of projections and computaion of in each connectedcomponent of a real algebraic set.

Discrete and Computational Geometry , 32(3):417–430, 2004.[24] E. Schost. Computing parametric geometric resolutions.

Applicable Algebra in Engineering, Communicationand Computing , 13(5):349–393, 2003.[25] C. Uhler. Geometry of maximum likelihood estimation in gaussian graphical models.

Annals of Statistics ,40(1):238–261, 2012.[26] L. Yang, X. Hou, and B. Xia. A complete algorithm for automated discovering of a class of inequality-typetheorems.

Science in China Series F Information Sciences , 44(1):33–49, 2001., 44(1):33–49, 2001.