Learning Negative Mixture Models by Tensor Decompositions
FFebruary 19, 2018
Learning Negative Mixture Models by Tensor Decompositions
Guillaume Rabusseau
GUILLAUME . RABUSSEAU @ LIF . UNIV - MRS . FR Franc¸ois Denis
FRANCOIS . DENIS @ LIF . UNIV - MRS . FR Aix Marseille Universit´e, CNRS, LIF, 13288 Marseille Cedex 9, FRANCE
Abstract
This work considers the problem of estimating the parameters of negative mixture models , i.e.mixture models that possibly involve negative weights. The contributions of this paper are as fol-lows. (i) We show that every rational probability distributions on strings, a representation whichoccurs naturally in spectral learning, can be computed by a negative mixture of at most two prob-abilistic automata (or HMMs). (ii) We propose a method to estimate the parameters of negativemixture models having a specific tensor structure in their low order observable moments. Buildingupon a recent paper on tensor decompositions for learning latent variable models, we extend thiswork to the broader setting of tensors having a symmetric decomposition with positive and negative weights. We introduce a generalization of the tensor power method for complex valued tensors, andestablish theoretical convergence guarantees. (iii) We show how our approach applies to negativeGaussian mixture models , for which we provide some experiments.
Keywords:
Spectral learning, Tensor decomposition, Mixture models, Rational series.
1. Introduction
Mixture models , such as Gaussian mixture model, are widely used in statistics and machine learn-ing Mclachlan and Peel (2000). Given a parametric family of probability distributions D , a mix-ture is defined by the number of components k ≥ , probabilities p , . . . p k ∈ [0 , satisfying p + . . . + p k = 1 and distributions F , . . . , F k from D . Given a sample drawn from a target mixturemodel, the parameters are usually fit by using the EM algorithm.Let f , . . . , f k be the PDF associated with F , . . . , F k . It may happen that the function p f + . . . + p k f k remains positive even if some of the weights become negative, and still defines a proba-bility density. We call negative mixtures such distributions. Only a few papers investigate negativemixtures Zhang and Zhang (2005); M¨uller et al. (2012); Jiang et al. (1999); Jevremovic (1991). Itcan easily be seen that every negative mixture can be written as a negative mixture of 2 positive one.We show that negative mixtures naturally occur in spectral learning.Let Σ be a finite alphabet and let Σ ∗ denote the set of strings built over Σ . A probability distri-bution p defined on Σ ∗ is said to be rational if it admits a linear representation , i.e. if there existsan integer n ≥ , vectors ι , τ ∈ R n and matrices M x ∈ R n × n associated with each letter x ∈ Σ such that p ( u . . . u l ) = ι (cid:62) M u . . . M u l τ Denis and Esposito (2008). It can easily be shown thatany probability distribution defined by a hidden Markov model (HMM), or equivalently, by a prob-abilistic automaton, is rational. However, there exist rational probability distributions that cannot becomputed by a HMM. The spectral learning algorithms used to infer probability distributions from asample of strings generally output rational probability distributions. Positive and negative mixturesof rational distributions are rational. Positive mixtures of distributions computed by HMMs can becomputed by HMMs. In this paper, we show that every rational distribution p is a negative mixture c (cid:13) February 19, 2018 G. Rabusseau & F. Denis. a r X i v : . [ c s . L G ] S e p ABUSSEAU D ENIS (1+ w ) p H − wp H of two distributions computed by HMMs. So, negative mixtures occur naturally.How the parameters of the target model can be fit?In a recent paper, it has been shown that the parameters of a number of latent variable models,including Gaussian mixture models and HMMs, can easily be estimated from tensor decompositionof low-order moments of the data Anandkumar et al. (2012). Typically, if x is drawn according tothe (positive) mixture p N ( µ , σ ) + . . . + p k N ( µ k , σ ) of spherical Gaussians whose centers µ i arelinearly independent, the tensors M = (cid:80) ki =1 p i µ i ⊗ µ i and M = (cid:80) ki =1 p i µ i ⊗ µ i ⊗ µ i can beexpressed as functions of the moments E [ x ⊗ x ] and E [ x ⊗ x ⊗ x ] Hsu and Kakade (2013). Using thefact that M is positive semidefinite, M can be reduce to a tensor (cid:102) M admitting an orthonormal decomposition (cid:102) M = (cid:80) ki =1 (cid:101) p i (cid:101) µ i ⊗ (cid:101) µ i ⊗ (cid:101) µ i , where (cid:101) µ (cid:62) i (cid:101) µ j = δ ij , and from which the originalparameters p i and µ i can be recovered. Lastly, it is shown that a decomposition of an orthogonallydecomposable tensor can quickly and robustly be approximated by means of a tensor power method .These results induce a learning scheme, which appears as a generalization of the spectral learningapproach: from a sample S , compute estimates of M and M , compute (cid:102) M and use the tensorpower method to compute an orthogonal decomposition of (cid:102) M from which the parameters of thetarget can be estimated.Each step of the previous scheme strongly use the facts that the weights p i are positive and thatthe µ i are linearly independent. We extend it to the case where the weights p i may be negative. Theextension is not straightforward since it needs to use complex square roots of negative real numbersand to introduce non-hermitian quadratic forms.Given the tensors M = (cid:80) ki =1 p i µ i ⊗ µ i and M = (cid:80) ki =1 p i µ i ⊗ µ i ⊗ µ i where the vectors µ i ∈ R n are still linearly independent but where the weights p i may be negative, we first showhow M can be reduced to a complex-valued pseudo-orthonormal decomposable tensor, i.e. ofthe form (cid:80) ki =1 (cid:101) p i (cid:101) µ i ⊗ (cid:101) µ i ⊗ (cid:101) µ i , where the vectors (cid:101) µ i ∈ C k satisfy (cid:101) µ (cid:62) i (cid:101) µ j = δ ij (for any vector µ ∈ C k , µ (cid:62) µ ∈ C since µ (cid:62) is not the conjugate transpose of µ ) and where the weights (cid:101) p i are non-zero complex numbers. Then, we show how the tensor power method can be adapted to the complexcase, with equivalent convergence guarantees. We deduce from these results a learning schemefor negative mixtures. To illustrate this analysis, we experiment our decomposition algorithm onnegative mixtures of spherical Gaussian models and we show how estimates of a negative mixturetarget can be inferred from data.The paper is organized as follows: preliminaries on rational probability distributions and ten-sor decomposition learning methods are given in Section 2; negative mixtures are introduced inSection 3 and two introductive examples are developed; the adaptation of the tensor decomposi-tion learning scheme to negative mixtures and the main results of the paper are given in Section 4;an application to negative mixtures of spherical gaussians and some experiments are provided inSections 5 and 6; a conclusion ends the paper.
2. Preliminaries
Let Σ be a finite alphabet and Σ ∗ denote the set of all finite strings built over Σ . A series is a map-ping r : Σ ∗ → R . A non negative series r is convergent if the sum (cid:80) w ∈ Σ ∗ r ( w ) is bounded; its limitis denoted by r (Σ ∗ ) . A probability distribution over Σ ∗ is a non-negative series that converges to1. A series r over Σ is rational if there exists an integer n ≥ , two vectors ι , τ ∈ R n and a matrix EARNING N EGATIVE M IXTURE M x ∈ R n × n for each x ∈ Σ such that for all u = u . . . u n ∈ Σ ∗ , r ( u ) = ι T M u . . . M u n τ Bersteland Reutenauer (1988). The triplet (cid:104) ι , ( M x ) x ∈ Σ , τ (cid:105) is called an n -dimensional linear representa-tion of r . An n -states probabilistic automaton (PA) can be defined as an n -dimensional linearrepresentation (cid:104) ι , ( M x ) x ∈ Σ , τ (cid:105) whose coefficients are all non-negative and satisfy the followingsyntactical conditions ι (cid:62) = 1 , I − M Σ is invertible and ( I − M Σ ) − τ = where = (1 , . . . (cid:62) ∈ R n and M Σ = (cid:80) x ∈ Σ M x . Hidden Markov Models (HMM) and PAsdefine the same probability distributions Dupont et al. (2005). There exist rational probabilisticdistributions that cannot be computed by a PA or a HMM (see Appendix A.1). See Kolda and Bader (2009) for references on tensor decomposition. Let us denote by (cid:78) p K n the p-th order tensor product of the vector space K n , where K = R or C . A tensor T ∈ (cid:78) p K n can bedescribed by a p -way array of scalars t i , ··· ,i p ∈ K for i , · · · , i p ∈ [ n ] , where [ n ] denotes the setof integers between and n . A tensor is symmetric if its multi-way array representation is invariantunder permutation of the indices. Given v (1) , . . . , v ( p ) ∈ K n , the tensor v (1) ⊗· · ·⊗ v ( p ) ∈ (cid:78) p K n isdefined by the p -way array ( v (1) i v (2) i . . . v ( p ) i p ) i , ··· ,i p ∈ [ n ] . For a vector v ∈ K n , let v ⊗ p = v ⊗ · · · ⊗ v denote the p -th tensor power of v . In particular, v ⊗ v can be identified with the matrix vv (cid:62) . Let x be a R n -valued random variable, its moment of order m is defined as the tensor E [ x ⊗ m ] ∈ (cid:78) m R n .For any integers m , · · · , m p ≥ , every p -th order tensor T ∈ (cid:78) p K n induces a multilin-ear map T : K n × m × · · · × K n × m p → K m ×···× m p defined by T ( A (1) , · · · , A ( p ) ) i , ··· ,i p = (cid:80) j , ··· ,j p ∈ [ n ] t j , ··· ,j p a (1) j i · · · a ( p ) j p i p where each i k ∈ [ m k ] for k ∈ [ p ] . In particular,if T = k (cid:88) i =1 λ i v (1) i ⊗ · · · ⊗ v ( p ) i then T ( A , · · · , A p ) = k (cid:88) i =1 λ i ( A (cid:62) v (1) i ) ⊗ · · · ⊗ ( A (cid:62) p v ( p ) i ) . The rank of a tensor T ∈ (cid:78) p K n is the smallest integer k such that T can be written as T = (cid:80) ki =1 λ i v (1) i ⊗· · ·⊗ v ( p ) i with λ i ∈ K and v (1) i , · · · , v ( p ) i ∈ K n . The symmetric rank of a symmetrictensor T is the smallest integer k such that T can be written as T = (cid:80) ki =1 λ i v ⊗ p i with λ i ∈ K and v i ∈ K n . It has been shown that computing the rank of a tensor is NP-hard and it is conjecturedthat computing the symmetric rank is also NP-hard Hillar and Lim (2013). However, if a real-valued third-order tensor T has a symmetric orthonormal decomposition , i.e. T = (cid:80) ki =1 λ i v ⊗ i with λ i ∈ R , v i ∈ R n and v (cid:62) i v j = δ ij for all i, j ∈ [ k ] , it has been shown in Anandkumar et al. (2012)that this decomposition can be recovered by several methods, both efficient and robust to noise,such as the tensor power method (see Section 4.2 below). Moreover, they show that any symmetricindependent decomposition T = (cid:80) ki =1 λ i v ⊗ i (where the v i ’s are independent but not necessarilyorthonormal) can be recovered if we have access to the second order tensor M = (cid:80) ki =1 λ i v i v (cid:62) i . Theorem 1
Anandkumar et al. (2012) Let v , . . . , v k be linearly independent vectors of R n , λ , . . . , λ k be positive scalars, M = (cid:80) ki =1 λ i v i ⊗ v i and M = (cid:80) ki =1 λ i v ⊗ i , let W ∈ R n × k be a matrixsuch that M ( W , W ) = I k , the k × k identity matrix, and let ν i = √ λ i W (cid:62) v i for i ∈ [ k ] . Then, M ( W , W , W ) = (cid:80) ki =1 λ − / i ν ⊗ i is an orthonormal decomposition from which the parameters λ i and v i can be computed. ABUSSEAU D ENIS
The spherical Gaussian mixture model is specified as follows: let k ≥ be the number of compo-nents, and for i ∈ [ n ] , let p i > be the probability of choosing the component N ( µ i , σ i I ) where µ i ∈ R n , σ i > and I ∈ R n × n is the identity matrix.Assuming that the component mean vectors µ i are linearly independent, the following result isproved in Hsu and Kakade (2013). Theorem 2
The average variance ¯ σ = (cid:80) ki =1 p i σ i is the smallest eigenvalue of the covariancematrix E [( x − E [ x ])( x − E [ x ]) (cid:62) ] . Let v be any unit-norm eigenvector corresponding to ¯ σ and let m = E [ x ( v (cid:62) ( x − E [ x ])) ] , M = E [ x ⊗ x ] − ¯ σ I , and M = E [ x ⊗ x ⊗ x ] − n (cid:88) i =1 [ m ⊗ e i ⊗ e i + e i ⊗ m ⊗ e i + e i ⊗ e i ⊗ m ] where e , · · · , e n is the coordinate basis of R n . Then, m = k (cid:88) i =1 p i σ i µ i , M = k (cid:88) i =1 p i µ i ⊗ µ i , and M = k (cid:88) i =1 p i µ i ⊗ µ i ⊗ µ i . The previous results induce a learning scheme: (i) estimate m , M and M from the learningdata; (ii) compute an orthonormal decomposition as in Theorem 1; (iii) use the tensor power methodto compute the mean vectors µ i and the probabilities p i and (iv) use m to recover the varianceparameters σ i .
3. Negative mixtures
Given a finite set of probability density functions f . . . , f k , and non negative weights w , . . . , w k satisfying w + . . . + w k = 1 , w f + . . . + w k f k is a probability density function called a finitemixture . It may happen that w f + . . . + w k f k defines a PDF even if some weights are negative.We call such a function a negative or a generalized mixture.For example, if f and g are two PDF satisfying g ≤ cf for some c > , then αf − ( α − g is anegative mixture for any ≤ α − ≤ ( c − − . It can easily be shown, by grouping the positiveand negative weights respectively, that any negative mixture can be written as a negative mixture of two positive mixtures: k (cid:88) i =1 α i f i − h (cid:88) j =1 β j g j = A (cid:32) k (cid:88) i =1 α i A f i (cid:33) − B h (cid:88) j =1 β j B g j where α i , β j > , A = (cid:80) ki =1 α i , B = (cid:80) hj =1 β j and A − B = 1 .If f, g and α are known, and if we have access to a random generator D f , then Algorithm 1simulates the distribution D αf − ( α − g by rejection sampling. EARNING N EGATIVE M IXTURE
Algorithm 1
Simulating a negative mixturedrawn ← false while not drawn do draw x according to D f draw e uniformly in [0 , if eαf ( x ) ≥ ( α − g ( x ) then drawn ← true end ifend whilereturn x We show below that every rational probability distribution on strings can be generated by the gen-eralized mixture of at most two probabilistic automata. The proof relies on the following lemmas.
Lemma 3
Any rational series is the difference of two rational series with non negative coefficients.
Proof
For any real number x , let x + = max { x, } and x − = max {− x, } . So, x = x + − x − .These operators are extended to vectors and matrices by applying them to all their coefficients.Let (cid:104) ι , ( M x ) x ∈ Σ , τ (cid:105) be an n -dimensional representation of a rational series r . Let us define (cid:101) ι = (cid:18) ι + ι − (cid:19) , (cid:101) ι = (cid:18) ι − ι + (cid:19) , (cid:101) τ = (cid:18) τ + τ − (cid:19) and (cid:102) M x = (cid:18) M + x M − x M − x M + x (cid:19) for each x ∈ Σ . Let r + (resp. r − ) be the rational series defined by the linear representation (cid:104) (cid:101) ι , ( (cid:102) M x ) x ∈ Σ , ˜ τ (cid:105) (resp. (cid:104) (cid:101) ι , ( (cid:102) M x ) x ∈ Σ , (cid:101) τ (cid:105) ). Then, r = r + − r − . Indeed, it can easily be checked that for any vectors u , u , v , v ∈ R n , (cid:102) M x (cid:18) u u (cid:19) = (cid:18) v v (cid:19) ⇒ M x ( u − u ) = v − v . Therefore, for any w = w . . . w n ∈ Σ ∗ , (cid:102) M w . . . (cid:102) M w n (cid:101) τ = (cid:18) v v (cid:19) ⇒ M w . . . M w n τ = v − v . Since ( (cid:101) ι (cid:62) − (cid:101) ι (cid:62) ) (cid:18) v v (cid:19) = ι (cid:62) ( v − v ) , it can easily be checked that r + ( w ) − r − ( w ) = ( (cid:101) ι (cid:62) − (cid:101) ι (cid:62) ) (cid:102) M w . . . (cid:102) M w n (cid:101) τ = r ( w ) . However, even if the non negative series r is convergent, the series r + and r − obtained from the pre-vious construction can be divergent (see an example in Appendix A.1). It has been shown in Bailly ABUSSEAU D ENIS and Denis (2011) that if a rational series r is absolutely convergent, then it can always be computedby a linear representation (cid:104) ι , ( M x ) x ∈ Σ , τ (cid:105) such that (cid:104)| ι | , ( | M x | ) x ∈ Σ , | τ |(cid:105) defines a positive conver-gent series s . In that case, r + and r − are bounded by s and are convergent. Let s + = r + (Σ ∗ ) , s − = r − (Σ ∗ ) and let p + = r + /s + and p − = r − /s − : p + and p − are rational probability dis-tributions and if r is itself a probability distribution, we have s + − s − = 1 and r is equal to thegeneralized mixture s + p + − s − p − . It remains to prove that p + and p − can be computed by aprobabilistic automaton. Lemma 4
Let (cid:104) ι , ( M x ) x ∈ Σ , τ (cid:105) be an n -dimensional minimal non negative linear representation ofa probability distribution p . Let λ = ( I − M Σ ) − τ and D = diag ( λ ) .Then, (cid:104) D ι , ( D − M x D ) x ∈ Σ , D − τ (cid:105) is a probabilistic automaton that recognizes p . Proof
The minimality of the representation implies that D is invertible. It is clear that the newrepresentation recognizes p since ( D ι ) (cid:62) D − M x D . . . D − M x n DD − τ = ι (cid:62) M x . . . M x n τ . We have ( D ι ) (cid:62) = ι (cid:62) λ = 1 . Moreover, I − D − M Σ D = D − ( I − M Σ ) D is invertible and ( I − D − M Σ D ) − D − τ = D − λ = .Combining the previous lemmas, we obtain the following theorem. Theorem 5
Every rational probability distribution on strings can be generated by the generalizedmixture of at most two probabilistic automata.
Let f and g be the PDF of the two k -dimensional Gaussian distributions N ( µ f , Σ f ) and N ( µ g , Σ g ) .For any real number α > , αf ( x ) − ( α − g ( x ) ≥ (1)if and only if exp (cid:26) −
12 ( x − µ f ) (cid:62) Σ − f ( x − µ f ) + 12 ( x − µ g ) (cid:62) Σ − g ( x − µ g ) (cid:27) ≥ (cid:115) | Σ f || Σ g | [1 − /α ] . There exists α > such that (1) holds for any x ∈ R k if and only if − ( x − µ f ) (cid:62) Σ − f ( x − µ f ) + ( x − µ g ) (cid:62) Σ − g ( x − µ g ) (2)has a finite lower bound which holds if and only if ( Σ − g − Σ − f ) is positive semi-definite.In that case, the minimum m of (2) is attained for µ = Σ ( Σ − g µ g − Σ − f µ f ) where Σ = ( Σ − g − Σ − f ) − , and there exists a constant λ such that λg/f defines a Gaussiandistribution of parameters µ and Σ . It can be checked that m = − ( µ f − µ g ) (cid:62) Σ − f Σ Σ − g ( µ f − µ g ) . EARNING N EGATIVE M IXTURE
Note that if the two distributions are distinct, then (cid:16) | Σ g || Σ f | (cid:17) / e m/ − < . Otherwise, anypositive α would be suitable and by dividing (1) by α , the density of the first distribution would beeverywhere larger than the density of the second, which cannot happen. Hence every α ∈ , (cid:32) − (cid:115) | Σ g || Σ f | e m/ (cid:33) − defines a valid negative mixture of the two distributions.If the gaussians are spherical, i.e. Σ f = σ f I and Σ g = σ g I , we obtain the following result. Proposition 6 αf ( x ) − ( α − g ( x ) defines a negative mixture iff σ f > σ g and < α ≤ (cid:32) − σ kg σ kf exp (cid:40) − || µ f − µ g || σ f − σ g (cid:41)(cid:33) − . Example
Let k = 2 , µ f = (11 . − . (cid:62) , σ f = 8 , µ g = (11 . − . (cid:62) , σ g = 4 : αf ( x ) − ( α − g ( x ) defines a negative mixture for any < α ≤ . . See figure 1.
4. Negative Mixtures and the Power Method
We consider systems of the form M = k (cid:88) i =1 w i µ i ⊗ µ i and M = k (cid:88) i =1 w i µ i ⊗ µ i ⊗ µ i (3)where the vectors µ , · · · , µ k ∈ R d are linearly independent and w , · · · , w k ∈ R are non zero.In this section, we show how the parameters w i and µ i can be recovered from M and M using a power method for complex-valued tensors . A set { ν , . . . , ν k } ⊂ C d is pseudo-orthonormal iff ν (cid:62) i ν j = δ ij for all i, j ∈ [ k ] . Note that forany ν = ( ν , . . . , ν d ) ∈ C d , ν (cid:62) ν = ν + . . . + ν d ∈ C and in particular, ν (cid:62) ν (cid:54) = || ν || = | ν | + . . . + | ν n | . It can easily be checked that a pseudo-orthonormal set is linearly independent.A tensor decomposition T = (cid:80) ki =1 z i ν ⊗ pi of a complex-valued tensor T ∈ (cid:78) p C n is pseudo-orthonormal if { ν , . . . , ν k } is a pseudo-orthonormal set.As in Anandkumar et al. (2012), we build a whitening matrix W from M , and we use W toobtain a pseudo-orthonormal decomposition of the tensor M .Identifying M with the symmetric rank- k matrix (cid:80) ki =1 w i µ i µ (cid:62) i , let UDU (cid:62) be the eigende-composition of M , where D is the k × k diagonal matrix whose diagonal elements are composedof the k non-zero eigenvalues of M and where U is a d × k matrix satisfying U (cid:62) U = I k and UU (cid:62) µ i = µ i for any i ∈ [ k ] . Let W = UD − ∈ C d × k and (cid:101) µ i = w i W (cid:62) µ i ∈ C k for i ∈ [ k ] where we consider complex square roots of the negative components of D and w i : x / = i | x | / and x − / = ( x / ) − = − i | x | − / if x < . We have k (cid:88) i =1 (cid:101) µ i (cid:101) µ (cid:62) i = W (cid:62) (cid:32) k (cid:88) i =1 w i µ i µ (cid:62) i (cid:33) W = W (cid:62) M W = I k ABUSSEAU D ENIS hence (cid:101) µ (cid:62) i (cid:101) µ j = δ ij for all i, j ∈ [ k ] . Now let (cid:102) M = M ( W , W , W ) = (cid:80) ki =1 w i ( W (cid:62) µ i ) ⊗ = (cid:80) ki =1 w − i (cid:101) µ ⊗ i which is a pseudo-orthonormal decomposition. The following theorem extends Lemma 5.1 of Anandkumar et al. (2012) to third-order complex-valued tensors having a pseudo-orthonormal decomposition T = (cid:80) ki =1 z i ν ⊗ i . Note that the pa-rameters of such a decomposition are not fully identifiable since z ν ⊗ = ( − z )( − ν ) ⊗ . Theorem 7
Let T ∈ (cid:78) C n have a pseudo-orthonormal decomposition T = (cid:80) ki =1 z i ν ⊗ i , andlet T be the mapping defined by T ( θ ) = T ( I, θ , θ ) for any θ ∈ C n . Let θ ∈ C n , suppose that | z . ν (cid:62) θ | > | z . ν (cid:62) θ | ≥ · · · ≥ | z k . ν (cid:62) k θ | > . For t = 1 , , · · · , define θ t = T ( θ t − )[ T ( θ t − ) (cid:62) T ( θ t − )] and λ t = T ( θ t , θ t , θ t ) (4) where we assume that θ is such that T ( θ t ) (cid:62) T ( θ t ) (cid:54) = 0 for all t . Then, θ t → ± ν and λ t → ± z .More precisely, let M = max (cid:26) , | z | | z i | , | z | (cid:107) ν i (cid:107)| z i | : i ∈ [ k ] (cid:27) and ε t = kM (cid:12)(cid:12)(cid:12)(cid:12) z . ν (cid:62) θ z . ν (cid:62) θ (cid:12)(cid:12)(cid:12)(cid:12) t . Then for all t ≥ such that ε t < , we have | e t f t λ t − z | ≤ | z | ε t and (cid:107) e t f t θ t − ν (cid:107) ≤ ε t (cid:16) || ν || + √ (cid:17) , where ( e t ) t and ( f t ) t are two sequences defined in the proof and taking their values in {− , } . Proof
Let us first define the square root of a complex number z = re iθ , where − π < θ ≤ π and r ≥ , by z / = r / e i θ , and note that z/ ( z ) / = z − ( z ) / = 1 if − π/ < θ ≤ π/ and − otherwise.Now, let c i = ν (cid:62) i θ for i ∈ [ k ] , (cid:101) θ = θ , (cid:101) θ t = T ( (cid:101) θ t − ) , and ρ t = ( (cid:101) θ (cid:62) t (cid:101) θ t ) for all t ≥ .Check by induction on t that, for all t ≥ , (cid:101) θ t = k (cid:88) i =1 z t − i c t i ν i (5)Let e t = ρ t +1 ρ − t / (cid:0) ρ − t ρ t +1 (cid:1) , note that e t = ± , and check by induction that, for all t ≥ , θ t = e t (cid:101) θ t ρ t . (6)Let α t = ρ − t z t − c t . Using Eq. 5 and Eq. 6, we obtain e t λ t = ρ − t k (cid:88) i =1 z i ( z t − i c t i ) = α t k (cid:88) i =1 z z i (cid:18) z i c i z c (cid:19) · t = α t z (cid:34) k (cid:88) i =2 z z i (cid:18) z i c i z c (cid:19) · t (cid:35) , and e t θ t = ρ − t k (cid:88) i =1 z t − i c t i ν i = α t k (cid:88) i =1 z z i (cid:18) z i c i z c (cid:19) t ν i = α t (cid:34) ν + k (cid:88) i =2 z z i (cid:18) z i c i z c (cid:19) t ν i (cid:35) . EARNING N EGATIVE M IXTURE
It can easily be checked that (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) k (cid:88) i =2 z z i (cid:18) z i c i z c (cid:19) · t (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ ε t and (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) k (cid:88) i =2 z z i (cid:18) z i c i z c (cid:19) t ν i (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ ε t . Moreover, it can be checked that α − t = ( (cid:101) θ (cid:62) t (cid:101) θ t ) / z t − c t = f t (cid:32) (cid:101) θ (cid:62) t (cid:101) θ t ( z t − c t ) (cid:33) / = f t (cid:34) k (cid:88) i =2 z z i (cid:18) z i c i z c (cid:19) t +1 (cid:35) / where f t = ( z t − c t ) − (cid:16) ( z t − c t ) (cid:17) = ± . Using the hypothesis ε t < and making use ofLemma 11 in Appendix A.2, it follows that | α t | ≤ √ , | f t α t − | ≤ ε t and | f t α t − | ≤ ε t . Finally, combining these inequalities, we obtain | e t f t λ t − z | ≤ | z | ε t and (cid:107) e t f t θ t − ν (cid:107) ≤ ε t (cid:16) || ν || + √ (cid:17) . This theorem directly yields an algorithm to recover the decomposition of a decomposablecomplex-valued tensor using the standard deflation technique.It can be shown that if θ is chosen at random in C n , the assumptions on θ t in the previoustheorem are satisfied with probability one. We prove it here for the assumption T ( θ t ) (cid:62) T ( θ t ) (cid:54) = 0 ,similar arguments can be used for the assumption | z . ν (cid:62) θ | > | z . ν (cid:62) θ | ≥ · · · ≥ | z k . ν (cid:62) k θ | > . Lemma 8
Using the definitions and under the hypothesis of Theorem 7, the set S = { θ ∈ C n |∃ t ≥ T ( θ t ) (cid:62) T ( θ t ) = 0 } has Lebesgue measure zero in C n . Proof
We use the notations of the previous proof.First note that T ( θ t − ) (cid:62) T ( θ t − ) = (cid:16)(cid:101) θ (cid:62) t − (cid:101) θ t − (cid:17) − (cid:101) θ (cid:62) t (cid:101) θ t and (cid:101) θ (cid:62) t (cid:101) θ t = (cid:80) ki =1 ( z t − i ( ν (cid:62) i θ ) t ) .For a fixed t , the set S t = (cid:40) θ ∈ C n : P t ( θ ) = k (cid:88) i =1 ( z t − i ( ν (cid:62) i θ ) t ) = 0 (cid:41) is the set of zeros of a multivariate polynomial. If P t is non-trivial (i.e. different from zero), it isa proper algebraic subvariety of C n of dimension less than n , thus of Lebesgue measure 0. Since S = ∪ ∞ t =0 S t , it is sufficient to show that P t is non-trivial for any index t .Without loss of generality, we assume that there exists at least one i ∈ [ k ] such that thefirst component ν i, of the vector ν i is not null. Suppose that P t is null, then all of its mono-mials are null. In particular, the coefficient associated with θ t +1 − θ j , which is proportional to (cid:80) ki =1 z t +1 − i ν t +1 − i, ν i,j , is null for all j ∈ [ n ] . Let α i = z t +1 − i ν t +1 − i, for i ∈ [ k ] , and note that ABUSSEAU D ENIS since z i (cid:54) = 0 for all i ∈ [ k ] , we cannot have all the α i equal to zero. Thus, we have (cid:80) i α i ν i,j = 0 forall j ∈ [ n ] , i.e. (cid:80) ki =1 α i ν i = which is in contradiction with the linear independence of { ν i } ki =1 .We can now state the following theorem, which summarizes the overall procedure to recoverthe parameters of a system of the form (3) using pseudo-orthonormalization and the complex tensorpower method. Note that this procedure generalizes the one proposed in Anandkumar et al. (2012):if all the weights w , · · · , w k are positive, the method we propose boils down to theirs. Theorem 9
Let µ , · · · , µ k ∈ R n be linearly independent, w , · · · , w k (cid:54) = 0 ∈ R , M = (cid:80) ki =1 w i µ i ⊗ µ i and M = (cid:80) ki =1 w i µ i ⊗ µ i ⊗ µ i . Let UDU (cid:62) be the eigendecomposition of M , W = UD − ∈ C n × k (see section 4.1) and ( W (cid:62) ) + = UD . Finally, let T = M ( W , W , W ) and let θ be drawn at random in C k .Then, using the definitions of θ t and λ t in Eq. 4, we have lim t λ t = w j and lim t λ t ( W (cid:62) ) + θ t = µ j with probability one, where j = arg max i { (cid:12)(cid:12) µ (cid:62) i W θ (cid:12)(cid:12) } . The indeterminacy on the sign of the coefficients in the pseudo-orthogonal decomposition T = (cid:80) ki =1 w − i (cid:18) w i W (cid:62) µ i (cid:19) ⊗ vanishes when we recover the original parameters w i and µ i .
5. Learning Negative Mixtures of Spherical Gaussians
In this section, we extend the method described in Section 2.3 to estimate the parameters of anegative mixture of spherical Gaussians. Let f ( x ) = (cid:80) ki =1 w i N ( x ; µ i , σ i I ) be the PDF of therandom vector x , where µ i ∈ R n are the component means, σ i the component variances, and w i (cid:54) =0 the coefficients ( (cid:80) ki =1 w i = 1 ). Assuming that the component means are linearly independent,we have the following result which generalizes Theorem 2. Theorem 10
The average variance ¯ σ = (cid:80) ki =1 w i σ i is an eigenvalue of the covariance matrix E [( x − E [ x ])( x − E [ x ]) (cid:62) ] . Let v be any unit-norm eigenvector corresponding to ¯ σ . We have m = (cid:80) ki =1 w i σ i µ i , M = (cid:80) ki =1 w i µ i ⊗ µ i , and M = (cid:80) ki =1 w i µ i ⊗ µ i ⊗ µ i , where m , M and M are defined as in Theorem 2.Moreover, let r be the number of negative eigenvalues of the matrix M = (cid:80) ki =1 w i ( µ i − E [ x ]) ⊗ ( µ i − E [ x ]) . Then ¯ σ is the ( r + 1) -th smallest eigenvalue of the covariance matrix. The proof of this theorem is given in Appendix A.3, where we also show that r = l or l + 1 ,where l is the number of negative coefficients w i , i.e. l = |{ w i : i ∈ [ k ] , w i < }| .This theorem, combined with Theorem 9, yields a procedure to estimate the parameters ofa negative mixture of spherical Gaussians: (i) compute the sample covariance matrix S , (ii) foreach candidate eigenvalue of S for ¯ σ , estimate the tensors m , M and M on the data, (iii)compute estimations of the parameters using Algorithm 2, (iv) choose the model that maximizes thelikelihood of the learning data. EARNING N EGATIVE M IXTURE −5 0 5 10 15−505101500.0050.010.0150.020.025 −20 −15 −10 −5 iterations e rr o r | w − w ||| µ − µ || Figure 1: (left) Density function of a negative mixture of spherical Gaussians with parameters w =1 . , µ = (11 . − . (cid:62) , σ = 8 , w = − . , µ = (11 . − . (cid:62) and σ = 4 .(right) Convergence rate of the proposed method on the exact tensors M and M .
6. Experiments
We illustrate the results presented above on the running example defined in Figure 1 (left). Thealgorithm to estimate the parameters of a system of the form (3) from estimation of the tensors M and M is summarized in Figure 2 (left).First, we run Algorithm 2 on the exact tensors M = (cid:80) ki =1 w i µ i ⊗ µ i and M = (cid:80) ki =1 w i µ i ⊗ µ i ⊗ µ i with various initializations of θ to extract the first eigenvector/eigenvalue pair. The corre-sponding parameters w i and µ i are always exactly recovered in less than 15 iterations, the averageerror over 500 initializations for those two parameters in function of the number of iterations isplotted in Figure 1 (right).Then, we test our algorithm in a learning setting. For various sizes (ranging from 1,000 to400,000), we generate 100 datasets (using Algorithm 1) and use the method described in the previ-ous section to estimate the parameters of the negative mixture of Gaussians. The results are plottedin Figure 2 (right), where each point represents the average on the 100 datasets of the l -normbetween the true parameters ( U = [ µ µ ] and w = [ w w ] ) and the estimations.For some of these datasets, our algorithm returns a decomposition involving complex valuedvectors and weights; for the experiments, we only used the real parts in the error measure. Thenumber of these pathological datasets decreases toward zero as their size increases. ABUSSEAU D ENIS
Algorithm 2
Negative Mixture Estimation
Input: k ∈ N , (cid:99) M ∈ (cid:78) R n , (cid:99) M ∈ (cid:78) R n Output: w , · · · , w k , µ , · · · , µ k UDU (cid:62) ← (cid:99) M ( k -truncated eig. decomp.); W ← UD − ; T ← (cid:99) M ( W , W , W ) ; for i = 1 to k do Draw θ at random in C k ; repeat θ ← T ( I, θ , θ ) ; θ ← θ ( θ (cid:62) θ ) ; until stabilization λ ← T ( θ , θ , θ ) ; T ← T − λ. θ ⊗ ; w i ← /λ ; µ i ← λ ( W (cid:62) ) + θ ; end for sample size e rr o r || w − w est |||| U − U est || Figure 2: (left) Algorithm for the estimation of the parameters of a negative mixture model fromestimation of the low-order moment tensors.(right) Estimation error as a function of the dataset size.
7. Conclusion
In this paper, we propose a first introductive study of negative mixture models. We argue that thesemodels may appear naturally in several learning settings — such as spectral learning of probabilitydistributions on strings — when the learning schemes rely on algebraic methods applied withoutpositivity constraints (i.e. on fields, e.g. R , rather than semi-fields, e.g. R + ).These models may seem difficult to handle, since allowing negative weights exclude the use ofprobabilistic methods such as EM. However, tensor decomposition techniques can be an appealingalternative. The complex tensor power method we propose, along with its application to the neg-ative Gaussian mixture model, is a first step toward a deep understanding of these models and theelaboration of tools to use them.This work could be extended in several ways. First, other fields of machine learning wherenegative mixture models appear, or where their expressiveness can be useful, should be investigated.By extending the power method to complex valued tensors, we are able to propose an algorithm toestimate the parameters of such models, but the implications of using decomposition techniques oncomplex tensors need to be studied further. In particular, a deep robustness analysis of our methodwould help to understand its behavior in the learning setting. EARNING N EGATIVE M IXTURE
References
Anima Anandkumar, Rong Ge, Daniel Hsu, Sham M. Kakade, and Matus Telgarsky. Tensor de-compositions for learning latent variable models.
CoRR , abs/1210.7559, 2012.Rapha¨el Bailly and Franc¸ois Denis. Absolute convergence of rational series is semi-decidable.
Inf.Comput. , 209(3):280–295, 2011.Jean Berstel and Christophe Reutenauer.
Rational series and their languages . EATCS mono-graphs on theoretical computer science. Springer-Verlag, Berlin, New York, 1988. ISBN 0-387-18626-3. URL http://opac.inria.fr/record=b1086956 . Translation of: Les sriesrationnelles et leurs langages.Franc¸ois Denis and Yann Esposito. On rational stochastic languages.
Fundam. Inform. , 86(1-2):41–77, 2008.S. W. Dharmadhikari. Sufficient conditions for a stationary process to be a function of a finitemarkov chain.
Ann. Math. Statist. , pages 1033–1041, 1963.Pierre Dupont, Franc¸ois Denis, and Yann Esposito. Links between probabilistic automata and hid-den markov models: probability distributions, learning models and induction algorithms.
PatternRecognition , 38(9):1349–1371, 2005.Christopher J. Hillar and Lek-Heng Lim. Most tensor problems are NP-hard.
J. ACM , 60(6):45:1–45:39, November 2013. ISSN 0004-5411. doi: 10.1145/2512329. URL http://doi.acm.org/10.1145/2512329 .Daniel Hsu and Sham M. Kakade. Learning mixtures of spherical gaussians: Moment methodsand spectral decompositions. In
Proceedings of the 4th Conference on Innovations in TheoreticalComputer Science , ITCS ’13, pages 11–20, New York, NY, USA, 2013. ACM. ISBN 978-1-4503-1859-4. doi: 10.1145/2422436.2422439. URL http://doi.acm.org/10.1145/2422436.2422439 .Vesna Jevremovic. A note on mixed exponential distribution with negative weights.
Statistics & Probability Letters , 11(3):259–265, March 1991. URL .R. Jiang, M.J. Zuo, and H.-X. Li. Weibull and inverse weibull mixture models allowing negativeweights.
Reliability Engineering & System Safety , 66(3):227 – 234, 1999. ISSN 0951-8320. doi:http://dx.doi.org/10.1016/S0951-8320(99)00037-X. URL .Tamara G. Kolda and Brett W. Bader. Tensor decompositions and applications.
SIAM REVIEW , 51(3):455–500, 2009.Geoffrey Mclachlan and David Peel.
Finite Mixture Models . Wiley Series in Probability andStatistics. Wiley-Interscience, 2000. URL . ABUSSEAU D ENIS
Philipp M¨uller, Simo Ali-Lytty, Marzieh Dashti, Henri Nurminen, and Robert Pich. Gaussianmixture filter allowing negative weights and its application to positioning using signal strengthmeasurements. In
WPNC , pages 71–76. IEEE, 2012. ISBN 978-1-4673-1437-4. URL http://dblp.uni-trier.de/db/conf/wpnc/wpnc2012.html .Petra Perner and Atsushi Imiya, editors.
Machine Learning and Data Mining in Pattern Recognition,4th International Conference, MLDM 2005, Leipzig, Germany, July 9-11, 2005, Proceedings ,volume 3587 of
Lecture Notes in Computer Science , 2005. Springer. ISBN 3-540-26923-1.Baibo Zhang and Changshui Zhang. Finite mixture models with negative components. In Pernerand Imiya (2005), pages 31–41. ISBN 3-540-26923-1.
Acknowledgments
This work has been carried out thanks to the support of the ARCHIMEDE Labex (ANR-11-LABX-0033) and the A*MIDEX project (ANR-11-IDEX-0001-02) funded by the ”Investissements d’Avenir”French government program managed by the ANR.
Appendix A. Proofs and Complements
A.1. Rational probability distributions on strings
Probabilistic automatas and HMMs define the same family of probability distributions on strings Dupontet al. (2005). All these distributions are rational but the converse is false Dharmadhikari (1963); De-nis and Esposito (2008). The simplest counter examples can be built on a one-letter alphabet and adimension equal to 3.Let
Σ = { a } be a one-letter alphabet. Let us define a parametrized family of linear representa-tions by ι = ( λ, , √ λ ) (cid:62) , M = ρ cos α − sin α α cos α
00 0 1 , τ = (1 , , (cid:62) where λ > and < ρ < . Let r be the associated rational series. It can easily be seen that r ( a n ) = ρ n √ λ [cos ( nα − π/
4) + 1)] ≥ for all n . We have also r (Σ ∗ ) = λ (cid:34) − √ ρ cos ( α − π/ ρ − ρ cos α + √ − ρ (cid:35) and λ can always be chosen such that r (Σ ∗ ) = 1 , i.e. such that r is a probability distribution. It caneasily be seen that r can be defined by a PA iff α/π ∈ Q .For example, if cos α = 3 / , sin α = 4 / , the corresponding distributions cannot be computedby a PA. EARNING N EGATIVE M IXTURE If ρ = 0 . , we have λ = √ and ι (cid:39) (0 . , , . (cid:62) . The construction described inSection 3.1 yields the distributions p + and p − respectively defined by the following PAs: ι + = (0 . , , . , , (cid:62) , M + = .
300 0 0 0 0 . .
302 0 . . . . . . , τ + = (0 . , . , . , , (cid:62) , ι − = (0 , , , (cid:62) , M − = .
300 0 0 0 . .
302 0 . . . . . , τ − = (0 . , . , , (cid:62) , and the mixture parameters s + = 1 . and s − = − . .If ρ = 0 . , the series r + and r − computed by Lemma 3 do not converge. It is necessary tocompute first a linear representation ( ι , M , τ ) of r such that the series associated with ( | ι | , | M | , | τ | ) is convergent. This can be achieved using techniques described in Bailly and Denis (2011). Forexample, we obtain the following linear representation ι = (1 , , , , , (cid:62) , M = . . . . . . − . . , and τ = (0 . . , . , . , . , . (cid:62) from which the construction describedin Section 3.1 can be applied. A.2. Proof of Lemma 11Lemma 11
Let k > and z ∈ C such that | z | < / . Then, | (1 + z ) − k − | ≤ | z | (2 k − . In particular, | (1 + z ) − / − | ≤ | z | and | (1 + z ) − / − | ≤ | z | . Proof
Let f ( z ) = (1 + z ) − k − with k > and | z | < / ; f (cid:48) ( z ) = − k (1 + z ) − ( k +1) . Let γ : [0 , (cid:55)→ C s.t. γ ( t ) = tz . We have (1 + z ) − k − (cid:90) γ f (cid:48) ( y ) dy = (cid:90) f (cid:48) ( γ ( t )) γ (cid:48) ( t ) dt = − kz (cid:90) (1 + tz ) − ( k +1) dt. Therefore, | (1 + z ) − k − | ≤ k | z | (cid:90) (1 − t | z | ) − ( k +1) dt = [(1 − t | z | ) − k ] = (1 − | z | ) − k − ≤ | z | (2 k − . ABUSSEAU D ENIS
Indeed, let g ( x ) = (1 − x ) − k − − x (2 k − . It can be checked that g (0) = g (1 /
2) = 0 and that g is convexe on [0 , / . A.3. Proof of Theorem 10
We will need the following results. The first one is a corollary of the
Sylvester’s Law of Inertia . Lemma 12
Let Q ∈ R n × n be a symmetric real matrix. Suppose that there exists a non singularmatrix P ∈ R n × n and w , . . . w n ∈ R such that Q = P (cid:62) DP where D = diag ( w , . . . , w n ) , thediagonal matrix whose diagonal entries are w , . . . , w n . Then, the number of negative eigenvaluesof Q is equal to the number of negative coefficients w i . Lemma 13 (Weyl’s Inequality)
Let A and B be two n × n hermitian matrices. We have σ ( A ) + σ i ( B ) ≤ σ i ( A + B ) ≤ σ n ( A ) + σ i ( B ) for all i ∈ [ n ] , where σ i ( M ) denotes the i -th smallesteigenvalue of M . Lemma 14
Let { v i } ki =1 be a linearly dependent family of vectors of R n , where n ≥ k , suchthat any of its subset of size k − is linearly independent. We consider the rank k − matrix M = (cid:80) ki =1 w i v i v (cid:62) i where w , · · · , w k (cid:54) = 0 . Let l be the number of negative coefficients w i . Thenthe first null eigenvalue of M is either the l -th or the ( l + 1) -th smallest one. Proof If l = 0 , then M is positive semi-definite and σ ( M ) = 0 . If l = k , then M is negativesemi-definite and σ k ( M ) = 0 .We suppose that ≤ l ≤ k − .For ≤ j ≤ k , let M j = (cid:80) ≤ i (cid:54) = j ≤ k w i v i v (cid:62) i and l j be the number of negative coefficients in { w i } ≤ i (cid:54) = j ≤ k . Let V j be the vector space spanned by { v , . . . , v j , . . . , v k } , where the notation v j means that v j is omitted. Let ν k , · · · , ν n be a linearly independent family of vectors in V ⊥ j and P be the non singular n × n matrix [ v · · · , v j , · · · , v k , ν k , · · · , ν n ] (cid:62) . Clearly, M j = P (cid:62) diag ( w , · · · , w j , · · · , w k , , · · · , P and therefore, from Lemma 12, l j is the number of negative eigenvalues of M j .For any j ∈ [ k ] , we consider the decomposition M = w j v j v (cid:62) j + M j , sum of two hermitianmatrices. The first summand is a rank one matrix whose only non null eigenvalue has the same signas w j , and the second has k − non zero eigenvalues, among l j are negative.Let j be an index such that w j < : from Weil’s inequality σ i ( M ) ≤ σ i ( M j ) for any i ∈ [ n ] . Since M j has l j = l − negative eigenvalues, M has at least l − negative eigenvalues.Let j be an index such that w j > : Weil’s inequality gives σ i ( M ) ≥ σ i ( M j ) for any i ∈ [ n ] , thus M has at least k − l − positive eigenvalues, hence at most l negative ones.Therefore, the first null eigenvalue of M must be either the l -th or the ( l + 1) -th smallest one.Let f ( x ) = (cid:80) ki =1 w i N ( x ; µ i , σ i I ) be the PDF of the random vector x , and let l be the numberof negative weights w i . We can now prove Theorem 10, along with the relation between l and theposition of the eigenvalue ¯ σ in the covariance matrix. EARNING N EGATIVE M IXTURE
Theorem
The average variance ¯ σ = (cid:80) ki =1 w i σ i is an eigenvalue of the covariance matrix E [( x − E [ x ])( x − E [ x ]) (cid:62) ] . Let v be any unit-norm eigenvector corresponding to ¯ σ . We have m = (cid:80) ki =1 w i σ i µ i , M = (cid:80) ki =1 w i µ i ⊗ µ i , and M = (cid:80) ki =1 w i µ i ⊗ µ i ⊗ µ i , where m , M and M are defined as in Theorem 2.Moreover, let r be the number of negative eigenvalues of the matrix M = (cid:80) ki =1 w i ( µ i − E [ x ]) ⊗ ( µ i − E [ x ]) . Then ¯ σ is the ( r + 1) -th smallest eigenvalue of the covariance matrix.Furthermore, r is either equal to l or l + 1 . Proof
Most of the proof of this theorem for usual Gaussian mixtures in Hsu and Kakade (2013)relies on the introduction of a discrete latent variable h : the sampling process is interpreted as firstsampling h with P [ h = i ] = w i , and then sampling x = µ h + z h where z h is a multivariateGaussian with mean and covariance σ h I . Allowing negative weights in the mixture, we cannotuse the same strategy, but it will be sufficient to note that E [ g ( x )] = (cid:80) ki =1 w i E [ g ( µ i + z i )] for anyfunction g , which is a direct consequence of the linearity of the expectation.First, we need to identify the position of ¯ σ in the covariance matrix. Let ¯ µ = E [ x ] = (cid:80) ki =1 w i µ i . The covariance matrix of x is E [( x − ¯ µ ) ⊗ ( x − ¯ µ )] = k (cid:88) i =1 w i ( µ i − ¯ µ ) ⊗ ( µ i − ¯ µ ) + ¯ σ I . Since the µ i ’s are linearly independent, F = { µ i − ¯ µ } ki =1 is a linearly dependent family ofvectors of R n such that any of its subset of size k − is linearly independent. It follows fromLemma 14 that is either the l -th or ( l + 1) -th smallest eigenvalue of the matrix (cid:80) ki =1 w i ( µ i − ¯ µ ) ⊗ ( µ i − ¯ µ ) , which implies that ¯ σ is the corresponding eigenvalue in the covariance matrix.Note that the strict separation of ¯ σ from the other eigenvalues in the covariance matrix impliesthat every eigenvector corresponding to ¯ σ is in the null space of (cid:80) ki =1 w i ( µ i − ¯ µ ) ⊗ ( µ i − ¯ µ ) ,hence v (cid:62) ( µ i − ¯ µ ) = 0 for all i ∈ [ k ] .We now express m , M and M in terms of the parameters w i , σ i and µ i . First, m = E [ x ( v (cid:62) ( x − E [ x ])) ]= k (cid:88) i =1 w i E [( µ i + z i )( v (cid:62) ( µ i − ¯ µ + z i )) ]= k (cid:88) i =1 w i E [( µ i + z i )( v (cid:62) z i ) ] = k (cid:88) i =1 w i σ i µ i . Next, since E [ z i ⊗ z i ] = σ i I for all i ∈ [ k ] , we have M = E [ x ⊗ x ] − ¯ σ I = k (cid:88) i =1 w i E [( µ i + z i ) ⊗ ( µ i + z i )] − ¯ σ I = k (cid:88) i =1 w i ( µ i ⊗ µ i + E [ z i ⊗ z i ]) − ¯ σ I = k (cid:88) i =1 w i µ i ⊗ µ i . ABUSSEAU D ENIS
Finally, writing z ij for the j -th component of the vector z i , we have k (cid:88) i =1 w i E [ µ i ⊗ z i ⊗ z i ] = k (cid:88) i =1 w i n (cid:88) p =1 n (cid:88) q =1 E [ z ip z iq ] µ i ⊗ e p ⊗ e q = k (cid:88) i =1 w i σ i n (cid:88) j =1 µ i ⊗ e j ⊗ e j = n (cid:88) j =1 m ⊗ e j ⊗ e j , where we used the fact that E [ z ip z iq ] = δ pq σ i for all i ∈ [ k ] , p, q ∈ [ n ] . Using the same derivation,we have (cid:80) ki =1 w i E [ z i ⊗ µ i ⊗ z i ] = (cid:80) nj =1 e j ⊗ m ⊗ e j and (cid:80) ki =1 w i E [ z i ⊗ z i ⊗ µ i ] = (cid:80) nj =1 e j ⊗ e j ⊗ m . Hence, E [ x ⊗ ] = k (cid:88) i =1 w i (cid:0) µ ⊗ i + E [ µ i ⊗ z i ⊗ z i ] + E [ z i ⊗ µ i ⊗ z i ] + E [ z i ⊗ z i ⊗ µ i ] (cid:1) = k (cid:88) i =1 w i µ ⊗ i + n (cid:88) j =1 ( m ⊗ e j ⊗ e j + e j ⊗ m ⊗ e j + e j ⊗ e j ⊗ m ) and M = (cid:80) ki =1 w i µ i ⊗ µ i ⊗ µ i ..