Learning Diagonal Gaussian Mixture Models and Incomplete Tensor Decompositions
LLEARNING DIAGONAL GAUSSIAN MIXTURE MODELS ANDINCOMPLETE TENSOR DECOMPOSITIONS
BINGNI GUO, JIAWANG NIE, AND ZI YANG
Abstract.
This paper studies how to learn parameters in diagonal Gaussianmixture models. The problem can be formulated as computing incompletesymmetric tensor decompositions. We use generating polynomials to computeincomplete symmetric tensor decompositions and approximations. Then thetensor approximation method is used to learn diagonal Gaussian mixture mod-els. We also do the stability analysis. When the first and third order momentsare sufficiently accurate, we show that the obtained parameters for the Gauss-ian mixture models are also highly accurate. Numerical experiments are alsoprovided. Introduction
A Gaussian mixture model consists of several component Gaussian distributions.For given samples of a Gaussian mixture model, people often need to estimate pa-rameters for each component Gaussian distribution. Consider a Gaussian mixturemodel with r components. For each i ∈ [ r ] := { , . . . , r } , let ω i be the positive prob-ability for the i th component Gaussian to appear in the mixture model. We haveeach ω i > (cid:80) ri =1 ω i = 1. Suppose the i th Gaussian distribution is N ( µ i , Σ i ),where µ i ∈ R d is the expectation (or mean) and Σ i ∈ R d × d is the covariance ma-trix. Let y ∈ R d be the random vector for the Gaussian mixture model and let y , · · · , y N be identically independent distributed (i.i.d) samples from the mixturemodel. Each y j is sampled from one of the r component Gaussian distributions,associated with a label Z j ∈ [ r ] indicating the component that it is sampled from.The probability that a sample comes from the i th component is ω i . When peo-ple observe only samples without labels, the Z j ’s are called latent variables. Thedensity function for the random variable y is f ( y ) := r (cid:88) i =1 ω i (cid:112) (2 π ) d det Σ i exp (cid:110) −
12 ( y − µ i ) T Σ − i ( y − µ i ) (cid:111) , where µ i is the mean and Σ i is the covariance matrix for the i th component.Learning a Gaussian mixture model is to estimate the parameters ω i , µ i , Σ i foreach i ∈ [ r ], from given samples of y . The number of parameters in a covariancematrix grows quadratically with respect to the dimension. Due to the curse ofdimensionality, the computation becomes very expensive for large d [31]. Hence,diagonal covariance matrices are preferable in applications. In this paper, we focus Mathematics Subject Classification.
Key words and phrases.
Gaussian model, tensor, decomposition, generating polynomial,moments. a r X i v : . [ m a t h . NA ] F e b BINGNI GUO, JIAWANG NIE, AND ZI YANG on learning Gaussian mixture models with diagonal covariance matrices, i.e.,Σ i = diag (cid:0) σ i , . . . , σ id (cid:1) , i = 1 , . . . , r. A natural approach for recovering the unknown parameters ω i , µ i , Σ i is the methodof moments. It estimates parameters by solving a system of multivariate polynomialequations, from moments of the random vector y . Directly solving polynomialsystems may encounter non-existence or non-uniqueness of statistically meaningfulsolutions [53]. However, for diagonal Gaussians, the third order moment tensor canhelp us avoid these troubles.Let M := E ( y ⊗ y ⊗ y ) be the third order tensor of moments for y . Onecan write that y = η ( z ) + ζ ( z ), where z is a discrete random variable such thatProb( z = i ) = ω i , η ( i ) = µ i ∈ R d and ζ ( i ) is the random variable ζ i obeying theGaussian distribution N (0 , Σ i ). Assume all Σ i are diagonal, then(1.1) M = k (cid:88) i =1 ω i E [( η ( i ) + ζ i ) ⊗ ] = k (cid:88) i =1 ω i (cid:16) µ i ⊗ µ i ⊗ µ i + E [ µ i ⊗ ζ i ⊗ ζ i ]+ E [ ζ i ⊗ µ i ⊗ ζ i ] + E [ ζ i ⊗ ζ i ⊗ µ i ] (cid:17) . The second equality holds because ζ i has zero mean and E [ ζ i ⊗ ζ i ⊗ ζ i ] = E [ µ i ⊗ µ i ⊗ ζ i ] = E [ ζ i ⊗ µ i ⊗ µ i ] = E [ µ i ⊗ ζ i ⊗ µ i ] = 0 . The random variable ζ i has diagonal covariance matrix, so E [( ζ i ) j ( ζ i ) l ] = 0 for j (cid:54) = l . Therefore, r (cid:88) i =1 ω i E [ µ i ⊗ ζ i ⊗ ζ i ] = r (cid:88) i =1 d (cid:88) j =1 ω i σ ij µ i ⊗ e j ⊗ e j = d (cid:88) j =1 a j ⊗ e j ⊗ e j , where the vectors a j are given by(1.2) a j := r (cid:88) i =1 ω i σ ij µ i , j = 1 , . . . , d. Similarly, we have r (cid:88) i =1 ω i E [ ζ i ⊗ µ i ⊗ ζ i ] = d (cid:88) j =1 e j ⊗ a j ⊗ e j , r (cid:88) i =1 ω i E [ ζ i ⊗ ζ i ⊗ µ i ] = d (cid:88) j =1 e j ⊗ e j ⊗ a j . Therefore, we can express M in terms of ω i , µ i , Σ i as(1.3) M = k (cid:88) i =1 ω i µ i ⊗ µ i ⊗ µ i + d (cid:88) j =1 (cid:16) a j ⊗ e j ⊗ e j + e j ⊗ a j ⊗ e j + e j ⊗ e j ⊗ a j (cid:17) . We are particularly interested in the following third order symmetric tensor(1.4) F := r (cid:88) i =1 ω i µ i ⊗ µ i ⊗ µ i . When the labels i , i , i are distinct from each other, we have( M ) i i i = ( F ) i i i for i (cid:54) = i (cid:54) = i (cid:54) = i . Denote the label set(1.5) Ω = { ( i , i , i ) : i (cid:54) = i (cid:54) = i (cid:54) = i , i , i , i are labels for M } . AUSSIAN MIXTURE MODELS AND INCOMPLETE TENSOR DECOMPOSITIONS 3
The tensor M can be estimated from the samplings for y , so the entries F i i i with ( i , i , i ) ∈ Ω can also be obtained from the estimation of M . To recover theparameters ω i , µ i , we first find the tensor decomposition for F , from the partiallygiven entries F i i i with ( i , i , i ) ∈ Ω. Once the parameters ω i , µ i are known, wecan determine Σ i from the expressions of a j as in (1.2).The above observation leads to the incomplete tensor decomposition problem.For a third order symmetric tensor F whose partial entries F i i i with ( i , i , i ) ∈ Ω are known, we are looking for vectors p , . . . , p r such that(1.6) F i i i = (cid:16) p ⊗ + · · · + p ⊗ r (cid:17) i i i , for all ( i , i , i ) ∈ Ω . The above is called an incomplete tensor decomposition for F . To find such a tensordecomposition for F , a straightforward approach is to do tensor completion: firstfind unknown tensor entries F i i i with ( i , i , i ) (cid:54)∈ Ω such that the completed F has low rank, and then compute the tensor decomposition for F . However, thereare serious disadvantages for this approach. The theory for tensor completion or re-covery, especially for symmetric tensors, is premature. Low rank tensor completionor recovery is typically not guaranteed by the currently existing methodology. Mostmethods for tensor completion are based on convex relaxations, e.g., the nuclearnorm or trace minimization [19, 32, 37, 50, 54]. These convex relaxations may notproduce low rank completions [47].In this paper, we propose a new method for determining incomplete tensor de-compositions. It is based on the generating polynomial method in [36]. The labelset Ω consists of ( i , i , i ) of distinct i , i , i . We can still determine some generat-ing polynomials, from the partially given tensor entries F i i i with ( i , i , i ) ∈ Ω.They can be used to get the incomplete tensor decomposition. We show that thisapproach works very well when the rank r is roughly not more than half of thedimension d . Consequently, the parameters for the Gaussian mixture model can berecovered from the incomplete tensor decomposition of F . Related Work
Gaussian mixture models have broad applications in machinelearning problems, e.g., automatic speech recognition [26, 44, 46], hyperspectralunmixing problem [4, 30], background subtraction [56, 28] and anomaly detection[52]. They also have applications in social and biological sciences [21, 49, 55].There exist methods for estimating unknown parameters for Gaussian mixturemodels. A popular method is the expectation-maximization (EM) algorithm thatiteratively approximates the maximum likelihood parameter estimation [13]. Thisapproach is widely used in applications, but it typically only gets local optimizersand the convergence may be slow [45]. Dasgupta [10] introduced a method thatfirst projects data to a randomly chosen low-dimensional subspace and then usethe empirical means and covariances of low-dimensional clusters to estimate theparameters. Later, Arora and Kannan [48] extended this idea to arbitrary Gaus-sians. Vempala and Wong [51] introduced the spectral technique to enhance theseparation condition by projecting data to principal components of the sample ma-trix instead of selecting a random subspace. For other subsequent work, we refer toDasgupta and Schulman [11], Kannan et al. [25], Achlioptas et al. [1], Chaudhuriand Rao [8], Brubaker and Vempala [5] and Chaudhuri et al. [7].Another frequently used approach is based on moments, introduced by Pear-son [42]. Belkin and Sinha [3] proposed a learning algorithm for identical sphericalGaussians (Σ i = σ I ) with arbitrarily small separation between mean vectors. It BINGNI GUO, JIAWANG NIE, AND ZI YANG was also shown in [24] that a mixture of two Gaussians can be learned with prov-ably minimal assumptions. Hsu and Kakade [23] provided a learning algorithm fora mixture of spherical Gaussians, i.e. each covariance matrix is a multiple of theidentity matrix. This method is based on moments up to order three and onlyassumes non-degeneracy instead of separations. For general covariance matrices,Ge et al. [20] proposed a learning method when the dimension d is sufficiently high.More moment based methods for general latent variable models can be found in [2]. Contributions
This paper proposes a new method for learning diagonal Gaussianmixture models, based on samplings for the first and third order moments. Let y , · · · , y N be samples and let { ( ω i , µ i , Σ i ) : i ∈ [ r ] } be parameters of the diagonalGaussian mixture model, where each covariance matrix Σ i is diagonal. We use thesamples y , · · · , y N to estimate the third order moment tensor M , as well as themean vector M . We have seen that the tensor M can be expressed as in (1.3).For the tensor F in (1.4), we have F i i i = ( M ) i i i when the labels i , i , i are distinct from each other. Other entries of F are not known, since the vectors a j are not available. The F is an incompletely given tensor. We give a new methodfor computing the incomplete tensor decomposition of F when the rank r is low(roughly no more than half of the dimension d ). The tensor decomposition of F isunique under some genericity conditions, so it can be used to recover parameters ω i , µ i . To compute the incomplete tensor decomposition of F , we use the generatingpolynomial method in [36, 38]. We look for a special set of generating polynomialsfor F , which can be obtained by solving linear least squares. It only requires to usethe known entries of F . The common zeros of these generating polynomials can bedetermined from eigenvalue decompositions. Under some genericity assumptions,these common zeros can be used to get the incomplete tensor decomposition. Afterthis is done, the parameters ω i , µ i can be recovered by solving linear systems.The diagonal covariance matrices Σ i can also be estimated by solving linear leastsquares. The tensor M is estimated from the samples y , . . . , y N . Typically, thetensor entries ( M ) i i i and F i i i , are not precisely given. We also provide astability analysis for this case, showing that the estimated parameters are alsoaccurate when the entries ( M ) i i i have small errors.The paper is organized as follows. In Section 2, we review some basic results forsymmetric tensor decompositions and generating polynomials. In Section 3, we givea new algorithm for computing an incomplete tensor decomposition for F , whenonly its subtensor F Ω is known. Section 4 gives the stability analysis when thereare errors for the subtensor F Ω . Section 5 gives the algorithm for learning Gaussianmixture models. Numerical experiments and applications are given in Section 6.We make some conclusions and discussions in Section 7.2. Preliminary
Notation.
Denote N , C and R the set of nonnegative integers, complex and realnumbers respectively. Denote the cardinality of a set L as | L | . Denote by e i the i th standard unit basis vector, i.e., the i the entry of e i is one and all others arezeros. For a complex number c , n √ c or c /n denotes the principal n th root of c .For a complex vector v , Re( v ) , Im( v ) denotes the real part and imaginary part of v respectively. A property is said to be generic if it is true in the whole space excepta subset of zero Lebesgue measure. The (cid:107) · (cid:107) denotes the Euclidean norm of avector or the Frobenius norm of a matrix. For a vector or matrix, the superscript AUSSIAN MIXTURE MODELS AND INCOMPLETE TENSOR DECOMPOSITIONS 5 T denotes the transpose and H denotes the conjugate transpose. For i, j ∈ N , [ i ]denotes the set { , , . . . , i } and [ i, j ] denotes the set { i, i + 1 , . . . , j } if i ≤ j . Fora vector v , v i : i denotes the vector ( v i , v i +1 , . . . , v i ). For a matrix A , denoteby A [ i : i ,j : j ] the submatrix of A whose row labels are i , i + 1 , . . . , i and whosecolumn labels are j , j + 1 . . . , j . For a tensor F , its subtensor F [ i : i ,j : j ,k : k ] issimilarly defined.Let S m ( C d ) (resp., S m ( R d )) denote the space of m th order symmetric tensorsover the vector space C d (resp., R d ). For convenience of notation, the labels fortensors start with 0. A symmetric tensor A ∈ S m ( C n +1 ) is labelled as A = ( A i ...i m ) ≤ i ,...,i m ≤ n , where the entry A i ...i m is invariant for all permutations of ( i , . . . , i m ). The Hilbert-Schmidt norm (cid:107)A(cid:107) is defined as(2.1) (cid:107)A(cid:107) := (cid:16) (cid:88) ≤ i , ··· ,i m ≤ n |A i ··· i m | (cid:17) / . The norm of a subtensor (cid:107)A Ω (cid:107) is similarly defined. For a vector u := ( u , u , . . . , u n ) ∈ C d , the tensor power u ⊗ m := u ⊗ · · · ⊗ u , where u is repeated m times, is definedsuch that ( u ⊗ m ) i ...i m = u i × · · · × u i m . For a symmetric tensor F , its symmetric rank isrank S ( F ) := min { r | F = r (cid:88) i =1 u ⊗ mi } . There are other types of tensor ranks [27, 29]. In this paper, we only deal withsymmetric tensors and symmetric ranks. We refer to [9, 14, 18, 22, 27, 29] forgeneral work about tensors and their ranks. For convenience, if r = rank S ( F ), wecall F a rank- r tensor and F = r (cid:80) i =1 u ⊗ mi is called a rank decomposition.For a power α := ( α , α , · · · , α n ) ∈ N n and x := ( x , x , · · · , x n ), denote | α | := α + α + · · · + α n , x α := x α x α · · · x α n n , x := 1 . The monomial power set of degree m is denoted as N nm := { α = ( α , α , · · · , α n ) ∈ N n : | α | ≤ m } . For α ∈ N n , we can write that x α = x i x i x i for some 0 ≤ i , i , i ≤ n .Let C [ x ] m be the space of all polynomials in x with complex coefficients andwhose degrees are no more than m . For a cubic polynomial p ∈ C [ x ] and F ∈ S ( C n +1 ), we define the bilinear product (note that x = 1)(2.2) (cid:104) p, F(cid:105) = (cid:88) ≤ i ,i ,i ≤ n p i i i F i i i for p = (cid:88) ≤ i ,i ,i ≤ n p i i i x i x i x i , where p i i i are coefficients of p . A polynomial g ∈ C [ x ] is called a generatingpolynomial for a symmetric tensor F ∈ S ( C n +1 ) if(2.3) (cid:104) g · x β , F(cid:105) = 0 ∀ β ∈ N n − deg( g ) , where deg( g ) denotes the degree of g in x . When the order is bigger than 3, we referto [36] for the definition of generating polynomials. They can be used to computesymmetric tensor decompositions and low rank approximations [36, 38], which are BINGNI GUO, JIAWANG NIE, AND ZI YANG closely related to truncated moment problems and polynomial optimization [17, 33,34, 35, 39]. There are special versions of symmetric tensors and their decompositions[16, 40, 41]. 3.
Incomplete Tensor Decomposition
This section discusses how to compute an incomplete tensor decomposition fora symmetric tensor
F ∈ S ( C d ) when only its subtensor F Ω is given, for the labelset Ω in (1.5). For convenience of notation, the labels for F begin with zeros whilea vector u ∈ C d is still labelled as u := ( u , . . . , u d ). We set n := d − , x = ( x , . . . , x n ) , x := 1 . For a given rank r , denote the monomial sets(3.1) B := { x , · · · , x r } , B = { x i x j : i ∈ [ r ] , j ∈ [ r + 1 , n ] } . For a monomial power α ∈ N n , by writing α ∈ B , we mean that x α ∈ B . For each α ∈ B , one can write α = e i + e j with i ∈ [ r ] , j ∈ [ r +1 , n ]. Let C [ r ] × B denote thespace of matrices labelled by the pair ( k, α ) ∈ [ r ] × B . For each α = e i + e j ∈ B and G ∈ C [ r ] × B , denote the quadratic polynomial in x (3.2) ϕ ij [ G ]( x ) := r (cid:88) k =1 G ( k, e i + e j ) x k − x i x j . Suppose r is the symmetric rank of F . A matrix G ∈ C [ r ] × B is called a generat-ing matrix of F if each ϕ ij [ G ]( x ), with α = e i + e j ∈ B , is a generating polynomialof F . Equivalently, G is a generating matrix of F if and only if(3.3) (cid:104) x t ϕ ij [ G ]( x ) , F(cid:105) = r (cid:88) k =1 G ( k, e i + e j ) F kt − F ijt = 0 , t = 0 , , . . . , n, for all i ∈ [ r ] , j ∈ [ r + 1 , n ]. The existence and uniqueness of the generating matrix G is shown as follows. Theorem 3.1.
Suppose F has the decomposition (3.4) F = λ (cid:20) u (cid:21) ⊗ + · · · + λ r (cid:20) u r (cid:21) ⊗ , for vectors u i ∈ C n and scalars (cid:54) = λ i ∈ C . If the subvectors ( u ) r , . . . , ( u r ) r are linearly independent, then there exists a unique generating matrix G ∈ C [ r ] × B satisfying (3.3) for the tensor F .Proof. We first prove the existence. For each i = 1 , . . . , r , denote the vectors v i = ( u i ) r . Under the given assumption, V := [ v . . . v r ] is an invertible matrix.For each l = r + 1 , . . . , n , let(3.5) N l := V · diag (cid:0) ( u ) l , . . . , ( u r ) l (cid:1) · V − . Then N l v i = ( u i ) l v i for i = 1 , . . . , r , i.e., N l has eigenvalues ( u ) l , . . . , ( u r ) l withcorresponding eigenvectors ( u ) r , . . . , ( u r ) r . We select G ∈ C [ r ] × B to be thematrix such that(3.6) N l = G (1 , e + e l ) · · · G ( r, e + e l )... . . . ... G (1 , e r + e l ) · · · G ( r, e r + e l ) , l = r + 1 , . . . , n. AUSSIAN MIXTURE MODELS AND INCOMPLETE TENSOR DECOMPOSITIONS 7
For each s = 1 , . . . , r and α = e i + e j ∈ B with i ∈ [ r ] , j ∈ [ r + 1 , n ], ϕ ij [ G ]( u s ) = r (cid:88) k =1 G ( k, e i + e j )( u s ) k − ( u s ) i ( u s ) j = 0 . For each t = 0 , , . . . , n , it holds that (cid:104) x t ϕ ij [ G ]( x ) , F(cid:105) = (cid:104) r (cid:88) k =1 G ( k, e i + e j ) x t x k − x t x i x j , F(cid:105) = (cid:104) r (cid:88) k =1 G ( k, e i + e j ) x t x k − x t x i x j , r (cid:88) s =1 λ s (cid:20) u s (cid:21) ⊗ (cid:105) = r (cid:88) k =1 G ( k, e i + e j ) r (cid:88) s =1 λ s ( u s ) t ( u s ) k − r (cid:88) s =1 λ s ( u s ) t ( u s ) i ( u s ) j = r (cid:88) s =1 λ s ( u s ) t (cid:0) r (cid:88) k =1 G ( k, e i + e j )( u s ) k − ( u s ) i ( u s ) j (cid:1) = 0 . Therefore, the matrix G satisfies (3.3) and it is a generating matrix for F .Second, we prove the uniqueness of such G . For each α = e i + e j ∈ B , let F := F · · · F r ... . . . ... F n · · · F rn , g ij := F ij ... F nij . Since G satisfies (3.3), we have F · G (: , e i + e j ) = g ij . The decomposition (3.4)implies that F = (cid:2) u · · · u r (cid:3) · diag( λ , . . . , λ r ) · (cid:2) v · · · v r (cid:3) T . The sets { v , . . . , v r } and { u , . . . , u r } are both linearly independent. Since each λ i (cid:54) = 0, the matrix F has full column rank. Hence, the generating matrix G satisfying F · G (: , e i + e j ) = g ij for all i ∈ [ r ] , j ∈ [ r + 1 , n ] is unique. (cid:3) The following is an example of generating matrices.
Example 3.2.
Consider the tensor
F ∈ S ( C ) that is given as F = 0 . · (1 , , , , , ⊗ + 0 . · (1 , − , , − , , ⊗ . The rank r = 2, B = { x , x } and B = { x x , x x , x x , x x , x x , x x } . Wehave the vectors u = (1 , , , , , u = ( − , , − , , , v = (1 , , v = ( − , . The matrices N , N , N as in (3.5) are N = (cid:20) −
11 2 (cid:21) (cid:20) − (cid:21) (cid:20) −
11 2 (cid:21) − = (cid:20) / / / − / (cid:21) ,N = (cid:20) −
11 2 (cid:21) (cid:20) (cid:21) (cid:20) −
11 2 (cid:21) − = (cid:20) / − / − / / (cid:21) ,N = (cid:20) −
11 2 (cid:21) (cid:20) (cid:21) (cid:20) −
11 2 (cid:21) − = (cid:20) / − / − / / (cid:21) . BINGNI GUO, JIAWANG NIE, AND ZI YANG
The entries of the generating matrix G are listed as below:(3.7) k \ ( i, j ) (1 ,
3) (1 ,
4) (1 ,
5) (2 ,
3) (2 ,
4) (2 , / / / / − / − /
32 2 / − / − / − / / / . The generating polynomials in (3.2) are ϕ [ G ]( x ) = 13 x + 23 x − x x ,ϕ [ G ]( x ) = 43 x − x − x x ,ϕ [ G ]( x ) = 53 x − x − x x , ϕ [ G ]( x ) = 43 x − x − x x ,ϕ [ G ]( x ) = − x + 53 x − x x ,ϕ [ G ]( x ) = − x + 73 x − x x . Above generating polynomials can be written in the following form (cid:20) ϕ j [ G ]( x ) ϕ j [ G ]( x ) (cid:21) = N j (cid:20) x x (cid:21) − x j (cid:20) x x (cid:21) , for j = 3 , , . For x to be a common zero of ϕ j [ G ]( x ) and ϕ j [ G ]( x ), it requires that ( x , x ) isan eigenvector of N j with the corresponding eigenvalue x j .3.1. Computing the tensor decomposition.
We show how to find an incom-plete tensor decomposition (3.4) for F when only its subtensor F Ω is given, wherethe label set Ω is as in (1.5). Suppose that there exists the decomposition (3.4)for F , for vectors u i ∈ C n and nonzero scalars λ i ∈ C . Assume the subvectors( u ) r , . . . , ( u r ) r are linearly independent, so there is a unique generating matrix G for F , by Theorem 3.1.For each α = e i + e j ∈ B with i ∈ [ r ] , j ∈ [ r + 1 , n ] and for each l = r + 1 , . . . , j − , j + 1 , . . . , n, the generating matrix G satisfies the equations(3.8) (cid:104) x l (cid:0) r (cid:88) k =1 G ( k, e i + e j ) x k − x i x j (cid:1) , F(cid:105) = r (cid:88) k =1 G ( k, e i + e j ) F kl − F ijl = 0 . Let the matrix A ij [ F ] ∈ C ( n − r − × r and the vector b ij [ F ] ∈ C n − r − be such that(3.9) A ij [ F ] := F , ,r +1 · · · F ,r,r +1 ... . . . ... F , ,j − · · · F ,r,j − F , ,j +1 · · · F ,r,j +1 ... . . . ... F , ,n · · · F ,r,n , b ij [ F ] := F i,j,r +1 ... F i,j,j − F i,j,j +1 ... F i,j,n . The equations in (3.8) can be equivalently written as(3.10) A ij [ F ] · G (: , e i + e j ) = b ij [ F ] . If the rank r ≤ d −
1, then n − r − d − r − ≥ r . Thus, the number ofrows is not less than the number of columns for matrices A ij [ F ]. If A ij [ F ] haslinearly independent columns, then (3.10) uniquely determines G (: , α ). For such AUSSIAN MIXTURE MODELS AND INCOMPLETE TENSOR DECOMPOSITIONS 9 a case, the matrix G can be fully determined by the linear system (3.10). Let N r +1 ( G ) , . . . , N m ( G ) ∈ C r × r be the matrices given as(3.11) N l ( G ) = G (1 , e + e l ) · · · G ( r, e + e l )... . . . ... G (1 , e r + e l ) · · · G ( r, e r + e l ) , l = r + 1 , . . . , n. As in the proof of Theorem 3.1, one can see that(3.12) N l ( G ) ( u i ) ...( u i ) r = ( u i ) l · ( u i ) ...( u i ) r , l = r + 1 , . . . , n. The above is equivalent to the equations N l ( G ) v i = ( w i ) l − r · v i , l = r + 1 , . . . , n, for the vectors ( i = 1 , . . . , r )(3.13) v i := ( u i ) r , w i := ( u i ) r +1: n . Each v i is a common eigenvector of the matrices N r +1 ( G ) , . . . , N n ( G ) and ( w i ) l − r isthe associated eigenvalue of N l ( G ). These matrices may or may not have repeatedeigenvalues. Therefore, we select a generic vector ξ := ( ξ r +1 , · · · , ξ n ) and let(3.14) N ( ξ ) := ξ r +1 N r +1 + · · · + ξ n N n . The eigenvalues of N ( ξ ) are ξ T w , . . . , ξ T w r . When w , . . . , w r are distinct fromeach other and ξ is generic, the matrix N ( ξ ) does not have a repeated eigenvalueand hence it has unique eigenvectors v , . . . , v r , up to scaling. Let ˜ v , . . . , ˜ v r be unitlength eigenvectors of N ( ξ ). They are also common eigenvectors of N r +1 ( G ), . . . , N n ( G ).For each i = 1 , . . . , r , let ˜ w i be the vector such that its j th entry ( ˜ w i ) j is the eigen-value of N j + r ( G ), associated to the eigenvector ˜ v i , or equivalently,(3.15) ˜ w i = (˜ v Hi N r +1 ( G )˜ v i , · · · , ˜ v Hi N n ( G )˜ v i ) i = 1 , . . . , r. Up to a permutation of (˜ v , . . . , ˜ v r ), there exist scalars γ i such that(3.16) v i = γ i ˜ v i , w i = ˜ w i . The tensor decomposition of F can also be written as F = λ γ ˜ v ˜ w ⊗ + · · · + λ r γ r ˜ v r ˜ w r ⊗ . The scalars λ , · · · , λ r and γ , · · · , γ r satisfy the linear equations λ γ ˜ v ⊗ ˜ w + · · · + λ r γ r ˜ v r ⊗ ˜ w r = F [0 , r,r +1: n ] ,λ γ ˜ v ⊗ ˜ v ⊗ ˜ w + · · · + λ r γ r ˜ v r ⊗ ˜ v r ⊗ ˜ w r = F [1: r, r,r +1: n ] . Denote the label sets(3.17) J := (cid:8) (0 , i , i ) : i ∈ [ r ] , i ∈ [ r + 1 , n ] (cid:9) ,J := (cid:8) ( i , i , i ) : i (cid:54) = i , i , i ∈ [ r ] , i ∈ [ r + 1 , n ] (cid:9) . To determine the scalars λ i , γ i , we can solve the linear least squares(3.18) min ( β ,...,β r ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) F J − r (cid:88) i =1 β i · ˜ v i ⊗ ˜ w i (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) , (3.19) min ( θ ,...,θ r ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) F J − r (cid:88) k =1 θ k · (˜ v k ⊗ ˜ v k ⊗ ˜ w i ) J (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) . Let ( β ∗ , . . . , β ∗ r ), ( θ ∗ , . . . , θ ∗ r ) be minimizers of (3.18) and (3.19) respectively. Then,for each i = 1 , . . . , r , let(3.20) λ i := ( β ∗ i ) /θ ∗ i , γ i := θ ∗ i /β ∗ i . For the vectors ( i = 1 , . . . , r ) p i := (cid:112) λ i (1 , γ i ˜ v i , ˜ w i ) , the sum p ⊗ + · · · + p ⊗ r is a tensor decomposition for F . This is justified in thefollowing theorem. Theorem 3.3.
Suppose the tensor F has the decomposition as in (3.4). Assumethat the vectors v , . . . , v r are linearly independent and the vectors w , . . . , w r aredistinct from each other. Let ξ be a generically chosen coefficient vector and let p , . . . , p r be the vectors produced as above. Then we must have the tensor decom-position F = p ⊗ + · · · + p ⊗ r .Proof. By Theorem 3.1, there is a unique generating matrix G for F satisfying (3.3).Under the given assumptions, the equation (3.10) uniquely determines G . Note that ξ T w , . . . , ξ T w r are the eigenvalues of N ( ξ ) and v , . . . , v r are the correspondingeigenvectors. When ξ is generically chosen, the values of ξ T w , . . . , ξ T w r are distincteigenvalues of N ( ξ ). So N ( ξ ) has unique eigenvalue decompositions, and hence(3.16) must hold, up to a permutation of ( v , . . . , v r ). Then the conclusion followsfrom the uniqueness of solutions to linear least squares, when the coefficient matriceshave full column ranks. (cid:3) The following is the algorithm for computing an incomplete tensor decompositionfor F when only its subtensor F Ω is given. Algorithm 3.4. (Incomplete symmetric tensor decompositions.)Input: A third order symmetric subtensor F Ω and a rank r ≤ d − G by solving (3.10) for each α = e i + e j ∈ B .2. Let N ( ξ ) be the matrix as in (3.14), for a randomly selected vector ξ .Compute the unit length eigenvectors ˜ v , . . . , ˜ v r of N ( ξ ) and choose ˜ w i asin (3.15).3. Solve the linear least squares (3.18) and (3.19) to get the coefficients λ i , γ i as in (3.20).4. For each i = 1 , . . . , r , let p i := √ λ i (1 , γ i ˜ v i , ˜ w i ).Output: The tensor decomposition F = ( p ) ⊗ + · · · + ( p r ) ⊗ .The following is an example of applying Algorithm 3.4. Example 3.5.
Consider the same tensor F as in Example 3.2. The monomial sets B , B are the same. The matrices A ij [ F ] and vectors b ij [ F ] are A [ F ] = A [ F ] = (cid:20) − . . − . (cid:21) , b [ F ] = (cid:20) . . (cid:21) , b [ F ] = (cid:20) − − . (cid:21) ,A [ F ] = A [ F ] = (cid:20) − . − . (cid:21) , b [ F ] = (cid:20) . − . (cid:21) , b [ F ] = (cid:20) − . (cid:21) , AUSSIAN MIXTURE MODELS AND INCOMPLETE TENSOR DECOMPOSITIONS 11 A [ F ] = A [ F ] = (cid:20) − . − . . (cid:21) , b [ F ] = (cid:20) . − . (cid:21) , b [ F ] = (cid:20) − . . (cid:21) . Solve (3.10) to obtain G , which is same as in (3.7). The matrices N ( G ) , N ( G ) , N ( G )are N ( G ) = (cid:20) / / / − / (cid:21) , N ( G ) = (cid:20) / − / − / / (cid:21) , N ( G ) = (cid:20) / − / − / / (cid:21) . Choose a generic ξ , say, ξ = (3 , , N ( ξ ) = (cid:20) / √ − / √ / √ / √ (cid:21) (cid:20)
12 00 20 (cid:21) (cid:20) / √ − / √ / √ / √ (cid:21) − . The unit length eigenvectors are˜ v = (1 / √ , / √ , ˜ v = ( − / √ , / √ . As in (3.15), we get the vectors w = (1 , , , w = ( − , , . Solving (3.18) and (3.19), we get the scalars γ = √ , γ = √ , λ = 0 . , λ = 0 . . This produces the decomposition F = λ u ⊗ + λ u ⊗ for the vectors u = (1 , γ v , w ) = (1 , , , , , , u = (1 , γ v , w ) = (1 , − , , − , , . Algorithm 3.4 also requires to know the value of r , which can be selected as inthe following remark. Remark 3.6.
Let Flat( F ) ∈ C ( n +1) × ( n +1) be the flattening matrix, labelled by( i, ( j, k )) such that Flat( F ) i, ( j,k ) = F i,j,k for all i, j, k = 0 , , . . . , n . It was shown that rank(Flat( F )) = rank( F ) when thevectors p , . . . , p r are generic and r ≤ d (see [38, Lemma 2.1]). The rank of Flat( F )is not available since only the subtensor ( F ) Ω is known. However, we can calculatethe ranks of submatrices of ( F ) Ω whose entries are known. To see this, we considera tensor F ∈ S ( C ) as an example. The flattening matrix Flat( F ) is(3.21) ∗ ∗ ∗ ∗ ∗ ∗ ∗∗ ∗ F F F F F ∗ F ∗ F F F F ∗ F F ∗ F F F ∗ F F F ∗ F F ∗ F F F F ∗ F ∗ F F F F F ∗ , where each ∗ means that entry is not given. The largest submatrices with knownentries are F F F F F F F F F , F F F F F F F F F . One can use the biggest rank of these submatrices to estimate r when it is small. Tensor approximations and stability analysis
In some applications, we do not have the subtensor F Ω exactly but only havean approximation (cid:98) F Ω for it. The Algorithm 3.4 can still provide a good rank- r approximation for F when it is applied to (cid:98) F Ω . We define the matrix A ij [ (cid:98) F ] andthe vector b ij [ (cid:98) F ] in the same way as in (3.9), for each α = e i + e j ∈ B . Thegenerating matrix G for F can be approximated by solving the linear least squares(4.1) min g ∈ C r (cid:107) A ij [ (cid:98) F ] · g − b ij [ (cid:98) F ] (cid:107) , for each α = e i + e j ∈ B . Let (cid:98) G (: , e i + e j ) be the optimizer of the above and (cid:98) G bethe matrix consisting of all such (cid:98) G (: , e i + e j ). Then (cid:98) G is an approximation for G .For each l = r + 1 , . . . , n , define the matrix N l ( (cid:98) G ) similarly as in (3.11). Choose ageneric vector ξ = ( ξ r +1 , . . . , ξ n ) and let(4.2) (cid:98) N ( ξ ) := ξ r +1 N r +1 ( (cid:98) G ) + · · · + ξ n N n ( (cid:98) G ) . The matrix (cid:98) N ( ξ ) is an approximation for N ( ξ ). Let ˆ v , . . . , ˆ v r be unit lengtheigenvectors of (cid:98) N ( ξ ). For k = 1 , . . . , r , let(4.3) ˆ w k := (cid:0) (ˆ v k ) H N r +1 ( (cid:98) G )ˆ v k , . . . , (ˆ v k ) H N n ( (cid:98) G )ˆ v k (cid:1) . For the label sets J , J as in (3.17), the subtensors (cid:98) F J , (cid:98) F J are similarly definedlike F J , F J . Consider the following linear least squares(4.4) min ( β ,...,β r ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) (cid:98) F J − r (cid:88) k =1 β k · ˆ v k ⊗ ˆ w k (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) , (4.5) min ( θ ,...,θ r ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) (cid:98) F J − r (cid:88) k =1 θ i · (ˆ v k ⊗ ˆ v k ⊗ ˆ w k ) J (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) . Let ( ˆ β , . . . , ˆ β r ) and (ˆ θ , . . . , ˆ θ r ) be their optimizers respectively. For each k =1 , . . . , r , let(4.6) ˆ λ k := ( ˆ β k ) / ˆ θ k , ˆ γ k := ˆ θ k / ˆ β k . This results in the tensor approximation
F ≈ (ˆ p ) ⊗ + · · · + (ˆ p r ) ⊗ , for the vectors ˆ p k := (cid:112) ˆ λ k (1 , ˆ γ k ˆ v k , ˆ w k ). The above may not give an optimal tensorapproximation. To get an improved one, we can use ˆ p , . . . , ˆ p r as starting points tosolve the following nonlinear optimization(4.7) min ( q ,...,q r ) (cid:13)(cid:13)(cid:13)(cid:0) r (cid:88) k =1 ( q k ) ⊗ − (cid:98) F (cid:1) Ω (cid:13)(cid:13)(cid:13) . The minimizer of the optimization (4.7) is denoted as ( p ∗ , . . . , p ∗ r ).Summarizing the above, we have the following algorithm for computing a tensorapproximation. Algorithm 4.1. (Incomplete symmetric tensor approximations.)Input: A third order symmetric subtensor (cid:98) F Ω and a rank r ≤ d − (cid:98) G by solving (4.1) for each α = e i + e j ∈ B . AUSSIAN MIXTURE MODELS AND INCOMPLETE TENSOR DECOMPOSITIONS 13
2. Choose a generic vector and let (cid:98) N ( ξ ) be the matrix as in (4.2). Computeunit length eigenvectors ˆ v , . . . , ˆ v r for (cid:98) N ( ξ ) and define ˆ w i in (4.3).3. Solve the linear least squares (4.4), (4.5) to get the coefficients ˆ λ i , ˆ γ i .4. For each i = 1 , . . . , r , let ˆ p i := (cid:112) ˆ λ i (1 , ˆ γ i ˆ v i , ˆ w i ). Then (ˆ p ) ⊗ + · · · + (ˆ p r ) ⊗ is a tensor approximation for (cid:98) F .5. Use ˆ p , . . . , ˆ p r as starting points to solve the nonlinear optimization (4.7)for an optimizer ( p ∗ , . . . , p ∗ r ).Output: The tensor approximation ( p ∗ ) ⊗ + · · · + ( p ∗ r ) ⊗ for (cid:98) F .When (cid:98) F is close to F , Algorithm 4.1 also produces a good rank- r tensor approx-imation for F . This is shown in the following. Theorem 4.2.
Suppose the tensor F = ( p ) ⊗ + · · · + ( p r ) ⊗ , with r ≤ d − ,satisfies the following conditions:(i) The leading entry of each p i is nonzero;(ii) the subvectors ( p ) r +1 , . . . , ( p r ) r +1 are linearly independent;(iii) the subvectors ( p ) [ r +2: j,j +2: d ] , . . . , ( p r ) [ r +2: j,j +2: d ] are linearly independentfor each j ∈ [ r + 1 , n ] ;(iv) the eigenvalues of the matrix N ( ξ ) in (3.14) are distinct from each other.Let ˆ p i , p ∗ i be the vectors produced by Algorithm 4.1. If the distance (cid:15) := (cid:107) ( (cid:98) F − F ) Ω (cid:107) is small enough, then there exist scalars ˆ τ i , τ ∗ i such that (ˆ τ i ) = ( τ ∗ i ) = 1 , (cid:107) ˆ τ i ˆ p i − p i (cid:107) = O ( (cid:15) ) , (cid:107) τ ∗ i p ∗ i − p i (cid:107) = O ( (cid:15) ) , up to a permutation of ( p , . . . , p r ) , where the constants inside O ( · ) only depend on F and the choice of ξ in Algorithm 4.1.Proof. The conditions (i)-(ii), by Theorem 3.1, imply that there is a unique gen-erating matrix G for F . The matrix G can be approximated by solving the linearleast square problems (4.1). Note that (cid:107) A ij [ (cid:98) F ] − A ij [ F ] (cid:107) ≤ (cid:15), (cid:107) b ij [ (cid:98) F ] − b ij [ F ] (cid:107) ≤ (cid:15), for all α = e i + e j ∈ B . The matrix A ij [ F ] can be written as A ij [ F ] = [( p ) [ r +2: j,j +2: d ] , . . . , ( p r ) [ r +2: j,j +2: d ] ] · [( p ) r +1 , . . . , ( p r ) r +1 ] T . By the conditions (ii)-(iii), the matrix A ij [ F ] has full column rank for each j ∈ [ r + 1 , n ] and hence the matrix A ij [ (cid:98) F ] has full column rank when (cid:15) is small enough.Therefore, the linear least problems (4.1) have unique solutions and the solution (cid:98) G satisfies that (cid:107) (cid:98) G − G (cid:107) = O ( (cid:15) ) , where O ( (cid:15) ) depends on F (see [12, Theorem 3.4]). For each j = r + 1 , . . . , n , N j ( (cid:98) G )is part of the generating matrix (cid:98) G , so (cid:107) N j ( (cid:98) G ) − N j ( G ) (cid:107) ≤ (cid:107) (cid:98) G − G (cid:107) = O ( (cid:15) ) , j = r + 1 , . . . , n. This implies that (cid:107) (cid:98) N ( ξ ) − N ( ξ ) (cid:107) = O ( (cid:15) ). When (cid:15) is small enough, the matrix (cid:98) N ( ξ )does not have repeated eigenvalues, due to the condition (iv). Thus, the matrix N ( ξ ) has a set of unit length eigenvectors ˜ v , . . . , ˜ v r with eigenvalues ˜ w , . . . , ˜ w r respectively, such that (cid:107) ˆ v i − ˜ v i (cid:107) = O ( (cid:15) ) , (cid:107) ˆ w i − ˜ w i (cid:107) = O ( (cid:15) ) . This follows from Proposition 4.2.1 in [6]. The constants inside the above O ( · )depend only on F and ξ . The ˜ w , . . . , ˜ w r are scalar multiples of linearly independentvectors ( p ) r +2: d , . . . , ( p r ) r +2: d respectively, so ˜ w , . . . , ˜ w r are linearly independent.When (cid:15) is small, ˆ w , . . . , ˆ w r are linearly independent as well. The scalars ˆ λ i ˆ γ i and ˆ λ i (ˆ γ i ) are optimizers for the linear least square problems (4.4) and (4.5). ByTheorem 3.4 in [12], we have (cid:107) ˆ λ i ˆ γ i − λ i γ i (cid:107) = O ( (cid:15) ) , (cid:107) ˆ λ i (ˆ γ i ) − λ i γ i (cid:107) = O ( (cid:15) ) . The vector p i can be written as p i = √ λ i (1 , γ i ˜ v i , ˜ w i ), so we must have λ i , γ i (cid:54) = 0due to the condition (ii). Thus, it holds that (cid:107) ˆ λ i − λ i (cid:107) = O ( (cid:15) ) , (cid:107) ˆ γ i − γ i (cid:107) = O ( (cid:15) ) , where constants inside O ( · ) depend only on F and ξ . For the vectors ˜ p i := √ λ i (1 , γ i ˜ v i , ˜ w i ), we have F = (cid:80) ri =1 ˜ p ⊗ i , by Theorem 3.3. Since p , . . . , p r arelinearly independent by the assumption, the rank decomposition of F is uniqueup to scaling and permutation. There exist scalars ˆ τ i such that (ˆ τ i ) = 1 andˆ τ i ˜ p i = p i , up to a permutation of p , . . . , p r . For ˆ p i = (cid:112) ˆ λ i (1 , ˆ γ i ˆ v i , ˆ w i ), we have (cid:107) ˆ τ i ˆ p i − p i (cid:107) = O ( (cid:15) ), where the constants in O ( · ) only depend on F and ξ .Since (cid:107) ˆ τ i ˆ p i − p i (cid:107) = O ( (cid:15) ), we have (cid:107) ( (cid:80) ri =1 (ˆ p i ) ⊗ − F ) Ω (cid:107) = O ( (cid:15) ). The ( p ∗ , . . . , p ∗ r )is a minimizer of (4.7), so (cid:107) ( r (cid:88) i =1 ( p ∗ i ) ⊗ − (cid:98) F ) Ω (cid:107) ≤ (cid:107) ( r (cid:88) i =1 (ˆ p i ) ⊗ − (cid:98) F ) Ω (cid:107) = O ( (cid:15) ) . For the tensor F ∗ := (cid:80) ri =1 ( p ∗ i ) ⊗ , we get (cid:107) ( F ∗ − F ) Ω (cid:107) ≤ (cid:107) ( F ∗ − (cid:98) F ) Ω (cid:107) + (cid:107) ( (cid:98) F − F ) Ω (cid:107) = O ( (cid:15) ) . When Algorithm 4.1 is applied to ( F ∗ ) Ω , the Step 4 will give the exact decompo-sition F ∗ = (cid:80) ri =1 ( p ∗ i ) ⊗ . By repeating the previous argument, we can similarlyshow that (cid:107) p i − τ ∗ i p ∗ i (cid:107) = O ( (cid:15) ) for some τ ∗ i such that ( τ ∗ i ) = 1, where the constantsin O ( · ) only depend on F and ξ . (cid:3) We would like to remark that when (cid:15) = 0, Algorithm 4.1 is the same as theAlgorithm 3.4, which produces the exact rank decomposition for F . Example 4.3.
We consider the same tensor F as in Example 3.2. The subtensor( F ) Ω is perturbed to ( (cid:98) F ) Ω . The perturbation is randomly generated from theGaussian distribution N (0 , . (cid:98) F ) Ω here. We use Algorithm 4.1 to compute the incomplete tensor approximation. Thematrices A ij [ (cid:98) F ] and vectors b ij [ (cid:98) F ] are given as follows: A [ (cid:98) F ] = A [ (cid:98) F ] = (cid:20) − . . − . . (cid:21) , b [ (cid:98) F ] = (cid:20) . . (cid:21) , b [ (cid:98) F ] = (cid:20) − . − . (cid:21) ,A [ (cid:98) F ] = A [ (cid:98) F ] = (cid:20) . − . − . . (cid:21) , b [ (cid:98) F ] = (cid:20) . − . (cid:21) , b [ (cid:98) F ] = (cid:20) − . . (cid:21) ,A [ (cid:98) F ] = A [ (cid:98) F ] = (cid:20) . − . − . . (cid:21) , b [ (cid:98) F ] = (cid:20) . − . (cid:21) , b [ (cid:98) F ] = (cid:20) − . . (cid:21) . AUSSIAN MIXTURE MODELS AND INCOMPLETE TENSOR DECOMPOSITIONS 15
The linear least square problems (4.1) are solved to obtain (cid:98) G and N ( (cid:98) G ) , N ( (cid:98) G ) , N ( (cid:98) G ),which are N ( (cid:98) G ) = (cid:20) . . . − . (cid:21) , N ( (cid:98) G ) = (cid:20) . − . − . . (cid:21) ,N ( (cid:98) G ) = (cid:20) . − . − . . (cid:21) . For ξ = (3 , , (cid:98) N ( ξ ) in (4.2) is (cid:98) N ( ξ ) = (cid:20) − . . − . − . (cid:21) (cid:20) . . (cid:21) (cid:20) − . . − . − . (cid:21) − . It has eigenvectors ˆ v = ( − . , − . , ˆ v = (0 . , − . w , ˆ w obtained as in (4.3) areˆ w = (1 . , . , . , ˆ w = ( − . , . , . . By solving (4.4) and (4.5), we got the scalarsˆ γ = − . , ˆ γ = − . , ˆ λ = 0 . , ˆ λ = 0 . . Finally, we got the decomposition ˆ λ ˆ u ⊗ + ˆ λ ˆ u ⊗ withˆ u = (1 , ˆ γ ˆ v , ˆ w ) = (1 , . , . , . , . , . , ˆ u = (1 , ˆ γ ˆ v , ˆ w ) = (1 , − . , . , − . , . , . . They are pretty close to the decomposition of F .5. Learning Diagonal Gaussian Mixture
We use the incomplete tensor decomposition or approximation method to learnparameters for Gaussian mixture models. The Algorithms 3.4 and 4.1 can be ap-plied to do that.Let y be the random variable of dimension d for a Gaussian mixture model,with r components of Gaussian distribution parameters ( ω i , µ i , Σ i ), i = 1 , . . . , r .We consider the case that r ≤ d −
1. Let y , . . . , y N be samples drawn from theGaussian mixture model. The sample average (cid:99) M := 1 N ( y + · · · + y N )is an estimation for the mean M := E [ y ] = ω µ + · · · + ω r µ r . The symmetrictensor (cid:99) M := 1 N ( y ⊗ + · · · + y ⊗ N )is an estimation for the third order moment tensor M := E [ y ⊗ ]. Recall that F = (cid:80) ri =1 ω i µ ⊗ i . When all the covariance matrices Σ i are diagonal, we have shownin (1.3) that M = F + d (cid:88) j =1 ( a j ⊗ e j ⊗ e j + e j ⊗ a j ⊗ e j + e j ⊗ e j ⊗ a j ) . If the labels i , i , i are distinct from each other, ( M ) i i i = ( F ) i i i . Recall thelabel set Ω in (1.5). It holds that ( M ) Ω = ( F ) Ω . Note that ( (cid:99) M ) Ω is only an approximation for ( M ) Ω and ( F ) Ω , due to samplingerrors. If the rank r ≤ d −
1, we can apply Algorithm 4.1 with the input ( (cid:99) M ) Ω , tocompute a rank- r tensor approximation for F . Suppose the tensor approximationproduced by Algorithm 4.1 is F ≈ ( p ∗ ) ⊗ + · · · + ( p ∗ r ) ⊗ . The computed p ∗ , . . . , p ∗ r may not be real vectors, even if F is real. When the error (cid:15) := (cid:107) ( F − (cid:99) M ) Ω (cid:107) is small, by Theorem 4.2, we know (cid:107) τ ∗ i p ∗ i − √ ω i µ i (cid:107) = O ( (cid:15) )where ( τ ∗ i ) = 1. In computation, we can choose τ ∗ i such that ( τ ∗ i ) = 1 and theimaginary part vector Im( τ ∗ i p ∗ i ) has the smallest norm. Then we get the real vectorˆ q i := Re( τ ∗ i p ∗ i ) . It is expected that ˆ q i ≈ √ ω i µ i . Since M = ω µ + · · · + ω r µ r ≈ ω / ˆ q + · · · + ω / r ˆ q r , the scalars ω / , . . . , ω / r can be obtained by solving the linear least squares(5.1) min ( β ,...,β r ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) (cid:99) M − r (cid:88) i =1 β i ˆ q i (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) . Let ( β ∗ , . . . , β ∗ r ) be an optimizer for the above, then ˆ ω i := ( β ∗ i ) / is a good ap-proximation for ω i and the vector ˆ µ i := ˆ q i / (cid:112) ˆ ω i is a good approximation for µ i . We may useˆ µ i , (cid:0) r (cid:88) j =1 ˆ ω j (cid:1) − ˆ ω i , i = 1 , . . . , r as starting points to solve the nonlinear optimization(5.2) (cid:40) min ( ω ,...,ω r ,µ ,...,µ r ) (cid:107) (cid:80) ri =1 ω i µ i − (cid:99) M (cid:107) + (cid:107) (cid:80) ri =1 ω i ( µ ⊗ i ) Ω − ( (cid:99) M ) Ω (cid:107) subject to ω + · · · + ω r = 1 , ω , . . . , ω r ≥ , for getting improved approximations. Suppose an optimizer of the above is( ω ∗ , . . . , ω ∗ r , µ ∗ , . . . , µ ∗ r ) . Now we discuss how to estimate the diagonal covariance matrices Σ i . Let(5.3) A := M − F , (cid:98) A := (cid:99) M − (ˆ q ) ⊗ − · · · − (ˆ q r ) ⊗ . By (1.3), we know that(5.4) A = d (cid:88) j =1 ( a j ⊗ e j ⊗ e j + e j ⊗ a j ⊗ e j + e j ⊗ e j ⊗ a j ) , where a j = r (cid:80) i =1 ω i σ ij µ i for j = 1 , · · · , d . The equation (5.4) implies that(5.5) ( a j ) j = 13 A jjj , ( a j ) i = A jij , AUSSIAN MIXTURE MODELS AND INCOMPLETE TENSOR DECOMPOSITIONS 17 for i, j = 1 , · · · , d and i (cid:54) = j . So we choose vectors ˆ a j ∈ R d such that(5.6) (ˆ a j ) j = 13 (cid:98) A jjj , (ˆ a j ) i = (cid:98) A jij for i (cid:54) = j. Since ˆ a j ≈ r (cid:80) i =1 ω i σ ij µ i , the covariance matrices Σ i = diag( σ i , . . . , σ id ) can beestimated by solving the nonnegative linear least squares ( j = 1 , . . . , d )(5.7) min ( β j ,...,β rj ) (cid:13)(cid:13)(cid:13)(cid:13) ˆ a j − r (cid:80) i =1 ω ∗ i µ ∗ i β ij (cid:13)(cid:13)(cid:13)(cid:13) subject to β j ≥ , . . . , β rj ≥ . For each j , let ( β ∗ j , . . . , β ∗ rj ) be the optimizer for the above. When ( (cid:99) M ) Ω is closeto ( M ) Ω , it is expected that β ∗ ij is close to ( σ ij ) . Therefore, we can estimate thecovariance matrices Σ i as follows(5.8) Σ ∗ i := diag( β ∗ i , . . . , β ∗ id ) , ( σ ∗ ij ) := β ∗ ij . The following is the algorithm for learning Gaussian mixture models.
Algorithm 5.1. (Learning diagonal Gaussian mixture models.)Input: Samples { y , · · · , y N } ⊆ R d drawn from a Gaussian mixture model and thenumber r of component Gaussian distributions.Step 1. Compute the sample averages (cid:99) M := N (cid:80) Ni =1 y i and (cid:99) M := 1 N N (cid:80) i =1 y ⊗ i .Step 2. Apply Algorithm 4.1 to the subtensor ( (cid:98) F ) Ω := ( (cid:99) M ) Ω . Let ( p ∗ ) ⊗ + · · · +( p ∗ r ) ⊗ be the obtained rank- r tensor approximation for (cid:98) F . For each i =1 , . . . , r , let ˆ q i := Re( τ i p ∗ i ) where τ i is the cube root of 1 that minimizes theimaginary part vector norm (cid:107) Im( τ i p ∗ i ) (cid:107) .Step 3. Solve (5.1) to get ˆ ω , . . . , ˆ ω r and ˆ µ i = q i / √ ˆ ω i , i = 1 , . . . , r .Step 4. Use the above ˆ ω i , ˆ q i as initial points to solve the nonlinear optimization(5.2) for the optimal ω ∗ i , µ ∗ i , i = 1 , . . . , r .Step 5. Get vectors ˆ a , . . . , ˆ a d as in (5.6). Solve the optimization (5.7) to get opti-mizers β ∗ ij and then choose Σ ∗ i as in (5.8).Output: Component Gaussian distribution parameters ( µ ∗ i , Σ ∗ i , ω ∗ i ) , i = 1 , . . . , r .The sample averages (cid:99) M , (cid:99) M can typically be used as good estimates for the truemoments M , M . When the value of r is not known, it can be determined as inRemark 3.6. The performance of Algorithm 5.1 is analyzed as follows. Theorem 5.2.
Consider the d -dimensional diagonal Gaussian mixture model withparameters { ( ω i , µ i , Σ i ) : i ∈ [ r ] } and r ≤ d − . Let { ( ω ∗ i , µ ∗ i , Σ ∗ i ) : i ∈ [ r ] } beproduced by Algorithm 5.1. If the distance (cid:15) := max( (cid:107) M − (cid:99) M (cid:107) , (cid:107) M − (cid:99) M (cid:107) ) issmall enough and the tensor F = (cid:80) ri =1 ω i µ ⊗ i satisfies conditions of Theorem 4.2,then (cid:107) µ i − µ ∗ i (cid:107) = O ( (cid:15) ) , (cid:107) ω i − ω ∗ i (cid:107) = O ( (cid:15) ) , (cid:107) Σ i − Σ ∗ i (cid:107) = O ( (cid:15) ) , where the above constants inside O ( · ) only depend on parameters { ( ω i , µ i , Σ i ) : i ∈ [ r ] } and the choice of ξ in Algorithm 5.1.Proof. For the vectors p i := √ ω i µ i , we have F = (cid:80) ri =1 p ⊗ i . Since (cid:107) ( F − (cid:98) F ) Ω (cid:107) = (cid:107) ( M − (cid:99) M ) Ω (cid:107) ≤ (cid:15) and F satisfies conditions of Theorem 4.2, we know (cid:107) τ ∗ i p ∗ i − p i (cid:107) = O ( (cid:15) ) for some( τ ∗ i ) = 1, by Theorem 4.2. The constants inside O ( (cid:15) ) depend on parameters ofthe Gaussian model and ξ . Then, we have (cid:107) Im( τ ∗ i p ∗ i ) (cid:107) = O ( (cid:15) ) since the vectors p i are real. When (cid:15) is small enough, such τ ∗ i is the τ in Step 2 of Algorithm 5.1 thatminimizes (cid:107) Im( τ i p ∗ i ) (cid:107) , so we have (cid:107) ˆ q i − p i (cid:107) ≤ (cid:107) τ i p ∗ i − p i (cid:107) = O ( (cid:15) )where ˆ q i = Re( τ i p ∗ i ) is from the Step 2. The vectors ˆ q , . . . , ˆ q r are linearly indepen-dent when (cid:15) is small. Thus, the problem (5.1) has a unique solution and the weightsˆ ω i can be found by solving (5.1). Since (cid:107) M − (cid:99) M (cid:107) ≤ (cid:15) and (cid:107) ˆ q i − p i (cid:107) = O ( (cid:15) ), wehave (cid:107) ω i − ˆ ω i (cid:107) = O ( (cid:15) ) (see [12, Theorem 3.4]). The mean vectors ˆ µ i are obtainedby ˆ µ i = ˆ q i / √ ˆ ω i , so the approximation error is (cid:107) µ i − ˆ µ i (cid:107) = (cid:107) p i / √ ω i − ˆ q i / (cid:112) ˆ ω i (cid:107) = O ( (cid:15) ) . The constants inside the above O ( (cid:15) ) depend on parameters of the Gaussian mixturemodel and ξ .The problem (5.2) is solved to obtain ω ∗ i and µ ∗ i , so (cid:107) (cid:99) M − r (cid:88) i =3 ω ∗ i µ ∗ i (cid:107) + (cid:107) (cid:98) F − r (cid:88) i =1 ω ∗ i ( µ ∗ i ) ⊗ (cid:107) = O ( (cid:15) ) . Let F ∗ := (cid:80) ri =1 ω ∗ i ( µ ∗ i ) ⊗ = (cid:80) ri =1 ( (cid:112) ω ∗ i µ ∗ i ) ⊗ , then (cid:107)F − F ∗ (cid:107) ≤ (cid:107)F − ˆ F(cid:107) + (cid:107) ˆ F − F ∗ (cid:107) = O ( (cid:15) ) . Theorem 4.2 implies (cid:107) p i − (cid:112) ω ∗ i µ ∗ i (cid:107) = O ( (cid:15) ). In addition, we have (cid:107) (cid:99) M − r (cid:88) i =1 ω ∗ i µ ∗ i (cid:107) = (cid:107) (cid:99) M − r (cid:88) i =1 ( ω ∗ i ) / (cid:112) ω ∗ i µ ∗ i (cid:107) ≤ O ( (cid:15) ) . The first order moment is M = (cid:80) ri =1 ( ω i ) / p i . Since (cid:107) M − ˆ M (cid:107) = O ( (cid:15) ) and (cid:107) p i − (cid:112) ω ∗ i µ ∗ i (cid:107) = O ( (cid:15) ), it holds that (cid:107) ω / i − ( ω ∗ i ) / (cid:107) = O ( (cid:15) ) by [12, Theorem 3.4].This implies that (cid:107) ω i − ω ∗ i (cid:107) = O ( (cid:15) ), so (cid:107) µ i − µ ∗ i (cid:107) = (cid:107) p i / √ ω i − ( (cid:112) ω ∗ i µ ∗ i ) / (cid:112) ω ∗ i (cid:107) = O ( (cid:15) ) . The constants inside the above O ( · ) only depend on parameters { ( ω i , µ i , Σ i ) : i ∈ [ r ] } and ξ .The covariance matrices Σ i are recovered by solving the linear least squares (5.7).In the least square problems, it holds that (cid:107) ω i µ i − ω ∗ i µ ∗ i (cid:107) = O ( (cid:15) ) and (cid:107)A − (cid:98) A(cid:107) ≤ (cid:107) M − (cid:99) M (cid:107) + (cid:107)F − r (cid:88) i =1 ˆ q ⊗ i (cid:107) = O ( (cid:15) ) , where tensors A , (cid:98) A are defined in (5.3). When the error (cid:15) is small, the vectors ω ∗ i µ ∗ , . . . , ω ∗ i µ ∗ r are linearly independent and hence (5.7) has a unique solution foreach j . By [12, Theorem 3.4], we have (cid:107) ( σ ij ) − ( σ ∗ ij ) (cid:107) = O ( (cid:15) ) . It implies that (cid:107) Σ i − Σ ∗ i (cid:107) = O ( (cid:15) ), where the constants inside O ( · ) only depend onparameters { ( ω i , µ i , Σ i ) : i ∈ [ r ] } and ξ . (cid:3) AUSSIAN MIXTURE MODELS AND INCOMPLETE TENSOR DECOMPOSITIONS 19 Numerical Simulations
This section gives numerical experiments for our proposed methods. The compu-tation is implemented in
MATLAB
R2019b, on an Alienware personal computer withIntel(R)Core(TM)i7-9700K [email protected] and RAM 16.0G. The
MATLAB function lsqnonlin is used to solve (4.7) in Algorithm 4.1 and the MATLAB function fmincon is used to solve (5.2) in Algorithm 5.1. We compare our method with theclassical EM algorithm, which is implemented by the
MATLAB function fitgmdist ( MaxIter is set to be 100 and
RegularizationValue is set to be 0 . d and rank r , we getthe tensor F = ( p ) ⊗ + · · · + ( p r ) ⊗ , where each p i is randomly generated ac-cording to the Gaussian distribution in MATLAB . Then, we apply the perturbation( (cid:98) F ) Ω = ( F ) Ω + E Ω , where E is a randomly generated tensor, also according to theGaussian distribution in MATLAB , with the norm (cid:107)E ω (cid:107) Ω = (cid:15) . After that, Algorithm4.1 is applied to the subtensor ( (cid:98) F ) Ω to find the rank- r tensor approximation. Theapproximation quality is measured by the relative errorrel-error := (cid:107) ( F ∗ − (cid:98) F ) Ω (cid:107)(cid:107) ( F − (cid:98) F ) Ω (cid:107) , where F ∗ is the output of Algorithm 4.1. For each case of ( d, r, (cid:15) ), we generate 100random instances. The min, average, and max relative errors for each dimension d and rank r are reported in the Table 1. The results show that Algorithm 4.1performs well for the tensor approximation. Table 1.
The performance of Algorithm 4.1 rel-error rel-error d r (cid:15) min average max d r (cid:15) min average max20 3 0.1 0.9610 0.9731 0.9835 30 4 0.1 0.9816 0.9854 0.98905 0.01 0.9634 0.9700 0.9742 8 0.01 0.9634 0.9700 0.97427 0.001 0.9148 0.9373 0.9525 11 0.001 0.9501 0.9587 0.966740 6 0.1 0.9853 0.9877 0.9904 50 7 0.1 0.9887 0.9911 0.992510 0.01 0.9761 0.9795 0.9820 13 0.01 0.9812 0.9831 0.985415 0.001 0.9653 0.9690 0.9734 18 0.001 0.9739 0.9767 0.9792
Second, we explore the performance of Algorithm 5.1 for learning diagonalGaussian mixture models. We compare it with the classical EM algorithm, forwhich the MATLAB function fitgmdist is used (
MaxIter is set to be 100 and
RegularizationValue is set to be 0 . d = 20 , , , , r are tested for each case of d . We generate 100 ran-dom instances of { ( ω i , µ i , Σ i ) : i = 1 , · · · , r } for d ∈ { , , } , and 20 randominstances for d ∈ { , } , because of the relatively more computational time forthe latter case. For each instance, 10000 samples are generated. Each sample isgenerated from one of r component Gaussian distributions, so they are naturallyseparated into r groups. Algorithm 5.1 and the EM algorithm are applied to fit the Gaussian mixture model to the 10000 samples for each instance. For each sample,we calculate the likelihood of the sample to each component Gaussian distributionin the estimated Gaussian mixture model. A sample is classified to the i th groupif its likelihood for the i th component is maximum. The classification accuracy isthe rate that samples are classified to the correct group. In Table 2, for each pair( d, r ), we report the accuracy of Algorithm 5.1 in the first row and the accuracyof the EM algorithm in the second row. As one can see, Algorithm 5.1 performsbetter than EM algorithm, and its accuracy isn’t affected when the dimensions andranks increase. Indeed, as the difference between the dimension d and the rank r increases, Algorithm 5.1 becomes more and more accurate. This is opposite to theEM algorithm. The reason is that the difference between the number of rows andthe number of columns of A ij [ F ] in (3.9) increases as d − r becomes bigger, whichmakes the computation in Algorithm 5.1 more robust. Table 2.
Comparison between Algorithm 5.1 and EM for simu-lations
Accuracy d = 20 d = 30 d = 40 r = 3 r = 5 r = 7 r = 4 r = 8 r = 11 r = 6 r = 10 r = 15Algorithm 5.1 0.9861 0.9740 0.9659 0.9965 0.9923 0.9895 0.9990 0.9981 0.9971EM 0.9763 0.9400 0.9252 0.9684 0.9277 0.9219 0.9117 0.8931 0.9111Accuracy d = 50 d = 60 r = 7 r = 13 r = 18 r = 8 r = 15 r = 22Algorithm 5.1 0.9997 0.9995 0.9993 0.9999 0.9998 0.9995EM 0.8997 0.9073 0.9038 0.8874 0.8632 0.8929 Last, we apply Algorithm 5.1 to do texture classifications. We select 8 texturedimages of 512 ×
512 pixels from the VisTex database. They are converted intograyscale version. Each image is divided into subimages of 32 ×
32 pixels. Weperform the discrete cosine transformation(DCT) on each block of size 16 × d of the feature vector extracted from each subimage is 13.We use blocks extracted from the first 160 subimages for training and those fromthe rest 96 subimages for testing. We refer to [43] for more details. For each image,we apply Algorithm 5.1 and the EM algorithm to fit a Gaussian mixture model tothe image. We choose the number of components r according to Remark 3.6. Toclassify the test data, we follow the Bayes decision rule that assigns each block tothe texture which maximizes the posteriori probability, where we assume a uniformprior over all classes [15]. The classification accuracy is the rate that a subimageis correctly classified, which is shown in Table 3. Algorithm 5.1 outperforms theclassical EM algorithm for the accuracy rates for six of the images.7. Conclusions and Future Work
This paper gives a new algorithm for learning Gaussian mixture models withdiagonal covariance matrices. We first give a method for computing incomplete
AUSSIAN MIXTURE MODELS AND INCOMPLETE TENSOR DECOMPOSITIONS 21
Figure 1.
Textures from VisTex
Bark.0000 Bark.0009 Flowers.0001 Tile.0000Paintings.11.0001 Grass.0001 Brick.0004 Fabric.0013
Table 3.
Classification results on 8 texturesAccuracy Algorithm 5.1 EMBark.0000 0.5376 0.8413Bark.0009 0.5107 0.7150Flowers.0001 0.8137 0.6315Tile.0000 0.8219 0.7239Paintings.11.0001 0.8047 0.7350Grass.0001 0.9841 0.9068Brick.0004 0.9406 0.8854Fabric.0013 0.9220 0.9048symmetric tensor decompositions. It is based on the usage of generating polyno-mials. The method is described in Algorithm 3.4. When the input subtensor hassmall errors, we can similarly compute the incomplete symmetric tensor approx-imation, which is given by Algorithm 4.1. We have shown in Theorem 4.2 thatif the input subtensor is sufficiently close to a low rank one, the produced tensorapproximation is highly accurate. Then unknown parameters for Gaussian mixturemodels can be recovered by using the incomplete tensor decomposition method. Itis described in Algorithm 5.1. When the estimations of M and M are accurate,the parameters recovered by Algorithm 5.1 are also accurate. The computationalsimulations demonstrate the good performance of the proposed method.The proposed methods deals with the case that the number of Gaussian com-ponents is less than one half of the dimension. How do we compute incompletesymmetric tensor decompositions when the set Ω is not like (1.5)? How can welearn parameters for Gaussian mixture models with more components? How canwe do that when the covariance matrices are not diagonal? They are importantand interesting topics for future research work. References [1] D. Achlioptas and F. McSherry, On spectral learning of mixtures of distributions,
Interna-tional Conference on Computational Learning Theory , pp. 458–469, 2005. [2] A. Anandkumar, R. Ge, D. Hsu, S. Kakade, and M. Telgarsky, Tensor decompositions forlearning latent variable models,
Journal of Machine Learning Research , 15, pp. 2773–2832,2014.[3] M. Belkin and K. Sinha, Toward learning Gaussian mixtures with arbitrary separation,
COLT , 2010.[4] J. M. Bioucas-Dias, A. Plaza, N. Dobigeon, M. Parente, Q. Du, P. Gader, and J. Chanussot,Hyperspectral unmixing overview: Geometrical, statistical, and sparse regression-based ap-proaches,
IEEE journal of selected topics in applied earth observations and remote sensing ,5(2):354–379, 2012.[5] C. Brubaker and S. Vempala, Isotropic PCA and affine-invariant clustering,
BuildingBridges , pages 241–281. Springer, 2008.[6] F. Chatelin, Eigenvalues of matrices: revised edition,
SIAM , 2012.[7] K. Chaudhuri, S. Kakade, K. Livescu, and K. Sridharan, Multi-view clustering via canonicalcorrelation analysis,
Proceedings of the 26th annual international conference on machinelearning , pages 129–136, 2009.[8] K. Chaudhuri and S. Rao, Learning Mixtures of Product Distributions Using Correlationsand Independence,
COLT , pages 9–20, 2008.[9] P. Comon, L.-H. Lim, Y. Qi, and K. Ye, Topology of tensor ranks,
Advances in Mathematics ,367:107128, 2020.[10] S. Dasgupta, Learning mixtures of Gaussians, , pages 634–644. IEEE, 1999.[11] S. Dasgupta and L. Schulman, A two-round variant of EM for Gaussian mixtures, arXivpreprint arXiv:1301.3850 , 2013.[12] J. Demmel,
Applied Numerical Linear Algebra , SIAM, 1997.[13] A. P. Dempster, N. M. Laird, and D. B. Rubin, Maximum likelihood from incomplete datavia the EM algorithm,
Journal of the Royal Statistical Society: Series B (Methodological) ,39(1):1–22, 1977.[14] V. De Silva and L.-H. Lim, Tensor rank and the ill-posedness of the best low-rank approxi-mation problem,
SIAM. J. Matrix Anal. Appl. , 30(3), 1084–1127, 2008.[15] M. Dixit, N. Rasiwasia, and N. Vasconcelos, Adapted Gaussian models for image classifica-tion,
CVPR 2011 , pages 937–943, 2011.[16] M. Dressler, J. Nie, and Z. Yang, Separability of Hermitian tensors and PSD decompositions,
Preprint , 2020. arXiv:2011.08132 [17] L. Fialkow and J. Nie, The truncated moment problem via homogenization and flat exten-sions,
Journal of Functional Analysis , 263(6), 1682–1700, 2012.[18] S. Friedland, Remarks on the symmetric rank of symmetric tensors,
SIAM J. Matrix Anal.Appl. , 37(1), 320–337, 2016.[19] S. Friedland and L.-H. Lim, Nuclear norm of higher-order tensors,
Mathematics of Compu-tation , 87(311), 1255–1281, 2018.[20] R. Ge, Q. Huang, and S. M. Kakade, Learning mixtures of Gaussians in high dimensions,
Proceedings of the forty-seventh annual ACM symposium on Theory of computing , pages761–770, 2015.[21] M. Haas, S. Mittnik, and M. S. Paolella, Modelling and predicting market risk with Laplace–Gaussian mixture distributions,
Applied Financial Economics , 16(15), 1145–1162, 2006.[22] C. J. Hillar and L.-H. Lim, Most tensor problems are NP-hard,
J. ACM , 60(6), Art. 45, 39,2013[23] D. Hsu and S. M. Kakade, Learning mixtures of spherical Gaussians: moment methods andspectral decompositions,
Proceedings of the 4th conference on Innovations in TheoreticalComputer Science , pages 11–20, 2013.[24] A. T. Kalai, A. Moitra, and G. Valiant, Efficiently learning mixtures of two Gaussians,
Proceedings of the forty-second ACM symposium on Theory of computing , pages 553–562,2010.[25] R. Kannan, H. Salmasian, and S. Vempala, The spectral method for general mixture mod-els,
International Conference on Computational Learning Theory , pages 444–457. Springer,2005.[26] S. Karpagavalli and E. Chandra. A review on automatic speech recognition architectureand approaches.
International Journal of Signal Processing, Image Processing and PatternRecognition , 9(4):393–404, 2016.
AUSSIAN MIXTURE MODELS AND INCOMPLETE TENSOR DECOMPOSITIONS 23 [27] J. M. Landsberg,
Tensors: geometry and applications , Graduate Studies in Mathematics,vol. 128, American Mathematical Society, Providence, RI, 2012.[28] D. -S. Lee, Effective Gaussian mixture learning for video background subtraction,
IEEEtransactions on pattern analysis and machine intelligence , 27(5):827–832, 2005.[29] L.-H. Lim,
Tensors and hypermatrices , in: L. Hogben (Ed.),
Handbook of linear algebra ,2nd Ed., CRC Press, Boca Raton, FL, 2013.[30] Y. Ma, Q. Jin, X. Mei, X. Dai, F. Fan, H. Li, and J. Huang, Hyperspectral unmixing withGaussian mixture model and low-rank representation,
Remote Sensing , 11(8), 911, 2019.[31] M. Magdon-Ismail and J. T. Purnell, Approximating the covariance matrix of gmms withlow-rank perturbations,
International Conference on Intelligent Data Engineering and Au-tomated Learning , pages 300–307, 2010.[32] C. Mu, B. Huang, J. Wright, and D. Goldfarb,
Square deal: lower bounds and improvedrelaxations for tensor recovery , Proceeding of the International Conference on MachineLearning (PMLR), 32(2),73-81, 2014.[33] J. Nie and B. Sturmfels, Matrix cubes parameterized by eigenvalues,
SIAM journal onmatrix analysis and applications
31 (2), 755–766, 2009.[34] J. Nie, The hierarchy of local minimums in polynomial optimization,
Mathematical program-ming
151 (2), 555–583.[35] J. Nie, Linear optimization with cones of moments and nonnegative polynomials,
Mathe-matical Programming , 153(1), 247–274, 2013.[36] J. Nie, Generating polynomials and symmetric tensor decompositions,
Foundations of Com-putational Mathematics , 17(2), 423–465, 2017.[37] J. Nie, Symmetric tensor nuclear norms,
SIAM J. Appl. Algebra Geometry , 1(1), 599–625,2017.[38] J. Nie, Low rank symmetric tensor approximations,
SIAM Journal on Matrix Analysis andApplications , 38(4), 1517–1540, 2017.[39] J. Nie, Tight relaxations for polynomial optimization and Lagrange multiplier expressions,
Mathematical Programming
178 (1-2), 1–37, 2019.[40] J. Nie and K. Ye, Hankel tensor decompositions and ranks,
SIAM Journal on Matrix Anal-ysis and Applications , vol. 40, no. 2, pp. 486–516, 2019.[41] J. Nie and Z. Yang, Hermitian tensor decompositions,
SIAM Journal on Matrix Analysisand Applications , 41 (3), 1115-1144, 2020[42] K. Pearson, Contributions to the mathematical theory of evolution,
Philosophical Transac-tions of the Royal Society of London. A , 185, 71–110, 1894.[43] H. Permuter, J. Francos, and I. Jermyn, A study of Gaussian mixture models of color andtexture features for image classification and segmentation,
Pattern Recognition , 39(4), 695–706, 2006.[44] D. Povey, L. Burget, M. Agarwal, P. Akyazi, F. Kai, A. Ghoshal, O. Glembek, N. Goel,M. Karafi´at, A. Rastrow, et al, The subspace Gaussian mixture model—a structured modelfor speech recognition,
Computer Speech & Language , 25(2), 404–439, 2011.[45] R. A. Redner and H. F. Walker, Mixture densities, maximum likelihood and the EM algo-rithm,
SIAM review , 26(2), 195–239, 1984.[46] D. A. Reynolds, Speaker identification and verification using Gaussian mixture speakermodels,
Speech communication , 17(1-2), 91–108, 1995.[47] B. Romera-Paredes and M. Pontil, A New Convex Relaxation for Tensor Completion,
Ad-vances in Neural Information Processing Systems 26 , 2967–2975, 2013.[48] A. Sanjeev and R. Kannan, Learning mixtures of arbitrary Gaussians,
Proceedings of thethirty-third annual ACM symposium on Theory of computing , pages 247–257, 2001.[49] Y. Shekofteh, S. Jafari, J. C. Sprott, S. M. R. H. Golpayegani, and F. Almasganj, A Gaussianmixture model based cost function for parameter estimation of chaotic biological systems,
Communications in Nonlinear Science and Numerical Simulation , 20(2), 469–481, 2015.[50] G. Tang and P. Shah,
Guaranteed tensor decomposition: a moment approach , Proceedingsof the 32nd International Conference on Machine Learning (ICML-15), pp. 1491-1500, 2015.
Journal of Machine Learning Research:
W&CP volume 37.[51] S. Vempala and G. Wang, A spectral algorithm for learning mixture models,
Journal ofComputer and System Sciences , 68(4),841–860, 2004. [52] T. Veracini, S. Matteoli, M. Diani, and G. Corsini, Fully unsupervised learning of Gaussianmixtures for anomaly detection in hyperspectral imagery, , 596–601, 2009.[53] Y. Wu, P. Yang, Optimal estimation of Gaussian mixtures via denoised method of moments,
Annals of Statistics , 48(4), pp. 1981–2007, 2020.[54] M. Yuan and C.-H. Zhang, On tensor completion via nuclear norm minimization,
Found.Comput. Math. , 16(4), 1031–1068, 2016.[55] H. Zhang, C. L. Giles, H. C. Foley, and J. Yen, Probabilistic community discovery usinghierarchical latent Gaussian mixture model,
AAAI , 7, 663–668, 2007.[56] Z. Zivkovic, Improved adaptive Gaussian mixture model for background subtraction,
Pro-ceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004. ,2, 28–31, 2004.
Bingni Guo, Jiawang Nie, and Zi Yang, Department of Mathematics, University ofCalifornia San Diego, 9500 Gilman Drive, La Jolla, CA, USA, 92093.
Email address ::