Segmentation of Facial Expressions Using Semi-Definite Programming and Generalized Principal Component Analysis
aa r X i v : . [ c s . C V ] J un August 2008
Segmentation of Facial Expressions UsingSemi-Definite Programming and GeneralizedPrincipal Component Analysis byBehnood Gholami Allen R. TannenbaumSchool of Aerospace Engineering Schools of Electrical & Computer andGeorgia Institute of Technology Biomedical [email protected] Georgia Institute of [email protected] M. HaddadSchool of Aerospace EngineeringGeorgia Institute of [email protected]
Abstract
In this paper, we use semi-definite programming and generalized principal componentanalysis (GPCA) to distinguish between two or more different facial expressions. In thefirst step, semi-definite programming is used to reduce the dimension of the image data and“unfold” the manifold which the data points (corresponding to facial expressions) reside on.Next, GPCA is used to fit a series of subspaces to the data points and associate each datapoint with a subspace. Data points that belong to the same subspace are claimed to belongto the same facial expression category. An example is provided. . Mathematical Preliminaries
In this section, we introduce notation, several definitions, and some key results in abstractalgebra and algebraic geometry [1, 2, 3, 4] that are necessary for developing the main resultsof this paper. Specifically, for A ∈ R n × n we write A ≥ A >
0) to indicate that A isa nonnegative-definite (resp., positive definite) matrix. In addition, ( · ) T denotes transpose,and ( · ) † denotes the Moore-Penrose generalized inverse. In the next paragraphs we give thedefinitions for ideal and the Veronese map. Definition 1.1 (Ideal) . Let R be a commutative ring and I be an additive subgroup of R . I is called an ideal if r ∈ R and s ∈ I then rs ∈ I . Furthermore, an ideal is said to begenerated by a set S , if for all t ∈ I , t = P ni =1 r i s i , r i ∈ R , s i ∈ S , i = 1 , , . . . n for some n ∈ N .Let R [ x ] be the set of polynomials of D variables, where x = [ x x . . . x D ] T , x i ∈ R , i = 1 , , . . . D , and R is a field. Then R [ x ] with the polynomial addition and multiplicationis a commutative ring. A product of n variables x , x , . . . x n is called a monomial of degree n (counting repeats). The number of distinct monomials of degree n is given by M n ( D ) △ = (cid:18) D + n − n (cid:19) . (1) Definition 1.2(Veronese Map) [4]. The Veronese Map of degree n , ν n : R D → R M n ( D ) ,is a mapping that assigns to D variables x , x , . . . x D , all the possible monomials of degree n , namely, ν ([ x x . . . x D ] T ) = [ u u . . . u M n ( D ) ] T such that u i = x n i x n i . . . x n iD D , i = 1 , , . . . M n ( D ), where n i + n i + · · · + n iD = n , n ij ∈ N , j = 1 , , . . . D , and M n ( D ) is given by (1).
2. Dimension Reduction
In this section, we introduce a method known as Maximum Variance Unfolding (MVU),a dimension reduction technique which uses semi-definite programming. Given a set ofpoints sampled from a low dimensional manifold in a high dimensional ambient space, thistechnique “unfolds” the manifold (and hence, the points it contains) while preserving thelocal geometrical properties of the manifold [5]. This method, in a sense, can be regarded as anonlinear generalization for the Principal Component Analysis (PCA). Given a set of pointsin a high dimensional ambient space, PCA identifies a low dimensional subspace such thatthe variance of the projection of the points on this subspace is maximized. More specifically,the base of a subspace on which the projection of the points has the maximum variance isthe eigenvectors corresponding to the non-zero eigenvalues of the covariance matrix [6]. Inthe case where data is noisy, the singular vectors corresponding to the dominant singularvalues of the covariance matrix are selected [4].Given N input points { x n } Nn =1 ∈ R D , we would like to find N output points { y n } Nn =1 ∈ R d such that d < D , there is a one-to-one correspondence between these sets, and points close to each other in the input data set remain close in the output data set. In order to be1ore precise, we need to introduce the concept of isometry for a set of points [7, 5]. Looselyspeaking, isometry is an invertible smooth mapping defined on a manifold such that it locallyhas a translation and rotation effect. The next definition extends the notion of isometry todata sets. Definition 2.1 [5]. Let X = { x n } Nn =1 ∈ R D and Y = { y n } Nn =1 ∈ R d be two sets of pointthat are in one-to-one correspondence. Then X and Y are k -locally isometric if there existsa mapping consisting of rotation and translation T : R D → R d such that if T ( x n ) = y n then T ( N x n ( k )) = N y n ( k ), for n = 1 , , . . . n , where N x ( k ) is the set of k -nearest neighbors of x ∈ X .Before stating the MVU method, we give the problem statement. Problem 2.1.
Given a set of input data points X = { x n } Nn =1 ∈ R D find the output datapoints Y = { y n } Nn =1 ∈ R d , d ≤ D , such that the sum of pairwise square distances betweenoutputs, namely, Φ = 12 n N X i =1 N X j =1 k y i − y j k , (2)is maximized and X and Y are k -locally isometric for some k ∈ N . Without loss of generality,we assume that P Nn =1 x n = 0. Moreover, we require P Nn =1 y n = 0 to remove the translationaldegree of freedom of the output points Y .Note that the data set can be represented by a weighted graph G , where each node repre-sents a point and the k -nearest points are connected by edges where k is a given parameter.The weights also represent the distance between the nodes. We, furthermore, assume thatthe corresponding graph G is connected. In case of a disconnected graph, each connectedcomponent should be analyzed separately. The k -local isometry condition in Problem 2.1requires that the distances and the angles between the k -nearest neighbors to be preserved.This constraint is equivalent to merely preserving the distances between neighboring pointsin a modified graph G ′ , where in G ′ for each node, all the neighboring nodes are also con-nected by an edge. More precisely, in G ′ , each node and the k -neighboring nodes form aclique of size k + 1 (See Figure 2.1).The next theorem gives the solution to Problem 2.1 for the case d = D . Theorem 2.1 [5]. Consider the problem given by Problem 2.1 and assume d = D . Theoutput data points Y = { y n } Nn =1 ∈ R D are given by the solution to the optimization problemmax Φ , (3)subject to N X n =1 y n = 0 , (4) k y i − y j k = D ij , if η ij = 1 , i, j = 1 , , . . . N, (5)where Φ is defined in (2), η = [ η ij ] ∈ R N × N is the adjacency matrix of the modified graph G ′ , and D ij = k x i − x j k , i, j = 1 , , . . . N , x i , x j ∈ X .2 Sfrag replacements G PSfrag replacements G ′ Figure 2.1 : The original and modified graphs for k = 2The optimization problem (3)–(5) is not convex. The following convex optimizationproblem, however, is equivalent to the optimization problem given in Theorem 2.1. Moreover,this theorem also addresses the case where d ≤ D . Theorem 2.2 [5]. Consider the problem given by Problem 2.1 and assume that d = D .The output data points Y = { y n } Nn =1 ∈ R D are given by the solution to the optimizationproblem max tr( K ) , (6)subject to K ≥ , (7) N X i =1 N X j =1 K ij = 0 , (8) K ii − K ij + K jj = D ij , if η ij = 1 , i, j = 1 , , . . . N, (9)where K = [ k ij ] is the inner product matrix defined by k ij = y T i y j , i, j = 1 , , . . . N , and η and D ij are defined in Theorem 2.1. Moreover, y ni = p l n V ni , i = 1 , , . . . D, n = 1 , , . . . N, (10)where V n = [ V n V n . . . V nD ] T , n = 1 , , . . . N , is the eigenvector of K , l n is its associatedeigenvalue, and y n = [ y n y n . . . y nD ] T . Furthermore, if K has d non-zero eigenvalues, thenthe output data points given by { y reduced n } Nn =1 ∈ R d can be found by removing the zeroelements in y n . Remark 2.1.
When data is noisy, usually all the eigenvalues of K are non-zero, andtherefore, one has to choose the dominant eigenvalues (see [4] for some techniques for choosingthe dominant eigenvalues). If the eigenvalues of K are sorted in the descending order, thefirst d elements of y n , n = 1 , , . . . N , is a d -dimensional data set that is approximately k -locally isometric to { x n } Nn =1 ∈ R D . 3 . Data Segmentation and Subspace Identification In this section, we address the problem of data segmentation and subspace identificationfor a set of given data points. Next, we define the multiple subspace segmentation problem.
Problem 3.1(Multiple Subspace Segmentation Problem) . Given the set Y = { y i } Ni =1 which are drawn from a set of distinct subspaces of unknown number and dimension, wewould like to ( i ) find the number of subspaces, ( ii ) find their dimensions, ( iii ) find the basisfor each subspace, and ( iv ) associate each point to the set it belongs to.Generalized Principal Component Analysis (GPCA) uses algebraic geometric conceptsto address this problem. First, we present the basic GPCA algorithm and later introducethe version of GPCA which is more robust to noise. For a detailed treatment of the subjectsee [4]. In this section we present the Basic GPCA algorithm, where we assume that data pointsare noise-free. The GPCA algorithm consists of three main steps: ( i ) polynomial fitting,( ii ) polynomial differentiation, and ( iii ) polynomial division. Let A = { S , S , . . . S n } bea subspace arrangement and Z A = S ∪ S ∪ · · · ∪ S n , where dim( S j ) = d j , j = 1 , , . . . n .Furthermore, let Y = { y i } Ni =1 be a set of sufficiently large number of points sampled from Z A .In this paper, we assume that the number of subspaces n is known. The GPCA algorithm,however, gives the solution for the case where n is unknown (see [4]). In order to algebraicallyrepresent Z A , we need to find the vanishing ideal of Z A , namely I ( Z A ). The vanishing ideal isthe set of polynomials which vanish on Z A . It can be shown that the homogenous componentof I ( Z A ), namely I n , uniquely determines I ( Z A ). Therefore, in order to find the vanishingideal I ( Z A ) it suffices to determine the homogenous component I n .Now note that if p n ( x ) is a polynomial in I n then p n ( x ) = c T n ν n ( x ), c n ∈ R M n ( D ) , where x = [ x x . . . x D ] T , for some D ∈ N , and M n ( D ) is given by (1). Therefore, each point y i , i = 1 , , . . . N , should satisfy p n ( x ), hence V n ( D ) c n = 0, where V n ( D ) △ = ν T n ( y ) ν T n ( y )... ν T n ( y N ) (11)is called the embedded data matrix. A one-to-one correspondence between the null space of V n ( D ) and the polynomials in I n exists if the following condition holdsdim ( N ( V n ( D ))) = dim( I n ) = h I ( n ) , (12)or equivalently, rank ( V n ( D )) = M n ( D ) − h I ( n ) , (13)where h I ( n ) is the Hilbert function. The singular vectors of V n ( D ) represented by c ni , i = 1 , , . . . h I ( n ) corresponding to the zero singular values of V n ( D ) can be used to compute4 basis for I n , namely I n = span { p ni ( x ) = c ni ν n ( x ) , i = 1 , , . . . h I ( n ) } . In the case where the data Y is corrupted by noise, the singular vectors corresponding tothe h I ( n ) smallest singular values of V n ( D ) can be used.The following theorem shows how polynomial differentiation can be used to find thedimensions and bases of each subspace. Theorem 3.1 [4]. Let Y = { y i } Ni =1 be a set of points sampled from Z A = S ∪ S ∪· · ·∪ S n ,where S i is a subspace of unknown dimension d i , i = 1 , , . . . n . Furthermore, assume thatfor each subspace S j , j = 1 , , . . . n , a point w j is given such that w j ∈ S j , w j S i , i = j , i = 1 , , . . . n , and condition (12) holds. Then S ⊥ j = span (cid:26) ∂∂x c T n ν n ( x ) | x = w j : c n ∈ N ( V n ( D )) (cid:27) , (14)where V n ( D ) is the embedded data matrix of Y . Furthermore, d j = D − rank ( ∇ P n ( w j )), j = 1 , , . . . n , where P n ( x ) = [ p n ( x ) p n ( x ) . . . p nh I ( n ) ( x )] T ∈ R × h I ( n ) is a row vector ofindependent polynomials in I n found using the singular vectors corresponding to the zerosingular values of V n ( D ), and ∇ P n = [ ∇ T p n ( x ) ∇ T p n ( x ) . . . ∇ T p nh I ( n ) ( x )] ∈ R D × h I ( n ) .As a final step, we need a procedure to select a point w j , j = 1 , , . . . n for each subspace.Without loss of generality let j = n . One can show that the first point w n , where w n ∈ S n and w n S i , i = 1 , , . . . n −
1, is given by w n = argmin y ∈ Y : ∇ P n ( y ) =0 P n ( y )( ∇ T P n ( y ) ∇ P n ( y )) † P T n ( y ) . (15)Furthermore, a basis for S n can be found by applying PCA to ∇ P n ( w n ). To find the rest ofthe points w i ∈ S i , i = 1 , , . . . n −
1, we can use the polynomial division as proposed by thenext theorem.
Theorem 3.2 [4]. Let Y = { y i } Ni =1 be a set of points sampled from Z A = S ∪ S ∪· · ·∪ S n ,where S i is a subspace of unknown dimension d i , i = 1 , , . . . n , and suppose (12) holds.Furthermore, let a point w n ∈ S n and S ⊥ n be given. Then the set S n − i =1 S i is characterizedby the set of homogenous polynomials (cid:8) c T n − ν n − ( x ) : V n ( D ) R n ( b n ) c n − = 0 , ∀ b n ∈ S ⊥ n , c n − ∈ R M n − ( D ) (cid:9) , where R n ( b n ) ∈ R M n ( D ) × M n − ( D ) is the matrix of coefficients of c n − when ( b T n x )( c T n − ν n − ( x )) ≡ c T n ν n ( x ) is rearranged to be of the form R n ( b n ) c n − = c n .Once the homogenous polynomials { c T n − ν n − ( x ) } given in the previous theorem are ob-tained, the same procedure can be repeated to find w n − and the homogenous polynomialscharacterizing S n − i =1 S i . 5 .2. Subspace Estimation Using a Voting Scheme The Basic GPCA framework works well in the absence of noise. In practice, however,noise is always present and efficient statistical methods need to be used in conjunction withBasic GPCA. In this section, we present one such statistical method where a voting schemeis combined with the Basic GPCA. Here we assume that the number of the subspaces andtheir dimensions are known. For a more complete treatment of the subject see [4].Let Y = { y i } Ni =1 ∈ R D be the set of data points sampled from the set Z A = S ∪ S ∪ · · · ∪ S n , where S j , j = 1 , , . . . n , is a subspace of dimension d j and co-dimension c j = D − d j . From the discussion in the previous section we know that the homogenouscomponent of degree n of the vanishing ideal I ( Z A ) denoted by I n uniquely defines I ( Z A ).Moreover, we mentioned that dim( I n ) = h I ( n ), where h I ( n ) is the Hilbert function. Let P = { p ( x ) , p ( x ) , . . . p h I ( n ) ( x ) } be the set of basis of I n , which can be found by selecting the h I ( n ) smallest singular values of V n ( D ), where V n ( D ) is the embedded data matrix. Supposewe choose a point y ∈ Y . Let us define ∇ P B ( y ) = (cid:2) ∇ T p ( y ) ∇ T p ( y ) . . . ∇ T p h I ( n ) ( y ) (cid:3) .In the noise-free case rank( ∇ P B ( y )) = c j .However, in the case where the data is corrupted by noise, a more efficient method forcomputing the bases is desired. Suppose the co-dimension of the subspaces take m distinctvalues c ′ , c ′ , . . . , c ′ m . In the voting scheme, since we don’t know which subspace y belongsto and we would like to leave our options open, the base for the orthogonal complementof subspaces of all possible dimensions c ′ i , i = 1 , , . . . m , are calculated by choosing the c ′ i principal components of ∇ P B ( y ). This results in m matrices B i ∈ R D × c ′ i , i = 1 , , . . . m each of which is a candidate base for S ⊥ i , i = 1 , , . . . n .The idea of the voting scheme is to count the number of repetitions of each candidatebase for all points in the data set y i , i = 1 , , . . . N . At the end, the n bases with the mostvotes are chosen to be the bases of S ⊥ i , i = 1 , , . . . n , and each point is assigned to the closestsubspace. In our criterion for counting the repetition of the bases, two bases are consideredto be the same if the angle between the two subspaces spanned by them is less than τ , where τ >
4. Segmentation of Facial Expressions
In this section, we use the techniques presented in Sections 2 and 3 to segment the facialexpressions in a given set of images. More specifically, given a set of images of a personwith two different facial expressions (e.g. neutral and happy), we try to segment the imagesbased on their facial expression. We should mention that the author in [8] uses the ideaof point clustering and PCA to segment images with different facial expressions. In thispaper, however, we would like to show that if the manifold of faces is “unfolded” (e.g. usinga Maximum Variance Unfolding technique), different facial expressions reside on differentsubspaces.In our experiment, for each human subject about 30 images of their face were taken,where the subject starts by a neutral expression, transitions to a happy expression, and goesback to the normal expression, where each part contains about 10 images. An example setof images is given in Figure 4.1. The images were taken in a sequence, each 200 ×
240 pixels,and in total there were 4 subjects.Each image can be considered as a vector of dimension 48000, by stacking up all the6 igure 4.1 : A sequence of pictures, where the subject starts with a neutral expression,smiles, and resumes the neutral expression.columns in the image matrix. In this way, each image is a point in a 48000 dimension space.In order to segment the images, first, the dimension is reduced to D = 5 using the MVUprocedure presented in Section 2, where k = 4, i.e. when forming the weighted graph G ,the 4-nearest neighbors are connected by an edge. Next, the resulting points in the D = 5dimensional ambient space are used to identify 2 subspaces of dimension d = 1 , , ,
4, wherethe in the GPCA voting algorithm two subspace are considered to be the same if the anglebetween the two is less than τ = 0 . D = 2, d = 1 is given in Figure 4.2. 7 able 1 : Segmentation Results for D = 5Subject Number of Images Segmentation Error d = 1 d = 2 d = 3 d = 4 −10000 −8000 −6000 −4000 −2000 0 2000 4000 6000 8000−800−600−400−20002004006008001000 Figure 4.2 : Facial expression segmentation with D = 2 and d = 1. The categorizationerror is 6/30. The solid and dashed lines are the subspaces corresponding to the neutral andhappy expressions, respectively. The points associated with the solid line and the dashedline are represented by “+” and “ × ”, respectively. The points with “ ◦ ” are those that areassociated with the wrong expression. 8 eferences [1] D. Eisenbud, Commutative Algebra . Springer-Verlag, 1995.[2] A. R. Tannenbaum,
Invariance and System Theory: Algebraic and Geometric Aspects .Springer-Verlag, 1981.[3] J. A. Gallian,
Contemporary Abstract Algebra . D. C. Heath and Company, 1990.[4] R. Vidal, Y. Ma, and S. Sastry,
Generalized Principal Component Analysis . Springer,2008.[5] K. Q. Weinberger and L. K. Saul, “Unspurvised learning of image manifolds by semi-definite programing,”
Int. J. Comp. Vis. , vol. 70, pp. 77–90, 2006.[6] I. T. Jolliffe,
Principal Component Analysis . Springer, 2002.[7] J. B. Tenenbaum, V. de Silva, and J. C. Langford, “A global geometric framework fornonlinear dimensionality reduction,”
Science , vol. 290, pp. 2319–2323, 2000.[8] M. A. Turk,
Interactive-Time Vision: Face Recognition as a Visual Behavior . Ph.DThesis, Massachusetts Institute of Technology, 1991.[9] A. Y. Yang, S. Rao, A. Wagner, Y. Ma, and R. M. Fossum, “Hilbert functions andapplications to the estimation of subspace arrangements,” in