Graph-based classification of multiple observation sets
aa r X i v : . [ c s . C V ] J u l Graph-based classification of multiple observation sets
Effrosyni Kokiopoulou Pascal FrossardETHZ Ecole Polytechnique F´ed´erale de Lausanne (EPFL)Seminar for Applied Mathematics Signal Processing Laboratory - LTS4CH - 8092 Z¨urich CH - 1015 Lausanne [email protected] [email protected]
Abstract —We consider the problem of classification of an object givenmultiple observations that possibly include different transformations. Thepossible transformations of the object generally span a low-dimensionalmanifold in the original signal space. We propose to take advantage of thismanifold structure for the effective classification of the object representedby the observation set. In particular, we design a low complexity solutionthat is able to exploit the properties of the data manifolds with agraph-based algorithm. Hence, we formulate the computation of theunknown label matrix as a smoothing process on the manifold under theconstraint that all observations represent an object of one single class. Itresults into a discrete optimization problem, which can be solved by anefficient and low complexity algorithm. We demonstrate the performanceof the proposed graph-based algorithm in the classification of sets ofmultiple images. Moreover, we show its high potential in video-basedface recognition, where it outperforms state-of-the-art solutions that fallshort of exploiting the manifold structure of the face image data sets.
Index Terms —Graph-based classification, multiple observations sets,video face recognition, multi-view object recognition.
I. I
NTRODUCTION
Recent years have witnessed a dramatic growth of the amount ofdigital data that is produced by sensors or computers of all sorts. Thatcreates the need for efficient processing and analysis algorithms inorder to extract the relevant information contained in these datasets.In particular, it commonly happens that multiple observations ofan object are captured at different time instants or under differentgeometric transformations. For instance, a moving object may beobserved over a time interval by a surveillance camera (see Fig. 1(a))or under different viewing angles by a network of vision sensors (seeFig. 1(b)). This typically produces a large volume of multimediacontent that lends itself as a valuable source of information foreffective knowledge discovery and content analysis. In this context,classification methods should be able to exploit the diversity ofthe multiple observations in order to provide increased classificationaccuracy [1].We build on our previous work [2] and we focus here on the patternclassification problem with multiple observations. We further assumethat observations are produced from the same object under differenttransformations, so that they all lie on the same low-dimensionalmanifold. We propose a novel graph-based algorithm built on labelpropagation [3]. Label propagation methods typically assume that thedata lie on a low dimensional manifold living in a high dimensionalspace. They rely upon the smoothness assumption , which states thatif two data samples x and x are close, then their labels y and y should be close as well. The main idea of these methods is tobuild a graph that captures the geometry of this manifold as well asthe proximity of the data samples. The labels of the test examplesare derived by “propagating” the labels of the labelled data along themanifold, while making use of the smoothness property. We exploitthe specificities of our particular classification problem and constrain This work has been mostly performed while the first author was with theSignal Processing Laboratory (LTS4) of EPFL. It has been partly supportedby the Swiss National Science Foundation, under grant NCCR IM2. object s x1 x2 xm (a) Video frames of a moving object object sx1 xm-1 xm (b) Network of vision sensorsFig. 1. Typical scenarios of producing multiple observations of an object. the unknown labels to correspond to one single class. This leads to theformulation of a discrete optimization problem that can be optimallysolved by a simple and low complexity algorithm.We apply the proposed algorithm to the classification of setsof multiple images in handwritten digit recognition, multi-viewobject recognition or video-based face recognition. In particular, weshow the high potential of our graph-based method for efficientclassification of images that belong to the same data manifold. Forexample, the proposed solution outperforms state-of-the-art subspaceor statistical classification methods in video-based face recognitionand object recognition from multiple image sets. Hence, this paperestablishes new connections between graph-based algorithms and theproblems of classification of multiple image sets or video-basedface recognition, where the proposed solutions are certainly verypromising.The paper is organized as follows. We first formulate the problemof classification of multiple observation sets in Section II. Weintroduce our graph-based algorithm inspired by label propagationin Section III. Then we demonstrate the performance of the pro-posed classification method for handwritten digit recognition, objectrecognition and video-based face recognition in Sections IV-A, IV-Band V, respectively. II. P
ROBLEM DEFINITION
We address the problem of the classification of multiple obser-vations of the same object, possibly with some transformations. In labelled exampleunlabelled example
Fig. 2. Typical structure of the k -NN graph. N i represents the neighborhoodof the sample x i . particular, the problem is to assign multiple observations of the testpattern/object s to a single class of objects. We assume that we have m transformed observations of s of the following form x i = U ( η i ) s, i = 1 , . . . , m, where U ( η ) denotes a (geometric) transformation operator withparameters η , which is applied on s . For instance, in the case of visualobjects, U ( η ) may correspond to a rotation, scaling, translation, orperspective projection of the object. We assume that each observation x i is obtained by applying a transformation η i on s , which is differentfrom its peers (i.e., η i = η j , for i = j ). The problem is to classify s in one of the c classes under consideration, using the multipleobservations x i , i = 1 , . . . , m .Assume further that the data set is organized in two parts X = { X ( l ) , X ( u ) } , where X ( l ) = { x , x , . . . , x l } ⊂ R d and X ( u ) = { x l +1 , . . . , x n } ⊂ R d , where n = l + m . Let also L = { , . . . , c } denote the label set. The l examples in X ( l ) arelabelled { y , y , . . . , y l } , y i ∈ L , and the m examples in X ( u ) areunlabelled. The classification problem can be formally defined asfollows. Problem Given a set of labelled data X ( l ) , and a set ofunlabelled data X ( u ) , { x j = U ( η j ) s, j = 1 , . . . , m } thatcorrespond to multiple transformed observations of s , the problemis to predict the correct class c ∗ of the original pattern s .One may view Problem 1 as a special case of semi-supervisedlearning [4], where the unlabelled data X ( u ) represent the multipleobservations with the extra constraint that all unlabelled data exam-ples belong to the same (unknown) class. The problem then resides inestimating the single unknown class, while generic semi-supervisedlearning problems attribute the test examples to different classes.III. G RAPH - BASED CLASSIFICATION
A. Label propagation
We propose in this section a novel method to solve Problem 1,which is inspired by label propagation [3]. The label propagationalgorithm is based on a smoothness assumption , which states thatif x and x are close by, then their corresponding labels y and y should be close as well. Denote by M the set of matrices withnonnegative entries, of size n × c . Notice that any matrix M ∈ M provides a labelling of the data set by applying the following rule: y i = max j =1 ,...,c M ij . We denote the initial label matrix as Y ∈ M where Y ij = 1 if x i belongs to class j and 0 otherwise. Thelabel propagation algorithm first forms the k nearest neighbor ( k -NN) graph defined as G = ( V , E ) , where the vertices V correspond to the data samples X . An edge e ij ∈ E is drawn if and only if x j is among the k nearest neighborsof x i .It is common practice to assign weights on the edge set of G . Onetypical choice is the Gaussian weights H ij = ( exp( − k x i − x j k σ ) when ( i, j ) ∈ E , otherwise . (1)The similarity matrix S ∈ R n × n is further defined as S = D − / HD − / , (2)where D is a diagonal matrix with entries D ii = P nj =1 H ij . Seealso Fig. 2 for a schematic illustration of the k -NN graph and relatednotation.Next, the algorithm computes a real valued M ∗ ∈ M basedon which the final classification is performed using the rule y i =max j =1 ,...,c M ∗ ij . This is done via a regularization framework witha cost function defined as U ( M ) = 12 “ n X i,j =1 H ij k √ D ii M i − p D jj M j k + µ n X i =1 k M i − Y i k ” , (3)where M i denotes the i th row of M . The computation of M ∗ is done by solving the quadratic optimization problem M ∗ =arg min M ∈M U ( M ) .Intuitively, we are seeking an M ∗ that is smooth along the edges ofsimilar pairs ( x i , x j ) and at the same time close to Y when evaluatedon the labelled data X ( l ) . The first term in (3) is the smoothness termand the second is the fitness term.Notice that when two examples x i and x j are similar (i.e., theweight H ij is large) minimizing the smoothness term in (3) resultsin M being smooth across similar examples. Thus, similar dataexamples will likely share the same class label. It can be shown[3] that the solution to problem (3) is given by M ∗ = β ( I − αS ) − µY, (4)where α = µ and β = µ µ .Finally, several other variants of label propagation have beenproposed in the past few years. We mention for instance, the methodof [5] and the variant of label propagation that was inspired from theJacobi iteration algorithm [4, Ch. 11]. Finally, it is interesting to notethat there have also been found connections to Markov random walks[6] and electric networks [7]. Note finally that label propagation isprobably the most representative algorithm among the graph-basedmethods for semi-supervised learning. B. Label propagation with multiple observations
We propose now to build on graph-based algorithms to solve theproblem of classification of multiple observation sets. In general,label propagation assumes that the unlabelled examples come fromdifferent classes. As Problem 1 presents the specific constraint that allunlabelled data belong to the same class, label propagation does notfit exactly the definition of the problem as it falls short of exploitingits special structure. Therefore, we propose in the sequel a novelgraph-based algorithm, which (i) uses the smoothness criterion on LU ... cn l Y p Fig. 3. Structure of the class-conditional label matrix Z p . the manifold in order to predict the unknown class labels and (ii) atthe same time, it is able to exploit the specificities of Problem 1.We represent the data labels with a 1-of- c encoding, which permitsto form a binary label matrix of size n × c , whose i th row encodes theclass label of the i th example. The class label is basically encodedin the position of the nonzero element.Suppose now that the correct class for the unlabelled data is the p thone. In this case, we denote by Z p ∈ R n × c the corresponding labelmatrix. Note that there are c such label matrices; one for each classhypothesis. Each class-conditional label matrix Z p has the followingform Z p = Y l ∈ R l × c e ⊤ p ∈ R m × c ∈ R n × c , (5)where e p ∈ R c is the p th canonical basis vector and ∈ R m is thevector of ones. Fig. 3 shows schematically the structure of matrix Z p . The upper part corresponds to the labelled examples and thelower part to the unlabelled ones. Z p holds the labels of all datasamples, assuming that all unlabelled examples belong to the p thclass. Observe that the Z p ’s share the first part Y l and differ only inthe second part.Since all unlabelled examples share the same label, the class labelshave a special structure that reflects the special structure of Problem1, as outlined in our previous work [2]. We could then express theunknown label matrix M as, M = c X p =1 λ p Z p , Z p ∈ R n × c , (6)where Z p is given in (5), λ p ∈ { , } and c X p =1 λ p = 1 . (7)In the above, λ = [ λ , . . . , λ c ] is the vector of linear combinationweights, which are discrete and sum to one. Ideally, λ should besparse with only one nonzero entry pointing to the correct class.The classification problem now resides in estimating the propervalue of λ . We rely on the smoothness assumption and we proposethe following objective function ˜ Q ( M ( λ )) = 12 “ n X i,j =1 H ij k √ D ii M i − p D jj M j k ” , (8) where the optimization variable now becomes the λ vector. Noticethat the fitting term in Eq. (3) is not needed anymore due tothe structure of the Z matrices. Furthermore, we observe that theoptimization parameter λ is implicitly represented in the aboveequation through M , defined in eq. (6).In the above, M i (resp. M j ) denotes the i th (resp. j th) row of M . In the case of normalized similarity matrix, the above criterionbecomes Q ( M ( λ )) = 12 n X i,j =1 S ij k M i − M j k , (9)where S is defined as in (2). It can be seen that the objective functiondirectly relies on the smoothness assumption. When two examples x i , x j are nearby (i.e., H ij or S ij is large), minimizing ˜ Q ( λ ) and Q ( λ ) results in class labels that are close too. The following propositionnow shows the explicit dependence of Q on λ . Proposition Assume the data set is split into l labelled examples X ( l ) and m unlabelled examples X ( u ) , i.e., X = [ X ( l ) , X ( u ) ] . Then,the objective function (9) can be written in the following form, Q ( λ ) = C + 12 X i ≤ l,j>l S ij k Y i − λ k + 12 X i>l,j ≤ l S ij k Y j − λ k (10)where C = P i ≤ l,j ≤ l S ij k Y i − Y j k . Proof:
From equation (9) observe that Q ( λ ) = 12 n X i,j ≤ l S ij k M i − M j k | {z } Q + 12 n X i,j>l S ij k M i − M j k | {z } Q + 12 n X i ≤ l,j>l S ij k M i − M j k | {z } Q + 12 n X i>l,j ≤ l S ij k M i − M j k | {z } Q . We consider the following cases(i) i ≤ l and j ≤ l : both data examples x i and x j arelabelled. Then, M i = ( P cp =1 λ p ) Y i = Y i , due to thespecial structure of the Z matrices (see (5)) and also dueto the constraint from Eq. (7). Similarly, M j = Y j . Thisresults in Q = P i,j ≤ l S ij k Y i − Y j k = C , which is aconstant term and does not depend on λ .(ii) i > l and j > l : both data samples x i and x j are unlabelled.In this case, M i = λ and M j = λ , again due to (5).Therefore the second term Q is zero.(iii) i ≤ l and j > l : x i is labelled and x j is unlabelled. Inthis case, M i = Y i and M j = λ . This results in Q = P i ≤ l,j>l S ij k Y i − λ k .(iv) i > l and j ≤ l is analogous to the case (iii) above,where the roles of x i and x j are switched. Thus, Q = P i>l,j ≤ l S ij k Y j − λ k .Putting the above facts together yields Eq. (10).The above proposition suggests that only the interface betweenlabelled and unlabelled examples matters in determining the smooth-ness value of a candidate label matrix M , or equivalently the solutionvector λ . We use this observation in order to design an efficient graph-based classification algorithm that is described below. Algorithm 1
The MASC algorithm Input : X ∈ R d × n : data examples. m : number of observations. l : number of labelled data. Output : ˆ p : estimated unknown class. Initialization : Form the k -NN graph G = ( V , E ) . Compute the weight matrix H ∈ R n × n and the diagonal matrix D , where D i,i = P nj =1 H ij . Compute S = D − / HD − / . for p = 1 : c do M = » Y l e ⊤ p – q ( p ) = P i ≤ l,j>l S ij k M i − M j k + P i>l,j ≤ l S ij k M i − M j k . end for ˆ p = arg min p q ( p ) C. The MASC algorithm
We propose in this section a simple, yet effective graph-basedalgorithm for the classification of multiple observations from the sameclass. Based on Proposition 1 and ignoring the constant term, we needto solve the following optimization problemOptimization problem:
OPT min λ P i ≤ l,j>l S ij k Y i − λ k + P i>l,j ≤ l S ij k Y j − λ k subject to λ p ∈ { , } , p = 1 , . . . , c , P cp =1 λ p = 1 .Intuitively, we seek the class that corresponds to the smoothestlabel assignment between labelled and unlabelled data. Observe thatthe above problem is a discrete optimization problem due to theconstraints imposed on λ , that can be collected in a set Λ , where Λ = { λ ∈ R c × : λ p ∈ { , } , p = 1 , . . . , c, c X p =1 λ p = 1 } . Interestingly, the search space Λ is small. In particular, it consists ofthe following c vectors: [1 , , . . . , , . . . , , , . . . , , . . . , . . . [0 , , . . . , , . . . , , , . . . , , . . . , . Thus, one may solve OPT by enumerating all above possible solutionsand pick the one λ ∗ that minimizes Q ( λ ) . Then, the position ofthe nonzero entry in λ ∗ yields the estimated unknown class. Wecall this algorithm MA nifold-based S moothing under C onstraints(MASC) and we show its main steps in Algorithm 1. The MASCalgorithm has a complexity that is linear with the number of classes,and quadratic with the number of samples.The construction of k -NN graph (lines 4-6) scales as O( n ). Once the graph has beenconstructed, the enumeration of all possible solutions scales as O( c ).We conclude that the total computational cost is O( n + c ).IV. C LASSIFICATION OF MULTIPLE IMAGES SETS
A. Handwritten digit classification
We evaluate the performance of the proposed MASC algorithmwith respect to label propagation, in the context of handwritten digit classification. Multiple transformed images of the same digitclass form a set of observations, which we want to assign in thecorrect class. We use two different data sets for our experimentalevaluation; (i) a handwritten digit image collection and (ii) the USPShandwritten digit image collection. The first collection contains 20 ×
16 bit binary images of “0” through “9”, where each class contains 39examples. The USPS collection contains 16 ×
16 grayscale imagesof digits and each class contains 1100 examples.Robustness to pattern transformations is a very important propertyof the classification of multiple observations. Transformation invari-ance can be reinforced into classification algorithms by augmentingthe labelled examples with the so-called virtual samples , denotedhereby as X ( vs ) (see [8] for a similar approach). The virtual samplesare essentially data samples that are generated artificially, by applyingtransformations to the original data samples. They are given theclass labels of the original examples that they have been generatedfrom, and are treated as labelled data. By including the virtualsamples in the data set, any classification algorithm becomes morerobust to transformations of the test examples. We therefore adoptthis strategy in the proposed methods and we include n vs virtualsamples X ( vs ) in our original data set that is finally written as X = { X ( l ) , X ( vs ) , X ( u ) } .We compare the classification performance of the MASC algorithmwith the label propagation (LP) method. In LP, the estimated class iscomputed by majority voting on the estimated class labels computedin Eq. (4). In our experiments, we use the same k -NN graph incombination with the Gaussian weights from Eq. (1) in both LP andMASC methods. In order to determine the value of the parameter σ in Eq. (1) we adopt the following process; we pick randomly 1000examples, compute their pairwise distances and then set σ equal tohalf of its median.We first split the data sets into training and test sets by including2 examples per class in the training set and the remaining areassigned to the test set. Each training sample is augmented by 4virtual examples generated by successive rotations of it, where eachrotation angle is sampled regularly in [ − ◦ , ◦ ] . This interval hasbeen chosen to be sufficiently small in order to avoid the confusionof digits ’6’ and ’9’. Next, in order to build the unlabelled set X ( u ) (i.e., multiple observations) of a certain class, we chooserandomly a sample from the test set of this class and then we applya random rotation on it by a random (uniformly sampled) angle θ ∈ [ − ◦ , ◦ ] .The number of nearest neighbors was set to k = 5 for both binarydigit collection and the USPS data set, in both methods. These valuesof k have been obtained by the best performance of LP on thetest set. We try different sizes of the unlabelled set (i.e., multipleobservations), namely m = [10 : 20 : 150] (in MATLAB notation).For each value of m , we report the average classification error rateacross 100 random realizations of X ( u ) generated from each one ofthe 10 classes. Thus, each point in the plot is an average over 1000random experiments.Figures 4(a) and 4(b) show the results over the binary digitsand the USPS digits image collections, respectively. Observe firstthat increasing the number of observations gradually improves theclassification error rate of both methods. This is expected sincemore observations of a certain pattern give more evidence, whichin turn results in higher confidence in the estimated class label.Finally, observe that the proposed MASC algorithm unsurprisinglyoutperforms LP in both data sets, since it is designed to exploit theparticular structure of Problem 1. ∼ roweis/data.html c l ass i f i ca t i on e rr o r r a t e ( % ) LPMASCTSVM (a) Binary digits c l ass i f i ca t i on e rr o r r a t e ( % ) LPMASCTSVM (b) USPS digitsFig. 4. Classification results measured on two different data sets.
B. Object recognition from multi-view image sets
In this section we evaluate our graph-based algorithm in the contextof object recognition from multi-view image sets. In this case, thedifferent views are considered as multiple observations of the sameobject, and the problem is to recognize correctly this object.The proposed MASC method implements Gaussian weights (1) andsets k = 5 in the construction of the k -NN graph. We compare MASCto well-known methods from the literature, which mostly gatheralgorithms based on either subspace analysis or density estimation(statistical methods): • MSM. The Mutual Subspace Method [9], [10], which is the mostwell known representative of the subspace analysis methods.It represents each image set by a subspace spanned by theprincipal components, i.e., eigenvectors of the covariance matrix.The comparison of a test image set with a training one is thenachieved by computing the principal angles [11] between thetwo subspaces. In our experiments, the number of principalcomponents has been set to nine, which has been found toprovide the best performance. • KMSM. MSM has been extended to its nonlinear version calledthe Kernel Mutual Subspace Method (KMSM) [12], in orderto take into account the nonlinearity of typical image sets. Themain difference of KMSM from MSM is that the images arefirst nonlinearly mapped into a high dimensional feature space,before modeling by linear subspaces takes place. In other words,KMSM uses kernel PCA instead of PCA in order to capture thenonlinearities in the data. In KMSM, we use the Gaussian kernel k ( x, y ) = exp( − k x − y k σ ) , where σ is determined exactly in thesame way as in the Gaussian weights of our MASC method. • KLD. The KL-divergence algorithm by Shakhnarovich et al [13]is the most popular representative of density-based statisticalmethods. It formulates the classification from multiple imagesas a statistical hypothesis testing problem. Under the i.i.d andthe Gaussian assumptions on the image sets, the classificationproblem typically boils down to a computation of the KLdivergence between sets, which can be computed in closed formin this case. The energy cut-off, which determines the number ofprincipal components used in the regularization of the covariancematrices, has been set to 0.96.In our evaluation, we use the ETH-80 image set [14], whichcontains 80 object classes from 8 categories; apple, car, cow, cup,dog, horse, pear and tomato. Each category has 10 object classes
MASC MSM KMSM KLD88.88 (1.71) 74.88 (5.02) 83.2500 (3.4) 52.5 (3.95)TABLE IO
BJECT RECOGNITION RATE IN THE MEAN ( STD ) FORMAT , MEASURED ONTHE
ETH-80
DATABASE . (see Fig. 5(a)). Each object class then consists of 41 views of theobject spaced evenly over the upper viewing hemisphere. Figure 5(b)shows the 41 views from a sample car object class. We use the cropped-close128 part of the database. All provided imagesare of size 128 ×
128 and they are cropped, so that they contain onlythe object without any border area. We downsampled the images tosize 32 ×
32 for computational ease. No further preprocessing is done.The 41 views from each object class are split randomly into 21training and 20 test samples. In this case, the 20 different viewsin the test set correspond to the multiple observations of the testobject. We perform 10 random experiments where the images arerandomly split into training and test sets. Table I shows the averageobject recognition rate for each method. We also report the standarddeviation of each method in parentheses. Notice that the subspacemethods are superior to the KLD method which assumes Gaussiandistribution of the data. Notice also that as one would expect, KMSMoutperforms MSM that falls short of capturing the nonlinearitiesin the data. Finally, observe that our graph-based method clearlyoutperforms its competitors, as it is able to capture not only thenonlinearity but also the manifold structure of the data.V. V
IDEO - BASED FACE RECOGNITION
A. Experimental setup
In this section we evaluate our graph-based algorithm in the contextof face recognition from video sequences. In this case, the differentvideo frames are considered as multiple observations of the sameperson, and the problem consists in the correct classification ofthis person. We evaluate in this section the behavior of the MASCalgorithm in realistic conditions, i.e., under variations in head pose,facial expression and illumination. Note in passing that our algorithmdoes not assume any temporal order between the frames; hence, itis also applicable to the generic problem of face recognition fromimage sets.We use two publically available databases; the VidTIMIT [15] andthe first subset of the Honda/UCSD [16] database. The VidTIMIT (a) ETH-80 (b) 41 views of a sample car modelFig. 5. Sample images from the ETH-80 database. database contains 43 individuals and there are three face sequencesobtained from three different sessions per subject. The data set hasbeen recorded in three sessions, with a mean delay of seven daysbetween session one and two, and six days between session two andthree. In each video sequence each person performed a head rotationsequence. In particular, the sequence consists of the person movinghis/her head to the left, right, back to the center, up, then down andfinally return to center.The Honda/UCSD database contains 59 sequences of 20 subjects.In contrast to the previous database, the individuals move their headfreely, in different speed and facial expressions. In each sequence,the subjects perform free in-plane and out-of-plane head rotations.Each person has between 2 and 5 video sequences and the numberof sequences per subject is variable.For preprocessing, in both databases, we used first P. Viola’sface detector [17] in order to automatically extract the facial regionfrom each frame. Note that this typically results in misaligned facialimages. Next, we downsampled the facial images to size 32 ×
32 forcomputational ease. No further preprocessing has been performed,which brings our experimental setup closer to real testing conditions.
B. Classification results on VidTIMIT
We first study the performance of the MASC algorithm with theVidTIMIT database. Figure 6 shows a few representative imagesfrom a sample face manifold in the VidTIMIT database. Observethe presence of large head pose variations. Figure 7 shows the3D projection of the manifold that is obtained using the ONPPmethod [18], which has been shown to be an effective tool fordata visualization. Notice the four clusters corresponding to the fourdifferent head poses i.e., looking left, right, up and down. Thisindicates that a graph-based method should be able to capture thegeometry of the manifold and propagate class labels based on themanifold structure.Since there are three sessions, we use the following metric forevaluating the classification performances e = 16 X i =1 3 X j =1 ,j = i e ( i, j ) , (11) http://users.rsise.anu.edu.au/ ∼ conrad/vidtimit/ http://vision.ucsd.edu/ leekc/HondaUCSDVideoDatabase/HondaUCSD.html (a) pose 1 (b) pose 2 (c) pose 3 (d) pose 4(e) pose 5 (f) pose 6 (g) pose 7 (h) pose 8Fig. 6. Head pose variations in the VidTIMIT database. where e ( i, j ) is the classification error rate when the i th session isused as training set and the j th session is used as test set. In otherwords, e is the average classification error rate calculated over thefollowing six experiments, namely (1,2), (2,1), (1,3), (3,1), (2,3) and(3,2). Recognition rate (%) MASC MSM KMSM KLD r = 4 r = 8 r = 12 r = 16 IDEO FACE RECOGNITION RESULTS ON THE V ID TIMIT
DATABASE . We evaluate the video face recognition performance of all methodsfor diverse sizes of the training and test sets. The objective is to assessthe robustness of the methods with respect to the size of the trainingand test set. For this reason, each image set is re-sampled as X i,r = X i (: , r : n ) , i = 1 , . . . , c. In the above, the image set X i is re-sampled with step r , i.e., only oneimage every r images is kept. In our experiments, we use differentvalues of r ranging from 4 to 16 with step 4. For each value of r , we −5000500−600 −400 −200 0 200 400 600 800050100150200250300350400 Fig. 7. A typical face manifold from the VidTIMIT database. Observe thefour clusters corresponding to the four different head poses (face looking left,right, up and down).Fig. 8. Video face recognition results on the VidTIMIT database. measure the average classification error rate according to the relation(11).Table II shows the recognition performance, for r ranging from4 to 16 with step 4. Figure 8 shows graphically the same results.Observe that the KLD method that relies on density estimation issensitive to the number of the available data. Also, notice that MSM issuperior to KLD, which is expected since KLD relies on the impreciseassumption that data follow a Gaussian distribution. Furthermore,KMSM, the nonlinear variant of MSM, outperforms the latter thathas trouble in capturing the nonlinear structures in the data. Finally,we observe that MASC clearly outperforms its competitors in thevast majority of cases. At the same time, it stays robust to significantre-sampling of the data, since its performance remains almost thesame for each value of r . C. Classification results on Honda/UCSD
We further study the video-based face recognition performanceon the Honda/UCSD database. Figure 9 shows a few representativeimages from a sample face manifold in the Honda/UCSD database.Observe the presence of large head pose variations along with facialexpressions. The projection of the manifold on the 3D space usingONPP shows again clearly the manifold structure of the data (seeFigure 10), which implies that a graph-based method is more suitablefor such kind of data. (a) pose 1 (b) pose 2 (c) pose 3 (d) pose 4(e) pose 5 (f) pose 6 (g) pose 7 (h) pose 8Fig. 9. Head pose variations in the Honda/UCSD database. −2000 200 400 600−1500−1000−5000500−400−2000200400600800
Fig. 10. A typical face manifold from the Honda/UCSD database.
The Honda/UCSD database comes with a default splitting intotraining and test sets, which contains 20 training and 39 test videosequences. We use this default setup and we report the classificationperformance of all methods, under different data re-sampling rates.Similarly as above, both training and test image sets are re-sampledwith step r , i.e., X i,r = X i (: , r : n ) , i = 1 , . . . , c . Table III Fig. 11. Video face recognition results on the Honda/UCSD database.
Recognition rate (%) MASC MSM KMSM KLD r = 4
100 84.62 87.18 84.62 r = 6
100 84.62 87.18 79.49 r = 8 r = 10 r = 12 IDEO FACE RECOGNITION RESULTS ON THE H ONDA /UCSD
DATABASE . shows the recognition rates, when r varies from 4 to 12 with step2. Figure 11 shows the same results graphically. Recall that largervalues of r imply sparser image sets. Observe again that KLD ismostly affected by r , by suffering loss in performance. This is notsurprising since it is a density-based method and densities cannot beaccurately estimated (in general) with a few samples. MSM seemsto be more robust, yielding better results than KLD, but as expected,it is inferior to KMSM in the majority of cases. Finally, MASC isagain the best performer and it exhibits very high robustness againstdata re-sampling.Regarding the relative performance of MASC and KMSM, weshould finally stress out that KMSM is a kernel technique thatattempts to capture the nonlinear structure of the data by assuminga linear model after applying a nonlinear mapping of the data into ahigh dimensional space. Although this methodology stays generic andpresents certain advantages, it is still not clear whether it is capableof capturing the individual (e.g., manifold) structure of diverse datasets. On the other hand, the MASC method explicitly relies on agraph model that may fit much better the manifold structure ofthe data. Furthermore, it provides a way to cope with the curseof dimensionality, since the intrinsic dimension of the manifolds istypically very small. We believe that graph methods have a greatpotential in this field. D. Video-based face recognition overview
For the sake of completeness, we review briefly in this last sectionthe state of the art in video-based face recognition. Typically, onemay distinguish between two main families of methods; those thatare based on subspace analysis and those that are based on densityestimation (statistical methods). The most representative methods forthese two families are respectively the MSM [9], [10] and KMSM[12] methods and the solution based on KLD [13], which have beenused in the experiments above.Among the methods based on subspace analysis, we should men-tion the extension of principal angles from subspaces, to nonlinearmanifolds. In a recent article [19] it was proposed to representthe facial manifold by a collection of linear patches, which arerecovered by a non-iterative algorithm that augments the current patchuntil the linearity criterion is violated. This manifold representationallows for defining the distance between manifolds as integration ofdistances between linear patches. For comparing two linear patches,the authors propose a distance measure that is a mixture between(i) the principal angles and (ii) exemplar-based distance. However,it is not clearly justified why such a mixture is needed and whatis the relative benefit over the individual distances. Moreover, theirproposed method requires the computation of both geodesic andEuclidean distances as well as setting four parameters. On thecontrary, our MASC method needs only one parameter ( k ) to be setand it requires the computation of the Euclidean distances only. Notefinally that their method achieves comparable results with MASC onthe Honda/UCSD database, but at a higher computational cost and atthe price of tuning four parameters. Along the same lines, the authors in [20] propose a similaritymeasure between manifolds that is a mixture of similarity betweensubspaces and similarity between local linear patches. Each individualsimilarity is based on a weighted combination of principal angles andthose weights are learnt by AdaBoost for improved discriminativeperformance. In contrast to the previous paper [19], the linear patchesare extracted here using mixtures of Probabilistic PCA (PPCA).PPCA mixture fitting is a highly non-trivial task, which requires anestimate of the local principal subspace dimension and it also involvesmodel selection. This step is quite computationally intensive, as notedin [19].The main limitation of the statistical methods such as KLD [13]is the inadequacy of the Gaussianity assumption of face imagessets; face sequences rather have a manifold structure. The test videoframes are moreover not independent, so that the i.i.d assumption isunrealistic as well. The authors in [21] therefore extend the workof KL divergence by replacing the Gaussian densities by GaussianMixture Models (GMMs), which provides a more flexible method fordensity estimation. However, the KL divergence in this case cannotbe computed in a closed form, which makes the authors to resort toMonte Carlo simulations that are quite computationally intensive.Finally, there have been a few other methods that cannot be directlycategorized in the above families of methods. The authors in [22]propose ensemble similarity metrics that are based on probabilisticdistance measures, evaluated in Reproducing Kernel Hilbert spaces.All computations are performed under the Gaussianity assumption,which is unfortunately not realistic for facial manifolds.In [23], the authors provide a probabilistic framework for facerecognition from image sets. They model the identity as a discrete orcontinuous random variable and they provide a statistical frameworkfor estimating the identity by marginalizing over face localization,illumination and head pose. Illumination-invariant basis vectors arelearnt for each (discretized) pose and the resulting subspace is usedfor representing the low dimensional vector that encodes the subjectidentity. However, the statistical framework requires the computationof several integrals that are numerically approximated. Also, theproposed method assumes that training images are available for everysubject at each possible pose and illumination, which is hard to satisfyin practice.X. Liu and T. Chen in [24] proposed a methodology based onadaptive hidden Markov models for video-based face recognition.The temporal dynamics of each subject are learnt during training andsubsequently used for recognition. However, the proposed approachassumes temporal order of the frames in the face sequence andunfortunately it is not applicable to the more generic problem ofrecognition from image sets. The study in [25] further investigateshow the performance of the above approach is affected by the facesequence length and the image quality.VI. C ONCLUSIONS
In this paper we have addressed the problem of classification ofmultiple observations of the same object. We have proposed to exploitthe specific structure of this problem in a graph-based algorithminspired by label propagation. The graph-based algorithm relies onthe smoothness assumption of the manifold in order to learn theunknown label matrix, under the constraint that all observationscorrespond to the same class. We have formulated this process asa discrete optimization problem that can be solved efficiently by alow complexity algorithm.We provide experimental results that illustrate the performanceof the proposed solution for the classification of handwritten digits,for object recognition and for video-based face recognition. In thetwo latter cases, the graph-based solution outperforms state-of-the-art methods on three publically available data sets. This clearly outlinesthe potential of the proposed graph-based solution that is able toadvantageously capture the structure of image manifolds.R
EFERENCES [1] C. Stauffer. Minimally-supervised classification using multiple observa-tion sets.
IEEE Int. Conf. on Computer Vision (ICCV) , 2003.[2] E. Kokiopoulou, S. Pirillos, and P. Frossard. Graph-based classificationfor multiple observations of transformed patterns.
IEEE Int. Conf.Pattern Recognition (ICPR) , December 2008.[3] D. Zhou, O. Bousquet, T. Navin Lal, J. Weston, and B. Scholkopf.Learning with local and global consistency.
Advances in NeuralInformation Processing Systems (NIPS) , 2003.[4] O. Chapelle, B. Scholkopf, and A. Zien.
Semi-Supervised learning . MITPress, 2006.[5] X. Zhu and Z. Ghahramani. Learning from labeled and unlabeled datawith label propagation.
Technical report CMU-CALD-02-107 , 2002.Carnegie Mellon University, Pittsburgh.[6] M. Szummer and T. Jaakkola. Partially labeled classification withmarkov random walks.
Advances in Neural Information ProcessingSystems (NIPS) , 2002.[7] X. Zhu, Z. Ghahramani, and J. Lafferty. Semi-supervised learning usinggaussian fields and harmonic functions. , 2003.[8] A. Pozdnoukhov and S. Bengio. Graph-based transformation manifoldsfor invariant pattern recognition with kernel methods.
IEEE Int. Conf.on Pattern Recognition (ICPR) , 2006.[9] K. Fukui and O. Yamaguchi. Face recognition using multi-viewpointpatterns for robot vision.
Int. Symp. on Robotics Research , 15:192–201,2005.[10] O. Yamaguchi, K. Fukui, and K. Maeda. Face recognition using temporalimage sequence.
IEEE Int. Conf. on Automatic Face and GestureRecognition , pages 318–323, 1998.[11] G. H. Golub and C. Van Loan.
Matrix Computations, 3rd edn . The JohnHopkins University Press, Baltimore, 1996.[12] H. Sakano and N. Mukawa. Kernel mutual subspace method for robustfacial image recognition. , 2000.[13] G. Shakhnarovich, J. W. Fisher, and T. Darrel. Face recognition fromlong-term observations.
European Conference on Computer Vision(ECCV) , 3:851–868, 2002.[14] B. Leibe and B. Schiele. Analyzing appearance and contour basedmethods for object categorization.
Int. Conf. on Computer Vision andPattern Recognition (CVPR’03) , 2003.[15] C. Sanderson.
Biometric Person Recognition: Face, Speech and Fusion .VDM-Verlag, 2008.[16] K. C. Lee, J. Ho, M. H. Yang, and D. Kriegman. Video-based facerecognition using probabilistic appearance manifolds.
IEEE Interna-tional Conference on Computer Vision and Pattern Recognition (CVPR) ,pages 313–320, 2003.[17] P. Viola and M. Jones. Robust real-time face detection.
InternationalJournal of Computer Vision , 57(2):137–154, 2004.[18] E. Kokiopoulou and Y. Saad. Orthogonal neighborhood preservingprojections: A projection-based dimensionality reduction technique.
IEEE Transactions on Pattern Analysis and Machine Intelligence ,29(12):2143–2156, December 2007.[19] R. Wang, S. Shan, X. Chen, and W. Gao. Manifold-manifold distancewith application to face recognition based on image set.
IEEE Interna-tional Conference on Computer Vision and Pattern Recognition (CVPR) ,2008.[20] T-K. Kim, O. Arandjelovic, and R. Cipolla. Boosted manifold principalangles for image set-based recognition.
Pattern Recognition , 40:2475–2484, 2007.[21] O. Arandjelovi´c, G. Shakhnarovich, J. Fisher, R. Cipolla, and T. Darrell.Face recognition with image sets using manifold density divergence.
IEEE Int. Conf. on Computer Vision and Pattern Recognition (CVPR) ,1:581–588, 2005.[22] S. Zhou and R. Chellappa. From sample similarity to ensemblesimilarity: Probabilistic distance measures in reproducing kernel hilbertspace.
IEEE Transactions on Pattern Analysis and Machine Intelligence ,28(6):917–929, June 2006.[23] S. K. Zhou and R. Chellappa. Probabilistic identity characterizationfor face recognition.
IEEE Int. Conf. on Computer Vision and PatternRecognition (CVPR) , 2:805–812, 2004. [24] X. Liu and T. Chen. Video-based face recognition using adaptive hiddenmarkov models.
IEEE Int. Conference on Computer Vision and PatternRecognition (CVPR) , 1:I–340 – I–345, 2003.[25] A. Hadid and M. Pietikainen. From still image to video-based facerecognition: an experimental analysis.