[PDF] A Latent Variable Model for Two-Dimensional Canonical Correlation Analysis and its Variational Inference

Abstract

Describing the dimension reduction (DR) techniques by means of probabilistic models has recently been given special attention. Probabilistic models, in addition to a better interpretability of the DR methods, provide a framework for further extensions of such algorithms. One of the new approaches to the probabilistic DR methods is to preserving the internal structure of data. It is meant that it is not necessary that the data first be converted from the matrix or tensor format to the vector format in the process of dimensionality reduction. In this paper, a latent variable model for matrix-variate data for canonical correlation analysis (CCA) is proposed. Since in general there is not any analytical maximum likelihood solution for this model, we present two approaches for learning the parameters. The proposed methods are evaluated using the synthetic data in terms of convergence and quality of mappings. Also, real data set is employed for assessing the proposed methods with several probabilistic and none-probabilistic CCA based approaches. The results confirm the superiority of the proposed methods with respect to the competing algorithms. Moreover, this model can be considered as a framework for further extensions.

Full PDF

aa Latent Variable Model for Two-DimensionalCanonical Correlation Analysis and its VariationalInference

Mehran Safayani, and Saeid Momenzadeh

Abstract

Describing the dimension reduction (DR) techniques by means of probabilistic models has recently been given special attention.Probabilistic models, in addition to a better interpretability of the DR methods, provide a framework for further extensions ofsuch algorithms. One of the new approaches to the probabilistic DR methods is to preserving the internal structure of data. It ismeant that it is not necessary that the data ﬁrst be converted from the matrix or tensor format to the vector format in the processof dimensionality reduction. In this paper, a latent variable model for matrix-variate data for canonical correlation analysis (CCA)is proposed. Since in general there is not any analytical maximum likelihood solution for this model, we present two approachesfor learning the parameters. The proposed methods are evaluated using the synthetic data in terms of convergence and quality ofmappings. Also, real data set is employed for assessing the proposed methods with several probabilistic and none-probabilisticCCA based approaches. The results conﬁrm the superiority of the proposed methods with respect to the competing algorithms.Moreover, this model can be considered as a framework for further extensions.

Index Terms

Canonical Correlation Analysis, Probabilistic dimension reduction, Matrix-variate distribution, Latent variable model.

I. I

NTRODUCTION R ECENTLY, probabilistic interpretation of statistical dimension reduction techniques in the subspace domain has beenapplied in the different applications [1]–[3]. Probabilistic dimension reduction models offer many beneﬁts, including thehandling of missing and outlier data [4], automatic selection of number of projection vectors [5] and the extending of thestandard dimension reduction methods to more complex ones such as mixture models [6], [7] or non-linear models [8], [9].Tipping and Bishop presented a probabilistic model for principal component analysis (PCA) called probabilistic PCA (PPCA)and showed that a relation could be found between the projections extracted by the PCA method and the maximum likelihoodsolution of an restricted factor analysis model [10]. Lawrence proposed a dual model for PPCA and extended it to the non-linear case through the gaussian processes [8]. Bach and Jordan presented probabilistic interpretation of canonical correlationanalysis (CCA) and called it probabilistic CCA (PCCA) [11]. In this method, a latent variable model is used to describe twogaussian random vectors. Recently different variations of PCCA have also been proposed [12]–[14].All the above mentioned methods assume that the input data or features are described as vectors. However, in manyapplications, we encounter with the data that have an intrinsic structure such as matrix or tensor. For example, in face recognitiontask, pixels of the image can be considered as the features, which have matrix structure, or in image processing, 2D Gaborfunctions are frequently used for feature extraction that the outputs of which have the matrix structure [15]. In traditionaldimension reduction approaches, these structures are usually broken and the features are concatenated into a long vector.However, this leads to the small sample size problem and the increase in the computational cost due to the large matrices [16].Therefore, in recent years, attention has been paid to the use of algorithms that do not use the data transformation (convertingdata into a vector) as a preprocessing step. Two-dimensional PCA (2DPCA) [16], general low rank approximation of matrix(GLRAM) [17] and two-dimensional CCA (2DCCA) [18], [19] are among the ﬁrst examples of these algorithms.The probabilistic interpretation of 2-or-more-dimensional subspace feature extraction techniques is also an active researchﬁeld. Tao and et al. presented a decoupled bayesian tensor analysis model which could reduce the dimensions of tensordata and automatically determine the appropriate dimensions [20]. In [21], by applying matrix variate distributions [22], aprobabilistic higher-order PCA for matrix-variate data was introduced and variational expectation-maximization (EM) wasapplied for learning the parameters of the model. Another probabilistic model for 2DPCA was proposed by Zhao and et al.called bilinear PPCA (BPPCA) which extend the previous models by deﬁning three separate model of noises called column, rowand common noise. BPCCA formulated its proposed model with a two-stage representation and the parameters of the modelwere computed with both maximum likelihood estimation (MLE) and EM [23]. Safayani and et al. introduced probabilistic2DCCA (P2DCCA) in which two models on columns and rows of images called left and right probabilistic models were deﬁned[24]. For learning the parameters, it is assumed that the parameters of right probabilistic model is known and the observations

M. Safayani is with the Department of Electrical and Computer Engineering, Isfahan University of Technology, Isfahan 84156-83111, Iran e-mail:([email protected], (corresponding author)).S. Momenzadeh is with the Department of Electrical and Computer Engineering, Isfahan University of Technology, Isfahan 84156-83111, Iran e-mail:([email protected]). a r X i v : . [ c s . C V ] A ug OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 1 are projected to the corresponding latent spaces. Then the parameters of the left probabilistic model are estimated using EMalgorithm. In a similar procedure, and parallel to the left probabilistic model, the parameters of right probabilistic model arelearned. This procedure is repeated until convergence. In this approach, it is assumed that the columns of observation matrixare independent and its probability distribution is constructed by producting the probability of the corresponding columns.Also, because of existence of two models on the rows and the columns, a pair of data can not be mapped to a single latentspace. This causes the model not to be a generative model.In this paper, a probabilistic model for CCA with matrix data assumption is presented. In this model, two random matricesare related through a latent variable that has a matrix-variate normal distribution. Since there is no closed-form solution forlearning the parameters of this model, two approaches are proposed. The ﬁrst approach called unilateral matrix-variate CCA(UMVCCA) assumes that the latent variables are only projected from one side (rows or columns), and with this assumption,the model parameters are estimated using EM algorithm. In the second approach, the simple assumption of the ﬁrst approachis not considered and the hidden variables are written from both sides (rows and columns). This model is named bilateralmatrix-variate CCA (BMVCCA). To learn the parameters of this model, an algorithm based on the variational EM [25] isproposed. In the learning algorithm, the posterior distribution is estimated using a matrix-variate normal distribution and alower bound of the log-likelihood is maximized with respect to the variational parameters that are here mean matrix and twocovariance matrices of the matrix-variate normal distribution.The proposed algorithms are initially assessed using synthetic data for convergence analysis and also accuracy of mappingmatrices, and then are evaluated on the ”NIR-VIS 2.0” [26] face database and their results are compared with competingalgorithms such as CCA, 2DCCA [18], PCCA [11] and P2DCCA [24]. The rest of this paper is organized as follows: inSection II, matrix-variate normal distribution as well as CCA, 2DCCA and PCCA are brieﬂy reviewed. The proposed methodis presented in Section III. Section IV is devoted to the assessing of proposed methods using both syntectic and real data.Finally, the paper is concluded in Section V. II. R

ELATED WORKS

A. Matrix-variate normal distribution

Since in two-dimensional probabilistic models we deal with random matrices (instead of random vectors) variables, it isconvenient to use matrix-variate distributions to model them. In general, matrix-variate distributions are a 2D generalization ofmultivariate distributions [22]. Matrix-variate normal distribution X ∼ MN ( M, Σ , Φ) is the most famous one and is deﬁnedas follows: π ) mn | Σ | n | Φ | m exp (cid:2) tr ( −

12 Σ − ( X − M )Φ − ( X − M ) (cid:48) ) (cid:3) , (1)where X ∈ IR m × n is a random matrix, M ∈ IR m × n is the mean matrix, Σ ∈ IR m × m (cid:31) and Φ ∈ IR n × n (cid:31) are thecolumn and row covariance matrices respectively and tr ( A ) denotes trace of matrix A . Also, it can be shown that if X follows X ∼ MN ( M, Σ , Φ) then vec ( X ) ∼ N ( vec ( M ) , Σ ⊗ Φ) , where ⊗ is the kronecker product [22]. B. Canonical Correlation Analysis (CCA)

Canonical correlation analysis is a method that seeks to ﬁnd relationships between two multivariate sets of variables [27].CCA maps each set of data to a common subspace in which the correlation between the two data sets is maximized. Forexample assumes that x ∈ IR m and x ∈ IR m are two random vectors. CCA ﬁnds two linear mappings w (cid:48) x and w (cid:48) x in which the following criteria is maximized arg max w ,w cov (cid:16) w (cid:48) x , w (cid:48) x (cid:17) , (2) s.t. var ( w (cid:48) x ) = 1 ,var ( w (cid:48) x ) = 1 . It can be shown that the optimal w and w can be obtained by solving the following eigen problems: C − C C − C w = λ w , (3) C − C C − C w = λ w , (4)where C and C are the autocovariance matrices of random vectors x and x respectively and C is the cross-covariancematrix of them and λ is the largest eigenvalue and is equal to the square of canonical correlations. OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 2

C. Two-dimensional CCA (2DCCA) { X jn ∈ IR m j × n j | j =1 , n = 1 , ..., N } arerealizations of random matrix variable X j | j =1 . Without loss of generality, here, we assume that the random variables are zeromean. 2DCCA tries to obtain projection vectors l j | j =1 and r j | j =1 that maximizes the following optimization problem: arg max l ,r ,l ,r cov (cid:16) l (cid:48) X r , l (cid:48) X r (cid:17) , (5) s.t. var ( l (cid:48) X r ) = 1 ,var ( l (cid:48) X r ) = 1 . Since there is no closed-form solution for this problem, 2DCCA assumes that once r j | j =1 are ﬁxed and after somesimpliﬁcations the optimization formula converts to arg max l ,l l (cid:48) Σ r l , (6) s.t. l (cid:48) Σ r l = 1 ,l (cid:48) Σ r l = 1 , where Σ rij = N (cid:80) Nn =1 X in r r (cid:48) X jn (cid:48) , and once again the l j | j =1 are assumed to be ﬁxed and the following formula is obtained: arg max r ,r r (cid:48) Σ l r , (7) s.t. r (cid:48) Σ l r = 1 ,r (cid:48) Σ l r = 1 , where Σ lij = N (cid:80) Nn =1 X in (cid:48) l l (cid:48) X jn . By solving (6) and (7) iteratively until convergence, the optimum transformations { l , r , l , r } are obtained. It can be shown that optimization (6) and (7) converts to the following eigen problems: (cid:20) r Σ r (cid:21) (cid:20) l l (cid:21) = λ (cid:20) Σ r

00 Σ r (cid:21) (cid:20) l l (cid:21) , (8) (cid:20) l Σ l (cid:21) (cid:20) r r (cid:21) = λ (cid:20) Σ l

00 Σ l (cid:21) (cid:20) r r (cid:21) . (9)The d largest eigenvectors of (8) generates the columns of matrix L j | j =1 , similarly d largest eigenvectors of (9) determinescolumns of matrix R j | j =1 . D. Probabilistic CCA (PCCA)

PCCA deﬁnes the following latent variable model [11]: x j = W j z + µ j + (cid:15) j , j ∈ { , } , (10)where x j is the observation random vector, z is the latent vector with a multivariate normal distribution with zero mean andidentity covariance matrix, (cid:15) j is the residual noise vector which follows multivariate normal distribution with expectation ofzero and Ψ j covariance matrix and µ j is the mean vector of random vector x j . In this model, conditioned on the latent variable z , x and x are independent. It can be shown that the following distributions result from (10): p ( x j | z ) = N ( W j z + µ j , Ψ j ) , (11) p ( x | z ) = N ( W z + µ, Ψ) , (12) p ( x ) = N ( µ, Σ) , (13)where x = [ x (cid:48) , x (cid:48) ] (cid:48) , W = [ W (cid:48) , W (cid:48) ] (cid:48) , µ = [ µ (cid:48) , µ (cid:48) ] (cid:48) , Σ = (cid:20) W W (cid:48) + Ψ W W (cid:48) W W (cid:48) W W (cid:48) + Ψ (cid:21) and Ψ = [Ψ (cid:48) , Ψ (cid:48) ] (cid:48) . Let x n | Nn =1 and x n | Nn =1 denote a set of observation vectors. The maximum log likelihood estimates of parameters θ = ( W, Ψ) canbe obtained by maximizing L ( θ ) = N N (cid:88) n =1 tr Σ − ( x n − µ )( x n − µ ) (cid:48) + const, (14) OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 3 which leads to W jML = (cid:101) Σ jj U jd S j , j ∈ { , } , (15) Ψ jML = (cid:101) Σ jj − W jML W jML (cid:48) , j ∈ { , } , (16)where S and S are the arbitrary matrices such that S S (cid:48) = C d and C d is the diagonal matrix of the ﬁrst d canonicaldirections, and U jd consists of ﬁrst d canonical directions and (cid:101) Σ jj is the sample covariance matrix of x j .In [11], also an iterative algorithm based on EM was proposed to maximize (14). For this reason, z n | Nn =1 are considered asthe missing values in the optimization algorithm and the complete log-likelihood is as follows: L ( θ ) = N (cid:88) n =1 { ln P ( x n | z n ) + ln P ( z n ) } . (17)By substituting (12) into (17) and some mathematics, the expected value of log-likelihood function with respect to the posteriordistribution is obtained : Q ( θ | θ ( t ) ) = N (cid:88) n =1 (cid:110) − | Ψ | −

12 ( x (cid:48) n Ψ − x n ) −

12 ( (cid:104) z n z (cid:48) n (cid:105) ) (18) −

12 ( W (cid:48) Ψ − W (cid:104) z n z (cid:48) n (cid:105) ) + (cid:104) z n (cid:105) W (cid:48) Ψ − x n (cid:111) , where (cid:104) z n (cid:105) = M W (cid:48) Ψ − x n , (19) (cid:104) z n z (cid:48) n (cid:105) = M + (cid:104) z n (cid:105)(cid:104) z n (cid:105) (cid:48) , (20) M = ( W (cid:48) Ψ W + I ) − . (21)By maximizing (18) with respect to the parameters and substituting (19) and (20) into the solutions, the ﬁnal update formulasare obtained as follows: W t +1 = (cid:101) ΣΨ − t W t M t ( M t + M t W (cid:48) t Ψ − t (cid:101) ΣΨ − t W t M t ) − , (22) Ψ t +1 = (cid:101) Σ − (cid:101) ΣΨ − t W t M t W t +1 . (23)III. L ATENT VARIABLE MODEL FOR TWO - DIMENSIONAL

CCAInspired by [11], [21], the proposed model extends PCCA for matrix data as follows: X j = L j ZR j (cid:48) + M j + Ξ j , j ∈ { , } , (24)where j is the index of observation variable, X j ∈ IR m j × n j and Ξ j ∈ IR m j × n j are observed variable and residual noisematrix respectively, Z ∈ IR d × d is the latent matrix variable, M j ∈ IR m j × n j indicates means of corresponding observedvariable ( without loss of generality from here we assume that the observed variables are zero mean) and L j ∈ IR m j × d and R j ∈ IR n j × d are the left and right projection matrices respectively. The latent and noise matrix variables have the followingdistributions: p ( Z ) = MN (0 , I, I ) , (25) p (Ξ j ) = MN (0 , Ψ jL , Ψ jR ) , Ψ jL , Ψ jR (cid:23) , j ∈ { , } , (26)where Ψ jL ∈ IR m j × m j and Ψ jR ∈ IR n j × n j are the positive-semideﬁnite covariance matrices of noise matrix variables.From the linear model presented in equation (24) and the noise distributions in equation (26), the conditional distributionof observation variables given the latent matrix are obtained as: p ( X j | Z ) = MN ( L j ZR j (cid:48) , Ψ jL , Ψ jR ) , j ∈ { , } . (27)This model is an extension to the PCCA [11], the difference is that here the observed, latent and noise variables are representedas matrices instead of vectors. It can be shown that the vector form of (24) is obtained as follows vec ( X j ) = W j vec ( Z ) + vec ( M j ) (28) + vec (Ξ j ) , j ∈ { , } ,p ( vec ( Z )) = N (0 , I ) , (29) p ( vec (Ξ j )) = N (0 , Ψ jL ) , Ψ j (cid:23) , j ∈ { , } , (30) OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 4 where vec ( . ) is an operator that vectorize the input matrix by concatenating its columns, W j = ( R j ⊗ L j ) and Ψ j = Ψ jR ⊗ Ψ jL .As it can be observed projection matrix W j is the Kronecker product of right and left projection matrices and similarly noisecovariance matrix Ψ j can be decompose into right and left noise covariance matrices. Therefore, the number of free parametersof model (28) is much less than those of model (10).The joint probabilistic distribution for a data pair and the corresponding subspace respresentation could be written as P ( X , X , Z ) = P ( X | Z ) P ( X | Z ) P ( Z ) . (31)While in PCCA the likelihood function of observed data, i.e., P ( X , X ) could be obtained via integrating out the latentvariable, here due to the use of matrix-variate distributions, in the general case P ( X , X ) and also posteriori distribution P ( Z | X , X ) do not follow a matrix-variate normal distribution. Therefore, we propose two approaches for learning theparameters. In the ﬁrst one, we simplify the model by assuming only one projection matrix (left or right) so that the posterioridistribution can be derived based on a matrix-variate normal distribution and a solution based on EM algorithm can be provided.We call this approach unilateral matrix variate CCA model(UMVCCA). In another approach that we call it bilateral matrixvariate CCA model (BMVCCA), we consider the model in general case (with both left and right projections) and estimatethe posterior distribution P ( Z | X , X ) using a parametric matrix-variate normal distribution q ( Z ) and a lower bound of thelog-likelihood is maximized using variational EM algorithm [25]. Each of these approaches is discussed here. A. Unilateral matrix variate CCA (UMVCCA)

We assume that one of the left or right mapping matrices(here, without loss of generality the left mapping matrix) in (24)are replaced by the identity matrix therefore we have X j = ZR j (cid:48) + Ξ j , j ∈ { , } , (32) p (Ξ j ) = MN (0 , I, Ψ jR ) , (33)where Z ∈ IR m j × d . With assumption of m = m = m , the above models can be reformulated into one factor analysismodel as follows: X = ZR (cid:48) + Ξ , (34)where X = [ X , X ] ∈ IR m × ( n + n ) , R = [ R (cid:48) , R (cid:48) ] (cid:48) ∈ IR ( n + n ) × d and Ξ = [Ξ , Ξ ] ∈ IR m × ( n + n ) . Followingdistributions can be obtained: p ( (cid:15) ) = MN (0 , I, (cid:2) Ψ R

00 Ψ R (cid:3) ) , (35) p ( X | Z ) = MN ( ZR (cid:48) , I, Ψ R ) , (36) P ( Z | X ) = MN ( SR (cid:48) Ψ − R X (cid:48) , I, S ) , (37) S = ( R (cid:48) Ψ − R R + I ) − , (38)where Ψ r = (cid:2) Ψ R

00 Ψ R (cid:3) .For learning the parameters of UMVCCA, we use well-known EM algorithm. Let X n | Nn =1 and X n | Nn =1 be deﬁned as N pairs of training data. { X n , Z n } Nn =1 is the complete data and the complete log likelihood is: L ( θ ) = Σ Nn =1 ln { P ( X n , Z n ) } = Σ Nn =1 ln { P ( X n | Z n ) P ( Z n ) } . (39)The rest of the derivations is straight forward. The ﬁnal EM update formulas are R ∗ = ˜ΣΨ − R RS [ mS + SR (cid:48) Ψ − R ˜ΣΨ − R RS ] − , (40) Ψ ∗ R = 1 m ˜Σ − m RSR (cid:48) Ψ − R ˜Σ + 1 m RSR (cid:48) Ψ − R ˜ΣΨ − R RSR (cid:48) , (41)where ˜Σ = N (cid:80) Ni =1 X (cid:48) i X i is data scatter matrix and N is the number of training data. Algorithm (1) shows the steps of thisalgorithm. Algorithm 1

UMVCCA algorithm

Input: X n | Nn =1 and X n | Nn =1 , initialization of R j with random matrices and Ψ jR with identity matrices, for j=1,2 repeat Update R and Ψ R using (40) and (41). until change of L is smaller than a threshold Output: R j and Ψ jR , for j=1,2 OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 5

B. Bilateral matrix variate CCA (BMVCCA)

In this case, we assume that in equation (24) both left and right projection matrices are exist. For maximizing the likelihoodfunction, we need to estimate the posterior of latent variable Z n given the observed variables X n and X n . However, in generalthere is no matrix-variate formulation for this posterior. Therefore we employ variational-EM algorithm [25] for maximizing thelower-bound of the likelihood function. In variational-EM, a parameterised variational distribution q ( Z n ) is chosen to estimatethe posteriori distribution P ( Z | X , X ) and then its parameters are optimised. Here, we consider the following parametricdistribution in the form of matrix-variate normal for q ( Z n ) : q ( Z n ) ∼ MN ( C n , O, S ) , (42)where C n ∈ IR d × d is the mean matrix and O ∈ IR d × d and S ∈ IR d × d are the column and row covariance matricesrespectively. The variational parameters are optimised to maximized a lower bound L ( q ) of the data log-likelihood functionwhich can be written as L ( q ) = (cid:88) n (cid:90) ln (cid:20) P ( X n , X n , Z n ) q ( Z n ) (cid:21) q ( Z n ) dZ n = (43) (cid:88) n E q (cid:34) ln (cid:20) P ( X n , X n , Z n ) q ( Z n ) (cid:21)(cid:35) , where E q stands for ”expected w.r.t the variational distribution”. The optimization of the variational parameters { C n , O, S } yields the following update formulas (variational E-step). O ∗ = (cid:20) d (cid:88) j =1 tr [ R j (cid:48) Ψ jR − R j S ] L j (cid:48) Ψ jL − L j + 1 d tr [ S ] × I (cid:21) − , (44) S ∗ = (cid:20) d (cid:88) j =1 tr [ L j (cid:48) Ψ jL − L j O ] R j (cid:48) Ψ jR − R j + 1 d tr [ O ] × I (cid:21) − , (45) vec ( C ∗ n ) = (cid:34) (cid:88) j =1 (cid:20) R j (cid:48) Ψ jR − R j ⊗ L j (cid:48) Ψ jL − L j (cid:21) + I (cid:35) − × (46) vec (cid:32) (cid:88) j =1 L j (cid:48) Ψ jL − X jn Ψ jR − R j (cid:33) . After estimating the variational distribution q ( Z n ) in the E-step process, the update formulas for the rest parameters of themodel are obtained as follows (variational M-step): Ψ jL ∗ = (cid:20) N n j P jL + 1 n j tr [ R j (cid:48) Ψ jR − R j S ]( L j OL j (cid:48) ) (cid:21) , (47) Ψ jR ∗ = (cid:20) N m j P + 1 m j tr [ L j (cid:48) Ψ jL − L j O ]( R j SR j (cid:48) ) (cid:21) , (48) L j ∗ = (cid:20) − N (cid:88) n =1 X jn Ψ jR − R j (cid:48) C (cid:48) n (cid:21)(cid:20) − N tr [ R j Ψ jR − R j (cid:48) S ] O (49) − N (cid:88) n =1 C n R j Ψ jR − R j (cid:48) C n (cid:48) (cid:21) − ,R j ∗ = (cid:20) − N (cid:88) n =1 C n (cid:48) L j (cid:48) Ψ jL − X jn (cid:21)(cid:20) − N tr [ L j (cid:48) Ψ jL − L j O ] S (50) − N (cid:88) n =1 C (cid:48) n L j (cid:48) Ψ jL − L j C n (cid:21) − , The algorithm of BMVCCA is presented in algorithm (2).

OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 6

Algorithm 2

BMVCCA algorithm

Input: X n | Nn =1 and X n | Nn =1 , initialization of L j and R j with 2DCCA algorithm and Ψ jL Ψ jR with identity matrices, for j=1,2 repeat E-STEP: Update O , S and C n using (44), (45) and (46) respectively. M-STEP: Update Ψ jL , Ψ jR L j and R j using (47), (48), (49) and (50) respectively for j=1,2. until change of L is smaller than a threshold Output: L j , R j , Ψ jL and Ψ jR for j=1,2 Iteration D i s t a n ce L L R R averageFig. 1. Distance between two successive iteration mappings of BMVCCA C. Dimension Reduction

Observation data matrices can be projected into the low-dimensional space using { L j , R j } j =1 . But it is more natural touse probabilistic projection. Here, similar to PPCA, we represent the observed data into the low-dimensional space by usingthe mean of posterior distribution,i.e., E ( Z | T , T ) . However, in BMVCCA, q ( Z ) estimates the posterior distribution so weconsider the mean of q ( Z ) . As it can be seen this equation can be applied whenever we have two observation matrices. In thecase of one observation matrix (here, X ), we consider E [ q ( Z | X , M )] as the low-dimensional representation of X , where M is the average of X random matrix in the training dataset. Similar formula can be obtained for X .IV. E XPERIMENTS

In this section we analyze the performance of algorithm using both synthetic and real data.

A. Synthetic data1) Convergence of Algorithm:

In this section, the convergence of the algorithm is analyzed using synthetic data. Forgenerating the synthetic data, we at ﬁrst sample each element of the projection matrices L j ∈ IR × | j =1 and R j ∈ IR × | j =1 from the uniform distribution in the range of zero and one. Then for each pair of observed data, the latent matrix Z ∈ IR × and residual matrices Ξ j ∈ IR × | j =1 are sampled from MN (0 , I, I ) and MN (0 , . I, . I ) respectively and the observeddata { X j ∈ IR × } j =1 are generated using equation (24). We generate 1000 samples with this procedure and run BMVCCAalgorithm and compute the Frobenius norm of each projection matrix in different iterations of the algorithm. Fig. (1) plots thedifference of computed norms in the successive iterations. As it can be seen the algorithm converges after some iterations.

2) Analyzing the probabilistic subspace:

In this section, we want to get the mean of distribution of latent variablesconditioned on the observed matrices and compute their distance to the true latent variables. Similar to the previous section,we generate 1000 pairs of data { X j ∈ IR × } j =1 using equation (24). This time, the projections L j ∈ IR × | j =1 and R j ∈ IR × | j =1 are vectors and come from the uniform distribution in the range of zero and one, for each data Z ∈ IR × is a scalar sampled from N (0 , distribution and residual matrices Ξ j ∈ IR × | j =1 are sampled from MN (0 , . I, . I ) .We run BMVCCA to obtain C n | n =1 , which are the estimated mean of posterior distribution of P ( Z n | X n , X n ) , in eachiteration of algorithm. Then we compute the Euclidean distance between the true latent space Z n | n =1 with the corresponding C n | n =1 and plot it in Fig. (2) for different iterations. As it can be observed the error is reduced and goes to zero. To investigate OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 7 . . . · − Iteration D i s t a n ce Fig. 2. Distance between estimate and true latent space in varying iterations for BMVCCA · − . Number of training samples D i s t a n ce Fig. 3. Distance between estimate and true latent space with varying training samples for BMVCCA the effect of the number of training data, we repeat this experiment with different training samples ranging from 10 to 1000samples. Fig. (3) illustrates the result as it can be observed the accuracy of estimation is improved with increased training size.It should be noted that this experiment is applicable only for × latent space. It is due to the fact that the obtained latentspace leads to the true latent space up to a rotation and in general there is no solution for obtaining the rotation matrices. Butit can be shown that whenever the subspace is restricted to × , there is no rotational matrix and the true and learned latentvariables are equal up to a scaling factor which its magnitude can be removed by normalizing both latent spaces. However, itssign cannot be removed by this procedure and for canceling the sign of scaling factor, we compute Euclidean distance betweenthe true latent space with both positive and negative sign of learned subspace and the minimum is chosen.

3) Analyzing the learned projections of UMVCCA:

In this section, we try to examine the learned projection matrices ofUMVCCA by using the synthetic data. We generate 1000 pairs of observation data { X j ∈ IR × } j =1 with the proceduresimilar to the previous section. However, this time only right projection vectors R j ∈ IR × | j =1 exist. Then, UMVCCA isrun and the learned projection vectors are obtained. Fig. (4) plots the true and learned projection vectors. It can be observedthat without regarding the sign and scaling factor the true and learned vectors are similar. B. NIR-VIS 2.0 face dataset

We analyze the performance of the proposed algorithms on ”NIR-VIS 2.0” [26].This dataset contains the visible and nearinfrared face images. The data are collected through four different sessions and in each session some visible and infraredimages are taken from each subject participated in that session. There are 740 unique ”person-session” subjects in the dataset( 710 persons were involved in only one session, while 15 persons were participated in two sessions). There are 1-22 VISand 5-50 NIR face images per subject. However, some of the subjects only have one image and also some of them do nothave either the visible or infrared images. Therefore, we select 728 ”person-session” for our experiments. The same label isconsidered for the subjects who have involved in two sessions. We select 728 VIS and 728 NIR face images as the traindata and 4333 VIS face images as the test data. The faced images are cropped so that the eyes of all images have the samecoordinates. The face images are gray scaled and resized to × . Fig. 5 shows several samples from this dataset.

1) Image reconstruction:

In this experiment, we at ﬁrst project the pairs of visible and infrared images into the low-dimensional subspace using BMVCCA and then try to reconstruct the original images. Fig. 6 depicts the images from visible

OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 8

10 20 30 − − R

10 20 30 − (a) R

10 20 30 − − (b) ˜ R Fig. 4. Illustration of the true and learned projection vectros of UMVCCA. Horizontal axis is the dimension of the vector ranging from 1 to 32. Left: trueprojection vectors, Right: learned projection vectorsFig. 5. Samples from NIR-VIS face dataset. First row (Visible images), Second row (Infrared images) and infrared spectrum of six persons as well as their corresponding low-dimensional subspace and reconstructed images. Here,compressed representation is × and is obtained by equation (46). As it can be observed, reconstructed images are similarto the original images.

2) Convergence:

Fig. 7 demonstrates the Euclidean distance of consecutive projection matrices in learning of BMVCCA.As it can be observed after some iterations the distance goes to zero. Also, Fig. 8 depicts a plot of the lower bound of the

Fig. 6. Six persons original visible and infrared images and their corresponding reconstructed images.

OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 9 L1 . . . . L2 . . E u c li d ea nd i s t a n ce b e t w ee n c u rr e n t a ndp r e v i ou ss t e p s v a l u e R1 R2 Fig. 7. Euclidean distance between consecutive projection matrices of BMVCCA in NIR-VIS 2.0 dataset . . . . . · Iteration L o w e r boundo f l og - li k e li hood Fig. 8. Lower bound of log-likelihood in BMVCCA algorithm in NIR-VIS 2.0 dataset logarithm of likelihood with respect to different iterations. As it can be observed the algorithm converges after some iterations.

3) face recognition:

The task here is face recognition which we train the model with pairs of visible and infrared imagesand then project each train or test data into the low-dimensional space. Then compute the Euclidean distance between eachprojected test image and all the projected train images and choose the label of the train image with the least distance as thelabel of the test image. Table I expresses the corresponding formula for the representing images into the low-dimensional spacefor each method. Moreover, in addition to the mentioned criteria based on the Euclidean distance between the projected testimage and train images, we utilize the following probabilistic criteria: label ( X j ) = (51) arg max n E q (cid:104) ln [ P ( X j | Z n )] (cid:105) , n = 1 ...N , j ∈ { , } , where here X j is a test image. This equation calculates the conditional probability of a test image given each of the trainingimages and selects the label corresponding to the most probable image as the test data label. We call this approach probabilistictest and abbreviate it as ”ptest” in the tables. Table II compares the error rate of different algorithms with different number of Method Subspace representationCCA W j x j PCCA [11] E [ z | x ] L j X j R j P2DCCA [24] E [ Z | X ] UMVCCA E [ Z | X ] BMVCCA E [ q ( Z )] TABLE ID

IFFERENT M ETHODS AND THEIR LOW - DIMENSIONAL REPRESENTATION FORMULA

OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 10

TABLE IIC

OMPARISON OF ERROR RATE OF DIFFERENT

CCA

BASED METHODS ON

NIR-VIS 2.0Subspace dimentionsMethods 5 × ×

10 15 ×

15 20 ×

20 25 ×

25 30 ×

30 bestPCA+CCA .

2% 51 .

4% 31 .

1% 21 . . − − − . PCA+PCCA .

5% 29 .

3% 20 .

7% 16 .

6% 15 . − − − . .

2% 20 .

2% 19 .

2% 18 .

8% 19 .

1% 19 .

3% 18 . P2DCCA .

6% 18 .

7% 11 .

3% 11 .

1% 9 .

2% 11 .

6% 9 . UMVCCA .

9% 15 .

6% 25 .

1% 29 .

6% 30% 30% 15 . BMVCCA .

6% 29 .

6% 17 .

4% 17 .

3% 17 .

1% 17 .

6% 17 . BMVCCA(ptest)

40% 13 .

2% 9 .

4% 9 .

1% 7 .

5% 7 .

5% 7 . features in the low-dimensional space. For UMVCCA, the number of features is × d , where d is the dimension of reducedfeature space, therefore we select appropriate d so that the number of obtained features is closed to the number of featuresin the corresponding column. For example, the number of features in the ﬁrst and second column of table II is and respectively. So for UMVCCA we consider the value of d to be and respectively for these two columns, which produces and features. Since CCA and PCCA methods suffer from the small sample size problem that leads to the singular matrices.We at ﬁrst project the data to the lower dimension of 727 (one less than the number of training data) using PCA and thenapplying the corresponding algorithms. Therefore, we cannot project the data to × features for these algorithms andconsequently place dash in the corresponding column. As it can be observed form table II, the BMVCCA with ptest criteriaoutperforms other algorithms signiﬁcantly. V. C ONCLUSION

In this paper a probabilistic model for CCA was proposed which works with matrix-variate data such as image matrices.Two iterative approach for learning the parameter were presented. In the ﬁrst approach, called unilateral matrix variate CCA(UMVCCA), the model was restricted by mapping the latent matrix only from one side (row or column) and a learning methodbased on expectation maximization was introduced. In the other approach, called bilateral matrix variate CCA (BMVCCA),the latent matrix was mapped from the both sides and the posterior distribution was estimated using a variational matrix-variate distribution and its variational parameters were estimated using variational expectation maximization. The proposedalgorithms was evaluated using syntectic and real data. The results indicated that the algorithm converges after a few iteration.Also, comparison with other CCA based algorithms showed that the proposed algorithm has a better performance in termsof recognition accuracy. This model can be extended to more complex models such as mixture models, nonlinear models, orbayesian framework. R

EFERENCES[1] F. Ju, Y. Sun, J. Gao, Y. Hu, and B. Yin, “Image outlier detection and feature extraction via l1-norm-based 2d probabilistic pca,”

IEEE Transactions onImage Processing , vol. 24, no. 12, pp. 4834–4846, Dec 2015.[2] F. Buettner, V. Moignard, B. Gttgens, and F. J. Theis, “Probabilistic pca of censored data: accounting for uncertainties in thevisualization of high-throughput single-cell qpcr data,”

Bioinformatics , vol. 30, no. 13, pp. 1867–1875, 2014. [Online]. Available:+http://dx.doi.org/10.1093/bioinformatics/btu134[3] R. Shariﬁ and R. Langari, “Nonlinear sensor fault diagnosis using mixture of probabilistic pca models,”

Mechanical Systems and Signal Processing

Computational Statistics& Data Analysis

Advances in Neural Information Processing Systems

Neural Computation , vol. 11, no. 2, pp. 443–482, 1999.[7] J. Zhao, “Efﬁcient model selection for mixtures of probabilistic pca via hierarchical bic,”

IEEE Transactions on Cybernetics , vol. 44, no. 10, pp.1871–1883, Oct 2014.[8] N. Lawrence, “Probabilistic non-linear principal component analysis with gaussian process latent variable models,”

J. Mach. Learn. Res. , vol. 6, pp.1783–1816, Dec. 2005. [Online]. Available: http://dl.acm.org/citation.cfm?id=1046920.1194904[9] M. K. Titsias and N. D. Lawrence, “Bayesian gaussian process latent variable model.” in

AISTATS , ser. JMLR Proceedings, Y. W. Teh and D. M.Titterington, Eds., vol. 9. JMLR.org, 2010, pp. 844–851.[10] M. E. Tipping and C. M. Bishop, “Probabilistic principal component analysis,”

Journal of the Royal Statistical Society: Series B (Statistical Methodology) ,vol. 61, no. 3, pp. 611–622, 1999.[11] F. R. Bach and M. I. Jordan, “A probabilistic interpretation of canonical correlation analysis,” 2005.[12] A. Klami, S. Virtanen, and S. Kaski, “Bayesian canonical correlation analysis,”

J. Mach. Learn. Res. , vol. 14, no. 1, pp. 965–1003, Apr. 2013.[13] T. Michaeli, W. Wang, and K. Livescu, “Nonparametric canonical correlation analysis,” in http://arxiv.org/abs/1511.04839 , 2016.[14] R. R. Sarvestani and R. Boostani, “Ff-skpcca: Kernel probabilistic canonical correlation analysis,”

Applied Intelligence , pp. 1–17, 2016.[15] J. G. Daugman, “Uncertainty relation for resolution in space, spatial frequency, and orientation optimized by two-dimensional visual cortical ﬁlters,”

J.Opt. Soc. Am. A , vol. 2, no. 7, pp. 1160–1169, Jul 1985. [Online]. Available: http://josaa.osa.org/abstract.cfm?URI=josaa-2-7-1160

OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 11 [16] J. Yang, D. Zhang, A. F. Frangi, and J.-y. Yang, “Two-dimensional pca: a new approach to appearance-based face representation and recognition,”

PatternAnalysis and Machine Intelligence, IEEE Transactions on , vol. 26, no. 1, pp. 131–137, 2004.[17] J. Ye, “Generalized low rank approximations of matrices,”

Machine Learning , vol. 61, no. 1, pp. 167–191, Nov 2005. [Online]. Available:https://doi.org/10.1007/s10994-005-3561-6[18] S. H. Lee and S. Choi, “Two-dimensional canonical correlation analysis,”

IEEE Signal Processing Letters , vol. 14, no. 10, p. 735, 2007.[19] N. Sun, Z.-h. Ji, C.-r. Zou, and L. Zhao, “Two-dimensional canonical correlation analysis and its application in small sample size face recognition,”

Neural Comput. Appl. , vol. 19, no. 3, pp. 377–382, 2010.[20] D. Tao, M. Song, X. Li, J. Shen, J. Sun, X. Wu, C. Faloutsos, and S. J. Maybank, “Bayesian tensor approach for 3-d face modeling,”

Circuits andSystems for Video Technology, IEEE Transactions on , vol. 18, no. 10, pp. 1397–1410, 2008.[21] S. Yu, J. Bi, and J. Ye, “Matrix-variate and higher-order probabilistic projections,”

Data Min. Knowl. Discov. , vol. 22, no. 3, pp. 372–392, 2011.[22] A. K. Gupta and D. K. Nagar,

Matrix variate distributions . CRC Press, 1999, vol. 104.[23] J. Zhao, P. L. Yu, and J. T. Kwok, “Bilinear probabilistic principal component analysis,”

Neural Networks and Learning Systems, IEEE Transactionson , vol. 23, no. 3, pp. 492–503, 2012.[24] M. Safayani, S. H. Ahmadi, H. Afrabandpey, and A. Mirzaei, “An em based probabilistic two-dimensional cca with application to face recognition,”

Applied Intelligence , Aug 2017. [Online]. Available: https://doi.org/10.1007/s10489-017-1012-2[25] M. I. Jordan, Z. Ghahramani, T. S. Jaakkola, and L. K. Saul, “An introduction to variational methods for graphical models,”

Mach. Learn. , vol. 37,no. 2, pp. 183–233, Nov. 1999. [Online]. Available: http://dx.doi.org/10.1023/A:1007665907178[26] S. Li, D. Yi, Z. Lei, and S. Liao, “The casia nir-vis 2.0 face database,” in

Proceedings of the IEEE Conference on Computer Vision and PatternRecognition Workshops , 2013, pp. 348–353.[27] H. Hotelling, “Relations between two sets of variates,”