[PDF] Algorithms for an Efficient Tensor Biclustering

Abstract

Consider a data set collected by (individuals-features) pairs in different times. It can be represented as a tensor of three dimensions (Individuals, features and times). The tensor biclustering problem computes a subset of individuals and a subset of features whose signal trajectories over time lie in a low-dimensional subspace, modeling similarity among the signal trajectories while allowing different scalings across different individuals or different features. This approach are based on spectral decomposition in order to build the desired biclusters. We evaluate the quality of the results from each algorithms with both synthetic and real data set.

Full PDF

AAlgorithms for an Eﬃcient Tensor Biclustering

A. D. Faneva

1, 2 , M. Lebbah , H. Azzag , and G. Beck African Institute for Mathematical Sciences (AIMS), Km2 Route deJoal, Centre IRD, Mbour, BP 1418, Senegal Computer Science Laboratory of Paris North (LIPN, CNRS UMR7030), University of Paris 13, F-93430 Villetaneuse France

Abstract

Consider a data set collected by (individuals-features) pairs in dif-ferent times. It can be represented as a tensor of three dimensions (Individuals, features and times) . The tensor biclustering problemcomputes a subset of individuals and a subset of features whose sig-nal trajectories over time lie in a low-dimensional subspace, modelingsimilarity among the signal trajectories while allowing diﬀerent scal-ings across diﬀerent individuals or diﬀerent features. This approachare based on spectral decomposition in order to build the desired bi-clusters. We evaluate the quality of the results from each algorithmswith both synthetic and real data set.

Index terms—

Multilinear Algebra, Tensor Decomposition, Prin-cipal Component Analysis.

Clustering analysis has become a fundamental tool in statistics and machinelearning. Many clustering algorithms have been developed with the generalidea of seeking groups among diﬀerent individuals in all space of features.Biclustering consists of simultaneous partitioning of a set of observationsand a set of their features into subsets often called bicluster. Consequently,a subset of rows exhibiting signiﬁcant coherence within a subset of columnsin the matrix can be extracted, which corresponds to a speciﬁc coherentpattern [2, 8]. Nowadays, there is a new type of data collection, in whichwe may collect data by individual-feature pair at multiple times. The varia-tion of a couple (individual-feature) at diﬀerent instants is called trajectory.This data can be represented as a three dimensional object called tensor

T ∈ R n × n × m , where n and n are respectively the size of observationsand features at m diﬀerent times. Tensor biclustering selects a subset ofindividual indices and a subset of features indices whose trajectories are1 a r X i v : . [ c s . L G ] M a r ighly correlated. Grouping those trajectories according to the correlationor similarity behaviour between them is useful in diﬀerent area such as de-cision making, but it is still a very challenging topic in research.In [7], the authors proposed diﬀerent methods based on the spectral de-composition of matrix and the length of trajectory, although they provide aunique bicluster. Many tools on tensor manipulation already exist in litera-ture to solve this tensor biclustering problem [4, 1, 6, 5, 3]. Our algorithmsare based on a spectral decomposition as proposed in [7]. This article isstructured as follows. In Section 2, we start by a brief summary of problemformulation. Section 3 introduces our algorithm extensions. Section 4 isrelated to the experiments. We make some concluding remarks in Section 5. We use the common notation where T , X and Z are used respectivelyto denote input, signal and noise tensors. For any set J , | ¯ J | denotes itscardinality. [ n ] denotes the set { , , . . . , n } . | ¯ J | = [ n ] − J . (cid:107) x (cid:107) = ( x t x ) / is the second norm of the vector x . x ⊗ y is the Kronecker product of twovectors x and y . We also use Matlab notation to denote the elements intensor. Speciﬁcally, T (: , : , i ), T (: , i, :) and T ( i, : , :) are respectively the i − th frontal, lateral and horizontal slice. T (: , i, j ), T ( i, : , j ) and T ( i, j, :) denoterespectively the mode −

1, mode − − T ∈ R n × n × m a third-order tensor, T = X + Z where X is the signal tensor and Z is thenoise tensor. Consider T = X + Z = q (cid:88) r =1 σ r ( u J ( r )1 r ⊗ w J ( r )2 r ⊗ v r ) + Z , (1)where J ( i )1 and J ( i )2 are respectively the sets of observations indices andfeatures indices in the i − th bicluster and u r ∈ R n , w r ∈ R n and v r ∈ R m are unit vectors. We assume that u J ( i )1 i and w J ( i )2 i have zero entries outsideof J ( i )1 and J ( i )2 respectively for i ∈ { , · · · , q } and σ ≥ σ ≥ · · · ≥ σ q > J = (cid:83) i J ( i )1 and J = (cid:83) i J ( i )2 . Under this model, trajectories X ( J , J , :) form at most q dimensional subspace.Concerning the noise model, if ( j , j ) / ∈ J × J , we assume that entriesof the noise trajectory Z ( j , j , :) are independent and identically distributed(i.i.d) and each entry has a standard normal distribution. If ( j , j ) ∈ J × J ,we assume that entries of Z ( j , j , :) are i.i.d and each entry has a Gaussiandistribution with zero means and σ z variance. We analyse tensor biclusteringproblem under two variances models of the noise trajectory:- Noise Model I: in this model, we assume σ z = 1, i.e., the variance ofthe noise within and outside of the clustering is assumed to be the same.2lthough this model simpliﬁes the analysis, it has the following drawback:under this noise model, for every value of σ , the average trajectory lengthsin the bicluster is larger than the average trajectory lengths outside thebicluster.Indeed, let T ∈ R m × k be a matrix whose columns include trajectories T ( j , j , :) for ( j , j ) ∈ J × J (i.e T is the unfolded T ( j , j , :)). Wecan write T = X + Z where X and Z are unfolded X ( j , j , :)and Z ( j , j , :), respectively. The squared Frobinius norm of X is equalto (cid:107) X (cid:107) F = σ . Morever, the squared Frobenius norm of Z has aChi-squared distribution with mk degrees of freedom i.e χ ( mk ). Thus,the average squared Frobenius norm of T is equal to σ + σ z mk . Let T ∈ R m × k be a matrix whose columns include only noise trajectories.Using a similar argument, we have E [ (cid:107) T (cid:107) F ] = mk , which is smallerthan σ + σ z mk .- Noise Model II: in this model, we assume σ z = max (cid:0) , − ( σ /mk ) (cid:1) ,i.e., σ z is modeled to minimize the diﬀerence between average trajectorylengths within and outside the bicluster.Indeed, if σ < mk , without noise, the average trajectory in the biclus-ter is smaller than the one outside the bicluster. In this regime, having σ z = 1 − σ /mk makes the average trajectory lengths within andoutside the bicluster comparable. This regime is called the low-SNR (sig-nal noise ratio) regime. If σ > mk , the average trajectory lengths inthe bicluster is larger than the one outside the bicluster. This regime iscalled high-SNR regime. In this regime, adding noise to signal trajectoriesincreases their lengths and makes solving the tensor biclustering problemeasier. Therefore, in this regime we assume σ z = 0 to minimize thediﬀerence between average trajectory lengths within and outside of thebicluster. The algorithm and the asymptotic behaviour of this method are availablein [7]. Under the assumption q = 1 and n = | n | = | n | , we drop thesubscript (1) from J (1)1 and J (1)2 . We assume that | J | = | J | = k . Theauthor propose to provide only one bicluster. This method separates theselection of the two sets J and J using lateral slice and horizontal slice ofthe tensor respectively. 3 ( j , = T ( j , : , :) and T ( j , = T (: , j , :) (2) C = n (cid:88) j =1 T t ( j , T ( j , and C = n (cid:88) j =1 T t ( j , T ( j , (3)The aim is to select the row and column indices whose trajectories arehighly correlated. The elements of J and J are the indices of the top k elements of the top eigenvector of the matrix C and C respectively(algorithm 1). We denoted by ˆ J and ˆ J the subset of individuals andfeatures respectively in the bicluster given from the algorithm. Algorithm 1:

Tensor folding and spectral

Input: tensor T , and the cardinality of output k Output:

The set of indices ˆ J and ˆ J Input: T , k Initialize: C , C , T and T for i in [ n ] do Compute T according to equation (2) Update C according to equation (3) Compute T according to equation (2) Update C according to equation (3) Compute ˆ u , the top eigenvector of C Compute ˆ w , the top eigenvector of C Compute ˆ J , set of indices of the k largest values of | ˆ u | Compute ˆ J , set of indices of the k largest values of | ˆ w | return ˆ J and ˆ J Tensor FS method have the best performance in both noise models com-pared to the three another methods (tensor unfolding+spectral, thresholdingsum of squared and individual trajectory lengths) proposed by Soheil Feizi,Hamid Javadi, David Tse [7]. 4

Extension of Tensor Folding and Spectral

In this section, we aim to extract many biclusters in the tensor data andimprove the quality of the result. We propose several methods in order todo this task. However instead of seeking only one bicluster, we assume thatin equation (1) q = r ≥ r is deﬁned by the number of gap in theeigenvalues of the covariance matrix C and C (equation (3)). The classical method extracts one rank of the low dimensional subspacewhich is not very interesting because it neglect the majority of the data sets.So, Direct improvement of this method is to compute recursively accordingto the number of gap shown in the eigenvalues (algorithm 2). In this method,there is no intersection in two diﬀerent blocks of tensor biclustering.

Algorithm 2:

Recursive extension

Input: tensor T , array cardinality of each bicluster k Output:

The set of all couple set of bicluster i ←− while i < k.length + 1 do compute the ﬁrst bicluster ( ˆ J ( i )1 , ˆ J ( i )2 ) (by using the algorithm 1) keep ( ˆ J ( i )1 , ˆ J ( i )2 ) on the dataset and change the entries to zero. Weuse it as a new dataset i ←− i + 1 return ( ˆ J (1)1 , ˆ J (1)2 ) , ( ˆ J (2)1 , ˆ J (2)2 ) · · · This method extract simultaneously the r biclusters in our tensor by usingthe idea of top r principal component analysis (PCA). The orthogonality ofthe principal components favor the quality of the result (algorithm 3).The illustration step of tensor FS method is showed in the Fig.1. Foreach ﬁx individual, we have a horizontal slice of the tensor represented by m × n matrix (equation (2)). Then, we compute the covariance matrix foreach horizontal slice and their sum give us only one squared matrix of order n ( C in equation (3)). We apply singular value decomposition (SVD) in C , the top r eigenvectors in the matrix C ensure the selection of the el-ements of the features index set ( J ( i )2 ) i ∈ [ r ] (algorithm 3). A similar step isapplied to each lateral slice of the tensor to ﬁnd all the element of the indexset ( J ( i )1 ) i ∈ [ r ] . 5igure 1: A visualization of the tensor FS extension algorithm to computethe bicluster index ( J ( i )2 ) i . Here we have two biclusters and the sets J (1)2 and J (2)2 do not intersect.Since k is a ﬁx parameter, multiple bicluster method allow some trajec-tory belong to many blocks of tensor biclustering. We call them a boundaryof bicluster. Those boundaries are very important as they belong to theintersection of all the biclusters. Thus they have all their properties. Algorithm 3:

Multiple biclusters

Input: tensor T , and the list of cardinality of the tensor biclustering k Output:

The set of all couple set of bicluster r ← length of k Compute the matrices C and C according to equation (3) Compute the top r eigenvectors of C and C for i ← to r do Compute ˆ J i ) from eigenvector | u i | Compute ˆ J i ) from eigenvector | w i | Compute I ←− (cid:84) i ˆ J i ) and I ←− (cid:84) i ˆ J i ) return (cid:0) ( ˆ J ( i )1 , ˆ J ( i )2 ) (cid:1) i ∈ [ r ] and ( I , I )6 Experimentation

We build synthetic data to evaluate the implementation of our methods.In this dataset, we have two biclusters with signal strength σ and σ suchthat σ > σ . We assume that v and v are ﬁxed unit vectors in R m and v = v

2. We assume also that J (1)1 (cid:84) J (2)1 = ∅ and J (1)2 (cid:84) J (2)2 = ∅ . Wehave n = n = n = 150, m = 40 and k = | J (1)1 | = | J (2)1 | = | J (1)2 | = | J (2)2 | = 30, we assume: u ( j ) = (cid:40) / √ k for j ∈ J (1)1 , w ( j ) = (cid:40) / √ k for j ∈ J (1)2 ,u ( j ) = (cid:40) / √ k for j ∈ J (2)1 , w ( j ) = (cid:40) / √ k for j ∈ J (2)2 T with thenoise model II deﬁne in section 2. Let ˆ J × ˆ J and ˆ J × ˆ J be the twoestimated biclusters indices of J (1)1 × J (1)2 and J (2)1 × J (2)2 respectively where | J (1)1 | = | J (1)2 | = | J (2)1 | = | J (2)2 | = k . We ﬁx the signal strength σ = 2 σ / σ >

90, the bar plot of the top ﬁve eigenvalues of bothcovariance matrices tell us that there is two block of tensor biclustering inthe data (see Fig.2).In this case, we know the value of parameter k = 30. To evaluate theinference quality of the result given from the algorithm, we compute therecovery rate:0 ≤ | ˆ J ∩ J (1)1 | k + | ˆ J ∩ J (2)1 | k + | ˆ J ∩ J (1)2 | k + | ˆ J ∩ J (2)2 | k ≤ . Recovery rate return zero if the algorithm do not ﬁnd any of the element ofthe two biclusters and return one if the algorithm ﬁnd all the elements ofthe two biclusters.We did the experiment with diﬀerent value of signal strength ( σ ) andfor each value of σ we repeat 10 times. Then we compute the average ofthe recovery rate (Figure 2(c)). We apply the both contribution algorithms to an electricity load diagrams data set during four years (2011-2014). This data set contains electricityconsumption of 370 clients for each 15 minutes during four years. After the http://archive.ics.uci.edu/ml/datasets/ElectricityLoadDiagrams20112014 a) Matrix C (3) (b) Matrix C (3)(c) Recovery rate with diﬀerent values of σ . Figure 2: Synthetic data sets, n = 150, m = 40data prepossessing, we have a tensor T ∈ R n × n × m where n = 365 is thenumber of day in one year, n = 161 is the number of clients and m = 4is the number of years.As illustrated in Fig.(3(a),3(d)), the gap on the eigenvalues shows theexistence of two tensor biclustering (section 3) in the data set. The pa-rameter k cardinality of each index sets are deﬁned from the multiple bi-cluster method, we choose k with few intersection of two blocks of bicluster | J (1)1 | = | J (2)1 | = 50 and | J (1)2 | = | J (2)2 | = 25. After compilation, wenote that the two individuals sets J (1)1 and J (2)1 are disjoint in both methods.Besides, the two features sets have 22 intersection elements for multiple bi-clusters method and 19 intersection elements for the recursive method. So,we have two distinct blocks with two distinct subsets of individuals and onesubset of feature.To evaluate the quality of the bicluster given for each algorithm, we com-pute the total absolute pairwise correlations of the trajectories among eachbicluster. With the recursive method, the the trajectories in ﬁrst biclusteris highly correlated but the quality of the second bicluster is a little bit lowas seen in Figure.(3(b), 3(c)). Besides, with multiple bicluster method, thetrajectories on both biclusters are highly correlated as seen in Figure.(3(e),3(f)). 8 a) Eigenvalue C (3) (b) First bicluster of re-cursive method (c) Second bicluster of re-cursive method(d) Eigenvalue C (3) (e) First bicluster of mul-tiple method (f) Second bicluster ofmultiple method Figure 3: Trajectories correlations of each bicluster for each method

In this article, we introduced two methods to increase the number of biclusterselected in the tensor data set based on [7], which depends on the numberof rank of the low dimensional subspace. The goal is to extract r subsetsof tensor ( r ≥

2) rows and columns such that each block of the trajectoriesform a low dimensional subspace. We proposed two algorithms to solve thisproblem, tensor recursive and multiple bicluster. The performance of bothalgorithms depends on the parameter k , one way to choose this parameteris in the multiple bicluster method. If the parameter chosen gives a lot ofindex intersections, decreasing the value of k is a good idea to improve thequality of the results. 9 eferences [1] Andrea Montanari, Daniel Reichman and Ofer Zeitouni. On the limi-tation of spectral methods: From the gaussian hidden clique problemto rank-one perturbations of gaussian tensors. In Advances in NeuralInformation Processing Systems , 2015.[2] Yudong Chen and Jiaming Xu. Statistical-computational tradeoﬀs inplanted problems and submatrix localization with a growing number ofclusters and submatrices. arXiv preprint arXiv , 1402, 2014.[3] Kolda and Bader. Tensor decompositions and applications. in SIAMREVIEW , 2009.[4] Emile Richard and Andrea Montanari. A statistical model for tensorpca.

In Advances in Neural Information Processing Systems , 2014.[5] Samuel B Hopkins, Jonathan Shi, and David Steurer. Tensor principalcomponent analysis via sum-of-square proofs.

In COLT , 2015.[6] Samuel B Hopkins, Tselil Schramm, Jonathan Shi and David Steurer.Fast spectral algorithms from sum-of-squares proofs: tensor decomposi-tion and planted sparse vectors. arXiv preprint arXiv , 2015.[7] Soheil Feizi, Hamid Javadi, David Tse. Tensor biclustering.

Advances inNeural Information Processing Systems , 30, 2017.[8] T Tony Cai, Tengyuan Liang, and Alexander Rakhlin. Computationaland statistical boundaries for submatrix localization in a large noisymatrix. arXiv preprint arXiv 1502.01988arXiv preprint arXiv 1502.01988