Asymmetric latent semantic indexing for gene expression experiments visualization
AAsymmetric latent semantic indexing for gene expression experiments visualization
Javier Gonz´alez , Alberto Mu˜noz and Gabriel Martos Sheffield Institute for Translational Neuroscience,Department of Computer Science, University of Sheffield.Glossop Road S10 2HQ, Sheffield, UK. [email protected] Department of Statistics, University Carlos III of MadridSpain. C/ Madrid, 126 - 28903, Getafe (Madrid), Spain. [email protected], [email protected]
ABSTRACT
We propose a new method to visualize gene expression experiments inspired bythe latent semantic indexing, technique originally proposed in the textual anal-ysis context. By using the correspondence word-gene document-experiment,we define an asymmetric similarity measure of association for genes that ac-counts for potential hierarchies in the data, the key to obtain meaningful genemappings. We use the polar decomposition to obtain the sources of asymmetryof the similarity matrix, which are later combined with previous knowledge.Genetic classes of genes are identified by means of a mixture model applied inthe genes latent space. We describe the steps of the procedure and we showits utility in the Human Cancer dataset.
Keywords : Latent semantic indexing, Asymmetric similarities, Gene expres-sion data, Textual data analysis. a r X i v : . [ s t a t . A P ] A p r Introducction
A gene expression dataset consists of a matrix Y ∈ IR n × p , with each row representing anexperiment and each column representing a gene. Typically, the number of genes is severalthousand, whereas the number of experiments or samples is in the order of tens. In Figure1.A we show the heat map of the differentially expressed genes of the Human cancer dataset,which originally consists of 6830 genes measured in 64 experiments corresponding to 14different types of Cancer patients available in Hastie et al. (2009). To provide answers toquestions like which genes are more similar in terms of their expression profiles or which genesare involved in certain types of cancer is the key to extracting useful biological knowledge inexperiments of this type.A common strategy to find interesting patterns in the data is to define some measureof similarity or dissimilarity for the genes (Priness et al., 2007; Kim et al., 2007), whichis later combined with a cluster algorithm (Kohonen et al., 2001; Gat-Viks et al., 2003).The Euclidean distance, the Pearson correlation coefficient or the Mutual Information, arethe most common measures. Although useful in many scenarios, such measures are unableto capture some complex features that have been discovered to be present in the way thegenes interact with each other. Particularly, an interesting case is the hierarchy among thegenes, an universal pattern that has been extensively observed in the literature, mainly inthe context of networks analysis (Reka and Barab´asi, 2002; Wuchty et al., 2003; Barabasiand Oltvai, 2004).Inspired by the latent semantic indexing (LSI) (Deerwester, 1988; Deerwester et al., 1990),the technique originally proposed for textual data analysis, in this paper we propose a newvisualization technique to unravel the structure of gene expression datasets. Although theidea of using textual data analysis techniques in the biological context has been exploredin the literature in some recent works (Bicego et al., 2010; Ng et al., 2004; Caldas et al.,2009), these approaches use the Latent Dirichlet Allocation (LDA) (Blei et al., 2003) as afundamental model, which provides neither a Euclidean representation of the genes usefulfor visualization nor takes into account the hierarchical relationship among the genes. Inthis work we address both problems by means of a new asymmetric latent semantic indexingapproach (aLSI), following the existing literature in asymmetric similarities based methods(Okada and Imaizumi, 1987; Okada, 1990; Chino, 1978, 1990; Mu˜noz et al., 2003). Therefore,the contributions of this paper are twofold:(i) A proof-of-concept analysis to illustrate the importance of using asymmetric gene sim-ilarities in gene expression experiments.1ii) A new asymmetric latent semantic indexing (aLSI) approach to produce meaningfulgene mappings, which can be used in combination with previous biological knowledgesuch as gene-ontologies, pathways, protein-protein interaction networks, etc.Our approach is inspired by the work of (Mu˜noz and Gonz´alez, 2012) in which an asym-metric version of the LSI is already defined in the textual data context. In this work theauthors propose a partition of the data in several hierarchical levels, which aim at accountingfor the hierarchical relationships between the words of the database. Within each level, aGram Mercer kernel matrix is obtained by means of the triangular decomposition, whichcaptures the remaining asymmetries not removed by the partition in the different layers.Finally, a Euclidean representation of the words is produced within each level and these areconnected using a measure of inclusion.In this work, we propose an alternative aLSI which does not require a partition of thedataset in hierarchical levels. This represent itself and advantage with respect to the workof (Mu˜noz and Gonz´alez, 2012) since the choice of the number of layers its already avoided.Nevertheless, the key aspect of our approach is to replace the triangular decompositionof the similarity matrix by the polar decomposition, which produces two complementarygene representations. This allows us to produce a global mapping that does not requireany partition of the data while the information provided by the asymmetries in the genesimilarity matrix is still taken into account.This paper is organized as follows. In Section 2 we detail the connection between asym-metric similarities and hierarchies in genetic experiments and we illustrate this phenomenonin the Human Cancer data set. In Section 3 we propose a new asymmetric latent semanticindexing (aLSI) procedure. In Section 4 we illustrate the utility of the proposed approachin a real data experiment and in Section 5 we conclude with a discussion of this work. In this section we illustrate the idea of “gene hierarchy”. To this end, we will use the abovementioned Human Cancer data set. Consider the matrix X such that x kj = 1 if the gene j is significantly expressed in the experiment k and x kj = 0 otherwise (see Section 4.1 fordetails). This gene-experiment matrix is analogous to the term-document matrix, commonin textual analysis (Mu˜noz and Gonz´alez, 2012). In this field, it is common to work with amatrix X where x kj = 1 if the term j appears in document k and x kj = 0 otherwise. Byusing the correspondence genes/words and experiments/documents we can apply techniquesfrom the text mining literature to analyse gene expression datasets. Therefore, in the sequel2igure 1: A) Heat-map of the micro-array of the Human Cancer dataset. Originally, thereare 6830 genes (columns) whose expression is measured in 64 patients (rows) with 14 differenttypes of Cancer. Colour intensity represents the level expression of the genes. B) Heap mapof the Human Cancer dataset in which only the expressed genes are highlighted (in white).Each row of this matrix can be interpreted as a document whose words are those genes whichare differentially expressed.we will use indistinctly the terms genes-words and experiments-documents.For now, consider a textual data set and let | x i | be the number of documents indexed byterm ith and | x i ∧ x j | the number of documents indexed by both i and j terms. Considerthe following asymmetric similarity measure ( s ij (cid:54) = s ji ) s ij = | x i ∧ x j || x i | = (cid:80) k min ( x ik , x jk ) (cid:80) k x ik , (2.1)which has been previously studied in a number of works related to Information Retrieval3 Genes Norm F r equen cy Figure 2: Evidence of the Zipf’s law in gene expression experiments: Histogram of the normsof the 2093 differentially expressed genes of the Human cancer data set.(Mu˜noz, 1997; Mart´ın de Diego et al., 2010). It turns out that expression (2.1) can beinterpreted as the degree in which the topic represented by the term i is a subset of the topicrepresented by the term j . As a measure of inclusion it was originally proposed by Kosko(1991) in the context of fuzzy set theory. Regarding its interpretation in a textual dataexample, consider, for instance, a collection of documents containing the term “statistics”.In this case a more specific term like “non parametric” will occur just in a subset. Therelation between “non parametric” and ”statistics” is strongly asymmetric, in the sense thatthe concept represented by the word “non parametric” is a subset of the concept representedby the word “statistics” but not conversely. In the biological context, where s ij representsthe similarity between two genes, expression (2.1) represents the degree in which a gene i isa subclass or it is hierarchically dependent of a gene j .The matrix X contains information about both, the terms and the documents of thedatabase. In the sequel we will use t j to refer the terms (columns of X ) and d i (rows of X ) to refer the documents. Using the definition of similarity in expression (2.1) the skew-4ymmetric term associated to each pair of terms t i , t j can be written as12 ( s ij − s ji ) = 12 (cid:18) | t i ∧ t j || t i | − | t i ∧ t j || t j | (cid:19) = | t i ∧ t j | | t i || t j | ( | t j | − | t i | ) ∝ ( | t j | − | t i | ) . Therefore, a large difference between s ij and s ji is directly related to a large differencebetween the norms of the words given by | t i | and | t j | . Thus, the distribution of term normsin case of asymmetry/hierarchy is clearly far from being uniform.In Figure 2 we show the histogram of the norms of the differentially expressed genesof the Human Cancer data set. The figure shows that a few number of genes have verylarge norms while a large number of genes have small norms. This behaviour, which canbe modelled by means of the Zipf’s law (Mart´ın-Merino and Mu˜noz, 2005), is an evidenceof asymmetric/hierarchical associations. Genes with large norms correspond to ‘biologicallyrelevant’ genes involved in many processes (or high level concepts), whereas genes with smallnorm represent rarely expressed genes (or very specific concepts). The hierarchy induced onthe gene set by the inclusion measure s ij is directly related with its asymmetric nature, andcaused by the strongly asymmetric gene frequency distribution. The latent semantic indexing (LSI) (Deerwester, 1988) is a useful technique in natural lan-guage processing to analyse relationships between a set of documents and the terms theycontain. The idea is to produce a set of concepts or latent semantic classes to summarize thecontent of the dataset. In this section we propose an asymmetric latent semantic indexingthat uses as input the similarity in eq. (2.1). In a biological context, we will talk about‘latent genetic classes’ to refer to groups of genes that summarize the main content of thedata. Next, we introduce the LSI to later generalize it to its asymmetric version.
Consider the n × p document by term X matrix whose entries contain the word counts perdocument. The matrix X T X contains the correlations among terms t j and t k (measuredas t Tj t k ) and XX T contains the correlations among documents measured as d i d Ts . Usingthe singular value decomposition (SVD) for X we obtain the unique decomposition X = U x Σ x V Tx , where U x and V x are orthogonal matrices and D x is diagonal and contains thesingular values of X . It is straightforward to see that XX T = U x Σ x Σ Tx U Tx and X T X = V x Σ Tx Σ x V Tx . (3.1)5herefore, the immersion of the term t j into the semantic class space is given by t j = Σ − x U x t j . (3.2)On the other hand, the immersion of document d i in the same latent space is given by d i = Σ − x V x d i . Consider the p × p asymmetric similarity matrix ( S ) ij = s ij in eq. (2.1). By means of theSVD we obtain that S = U s Σ s V Ts , which lead to the polar decomposition of S (Horn andR., 1991; Higham, 1986). Define L = U s V Ts . Then S = K L = LK , where K = U s Σ s U Ts (3.3) K = V s Σ s V Ts . (3.4)Note that (cid:107) S (cid:107) F = (cid:107) K (cid:107) F = (cid:107) K (cid:107) F , where (cid:107) · (cid:107) F is the Frobenius norm. Also remark that S does not directly decompose in any combination of K and K but these matrices can beunderstood as the two sources of asymmetry of S . Geometrically speaking, since SV = U Σ,it is straightforward to check that Sv j = σ j u j where v j and u j are the columns of U and V respectively. Therefore the eigenvectors { v , . . . , v p } of K are mapped under theasymmetric matrix S onto the scaled orthogonal coordinate system { σ u , . . . , σ n u p } . Equiv-alently, one can interpret the symmetric effect with respect to the eigenvectors { u , . . . , u p } .The asymmetry in S is therefore reflected in the angle between each pair of left and righteigenvectors of S . Therefore span { v , . . . , v p } and span { u , . . . , u p } produce different butcomplementary representations of the genes. Note that if S is a symmetric matrix K = K and therefore both representations are equivalent since u j = v j for all j = 1 , . . . , p . Thepolar decomposition has been previously used in the analysis of asymmetric relationships in(Gower, 1977, 1998). The matrices K and K are symmetric and positive semi-definite. Therefore, they arekernel matrices (Aroszajn, 1950; Wahba, 1990) that admit the decompositions K = Φ Φ T and K = Φ Φ T where Φ = U Σ / and Φ = V Σ / respectively. The two matrices inducetwo different distances for the terms, which are the consequence of S of being asymmetric.Note that if S is symmetric then Φ = Φ . To find a unifying distance (or kernel) using K and K is therefore the key to obtain an appropriate Euclidean representation for the6erms. In this sense, suppose that we are able to find suitable transformations φ i , i = 1 , d φ i ( t j , t k ) = (cid:107) φ i ( t j ) − φ i ( t k ) (cid:107) ,corresponds to the one induced by each kernel matrix K i . This implies that d φ i ( t j , t k ) =( K i ) jj + ( K i ) kk − K i ) jk , where j, k = 1 , . . . , p and ( K ) jk = φ i ( t j ) T φ i ( t k ).Following (Gonz´alez and Mu˜noz, 2013), it is possible to prove that for each matrix K i there exists a symmetry, continuous and positive-definite kernel function K i : T × T → IR,where T is a compact set, such that K i ( t j , t k ) = φ ( t j ) T φ ( t k ), t j , t k ∈ T is the implicit kernelcorresponding to d φ i ( t j , t k ). See (Gonz´alez and Mu˜noz, 2013) for conditions on the existenceof such k i . Each kernel function k i has a unique associated Reproducing kernel Hilbert space(RKHS), whose feature map or canonical basis, is given by φ i (Aroszajn, 1950; Wahba,1990).The operation of adding the kernels k and k gives rise to a new RKHS whose featuremap is the union of φ and φ . In particular, let k and k two positive semi-definite kernelfunctions and let φ and φ their underlying feature maps. Then k = λ k + λ k , with λ , λ ≥
0, is a positive semi-definite kernel with φ = [ √ λ φ , √ λ φ ] as a valid featuremap. This property, , which can be easily generalized to multiple kernels, implies thatthe sum of the kernel functions k and k can be understood as the sum of the associatedRKHSs. Therefore, to use the operation K = λ K + λ K , with λ = λ = 1 /
2, hasthe property of defining a new kernel matrix whose induced distances take equally intoaccount the representation of the terms using both kernels, or equivalently in our case, therepresentations of the genes given by Φ = U Σ / and Φ = V Σ / . That is, the right andleft eigenvalues of S have the same weight on the final distance induced by K .An alternative fusion scheme can be found in (Mu˜noz and Gonz´alez, 2012). However, inthis work the main step to deal with asymmetry is to split the dataset into layers of wordswith similar norm. Here, we are able to deal with asymmetry in a single step by meansof the polar decomposition of S . In the former approach, hierarchical clusters of words areprovided, but a unique representation of the terms is not available as we provide here. Thisrepresents a problem for the generalization and applicability of the work in (Mu˜noz andGonz´alez, 2012) that is solved in our proposal: since the distance among words of differentlayers is not available, this technique cannot be used in problem like classification in whicha unique distance for the words is needed. We say that φ is the feature map of a kernel k : T × T → IR if k ( t , t (cid:48) ) = (cid:104) φ ( t ) , φ ( t (cid:48) ) (cid:105) holds for any t , t (cid:48) ∈ T where (cid:104)· , ·(cid:105) represents the usual l product. .4 Generalizing the combination approach The goal of this section is to generalize the previous idea described in the previous sectionin order to propose an approach to combine K , K and a third matrix W with priorinformation about the problem. Such a matrix might be derived from an initial labelingof the terms or the experiments. In the genetic context, this is a natural idea since priorknowledge about the relationships among the genes is common (Wang et al., 2013). Someexamples are gene-ontologies, pathways, protein-protein interaction networks, etc. Note thatby imposing K to be positive semi-definite a Euclidean representation of the terms is alwaysavailable by mean of some matrix decomposition K = ΦΦ T (Schoenberg, 1935; Young andHouseholder, 1938).We combine K , K and W to obtain a fusion similarity matrix K by maximizing G τ ( K ) = (cid:107) K − γ F ( K , K ) (cid:107) F + τ (cid:107) K − γ W (cid:107) F , (3.5)where τ > γ , γ > F ( K , K )is a functional combination of the matrices K and K whose output is a symmetric positivesemi-definite matrix. The underlying idea in eq. (3.6) is to merge both sources of asymmetryand to keep a balance with the prior knowledge given by W . The fusion scheme proposed ineq. (3.6) can be derived using a regularization theory approach, similar to the one used inthe derivation of SVM classifiers (Mart´ın de Diego et al., 2010). The solution to the problemstated in eq. (3.5) is given in the following proposition, Proposition 1.
The minimizer of G τ ( S ) for any F and τ > and γ = γ = τ + 1 is givenby K = F ( K , K ) + τ W . (3.6)Of course, different F lead to different combinations of K and K . In this work, andbased on the ideas described in the previous section, we consider the arithmetic mean of thematrices F ( K , K ) = ( K + K ) / In this section we make use of eq. (3.5), computed from the asymmetric similarity matrix S , to redefine the LSI. We use the ideas from (Park and Ramamohanarao, 2009; Mu˜noz8nd Gonz´alez, 2012) with the special novelty that the term representation is given by thedistances induced by our particular choice of K .Following the ideas described in Section 3.3, let φ be a transformation of the terms suchthat the induced distance on the terms, given by d φ ( t j , t k ) = (cid:107) φ ( t j ) − φ ( t k ) (cid:107) , correspondsto the one induced by the kernel matrix K . Consider the matrix Φ, such that (Φ) ij = φ i ( t j ).The rows of Φ, say the φ ( t j ), represent the transformation of t j to the latent class/featurespace. Following the LSI scheme, we apply the SVD to the transformed term p × m matrixΦ = U Σ V T and we obtain that K = ΦΦ T = U ΣΣ T U T = ( U Λ )( U Λ ) T , where Λ = ΣΣ T =Σ is the diagonal matrix of eigenvalues of K and Σ is the diagonal matrix of singular valuesof Φ. In this context the matrix K plays the role of X T X in the original LSI formulation.Then the immersion of φ ( t i ) is given by φ s ( t i ) = Σ − U T φ ( t i ) = Λ − U T φ ( t i ) . Therefore, by replacing X T X by K we ‘kernelize’ the LSI by using the original asymmetricsimilarity matrix S : we replace the original linear mapping of the LSA by the non linear onegive by φ .The semantic classes in the latent space can be identified with clusters of transformedterm data. In order to estimate such semantic classes c , . . . , c q we apply a Gaussian mixturemodel-based clustering (Fraleyand and Raftery, 2002). That is, for each term we obtain anestimation of the probability of membership, p ( c i | t j ), to each one of the latent semanticclasses c i . We assume that each cluster is generated by a Gaussian multivariate distribution f k ( t ) = N k ( µ k , Σ k ), where µ k and Σ k are the mean vector and covariance matrix respectively.The final mixture density is therefore given by f ( t ) = q (cid:88) k =1 α k N k ( t ) = q (cid:88) k =1 α N k ( µ k , Σ k ) , where each α k represents the prior probability or weight of the component k . The mainadvantage of this approach is that we can obtain a density estimator for each cluster and a‘soft’ classification rule is available: each term may belong to more than one semantic classvia the use of conditional probabilities p ( c i | t j ). In this section we summarize the steps to apply the proposed asymmetric latent semanticindexing to a data set. As we detailed in Section 2, there exist strong similarities betweentextual and gene expression data, therefore our proposal can be used in both scenarios. SeeTable 1 for details. 9 nput:
Genes-by-experiments matrix X . Output:
Map of terms (genes), latent semantic classes.1. Obtain the asymmetric similarity S .2. Decompose S = U s Σ s V Ts .3. Obtain the two sources of asymmetry K and K .4. Obtain the matrix of labels of the terms (or genes) W .5. Fuse the matrices using the scheme proposed in (3.6).6. Obtain the projections of the terms into the latent semantic classes.7. Assign probabilities to the classes using a mixture model.8. Visualize the genes and the mixture model using MDS.Table 1: Main steps of aLSI algorithm. In this section we analyse the Human cancer data set, described in the introduction of thiswork, by using the proposed asymmetric latent semantic indexing detailed in Section 3.5.The analysis consists of two main steps. First, we calculate the genes which are statisticallyexpressed in each experiment and we obtain the matrix X . Second, we use this matrix toobtain genetic semantic classes of genes that we will associate with different types of cancer.In order to find the clusters of genes, we also use the Euclidean distance and the Correlationmatrix to illustrate the benefits of our approach in this context. The R-code to replicate allthe figures and results of this work is available at https://github.com/javiergonzalezh/aLSI. The initial point in our analysis is the matrix Y , which consists of the expression level of6830 genes in 64 experiments. The first step is to identify which genes are differentiallyexpressed. That is, to statistically decide whether for a given gene its expression is greaterthan what we would expect just due to natural random variations.The motivation for this gene filtering is that a relatively few number of genes of thedatabase should be expressed in each experiment. Different methods have been proposedin the literature (Yang et al., 2013). In this work, we follow a simple and straightforwardapproach which uses the coefficient of variation CV = | ¯ x | /sd ( x ) to discriminate between ex-pressed and non-expressed genes. The reason to use this coefficient is the linear relationshipbetween the gene expression mean and the gene standard deviation expression of the genesacross the experiments. See Figure 4.1 A. In particular, we consider that a gene is differ-10igure 3: The first step towards the identification of the latent genetic classes of the databaseis to perform a differential analysis of the genes. A) Mean vs. Standard deviation of the6830 genes of the Human cancer data set across the 64 available patients. B) Histogram ofthe CV of all the genes.entially expressed in the database if the value of the coefficient of variation is larger than0.5. Of course, other thresholds are possible if additional information about the experientialnoise is available. In Figure 4.1.A, we show the histogram of the coefficients of variation ofall the genes of the database. The total number of genes with a CV larger than 0.5 is 2093.Given the set of expressed genes, in order to build the matrix X , we need to decide whena particular gene is expressed in an experiment. To this end, we consider the maximum ofthe expression in the set of non expressed genes and we use it as a threshold in the set of theexpressed ones. The purpose of this threshold is to capture the random variation in the data.Figure 4 shows the expression values of two genes across the 64 experiments. One of the genes(left) is differentially expressed in those experiments above the selected threshold (horizontaldotted line at 4.46). In particular, this gene is assumed to be significantly expressed in atotal of 5 experiments. On the other hand, in Figure 4 (right), we show the expressionvalues of a non expressed gene. All the values remain below the threshold, reflecting thatthe variations in expression are random variations. In Figure 1.B, we show the heat map ofthe 2093 differentially expressed genes of the dataset.11 llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll Differentially expressed gene
Experiment E x p r e ss i on llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll Non differentially expressed gene
Experiment E x p r e ss i on Figure 4: Illustration of the profiles of two genes. On the left we show a differentiallyexpressed genes in 5 experiments. On the right we show the profile of a non differentiallyexpressed gene.
Next, we apply the asymmetric latent semantic indexing proposed in Section 3.5 to thedifferentially expressed genes of the Human Cancer dataset. To this end, we calculate thegene similarity following (2.1) and we proceed with the steps of Algorithm 1.The matrix W in expression (3.6) is calculated using the labels of the experiments. Firstwe assign a membership of the genes to each one of the 14 types of cancer:“CNS”, “RENAL”,“BREAST”, “NSCLC”, “UNKNOWN”, “OVARIAN”, “MELANOMA”, “PROSTATE”, “LEUKEMIA”,“K562B-repro”, “K562A-repro”, “COLON”, “MCF7A-repro”, and “MCF7D-repro”. To thisend, we assign the gene i to the type of cancer k if it is expressed in at least in one of theexperiments of that type. Note that the same gene might belong to more than one classsimultaneously. We define the gene similarity matrix Q whose entries are calculated as q ij = W in (3.6) is calculated as W = ( Q + Q ) / Q and Q are the matricesresulting from the polar decomposition of Q . Note that the matrix W play the role of thelabels in the combination, following the idea of kernel combinations in the support vectorclassification context (Mart´ın de Diego et al., 2010). Parameter τ is fixed to 0 . gene i = arg max c i p ( c i | gene i ) . The conditional probabilities p ( c i | gene i ) can be interpreted in this context as fuzzy member-ship degrees. In Table 2 we show the 10 genes with the highest probability of each cluster.In Table 3, we show the cross frequencies of the genes in the different types of cancers andclusters. Note that the same gene might belong to different cancer groups simultaneously,therefore the correspondence clusters-cancer types should not be necessarily one to one.Some interesting conclusions show up when the Table 3 is interpreted. BREAST, COLON,MELANONA, NSCLS and RENAL cancers seem to be associated to single clusters. Thecancers K562A-repro and K562B-repro appear clearly together in the same group (group9), which also occurs with cancers MCF7A-repro and MCF7D-repro. Apart from the inter-pretability of the groups in terms of types of cancers, Table 3 also helps to identify similaritiesbetween types of cancer. Similar patterns between cancers across the clusters (similar rows)can be associated to similar types of cancer. The previously mentioned case of the K562A-repro and K562B-repro types is a clear example. A graphic illustration of these results canbe observed in Figure 6, which shows a Sammon mapping of the 14 latent genetic classes(types of cancer) using the results from Table 3. In this paper we have proposed a new approach to visualize gene expression experiments.The key idea is to use an asymmetric similarity for the genes, which is used within thelatent semantic indexing context, to obtain latent genetic classes or groups of genes whichare similar in their expression patterns. We provide both, a Euclidean representation of thegenes, which is able to illustrate the different genetic patterns of expression in the data set,and the probabilities of membership of each gene to those classes. The proposed method has13 − . . . Asymmetric LSI c o m ponen t G1G2G3G4G5G6G7 G8G9G10G11G12G13G14 −1.0 −0.5 0.0 0.5 1.0 − . . . Asymmetric LSI t h c o m ponen t G1G2G3G4G5G6G7 G8G9G10G11G12G13G14−1.0 −0.5 0.0 0.5 1.0 − . . . Correlation c o m ponen t G1G2G3G4G5G6G7 G8G9G10G11G12G13G14 −1.0 −0.5 0.0 0.5 1.0 − . . . Correlation t h c o m ponen t G1G2G3G4G5G6G7 G8G9G10G11G12G13G14−20 −10 0 10 20 − − Euclidean distance c o m ponen t G1G2G3G4G5G6G7 G8G9G10G11G12G13G14 −20 −10 0 10 20 − − Euclidean distance t h c o m ponen t G1G2G3G4G5G6G7 G8G9G10G11G12G13G14
Figure 5: Multidimensional scaling projections (1st, 2nd, 3th) using the similarity ma-trix produced by the aLSI, the Pearson correlation and the Euclidean distance. Thegroups colouring correspond to the membership of the genes to the different groups ofcancer: G1 (BREAST), G2 (CNS), G3 (COLON), G4 (K562A-repro), G5 (K562B-repro),G6 (LEUKEMIA), G7 (MCF7A-repro), G8 (MCF7D-repro), G9 (MELANOMA), G10(NSCLC), G11 (OVARIAN), G2 (PROSTATE), G13 (RENAL), G14 (UNKNOWN).been used to analyse the Human Cancer dataset obtaining new and valuable informationthat remains unadvertised to classical similarity measures like the Pearson’s correlation and14
200 −100 0 100 200 − − aLSI mapping of types of Cancer Component 1 C o m ponen t BREASTCNSCOLON K562A−reproK562B−reproLEUKEMIAMCF7A−reproMCF7D−reproMELANOMA NSCLCOVARIANPROSTATE RENALUNKNOWN
Figure 6: Sammon mapping of the 14 types of cancers using the results from Table 3the Euclidean distance.This work leads to a wide variety of future analysis. On the most theoretical and method-ological side, the study of the geometrical properties of the matrices K and K and of furthercombination procedures are of interest. For instance, we aim to explore the Geometric andHarmonic weighted means given by F tgeometric ( K , K ) = K / ( K − / K K − / ) t K / , F tharmonic ( K , K ) = ( t K − + (1 − t ) K − ) − , for t ∈ [0 ,
1] and to study their effects in the final genes representation.In addition, although we have presented a method in which the sources of asymmetryfor the genes similarity are merged into a symmetric matrix, it is our plan to investigate thepotential combinations of our approach with previously developed asymmetric multidimen-sional scaling techniques (Chino, 2012). Also new ways to embed prior knowledge into thematrix W will be the focus of further study, which we envision will have a large impact forpractitioners: in this work we only have considered the experiments labelling to obtain ameasure of association for the genes. However, in the future it is our aim to consider gene15ntologies and other topological measures of biological networks, like Protein-Protein inter-action networks to improve the final gene mapping and the interpretation of the obtainedgene semantic classes. A Appendix
Proof. (Proposition 1). To maximize G τ ( K ) we take partial the derivative for each S ls . Then ∂ G τ [ K ] ∂ ( K ) sl = 2( K ls − γ F ( K , K ) ls ) + 2 τ (( K ) ls − γ ( W ) ls ) (A.1)for s, l = 1 , . . . , m, . Setting the previous partial derivatives to zero yields a linear systemwhose unique solution is a matrix K whose elements are given by K ∗ = γ τ + 1 F ( K , K ) + γ ττ + 1 W = F ( K , K ) + τ W , (A.2)for l, s = 1 , . . . , m, . To check if K is a maximum or a minimum we evaluate the Hessianmatrix of G τ [ S ] on K . Such matrix is the n × n diagonal matrix H ( K ∗ ) = 2 · τ + 1 0 · · · τ + 1 · · · · · · τ + 1 (A.3)which is positive definite for any τ >
0. Hence, (A.2) is a minimum of (3.5) for any τ > Proposition 2. let k and k two positive semi-definite kernel functions and let φ and φ their underlying feature maps. Then k = λ k + λ k , with λ , λ ≥ , is a positivesemi-definite kernel with φ = [ √ λ φ , √ λ φ ] as a valid feature map.Proof. (Proposition 2). We only need to show that k ( t , t (cid:48) ) = (cid:104) φ ( t ) , φ ( t (cid:48) ) (cid:105) is satisfied for k and φ . In our case we have that (cid:104) φ ( t ) , φ ( t (cid:48) ) (cid:105) = (cid:104) ( (cid:112) λ t , (cid:112) λ t ) , ( (cid:112) λ t (cid:48) , (cid:112) λ t (cid:48) ) (cid:105) = λ (cid:104) φ ( t ) , φ ( t (cid:48) ) (cid:105) + λ (cid:104) φ ( t ) , φ ( t (cid:48) ) (cid:105) = λk ( t , t (cid:48) ) + λ k ( t , t (cid:48) )= k ( t , t (cid:48) ) , which shows that the proposition holds. 16 cknowledgments We thank the support of the Spanish Grant Nos. MEC-2007/04438/00and DGULM-2008/00059/00. We also thank Georges E. Janssens for hisl elpfull commentson the manuscript.
References
Aroszajn, N. (1950). Theory of reproducing kernels.
Transactions of the American Mathe-matical Society , 68(3):337–404.Barabasi, A.-L. and Oltvai, Z. N. (2004). Network biology: understanding the cell’s func-tional organization.
Nature Reviews Genetics , 5(2):101–113.Bicego, M., Lovato, P., Oliboni, B., and Perina, A. (2010). Expression microarray clas-sification using topic models. In
Proceedings of the 2010 ACM Symposium on AppliedComputing , SAC ’10, pages 1516–1520, New York, NY, USA. ACM.Blei, D. M., Ng, A. Y., Jordan, M. I., and Lafferty, J. (2003). Latent dirichlet allocation.
Journal of Machine Learning Research , 3:2003.Caldas, J., Gehlenborg, N., Faisal, A., Brazma, A., and Kaski, S. (2009). Probabilisticretrieval and visualization of biologically relevant microarray experiments.Chino, N. (1978). A graphical technique for representing the asymmetric relationships be-tween n objects.
Behaviormetrika , 5(23-40):59.Chino, N. (1990). A generalized inner product model for the analysis of asymmetry.
Behav-iormetrika , 27:25–46.Chino, N. (2012). A brief survey of asymmetric mds and some open problems.
Behav-iormetrika, , 39:127–165.Deerwester, S. (1988). Improving Information Retrieval with Latent Semantic Indexing. InBorgman, C. L. and Pai, E. Y. H., editors,
Proceedings of the 51st ASIS Annual Meeting(ASIS ’88) , volume 25, Atlanta, Georgia. American Society for Information Science.Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., and Harshman, R. (1990).Indexing by latent semantic analysis.
Journal of the American Society for InformationScience , 41(6):391–407.Fraleyand, C. and Raftery, A. E. (2002). Model-based clustering, discriminant analysis, anddensity estimation.
Journal of the American Statistical Association , 97:611–631.17at-Viks, I., Sharan, R., and Shamir, R. (2003). Scoring clustering solutions by their bio-logical relevance.
Bioinformatics , 19(18):2381–2389.Gonz´alez, J. and Mu˜noz, A. (2013). Functional analysis techniques to improve similaritymatrices in discrimination problems.
Journal of Multivariate Analysis , 120(C):120–134.Gower, J. (1977). The analysis of asymmetry and orthogonality.
In: Recent Developmentsin Statistics. Eds. J. Barra et al. Amsterdam: North Holland Press , pages 109–123.Gower, J. (1998). Orthogonality and its approximation in the analysis of asymmetry.
Linearalgebra and its applications , 278:183–193.Hastie, T., Tibshirani, R., and Friedman, J. (2009). The elements of statistical learning:Data mining, inference, and prediction. second edition.
Springer Series in Statistics .Higham, N. J. (1986). Computing the polar decomposition with applications.
SIAM J. Sci.Statist. Comput. , 7:1160–1174.Horn, R. A. and R., J. C. (1991). Topics in matrix analysis.
Cambridge University Press .Kim, K., Zhang, S., Jiang, K., Cai, L., Lee, I.-B., Feldman, L. J., and Huang, H. (2007).Measuring similarities between gene expression profiles through new data transformations.
BMC bioinformatics , 8:29.Kohonen, T., Schroeder, M. R., and Huang, T. S., editors (2001).
Self-Organizing Maps .Springer-Verlag New York, Inc., Secaucus, NJ, USA, 3rd edition.Kosko, B. (1991). Neural networks and fuzzy systems: A dynamical approach to machineintelligence.
Prentice Hall .Mart´ın de Diego, I., Mu˜noz, A., and Martinez Moguerza, J. (2010). Methods for the combina-tion of kernel matrices within a support vector framework.
Machine Learning , 78:137–174.Mart´ın-Merino, M. and Mu˜noz, A. (2005). Visualizing asymmetric proximities with som andmds models.
Neurocomputing , 63:171–192.Mu˜noz, A. (1997). Compound key word generation from document databases using a hier-archical clustering ART model.
Intelligent Data Analysis , 1(1-4):25–48.Mu˜noz, A. and Gonz´alez, J. (2007). Joint diagonalization of kernels for information fusion.In
Proceedings of the Congress on Pattern Recognition 12th Iberoamerican Conferenceon Progress in Pattern Recognition, Image Analysis and Applications , CIARP’07, pages556–563, Berlin, Heidelberg. Springer-Verlag.18u˜noz, A. and Gonz´alez, J. (2008). Functional learning of kernels for information fusionpurposes. In Ruiz-Shulcloper, J. and Kropatsch, W. G., editors,
CIARP , volume 5197 of
Lecture Notes in Computer Science , pages 277–283. Springer.Mu˜noz, A. and Gonz´alez, J. (2012). Hierarchical latent semantic class extraction usingasymmetric term similarities.
Behaviormetrika , 39(1):91–109.Mu˜noz, A., Gonz´alez, J., and de Diego, I. M. (2006). Local linear approximation for kernelmethods: The railway kernel. In Trinidad, J. F. M., Carrasco-Ochoa, J. A., and Kittler,J., editors,
CIARP , volume 4225 of
Lecture Notes in Computer Science , pages 936–944.Springer.Mu˜noz, A., de Diego, I. M., and Moguerza, J. M. (2003). Support vector machine classi-fiers for asymmetric proximities. In
Artificial Neural Networks and Neural InformationProcessingICANN/ICONIP 2003 , pages 217–224. Springer.Ng, S.-K., Zhu, Z., and Ong, Y.-S. (2004). Whole-genome functional classification of genesby latent semantic analysis on microarray data. In
Proceedings of the second conferenceon Asia-Pacific bioinformatics - Volume 29 , APBC ’04, pages 123–129, Darlinghurst,Australia, Australia. Australian Computer Society, Inc.Okada, A. (1990). A generalization of asymmetric multidimensional scaling. In
Knowledge,data and computer-assisted decisions , pages 127–138. Springer.Okada, A. and Imaizumi, T. (1987). Nonmetric multidimensional scaling of asymmetricproximities.
Behaviormetrika , 21:81–96.Park, L. A. F. and Ramamohanarao, K. (2009). Kernel latent semantic analysis using aninformation retrieval based kernel. In
CIKM , pages 1721–1724.Priness, I., Maimon, O., and Ben-Gal, I. E. (2007). Evaluation of gene-expression clusteringvia mutual information distance measure.
BMC Bioinformatics , 8.Reka, A. and Barab´asi (2002). Statistical mechanics of complex networks.
Rev. Mod. Phys. ,74:47–97.Schoenberg, I. J. (1935). Remarks to maurice frchets article sur la dfinition axiomatiquedune classe despaces distancis vectoriellement applicable sur lespace de hilbert. annals ofmathematics 36(3. 19ahba, G. (1990). Spline models for observational data.
Series in Applied Mathematics,SIAM. Philadelphia , 59.Wang, Z., Xu, W., San Lucas, F. A., and Liu, Y. (2013). Incorporating prior knowledge intogene network study.
Bioinformatics .Wuchty, S., Rasasz, E., and Barbarasi, A. L. (2003). The architecture of Biological Networks.Yang, E.-W., Girke, T., and Jiang, T. (2013). Differential gene expression analysis usingcoexpression and RNA-Seq data.
Bioinformatics , 29(17):2153–2161.Young, G. and Householder, A. S. (1938). Discussion of a set of points in terms of theirmutual distances.