On the gene expression landscape of cancer
OOn the gene expression landscape of cancer
Augusto Gonzalez (1,2), Yasser Perera (3,4), Rolando Perez (1,5)(1) University of Electronic Science and Technology of China, Chengdu, People Republic of China(2) Institute of Cybernetics Mathematics and Physics, Havana, Cuba(3) China-Cuba Biotechnology Joint Innovation Center, Yongzhou,People Republic of China(4) Center of Genetic Engineering and Biotechnology, Havana, Cuba(5) Center of Molecular Immunology, Havana, Cuba
KEYWORDS:
Gene Expression; Principal Component Analysis; Cancer
ABSTRACT
A principal component analysis of the TCGA data for 15 cancer localizations unveils the followingqualitative facts about tumors: 1) The state of a tissue in gene expression space may be described bya few variables. In particular, there is a single variable describing the progression from a normaltissue to a tumor. 2) Each cancer localization is characterized by a gene expression profile, in whichgenes have specific weights in the definition of the cancer state. There are no less than 2500differentially-expressed genes, which lead to power-like tails in the expression distributionfunctions. 3) Tumors in different localizations share hundreds or even thousands of differentiallyexpressed genes. There are 6 genes common to the 15 studied tumor localizations. 4) The tumorregion is a kind of attractor. Tumors in advanced stages converge to this region independently ofpatient age or genetic variability. 5) There is a landscape of cancer in gene expression space with anapproximate border separating normal tissues from tumors.
INTRODUCTION
Cancer is a very complex phenomenon [1]. From one side, the number of internal variablesparticipating in relevant processes in a tissue is huge. The number of human genes, for example, isaround 60000. Trying to infer the behavior of such a system from observations in a few hundredsamples is very challenging. The general laws and state variables are not completely known orunderstood. From the other side, as in any aspect of life, ties with the environment are very strong.Enormous coordinated efforts aimed at understanding basic aspects of cancer have led to manyimportant results. A lot of information about genes, cells and tissues have been compiled and putinto public databases [2-4]. The analysis of such information allowed the identification of mutationsignatures, immune characteristics, etc [5-12]. From the theoretical point of view, current views to cancer include cell-intrinsic (i.e. genetic &epigenetic) and cell-extrinsic phenomena (i.e. micro-environment), as well as population geneticsapproaches with random drift and directional selection shaping, what has been coined as somaticevolution [13,14]. There is even a plausible hypothesis that cancer is an atavism, that is acooperative state of multi-cellular organisms, prior to modern metazoans [15-19].In spite of all this progress, a simple enough picture, quantifying the evolution of a normal tissuetowards cancer is still lacking. In this paper, we present a contribution to the qualitative perspective by stressing a few qualitativefacts on cancer which come from the analysis of gene expression data. The title of the paper,“cancer gene expression landscape”, indicates the aim at placing all tumors in a single plot anddelineating the border between normal tissues and tumors. he data, provided by The Cancer Genome Atlas (TCGA) project [4], is processed by standardPrincipal Component Analysis (PCA, [20-22]). In the paper, we avoid using models or elaboratedtheoretical constructions that could hide basic facts. Instead of this, we focus on general resultsfollowing straightforwardly from the expression data.
QUALITATIVE RESULTS
We take tissue expression data for 15 cancer localizations from the TCGA project. This is a wellcurated database. Gene expressions are measured by sequencing the mRNA produced in thetranscription process (RNA-seq, [23]). The data is in the number of fragments per kilo-base of genelength per mega-base of reads format (FPKM, [24]). We pick up localizations where there are atleast 10 normal samples, and the number of tumor samples is greater than 160. The studied casesare shown in Table I.(Insert Table I)In the Methods section, the PCA used methodology is briefly explained. We take the meangeometric average over normal samples in order to define the reference expression for each gene,and normalize accordingly to obtain the differential expressions, ē = e/e ref . Finally, we take the base2 logarithm, ê = Log2 (ē), to define the fold variation. Besides reducing the variance, the logarithmallows treating over- and sub-expression in a symmetrical way. The covariance matrix is defined interms of ê. We forced the reference for the PC analysis to be at the center of the cloud of normalsamples. This is what actually happens in a population, where most individuals are healthy andcancer situations are rare. The eigenvectors of the covariance matrix define the PC axes: PC1, PC2,etc, and projection over them define the new state variables. By definition, PC1 captures the highestfraction of the total variance in the sample set.
Statement 1. The state of a tissue in gene expression space may be described by a fewvariables. In particular, there is a single variable describing the progression from a normaltissue to a tumor.
Fig. 1 shows the PCA results for three of the studied tissues: Kidney renal papillary cell carcinoma(KIRP), Lung squamous cell carcinoma (LUSC) and Liver hepatocellular carcinoma (LIHC). In theleft panel, the distribution of normal and tumor samples in the (PC1, PC2) plane suggests that thereare well defined normal and tumor regions, and that the first PC1 variable may discriminatebetween a normal sample and a tumor. PC1 can thus be labeled as the cancer axis. (Insert Fig. 1)The projection along PC1 may be used to quantify the progression from a normal tissue to a tumor.Comparison with other qualifications like the tumor stage is interesting. We shall come back to thispoint below.The right panel, on the other hand, evaluates the fraction of total variance captured by the first n PCvariables. In LUSC, for example, the first three components account for 74% of the variance.Notice that with a few such variables we may describe to a considerable extent the complexity ofLUSC.This reduced number of variables, well below the number of constituent genes, may be taken as theeffective number of degrees of freedom of the complex system represented by a tissue. The fraction of variance depends on the sample set, and may vary if we enlarge or reduce the set.Thus, we should take the results as approximate, semi quantitative ones, that shall improve as thesample size is enlarged.owever, in spite of the statistical origin of the new variables we may use them to describe theactual state of a given sample. Each unitary vector along any of the PC axes defines an expressionprofile, with a meaning. The fact that with 6 - 10 such variables we may account for 80 % or moreof the data dispersion means that the effective dimensionality of the state space is much less than itseems. Genes do not take arbitrary expressions, but act in a concerted way. Groups of genes,metagenes or sub-networks express themselves as profiles.
Statement 2. Each cancer localization is characterized by a gene expression profile, in whichgenes have specific weights in the definition of the cancer state.
Let us call v the eigenvector along PC1 (boldface denotes vectors). We showed above that the PC1axis accounts for the largest fraction of variance and that projection along it may be taken as anindicator of the malignant state. For a given sample with fold expression vector ê , the projection x over the PC1 axis is precisely defined as: x = ê . v = Σ êi v The v vector may be thought to provide a metagene [25] or gene expression profile of cancer inthe tissue, i.e. the set of over- or under-expressed genes (and their relative importance) that definethe cancer state. V is the weight of gene i in the definition of the cancer state. A positive signmeans that the gene is over-expressed in the tumor. Fig. 2 left panel shows the 30 genes with the largest contributions to v in the same tumorsrepresented in Fig. 1. Note the signs, positive and negative indicating over- and under-expressionrespectively. The components of the v vector define the weights of these genes in the definition ofthe cancer state. In principle, because of their large weights, these 30 genes could be used as cancerbiomarkers, however their specific roles in each tissue deserve further study. In LUSC, for example,the gene with the largest weight is SFTPC, a silenced gene with an important role in lunghomeostasis [26,27]. The analogous genes in KIRP and LIHC are Uromodulin (UMOD) andCytochrome P450 family 1 subfamily A member 2 (CYP1A2), respectively. This analysis ispromising and, to the best of our knowledge, have not been sufficiently exploited so far.(Insert Fig. 2)The central and right panels of Fig, 2 show genes, ordered according to their differentialexpressions, for the centers of the tumor clouds (geometric averages over tumor samples). Only thetails with higher over- or under-expressions are shown. Notice that these tails contains a fewthousands of genes, the rest of the 60000 genes are not differentially expressed. The distributionsare not symmetrical. Whereas Kidney Papillary Cell Carcinoma (KIRP) is dominated by silencedgenes, LUSC has nearly equal proportions of under- and over-expressed genes, and in Liver HepaticCell Carcinoma (LIHC) the over-expressed genes are more numerous. Log-log plots stress that thetails exhibit a power-like (Pareto, [28,29]) behavior, i.e. the number of genes with differentialexpression greater than a given value is an inverse power of the expression. Statement 3. Tumors in different localizations share hundreds or even thousands ofdifferentially expressed genes. There are 6 genes common to 15 tumor localizations.
For each localization, we select the most significant 2500 genes with the largest contributions to thevector v along the PC1 direction defining the cancer state. This number, although arbitrary, isdictated by the previous results on the expression distribution function. Let us stress that these aregenes with significant differential expressions and great importance in the definition of the cancerstate.able II shows the number of shared genes for pairs of localizations. Notice that these numbers varyin the interval between 314 and 1889. Large numbers of shared genes are characteristic of tumors inthe same organ but originating in different cells (lung, kidney). However, there are also tumorssharing unexpectedly large numbers of genes. For example, tumors in the uterine corpus (UCEC)and bladder (BLCA) share more than 1300 genes. (Insert Table II)Let us stress that there are 49 genes common to a group of 11 tumors, PRAD-LUSC-LUAD-UCEC-BLCA-ESCA-BRCA-HNSC-COAD-READ-STAD, and six genes common to all of the studiedlocalizations. They are MMP11 (+), C7 (-), ANGPTL1 (-), UBE2C (+), IQGAP3 (+) and ADH1B(-). Their differential expressions are very similar in all of the studied tumors. The signs added inparenthesis mean that the gene is over- or under-expressed in tumors. The six identified pan-cancer genes have been recently pointed out as playing a significant role inmany cancers [30-35]. It is noteworthy that these genes are straightforwardly related to cancerhallmarks [36,37]: i.e. invasion, suppression of the immune response, angiogenesis, proliferationand changes in metabolism. Shared genes among groups of tumors open the question about universal therapies for these groups. Below, we notice that pan cancer genes (the six ones common to all of the tumors) play a role inboth tissue differentiation and in the definition of the border between normal tissues and tumors. Inaddition, the number of shared genes seems to be related to the proximity of tumors in theexpression space. Statement 4. The tumor region is a kind of attractor. Tumors in advanced stages converge tothis region independently of patient age or genetic variability.
As may be seen in Fig. 1, regions corresponding to normal and tumor samples are well defined andpartially disjoint in the expression space. The sample variability comes from genetic differences,patient ages and the evolution history of each individual. The fact that, in spite of variability,regions are well defined in expression space is in favor of the attractor paradigm of cancer [38-40]in which the cancer region should be the region of confluence of all somatic evolution trajectoriesout of the normal area. In a very reductionist view one may think, for example, about normalfunctioning and cancer as two stable solutions of a global gene regulatory network [41,42].We may conduct a more precise test of the attractor hypothesis by studying the dependence of thedistribution functions on patient age. Let us consider, again, LUSC as an example. According toage, in LUSC we may define 4 subgroups of samples: Normal Young (NY), Normal Old (NO),Tumor Young (TY) and Tumor Old (TO). These subgroups are in some sense arbitrarily defined.First, the label “normal” refers to a pathologically normal sample from a patient with a tumor.Second, in the example “young” labels samples where age is lower than 62 years. These conditionsare dictated by the availability of samples, as in any statistical analysis. Nevertheless, the results are very interesting. The (over-) expression distribution function isvisualized in Fig. 3. We compute (mean geometric) averages over the NY, NO, TY and TOsubgroups, and use the NY values as references in order to define normalized (differential)expressions in the remaining subgroups: ē NO , ē TY , and ē TO . These vectors characterize the centers oftheir respective clouds of samples. Genes are sorted with regard to their normalized expressionvalues. (Insert Fig. 3)here are deviations for a reduced number of genes in the NO group with regard to the NYreference. This may be taken as a consequence of aging. In tumors, however, deviations are muchlarger. There are around 1000 over-expressed genes with |ē| > 5. Most striking is the similarity between the distribution functions in the TY and TO subgroups. Thatis, for tumors the distribution function in the final state is nearly independent of the age when tumorinitiates. This is an argument in favor of the attractor hypothesis. Similar results (not shown) areobtained for the sub-expression tail. A slightly different test comes from considering a second “time” or progression variable: theclinically determined tumor stage. It is a qualification given to the tumor at the moment ofdiagnosis, but in some sense it quantifies also the somatic evolution once the portion of the tissueacquires the tumor condition. Fig. 4 shows the distribution of tumors by stages in Clear Cell KidneyCarcinoma (KIRC). Normal tissues are represented by blue points, whereas tumors are drawn inred. The four panels refer to the four stages: I, II, III and IV. Blue points are in the four panels, butonly red points with the corresponding stage are included in each panel.(Insert Fig. 4) We use a contour plot and colors to visualize the total density of points in the state space. Tworegions of maximal density are apparent, corresponding to normal states and an optimal region fortumors. Naively, one expects that tumors move along the transition region from the normal to thetumor region as the stage evolves from I to IV. In the actual measurements, we don’t trackindividual tumors as function of stages, but get pictures of different tumors at different stages. Thus,in the initial stages we should observe a fraction of red points captured in the transition region,whereas in the final stages, most tumors should be concentrated in the optimal region. This is whatactually follows from the figure, again supporting the attractor paradigm. We may speculate that theoptimal region could be related to a region of maximal fitness for the tumor in the given tissue. The intuition induces us to relate the tumor stage to the coordinate along the tumor axis PC1. Thecorrespondence, however, is not exact. Although there seems to be a correlation between stage andmean displacement towards the tumor region, many tumors in the initial stages are already at thecenter of the cloud. This could be related to the fact that the observed distribution of samples isprobably related to the fitness distribution and the transition region should be a low-fitness zone. Statement 5. There is a landscape of cancer in gene expression space with an approximateborder separating normal tissues from tumors.
We consider the central goal of the paper: i.e. to draw a picture in which both normal tissues andtumors in different localizations are represented. In a way, this is a picture involving tissuedifferentiation and cancer. It is not surprising that pan cancer genes will play a role in bothprocesses.We shall use the e normal and e tumor (mean geometric) averages for each localization in order todefine cloud centers. The reference is to be computed by averaging over all normal expressionvectors. Then, the ê magnitudes and the covariance matrix are obtained, and the latter diagonalized. The first aspect to be stressed is that the first two PCs accounts only for 37 % of the total datavariance. The relative importance of these two variables is not so apparent as in the case ofindividual tissues. This is probably due to the big dispersion of the data for normal tissues, relatedto tissue differentiation, sometimes even larger than separations between a normal tissue and therespective tumor.As a consequence of the dispersion of normal tissues, we do not have a “cancer axis” or direction,s in individual tissues. In order to draw a frontier between normal and tumor regions, we shallinclude higher PCs. The next component, PC3 accounts for 12 % of the data variance.We show in Fig. 5 the (PC1, PC3) plane, which indeed suggests that there is a border. Actually, theregions and the border are high dimensional, but the 2D figure captures the essential features. Wemay baptize this figure as the “approximate normal vs cancer” or “tissue differentiation vs cancer”plane. It is apparent from the figure, that the transition from a normal tissue to the correspondingtumor implies crossing the border, and involves simultaneous displacements along the PC1 and PC3axes. (Insert Fig. 5)The unitary vectors along these axes allow the identification of genes with the highest weights. It isvery interesting that pan cancer genes are among the most important genes in these vectors. Forexample, ADH1B and UBE2C are included in the set of 8 most important genes along PC1: PI3 (+),ADH1B (-), MYBL2 (+), UBE2C (+), ALB (-), CEACAM5 (+), CST1 (+) and MMP1 (+). A more detailed analysis of the border between normal tissues and tumors is required. In the presentpaper, we limit ourselves to draw the global picture, and leave this analysis for a future work.We would like to notice that distances between pairs of tumors in gene expression space areinversely correlated with the number of shared genes which define the cancer state. This fact ispartially reflected in Fig. 5 because the distances in this figure are not true distances, butprojections. DISCUSSION
We performed PC analysis of gene expression data for 15 tumors. Our results are approximate andsemi-quantitative, in the sense that they could be modified if a larger data set become available, butat the same time they are simple, general and unbiased, in the sense that no modeling or elaboratedmathematical treatments are used. We try to keep interpretation of results as close as possible to thefacts.Both somatic evolution in a normal tissue, and the transition to a tumor state involves themodification of thousands of genes. However, these genes do not act independently, but in aconcerted way. The number of relevant PC coordinates for a tissue (or a portion of it), which seemsto be around 10, may be interpreted as the effective number of degrees of freedom of the biologicalsystem. These variables, although of statistical origin, can be used to describe the state and evolution of thetissue, in particular one of the variables measures the progression from normal to cancer state (theprojection on PC1, the cancer axis) and allows the definition of a profile of genes involved in thisprogression. The data seem to support the theory of cancer as an attractor, that is once a portion of the tissueescapes from the normal region it is driven to the cancer basin of attraction. The existence of a ranking of genes in the definition of the cancer state, allows us to speculate aboutthe possibility to stop progression if the tumor is diagnosed in the initial stage. Indeed, what wouldhappen if we target a few of the most significant genes in the cancer profile? Could this kind ofintervention induce a rearrangement of the whole profile preventing the tumor to evolve to moreadvanced stages? This is an interesting question with possible implications for therapy. uestions are also raised in relation to the second important conclusion of our paper, related to theoverall landscape in gene expression space: not only individual tissues are separated from theirrespective tumors, but the set of normal tissues are separated from the set of tumors. In a very roughanalysis, we noticed that pan cancer genes (i.e, common to all 15 tumors) are involved both in tissuedifferentiation and in the definition of the border. One possible, very interesting question is thefollowing: what could this relationship tell us about tumors in early childhood, which initiate in thedevelopmental period? On the other hand, one can imagine therapies that make use of the complexlandscape represented in Fig. 5. For example, could we target the main genes connecting a tumorand a different nearby (in GE space) normal tissue? Would this intervention reduce the tumor fitnessin the original tissue leading to a regression? In the paper, we focused on qualitative aspects. However, there is the possibility to quantify and tomodel some aspects of cancer. For example, notice that from figures like Fig. 1 we can estimate thedimensions of the normal and cancer regions in gene expression space and the distance betweentheir centers. From this data, and the statement that progression is described by a single variable onemay possibly devise a one dimensional model for tumorigenesis in a given tissue [43].Work along some of these directions is in progress [44,45].
METHODS
The results of the paper are based on the analysis of TCGA data for gene expression in FPKMformat. The number of genes is 60483. This is the dimension of matrices in the PrincipalComponent analysis.We selected the 15 cancer types shown in Table I on the basis of two conditions: i) The number ofnormal samples is greater than 10, and ii) The number of tumor samples is greater than 160.We show in Fig. S1 expressions from a typical data file (PRAD case). Notice that there are around28000 not transcribed genes (expression exactly zero), and only around 25000 genes withexpression above 0.1.Usually, in order to compute the average expression of a gene the median or the geometric mean areused. We prefer geometric averages, but then the data should be slightly distorted to avoid zeroes.To this end, we added a constant 0.1 to the data. By applying this regularization procedure, genesidentified as relevant could be under question if the differential expression is relatively low andtheir expression in normal tissues is near zero. As we are mainly interested in the strongly over- orunder-expressed genes, they are out of the question.For each cancer localization, we take the mean geometric average over normal samples in order todefine the reference expression for each gene, e ref . Then the normalized or differential expression isdefined as: ē = e/e ref . The fold variation is defined in terms of the base 2 logarithm ê = Log2 (ē).Besides reducing the variance, the logarithm allows treating over- and sub-expression in asymmetrical way. Deviations and variances are measured with respect to ê = 0. That is, with respect to the averageover normal samples. This election is quite natural, because normal samples are the majority in apopulation, individuals with cancer are rare. With these assumptions, the covariance matrix is written: = Σ ê i (s) ê j (s) / (N samples -1)where the sum runs over the samples, s, and N samples is the total number of samples. ê i (s) is the fold variation of gene i in sample s.As mentioned, the dimension of matrix σ is 60483. By diagonalizing it, we get the axes ofmaximal variance: the Principal Components (PCs). They are sorted in descending order of theircontribution to the variance. In LUSC, for example, PC1 accounts for 67% of the variance. This large number is partly due toour choice of the reference, ê = 0, and the fact that most of the samples are tumors. The reward isthat PC1 may be defined as the cancer axis. The projection over PC1 defines whether a sample isclassified as normal or tumor. The next PCs account for a smaller fraction of the variance. PC2 is responsible of 4%, PC3 of 3%,etc. Around 10 PCs are enough for an approximate description of the region of the gene expressionspace occupied by the set of samples.Thus, we need only a small number of the eigenvalues and eigenvectors of σ . To this end, we usea Lanczos routine in Python language, and run it in a node with 2 processors, 12 cores and 64 GBof RAM memory. As a result, we get the first 100 eigenvalues and their corresponding eigenvectors. Conflict of interes
The authors have no conflict of interests.
Acknowledgments
Author contributions
Conceptualization & Investigation, all authors; Software and formal analysis, A.G.; Writing –Original Draft, A.G.; Writing – Review & Editing, all authors.
EFERENCES
Chemometrics and intelligent laboratory systems
Table I. The set of data analyzed in the paper . IRC PRAD LUSC LUAD UCEC BLCA COAD READ ESCA STAD BRCA HNSC LIHC THCAKIRP
KIRC
488 507 551 493 472 537 516 558 455 468 576 462 475
PRAD
561 608 774 789 621 712 573 597 730 581 485 481
LUSC
LUAD
863 883 801 731 778 803 908 725 628 528
UCEC
BLCA
813 935 817 965 1101 785 615 562
COAD
READ
587 951 761 567 450 518
ESCA
977 697 1069 639 502
STAD
700 859 630 433
BRCA
707 555 589
HNSC
612 485
LIHC
Table II. Number of common differentially-expressed genes in pairs of localizations . ig. 1. Principal Component Analysis of the TCGA gene expression data for three of thestudied tumors . The left panel contains the (PC1, PC2) plane, whereas the right panel shows thevariance captured by the first n PCs. ig. 2. Left panel . The 30 genes with highest weights in the definition of the cancer state. The sametumors as in Fig. 1 are used as examples. The numbering of genes is the one used in the TCGA data.To simplify drawing, the contributions of the remaining genes are set to zero. Positive signscorrespond to over-expressed, and negative to sub-expressed genes. Central and right panels .Integrated gene expression distribution functions, over- and under-expression tails are shown. ig. 3 Integrated gene (over) expression distribution functions in LUSC . According to age,samples are grouped into four sets: Normal Young (NY), Normal Old (NO), Tumor Young (TY) andTumor Old (TO). The average over the NY set is used to define reference values to normalize theexpressions. Each set of points represents the average over the respective group. ig. 4. Stages in the evolution of tumors in Clear Cell Kidney Cancer (KIRC) . Blue points arenormal tissues (included in the four panels), whereas red points are tumors in a given stage ofevolution. Contours represent the total density of points. Stage I seems to be “transitional”, thereare many points traveling along the intermediate region. On the other hand, stages II, III and IV are“final”, in the sense that most of the tumors are concentrated in the high density region. This picturereinforces the attractor paradigm of cancer. We may speculate that the attractor is the region of thestate space with maximal fitness for the tumor in the given tissue. ig. 5. The gene expression landscape in the (PC1, PC3) plane . Each point in the diagramrepresents the average of samples in a given localization. For simplicity, normal tissues are labeledwith the corresponding tumor indexes. The approximate border between the normal and tumorregions is drawn. ig. S1 (Supplementary Fig. 1). Range of values in a typical data fileig. S1 (Supplementary Fig. 1). Range of values in a typical data file